In this blog we will walk through a comprehensive example of indexing research papers with extracting different metadata — beyond full text chunking and embedding — and build semantic embeddings for indexing and querying. Ndingathanda kakhulu ukuba ungenza Ukuba ufumana le tutorial efanelekileyo. ⭐ star  CocoIndex kwi-GitHub CocoIndex kwi-GitHub Ukusetyenziswa Case 
 
 
 
 
 Ukucinga kunye nokufumana kwi-academic, kunye neengcali ze-AI ezinxulumene nophando Iinkqubo ze-Paper Recommendation Ukuhlolwa Knowledge Graphs I-Semantic Analysis ye-literature yenzululwazi Yintoni uya kufumana Ndiyathanda oku Njengomzekelo Ukucinga Yinto yintoni ufuna ukufumana: 
 
 
 
 
 
 Extract the paper metadata, including file name, title, author information, abstract, and number of pages. 
 
 
 Build vector embeddings for the metadata, such as the title and abstract, for semantic search. Ngokwenza oku, i-metadata-driven semantic search results ziyafumaneka kakhulu. Ngokwesibonelo, unokufumana iingxaki ze-text kunye neengxaki ze-titles kunye neengxaki ze-abstract. 
 
 
 
 
 
 Build an index of authors and all the file names associated with each author to answer questions like "Give me all the papers by Jeff Dean." 
 
 If you want to perform full PDF embedding for the paper, you can also refer to  . this article Uyakwazi ukufumana ikhowudi ephelele . Yiba Ukuba le nqaku lula, nceda ushiye i-star ⭐ at Ukusiza ukuvelisa. Ukucinga Iingxaki zangaphakathi 
 
 
 
 
 
 
 
 PDF Preprocessing 
 
 Reads PDFs using   and extracts:
 
 pypdf 
 
 
 Total number of pages First page content (used as a proxy for metadata-rich information) 
 
 Markdown Conversion 
 
 Converts the first page to Markdown using  . Marker 
 
 LLM-Powered Metadata Extraction 
 
 
 Sends the first-page Markdown to GPT-4o using CocoIndex's   function. ExtractByLlm Extracted metadata includes and more.
 
 
 
 
 
  (string) title  (with name, email, and affiliation) authors  (string) abstract 
 
 Semantic Embedding 
 
 
 The title is embedded directly using the   model by the SentenceTransformer. all-MiniLM-L6-v2 Abstracts are chunked based on semantic punctuation and token count, then each chunk is embedded individually. 
 
 Relational Data Collection 
 
 Authors are unrolled and collected into an   relation, enabling queries like:
 
 author_papers 
 
 
 Show all papers by X Which co-authors worked with Y? Prerequisites 
 
 
 
 
 
 . Install PostgreSQL CocoIndex uses PostgreSQL internally for incremental processing. 
 
 . Configure your OpenAI API key Alternatively, sinokufumana inkxaso native kwi-Gemini, i-Ollama, i-LiteLLM, i-checkout ye . Ukucinga Uyakwazi ukhethe umyaleli wakho ye-LLM ebonakalayo kwaye ungasebenza ngokupheleleyo kwi-premises. Ukulungiselela Indexing Flow Le nkqubo ibonisa umzekelo olungapheliyo olungapheliyo olungapheliyo olungapheliyo olungapheliyo olungapheliyo olungapheliyo olungapheliyo olungapheliyo olungapheliyo olungapheliyo olungapheliyo. Uya kuba kuba lula ukufumana le design yi-CocoIndex ngaphakathi kwe-100 iintlobo ze-logic ye-indexing - . ikhowudi Ukuze ukunceda ngokukhawuleza ukuba uqhagamshelane ngexesha elandelayo, apha i-diagram flow. 
 
 
 
 I-Import a list of papers kwi-PDF. For each file:
 
 
 
 
 
 
 Extract the first page of the paper. Convert the first page to Markdown. Extract metadata (title, authors, abstract) from the first page. Split the abstract into chunks, and compute embeddings for each chunk. Export to the following tables in Postgres with PGVector:
 
 
 
 
 
 Metadata (title, authors, abstract) for each paper. Author-to-paper mapping, for author-based query. Embeddings for titles and abstract chunks, for semantic search. Yenza i-zoom kwi-steps. Izixhobo zePapers 
@cocoindex.flow_def(name="PaperMetadata")

def paper_metadata_flow(

    flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope

) -> None:

    data_scope["documents"] = flow_builder.add_source(

        cocoindex.sources.LocalFile(path="papers", binary=True),

        refresh_interval=datetime.timedelta(seconds=10),

    )
 Yenza itheyibhile kunye neengxaki ze-sub ( Ukucinga Ngathi flow_builder.add_source filename content Ndingathanda ukuba i Kwiimeko ezininzi. Ukucaciswa Ukukhuphela kunye nokufaka i-metadata I-Extract First Page for Basic Info Uqwalasele isicelo esixhasiweyo ukusuka kwiphepha lokuqala kunye nenani lombhalo ye-PDF. 
@dataclasses.dataclass

class PaperBasicInfo:

    num_pages: int

    first_page: bytes
 
@cocoindex.op.function()

def extract_basic_info(content: bytes) -> PaperBasicInfo:

    """Extract the first pages of a PDF."""

    reader = PdfReader(io.BytesIO(content))



    output = io.BytesIO()

    writer = PdfWriter()

    writer.add_page(reader.pages[0])

    writer.write(output)



    return PaperBasicInfo(num_pages=len(reader.pages), first_page=output.getvalue())

 Nceda uqhagamshelane oku kwi-flow yakho. Thumela i-metadata ukusuka kwiphepha lokuqala ukunciphisa iindleko ze-processing, njengoko i-PDF epheleleyo ingaba enkulu kakhulu. 
with data_scope["documents"].row() as doc:

    doc["basic_info"] = doc["content"].transform(extract_basic_info)
 Emva kwesinyathelo leyo, kufuneka ufumane iinkcukacha ezisiseko yeenkcukacha ngamnye. Ukucinga Info Basic Ukuguqulwa kwiphepha lokuqala kwi-Markdown usebenzisa i-Marker. Kwakhona, ungakwazi ukuqhagamshela ngempumelelo kwi-PDF yakho yokufuneka, njenge-Docling. Uqwalasele i-converter function ye-markers kwaye uqhagamshelane, njengoko ukulungiswa kwelinye i-resource-intensive. Ngoko kuqinisekisa ukuba i-converter ye-instance efanayo isetyenziselwa iifayile ezahlukileyo. 
@cache

def get_marker_converter() -> PdfConverter:

    config_parser = ConfigParser({})

    return PdfConverter(

        create_model_dict(), config=config_parser.generate_config_dict()

    )
 Ingxubana ku-Custom Function. 
@cocoindex.op.function(gpu=True, cache=True, behavior_version=1)

def pdf_to_markdown(content: bytes) -> str:

    """Convert to Markdown."""



    with tempfile.NamedTemporaryFile(delete=True, suffix=".pdf") as temp_file:

        temp_file.write(content)

        temp_file.flush()

        text, _, _ = text_from_rendered(get_marker_converter()(temp_file.name))

        return text
 Faka kwi-transformation yakho 
with data_scope["documents"].row() as doc:      

    doc["first_page_md"] = doc["basic_info"]["first_page"].transform(

            pdf_to_markdown

        )
 Emva kwesinyathelo leyo, kufuneka uze ubheja lokuqala lwezinye iphepha kwi-Markdown format. Ukukhuphela Info Basic nge LLM I-CocoIndex inikeza ngokwemvelo i-LLM-structured extraction kunye nezicwangciso ezinxulumene kunye nezicwangciso. Ukuba unemibuzo ukufunda ngaphezulu malunga neengxaki zihlanganisa, nqakraza . Le nqaku 
@dataclasses.dataclass

class PaperMetadata:

    """

    Metadata for a paper.

    """



    title: str

    authors: list[Author]

    abstract: str

 Qhagamshelana nathi Nge-dataclass ebonakalayo, i-CocoIndex ithetha ngokuzenzakalelayo umxholo we-LLM kwi-dataclass. ExtractByLlm 
doc["metadata"] = doc["first_page_md"].transform(

    cocoindex.functions.ExtractByLlm(

        llm_spec=cocoindex.LlmSpec(

            api_type=cocoindex.LlmApiType.OPENAI, model="gpt-4o"

        ),

        output_type=PaperMetadata,

        instruction="Please extract the metadata from the first page of the paper.",

    )

)
 Emva kwesinyathelo leyo, kufuneka unayo i-metadata yeenkcukacha ngamnye. Ukubhala i-Paper Metadata 
  paper_metadata = data_scope.add_collector()

  with data_scope["documents"].row() as doc:

    # ... process

    # Collect metadata

    paper_metadata.collect(

        filename=doc["filename"],

        title=doc["metadata"]["title"],

        authors=doc["metadata"]["authors"],

        abstract=doc["metadata"]["abstract"],

        num_pages=doc["basic_info"]["num_pages"],

    )
 Nceda uqhagamshelane yonke into efunekayo :) Ukucinga Ukucinga Ukucaciswa Umbhali Ukucinga Umbhali Ukucinga Kule nathi ufuna ukuthatha Author → Papers kwi-table eyahlukileyo ukuvelisa umsebenzi yokufunda. Simahla ukusuka umbhali. 
author_papers = data_scope.add_collector()



with data_scope["documents"].row() as doc:

    with doc["metadata"]["authors"].row() as author:

        author_papers.collect(

            author_name=author["name"],

            filename=doc["filename"],

        )
 Ukubala kunye nokufaka embedments Umbhali 
doc["title_embedding"] = doc["metadata"]["title"].transform(

    cocoindex.functions.SentenceTransformerEmbed(

        model="sentence-transformers/all-MiniLM-L6-v2"

    )

)
 Ukucinga Zihlanganisa i-abstract kwiingqungquthela, zihlanganisa zonke iingquthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuthuth Nangona i-abstract ingaba elide kakhulu. 
doc["abstract_chunks"] = doc["metadata"]["abstract"].transform(

    cocoindex.functions.SplitRecursively(

        custom_languages=[

            cocoindex.functions.CustomLanguageSpec(

                language_name="abstract",

                separators_regex=[r"[.?!]+\s+", r"[:;]\s+", r",\s+", r"\s+"],

            )

        ]

    ),

    language="abstract",

    chunk_size=500,

    min_chunk_size=200,

    chunk_overlap=150,

)
 Emva kwesinyathelo leyo, kufuneka unayo iingcaciso ze-abstract yeephepha ngamnye. Ukugqithisa zonke iingubo kunye nokufaka iingubo zayo. 
with doc["abstract_chunks"].row() as chunk:

    chunk["embedding"] = chunk["text"].transform(

        cocoindex.functions.SentenceTransformerEmbed(

            model="sentence-transformers/all-MiniLM-L6-v2"

        )

    )
 Emva kwesinyathelo leyo, kufuneka ube iingxaki zeengxaki ze-abstract yeenkcukacha zeenkcukacha zeenkcukacha. Ukubonisa imibuzo 
metadata_embeddings = data_scope.add_collector()



with data_scope["documents"].row() as doc:

    # ... process

    # collect title embedding

    metadata_embeddings.collect(

        id=cocoindex.GeneratedField.UUID,

        filename=doc["filename"],

        location="title",

        text=doc["metadata"]["title"],

        embedding=doc["title_embedding"],

    )

    with doc["abstract_chunks"].row() as chunk:

        # ... process

        # collect abstract chunks embeddings

        metadata_embeddings.collect(

            id=cocoindex.GeneratedField.UUID,

            filename=doc["filename"],

            location="abstract",

            text=chunk["text"],

            embedding=chunk["embedding"],

        )
 Ukuhweba Okokuqala, sinikezela idatha kwi-Postgres. 
paper_metadata.export(

    "paper_metadata",

    cocoindex.targets.Postgres(),

    primary_key_fields=["filename"],

)

author_papers.export(

    "author_papers",

    cocoindex.targets.Postgres(),

    primary_key_fields=["author_name", "filename"],

)    

metadata_embeddings.export(

    "metadata_embeddings",

    cocoindex.targets.Postgres(),

    primary_key_fields=["id"],

    vector_indexes=[

        cocoindex.VectorIndexDef(

            field_name="embedding",

            metric=cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY,

        )

    ],

)
 Kule isibonelo sisebenzisa i-PGVector njenge-embedding stores/ Ngokusebenzisa i-CocoIndex, unako ukwenza i-line switch kwiinkcukacha zebhanki zebhanki zebhanki zebhanki zebhanki zebhanki, njl. Kwiimeko ezininzi. Ukucinga Thina siphinde ukulungiselela iinterfaces kunye nokwenza njengoko ukwakha i-LEGO. Ukubonisa kwi-CocoInsight step by step Uyakwazi ukuqhuba kwi-project step by step Ukucinga I-CocoInsight ngokufanelekileyo njengoko zonke iindawo ziye ziye ziye ziye ziye ziye zihambe phantsi komzila. Qhagamshelana ne-index Uyakwazi ukufikelela kule ndawo malunga Imibuzo yeTeksti Indlela yokwakha i-query kwi-embeddings. Okwangoku, i-CocoIndex ayinikezela i-interface ye-query eyongezelelweyo. Singakwazi ukuchitha i-SQL okanye ukhuseleko kwi-query engine yi-target storage. 
 
 
 Izixhasi ezininzi ziye ziye ziye ziye ziye ziye ziye ziye ziye ziye ziye ziye ziye ziye ziye ziye ziye ziye ziye ziye ziye ziye ziye ziye ziye ziye ziye ziye ziye ziye ziye ziye ziye ziye ziye ziye zibe zibe zibe zibe zibe zibe zibe zibe zibe zibe zibe zibe zibe. I-query space inesisombululo emangalisayo ye-querying, reranking, kunye nezinye umsebenzi ezinxulumene ne-search. Ukuba ufuna ukunceda ekubeni isicelo, nceda uqhagamshelane nathi kwi . Ukucinga Ukusekela Thina usebenzise ngokuzenzekelayo, kwaye iimveliso ezininzi kunye neengxaki ziyafumaneka ngokukhawuleza. Ukuba le nqaku lula, nceda ushiye i-star ⭐ at Ukusiza ukuvelisa. Ukucinga Nceda ucebisa!

This story contains new, firsthand information uncovered by the writer.

Star my work on github

Read My Stories

Le audio iveliswe ngolwimi lokuqala lwebali!

Turn Your PDF Library into a Searchable Research Database with 100 Lines of Code

About Author

IZIMVO

ZIJONGE IIMPAWU

ELI NQAKU LINIKEZELWE KU

Related Stories

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps