In this blog we will walk through a comprehensive example of indexing research papers with extracting different metadata — beyond full text chunking and embedding — and build semantic embeddings for indexing and querying. Waxaan ku dhaheenayaa in aad u baahan tahay Ma rabtaa tutorials waxaa loo isticmaali karaa. ⭐ star CocoIndex on GitHub CocoIndex on Github Qalabka Case Shuruudaha Academic iyo Retrieval, sida ka mid ah agenta ah AI-ga ah oo ku yaal Shirkadda Paper Recommendation Graphics wax soo saarka Semantic Analysis of Scientific Literature (Analysada Semantic ee Shiinaha) Waxaad ka heli karaa Waayo, waxaan u aragto this Sida loo yaabaa. QEEBE Sida loo yaabaa, waxaa loo yaabaa in ay ku yaalaa in ay ku yaalaa in ay ku yaalaa. Extract the paper metadata, including file name, title, author information, abstract, and number of pages. Build vector embeddings for the metadata, such as the title and abstract, for semantic search. Sida loo isticmaali karaa wax soo saarka semantic ee meta-data, sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale. Build an index of authors and all the file names associated with each author to answer questions like "Give me all the papers by Jeff Dean." If you want to perform full PDF embedding for the paper, you can also refer to . this article Waxaad ka heli karaa codka . Haku If this article is helpful for you, please give us a star ⭐ at Waayo, waxaan ka caawinaysaa. Haku Qalabka Core PDF Preprocessing Reads PDFs using and extracts: pypdf Total number of pages First page content (used as a proxy for metadata-rich information) Markdown Conversion Converts the first page to Markdown using . Marker LLM-Powered Metadata Extraction Sends the first-page Markdown to GPT-4o using CocoIndex's function. ExtractByLlm Extracted metadata includes and more. (string) title (with name, email, and affiliation) authors (string) abstract Semantic Embedding The title is embedded directly using the model by the SentenceTransformer. all-MiniLM-L6-v2 Abstracts are chunked based on semantic punctuation and token count, then each chunk is embedded individually. Relational Data Collection Authors are unrolled and collected into an relation, enabling queries like: author_papers Show all papers by X Which co-authors worked with Y? Qalabka . Install PostgreSQL CocoIndex uses PostgreSQL internally for incremental processing. . Configure your OpenAI API key Sidee, waxaan ka mid ah dhismaha native ee Gemini, Ollama, LiteLLM, checkout the . Haku Waxaad ka heli karaa adeegga LLM aad ugu fiican iyo sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale. Qalabka Indexing Shirkadda this waa mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah. Waxaad soo baxsanay in aad u baahan tahay in aad u baahan tahay in aad u baahan tahay in aad u baahan tahay in aad u baahan tahay in aad u baahan tahay in aad u baahan tahay in aad u baahan tahay in aad u baahan tahay in aad u baahan tahay in aad u baahan tahay in aad u baahan tahay in aad u baahan tahay in aad u baahan tahay in aad u baahan tahay in aad u baahan tahay in aad u baahan tahay in aad u baahan tahay in aad u baahan yahay. . codka Sida loo isticmaali karaa in aad u isticmaali karaa in aad u isticmaali karaa in aad isticmaali karaa in aad isticmaali karaa in aad isticmaali karaa. Markaad ka mid ah dhismaha iyo dhismaha. For each file: Extract the first page of the paper. Convert the first page to Markdown. Extract metadata (title, authors, abstract) from the first page. Split the abstract into chunks, and compute embeddings for each chunk. Export to the following tables in Postgres with PGVector: Metadata (title, authors, abstract) for each paper. Author-to-paper mapping, for author-based query. Embeddings for titles and abstract chunks, for semantic search. Waayo, zoom ka mid ah ka mid ah. Waayo, wax soo saarka waaweyn. @cocoindex.flow_def(name="PaperMetadata") def paper_metadata_flow( flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope ) -> None: data_scope["documents"] = flow_builder.add_source( cocoindex.sources.LocalFile(path="papers", binary=True), refresh_interval=datetime.timedelta(seconds=10), ) Waxaa laga yaabaa in la soo bandhigiis ah oo ay ka mid ah wax soo saarka ( Haku Haku: flow_builder.add_source filename content Waxa uu ku saabsan Xafiiska ugu badan. Dokuminta Qalabka iyo Metadata Soo dejisan First page for basic info Waayo, waxaa loo isticmaali karaa in ay u isticmaali karaa in ay u isticmaali karaa in ay u isticmaali karaa in ay u isticmaali karaa in ay u isticmaali karaa. @dataclasses.dataclass class PaperBasicInfo: num_pages: int first_page: bytes @cocoindex.op.function() def extract_basic_info(content: bytes) -> PaperBasicInfo: """Extract the first pages of a PDF.""" reader = PdfReader(io.BytesIO(content)) output = io.BytesIO() writer = PdfWriter() writer.add_page(reader.pages[0]) writer.write(output) return PaperBasicInfo(num_pages=len(reader.pages), first_page=output.getvalue()) Nala soo xiriir this in your flow. Waxaan soo saarka metadata ka mid ah soo saarka ugu horeysay si ay u isticmaali karaa macluumaadka, sida PDF-ka oo dhan waxaa laga yaabaa oo badan. with data_scope["documents"].row() as doc: doc["basic_info"] = doc["content"].transform(extract_basic_info) Ka dib markii loo yaabaa, waxaad ka heli karaa in ay ku saabsan dhismaha ugu weyn ee cadaadis ah. Qalabka Basic Info Waayo, sidoo kale waxaa loo isticmaali karaa Markdown. Sidee, sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale. Waayo, waxaa loo isticmaali karaa in ay ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah. Sida loo isticmaali karaa, waxaa loo isticmaali karaa in la isticmaali karaa in la isticmaali karaa in la isticmaali karaa in la isticmaali karaa in la isticmaali karaa. @cache def get_marker_converter() -> PdfConverter: config_parser = ConfigParser({}) return PdfConverter( create_model_dict(), config=config_parser.generate_config_dict() ) Waayo, waxaa loo isticmaali karaa function custom. @cocoindex.op.function(gpu=True, cache=True, behavior_version=1) def pdf_to_markdown(content: bytes) -> str: """Convert to Markdown.""" with tempfile.NamedTemporaryFile(delete=True, suffix=".pdf") as temp_file: temp_file.write(content) temp_file.flush() text, _, _ = text_from_rendered(get_marker_converter()(temp_file.name)) return text Waayo, waxaa loo yaqaan 'transformation' with data_scope["documents"].row() as doc: doc["first_page_md"] = doc["basic_info"]["first_page"].transform( pdf_to_markdown ) Ka dib markii loo yaqaan Markdown, waxaa loo yaqaan Markdown, waxaa loo yaqaan Markdown. Soo dejisan wax soo saarka ah oo ku saabsan LLM CocoIndex waxaa ka mid ah dhismaha dhismaha dhismaha dhismaha dhismaha dhismaha dhismaha dhismaha dhismaha dhismaha. If you are interested in learning more about shahaado shahaado, soo xiriir . Haku @dataclasses.dataclass class PaperMetadata: """ Metadata for a paper. """ title: str authors: list[Author] abstract: str Waayo, waxaa loo yaabaa in la Sida loo yaqaan DataClass, CocoIndex waxay ka heli karaa in la taabto LLM ee dataClass. ExtractByLlm doc["metadata"] = doc["first_page_md"].transform( cocoindex.functions.ExtractByLlm( llm_spec=cocoindex.LlmSpec( api_type=cocoindex.LlmApiType.OPENAI, model="gpt-4o" ), output_type=PaperMetadata, instruction="Please extract the metadata from the first page of the paper.", ) ) Ka dib markii loo yaabaa, waxaad ka heli karaa metadata ee cadaadis ah. Shirkadda Metadata Paper paper_metadata = data_scope.add_collector() with data_scope["documents"].row() as doc: # ... process # Collect metadata paper_metadata.collect( filename=doc["filename"], title=doc["metadata"]["title"], authors=doc["metadata"]["authors"], abstract=doc["metadata"]["abstract"], num_pages=doc["basic_info"]["num_pages"], ) Ma rabtaa in aad u baahan tahay :) QEEBE Haku Haku Haku Haku Haku Haku Waxaad ka soo xiriir in la soo xiriir Author → Papers ee tababarka kala duwan in la soo xiriir search-up functionality. Markaad ka mid ah autor. author_papers = data_scope.add_collector() with data_scope["documents"].row() as doc: with doc["metadata"]["authors"].row() as author: author_papers.collect( author_name=author["name"], filename=doc["filename"], ) Qalabka iyo Qalabka Haku doc["title_embedding"] = doc["metadata"]["title"].transform( cocoindex.functions.SentenceTransformerEmbed( model="sentence-transformers/all-MiniLM-L6-v2" ) ) Haku Shuruudaha ugu soo saarka ah oo ka mid ah wax soo saarka ah oo ka mid ah wax soo saarka ah oo ka mid ah wax soo saarka. Someday, abstract waxaa laga yaabaa oo dhan. doc["abstract_chunks"] = doc["metadata"]["abstract"].transform( cocoindex.functions.SplitRecursively( custom_languages=[ cocoindex.functions.CustomLanguageSpec( language_name="abstract", separators_regex=[r"[.?!]+\s+", r"[:;]\s+", r",\s+", r"\s+"], ) ] ), language="abstract", chunk_size=500, min_chunk_size=200, chunk_overlap=150, ) Ka dib markii loo yaabaa, waxaad ka heli karaa abstract of each paper. Waayo, waxa uu ka mid ah wax soo saarka ah oo ka mid ah wax soo saarka ah. with doc["abstract_chunks"].row() as chunk: chunk["embedding"] = chunk["text"].transform( cocoindex.functions.SentenceTransformerEmbed( model="sentence-transformers/all-MiniLM-L6-v2" ) ) Ka dib markii loo yaabaa, aad u baahan tahay in ay ka mid ah dhererka abstract ee cadaadis ah. Shirkadda Embeddings metadata_embeddings = data_scope.add_collector() with data_scope["documents"].row() as doc: # ... process # collect title embedding metadata_embeddings.collect( id=cocoindex.GeneratedField.UUID, filename=doc["filename"], location="title", text=doc["metadata"]["title"], embedding=doc["title_embedding"], ) with doc["abstract_chunks"].row() as chunk: # ... process # collect abstract chunks embeddings metadata_embeddings.collect( id=cocoindex.GeneratedField.UUID, filename=doc["filename"], location="abstract", text=chunk["text"], embedding=chunk["embedding"], ) Haku Last, waxaan soo saarka data ka Postgres. paper_metadata.export( "paper_metadata", cocoindex.targets.Postgres(), primary_key_fields=["filename"], ) author_papers.export( "author_papers", cocoindex.targets.Postgres(), primary_key_fields=["author_name", "filename"], ) metadata_embeddings.export( "metadata_embeddings", cocoindex.targets.Postgres(), primary_key_fields=["id"], vector_indexes=[ cocoindex.VectorIndexDef( field_name="embedding", metric=cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY, ) ], ) Sida loo isticmaali karaa dhismaha dhismaha dhismaha / CocoIndex, sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale sidoo kale: Xafiiska ugu badan. Haku Waayo, sidoo kale waxaa laga yaabaa in la soo bandhigay si ay u baahan tahay in ay ku yaqaan 'LEGO'. Sida loo isticmaali karaa CocoInsight step by step Waxaad ka mid ah u baahan tahay in ay ka mid ah wax soo saarka. Haku Qalabka Sidee waxaa laga yaabaa in ay ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ah mid ka mid ah mid ka mid ah mid ah mid ka mid ah mid ah mid ka mid ah mid ah mid ka mid ah mid ah mid ka mid ah. Qalabka Index Waxaad ku saabsan this section of Haku Shuruudaha Text Sida loo isticmaali karaa in ka mid ah wax soo saarka. CocoIndex waxay ka heli karaa qiyaasadda qiyaasadda ka mid ah qiyaasadda qiyaasadda. Dhismaha dhismaha badan oo ka mid ah dhismaha dhismaha dhismaha iyo dhismaha dhismaha iyo dhismaha dhismaha iyo dhismaha. Shuruudaha Qalabka waxaa ka mid ah wax soo saarka ah ee Qalabka Qalabka Qalabka Qalabka Qalabka Qalabka Qalabka Qalabka Qalabka Qalabka Qalabka Qalabka Qalabka Qalabka Qalabka Qalabka Qalabka Qalabka Qalabka Qalabka Haddii aad u baahan tahay si ay u baahan tahay in ay u baahan tahay in ay u baahan tahay in ay u baahan tahay in ay u baahan tahay in ay u baahan yahay in ay u baahan yahay in ay u baahan yahay in ay u baahan yahay in ay u baahan yahay in ay u baahan yahay in ay u baahan yahay in ay u baahan yahay in ay u baahan yahay in ay u baahan yahay in ay u baahan yahay in ay u baahan yahay in ay u baahan yahay in ay u baahan yahay in ay u baahan yahay in ay u baahan yahay. . Haku Haku Waayo, waxa uu ku yaalaa in ay ku yaalaa in ay ka mid ah wax soo saarka iyo wax soo saarka. If this article is helpful for you, please give us a star ⭐ at Waayo, waxaan ka caawinaysaa. Haku Thank u leh!