In this blog we will walk through a comprehensive example of indexing research papers with extracting different metadata — beyond full text chunking and embedding — and build semantic embeddings for indexing and querying. Tutafurahia sana ikiwa unaweza Kama wewe kufikiri hii tutorial ni muhimu. ⭐ star CocoIndex kwenye GitHub CocoIndex kwenye GitHub Matumizi ya kesi Utafutaji wa kitaaluma na kutafuta, pamoja na wafanyabiashara wa AI wa utafiti Mfumo wa mapendekezo ya karatasi Utafiti wa Knowledge Graphs Uchambuzi wa semantic ya fasihi ya kisayansi Tutaweza kufikia nini Hebu tuangalie juu ya hili Kwa mfano Mpango wa PDF Hapa ni nini tunataka kufikia: Extract the paper metadata, including file name, title, author information, abstract, and number of pages. Build vector embeddings for the metadata, such as the title and abstract, for semantic search. Hii inaruhusu matokeo bora ya utafutaji wa semantic yanayohusiana na metadata. Kwa mfano, unaweza kulinganisha maswali ya maandishi na majina na abstracts. Build an index of authors and all the file names associated with each author to answer questions like "Give me all the papers by Jeff Dean." If you want to perform full PDF embedding for the paper, you can also refer to . this article Unaweza kupata msimbo kamili . Hapa ya Ikiwa makala hii ni ya manufaa kwako, tafadhali tuwape nyota ⭐ katika ili kutusaidia kuongezeka. ya Github Sehemu ya msingi PDF Preprocessing Reads PDFs using and extracts: pypdf Total number of pages First page content (used as a proxy for metadata-rich information) Markdown Conversion Converts the first page to Markdown using . Marker LLM-Powered Metadata Extraction Sends the first-page Markdown to GPT-4o using CocoIndex's function. ExtractByLlm Extracted metadata includes and more. (string) title (with name, email, and affiliation) authors (string) abstract Semantic Embedding The title is embedded directly using the model by the SentenceTransformer. all-MiniLM-L6-v2 Abstracts are chunked based on semantic punctuation and token count, then each chunk is embedded individually. Relational Data Collection Authors are unrolled and collected into an relation, enabling queries like: author_papers Show all papers by X Which co-authors worked with Y? Mahitaji ya . Install PostgreSQL CocoIndex uses PostgreSQL internally for incremental processing. . Configure your OpenAI API key Vinginevyo, tuna msaada wa asili kwa Gemini, Ollama, LiteLLM, checkout . mwongozo Unaweza kuchagua mtoa LLM yako favorite na unaweza kufanya kazi kikamilifu kwenye eneo. Ufafanuzi wa Indexing Flow Mradi huu unaonyesha mfano mdogo zaidi wa kuelewa metadata karibu na matukio ya matumizi ya dunia halisi. Utaona jinsi rahisi ni kufikia muundo huu na CocoIndex ndani ya mstari wa 100 wa mantiki ya indexing - . Kodi ya Ili kukusaidia kuendesha vizuri kile tutakachukua, hapa ni chati ya mtiririko. Kuingiza orodha ya nyaraka katika PDF. For each file: Extract the first page of the paper. Convert the first page to Markdown. Extract metadata (title, authors, abstract) from the first page. Split the abstract into chunks, and compute embeddings for each chunk. Export to the following tables in Postgres with PGVector: Metadata (title, authors, abstract) for each paper. Author-to-paper mapping, for author-based query. Embeddings for titles and abstract chunks, for semantic search. Hebu tufanye zoom katika hatua. Uwekezaji wa nyaraka @cocoindex.flow_def(name="PaperMetadata") def paper_metadata_flow( flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope ) -> None: data_scope["documents"] = flow_builder.add_source( cocoindex.sources.LocalFile(path="papers", binary=True), refresh_interval=datetime.timedelta(seconds=10), ) itakuwa kuunda meza na mashamba ya chini ( ya kwa ajili ya flow_builder.add_source filename content Tunaweza kutaja juu ya Kwa maelezo zaidi. Takwimu ya Kuondoa na kukusanya metadata Kuondoa ukurasa wa kwanza kwa habari ya msingi Kufafanua kazi ya kibinafsi ili kuondoa ukurasa wa kwanza na idadi ya kurasa za PDF. @dataclasses.dataclass class PaperBasicInfo: num_pages: int first_page: bytes @cocoindex.op.function() def extract_basic_info(content: bytes) -> PaperBasicInfo: """Extract the first pages of a PDF.""" reader = PdfReader(io.BytesIO(content)) output = io.BytesIO() writer = PdfWriter() writer.add_page(reader.pages[0]) writer.write(output) return PaperBasicInfo(num_pages=len(reader.pages), first_page=output.getvalue()) Sasa kuunganisha hii katika mtiririko wako. Tunatoa metadata kutoka ukurasa wa kwanza ili kupunguza gharama za usindikaji, kwa sababu PDF nzima inaweza kuwa kubwa sana. with data_scope["documents"].row() as doc: doc["basic_info"] = doc["content"].transform(extract_basic_info) Baada ya hatua hii, unapaswa kuwa na maelezo ya msingi ya kila karatasi. Maelezo ya msingi Tutarekebisha ukurasa wa kwanza kwa Markdown kwa kutumia Marker. Vinginevyo, unaweza kwa urahisi kuunganisha mpangilio wako wa PDF favorite, kama Docling. Kufafanua kazi ya kubadilisha alama na kuhifadhi, kwa sababu mwanzo wake ni wa rasilimali. Hii inahakikisha kwamba instance moja ya kubadilisha hutumiwa tena kwa faili tofauti za kuingia. @cache def get_marker_converter() -> PdfConverter: config_parser = ConfigParser({}) return PdfConverter( create_model_dict(), config=config_parser.generate_config_dict() ) Kuunganisha kwenye kazi ya kawaida. @cocoindex.op.function(gpu=True, cache=True, behavior_version=1) def pdf_to_markdown(content: bytes) -> str: """Convert to Markdown.""" with tempfile.NamedTemporaryFile(delete=True, suffix=".pdf") as temp_file: temp_file.write(content) temp_file.flush() text, _, _ = text_from_rendered(get_marker_converter()(temp_file.name)) return text Tumia kwa ajili ya mabadiliko yako with data_scope["documents"].row() as doc: doc["first_page_md"] = doc["basic_info"]["first_page"].transform( pdf_to_markdown ) Baada ya hatua hii, unapaswa kuwa na ukurasa wa kwanza wa kila karatasi katika muundo wa Markdown. Kupata maelezo ya msingi na LLM Kufafanua mpango wa uchimbaji wa LLM. CocoIndex kwa asili inasaidia uchimbaji wa LLM na mipango ngumu na ya kuvutia. Kama wewe ni nia ya kujifunza zaidi kuhusu mipangilio nifty, kutaja . Makala hii @dataclasses.dataclass class PaperMetadata: """ Metadata for a paper. """ title: str authors: list[Author] abstract: str Kuingiza ndani ya Pamoja na darasa la data linalojulikana, CocoIndex itachapisha majibu ya LLM moja kwa moja katika darasa la data. ExtractByLlm doc["metadata"] = doc["first_page_md"].transform( cocoindex.functions.ExtractByLlm( llm_spec=cocoindex.LlmSpec( api_type=cocoindex.LlmApiType.OPENAI, model="gpt-4o" ), output_type=PaperMetadata, instruction="Please extract the metadata from the first page of the paper.", ) ) Baada ya hatua hii, unapaswa kuwa na metadata ya kila karatasi. Kukusanya metadata ya karatasi paper_metadata = data_scope.add_collector() with data_scope["documents"].row() as doc: # ... process # Collect metadata paper_metadata.collect( filename=doc["filename"], title=doc["metadata"]["title"], authors=doc["metadata"]["authors"], abstract=doc["metadata"]["abstract"], num_pages=doc["basic_info"]["num_pages"], ) Unaweza kukusanya kila kitu unachohitaji :) ya kukusanya ya Maelezo ya Mwandishi filename Mwandishi Filamu ya Hapa tunataka kukusanya Mwandishi → Makala katika meza tofauti ili kujenga kazi ya kutafuta. Tu kukusanya kwa mwandishi. author_papers = data_scope.add_collector() with data_scope["documents"].row() as doc: with doc["metadata"]["authors"].row() as author: author_papers.collect( author_name=author["name"], filename=doc["filename"], ) Kuweka na kukusanya vifaa Jina la doc["title_embedding"] = doc["metadata"]["title"].transform( cocoindex.functions.SentenceTransformerEmbed( model="sentence-transformers/all-MiniLM-L6-v2" ) ) Abstract ya Piga abstract katika vipande, kuingiza kila kipande na kukusanya kuingiza yao. Wakati mwingine abstract inaweza kuwa ndefu sana. doc["abstract_chunks"] = doc["metadata"]["abstract"].transform( cocoindex.functions.SplitRecursively( custom_languages=[ cocoindex.functions.CustomLanguageSpec( language_name="abstract", separators_regex=[r"[.?!]+\s+", r"[:;]\s+", r",\s+", r"\s+"], ) ] ), language="abstract", chunk_size=500, min_chunk_size=200, chunk_overlap=150, ) Baada ya hatua hii, unapaswa kuwa na vipande vichache vya kila karatasi. Weka kila kipande na kukusanya vipande vyao. with doc["abstract_chunks"].row() as chunk: chunk["embedding"] = chunk["text"].transform( cocoindex.functions.SentenceTransformerEmbed( model="sentence-transformers/all-MiniLM-L6-v2" ) ) Baada ya hatua hii, unapaswa kuwa na uingizaji wa vipande vichache vya kila karatasi. Mkusanyiko wa Embeddings metadata_embeddings = data_scope.add_collector() with data_scope["documents"].row() as doc: # ... process # collect title embedding metadata_embeddings.collect( id=cocoindex.GeneratedField.UUID, filename=doc["filename"], location="title", text=doc["metadata"]["title"], embedding=doc["title_embedding"], ) with doc["abstract_chunks"].row() as chunk: # ... process # collect abstract chunks embeddings metadata_embeddings.collect( id=cocoindex.GeneratedField.UUID, filename=doc["filename"], location="abstract", text=chunk["text"], embedding=chunk["embedding"], ) Uzalishaji wa Hatimaye, sisi kuuza data kwa Postgres. paper_metadata.export( "paper_metadata", cocoindex.targets.Postgres(), primary_key_fields=["filename"], ) author_papers.export( "author_papers", cocoindex.targets.Postgres(), primary_key_fields=["author_name", "filename"], ) metadata_embeddings.export( "metadata_embeddings", cocoindex.targets.Postgres(), primary_key_fields=["id"], vector_indexes=[ cocoindex.VectorIndexDef( field_name="embedding", metric=cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY, ) ], ) Katika mfano huu tunatumia PGVector kama maduka ya kuingizwa / Pamoja na CocoIndex, unaweza kufanya switch moja ya mstari kwenye mifumo mingine ya data ya Vector inayounga mkono kama Qdrant, angalia hii Kwa maelezo zaidi. mwongozo Sisi lengo la standardize interface na kufanya hivyo kama kujenga Lego. Kuangalia katika CocoInsight hatua kwa hatua Unaweza kutembea kupitia mradi hatua kwa hatua katika ya kuona Maelezo ya CocoInsight jinsi kila shamba ni kujengwa na kile kinachotokea nyuma ya vichwa. Jinsi ya kuunda index Unaweza kutaja sehemu hii ya kuhusu Maudhui ya maandishi Jinsi ya kujenga query dhidi ya embeddings. Kwa sasa CocoIndex haitoi interface ya swali ya ziada. Tunaweza kuandika SQL au kutegemea injini ya swali na uhifadhi wa lengo. Takwimu nyingi tayari zimeimarisha utekelezaji wa query na mazoea yao bora Uwanja wa swali una ufumbuzi mzuri wa swali, rearanking, na kazi nyingine zinazohusiana na utafutaji. Ikiwa unahitaji msaada katika kuandika swali, tafadhali kujisikia huru kuwasiliana nasi katika . Mjadala wa Msaada wetu Sisi ni daima kuboresha, na vipengele zaidi na mifano inakuja hivi karibuni. Ikiwa makala hii ni ya manufaa kwako, tafadhali tuwape nyota ⭐ katika ili kutusaidia kuongezeka. ya Github Shukrani kwa ajili ya kusoma!