In this blog we will walk through a comprehensive example of indexing research papers with extracting different metadata — beyond full text chunking and embedding — and build semantic embeddings for indexing and querying. We would greatly appreciate it if you could  หากคุณพบว่ากวดวิชานี้มีประโยชน์ ⭐ star  CocoIndex บน GitHub CocoIndex บน GitHub ใช้กรณี 
 
 
 
 
 การค้นหาและค้นหาทางวิชาการเช่นเดียวกับตัวแทน AI ที่ขึ้นอยู่กับการวิจัย ระบบการแนะนํากระดาษ การวิจัยความรู้กราฟ การวิเคราะห์สาธิตทางวิทยาศาสตร์ สิ่งที่เราจะบรรลุ ลองดูเรื่องนี้ เป็นตัวอย่าง ไฟล์ PDF นี่คือสิ่งที่เราต้องการบรรลุ: 
 
 
 
 
 
 Extract the paper metadata, including file name, title, author information, abstract, and number of pages. 
 
 
 Build vector embeddings for the metadata, such as the title and abstract, for semantic search. สิ่งนี้ช่วยให้ผลการค้นหาเชิงลําดับที่ขึ้นอยู่กับข้อมูลเมตาได้ดีขึ้น ตัวอย่างเช่นคุณสามารถจับคู่คําถามข้อความกับหัวและคําอธิบาย 
 
 
 
 
 
 Build an index of authors and all the file names associated with each author to answer questions like "Give me all the papers by Jeff Dean." 
 
 If you want to perform full PDF embedding for the paper, you can also refer to  . this article คุณสามารถค้นหารหัสเต็มรูปแบบ . ที่นี่ หากบทความนี้เป็นประโยชน์สําหรับคุณโปรดให้เราดาว ⭐ ที่ เพื่อช่วยให้เราเติบโต GitHub ส่วนประกอบหลัก 
 
 
 
 
 
 
 
 PDF Preprocessing 
 
 Reads PDFs using   and extracts:
 
 pypdf 
 
 
 Total number of pages First page content (used as a proxy for metadata-rich information) 
 
 Markdown Conversion 
 
 Converts the first page to Markdown using  . Marker 
 
 LLM-Powered Metadata Extraction 
 
 
 Sends the first-page Markdown to GPT-4o using CocoIndex's   function. ExtractByLlm Extracted metadata includes and more.
 
 
 
 
 
  (string) title  (with name, email, and affiliation) authors  (string) abstract 
 
 Semantic Embedding 
 
 
 The title is embedded directly using the   model by the SentenceTransformer. all-MiniLM-L6-v2 Abstracts are chunked based on semantic punctuation and token count, then each chunk is embedded individually. 
 
 Relational Data Collection 
 
 Authors are unrolled and collected into an   relation, enabling queries like:
 
 author_papers 
 
 
 Show all papers by X Which co-authors worked with Y? ข้อกําหนด 
 
 
 
 
 
 . Install PostgreSQL CocoIndex uses PostgreSQL internally for incremental processing. 
 
 . Configure your OpenAI API key หรือเรามีการสนับสนุนในประเทศสําหรับ Gemini, Ollama, LiteLLM, checkout the . คู่มือ คุณสามารถเลือกผู้ให้บริการ LLM ที่ชื่นชอบและสามารถทํางานได้อย่างเต็มที่ในสถานที่ ปรับแต่งการ Indexing Flow โครงการนี้แสดงตัวอย่างที่ครอบคลุมกว่าเล็กน้อยของการเข้าใจข้อมูลเมตาที่ใกล้เคียงกับกรณีการใช้ในโลกจริง คุณจะเห็นว่ามันเป็นเรื่องง่ายที่จะบรรลุการออกแบบนี้โดย CocoIndex ภายใน 100 เส้นของกลยุทธ์การดัชนี - . รหัส เพื่อช่วยให้คุณสามารถนําทางได้ดีขึ้นในสิ่งที่เราจะเดินผ่านนี่คือแผนภูมิการไหล 
 
 
 
 นําเข้ารายการเอกสารในรูปแบบ PDF For each file:
 
 
 
 
 
 
 Extract the first page of the paper. Convert the first page to Markdown. Extract metadata (title, authors, abstract) from the first page. Split the abstract into chunks, and compute embeddings for each chunk. Export to the following tables in Postgres with PGVector:
 
 
 
 
 
 Metadata (title, authors, abstract) for each paper. Author-to-paper mapping, for author-based query. Embeddings for titles and abstract chunks, for semantic search. เราจะ zoom ในขั้นตอน การนําเข้ากระดาษ 
@cocoindex.flow_def(name="PaperMetadata")

def paper_metadata_flow(

    flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope

) -> None:

    data_scope["documents"] = flow_builder.add_source(

        cocoindex.sources.LocalFile(path="papers", binary=True),

        refresh_interval=datetime.timedelta(seconds=10),

    )
 จะสร้างตารางที่มีฟิลด์ ( , ) flow_builder.add_source filename content เราสามารถอ้างถึง สําหรับรายละเอียดเพิ่มเติม หลักฐาน สกัดและรวบรวมข้อมูลเมตา สกัดหน้าแรกสําหรับข้อมูลพื้นฐาน กําหนดฟังก์ชั่นที่กําหนดเองเพื่อสกัดหน้าแรกและจํานวนหน้าของ PDF 
@dataclasses.dataclass

class PaperBasicInfo:

    num_pages: int

    first_page: bytes
 
@cocoindex.op.function()

def extract_basic_info(content: bytes) -> PaperBasicInfo:

    """Extract the first pages of a PDF."""

    reader = PdfReader(io.BytesIO(content))



    output = io.BytesIO()

    writer = PdfWriter()

    writer.add_page(reader.pages[0])

    writer.write(output)



    return PaperBasicInfo(num_pages=len(reader.pages), first_page=output.getvalue())

 ตอนนี้ปลั๊กสิ่งนี้ลงในกระแสของคุณ เราสกัดข้อมูลเมตาจากหน้าแรกเพื่อลดต้นทุนการประมวลผลเนื่องจากไฟล์ PDF ทั้งหมดสามารถมีขนาดใหญ่มาก 
with data_scope["documents"].row() as doc:

    doc["basic_info"] = doc["content"].transform(extract_basic_info)
 หลังจากขั้นตอนนี้คุณควรมีข้อมูลพื้นฐานของแต่ละกระดาษ ข้อมูลพื้นฐาน Parse เราจะแปลงหน้าแรกเป็น Markdown โดยใช้ Marker ทางเลือกอื่น ๆ คุณสามารถเชื่อมต่อไฟล์ PDF ที่ชื่นชอบของคุณได้ง่ายเช่น Docling กําหนดฟังก์ชั่นการแปลงตัวบ่งชี้และ cache มันเนื่องจากการเริ่มต้นของมันเป็นทรัพยากรที่เข้มงวด สิ่งนี้ช่วยให้แน่ใจว่าตัวอย่างแปลงเดียวกันจะถูกนํามาใช้ซ้ําสําหรับไฟล์อินพุตที่แตกต่างกัน 
@cache

def get_marker_converter() -> PdfConverter:

    config_parser = ConfigParser({})

    return PdfConverter(

        create_model_dict(), config=config_parser.generate_config_dict()

    )
 ปลั๊กลงในฟังก์ชั่นที่กําหนดเอง 
@cocoindex.op.function(gpu=True, cache=True, behavior_version=1)

def pdf_to_markdown(content: bytes) -> str:

    """Convert to Markdown."""



    with tempfile.NamedTemporaryFile(delete=True, suffix=".pdf") as temp_file:

        temp_file.write(content)

        temp_file.flush()

        text, _, _ = text_from_rendered(get_marker_converter()(temp_file.name))

        return text
 นําไปสู่การเปลี่ยนแปลงของคุณ 
with data_scope["documents"].row() as doc:      

    doc["first_page_md"] = doc["basic_info"]["first_page"].transform(

            pdf_to_markdown

        )
 หลังจากขั้นตอนนี้คุณควรมีหน้าแรกของแต่ละกระดาษในรูปแบบ Markdown สกัดข้อมูลพื้นฐานด้วย LLM กําหนดแผนการสําหรับการสกัด LLM CocoIndex รองรับการสกัดแบบ LLM ด้วยแผนที่ที่ซับซ้อนและซับซ้อน หากคุณมีความสนใจในการเรียนรู้เพิ่มเติมเกี่ยวกับแผนกนิรภัยโปรดปรึกษา . บทความนี้ 
@dataclasses.dataclass

class PaperMetadata:

    """

    Metadata for a paper.

    """



    title: str

    authors: list[Author]

    abstract: str

 ปลั๊กมันลงใน ฟังก์ชั่น. ด้วย dataclass ที่กําหนด CocoIndex จะวิเคราะห์การตอบสนอง LLM ใน dataclass โดยอัตโนมัติ ExtractByLlm 
doc["metadata"] = doc["first_page_md"].transform(

    cocoindex.functions.ExtractByLlm(

        llm_spec=cocoindex.LlmSpec(

            api_type=cocoindex.LlmApiType.OPENAI, model="gpt-4o"

        ),

        output_type=PaperMetadata,

        instruction="Please extract the metadata from the first page of the paper.",

    )

)
 หลังจากขั้นตอนนี้คุณควรมีข้อมูลเมตาของแต่ละกระดาษ การเก็บรวบรวมโลหะผสมกระดาษ 
  paper_metadata = data_scope.add_collector()

  with data_scope["documents"].row() as doc:

    # ... process

    # Collect metadata

    paper_metadata.collect(

        filename=doc["filename"],

        title=doc["metadata"]["title"],

        authors=doc["metadata"]["authors"],

        abstract=doc["metadata"]["abstract"],

        num_pages=doc["basic_info"]["num_pages"],

    )
 เพียงแค่เก็บสิ่งที่คุณต้องการ :) คอลเลกชัน สอง ข้อมูล author ภาษาไทย ผู้เขียน ภาษาไทย ที่นี่เราต้องการรวบรวมผู้เขียน → กระดาษในตารางแยกต่างหากเพื่อสร้างฟังก์ชั่นการค้นหา การเก็บรวบรวมโดยผู้เขียน 
author_papers = data_scope.add_collector()



with data_scope["documents"].row() as doc:

    with doc["metadata"]["authors"].row() as author:

        author_papers.collect(

            author_name=author["name"],

            filename=doc["filename"],

        )
 การคํานวณและรวบรวม embeddings ชื่อ 
doc["title_embedding"] = doc["metadata"]["title"].transform(

    cocoindex.functions.SentenceTransformerEmbed(

        model="sentence-transformers/all-MiniLM-L6-v2"

    )

)
 สารสกัดจาก Split abstract ในชิ้นแทรกแต่ละชิ้นและเก็บแทรกของพวกเขา บางครั้ง abstract อาจเป็นเวลานานมาก 
doc["abstract_chunks"] = doc["metadata"]["abstract"].transform(

    cocoindex.functions.SplitRecursively(

        custom_languages=[

            cocoindex.functions.CustomLanguageSpec(

                language_name="abstract",

                separators_regex=[r"[.?!]+\s+", r"[:;]\s+", r",\s+", r"\s+"],

            )

        ]

    ),

    language="abstract",

    chunk_size=500,

    min_chunk_size=200,

    chunk_overlap=150,

)
 หลังจากขั้นตอนนี้คุณควรมีชิ้นส่วน抽象ของแต่ละกระดาษ ใส่ชิ้นส่วนแต่ละชิ้นและเก็บแทรกของพวกเขา 
with doc["abstract_chunks"].row() as chunk:

    chunk["embedding"] = chunk["text"].transform(

        cocoindex.functions.SentenceTransformerEmbed(

            model="sentence-transformers/all-MiniLM-L6-v2"

        )

    )
 หลังจากขั้นตอนนี้คุณควรมีการแทรกของชิ้นส่วน抽象ของแต่ละกระดาษ การรวบรวม embeddings 
metadata_embeddings = data_scope.add_collector()



with data_scope["documents"].row() as doc:

    # ... process

    # collect title embedding

    metadata_embeddings.collect(

        id=cocoindex.GeneratedField.UUID,

        filename=doc["filename"],

        location="title",

        text=doc["metadata"]["title"],

        embedding=doc["title_embedding"],

    )

    with doc["abstract_chunks"].row() as chunk:

        # ... process

        # collect abstract chunks embeddings

        metadata_embeddings.collect(

            id=cocoindex.GeneratedField.UUID,

            filename=doc["filename"],

            location="abstract",

            text=chunk["text"],

            embedding=chunk["embedding"],

        )
 การส่งออก สุดท้ายเราส่งออกข้อมูลไปยัง Postgres 
paper_metadata.export(

    "paper_metadata",

    cocoindex.targets.Postgres(),

    primary_key_fields=["filename"],

)

author_papers.export(

    "author_papers",

    cocoindex.targets.Postgres(),

    primary_key_fields=["author_name", "filename"],

)    

metadata_embeddings.export(

    "metadata_embeddings",

    cocoindex.targets.Postgres(),

    primary_key_fields=["id"],

    vector_indexes=[

        cocoindex.VectorIndexDef(

            field_name="embedding",

            metric=cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY,

        )

    ],

)
 ในตัวอย่างนี้เราใช้ PGVector เป็นห้างสรรพสินค้า / ด้วย CocoIndex คุณสามารถสลับสายหนึ่งในฐานข้อมูล Vector ที่รองรับอื่น ๆ เช่น Qdrant ดูนี้ สําหรับรายละเอียดเพิ่มเติม คู่มือ เรามุ่งมั่นที่จะมาตรฐานอินเทอร์เฟซและทําให้มันเหมือนสร้าง LEGO ดูใน CocoInsight ขั้นตอนตามขั้นตอน คุณสามารถเดินผ่านโครงการขั้นตอนตามขั้นตอนใน เห็น ค็อกโกแลต อย่างแน่นอนวิธีที่แต่ละฟิลด์ถูกสร้างขึ้นและสิ่งที่เกิดขึ้นอยู่เบื้องหลัง ต้องการดัชนี คุณสามารถอ้างถึงส่วนนี้ของ เกี่ยวกับ บรรจุภัณฑ์ข้อความ วิธีการสร้างคําถามกับ embeddings ปัจจุบัน CocoIndex ไม่ให้อินเตอร์เฟซคําถามเพิ่มเติม เราสามารถเขียน SQL หรือพึ่งพาเครื่องมือคําถามโดยพื้นที่เก็บข้อมูลเป้าหมาย 
 
 
 ฐานข้อมูลจํานวนมากได้เพิ่มประสิทธิภาพการใช้งานคําถามด้วยวิธีการปฏิบัติที่ดีที่สุดของตนเอง พื้นที่คําถามมีโซลูชั่นที่ยอดเยี่ยมสําหรับการสอบถามการจัดอันดับใหม่และฟังก์ชั่นอื่น ๆ ที่เกี่ยวข้องกับการค้นหา หากคุณต้องการความช่วยเหลือในการเขียนคําถามโปรดอย่าลังเลที่จะติดต่อเราที่ . ไม่ยอมใครง่ายๆ สนับสนุนเรา เรากําลังปรับปรุงอย่างต่อเนื่องและคุณสมบัติและตัวอย่างเพิ่มเติมจะมาเร็ว ๆ นี้ หากบทความนี้เป็นประโยชน์สําหรับคุณโปรดให้เราดาว ⭐ ที่ เพื่อช่วยให้เราเติบโต GitHub ขอบคุณสําหรับการอ่าน!

This story contains new, firsthand information uncovered by the writer.

Star my work on github

Read My Stories

เสียงนี้ผลิตในภาษาต้นฉบับของเรื่องราว!

Turn Your PDF Library into a Searchable Research Database with 100 Lines of Code

About Author

ความคิดเห็น

แขวนแท็ก

บทความนี้ถูกนำเสนอใน

Related Stories

Meet Mezo: HackerNoon Company of the Week

CHANGES AT MILTON.

New Licensing Options for Pixel Icon Library: Free, Starter & Pro Plans

Meet Klink Finance: HackerNoon Company of the Week

Meet Mezo: HackerNoon Company of the Week

CHANGES AT MILTON.

New Licensing Options for Pixel Icon Library: Free, Starter & Pro Plans

Meet Klink Finance: HackerNoon Company of the Week

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps