In this blog we will walk through a comprehensive example of indexing research papers with extracting different metadata — beyond full text chunking and embedding — and build semantic embeddings for indexing and querying. Chúng tôi sẽ đánh giá cao nếu bạn có thể Nếu bạn thấy hướng dẫn này hữu ích. ⭐ star  CocoIndex trên GitHub CocoIndex trên GitHub Sử dụng Case 
 
 
 
 
 Tìm kiếm và tìm kiếm học thuật, cũng như các đại lý AI dựa trên nghiên cứu Hệ thống khuyến nghị giấy Nghiên cứu kiến thức graphs Phân tích ngữ nghĩa của văn học khoa học Những gì chúng ta sẽ đạt được Chúng ta hãy nhìn vào điều này Như một ví dụ. PDF Đây là những gì chúng tôi muốn đạt được: 
 
 
 
 
 
 Extract the paper metadata, including file name, title, author information, abstract, and number of pages. 
 
 
 Build vector embeddings for the metadata, such as the title and abstract, for semantic search. Điều này cho phép kết quả tìm kiếm ngữ nghĩa dựa trên siêu dữ liệu tốt hơn. Ví dụ, bạn có thể khớp các truy vấn văn bản với các tiêu đề và trích dẫn. 
 
 
 
 
 
 Build an index of authors and all the file names associated with each author to answer questions like "Give me all the papers by Jeff Dean." 
 
 If you want to perform full PDF embedding for the paper, you can also refer to  . this article Bạn có thể tìm thấy mã đầy đủ . Ở đây Nếu bài viết này hữu ích cho bạn, xin vui lòng cho chúng tôi một ngôi sao ⭐ tại  to help us grow. GitHub thành phần cốt lõi 
 
 
 
 
 
 
 
 PDF Preprocessing 
 
 Reads PDFs using   and extracts:
 
 pypdf 
 
 
 Total number of pages First page content (used as a proxy for metadata-rich information) 
 
 Markdown Conversion 
 
 Converts the first page to Markdown using  . Marker 
 
 LLM-Powered Metadata Extraction 
 
 
 Sends the first-page Markdown to GPT-4o using CocoIndex's   function. ExtractByLlm Extracted metadata includes and more.
 
 
 
 
 
  (string) title  (with name, email, and affiliation) authors  (string) abstract 
 
 Semantic Embedding 
 
 
 The title is embedded directly using the   model by the SentenceTransformer. all-MiniLM-L6-v2 Abstracts are chunked based on semantic punctuation and token count, then each chunk is embedded individually. 
 
 Relational Data Collection 
 
 Authors are unrolled and collected into an   relation, enabling queries like:
 
 author_papers 
 
 
 Show all papers by X Which co-authors worked with Y? Điều kiện 
 
 
 
 
 
 . Install PostgreSQL CocoIndex uses PostgreSQL internally for incremental processing. 
 
 . Configure your OpenAI API key Ngoài ra, chúng tôi có hỗ trợ bản địa cho Gemini, Ollama, LiteLLM, thanh toán . guide Bạn có thể chọn nhà cung cấp LLM yêu thích của bạn và có thể làm việc hoàn toàn tại chỗ. Định nghĩa index flow Dự án này cho thấy một ví dụ toàn diện hơn một chút về sự hiểu biết siêu dữ liệu gần gũi hơn với các trường hợp sử dụng thực tế. Bạn sẽ thấy nó dễ dàng như thế nào để đạt được thiết kế này bởi CocoIndex trong vòng 100 dòng logic lập chỉ mục - . mã Để giúp bạn điều hướng tốt hơn những gì chúng tôi sẽ đi qua, đây là một sơ đồ dòng chảy. 
 
 
 
 Nhập danh sách các tài liệu trong PDF. For each file:
 
 
 
 
 
 
 Extract the first page of the paper. Convert the first page to Markdown. Extract metadata (title, authors, abstract) from the first page. Split the abstract into chunks, and compute embeddings for each chunk. Export to the following tables in Postgres with PGVector:
 
 
 
 
 
 Metadata (title, authors, abstract) for each paper. Author-to-paper mapping, for author-based query. Embeddings for titles and abstract chunks, for semantic search. Hãy zoom vào các bước. Import các giấy tờ 
@cocoindex.flow_def(name="PaperMetadata")

def paper_metadata_flow(

    flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope

) -> None:

    data_scope["documents"] = flow_builder.add_source(

        cocoindex.sources.LocalFile(path="papers", binary=True),

        refresh_interval=datetime.timedelta(seconds=10),

    )
  will create a table with sub fields ( , , flow_builder.add_source filename content Chúng ta có thể đề cập đến Để biết thêm chi tiết tài liệu Trích xuất và thu thập metadata Trích xuất trang đầu tiên cho thông tin cơ bản Xác định một hàm tùy chỉnh để trích xuất trang đầu tiên và số trang của PDF. 
@dataclasses.dataclass

class PaperBasicInfo:

    num_pages: int

    first_page: bytes
 
@cocoindex.op.function()

def extract_basic_info(content: bytes) -> PaperBasicInfo:

    """Extract the first pages of a PDF."""

    reader = PdfReader(io.BytesIO(content))



    output = io.BytesIO()

    writer = PdfWriter()

    writer.add_page(reader.pages[0])

    writer.write(output)



    return PaperBasicInfo(num_pages=len(reader.pages), first_page=output.getvalue())

 Bây giờ hãy kết nối điều này vào dòng chảy của bạn. Chúng tôi trích xuất siêu dữ liệu từ trang đầu tiên để giảm thiểu chi phí xử lý, vì toàn bộ PDF có thể rất lớn. 
with data_scope["documents"].row() as doc:

    doc["basic_info"] = doc["content"].transform(extract_basic_info)
 Sau bước này, bạn nên có thông tin cơ bản của mỗi bài báo. Parse thông tin cơ bản Chúng tôi sẽ chuyển đổi trang đầu tiên sang Markdown bằng cách sử dụng Marker. Ngoài ra, bạn có thể dễ dàng cắm vào trình phân tích PDF yêu thích của bạn, chẳng hạn như Docling. Xác định một chức năng chuyển đổi marker và lưu trữ nó trong bộ nhớ đệm, vì việc khởi tạo của nó là tốn nhiều tài nguyên. Điều này đảm bảo rằng cùng một phiên bản chuyển đổi được tái sử dụng cho các tệp đầu vào khác nhau. 
@cache

def get_marker_converter() -> PdfConverter:

    config_parser = ConfigParser({})

    return PdfConverter(

        create_model_dict(), config=config_parser.generate_config_dict()

    )
 Cắm nó vào một chức năng custom. 
@cocoindex.op.function(gpu=True, cache=True, behavior_version=1)

def pdf_to_markdown(content: bytes) -> str:

    """Convert to Markdown."""



    with tempfile.NamedTemporaryFile(delete=True, suffix=".pdf") as temp_file:

        temp_file.write(content)

        temp_file.flush()

        text, _, _ = text_from_rendered(get_marker_converter()(temp_file.name))

        return text
 Gửi nó đến chuyển đổi của bạn 
with data_scope["documents"].row() as doc:      

    doc["first_page_md"] = doc["basic_info"]["first_page"].transform(

            pdf_to_markdown

        )
 Sau bước này, bạn nên có trang đầu tiên của mỗi tờ giấy ở định dạng Markdown. Trích xuất thông tin cơ bản với LLM CocoIndex tự nhiên hỗ trợ khai thác có cấu trúc LLM với các chương trình phức tạp và niêm phong. Nếu bạn quan tâm đến việc tìm hiểu thêm về các chương trình niêm phong, hãy tham khảo . Bài viết này 
@dataclasses.dataclass

class PaperMetadata:

    """

    Metadata for a paper.

    """



    title: str

    authors: list[Author]

    abstract: str

 Plug it vào Với một class dữ liệu được xác định, CocoIndex sẽ tự động phân tích câu trả lời LLM vào class dữ liệu. ExtractByLlm 
doc["metadata"] = doc["first_page_md"].transform(

    cocoindex.functions.ExtractByLlm(

        llm_spec=cocoindex.LlmSpec(

            api_type=cocoindex.LlmApiType.OPENAI, model="gpt-4o"

        ),

        output_type=PaperMetadata,

        instruction="Please extract the metadata from the first page of the paper.",

    )

)
 Sau bước này, bạn nên có các siêu dữ liệu của mỗi tờ giấy. Thu thập metadata giấy 
  paper_metadata = data_scope.add_collector()

  with data_scope["documents"].row() as doc:

    # ... process

    # Collect metadata

    paper_metadata.collect(

        filename=doc["filename"],

        title=doc["metadata"]["title"],

        authors=doc["metadata"]["authors"],

        abstract=doc["metadata"]["abstract"],

        num_pages=doc["basic_info"]["num_pages"],

    )
 Just collect anything you need :) Thu thập hai Thông tin Tác giả Filename Tác giả Filename Ở đây chúng tôi muốn thu thập Author → Papers trong một bảng riêng biệt để xây dựng một chức năng tìm kiếm. Chỉ thu thập bởi tác giả. 
author_papers = data_scope.add_collector()



with data_scope["documents"].row() as doc:

    with doc["metadata"]["authors"].row() as author:

        author_papers.collect(

            author_name=author["name"],

            filename=doc["filename"],

        )
 Tính toán và thu thập embeddings Tiêu đề 
doc["title_embedding"] = doc["metadata"]["title"].transform(

    cocoindex.functions.SentenceTransformerEmbed(

        model="sentence-transformers/all-MiniLM-L6-v2"

    )

)
 Abstracts Chia trừu tượng thành các miếng, nhúng từng miếng và thu thập các nhúng của chúng. Đôi khi abstract có thể rất dài. 
doc["abstract_chunks"] = doc["metadata"]["abstract"].transform(

    cocoindex.functions.SplitRecursively(

        custom_languages=[

            cocoindex.functions.CustomLanguageSpec(

                language_name="abstract",

                separators_regex=[r"[.?!]+\s+", r"[:;]\s+", r",\s+", r"\s+"],

            )

        ]

    ),

    language="abstract",

    chunk_size=500,

    min_chunk_size=200,

    chunk_overlap=150,

)
 Sau bước này, bạn nên có các mảnh trừu tượng của mỗi tờ giấy. Embed each chunk and collect their embeddings. 
with doc["abstract_chunks"].row() as chunk:

    chunk["embedding"] = chunk["text"].transform(

        cocoindex.functions.SentenceTransformerEmbed(

            model="sentence-transformers/all-MiniLM-L6-v2"

        )

    )
 Sau bước này, bạn nên có các nhúng của các mảnh trừu tượng của mỗi tờ giấy. Thu thập Embeddings 
metadata_embeddings = data_scope.add_collector()



with data_scope["documents"].row() as doc:

    # ... process

    # collect title embedding

    metadata_embeddings.collect(

        id=cocoindex.GeneratedField.UUID,

        filename=doc["filename"],

        location="title",

        text=doc["metadata"]["title"],

        embedding=doc["title_embedding"],

    )

    with doc["abstract_chunks"].row() as chunk:

        # ... process

        # collect abstract chunks embeddings

        metadata_embeddings.collect(

            id=cocoindex.GeneratedField.UUID,

            filename=doc["filename"],

            location="abstract",

            text=chunk["text"],

            embedding=chunk["embedding"],

        )
 xuất khẩu Cuối cùng, chúng tôi xuất dữ liệu sang Postgres. 
paper_metadata.export(

    "paper_metadata",

    cocoindex.targets.Postgres(),

    primary_key_fields=["filename"],

)

author_papers.export(

    "author_papers",

    cocoindex.targets.Postgres(),

    primary_key_fields=["author_name", "filename"],

)    

metadata_embeddings.export(

    "metadata_embeddings",

    cocoindex.targets.Postgres(),

    primary_key_fields=["id"],

    vector_indexes=[

        cocoindex.VectorIndexDef(

            field_name="embedding",

            metric=cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY,

        )

    ],

)
 Trong ví dụ này, chúng tôi sử dụng PGVector như các cửa hàng nhúng / Với CocoIndex, bạn có thể thực hiện chuyển đổi một dòng trên các cơ sở dữ liệu Vector được hỗ trợ khác như Qdrant, xem điều này Để biết thêm chi tiết Hướng dẫn Chúng tôi nhằm mục đích chuẩn hóa các giao diện và làm cho nó giống như xây dựng Lego. Xem trong CocoInsight từng bước Bạn có thể đi bộ qua dự án từng bước trong Nhìn CocoInsight chính xác làm thế nào mỗi lĩnh vực được xây dựng và những gì xảy ra đằng sau các cảnh. Tìm kiếm index Bạn có thể tham khảo phần này của Về văn bản embeddings Làm thế nào để xây dựng truy vấn chống lại embeddings. Hiện tại CocoIndex không cung cấp giao diện truy vấn bổ sung.Chúng tôi có thể viết SQL hoặc dựa vào công cụ truy vấn bởi lưu trữ mục tiêu. 
 
 
 Nhiều cơ sở dữ liệu đã tối ưu hóa các triển khai truy vấn với các thực tiễn tốt nhất của riêng họ Không gian truy vấn có các giải pháp tuyệt vời cho truy vấn, xếp hạng lại và các chức năng liên quan đến tìm kiếm khác. Nếu bạn cần giúp đỡ với việc viết câu hỏi, xin vui lòng cảm thấy tự do để liên lạc với chúng tôi tại . bất đồng Hỗ trợ chúng tôi Chúng tôi đang liên tục cải thiện, và nhiều tính năng và ví dụ sẽ sớm xuất hiện. Nếu bài viết này hữu ích cho bạn, xin vui lòng cho chúng tôi một ngôi sao ⭐ tại Để giúp chúng ta phát triển. GitHub Cảm ơn đã đọc!

This story contains new, firsthand information uncovered by the writer.

Star my work on github

Read My Stories

Nghe bài viết này bằng Tiếng Anh, đọc bởi robot thông minh của HackerNoon

Turn Your PDF Library into a Searchable Research Database with 100 Lines of Code

About Author

BÌNH LUẬN

chuyên mục

BÀI VIẾT NÀY CŨNG CÓ MẶT TẠI

Related Stories

128 Stories To Learn About Charles Dickens

189 Stories To Learn About Reading Books

240 Stories To Learn About Astounding Stories

178 Stories To Learn About Essay

128 Stories To Learn About Charles Dickens

189 Stories To Learn About Reading Books

240 Stories To Learn About Astounding Stories

178 Stories To Learn About Essay

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps