In this blog we will walk through a comprehensive example of indexing research papers with extracting different metadata — beyond full text chunking and embedding — and build semantic embeddings for indexing and querying. We would greatly appreciate it if you could  このチュートリアルが役に立つと思います。 ⭐ star  CocoIndex on GitHub CocoIndex on GitHub 使用ケース 
 
 
 
 
 Academic search and retrieval, as well as research-based AI agents (アカデミック検索とリハビリ、および研究ベースのAIエージェント) Paper Recommendation システム 研究知識グラフ 科学文学のセマンティック分析 われわれが達成するもの これを見てみよう 例として。 PDF 以下、我々が達成したいこと: 
 
 
 
 
 
 Extract the paper metadata, including file name, title, author information, abstract, and number of pages. 
 
 
 Build vector embeddings for the metadata, such as the title and abstract, for semantic search. これは、より良いメタデータ駆動のセマンティックな検索結果を可能にします. For example, you can match text queries against titles and abstracts. 
 
 
 
 
 
 Build an index of authors and all the file names associated with each author to answer questions like "Give me all the papers by Jeff Dean." 
 
 If you want to perform full PDF embedding for the paper, you can also refer to  . this article 完全なコードを見つけることができます。 . ここ この記事が役に立った場合は、私たちに星をください ⭐ at 私たちを成長させるために GitHub コア部品 
 
 
 
 
 
 
 
 PDF Preprocessing 
 
 Reads PDFs using   and extracts:
 
 pypdf 
 
 
 Total number of pages First page content (used as a proxy for metadata-rich information) 
 
 Markdown Conversion 
 
 Converts the first page to Markdown using  . Marker 
 
 LLM-Powered Metadata Extraction 
 
 
 Sends the first-page Markdown to GPT-4o using CocoIndex's   function. ExtractByLlm Extracted metadata includes and more.
 
 
 
 
 
  (string) title  (with name, email, and affiliation) authors  (string) abstract 
 
 Semantic Embedding 
 
 
 The title is embedded directly using the   model by the SentenceTransformer. all-MiniLM-L6-v2 Abstracts are chunked based on semantic punctuation and token count, then each chunk is embedded individually. 
 
 Relational Data Collection 
 
 Authors are unrolled and collected into an   relation, enabling queries like:
 
 author_papers 
 
 
 Show all papers by X Which co-authors worked with Y? 前提条件 
 
 
 
 
 
 . Install PostgreSQL CocoIndex uses PostgreSQL internally for incremental processing. 
 
 . Configure your OpenAI API key Alternatively, we have native support for Gemini, Ollama, LiteLLM, checkout the . ガイド あなたはあなたの好きなLLMプロバイダーを選択し、完全にオンプレミスで働くことができます。 インデックスフローの定義 このプロジェクトは、現実世界の使用事例に近いメタデータ理解の少しより包括的な例を示しています。 あなたは、このデザインを CocoIndex によって 100 行のインデックス論理で達成することがどれほど簡単かを見るでしょう - . コード わたしたちが何を歩いて行くかをよりよく導くために、ここに流れ図があります。 
 
 
 
 PDF で書類のリストをインポートします。 For each file:
 
 
 
 
 
 
 Extract the first page of the paper. Convert the first page to Markdown. Extract metadata (title, authors, abstract) from the first page. Split the abstract into chunks, and compute embeddings for each chunk. Export to the following tables in Postgres with PGVector:
 
 
 
 
 
 Metadata (title, authors, abstract) for each paper. Author-to-paper mapping, for author-based query. Embeddings for titles and abstract chunks, for semantic search. ステップでズームしましょう。 文書の輸入 
@cocoindex.flow_def(name="PaperMetadata")

def paper_metadata_flow(

    flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope

) -> None:

    data_scope["documents"] = flow_builder.add_source(

        cocoindex.sources.LocalFile(path="papers", binary=True),

        refresh_interval=datetime.timedelta(seconds=10),

    )
 サブフィールドを含むテーブルを作成します( で、 (※) flow_builder.add_source filename content We can refer to the もっと詳細へ 文書化 メタデータの抽出と収集 基本情報のための最初のページを抽出 PDFの最初のページとページ数を抽出するためのカスタム機能を定義します。 
@dataclasses.dataclass

class PaperBasicInfo:

    num_pages: int

    first_page: bytes
 
@cocoindex.op.function()

def extract_basic_info(content: bytes) -> PaperBasicInfo:

    """Extract the first pages of a PDF."""

    reader = PdfReader(io.BytesIO(content))



    output = io.BytesIO()

    writer = PdfWriter()

    writer.add_page(reader.pages[0])

    writer.write(output)



    return PaperBasicInfo(num_pages=len(reader.pages), first_page=output.getvalue())

 これをあなたの流れに組み込んでください。 処理コストを最小限に抑えるために、最初のページからメタデータを抽出します。 
with data_scope["documents"].row() as doc:

    doc["basic_info"] = doc["content"].transform(extract_basic_info)
 このステップの後、あなたは各紙の基本情報を持っている必要があります。 基本情報 基本情報 最初のページを Markdown に Marker を使用して変換します。 代わりに、Docling などのお気に入りの PDF パッサーを簡単に接続できます。 マーカー変換機能を定義し、その初期化がリソース密集であるため、キャッシュします。 これは、同じコンバータインスタンスが異なる入力ファイルに再利用されることを保証します。 
@cache

def get_marker_converter() -> PdfConverter:

    config_parser = ConfigParser({})

    return PdfConverter(

        create_model_dict(), config=config_parser.generate_config_dict()

    )
 従来の機能に接続します。 
@cocoindex.op.function(gpu=True, cache=True, behavior_version=1)

def pdf_to_markdown(content: bytes) -> str:

    """Convert to Markdown."""



    with tempfile.NamedTemporaryFile(delete=True, suffix=".pdf") as temp_file:

        temp_file.write(content)

        temp_file.flush()

        text, _, _ = text_from_rendered(get_marker_converter()(temp_file.name))

        return text
 あなたのトランスフォーメーションに渡す 
with data_scope["documents"].row() as doc:      

    doc["first_page_md"] = doc["basic_info"]["first_page"].transform(

            pdf_to_markdown

        )
 このステップの後、あなたは Markdown 形式で各紙の最初のページを持っている必要があります。 LLMで基本的な情報を抽出 CocoIndex は、複雑で組み込まれたスケジュールで LLM 構造の抽出をネイティブにサポートします。 あなたがニストされたスケジュールについてもっと知りたい場合は、参照してください。 . この記事 
@dataclasses.dataclass

class PaperMetadata:

    """

    Metadata for a paper.

    """



    title: str

    authors: list[Author]

    abstract: str

 Plug it into the データクラスが定義されると、CocoIndex はLLM 応答をデータクラスに自動的に解析します。 ExtractByLlm 
doc["metadata"] = doc["first_page_md"].transform(

    cocoindex.functions.ExtractByLlm(

        llm_spec=cocoindex.LlmSpec(

            api_type=cocoindex.LlmApiType.OPENAI, model="gpt-4o"

        ),

        output_type=PaperMetadata,

        instruction="Please extract the metadata from the first page of the paper.",

    )

)
 このステップの後、あなたは各紙のメタデータを持っている必要があります。 紙のメタデータの収集 
  paper_metadata = data_scope.add_collector()

  with data_scope["documents"].row() as doc:

    # ... process

    # Collect metadata

    paper_metadata.collect(

        filename=doc["filename"],

        title=doc["metadata"]["title"],

        authors=doc["metadata"]["authors"],

        abstract=doc["metadata"]["abstract"],

        num_pages=doc["basic_info"]["num_pages"],

    )
 必要なものを集めるだけです:) コレクション 2位 情報 author フィルム 著者 フィルム ここでは、Author → Papers を別々のテーブルに収集して検索機能を構築します。 作者のみの収集です。 
author_papers = data_scope.add_collector()



with data_scope["documents"].row() as doc:

    with doc["metadata"]["authors"].row() as author:

        author_papers.collect(

            author_name=author["name"],

            filename=doc["filename"],

        )
 コンピューティング&コレクション embeddings タイトル 
doc["title_embedding"] = doc["metadata"]["title"].transform(

    cocoindex.functions.SentenceTransformerEmbed(

        model="sentence-transformers/all-MiniLM-L6-v2"

    )

)
 抽象 抽象をブロックに分割し、各ブロックを埋め込み、その埋め込みを収集します。 たまに抽象は長いかもしれない。 
doc["abstract_chunks"] = doc["metadata"]["abstract"].transform(

    cocoindex.functions.SplitRecursively(

        custom_languages=[

            cocoindex.functions.CustomLanguageSpec(

                language_name="abstract",

                separators_regex=[r"[.?!]+\s+", r"[:;]\s+", r",\s+", r"\s+"],

            )

        ]

    ),

    language="abstract",

    chunk_size=500,

    min_chunk_size=200,

    chunk_overlap=150,

)
 このステップの後、あなたは各紙の抽象的な部分を持っている必要があります。 それぞれの部品を集め、収集する。 
with doc["abstract_chunks"].row() as chunk:

    chunk["embedding"] = chunk["text"].transform(

        cocoindex.functions.SentenceTransformerEmbed(

            model="sentence-transformers/all-MiniLM-L6-v2"

        )

    )
 このステップの後、あなたは各紙の抽象的な部分の埋め込みを持っているべきです。 コレクション Embeddings 
metadata_embeddings = data_scope.add_collector()



with data_scope["documents"].row() as doc:

    # ... process

    # collect title embedding

    metadata_embeddings.collect(

        id=cocoindex.GeneratedField.UUID,

        filename=doc["filename"],

        location="title",

        text=doc["metadata"]["title"],

        embedding=doc["title_embedding"],

    )

    with doc["abstract_chunks"].row() as chunk:

        # ... process

        # collect abstract chunks embeddings

        metadata_embeddings.collect(

            id=cocoindex.GeneratedField.UUID,

            filename=doc["filename"],

            location="abstract",

            text=chunk["text"],

            embedding=chunk["embedding"],

        )
 輸出 最後に、データをPostgresにエクスポートします。 
paper_metadata.export(

    "paper_metadata",

    cocoindex.targets.Postgres(),

    primary_key_fields=["filename"],

)

author_papers.export(

    "author_papers",

    cocoindex.targets.Postgres(),

    primary_key_fields=["author_name", "filename"],

)    

metadata_embeddings.export(

    "metadata_embeddings",

    cocoindex.targets.Postgres(),

    primary_key_fields=["id"],

    vector_indexes=[

        cocoindex.VectorIndexDef(

            field_name="embedding",

            metric=cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY,

        )

    ],

)
 この例では、埋め込みストアとして PGVector を使用します。 CocoIndex を使用すると、Qdrant などの他のサポートされている Vector データベースで 1 行をスイッチできます。 もっと詳細へ ガイド 私たちはインターフェイスを標準化し、LEGOを構築するようにすることを目指しています。 CocoInsight Step by Stepの検索結果 プロジェクトを一歩一歩進めていけます♪ 2 見る ココナッツ それぞれのフィールドがどのように構築されているのか、そして舞台の裏で何が起こっているのか。 インデックスを求める あなたはこのセクションを参照することができます について テキスト Embeddings How to build query against embeddings を構築する方法 現在のところ、CocoIndexは追加のクエリインターフェイスを提供していません. We can write SQL or rely on the query engine by the target storage. 
 
 
 多くのデータベースは既に、独自のベストプラクティスでクエリの実装を最適化しています。 クエリスペースには、クエリ、リランキング、およびその他の検索関連の機能のための優れたソリューションがあります。 あなたがクエリを書くために助けを必要とする場合は、私たちに連絡する自由に感じてください。 . ディスコード 応援 我々は絶え間なく改善し、より多くの機能と例が間もなく登場する。 この記事が役に立った場合は、私たちに星をください ⭐ at 私たちを成長させるために GitHub 読んでくれてありがとう!

This story contains new, firsthand information uncovered by the writer.

Star my work on github

Read My Stories

このオーディオは、ストーリーの元の言語で制作されています。

Turn Your PDF Library into a Searchable Research Database with 100 Lines of Code

About Author

コメント

ラベル

この記事は

Related Stories

Telegram: クリプト島と本土を結ぶ橋

ワークフローを10倍に向上させる方法: 必須アプリ17選

Claude Sonnet 3.5 システムプロンプトの漏洩: 法医学的分析

暗号通貨の成長: 効果的なユーザーペルソナの作成

Telegram: クリプト島と本土を結ぶ橋

ワークフローを10倍に向上させる方法: 必須アプリ17選

Claude Sonnet 3.5 システムプロンプトの漏洩: 法医学的分析

暗号通貨の成長: 効果的なユーザーペルソナの作成

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps