በዚህ ጦማር ውስጥ እኛ የተለያዩ ሜታዳታዎችን ማውረድ ጋር ምርምር ጽሑፎችን ማውረድ አንድ አጠቃላይ ምሳሌ ይመልከቱ - ሙሉ ጽሑፍ መቁረጥ እና መፍጠር በላይ - እና ማውረድ እና መዳረሻ ለማግኘት ሰማያዊ መቁረጫዎችን ለመፍጠር. We would greatly appreciate it if you could  ይህ tutorial ያገኛሉ. ⭐ star  በ GitHub ላይ CocoIndex በ GitHub ላይ CocoIndex Cases አጠቃቀም 
 
 
 
 
 የኮምፒዩተር የኮምፒዩተር እና የኮምፒዩተር የኮምፒዩተር እና የኮምፒዩተር የኮምፒዩተር የኮምፒዩተር የ Paper Recommendation ስርዓቶች የምስክር ወረቀት Graphs የሳይንስ ባህሪያት የሳይንስ ባህሪያት ምን ማግኘት ይሆናል ይህን ይመልከቱ ለምሳሌ እንደ. የፒዲኤፍ እዚህ ምን ይፈልጋል ለማሟላት: 
 
 
 
 
 
 Extract the paper metadata, including file name, title, author information, abstract, and number of pages. 
 
 
 Build vector embeddings for the metadata, such as the title and abstract, for semantic search. ይህ የተሻለ metadata-driven semantic search results ያደርጋል. ለምሳሌ, የኮምፒውተር ጥያቄዎች በኮምፒውተር እና አጠቃላይ ጥያቄዎች ጋር ተኳሃኝ ይችላሉ. 
 
 
 
 
 
 Build an index of authors and all the file names associated with each author to answer questions like "Give me all the papers by Jeff Dean." 
 
 If you want to perform full PDF embedding for the paper, you can also refer to  . this article ሙሉ ኮድ ማግኘት ይችላሉ . እዚህ ይህ ጽሑፍ ለእናንተ የሚፈልጉ ከሆነ, please give us a star ⭐ በ እኛን ለመፍጠር ይረዳል. በ GitHub ዋና ክፍሎች 
 
 
 
 
 
 
 
 PDF Preprocessing 
 
 Reads PDFs using   and extracts:
 
 pypdf 
 
 
 Total number of pages First page content (used as a proxy for metadata-rich information) 
 
 Markdown Conversion 
 
 Converts the first page to Markdown using  . Marker 
 
 LLM-Powered Metadata Extraction 
 
 
 Sends the first-page Markdown to GPT-4o using CocoIndex's   function. ExtractByLlm Extracted metadata includes and more.
 
 
 
 
 
  (string) title  (with name, email, and affiliation) authors  (string) abstract 
 
 Semantic Embedding 
 
 
 The title is embedded directly using the   model by the SentenceTransformer. all-MiniLM-L6-v2 Abstracts are chunked based on semantic punctuation and token count, then each chunk is embedded individually. 
 
 Relational Data Collection 
 
 Authors are unrolled and collected into an   relation, enabling queries like:
 
 author_papers 
 
 
 Show all papers by X Which co-authors worked with Y? ግምገማዎች 
 
 
 
 
 
 . Install PostgreSQL CocoIndex uses PostgreSQL internally for incremental processing. 
 
 . Configure your OpenAI API key ሌላ መንገድ, Gemini, Ollama, LiteLLM, Checkout ለ ምናባዊ ድጋፍ አለን . መመሪያ እርስዎ የሚፈልጉት LLM አቅራቢን ለመምረጥ ይችላሉ እና በቤት ውስጥ ሙሉ በሙሉ ይሰራሉ. Indexing Flow ማረጋገጫ ይህ ፕሮጀክቱ የሜታታታን ልምድ ከባድ ፍላጎቶች ጋር ይበልጥ ተስማሚ ምሳሌ ይሰጣል. ይህ ዲዛይን በ CocoIndex በ 100 መስመሮች ውስጥ ለመፍጠር እንዴት ቀላል ነው ያገኛሉ - . code የእርስዎን ፍጥነት ለማሳየት ይረዳል, እዚህ አንድ ፍጥነት ዲግሪ ነው. 
 
 
 
 በ PDF ውስጥ የፕሮግራም ዝርዝር መውሰድ. For each file:
 
 
 
 
 
 
 Extract the first page of the paper. Convert the first page to Markdown. Extract metadata (title, authors, abstract) from the first page. Split the abstract into chunks, and compute embeddings for each chunk. Export to the following tables in Postgres with PGVector:
 
 
 
 
 
 Metadata (title, authors, abstract) for each paper. Author-to-paper mapping, for author-based query. Embeddings for titles and abstract chunks, for semantic search. በዚያ ደረጃዎች ውስጥ ይጎብኙ. የምስክር ወረቀት 
@cocoindex.flow_def(name="PaperMetadata")

def paper_metadata_flow(

    flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope

) -> None:

    data_scope["documents"] = flow_builder.add_source(

        cocoindex.sources.LocalFile(path="papers", binary=True),

        refresh_interval=datetime.timedelta(seconds=10),

    )
 ከባድ ከባድ ከባድ ከባድ ( አግኙን አግኙን flow_builder.add_source filename content ሊሆን ይችላል የ ተጨማሪ መረጃ. ጽሑፍ Metadata ማውረድ እና ማውረድ የመጀመሪያው ገጽ ለማግኘት basic info የፒዲኤፍ የመጀመሪያ ገጽ እና ገጾች ቁጥር ለመውሰድ አንድ ብጁ ተግባር ይመዝገቡ. 
@dataclasses.dataclass

class PaperBasicInfo:

    num_pages: int

    first_page: bytes
 
@cocoindex.op.function()

def extract_basic_info(content: bytes) -> PaperBasicInfo:

    """Extract the first pages of a PDF."""

    reader = PdfReader(io.BytesIO(content))



    output = io.BytesIO()

    writer = PdfWriter()

    writer.add_page(reader.pages[0])

    writer.write(output)



    return PaperBasicInfo(num_pages=len(reader.pages), first_page=output.getvalue())

 እርስዎን በጉዞ ውስጥ ያግኙ. የእኛን መተግበሪያ ወጪን ዝቅተኛ ለማረጋገጥ የመጀመሪያ ገጽ ከ metadata መውሰድ ይችላሉ, ሁሉም PDF በጣም ትልቅ ሊሆን ይችላል. 
with data_scope["documents"].row() as doc:

    doc["basic_info"] = doc["content"].transform(extract_basic_info)
 በዚህ ደረጃ በኋላ, እርስዎ እያንዳንዱ ወረቀት ዋና መረጃ አለብዎት. አግኙን Basic Info የመጀመሪያው ገጽ Markdown ወደ Marker ይጠቀማል. Alternatively, you can easily plug in your favorite PDF parser, such as Docling. የ Marker Converter ተግባር መተግበሪያ እና Cache ያውቃል, ምክንያቱም በውስጡ Initialization ressource-intensive ነው. ይህ ተመሳሳይ መተግበሪያ መተግበሪያ መተግበሪያ መተግበሪያ መተግበሪያ መተግበሪያ ነው.This ensures that the same converter instance is reused for different input files. 
@cache

def get_marker_converter() -> PdfConverter:

    config_parser = ConfigParser({})

    return PdfConverter(

        create_model_dict(), config=config_parser.generate_config_dict()

    )
 እርስዎ አንድ custom function መጠቀም ይችላሉ. 
@cocoindex.op.function(gpu=True, cache=True, behavior_version=1)

def pdf_to_markdown(content: bytes) -> str:

    """Convert to Markdown."""



    with tempfile.NamedTemporaryFile(delete=True, suffix=".pdf") as temp_file:

        temp_file.write(content)

        temp_file.flush()

        text, _, _ = text_from_rendered(get_marker_converter()(temp_file.name))

        return text
 ወደ የእርስዎን transform ያግኙ 
with data_scope["documents"].row() as doc:      

    doc["first_page_md"] = doc["basic_info"]["first_page"].transform(

            pdf_to_markdown

        )
 በዚህ ደረጃ በኋላ, እያንዳንዱ ወረቀት የመጀመሪያ ገጽ Markdown ቅርጸት አለብዎት. በ LLM ጋር መሠረታዊ መረጃ ያግኙ የ CocoIndex የተመሠረተ የ LLM-የመሠረተ ውጭ ጋር የተመሠረተ የ LLM-የመሠረተ ውጭ ያቀርባል. ከባድ ቅርጸት ከባድ ቅርጸት ከባድ ቅርጸት ከባድ ቅርጸት ከባድ ቅርጸት ከባድ ቅርጸት ከባድ ቅርጸት ከባድ ቅርጽ . ይህ ጽሑፍ 
@dataclasses.dataclass

class PaperMetadata:

    """

    Metadata for a paper.

    """



    title: str

    authors: list[Author]

    abstract: str

 ያግኙ ወደ አንድ dataclass defined ጋር, CocoIndex ራስ-ሰር የ LLM መልስ ወደ dataclass ይመዝገቡ. ExtractByLlm 
doc["metadata"] = doc["first_page_md"].transform(

    cocoindex.functions.ExtractByLlm(

        llm_spec=cocoindex.LlmSpec(

            api_type=cocoindex.LlmApiType.OPENAI, model="gpt-4o"

        ),

        output_type=PaperMetadata,

        instruction="Please extract the metadata from the first page of the paper.",

    )

)
 በዚህ ደረጃ በኋላ, እርስዎ እያንዳንዱ ወረቀት የ metadata አለብዎት. የፕሮግራም Metadata 
  paper_metadata = data_scope.add_collector()

  with data_scope["documents"].row() as doc:

    # ... process

    # Collect metadata

    paper_metadata.collect(

        filename=doc["filename"],

        title=doc["metadata"]["title"],

        authors=doc["metadata"]["authors"],

        abstract=doc["metadata"]["abstract"],

        num_pages=doc["basic_info"]["num_pages"],

    )
 እርስዎ የሚፈልጉትን ነገር ያግኙ :) አጠቃቀም ፎቶ መረጃ author filename የቅጂ መብት አግኙን እዚህ አንድ የኮርፖሬሽን ተግባራዊነት ለመፍጠር አንድ የኮርፖሬሽን → የኮርፖሬሽን ጽሑፎች በግል ጡባዊ ውስጥ ይሰጣሉ. በኮምፒውተር ብቻ ነው. 
author_papers = data_scope.add_collector()



with data_scope["documents"].row() as doc:

    with doc["metadata"]["authors"].row() as author:

        author_papers.collect(

            author_name=author["name"],

            filename=doc["filename"],

        )
 አጠቃቀም እና አጠቃቀም ምልክት 
doc["title_embedding"] = doc["metadata"]["title"].transform(

    cocoindex.functions.SentenceTransformerEmbed(

        model="sentence-transformers/all-MiniLM-L6-v2"

    )

)
 Abstract እያንዳንዱ አግኝቷል, እያንዳንዱ አግኝቷል, እያንዳንዱ አግኝቷል, እያንዳንዱ አግኝቷል. በአብዛኛው ጊዜ Abstract በጣም ርዝመት ሊሆን ይችላል. 
doc["abstract_chunks"] = doc["metadata"]["abstract"].transform(

    cocoindex.functions.SplitRecursively(

        custom_languages=[

            cocoindex.functions.CustomLanguageSpec(

                language_name="abstract",

                separators_regex=[r"[.?!]+\s+", r"[:;]\s+", r",\s+", r"\s+"],

            )

        ]

    ),

    language="abstract",

    chunk_size=500,

    min_chunk_size=200,

    chunk_overlap=150,

)
 በዚህ ደረጃ በኋላ, እርስዎ እያንዳንዱ ወረቀት አግኝቷል አለብዎት. እያንዳንዱ አግኝተዋል እና እያንዳንዱ አግኝተዋል. 
with doc["abstract_chunks"].row() as chunk:

    chunk["embedding"] = chunk["text"].transform(

        cocoindex.functions.SentenceTransformerEmbed(

            model="sentence-transformers/all-MiniLM-L6-v2"

        )

    )
 በዚህ ደረጃ በኋላ, እርስዎ እያንዳንዱ ወረቀት አጠቃቀም ክፍሎች መካከል መዋቅር አለብዎት. አግኙን Embeddings 
metadata_embeddings = data_scope.add_collector()



with data_scope["documents"].row() as doc:

    # ... process

    # collect title embedding

    metadata_embeddings.collect(

        id=cocoindex.GeneratedField.UUID,

        filename=doc["filename"],

        location="title",

        text=doc["metadata"]["title"],

        embedding=doc["title_embedding"],

    )

    with doc["abstract_chunks"].row() as chunk:

        # ... process

        # collect abstract chunks embeddings

        metadata_embeddings.collect(

            id=cocoindex.GeneratedField.UUID,

            filename=doc["filename"],

            location="abstract",

            text=chunk["text"],

            embedding=chunk["embedding"],

        )
 ምርት መጨረሻም, መረጃዎችን ወደ Postgres ያውቃል. 
paper_metadata.export(

    "paper_metadata",

    cocoindex.targets.Postgres(),

    primary_key_fields=["filename"],

)

author_papers.export(

    "author_papers",

    cocoindex.targets.Postgres(),

    primary_key_fields=["author_name", "filename"],

)    

metadata_embeddings.export(

    "metadata_embeddings",

    cocoindex.targets.Postgres(),

    primary_key_fields=["id"],

    vector_indexes=[

        cocoindex.VectorIndexDef(

            field_name="embedding",

            metric=cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY,

        )

    ],

)
 በዚህ ምሳሌ ውስጥ, የ PGVector እንደ የተመሠረተ ጣቢያዎች / CocoIndex ጋር, Qdrant እንደ ሌሎች ድጋፍ የ Vector ውሂብ ቤቶች ላይ አንድ መስመር መተግበሪያ ይችላሉ, ይህ ይመልከቱ ተጨማሪ መረጃ. መመሪያ የእኛን መተግበሪያዎችን መደበኛ እና የ LEGO መዋቅር እንደ ያደርጋል. በ CocoInsight step by step ይመልከቱ የፕሮጀክቱ ደረጃ-ጥራት ውስጥ ሊሆን ይችላል ይመልከቱ የ CocoInsight እያንዳንዱ ገጾች እንዴት የተገነባ ነው እና ከታች ምን ይሆናል. index ያግኙን እርስዎ በዚህ ክፍል ላይ ሊሆን ይችላል ስለ ጽሑፍ አግኝተዋል እንዴት ለመፍጠር ጓደኛን ጓደኛን. በአሁኑ ጊዜ CocoIndex ተጨማሪ ጥያቄ መተግበሪያ ይሰጣል. እኛ SQL መተግበሪያዎችን መጻፍ ይችላሉ. 
 
 
 ብዙ ውሂብ ቤቶች በአብዛኛው ምርጥ ሂደቶች ጋር የካርታ መተግበሪያዎችን ያተኮሩ ናቸው. የ query space ለ querying, rearanking, እና ሌሎች የሽያጭ ጋር የተወሰነ ተግባራት ጥሩ መፍትሔዎችን ያቀርባል. የእርስዎን ጥያቄ ለመፍጠር እርዳታ አለብዎት, እባክዎ ለእኛ ያነጋግሩ. . ተለዋዋጭ የእኛን ድጋፍ እኛ በአብዛኛው ጊዜ ለማሻሻል ይሆናል, እና ተጨማሪ ባህሪያት እና ለምሳሌዎች በተቻለ ይሆናል. ይህ ጽሑፍ ለእናንተ የሚፈልጉ ከሆነ, please give us a star ⭐ በ እኛን ለመፍጠር ይረዳል. በ GitHub ያግኙን Thanks!

This story contains new, firsthand information uncovered by the writer.

Star my work on github

Read My Stories

ይህ ኦዲዮ የተዘጋጀው በታሪኩ የመጀመሪያ ቋንቋ ነው!

Turn Your PDF Library into a Searchable Research Database with 100 Lines of Code

About Author

አስተያየቶች

ተንጠልጣይ መለያዎች

ይህ ጽሑፍ ቀርቧል

Related Stories

Democratizing Tech Skills with Jeremy Morgan of Pluralsight, LIVE AMA on August 17th, 12 pm EST

Artificial Intelligence talks with Daniel Jeffries, AMA on August 25th, 10 am EST

Community Governed Loans and Mobile Money in Nigeria with Adebola Adeniran on The HackerNoon Podcast

THE HARMAS

Democratizing Tech Skills with Jeremy Morgan of Pluralsight, LIVE AMA on August 17th, 12 pm EST

Artificial Intelligence talks with Daniel Jeffries, AMA on August 25th, 10 am EST

Community Governed Loans and Mobile Money in Nigeria with Adebola Adeniran on The HackerNoon Podcast

THE HARMAS

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps