PDFs are rich with both text and visual content — from descriptive paragraphs to illustrations and tables. This example builds an end-to-end flow that parses, embeds, and indexes both, with full traceability to the original page. text and visual end-to-end flow In this example, we splits out both text and images, linking them back to page metadata, and enabling unified semantic search. We’ll use CocoIndex to define the flow, SentenceTransformers for text embeddings, and CLIP for image embeddings — all stored in Qdrant for retrieval. CocoIndex SentenceTransformers CLIP Qdrant 🔍 What It Does This flow automatically: Extracts page texts and images from PDF files
Skips tiny or redundant images
Creates consistent thumbnails (up to 512×512)
Chunks and embeds text with SentenceTransformers (all-MiniLM-L6-v2)
Embeds images with CLIP (openai/clip-vit-large-patch14)
Stores both embeddings and metadata (e.g., where the text or image came from) in Qdrant, enabling unified semantic search across text and image modalities Extracts page texts and images from PDF files Skips tiny or redundant images Creates consistent thumbnails (up to 512×512) Chunks and embeds text with SentenceTransformers (all-MiniLM-L6-v2) SentenceTransformers (all-MiniLM-L6-v2) Embeds images with CLIP (openai/clip-vit-large-patch14) CLIP (openai/clip-vit-large-patch14) Stores both embeddings and metadata (e.g., where the text or image came from) in Qdrant, enabling unified semantic search across text and image modalities Qdrant 📦 Prerequisite: Run Qdrant If you don’t have Qdrant running locally, start it via Docker: docker run -p 6333:6333 -p 6334:6334 qdrant/qdrant docker run -p 6333:6333 -p 6334:6334 qdrant/qdrant 📁 Input Data We’ll use a few sample PDFs (board game manuals). Download them into the source_files directory: source_files ./fetch_manual_urls.sh ./fetch_manual_urls.sh Or, feel free to drop in any of your own PDFs. ⚙️ Run the Flow Install dependencies: pip install -e . pip install -e . Then build your index (sets up tables automatically on first run): cocoindex update --setup main cocoindex update --setup main Or Run in CocoInsight cocoindex server -ci main cocoindex server -ci main Define the flow Flow definition Let’s break down what happens inside the PdfElementsEmbedding flow PdfElementsEmbedding @cocoindex.flow_def(name="PdfElementsEmbedding")
def multi_format_indexing_flow(
    flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope
) -> None:
    data_scope["documents"] = flow_builder.add_source(
        cocoindex.sources.LocalFile(
            path="source_files", included_patterns=["*.pdf"], binary=True
        )
    )
    text_output = data_scope.add_collector()
    image_output = data_scope.add_collector() @cocoindex.flow_def(name="PdfElementsEmbedding")
def multi_format_indexing_flow(
    flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope
) -> None:
    data_scope["documents"] = flow_builder.add_source(
        cocoindex.sources.LocalFile(
            path="source_files", included_patterns=["*.pdf"], binary=True
        )
    )
    text_output = data_scope.add_collector()
    image_output = data_scope.add_collector() We defines the flow, added source and added data collectors. For flow definition, the decorator @cocoindex.flow_def(name="PdfElementsEmbedding") @cocoindex.flow_def(name="PdfElementsEmbedding") marks the function as a CocoIndex flow definition, registering it as part of the data indexing system. When executed via CocoIndex runtime, it orchestrates data ingestion, transformation, and collection.When started, CocoIndex’s runtime executes the flow in either one-time update or live update mode, and incrementally embedding only what changed. CocoIndex flow definition one-time update live update Process Each Document Extract PDF documents We iterate each document row and run a custom transformation which extract pdf elements. with data_scope["documents"].row() as doc:
    doc["pages"] = doc["content"].transform(extract_pdf_elements) with data_scope["documents"].row() as doc:
    doc["pages"] = doc["content"].transform(extract_pdf_elements) Extract PDF Elements Define dataclass for structured extraction - we want to extract a list of PdfPage  each has page number, text, and list of images. dataclass PdfPage @dataclass
class PdfImage:
    name: str
    data: bytes

@dataclass
class PdfPage:
    page_number: int
    text: str
    images: list[PdfImage] @dataclass
class PdfImage:
    name: str
    data: bytes

@dataclass
class PdfPage:
    page_number: int
    text: str
    images: list[PdfImage] Next, define a CocoIndex function called extract_pdf_elements that extracts both text and images from a PDF file, returning them as structured, page-wise data objects. CocoIndex function extract_pdf_elements extracts both text and images from a PDF file @cocoindex.op.function()
def extract_pdf_elements(content: bytes) -> list[PdfPage]:
    """
    Extract texts and images from a PDF file.
    """
    reader = PdfReader(io.BytesIO(content))
    result = []
    for i, page in enumerate(reader.pages):
        text = page.extract_text()
        images = []
        for image in page.images:
            img = image.image
            if img is None:
                continue
            # Skip very small images.
            if img.width < 16 or img.height < 16:
                continue
            thumbnail = io.BytesIO()
            img.thumbnail(IMG_THUMBNAIL_SIZE)
            img.save(thumbnail, img.format or "PNG")
            images.append(PdfImage(name=image.name, data=thumbnail.getvalue()))
        result.append(PdfPage(page_number=i + 1, text=text, images=images))
    return result @cocoindex.op.function()
def extract_pdf_elements(content: bytes) -> list[PdfPage]:
    """
    Extract texts and images from a PDF file.
    """
    reader = PdfReader(io.BytesIO(content))
    result = []
    for i, page in enumerate(reader.pages):
        text = page.extract_text()
        images = []
        for image in page.images:
            img = image.image
            if img is None:
                continue
            # Skip very small images.
            if img.width < 16 or img.height < 16:
                continue
            thumbnail = io.BytesIO()
            img.thumbnail(IMG_THUMBNAIL_SIZE)
            img.save(thumbnail, img.format or "PNG")
            images.append(PdfImage(name=image.name, data=thumbnail.getvalue()))
        result.append(PdfPage(page_number=i + 1, text=text, images=images))
    return result The extract_pdf_elements function reads a PDF file from bytes and extracts both text and images from each page in a structured way. Using pypdf, it parses every page to retrieve text content and any embedded images, skipping empty or very small ones to avoid noise. extract_pdf_elements pypdf Each image is resized to a consistent thumbnail size (up to 512×512) and converted into bytes for downstream use. The result is a clean, per-page data structure (PdfPage) that contains the page number, extracted text, and processed images — making it easy to embed and index PDFs for multimodal search. PdfPage Process Each Page Once we have the pages, we will process each page Chunk the text Chunk the text Takes each PDF page’s text and splits it into smaller, overlapping chunks with doc["pages"].row() as page:
    page["chunks"] = page["text"].transform(
        cocoindex.functions.SplitRecursively(
            custom_languages=[
                cocoindex.functions.CustomLanguageSpec(
                    language_name="text",
                    separators_regex=[
                        r"\n(\s*\n)+",
                        r"[\.!\?]\s+",
                        r"\n",
                        r"\s+",
                    ],
                )
            ]
        ),
        language="text",
        chunk_size=600,
        chunk_overlap=100,
    ) with doc["pages"].row() as page:
    page["chunks"] = page["text"].transform(
        cocoindex.functions.SplitRecursively(
            custom_languages=[
                cocoindex.functions.CustomLanguageSpec(
                    language_name="text",
                    separators_regex=[
                        r"\n(\s*\n)+",
                        r"[\.!\?]\s+",
                        r"\n",
                        r"\s+",
                    ],
                )
            ]
        ),
        language="text",
        chunk_size=600,
        chunk_overlap=100,
    ) Process each chunk Process each chunk Embed and collect the metadata we need, each chunk has embedding, original text, and reference the filename and page in where it came from. filename page with page["chunks"].row() as chunk:
    chunk["embedding"] = chunk["text"].call(embed_text)
    text_output.collect(
        id=cocoindex.GeneratedField.UUID,
        filename=doc["filename"],
        page=page["page_number"],
        text=chunk["text"],
        embedding=chunk["embedding"],
    ) with page["chunks"].row() as chunk:
    chunk["embedding"] = chunk["text"].call(embed_text)
    text_output.collect(
        id=cocoindex.GeneratedField.UUID,
        filename=doc["filename"],
        page=page["page_number"],
        text=chunk["text"],
        embedding=chunk["embedding"],
    ) Process each image Process each image We use CLIP to embed the image and collects the data, embedding and metadata filename, page_number. CLIP data embedding filename page_number. with page["images"].row() as image:
    image["embedding"] = image["data"].transform(clip_embed_image)
    image_output.collect(
        id=cocoindex.GeneratedField.UUID,
        filename=doc["filename"],
        page=page["page_number"],
        image_data=image["data"],
        embedding=image["embedding"],
    ) with page["images"].row() as image:
    image["embedding"] = image["data"].transform(clip_embed_image)
    image_output.collect(
        id=cocoindex.GeneratedField.UUID,
        filename=doc["filename"],
        page=page["page_number"],
        image_data=image["data"],
        embedding=image["embedding"],
    ) When we collect image outputs, we also want to preserve relevant metadata alongside the embedding. For each image, we not only collect the image embedding and binary data, but also metadata such as the original file name and page number. This association makes it possible to trace embedded images back to their source document and page during retrieval or exploration. Export to Qdrant Finally we export the collected info to the target store text_output.export(
    "text_embeddings",
    cocoindex.targets.Qdrant(
        connection=qdrant_connection,
        collection_name=QDRANT_COLLECTION_TEXT,
    ),
    primary_key_fields=["id"],
)
image_output.export(
    "image_embeddings",
    cocoindex.targets.Qdrant(
        connection=qdrant_connection,
        collection_name=QDRANT_COLLECTION_IMAGE,
    ),
    primary_key_fields=["id"],
) text_output.export(
    "text_embeddings",
    cocoindex.targets.Qdrant(
        connection=qdrant_connection,
        collection_name=QDRANT_COLLECTION_TEXT,
    ),
    primary_key_fields=["id"],
)
image_output.export(
    "image_embeddings",
    cocoindex.targets.Qdrant(
        connection=qdrant_connection,
        collection_name=QDRANT_COLLECTION_IMAGE,
    ),
    primary_key_fields=["id"],
) 🧭 Explore with CocoInsight (Free Beta) Use CocoInsight to visually trace your data lineage and debug the flow. CocoInsight It connects locally with zero data retention. zero data retention Start your local server: cocoindex server -ci main cocoindex server -ci main Then open the UI 👉 https://cocoindex.io/cocoinsight https://cocoindex.io/cocoinsight 💡 Why This Matters Traditional document search only scratches the surface — it’s text-only and often brittle across document layouts. This flow gives you multimodal recall, meaning you can: multimodal recall Search PDFs by text or image similarity
Retrieve related figures, diagrams, or captions
Build next-generation retrieval systems across rich content formats Search PDFs by text or image similarity or Retrieve related figures, diagrams, or captions Build next-generation retrieval systems across rich content formats Compare with ColPali Vision Model (OCR Free) We also had an example for ColPali example for ColPali To compare two approaches: Aspect

ColPali Multi-Vector Image Grids

Separate Text and Image Embeddings



Input

Whole page as an image grid (multi-vector patches)

Text extracted via OCR + images processed separately



Embedding

Multi-vector patch-level embeddings preserving spatial context

Independent text and image vectors



Query Matching

Late interaction between text token embeddings and image patches

Separate embedding similarity / fusion



Document Structure Handling

Maintains layout and visual cues implicitly

Layout structure inferred by heuristics



OCR Dependence

Minimal to none; model reads text visually

Heavy dependence on OCR (for scanned PDF) and text extraction



Use Case Strength

Document-heavy, visual-rich formats

General image and text data, simpler layouts



Complexity

Higher computational cost, more complex storage

Simpler, fewer compute resources needed Aspect

ColPali Multi-Vector Image Grids

Separate Text and Image Embeddings



Input

Whole page as an image grid (multi-vector patches)

Text extracted via OCR + images processed separately



Embedding

Multi-vector patch-level embeddings preserving spatial context

Independent text and image vectors



Query Matching

Late interaction between text token embeddings and image patches

Separate embedding similarity / fusion



Document Structure Handling

Maintains layout and visual cues implicitly

Layout structure inferred by heuristics



OCR Dependence

Minimal to none; model reads text visually

Heavy dependence on OCR (for scanned PDF) and text extraction



Use Case Strength

Document-heavy, visual-rich formats

General image and text data, simpler layouts



Complexity

Higher computational cost, more complex storage

Simpler, fewer compute resources needed Aspect

ColPali Multi-Vector Image Grids

Separate Text and Image Embeddings Aspect Aspect Aspect ColPali Multi-Vector Image Grids ColPali Multi-Vector Image Grids ColPali Multi-Vector Image Grids Separate Text and Image Embeddings Separate Text and Image Embeddings Separate Text and Image Embeddings Input

Whole page as an image grid (multi-vector patches)

Text extracted via OCR + images processed separately Input Input Whole page as an image grid (multi-vector patches) Whole page as an image grid (multi-vector patches) Text extracted via OCR + images processed separately Text extracted via OCR + images processed separately Embedding

Multi-vector patch-level embeddings preserving spatial context

Independent text and image vectors Embedding Embedding Multi-vector patch-level embeddings preserving spatial context Multi-vector patch-level embeddings preserving spatial context Independent text and image vectors Independent text and image vectors Query Matching

Late interaction between text token embeddings and image patches

Separate embedding similarity / fusion Query Matching Query Matching Late interaction between text token embeddings and image patches Late interaction between text token embeddings and image patches Separate embedding similarity / fusion Separate embedding similarity / fusion Document Structure Handling

Maintains layout and visual cues implicitly

Layout structure inferred by heuristics Document Structure Handling Document Structure Handling Maintains layout and visual cues implicitly Maintains layout and visual cues implicitly Layout structure inferred by heuristics Layout structure inferred by heuristics OCR Dependence

Minimal to none; model reads text visually

Heavy dependence on OCR (for scanned PDF) and text extraction OCR Dependence OCR Dependence Minimal to none; model reads text visually Minimal to none; model reads text visually Heavy dependence on OCR (for scanned PDF) and text extraction Heavy dependence on OCR (for scanned PDF) and text extraction Use Case Strength

Document-heavy, visual-rich formats

General image and text data, simpler layouts Use Case Strength Use Case Strength Document-heavy, visual-rich formats Document-heavy, visual-rich formats General image and text data, simpler layouts General image and text data, simpler layouts Complexity

Higher computational cost, more complex storage

Simpler, fewer compute resources needed Complexity Complexity Higher computational cost, more complex storage Higher computational cost, more complex storage Simpler, fewer compute resources needed Simpler, fewer compute resources needed In essence, ColPali excels in deep, integrated vision-text understanding, ideal for visually complex documents, while splitting text and images for separate embeddings is more modular but prone to losing spatial and semantic cohesion. The choice depends on your document complexity, precision needs, and resource constraints. Support us ⭐ Star CocoIndex on GitHub and share with your community if you find it useful! CocoIndex on GitHub

This story contains new, firsthand information uncovered by the writer.

GitHub

AI Native Data Pipeline - What Do We Need?

Developers Gain Direct Insight Into Data Flows With CocoIndex Update

How to Extract and Embed Text and Images from PDFs for Unified Semantic Search

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

AI Native Data Pipeline - What Do We Need?

The Noonification: How Often Do NFTs Pass The Howey Test? (1/13/2023)

Darwin's Hybrid Intelligence to Align AI & Human Goals for Startups & VCs

The Noonification: White Man (11/26/2022)

The Noonification: The Metaverse is a Sh*tshow (11/2/2022)

100 Days of AI Day 1: From Newsletter to Podcast, Leveraging AI for Audio Transformation

AI Native Data Pipeline - What Do We Need?

The Noonification: How Often Do NFTs Pass The Howey Test? (1/13/2023)

Darwin's Hybrid Intelligence to Align AI & Human Goals for Startups & VCs

The Noonification: White Man (11/26/2022)

The Noonification: The Metaverse is a Sh*tshow (11/2/2022)

100 Days of AI Day 1: From Newsletter to Podcast, Leveraging AI for Audio Transformation

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps