Multimodal RAG with ColPali — Retrieving Documents by Vision, Not OCR

April 5, 2026

English

A user asked our RAG system: "What was Q3 revenue for the Asia-Pacific segment?"

The system returned a confident answer. The answer was wrong.

Not because the model failed — because OCR had turned a structured quarterly earnings table into a flat stream of numbers. The positional relationships between row headers, column headers, and cell values were gone. What had been a clean 6×4 table in the PDF became 42 31 28 17 APAC EMEA Americas... with no spatial structure. The retrieval system found the right page. The LLM had no way to know which number corresponded to which segment and which quarter.

This is the dirty secret of text-based RAG on rich documents: the pipeline throws away exactly the information that makes the document valuable, before retrieval even starts.

ColPali (arXiv:2407.01449) fixes this by not extracting text at all. It encodes document pages as images, retrieves by visual similarity, and answers using what the model actually sees.

When to Use ColPali vs Text RAG

I want to put this table first, because the most common mistake is reaching for ColPali when plain text RAG would do the job faster and cheaper.

Document Type	ColPali	Text RAG
Financial reports with tables	✓ Best	✗ Loses structure
Technical diagrams + explanation	✓ Best	✗ Diagram becomes nothing
Research papers with figures	✓ Strong	Partial
Scanned PDFs (image-only)	✓ Only option	✗ OCR fails
Product catalogs with images	✓ Strong	✗ Images lost
Plain text documents	✗ Overkill	✓ Faster, cheaper
Legal contracts (text-heavy)	✗ Overkill	✓ Sufficient

For most production systems: run both and merge. ColPali for visual retrieval, text RAG for text-heavy content, combined with Reciprocal Rank Fusion (RRF: 1 / (k + rank) where k=60). A document ranked #1 by both retrievers scores higher than any document ranked #1 by only one.

The Problem with Text-Based RAG on Rich Documents

Loading diagram...

When OCR extracts a financial table, it produces a flat list of numbers with no positional relationships. ColPali encodes the page as 1024 image patches — each patch is a small region of the page. The spatial structure, visual hierarchy, and relationships between elements are all preserved.

OCR is lossy compression. For documents where structure matters, it's the wrong foundation.

How ColPali Works

ColPali uses late interaction — borrowed from ColBERT for text retrieval. Instead of compressing a whole page into one vector, you keep per-patch vectors and compute similarity at query time.

Loading diagram...

The MaxSim Formula

The intuition first: each query token finds its best matching image patch. The total score is the sum of those best matches.

This is why it works on tables. The query tokens for "Q3" and "revenue" each find the exact patches they correspond to — the column header and the row header — rather than competing with every other word on the page. A single compressed vector would average everything together, drowning the signal.

Score(query, document) = Σᵢ maxⱼ (qᵢ · dⱼ)

The current SOTA model: ColQwen2-v1.0 scores 89.3% on the ViDoRe benchmark — significantly ahead of OCR-based systems on visually complex documents.

The Indexing Pipeline

At DPI=150 on an A100 GPU, indexing runs at roughly 800ms per page. A 100-page document takes ~80 seconds. This is a one-time cost — run it as a background job, not synchronously.

The pgvector schema stores each patch as a separate row:

CREATE TABLE document_pages (
    id         SERIAL PRIMARY KEY,
    doc_id     TEXT NOT NULL,
    page_num   INTEGER NOT NULL,
    UNIQUE(doc_id, page_num)
);

-- Each page → 1024 rows here (32×32 patches, 128-dim vectors)
CREATE TABLE page_patches (
    id          SERIAL PRIMARY KEY,
    page_id     INTEGER REFERENCES document_pages(id) ON DELETE CASCADE,
    patch_index INTEGER NOT NULL,
    embedding   vector(128) NOT NULL
);

CREATE INDEX ON page_patches USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

At 1024 patches per page, a 100-page document creates 102,400 rows in page_patches. For large collections, switch from HNSW to IVF indexing — IVF trades some recall for much faster build time and smaller memory footprint at scale.

Retrieval: Two-Phase MaxSim

Full MaxSim against every page is expensive. The production approach runs two phases:

Loading diagram...

Phase 1 uses the mean query patch as a proxy for fast ANN candidate selection. Phase 2 fetches all patches for each candidate page and computes the exact MaxSim score. The two-phase approach keeps total query latency under 3.5 seconds on typical document collections.

Step	Typical Time
Query encoding (GPU)	~50ms
ANN candidate retrieval (pgvector HNSW)	~10ms
MaxSim reranking (top 50 candidates)	~100ms
VLM answer generation (Gemini Flash)	~2–3s
Total	~2.5–3.5s

Resources

ColPali: Efficient Document Retrieval with Vision Language Models (arXiv:2407.01449)
ColBERT: Late Interaction Mechanism (arXiv:2004.12832)
vidore/colqwen2-v1.0 — Current SOTA, 89.3% on ViDoRe
colpali-engine (Python) — Official ColQwen2 implementation
ViDoRe Leaderboard
LlamaIndex ColPali integration

OCR is the wrong primitive for 60% of enterprise documents. It works on plain text. It loses structure on tables. It produces nothing on charts and diagrams. For a decade, we've been building document understanding systems on a foundation that throws away the hardest-won information in the document — the visual structure. ColPali doesn't fix OCR. It bypasses it entirely. That's the right solution.

Posted ondevwith tags:

#ai-agents #typescript #postgresql