Skip to content
4.12Advanced7 min

Multimodal RAG: Retrieving Images, PDFs and Tables

Blck Alpaca·
Definition

Multimodal RAG extends classic Retrieval-Augmented Generation to non-textual content: images, scanned PDFs, tables, charts and diagrams are indexed and made retrievable. Instead of searching plain text only, the system retrieves visual and structured information via multimodal embeddings, vision-LLM descriptions or layout-aware parsing and feeds it source-grounded into the answer prompt.

Key Takeaways

  • Classic RAG loses everything that is not clean plain text - tables, charts, scanned pages and diagrams fall through the cracks. Multimodal RAG closes this gap.
  • There are three main approaches: multimodal embeddings (CLIP, ColPali), vision-LLM description (image/table is translated into text) and layout-aware parsing (Docling, Unstructured, RAGFlow) - often as a combination.
  • Hybrid search remains mandatory: pure embeddings miss exact codes, article numbers and IDs in tables; BM25 plus rank fusion (RRF) catches them.
  • Typical use cases are technical documentation, invoices and reports - exactly where information sits primarily in layout, tables and graphics rather than in running text.
  • Complex layouts (multi-column, nested tables, poor scans) remain the biggest pitfall - layout-aware parsing and evaluation with RAGAS are not optional.
  • As of 2026 the toolchain is production-ready, but more cost- and latency-intensive than text-only RAG - the choice of components should be aligned to the document type.

Multimodal RAG extends the Retrieval-Augmented Generation pattern to everything that is not clean running text: images, scanned PDFs, tables, charts and diagrams. Classic RAG indexes text only. As soon as information sits primarily in the layout, in a table or in a graphic, a text-only pipeline loses exactly the substance that is decisive for the answer. Multimodal RAG makes this content indexable and retrievable - and thus citable in a source-grounded way.

  • What is retrieved? Not just plain text, but also page images, table structures, chart contents and scanned documents.
  • How? Via multimodal embeddings (CLIP, ColPali), vision-LLM descriptions or layout-aware parsing - usually as a combination.
  • What for? Technical documentation, invoices, reports - where knowledge lies in layout and graphics rather than in running text.

Why classic RAG fails on images and tables

A classic RAG pipeline runs along two paths: in the indexing path, sources are loaded, parsed, split into chunks, embedded and written to a vector database. In the query path, the request is embedded, retrieved via hybrid retrieval, re-ranked and inserted into the prompt of the generator LLM. The weak point for non-textual content sits right at the start - in parsing and chunking.

Two anti-patterns dominate here. First: naive fixed-size chunking ignores sentence, table and list boundaries. A 512-token window strategy cuts a table in the middle of a row, the table content is lost or distorted in meaning - the typical consequences are hallucinations and missing values. Second: pure semantics without BM25. Exact codes such as an article number "TS-999" or an invoice number are frequently missed by embeddings, because semantic similarity does not help here. Hybrid search with rank fusion (RRF) catches such exact hits and therefore remains mandatory in the multimodal variant too.

With scanned PDFs there is the added complication that there is no text layer at all. Without OCR or visual retrieval the page is simply invisible to a text RAG.

The three approaches to multimodal retrieval

1. Multimodal embeddings (CLIP, ColPali)

Multimodal embedding models map image and text into the same vector space. This makes it possible to match a text query against image vectors and vice versa. CLIP is the well-known representative for generic image-text linking. ColPali goes one step further and is built specifically for document retrieval: it indexes the rendered page as an image and thereby bypasses the error-prone intermediate parsing into text - layout, tables and graphics are preserved as visual context. This is particularly strong for documents whose meaning lies in the layout.

In the mainstream too, the embedding layer is shifting towards multimodality. The Gemini Embedding model (Google) is classified as multimodal and sits at the top of MTEB in 2026 - Gemini Embedding 2 achieves a retrieval score of 67.71 (as of 2026). Since MTEB scores change weekly and MTEB v2 is not directly comparable to v1, the snapshot date should be documented for every model choice.

2. Vision-LLM for description

Here a vision-capable LLM takes over the preprocessing: it receives the image, the table or the chart and produces a precise textual description. This description is embedded in the usual way and written to the vector database. Advantage: the existing text RAG stack remains usable, retrieval runs over proven text embeddings. Disadvantage: the description is only as good as the vision model and costs an additional LLM call per asset during indexing. For charts and diagrams this approach is often precise, because a good vision-LLM can verbalise axes, trends and data points.

3. Layout-aware parsing

The third route separates structure out cleanly before embedding. Layout-aware parsers recognise tables, lists, headers and columns and preserve this structure. Several production-ready tools are available for this (as of 2026): Docling (IBM, open source), Unstructured.io, LlamaParse, Marker (datalab.to), PyMuPDF as well as OCR engines such as Tesseract, Azure Document Intelligence and AWS Textract. The RAGFlow framework explicitly brings "Deep Document Understanding" with Document Layout Analysis (DLA) and OCR. At the chunking level, the appropriate strategy is layout-aware chunking (Docling, Unstructured), which preserves table, list and header structure - ideal for PDFs, contracts and technical manuals.

Approach comparison

Approach

Strength

Weakness

Suits

Multimodal embeddings (CLIP, ColPali)

Layout preserved as image, no lossy parsing

Higher storage/compute demand, newer toolchain

Scanned PDFs, charts, layout-driven documents

Vision-LLM description

Uses existing text RAG stack, good for charts

Additional LLM cost per asset, quality depends on model

Diagrams, individual images, reports

Layout-aware parsing (Docling, Unstructured, RAGFlow)

Preserves table structure, mature, well auditable

Weak on pure scans without OCR, error-prone with wild layouts

Technical documentation, contracts, invoices with a text layer

Hybrid search (complementary)

Catches exact codes, IDs, amounts

Does not solve visual retrieval on its own

Always enable alongside

In practice these approaches are not an either-or decision. A robust system combines layout-aware parsing for structured tables, vision-LLM descriptions for charts and hybrid search for exact hits.

Typical use cases

  • Technical documentation: Manuals with exploded views, wiring diagrams and specification tables. The Agri-Query benchmark (arXiv:2508.18093, August 2025) shows for agricultural-technology manuals that hybrid RAG achieves more than 85 percent accuracy across multiple languages and clearly beats naive long-context prompts.
  • Invoices: Line-item tables, amounts, tax rates and invoice numbers. Here the combination of table-structure extraction and exact BM25 matching on codes is decisive.
  • Reports and presentations: Quarterly figures in charts, KPI dashboards, diagrams. Vision-LLM description of the graphics plus table parsing covers both.

Concrete example: invoice RAG

Suppose a service provider wants to make 50,000 scanned PDF invoices searchable, so that a support agent can answer questions such as "Which tax rate applied to line item 3 on invoice TS-2024-0815?"

```
INDEXING (offline):
[50,000 PDF scans]
-> OCR (Tesseract / Azure Document Intelligence)
-> Layout-aware parsing (Docling): detects line-item table
-> serialise table preserving structure (Markdown)
-> optional: Vision-LLM describes layout/stamps
-> Embedding (multilingual, e.g. Cohere Embed v4)
-> Vector-DB upsert + parallel BM25 index
(metadata: tenant_id, invoice_number, date, ACL)

QUERY (online):
[Question] -> Embedding + BM25 query ("TS-2024-0815")
-> Hybrid retrieval (top_k = 50)
-> Re-ranker (Cohere Rerank v3.5), top_k = 5
-> Prompt + source citation (invoice number + snippet)
-> LLM -> Answer with faithfulness check (RAGAS)
```

The BM25 path finds the exact invoice number that a pure embedding would miss. The layout parser ensures that "line item 3" and the associated tax rate stay in the same structured row, instead of being torn apart by fixed-size chunking. Re-ranking reduces the retrieval error rate considerably - Anthropic Contextual Retrieval documents up to 67 percent fewer retrieval errors for text in combination with reranking (as of 2024); for multimodal pipelines the re-ranking principle applies analogously.

Pitfalls with complex layouts

  • Nested and multi-column tables: Parsers assign cells incorrectly. Test on real documents before roll-out, not on ideal examples.
  • Poor scan quality: OCR errors propagate into every subsequent step. Here visual approaches such as ColPali are often more robust, because they bypass the OCR step.
  • Missing re-ranking and missing evaluation: Without faithfulness measurement, quality regressions remain invisible. RAGAS measures along faithfulness, answer relevancy, context precision and context recall - that belongs in CI.
  • Lost in the chunks: A chunk "It increased by 12 percent" without a context header is worthless. Contextual chunking (Anthropic 2024) prepends an LLM-generated context and measurably improves retrieval quality.
  • Embedding drift on model change: A change of the multimodal embedding model requires a full re-indexing. Keep a version string in the index.
  • GDPR for invoices and scans: Personal content in images and tables is subject to the same obligations - tenant separation, roles concept, deletion pipeline on chunks and embeddings, EU hosting (cf. DSK guidance on RAG). Informational, not legal advice.

For agencies and B2B decision-makers

In 2026 multimodal RAG is no longer a research topic; it decides whether an AI assistant can unlock a company's most valuable documents at all - technical manuals, invoices, reports - or fails on them. For agencies, the leverage lies in clean component selection per document type: layout-aware parsing for structured PDFs, visual retrieval for scans and charts, hybrid search and re-ranking as standard. Anyone wanting to set up a knowledge system with images, scanned PDFs and tables in a production-ready, GDPR-compliant way with measurable faithfulness should plan architecture and evaluation together from the start. Blck Alpaca supports DACH companies with exactly this design - from the ingestion pipeline to the EU-sovereign toolchain.

FAQ

What is the difference between multimodal RAG and classic RAG?
Classic RAG indexes and retrieves text only. Content in images, scanned PDFs, tables or charts is lost or mangled during parsing. Multimodal RAG makes exactly these visual and structured contents retrievable - either via multimodal embeddings that place image and text in the same vector space, or by having a vision-LLM translate the non-textual elements into searchable descriptions beforehand.
Do I need CLIP or ColPali, or is it enough to parse PDFs into text?
That depends on the document type. For cleanly structured PDFs with a clear text layer, layout-aware parsing (e.g. Docling or Unstructured) plus classic text embedding is often sufficient. For scanned documents, dense charts or complex multi-column layouts, visual approaches such as ColPali (retrieval directly over page images) deliver better results, because they do not first have to force the layout into error-prone text.
How does multimodal RAG handle tables?
There are two proven routes. First: layout-aware parsing extracts the table structure (rows, columns, headers) and serialises it while preserving structure, for instance as a Markdown table. Second: a vision-LLM describes the table in natural language and the descriptive text is embedded. For exact values such as article numbers or amounts, hybrid search with BM25 should additionally be active, since pure embeddings miss exact codes.
What are the most common sources of error in multimodal RAG?
Naive fixed-size chunking that cuts tables in half, missing layout detection in multi-column or nested documents, poor OCR quality on scans, as well as missing re-ranking and missing faithfulness measurement. Skipping hybrid search is also a classic mistake, because otherwise exact codes and IDs are not found.
Can multimodal RAG be deployed in a GDPR-compliant way?
In principle yes, with the same requirements as text-based RAG: tenant separation in the metadata, a rights and roles concept, a deletion pipeline for chunks and embeddings as well as EU-region hosting. For invoices, contracts and scanned documents with personal data, the principles under GDPR Art. 5, 6 and 17 apply unchanged. This is an informational note and not legal advice.

Want to go deeper?

Get new analyses straight to your inbox – or see how we put this knowledge to work for companies.