Multimodal RAG: Retrieving Images, PDFs and Tables
Multimodal RAG extends classic Retrieval-Augmented Generation to non-textual content: images, scanned PDFs, tables, charts and diagrams are indexed and made retrievable. Instead of searching plain text only, the system retrieves visual and structured information via multimodal embeddings, vision-LLM descriptions or layout-aware parsing and feeds it source-grounded into the answer prompt.
Key Takeaways
- ✓Classic RAG loses everything that is not clean plain text - tables, charts, scanned pages and diagrams fall through the cracks. Multimodal RAG closes this gap.
- ✓There are three main approaches: multimodal embeddings (CLIP, ColPali), vision-LLM description (image/table is translated into text) and layout-aware parsing (Docling, Unstructured, RAGFlow) - often as a combination.
- ✓Hybrid search remains mandatory: pure embeddings miss exact codes, article numbers and IDs in tables; BM25 plus rank fusion (RRF) catches them.
- ✓Typical use cases are technical documentation, invoices and reports - exactly where information sits primarily in layout, tables and graphics rather than in running text.
- ✓Complex layouts (multi-column, nested tables, poor scans) remain the biggest pitfall - layout-aware parsing and evaluation with RAGAS are not optional.
- ✓As of 2026 the toolchain is production-ready, but more cost- and latency-intensive than text-only RAG - the choice of components should be aligned to the document type.
Multimodal RAG extends the Retrieval-Augmented Generation pattern to everything that is not clean running text: images, scanned PDFs, tables, charts and diagrams. Classic RAG indexes text only. As soon as information sits primarily in the layout, in a table or in a graphic, a text-only pipeline loses exactly the substance that is decisive for the answer. Multimodal RAG makes this content indexable and retrievable - and thus citable in a source-grounded way.
- What is retrieved? Not just plain text, but also page images, table structures, chart contents and scanned documents.
- How? Via multimodal embeddings (CLIP, ColPali), vision-LLM descriptions or layout-aware parsing - usually as a combination.
- What for? Technical documentation, invoices, reports - where knowledge lies in layout and graphics rather than in running text.
Why classic RAG fails on images and tables
A classic RAG pipeline runs along two paths: in the indexing path, sources are loaded, parsed, split into chunks, embedded and written to a vector database. In the query path, the request is embedded, retrieved via hybrid retrieval, re-ranked and inserted into the prompt of the generator LLM. The weak point for non-textual content sits right at the start - in parsing and chunking.
Two anti-patterns dominate here. First: naive fixed-size chunking ignores sentence, table and list boundaries. A 512-token window strategy cuts a table in the middle of a row, the table content is lost or distorted in meaning - the typical consequences are hallucinations and missing values. Second: pure semantics without BM25. Exact codes such as an article number "TS-999" or an invoice number are frequently missed by embeddings, because semantic similarity does not help here. Hybrid search with rank fusion (RRF) catches such exact hits and therefore remains mandatory in the multimodal variant too.
With scanned PDFs there is the added complication that there is no text layer at all. Without OCR or visual retrieval the page is simply invisible to a text RAG.
The three approaches to multimodal retrieval
1. Multimodal embeddings (CLIP, ColPali)
Multimodal embedding models map image and text into the same vector space. This makes it possible to match a text query against image vectors and vice versa. CLIP is the well-known representative for generic image-text linking. ColPali goes one step further and is built specifically for document retrieval: it indexes the rendered page as an image and thereby bypasses the error-prone intermediate parsing into text - layout, tables and graphics are preserved as visual context. This is particularly strong for documents whose meaning lies in the layout.
In the mainstream too, the embedding layer is shifting towards multimodality. The Gemini Embedding model (Google) is classified as multimodal and sits at the top of MTEB in 2026 - Gemini Embedding 2 achieves a retrieval score of 67.71 (as of 2026). Since MTEB scores change weekly and MTEB v2 is not directly comparable to v1, the snapshot date should be documented for every model choice.
2. Vision-LLM for description
Here a vision-capable LLM takes over the preprocessing: it receives the image, the table or the chart and produces a precise textual description. This description is embedded in the usual way and written to the vector database. Advantage: the existing text RAG stack remains usable, retrieval runs over proven text embeddings. Disadvantage: the description is only as good as the vision model and costs an additional LLM call per asset during indexing. For charts and diagrams this approach is often precise, because a good vision-LLM can verbalise axes, trends and data points.
3. Layout-aware parsing
The third route separates structure out cleanly before embedding. Layout-aware parsers recognise tables, lists, headers and columns and preserve this structure. Several production-ready tools are available for this (as of 2026): Docling (IBM, open source), Unstructured.io, LlamaParse, Marker (datalab.to), PyMuPDF as well as OCR engines such as Tesseract, Azure Document Intelligence and AWS Textract. The RAGFlow framework explicitly brings "Deep Document Understanding" with Document Layout Analysis (DLA) and OCR. At the chunking level, the appropriate strategy is layout-aware chunking (Docling, Unstructured), which preserves table, list and header structure - ideal for PDFs, contracts and technical manuals.
Approach comparison
Approach | Strength | Weakness | Suits |
|---|---|---|---|
Multimodal embeddings (CLIP, ColPali) | Layout preserved as image, no lossy parsing | Higher storage/compute demand, newer toolchain | Scanned PDFs, charts, layout-driven documents |
Vision-LLM description | Uses existing text RAG stack, good for charts | Additional LLM cost per asset, quality depends on model | Diagrams, individual images, reports |
Layout-aware parsing (Docling, Unstructured, RAGFlow) | Preserves table structure, mature, well auditable | Weak on pure scans without OCR, error-prone with wild layouts | Technical documentation, contracts, invoices with a text layer |
Hybrid search (complementary) | Catches exact codes, IDs, amounts | Does not solve visual retrieval on its own | Always enable alongside |
In practice these approaches are not an either-or decision. A robust system combines layout-aware parsing for structured tables, vision-LLM descriptions for charts and hybrid search for exact hits.
Typical use cases
- Technical documentation: Manuals with exploded views, wiring diagrams and specification tables. The Agri-Query benchmark (arXiv:2508.18093, August 2025) shows for agricultural-technology manuals that hybrid RAG achieves more than 85 percent accuracy across multiple languages and clearly beats naive long-context prompts.
- Invoices: Line-item tables, amounts, tax rates and invoice numbers. Here the combination of table-structure extraction and exact BM25 matching on codes is decisive.
- Reports and presentations: Quarterly figures in charts, KPI dashboards, diagrams. Vision-LLM description of the graphics plus table parsing covers both.
Concrete example: invoice RAG
Suppose a service provider wants to make 50,000 scanned PDF invoices searchable, so that a support agent can answer questions such as "Which tax rate applied to line item 3 on invoice TS-2024-0815?"
```
INDEXING (offline):
[50,000 PDF scans]
-> OCR (Tesseract / Azure Document Intelligence)
-> Layout-aware parsing (Docling): detects line-item table
-> serialise table preserving structure (Markdown)
-> optional: Vision-LLM describes layout/stamps
-> Embedding (multilingual, e.g. Cohere Embed v4)
-> Vector-DB upsert + parallel BM25 index
(metadata: tenant_id, invoice_number, date, ACL)
QUERY (online):
[Question] -> Embedding + BM25 query ("TS-2024-0815")
-> Hybrid retrieval (top_k = 50)
-> Re-ranker (Cohere Rerank v3.5), top_k = 5
-> Prompt + source citation (invoice number + snippet)
-> LLM -> Answer with faithfulness check (RAGAS)
```
The BM25 path finds the exact invoice number that a pure embedding would miss. The layout parser ensures that "line item 3" and the associated tax rate stay in the same structured row, instead of being torn apart by fixed-size chunking. Re-ranking reduces the retrieval error rate considerably - Anthropic Contextual Retrieval documents up to 67 percent fewer retrieval errors for text in combination with reranking (as of 2024); for multimodal pipelines the re-ranking principle applies analogously.
Pitfalls with complex layouts
- Nested and multi-column tables: Parsers assign cells incorrectly. Test on real documents before roll-out, not on ideal examples.
- Poor scan quality: OCR errors propagate into every subsequent step. Here visual approaches such as ColPali are often more robust, because they bypass the OCR step.
- Missing re-ranking and missing evaluation: Without faithfulness measurement, quality regressions remain invisible. RAGAS measures along faithfulness, answer relevancy, context precision and context recall - that belongs in CI.
- Lost in the chunks: A chunk "It increased by 12 percent" without a context header is worthless. Contextual chunking (Anthropic 2024) prepends an LLM-generated context and measurably improves retrieval quality.
- Embedding drift on model change: A change of the multimodal embedding model requires a full re-indexing. Keep a version string in the index.
- GDPR for invoices and scans: Personal content in images and tables is subject to the same obligations - tenant separation, roles concept, deletion pipeline on chunks and embeddings, EU hosting (cf. DSK guidance on RAG). Informational, not legal advice.
For agencies and B2B decision-makers
In 2026 multimodal RAG is no longer a research topic; it decides whether an AI assistant can unlock a company's most valuable documents at all - technical manuals, invoices, reports - or fails on them. For agencies, the leverage lies in clean component selection per document type: layout-aware parsing for structured PDFs, visual retrieval for scans and charts, hybrid search and re-ranking as standard. Anyone wanting to set up a knowledge system with images, scanned PDFs and tables in a production-ready, GDPR-compliant way with measurable faithfulness should plan architecture and evaluation together from the start. Blck Alpaca supports DACH companies with exactly this design - from the ingestion pipeline to the EU-sovereign toolchain.
FAQ
What is the difference between multimodal RAG and classic RAG?
Do I need CLIP or ColPali, or is it enough to parse PDFs into text?
How does multimodal RAG handle tables?
What are the most common sources of error in multimodal RAG?
Can multimodal RAG be deployed in a GDPR-compliant way?
Want to go deeper?
Get new analyses straight to your inbox – or see how we put this knowledge to work for companies.