Document AI

route

Processing Approaches

ApproachWhen to UseToolsTradeoffs
Vision-basedSimple documents, forms, handwritten, visual layoutsGPT-4V, Claude Vision, GeminiBest quality for simple docs, higher cost per page
OCR + TextDense text, structured extraction, high volumeTesseract, AWS Textract, Azure DILower cost, loses visual context
HybridComplex layouts, tables with text, mixed contentUnstructured, DoclingBest coverage, more complex pipeline
lightbulb

Modern VLMs often handle document understanding better from images than from OCR text. For simple documents like forms, invoices, or receipts, converting to images and using Claude Vision or GPT-4V yields better results than traditional OCR pipelines. OCR loses layout context, reading order, and visual cues that VLMs naturally understand.

description

Document Parsing Tools

Docling

IBM's layout-aware document parsing. Preserves structure, tables, and reading order. Outputs to markdown or JSON.

Layout-awareTablesOpen Source

Unstructured

Multi-format document extraction. Handles PDFs, Word, HTML, images. Auto-detects document type.

Multi-formatAuto-detectRAG-ready

LlamaParse

LlamaIndex's document parser. Optimized for LLM consumption with semantic chunking.

LLM-optimizedSemantic Chunks

Marker

PDF to markdown converter. Preserves formatting, handles multi-column layouts, extracts equations.

PDFMarkdownEquations
cloud

OCR & Extraction Services

ServiceProviderStrengthsBest For
AWS TextractAWSForms, tables, queries, expense analysisStructured extraction, AWS ecosystem
Azure Document IntelligenceMicrosoftPre-built models for invoices, receipts, IDsEnterprise documents, Azure ecosystem
Google Document AIGoogleCustom document processors, entity extractionCustom document types, GCP ecosystem
TesseractOpen Source100+ languages, LSTM engine, freeSelf-hosted, cost-sensitive, privacy
folder_open

Document Types & Strategies

PDFs

  • Text-based: Direct text extraction, preserve structure
  • Scanned: OCR required, consider image quality
  • Mixed: Hybrid approach, detect type per page

Forms & Invoices

  • Structured: Key-value extraction, field mapping
  • Semi-structured: Template matching, anchor detection
  • Handwritten: Vision models preferred over OCR

Tables & Charts

  • Tables: Cell detection, header recognition
  • Charts: Extract as images for VLM analysis
  • Data: Normalize to structured formats (CSV, JSON)

Multi-page Documents

  • Segmentation: Split by logical sections
  • Cross-references: Maintain page context
  • TOC: Use table of contents for structure
auto_awesome

RAG Integration Patterns

Document Chunking

  • Semantic boundaries (paragraphs, sections)
  • Preserve table integrity
  • Include surrounding context
  • Respect page breaks when relevant

Metadata Extraction

  • Document title, date, author
  • Page numbers for citations
  • Section headers for filtering
  • Document type classification

Multi-modal Embeddings

  • Text + image embeddings
  • Table as separate vectors
  • Chart descriptions
  • Cross-modal retrieval

See RAG Architecture for detailed retrieval patterns.

account_tree

Processing Pipeline

1Ingest
2Classify
3Parse
4Extract
5Chunk
6Embed