Document AI
Processing Approaches
| Approach | When to Use | Tools | Tradeoffs |
|---|---|---|---|
| Vision-based | Simple documents, forms, handwritten, visual layouts | GPT-4V, Claude Vision, Gemini | Best quality for simple docs, higher cost per page |
| OCR + Text | Dense text, structured extraction, high volume | Tesseract, AWS Textract, Azure DI | Lower cost, loses visual context |
| Hybrid | Complex layouts, tables with text, mixed content | Unstructured, Docling | Best coverage, more complex pipeline |
Modern VLMs often handle document understanding better from images than from OCR text. For simple documents like forms, invoices, or receipts, converting to images and using Claude Vision or GPT-4V yields better results than traditional OCR pipelines. OCR loses layout context, reading order, and visual cues that VLMs naturally understand.
Document Parsing Tools
Docling
IBM's layout-aware document parsing. Preserves structure, tables, and reading order. Outputs to markdown or JSON.
Unstructured
Multi-format document extraction. Handles PDFs, Word, HTML, images. Auto-detects document type.
LlamaParse
LlamaIndex's document parser. Optimized for LLM consumption with semantic chunking.
Marker
PDF to markdown converter. Preserves formatting, handles multi-column layouts, extracts equations.
OCR & Extraction Services
| Service | Provider | Strengths | Best For |
|---|---|---|---|
| AWS Textract | AWS | Forms, tables, queries, expense analysis | Structured extraction, AWS ecosystem |
| Azure Document Intelligence | Microsoft | Pre-built models for invoices, receipts, IDs | Enterprise documents, Azure ecosystem |
| Google Document AI | Custom document processors, entity extraction | Custom document types, GCP ecosystem | |
| Tesseract | Open Source | 100+ languages, LSTM engine, free | Self-hosted, cost-sensitive, privacy |
Document Types & Strategies
PDFs
- Text-based: Direct text extraction, preserve structure
- Scanned: OCR required, consider image quality
- Mixed: Hybrid approach, detect type per page
Forms & Invoices
- Structured: Key-value extraction, field mapping
- Semi-structured: Template matching, anchor detection
- Handwritten: Vision models preferred over OCR
Tables & Charts
- Tables: Cell detection, header recognition
- Charts: Extract as images for VLM analysis
- Data: Normalize to structured formats (CSV, JSON)
Multi-page Documents
- Segmentation: Split by logical sections
- Cross-references: Maintain page context
- TOC: Use table of contents for structure
RAG Integration Patterns
Document Chunking
- Semantic boundaries (paragraphs, sections)
- Preserve table integrity
- Include surrounding context
- Respect page breaks when relevant
Metadata Extraction
- Document title, date, author
- Page numbers for citations
- Section headers for filtering
- Document type classification
Multi-modal Embeddings
- Text + image embeddings
- Table as separate vectors
- Chart descriptions
- Cross-modal retrieval
See RAG Architecture for detailed retrieval patterns.
