Will Percey — Portfolio

Document AI

> > Updated Dec 2025

route

Processing Approaches

Approach	When to Use	Tools	Tradeoffs
Vision-based	Simple documents, forms, handwritten, visual layouts	GPT-4V, Claude Vision, Gemini	Best quality for simple docs, higher cost per page
OCR + Text	Dense text, structured extraction, high volume	Tesseract, AWS Textract, Azure DI	Lower cost, loses visual context
Hybrid	Complex layouts, tables with text, mixed content	Unstructured, Docling	Best coverage, more complex pipeline

lightbulb

Modern VLMs often handle document understanding better from images than from OCR text. For simple documents like forms, invoices, or receipts, converting to images and using Claude Vision or GPT-4V yields better results than traditional OCR pipelines. OCR loses layout context, reading order, and visual cues that VLMs naturally understand.

description

Document Parsing Tools

Docling

IBM's layout-aware document parsing. Preserves structure, tables, and reading order. Outputs to markdown or JSON.

Layout-awareTablesOpen Source

Unstructured

Multi-format document extraction. Handles PDFs, Word, HTML, images. Auto-detects document type.

Multi-formatAuto-detectRAG-ready

LlamaParse

LlamaIndex's document parser. Optimized for LLM consumption with semantic chunking.

LLM-optimizedSemantic Chunks

Marker

PDF to markdown converter. Preserves formatting, handles multi-column layouts, extracts equations.

PDFMarkdownEquations

cloud

OCR & Extraction Services

Service	Provider	Strengths	Best For
AWS Textract	AWS	Forms, tables, queries, expense analysis	Structured extraction, AWS ecosystem
Azure Document Intelligence	Microsoft	Pre-built models for invoices, receipts, IDs	Enterprise documents, Azure ecosystem
Google Document AI	Google	Custom document processors, entity extraction	Custom document types, GCP ecosystem
Tesseract	Open Source	100+ languages, LSTM engine, free	Self-hosted, cost-sensitive, privacy

folder_open

Document Types & Strategies

PDFs

Text-based: Direct text extraction, preserve structure
Scanned: OCR required, consider image quality
Mixed: Hybrid approach, detect type per page

Forms & Invoices

Structured: Key-value extraction, field mapping
Semi-structured: Template matching, anchor detection
Handwritten: Vision models preferred over OCR

Tables & Charts

Tables: Cell detection, header recognition
Charts: Extract as images for VLM analysis
Data: Normalize to structured formats (CSV, JSON)

Multi-page Documents

Segmentation: Split by logical sections
Cross-references: Maintain page context
TOC: Use table of contents for structure

auto_awesome

RAG Integration Patterns

Document Chunking

Semantic boundaries (paragraphs, sections)
Preserve table integrity
Include surrounding context
Respect page breaks when relevant

Metadata Extraction

Document title, date, author
Page numbers for citations
Section headers for filtering
Document type classification

Multi-modal Embeddings

Text + image embeddings
Table as separate vectors
Chart descriptions
Cross-modal retrieval

See RAG Architecture for detailed retrieval patterns.

account_tree

Processing Pipeline

1Ingest

→

2Classify

→

3Parse

→

4Extract

→

5Chunk

→

6Embed