RAG Architecture
RAG Pipeline Components
The first stage of RAG involves loading and preprocessing documents from various sources. Document loaders handle multiple formats including PDF, HTML, Markdown, and code files. Preprocessing includes cleaning text, normalizing formats, and enriching content with metadata. Effective ingestion sets the foundation for high-quality retrieval by ensuring documents are properly structured and tagged with relevant metadata like titles, authors, dates, and source information.
- Document Loaders: PDF, HTML, Markdown, code files
- Chunking Strategies: Fixed-size, semantic, recursive
- Metadata Extraction: Title, author, date, source
- Preprocessing: Cleaning, normalization, enrichment
Transform document chunks into dense vector representations that capture semantic meaning. Embedding models range from lightweight Sentence Transformers (384D) to powerful OpenAI embeddings (1536D) and Cohere (3072D). Model selection depends on accuracy requirements, latency constraints, and whether multilingual or domain-specific capabilities are needed. Batch processing enables efficient bulk embedding of large document collections.
- Embedding Models: OpenAI, Cohere, Sentence Transformers
- Dimensionality: 384, 768, 1536, 3072 dimensions
- Batch Processing: Efficient bulk embedding
- Model Selection: Multilingual, domain-specific
Store and index embeddings in specialized vector databases optimized for similarity search. Vector databases like Pinecone, Weaviate, and Qdrant provide efficient nearest-neighbor search using indexing algorithms like HNSW or IVF. Metadata filtering enables pre-filtering or post-filtering to narrow results based on structured attributes. Namespace organization supports multi-tenancy and versioning for production deployments.
- Vector Databases: Pinecone, Weaviate, Qdrant, Chroma
- Indexing Algorithms: HNSW, IVF, Flat
- Metadata Filtering: Pre-filtering, post-filtering
- Namespace Organization: Multi-tenancy, versioning
Query the vector database to find relevant document chunks for a given user question. Search strategies include pure similarity search, Maximum Marginal Relevance (MMR) for diversity, and hybrid search combining semantic and keyword matching. Re-ranking with cross-encoders refines initial results. Query transformation techniques like decomposition and rewriting improve retrieval quality. Result fusion algorithms like Reciprocal Rank Fusion combine multiple retrieval strategies.
- Search Strategies: Similarity, MMR, hybrid search
- Re-ranking: Cross-encoder models, Cohere rerank
- Query Transformation: Decomposition, rewriting, expansion
- Result Fusion: Reciprocal Rank Fusion (RRF)
Format retrieved documents into context suitable for LLM consumption while staying within token limits. Result formatting optimizes context window usage by prioritizing most relevant information. Source citation tracks document provenance for transparency and verification. Context compression removes redundant or irrelevant sentences to fit more useful information. Prompt engineering constructs effective system and user messages that guide the LLM to generate accurate, grounded responses.
- Result Formatting: Context window optimization
- Source Citation: Provenance tracking and attribution
- Context Compression: Removing irrelevant information
- Prompt Engineering: System/user message construction
Use an LLM to synthesize retrieved context into a coherent answer addressing the user's question. Model selection balances cost, latency, and quality requirements across options like GPT-4, Claude, or open-source Llama models. Prompt strategies like few-shot learning and chain-of-thought reasoning improve answer quality. Streaming provides token-by-token responses for better user experience. Answer synthesis involves multi-document reasoning to combine insights from multiple sources into comprehensive responses.
- LLM Selection: GPT-4, Claude, Llama, Mixtral
- Prompt Strategy: Few-shot, chain-of-thought reasoning
- Streaming: Token-by-token response generation
- Answer Synthesis: Multi-document reasoning and integration
Chunking Strategies
| Strategy | Description | Pros | Cons | Use Case |
|---|---|---|---|---|
| Fixed-Size | Split by character count or tokens | Simple, predictable | May break context | General purpose |
| Semantic | Split at natural boundaries (paragraphs, sentences) | Preserves meaning | Variable sizes | Articles, books |
| Recursive | Hierarchical splits (sections → paragraphs → sentences) | Context-aware | More complex | Structured documents |
| Document-Specific | Custom logic per document type | Optimal per type | Maintenance overhead | Code, tables, PDFs |
| Sliding Window | Overlapping chunks for context preservation | No context loss | More storage, duplicates | Legal docs, research |
Advanced RAG Patterns
Generate multiple search queries from a single user question to improve retrieval coverage and reduce missed context.
- LLM generates 3-5 query variations
- Parallel retrieval for each query
- Result deduplication and merging
- Improved recall on ambiguous queries
- Handles different phrasings
- Complex user questions
- Ambiguous queries
- Multi-faceted topics
- Improving retrieval recall
Model critiques and refines its own retrieval and generation through reflection, improving answer quality iteratively.
- Self-reflection on relevance
- Retrieval necessity determination
- Answer quality self-assessment
- Iterative refinement
- Reduced hallucinations
- High-accuracy requirements
- Fact-critical applications
- Medical/legal domains
- Academic research
Autonomous agents decide when and how to retrieve information, using tools and reasoning to answer complex questions.
- Tool use for retrieval
- Multi-step reasoning
- Dynamic retrieval decisions
- Query planning and execution
- Verification and validation
- Complex multi-hop questions
- Research assistants
- Data analysis tasks
- Multi-source synthesis
Evaluates retrieval quality and corrects by re-retrieving, web search, or generation fallback for better results.
- Retrieval quality evaluation
- Web search fallback
- Knowledge refinement
- Ambiguity detection
- Adaptive retrieval strategy
- Knowledge-intensive QA
- Fact verification
- Current events
- Handling outdated docs
Combines knowledge graphs with vector search for structured relationship understanding and multi-hop reasoning.
- Entity and relationship extraction
- Graph traversal for context
- Community detection
- Multi-hop reasoning
- Structured + unstructured fusion
- Complex entity relationships
- Multi-hop questions
- Domain-specific knowledge
- Structured data integration
Combines multiple retrieval methods (vector, keyword, metadata) and fuses results using algorithms like RRF.
- Multiple retrieval strategies
- Reciprocal Rank Fusion (RRF)
- Weighted combination
- Hybrid scoring
- Ensemble retrieval
- Maximum recall scenarios
- Diverse document types
- Multi-modal search
- Production RAG systems
RAG Optimization Techniques
Enhance retrieval quality through advanced search techniques and query manipulation. Hybrid search combines semantic similarity with keyword matching for comprehensive coverage. Query expansion generates multiple query variants to capture different phrasings. HyDE (Hypothetical Document Embeddings) generates hypothetical answers and searches for similar documents. Parent-child retrieval finds small chunks but returns larger surrounding context. Multi-vector retrieval stores multiple embeddings per document for richer representation.
- Hybrid Search: Combine semantic + keyword search
- Query Expansion: Generate multiple query variants
- HyDE: Generate hypothetical documents for search
- Parent-Child Retrieval: Retrieve small, return large context
- Multi-Vector Retrieval: Multiple embeddings per document
Maximize the value of limited context windows through intelligent compression and organization. Context compression removes irrelevant sentences while preserving key information. Lost in the Middle research shows placing critical information at start/end of context improves LLM attention. Context windowing dynamically adjusts context length based on query complexity. Prompt caching reduces costs by reusing system prompts across requests. Token budgeting optimizes context allocation across multiple retrieved documents.
- Context Compression: Remove irrelevant sentences
- Lost in the Middle: Place key info at start/end
- Context Windowing: Dynamic context sizing
- Prompt Caching: Cache system prompts for cost savings
- Token Budgeting: Optimize context length allocation
Improve answer accuracy and reliability through validation and verification techniques. Re-ranking with cross-encoder models scores retrieved chunks more accurately than initial embedding similarity. Answer validation checks responses against source documents for grounding. Hallucination detection identifies when LLMs generate information not present in context. Citation tracking maintains source attribution for transparency and fact-checking. Confidence scoring quantifies answer certainty to flag uncertain responses for human review.
- Re-ranking: Cross-encoder scoring for better ranking
- Answer Validation: Verify responses against sources
- Hallucination Detection: Identify unsupported claims
- Citation Tracking: Maintain source attribution
- Confidence Scoring: Quantify answer certainty
Managed RAG Platforms
Fully managed RAG service connecting foundation models to private data sources. Handles ingestion, chunking, embedding, vector storage, retrieval, and prompt augmentation end-to-end with built-in citations.
- S3, Confluence, SharePoint, Salesforce connectors
- Multiple vector stores (OpenSearch, Aurora, Pinecone, S3 Vectors)
- Semantic, hierarchical, fixed-size chunking
- Natural language to SQL for structured data
- Multimodal document parsing (tables, charts, images)
- Reranking and source attribution
- Enterprise knowledge bases
- Document Q&A systems
- Customer support automation
- Internal search and discovery
Microsoft's managed search service with vector search, semantic ranking, and integrated skills for document cracking, OCR, and entity extraction. Deep Azure OpenAI integration.
- Hybrid search (vector + keyword + semantic)
- Built-in document cracking and OCR
- AI enrichment pipeline (skills)
- Azure OpenAI integration
- Security trimming with AAD
- Geo-replicated indexes
- Enterprise search portals
- Azure-native RAG applications
- Document intelligence pipelines
- Copilot-style assistants
Google Cloud's enterprise search and RAG platform with grounding in Google Search, website indexing, and unstructured document understanding.
- Grounding with Google Search
- Unstructured document understanding
- Website and sitemap indexing
- Vertex AI integration
- Multi-turn conversations
- Enterprise data connectors
- Google Cloud RAG applications
- Website search with AI
- Enterprise knowledge management
- Conversational search
Cohere's retrieval-augmented generation with Command R models optimized for RAG, built-in web search connector, and enterprise connectors for common data sources.
- Command R/R+ models for RAG
- Built-in web search connector
- Enterprise data connectors
- Rerank API for relevance
- Multilingual support
- Citation generation
- Multilingual RAG
- Web-augmented generation
- Enterprise chatbots
- Research assistants
RAG Evaluation Metrics
| Metric | Category | What It Measures | Tools |
|---|---|---|---|
| Context Relevance | Retrieval Quality | Are retrieved documents relevant to query? | RAGAS, TruLens, LangSmith |
| Context Recall | Retrieval Quality | Did we retrieve all relevant documents? | RAGAS, Manual eval |
| Answer Relevance | Generation Quality | Does answer address the question? | RAGAS, LangSmith, Phoenix |
| Faithfulness | Generation Quality | Is answer grounded in retrieved context? | RAGAS, TruLens, Guardrails |
| Answer Correctness | End-to-End | Is the answer factually correct? | Human eval, RAGAS |
| Latency (P50, P95, P99) | Performance | Response time percentiles | OpenTelemetry, DataDog |
