RAG Architecture

account_tree

RAG Pipeline Components

Document Ingestion

The first stage of RAG involves loading and preprocessing documents from various sources. Document loaders handle multiple formats including PDF, HTML, Markdown, and code files. Preprocessing includes cleaning text, normalizing formats, and enriching content with metadata. Effective ingestion sets the foundation for high-quality retrieval by ensuring documents are properly structured and tagged with relevant metadata like titles, authors, dates, and source information.

Key Features
  • Document Loaders: PDF, HTML, Markdown, code files
  • Chunking Strategies: Fixed-size, semantic, recursive
  • Metadata Extraction: Title, author, date, source
  • Preprocessing: Cleaning, normalization, enrichment
Similar Technologies
LangChain LoadersLlamaIndex ReadersUnstructuredPyPDFBeautifulSoup
Embedding Generation

Transform document chunks into dense vector representations that capture semantic meaning. Embedding models range from lightweight Sentence Transformers (384D) to powerful OpenAI embeddings (1536D) and Cohere (3072D). Model selection depends on accuracy requirements, latency constraints, and whether multilingual or domain-specific capabilities are needed. Batch processing enables efficient bulk embedding of large document collections.

Key Features
  • Embedding Models: OpenAI, Cohere, Sentence Transformers
  • Dimensionality: 384, 768, 1536, 3072 dimensions
  • Batch Processing: Efficient bulk embedding
  • Model Selection: Multilingual, domain-specific
Similar Technologies
OpenAI EmbeddingsCohere EmbedBGEE5InstructorSFR-Embedding
Vector Storage

Store and index embeddings in specialized vector databases optimized for similarity search. Vector databases like Pinecone, Weaviate, and Qdrant provide efficient nearest-neighbor search using indexing algorithms like HNSW or IVF. Metadata filtering enables pre-filtering or post-filtering to narrow results based on structured attributes. Namespace organization supports multi-tenancy and versioning for production deployments.

Key Features
  • Vector Databases: Pinecone, Weaviate, Qdrant, Chroma
  • Indexing Algorithms: HNSW, IVF, Flat
  • Metadata Filtering: Pre-filtering, post-filtering
  • Namespace Organization: Multi-tenancy, versioning
Similar Technologies
PineconeWeaviateQdrantChromaMilvusFAISS
Retrieval

Query the vector database to find relevant document chunks for a given user question. Search strategies include pure similarity search, Maximum Marginal Relevance (MMR) for diversity, and hybrid search combining semantic and keyword matching. Re-ranking with cross-encoders refines initial results. Query transformation techniques like decomposition and rewriting improve retrieval quality. Result fusion algorithms like Reciprocal Rank Fusion combine multiple retrieval strategies.

Key Features
  • Search Strategies: Similarity, MMR, hybrid search
  • Re-ranking: Cross-encoder models, Cohere rerank
  • Query Transformation: Decomposition, rewriting, expansion
  • Result Fusion: Reciprocal Rank Fusion (RRF)
Similar Technologies
Similarity SearchHybrid SearchCohere RerankColBERTBM25
Context Assembly

Format retrieved documents into context suitable for LLM consumption while staying within token limits. Result formatting optimizes context window usage by prioritizing most relevant information. Source citation tracks document provenance for transparency and verification. Context compression removes redundant or irrelevant sentences to fit more useful information. Prompt engineering constructs effective system and user messages that guide the LLM to generate accurate, grounded responses.

Key Features
  • Result Formatting: Context window optimization
  • Source Citation: Provenance tracking and attribution
  • Context Compression: Removing irrelevant information
  • Prompt Engineering: System/user message construction
Similar Technologies
LongContextReorderLLMLinguaContext CompressionPrompt Templates
Generation

Use an LLM to synthesize retrieved context into a coherent answer addressing the user's question. Model selection balances cost, latency, and quality requirements across options like GPT-4, Claude, or open-source Llama models. Prompt strategies like few-shot learning and chain-of-thought reasoning improve answer quality. Streaming provides token-by-token responses for better user experience. Answer synthesis involves multi-document reasoning to combine insights from multiple sources into comprehensive responses.

Key Features
  • LLM Selection: GPT-4, Claude, Llama, Mixtral
  • Prompt Strategy: Few-shot, chain-of-thought reasoning
  • Streaming: Token-by-token response generation
  • Answer Synthesis: Multi-document reasoning and integration
Similar Technologies
GPT-4ClaudeLlamaMixtralGemini
splitscreen

Chunking Strategies

StrategyDescriptionProsConsUse Case
Fixed-SizeSplit by character count or tokensSimple, predictableMay break contextGeneral purpose
SemanticSplit at natural boundaries (paragraphs, sentences)Preserves meaningVariable sizesArticles, books
RecursiveHierarchical splits (sections → paragraphs → sentences)Context-awareMore complexStructured documents
Document-SpecificCustom logic per document typeOptimal per typeMaintenance overheadCode, tables, PDFs
Sliding WindowOverlapping chunks for context preservationNo context lossMore storage, duplicatesLegal docs, research
auto_awesome

Advanced RAG Patterns

Multi-Query RAG

Generate multiple search queries from a single user question to improve retrieval coverage and reduce missed context.

Key Features
  • LLM generates 3-5 query variations
  • Parallel retrieval for each query
  • Result deduplication and merging
  • Improved recall on ambiguous queries
  • Handles different phrasings
Use Cases
  • Complex user questions
  • Ambiguous queries
  • Multi-faceted topics
  • Improving retrieval recall
Related Patterns
Query DecompositionHyDEQuery Expansion
Self-RAG

Model critiques and refines its own retrieval and generation through reflection, improving answer quality iteratively.

Key Features
  • Self-reflection on relevance
  • Retrieval necessity determination
  • Answer quality self-assessment
  • Iterative refinement
  • Reduced hallucinations
Use Cases
  • High-accuracy requirements
  • Fact-critical applications
  • Medical/legal domains
  • Academic research
Related Patterns
FLAREActive RAGCorrective RAG
Agentic RAG

Autonomous agents decide when and how to retrieve information, using tools and reasoning to answer complex questions.

Key Features
  • Tool use for retrieval
  • Multi-step reasoning
  • Dynamic retrieval decisions
  • Query planning and execution
  • Verification and validation
Use Cases
  • Complex multi-hop questions
  • Research assistants
  • Data analysis tasks
  • Multi-source synthesis
Related Patterns
ReActPlan-and-ExecuteMulti-Query RAG
Corrective RAG (CRAG)

Evaluates retrieval quality and corrects by re-retrieving, web search, or generation fallback for better results.

Key Features
  • Retrieval quality evaluation
  • Web search fallback
  • Knowledge refinement
  • Ambiguity detection
  • Adaptive retrieval strategy
Use Cases
  • Knowledge-intensive QA
  • Fact verification
  • Current events
  • Handling outdated docs
Related Patterns
Self-RAGActive RAGFLARE
Graph RAG

Combines knowledge graphs with vector search for structured relationship understanding and multi-hop reasoning.

Key Features
  • Entity and relationship extraction
  • Graph traversal for context
  • Community detection
  • Multi-hop reasoning
  • Structured + unstructured fusion
Use Cases
  • Complex entity relationships
  • Multi-hop questions
  • Domain-specific knowledge
  • Structured data integration
Related Patterns
Knowledge Graph QAHybrid SearchMulti-Vector RAG
Fusion RAG

Combines multiple retrieval methods (vector, keyword, metadata) and fuses results using algorithms like RRF.

Key Features
  • Multiple retrieval strategies
  • Reciprocal Rank Fusion (RRF)
  • Weighted combination
  • Hybrid scoring
  • Ensemble retrieval
Use Cases
  • Maximum recall scenarios
  • Diverse document types
  • Multi-modal search
  • Production RAG systems
Related Patterns
Hybrid SearchMulti-Query RAGEnsemble Retrieval
tune

RAG Optimization Techniques

Retrieval Optimization

Enhance retrieval quality through advanced search techniques and query manipulation. Hybrid search combines semantic similarity with keyword matching for comprehensive coverage. Query expansion generates multiple query variants to capture different phrasings. HyDE (Hypothetical Document Embeddings) generates hypothetical answers and searches for similar documents. Parent-child retrieval finds small chunks but returns larger surrounding context. Multi-vector retrieval stores multiple embeddings per document for richer representation.

Key Features
  • Hybrid Search: Combine semantic + keyword search
  • Query Expansion: Generate multiple query variants
  • HyDE: Generate hypothetical documents for search
  • Parent-Child Retrieval: Retrieve small, return large context
  • Multi-Vector Retrieval: Multiple embeddings per document
Similar Technologies
Hybrid SearchHyDEQuery ExpansionDense-X RetrievalColBERT
Context Optimization

Maximize the value of limited context windows through intelligent compression and organization. Context compression removes irrelevant sentences while preserving key information. Lost in the Middle research shows placing critical information at start/end of context improves LLM attention. Context windowing dynamically adjusts context length based on query complexity. Prompt caching reduces costs by reusing system prompts across requests. Token budgeting optimizes context allocation across multiple retrieved documents.

Key Features
  • Context Compression: Remove irrelevant sentences
  • Lost in the Middle: Place key info at start/end
  • Context Windowing: Dynamic context sizing
  • Prompt Caching: Cache system prompts for cost savings
  • Token Budgeting: Optimize context length allocation
Similar Technologies
LLMLinguaSelective ContextLongContextReorderPrompt Caching
Quality Optimization

Improve answer accuracy and reliability through validation and verification techniques. Re-ranking with cross-encoder models scores retrieved chunks more accurately than initial embedding similarity. Answer validation checks responses against source documents for grounding. Hallucination detection identifies when LLMs generate information not present in context. Citation tracking maintains source attribution for transparency and fact-checking. Confidence scoring quantifies answer certainty to flag uncertain responses for human review.

Key Features
  • Re-ranking: Cross-encoder scoring for better ranking
  • Answer Validation: Verify responses against sources
  • Hallucination Detection: Identify unsupported claims
  • Citation Tracking: Maintain source attribution
  • Confidence Scoring: Quantify answer certainty
Similar Technologies
Cohere RerankCross-EncoderRAGASTruLensGuardrails AI
cloud

Managed RAG Platforms

Amazon Bedrock Knowledge Bases

Fully managed RAG service connecting foundation models to private data sources. Handles ingestion, chunking, embedding, vector storage, retrieval, and prompt augmentation end-to-end with built-in citations.

Key Features
  • S3, Confluence, SharePoint, Salesforce connectors
  • Multiple vector stores (OpenSearch, Aurora, Pinecone, S3 Vectors)
  • Semantic, hierarchical, fixed-size chunking
  • Natural language to SQL for structured data
  • Multimodal document parsing (tables, charts, images)
  • Reranking and source attribution
Use Cases
  • Enterprise knowledge bases
  • Document Q&A systems
  • Customer support automation
  • Internal search and discovery
Alternatives
Azure AI SearchVertex AI SearchLangChain + Pinecone
Azure AI Search (Cognitive Search)

Microsoft's managed search service with vector search, semantic ranking, and integrated skills for document cracking, OCR, and entity extraction. Deep Azure OpenAI integration.

Key Features
  • Hybrid search (vector + keyword + semantic)
  • Built-in document cracking and OCR
  • AI enrichment pipeline (skills)
  • Azure OpenAI integration
  • Security trimming with AAD
  • Geo-replicated indexes
Use Cases
  • Enterprise search portals
  • Azure-native RAG applications
  • Document intelligence pipelines
  • Copilot-style assistants
Alternatives
Bedrock Knowledge BasesVertex AI SearchElasticsearch
Google Vertex AI Search

Google Cloud's enterprise search and RAG platform with grounding in Google Search, website indexing, and unstructured document understanding.

Key Features
  • Grounding with Google Search
  • Unstructured document understanding
  • Website and sitemap indexing
  • Vertex AI integration
  • Multi-turn conversations
  • Enterprise data connectors
Use Cases
  • Google Cloud RAG applications
  • Website search with AI
  • Enterprise knowledge management
  • Conversational search
Alternatives
Bedrock Knowledge BasesAzure AI SearchAlgolia
Cohere RAG

Cohere's retrieval-augmented generation with Command R models optimized for RAG, built-in web search connector, and enterprise connectors for common data sources.

Key Features
  • Command R/R+ models for RAG
  • Built-in web search connector
  • Enterprise data connectors
  • Rerank API for relevance
  • Multilingual support
  • Citation generation
Use Cases
  • Multilingual RAG
  • Web-augmented generation
  • Enterprise chatbots
  • Research assistants
Alternatives
Bedrock Knowledge BasesOpenAI + BingPerplexity API
analytics

RAG Evaluation Metrics

MetricCategoryWhat It MeasuresTools
Context RelevanceRetrieval QualityAre retrieved documents relevant to query?RAGAS, TruLens, LangSmith
Context RecallRetrieval QualityDid we retrieve all relevant documents?RAGAS, Manual eval
Answer RelevanceGeneration QualityDoes answer address the question?RAGAS, LangSmith, Phoenix
FaithfulnessGeneration QualityIs answer grounded in retrieved context?RAGAS, TruLens, Guardrails
Answer CorrectnessEnd-to-EndIs the answer factually correct?Human eval, RAGAS
Latency (P50, P95, P99)PerformanceResponse time percentilesOpenTelemetry, DataDog