4 tokens
AI Architectures
Neural Network Architectures
Dense Transformer
Traditional transformer architecture where every token attends to every other token through self-attention. All parameters are activated for every input.
Key Characteristics
- Full self-attention mechanism
- All model parameters active for each token
- Uniform computation across all inputs
- +1 more
Use Cases
Examples
State Space Models (SSM)
Models like Mamba that use state space equations instead of attention, offering linear scaling with sequence length while maintaining long-range dependencies.
Key Characteristics
- Linear time complexity (vs quadratic for attention)
- Continuous-time state representations
- Selective state updates
- +1 more
Use Cases
Examples
Generative Models
Multimodal Architecture
Models that process and generate multiple modalities (text, image, audio, video) through shared or interconnected representations.
Key Characteristics
- Multiple encoder/decoder pathways for different modalities
- Cross-modal attention mechanisms
- Shared representation space
- +1 more
Use Cases
Examples
Diffusion Models
Generative models that learn to denoise data by reversing a gradual noising process, producing high-quality samples.
Key Characteristics
- Forward diffusion adds noise gradually
- Reverse process learns to denoise
- Iterative generation process
- +1 more
Use Cases
Examples
Attention Mechanisms & Position Encodings
Multi-Head Attention (MHA)
Standard transformer attention mechanism that uses multiple parallel attention heads to capture different representation subspaces. Each head learns different patterns of relationships between tokens.
Key Characteristics
- Multiple parallel attention heads (typically 8-96)
- Each head has separate Query, Key, Value matrices
- O(n²) complexity with sequence length
- +1 more
Use Cases
Examples
Grouped Query Attention (GQA)
Reduces KV cache memory by grouping queries to share Key-Value pairs. Balances between Multi-Head Attention (unique KV per head) and Multi-Query Attention (single shared KV).
Key Characteristics
- Groups of query heads share K/V projections
- Fewer KV heads than query heads
- Reduces KV cache size proportionally
- +1 more
Use Cases
Examples
Multi-Head Latent Attention (MLA)
DeepSeek-V3's innovation that compresses KV cache by projecting high-dimensional keys and values into a low-dimensional latent space, drastically reducing memory while maintaining quality.
Key Characteristics
- Low-rank compression of K/V representations
- Latent dimension much smaller than head dimension
- Decoupled compression and absorption matrices
- +1 more
Use Cases
Examples
Sliding Window Attention
Restricts attention to a fixed-size local window around each token, reducing complexity from quadratic to linear while maintaining local context awareness.
Key Characteristics
- Fixed window size (e.g., 4096 tokens)
- O(n·w) complexity where w is window size
- Local attention pattern
- +1 more
Use Cases
Examples
Grouped Window Attention
Hierarchical attention pattern that partitions sequence into groups/windows, with attention computed within groups and between group representatives.
Key Characteristics
- Hierarchical grouping of tokens
- Local attention within windows
- Global attention between window representatives
- +1 more
Use Cases
Examples
RoPE (Rotary Position Embeddings)
Encodes absolute position information using rotation matrices in complex space, naturally incorporating relative position information into attention mechanism through rotation properties.
Key Characteristics
- Rotation-based position encoding
- Encodes both absolute and relative positions
- Applied to query and key vectors
- +1 more
Use Cases
Examples
Interactive Visualizations
Token 1 of 16
Token 1 can attend to tokens 1-3
Multi-Head Attention (MHA)
KV Heads: 32
Memory: 3200 MB
Baseline (1.0x)
Grouped Query Attention (GQA)
KV Heads: 8
Memory: 800 MB
4.0x reduction
Multi-Query Attention (MQA)
KV Heads: 1
Memory: 100 MB
32.0x reduction
Tooling & Application Patterns
Model Context Protocol (MCP)
Open standard for connecting AI models to external tools, data sources, and services. Enables models to interact with databases, APIs, filesystems, and other resources through a unified interface.
Key Characteristics
- Standardized tool/resource definition
- Client-server architecture
- Context sharing across tools
- +1 more
Use Cases
Examples
Retrieval-Augmented Generation (RAG)
Combines a retriever component that fetches relevant documents with a generator LLM that produces responses grounded in retrieved information.
Key Characteristics
- Separate retrieval and generation components
- External knowledge base access
- Dynamic context augmentation
- +1 more
