AI Architectures

neurology

Neural Network Architectures

Architecture

Dense Transformer

Traditional transformer architecture where every token attends to every other token through self-attention. All parameters are activated for every input.

check_circle Key Characteristics
  • Full self-attention mechanism
  • All model parameters active for each token
  • Uniform computation across all inputs
  • +1 more
lightbulb Use Cases
General language understandingText generationMachine translation +1
code Examples
GPT-3BERTT5LLaMA +1
Architecture

State Space Models (SSM)

Models like Mamba that use state space equations instead of attention, offering linear scaling with sequence length while maintaining long-range dependencies.

check_circle Key Characteristics
  • Linear time complexity (vs quadratic for attention)
  • Continuous-time state representations
  • Selective state updates
  • +1 more
lightbulb Use Cases
Long document processingTime series analysisGenomic sequence modeling +1
code Examples
MambaS4HyenaRWKV
auto_awesome

Generative Models

Architecture

Multimodal Architecture

Models that process and generate multiple modalities (text, image, audio, video) through shared or interconnected representations.

check_circle Key Characteristics
  • Multiple encoder/decoder pathways for different modalities
  • Cross-modal attention mechanisms
  • Shared representation space
  • +1 more
lightbulb Use Cases
Image captioningVisual question answeringText-to-image generation +1
code Examples
GPT-4VCLIPFlamingoDALL-E 3 +1
Generative Architecture

Diffusion Models

Generative models that learn to denoise data by reversing a gradual noising process, producing high-quality samples.

check_circle Key Characteristics
  • Forward diffusion adds noise gradually
  • Reverse process learns to denoise
  • Iterative generation process
  • +1 more
lightbulb Use Cases
Text-to-image generationImage editing and inpaintingSuper-resolution +1
code Examples
Stable DiffusionDALL-E 2MidjourneyImagen
visibility

Attention Mechanisms & Position Encodings

Attention Mechanism

Multi-Head Attention (MHA)

Standard transformer attention mechanism that uses multiple parallel attention heads to capture different representation subspaces. Each head learns different patterns of relationships between tokens.

check_circle Key Characteristics
  • Multiple parallel attention heads (typically 8-96)
  • Each head has separate Query, Key, Value matrices
  • O(n²) complexity with sequence length
  • +1 more
lightbulb Use Cases
General language modelingMachine translationText classification +1
code Examples
GPT-3BERTT5Early GPT-4 variants
Attention Mechanism

Grouped Query Attention (GQA)

Reduces KV cache memory by grouping queries to share Key-Value pairs. Balances between Multi-Head Attention (unique KV per head) and Multi-Query Attention (single shared KV).

check_circle Key Characteristics
  • Groups of query heads share K/V projections
  • Fewer KV heads than query heads
  • Reduces KV cache size proportionally
  • +1 more
lightbulb Use Cases
Long-context language modelsMemory-constrained inferenceLarge-scale deployment +1
code Examples
Llama 3Llama 3.1Llama 3.2PaLM +1
Attention Mechanism

Multi-Head Latent Attention (MLA)

DeepSeek-V3's innovation that compresses KV cache by projecting high-dimensional keys and values into a low-dimensional latent space, drastically reducing memory while maintaining quality.

check_circle Key Characteristics
  • Low-rank compression of K/V representations
  • Latent dimension much smaller than head dimension
  • Decoupled compression and absorption matrices
  • +1 more
lightbulb Use Cases
Ultra-long context applicationsMemory-constrained deploymentLarge-scale language models +1
code Examples
DeepSeek-V3DeepSeek-V2 (early variant)
Attention Mechanism

Sliding Window Attention

Restricts attention to a fixed-size local window around each token, reducing complexity from quadratic to linear while maintaining local context awareness.

check_circle Key Characteristics
  • Fixed window size (e.g., 4096 tokens)
  • O(n·w) complexity where w is window size
  • Local attention pattern
  • +1 more
lightbulb Use Cases
Long document processingCode understandingEfficient transformers +1
code Examples
Mistral 7BLongformerBigBird
Attention Mechanism

Grouped Window Attention

Hierarchical attention pattern that partitions sequence into groups/windows, with attention computed within groups and between group representatives.

check_circle Key Characteristics
  • Hierarchical grouping of tokens
  • Local attention within windows
  • Global attention between window representatives
  • +1 more
lightbulb Use Cases
Document understandingHierarchical text processingVideo and audio modeling +1
code Examples
Swin Transformer (vision)Hierarchical transformersDocument AI models
Position Encoding

RoPE (Rotary Position Embeddings)

Encodes absolute position information using rotation matrices in complex space, naturally incorporating relative position information into attention mechanism through rotation properties.

check_circle Key Characteristics
  • Rotation-based position encoding
  • Encodes both absolute and relative positions
  • Applied to query and key vectors
  • +1 more
lightbulb Use Cases
Long-context language modelsModels requiring length extrapolationRelative position-aware tasks +1
code Examples
Llama (all versions)MistralQwenDeepSeek +1
interactive_space

Interactive Visualizations

Attention Patterns
No attention
Strong attention
Sliding Window Attention

4 tokens

Token 1 of 16

Token 1 can attend to tokens 1-3

Memory Comparison: MHA vs GQA vs MQA

Multi-Head Attention (MHA)

KV Heads: 32

Memory: 3200 MB

Baseline (1.0x)

Grouped Query Attention (GQA)

KV Heads: 8

Memory: 800 MB

4.0x reduction

Multi-Query Attention (MQA)

KV Heads: 1

Memory: 100 MB

32.0x reduction

build

Tooling & Application Patterns

Integration Protocol

Model Context Protocol (MCP)

Open standard for connecting AI models to external tools, data sources, and services. Enables models to interact with databases, APIs, filesystems, and other resources through a unified interface.

check_circle Key Characteristics
  • Standardized tool/resource definition
  • Client-server architecture
  • Context sharing across tools
  • +1 more
lightbulb Use Cases
Connecting models to databasesAPI and service integrationsFile system operations +1
code Examples
Claude DesktopMCP ServersTool providers implementing MCP
Architecture Pattern

Retrieval-Augmented Generation (RAG)

Combines a retriever component that fetches relevant documents with a generator LLM that produces responses grounded in retrieved information.

check_circle Key Characteristics
  • Separate retrieval and generation components
  • External knowledge base access
  • Dynamic context augmentation
  • +1 more
lightbulb Use Cases
Question answering over documentsEnterprise knowledge basesCustomer support chatbots +1
code Examples
ChatGPT with browsingPerplexity AIYou.comMicrosoft Copilot