Will Percey — Portfolio

neurology

Neural Network Architectures

Dense Transformer

Traditional transformer architecture where every token attends to every other token through self-attention. All parameters are activated for every input.

check_circle Key Characteristics

Full self-attention mechanism
All model parameters active for each token
Uniform computation across all inputs
+1 more

lightbulb Use Cases

General language understandingText generationMachine translation +1

code Examples

GPT-3BERTT5LLaMA +1

Mixture of Experts (MoE)

Sparse architecture that activates only a subset of parameters (experts) for each input. A gating network routes tokens to specialized expert sub-networks.

check_circle Key Characteristics

Sparse activation - only subset of parameters used per token
Router/gating network selects active experts
Multiple specialized expert sub-networks
+1 more

lightbulb Use Cases

Extremely large language modelsMulti-domain modelsEfficient scaling +1

code Examples

GPT-4Mixtral 8x7BSwitch TransformerGLaM

State Space Models (SSM)

Models like Mamba that use state space equations instead of attention, offering linear scaling with sequence length while maintaining long-range dependencies.

check_circle Key Characteristics

Linear time complexity (vs quadratic for attention)
Continuous-time state representations
Selective state updates
+1 more

lightbulb Use Cases

Long document processingTime series analysisGenomic sequence modeling +1

code Examples

MambaS4HyenaRWKV

auto_awesome

Generative Models

Multimodal Architecture

Models that process and generate multiple modalities (text, image, audio, video) through shared or interconnected representations.

check_circle Key Characteristics

Multiple encoder/decoder pathways for different modalities
Cross-modal attention mechanisms
Shared representation space
+1 more

lightbulb Use Cases

Image captioningVisual question answeringText-to-image generation +1

code Examples

GPT-4VCLIPFlamingoDALL-E 3 +1

Diffusion Models

Generative models that learn to denoise data by reversing a gradual noising process, producing high-quality samples.

check_circle Key Characteristics

Forward diffusion adds noise gradually
Reverse process learns to denoise
Iterative generation process
+1 more

lightbulb Use Cases

Text-to-image generationImage editing and inpaintingSuper-resolution +1

code Examples

Stable DiffusionDALL-E 2MidjourneyImagen

visibility

Attention Mechanisms & Position Encodings

Multi-Head Attention (MHA)

Standard transformer attention mechanism that uses multiple parallel attention heads to capture different representation subspaces. Each head learns different patterns of relationships between tokens.

check_circle Key Characteristics

Multiple parallel attention heads (typically 8-96)
Each head has separate Query, Key, Value matrices
O(n²) complexity with sequence length
+1 more

lightbulb Use Cases

General language modelingMachine translationText classification +1

code Examples

GPT-3BERTT5Early GPT-4 variants

Grouped Query Attention (GQA)

Reduces KV cache memory by grouping queries to share Key-Value pairs. Balances between Multi-Head Attention (unique KV per head) and Multi-Query Attention (single shared KV).

check_circle Key Characteristics

Groups of query heads share K/V projections
Fewer KV heads than query heads
Reduces KV cache size proportionally
+1 more

lightbulb Use Cases

Long-context language modelsMemory-constrained inferenceLarge-scale deployment +1

code Examples

Llama 3Llama 3.1Llama 3.2PaLM +1

Multi-Head Latent Attention (MLA)

DeepSeek-V3's innovation that compresses KV cache by projecting high-dimensional keys and values into a low-dimensional latent space, drastically reducing memory while maintaining quality.

check_circle Key Characteristics

Low-rank compression of K/V representations
Latent dimension much smaller than head dimension
Decoupled compression and absorption matrices
+1 more

lightbulb Use Cases

Ultra-long context applicationsMemory-constrained deploymentLarge-scale language models +1

code Examples

DeepSeek-V3DeepSeek-V2 (early variant)

Sliding Window Attention

Restricts attention to a fixed-size local window around each token, reducing complexity from quadratic to linear while maintaining local context awareness.

check_circle Key Characteristics

Fixed window size (e.g., 4096 tokens)
O(n·w) complexity where w is window size
Local attention pattern
+1 more

lightbulb Use Cases

Long document processingCode understandingEfficient transformers +1

code Examples

Mistral 7BLongformerBigBird

Grouped Window Attention

Hierarchical attention pattern that partitions sequence into groups/windows, with attention computed within groups and between group representatives.

check_circle Key Characteristics

Hierarchical grouping of tokens
Local attention within windows
Global attention between window representatives
+1 more

lightbulb Use Cases

Document understandingHierarchical text processingVideo and audio modeling +1

code Examples

Swin Transformer (vision)Hierarchical transformersDocument AI models

RoPE (Rotary Position Embeddings)

Encodes absolute position information using rotation matrices in complex space, naturally incorporating relative position information into attention mechanism through rotation properties.

check_circle Key Characteristics

Rotation-based position encoding
Encodes both absolute and relative positions
Applied to query and key vectors
+1 more

lightbulb Use Cases

Long-context language modelsModels requiring length extrapolationRelative position-aware tasks +1

code Examples

Llama (all versions)MistralQwenDeepSeek +1

interactive_space

Interactive Visualizations

Attention Patterns

Attention Type

No attention

Strong attention

Sliding Window Attention

Window Size

4 tokens

Current Position

Token 1 of 16

Token 1 can attend to tokens 1-3

Memory Comparison: MHA vs GQA vs MQA

Number of Heads

GQA Groups

Multi-Head Attention (MHA)

KV Heads: 32

Memory: 3200 MB

Baseline (1.0x)

Grouped Query Attention (GQA)

KV Heads: 8

Memory: 800 MB

4.0x reduction

Multi-Query Attention (MQA)

KV Heads: 1

Memory: 100 MB

32.0x reduction

build

Tooling & Application Patterns

Model Context Protocol (MCP)

Open standard for connecting AI models to external tools, data sources, and services. Enables models to interact with databases, APIs, filesystems, and other resources through a unified interface.

check_circle Key Characteristics

Standardized tool/resource definition
Client-server architecture
Context sharing across tools
+1 more

lightbulb Use Cases

Connecting models to databasesAPI and service integrationsFile system operations +1

code Examples

Claude DesktopMCP ServersTool providers implementing MCP

Retrieval-Augmented Generation (RAG)

Combines a retriever component that fetches relevant documents with a generator LLM that produces responses grounded in retrieved information.

check_circle Key Characteristics

Separate retrieval and generation components
External knowledge base access
Dynamic context augmentation
+1 more

lightbulb Use Cases

Question answering over documentsEnterprise knowledge basesCustomer support chatbots +1

code Examples

ChatGPT with browsingPerplexity AIYou.comMicrosoft Copilot