Will Percey — Portfolio

LLM Performance

> > Updated Jan 2026

tune

Sampling Parameters

Temperature

Controls randomness in token selection by scaling logits before softmax. Lower values (0.1-0.3) produce deterministic, focused outputs ideal for factual tasks; higher values (0.7-1.0) increase creativity and diversity for brainstorming. Temperature 0 always picks the highest probability token, while temperature 1+ can produce incoherent or repetitive text.

Key Features

Scales logits before softmax
0 = greedy/deterministic
Higher = more random sampling
Trade-off: creativity vs coherence

Top-PTop-KMin-PTypical Sampling

Top-P (Nucleus Sampling)

Samples from the smallest set of tokens whose cumulative probability exceeds threshold p. At p=0.9, only tokens comprising 90% of probability mass are considered. More adaptive than Top-K: when the model is confident, fewer tokens are sampled; when uncertain, more variety is allowed. Commonly combined with temperature for fine-grained control.

Key Features

Cumulative probability threshold
Dynamic vocabulary filtering
Adapts to confidence levels
Common values: 0.9-0.95

TemperatureTop-KBeam SearchContrastive Decoding

compress

Quantization

Post-Training Quantization (PTQ)

Quantizes a pre-trained model without additional training, using calibration data to determine optimal scaling factors. Reduces precision from FP32/FP16 to INT8/INT4, cutting memory 2-4x with minimal quality loss (typically 1-3%). Fast to apply (minutes to hours) and doesn't require training infrastructure. The go-to approach for deploying existing models efficiently.

Key Features

No training required
Uses calibration dataset
Minutes to hours to complete
1-3% typical quality loss

Alternatives

QATPruningDistillationMixed Precision

GPTQ

Post-training quantization using approximate second-order information (Hessian). Quantizes one layer at a time, compensating for errors in subsequent layers. Fast to apply (~hours for large models) and produces high-quality INT4 models. Best for GPU inference with CUDA support. Widely supported in vLLM, TGI, and transformers.

Key Features

One-shot layer-by-layer quantization
Uses Hessian for error compensation
Optimized for GPU inference
Wide framework support

Alternatives

AWQGGUFbitsandbytesSmoothQuant

AWQ (Activation-aware Weight Quantization)

Identifies salient weights (1% of weights that cause large activation changes) and preserves them at higher precision. Typically 0.5-1% better perplexity than GPTQ at same bit-width. More robust on instruction-following and reasoning tasks where certain weights are critical. Faster inference than GPTQ due to hardware-friendly design.

Key Features

Activation-aware weight selection
Preserves critical 1% of weights
Better quality than GPTQ
Faster inference speed

Alternatives

GPTQGGUFSmoothQuantFP8

GGUF (llama.cpp)

Quantization format optimized for CPU and mixed CPU/GPU inference. Q4_K_M offers best quality/size balance; Q5_K_M is near-lossless; Q8_0 is effectively lossless but larger. Enables running 7B-13B models on laptops without GPU, or offloading layers between CPU/GPU memory. The standard for local/edge deployment.

Key Features

Optimized for CPU inference
Multiple quantization levels (Q4-Q8)
Layer offloading support
Cross-platform compatibility

Alternatives

GPTQAWQONNX quantizationTensorRT

Pre-Quantized Models

Models distributed already quantized, ready for immediate use without running quantization yourself. Available on Hugging Face (TheBloke, etc.) in GPTQ, AWQ, and GGUF formats. Saves hours of compute time and ensures consistent quantization quality. Check compatibility with your inference framework before downloading.

Key Features

Ready to use immediately
No quantization compute needed
Multiple format options
Community-validated quality

Alternatives

Self-quantizationOn-the-fly quantizationbitsandbytes dynamic

QLoRA (Quantized LoRA)

Fine-tuning technique that keeps base model weights in 4-bit quantized form while training LoRA adapters in FP16/BF16. Reduces fine-tuning memory by 4-8x: a 65B model fits on a single 48GB GPU instead of requiring 8x 80GB GPUs. Uses NF4 (normal float 4-bit) data type and double quantization for minimal quality loss compared to full fine-tuning.

Key Features

4-bit base model + FP16 adapters
4-8x memory reduction for training
NF4 data type for quality
Near full fine-tuning performance

Alternatives

LoRAFull fine-tuningPEFTAdapter tuning

device_hub

Parallelism Strategies

Tensor Parallelism

Splits individual layers (weight matrices) across multiple GPUs horizontally. Each GPU computes a portion of each layer, requiring all-reduce communication between layers. Reduces per-GPU memory linearly but adds latency from synchronization. Best within a single node with fast NVLink; cross-node adds significant overhead.

Key Features

Intra-layer parallelism
Splits weight matrices across GPUs
High communication overhead
Best for large layers that don't fit on one GPU

Alternatives

Pipeline ParallelismData ParallelismExpert Parallelism

Pipeline Parallelism

Distributes model layers across GPUs vertically. Each GPU handles a subset of layers, passing activations forward. Lower communication than tensor parallelism (only at stage boundaries), but creates pipeline bubbles where GPUs idle waiting for dependencies. Micro-batching helps fill bubbles. Works well across nodes.

Key Features

Inter-layer parallelism
Each GPU owns subset of layers
Lower communication than tensor parallel
Can have pipeline bubbles (idle time)

Alternatives

Tensor ParallelismData ParallelismInterleaved Pipeline

Data Parallelism

Replicates the full model on each GPU and distributes different batches across devices. Gradients are synchronized via all-reduce after each step. Scales batch size linearly with GPU count. Simple to implement but requires each GPU to hold the entire model. For inference, enables higher throughput with independent replicas.

Key Features

Same model on all GPUs
Different data batches per GPU
Gradient synchronization required
Scales batch size linearly

Alternatives

FSDPZeROTensor ParallelismDDP

Sequence Parallelism

Distributes long input sequences across devices, each GPU processing a portion. Enables context lengths beyond single-GPU memory (e.g., 1M+ tokens). Ring attention passes KV slices between GPUs in a ring topology. Particularly important for long-document understanding and retrieval-augmented generation with large contexts.

Key Features

Splits sequence dimension
Enables ultra-long contexts
Works with ring attention
Reduces memory per device

Alternatives

Context ParallelismRing AttentionUlysses Attention

memory

Memory Optimization

KV-Cache

Caches key and value tensors from previous tokens to avoid recomputation during autoregressive generation. Without caching, each new token would require recomputing attention over all previous tokens. Memory grows as O(batch × layers × heads × seq_len × head_dim). For a 70B model at 8K context, KV-cache alone can consume 10-20GB per request.

Key Features

Stores K/V from previous tokens
Eliminates redundant computation
Memory grows with sequence length
Major bottleneck for long contexts

Alternatives

Multi-Query AttentionGrouped-Query AttentionSliding Window

TurboQuant (KV-Cache Compression)

Google Research compression technique that reduces KV-cache to 3 bits with zero accuracy loss and no retraining required. Uses two stages: PolarQuant rotates vectors into polar coordinates to eliminate normalization overhead, then QJL (Quantized Johnson-Lindenstrauss) applies 1-bit error correction. Achieves 6x+ memory reduction on long-context tasks and up to 8x attention speedup on H100 GPUs — outperforming PQ and RaBitQ on vector search recall.

Key Features

3-bit KV-cache, zero accuracy loss
6x+ memory reduction on long contexts
8x attention speedup on H100
No retraining required
Polar coordinate quantization

Alternatives

PagedAttentionGrouped-Query AttentionSliding Window Attention

PagedAttention (vLLM)

Manages KV-cache like virtual memory with non-contiguous blocks. Traditional serving pre-allocates max sequence length, wasting 60-80% of memory. PagedAttention allocates on-demand in small blocks, achieving near-zero waste. Also enables prefix caching where common system prompts share KV-cache across requests, reducing memory and computation.

Key Features

Non-contiguous KV-cache blocks
Eliminates memory fragmentation
Enables prefix caching
Near-zero memory waste

Alternatives

Continuous batchingDynamic memory allocationChunked prefill

Flash Attention

Memory-efficient attention that tiles computation to avoid materializing the full N×N attention matrix. Fuses operations to minimize GPU memory reads/writes (the real bottleneck). Reduces memory from O(n²) to O(n), enabling 4-16x longer sequences. Also 2-4x faster due to better hardware utilization. Now standard in most inference frameworks.

Key Features

Tiled computation approach
O(n) memory instead of O(n²)
IO-aware algorithm design
2-4x faster than standard attention

Alternatives

Memory Efficient AttentionRing AttentionxFormers

speed

Throughput Optimization

Speculative Decoding

Uses a small draft model (e.g., 1B params) to generate K candidate tokens quickly, then verifies all K in a single forward pass of the large model. Accepted tokens are kept; rejected ones trigger resampling. Achieves 2-3x speedup without changing output distribution. Most effective when draft model closely matches target model's distribution.

Key Features

Draft with small model
Verify with large model in parallel
Maintains exact output distribution
2-3x inference speedup

Alternatives

MedusaEAGLELookahead DecodingSelf-Speculative

Continuous Batching

Dynamically adds new requests to running batches as slots become available, rather than waiting for entire batches to complete. A request finishing after 50 tokens immediately frees its slot for a new request, even while others generate 500+ tokens. Increases throughput 2-10x over static batching and reduces average latency significantly.

Key Features

Dynamic batch management
No waiting for batch completion
Maximizes GPU utilization
Reduces average latency

Alternatives

Static batchingIn-flight batchingIteration-level batching

Chunked Prefill

Splits long prompt processing into chunks (e.g., 512 tokens), interleaving prefill with ongoing decode operations. Without chunking, a 10K token prompt blocks all other requests for seconds. Chunking allows decode iterations to proceed between prefill chunks, reducing P99 latency dramatically for mixed workloads with varying prompt lengths.

Key Features

Breaks prefill into chunks
Interleaves with decode phase
Reduces head-of-line blocking
Better latency for mixed workloads

Alternatives

Disaggregated prefillPrefix cachingPrompt caching

CTranslate2

C++/Python library for efficient Transformer inference with layer fusion, batch reordering, and KV caching. Supports INT8/INT4 quantization, runs on CPU (x86, ARM) and NVIDIA GPUs. Often 2-4x faster than HuggingFace Transformers with lower memory. Supports Llama, Mistral, Whisper, T5, BERT and more.

Key Features

Layer fusion and padding removal
INT8/INT4/FP16 quantization built-in
CPU and CUDA GPU support
2-4x faster than baseline Transformers

Alternatives

llama.cppvLLMTensorRT-LLMONNX Runtime

DeepSpeed

Microsoft's optimization library for training and inference at scale. ZeRO optimizer eliminates memory redundancy, enabling trillion-parameter models. DeepSpeed-Inference provides optimized kernels, tensor parallelism, and dynamic batching. Used to train MT-NLG 530B, BLOOM 176B. Integrates with HuggingFace, PyTorch Lightning, and Accelerate.

Key Features

ZeRO memory optimization (stages 1-3)
ZeRO-Offload to CPU/NVMe
DeepSpeed-Inference for serving
Ulysses sequence parallelism

Alternatives

FSDPMegatron-LMColossal-AIFairScale