LLM Performance
Sampling Parameters
Controls randomness in token selection by scaling logits before softmax. Lower values (0.1-0.3) produce deterministic, focused outputs ideal for factual tasks; higher values (0.7-1.0) increase creativity and diversity for brainstorming. Temperature 0 always picks the highest probability token, while temperature 1+ can produce incoherent or repetitive text.
- Scales logits before softmax
- 0 = greedy/deterministic
- Higher = more random sampling
- Trade-off: creativity vs coherence
Samples from the smallest set of tokens whose cumulative probability exceeds threshold p. At p=0.9, only tokens comprising 90% of probability mass are considered. More adaptive than Top-K: when the model is confident, fewer tokens are sampled; when uncertain, more variety is allowed. Commonly combined with temperature for fine-grained control.
- Cumulative probability threshold
- Dynamic vocabulary filtering
- Adapts to confidence levels
- Common values: 0.9-0.95
Quantization
Quantizes a pre-trained model without additional training, using calibration data to determine optimal scaling factors. Reduces precision from FP32/FP16 to INT8/INT4, cutting memory 2-4x with minimal quality loss (typically 1-3%). Fast to apply (minutes to hours) and doesn't require training infrastructure. The go-to approach for deploying existing models efficiently.
- No training required
- Uses calibration dataset
- Minutes to hours to complete
- 1-3% typical quality loss
Post-training quantization using approximate second-order information (Hessian). Quantizes one layer at a time, compensating for errors in subsequent layers. Fast to apply (~hours for large models) and produces high-quality INT4 models. Best for GPU inference with CUDA support. Widely supported in vLLM, TGI, and transformers.
- One-shot layer-by-layer quantization
- Uses Hessian for error compensation
- Optimized for GPU inference
- Wide framework support
Identifies salient weights (1% of weights that cause large activation changes) and preserves them at higher precision. Typically 0.5-1% better perplexity than GPTQ at same bit-width. More robust on instruction-following and reasoning tasks where certain weights are critical. Faster inference than GPTQ due to hardware-friendly design.
- Activation-aware weight selection
- Preserves critical 1% of weights
- Better quality than GPTQ
- Faster inference speed
Quantization format optimized for CPU and mixed CPU/GPU inference. Q4_K_M offers best quality/size balance; Q5_K_M is near-lossless; Q8_0 is effectively lossless but larger. Enables running 7B-13B models on laptops without GPU, or offloading layers between CPU/GPU memory. The standard for local/edge deployment.
- Optimized for CPU inference
- Multiple quantization levels (Q4-Q8)
- Layer offloading support
- Cross-platform compatibility
Models distributed already quantized, ready for immediate use without running quantization yourself. Available on Hugging Face (TheBloke, etc.) in GPTQ, AWQ, and GGUF formats. Saves hours of compute time and ensures consistent quantization quality. Check compatibility with your inference framework before downloading.
- Ready to use immediately
- No quantization compute needed
- Multiple format options
- Community-validated quality
Fine-tuning technique that keeps base model weights in 4-bit quantized form while training LoRA adapters in FP16/BF16. Reduces fine-tuning memory by 4-8x: a 65B model fits on a single 48GB GPU instead of requiring 8x 80GB GPUs. Uses NF4 (normal float 4-bit) data type and double quantization for minimal quality loss compared to full fine-tuning.
- 4-bit base model + FP16 adapters
- 4-8x memory reduction for training
- NF4 data type for quality
- Near full fine-tuning performance
Parallelism Strategies
Splits individual layers (weight matrices) across multiple GPUs horizontally. Each GPU computes a portion of each layer, requiring all-reduce communication between layers. Reduces per-GPU memory linearly but adds latency from synchronization. Best within a single node with fast NVLink; cross-node adds significant overhead.
- Intra-layer parallelism
- Splits weight matrices across GPUs
- High communication overhead
- Best for large layers that don't fit on one GPU
Distributes model layers across GPUs vertically. Each GPU handles a subset of layers, passing activations forward. Lower communication than tensor parallelism (only at stage boundaries), but creates pipeline bubbles where GPUs idle waiting for dependencies. Micro-batching helps fill bubbles. Works well across nodes.
- Inter-layer parallelism
- Each GPU owns subset of layers
- Lower communication than tensor parallel
- Can have pipeline bubbles (idle time)
Replicates the full model on each GPU and distributes different batches across devices. Gradients are synchronized via all-reduce after each step. Scales batch size linearly with GPU count. Simple to implement but requires each GPU to hold the entire model. For inference, enables higher throughput with independent replicas.
- Same model on all GPUs
- Different data batches per GPU
- Gradient synchronization required
- Scales batch size linearly
Distributes long input sequences across devices, each GPU processing a portion. Enables context lengths beyond single-GPU memory (e.g., 1M+ tokens). Ring attention passes KV slices between GPUs in a ring topology. Particularly important for long-document understanding and retrieval-augmented generation with large contexts.
- Splits sequence dimension
- Enables ultra-long contexts
- Works with ring attention
- Reduces memory per device
Memory Optimization
Caches key and value tensors from previous tokens to avoid recomputation during autoregressive generation. Without caching, each new token would require recomputing attention over all previous tokens. Memory grows as O(batch × layers × heads × seq_len × head_dim). For a 70B model at 8K context, KV-cache alone can consume 10-20GB per request.
- Stores K/V from previous tokens
- Eliminates redundant computation
- Memory grows with sequence length
- Major bottleneck for long contexts
Manages KV-cache like virtual memory with non-contiguous blocks. Traditional serving pre-allocates max sequence length, wasting 60-80% of memory. PagedAttention allocates on-demand in small blocks, achieving near-zero waste. Also enables prefix caching where common system prompts share KV-cache across requests, reducing memory and computation.
- Non-contiguous KV-cache blocks
- Eliminates memory fragmentation
- Enables prefix caching
- Near-zero memory waste
Memory-efficient attention that tiles computation to avoid materializing the full N×N attention matrix. Fuses operations to minimize GPU memory reads/writes (the real bottleneck). Reduces memory from O(n²) to O(n), enabling 4-16x longer sequences. Also 2-4x faster due to better hardware utilization. Now standard in most inference frameworks.
- Tiled computation approach
- O(n) memory instead of O(n²)
- IO-aware algorithm design
- 2-4x faster than standard attention
Throughput Optimization
Uses a small draft model (e.g., 1B params) to generate K candidate tokens quickly, then verifies all K in a single forward pass of the large model. Accepted tokens are kept; rejected ones trigger resampling. Achieves 2-3x speedup without changing output distribution. Most effective when draft model closely matches target model's distribution.
- Draft with small model
- Verify with large model in parallel
- Maintains exact output distribution
- 2-3x inference speedup
Dynamically adds new requests to running batches as slots become available, rather than waiting for entire batches to complete. A request finishing after 50 tokens immediately frees its slot for a new request, even while others generate 500+ tokens. Increases throughput 2-10x over static batching and reduces average latency significantly.
- Dynamic batch management
- No waiting for batch completion
- Maximizes GPU utilization
- Reduces average latency
Splits long prompt processing into chunks (e.g., 512 tokens), interleaving prefill with ongoing decode operations. Without chunking, a 10K token prompt blocks all other requests for seconds. Chunking allows decode iterations to proceed between prefill chunks, reducing P99 latency dramatically for mixed workloads with varying prompt lengths.
- Breaks prefill into chunks
- Interleaves with decode phase
- Reduces head-of-line blocking
- Better latency for mixed workloads
C++/Python library for efficient Transformer inference with layer fusion, batch reordering, and KV caching. Supports INT8/INT4 quantization, runs on CPU (x86, ARM) and NVIDIA GPUs. Often 2-4x faster than HuggingFace Transformers with lower memory. Supports Llama, Mistral, Whisper, T5, BERT and more.
- Layer fusion and padding removal
- INT8/INT4/FP16 quantization built-in
- CPU and CUDA GPU support
- 2-4x faster than baseline Transformers
Microsoft's optimization library for training and inference at scale. ZeRO optimizer eliminates memory redundancy, enabling trillion-parameter models. DeepSpeed-Inference provides optimized kernels, tensor parallelism, and dynamic batching. Used to train MT-NLG 530B, BLOOM 176B. Integrates with HuggingFace, PyTorch Lightning, and Accelerate.
- ZeRO memory optimization (stages 1-3)
- ZeRO-Offload to CPU/NVMe
- DeepSpeed-Inference for serving
- Ulysses sequence parallelism
