LLM Performance

tune

Sampling Parameters

Temperature

Controls randomness in token selection by scaling logits before softmax. Lower values (0.1-0.3) produce deterministic, focused outputs ideal for factual tasks; higher values (0.7-1.0) increase creativity and diversity for brainstorming. Temperature 0 always picks the highest probability token, while temperature 1+ can produce incoherent or repetitive text.

Key Features
  • Scales logits before softmax
  • 0 = greedy/deterministic
  • Higher = more random sampling
  • Trade-off: creativity vs coherence
Related
Top-PTop-KMin-PTypical Sampling
Top-P (Nucleus Sampling)

Samples from the smallest set of tokens whose cumulative probability exceeds threshold p. At p=0.9, only tokens comprising 90% of probability mass are considered. More adaptive than Top-K: when the model is confident, fewer tokens are sampled; when uncertain, more variety is allowed. Commonly combined with temperature for fine-grained control.

Key Features
  • Cumulative probability threshold
  • Dynamic vocabulary filtering
  • Adapts to confidence levels
  • Common values: 0.9-0.95
Related
TemperatureTop-KBeam SearchContrastive Decoding
compress

Quantization

Post-Training Quantization (PTQ)

Quantizes a pre-trained model without additional training, using calibration data to determine optimal scaling factors. Reduces precision from FP32/FP16 to INT8/INT4, cutting memory 2-4x with minimal quality loss (typically 1-3%). Fast to apply (minutes to hours) and doesn't require training infrastructure. The go-to approach for deploying existing models efficiently.

Key Features
  • No training required
  • Uses calibration dataset
  • Minutes to hours to complete
  • 1-3% typical quality loss
Alternatives
QATPruningDistillationMixed Precision
GPTQ

Post-training quantization using approximate second-order information (Hessian). Quantizes one layer at a time, compensating for errors in subsequent layers. Fast to apply (~hours for large models) and produces high-quality INT4 models. Best for GPU inference with CUDA support. Widely supported in vLLM, TGI, and transformers.

Key Features
  • One-shot layer-by-layer quantization
  • Uses Hessian for error compensation
  • Optimized for GPU inference
  • Wide framework support
Alternatives
AWQGGUFbitsandbytesSmoothQuant
AWQ (Activation-aware Weight Quantization)

Identifies salient weights (1% of weights that cause large activation changes) and preserves them at higher precision. Typically 0.5-1% better perplexity than GPTQ at same bit-width. More robust on instruction-following and reasoning tasks where certain weights are critical. Faster inference than GPTQ due to hardware-friendly design.

Key Features
  • Activation-aware weight selection
  • Preserves critical 1% of weights
  • Better quality than GPTQ
  • Faster inference speed
Alternatives
GPTQGGUFSmoothQuantFP8
GGUF (llama.cpp)

Quantization format optimized for CPU and mixed CPU/GPU inference. Q4_K_M offers best quality/size balance; Q5_K_M is near-lossless; Q8_0 is effectively lossless but larger. Enables running 7B-13B models on laptops without GPU, or offloading layers between CPU/GPU memory. The standard for local/edge deployment.

Key Features
  • Optimized for CPU inference
  • Multiple quantization levels (Q4-Q8)
  • Layer offloading support
  • Cross-platform compatibility
Alternatives
GPTQAWQONNX quantizationTensorRT
Pre-Quantized Models

Models distributed already quantized, ready for immediate use without running quantization yourself. Available on Hugging Face (TheBloke, etc.) in GPTQ, AWQ, and GGUF formats. Saves hours of compute time and ensures consistent quantization quality. Check compatibility with your inference framework before downloading.

Key Features
  • Ready to use immediately
  • No quantization compute needed
  • Multiple format options
  • Community-validated quality
Alternatives
Self-quantizationOn-the-fly quantizationbitsandbytes dynamic
QLoRA (Quantized LoRA)

Fine-tuning technique that keeps base model weights in 4-bit quantized form while training LoRA adapters in FP16/BF16. Reduces fine-tuning memory by 4-8x: a 65B model fits on a single 48GB GPU instead of requiring 8x 80GB GPUs. Uses NF4 (normal float 4-bit) data type and double quantization for minimal quality loss compared to full fine-tuning.

Key Features
  • 4-bit base model + FP16 adapters
  • 4-8x memory reduction for training
  • NF4 data type for quality
  • Near full fine-tuning performance
Alternatives
LoRAFull fine-tuningPEFTAdapter tuning
device_hub

Parallelism Strategies

Tensor Parallelism

Splits individual layers (weight matrices) across multiple GPUs horizontally. Each GPU computes a portion of each layer, requiring all-reduce communication between layers. Reduces per-GPU memory linearly but adds latency from synchronization. Best within a single node with fast NVLink; cross-node adds significant overhead.

Key Features
  • Intra-layer parallelism
  • Splits weight matrices across GPUs
  • High communication overhead
  • Best for large layers that don't fit on one GPU
Alternatives
Pipeline ParallelismData ParallelismExpert Parallelism
Pipeline Parallelism

Distributes model layers across GPUs vertically. Each GPU handles a subset of layers, passing activations forward. Lower communication than tensor parallelism (only at stage boundaries), but creates pipeline bubbles where GPUs idle waiting for dependencies. Micro-batching helps fill bubbles. Works well across nodes.

Key Features
  • Inter-layer parallelism
  • Each GPU owns subset of layers
  • Lower communication than tensor parallel
  • Can have pipeline bubbles (idle time)
Alternatives
Tensor ParallelismData ParallelismInterleaved Pipeline
Data Parallelism

Replicates the full model on each GPU and distributes different batches across devices. Gradients are synchronized via all-reduce after each step. Scales batch size linearly with GPU count. Simple to implement but requires each GPU to hold the entire model. For inference, enables higher throughput with independent replicas.

Key Features
  • Same model on all GPUs
  • Different data batches per GPU
  • Gradient synchronization required
  • Scales batch size linearly
Alternatives
FSDPZeROTensor ParallelismDDP
Sequence Parallelism

Distributes long input sequences across devices, each GPU processing a portion. Enables context lengths beyond single-GPU memory (e.g., 1M+ tokens). Ring attention passes KV slices between GPUs in a ring topology. Particularly important for long-document understanding and retrieval-augmented generation with large contexts.

Key Features
  • Splits sequence dimension
  • Enables ultra-long contexts
  • Works with ring attention
  • Reduces memory per device
Alternatives
Context ParallelismRing AttentionUlysses Attention
memory

Memory Optimization

KV-Cache

Caches key and value tensors from previous tokens to avoid recomputation during autoregressive generation. Without caching, each new token would require recomputing attention over all previous tokens. Memory grows as O(batch × layers × heads × seq_len × head_dim). For a 70B model at 8K context, KV-cache alone can consume 10-20GB per request.

Key Features
  • Stores K/V from previous tokens
  • Eliminates redundant computation
  • Memory grows with sequence length
  • Major bottleneck for long contexts
Alternatives
Multi-Query AttentionGrouped-Query AttentionSliding Window
PagedAttention (vLLM)

Manages KV-cache like virtual memory with non-contiguous blocks. Traditional serving pre-allocates max sequence length, wasting 60-80% of memory. PagedAttention allocates on-demand in small blocks, achieving near-zero waste. Also enables prefix caching where common system prompts share KV-cache across requests, reducing memory and computation.

Key Features
  • Non-contiguous KV-cache blocks
  • Eliminates memory fragmentation
  • Enables prefix caching
  • Near-zero memory waste
Alternatives
Continuous batchingDynamic memory allocationChunked prefill
Flash Attention

Memory-efficient attention that tiles computation to avoid materializing the full N×N attention matrix. Fuses operations to minimize GPU memory reads/writes (the real bottleneck). Reduces memory from O(n²) to O(n), enabling 4-16x longer sequences. Also 2-4x faster due to better hardware utilization. Now standard in most inference frameworks.

Key Features
  • Tiled computation approach
  • O(n) memory instead of O(n²)
  • IO-aware algorithm design
  • 2-4x faster than standard attention
Alternatives
Memory Efficient AttentionRing AttentionxFormers
speed

Throughput Optimization

Speculative Decoding

Uses a small draft model (e.g., 1B params) to generate K candidate tokens quickly, then verifies all K in a single forward pass of the large model. Accepted tokens are kept; rejected ones trigger resampling. Achieves 2-3x speedup without changing output distribution. Most effective when draft model closely matches target model's distribution.

Key Features
  • Draft with small model
  • Verify with large model in parallel
  • Maintains exact output distribution
  • 2-3x inference speedup
Alternatives
MedusaEAGLELookahead DecodingSelf-Speculative
Continuous Batching

Dynamically adds new requests to running batches as slots become available, rather than waiting for entire batches to complete. A request finishing after 50 tokens immediately frees its slot for a new request, even while others generate 500+ tokens. Increases throughput 2-10x over static batching and reduces average latency significantly.

Key Features
  • Dynamic batch management
  • No waiting for batch completion
  • Maximizes GPU utilization
  • Reduces average latency
Alternatives
Static batchingIn-flight batchingIteration-level batching
Chunked Prefill

Splits long prompt processing into chunks (e.g., 512 tokens), interleaving prefill with ongoing decode operations. Without chunking, a 10K token prompt blocks all other requests for seconds. Chunking allows decode iterations to proceed between prefill chunks, reducing P99 latency dramatically for mixed workloads with varying prompt lengths.

Key Features
  • Breaks prefill into chunks
  • Interleaves with decode phase
  • Reduces head-of-line blocking
  • Better latency for mixed workloads
Alternatives
Disaggregated prefillPrefix cachingPrompt caching
CTranslate2

C++/Python library for efficient Transformer inference with layer fusion, batch reordering, and KV caching. Supports INT8/INT4 quantization, runs on CPU (x86, ARM) and NVIDIA GPUs. Often 2-4x faster than HuggingFace Transformers with lower memory. Supports Llama, Mistral, Whisper, T5, BERT and more.

Key Features
  • Layer fusion and padding removal
  • INT8/INT4/FP16 quantization built-in
  • CPU and CUDA GPU support
  • 2-4x faster than baseline Transformers
Alternatives
llama.cppvLLMTensorRT-LLMONNX Runtime
DeepSpeed

Microsoft's optimization library for training and inference at scale. ZeRO optimizer eliminates memory redundancy, enabling trillion-parameter models. DeepSpeed-Inference provides optimized kernels, tensor parallelism, and dynamic batching. Used to train MT-NLG 530B, BLOOM 176B. Integrates with HuggingFace, PyTorch Lightning, and Accelerate.

Key Features
  • ZeRO memory optimization (stages 1-3)
  • ZeRO-Offload to CPU/NVMe
  • DeepSpeed-Inference for serving
  • Ulysses sequence parallelism
Alternatives
FSDPMegatron-LMColossal-AIFairScale