Performance Engineering
Key Performance Metrics
Latency metrics capture response time characteristics across the distribution of requests, from typical user experience to worst-case scenarios. P50 (median) represents the typical user experience, while P95 and P99 capture tail latencies affecting 5% and 1% of requests respectively. For LLM applications, TTFT (Time to First Token) measures how quickly streaming responses begin, while ITL (Inter-Token Latency) tracks the speed of token generation during streaming.
- P50 (Median): Typical user experience baseline
- P95: 95% of requests complete faster than this threshold
- P99: Tail latency capturing worst 1% of requests
- P99.9: Extreme outliers for SLA monitoring
- TTFT (Time to First Token): LLM streaming start latency
- ITL (Inter-Token Latency): Token generation speed
Throughput metrics measure system capacity and resource utilization, quantifying how many requests or operations can be processed per unit time. QPS (Queries Per Second) tracks request rate, while Tokens/Second measures LLM generation throughput. GPU utilization and FLOPS efficiency reveal how effectively hardware resources are being leveraged. Memory bandwidth utilization indicates whether data transfer is a bottleneck. High throughput with efficient resource usage signals well-optimized systems.
- QPS (Queries Per Second): Request processing rate
- Tokens/Second: LLM token generation throughput
- Batch Size: Concurrent request processing capacity
- GPU Utilization: Percentage of GPU compute capacity used
- Memory Bandwidth: GB/s data transfer rate
- FLOPS Utilization: Percentage of theoretical maximum performance
Efficiency metrics tie performance to cost and resource consumption, enabling economic optimization decisions. Cost per request and cost per token quantify the monetary expense of serving workloads, critical for production economics. Model FLOPs Utilization (MFU) measures training efficiency as a percentage of theoretical hardware maximum. Power efficiency (inferences per Watt) becomes crucial for sustainability and operational cost reduction. These metrics guide right-sizing and optimization priorities.
- Cost per Request: Dollar cost per inference operation
- Cost per Token: Dollar cost per 1K tokens for LLMs
- Model FLOPs Utilization (MFU): Training efficiency percentage
- Power Efficiency: Inferences per Watt consumed
- Resource Utilization: CPU, memory, GPU usage percentages
Scaling metrics quantify how effectively systems leverage additional resources. Scaling efficiency measures performance gain per added GPU, revealing communication overhead and parallelism effectiveness. Weak scaling (fixed work per GPU) and strong scaling (fixed total work) provide complementary views of scalability. Critical batch size identifies the point where adding more parallelism yields diminishing returns. Communication overhead percentage reveals time spent on synchronization versus computation.
- Scaling Efficiency: Performance gain per added GPU
- Communication Overhead: Percentage of time in GPU synchronization
- Weak Scaling: Fixed work per GPU, measure speedup
- Strong Scaling: Fixed total work distributed across GPUs
- Critical Batch Size: Maximum batch size before diminishing returns
Latency Targets by Use Case
| Use Case | Target P99 Latency | Rationale | Optimization Focus |
|---|---|---|---|
| Real-time Chatbots | < 200ms | User expects instant responses | Model size, caching, quantization |
| Search/Recommendations | < 100ms | Part of larger page load | Batch inference, pre-computation |
| Content Generation | < 2s | User can wait briefly | Throughput over latency |
| Fraud Detection | < 50ms | Real-time transaction approval | Model simplification, edge deployment |
| Autonomous Vehicles | < 20ms | Safety-critical decisions | Edge inference, specialized hardware |
| Batch Processing | Minutes to hours | Offline workload | Maximize throughput, cost efficiency |
Common Performance Bottlenecks
Compute-bound workloads saturate GPU processing capacity, with utilization near 100% while operations wait in queue. The model is too complex for available hardware, requiring more floating-point operations than the GPU can provide. Solutions focus on reducing computational requirements through quantization (INT8/INT4), pruning unnecessary parameters, knowledge distillation to smaller models, or upgrading to more powerful GPUs. Operator fusion can also reduce overhead by combining multiple operations.
- Symptom: GPU utilization consistently near 100%
- Cause: Model complexity exceeds hardware capability
Memory-bound bottlenecks occur when GPU compute units sit idle waiting for data from memory, characterized by low GPU utilization despite high memory bandwidth usage. Inefficient memory access patterns fail to leverage GPU caching and bandwidth capabilities. Solutions include increasing batch size to amortize memory overhead, implementing Flash Attention for Transformers to reduce memory accesses, applying model parallelism to distribute memory load, and optimizing KV cache usage in autoregressive generation.
- Symptom: Low GPU utilization with high memory bandwidth
- Cause: Inefficient memory access patterns
I/O-bound systems waste GPU cycles waiting for data loading and preprocessing, leaving expensive hardware idle. Slow storage, inefficient data pipelines, or poor prefetching cause the GPU to stall. Solutions involve optimizing data pipelines with parallel loading, implementing prefetching to load next batches during GPU processing, upgrading to faster NVMe storage, and deploying multiple parallel data loaders. Caching frequently accessed data can also dramatically reduce load times.
- Symptom: GPU sits idle waiting for input data
- Cause: Slow data loading or preprocessing pipeline
Communication-bound workloads suffer from excessive GPU synchronization overhead in multi-GPU setups, leading to poor scaling efficiency. Time spent exchanging gradients, activations, or model parameters dominates useful computation. Solutions include gradient accumulation to reduce synchronization frequency, upgrading interconnects to NVLink for faster GPU-to-GPU communication, implementing pipeline parallelism to overlap communication with computation, and reducing communication volume through compression or selective parameter updates.
- Symptom: Poor multi-GPU scaling efficiency
- Cause: Excessive GPU synchronization overhead
CPU-bound bottlenecks emerge when preprocessing or postprocessing on CPU cannot keep pace with GPU inference, leaving GPUs underutilized despite high CPU usage. Tokenization, data augmentation, or result processing becomes the limiting factor. Solutions involve moving preprocessing operations to GPU where possible, optimizing tokenization with compiled libraries, using vectorized transforms, or increasing CPU core count. Offloading compatible operations to GPU eliminates CPU-GPU data transfer overhead.
- Symptom: High CPU usage with GPU underutilized
- Cause: Preprocessing or postprocessing bottleneck
Network-bound systems experience latency from remote API calls, data transfers, or geographic distance between users and servers. Network round-trip times dominate request latency, particularly noticeable in distributed architectures. Solutions include deploying inference closer to users via CDN or edge computing, coalescing multiple small requests into batches, applying compression to reduce payload size, and using connection pooling to eliminate TCP handshake overhead for repeated requests.
- Symptom: High network latency in request path
- Cause: Remote API calls or data transfer delays
Performance Profiling Tools
| Tool | Purpose | Metrics | Best For |
|---|---|---|---|
| nvidia-smi | GPU monitoring | Utilization, memory, power, temperature | Quick checks, real-time monitoring |
| NVIDIA Nsight Systems | System-wide profiling | CPU/GPU timeline, API calls, memory transfers | Identifying bottlenecks, end-to-end analysis |
| NVIDIA Nsight Compute | Kernel profiling | SM efficiency, memory throughput, occupancy | Kernel optimization, low-level tuning |
| PyTorch Profiler | PyTorch-specific profiling | Op-level timing, memory, CPU/GPU breakdown | PyTorch model optimization |
| TensorBoard Profiler | TensorFlow profiling | Op timing, memory, trace viewer | TensorFlow model optimization |
| cProfile / line_profiler | Python profiling | Function call times, line-by-line | CPU-bound Python code |
| perf / VTune | CPU profiling | Cache misses, branch mispredicts, CPU cycles | Low-level CPU optimization |
Performance Optimization Workflow
Establish current performance characteristics by profiling representative workloads under production-like conditions. Identify which metrics matter most for your use case - latency for real-time applications, throughput for batch processing, or cost efficiency for large-scale deployments. Document baseline measurements to quantify improvement from optimizations.
Use profiling tools to systematically identify the slowest component limiting overall performance. Analyze GPU utilization, memory bandwidth, CPU usage, and I/O patterns to pinpoint whether you're compute-bound, memory-bound, or limited by data loading. Focus optimization efforts on the critical path that dominates execution time.
Implement targeted optimization addressing the identified bottleneck. Choose techniques appropriate to the limiting factor - quantization for compute-bound workloads, Flash Attention for memory-bound Transformers, or data pipeline optimization for I/O-bound systems. Avoid premature optimization of non-critical components that won't improve end-to-end performance.
Measure performance after optimization to quantify actual improvement against baseline metrics. Verify the fix didn't introduce regressions in other areas like accuracy or resource usage. If performance goals aren't met, profile again to identify the next bottleneck and repeat the cycle. Performance optimization is an iterative process of measurement, analysis, and targeted improvements.
Performance Engineering Best Practices
Always measure performance before optimizing to avoid wasting effort on non-critical paths. Focus on metrics that directly impact user experience or business objectives. Profile under production-like conditions including realistic data, concurrency, and infrastructure. Track performance over time to catch regressions early. Establish performance budgets that align with SLAs and user expectations to guide optimization priorities.
- Always measure before optimizing
- Focus on user-impacting metrics
- Profile in production-like conditions
- Track performance trends over time
- Set performance budgets aligned with SLAs
Optimize the critical path that dominates execution time first for maximum impact. Apply the 80/20 rule - fixing the biggest bottleneck typically yields more benefit than dozens of micro-optimizations. Prioritize algorithmic improvements over low-level tuning when possible. Balance competing objectives of latency, throughput, and cost based on use case requirements. Consider maintenance cost of complex optimizations against performance benefits.
- Optimize critical path with highest impact first
- 80/20 rule: Fix biggest bottleneck
- Algorithmic improvements over micro-optimizations
- Balance latency vs throughput vs cost
- Consider maintainability of optimizations
Performance optimization involves navigating fundamental trade-offs between competing objectives. Larger batch sizes improve throughput but increase latency. Quantization boosts speed but may reduce accuracy. Model recomputation saves memory at the cost of additional compute. Hardware choices balance cost against performance needs. Complex optimizations may deliver gains but hurt code maintainability. Understanding and deliberately managing these trade-offs is essential for effective performance engineering.
- Latency vs Throughput: Batch size trade-off
- Accuracy vs Speed: Quantization impact
- Memory vs Compute: Recomputation strategies
- Cost vs Performance: Hardware selection
- Complexity vs Maintainability: Code sustainability
