Performance Engineering

speed

Key Performance Metrics

Latency Metrics

Latency metrics capture response time characteristics across the distribution of requests, from typical user experience to worst-case scenarios. P50 (median) represents the typical user experience, while P95 and P99 capture tail latencies affecting 5% and 1% of requests respectively. For LLM applications, TTFT (Time to First Token) measures how quickly streaming responses begin, while ITL (Inter-Token Latency) tracks the speed of token generation during streaming.

Key Features
  • P50 (Median): Typical user experience baseline
  • P95: 95% of requests complete faster than this threshold
  • P99: Tail latency capturing worst 1% of requests
  • P99.9: Extreme outliers for SLA monitoring
  • TTFT (Time to First Token): LLM streaming start latency
  • ITL (Inter-Token Latency): Token generation speed
Similar Technologies
Response TimePercentilesSLA MetricsUser-Perceived Latency
Throughput Metrics

Throughput metrics measure system capacity and resource utilization, quantifying how many requests or operations can be processed per unit time. QPS (Queries Per Second) tracks request rate, while Tokens/Second measures LLM generation throughput. GPU utilization and FLOPS efficiency reveal how effectively hardware resources are being leveraged. Memory bandwidth utilization indicates whether data transfer is a bottleneck. High throughput with efficient resource usage signals well-optimized systems.

Key Features
  • QPS (Queries Per Second): Request processing rate
  • Tokens/Second: LLM token generation throughput
  • Batch Size: Concurrent request processing capacity
  • GPU Utilization: Percentage of GPU compute capacity used
  • Memory Bandwidth: GB/s data transfer rate
  • FLOPS Utilization: Percentage of theoretical maximum performance
Similar Technologies
TPSRequest RateProcessing CapacityResource Utilization
Efficiency Metrics

Efficiency metrics tie performance to cost and resource consumption, enabling economic optimization decisions. Cost per request and cost per token quantify the monetary expense of serving workloads, critical for production economics. Model FLOPs Utilization (MFU) measures training efficiency as a percentage of theoretical hardware maximum. Power efficiency (inferences per Watt) becomes crucial for sustainability and operational cost reduction. These metrics guide right-sizing and optimization priorities.

Key Features
  • Cost per Request: Dollar cost per inference operation
  • Cost per Token: Dollar cost per 1K tokens for LLMs
  • Model FLOPs Utilization (MFU): Training efficiency percentage
  • Power Efficiency: Inferences per Watt consumed
  • Resource Utilization: CPU, memory, GPU usage percentages
Similar Technologies
TCOCost OptimizationEnergy EfficiencyROI Metrics
Scaling Metrics

Scaling metrics quantify how effectively systems leverage additional resources. Scaling efficiency measures performance gain per added GPU, revealing communication overhead and parallelism effectiveness. Weak scaling (fixed work per GPU) and strong scaling (fixed total work) provide complementary views of scalability. Critical batch size identifies the point where adding more parallelism yields diminishing returns. Communication overhead percentage reveals time spent on synchronization versus computation.

Key Features
  • Scaling Efficiency: Performance gain per added GPU
  • Communication Overhead: Percentage of time in GPU synchronization
  • Weak Scaling: Fixed work per GPU, measure speedup
  • Strong Scaling: Fixed total work distributed across GPUs
  • Critical Batch Size: Maximum batch size before diminishing returns
Similar Technologies
Parallel EfficiencyAmdahl's LawDistributed PerformanceMulti-GPU Scaling
timer

Latency Targets by Use Case

Use CaseTarget P99 LatencyRationaleOptimization Focus
Real-time Chatbots< 200msUser expects instant responsesModel size, caching, quantization
Search/Recommendations< 100msPart of larger page loadBatch inference, pre-computation
Content Generation< 2sUser can wait brieflyThroughput over latency
Fraud Detection< 50msReal-time transaction approvalModel simplification, edge deployment
Autonomous Vehicles< 20msSafety-critical decisionsEdge inference, specialized hardware
Batch ProcessingMinutes to hoursOffline workloadMaximize throughput, cost efficiency
warning

Common Performance Bottlenecks

Compute-Bound

Compute-bound workloads saturate GPU processing capacity, with utilization near 100% while operations wait in queue. The model is too complex for available hardware, requiring more floating-point operations than the GPU can provide. Solutions focus on reducing computational requirements through quantization (INT8/INT4), pruning unnecessary parameters, knowledge distillation to smaller models, or upgrading to more powerful GPUs. Operator fusion can also reduce overhead by combining multiple operations.

Key Features
  • Symptom: GPU utilization consistently near 100%
  • Cause: Model complexity exceeds hardware capability
Similar Technologies
QuantizationModel PruningDistillationOperator FusionHardware Upgrade
Memory-Bound

Memory-bound bottlenecks occur when GPU compute units sit idle waiting for data from memory, characterized by low GPU utilization despite high memory bandwidth usage. Inefficient memory access patterns fail to leverage GPU caching and bandwidth capabilities. Solutions include increasing batch size to amortize memory overhead, implementing Flash Attention for Transformers to reduce memory accesses, applying model parallelism to distribute memory load, and optimizing KV cache usage in autoregressive generation.

Key Features
  • Symptom: Low GPU utilization with high memory bandwidth
  • Cause: Inefficient memory access patterns
Similar Technologies
Flash AttentionBatch Size IncreaseModel ParallelismKV Cache Optimization
I/O-Bound

I/O-bound systems waste GPU cycles waiting for data loading and preprocessing, leaving expensive hardware idle. Slow storage, inefficient data pipelines, or poor prefetching cause the GPU to stall. Solutions involve optimizing data pipelines with parallel loading, implementing prefetching to load next batches during GPU processing, upgrading to faster NVMe storage, and deploying multiple parallel data loaders. Caching frequently accessed data can also dramatically reduce load times.

Key Features
  • Symptom: GPU sits idle waiting for input data
  • Cause: Slow data loading or preprocessing pipeline
Similar Technologies
Data Pipeline OptimizationPrefetchingNVMe StorageParallel LoadersCaching
Communication-Bound

Communication-bound workloads suffer from excessive GPU synchronization overhead in multi-GPU setups, leading to poor scaling efficiency. Time spent exchanging gradients, activations, or model parameters dominates useful computation. Solutions include gradient accumulation to reduce synchronization frequency, upgrading interconnects to NVLink for faster GPU-to-GPU communication, implementing pipeline parallelism to overlap communication with computation, and reducing communication volume through compression or selective parameter updates.

Key Features
  • Symptom: Poor multi-GPU scaling efficiency
  • Cause: Excessive GPU synchronization overhead
Similar Technologies
Gradient AccumulationNVLinkPipeline ParallelismCommunication Reduction
CPU-Bound

CPU-bound bottlenecks emerge when preprocessing or postprocessing on CPU cannot keep pace with GPU inference, leaving GPUs underutilized despite high CPU usage. Tokenization, data augmentation, or result processing becomes the limiting factor. Solutions involve moving preprocessing operations to GPU where possible, optimizing tokenization with compiled libraries, using vectorized transforms, or increasing CPU core count. Offloading compatible operations to GPU eliminates CPU-GPU data transfer overhead.

Key Features
  • Symptom: High CPU usage with GPU underutilized
  • Cause: Preprocessing or postprocessing bottleneck
Similar Technologies
GPU PreprocessingTokenization OptimizationCompiled TransformsCPU Scaling
Network-Bound

Network-bound systems experience latency from remote API calls, data transfers, or geographic distance between users and servers. Network round-trip times dominate request latency, particularly noticeable in distributed architectures. Solutions include deploying inference closer to users via CDN or edge computing, coalescing multiple small requests into batches, applying compression to reduce payload size, and using connection pooling to eliminate TCP handshake overhead for repeated requests.

Key Features
  • Symptom: High network latency in request path
  • Cause: Remote API calls or data transfer delays
Similar Technologies
Edge DeploymentRequest CoalescingCompressionConnection PoolingCDN
bug_report

Performance Profiling Tools

ToolPurposeMetricsBest For
nvidia-smiGPU monitoringUtilization, memory, power, temperatureQuick checks, real-time monitoring
NVIDIA Nsight SystemsSystem-wide profilingCPU/GPU timeline, API calls, memory transfersIdentifying bottlenecks, end-to-end analysis
NVIDIA Nsight ComputeKernel profilingSM efficiency, memory throughput, occupancyKernel optimization, low-level tuning
PyTorch ProfilerPyTorch-specific profilingOp-level timing, memory, CPU/GPU breakdownPyTorch model optimization
TensorBoard ProfilerTensorFlow profilingOp timing, memory, trace viewerTensorFlow model optimization
cProfile / line_profilerPython profilingFunction call times, line-by-lineCPU-bound Python code
perf / VTuneCPU profilingCache misses, branch mispredicts, CPU cyclesLow-level CPU optimization
tune

Performance Optimization Workflow

Measure Baseline

Establish current performance characteristics by profiling representative workloads under production-like conditions. Identify which metrics matter most for your use case - latency for real-time applications, throughput for batch processing, or cost efficiency for large-scale deployments. Document baseline measurements to quantify improvement from optimizations.

Similar Technologies
ProfilingBenchmarkingPerformance MonitoringMetrics Collection
Find Bottleneck

Use profiling tools to systematically identify the slowest component limiting overall performance. Analyze GPU utilization, memory bandwidth, CPU usage, and I/O patterns to pinpoint whether you're compute-bound, memory-bound, or limited by data loading. Focus optimization efforts on the critical path that dominates execution time.

Similar Technologies
Profiling ToolsPerformance AnalysisBottleneck DetectionCritical Path
Apply Fix

Implement targeted optimization addressing the identified bottleneck. Choose techniques appropriate to the limiting factor - quantization for compute-bound workloads, Flash Attention for memory-bound Transformers, or data pipeline optimization for I/O-bound systems. Avoid premature optimization of non-critical components that won't improve end-to-end performance.

Similar Technologies
Optimization TechniquesPerformance TuningTargeted FixesIterative Improvement
Verify & Iterate

Measure performance after optimization to quantify actual improvement against baseline metrics. Verify the fix didn't introduce regressions in other areas like accuracy or resource usage. If performance goals aren't met, profile again to identify the next bottleneck and repeat the cycle. Performance optimization is an iterative process of measurement, analysis, and targeted improvements.

Similar Technologies
ValidationRegression TestingIterative OptimizationPerformance Tracking
checklist

Performance Engineering Best Practices

Measurement

Always measure performance before optimizing to avoid wasting effort on non-critical paths. Focus on metrics that directly impact user experience or business objectives. Profile under production-like conditions including realistic data, concurrency, and infrastructure. Track performance over time to catch regressions early. Establish performance budgets that align with SLAs and user expectations to guide optimization priorities.

Key Features
  • Always measure before optimizing
  • Focus on user-impacting metrics
  • Profile in production-like conditions
  • Track performance trends over time
  • Set performance budgets aligned with SLAs
Similar Technologies
ProfilingMonitoringPerformance TestingSLA Management
Optimization Priority

Optimize the critical path that dominates execution time first for maximum impact. Apply the 80/20 rule - fixing the biggest bottleneck typically yields more benefit than dozens of micro-optimizations. Prioritize algorithmic improvements over low-level tuning when possible. Balance competing objectives of latency, throughput, and cost based on use case requirements. Consider maintenance cost of complex optimizations against performance benefits.

Key Features
  • Optimize critical path with highest impact first
  • 80/20 rule: Fix biggest bottleneck
  • Algorithmic improvements over micro-optimizations
  • Balance latency vs throughput vs cost
  • Consider maintainability of optimizations
Similar Technologies
Critical PathPareto PrinciplePerformance Trade-offsOptimization ROI
Trade-offs

Performance optimization involves navigating fundamental trade-offs between competing objectives. Larger batch sizes improve throughput but increase latency. Quantization boosts speed but may reduce accuracy. Model recomputation saves memory at the cost of additional compute. Hardware choices balance cost against performance needs. Complex optimizations may deliver gains but hurt code maintainability. Understanding and deliberately managing these trade-offs is essential for effective performance engineering.

Key Features
  • Latency vs Throughput: Batch size trade-off
  • Accuracy vs Speed: Quantization impact
  • Memory vs Compute: Recomputation strategies
  • Cost vs Performance: Hardware selection
  • Complexity vs Maintainability: Code sustainability
Similar Technologies
Performance Trade-offsEngineering Trade-offsSystem DesignOptimization Balance