Will Percey — Portfolio

Performance Engineering

> > Updated Dec 2025

speed

Key Performance Metrics

Latency Metrics

Latency metrics capture response time characteristics across the distribution of requests, from typical user experience to worst-case scenarios. P50 (median) represents the typical user experience, while P95 and P99 capture tail latencies affecting 5% and 1% of requests respectively. For LLM applications, TTFT (Time to First Token) measures how quickly streaming responses begin, while ITL (Inter-Token Latency) tracks the speed of token generation during streaming.

Key Features

P50 (Median): Typical user experience baseline
P95: 95% of requests complete faster than this threshold
P99: Tail latency capturing worst 1% of requests
P99.9: Extreme outliers for SLA monitoring
TTFT (Time to First Token): LLM streaming start latency
ITL (Inter-Token Latency): Token generation speed

Similar Technologies

Response TimePercentilesSLA MetricsUser-Perceived Latency

Throughput Metrics

Throughput metrics measure system capacity and resource utilization, quantifying how many requests or operations can be processed per unit time. QPS (Queries Per Second) tracks request rate, while Tokens/Second measures LLM generation throughput. GPU utilization and FLOPS efficiency reveal how effectively hardware resources are being leveraged. Memory bandwidth utilization indicates whether data transfer is a bottleneck. High throughput with efficient resource usage signals well-optimized systems.

Key Features

QPS (Queries Per Second): Request processing rate
Tokens/Second: LLM token generation throughput
Batch Size: Concurrent request processing capacity
GPU Utilization: Percentage of GPU compute capacity used
Memory Bandwidth: GB/s data transfer rate
FLOPS Utilization: Percentage of theoretical maximum performance

Similar Technologies

TPSRequest RateProcessing CapacityResource Utilization

Efficiency Metrics

Efficiency metrics tie performance to cost and resource consumption, enabling economic optimization decisions. Cost per request and cost per token quantify the monetary expense of serving workloads, critical for production economics. Model FLOPs Utilization (MFU) measures training efficiency as a percentage of theoretical hardware maximum. Power efficiency (inferences per Watt) becomes crucial for sustainability and operational cost reduction. These metrics guide right-sizing and optimization priorities.

Key Features

Cost per Request: Dollar cost per inference operation
Cost per Token: Dollar cost per 1K tokens for LLMs
Model FLOPs Utilization (MFU): Training efficiency percentage
Power Efficiency: Inferences per Watt consumed
Resource Utilization: CPU, memory, GPU usage percentages

Similar Technologies

TCOCost OptimizationEnergy EfficiencyROI Metrics

Scaling Metrics

Scaling metrics quantify how effectively systems leverage additional resources. Scaling efficiency measures performance gain per added GPU, revealing communication overhead and parallelism effectiveness. Weak scaling (fixed work per GPU) and strong scaling (fixed total work) provide complementary views of scalability. Critical batch size identifies the point where adding more parallelism yields diminishing returns. Communication overhead percentage reveals time spent on synchronization versus computation.

Key Features

Scaling Efficiency: Performance gain per added GPU
Communication Overhead: Percentage of time in GPU synchronization
Weak Scaling: Fixed work per GPU, measure speedup
Strong Scaling: Fixed total work distributed across GPUs
Critical Batch Size: Maximum batch size before diminishing returns

Similar Technologies

Parallel EfficiencyAmdahl's LawDistributed PerformanceMulti-GPU Scaling

timer

Latency Targets by Use Case

Use Case	Target P99 Latency	Rationale	Optimization Focus
Real-time Chatbots	< 200ms	User expects instant responses	Model size, caching, quantization
Search/Recommendations	< 100ms	Part of larger page load	Batch inference, pre-computation
Content Generation	< 2s	User can wait briefly	Throughput over latency
Fraud Detection	< 50ms	Real-time transaction approval	Model simplification, edge deployment
Autonomous Vehicles	< 20ms	Safety-critical decisions	Edge inference, specialized hardware
Batch Processing	Minutes to hours	Offline workload	Maximize throughput, cost efficiency

warning

Common Performance Bottlenecks

Compute-Bound

Compute-bound workloads saturate GPU processing capacity, with utilization near 100% while operations wait in queue. The model is too complex for available hardware, requiring more floating-point operations than the GPU can provide. Solutions focus on reducing computational requirements through quantization (INT8/INT4), pruning unnecessary parameters, knowledge distillation to smaller models, or upgrading to more powerful GPUs. Operator fusion can also reduce overhead by combining multiple operations.

Key Features

Symptom: GPU utilization consistently near 100%
Cause: Model complexity exceeds hardware capability

Similar Technologies

QuantizationModel PruningDistillationOperator FusionHardware Upgrade

Memory-Bound

Memory-bound bottlenecks occur when GPU compute units sit idle waiting for data from memory, characterized by low GPU utilization despite high memory bandwidth usage. Inefficient memory access patterns fail to leverage GPU caching and bandwidth capabilities. Solutions include increasing batch size to amortize memory overhead, implementing Flash Attention for Transformers to reduce memory accesses, applying model parallelism to distribute memory load, and optimizing KV cache usage in autoregressive generation.

Key Features

Symptom: Low GPU utilization with high memory bandwidth
Cause: Inefficient memory access patterns

Similar Technologies

Flash AttentionBatch Size IncreaseModel ParallelismKV Cache Optimization

I/O-Bound

I/O-bound systems waste GPU cycles waiting for data loading and preprocessing, leaving expensive hardware idle. Slow storage, inefficient data pipelines, or poor prefetching cause the GPU to stall. Solutions involve optimizing data pipelines with parallel loading, implementing prefetching to load next batches during GPU processing, upgrading to faster NVMe storage, and deploying multiple parallel data loaders. Caching frequently accessed data can also dramatically reduce load times.

Key Features

Symptom: GPU sits idle waiting for input data
Cause: Slow data loading or preprocessing pipeline

Similar Technologies

Data Pipeline OptimizationPrefetchingNVMe StorageParallel LoadersCaching

Communication-Bound

Communication-bound workloads suffer from excessive GPU synchronization overhead in multi-GPU setups, leading to poor scaling efficiency. Time spent exchanging gradients, activations, or model parameters dominates useful computation. Solutions include gradient accumulation to reduce synchronization frequency, upgrading interconnects to NVLink for faster GPU-to-GPU communication, implementing pipeline parallelism to overlap communication with computation, and reducing communication volume through compression or selective parameter updates.

Key Features

Symptom: Poor multi-GPU scaling efficiency
Cause: Excessive GPU synchronization overhead

Similar Technologies

Gradient AccumulationNVLinkPipeline ParallelismCommunication Reduction

CPU-Bound

CPU-bound bottlenecks emerge when preprocessing or postprocessing on CPU cannot keep pace with GPU inference, leaving GPUs underutilized despite high CPU usage. Tokenization, data augmentation, or result processing becomes the limiting factor. Solutions involve moving preprocessing operations to GPU where possible, optimizing tokenization with compiled libraries, using vectorized transforms, or increasing CPU core count. Offloading compatible operations to GPU eliminates CPU-GPU data transfer overhead.

Key Features

Symptom: High CPU usage with GPU underutilized
Cause: Preprocessing or postprocessing bottleneck

Similar Technologies

GPU PreprocessingTokenization OptimizationCompiled TransformsCPU Scaling

Network-Bound

Network-bound systems experience latency from remote API calls, data transfers, or geographic distance between users and servers. Network round-trip times dominate request latency, particularly noticeable in distributed architectures. Solutions include deploying inference closer to users via CDN or edge computing, coalescing multiple small requests into batches, applying compression to reduce payload size, and using connection pooling to eliminate TCP handshake overhead for repeated requests.

Key Features

Symptom: High network latency in request path
Cause: Remote API calls or data transfer delays

Similar Technologies

Edge DeploymentRequest CoalescingCompressionConnection PoolingCDN

bug_report

Performance Profiling Tools

Tool	Purpose	Metrics	Best For
nvidia-smi	GPU monitoring	Utilization, memory, power, temperature	Quick checks, real-time monitoring
NVIDIA Nsight Systems	System-wide profiling	CPU/GPU timeline, API calls, memory transfers	Identifying bottlenecks, end-to-end analysis
NVIDIA Nsight Compute	Kernel profiling	SM efficiency, memory throughput, occupancy	Kernel optimization, low-level tuning
PyTorch Profiler	PyTorch-specific profiling	Op-level timing, memory, CPU/GPU breakdown	PyTorch model optimization
TensorBoard Profiler	TensorFlow profiling	Op timing, memory, trace viewer	TensorFlow model optimization
cProfile / line_profiler	Python profiling	Function call times, line-by-line	CPU-bound Python code
perf / VTune	CPU profiling	Cache misses, branch mispredicts, CPU cycles	Low-level CPU optimization

tune

Performance Optimization Workflow

Measure Baseline

Establish current performance characteristics by profiling representative workloads under production-like conditions. Identify which metrics matter most for your use case - latency for real-time applications, throughput for batch processing, or cost efficiency for large-scale deployments. Document baseline measurements to quantify improvement from optimizations.

Similar Technologies

ProfilingBenchmarkingPerformance MonitoringMetrics Collection

Find Bottleneck

Use profiling tools to systematically identify the slowest component limiting overall performance. Analyze GPU utilization, memory bandwidth, CPU usage, and I/O patterns to pinpoint whether you're compute-bound, memory-bound, or limited by data loading. Focus optimization efforts on the critical path that dominates execution time.

Similar Technologies

Profiling ToolsPerformance AnalysisBottleneck DetectionCritical Path

Apply Fix

Implement targeted optimization addressing the identified bottleneck. Choose techniques appropriate to the limiting factor - quantization for compute-bound workloads, Flash Attention for memory-bound Transformers, or data pipeline optimization for I/O-bound systems. Avoid premature optimization of non-critical components that won't improve end-to-end performance.

Similar Technologies

Optimization TechniquesPerformance TuningTargeted FixesIterative Improvement

Verify & Iterate

Measure performance after optimization to quantify actual improvement against baseline metrics. Verify the fix didn't introduce regressions in other areas like accuracy or resource usage. If performance goals aren't met, profile again to identify the next bottleneck and repeat the cycle. Performance optimization is an iterative process of measurement, analysis, and targeted improvements.

Similar Technologies

ValidationRegression TestingIterative OptimizationPerformance Tracking

checklist

Performance Engineering Best Practices

Measurement

Always measure performance before optimizing to avoid wasting effort on non-critical paths. Focus on metrics that directly impact user experience or business objectives. Profile under production-like conditions including realistic data, concurrency, and infrastructure. Track performance over time to catch regressions early. Establish performance budgets that align with SLAs and user expectations to guide optimization priorities.

Key Features

Always measure before optimizing
Focus on user-impacting metrics
Profile in production-like conditions
Track performance trends over time
Set performance budgets aligned with SLAs

Similar Technologies

ProfilingMonitoringPerformance TestingSLA Management

Optimization Priority

Optimize the critical path that dominates execution time first for maximum impact. Apply the 80/20 rule - fixing the biggest bottleneck typically yields more benefit than dozens of micro-optimizations. Prioritize algorithmic improvements over low-level tuning when possible. Balance competing objectives of latency, throughput, and cost based on use case requirements. Consider maintenance cost of complex optimizations against performance benefits.

Key Features

Optimize critical path with highest impact first
80/20 rule: Fix biggest bottleneck
Algorithmic improvements over micro-optimizations
Balance latency vs throughput vs cost
Consider maintainability of optimizations

Similar Technologies

Critical PathPareto PrinciplePerformance Trade-offsOptimization ROI

Trade-offs

Performance optimization involves navigating fundamental trade-offs between competing objectives. Larger batch sizes improve throughput but increase latency. Quantization boosts speed but may reduce accuracy. Model recomputation saves memory at the cost of additional compute. Hardware choices balance cost against performance needs. Complex optimizations may deliver gains but hurt code maintainability. Understanding and deliberately managing these trade-offs is essential for effective performance engineering.

Key Features

Latency vs Throughput: Batch size trade-off
Accuracy vs Speed: Quantization impact
Memory vs Compute: Recomputation strategies
Cost vs Performance: Hardware selection
Complexity vs Maintainability: Code sustainability

Similar Technologies

Performance Trade-offsEngineering Trade-offsSystem DesignOptimization Balance