GPU Inference
TensorRT Optimization Pipeline
Import trained models from ONNX, PyTorch (via torch.onnx.export or torch-tensorrt), or TensorFlow. The parser converts the computation graph into TensorRT's internal representation, validating layer support and identifying optimization opportunities.
Automatically fuses compatible adjacent layers to reduce memory bandwidth and kernel launch overhead. Common fusions: Conv+BatchNorm+ReLU, Conv+Add+ReLU. Vertical and horizontal fusion patterns.
TensorRT benchmarks multiple kernel implementations for each layer during build, selecting the fastest for the target GPU. Considers tile sizes, memory access patterns, and tensor core utilization.
Serialize optimized engine to file for fast loading without rebuild. Engine is GPU-architecture specific - rebuild required for different GPU. Version compatibility between TensorRT versions.
Precision Calibration
| Precision | Memory | Speed | Accuracy | Use Case |
|---|---|---|---|---|
| FP32 | 1x (baseline) | 1x (baseline) | Full precision | Accuracy-critical, reference baseline |
| TF32 | 1x | ~2x on Ampere+ | Near-FP32 | Default on Ampere GPUs, transparent |
| FP16 | 0.5x | 2-4x | Minimal loss | Production default, good accuracy/speed |
| INT8 | 0.25x | 4-8x | Requires calibration | Maximum throughput, CNNs, detection |
| FP8 | 0.25x | 4-8x on Hopper+ | Better than INT8 | Hopper/Ada GPUs, transformers |
- Collect representative calibration dataset (100-1000 samples)
- Run inference to collect activation distributions
- Compute optimal scale factors per tensor
- Entropy calibration (default) or MinMax calibration
- Cache calibration data for reproducibility
- Keep sensitive layers in FP16 (first/last layers)
- Use INT8 for compute-heavy convolutions
- Profile accuracy loss per layer
- Layer-wise precision selection via API
- Fallback to higher precision if needed
CUDA Memory Management
Standard GPU memory allocated with cudaMalloc. High bandwidth but high latency. Used for model weights, activations, and I/O buffers. Must be explicitly managed - allocate once, reuse.
Host memory locked in physical RAM, enabling DMA transfers. 2-3x faster host-device copies. Required for async transfers with streams. Higher allocation cost - allocate at startup.
Single address space accessible from CPU and GPU. Automatic page migration on access. Simplifies programming but can cause performance issues with frequent migrations. Good for prototyping.
Fast on-chip memory shared within thread blocks. User-managed cache for frequently accessed data. Critical for custom kernel performance. Limited size (48-164KB per SM).
Stream Concurrency & Async Operations
CUDA streams enable concurrent kernel execution and overlapped data transfers. Proper stream management is critical for maximizing GPU utilization in inference pipelines.
Overlap H2D copy, compute, and D2H copy across multiple streams.
- Stream 1: Copy batch N to GPU
- Stream 2: Inference on batch N-1
- Stream 3: Copy batch N-2 results to host
- Use cudaStreamSynchronize selectively
Run multiple models concurrently on same GPU for better utilization.
- Separate stream per model/context
- Useful for small models with low occupancy
- MPS (Multi-Process Service) for multi-process
- Monitor SM utilization for benefits
Capture and replay entire workflows to minimize CPU overhead.
- Record stream operations into graph
- Single launch for entire pipeline
- 10-20% latency reduction typical
- Static shapes required (or graph per shape)
Custom CUDA Kernels
Custom CUDA kernels are justified when standard libraries don't support your operation, when fusing multiple operations provides significant gains, or when domain-specific optimizations are possible.
Maximize throughput through coalesced memory access, shared memory usage, occupancy tuning, and minimizing warp divergence. Profile before optimizing - focus on bottlenecks.
Memory bandwidth is often the bottleneck. Ensure coalesced access (adjacent threads access adjacent memory), use shared memory for reused data, and minimize global memory transactions.
Choose block dimensions for maximum occupancy while considering shared memory and register usage. Typical: 128-256 threads per block. Profile different configurations.
TensorRT Plugin Development
| Plugin Type | Use Case | Implementation |
|---|---|---|
| IPluginV2DynamicExt | Dynamic input shapes, variable batch size | Full shape inference, workspace calculation |
| IPluginV2IOExt | Fixed shapes with format flexibility | Simpler than Dynamic, broadcast support |
| IPluginCreator | Plugin registration and factory | Deserialize plugin from engine file |
- IPluginV2DynamicExt for dynamic shapes
- getOutputDimensions() - compute output shapes
- enqueue() - execute kernel on stream
- configurePlugin() - set input/output formats
- serialize()/deserialize() for engine caching
- BatchedNMS - Non-maximum suppression
- GridAnchor - Anchor generation
- InstanceNormalization - Instance norm
- ResizeNearest - Upsampling
- ScatterND - Scatter operations
Profiling & Performance Analysis
System-wide profiler showing CPU/GPU timeline, CUDA API calls, kernel launches, memory transfers, and stream activity. Essential for identifying pipeline bottlenecks and concurrency issues.
Detailed kernel profiler with hardware counter analysis. Shows occupancy, memory throughput, compute utilization, and roofline analysis. Use for kernel-level optimization.
Built-in profiler reporting per-layer timing during inference. Identifies slow layers for optimization focus. Enable via IExecutionContext::setProfiler().
Dynamic Shapes & Batching
- Define min/opt/max dimensions per input
- Multiple optimization profiles for different ranges
- Runtime shape selection without rebuild
- Trade-off: broader ranges = less optimization
- Batch dimension typically most variable
- Pad to fixed batch sizes (1, 2, 4, 8, 16...)
- Request batching with timeout
- Continuous batching for LLMs
- Profile optimal batch sizes per GPU
- Memory vs latency trade-offs
Optimization Best Practices
- Use pinned memory for host-device transfers
- Pre-allocate buffers, avoid runtime allocation
- Enable workspace memory sharing
- Use memory pools (cudaMallocAsync)
- Profile memory with nvidia-smi / Nsight
- Overlap compute and data transfer
- Use multiple streams for concurrency
- Batch requests for GPU efficiency
- Profile kernel occupancy
- Minimize host-device synchronization
- Use CUDA graphs for fixed workloads
- Minimize kernel launch overhead
- Keep data on GPU between operations
- Use persistent kernels where applicable
- Profile end-to-end latency breakdown
