GPU Inference

memory

TensorRT Optimization Pipeline

downloadModel ImportONNX / PyTorch / TF
linkLayer FusionConv+BN+ReLU
tunePrecision CalibrationFP32→FP16/INT8
boltKernel TuningAuto-select best
saveEngine Serialize.engine / .plan
Model Import & Parsing

Import trained models from ONNX, PyTorch (via torch.onnx.export or torch-tensorrt), or TensorFlow. The parser converts the computation graph into TensorRT's internal representation, validating layer support and identifying optimization opportunities.

Similar Technologies
ONNX Parser (recommended)PyTorch via TorchScript/ONNXTF-TRT integrationCustom parser for proprietary formats
Layer Fusion

Automatically fuses compatible adjacent layers to reduce memory bandwidth and kernel launch overhead. Common fusions: Conv+BatchNorm+ReLU, Conv+Add+ReLU. Vertical and horizontal fusion patterns.

Similar Technologies
Conv+BN+Activation fusionPointwise operation fusionShuffle layer eliminationConstant folding
Kernel Auto-Tuning

TensorRT benchmarks multiple kernel implementations for each layer during build, selecting the fastest for the target GPU. Considers tile sizes, memory access patterns, and tensor core utilization.

Similar Technologies
cuDNN algorithm selectionTensor Core kernels (FP16/INT8)Custom TensorRT kernelsHardware-specific optimization
Engine Serialization

Serialize optimized engine to file for fast loading without rebuild. Engine is GPU-architecture specific - rebuild required for different GPU. Version compatibility between TensorRT versions.

Similar Technologies
.engine/.plan file formatGPU architecture specificVersion compatibility checksTiming cache for faster rebuilds
tune

Precision Calibration

PrecisionMemorySpeedAccuracyUse Case
FP321x (baseline)1x (baseline)Full precisionAccuracy-critical, reference baseline
TF321x~2x on Ampere+Near-FP32Default on Ampere GPUs, transparent
FP160.5x2-4xMinimal lossProduction default, good accuracy/speed
INT80.25x4-8xRequires calibrationMaximum throughput, CNNs, detection
FP80.25x4-8x on Hopper+Better than INT8Hopper/Ada GPUs, transformers
INT8 Calibration Process

Key Features
  • Collect representative calibration dataset (100-1000 samples)
  • Run inference to collect activation distributions
  • Compute optimal scale factors per tensor
  • Entropy calibration (default) or MinMax calibration
  • Cache calibration data for reproducibility
Mixed Precision Strategy

Key Features
  • Keep sensitive layers in FP16 (first/last layers)
  • Use INT8 for compute-heavy convolutions
  • Profile accuracy loss per layer
  • Layer-wise precision selection via API
  • Fallback to higher precision if needed
storage

CUDA Memory Management

Device Memory (Global)

Standard GPU memory allocated with cudaMalloc. High bandwidth but high latency. Used for model weights, activations, and I/O buffers. Must be explicitly managed - allocate once, reuse.

Similar Technologies
cudaMalloc/cudaFreecudaMallocAsync (pools)cudaMemcpy variantscudaMemset
Pinned (Page-Locked) Memory

Host memory locked in physical RAM, enabling DMA transfers. 2-3x faster host-device copies. Required for async transfers with streams. Higher allocation cost - allocate at startup.

Similar Technologies
cudaMallocHostcudaHostAlloccudaFreeHostcudaHostRegister
Unified Memory

Single address space accessible from CPU and GPU. Automatic page migration on access. Simplifies programming but can cause performance issues with frequent migrations. Good for prototyping.

Similar Technologies
cudaMallocManagedcudaMemPrefetchAsynccudaMemAdviseOversubscription support
Shared Memory

Fast on-chip memory shared within thread blocks. User-managed cache for frequently accessed data. Critical for custom kernel performance. Limited size (48-164KB per SM).

Similar Technologies
__shared__ keywordDynamic allocationBank conflict avoidanceAsync copy (Ampere+)
stream

Stream Concurrency & Async Operations

CUDA streams enable concurrent kernel execution and overlapped data transfers. Proper stream management is critical for maximizing GPU utilization in inference pipelines.

Pipeline Overlap

Overlap H2D copy, compute, and D2H copy across multiple streams.

Key Features
  • Stream 1: Copy batch N to GPU
  • Stream 2: Inference on batch N-1
  • Stream 3: Copy batch N-2 results to host
  • Use cudaStreamSynchronize selectively
Multi-Model Concurrency

Run multiple models concurrently on same GPU for better utilization.

Key Features
  • Separate stream per model/context
  • Useful for small models with low occupancy
  • MPS (Multi-Process Service) for multi-process
  • Monitor SM utilization for benefits
CUDA Graphs

Capture and replay entire workflows to minimize CPU overhead.

Key Features
  • Record stream operations into graph
  • Single launch for entire pipeline
  • 10-20% latency reduction typical
  • Static shapes required (or graph per shape)
code

Custom CUDA Kernels

When to Write Custom Kernels

Custom CUDA kernels are justified when standard libraries don't support your operation, when fusing multiple operations provides significant gains, or when domain-specific optimizations are possible.

Similar Technologies
Operation not in cuDNN/cuBLASFusion opportunity (3+ ops)Domain-specific patternsMemory-bound operationsNon-standard data layouts
Kernel Optimization Techniques

Maximize throughput through coalesced memory access, shared memory usage, occupancy tuning, and minimizing warp divergence. Profile before optimizing - focus on bottlenecks.

Similar Technologies
Coalesced global memory accessShared memory tilingRegister pressure managementWarp divergence eliminationInstruction-level parallelism
Memory Access Patterns

Memory bandwidth is often the bottleneck. Ensure coalesced access (adjacent threads access adjacent memory), use shared memory for reused data, and minimize global memory transactions.

Similar Technologies
Coalesced reads/writesShared memory bank conflictsL1/L2 cache utilizationMemory alignmentTexture memory for spatial locality
Thread Block Configuration

Choose block dimensions for maximum occupancy while considering shared memory and register usage. Typical: 128-256 threads per block. Profile different configurations.

Similar Technologies
Occupancy calculatorShared memory per blockRegisters per threadBlock size vs grid sizeWave quantization
extension

TensorRT Plugin Development

Plugin TypeUse CaseImplementation
IPluginV2DynamicExtDynamic input shapes, variable batch sizeFull shape inference, workspace calculation
IPluginV2IOExtFixed shapes with format flexibilitySimpler than Dynamic, broadcast support
IPluginCreatorPlugin registration and factoryDeserialize plugin from engine file
Plugin Interface

Key Features
  • IPluginV2DynamicExt for dynamic shapes
  • getOutputDimensions() - compute output shapes
  • enqueue() - execute kernel on stream
  • configurePlugin() - set input/output formats
  • serialize()/deserialize() for engine caching
Built-in Plugins

Key Features
  • BatchedNMS - Non-maximum suppression
  • GridAnchor - Anchor generation
  • InstanceNormalization - Instance norm
  • ResizeNearest - Upsampling
  • ScatterND - Scatter operations
analytics

Profiling & Performance Analysis

Nsight Systems

System-wide profiler showing CPU/GPU timeline, CUDA API calls, kernel launches, memory transfers, and stream activity. Essential for identifying pipeline bottlenecks and concurrency issues.

Similar Technologies
Timeline visualizationCUDA API tracingCPU/GPU correlationStream analysisMemory transfer tracking
Nsight Compute

Detailed kernel profiler with hardware counter analysis. Shows occupancy, memory throughput, compute utilization, and roofline analysis. Use for kernel-level optimization.

Similar Technologies
Roofline analysisMemory throughput metricsOccupancy analysisSource correlationComparison baselines
TensorRT Profiler

Built-in profiler reporting per-layer timing during inference. Identifies slow layers for optimization focus. Enable via IExecutionContext::setProfiler().

Similar Technologies
Per-layer timingLayer-by-layer breakdownEngine build profilingTiming cache analysisAPI integration
resize

Dynamic Shapes & Batching

Dynamic Shape Profiles

Key Features
  • Define min/opt/max dimensions per input
  • Multiple optimization profiles for different ranges
  • Runtime shape selection without rebuild
  • Trade-off: broader ranges = less optimization
  • Batch dimension typically most variable
Dynamic Batching Strategies

Key Features
  • Pad to fixed batch sizes (1, 2, 4, 8, 16...)
  • Request batching with timeout
  • Continuous batching for LLMs
  • Profile optimal batch sizes per GPU
  • Memory vs latency trade-offs
checklist

Optimization Best Practices

Memory Optimization

Key Features
  • Use pinned memory for host-device transfers
  • Pre-allocate buffers, avoid runtime allocation
  • Enable workspace memory sharing
  • Use memory pools (cudaMallocAsync)
  • Profile memory with nvidia-smi / Nsight
Throughput Optimization

Key Features
  • Overlap compute and data transfer
  • Use multiple streams for concurrency
  • Batch requests for GPU efficiency
  • Profile kernel occupancy
  • Minimize host-device synchronization
Latency Optimization

Key Features
  • Use CUDA graphs for fixed workloads
  • Minimize kernel launch overhead
  • Keep data on GPU between operations
  • Use persistent kernels where applicable
  • Profile end-to-end latency breakdown