Will Percey — Portfolio

memory

TensorRT Optimization Pipeline

downloadModel ImportONNX / PyTorch / TF

→

linkLayer FusionConv+BN+ReLU

→

tunePrecision CalibrationFP32→FP16/INT8

→

boltKernel TuningAuto-select best

→

saveEngine Serialize.engine / .plan

Model Import & Parsing

Import trained models from ONNX, PyTorch (via torch.onnx.export or torch-tensorrt), or TensorFlow. The parser converts the computation graph into TensorRT's internal representation, validating layer support and identifying optimization opportunities.

Similar Technologies

ONNX Parser (recommended)PyTorch via TorchScript/ONNXTF-TRT integrationCustom parser for proprietary formats

Layer Fusion

Automatically fuses compatible adjacent layers to reduce memory bandwidth and kernel launch overhead. Common fusions: Conv+BatchNorm+ReLU, Conv+Add+ReLU. Vertical and horizontal fusion patterns.

Similar Technologies

Conv+BN+Activation fusionPointwise operation fusionShuffle layer eliminationConstant folding

Kernel Auto-Tuning

TensorRT benchmarks multiple kernel implementations for each layer during build, selecting the fastest for the target GPU. Considers tile sizes, memory access patterns, and tensor core utilization.

Similar Technologies

cuDNN algorithm selectionTensor Core kernels (FP16/INT8)Custom TensorRT kernelsHardware-specific optimization

Engine Serialization

Serialize optimized engine to file for fast loading without rebuild. Engine is GPU-architecture specific - rebuild required for different GPU. Version compatibility between TensorRT versions.

Similar Technologies

.engine/.plan file formatGPU architecture specificVersion compatibility checksTiming cache for faster rebuilds

tune

Precision Calibration

Precision	Memory	Speed	Accuracy	Use Case
FP32	1x (baseline)	1x (baseline)	Full precision	Accuracy-critical, reference baseline
TF32	1x	~2x on Ampere+	Near-FP32	Default on Ampere GPUs, transparent
FP16	0.5x	2-4x	Minimal loss	Production default, good accuracy/speed
INT8	0.25x	4-8x	Requires calibration	Maximum throughput, CNNs, detection
FP8	0.25x	4-8x on Hopper+	Better than INT8	Hopper/Ada GPUs, transformers

INT8 Calibration Process

Key Features

Collect representative calibration dataset (100-1000 samples)
Run inference to collect activation distributions
Compute optimal scale factors per tensor
Entropy calibration (default) or MinMax calibration
Cache calibration data for reproducibility

Mixed Precision Strategy

Key Features

Keep sensitive layers in FP16 (first/last layers)
Use INT8 for compute-heavy convolutions
Profile accuracy loss per layer
Layer-wise precision selection via API
Fallback to higher precision if needed

storage

CUDA Memory Management

Device Memory (Global)

Standard GPU memory allocated with cudaMalloc. High bandwidth but high latency. Used for model weights, activations, and I/O buffers. Must be explicitly managed - allocate once, reuse.

Similar Technologies

cudaMalloc/cudaFreecudaMallocAsync (pools)cudaMemcpy variantscudaMemset

Pinned (Page-Locked) Memory

Host memory locked in physical RAM, enabling DMA transfers. 2-3x faster host-device copies. Required for async transfers with streams. Higher allocation cost - allocate at startup.

Similar Technologies

cudaMallocHostcudaHostAlloccudaFreeHostcudaHostRegister

Unified Memory

Single address space accessible from CPU and GPU. Automatic page migration on access. Simplifies programming but can cause performance issues with frequent migrations. Good for prototyping.

Similar Technologies

cudaMallocManagedcudaMemPrefetchAsynccudaMemAdviseOversubscription support

Shared Memory

Fast on-chip memory shared within thread blocks. User-managed cache for frequently accessed data. Critical for custom kernel performance. Limited size (48-164KB per SM).

Similar Technologies

__shared__ keywordDynamic allocationBank conflict avoidanceAsync copy (Ampere+)

stream

Stream Concurrency & Async Operations

CUDA streams enable concurrent kernel execution and overlapped data transfers. Proper stream management is critical for maximizing GPU utilization in inference pipelines.

Pipeline Overlap

Overlap H2D copy, compute, and D2H copy across multiple streams.

Key Features

Stream 1: Copy batch N to GPU
Stream 2: Inference on batch N-1
Stream 3: Copy batch N-2 results to host
Use cudaStreamSynchronize selectively

Multi-Model Concurrency

Run multiple models concurrently on same GPU for better utilization.

Key Features

Separate stream per model/context
Useful for small models with low occupancy
MPS (Multi-Process Service) for multi-process
Monitor SM utilization for benefits

CUDA Graphs

Capture and replay entire workflows to minimize CPU overhead.

Key Features

Record stream operations into graph
Single launch for entire pipeline
10-20% latency reduction typical
Static shapes required (or graph per shape)

code

Custom CUDA Kernels

When to Write Custom Kernels

Custom CUDA kernels are justified when standard libraries don't support your operation, when fusing multiple operations provides significant gains, or when domain-specific optimizations are possible.

Similar Technologies

Operation not in cuDNN/cuBLASFusion opportunity (3+ ops)Domain-specific patternsMemory-bound operationsNon-standard data layouts

Kernel Optimization Techniques

Maximize throughput through coalesced memory access, shared memory usage, occupancy tuning, and minimizing warp divergence. Profile before optimizing - focus on bottlenecks.

Similar Technologies

Coalesced global memory accessShared memory tilingRegister pressure managementWarp divergence eliminationInstruction-level parallelism

Memory Access Patterns

Memory bandwidth is often the bottleneck. Ensure coalesced access (adjacent threads access adjacent memory), use shared memory for reused data, and minimize global memory transactions.

Similar Technologies

Coalesced reads/writesShared memory bank conflictsL1/L2 cache utilizationMemory alignmentTexture memory for spatial locality

Thread Block Configuration

Choose block dimensions for maximum occupancy while considering shared memory and register usage. Typical: 128-256 threads per block. Profile different configurations.

Similar Technologies

Occupancy calculatorShared memory per blockRegisters per threadBlock size vs grid sizeWave quantization

extension

TensorRT Plugin Development

Plugin Type	Use Case	Implementation
IPluginV2DynamicExt	Dynamic input shapes, variable batch size	Full shape inference, workspace calculation
IPluginV2IOExt	Fixed shapes with format flexibility	Simpler than Dynamic, broadcast support
IPluginCreator	Plugin registration and factory	Deserialize plugin from engine file

Plugin Interface

Key Features

IPluginV2DynamicExt for dynamic shapes
getOutputDimensions() - compute output shapes
enqueue() - execute kernel on stream
configurePlugin() - set input/output formats
serialize()/deserialize() for engine caching

Built-in Plugins

Key Features

BatchedNMS - Non-maximum suppression
GridAnchor - Anchor generation
InstanceNormalization - Instance norm
ResizeNearest - Upsampling
ScatterND - Scatter operations

analytics

Profiling & Performance Analysis

Nsight Systems

System-wide profiler showing CPU/GPU timeline, CUDA API calls, kernel launches, memory transfers, and stream activity. Essential for identifying pipeline bottlenecks and concurrency issues.

Similar Technologies

Timeline visualizationCUDA API tracingCPU/GPU correlationStream analysisMemory transfer tracking

Nsight Compute

Detailed kernel profiler with hardware counter analysis. Shows occupancy, memory throughput, compute utilization, and roofline analysis. Use for kernel-level optimization.

Similar Technologies

Roofline analysisMemory throughput metricsOccupancy analysisSource correlationComparison baselines

TensorRT Profiler

Built-in profiler reporting per-layer timing during inference. Identifies slow layers for optimization focus. Enable via IExecutionContext::setProfiler().

Similar Technologies

Per-layer timingLayer-by-layer breakdownEngine build profilingTiming cache analysisAPI integration

resize

Dynamic Shapes & Batching

Dynamic Shape Profiles

Key Features

Define min/opt/max dimensions per input
Multiple optimization profiles for different ranges
Runtime shape selection without rebuild
Trade-off: broader ranges = less optimization
Batch dimension typically most variable

Dynamic Batching Strategies

Key Features

Pad to fixed batch sizes (1, 2, 4, 8, 16...)
Request batching with timeout
Continuous batching for LLMs
Profile optimal batch sizes per GPU
Memory vs latency trade-offs

checklist

Optimization Best Practices

Memory Optimization

Key Features

Use pinned memory for host-device transfers
Pre-allocate buffers, avoid runtime allocation
Enable workspace memory sharing
Use memory pools (cudaMallocAsync)
Profile memory with nvidia-smi / Nsight

Throughput Optimization

Key Features

Overlap compute and data transfer
Use multiple streams for concurrency
Batch requests for GPU efficiency
Profile kernel occupancy
Minimize host-device synchronization

Latency Optimization

Key Features

Use CUDA graphs for fixed workloads
Minimize kernel launch overhead
Keep data on GPU between operations
Use persistent kernels where applicable
Profile end-to-end latency breakdown