Will Percey — Portfolio

smart_toy

LLM Inference Servers

vLLM

High-throughput LLM serving engine with PagedAttention for efficient memory management and continuous batching for optimal GPU utilization.

Key Features

PagedAttention for KV cache management
Continuous batching for throughput
Tensor and pipeline parallelism
OpenAI-compatible API
Support for quantization (AWQ, GPTQ)

Use Cases

High-throughput LLM inference
Production chatbot backends
Large language model APIs
Multi-user serving

Alternatives

TGITensorRT-LLMRay ServeLiteLLM

Text Generation Inference (TGI)

Hugging Face's optimized inference server for LLMs with streaming, token streaming, and production-ready features.

Key Features

Token streaming support
Quantization (bitsandbytes, GPTQ)
Tensor parallelism
Flash Attention integration
Safetensors weight loading

Use Cases

Hugging Face model deployment
Streaming text generation
Production LLM serving
Multi-GPU inference

Alternatives

vLLMTensorRT-LLMOpenLLMTriton

TensorRT-LLM

NVIDIA's high-performance LLM inference optimization SDK with custom kernels for maximum GPU efficiency.

Key Features

Custom CUDA kernels
INT4/INT8/FP8 quantization
In-flight batching
Multi-GPU/multi-node support
KV cache optimization

Use Cases

Maximum performance inference
NVIDIA GPU deployments
Low-latency applications
Cost-optimized serving

Alternatives

vLLMTGIFasterTransformerDeepSpeed-Inference

Ollama

Local LLM serving made easy with bundled models, automatic quantization, and simple API for running models on consumer hardware.

Key Features

One-command model deployment
Automatic quantization
CPU and GPU support
OpenAI-compatible API
Model library with popular LLMs

Use Cases

Local development
Privacy-sensitive applications
Offline inference
Consumer hardware deployment

Alternatives

LM StudioLocalAIGPT4AllLlama.cpp

model_training

General ML Serving Frameworks

Triton Inference Server

NVIDIA's production-ready inference server supporting multiple frameworks with dynamic batching and concurrent model execution.

Key Features

Multi-framework support (PyTorch, TensorFlow, ONNX)
Dynamic batching
Model ensembles
Multi-model concurrent execution
HTTP/gRPC/C++ APIs

Use Cases

Multi-model serving
High-throughput inference
Computer vision pipelines
Production ML systems

Similar Technologies

TorchServeTensorFlow ServingKServeRay Serve

TorchServe

PyTorch's official model serving framework with built-in support for multi-model deployment and A/B testing.

Key Features

Native PyTorch integration
Multi-model management
Model versioning
Custom preprocessing/postprocessing
Metrics and logging

Use Cases

PyTorch model deployment
Computer vision serving
NLP model APIs
Research model deployment

Similar Technologies

TritonTensorFlow ServingBentoMLSeldon Core

KServe (KFServing)

Kubernetes-native serverless ML inference platform with auto-scaling, canary rollouts, and explainability.

Key Features

Serverless autoscaling
Canary and blue/green deployments
Explainability integration
Multi-framework support
Istio-based traffic management

Use Cases

Kubernetes ML deployments
Serverless inference
Enterprise ML platforms
Cloud-native applications

Similar Technologies

Seldon CoreBentoMLCortexRay Serve

Ray Serve

Scalable model serving framework built on Ray with Python-first API and support for model composition and business logic.

Key Features

Python-first API
Model composition and chaining
Dynamic request batching
Distributed serving
Integration with Ray ecosystem

Use Cases

Multi-model pipelines
Complex inference workflows
Python-native deployments
Distributed ML serving

Similar Technologies

KServeBentoMLTritonSeldon Core

cloud

Cloud Managed Inference

AWS SageMaker Endpoints

Fully managed deployment service with auto-scaling, A/B testing, and multi-model endpoints for production ML inference.

Key Features

Managed infrastructure
Auto-scaling
Multi-model endpoints
Shadow testing
Built-in monitoring

Use Cases

Production model deployment on AWS
Scalable inference endpoints
Multi-variant testing
Enterprise ML on AWS

Similar Technologies

Vertex AIAzure ML EndpointsBedrock

Google Vertex AI Predictions

Google Cloud's managed prediction service with custom containers, auto-scaling, and integrated monitoring.

Key Features

Custom container support
Online and batch predictions
Automatic scaling
Model monitoring
Explainable AI

Use Cases

GCP ML deployments
Scalable predictions
Batch inference jobs
AutoML deployments

Similar Technologies

SageMakerAzure MLAI Platform

Azure Machine Learning Endpoints

Azure's managed endpoints for real-time and batch inference with MLflow integration and managed online endpoints.

Key Features

Managed online endpoints
Batch endpoints
MLflow integration
Blue/green deployments
Kubernetes integration

Use Cases

Azure ML deployments
Enterprise inference
Batch scoring
MLOps on Azure

Similar Technologies

SageMakerVertex AIAKS + KServe

AWS Bedrock

Fully managed service for foundation models with serverless API access to models from Anthropic, AI21, Stability AI, and more.

Key Features

Serverless foundation model access
No infrastructure management
Custom model fine-tuning
Guardrails and safety filters
Pay-per-use pricing

Use Cases

LLM applications
Serverless AI
Foundation model access
Quick AI prototyping

Similar Technologies

Azure OpenAIVertex AI Model GardenOpenAI API

Inference Serving Patterns

Pattern	Description	Latency	Cost	Use Case
Real-time (Synchronous)	Immediate response to API request	Low (ms)	High	Chatbots, user-facing apps
Batch Inference	Process multiple requests together	High (minutes)	Low	Bulk processing, analytics
Streaming	Process data as it arrives	Medium	Medium	Real-time analytics, IoT
Asynchronous	Queue-based processing	Medium (seconds)	Medium	Background jobs, non-urgent
Edge Inference	On-device model execution	Very Low	Very Low	Mobile apps, IoT devices

Inference Optimization Techniques

Model Optimization

Quantization: Reduce precision (INT8, INT4)
Pruning: Remove unnecessary weights
Distillation: Train smaller model from larger
Model Compression: GPTQ, AWQ, GGUF formats

Serving Optimization

Batching: Process multiple requests together
Caching: Cache frequent responses/embeddings
Request Coalescing: Merge similar requests
KV Cache: Reuse key-value calculations

Hardware Optimization

GPU Acceleration: CUDA, TensorRT optimization
Multi-GPU: Tensor/pipeline parallelism
Specialized Hardware: TPUs, Inferentia, Trainium
Flash Attention: Memory-efficient attention

Architecture Patterns

Auto-scaling: Dynamic replica adjustment
Load Balancing: Distribute requests efficiently
Model Routing: Route to appropriate model size
Canary Deployment: Gradual rollout