Edge AI
Edge AI Fundamentals
Edge AI runs models directly on devices (phones, cameras, vehicles) rather than cloud servers. Enables real-time responses without network latency, works offline, and keeps data private. Trade-off is limited compute and memory compared to cloud.
- Latency: <10ms edge vs 100ms+ cloud
- Privacy: Data stays on device
- Offline: Works without connectivity
- Cost: No per-inference cloud charges
- Compute: Limited by device hardware
Edge devices have limited memory (MB to low GB) compared to cloud GPUs. Models must be compressed to fit. Target sizes range from <1MB for microcontrollers to hundreds of MB for phones and edge servers.
- Microcontrollers: <1MB models
- Mobile phones: 10-200MB models
- Edge servers: Up to few GB
- RAM often more limiting than storage
- Batch size typically 1
Battery-powered and thermal-constrained devices require power-efficient inference. Measured in inferences per watt. Specialized AI accelerators (NPUs, TPUs) achieve 10-100x better efficiency than general CPUs.
- TOPS/Watt as key metric
- NPUs >> GPUs >> CPUs for efficiency
- Dynamic voltage/frequency scaling
- Model architecture impacts power
- Thermal throttling considerations
Many edge applications require deterministic, low-latency inference. Video processing needs 30fps (33ms/frame). Autonomous systems need <10ms response. Hard vs soft real-time constraints affect design choices.
- Latency budgets: 10-100ms typical
- Throughput: frames/second
- Jitter: variance in latency
- Hard real-time: safety-critical
- Soft real-time: best-effort
Model Optimization Techniques
| Technique | Description | Size Reduction | Accuracy Impact | Use Case |
|---|---|---|---|---|
| INT8 Quantization | Convert FP32 weights to 8-bit integers | 4x smaller | Minimal (1-2%) | General deployment |
| INT4 Quantization | 4-bit integer weights, aggressive compression | 8x smaller | Moderate (2-5%) | Highly constrained devices |
| Pruning | Remove unimportant weights/neurons | 2-10x smaller | Varies | Sparse hardware support |
| Knowledge Distillation | Train small model to mimic large model | 10-100x smaller | Moderate | Custom edge models |
| Neural Architecture Search | Automatically find efficient architectures | Varies | Optimized | Hardware-specific models |
| Low-Rank Factorization | Decompose weight matrices into smaller factors | 2-5x smaller | Minimal | Dense layers |
Inference Runtimes
Google's lightweight runtime for mobile and embedded devices. Supports Android, iOS, Linux, and microcontrollers. Extensive model zoo, quantization tools, and GPU/NPU delegate support.
- Cross-platform (Android, iOS, Linux, MCU)
- GPU, NNAPI, CoreML delegates
- Built-in quantization
- Model optimization toolkit
- Large ecosystem
Microsoft's cross-platform inference engine. Runs ONNX models with hardware acceleration across CPUs, GPUs, and NPUs. Excellent for deploying models from any framework (PyTorch, TensorFlow, etc.).
- Framework-agnostic (ONNX format)
- CPU, GPU, NPU, FPGA support
- Web, mobile, and desktop
- Quantization and optimization
- DirectML, CUDA, TensorRT backends
Apple's ML framework for iOS, macOS, watchOS, and tvOS. Leverages Neural Engine, GPU, and CPU. Best performance on Apple hardware with automatic hardware selection.
- Apple Neural Engine support
- Automatic hardware selection
- On-device training
- Privacy-preserving
- Swift/Objective-C integration
NVIDIA's high-performance inference optimizer and runtime. Maximizes throughput on NVIDIA GPUs through layer fusion, precision calibration, and kernel auto-tuning. Essential for Jetson deployment.
- NVIDIA GPU optimization
- Layer and tensor fusion
- INT8/FP16 calibration
- Dynamic batching
- Jetson integration
Intel's toolkit for optimizing and deploying models on Intel hardware (CPUs, integrated GPUs, VPUs, FPGAs). Model Optimizer converts from various frameworks.
- Intel CPU/GPU/VPU optimization
- Model Optimizer converter
- Post-training quantization
- Async inference API
- Neural Compute Stick support
C/C++ implementation of LLaMA inference with minimal dependencies. Enables running LLMs on edge devices with quantization support. Foundation for many local LLM applications.
- CPU inference (AVX, ARM NEON)
- 4-bit quantization (GGUF)
- Metal, CUDA, OpenCL support
- Minimal dependencies
- Active community
Edge Hardware
| Hardware | Vendor | Performance | Power | Best For |
|---|---|---|---|---|
| NVIDIA Jetson Orin | NVIDIA | Up to 275 TOPS | 15-60W | Robotics, autonomous vehicles |
| Google Coral TPU | 4 TOPS | 2W | Low-power vision, IoT | |
| Intel Neural Compute Stick | Intel | 1 TOPS | 1W | USB accelerator, prototyping |
| Apple Neural Engine | Apple | 15-35 TOPS | Integrated | iOS/macOS apps |
| Qualcomm Hexagon NPU | Qualcomm | 15-75 TOPS | Integrated | Android devices |
| Raspberry Pi 5 | RPi Foundation | CPU only (+ AI HAT) | 3-5W | Hobbyist, education |
| Hailo-8 | Hailo | 26 TOPS | 2.5W | Industrial, smart cameras |
Edge ML Frameworks
Google's framework for building multimodal ML pipelines. Pre-built solutions for face detection, hand tracking, pose estimation. Cross-platform with optimized mobile/web performance.
- Pre-built vision solutions
- Graph-based pipeline definition
- iOS, Android, Web, Python
- Real-time performance
- GPU acceleration
Google's mobile ML SDK for Android and iOS. High-level APIs for common tasks (text recognition, face detection, barcode scanning). On-device and cloud variants.
- Pre-trained APIs
- On-device processing
- Firebase integration
- Custom model support
- Easy integration
PyTorch's mobile deployment path. Export models via TorchScript, optimize with Mobile Optimizer. Supports iOS, Android with CPU and GPU backends.
- TorchScript export
- Mobile optimizer
- Quantization support
- Custom operators
- ExecuTorch (new runtime)
ML on microcontrollers (Cortex-M, ESP32). Models in kilobytes, inference in milliseconds with microwatts of power. Enables AI on battery-powered sensors.
- Microcontroller support
- No OS required
- KB-sized models
- Ultra-low power
- Keyword spotting, anomaly detection
Inference server for edge and cloud. Model repository management, dynamic batching, concurrent model execution. Supports TensorRT, ONNX, TensorFlow, PyTorch.
- Multi-framework support
- Dynamic batching
- Model ensembles
- Metrics and monitoring
- Kubernetes integration
End-to-end platform for developing edge ML. Data collection, model training, optimization, and deployment to microcontrollers. Great for TinyML prototyping.
- Web-based ML development
- Automatic optimization
- Hardware profiling
- OTA model updates
- MCU deployment
Deployment Patterns
All inference happens locally with no cloud dependency. Maximum privacy and minimum latency. Requires model to fit on device and handle all scenarios offline.
- Zero network dependency
- Maximum privacy
- Lowest latency
- Works offline
- Fixed model until update
Process on edge when possible, fall back to cloud for complex cases. Edge handles common scenarios quickly; cloud provides capability for edge cases.
- Best of both worlds
- Graceful degradation
- Cost optimization
- Confidence-based routing
- Requires connectivity strategy
Train models across decentralized edge devices without centralizing data. Devices compute local updates; server aggregates. Enables learning from sensitive data.
- Data stays on device
- Distributed training
- Privacy-preserving
- Personalized models
- Communication efficiency
Manage model lifecycle on edge devices. OTA updates, A/B testing, rollback capabilities. Critical for improving models post-deployment without device recalls.
- Over-the-air updates
- Staged rollouts
- Version management
- Rollback capability
- Delta updates for efficiency
Edge AI Use Cases
| Domain | Applications | Requirements | Common Models |
|---|---|---|---|
| Smart Cameras | Object detection, face recognition, anomaly detection | Real-time (30fps), low latency | YOLO, MobileNet-SSD, RetinaFace |
| Autonomous Vehicles | Perception, planning, sensor fusion | Ultra-low latency, high reliability | PointPillars, BEVFormer, custom CNNs |
| Mobile Apps | Photo enhancement, voice assistants, AR filters | Battery efficiency, privacy | MobileNet, EfficientNet, Whisper.cpp |
| Industrial IoT | Predictive maintenance, quality inspection | 24/7 operation, rugged conditions | Anomaly detection, time series models |
| Wearables | Health monitoring, gesture recognition | Ultra-low power, small size | TinyML models, keyword spotting |
| Robotics | Navigation, manipulation, human interaction | Real-time, sensor integration | SLAM, grasping models, RL policies |
