Edge AI

developer_board

Edge AI Fundamentals

Edge vs Cloud Inference

Edge AI runs models directly on devices (phones, cameras, vehicles) rather than cloud servers. Enables real-time responses without network latency, works offline, and keeps data private. Trade-off is limited compute and memory compared to cloud.

Key Features
  • Latency: <10ms edge vs 100ms+ cloud
  • Privacy: Data stays on device
  • Offline: Works without connectivity
  • Cost: No per-inference cloud charges
  • Compute: Limited by device hardware
Similar Technologies
Cloud inferenceHybrid edge-cloudFog computing
Model Size Constraints

Edge devices have limited memory (MB to low GB) compared to cloud GPUs. Models must be compressed to fit. Target sizes range from <1MB for microcontrollers to hundreds of MB for phones and edge servers.

Key Features
  • Microcontrollers: <1MB models
  • Mobile phones: 10-200MB models
  • Edge servers: Up to few GB
  • RAM often more limiting than storage
  • Batch size typically 1
Similar Technologies
Cloud deploymentModel streamingSplit inference
Power Efficiency

Battery-powered and thermal-constrained devices require power-efficient inference. Measured in inferences per watt. Specialized AI accelerators (NPUs, TPUs) achieve 10-100x better efficiency than general CPUs.

Key Features
  • TOPS/Watt as key metric
  • NPUs >> GPUs >> CPUs for efficiency
  • Dynamic voltage/frequency scaling
  • Model architecture impacts power
  • Thermal throttling considerations
Similar Technologies
Always-on low-power modesWake-on-detectionDuty cycling
Real-Time Requirements

Many edge applications require deterministic, low-latency inference. Video processing needs 30fps (33ms/frame). Autonomous systems need <10ms response. Hard vs soft real-time constraints affect design choices.

Key Features
  • Latency budgets: 10-100ms typical
  • Throughput: frames/second
  • Jitter: variance in latency
  • Hard real-time: safety-critical
  • Soft real-time: best-effort
Similar Technologies
Batch processingAsync inferencePredictive pre-computation
compress

Model Optimization Techniques

TechniqueDescriptionSize ReductionAccuracy ImpactUse Case
INT8 QuantizationConvert FP32 weights to 8-bit integers4x smallerMinimal (1-2%)General deployment
INT4 Quantization4-bit integer weights, aggressive compression8x smallerModerate (2-5%)Highly constrained devices
PruningRemove unimportant weights/neurons2-10x smallerVariesSparse hardware support
Knowledge DistillationTrain small model to mimic large model10-100x smallerModerateCustom edge models
Neural Architecture SearchAutomatically find efficient architecturesVariesOptimizedHardware-specific models
Low-Rank FactorizationDecompose weight matrices into smaller factors2-5x smallerMinimalDense layers
memory

Inference Runtimes

TensorFlow Lite

Google's lightweight runtime for mobile and embedded devices. Supports Android, iOS, Linux, and microcontrollers. Extensive model zoo, quantization tools, and GPU/NPU delegate support.

Key Features
  • Cross-platform (Android, iOS, Linux, MCU)
  • GPU, NNAPI, CoreML delegates
  • Built-in quantization
  • Model optimization toolkit
  • Large ecosystem
Similar Technologies
ONNX RuntimePyTorch MobileCoreML
ONNX Runtime

Microsoft's cross-platform inference engine. Runs ONNX models with hardware acceleration across CPUs, GPUs, and NPUs. Excellent for deploying models from any framework (PyTorch, TensorFlow, etc.).

Key Features
  • Framework-agnostic (ONNX format)
  • CPU, GPU, NPU, FPGA support
  • Web, mobile, and desktop
  • Quantization and optimization
  • DirectML, CUDA, TensorRT backends
Similar Technologies
TensorFlow LiteTensorRTOpenVINO
CoreML

Apple's ML framework for iOS, macOS, watchOS, and tvOS. Leverages Neural Engine, GPU, and CPU. Best performance on Apple hardware with automatic hardware selection.

Key Features
  • Apple Neural Engine support
  • Automatic hardware selection
  • On-device training
  • Privacy-preserving
  • Swift/Objective-C integration
Similar Technologies
TensorFlow LitePyTorch MobileCreate ML
TensorRT

NVIDIA's high-performance inference optimizer and runtime. Maximizes throughput on NVIDIA GPUs through layer fusion, precision calibration, and kernel auto-tuning. Essential for Jetson deployment.

Key Features
  • NVIDIA GPU optimization
  • Layer and tensor fusion
  • INT8/FP16 calibration
  • Dynamic batching
  • Jetson integration
Similar Technologies
ONNX RuntimeTriton Inference ServerTensorFlow Lite
OpenVINO

Intel's toolkit for optimizing and deploying models on Intel hardware (CPUs, integrated GPUs, VPUs, FPGAs). Model Optimizer converts from various frameworks.

Key Features
  • Intel CPU/GPU/VPU optimization
  • Model Optimizer converter
  • Post-training quantization
  • Async inference API
  • Neural Compute Stick support
Similar Technologies
ONNX RuntimeTensorFlow LiteTensorRT
llama.cpp

C/C++ implementation of LLaMA inference with minimal dependencies. Enables running LLMs on edge devices with quantization support. Foundation for many local LLM applications.

Key Features
  • CPU inference (AVX, ARM NEON)
  • 4-bit quantization (GGUF)
  • Metal, CUDA, OpenCL support
  • Minimal dependencies
  • Active community
Similar Technologies
OllamaMLC LLMExLlamaV2
memory_alt

Edge Hardware

HardwareVendorPerformancePowerBest For
NVIDIA Jetson OrinNVIDIAUp to 275 TOPS15-60WRobotics, autonomous vehicles
Google Coral TPUGoogle4 TOPS2WLow-power vision, IoT
Intel Neural Compute StickIntel1 TOPS1WUSB accelerator, prototyping
Apple Neural EngineApple15-35 TOPSIntegratediOS/macOS apps
Qualcomm Hexagon NPUQualcomm15-75 TOPSIntegratedAndroid devices
Raspberry Pi 5RPi FoundationCPU only (+ AI HAT)3-5WHobbyist, education
Hailo-8Hailo26 TOPS2.5WIndustrial, smart cameras
code

Edge ML Frameworks

MediaPipe

Google's framework for building multimodal ML pipelines. Pre-built solutions for face detection, hand tracking, pose estimation. Cross-platform with optimized mobile/web performance.

Key Features
  • Pre-built vision solutions
  • Graph-based pipeline definition
  • iOS, Android, Web, Python
  • Real-time performance
  • GPU acceleration
Similar Technologies
TensorFlow LiteOpenCVVision Framework
ML Kit

Google's mobile ML SDK for Android and iOS. High-level APIs for common tasks (text recognition, face detection, barcode scanning). On-device and cloud variants.

Key Features
  • Pre-trained APIs
  • On-device processing
  • Firebase integration
  • Custom model support
  • Easy integration
Similar Technologies
MediaPipeCoreMLTensorFlow Lite
PyTorch Mobile

PyTorch's mobile deployment path. Export models via TorchScript, optimize with Mobile Optimizer. Supports iOS, Android with CPU and GPU backends.

Key Features
  • TorchScript export
  • Mobile optimizer
  • Quantization support
  • Custom operators
  • ExecuTorch (new runtime)
Similar Technologies
TensorFlow LiteONNX RuntimeCoreML
TinyML / TensorFlow Lite Micro

ML on microcontrollers (Cortex-M, ESP32). Models in kilobytes, inference in milliseconds with microwatts of power. Enables AI on battery-powered sensors.

Key Features
  • Microcontroller support
  • No OS required
  • KB-sized models
  • Ultra-low power
  • Keyword spotting, anomaly detection
Similar Technologies
Edge ImpulsemicroTVMCMSIS-NN
NVIDIA Triton

Inference server for edge and cloud. Model repository management, dynamic batching, concurrent model execution. Supports TensorRT, ONNX, TensorFlow, PyTorch.

Key Features
  • Multi-framework support
  • Dynamic batching
  • Model ensembles
  • Metrics and monitoring
  • Kubernetes integration
Similar Technologies
TorchServeTensorFlow ServingBentoML
Edge Impulse

End-to-end platform for developing edge ML. Data collection, model training, optimization, and deployment to microcontrollers. Great for TinyML prototyping.

Key Features
  • Web-based ML development
  • Automatic optimization
  • Hardware profiling
  • OTA model updates
  • MCU deployment
Similar Technologies
TensorFlow Lite MicroSensiMLQeexo
cloud_sync

Deployment Patterns

Fully On-Device

All inference happens locally with no cloud dependency. Maximum privacy and minimum latency. Requires model to fit on device and handle all scenarios offline.

Key Features
  • Zero network dependency
  • Maximum privacy
  • Lowest latency
  • Works offline
  • Fixed model until update
Similar Technologies
HybridCloud inferenceFederated
Hybrid Edge-Cloud

Process on edge when possible, fall back to cloud for complex cases. Edge handles common scenarios quickly; cloud provides capability for edge cases.

Key Features
  • Best of both worlds
  • Graceful degradation
  • Cost optimization
  • Confidence-based routing
  • Requires connectivity strategy
Similar Technologies
Fully on-deviceCloud-firstTiered inference
Federated Learning

Train models across decentralized edge devices without centralizing data. Devices compute local updates; server aggregates. Enables learning from sensitive data.

Key Features
  • Data stays on device
  • Distributed training
  • Privacy-preserving
  • Personalized models
  • Communication efficiency
Similar Technologies
Centralized trainingTransfer learningOn-device fine-tuning
Model Updates & Versioning

Manage model lifecycle on edge devices. OTA updates, A/B testing, rollback capabilities. Critical for improving models post-deployment without device recalls.

Key Features
  • Over-the-air updates
  • Staged rollouts
  • Version management
  • Rollback capability
  • Delta updates for efficiency
Similar Technologies
Manual updatesFixed modelsCloud-only models
apps

Edge AI Use Cases

DomainApplicationsRequirementsCommon Models
Smart CamerasObject detection, face recognition, anomaly detectionReal-time (30fps), low latencyYOLO, MobileNet-SSD, RetinaFace
Autonomous VehiclesPerception, planning, sensor fusionUltra-low latency, high reliabilityPointPillars, BEVFormer, custom CNNs
Mobile AppsPhoto enhancement, voice assistants, AR filtersBattery efficiency, privacyMobileNet, EfficientNet, Whisper.cpp
Industrial IoTPredictive maintenance, quality inspection24/7 operation, rugged conditionsAnomaly detection, time series models
WearablesHealth monitoring, gesture recognitionUltra-low power, small sizeTinyML models, keyword spotting
RoboticsNavigation, manipulation, human interactionReal-time, sensor integrationSLAM, grasping models, RL policies