Will Percey — Portfolio

Edge AI

> > Updated Dec 2025

developer_board

Edge AI Fundamentals

Edge vs Cloud Inference

Edge AI runs models directly on devices (phones, cameras, vehicles) rather than cloud servers. Enables real-time responses without network latency, works offline, and keeps data private. Trade-off is limited compute and memory compared to cloud.

Key Features

Latency: <10ms edge vs 100ms+ cloud
Privacy: Data stays on device
Offline: Works without connectivity
Cost: No per-inference cloud charges
Compute: Limited by device hardware

Similar Technologies

Cloud inferenceHybrid edge-cloudFog computing

Model Size Constraints

Edge devices have limited memory (MB to low GB) compared to cloud GPUs. Models must be compressed to fit. Target sizes range from <1MB for microcontrollers to hundreds of MB for phones and edge servers.

Key Features

Microcontrollers: <1MB models
Mobile phones: 10-200MB models
Edge servers: Up to few GB
RAM often more limiting than storage
Batch size typically 1

Similar Technologies

Cloud deploymentModel streamingSplit inference

Power Efficiency

Battery-powered and thermal-constrained devices require power-efficient inference. Measured in inferences per watt. Specialized AI accelerators (NPUs, TPUs) achieve 10-100x better efficiency than general CPUs.

Key Features

TOPS/Watt as key metric
NPUs >> GPUs >> CPUs for efficiency
Dynamic voltage/frequency scaling
Model architecture impacts power
Thermal throttling considerations

Similar Technologies

Always-on low-power modesWake-on-detectionDuty cycling

Real-Time Requirements

Many edge applications require deterministic, low-latency inference. Video processing needs 30fps (33ms/frame). Autonomous systems need <10ms response. Hard vs soft real-time constraints affect design choices.

Key Features

Latency budgets: 10-100ms typical
Throughput: frames/second
Jitter: variance in latency
Hard real-time: safety-critical
Soft real-time: best-effort

Similar Technologies

Batch processingAsync inferencePredictive pre-computation

compress

Model Optimization Techniques

Technique	Description	Size Reduction	Accuracy Impact	Use Case
INT8 Quantization	Convert FP32 weights to 8-bit integers	4x smaller	Minimal (1-2%)	General deployment
INT4 Quantization	4-bit integer weights, aggressive compression	8x smaller	Moderate (2-5%)	Highly constrained devices
Pruning	Remove unimportant weights/neurons	2-10x smaller	Varies	Sparse hardware support
Knowledge Distillation	Train small model to mimic large model	10-100x smaller	Moderate	Custom edge models
Neural Architecture Search	Automatically find efficient architectures	Varies	Optimized	Hardware-specific models
Low-Rank Factorization	Decompose weight matrices into smaller factors	2-5x smaller	Minimal	Dense layers

memory

Inference Runtimes

TensorFlow Lite

Google's lightweight runtime for mobile and embedded devices. Supports Android, iOS, Linux, and microcontrollers. Extensive model zoo, quantization tools, and GPU/NPU delegate support.

Key Features

Cross-platform (Android, iOS, Linux, MCU)
GPU, NNAPI, CoreML delegates
Built-in quantization
Model optimization toolkit
Large ecosystem

Similar Technologies

ONNX RuntimePyTorch MobileCoreML

ONNX Runtime

Microsoft's cross-platform inference engine. Runs ONNX models with hardware acceleration across CPUs, GPUs, and NPUs. Excellent for deploying models from any framework (PyTorch, TensorFlow, etc.).

Key Features

Framework-agnostic (ONNX format)
CPU, GPU, NPU, FPGA support
Web, mobile, and desktop
Quantization and optimization
DirectML, CUDA, TensorRT backends

Similar Technologies

TensorFlow LiteTensorRTOpenVINO

CoreML

Apple's ML framework for iOS, macOS, watchOS, and tvOS. Leverages Neural Engine, GPU, and CPU. Best performance on Apple hardware with automatic hardware selection.

Key Features

Apple Neural Engine support
Automatic hardware selection
On-device training
Privacy-preserving
Swift/Objective-C integration

Similar Technologies

TensorFlow LitePyTorch MobileCreate ML

TensorRT

NVIDIA's high-performance inference optimizer and runtime. Maximizes throughput on NVIDIA GPUs through layer fusion, precision calibration, and kernel auto-tuning. Essential for Jetson deployment.

Key Features

NVIDIA GPU optimization
Layer and tensor fusion
INT8/FP16 calibration
Dynamic batching
Jetson integration

Similar Technologies

ONNX RuntimeTriton Inference ServerTensorFlow Lite

OpenVINO

Intel's toolkit for optimizing and deploying models on Intel hardware (CPUs, integrated GPUs, VPUs, FPGAs). Model Optimizer converts from various frameworks.

Key Features

Intel CPU/GPU/VPU optimization
Model Optimizer converter
Post-training quantization
Async inference API
Neural Compute Stick support

Similar Technologies

ONNX RuntimeTensorFlow LiteTensorRT

llama.cpp

C/C++ implementation of LLaMA inference with minimal dependencies. Enables running LLMs on edge devices with quantization support. Foundation for many local LLM applications.

Key Features

CPU inference (AVX, ARM NEON)
4-bit quantization (GGUF)
Metal, CUDA, OpenCL support
Minimal dependencies
Active community

Similar Technologies

OllamaMLC LLMExLlamaV2

memory_alt

Edge Hardware

Hardware	Vendor	Performance	Power	Best For
NVIDIA Jetson Orin	NVIDIA	Up to 275 TOPS	15-60W	Robotics, autonomous vehicles
Google Coral TPU	Google	4 TOPS	2W	Low-power vision, IoT
Intel Neural Compute Stick	Intel	1 TOPS	1W	USB accelerator, prototyping
Apple Neural Engine	Apple	15-35 TOPS	Integrated	iOS/macOS apps
Qualcomm Hexagon NPU	Qualcomm	15-75 TOPS	Integrated	Android devices
Raspberry Pi 5	RPi Foundation	CPU only (+ AI HAT)	3-5W	Hobbyist, education
Hailo-8	Hailo	26 TOPS	2.5W	Industrial, smart cameras

code

Edge ML Frameworks

MediaPipe

Google's framework for building multimodal ML pipelines. Pre-built solutions for face detection, hand tracking, pose estimation. Cross-platform with optimized mobile/web performance.

Key Features

Pre-built vision solutions
Graph-based pipeline definition
iOS, Android, Web, Python
Real-time performance
GPU acceleration

Similar Technologies

TensorFlow LiteOpenCVVision Framework

ML Kit

Google's mobile ML SDK for Android and iOS. High-level APIs for common tasks (text recognition, face detection, barcode scanning). On-device and cloud variants.

Key Features

Pre-trained APIs
On-device processing
Firebase integration
Custom model support
Easy integration

Similar Technologies

MediaPipeCoreMLTensorFlow Lite

PyTorch Mobile

PyTorch's mobile deployment path. Export models via TorchScript, optimize with Mobile Optimizer. Supports iOS, Android with CPU and GPU backends.

Key Features

TorchScript export
Mobile optimizer
Quantization support
Custom operators
ExecuTorch (new runtime)

Similar Technologies

TensorFlow LiteONNX RuntimeCoreML

TinyML / TensorFlow Lite Micro

ML on microcontrollers (Cortex-M, ESP32). Models in kilobytes, inference in milliseconds with microwatts of power. Enables AI on battery-powered sensors.

Key Features

Microcontroller support
No OS required
KB-sized models
Ultra-low power
Keyword spotting, anomaly detection

Similar Technologies

Edge ImpulsemicroTVMCMSIS-NN

NVIDIA Triton

Inference server for edge and cloud. Model repository management, dynamic batching, concurrent model execution. Supports TensorRT, ONNX, TensorFlow, PyTorch.

Key Features

Multi-framework support
Dynamic batching
Model ensembles
Metrics and monitoring
Kubernetes integration

Similar Technologies

TorchServeTensorFlow ServingBentoML

Edge Impulse

End-to-end platform for developing edge ML. Data collection, model training, optimization, and deployment to microcontrollers. Great for TinyML prototyping.

Key Features

Web-based ML development
Automatic optimization
Hardware profiling
OTA model updates
MCU deployment

Similar Technologies

TensorFlow Lite MicroSensiMLQeexo

cloud_sync

Deployment Patterns

Fully On-Device

All inference happens locally with no cloud dependency. Maximum privacy and minimum latency. Requires model to fit on device and handle all scenarios offline.

Key Features

Zero network dependency
Maximum privacy
Lowest latency
Works offline
Fixed model until update

Similar Technologies

HybridCloud inferenceFederated

Hybrid Edge-Cloud

Process on edge when possible, fall back to cloud for complex cases. Edge handles common scenarios quickly; cloud provides capability for edge cases.

Key Features

Best of both worlds
Graceful degradation
Cost optimization
Confidence-based routing
Requires connectivity strategy

Similar Technologies

Fully on-deviceCloud-firstTiered inference

Federated Learning

Train models across decentralized edge devices without centralizing data. Devices compute local updates; server aggregates. Enables learning from sensitive data.

Key Features

Data stays on device
Distributed training
Privacy-preserving
Personalized models
Communication efficiency

Similar Technologies

Centralized trainingTransfer learningOn-device fine-tuning

Model Updates & Versioning

Manage model lifecycle on edge devices. OTA updates, A/B testing, rollback capabilities. Critical for improving models post-deployment without device recalls.

Key Features

Over-the-air updates
Staged rollouts
Version management
Rollback capability
Delta updates for efficiency

Similar Technologies

Manual updatesFixed modelsCloud-only models

apps

Edge AI Use Cases

Domain	Applications	Requirements	Common Models
Smart Cameras	Object detection, face recognition, anomaly detection	Real-time (30fps), low latency	YOLO, MobileNet-SSD, RetinaFace
Autonomous Vehicles	Perception, planning, sensor fusion	Ultra-low latency, high reliability	PointPillars, BEVFormer, custom CNNs
Mobile Apps	Photo enhancement, voice assistants, AR filters	Battery efficiency, privacy	MobileNet, EfficientNet, Whisper.cpp
Industrial IoT	Predictive maintenance, quality inspection	24/7 operation, rugged conditions	Anomaly detection, time series models
Wearables	Health monitoring, gesture recognition	Ultra-low power, small size	TinyML models, keyword spotting
Robotics	Navigation, manipulation, human interaction	Real-time, sensor integration	SLAM, grasping models, RL policies