Computer Vision
Vision Model Architectures
Foundation of computer vision with convolutional layers for spatial feature extraction. Classic architectures: LeNet, AlexNet, VGG (deep stacks), ResNet (skip connections), Inception (multi-scale). Inductive bias for images via local receptive fields and parameter sharing. State-of-art until Transformers. Still efficient for many tasks and edge deployment.
Apply Transformer architecture to image patches treated as tokens. Self-attention captures long-range dependencies. Requires large datasets or pretrained models. SOTA on ImageNet and downstream tasks. Variants: DeiT (distillation), Swin (hierarchical), BEiT (masked image modeling). Scalable to massive models like ViT-G.
Compound scaling of depth, width, and resolution for optimal accuracy-efficiency tradeoff. Neural Architecture Search (NAS) for discovering efficient architectures. EfficientNet family from B0 (small) to B7 (large). EfficientNetV2 with faster training. Excellent for resource-constrained environments with high accuracy per parameter.
Specialized architectures for localization plus classification. Two-stage: Faster R-CNN (region proposals + classifier), Cascade R-CNN. One-stage: YOLO (real-time), RetinaNet (focal loss), EfficientDet. Transformer-based: DETR (set prediction), Deformable DETR. Trade-off between speed and accuracy for deployment scenarios.
Pixel-level classification for semantic or instance segmentation. U-Net (encoder-decoder with skip connections) for medical imaging. DeepLab (atrous convolution, ASPP). Mask R-CNN (instance segmentation extending Faster R-CNN). Panoptic segmentation combining semantic and instance. Recent: SegFormer, Mask2Former with Transformers.
Learn representations without labels via pretext tasks. Contrastive: SimCLR, MoCo (momentum contrast), SwAV (clustering). Masked image modeling: MAE (Masked Autoencoder), BEiT, SimMIM. Pretrain on large unlabeled datasets, fine-tune on downstream tasks. Reduces annotation cost while achieving strong performance.
Core Computer Vision Tasks
| Task | Description | Key Models | Output Format | Applications |
|---|---|---|---|---|
| Image Classification | Assign single or multiple labels to entire image | ResNet, EfficientNet, ViT, ConvNeXt | Class probabilities | Content moderation, medical diagnosis, product categorization |
| Object Detection | Locate and classify multiple objects with bounding boxes | YOLO, Faster R-CNN, DETR, EfficientDet | Boxes + class labels + confidence | Autonomous driving, surveillance, retail analytics |
| Semantic Segmentation | Classify every pixel into predefined categories | DeepLabV3+, SegFormer, U-Net, PSPNet | Pixel-wise class map | Autonomous driving, satellite imagery, medical imaging |
| Instance Segmentation | Detect and segment individual object instances | Mask R-CNN, YOLACT, SOLOv2, Mask2Former | Pixel masks per instance | Robotics, image editing, inventory counting |
| Pose Estimation | Detect keypoints for human or object pose | OpenPose, HRNet, ViTPose, MediaPipe | Keypoint coordinates | AR/VR, sports analysis, fitness apps, animation |
| Optical Character Recognition | Extract text from images with localization and recognition | Tesseract, EasyOCR, PaddleOCR, TrOCR | Text strings + bounding boxes | Document digitization, license plate reading, accessibility |
| Image Generation | Generate novel images from noise, text, or other images | GANs, Diffusion Models (Stable Diffusion), DALL-E | Generated images | Art creation, data augmentation, content creation |
| Anomaly Detection | Identify abnormal patterns or defects in images | Autoencoders, PatchCore, SPADE, FastFlow | Anomaly score + localization | Manufacturing QA, medical screening, fraud detection |
Computer Vision Frameworks
Industry-standard library for classical computer vision and image processing. 2500+ algorithms including filtering, transforms, feature detection, camera calibration. C++ core with Python, Java bindings. CPU optimized with optional GPU acceleration. DNN module for deep learning inference. Essential tool for preprocessing and traditional CV tasks.
Comprehensive toolbox for object detection and instance segmentation. 300+ models including YOLO, Faster R-CNN, DETR, Mask R-CNN. Modular design for easy experimentation. Part of OpenMMLab ecosystem (MMSegmentation, MMPose, etc.). PyTorch-based with distributed training support. Research and production ready.
PyTorch-based detection and segmentation framework from Meta AI. Implements Mask R-CNN, RetinaNet, DensePose, Panoptic FPN. Flexible config system for model customization. Fast training with mixed precision and efficient ops. Strong baseline results and well-documented. Good choice for research prototyping.
Unified API for vision Transformers and multimodal models. Pretrained models: ViT, DETR, SegFormer, CLIP, LayoutLM. Datasets library integration for loading vision datasets. Trainer API for simplified training loops. Easy fine-tuning and deployment. Growing ecosystem of vision models.
Fast image augmentation library optimized for performance. Rich set of transforms for classification, detection, segmentation. Pixel-level and spatial-level augmentations with consistent API. 10x faster than imgaug with identical output. Seamless integration with PyTorch, TensorFlow. Industry standard for augmentation pipelines.
Data loading library for accelerated preprocessing on GPU. Portable pipelines for image decode, augmentation, format conversion. Reduces CPU bottleneck in data loading for GPU training. Operators for common CV tasks optimized for throughput. Integration with PyTorch, TensorFlow, MXNet. Critical for large-scale training efficiency.
Vision Foundation Models
Contrastive Language-Image Pretraining aligning vision and language. Zero-shot image classification by comparing image embeddings to text descriptions. Trained on 400M image-text pairs from web. Foundation for many multimodal applications. Enables semantic image search, retrieval, and generation (DALL-E). Open-source alternatives: OpenCLIP.
Self-supervised Vision Transformer trained on curated dataset. Strong performance without labels via self-distillation. Excellent features for dense prediction tasks (segmentation, depth). Models from small to giant (ViT-g/14). Works out-of-box for many tasks without fine-tuning. Open-source and widely adopted for feature extraction.
Meta's promptable segmentation model trained on 1B+ masks. Zero-shot segmentation with point, box, or mask prompts. ViT-based architecture with mask decoder. Generalizes to unseen objects and domains. Enables interactive annotation tools and applications. Available in multiple sizes (ViT-B, L, H).
Open-set object detection with language grounding. Detect arbitrary objects described in natural language. Combines DINO (self-supervised) with language understanding. Zero-shot detection without training on specific classes. Useful for flexible detection pipelines and annotation assistance. Integrates well with SAM for segmentation.
Foundation model for monocular depth estimation. Zero-shot depth prediction for diverse scenes and domains. Trained with self-supervised and semi-supervised techniques on massive unlabeled data. Robust to different image types (indoor, outdoor, in-the-wild). Applications: 3D reconstruction, AR, robotics. Simple inference with strong generalization.
Open-source text-to-image diffusion model with latent space approach. Generate high-quality images from text prompts. ControlNet for conditional generation (pose, depth, edges). Inpainting, outpainting, image-to-image translation. Fine-tunable with LoRA, DreamBooth for custom concepts. Widely deployed in creative tools and applications.
Data Augmentation Strategies
Basic Augmentations
- Random crop, resize, flip
- Color jitter (brightness, contrast, saturation)
- Rotation, translation, scaling
- Gaussian blur, noise injection
- Normalize with ImageNet statistics
Advanced Methods
- Cutout, CutMix, MixUp for regularization
- AutoAugment (learned augmentation policies)
- RandAugment (random magnitude augmentations)
- AugMax (adversarial augmentation)
- Test-time augmentation (TTA) for inference
Synthetic Data
- GANs for generating training images
- Simulation environments (CARLA, AirSim)
- 3D rendering with domain randomization
- Copy-paste augmentation for detection
- Domain adaptation from synthetic to real
Domain-Specific
- Medical: elastic deformations, intensity shifts
- Satellite: multi-spectral band manipulation
- OCR: perspective transform, distortions
- Face: alignment, landmark-based warping
- Self-supervised: contrastive augmentations
Computer Vision Deployment
Edge & Mobile Deployment
- TensorFlow Lite: Optimized for mobile (Android, iOS) and embedded devices
- ONNX Runtime Mobile: Cross-platform inference engine
- Core ML: Apple's framework for iOS model deployment
- Optimization: Quantization (INT8), pruning, knowledge distillation
- Hardware: NPU, GPU acceleration on mobile chips
- Models: MobileNet, EfficientNet-Lite, SqueezeNet
Cloud Inference
- Serving: TensorFlow Serving, TorchServe, Triton Inference Server
- Batching: Dynamic batching for throughput optimization
- Auto-scaling: Kubernetes HPA based on request load
- GPU Utilization: Multi-model serving, MIG for GPU sharing
- APIs: REST, gRPC endpoints with load balancing
- Monitoring: Latency, throughput, GPU memory metrics
Real-Time Video Processing
- Streaming: RTSP, WebRTC for video input streams
- Frame Processing: Skip frames, adaptive resolution based on load
- Pipeline: Decode → Inference → Post-process → Encode
- Hardware: NVIDIA DeepStream, Intel OpenVINO for optimized pipelines
- Latency: Sub-100ms for interactive applications
- Use Cases: Surveillance, autonomous vehicles, AR/VR
Computer Vision Best Practices
Training Optimization
- Transfer learning from ImageNet or domain-specific pretrained models
- Progressive resizing: start with small images, increase size
- Mixed precision training (FP16) for faster training
- Learning rate schedules (cosine annealing, warmup)
- Gradient accumulation for large batch sizes on limited GPU
- Early stopping and model checkpointing based on validation metrics
Data Quality
- Class balance: oversample minorities or use weighted loss
- Active learning to label most informative samples
- Data versioning and lineage tracking (DVC, Pachyderm)
- Annotation quality checks and inter-annotator agreement
- Remove duplicates and near-duplicates from training set
- Stratified splits to ensure representative train/val/test sets
Production Monitoring
- Input data drift detection (distribution shift from training)
- Prediction confidence thresholding and uncertainty estimation
- A/B testing new models against production baseline
- Error analysis: confusion matrix, failure case clustering
- Feedback loop: capture mispredictions for retraining
- Model performance degradation alerts and rollback mechanisms
