Will Percey — Portfolio

Computer Vision

> > Updated Dec 2025

architecture

Vision Model Architectures

Convolutional Neural Networks (CNNs)

Foundation of computer vision with convolutional layers for spatial feature extraction. Classic architectures: LeNet, AlexNet, VGG (deep stacks), ResNet (skip connections), Inception (multi-scale). Inductive bias for images via local receptive fields and parameter sharing. State-of-art until Transformers. Still efficient for many tasks and edge deployment.

Similar Technologies

Vision Transformers (ViT)Hybrid CNN-TransformerMLP-MixerConvNeXtEfficientNet

Vision Transformers (ViT)

Apply Transformer architecture to image patches treated as tokens. Self-attention captures long-range dependencies. Requires large datasets or pretrained models. SOTA on ImageNet and downstream tasks. Variants: DeiT (distillation), Swin (hierarchical), BEiT (masked image modeling). Scalable to massive models like ViT-G.

Similar Technologies

CNNsSwin TransformerConvNeXtHybrid ModelsMaxViT

EfficientNet & Scaling

Compound scaling of depth, width, and resolution for optimal accuracy-efficiency tradeoff. Neural Architecture Search (NAS) for discovering efficient architectures. EfficientNet family from B0 (small) to B7 (large). EfficientNetV2 with faster training. Excellent for resource-constrained environments with high accuracy per parameter.

Similar Technologies

ResNetMobileNetRegNetNFNetConvNeXt

Object Detection Architectures

Specialized architectures for localization plus classification. Two-stage: Faster R-CNN (region proposals + classifier), Cascade R-CNN. One-stage: YOLO (real-time), RetinaNet (focal loss), EfficientDet. Transformer-based: DETR (set prediction), Deformable DETR. Trade-off between speed and accuracy for deployment scenarios.

Similar Technologies

YOLO (v5-v8)Faster R-CNNDETREfficientDetMask R-CNN

Segmentation Architectures

Pixel-level classification for semantic or instance segmentation. U-Net (encoder-decoder with skip connections) for medical imaging. DeepLab (atrous convolution, ASPP). Mask R-CNN (instance segmentation extending Faster R-CNN). Panoptic segmentation combining semantic and instance. Recent: SegFormer, Mask2Former with Transformers.

Similar Technologies

U-NetDeepLabV3+Mask R-CNNSegFormerMask2Former

Self-Supervised Vision Models

Learn representations without labels via pretext tasks. Contrastive: SimCLR, MoCo (momentum contrast), SwAV (clustering). Masked image modeling: MAE (Masked Autoencoder), BEiT, SimMIM. Pretrain on large unlabeled datasets, fine-tune on downstream tasks. Reduces annotation cost while achieving strong performance.

Similar Technologies

Supervised Pretrained (ImageNet)CLIP (vision-language)DINOMAESimCLR

task

Core Computer Vision Tasks

Task	Description	Key Models	Output Format	Applications
Image Classification	Assign single or multiple labels to entire image	ResNet, EfficientNet, ViT, ConvNeXt	Class probabilities	Content moderation, medical diagnosis, product categorization
Object Detection	Locate and classify multiple objects with bounding boxes	YOLO, Faster R-CNN, DETR, EfficientDet	Boxes + class labels + confidence	Autonomous driving, surveillance, retail analytics
Semantic Segmentation	Classify every pixel into predefined categories	DeepLabV3+, SegFormer, U-Net, PSPNet	Pixel-wise class map	Autonomous driving, satellite imagery, medical imaging
Instance Segmentation	Detect and segment individual object instances	Mask R-CNN, YOLACT, SOLOv2, Mask2Former	Pixel masks per instance	Robotics, image editing, inventory counting
Pose Estimation	Detect keypoints for human or object pose	OpenPose, HRNet, ViTPose, MediaPipe	Keypoint coordinates	AR/VR, sports analysis, fitness apps, animation
Optical Character Recognition	Extract text from images with localization and recognition	Tesseract, EasyOCR, PaddleOCR, TrOCR	Text strings + bounding boxes	Document digitization, license plate reading, accessibility
Image Generation	Generate novel images from noise, text, or other images	GANs, Diffusion Models (Stable Diffusion), DALL-E	Generated images	Art creation, data augmentation, content creation
Anomaly Detection	Identify abnormal patterns or defects in images	Autoencoders, PatchCore, SPADE, FastFlow	Anomaly score + localization	Manufacturing QA, medical screening, fraud detection

code

Computer Vision Frameworks

OpenCV

Industry-standard library for classical computer vision and image processing. 2500+ algorithms including filtering, transforms, feature detection, camera calibration. C++ core with Python, Java bindings. CPU optimized with optional GPU acceleration. DNN module for deep learning inference. Essential tool for preprocessing and traditional CV tasks.

Similar Technologies

scikit-imagePIL/PillowAlbumentationsimgaugtorchvision

MMDetection & OpenMMLab

Comprehensive toolbox for object detection and instance segmentation. 300+ models including YOLO, Faster R-CNN, DETR, Mask R-CNN. Modular design for easy experimentation. Part of OpenMMLab ecosystem (MMSegmentation, MMPose, etc.). PyTorch-based with distributed training support. Research and production ready.

Similar Technologies

Detectron2YOLOv5/v8TensorFlow Object Detection APIUltralyticsPaddleDetection

Detectron2 (Meta)

PyTorch-based detection and segmentation framework from Meta AI. Implements Mask R-CNN, RetinaNet, DensePose, Panoptic FPN. Flexible config system for model customization. Fast training with mixed precision and efficient ops. Strong baseline results and well-documented. Good choice for research prototyping.

Similar Technologies

MMDetectionTensorFlow Object Detection APIYOLOv8UltralyticsHugging Face Transformers

Hugging Face Transformers (Vision)

Unified API for vision Transformers and multimodal models. Pretrained models: ViT, DETR, SegFormer, CLIP, LayoutLM. Datasets library integration for loading vision datasets. Trainer API for simplified training loops. Easy fine-tuning and deployment. Growing ecosystem of vision models.

Similar Technologies

timm (PyTorch Image Models)TensorFlow HubMMDetectionDetectron2Torchvision

Albumentations

Fast image augmentation library optimized for performance. Rich set of transforms for classification, detection, segmentation. Pixel-level and spatial-level augmentations with consistent API. 10x faster than imgaug with identical output. Seamless integration with PyTorch, TensorFlow. Industry standard for augmentation pipelines.

Similar Technologies

imgaugtorchvision.transformsAugmentorAutoAugmentRandAugment

NVIDIA DALI

Data loading library for accelerated preprocessing on GPU. Portable pipelines for image decode, augmentation, format conversion. Reduces CPU bottleneck in data loading for GPU training. Operators for common CV tasks optimized for throughput. Integration with PyTorch, TensorFlow, MXNet. Critical for large-scale training efficiency.

Similar Technologies

Native data loaderstf.dataPyTorch DataLoaderAlbumentationsNVIDIA Triton

foundation

Vision Foundation Models

CLIP (OpenAI)

Contrastive Language-Image Pretraining aligning vision and language. Zero-shot image classification by comparing image embeddings to text descriptions. Trained on 400M image-text pairs from web. Foundation for many multimodal applications. Enables semantic image search, retrieval, and generation (DALL-E). Open-source alternatives: OpenCLIP.

Similar Technologies

ALIGN (Google)BLIPFlorenceSigLIPOpenCLIP

DINOv2 (Meta)

Self-supervised Vision Transformer trained on curated dataset. Strong performance without labels via self-distillation. Excellent features for dense prediction tasks (segmentation, depth). Models from small to giant (ViT-g/14). Works out-of-box for many tasks without fine-tuning. Open-source and widely adopted for feature extraction.

Similar Technologies

CLIPMAEDINOSwAVBEiT

Segment Anything Model (SAM)

Meta's promptable segmentation model trained on 1B+ masks. Zero-shot segmentation with point, box, or mask prompts. ViT-based architecture with mask decoder. Generalizes to unseen objects and domains. Enables interactive annotation tools and applications. Available in multiple sizes (ViT-B, L, H).

Similar Technologies

Interactive Segmentation ModelsU-NetMask R-CNNSegGPTSAM 2

Grounding DINO

Open-set object detection with language grounding. Detect arbitrary objects described in natural language. Combines DINO (self-supervised) with language understanding. Zero-shot detection without training on specific classes. Useful for flexible detection pipelines and annotation assistance. Integrates well with SAM for segmentation.

Similar Technologies

GLIPMDETROWL-ViTGDINOGroundedSAM

Depth Anything

Foundation model for monocular depth estimation. Zero-shot depth prediction for diverse scenes and domains. Trained with self-supervised and semi-supervised techniques on massive unlabeled data. Robust to different image types (indoor, outdoor, in-the-wild). Applications: 3D reconstruction, AR, robotics. Simple inference with strong generalization.

Similar Technologies

MiDaSDPTZoeDepthDepth ProMarigold

Stable Diffusion (Image Generation)

Open-source text-to-image diffusion model with latent space approach. Generate high-quality images from text prompts. ControlNet for conditional generation (pose, depth, edges). Inpainting, outpainting, image-to-image translation. Fine-tunable with LoRA, DreamBooth for custom concepts. Widely deployed in creative tools and applications.

Similar Technologies

DALL-E 3MidjourneyImagenFireflySDXL

transform

Data Augmentation Strategies

image

Basic Augmentations

Random crop, resize, flip
Color jitter (brightness, contrast, saturation)
Rotation, translation, scaling
Gaussian blur, noise injection
Normalize with ImageNet statistics

auto_awesome

Advanced Methods

Cutout, CutMix, MixUp for regularization
AutoAugment (learned augmentation policies)
RandAugment (random magnitude augmentations)
AugMax (adversarial augmentation)
Test-time augmentation (TTA) for inference

layers

Synthetic Data

GANs for generating training images
Simulation environments (CARLA, AirSim)
3D rendering with domain randomization
Copy-paste augmentation for detection
Domain adaptation from synthetic to real

palette

Domain-Specific

Medical: elastic deformations, intensity shifts
Satellite: multi-spectral band manipulation
OCR: perspective transform, distortions
Face: alignment, landmark-based warping
Self-supervised: contrastive augmentations

rocket_launch

Computer Vision Deployment

Edge & Mobile Deployment

TensorFlow Lite: Optimized for mobile (Android, iOS) and embedded devices
ONNX Runtime Mobile: Cross-platform inference engine
Core ML: Apple's framework for iOS model deployment
Optimization: Quantization (INT8), pruning, knowledge distillation
Hardware: NPU, GPU acceleration on mobile chips
Models: MobileNet, EfficientNet-Lite, SqueezeNet

Cloud Inference

Serving: TensorFlow Serving, TorchServe, Triton Inference Server
Batching: Dynamic batching for throughput optimization
Auto-scaling: Kubernetes HPA based on request load
GPU Utilization: Multi-model serving, MIG for GPU sharing
APIs: REST, gRPC endpoints with load balancing
Monitoring: Latency, throughput, GPU memory metrics

Real-Time Video Processing

Streaming: RTSP, WebRTC for video input streams
Frame Processing: Skip frames, adaptive resolution based on load
Pipeline: Decode → Inference → Post-process → Encode
Hardware: NVIDIA DeepStream, Intel OpenVINO for optimized pipelines
Latency: Sub-100ms for interactive applications
Use Cases: Surveillance, autonomous vehicles, AR/VR

checklist

Computer Vision Best Practices

Training Optimization

Transfer learning from ImageNet or domain-specific pretrained models
Progressive resizing: start with small images, increase size
Mixed precision training (FP16) for faster training
Learning rate schedules (cosine annealing, warmup)
Gradient accumulation for large batch sizes on limited GPU
Early stopping and model checkpointing based on validation metrics

Data Quality

Class balance: oversample minorities or use weighted loss
Active learning to label most informative samples
Data versioning and lineage tracking (DVC, Pachyderm)
Annotation quality checks and inter-annotator agreement
Remove duplicates and near-duplicates from training set
Stratified splits to ensure representative train/val/test sets

Production Monitoring

Input data drift detection (distribution shift from training)
Prediction confidence thresholding and uncertainty estimation
A/B testing new models against production baseline
Error analysis: confusion matrix, failure case clustering
Feedback loop: capture mispredictions for retraining
Model performance degradation alerts and rollback mechanisms