Will Percey — Portfolio

Neural Network Algorithms

> > Updated Dec 2025

neurology

Optimization Algorithms

Gradient Descent

Iterative optimization algorithm that uses gradients to find the minimum of a loss function. The learning rate is crucial for convergence, and batch processing enables efficient computation on large datasets.

Key Features

Iterative optimization approach
Learning rate crucial for convergence
Batch processing for efficiency
Can get stuck in local minima

Common Examples

Linear regression optimizationNeural network weight updatesConvex optimization problemsCost function minimizationParameter tuning

Stochastic Gradient Descent

Updates model parameters using single samples rather than the entire dataset. Faster iterations and noisy updates help escape local minima, making it ideal for online learning scenarios.

Key Features

Single sample updates per iteration
Faster iterations than batch GD
Noisy updates help escape local minima
Enables online learning

Common Examples

Real-time model updatesLarge-scale deep learningStreaming data processingOnline recommendation systemsAdaptive learning systems

Adam Optimizer

Combines the benefits of momentum and RMSProp with adaptive learning rates for each parameter. The most popular optimizer in deep learning, working well with default hyperparameters across diverse problems.

Key Features

Adaptive learning rates per parameter
Combines momentum and RMSProp
Most popular optimizer in practice
Works well with default settings

Common Examples

Transformer models trainingComputer vision networksNatural language processingGenerative modelsTransfer learning fine-tuning

RMSProp

Adaptive learning rate optimizer that divides the learning rate by a moving average of recent gradients. Particularly effective for recurrent neural networks and handles sparse gradients well.

Key Features

Adaptive learning rate adjustment
Especially good for RNNs
Divides by moving average of gradients
Handles sparse gradients effectively

Common Examples

Recurrent neural networksTime series predictionSpeech recognitionNatural language modelingSequence-to-sequence models

psychology

Backpropagation & Training

Backpropagation

The foundation of deep learning that uses the chain rule to efficiently compute gradients through all network layers. This reverse-mode automatic differentiation enables training of deep neural networks.

Key Features

Uses chain rule for gradient computation
Reverse-mode automatic differentiation
Efficient gradient computation
Foundation of deep learning

Common Examples

Multi-layer perceptron trainingConvolutional network optimizationRecurrent network updatesDeep learning frameworksError propagation through layers

Mini-Batch Training

Balances the speed of SGD with the accuracy of batch gradient descent using typical batch sizes of 32-256. Optimizes GPU utilization while providing better generalization than full-batch training.

Key Features

Balances speed and accuracy
Typical sizes between 32-256
Optimized for GPU computation
Better generalization than full-batch

Common Examples

ImageNet model trainingBERT pre-trainingGPUs/TPUs batch processingDistributed trainingModern deep learning pipelines

Learning Rate Schedules

Dynamically adjusts the learning rate during training using strategies like step, exponential, or cosine decay. Improves convergence by taking smaller steps as training progresses, avoiding overshooting the optimum.

Key Features

Decays learning rate over time
Step, exponential, or cosine strategies
Improves convergence quality
Avoids overshooting minimum

Common Examples

ResNet training schedulesWarmup then decay strategiesCosine annealingCyclical learning ratesFine-tuning schedules

hub

Activation Functions

ReLU (Rectified Linear Unit)

The most common activation function defined as max(0,x). Computationally efficient and creates sparse activations, though it can suffer from the dead neuron problem where neurons permanently output zero.

Key Features

Defined as max(0,x)
Most common in hidden layers
Can suffer dead neuron problem
Computationally efficient with sparse activation

Common Examples

Convolutional neural networksDeep feedforward networksResNet and modern architecturesComputer vision modelsHidden layer activations

Sigmoid

Squashes input to range (0,1), producing smooth and differentiable outputs. Commonly used in output layers for binary classification, but suffers from vanishing gradient problems in deep networks.

Key Features

Output range from 0 to 1
Vanishing gradient problem in deep nets
Common in output layer for binary tasks
Smooth and differentiable

Common Examples

Binary classification outputProbability estimationGate mechanisms in LSTMsLogistic regressionAttention weights

Tanh (Hyperbolic Tangent)

Outputs values in range (-1,1), providing zero-centered activations that generally perform better than sigmoid. Still experiences vanishing gradients but less severely than sigmoid in practice.

Key Features

Output range from -1 to 1
Zero-centered outputs
Better than sigmoid for hidden layers
Still has vanishing gradient issue

Common Examples

RNN hidden statesLSTM cell statesClassical neural networksTime series modelsSequence processing

Softmax

Converts a vector of values into a probability distribution that sums to 1. Essential for multi-class classification tasks, normalizing outputs to represent class probabilities.

Key Features

Creates probability distribution
Used in multi-class output layer
Common in classification tasks
Normalizes outputs to sum to 1

Common Examples

Multi-class classificationImage classification outputLanguage model token predictionAttention mechanismsCategorical distributions

deployed_code

Regularization Techniques

Dropout

Randomly drops neurons during training with a specified probability, preventing overfitting through an implicit ensemble effect. Only applied during training; all neurons are active during inference.

Key Features

Randomly drops neurons during training
Prevents overfitting effectively
Creates ensemble effect
Only applied during training phase

Common Examples

Deep neural network regularizationFully connected layersPreventing co-adaptationLarge model trainingTransfer learning

L1/L2 Regularization

Adds weight penalties to the loss function - L1 encourages sparsity by pushing weights to zero, while L2 encourages small weights. Both techniques prevent overfitting by constraining model complexity.

Key Features

Adds weight penalty to loss
L1 for sparsity, L2 for small weights
Prevents overfitting
Constrains model complexity

Common Examples

Linear model regularizationRidge and Lasso regressionNeural network weight decayFeature selection (L1)Model compression

Batch Normalization

Normalizes layer inputs across each mini-batch, reducing internal covariate shift. Enables faster training with higher learning rates while also acting as a regularizer, reducing the need for dropout.

Key Features

Normalizes inputs across mini-batch
Enables faster training
Reduces internal covariate shift
Acts as implicit regularizer

Common Examples

Convolutional networksDeep residual networksGANs training stabilizationModern architecturesAccelerating convergence

Data Augmentation

Creates synthetic training data through domain-specific transformations, effectively expanding the training set. Prevents overfitting and improves generalization by exposing the model to more variations.

Key Features

Generates synthetic training data
Prevents overfitting
Domain-specific transformations
Improves model generalization

Common Examples

Image transformations (flip, rotate, crop)Text augmentation (back-translation)Audio pitch/speed changesMixup and CutMixSynthetic data generation

model_training

Advanced Algorithms

Attention Mechanism

Allows models to focus on relevant parts of the input using query-key-value operations. The foundation of transformer architectures, with self-attention enabling state-of-the-art performance in NLP and vision.

Key Features

Focuses on relevant input parts
Foundation of transformer models
Query-key-value operation
Self-attention for sequence modeling

Common Examples

Transformer models (BERT, GPT)Machine translationImage captioningVision transformersMulti-modal models

Convolution Operation

Applies learnable filters with local connectivity and parameter sharing to detect spatial patterns. Creates hierarchical feature representations, forming the foundation of convolutional neural networks.

Key Features

Local connectivity patterns
Parameter sharing across space
Builds spatial hierarchies
Foundation of CNNs

Common Examples

Image classificationObject detectionSemantic segmentationFace recognitionMedical image analysis

Recurrent Computation

Maintains hidden state memory across time steps to process sequential data. LSTM and GRU variants use gating mechanisms to capture long-term dependencies in temporal sequences.

Key Features

Hidden state maintains memory
Processes sequential data
LSTM/GRU use gating mechanisms
Captures time dependencies

Common Examples

Language modelingSpeech recognitionTime series forecastingVideo analysisMusic generation