Neural Network Algorithms
Optimization Algorithms
Iterative optimization algorithm that uses gradients to find the minimum of a loss function. The learning rate is crucial for convergence, and batch processing enables efficient computation on large datasets.
- Iterative optimization approach
- Learning rate crucial for convergence
- Batch processing for efficiency
- Can get stuck in local minima
Updates model parameters using single samples rather than the entire dataset. Faster iterations and noisy updates help escape local minima, making it ideal for online learning scenarios.
- Single sample updates per iteration
- Faster iterations than batch GD
- Noisy updates help escape local minima
- Enables online learning
Combines the benefits of momentum and RMSProp with adaptive learning rates for each parameter. The most popular optimizer in deep learning, working well with default hyperparameters across diverse problems.
- Adaptive learning rates per parameter
- Combines momentum and RMSProp
- Most popular optimizer in practice
- Works well with default settings
Adaptive learning rate optimizer that divides the learning rate by a moving average of recent gradients. Particularly effective for recurrent neural networks and handles sparse gradients well.
- Adaptive learning rate adjustment
- Especially good for RNNs
- Divides by moving average of gradients
- Handles sparse gradients effectively
Backpropagation & Training
The foundation of deep learning that uses the chain rule to efficiently compute gradients through all network layers. This reverse-mode automatic differentiation enables training of deep neural networks.
- Uses chain rule for gradient computation
- Reverse-mode automatic differentiation
- Efficient gradient computation
- Foundation of deep learning
Balances the speed of SGD with the accuracy of batch gradient descent using typical batch sizes of 32-256. Optimizes GPU utilization while providing better generalization than full-batch training.
- Balances speed and accuracy
- Typical sizes between 32-256
- Optimized for GPU computation
- Better generalization than full-batch
Dynamically adjusts the learning rate during training using strategies like step, exponential, or cosine decay. Improves convergence by taking smaller steps as training progresses, avoiding overshooting the optimum.
- Decays learning rate over time
- Step, exponential, or cosine strategies
- Improves convergence quality
- Avoids overshooting minimum
Activation Functions
The most common activation function defined as max(0,x). Computationally efficient and creates sparse activations, though it can suffer from the dead neuron problem where neurons permanently output zero.
- Defined as max(0,x)
- Most common in hidden layers
- Can suffer dead neuron problem
- Computationally efficient with sparse activation
Squashes input to range (0,1), producing smooth and differentiable outputs. Commonly used in output layers for binary classification, but suffers from vanishing gradient problems in deep networks.
- Output range from 0 to 1
- Vanishing gradient problem in deep nets
- Common in output layer for binary tasks
- Smooth and differentiable
Outputs values in range (-1,1), providing zero-centered activations that generally perform better than sigmoid. Still experiences vanishing gradients but less severely than sigmoid in practice.
- Output range from -1 to 1
- Zero-centered outputs
- Better than sigmoid for hidden layers
- Still has vanishing gradient issue
Converts a vector of values into a probability distribution that sums to 1. Essential for multi-class classification tasks, normalizing outputs to represent class probabilities.
- Creates probability distribution
- Used in multi-class output layer
- Common in classification tasks
- Normalizes outputs to sum to 1
Regularization Techniques
Randomly drops neurons during training with a specified probability, preventing overfitting through an implicit ensemble effect. Only applied during training; all neurons are active during inference.
- Randomly drops neurons during training
- Prevents overfitting effectively
- Creates ensemble effect
- Only applied during training phase
Adds weight penalties to the loss function - L1 encourages sparsity by pushing weights to zero, while L2 encourages small weights. Both techniques prevent overfitting by constraining model complexity.
- Adds weight penalty to loss
- L1 for sparsity, L2 for small weights
- Prevents overfitting
- Constrains model complexity
Normalizes layer inputs across each mini-batch, reducing internal covariate shift. Enables faster training with higher learning rates while also acting as a regularizer, reducing the need for dropout.
- Normalizes inputs across mini-batch
- Enables faster training
- Reduces internal covariate shift
- Acts as implicit regularizer
Creates synthetic training data through domain-specific transformations, effectively expanding the training set. Prevents overfitting and improves generalization by exposing the model to more variations.
- Generates synthetic training data
- Prevents overfitting
- Domain-specific transformations
- Improves model generalization
Advanced Algorithms
Allows models to focus on relevant parts of the input using query-key-value operations. The foundation of transformer architectures, with self-attention enabling state-of-the-art performance in NLP and vision.
- Focuses on relevant input parts
- Foundation of transformer models
- Query-key-value operation
- Self-attention for sequence modeling
Applies learnable filters with local connectivity and parameter sharing to detect spatial patterns. Creates hierarchical feature representations, forming the foundation of convolutional neural networks.
- Local connectivity patterns
- Parameter sharing across space
- Builds spatial hierarchies
- Foundation of CNNs
Maintains hidden state memory across time steps to process sequential data. LSTM and GRU variants use gating mechanisms to capture long-term dependencies in temporal sequences.
- Hidden state maintains memory
- Processes sequential data
- LSTM/GRU use gating mechanisms
- Captures time dependencies
