Optimization functions

THink as these we want to reduce the number range so if we divide all number by same number there nothing difference

let say we have huge number 10k, 30k ,, etc if we divide all by 10k it will be

1 ,3 etc eventhoug we divided there is no relationship break between them that is actication funciton

Softmax

Softmax is a mathematical function that converts a vector of numbers into a probability distribution. It’s often used in machine learning, particularly in classification problems, where you want to predict a probability for each class.

Given a set of values (like scores or logits), softmax squashes these values to be between 0 and 1, and the sum of all values will be 1 (like probabilities). It highlights the largest value and diminishes the smaller ones.

Softmax (z_{i}) = \frac{e ^{z_{i}}}{\sum _{j = 1}^{n} e ^{z_{j}}}

Where:

$z_{i}$ is the input value for the class $i$ ,
$e$ is Euler’s number (the base of the natural logarithm),
The denominator is the sum of exponentials of all input values.

Let’s say we have the following scores from a model for 3 classes:

Class A: 2.0
Class B: 1.0
Class C: 0.1

Step 1 $e^{2.0} = 7.389, e^{1.0} = 2.718, e^{0.1} = 1.105$

Step 2 $7.389 + 2.718 + 1.105 = 11.212$

Step 3

Softmax(Class A) = $\frac{7.389}{11.212} \approx 0.659$
Softmax(Class B) = $\frac{2.718}{11.212} \approx 0.243$
Softmax(Class C) = $\frac{1.105}{11.212} \approx 0.098$
Class A: 65.9%
Class B: 24.3%
Class C: 9.8%

**Vanishing / Exploding Gradients

The vanishing gradient problem occurs when gradients become very small in deep networks, making it hard for the model to learn. This is often caused by activation functions like sigmoid or tanh, which squash the input into a small range, leading to tiny gradients.

The exploding gradient problem happens when gradients become too large, making the model unstable and causing large updates to the weights, which can lead to poor convergence.

when we multiply small number we get even small number

0.25 × 0.5 = 0.125

A number less than 1 represents a reduction.
Multiplying reductions repeatedly applies that reduction over and over, pushing the result closer to zero.

This is why multiplying many small numbers creates a very small result.

Techniques to avoid

Technique	Mechanistic effect on Jacobians and products
Weight initialization	Keeps variance of gradients ≈ constant → products don’t shrink/grow too much
ReLU / non-saturating activations	Avoid derivatives < 1 everywhere → reduces vanishing
Normalization layers	Stabilize activation distribution → derivatives stay in usable range
Residual connections	Add identity term → gradients flow directly without multiplying hundreds of times
Gradient clipping	Rescales large gradients to avoid exploding
LSTM/GRU	Gates keep multipliers near 1 through time

Weight Initialization for Deep Networks

Proper weight initialization is crucial to avoid vanishing/exploding gradients:

Xavier/Glorot Initialization: Initializes weights with a variance based on the number of input and output neurons, which helps prevent exploding or vanishing gradients.
He Initialization: Similar to Xavier, but with a higher variance, better suited for ReLU activation functions.

Gradient Checking

Gradient checking is a method to verify that the gradients computed by your backpropagation algorithm are correct. It compares the gradients from backpropagation with the numerically approximated gradients.

Mini-batch Gradient Descent

Mini-batch gradient descent is a compromise between stochastic gradient descent (SGD) and batch gradient descent. It computes the gradient using a small batch of data instead of the entire dataset or just one sample. This approach offers:

Faster convergence than batch gradient descent.
More stable updates than SGD.

Exponentially Weighted Averages

An exponentially weighted average is a type of moving average that assigns more weight to recent values. It is used in optimization algorithms like Adam to smooth the gradients over time and avoid oscillations.

In optimization, exponentially weighted averages are used for:

Momentum: Helps the gradient descent converge faster by considering past gradients.
Adam: Uses exponentially weighted averages of both gradients and squared gradients to improve convergence

Bias Correction in Exponentially Weighted Averages When using exponentially weighted averages, there’s often a bias towards zero at the beginning of training. Bias correction adjusts the values to account for this, improving the estimates during early training.