Aquileo | Stochastic Gradient Descent (SGD)

Stochastic Gradient Descent is an optimization algorithm used in machine learning, especially for large datasets, that updates model parameters efficiently using small batches or single samples.

Variant of gradient descent designed for faster and scalable learning
Updates parameters using one data point or small batches at a time
Reduces computation compared to full batch gradient descent
Widely used in deep learning for efficient training

Working of Stochastic Gradient Descent

In traditional gradient descent, the gradients are computed based on the entire dataset which can be computationally expensive for large datasets.
In Stochastic Gradient Descent, the gradient is calculated for each training example (or a small subset of training examples) rather than the entire dataset.

Stochastic Gradient Descent update rule is:

\theta = \theta - \eta \nabla_\theta J(\theta; x_i, y_i)

Where:

x_i and y_i represent the features and target of the i-th training example.
The gradient \nabla_\theta J(\theta; x_i, y_i) is now calculated for a single data point or a small batch.

Implementing Stochastic Gradient Descent from Scratch

1. Generating the Data

In this step, we generate synthetic data for the linear regression problem. The data consists of feature X and the target y where the relationship is linear, i.e., y = 4 + 3 * X + noise.

X is a random array of 100 samples between 0 and 2.
y is the target, calculated using a linear equation with a little random noise to make it more realistic.

Python

import numpy as np

np.random.seed(42)
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)

For a linear regression with one feature, the model is described by the equation:

y = \theta_0 + (\theta_1 )\cdot X

Where:

\theta_0 is the intercept (the bias term),
\theta_1 is the slope or coefficient associated with the input feature X.

2. Defining the SGD Function

This step defines the Stochastic Gradient Descent function that initializes parameters, updates them iteratively, and tracks the loss during training.

Takes input data X and target y
Itheta (\theta) is the parameter vector (intercept and slope) initialized randomly.
X_bias is the augmented X with a column of ones added for the bias term (intercept).
Shuffles data in each epoch and updates parameters using single samples or mini-batches
Computes loss using Mean Squared Error (MSE)
Stores loss values to monitor convergence

Python

def sgd(X, y, learning_rate=0.1, epochs=1000, batch_size=1):
    m = len(X)
    theta = np.random.randn(2, 1)

    X_bias = np.c_[np.ones((m, 1)), X]

    cost_history = []

    for epoch in range(epochs):
        indices = np.random.permutation(m)
        X_shuffled = X_bias[indices]
        y_shuffled = y[indices]

        for i in range(0, m, batch_size):
            X_batch = X_shuffled[i:i + batch_size]
            y_batch = y_shuffled[i:i + batch_size]

            gradients = 2 / batch_size * \
                X_batch.T.dot(X_batch.dot(theta) - y_batch)
            theta -= learning_rate * gradients

        predictions = X_bias.dot(theta)
        cost = np.mean((predictions - y) ** 2)
        cost_history.append(cost)

        if epoch % 100 == 0:
            print(f"Epoch {epoch}, Cost: {cost}")

    return theta, cost_history

3: Train the Model Using SGD

In this step, we call the sgd() function to train the model. We specify the learning rate, number of epochs and batch size for SGD.

Python

theta_final, cost_history = sgd(X, y, learning_rate=0.1, epochs=1000, batch_size=1)

Output:

4. Visualizing the Cost Function

After training, we visualize how the cost function evolves over epochs. This helps us understand if the algorithm is converging properly.

Python

import matplotlib.pyplot as plt

plt.plot(cost_history)
plt.xlabel('Epochs')
plt.ylabel('Cost (MSE)')
plt.title('Cost Function during Training')
plt.show()

Output:

5. Plotting the Data and Regression Line

We will visualize the data points and the fitted regression line after training. We plot the data points as blue dots and the predicted line (from the final \theta) as a red line.

Python

plt.scatter(X, y, color='blue', label='Data points')
plt.plot(X, np.c_[np.ones((X.shape[0], 1)), X].dot(
    theta_final), color='red', label='SGD fit line')
plt.xlabel('X')
plt.ylabel('y')
plt.title('Linear Regression using Stochastic Gradient Descent')
plt.legend()
plt.show()

Output:

Linear-regression-using-SGD- — Plot the Data and Regression Line

6. Printing the Final Model Parameters

After training, we print the final parameters of the model which include the slope and intercept. These values are the result of optimizing the model using SGD.

Python

print(f"Final parameters: {theta_final}")

Output:

Final parameters: [[4.35097872] [3.45754277]]

The final parameters returned by the model are:

\theta_0 = 4.35, \quad \theta_1 = 3.45

Then the fitted linear regression model will be:

y = 4.35 + (3.45)\cdot X

This means:

When X=0, y=4.3(the intercept or bias term).
For each unit increase in X, y will increase by 3.4 units (the slope or coefficient).

Applications

Used in deep learning to train large neural networks efficiently
Applied in NLP for models like Word2Vec and transformers
Used in computer vision tasks such as image classification, object detection and segmentation
Applied in reinforcement learning for models like deep Q-networks and policy gradient methods

Advantages

Faster and more efficient since it updates parameters using one or a few data points
Requires less memory, making it suitable for large datasets
Stochastic updates help escape local minima and saddle points
Supports online learning by updating the model with incoming data

Challenges

Updates can be noisy due to using single samples, causing fluctuations in the loss instead of smooth convergence
Highly sensitive to learning rate; too high may diverge, too low slows down learning
May take longer to converge overall despite faster individual updates

Stochastic Gradient Descent (SGD)

Working of Stochastic Gradient Descent

Implementing Stochastic Gradient Descent from Scratch

1. Generating the Data

2. Defining the SGD Function

3: Train the Model Using SGD

4. Visualizing the Cost Function

5. Plotting the Data and Regression Line

6. Printing the Final Model Parameters

Applications

Advantages

Challenges

Explore