Gradient Descent, The "Hello World" of ML

I wanted to make this blog post for those who want to learn an intro to machine learning. I hope this helps someone as much as I enjoyed writing it.

A huge thanks to 3Blue1Brown for the amazing visual explanations that made me understand this and made this post possible. I highly recommend his channel.

Gradient descent applied to univariate linear regression is basically the “hello world” of machine learning. Despite its simplicity, the way we train linear regression models is pretty similar to how we train more advanced algorithms.

I’ll assume you’re familiar with basic calculus (derivatives, chain rule) and some linear algebra. I’ll go over helping you with the intuition first before diving into the math.

Historical Context

In 1809, Carl Friedrich Gauss, the “Prince of Mathematicians” himself, published Theoria Motus Corporum Coelestium in Sectionibus Conicis Solem Ambientium, a landmark paper on celestial mechanics. In it, he claimed to have invented least squares regression: a way to plot a line of best fit through scattered data points, no matter how messy the data is.

But here’s where it gets interesting. Gauss promptly received a frustrated letter from Adrien Marie Legendre, who had published the exact same formula four years earlier in 1805. Legendre’s paper Nouvelles méthodes pour la détermination des orbites des comètes contained the original notation for least squares regression.

Gauss argued he’d thought of it independently back in 1795, but didn’t publish because he found it “trivial.” This sparked one of math’s most famous priority disputes!

Today, we attribute linear regression to both mathematicians, along with Francis Galton, who later expanded the idea.

Use Case

Say we have a dataset with two columns: an independent variable and a dependent variable that show some correlation.

Example: Imagine we gathered data from 10 schools, recording the number of students and the number of ethnicities represented in each school. Plotting this data, we see a correlation — generally, the more students in a school, the more ethnicities we find.

Here’s the question: Can we create a prediction algorithm that predicts the number of ethnicities in a school if we know the number of students?

We’d do this by plotting a line of best fit on the scatter plot, giving us an equation in \(y = mx + b\) form that predicts \(y\) (ethnicities) when given \(x\) (students). All our predictions would lie on this line.

The Hypothesis

The equation of our line of best fit is given by:

\[ h(x) = \theta_0 + \theta_1 x \]

If this seems foreign to you, that’s okay, just think of it as \(y = mx + b\), but with \(\theta_0\) as the y-intercept \(b\), and \(\theta_1\) as the slope \(m\).

So in univariate (one variable) linear regression, we need to find a way to tweak these two thetas to get the best line of fit:

\(\theta_0\) is the bias — it shifts the entire line up or down (the y-intercept)
\(\theta_1\) is the weight — it determines how much the input \(x\) impacts the output \(y\). In our scenario: what’s the number that relates tree species to bird species?

The line of best fit changes as \(\theta_0\) and \(\theta_1\) change. We’re trying to find the ideal values of theta that best match the training data.

Mean Squared Error (MSE) Cost Function

Before we figure out how to tweak the theta values, we need a way to concretely evaluate how good our prediction algorithm is. This is where the mean squared error (MSE) cost function comes in:

\[ J(\theta_0, \theta_1) = \frac{1}{2m} \sum_{i=1}^{m} \left( h(x^{(i)}) - y^{(i)} \right)^2 \]

What does this actually do? For each training example, MSE:

Computes the difference between our prediction and the actual value: \(h(x^{(i)}) - y^{(i)}\)
Squares that difference to get rid of negatives: \((h(x^{(i)}) - y^{(i)})^2\)
Adds them all up
Averages the result (the \(\frac{1}{2m}\) part — the 2 is there to make derivatives cleaner later)

This final number is how we evaluate our prediction algorithm. Our goal is to minimize this cost.

To summarize:

Component	Expression
Hypothesis	\(h_\theta(x) = \theta_0 + \theta_1 x\)
Parameters	\(\theta_0, \theta_1\)
Cost Function	\(J(\theta_0, \theta_1) = \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2\)
Goal	\(\min_{\theta_0, \theta_1} J(\theta_0, \theta_1)\)

We’re trying to choose the right \(\theta_0\) and \(\theta_1\) to minimize the cost function, which means making those prediction errors as small as possible.

Visualizing the Cost Surface

Here’s another way to think about it: since \(J\) depends on two parameters, we can plot it as a 3D surface. The x and y axes are \(\theta_0\) and \(\theta_1\), and the z axis is the cost \(J(\theta_0, \theta_1)\).

For linear regression with MSE, this surface is convex which it looks like a bowl with a single lowest point. We’re trying to find that lowest point, which corresponds to the \(\theta_0\) and \(\theta_1\) that give us the minimum cost.

Gradient Descent: The Intuition

Gradient descent is an iterative algorithm that randomly starts somewhere on this surface, then adjusts \(\theta_0\) and \(\theta_1\) until it reaches a point where the cost is minimized.

Intuitively, here’s what gradient descent does at each iteration:

“Looks around” a full 360 degrees
Chooses the direction of steepest descent (the most efficient way down)
Takes a baby step in that direction
Repeats until it converges — when it reaches a point where the cost can’t go any lower

The Update Rules

Here are the formulas that describe how these steps are taken:

\[ \theta_0 := \theta_0 - \alpha \frac{\partial}{\partial \theta_0} J(\theta_0, \theta_1) \]

\[ \theta_1 := \theta_1 - \alpha \frac{\partial}{\partial \theta_1} J(\theta_0, \theta_1) \]

For each iteration of gradient descent, the values of theta are updated according to these formulas. The \(:=\) symbol means “update to equal.”

I’ll prove the derivations for those partial derivatives near the end. For now, let’s understand the logic.

Understanding with a 1D Example

Let’s simplify things by getting rid of the bias term, so our hypothesis only has one theta:

\[ h(x) = \theta_1 x \quad \text{instead of} \quad h(x) = \theta_0 + \theta_1 x \]

This means our line of best fit must pass through the origin.

Now the cost is only a function of \(\theta_1\), so we can represent it as a simple 2D graph, which is a parabola. The derivative term \(\frac{d}{d\theta_1} J(\theta_1)\) is just the slope at a single point.

For those rusty on derivatives: think of it simply as the slope between two points that are very, very close to each other.

Ignore \(\alpha\) (the “learning rate”) for now — just think of it as some positive number.

Case 1: Positive slope (right of minimum)

When the derivative/slope is positive, it means we’re to the right of the minimum. The update rule multiplies this positive derivative with the negative \(-\alpha\), which subtracts value from \(\theta_1\), bringing it closer to the minimum from the right.

Case 2: Negative slope (left of minimum)

When the derivative/slope is negative, we’re to the left of the minimum. Multiplying a negative derivative with \(-\alpha\) adds value to \(\theta_1\), bringing it closer to the minimum from the left.

Bonus: The magnitude of the derivative decreases as we approach the minimum (where the slope is 0). This means the gradient update automatically slows down as the surface flattens. The step size naturally gets smaller as we get closer!

The Learning Rate \(\alpha\)

The learning rate dictates the size of each step. It’s a hyperparameter, meaning we have to choose it manually.

Since the learning rate scales the derivative, it determines how much each parameter changes after each iteration. Finding the right learning rate depends on your specific problem.

Too large: Overshoots the minimum, might oscillate wildly or diverge to infinity.

Too small: Takes way too many iterations to reach the minimum.

Just right: Converges efficiently in a reasonable number of iterations.

Common strategies:

Start with \(\alpha = 0.01\) and adjust based on results
Use learning rate schedules that decrease \(\alpha\) over time
Use adaptive methods like Adam or RMSprop

Mathematical Explanation: Partial Derivatives

If you understand calculus, you know a derivative is the rate at which a function changes with respect to its input.

For a multivariable function \(f(x, y, z)\), we have partial derivatives:

\[ \frac{\partial f}{\partial x}, \quad \frac{\partial f}{\partial y}, \quad \frac{\partial f}{\partial z} \]

Where \(\frac{\partial f}{\partial x}\) is how \(f\) changes with respect to \(x\), and so on.

Partial derivatives are calculated by taking the derivative with respect to one variable while treating the other variables as constants. Each multivariable function has as many partial derivatives as it has input variables.

The Gradient

The gradient of a multivariable function is simply a vector of all its partial derivatives, written as \(\nabla\). For example:

\[ \nabla f(x, y, z) = \begin{bmatrix} \frac{\partial f}{\partial x} \\[6pt] \frac{\partial f}{\partial y} \\[6pt] \frac{\partial f}{\partial z} \end{bmatrix} \]

Here’s the key insight: the gradient vector always points in the direction of steepest ascent, no matter how many dimensions we’re in.

So taking the negative gradient gives us the direction of steepest descent. This is exactly what we want for minimizing our cost function!

Gradient-Based Intuition for the Update Rules

The mathematical intuition for our update rules is actually pretty simple once you understand multivariable calculus.

Looking at just the derivative terms of our two updates:

\[ \frac{\partial}{\partial \theta_0} J(\theta_0, \theta_1) \]

\[ \frac{\partial}{\partial \theta_1} J(\theta_0, \theta_1) \]

Notice that when we compute both updates consecutively first the partial derivative with respect to \(\theta_0\), then with respect to \(\theta_1\), it’s kind of like computing the gradient. We’re calculating all the partial derivatives of \(J(\theta_0, \theta_1)\) at the same time.

Although we aren’t technically dealing with the gradient as a vector, we’re informally doing the same thing:

\[ \theta_0 := \theta_0 - \alpha \frac{\partial}{\partial \theta_0} J(\theta_0, \theta_1) \]

\[ \theta_1 := \theta_1 - \alpha \frac{\partial}{\partial \theta_1} J(\theta_0, \theta_1) \]

When we zoom out, we see we’re dealing with the negative partial derivatives (the \(-\alpha\) is the same as putting a minus before the partial derivatives, because of the associative property of multiplication).

If we think of this as a vector operation, we’re adjusting a theta vector \([\theta_0, \theta_1]\) by the negative gradient scaled by some learning rate \(\alpha\):

\[ \begin{bmatrix} \theta_0 \\ \theta_1 \end{bmatrix} - \overbrace{\begin{bmatrix} \frac{\partial}{\partial \theta_0} J(\theta_0, \theta_1) \\[6pt] \frac{\partial}{\partial \theta_1} J(\theta_0, \theta_1) \end{bmatrix}}^{-\nabla J(\theta_0, \theta_1)} \alpha \]

Intuitively, you’re minimizing multiple partial derivatives at the same time, which add up to something significant: the gradient.

Derivations of the Partial Derivatives

Now let’s prove why the update rules work. Here’s the step by step derivation.

Derivative with Respect to \(\theta_0\)

Why is this true?

\[ \theta_0 := \theta_0 - \alpha \frac{\partial}{\partial \theta_0} J(\theta_0, \theta_1) \quad = \quad \theta_0 := \theta_0 - \alpha \frac{1}{m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \]

The derivation is pretty simple first-year calculus, except for the brief use of multivariable calculus in determining the second product after using the chain rule.

Step by step:

Start with the MSE cost function:

\[ J(\theta_0, \theta_1) = \frac{1}{2m} \sum_{i=1}^{m} (h(x^{(i)}) - y^{(i)})^2 \]

Take derivative w.r.t \(\theta_0\) on both sides. Use product rule to take \(\frac{1}{2m}\) outside of derivative:

\[ \frac{\partial}{\partial \theta_0} J(\theta_0, \theta_1) = \frac{1}{2m} \frac{\partial}{\partial \theta_0} \sum_{i=1}^{m} (h(x^{(i)}) - y^{(i)})^2 \]

Take derivative inside sum (sum rule: the derivative of the sum is the sum of the derivatives):

\[ \frac{\partial}{\partial \theta_0} J(\theta_0, \theta_1) = \frac{1}{2m} \sum_{i=1}^{m} \frac{\partial}{\partial \theta_0} (h(x^{(i)}) - y^{(i)})^2 \]

Use chain rule, and then power rule:

\[ \frac{\partial}{\partial \theta_0} J(\theta_0, \theta_1) = \frac{1}{2m} \sum_{i=1}^{m} 2(h(x^{(i)}) - y^{(i)}) \cdot \frac{\partial}{\partial \theta_0} (h(x^{(i)}) - y^{(i)}) \]

Replace \(h(x)\) with its actual function:

\[ \frac{\partial}{\partial \theta_0} J(\theta_0, \theta_1) = \frac{1}{2m} \sum_{i=1}^{m} 2(h(x^{(i)}) - y^{(i)}) \cdot \frac{\partial}{\partial \theta_0} (\theta_0 + \theta_1 x^{(i)} - y^{(i)}) \]

Take derivative of the expanded error function. Since we’re doing it w.r.t only \(\theta_0\), treat other variables as constants:

\[ \frac{\partial}{\partial \theta_0} J(\theta_0, \theta_1) = \frac{1}{2m} \sum_{i=1}^{m} 2(h(x^{(i)}) - y^{(i)}) \cdot \frac{\partial}{\partial \theta_0} (\theta_0) \]

Evaluates to 1:

\[ \frac{\partial}{\partial \theta_0} J(\theta_0, \theta_1) = \frac{1}{2m} \sum_{i=1}^{m} 2(h(x^{(i)}) - y^{(i)}) \cdot 1 \]

Take out the scalar value 2 out of summation (cancels with \(\frac{1}{2m}\)):

\[ \frac{\partial}{\partial \theta_0} J(\theta_0, \theta_1) = \frac{1}{m} \sum_{i=1}^{m} (h(x^{(i)}) - y^{(i)}) \]

Derivative with Respect to \(\theta_1\)

Calculating the derivative w.r.t \(\theta_1\) is nearly identical, since there’s no partial derivatives to deal with. The only small difference is in differentiating the second product of the chain rule, but it introduces no need for new laws or formulas.

Why is this true?

\[ \theta_1 := \theta_1 - \alpha \frac{\partial}{\partial \theta_1} J(\theta_0, \theta_1) \quad = \quad \theta_1 := \theta_1 - \alpha \frac{1}{m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x^{(i)} \]

Following the same steps as before, when we reach:

Replace \(h(x)\) with its actual function:

\[ \frac{\partial}{\partial \theta_1} J(\theta_0, \theta_1) = \frac{1}{2m} \sum_{i=1}^{m} 2(h(x^{(i)}) - y^{(i)}) \cdot \frac{\partial}{\partial \theta_1} (\theta_0 + \theta_1 x^{(i)} - y^{(i)}) \]

Take derivative of the expanded error function. Since we’re doing it w.r.t only \(\theta_1\), treat other variables as constants:

\[ \frac{\partial}{\partial \theta_1} J(\theta_0, \theta_1) = \frac{1}{2m} \sum_{i=1}^{m} 2(h(x^{(i)}) - y^{(i)}) \cdot \frac{\partial}{\partial \theta_1} (\theta_1 x^{(i)}) \]

Evaluates to \(x^{(i)}\):

\[ \frac{\partial}{\partial \theta_1} J(\theta_0, \theta_1) = \frac{1}{2m} \sum_{i=1}^{m} 2(h(x^{(i)}) - y^{(i)}) \cdot x^{(i)} \]

Take out the scalar value 2 out of summation:

\[ \frac{\partial}{\partial \theta_1} J(\theta_0, \theta_1) = \frac{1}{m} \sum_{i=1}^{m} (h(x^{(i)}) - y^{(i)}) \cdot x^{(i)} \]

Conclusion

The elegance of gradient descent lies in its generality. The same algorithm that fits a line through ten data points can train a language model with billions of parameters.