The Descent.
Every neural network and language model is trained by rolling downhill on a loss landscape. You'll do it by hand and waste steps — then meet the algorithm that just reads the slope, and watch a bad learning rate blow the whole thing up.
Gradient descent trains a model by minimizing a loss — a number that says how wrong it is. It treats the loss as a landscape over the model's parameters and repeatedly steps in the direction of steepest descent (the negative gradient, i.e. the downhill slope). The step size is the learning rate. Roll downhill enough times and you reach a minimum — the model's best settings.
- Losshow wrong the model is — lower is better
- Gradientthe slope; points uphill, so we go negative
- Learning ratestep size each iteration (η)
- Minimumthe bottom of a valley — the goal
① Descend by hand → ② Follow the slope → ③ Tune the learning rate till it explodes → ④ Escape a local minimum
The height of the curve is the loss; the ball's left–right position is the model's parameter. Training = rolling the ball to the lowest point by following the downhill slope.
What you just played, written down
You rolled it by hand, followed the slope, tuned the step size, and escaped a trap. Here's the same thing as the rule that trains every neural network.
How the optimizer thinks — four steps
- Measure the loss. Run the model, compare to the target, get a single number — how wrong it is.
- Take the gradient. Compute the slope of the loss with respect to each parameter — the direction that increases loss fastest.
- Step downhill. Move each parameter a little in the opposite direction: w ← w − η·∇L. η is the learning rate.
- Repeat until the gradient is ~0 (a minimum) or you run out of steps.
Each step moves by η × slope. Too small → it crawls and may never arrive. Too big → it leaps past the bottom and the loss grows instead of shrinks, spiralling to infinity. The whole art of training lives in this one number.
The update rule
for each step:
L = loss(params) # how wrong we are
grad = ∇L(params) # slope of L
params = params - lr * grad # step downhill
# with momentum (a rolling ball):
v = beta * v - lr * grad
params = params + v
Solved by hand — gradient descent on the valley above
Each step, the parameter, its loss, and the slope the optimizer read — with a sensible learning rate. Watch the slope shrink toward zero as it nears the bottom.
⚠ When it breaks
A learning rate that's too high makes the loss explode to NaN; a local minimum can trap plain descent because the slope there is zero. Momentum and good initialization help.
↔ Its cousins
SGD estimates the slope from a mini-batch (cheaper, noisier, escapes shallow traps). Momentum adds velocity; Adam adds a per-parameter adaptive learning rate.
★ Where you've used it
Every trained model: neural networks and LLMs, logistic/linear regression, recommendation systems, and fine-tuning — all minimize a loss with some flavour of gradient descent.
Did it stick?
Gradient descent, answered
What is gradient descent?
Gradient descent trains most ML models by minimizing a loss: it repeatedly steps the parameters in the direction of steepest descent (the negative gradient), rolling downhill on the loss landscape toward a minimum.
What does the learning rate do?
The learning rate (η) is the step size. Too small → training crawls; too large → steps overshoot and the loss diverges to infinity. It's the most important training hyperparameter.
What is a local minimum?
A valley that's lower than its surroundings but not the lowest overall (the global minimum). Plain descent can get stuck there because the slope is zero; momentum and noise help escape shallow ones.
Gradient descent vs stochastic gradient descent (SGD)?
Batch GD uses the whole dataset per step (accurate, slow). SGD uses one example or mini-batch per step (noisy, cheap, escapes shallow traps). Mini-batch SGD trains modern neural networks.
What is momentum?
Momentum adds a fraction of the last update to the next, accumulating velocity like a rolling ball — speeding convergence, damping oscillation, and carrying the optimizer through shallow local minima. Adam combines it with adaptive learning rates.
Why does training diverge or the loss become NaN?
Usually the learning rate is too high: each step overshoots, the loss grows, and it compounds to infinity/NaN. Lower the rate, add warmup/decay, clip gradients, and normalize inputs.