Question 1

What is gradient descent?

Accepted Answer

Gradient descent is the optimization algorithm used to train most machine-learning models, including neural networks. It minimizes a loss function by repeatedly moving the parameters a small step in the direction of steepest descent — the negative gradient (slope). Over many steps it rolls 'downhill' on the loss landscape toward a minimum, which corresponds to the model's best settings.

Question 2

What does the learning rate do?

Accepted Answer

The learning rate (often written η or alpha) is the step size of gradient descent: how far the parameters move along the slope each iteration. Too small and training crawls and may never finish; too large and the steps overshoot the minimum, oscillate, or diverge entirely — the loss blows up to infinity. Choosing or scheduling the learning rate is the single most important hyperparameter in training.

Question 3

What is a local minimum in gradient descent?

Accepted Answer

A local minimum is a valley in the loss landscape that is lower than its immediate surroundings but not the lowest point overall (the global minimum). Plain gradient descent can get stuck in a local minimum because the slope there is zero, so it stops moving. Momentum, random restarts, and stochastic noise help the optimizer roll through shallow local minima toward better solutions.

Question 4

What is the difference between gradient descent and stochastic gradient descent (SGD)?

Accepted Answer

Batch gradient descent computes the gradient over the entire dataset before each step — accurate but slow and memory-heavy. Stochastic gradient descent (SGD) estimates the gradient from one example (or a small mini-batch) per step, so it takes many more, noisier steps that are far cheaper and often escape shallow local minima. Mini-batch SGD is what actually trains modern neural networks.

Question 5

What is momentum in gradient descent?

Accepted Answer

Momentum adds a fraction of the previous update to the current one, so the optimizer accumulates velocity in consistent directions — like a ball rolling downhill. It speeds up convergence along gentle slopes, damps oscillation across steep valleys, and carries the optimizer through shallow local minima. Optimizers like Adam combine momentum with a per-parameter adaptive learning rate.

Question 6

Why does training diverge or the loss become NaN?

Accepted Answer

The most common cause is a learning rate that is too high: each step overshoots so far that the loss increases instead of decreasing, compounding until it overflows to infinity or NaN. Fixes include lowering the learning rate, using a warmup or decay schedule, clipping gradients, and normalizing inputs. Exploding gradients in deep networks have the same effect and are addressed with gradient clipping and normalization layers.

The Descent.

What you just played, written down

How the optimizer thinks — four steps

The update rule

Solved by hand — gradient descent on the valley above

⚠ When it breaks

↔ Its cousins

★ Where you've used it

Did it stick?

Gradient descent, answered

Explore more from Vibe Engines