How models learn. Every weight update, every backprop pass, every optimizer is calculus applied to a loss function — from first principles.
Rate of change — the slope at a single point
Training a model means minimizing a loss function. To minimize it, you need to know which direction makes it decrease — and that direction is given by the derivative. Without derivatives there is no gradient descent, no backprop, no training. Every optimizer (SGD, Adam, RMSProp) is entirely built on derivatives. This is the most foundational concept in ML training.
A derivative answers one question: if I nudge the input by a tiny amount, how much does the output change? Geometrically, it's the slope of the curve at a single point — the steepness of the tangent line there.
Imagine you're standing on a hilly landscape (your loss surface). The derivative at your current position tells you: how steep is the ground right here, and in which direction does it slope? That's all you need to know to take one step downhill.
The derivative of f at point x is the limit of the slope of the secant line as the two points get infinitely close:
You don't need to compute limits in interviews — but understanding this says: derivative = instantaneous rate of change = slope of tangent at x.
If f'(x) > 0 → function is increasing at x. If f'(x) < 0 → decreasing. If f'(x) = 0 → flat — could be a minimum, maximum, or saddle point.
Activation function derivatives you'll need:
The second derivative f''(x) measures how the slope is changing — curvature of the function.
Second-order optimizers (Newton's method) use the Hessian matrix — the matrix of all second partial derivatives — to take more informed steps. Computationally too expensive for large models, but theoretically more efficient than first-order methods.
Derivatives when you have multiple inputs — like millions of weights
A neural network has millions of parameters. The loss function depends on all of them simultaneously. A regular derivative only handles one variable. Partial derivatives extend this: they let you ask "how does the loss change with respect to this one specific weight, while holding all others fixed?" This is exactly what backprop computes — a partial derivative for every single weight in the network.
Imagine a landscape where your position is described by two coordinates (x, y) and your altitude is f(x, y). A partial derivative ∂f/∂x asks: if I take one step in the x-direction only (keeping y frozen), how much does my altitude change? ∂f/∂y asks the same for y.
In ML: x and y are two weights. f is the loss. Partial derivatives tell you the slope in each weight's direction independently.
Treat all other variables as constants, then differentiate normally with respect to the target variable.
The notation ∂ (curly d) distinguishes partial from total derivatives. Read ∂f/∂wᵢ as "the partial derivative of f with respect to wᵢ."
In a simple linear model: ŷ = w·x + b, loss = MSE = (y - ŷ)²
These two partial derivatives tell you exactly how to update w and b to reduce the loss. This is the update rule for linear regression gradient descent.
A deep network has parameters w₁, w₂, ..., wₙ (millions of them). Backprop computes ∂L/∂wᵢ for every single weight. Each tells you: nudge this weight in this direction by this much to reduce the loss. The collection of all these partial derivatives is the gradient — covered next.
The mathematical engine behind backpropagation
A neural network is a composition of functions — layer 1 feeds into layer 2, feeds into layer 3, and so on. To compute how the loss at the end depends on weights at layer 1, you need the chain rule. Backpropagation is literally "apply the chain rule through a computational graph." This is the single most important rule to understand if you want to explain how neural networks learn.
The chain rule handles composed functions: if y depends on u, and u depends on x, then how does y change when x changes? Answer: multiply the rates. If u doubles when x increases by 1, and y triples when u doubles — then y increases by 6 when x increases by 1. You chain the rates together by multiplying.
In a neural network: loss depends on the output layer, which depends on the hidden layer, which depends on the weights. The chain rule lets you trace this dependency all the way back.
If y = f(u) and u = g(x), then:
The derivative of the outer function times the derivative of the inner function. For deeper compositions:
A chain of multiplications — one factor per function in the composition.
To find ∂L/∂w₁ (how loss depends on weight in layer 1):
Each factor is a local derivative — easy to compute at each node. Backprop walks this chain from right to left, accumulating products. This is the full mechanism of backpropagation.
Backprop multiplies many numbers together as it propagates backward. Two problems arise:
The vector of all partial derivatives — direction of steepest ascent
The gradient is what every optimizer actually uses. When you call loss.backward() in PyTorch, it computes the gradient of the loss with respect to every parameter. The gradient vector is the complete answer to "which direction makes the loss increase the fastest?" — and you walk in the opposite direction to minimize it.
Think of the gradient as a compass for a hilly landscape with millions of dimensions. It points uphill — in the direction of steepest increase. Flip it and you get the direction of steepest descent. Each component tells you the slope in one parameter's direction.
The gradient always points perpendicular to contour lines (lines of equal loss). To descend most efficiently, follow the negative gradient.
For a function f(w₁, w₂, ..., wₙ) with n parameters, the gradient is a vector of all partial derivatives:
The gradient has the same shape as the parameter vector — one number per parameter. For a network with 10 million weights, the gradient is a 10-million-dimensional vector.
For logistic regression with sigmoid output σ(z) and binary cross-entropy loss:
Beautifully simple: the gradient is just how wrong your prediction is. If σ(z) = 0.9 and y = 0, gradient = 0.9 — large, because you were very wrong. This is why cross-entropy + sigmoid is the standard for binary classification.
The algorithm that trains every neural network
This is the training algorithm. Everything before this was building up to it. Gradient descent is how weights get updated, how models improve over epochs, how a random initialization becomes a useful model. Understanding it — and why Adam beats vanilla SGD — is expected in every ML interview, from junior to senior.
Imagine you're blindfolded on a hilly landscape and want to reach the lowest point. You can only feel the slope under your feet. Strategy: take a small step in the downhill direction, feel the slope again, take another step. Repeat. That's gradient descent. The learning rate is your step size — too large and you overshoot valleys; too small and you take forever.
At each step, every parameter moves a small amount opposite to its gradient:
If ∂L/∂w is positive (increasing w increases loss) → subtract → decrease w → reduce loss. The sign always works out correctly.
Smaller batch = noisier gradients (can help generalization). Larger batch = more stable, better GPU utilization, may converge to sharp minima that generalize worse.
Common practice: use a learning rate scheduler — start high, decay over time (step decay, cosine annealing, warmup + decay). Warmup is critical for Transformers.
Vanilla SGD uses the same learning rate for every parameter. Adam adapts the learning rate per parameter using running estimates of the gradient's mean and variance:
The algorithm that makes gradient descent practical
Gradient descent needs the gradient of the loss with respect to every weight. A network with millions of weights can't compute each gradient independently — that would take millions of forward passes. Backpropagation computes all gradients in a single backward pass by reusing intermediate computations. This is why deep learning is computationally feasible at all.
Think of backprop as a blame assignment algorithm. The network makes a prediction, computes a loss, then works backward through each layer asking: "how much did this weight contribute to the error?" Weights that contributed more get a larger gradient — they're more responsible and need a larger correction.
The key insight: each layer only needs to know two things — the error signal coming from the layer above, and its own local derivative. It can compute its contribution without knowing anything about the rest of the network.
The forward pass caches intermediate activations because the backward pass needs them to compute local derivatives. This is why memory usage scales with depth — you need to store all activations until the backward pass is complete.
For a layer with weights W, bias b, input x, pre-activation z = Wx + b, activation a = σ(z):
⊙ is element-wise multiplication. Each layer passes δˡ backward to the layer below — this is the "error signal." Each layer also computes its own weight gradients from δˡ and its input activations.
Modern frameworks (PyTorch, JAX) implement backprop via automatic differentiation on a computational graph. Every operation in the forward pass registers a backward function. When you call loss.backward(), PyTorch traverses the graph in reverse and accumulates gradients automatically. You never write backprop by hand.
requires_grad=True: tells PyTorch to track operations on this tensorloss.backward(): triggers the backward pass, populates .grad on all tracked tensorsoptimizer.step(): applies the update rule using the accumulated gradientsoptimizer.zero_grad(): clears gradients before the next forward pass (critical — PyTorch accumulates by default)Why optimization is easy in theory and hard in practice
Convexity determines whether gradient descent is guaranteed to find the global minimum. Linear regression and logistic regression have convex loss functions — gradient descent always finds the best solution. Neural networks do not — their loss surfaces are riddled with local minima, saddle points, and flat regions. Understanding convexity explains why training a neural net is fundamentally harder than training a linear model, and why we rely on tricks like momentum, normalization, and good initialization.
A convex function is shaped like a bowl — any two points on the curve, the line segment between them lies above (or on) the curve. There's only one valley, so wherever you start gradient descent, you'll end up at the same global minimum.
A non-convex function is like a mountain range — full of valleys, peaks, and flat plateaus. Gradient descent can get trapped in a shallow local valley far from the deepest point. This is the landscape neural networks live in.
A function f is convex if for any two points x, y and any λ ∈ [0, 1]:
The function value at any interpolated point is at most the interpolated function value — the chord never dips below the curve.
Equivalently: f is convex if f''(x) ≥ 0 everywhere (non-negative second derivative — concave up). For multivariate functions: the Hessian matrix is positive semi-definite everywhere.
In high dimensions (millions of parameters), saddle points are overwhelmingly more common than local minima. The loss at a random saddle point in a deep network is typically close to the global minimum — which is why neural networks can be trained at all despite non-convexity.