Calculus | NexForay

Derivatives

Rate of change — the slope at a single point

Why — motivation

Training a model means minimizing a loss function. To minimize it, you need to know which direction makes it decrease — and that direction is given by the derivative. Without derivatives there is no gradient descent, no backprop, no training. Every optimizer (SGD, Adam, RMSProp) is entirely built on derivatives. This is the most foundational concept in ML training.

Intuition — the mental model

A derivative answers one question: if I nudge the input by a tiny amount, how much does the output change? Geometrically, it's the slope of the curve at a single point — the steepness of the tangent line there.

Imagine you're standing on a hilly landscape (your loss surface). The derivative at your current position tells you: how steep is the ground right here, and in which direction does it slope? That's all you need to know to take one step downhill.

Explanation

Formal definition

The derivative of f at point x is the limit of the slope of the secant line as the two points get infinitely close:

f'(x) = df/dx = lim[h→0] (f(x+h) - f(x)) / h

You don't need to compute limits in interviews — but understanding this says: derivative = instantaneous rate of change = slope of tangent at x.

If f'(x) > 0 → function is increasing at x. If f'(x) < 0 → decreasing. If f'(x) = 0 → flat — could be a minimum, maximum, or saddle point.

Derivatives you must know cold

d/dx [xⁿ] = n·xⁿ⁻¹ (power rule) d/dx [eˣ] = eˣ (exponential — derivative of itself) d/dx [ln x] = 1/x (log) d/dx [sin x] = cos x d/dx [c] = 0 (constant — zero slope) d/dx [c·f(x)] = c·f'(x) (constant factor pulls out)

Activation function derivatives you'll need:

sigmoid: σ(x) = 1/(1+e⁻ˣ) d/dx [σ(x)] = σ(x)·(1 - σ(x)) ← elegant, self-referential ReLU: f(x) = max(0, x) d/dx [ReLU] = 1 if x > 0, else 0 ← subgradient at x=0 is 0

Second derivative — curvature

The second derivative f''(x) measures how the slope is changing — curvature of the function.

f''(x) > 0 at a critical point → local minimum (concave up, like a bowl)
f''(x) < 0 at a critical point → local maximum (concave down, like a hill)
f''(x) = 0 → inflection point

Second-order optimizers (Newton's method) use the Hessian matrix — the matrix of all second partial derivatives — to take more informed steps. Computationally too expensive for large models, but theoretically more efficient than first-order methods.

Interview Q & A

Q: What is a derivative and why is it important in ML?

A: A derivative f'(x) measures how much f changes when x changes by an infinitesimal amount — it's the instantaneous slope at point x. In ML, derivatives are essential because training is an optimization problem: we minimize a loss function by repeatedly moving in the direction of steepest descent. The derivative tells us the slope at our current parameter values, so we know which way to nudge each weight. If f'(x) is positive, the function increases as x increases, so we decrease x to go downhill.

Partial derivatives

Derivatives when you have multiple inputs — like millions of weights

Why — motivation

A neural network has millions of parameters. The loss function depends on all of them simultaneously. A regular derivative only handles one variable. Partial derivatives extend this: they let you ask "how does the loss change with respect to this one specific weight, while holding all others fixed?" This is exactly what backprop computes — a partial derivative for every single weight in the network.

Intuition — the mental model

Imagine a landscape where your position is described by two coordinates (x, y) and your altitude is f(x, y). A partial derivative ∂f/∂x asks: if I take one step in the x-direction only (keeping y frozen), how much does my altitude change? ∂f/∂y asks the same for y.

In ML: x and y are two weights. f is the loss. Partial derivatives tell you the slope in each weight's direction independently.

Explanation

How to compute a partial derivative

Treat all other variables as constants, then differentiate normally with respect to the target variable.

f(x, y) = 3x² + 2xy + y³ ∂f/∂x: treat y as constant → 6x + 2y ∂f/∂y: treat x as constant → 2x + 3y²

The notation ∂ (curly d) distinguishes partial from total derivatives. Read ∂f/∂wᵢ as "the partial derivative of f with respect to wᵢ."

Partial derivative of a loss function

In a simple linear model: ŷ = w·x + b, loss = MSE = (y - ŷ)²

L = (y - (wx + b))² ∂L/∂w = 2·(y - ŷ)·(-x) = -2x·(y - ŷ) ∂L/∂b = 2·(y - ŷ)·(-1) = -2·(y - ŷ)

These two partial derivatives tell you exactly how to update w and b to reduce the loss. This is the update rule for linear regression gradient descent.

In deep networks: ∂L/∂wᵢⱼ

A deep network has parameters w₁, w₂, ..., wₙ (millions of them). Backprop computes ∂L/∂wᵢ for every single weight. Each tells you: nudge this weight in this direction by this much to reduce the loss. The collection of all these partial derivatives is the gradient — covered next.

Interview Q & A

Q: What is a partial derivative and how is it used in neural network training?

A: A partial derivative ∂f/∂xᵢ measures how f changes when xᵢ changes, while all other variables are held constant. In neural network training, the loss depends on millions of weights simultaneously. We compute ∂L/∂wᵢ for every weight wᵢ. Backpropagation efficiently computes all these partial derivatives in one backward pass using the chain rule. Each ∂L/∂wᵢ then tells the optimizer which direction to nudge that weight.

Chain rule

The mathematical engine behind backpropagation

Why — motivation

A neural network is a composition of functions — layer 1 feeds into layer 2, feeds into layer 3, and so on. To compute how the loss at the end depends on weights at layer 1, you need the chain rule. Backpropagation is literally "apply the chain rule through a computational graph." This is the single most important rule to understand if you want to explain how neural networks learn.

Intuition — the mental model

The chain rule handles composed functions: if y depends on u, and u depends on x, then how does y change when x changes? Answer: multiply the rates. If u doubles when x increases by 1, and y triples when u doubles — then y increases by 6 when x increases by 1. You chain the rates together by multiplying.

In a neural network: loss depends on the output layer, which depends on the hidden layer, which depends on the weights. The chain rule lets you trace this dependency all the way back.

Explanation

The rule

If y = f(u) and u = g(x), then:

dy/dx = (dy/du) · (du/dx)

The derivative of the outer function times the derivative of the inner function. For deeper compositions:

y = f(g(h(x))) dy/dx = f'(g(h(x))) · g'(h(x)) · h'(x)

A chain of multiplications — one factor per function in the composition.

Concrete example

f(x) = (3x + 1)⁴ Let u = 3x + 1, so f = u⁴ df/dx = (df/du) · (du/dx) = 4u³ · 3 = 12(3x+1)³

Chain rule through a two-layer network

z₁ = w₁·x (linear layer 1) a₁ = σ(z₁) (activation — sigmoid) z₂ = w₂·a₁ (linear layer 2) L = (y - z₂)² (MSE loss)

To find ∂L/∂w₁ (how loss depends on weight in layer 1):

∂L/∂w₁ = (∂L/∂z₂) · (∂z₂/∂a₁) · (∂a₁/∂z₁) · (∂z₁/∂w₁) = -2(y-z₂) · w₂ · σ(z₁)(1-σ(z₁)) · x

Each factor is a local derivative — easy to compute at each node. Backprop walks this chain from right to left, accumulating products. This is the full mechanism of backpropagation.

Why multiplying many numbers is dangerous

Backprop multiplies many numbers together as it propagates backward. Two problems arise:

Vanishing gradient

If each factor is < 1 (e.g. sigmoid derivative ≤ 0.25), multiplying many → exponentially small gradient → early layers learn nothing. Solved by: ReLU activations, batch norm, residual connections.

Exploding gradient

If each factor is > 1, multiplying many → exponentially large gradient → unstable training, NaN loss. Solved by: gradient clipping (cap the gradient norm to a max value).

Interview Q & A

Q: What is the chain rule and how does it enable backpropagation?

A: The chain rule states that for composed functions, the derivative is the product of derivatives at each step: dy/dx = (dy/du)·(du/dx). A neural network is a composition of functions — each layer transforms the previous layer's output. To compute how the loss depends on weights deep in the network, we apply the chain rule repeatedly backward through the layers, multiplying local derivatives at each step. This is backpropagation. The risk is vanishing gradients: if local derivatives are small (like sigmoid's max of 0.25), the product across many layers shrinks toward zero, making early layers impossible to train. ReLU and residual connections are the standard fixes.

Gradient

The vector of all partial derivatives — direction of steepest ascent

Why — motivation

The gradient is what every optimizer actually uses. When you call loss.backward() in PyTorch, it computes the gradient of the loss with respect to every parameter. The gradient vector is the complete answer to "which direction makes the loss increase the fastest?" — and you walk in the opposite direction to minimize it.

Intuition — the mental model

Think of the gradient as a compass for a hilly landscape with millions of dimensions. It points uphill — in the direction of steepest increase. Flip it and you get the direction of steepest descent. Each component tells you the slope in one parameter's direction.

The gradient always points perpendicular to contour lines (lines of equal loss). To descend most efficiently, follow the negative gradient.

Explanation

Definition

For a function f(w₁, w₂, ..., wₙ) with n parameters, the gradient is a vector of all partial derivatives:

∇f = [∂f/∂w₁, ∂f/∂w₂, ..., ∂f/∂wₙ]

The gradient has the same shape as the parameter vector — one number per parameter. For a network with 10 million weights, the gradient is a 10-million-dimensional vector.

Key properties

∇f at a point points in the direction of steepest increase of f
−∇f points in the direction of steepest decrease — this is the descent direction
The magnitude ||∇f|| tells you how steep the slope is at that point
At a minimum (or maximum or saddle): ∇f = 0 — all partial derivatives are zero
The gradient is perpendicular to the level curves (contour lines) of f

Gradient of cross-entropy loss

For logistic regression with sigmoid output σ(z) and binary cross-entropy loss:

L = -[y·log(σ(z)) + (1-y)·log(1-σ(z))] ∂L/∂z = σ(z) - y ← prediction minus true label

Beautifully simple: the gradient is just how wrong your prediction is. If σ(z) = 0.9 and y = 0, gradient = 0.9 — large, because you were very wrong. This is why cross-entropy + sigmoid is the standard for binary classification.

Jacobian & Hessian

Jacobian matrix

Generalization of the gradient when the output is also a vector. Matrix of all partial derivatives of all outputs w.r.t. all inputs. Shape: (m×n) for m outputs, n inputs. Used in autograd frameworks.

Hessian matrix

Matrix of all second partial derivatives. Captures curvature. Used by second-order optimizers. Too expensive for large neural nets — n² entries for n parameters.

Interview Q & A

Q: What is the gradient and what does it represent geometrically?

A: The gradient ∇L is a vector containing the partial derivative of the loss with respect to every parameter — ∂L/∂w₁, ∂L/∂w₂, etc. Geometrically it points in the direction of steepest increase of the loss in parameter space. Its magnitude tells you how steep the slope is. For optimization, we move in the negative gradient direction: each parameter update is w ← w − η·∂L/∂w, where η is the learning rate. When the gradient is zero, we're at a critical point — a minimum, maximum, or saddle point.

Gradient descent & optimizers

The algorithm that trains every neural network

Why — motivation

This is the training algorithm. Everything before this was building up to it. Gradient descent is how weights get updated, how models improve over epochs, how a random initialization becomes a useful model. Understanding it — and why Adam beats vanilla SGD — is expected in every ML interview, from junior to senior.

Intuition — the mental model

Imagine you're blindfolded on a hilly landscape and want to reach the lowest point. You can only feel the slope under your feet. Strategy: take a small step in the downhill direction, feel the slope again, take another step. Repeat. That's gradient descent. The learning rate is your step size — too large and you overshoot valleys; too small and you take forever.

Explanation

The update rule

At each step, every parameter moves a small amount opposite to its gradient:

w ← w − η · ∂L/∂w where: w = current parameter value η = learning rate (step size, e.g. 0.001) ∂L/∂w = gradient of loss w.r.t. this parameter

If ∂L/∂w is positive (increasing w increases loss) → subtract → decrease w → reduce loss. The sign always works out correctly.

Three variants — batch, mini-batch, stochastic

Batch GD: compute gradient on ALL training data → one update accurate but slow, can't fit large datasets in memory Stochastic GD: compute gradient on ONE random sample → update (SGD) fast, noisy, can escape local minima, poor GPU util Mini-batch GD: compute gradient on a BATCH (e.g. 32 or 256 samples) balance of accuracy and speed — standard in practice "SGD" in frameworks usually means mini-batch SGD

Smaller batch = noisier gradients (can help generalization). Larger batch = more stable, better GPU utilization, may converge to sharp minima that generalize worse.

Learning rate — the most important hyperparameter

Too high (η too large)

Steps overshoot the minimum. Loss oscillates or diverges. NaN loss is often caused by exploding gradients amplified by a high learning rate.

Too low (η too small)

Training is very slow. Risk of getting stuck in a poor local minimum or plateau. Takes too many epochs to converge.

Common practice: use a learning rate scheduler — start high, decay over time (step decay, cosine annealing, warmup + decay). Warmup is critical for Transformers.

Adam optimizer — why it's the default

Vanilla SGD uses the same learning rate for every parameter. Adam adapts the learning rate per parameter using running estimates of the gradient's mean and variance:

m_t = β₁·m_{t-1} + (1-β₁)·g_t ← 1st moment: running mean v_t = β₂·v_{t-1} + (1-β₂)·g_t² ← 2nd moment: running mean of grad² w ← w − η · m̂_t / (√v̂_t + ε) m̂_t, v̂_t are bias-corrected estimates β₁ = 0.9, β₂ = 0.999, ε = 1e-8 (typical defaults)

Parameters with consistently large gradients get a smaller effective LR (denominator grows)
Parameters with small or noisy gradients get a relatively larger effective LR
This adaptive behavior makes Adam robust to different parameter scales and sparse gradients
AdamW adds weight decay directly to weights (not via gradient) — fixes L2 regularization behavior and is preferred for Transformers

Interview Q & A

Q: What is the difference between SGD and Adam, and when would you use each?

A: SGD applies the same learning rate to every parameter update. Adam maintains per-parameter running averages of the gradient and its square, adapting the effective learning rate for each weight — parameters with large consistent gradients get smaller updates, and vice versa. Adam converges faster and is more robust to learning rate choice, making it the default for most deep learning tasks. SGD with momentum can sometimes find flatter, better-generalizing minima and is preferred in vision models like ResNets. For Transformers, AdamW (Adam + decoupled weight decay) is standard.

Backpropagation

The algorithm that makes gradient descent practical

Why — motivation

Gradient descent needs the gradient of the loss with respect to every weight. A network with millions of weights can't compute each gradient independently — that would take millions of forward passes. Backpropagation computes all gradients in a single backward pass by reusing intermediate computations. This is why deep learning is computationally feasible at all.

Intuition — the mental model

Think of backprop as a blame assignment algorithm. The network makes a prediction, computes a loss, then works backward through each layer asking: "how much did this weight contribute to the error?" Weights that contributed more get a larger gradient — they're more responsible and need a larger correction.

The key insight: each layer only needs to know two things — the error signal coming from the layer above, and its own local derivative. It can compute its contribution without knowing anything about the rest of the network.

Explanation

Forward pass vs backward pass

Forward pass: input → layer 1 → layer 2 → ... → output → loss (compute activations and cache them) Backward pass: loss → layer N → layer N-1 → ... → layer 1 (compute gradients using chain rule + cached activations)

The forward pass caches intermediate activations because the backward pass needs them to compute local derivatives. This is why memory usage scales with depth — you need to store all activations until the backward pass is complete.

The four equations of backprop

For a layer with weights W, bias b, input x, pre-activation z = Wx + b, activation a = σ(z):

δᴸ = ∇ₐL ⊙ σ'(zᴸ) ← error at output layer δˡ = (Wˡ⁺¹)ᵀ · δˡ⁺¹ ⊙ σ'(zˡ) ← error propagated backward ∂L/∂Wˡ = δˡ · (aˡ⁻¹)ᵀ ← gradient for weights ∂L/∂bˡ = δˡ ← gradient for biases

⊙ is element-wise multiplication. Each layer passes δˡ backward to the layer below — this is the "error signal." Each layer also computes its own weight gradients from δˡ and its input activations.

Computational graph & automatic differentiation

Modern frameworks (PyTorch, JAX) implement backprop via automatic differentiation on a computational graph. Every operation in the forward pass registers a backward function. When you call loss.backward(), PyTorch traverses the graph in reverse and accumulates gradients automatically. You never write backprop by hand.

requires_grad=True: tells PyTorch to track operations on this tensor
loss.backward(): triggers the backward pass, populates .grad on all tracked tensors
optimizer.step(): applies the update rule using the accumulated gradients
optimizer.zero_grad(): clears gradients before the next forward pass (critical — PyTorch accumulates by default)

Interview Q & A

Q: Explain backpropagation. Why is it efficient?

A: Backpropagation is the algorithm for computing the gradient of the loss with respect to every weight in the network. It works by applying the chain rule backward through the computational graph. The key efficiency: it computes all N gradients in a single backward pass — cost proportional to one forward pass — rather than N separate forward passes (one per weight). It reuses intermediate computations by passing an error signal δˡ backward layer by layer, so each layer only needs its local derivative and the incoming error from above. This makes training networks with millions of parameters computationally feasible.

Convexity

Why optimization is easy in theory and hard in practice

Why — motivation

Convexity determines whether gradient descent is guaranteed to find the global minimum. Linear regression and logistic regression have convex loss functions — gradient descent always finds the best solution. Neural networks do not — their loss surfaces are riddled with local minima, saddle points, and flat regions. Understanding convexity explains why training a neural net is fundamentally harder than training a linear model, and why we rely on tricks like momentum, normalization, and good initialization.

Intuition — the mental model

A convex function is shaped like a bowl — any two points on the curve, the line segment between them lies above (or on) the curve. There's only one valley, so wherever you start gradient descent, you'll end up at the same global minimum.

A non-convex function is like a mountain range — full of valleys, peaks, and flat plateaus. Gradient descent can get trapped in a shallow local valley far from the deepest point. This is the landscape neural networks live in.

Explanation

Formal definition

A function f is convex if for any two points x, y and any λ ∈ [0, 1]:

f(λx + (1-λ)y) ≤ λf(x) + (1-λ)f(y)

The function value at any interpolated point is at most the interpolated function value — the chord never dips below the curve.

Equivalently: f is convex if f''(x) ≥ 0 everywhere (non-negative second derivative — concave up). For multivariate functions: the Hessian matrix is positive semi-definite everywhere.

Convex vs non-convex in ML

Convex: linear regression (MSE loss), logistic regression (cross-entropy), SVMs, Lasso/Ridge — gradient descent always finds the global optimum
Non-convex: neural networks (any depth ≥ 2 with nonlinear activations) — loss surface has exponentially many critical points
Why non-convex networks still train well: most local minima in high-dimensional spaces are actually close in loss value to the global minimum. Saddle points (not local minima) are the real challenge — gradient descent can stall near them.

Critical points on the loss surface

Local minimum: ∇f = 0, Hessian is positive definite (all eigenvalues > 0) Local maximum: ∇f = 0, Hessian is negative definite (all eigenvalues < 0) Saddle point: ∇f = 0, Hessian has mixed eigenvalues (some + and some -) Plateau: ∇f ≈ 0 over a wide region — gradient descent stalls

In high dimensions (millions of parameters), saddle points are overwhelmingly more common than local minima. The loss at a random saddle point in a deep network is typically close to the global minimum — which is why neural networks can be trained at all despite non-convexity.

Practical implications

Momentum (in SGD with momentum, Adam): accumulates velocity across steps — helps roll through saddle points and flat regions instead of stalling
Batch noise: the randomness of mini-batch gradients helps escape sharp local minima and saddle points
Learning rate warmup: starts with small steps to avoid overshooting early in training when gradients are large
Good initialization: Xavier/He init sets weights so gradients are well-scaled at the start — bad initialization can land near flat or explosive regions of the loss surface

Interview Q & A

Q: Is the neural network loss function convex? Why does it matter?

A: No — neural network loss functions are non-convex for any network with depth ≥ 2 and nonlinear activations. This matters because gradient descent is only guaranteed to find a global minimum on convex functions. On non-convex surfaces, it can get trapped in local minima or stall at saddle points. In practice, this is less catastrophic than it sounds: in high-dimensional parameter spaces, most critical points are saddle points rather than local minima, and local minima tend to have similar loss values to the global minimum. Techniques like momentum, batch noise, and good initialization help navigate the non-convex landscape effectively.