Probability

Uncertainty & inference — how models express and reason about what they don't know.

basic rules Bayes' theorem distributions MLE & MAP

Topic overview

Use probability to reason about uncertainty: events, conditional probability, Bayes' theorem, distributions, likelihood, and posterior beliefs.

Core concepts

Understand joint, marginal, and conditional probability; independence; Bayes' theorem; common distributions; expectation; variance; MLE; and MAP.

Why it matters

Probability is the foundation for classification confidence, uncertainty estimates, generative models, Bayesian reasoning, and evaluating risk.

Interview relevance

Probability questions reveal whether you can reason from first principles under uncertainty instead of only applying formulas.

Basic probability rules

Joint, conditional, marginal — the vocabulary of uncertainty

Why — motivation

Probability is the language of uncertainty, and ML is fundamentally about making decisions under uncertainty. Every model output that is a "probability" — sigmoid, softmax, language model next-token scores — requires you to understand what a probability actually means and how probabilities relate to each other.

Without this foundation you cannot understand Naive Bayes, Bayesian networks, generative models, or even why cross-entropy loss is the right choice for classification. These rules are also tested directly in DS/ML interviews as short warm-up problems.

Intuition — the mental model

Think of probability as a budget of belief — you have 1.0 to distribute among all possible outcomes. Joint probability asks: how much of that budget goes to two things happening together? Conditional asks: if I already know B happened, how do I redistribute the remaining budget over A? Marginal asks: ignoring everything else, what's the total budget on A?

The three are connected like different views of the same underlying table of counts.

Explanation

The three core probabilities

P(A) marginal: probability that A happens, ignoring everything else P(A ∩ B) joint: probability that A AND B both happen P(A|B) conditional: probability of A, GIVEN that B already happened Key relation: P(A|B) = P(A ∩ B) / P(B) Intuition: of all the times B happened, how often did A also happen?

Law of total probability

If B can take several values (B₁, B₂, ..., Bₙ) that partition the sample space:

P(A) = P(A|B₁)·P(B₁) + P(A|B₂)·P(B₂) + ... + P(A|Bₙ)·P(Bₙ) = Σᴵ P(A|Bᴵ)·P(Bᴵ)

This lets you compute a marginal by summing over all cases of a conditioning variable. Example: P(spam) = P(spam|"win")·P("win") + P(spam|no "win")·P(no "win").

Independence vs mutual exclusivity — the classic confusion

Independent: P(A ∩ B) = P(A) · P(B) Knowing B tells you NOTHING about A. Example: two separate coin flips. Mutually exclusive: P(A ∩ B) = 0 A and B CANNOT both happen. Example: heads and tails on a single flip. CRITICAL: mutually exclusive events (with P(A), P(B) > 0) are NOT independent. If P(A ∩ B) = 0 then P(A|B) = 0 ≠ P(A) → knowing B happened tells you A definitely did NOT happen.

Interview Q & A

Q: What is the difference between independent events and mutually exclusive events?

A: Independent events do not influence each other — P(A|B) = P(A), equivalently P(A∩B) = P(A)·P(B). Mutually exclusive events cannot both occur — P(A∩B) = 0. These are very different. If two events each have positive probability and are mutually exclusive, they cannot be independent: knowing one happened tells you the other definitely did not — so they are dependent.

Bayes' theorem

Updating belief with evidence — the foundation of probabilistic ML

Why — motivation

Bayes' theorem is arguably the most important single equation in ML. It tells you how to rationally update your beliefs when you get new evidence. Naive Bayes, Bayesian optimisation (used in AutoML and hyperparameter tuning), variational inference, and MAP estimation are all direct applications.

Interviewers ask this in two ways: directly ("state Bayes' theorem") and indirectly ("how does Naive Bayes work?", "what is a prior?", "what is MAP?"). All roads lead back here.

Intuition — the mental model

Before seeing any evidence, you have a prior belief — your best guess before the data arrives. Then you observe something (evidence). Bayes' theorem tells you how to update your prior into a posterior — your new, refined belief after incorporating the evidence.

The update is multiplicative: you scale your prior by how likely the evidence is under each hypothesis, then renormalise. Hypotheses that predict the evidence well get their probability boosted; those that predict it poorly get shrunk.

Explanation

The formula

P(E|H) · P(H) P(H|E) = —————————— P(E) where: H = hypothesis (what we want to know) E = evidence (what we observed) P(H) = prior: belief in H BEFORE seeing E P(E|H) = likelihood: how probable is E if H were true P(E) = evidence: total probability of E (normalisation constant) P(H|E) = posterior: updated belief in H AFTER seeing E

Concrete example — spam detection

H = "email is spam" E = "email contains the word 'win'" P(spam) = 0.01 (1% of all emails are spam — prior) P("win"|spam) = 0.80 (80% of spam emails contain "win" — likelihood) P("win") = 0.05 (5% of all emails contain "win" — evidence) P(spam|"win") = (0.80 × 0.01) / 0.05 = 0.008 / 0.05 = 0.16 Seeing "win" updates spam probability from 1% → 16%. The word is suspicious but not conclusive — more features needed.

Naive Bayes classifier — Bayes in action

Naive Bayes classifies by computing the posterior of each class given all features. The "naive" assumption: all features are independent given the class — which simplifies the product enormously:

Despite the independence assumption being almost always wrong, Naive Bayes works surprisingly well for text — because even with wrong probabilities, the ranking of classes is often correct.

Interview Q & A

Q: Explain Bayes' theorem and how it connects to the Naive Bayes classifier.

A: Bayes' theorem states P(H|E) = P(E|H)·P(H)/P(E) — posterior is proportional to likelihood × prior. Naive Bayes applies this to classification: it computes the posterior probability of each class given all features, using the assumption that features are conditionally independent given the class. This lets us factor P(features|class) into a product of individual P(featureᴵ|class) terms, easy to estimate from data. We pick the class with the highest posterior. The independence assumption is rarely true, but the classifier still performs well because we only need to rank classes correctly, not estimate exact probabilities.

Key probability distributions

The shape of your data determines your model — and your loss function

Why — motivation

Every assumption you make when building a model is implicitly a distributional assumption. When you use MSE loss, you're assuming Gaussian noise. When you use cross-entropy, you're assuming Bernoulli or Categorical outputs. When you initialise weights from a normal distribution, you're using the Gaussian.

Interviewers ask: "why cross-entropy for classification?" and "what distribution does softmax output?" Both require knowing these distributions cold. Generative models (VAEs, diffusion) are entirely defined by distributional choices.

Intuition — the mental model

A probability distribution is a function that assigns probabilities to every possible outcome. Different phenomena follow different shapes: coin flips are binary (Bernoulli), class labels are categorical, measurement errors cluster around zero (Gaussian), event counts in time are Poisson.

Choosing the right distribution for your model is choosing the right language to describe your data. Using the wrong one is like measuring distance in seconds — the math will work but the answers will be wrong.

Explanation

Gaussian (Normal) distribution

f(x) = (1 / √(2πσ²)) · exp(-(x-μ)² / (2σ²)) Parameters: μ = mean (centre), σ² = variance (spread) Support: all real numbers (-∞ to +∞) Key property: 68% within 1σ, 95% within 2σ, 99.7% within 3σ

The bell curve. Symmetric around the mean. Ubiquitous because of the Central Limit Theorem: the mean of many independent random variables converges to Gaussian regardless of the original distribution.

In ML: noise in linear regression is assumed Gaussian → MSE is the right loss. Weight initialisation often uses N(0, σ²). Latent variables in VAEs are Gaussian.

Bernoulli distribution

P(X=1) = p, P(X=0) = 1-p Parameter: p ∈ [0,1] = probability of "success" Support: {0, 1} Mean: p, Variance: p(1-p)

Single binary trial. Coin flip is the canonical example. In ML: the output of sigmoid in binary classification is p — the model's estimate of P(y=1|x). Binary cross-entropy loss is derived from the Bernoulli likelihood.

Categorical distribution

P(X=k) = pₖ for k ∈ {1, 2, ..., K} Parameters: p₁, p₂, ..., pₖ with Σpₖ = 1 Support: {1, 2, ..., K} — K mutually exclusive outcomes

Generalises Bernoulli to K classes. In ML: softmax outputs a Categorical distribution — the K numbers sum to 1 and represent the model's probability distribution over classes. Categorical cross-entropy loss is derived from the Categorical likelihood.

Uniform & Poisson distributions

Uniform

f(x) = 1/(b-a) for x ∈ [a, b]. Maximum entropy — encodes "I know nothing." Used in weight initialisation (Xavier uses Uniform(-a, a)) and as a flat prior in Bayesian methods.

Poisson

P(X=k) = (λᵏ·e⁻λ)/k!. Models event counts in a fixed interval. λ = mean = variance. Used for API request rates, word counts in NLP, recommendation systems.

Distribution → loss function cheat sheet

Output distribution Correct loss function ——————————————————————————————— Gaussian noise MSE (mean squared error) Bernoulli (binary classes) Binary cross-entropy Categorical (K classes) Categorical cross-entropy Poisson (count data) Poisson deviance / NLL

Interview Q & A

Q: Why do we use cross-entropy loss for classification instead of MSE?

A: The choice of loss function corresponds to a distributional assumption about the output. MSE is the negative log-likelihood under a Gaussian noise assumption — appropriate for continuous real-valued outputs. Classification outputs are probabilities over K discrete classes, which follow a Bernoulli (K=2) or Categorical (K>2) distribution. The negative log-likelihood of the Categorical distribution gives exactly categorical cross-entropy: −Σ yₖ·log(pₖ). Using MSE for classification doesn't account for the probabilistic nature of the output, penalises confident correct predictions poorly, and is harder to optimise for bounded outputs.

MLE & MAP estimation

Why loss functions exist — deriving them from probability

Why — motivation

MLE and MAP are the theoretical frameworks that explain why we minimise the losses we do. Most engineers use MSE and cross-entropy every day without knowing why — but understanding that they are maximum likelihood estimators is what separates someone who reasons from first principles from someone who follows recipes.

They also explain regularisation from a Bayesian perspective — L2 regularisation is exactly MAP estimation with a Gaussian prior. This connection comes up in senior ML interviews and system design discussions.

Intuition — the mental model

MLE asks: "What parameters make the data I observed most probable?" You find the parameters that best explain the data, with no assumptions about what the parameters should look like beforehand.

MAP asks the same question but with a twist: "What parameters make the data most probable, while also being themselves plausible given my prior beliefs?" It's MLE plus a prior — a regularised version of MLE that prevents extreme parameter values.

Explanation

Maximum Likelihood Estimation (MLE)

Given observed data D = {x₁, x₂, ..., xₙ} and a model with parameters θ, MLE finds the θ that maximises the probability of observing D:

θ_MLE = argmax_θ P(D | θ) = argmax_θ Π P(xᴵ | θ) (assuming i.i.d. samples)

In practice: take the log (monotonic, doesn't change argmax):

θ_MLE = argmax_θ log P(D | θ) = argmax_θ Σ log P(xᴵ | θ) Why log? Products of many small probabilities underflow to zero numerically. Log converts the product to a sum — stable and easier to differentiate.

MLE → MSE loss (linear regression derivation)

Assume: outputs = true value + Gaussian noise: y = θᵀx + ε, where ε ~ N(0, σ²)

P(yᴵ | xᴵ, θ) = N(θᵀxᴵ, σ²) log P(D|θ) = Σ [ -½log(2πσ²) - (yᴵ - θᵀxᴵ)² / (2σ²) ] Maximising log P(D|θ) w.r.t. θ ≡ minimising Σ(yᴵ - θᵀxᴵ)² ≡ minimising MSE loss ✓

The Gaussian noise assumption directly gives you MSE. It's not arbitrary — it's the statistically correct loss for Gaussian-distributed outputs.

MLE → Cross-entropy loss (logistic regression derivation)

Assume: outputs follow a Bernoulli distribution with p = σ(θᵀx)

P(yᴵ | xᴵ, θ) = σ(θᵀxᴵ)^yᴵ · (1 - σ(θᵀxᴵ))^(1-yᴵ) log P(D|θ) = Σ [ yᴵ·log(σ(θᵀxᴵ)) + (1-yᴵ)·log(1-σ(θᵀxᴵ)) ] Maximising log P(D|θ) ≡ minimising Binary Cross-Entropy: BCE = -Σ [ yᴵ·log(ŷᴵ) + (1-yᴵ)·log(1-ŷᴵ) ] ✓

The Bernoulli assumption directly gives binary cross-entropy. Extend to K classes with Categorical → get categorical cross-entropy.

Maximum A Posteriori (MAP) — MLE + prior

MAP adds a prior P(θ) — your belief about plausible parameter values before seeing data:

θ_MAP = argmax_θ P(θ | D) = argmax_θ P(D | θ) · P(θ) [by Bayes, ignoring P(D) constant] In log form: θ_MAP = argmax_θ [ log P(D|θ) + log P(θ) ] ↑ ↑ likelihood regulariser

The extra term log P(θ) acts as a regulariser — it penalises parameter values that are improbable under the prior.

L2 regularisation = MAP with Gaussian prior

If prior is Gaussian: P(θ) = N(0, σ²) log P(θ) = -θ² / (2σ²) + constant MAP objective: argmax_θ [ log P(D|θ) - θ²/(2σ²) ] ≡ argmin_θ [ NLL(D, θ) + λ·||θ||² ] where λ = 1/(2σ²) This is exactly L2 regularisation (Ridge / weight decay). ✓ Analogously: L1 regularisation = MAP with Laplace prior.

Regularisation is not an ad hoc trick — L2 (prefer small weights) corresponds to believing weights are normally distributed around zero before seeing any data. The strength λ is the inverse of the prior's variance — a tight prior (small σ²) gives strong regularisation.

MLE

No prior. Pure data — finds parameters that make the observed data most probable. Can overfit with little data. MSE and cross-entropy are both MLE objectives.

MAP

Prior + data. Regularised MLE. Gaussian prior → L2 regularisation. Laplace prior → L1. Always more conservative than MLE when data is scarce.

Interview Q & A

Q: What is MLE, what is MAP, and how does MAP relate to L2 regularisation?

A: MLE finds parameters that maximise the likelihood of observed data: argmax_θ P(D|θ). In practice we maximise log-likelihood — for Gaussian noise this gives MSE, for Bernoulli outputs this gives cross-entropy. MAP extends MLE by adding a prior: argmax_θ [P(D|θ)·P(θ)], equivalent to maximising log-likelihood plus log-prior. The log-prior acts as a regularisation term. If the prior is Gaussian N(0, σ²), the log-prior is −θ²/(2σ²), and the MAP objective becomes: minimise NLL + λ||θ||² — exactly L2 regularisation. So Ridge regression is linear regression with a Gaussian prior on weights, and L1 (Lasso) corresponds to a Laplace prior.