Uncertainty & inference — how models express and reason about what they don't know.
Joint, conditional, marginal — the vocabulary of uncertainty
Probability is the language of uncertainty, and ML is fundamentally about making decisions under uncertainty. Every model output that is a "probability" — sigmoid, softmax, language model next-token scores — requires you to understand what a probability actually means and how probabilities relate to each other.
Without this foundation you cannot understand Naive Bayes, Bayesian networks, generative models, or even why cross-entropy loss is the right choice for classification. These rules are also tested directly in DS/ML interviews as short warm-up problems.
Think of probability as a budget of belief — you have 1.0 to distribute among all possible outcomes. Joint probability asks: how much of that budget goes to two things happening together? Conditional asks: if I already know B happened, how do I redistribute the remaining budget over A? Marginal asks: ignoring everything else, what's the total budget on A?
The three are connected like different views of the same underlying table of counts.
If B can take several values (B₁, B₂, ..., Bₙ) that partition the sample space:
This lets you compute a marginal by summing over all cases of a conditioning variable. Example: P(spam) = P(spam|"win")·P("win") + P(spam|no "win")·P(no "win").
Updating belief with evidence — the foundation of probabilistic ML
Bayes' theorem is arguably the most important single equation in ML. It tells you how to rationally update your beliefs when you get new evidence. Naive Bayes, Bayesian optimisation (used in AutoML and hyperparameter tuning), variational inference, and MAP estimation are all direct applications.
Interviewers ask this in two ways: directly ("state Bayes' theorem") and indirectly ("how does Naive Bayes work?", "what is a prior?", "what is MAP?"). All roads lead back here.
Before seeing any evidence, you have a prior belief — your best guess before the data arrives. Then you observe something (evidence). Bayes' theorem tells you how to update your prior into a posterior — your new, refined belief after incorporating the evidence.
The update is multiplicative: you scale your prior by how likely the evidence is under each hypothesis, then renormalise. Hypotheses that predict the evidence well get their probability boosted; those that predict it poorly get shrunk.
Naive Bayes classifies by computing the posterior of each class given all features. The "naive" assumption: all features are independent given the class — which simplifies the product enormously:
Despite the independence assumption being almost always wrong, Naive Bayes works surprisingly well for text — because even with wrong probabilities, the ranking of classes is often correct.
The shape of your data determines your model — and your loss function
Every assumption you make when building a model is implicitly a distributional assumption. When you use MSE loss, you're assuming Gaussian noise. When you use cross-entropy, you're assuming Bernoulli or Categorical outputs. When you initialise weights from a normal distribution, you're using the Gaussian.
Interviewers ask: "why cross-entropy for classification?" and "what distribution does softmax output?" Both require knowing these distributions cold. Generative models (VAEs, diffusion) are entirely defined by distributional choices.
A probability distribution is a function that assigns probabilities to every possible outcome. Different phenomena follow different shapes: coin flips are binary (Bernoulli), class labels are categorical, measurement errors cluster around zero (Gaussian), event counts in time are Poisson.
Choosing the right distribution for your model is choosing the right language to describe your data. Using the wrong one is like measuring distance in seconds — the math will work but the answers will be wrong.
The bell curve. Symmetric around the mean. Ubiquitous because of the Central Limit Theorem: the mean of many independent random variables converges to Gaussian regardless of the original distribution.
In ML: noise in linear regression is assumed Gaussian → MSE is the right loss. Weight initialisation often uses N(0, σ²). Latent variables in VAEs are Gaussian.
Single binary trial. Coin flip is the canonical example. In ML: the output of sigmoid in binary classification is p — the model's estimate of P(y=1|x). Binary cross-entropy loss is derived from the Bernoulli likelihood.
Generalises Bernoulli to K classes. In ML: softmax outputs a Categorical distribution — the K numbers sum to 1 and represent the model's probability distribution over classes. Categorical cross-entropy loss is derived from the Categorical likelihood.
Why loss functions exist — deriving them from probability
MLE and MAP are the theoretical frameworks that explain why we minimise the losses we do. Most engineers use MSE and cross-entropy every day without knowing why — but understanding that they are maximum likelihood estimators is what separates someone who reasons from first principles from someone who follows recipes.
They also explain regularisation from a Bayesian perspective — L2 regularisation is exactly MAP estimation with a Gaussian prior. This connection comes up in senior ML interviews and system design discussions.
MLE asks: "What parameters make the data I observed most probable?" You find the parameters that best explain the data, with no assumptions about what the parameters should look like beforehand.
MAP asks the same question but with a twist: "What parameters make the data most probable, while also being themselves plausible given my prior beliefs?" It's MLE plus a prior — a regularised version of MLE that prevents extreme parameter values.
Given observed data D = {x₁, x₂, ..., xₙ} and a model with parameters θ, MLE finds the θ that maximises the probability of observing D:
In practice: take the log (monotonic, doesn't change argmax):
Assume: outputs = true value + Gaussian noise: y = θᵀx + ε, where ε ~ N(0, σ²)
The Gaussian noise assumption directly gives you MSE. It's not arbitrary — it's the statistically correct loss for Gaussian-distributed outputs.
Assume: outputs follow a Bernoulli distribution with p = σ(θᵀx)
The Bernoulli assumption directly gives binary cross-entropy. Extend to K classes with Categorical → get categorical cross-entropy.
MAP adds a prior P(θ) — your belief about plausible parameter values before seeing data:
The extra term log P(θ) acts as a regulariser — it penalises parameter values that are improbable under the prior.
Regularisation is not an ad hoc trick — L2 (prefer small weights) corresponds to believing weights are normally distributed around zero before seeing any data. The strength λ is the inverse of the prior's variance — a tight prior (small σ²) gives strong regularisation.