Statistics

How to describe data, assess model quality, and make decisions from evidence.

descriptive stats bias-variance hypothesis testing p-values correlation vs causation

Topic overview

Use statistics to summarize data, compare models, reason about variation, and decide whether evidence is strong enough to trust.

Core concepts

Focus on mean vs median, variance, standard deviation, covariance, correlation, bias-variance, confidence intervals, hypothesis tests, p-values, and causality limits.

Why it matters

Statistics keeps analysis honest: it helps you spot noisy results, misleading averages, spurious correlations, overfitting, and weak experiment conclusions.

Interview relevance

Data and ML interviews often test whether you can explain results, choose metrics, interpret experiments, and avoid false confidence.

Descriptive statistics

The first thing you reach for when you see a new dataset

Why — motivation

Before building any model, you need to understand your data. Interviewers frequently open ML system design rounds with "you have this dataset — how would you approach it?" The expected first answer is always exploratory data analysis, and descriptive statistics are the tools of EDA. Get these wrong and the interviewer loses confidence before the real questions even start.

Beyond EDA, these concepts appear in technical depth too: PCA requires the covariance matrix, normalisation decisions require understanding mean and variance, and model evaluation often involves comparing distributions of predictions vs actuals.

Intuition — the mental model

Descriptive statistics are summaries. Instead of looking at 1 million data points, you reduce them to a handful of numbers that capture the most important properties: where is the data centred (mean/median), how spread out is it (variance/std), how is it shaped (skewness/kurtosis), and how do features relate to each other (covariance)?

Think of them as a health check for your data — you run these before any modelling to catch problems: skewed distributions, outliers, correlated features, missing patterns.

Explanation

Mean vs median

Mean: μ = (1/n) · Σ xᵢ — sum divided by count Median: middle value when sorted — 50th percentile

The mean minimises the sum of squared deviations. The median minimises the sum of absolute deviations. Key difference: outlier sensitivity.

Mean — use when

Data is roughly symmetric, no extreme outliers. One billionaire raises the "average" income of a room dramatically — mean becomes misleading.

Median — use when

Data is skewed or has outliers. House prices, income distributions. Far more representative than mean when there's extreme inequality.

Variance & standard deviation

Variance: σ² = (1/n) · Σ (xᵢ - μ)² = E[(X - μ)²] = E[X²] - (E[X])² — useful alternative form Std dev: σ = √σ² — same units as the data

Variance measures how spread out values are around the mean. Standard deviation brings it back to the original units. In ML: feature normalisation (subtract mean, divide by std) puts all features on the same scale — essential for gradient-based and distance-based models (KNN, SVM).

Adding a constant to all values doesn't change variance
Multiplying by c scales variance by c²
Variance of a constant = 0

Covariance & the covariance matrix

Cov(X, Y) = E[(X - μₓ)(Y - μᵧ)] = E[XY] - E[X]·E[Y] Cov > 0 → X and Y increase together Cov < 0 → X increases when Y decreases Cov = 0 → no linear relationship (may still be non-linear) Covariance matrix Σ (d×d, for d features): Σ = (1/n) · XᵀX (X is mean-centred) Diagonal: Var(Xᵢ) — variance of each feature Off-diagonal: Cov(Xᵢ, Xⱼ) — how features co-vary

The covariance matrix is central to PCA — eigenvectors of Σ are the principal components, eigenvalues are the variance explained along each. Large off-diagonal values mean features are correlated and potentially redundant.

Skewness & kurtosis

Skewness — asymmetry

0 = symmetric. >0 = right-skewed (long tail right, mean > median — income, house prices). <0 = left-skewed. Highly skewed features often benefit from log-transform before linear models.

Kurtosis — tail heaviness

Gaussian kurtosis = 3. High kurtosis = heavy tails, more extreme outliers (common in financial data). Low kurtosis = fewer extremes. Heavy-tailed data means outliers are more frequent than Gaussian predicts.

Interview Q & A

Q: You're given a new dataset. Walk me through initial exploratory data analysis.

A: First check shape and types — how many samples, features, dtypes. For each numerical feature: compute mean, median, std, min, max and check outliers (beyond 3σ). Compare mean vs median to detect skew — large difference means log-transform may help. Compute the correlation matrix to identify highly correlated features (potential multicollinearity). Check for missing values and their pattern. For categorical features, check cardinality and class balance. Finally examine the target distribution — is it imbalanced? This analysis drives every downstream decision: normalisation, encoding, feature selection, and model choice.

Bias-variance tradeoff

The core tension in every modelling decision

Why — motivation

This is the single most important concept in applied ML. Every modelling decision — choosing model complexity, adding regularisation, collecting more data, applying dropout — is a move along the bias-variance spectrum. If you can't explain this clearly, interviewers will question your understanding of why models fail and how to fix them.

It's asked directly ("explain bias-variance tradeoff") and indirectly ("why is your model overfitting?", "when would you use L2 regularisation?", "how would you fix underfitting?"). All are the same question in disguise.

Intuition — the mental model

Bias is the error from wrong assumptions. A linear model fitting a sine wave has high bias — it's systematically wrong no matter how much data you give it. Not paying attention to the data closely enough.

Variance is the error from being too sensitive to the training data. A deep unpruned decision tree memorises every training point but fails on new data — paying too much attention to noise.

The tradeoff: making a model more complex (lower bias) almost always makes it more sensitive to training data (higher variance). The sweet spot captures real patterns without capturing noise.

Explanation

The decomposition

E[(y - ŷ)²] = Bias²(ŷ) + Variance(ŷ) + Irreducible noise Bias(ŷ) = E[ŷ] - f(x) — how wrong the average prediction is Variance(ŷ) = E[(ŷ - E[ŷ])²] — how much predictions vary across datasets Irreducible = Var(ε) — noise in the data — cannot be reduced

The irreducible noise is a floor — even a perfect model cannot go below it. Your job is to minimise Bias² + Variance.

High bias vs high variance

High bias — underfitting

Training error is high. Val ≈ train (both bad). Fix: more complex model, add features, reduce regularisation, train longer.

High variance — overfitting

Training error is low. Val >> train (large gap). Fix: L1/L2 regularisation, dropout, early stopping, more data, simpler model.

The "more data" rule — critical nuance

More data reduces variance — a high-variance model stabilises as the training set grows, narrowing the train/val gap. But more data does not fix bias. If your model is fundamentally wrong (linear model on quadratic data), 10× more data won't help — the architecture needs to change.

Diagnostic: plot learning curves (train/val error vs dataset size). Both plateau high = bias problem. Large gap between them = variance problem.

Where models sit on the spectrum

High bias ←————————————————————→ High variance Linear regression Unpruned decision tree Naive Bayes k-NN (k=1) Logistic regression Deep neural net (unregularised) Regularisation moves a model LEFT (more bias, less variance): Ridge (L2), Lasso (L1), dropout, weight decay, early stopping

Interview Q & A

Q: Explain the bias-variance tradeoff and how you'd diagnose which problem a model has.

A: Prediction error = Bias² + Variance + Irreducible noise. Bias is error from wrong assumptions — the model is systematically wrong regardless of data. Variance is sensitivity to the training data — the model memorises noise. To diagnose: compare training and validation error. If both are high — high bias, try a more complex model, add features, reduce regularisation. If training is low but validation is much higher — high variance, try L1/L2 regularisation, dropout, early stopping, or more data. Learning curves are the clearest diagnostic: both plateau high means bias; large gap means variance.

Hypothesis testing & p-values

Deciding whether an observed pattern is real or just noise

Why — motivation

Every time you deploy a new model and ask "is this better than the old one?", you are running a hypothesis test. A/B testing — the standard way to evaluate model changes in production — is applied hypothesis testing. Data science roles test this heavily because it separates engineers who make decisions from evidence from those who go on intuition.

The classic trap: "our p-value is 0.03, so our new model is significantly better" — which ignores practical significance, multiple comparisons, and power.

Intuition — the mental model

Hypothesis testing starts from scepticism. You assume the boring explanation is true — "there's no effect, any difference I see is just random chance" (null hypothesis). Then you measure how surprising your data would be if that boring explanation were true.

The p-value answers: "if the null were actually true, how often would I see a result at least this extreme just by luck?" Small p = this would be very unlikely by luck = evidence against the null. It does NOT tell you the probability that the null is true.

Explanation

The framework

H₀ (null): "no effect" — new model accuracy = old model accuracy H₁ (alternative): "there is an effect" — new model accuracy > old Choose α (usually 0.05) — the false positive rate you'll accept. Compute p-value: p = P(seeing a result this extreme | H₀ is true) Decision: p < α → reject H₀ (statistically significant) p ≥ α → fail to reject H₀ (insufficient evidence)

Type I and Type II errors

Type I (α) — false positive

You say "new model is better" when it isn't. α=0.05 means you accept 5% false positive rate. Lowering α reduces this but makes real effects harder to detect.

Type II (β) — false negative

You say "no difference" when new model actually is better. Power = 1−β = probability of correctly detecting a real effect. Increases with sample size.

Confidence intervals

95% CI for a mean: x̄ ± 1.96 · (σ / √n) Correct interpretation: If we repeated the experiment 100 times, ~95 of the intervals would contain the true parameter value. WRONG interpretation: "There is a 95% probability the true value is in this interval." (The true value is fixed — it either is or isn't in the interval.)

CIs are more informative than p-values alone — they show both statistical significance AND the magnitude and precision of the effect.

Statistical vs practical significance

Example: Old model accuracy = 87.00% New model accuracy = 87.01% n = 10,000,000 → p-value = 0.001 Statistically significant? YES (p < 0.05) Practically significant? NO (0.01% is noise in any real system) Always ask alongside p-value: What is the effect size? (relative improvement %) Does this difference actually matter for the business?

Multiple comparisons problem

If you run 20 tests at α=0.05 and nothing is actually different, you'd expect 1 false positive by chance (0.05 × 20 = 1). Running many tests inflates the effective false positive rate.

Bonferroni: divide α by the number of tests. 20 tests → use α = 0.0025 per test. Conservative but simple.
FDR (Benjamini-Hochberg): controls the expected fraction of false positives among all rejections. Less conservative than Bonferroni.

Interview Q & A

Q: You ran an A/B test. The p-value is 0.03. Should you ship the new model?

A: Not necessarily. p=0.03 means if there were no real difference, you'd see this result only 3% of the time by chance — statistically significant at α=0.05. But before shipping I'd ask: what's the effect size? Is the improvement 0.01% or 5%? Statistical significance doesn't guarantee practical significance with large datasets. Did we correct for multiple comparisons if testing many metrics? Was the experiment properly randomised? Did we stop early after seeing a positive result (peeking inflates false positive rate)? Only after these checks would I recommend shipping.

Correlation vs causation

Measuring relationships and interpreting what they represent

Why — motivation

Confusing correlation with causation is one of the most common and costly errors in applied data science. Models trained on correlations will fail the moment the correlation breaks — and it always eventually breaks. This separates a data scientist who builds robust systems from one who keeps being surprised when models degrade in production.

It appears in feature selection (correlated vs causal features), model interpretation (high coefficient ≠ causal), and business recommendations ("our model says X correlates with churn — should we change X?").

Intuition — the mental model

Two variables can move together for three reasons: X causes Y, Y causes X, or a third variable Z causes both. Correlation only tells you they move together — it says nothing about why.

The ice cream and drowning example: both rise in summer, not because ice cream causes drowning, but because hot weather (a confounder) causes both. Your model learns this correlation happily — then gives the wrong recommendation: "ban ice cream to prevent drowning."

Explanation

Pearson vs Spearman correlation

Pearson r

r = Cov(X,Y) / (σₓ·σᵧ). Range −1 to +1. Measures linear relationships only. r=0 doesn't mean no relationship — could be non-linear (Y = X²). Scale-invariant.

Spearman ρ

Pearson applied to ranks of X and Y. Use when relationship is monotonic but not linear, data has outliers, or data is ordinal. More robust for ML feature analysis.

Confounders — the hidden cause

Confounder Z causes both X and Y → spurious correlation: Z (hot weather) ↙ ↘ X (ice cream) Y (drowning) X and Y correlate, but X does NOT cause Y. Controlling for Z removes the correlation. ML example: "users who open app on Tuesdays churn less" Confounder: power users use app more AND churn less. Tuesday usage → power user signal, not a causal churn driver. Sending Tuesday notifications won't reduce churn.

Establishing causation

Randomised controlled trial

Randomly assign subjects to treatment/control. Random assignment breaks the confounder link — any difference must be causal. Gold standard. Often not feasible (ethical, cost).

Causal inference

When RCT isn't possible: propensity score matching, instrumental variables, difference-in-differences, regression discontinuity. More assumptions, less clean — but often the only option in production.

Interview Q & A

Q: Your model finds feature X is highly correlated with target Y. Can you conclude X causes Y? How would you investigate?

A: No — correlation only tells us X and Y move together, not why. Three possible explanations: X causes Y, Y causes X (reverse causation), or a confounder Z causes both. To investigate: first, does the causal direction make sense from domain knowledge? Second, check for confounders — variables driving both X and Y — and try controlling for them using partial correlation or regression. Third, if feasible, run a randomised experiment: randomly vary X and see if Y changes. If RCT isn't possible, causal inference techniques like instrumental variables or diff-in-diff can help. In production ML, spurious correlations break when the underlying distribution shifts, causing unexpected model degradation.