How to describe data, assess model quality, and make decisions from evidence.
The first thing you reach for when you see a new dataset
Before building any model, you need to understand your data. Interviewers frequently open ML system design rounds with "you have this dataset — how would you approach it?" The expected first answer is always exploratory data analysis, and descriptive statistics are the tools of EDA. Get these wrong and the interviewer loses confidence before the real questions even start.
Beyond EDA, these concepts appear in technical depth too: PCA requires the covariance matrix, normalisation decisions require understanding mean and variance, and model evaluation often involves comparing distributions of predictions vs actuals.
Descriptive statistics are summaries. Instead of looking at 1 million data points, you reduce them to a handful of numbers that capture the most important properties: where is the data centred (mean/median), how spread out is it (variance/std), how is it shaped (skewness/kurtosis), and how do features relate to each other (covariance)?
Think of them as a health check for your data — you run these before any modelling to catch problems: skewed distributions, outliers, correlated features, missing patterns.
The mean minimises the sum of squared deviations. The median minimises the sum of absolute deviations. Key difference: outlier sensitivity.
Variance measures how spread out values are around the mean. Standard deviation brings it back to the original units. In ML: feature normalisation (subtract mean, divide by std) puts all features on the same scale — essential for gradient-based and distance-based models (KNN, SVM).
The covariance matrix is central to PCA — eigenvectors of Σ are the principal components, eigenvalues are the variance explained along each. Large off-diagonal values mean features are correlated and potentially redundant.
The core tension in every modelling decision
This is the single most important concept in applied ML. Every modelling decision — choosing model complexity, adding regularisation, collecting more data, applying dropout — is a move along the bias-variance spectrum. If you can't explain this clearly, interviewers will question your understanding of why models fail and how to fix them.
It's asked directly ("explain bias-variance tradeoff") and indirectly ("why is your model overfitting?", "when would you use L2 regularisation?", "how would you fix underfitting?"). All are the same question in disguise.
Bias is the error from wrong assumptions. A linear model fitting a sine wave has high bias — it's systematically wrong no matter how much data you give it. Not paying attention to the data closely enough.
Variance is the error from being too sensitive to the training data. A deep unpruned decision tree memorises every training point but fails on new data — paying too much attention to noise.
The tradeoff: making a model more complex (lower bias) almost always makes it more sensitive to training data (higher variance). The sweet spot captures real patterns without capturing noise.
The irreducible noise is a floor — even a perfect model cannot go below it. Your job is to minimise Bias² + Variance.
More data reduces variance — a high-variance model stabilises as the training set grows, narrowing the train/val gap. But more data does not fix bias. If your model is fundamentally wrong (linear model on quadratic data), 10× more data won't help — the architecture needs to change.
Diagnostic: plot learning curves (train/val error vs dataset size). Both plateau high = bias problem. Large gap between them = variance problem.
Deciding whether an observed pattern is real or just noise
Every time you deploy a new model and ask "is this better than the old one?", you are running a hypothesis test. A/B testing — the standard way to evaluate model changes in production — is applied hypothesis testing. Data science roles test this heavily because it separates engineers who make decisions from evidence from those who go on intuition.
The classic trap: "our p-value is 0.03, so our new model is significantly better" — which ignores practical significance, multiple comparisons, and power.
Hypothesis testing starts from scepticism. You assume the boring explanation is true — "there's no effect, any difference I see is just random chance" (null hypothesis). Then you measure how surprising your data would be if that boring explanation were true.
The p-value answers: "if the null were actually true, how often would I see a result at least this extreme just by luck?" Small p = this would be very unlikely by luck = evidence against the null. It does NOT tell you the probability that the null is true.
CIs are more informative than p-values alone — they show both statistical significance AND the magnitude and precision of the effect.
If you run 20 tests at α=0.05 and nothing is actually different, you'd expect 1 false positive by chance (0.05 × 20 = 1). Running many tests inflates the effective false positive rate.
Measuring relationships and interpreting what they represent
Confusing correlation with causation is one of the most common and costly errors in applied data science. Models trained on correlations will fail the moment the correlation breaks — and it always eventually breaks. This separates a data scientist who builds robust systems from one who keeps being surprised when models degrade in production.
It appears in feature selection (correlated vs causal features), model interpretation (high coefficient ≠ causal), and business recommendations ("our model says X correlates with churn — should we change X?").
Two variables can move together for three reasons: X causes Y, Y causes X, or a third variable Z causes both. Correlation only tells you they move together — it says nothing about why.
The ice cream and drowning example: both rise in summer, not because ice cream causes drowning, but because hot weather (a confounder) causes both. Your model learns this correlation happily — then gives the wrong recommendation: "ban ice cream to prevent drowning."