Machine Learning Interview Prep

What is overfitting?medium

Type: conceptual
Topic: overfitting
Frequency: common
Tags: overfitting

Answer

Overfitting happens when a model learns training noise and fails on new data.

Explanation

It often appears as high training performance and weaker validation performance. Regularization, simpler models, better validation, and more data can help.

Follow-upHow do you detect overfitting from metrics?

What is the bias-variance tradeoff?medium

Type: conceptual
Topic: bias-variance-tradeoff
Frequency: common
Tags: bias, variance, tradeoff

Answer

It balances underfitting from high bias and overfitting from high variance.

Explanation

Simple models may miss patterns, while complex models may be too sensitive to training data. Good generalization manages both.

Follow-upHow does model complexity affect bias and variance?

What is cross-validation?hard

Type: conceptual
Topic: cross-validation
Frequency: common
Tags: cross, validation

Answer

It evaluates a model across multiple train-validation splits.

Explanation

K-fold cross-validation reduces dependence on one split and gives a more stable estimate of generalization.

Follow-upWhy should preprocessing be fit inside each training fold?

What is feature leakage?medium

Type: conceptual
Topic: feature-leakage
Frequency: common
Tags: feature, leakage

Answer

Leakage happens when training data includes information unavailable at prediction time.

Explanation

Leakage creates unrealistically strong offline metrics and weak production performance because the model learned future or target-derived signals.

Follow-upHow do you prevent leakage in time-based data?

How do you choose an evaluation metric?medium

Type: conceptual
Topic: choose-evaluation-metric
Frequency: common
Tags: choose, evaluation, metric

Answer

Pick a metric that matches the task and business cost of errors.

Explanation

Classification may use precision, recall, F1, ROC-AUC, or PR-AUC. Regression may use MAE, RMSE, or R squared depending on error sensitivity.

Follow-upWhen is accuracy a bad metric?

What is the difference between precision and recall?medium

Type: conceptual
Topic: precision-recall
Frequency: common
Tags: classification, precision, recall

Answer

Precision measures how many predicted positives are correct; recall measures how many actual positives are found.

Explanation

Precision matters when false positives are costly. Recall matters when missing positives is costly, such as fraud, disease detection, or safety alerts.

Follow-upWhen would you optimize for recall over precision?

What does regularization do in machine learning?medium

Type: conceptual
Topic: regularization
Frequency: common
Tags: regularization, l1, l2

Answer

Regularization penalizes model complexity to improve generalization.

Explanation

L1 can encourage sparse weights, while L2 discourages large weights. Both reduce overfitting by making the learned function less sensitive to noise.

Follow-upHow is L1 regularization different from L2 regularization?

How do you handle an imbalanced classification dataset?hard

Type: scenario
Topic: imbalanced-dataset
Frequency: common
Tags: class-imbalance, metrics, validation

Answer

Use appropriate metrics, class weighting, resampling, threshold tuning, and careful validation.

Explanation

Accuracy can hide poor minority-class performance. PR-AUC, recall, precision, confusion matrix analysis, and cost-aware thresholds often give a better picture.

Follow-upWhy can oversampling before train-test split cause leakage?

How does gradient boosting differ from bagging?medium

Type: conceptual
Topic: how-does-gradient-boosting-differ-from-bagging
Frequency: common
Tags: machine-learning, how, does, gradient, boosting, differ

Answer

Bagging (Random Forest) trains trees in parallel on random subsets and averages — reduces variance.

Explanation

Bagging (Random Forest) trains trees in parallel on random subsets and averages — reduces variance. Boosting trains sequentially, each tree correcting previous errors — reduces bias. XGBoost adds L1/L2 regularization to the objective, handles missing values natively, and uses second-order gradients for faster convergence. Chosen for markdown because demand patterns have structured nonlinearities boosting captures well.

Follow-upWhen would you choose one approach over the other?

Precision, recall, F1 — which matters most in resume screening?medium

Type: conceptual
Topic: precision-recall-f1-which-matters-most-in-resume-screening
Frequency: common
Tags: machine-learning, precision, recall, which, matters, most

Answer

Precision = TP/(TP+FP), recall = TP/(TP+FN), F1 = harmonic mean.

Explanation

Precision = TP/(TP+FP), recall = TP/(TP+FN), F1 = harmonic mean. In a resume screening system, recall matters more — missing a good candidate (false negative) is costlier than surfacing a borderline one for human review. With downstream HITL review, balance with F1 and use configurable scoring weights to tune per job template.

Follow-upCan you give a production example?

Why is k-fold cross-validation preferred over a single train-test split?medium

Type: conceptual
Topic: is-k-fold-cross-validation-preferred-over-a-single-train-t
Frequency: common
Tags: machine-learning, why, fold, cross, validation, preferred

Answer

A single split gives a high-variance estimate — you might get lucky or unlucky.

Explanation

A single split gives a high-variance estimate — you might get lucky or unlucky. k-fold uses all data for both training and validation across k rounds, giving a more stable estimate. Stratified k-fold preserves class distribution. For time-series (demand forecasting), use TimeSeriesSplit to avoid data leakage.

Follow-upCan you give a production example?

How does Random Forest reduce variance compared to a single decision tree?medium

Type: conceptual
Topic: does-random-forest-reduce-variance-compared-to-a-single-de
Frequency: common
Tags: machine-learning, how, does, random, forest, reduce

Answer

By training many deep trees on random bootstrap samples (bagging) and using a random feature subset at each split.

Explanation

By training many deep trees on random bootstrap samples (bagging) and using a random feature subset at each split. Averaging uncorrelated trees cancels out individual noise. The key is decorrelation — if all trees see the same features, they make correlated errors and averaging doesn't help much.

Follow-upCan you give a production example?

L1 vs L2 regularization — when do you use each?medium

Type: conceptual
Topic: l1-vs-l2-regularization-when-do-you-use-each
Frequency: common
Tags: machine-learning, regularization, when, you, use, each

Answer

L1 (Lasso) adds sum of |weights| — drives some weights to exactly zero, giving sparse models.

Explanation

L1 (Lasso) adds sum of |weights| — drives some weights to exactly zero, giving sparse models. Good for feature selection. L2 (Ridge) adds sum of weights² — shrinks all weights small but nonzero. Better when all features contribute. ElasticNet combines both. In LLM training, L2-like weight decay is standard.

Follow-upWhen would you choose one approach over the other?

Describe the LP optimizer in Price Markdown Optimization.medium

Type: scenario
Topic: the-lp-optimizer-in-price-markdown-optimization
Frequency: common
Tags: machine-learning, describe, the, optimizer, price, markdown

Answer

The LP takes demand forecasts as input and optimizes markdown percentage per SKU-brand pair to minimize revenue loss while clearing inventory before end-of-life.

Explanation

The LP takes demand forecasts as input and optimizes markdown percentage per SKU-brand pair to minimize revenue loss while clearing inventory before end-of-life. Constraints: minimum margin floors, maximum markdown caps, inventory depletion deadlines. Objective: maximize recovered revenue subject to sell-through constraints. Implemented with scipy.optimize or PuLP.

Follow-upWhat tradeoffs did you consider in that implementation?

How did you handle class imbalance in classification tasks?medium

Type: scenario
Topic: did-you-handle-class-imbalance-in-classification-tasks
Frequency: common
Tags: machine-learning, how, did, you, handle, class

Answer

Options: oversampling minority class (SMOTE), undersampling majority, class_weight='balanced' in sklearn, adjusting decision threshold post-training, or using F1/AUC-ROC instead of accuracy.

Explanation

Options: oversampling minority class (SMOTE), undersampling majority, class_weight='balanced' in sklearn, adjusting decision threshold post-training, or using F1/AUC-ROC instead of accuracy. In resume screening, threshold tuning + weighted scoring worked better than resampling since the minority examples were genuine top candidates, not noise.

Follow-upWhat tradeoffs did you consider in that implementation?

How do you select features for a forecasting model?medium

Type: conceptual
Topic: do-you-select-features-for-a-forecasting-model
Frequency: common
Tags: machine-learning, how, you, select, features, for

Answer

Start with domain knowledge (price, seasonality, sell-through rate, days-to-expiry).

Explanation

Start with domain knowledge (price, seasonality, sell-through rate, days-to-expiry). Use correlation analysis, feature importance from a baseline tree model, and VIF for multicollinearity. For time-series, add lag features and rolling statistics. Drop features with high missingness or near-zero variance. Validate with time-aware CV.

Follow-upCan you give a production example?

What ensemble methods did you use for demand forecasting?medium

Type: conceptual
Topic: ensemble-methods-did-you-use-for-demand-forecasting
Frequency: common
Tags: machine-learning, what, ensemble, methods, did, you

Answer

Used an ensemble of XGBoost, a statistical baseline (exponential smoothing or ARIMA), and a linear model for regularization.

Explanation

Used an ensemble of XGBoost, a statistical baseline (exponential smoothing or ARIMA), and a linear model for regularization. Combined via weighted averaging or stacking where a meta-learner decides weights based on recent forecast error. Reduces the risk of any single model failing on edge SKUs.

Follow-upCan you give a production example?

What is data leakage and how do you prevent it in time-series?medium

Type: scenario
Topic: is-data-leakage-and-how-do-you-prevent-it-in-time-series
Frequency: common
Tags: machine-learning, what, data, leakage, and, how

Answer

Leakage is when future information bleeds into training, inflating metrics.

Explanation

Leakage is when future information bleeds into training, inflating metrics. Prevention: strict temporal splits (TimeSeriesSplit, no shuffle), lag features only from t-n onwards, rolling stats computed on training window only, careful with target encoding. Always verify that your validation set strictly follows your training set in time.

Follow-upCan you give a production example?

How do you evaluate regression vs classification models?medium

Type: conceptual
Topic: do-you-evaluate-regression-vs-classification-models
Frequency: common
Tags: machine-learning, how, you, evaluate, regression, classification

Answer

Regression: MAE, MSE, RMSE, MAPE, R². Classification: accuracy, precision, recall, F1, AUC-ROC, log loss.

Explanation

Regression: MAE, MSE, RMSE, MAPE, R². Classification: accuracy, precision, recall, F1, AUC-ROC, log loss. For imbalanced classification, AUC-ROC and F1 are more informative than accuracy. For business-critical regression (demand forecast), MAPE is intuitive but breaks near zero — use sMAPE or RMSE with domain thresholds.

Follow-upWhen would you choose one approach over the other?

Explain SMOTE — when is it appropriate and when not?medium

Type: conceptual
Topic: smote-when-is-it-appropriate-and-when-not
Frequency: common
Tags: machine-learning, explain, smote, when, appropriate, and

Answer

SMOTE generates synthetic minority samples by interpolating between existing ones in feature space.

Explanation

SMOTE generates synthetic minority samples by interpolating between existing ones in feature space. Appropriate for tabular numeric features with enough minority examples to interpolate. Not appropriate for text, very sparse features, or when minority samples represent true anomalies (don't synthesize fraud patterns). For tree models, class_weight often works just as well with less risk.

Follow-upCan you give a production example?

How do Scikit-learn pipelines work?medium

Type: conceptual
Topic: do-scikit-learn-pipelines-work
Frequency: common
Tags: machine-learning, how, scikit, learn, pipelines, work

Answer

A Pipeline chains preprocessing steps and the final estimator.

Explanation

A Pipeline chains preprocessing steps and the final estimator. Each step implements fit/transform; the last step implements fit/predict. Benefits: prevents leakage (fit on train, transform on test within CV), cleaner code, easy serialization with joblib, and direct deployment. Nest ColumnTransformer inside pipelines for mixed-type data.

Follow-upCan you give a production example?

How do you tune XGBoost at scale without overfitting?hard

Type: conceptual
Topic: do-you-tune-xgboost-at-scale-without-overfitting
Frequency: common
Tags: machine-learning, how, you, tune, xgboost, scale

Answer

Key params: max_depth (keep 3-6), min_child_weight (higher = less overfit), subsample and colsample_bytree (0.7-0.9), small learning_rate + more rounds, lambda/alpha for L2/L1 reg.

Explanation

Key params: max_depth (keep 3-6), min_child_weight (higher = less overfit), subsample and colsample_bytree (0.7-0.9), small learning_rate + more rounds, lambda/alpha for L2/L1 reg. Use early stopping on a validation set. For scale: Bayesian optimization (Optuna) instead of grid search. Always monitor train vs val loss gap.

Follow-upCan you give a production example?

Machine Learning Interview Questions

What is overfitting?medium

What is the bias-variance tradeoff?medium

What is cross-validation?hard

What is feature leakage?medium

How do you choose an evaluation metric?medium

What is the difference between precision and recall?medium

What does regularization do in machine learning?medium

How do you handle an imbalanced classification dataset?hard

How does gradient boosting differ from bagging?medium

Precision, recall, F1 — which matters most in resume screening?medium

Why is k-fold cross-validation preferred over a single train-test split?medium

How does Random Forest reduce variance compared to a single decision tree?medium

L1 vs L2 regularization — when do you use each?medium

Describe the LP optimizer in Price Markdown Optimization.medium

How did you handle class imbalance in classification tasks?medium

How do you select features for a forecasting model?medium

What ensemble methods did you use for demand forecasting?medium

What is data leakage and how do you prevent it in time-series?medium

How do you evaluate regression vs classification models?medium

Explain SMOTE — when is it appropriate and when not?medium

How do Scikit-learn pipelines work?medium

How do you tune XGBoost at scale without overfitting?hard