InterviewSkill

Machine Learning Interview Questions

Model training, validation, evaluation, and feature tradeoffs for ML interviews.

22 questions
Machine Learning

What is overfitting?medium

Type
conceptual
Topic
overfitting
Frequency
common
Tags
overfitting
Answer

Overfitting happens when a model learns training noise and fails on new data.

Explanation

It often appears as high training performance and weaker validation performance. Regularization, simpler models, better validation, and more data can help.

Follow-upHow do you detect overfitting from metrics?

What is the bias-variance tradeoff?medium

Type
conceptual
Topic
bias-variance-tradeoff
Frequency
common
Tags
bias, variance, tradeoff
Answer

It balances underfitting from high bias and overfitting from high variance.

Explanation

Simple models may miss patterns, while complex models may be too sensitive to training data. Good generalization manages both.

Follow-upHow does model complexity affect bias and variance?

What is cross-validation?hard

Type
conceptual
Topic
cross-validation
Frequency
common
Tags
cross, validation
Answer

It evaluates a model across multiple train-validation splits.

Explanation

K-fold cross-validation reduces dependence on one split and gives a more stable estimate of generalization.

Follow-upWhy should preprocessing be fit inside each training fold?

What is feature leakage?medium

Type
conceptual
Topic
feature-leakage
Frequency
common
Tags
feature, leakage
Answer

Leakage happens when training data includes information unavailable at prediction time.

Explanation

Leakage creates unrealistically strong offline metrics and weak production performance because the model learned future or target-derived signals.

Follow-upHow do you prevent leakage in time-based data?

How do you choose an evaluation metric?medium

Type
conceptual
Topic
choose-evaluation-metric
Frequency
common
Tags
choose, evaluation, metric
Answer

Pick a metric that matches the task and business cost of errors.

Explanation

Classification may use precision, recall, F1, ROC-AUC, or PR-AUC. Regression may use MAE, RMSE, or R squared depending on error sensitivity.

Follow-upWhen is accuracy a bad metric?

What is the difference between precision and recall?medium

Type
conceptual
Topic
precision-recall
Frequency
common
Tags
classification, precision, recall
Answer

Precision measures how many predicted positives are correct; recall measures how many actual positives are found.

Explanation

Precision matters when false positives are costly. Recall matters when missing positives is costly, such as fraud, disease detection, or safety alerts.

Follow-upWhen would you optimize for recall over precision?

What does regularization do in machine learning?medium

Type
conceptual
Topic
regularization
Frequency
common
Tags
regularization, l1, l2
Answer

Regularization penalizes model complexity to improve generalization.

Explanation

L1 can encourage sparse weights, while L2 discourages large weights. Both reduce overfitting by making the learned function less sensitive to noise.

Follow-upHow is L1 regularization different from L2 regularization?

How do you handle an imbalanced classification dataset?hard

Type
scenario
Topic
imbalanced-dataset
Frequency
common
Tags
class-imbalance, metrics, validation
Answer

Use appropriate metrics, class weighting, resampling, threshold tuning, and careful validation.

Explanation

Accuracy can hide poor minority-class performance. PR-AUC, recall, precision, confusion matrix analysis, and cost-aware thresholds often give a better picture.

Follow-upWhy can oversampling before train-test split cause leakage?

How does gradient boosting differ from bagging?medium

Type
conceptual
Topic
how-does-gradient-boosting-differ-from-bagging
Frequency
common
Tags
machine-learning, how, does, gradient, boosting, differ
Answer

Bagging (Random Forest) trains trees in parallel on random subsets and averages — reduces variance.

Explanation

Bagging (Random Forest) trains trees in parallel on random subsets and averages — reduces variance. Boosting trains sequentially, each tree correcting previous errors — reduces bias. XGBoost adds L1/L2 regularization to the objective, handles missing values natively, and uses second-order gradients for faster convergence. Chosen for markdown because demand patterns have structured nonlinearities boosting captures well.

Follow-upWhen would you choose one approach over the other?

Precision, recall, F1 — which matters most in resume screening?medium

Type
conceptual
Topic
precision-recall-f1-which-matters-most-in-resume-screening
Frequency
common
Tags
machine-learning, precision, recall, which, matters, most
Answer

Precision = TP/(TP+FP), recall = TP/(TP+FN), F1 = harmonic mean.

Explanation

Precision = TP/(TP+FP), recall = TP/(TP+FN), F1 = harmonic mean. In a resume screening system, recall matters more — missing a good candidate (false negative) is costlier than surfacing a borderline one for human review. With downstream HITL review, balance with F1 and use configurable scoring weights to tune per job template.

Follow-upCan you give a production example?

Why is k-fold cross-validation preferred over a single train-test split?medium

Type
conceptual
Topic
is-k-fold-cross-validation-preferred-over-a-single-train-t
Frequency
common
Tags
machine-learning, why, fold, cross, validation, preferred
Answer

A single split gives a high-variance estimate — you might get lucky or unlucky.

Explanation

A single split gives a high-variance estimate — you might get lucky or unlucky. k-fold uses all data for both training and validation across k rounds, giving a more stable estimate. Stratified k-fold preserves class distribution. For time-series (demand forecasting), use TimeSeriesSplit to avoid data leakage.

Follow-upCan you give a production example?

How does Random Forest reduce variance compared to a single decision tree?medium

Type
conceptual
Topic
does-random-forest-reduce-variance-compared-to-a-single-de
Frequency
common
Tags
machine-learning, how, does, random, forest, reduce
Answer

By training many deep trees on random bootstrap samples (bagging) and using a random feature subset at each split.

Explanation

By training many deep trees on random bootstrap samples (bagging) and using a random feature subset at each split. Averaging uncorrelated trees cancels out individual noise. The key is decorrelation — if all trees see the same features, they make correlated errors and averaging doesn't help much.

Follow-upCan you give a production example?

L1 vs L2 regularization — when do you use each?medium

Type
conceptual
Topic
l1-vs-l2-regularization-when-do-you-use-each
Frequency
common
Tags
machine-learning, regularization, when, you, use, each
Answer

L1 (Lasso) adds sum of |weights| — drives some weights to exactly zero, giving sparse models.

Explanation

L1 (Lasso) adds sum of |weights| — drives some weights to exactly zero, giving sparse models. Good for feature selection. L2 (Ridge) adds sum of weights² — shrinks all weights small but nonzero. Better when all features contribute. ElasticNet combines both. In LLM training, L2-like weight decay is standard.

Follow-upWhen would you choose one approach over the other?

Describe the LP optimizer in Price Markdown Optimization.medium

Type
scenario
Topic
the-lp-optimizer-in-price-markdown-optimization
Frequency
common
Tags
machine-learning, describe, the, optimizer, price, markdown
Answer

The LP takes demand forecasts as input and optimizes markdown percentage per SKU-brand pair to minimize revenue loss while clearing inventory before end-of-life.

Explanation

The LP takes demand forecasts as input and optimizes markdown percentage per SKU-brand pair to minimize revenue loss while clearing inventory before end-of-life. Constraints: minimum margin floors, maximum markdown caps, inventory depletion deadlines. Objective: maximize recovered revenue subject to sell-through constraints. Implemented with scipy.optimize or PuLP.

Follow-upWhat tradeoffs did you consider in that implementation?

How did you handle class imbalance in classification tasks?medium

Type
scenario
Topic
did-you-handle-class-imbalance-in-classification-tasks
Frequency
common
Tags
machine-learning, how, did, you, handle, class
Answer

Options: oversampling minority class (SMOTE), undersampling majority, class_weight='balanced' in sklearn, adjusting decision threshold post-training, or using F1/AUC-ROC instead of accuracy.

Explanation

Options: oversampling minority class (SMOTE), undersampling majority, class_weight='balanced' in sklearn, adjusting decision threshold post-training, or using F1/AUC-ROC instead of accuracy. In resume screening, threshold tuning + weighted scoring worked better than resampling since the minority examples were genuine top candidates, not noise.

Follow-upWhat tradeoffs did you consider in that implementation?

How do you select features for a forecasting model?medium

Type
conceptual
Topic
do-you-select-features-for-a-forecasting-model
Frequency
common
Tags
machine-learning, how, you, select, features, for
Answer

Start with domain knowledge (price, seasonality, sell-through rate, days-to-expiry).

Explanation

Start with domain knowledge (price, seasonality, sell-through rate, days-to-expiry). Use correlation analysis, feature importance from a baseline tree model, and VIF for multicollinearity. For time-series, add lag features and rolling statistics. Drop features with high missingness or near-zero variance. Validate with time-aware CV.

Follow-upCan you give a production example?

What ensemble methods did you use for demand forecasting?medium

Type
conceptual
Topic
ensemble-methods-did-you-use-for-demand-forecasting
Frequency
common
Tags
machine-learning, what, ensemble, methods, did, you
Answer

Used an ensemble of XGBoost, a statistical baseline (exponential smoothing or ARIMA), and a linear model for regularization.

Explanation

Used an ensemble of XGBoost, a statistical baseline (exponential smoothing or ARIMA), and a linear model for regularization. Combined via weighted averaging or stacking where a meta-learner decides weights based on recent forecast error. Reduces the risk of any single model failing on edge SKUs.

Follow-upCan you give a production example?

What is data leakage and how do you prevent it in time-series?medium

Type
scenario
Topic
is-data-leakage-and-how-do-you-prevent-it-in-time-series
Frequency
common
Tags
machine-learning, what, data, leakage, and, how
Answer

Leakage is when future information bleeds into training, inflating metrics.

Explanation

Leakage is when future information bleeds into training, inflating metrics. Prevention: strict temporal splits (TimeSeriesSplit, no shuffle), lag features only from t-n onwards, rolling stats computed on training window only, careful with target encoding. Always verify that your validation set strictly follows your training set in time.

Follow-upCan you give a production example?

How do you evaluate regression vs classification models?medium

Type
conceptual
Topic
do-you-evaluate-regression-vs-classification-models
Frequency
common
Tags
machine-learning, how, you, evaluate, regression, classification
Answer

Regression: MAE, MSE, RMSE, MAPE, R². Classification: accuracy, precision, recall, F1, AUC-ROC, log loss.

Explanation

Regression: MAE, MSE, RMSE, MAPE, R². Classification: accuracy, precision, recall, F1, AUC-ROC, log loss. For imbalanced classification, AUC-ROC and F1 are more informative than accuracy. For business-critical regression (demand forecast), MAPE is intuitive but breaks near zero — use sMAPE or RMSE with domain thresholds.

Follow-upWhen would you choose one approach over the other?

Explain SMOTE — when is it appropriate and when not?medium

Type
conceptual
Topic
smote-when-is-it-appropriate-and-when-not
Frequency
common
Tags
machine-learning, explain, smote, when, appropriate, and
Answer

SMOTE generates synthetic minority samples by interpolating between existing ones in feature space.

Explanation

SMOTE generates synthetic minority samples by interpolating between existing ones in feature space. Appropriate for tabular numeric features with enough minority examples to interpolate. Not appropriate for text, very sparse features, or when minority samples represent true anomalies (don't synthesize fraud patterns). For tree models, class_weight often works just as well with less risk.

Follow-upCan you give a production example?

How do Scikit-learn pipelines work?medium

Type
conceptual
Topic
do-scikit-learn-pipelines-work
Frequency
common
Tags
machine-learning, how, scikit, learn, pipelines, work
Answer

A Pipeline chains preprocessing steps and the final estimator.

Explanation

A Pipeline chains preprocessing steps and the final estimator. Each step implements fit/transform; the last step implements fit/predict. Benefits: prevents leakage (fit on train, transform on test within CV), cleaner code, easy serialization with joblib, and direct deployment. Nest ColumnTransformer inside pipelines for mixed-type data.

Follow-upCan you give a production example?

How do you tune XGBoost at scale without overfitting?hard

Type
conceptual
Topic
do-you-tune-xgboost-at-scale-without-overfitting
Frequency
common
Tags
machine-learning, how, you, tune, xgboost, scale
Answer

Key params: max_depth (keep 3-6), min_child_weight (higher = less overfit), subsample and colsample_bytree (0.7-0.9), small learning_rate + more rounds, lambda/alpha for L2/L1 reg.

Explanation

Key params: max_depth (keep 3-6), min_child_weight (higher = less overfit), subsample and colsample_bytree (0.7-0.9), small learning_rate + more rounds, lambda/alpha for L2/L1 reg. Use early stopping on a validation set. For scale: Bayesian optimization (Optuna) instead of grid search. Always monitor train vs val loss gap.

Follow-upCan you give a production example?