InterviewSkill

MLOps Interview Questions

Production ML lifecycle concepts for deployment, monitoring, and reliability interviews.

17 questions
MLOps

What is MLOps?medium

Type
conceptual
Topic
mlops
Frequency
common
Tags
mlops
Answer

MLOps is the practice of reliably building, deploying, and monitoring ML systems.

Explanation

It combines software engineering, data engineering, model training, deployment, monitoring, governance, and retraining workflows.

Follow-upHow is MLOps different from DevOps?

What is data drift?medium

Type
conceptual
Topic
data-drift
Frequency
common
Tags
data, drift
Answer

Data drift happens when production input data changes from training data.

Explanation

Drift can reduce model performance. Monitoring feature distributions and business outcomes helps detect when retraining may be needed.

Follow-upWhat is concept drift?

What is a model registry?hard

Type
conceptual
Topic
model-registry
Frequency
common
Tags
model, registry
Answer

A model registry tracks model versions, metadata, stages, and artifacts.

Explanation

It helps teams manage promotion from experimentation to staging and production with reproducibility and auditability.

Follow-upWhat metadata should a registry store?

How do you monitor a model in production?medium

Type
conceptual
Topic
monitor-model-production
Frequency
common
Tags
monitor, model, production
Answer

Track performance, drift, latency, errors, data quality, and business metrics.

Explanation

Monitoring must cover both software health and model behavior because a service can be available while predictions become poor.

Follow-upWhat if labels arrive late?

What is CI/CD for ML?medium

Type
conceptual
Topic
ci-cd-ml
Frequency
common
Tags
ci, cd, ml
Answer

It automates testing and deployment of code, data pipelines, and model artifacts.

Explanation

ML CI/CD should validate data schemas, feature logic, model quality, reproducibility, and serving behavior before release.

Follow-upWhat tests are unique to ML pipelines?

How does a feature store work in production ML?hard

Type
scenario
Topic
does-a-feature-store-work-how-did-you-use-it-in-sagemaker
Frequency
common
Tags
mlops, how, does, feature, store, work
Answer

A feature store centralizes computed features for reuse across models, ensuring training-serving consistency.

Explanation

A feature store centralizes computed features for reuse across models, ensuring training-serving consistency. SageMaker Feature Store has online (low-latency serving) and offline (S3-backed for training) stores. In price markdown optimization, features like rolling demand, sell-through rate, and days-to-expiry were precomputed offline for batch model training.

Follow-upWhat tradeoffs did you consider in that implementation?

What is concept drift and how do you handle it?hard

Type
scenario
Topic
is-concept-drift-and-how-do-you-handle-it
Frequency
common
Tags
mlops, what, concept, drift, and, how
Answer

Concept drift is when the statistical relationship between features and target changes over time.

Explanation

Concept drift is when the statistical relationship between features and target changes over time. Detect with PSI on feature distributions, model performance monitoring on recent windows, or KS test. Handle by retraining periodically, using a sliding window of recent data, or building an ensemble that weights recent data more.

Follow-upCan you give a production example?

What metrics evaluate document extraction accuracy?medium

Type
conceptual
Topic
metrics-do-you-use-to-evaluate-extraction-accuracy-in-inte
Frequency
common
Tags
mlops, what, metrics, you, use, evaluate
Answer

Field-level exact match (does extracted value match ground truth?), partial match / token F1 for variable-format fields like dates, overall document accuracy (all fields correct), null rate (fields model failed to extrac

Explanation

Field-level exact match (does extracted value match ground truth?), partial match / token F1 for variable-format fields like dates, overall document accuracy (all fields correct), null rate (fields model failed to extract), and false positive rate (model extracted something for an absent field). Ground truth: manually annotated 200 representative documents by a domain expert.

Follow-upHow would you monitor this in production?

How should 98% extraction accuracy be measured?medium

Type
conceptual
Topic
claim-98-extraction-accuracy-how-exactly-was-that-measured
Frequency
common
Tags
mlops, you, claim, extraction, accuracy, how
Answer

Measured on a held-out test set of ~100 documents not seen during prompt development.

Explanation

Measured on a held-out test set of ~100 documents not seen during prompt development. Ground truth annotated by a domain expert. Metric: field-level exact match after normalizing dates and numeric formats. 98% = 98% of extracted fields matched ground truth across all documents and fields. Errors were mostly edge cases — unusual swap structures or corrupted PDFs.

Follow-upHow would you monitor this in production?

BLEU, ROUGE, BERTScore — which is best for LLM output evaluation?medium

Type
conceptual
Topic
bleu-rouge-bertscore-which-is-best-for-llm-output-evaluati
Frequency
common
Tags
mlops, bleu, rouge, bertscore, which, best
Answer

BLEU measures n-gram precision (designed for translation, poor for open-ended text).

Explanation

BLEU measures n-gram precision (designed for translation, poor for open-ended text). ROUGE measures n-gram recall (designed for summarization). Both are surface-level and miss semantic equivalence. BERTScore uses contextual embeddings for similarity — better for paraphrase-heavy LLM output. For production: none is sufficient alone. Combine with LLM-as-judge and task-specific metrics.

Follow-upCan you give a production example?

How do you evaluate hallucination in an LLM extraction pipeline?medium

Type
conceptual
Topic
do-you-evaluate-hallucination-in-an-llm-extraction-pipelin
Frequency
common
Tags
mlops, how, you, evaluate, hallucination, llm
Answer

(1) Grounding check: verify extracted values appear in or are entailed by source document.

Explanation

(1) Grounding check: verify extracted values appear in or are entailed by source document. (2) Contradiction detection: run NLI model to check if extraction contradicts the source. (3) LLM-as-judge: second LLM verifies if extraction is supported by document. (4) Track null-vs-hallucinated rate: model should return null for absent fields, not fabricate values.

Follow-upHow would you monitor this in production?

What is LLM-as-a-judge? What are its risks?medium

Type
conceptual
Topic
is-llm-as-a-judge-what-are-its-risks
Frequency
common
Tags
mlops, what, llm, judge, are
Answer

LLM-as-judge uses a capable LLM to score other LLM outputs — cheaper than human annotation at scale.

Explanation

LLM-as-judge uses a capable LLM to score other LLM outputs — cheaper than human annotation at scale. Risks: self-serving bias (models prefer their own style), position bias (prefers earlier responses), verbosity bias (rewards longer answers). Mitigations: use a different model as judge, swap option order and average scores, define explicit rubrics, calibrate against human labels.

Follow-upCan you give a production example?

How do you build an LLM evaluation component?medium

Type
scenario
Topic
how-do-you-build-an-llm-evaluation-component
Frequency
common
Tags
mlops, how, did, you, build, the
Answer

(1) Automated metrics: exact match, JSON schema compliance, field coverage.

Explanation

(1) Automated metrics: exact match, JSON schema compliance, field coverage. (2) LLM-as-judge with a fixed rubric. (3) A/B prompt testing — route X% of traffic to new prompt version, compare metrics. (4) Audit logging — every LLM input/output stored in S3 with metadata. (5) Dashboard for error analysis by document type, model, and prompt version.

Follow-upWhat tradeoffs did you consider in that implementation?

Offline vs online evaluation for LLM systems?medium

Type
conceptual
Topic
offline-vs-online-evaluation-for-llm-systems
Frequency
common
Tags
mlops, offline, online, evaluation, for, llm
Answer

Offline: evaluate on static held-out dataset before deployment — controlled, reproducible, cheap.

Explanation

Offline: evaluate on static held-out dataset before deployment — controlled, reproducible, cheap. Online: evaluate in production — shadow mode (run new model without serving output), A/B testing (split traffic), monitoring (track user feedback, error rates). Both needed: offline gates deployment, online catches distribution shift and real-world edge cases.

Follow-upWhen would you choose one approach over the other?

What is a regression test suite for LLM apps? How do you prevent regressions?medium

Type
scenario
Topic
is-a-regression-test-suite-for-llm-apps-how-do-you-prevent
Frequency
common
Tags
mlops, what, regression, test, suite, for
Answer

A curated set of (input, expected output) pairs that must pass before any prompt or model change is deployed.

Explanation

A curated set of (input, expected output) pairs that must pass before any prompt or model change is deployed. Store in version-controlled eval dataset. Run on every PR via CI/CD. Flag any drop in pass rate > threshold. Include: happy path examples, previously-failed edge cases, adversarial inputs. Set temperature=0 for eval, or use LLM-judge with a tolerance threshold for non-deterministic outputs.

Follow-upCan you give a production example?

How do you validate automation and accuracy in document processing?medium

Type
scenario
Topic
how-do-you-validate-automation-and-accuracy-in-document-pr
Frequency
common
Tags
mlops, how, did, you, validate, accuracy
Answer

Automation rate: % of fund filings processed end-to-end without human intervention — (automated / total) over a quarter.

Explanation

Automation rate: % of fund filings processed end-to-end without human intervention — (automated / total) over a quarter. Accuracy: spot-checked extracted financial metadata against source PDFs on a stratified sample. Errors escalated to a review queue. Improvement cycle: analyze error patterns, refine agent prompts, re-run on failure cases.

Follow-upWhat tradeoffs did you consider in that implementation?

How do you A/B test prompts in a production system?hard

Type
conceptual
Topic
do-you-a-b-test-prompts-in-a-production-system
Frequency
common
Tags
mlops, how, you, test, prompts, production
Answer

Route traffic: X% to prompt A, (100-X)% to prompt B via feature flags or a routing layer.

Explanation

Route traffic: X% to prompt A, (100-X)% to prompt B via feature flags or a routing layer. Log all inputs/outputs with variant tag. After N samples, compare metrics (exact match, LLM-judge score, latency, cost). Use statistical significance tests before declaring a winner. Run shadow mode first: both prompts execute, only A is served, compare offline.

Follow-upCan you give a production example?