MLOps Interview Prep

What is MLOps?medium

Type: conceptual
Topic: mlops
Frequency: common
Tags: mlops

Answer

MLOps is the practice of reliably building, deploying, and monitoring ML systems.

Explanation

It combines software engineering, data engineering, model training, deployment, monitoring, governance, and retraining workflows.

Follow-upHow is MLOps different from DevOps?

What is data drift?medium

Type: conceptual
Topic: data-drift
Frequency: common
Tags: data, drift

Answer

Data drift happens when production input data changes from training data.

Explanation

Drift can reduce model performance. Monitoring feature distributions and business outcomes helps detect when retraining may be needed.

Follow-upWhat is concept drift?

What is a model registry?hard

Type: conceptual
Topic: model-registry
Frequency: common
Tags: model, registry

Answer

A model registry tracks model versions, metadata, stages, and artifacts.

Explanation

It helps teams manage promotion from experimentation to staging and production with reproducibility and auditability.

Follow-upWhat metadata should a registry store?

How do you monitor a model in production?medium

Type: conceptual
Topic: monitor-model-production
Frequency: common
Tags: monitor, model, production

Answer

Track performance, drift, latency, errors, data quality, and business metrics.

Explanation

Monitoring must cover both software health and model behavior because a service can be available while predictions become poor.

Follow-upWhat if labels arrive late?

What is CI/CD for ML?medium

Type: conceptual
Topic: ci-cd-ml
Frequency: common
Tags: ci, cd, ml

Answer

It automates testing and deployment of code, data pipelines, and model artifacts.

Explanation

ML CI/CD should validate data schemas, feature logic, model quality, reproducibility, and serving behavior before release.

Follow-upWhat tests are unique to ML pipelines?

How does a feature store work in production ML?hard

Type: scenario
Topic: does-a-feature-store-work-how-did-you-use-it-in-sagemaker
Frequency: common
Tags: mlops, how, does, feature, store, work

Answer

A feature store centralizes computed features for reuse across models, ensuring training-serving consistency.

Explanation

A feature store centralizes computed features for reuse across models, ensuring training-serving consistency. SageMaker Feature Store has online (low-latency serving) and offline (S3-backed for training) stores. In price markdown optimization, features like rolling demand, sell-through rate, and days-to-expiry were precomputed offline for batch model training.

Follow-upWhat tradeoffs did you consider in that implementation?

What is concept drift and how do you handle it?hard

Type: scenario
Topic: is-concept-drift-and-how-do-you-handle-it
Frequency: common
Tags: mlops, what, concept, drift, and, how

Answer

Concept drift is when the statistical relationship between features and target changes over time.

Explanation

Concept drift is when the statistical relationship between features and target changes over time. Detect with PSI on feature distributions, model performance monitoring on recent windows, or KS test. Handle by retraining periodically, using a sliding window of recent data, or building an ensemble that weights recent data more.

Follow-upCan you give a production example?

What metrics evaluate document extraction accuracy?medium

Type: conceptual
Topic: metrics-do-you-use-to-evaluate-extraction-accuracy-in-inte
Frequency: common
Tags: mlops, what, metrics, you, use, evaluate

Answer

Field-level exact match (does extracted value match ground truth?), partial match / token F1 for variable-format fields like dates, overall document accuracy (all fields correct), null rate (fields model failed to extrac

Explanation

Field-level exact match (does extracted value match ground truth?), partial match / token F1 for variable-format fields like dates, overall document accuracy (all fields correct), null rate (fields model failed to extract), and false positive rate (model extracted something for an absent field). Ground truth: manually annotated 200 representative documents by a domain expert.

Follow-upHow would you monitor this in production?

How should 98% extraction accuracy be measured?medium

Type: conceptual
Topic: claim-98-extraction-accuracy-how-exactly-was-that-measured
Frequency: common
Tags: mlops, you, claim, extraction, accuracy, how

Answer

Measured on a held-out test set of ~100 documents not seen during prompt development.

Explanation

Measured on a held-out test set of ~100 documents not seen during prompt development. Ground truth annotated by a domain expert. Metric: field-level exact match after normalizing dates and numeric formats. 98% = 98% of extracted fields matched ground truth across all documents and fields. Errors were mostly edge cases — unusual swap structures or corrupted PDFs.

Follow-upHow would you monitor this in production?

BLEU, ROUGE, BERTScore — which is best for LLM output evaluation?medium

Type: conceptual
Topic: bleu-rouge-bertscore-which-is-best-for-llm-output-evaluati
Frequency: common
Tags: mlops, bleu, rouge, bertscore, which, best

Answer

BLEU measures n-gram precision (designed for translation, poor for open-ended text).

Explanation

BLEU measures n-gram precision (designed for translation, poor for open-ended text). ROUGE measures n-gram recall (designed for summarization). Both are surface-level and miss semantic equivalence. BERTScore uses contextual embeddings for similarity — better for paraphrase-heavy LLM output. For production: none is sufficient alone. Combine with LLM-as-judge and task-specific metrics.

Follow-upCan you give a production example?

How do you evaluate hallucination in an LLM extraction pipeline?medium

Type: conceptual
Topic: do-you-evaluate-hallucination-in-an-llm-extraction-pipelin
Frequency: common
Tags: mlops, how, you, evaluate, hallucination, llm

Answer

(1) Grounding check: verify extracted values appear in or are entailed by source document.

Explanation

(1) Grounding check: verify extracted values appear in or are entailed by source document. (2) Contradiction detection: run NLI model to check if extraction contradicts the source. (3) LLM-as-judge: second LLM verifies if extraction is supported by document. (4) Track null-vs-hallucinated rate: model should return null for absent fields, not fabricate values.

Follow-upHow would you monitor this in production?

What is LLM-as-a-judge? What are its risks?medium

Type: conceptual
Topic: is-llm-as-a-judge-what-are-its-risks
Frequency: common
Tags: mlops, what, llm, judge, are

Answer

LLM-as-judge uses a capable LLM to score other LLM outputs — cheaper than human annotation at scale.

Explanation

LLM-as-judge uses a capable LLM to score other LLM outputs — cheaper than human annotation at scale. Risks: self-serving bias (models prefer their own style), position bias (prefers earlier responses), verbosity bias (rewards longer answers). Mitigations: use a different model as judge, swap option order and average scores, define explicit rubrics, calibrate against human labels.

Follow-upCan you give a production example?

How do you build an LLM evaluation component?medium

Type: scenario
Topic: how-do-you-build-an-llm-evaluation-component
Frequency: common
Tags: mlops, how, did, you, build, the

Answer

(1) Automated metrics: exact match, JSON schema compliance, field coverage.

Explanation

(1) Automated metrics: exact match, JSON schema compliance, field coverage. (2) LLM-as-judge with a fixed rubric. (3) A/B prompt testing — route X% of traffic to new prompt version, compare metrics. (4) Audit logging — every LLM input/output stored in S3 with metadata. (5) Dashboard for error analysis by document type, model, and prompt version.

Follow-upWhat tradeoffs did you consider in that implementation?

Offline vs online evaluation for LLM systems?medium

Type: conceptual
Topic: offline-vs-online-evaluation-for-llm-systems
Frequency: common
Tags: mlops, offline, online, evaluation, for, llm

Answer

Offline: evaluate on static held-out dataset before deployment — controlled, reproducible, cheap.

Explanation

Offline: evaluate on static held-out dataset before deployment — controlled, reproducible, cheap. Online: evaluate in production — shadow mode (run new model without serving output), A/B testing (split traffic), monitoring (track user feedback, error rates). Both needed: offline gates deployment, online catches distribution shift and real-world edge cases.

Follow-upWhen would you choose one approach over the other?

What is a regression test suite for LLM apps? How do you prevent regressions?medium

Type: scenario
Topic: is-a-regression-test-suite-for-llm-apps-how-do-you-prevent
Frequency: common
Tags: mlops, what, regression, test, suite, for

Answer

A curated set of (input, expected output) pairs that must pass before any prompt or model change is deployed.

Explanation

A curated set of (input, expected output) pairs that must pass before any prompt or model change is deployed. Store in version-controlled eval dataset. Run on every PR via CI/CD. Flag any drop in pass rate > threshold. Include: happy path examples, previously-failed edge cases, adversarial inputs. Set temperature=0 for eval, or use LLM-judge with a tolerance threshold for non-deterministic outputs.

Follow-upCan you give a production example?

How do you validate automation and accuracy in document processing?medium

Type: scenario
Topic: how-do-you-validate-automation-and-accuracy-in-document-pr
Frequency: common
Tags: mlops, how, did, you, validate, accuracy

Answer

Automation rate: % of fund filings processed end-to-end without human intervention — (automated / total) over a quarter.

Explanation

Automation rate: % of fund filings processed end-to-end without human intervention — (automated / total) over a quarter. Accuracy: spot-checked extracted financial metadata against source PDFs on a stratified sample. Errors escalated to a review queue. Improvement cycle: analyze error patterns, refine agent prompts, re-run on failure cases.

Follow-upWhat tradeoffs did you consider in that implementation?

How do you A/B test prompts in a production system?hard

Type: conceptual
Topic: do-you-a-b-test-prompts-in-a-production-system
Frequency: common
Tags: mlops, how, you, test, prompts, production

Answer

Route traffic: X% to prompt A, (100-X)% to prompt B via feature flags or a routing layer.

Explanation

Route traffic: X% to prompt A, (100-X)% to prompt B via feature flags or a routing layer. Log all inputs/outputs with variant tag. After N samples, compare metrics (exact match, LLM-judge score, latency, cost). Use statistical significance tests before declaring a winner. Run shadow mode first: both prompts execute, only A is served, compare offline.

Follow-upCan you give a production example?

MLOps Interview Questions

What is MLOps?medium

What is data drift?medium

What is a model registry?hard

How do you monitor a model in production?medium

What is CI/CD for ML?medium

How does a feature store work in production ML?hard

What is concept drift and how do you handle it?hard

What metrics evaluate document extraction accuracy?medium

How should 98% extraction accuracy be measured?medium

BLEU, ROUGE, BERTScore — which is best for LLM output evaluation?medium

How do you evaluate hallucination in an LLM extraction pipeline?medium

What is LLM-as-a-judge? What are its risks?medium

How do you build an LLM evaluation component?medium

Offline vs online evaluation for LLM systems?medium

What is a regression test suite for LLM apps? How do you prevent regressions?medium

How do you validate automation and accuracy in document processing?medium

How do you A/B test prompts in a production system?hard