What is MLOps?medium
MLOps is the practice of reliably building, deploying, and monitoring ML systems.
It combines software engineering, data engineering, model training, deployment, monitoring, governance, and retraining workflows.
InterviewSkill
Production ML lifecycle concepts for deployment, monitoring, and reliability interviews.
MLOps is the practice of reliably building, deploying, and monitoring ML systems.
It combines software engineering, data engineering, model training, deployment, monitoring, governance, and retraining workflows.
Data drift happens when production input data changes from training data.
Drift can reduce model performance. Monitoring feature distributions and business outcomes helps detect when retraining may be needed.
A model registry tracks model versions, metadata, stages, and artifacts.
It helps teams manage promotion from experimentation to staging and production with reproducibility and auditability.
Track performance, drift, latency, errors, data quality, and business metrics.
Monitoring must cover both software health and model behavior because a service can be available while predictions become poor.
It automates testing and deployment of code, data pipelines, and model artifacts.
ML CI/CD should validate data schemas, feature logic, model quality, reproducibility, and serving behavior before release.
A feature store centralizes computed features for reuse across models, ensuring training-serving consistency.
A feature store centralizes computed features for reuse across models, ensuring training-serving consistency. SageMaker Feature Store has online (low-latency serving) and offline (S3-backed for training) stores. In price markdown optimization, features like rolling demand, sell-through rate, and days-to-expiry were precomputed offline for batch model training.
Concept drift is when the statistical relationship between features and target changes over time.
Concept drift is when the statistical relationship between features and target changes over time. Detect with PSI on feature distributions, model performance monitoring on recent windows, or KS test. Handle by retraining periodically, using a sliding window of recent data, or building an ensemble that weights recent data more.
Field-level exact match (does extracted value match ground truth?), partial match / token F1 for variable-format fields like dates, overall document accuracy (all fields correct), null rate (fields model failed to extrac
Field-level exact match (does extracted value match ground truth?), partial match / token F1 for variable-format fields like dates, overall document accuracy (all fields correct), null rate (fields model failed to extract), and false positive rate (model extracted something for an absent field). Ground truth: manually annotated 200 representative documents by a domain expert.
Measured on a held-out test set of ~100 documents not seen during prompt development.
Measured on a held-out test set of ~100 documents not seen during prompt development. Ground truth annotated by a domain expert. Metric: field-level exact match after normalizing dates and numeric formats. 98% = 98% of extracted fields matched ground truth across all documents and fields. Errors were mostly edge cases — unusual swap structures or corrupted PDFs.
BLEU measures n-gram precision (designed for translation, poor for open-ended text).
BLEU measures n-gram precision (designed for translation, poor for open-ended text). ROUGE measures n-gram recall (designed for summarization). Both are surface-level and miss semantic equivalence. BERTScore uses contextual embeddings for similarity — better for paraphrase-heavy LLM output. For production: none is sufficient alone. Combine with LLM-as-judge and task-specific metrics.
(1) Grounding check: verify extracted values appear in or are entailed by source document.
(1) Grounding check: verify extracted values appear in or are entailed by source document. (2) Contradiction detection: run NLI model to check if extraction contradicts the source. (3) LLM-as-judge: second LLM verifies if extraction is supported by document. (4) Track null-vs-hallucinated rate: model should return null for absent fields, not fabricate values.
LLM-as-judge uses a capable LLM to score other LLM outputs — cheaper than human annotation at scale.
LLM-as-judge uses a capable LLM to score other LLM outputs — cheaper than human annotation at scale. Risks: self-serving bias (models prefer their own style), position bias (prefers earlier responses), verbosity bias (rewards longer answers). Mitigations: use a different model as judge, swap option order and average scores, define explicit rubrics, calibrate against human labels.
(1) Automated metrics: exact match, JSON schema compliance, field coverage.
(1) Automated metrics: exact match, JSON schema compliance, field coverage. (2) LLM-as-judge with a fixed rubric. (3) A/B prompt testing — route X% of traffic to new prompt version, compare metrics. (4) Audit logging — every LLM input/output stored in S3 with metadata. (5) Dashboard for error analysis by document type, model, and prompt version.
Offline: evaluate on static held-out dataset before deployment — controlled, reproducible, cheap.
Offline: evaluate on static held-out dataset before deployment — controlled, reproducible, cheap. Online: evaluate in production — shadow mode (run new model without serving output), A/B testing (split traffic), monitoring (track user feedback, error rates). Both needed: offline gates deployment, online catches distribution shift and real-world edge cases.
A curated set of (input, expected output) pairs that must pass before any prompt or model change is deployed.
A curated set of (input, expected output) pairs that must pass before any prompt or model change is deployed. Store in version-controlled eval dataset. Run on every PR via CI/CD. Flag any drop in pass rate > threshold. Include: happy path examples, previously-failed edge cases, adversarial inputs. Set temperature=0 for eval, or use LLM-judge with a tolerance threshold for non-deterministic outputs.
Automation rate: % of fund filings processed end-to-end without human intervention — (automated / total) over a quarter.
Automation rate: % of fund filings processed end-to-end without human intervention — (automated / total) over a quarter. Accuracy: spot-checked extracted financial metadata against source PDFs on a stratified sample. Errors escalated to a review queue. Improvement cycle: analyze error patterns, refine agent prompts, re-run on failure cases.
Route traffic: X% to prompt A, (100-X)% to prompt B via feature flags or a routing layer.
Route traffic: X% to prompt A, (100-X)% to prompt B via feature flags or a routing layer. Log all inputs/outputs with variant tag. After N samples, compare metrics (exact match, LLM-judge score, latency, cost). Use statistical significance tests before declaring a winner. Run shadow mode first: both prompts execute, only A is served, compare offline.