What is an LLM?medium
A model trained to predict and generate text-like tokens.
LLMs learn patterns from large corpora and can be adapted through prompting, retrieval, fine-tuning, and tool use.
InterviewRole
A path for LLMs, prompts, RAG, agents, evaluations, safety, and production AI systems.
A model trained to predict and generate text-like tokens.
LLMs learn patterns from large corpora and can be adapted through prompting, retrieval, fine-tuning, and tool use.
Improve instructions, context, examples, retrieval, evaluation, and constraints.
Quality work is iterative: define failure cases, create test sets, measure outputs, and reduce ambiguity.
Choosing what information fits into the model input.
Good systems prioritize relevant context, compress history, remove noise, and preserve source-grounded facts.
Use it when answers need private, current, or source-grounded information.
RAG separates knowledge retrieval from generation, but quality depends on chunking, retrieval, ranking, and citations.
Check retrieval first, then ranking, prompt context, and generation behavior.
If the right document is not retrieved, fix indexing or search. If it is retrieved but ignored, fix prompting or context layout.
It is self-contained, focused, and sized for retrieval and context limits.
Chunking should preserve meaning, headings, metadata, and enough surrounding context to answer accurately.
Use an agent when the task needs planning, tools, or multi-step decisions.
Agents add power but also cost, latency, and reliability risk. Simple workflows should stay deterministic.
Constrain tools, validate outputs, add state checks, and evaluate task completion.
Reliability improves with small action spaces, clear tool contracts, retries, human escalation, and logs.
Only useful, consented, and durable context.
Memory needs privacy boundaries, update rules, deletion paths, and safeguards against stale or sensitive information.
Create task rubrics, golden sets, user metrics, and safety checks.
Good evaluation combines automated scoring, human judgment, regression tests, and production telemetry.
A representative set of inputs and expected behaviors.
It should include common cases, edge cases, adversarial cases, and historical failures so quality can be tracked.
Run both on the same eval set and review quality, cost, and latency.
Prompt changes can improve one slice and hurt another, so compare across categories and failure modes.
Data, code, features, model artifacts, parameters, and evaluation results.
Versioning lets teams reproduce a model, compare experiments, rollback safely, and audit production behavior.
Track service health, data quality, drift, prediction quality, and business impact.
Production ML needs both software metrics and model-specific signals, especially when labels arrive late.
A mismatch between training features and production features.
Skew often comes from duplicated feature logic, different timestamps, missing values, or online/offline transformation drift.