LLMs Interview Prep

What is a large language model?medium

Type: conceptual
Topic: large-language-model
Frequency: common
Tags: large, language, model

Answer

An LLM is a neural model trained to predict and generate language.

Explanation

LLMs learn patterns from large corpora and can perform tasks like summarization, coding, reasoning, and extraction when prompted with context.

Follow-upWhy can LLMs hallucinate?

What is a context window?medium

Type: conceptual
Topic: context-window
Frequency: common
Tags: context, window

Answer

It is the maximum amount of text the model can consider at once.

Explanation

Inputs, retrieved context, instructions, conversation history, and outputs all consume context, so long tasks need careful context management.

Follow-upHow do you handle documents longer than the context window?

What is temperature?hard

Type: conceptual
Topic: temperature
Frequency: common
Tags: temperature

Answer

Temperature controls randomness in model output sampling.

Explanation

Lower temperature makes outputs more deterministic, while higher temperature increases variety and risk of unexpected answers.

Follow-upWhen would you use a low temperature?

What is prompt engineering?medium

Type: conceptual
Topic: prompt-engineering
Frequency: common
Tags: prompt, engineering

Answer

It is designing instructions and context to guide model behavior.

Explanation

Strong prompts define the task, constraints, audience, examples, and output format. Production systems often combine prompts with retrieval and tools.

Follow-upWhat is few-shot prompting?

What is fine-tuning?medium

Type: conceptual
Topic: fine-tuning
Frequency: common
Tags: fine, tuning

Answer

Fine-tuning further trains a model on task-specific examples.

Explanation

It can improve style, format, or task behavior, but it is not always the best way to add factual knowledge. RAG may be better for changing knowledge.

Follow-upWhen would you choose RAG over fine-tuning?

What is instruction tuning?medium

Type: conceptual
Topic: instruction-tuning
Frequency: common
Tags: instruction-tuning, alignment, training

Answer

Instruction tuning trains a model to follow task instructions more reliably.

Explanation

It uses examples of instructions and desired responses so the model becomes better at conversational and task-oriented behavior than raw next-token prediction alone.

Follow-upHow is instruction tuning different from pretraining?

How do you reduce hallucinations in an LLM application?hard

Type: scenario
Topic: hallucination-reduction
Frequency: common
Tags: hallucination, grounding, evaluation

Answer

Ground responses with reliable context, constrain outputs, evaluate claims, and make uncertainty explicit.

Explanation

RAG, citations, tool checks, structured output validation, refusal policies, and targeted evals reduce unsupported claims but do not eliminate risk completely.

Follow-upWhy can RAG still hallucinate?

Why are structured outputs useful with LLMs?medium

Type: conceptual
Topic: structured-outputs
Frequency: common
Tags: structured-output, schemas, validation

Answer

They make model responses easier to parse, validate, and connect to downstream systems.

Explanation

Schemas reduce ambiguity and make failures detectable. They are especially useful for extraction, routing, tool arguments, and workflow automation.

Follow-upWhat should you do when a structured response fails validation?

Foundation model vs fine-tuned — when to fine-tune vs prompt-engineer?hard

Type: conceptual
Topic: foundation-model-vs-fine-tuned-when-to-fine-tune-vs-prompt
Frequency: common
Tags: llms, foundation, model, fine, tuned, when

Answer

Foundation model: pretrained, general-purpose. Fine-tuning adapts weights with labeled data.

Explanation

Foundation model: pretrained, general-purpose. Fine-tuning adapts weights with labeled data. Prompting steers without changing weights. Start with prompting — cheaper, faster, no data needed. Fine-tune when prompting hits a quality ceiling, you have 100s+ examples, and the task needs consistent format/style. a document extraction pipeline achieved 98% accuracy with prompt engineering alone — fine-tuning was unnecessary.

Follow-upWhen would you choose one approach over the other?

How does autoregressive decoding work? Temperature, top-k, top-p?medium

Type: conceptual
Topic: does-autoregressive-decoding-work-temperature-top-k-top-p
Frequency: common
Tags: llms, how, does, autoregressive, decoding, work

Answer

At each step, model outputs a probability distribution over vocabulary, samples a token, appends it to context, repeats.

Explanation

At each step, model outputs a probability distribution over vocabulary, samples a token, appends it to context, repeats. Temperature scales the distribution (lower = more deterministic). Top-k samples only from the k highest-probability tokens. Top-p (nucleus sampling) samples from the smallest set whose cumulative probability ≥ p — adapts dynamically, preferred over fixed top-k.

Follow-upCan you give a production example?

What is instruction tuning? How does RLHF improve on it?hard

Type: conceptual
Topic: is-instruction-tuning-how-does-rlhf-improve-on-it
Frequency: common
Tags: llms, what, instruction, tuning, how, does

Answer

Instruction tuning fine-tunes on (instruction, response) pairs — teaches the model to follow directions.

Explanation

Instruction tuning fine-tunes on (instruction, response) pairs — teaches the model to follow directions. RLHF: human raters rank outputs, a reward model learns preferences, the LLM is fine-tuned via PPO to maximize reward. RLHF reduces harmful outputs and improves helpfulness beyond what supervised instruction tuning alone achieves.

Follow-upCan you give a production example?

Explain system prompt vs user message vs assistant turn.medium

Type: conceptual
Topic: system-prompt-vs-user-message-vs-assistant-turn
Frequency: common
Tags: llms, explain, system, prompt, user, message

Answer

System prompt: sets persistent context, persona, constraints — not part of conversational turn, processed first.

Explanation

System prompt: sets persistent context, persona, constraints — not part of conversational turn, processed first. User message: human input for this turn. Assistant turn: model response. Full history (system + alternating user/assistant) is passed each call — the API is stateless. In a document extraction pipeline: system prompt defines extraction schema and rules; user message contains the document chunk.

Follow-upWhen would you choose one approach over the other?

What is context-aware prompt engineering for document extraction?medium

Type: conceptual
Topic: what-is-context-aware-prompt-engineering-for-document-extr
Frequency: common
Tags: llms, what, context, aware, prompt, engineering

Answer

Context-aware prompting provides structured context (document chunk, schema, examples) for precise extraction.

Explanation

Context-aware prompting provides structured context (document chunk, schema, examples) for precise extraction. In a document extraction pipeline: system prompt defines the JSON schema for swap/trade documents (counterparty, notional, reset schedule, maturity date). User prompt contains the raw document text. Few-shot examples of edge cases (unusual date formats) are included. This achieved 98% extraction accuracy without fine-tuning.

Follow-upCan you give a production example?

How do you handle LLM hallucinations in a document extraction pipeline?medium

Type: scenario
Topic: do-you-handle-llm-hallucinations-in-a-document-extraction
Frequency: common
Tags: llms, how, you, handle, llm, hallucinations

Answer

(1) Constrained output — JSON-only with defined schema, return null for absent fields.

Explanation

(1) Constrained output — JSON-only with defined schema, return null for absent fields. (2) Pydantic validation — parse and validate every response, reject malformed outputs. (3) Confidence scoring — ask model to output confidence per field. (4) Grounding check — verify extracted values exist in source text. (5) Human review queue for low-confidence outputs (HITL).

Follow-upCan you give a production example?

How do you handle context length limitations in long-document tasks?medium

Type: scenario
Topic: do-you-handle-context-length-limitations-in-long-document
Frequency: common
Tags: llms, how, you, handle, context, length

Answer

(1) Chunking + RAG — retrieve only relevant chunks. (2) Sliding window — overlapping windows, merge results.

Explanation

(1) Chunking + RAG — retrieve only relevant chunks. (2) Sliding window — overlapping windows, merge results. (3) Hierarchical summarization — summarize sections, then summaries. (4) Map-reduce — process chunks in parallel, aggregate. (5) Long-context models (Claude 3.5: 200K tokens). a document extraction pipeline: chunked by clause type with sliding overlap to avoid splitting key fields.

Follow-upCan you give a production example?

What is LoRA fine-tuning and how does it reduce memory cost?hard

Type: conceptual
Topic: is-lora-fine-tuning-and-how-does-it-reduce-memory-cost
Frequency: common
Tags: llms, what, lora, fine, tuning, and

Answer

LoRA freezes original model weights and adds small trainable rank-decomposition matrices (A, B where r << d) to attention layers: W' = W + BA.

Explanation

LoRA freezes original model weights and adds small trainable rank-decomposition matrices (A, B where r << d) to attention layers: W' = W + BA. Only A and B are trained — ~0.1% of parameters. Memory savings: no optimizer states for full weights. LoRA weights can be merged at inference (no latency cost). QLoRA adds 4-bit quantization for further memory reduction.

Follow-upCan you give a production example?

How do you choose between Claude, GPT-4, Llama for an enterprise task?medium

Type: conceptual
Topic: do-you-choose-between-claude-gpt-4-llama-for-an-enterprise
Frequency: common
Tags: llms, how, you, choose, between, claude

Answer

Evaluate on: task performance (run evals on representative samples), context length, latency, cost per token, data privacy (on-prem vs cloud), and licensing.

Explanation

Evaluate on: task performance (run evals on representative samples), context length, latency, cost per token, data privacy (on-prem vs cloud), and licensing. Regulated: Bedrock + Claude for data residency. Cost-sensitive high-volume: Haiku + prompt caching. Reasoning-heavy agentic: Sonnet/Opus. Self-hosted: Llama 3.1/Qwen. Always benchmark on your actual data.

Follow-upCan you give a production example?

How does Ollama enable local model inference? What did you use it for?medium

Type: conceptual
Topic: does-ollama-enable-local-model-inference-what-did-you-use
Frequency: common
Tags: llms, how, does, ollama, enable, local

Answer

Ollama packages open-source models (Llama, Qwen, Mistral) with a local inference server exposing an OpenAI-compatible API.

Explanation

Ollama packages open-source models (Llama, Qwen, Mistral) with a local inference server exposing an OpenAI-compatible API. No GPU cloud needed. Used in NxLab and an agent platform development for rapid local testing without Bedrock API costs. Also used for the local agent backend with Strands Agents. Qwen2.5-Coder was preferred for code-related agent tasks.

Follow-upCan you give a production example?

LLMs Interview Questions

What is a large language model?medium

What is a context window?medium

What is temperature?hard

What is prompt engineering?medium

What is fine-tuning?medium

What is instruction tuning?medium

How do you reduce hallucinations in an LLM application?hard

Why are structured outputs useful with LLMs?medium

Foundation model vs fine-tuned — when to fine-tune vs prompt-engineer?hard

How does autoregressive decoding work? Temperature, top-k, top-p?medium

What is instruction tuning? How does RLHF improve on it?hard

Explain system prompt vs user message vs assistant turn.medium

What is context-aware prompt engineering for document extraction?medium

How do you handle LLM hallucinations in a document extraction pipeline?medium

How do you handle context length limitations in long-document tasks?medium

What is LoRA fine-tuning and how does it reduce memory cost?hard

How do you choose between Claude, GPT-4, Llama for an enterprise task?medium

How does Ollama enable local model inference? What did you use it for?medium