InterviewSkill

LLMs Interview Questions

Large language model fundamentals for applied AI and Generative AI engineering roles.

18 questions
LLMs

What is a large language model?medium

Type
conceptual
Topic
large-language-model
Frequency
common
Tags
large, language, model
Answer

An LLM is a neural model trained to predict and generate language.

Explanation

LLMs learn patterns from large corpora and can perform tasks like summarization, coding, reasoning, and extraction when prompted with context.

Follow-upWhy can LLMs hallucinate?

What is a context window?medium

Type
conceptual
Topic
context-window
Frequency
common
Tags
context, window
Answer

It is the maximum amount of text the model can consider at once.

Explanation

Inputs, retrieved context, instructions, conversation history, and outputs all consume context, so long tasks need careful context management.

Follow-upHow do you handle documents longer than the context window?

What is temperature?hard

Type
conceptual
Topic
temperature
Frequency
common
Tags
temperature
Answer

Temperature controls randomness in model output sampling.

Explanation

Lower temperature makes outputs more deterministic, while higher temperature increases variety and risk of unexpected answers.

Follow-upWhen would you use a low temperature?

What is prompt engineering?medium

Type
conceptual
Topic
prompt-engineering
Frequency
common
Tags
prompt, engineering
Answer

It is designing instructions and context to guide model behavior.

Explanation

Strong prompts define the task, constraints, audience, examples, and output format. Production systems often combine prompts with retrieval and tools.

Follow-upWhat is few-shot prompting?

What is fine-tuning?medium

Type
conceptual
Topic
fine-tuning
Frequency
common
Tags
fine, tuning
Answer

Fine-tuning further trains a model on task-specific examples.

Explanation

It can improve style, format, or task behavior, but it is not always the best way to add factual knowledge. RAG may be better for changing knowledge.

Follow-upWhen would you choose RAG over fine-tuning?

What is instruction tuning?medium

Type
conceptual
Topic
instruction-tuning
Frequency
common
Tags
instruction-tuning, alignment, training
Answer

Instruction tuning trains a model to follow task instructions more reliably.

Explanation

It uses examples of instructions and desired responses so the model becomes better at conversational and task-oriented behavior than raw next-token prediction alone.

Follow-upHow is instruction tuning different from pretraining?

How do you reduce hallucinations in an LLM application?hard

Type
scenario
Topic
hallucination-reduction
Frequency
common
Tags
hallucination, grounding, evaluation
Answer

Ground responses with reliable context, constrain outputs, evaluate claims, and make uncertainty explicit.

Explanation

RAG, citations, tool checks, structured output validation, refusal policies, and targeted evals reduce unsupported claims but do not eliminate risk completely.

Follow-upWhy can RAG still hallucinate?

Why are structured outputs useful with LLMs?medium

Type
conceptual
Topic
structured-outputs
Frequency
common
Tags
structured-output, schemas, validation
Answer

They make model responses easier to parse, validate, and connect to downstream systems.

Explanation

Schemas reduce ambiguity and make failures detectable. They are especially useful for extraction, routing, tool arguments, and workflow automation.

Follow-upWhat should you do when a structured response fails validation?

Foundation model vs fine-tuned — when to fine-tune vs prompt-engineer?hard

Type
conceptual
Topic
foundation-model-vs-fine-tuned-when-to-fine-tune-vs-prompt
Frequency
common
Tags
llms, foundation, model, fine, tuned, when
Answer

Foundation model: pretrained, general-purpose. Fine-tuning adapts weights with labeled data.

Explanation

Foundation model: pretrained, general-purpose. Fine-tuning adapts weights with labeled data. Prompting steers without changing weights. Start with prompting — cheaper, faster, no data needed. Fine-tune when prompting hits a quality ceiling, you have 100s+ examples, and the task needs consistent format/style. a document extraction pipeline achieved 98% accuracy with prompt engineering alone — fine-tuning was unnecessary.

Follow-upWhen would you choose one approach over the other?

How does autoregressive decoding work? Temperature, top-k, top-p?medium

Type
conceptual
Topic
does-autoregressive-decoding-work-temperature-top-k-top-p
Frequency
common
Tags
llms, how, does, autoregressive, decoding, work
Answer

At each step, model outputs a probability distribution over vocabulary, samples a token, appends it to context, repeats.

Explanation

At each step, model outputs a probability distribution over vocabulary, samples a token, appends it to context, repeats. Temperature scales the distribution (lower = more deterministic). Top-k samples only from the k highest-probability tokens. Top-p (nucleus sampling) samples from the smallest set whose cumulative probability ≥ p — adapts dynamically, preferred over fixed top-k.

Follow-upCan you give a production example?

What is instruction tuning? How does RLHF improve on it?hard

Type
conceptual
Topic
is-instruction-tuning-how-does-rlhf-improve-on-it
Frequency
common
Tags
llms, what, instruction, tuning, how, does
Answer

Instruction tuning fine-tunes on (instruction, response) pairs — teaches the model to follow directions.

Explanation

Instruction tuning fine-tunes on (instruction, response) pairs — teaches the model to follow directions. RLHF: human raters rank outputs, a reward model learns preferences, the LLM is fine-tuned via PPO to maximize reward. RLHF reduces harmful outputs and improves helpfulness beyond what supervised instruction tuning alone achieves.

Follow-upCan you give a production example?

Explain system prompt vs user message vs assistant turn.medium

Type
conceptual
Topic
system-prompt-vs-user-message-vs-assistant-turn
Frequency
common
Tags
llms, explain, system, prompt, user, message
Answer

System prompt: sets persistent context, persona, constraints — not part of conversational turn, processed first.

Explanation

System prompt: sets persistent context, persona, constraints — not part of conversational turn, processed first. User message: human input for this turn. Assistant turn: model response. Full history (system + alternating user/assistant) is passed each call — the API is stateless. In a document extraction pipeline: system prompt defines extraction schema and rules; user message contains the document chunk.

Follow-upWhen would you choose one approach over the other?

What is context-aware prompt engineering for document extraction?medium

Type
conceptual
Topic
what-is-context-aware-prompt-engineering-for-document-extr
Frequency
common
Tags
llms, what, context, aware, prompt, engineering
Answer

Context-aware prompting provides structured context (document chunk, schema, examples) for precise extraction.

Explanation

Context-aware prompting provides structured context (document chunk, schema, examples) for precise extraction. In a document extraction pipeline: system prompt defines the JSON schema for swap/trade documents (counterparty, notional, reset schedule, maturity date). User prompt contains the raw document text. Few-shot examples of edge cases (unusual date formats) are included. This achieved 98% extraction accuracy without fine-tuning.

Follow-upCan you give a production example?

How do you handle LLM hallucinations in a document extraction pipeline?medium

Type
scenario
Topic
do-you-handle-llm-hallucinations-in-a-document-extraction
Frequency
common
Tags
llms, how, you, handle, llm, hallucinations
Answer

(1) Constrained output — JSON-only with defined schema, return null for absent fields.

Explanation

(1) Constrained output — JSON-only with defined schema, return null for absent fields. (2) Pydantic validation — parse and validate every response, reject malformed outputs. (3) Confidence scoring — ask model to output confidence per field. (4) Grounding check — verify extracted values exist in source text. (5) Human review queue for low-confidence outputs (HITL).

Follow-upCan you give a production example?

How do you handle context length limitations in long-document tasks?medium

Type
scenario
Topic
do-you-handle-context-length-limitations-in-long-document
Frequency
common
Tags
llms, how, you, handle, context, length
Answer

(1) Chunking + RAG — retrieve only relevant chunks. (2) Sliding window — overlapping windows, merge results.

Explanation

(1) Chunking + RAG — retrieve only relevant chunks. (2) Sliding window — overlapping windows, merge results. (3) Hierarchical summarization — summarize sections, then summaries. (4) Map-reduce — process chunks in parallel, aggregate. (5) Long-context models (Claude 3.5: 200K tokens). a document extraction pipeline: chunked by clause type with sliding overlap to avoid splitting key fields.

Follow-upCan you give a production example?

What is LoRA fine-tuning and how does it reduce memory cost?hard

Type
conceptual
Topic
is-lora-fine-tuning-and-how-does-it-reduce-memory-cost
Frequency
common
Tags
llms, what, lora, fine, tuning, and
Answer

LoRA freezes original model weights and adds small trainable rank-decomposition matrices (A, B where r << d) to attention layers: W' = W + BA.

Explanation

LoRA freezes original model weights and adds small trainable rank-decomposition matrices (A, B where r << d) to attention layers: W' = W + BA. Only A and B are trained — ~0.1% of parameters. Memory savings: no optimizer states for full weights. LoRA weights can be merged at inference (no latency cost). QLoRA adds 4-bit quantization for further memory reduction.

Follow-upCan you give a production example?

How do you choose between Claude, GPT-4, Llama for an enterprise task?medium

Type
conceptual
Topic
do-you-choose-between-claude-gpt-4-llama-for-an-enterprise
Frequency
common
Tags
llms, how, you, choose, between, claude
Answer

Evaluate on: task performance (run evals on representative samples), context length, latency, cost per token, data privacy (on-prem vs cloud), and licensing.

Explanation

Evaluate on: task performance (run evals on representative samples), context length, latency, cost per token, data privacy (on-prem vs cloud), and licensing. Regulated: Bedrock + Claude for data residency. Cost-sensitive high-volume: Haiku + prompt caching. Reasoning-heavy agentic: Sonnet/Opus. Self-hosted: Llama 3.1/Qwen. Always benchmark on your actual data.

Follow-upCan you give a production example?

How does Ollama enable local model inference? What did you use it for?medium

Type
conceptual
Topic
does-ollama-enable-local-model-inference-what-did-you-use
Frequency
common
Tags
llms, how, does, ollama, enable, local
Answer

Ollama packages open-source models (Llama, Qwen, Mistral) with a local inference server exposing an OpenAI-compatible API.

Explanation

Ollama packages open-source models (Llama, Qwen, Mistral) with a local inference server exposing an OpenAI-compatible API. No GPU cloud needed. Used in NxLab and an agent platform development for rapid local testing without Bedrock API costs. Also used for the local agent backend with Strands Agents. Qwen2.5-Coder was preferred for code-related agent tasks.

Follow-upCan you give a production example?