What is a large language model?medium
An LLM is a neural model trained to predict and generate language.
LLMs learn patterns from large corpora and can perform tasks like summarization, coding, reasoning, and extraction when prompted with context.
InterviewSkill
Large language model fundamentals for applied AI and Generative AI engineering roles.
An LLM is a neural model trained to predict and generate language.
LLMs learn patterns from large corpora and can perform tasks like summarization, coding, reasoning, and extraction when prompted with context.
It is the maximum amount of text the model can consider at once.
Inputs, retrieved context, instructions, conversation history, and outputs all consume context, so long tasks need careful context management.
Temperature controls randomness in model output sampling.
Lower temperature makes outputs more deterministic, while higher temperature increases variety and risk of unexpected answers.
It is designing instructions and context to guide model behavior.
Strong prompts define the task, constraints, audience, examples, and output format. Production systems often combine prompts with retrieval and tools.
Fine-tuning further trains a model on task-specific examples.
It can improve style, format, or task behavior, but it is not always the best way to add factual knowledge. RAG may be better for changing knowledge.
Instruction tuning trains a model to follow task instructions more reliably.
It uses examples of instructions and desired responses so the model becomes better at conversational and task-oriented behavior than raw next-token prediction alone.
Ground responses with reliable context, constrain outputs, evaluate claims, and make uncertainty explicit.
RAG, citations, tool checks, structured output validation, refusal policies, and targeted evals reduce unsupported claims but do not eliminate risk completely.
They make model responses easier to parse, validate, and connect to downstream systems.
Schemas reduce ambiguity and make failures detectable. They are especially useful for extraction, routing, tool arguments, and workflow automation.
Foundation model: pretrained, general-purpose. Fine-tuning adapts weights with labeled data.
Foundation model: pretrained, general-purpose. Fine-tuning adapts weights with labeled data. Prompting steers without changing weights. Start with prompting — cheaper, faster, no data needed. Fine-tune when prompting hits a quality ceiling, you have 100s+ examples, and the task needs consistent format/style. a document extraction pipeline achieved 98% accuracy with prompt engineering alone — fine-tuning was unnecessary.
At each step, model outputs a probability distribution over vocabulary, samples a token, appends it to context, repeats.
At each step, model outputs a probability distribution over vocabulary, samples a token, appends it to context, repeats. Temperature scales the distribution (lower = more deterministic). Top-k samples only from the k highest-probability tokens. Top-p (nucleus sampling) samples from the smallest set whose cumulative probability ≥ p — adapts dynamically, preferred over fixed top-k.
Instruction tuning fine-tunes on (instruction, response) pairs — teaches the model to follow directions.
Instruction tuning fine-tunes on (instruction, response) pairs — teaches the model to follow directions. RLHF: human raters rank outputs, a reward model learns preferences, the LLM is fine-tuned via PPO to maximize reward. RLHF reduces harmful outputs and improves helpfulness beyond what supervised instruction tuning alone achieves.
System prompt: sets persistent context, persona, constraints — not part of conversational turn, processed first.
System prompt: sets persistent context, persona, constraints — not part of conversational turn, processed first. User message: human input for this turn. Assistant turn: model response. Full history (system + alternating user/assistant) is passed each call — the API is stateless. In a document extraction pipeline: system prompt defines extraction schema and rules; user message contains the document chunk.
Context-aware prompting provides structured context (document chunk, schema, examples) for precise extraction.
Context-aware prompting provides structured context (document chunk, schema, examples) for precise extraction. In a document extraction pipeline: system prompt defines the JSON schema for swap/trade documents (counterparty, notional, reset schedule, maturity date). User prompt contains the raw document text. Few-shot examples of edge cases (unusual date formats) are included. This achieved 98% extraction accuracy without fine-tuning.
(1) Constrained output — JSON-only with defined schema, return null for absent fields.
(1) Constrained output — JSON-only with defined schema, return null for absent fields. (2) Pydantic validation — parse and validate every response, reject malformed outputs. (3) Confidence scoring — ask model to output confidence per field. (4) Grounding check — verify extracted values exist in source text. (5) Human review queue for low-confidence outputs (HITL).
(1) Chunking + RAG — retrieve only relevant chunks. (2) Sliding window — overlapping windows, merge results.
(1) Chunking + RAG — retrieve only relevant chunks. (2) Sliding window — overlapping windows, merge results. (3) Hierarchical summarization — summarize sections, then summaries. (4) Map-reduce — process chunks in parallel, aggregate. (5) Long-context models (Claude 3.5: 200K tokens). a document extraction pipeline: chunked by clause type with sliding overlap to avoid splitting key fields.
LoRA freezes original model weights and adds small trainable rank-decomposition matrices (A, B where r << d) to attention layers: W' = W + BA.
LoRA freezes original model weights and adds small trainable rank-decomposition matrices (A, B where r << d) to attention layers: W' = W + BA. Only A and B are trained — ~0.1% of parameters. Memory savings: no optimizer states for full weights. LoRA weights can be merged at inference (no latency cost). QLoRA adds 4-bit quantization for further memory reduction.
Evaluate on: task performance (run evals on representative samples), context length, latency, cost per token, data privacy (on-prem vs cloud), and licensing.
Evaluate on: task performance (run evals on representative samples), context length, latency, cost per token, data privacy (on-prem vs cloud), and licensing. Regulated: Bedrock + Claude for data residency. Cost-sensitive high-volume: Haiku + prompt caching. Reasoning-heavy agentic: Sonnet/Opus. Self-hosted: Llama 3.1/Qwen. Always benchmark on your actual data.
Ollama packages open-source models (Llama, Qwen, Mistral) with a local inference server exposing an OpenAI-compatible API.
Ollama packages open-source models (Llama, Qwen, Mistral) with a local inference server exposing an OpenAI-compatible API. No GPU cloud needed. Used in NxLab and an agent platform development for rapid local testing without Bedrock API costs. Also used for the local agent backend with Strands Agents. Qwen2.5-Coder was preferred for code-related agent tasks.