OpenAI Interview Questions

LLM systems, evaluation, safety, RAG, and product-quality AI engineering questions.

4 questions

How would you evaluate a customer-support LLM assistant?hard

Type: scenario
Topic: how-would-you-evaluate-a-customer-support-llm-assistant
Frequency: common

Answer

Use task success, factuality, refusal quality, latency, cost, and human review.

Explanation

LLM evaluation should combine golden datasets, rubric-based judging, production feedback, safety checks, and regression tests for known failure modes.

Follow-upHow do you detect hallucinations?

How would you design a RAG system for internal docs?hard

Type: scenario
Topic: how-would-you-design-a-rag-system-for-internal-docs
Frequency: common

Answer

Ingest docs, chunk, embed, retrieve, rerank, generate with citations, and evaluate.

Explanation

Mention permissions, freshness, chunk strategy, hybrid search, reranking, context packing, answer grounding, and feedback loops.

Follow-upHow do you handle stale documents?

What is the difference between fine-tuning and prompting?medium

Type: scenario
Topic: what-is-the-difference-between-fine-tuning-and-prompting
Frequency: common

Answer

Prompting changes instructions at runtime; fine-tuning changes model behavior through training examples.

Explanation

Prompting is faster and flexible. Fine-tuning helps style, format consistency, and repeated task behavior, but needs data quality and evaluation.

Follow-upWhen would you avoid fine-tuning?

How do you reduce latency in an LLM product?medium

Type: scenario
Topic: how-do-you-reduce-latency-in-an-llm-product
Frequency: common

Answer

Optimize model choice, prompt length, retrieval, streaming, caching, and parallel work.

Explanation

Latency work includes measuring time to first token, token generation rate, retrieval overhead, network cost, and fallback paths.

Follow-upWhat tradeoff exists between quality and latency?

Back to Interview