InterviewSkill

RAG Interview Questions

Retrieval-augmented generation concepts for grounded AI systems.

28 questions
RAG

What is RAG?medium

Type
conceptual
Topic
rag
Frequency
common
Tags
rag
Answer

RAG retrieves relevant external context before generating an answer.

Explanation

A RAG pipeline indexes documents, retrieves relevant chunks for a query, and sends them to a model so answers can be grounded in source material.

Follow-upWhen is RAG better than fine-tuning?

Why does chunking matter?medium

Type
conceptual
Topic
chunking-matter
Frequency
common
Tags
chunking, matter
Answer

Chunking controls what content can be retrieved and fit into context.

Explanation

Chunks that are too small lose meaning; chunks that are too large add noise. Good chunking preserves semantic units.

Follow-upHow do you choose chunk size?

What are embeddings used for in RAG?hard

Type
conceptual
Topic
embeddings-used-rag
Frequency
common
Tags
embeddings, used, rag
Answer

Embeddings represent query and document meaning for semantic search.

Explanation

Similar vectors are retrieved as relevant context. Embedding quality strongly affects recall and answer grounding.

Follow-upHow do you evaluate retrieval quality?

What is reranking?medium

Type
conceptual
Topic
reranking
Frequency
common
Tags
reranking
Answer

Reranking reorders retrieved candidates using a stronger relevance model.

Explanation

Retrieval often gets a broad candidate set quickly, while reranking improves precision before context is passed to the generator.

Follow-upWhat tradeoff does reranking introduce?

What can go wrong in a RAG pipeline?medium

Type
conceptual
Topic
go-wrong-rag-pipeline
Frequency
common
Tags
go, wrong, rag, pipeline
Answer

Bad ingestion, poor retrieval, stale data, noisy context, and weak prompts can fail the system.

Explanation

RAG failures can come from every stage, so evaluation should separately measure retrieval, grounding, answer quality, and citations.

Follow-upHow would you debug a wrong RAG answer?

How do you evaluate retrieval quality in RAG?hard

Type
scenario
Topic
retrieval-evaluation
Frequency
common
Tags
retrieval, evaluation, recall-at-k
Answer

Measure whether the retriever returns relevant evidence for each query before judging the generated answer.

Explanation

Use metrics like recall@k, precision@k, MRR, nDCG, and human relevance labels. Separate retrieval evaluation from answer evaluation to locate failures.

Follow-upWhy can answer quality be poor even when retrieval looks good?

What is hybrid search in RAG?medium

Type
conceptual
Topic
hybrid-search
Frequency
common
Tags
hybrid-search, bm25, vectors
Answer

Hybrid search combines lexical search with vector similarity search.

Explanation

Lexical search is strong for exact terms, IDs, and rare keywords. Vector search is strong for semantic matches. Combining them often improves recall.

Follow-upWhen would BM25 outperform vector search?

Why does metadata filtering matter in RAG?medium

Type
scenario
Topic
metadata-filtering
Frequency
common
Tags
metadata, filtering, permissions
Answer

It limits retrieval to context that is relevant, authorized, and fresh enough for the query.

Explanation

Filters like tenant, document type, date, language, region, and permission level can prevent irrelevant or unsafe context from entering the prompt.

Follow-upWhat can go wrong if metadata is missing or stale?

How do you build semantic similarity ranking? What distance metric?medium

Type
scenario
Topic
how-do-you-build-semantic-similarity-ranking-what-distance
Frequency
common
Tags
rag, how, did, you, build, semantic
Answer

Built a domain-specific corpus of 32K CS keywords via Wikipedia API (3 levels deep).

Explanation

Built a domain-specific corpus of 32K CS keywords via Wikipedia API (3 levels deep). Trained Word2Vec (Skip-gram) using Gensim. Represented job descriptions and resumes as averaged word vectors. Ranked candidates by cosine similarity. Cosine was chosen because it measures directional similarity regardless of vector magnitude — better for sparse high-dimensional embedding spaces than Euclidean.

Follow-upWhat tradeoffs did you consider in that implementation?

Cosine similarity vs Euclidean distance in embedding space?medium

Type
conceptual
Topic
cosine-similarity-vs-euclidean-distance-in-embedding-space
Frequency
common
Tags
rag, cosine, similarity, euclidean, distance, embedding
Answer

Cosine measures the angle between vectors — magnitude-invariant, preferred for text embeddings where document length varies.

Explanation

Cosine measures the angle between vectors — magnitude-invariant, preferred for text embeddings where document length varies. Euclidean measures absolute distance — sensitive to magnitude. In practice: cosine for semantic similarity (RAG, ranking), Euclidean/L2 for spatial tasks. FAISS supports both via IndexFlatIP (inner product ≈ cosine on normalized vectors) and IndexFlatL2.

Follow-upWhen would you choose one approach over the other?

How do you evaluate Word2Vec embedding quality on a domain corpus?medium

Type
scenario
Topic
trained-word2vec-on-a-32k-cs-corpus-how-did-you-evaluate-e
Frequency
common
Tags
rag, you, trained, word2vec, 32k, corpus
Answer

Word analogy tasks (king - man + woman = queen), word similarity benchmarks, and domain-specific nearest-neighbor inspection (is 'neural network' close to 'deep learning'?).

Explanation

Word analogy tasks (king - man + woman = queen), word similarity benchmarks, and domain-specific nearest-neighbor inspection (is 'neural network' close to 'deep learning'?). Downstream eval: did cosine similarity ranking correlate with human recruiter judgments? Also visualized with t-SNE to verify clustering of related CS concepts.

Follow-upWhat tradeoffs did you consider in that implementation?

Faithfulness vs relevance in RAG evaluation?medium

Type
conceptual
Topic
faithfulness-vs-relevance-in-rag-evaluation
Frequency
common
Tags
rag, faithfulness, relevance, evaluation
Answer

Faithfulness: does the generated answer contain only information supported by the retrieved context?

Explanation

Faithfulness: does the generated answer contain only information supported by the retrieved context? Measures hallucination. Relevance: is the answer responsive to the user's question? A response can be faithful (all claims grounded) but irrelevant (answers a different question). Measure faithfulness with NLI or LLM-judge checking claim-by-claim. Measure relevance with semantic similarity between query and answer.

Follow-upWhen would you choose one approach over the other?

What is RAGAS? How would you integrate it into CI/CD?hard

Type
scenario
Topic
is-ragas-how-would-you-integrate-it-into-ci-cd
Frequency
common
Tags
rag, what, ragas, how, would, you
Answer

RAGAS is an open-source RAG evaluation framework computing: faithfulness, answer relevancy, context precision, and context recall.

Explanation

RAGAS is an open-source RAG evaluation framework computing: faithfulness, answer relevancy, context precision, and context recall. Integrate: run RAGAS on a golden Q&A dataset in your CI pipeline (GitHub Actions). Gate deployment if metrics drop below thresholds. Track over time for drift detection. Can use LLM-as-judge internally, so choose a consistent judge model.

Follow-upCan you give a production example?

Walk me through the full RAG pipeline.medium

Type
scenario
Topic
me-through-the-full-rag-pipeline
Frequency
common
Tags
rag, walk, through, the, full
Answer

(1) Ingestion: load docs, chunk, embed chunks, store vectors with metadata.

Explanation

(1) Ingestion: load docs, chunk, embed chunks, store vectors with metadata. (2) Retrieval: embed user query, ANN search for top-k chunks. (3) Augmentation: inject retrieved chunks into LLM prompt as context. (4) Generation: LLM generates answer grounded in context. Key decisions: chunk size, overlap, embedding model, retrieval top-k, and whether to rerank before generation.

Follow-upWhat tradeoffs did you consider in that implementation?

What is semantic chunking? How does it differ from fixed-size?medium

Type
conceptual
Topic
is-semantic-chunking-how-does-it-differ-from-fixed-size
Frequency
common
Tags
rag, what, semantic, chunking, how, does
Answer

Fixed-size splits at a set token count — simple but can cut mid-concept.

Explanation

Fixed-size splits at a set token count — simple but can cut mid-concept. Semantic chunking splits at natural boundaries: sentences, paragraphs, or detected topic shifts (embedding similarity drops). Each chunk is a coherent unit. Used in a document extraction pipeline because contract clauses are variable-length and splitting mid-clause destroys extraction context.

Follow-upWhen would you choose one approach over the other?

How do you design a retrieval layer for document extraction?medium

Type
scenario
Topic
how-do-you-design-a-retrieval-layer-for-document-extractio
Frequency
common
Tags
rag, how, did, you, design, the
Answer

Documents chunked by clause type (header, definitions, payment terms, maturity).

Explanation

Documents chunked by clause type (header, definitions, payment terms, maturity). Each chunk embedded with Bedrock Titan or Cohere. Stored in FAISS/ChromaDB with metadata (doc ID, clause type, date). At query time: embed the target field name, retrieve top-k relevant clauses, inject into extraction prompt. Clause-type metadata filtering used to narrow search scope before ANN.

Follow-upWhat tradeoffs did you consider in that implementation?

Sparse retrieval (BM25) vs dense retrieval (ANN) — when to hybrid?medium

Type
conceptual
Topic
sparse-retrieval-bm25-vs-dense-retrieval-ann-when-to-hybri
Frequency
common
Tags
rag, sparse, retrieval, bm25, dense
Answer

BM25 is keyword-based — great for exact term matches and domain-specific jargon.

Explanation

BM25 is keyword-based — great for exact term matches and domain-specific jargon. Dense retrieval uses embedding similarity — better for semantic/paraphrase matches. Hybrid (RRF: Reciprocal Rank Fusion): run both, merge ranked lists. Use hybrid when queries mix exact terms and semantic meaning, or domain vocabulary is specialized (financial terms). Recommended for production RAG.

Follow-upWhen would you choose one approach over the other?

How do you handle multi-document retrieval where context spans multiple files?medium

Type
scenario
Topic
do-you-handle-multi-document-retrieval-where-context-spans
Frequency
common
Tags
rag, how, you, handle, multi, document
Answer

(1) Cross-document retrieval — retrieve top-k chunks across all docs.

Explanation

(1) Cross-document retrieval — retrieve top-k chunks across all docs. (2) Document-level metadata — tag chunks with doc ID, retrieve then group for coherent context. (3) Hierarchical — first retrieve relevant documents, then relevant chunks within them. (4) Knowledge graph — link entities across documents. a fund document processing system: each filing is one document; cross-filing queries use metadata filtering by fund family.

Follow-upCan you give a production example?

What is reranking in RAG? When does it help?medium

Type
conceptual
Topic
is-reranking-in-rag-when-does-it-help
Frequency
common
Tags
rag, what, reranking, when, does
Answer

After ANN retrieval (approximate, optimizes for speed), a cross-encoder reranker scores each retrieved chunk precisely against the query.

Explanation

After ANN retrieval (approximate, optimizes for speed), a cross-encoder reranker scores each retrieved chunk precisely against the query. Helps when: top-k from ANN includes irrelevant chunks, query is complex/long, or precision matters over recall. Overkill for: simple single-document lookup, latency-critical paths, or when ANN already gives high precision.

Follow-upCan you give a production example?

How do FAISS and ChromaDB differ?medium

Type
conceptual
Topic
do-faiss-and-chromadb-differ
Frequency
common
Tags
rag, how, faiss, and, chromadb, differ
Answer

FAISS: a library for fast ANN search — no persistence, no metadata filtering, bare-bones, full control.

Explanation

FAISS: a library for fast ANN search — no persistence, no metadata filtering, bare-bones, full control. Ideal for high-performance custom pipelines. ChromaDB: a full vector DB with persistence, metadata filtering, collections, Python-native API. Easier to prototype with. For production scale: Pinecone, Weaviate, or OpenSearch with k-NN. FAISS used when you embed it into a custom pipeline.

Follow-upWhen would you choose one approach over the other?

What is the lost-in-the-middle problem?medium

Type
conceptual
Topic
is-the-lost-in-the-middle-problem
Frequency
common
Tags
rag, what, the, lost, middle
Answer

LLMs use information at the beginning and end of their context better than the middle.

Explanation

LLMs use information at the beginning and end of their context better than the middle. Mitigation: place the most important retrieved chunks at start/end of context. Use fewer, higher-quality chunks (reranking helps). Ask the model to cite specific sections, forcing it to attend to the full context. Long-context models reduce but don't eliminate the problem.

Follow-upCan you give a production example?

How do you update a vector index when source documents change?hard

Type
conceptual
Topic
do-you-update-a-vector-index-when-source-documents-change
Frequency
common
Tags
rag, how, you, update, vector, index
Answer

(1) Delete + re-insert: delete old vectors by doc ID, re-embed and insert updated chunks.

Explanation

(1) Delete + re-insert: delete old vectors by doc ID, re-embed and insert updated chunks. (2) Versioning: add version field to metadata, filter to latest. (3) Incremental indexing: only process changed docs (track last-modified timestamps in S3). (4) Scheduled full reindex for major structural changes. For real-time: S3 → EventBridge → Lambda → vector DB pipeline.

Follow-upCan you give a production example?

What is parent-child chunking?medium

Type
conceptual
Topic
is-parent-child-chunking
Frequency
common
Tags
rag, what, parent, child, chunking
Answer

Split documents into small child chunks for precise retrieval and large parent chunks for rich context.

Explanation

Split documents into small child chunks for precise retrieval and large parent chunks for rich context. Retrieve by child chunk similarity, but return the parent chunk to the LLM. Gives retrieval precision (small chunks match queries better) + generation quality (LLM gets full context). Useful for a document extraction pipeline where a clause reference only makes sense in its full paragraph context.

Follow-upCan you give a production example?

How do you choose chunk size?medium

Type
conceptual
Topic
do-you-choose-chunk-size
Frequency
common
Tags
rag, how, you, choose, chunk, size
Answer

Too small: lacks context, generation suffers, retrieval misses multi-sentence concepts.

Explanation

Too small: lacks context, generation suffers, retrieval misses multi-sentence concepts. Too large: retrieval is imprecise (chunk covers many topics), more noise in LLM context. Typical range: 256-512 tokens with 10-15% overlap. Tune empirically using context precision/recall metrics. For structured documents: align chunk boundaries to logical units (clauses, sections) rather than token count.

Follow-upCan you give a production example?

How do you use metadata filtering to narrow retrieval?medium

Type
conceptual
Topic
do-you-use-metadata-filtering-to-narrow-retrieval
Frequency
common
Tags
rag, how, you, use, metadata, filtering
Answer

Attach structured metadata to each chunk at index time (document type, date, section, entity name).

Explanation

Attach structured metadata to each chunk at index time (document type, date, section, entity name). At query time, pre-filter before ANN search: only search chunks where doc_type='swap' and year=2024. Reduces search space, improves precision, prevents cross-contamination. a fund document processing system: filter by fund_name before searching portfolio data. Supported natively in ChromaDB, Pinecone, and Weaviate.

Follow-upCan you give a production example?

What is HyDE (Hypothetical Document Embedding)?medium

Type
conceptual
Topic
is-hyde-hypothetical-document-embedding
Frequency
common
Tags
rag, what, hyde, hypothetical, document, embedding
Answer

Instead of embedding the user query directly, ask the LLM to generate a hypothetical ideal answer, then embed that.

Explanation

Instead of embedding the user query directly, ask the LLM to generate a hypothetical ideal answer, then embed that. The hypothesis is closer to the document distribution than a short query. Improves retrieval when queries are short/ambiguous and documents are long/verbose. Tradeoff: adds one LLM call per query (latency + cost). Useful for financial filings where query vocabulary differs from document vocabulary.

Follow-upCan you give a production example?

How would you build a RAG system for thousands of portfolio documents?medium

Type
scenario
Topic
how-would-you-build-a-rag-system-for-thousands-of-portfoli
Frequency
common
Tags
rag, how, would, you, build
Answer

Ingestion: parse PDF filings → extract structured sections (holdings, NAV, metadata) → semantic chunking → embed with Bedrock Titan → store in ChromaDB with metadata (fund_name, filing_date, section_type).

Explanation

Ingestion: parse PDF filings → extract structured sections (holdings, NAV, metadata) → semantic chunking → embed with Bedrock Titan → store in ChromaDB with metadata (fund_name, filing_date, section_type). Retrieval: filter by fund_name + date range, then ANN search. Generation: inject top-3 chunks into Strands agent prompt. Batch process new filings via Step Functions on S3 upload.

Follow-upCan you give a production example?

What is the role of embedding dimensionality and model choice in retrieval quality?medium

Type
conceptual
Topic
is-the-role-of-embedding-dimensionality-and-model-choice-i
Frequency
common
Tags
rag, what, the, role, embedding, dimensionality
Answer

Higher dimensionality: more expressive but slower ANN search and more memory.

Explanation

Higher dimensionality: more expressive but slower ANN search and more memory. Common: 768d (BERT), 1536d (OpenAI ada-002), 1024d (Cohere). Model choice matters more than dimensionality: a domain-fine-tuned 768d model often beats a general 1536d model. Benchmark with MTEB or run retrieval evals on your own data. For financial documents, Cohere or fine-tuned models outperform general-purpose embeddings.

Follow-upCan you give a production example?