RAG Interview Prep

What is RAG?medium

Type: conceptual
Topic: rag
Frequency: common
Tags: rag

Answer

RAG retrieves relevant external context before generating an answer.

Explanation

A RAG pipeline indexes documents, retrieves relevant chunks for a query, and sends them to a model so answers can be grounded in source material.

Follow-upWhen is RAG better than fine-tuning?

Why does chunking matter?medium

Type: conceptual
Topic: chunking-matter
Frequency: common
Tags: chunking, matter

Answer

Chunking controls what content can be retrieved and fit into context.

Explanation

Chunks that are too small lose meaning; chunks that are too large add noise. Good chunking preserves semantic units.

Follow-upHow do you choose chunk size?

What are embeddings used for in RAG?hard

Type: conceptual
Topic: embeddings-used-rag
Frequency: common
Tags: embeddings, used, rag

Answer

Embeddings represent query and document meaning for semantic search.

Explanation

Similar vectors are retrieved as relevant context. Embedding quality strongly affects recall and answer grounding.

Follow-upHow do you evaluate retrieval quality?

What is reranking?medium

Type: conceptual
Topic: reranking
Frequency: common
Tags: reranking

Answer

Reranking reorders retrieved candidates using a stronger relevance model.

Explanation

Retrieval often gets a broad candidate set quickly, while reranking improves precision before context is passed to the generator.

Follow-upWhat tradeoff does reranking introduce?

What can go wrong in a RAG pipeline?medium

Type: conceptual
Topic: go-wrong-rag-pipeline
Frequency: common
Tags: go, wrong, rag, pipeline

Answer

Bad ingestion, poor retrieval, stale data, noisy context, and weak prompts can fail the system.

Explanation

RAG failures can come from every stage, so evaluation should separately measure retrieval, grounding, answer quality, and citations.

Follow-upHow would you debug a wrong RAG answer?

How do you evaluate retrieval quality in RAG?hard

Type: scenario
Topic: retrieval-evaluation
Frequency: common
Tags: retrieval, evaluation, recall-at-k

Answer

Measure whether the retriever returns relevant evidence for each query before judging the generated answer.

Explanation

Use metrics like recall@k, precision@k, MRR, nDCG, and human relevance labels. Separate retrieval evaluation from answer evaluation to locate failures.

Follow-upWhy can answer quality be poor even when retrieval looks good?

What is hybrid search in RAG?medium

Type: conceptual
Topic: hybrid-search
Frequency: common
Tags: hybrid-search, bm25, vectors

Answer

Hybrid search combines lexical search with vector similarity search.

Explanation

Lexical search is strong for exact terms, IDs, and rare keywords. Vector search is strong for semantic matches. Combining them often improves recall.

Follow-upWhen would BM25 outperform vector search?

Why does metadata filtering matter in RAG?medium

Type: scenario
Topic: metadata-filtering
Frequency: common
Tags: metadata, filtering, permissions

Answer

It limits retrieval to context that is relevant, authorized, and fresh enough for the query.

Explanation

Filters like tenant, document type, date, language, region, and permission level can prevent irrelevant or unsafe context from entering the prompt.

Follow-upWhat can go wrong if metadata is missing or stale?

How do you build semantic similarity ranking? What distance metric?medium

Type: scenario
Topic: how-do-you-build-semantic-similarity-ranking-what-distance
Frequency: common
Tags: rag, how, did, you, build, semantic

Answer

Built a domain-specific corpus of 32K CS keywords via Wikipedia API (3 levels deep).

Explanation

Built a domain-specific corpus of 32K CS keywords via Wikipedia API (3 levels deep). Trained Word2Vec (Skip-gram) using Gensim. Represented job descriptions and resumes as averaged word vectors. Ranked candidates by cosine similarity. Cosine was chosen because it measures directional similarity regardless of vector magnitude — better for sparse high-dimensional embedding spaces than Euclidean.

Follow-upWhat tradeoffs did you consider in that implementation?

Cosine similarity vs Euclidean distance in embedding space?medium

Type: conceptual
Topic: cosine-similarity-vs-euclidean-distance-in-embedding-space
Frequency: common
Tags: rag, cosine, similarity, euclidean, distance, embedding

Answer

Cosine measures the angle between vectors — magnitude-invariant, preferred for text embeddings where document length varies.

Explanation

Cosine measures the angle between vectors — magnitude-invariant, preferred for text embeddings where document length varies. Euclidean measures absolute distance — sensitive to magnitude. In practice: cosine for semantic similarity (RAG, ranking), Euclidean/L2 for spatial tasks. FAISS supports both via IndexFlatIP (inner product ≈ cosine on normalized vectors) and IndexFlatL2.

Follow-upWhen would you choose one approach over the other?

How do you evaluate Word2Vec embedding quality on a domain corpus?medium

Type: scenario
Topic: trained-word2vec-on-a-32k-cs-corpus-how-did-you-evaluate-e
Frequency: common
Tags: rag, you, trained, word2vec, 32k, corpus

Answer

Word analogy tasks (king - man + woman = queen), word similarity benchmarks, and domain-specific nearest-neighbor inspection (is 'neural network' close to 'deep learning'?).

Explanation

Word analogy tasks (king - man + woman = queen), word similarity benchmarks, and domain-specific nearest-neighbor inspection (is 'neural network' close to 'deep learning'?). Downstream eval: did cosine similarity ranking correlate with human recruiter judgments? Also visualized with t-SNE to verify clustering of related CS concepts.

Follow-upWhat tradeoffs did you consider in that implementation?

Faithfulness vs relevance in RAG evaluation?medium

Type: conceptual
Topic: faithfulness-vs-relevance-in-rag-evaluation
Frequency: common
Tags: rag, faithfulness, relevance, evaluation

Answer

Faithfulness: does the generated answer contain only information supported by the retrieved context?

Explanation

Faithfulness: does the generated answer contain only information supported by the retrieved context? Measures hallucination. Relevance: is the answer responsive to the user's question? A response can be faithful (all claims grounded) but irrelevant (answers a different question). Measure faithfulness with NLI or LLM-judge checking claim-by-claim. Measure relevance with semantic similarity between query and answer.

Follow-upWhen would you choose one approach over the other?

What is RAGAS? How would you integrate it into CI/CD?hard

Type: scenario
Topic: is-ragas-how-would-you-integrate-it-into-ci-cd
Frequency: common
Tags: rag, what, ragas, how, would, you

Answer

RAGAS is an open-source RAG evaluation framework computing: faithfulness, answer relevancy, context precision, and context recall.

Explanation

RAGAS is an open-source RAG evaluation framework computing: faithfulness, answer relevancy, context precision, and context recall. Integrate: run RAGAS on a golden Q&A dataset in your CI pipeline (GitHub Actions). Gate deployment if metrics drop below thresholds. Track over time for drift detection. Can use LLM-as-judge internally, so choose a consistent judge model.

Follow-upCan you give a production example?

Walk me through the full RAG pipeline.medium

Type: scenario
Topic: me-through-the-full-rag-pipeline
Frequency: common
Tags: rag, walk, through, the, full

Answer

(1) Ingestion: load docs, chunk, embed chunks, store vectors with metadata.

Explanation

(1) Ingestion: load docs, chunk, embed chunks, store vectors with metadata. (2) Retrieval: embed user query, ANN search for top-k chunks. (3) Augmentation: inject retrieved chunks into LLM prompt as context. (4) Generation: LLM generates answer grounded in context. Key decisions: chunk size, overlap, embedding model, retrieval top-k, and whether to rerank before generation.

Follow-upWhat tradeoffs did you consider in that implementation?

What is semantic chunking? How does it differ from fixed-size?medium

Type: conceptual
Topic: is-semantic-chunking-how-does-it-differ-from-fixed-size
Frequency: common
Tags: rag, what, semantic, chunking, how, does

Answer

Fixed-size splits at a set token count — simple but can cut mid-concept.

Explanation

Fixed-size splits at a set token count — simple but can cut mid-concept. Semantic chunking splits at natural boundaries: sentences, paragraphs, or detected topic shifts (embedding similarity drops). Each chunk is a coherent unit. Used in a document extraction pipeline because contract clauses are variable-length and splitting mid-clause destroys extraction context.

Follow-upWhen would you choose one approach over the other?

How do you design a retrieval layer for document extraction?medium

Type: scenario
Topic: how-do-you-design-a-retrieval-layer-for-document-extractio
Frequency: common
Tags: rag, how, did, you, design, the

Answer

Documents chunked by clause type (header, definitions, payment terms, maturity).

Explanation

Documents chunked by clause type (header, definitions, payment terms, maturity). Each chunk embedded with Bedrock Titan or Cohere. Stored in FAISS/ChromaDB with metadata (doc ID, clause type, date). At query time: embed the target field name, retrieve top-k relevant clauses, inject into extraction prompt. Clause-type metadata filtering used to narrow search scope before ANN.

Follow-upWhat tradeoffs did you consider in that implementation?

Sparse retrieval (BM25) vs dense retrieval (ANN) — when to hybrid?medium

Type: conceptual
Topic: sparse-retrieval-bm25-vs-dense-retrieval-ann-when-to-hybri
Frequency: common
Tags: rag, sparse, retrieval, bm25, dense

Answer

BM25 is keyword-based — great for exact term matches and domain-specific jargon.

Explanation

BM25 is keyword-based — great for exact term matches and domain-specific jargon. Dense retrieval uses embedding similarity — better for semantic/paraphrase matches. Hybrid (RRF: Reciprocal Rank Fusion): run both, merge ranked lists. Use hybrid when queries mix exact terms and semantic meaning, or domain vocabulary is specialized (financial terms). Recommended for production RAG.

Follow-upWhen would you choose one approach over the other?

How do you handle multi-document retrieval where context spans multiple files?medium

Type: scenario
Topic: do-you-handle-multi-document-retrieval-where-context-spans
Frequency: common
Tags: rag, how, you, handle, multi, document

Answer

(1) Cross-document retrieval — retrieve top-k chunks across all docs.

Explanation

(1) Cross-document retrieval — retrieve top-k chunks across all docs. (2) Document-level metadata — tag chunks with doc ID, retrieve then group for coherent context. (3) Hierarchical — first retrieve relevant documents, then relevant chunks within them. (4) Knowledge graph — link entities across documents. a fund document processing system: each filing is one document; cross-filing queries use metadata filtering by fund family.

Follow-upCan you give a production example?

What is reranking in RAG? When does it help?medium

Type: conceptual
Topic: is-reranking-in-rag-when-does-it-help
Frequency: common
Tags: rag, what, reranking, when, does

Answer

After ANN retrieval (approximate, optimizes for speed), a cross-encoder reranker scores each retrieved chunk precisely against the query.

Explanation

After ANN retrieval (approximate, optimizes for speed), a cross-encoder reranker scores each retrieved chunk precisely against the query. Helps when: top-k from ANN includes irrelevant chunks, query is complex/long, or precision matters over recall. Overkill for: simple single-document lookup, latency-critical paths, or when ANN already gives high precision.

Follow-upCan you give a production example?

How do FAISS and ChromaDB differ?medium

Type: conceptual
Topic: do-faiss-and-chromadb-differ
Frequency: common
Tags: rag, how, faiss, and, chromadb, differ

Answer

FAISS: a library for fast ANN search — no persistence, no metadata filtering, bare-bones, full control.

Explanation

FAISS: a library for fast ANN search — no persistence, no metadata filtering, bare-bones, full control. Ideal for high-performance custom pipelines. ChromaDB: a full vector DB with persistence, metadata filtering, collections, Python-native API. Easier to prototype with. For production scale: Pinecone, Weaviate, or OpenSearch with k-NN. FAISS used when you embed it into a custom pipeline.

Follow-upWhen would you choose one approach over the other?

What is the lost-in-the-middle problem?medium

Type: conceptual
Topic: is-the-lost-in-the-middle-problem
Frequency: common
Tags: rag, what, the, lost, middle

Answer

LLMs use information at the beginning and end of their context better than the middle.

Explanation

LLMs use information at the beginning and end of their context better than the middle. Mitigation: place the most important retrieved chunks at start/end of context. Use fewer, higher-quality chunks (reranking helps). Ask the model to cite specific sections, forcing it to attend to the full context. Long-context models reduce but don't eliminate the problem.

Follow-upCan you give a production example?

How do you update a vector index when source documents change?hard

Type: conceptual
Topic: do-you-update-a-vector-index-when-source-documents-change
Frequency: common
Tags: rag, how, you, update, vector, index

Answer

(1) Delete + re-insert: delete old vectors by doc ID, re-embed and insert updated chunks.

Explanation

(1) Delete + re-insert: delete old vectors by doc ID, re-embed and insert updated chunks. (2) Versioning: add version field to metadata, filter to latest. (3) Incremental indexing: only process changed docs (track last-modified timestamps in S3). (4) Scheduled full reindex for major structural changes. For real-time: S3 → EventBridge → Lambda → vector DB pipeline.

Follow-upCan you give a production example?

What is parent-child chunking?medium

Type: conceptual
Topic: is-parent-child-chunking
Frequency: common
Tags: rag, what, parent, child, chunking

Answer

Split documents into small child chunks for precise retrieval and large parent chunks for rich context.

Explanation

Split documents into small child chunks for precise retrieval and large parent chunks for rich context. Retrieve by child chunk similarity, but return the parent chunk to the LLM. Gives retrieval precision (small chunks match queries better) + generation quality (LLM gets full context). Useful for a document extraction pipeline where a clause reference only makes sense in its full paragraph context.

Follow-upCan you give a production example?

How do you choose chunk size?medium

Type: conceptual
Topic: do-you-choose-chunk-size
Frequency: common
Tags: rag, how, you, choose, chunk, size

Answer

Too small: lacks context, generation suffers, retrieval misses multi-sentence concepts.

Explanation

Too small: lacks context, generation suffers, retrieval misses multi-sentence concepts. Too large: retrieval is imprecise (chunk covers many topics), more noise in LLM context. Typical range: 256-512 tokens with 10-15% overlap. Tune empirically using context precision/recall metrics. For structured documents: align chunk boundaries to logical units (clauses, sections) rather than token count.

Follow-upCan you give a production example?

How do you use metadata filtering to narrow retrieval?medium

Type: conceptual
Topic: do-you-use-metadata-filtering-to-narrow-retrieval
Frequency: common
Tags: rag, how, you, use, metadata, filtering

Answer

Attach structured metadata to each chunk at index time (document type, date, section, entity name).

Explanation

Attach structured metadata to each chunk at index time (document type, date, section, entity name). At query time, pre-filter before ANN search: only search chunks where doc_type='swap' and year=2024. Reduces search space, improves precision, prevents cross-contamination. a fund document processing system: filter by fund_name before searching portfolio data. Supported natively in ChromaDB, Pinecone, and Weaviate.

Follow-upCan you give a production example?

What is HyDE (Hypothetical Document Embedding)?medium

Type: conceptual
Topic: is-hyde-hypothetical-document-embedding
Frequency: common
Tags: rag, what, hyde, hypothetical, document, embedding

Answer

Instead of embedding the user query directly, ask the LLM to generate a hypothetical ideal answer, then embed that.

Explanation

Instead of embedding the user query directly, ask the LLM to generate a hypothetical ideal answer, then embed that. The hypothesis is closer to the document distribution than a short query. Improves retrieval when queries are short/ambiguous and documents are long/verbose. Tradeoff: adds one LLM call per query (latency + cost). Useful for financial filings where query vocabulary differs from document vocabulary.

Follow-upCan you give a production example?

How would you build a RAG system for thousands of portfolio documents?medium

Type: scenario
Topic: how-would-you-build-a-rag-system-for-thousands-of-portfoli
Frequency: common
Tags: rag, how, would, you, build

Answer

Ingestion: parse PDF filings → extract structured sections (holdings, NAV, metadata) → semantic chunking → embed with Bedrock Titan → store in ChromaDB with metadata (fund_name, filing_date, section_type).

Explanation

Ingestion: parse PDF filings → extract structured sections (holdings, NAV, metadata) → semantic chunking → embed with Bedrock Titan → store in ChromaDB with metadata (fund_name, filing_date, section_type). Retrieval: filter by fund_name + date range, then ANN search. Generation: inject top-3 chunks into Strands agent prompt. Batch process new filings via Step Functions on S3 upload.

Follow-upCan you give a production example?

What is the role of embedding dimensionality and model choice in retrieval quality?medium

Type: conceptual
Topic: is-the-role-of-embedding-dimensionality-and-model-choice-i
Frequency: common
Tags: rag, what, the, role, embedding, dimensionality

Answer

Higher dimensionality: more expressive but slower ANN search and more memory.

Explanation

Higher dimensionality: more expressive but slower ANN search and more memory. Common: 768d (BERT), 1536d (OpenAI ada-002), 1024d (Cohere). Model choice matters more than dimensionality: a domain-fine-tuned 768d model often beats a general 1536d model. Benchmark with MTEB or run retrieval evals on your own data. For financial documents, Cohere or fine-tuned models outperform general-purpose embeddings.

Follow-upCan you give a production example?

RAG Interview Questions

What is RAG?medium

Why does chunking matter?medium

What are embeddings used for in RAG?hard

What is reranking?medium

What can go wrong in a RAG pipeline?medium

How do you evaluate retrieval quality in RAG?hard

What is hybrid search in RAG?medium

Why does metadata filtering matter in RAG?medium

How do you build semantic similarity ranking? What distance metric?medium

Cosine similarity vs Euclidean distance in embedding space?medium

How do you evaluate Word2Vec embedding quality on a domain corpus?medium

Faithfulness vs relevance in RAG evaluation?medium

What is RAGAS? How would you integrate it into CI/CD?hard

Walk me through the full RAG pipeline.medium

What is semantic chunking? How does it differ from fixed-size?medium

How do you design a retrieval layer for document extraction?medium

Sparse retrieval (BM25) vs dense retrieval (ANN) — when to hybrid?medium

How do you handle multi-document retrieval where context spans multiple files?medium

What is reranking in RAG? When does it help?medium

How do FAISS and ChromaDB differ?medium

What is the lost-in-the-middle problem?medium

How do you update a vector index when source documents change?hard

What is parent-child chunking?medium

How do you choose chunk size?medium

How do you use metadata filtering to narrow retrieval?medium

What is HyDE (Hypothetical Document Embedding)?medium

How would you build a RAG system for thousands of portfolio documents?medium

What is the role of embedding dimensionality and model choice in retrieval quality?medium