What is RAG?medium
RAG retrieves relevant external context before generating an answer.
A RAG pipeline indexes documents, retrieves relevant chunks for a query, and sends them to a model so answers can be grounded in source material.
InterviewSkill
Retrieval-augmented generation concepts for grounded AI systems.
RAG retrieves relevant external context before generating an answer.
A RAG pipeline indexes documents, retrieves relevant chunks for a query, and sends them to a model so answers can be grounded in source material.
Chunking controls what content can be retrieved and fit into context.
Chunks that are too small lose meaning; chunks that are too large add noise. Good chunking preserves semantic units.
Embeddings represent query and document meaning for semantic search.
Similar vectors are retrieved as relevant context. Embedding quality strongly affects recall and answer grounding.
Reranking reorders retrieved candidates using a stronger relevance model.
Retrieval often gets a broad candidate set quickly, while reranking improves precision before context is passed to the generator.
Bad ingestion, poor retrieval, stale data, noisy context, and weak prompts can fail the system.
RAG failures can come from every stage, so evaluation should separately measure retrieval, grounding, answer quality, and citations.
Measure whether the retriever returns relevant evidence for each query before judging the generated answer.
Use metrics like recall@k, precision@k, MRR, nDCG, and human relevance labels. Separate retrieval evaluation from answer evaluation to locate failures.
Hybrid search combines lexical search with vector similarity search.
Lexical search is strong for exact terms, IDs, and rare keywords. Vector search is strong for semantic matches. Combining them often improves recall.
It limits retrieval to context that is relevant, authorized, and fresh enough for the query.
Filters like tenant, document type, date, language, region, and permission level can prevent irrelevant or unsafe context from entering the prompt.
Built a domain-specific corpus of 32K CS keywords via Wikipedia API (3 levels deep).
Built a domain-specific corpus of 32K CS keywords via Wikipedia API (3 levels deep). Trained Word2Vec (Skip-gram) using Gensim. Represented job descriptions and resumes as averaged word vectors. Ranked candidates by cosine similarity. Cosine was chosen because it measures directional similarity regardless of vector magnitude — better for sparse high-dimensional embedding spaces than Euclidean.
Cosine measures the angle between vectors — magnitude-invariant, preferred for text embeddings where document length varies.
Cosine measures the angle between vectors — magnitude-invariant, preferred for text embeddings where document length varies. Euclidean measures absolute distance — sensitive to magnitude. In practice: cosine for semantic similarity (RAG, ranking), Euclidean/L2 for spatial tasks. FAISS supports both via IndexFlatIP (inner product ≈ cosine on normalized vectors) and IndexFlatL2.
Word analogy tasks (king - man + woman = queen), word similarity benchmarks, and domain-specific nearest-neighbor inspection (is 'neural network' close to 'deep learning'?).
Word analogy tasks (king - man + woman = queen), word similarity benchmarks, and domain-specific nearest-neighbor inspection (is 'neural network' close to 'deep learning'?). Downstream eval: did cosine similarity ranking correlate with human recruiter judgments? Also visualized with t-SNE to verify clustering of related CS concepts.
Faithfulness: does the generated answer contain only information supported by the retrieved context?
Faithfulness: does the generated answer contain only information supported by the retrieved context? Measures hallucination. Relevance: is the answer responsive to the user's question? A response can be faithful (all claims grounded) but irrelevant (answers a different question). Measure faithfulness with NLI or LLM-judge checking claim-by-claim. Measure relevance with semantic similarity between query and answer.
RAGAS is an open-source RAG evaluation framework computing: faithfulness, answer relevancy, context precision, and context recall.
RAGAS is an open-source RAG evaluation framework computing: faithfulness, answer relevancy, context precision, and context recall. Integrate: run RAGAS on a golden Q&A dataset in your CI pipeline (GitHub Actions). Gate deployment if metrics drop below thresholds. Track over time for drift detection. Can use LLM-as-judge internally, so choose a consistent judge model.
(1) Ingestion: load docs, chunk, embed chunks, store vectors with metadata.
(1) Ingestion: load docs, chunk, embed chunks, store vectors with metadata. (2) Retrieval: embed user query, ANN search for top-k chunks. (3) Augmentation: inject retrieved chunks into LLM prompt as context. (4) Generation: LLM generates answer grounded in context. Key decisions: chunk size, overlap, embedding model, retrieval top-k, and whether to rerank before generation.
Fixed-size splits at a set token count — simple but can cut mid-concept.
Fixed-size splits at a set token count — simple but can cut mid-concept. Semantic chunking splits at natural boundaries: sentences, paragraphs, or detected topic shifts (embedding similarity drops). Each chunk is a coherent unit. Used in a document extraction pipeline because contract clauses are variable-length and splitting mid-clause destroys extraction context.
Documents chunked by clause type (header, definitions, payment terms, maturity).
Documents chunked by clause type (header, definitions, payment terms, maturity). Each chunk embedded with Bedrock Titan or Cohere. Stored in FAISS/ChromaDB with metadata (doc ID, clause type, date). At query time: embed the target field name, retrieve top-k relevant clauses, inject into extraction prompt. Clause-type metadata filtering used to narrow search scope before ANN.
BM25 is keyword-based — great for exact term matches and domain-specific jargon.
BM25 is keyword-based — great for exact term matches and domain-specific jargon. Dense retrieval uses embedding similarity — better for semantic/paraphrase matches. Hybrid (RRF: Reciprocal Rank Fusion): run both, merge ranked lists. Use hybrid when queries mix exact terms and semantic meaning, or domain vocabulary is specialized (financial terms). Recommended for production RAG.
(1) Cross-document retrieval — retrieve top-k chunks across all docs.
(1) Cross-document retrieval — retrieve top-k chunks across all docs. (2) Document-level metadata — tag chunks with doc ID, retrieve then group for coherent context. (3) Hierarchical — first retrieve relevant documents, then relevant chunks within them. (4) Knowledge graph — link entities across documents. a fund document processing system: each filing is one document; cross-filing queries use metadata filtering by fund family.
After ANN retrieval (approximate, optimizes for speed), a cross-encoder reranker scores each retrieved chunk precisely against the query.
After ANN retrieval (approximate, optimizes for speed), a cross-encoder reranker scores each retrieved chunk precisely against the query. Helps when: top-k from ANN includes irrelevant chunks, query is complex/long, or precision matters over recall. Overkill for: simple single-document lookup, latency-critical paths, or when ANN already gives high precision.
FAISS: a library for fast ANN search — no persistence, no metadata filtering, bare-bones, full control.
FAISS: a library for fast ANN search — no persistence, no metadata filtering, bare-bones, full control. Ideal for high-performance custom pipelines. ChromaDB: a full vector DB with persistence, metadata filtering, collections, Python-native API. Easier to prototype with. For production scale: Pinecone, Weaviate, or OpenSearch with k-NN. FAISS used when you embed it into a custom pipeline.
LLMs use information at the beginning and end of their context better than the middle.
LLMs use information at the beginning and end of their context better than the middle. Mitigation: place the most important retrieved chunks at start/end of context. Use fewer, higher-quality chunks (reranking helps). Ask the model to cite specific sections, forcing it to attend to the full context. Long-context models reduce but don't eliminate the problem.
(1) Delete + re-insert: delete old vectors by doc ID, re-embed and insert updated chunks.
(1) Delete + re-insert: delete old vectors by doc ID, re-embed and insert updated chunks. (2) Versioning: add version field to metadata, filter to latest. (3) Incremental indexing: only process changed docs (track last-modified timestamps in S3). (4) Scheduled full reindex for major structural changes. For real-time: S3 → EventBridge → Lambda → vector DB pipeline.
Split documents into small child chunks for precise retrieval and large parent chunks for rich context.
Split documents into small child chunks for precise retrieval and large parent chunks for rich context. Retrieve by child chunk similarity, but return the parent chunk to the LLM. Gives retrieval precision (small chunks match queries better) + generation quality (LLM gets full context). Useful for a document extraction pipeline where a clause reference only makes sense in its full paragraph context.
Too small: lacks context, generation suffers, retrieval misses multi-sentence concepts.
Too small: lacks context, generation suffers, retrieval misses multi-sentence concepts. Too large: retrieval is imprecise (chunk covers many topics), more noise in LLM context. Typical range: 256-512 tokens with 10-15% overlap. Tune empirically using context precision/recall metrics. For structured documents: align chunk boundaries to logical units (clauses, sections) rather than token count.
Attach structured metadata to each chunk at index time (document type, date, section, entity name).
Attach structured metadata to each chunk at index time (document type, date, section, entity name). At query time, pre-filter before ANN search: only search chunks where doc_type='swap' and year=2024. Reduces search space, improves precision, prevents cross-contamination. a fund document processing system: filter by fund_name before searching portfolio data. Supported natively in ChromaDB, Pinecone, and Weaviate.
Instead of embedding the user query directly, ask the LLM to generate a hypothetical ideal answer, then embed that.
Instead of embedding the user query directly, ask the LLM to generate a hypothetical ideal answer, then embed that. The hypothesis is closer to the document distribution than a short query. Improves retrieval when queries are short/ambiguous and documents are long/verbose. Tradeoff: adds one LLM call per query (latency + cost). Useful for financial filings where query vocabulary differs from document vocabulary.
Ingestion: parse PDF filings → extract structured sections (holdings, NAV, metadata) → semantic chunking → embed with Bedrock Titan → store in ChromaDB with metadata (fund_name, filing_date, section_type).
Ingestion: parse PDF filings → extract structured sections (holdings, NAV, metadata) → semantic chunking → embed with Bedrock Titan → store in ChromaDB with metadata (fund_name, filing_date, section_type). Retrieval: filter by fund_name + date range, then ANN search. Generation: inject top-3 chunks into Strands agent prompt. Batch process new filings via Step Functions on S3 upload.
Higher dimensionality: more expressive but slower ANN search and more memory.
Higher dimensionality: more expressive but slower ANN search and more memory. Common: 768d (BERT), 1536d (OpenAI ada-002), 1024d (Cohere). Model choice matters more than dimensionality: a domain-fine-tuned 768d model often beats a general 1536d model. Benchmark with MTEB or run retrieval evals on your own data. For financial documents, Cohere or fine-tuned models outperform general-purpose embeddings.