Context Engineering & Advanced RAG Handbook
A systems-focused guide to embeddings, chunking, retrieval, re-ranking, graph-aware context, and evaluation patterns that maximize signal-to-noise ratio inside a finite LLM context window.
Module 1: The Foundations of Context
Prompt engineering tells the model how to behave. Context engineering decides what evidence the model sees at runtime. That is the shift. In practical systems, the quality of retrieval, chunking, ranking, and filtering often matters more than the exact phrasing of the instruction prompt.
Prompt vs. Context
A useful analogy is a courtroom. The prompt is the judge's instruction to the jury: follow the rules, use this format, answer this question. The context is the evidence presented during the case. If the evidence is noisy, incomplete, or irrelevant, even a well-worded instruction cannot rescue the answer.
- Prompting defines tone, constraints, output shape, and safety behavior.
- Context provides the live documents, records, snippets, or graph facts needed for the task.
- Context engineering is the discipline of deciding what to retrieve, what to omit, and how to order it before it touches the model.
Embeddings & Vector Space
An embedding model maps text into a high-dimensional numeric vector such that semantically similar text lands closer together in vector space. The model is not storing dictionary definitions. It is learning a geometry of meaning. That is why “reset my password” and “I can't sign in” can end up near each other even if they share few exact keywords.
from __future__ import annotations
import numpy as np
from sentence_transformers import SentenceTransformer
def cosine_similarity(vec_a: np.ndarray, vec_b: np.ndarray) -> float:
numerator = float(np.dot(vec_a, vec_b))
denominator = float(np.linalg.norm(vec_a) * np.linalg.norm(vec_b))
return numerator / denominator
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
sentences = [
"The user cannot log into the dashboard.",
"A customer is unable to sign in to the portal.",
]
embeddings = model.encode(sentences, normalize_embeddings=True)
similarity = cosine_similarity(embeddings[0], embeddings[1])
print(f"Embedding dimension: {embeddings[0].shape[0]}")
print(f"Cosine similarity: {similarity:.4f}")
The operational consequence is that embeddings let you retrieve based on meaning, not just literal wording. That is powerful, but also lossy. Every embedding compresses information, which is why retrieval quality depends on more than just “store vectors and query top-k.”
The Context Window
Tokens are the units the model actually processes, not words or characters. A context window is the maximum number of tokens the model can attend to at once. The danger is assuming that if the model can technically accept a long prompt, it will use every part of it equally well. In practice, long prompts suffer from dilution and the “lost in the middle” effect, where evidence buried in the center is easier for the model to ignore.
A larger context window increases capacity, but it does not eliminate ranking and relevance problems. Good systems still compress, filter, and order context aggressively.
Module 2: Advanced Chunking Strategies
Chunking is the translation layer between raw documents and retrievable units. Poor chunking destroys coherence before retrieval even starts. Good chunking preserves semantic boundaries so each candidate chunk is small enough to retrieve precisely and rich enough to make sense on its own.
The Naive Chunking Problem
Fixed-size character chunking is easy, but it often splits tables, code blocks, lists, or sentences in arbitrary places. That is equivalent to tearing pages out of a manual with scissors and hoping retrieval can still reconstruct the meaning. It sometimes works, but it is a weak default for enterprise documents.
- Sentence break damage: a legal exception or numerical condition can be cut in half.
- Table break damage: headers and row values can land in separate chunks.
- Semantic drift: the chunk may no longer represent a coherent unit of meaning after splitting.
Semantic Chunking
Semantic chunking uses natural structure such as markdown headers, paragraphs, or sentence groups. In enterprise knowledge bases, this is usually the better starting point because users tend to ask questions about concepts that align with headings and section boundaries.
from __future__ import annotations
from langchain.text_splitter import MarkdownHeaderTextSplitter
markdown_text = """
# Incident Response
The incident commander owns communication.
## Severity Levels
Severity 1 incidents require an executive update every 30 minutes.
## Escalation Paths
Security incidents must notify the on-call security engineer immediately.
"""
headers_to_split_on = [
("#", "h1"),
("##", "h2"),
]
splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
documents = splitter.split_text(markdown_text)
for doc in documents:
# Each chunk retains section-level metadata so retrieval stays interpretable.
print(doc.metadata, doc.page_content)
from __future__ import annotations
from llama_index.core import Document
from llama_index.core.node_parser import SentenceSplitter
documents = [
Document(
text=(
"Context engineering starts with coherent segmentation. "
"This paragraph explains why semantic boundaries matter. "
"The next paragraph explains how chunk overlap preserves continuity. "
"The final paragraph explains how retrieval precision improves when chunks stay meaningful."
)
)
]
splitter = SentenceSplitter(chunk_size=120, chunk_overlap=20)
nodes = splitter.get_nodes_from_documents(documents)
for node in nodes:
print(node.text)
Parent-Child Document Retrieval
Parent-child retrieval solves a common tension: small chunks are better for precise retrieval, but larger chunks are better for answering. The pattern is to index children for retrieval accuracy, then send the larger parent section to the LLM once a child match wins. This improves precision without starving the model of surrounding context.
from __future__ import annotations
from dataclasses import dataclass
@dataclass
class ParentDocument:
parent_id: str
content: str
@dataclass
class ChildChunk:
chunk_id: str
parent_id: str
content: str
parents = {
"policy-12": ParentDocument(
parent_id="policy-12",
content="Full parent section text for the incident response policy...",
)
}
children = [
ChildChunk(chunk_id="policy-12-chunk-1", parent_id="policy-12", content="Severity 1 incidents require an executive update every 30 minutes."),
]
# Retrieve the child chunk in the vector DB, then attach the parent document to the LLM context.
matched_child = children[0]
llm_context = parents[matched_child.parent_id].content
print(llm_context)
Module 3: Vector Databases & Hybrid Search
Vector search is powerful for semantic intent, but it is not the whole retrieval stack. Enterprise search often needs both dense semantic signals and sparse lexical signals. Hybrid search exists because users care about meaning and exact terms at the same time.
Dense vs. Sparse Vectors
Dense vectors come from embedding models and capture semantic similarity. Sparse methods like BM25 or TF-IDF preserve exact keyword evidence. A useful analogy is hiring both a domain expert and a keyword indexer: the expert knows what two phrases mean, while the indexer knows whether the exact compliance code appeared in the document.
Hybrid Search Implementation
Hybrid search combines both signals with a weighting factor, usually called alpha. Lower alpha leans more on keyword match. Higher alpha leans more on semantic similarity. In practice, alpha is a business tuning knob that should be evaluated, not guessed.
from __future__ import annotations
from sentence_transformers import SentenceTransformer
from weaviate import Client
client = Client("http://localhost:8080")
embedder = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
query_text = "What is the retention policy for payroll records?"
query_vector = embedder.encode(query_text).tolist()
response = (
client.query
.get("PolicyChunk", ["content", "title", "department"])
.with_hybrid(query=query_text, vector=query_vector, alpha=0.65)
.with_limit(5)
.do()
)
print(response)
Module 4: Re-Ranking & Context Compression
The first retrieval stage is often a broad filter, not the final decision. That is why advanced RAG stacks frequently use a second-stage re-ranker and, when needed, a compression layer before building the final prompt.
The Retrieval Bottleneck
Top-k vector retrieval is often inaccurate because embeddings compress many document details into a fixed-length representation. Important distinctions can get blurred. The result is a candidate set that is relevant in a vague sense but still not the best evidence for the user's specific question.
Cross-Encoders (Re-ranking)
A cross-encoder or API re-ranker reads the query and each candidate document together and scores relevance more precisely than embedding-only retrieval. It is slower, but much sharper. The typical pattern is retrieve 20, re-rank them, and pass the top 3 to the LLM.
from __future__ import annotations
import cohere
co = cohere.ClientV2(api_key="YOUR_COHERE_API_KEY")
query = "Who approves security exceptions for production access?"
retrieved_docs = [
"Security exceptions must be approved by the CISO and platform owner.",
"Vacation policy for full-time employees.",
"Production access requires manager approval and audit logging.",
# Add the rest of your top-20 retrieved chunks here.
]
rerank_response = co.rerank(
model="rerank-v3.5",
query=query,
documents=retrieved_docs,
top_n=3,
)
top_documents = [retrieved_docs[result.index] for result in rerank_response.results]
for rank, doc in enumerate(top_documents, start=1):
print(rank, doc)
from __future__ import annotations
from sentence_transformers import CrossEncoder
cross_encoder = CrossEncoder("BAAI/bge-reranker-base")
query = "Who approves security exceptions for production access?"
retrieved_docs = [
"Security exceptions must be approved by the CISO and platform owner.",
"Vacation policy for full-time employees.",
"Production access requires manager approval and audit logging.",
]
pairs = [[query, doc] for doc in retrieved_docs]
scores = cross_encoder.predict(pairs)
ranked = sorted(zip(retrieved_docs, scores), key=lambda item: item[1], reverse=True)
top_3 = ranked[:3]
print(top_3)
Context Compression
Compression tools such as LLMLingua or summary-based pruning remove low-value tokens before the prompt reaches the LLM. This is helpful when the retrieved context is mostly correct but still too verbose. The goal is not blind truncation. The goal is preserving meaning while deleting filler.
- Use compression after retrieval quality is decent. Compression cannot rescue irrelevant chunks.
- Compress for salience, not aesthetics. Preserve entities, numbers, tables, and qualifiers.
- Benchmark groundedness after compression. Smaller prompts are cheaper, but only if the answer quality holds.
Module 5: Graph RAG & Structured Context
Standard vector RAG is excellent for local semantic similarity, but it can struggle with exact multi-hop relationship questions. That is where structured context and knowledge graphs become useful.
The Limits of Vector RAG
Vector RAG often fails at multi-hop reasoning because the relevant facts may live in separate documents that are individually retrievable but not obviously connectable by similarity alone. A question like “Who is the CEO of the company that acquired the startup founded by John Doe?” requires relationship traversal, not just nearest-neighbor retrieval.
Knowledge Graphs
Graph RAG represents entities as nodes and relationships as edges. Instead of hoping semantic search retrieves all connected facts, the system can traverse exact relationships. That is powerful for org charts, compliance controls, supply chains, and acquisition histories.
from __future__ import annotations
from neo4j import GraphDatabase
driver = GraphDatabase.driver("bolt://localhost:7687", auth=("neo4j", "password"))
query = """
MATCH (founder:Person {name: $founder_name})-[:FOUNDED]->(startup:Company)
MATCH (acquirer:Company)-[:ACQUIRED]->(startup)
MATCH (ceo:Person)-[:CEO_OF]->(acquirer)
RETURN founder.name AS founder, startup.name AS startup, acquirer.name AS acquirer, ceo.name AS ceo
"""
with driver.session() as session:
records = session.run(query, founder_name="John Doe")
for record in records:
print(record.data())
The retrieval pattern changes here: the graph provides exact facts and relationships, while the LLM explains or synthesizes them. That separation is often more reliable than asking vector search to simulate structured reasoning.
Module 6: Evaluation & Optimization
RAG systems must be evaluated as systems, not just by final answer quality. You need to know whether the retrieved context was relevant, whether the answer was grounded in that context, and whether the answer actually addressed the user's question.
RAG Triad
The three useful evaluation pillars are:
Evaluation Frameworks
Libraries such as Ragas and TruLens formalize RAG evaluation with measurable scores. That matters because retrieval systems are too complex to tune by vibes alone. If you change chunking, ranking, or prompt assembly, you need metrics that show what improved and what regressed.
from __future__ import annotations
from datasets import Dataset
from ragas import evaluate
from ragas.metrics import answer_relevancy, context_precision, faithfulness
evaluation_dataset = Dataset.from_dict(
{
"question": ["Who approves security exceptions for production access?"],
"answer": ["Security exceptions are approved by the CISO and platform owner."],
"contexts": [["Security exceptions must be approved by the CISO and platform owner."]],
"ground_truth": ["The CISO and platform owner approve security exceptions."],
}
)
results = evaluate(
dataset=evaluation_dataset,
metrics=[context_precision, faithfulness, answer_relevancy],
)
print(results)
Module 7: Common Pitfalls & Anti-Patterns
Most context engineering failures happen before the prompt reaches the LLM. The model often gets blamed, but the actual issue is usually broken ingestion, bad ranking assumptions, or missing metadata filters.
1. Garbage In, Garbage Out (GIGO)
If you embed messy PDFs with broken OCR, missing table structure, or merged columns, retrieval quality collapses. The vector database will faithfully store garbage representations of the content. Good retrieval starts with clean ingestion, normalized text, and preserved structure.
2. Blindly Trusting Top-K = 5
Hardcoding a fixed number of retrieved chunks is a classic anti-pattern. Too low, and you miss required evidence. Too high, and you inject noise that hurts groundedness. Retrieval depth should be tuned against the data distribution and often paired with re-ranking.
from __future__ import annotations
from typing import Iterable
def dynamic_top_k(query_type: str) -> int:
# Simple example: broader questions often need more candidates before re-ranking.
if query_type == "factoid":
return 8
if query_type == "policy":
return 15
return 12
print(dynamic_top_k("policy"))
3. Ignoring Metadata
Metadata is how you narrow the search space before expensive semantic similarity math. If the user asks for a 2025 HR policy, you should not retrieve engineering documents from 2022 and hope the vector model sorts it out. Metadata filters are high-signal constraints that improve precision cheaply.
from __future__ import annotations
from qdrant_client import QdrantClient
from qdrant_client.models import FieldCondition, Filter, MatchValue
client = QdrantClient(url="http://localhost:6333")
metadata_filter = Filter(
must=[
FieldCondition(key="department", match=MatchValue(value="HR")),
FieldCondition(key="year", match=MatchValue(value=2025)),
]
)
results = client.search(
collection_name="policy_chunks",
query_vector=[0.1, 0.2, 0.3], # Replace with a real embedding vector.
query_filter=metadata_filter,
limit=10,
)
print(results)
Reference Links
- LlamaIndex LlamaIndex documentation
- LangChain LangChain documentation
- SentenceTransformers SentenceTransformers documentation
- Qdrant Qdrant documentation
- Weaviate Weaviate documentation
- Ragas Ragas documentation
- Neo4j Neo4j documentation