Back to handbooks index
RAG & Semantic Search ANN Internals Production Patterns
Similarity Search & Retrieval Infrastructure

Vector Databases
Enterprise Handbook

A practitioner's deep-dive into Pinecone, Milvus, Qdrant, and pgvector — covering ANN index internals, distance metrics, RAG pipelines, filtered search, and production operations for teams building on embeddings.

HNSW & IVF-Flat Pinecone Milvus Qdrant pgvector Advanced RAG Hybrid Search
📐

What is a Vector Database?

Semantic similarity at scale — the shift from exact to approximate matching

A vector database stores, indexes, and queries high-dimensional float arrays — embeddings — produced by neural models. Unlike SQL where you filter on equality or range, vector DBs answer nearest-neighbor queries: "find the 10 vectors most similar to this query vector." This is the engine behind semantic search, RAG, recommendation, anomaly detection, and multimodal retrieval.

Exact vs Approximate

Exact k-NN is O(n·d) — scanning 10M 1536-dim vectors takes ~1 second. ANN indexes trade a small recall loss (1–5%) for 100–1000× speedups.

Metadata Filtering

Real workloads combine vector similarity with structured predicates: doc_type='legal' AND year>2022. How a DB handles this conjunction is the key differentiator.

Hybrid Search

Sparse (BM25/TF-IDF keyword) + dense (embedding) retrieval fused via RRF or weighted sum. Outperforms either alone, especially for rare terms and exact matches.

ℹ️
Dimensionality note: OpenAI text-embedding-3-small → 1536 dims. Cohere Embed v3 → 1024 dims. E5-large → 1024 dims. BGE-M3 supports matryoshka embeddings (truncate to 512/256). Smaller dims = faster queries and less memory but lower recall. Most production workloads use 768–1536.
🔬

ANN Index Internals

HNSW, IVF-Flat, IVF-PQ, ScaNN, DiskANN — how the sausage is made

HNSW — Hierarchical Navigable Small World

The workhorse of most modern vector DBs. A multi-layer graph where each layer is a randomly thinned version of the one below. Search starts at the top (sparse, long-range connections) and greedily descends, increasing resolution each layer. Insert-time cost is O(log n); query cost is O(log n) with constant depending on ef.

Key Parameters
  • M — max edges per node (16–64). Higher = better recall, more RAM and build time.
  • ef_construction — beam width during build (100–500). Larger = higher quality graph, slower build.
  • ef_search — beam width at query time. Tune this at runtime for the recall/latency tradeoff.
Memory Profile
  • Overhead: ~(M × 8 bytes) per vector for graph edges, plus raw vectors.
  • 1M vectors × 1536 dims × float32 = ~6 GB raw. HNSW graph adds ~20–40%.
  • HNSW lives entirely in RAM — if the index doesn't fit, you need DiskANN (Qdrant) or IVF-PQ compression.

IVF-Flat — Inverted File Index

Clusters vectors into nlist Voronoi cells at build time (k-means). At query time, probe the nprobe nearest centroids and do exact search within those cells only. Naturally shardable; faster builds than HNSW on very large datasets. Milvus defaults to IVF-Flat for large collections.

python — Milvus IVF-Flat config
# Build: nlist = sqrt(n) is the classic heuristic index_params = { "index_type": "IVF_FLAT", "metric_type": "COSINE", "params": {"nlist": 1024} # cluster count } # Query: nprobe = 10–20% of nlist for 95%+ recall search_params = {"metric_type": "COSINE", "params": {"nprobe": 128}}

IVF-PQ — Product Quantization

Compresses vectors 4–32× by splitting each d-dim vector into m sub-vectors and replacing each with a codebook index. Drastically reduces memory but hurts recall. Typical use: 100M+ vectors where HNSW OOMs. Milvus, Faiss, and pgvector all support PQ.

DiskANN — SSD-Resident Graphs

Microsoft Research's algorithm (2019) for billion-scale datasets that don't fit in RAM. Stores graph edges on SSD, uses DRAM only for compressed in-memory cache. Qdrant implements DiskANN via its on-disk payload + HNSW hybrid. Latency is higher (~2–5ms vs sub-ms for in-memory) but cost/vector is 10–20× lower.

ScaNN — Google's Asymmetric Hashing

Used internally by Google; open-sourced in 2020. Produces the best recall/latency tradeoff on standard ANN benchmarks (ann-benchmarks.com) but is C++-only with a Python wrapper. Pinecone's proprietary index likely draws inspiration here. Not directly available in Milvus/Qdrant.

IndexBuild SpeedQuery SpeedRAM UsageBest For HNSWSlow (O(n log n))⚡ FastestHigh (in-RAM)<50M vecs, low-latency IVF-FlatFastFastMediumBatch workloads, sharded clusters IVF-PQFastFast⚡ Very low100M+ vecs, memory-constrained DiskANNSlowMedium (~3ms)⚡ Very lowBillion-scale, cost-sensitive Flat (brute)NoneSlow (exact)Low<100K vecs, baseline, testing
📏

Distance Metrics

Cosine, dot product, Euclidean — when each applies
Cosine Similarity

Measures angle between vectors, ignoring magnitude. Best for text embeddings where magnitude encodes word frequency artifacts. Most embedding models are trained with cosine loss. Range: [-1, 1].

sim(A,B) = (A·B) / (||A|| × ||B||)

Dot Product

Equivalent to cosine on unit-normalized vectors (most modern models). Preferred in Pinecone because it's 30% faster than full cosine (skips the norm division). If your model normalizes outputs, use dot product.

sim(A,B) = Σ Aᵢ × Bᵢ

Euclidean (L2)

Geometric distance in embedding space. Better than cosine for image/multimodal embeddings and when absolute position in latent space matters. Used by CLIP, image embeddings. Range: [0, ∞).

d(A,B) = √Σ(Aᵢ - Bᵢ)²

⚠️
Metric mismatch is a silent killer. Indexing with cosine but querying with L2 (or vice versa) returns valid-looking but semantically wrong results. Always match the metric to the model's training objective — check the model card. OpenAI and Cohere models: cosine/dot product. CLIP: L2. Binary quantized models: Hamming distance.
🧬

Embeddings Primer

Choosing the right model for your retrieval task
ModelDimsContextMTEB ScoreBest For text-embedding-3-large30728191 tok64.6Highest quality, OpenAI workloads text-embedding-3-small15368191 tok62.3Balance of cost + quality Cohere embed-v3-english1024512 tok64.5High recall, Cohere platform BAAI/bge-large-en-v1.51024512 tok63.9Open-source, self-hosted BAAI/bge-m310248192 tok62.7Multilingual, hybrid (dense+sparse) nomic-embed-text-v1.57688192 tok62.4Long-context, open weights gte-Qwen2-7B358432768 tok70.2SOTA, heavy, self-hosted GPU
Matryoshka embeddings: Models like text-embedding-3-* and bge-m3 support truncating to smaller dims (e.g., 256) with acceptable quality loss. Use a tiered strategy: retrieve 100 candidates at 256 dims cheaply, then re-rank with full 1536-dim cosine on the top 20. This yields 3–4× cost reduction at <2% recall loss.
🌲

Pinecone

Managed serverless vector database — zero ops, pay per use
MANAGED SaaS
Pinecone
Proprietary index (likely ScaNN-inspired), serverless + pod-based tiers, sparse-dense hybrid natively supported since 2024
Serverless Hybrid Search Namespaces Metadata Filters

Core Concepts

Indexes are the top-level unit — a named collection with a fixed dimension and metric. Serverless indexes auto-scale; pod indexes provision explicit hardware (s1, p1, p2). Namespaces provide logical partitioning within an index — ideal for multi-tenancy (one namespace per user/org). Queries are always namespace-scoped.

python
from pinecone import Pinecone, ServerlessSpec pc = Pinecone(api_key="...") # Create serverless index pc.create_index( name="knowledge-base", dimension=1536, metric="cosine", # cosine | dotproduct | euclidean spec=ServerlessSpec(cloud="aws", region="us-east-1") ) idx = pc.Index("knowledge-base") # Upsert (create or update by ID) idx.upsert( vectors=[ {"id": "doc-001", "values": embedding, "metadata": {"source": "legal", "year": 2024}}, ], namespace="tenant-acme" ) # Filtered semantic search results = idx.query( vector=query_embedding, top_k=10, namespace="tenant-acme", filter={"source": {"$eq": "legal"}, "year": {"$gte": 2023}}, include_metadata=True )

Hybrid Search (Sparse + Dense)

python
from pinecone_text.sparse import BM25Encoder bm25 = BM25Encoder().default() # pre-trained on MS MARCO # Upsert with both sparse + dense vectors idx.upsert(vectors=[{ "id": "doc-42", "values": dense_embedding, # float list "sparse_values": bm25.encode_documents([text])[0], # {indices, values} "metadata": {"text": text} }]) # Hybrid query — alpha controls dense/sparse balance # alpha=1.0 → pure dense, alpha=0.0 → pure BM25 results = idx.query( vector=dense_q, sparse_vector=bm25.encode_queries([query])[0], top_k=10, alpha=0.75 # 75% dense / 25% keyword )
⚠️
Serverless cold start: Serverless indexes can experience 200–500ms cold-start latency after inactivity. For latency-critical workloads (<50ms p99), use pod-based indexes or keep the index warm with a scheduled ping. Serverless billing: per read/write unit, not provisioned capacity.

Metadata Filter Caveats

Pinecone performs post-ANN filtering — it retrieves top_k × K candidates internally, then applies the filter. If your filter is highly selective (only 1% of vectors match), you may get fewer than top_k results. Workaround: use a larger internal fetch multiplier via top_k or consider namespace-level partitioning for high-selectivity filters.

🐳

Milvus

Open-source, cloud-native, billion-scale — the Kubernetes of vector DBs
OPEN SOURCE
Milvus 2.x
Decoupled storage/compute architecture, Pulsar message queue, FAISS-backed indexes, Zilliz Cloud managed option
Distributed HNSW / IVF-PQ Partitions Scalar Index

Architecture

Milvus separates storage (MinIO/S3), query nodes (ANN execution), data nodes (WAL → object store flush), and coordination (etcd). This allows independent scaling of each tier. For teams <10M vectors, Milvus Lite (single binary, no dependencies) or docker compose standalone suffices.

python — pymilvus
from pymilvus import (MilvusClient, DataType, CollectionSchema, FieldSchema, utility) client = MilvusClient(uri="http://localhost:19530") # Define schema with primary key + vector + scalar fields schema = CollectionSchema(fields=[ FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True), FieldSchema(name="vector", dtype=DataType.FLOAT_VECTOR, dim=1536), FieldSchema(name="source", dtype=DataType.VARCHAR, max_length=64), FieldSchema(name="year", dtype=DataType.INT32), ], description="Knowledge base", enable_dynamic_field=True) client.create_collection(collection_name="kb", schema=schema) # Build HNSW index on the vector field client.create_index("kb", "vector", index_params={ "index_type": "HNSW", "metric_type": "COSINE", "params": {"M": 16, "efConstruction": 200} }) # Also create scalar index for fast pre-filtering client.create_index("kb", "source", index_params={"index_type": "Trie"}) client.create_index("kb", "year", index_params={"index_type": "INVERTED"}) client.load_collection("kb") # Insert client.insert("kb", [ {"vector": embedding, "source": "legal", "year": 2024, "text": chunk_text} ]) # Filtered ANN search results = client.search( collection_name="kb", data=[query_vec], anns_field="vector", limit=10, filter='source == "legal" and year >= 2023', output_fields=["source", "year", "text"], search_params={"metric_type": "COSINE", "params": {"ef": 200}} )

Partitions & Partition Keys

python
# Partition key — Milvus routes vectors to sub-indexes automatically # Use for high-cardinality tenant isolation (each tenant = its own HNSW sub-graph) FieldSchema(name="tenant_id", dtype=DataType.VARCHAR, max_length=64, is_partition_key=True) # Search is automatically scoped to the matching partition results = client.search(..., filter='tenant_id == "acme"')
Milvus filtering modes: Milvus 2.4+ uses a hybrid filtering strategy — for high-selectivity filters it switches to filtered HNSW (iterates the graph, skips non-matching nodes); for low-selectivity it post-filters after ANN. Scalar indexes (Trie for strings, INVERTED for numerics) are critical — without them, filters degrade to O(n) scans.
🔷

Qdrant

Rust-native, payload-first filtering, sparse + dense, on-disk HNSW
OPEN SOURCE · RUST
Qdrant
HNSW with filtered graph traversal baked in, WAL, quantization (scalar/product/binary), Qdrant Cloud managed
Filtered HNSW Quantization Sparse Vectors On-disk HNSW

Key Differentiator: Payload Indexing

Qdrant's filtering is a first-class index citizen, not an afterthought. Payload fields can be indexed with keyword, integer, float, text, datetime, or geo index types. The HNSW graph traversal itself checks payload conditions during graph walks — this means filtered search recall is near-identical to unfiltered, even for highly selective filters.

python — qdrant-client
from qdrant_client import QdrantClient, models client = QdrantClient(host="localhost", port=6333) # Create collection with named vectors + on-disk HNSW client.create_collection( collection_name="knowledge", vectors_config={ "dense": models.VectorParams(size=1536, distance=models.Distance.COSINE, on_disk=True), # HNSW on SSD }, sparse_vectors_config={ "sparse": models.SparseVectorParams() # BM42 / SPLADE }, hnsw_config=models.HnswConfigDiff(m=16, ef_construct=200, on_disk=True), quantization_config=models.ScalarQuantization( scalar=models.ScalarQuantizationConfig( type=models.ScalarType.INT8, # 4× memory reduction quantile=0.99, always_ram=True # keep quantized vecs in RAM ) ) ) # Index payload fields for fast pre-filtering client.create_payload_index("knowledge", "source", models.PayloadSchemaType.KEYWORD) client.create_payload_index("knowledge", "year", models.PayloadSchemaType.INTEGER) # Upsert points client.upsert(collection_name="knowledge", points=[ models.PointStruct(id=1, vector={"dense": dense_vec, "sparse": models.SparseVector(indices=[...], values=[...])}, payload={"source": "legal", "year": 2024, "text": chunk} ) ]) # Hybrid search — dense + sparse fused via RRF results = client.query_points( collection_name="knowledge", prefetch=[ models.Prefetch(query=dense_vec, using="dense", limit=50), models.Prefetch(query=models.SparseVector(indices=qi, values=qv), using="sparse", limit=50), ], query=models.FusionQuery(fusion=models.Fusion.RRF), query_filter=models.Filter(must=[ models.FieldCondition(key="source", match=models.MatchValue(value="legal")), models.FieldCondition(key="year", range=models.Range(gte=2023)), ]), limit=10 )

Quantization Options

Scalar (INT8)

4× memory reduction, <2% recall loss. Best default for most workloads. Keep quantized vectors in RAM, raw on disk.

Product (PQ)

8–32× reduction, 3–8% recall loss. Use for billion-scale collections. Requires rescoring enabled.

Binary

64× reduction. Sub-1ms queries. Use only with binary-aware models (e.g., Cohere binary embeddings). Hamming distance.

🐘

pgvector

Vector search inside PostgreSQL — SQL joins, ACID, zero new infra
POSTGRES EXTENSION
pgvector 0.7+
HNSW + IVFFlat indexes, half-precision (halfvec), binary vectors (bit), sparsevec for BM25, full SQL interface
HNSW ACID Transactions SQL Joins halfvec

Setup & Index Creation

sql
-- Install extension (once per database) CREATE EXTENSION IF NOT EXISTS vector; -- Table with vector column (1536-dim) CREATE TABLE documents ( id BIGSERIAL PRIMARY KEY, content TEXT NOT NULL, source VARCHAR(64) NOT NULL, year INTEGER, embedding VECTOR(1536), -- float32 emb_half HALFVEC(1536) GENERATED ALWAYS AS (embedding::halfvec(1536)) STORED ); -- HNSW index (pgvector 0.5+) — best for low-latency queries -- Use maintenance_work_mem to speed up builds (e.g., SET maintenance_work_mem='2GB') CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops) WITH (m = 16, ef_construction = 64); -- IVFFlat — faster builds, lower RAM, but lower recall CREATE INDEX ON documents USING ivfflat (embedding vector_cosine_ops) WITH (lists = 1000); -- lists ≈ sqrt(row count) -- Regular B-tree indexes for filtering CREATE INDEX ON documents (source); CREATE INDEX ON documents (year);
sql — query patterns
-- Top-10 nearest neighbors (cosine) SELECT id, content, 1 - (embedding <=> $1) AS score FROM documents ORDER BY embedding <=> $1 -- <=> cosine, <-> L2, <#> dot product LIMIT 10; -- Filtered search (pre-filter on source+year, then vector sort) SET enable_seqscan = off; -- force index use during testing SELECT id, content, embedding <=> $1 AS dist FROM documents WHERE source = 'legal' AND year >= 2023 ORDER BY embedding <=> $1 LIMIT 10; -- Tune ef_search at session level (higher = more recall, slower) SET hnsw.ef_search = 200; -- Half-precision index for 2× memory saving (pgvector 0.7+) CREATE INDEX ON documents USING hnsw (emb_half halfvec_cosine_ops) WITH (m = 16, ef_construction = 64); -- Hybrid: vector similarity + full-text + structured filter in ONE query SELECT d.id, d.content, ts_rank(to_tsvector(d.content), plainto_tsquery($2)) AS bm25, d.embedding <=> $1 AS vec_dist FROM documents d WHERE source = 'legal' AND to_tsvector(d.content) @@ plainto_tsquery($2) ORDER BY (0.7 * (d.embedding <=> $1)) + (0.3 / (ts_rank(...) + 1)) LIMIT 10;
🔴
pgvector filtering trap: PostgreSQL's query planner may choose to do a sequential scan + filter before touching the HNSW index when the WHERE clause is highly selective. This bypasses the ANN index entirely and returns exact results — which sounds good but is 100–1000× slower on large tables. Use EXPLAIN ANALYZE to verify index usage, and consider partial indexes: CREATE INDEX ... WHERE source = 'legal'.

Python with psycopg3 / SQLAlchemy

python
from pgvector.psycopg import register_vector import psycopg, numpy as np with psycopg.connect(dsn) as conn: register_vector(conn) # Insert conn.execute("INSERT INTO documents (content, source, year, embedding) VALUES (%s,%s,%s,%s)", (text, "legal", 2024, np.array(embedding))) # Query rows = conn.execute("SELECT id, content FROM documents ORDER BY embedding <=> %s LIMIT %s", (np.array(query_vec), 10)).fetchall()
⚖️

Head-to-Head Comparison

Choosing the right database for your use case
Dimension
Index type
Max scale
Filtering
Hybrid search
Multi-tenancy
Quantization
Ops burden
SQL joins
Transactions
Pricing model
Best for
🌲 Pinecone
Proprietary
100B+ (pods)
Post-ANN
Native
Namespaces
None exposed
Zero
No
No
Per R/W unit
SaaS, zero-ops teams
🐳 Milvus
HNSW, IVF-PQ, DiskANN
Billion+
Scalar index
BGE-M3 native
Partition keys
IVF-PQ, binary
Medium (K8s)
No
Eventual
Open source / Zilliz
Billion-scale self-hosted
🔷 Qdrant
HNSW (filtered)
100M–1B+
In-graph
RRF / linear
Collections
SQ8, PQ, binary
Low (single bin)
No
WAL only
Open source / Cloud
High-filter workloads
Dimension
Index type
Max scale
Filtering
Hybrid search
Multi-tenancy
Ops burden
SQL joins
Transactions
Best for
🐘 pgvector
HNSW + IVFFlat (Postgres native)
~50M vecs comfortably; larger needs partitioning
Planner may bypass ANN — verify with EXPLAIN
Combine with tsvector for BM25 in one query
Row-level security + schemas (native Postgres)
None if you run Postgres already
Full SQL — JOIN with users, orders, anything
Full ACID
Teams already on Postgres, strong relational requirements, <50M vectors
📡

Advanced RAG Patterns

Beyond naive retrieval — re-ranking, multi-stage, HyDE, parent-child

Two-Stage Retrieval (Embed → Re-rank)

The workhorse of production RAG. Stage 1: retrieve 50–100 candidates cheaply via ANN. Stage 2: re-rank with a cross-encoder (Cohere Rerank, BGE-Reranker, Jina Reranker) that reads both query and document jointly, producing far more accurate relevance scores. Final context: top 5–10 from re-ranker.

python — two-stage RAG
import cohere co = cohere.Client("...") # Stage 1: ANN retrieval — broad net candidates = vector_db.query(query_embedding, top_k=80) docs = [(c.id, c.metadata["text"]) for c in candidates.matches] # Stage 2: Cross-encoder re-ranking reranked = co.rerank( query=user_query, documents=[d for _, d in docs], model="rerank-english-v3.0", top_n=5 ) context_chunks = [docs[r.index][1] for r in reranked.results]

HyDE — Hypothetical Document Embeddings

Use the LLM to generate a hypothetical answer to the query, embed that answer (not the query), then retrieve using the answer's embedding. Dramatically improves recall for knowledge-gap queries where the user doesn't know the vocabulary of the answer domain.

python
# Generate hypothetical answer hyde_prompt = f"Write a short, authoritative passage that answers: {user_query}" hypothetical_doc = llm.generate(hyde_prompt, max_tokens=256) # Embed the hypothetical answer, not the query hyde_embedding = embed.encode(hypothetical_doc) # Retrieve using the hypothetical embedding results = vector_db.query(hyde_embedding, top_k=20)

Parent-Child Chunking

Store small child chunks (128–256 tokens) for precise embedding, but return the full parent chunk (512–1024 tokens) to the LLM for context. Child chunks have sharper semantic signal; parent chunks give the LLM enough context to generate coherent answers.

python
# Index child chunks with parent_id in metadata for parent in parent_chunks: for child in split_children(parent, size=256): vector_db.upsert([{ "id": child.id, "values": embed(child.text), "metadata": {"text": child.text, "parent_id": parent.id} }]) # At query time: retrieve children, fetch parents child_hits = vector_db.query(query_vec, top_k=10, include_metadata=True) parent_ids = list({h.metadata["parent_id"] for h in child_hits.matches}) context = [parent_store[pid].text for pid in parent_ids] # deduped parent text

Query Routing & Decomposition

Query Routing

Use a lightweight classifier (or LLM with structured output) to route queries to the appropriate collection or retrieval strategy. Example: route "show me the org chart" to a structured DB, "explain our refund policy" to the vector store.

Query Decomposition

Break complex multi-part questions into sub-queries, retrieve separately, then merge before generation. Example: "Compare our Q3 revenue with industry average" → retrieve internal Q3 docs + retrieve industry benchmark docs separately.

🔽

Filtered Search Deep-Dive

Pre-filter, post-filter, in-graph — each has different recall characteristics
Pre-filtering

Apply metadata filter first (get matching doc IDs), then do ANN search only within that subset. Exact recall but may degrade to brute-force scan if the subset is small. Milvus with scalar index, Qdrant with payload index.

Post-filtering

Do full ANN search (top K × multiplier), then filter results. Can return fewer than K results if filter is very selective. Pinecone uses this approach. Mitigation: increase top_k with a multiplier (e.g., 10× the desired results).

In-graph (HNSW)

Filter is evaluated during graph traversal — non-matching nodes are skipped. Qdrant's approach. Near-identical recall to unfiltered HNSW. Requires payload indexes; degrades without them.

Multi-tenancy pattern: For strict tenant isolation (tenants must never see each other's data), use namespace (Pinecone), partition key (Milvus), or a separate collection per tenant (Qdrant). Metadata filter alone is insufficient — a misconfigured filter leaks data. For soft multi-tenancy (performance isolation is acceptable, strict separation is not required), metadata filter is fine.
✂️

Chunking & Ingestion Pipelines

The most underrated factor in RAG quality
Fixed-size with Overlap

Simplest approach. Chunk every N tokens with a K-token overlap to avoid splitting mid-sentence. Works well for homogeneous corpora (legal docs, support tickets). N=512, overlap=50 is a common baseline.

Semantic Chunking

Split on semantic boundaries: embed consecutive sentences, detect cosine distance spikes (breakpoints), split there. Langchain and LlamaIndex both implement this. Better recall on heterogeneous documents.

Recursive Character Split

Try splitting by \n\n, then \n, then . , then space — backing off to smaller separators until chunks fit the target size. The LangChain default. Good general-purpose approach for unstructured text.

Document-aware Splitting

Use document structure: split Markdown on headers, PDFs on page boundaries, code on function/class definitions. Preserves semantic coherence. Use unstructured.io or custom parsers for PDFs.

Ingestion Pipeline Pattern

python — async batch ingestion
import asyncio from itertools import batched async def ingest_documents(docs: list[dict], batch_size: int = 100): for batch in batched(docs, batch_size): texts = [d["text"] for d in batch] # Embed in parallel (rate-limited) embeddings = await embed_client.aembed(texts) vectors = [ {"id": d["id"], "values": emb, "metadata": {"text": d["text"], "source": d["source"], "year": d["year"]}} for d, emb in zip(batch, embeddings) ] idx.upsert(vectors=vectors, namespace="prod") # Handle document updates: always upsert (never insert) — idempotent # Delete + re-ingest stale chunks when source doc changes # Store doc hash → chunk IDs mapping to detect staleness
ℹ️
Embed the metadata too. Prepend key metadata to the chunk text before embedding: f"[Source: Legal | Year: 2024]\n{chunk_text}". This encodes metadata into the vector itself, improving semantic alignment when queries mention source-type terms. Don't rely solely on metadata filters for this.
🚀

Production Operations

Observability, capacity planning, index maintenance, backup
Key Metrics to Track
  • Query latency p50/p95/p99 — alert if p99 > 200ms
  • Recall@K — measure offline vs exact-search ground truth, alert if drops >2%
  • Index memory utilization — HNSW OOM is a hard crash, not degraded service
  • Write throughput / upsert lag — Milvus/Qdrant buffer writes before indexing
  • Embedding API error rate + latency — upstream bottleneck for ingestion
Capacity Planning
  • RAM for HNSW: n_vectors × dims × 4 bytes × 1.3 (30% graph overhead)
  • 1M × 1536-dim float32 = ~7.8 GB RAM with HNSW graph
  • With INT8 quantization: ~2.2 GB (3.5× reduction)
  • With halfvec (pgvector): ~4 GB (2× reduction)
  • Budget 2× headroom for index rebuilds and peak traffic

Index Maintenance

sql / python — maintenance ops
-- pgvector: re-index after heavy inserts (fixes fragmented HNSW graph) REINDEX INDEX CONCURRENTLY idx_documents_hnsw; -- pgvector: vacuum to reclaim space from deleted vectors VACUUM ANALYZE documents; -- Qdrant: optimize segments (merges small segments, rebuilds quantized index) import httpx httpx.post("http://localhost:6333/collections/knowledge/index") -- Milvus: compact + flush before query benchmarks client.flush(["kb"]) utility.do_bulk_insert("kb") # compact small segments

Blue-Green Index Deployment

For zero-downtime re-indexing (when switching models or re-chunking), build the new index in parallel under a new name (kb-v2), run query shadowing (send traffic to both, compare results), then atomically switch the application pointer. Keep kb-v1 for 24 hours as a rollback target.

Cost optimization: Use tiered storage — hot (recent, frequently accessed) vectors in RAM-backed HNSW; cold (archived, rarely accessed) vectors in DiskANN (Qdrant on-disk) or a compressed IVF-PQ index. Route queries by document timestamp. This typically reduces infrastructure costs by 40–60% at 100M+ vector scale.
🗒️

Cheat Sheet

Quick reference — decision tree, operators, parameter table

Decision Tree: Which DB?

Use Pinecone if…
  • You want zero infrastructure management
  • You need hybrid (sparse+dense) natively
  • You're building a prototype or SaaS product fast
  • You need per-tenant namespace isolation without config overhead
Use Milvus if…
  • You need billion-scale self-hosted with full control
  • You need IVF-PQ compression for memory-constrained infra
  • You're already running Kubernetes
  • You need multiple index types per collection (e.g., dense + sparse + scalar)
Use Qdrant if…
  • Your workload has highly selective metadata filters (<5% match)
  • You need quantization (INT8/binary) for memory savings
  • You want a single binary with no external dependencies
  • You need on-disk HNSW for low-cost large-scale deployment
Use pgvector if…
  • You're already running PostgreSQL and <50M vectors
  • You need SQL JOINs between vectors and your relational data
  • You need ACID transactions over vector updates
  • You want to minimize infrastructure surface area

Operator Quick Reference

DBCosineL2Dot ProductFilter syntax sample Pineconemetric="cosine"metric="euclidean"metric="dotproduct"{"year": {"$gte": 2023}} Milvusmetric_type="COSINE"metric_type="L2"metric_type="IP"'year >= 2023 and src == "legal"' QdrantDistance.COSINEDistance.EUCLIDDistance.DOTFieldCondition(key="year", range=Range(gte=2023)) pgvector<=> (vector_cosine_ops)<-> (vector_l2_ops)<#> (vector_ip_ops)WHERE year >= 2023 AND src = 'legal'

HNSW Parameter Reference

ParameterPineconeMilvusQdrantpgvectorRecommended Max connections (M)ManagedMmm16–32 (higher = better recall, more RAM) Build beam widthManagedefConstructionef_constructef_construction100–400 (don't go below 64) Query beam widthManagedef in search_paramsef in SearchParamshnsw.ef_search40–200 (tune for recall/latency)
📚
Further reading: ann-benchmarks.com for recall/latency comparisons. huggingface.co/spaces/mteb/leaderboard for embedding model rankings. qdrant.tech/articles and milvus.io/blog for engineering deep-dives. For RAG evaluation, use RAGAS or TruLens to measure faithfulness, answer relevancy, and context precision.