Similarity Search & Retrieval Infrastructure

Vector Databases
Enterprise Handbook

A practitioner's deep-dive into Pinecone, Milvus, Qdrant, and pgvector — covering ANN index internals, distance metrics, RAG pipelines, filtered search, and production operations for teams building on embeddings.

HNSW & IVF-Flat Pinecone Milvus Qdrant pgvector Advanced RAG Hybrid Search

📐

What is a Vector Database?

Semantic similarity at scale — the shift from exact to approximate matching

A vector database stores, indexes, and queries high-dimensional float arrays — embeddings — produced by neural models. Unlike SQL where you filter on equality or range, vector DBs answer nearest-neighbor queries: "find the 10 vectors most similar to this query vector." This is the engine behind semantic search, RAG, recommendation, anomaly detection, and multimodal retrieval.

Exact vs Approximate

Exact k-NN is O(n·d) — scanning 10M 1536-dim vectors takes ~1 second. ANN indexes trade a small recall loss (1–5%) for 100–1000× speedups.

Metadata Filtering

Real workloads combine vector similarity with structured predicates: doc_type='legal' AND year>2022. How a DB handles this conjunction is the key differentiator.

Hybrid Search

Sparse (BM25/TF-IDF keyword) + dense (embedding) retrieval fused via RRF or weighted sum. Outperforms either alone, especially for rare terms and exact matches.

ℹ️

Dimensionality note: OpenAI text-embedding-3-small → 1536 dims. Cohere Embed v3 → 1024 dims. E5-large → 1024 dims. BGE-M3 supports matryoshka embeddings (truncate to 512/256). Smaller dims = faster queries and less memory but lower recall. Most production workloads use 768–1536.

🔬

ANN Index Internals

HNSW, IVF-Flat, IVF-PQ, ScaNN, DiskANN — how the sausage is made

HNSW — Hierarchical Navigable Small World

The workhorse of most modern vector DBs. A multi-layer graph where each layer is a randomly thinned version of the one below. Search starts at the top (sparse, long-range connections) and greedily descends, increasing resolution each layer. Insert-time cost is O(log n); query cost is O(log n) with constant depending on ef.

Key Parameters

M — max edges per node (16–64). Higher = better recall, more RAM and build time.
ef_construction — beam width during build (100–500). Larger = higher quality graph, slower build.
ef_search — beam width at query time. Tune this at runtime for the recall/latency tradeoff.

Memory Profile

Overhead: ~(M × 8 bytes) per vector for graph edges, plus raw vectors.
1M vectors × 1536 dims × float32 = ~6 GB raw. HNSW graph adds ~20–40%.
HNSW lives entirely in RAM — if the index doesn't fit, you need DiskANN (Qdrant) or IVF-PQ compression.

IVF-Flat — Inverted File Index

Clusters vectors into nlist Voronoi cells at build time (k-means). At query time, probe the nprobe nearest centroids and do exact search within those cells only. Naturally shardable; faster builds than HNSW on very large datasets. Milvus defaults to IVF-Flat for large collections.

python — Milvus IVF-Flat config

# Build: nlist = sqrt(n) is the classic heuristic
index_params = {
    "index_type": "IVF_FLAT",
    "metric_type": "COSINE",
    "params": {"nlist": 1024}   # cluster count
}

# Query: nprobe = 10–20% of nlist for 95%+ recall
search_params = {"metric_type": "COSINE", "params": {"nprobe": 128}}

IVF-PQ — Product Quantization

Compresses vectors 4–32× by splitting each d-dim vector into m sub-vectors and replacing each with a codebook index. Drastically reduces memory but hurts recall. Typical use: 100M+ vectors where HNSW OOMs. Milvus, Faiss, and pgvector all support PQ.

DiskANN — SSD-Resident Graphs

Microsoft Research's algorithm (2019) for billion-scale datasets that don't fit in RAM. Stores graph edges on SSD, uses DRAM only for compressed in-memory cache. Qdrant implements DiskANN via its on-disk payload + HNSW hybrid. Latency is higher (~2–5ms vs sub-ms for in-memory) but cost/vector is 10–20× lower.

ScaNN — Google's Asymmetric Hashing

Used internally by Google; open-sourced in 2020. Produces the best recall/latency tradeoff on standard ANN benchmarks (ann-benchmarks.com) but is C++-only with a Python wrapper. Pinecone's proprietary index likely draws inspiration here. Not directly available in Milvus/Qdrant.

IndexBuild SpeedQuery SpeedRAM UsageBest For HNSWSlow (O(n log n))⚡ FastestHigh (in-RAM)<50M vecs, low-latency IVF-FlatFastFastMediumBatch workloads, sharded clusters IVF-PQFastFast⚡ Very low100M+ vecs, memory-constrained DiskANNSlowMedium (~3ms)⚡ Very lowBillion-scale, cost-sensitive Flat (brute)NoneSlow (exact)Low<100K vecs, baseline, testing

📏

Distance Metrics

Cosine, dot product, Euclidean — when each applies

Cosine Similarity

Measures angle between vectors, ignoring magnitude. Best for text embeddings where magnitude encodes word frequency artifacts. Most embedding models are trained with cosine loss. Range: [-1, 1].

sim(A,B) = (A·B) / (||A|| × ||B||)

Dot Product

Equivalent to cosine on unit-normalized vectors (most modern models). Preferred in Pinecone because it's 30% faster than full cosine (skips the norm division). If your model normalizes outputs, use dot product.

sim(A,B) = Σ Aᵢ × Bᵢ

Euclidean (L2)

Geometric distance in embedding space. Better than cosine for image/multimodal embeddings and when absolute position in latent space matters. Used by CLIP, image embeddings. Range: [0, ∞).

d(A,B) = √Σ(Aᵢ - Bᵢ)²

⚠️

Metric mismatch is a silent killer. Indexing with cosine but querying with L2 (or vice versa) returns valid-looking but semantically wrong results. Always match the metric to the model's training objective — check the model card. OpenAI and Cohere models: cosine/dot product. CLIP: L2. Binary quantized models: Hamming distance.

🧬

Embeddings Primer

Choosing the right model for your retrieval task

ModelDimsContextMTEB ScoreBest For text-embedding-3-large30728191 tok64.6Highest quality, OpenAI workloads text-embedding-3-small15368191 tok62.3Balance of cost + quality Cohere embed-v3-english1024512 tok64.5High recall, Cohere platform BAAI/bge-large-en-v1.51024512 tok63.9Open-source, self-hosted BAAI/bge-m310248192 tok62.7Multilingual, hybrid (dense+sparse) nomic-embed-text-v1.57688192 tok62.4Long-context, open weights gte-Qwen2-7B358432768 tok70.2SOTA, heavy, self-hosted GPU

⚡

Matryoshka embeddings: Models like text-embedding-3-* and bge-m3 support truncating to smaller dims (e.g., 256) with acceptable quality loss. Use a tiered strategy: retrieve 100 candidates at 256 dims cheaply, then re-rank with full 1536-dim cosine on the top 20. This yields 3–4× cost reduction at <2% recall loss.

🌲

Pinecone

Managed serverless vector database — zero ops, pay per use

MANAGED SaaS

Pinecone

Proprietary index (likely ScaNN-inspired), serverless + pod-based tiers, sparse-dense hybrid natively supported since 2024

Serverless Hybrid Search Namespaces Metadata Filters

Core Concepts

Indexes are the top-level unit — a named collection with a fixed dimension and metric. Serverless indexes auto-scale; pod indexes provision explicit hardware (s1, p1, p2). Namespaces provide logical partitioning within an index — ideal for multi-tenancy (one namespace per user/org). Queries are always namespace-scoped.

python

from pinecone import Pinecone, ServerlessSpec

pc = Pinecone(api_key="...")

# Create serverless index
pc.create_index(
    name="knowledge-base",
    dimension=1536,
    metric="cosine",          # cosine | dotproduct | euclidean
    spec=ServerlessSpec(cloud="aws", region="us-east-1")
)
idx = pc.Index("knowledge-base")

# Upsert (create or update by ID)
idx.upsert(
    vectors=[
        {"id": "doc-001", "values": embedding, "metadata": {"source": "legal", "year": 2024}},
    ],
    namespace="tenant-acme"
)

# Filtered semantic search
results = idx.query(
    vector=query_embedding,
    top_k=10,
    namespace="tenant-acme",
    filter={"source": {"$eq": "legal"}, "year": {"$gte": 2023}},
    include_metadata=True
)

Hybrid Search (Sparse + Dense)

python

from pinecone_text.sparse import BM25Encoder

bm25 = BM25Encoder().default()  # pre-trained on MS MARCO

# Upsert with both sparse + dense vectors
idx.upsert(vectors=[{
    "id": "doc-42",
    "values": dense_embedding,           # float list
    "sparse_values": bm25.encode_documents([text])[0],  # {indices, values}
    "metadata": {"text": text}
}])

# Hybrid query — alpha controls dense/sparse balance
# alpha=1.0 → pure dense, alpha=0.0 → pure BM25
results = idx.query(
    vector=dense_q,
    sparse_vector=bm25.encode_queries([query])[0],
    top_k=10,
    alpha=0.75   # 75% dense / 25% keyword
)

⚠️

Serverless cold start: Serverless indexes can experience 200–500ms cold-start latency after inactivity. For latency-critical workloads (<50ms p99), use pod-based indexes or keep the index warm with a scheduled ping. Serverless billing: per read/write unit, not provisioned capacity.

Metadata Filter Caveats

Pinecone performs post-ANN filtering — it retrieves top_k × K candidates internally, then applies the filter. If your filter is highly selective (only 1% of vectors match), you may get fewer than top_k results. Workaround: use a larger internal fetch multiplier via top_k or consider namespace-level partitioning for high-selectivity filters.

🐳

Milvus

Open-source, cloud-native, billion-scale — the Kubernetes of vector DBs

OPEN SOURCE

Milvus 2.x

Decoupled storage/compute architecture, Pulsar message queue, FAISS-backed indexes, Zilliz Cloud managed option

Distributed HNSW / IVF-PQ Partitions Scalar Index

Architecture

Milvus separates storage (MinIO/S3), query nodes (ANN execution), data nodes (WAL → object store flush), and coordination (etcd). This allows independent scaling of each tier. For teams <10M vectors, Milvus Lite (single binary, no dependencies) or docker compose standalone suffices.

python — pymilvus

from pymilvus import (MilvusClient, DataType, CollectionSchema,
                       FieldSchema, utility)

client = MilvusClient(uri="http://localhost:19530")

# Define schema with primary key + vector + scalar fields
schema = CollectionSchema(fields=[
    FieldSchema(name="id",     dtype=DataType.INT64,          is_primary=True, auto_id=True),
    FieldSchema(name="vector", dtype=DataType.FLOAT_VECTOR, dim=1536),
    FieldSchema(name="source", dtype=DataType.VARCHAR,       max_length=64),
    FieldSchema(name="year",   dtype=DataType.INT32),
], description="Knowledge base", enable_dynamic_field=True)

client.create_collection(collection_name="kb", schema=schema)

# Build HNSW index on the vector field
client.create_index("kb", "vector", index_params={
    "index_type": "HNSW",
    "metric_type": "COSINE",
    "params": {"M": 16, "efConstruction": 200}
})

# Also create scalar index for fast pre-filtering
client.create_index("kb", "source", index_params={"index_type": "Trie"})
client.create_index("kb", "year",   index_params={"index_type": "INVERTED"})
client.load_collection("kb")

# Insert
client.insert("kb", [
    {"vector": embedding, "source": "legal", "year": 2024, "text": chunk_text}
])

# Filtered ANN search
results = client.search(
    collection_name="kb",
    data=[query_vec],
    anns_field="vector",
    limit=10,
    filter='source == "legal" and year >= 2023',
    output_fields=["source", "year", "text"],
    search_params={"metric_type": "COSINE", "params": {"ef": 200}}
)

Partitions & Partition Keys

python

# Partition key — Milvus routes vectors to sub-indexes automatically
# Use for high-cardinality tenant isolation (each tenant = its own HNSW sub-graph)
FieldSchema(name="tenant_id", dtype=DataType.VARCHAR,
            max_length=64, is_partition_key=True)

# Search is automatically scoped to the matching partition
results = client.search(..., filter='tenant_id == "acme"')

⚡

Milvus filtering modes: Milvus 2.4+ uses a hybrid filtering strategy — for high-selectivity filters it switches to filtered HNSW (iterates the graph, skips non-matching nodes); for low-selectivity it post-filters after ANN. Scalar indexes (Trie for strings, INVERTED for numerics) are critical — without them, filters degrade to O(n) scans.

🔷

Qdrant

Rust-native, payload-first filtering, sparse + dense, on-disk HNSW

OPEN SOURCE · RUST

Qdrant

HNSW with filtered graph traversal baked in, WAL, quantization (scalar/product/binary), Qdrant Cloud managed

Filtered HNSW Quantization Sparse Vectors On-disk HNSW

Key Differentiator: Payload Indexing

Qdrant's filtering is a first-class index citizen, not an afterthought. Payload fields can be indexed with keyword, integer, float, text, datetime, or geo index types. The HNSW graph traversal itself checks payload conditions during graph walks — this means filtered search recall is near-identical to unfiltered, even for highly selective filters.

python — qdrant-client

from qdrant_client import QdrantClient, models

client = QdrantClient(host="localhost", port=6333)

# Create collection with named vectors + on-disk HNSW
client.create_collection(
    collection_name="knowledge",
    vectors_config={
        "dense": models.VectorParams(size=1536, distance=models.Distance.COSINE,
                                        on_disk=True),   # HNSW on SSD
    },
    sparse_vectors_config={
        "sparse": models.SparseVectorParams()          # BM42 / SPLADE
    },
    hnsw_config=models.HnswConfigDiff(m=16, ef_construct=200, on_disk=True),
    quantization_config=models.ScalarQuantization(
        scalar=models.ScalarQuantizationConfig(
            type=models.ScalarType.INT8,   # 4× memory reduction
            quantile=0.99,
            always_ram=True               # keep quantized vecs in RAM
        )
    )
)

# Index payload fields for fast pre-filtering
client.create_payload_index("knowledge", "source",   models.PayloadSchemaType.KEYWORD)
client.create_payload_index("knowledge", "year",     models.PayloadSchemaType.INTEGER)

# Upsert points
client.upsert(collection_name="knowledge", points=[
    models.PointStruct(id=1,
        vector={"dense": dense_vec, "sparse": models.SparseVector(indices=[...], values=[...])},
        payload={"source": "legal", "year": 2024, "text": chunk}
    )
])

# Hybrid search — dense + sparse fused via RRF
results = client.query_points(
    collection_name="knowledge",
    prefetch=[
        models.Prefetch(query=dense_vec,   using="dense",  limit=50),
        models.Prefetch(query=models.SparseVector(indices=qi, values=qv),
                         using="sparse", limit=50),
    ],
    query=models.FusionQuery(fusion=models.Fusion.RRF),
    query_filter=models.Filter(must=[
        models.FieldCondition(key="source", match=models.MatchValue(value="legal")),
        models.FieldCondition(key="year",   range=models.Range(gte=2023)),
    ]),
    limit=10
)

Quantization Options

Scalar (INT8)

4× memory reduction, <2% recall loss. Best default for most workloads. Keep quantized vectors in RAM, raw on disk.

Product (PQ)

8–32× reduction, 3–8% recall loss. Use for billion-scale collections. Requires rescoring enabled.

Binary

64× reduction. Sub-1ms queries. Use only with binary-aware models (e.g., Cohere binary embeddings). Hamming distance.

🐘

pgvector

Vector search inside PostgreSQL — SQL joins, ACID, zero new infra

POSTGRES EXTENSION

pgvector 0.7+

HNSW + IVFFlat indexes, half-precision (halfvec), binary vectors (bit), sparsevec for BM25, full SQL interface

HNSW ACID Transactions SQL Joins halfvec

Setup & Index Creation

sql

-- Install extension (once per database)
CREATE EXTENSION IF NOT EXISTS vector;

-- Table with vector column (1536-dim)
CREATE TABLE documents (
    id         BIGSERIAL PRIMARY KEY,
    content    TEXT          NOT NULL,
    source     VARCHAR(64)  NOT NULL,
    year       INTEGER,
    embedding  VECTOR(1536), -- float32
    emb_half   HALFVEC(1536) GENERATED ALWAYS AS (embedding::halfvec(1536)) STORED
);

-- HNSW index (pgvector 0.5+) — best for low-latency queries
-- Use maintenance_work_mem to speed up builds (e.g., SET maintenance_work_mem='2GB')
CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 64);

-- IVFFlat — faster builds, lower RAM, but lower recall
CREATE INDEX ON documents USING ivfflat (embedding vector_cosine_ops)
    WITH (lists = 1000);  -- lists ≈ sqrt(row count)

-- Regular B-tree indexes for filtering
CREATE INDEX ON documents (source);
CREATE INDEX ON documents (year);

sql — query patterns

-- Top-10 nearest neighbors (cosine)
SELECT id, content, 1 - (embedding <=> $1) AS score
FROM documents
ORDER BY embedding <=> $1   -- <=> cosine, <-> L2, <#> dot product
LIMIT 10;

-- Filtered search (pre-filter on source+year, then vector sort)
SET enable_seqscan = off;     -- force index use during testing
SELECT id, content, embedding <=> $1 AS dist
FROM documents
WHERE source = 'legal'
  AND year >= 2023
ORDER BY embedding <=> $1
LIMIT 10;

-- Tune ef_search at session level (higher = more recall, slower)
SET hnsw.ef_search = 200;

-- Half-precision index for 2× memory saving (pgvector 0.7+)
CREATE INDEX ON documents USING hnsw (emb_half halfvec_cosine_ops)
    WITH (m = 16, ef_construction = 64);

-- Hybrid: vector similarity + full-text + structured filter in ONE query
SELECT d.id, d.content,
       ts_rank(to_tsvector(d.content), plainto_tsquery($2)) AS bm25,
       d.embedding <=> $1 AS vec_dist
FROM documents d
WHERE source = 'legal'
  AND to_tsvector(d.content) @@ plainto_tsquery($2)
ORDER BY (0.7 * (d.embedding <=> $1)) + (0.3 / (ts_rank(...) + 1))
LIMIT 10;

🔴

pgvector filtering trap: PostgreSQL's query planner may choose to do a sequential scan + filter before touching the HNSW index when the WHERE clause is highly selective. This bypasses the ANN index entirely and returns exact results — which sounds good but is 100–1000× slower on large tables. Use EXPLAIN ANALYZE to verify index usage, and consider partial indexes: CREATE INDEX ... WHERE source = 'legal'.

Python with psycopg3 / SQLAlchemy

python

from pgvector.psycopg import register_vector
import psycopg, numpy as np

with psycopg.connect(dsn) as conn:
    register_vector(conn)
    # Insert
    conn.execute("INSERT INTO documents (content, source, year, embedding) VALUES (%s,%s,%s,%s)",
                 (text, "legal", 2024, np.array(embedding)))
    # Query
    rows = conn.execute("SELECT id, content FROM documents ORDER BY embedding <=> %s LIMIT %s",
                         (np.array(query_vec), 10)).fetchall()

⚖️

Head-to-Head Comparison

Choosing the right database for your use case

Dimension

Index type

Max scale

Filtering

Hybrid search

Multi-tenancy

Quantization

Ops burden

SQL joins

Transactions

Pricing model

Best for

🌲 Pinecone

Proprietary

100B+ (pods)

Post-ANN

Native

Namespaces

None exposed

Zero

Per R/W unit

SaaS, zero-ops teams

🐳 Milvus

HNSW, IVF-PQ, DiskANN

Billion+

Scalar index

BGE-M3 native

Partition keys

IVF-PQ, binary

Medium (K8s)

Eventual

Open source / Zilliz

Billion-scale self-hosted

🔷 Qdrant

HNSW (filtered)

100M–1B+

In-graph

RRF / linear

Collections

SQ8, PQ, binary

Low (single bin)

WAL only

Open source / Cloud

High-filter workloads

Dimension

Index type

Max scale

Filtering

Hybrid search

Multi-tenancy

Ops burden

SQL joins

Transactions

Best for

🐘 pgvector

HNSW + IVFFlat (Postgres native)

~50M vecs comfortably; larger needs partitioning

Planner may bypass ANN — verify with EXPLAIN

Combine with tsvector for BM25 in one query

Row-level security + schemas (native Postgres)

None if you run Postgres already

Full SQL — JOIN with users, orders, anything

Full ACID

Teams already on Postgres, strong relational requirements, <50M vectors

📡

Advanced RAG Patterns

Beyond naive retrieval — re-ranking, multi-stage, HyDE, parent-child

Two-Stage Retrieval (Embed → Re-rank)

The workhorse of production RAG. Stage 1: retrieve 50–100 candidates cheaply via ANN. Stage 2: re-rank with a cross-encoder (Cohere Rerank, BGE-Reranker, Jina Reranker) that reads both query and document jointly, producing far more accurate relevance scores. Final context: top 5–10 from re-ranker.

python — two-stage RAG

import cohere

co = cohere.Client("...")

# Stage 1: ANN retrieval — broad net
candidates = vector_db.query(query_embedding, top_k=80)
docs = [(c.id, c.metadata["text"]) for c in candidates.matches]

# Stage 2: Cross-encoder re-ranking
reranked = co.rerank(
    query=user_query,
    documents=[d for _, d in docs],
    model="rerank-english-v3.0",
    top_n=5
)

context_chunks = [docs[r.index][1] for r in reranked.results]

HyDE — Hypothetical Document Embeddings

Use the LLM to generate a hypothetical answer to the query, embed that answer (not the query), then retrieve using the answer's embedding. Dramatically improves recall for knowledge-gap queries where the user doesn't know the vocabulary of the answer domain.

python

# Generate hypothetical answer
hyde_prompt = f"Write a short, authoritative passage that answers: {user_query}"
hypothetical_doc = llm.generate(hyde_prompt, max_tokens=256)

# Embed the hypothetical answer, not the query
hyde_embedding = embed.encode(hypothetical_doc)

# Retrieve using the hypothetical embedding
results = vector_db.query(hyde_embedding, top_k=20)

Parent-Child Chunking

Store small child chunks (128–256 tokens) for precise embedding, but return the full parent chunk (512–1024 tokens) to the LLM for context. Child chunks have sharper semantic signal; parent chunks give the LLM enough context to generate coherent answers.

python

# Index child chunks with parent_id in metadata
for parent in parent_chunks:
    for child in split_children(parent, size=256):
        vector_db.upsert([{
            "id": child.id,
            "values": embed(child.text),
            "metadata": {"text": child.text, "parent_id": parent.id}
        }])

# At query time: retrieve children, fetch parents
child_hits = vector_db.query(query_vec, top_k=10, include_metadata=True)
parent_ids = list({h.metadata["parent_id"] for h in child_hits.matches})
context = [parent_store[pid].text for pid in parent_ids]  # deduped parent text

Query Routing & Decomposition

Query Routing

Use a lightweight classifier (or LLM with structured output) to route queries to the appropriate collection or retrieval strategy. Example: route "show me the org chart" to a structured DB, "explain our refund policy" to the vector store.

Query Decomposition

Break complex multi-part questions into sub-queries, retrieve separately, then merge before generation. Example: "Compare our Q3 revenue with industry average" → retrieve internal Q3 docs + retrieve industry benchmark docs separately.

🔽

Filtered Search Deep-Dive

Pre-filter, post-filter, in-graph — each has different recall characteristics

Pre-filtering

Apply metadata filter first (get matching doc IDs), then do ANN search only within that subset. Exact recall but may degrade to brute-force scan if the subset is small. Milvus with scalar index, Qdrant with payload index.

Post-filtering

Do full ANN search (top K × multiplier), then filter results. Can return fewer than K results if filter is very selective. Pinecone uses this approach. Mitigation: increase top_k with a multiplier (e.g., 10× the desired results).

In-graph (HNSW)

Filter is evaluated during graph traversal — non-matching nodes are skipped. Qdrant's approach. Near-identical recall to unfiltered HNSW. Requires payload indexes; degrades without them.

✅

Multi-tenancy pattern: For strict tenant isolation (tenants must never see each other's data), use namespace (Pinecone), partition key (Milvus), or a separate collection per tenant (Qdrant). Metadata filter alone is insufficient — a misconfigured filter leaks data. For soft multi-tenancy (performance isolation is acceptable, strict separation is not required), metadata filter is fine.

✂️

Chunking & Ingestion Pipelines

The most underrated factor in RAG quality

Fixed-size with Overlap

Simplest approach. Chunk every N tokens with a K-token overlap to avoid splitting mid-sentence. Works well for homogeneous corpora (legal docs, support tickets). N=512, overlap=50 is a common baseline.

Semantic Chunking

Split on semantic boundaries: embed consecutive sentences, detect cosine distance spikes (breakpoints), split there. Langchain and LlamaIndex both implement this. Better recall on heterogeneous documents.

Recursive Character Split

Try splitting by \n\n, then \n, then . , then space — backing off to smaller separators until chunks fit the target size. The LangChain default. Good general-purpose approach for unstructured text.

Document-aware Splitting

Use document structure: split Markdown on headers, PDFs on page boundaries, code on function/class definitions. Preserves semantic coherence. Use unstructured.io or custom parsers for PDFs.

Ingestion Pipeline Pattern

python — async batch ingestion

import asyncio
from itertools import batched

async def ingest_documents(docs: list[dict], batch_size: int = 100):
    for batch in batched(docs, batch_size):
        texts = [d["text"] for d in batch]

        # Embed in parallel (rate-limited)
        embeddings = await embed_client.aembed(texts)

        vectors = [
            {"id": d["id"], "values": emb,
             "metadata": {"text": d["text"], "source": d["source"], "year": d["year"]}}
            for d, emb in zip(batch, embeddings)
        ]
        idx.upsert(vectors=vectors, namespace="prod")

# Handle document updates: always upsert (never insert) — idempotent
# Delete + re-ingest stale chunks when source doc changes
# Store doc hash → chunk IDs mapping to detect staleness

ℹ️

Embed the metadata too. Prepend key metadata to the chunk text before embedding: f"[Source: Legal | Year: 2024]\n{chunk_text}". This encodes metadata into the vector itself, improving semantic alignment when queries mention source-type terms. Don't rely solely on metadata filters for this.

🚀

Production Operations

Observability, capacity planning, index maintenance, backup

Key Metrics to Track

Query latency p50/p95/p99 — alert if p99 > 200ms
Recall@K — measure offline vs exact-search ground truth, alert if drops >2%
Index memory utilization — HNSW OOM is a hard crash, not degraded service
Write throughput / upsert lag — Milvus/Qdrant buffer writes before indexing
Embedding API error rate + latency — upstream bottleneck for ingestion

Capacity Planning

RAM for HNSW: n_vectors × dims × 4 bytes × 1.3 (30% graph overhead)
1M × 1536-dim float32 = ~7.8 GB RAM with HNSW graph
With INT8 quantization: ~2.2 GB (3.5× reduction)
With halfvec (pgvector): ~4 GB (2× reduction)
Budget 2× headroom for index rebuilds and peak traffic

Index Maintenance

sql / python — maintenance ops

-- pgvector: re-index after heavy inserts (fixes fragmented HNSW graph)
REINDEX INDEX CONCURRENTLY idx_documents_hnsw;

-- pgvector: vacuum to reclaim space from deleted vectors
VACUUM ANALYZE documents;

-- Qdrant: optimize segments (merges small segments, rebuilds quantized index)
import httpx
httpx.post("http://localhost:6333/collections/knowledge/index")

-- Milvus: compact + flush before query benchmarks
client.flush(["kb"])
utility.do_bulk_insert("kb")  # compact small segments

Blue-Green Index Deployment

For zero-downtime re-indexing (when switching models or re-chunking), build the new index in parallel under a new name (kb-v2), run query shadowing (send traffic to both, compare results), then atomically switch the application pointer. Keep kb-v1 for 24 hours as a rollback target.

⚡

Cost optimization: Use tiered storage — hot (recent, frequently accessed) vectors in RAM-backed HNSW; cold (archived, rarely accessed) vectors in DiskANN (Qdrant on-disk) or a compressed IVF-PQ index. Route queries by document timestamp. This typically reduces infrastructure costs by 40–60% at 100M+ vector scale.

🗒️

Cheat Sheet

Quick reference — decision tree, operators, parameter table

Decision Tree: Which DB?

Use Pinecone if…

You want zero infrastructure management
You need hybrid (sparse+dense) natively
You're building a prototype or SaaS product fast
You need per-tenant namespace isolation without config overhead

Use Milvus if…

You need billion-scale self-hosted with full control
You need IVF-PQ compression for memory-constrained infra
You're already running Kubernetes
You need multiple index types per collection (e.g., dense + sparse + scalar)

Use Qdrant if…

Your workload has highly selective metadata filters (<5% match)
You need quantization (INT8/binary) for memory savings
You want a single binary with no external dependencies
You need on-disk HNSW for low-cost large-scale deployment

Use pgvector if…

You're already running PostgreSQL and <50M vectors
You need SQL JOINs between vectors and your relational data
You need ACID transactions over vector updates
You want to minimize infrastructure surface area

Operator Quick Reference

DBCosineL2Dot ProductFilter syntax sample Pineconemetric="cosine"metric="euclidean"metric="dotproduct"{"year": {"$gte": 2023}} Milvusmetric_type="COSINE"metric_type="L2"metric_type="IP"'year >= 2023 and src == "legal"' QdrantDistance.COSINEDistance.EUCLIDDistance.DOTFieldCondition(key="year", range=Range(gte=2023)) pgvector<=> (vector_cosine_ops)<-> (vector_l2_ops)<#> (vector_ip_ops)WHERE year >= 2023 AND src = 'legal'

HNSW Parameter Reference

ParameterPineconeMilvusQdrantpgvectorRecommended Max connections (M)ManagedMmm16–32 (higher = better recall, more RAM) Build beam widthManagedefConstructionef_constructef_construction100–400 (don't go below 64) Query beam widthManagedef in search_paramsef in SearchParamshnsw.ef_search40–200 (tune for recall/latency)

📚

Further reading: ann-benchmarks.com for recall/latency comparisons. huggingface.co/spaces/mteb/leaderboard for embedding model rankings. qdrant.tech/articles and milvus.io/blog for engineering deep-dives. For RAG evaluation, use RAGAS or TruLens to measure faithfulness, answer relevancy, and context precision.

Vector DatabasesEnterprise Handbook

What is a Vector Database?

ANN Index Internals

HNSW — Hierarchical Navigable Small World

IVF-Flat — Inverted File Index

IVF-PQ — Product Quantization

DiskANN — SSD-Resident Graphs

ScaNN — Google's Asymmetric Hashing

Distance Metrics

Embeddings Primer

Pinecone

Core Concepts

Hybrid Search (Sparse + Dense)

Metadata Filter Caveats

Milvus

Architecture

Partitions & Partition Keys

Qdrant

Key Differentiator: Payload Indexing

Quantization Options

pgvector

Setup & Index Creation

Python with psycopg3 / SQLAlchemy

Head-to-Head Comparison

Advanced RAG Patterns

Two-Stage Retrieval (Embed → Re-rank)

HyDE — Hypothetical Document Embeddings

Parent-Child Chunking

Query Routing & Decomposition

Filtered Search Deep-Dive

Chunking & Ingestion Pipelines

Ingestion Pipeline Pattern

Production Operations

Index Maintenance

Blue-Green Index Deployment

Cheat Sheet

Decision Tree: Which DB?

Operator Quick Reference

HNSW Parameter Reference

Vector Databases
Enterprise Handbook