A practitioner's deep-dive into Pinecone, Milvus, Qdrant, and pgvector — covering ANN index internals, distance metrics, RAG pipelines, filtered search, and production operations for teams building on embeddings.
Semantic similarity at scale — the shift from exact to approximate matching
A vector database stores, indexes, and queries high-dimensional float arrays — embeddings — produced by neural models. Unlike SQL where you filter on equality or range, vector DBs answer nearest-neighbor queries: "find the 10 vectors most similar to this query vector." This is the engine behind semantic search, RAG, recommendation, anomaly detection, and multimodal retrieval.
Exact vs Approximate
Exact k-NN is O(n·d) — scanning 10M 1536-dim vectors takes ~1 second. ANN indexes trade a small recall loss (1–5%) for 100–1000× speedups.
Metadata Filtering
Real workloads combine vector similarity with structured predicates: doc_type='legal' AND year>2022. How a DB handles this conjunction is the key differentiator.
Hybrid Search
Sparse (BM25/TF-IDF keyword) + dense (embedding) retrieval fused via RRF or weighted sum. Outperforms either alone, especially for rare terms and exact matches.
ℹ️
Dimensionality note: OpenAI text-embedding-3-small → 1536 dims. Cohere Embed v3 → 1024 dims. E5-large → 1024 dims. BGE-M3 supports matryoshka embeddings (truncate to 512/256). Smaller dims = faster queries and less memory but lower recall. Most production workloads use 768–1536.
🔬
ANN Index Internals
HNSW, IVF-Flat, IVF-PQ, ScaNN, DiskANN — how the sausage is made
HNSW — Hierarchical Navigable Small World
The workhorse of most modern vector DBs. A multi-layer graph where each layer is a randomly thinned version of the one below. Search starts at the top (sparse, long-range connections) and greedily descends, increasing resolution each layer. Insert-time cost is O(log n); query cost is O(log n) with constant depending on ef.
Key Parameters
M — max edges per node (16–64). Higher = better recall, more RAM and build time.
HNSW lives entirely in RAM — if the index doesn't fit, you need DiskANN (Qdrant) or IVF-PQ compression.
IVF-Flat — Inverted File Index
Clusters vectors into nlist Voronoi cells at build time (k-means). At query time, probe the nprobe nearest centroids and do exact search within those cells only. Naturally shardable; faster builds than HNSW on very large datasets. Milvus defaults to IVF-Flat for large collections.
python — Milvus IVF-Flat config
# Build: nlist = sqrt(n) is the classic heuristic
index_params = {
"index_type": "IVF_FLAT",
"metric_type": "COSINE",
"params": {"nlist": 1024} # cluster count
}
# Query: nprobe = 10–20% of nlist for 95%+ recall
search_params = {"metric_type": "COSINE", "params": {"nprobe": 128}}
IVF-PQ — Product Quantization
Compresses vectors 4–32× by splitting each d-dim vector into m sub-vectors and replacing each with a codebook index. Drastically reduces memory but hurts recall. Typical use: 100M+ vectors where HNSW OOMs. Milvus, Faiss, and pgvector all support PQ.
DiskANN — SSD-Resident Graphs
Microsoft Research's algorithm (2019) for billion-scale datasets that don't fit in RAM. Stores graph edges on SSD, uses DRAM only for compressed in-memory cache. Qdrant implements DiskANN via its on-disk payload + HNSW hybrid. Latency is higher (~2–5ms vs sub-ms for in-memory) but cost/vector is 10–20× lower.
ScaNN — Google's Asymmetric Hashing
Used internally by Google; open-sourced in 2020. Produces the best recall/latency tradeoff on standard ANN benchmarks (ann-benchmarks.com) but is C++-only with a Python wrapper. Pinecone's proprietary index likely draws inspiration here. Not directly available in Milvus/Qdrant.
Index
Build Speed
Query Speed
RAM Usage
Best For
HNSW
Slow (O(n log n))
⚡ Fastest
High (in-RAM)
<50M vecs, low-latency
IVF-Flat
Fast
Fast
Medium
Batch workloads, sharded clusters
IVF-PQ
Fast
Fast
⚡ Very low
100M+ vecs, memory-constrained
DiskANN
Slow
Medium (~3ms)
⚡ Very low
Billion-scale, cost-sensitive
Flat (brute)
None
Slow (exact)
Low
<100K vecs, baseline, testing
📏
Distance Metrics
Cosine, dot product, Euclidean — when each applies
Cosine Similarity
Measures angle between vectors, ignoring magnitude. Best for text embeddings where magnitude encodes word frequency artifacts. Most embedding models are trained with cosine loss. Range: [-1, 1].
sim(A,B) = (A·B) / (||A|| × ||B||)
Dot Product
Equivalent to cosine on unit-normalized vectors (most modern models). Preferred in Pinecone because it's 30% faster than full cosine (skips the norm division). If your model normalizes outputs, use dot product.
sim(A,B) = Σ Aᵢ × Bᵢ
Euclidean (L2)
Geometric distance in embedding space. Better than cosine for image/multimodal embeddings and when absolute position in latent space matters. Used by CLIP, image embeddings. Range: [0, ∞).
d(A,B) = √Σ(Aᵢ - Bᵢ)²
⚠️
Metric mismatch is a silent killer. Indexing with cosine but querying with L2 (or vice versa) returns valid-looking but semantically wrong results. Always match the metric to the model's training objective — check the model card. OpenAI and Cohere models: cosine/dot product. CLIP: L2. Binary quantized models: Hamming distance.
🧬
Embeddings Primer
Choosing the right model for your retrieval task
Model
Dims
Context
MTEB Score
Best For
text-embedding-3-large
3072
8191 tok
64.6
Highest quality, OpenAI workloads
text-embedding-3-small
1536
8191 tok
62.3
Balance of cost + quality
Cohere embed-v3-english
1024
512 tok
64.5
High recall, Cohere platform
BAAI/bge-large-en-v1.5
1024
512 tok
63.9
Open-source, self-hosted
BAAI/bge-m3
1024
8192 tok
62.7
Multilingual, hybrid (dense+sparse)
nomic-embed-text-v1.5
768
8192 tok
62.4
Long-context, open weights
gte-Qwen2-7B
3584
32768 tok
70.2
SOTA, heavy, self-hosted GPU
⚡
Matryoshka embeddings: Models like text-embedding-3-* and bge-m3 support truncating to smaller dims (e.g., 256) with acceptable quality loss. Use a tiered strategy: retrieve 100 candidates at 256 dims cheaply, then re-rank with full 1536-dim cosine on the top 20. This yields 3–4× cost reduction at <2% recall loss.
🌲
Pinecone
Managed serverless vector database — zero ops, pay per use
MANAGED SaaS
Pinecone
Proprietary index (likely ScaNN-inspired), serverless + pod-based tiers, sparse-dense hybrid natively supported since 2024
ServerlessHybrid SearchNamespacesMetadata Filters
Core Concepts
Indexes are the top-level unit — a named collection with a fixed dimension and metric. Serverless indexes auto-scale; pod indexes provision explicit hardware (s1, p1, p2). Namespaces provide logical partitioning within an index — ideal for multi-tenancy (one namespace per user/org). Queries are always namespace-scoped.
frompinecone_text.sparseimportBM25Encoder
bm25 = BM25Encoder().default() # pre-trained on MS MARCO# Upsert with both sparse + dense vectors
idx.upsert(vectors=[{
"id": "doc-42",
"values": dense_embedding, # float list"sparse_values": bm25.encode_documents([text])[0], # {indices, values}"metadata": {"text": text}
}])
# Hybrid query — alpha controls dense/sparse balance# alpha=1.0 → pure dense, alpha=0.0 → pure BM25
results = idx.query(
vector=dense_q,
sparse_vector=bm25.encode_queries([query])[0],
top_k=10,
alpha=0.75# 75% dense / 25% keyword
)
⚠️
Serverless cold start: Serverless indexes can experience 200–500ms cold-start latency after inactivity. For latency-critical workloads (<50ms p99), use pod-based indexes or keep the index warm with a scheduled ping. Serverless billing: per read/write unit, not provisioned capacity.
Metadata Filter Caveats
Pinecone performs post-ANN filtering — it retrieves top_k × K candidates internally, then applies the filter. If your filter is highly selective (only 1% of vectors match), you may get fewer than top_k results. Workaround: use a larger internal fetch multiplier via top_k or consider namespace-level partitioning for high-selectivity filters.
🐳
Milvus
Open-source, cloud-native, billion-scale — the Kubernetes of vector DBs
Milvus separates storage (MinIO/S3), query nodes (ANN execution), data nodes (WAL → object store flush), and coordination (etcd). This allows independent scaling of each tier. For teams <10M vectors, Milvus Lite (single binary, no dependencies) or docker compose standalone suffices.
python — pymilvus
frompymilvusimport (MilvusClient, DataType, CollectionSchema,
FieldSchema, utility)
client = MilvusClient(uri="http://localhost:19530")
# Define schema with primary key + vector + scalar fields
schema = CollectionSchema(fields=[
FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),
FieldSchema(name="vector", dtype=DataType.FLOAT_VECTOR, dim=1536),
FieldSchema(name="source", dtype=DataType.VARCHAR, max_length=64),
FieldSchema(name="year", dtype=DataType.INT32),
], description="Knowledge base", enable_dynamic_field=True)
client.create_collection(collection_name="kb", schema=schema)
# Build HNSW index on the vector field
client.create_index("kb", "vector", index_params={
"index_type": "HNSW",
"metric_type": "COSINE",
"params": {"M": 16, "efConstruction": 200}
})
# Also create scalar index for fast pre-filtering
client.create_index("kb", "source", index_params={"index_type": "Trie"})
client.create_index("kb", "year", index_params={"index_type": "INVERTED"})
client.load_collection("kb")
# Insert
client.insert("kb", [
{"vector": embedding, "source": "legal", "year": 2024, "text": chunk_text}
])
# Filtered ANN search
results = client.search(
collection_name="kb",
data=[query_vec],
anns_field="vector",
limit=10,
filter='source == "legal" and year >= 2023',
output_fields=["source", "year", "text"],
search_params={"metric_type": "COSINE", "params": {"ef": 200}}
)
Partitions & Partition Keys
python
# Partition key — Milvus routes vectors to sub-indexes automatically# Use for high-cardinality tenant isolation (each tenant = its own HNSW sub-graph)FieldSchema(name="tenant_id", dtype=DataType.VARCHAR,
max_length=64, is_partition_key=True)
# Search is automatically scoped to the matching partition
results = client.search(..., filter='tenant_id == "acme"')
⚡
Milvus filtering modes: Milvus 2.4+ uses a hybrid filtering strategy — for high-selectivity filters it switches to filtered HNSW (iterates the graph, skips non-matching nodes); for low-selectivity it post-filters after ANN. Scalar indexes (Trie for strings, INVERTED for numerics) are critical — without them, filters degrade to O(n) scans.
Qdrant's filtering is a first-class index citizen, not an afterthought. Payload fields can be indexed with keyword, integer, float, text, datetime, or geo index types. The HNSW graph traversal itself checks payload conditions during graph walks — this means filtered search recall is near-identical to unfiltered, even for highly selective filters.
4× memory reduction, <2% recall loss. Best default for most workloads. Keep quantized vectors in RAM, raw on disk.
Product (PQ)
8–32× reduction, 3–8% recall loss. Use for billion-scale collections. Requires rescoring enabled.
Binary
64× reduction. Sub-1ms queries. Use only with binary-aware models (e.g., Cohere binary embeddings). Hamming distance.
🐘
pgvector
Vector search inside PostgreSQL — SQL joins, ACID, zero new infra
POSTGRES EXTENSION
pgvector 0.7+
HNSW + IVFFlat indexes, half-precision (halfvec), binary vectors (bit), sparsevec for BM25, full SQL interface
HNSWACID TransactionsSQL Joinshalfvec
Setup & Index Creation
sql
-- Install extension (once per database)CREATE EXTENSION IF NOT EXISTS vector;
-- Table with vector column (1536-dim)CREATE TABLE documents (
id BIGSERIAL PRIMARY KEY,
content TEXTNOT NULL,
source VARCHAR(64) NOT NULL,
year INTEGER,
embedding VECTOR(1536), -- float32
emb_half HALFVEC(1536) GENERATED ALWAYS AS (embedding::halfvec(1536)) STORED
);
-- HNSW index (pgvector 0.5+) — best for low-latency queries-- Use maintenance_work_mem to speed up builds (e.g., SET maintenance_work_mem='2GB')CREATE INDEXON documents USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
-- IVFFlat — faster builds, lower RAM, but lower recallCREATE INDEXON documents USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 1000); -- lists ≈ sqrt(row count)-- Regular B-tree indexes for filteringCREATE INDEXON documents (source);
CREATE INDEXON documents (year);
sql — query patterns
-- Top-10 nearest neighbors (cosine)SELECT id, content, 1 - (embedding <=> $1) AS score
FROM documents
ORDER BY embedding <=> $1-- <=> cosine, <-> L2, <#> dot productLIMIT10;
-- Filtered search (pre-filter on source+year, then vector sort)SET enable_seqscan = off; -- force index use during testingSELECT id, content, embedding <=> $1AS dist
FROM documents
WHERE source = 'legal'AND year >= 2023ORDER BY embedding <=> $1LIMIT10;
-- Tune ef_search at session level (higher = more recall, slower)SET hnsw.ef_search = 200;
-- Half-precision index for 2× memory saving (pgvector 0.7+)CREATE INDEXON documents USING hnsw (emb_half halfvec_cosine_ops)
WITH (m = 16, ef_construction = 64);
-- Hybrid: vector similarity + full-text + structured filter in ONE querySELECT d.id, d.content,
ts_rank(to_tsvector(d.content), plainto_tsquery($2)) AS bm25,
d.embedding <=> $1AS vec_dist
FROM documents d
WHERE source = 'legal'AND to_tsvector(d.content) @@ plainto_tsquery($2)
ORDER BY (0.7 * (d.embedding <=> $1)) + (0.3 / (ts_rank(...) + 1))
LIMIT10;
🔴
pgvector filtering trap: PostgreSQL's query planner may choose to do a sequential scan + filter before touching the HNSW index when the WHERE clause is highly selective. This bypasses the ANN index entirely and returns exact results — which sounds good but is 100–1000× slower on large tables. Use EXPLAIN ANALYZE to verify index usage, and consider partial indexes: CREATE INDEX ... WHERE source = 'legal'.
Python with psycopg3 / SQLAlchemy
python
frompgvector.psycopgimportregister_vectorimportpsycopg, numpy as np
with psycopg.connect(dsn) as conn:
register_vector(conn)
# Insert
conn.execute("INSERT INTO documents (content, source, year, embedding) VALUES (%s,%s,%s,%s)",
(text, "legal", 2024, np.array(embedding)))
# Query
rows = conn.execute("SELECT id, content FROM documents ORDER BY embedding <=> %s LIMIT %s",
(np.array(query_vec), 10)).fetchall()
⚖️
Head-to-Head Comparison
Choosing the right database for your use case
Dimension
Index type
Max scale
Filtering
Hybrid search
Multi-tenancy
Quantization
Ops burden
SQL joins
Transactions
Pricing model
Best for
🌲 Pinecone
Proprietary
100B+ (pods)
Post-ANN
Native
Namespaces
None exposed
Zero
No
No
Per R/W unit
SaaS, zero-ops teams
🐳 Milvus
HNSW, IVF-PQ, DiskANN
Billion+
Scalar index
BGE-M3 native
Partition keys
IVF-PQ, binary
Medium (K8s)
No
Eventual
Open source / Zilliz
Billion-scale self-hosted
🔷 Qdrant
HNSW (filtered)
100M–1B+
In-graph
RRF / linear
Collections
SQ8, PQ, binary
Low (single bin)
No
WAL only
Open source / Cloud
High-filter workloads
Dimension
Index type
Max scale
Filtering
Hybrid search
Multi-tenancy
Ops burden
SQL joins
Transactions
Best for
🐘 pgvector
HNSW + IVFFlat (Postgres native)
~50M vecs comfortably; larger needs partitioning
Planner may bypass ANN — verify with EXPLAIN
Combine with tsvector for BM25 in one query
Row-level security + schemas (native Postgres)
None if you run Postgres already
Full SQL — JOIN with users, orders, anything
Full ACID
Teams already on Postgres, strong relational requirements, <50M vectors
The workhorse of production RAG. Stage 1: retrieve 50–100 candidates cheaply via ANN. Stage 2: re-rank with a cross-encoder (Cohere Rerank, BGE-Reranker, Jina Reranker) that reads both query and document jointly, producing far more accurate relevance scores. Final context: top 5–10 from re-ranker.
python — two-stage RAG
importcohere
co = cohere.Client("...")
# Stage 1: ANN retrieval — broad net
candidates = vector_db.query(query_embedding, top_k=80)
docs = [(c.id, c.metadata["text"]) for c in candidates.matches]
# Stage 2: Cross-encoder re-ranking
reranked = co.rerank(
query=user_query,
documents=[d for _, d in docs],
model="rerank-english-v3.0",
top_n=5
)
context_chunks = [docs[r.index][1] for r in reranked.results]
HyDE — Hypothetical Document Embeddings
Use the LLM to generate a hypothetical answer to the query, embed that answer (not the query), then retrieve using the answer's embedding. Dramatically improves recall for knowledge-gap queries where the user doesn't know the vocabulary of the answer domain.
python
# Generate hypothetical answer
hyde_prompt = f"Write a short, authoritative passage that answers: {user_query}"
hypothetical_doc = llm.generate(hyde_prompt, max_tokens=256)
# Embed the hypothetical answer, not the query
hyde_embedding = embed.encode(hypothetical_doc)
# Retrieve using the hypothetical embedding
results = vector_db.query(hyde_embedding, top_k=20)
Parent-Child Chunking
Store small child chunks (128–256 tokens) for precise embedding, but return the full parent chunk (512–1024 tokens) to the LLM for context. Child chunks have sharper semantic signal; parent chunks give the LLM enough context to generate coherent answers.
python
# Index child chunks with parent_id in metadatafor parent in parent_chunks:
for child insplit_children(parent, size=256):
vector_db.upsert([{
"id": child.id,
"values": embed(child.text),
"metadata": {"text": child.text, "parent_id": parent.id}
}])
# At query time: retrieve children, fetch parents
child_hits = vector_db.query(query_vec, top_k=10, include_metadata=True)
parent_ids = list({h.metadata["parent_id"] for h in child_hits.matches})
context = [parent_store[pid].text for pid in parent_ids] # deduped parent text
Query Routing & Decomposition
Query Routing
Use a lightweight classifier (or LLM with structured output) to route queries to the appropriate collection or retrieval strategy. Example: route "show me the org chart" to a structured DB, "explain our refund policy" to the vector store.
Query Decomposition
Break complex multi-part questions into sub-queries, retrieve separately, then merge before generation. Example: "Compare our Q3 revenue with industry average" → retrieve internal Q3 docs + retrieve industry benchmark docs separately.
🔽
Filtered Search Deep-Dive
Pre-filter, post-filter, in-graph — each has different recall characteristics
Pre-filtering
Apply metadata filter first (get matching doc IDs), then do ANN search only within that subset. Exact recall but may degrade to brute-force scan if the subset is small. Milvus with scalar index, Qdrant with payload index.
Post-filtering
Do full ANN search (top K × multiplier), then filter results. Can return fewer than K results if filter is very selective. Pinecone uses this approach. Mitigation: increase top_k with a multiplier (e.g., 10× the desired results).
In-graph (HNSW)
Filter is evaluated during graph traversal — non-matching nodes are skipped. Qdrant's approach. Near-identical recall to unfiltered HNSW. Requires payload indexes; degrades without them.
✅
Multi-tenancy pattern: For strict tenant isolation (tenants must never see each other's data), use namespace (Pinecone), partition key (Milvus), or a separate collection per tenant (Qdrant). Metadata filter alone is insufficient — a misconfigured filter leaks data. For soft multi-tenancy (performance isolation is acceptable, strict separation is not required), metadata filter is fine.
✂️
Chunking & Ingestion Pipelines
The most underrated factor in RAG quality
Fixed-size with Overlap
Simplest approach. Chunk every N tokens with a K-token overlap to avoid splitting mid-sentence. Works well for homogeneous corpora (legal docs, support tickets). N=512, overlap=50 is a common baseline.
Semantic Chunking
Split on semantic boundaries: embed consecutive sentences, detect cosine distance spikes (breakpoints), split there. Langchain and LlamaIndex both implement this. Better recall on heterogeneous documents.
Recursive Character Split
Try splitting by \n\n, then \n, then . , then space — backing off to smaller separators until chunks fit the target size. The LangChain default. Good general-purpose approach for unstructured text.
Document-aware Splitting
Use document structure: split Markdown on headers, PDFs on page boundaries, code on function/class definitions. Preserves semantic coherence. Use unstructured.io or custom parsers for PDFs.
Ingestion Pipeline Pattern
python — async batch ingestion
importasynciofromitertoolsimportbatchedasync defingest_documents(docs: list[dict], batch_size: int = 100):
for batch inbatched(docs, batch_size):
texts = [d["text"] for d in batch]
# Embed in parallel (rate-limited)
embeddings = await embed_client.aembed(texts)
vectors = [
{"id": d["id"], "values": emb,
"metadata": {"text": d["text"], "source": d["source"], "year": d["year"]}}
for d, emb inzip(batch, embeddings)
]
idx.upsert(vectors=vectors, namespace="prod")
# Handle document updates: always upsert (never insert) — idempotent# Delete + re-ingest stale chunks when source doc changes# Store doc hash → chunk IDs mapping to detect staleness
ℹ️
Embed the metadata too. Prepend key metadata to the chunk text before embedding: f"[Source: Legal | Year: 2024]\n{chunk_text}". This encodes metadata into the vector itself, improving semantic alignment when queries mention source-type terms. Don't rely solely on metadata filters for this.
🚀
Production Operations
Observability, capacity planning, index maintenance, backup
Key Metrics to Track
Query latency p50/p95/p99 — alert if p99 > 200ms
Recall@K — measure offline vs exact-search ground truth, alert if drops >2%
Index memory utilization — HNSW OOM is a hard crash, not degraded service
Write throughput / upsert lag — Milvus/Qdrant buffer writes before indexing
Embedding API error rate + latency — upstream bottleneck for ingestion
Budget 2× headroom for index rebuilds and peak traffic
Index Maintenance
sql / python — maintenance ops
-- pgvector: re-index after heavy inserts (fixes fragmented HNSW graph)REINDEX INDEX CONCURRENTLY idx_documents_hnsw;
-- pgvector: vacuum to reclaim space from deleted vectorsVACUUM ANALYZE documents;
-- Qdrant: optimize segments (merges small segments, rebuilds quantized index)importhttpx
httpx.post("http://localhost:6333/collections/knowledge/index")
-- Milvus: compact + flush before query benchmarks
client.flush(["kb"])
utility.do_bulk_insert("kb") # compact small segments
Blue-Green Index Deployment
For zero-downtime re-indexing (when switching models or re-chunking), build the new index in parallel under a new name (kb-v2), run query shadowing (send traffic to both, compare results), then atomically switch the application pointer. Keep kb-v1 for 24 hours as a rollback target.
⚡
Cost optimization: Use tiered storage — hot (recent, frequently accessed) vectors in RAM-backed HNSW; cold (archived, rarely accessed) vectors in DiskANN (Qdrant on-disk) or a compressed IVF-PQ index. Route queries by document timestamp. This typically reduces infrastructure costs by 40–60% at 100M+ vector scale.
You need per-tenant namespace isolation without config overhead
Use Milvus if…
You need billion-scale self-hosted with full control
You need IVF-PQ compression for memory-constrained infra
You're already running Kubernetes
You need multiple index types per collection (e.g., dense + sparse + scalar)
Use Qdrant if…
Your workload has highly selective metadata filters (<5% match)
You need quantization (INT8/binary) for memory savings
You want a single binary with no external dependencies
You need on-disk HNSW for low-cost large-scale deployment
Use pgvector if…
You're already running PostgreSQL and <50M vectors
You need SQL JOINs between vectors and your relational data
You need ACID transactions over vector updates
You want to minimize infrastructure surface area
Operator Quick Reference
DB
Cosine
L2
Dot Product
Filter syntax sample
Pinecone
metric="cosine"
metric="euclidean"
metric="dotproduct"
{"year": {"$gte": 2023}}
Milvus
metric_type="COSINE"
metric_type="L2"
metric_type="IP"
'year >= 2023 and src == "legal"'
Qdrant
Distance.COSINE
Distance.EUCLID
Distance.DOT
FieldCondition(key="year", range=Range(gte=2023))
pgvector
<=> (vector_cosine_ops)
<-> (vector_l2_ops)
<#> (vector_ip_ops)
WHERE year >= 2023 AND src = 'legal'
HNSW Parameter Reference
Parameter
Pinecone
Milvus
Qdrant
pgvector
Recommended
Max connections (M)
Managed
M
m
m
16–32 (higher = better recall, more RAM)
Build beam width
Managed
efConstruction
ef_construct
ef_construction
100–400 (don't go below 64)
Query beam width
Managed
ef in search_params
ef in SearchParams
hnsw.ef_search
40–200 (tune for recall/latency)
📚
Further reading:ann-benchmarks.com for recall/latency comparisons. huggingface.co/spaces/mteb/leaderboard for embedding model rankings. qdrant.tech/articles and milvus.io/blog for engineering deep-dives. For RAG evaluation, use RAGAS or TruLens to measure faithfulness, answer relevancy, and context precision.