Introduction
Vector databases are the retrieval layer of RAG systems. Get them wrong and your queries are slow, your memory usage explodes, or your recall tanks. This guide covers the index structures, tuning parameters, and scaling patterns that separate toy demos from production-grade vector search.
Index Algorithms: HNSW vs IVF vs Flat
Vector databases use approximate nearest neighbor (ANN) algorithms. The three dominant approaches:
1. HNSW (Hierarchical Navigable Small World)
How it works: Builds a multi-layer graph where each layer has progressively fewer nodes. Search starts at the top layer (sparse) and drills down to the bottom (dense).
Performance:
- Query latency: ~1-10ms for 1M vectors
- Recall: >95% with default params
- Memory: High (~2-3× vector data size due to graph overhead)
- Index build time: Slow (~10min for 1M vectors)
Tuning knobs:
M(connections per node): Higher = better recall, more memory. Default: 16ef_construction(candidates during build): Higher = better recall, slower build. Default: 200ef_search(candidates during query): Higher = better recall, slower queries. Default: 50
Use when: Query latency is critical, memory is abundant, index is mostly read-only.
2. IVF (Inverted File Index)
How it works: Clusters vectors into N partitions (Voronoi cells). At query time, search only the K nearest partitions.
Performance:
- Query latency: ~10-50ms for 1M vectors
- Recall: 90-95% (depends on nprobe)
- Memory: Low (~1.1× vector data size)
- Index build time: Fast (~2min for 1M vectors)
Tuning knobs:
nlist(number of partitions): More = faster queries, worse recall. Default: sqrt(N)nprobe(partitions searched): Higher = better recall, slower queries. Default: 10
Use when: Memory constrained, willing to trade latency for lower cost.
3. Flat (Brute Force)
How it works: Computes distance to every vector. No index structure.
Performance:
- Query latency: ~100ms for 1M vectors (linear scan)
- Recall: 100% (exact)
- Memory: Minimal (just vector data)
- Index build time: None (no indexing)
Use when: <10K vectors, or when you need exact results for benchmarking.
Performance Tuning Guide
Latency Optimization
If queries are too slow (>50ms p95):
- Lower
ef_search(HNSW) ornprobe(IVF) — trades recall for speed - Reduce
top_k— returning 100 results is slower than 10 - Use smaller embeddings — 768d → 384d halves compute and memory
- Add filtering at query time — pre-filter metadata before ANN search
- Shard by user/tenant — search per-user indexes instead of global index
Recall Optimization
If recall is below target (<90%):
- Increase
ef_searchornprobe— explore more candidates - Rebuild index with higher
ef_construction— improves graph quality - Switch from IVF to HNSW — better recall at cost of memory
- Use hybrid search — combine ANN with metadata filtering or BM25
Memory Optimization
If memory usage is too high (>10GB for 1M vectors):
- Use IVF instead of HNSW — 50% memory reduction
- Enable product quantization (PQ) — compresses vectors 8-16×
- Use scalar quantization (SQ) — float32 → int8 reduces memory 4×
- Store vectors on disk (mmap) — only load into RAM on query
Quantization Trade-Offs
Quantization compresses vectors to save memory/disk, but degrades recall.
Scalar Quantization (SQ8)
Converts float32 → int8 by bucketing values:
quantized = round((value - min) / (max - min) * 255)
Impact:
- Memory: 4× reduction
- Recall: -2% to -5%
- Latency: Slightly faster (less data to transfer)
Use when: Memory constrained, can tolerate minor recall loss.
Product Quantization (PQ)
Splits vectors into M subvectors, clusters each independently, stores cluster IDs:
# 768-dim vector → 96 8-dim subvectors → 96 uint8 IDs = 96 bytes (vs 3072 bytes)
Impact:
- Memory: 8-32× reduction (configurable)
- Recall: -10% to -20%
- Latency: Faster (smaller data)
Use when: Index doesn't fit in memory, willing to sacrifice significant recall.
Scaling Patterns
Vertical Scaling (Bigger Machine)
Works up to ~100M vectors on a single node:
- 128GB RAM → 30M vectors (768d, HNSW)
- 256GB RAM → 60M vectors
- 512GB RAM → 120M vectors
Pros: Simplest architecture
Cons: Expensive, single point of failure
Horizontal Scaling (Sharding)
Split vectors across N shards, query all shards, merge results:
# Query coordinator
def search_sharded(query_vector, top_k=10):
results = []
for shard in shards:
shard_results = shard.search(query_vector, top_k=top_k)
results.extend(shard_results)
# Merge and re-rank
results.sort(key=lambda x: x['score'], reverse=True)
return results[:top_k]
Pros: Linear scaling, better fault tolerance
Cons: Higher latency (fan-out queries), needs result merging
Tenant-Based Sharding
Shard by user/customer instead of randomly:
- User A's vectors on Shard 1
- User B's vectors on Shard 2
Pros: Queries only hit 1 shard (no fan-out), data isolation
Cons: Uneven shard sizes if users vary
Database Comparison
| Database | Index | Strength | Weakness | Use Case |
|---|---|---|---|---|
| Pinecone | HNSW | Fully managed, auto-scaling | Expensive, vendor lock-in | Production RAG, low ops burden |
| Weaviate | HNSW | Hybrid search, GraphQL API | Complex config, resource hungry | Multi-modal search, metadata filtering |
| Qdrant | HNSW | Fast, Rust-based, open source | Less mature ecosystem | Self-hosted RAG, high performance |
| ChromaDB | HNSW | Embedded mode, Python-native | No production clustering | Prototyping, single-machine apps |
| Milvus | IVF, HNSW | Billions-scale, GPU support | Complex deployment | Massive-scale search (>100M vectors) |
| pgvector | IVF, HNSW | PostgreSQL extension, SQL queries | Slower than specialized DBs | Existing Postgres infra, <10M vectors |
Benchmarking Methodology
To benchmark your vector DB setup:
1. Prepare Test Data
# 10K test queries, 100K vector corpus
queries = load_queries("test_queries.npy") # shape: (10000, 768)
corpus = load_corpus("corpus.npy") # shape: (100000, 768)
# Ground truth: top-10 nearest neighbors via brute force
ground_truth = compute_exact_neighbors(queries, corpus, k=10)
2. Measure Recall@K
def recall_at_k(predicted, ground_truth, k=10):
hits = 0
for pred, truth in zip(predicted, ground_truth):
hits += len(set(pred[:k]) & set(truth[:k]))
return hits / (len(predicted) * k)
# Benchmark
results = vector_db.search_batch(queries, top_k=10)
recall = recall_at_k(results, ground_truth, k=10)
print(f"Recall@10: {recall:.3f}")
3. Measure Latency
import time
latencies = []
for query in queries:
start = time.time()
vector_db.search(query, top_k=10)
latencies.append((time.time() - start) * 1000) # ms
print(f"p50: {np.percentile(latencies, 50):.1f}ms")
print(f"p95: {np.percentile(latencies, 95):.1f}ms")
print(f"p99: {np.percentile(latencies, 99):.1f}ms")
4. Measure Throughput
import concurrent.futures
def benchmark_throughput(vector_db, queries, workers=10):
def search_batch(batch):
return [vector_db.search(q, top_k=10) for q in batch]
batches = [queries[i:i+100] for i in range(0, len(queries), 100)]
start = time.time()
with concurrent.futures.ThreadPoolExecutor(max_workers=workers) as executor:
list(executor.map(search_batch, batches))
duration = time.time() - start
qps = len(queries) / duration
print(f"Throughput: {qps:.0f} queries/sec")
Hybrid Search Patterns
Pure vector search fails when metadata matters. Combine with keyword/SQL filters:
Post-Filtering (Naive)
Retrieve top-K via ANN, then filter by metadata:
results = vector_db.search(query_vector, top_k=100)
filtered = [r for r in results if r['metadata']['category'] == 'legal']
return filtered[:10]
Problem: If only 5% of results match filter, you need top_k=200 to get 10 results.
Pre-Filtering (Better)
Filter metadata first, then ANN search within subset:
candidate_ids = db.filter_metadata(category='legal') # SQL query
results = vector_db.search(query_vector, top_k=10, filter_ids=candidate_ids)
Problem: If filtered set is tiny (<1000 vectors), ANN index is overkill.
Hybrid Scoring (Best)
Combine vector similarity + BM25 keyword score + metadata filters:
def hybrid_search(query_text, query_vector, filters):
# Vector similarity
vector_results = vector_db.search(query_vector, top_k=100)
# BM25 keyword score
bm25_results = bm25_index.search(query_text, top_k=100)
# Merge scores
combined = {}
for r in vector_results:
combined[r['id']] = {'vector_score': r['score'], 'bm25_score': 0}
for r in bm25_results:
if r['id'] in combined:
combined[r['id']]['bm25_score'] = r['score']
else:
combined[r['id']] = {'vector_score': 0, 'bm25_score': r['score']}
# Apply filters and rank
final = []
for doc_id, scores in combined.items():
doc = db.get_metadata(doc_id)
if matches_filters(doc, filters):
hybrid_score = 0.7 * scores['vector_score'] + 0.3 * scores['bm25_score']
final.append({'id': doc_id, 'score': hybrid_score})
final.sort(key=lambda x: x['score'], reverse=True)
return final[:10]
Monitoring and Alerts
Track these metrics in production:
Query Metrics
- Query latency (p50, p95, p99)
- Queries per second (QPS)
- Top-K distribution (are users requesting top-100 when top-10 would suffice?)
Index Metrics
- Index size (memory/disk)
- Number of vectors
- Average vector dimensionality
Alerts
- Alert if p95 latency exceeds 100ms
- Alert if index memory usage exceeds 80% of available RAM
- Alert if error rate > 1% (connection failures, timeouts)
Conclusion
Vector database performance hinges on choosing the right index (HNSW for speed, IVF for memory), tuning recall/latency trade-offs, and scaling appropriately (vertical first, horizontal when needed). Benchmark on your own data—generic benchmarks don't predict your workload. And always monitor latency, memory, and recall in production.
Build Better AI Tools
DevKits provides developer tools for JSON formatting, Base64 encoding, regex testing, and more — all free and privacy-first.
Try DevKits →