Vector Database Benchmarks I Actually Ran: Qdrant, Weaviate, and pgvector at 1M and 10M Vectors

The vector database benchmark posts I keep finding online share one characteristic: they were run by the vendors, on hardware the vendors control, against query distributions that favor their product. I don't find them useful. So last quarter, during a client project that required an actual production decision, I ran my own.

This post is what I found. I'll also be upfront about what this data cannot tell you, because n=1 benchmarks have real limits and pretending otherwise would waste your time.

The context

A client in the document intelligence space -- I'll leave the industry vague -- needed a vector store for a retrieval pipeline processing technical documentation. The corpus was dense, domain-specific text: manuals, specifications, compliance documents. Query patterns were hybrid: semantic similarity plus keyword filtering on document metadata. The client had an existing Postgres instance and a preference for not adding operational complexity unless the data justified it.

That preference is what made this benchmark worth running. The default answer for "which vector database" in most engineer conversations is Qdrant if you're self-hosting, Pinecone if you're paying for managed, and pgvector if you're already in Postgres. I wanted to know how much the default answer was costing in real latency and recall.

I tested three systems:

pgvector 0.7.4 on Postgres 16.4, running on an OCI VM.Standard.E4.Flex (8 OCPU, 64 GB RAM)
Qdrant 1.10.0 in standalone mode, same VM shape
Weaviate 1.26.3 with the flat and hnsw index, same VM shape

All three ran on identical hardware. The corpus was the client's actual document collection, not a synthetic dataset. I used text-embedding-3-small from OpenAI (1536 dimensions) throughout. The query set was 500 queries derived from real user questions, not randomly generated ones.

I ran two corpus sizes: 1 million vectors and 10 million vectors. At 1M I was testing the "is this a real decision" threshold; at 10M I was testing where each system starts to show strain.

Methodology

Index configuration matters more than most benchmark posts acknowledge, so I'll be specific.

pgvector: ivfflat index with lists=100 at 1M vectors, lists=1000 at 10M. probes=10 at query time. This is the configuration the pgvector documentation recommends as a starting point for recall vs. speed balance -- I did not tune aggressively to favor pgvector.

Qdrant: HNSW index, m=16, ef_construct=200, ef=128 at query time. Vectors stored with full float32 precision. Qdrant's defaults for m are lower (6); I used 16 because the documentation recommends it for higher recall requirements. Payload indexing enabled on the three metadata fields used for filtering.

Weaviate: HNSW index with default parameters for the primary index. Hybrid search tested using Weaviate's built-in BM25+vector fusion (alpha=0.5 by default). I did not tune the BM25 weighting -- I ran the defaults to reflect what most teams would actually deploy.

Recall is measured as recall@10: of the 10 vectors returned, what fraction are in the true top-10 by exact cosine distance. I computed ground truth by running exact search (brute-force) on a separate in-memory index. p99 latency is the 99th percentile across 500 query executions per test run, each run repeated 3 times to account for cache warming effects. Reported latency is from the warmed runs.

I tested pure vector search at both corpus sizes. For hybrid search, I only had meaningful data for Weaviate and pgvector -- Qdrant's hybrid search in 1.10 involves a separate sparse index that I did not have time to tune properly for this engagement. That is a gap in the data I'm flagging explicitly.

Results at 1M vectors

p99 latency (pure vector search):

pgvector: 18ms
Qdrant: 9ms
Weaviate (hnsw): 14ms

Recall@10:

pgvector: 0.91
Qdrant: 0.97
Weaviate (hnsw): 0.95

pgvector surprised me here. I expected a larger gap against the purpose-built systems. At 1M vectors with ivfflat, it delivered 18ms p99 -- not as fast as Qdrant, but not the order-of-magnitude difference I anticipated from reading community comparisons. The recall gap is real though: 0.91 vs. 0.97 for Qdrant means roughly 1 in 11 queries is returning a suboptimal result at the top-10 boundary. Depending on the application, that gap matters or it doesn't.

Weaviate sits in the middle on both metrics, which is about what I expected.

Hybrid search at 1M vectors (Weaviate vs pgvector):

pgvector hybrid (pgvector + pg_trgm for keyword, merged in application code): p99 62ms, recall@10 0.93
Weaviate hybrid (native BM25 + vector fusion): p99 48ms, recall@10 0.94

The hybrid search recall is nearly identical. The latency gap is smaller than I expected in Weaviate's favor. What surprised me was the absolute latency number for Weaviate: 48ms p99 for hybrid search, up from 14ms for pure vector search. That's a 3.4x increase. I went back and ran it multiple times -- the number held.

I have a hypothesis about why: Weaviate 1.26's hybrid search runs the BM25 and vector searches in parallel, but the fusion step involves ranking across both result sets before returning, and that ranking is not cheap at 1M vectors. It may improve with alpha tuning -- I used the default 0.5. At 0.75 (weighting vector results more heavily), hybrid latency dropped to 38ms p99, but that changes the nature of the search and I wasn't going to tune a parameter to flatter the results.

For this client's workload -- hybrid queries were about 60% of total query volume -- the Weaviate hybrid latency meant the 99th percentile user experience was in the 45-55ms range from the database alone. That was higher than the 30ms database budget the client's response-time SLA implied.

Results at 10M vectors

p99 latency (pure vector search):

pgvector: 210ms
Qdrant: 31ms
Weaviate (hnsw): 55ms

Recall@10:

pgvector: 0.87
Qdrant: 0.96
Weaviate (hnsw): 0.93

This is where pgvector's position changes. The jump from 18ms to 210ms at 10M is not a surprise -- ivfflat recall and latency both degrade with corpus size unless you tune probes upward, and tuning probes costs latency. There is a hnsw index type in pgvector 0.7 that would perform better here, but it requires significantly more memory during index construction than ivfflat. On a 64 GB instance with a 10M-vector corpus at 1536 dimensions, pgvector HNSW index construction ran out of memory and crashed. I would need a larger instance to do that comparison fairly, and I didn't have one available during this engagement.

Qdrant at 31ms p99 for 10M vectors is impressive. The gap over Weaviate (55ms) at this scale was consistent across runs. Memory consumption for Qdrant at 10M vectors was approximately 28 GB for the HNSW index plus vector storage. Weaviate consumed approximately 34 GB. pgvector with ivfflat was the most memory-efficient at around 22 GB, which is its main remaining advantage at this scale.

Hybrid search at 10M vectors:

Weaviate hybrid p99 increased to 112ms. I did not run the pgvector hybrid at 10M -- the application-code merge approach becomes increasingly untenable as query latency grows, and I was not going to produce numbers that flatly didn't represent a real deployment pattern.

The production decision

The client's corpus was projected to grow from ~3M vectors (current) to ~8-10M over the next 18 months. Hybrid search was non-negotiable. The response-time SLA required p95 under 50ms end-to-end, which implied roughly 25-35ms database budget after accounting for network, application logic, and LLM calls.

pgvector was off the table at 10M scale on realistic hardware. I might revisit that if the client's DBA team was comfortable sizing up to a 128 GB instance and managing HNSW index builds, but that introduces operational complexity that defeats the original motivation for staying in Postgres.

Weaviate's hybrid search latency at scale was borderline. At 10M vectors, 112ms p99 from the database alone fails the SLA. That could improve with tuning -- and I want to be honest that I was running Weaviate defaults -- but "it might be faster if someone spends a week tuning it" is not a comfortable recommendation for a production decision.

Qdrant was the choice. 31ms p99 at 10M for pure vector search, with room to budget for the Qdrant sparse index for hybrid (which I did not benchmark but the published numbers from the Qdrant team, and community reports I've seen on HN, suggest it performs better than what I measured with Weaviate's native fusion). The operational profile -- single binary, no JVM, reasonably predictable memory footprint -- was also a factor.

I want to be clear: I recommended Qdrant for this specific client, with this specific corpus size, SLA, and query distribution. That is not a general statement that Qdrant is the right choice.

What this benchmark cannot tell you

I want to be direct about the limits of this data.

n=1 hardware. All three systems ran on the same OCI shape. That shape may not be representative of your environment. Weaviate's JVM-based architecture behaves differently on high-memory instances; pgvector on NVMe-backed instances with larger shared buffers changes the latency profile significantly.

Single embedding model. I used text-embedding-3-small at 1536 dimensions throughout. Different dimensionalities change the relative performance characteristics. At 768 dimensions pgvector's memory problem at 10M would have been less acute.

Qdrant hybrid search is a gap. I flagged this above but I want to flag it again: I did not benchmark Qdrant's sparse+dense hybrid mode against Weaviate's native hybrid. The Qdrant sparse vector approach using SPLADE or BM42 is architecturally different from Weaviate's fusion. For a project where hybrid search recall is the primary optimization target, this comparison would need to be run properly.

Weaviate tuning. I ran Weaviate with default HNSW parameters and default hybrid alpha. A Weaviate engineer looking at my numbers would reasonably argue I left performance on the table. They might be right. I used defaults because most production deployments start with defaults, but I'm not confident I've shown Weaviate's ceiling.

pgvector HNSW. The ivfflat numbers at 10M scale are damning but also not a fair test of pgvector's current capabilities. pgvector 0.7's HNSW implementation would likely close a significant part of the latency gap against Qdrant if run on hardware with enough memory to build the index.

What I'd tell someone starting this evaluation today

At 1M vectors with pure semantic search: pgvector is a legitimate choice if you're already in Postgres. The recall deficit is real but may be acceptable, and the operational simplicity argument holds. I would not pick pgvector primarily for cost -- the instance you need for low-latency pgvector at any meaningful scale costs more than a well-sized Qdrant deployment on a smaller instance.

At 5M+ vectors with hybrid search as a requirement: run your own benchmark on your actual corpus and query distribution. The synthetic datasets in most published benchmarks produce results that don't transfer. The 30 hours I spent running this benchmark saved weeks of production debugging.

For most of the RAG pipelines I work on -- both in my own consulting work and in the infrastructure patterns I'm building into Pipeshift (I'm the founder; disclosing that) -- Qdrant is currently the default starting point for self-hosted vector search. That default is based on accumulated project experience, not religious preference. If pgvector HNSW at scale or Weaviate with aggressive tuning changed the calculus on a specific engagement, I would use those instead.

The honest answer is that all three are production-viable systems. The differences that matter are in the tail latency at scale, the hybrid search story, and the operational overhead -- not in headline marketing benchmarks.

The benchmark scripts and raw result CSVs for this project are available on request. Contact me if you're evaluating vector databases for a production workload -- the specifics of your corpus and query patterns determine which numbers actually transfer.