The system had been running fine for three weeks. Queries returning in under 200ms, recall looking reasonable in our offline evals, nothing alarming in the logs. Then, on a Tuesday afternoon at 2:47 PM -- ten minutes into a live demo with the client's VP of Engineering and two of their senior architects -- every query request started returning 503s.
That is not the kind of failure that happens quietly. It happens in front of the people whose trust you are still earning.
This is the postmortem. What happened, what I found over the next four hours, what fixed it, and what I should have instrumented from the beginning.
What the system looked like
The deployment was a semantic search layer over a proprietary document corpus -- roughly 1.4 million vectors at 1536 dimensions (OpenAI text-embedding-3-large). Qdrant 1.9.2 running as a single-node Docker container on a client-managed VM: 8 vCPU, 32 GB RAM. The container had no explicit memory limit set -- it could claim the full 32 GB of host memory.
Query volume during normal operation averaged around 15-20 requests per second. The demo was expected to push that up to maybe 30 RPS while two separate users ran concurrent searches. Nowhere near what I would have called a load threshold worth worrying about.
I was wrong about that, but the query load was not the root cause. It was the trigger.
Timeline
2:37 PM -- Demo starts. I'm in a screenshare, walking through a live query interface. Response times are normal, 180-210ms.
2:43 PM -- Response times climb. 400ms, 700ms. I notice the graph ticking up in my monitoring tab and assume it is the demo query patterns -- more complex queries than the test cases.
2:47 PM -- First 503s. The application layer is returning errors because the Qdrant HTTP API has stopped responding. The container is still running -- it has not crashed. It is just not answering.
2:52 PM -- I kill and restart the Qdrant container while explaining, with as much composure as I could manage, that we had "hit a resource constraint we will address tonight." The container restarts in under 30 seconds. Demo continues for another 20 minutes without incident.
3:15 PM -- Demo ends. I start investigating.
What I found in the first hour
My first instinct was query load -- the demo had concurrent users and it was the only thing that changed. I pulled query logs and confirmed the demo was peaking at around 28 RPS. That is not impressive. Qdrant 1.9.2 on that hardware should handle it without blinking.
Then I checked container stats. I do not have a time-series of memory during the incident -- that is the monitoring gap I will get to -- but looking at Qdrant's own metrics endpoint after the restart, something was off about the in-memory index state.
Qdrant exposes collection info via GET /collections/{collection_name}. Specifically the optimizer_status and indexed_vectors_count fields. After the restart, I saw this:
"optimizer_status": {
"status": "ok",
"error": null
},
"indexed_vectors_count": 847234,
"points_count": 1400000
About 60% of the vectors were in the HNSW index. The other 40% were in an unindexed state -- stored but not yet indexed, meaning they would fall back to a brute-force scan on queries that touched them. That was unexpected. A week earlier I had confirmed that the collection was fully indexed.
I pulled the Qdrant logs from the container (piped to a file before the restart, fortunately). The log line I was looking for:
[2026-06-17 14:31:09] INFO qdrant::wal ... segment optimization started
[2026-06-17 14:31:09] INFO qdrant::collection ... 2 segments queued for optimization
The re-indexing job had kicked off at 14:31 -- sixteen minutes before the demo started. Qdrant's optimizer runs as a background process that merges and rebuilds HNSW segments. Under default configuration it does not have explicit CPU or memory resource limits separate from the container. It runs concurrently with queries.
What actually caused the OOM
The full picture took another two hours to reconstruct, partly from Qdrant docs and partly from reading through Qdrant's GitHub issues.
The HNSW optimizer in Qdrant holds the existing index and the newly built index in memory simultaneously during the merge phase. For a collection of 1.4M vectors at 1536 dimensions with m=16 and ef_construction=100 (the defaults I used), a single HNSW index takes roughly 18-22 GB of RAM at full build. During merge, peak memory usage can approach 2x the steady-state index size before the old index is released.
At steady state, Qdrant was using around 20-22 GB for the loaded index plus payload storage. When the optimizer started a full segment rebuild -- which it does automatically when the ratio of unindexed vectors exceeds the indexing_threshold -- it allocated another large memory region for the new HNSW graph. The two together exceeded 32 GB. The OS started swapping. The container became unresponsive under concurrent query load. It did not OOM-kill cleanly -- it just stopped answering requests because every thread was waiting on swapped-out memory pages.
The timing was coincidental and terrible. A background job that runs infrequently happened to start just before we kicked off the demo.
I might be wrong about some of the internal Qdrant memory accounting here -- I am inferring from external observation and docs rather than from a heap dump. But the sequence (optimizer starts, memory climbs, queries slow, 503s) is consistent with what I saw in the logs.
What I did not have that I needed
No memory time-series. I had Qdrant's REST metrics available but I was not scraping them on a short interval and storing them anywhere. I had no alert configured on host memory utilization. When the incident happened, I could tell something was wrong from application-layer errors, but I had no visibility into what Qdrant was doing internally in the minutes before.
Qdrant exposes a Prometheus-compatible metrics endpoint at /metrics. It includes qdrant_collections_total, qdrant_rest_responses_total, and process_resident_memory_bytes among others. I was not using it. That is a gap I had accepted too casually.
The optimizer schedule is also not surfaced by default. Qdrant runs it automatically based on internal thresholds. You can see the current status via the collections API but there is no "optimizer will run at approximately X" visibility. You have to know to ask.
What fixed it
Three changes, in order of how quickly they could be applied:
1. Set an explicit memory limit on the container.
services:
qdrant:
image: qdrant/qdrant:v1.9.2
mem_limit: 24g
memswap_limit: 24g
Setting memswap_limit equal to mem_limit disables swap for the container. This means the container OOM-kills hard rather than degrading into the slow death I saw -- swap thrash under concurrent queries is worse than a clean restart. A fast restart with an alert is a better failure mode than a slow spiral that looks like the service is up but is not responding.
2. Throttle the optimizer via Qdrant's collection update API.
from qdrant_client import QdrantClient
client = QdrantClient(host="localhost", port=6333)
client.update_collection(
collection_name="documents",
optimizer_config={
"max_optimization_threads": 1, # default is None (uses all cores)
"indexing_threshold": 50000, # raise from default 20000
"flush_interval_sec": 30,
}
)
Setting max_optimization_threads to 1 forces the optimizer to use a single CPU core. It runs longer but it contends less for memory during peak query periods. indexing_threshold of 50000 means the optimizer only triggers a rebuild when there are 50k+ unindexed vectors -- reducing how often it runs.
3. Schedule forced optimizer runs during off-peak hours.
Qdrant does not have a native cron-style schedule for optimization. The workaround is to set indexing_threshold high enough that automatic optimization almost never triggers, and then call the optimize endpoint manually during a defined maintenance window:
import time
from qdrant_client import QdrantClient
from qdrant_client.models import OptimizersConfigDiff
def run_off_peak_optimization(client: QdrantClient, collection: str):
# call during off-peak window, e.g. 2 AM
# first, lower the threshold to trigger optimization
client.update_collection(
collection_name=collection,
optimizer_config=OptimizersConfigDiff(
indexing_threshold=0 # trigger immediately
)
)
# poll until optimizer is idle
while True:
info = client.get_collection(collection)
status = info.optimizer_status
if status.status == "ok" and info.indexed_vectors_count == info.points_count:
break
time.sleep(30)
# restore production threshold
client.update_collection(
collection_name=collection,
optimizer_config=OptimizersConfigDiff(
indexing_threshold=50000
)
)
This is a manual pattern, not an elegant one. If you are running Qdrant at scale with frequent writes, the right answer is a proper write pipeline that batches inserts and controls when indexing happens rather than relying on Qdrant's background optimizer at all. For this client's workload -- a document corpus that updates weekly, not continuously -- the off-peak schedule is sufficient.
What I would do differently in capacity planning
The number I should have started with: expected peak memory = (steady-state index size) x 2.2, with 20% headroom on top of that.
For 1.4M vectors at 1536 dimensions with HNSW m=16:
- Steady-state index size: approximately 20 GB (rough estimate: 1536 dims x 4 bytes x 1.4M vectors x HNSW overhead factor ~2.3)
- Peak during optimization: ~20 GB x 2 = 40 GB
- Required host RAM to avoid swap: 40 GB + 20% = 48 GB
On 32 GB of RAM, this collection size was always going to be a problem if the optimizer ran at full intensity while the index was loaded. I just never stress-tested it with the optimizer running concurrently.
The right VM for this workload is 64 GB RAM minimum. Not 32 GB. On the client's cloud provider that is a roughly $200-300/month difference for the VM tier. I did not have that conversation during scoping because I had not done the memory math. That is a scoping failure, not a production failure -- but it showed up as a production failure.
I also should have been more skeptical of "running fine for three weeks" as a signal. Three weeks without a full optimizer cycle is not validation. It is just luck. The collection had grown through weekly document ingestion, and the optimizer had been deferring because the unindexed count stayed below the threshold. It all caught up at once.
What I am monitoring now
After the fix:
process_resident_memory_bytesscraped from Qdrant's/metricsendpoint every 30 seconds into a Prometheus instance. Alert at 80% ofmem_limit.qdrant_rest_responses_total{status="5xx"}with alert on any sustained 5xx rate over 60 seconds.- A daily log check on optimizer status -- indexed_vectors_count vs points_count. If the gap is growing, it means writes are outpacing optimization and I need to schedule a catch-up run.
None of this is sophisticated. It is the baseline instrumentation I should have had before go-live.
The part I still do not have a clean answer to
Resource isolation between query threads and optimizer threads inside a single Qdrant node is not fine-grained. You can throttle optimizer CPU with max_optimization_threads. You can limit total container memory. But there is no mechanism in Qdrant 1.9.2 to say "the optimizer may use at most X GB and the remaining memory is reserved for query serving." That would have been the right control for this situation.
The Qdrant roadmap and community Discord suggest this is a known operational concern for large collections, and there has been discussion about more granular optimizer memory controls. I have not seen it shipped as of 1.9.2. The workaround is the off-peak schedule pattern above, combined with provisioning enough RAM that both the steady-state index and an optimizer run can coexist without swapping.
If your collection is large enough that steady-state index size plus an optimizer pass would exceed available RAM, you are looking at multi-node Qdrant with distributed sharding, which is a meaningfully different operational footprint. I would not reach for distributed Qdrant until a single node with proper provisioning genuinely cannot fit the workload.
If you are running Qdrant or another vector store in production and want a review of the provisioning, monitoring, and query isolation setup before a high-stakes milestone -- demo, launch, client handover -- that is something I help with. My contact page has the details.