Multi-Tenant LLM Architecture: Decisions I Got Right and One I'd Fully Reverse

The B2B SaaS context changes almost every architectural decision in an LLM system. You are not building a product with one corpus and one user population. You are building infrastructure that serves dozens -- sometimes hundreds -- of distinct organizations, each with their own data, their own usage patterns, their own compliance requirements, and their own expectation that their data does not leak into another customer's response.

I have been building multi-tenant LLM systems for B2B clients for the past couple of years, across retrieval architectures ranging from a handful of tenants to one engagement where we crossed 500. The patterns below come from that work. I will also describe one decision I made early that I would fully reverse: using a shared vector index with metadata filtering as the isolation boundary for a client with 500+ tenants. It seemed like the operationally sensible choice. It was not.

The isolation question comes first

Before prompt design, before model selection, before any of the interesting engineering work: you have to decide what your isolation boundary is at the retrieval layer.

There are two options and a lot of territory between them.

Separate indexes per tenant. Each tenant gets their own vector collection, namespace, or index (the exact noun depends on your database). Their documents are embedded into it. Retrieval queries only ever touch that tenant's index. Isolation is structural -- there is no metadata filter to misconfigure, no query that accidentally crosses a boundary.

Shared index with metadata filtering. All tenants' documents live in one index. Each document is tagged with a tenant ID. Every retrieval query includes a metadata filter that restricts results to the querying tenant's documents.

I use separate indexes. My preference is strong, and the 500-tenant failure I will describe later is the reason.

The argument for metadata filtering is operational convenience: one index to manage, one schema, simpler monitoring, no per-onboarding provisioning step. If you have ten tenants and they are all similar in size and query volume, the tradeoff is reasonable. The moment you have tenants whose data volumes differ by an order of magnitude, or tenants with dramatically different query rates, or more than about 100-200 tenants, the tradeoffs shift.

With Qdrant (which I use in most of my current deployments), separate tenant collections are first-class. Creating a collection per tenant on onboarding is a single API call and adds approximately zero operational complexity if you have an onboarding pipeline at all. Weaviate has a similar multi-tenancy model. Pinecone's namespace isolation is good but the economics at scale change the calculation -- more on that.

What I actually use for tenant isolation

My current stack for B2B multi-tenant RAG is Qdrant running on a dedicated VM (r6i.xlarge or equivalent, depending on the cloud), with one collection per tenant. Tenants are provisioned by a thin orchestration layer that handles collection creation, sets per-tenant HNSW parameters if their collection is large enough to warrant it, and stores the collection name alongside tenant metadata in Postgres.

The orchestration layer matters. The anti-pattern is letting collection names proliferate without a central registry. If your retrieval code has to know the collection naming convention to query the right index, you have embedded business logic in the wrong layer. The registry makes the retrieval code generic:

def get_tenant_collection(tenant_id: str, db: Session) -> str:
    tenant = db.query(Tenant).filter(Tenant.id == tenant_id).first()
    if not tenant or not tenant.vector_collection:
        raise TenantNotFoundError(tenant_id)
    return tenant.vector_collection

Retrieval then calls get_tenant_collection and never constructs a collection name directly. This means the naming convention can change without touching retrieval code, and adding a new tenant is a data operation, not a code deployment.

The HNSW parameter tuning is something most teams skip and then wonder why their retrieval quality degrades as collections grow. Qdrant's default m=16, ef_construct=100 is reasonable for small collections. For tenants with 500k+ vectors, I increase m to 32 and ef_construct to 200. The build cost is higher but search quality at recall@10 improves measurably for those collections. Small tenants do not need this; large tenants do. The per-collection configuration is what makes this practical.

The decision I would reverse: shared index for 500+ tenants

Here is the specific failure.

A client was building an LLM-powered document Q&A product. At the start of the engagement, they had roughly 80 tenants. The architecture I put in place used Pinecone with a single index and per-document tenant ID metadata filtering. My reasoning at the time: the tenant count was modest, Pinecone's managed infrastructure removed operational burden, and metadata filtering was well-documented as the multi-tenancy pattern in their docs.

Eighteen months later, tenant count was 512. Query latency at p95 had increased from around 180ms to over 900ms under moderate load. Index size was approaching 50 million vectors. The p95 degradation was not uniform -- it was worst for small tenants whose filtered result sets were sparse relative to the full index. Pinecone's ANN implementation has to search the full index and then apply the metadata filter post-retrieval; for a small tenant whose documents represent 0.01% of the index, the effective recall before filtering is poor. You surface the right number of results, but you are working much harder to find them than if they lived in their own isolated space.

The cost dimension compounded the problem. A Pinecone p2 pod at that vector count was running around $700/month for a single pod, and we were running two for redundancy. That is $1,400/month for the vector layer alone, and the performance was worse than I wanted.

I might be wrong that separate indexes would have been uniformly cheaper -- pod consolidation has real value, and managing 512 separate Qdrant collections is not free overhead. But I am confident that the p95 latency degradation for small tenants under a shared index is structural, not a tuning problem. The ANN search has to traverse vectors that are never relevant to the query. Metadata filtering is applied after traversal, not during it. This is fundamental to how approximate nearest-neighbor indexes work.

The remediation path -- migrating 512 tenants from one shared Pinecone index to per-tenant collections on a self-hosted Qdrant cluster -- took about six weeks of engineering time and a careful dual-write period to avoid data loss. It was entirely avoidable.

If I were starting that engagement today: separate collections from day one, self-hosted Qdrant on a few large VMs, cost amortized across tenants. The operational overhead of collection-per-tenant with a proper registry is lower than the operational overhead of debugging p95 regressions in a shared index at scale.

Per-tenant prompt caching

The second architectural layer where tenants need explicit treatment is prompt caching.

Most LLM providers now offer some form of prompt caching. Anthropic's prompt caching for Claude (available since mid-2024) gives roughly a 90% cost reduction on cached input tokens and around a 5x latency improvement on cache hits. OpenAI has a similar automatic prefix caching mechanism. The economics are significant at B2B SaaS scale -- system prompts alone can be 1,000-4,000 tokens, and if you are calling the LLM thousands of times per day per tenant, that spend adds up.

The structure that enables cache hits: put the stable content at the top of the prompt, the dynamic content at the bottom.

For a multi-tenant RAG system:

[system prompt -- tenant-specific instructions, persona, output format constraints]
[tenant knowledge context -- static rules, glossary, behavioral constraints]
[retrieved chunks -- changes per query]
[user question -- changes per query]

The first two blocks are good cache candidates. The last two are not. If you invert this -- putting retrieved chunks or user history early in the prompt -- you break the cache prefix on every query and pay full input token cost every time.

The per-tenant angle: system prompts and knowledge context are different per tenant, which means you need a different cache entry per tenant. That is fine. The point is not to share a cache across tenants (which would be an isolation violation) but to ensure each tenant's queries hit that tenant's cached prefix rather than recomputing it on every call.

I track cache hit rate per tenant as a billing-layer metric. If a tenant's cache hit rate is low, it usually means their prompt structure has too much dynamic content near the top, or their system prompt is being regenerated unnecessarily on each request. Investigating low cache hit rates has saved meaningful LLM spend on several engagements -- on one client, fixing the prompt ordering for a single high-volume tenant cut their monthly Claude API spend by about $2,200.

Cost attribution per tenant

Cost attribution is the first thing B2B SaaS companies ask about and the last thing they actually build properly. It is operationally critical: you need to know which tenants are consuming disproportionate LLM budget, which features are cost drivers, and where to tune before you have a surprise invoice.

My attribution model tracks four dimensions per LLM call:

Tenant ID
Feature / call type (e.g., "retrieval_qa", "document_summary", "structured_extraction")
Model
Token counts (prompt tokens, completion tokens, cached prompt tokens separately)

I emit these as structured log events on every LLM call completion:

import structlog

logger = structlog.get_logger()

def log_llm_call(
    tenant_id: str,
    call_type: str,
    model: str,
    prompt_tokens: int,
    completion_tokens: int,
    cached_tokens: int,
    duration_ms: float,
):
    logger.info(
        "llm_call_complete",
        tenant_id=tenant_id,
        call_type=call_type,
        model=model,
        prompt_tokens=prompt_tokens,
        completion_tokens=completion_tokens,
        cached_tokens=cached_tokens,
        prompt_cost_usd=round((prompt_tokens - cached_tokens) * 0.000003, 6),
        cached_cost_usd=round(cached_tokens * 0.0000003, 6),
        completion_cost_usd=round(completion_tokens * 0.000015, 6),
        duration_ms=duration_ms,
    )

The prices above are illustrative -- substitute the actual per-token rates for the model you are using. The point is computing cost at log time rather than trying to reconstruct it from raw token counts later. Aggregating these logs in a data warehouse gives you a per-tenant cost breakdown that feeds both internal alerting and, in some deployments, customer-facing usage dashboards.

One thing I have learned the hard way: do not wait until a billing cycle ends to look at cost by tenant. Alert when any tenant's spend crosses a daily threshold. LLM cost spikes are usually caused by one of three things: a prompt that grew much larger than intended after a recent deploy, a loop that is calling the LLM many more times than expected, or a tenant whose data volume has grown enough to significantly change retrieval behavior. All three are fixable quickly if caught in hours rather than weeks.

Quota enforcement

The quota enforcement pattern I use is a token-bucket rate limiter per tenant, implemented at the application layer rather than relying on the LLM provider's own rate limits.

The provider rate limits exist to protect the provider. They are not tenant-aware in a B2B context -- they apply to your entire API key, not to individual tenants. A single tenant whose integration goes haywire can consume your rate limit budget and degrade the experience for everyone else. You need your own enforcement layer.

The implementation is straightforward with Redis:

import redis
import time

class TenantQuotaEnforcer:
    def __init__(self, redis_client: redis.Redis):
        self.redis = redis_client

    def check_and_consume(
        self,
        tenant_id: str,
        tokens_requested: int,
        window_seconds: int = 60,
        token_limit: int = 100_000,
    ) -> bool:
        key = f"quota:{tenant_id}:{int(time.time()) // window_seconds}"
        pipe = self.redis.pipeline()
        pipe.incrby(key, tokens_requested)
        pipe.expire(key, window_seconds * 2)
        results = pipe.execute()
        current_usage = results[0]
        if current_usage > token_limit:
            return False
        return True

This is a sliding-window approximation, not exact token bucket, but it is good enough for rate limiting purposes and cheap to operate. The window size and limit are configurable per tenant -- some enterprise tenants have negotiated higher quotas; some trial tenants are restricted below the default.

The quota check happens before the LLM call, not after. Checking after is common but wrong: you have already made the API call and spent the tokens by the time you check. Pre-call enforcement means the occasional edge case where you over-count (if tokens_requested is an estimate based on prompt length, the actual count may differ), but that is a much cheaper error than discovering quota violations in the billing data.

When a tenant hits their quota, I return a 429 with a structured error body that includes the current window's usage, the limit, and the reset time. The client can surface this to their users meaningfully rather than displaying a generic error.

What I would add if I were rebuilding

Tenant-level retrieval quality tracking. Right now I have cost attribution and latency tracking per tenant, but I do not have retrieval quality metrics per tenant. If one tenant's collection has low embedding coverage, or their documents are formatted in a way that degrades chunking quality, I do not find out until they file a support ticket.

The right design: a per-tenant eval harness that runs a fixed set of queries against the tenant's collection weekly, measures recall@5 against known-good answers, and alerts if it degrades more than 10% week over week. This is more expensive to build than the cost and quota infrastructure -- you need per-tenant ground-truth datasets -- but it is the layer that closes the loop on retrieval quality. The absence of it means I am flying blind on whether the RAG layer is actually working well for tenants who are not actively complaining.

The retrieval architecture here reflects patterns I have applied across consulting engagements. The multi-tenancy design work also informed the isolation model I use in Pipeshift -- though I am the founder, so weight that accordingly. If you are building a B2B RAG product and wrestling with the isolation/performance tradeoff at scale, my contact page is the right starting point.