Retrieval at multi-tenant scale

Every customer’s knowledge is different — their SOPs, their documents, their permissions — but writing code per customer doesn’t scale past the first handful. The recurring answer is to split the system in two: a shared reasoning engine built once, and a per-tenant data layer it reads at query time. The uniqueness lives in data — learned memories, an ACL-bearing knowledge graph, a proprietary corpus — so onboarding a tenant is loading data, not forking the codebase.

Why it’s hard

The naïve approach — branch the code per customer — collapses under its own weight: every new tenant adds a fork to maintain, and a bug fix has to land N times. So the variety has to move into data the shared engine can read generically, which is harder than it sounds. Retrieval has to be permission-faithful — in enterprise search the bottleneck is access fidelity, not recall, and a query that returns a document the user can’t open is a security incident, not a relevance miss. It has to stay isolated, so tenant A’s data can’t leak into tenant B’s results. And the per-tenant knowledge is rarely handed to you clean — it’s tribal, buried in inboxes and recordings — so something has to learn it and prove it’s right before it grounds a live answer.

Patterns

Knowledge-as-data, not per-customer code — Put per-tenant uniqueness in a data layer the shared engine reads, so the reasoning core and connectors are built once and amortized across customers. Pallet’s explicit bet is that “per-customer heterogeneity lives in data, not code” — its largest tenant runs on “more than 20,000 customer-specific memories” while the engine stays single. — Pallet, Glean

Permission-faithful hybrid retrieval — Combine lexical precision with vector recall, rerank with whatever signals you have, and enforce access at retrieval rather than as a post-filter. Glean fuses “the precision of lexical search and the nuanced understanding of vector search … powered by the additional context and nuance provided by the signals and anchors within our knowledge graph,” whose edges carry access control so a multi-hop traversal still respects ACLs — a query “never returns a document the user can’t open.” — Glean

Learn the tenant’s knowledge, then backtest it before it goes live — Don’t hand-author per-customer rules; infer them from the tenant’s own data, then validate against history before activation. Pallet infers customer tribal knowledge “from the inbox into thousands of discrete facts,” and “memories are backtested against historical scenarios before going active” — inference is only safe once it’s proven on real cases. — Pallet

A proprietary corpus is the retrieval asset — The data you uniquely capture becomes a queryable index a competitor can’t replicate, and often trains the model too. Rilla turns “millions of in-person conversations no competitor captures” into “a search engine over voice data,” the same corpus that fine-tunes its field-noise ASR. — Rilla

Scope isolation explicitly — Make tenant boundaries a property of the deployment or the row, not something the app layer remembers. Glean runs single-tenant — “ingestion, index, and knowledge graph run in the customer’s GCP/AWS/Azure project … data never leaves your tenant’s environment”; Pallet scopes with Postgres row-level security keyed on org id. — Glean, Pallet

Tools & popular choices

Decision	Common choice	Notes
Vector store	pgvector in Postgres, or a managed vector DB	Pallet runs pgvector in Cloud SQL; Rilla keeps embeddings alongside transcripts in Postgres. Co-locating with relational data makes tenant scoping natural.
Retrieval style	Hybrid: lexical (BM25) + vector, reranked by signals	Glean combines lexical precision, vector recall, and knowledge-graph anchors-and-signals — +24% relevance from the agentic layer on top.
Access control	ACLs mapped per source at retrieval time, carried on graph edges	Glean connectors “map and maintain each source’s ACLs”; enforce at retrieval, never post-filter.
Tenant isolation	Single-tenant deployment in the customer’s cloud, or row-level security on org id	Glean runs in the customer’s own cloud project; Pallet uses Postgres RLS keyed on org id.
Per-tenant knowledge	Learned “memories” / SOPs, backtested before activation	Pallet infers 20,000+ facts per tenant from the inbox and backtests each one against history.
Models / embeddings	Multi-provider, routed	Glean, Pallet, and Rilla all route across OpenAI / Anthropic / Google (Rilla via LiteLLM).

Reference architecture

The system splits into a shared plane and a per-tenant plane. A query arrives carrying the user’s identity; the shared engine — hybrid retrieval (lexical + vector + rerank) feeding an agent — is built once and serves every tenant. It reads from the per-tenant data layer: learned memories, a proprietary corpus, or an ACL-bearing knowledge graph, scoped by org id or by single-tenant deployment. An ACL filter applies the user’s permissions at retrieval, so results can never include something the user can’t open. A separate ingestion pipeline keeps that data layer fresh — inferring memories, backtesting them, mapping each source’s ACLs — without ever touching the shared engine’s code.

Mermaid source

flowchart LR
  classDef io fill:#eef2f8,stroke:#94a3b8,stroke-width:1.5px,color:#0f172a;
  classDef ai fill:#eef0fe,stroke:#6366f1,stroke-width:1.5px,color:#0f172a;
  classDef gate fill:#fef6e7,stroke:#d9a441,stroke-width:1.5px,color:#0f172a;

  subgraph shared["Shared engine · built once"]
    Q("Tenant query<br/>+ user identity"):::io
    Retr("Hybrid retrieval<br/>lexical + vector + rerank"):::ai
    Agent("Agent reasoning"):::ai
    Resp("Grounded response"):::io
  end
  subgraph tenant["Per-tenant data layer"]
    ACL{"ACL filter<br/>at retrieval"}:::gate
    KB[("Learned memories /<br/>corpus / knowledge graph<br/>scoped by org id")]:::io
  end
  Ingest("Ingestion: infer memories,<br/>backtest, map ACLs"):::ai

  Q --> Retr
  Retr -->|user identity| ACL
  ACL --> KB
  KB -->|permitted results| Agent
  Agent --> Resp
  Ingest --> KB

Best practices

Put the uniqueness in data, not code. If onboarding a customer means writing code, it won’t scale — make it loading memories, connectors, or a corpus into a shared engine.
Enforce permissions at retrieval, never as a post-filter. A query should be incapable of returning a document the user can’t open; carry the user’s identity into the retrieval call itself.
Go hybrid. Lexical catches exact terms, vectors catch meaning, and a reranker with domain signals beats either alone — don’t ship vector-only and call it search.
Learn the tenant’s knowledge, then backtest it. Inferred SOPs are only safe once validated against historical cases — gate activation on that, the same way you’d gate a model change (see Testing output that isn’t reproducible).
Scope isolation explicitly. Single-tenant deployment or row-level security keyed on org id — don’t rely on the app layer to remember which tenant it’s serving.

Seen in

Glean — permission-faithful hybrid retrieval over a knowledge graph whose edges carry ACLs, deployed single-tenant in the customer’s own cloud, with agentic reasoning adding +24% relevance.
Pallet — per-tenant uniqueness as data not code: 20,000+ plain-English memories inferred per customer from the inbox and backtested before activation, retrieved over pgvector.
Rilla — a proprietary corpus of millions of field conversations becomes the searchable retrieval asset, the same data that fine-tunes its noise-robust ASR.