Chapter 15Model routing and edge infrastructure
In a prototype, the network topology between the agent and its model is trivial: the application imports a provider’s SDK, holds an API key, and makes a direct call over the public internet to a foundation model. The arrangement is invisible until it reaches an enterprise, where it becomes a non-starter on four counts at once. It moves regulated data across a boundary the company does not control. It obscures which team incurred which cost. It makes a single external API a single point of failure for the whole fleet. And it ties the system’s reliability to the uptime and pricing of one vendor. The model provider is a volatile, third-party dependency, and treating it as a library call buries that fact in every component.
This chapter addresses the boundary between the deterministic agent infrastructure and the probabilistic models it consumes. The architectural answer is the internal model gateway, an inference proxy that all model traffic passes through. Just as the bounding gateway (Chapter 5) abstracts the agent’s actions and the memory gateway (Chapter 7) abstracts its state, the model gateway abstracts its reasoning. Forcing every inference request through one deterministic proxy is what lets the architecture regain control over data egress, cost, vendor dependence, and latency, none of which can be governed from inside the agent’s reasoning loop.
The model gateway pattern
The commitment is absolute: no agentic component calls a public model API directly. Every generation, embedding, and classification request routes through a gateway owned by the platform’s infrastructure team.

The topology solves five problems that have no solution inside the agent’s own logic.
Data residency and egress filtering
An agentic system routinely handles regulated data, patient records, financial ledgers, personal information. Even with a strict memory gateway (Chapter 7), an agent may summarize a document containing protected health information (PHI) and try to send that summary to a public model. The model gateway is the final egress boundary, the point at which the deterministic shell filters what leaves the corporate network. Before a prompt departs, the gateway scans the payload; on a regulated match it does not merely error but reroutes, sending the prompt to an internally hosted model that never touches the public internet while letting benign prompts continue to the faster external one. Residency becomes a routing decision enforced by infrastructure rather than a policy the agent is trusted to remember.
Semantic routing by capability, not model name
In an orchestrator-worker arrangement (Chapter 9), different components have sharply different cognitive needs. The orchestrator’s planning demands a frontier reasoning model; a worker checking that output conforms to a schema needs only basic instruction-following; a classifier deciding which of three branches to take needs to emit a single token. Hardcoding a specific model name inside the agent couples the application to one vendor’s naming and pricing. The gateway instead accepts a request for a capability tier, frontier reasoning, fast structured output, cheap classification, and maps that tier to the most cost-effective model currently available. The infrastructure team can then swap the model behind a tier across the entire fleet without touching a line of agentic code.
Affinity routing to preserve the cache
Provider-side prompt caching makes a large static prefix, a system prompt, retrieved context, cheap on every call after the first, but only if the same prefix reaches the same upstream. A gateway that load-balances naively, round-robining requests across keys or regions, destroys the cache hit rate and inflates cost precisely on the largest prompts. The commitment is affinity routing: the gateway hashes the static portion of the prompt and consistently routes matching payloads to the same upstream session or region, so the fleet earns the cache discount the provider offers. Cost control here is a routing property, not a prompt-engineering one.
Fallback and circuit breaking
External model APIs degrade, return rate-limit errors in bursts, and fail outright. An agentic workflow of fifteen sequential steps that fails on the fourteenth because a connection dropped has wasted the entire session. The gateway applies standard distributed-systems resilience that the agent never sees: it retries transient errors with backoff, trips a circuit breaker when a provider’s latency crosses a threshold, and fails a request over to an equivalent model on a second provider. The reasoning loop continues uninterrupted; the trace records that a fallback occurred, so the event is observable without being disruptive.
Cost attribution and chargebacks
Cost bounds (Chapter 5) stop a runaway agent, but a central platform team serving the whole company cannot absorb the aggregate bill without knowing who spent it. The gateway enforces attribution: every request carries metadata identifying the tenant, the business unit, and the session, and the gateway logs token usage against the routed model’s price and emits a billing event. When one department deploys an inefficient looping agent, that department’s budget is charged and its quota throttled, rather than the cost landing on a shared account where no one can trace the spike back to its source.
The embedding exception
The five concerns above describe the gateway’s treatment of generation calls, where transparent failover is the headline benefit. Embeddings are the exception, and the gateway must treat them differently. A generation request can be rerouted to an equivalent model mid-flight because each call stands alone, its response is consumed and discarded. An embedding request cannot, because the vector it returns must live in an index already populated by one specific model, and vectors from two models are not comparable (Chapter 8). A gateway that silently fails an embedding call over to a different model to keep the request alive does not degrade gracefully; it quietly corrupts the index with vectors that no similarity search can rank against the rest.
The routing policy must therefore pin the embedding model per index, not per capability tier. When the pinned model is unavailable, the correct behavior is to queue the work or fail it loudly, never to substitute. Re-pointing an index at a new embedding model is the governed backfill of Chapter 8, a deliberate batch operation, not something the gateway may improvise under load. Generation is a stateless dependency the gateway can swap freely; embeddings are a stateful one it must hold fixed.
Edge and local models
The assumption that all reasoning happens in a large public cloud is no longer safe. Small language models (SLMs), from roughly one to several billion parameters, now run on employee laptops, inside isolated build containers, and on bare-metal corporate servers. In an agentic architecture these are not merely a cost saving; they behave more like deterministic tools than cloud models do, because they have fixed weights, no network latency, and no data-egress risk. The gateway itself often hosts such a model to perform its own routing and egress checks, classifying whether a prompt carries regulated data in milliseconds rather than running a slower scan.
A mature system uses edge and cloud models together. A local model on the user’s device or a nearby node performs the initial intent classification and redaction. For simple tasks, formatting text, extracting an entity, summarizing a short document, that local model acts as the agent and returns in milliseconds. When a task needs multi-step planning or a large context, the local model hands off (Chapter 9) to the cloud orchestrator, passing the already-scrubbed context up to the heavier model. The hybrid reduces cloud token cost, removes first-token latency for routine work, and leaves the system with a degraded but functional offline mode.
When equivalence fails
The discipline above treats a fallback model as a qualified, behaviorally-equivalent substitute. The case that tests the architecture is the one where equivalence fails: the provider deprecates the pinned model, no behaviorally-equivalent fallback exists, and the only available model does not reproduce the envelope the prior one met. There are three responses, and the choice between them is a product decision, not an engineering one, but the engineering discipline makes the choice tractable.
The first response is to accept degraded behavior behind a tighter envelope. Where the new model is adequate on the bulk of the traffic but fails the held-out cases the prior model passed, route the new model behind a stricter bounding layer: a lower cost ceiling, a tighter action surface, a lower approval threshold on the risk scorer. The envelope is tightened to the level at which the new model’s behavior is acceptable, and the system runs with reduced capability rather than no capability. The cost is a degradation the user sees, fewer autonomous actions, more approvals, narrower scope, and the discipline is to measure that degradation against the golden trace set (Chapter 12) before promoting, so the team knows precisely what it is shipping rather than discovering it in production.
The second response is to fall back to a human-in-the-loop tier. Where the new model cannot be trusted even behind a tighter envelope for the consequential actions, route those actions to the approval queue by default and let the autonomous path handle only the actions the new model still handles reliably. This is the reversibility envelope (Chapter 5) widened deliberately: more actions become irreversible-by-default, more work passes through the human gate, and the system keeps functioning at lower autonomy rather than blocking the capability entirely. The cost is reviewer throughput, which is the constraint Chapter 18 already names, and sizing the reviewer pool against the elevated approval rate is the operational move that makes this response sustainable rather than a queue that grows without bound.
The third response is to block the capability until qualification. Where the new model’s failure mode is severe, a structural hallucination on the semantic layer (Chapter 14), an inability to honor the tool schemas the action surface depends on, the capability is not shippable behind any envelope, and the right answer is to refuse promotion. The model gateway routes the capability to an error path, the user-facing surface reports the capability as unavailable, and the team treats the gap as a procurement and qualification problem rather than a deployment one. This is the response the book’s thesis implies but rarely states: a system whose reliability rests on a deterministic shell does not owe the user a capability the shell cannot make safe, and refusing to ship a capability the new model breaks is more honest than shipping it broken.
The three responses are not exclusive; a real migration often combines them, the bulk of traffic behind a tighter envelope, the consequential actions on the human tier, the one capability that cannot be made safe blocked outright. The discipline that unifies them is the same one the chapter applies to a clean fallback: the new model is qualified against held-out tasks before promotion, the envelope is tightened to what the qualification shows, and the degradation is measured and communicated rather than absorbed silently. The difference is that the qualification now shows the model cannot meet the prior envelope, and the architecture’s job is to find the envelope it can meet, or to admit that none exists and refuse to ship.
Standardizing the model protocol
For the gateway to route, fail over, and translate, the fleet must speak to it in one schema. If one worker uses a provider’s SDK, another a different provider’s, and a third a raw client for a local model, the gateway cannot parse or reroute their requests uniformly. The commitment is a single internal request schema, often an existing de facto industry schema, or a thin internal abstraction over one. Every agent emits prompts and tool calls in that schema; the gateway translates to each destination model’s native format and translates the response back. That translation layer is what makes vendor lock-in a configuration detail rather than a rewrite.
The abstraction is real but leaky, and pretending otherwise produces its own incidents. Providers differ in ways the translation layer cannot fully hide: the dialects in which tool schemas are expressed, how a tool result is represented, the finish reasons a response can carry, and how tokens are counted and priced. Some of this normalizes cleanly, such as mapping one stop reason onto another. Some does not. A model that supports parallel tool calls, a strict structured-output mode, or a far larger context window is not behaviorally interchangeable with one that lacks them, even behind an identical schema. A capability tier is only as fungible as its least capable member.
The discipline is to define a tier by the capabilities the agents in it actually rely on, and to qualify a model for that tier by testing it for behavioral equivalence on held-out tasks (Chapter 12), not by assuming the shared schema guarantees it. Failover then degrades along a known axis rather than surprising the fleet with a target that parses every request and quietly mishandles the ones that matter. The unified protocol removes the mechanical lock-in of vendor SDKs; it does not make models equivalent, and the routing table has to encode the difference.
The gateway among the gateways
The model gateway is the fourth deterministic chokepoint the book has placed around the agent, and naming the set clarifies what each owns. The bounding gateway (Chapter 5) governs what the agent may do and how far; the governance layer (Chapter 6) governs whether a specific proposed action is acceptable; the memory gateway (Chapter 7) governs what the agent may read and write. The model gateway governs what leaves the network to be reasoned about, and on which model. The four are deliberately separate, each is testable in isolation and each fails independently, but they share a stance: the agent requests, and a deterministic layer it does not control decides.
One distinction is worth drawing explicitly, because the gateway’s egress filtering and the glass layer’s output buffering (Chapter 13) both inspect text at a boundary and are easy to conflate. They guard different boundaries for different parties. The model gateway’s egress scan reads a prompt on its way out of the corporate network to the provider; its concern is that regulated data not leave the building. The glass layer’s chunked buffering reads generated output on its way to the user; its concern is that policy-violating content not reach a person, and it enforces the governance gates of Chapter 6 at the moment of display. The mechanism family is the same, deterministic inspection of text at a boundary, applied at two different boundaries: agent-to-provider egress, and system-to-user surface. A response can clear the model gateway, having contained nothing the provider should not see, and still be held at the glass for containing something the user should not see. The architecture needs both, and neither subsumes the other.
The separation has a cost worth stating plainly. Each chokepoint is a network hop, and the model gateway sits on the latency-critical path of every inference. The mitigation is ordinary distributed-systems practice, co-locate the gateway with the agent fleet, keep it stateless so it scales horizontally, and reserve its expensive work (egress classification, affinity hashing) for the requests that need it. The hop is not free, but it buys residency, cost control, and vendor independence that cannot be recovered once a direct call has left the building.
The gateway earns its complexity at scale, not on the first day. A single team running one model for one tenant gains little from a proxy and is better served by a thin adapter that calls the provider directly while leaving room to introduce the gateway later. The gateway becomes load-bearing once any of three thresholds is crossed: a fleet of agents whose aggregate cost must be attributed to its sources, more than one provider or region to route among, or regulated data whose egress must be controlled. Building it before any of these is present is premature; discovering all three at once, mid-incident, is worse. The discipline that resolves the tension is to put the agent’s model calls behind a capability-tier interface from the start, so the gateway can be slotted in when the first threshold arrives without rewriting agentic code.
The threat model of the gateway
Centralizing all inference through one proxy concentrates risk, and the gateway carries a threat model of its own.
The gateway as a bottleneck. Because all reasoning passes through it, a misconfigured limiter or a crashed instance halts the entire fleet. The gateway must run as a highly available, stateless, horizontally scaled service, never as a single instance on the critical path.
Header spoofing. If an agent can forge its business-unit or tenant header, it can charge its own compute to another department and evade its quota. The gateway must accept attribution headers only when they are cryptographically signed by the internal authentication layer, never self-reported by the agent.
Cache poisoning. A gateway that caches identical requests to save money will, if it keys the cache on the prompt alone, serve one user’s cached answer to another who lacks authorization to see it. Cache keys must be composite, combining the prompt hash with the requester’s identity and access scope, so a cache hit can never cross an access-control boundary (Chapter 6).
Anti-patterns
Hardcoded model selection. Naming a specific vendor model inside the agent. It couples runtime behavior to one vendor’s naming, breaks failover, and prevents the platform from routing to a cheaper model. Request a capability tier from the gateway instead.
Direct internet egress. Agent containers deployed with open outbound access. Any injection that convinces the agent to call an attacker’s URL succeeds. Agent containers should have no internet egress, reaching the outside world only through internal tools and the model gateway.
The unattributed bill. A single corporate account absorbing the inference cost of every workflow. When the bill jumps by a large sum over a weekend, no one can trace it to the agent or session that caused it. Attribution must be enforced at the gateway, not reconstructed after the fact.
Summary
The foundation models that power agentic systems are commodities; the deterministic architecture around them is the asset, and the internal model gateway is where that architecture meets the network. Routing every inference through one proxy lets the platform enforce data residency through egress filtering, control cost through capability-tier routing, affinity caching, and chargebacks, and guarantee continuity through transparent failover, while a standardized protocol keeps any one vendor replaceable. Folding local edge models into the same routing layer adds latency and privacy that the cloud cannot match. Chapter 16 turns from the individual layers to whole systems, composing the patterns of Parts II and III into a set of worked architectural vignettes.