The case for model-agnostic AI architecture is straightforward, and we make it ourselves most of the time. The AI vendor landscape is moving fast. Anthropic, OpenAI, Google, and a half-dozen credible open-source labs are releasing new models on overlapping timelines. Building infrastructure that locks you to one provider's API surface is a bet on that provider's continued superiority. Even if you're right today, you're probably wrong by next year — and a 30% cost reduction from a competing provider becomes "we can't easily switch" instead of "we just switched."
So we usually recommend model-agnostic. Portable prompts, model-independent evals, orchestration code that doesn't import provider-specific helpers if it can avoid it. This is the default in most of the engagements we run.
But there's a hidden cost most teams don't price in until they're a year deep. This post is about that cost and how to think about it.
What "model-agnostic" usually means in practice
In most implementations we audit, model-agnostic means:
- A thin abstraction layer over the AI provider's SDK (LiteLLM, LangChain's chat models, or a hand-rolled equivalent).
- Prompts stored as strings or markdown files, not embedded in provider-specific objects.
- Evals written against the abstraction, not the underlying API.
- The ability to swap from Claude to GPT to Gemini by changing a config value.
This works. We've seen it work. It's not what costs you.
What it actually costs
The cost shows up when you want to use a provider's distinctive feature. A few examples we've hit:
Tool use semantics differ. Anthropic's tool use, OpenAI's function calling, and Gemini's function calling are not identical. They differ in how they handle parallel tool calls, in how they treat tool errors, in how aggressively the model retries, in how they emit reasoning traces. The differences are subtle but they show up in production — especially in agentic workflows where the model is chaining tool calls.
If your abstraction normalizes these differences to a lowest common denominator, you lose the provider-specific behavior that makes one of them better than the others for your use case. We've seen teams who'd standardized on a portable tool use shape discover, six months in, that their Claude implementation would have been 15% more accurate if they'd used Claude's native semantics. They'd architected the win away.
Structured outputs. OpenAI's structured outputs (constrained generation against a JSON schema) and Anthropic's tool-use-based structured outputs are not the same. The OpenAI version guarantees schema conformance; the Anthropic version is empirically excellent but not formally guaranteed. If your abstraction layer treats them as interchangeable, you forgo the guarantee, which can matter for downstream systems that don't gracefully handle parse failures.
Long context behavior. Claude's long context performance is qualitatively different from GPT-4o's. They're both nominally 200K+ tokens, but how they handle a long document at the end of a prompt differs noticeably. A workflow optimized for one is not automatically optimal on the other. The "swap the model with a config change" idea is true syntactically and not always true semantically.
Prompt caching. Anthropic's prompt caching, Gemini's context caching, OpenAI's prompt caching all behave differently. Anthropic's is the most mature as of 2026, with multi-tier caching and explicit cache breakpoints. Using it well requires prompt-structure choices that don't necessarily port to other providers. Teams who want full portability forgo significant cost savings — 50-90% on cacheable inputs, which is a lot of money at scale.
The right balance
We've come around to a hybrid recommendation. Portable where it's cheap to be portable; specialized where it actually matters.
Practically:
- Prompts as strings, evals as data, orchestration in code you own. Yes. Always. This is the 80% portable part, and it costs nothing to maintain.
- Provider-specific code at the surgical points where it matters. Tool use, structured outputs, prompt caching strategy, long-context structuring. Don't normalize these to a shared abstraction; embrace the differences.
- A clear mental model of "what would it cost to switch." For every workflow, you should be able to answer the question "if we had to swap providers tomorrow, what changes." The answer is rarely "nothing" and rarely "everything." The teams that have a clear answer make better architecture decisions than the teams who assume full portability.
This is the architecture we ship most of the time. It's a fuzzy line, not a clean one, and reasonable teams disagree about where the line is. But it produces measurably better workflows than full-portability designs, in our experience.
What this looks like in code
A concrete example. Imagine a customer support workflow with three steps: (1) categorize the incoming ticket, (2) draft a reply using retrieved context, (3) decide whether to escalate or auto-send.
A purely portable architecture has all three steps going through the same abstraction layer with the same prompt-as-string approach. Reasonable, but you're not using anyone's special features.
A purely vendor-locked architecture writes all three steps as OpenAI-specific (or Anthropic-specific) code with provider-specific objects and helpers throughout. You're fast to a working version but locked in.
The hybrid: orchestration code (looping, branching, persistence) is provider-agnostic. The categorize step uses OpenAI structured outputs because the schema guarantee matters downstream. The draft step uses Anthropic with prompt caching because the retrieved context is large and stable. The escalation decision uses whichever model wins your eval on your data — and your eval is provider-independent, so you can test all three.
That hybrid pencils. Pure portability and pure lock-in both leave value on the table.
A note on open-source models
The model-agnostic conversation usually focuses on the proprietary providers. Open-source models (Llama, Mistral, Qwen, DeepSeek) are a separate consideration. The case for keeping the door open is stronger here — open-source progress is genuinely surprising in 2026, and a workflow that runs on a small Llama or Qwen variant has fundamentally different cost economics than one that runs on Claude Opus.
But the same hybrid logic applies. Use the open-source model where it earns its keep (high volume, low complexity, cost-sensitive); use the bigger proprietary models where the workflow genuinely needs the capability gap. Don't try to standardize on one for all steps — you'll either waste money on simple steps or sacrifice quality on hard ones.
What we'd tell teams making this call
Three opinions we'd put in writing:
Start portable. Stay portable for orchestration. Specialize per step where the feature matters. This is the default architecture we ship.
Maintain your evals as the portable layer. Eval data should run against any provider with a small adapter. This is the highest-value portability investment. If you ever want to switch providers, the evals tell you whether the new one performs as well — and you can't have that conversation without portable evals.
Don't over-invest in abstractions you don't need yet. Premature abstraction is more expensive than vendor coupling. Build the workflow with one provider, get it working, refactor toward portability only at the points where you've actually hit a wall.
If you're architecting AI infrastructure right now and want a second opinion on the portable-vs-specialized line, we'll do a paid one-week audit that ends with a written recommendation.