Should we build model-agnostic AI infrastructure?

Default to yes — keep your prompts, evals, and orchestration portable across providers. The cost of doing so is low, the option value of being able to swap models is high. But don't over-invest in abstraction. Most teams over-engineer this and end up with a lowest-common-denominator architecture that can't use any model's distinctive features well.

What's the hidden cost of model-agnostic architecture?

You forgo the gains from using each provider's distinctive features. Anthropic's tool use, OpenAI's structured outputs, Google's long-context performance — using these well usually requires provider-specific code that doesn't port cleanly. Teams who insist on full portability end up with a lowest-common-denominator architecture that's measurably worse than what either provider could produce alone.

What's the right balance?

Portable interfaces (prompts as strings, evals as data, orchestration in code you own) plus provider-specific implementations where it actually matters. Most workflows benefit from this hybrid: 80% portable, 20% specialized. The 20% pays back when a specific provider's feature gives you a measurable accuracy or cost win on a specific step.

The hidden cost of model-agnostic AI architecture

The case for model-agnostic AI architecture is straightforward, and we make it ourselves most of the time. The AI vendor landscape is moving fast. Anthropic, OpenAI, Google, and a half-dozen credible open-source labs are releasing new models on overlapping timelines. Building infrastructure that locks you to one provider's API surface is a bet on that provider's continued superiority. Even if you're right today, you're probably wrong by next year — and a 30% cost reduction from a competing provider becomes "we can't easily switch" instead of "we just switched."

So we usually recommend model-agnostic. Portable prompts, model-independent evals, orchestration code that doesn't import provider-specific helpers if it can avoid it. This is the default in most of the engagements we run.

But there's a hidden cost most teams don't price in until they're a year deep. This post is about that cost and how to think about it.

What "model-agnostic" usually means in practice

In most implementations we audit, model-agnostic means:

A thin abstraction layer over the AI provider's SDK (LiteLLM, LangChain's chat models, or a hand-rolled equivalent).
Prompts stored as strings or markdown files, not embedded in provider-specific objects.
Evals written against the abstraction, not the underlying API.
The ability to swap from Claude to GPT to Gemini by changing a config value.

This works. We've seen it work. It's not what costs you.

What it actually costs

The cost shows up when you want to use a provider's distinctive feature. A few examples we've hit:

Tool use semantics differ. Anthropic's tool use, OpenAI's function calling, and Gemini's function calling are not identical. They differ in how they handle parallel tool calls, in how they treat tool errors, in how aggressively the model retries, in how they emit reasoning traces. The differences are subtle but they show up in production — especially in agentic workflows where the model is chaining tool calls.

If your abstraction normalizes these differences to a lowest common denominator, you lose the provider-specific behavior that makes one of them better than the others for your use case. We've seen teams who'd standardized on a portable tool use shape discover, six months in, that their Claude implementation would have been 15% more accurate if they'd used Claude's native semantics. They'd architected the win away.

Structured outputs. OpenAI's structured outputs (constrained generation against a JSON schema) and Anthropic's tool-use-based structured outputs are not the same. The OpenAI version guarantees schema conformance; the Anthropic version is empirically excellent but not formally guaranteed. If your abstraction layer treats them as interchangeable, you forgo the guarantee, which can matter for downstream systems that don't gracefully handle parse failures.

Long context behavior. Claude's long context performance is qualitatively different from GPT-4o's. They're both nominally 200K+ tokens, but how they handle a long document at the end of a prompt differs noticeably. A workflow optimized for one is not automatically optimal on the other. The "swap the model with a config change" idea is true syntactically and not always true semantically.

Prompt caching. Anthropic's prompt caching, Gemini's context caching, OpenAI's prompt caching all behave differently. Anthropic's is the most mature as of 2026, with multi-tier caching and explicit cache breakpoints. Using it well requires prompt-structure choices that don't necessarily port to other providers. Teams who want full portability forgo significant cost savings — 50-90% on cacheable inputs, which is a lot of money at scale.

The right balance

We've come around to a hybrid recommendation. Portable where it's cheap to be portable; specialized where it actually matters.

Practically:

Prompts as strings, evals as data, orchestration in code you own. Yes. Always. This is the 80% portable part, and it costs nothing to maintain.
Provider-specific code at the surgical points where it matters. Tool use, structured outputs, prompt caching strategy, long-context structuring. Don't normalize these to a shared abstraction; embrace the differences.
A clear mental model of "what would it cost to switch." For every workflow, you should be able to answer the question "if we had to swap providers tomorrow, what changes." The answer is rarely "nothing" and rarely "everything." The teams that have a clear answer make better architecture decisions than the teams who assume full portability.

This is the architecture we ship most of the time. It's a fuzzy line, not a clean one, and reasonable teams disagree about where the line is. But it produces measurably better workflows than full-portability designs, in our experience.

What this looks like in code

A concrete example. Imagine a customer support workflow with three steps: (1) categorize the incoming ticket, (2) draft a reply using retrieved context, (3) decide whether to escalate or auto-send.

A purely portable architecture has all three steps going through the same abstraction layer with the same prompt-as-string approach. Reasonable, but you're not using anyone's special features.

A purely vendor-locked architecture writes all three steps as OpenAI-specific (or Anthropic-specific) code with provider-specific objects and helpers throughout. You're fast to a working version but locked in.

The hybrid: orchestration code (looping, branching, persistence) is provider-agnostic. The categorize step uses OpenAI structured outputs because the schema guarantee matters downstream. The draft step uses Anthropic with prompt caching because the retrieved context is large and stable. The escalation decision uses whichever model wins your eval on your data — and your eval is provider-independent, so you can test all three.

That hybrid pencils. Pure portability and pure lock-in both leave value on the table.

A note on open-source models

The model-agnostic conversation usually focuses on the proprietary providers. Open-source models (Llama, Mistral, Qwen, DeepSeek) are a separate consideration. The case for keeping the door open is stronger here — open-source progress is genuinely surprising in 2026, and a workflow that runs on a small Llama or Qwen variant has fundamentally different cost economics than one that runs on Claude Opus.

But the same hybrid logic applies. Use the open-source model where it earns its keep (high volume, low complexity, cost-sensitive); use the bigger proprietary models where the workflow genuinely needs the capability gap. Don't try to standardize on one for all steps — you'll either waste money on simple steps or sacrifice quality on hard ones.

What we'd tell teams making this call

Three opinions we'd put in writing:

Start portable. Stay portable for orchestration. Specialize per step where the feature matters. This is the default architecture we ship.

Maintain your evals as the portable layer. Eval data should run against any provider with a small adapter. This is the highest-value portability investment. If you ever want to switch providers, the evals tell you whether the new one performs as well — and you can't have that conversation without portable evals.

Don't over-invest in abstractions you don't need yet. Premature abstraction is more expensive than vendor coupling. Build the workflow with one provider, get it working, refactor toward portability only at the points where you've actually hit a wall.

If you're architecting AI infrastructure right now and want a second opinion on the portable-vs-specialized line, we'll do a paid one-week audit that ends with a written recommendation.

The hidden cost of model-agnostic AI architecture

What "model-agnostic" usually means in practice

What it actually costs

The right balance

What this looks like in code

A note on open-source models

What we'd tell teams making this call

The complete guide to AI automation for businesses in 2026

How to read AI model benchmarks in 2026 (and what to ignore)

Anthropic Claude vs OpenAI GPT for production AI in 2026

Frequently asked.

Send us your most expensive operation.
We'll have an audit on your desk in five days.

The hidden cost of model-agnostic AI architecture

What "model-agnostic" usually means in practice

What it actually costs

The right balance

What this looks like in code

A note on open-source models

What we'd tell teams making this call

The complete guide to AI automation for businesses in 2026

How to read AI model benchmarks in 2026 (and what to ignore)

Anthropic Claude vs OpenAI GPT for production AI in 2026

Frequently asked.

Send us your most expensive operation.We'll have an audit on your desk in five days.

Send us your most expensive operation.
We'll have an audit on your desk in five days.