About 1 in 4 AI proofs-of-concept we'd advise on don't make production. That's not a number we love, but it's the number — and it's roughly stable across the engagements we've shipped over the last two years. What's interesting isn't the rate. It's that the failure mode is almost always the same handful of issues, and they're predictable enough that you can design around them if you know what to look for.
This is what we tell prospects when they ask us about the failure rate, and what we'd want any team building AI to know before they start.
The technology is not what fails
The most common assumption is that AI POCs die on a model problem. The model couldn't handle the edge cases, the accuracy wasn't there, the customer-facing output was bad. We've seen this happen, but it's rare — maybe 1 in 20 failures.
The other 19 in 20 fail operationally. Specifically: the team built something that worked in demo but couldn't get traction inside the actual operation. The model was fine. The hand-off to humans was not.
This matters because most AI procurement decisions optimize for the wrong axis. Teams agonize over model choice (Anthropic vs OpenAI vs open-source), benchmark scores, latency, cost per token. Then they ship a pilot, and the pilot dies because nobody updated the standard operating procedure to tell the front-desk team that the agent was on, what it does, and how to override it.
The four failure modes we see
In order of how often we see them:
Scope creep dressed as feedback. A pilot ships. Someone on the team has an idea: "while we're at it, can the agent also do X?" Three small adds turn into a three-month delay, the original pilot never goes live, and the executive sponsor stops returning calls. The fix is to lock the scope before the build and queue every "while we're at it" as a follow-up engagement. Hard discipline. Worth it.
No internal champion. Every successful agent in production has a single human who cares whether it works. Not a steering committee. One operator or manager who'd notice if the agent stopped working tomorrow morning. Without that person, the project drifts. The pilot ships, the steering committee approves it, nobody is actually responsible for whether the team uses it, and within two months the volume reverts to manual.
Handoff theater. The team ships the pilot, runs a one-hour training, sends a Loom, and walks away. The operator now technically "has the agent" but doesn't trust the outputs, can't find the dashboard, doesn't know what to do when the agent is wrong. The trust deficit compounds; by month four the agent is on but nobody is leaning on it.
The metrics nobody wired. The pilot is producing outputs but the team has no way to know if it's working. There's no comparison against baseline. No eval data. No "operator hours per 100 cases" metric. So when the executive sponsor asks for ROI in month three, the answer is qualitative and unconvincing. Project defunds.
What the successful POCs do differently
The 75% that make it to production share four patterns:
They audited before they built. A one-week audit of the operation — not the AI, the operation — before anything got coded. This sets the kill criteria, names the operator, names the metrics, and surfaces the edge cases. Skipping this step is the strongest predictor of failure we've found.
They named an internal champion in week one. The single human who'd notice if the agent stopped. By the way: this person should not be the executive sponsor. They should be one or two levels closer to the work. The exec sponsor opens budget; the champion uses the thing.
They shipped with a handoff plan in scope. Documentation, dashboards, override paths, escalation rules, a 30-day shadow review window. Not as an afterthought. Part of the build. The cost of a good handoff is small; the cost of a bad one is the entire project.
They measured what AI replaces, not what it does. "Operator hours per 100 cases" is a metric. "Tickets touched by AI" is not. The successful pilots had a baseline number from the audit, a target post-launch, and a way to compare them. The failed ones celebrated vanity counters.
The shape of the conversation that prevents this
We've started doing something specific in discovery that catches most of these failure patterns early. We ask three questions and listen to how the team answers them:
- "If we build this and it works, who specifically uses it on Monday morning?" A good answer names a person by title and gives you a sense of their week. A bad answer is "the team."
- "What metric on what dashboard tells us in week six whether this worked?" A good answer names a metric, a baseline, and a place we'd see it. A bad answer is "ROI."
- "If the pilot fails, when do we kill it and what happens to the budget?" A good answer has a date and a fallback plan. A bad answer is "we don't expect that."
The teams that struggle to answer those three are the teams that produce POCs that die. We've started writing this into our discovery process explicitly, because the discovery is the work that determines the outcome — more than the code, more than the model.
Killing cleanly
One more thing worth saying. About 1 in 5 audits we run conclude with a recommendation to NOT build — usually because the underlying process needs a redesign first, or because off-the-shelf hits the requirements at acceptable cost, or because the team isn't ready for the change management that the agent would require.
This is the right outcome when it's the right outcome. A clean "no" in week one is dramatically cheaper than a pilot that drags out for six months before dying quietly. The teams we work with the longest are the ones we've told "no" at least once.
It's also the reason we offer the Sprint engagement at the small end. One week, one operation, one written go/no-go. If the recommendation is don't build, you get the audit and we walk away. That's the engagement that earned us most of our long-term clients — not because they didn't build, but because we told them honestly when they shouldn't.
If your POC is in the danger zone
The signs that a pilot is heading toward the failure column, in order of how visible they are:
- The operator who should use it daily doesn't have it open.
- The dashboards aren't where they look.
- New "while we're at it" requests are coming in faster than the original scope is shipping.
- No one can answer "what's the metric, where do I see it, what's the baseline."
- The executive sponsor is checking in less often than at week two.
If three or more of those are true at week six, the pilot is in trouble. The work to save it is usually operational — redesign the handoff, name a champion, wire the metric, lock the scope — not technical. We've recovered projects from this state more often than we've killed them, but only when the team is honest about where they are.
If you've got an AI POC that's not quite landing and you want a second set of eyes on it, we'll do an unpaid 30-minute call before scoping anything. Send us where it's stuck.