Notes · AI agents in production

Why Do Our AI Agent Pilots Never Make It to Production?

· AI agents · ~9 min read

AI agent pilots stall on the way to production because a demo only has to work once, under conditions you control, while a production system has to work every time, on inputs you have never seen, wired into the tools your business actually runs on. The pilot proved the idea was possible. It said almost nothing about whether the idea was operable — and that gap, not the model, is where most projects quietly die.

You watched it work. The agent read the email, pulled the right record, drafted the reply, booked the slot — and the room went quiet in a good way. Then six weeks later it is still sitting in a tab nobody opens, and someone asks why the AI agent pilot never made it to production. You are not failing at something everyone else has solved. You are running into the most common outcome in the field.

MIT's NANDA initiative, in its 2025 report The GenAI Divide: State of AI in Business, found that roughly 95% of generative AI pilots deliver little to no measurable impact on the bottom line — only about one in twenty reaches rapid, real returns. Gartner's June 2025 forecast points the same way for agents specifically: over 40% of agentic AI projects will be cancelled by the end of 2027, undone by escalating costs, unclear business value and inadequate risk controls. The headline number changes depending on who counts; the pattern does not. Pilots are cheap to start and expensive to finish, and most never finish.

A demo and a production system are not the same thing

The reason the AI agent pilot to production gap catches good teams off guard is that a demo and a production system look almost identical. Same model, same prompt, same task. The difference is in what each one is allowed to assume.

A demo runs once, on an input you chose, with you watching. If it stumbles you re-run it. If the tool times out you shrug and try again. The success rate that matters is "did it work in the meeting" — and a single good run clears that bar.

A production system runs unattended, on inputs nobody curated, while you are asleep. It meets the malformed PDF, the customer who answers a different question than the one asked, the CRM field that is suddenly empty, the API that changed its response shape overnight. Fiddler's analysis of why AI agents fail in production puts a number on the fall-off: an estimated 88% of enterprise agents that work in controlled demos fail when deployed to real workflows, with single-run success around 60% collapsing toward 25% once you measure the same agent over eight consecutive runs. The agent did not get worse. The conditions got honest.

So the question is rarely "can the model do this task?" The pilot already answered that. The real question — the one the demo cannot answer — is "can this thing do the task ten thousand times, on inputs we did not pick, without a human in the loop, and fail safely when it can't?" That is a different system, and it usually has to be built rather than discovered.

The four gaps that kill the handover

When we trace a stuck AI pilot back to where it actually broke, the cause clusters into four predictable places. Naming them is the first step to closing them.

Integration was treated as the last 10%, and it is the first 50%. The clever part — the reasoning — is the part the demo showed off. The unglamorous part is the wiring: authentication that expires, rate limits, your specific data shapes, the half-dozen systems the agent has to read from and write back into. The DigitalOcean 2026 research behind the wider "10% of pilots scale" finding puts integration at 40–60% of total deployment effort. If the pilot mocked the integrations to move fast, it skipped the majority of the actual work and called the easy half "done".

Nobody owned it. In the same body of research, 43% of organisations cite organisational ownership as their primary blocker to scaling — not the technology. A pilot is somebody's exciting side project. Production needs an owner who answers for it at 2am, a place it lives, a person whose job changes because of it. Without that, the pilot has nowhere to graduate to.

Reliability and error handling were never built, because demos do not fail on purpose. A production agent spends most of its engineering not on the happy path but on what happens when a step goes wrong: retries, fallbacks, a clear handoff to a human, a refusal that is safer than a confident guess. Fiddler lists hallucination, context-window overruns that silently drop instructions, and runaway loops that burn cost as recurring production failure modes. None of those show up in a five-minute demo. All of them show up by week two of real traffic.

Security and compliance were deferred — and then could not be retrofitted. The pilot ran on a copy of the data with the guardrails switched off, because that was faster. Production touches real customer records, real money, real regulatory exposure. Gartner is blunt that inadequate risk controls are a leading cancellation cause. Bolting governance onto a system that was architected without it is often more expensive than rebuilding, which is exactly when projects get quietly shelved instead.

Why "we'll harden it later" rarely works

The instinct after a good demo is sensible on its face: prove the concept cheaply, then industrialise it. The trouble is that proof-of-concept code and production code are not the same code with more polish — they are built on different assumptions, and the assumptions are load-bearing.

A pilot optimised purely for "make it work in the demo" hard-codes the happy path, skips the error states, ignores the integrations, and assumes a human is watching. Hardening it means unpicking every one of those shortcuts. The DigitalOcean research found that teams who build with production constraints from day one achieve roughly 3x higher scaling success, at the cost of only 20–30% more effort up front — while eliminating 50–70% of the refactoring later. The "harden it later" route is not cheaper. It front-loads a quick win and back-loads a rewrite, and the rewrite is where momentum, budget and patience run out.

There is also a cost cliff hiding in the handover. The same analysis puts a typical pilot at tens of thousands and the same capability in production at several hundred thousand a year, with infrastructure alone running a 5–10x multiplier once you account for real volume, monitoring and redundancy. When that number lands after the pilot has already been declared a success, it reads as a betrayal rather than a forecast — and "let's pause this" becomes the easiest sentence in the room.

The honest test before you spend another penny

Before you fund the next phase of any AI proof of concept to production effort, it is worth running it past a few questions the demo will never volunteer. They are uncomfortable on purpose.

  • What does it do when it is wrong? If the answer is "a human catches it", you have a demo, not a system. Production needs the agent to know it is uncertain and hand off cleanly.
  • What is the cost per run at full volume, not in the demo? Multiply the pilot's token and infrastructure cost by real traffic, then add monitoring and redundancy. If nobody has done that sum, the business case is unproven.
  • Who owns it on the org chart, and whose week changes? If no name appears, the enterprise AI agent scaling gap has already opened beneath you.
  • Are the integrations real or mocked? Every mocked connection is unbuilt work dressed as finished work.
  • What happens to the data and the audit trail? If compliance was switched off for the pilot, you have not de-risked the project — you have hidden the risk.

If most of those answers are solid, you do not have a stuck pilot. You have a system that is genuinely close, and the next spend is justified. If most of them are hand-waves, the kindest thing is to stop calling it nearly-done and decide — deliberately — whether the use case is worth building properly or worth dropping. Both are honest outcomes. A pilot drifting in a tab is not.

What it takes to actually cross the gap

Crossing from pilot to production is less about a better model and more about treating the agent as a piece of operational software with a job to do. In practice that means designing for the inputs you did not choose, not the one you demoed. It means building the error paths before the happy path is celebrated, instrumenting the thing so you can see what it is doing in the wild, and wiring the real integrations early — because that is where the surprises live and the budget goes.

It also means being precise about scope. The reason vendor-built systems outperformed internal builds roughly three-to-one in the MIT data is partly that a narrow, well-specified agent that does one workflow reliably beats a broad, impressive agent that does ten things in a demo and none of them dependably. The route to production usually runs through doing less, properly, and earning the right to expand.

None of this is a reason to be cynical about AI agents — the same forecasts that predict mass cancellation also predict that the survivors will run a meaningful share of routine decisions within a few years. The dividing line is not luck or model choice. It is whether the thing was built to be operated or built to be demonstrated. If your last pilot was built to be demonstrated, that is not a failure of nerve or talent. It is simply the wrong target — and now you know what the right one looks like.

Straight answers

Questions we hear about stuck pilots

Is it true that 95% of AI pilots fail?

MIT's 2025 NANDA report, The GenAI Divide, found that around 95% of generative AI pilots deliver little to no measurable impact on the bottom line, with only about 5% reaching rapid returns. Gartner separately forecasts that over 40% of agentic AI projects specifically will be cancelled by the end of 2027. The exact figure depends on what you count, but the direction is consistent: most pilots do not reach durable production.

Why does our agent work perfectly in the demo but fail in production?

A demo runs once on an input you chose while you watch; production runs unattended on inputs nobody curated. Fiddler estimates that roughly 88% of agents working in controlled demos fail in real workflows, as single-run success rates fall sharply over repeated runs. The model did not get worse — the conditions got honest, exposing the error handling, integrations and edge cases the demo never tested.

What is the AI agent pilot to production gap, exactly?

It is the difference between proving a task is possible once and proving it is operable forever. Closing it means building real integrations, error handling, monitoring, ownership and compliance — work that demos skip. Research suggests integration alone consumes 40–60% of deployment effort, which is why pilots that mocked their connections were only ever half-built.

Can we just harden the pilot we already have?

Sometimes, but it is often more expensive than it looks. Pilot code is built on assumptions — happy path only, human watching, guardrails off — that are load-bearing. Teams who design for production constraints from the start see around 3x higher scaling success for 20–30% more upfront effort, while avoiding most of the later rewrite. We will tell you honestly whether hardening or rebuilding is the cheaper route for your case.

How do we know if our use case is even worth taking to production?

Ask what the agent does when it is wrong, what it costs per run at full volume, who owns it on the org chart, whether the integrations are real or mocked, and what happens to the data and audit trail. If those answers are solid, the next spend is justified. If they are hand-waves, the honest move is to either build it properly or drop it deliberately rather than leave it drifting.

Does picking a better model fix the problem?

Rarely. The pilot already proved the model can do the task; the failures happen in the operational layer around it — integrations, reliability, ownership, governance and cost. In MIT's data, narrowly scoped vendor-built systems outperformed broad internal builds roughly three to one, which points to scope and engineering discipline mattering far more than swapping in a newer model.

Stop paying for a pilot that never ships

If you have an agent that dazzled in a demo and then stalled, the question is no longer "does it work?" — it is "what would it take to run on it every day?" Bring us the pilot. We will tell you, plainly, whether it is close enough to harden or honest enough to rebuild — and if the use case does not justify either, we will say that too.