What's the difference between an AI chatbot and the chatbots we tried five years ago?

The pre-LLM chatbots (Drift, Intercom Bot circa 2020) were decision trees with buttons — pick one of four options, get a scripted reply, escalate if your question wasn't on the menu. Modern AI chatbots are powered by language models with retrieval over your actual content. They handle open-ended questions, follow context across turns, and produce different replies for different customers. The failure modes are different too: pre-LLM chatbots failed by being too rigid; AI chatbots fail by being confidently wrong. The design discipline is to manage that new failure mode.

How well do AI chatbots actually work for support?

For tier-1 support — password resets, billing questions, plan comparisons, common how-to questions — a well-built RAG chatbot deflects 50-65% of tickets with CSAT comparable to or better than human agents (we've seen 4.6/5 on a deployment that replaced 4.4/5 human handling). For tier-2 and tier-3 — anything requiring access to specific account state, multi-step troubleshooting, or judgment calls about exceptions — deflection drops to 20-30% and the chatbot's job becomes triage and context-capture, not resolution.

How long does it take to build a support chatbot?

A focused support chatbot trained on existing docs and ticket history ships in 3-6 weeks for one product surface. The bottleneck is not engineering — it's curating the knowledge base (most teams' docs are messier than they think), labeling escalation rules, and shadow-running long enough to catch the edge cases before live deployment. Multi-product or multi-language deployments take longer.

How much does it cost to build a custom chatbot vs using Intercom Fin or similar?

Off-the-shelf chatbots (Intercom Fin, Ada, Decagon) run $0.99-$2.50 per resolved conversation plus platform fees, with setup measured in hours. Custom builds run $10,000-$50,000 setup plus inference at vendor cost (typically $0.02-$0.10 per conversation). The off-the-shelf option wins below ~5,000 resolved conversations per month. Custom wins when you need control over the RAG pipeline, when the off-the-shelf accuracy misses on your specific docs, or when you have proprietary integrations the off-the-shelf product can't see.

Should our chatbot be visible on the marketing site or just inside the product?

Almost always inside the product, behind auth, where the agent has context about the user's account. Public-facing chatbots on marketing sites tend to get gamed (prompt injection attempts, off-topic chatter, low signal-to-noise). The exception: high-quality lead-qualification chatbots on pricing or contact pages, where the goal is to capture intent and route to sales, not to resolve support tickets.

What about hallucinations — can we trust an AI chatbot to be factually correct?

Modern RAG chatbots, configured correctly, hallucinate rarely — under 1% of responses contain fabricated facts when the answer is grounded in retrieved content and the model is told to abstain on out-of-corpus questions. The risk profile is: when the chatbot is wrong, it's wrong confidently. Mitigate by (1) retrieving aggressively before generating, (2) instructing the model to say 'I don't know' when retrieval returns nothing relevant, (3) logging every response with the retrieved sources so you can audit, and (4) shadow-testing for two weeks before going live.

What's the biggest mistake teams make with AI chatbots?

Treating the chatbot as a product they launch and forget. A chatbot at launch is mediocre; a chatbot four weeks after launch, with eval data and edge-case fixes folded in, is excellent. The teams that succeed treat the chatbot like a junior support agent in onboarding — they review its work daily for the first month, fix the patterns, and let it earn its independence. The teams that fail ship it and walk away.

Will customers prefer talking to an AI or a human?

Customers prefer fast and right over slow and human. Multiple deployments we've shipped show CSAT staying flat or rising when an AI chatbot replaces a slow human-only queue. Customers want their problem solved. If the AI solves it faster, they're happy. If the AI gets it wrong, they want a quick escalation path. The wrong question is 'AI or human'; the right question is 'how do we make the path to resolution short' — and the answer is usually both.

← all guides

00—Pillar guide

AI chatbots for customer support and sales in 2026

What separates a chatbot that deflects 60% of tickets at 4.6/5 CSAT from a chatbot that gets turned off after six weeks. The architecture, the failure modes, and how to size the build so it earns its place.

May 24, 2026~12 min read3,000 wordsby Amine Hn

A B2B SaaS company we worked with had four support reps and rising volume. They'd tried a chatbot two years prior — a SaaS vendor's product, decision-tree-based, CSAT of 2.1/5 — and turned it off after six weeks. By the time we got the inquiry, the team was burnt on chatbots and considering hiring two more reps instead.

Five weeks later, the new chatbot was deflecting 60% of tier-1 tickets at 4.6/5 CSAT. The four reps spent the time they'd reclaimed on customer success work that compounded. They didn't hire the two extra heads. The economics of the project paid back in the first quarter.

The difference between that outcome and the prior chatbot wasn't the model; it was the design discipline. This guide is what that discipline looks like.

What modern AI chatbots actually are

The 2020-era chatbots — Drift, Intercom Bot, the early Ada — were decision trees with a chat UI on top. Pick "I forgot my password" from a menu, get a templated response. Anything off-menu was an escalation. They worked for narrow flows and felt rigid for everything else.

The 2026 AI chatbot is structurally different. The core is:

A language model (Claude, GPT-4o, or similar) handling the conversation turn-by-turn.
A retrieval layer (RAG — retrieval-augmented generation) that fetches the most relevant content from your knowledge base before the model writes a reply.
A memory layer that holds conversation state across turns and can read account-specific state from your systems (CRM, billing, product DB).
An escalation layer that routes to a human when confidence is low, when the customer asks for one, or when the conversation pattern matches escalation rules you've defined.
An observability layer that logs every conversation, the retrieved sources, the model's response, and any escalation reason — so you can audit, eval, and improve.

The model is the smallest part of the puzzle. The retrieval and the observability are where the engineering work lives.

Where chatbots actually pay back

Three deployment shapes have crossed into "production-ready" by 2026:

Tier-1 support inside the product. Behind authentication, with access to the user's account state. The chatbot handles password resets, billing questions, plan changes, common how-to questions — the work that currently consumes 50-70% of a support team's tickets. This is the highest-ROI deployment. Deflection rates of 50-65% are achievable; the team gets back time for the deeper customer success work that grows accounts.

Internal knowledge Q&A. A chatbot trained on your internal docs (Notion, Confluence, Google Drive) that answers employee questions about policies, procedures, who-owns-what. Lower visibility than customer-facing but high ROI because it deflects the steady stream of one-off questions that consume managers' time. Often best served by an off-the-shelf product (Glean, Mendable, Notion AI Q&A) before custom.

Sales lead qualification. A chatbot on the pricing or contact page that gathers signal — what the prospect's trying to solve, their company size, urgency — and routes high-intent leads to sales while filtering out the low-intent ones. Best when the volume of inbound leads exceeds what sales can manually triage.

What doesn't work yet:

Open-ended consultative sales. The chatbot can qualify and capture intent. It can't close. The deals where the chatbot tries to replace the salesperson have measurably lower conversion.
Multi-system troubleshooting that requires judgment. "My report isn't rendering correctly" can have 40 root causes. The chatbot can collect the basics and route, but it can't yet diagnose effectively.
Anything requiring emotional escalation handling. Refund disputes, complaint resolution, anything where the customer is upset and a human's empathy is part of the resolution. Bot can recognize the signal and escalate fast; that's the right play.

The architecture, briefly

For tier-1 support, a typical deployment looks like:

Custom chat widget (React) embedded in your product, talking to a backend over websockets.
Backend (Node, Python, or whatever fits your stack) that handles conversation state, calls retrieval, calls the LLM, manages escalation.
Vector store (Pinecone, Turbopuffer, pgvector, Voyage, Qdrant) holding embeddings of your docs and ticket history. We're ambivalent about which — they all work; pick by ecosystem fit.
Embedding model (Voyage, OpenAI text-embedding-3-large, Cohere) to turn documents and queries into vectors. Better embeddings = better retrieval = better answers.
Generation model (Claude Sonnet for nuanced reasoning, Claude Haiku or GPT-4o-mini for cost-sensitive turns). Claude's long context window is useful here — you can stuff a lot of retrieved content into the prompt.
Live agent UI integration (Intercom, Zendesk, Front, custom) — when the chatbot escalates, the conversation lands cleanly in the human's queue with full context.

We default to Claude for the generation layer in 2026 because it adheres to system prompts well (matters a lot for support, where you don't want the model freelancing) and has excellent long-context performance for RAG. For embeddings, we use whichever embedding model wins the eval on the client's actual docs — there isn't a single best.

The work isn't the model; it's the knowledge base

This is the part most teams underestimate. Your docs are messier than you think. Stale articles, contradictory guidance, the same answer written three different ways by three different people, FAQs that haven't been updated since 2023. The chatbot is going to reflect that mess back at customers unless you clean it up first.

What "cleaning up" looks like:

Audit the existing docs against the top 100 tickets. For each common question, is the answer findable in the docs? Is it consistent? Is it correct as of today? Most teams discover their docs cover maybe 60% of the top-100 questions adequately.
Write what's missing. The gap analysis from step one is your content roadmap. Resolve the contradictions, write the missing articles, mark the obsolete content for removal.
Add ticket history selectively. Resolved support tickets are gold for RAG. Add them, but filter — you don't want the chatbot retrieving from a customer's vented frustration as if it were canonical.
Tag for retrieval. Modern vector stores support metadata filtering. Tag content by product area, by user role, by plan tier. The retrieval layer can then narrow the candidate set per user context — a free-plan user's question searches free-plan-relevant docs.

We've seen chatbot deflection rates jump 15-20 percentage points just from cleaning the knowledge base before deployment. The model didn't change; the content it could retrieve did.

Handling the failure modes

The two failures to design around:

Confident wrong answers. The model says something plausible but incorrect. Mitigations:

Force the model to cite its sources in every response (visible to the customer or just logged for audit).
Instruct the model explicitly to say "I'm not sure, let me get someone who can help" when retrieval returns nothing relevant.
Shadow-test for two weeks pre-launch — every model response reviewed against ground truth. The patterns that produce confident-wrong answers in your domain are findable in shadow.
Log everything and review weekly post-launch. Track the responses that customers thumbs-down or follow up to.

Customer in distress not getting escalated fast enough. The chatbot tries to solve when it should be handing off. Mitigations:

Sentiment detection on every customer message. If sentiment crosses a threshold, escalate.
Pattern matching on phrases like "speak to a human," "this is the third time," "I want a refund" — escalate immediately.
A visible, always-available "talk to a human" button in the chat UI. Don't make customers fight the bot.

Off-the-shelf vs custom

A practical heuristic. Run off-the-shelf first (Intercom Fin, Decagon, Ada) if:

You're under ~5,000 resolved conversations per month.
Your docs are reasonably clean (or willing to use the platform's authoring tools).
You don't have unusual integration requirements.
You want to be live in two weeks, not six.

Build custom if:

You have proprietary integrations the off-the-shelf product can't see.
Your accuracy threshold is higher than the off-the-shelf benchmark can hit on your data.
You want to own the conversation data and the retrieval logic.
Your volume is high enough that the per-conversation pricing of off-the-shelf becomes the more expensive option.

Most teams over-build. The discipline is to measure the off-the-shelf option on your actual tickets before committing to a custom build.

What we'd do differently than most vendors

Three opinions we'd put in writing:

We default to "behind auth" deployment. Public marketing-site chatbots are mostly noise. They get prompt-injected, they get gamed, they get low-quality engagement. The leverage is inside the product, where the agent has account context. We push back on requests to put the chatbot on the marketing site unless there's a specific lead-qualification reason.

We default to citing sources. Every chatbot response we ship has a "based on" footnote linking to the docs the answer came from. Customers trust it more (verifiable), the team can audit it (sourced), and the model is forced to ground its answer in retrieved content.

We default to a 30-day shadow review. The chatbot is in production but the team is reviewing a sample of conversations daily. Anything wrong feeds back into evals. The chatbot at day 30 is dramatically better than the chatbot at day 1.

What's coming next

A few shifts worth tracking:

Agentic chatbots. Right now most chatbots are reactive — customer asks, bot answers. Agentic chatbots can take action: actually reset the password, actually update the plan, actually file the ticket. The model layer can do it; the safety/audit layer is catching up. By late 2026 expect this to be standard for low-risk actions.

Voice + chat unified. The same agent that handles the chat conversation should be able to hand off to voice if the customer prefers. The architecture is mostly ready; the orchestration platforms are converging.

Cross-language deployments without rebuild. Modern models handle 30+ languages competently with the same RAG pipeline. We've shipped support chatbots in English and Spanish off the same backend with about a week of additional tuning. Five years ago this was a major project; today it's a tuning task.

If you've got a chatbot project in mind

We can usually size a chatbot build from a 30-minute conversation about your ticket volume, your knowledge base state, and the integration shape. Send us your current support stack and a sample of your top-20 tickets and we'll come back with a quote and a written scope inside a week.

·—Related

·—FAQ

Frequently asked.

What's the difference between an AI chatbot and the chatbots we tried five years ago?: The pre-LLM chatbots (Drift, Intercom Bot circa 2020) were decision trees with buttons — pick one of four options, get a scripted reply, escalate if your question wasn't on the menu. Modern AI chatbots are powered by language models with retrieval over your actual content. They handle open-ended questions, follow context across turns, and produce different replies for different customers. The failure modes are different too: pre-LLM chatbots failed by being too rigid; AI chatbots fail by being confidently wrong. The design discipline is to manage that new failure mode.
How well do AI chatbots actually work for support?: For tier-1 support — password resets, billing questions, plan comparisons, common how-to questions — a well-built RAG chatbot deflects 50-65% of tickets with CSAT comparable to or better than human agents (we've seen 4.6/5 on a deployment that replaced 4.4/5 human handling). For tier-2 and tier-3 — anything requiring access to specific account state, multi-step troubleshooting, or judgment calls about exceptions — deflection drops to 20-30% and the chatbot's job becomes triage and context-capture, not resolution.
How long does it take to build a support chatbot?: A focused support chatbot trained on existing docs and ticket history ships in 3-6 weeks for one product surface. The bottleneck is not engineering — it's curating the knowledge base (most teams' docs are messier than they think), labeling escalation rules, and shadow-running long enough to catch the edge cases before live deployment. Multi-product or multi-language deployments take longer.
How much does it cost to build a custom chatbot vs using Intercom Fin or similar?: Off-the-shelf chatbots (Intercom Fin, Ada, Decagon) run $0.99-$2.50 per resolved conversation plus platform fees, with setup measured in hours. Custom builds run $10,000-$50,000 setup plus inference at vendor cost (typically $0.02-$0.10 per conversation). The off-the-shelf option wins below ~5,000 resolved conversations per month. Custom wins when you need control over the RAG pipeline, when the off-the-shelf accuracy misses on your specific docs, or when you have proprietary integrations the off-the-shelf product can't see.
Should our chatbot be visible on the marketing site or just inside the product?: Almost always inside the product, behind auth, where the agent has context about the user's account. Public-facing chatbots on marketing sites tend to get gamed (prompt injection attempts, off-topic chatter, low signal-to-noise). The exception: high-quality lead-qualification chatbots on pricing or contact pages, where the goal is to capture intent and route to sales, not to resolve support tickets.
What about hallucinations — can we trust an AI chatbot to be factually correct?: Modern RAG chatbots, configured correctly, hallucinate rarely — under 1% of responses contain fabricated facts when the answer is grounded in retrieved content and the model is told to abstain on out-of-corpus questions. The risk profile is: when the chatbot is wrong, it's wrong confidently. Mitigate by (1) retrieving aggressively before generating, (2) instructing the model to say 'I don't know' when retrieval returns nothing relevant, (3) logging every response with the retrieved sources so you can audit, and (4) shadow-testing for two weeks before going live.
What's the biggest mistake teams make with AI chatbots?: Treating the chatbot as a product they launch and forget. A chatbot at launch is mediocre; a chatbot four weeks after launch, with eval data and edge-case fixes folded in, is excellent. The teams that succeed treat the chatbot like a junior support agent in onboarding — they review its work daily for the first month, fix the patterns, and let it earn its independence. The teams that fail ship it and walk away.
Will customers prefer talking to an AI or a human?: Customers prefer fast and right over slow and human. Multiple deployments we've shipped show CSAT staying flat or rising when an AI chatbot replaces a slow human-only queue. Customers want their problem solved. If the AI solves it faster, they're happy. If the AI gets it wrong, they want a quick escalation path. The wrong question is 'AI or human'; the right question is 'how do we make the path to resolution short' — and the answer is usually both.

06—The discovery offer

Send us your most expensive operation.
We'll have an audit on your desk in five days.

One PDF. No deck. No obligation. We'll tell you whether AI is the right answer for it — and if it is, we'll quote the build the same week.

request the 5-day audit read a guide first

AI chatbots for customer support and sales in 2026

What modern AI chatbots actually are

Where chatbots actually pay back

The architecture, briefly

The work isn't the model; it's the knowledge base

Handling the failure modes

Off-the-shelf vs custom

What we'd do differently than most vendors

What's coming next

If you've got a chatbot project in mind

The complete guide to AI automation for businesses in 2026

When NOT to build an AI chatbot (and what to build instead)

n8n vs Make.com for AI workflow automation in 2026

Frequently asked.

Send us your most expensive operation.
We'll have an audit on your desk in five days.

AI chatbots for customer support and sales in 2026

What modern AI chatbots actually are

Where chatbots actually pay back

The architecture, briefly

The work isn't the model; it's the knowledge base

Handling the failure modes

Off-the-shelf vs custom

What we'd do differently than most vendors

What's coming next

If you've got a chatbot project in mind

The complete guide to AI automation for businesses in 2026

When NOT to build an AI chatbot (and what to build instead)

n8n vs Make.com for AI workflow automation in 2026

Frequently asked.

Send us your most expensive operation.We'll have an audit on your desk in five days.

Send us your most expensive operation.
We'll have an audit on your desk in five days.