Klarna walked it back. Salesforce doubled down. Both are running the same flawed playbook. Here is what the fourth generation gets right.

Klarna replaced 700 customer service agents with AI in 2024. By early 2026, the CEO admitted the company had gone too far and started rehiring.

Salesforce cut its support team from 9,000 to 5,000 the same year. Marc Benioff said Agentforce now handles 50% of customer conversations, and the quality is fine. The receipts are not in yet.

Just recently, even Duolingo’s CEO walked back his knee-jerk enthusiasm for AI, noting it cannot achieve the same level of quality that employees can. Or perhaps it was the many cancelled subscriptions over quality and slop that users loudly complained of online.

Three of the most visible AI rollouts in CX all made the same underlying mistake.

Every generation of CX AI has failed the same way: by treating humans as the thing to eliminate.

Generation one: rule-based bots (NLU)

The first wave of CX automation was deterministic. Decision trees. If-then logic. Strict scripts. Reliable in the narrow sense that they did exactly what you told them to, and useless the moment a customer phrased a question in a way the script did not anticipate.

Humans were the escape hatch. The bot contained what it could and dumped the rest on an agent. Compliance was clean because there was no real intelligence to govern. Customer satisfaction was the cost.

Generation two: LLMs bolted on top of NLU flow builders

Platforms like Cognigy and Kore.ai, with natural language understanding added LLMs on top of the same rigid flow architecture. LLMs are much better at understanding intent and parsing complex conversations, but they still cannot be trusted to always follow business rules. This ultimately gave the user two options: “Trust me Bro” Agentic AI via prompting guardrails or strict NLU-based process adherence. While the two can be combined and still work well, it remains an either/or solution and ultimately bridges architecture already implicitly admitting what needs to come next.

The pattern here is the one Klarna lived. You ship the AI. It handles routine queries beautifully. Then complexity creeps in, edge cases pile up, and the guardrail stack grows without making a meaningful dent in reliability. Klarna CEO Sebastian Siemiatkowski put it plainly: AI chatbots were cheaper than human staff but resulted in lower-quality service.

Humans were still the fallback. By the time a conversation got escalated, the customer had often already a bad experience. The human agent was a recovery function or last-mile, not a system input.

Generation three: LLM-only agents

The current wave is made up of those who jumped on a bandwagon, only to find out too late it’s heading in the wrong direction. That includes Sierra, Decagon, ElevenLabs and a long list of others. No legacy flow builder, just an LLM with tools and prompts. Fluent. Fast. Capable of handling conversations

Salesforce sits inside this generation with Agentforce. Benioff says AI manages roughly 50% of customer conversations and claims service quality is unchanged. The internal numbers may or may not bear that out but the structural problem does not go away.

Next-token prediction cannot reliably enforce a business rule 100% of the time. Trusting an LLM to enforce your business rules is like trusting a politician. They deliver sometimes. Just enough to stay in office but the only thing you can trust is that you cannot trust them.

LLMs can be coached, prompted, and constrained, but at enterprise scale, rare hallucinations become daily occurrences. A 99% reliability rate sounds impressive until you do 10 million conversations a year. That’s 100,000 broken interactions and a percentage of those are regulated and all it takes is one to end up in the headlines. Humans in this generation are still the fallback. The escalation path. The thing you trim when the AI looks like it is working.

Klarna and Salesforce are not opposite stories. They are the same story at different points on the curve.

Why all three failed the same way

Look at the architecture, not the marketing. In every generation, humans sit outside the system. They get called when AI breaks. Their judgment leaves the building when their shift ends. Nothing they do feeds back into the platform.

That is the design flaw. Not the model choice, not the prompt strategy, not the guardrail layer. The flaw is treating humans as a cost line instead of a structural input.

The fourth generation: the Human-AI Flywheel & ContextGraphOS

There is a different way to build this. Not as a category upgrade, as an architectural one.

Three things have to be true:

Business logic is grounded, not prompted. Your policies, procedures, and compliance rules live as deterministic structure in ContextGraphOS, not as instructions an LLM tries to follow. The AI handles natural language and dynamic conversation flows. The structure enforces the rules. Neither can override the other.
Humans operate inside the platform, not next to it. Agents see what the AI sees. They handle the sensitive interactions, validate edge cases, and approve decisions in flight. Same system, same data, same audit trail.
Every human action becomes training data. When a person resolves a conversation the AI could not, that resolution feeds back as structured learning. Quality compounds. Automation rate climbs because the system gets smarter, not because someone trimmed the team.

This is the Human-AI Flywheel. It is the difference between AI that decays and AI that improves. Even when an AI model is deprecated or updated, your most valuable data sits in your system and doesn’t get thrown out with the old model.

What this looks like in practice

Glovo runs reactivation and restaurant management through this model and drove a 7x increase in weekly orders. Altis Hotels increased direct bookings by 22% while cutting guest response times. Nicomatic deployed AI to industrial knowledge management with 0% hallucination or data leakage risk in a setting where one mistake means a recall.

Movistar runs regulated telco operations on the same architecture. Compliance is not a guardrail layer they added. It is grounded in the platform from day one. EU AI Act, GDPR, ISO. Every decision logged, every path auditable, every escalation explainable.

None of these customers had to choose between automation and oversight. None of them are going to write Klarna's blog post in 18 months.

The question you need to ask

Forget the automation rate. It is the vanity metric of generation three.

Ask this instead: when one of my agents resolves something your AI could not, what happens to that resolution six months from now?

If the answer is "nothing structural, we will use it for retraining maybe," you are buying a system that decays. If the answer is "it becomes part of the operating model and every AI conversation from this point on benefits from it," you are buying a system that compounds.

That is the architectural difference. That is why we built the fourth generation. AI you can trust at scale, every time.

Want to find out more? Download our latest eBook “Trust: The Missing Layer in Enterprise AI.”

#Generation one: rule-based bots (NLU)

#Generation two: LLMs bolted on top of NLU flow builders

#Generation three: LLM-only agents

#Why all three failed the same way

#The fourth generation: the Human-AI Flywheel & ContextGraphOS

#Three things have to be true:

#What this looks like in practice

#The question you need to ask