How to Detect User Frustration in Your LLM Agent

▣JULY 3, 2026

You ship an agent. It demos well. Then a user types “no, that’s not what I asked” for the third time, gives up, and leaves, and you never find out. Multiply that by thousands of sessions and you have a real problem you can’t see.

Here’s the uncomfortable part: most frustration is invisible to the tools people reach for first. A user who writes “I can only pick up after 6pm” and gets ignored is furious, but the words are perfectly polite. Sentiment analysis scores them neutral. A keyword filter for “stupid” or “useless” catches nothing.

This guide is a working method for catching frustration you’d otherwise miss, ordered from what you can ship today to what actually scales.

Start by defining what you’re detecting

“Frustration” is too vague to build on. Split it into three things you can point at:

Explicit anger — profanity, “this is useless,” “let me talk to a human.” Rare, and the easiest to catch (so don’t over-index on it).
Silent failure — the user’s goal never gets met and they abandon. No hostile words. This is the majority, and the part everyone misses.
Effortful struggle — the user is still trying but working too hard: rephrasing, repeating, correcting the agent. The early-warning stage, before they give up.

If you only build for the first bucket, you’ll congratulate yourself on a low frustration rate while most of it walks out the door.

Why it’s harder than it looks

Frustration lives in the shape of the whole conversation, not in any single message. That breaks the two cheapest approaches:

Keyword matching assumes frustration has vocabulary. It mostly doesn’t. In a 2025 benchmark of real deployed-assistant conversations, keyword rules caught almost nothing, recall around 0.05.
Turn-level sentiment judges each message alone, so it can’t see the pattern: the same request asked three ways, the polite-to-clipped drift, the abandonment after a wrong answer.

The takeaway that should drive your design: you have to read across turns, with context about what the user was trying to do. Everything below follows from that.

The signals that actually predict it

Before picking a method, know what you’re looking for. Three layers.

In the words (weak alone): repetition markers (“again,” “like I said,” “for the third time”), correction (“no,” “that’s not what I meant”), escalation requests (“a human,” “your manager”), profanity, ALL CAPS, “!!!”.

In the behavior (stronger): the same intent rephrased 2+ times, an identical request repeated, a sudden turn-count blowup for a normally-simple task, and the single strongest and most-ignored signal, abandonment mid-task. Someone who leaves without resolution rarely announces it.

In the flow (strongest): answer loops (agent gives the same reply twice), ignored constraints (user states a limit, agent proceeds as if unsaid), and a rising trajectory across the session (polite, then terse, then hostile, then gone).

Notice the pattern: the more predictive the signal, the more it depends on seeing the whole session. Which is exactly why the naive methods fail and the next ones don’t.

Pick your method by what you can afford

There’s no single right answer. There’s a right answer for your budget and latency. Roughly cheapest to richest:

1. Mine your explicit feedback first. Thumbs-down and “talk to a human” clicks are unambiguous frustration you’re probably already collecting. Near-zero cost. Low recall (most frustrated users never click), so treat these not as your detector but as a free labeled set to calibrate everything else against.

2. Add a few behavioral triggers for real-time. If you need to intervene mid-conversation, offer a human or change tack, you can’t wait for an expensive model. Cheap rules (same request repeated, N corrections in a row, turn count over a threshold) fire instantly. Accept that they’re noisy; use them to act now, not to report accurately.

3. Use an LLM-as-judge for accurate labeling. This is the workhorse. Feed a model the full conversation plus the user’s goal and ask it to judge frustration. In that same 2025 benchmark, an LLM judge with full context scored ~0.85 F1, versus ~0.05 for keywords, because it can finally see the pattern. Cost and latency make it a retrospective/batch tool, not a per-turn gate, and it needs a clear rubric or it drifts. A starting prompt:

You are labeling a support conversation for user frustration.

User's goal: {inferred or provided goal}
Conversation:
{full transcript, both sides}

Rate frustration 0–3:
  0 none · 1 mild effort · 2 clear struggle · 3 anger/abandonment
Base it on the WHOLE conversation, not one message.
Weigh: unmet goal, repetition, corrections, ignored constraints,
abandonment. Do NOT require negative words — polite users get
frustrated too.

Return JSON: {"score": int, "signals": [..], "reason": "one line"}

Calibrate it against your thumbs-down set from step 1, and watch for the classic false positives: sarcasm, venting about a third party (“my bank is useless”), and frustration aimed at the situation, not your agent. Tell the judge to distinguish those.

Going from “one conversation” to “all of them”

Judging a single session is solved. The real job is the thousands of sessions you’ll never read. Two moves scale it, and since building semantic search and clustering over agent traces is what we do at Latitude, here’s what we’ve learned making each one actually work, not just the textbook version.

Search your sessions by meaning, instead of keywords

The textbook mechanism: embed each session into a vector, embed a natural-language query like “user gave up because the agent kept misunderstanding,” return the nearest sessions by cosine similarity. Retrieval instead of writing a classifier for every pattern you can imagine.

Three things we learned that the textbook skips:

Embed at the turn level, not the whole session. When we embedded entire transcripts, meta-meaning queries like “user frustration” or “assistant is being lazy” got washed out, one signal drowned in a long conversation’s worth of unrelated text. Chunking each trace into user/assistant turn chunks and embedding those made exactly these emotional and behavioral queries land. If you build this yourself, chunk by turn before you embed.
Go hybrid, semantic alone is blunt. Pure vector search can’t pin an exact string, and pure keyword search can’t read meaning. We run both: a lexical text index and semantic embeddings, and let you combine them, so you can semantically search “user frustrated” but only within sessions that contain the literal string billing. Meaning finds the pattern; the keyword scopes it to what you care about.
Treat search as exploration, not a verdict. This is the trap. Semantic search returns the top-ranked sample of matches across your whole corpus, it is not “every frustrated session.” Add a filter like “last 7 days” and you don’t get all of this week’s frustrated users; you get whichever of the global top matches happen to be recent, which can be almost none. Scanning every trace to get true set-semantics is a full scan that times out on large projects. So use search to find and investigate patterns, never as the source of truth for a count, a chart, or a real-time gate. For counts and alerts you need a detector that scores each trace once, on arrival, and remembers the result, which is a different job (next section).

One honest ergonomic caveat, from watching people use it: semantic search rewards phrasing your query like the thing you’re hunting for. “OR” is unintuitive, so the practical trick is to run two queries (“user frustration” and “user anger”), then dedupe and union.

Cluster sessions to find the pattern, not the instance

Search finds sessions you already suspect exist. Clustering surfaces the ones you didn’t. The move: group sessions by what they mean, then have a person read a few from each group and name it. That turns “here are 200 frustrated users” into “here are the four reasons they’re frustrated,” which is the thing you actually fix.

What we learned building it (we call this layer Behaviours):

A plain clustering library isn’t enough. Naive k-means or HDBSCAN over session embeddings gives you blobs nobody can act on. What made clusters trustworthy was hybrid clustering, semantic similarity and keyword-reranked search to form the groups, over conversation moments rather than whole sessions, so a cluster maps to one coherent failure pattern (“ignored the user’s stated constraint”) instead of a vague topic.
Clusters have to move. User behavior drifts, so static clusters rot. We keep a centroid per cluster and let related clusters merge over time as the centroids move, so the taxonomy stays alive instead of freezing on day one’s data.
A human still names them. Every clustering approach, ours included, produces groups a person has to read and label. Budget for that step. It isn’t automatable away, and it’s where the insight actually happens.

None of this is exotic. An embedding model, a vector store, and a clustering library get you a first version, and building it yourself is a completely legitimate choice. The reason a dedicated layer exists (ours or anyone’s) is that the un-obvious parts above (turn-level chunking, hybrid search, exploration-vs-set-semantics, moving centroids, plus running detectors automatically so frustration and tool errors get flagged as traces arrive rather than only when you go looking) are the difference between a demo and something you trust in production.

Turn detection into something that improves the agent

Detecting frustration is worthless if it just fills a dashboard. Close the loop:

Track a frustration rate (share of sessions scoring 2–3) so you know if changes help or hurt.
Route the worst sessions into your eval set. Real frustrated conversations are the highest-value regression tests you’ll ever get.
Trace back to the cause. The cluster tells you why: a missing capability, a confusing flow, a tool that fails silently. Fix the cause, not the symptom.

FAQ

Can’t I just use sentiment analysis? As a cheap first pass, yes, but expect it to miss most real frustration, which is polite and implicit. Use it to triage, not to trust.

What’s the most accurate single method? An LLM-as-judge given the full conversation and the user’s goal. Accuracy comes from context, not from a fancier model.

How do I detect frustration in real time without adding latency? Don’t run a big model per turn. Use cheap behavioral triggers (repeats, corrections, turn-count spikes) to intervene live, and do accurate LLM-judge labeling in batch afterward.

What’s the strongest signal? Abandonment, leaving without resolution. It’s silent, so behavioral and flow analysis catches it where any word-based method can’t.

How do I avoid false positives? Watch for sarcasm, venting about third parties, and frustration at the situation rather than the agent. Give your judge examples of each and calibrate against real thumbs-down data.

How do I do this across thousands of sessions? Semantic search to find them and clustering to group them by cause. Manual transcript-reading doesn’t scale past a few hundred.

Sources: “Stupid robot, I want to speak to a human!” — frustration detection benchmark on deployed assistant conversations (COLING 2025 Industry Track); GoEmotions (ACL 2020); Survey on LLM-as-a-Judge (arXiv:2411.15594).