QApilot - AI-Powered Mobile App Testing
    Back to Blogs
    Why Your Agent Passes Every Check But Still Fails in Production - QApilot Blog

    Why Your Agent Passes Every Check But Still Fails in Production

    Step-level metrics give false confidence when evaluating AI agents. This post explains why full-run evaluation, LLM-as-Judge, and scenario diversity analysis are the right approach.

    AI / EngineeringAI agentsLLM evaluationagent testingLLM-as-JudgeAI engineeringtest automationQApilot

    Vidushee Geetam

    Software Engineer

    This article was originally authored by Vidushee Geetam, Software Engineer at QApilot, and published on LinkedIn on May 4, 2026. We're reposting it here to share her insights with a wider audience.

    At QApilot, we invest in our team's thinking not just their work. Vidushee's piece below cuts to something we've seen across AI teams: evaluation systems that give false confidence. If you're building or shipping agents, this is worth reading carefully.


    Why Your Agent Passes Every Check But Still Fails in Production

    If you've shipped an AI-powered agent, you've probably felt the evaluation gap. The model seems to work fine in demos. Individual outputs look reasonable. But something goes wrong in longer runs, and you struggle to measure what or how much.

    The instinct most teams reach for first is to borrow from how we've always evaluated models: score each output, check each step, aggregate the numbers. For a single-turn model, given a prompt, did it produce a good response, this works. For an agent that takes twenty actions to accomplish a goal, it quietly falls apart.

    This post is about why that happens, and what to do instead. The short version: the unit of evaluation for an agent is the full run, not the step. Once you accept that, a lot of other things about your eval setup have to change too: how you write rubrics, how you choose what to test on, even how you read the results. We'll walk through each of those.

    Let's start with why step-level evaluation breaks down.

    The Trap of Evaluating Actions in Isolation

    Say you're building a shopping agent. A user asks it to find and buy a blue running shoe under $100, and the agent has to navigate the site, apply filters, pick a product, and get through checkout. You decide to evaluate it by scoring each action it takes: did it click the right element? did it fill in the right field?

    Here's what that misses. Consider this sequence:

    1. Agent lands on the running shoes listing page
    2. Agent applies a price filter for under $100, reasonable
    3. Agent clicks into a blue Nike Pegasus product page, reasonable
    4. Agent navigates back to the home page for no apparent reason
    5. Agent re-applies the same price filter it already applied
    6. Agent reaches checkout, success

    If you score each step, most of them pass. Step 4 passes too. The home button exists on the page, the agent clicked it, no hallucination. Your per-step check has no opinion on why the agent went home or whether that made any sense given what it was doing. But any human watching this would immediately spot that the agent lost its thread and wasted steps recovering.

    Local correctness does not imply global coherence. This is the fundamental problem with step-level evaluation for agents.

    The Right Question: Does the Whole Sequence Make Sense?

    Instead of asking "was this action correct?", ask: does this entire sequence of actions make sense as a way to accomplish the stated goal?

    This requires handing the complete record of what happened, every state, every action, in order, to an evaluator that can reason about the whole thing at once. For agents operating in open-ended environments, that evaluator is almost always going to be another LLM.

    This is the LLM-as-Judge pattern: you use a capable model as your evaluator, give it a detailed rubric, and ask it to score the agent's full run and explain its reasoning.

    Going back to the shoe-shopping agent, rather than checking each click, you'd give the judge:

    • The goal ("find and purchase a blue running shoe under $100")
    • Every page the agent visited and action it took, in order
    • The final state (did it reach checkout? abandon midway?)

    And ask: does this sequence represent a coherent, effective path toward the goal?

    Designing Your Rubric

    The rubric is where the real work happens. A few principles:

    Split into dimensions. A single "quality score" is hard to act on. If your agent scores 2.8/5, that tells you nothing about what to fix. Score multiple dimensions separately and you get a map: maybe goal achievement is high but efficiency is low, the agent gets there but wastes too many steps. That's actionable.

    Separate the generic from the domain-specific. Some dimensions apply to almost any agent: did it accomplish the goal? was the sequence logical? was it efficient? Others are specific to what your agent does. For the shoe agent, you might want a dimension like "did the agent stay within the product catalog instead of wandering into help pages or account settings?". Irrelevant for a coding agent, critical for a shopping one. Keep both, but know the difference.

    Require written reasons, not just scores. The number tells you how much. The reason tells you why. When debugging, you'll ignore the scores and read the reasons. Make them mandatory in your output format.

    Use structured output. Ask for JSON. Parse it. Aggregate scores across runs. Without this, you're doing manual review, which doesn't scale.

    One thing that's easy to get wrong: picking dimensions that are correlated. If "was the sequence logical?" and "did each step follow from the previous one?" are both in your rubric, they'll almost always move together. You're not measuring two things, you're just measuring one thing twice and inflating your confidence. The goal is dimensions that can diverge: an agent can be highly efficient but still miss the goal, or reach the goal through a completely incoherent path. When two dimensions never disagree in practice, merge them.

    The Scenarios You Test On Matter As Much As the Judge

    So now you've got a judge that can look at a full run and score it across the dimensions you care about. You run it across your test cases, get your numbers, and ship.

    Except there's one more thing that can quietly invalidate everything you just built.

    Your scores only describe how the agent performs on the situations you tested. If those situations all look roughly the same, same starting point, same type of goal, same path through the system, you'll get clean numbers while being completely blind to whatever falls outside that narrow slice. The agent could be catastrophically bad at half its real-world job, and your evaluation would never tell you. It's the agent equivalent of a codebase with 100% test coverage on one function and 0% on everything else.

    For our shoe-shopping agent, this might mean every test case is some variant of "search for a product, filter, buy." None of them touch category browsing, none touch the account section, none test what happens when a page fails to load. The agent scores 4.5/5 across the board, and you have no idea it falls apart the moment a user does anything other than search.

    The fix is to treat the variety of your test cases as something you actively measure, not assume. After running a batch, make a second judge call, not on any individual run, but on summaries of all the runs together. Ask: do these scenarios represent meaningfully different situations? What kinds of tasks or paths are missing entirely?

    This gives you two outputs per batch:

    • Per-run quality scores: how well did the agent do on each task?
    • A coverage gap analysis: what scenarios are you failing to exercise?

    The second output feeds directly back into how you design your next batch of tests. It closes the loop.

    Putting it Together

    Two passes over every batch:

    Per-run judge. Takes one full run at a time. Returns quality scores and reasons for that run.

    Diversity judge. Takes summaries of all runs in the batch. Returns a coverage score and the scenario types that are missing.

    The first tells you how your agent is doing. The second tells you whether you're asking the right questions.

    Summary

    • Step-level metrics catch local correctness but miss global incoherence.
    • A full-sequence LLM judge catches end-to-end quality and goal achievement, but tells you nothing about whether your scenarios cover the right ground.
    • Adding a diversity judge catches the coverage gaps in your scenarios. With both judges in place, you have the full picture.

    An agent's quality is a property of its full run, not any individual step. Build your evaluation around that, and you'll catch the failure modes that matter.


    Written by

    Vidushee Geetam

    Vidushee Geetam

    LinkedIn

    Software Engineer

    Vidushee is a Software Engineer at QApilot, a graduate of BITS Pilani, and holds a postgraduate degree from Imperial College London. She enjoys working at the intersection of AI, systems, and applied problem solving, with a focus on clarity, structure, and thoughtful design.

    Read More...

    Get started

    Start Your Journey to Smarter Mobile App QE

    Rethink how your team approaches mobile testing.