
As AI agents move from experimental chatbots to production-grade workers, testing their reliability is critical. Ashr provides an 'Agent-in-the-Loop' evaluation platform to stress-test these autonomous systems. This guide will walk you through setting up and using Ashr to ensure your agents perform consistently in real-world scenarios.
Step 1: Connect Your AI Agent
The first step is integrating Ashr with your existing environment. Ashr is provider-agnostic and works via a lightweight API or SDK. You simply provide the endpoint where your agent lives. This allows Ashr to send synthetic user requests to your agent and monitor its reasoning and tool-call logic without requiring major changes to your codebase.
Step 2: Define User Archetypes and Stories
To test how an agent handles diverse interactions, you must define 'User Archetypes.' Within the Ashr dashboard, you can describe different personas—such as an angry customer, a technical power user, or even a malicious user attempting prompt injection. Ashr uses these archetypes to generate thousands of unique, branching user stories that simulate authentic production journeys.
Step 3: Set Custom Evaluation Metrics
One of Ashr’s most powerful features is the ability to define business-specific success criteria. You can set rules such as: 'Agent must never provide medical advice' or 'Agent must confirm account details before updating a subscription.' Ashr will automatically evaluate every simulated journey against these metrics, grading the agent’s logic and adherence to your safety guardrails.
Step 4: Run Simulations and Stress Tests
Once your archetypes and metrics are set, trigger a simulation run. Ashr will execute thousands of parallel conversations with your agent, mimicking rapid-fire user interactions. It specifically looks for edge cases—situations where the agent might hallucinate, fail to call a tool correctly, or deviate from the intended workflow. This high-volume stress testing uncovers bugs that manual QA would likely miss.
Step 5: Analyze the Health Score and Iterated
After the simulation completes, Ashr provides a comprehensive 'Health Score' and detailed failure reports. You can dive into specific journeys that failed, seeing exactly where the agent's reasoning broke down. Use these insights to refine your prompts, update your agent's tool-call documentation, or adjust your fine-tuning data. Repeat the process until your agent achieves a reliability score high enough for production deployment.



