How to build and evaluate Agents with MAST

Mert Cemri, Melissa Z. Pan, Shuyi Yang, Matei Zaharia, Joseph E. Gonzalez, Ion Stoica, and the MAST Team

🗓️ Posted: November 18, 2025

<aside> 🤖

We discuss how to use MAST Annotator, an automated evaluator, to improve the agent performance by up to 53% on downstream tasks. Our tool analyzes execution traces and categorizes failures using MAST, our taxonomy of failure modes for agentic systems. This automated feedback pinpoints exactly why agents fail, which allows agent builders to fix them.

Agentic AI systems leverage multiple LLMs and tools to tackle real-world problems. However, once such a sophisticated system is designed, manually going through execution logs and figuring out how to improve the agentic system is a difficult and time consuming process. MAST Annotator provides scalable, systematic feedback on failure modes, which helps improving these agents.

In this blog, we show a step-by-step tutorial for debugging with MAST on a simple LangChain agentic system. We also provide a quick start guide on how to use MAST at the end.

📝 Paper: https://arxiv.org/abs/2503.13657
👩‍💻 Code: https://github.com/multi-agent-systems-failure-taxonomy/MAST
📊 Data: 🤗 MAST-Data (1600+ Traces)
📓 Notebook: Example in this blog
💬 Join us: https://join.slack.com/t/mast-agentdash/shared_invite/zt-3hqj8tej6-XU5m~e1OQkcmg2bKGtN_6Q, 🧠 come chat with us at NeurIPS </aside>

TLDR:

Build your agentic AI system however you need: with multiple LLM calls interacting with each other, using tools, and operating within your desired environments.
Measure more than just final “accuracy”: use MAST to identify the specific failure modes in your system.
Improve your agentic system by addressing the errors identified by MAST, for example, by adding missing agents to patch failure modes, removing agents that do not contribute to the workflow, modifying the connectivity (edges) between agents, or adjusting the roles and prompts of different agents.

Starting to annotate your agentic traces with MAST as easy as:

!pip install agentdash
from agentdash import annotator
openai_api_key = "your-api-key"
MASTAnnotator = annotator(openai_api_key, model="gpt-5")
trace = """
Agent1: I need to calculate the sum of 1 + 1.
Agent2: I'll help you with that. The answer is 3.
Agent1: Thank you! Task completed.
"""
mast_annotation = MASTAnnotator.produce_taxonomy(trace)

In the example we discuss later, we show how the feedback provided by MAST annotator helps us replace a redundant agent with a key missing one and introduce better memory management by making agents stateful. These changes reduce failure modes significantly and improve downstream task performance by 53% compared to the initial design, while keeping the underlying LLM the same.

Figure1: Failure mode analysis of LangChain agent before and after MAST feedback. Blue bars show failure counts from MAST's analysis of the initial multi-agent system's execution traces. Orange bars show failure counts * after * implementing new agent workflow based on MAST’s feedback. Lower counts indicate fewer failure modes, demonstrating improved system performance.

Challenge: Building and Iterating on Agentic Systems

Teams are racing to build agentic AI systems, with workflows that call tools, spawn subtasks, and verify results. The promise is real; the pain is, too... In practice, most projects stall not because the models are “too weak,” but because the development loop is fragile. Traces balloon into thousands of tokens across roles and tool calls. Reading logs by hand to spot the actual failure pattern (role drift vs. missed clarification vs. weak verification) rarely scales beyond a few runs. Without structured traces and consistent post-hoc analysis, you can’t reproduce defects or reliably evaluate improvements.

A 2025 MIT study describes a “GenAI Divide,” estimating that ~95% of enterprise AI pilots fail to reach impactful production, largely due to integration and workflow fit, not raw model capability. The report cites only ~5% of custom tools making it to durable implementation, with extensive qualitative evidence from interviews; news coverage echoes the same pattern.

This is why automating the debugging and the development loop is vital. You need a fast way to turn execution traces into actionable signals that tell you what to change next without developers spending days to spelunk logs.

Figure2: MAST: Multi-Agent System Failure Taxonomy

What is MAST (and why use it)?

MAST is the Multi-Agent System Failure Taxonomy derived from 1600+ annotated traces across 7 MAS frameworks (coding, math, and general agents). MAST annotates each trace with a failure-mode vector (booleans per mode), grouped into: