How to Evaluate an Agent

This guide walks you through the complete process of evaluating an agent in Youtu-Agent, from implementation to analysis.

Prerequisites

Before you begin, ensure that:

Environment is set up: You've completed the Getting Started setup
Database is configured: The UTU_DB_URL environment variable is set in your .env file (defaults to sqlite:///test.db)
API keys are configured: UTU_LLM_API_KEY and other required keys are set in .env

Overview

The evaluation process consists of four main steps:

Implement and debug your agent - Test your agent interactively
Prepare evaluation dataset - Create and upload test cases
Run evaluation - Execute automated evaluation
Analyze results - Review performance using the analysis dashboard

Step 1: Implement and Debug Your Agent

Before running formal evaluations, you should implement and test your agent interactively to ensure it works as expected.

1.1 Create Agent Configuration

Create an agent configuration file in configs/agents/. For example, configs/agents/examples/svg_generator.yaml:

# @package _global_
defaults:
  - /model/base@orchestrator_model
  - /agents/router/force_plan@orchestrator_router
  - /agents/simple/search_agent@orchestrator_workers.SearchAgent
  - /agents/simple/svg_generator@orchestrator_workers.SVGGenerator
  - _self_

type: orchestrator

orchestrator_config:
  name: SVGGeneratorAgent
  add_chitchat_subagent: false
  additional_instructions: |-
    Your main objective is generating SVG code according to user's request.

orchestrator_workers_info:
  - name: SearchAgent
    description: Performs focused information retrieval
  - name: SVGGenerator
    description: Creates SVG cards

1.2 Test Interactively

Run the CLI chat interface to verify your agent works correctly:

python scripts/cli_chat.py --config examples/svg_generator

Try various inputs and verify the agent's responses meet your expectations. Debug and refine your agent configuration as needed.

Step 2: Prepare Evaluation Dataset

Once your agent is working, create a dataset of test cases for evaluation.

2.1 Create Dataset File

Create a JSONL file (e.g., data/svg_gen.jsonl) where each line is a JSON object with your test case. Two data formats are supported:

Format 1: LLaMA Factory format (recommended for SFT data):

{"instruction": "Research Youtu-Agent and create an SVG summary"}
{"instruction": "Introduce the data formats in LLaMA Factory"}
{"instruction": "Summarize how to use Claude Code"}
{"input": "https://docs.claude.com/en/docs/claude-code/skills"}
{"input": "Please introduce Devin coding agent"}
{"instruction": "Summarize the following content", "input": "# Model Context Protocol servers\n\nThis repository is a collection of reference implementations..."}
{"input": "https://arxiv.org/abs/2502.05957"}
{"input": "https://github.com/microsoft/agent-lightning"}
{"input": "Summarize resources about MCP"}
{"input": "https://huggingface.co/papers/2510.19779"}

In LLaMA Factory format: - instruction and/or input fields are combined to create the question - output field (if present) becomes the ground truth answer

Format 2: Default format:

{"question": "What is Youtu-Agent?", "answer": "A flexible agent framework"}
{"question": "How to install?", "answer": "Run uv sync"}

2.2 Upload Dataset to Database

Upload your dataset using the upload_dataset.py script:

python scripts/data/upload_dataset.py \
  --file_path data/svg_gen.jsonl \
  --dataset_name example_svg_gen \
  --data_format llamafactory

Parameters: - --file_path: Path to your JSONL file - --dataset_name: Name to assign to this dataset in the database - --data_format: Either llamafactory or default

The script will parse your JSONL file and store the test cases in the database. You should see output like:

Uploaded 10 datapoints from data/svg_gen.jsonl to dataset 'example_svg_gen'.
Upload complete.

Step 3: Run Evaluation

Now run your agent on the evaluation dataset.

3.1 Create Evaluation Configuration

Create an evaluation config file in configs/eval/examples/. For example, configs/eval/examples/eval_svg_generator.yaml:

# @package _global_
defaults:
  - /agents/examples/svg_generator@agent
  - _self_

exp_id: "example_svg_gen"

data:
  dataset: example_svg_gen

concurrency: 16

Key fields: - defaults: Reference your agent configuration - exp_id: Unique identifier for this evaluation run - data.dataset: Name of the dataset you uploaded - concurrency: Number of parallel evaluation tasks

3.2 Run the Evaluation Script

Execute the evaluation:

python scripts/run_eval.py \
  --config_name examples/eval_svg_generator \
  --exp_id example_svg_gen \
  --dataset example_svg_gen \
  --step rollout

Parameters: - --config_name: Name of your eval config (without .yaml extension) - --exp_id: Unique identifier for this evaluation run - --dataset: Name of the dataset to evaluate on - --step: Evaluation step to run - rollout: Run the agent on all test cases (collect trajectories) - judge: Evaluate the agent's outputs (requires ground truth answers) - all: Run both rollout and judge steps

Note: Use --step rollout when your dataset doesn't have ground truth answers (GT). If you have GT answers, you can run --step all or run judge separately:

# Run judge separately (requires rollout to be completed first)
python scripts/run_eval_judge.py \
  --config_name examples/eval_svg_generator \
  --exp_id example_svg_gen

3.3 Monitor Progress

During evaluation, you'll see progress indicators:

> Loading dataset 'example_svg_gen'...
> Found 10 test cases
> Starting rollout with concurrency 16...
[1/10] Processing: Research Youtu-Agent...
[2/10] Processing: Introduce LLaMA Factory...
...
> Evaluation complete. Results saved to database.

Step 4: Analyze Results

After evaluation completes, analyze the results using the web-based analysis dashboard.

4.1 Set Up Analysis Dashboard

The evaluation analysis dashboard is a Next.js application located in frontend/exp_analysis/.

First-time setup:

cd frontend/exp_analysis

# Install dependencies
npm install --legacy-peer-deps

# Build the application
npm run build

4.2 Start the Dashboard

# Make sure you're in frontend/exp_analysis/
npm run start

The dashboard will be available at http://localhost:3000 by default.

Change the port (optional): Edit package.json and modify the start script:

"scripts": {
  "start": "next start -p 8080"
}

4.3 View Evaluation Results

Open your browser and navigate to http://localhost:3000. The dashboard provides:

Overview: Summary statistics for your evaluation runs
Experiment Comparison: Compare multiple evaluation runs side-by-side
Detailed Trajectories: Inspect individual agent execution traces
Success/Failure Analysis: Identify patterns in successful vs. failed cases
Performance Metrics: Accuracy, latency, token usage, etc.

Key features: - Filter by experiment ID, dataset, or date range - View full agent trajectories including tool calls and reasoning steps - Export results for further analysis - Compare different agent configurations

4.4 Verify Database Connection

If you encounter issues, test the database connection:

cd frontend/exp_analysis
npm run test:db

This will verify that the dashboard can connect to your database and retrieve evaluation data.

Advanced Topics

Evaluating on Standard Benchmarks

To evaluate on standard benchmarks like WebWalkerQA or GAIA:

# Prepare benchmark dataset (one-time setup)
python scripts/data/process_web_walker_qa.py

# Run evaluation on WebWalkerQA
python scripts/run_eval.py \
  --config_name ww \
  --exp_id my_ww_run \
  --dataset WebWalkerQA_15 \
  --concurrency 5

See the Evaluation Documentation for more details on benchmarks.

Configuring Judge Models

For evaluations with ground truth, configure judge models in your eval config:

judge_model:
  model_provider:
    type: ${oc.env:JUDGE_LLM_TYPE}
    model: ${oc.env:JUDGE_LLM_MODEL}
    base_url: ${oc.env:JUDGE_LLM_BASE_URL}
    api_key: ${oc.env:JUDGE_LLM_API_KEY}
  model_params:
    temperature: 0.5
judge_concurrency: 16

Set the corresponding environment variables in your .env file:

JUDGE_LLM_TYPE=chat.completions
JUDGE_LLM_MODEL=deepseek-chat
JUDGE_LLM_BASE_URL=https://api.deepseek.com/v1
JUDGE_LLM_API_KEY=your-judge-api-key