Skip to content

Adding a New Evaluation Benchmark¶

Overview¶

Evaluations are used to benchmark agent performance on specific tasks.

Steps¶

Create evaluation module in src/aigise/evaluations/
Implement evaluation interface
Add configuration template
Add sample data handling

Evaluation Structure¶

# src/aigise/evaluations/my_benchmark/my_evaluation.py
from aigise.evaluations import EvaluationTask

class MyEvaluation:
    async def run_evaluation(self, tasks):
        # Evaluation logic
        pass

Configuration¶

Create config template in src/aigise/evaluations/configs/:

# my_benchmark_config.toml
task_name = "my_benchmark"
# ... evaluation-specific config

Data Handling¶

Load benchmark data
Run agents on tasks
Collect results
Generate metrics

See Also¶

Development Guides - Other development guides
Testing Debugging - Testing evaluations