Evaluations - Batch Processing Entry Point¶
Evaluation scripts run agents on benchmark datasets for performance measurement and testing.
Command¶
cd src/aigise/evaluations
python cybergym/cybergym_vul_detection.py run \
--agent-id my_agent \
--config-path /path/to/config.toml \
--max_llm_calls 75 \
--use_multiprocessing \
--max_workers 3
Step-by-Step Workflow¶
Step 1: Script Initialization¶
- Fire library parses command-line arguments
- Creates
Evaluationclass instance with parameters: agent_id: Identifier for the agentconfig_path: Path to TOML configurationmax_llm_calls: Maximum LLM calls per taskuse_multiprocessing: Use processes vs threadsmax_workers: Number of parallel workers- Sets up logging and instrumentation (Langfuse, OpenTelemetry)
Step 2: Load Dataset¶
- Loads benchmark dataset (e.g., HuggingFace datasets, JSON files)
- Dataset contains multiple samples/tasks to evaluate
- Example: CyberGym dataset has vulnerability detection tasks
- Each sample contains:
- Task description
- Expected outputs (ground truth)
- Metadata (file paths, vulnerability info, etc.)
Step 3: Prepare General Environment (_prepare_general_env)¶
This sets up shared resources used across all evaluation tasks.
3.1 Create Base Configuration¶
- Loads base configuration from TOML file
- Expands template variables
- Stores in class for later use
3.2 Setup Evaluation Directories¶
self.eval_output_dir = Path(f"evals/{self.agent_id}/...")
self.eval_output_dir.mkdir(parents=True, exist_ok=True)
- Creates output directories for results
- Structure:
evals/{agent_id}/{benchmark_name}/{timestamp}/ - Stores agent outputs, logs, artifacts
Step 4: Generate Samples (Parallel Execution)¶
The evaluation runs tasks in parallel. Choose one mode:
Mode A: Multiprocessing (generate())¶
with ProcessPoolExecutor(max_workers=self.max_workers) as executor:
futures = {
executor.submit(_run_sample_in_process, self, sample): sample
for sample in self.dataset
}
- Each sample runs in separate process
- True parallelism (bypasses Python GIL)
- Processes are isolated (no shared memory)
- Requires serializable data
Mode B: Multithreading (generate_threaded())¶
with ThreadPoolExecutor(max_workers=self.max_workers) as executor:
futures = {
executor.submit(run_sample_in_thread, sample): sample
for sample in self.dataset
}
- Each sample runs in separate thread
- Shared memory (can share resources)
- Limited by GIL for CPU-bound tasks
- Better for I/O-bound operations
Mode C: Single Thread (generate_single_thread())¶
- Sequential execution, one sample at a time
- Used for debugging
- Easier to debug issues
- Much slower
Step 5: Process Each Sample (_generate_sample or _run_sample_in_process)¶
For each sample in the dataset:
5.1 Create Evaluation Task¶
- Extracts sample data
- Creates
EvaluationTaskobject with: session_id: Unique ID for this tasksample: Original sample dataaigise_session: Will be created next- Metadata (task name, description, etc.)
5.2 Create SAGE-X Session¶
aigise_session = get_aigise_session(
aigise_session_id=task.session_id,
config_path=self.config_path
)
- Creates isolated SAGE-X session for this task
- Loads configuration
- Each task gets its own session (isolation)
5.3 Prepare Task-Specific Environment (_prepare_environment)¶
This is benchmark-specific. Example for CyberGym:
- Extract code/data:
- Extracts source code to sandbox
- Copies test files, build scripts
-
Sets up project structure
-
Initialize sandboxes:
- Creates shared volumes
- Launches required sandbox containers
-
Initializes sandboxes (tools, dependencies)
-
Set source directory:
-
Tells tools where to find source code
-
Git repository setup (if applicable):
- Finds git repository in sandbox
- Checks out main/master branch
- Updates
src_dir_in_sandboxto repo path
5.4 Load Agent¶
- Imports agent module
- Calls
mk_agent()function with session ID - Agent is configured for this specific session
- Agent has access to task-specific sandboxes and resources
5.5 Create ADK Session and Runner¶
inner_session_service = InMemorySessionService()
await inner_session_service.create_session(
app_name=app_name,
user_id=self.user_id + "_" + meta_data,
session_id=task.session_id,
state={"aigise_session_id": task.session_id},
)
runner = Runner(
agent=agent,
app_name=app_name,
session_service=inner_session_service,
)
- Creates ADK session that maps to SAGE-X session
- Stores
aigise_session_idin session state - Creates ADK Runner for agent execution
5.6 Run Agent¶
run_config = RunConfig(max_llm_calls=self.max_llm_calls)
async for event in runner.run_async(
user_id=user_id,
session_id=task.session_id,
run_config=run_config,
new_message=types.Content(parts=[types.Part(text=task.prompt)]),
):
# Process events
if isinstance(event, types.FunctionResponse):
# Tool execution results
elif isinstance(event, types.Candidate):
# Agent responses
- Runner starts agent execution:
- Sends prompt to agent
-
Agent enters reason-act loop
-
Agent reasoning:
- Calls LLM for reasoning
- Decides which tools to use
-
Generates function calls
-
Tool execution:
- Runner executes tools in sandbox
- Tools access session resources
-
Results returned to agent
-
Iteration:
- Agent processes tool results
- Decides next action
-
Continues until completion or max calls
-
Completion:
- Agent generates final response
- Runner finishes execution
- Events collected
5.7 Collect Results¶
result = {
"session_id": task.session_id,
"prompt": task.prompt,
"response": agent_response,
"events": events,
"metadata": {...},
}
- Extracts agent response
- Collects execution metadata:
- Number of LLM calls
- Tools used
- Execution time
- Errors (if any)
5.8 Save Results¶
- Saves result to file (JSON)
- Location:
evals/{agent_id}/{benchmark}/results/{task_id}.json - Includes full event history for analysis
5.9 Cleanup Task Session¶
- Stops sandbox containers
- Removes shared volumes
- Cleans up session resources
- Frees Docker resources
Step 6: Collect All Results¶
After all samples complete:
- Aggregates results from all tasks
- Collects statistics:
- Success rate
- Average execution time
- Tool usage patterns
- Error rates
Step 7: Evaluate Results (evaluate())¶
- Load ground truth:
- Loads expected outputs from dataset
-
Loads agent results from files
-
Compare outputs:
- Compares agent output vs ground truth
-
Calculates metrics:
- Accuracy
- Precision/Recall (if applicable)
- Custom benchmark metrics
-
Generate report:
- Creates evaluation report
- Includes metrics, statistics, examples
-
Saves to
evals/{agent_id}/{benchmark}/evaluation_report.json -
Display summary:
- Prints metrics to console
- Shows top failures/successes
- Provides analysis
Key Characteristics¶
Isolation¶
- Each task gets its own SAGE-X session
- Separate sandbox containers
- No interference between tasks
Parallelism¶
- Multiple tasks run simultaneously
- Configurable worker count
- Process or thread-based execution
Reproducibility¶
- Deterministic task execution
- Results saved with full event history
- Can replay specific tasks
Resource Management¶
- Sessions cleaned up after each task
- Containers stopped and removed
- Prevents resource leaks
Comparison with aigise web¶
| Aspect | aigise web | Evaluations |
|---|---|---|
| Purpose | Development, debugging | Performance measurement |
| Sessions | Single long-lived session | Multiple short-lived sessions |
| Interaction | Interactive chat | Batch processing |
| Parallelism | Single user | Multiple tasks in parallel |
| Cleanup | Manual (on exit) | Automatic (per task) |
| Output | Real-time events | Saved results files |
Example Evaluation Flow¶
Dataset (100 tasks)
↓
Process Pool (3 workers)
├─ Worker 1: Task 1 → Session 1 → Agent → Result 1
├─ Worker 2: Task 2 → Session 2 → Agent → Result 2
└─ Worker 3: Task 3 → Session 3 → Agent → Result 3
├─ Worker 1: Task 4 → Session 4 → Agent → Result 4
...
↓
All Results Collected
↓
Evaluation (compare vs ground truth)
↓
Report Generated
Related Topics¶
- Entry Points - Overview of entry points
- Core Concepts - Understanding sessions
- Testing Debugging - Debugging evaluations