Setup
This page has instructions for setting up all the software you need to develop tasks on your computer. First, make sure you have:
To get started, head over to our task template on GitHub. This repository contains everything you need to develop a basic task. Click the green “Use this template” button to create a private repository for your own task, then clone it to your computer.
All the files you’ll be modifying are in the my_task
folder. Everything else is there to help you run and test the task.
my_task/my_task.py
implements a simple example task where the agent has to add two numbers. (Technically, it implements a “task family”, which is a group of tasks that share code. Right now there’s only one task in the task family.)
Running a task
Your task repository includes a “workbench” tool for running tasks inside a Docker container. Open up a terminal window, navigate to the repository, and run these commands:
cd drivers
npm install
cd ../workbench
npm install
npm run task -- ../my_task addition
You should see something like this:
=== Task container set up ===
To access the container as the root user:
docker exec -it -u root -w /root metr-task-standard-container-my_task-addition-1561014711 bash
To access it as the agent user:
docker exec -it -u agent -w /home/agent metr-task-standard-container-my_task-addition-1561014711 bash
Complete the task by writing a submission to /home/agent/submission.txt in the container. Then, the task will be automatically scored.
Waiting for /home/agent/submission.txt to exist...
If you get a path error, go to drivers/DriverImpl.ts
and change the baseDir
to /home/agent
and change the line that reads taskHelperCode = readFromAncestor('./task-standard/drivers/taskhelper.py')
to taskHelperCode = readFromAncestor('./drivers/taskhelper.py')
.
If you get an error about Docker, make sure Docker is running and you have the latest version installed.
Now open a new terminal and run that second docker exec
command. This will give you a shell into the Docker container where the task is running. When an AI agent attempts the task, it will have a similar text-based shell where it can edit files and run commands. Instructions and other files for the agent to work with are stored in /home/agent
. The code implementing the task is stored in /root
, which the agent can’t access (otherwise it could cheat!).
When the agent completes a task, it submits a final answer for scoring. You can simulate this on the workbench by writing to the file /home/agent/submission.txt
.
echo "3580245" > /home/agent/submission.txt
Go back to the first terminal to see the final score for the task. If you gave the correct answer, it should be 1.0!
instructions.txt
or submission.txt
in your task. Instead, write instructions in TaskFamily.get_instructions
and advise the agent to “submit” its solution. The agent will know how to do this.Testing a task
The example task family also comes with automated tests in my_task/my_task_test.py
. You can run these using the workbench.
cd workbench
npm run test ../my_task addition my_task_test.py
Next steps
If you haven’t already, now is a good time to look over all the methods in my_task.py
. Try making simple changes, such as adding a multiplication
task, and running them with the workbench.
On the next page, we’ll discuss how you can design a task to be useful for evaluating AI agent capabilities.