Setup

Setup

ℹ️
These instructions are slightly out of date. We now suggest you use Vivaria to develop tasks. You can find setup instructions for Vivaria here.

This page has instructions for setting up all the software you need to develop tasks on your computer. First, make sure you have:

To get started, head over to our task template on GitHub. This repository contains everything you need to develop a basic task. Click the green “Use this template” button to create a private repository for your own task, then clone it to your computer.

All the files you’ll be modifying are in the my_task folder. Everything else is there to help you run and test the task.

my_task/my_task.py implements a simple example task where the agent has to add two numbers. (Technically, it implements a “task family”, which is a group of tasks that share code. Right now there’s only one task in the task family.)

Running a task

Your task repository includes a “workbench” tool for running tasks inside a Docker container. Open up a terminal window, navigate to the repository, and run these commands:

cd drivers
npm install
cd ../workbench
npm install
npm run task -- ../my_task addition

You should see something like this:

 === Task container set up ===
To access the container as the root user:

  docker exec -it -u root -w /root metr-task-standard-container-my_task-addition-1561014711 bash

To access it as the agent user:

  docker exec -it -u agent -w /home/agent metr-task-standard-container-my_task-addition-1561014711 bash

Complete the task by writing a submission to /home/agent/submission.txt in the container. Then, the task will be automatically scored.

Waiting for /home/agent/submission.txt to exist...

If you get a path error, go to drivers/DriverImpl.ts and change the baseDir to /home/agent and change the line that reads taskHelperCode = readFromAncestor('./task-standard/drivers/taskhelper.py') to taskHelperCode = readFromAncestor('./drivers/taskhelper.py').

If you get an error about Docker, make sure Docker is running and you have the latest version installed.

Now open a new terminal and run that second docker exec command. This will give you a shell into the Docker container where the task is running. When an AI agent attempts the task, it will have a similar text-based shell where it can edit files and run commands. Instructions and other files for the agent to work with are stored in /home/agent. The code implementing the task is stored in /root, which the agent can’t access (otherwise it could cheat!).

When the agent completes a task, it submits a final answer for scoring. You can simulate this on the workbench by writing to the file /home/agent/submission.txt.

echo "3580245" > /home/agent/submission.txt

Go back to the first terminal to see the final score for the task. If you gave the correct answer, it should be 1.0!

ℹ️
The method for receiving instructions and submitting an answer will vary depending on how a task is run. To ensure compatibility, do not refer to instructions.txt or submission.txt in your task. Instead, write instructions in TaskFamily.get_instructions and advise the agent to “submit” its solution. The agent will know how to do this.

Testing a task

The example task family also comes with automated tests in my_task/my_task_test.py. You can run these using the workbench.

cd workbench
npm run test ../my_task addition my_task_test.py

Next steps

If you haven’t already, now is a good time to look over all the methods in my_task.py. Try making simple changes, such as adding a multiplication task, and running them with the workbench.

On the next page, we’ll discuss how you can design a task to be useful for evaluating AI agent capabilities.