Implementation

ℹ️

We’ve paused the bounty program to new submissions indefinitely. Read our announcement →

Refer to the Setup page for instructions on how to get a basic task up and running. When you’re developing your real task, rename my_task, my_task.py, and my_task_test.py with the actual name of your task. You can also remove the explanatory comments in those files.

The rest of this page has additional information that may be useful for developing more complex tasks.

Example implementations

The Task Standard contains some simple example tasks that demonstrate how to use certain features, such as aux VMs.
We’ve also published an initial version of our task suite, with source code for many fully implemented tasks.

Testing

All tasks should have automated tests to help us verify that they work correctly. Tests go in a separate file with a name like my_task/my_task_test.py, and they generally look like this:

def test_example(task_family: TaskFamily, task_name: str, task: Task):
    # Assert that the task environment is set up in the way you expect it to be.
    # For example, check that necessary files are present and software is installed correctly.

    # Modify the task environment in some way. For example, create or update files, or SSH into aux VMs and modify them.
    # The test could go as far as solving the entire task. Or, it could implement a partial or incorrect solution.

    # Define a submission.

    score = task_family.score(task, submission)

    # Assert that the returned score is what you expect it to be.

QA runs provide a good starting point for tests. Often, you can write a test that simply recreates the end state of your QA run (for example by copying files from/root/meta/qa into /home/agent), then submits, and asserts that the score is what you expected. If you can think of other tests that would be helpful, we encourage you to add them too!

You can use the workbench to run your tests during development. See the workbench README for instructions.

Common code

The task template includes a directory my_task/common with code that might be useful in many different tasks. For example, if your task involves monitoring the agent’s CPU usage, you could do this in my_task/my_task.py:

from common import monitor_cpu

If you write new code as part of your task that might be useful for other tasks, you can add a file to the common directory. However, please don’t modify existing files in this directory, as other tasks may depend on them. If you need to adapt something to fit your task, feel free to copy/paste the relevant code into a new file outside of the common directory.

Auxiliary virtual machines, GPUs

Some tasks are not easily containerizable, for example:

The task of breaking out of a Docker container
The task of installing and configuring a Kubernetes cluster
Tasks that require setting up networks of machines using Docker Compose
Machine learning tasks that give the agent access to GPUs in a safely sandboxed manner

To support these tasks, the Task Standard includes a feature called “auxiliary virtual machines” or “aux VMs”. See the page on Auxiliary Virtual Machines for more information.

Hiding information from the agent

Because the agent runs as the agent user in the task environment, it cannot access files and processes created/owned by root. If you want, your task start method can run a root process that binds to localhost; agent will be able to interact with this process but not inspect it. For example, start could run an HTTP server on localhost, and the task instructions could tell the agent to make HTTP requests to that URL. This could be useful for e.g. hacking/cybersecurity/other black-box challenges.

You can also write tasks that define an aux VM and hide information from the agent there. By default, agents don’t have SSH access to any user on the aux VM — it’s up to the task to grant the agent access if desired.

Variants

For some tasks, there’s a natural way to create “variants” that share the same general logic but have easier, harder, or slightly different objectives. When this is the case, we encourage you to include variants in your task family.

Variants that are a “strict subset” of the original task (e.g. just remove some of the list of requirements for completing the task) don’t need separate QA testing.

Variants that involve significant differences in the environment, starting state or task goals (e.g. variants starting from a partially-implemented codebase, or that involve implementing different functionality) will need additional QA testing.

Specification Quality Assurance