Implementation

Implementation

Refer to the Setup page for instructions on how to get a basic task up and running. When you’re developing your real task, rename my_task, my_task.py, and my_task_test.py with the actual name of your task. You can also remove the explanatory comments in those files.

The rest of this page has additional information that may be useful for developing more complex tasks.

Example implementations

  • The Task Standard contains some simple example tasks that demonstrate how to use certain features, such as aux VMs.
  • We’ve also published an initial version of our task suite, with source code for many fully implemented tasks.

Testing

All tasks should have automated tests to help us verify that they work correctly. Tests go in a separate file with a name like my_task/my_task_test.py, and they generally look like this:

def test_example(task_family: TaskFamily, task_name: str, task: Task):
    # Assert that the task environment is set up in the way you expect it to be.
    # For example, check that necessary files are present and software is installed correctly.

    # Modify the task environment in some way. For example, create or update files, or SSH into aux VMs and modify them.
    # The test could go as far as solving the entire task. Or, it could implement a partial or incorrect solution.

    # Define a submission.

    score = task_family.score(task, submission)

    # Assert that the returned score is what you expect it to be.

QA runs provide a good starting point for tests. Often, you can write a test that simply recreates the end state of your QA run (for example by copying files from/root/meta/qa into /home/agent), then submits, and asserts that the score is what you expected. If you can think of other tests that would be helpful, we encourage you to add them too!

You can use the workbench to run your tests during development. See the workbench README for instructions.

Common code

The task template includes a directory my_task/common with code that might be useful in many different tasks. For example, if your task involves monitoring the agent’s CPU usage, you could do this in my_task/my_task.py:

from common import monitor_cpu

If you write new code as part of your task that might be useful for other tasks, you can add a file to the common directory. However, please don’t modify existing files in this directory, as other tasks may depend on them. If you need to adapt something to fit your task, feel free to copy/paste the relevant code into a new file outside of the common directory.

Auxiliary virtual machines, GPUs

Some tasks are not easily containerizable, for example:

  • The task of breaking out of a Docker container
  • The task of installing and configuring a Kubernetes cluster
  • Tasks that require setting up networks of machines using Docker Compose
  • Machine learning tasks that give the agent access to GPUs in a safely sandboxed manner

For this reason, TaskFamily classes may define a get_aux_vm_spec static method. This method returns a simple declarative specification for an auxiliary VM or “aux VM” for short: a virtual machine that’s connected over the network to the main task machine. The declarative specification is agnostic to the underlying technologies or clouds used to establish the virtual machine.

Here’s an example of a task family that uses an aux VM with a GPU:

class TaskFamily:
    standard_version = "0.2.2"

    # ...

    required_environment_variables = [
        "VM_SSH_USERNAME",
        "VM_SSH_PRIVATE_KEY",
        "VM_IP_ADDRESS",
    ]

    def get_permissions(t: Task) -> list[str]:
        # Aux VMs are connected over the internet, so this permission is necessary
        return ["full_internet"]

    @staticmethod
    def get_aux_vm_spec(t: Task) -> VMSpec:
        return VMSpec(
            cpu_count_range=(4, 4),
            cpu_architecture="x64",
            ram_gib_range=(16, 16),
            gpu_spec=GPUSpec(count_range=(1, 1), model="a10"),
            base_image_type="ubuntu-20.04-cuda",
        )

The file my_task/common/aux_vm_access.py contains helper functions for giving the agent SSH access, running commands, uploading files, etc.

If at all possible, write your task without relying on an aux VM, because (a) it’s simpler/faster, (b) it’s harder for evals platforms to implement support for aux VMs, so some platforms might only support tasks which don’t require them.

Hiding information from the agent

Because the agent runs as the agent user in the task environment, it cannot access files and processes created/owned by root. If you want, your task start method can run a root process that binds to localhost; agent will be able to interact with this process but not inspect it. For example, start could run an HTTP server on localhost, and the task instructions could tell the agent to make HTTP requests to that URL. This could be useful for e.g. hacking/cybersecurity/other black-box challenges.

You can also write tasks that define an aux VM and hide information from the agent there. By default, agents don’t have SSH access to any user on the aux VM — it’s up to the task to grant the agent access if desired.

Variants

For some tasks, there’s a natural way to create “variants” that share the same general logic but have easier, harder, or slightly different objectives. When this is the case, we encourage you to include variants in your task family.

Variants that are a “strict subset” of the original task (e.g. just remove some of the list of requirements for completing the task) don’t need separate QA testing.

Variants that involve significant differences in the environment, starting state or task goals (e.g. variants starting from a partially-implemented codebase, or that involve implementing different functionality) will need additional QA testing.