Auxiliary Virtual Machines

Auxiliary Virtual Machines

Some tasks are not easily containerizable, for example:

  • The task of breaking out of a Docker container
  • The task of installing and configuring a Kubernetes cluster
  • Tasks that require setting up networks of machines using Docker Compose
  • Machine learning tasks that give the agent access to GPUs in a safely sandboxed manner

To support these tasks, the Task Standard includes a feature called “auxiliary virtual machines” or “aux VMs”. These VMs start alongside the main task container and run on a cloud computing service. Your task code, which runs in the main Docker container, can run commands on the aux VM via SSH. It can also grant SSH access to the agent, either as an administrator or with restrictions.

ℹ️
If possible, write your task without relying on an aux VM. This will make your task simpler and easier to test.

To use an aux VM, your task family needs 3 things: required environment variables, internet permissions, and a VM spec. Here’s an example of a task family that uses an aux VM with a GPU:

class TaskFamily:
    standard_version = "0.2.3"

    # ...

    # Tell the workbench to set admin credentials for the VM as environment variables
    # Note: these environment variables will not be available to the agent
    required_environment_variables = [
        "VM_SSH_USERNAME",
        "VM_SSH_PRIVATE_KEY",
        "VM_IP_ADDRESS",
    ]

    # Aux VMs are connected over the internet, so the task needs to have "full_internet" permission
    @staticmethod
    def get_permissions(t: Task) -> list[str]:
       return ["full_internet"]


    # Specify what kind of VM the task needs and how to set it up
    # Different tasks in a family can have different VM specs
    @staticmethod
    def get_aux_vm_spec(t: Task) -> VMSpec:
        return VMSpec(
            cpu_count_range=(4, 4),
            cpu_architecture="x64",
            ram_gib_range=(16, 16),
            gpu_spec=GPUSpec(count_range=(1, 1), model="a10"),
            base_image_type="ubuntu-20.04-cuda",
            build_steps=[
                {
                    "type": "file",
                    "source": "assets/starter_code.py",
                    "destination": "/home/ubuntu/starter_code.py"
                },
                {
                    "type": "shell",
                    "commands": [
                        "python3 -m pip install torch"
                    ]
                }
            ]
        )

AWS setup

The task workbench uses AWS to run aux VMs. If your task uses aux VMs, you’ll need to follow these steps before you can run it on the workbench.

  1. Sign up for an AWS account if you don’t already have one. Then make sure you’ve installed the AWS CLI and Packer with the Amazon plugin.
  2. Create an IAM user in the AWS console. Choose “attach policies directly” and create a new policy. Use the JSON configuration from here.
  3. Create an access key for the new IAM user.
  4. Run aws configure in your terminal and paste in the access key and secret access key you just created.
  5. Make sure that the VM spec you request matches an available EC2 instance type and base image.
  6. Now, if you run a task that requires an aux VM, the workbench will automatically start one for you. Make sure to monitor your AWS console and terminate any instances you no longer need.

Build steps

By default, every aux VM starts from a regular Debian or Ubuntu Linux image. You can customize the setup process by adding build steps to the aux VM spec. These steps will be run using Packer. There are two types of build steps, file and shell, which correspond to the File Provisioner and Shell Provisioner respectively.

After the build steps finish, the resulting image is cached so it can be re-used for future runs. Images are automatically re-built if there are changes to the build steps (or files referenced by a build step) since the last run.

ℹ️
GPUs are not available during the build process. If any part of task setup requires a GPU, wait and run it in your task’s start() method using SSH.

Interacting with aux VMs

The preferred way for task code to interact with an aux VM is SSH using the Paramiko library. You can use the helper function common.aux_vm_access.ssh_client() to get an SSHClient instance connected as the admin user.

Running commands

The SSHClient.exec_command() method allows you to run commands on the aux VM. The method returns a 3-tuple of (stdin, stdout, stderr), which are binary file-like objects representing the input and output streams of the command.

Note that exec_command() returns immediately after the command has started. To wait for the command to finish, call stdout.channel.recv_exit_status(), which will block until the command exits.

Uploading and downloading files

If you need to upload files to or download files from the aux VM after the task has started, Paramiko supports SFTP. You can get an SFTP client from an existing SSH client using SSHClient.open_sftp().

Executing multiple SFTP operations is slow, so if you want to download or upload many files at once, it will be more efficient to compress the files into a tar or ZIP archive before transferring them.

Scoring output on the aux VM

If the agent creates files on the aux VM that you need to use in your scoring function, then you can do one of the following in the TaskFamily.score():

  1. Copy the assets from the aux VM to the task environment using SFTP, or
  2. Run a scoring script on the aux VM using exec_command() and capture its output using the stdout/stderr streams.

We recommend copying assets used for scoring from the aux VM to the task environment, as it will likely be easier to write and debug your scoring code if it runs locally to the task environment. However, if scoring must run on a GPU or needs to inspect the state of the aux VM, it may be better to run a scoring script on the aux VM instead.

Letting agents access the aux VM

By default, the agent has no way to access the aux VM. You can use the helper function common.aux_vm_access.setup_agent_ssh() to generate an SSH command that the agent can use for this purpose. The command will be saved to /home/agent/ssh_command in the main Docker container. Mention this file in the instructions so that the agent knows where to find it.

Tips, pitfalls, and known issues

Installing packages

  • All package management commands must be run using sudo.
  • apt-get may pause for user confirmation when installing packages, which will cause the build process to hang. To avoid this:
    1. Ensure that the first command in any shell build step that uses a package management tool is export DEBIAN_FRONTEND=noninteractive
    2. Add the option -y to every package management command you run (e.g. apt-get install -y pkg_name) to automatically answer “yes” to any question the command asks
  • Run sudo apt-get update -y before installing packages to ensure that you have up-to-date package lists.
  • On startup, the AWS cloud-init process will refresh package sources. If your build steps start before cloud-init finishes, then package installation may fail because cloud-init has not finished configuring the available package sources. You can prevent this from happening by running cloud-init status --wait, which will wait until cloud-init has finished.
  • During start-up, other processes may try to install or update packages, which can cause package management commands to fail if another process has locked the package management database. To avoid this:
    1. Add the option -o DPkg::Lock::Timeout=600 to every package management command you run (e.g. apt-get install -o DPkg::Lock::Timeout=600 pkg_name) so that the command waits up to 10 minutes for other processes to release the package management lock, rather than failing immediately.
    2. Ensure that unattended-upgrades is not running, as this may install packages at unpredictable times. To do so, run sudo systemctl stop unattended-upgrades and then sudo apt-get purge -y -o DPkg::Lock::Timeout=600 unattended-upgrades to remove it.

Python

Aux VMs may have more than one version of Python installed, and python may point to an older version.

Changing the default python may break system tools that rely on an older version. Instead:

  • Identify the command that will run the latest version of Python, if that is different from the default python command. On ubuntu-20.04-cuda this command is python3.9.
  • When installing packages using pip, make sure to invoke it as a module (e.g. python3.9 -m pip).
  • If applicable, tell the agent which Python command to use in the instructions.

Permissions

The build steps, and any actions you initiate on the aux VM via SSH, will run as an admin user. The name of the admin user will depend on the base image you use, and will be stored in the VM_SSH_USERNAME environment variable.

  • If you run a command that needs root privileges, use sudo.
  • Unlike the main Docker container, the aux VM will not have an agent user by default. If you’d like to create one, add common.aux_vm_access.create_agent_user_step() to your aux VM’s build steps.
  • When uploading files to the aux VM, you can only upload to folders where the admin user has write permissions, and those files will be owned by the admin user. If you’d like to put files in the agent’s home directory, upload them to the admin user’s home directory, then use sudo mv and sudo chown.