Desiderata

ℹ️

We’ve paused the bounty program to new submissions indefinitely. Read our announcement →

Background

We want tasks that, taken together, allows us to get a fairly continuous measure of “ability to do things in the world and achieve goals”. We’re focusing on general capabilities because it seems hard to rule out danger from an agent that’s able to carry out a wide variety of multi-day expert-level projects in software development, ML, computer security, research, business, etc. just by ruling out specific threat stories.

We ideally want the tasks to cover a fairly wide range of difficulty, such that an agent that succeeds at most of the tasks seems quite likely to pose a catastrophic risk, and an agent that makes very little progress on any of the tasks seems very unlikely to pose a catastrophic risk.

We also want the distribution of task difficulty to be fairly uniform, such that we can meaningfully extrapolate gradual increases in task success (as opposed to e.g. a bimodal distribution of difficulty, where we might see very rapid progress, then no progress for a long time, then rapid progress again).

We’re most interested in collecting harder tasks first, and we think we can often make easier variants of hard tasks quite straightforwardly—by providing more guidance or partial implementation of the task, or by reducing the requirements. We’re also interested in collecting ideas and implementations of difficulty modifiers for tasks.

Domains

We’re interested in testing the capabilities of AI agents in the following domains:

Domain	Examples
Software Engineering	• Local webapp for playing obscure boardgame variant • SWEbench-like tasks
Cybersecurity, Hacking	• picoCTF • SQL injection • Finding exploits in code
Generalist Tasks, Cybercrime	• Research + Spearphishing • Copycat LLM API, harvest credentials
Post-training Enhancement, Elicitation	• Improve agent scaffolding • Create task guidance which improves performance • Generate synthetic training data
Machine Learning / AI R&D	• Replicate ML paper • Implement FlashAttention

In addition, we would like to construct tasks such that they also test for certain broad categories of behaviors or skills when possible. For example:

Research skills
Asking good questions and assessing the value of information
Figuring out how novel systems work / causal modeling
Being able to persistently make progress given success feedback from environment
Correcting errors
Learning from experience

Difficulty

Ideal difficulty range is 6-8 hours for a human domain professional to solve. We don’t want tasks over ~10 hours (and the hard absolute upper limit is ~40 hours).
- For coding/debugging tasks, time estimates would be for a decent engineer who’s familiar with the language and technology stack, but not the particular codebase.
- Shorter tasks are also somewhat useful, but we really want to prioritize getting hard tasks.
Tasks shouldn’t be difficult for incidental reasons.
- Ideally, tasks play to the strengths of AI agents, meaning they can be completed by writing code, using the command line, or other text-based interaction.
- Tasks shouldn’t be weird or confusing in a way that might unfairly throw the agent off.
- Tasks shouldn’t be highly sensitive to specific properties of the exact setup, the agent scaffolding, contingent capabilities or dispositions of current LLMs, etc
- Ideally, tasks don’t involve artificial restrictions that aren’t applicable to the real-life threat models. An actual AI agent would be able to use the internet, run code, etc.
Our task suite should contain tasks that have a diverse range of bottlenecks
- We don’t want it to be the case that all of the tasks are blocked on browsing, or authentication, or ML knowledge or something, because then we might see a sudden big jump in task performance after a small improvement to the agent.
Individual tasks should have at most one large “hard step”, and bottlenecks should not be highly serial and decoupled from one another.
- We want out task suite to break down and cover every aspect of large domains as smaller separate tasks. Eg we could have every ML task be “train, fine-tune, and deploy a state-of-the-art ML model to do X”, but that would take far too long to test and is a bad proxy for how close models are to being able to accomplish that tasks since they could get stuck one just one part and fail the whole thing. (If a task has 100 parts and the agent can do 95 of them but gets stuck on one part early, we might see a big jump in task performance when the agent figure out how to do that part.)
- In the extreme, harder tasks can be created by stringing together multiple smaller tasks—but it’s a bit sad if the links between the tasks are horribly contrived (e.g. one task gives you a password that you use to decrypt the instructions for the next task) unless there’s some setup where this might actually happen naturally.
- In general if a task can be cleaved into two clear serial subtasks and the task is on the long side, it’s worth trying to split it into two tasks.

Novelty

We want to test agents’ ability to solve problems and accomplish goals, not how much they memorized from their training data.
- Tasks should not have a solution available publicly on the internet.
- Ideally, solutions will not be published in the future.
- Ideally, the task isn’t similar to anything that has a solution published on the internet.
- It’s good if you can confirm that a current model (like GPT-4) hasn’t memorized your task, but doesn’t totally remove the memorization concern. We ideally want the tasks to be future-proof against models that might have memorized public solutions in more detail.
We also don’t want a task to get easier unexpectedly, e.g. if someone posts a solution to StackOverflow, or makes a library that solves the task with a single call.
- It’s great if the task is reasonable to perform without requiring interacting with the live internet. The agent could have a local copy of the documentation for all the relevant libraries, or archived copy of various sites from a particular date, or some other information, but no general internet access.

Original task	Problem	Better task
Develop an interpreter for the Python programming language	Python is open-source and language models have likely been trained on its code	Develop an interpreter for a novel, esoteric programming language
Solve cybersecurity challenges from picoCTF	Problems and solutions are widely available online	Solve unreleased challenges from a different CTF event (requires help from event organizers)

Consistency

We want tasks to behave the same way every time we run them.
- This poses a problem for tasks involving self-modification, because they are effectively a different task for different agents. Usually, it’s better to give the agent code for a second agent (which is the same every time) and ask it to improve that.
- Ideally, the task doesn’t depend too much on specific online services. If a website removes features or shuts down, we want the task to still work. When possible, we try to run “online services” on localhost in the VM, so that we can ensure that they are consistent between runs.
- While human interaction and persuasion are relevant to many threat models, they’re very difficult to evaluate objectively. For that reason, we are unlikely to accept submissions for persuasion tasks.

Original task	Problem	Better task
Jailbreak the ChatGPT API	The models serving the API could be updated at any time	Jailbreak an open-source language model running locally
Improve the agent’s own scaffolding	This will effectively be a different task for different agents	Improve the scaffolding of another agent

Threat relevance

We’re looking for tasks that provide information about a model’s potentially dangerous capabilities.
- Ideally, this means that either (a) it’s scary if an agent succeeds, or (b) it’s comforting if an agent fails.
- We think general capabilities in our domains of interest (software engineering, cybersecurity, ML, post-training enhancement) indicate the level of risk our model poses. Tasks that test a wide range of these capabilities are useful, even if there’s not a 1-1 correspondence to threat models.
It’s good if tasks are as “natural” as possible rather than elaborately constructed to fit a threat model.
- Ideally, in many of the tasks a lot of the difficulty comes not from some core tricky puzzle to solve, but from mundane difficulties with actually doing things—problems with your python environment, data cleaning, unhelpful Google search results, keeping track of information, avoiding overengineering, etc.

Original task	Problem	Better task
Take an International Mathematics Olympiad test	May not indicate real-world abilities	Fix this challenging bug I found in a personal programming project

Safety

A hard requirement is that the task doesn’t involve the agent doing anything dangerous, illegal, or unethical in the real world.
- Ideally, the task takes place in an isolated environment so that the agent cannot do anything with real-world consequences.
Ideally, the task also appears safe, so that nice, ethical assistants will be willing to complete it.
- For example, presented with a highly realistic offensive cybersecurity task, an agent might refuse for ethical reasons, and we wouldn’t learn about its true capabilities.
- There’s a tradeoff between threat relevance and appearing safe. However, given a task that appears to be dangerous, it’s often possible to identify underlying capabilities that allow a model to complete the task and then create a benign task to test for those capabilities.
- Appearing safe is not a hard requirement, and we’ll accept tasks that trigger refusals if there’s no good alternative way to test for that capability.

Original task	Problem	Better task
Exploit a vulnerability in a live website	Unethical and illegal	Identify and fix a vulnerability in a website running on localhost
Given a large language model that refuses to perform illegal tasks, undo the fine-tuning so it can be used to commit crimes	Safety-trained models will likely refuse	Given a large language model that refuses to talk about math, undo the fine-tuning so that it can complete a basic math test

Scoring

The final score on a task is a number between 0 and 1.
- Ideally the scoring isn’t binary but gives different scores depending on how well the agent did, where the numerical scores correspond well to the difficulty level.
- Alternatively, or in addition, you can create a family of similar tasks with different levels of difficulty.
- A very competent agent should be able to score above 0.9, and a very incompetent agent should not be able to score above 0.1.
Task scoring should be automatic.
- If possible, avoid using LLMs to score a task. LLMs introduce technical complexity, and it’s hard to test that they will give accurate scores in every possible case.
Task scoring should behave reasonably when a submission doesn’t match the expected format.
- If a task requires a numerical answer and the agent submits a string like “123”, the scoring function should parse it into a number and proceed normally.
- If a task requires a numerical answer and the agent submits a string like “I don’t know”, the scoring function should return 0 without throwing an error.
- In general, the scoring function should tolerate minor formatting issues, such as single quotes in JSON data.
To the extent possible, “cheating” should be prevented by technical measures.
- We want to know that the agent has followed the rules without manually reviewing its actions.
- Make sure to set the right file permissions and consider restricting internet access.
Noise and randomness in the scoring procedure should be minimized.
- In our experience, agent runs are pretty variable, so a bit of noise is fine.

Original task	Problem	Better task
Create a video game that humans can play	Requires manual and potentially time-consuming scoring	Create a payment processing system that can pass automated test cases
Train an ML classifier and submit its predictions for the provided test set	Agent could make the predictions itself	Train an ML classifier, then have the scoring function run it on a test set that the agent can’t access
Use complex digital forensic techniques to find the answer to a yes-no question	Agent could blindly guess the answer	Use complex digital forensic techniques to find a randomly generated 10-character string

Practicality

Ideally, tasks don’t need much computational power and can run on one virtual machine.
- This is not a hard requirement and can be waived if necessary for especially good tasks. We expect that some good tasks in the ML R&D and cybersecurity categories will require more complicated setups.
Ideally, the task environment is simple to set up and easy to test.
Ideally, tasks do not require human involvement, either to simulate part of the environment or to monitor and stop unsafe actions.
- Again, this is not a hard requirement and can be waived if necessary for especially good tasks.

Original task	Problem	Better task
Train a 10B-parameter language model from scratch	Requires complicated and expensive computing infrastructure	Adapt a pretrained 10B-parameter language model to run on consumer hardware
Monitor a large enterprise computer network and detect a cyberattack	Requires a complicated network setup with many virtual machines	Given logs from a large enterprise computer network, identify when and how a cyberattack happened
Persuade a human pretending to be a cybercrime victim to send Bitcoin to a certain address	Requires a human to play the victim every time the task is run	Set up a phishing website that matches the appearance of a real cryptocurrency exchange

Introduction Setup