Desiderata

Desiderata

Background

We want tasks that, taken together, allows us to get a fairly continuous measure of “ability to do things in the world and achieve goals”. We’re focusing on general capabilities because it seems hard to rule out danger from an agent that’s able to carry out a wide variety of multi-day expert-level projects in software development, ML, computer security, research, business, etc. just by ruling out specific threat stories.

We ideally want the tasks to cover a fairly wide range of difficulty, such that an agent that succeeds at most of the tasks seems quite likely to pose a catastrophic risk, and an agent that makes very little progress on any of the tasks seems very unlikely to pose a catastrophic risk.

We also want the distribution of task difficulty to be fairly uniform, such that we can meaningfully extrapolate gradual increases in task success (as opposed to e.g. a bimodal distribution of difficulty, where we might see very rapid progress, then no progress for a long time, then rapid progress again).

We’re most interested in collecting harder tasks first, and we think we can often make easier variants of hard tasks quite straightforwardly—by providing more guidance or partial implementation of the task, or by reducing the requirements. We’re also interested in collecting ideas and implementations of difficulty modifiers for tasks.

Domains

We’re interested in testing the capabilities of AI agents in the following domains:

DomainExamples
Software Engineering• Local webapp for playing obscure boardgame variant
• SWEbench-like tasks
Cybersecurity, Hacking• picoCTF
• SQL injection
• Finding exploits in code
Generalist Tasks, Cybercrime• Research + Spearphishing
• Copycat LLM API, harvest credentials
Post-training Enhancement, Elicitation• Improve agent scaffolding
• Create task guidance which improves performance
• Generate synthetic training data
Machine Learning / AI R&D• Replicate ML paper
• Implement FlashAttention

In addition, we would like to construct tasks such that they also test for certain broad categories of behaviors or skills when possible. For example:

  • Research skills
  • Asking good questions and assessing the value of information
  • Figuring out how novel systems work / causal modeling
  • Being able to persistently make progress given success feedback from environment
  • Correcting errors
  • Learning from experience

Difficulty

  • Ideal difficulty range is at least 6 hours for a human domain professional to solve, and want some tasks to be >20hrs. In general, the harder and longer the task, the better.
    • For coding/debugging tasks, time estimates would be for a decent engineer who’s familiar with the language and technology stack, but not the particular codebase.
    • Shorter tasks are also somewhat useful, but we really want to prioritize getting hard tasks.
  • Tasks shouldn’t be difficult for incidental reasons.
    • Ideally, tasks play to the strengths of AI agents, meaning they can be completed by writing code, using the command line, or other text-based interaction.
    • Tasks shouldn’t be weird or confusing in a way that might unfairly throw the agent off.
    • Tasks shouldn’t be highly sensitive to specific properties of the exact setup, the agent scaffolding, contingent capabilities or dispositions of current LLMs, etc
    • Ideally, tasks don’t involve artificial restrictions that aren’t applicable to the real-life threat models. An actual AI agent would be able to use the internet, run code, etc.
  • Tasks should have a diverse range of bottlenecks.
    • We don’t want it to be the case that all of the tasks are blocked on browsing, or authentication, or ML knowledge or something, because then we might see a sudden big jump in task performance after a small improvement to the agent.
  • Harder tasks can be created by stringing together multiple smaller tasks—but it’s a bit sad if the links between the tasks are horribly contrived (e.g. one task gives you a password that you use to decrypt the instructions for the next task) unless there’s some setup where this might actually happen naturally.

Novelty

  • We want to test agents’ ability to solve problems and accomplish goals, not how much they memorized from their training data.
    • Ideally, the task solution has not been posted publicly in the past, is unlikely to be posted in the future, and is not especially close to anything in the training corpus.
    • It’s good if you can confirm that a current model (like GPT-4) hasn’t memorized your task, but doesn’t totally remove the memorization concern. We ideally want the tasks to be future-proof against models that might have memorized public solutions in more detail.
    • If you’re able to remove the solution from the public internet, or specify a way to filter all examples from the training data, that’s good too.
  • We also don’t want a task to get easier unexpectedly, e.g. if someone posts a solution to StackOverflow, or makes a library that solves the task with a single call.
    • It’s great if the task is reasonable to perform without requiring interacting with the live internet. The agent could have a local copy of the documentation for all the relevant libraries, or archived copy of various sites from a particular date, or some other information, but no general internet access.
Original taskProblemBetter task
Develop an interpreter for the Python programming languagePython is open-source and language models have likely been trained on its codeDevelop an interpreter for a novel, esoteric programming language
Solve cybersecurity challenges from picoCTFProblems and solutions are widely available onlineSolve unreleased challenges from a different CTF event (requires help from event organizers)

Consistency

  • We want tasks to behave the same way every time we run them.
    • This poses a problem for tasks involving self-modification, because they are effectively a different task for different agents. Usually, it’s better to give the agent code for a second agent (which is the same every time) and ask it to improve that.
    • Ideally, the task doesn’t depend too much on specific online services. If a website removes features or shuts down, we want the task to still work. When possible, we try to run “online services” on localhost in the VM, so that we can ensure that they are consistent between runs.
    • While human interaction and persuasion are relevant to many threat models, they’re very difficult to evaluate objectively. For that reason, we are unlikely to accept submissions for persuasion tasks.
Original taskProblemBetter task
Jailbreak the ChatGPT APIThe models serving the API could be updated at any timeJailbreak an open-source language model running locally
Improve the agent’s own scaffoldingThis will effectively be a different task for different agentsImprove the scaffolding of another agent

Threat relevance

  • We’re looking for tasks that provide information about a model’s potentially dangerous capabilities.
    • Ideally, this means that either (a) it’s scary if an agent succeeds, or (b) it’s comforting if an agent fails.
    • We think general capabilities in our domains of interest (software engineering, cybersecurity, ML, post-training enhancement) indicate the level of risk our model poses. Tasks that test a wide range of these capabilities are useful, even if there’s not a 1-1 correspondence to threat models.
  • It’s good if tasks are as “natural” as possible rather than elaborately constructed to fit a threat model.
    • Ideally, in many of the tasks a lot of the difficulty comes not from some core tricky puzzle to solve, but from mundane difficulties with actually doing things—problems with your python environment, data cleaning, unhelpful Google search results, keeping track of information, avoiding overengineering, etc.
Original taskProblemBetter task
Take an International Mathematics Olympiad testMay not indicate real-world abilitiesFix this challenging bug I found in a personal programming project

Safety

  • A hard requirement is that the task doesn’t involve the agent doing anything dangerous, illegal, or unethical in the real world.
    • Ideally, the task takes place in an isolated environment so that the agent cannot do anything with real-world consequences.
  • Ideally, the task also appears safe, so that nice, ethical assistants will be willing to complete it.
    • For example, presented with a highly realistic offensive cybersecurity task, an agent might refuse for ethical reasons, and we wouldn’t learn about its true capabilities.
    • There’s a tradeoff between threat relevance and appearing safe. However, given a task that appears to be dangerous, it’s often possible to identify underlying capabilities that allow a model to complete the task and then create a benign task to test for those capabilities.
    • Appearing safe is not a hard requirement, and we’ll accept tasks that trigger refusals if there’s no good alternative way to test for that capability.
Original taskProblemBetter task
Exploit a vulnerability in a live websiteUnethical and illegalIdentify and fix a vulnerability in a website running on localhost
Given a large language model that refuses to perform illegal tasks, undo the fine-tuning so it can be used to commit crimesSafety-trained models will likely refuseGiven a large language model that refuses to talk about math, undo the fine-tuning so that it can complete a basic math test

Scoring

  • The final score on a task is a number between 0 and 1.
    • Ideally the scoring isn’t binary but gives different scores depending on how well the agent did, where the numerical scores correspond well to the difficulty level.
    • Alternatively, or in addition, you can create a family of similar tasks with different levels of difficulty.
  • Ideally, tasks can be automatically scored.
    • Automatic scoring should be reliable, and not incorrectly score tasks because e.g. the agent made a minor formatting error.
  • It’s also fine if some tasks need to be scored by humans.
    • Manual scoring should be as quick as possible. We recommend using the scoring function to automatically retrieve and display any relevant information to human scorers.
    • Manual scoring should objective and well-specified. Often, this means a rubric where the agent gets a specific number of points for completing each part of the task.
  • To the extent possible, “cheating” should be prevented by technical measures.
    • We want to know that the agent has followed the rules without manually reviewing its actions.
    • Make sure to set the right file permissions and consider restricting internet access.
  • Noise and randomness in the scoring procedure should be minimized.
    • In our experience, agent runs are pretty variable, so a bit of noise is fine.
Original taskProblemBetter task
Create a video game that humans can playRequires manual and potentially time-consuming scoringCreate a payment processing system that can pass automated test cases
Train an ML classifier and submit its predictions for the provided test setAgent could make the predictions itselfTrain an ML classifier, then have the scoring function run it on a test set that the agent can’t access
Use complex digital forensic techniques to find the answer to a yes-no questionAgent could blindly guess the answerUse complex digital forensic techniques to find a randomly generated 10-character string

Practicality

  • Ideally, tasks don’t need much computational power and can run on one virtual machine.
    • This is not a hard requirement and can be waived if necessary for especially good tasks. We expect that some good tasks in the ML R&D and cybersecurity categories will require more complicated setups.
  • Ideally, the task environment is simple to set up and easy to test.
  • Ideally, tasks do not require human involvement, either to simulate part of the environment or to monitor and stop unsafe actions.
    • Again, this is not a hard requirement and can be waived if necessary for especially good tasks.
Original taskProblemBetter task
Train a 10B-parameter language model from scratchRequires complicated and expensive computing infrastructureAdapt a pretrained 10B-parameter language model to run on consumer hardware
Monitor a large enterprise computer network and detect a cyberattackRequires a complicated network setup with many virtual machinesGiven logs from a large enterprise computer network, identify when and how a cyberattack happened
Persuade a human pretending to be a cybercrime victim to send Bitcoin to a certain addressRequires a human to play the victim every time the task is runSet up a phishing website that matches the appearance of a real cryptocurrency exchange