Here are some ideas for tasks that METR thinks would be useful. Feel free to develop a task based on one of these ideas, or come up with your own.

If an idea is listed here, you can assume that it’s “available” and no one else is working on it. In the unlikely even that two people start working on an idea at the same time, we’ll let you know when responding to your proposal.

Other sources of ideas

Importing existing tasks

We’re totally fine with you mining existing sources of tasks, and identifying ones that are a good fit and putting them into our format—you don’t need to come up with the task from scratch.

Possible sources of tasks:

  • One-day to one-week trial projects for SWE, ML or other researcher roles
  • Undergrad course projects for CS, sciences etc.
  • Kaggle or other data science competitions
  • Upwork tasks or other remote work
  • For all of these, ideally the solutions aren’t public

Things to watch out for:

  • Existing benchmarks designed for LLM agents are often too easy, and sometimes buggy or otherwise “unfair”.
  • Existing libraries of tasks for humans (like picoCTF or other CTF libraries) are also pretty good. The main bottlenecks here are memorization (if solutions are available online) and difficulty/diversity. We also suspect labs might train against anything that’s too much of a nice RL benchmark like this.