Introduction

Introduction

METR is building a suite of tasks to measure potentially dangerous capabilities of AI agents. For example, we might want to know how many security vulnerabilities an agent can identify in a network, or whether it can copy itself onto a new server. Results from the task suite will help identify risks early so they can be properly mitigated.

To fully understand the capabilities of these agents, we need a large, diverse set of tasks. So we’re inviting you to develop and submit your own tasks in any of several risk-relevant domains. If you send us a good task, we’ll award a cash bounty.

This guide will walk you through the entire process for developing a task, from brainstorming to debugging to submission. We look forward to seeing what you come up with!

Keeping tasks out of training data

We want tasks that measure an agent’s general capabilities, not how much it has memorized. Please take care to ensure your tasks and solutions won’t end up in training datasets.

  • Don’t post details about your task anywhere on the public web.
  • When using cloud services such as Google Drive or GitHub, make sure your files are “private” (not “anyone with the link can view”).
  • If you use ChatGPT, go to settings and turn off chat history. Avoid using other AI assistants unless you can make sure they will not train on your data.

Contact us

If you have feedback or questions, feel free to email us at [email protected].