Quality Assurance

Quality Assurance

Often, the first version of a task has bugs which make it not a valid test of agent capabilities. For example, the task might be impossible because important files are missing, or there may be a loophole in the rules that allows the agent to cheat. Before we can use your task to evaluate AI agents, we need to make sure all these issues have been identified and fixed.

The best way to do this is a quality assurance (QA) run, in which a human attempts to complete your task using the same resources that an AI agent could use. In most cases, a task implementation must have a documented QA run to be eligible for the bounty. See the last section for information about alternatives.

Instructions for task author

  • Finding a human participant
    • The human participant shouldn’t be the task author, and shouldn’t have special information about the solution or intention of the task except for the prompt they’re given.
    • The human participant should be fairly skilled in the relevant domain. For example, for a software engineering task, they should have at least a few years of programming experience.
  • Setup
    • The QA environment should be as similar as possible to the one the AI agent will have. Appropriate software, files, and instructions should all be available.
    • Make sure to communicate any additional constraints defined in the task. For example, if your task does not allow internet access, inform the participant that they cannot use the internet.
    • If there’s a significant discrepancy between the setup for the human and the AI, please note that in your documentation.

Instructions for QA tester

  • Keeping tasks private
    • Don’t post details about your task anywhere on the public web.
    • When using cloud services such as Google Drive or GitHub, make sure your files are “private” (not “anyone with the link can view”).
  • AI assistance
    • You are welcome, and encouraged, to use AI to help you complete the task faster.
    • Please document which AI tools and models you use, if any.
    • Before using any AI tool, please make absolutely sure that it will not be trained on your data.
      • ChatGPT: go to settings and turn off chat history
      • Cursor: enable privacy mode or use your own API key
  • Time tracking
    • Use a timer to track how long you spend completing the task.
    • Log how much progress you have made at regular intervals (e.g. 30 min), and after completing notable milestones. Note in each entry how much time you’ve spent on the task so far. This is helpful for calibrating the difficulty of the task and ensuring our scoring is sensible.
    • You can (and should!) pause the timer to take breaks, write material for your QA document, and fill out the timing logs.
    • If progress on the task is blocked by a long-running program (e.g. training an ML model) pause the timer while you wait for it to finish.
    • Complete the task at a normal working pace (including whatever breaks you would usually take), and the task can be completed over multiple days.
  • For programming tasks, track progress in a git repo, making commits relatively often.
    • Please try to have descriptive commit messages.
    • When recording an entry in the timing log, include the ID of the most recent commit.
    • These intermediate states will facilitate starting the task part way through, to test how agents perform when given some foundation to work with.
  • Review document
    • The task template contains a file meta/qa/review.md with questions for you to answer after finishing the task. You can also find a copy here.
  • Payment
    • Task authors and QA testers can split the bounty however they see fit.

Alternatives to a full QA run

There are two main categories of tasks for which we won’t require a full QA run: tasks that have already been validated somehow, and tasks that take too long to fully test.

Pre-validated tasks

For some tasks, there is already evidence that they are fair and challenging. For example, if the task is based on a homework assignment you completed, then you know whether the instructions were clear and how long it took you. In this case, the task is “pre-validated” and you don’t need to do a full QA run. Instead:

  1. Document where the task came from, what skill level it requires, and how long it took to complete. This is usually enough to validate the task design.
  2. To validate your technical implementation, start the task in a Docker container. Make sure all necessary files are present and all necessary software works. Try submitting various solutions—ideally, this includes an invalid solution, a partially-correct solution, and the best solution you have. Confirm for each that the score is what you expect. Document all of this in a markdown file.

Expedited QA

For some tasks, it’s not practical to do a full QA run because the task takes a long time. In this case, you can do an “expedited” QA run:

  1. Ask the QA runner to read the instructions, explore the resources and tools in the Docker container, and note anything confusing or missing.
  2. Ask the QA runner to briefly describe how they’d approach the problem.
  3. Depending on the task, you can provide hints and/or partial solutions to help the QA runner complete the task faster. If they run into unexpected challenges, such as misinterpreting the instructions, consider whether the task should be modified.
  4. Document the findings of the QA run, and what assistance that was given. Include the QA runner’s estimate for how long it would take them to complete the full task.