Our Annotations Guide for BIG-Bench Mistake

Authors: (1) Gladys Tyen, University of Cambridge, Dept. of Computer Science & Technology, ALTA Institute, and Work done during an internship at Google Research (e-mail: gladys.tyen@cl.cam.ac.uk); (2) Hassan Mansoor, Google Research (e-mail: hassan@google.com); (3) Victor Carbune, Google Research (e-mail: vcarbune@google.com); (4) Peter Chen, Google Research and Equal leadership contribution (chenfeif@google.com); (5) Tony Mak, Google Research and Equal leadership contribution (e-mail: tonymak@google.com). Table of Links Abstract and Introduction BIG-Bench Mistake Benchmark results Backtracking Related Works Conclusion, Limitations, and References A. Implementational details B. Annotation C. Benchmark scores B Annotation We release our annotation guidelines at https:// github.com/WHGTyen/BIG-Bench-Mistake. During annotation of the multistep arithmetic task, we found that the first CoT step given in the original BIG-Bench Hard prompt examples (Suzgun et al., 2022) was incorrect. Since all generated traces contained the same first step, we removed that step before showing traces to the annotators. Figure 3 contains an example screenshot of the user interface. For every trace, we provide the input question as well as the target answer, with a note to be aware of errors that may occur in correctans traces. Annotators can click on words to highlight the same word across the trace and the question text, which we found was particularly helpful for some tasks such as word sorting and tracking shuffled objects. Buttons on the right automatically become inactive if a previous step has been labelled as negative. This paper is available on arxiv under CC 4.0 license. Authors: (1) Gladys Tyen, University of Cambridge, Dept. of Computer Science & Technology, ALTA Institute, and Work done during an internship at Google Research (e-mail: gladys.tyen@cl.cam.ac.uk); (2) Hassan Mansoor, Google Research (e-mail: hassan@google.com); (3) Victor Carbune, Google Research (e-mail: vcarbune@google.com); (4) Peter Chen, Google Research and Equal leadership contribution (chenfeif@google.com); (5) Tony Mak, Google Research and Equal leadership contribution (e-mail: tonymak@google.com). Authors: Authors: (1) Gladys Tyen, University of Cambridge, Dept. of Computer Science & Technology, ALTA Institute, and Work done during an internship at Google Research (e-mail: gladys.tyen@cl.cam.ac.uk); (2) Hassan Mansoor, Google Research (e-mail: hassan@google.com); (3) Victor Carbune, Google Research (e-mail: vcarbune@google.com); (4) Peter Chen, Google Research and Equal leadership contribution (chenfeif@google.com); (5) Tony Mak, Google Research and Equal leadership contribution (e-mail: tonymak@google.com). Table of Links Abstract and Introduction Abstract and Introduction BIG-Bench Mistake BIG-Bench Mistake Benchmark results Benchmark results Backtracking Backtracking Related Works Related Works Conclusion, Limitations, and References Conclusion, Limitations, and References A. Implementational details A. Implementational details B. Annotation B. Annotation C. Benchmark scores C. Benchmark scores B Annotation We release our annotation guidelines at https:// github.com/WHGTyen/BIG-Bench-Mistake. During annotation of the multistep arithmetic task, we found that the first CoT step given in the original BIG-Bench Hard prompt examples (Suzgun et al., 2022) was incorrect. Since all generated traces contained the same first step, we removed that step before showing traces to the annotators. Figure 3 contains an example screenshot of the user interface. For every trace, we provide the input question as well as the target answer, with a note to be aware of errors that may occur in correctans traces. Annotators can click on words to highlight the same word across the trace and the question text, which we found was particularly helpful for some tasks such as word sorting and tracking shuffled objects. Buttons on the right automatically become inactive if a previous step has been labelled as negative. This paper is available on arxiv under CC 4.0 license. This paper is available on arxiv under CC 4.0 license. available on arxiv