paint-brush
Our Annotations Guide for BIG-Bench Mistakeby@textmodels
155 reads

Our Annotations Guide for BIG-Bench Mistake

tldt arrow

Too Long; Didn't Read

Annotators can click on words to highlight the same word across the trace and the question text. Buttons on the right automatically become inactive if a previous step has been labelled as negative. For every trace, we provide the input question as well as the target answer, with a note to be aware of errors that may occur in correctans traces.
featured image - Our Annotations Guide for BIG-Bench Mistake
Writings, Papers and Blogs on Text Models HackerNoon profile picture

Authors:

(1) Gladys Tyen, University of Cambridge, Dept. of Computer Science & Technology, ALTA Institute, and Work done during an internship at Google Research (e-mail: [email protected]);

(2) Hassan Mansoor, Google Research (e-mail: [email protected]);

(3) Victor Carbune, Google Research (e-mail: [email protected]);

(4) Peter Chen, Google Research and Equal leadership contribution ([email protected]);

(5) Tony Mak, Google Research and Equal leadership contribution (e-mail: [email protected]).

Abstract and Introduction

BIG-Bench Mistake

Benchmark results

Backtracking

Related Works

Conclusion, Limitations, and References

A. Implementational details

B. Annotation

C. Benchmark scores

B Annotation

We release our annotation guidelines at https:// github.com/WHGTyen/BIG-Bench-Mistake.


During annotation of the multistep arithmetic task, we found that the first CoT step given in the original BIG-Bench Hard prompt examples (Suzgun et al., 2022) was incorrect. Since all generated traces contained the same first step, we removed that step before showing traces to the annotators. Figure 3 contains an example screenshot of the user interface. For every trace, we provide the input question as well as the target answer, with a note to be aware of errors that may occur in correctans traces.


Annotators can click on words to highlight the same word across the trace and the question text, which we found was particularly helpful for some tasks such as word sorting and tracking shuffled objects. Buttons on the right automatically become inactive if a previous step has been labelled as negative.


This paper is available on arxiv under CC 4.0 license.