A Detailed Analysis of Inter-Annotator Agreement

Authors: (1) Dominic Petrak, UKP Lab, Department of Computer Science, Technical University of Darmstadt, Germany; (2) Nafise Sadat Moosavi, Department of Computer Science, The University of Sheffield, United Kingdom; (3) Ye Tian, Wluper, London, United Kingdom; (4) Nikolai Rozanov, Wluper, London, United Kingdom; (5) Iryna Gurevych, UKP Lab, Department of Computer Science, Technical University of Darmstadt, Germany. Table of Links Abstract & Introduction Related Work Datasets Examined Manual Error Type Analysis and Taxonomies Automatic Filtering for Potentially Relevant Dialogs Statistical Analysis Evaluation and Experiments Discussion Conclusion, Limitation, Acknowledgments, and References A Integrated Error Taxonomy – Details B Error-Indicating Sentences And Phrases C Automatic Filtering – Implementation D Automatic Filtering – Sentence-Level Analysis E Task-Oriented Dialogs – Examples F Effectiveness Of Automatic Filtering – A Detailed Analysis G Inter-Annotator Agreement – Detailed Analysis H Annotation Guidelines I Hyperparameters and Baseline Experiments J Human-Human Dialogs – Examples G Inter-Annotator Agreement – Detailed Analysis This section gives more insights on the interannotator agreement. All additional annotators that participated in this study were non-native speakers. They were experts from our lab with sound English skills and NLP background. We did not select them based on specific criteria; they participated voluntarily. Accordingly, they were not paid extra for this, since they did the annotations during their working hours. For annotation, we did not use any specific tool. We provided the annotators with dialogs in json format and asked them to do the annotations directly in the respective files. See Section H for the annotation guidelines. Table 13 shows the inter-annotator agreement for each dataset using our modified Integrated Error Taxonomy. In the case of human-human dialogs, the overall agreement is rather low (except for Persona Chat (Zhang et al., 2018)). We find that errors are hard to identify in these dialogs, as humans rather suggest disagreements instead of accusing the partner of a mistake. This is also reflected in the user response type agreement since it depends on the error type annotation. However, Persona Chat seems to be different (according to Table 8). We attribute this to the dialog type, which is open-domain, where we find that humans react harshly and accusing to errors in system utterances, resulting in more direct feedback that is easier to identify. Table 14 shows the inter-annotator agreement for each dataset using the original Integrated Error Taxonomy as proposed by Higashinaka et al. (2021). Using this taxonomy deteriorates the interannotator agreement. This is most obvious in case of MWoZ (Budzianowski et al., 2018) and PC (Zhang et al., 2018), which are both humanhuman datasets. A detailed analysis revealed that this is mostly due to over-specialized error types which were merged in our modified taxonomy, such as ignore expectation and ignore request, I9 and I6 in the original taxonomy (Table 2). Another reason are the newly added error types, such as factually incorrect, E5 (Table 4), which were not covered in the original taxonomy, but occur in the dialogs. This paper is under CC BY-NC-SA 4.0 DEED license. available on arxiv