paint-brush
A Detailed Analysis of Inter-Annotator Agreementby@feedbackloop

A Detailed Analysis of Inter-Annotator Agreement

tldt arrow

Too Long; Didn't Read

Explore the complexities of inter-annotator agreement in error identification, particularly in human-human dialogs. Non-native speaker annotators, selected voluntarily from the lab, reveal challenges in pinpointing errors, with variations across datasets. The study highlights the impact of the modified Integrated Error Taxonomy in improving agreement, addressing challenges posed by specialized error types and introducing new categories for comprehensive coverage.
featured image - A Detailed Analysis of Inter-Annotator Agreement
The FeedbackLoop: #1 in PM Education HackerNoon profile picture

Authors:

(1) Dominic Petrak, UKP Lab, Department of Computer Science, Technical University of Darmstadt, Germany;

(2) Nafise Sadat Moosavi, Department of Computer Science, The University of Sheffield, United Kingdom;

(3) Ye Tian, Wluper, London, United Kingdom;

(4) Nikolai Rozanov, Wluper, London, United Kingdom;

(5) Iryna Gurevych, UKP Lab, Department of Computer Science, Technical University of Darmstadt, Germany.

Table of Links

Abstract & Introduction

Related Work

Datasets Examined

Manual Error Type Analysis and Taxonomies

Automatic Filtering for Potentially Relevant Dialogs

Statistical Analysis

Evaluation and Experiments

Discussion

Conclusion, Limitation, Acknowledgments, and References

A Integrated Error Taxonomy – Details

B Error-Indicating Sentences And Phrases

C Automatic Filtering – Implementation

D Automatic Filtering – Sentence-Level Analysis

E Task-Oriented Dialogs – Examples

F Effectiveness Of Automatic Filtering – A Detailed Analysis

G Inter-Annotator Agreement – Detailed Analysis

H Annotation Guidelines

I Hyperparameters and Baseline Experiments

J Human-Human Dialogs – Examples

G Inter-Annotator Agreement – Detailed Analysis

This section gives more insights on the interannotator agreement. All additional annotators that participated in this study were non-native speakers. They were experts from our lab with sound English skills and NLP background. We did not select them based on specific criteria; they participated voluntarily. Accordingly, they were not paid extra for this, since they did the annotations during their working hours. For annotation, we did not use any specific tool. We provided the annotators with dialogs in json format and asked them to do the annotations directly in the respective files. See Section H for the annotation guidelines. Table 13 shows the inter-annotator agreement for each dataset using our modified Integrated Error Taxonomy.


Table 13: Inter-annotator agreement for each dataset.


In the case of human-human dialogs, the overall agreement is rather low (except for Persona Chat (Zhang et al., 2018)). We find that errors are hard to identify in these dialogs, as humans rather suggest disagreements instead of accusing the partner of a mistake. This is also reflected in the user response type agreement since it depends on the error type annotation. However, Persona Chat seems to be different (according to Table 8). We attribute this to the dialog type, which is open-domain, where we find that humans react harshly and accusing to errors in system utterances, resulting in more direct feedback that is easier to identify.


Table 14: Inter-annotator-agreement for the Higashinaka et al. (2021) taxonomy.


Table 14 shows the inter-annotator agreement for each dataset using the original Integrated Error Taxonomy as proposed by Higashinaka et al. (2021). Using this taxonomy deteriorates the interannotator agreement. This is most obvious in case of MWoZ (Budzianowski et al., 2018) and PC (Zhang et al., 2018), which are both humanhuman datasets. A detailed analysis revealed that this is mostly due to over-specialized error types which were merged in our modified taxonomy, such as ignore expectation and ignore request, I9 and I6 in the original taxonomy (Table 2). Another reason are the newly added error types, such as factually incorrect, E5 (Table 4), which were not covered in the original taxonomy, but occur in the dialogs.


This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.