Authors:
(1) Dominic Petrak, UKP Lab, Department of Computer Science, Technical University of Darmstadt, Germany;
(2) Nafise Sadat Moosavi, Department of Computer Science, The University of Sheffield, United Kingdom;
(3) Ye Tian, Wluper, London, United Kingdom;
(4) Nikolai Rozanov, Wluper, London, United Kingdom;
(5) Iryna Gurevych, UKP Lab, Department of Computer Science, Technical University of Darmstadt, Germany.
Manual Error Type Analysis and Taxonomies
Automatic Filtering for Potentially Relevant Dialogs
Conclusion, Limitation, Acknowledgments, and References
A Integrated Error Taxonomy – Details
B Error-Indicating Sentences And Phrases
C Automatic Filtering – Implementation
D Automatic Filtering – Sentence-Level Analysis
E Task-Oriented Dialogs – Examples
F Effectiveness Of Automatic Filtering – A Detailed Analysis
G Inter-Annotator Agreement – Detailed Analysis
I Hyperparameters and Baseline Experiments
J Human-Human Dialogs – Examples
The goal of this work was to investigate the type and frequency of errors in system utterances and subsequent user responses included in the datasets examined to assess their extendibility with annotations for learning from free-text human feedback. We found that this mostly depends on whether the dialogs are human-human or human-bot. In humanhuman dialogs, we find that humans rather suggest disagreements in a very polite way instead of accusing the partner of a mistake (see Appendix J for examples). Accordingly, there is only little free-text human feedback available that could be used for learning (Section 6.2 and 6.3). Therefore, it might be hard and ineffective to extend these datasets with annotations for learning from such data. This is different in human-bot dialogs, where humans often react harshly and accusingly to errors in system utterances, resulting in more direct feedback. However, we also found that it depends on the dialog type. In general, we find that opendomain and knowledge-grounded dialogs contain a larger number of errors and user responses that are likely to contain free-text human feedback, making them more suitable for this purpose (Section 6.1).
Using the manually annotated dialogs from Section 6, our experiments in Section 7.2 suggest that including user responses to errors in system utterances has a positive impact in response generation, which supports the findings from recent works on including free-text human feedback (Xu et al., 2023; Ung et al., 2022). Additionally, our results suggest that including the error-annotated system utterance itself can have a positive impact. From our point of view, distinguishing between user response types could be an interesting alternative to binary signals, such as user satisfaction (Hancock et al., 2019) or thumbs-down (Shuster et al., 2022), as an indicator of an error in a system utterance. However, the dialogs annotated in Section 6 do not provide enough such data for a thorough analysis that also takes into account the different types of user responses. Therefore, we leave this as a research question for future work. Our human evaluation in Section 7.1 shows that our proposed taxonomies may serve as a promising starting point to obtain the necessary annotaions, although they may not cover all possible error and user response types.
This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.