Authors:
(1) Dominic Petrak, UKP Lab, Department of Computer Science, Technical University of Darmstadt, Germany;
(2) Nafise Sadat Moosavi, Department of Computer Science, The University of Sheffield, United Kingdom;
(3) Ye Tian, Wluper, London, United Kingdom;
(4) Nikolai Rozanov, Wluper, London, United Kingdom;
(5) Iryna Gurevych, UKP Lab, Department of Computer Science, Technical University of Darmstadt, Germany.
Manual Error Type Analysis and Taxonomies
Automatic Filtering for Potentially Relevant Dialogs
Conclusion, Limitation, Acknowledgments, and References
A Integrated Error Taxonomy – Details
B Error-Indicating Sentences And Phrases
C Automatic Filtering – Implementation
D Automatic Filtering – Sentence-Level Analysis
E Task-Oriented Dialogs – Examples
F Effectiveness Of Automatic Filtering – A Detailed Analysis
G Inter-Annotator Agreement – Detailed Analysis
I Hyperparameters and Baseline Experiments
J Human-Human Dialogs – Examples
Datasets Annotated with Free-Text Human Feedback As of today, dialog datasets with annotations for free-text human feedback are few and mostly focused on specific error types. For example, FITS (Xu et al., 2023), a dataset of 14,000 human-bot dialogs, provides free-text human feedback for response quality and search query errors. SaFeRDialogues (Ung et al., 2022) consists of 7,000 human-bot dialogs, but only addresses toxicity in system utterances. The dataset published with the Self-Feeding Chatbot (Hancock et al., 2019) provides 60,000 human-bot dialogs, partly annotated with response alternatives for dissatisfying system utterances. Other works, such as Park et al. (2021) and Veron et al. (2021) collected free-text human feedback for evaluating their approaches, but never made the data publicly available. In this work, we investigate the types of errors and subsequent user responses in six widely used dialog datasets, such as MultiWOZ (Budzianowski et al., 2018), PersonaChat (Zhang et al., 2018), and SGD (Rastogi et al., 2020) to assess their extendibility with annotations for learning from freetext human feedback as an alternative to data collection from scratch.
Error and User Response Type Taxonomies The error taxonomies used to collect the datasets discussed above are very specific which limits their applicability, e.g., in SaFeRDialogues (Ung et al., 2022), they only focus on toxicity. However, errors in conversational AI have long been the subject of research and more comprehensive error taxonomies are already available. For example, Dybkjaer et al. (1996) and Möller et al. (2007) focus on errors in task-oriented dialog systems and distinguish multiple error types with focus on content-related errors and practical aspects. More recently published error taxonomies, such as the Integrated Error Taxonomy proposed by Higashinaka et al. (2021), distinguish levels of errors, which makes them more broadly applicable. The Integrated Error Taxonomy covers 17 error types in four different levels, including utterance, response, context, and society. Regarding user responses subsequent to errors in system utterances, See and Manning (2021) proposed a taxonomy that distinguishes dissatisfaction and unclear user utterances.
In this work, we use the Integrated Error Taxonomy by Higashinaka et al. (2021) to study the errors in the system utterances of 1,200 dialogs from six dialog datasets, based on which we determine (1) the different types of user responses to errors in system utterances, and (2) the limitations of the Integrated Error Taxonomy, such as missing error types.
This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.