Authors:
(1) Dominic Petrak, UKP Lab, Department of Computer Science, Technical University of Darmstadt, Germany;
(2) Nafise Sadat Moosavi, Department of Computer Science, The University of Sheffield, United Kingdom;
(3) Ye Tian, Wluper, London, United Kingdom;
(4) Nikolai Rozanov, Wluper, London, United Kingdom;
(5) Iryna Gurevych, UKP Lab, Department of Computer Science, Technical University of Darmstadt, Germany.
Manual Error Type Analysis and Taxonomies
Automatic Filtering for Potentially Relevant Dialogs
Conclusion, Limitation, Acknowledgments, and References
A Integrated Error Taxonomy – Details
B Error-Indicating Sentences And Phrases
C Automatic Filtering – Implementation
D Automatic Filtering – Sentence-Level Analysis
E Task-Oriented Dialogs – Examples
F Effectiveness Of Automatic Filtering – A Detailed Analysis
G Inter-Annotator Agreement – Detailed Analysis
I Hyperparameters and Baseline Experiments
J Human-Human Dialogs – Examples
Learning from free-text human feedback is essential for dialog systems, but annotated data is scarce and usually covers only a small fraction of error types known in conversational AI. Instead of collecting and annotating new datasets from scratch, recent advances in synthetic dialog generation could be used to augment existing dialog datasets with the necessary annotations. However, to assess the feasibility of such an effort, it is important to know the types and frequency of free-text human feedback included in these datasets. In this work, we investigate this question for a variety of commonly used dialog datasets, including MultiWoZ, SGD, BABI, PersonaChat, Wizardsof-Wikipedia, and the human-bot split of the Self-Feeding Chatbot. Using our observations, we derive new taxonomies for the annotation of free-text human feedback in dialogs and investigate the impact of including such data in response generation for three SOTA language generation models, including GPT-2, LLAMA, and Flan-T5. Our findings provide new insights into the composition of the datasets examined, including error types, user response types, and the relations between them.
If a dialog system generates a dissatisfying or erroneous response, e.g., a response with factually incorrect information, users tend to provide a textual descriptions of what went wrong or what they would have expected (See and Manning, 2021; Xu et al., 2023; Ung et al., 2022). This textual description is usually referred to as free-text human feedback, and it is an important source to improve dialog systems, to keep them engaging and socially acceptable (Shuster et al., 2022; Christiano et al., 2017; Ouyang et al., 2022; Xu et al., 2023). Learning from such data requires annotations for errors in system utterances and subsequent free-text human feedback, but available datasets are scarce and typically cover only a small subset of the error types known in conversational AI, such as in the cases of SaFeRDialogues (Ung et al., 2022) and FITS (Xu et al., 2023). As a result, data collection is usually a first step for research on learning from free-text human feedback (Hancock et al., 2019; Veron et al., 2021; Park et al., 2021). To avoid this in future research, recent advances in synthetic dialog generation (Kim et al., 2022; Zheng et al., 2022) could be used to augment existing dialog datasets with annotations for errors and free-text human feedback. However, to assess the feasibility of such an effort, it is important to know the types and frequency of such data included in these datasets.
In this work, we investigate this question for a variety of dialog datasets, including MultiWoZ (Budzianowski et al., 2018), PersonaChat (Zhang et al., 2018), Wizards-ofWikipedia (Dinan et al., 2019), SGD (Rastogi et al., 2020), BABI (Bordes et al., 2017), and the humanbot split from the Self-Feeding Chatbot (Hancock et al., 2019). For this, we follow a two-step approach in which we first use the Integrated Error Taxonomy proposed by Higashinaka et al. (2021) to study the types of errors in system utterances and subsequent user responses in subsets of these datasets. We use the insights gained during this process to derive a new user response type taxonomy and a modified Integrated Error Taxonomy for the annotation of free-text human feedback in dialog data. In the second step, we use SentenceTransformer (Reimers and Gurevych, 2019) to identify similar situations in the remaining dialogs of the datasets. For a subsequent statistical analysis, we manually annotate subsets of the identified dialogs with error and user response types. We also use this manually annotated data to investigate the impact of including this data in response generation, using three SOTA language generation models, i.e., GPT-2 (Radford et al., 2019), LLAMA (Touvron et al., 2023), and FlanT5 (Chung et al., 2022).
We find that types and frequency of errors and user responses that include free-text human feedback largely depend on (1) whether the dialogs are human-human or human-bot, and (2) whether the dialogs are task-oriented, open-domain, or knowledge-grounded. Our analysis and experiments show that using our taxonomies is beneficial for identifying free-text human feedback, and that including such data has a positive impact in response generation.
This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.