Authors:
(1) Dominic Petrak, UKP Lab, Department of Computer Science, Technical University of Darmstadt, Germany;
(2) Nafise Sadat Moosavi, Department of Computer Science, The University of Sheffield, United Kingdom;
(3) Ye Tian, Wluper, London, United Kingdom;
(4) Nikolai Rozanov, Wluper, London, United Kingdom;
(5) Iryna Gurevych, UKP Lab, Department of Computer Science, Technical University of Darmstadt, Germany.
Manual Error Type Analysis and Taxonomies
Automatic Filtering for Potentially Relevant Dialogs
Conclusion, Limitation, Acknowledgments, and References
A Integrated Error Taxonomy – Details
B Error-Indicating Sentences And Phrases
C Automatic Filtering – Implementation
D Automatic Filtering – Sentence-Level Analysis
E Task-Oriented Dialogs – Examples
F Effectiveness Of Automatic Filtering – A Detailed Analysis
G Inter-Annotator Agreement – Detailed Analysis
I Hyperparameters and Baseline Experiments
J Human-Human Dialogs – Examples
In this section, we use the manually annotated dialogs from Section 6 in a human evaluation to assess the impact of our modifications to the Integrated Error Taxonomy (Higashinaka et al., 2021). We also use these dialogs to investigate the performance impact of errors in system utterances and subsequent user responses as additional input signals for response generation in three SOTA language generation models, including GPT-2 (Radford et al., 2019), LLAMA (Touvron et al., 2023), and Flan-T5 (Chung et al., 2022).
To evaluate the impact of our modifications to reduce ambiguity and address missing error types in the Integrated Error Taxonomy (Higashinaka et al., 2021), we perform a human evaluation. We asked nine experts with NLP background and sound English skills to annotate 600 dialogs from those that were manually annotated by us in Section 6 (300 from both the filtered and random dialogs, 50 per dataset) with error and user response types using our modified Integrated Error Taxonomy (Section 4.1) and proposed user response type taxonomy (Section 4.2) 6 . Each of the dialogs was then assigned to two of these experts and thus annotated three times in total (including our own initial annotation). For comparison, we mapped all annotations back to the original Integrated Error Taxonomy. For merged error types, we asked the annotators for a second assessment using the original taxonomy. Table 10 shows the inter-annotator agreement (IAA) calculated using Krippendorff’s Alpha (Krippendorff, 2004) 7 and summarized by human-human and human-bot dialogs.
In the case of human-human dialogs, the overall agreement is rather low. This also applies to the
user response types (what was to be expected, as they directly depend on the error type annotations). During our analysis in Section 6, we found that in human-human dialogs, participants tend to suggest disagreement in a friendly manner, which complicates the identification of errors. This is different for human-bot dialogs, where participants tend to provide direct and clear feedback, such as You’re wrong. We attribute the low agreement for the annotations in human-human dialogs to this observation. Nonetheless, using our modified Integrated Error Taxonomy improves IAA over the original one in all cases. This is most obvious in the case of the human-human filtered dialogs, where it improves IAA by 0.14 points. A detailed analysis revealed that this is mainly due to (1) the condensed number of abstract error types, e.g., we merged ambiguous error types such as Ignore Proposal and Ignore Request, and (2) the newly added error types, such as Factually Incorrect, which were not covered in the original taxonomy.
In the following, we investigate the performance impact of including errors in system utterances and the subsequent user responses as additional input signals in response generation. For this experiment, we consider three different SOTA language generation models: GPT-2 (Radford et al., 2019), LLAMA (Touvron et al., 2023), and FlanT5 (Chung et al., 2022). For GPT-2 and Flan-T5, we use the large variants. For LLAMA, we use the 7B variant8 .
Experimental Setup We use the dialogs annotated in Section 6 for this experiment. In a first step, we use the 967 dialogs without error and user response type annotations to train baselines models in the task of response generation. Next, we train the baseline models using the 188 error and user response type annotated dialogs and include the annotated data as additional input signals. For error types, we include the respective system utterance (Error Text). For user responses, we include the respective user utterance (User Response). As evaluation metrics, we use word-overlapping F1- Score (following Xu et al. (2023) with FITS) and BLEU (Papineni et al., 2002). We provide more details, including the baseline results, in Appendix I.
Results Table 11 shows the results9 . We find a large performance gap between Flan-T5 (Chung et al., 2022) and the other models. A detailed analysis revealed that both GPT-2 (Radford et al., 2019) and LLAMA (Touvron et al., 2023) generate reasonable and fluent responses, but mostly deviate from the target sequence. That aside, including user responses as an additional input signal improves the results over the other configurations, including None, for both Flan-T5 and GPT-2. For LLAMA, additionally using the error text improves the results over the other configurations.
Overall, our results support the findings from recent works on learning from free-text human feedback in that including user responses to errors in system utterances is beneficial (Xu et al., 2023; Ung et al., 2022). However, we also show that including the error itself can have a positive impact.
This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.