Authors:
(1) Dominic Petrak, UKP Lab, Department of Computer Science, Technical University of Darmstadt, Germany;
(2) Nafise Sadat Moosavi, Department of Computer Science, The University of Sheffield, United Kingdom;
(3) Ye Tian, Wluper, London, United Kingdom;
(4) Nikolai Rozanov, Wluper, London, United Kingdom;
(5) Iryna Gurevych, UKP Lab, Department of Computer Science, Technical University of Darmstadt, Germany.
Manual Error Type Analysis and Taxonomies
Automatic Filtering for Potentially Relevant Dialogs
Conclusion, Limitation, Acknowledgments, and References
A Integrated Error Taxonomy – Details
B Error-Indicating Sentences And Phrases
C Automatic Filtering – Implementation
D Automatic Filtering – Sentence-Level Analysis
E Task-Oriented Dialogs – Examples
F Effectiveness Of Automatic Filtering – A Detailed Analysis
G Inter-Annotator Agreement – Detailed Analysis
I Hyperparameters and Baseline Experiments
J Human-Human Dialogs – Examples
We first study the errors in system utterances in a randomly sampled set of 1, 200 dialogs (200 dialogs from each of the six datasets), using the Integrated Error Taxonomy proposed by Higashinaka et al. (2021). The taxonomy consists of 17 error types (I1-I17) across four levels: utterance, response, context, and society. They further categorize error types into content violation, i.e., if the error may cause a dialog breakdown, and form violation, i.e., if the system utterance is not interpretable due to massive grammatical problems. Table 2 presents a summary of the error types (see Appendix A for more details).
If we find an error, we analyze the following user response for an error-indicating phrase, a short text fragment of arbitrary length that directly addresses the error in the previous system utterance (e.g., how do you mean) or indicates user dissatisfaction (e.g., I don’t like that), and add the surrounding sentence to our list of unique error-indicating sentences. We use these error-indicating sentences in Section 5 to explore the remaining dialogs from each dataset for user responses (and thus errors) that are similar to the ones observed in this step3.
Overall, we found 79 errors in system utterances and collected a set of 67 error-indicating sentences with an average sentence length of approximately 6.52 words (see Appendix B for all collected phrases and sentences). Each sentence contains a unique error-indicating phrase with an average length of 3.52 words. Contractions (two words that have been connected, e.g., don’t or it’s) are considered as one word. Table 3 shows the distribution of error-indicating sentences across datasets.
We find most error-indicating sentences in opendomain and knowledge-grounded datasets, especially in SFC (Hancock et al., 2019).
During this study, we found that the Integrated Error Taxonomy (Higashinaka et al., 2021) has weaknesses. Some error types are never observed and others are missing. Based on these insights, we modify the taxonomy for the classification of errors in system utterances. Table 4 shows the result.
We ignore Lack of Information (I12 in Table 2), since it is rarely observed by Higashinaka et al. (2021) and we never observed it in our study. For the same reason, we ignore I1-I3. However, we also found them to be rather ambiguous. For example, the Semantic Error (I3) is intended to be used for invalid predicate/argument combinations, such as situations where a missing letter results in a different meaning (raining instead of training). This is similar to the Lack of Common Sense error type (I17, now E10), since the model is supposed to be aware of the concept, but not in the given context. For Wrong Information (I4), we introduce a new error type, Factually Incorrect (E5), that extends the original definition for also taking factually incorrect knowledge into account. Furthermore, we ignore Contradiction (I14), since it is covered by Lack of Common Sense and Factually Incorrect errors. We merge Ignore Proposal (I7) and Ignore Request (I6) into one error type (E2), since both are very similar in meaning. Next, we merge Ignore Greeting (I8) with Lack of Sociality (I16, now E9), as the latter implies the first one. We merge Repetition (I15) and Self-Contradiction (I13) into a new error type, Conversationality (E7), since we observed both very rarely and only in situations that the system had lost the thread of the conversation. We also observed instances of incorrectly conveyed attributes (slots) that are not accounted for in the original taxonomy. To address this, we introduce the Attribute Error error type (E4).
We observed five different patterns in user responses subsequent to errors in system utterances during this study, which are also reflected in the collected error-indicating sentences. We propose them as a new taxonomy for the annotation of such data in dialogs (Table 5).
Among these, we find that UR2, UR3, and UR5 are likely to contain free-text human feedback, such as corrections, new knowledge, or response alternatives.
This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.