Authors:
(1) Dominic Petrak, UKP Lab, Department of Computer Science, Technical University of Darmstadt, Germany;
(2) Nafise Sadat Moosavi, Department of Computer Science, The University of Sheffield, United Kingdom;
(3) Ye Tian, Wluper, London, United Kingdom;
(4) Nikolai Rozanov, Wluper, London, United Kingdom;
(5) Iryna Gurevych, UKP Lab, Department of Computer Science, Technical University of Darmstadt, Germany.
Manual Error Type Analysis and Taxonomies
Automatic Filtering for Potentially Relevant Dialogs
Conclusion, Limitation, Acknowledgments, and References
A Integrated Error Taxonomy – Details
B Error-Indicating Sentences And Phrases
C Automatic Filtering – Implementation
D Automatic Filtering – Sentence-Level Analysis
E Task-Oriented Dialogs – Examples
F Effectiveness Of Automatic Filtering – A Detailed Analysis
G Inter-Annotator Agreement – Detailed Analysis
I Hyperparameters and Baseline Experiments
J Human-Human Dialogs – Examples
In this section, we conduct a statistical analysis of the distribution of error and user response types and their relations in the dialogs of the datasets examined. For this, we manually annotate 555 of the filtered dialogs (100 from each dataset, if available) with error and user response types, using the taxonomies proposed in Section 4.1 and 4.2. To avoid bias from our filtering procedure in Section 5, we also consider 600 randomly selected dialogs (100 from each dataset) that were not identified during this process (similarity <50%) in this analysis. Hereinafter, these dialogs are referred to as random dialogs. In Section 6.4, we assess the impact of our filtering procedure on this analysis.
Overall, we manually annotate 1,155 dialogs with error and user response types. For annotation, we always consider the entire dialog (the context).
Overall, we identified 188 errors across all dialogs. Table 7 shows the distribution.
As expected, the filtered dialogs contain a larger number of errors (130 overall) compared to the random dialogs (58 overall), especially for opendomain and knowledge-grounded dialogs, such as SFC (Hancock et al., 2019) and WoW (Dinan et al., 2019).
Table 8 shows the most common error types and their frequency for both the filtered and random dialogs, which already accounts for 172 of all identified errors4 . In the case of open-domain dialogs, the most frequent error types are Ignore Question (E1) and Topic Transition Error (E6). This is particularly the case in the SFC dataset (Hancock et al., 2019), where we find the system utterances to be often out of context. In the case of task-oriented dialogs, Ignore Request (E2) and Ignore Expectation (E3) are the most common error types. We observe these errors when requests are only partially processed, e.g., when the user requests to book a hotel room and a train, but the system only books the hotel room. Moreover, we find that there is only little variety in language in task-oriented dialogs, regardless of the number of tasks reflected in the dataset (see Appendix E for examples). In the case of WoW (Dinan et al., 2019), the knowledgegrounded dataset, the Factually Incorrect (E5) error is the most frequently observed error type.
Table 9 shows the distribution of user response types to errors in system utterances for both the random dialogs (R) and the filtered dialogs (F).
As described in Section 4.2, UR2 (Repeat or Rephrase), UR3 (Make Aware with Correction), and UR5 (Ask for Clarification) are likely to contain free-text human feedback. In the case of the filtered dialogs, we find that UR3 and UR5 are more often observed in open-domain and knowledgegrounded dialogs, such as SFC (Hancock et al., 2019) or WoW (Dinan et al., 2019). UR2 is only rarely observed, and only in task-oriented dialogs. However, UR1 (Ignore and Continue) is also frequently observed, especially in SFC. For randomly selected dialogs, this is the most frequent user response type (it occurs 42 times).
Figure 1 illustrates the relation between the most common error types (the 172 errors presented in Table 8) and user response types (Table 9) in both the filtered and random dialogs.
We find that UR1, UR4, and UR5 are the most frequently observed user response types, particu-
larly in the case of Ignore Question (E1) and Topic Transition Error (E6), which is mostly observed in open-domain datasets (Table 8). Along with UR3, UR5 is also a frequent response type in the case of Factually Incorrect (E5), which is mostly observed in WoW (Dinan et al., 2019). UR2 is only rarely observed. It sometimes occurs in the context of Ignore Request (E2) and Ignore Expectation (E3), which are mostly found in task-oriented dialogs.
As Table 7 shows, a total of 188 dialogs was identified to contain errors in this analysis. 130 of them were found in filtered dialogs and 58 in random dialogs (the ones that were missed by our automatic filtering procedure in Section 5). Considering this at the level of user response types (Table 9), 46 of these 58 errors were ignored by users or did not provide any additional information (UR1 or UR4, the ones marked in bold red in the table), meaning that they are irrelevant because they do not contain free-text human feedback. For the remaining 12 missed errors (UR3 or UR5, the ones marked in bold green), we find that they are not reflected in the set of 67 error-indicating sentences used for filtering. Although this limits the effectiveness, we find that our procedure for automatic filtering itself has no negative impact on the results of our analysis, but rather improved annotation efficiency. An approximated recall of 0.72 supports this assumption. The recall was approximated with respect to the ratio between the size of the filtered subsets and the original datasets5 (Table 6) and only considers the 12 missed relevant errors (when considering all missed errors, the recall is 0.35). We provide a more detailed analysis in Appendix F.
This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.