Authors:
(1) Dominic Petrak, UKP Lab, Department of Computer Science, Technical University of Darmstadt, Germany;
(2) Nafise Sadat Moosavi, Department of Computer Science, The University of Sheffield, United Kingdom;
(3) Ye Tian, Wluper, London, United Kingdom;
(4) Nikolai Rozanov, Wluper, London, United Kingdom;
(5) Iryna Gurevych, UKP Lab, Department of Computer Science, Technical University of Darmstadt, Germany.
Manual Error Type Analysis and Taxonomies
Automatic Filtering for Potentially Relevant Dialogs
Conclusion, Limitation, Acknowledgments, and References
A Integrated Error Taxonomy – Details
B Error-Indicating Sentences And Phrases
C Automatic Filtering – Implementation
D Automatic Filtering – Sentence-Level Analysis
E Task-Oriented Dialogs – Examples
F Effectiveness Of Automatic Filtering – A Detailed Analysis
G Inter-Annotator Agreement – Detailed Analysis
I Hyperparameters and Baseline Experiments
J Human-Human Dialogs – Examples
Since our study in Section 4 indicated that errors in system utterances are rare, we use Sentence Transformer (Reimers and Gurevych, 2019) to facilitate the process of filtering the remaining dialogs of each datasets for potentially relevant ones, i.e., dialogs with user responses similar to the collected error-indicating sentences.
For each dataset, we decompose every dialog into turns (alternating utterances), extract the user response, and segment it into sentences. Next, we pair these sentences with each of the errorindicating sentences and use a pretrained SentenceTransformer based on MPNet (Song et al., 2020) to calculate their cosine similarity (see Appendix C for implementation details). We consider a dialog to be potentially relevant if at least one of these pairs has a cosine similarity ≥ 50%. Table 6 presents the sizes of the filtered subsets in comparison to the original datasets.
With 58.5%, MWoZ (Budzianowski et al., 2018) contains most of the potentially relevant dialogs. PC (Zhang et al., 2018) and WoW (Dinan et al., 2019) have the smallest number of such dialogs (8.9% and 7.57%, respectively). Overall, only 25% of the data is potentially relevant, i.e., contains at least one user response that is similar to one of those observed in Section 4. Hereinafter, we refer to these dialogs as filtered dialogs. We provide a sentence-level analysis in Appendix D.
This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.