paint-brush
Discussion: User Reactions in Human-Bot Dialogs as Key to Learning from Errorsby@feedbackloop

Discussion: User Reactions in Human-Bot Dialogs as Key to Learning from Errors

tldt arrow

Too Long; Didn't Read

Uncover the challenges in extracting learning signals from human-human dialogs, where politeness masks disagreement nuances. Dive into the dynamics of user reactions in human-bot dialogs, highlighting the valuable learning signals within harsh responses. Recognize the potential goldmine in open-domain and knowledge-grounded dialogs for AI learning. The discussion also opens the door to exploring nuanced user response types for enhanced learning insights in future AI dialog systems.
featured image - Discussion: User Reactions in Human-Bot Dialogs as Key to Learning from Errors
The FeedbackLoop: #1 in PM Education HackerNoon profile picture

Authors:

(1) Dominic Petrak, UKP Lab, Department of Computer Science, Technical University of Darmstadt, Germany;

(2) Nafise Sadat Moosavi, Department of Computer Science, The University of Sheffield, United Kingdom;

(3) Ye Tian, Wluper, London, United Kingdom;

(4) Nikolai Rozanov, Wluper, London, United Kingdom;

(5) Iryna Gurevych, UKP Lab, Department of Computer Science, Technical University of Darmstadt, Germany.


Abstract & Introduction

Related Work

Datasets Examined

Manual Error Type Analysis and Taxonomies

Automatic Filtering for Potentially Relevant Dialogs

Statistical Analysis

Evaluation and Experiments

Discussion

Conclusion, Limitation, Acknowledgments, and References

A Integrated Error Taxonomy – Details

B Error-Indicating Sentences And Phrases

C Automatic Filtering – Implementation

D Automatic Filtering – Sentence-Level Analysis

E Task-Oriented Dialogs – Examples

F Effectiveness Of Automatic Filtering – A Detailed Analysis

G Inter-Annotator Agreement – Detailed Analysis

H Annotation Guidelines

I Hyperparameters and Baseline Experiments

J Human-Human Dialogs – Examples

8 Discussion

The goal of this work was to investigate the type and frequency of errors in system utterances and subsequent user responses included in the datasets examined to assess their extendibility with annotations for learning from free-text human feedback. We found that this mostly depends on whether the dialogs are human-human or human-bot. In humanhuman dialogs, we find that humans rather suggest disagreements in a very polite way instead of accusing the partner of a mistake (see Appendix J for examples). Accordingly, there is only little free-text human feedback available that could be used for learning (Section 6.2 and 6.3). Therefore, it might be hard and ineffective to extend these datasets with annotations for learning from such data. This is different in human-bot dialogs, where humans often react harshly and accusingly to errors in system utterances, resulting in more direct feedback. However, we also found that it depends on the dialog type. In general, we find that opendomain and knowledge-grounded dialogs contain a larger number of errors and user responses that are likely to contain free-text human feedback, making them more suitable for this purpose (Section 6.1).


Using the manually annotated dialogs from Section 6, our experiments in Section 7.2 suggest that including user responses to errors in system utterances has a positive impact in response generation, which supports the findings from recent works on including free-text human feedback (Xu et al., 2023; Ung et al., 2022). Additionally, our results suggest that including the error-annotated system utterance itself can have a positive impact. From our point of view, distinguishing between user response types could be an interesting alternative to binary signals, such as user satisfaction (Hancock et al., 2019) or thumbs-down (Shuster et al., 2022), as an indicator of an error in a system utterance. However, the dialogs annotated in Section 6 do not provide enough such data for a thorough analysis that also takes into account the different types of user responses. Therefore, we leave this as a research question for future work. Our human evaluation in Section 7.1 shows that our proposed taxonomies may serve as a promising starting point to obtain the necessary annotaions, although they may not cover all possible error and user response types.


This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.