paint-brush
Human Assessment of Enhanced Integrated Error Taxonomyby@feedbackloop

Human Assessment of Enhanced Integrated Error Taxonomy

tldt arrow

Too Long; Didn't Read

Delve into a human evaluation of an enhanced Integrated Error Taxonomy, showcasing its effectiveness in human-bot dialog understanding. Explore experiments with top-notch language generation models, revealing the power of errors and user responses as signals for response generation. Witness a leap forward in AI dialog systems as they learn from mistakes and user interactions.
featured image - Human Assessment of Enhanced Integrated Error Taxonomy
The FeedbackLoop: #1 in PM Education HackerNoon profile picture

Authors:

(1) Dominic Petrak, UKP Lab, Department of Computer Science, Technical University of Darmstadt, Germany;

(2) Nafise Sadat Moosavi, Department of Computer Science, The University of Sheffield, United Kingdom;

(3) Ye Tian, Wluper, London, United Kingdom;

(4) Nikolai Rozanov, Wluper, London, United Kingdom;

(5) Iryna Gurevych, UKP Lab, Department of Computer Science, Technical University of Darmstadt, Germany.


Table of Links

Abstract & Introduction

Related Work

Datasets Examined

Manual Error Type Analysis and Taxonomies

Automatic Filtering for Potentially Relevant Dialogs

Statistical Analysis

Evaluation and Experiments

Discussion

Conclusion, Limitation, Acknowledgments, and References

A Integrated Error Taxonomy – Details

B Error-Indicating Sentences And Phrases

C Automatic Filtering – Implementation

D Automatic Filtering – Sentence-Level Analysis

E Task-Oriented Dialogs – Examples

F Effectiveness Of Automatic Filtering – A Detailed Analysis

G Inter-Annotator Agreement – Detailed Analysis

H Annotation Guidelines

I Hyperparameters and Baseline Experiments

J Human-Human Dialogs – Examples

7 Evaluation and Experiments

In this section, we use the manually annotated dialogs from Section 6 in a human evaluation to assess the impact of our modifications to the Integrated Error Taxonomy (Higashinaka et al., 2021). We also use these dialogs to investigate the performance impact of errors in system utterances and subsequent user responses as additional input signals for response generation in three SOTA language generation models, including GPT-2 (Radford et al., 2019), LLAMA (Touvron et al., 2023), and Flan-T5 (Chung et al., 2022).


7.1 Integrated Error Taxonomy – Evaluation

To evaluate the impact of our modifications to reduce ambiguity and address missing error types in the Integrated Error Taxonomy (Higashinaka et al., 2021), we perform a human evaluation. We asked nine experts with NLP background and sound English skills to annotate 600 dialogs from those that were manually annotated by us in Section 6 (300 from both the filtered and random dialogs, 50 per dataset) with error and user response types using our modified Integrated Error Taxonomy (Section 4.1) and proposed user response type taxonomy (Section 4.2) 6 . Each of the dialogs was then assigned to two of these experts and thus annotated three times in total (including our own initial annotation). For comparison, we mapped all annotations back to the original Integrated Error Taxonomy. For merged error types, we asked the annotators for a second assessment using the original taxonomy. Table 10 shows the inter-annotator agreement (IAA) calculated using Krippendorff’s Alpha (Krippendorff, 2004) 7 and summarized by human-human and human-bot dialogs.


In the case of human-human dialogs, the overall agreement is rather low. This also applies to the


Table 10: Inter-annotator agreement using the Integrated Error Taxonomy (Higashinaka et al., 2021) (Theirs) and our modified version (Ours).


user response types (what was to be expected, as they directly depend on the error type annotations). During our analysis in Section 6, we found that in human-human dialogs, participants tend to suggest disagreement in a friendly manner, which complicates the identification of errors. This is different for human-bot dialogs, where participants tend to provide direct and clear feedback, such as You’re wrong. We attribute the low agreement for the annotations in human-human dialogs to this observation. Nonetheless, using our modified Integrated Error Taxonomy improves IAA over the original one in all cases. This is most obvious in the case of the human-human filtered dialogs, where it improves IAA by 0.14 points. A detailed analysis revealed that this is mainly due to (1) the condensed number of abstract error types, e.g., we merged ambiguous error types such as Ignore Proposal and Ignore Request, and (2) the newly added error types, such as Factually Incorrect, which were not covered in the original taxonomy.


7.2 Impact in Response Generation

In the following, we investigate the performance impact of including errors in system utterances and the subsequent user responses as additional input signals in response generation. For this experiment, we consider three different SOTA language generation models: GPT-2 (Radford et al., 2019), LLAMA (Touvron et al., 2023), and FlanT5 (Chung et al., 2022). For GPT-2 and Flan-T5, we use the large variants. For LLAMA, we use the 7B variant8 .


Experimental Setup We use the dialogs annotated in Section 6 for this experiment. In a first step, we use the 967 dialogs without error and user response type annotations to train baselines models in the task of response generation. Next, we train the baseline models using the 188 error and user response type annotated dialogs and include the annotated data as additional input signals. For error types, we include the respective system utterance (Error Text). For user responses, we include the respective user utterance (User Response). As evaluation metrics, we use word-overlapping F1- Score (following Xu et al. (2023) with FITS) and BLEU (Papineni et al., 2002). We provide more details, including the baseline results, in Appendix I.


Results Table 11 shows the results9 . We find a large performance gap between Flan-T5 (Chung et al., 2022) and the other models. A detailed analysis revealed that both GPT-2 (Radford et al., 2019) and LLAMA (Touvron et al., 2023) generate reasonable and fluent responses, but mostly deviate from the target sequence. That aside, including user responses as an additional input signal improves the results over the other configurations, including None, for both Flan-T5 and GPT-2. For LLAMA, additionally using the error text improves the results over the other configurations.


Table 11: Experiments with errors in system utterances and subsequent user reactions as additional input signals. For each model, the best performing configuration is highlighted. Both includes both feedback signals as additional input signal. None was just continually trained on the 188 dialogs, without including the feedback.


Overall, our results support the findings from recent works on learning from free-text human feedback in that including user responses to errors in system utterances is beneficial (Xu et al., 2023; Ung et al., 2022). However, we also show that including the error itself can have a positive impact.


This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.