Hyperparameters and Baseline Experiments in Dialog Systems

by The FeedbackLoop: #1 in PM EducationJanuary 17th, 2024

Too Long; Didn't Read

Baseline experiments in dialog systems unfold with key hyperparameter settings. Models trained for five epochs, extended to ten for erroneous dialogs, featured a batch size of 32, learning rate of 5e − 5, and AdamW optimizer. LLAMA adopted unique finetuning parameters. Results, reflected in Table 17, elucidate the interplay of data quality, system errors, and model performance through F1-Score and BLEU metrics.

featured image - Hyperparameters and Baseline Experiments in Dialog Systems

Read by Dr. One voice-avatar

Listen to this story

Authors:

(1) Dominic Petrak, UKP Lab, Department of Computer Science, Technical University of Darmstadt, Germany;

(2) Nafise Sadat Moosavi, Department of Computer Science, The University of Sheffield, United Kingdom;

(3) Ye Tian, Wluper, London, United Kingdom;

(4) Nikolai Rozanov, Wluper, London, United Kingdom;

(5) Iryna Gurevych, UKP Lab, Department of Computer Science, Technical University of Darmstadt, Germany.

Table of Links

Abstract & Introduction

Related Work

Datasets Examined

Manual Error Type Analysis and Taxonomies

Automatic Filtering for Potentially Relevant Dialogs

Statistical Analysis

Evaluation and Experiments

Discussion

Conclusion, Limitation, Acknowledgments, and References

A Integrated Error Taxonomy – Details

B Error-Indicating Sentences And Phrases

C Automatic Filtering – Implementation

D Automatic Filtering – Sentence-Level Analysis

E Task-Oriented Dialogs – Examples

F Effectiveness Of Automatic Filtering – A Detailed Analysis

G Inter-Annotator Agreement – Detailed Analysis

H Annotation Guidelines

I Hyperparameters and Baseline Experiments

J Human-Human Dialogs – Examples

I Hyperparameters and Baseline Experiments

Hyperparameters All baseline models were trained for five epochs. For the experiment using erroneous dialogs, we trained the models for ten epochs. We used a batch size of 32 and a learning rate of 5e − 5 with no warmup steps. As optimizer, we used the implementation of AdamW 11 (Loshchilov and Hutter, 2019) in Pytorch. Except for LLAMA (Touvron et al., 2023), we fully-finetuned all models. For LLAMA, we only finetuned the LoRA (Hu et al., 2022) weights, using a rank of 8, an alpha of 16, and a dropout rate of 0.05.

Results Table 17 shows the results of our baseline experiments using word-overlapping F1-Score and BLEU (Papineni et al., 2002).

Compared to the results of our experiment using erroneous dialogs (Table 11), both F1-Score and BLEU are higher in case of GPT-2 and LLAMA. We attribute this to the comparatively bad quality of the erroneous dialogs. The majority of these

Table 17: Results of our baseline experiments.

dialogs comes from the SFC (Hancock et al., 2019) dataset, where most errors are due to out-of-context system utterances (see Section 6.1) and were ignored by users (see Section 6.2).

This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.

L O A D I N G
. . . comments & more!

About Author

The FeedbackLoop: #1 in PM Education@feedbackloop

The FeedbackLoop offers premium product management education, research papers, and certifications. Start building today!

Read my stories Learn More

TOPICS

tech-stories #dataset-annotation #dialog-systems #dialog-datasets #ai-research #conversational-ai #ai-training-data #ai-training-datasets #free-text-human-feedback

THIS ARTICLE WAS FEATURED IN...

Join HackerNoon

Latest technology trends. Customized Experience. Curated Stories. Publish Your Ideas

Hyperparameters and Baseline Experiments in Dialog Systems

The FeedbackLoop: #1 in PM Education

@feedbackloop

Too Long; Didn't Read

The FeedbackLoop: #1 in PM Education

Table of Links

I Hyperparameters and Baseline Experiments

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

RELATED STORIES