Authors:
(1) Jianzhu Yao, The CoAI group, Tsinghua University, Beijing, China Department of Computer Science and Technology, Tsinghua University, Beijing, China Beijing National Research Center for Information Science and Technology;
(2) Ziqi Liu, The CoAI group, Tsinghua University, Beijing, China Department of Computer Science and Technology, Tsinghua University, Beijing, China Beijing National Research Center for Information Science and Technology;
(3) Jian Guan, The CoAI group, Tsinghua University, Beijing, China Department of Computer Science and Technology, Tsinghua University, Beijing, China Beijing National Research Center for Information Science and Technology;
(4) Minlie Huang, The CoAI group, Tsinghua University, Beijing, China Department of Computer Science and Technology, Tsinghua University, Beijing, China Beijing National Research Center for Information Science and Technology.
Implementation Details To conduct experiments on the masked dialogue generation task, we decide the hyper-parameters based on the performance of the validation set. We train Shao et al. (2021)’s BART model for 4.6 epochs with a 1e-4 learning rate for 1 day, and for our model, we train it for 5.6 epochs with a 1e-4 learning rate for 1 day. All baselines and our model are trained using the Adam optimizer.
During the training process for our method, we computed the selection coverage of characters within a single story. And it showed that in every 1000 training steps, the coverage of different characters ranged from 98.64% to 99.00%, which meant nearly all the characters are selected during training, and all the characters contributed to the generated dialogue. It further proved that the argmax in Eq. 3 operation doesn’t break the gradient progress when training for this task.
Automatic Evaluation Following previous works, we use several standard, widely used automatic evaluation metrics. We use BLEUn (Papineni et al., 2002) to measure the average word overlap between each generated and groundtruth dialogue turn (n=1,2), and Distinct-n (Li et al., 2015) to evaluate n-gram diversity of generated dialogue turns (n=2,3,4).
To be more specific, for the coherence classifier, we construct the training and validation sets by randomly shuffling the order of dialogue turns and keeping other content in the correct order. We regard the perturbed story as a negative example and the original story as a positive example. We sample another 195k stories (except those in DIALSTORY) from the novels of Guan et al. (2022) to construct the training set (190k examples) and the validation set (5k examples). We train the model for 4 epochs with a 2e-5 learning rate and a 16- batch size, using the Adam optimizer. During the evaluation, we consider an example coherent when the probability of being coherent predicted by the classifier is greater than 0.5. We use the ratio of outputs (along with the input) that are classified as coherent by the classifier to all generated outputs as the coherence score.
The result of the automatic evaluation is presented in Table 3. According to the table, compared to the BART baseline, our model consistently generates more word overlaps with ground truth and achieves better diversity under the guidance of character representations, which means our model can generate more diverse but not commonplace responses.
Figure 3 plots the coherence score varying with the number of masked dialogue turns. The result shows that our model gets a higher coherence score than BART when required to generate more than seven turns of dialogue in one story.
Manual Evaluation We conduct a pairwise comparison between our model and the BART baseline. We randomly select 100 examples from the test set. For each pair of outputs along with the input, we ask three annotators to give a preference (win, lose and tie) in terms of fluency, coherence, and informativeness. All the annotations are native Chinese speakers. We adopt majority voting to make final decisions among the annotators. The three aspects of manual evaluation are as follows:
Fluency: Grammatical correctness and intra-sentence linguistic quality.
Coherence: Inter-sentence relatedness, causal and temporal dependencies. We judge the coherence between the story and a dialogue turn by following the criterion in Table 5. We add the scores of all the generated dialogue turns in a story to get the overall coherence score of the story, which is then used to compare with each other.
Informativeness: Interesting, diverse and rich details.
As shown in Table 4, all the results show moderate (κ > 0.4) agreement, which shows our model outperforms the BART baseline significantly in dialogue informativeness and coherence.
Case Study Figure 5 showed two examples to investigate how learning character representations can help our model generate more coherent dialogue. We found that our model can better model the relationship between different characters and the direction of the storyline. For example, in the first case, we can see that the BART’s generation confuses different characters’ fathers, while our model captures the relationship between different characters, and generates proper responses for the corresponding characters, which also moves the plot forward. And in the second case, we can see that BART’s generation is commonplace and contradicts the plot development. In contrast, our model captures the intentions of the speaker and the development trend of the plot, generating an appropriate and coherent response. Since these two models use the same pretrained weight, we can infer that the character modeling module leverages the coherent and reasonable generation.
We also summarize four error types of the generated dialogue turn for the DialGen task: (1) Intersentence Contradiction; (2) Inter-sentence Repetition; (3) Intra-sentence Contradiction; (4) Intrasentence Repetition. We show the typical corresponding cases in Figure 4. We conducted a quantitative analysis of those 4 error types on our model’s generation. We analyzed 20 stories with 103 dialog turns and the results are shown in Figure 6. We found that both our model and BART suffer from these errors, suggesting that there is still space for model improvement, especially in the inter-sentence repetition.
Implementation Details To conduct experiments on the speaker recognition task, we decide the hyper-parameters based on the performance of the validation set. For the BART baseline and our approach, we insert a mask token before each dialogue needed to be predicted, and a person id token before and after each character name span. Then, we insert all the unique person id tokens before the input stories as different options, and make predictions based on the cos similarity of option tokens and mask tokens. We train Shao et al. (2021)’s BART model for 30 epochs with a 5e-5 learning rate for 3 days, For encoder-only baselines, we implemented BERT, RoBERTa, and MacBERT and trained them for 15 epochs wity a le-5 learning rate for 2 days. For our model, we train it for 22 epochs with a le-6 learning rate for 2 days. All baselines and our model are trained using the Adam optimizer.
Metrics We evaluate the DialSpk task using two automatic metrics including dialogue-level accuracy (DAC) and story-level accuracy (SAC). DAC is calculated as the ratio of the correct predictions to the total number of specifies dialogue turns, while SAC is the ratio of the number of stories where all dialogue turns are correctly predicted to the number of all test examples. These two metrics provide the evaluation for dialogue understanding with different granularities.
Results As shown in Table 7, our model outperforms all the baselines significantly (p< 0.01, Wilcoxon signed-rank test) on both DAS and SAC scores, suggesting the benefit of learning character representations. We tested the accuracy of automatic training set annotations, and the DAC/SAC scores are 86.78%/67.80%. Together with the model’s performance on the test set, we can see the automatic annotation for the training set is of good quality. We also conducted the human prediction experiment, and the DAC/SAC scores are 97.90%/90.70%, which are much higher than the best model. So there is much room for further improvement for machine-based approaches.
This paper is available on arxiv under CC 4.0 DEED license.
[2] https://huggingface.co/fnlp/ bart-base-chinese
[3] https://huggingface.co/bert-base-chinese
[4] https://huggingface.co/hfl/ chinese-roberta-wwm-ext
[5] https://huggingface.co/hfl/ chinese-macbert-base