paint-brush
Evaluating LLM-Generated Text: Methods, Limitations, and Human-Centric Approachesby@teleplay
752 reads
752 reads

Evaluating LLM-Generated Text: Methods, Limitations, and Human-Centric Approaches

by Teleplay Technology May 20th, 2024
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Evaluating text generated by large language models involves automated metrics and human-centric approaches. Crowdsourced evaluations face quality and bias issues, leading to a shift towards expert-led assessments. The study engages theatre and film industry professionals to co-write scripts and assess LLM contributions.
featured image - Evaluating LLM-Generated Text: Methods, Limitations, and Human-Centric Approaches
Teleplay Technology  HackerNoon profile picture

Authors:

(1) PIOTR MIROWSKI and KORY W. MATHEWSON, DeepMind, United Kingdom and Both authors contributed equally to this research;

(2) JAYLEN PITTMAN, Stanford University, USA and Work done while at DeepMind;

(3) RICHARD EVANS, DeepMind, United Kingdom.

Abstract and Intro

Storytelling, The Shape of Stories, and Log Lines

The Use of Large Language Models for Creative Text Generation

Evaluating Text Generated by Large Language Models

Participant Interviews

Participant Surveys

Discussion and Future Work

Conclusions, Acknowledgements, and References

A. RELATED WORK ON AUTOMATED STORY GENERATION AND CONTROLLABLE STORY GENERATION

B. ADDITIONAL DISCUSSION FROM PLAYS BY BOTS CREATIVE TEAM

C. DETAILS OF QUANTITATIVE OBSERVATIONS

D. SUPPLEMENTARY FIGURES

E. FULL PROMPT PREFIXES FOR DRAMATRON

F. RAW OUTPUT GENERATED BY DRAMATRON

G. CO-WRITTEN SCRIPTS

4 EVALUATING TEXT GENERATED BY LARGE LANGUAGE MODELS

In this section, we first review existing methods for evaluating LLM-generated text before presenting our approach. Echoing Celikyilmaz et al. [16], we split evaluation methods into automated or machine-learned metrics, and humancentric evaluation. Automated and machine-learned metrics (reviewed in Appendix A.6) typically calculate the similarity between generated and “ground truth” stories, the consistency between generated stories and their writing prompts, or the diversity of language in the generated output. These metrics were not designed for generated text of the length of a screenplay or theatre script. This motivates a focus on human-centric evaluation, which can be conducted with naïve crowdworkers or with experts.


We now review the limitations of crowdsourced evaluation of the coherence of generated text, and explain why non-expert, crowdsourced evaluation faces crucial quality and bias issues. Inheriting from research standards for large-scale natural language processing tasks, the majority of studies assessing the quality of generations from LLMs evaluate model performance by collecting data from crowdworkers [56, 75, 77, 85, 88, 121]. For instance Yao et al. [121] recruit crowdworkers to evaluate fidelity, coherence, interestingness, and popularity, and Rashkin et al. [85] to evaluate narrative flow and ordering of events. That said, it has been shown that crowdworkers’ personal opinions, demographic characteristics and cognitive biases [35] can affect the quality of crowdsourced annotations in fact-checking [30] or in tasks involving subjective assessments [51]. These issues have led some researchers to try evaluating their models with experts. Karpinska et al. [54] highlight the perils of using crowdsourced workers for evaluating open-ended generated text, because crowdworkers do not read text as carefully as expert teachers. Some studies consulted with expert linguists [31], university students in the sciences [41] or humanities [120], amateur writers [2, 21, 60, 97, 122]. In one recent study, Calderwood et al. [13] interviewed 4 novelists about their usage of GPT-2 via Talk To Transformer and Write With Transformer tools (see https://app.inferkit.com/demo), uncovering usages of the “model as antagonist” (i.e., random), for “description creation”, as “constraint”, or for “unexpected” ideas.


Given the issues discussed above, we believe that crowdsourcing are not an effective approach to evaluating screenplays and theatre scripts co-written with language models. Thus, in a departure from crowd-sourced evaluations, we engage 15 experts—theatre and film industry professionals—who have both an experience in using AI writing tools and who have worked in TV, film or theatre in one of these capacities: writer, actor, director or producer. These experts participate in a 2-hour session wherein they co-write a screenplay or theatre script alongside Dramatron. Most were able to co-write a full script and the open discussion interview in the allotted 2-hours; for the others, we slightly extended the interview session. Interviews are analysed in Section 5, and following the interactive sessions, the participants were asked a series of questions adapted from [122] and detailed in Section 6. Each question was answered on a 5-point Likert-type scale, using questions adapted from [22, 56, 77, 104, 122]. As another quantitative evaluation, we track writer modifications to generated sentences [88]. This allows for comparison of Dramatron generations pre- and post-human edits. We track absolute and relative word edit distance, to assess whether and how much the writer adds to or removes from the output suggestions. We also track a Jaccard similarity-based metric on words, to quantify how similar is the draft after edits to the original suggestion. Our objective is to assess whether Dramatron “contributes new ideas to writing”, or “merely expand[s] on [the writer’s] existing ideas” [60]. We do not evaluate Dramatron outputs for grammatical correctness as [77], as the few errors made by the LLM can be fixed by the human co-writer.


We compensate experts for the co-writing and interview sessions at 100 GBP per hour. Our study design and data collection process was validated and approved by HuBREC (Human Behavioral Research Ethics Committee), which is a research ethics committee run within Deepmind which includes and is chaired by academics from outside the company.



This paper is available on arxiv under CC 4.0 license.