Exploring Automated Story Generation and Controllable Language Models

Authors: (1) PIOTR MIROWSKI and KORY W. MATHEWSON, DeepMind, United Kingdom and Both authors contributed equally to this research; (2) JAYLEN PITTMAN, Stanford University, USA and Work done while at DeepMind; (3) RICHARD EVANS, DeepMind, United Kingdom. Table of Links Abstract and Intro Storytelling, The Shape of Stories, and Log Lines The Use of Large Language Models for Creative Text Generation Evaluating Text Generated by Large Language Models Participant Interviews Participant Surveys Discussion and Future Work Conclusions, Acknowledgements, and References A. RELATED WORK ON AUTOMATED STORY GENERATION AND CONTROLLABLE STORY GENERATION B. ADDITIONAL DISCUSSION FROM PLAYS BY BOTS CREATIVE TEAM C. DETAILS OF QUANTITATIVE OBSERVATIONS D. SUPPLEMENTARY FIGURES E. FULL PROMPT PREFIXES FOR DRAMATRON F. RAW OUTPUT GENERATED BY DRAMATRON G. CO-WRITTEN SCRIPTS A. RELATED WORK ON AUTOMATED STORY GENERATION AND CONTROLLABLE STORY GENERATION In this section we provide background and related work on the intersecting fields of automatic plot and story generation as well as controllable language generation. A.1 Automatic Story Generation Automatic story generation is the research problem concerned with generating sequences of story elements that collectively tell a coherent narrative. A narrative plot is a sequence of events where each affects the next. The plot is composed of narrative elements sometimes referred to as actions, beats, scenes, or events [70]. Generative plot systems have been developed for nearly a century by Cook [23], and computerized versions have existed for decades [71]. These systems support human authors with creative output material and as a source of randomness. Recent work has adapted these systems for computational interaction for use in web-based and theatrical settings [33, 34]. In generating narratives, the combinations of these component elements form subplots. Multiple subplots can be combined into a single plot, and multiple plots can intertwine to create complex narratives. Many contemporary stories are written to have multiple plot lines which intertwine. But, there is little work on how to computationally model and generate multiple intersecting plot lines. Complex plot line interaction is a promising avenue of future work for human-machine co-creativity research in story generation. Early approaches to automatic story generation used symbolic planning and hand-engineered heuristics [61, 71, 87, 107, 114]. Recently, research has explored open-story generation using machine learning techniques which leverage large datasets, massive deep learning models, increased compute capacity, and large language model prompt engineering [11, 14, 37, 81, 86, 99]. These methods show how models can succeed, and fail, in the generation of unique and coherent stories. Additionally, while coherence has been studied in dialogue generation methods [32], it remains challenging to measure coherence in story, specifically as it relates to causal narrative events, or common sense knowledge [3], or consistency in characters [78]. A.2 Symbolic and Hierarchical Story Generation Some work has tried to bridge the gap between symbolic event representations and textual representations. Several of these methods process and predict events from text [45, 64, 95, 112] by generating sequences of plot events and then expanding such plot events into sentences [4, 85]. Others model each story as a series of character and story challenge cards [2] (first simulating sequences of causal events, then transforming them into sentences) or by simulating social practices between autonomous agents [36]. Other recent work separates storytelling into two phases: storyline (i.e. plot) planning and story writing based on that storyline [121]. Similarly, methods have been introduced which decompose story generation into processes of coarse-to-fine generation [16, 38]. Goldfarb-Tarrant et al. [44] further introduced rescoring methods for character and plot events. These methods have not focused on synthesising coherent stories by generating scenes and dialogue, as we do in our work. Many of these methods lean on human reading comprehension and preference-based evaluation as opposed to production of the final script. Hierarchical generation of a theatre play was first mentioned and used in [90, 92] for the production of AI: Can a Robot Write a Play? by company THEaiTRE in 2021 in Prague[6]. In this work, the authors start with a title (or a prompt for the story) and then generate a textual synopsis, which is then used to generate the dialogue. In contrast to our approach, they did not generate characters alongside synopsis and would start the “flat” dialogue generation from manually input two-line exchanges between characters. Their work also only served the production of a specific theatrical play rather than being evaluated within a diverse community of writers. A.3 Controllable Story Generation Previous work used a trained autoregressive transformer to plan sentences of an opinion piece from a set of usersupplied keywords [50]. This built upon [119] which incorporated commonsense knowledge into keyword-based story generation. Using conditional language models controlled by topics or keywords [73], Cho et al. [19] trained genre-controlled short story generators. Story arc generation was introduced in the Tale Brush graphical tool [21], using Kurt Vonnegut’s theory about the fortune of the protagonist as the story progresses. This theory has also been used by Mathewson et al. [68] as an approach to produce creative, engaging dialogue. We similarly use the concept of the narrative arc, though we use it textually in the prompts to Dramatron. In “Controlled Cue Generation for Play Scripts”, Dirik et al. [29] use LLMs to generate both the next line and a stage cue. In “DialogueScript: Using Dialogue Agents to Produce a Script”, Schmidtová et al. [96] use different LLMs for each character. Si et al. [102] model multi-user dialogue and character relationships for story continuation. Schmitt and Buschek [97] use question-based chatbot interactions to assist with character creation. A.4 Prompt Engineering for Long-Term Generation Prompt engineering has been used to write a short plot from two characters, a genre and a theme in [52]. Our work of decomposing a log line into a synopsis can be seen as a narrative equivalent of Chain of Thought prompting for reasoning tasks [115], and uses the idea of using LLM output as prompt for the next stage of the generation—also called prompt chaining [118] or language engineering [24]. A.5 Interactive Authorship In “Mapping the Design Space of Human-AI Interaction in Text Summarization”, Cheng et al. [18] defined a taxonomy of different interaction modalities for LLM generated text, along a user study. Padmakumar and He [77] evaluated creative image captioning models that would rewrite spans of text (within the user’s original draft) to introduce descriptive and figurative elements. Du et al. [31] focused on the text revision process and Yang et al. [120] on revision and summarisation of fiction. Clark et al. [22] (and later Nichols et al. [75] using GPT-2) evaluated a turn-by-turn, fine-grained 10-sentence-long system for overcoming writer’s block in writing slogans or short stories. Gero et al. [41] designed a system for inspiring science writers with suggestions. Our model Dramatron allows for writer’s interventions within a hierarchical generation structure. Yuan et al. [122] used an LLM with an editor and an interface that asked to continue the story, asked for details, or suggested to rewrite it. The AI was used as a tool for idea generation, copy editing, scene interpolation. Automatic generation of stories with dialogue has been used to populate digital worlds in video games, interactive narratives, entertainment, virtual worlds [80], artistic performances [47], specifically improvised theatre [10, 66, 72], short film scripts like Sunspring in 2016, song music lyrics for musical Beyond the Fence in 2016 in London[7] or interactive playwriting for AI at the Young Vic in 2021 in London[8]. Other than [90, 92, 96] for Prague-based company THEaiTRE, none of these techniques have been used to generate long-range coherent theatre scripts or screenplays. And, none of these methods have used few-shot learning and prompt engineering to prime LLMs for generation. A.6 Review of Automated and Machine-Learned Metrics for the Evaluation of Story Generation A.6.1 Similarity Between Generated and “Ground-Truth” Stories. In a typical machine learning mindset, story generation can be envisioned as merely a prediction task, allowing for evaluation against “ground truth”. An example of such datasets includes the Writing Prompts[9], a set of 300k human-written stories, at average of 734 words, paired with writing prompts [37]. Fan et al. [37] propose metrics such as test set perplexity, prompt ranking accuracy (a measure of likelihood of a story generated using a true prompt vs. decoys), average longest common subsequence, and a triple-pairing task for human annotator evaluation of prompt-story coherence. They later measure sentence completion top N-in-M accuracy [38]. Si et al. [102] measure top-N hits of character or story continuation. Rashkin et al. [85] compare generated stories to ground truth using the ROUGE score [62]. A.6.2 Consistency between Generated Stories and Their Writing Prompts. In the context of prompt-based story generation or continuation, Roemmele et al. [89] measure the quality of generated text based on whether it presents a consistent writing style and maintains the category (part-of-speech tags) distribution of individual words between the prompt and the generated story. They also record story-dependent metrics like lexical cohesion, style matching and entity coreference, and story-independent metrics such as sentence length, grammaticity, lexical diversity, lexical frequency and syntactic complexity. See et al. [99] measure N-gram similarity and sentence embedding similarity between the generated story and the prompt. Further metrics include counting the number of unique words, percentage of verbs and diversity in entity names [38], rare word usage and sentence length [99]. A.6.3 Statistical Measures of Corpora of Generated Stories. Without comparing individual generated stories to a ground truth or to a writing prompt, one can measure the Vocab:token ratio (originality and diversity of content), number of entities per plot, of unique verbs, verb diversity as well as inter- and intra-story trigram or 4-gram repetition [44, 46]. Rashkin et al. [85] measure the diversity of generated sentences using self-BLEU scores [124], or even adversarially train a classifier for the plausibility of a short story [42]. This paper is available on arxiv under CC 4.0 license. [6] Performance documentation available at https://theaitre.com/. [7] https://www.theguardian.com/stage/2016/feb/28/beyond-the-fence-review-computer-created-musical-arts-theatre-london [8] Review: https://www.theguardian.com/stage/2021/aug/24/rise-of-the-robo-drama-young-vic-creates-new-play-using-artificial-intelligence [9] https://www.kaggle.com/datasets/ratthachat/writing-prompts Authors: (1) PIOTR MIROWSKI and KORY W. MATHEWSON, DeepMind, United Kingdom and Both authors contributed equally to this research; (2) JAYLEN PITTMAN, Stanford University, USA and Work done while at DeepMind; (3) RICHARD EVANS, DeepMind, United Kingdom. Authors: Authors: (1) PIOTR MIROWSKI and KORY W. MATHEWSON, DeepMind, United Kingdom and Both authors contributed equally to this research; (2) JAYLEN PITTMAN, Stanford University, USA and Work done while at DeepMind; (3) RICHARD EVANS, DeepMind, United Kingdom. Table of Links Abstract and Intro Abstract and Intro Storytelling, The Shape of Stories, and Log Lines Storytelling, The Shape of Stories, and Log Lines The Use of Large Language Models for Creative Text Generation The Use of Large Language Models for Creative Text Generation Evaluating Text Generated by Large Language Models Evaluating Text Generated by Large Language Models Participant Interviews Participant Interviews Participant Surveys Participant Surveys Discussion and Future Work Discussion and Future Work Conclusions, Acknowledgements, and References Conclusions, Acknowledgements, and References A. RELATED WORK ON AUTOMATED STORY GENERATION AND CONTROLLABLE STORY GENERATION A. RELATED WORK ON AUTOMATED STORY GENERATION AND CONTROLLABLE STORY GENERATION B. ADDITIONAL DISCUSSION FROM PLAYS BY BOTS CREATIVE TEAM B. ADDITIONAL DISCUSSION FROM PLAYS BY BOTS CREATIVE TEAM C. DETAILS OF QUANTITATIVE OBSERVATIONS C. DETAILS OF QUANTITATIVE OBSERVATIONS D. SUPPLEMENTARY FIGURES D. SUPPLEMENTARY FIGURES E. FULL PROMPT PREFIXES FOR DRAMATRON E. FULL PROMPT PREFIXES FOR DRAMATRON F. RAW OUTPUT GENERATED BY DRAMATRON F. RAW OUTPUT GENERATED BY DRAMATRON G. CO-WRITTEN SCRIPTS G. CO-WRITTEN SCRIPTS A. RELATED WORK ON AUTOMATED STORY GENERATION AND CONTROLLABLE STORY GENERATION In this section we provide background and related work on the intersecting fields of automatic plot and story generation as well as controllable language generation. A.1 Automatic Story Generation Automatic story generation is the research problem concerned with generating sequences of story elements that collectively tell a coherent narrative. A narrative plot is a sequence of events where each affects the next. The plot is composed of narrative elements sometimes referred to as actions, beats, scenes, or events [70]. Generative plot systems have been developed for nearly a century by Cook [23], and computerized versions have existed for decades [71]. These systems support human authors with creative output material and as a source of randomness. Recent work has adapted these systems for computational interaction for use in web-based and theatrical settings [33, 34]. In generating narratives, the combinations of these component elements form subplots. Multiple subplots can be combined into a single plot, and multiple plots can intertwine to create complex narratives. Many contemporary stories are written to have multiple plot lines which intertwine. But, there is little work on how to computationally model and generate multiple intersecting plot lines. Complex plot line interaction is a promising avenue of future work for human-machine co-creativity research in story generation. Early approaches to automatic story generation used symbolic planning and hand-engineered heuristics [61, 71, 87, 107, 114]. Recently, research has explored open-story generation using machine learning techniques which leverage large datasets, massive deep learning models, increased compute capacity, and large language model prompt engineering [11, 14, 37, 81, 86, 99]. These methods show how models can succeed, and fail, in the generation of unique and coherent stories. Additionally, while coherence has been studied in dialogue generation methods [32], it remains challenging to measure coherence in story, specifically as it relates to causal narrative events, or common sense knowledge [3], or consistency in characters [78]. A.2 Symbolic and Hierarchical Story Generation Some work has tried to bridge the gap between symbolic event representations and textual representations. Several of these methods process and predict events from text [45, 64, 95, 112] by generating sequences of plot events and then expanding such plot events into sentences [4, 85]. Others model each story as a series of character and story challenge cards [2] (first simulating sequences of causal events, then transforming them into sentences) or by simulating social practices between autonomous agents [36]. Other recent work separates storytelling into two phases: storyline (i.e. plot) planning and story writing based on that storyline [121]. Similarly, methods have been introduced which decompose story generation into processes of coarse-to-fine generation [16, 38]. Goldfarb-Tarrant et al. [44] further introduced rescoring methods for character and plot events. These methods have not focused on synthesising coherent stories by generating scenes and dialogue, as we do in our work. Many of these methods lean on human reading comprehension and preference-based evaluation as opposed to production of the final script. Hierarchical generation of a theatre play was first mentioned and used in [90, 92] for the production of AI: Can a Robot Write a Play? by company THEaiTRE in 2021 in Prague[6]. In this work, the authors start with a title (or a prompt for the story) and then generate a textual synopsis, which is then used to generate the dialogue. In contrast to our approach, they did not generate characters alongside synopsis and would start the “flat” dialogue generation from manually input two-line exchanges between characters. Their work also only served the production of a specific theatrical play rather than being evaluated within a diverse community of writers. A.3 Controllable Story Generation Previous work used a trained autoregressive transformer to plan sentences of an opinion piece from a set of usersupplied keywords [50]. This built upon [119] which incorporated commonsense knowledge into keyword-based story generation. Using conditional language models controlled by topics or keywords [73], Cho et al. [19] trained genre-controlled short story generators. Story arc generation was introduced in the Tale Brush graphical tool [21], using Kurt Vonnegut’s theory about the fortune of the protagonist as the story progresses. This theory has also been used by Mathewson et al. [68] as an approach to produce creative, engaging dialogue. We similarly use the concept of the narrative arc, though we use it textually in the prompts to Dramatron. In “Controlled Cue Generation for Play Scripts”, Dirik et al. [29] use LLMs to generate both the next line and a stage cue. In “DialogueScript: Using Dialogue Agents to Produce a Script”, Schmidtová et al. [96] use different LLMs for each character. Si et al. [102] model multi-user dialogue and character relationships for story continuation. Schmitt and Buschek [97] use question-based chatbot interactions to assist with character creation. A.4 Prompt Engineering for Long-Term Generation Prompt engineering has been used to write a short plot from two characters, a genre and a theme in [52]. Our work of decomposing a log line into a synopsis can be seen as a narrative equivalent of Chain of Thought prompting for reasoning tasks [115], and uses the idea of using LLM output as prompt for the next stage of the generation—also called prompt chaining [118] or language engineering [24]. A.5 Interactive Authorship In “Mapping the Design Space of Human-AI Interaction in Text Summarization”, Cheng et al. [18] defined a taxonomy of different interaction modalities for LLM generated text, along a user study. Padmakumar and He [77] evaluated creative image captioning models that would rewrite spans of text (within the user’s original draft) to introduce descriptive and figurative elements. Du et al. [31] focused on the text revision process and Yang et al. [120] on revision and summarisation of fiction. Clark et al. [22] (and later Nichols et al. [75] using GPT-2) evaluated a turn-by-turn, fine-grained 10-sentence-long system for overcoming writer’s block in writing slogans or short stories. Gero et al. [41] designed a system for inspiring science writers with suggestions. Our model Dramatron allows for writer’s interventions within a hierarchical generation structure. Yuan et al. [122] used an LLM with an editor and an interface that asked to continue the story, asked for details, or suggested to rewrite it. The AI was used as a tool for idea generation, copy editing, scene interpolation. Automatic generation of stories with dialogue has been used to populate digital worlds in video games, interactive narratives, entertainment, virtual worlds [80], artistic performances [47], specifically improvised theatre [10, 66, 72], short film scripts like Sunspring in 2016, song music lyrics for musical Beyond the Fence in 2016 in London[7] or interactive playwriting for AI at the Young Vic in 2021 in London[8]. Other than [90, 92, 96] for Prague-based company THEaiTRE, none of these techniques have been used to generate long-range coherent theatre scripts or screenplays. And, none of these methods have used few-shot learning and prompt engineering to prime LLMs for generation. A.6 Review of Automated and Machine-Learned Metrics for the Evaluation of Story Generation A.6.1 Similarity Between Generated and “Ground-Truth” Stories. In a typical machine learning mindset, story generation can be envisioned as merely a prediction task, allowing for evaluation against “ground truth”. An example of such datasets includes the Writing Prompts[9], a set of 300k human-written stories, at average of 734 words, paired with writing prompts [37]. Fan et al. [37] propose metrics such as test set perplexity, prompt ranking accuracy (a measure of likelihood of a story generated using a true prompt vs. decoys), average longest common subsequence, and a triple-pairing task for human annotator evaluation of prompt-story coherence. They later measure sentence completion top N-in-M accuracy [38]. Si et al. [102] measure top-N hits of character or story continuation. Rashkin et al. [85] compare generated stories to ground truth using the ROUGE score [62]. A.6.2 Consistency between Generated Stories and Their Writing Prompts . In the context of prompt-based story generation or continuation, Roemmele et al. [89] measure the quality of generated text based on whether it presents a consistent writing style and maintains the category (part-of-speech tags) distribution of individual words between the prompt and the generated story. They also record story-dependent metrics like lexical cohesion, style matching and entity coreference, and story-independent metrics such as sentence length, grammaticity, lexical diversity, lexical frequency and syntactic complexity. See et al. [99] measure N-gram similarity and sentence embedding similarity between the generated story and the prompt. Further metrics include counting the number of unique words, percentage of verbs and diversity in entity names [38], rare word usage and sentence length [99]. A.6.2 Consistency between Generated Stories and Their Writing Prompts A.6.3 Statistical Measures of Corpora of Generated Stories. Without comparing individual generated stories to a ground truth or to a writing prompt, one can measure the Vocab:token ratio (originality and diversity of content), number of entities per plot, of unique verbs, verb diversity as well as inter- and intra-story trigram or 4-gram repetition [44, 46]. Rashkin et al. [85] measure the diversity of generated sentences using self-BLEU scores [124], or even adversarially train a classifier for the plausibility of a short story [42]. A.6.3 Statistical Measures of Corpora of Generated Stories. This paper is available on arxiv under CC 4.0 license. This paper is available on arxiv under CC 4.0 license. available on arxiv [6] Performance documentation available at https://theaitre.com/. [7] https://www.theguardian.com/stage/2016/feb/28/beyond-the-fence-review-computer-created-musical-arts-theatre-london [8] Review: https://www.theguardian.com/stage/2021/aug/24/rise-of-the-robo-drama-young-vic-creates-new-play-using-artificial-intelligence [9] https://www.kaggle.com/datasets/ratthachat/writing-prompts