Authors:
(1) PIOTR MIROWSKI and KORY W. MATHEWSON, DeepMind, United Kingdom and Both authors contributed equally to this research;
(2) JAYLEN PITTMAN, Stanford University, USA and Work done while at DeepMind;
(3) RICHARD EVANS, DeepMind, United Kingdom.
Storytelling, The Shape of Stories, and Log Lines
The Use of Large Language Models for Creative Text Generation
Evaluating Text Generated by Large Language Models
Conclusions, Acknowledgements, and References
A. RELATED WORK ON AUTOMATED STORY GENERATION AND CONTROLLABLE STORY GENERATION
B. ADDITIONAL DISCUSSION FROM PLAYS BY BOTS CREATIVE TEAM
C. DETAILS OF QUANTITATIVE OBSERVATIONS
E. FULL PROMPT PREFIXES FOR DRAMATRON
F. RAW OUTPUT GENERATED BY DRAMATRON
Throughout our interviews with the 15 participants (anonymised as p1, p2, etc.), we collected qualitative feedback on co-writing with Dramatron. In this section, we summarize this feedback into seven themes. Each is presented alongside supporting quotes from participant interviews.
(1) Positive comments about Dramatron focused on: hierarchical generation that lets the writer work on the narrative arc, the possibility either to co-author interactively or to simply let the system generate, and the potential of the output script to serve as source material for the human writer (Section 5.1).
(2) Participants identified inspiration, world building, and content generation as potential writing applications for Dramatron, and saw it as possible tool for literary analysis (Section 5.2).
(3) Participants noticed various biases embedded in the language model (Section 5.3).
(4) Some writers were interested by the involuntary glitch aesthetic and failure modes of Dramatron, such as repetition and dialogue loops (Section 5.4).
(5) Unsurprisingly, participants noticed logical gaps in storytelling, lack of common sense, nuance and subtext, which were manifest in the lack of motivation for the characters (Section 5.5).
(6) Structural criticism focused on the need to come up with a log line, as well as on the inconsistencies between consecutive scenes due to parallel dialogue generation (Section 5.6).
(7) Participants were engaged with the tool and eager to provide suggestions for improvement (Section 5.7).
5.1.1 Praise for the interactive hierarchical generation in Dramatron. All participants but p4 and p5 (who preferred a nonlinear writing workflow) were enthusiastic about the interactive hierarchical generation. “Once I see this, I know the shape of the series. I know the way that the story unfolds. I can see the narrative more clearly [...] I like this approach of making it a log line and then packing the detail inside it. You are planting a seed of an idea and it is putting meat on the bones” (p13). “All of it is quite consistent, symbolically consistent and coherent and relates to the state of affairs of the state of the play [...] There is lots of emotion and content about relationships in some of the generations” (p8). “In terms of the interactive co-authorship process, I think it is great [...] ” (p9). “What I like about the hierarchy is that you can do as much human-ing as you want at any level” (p2). “In working with the machine I can see the content a little more clearly. As there is specificity, character arcs, then I can see how the story comes together [...] This [hierarchical generation] really felt so much cleaner than the process [GPT-2 or GPT-3 with flat prompting] I was using” (p15). “Let’s try more! God, you could just waste your time doing this” (p3). Participants p1, p6 and p3 further noted how such hierarchical generation helped with dialogue: “there is good content from any generation” (p1) and (referring to one of the generations) “You got some big profound discussions in it. I am impressed with that one” (p3).
5.1.2 Ease of use of Dramatron’s UI and seed-based generation. Participant p13 liked the user experience of interactive, step-by-step generation of title, characters and plot, whereas p10 thought that “interaction seemed simpler when the whole script was generated ahead of time rather than editing it”. Participant p1 tried and discussed three different modes of script generation: 1) interactive co-authorship, 2) modifying the output from one fully automated generation, and 3) curating and modifying outputs from 3-4 generations. The benefits of running multiple generations included having “lots of material”, allowing to “pull good ideas”, “cherry-picking”, “more interpretations and artistic freedom” but “requires more massaging on my end” and “word crafting to make it flow” (p1). Participant p1 developed a workflow for co-generating a script that included editing lists of characters and editing the log line to add more “characters that we know about”, giving the characters status and names, adding them to the plot’s beats. When crafting the log line, p1 wanted to imply high stakes and “stay with humanoid characters: non-human characters take us to the Theatre of the Absurd, to the Surreal, to Magical Realism”, and they wanted log-lines that situated the story in realism “to meet the audiences expectations” and “set things at a specific location”.
5.1.3 About the potential for the script to be staged after editing. Several participants (p6, p9, p11, p13, p15) highlighted the potential for the script to be staged after editing: “a rough draft, would need to work a lot with it [but] it could be helpful and staged, definitely” (p6), “It gets me thinking about how you can make a full show with a single idea” (p11) and “You know, with a bit of editing, I could take that to Netflix: just need to finesse it a little bit” (p9). Participant p1 staged several scripts generated with Dramatron (see Section 5.9).
5.2.1 Inspiration for the Writer. All participants found Dramatron useful for getting inspiration: “this is perfect for writers’ block” (p13), “I can see it being very helpful, if you are stuck” (p4, p5), “more in depth than the writers’ unblocking prompts website” (p3). Dramatron was described as a tool that indirectly stimulates the playwright’s creativity: “I like what happens in my brain when I read some outputs of the model. I got an idea for the rest of the story” (p6), “It is about me discovering what will translate from what it gives me” (p10), or that directly gives actionable suggestions: “Here is a concept; it puts meat on the bones, and then you trim the fat by going back and forth” (p13). Glitches and language model limitations can be subverted for inspiration, in particular when the script is performed: “mistakes are gifts that we can leave for the improvisers” (p1).
5.2.2 Generation of Alternative Choices and World Building. More than merely providing a creative spark for the main story, the model can be employed to populate the universe of the story: “If I was going to use this to write a script, I’d use it to generate characters to see if it generated things I hadn’t thought about. Or relationships I hadn’t thought about” (p15). Dramatron for exploration: “I would go with the suggestion that is further away from what I would have suggested because I already know what is in my head and I want to know what the machine would do” (p12).
5.2.3 Using the System for Learning and Analysis. By prompting the system, writers could indirectly search the language model for literary styles and elements: “Even if I were not writing, it does a wonderful job of collecting what is in the literature” (p10) or even hypothetically search within their own output: “I would be very interested in feeding everything I ever wrote and then getting it to generate script in my voice and style” (p4, p5). Learning could also happen by analysing how to improve Dramatron’s outputs: “For me, as a playwright, the interesting thing about working with this technology is thinking about how I would edit it. For instance: What would this look like on stage?” (p8).
5.2.4 Content Generation. Beyond inspiration, several participants were interested by the co-writing potential of Dramatron, and thought it could provide them with material. “One of the big sticking points of playwriting is getting words on the page. This helps with that step” (p8). “I would use this tool to fix (screenwriting) projects that might be dead” (p14). “This is a rich tool for basically everything. I have done devised creation. There are methods that you can use to generate text, where you pull songs, scripts, or news articles, then chop and paste them down. This reminds me of Dadaist text generation” (p11). “Practically, it might impact the economics of writing if longer running series could be augmented by such writing systems. It might be useful on long-running series, where you have a writers room” (p4, p5).
5.2.5 Potential of AI as Tool for TV Screenwriting. Some participants suggested this tool could be employed in a TV writers’ room, to help with writing formulaic scripts. “If you were able to make an AI to synopsize scripts effectively, you would be valuable to the studio” (p14). “It is like having a very good dramaturge” (p10). “AI can come up with 5 scripts in 5 minutes” (p9). “Which part of the process is this tool relevant for? Formulaic TV series” (p4, p5). “I wouldn’t use it for writing a straight play” (p11).
5.3.1 The system outputs are too literal and predictable. Some participants found the character “relationships so tight and prescriptive” (p4, p5); if a character has “a noble endeavour, it will be stated in the dialogue” (p4, p5), and that characters were given “silly” and “on the nose, pun names” (p2). Similarly, the title generation “does what it says on the tin” (p15), and “can be overly descriptive sometimes: the director could make decisions” (p8). One commented, “this is a thing that my students would do” (p8). There were some positive aspects to such a predictable system: “interpersonal relationships created here are classic tropes that keep the audience interested” (p3) and “there is interest in generating outputs from the system for content that already exists: actual titles are fun to compare against” (p14).
5.3.2 The system outputs can be problematic, stereotypical, and biased. Participant p9 wondered “What cultures and languages the books come?” whereas many participants noticed gender biases and ageism in the system outputs. “I am less sexist than the computer” (p3). “The protagonists are both male characters, and all of the supporting characters are female” (p4, p5). “The female lead is defined by their relationship to the other characters: it is a typical thing in plays that the women characters don’t have a lot of information about them” (p11). “She is always upset and doesn’t have wants (like the male characters) [...] Actually lots of the content [...] is misogynistic and patriarchal” (p8). This problem raised the issue of coping strategies or cultural appropriation: “if we gave GPT-2 some character names, it could come up with bigoted characters: [we] went with more made up names, not gender specific, not ethnicity-specific” (p13) and “there is an ethical question about using AI for a group of theatre makers: the AI throws us a topic, or relation that is unrelated to our lived experience and we are compelled to Yes, and the offers” (p4, p5). We discuss ethical issues raised in discussion by participants in greater detail in Section 7.3.
5.4.1 Participants embrace unexpected outputs from the system. Participant p6 laughed at the “poetic and absurd” suggestions. “It is really interesting to see what it comes up with” (p8), “levels of absurdity that are tickling my fancy” (p10), “I wouldn’t have thought of that but it is quite funny” (p11). “This is something that a human author probably would not stand for, it is uniquely created [...] I want ideas that a human couldn’t possibly have” (p12).
5.4.2 The system often enters in generation loops. All participants noticed how the system could enter generation loops: “I would probably cut a lot of it” (p6) or “a whole scene about a boiler being broken: yeah” (p8). They sometimes found positive aspects to such loops: “It is a silly conversation. It is a little repetitive. I like it.” (p6), “repetition leaves room for subtext” (p12) and enjoyed the glitches (p4, p5) or even made parallels with existing work (p3).
5.5.1 Lack of consistency and of long-term coherence. “Keeping dialogue character-based and consistent is most important [...] There is still some difficulty in getting it to stay on track with the context.” (p15). “I want the characters to be more consistent within themselves” (p12). “There is a bit of confusion in the logic, gaps in logic [...] It looks like postmodern theatre [...] But in terms of [a play with a given] genre, that has a plot to follow, it is getting confusing” (p11). Participant 7 “wants to add some stitching between the beats to make them narratively make sense”.
5.5.2 Lack of common sense and embodiment. Participant 8 observed that “There are things that it is hard to show on stage – such as a cat. The system doesn’t have an awareness of what is stageable and not stageable” and p9 noted that when “interfacing with a story telling AI, the input space is constrained”.
5.5.3 Lack of nuance and subtext. Participant 3 observed: “that’s a good example of how computers do not understand nuance, the way we see language and can understand it even if it is not super specific”. “A lot of information, a bit too verbalised, there should be more subtext” (p6). “With dialogue in plays, you have to ask yourself two questions: 1) Do people actually speak like that? 2) Are actors attracted to these lines and are these appealing lines to play?” (p7) “Playwriting is about realistic dialogue... all of the things around subtext. [...] Show, not tell: here we are just telling. Just like in improv: ‘do not mention the thing’. The element in the log line became the central bit in the generation, and that was repetitive” (p8). Participant 14 concluded that “AI will never write Casablanca, or A Wonderful Life. It might be able to write genre boxed storytelling”.
5.5.4 Lack of a motivation for the characters. “The stories do not finish. The character journeys are not complete. There is perhaps something missing in the character background [...] Where is the emotional motivation, stuff that might exist in the backstory and not exist in the script?” (p14). “On the first go-through, you are looking for the goal of the protagonist, and impediment for that drive. What is my character doing, and what do they want? If this was given to an actor they are going to struggle with the first thing to do, which is to find the needs and the wants of the character and then to personalise it” (p9). “My students do this: a character comes into play and says right what they want.” (p8). “The conflict should be something inside the character” (p6). “Why do people not say what they mean? It is because we have societal understanding, but sometimes get lost in translation” (p3).
5.6.1 Difficulty caused by the need to come up with the log line to condition all the generation. For participant 12, it was difficult to come up with a log line, and the process seemed precious. “Coming up with the first prompt takes a little bit of back and forth” (p11). “Packing the action into the log line: this is a panic moment for the writer, because they want to add everything meaningful into the script. [...] It is all about the witty premise. The system that you have right now is somewhat about wit. There is a need for the log line to hold some kind of wit” (p13). “Does [the log line] have to have a character name? (p4, p5). “The log line is not a closed synopsis. It is less descriptive and more prescriptive. The art of log lines is about how short you can make it so that [the producers] read the rest of your material” (p14).
5.6.2 Structural criticism of log line-based conditioning of the whole generation. “Generally the way that I work, I am clear what I want to say about the world – what I think about the world. The vehicles, or the characters, or the arc is not clear. This looks like a collection of scenes that logically follow one to the next. But, the core idea of the thing to say [is missing]” (p4, p5). “If I could program something to write a script, I wouldn’t start with a log line. You can also consider starting with a character and an obstacle in the way of that character” (p9).
5.6.3 Negative consequence of Dramatron’s design choice: parallel dialogue generation. “From the scene beats, it has no idea of what the previous dialogue contained. Then to see the dialogue not be consistent is jarring” (p1). “I wonder if there is a problem in importing the previous beat into the scene [...] Paying attention to the consistency in the beats, helps with the consistency of the dialogue generated” (p12). Upon learning that scene dialogue was generated in parallel for each scene, Participant 2 commented: “If it didn’t read its last scene, how can you get the last scene into the next generation? Generation of these scripts could be significantly benefited from attending to the previous scene’s dialogue”.
Modeling characters and their relationships was a recurrent theme: “can we make the system relationship-driven?” (p12), “where does status belong in character building?” (p12), “could we generate the stem of a character and then complete it?” (p15). Participant 12 suggested: “as an author, I would build a social graph of the characters relations”. Answering the question “How do you get the system to know where the scene should start and end?” (p15), three participants (p8, p13, p15) suggested fitting a narrative arc within each scene.
Several participants wanted to be able to query and dialogue with the writing model: “Have you engaged [the AI system] by trying to give it notes?” (p2) to allow it to learn about the world: “How does world building happen? Maybe the model needs to know the Ws of Stella Adler [(Who? What? Where? Why? How? etc.)] Can you get the system to answer these questions?” (p9), or to allow rewriting and reformulation: “can we ask the system to re-write with a style or context?” (p8). As p10 reiterates, iterative rewriting was a desired workflow: “I am less interested in shaping [the narrative], rather than seeing what it is saying, and refining it to see what it says, and then refining it again. A playwright has to see the play spoken before making cuts.”
playwright has to see the play spoken before making cuts.” Finally, p4 and p5 astutely observed that “there has been a push away from systems of Western dramaturgy, so in terms of making this most useful for the future, it might be helpful to consider how it might be used within the context of other contemporary writing”—suggesting alternative narrative structures and elements—“as the AI is not bound by the same rules that we are. So, telling it to be bound by those human rules feels limiting of the capabilities”.
As detailed in Section 5.7, the participants were engaged and provided constructive feedback about Dramatron. As one of the participants in the study remarked: “the system is so adaptable, it can change with our feedback and tweaks”. This sort of understanding of the systems modifiability empowered those that interacted with it to more freely suggest changes, knowing that they could be incorporated. In this way, the system positively benefited and evolved over the course of the participant study.
Over the course of the interviews, we incorporate the feedback we could by making small, incremental changes to the prompt prefix sets of Dramatron. Table 1 summarizes changes made as a direct result of participant’s feedback. This sort of participatory design and development is critical for creative tool generation as the feedback from users can be directly incorporated to improve the system for the next interaction. This is made possible via the modular design of the system, the lightweight prompt-based interactions, and the flexibility afforded by Dramatron. This participation also inspires participants to explore related, connected, creative ideas. For example Fig. 4 (LEFT) shows concept art for a narrative test of virtual actors interpreting a co-written script.
Creative writing for theatre is fundamentally interactive: not just between collaborating storytellers, but between storytellers and the audience. For this reason, we evaluated how scripts co-written with Dramatron could be produced on the theatre stage. In this section, we describe staging details and report evaluative reflections from both the creative team and two professional theatre reviewers.
Five scripts co-written with Dramatron were staged in public performances in August 2022 at North America’s largest theatre festival: The 2022 Edmonton International Fringe Theatre festival. The show’s run was titled Plays By Bots and ran 7 performances over two weeks (see an image from the production on Fig. 4). In each show, different casts would act out one of the plays from the co-writing experiments. The plays span different genres, styles, characters, and storylines. The scripts were brought to life by a cast of 4-6 experienced improvisers and actors. The first-half of each script was given to each of the cast members in a sealed envelope. Only when the show began were they allowed to open the script, and then they commenced performance by reading it live in front of the audience. Once the script ran out, the actors improvised the ending, based on the context and story set out by the script[5]. During each show’s performance, the director and co-writer (participant p1 from above) introduced the project to the audience and explained that they co-wrote and edited the script using Dramatron.
There were two reviews written about the production of Plays By Bots at the festival. One of the reviews noted that the performance “proves that artificial intelligence can in fact write a hit Fringe play”. The reviewer also noted that the success of the performance was due to both the Dramatron system and the human actors, especially one performer who “mastered Dramatron’s voice and seamlessly took it off-script for the remainder of the show, much to the delight of the howling audience”. The second review was also positive. With a hint of incredulity, the reviewer complimented the abilities of Dramatron. The reviewer noted the style of Dramatron, and how that served the performance saying “if there’s a certain flatness in the dialogue, which runs to declarations, that in itself is amusing since it turned out to be perfectly suited to the deadpan comic talents of [the] improvisers,” and “the human actors continue to capture the playwright bot’s tone”. The reviewer also expressed surprise at the ability of the system to create a play that hangs together and creates a world. They further noted that some lines from Dramatron are so funny they were reprised later in the show once the human actors were improvising.
Discussions amongst the creative team compliment the reviewers and provide insights on how professional actors and improvisers found working with scripts co-written by Dramatron. Post-show discussions were facilitated and relayed to us by the director (p1 above). Four key themes emerged through these discussions which echo the themes presented earlier in Section 5. Specifically, the system has a distinct glitch style, generated text can be repetitive and fun to work with. As well, the team attributed agency to the system, and had expectations of the systems capabilities. As trained improvisational theatre performers, the actors were able to add a layer of interpretation to the co-written script. This helped add meaning to the text. Finally, the prevailing feedback from the creative team was that participating in the production was fun! Enthusiasm and reflections from the creative team echo the usefulness of co-written scripts for theatre production and collaboration; more reflections and supporting quotes are included in Appendix B.
This paper is available on arxiv under CC 4.0 license.
[5] Video of performance shared upon acceptance.