Revolutionizing Creative Text Generation with Quality-Diversity through AI Feedbackby@feedbackloop
248 reads

Revolutionizing Creative Text Generation with Quality-Diversity through AI Feedback

tldt arrow

Too Long; Didn't Read

Discover Quality-Diversity through AI Feedback (QDAIF), a cutting-edge method for creative text generation. QDAIF outperforms baselines, excelling in opinions, short stories, and poetry domains. While showing promise, the paper discusses limitations and proposes future directions, marking a significant leap in AI-driven creative search systems.

People Mentioned

Mention Thumbnail
Mention Thumbnail
featured image - Revolutionizing Creative Text Generation with Quality-Diversity through AI Feedback
The FeedbackLoop: #1 in PM Education HackerNoon profile picture


(1) Herbie Bradley, CarperAI, CAML Lab, University of Cambridge & EleutherAI;

(2) Andrew Dai, Aleph Alpha;

(3) Hannah Teufel, Aleph Alpha;

(4) Jenny Zhang, 5Department of Computer Science, University of British Columbia & Vector Institute;

(5) Koen Oostermeijer, Aleph Alpha;

(6) Marco Bellagente, Stability AI;

(7) Jeff Clune, Department of Computer Science, University of British Columbia, Vector Institute & Canada CIFAR AI Chair;

(8) Kenneth Stanley, Maven;

(9) Grégory Schott, Aleph Alpha;

(10) Joel Lehman, Stochastic Labs.


This paper introduces QDAIF, a quality-diversity method that aims to discover diverse and highquality solutions in qualitative domains, by leveraging advances in foundation models to evaluate the quality and diversity of generated individuals. QDAIF outperforms baseline methods in returning more diverse, high-quality solutions in creative writing domains (Opinions, Stories, Poetry), that benefit greatly from accurate AI feedback measures. The paper’s results highlight that QDAIF can succeed at its aims, generating solutions that align with human perception of quality and diversity.

We note limitations with QDAIF that motivate future work. Firstly, we suspect reward hacking happening when using LMs to generate feedback. Our human evaluation investigation shows that while the LM’s evaluation of quality mostly aligns with human perception, the correlation drops when the evaluated quality is in the range 0.995 to 1 (cf. Figure 5). The text generation might have exploited certain attributes or phrasings that allow an LM to give a high-quality estimate, but not what humans would agree is good. This is a common issue highlighted by other works when using AI models as classifiers or evaluators (Nguyen et al., 2015a), highlighting risks of open-ended search to be tackled (Ecoffet et al., 2020). One method to address this limitation could be to use RLHF finetuning (Ouyang et al., 2022) to produce LMs that can detect and mitigate adversarially generated texts. Another possible approach could be to use an ensemble of different AI models to evaluate solutions, rather than relying only on one; the hope would be that robustness would result from models having uncorrelated blind spots.

Furthermore, although QDAIF makes it easy to specify qualitative aspects of diversity through natural language prompts, it still requires specified definitions of diversity axes. For example, if we applied QDAIF to generate short stories of different genres (e.g. comparing horror vs. romance), it would not autonomously explore other important attributes that a writer might care about (e.g. firstperson vs. third-person perspective) unless explicitly specified. When we tested different diversity measures in the Stories domain, such pathologies were observed (Appendix A.32). For example, when using "hero spy vs. hero politician" as the diversity measure, many of the solutions generated tend to neglect the interaction between the spy and the politician, focusing solely on the character that is meant to be the hero. However, someone writing a short story about a spy and a politician would naturally care about how the characters interact with one another. One possible way to automatically determine interesting diversity measures is to utilize the human notions of interestingness distilled into foundation models (Zhang et al., 2023). That is, we could ask LMs to suggest interesting diversity measures that a human would typically care about in the domain, thereby enabling a more autonomous creative search (see Appendix A.10 for findings on the potential of this method).

In conclusion, we show that QDAIF is a promising approach to open-ended search that can reveal unexplored creative writing spaces, surpassing alternative text generation methods in generating diverse high-quality natural language text. AI feedback, Evolution through Large Models (ELM), and quality-diversity search (QD) were found to be essential ingredients for enhanced AI systems that can innovate in subjective spaces, similar to past research on Innovation Engines (Nguyen et al., 2016; 2015b). In fact, we see AI feedback as a general ingredient for open-ended search for solutions in multimodal domains, capable of following instructions beyond text (Liu et al., 2023). QDAIF can be easily extended to multi-modal domains (e.g. vision-language) for synthetic data generation and evaluation, building on top of recent advances in the field (Eichenberg et al., 2021; Alayrac et al., 2022; Bellagente et al., 2023; Driess et al., 2023; Bhatt et al., 2023; Sudhakaran et al., 2023; Todd et al., 2023). We see many possibilities from QDAIF to build creative search systems with evaluation, diversification, and improvement capabilities, bringing us closer to AI that can support and extend human innovation.


Human evaluations were performed by the co-authors of this paper and select colleagues. All human evaluators provided informed consent, and their feedback and assessments were obtained without coercion or bias. We took action to prevent bias by presenting evaluators with texts to evaluate in a blind setting, with only the instructions for the study annotation task presented (to carefully read through the presented texts, then give a quality score and a label of the characteristic that best matches the texts). We show a detailed setup for the human study in Appendix A.1.

For transparency, we provide the full set of results with caption descriptions from our human evaluation. In the Opinions domain, Tables 13–16 contain the human evaluation results for sets from baseline methods, Tables 29–32 contain the human evaluation results for sets from QDAIF methods, and Tables 25–28 contain the human evaluation results for sets from embedding feedback QD methods. In the Stories - Genre domain, Tables 17–20 contain the human evaluation results for sets from baseline methods, and Tables 33–36 contain the human evaluation results for sets from QDAIF methods. For the Stories - Ending domain, Tables 21–24 contain the human evaluation results for sets from baseline methods, and Tables 37–40 contain the human evaluation results for sets from QDAIF methods.


Herbie developed the setup and framework for the Poetry domain experiments and base library for research. Andrew developed the setup and experiments for the Opinions and Stories domains, and contributed to extended studies, visualization, and analysis across experiments in the paper. Hannah contributed additional experimentation in the Stories domain, in addition to coordinating part of human evaluation studies. Jenny contributed qualitative analysis across studied domains. Koen developed visualization scripts used in Opinions and Stories domain experiments. Marco contributed to part of the technical implementation and ideation. Andrew conducted the blind human evaluation study, and Grégory advised on the conduct and analysis of the human study. Joel, Jeff, and Ken initiated early ideation for this work. Joel, Grégory, Jeff, and Ken advised and guided. Andrew, Jenny, Herbie, and Joel wrote the manuscript with edits and feedback from all authors.


We thank Robert Baldock, Samuel Weinbach, Souradeep Nanda, Jan Zierstek, and Andres Felipe Cruz Salinas for insightful discussions and feedback within the lab at Aleph Alpha. We also thank Katherine Hardgrave, David Nugent, Daniel Flood, and Formula Trinity Autonomous for the inspiration that seeded the momentum leading up to this work.


Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, et al. Do as I can, not as I say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691, 2022.

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: A visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716– 23736, 2022.

I. Elaine Allen and Christopher A. Seaman. Likert scales and data analyses. Quality progress, 40 (7):64–65, 2007.

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional AI: Harmlessness from AI feedback. arXiv preprint arXiv:2212.08073, 2022.

Marco Bellagente, Manuel Brack, Hannah Teufel, Felix Friedrich, Björn Deiseroth, Constantin Eichenberg, Andrew Dai, Robert Baldock, Souradeep Nanda, Koen Oostermeijer, et al. Multifusion: Fusing pre-trained models for multi-lingual, multi-modal image generation. arXiv preprint arXiv:2305.15296, 2023.

Varun Bhatt, Bryon Tjanaka, Matthew Fontaine, and Stefanos Nikolaidis. Deep surrogate assisted generation of environments. Advances in Neural Information Processing Systems, 35:37762– 37777, 2022.

Varun Bhatt, Heramb Nemlekar, Matthew Fontaine, Bryon Tjanaka, Hejia Zhang, Ya-Chuan Hsu, and Stefanos Nikolaidis. Surrogate assisted generation of human-robot interaction scenarios. arXiv preprint arXiv:2304.13787, 2023.

Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.

Herbie Bradley, Honglu Fan, Francisco Carvalho, Matthew Fisher, Louis Castricato, reciprocated, dmayhem93, Shivanshu Purohit, and Joel Lehman. OpenELM, January 2023a. URL https: //

Herbie Bradley, Honglu Fan, Harry Saini, Reshinth Adithyan, Shivanshu Purohit, and Joel Lehman. Diff models - A new way to edit code. CarperAI Blog, Jan 2023b. URL https://carper. ai/diff-model/.

Jonathan C. Brant and Kenneth O. Stanley. Minimal criterion coevolution: A new approach to open-ended search. In Proceedings of the Genetic and Evolutionary Computation Conference, pp. 67–74, 2017.

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.

Angelica Chen, David M. Dohan, and David R. So. Evoprompting: Language models for code-level neural architecture search. arXiv preprint arXiv:2302.14838, 2023.

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. Evaluating large language models trained on code. 2021.

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. PaLM: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.

Jeff Clune. AI-GAs: AI-generating algorithms, an alternate paradigm for producing general artificial intelligence. arXiv preprint arXiv:1905.10985, 2019.

Cédric Colas, Laetitia Teodorescu, Pierre-Yves Oudeyer, Xingdi Yuan, and Marc-Alexandre Côté. Augmenting autotelic agents with large language models. arXiv preprint arXiv:2305.12487, 2023.

Antoine Cully. Autonomous skill discovery with quality-diversity and unsupervised descriptors. In Proceedings of the Genetic and Evolutionary Computation Conference, pp. 81–89, 2019.

Antoine Cully and Yiannis Demiris. Hierarchical behavioral repertoires with unsupervised descriptors. Proceedings of the Genetic and Evolutionary Computation Conference, 2018.

Antoine Cully, Jeff Clune, Danesh Tarapore, and Jean-Baptiste Mouret. Robots that can adapt like animals. Nature, 521(7553):503–507, 2015.

Li Ding, Jenny Zhang, Jeff Clune, Lee Spector, and Joel Lehman. Quality diversity through human feedback. arXiv preprint arXiv:2310.12103, 2023.

Danny Driess, Fei Xia, Mehdi S.M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. PaLM-E: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023.

Adrien Ecoffet, Jeff Clune, and Joel Lehman. Open questions in creating safe open-ended ai: tensions between control and creativity. In Artificial Life Conference Proceedings 32, pp. 27–35. MIT Press One Rogers Street, Cambridge, MA 02142-1209, USA journals-info . . . , 2020.

Constantin Eichenberg, Sidney Black, Samuel Weinbach, Letitia Parcalabescu, and Anette Frank. MAGMA - Multimodal augmentation of generative models through adapter-based finetuning. arXiv preprint arXiv:2112.05253, 2021.

Manon Flageat and Antoine Cully. Fast and stable map-elites in noisy domains using deep grids. In Artificial Life Conference Proceedings 32, pp. 273–282. MIT Press One Rogers Street, Cambridge, MA 02142-1209, USA journals-info . . . , 2020.

Manon Flageat and Antoine Cully. Uncertain quality-diversity: Evaluation methodology and new methods for quality-diversity in uncertain domains. IEEE Transactions on Evolutionary Computation, 2023.

Matthew Fontaine and Stefanos Nikolaidis. Differentiable quality diversity. Advances in Neural Information Processing Systems, 34:10040–10052, 2021.

Felix Friedrich, Patrick Schramowski, Manuel Brack, Lukas Struppek, Dominik Hintersdorf, Sasha Luccioni, and Kristian Kersting. Fair diffusion: Instructing text-to-image generation models on fairness. arXiv preprint arXiv:2302.10893, 2023.

Adam Gaier. Accelerating Evolutionary Design Exploration with Predictive and Generative Models. PhD thesis, Université de Lorraine, 2020.

Adam Gaier, Alexander Asteroth, and Jean-Baptiste Mouret. Data-efficient exploration, optimization, and modeling of diverse designs through surrogate-assisted illumination. In Proceedings of the Genetic and Evolutionary Computation Conference, pp. 99–106, 2017.

Adam Gaier, Alexander Asteroth, and Jean-Baptiste Mouret. Are quality diversity algorithms better at generating stepping stones than objective-based search? In Proceedings of the Genetic and Evolutionary Computation Conference Companion, pp. 115–116, 2019.

Luca Grillotti and Antoine Cully. Unsupervised behaviour discovery with quality-diversity optimisation. arXiv preprint arXiv:2106.05648, 2021.

Junxian He, Chunting Zhou, Xuezhe Ma, Taylor Berg-Kirkpatrick, and Graham Neubig. Towards a unified view of parameter-efficient transfer learning. arXiv preprint arXiv:2110.04366, 2021.

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for NLP. In International Conference on Machine Learning, pp. 2790–2799. PMLR, 2019.

Zhengbao Jiang, Jun Araki, Haibo Ding, and Graham Neubig. How can we know when language models know? On the calibration of language models for question answering. Transactions of the Association for Computational Linguistics, 9:962–977, 2021.

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.

Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4401–4410, 2019.

Leon Keller, Daniel Tanneberg, Svenja Stark, and Jan Peters. Model-based quality-diversity search for efficient robot learning. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 9675–9680. IEEE, 2020.

Robert Kirk, Ishita Mediratta, Christoforos Nalmpantis, Jelena Luketina, Eric Hambro, Edward Grefenstette, and Roberta Raileanu. Understanding the effects of rlhf on llm generalisation and diversity. arXiv preprint arXiv:2310.06452, 2023.

Harrison Lee, Samrat Phatale, Hassan Mansoor, Kellie Lu, Thomas Mesnard, Colton Bishop, Victor Carbune, and Abhinav Rastogi. RLAIF: Scaling reinforcement learning from human feedback with AI feedback. arXiv preprint arXiv:2309.00267, 2023.

Joel Lehman and Kenneth O. Stanley. Revising the evolutionary computation abstraction: Minimal criteria novelty search. In Proceedings of the 12th annual conference on Genetic and evolutionary computation, pp. 103–110, 2010.

Joel Lehman and Kenneth O Stanley. Abandoning objectives: Evolution through the search for novelty alone. Evolutionary computation, 19(2):189–223, 2011a.

Joel Lehman and Kenneth O. Stanley. Evolving a diversity of virtual creatures through novelty search and local competition. In Proceedings of the 13th annual conference on Genetic and evolutionary computation, pp. 211–218, 2011b.

Joel Lehman, Kenneth O Stanley, et al. Exploiting open-endedness to solve problems through the search for novelty. In ALIFE, pp. 329–336, 2008.

Joel Lehman, Bryan Wilder, and Kenneth O. Stanley. On the critical role of divergent selection in evolvability. Frontiers in Robotics and AI, 3:45, 2016.

Joel Lehman, Jeff Clune, Dusan Misevic, Christoph Adami, Lee Altenberg, Julie Beaulieu, Peter J. Bentley, Samuel Bernard, Guillaume Beslon, David M. Bryson, Patryk Chrabaszcz, Nick Cheney, Antoine Cully, Stephane Doncieux, Fred C. Dyer, Kai Olav Ellefsen, Robert Feldt, Stephan Fischer, Stephanie Forrest, Antoine Frénoy, Christian Gagné, Leni Le Goff, Laura M. Grabowski, Babak Hodjat, Frank Hutter, Laurent Keller, Carole Knibbe, Peter Krcah, Richard E. Lenski, Hod Lipson, Robert MacCurdy, Carlos Maestre, Risto Miikkulainen, Sara Mitri, David E. Moriarty, Jean-Baptiste Mouret, Anh Nguyen, Charles Ofria, Marc Parizeau, David Parsons, Robert T. Pennock, William F. Punch, Thomas S. Ray, Marc Schoenauer, Eric Shulte, Karl Sims, Kenneth O. Stanley, François Taddei, Danesh Tarapore, Simon Thibault, Westley Weimer, Richard Watson, and Jason Yosinski. The surprising creativity of digital evolution: A collection of anecdotes from the evolutionary computation and artificial life research communities, 2019.

Joel Lehman, Jonathan Gordon, Shawn Jain, Kamal Ndousse, Cathy Yeh, and Kenneth O. Stanley. Evolution through large models. arXiv preprint arXiv:2206.08896, 2022.

Bryan Lim, Luca Grillotti, Lorenzo Bernasconi, and Antoine Cully. Dynamics-aware qualitydiversity for efficient learning of skill repertoires. arXiv preprint arXiv:2109.08522, 2021.

Bryan Lim, Alexander Reichenbach, and Antoine Cully. Learning to walk autonomously via resetfree quality-diversity. arXiv preprint arXiv:2204.03655, 2022.

Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pp. 74–81, 2004.

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023.

Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stenetorp. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. arXiv preprint arXiv:2104.08786, 2021.

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651, 2023.

Christopher D Manning. An introduction to information retrieval. Cambridge university press, 2009.

Elliot Meyerson, Mark J. Nelson, Herbie Bradley, Arash Moradi, Amy K. Hoover, and Joel Lehman. Language model crossover: Variation through few-shot prompting. arXiv preprint arXiv:2302.12170, 2023.

Jean-Baptiste Mouret and Jeff Clune. Illuminating search spaces by mapping elites. arXiv preprint arXiv:1504.04909, 2015.

Niklas Muennighoff. SGPT: GPT sentence embeddings for semantic search. arXiv:2202.08904, 2022.

Anh Nguyen, Jason Yosinski, and Jeff Clune. Deep neural networks are easily fooled: High confidence predictions for unrecognizable images. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 427–436, 2015a.

Anh Nguyen, Jason Yosinski, and Jeff Clune. Innovation engines: Automated creativity and improved stochastic optimization via deep learning. In Proceedings of the Genetic and Evolutionary Computation Conference, 2015b.

Anh Nguyen, Jason Yosinski, and Jeff Clune. Understanding innovation engines: Automated creativity and improved stochastic optimization via deep learning. Evolutionary computation, 24(3): 545–572, 2016.

Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal, Jason Weston, and Douwe Kiela. Adversarial NLI: A new benchmark for natural language understanding. arXiv preprint arXiv:1910.14599, 2019.

Olle Nilsson and Antoine Cully. Policy gradient assisted MAP-Elites. In Proceedings of the Genetic and Evolutionary Computation Conference, pp. 866–875, 2021.

OpenAI. GPT-4 technical report, 2023.

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35: 27730–27744, 2022.

Ethan Perez, Sam Ringer, Kamile Lukoši ˙ ut¯ e, Karina Nguyen, Edwin Chen, Scott Heiner, Craig ˙ Pettit, Catherine Olsson, Sandipan Kundu, Saurav Kadavath, et al. Discovering language model behaviors with model-written evaluations. arXiv preprint arXiv:2212.09251, 2022.

Julien Pourcel, Cédric Colas, Pierre-Yves Oudeyer, and Laetitia Teodorescu. Aces: generating diverse programming puzzles with autotelic language models and semantic descriptors. arXiv preprint arXiv:2310.10692, 2023.

Justin K. Pugh, Lisa B. Soros, and Kenneth O. Stanley. Quality diversity: A new frontier for evolutionary computation. Frontiers in Robotics and AI, 3:40, 2016.

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748–8763. PMLR, 2021.

Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence embeddings using siamese BERTnetworks. arXiv preprint arXiv:1908.10084, 2019.

Alex Renda, Aspen K. Hopkins, and Michael Carbin. Can LLMs generate random numbers? evaluating LLM sampling in controlled domains. In ICML 2023 Workshop: Sampling and Optimization in Discrete Space, 2023. URL llm-sampling-paper.

Victor Sanh, Albert Webson, Colin Raffel, Stephen H Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, et al. Multitask prompted training enables zero-shot task generalization. arXiv preprint arXiv:2110.08207, 2021.

William Saunders, Catherine Yeh, Jeff Wu, Steven Bills, Long Ouyang, Jonathan Ward, and Jan Leike. Self-critiquing models for assisting human evaluators. arXiv preprint arXiv:2206.05802, 2022.

Jimmy Secretan, Nicholas Beato, David B. D’Ambrosio, Adelein Rodriguez, Adam Campbell, Jeremiah T. Folsom-Kovarik, and Kenneth O. Stanley. Picbreeder: A case study in collaborative evolutionary exploration of design space. Evol. Comput., 19(3):373–403, 2011. doi: 10.1162/EVCO\_a\_00030. URL

Noah Shinn, Beck Labash, and Ashwin Gopinath. Reflexion: An autonomous agent with dynamic memory and self-reflection. arXiv preprint arXiv:2303.11366, 2023.

Joar Skalse, Nikolaus Howe, Dmitrii Krasheninnikov, and David Krueger. Defining and characterizing reward gaming. Advances in Neural Information Processing Systems, 35:9460–9471, 2022.

Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Y. Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pp. 1631–1642, 2013.

Kenneth O Stanley. Compositional pattern producing networks: A novel abstraction of development. Genetic programming and evolvable machines, 8:131–162, 2007.

Kenneth O. Stanley and Joel Lehman. Why greatness cannot be planned: The myth of the objective. Springer, 2015.

Kenneth O. Stanley, Joel Lehman, and Lisa Soros. Open-endedness: The last grand challenge you’ve never heard of. O’Reilly Radar, 2017.

Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864, 2021.

Shyam Sudhakaran, Miguel González-Duque, Claire Glanois, Matthias Freiberger, Elias Najarro, and Sebastian Risi. Mariogpt: Open-ended text2level generation through large language models, 2023.

Graham Todd, Sam Earle, Muhammad Umair Nasir, Michael Cerny Green, and Julian Togelius. Level generation through large language models. In Proceedings of the 18th International Conference on the Foundations of Digital Games, pp. 1–8, 2023.

Vassilis Vassiliades, Konstantinos Chatzilygeroudis, and Jean-Baptiste Mouret. Scaling up MAPElites using centroidal Voronoi tessellations. arXiv preprint arXiv:1610.05729, 2016.

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023a.

Ren-Jian Wang, Ke Xue, Yutong Wang, Peng Yang, Haobo Fu, Qiang Fu, and Chao Qian. Diversity from human feedback. arXiv preprint arXiv:2310.06648, 2023b.

Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560, 2022a.

Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi, Amirreza Mirzaei, Anjana Arunkumar, Arjun Ashok, Arut Selvan Dhanasekaran, Atharva Naik, David Stap, et al. Super-NaturalInstructions: Generalization via declarative instructions on 1600+ NLP tasks. arXiv preprint arXiv:2204.07705, 2022b.

Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V. Le. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021.

Jenny Zhang, Joel Lehman, Kenneth Stanley, and Jeff Clune. OMNI: Open-endedness via models of human notions of interestingness. arXiv preprint arXiv:2306.01711, 2023.

Yulun Zhang, Matthew C Fontaine, Amy K Hoover, and Stefanos Nikolaidis. Deep surrogate assisted MAP-Elites for automated hearthstone deckbuilding. In Proceedings of the Genetic and Evolutionary Computation Conference, pp. 158–167, 2022.

This paper is available on arxiv under CC 4.0 license.