Authors:
(1) Herbie Bradley, CarperAI, CAML Lab, University of Cambridge & EleutherAI;
(2) Andrew Dai, Aleph Alpha;
(3) Hannah Teufel, Aleph Alpha;
(4) Jenny Zhang, 5Department of Computer Science, University of British Columbia & Vector Institute;
(5) Koen Oostermeijer, Aleph Alpha;
(6) Marco Bellagente, Stability AI;
(7) Jeff Clune, Department of Computer Science, University of British Columbia, Vector Institute & Canada CIFAR AI Chair;
(8) Kenneth Stanley, Maven;
(9) Grégory Schott, Aleph Alpha;
(10) Joel Lehman, Stochastic Labs.
Figure 2 provides an overview of the approach, which is to extend a common QD algorithm (MAPElites) with LM operators that generate variation, as well as evaluate both the quality and diversity of candidate solutions. The result is a search algorithm capable of iterative discovery and refinement, applicable to subjective text-based domains.
MAP-Elites. Our QDAIF implementation builds upon MAP-Elites (Mouret & Clune, 2015), a widely used QD algorithm (Lehman et al., 2022; Cully et al., 2015; Nilsson & Cully, 2021; Vassiliades et al., 2016). MAP-Elites discretizes the diversity space (i.e. dimensions of relevant diversity) into a grid, called the archive. The overarching objective is to populate each grid bin (or cell) within the archive with as high-quality a solution as possible. An iteration in MAP-Elites follows these steps: (1) randomly select an existing solution from the archive, (2) mutate the chosen solution to generate new solutions, (3) evaluate the new solution’s quality and diversity characteristics, and (4) if the new solution is higher quality than the current occupant at the cell corresponding to its diversity characteristics, replace the previous cell occupant solution with the new solution. For a new solution to be added to the archive, it has to improve either the quality or the diversity of the grid, meaning that it has to either fill an empty bin or perform better than the solution already in its bin. QDAIF distinguishes itself from standard MAP-Elites in four key areas: archive initialization, solution mutation, solution evaluation, and grid discretization (cf. Figure 2). We provide details on each of these differences below.
Initialization and Mutation. For archive initialization, QDAIF employs few-shot prompting, generating solutions based on a hand-chosen set of seed examples. We list in Appendix A.21 the three few-shot examples utilized in each domain, each chosen to span a breadth of diversity characteristics. For example, in a domain where you want diversity of sentiments (like the Opinions domain described in Section 4.1), the few-shot examples demonstrate positive, neutral, and negative sentiments. For solution mutation, QDAIF employs LMX-Near (referred to as "LMX" for brevity in the rest of this manuscript), as detailed in Meyerson et al. (2023). LMX evolves varied text representations (e.g. mathematical expressions, sentences, Python code) by leveraging effective in-context learning (Brown et al., 2020). LMX prompts are kept simple, typically starting with “Here is a random example of”. Appendix A.22 shows the full LMX prompts. We also introduce a novel mutation method with instruction-following prompts for poetry in Section 4.4.
Archive Measures. While it is sometimes feasible to devise hand-crafted heuristics to evaluate the quality of a solution (e.g. efficiency in locomotion) or diversity characteristics (e.g. a robot’s size and mass), this approach falters as domains become more complex and nuanced, as in creative writing. For example, hand-crafting robust heuristics for qualitative aspects of a story, such as its genre (e.g. romance vs. horror), is very difficult. QDAIF circumvents the need for hand-coded measures through prompting LMs with easily-written natural language queries to generate feedback. In particular, capable LMs trained on expansive text corpora can begin to mirror human intuition across a range of potentially subtle diversity characteristics.
Quantifying Performance and Diversity. For quality assessment, we prompt the LM to discern whether the input text contains a high-quality solution or pertains to the requested topic, requesting a “yes” or “no” response. The solution’s quality estimate is derived from the logarithm of the probability of the LM predicting one response versus the other response. Similarly, for diversity evaluation, we guide the LM to identify a particular diversity trait. For instance, in an opinion generating domain, the LM is prompted to gauge a solution’s sentiment, with a requested response of “positive” or “negative”. The log probability of these responses serves as our measure of solution diversity. Appendix A.22 shows the full prompts used in each domain to evaluate the solutions. We also introduce a novel categorical approach to evaluate solution attributes based on raw predictions of discrete labels in Section 4.4.
Discretization. MAP-Elites typically partitions the grid into equally-sized bins, from the intuition that all parts of the behavior space are equally interesting. However, we observe that when assigning a bin along the diversity axis - which is in our approach based on logits of an LM AI feedback - that qualitative changes in behavior do not uniformly correspond to changes in the logits (cf. Appendix A.31). This is likely due to the (non-linear) calibration behavior of instruction-tuned models in predicting the labels (as output tokens) of text passages (Jiang et al., 2021). Hence, we use custom non-uniform bins, which are denser towards range ends. Qualitative analysis of the generated text showed that the non-uniform bins yielded better alignment with typical human perceptions of diversity changes, influenced by both the AI model’s calibration and the domain-specific goals.
Models and Setup. Details on the LMX generation model (Appendix A.24) and finetuned AI feedback model (Appendix A.25) are given, with details on the training of these LMs. Additional default hyperparameters are described in Appendix A.27.
This paper is available on arxiv under CC 4.0 license.