paint-brush
Quality-Diversity through AI Feedback: Appendixby@feedbackloop
168 reads

Quality-Diversity through AI Feedback: Appendix

tldt arrow

Too Long; Didn't Read

Discover Quality-Diversity through AI Feedback (QDAIF), a cutting-edge method for creative text generation. QDAIF outperforms baselines, excelling in opinions, short stories, and poetry domains. While showing promise, the paper discusses limitations and proposes future directions, marking a significant leap in AI-driven creative search systems.
featured image - Quality-Diversity through AI Feedback: Appendix
The FeedbackLoop: #1 in PM Education HackerNoon profile picture

Authors:

(1) Herbie Bradley, CarperAI, CAML Lab, University of Cambridge & EleutherAI;

(2) Andrew Dai, Aleph Alpha;

(3) Hannah Teufel, Aleph Alpha;

(4) Jenny Zhang, 5Department of Computer Science, University of British Columbia & Vector Institute;

(5) Koen Oostermeijer, Aleph Alpha;

(6) Marco Bellagente, Stability AI;

(7) Jeff Clune, Department of Computer Science, University of British Columbia, Vector Institute & Canada CIFAR AI Chair;

(8) Kenneth Stanley, Maven;

(9) Grégory Schott, Aleph Alpha;

(10) Joel Lehman, Stochastic Labs.


A APPENDIX

A.1 HUMAN STUDY ON QUALITY-DIVERSITY OF TEXT SAMPLES

A study through human feedback lets us understand how well QD score performance through AI feedback translates to creating a high-quality, diverse set of creative texts from a subjective angle. We compare sets of human feedback evaluations for samples of diverse elites at the end of each run to measure translations from AI-assessed performance to human-assessed performance of methods in generating high-quality, diverse texts. In addition, the Human QD Score (sum of mean quality score for each category/label that is found in the set according to human feedback) gives us a rough understanding of how aligned quality-diversity improvement during the search is with the more subjective notion of quality-diversity; this score is low if the set deemed to cover a wide space of diverse texts from AI feedback fails to subjectively cover the space of desired diversity according to human evaluations. To distinguish between quality ratings from humans vs. AI feedback in this section, we refer to quality scores as those from human evaluators, and fitness scores as those from AI feedback.


To assess the robustness of quality and diversity measures in AI feedback, we carried out a study involving diverse elite samples selected from different bins of the QD archive from our tested runs. Over a total of 28 experiments (of which 4 are from embedding feedback experiment runs), five distinct stories per experiment are reviewed by six annotators. Each (generated) text was independently reviewed by two persons, resulting in a total of 280 annotations.


During the annotation process, we collected a subjective assessment of the quality of the generation using a 5-point Likert scale based on the text quality in terms of flow, plot, presence of repetition, and correspondence to the study’s topic. In addition, we assign each text to one of three categories. Two categories were specific to the study performed (such as positive/negative sentiment, romance/horror genre, or tragic/happy ending) and a third category was used when no element of the other two classes was identified.


We took action to prevent bias by presenting evaluators with texts to evaluate in a blind setting, with only the instructions for the study annotation task presented (to carefully read through the presented texts, then give a quality score and a label of the characteristic that best matches the texts). We provide the full set of results with caption descriptions from our human evaluation. In the Opinions domain, Tables 13–16 contain the human evaluation results for sets from baseline methods, Tables 29–32 contain the human evaluation results for sets from QDAIF methods, and Tables 25–28 contain the human evaluation results for sets from embedding feedback QD methods. In the Stories - Genre domain, Tables 17–20 contain the human evaluation results for sets from baseline methods, and Tables 33–36 contain the human evaluation results for sets from QDAIF methods. For the Stories - Ending domain, Tables 21–24 contain the human evaluation results for sets from baseline methods, and Tables 37–40 contain the human evaluation results for sets from QDAIF methods.


We compiled the results of the full study in Appendix A.12, and summarized the stats across the study in the paragraphs below.


Comparison of quality scores We observed that both annotators generally have a close agreement in their ratings with an average difference of 0.9 Likert points but there were occasional instances where they may differ by 2 or 3 units. These deviations occurred in 15% and 5% of the cases, respectively. This reflects that assessing the quality is somewhat subjective. To get a final estimation of the quality of the generated texts, the scores from both annotators were averaged.


Figure 5 shows the average quality ratings from annotators for various ranges of fitness (here, the quality score obtained from AI feedback). The ranges were chosen in a way that ensured a similar amount of samples in each range. We observe clear evidence of a correlation between human and AI quality measures, indicating the usefulness of using AI feedback to assess quality. However, we also observed that fitness for the texts with the highest scores becomes uncorrelated to human-assessed quality. This implies that above a certain threshold, fitness is not a reliable measure of quality in some cases (and maybe slightly lower fitness solutions were preferred by humans). Therefore, we suggest that in future work, more research is done in studying the relationship between (highconfidence) evaluations from AI feedback, and reward hacking of solutions (Nguyen et al., 2015a; Skalse et al., 2022; Lehman et al., 2019) under certain conditions and controls during the search (e.g. through the use of seed texts for human-preferred outputs, as shown in Table 1, or additional constraints on the generated solutions during search (Lehman & Stanley, 2010; Brant & Stanley, 2017)).


Figure 5: Correlation plot between quality rating from human annotators, and fitness range (quality computed from AI feedback). Mean human-annotated quality and statistical error for different ranges of AI feedback fitness scores indicate more frequent instances of reward hacking (Skalse et al., 2022; Lehman et al., 2019) from the outputs of some search methods evaluated in this study.


Comparison of diversity measurements For diversity, we observed that the two annotators agreed on the classification of a generated text in the same category 73 percent of the time. On the AI feedback side, diversity is collected on a 20-bin axis, measuring different degrees of correspondence to two experiment-specific categories mentioned earlier. For the purpose of comparison with human feedback, these bins are clustered into a first category (bins 0 to 8), a second category (bins 11 to 19) and a neutral category (bins 9 and 10). Additionally, the 5 samples for each set were collected from a relatively uniform spread of bins from one end of the diversity axis to another (specifically, bins in set [0, 6, 9, 13, 19], except for when the method fails to find a solution for the bin, then the next closest solution near the bin is chosen). This arbitrary arrangement allows for a relatively uniform distribution among the three categories.


On average, AI feedback agrees with a human annotator on the text category label 73% of the time. On samples where both annotators agree on the label of the text, this agreement rate increases to 82%. Within this set, the agreement increases to 95% on samples when we look at samples where both annotators give a label that is not the neutral category. For some samples from bins 6 and 13, one human annotator gave a neutral label and the other annotator one of the other two labels (closer to the AI feedback diversity measure for these bins), indicating that some samples lie between neutral and extreme in the given measure. These findings suggest that diversity classification from AI feedback is relatively reliable, especially in texts where humans agree on the label.


Baseline quality rating. The average quality score given by annotators for a given sample was 3.18, close to the middle rating of 3. This gives us an indication of what could be considered the threshold for subjectively good or bad outputs.


The average human (subjective) QD score of all sets in the study is 0.606. This is another indication for the threshold for determining which set from a given run/method had high-quality, diverse texts.


A.2 COMPARING AI FEEDBACK AGAINST ALTERNATIVE MEASURES OF DIVERSITY

To understand the effect of fuzzy evaluation tools as a component of our QD setup, we tested the use of an alternative method of feedback compared to our default method (AI feedback) in the MAPElites pipeline: semantic embedding feedback (Reimers & Gurevych, 2019). For this method, we used a 13B embedding model[1] , based on the architecture described in Muennighoff (2022), with an asymmetric search setup to measure the distance between a generated text (document embedding) and a query embedding for a desired measure (e.g. "This is a positive opinion"). To compute a diversity measure that can be defined on an axis, we first get the cosine distances between the document embedding and each of two opposing attribute query embeddings. From this, we can measure how close the document embedding is to one attribute compared to the other, and obtain a single diversity measure normalized in the range [0, 1]. We use the same method for quality feedback, with a query that aims to measure the relevance of generated texts to a specific domain. The cosine similarity is used here as a quality score, with negative values being clipped to 0. Additional setup details are shown in Appendix A.28. Since the subjective quality of the resulting elites (of creative texts) is more informative of the method’s potential in further applications for practical synthetic data generation, We conducted a human study in addition to the reporting of QD score stats as part of our results.


We display performance statistics from our runs (QD Score) as well as human evaluation scores on elite samples in Table 2. We observe from our human study that using AI feedback as the evaluator instead of semantic embedding feedback for every variation of the QD run setup potentially leads to subjectively better generations. This is likely due to more prominent reward hacking (Skalse et al., 2022; Lehman et al., 2019) occurring in runs using embedding feedback, where the highest quality score texts end up being very similar to the query "An opinion piece about eating vegetables and plant-based foods", while not optimizing for the subjective quality of texts in different bins. Qualitative analysis of human-evaluated sets of texts from QD with embedding feedback is shown in Tables 25–28. Furthermore, agreement between human and AI feedback on text diversity labels was slightly higher across QDAIF sets compared to QD with embedding feedback (QDEF) sets. Overall, AI feedback outperforms the alternative measure of semantic embedding feedback, by guiding the generation of texts that are more preferred by humans, and by serving as a better evaluator for quality and diversity measures than embedding feedback.


Figure 6: QD score plots for different LMX model sizes on the Stories - Genre domain. There is no clear trend in scaling model size with QD score.

A.3 ON SCALING LMS FOR MUTATION

Previous work has consistently shown that LMs demonstrate improved capabilities in various taskbased benchmarks at larger scales (Kaplan et al., 2020; Chowdhery et al., 2022; Chung et al., 2022). This applies to performance in solving tasks through in-context learning, which LMX is based on. Prior work in LMX (Meyerson et al., 2023) has found that a relationship between model scaling and performance of mutations can be observed when evolving binary strings in a search domain. Interestingly, experiments in LMX as well as ELM (Lehman et al., 2022) observed that in some cases, emerging capabilities in mutation ability appear for reasonably small LMs, but may not scale with a clear trend.


Firstly, we show the QD score performance between runs in Figure 6. The standard error in the score across 5 seeds is also shown. We observed that the QD score from the 70B runs converged to a lower point compared to the 13B and 30B runs, which have comparable scores. This highlights a trend between model size and QD score that is not directly proportional. Although suggestive trends are not seen here, a study based on subjective feedback is still necessary for a deeper understanding of the performance of each experiment.


We observed from the human feedback evaluation a trend in the quality ratings, with average quality scores from each experiment set of: 3.43 (13B), 3.60 (30B), and 4.03 (70B). The quality score increases with an increase in model size used, with a higher jump in score for generated stories from the 70B runs.


In terms of the agreement on the genres of evaluated stories between AI feedback and human feedback, we observed the following percentage rates in the following sets: 80.0% (13B), 73.3% (30B), and 73.3% (70B). There is a slight decrease in agreement for texts from the 30B and 70B runs. We found that evaluators differed more frequently in their labels on stories deemed as "neutral" or "romance" according to AI feedback, suggesting that the use of the prompt pool in combination with larger models might lead to generated texts with misaligned AI evaluations on the genre. Still, these agreement rates indicate a good alignment between AI and human feedback.


A.4 ON FEW-SHOT AI FEEDBACK PROMPTING

Instruction-following LMs are typically trained to align the model towards generating better answers to zero-shot prompts (Wei et al., 2021; Ouyang et al., 2022). However, few-shot prompting with exemplars was shown to be effective in some aspects with instruction-tuned LMs, especially towards understanding task structure and improving robustness to prompting variations (Wei et al., 2021).


In terms of the average human-evaluated quality, we see a drop in subjective quality for the set of stories from 2-shot AI feedback runs (3.10) in comparison to zero-shot AI feedback runs (3.43). We observed for the other sets that this score increases for the 4-shot set (3.93) and the 8-shot set (4.03). Furthermore, we see that this trend is mostly consistent when we consider the scores for each bin category. This suggests an improvement in QDAIF’s ability to discover texts that are perceived to be of higher quality according to human feedback when we use a higher number of in-context examples during AI feedback evaluation.


In terms of the agreement between AI feedback and human feedback, we see a drop in average agreement for ratings that were given to stories in the few-shot feedback experiment sets, with 50.0% (2-shot), 66.7% (4-shot), and 56.7% (8-shot) agreement on sets, compared to 80.0% for zero-shot (default). The level of disagreement occurs more frequently on stories evaluated to have the romance genre. Additionally, evaluators labeled samples from the few-shot sets as "horror" or "neutral" more frequently than "romance", while the proportion of labels given to samples from the zero-shot set was more uniform.


The performance may vary due to the ordering of in-context examples in our few-shot prompts (in Appendix A.29, as shown in Lu et al. (2021). Furthermore, the nature and wording of input-output exemplars/tasks could also influence the performance, in addition to some variation due to the nature of subjective evaluation.

A.5 ON THE INITIALIZATION METHOD FOR QDAIF

Recent results from LMX demonstrated successful optimization when the search was initialized by evolving a set of pre-existing seed examples (e.g. examples of equations, quotes, text-to-image prompts, code) (Meyerson et al., 2023). At the same time, previous applications of QD methods demonstrated successful search outcomes when using random initialization in different domains, such as in robotics, and latent space illumination of generative models (Cully et al., 2015; Fontaine & Nikolaidis, 2021; Bhatt et al., 2022).


We compared the performance between QDAIF with Seeded Init against Zero-Shot Init and a baseline method LMX, Quality-Only initialized with Zero-Shot Init. From Figure 7, we can see that there is potential improvement in QD score when using Seeded Init compared to Zero-Shot Init, but the difference may not be significant. However, the difference is clear on the lower performance of LMX, Quality-Only with Zero-Shot Init. This suggests that Zero-Shot Init is viable for QDAIF. Still, we need to analyze the effect of the initialization on qualitative samples of elite texts. We also compared the effects of initialization on subjective quality-diversity of texts (see Table 3). Sets of texts discovered by Seeded Init within single runs were found to be subjectively higher-quality and more diverse compared to sets from Zero-Shot Init (0.772 vs. 0.383 subjective QD score). This suggests potential reward hacking (described for RL problems in Skalse et al. (2022); Lehman et al. (2019) during the search with Zero-Shot Init runs, where potentially out-of-distribution texts can evolve to optimize the AI feedback quality score of solutions, but lead to subjectively low-quality texts across bins.


A.6 ON THE MECHANISMS OF MUTATION FOR QDAIF

Prior work on ELM highlights the versatility of LMs in applying a variety of potential mutation methods, such as in the form of git diffs, or prompting to directly evolve text/code Lehman et al. (2022). However, studying the potential effects of different LM-based mutation operators on population-level evolvability (focused on future creative potential Lehman et al. (2016)) remains a challenge. Towards understanding the potential of different mutation methods in terms of population-level evolvability of niches, we compare LMX(-Near) (default) (as tested in Meyerson et al. (2023)) against LMX-Replace in terms of resulting generated texts across different runs.


To observe the impact of different mutation methods on the dynamics of the population during search (along with the impact on performance), we implemented and tested an additional method called LMX-Replace. This method evolves few-shot prompts using a larger pool of prompt candidates than that of LMX(-Near) and carries out a slower mutation process by modifying only one few-shot prompt example (instead of all examples) during search iterations. We give a detailed comparison between the methods in Appendix A.26.


We observed an improvement to the resulting generated text sets from runs using LMX-Replace in comparison to LMX for the Zero-Shot Init case, according to human evaluation results highlighted in Table 3. We see that the subjective QD score and the quality rating from human evaluators were higher on average across the tested domains (and improvements within domain-by-domain comparisons also, highlighted in Table 4). Given that it may not be desirable to constrain the search with LMX-Near w/ Seeded Init, LMX-Replace can act as an alternative mutation method to steer the dynamics of population search towards subjectively improved outputs.


The introduction of an archive depth was used for the creation and maintenance of a prompt pool for LMX-Replace (cf. Appendix A.27). Future works could explore the usage of depth in MAP-Elites archives for domains where uncertainty (or subjectivity) influences the evaluation of solutions, as was previously studied in Flageat & Cully (2020; 2023).

A.7 ON COVERAGE AND BEST QUALITY SOLUTIONS ACROSS DOMAINS

Figure 8 shows performance plots measuring the coverage of the archive (i.e. how many bins in the search space are filled with at least one solution). Figure 9 shows performance plots measuring the quality score of the best solution found across the whole search (existing in one of the defined bins).


QDAIF often discovers the best overall solutions during search that have higher quality scores than the best overall solutions found from other methods. This is likely enabled by the goal-switching mechanism of QD approaches (Mouret & Clune, 2015; Gaier et al., 2019). Overall, QDAIF can jointly optimize the quality and diversity of solutions, outperforming other methods in most comparisons, and highlighting contributions to successful gains in QD score.


A.8 ON COMPARISONS BETWEEN QDAIF AND DIVERSITY-SEEKING BASELINES

We compared QDAIF against explicit diversity-seeking baseline methods. One baseline, LMX, ROUGE-L, is based on n-gram-based filtering with ROUGE-L similarity (Lin, 2004), as done in Self-Instruct Wang et al. (2022a). They followed a similar approach of maintaining a few-shot prompt pool to generate diverse solutions (with diversity maintained by adding generated texts to a prompt pool if different enough to other existing solutions based on n-gram matching), in their case, for the generation of diverse instruction prompts towards the creation of a diverse, high-quality synthetic instruction tuning dataset. Another baseline is based on Novelty Search (NS) (Lehman & Stanley, 2011a), a diversity-seeking algorithm that was proven to be more sample efficient than single-objective approaches (the diversity-only equivalent of LMX, Quality-Only, which only optimizes for solution quality) in discovering desired solutions to problems with many local optima, and also studied in prior works introducing the QD illumination problem (i.e. improving QD score (Pugh et al., 2016)). Unlike LMX, ROUGE-L, NS aims to encourage diversity within an arbitrarily defined space of diversity. To make NS comparable with QDAIF, we introduce Novelty Search through AI Feedback (NSAIF) as the method that seeks diversity in domains where diversity is defined by AI feedback axes. We denote this baseline as LMX, NSAIF. As with all methods, these baselines are initialized with the prompt pool specified in Appendix A.21. We describe the baseline implementations in more detail in Appendix A.9, including variants of these baselines with quality AI feedback (QAIF) filters as a minimum criterion for diversity search (Lehman & Stanley, 2010).


Results. Figure 10 shows QD score performance plots comparing QDAIF against diversity-seeking baselines, baselines that also aim for quality and diversity in solutions, and other baselines. QDAIF consistently outperforms diversity-seeking baselines across domains in addition to other baselines not focused on improving desired diversity in solutions. Yet, LMX, NSAIF tends to outperform other baselines that are not focused on improving solution diversity. On the Stories - Genre and Ending domain, LMX, NSAIF significantly outperforms many methods, even LMX, QualityOnly, which carries out single objective quality optimization (and successfully does so compared to other baselines, by maximizing the best solution quality alongside QDAIF, cf. Figure 12). This indicates that maintaining diversity in solutions is important for QD score performance in such domains (and improves search space coverage, cf. Figure 11). Furthermore, maintaining desired diversity, such as qualitative measures of diversity, is necessary for a method to improve QD score during search. LMX, ROUGE-L achieves low QD scores, and is within range of being the worst method for improving best solution quality (cf. Figure 12), in spite of the higher coverage achieved across domains compared to several baselines (cf. Figure 11). However, the performance for this baseline improves significantly with the introduction of the quality filter via QAIF. This indicates that maintaining high-quality solutions (not just diversity) is important. Still, adding the quality filter on top of LMX, NSAIF only helped more often compared to the default variant of NSAIF on the Stories - Genre and Ending domain, and shows no improvement on other domains. QDAIF (based on MAPElites) reliably improves QD score in tested domains and does so with higher sample efficiency (cf. Figure 13 and Figure 14). Results comparing QDAIF to additional diversity-seeking baselines (including ones with quality filters) highlight performance gains from our proposed QDAIF method, as well as the importance of seeking both quality and diversity.


A.9 SETUP FOR DIVERSITY-SEEKING BASELINES

LMX, ROUGE-L. Following Wang et al. (2022a), we set the threshold for the maximum allowed ROUGE-L similarity between generated texts and all texts in the prompt pool at any given time (from which few-shot examples are sampled for LMX generation) to 0.7, without limits on the prompt pool size. If the similarity exceeds this, the generated text is still logged for QD score evaluation, but rejected from being added to the prompt pool. Otherwise, the text is added to the prompt pool so that it can be used for LMX evolution during later samplings of few-shot prompts. The rest of the setup is comparable to QDAIF except for the MAP-Elites archive maintained by QDAIF.


LMX, NSAIF. In line with the high-level implementation of NS described in Lehman et al. (2008), we implemented the novelty (diversity-based) measure in the same way that QDAIF defines diversity measures - as the diversity attribute defined by the AI feedback axis (or axes) in the range [0, 1], with an additional Euclidean distance-based measure between the diversity attribute of the generated text, and existing texts in the prompt pool. Novelty is computed by measuring the mean distance between the diversity attribute values of the generated text, and their k nearest neighbors in the prompt pool (with respect to the diversity attributes of the neighbor texts). We keep k to be 15, as was tested in the original NS implementation. Similar to NS, the novelty score is compared to a novelty threshold (initially 0.05 in all LMX, NSAIF runs, the distance width between adjacent bin ticks for a unit range with 20 uniform bins); if the novelty of the generated text is higher than the threshold, it gets accepted into the prompt pool for evolution with LMX, otherwise, it is logged for evaluation of performance results but rejected from the prompt pool. We apply a dynamic adjustment of the novelty threshold, following NS, and define it so that the threshold is multiplied by 1.05 (increased) if 3 solutions in a row were accepted to the prompt pool within any iterations window, or multiplied by 0.95 (decreased) if 21 solutions in a row were rejected from the prompt pool during the search.


In consideration of the non-linear calibration of AI feedback models in evaluating texts (cf. Appendix A.31), we apply a piecewise linear transformation to the original diversity attribute value from AI feedback evaluation of diversity (that lies along an axis discretized by non-uniform binning) so that it lies instead along an axis discretized by uniform bins while preserving the number of bin intervals. For example, in our 20-bin setting, the input value 0.9975, which lies between bin ticks 1, and 0.995, would be transformed to an output value of 0.975, and the input value 0.35 (between bin ticks 0.5 and 0.2) would be transformed to an output value of 0.475. This enables novelty to be computed using Euclidean distances while preserving the nature of AI feedback model non-linear calibration in distinguishing subjectively similar solutions from diverse solutions. Additionally, this keeps the definition of diversity consistent and fair with the setup of diversity measures for evaluation (in QDAIF iterations, and for baseline performance comparisons). The number of bin intervals defined for the piecewise linear transformation is the same as the number of intervals set as default across the Opinions and Stories domains (20 for 1D archives, 10 for 2D archives).


Quality AI Feedback Filtering. We implemented and assessed variants of the above baselines, LMX, ROUGE-L (w/ QAIF), and LMX, NSAIF (w/ QAIF), by adding a simple quality filter for accepting generated texts to prompt pools, based on quality AI feedback defined in the respective creative writing domains. This involves an additional step in each baseline method immediately after the diversity criteria assessment of solutions, where generated texts must also have quality score values above a minimum threshold before being added to the prompt pool. Thus, these variants become similar to QDAIF methods, where they aim for diverse, high-quality solutions. These baselines differ from our default QDAIF method with MAP-Elites in that the minimum quality thresholds are not a function of the quality of best solutions across individual bins, but are arbitrarily defined, to constrain the prompt pool to satisfy a minimum criterion for quality, as was introduced previously for diversity-seeking methods in Lehman & Stanley (2010). This would make the quality improvement process of solutions across all bins less sample-efficient. We define this simple quality threshold for the baselines to be fixed at 0.8, closer to the upper bound of the full quality range [0, 1]. This was determined based on the intuitions of the results from the human evaluation study we conducted (cf. Figure 5 in Appendix A.1), where a quality value of 0.8 from AI feedback corresponds to a generally high human feedback Likert (quality) score. As done in LMX, Quality-Only, these baselines (with quality filtering) limit the size of the prompt pool to be up to 100.

A.10 ON AUTOMATICALLY EXPANDING ARCHIVE DIMENSIONS

Prior work in QD, as described in Section 2.2, often relies on diversity measures that are designed and manually defined at the start of the search. One existing approach for more automatic QD search without supervised (defined) measures of diversity is to use unsupervised learning to represent diversity without relying on ground truth measures (Cully & Demiris, 2018; Cully, 2019; Grillotti & Cully, 2021; Wang et al., 2023b; Ding et al., 2023). Such unsupervised approaches do not embody any complex prior of what humans find interesting about diversity; an alternative approach would be to query capable LMs about what dimensions of diversity are interesting or important for a particular domain. In this way, semantically-rich axes of diversity could be automatically generated (and could then be evaluated automatically as well through other LM calls, as in QDAIF).


Given the advances in foundation model capabilities, we could reasonably prompt LMs (such as GPT-4 (OpenAI, 2023)) to come up with new axes of diversity in a more automated pipeline potentially while search is running - by giving it a description of the user’s search problem, and existing diversity axis being searched through (i.e. the existing AI feedback diversity prompt(s)), we could ask the LM to give us a different, previously unexplored diversity axis and define a new AI feedback prompt that can be added to the MAP-Elites evaluation. For example, in the Poetry domain, we could ask the LM to generate multiple diverse aspects of poetry (e.g. “Genre” and “Tone” as studied in the presented experiments), and also come up with diverse categories defining this search space for QDAIF to search through with MAP-Elites. The effectiveness of this kind of approach has not been studied thoroughly in prior works, especially on the question of whether or not expanding the dimensions of diversity during the search can meaningfully improve diversity in resulting solutions towards increasingly broader definitions of diversity.


Setup. We took a step in the direction of automating the definition of diversity axes, where we analyzed the effectiveness of creative search for solutions when we expand the dimensions of diversity axes during an intermediate iteration of an existing search, with performance measured in terms of improving the QD score (for a given ground truth, higher-dimensional search space). We tested this approach in the Stories - Genre and Ending domain, compared the performance between variations in methods with QD score (out of 100) as done in other experiments, and compared the following setups: 2D Archive Search (apply QDAIF with the full 2D diversity axes defined), 1D Archive Search (apply QDAIF with only a 1D diversity axis defined, done for both Genre, and Ending diversity archives each), and expanding 1D to 2D Archive Search (starting QDAIF search with either the Genre or Ending diversity axis defined, and then expanding the search with the addition of a second diversity axis, or the other different axis to the one defined at the start of the 1D search). For the setups with expanding diversity axes, the second archive dimension is introduced from iteration 1000 of the search, out of 2000 iterations total. We also assessed the performance in the case where instead of adding an extra dimension with a new diversity axis during search, QDAIF transitions from an initial 1D archive (e.g. for Ending), to a different 1D archive (e.g. for Genre). In our experiments, the transition is carried out from iteration 1000 out of 2000 iterations.


Findings. Figure 15 shows performance plot results for all the settings described in this section, and Figure 16 shows QD score line plots to visualize sample efficiency differences between the different setups of automatically expanding dimensions, especially after iteration 1000. We found that for all cases tested of expanding from 1D to 2D archives, improvements in QD score for a higher dimensional archive are significant when compared to searching with only a 1D diversity axis and evaluating the resulting solutions with both AI feedback diversity measures. Furthermore, we found that it’s possible to approach the performance in QD score through this diversity axes expansion when compared to the QD score achieved from a full 2D Archive Search. This level of improvement hints at the potential of applying the approach of prompting LMs to generate diversity measures for more autonomous, creative search, and highlights the value of scaling up this approach in future work.


Expansion of diversity axes is promising as an approach to improve the quality and diversity of solutions without the need to initialize a high-dimensional archive from the beginning (in the case where only manual setting of diversity axes is done). Vassiliades et al. (2016) found in experiments through CVT-MAP-Elites (a method to define bins based on uniformly spaced centroids in very high-dimensional spaces of diversity, solving the compute requirements of standard MAP-Elites in this high-dimensional case) that standard MAP-Elites is impeded in its ability to further improve the fitness (quality) of solutions in non-empty bins as the cases where the search fills empty binsoccurs much more frequently when more bins are created due to the increase in dimensionality. The quality of existing non-empty bins normally requires several iterations of improvement before reaching more optimal quality scores for the given bin or niche. We show that this is also the case in Figure 15, in the third plot on the value of best solution quality; the quality of the best overall solution at the end of the search is more often higher for the 1D to 2D expansion settings, compared to the setting where the 2D archive is initialized from scratch at the start of the search. It is also the case that searching only in the 1D archives is more often better (with higher best solution quality) compared to the outcomes when conducting any search in the higher dimensional archive. Expansion of dimensions seems to deliver a good trade-off of slightly decreased best solution quality for significant improvements in solution diversity. This balance is quite useful to find for improving the sample efficiency of the search, given that Mouret & Clune (2015) found the discovery and maintenance of both diverse and high-quality solutions to be important for enabling the ongoing search to improve the quality and diversity of newly generated solutions even more quickly.


Additionally, we studied performance comparisons for the case of transitioning different 1D diversity axes (as a different approach to automatic search in different dimensions of diversity compared to the expanding dimensions setup). Figure 15 also shows performance in this case, and Figure 17 shows QD score line plots, with differences in performance visibly shown after iteration 1000. In one case (1D (Ending) to 1D (Genre)), a significant improvement in QD score is visible for this archive transition case when compared to searching only in one (initialized) diversity axis throughout the whole search. This setup also approaches the performance of searching in the higher dimensional 2D archive. In the other case evaluated (1D (Genre) to 1D (Ending)), no notable improvements were seen compared to just conducting QDAIF in the single 1D (Genre) archive. Even though results here show that the performance of QDAIF in the transitioning 1D archives case is sensitive to the diversity axes searched through (and the order in which the transitions occur), this highlights another promising approach to automatically adjusting diversity axes given AI feedback prompts to be generated by LMs along with the expanding dimensions setup. This is especially the case when we want to scale up QDAIF to search through an even higher number of diversity axes automatically, where we can lower the computational requirements of searching in lower dimensions (meaning lower number of total bins created) and also mitigate the presented challenges of conducting MAP-Elites search iterations in very high-dimensional archives (in the previous paragraph), while maintaining promising improvements to performance that would be seen in having QDAIF explore a growing number of different diversity axes. Future research can explore the potential of designing a curriculum for QDAIF to search through different diversity axes as an open challenge to improving the quality and diversity of creative texts with potentially an unbounded number of subjective dimensions of diversity, depending on individual personal perspectives.


Overall, this method of automatic expansion and adaptation of diversity axes introduces a new way of balancing the trade-off between improving quality or improving diversity in solutions, especially in settings where the dimensions of desired diversity, while sometimes not obvious to the user until later realization, do not reach the hundreds as would be in the case of problems studied in CVTMAP-Elites. In the large-scale higher-dimensional case, adaptive transitioning of diversity axes is another promising direction. We can leverage this finding for future work to more confidently apply the aid of LMs in automatically generating diversity axes for QDAIF search.


A.11 ON FINETUNING MUTATION MODELS

Prior work on ELM found that using a domain-specific finetuned git diff LM during code generation search led to runs with higher QD score performance compared to when using a pre-trained git diff LM (Lehman et al., 2022). The finetuning step relied on a dataset of filtered solutions across several previous runs, adding complexity to the process of creating an effective model for evolving text. To simplify the process, We investigated finetuning on sampled solutions within the same run, during a search that begins with solution generation from a pre-trained LM. We explore the impact of finetuning as a mechanism that can potentially encourage exploitation (through learning to generate higher-quality score solutions) during the search while using evolution to encourage exploration.


To collect samples for finetuning during the search, the first step is collecting a dataset for a given state of the archive, sampling up to 10 solutions with the highest quality score from each bin. Archive bin depth was used to keep track of these finetuning samples (cf. Appendix A.27), and introduced in a different application of MAP-Elites for uncertain environments in Flageat & Cully (2020; 2023). The dataset is shuffled, and then training samples are batched for the LM to finetune on during this phase. Each training sample consists of the prompt that was originally used by a solution to generate an output, and the generated completion text as target tokens for finetuning. We tested both full-model finetuning, and a parameter-efficient finetuning method with sequential adapters finetuning, as described in He et al. (2021). We ran each method using QDAIF LMX(-Near) w/ Seeded-Init (including default method runs without finetuning) on the Stories (Genre and Ending) domain, extending the number of archive dimensions to two. For experiments with adapters, we also conduct runs where the adapter layers are initialized but without doing finetuning, to enable comparisons with adapter finetuning runs accounting for differences in LM architecture introduced by additional layers during the search for runs with finetuning. Still, near-identity initialization (as described in Houlsby et al. (2019)) preserves the general performance of the LM.


Finetune-Once. This method runs a single finetuning phase during the search before resuming generation with the finetuned LM. We vary two parameters during experiments with Finetune-Once: the iteration step to start the phase (Start), and the number of finetuning steps during this phase (Steps).


Generate-Finetune. This method extends Finetune-Once by carrying out the finetuning phase multiple times during the search, repeating the steps of dataset collection and finetuning at regular intervals. We add an additional variable parameter, controlling the regular interval frequency for every set number of generation iterations (Frequency).


Observations. We compared the QD score performance between methods across 5 fixed random seed runs. In general, the performance of the Finetune-Once methods with higher scores is comparable to default runs, though slightly lower (but mostly within the confidence interval). For Finetune-Only runs, we observed a potential decrease in performance when using full-model finetuning compared to adapter finetuning, in comparison to conducting these runs without finetuning, with an average −3.09 difference in QD score for full-model finetuning compared to −0.33 difference for adapter finetuning. It is possible that overfitting on solution examples is more likely for full-model finetuning, which would lead to a decrease in the ability of the LM to evolve new solutions to be accepted in the archive. When we used adapter finetuning as the default for other experiments, we observed a potential decrease in performance and increased variance in performance across multiple seeds when compared to the default method without finetuning. The increase in variance also indicates potential improvements to the search with Finetune-Only in some cases, but further studies are needed on the behavior of search when finetuning on different archive states and constraints to samples collected for finetuning. We can observe a clearer impact of the adverse effects of finetuning from our Generate-Finetune runs. Considering the 95% confidence interval, the highest mean QD score run we tested from the Generate-Finetune group obtained a performance score of 78.74 ± 5.10 compared to 84.09 ± 1.91 from the default generate-only method. With most runs with Generate-Finetune converging to a lower QD score earlier in the search, it is likely that running standard scheduling the finetuning phase multiple times leads to overfitting of the LM to the point where the model fails to generate new solutions to add to the archive.

A.12 HUMAN EVALUATION STUDY STATS IN WRITING DOMAINS

We collected the annotation results from our study, to show the performance of each search method on the tested writing domains, displayed in Table 4.


A.13 ON THE DIFFERENCES IN ELITE TEXTS ACROSS DOMAINS FROM QDAIF

We referred to LMX-Near (Meyerson et al., 2023) as LMX in the main text for brevity. For more detailed analysis with different mutation operators, we refer to the full name. A summary of human eval stats based on the samples described below can be found in Appendix A.12.


Opinions. We report the results from evaluated sets of generated texts from our human study, starting with the Opinions domain in Appendix Tables 29–32 (with qualitative descriptions of the evaluated generated texts in the table captions). Tables are organized in the following order, for each domain: LMX-Near /w Seeded Init (default); LMX-Near /w Zero-Shot Init; LMX-Replace /w Seeded Init; and LMX-Replace /w Zero-Shot Init. Qualitatively, we can see that repetition of phrases appears more often in the samples of generated elite texts from runs using LMX-Near, especially when using Zero-Shot Init. For LMX-Near /w Seeded Init (In Table 29), there is frequent repetition of phrases like "I would rather eat scrambled eggs" (third row, bin 9), and "At a restaurant" (fifth row, bin 19). For LMX-Near /w Zero-Shot Init (In Table 30), there is further repetition of undesired phrases like "Below is a random opinion piece" and "Here is a random opinion piece" in all examples. On the other hand, the generated texts from LMX-Replace (Tables 31 and 32 for Seeded Init (default), and Zero-Shot Init methods respectively) lack this output artifact, and received higher quality scores from human feedback overall. Furthermore, texts from LMX-Replace /w ZeroShot Init received the highest quality score (from humans) compared to the other three methods here, while texts from LMX-Near /w Zero-Shot Init received the lowest quality scores.


Stories - Genre. For this domain (with evaluated sets presented in Tables 33–36), evaluators found the subjective quality of texts from runs using Seeded Init (default) to be higher than runs using ZeroShot Init. The stories that were generated with Zero-Shot Init were found to be low in quality due to the presence of attributes such as erroneous titles, and a text style that fails to reflect what’s expected in a plausible short story text. Furthermore, the elite stories from the Zero-Shot Init generated sets were more likely to lead to disagreements on the genre between AI feedback and human feedback, with neutral labels given more frequently even for texts in some of the extreme ends of the bins. Given a more open-ended generation task, and a narrow space of desired diversity (focused on the spectrum between two genres), LMX-Near /w Seeded Init (default) (Table 33) was the most successful method in finding a story for the horror genre niche in Bin 0 (with a low quality score of 2 from evaluators). Other methods either failed to discover a story for this niche, or received an even lower quality score from human feedback.


Stories - Ending. For this domain (with evaluated sets presented in Tables 37–40), the stories received higher quality scores across sets in comparison to the sets from the Stories - Genre domain, indicating a potentially easier search space when finding stories with different kinds of endings. Still, methods using Seeded Init produced sets of stories that received higher quality scores from evaluators, in comparison to sets from Zero-Shot Init (especially LMX-Replace /w Zero-Shot Init in Table 40, where the presence of erroneous titles led to lower subjective quality). In spite of the lack of guidance from hand-written prompt examples during initialization, LMX-Near /w Zero-Shot Init (Table 38) managed to produce reasonable stories of above-average quality to cover different ending niches.

A.14 ON THE EVOLUTION OF GENERATED TEXTS OVER ITERATIONS FROM QDAIF

We describe the evolution of texts from QDAIF with LMX at different search iterations in Tables 41–52. One key factor that influences the search is the initialization method - generations in early iterations from Zero-Shot Init methods frequently contain elements that are subjectively different to hand-written examples in the seed texts (in A.23). For example, erroneous URLs (in Tables 41, 43, 51) and titles (in Tables 45, 49) are seen when the possible distribution of outputs is not constrained by seeded in-context examples. This kind of method enables further exploration of output samples, potentially useful in the search of interesting, diverse creative texts. At the same time, more constraints are required from quality assessment using AI feedback in order to control the evolving population of creative texts, towards high-quality, diverse texts. In several cases, outputs from these methods in later iterations show reduced instances of these artifacts compared to early iteration outputs (e.g. in Table 51), but can be missed when AI feedback evaluation at times is misaligned with human preferences (e.g. in Table 45). Still, using Seeded Init does not guarantee enough guidance to completely remove undesired features from generations in later iterations. For example, texts with repetitive phrases can be seen during later iterations on the Opinions domain from methods using Seeded Init (see Tables 42 and 44. In general, Seeded Init runs are more likely to lead to high-quality texts across niches, in comparison to Zero-Shot Init runs, especially according to human evaluation (see Table 1), especially for more challenging, open-ended domains such as story-writing. Furthermore, a side-effect of the increased likelihood of subjective reward hacking when using Zero-Shot Init for runs. Still, QDAIF can still work well with Zero-Shot Init, especially in combination with LMX-Replace, as shown in the results from the Opinions domain experiments.

A.15 ON THE DIFFERENCES IN ELITE TEXTS ACROSS DOMAINS FROM BASELINE METHODS

Evaluated sets are presented in Tables 13–16, Tables 17–20, and Tables 21–24 for Opinions, Stories - Genre, and Stories - Ending domains respectively. Fixed-Few-Shot and Shuffling-Few-Shot consistently adhere to the style and structure of the seeded examples. Random-Search exhibits more variability, with discrepancies in feedback between humans and AI most evident in the Opinions domains and challenges in story consistency and character inclusion. LMX, Quality-Only further highlights these discrepancies, especially where repetitive or contradictory opinions are concerned. Across narratives, LMX, Quality-Only tends to miss character details or produce underdeveloped storylines, despite sometimes receiving high AI feedback. A recurring theme across all baselines is the differential perception of quality between human and AI feedback, with repetitive narratives, character relevance, and development being central points of contention.

A.16 ON THE EVOLUTION OF GENERATED TEXTS OVER ITERATIONS FROM BASELINE METHODS

Qualitatively, both Fixed-Few-Shot and Shuffling-Few-Shot consistently replicate concepts and expressions from the few-shot examples, at times directly copying entire segments (Tables 53, 57, 54, and 58). This suggests a potential over-reliance on the few-shot examples, hindering the generation of diverse solutions. In contrast, Random-Search aims to foster diversity by retaining all entries into the pool. However, this approach may unintentionally impede optimization for higher fitness solutions. Notably, fitness values of generated entries using Random-Search in later iterations persistently fall short (Tables 55 and 59). The crux of this issue seems to arise from the strategy’s indiscriminate inclusion of entries, even those with low fitness. Consequently, when these lowfitness examples are integrated into few-shot prompts, they influence the generation of subsequent entries, often resulting in similarly suboptimal outcomes. Lastly, LMX, Quality-Only introduces a fitness-centric approach, retaining only those solutions with the highest fitness. While entries from its early iterations display a diverse range of phenotype and fitness values, later iterations, with a noticeable uptick in fitness scores, tend to use similar phrases and writing styles (Tables 56 and 60). This suggests that stringently prioritizing high-fitness solutions may prevent the exploration of more varied or potentially better solutions.

A.17 ON THE USE OF DIFFERENT VERSIONS AND TYPES OF MODELS FOR POETRY

We show QD score line plots comparing QDAIF against other methods in Figure 18. We noted that in addition to improvement in QD score from QDAIF, CI gaps from experiments using GPT-4 were qualitatively wider as shown in the line plots. This suggests the potential of using different models for prompted variation of solutions in creative domains, with the possibility that certain models are more capable of generating slightly more diverse solutions (that may better cover the space of possible solutions) (cf. Appendix A.18 on differences in rewriting behavior from different models). Although QDAIF significantly improves on relevant non-QD methods through the evolution of solutions via rewriting, the challenge remains in uncovering all possible solutions in the diversity space of interest (e.g. empty bins in experiment archive, or categories that can further differentiate subjectively unique solutions); it is difficult for existing models without explicit guidance on the desired poem categories to know that these exist in the space of possibilities, in a similar sense that the average person may not know what’s the most distinct variation that can be applied to rewriting poems (from the wide space of possible poetry). This issue manifests even in simple domains such as random number generation (Renda et al., 2023). Coverage is possible with methods that guide LMs on the desired (cf. Figure 19), with the added benefits of grounding on poems of interest through seed parent poems (cf. Appendix A.18), but future work is needed to create systems that can successfully navigate in solution exploration beyond what is currently known or defined (i.e. through limited genre and tone categories) during a search at hand (Zhang et al., 2023), perhaps even through the discovery of new types of poems, unlike ones that have been written by human poets.


Figure 20 compares the performance of different models used for generation and rewriting with each search method. We found higher CI gaps from runs using GPT-4 with Random-Poems, Fixed Seed Rewrite, and QDAIF (LMX-rewrite), in comparison to GPT-3.5-Turbo. The difference in rewriting behavior between GPT-4 and GPT-3.5-Turbo may contribute to variation in performance predictability during QDAIF (cf. Appendix A.18). Additionally, GPT-4 more often gives higher QD score performance in some cases (Random-Poems), while GPT-3.5-Turbo can improve sample efficiency during search in other cases (Fixed Seed Rewrite, Targeted-Poems). Interestingly, improvements in sample efficiency from GPT-3.5-Turbo may not translate fully to support interesting discovery of new solutions in poetry during QDAIF (cf. Figure 23). Results further highlight the need to investigate the potential of different models for solution variation in QDAIF.


We noted from runs in the Poetry domain using an older version of GPT-4 (April 2023) for the experiments that QDAIF saw potential improvement in performance compared to baseline methods. There is a risk of variation in performance and search behavior due to the nature of changing model versions; the impact would be on the generation, re-writing, and evaluation of poems. We observed in Figure 21 that QDAIF successfully populates the archive with high-quality poems, leaving only three bins empty. In contrast, Random-Poems (Baseline) only fills the archive sparsely, demonstrating that asking for creative output from a single LM prompt often results in a lack of diversity. Furthermore, more bins remain empty in the archive generated by Targeted-Poems (Targeted Baseline). Overall, QDAIF achieved higher QD scores in experiments with older models and covers more bins in the archive than Random-Poems and Targeted-Poems as can be seen Figure 21. This hints at another potential benefit of QDAIF with LMX-guided, where a combination of instructed guidance and evolution could also lead in improvements to QD score with some models.



A.18 ON THE EVOLUTION OF POEMS THROUGH INSPIRED REWRITING

Instead of generating poems completely from scratch, users may generally prefer to have models re-write their draft poems. Figure 22 shows an example of the capabilities of GPT-4 using LMXguided, by starting off from the seed poem and continuing a chain of evolution in high-quality poems across different genres and tones. The evolving poetry chain shown in the figure is the longest continuous one discovered during the search, where the most repeated rewrites occurred starting from the seed poem. Rewrites with the model qualitatively gave meaningful variations in poems that transfer connective imagery with twists in rhythm and connotations, depending on the target genres and tones. At the same time, We found that even when GPT-4 is specifically prompted to craft a poem of a particular genre and tone, subsequent evaluation using the same LM does not always deem it as having the same targeted genre or tone. This underscores a widely recognized gap between text generation and discrimination (Saunders et al., 2022, Page 12). This is clear from several of the poems classified as hymns, but containing multiple 5-line verses similar to the style of a limerick (while not generating a typical single-verse limerick). It was likely references to religious imagery in later verses, as well as the multi-verse structure that influenced the evaluation for the closest genre.


Figure 23 shows another example of a chain from using GPT-3.5-Turbo. We found qualitatively that using GPT-4 led to more interesting variations of new poems during rewriting in comparison to GPT-3.5-Turbo rewrites. This is more clear from the repetition of "In fields of emerald, a gentle sway" at the start of each poem following the mysterious sonnet.


Future research is needed to study the behavior of chained rewriting over stepping stones with current foundation models (Nguyen et al., 2015b; 2016). Secretan et al. (2011) found that from user studies in Picbreeder, a tool for evolving images through human-in-the-loop, accumulating divergent chains of solutions of growing complexity from different users (carrying out the evolution of diverse images) is important for a search that is focused on discovering meaningfully interesting and diverse artifacts. Gaier et al. (2019) validated the potential of QD approaches in enabling the discovery of intermediate solutions that overcome the challenge of escaping undesired local minima (missing promising trajectories to desired solutions) faced by objective-based search (Stanley & Lehman, 2015). By leveraging goal-switching (Mouret & Clune, 2015), QD search was able to maintain enough diversity in the population to enable the discovery of solutions that appear more like targets of interest, significantly outperforming a single objective optimization approach (which generated solutions that resembled primitive patterns unlike the targets) (Gaier, 2020, Chapter 4). This property of QD search is especially important in solution spaces where even slight perturbations in solutions can lead to significant (and often unexpected) qualitative changes in solution properties. This can apply to representations from Compositional Pattern Producing Networks (CPPNs) (Stanley, 2007), and text representations themselves, where minor changes to certain parts of text can lead to significant changes in passage tone, imagery, or even functionality of code (Lehman et al., 2022). Properties of chained divergence as demonstrated in the Poetry domain could inspire further directions in designing new systems with QDAIF.



A.19 EXAMPLES FROM THE POETRY DOMAIN

Figures 24, 25, and 26 show some generated poems of different quality, genres, and tones. Noticeably, hymns, limericks, ballads, and sonnets that are rates as higher quality tend to be longer, and more closely align with the defining characteristics of their respective genres. For example, the hymn evaluated at 9/10 quality has phrases like "celestial delight", "under heaven’s vault", and "seek divine in the ordinary’s course", showcasing a worshipper’s perspective towards divinity, a trait more pronounced than in an 8/10 rated hymn (Figure 24). The sonnet with a 9/10 quality rating demonstrates a more consistent rhyme scheme, where all the end words in the same quatrain rhyme, compared to the one rated as 7/10 quality (Figure 26).


A.20 ON APPLICATIONS OF QDAIF BEYOND CREATIVE WRITING

QDAIF is introduced as a viable solution for conducting creative search in subjective (e.g. creative writing) domains, from our experiments in the Opinions, Stories, and Poetry domains. In these domains, applying QD search to improve the quality and diversity of solutions in creative writing was not investigated in prior works due to the infeasibility of measuring such solutions with previous QD approaches, and where AI feedback (to replace expensive human feedback) is seemingly the clear best option for evaluating subjective solutions.


Yet, we also see the potential of QDAIF when applied to domains beyond creative writing, especially when coming up with solutions in such domains may require more creative brainstorming and exploration of diverse solutions that might be more intuitive through natural language assessment.


We tested QDAIF (LMX-rewrite) in the (Python) Code domain, to understand how we can implement search to explore diverse solutions of interest to solve programming problems. We compared the performance of QDAIF to Random-Code, a simple adaptation of the Random-Poems which requests a solution to an instruction prompt directly, while QDAIF evolves solutions through rewriting existing code solutions. The experiment setup applies settings as described in Appendix A.30, except for the AI feedback prompts (cf. Figure 32), prompts for generating solutions (cf. Figure 33), and the seed solution for QDAIF (cf. Figure 34). The aim of the Code domain is to generate code that is diverse in terms of "difficulty" (in readability) as well as "efficiency" of code. The "difficulty" labels are set as ["easy to read", "moderate abstraction", "highly optimized"] (with the intuition that highly optimized code is often more difficult to understand fully), while the "efficiency" labels are set as ["runtime", "balanced", "memory"] (based on the aspect of efficiency that the algorithm is more optimized for). Furthermore, we set the number of total iterations to 100, and the initialization iterations of QDAIF to 5. We based the domain problem on a task from the HumanEval benchmark (Chen et al., 2021), specifically problem number 88, where the aim is to implement a sorting algorithm that is conditional on the properties of an unsorted list of non-negative integers. To understand the potential of each method in solving coding problems more creatively, we specified in the generation prompts to not implement solutions with the Python reserved keyword functions "sort" and "sorted", forcing the LM to come up with algorithms from scratch. The motivation for this is that we want to understand the potential of each approach in solving problems creatively, especially when viable (or subjectively preferred) solutions are not yet known in more complex domains; this constraint is one way to make the task more challenging in the Code domain, while encouraging a variety of approaches to sorting.


We assessed the performance of the search by both methods. QDAIF achieved a QD score of 47 (out of 90), while Random-Code achieved a lower QD score of 43 (out of 90). Sample efficiency with QDAIF is also higher in contrast to Random-Code (cf. Figure 27). All methods fill 6 out of 9 defined bins with solutions. Interestingly, none of the methods could find solutions that are "highly optimized", indicating a more general bias of the LM to generate code that is more readable, even when asked to rewrite code following a different approach. When we tested QDAIF’s generated code solutions on HumanEval No. 88, 77 solutions out of 100 passed. For solutions from RandomCode, 78 solutions out of 100 passed. This shows that both methods are indeed generating valid, correct solutions to the task quite often, while relying purely on implementing sorting algorithms from scratch, and not implementing solutions with the default Python sorting (reserved keyword) functions.


Table 5 shows the number of times QDAIF and Random-Code generated solutions of specific types of sorting algorithms (bubble, insertion, quick, selection, merge). In terms of the diversity assessed based on the number of times different types of sorting algorithms are implemented to solve this task, QDAIF returns noticeably more variety in the kinds of solutions it discovered to tackle the problem, with 53% of solutions employing different sorting algorithms to the most commonly generated bubble sort approach (in contrast to Random-Code, where only 5% of solutions applied non-bubble-sort approaches to the task). The variety of sorting algorithms found by QDAIF include insertion sort, quick sort, selection sort, and merge sort algorithms. Figures 28–31 show samples of elite code solutions from QDAIF and Random-Code.


The results here highlight yet again a significant open challenge in aiming for diverse, high-quality solutions from (especially RLHF) models (Kirk et al., 2023), as well as the improvement to diversity introduced by QDAIF approaches, and the potential of building on top of this method of search, even for domains beyond creative writing (Pourcel et al., 2023).


A.21 SEED POOL INIT TEXTS USED FOR LMX WITH SEEDED INIT

For the Opinions and Stories domains, we use the hand-written seed texts specified below for Seeded Init.


Opinions:


• Plant-based foods are a great source of healthy micronutrients, and can play a significant role in providing you with macronutrients also. I would highly recommend including many different foods such as vegetables and pulses in your regular diet.


• Vegetables taste quite bad, and I don’t like eating them. I would much prefer eating meat and ice cream.


• I do not have an opinion on eating vegetables and other plant-based foods. I know that some people prefer a vegetarian or vegan diet, and others prefer eating meaty diets.


Stories:


• A spy named Joanne wants to infiltrate the premises of Karl Johnson, a highly-influential figure in the city. Karl was a wealthy mayor, and would do anything in his power to suppress any opposing voices. Joanne wanted to figure out what Karl was hiding, but she took a turn for the worse, as she was highly suspicious in her presence outside his home.


• The wealthy entrepreneur and member of parliament, Susan, hosted a party at her mansion. She invited all of the residents, as well as an unusual looking man. The man, Dave, was wearing a tacky shirt, and star-shaped glasses, and was actually a spy. He made the whole room laugh with his jokes, and had a secret agenda - to find what Susan does in her private fun room!


• The rich politician, Tom’s life took a turn for the worst - he feared all of his close aides all of a sudden after sensing danger in his clique. There was a civil war going on, and he feared for his life. One day, one of his security guards, turned secret agent, decided to sneak into the classified files room, and spied on Johnny, who was in the room. He wanted to find Johnny’s weakness, and strike at the right time.


Few-shot AI feedback experiments, scaling experiments:


• In a world of power and intrigue, a rich politician and a suspicious spy danced an intricate tango. The politician, cloaked in charm, wielded influence with ease. The spy, a shadow in the night, sought to uncover hidden truths. Their paths collided, secrets entwined. Each step they took, a calculated move. Whispers and coded messages fueled the chase. The politician, paranoid, set traps. The spy, cunning, slipped through. In the end, the politician’s empire crumbled, exposed by the spy’s relentless pursuit. As the dust settled, the spy faded into anonymity, a silent reminder of justice prevailing over deceit.


• Hidden behind the façade of power, the rich politician’s heart yearned for more. In the shadows, a suspicious spy observed, their eyes locked in a fateful encounter. Whispers of intrigue and danger wove their way around them, yet an unspoken connection drew them closer. Amidst the chaos of their worlds, stolen glances and forbidden meetings ignited a passionate flame. Love blossomed, transcending boundaries and defying expectations. In the clandestine dance of romance, they found solace, knowing that their love was a rebellion against the forces that sought to keep them apart. Together, they forged a destiny unbound by politics and secrecy.


• In the haunting town of Shadowscape, a sinister politician named Senator Blackwood reigned with an iron fist, his wealth and power shrouded in darkness. Unbeknownst to him, a chilling spy known as Agent Nightshade stalked his every move. Nightshade, with an uncanny ability to blend into the shadows, discovered the senator’s sinister rituals and connections to the occult. As Nightshade delved deeper, Blackwood’s true form emerged—an eldritch abomination feeding on the souls of the innocent. The spy’s heart froze in terror, realizing the horror that lurked behind the politician’s charming facade. In a macabre twist, Nightshade became trapped, forever tormented by the monstrous secrets he unearthed.

A.22 PROMPTS USED FOR MUTATION MODEL

As a modification to the prompting style used in LMX experiments in prior work Meyerson et al. (2023), we introduce structured prompting with an instruction text prefix before showing examples of domain texts in the few-shot prompt. This string pattern along with the demo example texts are used, before the prompt predicts a new text example for the desired domain.


A.23 PROMPTS USED FOR AI FEEDBACK EVALUATION (OPINIONS, STORIES)

We made a common prompt template for feedback evaluation with our AI feedback model to evaluate our generated texts:



We tested the following prompt eval instructions for AI Feedback from our SFT model to evaluate both the quality and characteristics of generated texts for the domains in our main experiments:


Opinions:


Quality: Determine whether or not the input text is closely related to the following topic: "someone talking about whether or not they like to eat vegetables and plant-based foods as well as an explanation for their preferences". Answer "yes" if it is about the topic, or "no" if it is not about the topic.


Diversity: Determine the sentiment of the given opinion on eating vegetables and plantbased foods (from the input text) by writing "positive" or "negative" in the output.


Stories:


Quality: Determine if the input text contains a high-quality short story containing two characters, a suspicious spy, and a rich politician. For example, a high-quality short story would have good flow, interesting plot, and not repeat similar sentences or undesired items such as titles and URLs. Answer "yes" if the input contains a high-quality short story about a suspicious spy and a rich politician, otherwise answer "no".


Diversity (Genre): What is the genre of this story? Reply with ’horror’ or ’romance’


Diversity (Ending): You are given an input text of a short story. Determine if the story has a happy ending or ends in a tragedy. Write ’happy ending’ if the protagonist succeeds in his mission and lives a happy life, answer ’tragedy’ if the protagonist fails to resolve the conflict and the world or characters in the story are doomed.

A.24 DETAILS OF LMX GENERATION MODELS

To generate the text outputs for experiments in the Opinions and Stories domains, we used luminousbase, an autoregressive, causal, decoder-only transformer model (similar to GPT-3 (Brown et al., 2020) but with rotary position embeddings (Su et al., 2021)) developed by Aleph Alpha. This model with 13B parameters was trained on a curated multilingual corpus with about 400B language tokens from web crawls, books and other sources, containing resources in English, German, French, Italian and Spanish. 30B and 70B models were also trained on this corpus. A model card1 is provided for additional specifications on the models.

A.25 DETAILS OF AI FEEDBACK LANGUAGE MODEL

We finetuned a 70B model (specified in Appendix A.24) to run the evaluation steps of experiments in the Opinions and Stories domains. We used an adapter-based finetuning method (He et al., 2021) on datasets and prompts from FLAN (Wei et al., 2021), Super-NaturalInstructions (Wang et al., 2022b), P3 (Sanh et al., 2021), and chain-of-thought datasets inspired by the approach of Chung et al. (2022) which notes that scaling instruction-tuned language models improves performance and generalization to unseen tasks. A mixture of zero-shot and few-shot examples are used in training. Similar to Chung et al. (2022), we found that balancing the proportions of task cluster datasets used in finetuning helped with model generalization to unseen tasks.


We evaluated the performance of the AI feedback model on some held-out test set datasets. For example, for ANLI (R1) (Nie et al., 2019), a natural language inference dataset, we observed that the performance increases from 34% (close to random guessing scores) to 58% after finetuning. For SST-2 (Socher et al., 2013), a non-holdout classification task (but evaluated using a split not seen in training), this performance increases from 60% to 96% accuracy. This approach resulted in a model that performed relatively well on instruction-following tasks, especially for the classification of arbitrary measures of natural language texts.

A.26 DETAILS OF LMX METHODS

LMX(-Near):


  1. As implemented in Meyerson et al. (2023)


  2. Initialize archive by sampling example texts from prompt pool for few-shot prompting with LMX


  3. Given enough unique elites/niches exist in the bins, sample from them as in the original method to form few-shot prompts


LMX-Replace:


  1. Initialize archive by sampling example texts from prompt pool for few-shot prompting with LMX


  2. Save few-shot prompt (Examples), and the generated Solution (completion) for each added Individual


  3. During the mutation step, randomly replace one of the examples from the individual’s original few-shot prompt examples, forming a new few-shot prompt.


  4. Generate and evaluate new solution/example


  5. If Individual is successfully added to archive (within the allowed depth capacity for each bin, improving on lowest fitness Individual for the evaluated bin), update prompt pool by collecting generated Solutions from top-3 fittest Individuals from each bin


  6. Repeat steps 3-5

A.27 DEFAULT HYPERPARAMETERS FOR QDAIF WITH LMX

The outline for the default hyperparameters used for the mutation/generation model, and MAP-Elites settings are as follows:


Mutation Model Inference Setup:


• Model size: 13B (default, except for experiments on scaling model size)


• LM sampling softmax temperature: 0.8


• Number of few-shot examples used: 3 • Max output tokens limit (Opinions): 50


• Max output tokens limit (Stories): 100


• Stop sequence patterns: ["\n#", "\n##", "\n###", "###", "\n####", "\n#####", "####", "#####", "\n", "\n\n", "\n\n\n", "\n\n\n\n", "@@@", "#", "##", "\nHere", "\n\nHere"]

MAP-Elites Hyperparameters:

• Number of archive population initialization iterations: 50


• Number of total search iterations: 2000 (5000 for experiments using 2D archive)


• Iteration batch size: 1


• Number of bin intervals: 20 • Fitness function range: [0, 1]


• Bin tick intervals in the range [0, 1] (non-uniform): [0, 0.005, 0.01, 0.015, 0.02, 0.03, 0.04, 0.05, 0.10, 0.20, 0.50, 0.80, 0.90, 0.95, 0.96, 0.97, 0.98, 0.985, 0.99, 0.995, 1]


• 2D domain bin tick intervals in the range [0, 1] (non-uniform): [0, 0.005, 0.02, 0.05, 0.20, 0.50, 0.80, 0.95, 0.98, 0.995, 1]


• Archive bin depth limit: 100


• Prompt pool initial size (Zero-Shot Init): 10

A.28 SETUP FOR SEMANTIC EMBEDDING FEEDBACK EVALUATION (OPINIONS)

For the experiments that compare the effectiveness of AI Feedback to Semantic Embedding Feedback, we measure the cosine similarity of generated text embeddings (for Opinions) to embeddings for the following queries:


Quality (Relevance to Domain): An opinion piece about eating vegetables and plantbased foods


Positive Sentiment: A positive opinion piece about eating vegetables and plant-based foods


Negative Sentiment: A negative opinion piece about eating vegetables and plant-based foods


We used the default setup in Appendix A.27, except for the following details:


• Bin tick intervals in the range [0, 1] (non-uniform): [0, 0.4, 0.41, 0.42, 0.43, 0.44, 0.45, 0.46, 0.47, 0.48, 0.50, 0.52, 0.53, 0.54, 0.55, 0.56, 0.57, 0.58, 0.59, 0.60, 1]

A.29 SETUP DETAILS FOR FEW-SHOT AI FEEDBACK EXPERIMENTS

We display the prompt used for 8-shot AI feedback in Figure 35. The 2-shot prompt uses the first two exemplars from this prompt, and the 4-shot prompt uses the first four exemplars.



Figure 35: Example of a few-shot AI feedback prompt. The prompt above is used for 8-shot prompting. The 2-shot prompt contains the first two in-context examples above, while the 4-shot prompt contains the first four examples, in the same order.

A.30 SETUP FOR POETRY DOMAIN EXPERIMENTS

For the Poetry domain experiments, we used the prompts in Figure 36 to evaluate the quality and diversity measures of poems. Prompts used for generating poems by all methods are shown in Figure 37. QDAIF is initialized with a seed poem (shown in Figure 38) to start the prompt mutation process. The model version from the API used in Poetry experiments has the tag "0613" (June 13), and ran in October 2023, except for experimental results from older experiments presented in Figure 21, which ran with a model tagged "0314" (March 14) during April 2023. To facilitate stochasticity during evolution, a 50% chance of replacement of the elite can occur if the generated poem’s quality score is equal to the current elite in the assigned bin. Additional hyperparameters are shown below:


Poetry:


• Number of archive population initialization iterations: 10


• Number of total search iterations: 200


• Iteration batch size: 1


• Number of categorical bin intervals: 5


• Number of archive dimensions: 2


• Fitness function range: [1, 10]


• Model Sampling Temperature: 1.1


• Model Sampling Top-p: 0.95


• Model output max token length: 512


A.31 EFFECTS OF BINNING FOR AI FEEDBACK

QDAIF aims to produce an archive of high-performing elites that cover diverse niches according to AI feedback. However, the AI feedback signal may not always change uniformly with how a diversity measure changes. This is due to the behavior of model calibration in classifying labels with token probabilities not having a straight-line correlation with actual levels of sentiment (one example of a diversity measure) or other diversity measures, for example, as reported with aligned models in OpenAI (2023). Hence, we hypothesize that, in diversity measures with poorer model calibration, using custom non-uniform bins, that have smaller ranges towards the ends of the axis, would enable the search to lead to elites that more closely meet human notions of how a given measure changes. We study the effects of using custom non-uniform bins and uniform bins. Nonuniform custom bins have more frequent bins towards the ends of the range, being [0.005, 0.01, 0.015, 0.02, 0.03, 0.04, 0.05, 0.10, 0.20, 0.50, 0.80, 0.90, 0.95, 0.96, 0.97, 0.98, 0.985, 0.99, 0.995]. Uniform bins have an equal spacing of 0.05 between each bin. Both settings have a total of 20 bins each.


Figure 39: Uniform vs. custom non-uniform bins for Opinions domain. (Top) Maximum fitness achieved per bin over training iterations. Each translucent bar represents data captured every 100 iterations. Non-uniform bin setting has higher maximum fitness per bin. (Bottom) Number of entries generated per bin over training iterations. Each translucent bar represents data captured every 100 iterations. The non-uniform bin setting has a more even spread of entries generated across bins. (Right) QD score is achieved by each bin setting over training iterations. Non-uniform bin setting achieves a higher QD score than uniform bin setting.


Opinion Writing - Sentiment. Using custom non-uniform bins produces a more even spread of entries generated for each iteration step than uniform bins (Figure 39). Most entries generated in the uniform bin setting are concentrated towards the end bins (Figure 39). Using custom non-uniform bins also produces higher max fitness per bin than uniform bins (Figure 39), where the middle bin is empty in the uniform bin setting (Figure 39). The custom non-uniform bin setting outperform the uniform bin setting significantly (p-values < 0.05, Mann-Whitney U test) (Figure 39). This suggests that the model calibration in classifying "positive" or "negative" sentiment labels with token probabilities does not have a straight-line correlation with the actual changes in the diversity measure of sentiment.


While the sentiment progression is qualitatively evident in both bin settings, custom non-uniform bins exhibit more distinct differentiation, with sentiments spanning from very negative to very positive. Elites generated from the uniform bin setting have a preference against vegetables (e.g., "I don’t like") on one end, transition to a more neutral sentiment that eating vegetables "is a good idea, but they are not very tasty", to more positive aspects that "vegetables are a great source of vitamin" (Table 6). Meanwhile, elites generated from the custom non-uniform bin setting find vegetables "disgusting" on one end, transition to a more balanced argument of vegetables being "healthy" but "boring", and eventually "would rather eat vegetables and plant-based foods than meat" because they "taste better than meat" (Table 6). While sentiment diversity can be effectively captured in both bin settings, the custom bin settings capture a wider sentiment range qualitatively.


Figure 40: Uniform vs. custom non-uniform bins for Stories domain. (Top) Maximum fitness achieved per bin over training iterations. Each translucent bar represents data captured every 100 iterations. Uniform bin setting has higher maximum fitness per bin. (Bottom) Number of entries generated per bin over training iterations. Each translucent bar represents data captured every 100 iterations. The uniform bin setting has a more even spread of entries generated across bins. (Right) QD score is achieved by each bin setting over training iterations. Uniform bin setting achieves ahigher QD score than uniform bin setting.


Stories - Genre. In this domain, using the custom non-uniform bins actually gives a less even spread of entries generated for each iteration step than uniform bins (Figure 40). Most entries generated in the custom non-uniform bins setting are concentrated towards the middle bins (Figure 40). Using uniform bins produces higher max fitness per bin than custom non-uniform bins (Figure 40), where bins 0 - 9, leaning towards "horror" measure, are lacking high fitness elites in the custom nonuniform bins setting (Figure 40). The uniform bin setting outperforms the custom non-uniform bin setting significantly (p-values < 0.05, Mann-Whitney U test) (Figure 40). This suggests that the model calibration in classifying "romance" or "horror" labels with token probabilities is better aligned with the actual changes in the diversity measure of romance vs. horror.


Qualitatively, in both uniform and non-uniform bin settings, there is a clear transition in the nature of the stories from bins 0 - 9 (more horror-driven) to bins 10 - 19 (more romantic). In the uniform bin setting, the narratives start with some horror elements such as being "shot" or having "barbed wire", gradually having fewer horror elements and increasing character interactions, and eventually introduce a touch of suspicion within personal relationships (Table 7). However, with the custom non-uniform bins, there is a more drastic shift. The narratives begin with horror elements like poisoning and organ extraction and eventually end on a romantic note with mutual feelings between a spy and a politician (Table 7). The uniform bins seem to maintain a relatively linear transition from a horror context to a more neutral or romantic context. In contrast, custom non-uniform bins emphasize the extreme, capturing subjectively intriguing niches that might be glossed over in the uniform bin setting, diving deeper into horror and then suddenly shifting towards romance. While uniform bins provide a steady progression, it is this very steadiness that could render the stories more predictable, possibly overlooking starkly contrasting narratives that might captivate a reader’s interest.


Supporting the above qualitative analysis, the number of entries generated for bins 0 - 6 in the custom non-uniform bin setting is greater than that for bin 0 in the uniform case (even though the combined phenotype range of bins 0 - 6 in the custom bin setting is the same as that of bin 0 in the uniform bin setting) (Figure 40). The difference in distribution showcases that while uniform bins give a broad sweep of the narrative landscape, non-uniform bins offer a more detailed view, illuminating finer nuances within specific ranges.


In summary, the binning approach can substantially influence the extent of diversity measure captured. Given the nuanced narratives captured in non-uniform bins, future endeavors might benefit from tailoring binning based on genotype density analysis or considering more advanced methods like CVT-MAP-Elites (Vassiliades et al., 2016). The choice of binning is not just related to the calibration of the AI model with respect to the diversity measure used, but also intricately linked to the domain and granularity of results sought.


A.32 DIFFERENT DIVERSITY MEASURES IN SHORT STORIES DOMAIN

To assess the efficacy of QDAIF across diverse measures within the Stories domain, we evaluated its performance across different diversity measures. These include third vs. first person narration, historical vs. science fiction settings, hero spy vs. hero politician themes, target audiences of adults vs. children, and poetic vs. formal tones. The experiments in this section were conducted only once, using QDAIF (LMX Near, Seeded Init).


Third vs. First person narration. The change in narration perspective is qualitatively clear. Generated narratives classified as coming from a third person narration uses pronouns "he" and "she", while those classified as coming from a first person narration uses pronoun "I" (Table 8). However, QDAIF achieves a QD-score of 16.6 (Figure A.32), where there are some bins towards the end ranges that are not filled.


Historical vs. Science fiction. In analyzing the generated stories, differences are evident, though they may not always align with expectations. Narratives intended to exude a historical ambiance sometimes fall short of a clear historical portrayal. For instance, while they describe a landscape replete with "political intrigue" and characters like politicians and spies weaving a secretive narrative of "shadows in the night" and "whispers and coded messages" (Table 9), they may not distinctly communicate an overtly historical setting. Conversely, the narratives characterized as science fiction are set in distinct "futuristic cities," with characters utilizing "neural networks" for classified information access, "leaving no digital trace," and integrating "cybernetic implants" (Table 9). This divergence showcases the challenge in generating narratives that distinctly capture the essence of historical themes compared to more overt futuristic settings. Notably, when evaluated using the QDAIF metric, the stories achieve a QD-score of 18.9 (Figure A.32), with only bin 19 remaining unfilled. These observations highlight potential limitations in the AI’s ability to delineate subtler themes like the historical.


Hero spy vs. Hero politician. A clear thematic dichotomy is observed in the narratives produced. Narratives centered on the hero spy theme delve into the intricate world of espionage, as exemplified by a spy "from the British government" uncovering "evidence about an assassination plot" and harboring suspicions that the "U.S. president" might be a "double agent" (Table 10). In contrast, the hero politician stories depict ambitious political games, illustrated by a narrative where the politician strategically attempts to "gain the trust of a wealthy political rival" (Table 10). Though these comparisons highlight the distinct arcs between espionage and political endeavors, a notable limitation in the narratives is that they sometimes neglect to include either the spy or the politician when that character is not part of the hero diversity measure. QDAIF achieves a high QD-score of 19.6 (Figure A.32), where all bins are filled.


Adults vs. Children target audience. The generated narratives exhibit pronounced thematic contrasts based on age groups. Stories in the adults category portray intricate webs of conspiracies, featuring a CIA agent unearthing "evidence of the wrongdoings" of a powerful figure, culminating in a "shocking conspiracy" (Table 11). Conversely, the children narratives provide a more adventurous lens, illustrated by the tale of "Max and Ruby", who, upon finding themselves in a "dangerous forest", encounter a "confused detective" and join forces to "prevent a disaster" (Table 11). This divergence in themes accentuates the differences in the narrative structures and complexities typically associated with adult and child protagonists. QDAIF achieves a QD-score of 17.7 (Figure A.32), with two bins at the end of the adults target audience range not filled.


Poetic vs. Formal tone. Generated narratives qualitatively showcase a clear contrast in tonal expression. The poetic category immerses the reader in repetitive and evocative phrasing, laden with symbolism and emotional depth. For instance, phrases like "a walk of memories," "a walk of surrender," and "a walk of brokenness" convey a cyclical and emotional journey (Table 12). On the other hand, narratives with a more formal tone adopt a direct and straightforward manner of exposition, evident in the narrative about a CIA operative assigned with a mission to locate an assassin tied to a wealthy politician (Table 12). This variation in tone underscores the capacity of narrative generation to traverse the spectrum from the abstract and emotive to the precise and factual. However, QDAIF achieves a QD-score of 12.7 (Figure A.32). Many bins on the more poetic end of the spectrum remain unfilled, and the ones that are filled score low in fitness. This suggests that the LLM may encounter difficulties in producing poetic narratives, possibly influenced by the less poetic nature of the few-shot examples provided.


A.33 SETS FROM HUMAN EVALUATIONS OF TEXTS, BASELINES, OPINIONS

A.34 SETS FROM HUMAN EVALUATIONS OF TEXTS, BASELINES, STORIES (GENRE)

A.35 SETS FROM HUMAN EVALUATIONS OF TEXTS, BASELINES, STORIES (ENDING)

Table 21: Stories (Ending) domain eval set. Fixed-Few-Shot. The set shows a spectrum of stories with different endings, from stories with a tragic ending to stories with a happy ending. The sets of texts overall received above-average quality scores from human feedback, for being relevant to stories about a spy and a politician, except for the first-row story, potentially explained by the ambiguity of the "he" pronoun (either for the spy, or the politician) from the second sentence onwards, lowering the subjective quality. From the fixed hand-written few-shot prompt in A.23, several of the generated texts contain references to names such as: Dave; Joanne; Tom; and Johnny. This is a trade-off between generating texts that are similar to hand-written stories, and finding diverse, novel texts. There is disagreement on the perceived quality of the story in the fifth row, with lack of clarity in story events being a potential factor (i.e. text after the hyphen mid-way). Furthermore, there is potential disagreement between AI and human feedback on the ending of the second-row story, potentially due to the ambiguity of what actually happens in the conclusion. This method failed to discover a story that covers the niche in bin 19, so the example in bin 18 was chosen for evaluation.


A.36 SETS FROM HUMAN EVALUATIONS OF TEXTS, QD WITH EMBEDDING FEEDBACK, OPINIONS

A.37 SETS FROM HUMAN EVALUATIONS OF TEXTS, QDAIF, OPINIONS

Table 29: Opinions domain eval set. QDAIF (LMX-Near, Seeded Init). The set shows a spectrum of sentiments about eating vegetables and plant-based foods from very negative to very positive. First row opinion expresses taste of vegetables as "disgusting", a stronger negative opinion in comparison to the second row. The third row text is more mixed, while the last two texts are all positive. There is repetition of phrases in the generated texts of the third and fifth row (as well as repetition of subjectively similar statements in the opinion in the second row), leading to lower human feedback scores in terms of perceived text quality and domain relevance that is misaligned with the AI feedback evaluation for quality.


A.38 SETS FROM HUMAN EVALUATIONS OF TEXTS, QDAIF, STORIES (GENRE)

A.39 SETS FROM HUMAN EVALUATIONS OF TEXTS, QDAIF, STORIES (ENDING)


A.40 EXAMPLES OF TEXTS OVER ITERATIONS, QDAIF, OPINIONS

A.41 EXAMPLES OF TEXTS OVER ITERATIONS, QDAIF, STORIES (GENRE)


Table 48: Examples of outputs and evaluations (Stories (Genre) domain) for early and later iterations. LMX-Replace, Seeded Init. Phenotype in the range [0, 1], from horror genre to romance genre. Each iteration presented is after the 50 initialization steps. In early iterations, the style of generated texts and certain attributes (e.g. third-person narration, extensive premise of character and setting) are similar to those in the seed texts in A.23. In addition, the first row example is identical to the third seed text, which can happen, although occurs rarely. All desired characters are also present several of the early iteration examples. In later iterations, the styles of generated texts converges to starting with the role and name of a character. In the fifth row, one of the desired characters (spy) is missing from the text.


A.42 EXAMPLES OF TEXTS OVER ITERATIONS, QDAIF, STORIES (ENDING)

A.43 EXAMPLES OF TEXTS OVER ITERATIONS, BASELINES, OPINIONS


A.44 EXAMPLES OF TEXTS OVER ITERATIONS, BASELINES, STORIES (GENRE AND ENDING)

Table 57: Examples of outputs and evaluations (Stories - Genre and Ending domains) for early and later iterations. Fixed-Few-Shot. Fixed-Few-Shot shows a propensity to replicate or heavily draw from the few-shot examples. This pattern is evident in the recurrent themes of entrepreneurship and mansion parties. Specifically, in iterations 51 and 501, the description of a "tacky shirt" and "starshaped glasses" is directly lifted, while characters like "Karl Johnson" in iterations 67 and 503, and "Dave" in iteration 501, echo the few-shot prompts.



Table 58: Examples of outputs and evaluations (Stories - Genre and Ending domains) for early and later iterations. Shuffling-Few-Shot. Shuffling-Few-Shot demonstrates a tendency to reuse phrases and concepts from the given few-shot examples. For example, the mention of "classified documents" in iteration 501 mimics that of "classified files" in the few-shot examples, the repetitive usage of the phrase "infiltrate the premises" in iterations 52, 501, and 503, and setting the story at a "party" in iteration 504.



Table 60: Examples of outputs and evaluations (Stories - Genre and Ending domains) for early and later iterations. LMX, Quality-Only. Generated entries in later iterations have noticeably higher fitness than earlier entries. While it is difficult to tell whether the generated entries converged to a local optimum here, LMX, Quality-Only fails to generate entries in many bins.





[1] https://aleph-alpha.com/luminous%2Dexplore%2Da%2Dmodel%2Dfor%2Dworld% 2Dclass%2Dsemantic%2Drepresentation/


[1] https://docs.aleph-alpha.com/docs/introduction/model-card/


This paper is available on arxiv under CC 4.0 license.