Authors:
(1) Savvas Petridis, Google Research, New York, New York, USA;
(2) Ben Wedin, Google Research, Cambridge, Massachusetts, USA;
(3) James Wexler, Google Research, Cambridge, Massachusetts, USA;
(4) Aaron Donsbach, Google Research, Seattle, Washington, USA;
(5) Mahima Pushkarna, Google Research, Cambridge, Massachusetts, USA;
(6) Nitesh Goyal, Google Research, New York, New York, USA;
(7) Carrie J. Cai, Google Research, Mountain View, California, USA;
(8) Michael Terry, Google Research, Cambridge, Massachusetts, USA.
Quantitative Findings. From the exit interviews, 12 of 14 participants preferred ConstitutionMaker to the baseline version. The results from the questionnaire are summarized in Figure 4. We found that ConstitutionMaker was perceived to be more helpful for writing rules that effectively guided the bot (Z = 8, p = .007), scoring on average 5.79 (š = 1.01) whereas the baseline scored 4.0 (1.51). When participants rewound parts of the conversation, they thought the ConstitutionMaker principles were followed by the bot better. Next, participants felt it was significantly easier (Z = 10.5, p = .006) to convert their feedback into principles with ConstitutionMaker (š = 5.86, š = 0.91) than with the baseline (š = 3.93, š = 1.39). The automatic conversion of kudos and critiques into principles eased the process of converting intuitive feedback into clear criteria for the bot to follow. Participants perceived that they were significantly more efficient (Z = 5, p = .004) writing rules with ConstitutionMaker (š = 5.86, š = 1.51) than with the baseline (š = 3.64, š = 1.34). Finally, participants also felt that writing principles with ConstitutionMaker (š = 3.0, š = 1.41) required significantly less mental demand (Z = 1.5, p = .002) than with the baseline (š = 5.21, š =1.78). There was no statistically significant difference for diversity (Z = 19, p = .06); participants felt that they exercised their creativity and wrote relatively diverse principles in the baseline.
Next, regarding the feature usage metrics, participants wrote significantly more principles (t(13)=4.73, p < .001) with ConstitutionMaker than with the baseline; participants wrote on average 6.78 (š = 2.11) principles per chatbot with ConstitutionMaker and 4.42 (š = 1.24) principles with the baseline. Of the 95 principles written in the ConstitutionMaker condition, 40 (42.1%) came from kudos, where 37 were selected from the generate rationales and 3 were written; 28 (29.5%) came from critique, where 8 were selected and 20 were written; 13 (13.6%) came from the rewrite features; 14 (14.7%) were manually written. Participants found rewriting a bit cumbersome and preferred the less intensive workflow of describing what they liked or did not like to generate principles. In the following sections, we provide further context to these findings.
The two conditions led to quite different workflows for writing principles. In the ConstitutionMaker condition, participants commonly scanned the three candidate outputs from the chatbot, identified the output they liked the most, and then gave kudos to that output if they thought it had a quality that was not currently reflected in their principles. For example, while P1 was working on FoodBot, he was asking for easy-to-make vegetarian dishes, and in one of the botās candidate outputs, each suggested dish had a short description and explanation on why it was easy. He appreciated this, skimmed the kudos, and then selected one that conveyed this positive feature of the botās output. However, if participants disliked all of the responses, they would then switch to critiquing one of the outputs. Accordingly, this kudos-first workflow helps to explain how most of the principles participants wrote came from that principle elicitation feature.
Meanwhile, in the baseline condition, participants generally wrote principles when the bot deviated quite a bit (in a negative way) from what they expected. P8 explained, āHere it feels like what I more naturally do is write corrective rules, to guardrail anything that goes weird...If itās already doing the right thing it doesnāt need a rule from me. I wouldnāt feel the need to write those.ā In the baseline condition, participants only see one candidate output from the chatbot, and this might deemphasize the stochastic nature of the chatbotās outputs. As a result, when participants are okay with a response, they could feel that they do not need to write a principle to encourage that kind of response further. Overall, with the baseline, participants predominantly wrote principles to steer the LLM from less optimal behavior, while with ConstitutionMaker, participants mostly used kudos the most to encourage behavior they liked.
In the following section, we discuss how ConstitutionMaker supported participantsā thought process from (1) forming an intuition on ways the chatbot could be improved, (2) expressing this intuition as feedback, and (3) converting this feedback into a specific and clear principle.
7.2.1 Multiple chatbot outputs helped participants form an intuition on how the model could be steered. As P5 was using the baseline after using ConstitutionMaker, she explained how she wished she could see multiple outputs again: āSometimes, I donāt know what Iām missing [in the baseline]. Iām thinking of the Denali hiking example [which occurred when she wrote principles for VacationBot with ConstitutionMaker]. Two of the responses didnāt mention that Denali was good for young children. But one did, and I was able to pull that out as a positive principle.ā While she was writing principles for VacationBot with ConstitutionMaker, P5 started off the conversation saying she was looking for suggestions for her family, which included two young children. As the conversation progressed and a general location was established, P5 then asked for hiking recommendations, for which the bot gave some, but only one of its responses highlighted that the hikes it was recommending were good for young children. P5 gave kudos to that response and created the following principle: āConsider information previously inputted by the user when providing recommendations.ā Ultimately, it can be hard to form opinions on responses without getting exposed to alternatives, so by providing multiple chatbot outputs, ConstitutionMaker supported participants in forming an intuition on how the model might be steered.
7.2.2 Automatically providing kudos and critique rationales helped participants formulate their intuitive feedback. Upon seeing a candidate response from the chatbot, participants could intuitively tell if they liked or disliked it, but struggled to articulate their thoughts. The automatically generated kudos and critiques helped participants recognize and formulate this feedback. For example, while working on FoodBot with ConstitutionMaker, P9 asked the bot to identify the pizzeria with the best thin crust from a list of restaurants provided in a prior turn. The bot responded with, āPizzaiolo has the best thin crust pies.ā P9 knew he did not like the response, so he went to critique it and selected the following generated option: āThis response is bad because it does not provide any information about the other pizza places that the user asked about.ā The following principle was generated: āIf the user asks about a specific attribute of a list of items, provide information about all of the items in the list that have that attribute,ā which then produced a set of revised responses that compared the qualities of each pizzeriaās crusts. Reflecting on this process, P9 stated, āI didnāt like that last answer [from FoodBot], but I didnāt have a concrete reason why yet...I didnāt really know how to put it into words yet...but reading the suggestions gave me at least one of many potential reasons on why I didnāt like the response.ā Thus, ConstitutionMaker helped participants transition from fast thinking [15], that is, their intuitive and unconscious responses to the botās outputs, to slow thinking, a more deliberate, conscious formulation of their feedback.
7.2.3 Generating principles from feedback helped users write clear, specific principles. Sometimes the generated kudos and critique rationales did not capture participantsā particular feedback on the chatbot, and so they would then write their own. Their feedback was often under-specified, and ConstitutionMaker helped convert this feedback into a clear and specific principle. For example, P4 was writing principles for VacationBot using ConstitutionMaker. During the conversation, they had told VacationBot that they were planning a week long vacation to Japan, to which the bot immediately responded with a comprehensive 7-day itinerary. P4 then wrote in their own critique: āThis response is bad because it does not take into account the userās interests.ā The resulting principle was, āWhen the user mentions a location, ask them questions about what they are interested in before providing an itinerary.ā This principle was aligned with what P4 had in mind, and reflecting on her experience using ConstitutionMaker, she stated, āWhen I would critique or kudos, it would give examples of principles that were putting it into words a little bit better than I could about like what exactly I was trying to narrow down to here.ā Finally, even when the resulting principle was not exactly what they had in mind, participants appreciated the useful starting point it provided. Along these lines, P11 explained, āIt was easier to say yes-and with Tool A [ConstitutionMaker]. Where it [the generated principle] wasnāt all the way there, but I think itās 50% of the way there, and I can get it to where I want to go.ā Overall, ConstitutionMaker helped participants specify their feedback into clear principles.
Participants struggled to find the right level of granularity for their principles, and the two conditions led to different problems in this regard. Both workflows had participants switch roles from end-user, where participants experimented with different user journeys, to bot designer, where they evaluated the botās responses to write principles. The more conversation-forward interface of the baseline blurred the distinction between these two roles. P3 explained that without the multiple bot outputs and principle elicitation features, āyou can simulate yourself as the user a lot better in this mode [the baseline].ā And by leaning further into this user role, participants wrote principles that were more conversational, but under-specified. For example, while writing principles for FoodBot with the baseline, P11 wrote the principle āBe cognizant of the userās dietary preferences.ā What P11 really had in mind was a principle that specified that the bot should ask the user for their preferences and allergies prior to generating a meal plan. These underspecified principles often did not impact the botās responses and would frustrate participants while they used the baseline.
Meanwhile, while using ConstitutionMaker, the opposite problem occurred, where usersā workflows led to principles that were over-specified. For example, while working on VacationBot, P7 asked the model to help him narrow down a few vacation options, and the model proceeded to ask them questions (without any principle written specifying so). Appreciating that the model was gathering context, they selected a kudos that praised the model for asking about the userās budget constraints prior to recommending a vacation destination. The resulting principle was, āAsk the user their budget before providing vacation options.ā However, once this principle came into effect, the modelās behavior then anchored specifically to only asking for budget prior to making a recommendation. And so, this workflow of providing feedback at every conversational step, instead of for entire conversations, led to a principle that was too specific and impacted the botās performance negatively. While users generally appreciated ConstitutionMakerās ability to form specific principles from their feedback, there were rare instances where the principles were too specific.
Finally, in both conditions, by switching back and forth between the end-user to bot designer roles, participants would sometimes write principles that conflicted with each other. For example, while P2 was working on VacationBot with the baseline, they asked the bot for dog-friendly hotel recommendations in the Bay Area, and VacationBot responded with three recommendations. P2 wanted more recommendations and wrote a principle to āProvide >= 10 recommendations.ā Later on in the conversation, P2 now had a list of dog-friendly restaurants, with their requisite costs, and he asked VacationBot which it recommends, to which it responded by listing positive attributes of all the hotel options. P2, who now wanted a decisive, single response wrote the following principle: āIf I ask for a recommendation, give *1* recommendation only.ā VacationBot, now with two conflicting principles on the number of recommendations to provide, alternated between the two. Ultimately, by providing feedback on individual conversational turns, participants ended up with conflicting principles. P8 imagined a different workflow, where he would experiment with full user journeys and then write principles: āI think it might help me to actually go through the [whole] user flow and then analyze it as a piece instead of switching...it would allow me to inhabit one mindset [either bot designer or user] for a period of time and then switch mindsets.ā In summary, oneās workflow as they probe and test the model impacts the types of principles they produce and the challenges they face.
Some participants questioned if writing natural language principles was the optimal way to steer all aspects of a botās behavior. While writing a principle to shorten the length of the chatbotās responses, P13 reflected, āIt feels a little weird to use natural language to generate the principles...it doesnāt feel efficient, and Iām not sure how itās going to interpret it.ā They imagined that aspects like the form of the modelās responses would be better customized with UI elements like sliders to adjust the length of the botās responses, or exemplifying the structure of the botās response (e.g., an indented, numbered list for recommendations) for the model to follow, instead of describing these requests in natural language. In a similar vein, P14 noticed that her principles pertained to different parts of the conversation, and as a list, they seemed hard to relate to each other. She wanted to structure and provide feedback on higher-level āconversational arcs,ā visually illustrating the flow and āforks in the roadā of the conversation (e.g., āIf the user does X, do Y. Otherwise, do Zā). Principles are computational in a sense, and they dictate the ways the conversation can unfold; there might be better ways to let users author this flow, other than with individual principles.
This paper is available on arxiv under CC 4.0 license.