Authors:
(1) Savvas Petridis, Google Research, New York, New York, USA;
(2) Ben Wedin, Google Research, Cambridge, Massachusetts, USA;
(3) James Wexler, Google Research, Cambridge, Massachusetts, USA;
(4) Aaron Donsbach, Google Research, Seattle, Washington, USA;
(5) Mahima Pushkarna, Google Research, Cambridge, Massachusetts, USA;
(6) Nitesh Goyal, Google Research, New York, New York, USA;
(7) Carrie J. Cai, Google Research, Mountain View, California, USA;
(8) Michael Terry, Google Research, Cambridge, Massachusetts, USA.
To understand (1) if the principle elicitation features help users write principles to steer LLM outputs and (2) what other kinds of feedback they wanted to give, we conducted a 14-participant within-subjects user study, comparing ConstitutionMaker to an ablated version without the principle elicitation features. This ablated version still offered users the ability to rewind parts of the conversation, but participants could only see one chatbot output at a time and had to write their own principles.
The overall outline of this study is as follows: (1) Participants spent 40 minutes writing principles for two separate chatbots, one with ConstitutionMaker (20 minutes) and the other with the baseline version (20 minutes), while thinking aloud. Condition order was counterbalanced, with chatbot assignment per condition also balanced. (2) After writing principles for both chatbots, participants completed a post-study questionnaire, which compared the process of writing principles with each tool. (4) Finally, in a semi-structured interview, participants described the positives and negatives of each tool and their workflow. The total time commitment of the study was 1 hour.
The two chatbots participants wrote principles for were VacationBot, an assistant that helps users plan and explore different vacation options, and FoodBot, an assistant that helps you plan your meals and figure out what to eat. These two chatbots were chosen because they support tasks that most people are experienced with, so that participants could have opinions on their outputs and write principles. For both chatbots, participants were given the name and capabilities (Figure 1A), so as to focus predominantly on principle writing. Also, we balanced which chatbot went with each condition, so half of the participants used ConstitutionMaker to write principles for VacationBot, and the other half used the baseline for VacationBot. Finally, prior to using each version, participants watched a short video showing that respective tool’s features.
To situate the task, participants were asked to pretend to be a chatbot designer and that they were writing principles to dictate the chatbot’s behavior so that it performs better for users. We wanted to observe their process for writing principles and see if the tools impacted how many principles they could write, so we encouraged participants to write at least 10 principles for each chatbot, to give them a concrete goal and to motivate them to write more principles. However, we emphasized that this was only to encourage them to write principles and that they should only write a principle if they thought it would be useful to future users.
6.2.1 Questionnaire. We wanted to understand if and how well ConstitutionMaker’s principle elicitation features help users write principles. Our questionnaire (Table 1) probes a few aspects of principle writing including participants’ perception of (1) how well the output principles effectively guide the bot, the diversity of the output principles, how easy it was to convert their feedback into principles, the efficiency of their principle writing process, and the requisite mental demand [9] for writing principles with each tool. To compare the two conditions, we conducted paired sample Wilcoxon tests with full Bonferroni correction, since the study was within-subjects and the questionnaire data was ordinal.
6.2.2 Feature Usage Metrics. To shed further light on which tool helped participants write principles more, we recorded the number of principles written for each condition. Moreover, to understand which of the principles elicitation features was most helpful, we recorded how often each was used during the experimental (full ConstitutionMaker) condition. To compare the average number of principles collected across the two conditions, we conducted a paired t-test.
We recruited 14 industry professionals at a large technology company (average age = 32, 6 female and 8 male) via an email call for participation and word of mouth. These industry professionals included UX designers, software engineers, data scientists, and UX researchers. Eligible participants were those that had at least written a few LLM prompts in the past. The interviews were conducted remotely. Participants received a $25 gift card for their time.
This paper is available on arxiv under CC 4.0 license.