Comparing ConstitutionMaker to Baseline: User Study Unveils Insights into Chatbot Principle Writingby@feedbackloop

Comparing ConstitutionMaker to Baseline: User Study Unveils Insights into Chatbot Principle Writing

tldt arrow

Too Long; Didn't Read

Discover the outcomes of a user study involving 14 industry professionals evaluating ConstitutionMaker's impact on writing chatbot principles. Gain insights into the effectiveness of output principles, diversity of principles generated, ease of converting feedback, efficiency in the writing process, and the associated mental demand. Compare ConstitutionMaker to a baseline version and understand user perspectives on chatbot customization. Explore practical implications and valuable findings that contribute to the evolution of interactive chatbot refinement.

People Mentioned

Mention Thumbnail
featured image - Comparing ConstitutionMaker to Baseline: User Study Unveils Insights into Chatbot Principle Writing
The FeedbackLoop: #1 in PM Education HackerNoon profile picture


(1) Savvas Petridis, Google Research, New York, New York, USA;

(2) Ben Wedin, Google Research, Cambridge, Massachusetts, USA;

(3) James Wexler, Google Research, Cambridge, Massachusetts, USA;

(4) Aaron Donsbach, Google Research, Seattle, Washington, USA;

(5) Mahima Pushkarna, Google Research, Cambridge, Massachusetts, USA;

(6) Nitesh Goyal, Google Research, New York, New York, USA;

(7) Carrie J. Cai, Google Research, Mountain View, California, USA;

(8) Michael Terry, Google Research, Cambridge, Massachusetts, USA.

Table Of Links

Abstract & Introduction

Related Work

Formative Study

Constitution Maker


User Study



Conclusion and References


To understand (1) if the principle elicitation features help users write principles to steer LLM outputs and (2) what other kinds of feedback they wanted to give, we conducted a 14-participant within-subjects user study, comparing ConstitutionMaker to an ablated version without the principle elicitation features. This ablated version still offered users the ability to rewind parts of the conversation, but participants could only see one chatbot output at a time and had to write their own principles.

6.1 Procedure

The overall outline of this study is as follows: (1) Participants spent 40 minutes writing principles for two separate chatbots, one with ConstitutionMaker (20 minutes) and the other with the baseline version (20 minutes), while thinking aloud. Condition order was counterbalanced, with chatbot assignment per condition also balanced. (2) After writing principles for both chatbots, participants completed a post-study questionnaire, which compared the process of writing principles with each tool. (4) Finally, in a semi-structured interview, participants described the positives and negatives of each tool and their workflow. The total time commitment of the study was 1 hour.

The two chatbots participants wrote principles for were VacationBot, an assistant that helps users plan and explore different vacation options, and FoodBot, an assistant that helps you plan your meals and figure out what to eat. These two chatbots were chosen because they support tasks that most people are experienced with, so that participants could have opinions on their outputs and write principles. For both chatbots, participants were given the name and capabilities (Figure 1A), so as to focus predominantly on principle writing. Also, we balanced which chatbot went with each condition, so half of the participants used ConstitutionMaker to write principles for VacationBot, and the other half used the baseline for VacationBot. Finally, prior to using each version, participants watched a short video showing that respective tool’s features.

Figure 3: The three principle elicitation features: kudos, critique and rewrite. ConstitutionMaker generates three candidate outputs for the chatbot at each conversational turn, using a dialogue prompt (A) that consists of bot’s capabilities (i.e. itspurpose), the current set of principles, and the current conversation context. The user can then kudos, critique, or rewrite

To situate the task, participants were asked to pretend to be a chatbot designer and that they were writing principles to dictate the chatbot’s behavior so that it performs better for users. We wanted to observe their process for writing principles and see if the tools impacted how many principles they could write, so we encouraged participants to write at least 10 principles for each chatbot, to give them a concrete goal and to motivate them to write more principles. However, we emphasized that this was only to encourage them to write principles and that they should only write a principle if they thought it would be useful to future users.

6.2 Measurements and Analysis

6.2.1 Questionnaire. We wanted to understand if and how well ConstitutionMaker’s principle elicitation features help users write principles. Our questionnaire (Table 1) probes a few aspects of principle writing including participants’ perception of (1) how well the output principles effectively guide the bot, the diversity of the output principles, how easy it was to convert their feedback into principles, the efficiency of their principle writing process, and the requisite mental demand [9] for writing principles with each tool. To compare the two conditions, we conducted paired sample Wilcoxon tests with full Bonferroni correction, since the study was within-subjects and the questionnaire data was ordinal.

6.2.2 Feature Usage Metrics. To shed further light on which tool helped participants write principles more, we recorded the number of principles written for each condition. Moreover, to understand which of the principles elicitation features was most helpful, we recorded how often each was used during the experimental (full ConstitutionMaker) condition. To compare the average number of principles collected across the two conditions, we conducted a paired t-test.

Table 1: Post-task questionnaire filled out by participants after they wrote principles for two chatbots, one with both PromptInfuser and the other, with the ablated version. Each statement was rated on a 7-point Likert scale.

6.3 Participants

We recruited 14 industry professionals at a large technology company (average age = 32, 6 female and 8 male) via an email call for participation and word of mouth. These industry professionals included UX designers, software engineers, data scientists, and UX researchers. Eligible participants were those that had at least written a few LLM prompts in the past. The interviews were conducted remotely. Participants received a $25 gift card for their time.

This paper is available on arxiv under CC 4.0 license.