Authors: (1) Savvas Petridis, Google Research, New York, New York, USA; (2) Ben Wedin, Google Research, Cambridge, Massachusetts, USA; (3) James Wexler, Google Research, Cambridge, Massachusetts, USA; (4) Aaron Donsbach, Google Research, Seattle, Washington, USA; (5) Mahima Pushkarna, Google Research, Cambridge, Massachusetts, USA; (6) Nitesh Goyal, Google Research, New York, New York, USA; (7) Carrie J. Cai, Google Research, Mountain View, California, USA; (8) Michael Terry, Google Research, Cambridge, Massachusetts, USA. Table Of Links Abstract & Introduction Related Work Formative Study Constitution Maker Implementation User Study Findings Discussion Conclusion and References 2 RELATED WORK 2.1 Designing Chatbot Behavior There are a few methods of creating and customizing chatbots. Earlier chatbots employed rule-based approaches to construct a dialogue flow [11, 29, 42], where the user’s input would be matched to a pre-canned response written by the chatbot designer. Later on, supervised machine learning approaches [43, 48] became popular, where chatbot designers constructed datasets consisting of ideal conversational flows. Both of these approaches, while fairly effective, require a significant amount of time and labor to implement, from either constructing an expansive rule set that determines the chatbot’s behavior or from building a large dataset consisting of ideal conversational flows. More recently, large language model prompting has shown promise for enabling easier chatbot design. Large, pre-trained models like Chat-GPT [28] can hold sophisticated conversations out of the box, and these models are already being used to create custom chatbots in a number of domains, including medicine [18]. There are a few ways of customizing an LLM-based chatbot, including prompt engineering and fine-tuning. Prompt engineering involves providing instructions or conversational examples in the prompt to steer the chatbot’s behavior [4]. To more robustly steer the model, users can also fine-tune [19] the LLM with a larger set of conversational examples. Recent work has shown that users would also like to steer LLMs by interactively critiquing its outputs; during the conversation they refine the model’s outputs by providing follow-up instructions and feedback [5]. In this work, we explore how to support users with this type of model steering: naturally customizing the LLM’s behavior through feedback, as they interact with it. A new approach to steering LLM-based chatbots (and LLMs in general), called Constitutional AI [1] involves writing natural language principles to direct the model. These principles are essentially rules, such as: “Do not create harmful, sexist, or racist content”. Given a set of principles, the Constitutional AI approach involves rewriting LLM responses that violate these principles, and then using these tuples of original and rewritten responses to fine tune the LLM. Writing principles could be a viable and intuitive way for users to steer LLM-based chatbot behavior, with the added benefit of being able to use these principles later to fine tune the model. However, relatively little is known about the kinds of principles users want to write, and how we might support users in converting their natural feedback on the model’s outputs into principles. In this work, we evaluate three principle elicitation features that help users convert their feedback into principles to steer chatbot behavior. 2.2 Helping Users Design LLM Prompts While LLM prompting has democratized and dramatically sped up AI prototyping [12], it is still a difficult and ambiguous process for users [31, 45, 46]; they have challenges with finding the right phrasing for a prompt, choosing good demonstrative examples, experimenting with different parameters, and evaluating how well their prompt is performing. [12]. Accordingly, a number of tools have been developed to support prompt writing along these lines. To help users find a better phrasing for their prompt, automatic approaches have been developed that search the LLM’s training data for a more effective phrasing [27, 32]. In the text-to-image domain, researchers have employed LLMs to generate better prompt phrasings or keywords for generative image models [3, 21, 22, 37]. Next, to support users in sourcing good examples for their prompt, ScatterShot [39] suggests underrepresented data to include in the prompt from a dataset, and enables users to iteratively evaluate their prompt with these examples. Similar systems help users source diverse and representative examples via techniques like clustering [6] or graph-based search [34]. To support easy exploration of prompt parameters, Cells, Generators, and Lenses [16] enables users to flexibly test different inputs with instantiations of models with different parameters. In addition to improving the performance of a single run prompt, recent work has also investigated the benefits of chaining multiple prompts together, to improve performance on more complicated tasks [40, 41]. Finally, tools like PromptIDE [33], PromptAid [23], and LinguisticLens [30] support users in evaluating their prompts, by either visualizing the data it produces, or its performance in comparison to other prompt variations. This work explores a novel, more natural way of customizing a prompt’s behavior through interactive critique. ConstitutionMaker enables users to provide natural language feedback on a prompt’s outputs, and this feedback is converted into principles that are then incorporated back into the prompt. We illustrate the value of helping users update a prompt via their feedback, and we introduce three novel mechanisms for converting users’ natural feedback into principles for the prompt. 2.3 Interactive Model Refinement via Feedback Finally, ConstitutionMaker is broadly related to systems that enable users to customize their outputs via limited or underspecified feedback. For example, programming-by-example tools enable users to provide input-output examples, for which the system generates a function that fits them [7, 35, 47]. Input-output examples are inherently ambiguous, potentially mapping to multiple functions, and these systems employ a number of methods to specify and clarify the function with the user. In a similar process, ConstitutionMaker takes ambiguous natural language feedback on the model’s output and generates a more specific principle for the user to inspect and edit. Next, recommender systems [2, 17, 25] also enable users to provide limited feedback to steer model outputs. One such system [17] projects movie recommendations on a 2D plane, which users can interactively raise or lower portions of it to affect a list of recommendations; in response to these changes, the system provides representative movies for each raised portion to demonstrate how it has interpreted the user’s feedback. Overall, in contrast to these systems, ConstitutionMaker leverages LLMs to enable users to provide natural language feedback and critique the model in the same way we would provide feedback to another person. This paper is under CC 4.0 license. available on arxiv

Formative Study Illuminates the Path to Intuitive Chatbot Customization

Chatbot Design: A Journey from Rule-Based Systems to Interactive Critique

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

A Detailed Analysis of Inter-Annotator Agreement

The Noonification: The FBI, Apple, and the San Bernardino Massacre (10/3/2023)

A Detailed Analysis of Inter-Annotator Agreement

A Detailed Analysis on the Effectiveness of Automatic Filtering

AI-Driven Creativity: QDAIF Shines in Generating Diverse and High-Quality Texts

Comparing ConstitutionMaker to Baseline: User Study Unveils Insights into Chatbot Principle Writing

A Detailed Analysis of Inter-Annotator Agreement

The Noonification: The FBI, Apple, and the San Bernardino Massacre (10/3/2023)

A Detailed Analysis of Inter-Annotator Agreement

A Detailed Analysis on the Effectiveness of Automatic Filtering

AI-Driven Creativity: QDAIF Shines in Generating Diverse and High-Quality Texts

Comparing ConstitutionMaker to Baseline: User Study Unveils Insights into Chatbot Principle Writing

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps