This story draft by @escholar has not been reviewed by an editor, YET.

The Survey Instrument

Authors:

(1) Michael Xieyang Liu, Google Research, Pittsburgh, PA, USA (lxieyang@google.com);

(2) Frederick Liu, Google Research, Seattle, Washington, USA (frederickliu@google.com);

(3) Alexander J. Fiannaca, Google Research, Seattle, Washington, USA (afiannaca@google.com);

(4) Terry Koo, Google, Indiana, USA (terrykoo@google.com);

(5) Lucas Dixon, Google Research, Paris, France (ldixon@google.com);

(6) Michael Terry, Google Research, Cambridge, Massachusetts, USA (michaelterry@google.com);

(7) Carrie J. Cai, Google Research, Mountain View, California, USA (cjcai@google.com).

Table of Links

Abstract and 1 Introduction

2 Survey with Industry Professionals

3 RQ1: Real-World use cases that necessitate output constraints

4 RQ2: Benefits of Applying Constraints to LLM Outputs and 4.1 Increasing Prompt-based development Efficiency

4.2 Integrating with Downstream Processes and Workflows

4.3 Satisfying UI and Product Requirements and 4.4 Improving User Experience, Trust, and Adoption

5 How to Articulate output constraints to LLMS and 5.1 The case for GUI: A Quick, Reliable, and Flexible Way of Prototyping Constraints

5.2 The Case for NL: More Intuitive and Expressive for Complex Constraints

6 The Constraint maker Tool and 6.1 Iterative Design and User Feedback

7 Conclusion and References

A. The Survey Instrument

A THE SURVEY INSTRUMENT

In this section, we detail the design of our survey. The survey starts with questions about background and self-reported technical proficiency:

• What best describes your job role: software Engineer; research scientist; UX designer; UX researcher; product manager; technical writer; other (open-ended)

• To what extent have you designed LLM prompts: a) I have “chatted with” chatbots like Bard / ChatGPT as a user; b) I’ve tried making a prompt once or twice just to check it out, but haven’t done much prompt design / engineering; c) I have some experience doing prompt design / engineering on at least three LLM prompts; d) I have done extensive prompt design / engineering to accomplish desired functionality. Only those participants who selected either option c) or d) were given the opportunity to continue with the remainder of the survey. This approach is specifically designed to exclude “casual” LLM users.

• I primarily design prompts with the intent that they will be used by: a) consumers / end-users (e.g. a recipe idea generator); b) downstream development teams (e.g. captioning, classifiers); c) both, I split my time about evenly between the two; d) other audience or use cases (open response).

The survey then asked participants to report three real-world use cases where they would like to constrain LLM outputs. For each use case, participants were asked:

• How would like to be able to constrain the model output (open response);

• Provide a concrete example where it would be useful to have this constraint (open response);

• How precisely do you need this constraint to be followed: a) exact match; b) approximate match and why (optional open response);

• How important is this constraint to your workflow (5-point Likert scale from “it’s a nice to have, but my current workarounds are fine” to “it’s essential to my workflow”) and why (optional open response).

The survey then asked participants to reflect through open response on scenarios where they would prefer expressing constraints via GUI (sliders, buttons, etc.) over natural language (in prompts, etc.) and vice versa, as well as any alternative ways they would prefer to express constraints. To facilitate the reflection, the survey additionally asked participants to rate their level of preference in:

• Output should be exactly 3 words, no more than 3 paragraphs, etc.

• Output in a specific format or structure (e.g., JSON, XML, bulleted / ordered list)

• Only output “left-handed”, “right-handed”, or “ambidextrous”

• Output must include or avoid certain words / phrases

• Output must cover or avoid certain topics, only use certain libraries when generating code, etc.

• Output style should mimic Yoda / Shakespeare / certain personas, etc.

Each question presented a 7-point Likert scale from “strongly prefer natural language” to “strongly prefer GUI.”