This paper is available on arxiv under CC 4.0 license.
Authors:
(1) Ryen W. White, Microsoft Research, Redmond, WA, USA.
Abstract and Taking Search to task
The Undiscovered Country and References
Copilots[6] are applications of modern AI (foundation models, etc.) to help people with complex cognitive tasks. Copilots have conversational user interfaces and their users engage with them via natural language, they are powered by foundation models such as GPT-4, they are extensible with skills/tools/plugins, and they are scoped to specialized domains or applications (including search). Copilots are designed to to keep humans at the center of the task completion process and augment human capabilities to help them complete a broader range of tasks in less time.
The Microsoft copilot stack (Figure 2) contains four main layers: (1) The copilot frontend, covering the user experience and extensibility with plugins, enabling developers to provide additional visible tools to the copilot; (2) The orchestration layer that handles the internal information flows, prompting, grounding, executing the plugins and processing their responses, among other things; (3) Copilots leverage the power of large foundation models that can be provided to the developer as is, specialized to specific tasks, domains, or applications, or developers can bring their own models to use to power copilot functionality, and; (4) This all runs on top of massive scale AI infrastructure hosted in the cloud on Azure. Underpinning all of this is Microsoft’s strong commitment to responsible AI, which ensures that copilots are safe, secure, and transparent, via an iterative, layered approach with mitigations spanning the model, prompts, grounding, and the user experience.
Examples of copilots include GitHub Copilot,[7] an AI pair programmer that has been shown to reduce developer effort, enable more task success, and significantly expedite task completion [5]. Copilots are also emerging in search. Popular Web search engines such as Bing and Google are adding copilot functionalities to their search engines in the form of conversational agents such as Bing Chat and Google BARD. In search, copilots can help searchers to tackle a broader range of tasks than information finding and go deeper than surface (SERP-level) interactions with content by synthesizing answers on the searcher’s behalf. They also enable searchers to communicate their intents and goals more directly. Returning to the task tree (Figure 1), the focus on engaging copilots via natural language interactions allows both searchers and systems to consider higher-level task representations (macrotasks, subtasks) in addition to the more granular actions (queries, result clicks, pagination, and so on) that searchers already perform when engaging with traditional search engines.
Copilots and chat experiences are a complement not a replacement to traditional search engines. Search engines have been around for decades and serve a valuable purpose: providing near instantaneous access to answers and resources for a broad range of search requests. These existing and emerging modalities can and should work well together to help searchers tackle a wider range of tasks. The capabilities of copilots to better understand intentions and provide assistance beyond fact finding and basic learning/investigation will expand the task frontier, broadening the range of complex search tasks that can be completed, e.g., direct support for tasks requiring creative inspiration (Figure 3). This all moves us a step closer toward intelligent search systems that can help with all-task completion, covering the full universe of tasks for which people might need search support. One way to define the range of tasks that copilots can support is though Bloom’s taxonomy of learning objectives [27]. Creation is at the pinnacle of that taxonomy and we have only scratched the surface in creativity support with next-word prediction through transformer models [16]. We are already seeing expansions into modalities beyond text (images, video, audio, and so on) and could consider support for other creative tasks including planning, analysis, and invention. There are also many other layers in Bloom’s taxonomy (e.g., evaluation - help searchers make judgments and decisions, application - help searchers complete new tasks, understanding - explain ideas and concepts to accelerate learning) that could form future task frontiers.
Beyond greater capabilities, the introduction of copilots into search will also change how people will engage with search systems. In copilots, the mode of interaction is primarily natural language input, with some recent expansion toward multimodal inputs and outputs via the introduction of diffusion models such as DALL·E. Copilots can generate/synthesize direct answers, with source attribution for provenance, to build trust with users, and to drive traffic back to content creators, which is important to incentivize further content creation that fuels future foundation models. The overall search interaction flow is also different between search engines and copilots. When using copilots, searchers do not need to decompose their goal into sub-goals or sub-queries, examine SERPs and landing pages, and aggregate/synthesize relevant knowledge from retrieved information. Figure 4 has a comparison of the information seeking processes in the two modalities. In copilots, the responsibility for generating answers is delegated by the searcher to the system, which is not without its challenges in terms of human agency and human learning, as we discuss later in this article.
It is not practical nor necessary to deploy copilots for all search tasks. Foundation model inference is expensive at massive scale and search engine algorithms have been honed over decades to provide relevant results for a broad range of tasks (e.g., navigation, factfinding). Conversational interfaces are less familiar for searchers and it will take time for searchers to adapt to this way of searching. Traditional search engines are sufficient when searchers know exactly what they want. Copilots are helpful for more complex search tasks or in situations where searchers may be struggling to find relevant information. Task complexity can be estimated using aggregate metrics such as the amount of engagement with the search engine (e.g., number of query reformulations) for similar tasks historically. As generative AI appears in more applications and searchers better understand search copilot capabilities, the tasks that searchers bring to search copilots will likely evolve.
We will also see a growth in search experiences that unify traditional search and copilots. In a step towards this, search engines such as Bing and Google are already integrating dynamic answers from foundation models into their SERPs for some informational queries. Also, in the Google copilot (BARD), search results are displayed immediately below the copilot responses, allowing searchers to easily engage with them as desired. In the Bing copilot (Bing Chat), the search and copilot experiences are more separated and searchers must select a specific modality based on their task and personal preferences. Bing also provides searchers with control over other aspects, such as conversation style and tone, although it is not clear that searchers are sufficiently familiar with AI copilots at this time to use these nuanced controls effectively.
Search copilots such as Bing Chat use retrieval augmented generation (RAG) [28] to ground copilot responses via timely and relevant results. This has many advantages, including: (1) There is no need to retrain the massive foundation model over time; (2) Search results provide relevant and fresh information to foundation models, and; (3) It provides a provenance signal linking generated content with online sources. In response to a searcher prompt, the foundation model generates internal queries iteratively that are used to retrieve the results that form context for the copilot answers created using generative AI. Displaying these queries to searchers inline in the dialog (as in Bing Chat) creates greater transparency and helps build trust with searchers that the system is understanding their tasks and goals. The Bing orchestrator can also pull in relevant instant answers from the search engine such as weather, stock, sports, and so on, and display those in copilot responses instead of or in addition to the answers generated by the foundation model. Figure 5 shows the high-level search process from query (+ conversation context) to the answer, and the role of various key system components.
Copilots also enable search engines to support more complex search tasks. Using search alone would require more searcher effort to examine search results and manually generate answers or insights (see recent work on the Delphic costs and benefits of search [9]). Of course, there are different perspectives on task complexity, e.g., the copilot perspective (denoting the amount of computation, requests, etc. required for the system to complete the task) and the searcher perspective (denoting the amount of manual effort required for the human searcher to generate an answer and complete a task). Table 1 considers the task complexity from these two different perspectives and (again drawing from Bloom’s taxonomy) provides some current, anecdotal examples of the types of tasks that both searcher and systems may find to be more or less complex. Assuming that foundational model costs will drop and sophistication will increase, we focus here on the task complexity for searchers.
[6] Copilots is the Microsoft terminology for an AI-powered digital assistant designed to work alongside humans to augment their cognitive capabilities.
[7] https://github.com/features/copilot