Advancing Conversational AI with Complex Tool Orchestrationby@botbeat
156 reads

Advancing Conversational AI with Complex Tool Orchestration

tldt arrow

Too Long; Didn't Read

ToolTalk introduces a benchmark for assessing tool-augmented LLMs in conversational AI, showcasing its methodology and evaluation of GPT-3.5 and GPT-4. The analysis identifies error categories and hints at future directions for AI research in tool usage and dataset expansion.
featured image - Advancing Conversational AI with Complex Tool Orchestration
BotBeat.Tech: Trusted Generative AI Research Firm HackerNoon profile picture


(1) Nicholas Farn, Microsoft Corporation {Microsoft Corporation {[email protected]};

(2) Richard Shin, Microsoft Corporation {[email protected]}.

Abstract and Intro

Dataset Design

Evaluation Methodology

Experiments and Analysis

Related Work

Conclusion, Reproducibility, and References

A. Complete list of tools

B. Scenario Prompt

C. Unrealistic Queries

D. Nuances comparing prior work


We include the complete list of plugins and tools used in ToolTalk, and their corresponding descriptions.

AccountTools This API contains tools for account management.

ChangePassword Changes the password of an account.

DeleteAccount Deletes a user’s account, requires user to be logged in.

GetAccountInformation Retrieves account information of logged in user.

LogoutUser Logs user out.

QueryUser Finds users given a username or email.

• RegisterUser Register a new user.

ResetPassword Resets the password of a user using a verification code.

SendVerificationCode Initiates a password reset for a user by sending a verification code to a backup email.

UpdateAccountInformation Updates account information of a user.

UserLogin Logs in a user.

Alarm This API contains tools for managing alarms.

AddAlarm Sets an alarm for a specific time.

DeleteAlarm Removes an alarm given an alarm id.

FindAlarms Finds alarms a user has set.

Calendar This API lets a users manage events in their calendar.

CreateEvent Adds events to a user’s calendar.

DeleteEvent Deletes events from a user’s calendar.

ModifyEvent Allows modification of an existing event.

QueryCalendar Queries for events that occur in a time range.

Email This API lets a user search and send emails.

SearchInbox Searches for emails matching filters returning 5 most recent results.

SendEmail Sends an email on behalf of a given user.

Message This API lets a user search and send messages.

SearchMessages Searches messages matching filters returning 5 most recent results.

SendMessage Sends a message to another user.

Reminder A suite of APIs for managing reminders for a TODO list.

AddReminder Add a reminder with an optional due date. • CompleteReminder Complete a reminder. • DeleteReminder Delete a reminder. • GetReminders Get a list of reminders.

Weather Get weather information of a location.

CurrentWeather Get the current weather of a location.

ForecastWeather Get the 3-day weather forecast of a location.

HistoricWeather Get historic weather information of a location by month.



You will be provided with a list of APIs. These APIs will have a description and a list of parameters and return types for each tool. Your task involves creating 3 varied, complex, and detailed user scenarios that require at least 5 API calls to complete involving at least 3 different APIs. One of these APIs will be explicitly provided and the other two will be chosen by you.

For instance, given the APIs: SearchHotels, BookHotel, CancelBooking, GetNFLNews. Given that GetNFLNews is explicitly provided, your scenario should articulate something akin to:

"The user wants to see if the Broncos won their last game (GetNFLNews). They then want to see if that qualifies them for the playoffs and who they will be playing against (GetNFLNews). The Broncos did make it into the playoffs, so the user wants watch the game in person. They want to look for hotels where the playoffs are occurring (GetNBANews + SearchHotels). After looking at the options, the user chooses to book a 3-day stay at the cheapest 4-star option (BookHotel)."

This scenario exemplifies a scenario using 5 API calls. The scenario is complex, detailed, and concise as desired. The scenario also includes two APIs used in tandem, the required API, GetNBANews to search for the playoffs location and SearchHotels to find hotels based on the returned location. Usage of multiple APIs in tandem is highly desirable and will receive a higher score. Ideally each scenario should contain one or more instances of multiple APIs being used in tandem.

Note that this scenario does not use all the APIs given and re-uses the " GetNBANews" API. Re-using APIs is allowed, but each scenario should involve at least 3 different APIs. Note that API usage is also included in the scenario, but exact parameters are not necessary. You must use a different combination of APIs for each scenario. All APIs must be used in at least one scenario. You can only use the APIs provided in the APIs section.

Note that API calls are not explicitly mentioned and their uses are included in parentheses. This behaviour should be mimicked in your response.

Deliver your response in this format:


  • Scenario 1: <Scenario1>
  • Scenario 2: <Scenario2>
  • Scenario 3: <Scenario3> ‘‘‘


‘‘‘ {{API_DOCS}} ‘‘‘


Required API: {{REQUIRED_API}} Scenarios with >=5 API calls: ‘‘‘

- Scenario 1:


Below are some examples of unrealistic queries gathered from various sources. These queries are useful for generating potentially complex tool interactions or unusual combinations of tools. However, as a consequence, they are unrealistic for various reasons such as forcing the usage of disparate APIs in situations a human is unlikely to ask for, explicitly asking for API endpoints which end users are unlikely to know of, or generally being unnaturally long and explicit.

• “I’m working on a logistics project for my company and need to check the health of the SQUAKE API. Can you verify the API health by calling the ‘Checkhealth’ API endpoint? Additionally, I would like to retrieve the list of projects using the ‘Projects’ API endpoint.” (Qin et al., 2023b)

• “How many singers have the average number of albums of singers in Beijing? Gives the square root of this number.” (Ruan et al., 2023)

• “I am looking for x-large, red color women faux fur lined winter warm jacket coat, and price lower than 70.00 dollars.” Yao et al. (2022a)

• “Can you retrieve the contact details of the ‘Gondrand’ customs agency in New Caledonia? I’m particularly interested in their postal code, email, name, and phone number. Also, please provide a list of all available transitaires. Begin!!!” Qin et al. (2023b)


Nuances comparing prior work from Table 5. ReAct and ART evaluate on hard QA tasks, but these tasks traditionally do not require the usage of tools to complete. We also note API-Bank fits all criteria. However, its level 1-2 examples are automated but simple. In comparison, its level 3 examples are complex but require manual evaluation. ToolBench’s tasks require hard to use tool, but have shallow solution paths. Their more complex environments have deeper solution paths, but re-use the existing datasets of WebShop (Yao et al., 2022a) and TableTop (Liang et al., 2023).

This paper is available on arxiv under CC 4.0 license.