3,379 reads

How ChatGPT Can Learn to Use Tools and Plugins

by Laszlo FazekasApril 17th, 2023

Too Long; Didn't Read

Large Language Models (LLMs) like ChatGPT are super cool, and changed everything, although they have some very strong limitations. One of these limitations is that these models are prewired. This means, that they are trained on a big set of documents, so they have a very big knowledge, but they cannot learn new things (I have a full article about how neural networks are trained). From this perspective, the training of neural networks is much more similar to instincts evolved by evolution than when we learn new things in school. But if they cannot learn new things, then how can they learn to use tools like calculators, or search on the web by Google? To investigate how this Voodoo magic works I made a small Vue.js app by using LangChain.js.

featured image - How ChatGPT Can Learn to Use Tools and Plugins

Large Language Models (LLMs) like ChatGPT are super cool, and changed everything, although they have some very strong limitations.

One of these limitations is that these models are prewired. This means, that they are trained on a big set of documents, so they have a very big knowledge, but they cannot learn new things (I have a full article about how neural networks are trained). From this perspective, the training of neural networks is much more similar to instincts evolved by evolution than when we learn new things in school. But if they cannot learn new things, then how can they learn to use tools like calculators, or search on the web by Google?

To investigate how this Voodoo magic works I made a small Vue.js app by using LangChain.js. LangChain.js is a wrapper library for LLMs and LLM-related things like prompts, agents, vector databases, etc. With LangChain you can use tools with OpenAI models, which is a very similar concept that plugins in ChatGPT. Fortunately, LangChain is working fine in browsers, so if you have a browser app, you can simply check the API calls in the network tab.

Now, after the theory, let’s see how it works. My code is available here. It is very simple, but in this case, the code is less relevant.

In the code, I created a LangChain agent executor thet can use the calculator tool:

    const model = new OpenAI({
      temperature: 0,
      openAIApiKey: process.env.OPENAI_API_KEY,
    });
    const tools = [new Calculator()];
    this.executor = await initializeAgentExecutor(
      tools,
      model,
      "zero-shot-react-description"
    );

I sent the following prompt to see how the tool is used:

How much is 10+12+33+(5*8)?

Langchain sent the following prompt to OpenAI text-davinci-003:

Answer the following questions as best you can. You have access to the following tools:

calculator: Useful for getting the result of a math expression. 
The input to this tool should be a valid mathematical expression 
that could be executed by a simple calculator.

Use the following format:

Question: the input question you must answer
Thought: you should always think about what to do
Action: the action to take, should be one of [calculator]
Action Input: the input to the action
Observation: the result of the action
... (this Thought/Action/Action Input/Observation can repeat N times)
Thought: I now know the final answer
Final Answer: the final answer to the original input question

Begin!

Question: How much is 10+12+33+(5*8)?
Thought:

The response was the following:

I need to calculate the expression
Action: calculator
Action Input: 10+12+33+(5*8)

Then LangChain sent a new prompt to OpenAI with the merged content where the result was calculated by the tool:

Answer the following questions as best you can. You have access to the following tools:

calculator: Useful for getting the result of a math expression. 
The input to this tool should be a valid mathematical expression 
that could be executed by a simple calculator.

Use the following format:

Question: the input question you must answer
Thought: you should always think about what to do
Action: the action to take, should be one of [calculator]
Action Input: the input to the action
Observation: the result of the action
... (this Thought/Action/Action Input/Observation can repeat N times)
Thought: I now know the final answer
Final Answer: the final answer to the original input question

Begin!

Question: How much is 10+12+33+(5*8)?
Thought: I need to calculate the expression
Action: calculator
Action Input: 10+12+33+(5*8)
Observation: 95
Thought:

And the final response was:

I now know the final answer
Final Answer: 95

Yay, the magic is revealed!

LLMs have no memory, and cannot learn new things. The only thing that they see is the input sequence which is a list of tokens (~1 word is one token). In the case of GPT-3.5, the maximum number of input tokens is 4096, and 8192 in the case of GPT-4. This number is limited, but not so little. If you want to use tools, you have to write the instructions to the prompt.

When I first saw the ChatGPT plugin API, it was super weird that the interface is defined with natural language. Now the reason is absolutely clear. The interface is directly given to the model to use it. It understands the description because it understands the language, and generates the required output that is processed by the framework that calls the chosen tool. When the tool gives back the result, the framework writes it back to the prompt and sends it again to the LLM that gives back the final answer. That’s all. A super genius and magical solution to expand the boundaries of LLMs.

Today's models are full of completely unnecessary lexical knowledge. This knowledge is unnecessary because the model can search it from the web. If you drop this unnecessary knowledge, the size of the model can be radically reduced. These smaller models (like Alpaca) can easily run on your phone, your Raspberry Pi, or your coffee machine with relatively the same performance.

Another advantage of shrinking the model is that you can make it wider. If the number of input tokens is more, then the model will be able to use more tools, understand wider contents, and give more accurate results.

I think the future is bright, and the use of tools is a quantum leap toward AGI. These models will be with us on our phones. We can talk with them through our Bluetooth headset, or maybe through Neuralink. And spoken language is only the first step. Everything can be tokenized. The sights, the tactile stimuli, anything. In the future, these models can see through our smartglasses or maybe feel what we feel through BCI and help us as an integrated part of ourselves, and connect us to the Internet through tools as a new layer of our brain. The exocortex…