Function Calling LLMs: Combining Multiple Models for Better RAG Performance

In 2024, there is no shortage of LLM-based applications. Major tech companies like Microsoft and Google are pushing forward ever-more-powerful versions of their flagship ChatGPT and Gemini models, and specialist players like Anthropic are pushing forward competing offerings with additional integrations. In the applied LLM world, companies and even governments are experimenting with chat applications for a variety of contexts.

Despite the enormous entrepreneurial energy poured into LLMs, most high-profile applications are still limited by their focus on chat-like interfaces that take in and return human-readable text. This is understandable - after all, one of the most exciting developments of LLMs is the ability to comprehend and generate human language, allowing for a conversational user interface. However, chat interfaces overlook another very important use case for LLMs - text processing as an intermediate step in a larger data pipeline.

Today, we will explore this use case and see how LLMs can be useful as part of a data pipeline, not just as a natural language generator.

Function Calling LLMs

In order to use LLMs as part of a data pipeline, we need to change the output format of the LLM - instead of generating paragraphs for humans to read, the model has to generate something that computers can read. Usually, this means structured output in a data format like JSON or instructions in a programming language like Python or SQL. These output formats are a lot less forgiving than natural language, since a missing quotation mark or bracket can crash the entire pipeline. Therefore, we will need to rely on specialized methods and features that have been developed to support this use case. These features are collectively referred to as function calling, since the output tends to call functions or be consumed in function calls.

Before we delve into these methods, though, let’s first explore a little more in depth why function calling features were developed in the first place.

Use Cases

Connecting to external services

The original and most straightforward use case for function calling is connecting to external services. When OpenAI originally released GPT-4, it also updated the API with a feature that allowed the user to provide the model a set of python-style functions that the model could call to accomplish its goals. The model could, for example, be told that it can call an external function to calculate the price of a complex security. Using this information, the model would be able to write code that calculates the value of a portfolio made up of these securities, without having to possess specialized knowledge on securities pricing.

Calling python-style functions is only the tip of the iceberg. Once the importance of function-calling was proven in the market, OpenAI and other LLM providers started to support other output formats, such as JSON or SQL. The important thing was that these models were outputting machine-readable outputs, which could be reliably read by other processes.

Astute readers might recognize that we have done something similar in previous articles, using LLM to generate training data, SQL, or JSON. In these articles, we accomplished this using prompt engineering and output-checking, but now that function-calling is a widely-supported feature, we can do this more easily by relying on these model-level features.

Chaining LLMs

Function calling created new possibilities for multi-LLM applications, and soon developers were experimenting with chaining LLMs to create sophisticated systems. Some of these systems became known as agents, which could independently search the web, gather new data, and call another LLM with the new information. These pipelines are capable of surprising amounts of autonomy and are capable of solving sophisticated problems with very little input, though there are still limitations, such as API costs and behavioral safeguards, that prevent agents from gaining widespread adoption.

Using LLM as input and intermediate processing

Another use case for function-calling LLMs is input and intermediate data processing. LLMs can be used to parse unstructured inputs into structured data that can be used for downstream processing. While many of these tasks tend to be relegated to “traditional” NLP technologies, the flexible nature of transformer-based models means that a specifically trained model can perform much better on these tasks than other NLP technologies. As a result, many developers are trying to leverage these specialized models in their data pipelines.

Now that we’re familiar with the theory behind function-calling, let’s look at the application we will be building today.

Demo Application

CB Insights Entity Relationship Analyzer

In a previous article, I created a simple RAG application to answer natural language questions about CB Insights popular newsletter. In today’s article, we will build a similar question-answering application, but instead of relying on embedding search and GPT3.5, we will be using entity recognition as the primary indexing method. Additionally, we will use DRAGON as the summarization engine so we can run the entire application on our laptop, no cloud services needed.

Why Use Entity Recognition?

Before we delve into the implementation, let’s first explore the advantages of using NER as the retrieval technology instead of embedding search. While in production we will want to use each technique according to the demands of the specific situation, NER offers several advantages over a system that relies on embedding search.

Debuggability: Because entity recognition is an easily verifiable task, the entire pipeline is much more debuggable. It is easy to check if the model correctly identifies all of the entities, and it is easy to build and improve a matching algorithm based on these entities. Comparatively, it is much more difficult to verify that the embedding algorithm is properly identifying similarities and differences in passages.
Flexibility: By separating recognition and filtering, we make the pipeline much more flexible than an embedding-based search algorithm. We can add additional metadata, change search algorithms based on entity type, and even use embedding algorithms on top of the NER-based search results. This allows for much more powerful features to be developed based on combinations of technologies.
1. For example, an NER-based pipeline can use entity-based search to narrow down a set of documents, then use an embedding algorithm to narrow down further within the search space. This tends to make the search much faster and more efficient.
Identification Power: Due to the level of manual control that is possible, NER pipelines are much more suited for certain search and retrieval tasks. Embedding searches are based on the proximity of passages within the training data, which calculates the probability that two documents come from the same larger document. In some use cases, this will cause embedding search to miss important documents.
1. For example, if we were trying to understand Microsoft’s AI strategy, we might not only want to retrieve information about Microsoft’s activities within the AI software space, but also information about investments in chipmaking, data centers, and robotics. A general purpose embedding algorithm will not be able to identify these connections, whereas an NER-based algorithm will be able to draw these connections.

Now that we understand the benefits of using NER within a RAG pipeline, let’s get a deeper look at the specific technologies we’re using in our application.

Technologies Used

LLM Entity Recognition

Entity Recognition is a “traditional” NLP technology for extracting structured data from unstructured language data. The extracted data can then be used in down-line processing or as metadata for the passage being analyzed.

Historically, this was accomplished using small, purpose-built machine learning algorithms that first tags the parts of speech, followed by a second pass that determines if the proper nouns identified are named entities.

With function-calling LLMs, we can accomplish the same task, with several added advantages.

Because transformer architecture is able to understand the language structure better than smaller NLP models, the performance is likely to be more robust, for example in cases where the passage is improperly formatted.
Because LLMs are being updated with the most recent training data, the entity recognition system is likely to stay more up-to-date than other open-source entity recognition models.
Prompt engineering techniques can be used to embed additional data and give additional instructions, allowing for more flexible behavior even in off-the-shelf models.

DRAGON We have covered Dragon in a previous article where we highlighted its impressive performance in summarizing natural-language data. We will be using Dragon to perform the final step of analysis, where we will summarize all of the articles relevant to the selected entities.

SLIMs SLIMs are a new family of miniaturized, locally-runnable models from LLMWare - it’s no secret I am a fan of their miniaturization work - that specializes in function calling. These models have been fine-tuned specifically to generate outputs that can be interpreted by machines, allowing users to take advantage of the recent developments in function calling architecture without having to rely on external APIs.

Today we will be using the SLIMs-NER model, which will perform the Named Entity Recognition on the newsletter articles. In addition to NER, there are also SLIMs models for sentiment analysis, SQL generation, and multi-step agents.

Now we understand the technology, let’s implement the application!

Implementation

Data Downloading and Processing

Let’s begin by downloading the CB Insights articles. Let’s import the dependencies:

import requests
from bs4 import BeautifulSoup
import os
import pandas as pd
import json
import re

And now the code to download the newsletter archive:

res = requests.get('https://www.cbinsights.com/newsletter/')

soup = BeautifulSoup(res.text)
article_links = [[i.text, i['href']] for i in soup.find_all('a') if 'campaign-archive' in i['href'] ]
article_soups = [BeautifulSoup(requests.get(link).text) for title, link in article_links]

Now that we have downloaded the newsletter archive, let’s process it into JSON format

result_json = {}
for soup_meta, soup_art in zip(article_links, article_soups):
    article_tables = []
    cur_article = []
    for table in soup_art.find_all('table'):
        if table.attrs.get('mc:variant') == 'Section_Divider':
            article_tables.append(get_deduped_article_tables(cur_article))
            cur_article = []
        else:
            cur_article.append(table.text)
    article_tables.append(get_deduped_article_tables(cur_article))
    result_json[soup_meta[0]] = article_tables

articles_with_meta = []
for name, tables in result_json.items():
      print(name, pd.to_datetime(tables[0][1].strip()))
      articles_with_meta.append({
            'name': name, 'date':  pd.to_datetime(tables[0][1].strip()).strftime('%Y-%m-%d'), 'tables': tables
      })
df = pd.DataFrame(articles_with_meta)

Now we are at the same place we were at in the previous article. Instead of going right to the RAG construction by creating embeddings, however, we will instead run an entity-recognition step.

First, let’s import the dependencies for running SLIMs, a new set of models from LLMWare that allows us to perform entity recognition locally on our machine:

from llmware.agents import LLMfx
from llmware.parsers import WikiParser
from collections import defaultdict

# define a function for identifying all the named entities
def run_ner(text):
    agent = LLMfx()
    agent.load_work(text)
    agent.load_tool("ner")
    named_entities = agent.ner()
    ner_dict= named_entities["llm_response"]
    return ner_dict

Now we can run the named entity recognition on all of the articles:

date_entities = defaultdict(dict)
for _, row in df.iterrows():
    for idx, t in enumerate(row['tables'][1:]):
        if 'Start Your Free Trial' in t:
            t = t[:t.index('Start Your Free Trial')]
        date_entities[row['date']][idx] = run_ner('\n'.join(t))

It might take a few minutes for the NER pipeline to complete, but this is all you need to recognize entities using a state-of-the-art miniaturized LLM.

You can check the date_entities dictionary by printing some test items. For example, the code:

date_entities[list(date_entities.keys())[0]]

Should produce the output below:

{0: {'people': ['Yahoo!'],
  'place': [],
  'company': ['Databricks', 'MosaicML'],
  'misc': []},
 1: {'people': [],
  'place': ['New York'],
  'company': ['CB Insights'],
  'misc': []}}

Which shows all of the various entities that have been detected by the SLIMs model.

Question-Answering with DRAGON

Now that the entities have been detected, let’s build a question-answering workflow that demonstrates the power of this technique. For our example, we will use the test question: What role does OpenAI play in Microsoft's AI strategy?

Let’s begin by importing the proper packages for running DRAGON:

from llmware.prompts import Prompt
query = "What role does OpenAI play in Microsoft's AI strategy?"

model_name = "llmware/dragon-llama-7b-gguf"
prompter = Prompt().load_model(model_name)

Now, we can build the function that answers the question using entity recognition. In order to leverage our NER data, we will want to implement the following workflow:

Identify the entities in the user query. We can use the same SLIMs function we used in the previous step.
Perform a lookup based on the results of the lookup, that is, we search for articles in the corpus that talk about the same entities.
Once we find these articles, we can supply them to the question-answering model to extract the relevant information.

Here is what the workflow looks like in code form:

def answer_question_with_ner(query):
    ner_results = run_ner(query) # run NER on the user query
    search_entities = []
    for ent_type, ent_list in ner_results.items():
        search_entities.extend(ent_list)
    # create a list of entities to search for
    search_entities = list(set(search_entities))

    # now perform a lookup for articles that mention the identified entities, using set arithmetic. 
    articles = set(entity_to_place_map[search_entities[0]])
    for se in search_entities:
        articles &= set(entity_to_place_map[se])

    # now process the corpus into a prompt and feed it to the question-answering LLM.
    article_content = []
    for article in articles:
        article_content.extend(df[df['date'] == article[0]].iloc[0]['tables'][article[1]+1])
    response = prompter.prompt_main(query, 
        context='\n'.join(article_content),
        prompt_name="default_with_context", temperature=0.3)
    return response # return the response.

Run the function, with the query from the first code block, and you should see the following result:

Microsoft has poured billions of dollars into ChatGPT developer OpenAI. However, it's also placed a number of bets on other GenAI startups across computing, mapping, and gaming.

You can also see in the “evidence” section of the response object that, indeed the two articles from the CB Insights archive that mention Microsoft and OpenAI have been retrieved, and that the LLM’s answer is based directly on the evidence.

Note that because we are doing explicit entity recognition, the retrieval process is extremely transparent and debuggable. Should a question ever arise as to why the model retrieved these specific pieces of information from the corpus - all it takes is a simple print statement to see that the model chose the article because the query mentions “Microsoft” and “OpenAI”, and the two retrieved newsletter sections are the only sections that mention both entities.

Additionally, when compared with embedding-based vector search, the NER lookup method provides much more precise answers. In my tests, the selected articles were in the top 10 most relevant articles ranked by OpenAI’s ada algorithm, but was not the one identified as having the closest proximity to the question at hand. Therefore, it is possible an application using embedding search may not have correctly answered the question at all, and the lack of debuggability would only contribute to the confusion.

And with that, we have built a question-answering application using NER lookups, improving the quality of the pipeline in the process!

Conclusion

Today, we built an application using function calling, an exciting new LLM feature. Miniaturized, locally-runnable function-calling models is a revolutionary development that unlocks a new class of AI applications, and we are only seeing the first iteration of these technologies. It will be exciting to see what applications developers will create with these technologies in the coming months.

If you have an idea in the AI space that you want to see implemented or simply want to chat about technology, please do not hesitate to reach out on Github or LinkedIn.

If you want to learn more about LLMWare, the company behind SLIMs and DRAGON, you can find them on HugginFace or Github.

Function Calling LLMs: Combining Multiple Models for Better RAG Performance

Too Long; Didn't Read