With the rapid integration of generative AI into the application development process we are seeing an increasing need to be able to integrate our private data with public data that is being used to train
In a recent webinar on
In this post, we’ll explain how LlamaIndex can be used as a framework for data integration, data organization, and data retrieval for all your private data generative AI needs.
As stated earlier, LlamaIndex is an orchestration framework or “data framework” that simplifies building LLM applications. It provides the ability to perform data augmentation of private data, enabling it to be incorporated into LLMs for knowledge generation and reasoning. At the heart of all generative AI functionality is data. Enterprise applications need to be able to access more than just the public data that LLMs are trained on and need to incorporate structured, unstructured, and semi-structured data from all their internal and external data sources for building applications.
It is this integration of data the LlamaIndex provides. Bringing in data from multiple unique sources.
LlamaIndex, formerly known as GPT Index, is a framework that provides the tools needed to manage the end-to-end lifecycle for building LLM-based applications. The challenge with building LLM-based applications is that they need data, typically from multiple different sources, and unless there is strong adherence to a common data representation the data required is in many different formats, some highly structured, some unstructured, and some in between.
That is where LlamaIndex provides the toolbox to unlock this data with tools for data ingestion and data indexing. Once ingested and indexed,
LlamaIndex has hundreds of data loaders that provide the ability to connect custom data sources to LLMs. It connects pre-built solutions like Airtable, Jira, Salesforce, and more to generic plugins for loading data from files, JSON documents, simple csv, and unstructured data.
A complete list of data loaders can be found on the
Once data is ingested, it needs to be mathematically represented so that it can be easily queried by an LLM. With LlamaIndex, an index simply provides the ability to represent data mathematically in multiple different dimensions. Indexing data isn’t a new concept. However, with machine learning, we can expand the granularity of indexing from one or two dimensions (key/value representation for example) to hundreds or thousands of dimensions.
The most common approach to indexing data for machine learning and LLMs is called a vector index; once data has been indexed the mathematical representation of the data is called a vector embedding. There are many types of indexing and embedding models but once data has been embedded the mathematical representation of the data can be used to provide semantic search as things like text with similar meanings will have a similar mathematical representation. For example, king and queen might be highly related if the query is royalty but not highly related if the query is gender.
This is where some of the real power of LlamaIndex and LLMs comes into play. Because querying data using LlamaIndex isn’t a complex series of commands to merge/join and find the data, it is represented as natural language via a concept called
LlamaIndex offers several different indexing models that are designed to provide optimizations around how you want to explore and categorize your data. This is ultimately where a lot of gains can be achieved, if you know the type of operation your application needs to perform on the data, leveraging a specific type of index can provide significant benefit to the application using the LLM and instantiating the query.
A list index is an approach that breaks down the data and represents the data in the form of a sequential list. The advantage this has is that while the data can be explored in a multidimensional manner the primary optimization to querying the data is via a sequential pattern. This type of index works well with structured objects that occur over time so things like change logs where you want to query how things have changed over time.
When using a tree index, LlamaIndex takes the input data and organizes it into a binary tree structure where data is organized as parent and leaf nodes. A tree index provides the ability to traverse large amounts of data and construct responses where you need to extract specific segments of the texts based on how the search traverses the tree. Tree indexing works best for cases where you have a pattern of information that you want to follow or validate like building a natural language processing chatbot on top of a support/FAQ engine.
When using the vector store index type, LlamaIndex stores data notes as vector embeddings. This is probably the most common indexing type as it provides the ability to use the representation of the data in multiple different ways including vector or similarity search. When data is indexed with a vector store index, it can be leveraged locally for smaller datasets and by a single application or for larger datasets and/or to be used across multiple different LLMs/applications it can be stored in a high-performance vector database like
Keyword indexing is more of the traditional approach of mapping a metadata tag, i.e. a keyword to specific nodes that contain those keywords. This mapping builds a web of relationships based on keywords, because a keyword may map to multiple different nodes and a node may be mapped to multiple different keywords. This indexing model works well if you are looking to tag large volumes of data and query it based on specific keywords that can be queried across multiple different datasets. For example legal briefings, medical records, or any other data that needs to be aligned based on specific types of metadata.
One of the big questions that come up is how do LlamaIndex and LangChain compare, do they provide similar functionality or do they complement each other? The reality is that LlamaIndex and LangChain provide two sides of the same coin. While they are both designed to provide an interface to LLMs and machine learning in your application, LlamaIndex is designed and built specifically to provide indexing and querying capabilities for intelligent searching of data. On the other side of that coin is the ability to interact with data either via natural language processing, i.e. building a chatbot to interact with your data, or using that data to drive other functions like calling code.
LlamaIndex provides the ability to store the data you have in a variety of different formats and pull that data from a bunch of different sources, ultimately providing the how for your generative AI application.
LangChain provides the ability to do something with that data once it has been stored, generate code, provide generative question answers, and drive decisions, ultimately providing the what for your generative AI application.
With LlamaIndex you have an easy-to-use data/orchestration framework for ingesting, indexing, and querying your data for building generative AI applications. While we provide a simple example above to get started, the real power of LlamaIndex comes from the ability to build data-driven AI applications. You don’t need to retrain models, you can use LlamaIndex, and a highly scalable vector database to create custom query engines, conversational chatbots, or powerful agents that can interact with complex problem-solving by dynamically interpreting the data coming in and make contextual decisions in real-time.
So when it comes time to build a generative AI application that requires the ability to leverage your private data and incorporate that into an application’s ability to interact and respond to that data, LlamaIndex is a great place to start for ingestion, indexing, and querying. But don’t repeat the mistakes of the past and silo the data you are using, embedding, and accessing for AI applications. Build out a complete end-to-end solution that includes storing those embeddings and indexes in a highly scalable vector store like Astra DB.
To get started with LlamaIndex and to see how DataStax and LlamaIndex are better together, check out the recent DataStax blog post, “
You can find more information on how to set up and deploy Astra DB on one of the world’s highest-performing vector stores built on Apache Cassandra which was designed for handling massive volumes of data at scale. To get started for free,
- By Bill McLane, DataStax
Also published here.