Privacy is a top concern when discussing ChatGPT-like tools with professionals.
On May 2nd, 2023, Iván Martínez showed how open-source models and tools like @LangChainAI enable a 100% local execution, ensuring your data never leaves your environment.
Enables local execution of language models, keeping data within the user's environment.
Enhances data privacy, security, and control, reducing compliance risks associated with traditional ChatGPT-like tools.
privateGPT allows you to interact with language models (such as LLMs, which stands for "Large Language Models") without requiring an internet connection.
With privateGPT, you can work with your documents by asking questions and receiving answers using the capabilities of these language models.
One of the critical features emphasized in the statement is the privacy aspect.
privateGPT ensures that none of your data leaves the environment in which it is executed.
This means that all interactions, including document ingestion and question-answering, occur within your execution environment without transmitting data over the Internet.
This focus on privacy aims to protect your sensitive information and maintain confidentiality.
To summarize, privateGPT enables offline usage of language models, allowing you to interact with your documents by asking questions and obtaining responses. It emphasizes data privacy by ensuring that your data remains within your execution environment and is not transmitted over the Internet.
pip install
process, you may need to install a C++ compiler on your computer.To set up your environment and run the code provided, follow the steps below:
Make sure you have Python and pip installed on your system. Then, open a terminal or command prompt and navigate to the project directory. Run the following command to install the required dependencies:
pip install -r requirements.txt
This command will automatically install all the necessary packages and libraries for running the code.
Next, you need to download the two models required for the code. These models are the LLM (GPT4All-J compatible model) and the Embedding model. Follow the instructions below to download and place them in a directory of your choice:
LLM Model: Download the LLM model compatible with GPT4All-J. The default model is named "ggml-gpt4all-j-v1.3-groovy.bin". If you prefer a different GPT4All-J compatible model, you can download it from a reliable source. Once downloaded, place the model file in a directory of your choice.
Embedding Model: Download the Embedding model compatible with the code. The default model is named "ggml-model-q4_0.bin". If you prefer a different compatible Embeddings model, download it and save it in the same directory as the LLM model.
In the project directory, locate the file named "example.env". Rename this file to ".env" (with a dot at the beginning). This file contains the configuration variables that need to be set appropriately.
Open the ".env" file in a text editor and modify the following variables according to your setup:
Note: When specifying the LLAMA embeddings model path in the LLAMA_EMBEDDINGS_MODEL variable, make sure to provide the absolute path. Home directory shortcuts like "~/path/to/model" or "$HOME/path/to/model" will not work.
Save the ".env" file after making the necessary changes.
Once you have completed the steps above, you are ready to run the code with your customized environment settings. Execute the code using your preferred method, such as running a script or using an integrated development environment (IDE).
Ensure that your code reads the environment variables from the ".env" file at the beginning of the execution. This will ensure that the correct model paths and settings are used.
To ingest your own dataset into the local vectorstore using the provided code, follow the instructions below:
Gather all the documents you want to ingest into the vectorstore. The supported file formats are .txt, .pdf, and .csv. Ensure that your documents are properly formatted and contain the relevant text data.
Create a directory named "source_documents" in the same directory as the code files. This directory will hold all your dataset files. Place all the .txt, .pdf, or .csv files into the "source_documents" directory.
Open a terminal or command prompt and navigate to the directory where the code files are located. Use the following command to run the ingestion process:
python ingest.py
This command will initiate the ingestion process and start processing your dataset files. Depending on the size of your documents, the process may take some time to complete. Please be patient and allow the command to finish execution.
The ingestion process will create a folder named "db" in the same directory as the code files. This folder will contain the local vectorstore, where your ingested documents' embeddings are stored. The time taken for the ingestion process depends on the size of your documents and the processing power of your system.
During the ingestion process, no data leaves your local environment. The ingestion is performed entirely on your machine, and you can even perform it without an internet connection.
Once the ingestion process is complete, you can start using the ingested data for various tasks such as similarity search or text analysis. The local vectorstore contains the accumulated embeddings of all the ingested documents, allowing you to perform operations on your dataset efficiently.
Note: If you want to start with an empty database and remove all previously ingested data, simply delete the "db" folder before running the ingestion command again.
To ask questions to your documents locally, follow these steps:
python privateGPT.py
Note: The script works offline, and no data leaves your local environment.
The process involves selecting appropriate local models and utilizing the power of LangChain. By doing so, you can perform the entire pipeline within your own environment, without any data leaving it, while maintaining reasonable performance.
Firstly, the "ingest.py" script utilizes LangChain tools to analyze the document and generate embeddings (representations) of the text.
This is done locally using LlamaCppEmbeddings. The resulting embeddings are then stored in a local vector database using Chroma vector store.
Next, the "privateGPT.py" script uses a local language model (LLM) based on either GPT4All-J or LlamaCpp. It uses this model to comprehend questions and generate answers.
To provide context for the answers, the script extracts relevant information from the local vector database. It achieves this by performing a similarity search, which helps locate the appropriate piece of context from the stored documents.
Additionally, it's worth noting that the GPT4All-J wrapper was introduced in LangChain version 0.0.162, facilitating the integration of GPT4All-J within the LangChain framework.
Overall, this approach allows you to run the entire process locally, without sharing your data, and achieve satisfactory performance by leveraging local models and the capabilities of LangChain.
Here is the link to the repo: https://github.com/imartinez/privateGPT
Also published here.