How to Run Your Own Local LLM: Updated for 2024

All Images Generated With Bing Image Creator. It’s Awesome! (DALL-E-3)

This is the breakout year for Generative AI!

Yes, this article’s been written by a violinist and there’s new content! Spectacular new and awesome content!

Well; to say the very least, this year, I’ve been spoiled for choice as to how to run an LLM Model locally.

Let’s start!

1) HuggingFace Transformers:

Magic of Bing Image Creator — Very imaginative.

Remember, you must either download the model with internet access and save it locally or clone the model repository.

You can visit the website https://huggingface.co/models for more details.

There are around a stunning 558,000~ odd transformer LLMs available.

Hugging Face has become the de facto democratizer for LLM models, making nearly all available open source LLM models accessible, and executable without the usual mountain of expenses and bills. Basically, available, open source, and free. This is the mother lode!

2) gpt4all

I’m sorry, I can’t describe this one!

gpt4all is an open-source project that allows anyone to access and use powerful AI models. Here are step-by-step instructions for installing and using gpt4all:

Python GPT4All:

https://pypi.org/project/gpt4all/

pip install gpt4all

This will download the latest version of the gpt4all package from PyPI.

Local Build

As an alternative to downloading via pip, you may build the Python bindings from the source.

How to Build the Python Bindings:

Clone GPT4All and change directory:

git clone --recurse-submodules https://github.com/nomic-ai/gpt4all.git

cd gpt4all/gpt4all-backend

Install the Python package:

cd ../../gpt4all-bindings/python pip install -e .

This is one way to use gpt4all locally.

The website is (unsurprisingly)

https://gpt4all.io

Like all the LLMs on this list (when configured correctly), gpt4all does not require Internet or a GPU.

3) ollama

Again, magic!

Ollama is an open source library that provides easy access to large language models like GPT-3. Here are the details on its system requirements, installation, and usage:

System Requirements:

Python 3.7 or higher

Requests library

Valid OpenAI API key

Installation:

pip install ollama Usage: Multi-modal Ollama has support for multi-modal LLMs(which require GPUs — exception), such as bakllava and llava.

ollama pull bakllava

Be sure to update Ollama so that you have the most recent version to support multi-modal.

Again, refer to the Ollama documentation for additional details on all available methods.

The website is:

https://ollama.com/?embedable=true

4. localllm

Defies explanation, doesn’t it?

I find that this is the most convenient way of all. The full explanation is given on the link below:

GOOGLE CLOUD BLOG

New localllm lets you develop gen AI apps locally, without GPUs | Google Cloud Blog Want to use open-source LLM models from Hugging Face on your local development environment? With localllm and Cloud Workstations, you can.

Summarized:

5. Llama 2 (Version 3 already released from Meta)

Now that’s a spectacular Llama!

Steps to Use a Pre-trained Finetuned Llama 2 Model Locally Using C++:

(This is on Linux, please!)

Ensure you have the necessary dependencies installed:

sudo apt-get install python-pybind11-dev libpython-dev libncurses5-dev libstdc++-dev python-dev 
Download the pre-trained Llama 2 model from the Hugging Face Transformers hub (huggingface.co/models).

Extract the downloaded model file:

unzip llama-2-large-cnn-transformer.zip 

Navigate to the extracted directory:

cd llama-2-large-cnn-transformer 

Copy the llama.cpp file from the repository to your working directory.

Edit the llama.cpp file and modify the main() function to load the model and generate a response:

#include "transformer.h"

int main() 
{ 
 std::string prompt = "What is the meaning of life?"; 
std::string response = GenerateResponse(prompt); 
std::cout << response; return 0; 
}

Compile the edited llama.cpp file:

g++ -o llama llama.cpp -L./lib -lstdc++ -o llama Run the compiled executable:

6. LLM

Explain this one. I sure can’t!

Perhaps the simplest option of the lot, a Python script called llm allows you to run large language models locally with ease.

To install:

pip install llm 
LLM can run many different models, although albeit a limited set.

You can install plugins to run your llm of choice with the command:

llm install <name-of-the-model> 
To see all the models you can run, use the command:

llm models list
You can work with local LLMs using the following syntax:

llm -m <name-of-the-model> <prompt> 7) llamafile

7. LlamaFile

Llama with some heavy-duty options!

llamafile allows you to download LLM files in the GGUF format, import them, and run them in a local in-browser chat interface.

The best way to install llamafile (only on Linux) is

curl -L https://github.com/Mozilla-Ocho/llamafile/releases/download/0.1/llamafile-server-0.1 > llamafile
chmod +x llamafile
Download a model from HuggingFace and run it locally with the command:

./llamafile --model .<gguf-file-name> 
Wait for it to load, and open it in your browser at

http://127.0.0.1:8080.

Enter the prompt, and you can use it like a normal LLM with a GUI.

8. ChatGPT-Next-Web

That’s Next level, all right!

ChatGPT-Next-Web, also known as NextChat, is an open-source application that provides a user interface for interacting with advanced language models like GPT-3.5, GPT-4, and Gemini-Pro. It’s built on Next.js, a popular JavaScript framework.

Here are some key features of ChatGPT-Next-Web:

Deployment: It can be deployed for free with one-click on Vercel in under 1 minute.

Compact Client: It provides a compact client (~5MB) that can run on Linux, Windows, and MacOS.

Compatibility: It’s fully compatible with self-deployed Language Learning Models (LLMs), and it’s recommended for use with RWKV-Runner or LocalAI.

Privacy: All data is stored locally in the browser, ensuring privacy.

Markdown Support: It supports Markdown, including LaTex, mermaid, code highlighting, and more.

Prompt Templates: You can create, share, and debug your chat tools with prompt templates.

Multilingual Support: It supports multiple languages including English, 简体中文, 繁体中文, 日本語, Français, Español, Italiano, Türkçe, Deutsch, Tiếng Việt, Русский, Čeština, 한국어, Indonesia.

It’s a versatile tool that allows users to interact with AI models in a more personalized and controlled manner. It’s being used by developers and AI enthusiasts around the world to create and deploy their own ChatGPT applications.

https://github.com/ChatGPTNextWeb/ChatGPT-Next-Web?embedable=true

LocalAI

Not sure which planet, but sure looks beautiful!

LocalAI is a free, open-source project that acts as a drop-in replacement for the OpenAI API. It allows you to run various AI models locally or on-premises with consumer-grade hardware. Here are some key features of LocalAI:

Local Inference: LocalAI allows you to run Language Learning Models (LLMs), generate images, audio, and more, all locally or on-premises.

Compatibility: It’s compatible with OpenAI API specifications.

No GPU Required: LocalAI does not require a GPU for operation.

Privacy: Since all data is processed locally, your data remains private.

Multiple Model Support: It supports multiple model families and architectures.

Fast Inference: Once loaded the first time, it keeps models loaded in memory for faster inference1.

Community-Driven: LocalAI is a community-driven project, and contributions from the community are welcome.

https://localai.io

10. gpt4free

Switzerland in the Multiverse!

GPT4Free, also known as G4F, is an open-source project that provides access to powerful language models like GPT-3.5 and GPT-4. Here are some key features of GPT4Free:

Proof of Concept: GPT4Free serves as a proof of concept, demonstrating the development of an API package with multi-provider requests, with features like timeouts, load balance, and flow control.

Free Access: It provides free access to these high-powered AI models by reverse engineering the application programming interfaces (APIs) platforms used with paid access to these models.

Installation: You can install GPT4Free using Python, Docker, or Web UI.

Privacy: GPT4Free does save conversation data, but all with strict privacy guidelines4.

Custom Model Parameters: You can customize model parameters like the ‘presence_penalty’ to suit your specific needs.

Multiple Language Support: GPT4Free supports multiple languages, ensuring a smooth user experience across different geographies.

https://gpt4free.io

11. PrivateGPT

If that isn’t private, I don’t know what is!

PrivateGPT is an open-source, production-ready AI project that allows you to interact with your documents using the power of Large Language Models (LLMs), 100% privately. Here are some key features of PrivateGPT:

Privacy: All data is processed locally, ensuring that no data leaves your execution environment at any point.

Local Inference: PrivateGPT allows you to ask questions about your documents using LLMs, even in scenarios without an Internet connection.

API: It provides an API offering all the primitives required to build private, context-aware AI applications. It follows and extends the OpenAI API standard, and supports both normal and streaming response.

High-Level and Low-Level API: The API is divided into two logical blocks: a high-level API, which abstracts all the complexity of a Retrieval Augmented Generation (RAG) pipeline implementation, and a low-level API, which allows advanced users to implement their own complex pipelines.

Gradio UI Client: A working Gradio UI client is provided to test the API, together with a set of useful tools such as bulk model download script, ingestion script, documents folder watch, etc.

PrivateGPT is a versatile tool that allows users to interact with AI models in a more personalized and controlled manner. It’s being used by developers and AI enthusiasts around the world to create and deploy their own AI applications.

https://github.com/zylon-ai/private-gpt?embedable=true

12. Text-Generation-WebUI

Aliens? Spiders? Mushrooms? Oobabooga? You got me.

Text-Generation-WebUI, also known as TGW or “oobabooga”, is an open-source project that provides a Gradio-based web interface for interacting with Large Language Models.

Here are some key features of Text-Generation-WebUI:

Interface Modes: It offers three interface modes: default (two columns), notebook, and chat.

Model Backends: It supports multiple model backends including Transformers, llama.cpp (through llama-cpp-python), ExLlamaV2, AutoGPTQ, AutoAWQ, GPTQ-for-LLaMa, and QuIP.

Model Switching: It provides a dropdown menu for quickly switching between different models.

Extensions: It supports a large number of extensions (built-in and user-contributed), including Coqui TTS for realistic voice outputs, Whisper STT for voice inputs, translation, multimodal pipelines, vector databases, Stable Diffusion integration, and more.

Chat with Custom Characters: You can chat with custom characters.

Precise Chat Templates: It provides precise chat templates for instruction-following models, including Llama-2-chat, Alpaca, Vicuna, Mistral.

LoRA: You can train new LoRAs with your own data, load/unload LoRAs on the fly for generation.

Transformers Library Integration: It supports loading models in 4-bit or 8-bit precision through bitsandbytes, using llama.cpp with transformers samplers (llamacpp_HF loader), and CPU inference in 32-bit precision using PyTorch.

OpenAI-Compatible API Server: It provides an OpenAI-compatible API server with Chat and Completions endpoints.

https://github.com/oobabooga/text-generation-webui?embedable=true

13. H2O.ai

Now that’s all versions of H2O!

H2O.ai is a company that provides AI solutions and is known for democratizing AI. They offer a range of products and services, including:

H2O Open Source: This is a fully open source, distributed in-memory machine learning platform with linear scalability. It supports the most widely used statistical & machine learning algorithms including gradient boosted machines, generalized linear models, deep learning and more. It also has an industry leading AutoML functionality that automatically runs through all the algorithms and their hyperparameters to produce a leaderboard of the best models.

H2O.ai Cloud Platform: This is a cloud-based platform for creating and deploying generative AI solutions with customizable large language models (LLMs). It offers features like information retrieval on internal data, data extraction, summarization, and other batch processing tasks, as well as code/SQL generation for data analysis. It also provides a customizable evaluation and validation framework that is model agnostic.

Enterprise Support: When AI becomes mission critical for enterprise success, H2O.ai provides the services you need to optimize your investments in people and technology to deliver on your AI vision.

PrivateGPT: This is a product that allows you to interact with your documents using the power of Large Language Models (LLMs), 100% privately. It provides an API offering all the primitives required to build private, context-aware AI applications.

H2O.ai is used by over 18,000 organizations globally and is extremely popular in both the R & Python communities. It’s a powerful tool for those who want to run AI models locally without the need for internet connectivity or expensive hardware.

https://h2o.ai/?embedable=true

14. LightLLM

I know it’s LightLLM but isn’t this a bit over?

LightLLM is a Python-based Large Language Model (LLM) inference and serving framework. It’s known for its lightweight design, easy scalability, and high-speed performance. Here are some key features of LightLLM:

Tri-process Asynchronous Collaboration: Tokenization, model inference, and detokenization are performed asynchronously, leading to a considerable improvement in GPU utilization.

Nopad (Unpad): Offers support for nopad attention operations across multiple models to efficiently handle requests with large length disparities.

Dynamic Batch: Enables dynamic batch scheduling of requests.

FlashAttention: Incorporates FlashAttention to improve speed and reduce GPU memory footprint during inference.

Tensor Parallelism: Utilizes tensor parallelism over multiple GPUs for faster inference.

Token Attention: Implements token-wise’s KV cache memory management mechanism, allowing for zero memory waste during inference.

High-performance Router: Collaborates with Token Attention to meticulously manage the GPU memory of each token, thereby optimizing system throughput.

Int8KV Cache: This feature will increase the capacity of tokens to almost twice as much.

LightLLM supports a wide range of models including BLOOM, LLaMA, LLaMA V2, StarCoder, Qwen-7b, ChatGLM2-6b, Baichuan-7b, Baichuan2-7b, Baichuan2-13b, Baichuan-13b, InternLM-7b, Yi-34b, Qwen-VL, Qwen-VL-Chat, Llava-7b, Llava-13b, Mixtral, Stablelm, MiniCPM.

https://github.com/ModelTC/lightllm?embedable=true

15. GPT Academic

I’ve given up trying to explain these images. How is this academic?

GPT Academic, also known as gpt_academic, is an open-source project that provides a practical interaction interface for Large Language Models (LLMs) like GPT and GLM. It’s particularly optimized for academic reading, writing, and code explanation. Here are some key features of GPT Academic:

Modular Design: It supports a modular design that allows for the creation of custom shortcut buttons and function plugins.

Project Analysis: It offers project analysis capabilities for Python, C++, and other languages.

PDF/LaTex Paper Translation & Summarization: It provides functionalities for translating and summarizing PDF/LaTex academic papers.

Parallel Inquiry: It supports parallel inquiries across multiple LLM models.

Local Model Support: It supports local models like chatglm3.

Integration with Various Models: It integrates with various models like deepseekcoder, Llama2, rwkv, claude2, moss, and others.

Customizable: It’s highly customizable and supports the addition of new models, plugins, and shortcut keys.

https://academicgpt.net

In Conclusion

Now, let’s take a moment to appreciate the sheer genius of these LLMs. They’re like the Swiss Army knives of the AI world - versatile, handy, and always ready to impress with their multitude of uses. And the best part? They’re local! That’s right, no need to venture into the vast and sometimes scary world of the internet. These models are right at home on your local machine, ready to serve at a moment’s notice.

But let’s not forget the real heroes of our story - the programmers. Those tireless warriors of the digital realm, armed with nothing but their keyboards and an unquenchable thirst for knowledge. They’re the ones who’ve tamed these wild beasts of AI, turning lines of incomprehensible code into something we can all use and appreciate. So here’s to you, dear programmers. May your coffee be strong, your bugs few, and your StackOverflow answers always upvoted.

So, here’s to Local LLMs - the truly unsung heroes of the AI world. May they continue to inspire, amaze, and occasionally confuse us for many years to come. And remember, in the world of AI, the only limit is your imagination. So dream big, code hard, and don’t forget to laugh along the way. After all, as they say in the world of programming, “Laughter is the best exception handler.” Cheers!

Yup! Laughter is the best exception handler! Especially at weddings!

How to Run Your Own Local LLM: Updated for 2024 - Version 2

1) HuggingFace Transformers:

2) gpt4all

Local Build

How to Build the Python Bindings:

3) ollama

Installation:

4. localllm

5. Llama 2 (Version 3 already released from Meta)

Steps to Use a Pre-trained Finetuned Llama 2 Model Locally Using C++:

6. LLM

7. LlamaFile

8. ChatGPT-Next-Web

10. gpt4free

11. PrivateGPT

12. Text-Generation-WebUI

13. H2O.ai

14. LightLLM

15. GPT Academic

In Conclusion