3,250 reads

How to Run Your Own Local LLM: Updated for 2024 - Version 1

by Thomas CherickalMarch 21st, 2024

Too Long; Didn't Read

The article provides detailed guides on using Generative AI models like Hugging Face Transformers, gpt4all, Ollama, and localllm locally. Learn how to harness the power of AI for creative applications and innovative solutions.

featured image - How to Run Your Own Local LLM: Updated for 2024 - Version 1

This is the breakout year for Generative AI!

Well; to say the very least, this year, I’ve been spoilt for choice as to how to run an LLM Model locally.

Let’s start!

1) HuggingFace Transformers:

All Images Created by Bing Image Creator

To run Hugging Face Transformers offline without internet access, follow these steps:

Running HuggingFace Transformers Offline in Python on Windows

Requirements:

Python 3.6+
PyTorch (version compatible with Transformers)
Transformers library Tokenizers library
Sentence Transformers library (optional, for sentence-level tasks)

Steps:

Download the model:
Choose a model from the HuggingFace Hub.
Download the model weights and tokenizer weights.
Place the downloaded files in a local directory. Set environment variables:
Create a .env file in your project directory.
In the .env file, define the following variables:
transformers_home: Path to the directory where you stored the downloaded model and tokenizer weights.
MODEL_NAME: Name of the model you want to use.
MODEL_CONFIG: Path to the model configuration file (optional).
TOKENIZER_NAME: Name of the tokenizer you want to use.

Import libraries:

import os import transformers from transformers import AutoModel, AutoTokenizer #Replace "your-model-name" with the actual name of your model model_name = os.getenv("MODEL_NAME") model_config_path = os.getenv("MODEL_CONFIG") #Load the model and tokenizer model = AutoModel.from_pretrained(model_name, config=model_config_path) tokenizer = AutoTokenizer.from_pretrained(model_name)

Use the Model:

#Example usage: input_text = "Hello, world!" tokens = tokenizer(input_text) outputs = model(tokens) #Print the outputs print(outputs)

Additional Notes:

You may need to modify the transformers_home variable if you want to store the downloaded models in a different location.

You should download the model by cloning the repository and tokenizer weights manually to run it offline.

You can find more information on how to run Transformers offline on the HuggingFace documentation:

https://transformers.huggingface.co/docs/usage/inference#offline-inference

Example:

#Assuming you have downloaded the model and tokenizer weights #for bert-base-uncased-finetuned-sst-2-#english os.environ["transformers_home"] = "C:\transformers" os.environ["MODEL_NAME"] = "bert-base-uncased-finetuned-sst-2-english" import os import transformers model_name = os.getenv("MODEL_NAME") model_config_path = os.getenv("MODEL_CONFIG") model = AutoModel.from_pretrained(model_name, config=model_config_path) tokenizer = AutoTokenizer.from_pretrained(model_name) input_text = "The quick brown fox jumps over the lazy dog." tokens = tokenizer(input_text) outputs = model(tokens) print(outputs)

This code will output the model's predictions for the input text.

Remember, you must either download the model with internet access and save it locally or clone the model repository.

You can visit the website https://huggingface.co/models for more details.

There are around a stunning 558,000~ odd transformer LLMs available.

Hugging Face has become the de facto democratizer for LLM models, making nearly all available open source LLM models accessible, and executable without the usual mountain of expenses and bills. Basically, available, open source, and free. This is the mother lode!

2) gpt4all

gpt4all is an open-source project that allows anyone to access and use powerful AI models. Here are step-by-step instructions for installing and using gpt4all:

Python GPT4All: https://pypi.org/project/gpt4all/

`pip install gpt4all

This will download the latest version of the gpt4all package from PyPI.

Local Build

As an alternative to downloading via pip, you may build the Python bindings from the source.

How to Build the Python Bindings:

Clone GPT4All and change directory:

git clone --recurse-submodules https://github.com/nomic-ai/gpt4all.git

cd gpt4all/gpt4all-backend

Install the Python package:

cd ../../gpt4all-bindings/python pip install -e .

Usage

In a Python script or console:

from gpt4all import GPT4All model = GPT4All("orca-mini-3b-gguf2-q4_0.gguf") output = model.generate("The capital of France is ", max_tokens=3) print(output)

GPU Usage:

from gpt4all import GPT4All model = GPT4All("orca-mini-3b-gguf2-q4_0.gguf", device='gpu')

device='amd', device='intel'

output = model.generate("The capital of France is ", max_tokens=3) print(output)

This is one way to use gpt4all locally.

The website is (unsurprisingly) https://gpt4all.io.

Like all the LLMs on this list (when configured correctly), gpt4all does not require Internet or a GPU.

3) ollama

Ollama is an open source library that provides easy access to large language models like GPT-3. Here are the details on its system requirements, installation, and usage:

System Requirements:

Python 3.7 or higher
Requests library
Valid OpenAI API key

Installation:

pip install ollama

Usage:

Multi-modal

Ollama has support for multi-modal LLMs, such as bakllava and llava.

ollama pull bakllava

Be sure to update Ollama so that you have the most recent version to support multi-modal.

from langchain_community.llms import Ollama

bakllava = Ollama(model="bakllava") import base64 from io import BytesIO

from IPython.display import HTML, display from PIL import Image

def convert_to_base64(pil_image): """ Convert PIL images to Base64 encoded strings

:param pil_image: PIL image
:return: Re-sized Base64 string
"""

buffered = BytesIO()
pil_image.save(buffered, format="JPEG")  # You can change the format if needed
img_str = base64.b64encode(buffered.getvalue()).decode("utf-8")
return img_str

def plt_img_base64(img_base64): """ Display base64 encoded string as image

:param img_base64:  Base64 string
"""
# Create an HTML img tag with the base64 string as the source
image_html = f'<img src="data:image/jpeg;base64,{img_base64}" />'
# Display the image by rendering the HTML
display(HTML(image_html))

file_path = "../../../static/img/ollama_example_img.jpg" pil_image = Image.open(file_path) image_b64 = convert_to_base64(pil_image) plt_img_base64(image_b64)

This example is from the LangChain documentation.

Again, refer to the Ollama documentation for additional details on all available methods.

The website is: https://ollama.com/

4) localllm

I find that this is the most convenient way of all. The full explanation is given on the link below:

Summarized:

localllm combined with Cloud Workstations revolutionizes AI-driven application development by letting you use LLMs locally on CPU and memory within the Google Cloud environment. By eliminating the need for GPUs, you can overcome the challenges posed by GPU scarcity and unlock the full potential of LLMs. With enhanced productivity, cost efficiency, and improved data security, localllm lets you build innovative applications with ease.

5) Llama 2(Version 3 coming soon from Meta)

Steps to Use a Pre-trained Finetuned Llama 2 Model Locally Using C++:

(This is on Linux, please!)

Ensure you have the necessary dependencies installed:

sudo apt-get install python-pybind11-dev libpython-dev libncurses5-dev libstdc++-dev python-dev

Download the pre-trained Llama 2 model from the Hugging Face Transformers hub (huggingface.co/models).

Extract the downloaded model file:

unzip llama-2-large-cnn-transformer.zip

Navigate to the extracted directory:

cd llama-2-large-cnn-transformer

Copy the llama.cpp file from the repository to your working directory. Edit the llama.cpp file and modify the main() function to load the model and generate a response:

#include "transformer.h"int main() {

std::string prompt = "What is the meaning of life?";std::string response = GenerateResponse(prompt);std::cout << response;return 0; }

Compile the edited llama.cpp file:

g++ -o llama llama.cpp -L./lib -lstdc++ -o llama

Run the compiled executable:

./llama

Please Note: The prompt variable can be any text you want the model to generate a response for. The response variable will contain the model's response.

6) LLM

Perhaps the simplest option of the lot, a Python script called llm allows you to run large language models locally with ease.

To install:

pip install llm

LLM can run many different models, although albeit a very limited set.

You can install plugins to run your llm of choice with the command:

llm install <name-of-the-model>

To see all the models you can run, use the command:

llm models list

You can work with local LLMs using the following syntax:

llm -m <name-of-the-model> <prompt>

7) llamafile

llamafile allows you to download LLM files in the GGUF format, import them, and run them in a local in-browser chat interface.

The best way to install llamafile (only on Linux) is

curl -L https://github.com/Mozilla-Ocho/llamafile/releases/download/0.1/llamafile-server-0.1 > llamafile

chmod +x llamafile

Download a model from HuggingFace and run it locally with the command:

./llamafile --model .<gguf-file-name>

Wait for it to load, and open it in your browser at

http://127.0.0.1:8080.

Enter the prompt, and you can use it like a normal LLM with a GUI.

The complete Python program is given below:

#Import necessary libraries import llamafile import transformers #Define the HuggingFace model name and the path to save the model model_name = "distilbert-base-uncased" model_path = "<path-to-model>/model.gguf"

#Use llamafile to download the model in gguf format from the command line and store the location in model_path #Load the model tokenizer = transformers.DistilBertTokenizer.from_pretrained(model_name) model = transformers.DistilBertModel.from_pretrained(model_path) #Define a function to query the model def query_model(text): inputs = tokenizer(text, return_tensors="pt") outputs = model(**inputs) return outputs.last_hidden_state[:, 0, :]

#Query the model with a sample input response = query_model("Hello, how are you?")

#Print the model's response print(response)

Conclusion (?)

And there’s more! So much more. The list of LLMs available to run is getting bigger every day!

Apart from the huge number available on the HuggingFace Transformers website, there is an ever-growing bevy of new LLMs popping up nearly every day this year,

Here’s to a glorious year of beautiful creativity.

Generative AI has been the buzzword this year, and everyone is waiting with bated breath for AGI.

(Has Q* already achieved it? Elon Musk is a pretty smart guy! He did not file that lawsuit without reason!)

Cheers!

I did not verify all the outputs for all of the models, so I suggest, if you want an ultra-reliable way to use transformers locally use Google’s locallm library.

These models are not limited to text anymore.

Multimodal models - as shown with Ollama - are the new technology that everyone is talking about.

I will show you how to use audio, pictures, videos, and media in general with transformers and get the computer to respond intelligently and in a very human-like manner.

But this article is now slightly long. I’ll save multimodal tech for another article.

Until then, see you around!

And make sure you enjoy what you’re doing - building LLMs and LMMs (Large Multi-Modal Models).

For the first time in history, a human-created device is humanly responding to human multimedia.

Don’t miss the magic - and continue to keep up the enthusiasm - for the wonders of technology that are now available to everyone!

Feel the excitement in the air! Again, AGI is not very far away!

L O A D I N G
. . . comments & more!

About Author

Thomas Cherickal@thomascherickal

#1 Top Writer in ML/AI on HackerNoon. Full Stack MLOps TDD Python Dev.

Read my stories My Top Writer Ranking