How to Run Your Own Local LLM: Updated for 2024 - Version 2by@thomascherickal
507 reads
507 reads

How to Run Your Own Local LLM: Updated for 2024 - Version 2

by Thomas CherickalApril 26th, 2024
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Think Local LLMs are beyond you? Guess again! They are more accessible than ever! An absolutely fantastic collection of local llms all run through open source tools. Have a blast running these LLMs. Some of them, like oobabooga and Academic GPT are incredibly capable in multiple domains and useful in more ways than you know!
featured image - How to Run Your Own Local LLM: Updated for 2024 - Version 2
Thomas Cherickal HackerNoon profile picture

This is the breakout year for Generative AI!

Yes, this article’s been written by a violinist and there’s new content! Spectacular new and awesome content!

Well; to say the very least, this year, I’ve been spoiled for choice as to how to run an LLM Model locally.

Let’s start!

1) HuggingFace Transformers

Magic of Bing Image Creator — Very imaginative.

Hugging Face Transformers is a state-of-the-art machine learning library that provides easy access to a wide range of pre-trained models for Natural Language Processing (NLP), Computer Vision, Audio tasks, and more. It’s an open-source library developed by Hugging Face, a company that has built a strong community around machine learning and NLP.

Here are some key features of Hugging Face Transformers:

  1. Pre-trained Models: Transformers provides APIs and tools to easily download and train state-of-the-art pre-trained models. Using these models can reduce your compute costs, carbon footprint, and save you the time and resources required to train a model from scratch.
  2. Multimodal Support: These models support common tasks in different modalities, such as text classification, named entity recognition, question answering, language modeling, summarization, translation, multiple choice, text generation for NLP; image classification, object detection, and segmentation for Computer Vision; automatic speech recognition and audio classification for Audio; and table question answering, optical character recognition, information extraction from scanned documents, video classification, and visual question answering for Multimodal tasks.
  3. Framework Interoperability: Transformers support framework interoperability between PyTorch, TensorFlow, and JAX. This provides the flexibility to use a different framework at each stage of a model’s life; train a model in one framework, and load it for inference in another.
  4. Community and Collaboration: Hugging Face has a strong community focus, with a Model Hub that serves as a bustling hub where users exchange and discover thousands of models and datasets, fostering a culture of collective innovation in NLP.
  5. Ease of Use: The Transformers library simplifies the machine learning journey, offering developers an efficient pathway to download, train, and seamlessly integrate machine learning models into their workflows.
  6. Diverse Datasets: The Datasets library functions as a comprehensive toolbox, offering diverse datasets for developers to train and test language models effortlessly.

In summary, Hugging Face Transformers is a powerful tool that makes the complexities of language technology and machine learning accessible to everyone, from beginners to experts.

2) gpt4all

I’m sorry, I can’t describe this one!

GPT4All is an open-source ecosystem developed by Nomic AI that allows you to run powerful and customized large language models (LLMs) locally on consumer-grade CPUs and any GPU. It aims to be the best instruction-tuned assistant-style language model that any person or enterprise can freely use, distribute, and build on.

Here are some key features of GPT4All:

  1. Locally Running: GPT4All runs locally on your machine, which means it doesn’t require an internet connection or a GPU. This makes it a privacy-aware chatbot.
  2. Free-to-Use: It’s free to use, which means you don’t have to pay for a platform or hardware subscription.
  3. Customized Large Language Models: GPT4All allows you to train and deploy powerful and customized large language models.
  4. Various Capabilities: GPT4All can answer questions about the world, assist in writing emails, documents, creative stories, poems, songs, and plays, understand documents, and provide guidance on easy coding tasks.
  5. Open-Source Ecosystem: GPT4All is part of an open-source ecosystem, which means you can contribute to its development and improvement.

In summary, GPT4All is a tool that democratizes access to AI resources by allowing anyone to run large language models locally on their own hardware.

The website is (unsurprisingly)

Like all the LLMs on this list (when configured correctly), gpt4all does not require Internet or a GPU.

3) ollama

Again, magic!

Ollama is an open-source command line tool that allows you to run, create, and share large language models on your computer. It’s designed to run open-source large language models locally on your machine2. Ollama supports various models such as Llama 3, Mistral, Gemma, and others.

It simplifies the process by bundling model weights, configuration, and data into a single package defined by a Modelfile. This means you can run large language models, such as Llama 2 and Code Llama, without any registration or waiting list.

Ollama is available for macOS, Linux, and Windows. It started by supporting Llama, then expanded its model library to include models like Mistral and Phi-25. So, if you’re interested in running large language models on your local machine, Ollama could be a great tool to consider!

Ollama’s multimodal capabilities allow it to process both text and image inputs together. This means you can ask the model questions or give it prompts that involve both text and images, and it will generate responses based on both types of input.

To use this feature, you simply type your prompt and then drag and drop an image. There is a new images parameter for both Ollama’s Generate API & Chat API. The images parameter takes a list of base64 encoded PNG or JPEG format images. Ollama supports image sizes up to 100MB.

For example, you can run a multimodal model like LLaVA by typing ollama run llava in the terminal. In the background, Ollama will download the LLaVA 7B model and run it. If you want to use a different parameter size, you can try the 13B model using ollama run llava:13b.

More multimodal models are becoming available, such as BakLLaVA 7B. You can run it by typing ollama run bakllava in the terminal.

This multimodal capability makes Ollama a powerful tool for generating creative and contextually relevant responses based on a combination of text and image inputs. It’s a great way to leverage the power of large language models in a more interactive and engaging way.

4. localllm

Defies explanation, doesn’t it?

I find that this is the most convenient way of all. The full explanation is given on the link below:

Google Cloud Blog

localllm is a tool developed by Google Cloud Platform that allows you to run Large Language Models (LLMs) locally on Cloud Workstations. It’s particularly useful for developers who want to leverage the power of LLMs without the constraints of GPU availability.

Here are some key points about localllm:

  • It’s designed to run on local CPUs, eliminating the need for GPUs.
  • It uses quantized models from HuggingFace, which are AI models optimized to run on devices with limited computational resources.
  • The tool is part of a solution that combines quantized models, Cloud Workstations, and generally available resources to develop AI-based applications.
  • It provides easy access to quantized models through a command-line utility.

In essence, localllm is a game-changer for developers seeking to leverage LLMs in their applications, offering a flexible and efficient approach to application development.

5. Llama 3 (Version 3 released from Meta) Now that’s a spectacular Llama!

Meta’s Llama 3, often referred to as Llama 3, is the latest iteration of Meta’s Large Language Model (LLM). It’s a powerful AI tool that has made significant strides in the field of Natural Language Processing (NLP) and Machine Learning (ML).

Llama 3 is an open-source large language model designed for developers, researchers, and businesses to build, experiment, and responsibly scale their generative AI ideas. It’s part of a foundational system and serves as a bedrock for innovation in the global community.

This model is available in both 8B and 70B pretrained and instruction-tuned versions to support a wide range of applications. It excels at understanding language nuances and performing complex tasks like translation and dialogue generation.

Llama 3 has been trained on over 15 trillion tokens of data, a training dataset 7 times larger than that used for Llama 2, including 4 times more code. This results in the most capable Llama model yet, which supports an 8K context length that doubles the capacity of Llama 2.

With the release of Llama 3, Meta has updated the Responsible Use Guide (RUG) to provide the most comprehensive information on responsible development with LLMs. Their system-centric approach includes updates to their trust and safety tools with Llama Guard 2, optimized to support the newly announced taxonomy published by MLCommons expanding its coverage to a more comprehensive set of safety categories, Code Shield, and Cybersec Eval 2.

In summary, Llama 3 represents a significant advancement in AI technology, offering enhanced capabilities and improved performance for a broad range of applications.


6. LLM

Explain this one. I sure can’t!

The llm Python script from PyPI is a command-line utility and Python library for interacting with Large Language Models (LLMs), including OpenAI, PaLM, and local models installed on your own machine.

Here’s a brief overview of its functionalities:

  1. Interacting with LLMs: You can run prompts from the command-line, store the results in SQLite, generate embeddings, and more.
  2. Support for Self-Hosted Language Models: The LLM CLI tool now supports self-hosted language models via plugins.
  3. Working with Embeddings: LLM provides tools for working with embeddings.
  4. Installation: You can install this tool using pip with the command pip install llm or using Homebrew with the command brew install llm.
  5. Getting Started: If you have an OpenAI API key, you can get started using the OpenAI models right away. As an alternative to OpenAI, you can install plugins to access models by other providers, including models that can be installed and run on your own device.
  6. Installing a Model Locally: LLM plugins can add support for alternative models, including models that run on your own machine. For example, to download and run Mistral 7B Instruct locally, you can install the llm-gpt4all plugin.
  7. Running a Prompt: Once you’ve saved a key, you can run a prompt like this: llm "Five cute names for a pet penguin".
  8. Chatting with a Model: You can also start a chat session with the model using the llm chat command.
  9. Using a System Prompt: You can use the -s/--system option to set a system prompt, providing instructions for processing other input to the tool.

7. LlamaFile

Llama with some heavy-duty options!

A llamafile is an executable Large Language Model (LLM) that you can run on your own computer. It contains the weights for a given open LLM, as well as everything needed to actually run that model on your computer.

The goal of a llamafile is to make open LLMs much more accessible to both developers and end users. It combines the model with a framework into one single-file executable (called a “llamafile”) that runs locally on most computers, with no installation1. This means all the operations happen locally and no data ever leaves your computer.

For example, you can download an example llamafile for the LLaVA model, which is a new LLM that can do more than just chat; you can also upload images and ask it questions about them.

In addition to hosting a web UI chat server, when a llamafile is started, it also provides an OpenAI API compatible chat completions endpoint. This is designed to support the most common OpenAI API use cases, in a way that runs entirely locally.

This initiative is part of an effort to ensure that AI remains free and open, and to increase both trust and safety. It aims to lower barriers to entry, increase user choice, and put the technology directly into the hands of the people.

8. ChatGPT-Next-Web

That’s Next level, all right!

ChatGPT-Next-Web, also known as NextChat, is an open-source application that provides a user interface for interacting with advanced language models like GPT-3.5, GPT-4, and Gemini-Pro. It’s built on Next.js, a popular JavaScript framework.

Here are some key features of ChatGPT-Next-Web:

Deployment: It can be deployed for free with one-click on Vercel in under 1 minute.

  • Compact Client: It provides a compact client (~5MB) that can run on Linux, Windows, and MacOS.
  • Compatibility: It’s fully compatible with self-deployed Language Learning Models (LLMs), and it’s recommended for use with RWKV-Runner or LocalAI.
  • Privacy: All data is stored locally in the browser, ensuring privacy.
  • Markdown Support: It supports Markdown, including LaTex, mermaid, code highlighting, and more.
  • Prompt Templates: You can create, share, and debug your chat tools with prompt templates.
  • Multilingual Support: It supports multiple languages including English, 简体中文, 繁体中文, 日本語, Français, Español, Italiano, Türkçe, Deutsch, Tiếng Việt, Русский, Čeština, 한국어, Indonesia.

It’s a versatile tool that allows users to interact with AI models in a more personalized and controlled manner. It’s being used by developers and AI enthusiasts around the world to create and deploy their own ChatGPT applications.


9. LocalAI

 Not sure which planet, but sure looks beautiful!

  • LocalAI is a free, open-source project that acts as a drop-in replacement for the OpenAI API. It allows you to run various AI models locally or on-premises with consumer-grade hardware. Here are some key features of LocalAI:
  • Local Inference: LocalAI allows you to run Language Learning Models (LLMs), generate images, audio, and more, all locally or on-premises.
  • Compatibility: It’s compatible with OpenAI API specifications.
  • No GPU Required: LocalAI does not require a GPU for operation.
  • Privacy: Since all data is processed locally, your data remains private.
  • Multiple Model Support: It supports multiple model families and architectures.
  • Fast Inference: Once loaded the first time, it keeps models loaded in memory for faster inference.
  • Community-Driven: LocalAI is a community-driven project, and contributions from the community are welcome.

10. gpt4free

Switzerland in the Multiverse!

  • GPT4Free, also known as G4F, is an open-source project that provides access to powerful language models like GPT-3.5 and GPT-4. Here are some key features of GPT4Free:
  • Proof of Concept: GPT4Free serves as a proof of concept, demonstrating the development of an API package with multi-provider requests, with features like timeouts, load balance, and flow control.
  • Free Access: It provides free access to these high-powered AI models by reverse engineering the application programming interfaces (APIs) platforms used with paid access to these models.
  • Installation: You can install GPT4Free using Python, Docker, or Web UI.
  • Privacy: GPT4Free does save conversation data, but all with strict privacy guidelines.
  • Custom Model Parameters: You can customize model parameters like the ‘presence_penalty’ to suit your specific needs.
  • Multiple Language Support: GPT4Free supports multiple languages, ensuring a smooth user experience across different geographies.

11. PrivateGPT

If that isn’t private, I don’t know what is!

PrivateGPT is an open-source, production-ready AI project that allows you to interact with your documents using the power of Large Language Models (LLMs), 100% privately.

Here are some key features of PrivateGPT:

  • Privacy: All data is processed locally, ensuring that no data leaves your execution environment at any point.
  • Local Inference: PrivateGPT allows you to ask questions about your documents using LLMs, even in scenarios without an Internet connection.
  • API: It provides an API offering all the primitives required to build private, context-aware AI applications. It follows and extends the OpenAI API standard, and supports both normal and streaming response.
  • High-Level and Low-Level API: The API is divided into two logical blocks: a high-level API, which abstracts all the complexity of a Retrieval Augmented Generation (RAG) pipeline implementation, and a low-level API, which allows advanced users to implement their own complex pipelines.
  • Gradio UI Client: A working Gradio UI client is provided to test the API, together with a set of useful tools such as bulk model download script, ingestion script, documents folder watch, etc.

PrivateGPT is a versatile tool that allows users to interact with AI models in a more personalized and controlled manner. It’s being used by developers and AI enthusiasts around the world to create and deploy their own AI applications.

12. Text-Generation-WebUI

Aliens? Spiders? Mushrooms? Oobabooga? You got me.

Text-Generation-WebUI, also known as TGW or “oobabooga”, is an open-source project that provides a Gradio-based web interface for interacting with Large Language Models.

Here are some key features of Text-Generation-WebUI:

  • Interface Modes: It offers three interface modes: default (two columns), notebook, and chat.
  • Model Backends: It supports multiple model backends including Transformers, llama.cpp (through llama-cpp-python), ExLlamaV2, AutoGPTQ, AutoAWQ, GPTQ-for-LLaMa, and QuIP.
  • Model Switching: It provides a dropdown menu for quickly switching between different models.
  • Extensions: It supports a large number of extensions (built-in and user-contributed), including Coqui TTS for realistic voice outputs, Whisper STT for voice inputs, translation, multimodal pipelines, vector databases, Stable Diffusion integration, and more.
  • Chat with Custom Characters: You can chat with custom characters.
  • Precise Chat Templates: It provides precise chat templates for instruction-following models, including Llama-2-chat, Alpaca, Vicuna, Mistral.
  • LoRA: You can train new LoRAs with your own data, load/unload LoRAs on the fly for generation.
  • Transformers Library Integration: It supports loading models in 4-bit or 8-bit precision through bitsandbytes, using llama.cpp with transformers samplers (llamacpp_HF loader), and CPU inference in 32-bit precision using PyTorch.
  • OpenAI-Compatible API Server: It provides an OpenAI-compatible API server with Chat and Completions endpoints.


Now that’s all versions of H2O! is a company that provides AI solutions and is known for democratizing AI. They offer a range of products and services, including:

H2O Open Source: This is a fully open source, distributed in-memory machine learning platform with linear scalability. It supports the most widely used statistical & machine learning algorithms including gradient boosted machines, generalized linear models, deep learning and more. It also has an industry leading AutoML functionality that automatically runs through all the algorithms and their hyperparameters to produce a leaderboard of the best models. Cloud Platform: This is a cloud-based platform for creating and deploying generative AI solutions with customizable large language models (LLMs). It offers features like information retrieval on internal data, data extraction, summarization, and other batch processing tasks, as well as code/SQL generation for data analysis. It also provides a customizable evaluation and validation framework that is model agnostic.

Enterprise Support: When AI becomes mission critical for enterprise success, provides the services you need to optimize your investments in people and technology to deliver on your AI vision.

PrivateGPT: This is a product that allows you to interact with your documents using the power of Large Language Models (LLMs), 100% privately. It provides an API offering all the primitives required to build private, context-aware AI applications. is used by over 18,000 organizations globally and is extremely popular in both the R & Python communities. It’s a powerful tool for those who want to run AI models locally without the need for internet connectivity or expensive hardware.

14. LightLLM

I know it’s LightLLM but isn’t this a bit over the top?

LightLLM is a Python-based Large Language Model (LLM) inference and serving framework. It’s known for its lightweight design, easy scalability, and high-speed performance.

Here are some key features of LightLLM:

  • Tri-process Asynchronous Collaboration: Tokenization, model inference, and detokenization are performed asynchronously, leading to a considerable improvement in GPU utilization.
  • Nopad (Unpad): Offers support for nopad attention operations across multiple models to efficiently handle requests with large length disparities.
  • Dynamic Batch: Enables dynamic batch scheduling of requests.
  • FlashAttention: Incorporates FlashAttention to improve speed and reduce GPU memory footprint during inference.
  • Tensor Parallelism: Utilizes tensor parallelism over multiple GPUs for faster inference.
  • Token Attention: Implements token-wise’s KV cache memory management mechanism, allowing for zero memory waste during inference.
  • High-performance Router: Collaborates with Token Attention to meticulously manage the GPU memory of each token, thereby optimizing system throughput.
  • Int8KV Cache: This feature will increase the capacity of tokens to almost twice as much.

LightLLM supports a wide range of models including BLOOM, LLaMA, LLaMA V2, StarCoder, Qwen-7b, ChatGLM2-6b, Baichuan-7b, Baichuan2-7b, Baichuan2-13b, Baichuan-13b, InternLM-7b, Yi-34b, Qwen-VL, Qwen-VL-Chat, Llava-7b, Llava-13b, Mixtral, Stablelm, MiniCPM.

15. GPT Academic

Academic, buddy… It’s academic!

GPT Academic, also known as gpt_academic, is an open-source project that provides a practical interaction interface for Large Language Models (LLMs) like GPT and GLM. It’s particularly optimized for academic reading, writing, and code explanation.

Here are some key features of GPT Academic:

  • Modular Design: It supports a modular design that allows for the creation of custom shortcut buttons and function plugins.
  • Project Analysis: It offers project analysis capabilities for Python, C++, and other languages.
  • PDF/LaTex Paper Translation & Summarization: It provides functionalities for translating and summarizing PDF/LaTex academic papers.
  • Parallel Inquiry: It supports parallel inquiries across multiple LLM models.
  • Local Model Support: It supports local models like chatglm3.
  • Integration with Various Models: It integrates with various models like deepseekcoder, Llama2, rwkv, claude2, moss, and others.
  • Customizable: It’s highly customizable and supports the addition of new models, plugins, and shortcut keys.

In Conclusion

Now, let’s take a moment to appreciate the sheer genius of these LLMs. They’re like the Swiss Army knives of the AI world - versatile, handy, and always ready to impress with their multitude of uses. And the best part? They’re local! That’s right, no need to venture into the vast and sometimes scary world of the internet. These models are right at home on your local machine, ready to serve at a moment’s notice.

But let’s not forget the real heroes of our story - the programmers. Those tireless warriors of the digital realm, armed with nothing but their keyboards and an unquenchable thirst for knowledge. They’re the ones who’ve tamed these wild beasts of AI, turning lines of incomprehensible code into something we can all use and appreciate. So here’s to you, dear programmers. May your coffee be strong, your bugs few, and your StackOverflow answers always upvoted.

So, here’s to Local LLMs - the truly unsung heroes of the AI world. May they continue to inspire, amaze, and occasionally confuse us for many years to come. And remember, in the world of AI, the only limit is your imagination. So dream big, code hard, and don’t forget to laugh along the way. After all, as they say in the world of programming, “Laughter is the best exception handler.”


Yup! Laughter is the best exception handler! Especially at weddings!

All Images Generated With Bing Image Creator. It’s Awesome! (DALL-E-3)