Running open-source AI models locally on our own computers gives us privacy, endless possibilities of tinkering, and freedom from large corporations. It is almost a matter of free speech.
For us GPU-poor, however, having our own AI computer seems to be a pricey dream.
What if I tell you that you can get a useful AI computer for $300? Interested? You do need to supply your own monitor, keyboard, and mouse. And you need a bit of tinkering around the Linux operating system, drivers, middleware, and configurations.
To clarify, we are NOT talking about “training” or “fine-tuning” large generative AI models. We will focus on how to run open-source LLM (large language models such as
Now, let’s continue.
Let’s assume one of the main use cases for a home AI computer is running
However, you do need the following for a faster inference speed. Otherwise, you will be like watching hair grow on your palm while the LLM spits out one token at a time.
For image generation with Stable Diffusion, you do need GPU power. However, you don’t have to have a very fancy GPU for that. You can leverage the integrated GPU already in your home computers:
All Macs with M1/M2/M3 CPU, which integrates CPU, GPU, and high-speed memory (they are really good, but due to price are excluded from this particular article)
AMD APU (e.g., Ryzen 7 5700U), which integrates CPU and GPU for budget-friendly mini-PCs. This will be the focus of this article.
Intel CPU (e.g., Core i5-1135G7), which also integrates CPU and GPU. They are slightly above the $300 budget for the entire mini-PC, but readers are welcome to explore them further on their own.
An AMD-based Mini PC with the following specs usually sells for less than $300. I don’t want to endorse any particular brand, so you can search yourself:
I splurged a bit and opted for the $400 model with 32GB RAM and 1TB SSD (everything else equal). The main reason is that I do research on open-source LLMs and would like to run bigger models, in addition to running Stable Difusion. But you should be able to do almost everything in this article with the $300 computer.
For AMD APUs like the
You need to change that depending on your main use case:
If you only need to run LLM inference, you can skip this entire prep step. Since LLM inference will only need to use CPU, and you should save most RAM for the CPU so you can run larger LLM models.
If you need to run
In my case, I want to run both Stable Diffusion XL and LLM inference on the same mini PC. Therefore, I would like to allocate 16GB (out of 32GB total) for the GPU.
You can achieve this by changing the settings in BIOS. Typically, there is an upper limit, and the default setting might be much lower than the upper limit. On my computer, the upper limit was 16GB, or half of the total RAM available.
If your computer’s BIOS supports such settings, go ahead and change to your desired number. My BIOS has no such setting.
If your BIOS does not have this setting, then please follow the nice instruction “Unlocking GPU Memory Allocation on AMD Ryzen™ APU?” by Winston Ma. I tried it and it worked well, so now I have 16GB VRAM.
AMD’s
In order to install and make AMD’s ROCm work, you have to make sure that the versions of GPU hardware, Linux distro, kernel, python, HIP driver, ROCm library, and pytorch are compatible. If you want the least pain and maximum possibility of first-time success, stick with the recommended and verified combinations.
Please check the following link to get the compatible Linux OS and kernel versions, and install them. Initially, I made the mistake of just installing my favorite Linux OS and default Linux kernel, and it was a big pain to walk backward to resolve compatibility issues. You can avoid this pain by just using the officially supported combinations.
If the entire installation finishes well, you can type in rocminfo
, and something like this will show (I only snipped the most relevant parts in highlighted yellow):
ROCk module is loaded
=====================
HSA System Attributes
=====================
Runtime Version: 1.1
System Timestamp Freq.: 1000.000000MHz
Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model: LARGE
System Endianness: LITTLE
Mwaitx: DISABLED
DMAbuf Support: YES
==========
HSA Agents
==========
*******
Agent 1
*******
Name: AMD Ryzen 7 5800H with Radeon Graphics
Uuid: CPU-XX
Marketing Name: AMD Ryzen 7 5800H with Radeon Graphics
Vendor Name: CPU
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 16777216(0x1000000) KB
Python dependency can be quite tricky, so it is good practice to set up a proper environment. You can use either
source venv/bin/activate
conda activate llm
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.7
The following is specific to APU’s with integrated graphics. Even though they are not officially supported by ROCm, the following proved to work.
export HSA_OVERRIDE_GFX_VERSION=9.0.0
Now, after all the complicated steps, let’s test if ROCm is working with Torch. And you can see that ROCm is “pretending” to be CUDA for the purpose of Pytorch.
python3 -c 'import torch' 2> /dev/null && echo 'Success' || echo 'Failure'
Success
python3 -c 'import torch; print(torch.cuda.is_available())'
True
Let’s start with something easy for our newly configured $300 AI computer: running a large language model locally. We can choose one of the popular open-source modes:
In addition, you can also try small LLMs from
We will be using
First, you need to install wget
and git
. And then follow the steps to compile and install llama.cpp.
sudo apt-get install build-essential
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
make
In order to run the LLMs on our inexpensive machine instead of cloud servers with expensive GPUs, we need to use a “compressed” version of the models so they can fit into the RAM space. For a simple example, a LLaMA-2 7B model has 7B parameters, each represented by float16 (2 bytes).
Also, the file format should be
First, we tested it on the AMD mini PC, and we achieved about 10 tokens per second. This is actually quite decent, and you can carry on a chat with the LLM without too much waiting.
System config:
Command line instruction:
./main -m models/llama-2-7b-chat.Q4_0.gguf --color -ins -n 512 --mlock
llama_print_timings: load time = 661.10 ms
llama_print_timings: sample time = 234.73 ms / 500 runs ( 0.47 ms per token, 2130.14 tokens per second)
llama_print_timings: prompt eval time = 1307.11 ms / 32 tokens ( 40.85 ms per token, 24.48 tokens per second)
llama_print_timings: eval time = 50090.22 ms / 501 runs ( 99.98 ms per token, 10.00 tokens per second)
llama_print_timings: total time = 64114.27 ms
Next, we tested on an Intel mini PC, and we achieved about 1.5 tokens per second. This is a bit too slow for a fruitful chat session. It is not a fair comparison, since the Intel N5105 is clearly weaker than AMD 5800H. But that is the only Intel mini PC in my possession. If you use the more powerful Intel CPU (e.g., Core i5-1135G7) you should get comparable results. Please report your findings in the comments below.
System config:
./main -m models/llama-2-7b-chat.Q4_0.gguf -ins --color -n 512 --mlock
llama_print_timings: load time = 14490.05 ms
llama_print_timings: sample time = 171.53 ms / 97 runs ( 1.77 ms per token, 565.49 tokens per second)
llama_print_timings: prompt eval time = 21234.29 ms / 33 tokens ( 643.46 ms per token, 1.55 tokens per second)
llama_print_timings: eval time = 75754.03 ms / 98 runs ( 773.00 ms per token, 1.29 tokens per second)
And pay attention to this page as well, in regards to AMD ROCm
export HSA_OVERRIDE_GFX_VERSION=9.0.0
source venv/bin/activate
./webui.sh --upcast-sampling --skip-torch-cuda-test --precision full --no-half
./webui.sh --upcast-sampling --skip-torch-cuda-test --precision full --no-half
Test 1
SDXL (max resolution 1024x1024) recommends at least 12GB VRAM, so you definitely need to get the Prep 1 step done to allocate 16GB VRAM for iGPU. So, this task is only possible with the $400 mini PC.
./webui.sh --upcast-sampling
Test 1:
Test 2:
Although this article focuses on Linux operating systems, you can get Stable Diffusion working in Windows too. Here are my experiments:
Test 1:
So, are you having fun running your own generative AI models on your new $300 mini PC? I hope you do.
Open-source AI models running on personal devices is one of the most exciting areas for tinkers since none of us will have the massive GPU pool to actually train a foundational model. This will enable a new generation of apps that are both super smart while still preserving our data privacy.
What next?
And happy tinkering with AI, open source, and on-device!