How to Run Local LLM (AI) in Android Studio

Hi! If you are a mobile developer and follow AI trends, you probably wondered how to integrate language models (LLM) into your apps directly from Android Studio. In this article, I will tell you how you can do it quickly and easily, without relying on external APIs and cloud solutions .

I will share a step-by-step guide on how to run a local LLM on your computer and integrate it into Android Studio. We will figure out how to choose a model, prepare the environment, and how to use it.

🛠️ Step-by-step guide

Selecting and loading a model

Hardware requirements
Where to find models and what to look for
Installing and configuring the necessary tools and libraries
Which models are best suited for local launch?

Setting up the environment

Setting up and running the server to run the model locally

Integration with Android Studio

Installing the Continue plugin (and its problems)
Launch and test your work directly from Android Studio

🔍 What else did I try and conclusions

Other options to launch LLM on Android (ChatGPT, Cursor)
What’s next?
Ready to launch your own AI right on your desktop? Let’s go! 🚀

Hardware requirements
Apple Silicon M-Chip with 16+ RAM (preferably 32)
Windows PC with NVIDIA/AMD graphics card (the more powerful the better)
Selecting and loading a model
There is a great resource for models — https://huggingface.co/models , but it is quite difficult to figure out and understand what we need right away, you will also have to figure out how to launch them through the console and read guides, scripts, etc. I would like to avoid this, especially when we are at the stage of familiarization with different models. For familiarization — you can watch :)
Since we are lazy, we will use the LM Studio program . Supports macOS, Windows, Linux. What we need. Download and install.

After launching, select Power User mod and click on Discovery (magnifying glass icon)

The search will show a large number of models to choose from.

Also pay attention to the GGUF and MLX checkboxes.

For Apple Silicon (M1/M2/M3/M4) it is preferable to use MLX , as they should work better, but there is not always an MLX model for the desired model (this is not a problem if the model is small).

Also, when loading a model, you can choose its “cut-down”. Try with the smallest one, if everything is ok and the hardware allows it, try a fatter model, etc.

Quantized models are compressed versions of neural networks that take up less space and require fewer resources to run (e.g. on low-end hardware).

Usually they are slightly worse than the original in quality: the stronger the compression (the lower the number of bits, for example, 2, 4, 8), the worse the model works. At the same time, 8-bit models are practically indistinguishable from the original, and 4-bit ones give a good balance between quality and size.

It is important to remember that the number of parameters affects quality even more than the degree of quantization. That is, a model with a large number of parameters but a lower bit rate (e.g. quantized 13B) will often perform better than a model with a smaller number of parameters, even if it is not compressed at all (e.g. original 7B).

In this example I will use the Llama 3.1 8B Instruct 4bit model.

(If you have weak hardware, you can try the deepseek-coder-6.7b-instruct model)

Select and download.

While we wait for the download…

Next, go to the Developer tab and select Select a model to Load. Select our module and load it into memory.

Setting up the environment

When our model is loaded, we need to do the following in order:
1. Enable the CORS checkbox (opens the possibility for plugins and the web)
2. Select Server Port 11434 (this is important, more on that below in the plugin settings)
3. Start the server using the checkbox.

If everything is OK, then the model should be READY and we will see logs in the console. Now your model is available at http://127.0.0.1:11434.

Integration with Android Studio

At the moment, Android Studio has many different plugins and most of them are under active development, so their stability and documentation leave much to be desired.
Here are some definitely working plugins that support full offline:
1. Continue
2. Juine
Unfortunately, for Junie even for the offline version you need to register and start a trial, they also have a bug where it is impossible to start even a trial from the studio. I tried VPN and proxy on a laptop, on a router. Changed countries and regions. Didn’t work. Plus the need to register and start a trial for the offline version is a bit confusing. Therefore, the choice fell on Continue.
Installing the Continue plugin
Follow the link and install the plugin in Android Studio (my version at the time of writing is Android Studio Meerkat | 2024.3.1 Patch 1). You should see a tab. Click on it.

There is a high probability that you will see a Connection Error / Error error in the window of this plugin. (because Java runtime that has no support for JCEF, issue )
To solve this problem we need to choose a runtime with JCEF support:
You need to be careful here, because you can break Android Studio :)
Find Action
Choose Boot Runtime for IDE
Here choose any runtime with JCEF (I chose the one on the screenshot)

If Android Studio is broken, try several different versions with JCEF. If it is completely broken, we need to delete the studio.jdk file in the Android Studio directory. Where to find the directories — here . I started everything on the first try.
Next, select Local Assistant in the plugin menu. (should be selected by default if you haven’t logged in)
Click on Add chat Models and then select LM Studio and Autodetect and click Connect.

Android Studio should open config.yaml. Note the apiBase path — it contains /v1/.

In the config we can configure our model for different roles . Now the request from Android Studio is redirected to the local server. Let’s check — everything is ok.

More details about the roles:
Click on the cube icon (Models). A set of dropdown menus with a choice of models for each option is called up. Here we can configure the behavior for each model.
In the screenshot, chat (communication), edit (editing in code), apply (application from chat) and autocomplete (smart autocompletion) are enabled. If the features are not needed, disable them (you can comment them in config.yaml). You can also use a large model for chat and a faster one for autocomplete, but then 2 models will hang in memory, so proceed from your hardware and needs.

Also in the same menu in Tools you can configure the behavior for each operation.

The plugin from Continue also has bugs. For example, they have a hardcode for Ollama on the port in the plugin and it listens to exactly 11434 from the studio. I saw similar issues, so in order not to wait for fixes, it was important to choose it.

🔍 What else have you tried?

Paid ChatGPT :

You can use it via a native app with apply code features, but it doesn’t work well with AS, sometimes it doesn’t apply the code or crashes when switching tabs. Potentially ChatGPT (OpenAPI), you can use it with the current setup in the Continue plugin, just switch to the online model.

Paid Cursor:

There are several models under the hood, but the best one is the same claude-3.7. Switching to gpt-mini/o4-mini and similar ones doesn’t make much sense. It’s also quite awkward to work in their IDE. Yes, they support extensions for working with Gradle, Kotlin. But they can’t normally use and launch Cursor from their environment. I encountered the fact that I had 2 Gradle and Java running — one from Cursor, the other from AS. And they ate up almost all the memory. Well, there is nothing more convenient than AS for mobile development :). Alternatively, you can give it limited access to a project module or a set of files and ask it to perform a certain task — the agent will do it itself. I often had it that it did something wrong, then something else, then broke it, etc. It is especially difficult if there are more than 10 files.

What’s next?

Test and find a suitable model for your hardware (I have M1 64GB).
I am currently testing these. The smartest is claude-3.7-sonnet. There is no MLX for it now, but there are even cases where MLX works slower.

Approximate load on my hardware while processing a request to claude-3.7-sonnet (tool in the asitop terminal)

There is also a PC with a video card, you can potentially run the model on it and share it on the local network. (LM Studio supports this in 1 click). Then almost all the RAM on the laptop will be available. You can go even further and share it on your network through your VPN.
Try Copilot, Junie (: when the bugs are fixed :-)

✅ Conclusions

Running LLM for Android Studio locally is not only convenient, but also significantly expands your capabilities as a developer. Of course, at the moment the solutions are still a bit “raw”, but the potential is huge, and now you can feel the real advantages of AI right on your hardware.

🔥 Let’s discuss!

Tell us how you managed to integrate the local LLM? Which models seemed the most convenient? (Ideally, if you write the hardware and what suited you and the size of the project).

Share your thoughts and ask questions in the comments :-)

Was it interesting and do you have any questions? (my linkedin)