TL;DR: I released a new app called LocalChat that allows you to run chatbots locally — including models that have been in the news over the past years, such as Facebook's Llama. Find the website & download options here.
How far we have come. Just a little over five years ago, the transformer architecture has been introduced. Then, OpenAI has released ChatGPT in 2022. And today, in early 2024, we see more and more models that deliver a similar performance to ChatGPT but for which you don’t even need an enterprise-grade GPU anymore.
Are the days finally over that we have to pollute the environment just to enjoy the fruits of recent AI breakthroughs? Certainly not. The technology is still in its infancy. There are lots of new developments to be made until generative AI is relatively carbon-neutral. And there are still thousands of issues surrounding ethics, bias, and fairness. We are only at the start. And the political fight over the power of interpretation has only just begun: Is AI the next industrial revolution? Or is AI the harbinger of death? Will it only benefit the rich? Or can the technology be democratized?
What generative AI certainly is, is that large for-profit companies seek to monetize data they scraped – sometimes illegally, sometimes unethically – from the web; underpay click-workers to do the dirt work; and run energy-hungry server farms 24/7 just so that Justin can ask an AI to convert inches to centimeters instead of simply using a calculator.
However, any tool can be a weapon if you hold it right; and in this sense, I am happy to announce that I launched my attempt of democratizing access to generative AI today: LocalChat. It’s a small hobby project, so don’t expect too many exciting new updates (I still have a dissertation to finish); but the project is mature enough to be used carefully.
So what is LocalChat? It’s a small app that effectively serves as a graphical user interface to interact with large language models (LLMs) in a chat-like environment. It is certainly not the first. However, the novelty is that it’s the first Open Source app (licensed via GNU GPL v3.0) that works without any setup. The only things you need is the app itself and some language model that you want to chat with. It’s that simple: No need to set up Python, compilers, or anything, really.
The reason I developed an app for that is that generative AI is still somewhat niche for technically less adept consumers. You either need a lot of knowledge to run them locally, or you need a lot of money to throw at OpenAI for letting you use their models.
But, most things shouldn’t be provided as a service, rather as a local app, and I am happy to have discovered that generative AI belongs to that latter category now as well.
This has to do with two new developments since ChatGPT was launched in the end of 2022: Llama.cpp, and quantization.
The first development started in early 2023 with the release of Llama.cpp. As the name suggests, it is a C++-library for running Facebook’s Llama-models on MacBooks. Fortunately, this was only true for the first iteration of the library. Soon thereafter, Llama.cpp added support for Windows and Linux, and can run a wide variety of models. Effectively, Llama.cpp is a faster, less convoluted alternative to running LLMs with a Python wrapper.
These efforts have also culminated in a nice side effect: The GGUF file format. This is just another way of storing model weights similar to Huggingface’s “safetensors” format, or a plain pickle file. The benefits of GGUF files is that they are a bit more compact, load much faster, and they support quantized models out of the box.
LocalChat uses Llama.cpp under the hood to facilitate all the chatbot magic. It has enabled me to write an app that delivers comparatively fast generation times with comparatively large models. While still not as reliable as ChatGPT (if you want to call a probabilistic language model “reliable”, that is), the current models’ quality is decent enough to use them for basic tasks.
I mentioned a second trend that has enabled this application: quantization. I have heard of this before, but never cared to look deeper into it. It turns out that there is a whole field that is exclusively concerned with trying to reduce the size of current generative AI models while maintaining as much quality as possible.
The basic idea is that a neural network has weights it uses to generate words, and these weights are normally stored in 16bit or 32bit floating point numbers. While these are only 2 or 4 bytes of storage respectively, if you multiply this by 7 billion parameters, you end up with very large numbers. The idea behind quantization is now to basically “cut off” some digits after the period, which reduces the precision of the weights, but also makes them smaller. In layman’s terms, you take the number
0.1623426982 and turn it into
Quantization is especially useful for inference (i.e., using the model as opposed to training it). This means that a model is first trained with full precision, and only afterwards the weights are trimmed. This can dramatically reduce the file size and memory requirements, enabling even slower hardware to run large models. And the quality doesn’t suffer all that much. Sure, a “full”, non-quantized model is still more precise, but I discovered that a quantized model with 4 bits (as opposed to 4x the amount of bits) is still really, really good.
And that’s good news.
No SaaS: You are In Control
The time is running. All the large companies, and many newly founded startups, are trying to monetize on the generative AI trend. And the easiest way to keep yourself as a company in business is to simply offer a service: Unless you grant access to the service, nobody can use it. It is very simple to generate a continuous stream of (subscription) income.
But it’s also anti-consumer and hostile to the economically disadvantaged. When software is provided in the form of “SaaS”, or “Software as a Service”, it means that you have to have a continuous internet connection; you have to send your data to a company with possibly less-than-ideal privacy-protecting measures; and you always risk losing access, should the company either cut you off or go out of business. Governments, government agencies, and companies have already instituted bans on using ChatGPT to prevent employees from accidentally leaking business secrets or classified information to the Californian company.
This means that effectively no privacy-aware citizen can really use ChatGPT to the fullest. It is simply exhausting having to constantly think about whether you may have written down confidential information in your latest prompt.
LocalChat is my – very honest, very limited, very low-key – attempt to fix this issue for me and like-minded people. I like to play around with new technology, but I don’t like the aforementioned risks. With generative AI getting smaller and better, it is about time we stop the trend of businesses providing access to these AI assistants only via an API. Running your own AI locally has many benefits: it’s privacy-respecting by default, it works when the internet is down, and you exercise full control.
I hope that you can find some use in LocalChat. As I said, it is more a proof of concept than a stable, fully-fledged app at this time (and I won’t make a mobile app), but I personally find it already very useful.
If you’d like to try it out, head over to https://nathanlesage.github.io/local-chat and give it a go!