I'm finally comfortable admitting that LLM no longer stands for Lunar Landing Module.

After an 8 month break where my technical output was limited to updating this blog to use SQLite instead of MongoDB and creating a Minecraft modpack and a few mods, the itch to tinker and explore is finally coming back.

LLM oriented development seems poised to shake up the way both hobbyists and professionals program.

There are a lot of angles from which to contemplate this, serious questions regarding the impact to human society these may have, and ethical questions regarding the way that AI companies and researchers have trained LLMs. I'm aware of the horrors that unrestrained curiosity can lead us to.

I am going to claim, without evidence, that I believe the legitimate complaints that writers and artists have about what they see as theft apply a bit less clearly to software, where is an enormous amount of copyleft code available that would allow such training. I certainly see it as allowable for my projects, most of which are BSD or MIT licensed.

With that discussion insufficiently addressed/side stepped, my first instinct is to explore what's out there in the OSS world. I'm more than happy to explore paid offerings, but I don't want to start out there.

Before I try to run editor-integrated agents, my goal is just to be able to ask questions of a locally run model, eg. through Simon Willison's llm tool. Since DeepSeek-R1 is one of the most powerful "open source" models out there, I decided to set myself a goal of running that locally; if it turns out to be too big an ask, I could always use what I've learned to run a smaller model.

After some brief sleuthing, I found this blog post from Unsloth which has a great breakdown of the steps they took to distill the 670GiB model to ~128GiB, and explains in detail how to get the model running locally.

The first problem I ran up against was that you need the nvidia CUDA toolkit in order to compile llama-gguf-split in the way that they suggest. You can get this on Ubuntu 24.04 via:

apt install nvidia-cuda-toolkit nvidia-cuda-dev

My next divergence from the script is that I used hfdownloader to download their model. Upon installation, you can download the Unsloth's models like this:

hfdownloader -m unsloth/DeepSeek-R1-GGUF:UD-IQ1_S

This will apply the same filter for the UD-IQ1_S 1.58bit version as the python script suggested in the blog post.

The blog post provides a pointer on how many layers of the model you can run on the GPU, based on the amount of VRAM you have. I have a GeForce RTX 3070 8GB, 64GB of system RAM, and a Core i5-13600K: good for most things, but clearly under powered for a model this large.

I wanted to see if I could run it anyway.

If you don't know what your hardware is, you can use nvidia-smi to get a nice info dump for your GPU, including a process list with current VRAM usage, and /proc/cpuinfo for info on your processor. Tip: you probably cannot run LLM inference on your GPU while playing Civilization VII.

According to the formula in the blog post, 8GB would be too small to run even a single layer on the GPU:

>>> (8 / 131) * 61 - 4
-0.2748091603053435

I decided I'd try running it CPU-only first, and then increase the number of GPU layers til I reached failure:

./llama.cpp/build/bin/llama-cli \
  --model unsloth_DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf \
  --cache-type-k q4_0 \
  --threads 16 \
  --prio 2 \
  --temp 0.6 \
  --ctx-size 8192 \
  --seed 3407 \
  --n-gpu-layers 0 \
  -no-cnv \
  --prompt "<|User|>Write a hello world in Python.<|Assistant|>"

To temper your expectations, unless you have a purpose built machine, this is likely to take some time.

# --n-gpu-layers 0 -> 0.23 tokens per second
llama_perf_context_print: total time = 1368717.92 ms /   317 tokens
# --n-gpu-layers 1 -> 0.25 tokens per second
llama_perf_context_print: total time = 1843128.84 ms /   461 tokens
# --n-gpu-layers 2 -> OOM in CUDA

The model prints out its extensive thought process, which is fascinating and seems to vary significantly across runs. Because each token takes ~4s, the total runtime of the query is massively dominated by how much it thinks itself in circles before it decides to answer. Running only one of 61 layers has only a minor performance impact, and ends up being noise compared to how extensively the model second guesses itself.

Clearly, if I want to run a local model for programming tasks, I'm going to have to choose a much more modest one, make some significant investments in iron, or both.

Now that I can generate a Python "Hello, World" in 30 short minutes, and I know that I can offload to my GPU, I'm going to investigate some smaller distillations of DeepSeek and some of the other OSS models.

CUDA coda

I did several hours of research on "significant investments in iron" to see what that might look like.

There are two active reddit communities that contain a lot of information to get started with understanding the options and bottlenecks of running these models, r/LocalLLM and r/LocalLLaMa.

The biggest challenge currently is that the best performing inference code uses CUDA, which is nvidia exclusive, but as you might expect from a graphics card company with a $3.4 trillion dollar market cap, their cards are expensive.

For a long time to nn60/70/80/90 branding of nvidia's cards was confusing to me. Is an RTX 3090 better than a RTX 4060? What the hell is an RTX 8000? These cards are complex systems, more like an embedded microcomputer. They have a number of different parameters that can have an impact when it comes to running an LLM. In order of how most people seem to prioritize these:

  • VRAM - analogous to system RAM, this determines how large of a model you can load into the card; these typically run 8, 12, 16, 24, or 32GB. People seem to prize this the highest.
  • VRAM Bandwidth - how fast you can load things in and out of memory. Most people report that this is typically the bottleneck for inference; most graphics cards are overkill computationally.
  • Architecture age - there is a cutoff beyond which the older architectures lack important hardware support that accelerate inference. For consumer NVIDIA cards, the Ampere architecture of the 30NN+ series is considered the cutoff for "good" architecture.
  • CUDA cores - impacts parallelism and thus throughput for CUDA code; typically in the thousands.
  • Tensor cores - typically in the hundreds. These are also employed by most engines though I don't know their impact. Having some at all seems good enough as they will always be present on cards with "good" architecture.

In addition to the gaming focused cards that you may be familiar with (the 30N0/40N0 series), Nvidia also makes several lines of "datacenter" oriented GPUs which typically ship with much more VRAM. The RTX a6000, for instance, has 48GB of VRAM and the "good" Ampere architecture, but they are about ~$5000. The older RTX 8000 comes in 48GB flavor, but the outdated architecture apparently makes it slower when running inference than trying to cobble together a bunch of newer architecture cards.

The TLDR of all of this is that, for the past ~1 year at least, the best $/perf balance has been in running one or more GTX 3090 cards.

Coming at the problem from a different direction, if the issue is that VRAM is expensive and system CPUs lack parallelization, what if you had a unified memory architecture and an ARM main CPU like most modern Macs do?

The sweet spot seems to be a Mac Studio, which you can get with 192GB of unified V/RAM for about the same cost as an RTX a6000. These are actually available, have great power utilization and thermal properties, and you don't have to build them yourself. The downside seems to be that while they run inference well, they are not as good at tuning and training.

Feb 17