When I first got into local LLMs nearly 3 years ago, in mid 2023, the frontier closed models were ofcourse impressively capable.
I then tried my hand on running 7b size local models, primarily one called Zephyr-7b (what happened to these models?? Dolphin anyone??), on my gaming PC with 8GB AMD RX580 GPU. Fair to say it was just a curiosity exercise (in terms of model performance).
Fast forward to this month, I revisit local LLM. (Although I no longer have the gaming PC, cost-of-living-crisis anyone 😫 )
And, the 31b size models look very sufficient. #Qwen has taken the helm in this order. Which is still very expensive to setup locally, although within grasp.
I’m rooting for the edge-computing models now - the ~2b size models. Due to their low footprint, they are practical to run in a SBC 24/7 at home for many people.
But these edge models are the ‘curiosity category’ now.
This weekend I had an LLM walk me through setting up some home server stuff and networking. I tried using Proton’s Lumo and Qwen 3.6 locally. I have to say Qwen was the more impressive of the two models. When I first tried running models locally like llama 4, I remember thinking to myself that this was a dead end and big servers would always have the advantage, but it seems like we’re hitting a turning point where many things can be done locally.
cool what was your hardware, and which qwen size you used? thanks
I have a 24GB AMD 7900XTX, and it’s a 35b parameter model.
Ooo… I’m running a 7900 XTX as well. Having 24GB without the Nvidia tax has been super nice for AI stuff. I have a 16GB 6900 XT running in another computer, and a lot of my AI model selection is still sized for it. I may need to stop procrastinating and copy your setup sooner rather than later.
Before I forget, can I ask you what GPU driver version you’re running? I recently encountered some stability issues after a driver update (trying to support gaming and AI stuff at the same time), and the latest version I could find any stability claims for was 24.12.1.
As I recall there are some new tricks that allow up to 8B models to run on a Raspberry Pi 5 and around 10-15 tokens per second with --ctx 32768. I haven’t kept across it because I don’t visit Reddit but that was my last recollection. If you fossick over there, you may be able to find it. Or use kagi.com to find it, heh.
One of the goals of the harness that I built was to reduce memory pressure, particularly KV cache, so that you could run larger models on more constrained hardware, but I’m not here to spruik myself. I’m just letting you know that there are ways and means to get it done on SBCs.
EDIT: I “kagi’ed” it for you. Here
qwen3.5 9B Q8_0 8.86 GiB 8.95 B CPU 4 0 pp512 18.20 ± 0.23 tok/s
is it just me or the smaller models that fit in my vram are very dumb?
Do you have 24 GB?
not of vram
That’s your issue.
thanks for the torough investigation
It’s not just you. But while they may be natively “dumb”, they can be augmented quite significantly. Even adding a simple web-search tool can help a lot.
So, there are levels of “dumb”. Some - like Qwen3-4B 2507 instruct - may not have the world knowledge of a SOTA, but its reasoning abilities can be quite impressive. See HERE as an example of a self made test suite. You can run something similar yourself.
I guess it depends what you mean by “dumb” and how that affects what you’re trying to do with them. Some are dumb at tool use, some have poor world knowledge etc. You can find small models that are good at what’s important to you if you dig around. Except for coding - that’s rough. Probably the smallest stand-alone that might make you sit up and pay attention is something like Qwen2.5-Coder-14B-Instruct or FrogMini-14B-2510…but I wouldn’t trust them to go spelunking a code base.
how are some other ways to make it better beyond just adding a search tool? is 16gb vram sufficient for usable results?
where do you think is the best place to go into this rabbit hole?
It’s really hard for me to answer this question without pointing to my project, because the project is sort of directly in response to this very problem. So, gauche as it may be, fuck it:
https://codeberg.org/BobbyLLM/llama-conductor
I mention this because 1) I am NOT trying to get you to install my shit but 2) my shit answers this directly. I note the conflict of interest, but OTOH you did ask me, and I sort of solved it in my way so…fuck. (It’s FOSS / I’m not trying to sell you anything etc etc).
With that out of the way, I will answer from where I am sitting and then generically (if I understand your question right).
Basically -
Small models have problems with how much they can hold internally. There’s a finite meta-cognitive “headspace” for them to work with…and the lower the quant, the fuzzier that gets. Sadly, with weaker GPU, you’re almost forced to use lower quants.
If you can’t upgrade the LLM (due to hardware), what you need to do is augment it with stuff that takes on some of the heavy lifting.
What I did was this: I wrapped a small, powerful, well-benchmarking LLM in an infrastructure that takes the things it’s bad at outside of its immediate concern.
Bad inbuilt model priors / knowledge base? No problem; force answers to go thru a tiered cascade.
Inbuilt quick responses that you define yourself as grounding (cheatsheets) --> self-populating wiki-like structure (you drop in .md into one folder, hit >>summ and it cross-updates everywhere) --> wikipedia short lookup (800 character open box: most wiki articles are structured with the TL;DR in that section) --> web search (using trusted domains) or web synth (using trusted domains plus cross-verification) --> finally…model pre-baked priors.
In my set up, the whole thing cascades from highest trust to lowest trust (as defined by the human), stops when it gathers the info it needs and tells you where the answer came from.
Outside of that, sidecars that do specific things (maths solvers, currency look up tools, weather look up, >>judge comparitors…tricks on tricks on tricks).
Based on my tests, with my corpus (shit I care about) I can confidently say my little 4B can go toe to toe with any naked 100B on my stuff. That’s a big claim, and I don’t expect you to take it at face value. It’s a bespoke system with opinions…but I have poked it to death and it refuses to die. So…shrug. I’m sanguine.
Understand: I assume the human in the middle is the ultimate arbiter of what the LLM reasons over. This is a different school of thought to “just add more parameters, bro” or “just get a better rig, bro”, but it was my solution to constrained hardware and hallucinations.
There are other schools of thought. Hell, others use things like MCP tool calls. The model pings cloud or self-host services (like farfalle or Perplexica), calls them when it decides it needs to, and the results land in context. But that’s a different locus of control; the model’s still driving…and I’m not a fan of that on principle. Because LLMs are beautiful liars and I don’t trust them.
The other half of the problem isn’t knowledge - it’s behaviour.
Small models drift. They go off-piste, ignore your instructions halfway through a long response, or confidently make shit up when they hit the edge of what they know. So the other thing I built was a behavioural shaping layer that keeps the model constrained at inference time - no weight changes, just harness-level incentive structure. Hallucination = retry loop = cost. Refusal = path of least resistance. You’re not fixing the model; you’re making compliance (mathematically) cheaper than non-compliance.
That’s how I solved it for me. YMMV.
On 16GB VRAM: honestly, that’s decent - don’t let GPU envy get to you. You can comfortably run a Q4_K_M of a 14B model entirely in VRAM at usable speeds - something like Qwen3-14B or Mistral-Small. Those are genuinely capable; not frontier, but not a toy either. The painful zone is 4-8GB (hello!), where you’re either running small models natively or offloading layers to RAM and watching your tokens-per-second crater. You can do some good stuff with a 14B, augmented with the right tools.
Where to start the rabbit hole: Do you mean generally? Either Jan.ai or LM Studio is the easiest on-ramp - drag and drop models, built-in chat UI, handles GGUF out of the box. IIRC, Jan has direct MCP tooling as well.
Once you want more control, drop into llama.cpp directly. It’s just…better. Faster. Fiddlier, yes…but worth it.
For finding good models, Unsloth’s HuggingFace page is consistently one of the better curators of well-quantised GGUFs. After that it’s just… digging through LocalLLaMA and benchmarking stuff yourself.
There’s no substitute for running your own evals on your own hardware for your own use case - published benchmarks will lie to you. If you’re insane enough to do that, see my above “rubric” post.
Not sure…have I answered your question?
PS: for anyone that hits the repo and reads the 1.9.5 commit message - enjoy :) Twas a mighty fine bork indeed, worthy of the full “Bart Simpson writes on chalkboard x 1000” hall of shame message. Fucking Vscodium man…I don’t know how sandbox mode got triggered but it did and it ate half my frikken hard-drive and repo before I could stop it. Rookie shit.
commenting so i can come back to this later
I didn;t try any 7b ones lately, they may be better fit for 16gb I think. I was able to try the 2b ones as I mentioned (on cpu). they are subpar. like mentioned the usable ones were 31b, I think you need atleast 24gb vram for most models though. maybe someone else can suggest better.
bummer. spilling to old computer sysram is painful for the smarter models too.
you can give “unloading some layers to RAM” a try though… that way you can get your hands on the “usable” 31b models. browse around to find some good 31b ones… GL
For small model bonsai series seems getting the spotlight. Natively trained on1bit and ternary 1.58bit, 8B runs on ~1GB memory. I’m curios on local models but haven’t tried because of lack of gaming rig but it seems work enough for regular pc
funny I tried the 8B bonsai https://huggingface.co/prism-ml/Bonsai-8B-gguf when loaded it takes ~7GB RAM!! When prompting it stalls my llama.cpp container (I’m running on a weak 4th gen i5)
Interesting thanks!
I’m glad to see 1.58Bs finally starting to appear.
I got GPT to side-by-side the benchmarks (for what they are worth). Bonsai 8B seems to be a cook off from Qwen3-8B. If they can squeeze an 8B into 1GB…then perhaps we can get a 20-30B in 4gb soon.
Category Bonsai-8B-gguf Qwen3-4B-Instruct-2507 Base / lineage Compressed Qwen3-8B dense architecture in 1-bit GGUF Q1_0 form (Hugging Face) Official Qwen3 4B instruct release from Alibaba/Qwen (Hugging Face) Params 8.19B total, ~6.95B non-embedding (Hugging Face) 4.0B total, 3.6B non-embedding (Hugging Face) Layers / heads 36 layers, GQA 32 Q / 8 KV (Hugging Face) 36 layers, GQA 32 Q / 8 KV (Hugging Face) Context length 65,536 tokens (Hugging Face) 262,144 tokens native (Hugging Face) Format GGUF Q1_0, end-to-end 1-bit weights (Hugging Face) Standard full model release; quantized variants exist elsewhere, but the official card here is the base instruct model (Hugging Face) Deployed size / memory 1.15 GB deployed; Prism says 14.2x smaller than FP16 (Hugging Face) Card does not list one deployed size on-page; it is a normal 4B model, so materially larger than Bonsai in practice (Hugging Face) Stated goal Extreme compression, speed, and efficiency while staying “competitive” with 8B-class models (Hugging Face) Strong general-purpose instruct model with gains in reasoning, coding, writing, tool use, and long-context handling (Hugging Face) Published benchmark bundle EvalScope bundle across MMLU-R, MuSR, GSM8K, HE+, IFEval, BFCL with 70.5 avg (Hugging Face) Broader Qwen benchmark suite including MMLU-Pro, GPQA, AIME25, ZebraLogic, LiveBench, LiveCodeBench, IFEval, Arena-Hard v2, BFCL-v3, plus agent/multilingual tasks (Hugging Face) Knowledge benchmark MMLU-R 65.7 (Hugging Face) MMLU-Pro 69.6, MMLU-Redux 84.2, GPQA 62.0, SuperGPQA 42.8 (Hugging Face) Reasoning benchmark MuSR 50, GSM8K 88 (Hugging Face) AIME25 47.4, HMMT25 31.0, ZebraLogic 80.2, LiveBench 63.0 (Hugging Face) Coding benchmark HumanEval+ 73.8 (Hugging Face) LiveCodeBench 35.1, MultiPL-E 76.8, Aider-Polyglot 12.9 (Hugging Face) Instruction following / alignment IFEval 79.8 (Hugging Face) IFEval 83.4, Arena-Hard v2 43.4, Creative Writing v3 83.5, WritingBench 83.4 (Hugging Face) Tool / agent metrics BFCL 65.7 (Hugging Face) BFCL-v3 61.9, TAU1-Retail 48.7, TAU1-Airline 32.0, TAU2-Retail 40.4 (Hugging Face) Speed claims Prism reports 368 tok/s on RTX 4090 vs 59 tok/s FP16 baseline, plus strong gains on other hardware (Hugging Face) The model card here emphasizes capability and deployment support, not a comparable on-page throughput table (Hugging Face) Energy claims Prism reports 4.1x better energy/token on RTX 4090 and 5.1x on M4 Pro vs FP16 baselines (Hugging Face) No equivalent on-page energy table in this card (Hugging Face) Best practical use Tiny footprint, fast local inference, “how is this running here?” deployments (Hugging Face) Better bet for raw reasoning, writing, long context, and general instruction-following (Hugging Face)
For what stuff do you want to use them? I don’t think they come remotely close to today’s commercial models. Maybe for a specific purpose?
hey, thanks for your response… yeah that’s what I meant, the 2b models aren’t usable in today’s state, but more practical for everyday use if they work out…
I actually meant the 31b models are useful for my purpose. I don’t do full-on agentic coding, just interactive chat/prompting. Example, I make good use for making linux shell scripts (as I don’t know howto myself). Currently I use qwen3.5-flash via cloud. It’s as good as the frontier models back then if not better…
I wanted to use smaller models, but then do more work on the “thinking” process. I didn’t come far, because it get so slow with normal hardware and too expensive on dedicated one. Time consuming (I’m also not a programmer) but a fun project, but in the end I just decided to satisfy the privacy angle with protons ai Lumo.
Proton has AI? Damn, that’s gotta be bleeding their coffers
Probably not; the models they use all tend to be quite lightweight and inexpensive, tbh.
EDIT:
https://proton.me/support/lumo-privacy
Open-source language models
Lumo is powered by open-source large language models (LLMs) which have been optimized by Proton to give you the best answer based on the model most capable of dealing with your request. The models we’re using currently are Nemo, OpenHands 32B, OLMO 2 32B, GPT-OSS 120B, Qwen, Ernie 4.5 VL 28B, Apertus, and Kimi K2. These run exclusively on servers Proton controls so your data is never stored on a third-party platform.
Lumo’s code is open source, meaning anyone can see it’s secure and does what it claims to. We’re constantly improving Lumo with the latest models that give the best user experience.
Quite lightweight swarm for cloud service, barring Kimi K2.
They have been working on this. Only 3 months ago it was pretty terrible. Today it’s almost on par with chatgpt. A bit worse on rag, slower,… good enough for normal use.
I was playing around with a tiny amount earlier today (I use ProtonMail, so I figured why not).
I can’t tell much about it. It seems very…safety theater / personality removed.
Any idea of what models they use now? I get a feeling that the main brain is 14B (based on how it responds to questions / drops nuance).
There are several 3B or less models that are surprisingly good. If you’re talking about a general chat model, you can get a lot of bang for your buck with Qwen3-1.7b. Granite-3B is also quite good (and obedient at tool calls, IIRC).
My every day driver is an ablit of Qwen3-4B 2507 instruct called Qwen HIVEMIND. I find it excellent…but again…black magic and clever tricks.
I’ve actually been scoping out the possibility of using ECA.dev and having something cheap / cloud based (say, GPT-5.4 mini) as the “brains” and SERA-8B as the “hands”.
GPT-5.4 mini is $0.75/M input tokens$4.50/M output tokens…and if it marries up with SERA-8B…well…that could go a long way indeed.
Small models can be made useful, as part of swarm architecture…but that’s not an apples : apples comparison.
For me, anything less than gpt oss 20b (a2b) is just for messing around with or for basic categorisation and basic text or data processing with highly structured prompts.





