My experience with local LLM

ntn888@lemmy.ml · 8 days ago

My experience with local LLM

ZoteTheMighty@lemmy.zip · 7 days ago

This weekend I had an LLM walk me through setting up some home server stuff and networking. I tried using Proton’s Lumo and Qwen 3.6 locally. I have to say Qwen was the more impressive of the two models. When I first tried running models locally like llama 4, I remember thinking to myself that this was a dead end and big servers would always have the advantage, but it seems like we’re hitting a turning point where many things can be done locally.

ntn888@lemmy.ml · 7 days ago

cool what was your hardware, and which qwen size you used? thanks

ZoteTheMighty@lemmy.zip · 6 days ago

I have a 24GB AMD 7900XTX, and it’s a 35b parameter model.

ericwdhs@discuss.online · 6 days ago

Ooo… I’m running a 7900 XTX as well. Having 24GB without the Nvidia tax has been super nice for AI stuff. I have a 16GB 6900 XT running in another computer, and a lot of my AI model selection is still sized for it. I may need to stop procrastinating and copy your setup sooner rather than later.

ericwdhs@discuss.online · 6 days ago

Before I forget, can I ask you what GPU driver version you’re running? I recently encountered some stability issues after a driver update (trying to support gaming and AI stuff at the same time), and the latest version I could find any stability claims for was 24.12.1.

SuspciousCarrot78@lemmy.world · edit-2 7 days ago

As I recall there are some new tricks that allow up to 8B models to run on a Raspberry Pi 5 and around 10-15 tokens per second with --ctx 32768. I haven’t kept across it because I don’t visit Reddit but that was my last recollection. If you fossick over there, you may be able to find it. Or use kagi.com to find it, heh.

One of the goals of the harness that I built was to reduce memory pressure, particularly KV cache, so that you could run larger models on more constrained hardware, but I’m not here to spruik myself. I’m just letting you know that there are ways and means to get it done on SBCs.

EDIT: I “kagi’ed” it for you. Here

qwen3.5 9B Q8_0 8.86 GiB 8.95 B CPU 4 0 pp512 18.20 ± 0.23 tok/s

☂️-@lemmy.ml · 7 days ago

is it just me or the smaller models that fit in my vram are very dumb?

Samskara@sh.itjust.works · 4 days ago

Do you have 24 GB?

☂️-@lemmy.ml · 4 days ago

not of vram

Samskara@sh.itjust.works · 4 days ago

That’s your issue.

☂️-@lemmy.ml · 4 days ago

thanks for the torough investigation

SuspciousCarrot78@lemmy.world · edit-2 7 days ago

It’s not just you. But while they may be natively “dumb”, they can be augmented quite significantly. Even adding a simple web-search tool can help a lot.

So, there are levels of “dumb”. Some - like Qwen3-4B 2507 instruct - may not have the world knowledge of a SOTA, but its reasoning abilities can be quite impressive. See HERE as an example of a self made test suite. You can run something similar yourself.

I guess it depends what you mean by “dumb” and how that affects what you’re trying to do with them. Some are dumb at tool use, some have poor world knowledge etc. You can find small models that are good at what’s important to you if you dig around. Except for coding - that’s rough. Probably the smallest stand-alone that might make you sit up and pay attention is something like Qwen2.5-Coder-14B-Instruct or FrogMini-14B-2510…but I wouldn’t trust them to go spelunking a code base.

☂️-@lemmy.ml · 7 days ago

how are some other ways to make it better beyond just adding a search tool? is 16gb vram sufficient for usable results?

where do you think is the best place to go into this rabbit hole?

SuspciousCarrot78@lemmy.world · edit-2 6 days ago

It’s really hard for me to answer this question without pointing to my project, because the project is sort of directly in response to this very problem. So, gauche as it may be, fuck it:

https://codeberg.org/BobbyLLM/llama-conductor

I mention this because 1) I am NOT trying to get you to install my shit but 2) my shit answers this directly. I note the conflict of interest, but OTOH you did ask me, and I sort of solved it in my way so…fuck. (It’s FOSS / I’m not trying to sell you anything etc etc).

With that out of the way, I will answer from where I am sitting and then generically (if I understand your question right).

Basically -

Small models have problems with how much they can hold internally. There’s a finite meta-cognitive “headspace” for them to work with…and the lower the quant, the fuzzier that gets. Sadly, with weaker GPU, you’re almost forced to use lower quants.

If you can’t upgrade the LLM (due to hardware), what you need to do is augment it with stuff that takes on some of the heavy lifting.

What I did was this: I wrapped a small, powerful, well-benchmarking LLM in an infrastructure that takes the things it’s bad at outside of its immediate concern.

Bad inbuilt model priors / knowledge base? No problem; force answers to go thru a tiered cascade.

Inbuilt quick responses that you define yourself as grounding (cheatsheets) --> self-populating wiki-like structure (you drop in .md into one folder, hit >>summ and it cross-updates everywhere) --> wikipedia short lookup (800 character open box: most wiki articles are structured with the TL;DR in that section) --> web search (using trusted domains) or web synth (using trusted domains plus cross-verification) --> finally…model pre-baked priors.

In my set up, the whole thing cascades from highest trust to lowest trust (as defined by the human), stops when it gathers the info it needs and tells you where the answer came from.

Outside of that, sidecars that do specific things (maths solvers, currency look up tools, weather look up, >>judge comparitors…tricks on tricks on tricks).

Based on my tests, with my corpus (shit I care about) I can confidently say my little 4B can go toe to toe with any naked 100B on my stuff. That’s a big claim, and I don’t expect you to take it at face value. It’s a bespoke system with opinions…but I have poked it to death and it refuses to die. So…shrug. I’m sanguine.

Understand: I assume the human in the middle is the ultimate arbiter of what the LLM reasons over. This is a different school of thought to “just add more parameters, bro” or “just get a better rig, bro”, but it was my solution to constrained hardware and hallucinations.

There are other schools of thought. Hell, others use things like MCP tool calls. The model pings cloud or self-host services (like farfalle or Perplexica), calls them when it decides it needs to, and the results land in context. But that’s a different locus of control; the model’s still driving…and I’m not a fan of that on principle. Because LLMs are beautiful liars and I don’t trust them.

The other half of the problem isn’t knowledge - it’s behaviour.

Small models drift. They go off-piste, ignore your instructions halfway through a long response, or confidently make shit up when they hit the edge of what they know. So the other thing I built was a behavioural shaping layer that keeps the model constrained at inference time - no weight changes, just harness-level incentive structure. Hallucination = retry loop = cost. Refusal = path of least resistance. You’re not fixing the model; you’re making compliance (mathematically) cheaper than non-compliance.

That’s how I solved it for me. YMMV.

On 16GB VRAM: honestly, that’s decent - don’t let GPU envy get to you. You can comfortably run a Q4_K_M of a 14B model entirely in VRAM at usable speeds - something like Qwen3-14B or Mistral-Small. Those are genuinely capable; not frontier, but not a toy either. The painful zone is 4-8GB (hello!), where you’re either running small models natively or offloading layers to RAM and watching your tokens-per-second crater. You can do some good stuff with a 14B, augmented with the right tools.

Where to start the rabbit hole: Do you mean generally? Either Jan.ai or LM Studio is the easiest on-ramp - drag and drop models, built-in chat UI, handles GGUF out of the box. IIRC, Jan has direct MCP tooling as well.

Once you want more control, drop into llama.cpp directly. It’s just…better. Faster. Fiddlier, yes…but worth it.

For finding good models, Unsloth’s HuggingFace page is consistently one of the better curators of well-quantised GGUFs. After that it’s just… digging through LocalLLaMA and benchmarking stuff yourself.

There’s no substitute for running your own evals on your own hardware for your own use case - published benchmarks will lie to you. If you’re insane enough to do that, see my above “rubric” post.

Not sure…have I answered your question?

PS: for anyone that hits the repo and reads the 1.9.5 commit message - enjoy :) Twas a mighty fine bork indeed, worthy of the full “Bart Simpson writes on chalkboard x 1000” hall of shame message. Fucking Vscodium man…I don’t know how sandbox mode got triggered but it did and it ate half my frikken hard-drive and repo before I could stop it. Rookie shit.

☂️-@lemmy.ml · 4 days ago

commenting so i can come back to this later

ntn888@lemmy.ml · 7 days ago

I didn;t try any 7b ones lately, they may be better fit for 16gb I think. I was able to try the 2b ones as I mentioned (on cpu). they are subpar. like mentioned the usable ones were 31b, I think you need atleast 24gb vram for most models though. maybe someone else can suggest better.

☂️-@lemmy.ml · edit-2 7 days ago

bummer. spilling to old computer sysram is painful for the smarter models too.

ntn888@lemmy.ml · 7 days ago

you can give “unloading some layers to RAM” a try though… that way you can get your hands on the “usable” 31b models. browse around to find some good 31b ones… GL

inconel@lemmy.ca · 7 days ago

For small model bonsai series seems getting the spotlight. Natively trained on1bit and ternary 1.58bit, 8B runs on ~1GB memory. I’m curios on local models but haven’t tried because of lack of gaming rig but it seems work enough for regular pc

ntn888@lemmy.ml · 7 days ago

funny I tried the 8B bonsai https://huggingface.co/prism-ml/Bonsai-8B-gguf when loaded it takes ~7GB RAM!! When prompting it stalls my llama.cpp container (I’m running on a weak 4th gen i5)

ntn888@lemmy.ml · 7 days ago

Interesting thanks!

SuspciousCarrot78@lemmy.world · 7 days ago

I’m glad to see 1.58Bs finally starting to appear.

I got GPT to side-by-side the benchmarks (for what they are worth). Bonsai 8B seems to be a cook off from Qwen3-8B. If they can squeeze an 8B into 1GB…then perhaps we can get a 20-30B in 4gb soon.

Category	Bonsai-8B-gguf	Qwen3-4B-Instruct-2507
Base / lineage	Compressed Qwen3-8B dense architecture in 1-bit GGUF Q1_0 form (Hugging Face)	Official Qwen3 4B instruct release from Alibaba/Qwen (Hugging Face)
Params	8.19B total, ~6.95B non-embedding (Hugging Face)	4.0B total, 3.6B non-embedding (Hugging Face)
Layers / heads	36 layers, GQA 32 Q / 8 KV (Hugging Face)	36 layers, GQA 32 Q / 8 KV (Hugging Face)
Context length	65,536 tokens (Hugging Face)	262,144 tokens native (Hugging Face)
Format	GGUF Q1_0, end-to-end 1-bit weights (Hugging Face)	Standard full model release; quantized variants exist elsewhere, but the official card here is the base instruct model (Hugging Face)
Deployed size / memory	1.15 GB deployed; Prism says 14.2x smaller than FP16 (Hugging Face)	Card does not list one deployed size on-page; it is a normal 4B model, so materially larger than Bonsai in practice (Hugging Face)
Stated goal	Extreme compression, speed, and efficiency while staying “competitive” with 8B-class models (Hugging Face)	Strong general-purpose instruct model with gains in reasoning, coding, writing, tool use, and long-context handling (Hugging Face)
Published benchmark bundle	EvalScope bundle across MMLU-R, MuSR, GSM8K, HE+, IFEval, BFCL with 70.5 avg (Hugging Face)	Broader Qwen benchmark suite including MMLU-Pro, GPQA, AIME25, ZebraLogic, LiveBench, LiveCodeBench, IFEval, Arena-Hard v2, BFCL-v3, plus agent/multilingual tasks (Hugging Face)
Knowledge benchmark	MMLU-R 65.7 (Hugging Face)	MMLU-Pro 69.6, MMLU-Redux 84.2, GPQA 62.0, SuperGPQA 42.8 (Hugging Face)
Reasoning benchmark	MuSR 50, GSM8K 88 (Hugging Face)	AIME25 47.4, HMMT25 31.0, ZebraLogic 80.2, LiveBench 63.0 (Hugging Face)
Coding benchmark	HumanEval+ 73.8 (Hugging Face)	LiveCodeBench 35.1, MultiPL-E 76.8, Aider-Polyglot 12.9 (Hugging Face)
Instruction following / alignment	IFEval 79.8 (Hugging Face)	IFEval 83.4, Arena-Hard v2 43.4, Creative Writing v3 83.5, WritingBench 83.4 (Hugging Face)
Tool / agent metrics	BFCL 65.7 (Hugging Face)	BFCL-v3 61.9, TAU1-Retail 48.7, TAU1-Airline 32.0, TAU2-Retail 40.4 (Hugging Face)
Speed claims	Prism reports 368 tok/s on RTX 4090 vs 59 tok/s FP16 baseline, plus strong gains on other hardware (Hugging Face)	The model card here emphasizes capability and deployment support, not a comparable on-page throughput table (Hugging Face)
Energy claims	Prism reports 4.1x better energy/token on RTX 4090 and 5.1x on M4 Pro vs FP16 baselines (Hugging Face)	No equivalent on-page energy table in this card (Hugging Face)
Best practical use	Tiny footprint, fast local inference, “how is this running here?” deployments (Hugging Face)	Better bet for raw reasoning, writing, long context, and general instruction-following (Hugging Face)

NoiseColor @lemmy.world · 8 days ago

For what stuff do you want to use them? I don’t think they come remotely close to today’s commercial models. Maybe for a specific purpose?

ntn888@lemmy.ml · 8 days ago

hey, thanks for your response… yeah that’s what I meant, the 2b models aren’t usable in today’s state, but more practical for everyday use if they work out…

I actually meant the 31b models are useful for my purpose. I don’t do full-on agentic coding, just interactive chat/prompting. Example, I make good use for making linux shell scripts (as I don’t know howto myself). Currently I use qwen3.5-flash via cloud. It’s as good as the frontier models back then if not better…

NoiseColor @lemmy.world · 7 days ago

I wanted to use smaller models, but then do more work on the “thinking” process. I didn’t come far, because it get so slow with normal hardware and too expensive on dedicated one. Time consuming (I’m also not a programmer) but a fun project, but in the end I just decided to satisfy the privacy angle with protons ai Lumo.

inari@piefed.zip · 7 days ago

Proton has AI? Damn, that’s gotta be bleeding their coffers

SuspciousCarrot78@lemmy.world · edit-2 7 days ago

Probably not; the models they use all tend to be quite lightweight and inexpensive, tbh.

EDIT:
https://proton.me/support/lumo-privacy

Open-source language models

Lumo is powered by open-source large language models (LLMs) which have been optimized by Proton to give you the best answer based on the model most capable of dealing with your request. The models we’re using currently are Nemo, OpenHands 32B, OLMO 2 32B, GPT-OSS 120B, Qwen, Ernie 4.5 VL 28B, Apertus, and Kimi K2. These run exclusively on servers Proton controls so your data is never stored on a third-party platform.

Lumo’s code is open source, meaning anyone can see it’s secure and does what it claims to. We’re constantly improving Lumo with the latest models that give the best user experience.

Quite lightweight swarm for cloud service, barring Kimi K2.

NoiseColor @lemmy.world · 7 days ago

They have been working on this. Only 3 months ago it was pretty terrible. Today it’s almost on par with chatgpt. A bit worse on rag, slower,… good enough for normal use.

SuspciousCarrot78@lemmy.world · 7 days ago

I was playing around with a tiny amount earlier today (I use ProtonMail, so I figured why not).

I can’t tell much about it. It seems very…safety theater / personality removed.

Any idea of what models they use now? I get a feeling that the main brain is 14B (based on how it responds to questions / drops nuance).

SuspciousCarrot78@lemmy.world · edit-2 7 days ago

There are several 3B or less models that are surprisingly good. If you’re talking about a general chat model, you can get a lot of bang for your buck with Qwen3-1.7b. Granite-3B is also quite good (and obedient at tool calls, IIRC).

My every day driver is an ablit of Qwen3-4B 2507 instruct called Qwen HIVEMIND. I find it excellent…but again…black magic and clever tricks.

I’ve actually been scoping out the possibility of using ECA.dev and having something cheap / cloud based (say, GPT-5.4 mini) as the “brains” and SERA-8B as the “hands”.

GPT-5.4 mini is $0.75/M input tokens$4.50/M output tokens…and if it marries up with SERA-8B…well…that could go a long way indeed.

Small models can be made useful, as part of swarm architecture…but that’s not an apples : apples comparison.

fozid@feddit.uk · 7 days ago

For me, anything less than gpt oss 20b (a2b) is just for messing around with or for basic categorisation and basic text or data processing with highly structured prompts.