Self hosted LLM

HumanPerson@sh.itjust.works · 9 months ago

Self hosted LLM

passepartout@feddit.de · 9 months ago

I tried Huggingface TGI yesterday, but all of the reasonable models need at least 16 gigs of vram. The only model i got working (on a desktop machine with a amd 6700xt gpu) was microsoft phi-2.

BetaDoggo_@lemmy.world · 9 months ago

Koboldcpp should allow you to run much larger models with a little bit of ram offloading. There’s a fork that supports rocm for AMD cards: https://github.com/YellowRoseCx/koboldcpp-rocm

Make sure to use quantized models for the best performace, q4k_M being the standard.

HumanPerson@sh.itjust.works · 9 months ago

I know the gpt4all models run fine on my desktop with 8gig vram. It does use a decent chunk of my normal ram though. Could the gpt4all models work on huggingface or do they use different formats? Sorry if I am completely misunderstanding huggingface, I haven’t heard of it until now.

passepartout@feddit.de · 9 months ago

Huggingface TGI is just a piece of software handling the models, like gpt4all. Here is a list of models officially supported by TGI, although they state that you can try different ones as well. You follow the link and look for the files section. The size of the model files (safetensors or pickele binaries) gives a good estimate of how much vram you will need. Sadly this is more than most consumer graphics cards have except for santacoder and microsoft phi.

HumanPerson@sh.itjust.works · 9 months ago

I don’t really want to try to get that to work. I wonder how hard it would be to create my own webui using gpt4all’s Python package.