Please suggest some good self-hostable RAG for my LLM.

Maroon@lemmy.world · edit-2 3 months ago

Please suggest some good self-hostable RAG for my LLM.

kwa@lemmy.zip · edit-2 3 months ago

I’m new to this and I was wondering why you don’t recommend ollama? This is the first one I managed to run and it seemed decent but if there are better alternatives I’m interested

Edit: it seems the two others don’t have an API. What would you recommend if you need an API?

brucethemoose@lemmy.world · edit-2 3 months ago

Pretty much everything has an API :P

ollama is OK because its easy and automated, but you can get higher performance, better vram efficiency, and better samplers from either kobold.cpp or tabbyAPI, with the catch being that more manual configuration is required. But this is good, as it “forces” you to pick and test an optimal config for your system.

I’d recommend kobold.cpp for very short context (like 6K or less) or if you need to partially offload the model to CPU because your GPU is relatively low VRAM. Use a good IQ quantization (like IQ4_M, for instance).

Otherwise use TabbyAPI with an exl2 quantization, as it’s generally faster (but GPU only) and much better at long context through its great k/v cache quantization.

They all have OpenAI APIs, though kobold.cpp also has its own web ui.