The Inventor Behind a Rush of AI Copyright Suits Is Trying to Show His Bot Is Sentient::Stephen Thaler’s series of high-profile copyright cases has made headlines worldwide. He’s done it to demonstrate his AI is capable of independent thought.
The Inventor Behind a Rush of AI Copyright Suits Is Trying to Show His Bot Is Sentient::Stephen Thaler’s series of high-profile copyright cases has made headlines worldwide. He’s done it to demonstrate his AI is capable of independent thought.
Are you using ooga booga? What specs does your system have?
I do use Oobabooga a lot. I am developing my own scripts and modifying some of Oobabooga too. I also use Koboldcpp. I am on a 12gen i7 with 20 logic cores and 64GB of system memory along with a 3080Ti with 16GBV. The 70B 4 bit quantized model running with 14 layers offloaded onto the GPU generates 3 tokens a second. So it is 1.5 times faster than just on the CPU.
If I was putting together another system, I would only get something with AVX-512 instructions support in the CPU. That instruction is troublesome for CVE issues. You’ll probably need to look into this depending on your personal privacy/security threat model. The ability to run larger models is really important. You really want all the RAM. The answer to the question of how much is always yes. You are not going to get enough memory using consumer GPUs you can only offload a few layers onto a consumer grade GPU. I can’t say how well even larger models than the 70B will perform as the memory bottlenecks. I can’t even say how a 30B or larger runs at full quantization. I can’t add any more memory to my system. Running the full models, as a rule of thumb, requires double the token size in RAM. So a 30B will require around 60GB of memory to initial load. Most of these models are float-16. So running them 8-bit cuts the size in half with penalties in areas like accuracy. Running 4 bit splits the size again. There is tuning, bias, and asymmetry in the way quantization is done to preserve certain aspects like emerging phenomena in the original data. This is why a larger model with a smaller quantization may outperform a smaller model running at full quantization. For GPUs, if you are at all serious about this, you need at least 16GBV at a bare minimum. Really, we need to see a descent priced 40-80GBV consumer option. The thing is that GPU memory is directly tied to compute hardware. There isn’t the overhead of a memory management system like system memory has. This is what makes GPUs ideal and fast, but it is the biggest chunk of bleeding edge silicon in consumer hardware already, and we need it to be 4× larger and cheap. That is not going to happen any time soon. This means the most accessible path to larger models is using the system memory. While you’ll never get the parallelism of a GPU, having cpu instructions that are 512 bits wide is a big performance boost. You also need max logic cores. That is just my take.