• 0 Posts
  • 3 Comments
Joined 11 months ago
cake
Cake day: October 24th, 2023

help-circle
  • The most important thing about GPU cores is that they are parallel in nature. A lot of GPUs out there use 1024-bit arithmetic units that can process 32 numbers at the same time. That is, if you do something like a + b, both a and b are “vectors” consisting of 32 numbers. Since a GPU is built to process large amount of data simultaneously, for example shading all pixels in a triangle, this is optimal design that has good balance between cost, performance, and power consumption.

    But the parallel design of GPU units also means that they will have problems if you want more execution granularity. For example in common control logic like “if condition is true do x, otherwise do y”, especially if both x and y are complex things. Remember that GPUs really want to do the same thing for 32 items at a time, if you don’t have that many things to work with, their efficiency will suffer. So a lot of common problem solutions that are formulated with “one value at a time” approach in mind won’t translate directly to a GPU. For example, sorting. On a CPU it’s easy to compare numbers and put them in sorted order. On a GPU you want to compare and order hundreds or even thousands of numbers simultaneously to get good performance, and it’s much more difficult to design a program that will do it.

    If you are talking about math specifically, well, it depends on the GPU. Modern GPUs are very well optimised for many operations and have native instructions to compute trigonometric functions (sin, cos), exponential functions and logarithms, as well as do complex bit manipulation. They also natively support a range of data values such as 32- and 16-bit floating point values. But 64-bit floating point value (double) support is usually lacking (either low performance or missing entirely).



  • Simplified version: an Apple GPU core contains four execution units, each of which is 32-wide (it performs an operation on 32 data values in parallel). An instruction in shader program is executed on one of these units. In other words, there are 128 scalar arithmetical units in an Apple GPU core, capable of executing up to four different 32-wide instructions per cycle.

    More complicated, but correct version. An Apple GPU core contains multiple execution units of different types. There are also four instruction schedulers which select a shader instruction and send it to an execution unit. Each scheduler controls one 32-wide FP32 unit, one 32-wide FP16 unit, and (presumably, not quite sure) one 16-wide INT32 unit. So in total you have 4x of those units in a core. On M1 and M2 a scheduler can dispatch one instruction to a suitable execution unit per cycle. This means the other units are idling (e.g. it can do either FP32, FP16, or half INT32 operation per cycle). On Apple M3, schedulers are capable of dual issue and can dispatch two instructions per cycle (e.g. one FP32 and one FP16 or INT) assuming appropriate instructions can be found in the instruction stream. This is why M3 can be much faster on complex shaders even though the nominal spec of the GPU didn’t change much.

    Each GPU core executes a large number of shader programs in parallel and switches between shaders every cycle, in order to make as much progress as possible. If it can’t find an instruction to execute (for example because all shaders are currently waiting for a texture load), the units have to go idle and thus your performance potential decreases. This is why it’s important to give the GPU as much work as possible, it helps to fill those gaps (the hardware can run some shaders while others are waiting).