I was reading Owen Shepherd’s post “{n} times faster than C”, which explores how to hand-tune x86-64 assembly to make a certain problem faster (see below). Originally, this inspired me to write a short introduction to using Rust’s portable SIMD to manually speed up problems like this. I rewrote the problem in Rust (of course), used explicit SIMD, and observed a substantial speed-up.
The random ascii vs random s/p character is definitely just branch prediction. For random ascii the most common branch is “do nothing”, so easier to branch predict. Not sure why the original author was unsure?