For the in-order ARM Cortex-A8 (the target for this code), adjacent multiply-add instructions forward summands quickly. A simple in-order dot-product computation has no latency problems, while interleaving computations, as suggested in this thread, creates problems. Also, on this microarchitecture, occasional ARM instructions run in parallel with NEON, so trying to manually eliminate ARM instructions through global pointer tracking wouldn't gain speed; it would simply create unnecessary code-maintenance problems. See https://cr.yp.to/papers.html#neoncrypto for analysis of the performance of---and remaining bottlenecks in---this code. Further speedups should be possible on this microarchitecture, but, for anyone interested in this, I recommend focusing on building a cycle-accurate simulator (e.g., fixing inaccuracies in the Sobole simulator) first. Of course, there are other ARM microarchitectures, and there are many cases where different microarchitectures prefer different optimizations. The kernel already has boot-time benchmarks for different optimizations for raid6, and should do the same for crypto code, so that implementors can focus on each microarchitecture separately rather than living in the barbaric world of having to choose which CPUs to favor. ---Dan