From: Linus Torvalds
> Sent: 09 February 2018 19:49
...
> I think the instruction scheduling ends up basically breaking around
> microcoded instructions, which is why you'll get something like 12+n
> cycles for "rep movs" on some uarchs, but at that point it's probably
> mostly in the noise compared to all the other nasty PTI things.

Or 48+n on P4

> You won't see any of the _real_ advantages (which are about moving
> cachelines at a time), so with smallish copies you really only see the
> downsides of "rep movs", which is mainly that instruction scheduling
> hickup with any miocrocode.

I thought that the hardware optimisation for 'rep movsb' on recent
Intel cpus generated word sized memory accesses even for misaligned
short transfers.
My thoughts were that they'd implemented a cache line sized barrel
shift register.
If that isn't true then using it for all memcpy() is probably stupid
(but not as stupid as doing all memcpy backwards!)

	David