From: Linus Torvalds > Sent: 09 February 2018 19:49 ... > I think the instruction scheduling ends up basically breaking around > microcoded instructions, which is why you'll get something like 12+n > cycles for "rep movs" on some uarchs, but at that point it's probably > mostly in the noise compared to all the other nasty PTI things. Or 48+n on P4 > You won't see any of the _real_ advantages (which are about moving > cachelines at a time), so with smallish copies you really only see the > downsides of "rep movs", which is mainly that instruction scheduling > hickup with any miocrocode. I thought that the hardware optimisation for 'rep movsb' on recent Intel cpus generated word sized memory accesses even for misaligned short transfers. My thoughts were that they'd implemented a cache line sized barrel shift register. If that isn't true then using it for all memcpy() is probably stupid (but not as stupid as doing all memcpy backwards!) David