From: Al Viro > Sent: 22 July 2020 18:39 > I would love to see your patch, anyway, along with the testcases and performance > comparison. See attached program. Compile and run (as root): csum_iov 1 Unpatched (as shipped) 16 vectors of 1 byte take ~430 clocks on my haswell cpu. With dsl_patch defined they take ~393. The maximum throughput is ~1.16 clocks/word for 16 vectors of 1k. For longer vectors the data gets lost from the cache between the iterations. On an older Ivy Bridge cpu it never goes faster than 2 clocks/word. (Due to the implementation of ADC.) The absolute limit is 1 clock/word - limited by the memory write. I suspect that is achievable on Haswell with much less loop unrolling. I had to replace the ror32() with __builtin_bswap32(). The kernel object do contain the 'ror' instruction - even though I didn't find the asm for it. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)