RE: [tip:x86/core 1/1] arch/x86/um/../lib/csum-partial_64.c:98:12: error: implicit declaration of function 'load_unaligned_zeropad'

From: David Laight <David.Laight@ACULAB.COM>
To: 'Eric Dumazet' <edumazet@google.com>
Cc: Noah Goldstein <goldstein.w.n@gmail.com>,
	Johannes Berg <johannes@sipsolutions.net>,
	"alexanderduyck@fb.com" <alexanderduyck@fb.com>,
	"kbuild-all@lists.01.org" <kbuild-all@lists.01.org>,
	open list <linux-kernel@vger.kernel.org>,
	"linux-um@lists.infradead.org" <linux-um@lists.infradead.org>,
	"lkp@intel.com" <lkp@intel.com>,
	"peterz@infradead.org" <peterz@infradead.org>,
	X86 ML <x86@kernel.org>
Subject: RE: [tip:x86/core 1/1] arch/x86/um/../lib/csum-partial_64.c:98:12: error: implicit declaration of function 'load_unaligned_zeropad'
Date: Fri, 26 Nov 2021 22:41:10 +0000	[thread overview]
Message-ID: <8a6fe34e0f2f4739af39a5935a74b823@AcuMS.aculab.com> (raw)
In-Reply-To: <CANn89iJubuJxjVp4fx78-bjKBN3e9JsdAwZxj4XO6g2_7ZPqJQ@mail.gmail.com>

From: Eric Dumazet
> Sent: 26 November 2021 18:10
...
> > AFAICT (from a pdf) bswap32() and ror(x, 8) are likely to be
> > the same speed but may use different execution units.

The 64bit shifts/rotates are also only one clock.
It is the bswap64 that can be two.

> > Intel seem so have managed to slow down ror(x, %cl) to 3 clocks
> > in sandy bridge - and still not fixed it.
> > Although the compiler might be making a pigs-breakfast of the
> > register allocation when you tried setting 'odd = 8'.
> >
> > Weeks can be spent fiddling with this code :-(
> 
> Yes, and in the end, it won't be able to compete with  a
> specialized/inlined ipv6_csum_partial()

I bet most of the gain comes from knowing there is a non-zero
whole number of 32bit words.
The pesky edge conditions cost.

And even then you need to get it right!
The one for summing the 5-word IPv4 header is actually horrid
on Intel cpu prior to Haswell because 'adc' has a latency of 2.
On Sandy bridge the carry output is valid on the next clock,
so adding to alternate registers doubles throughput.
(That could easily be done in the current function and will
make a big different on those cpu.)

But basically the current generic code has the loop unrolled
further than is necessary for modern (non-atom) cpu.
That just adds more code outside the loop.

I did managed to get 12 bytes/clock using adco/adox with only
32 bytes each iteration.
That will require aligned buffers.

Alignment won't matter for 'adc' loops because there are two
'memory read' units - but there is the elephant:

Sandy bridge Cache bank conflicts
Each consecutive 128 bytes, or two cache lines, in the data cache is divided
into 8 banks of 16 bytes each. It is not possible to do two memory reads in
the same clock cycle if the two memory addresses have the same bank number,
i.e. if bit 4 - 6 in the two addresses are the same.
	; Example 9.5. Sandy bridge cache bank conflict
	mov eax, [rsi] ; Use bank 0, assuming rsi is divisible by 40H
	mov ebx, [rsi+100H] ; Use bank 0. Cache bank conflict
	mov ecx, [rsi+110H] ; Use bank 1. No cache bank conflict

That isn't a problem on Haswell, but it is probably worth ordering
the 'adc' in the loop to reduce the number of conflicts.
I didn't try to look for that though.
I only remember testing aligned buffers on Sandy/Ivy bridge.
Adding to alternate registers helped no end.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)