All of lore.kernel.org
 help / color / mirror / Atom feed
From: David Laight <David.Laight@ACULAB.COM>
To: 'Eric Dumazet' <edumazet@google.com>
Cc: Noah Goldstein <goldstein.w.n@gmail.com>,
	Johannes Berg <johannes@sipsolutions.net>,
	"alexanderduyck@fb.com" <alexanderduyck@fb.com>,
	"kbuild-all@lists.01.org" <kbuild-all@lists.01.org>,
	open list <linux-kernel@vger.kernel.org>,
	"linux-um@lists.infradead.org" <linux-um@lists.infradead.org>,
	"lkp@intel.com" <lkp@intel.com>,
	"peterz@infradead.org" <peterz@infradead.org>,
	X86 ML <x86@kernel.org>
Subject: RE: [tip:x86/core 1/1] arch/x86/um/../lib/csum-partial_64.c:98:12: error: implicit declaration of function 'load_unaligned_zeropad'
Date: Fri, 26 Nov 2021 22:41:10 +0000	[thread overview]
Message-ID: <8a6fe34e0f2f4739af39a5935a74b823@AcuMS.aculab.com> (raw)
In-Reply-To: <CANn89iJubuJxjVp4fx78-bjKBN3e9JsdAwZxj4XO6g2_7ZPqJQ@mail.gmail.com>

From: Eric Dumazet
> Sent: 26 November 2021 18:10
...
> > AFAICT (from a pdf) bswap32() and ror(x, 8) are likely to be
> > the same speed but may use different execution units.

The 64bit shifts/rotates are also only one clock.
It is the bswap64 that can be two.

> > Intel seem so have managed to slow down ror(x, %cl) to 3 clocks
> > in sandy bridge - and still not fixed it.
> > Although the compiler might be making a pigs-breakfast of the
> > register allocation when you tried setting 'odd = 8'.
> >
> > Weeks can be spent fiddling with this code :-(
> 
> Yes, and in the end, it won't be able to compete with  a
> specialized/inlined ipv6_csum_partial()

I bet most of the gain comes from knowing there is a non-zero
whole number of 32bit words.
The pesky edge conditions cost.

And even then you need to get it right!
The one for summing the 5-word IPv4 header is actually horrid
on Intel cpu prior to Haswell because 'adc' has a latency of 2.
On Sandy bridge the carry output is valid on the next clock,
so adding to alternate registers doubles throughput.
(That could easily be done in the current function and will
make a big different on those cpu.)

But basically the current generic code has the loop unrolled
further than is necessary for modern (non-atom) cpu.
That just adds more code outside the loop.

I did managed to get 12 bytes/clock using adco/adox with only
32 bytes each iteration.
That will require aligned buffers.

Alignment won't matter for 'adc' loops because there are two
'memory read' units - but there is the elephant:

Sandy bridge Cache bank conflicts
Each consecutive 128 bytes, or two cache lines, in the data cache is divided
into 8 banks of 16 bytes each. It is not possible to do two memory reads in
the same clock cycle if the two memory addresses have the same bank number,
i.e. if bit 4 - 6 in the two addresses are the same.
	; Example 9.5. Sandy bridge cache bank conflict
	mov eax, [rsi] ; Use bank 0, assuming rsi is divisible by 40H
	mov ebx, [rsi+100H] ; Use bank 0. Cache bank conflict
	mov ecx, [rsi+110H] ; Use bank 1. No cache bank conflict

That isn't a problem on Haswell, but it is probably worth ordering
the 'adc' in the loop to reduce the number of conflicts.
I didn't try to look for that though.
I only remember testing aligned buffers on Sandy/Ivy bridge.
Adding to alternate registers helped no end.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

WARNING: multiple messages have this Message-ID (diff)
From: David Laight <David.Laight@ACULAB.COM>
To: kbuild-all@lists.01.org
Subject: Re: [tip:x86/core 1/1] arch/x86/um/../lib/csum-partial_64.c:98:12: error: implicit declaration of function 'load_unaligned_zeropad'
Date: Fri, 26 Nov 2021 22:41:10 +0000	[thread overview]
Message-ID: <8a6fe34e0f2f4739af39a5935a74b823@AcuMS.aculab.com> (raw)
In-Reply-To: <CANn89iJubuJxjVp4fx78-bjKBN3e9JsdAwZxj4XO6g2_7ZPqJQ@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 2554 bytes --]

From: Eric Dumazet
> Sent: 26 November 2021 18:10
...
> > AFAICT (from a pdf) bswap32() and ror(x, 8) are likely to be
> > the same speed but may use different execution units.

The 64bit shifts/rotates are also only one clock.
It is the bswap64 that can be two.

> > Intel seem so have managed to slow down ror(x, %cl) to 3 clocks
> > in sandy bridge - and still not fixed it.
> > Although the compiler might be making a pigs-breakfast of the
> > register allocation when you tried setting 'odd = 8'.
> >
> > Weeks can be spent fiddling with this code :-(
> 
> Yes, and in the end, it won't be able to compete with  a
> specialized/inlined ipv6_csum_partial()

I bet most of the gain comes from knowing there is a non-zero
whole number of 32bit words.
The pesky edge conditions cost.

And even then you need to get it right!
The one for summing the 5-word IPv4 header is actually horrid
on Intel cpu prior to Haswell because 'adc' has a latency of 2.
On Sandy bridge the carry output is valid on the next clock,
so adding to alternate registers doubles throughput.
(That could easily be done in the current function and will
make a big different on those cpu.)

But basically the current generic code has the loop unrolled
further than is necessary for modern (non-atom) cpu.
That just adds more code outside the loop.

I did managed to get 12 bytes/clock using adco/adox with only
32 bytes each iteration.
That will require aligned buffers.

Alignment won't matter for 'adc' loops because there are two
'memory read' units - but there is the elephant:

Sandy bridge Cache bank conflicts
Each consecutive 128 bytes, or two cache lines, in the data cache is divided
into 8 banks of 16 bytes each. It is not possible to do two memory reads in
the same clock cycle if the two memory addresses have the same bank number,
i.e. if bit 4 - 6 in the two addresses are the same.
	; Example 9.5. Sandy bridge cache bank conflict
	mov eax, [rsi] ; Use bank 0, assuming rsi is divisible by 40H
	mov ebx, [rsi+100H] ; Use bank 0. Cache bank conflict
	mov ecx, [rsi+110H] ; Use bank 1. No cache bank conflict

That isn't a problem on Haswell, but it is probably worth ordering
the 'adc' in the loop to reduce the number of conflicts.
I didn't try to look for that though.
I only remember testing aligned buffers on Sandy/Ivy bridge.
Adding to alternate registers helped no end.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

WARNING: multiple messages have this Message-ID (diff)
From: David Laight <David.Laight@ACULAB.COM>
To: 'Eric Dumazet' <edumazet@google.com>
Cc: Noah Goldstein <goldstein.w.n@gmail.com>,
	Johannes Berg <johannes@sipsolutions.net>,
	"alexanderduyck@fb.com" <alexanderduyck@fb.com>,
	"kbuild-all@lists.01.org" <kbuild-all@lists.01.org>,
	open list <linux-kernel@vger.kernel.org>,
	"linux-um@lists.infradead.org" <linux-um@lists.infradead.org>,
	"lkp@intel.com" <lkp@intel.com>,
	"peterz@infradead.org" <peterz@infradead.org>,
	X86 ML <x86@kernel.org>
Subject: RE: [tip:x86/core 1/1] arch/x86/um/../lib/csum-partial_64.c:98:12: error: implicit declaration of function 'load_unaligned_zeropad'
Date: Fri, 26 Nov 2021 22:41:10 +0000	[thread overview]
Message-ID: <8a6fe34e0f2f4739af39a5935a74b823@AcuMS.aculab.com> (raw)
In-Reply-To: <CANn89iJubuJxjVp4fx78-bjKBN3e9JsdAwZxj4XO6g2_7ZPqJQ@mail.gmail.com>

From: Eric Dumazet
> Sent: 26 November 2021 18:10
...
> > AFAICT (from a pdf) bswap32() and ror(x, 8) are likely to be
> > the same speed but may use different execution units.

The 64bit shifts/rotates are also only one clock.
It is the bswap64 that can be two.

> > Intel seem so have managed to slow down ror(x, %cl) to 3 clocks
> > in sandy bridge - and still not fixed it.
> > Although the compiler might be making a pigs-breakfast of the
> > register allocation when you tried setting 'odd = 8'.
> >
> > Weeks can be spent fiddling with this code :-(
> 
> Yes, and in the end, it won't be able to compete with  a
> specialized/inlined ipv6_csum_partial()

I bet most of the gain comes from knowing there is a non-zero
whole number of 32bit words.
The pesky edge conditions cost.

And even then you need to get it right!
The one for summing the 5-word IPv4 header is actually horrid
on Intel cpu prior to Haswell because 'adc' has a latency of 2.
On Sandy bridge the carry output is valid on the next clock,
so adding to alternate registers doubles throughput.
(That could easily be done in the current function and will
make a big different on those cpu.)

But basically the current generic code has the loop unrolled
further than is necessary for modern (non-atom) cpu.
That just adds more code outside the loop.

I did managed to get 12 bytes/clock using adco/adox with only
32 bytes each iteration.
That will require aligned buffers.

Alignment won't matter for 'adc' loops because there are two
'memory read' units - but there is the elephant:

Sandy bridge Cache bank conflicts
Each consecutive 128 bytes, or two cache lines, in the data cache is divided
into 8 banks of 16 bytes each. It is not possible to do two memory reads in
the same clock cycle if the two memory addresses have the same bank number,
i.e. if bit 4 - 6 in the two addresses are the same.
	; Example 9.5. Sandy bridge cache bank conflict
	mov eax, [rsi] ; Use bank 0, assuming rsi is divisible by 40H
	mov ebx, [rsi+100H] ; Use bank 0. Cache bank conflict
	mov ecx, [rsi+110H] ; Use bank 1. No cache bank conflict

That isn't a problem on Haswell, but it is probably worth ordering
the 'adc' in the loop to reduce the number of conflicts.
I didn't try to look for that though.
I only remember testing aligned buffers on Sandy/Ivy bridge.
Adding to alternate registers helped no end.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um


  reply	other threads:[~2021-11-26 22:43 UTC|newest]

Thread overview: 75+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-11-17 18:45 [tip:x86/core 1/1] arch/x86/um/../lib/csum-partial_64.c:98:12: error: implicit declaration of function 'load_unaligned_zeropad' kernel test robot
2021-11-17 18:45 ` kernel test robot
2021-11-17 18:55 ` Eric Dumazet
2021-11-17 18:55   ` Eric Dumazet
2021-11-17 19:40   ` Eric Dumazet
2021-11-17 19:40     ` Eric Dumazet
2021-11-18 16:00     ` Peter Zijlstra
2021-11-18 16:00       ` Peter Zijlstra
2021-11-18 16:00       ` Peter Zijlstra
2021-11-18 16:26       ` Johannes Berg
2021-11-18 16:26         ` Johannes Berg
2021-11-18 16:26         ` Johannes Berg
2021-11-18 16:57         ` Eric Dumazet
2021-11-18 16:57           ` Eric Dumazet
2021-11-18 16:57           ` Eric Dumazet
2021-11-18 17:02           ` Eric Dumazet
2021-11-18 17:02             ` Eric Dumazet
2021-11-18 17:02             ` Eric Dumazet
2021-11-25  1:58           ` Noah Goldstein
2021-11-25  1:58             ` Noah Goldstein
2021-11-25  1:58             ` Noah Goldstein
2021-11-25  2:56             ` Eric Dumazet
2021-11-25  2:56               ` Eric Dumazet
2021-11-25  2:56               ` Eric Dumazet
2021-11-25  3:41               ` Noah Goldstein
2021-11-25  3:41                 ` Noah Goldstein
2021-11-25  3:41                 ` Noah Goldstein
2021-11-25  4:00                 ` Eric Dumazet
2021-11-25  4:00                   ` Eric Dumazet
2021-11-25  4:00                   ` Eric Dumazet
2021-11-25  4:08                   ` Eric Dumazet
2021-11-25  4:08                     ` Eric Dumazet
2021-11-25  4:08                     ` Eric Dumazet
2021-11-25  4:20                     ` Eric Dumazet
2021-11-25  4:20                       ` Eric Dumazet
2021-11-25  4:20                       ` Eric Dumazet
2021-11-25  4:56                       ` Noah Goldstein
2021-11-25  4:56                         ` Noah Goldstein
2021-11-25  4:56                         ` Noah Goldstein
2021-11-25  5:09                         ` Noah Goldstein
2021-11-25  5:09                           ` Noah Goldstein
2021-11-25  5:09                           ` Noah Goldstein
2021-11-25  6:32                           ` Eric Dumazet
2021-11-25  6:32                             ` Eric Dumazet
2021-11-25  6:32                             ` Eric Dumazet
2021-11-25  6:45                             ` Eric Dumazet
2021-11-25  6:45                               ` Eric Dumazet
2021-11-25  6:45                               ` Eric Dumazet
2021-11-25  6:49                               ` Noah Goldstein
2021-11-25  6:49                                 ` Noah Goldstein
2021-11-25  6:49                                 ` Noah Goldstein
2021-11-25  6:47                             ` Noah Goldstein
2021-11-25  6:47                               ` Noah Goldstein
2021-11-25  6:47                               ` Noah Goldstein
2021-11-26 17:18                   ` David Laight
2021-11-26 17:18                     ` David Laight
2021-11-26 17:18                     ` David Laight
2021-11-26 18:09                     ` Eric Dumazet
2021-11-26 18:09                       ` Eric Dumazet
2021-11-26 18:09                       ` Eric Dumazet
2021-11-26 22:41                       ` David Laight [this message]
2021-11-26 22:41                         ` David Laight
2021-11-26 22:41                         ` David Laight
2021-11-26 23:04                         ` Noah Goldstein
2021-11-26 23:04                           ` Noah Goldstein
2021-11-26 23:04                           ` Noah Goldstein
2021-11-28 18:30                           ` David Laight
2021-11-28 18:30                             ` David Laight
2021-11-28 18:30                             ` David Laight
2021-12-29  6:00       ` Al Viro
2021-12-29  6:00         ` Al Viro
2021-12-29  6:00         ` Al Viro
2022-01-31  2:29         ` Al Viro
2022-01-31  2:29           ` Al Viro
2022-01-31  2:29           ` Al Viro

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=8a6fe34e0f2f4739af39a5935a74b823@AcuMS.aculab.com \
    --to=david.laight@aculab.com \
    --cc=alexanderduyck@fb.com \
    --cc=edumazet@google.com \
    --cc=goldstein.w.n@gmail.com \
    --cc=johannes@sipsolutions.net \
    --cc=kbuild-all@lists.01.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-um@lists.infradead.org \
    --cc=lkp@intel.com \
    --cc=peterz@infradead.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.