All of lore.kernel.org
 help / color / mirror / Atom feed
From: Noah Goldstein <goldstein.w.n@gmail.com>
To: David Laight <David.Laight@aculab.com>
Cc: Eric Dumazet <edumazet@google.com>,
	Johannes Berg <johannes@sipsolutions.net>,
	"alexanderduyck@fb.com" <alexanderduyck@fb.com>,
	"kbuild-all@lists.01.org" <kbuild-all@lists.01.org>,
	open list <linux-kernel@vger.kernel.org>,
	"linux-um@lists.infradead.org" <linux-um@lists.infradead.org>,
	"lkp@intel.com" <lkp@intel.com>,
	"peterz@infradead.org" <peterz@infradead.org>,
	X86 ML <x86@kernel.org>
Subject: Re: [tip:x86/core 1/1] arch/x86/um/../lib/csum-partial_64.c:98:12: error: implicit declaration of function 'load_unaligned_zeropad'
Date: Fri, 26 Nov 2021 17:04:20 -0600	[thread overview]
Message-ID: <CAFUsyfJmpFFzuMhHrH+oBVzcHggW0QZM9dvXtPQW88kAw_2_BQ@mail.gmail.com> (raw)
In-Reply-To: <8a6fe34e0f2f4739af39a5935a74b823@AcuMS.aculab.com>

On Fri, Nov 26, 2021 at 4:41 PM David Laight <David.Laight@aculab.com> wrote:
>
> From: Eric Dumazet
> > Sent: 26 November 2021 18:10
> ...
> > > AFAICT (from a pdf) bswap32() and ror(x, 8) are likely to be
> > > the same speed but may use different execution units.
>
> The 64bit shifts/rotates are also only one clock.
> It is the bswap64 that can be two.
>
> > > Intel seem so have managed to slow down ror(x, %cl) to 3 clocks
> > > in sandy bridge - and still not fixed it.
> > > Although the compiler might be making a pigs-breakfast of the
> > > register allocation when you tried setting 'odd = 8'.
> > >
> > > Weeks can be spent fiddling with this code :-(
> >
> > Yes, and in the end, it won't be able to compete with  a
> > specialized/inlined ipv6_csum_partial()
>
> I bet most of the gain comes from knowing there is a non-zero
> whole number of 32bit words.
> The pesky edge conditions cost.
>
> And even then you need to get it right!
> The one for summing the 5-word IPv4 header is actually horrid
> on Intel cpu prior to Haswell because 'adc' has a latency of 2.
> On Sandy bridge the carry output is valid on the next clock,
> so adding to alternate registers doubles throughput.
> (That could easily be done in the current function and will
> make a big different on those cpu.)
>
> But basically the current generic code has the loop unrolled
> further than is necessary for modern (non-atom) cpu.
> That just adds more code outside the loop.
>
> I did managed to get 12 bytes/clock using adco/adox with only
> 32 bytes each iteration.
> That will require aligned buffers.
>
> Alignment won't matter for 'adc' loops because there are two
> 'memory read' units - but there is the elephant:
>
> Sandy bridge Cache bank conflicts
> Each consecutive 128 bytes, or two cache lines, in the data cache is divided
> into 8 banks of 16 bytes each. It is not possible to do two memory reads in
> the same clock cycle if the two memory addresses have the same bank number,
> i.e. if bit 4 - 6 in the two addresses are the same.
>         ; Example 9.5. Sandy bridge cache bank conflict
>         mov eax, [rsi] ; Use bank 0, assuming rsi is divisible by 40H
>         mov ebx, [rsi+100H] ; Use bank 0. Cache bank conflict
>         mov ecx, [rsi+110H] ; Use bank 1. No cache bank conflict
>
> That isn't a problem on Haswell, but it is probably worth ordering
> the 'adc' in the loop to reduce the number of conflicts.
> I didn't try to look for that though.
> I only remember testing aligned buffers on Sandy/Ivy bridge.
> Adding to alternate registers helped no end.

Cant that just be solved by having the two independent adcx/adox chains work
from region that are 16+ bytes apart? For 40 byte ipv6 header it will be simple.
>
>         David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)

WARNING: multiple messages have this Message-ID (diff)
From: Noah Goldstein <goldstein.w.n@gmail.com>
To: kbuild-all@lists.01.org
Subject: Re: [tip:x86/core 1/1] arch/x86/um/../lib/csum-partial_64.c:98:12: error: implicit declaration of function 'load_unaligned_zeropad'
Date: Fri, 26 Nov 2021 17:04:20 -0600	[thread overview]
Message-ID: <CAFUsyfJmpFFzuMhHrH+oBVzcHggW0QZM9dvXtPQW88kAw_2_BQ@mail.gmail.com> (raw)
In-Reply-To: <8a6fe34e0f2f4739af39a5935a74b823@AcuMS.aculab.com>

[-- Attachment #1: Type: text/plain, Size: 2945 bytes --]

On Fri, Nov 26, 2021 at 4:41 PM David Laight <David.Laight@aculab.com> wrote:
>
> From: Eric Dumazet
> > Sent: 26 November 2021 18:10
> ...
> > > AFAICT (from a pdf) bswap32() and ror(x, 8) are likely to be
> > > the same speed but may use different execution units.
>
> The 64bit shifts/rotates are also only one clock.
> It is the bswap64 that can be two.
>
> > > Intel seem so have managed to slow down ror(x, %cl) to 3 clocks
> > > in sandy bridge - and still not fixed it.
> > > Although the compiler might be making a pigs-breakfast of the
> > > register allocation when you tried setting 'odd = 8'.
> > >
> > > Weeks can be spent fiddling with this code :-(
> >
> > Yes, and in the end, it won't be able to compete with  a
> > specialized/inlined ipv6_csum_partial()
>
> I bet most of the gain comes from knowing there is a non-zero
> whole number of 32bit words.
> The pesky edge conditions cost.
>
> And even then you need to get it right!
> The one for summing the 5-word IPv4 header is actually horrid
> on Intel cpu prior to Haswell because 'adc' has a latency of 2.
> On Sandy bridge the carry output is valid on the next clock,
> so adding to alternate registers doubles throughput.
> (That could easily be done in the current function and will
> make a big different on those cpu.)
>
> But basically the current generic code has the loop unrolled
> further than is necessary for modern (non-atom) cpu.
> That just adds more code outside the loop.
>
> I did managed to get 12 bytes/clock using adco/adox with only
> 32 bytes each iteration.
> That will require aligned buffers.
>
> Alignment won't matter for 'adc' loops because there are two
> 'memory read' units - but there is the elephant:
>
> Sandy bridge Cache bank conflicts
> Each consecutive 128 bytes, or two cache lines, in the data cache is divided
> into 8 banks of 16 bytes each. It is not possible to do two memory reads in
> the same clock cycle if the two memory addresses have the same bank number,
> i.e. if bit 4 - 6 in the two addresses are the same.
>         ; Example 9.5. Sandy bridge cache bank conflict
>         mov eax, [rsi] ; Use bank 0, assuming rsi is divisible by 40H
>         mov ebx, [rsi+100H] ; Use bank 0. Cache bank conflict
>         mov ecx, [rsi+110H] ; Use bank 1. No cache bank conflict
>
> That isn't a problem on Haswell, but it is probably worth ordering
> the 'adc' in the loop to reduce the number of conflicts.
> I didn't try to look for that though.
> I only remember testing aligned buffers on Sandy/Ivy bridge.
> Adding to alternate registers helped no end.

Cant that just be solved by having the two independent adcx/adox chains work
from region that are 16+ bytes apart? For 40 byte ipv6 header it will be simple.
>
>         David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)

WARNING: multiple messages have this Message-ID (diff)
From: Noah Goldstein <goldstein.w.n@gmail.com>
To: David Laight <David.Laight@aculab.com>
Cc: Eric Dumazet <edumazet@google.com>,
	Johannes Berg <johannes@sipsolutions.net>,
	"alexanderduyck@fb.com" <alexanderduyck@fb.com>,
	"kbuild-all@lists.01.org" <kbuild-all@lists.01.org>,
	open list <linux-kernel@vger.kernel.org>,
	"linux-um@lists.infradead.org" <linux-um@lists.infradead.org>,
	"lkp@intel.com" <lkp@intel.com>,
	"peterz@infradead.org" <peterz@infradead.org>,
	X86 ML <x86@kernel.org>
Subject: Re: [tip:x86/core 1/1] arch/x86/um/../lib/csum-partial_64.c:98:12: error: implicit declaration of function 'load_unaligned_zeropad'
Date: Fri, 26 Nov 2021 17:04:20 -0600	[thread overview]
Message-ID: <CAFUsyfJmpFFzuMhHrH+oBVzcHggW0QZM9dvXtPQW88kAw_2_BQ@mail.gmail.com> (raw)
In-Reply-To: <8a6fe34e0f2f4739af39a5935a74b823@AcuMS.aculab.com>

On Fri, Nov 26, 2021 at 4:41 PM David Laight <David.Laight@aculab.com> wrote:
>
> From: Eric Dumazet
> > Sent: 26 November 2021 18:10
> ...
> > > AFAICT (from a pdf) bswap32() and ror(x, 8) are likely to be
> > > the same speed but may use different execution units.
>
> The 64bit shifts/rotates are also only one clock.
> It is the bswap64 that can be two.
>
> > > Intel seem so have managed to slow down ror(x, %cl) to 3 clocks
> > > in sandy bridge - and still not fixed it.
> > > Although the compiler might be making a pigs-breakfast of the
> > > register allocation when you tried setting 'odd = 8'.
> > >
> > > Weeks can be spent fiddling with this code :-(
> >
> > Yes, and in the end, it won't be able to compete with  a
> > specialized/inlined ipv6_csum_partial()
>
> I bet most of the gain comes from knowing there is a non-zero
> whole number of 32bit words.
> The pesky edge conditions cost.
>
> And even then you need to get it right!
> The one for summing the 5-word IPv4 header is actually horrid
> on Intel cpu prior to Haswell because 'adc' has a latency of 2.
> On Sandy bridge the carry output is valid on the next clock,
> so adding to alternate registers doubles throughput.
> (That could easily be done in the current function and will
> make a big different on those cpu.)
>
> But basically the current generic code has the loop unrolled
> further than is necessary for modern (non-atom) cpu.
> That just adds more code outside the loop.
>
> I did managed to get 12 bytes/clock using adco/adox with only
> 32 bytes each iteration.
> That will require aligned buffers.
>
> Alignment won't matter for 'adc' loops because there are two
> 'memory read' units - but there is the elephant:
>
> Sandy bridge Cache bank conflicts
> Each consecutive 128 bytes, or two cache lines, in the data cache is divided
> into 8 banks of 16 bytes each. It is not possible to do two memory reads in
> the same clock cycle if the two memory addresses have the same bank number,
> i.e. if bit 4 - 6 in the two addresses are the same.
>         ; Example 9.5. Sandy bridge cache bank conflict
>         mov eax, [rsi] ; Use bank 0, assuming rsi is divisible by 40H
>         mov ebx, [rsi+100H] ; Use bank 0. Cache bank conflict
>         mov ecx, [rsi+110H] ; Use bank 1. No cache bank conflict
>
> That isn't a problem on Haswell, but it is probably worth ordering
> the 'adc' in the loop to reduce the number of conflicts.
> I didn't try to look for that though.
> I only remember testing aligned buffers on Sandy/Ivy bridge.
> Adding to alternate registers helped no end.

Cant that just be solved by having the two independent adcx/adox chains work
from region that are 16+ bytes apart? For 40 byte ipv6 header it will be simple.
>
>         David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)

_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um


  reply	other threads:[~2021-11-26 23:06 UTC|newest]

Thread overview: 75+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-11-17 18:45 [tip:x86/core 1/1] arch/x86/um/../lib/csum-partial_64.c:98:12: error: implicit declaration of function 'load_unaligned_zeropad' kernel test robot
2021-11-17 18:45 ` kernel test robot
2021-11-17 18:55 ` Eric Dumazet
2021-11-17 18:55   ` Eric Dumazet
2021-11-17 19:40   ` Eric Dumazet
2021-11-17 19:40     ` Eric Dumazet
2021-11-18 16:00     ` Peter Zijlstra
2021-11-18 16:00       ` Peter Zijlstra
2021-11-18 16:00       ` Peter Zijlstra
2021-11-18 16:26       ` Johannes Berg
2021-11-18 16:26         ` Johannes Berg
2021-11-18 16:26         ` Johannes Berg
2021-11-18 16:57         ` Eric Dumazet
2021-11-18 16:57           ` Eric Dumazet
2021-11-18 16:57           ` Eric Dumazet
2021-11-18 17:02           ` Eric Dumazet
2021-11-18 17:02             ` Eric Dumazet
2021-11-18 17:02             ` Eric Dumazet
2021-11-25  1:58           ` Noah Goldstein
2021-11-25  1:58             ` Noah Goldstein
2021-11-25  1:58             ` Noah Goldstein
2021-11-25  2:56             ` Eric Dumazet
2021-11-25  2:56               ` Eric Dumazet
2021-11-25  2:56               ` Eric Dumazet
2021-11-25  3:41               ` Noah Goldstein
2021-11-25  3:41                 ` Noah Goldstein
2021-11-25  3:41                 ` Noah Goldstein
2021-11-25  4:00                 ` Eric Dumazet
2021-11-25  4:00                   ` Eric Dumazet
2021-11-25  4:00                   ` Eric Dumazet
2021-11-25  4:08                   ` Eric Dumazet
2021-11-25  4:08                     ` Eric Dumazet
2021-11-25  4:08                     ` Eric Dumazet
2021-11-25  4:20                     ` Eric Dumazet
2021-11-25  4:20                       ` Eric Dumazet
2021-11-25  4:20                       ` Eric Dumazet
2021-11-25  4:56                       ` Noah Goldstein
2021-11-25  4:56                         ` Noah Goldstein
2021-11-25  4:56                         ` Noah Goldstein
2021-11-25  5:09                         ` Noah Goldstein
2021-11-25  5:09                           ` Noah Goldstein
2021-11-25  5:09                           ` Noah Goldstein
2021-11-25  6:32                           ` Eric Dumazet
2021-11-25  6:32                             ` Eric Dumazet
2021-11-25  6:32                             ` Eric Dumazet
2021-11-25  6:45                             ` Eric Dumazet
2021-11-25  6:45                               ` Eric Dumazet
2021-11-25  6:45                               ` Eric Dumazet
2021-11-25  6:49                               ` Noah Goldstein
2021-11-25  6:49                                 ` Noah Goldstein
2021-11-25  6:49                                 ` Noah Goldstein
2021-11-25  6:47                             ` Noah Goldstein
2021-11-25  6:47                               ` Noah Goldstein
2021-11-25  6:47                               ` Noah Goldstein
2021-11-26 17:18                   ` David Laight
2021-11-26 17:18                     ` David Laight
2021-11-26 17:18                     ` David Laight
2021-11-26 18:09                     ` Eric Dumazet
2021-11-26 18:09                       ` Eric Dumazet
2021-11-26 18:09                       ` Eric Dumazet
2021-11-26 22:41                       ` David Laight
2021-11-26 22:41                         ` David Laight
2021-11-26 22:41                         ` David Laight
2021-11-26 23:04                         ` Noah Goldstein [this message]
2021-11-26 23:04                           ` Noah Goldstein
2021-11-26 23:04                           ` Noah Goldstein
2021-11-28 18:30                           ` David Laight
2021-11-28 18:30                             ` David Laight
2021-11-28 18:30                             ` David Laight
2021-12-29  6:00       ` Al Viro
2021-12-29  6:00         ` Al Viro
2021-12-29  6:00         ` Al Viro
2022-01-31  2:29         ` Al Viro
2022-01-31  2:29           ` Al Viro
2022-01-31  2:29           ` Al Viro

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAFUsyfJmpFFzuMhHrH+oBVzcHggW0QZM9dvXtPQW88kAw_2_BQ@mail.gmail.com \
    --to=goldstein.w.n@gmail.com \
    --cc=David.Laight@aculab.com \
    --cc=alexanderduyck@fb.com \
    --cc=edumazet@google.com \
    --cc=johannes@sipsolutions.net \
    --cc=kbuild-all@lists.01.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-um@lists.infradead.org \
    --cc=lkp@intel.com \
    --cc=peterz@infradead.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.