linux-toolchains.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [linus:master] [iov_iter] c9eec08bac: vm-scalability.throughput -16.9% regression
       [not found]               ` <CAHk-=wjGQh3ucZFmFR0evbKu2OyEuue-bOjsrnCvxSQdj8x6aw@mail.gmail.com>
@ 2023-11-17 11:44                 ` Borislav Petkov
  2023-11-17 12:09                   ` Jakub Jelinek
  2023-11-17 13:09                   ` David Laight
  0 siblings, 2 replies; 6+ messages in thread
From: Borislav Petkov @ 2023-11-17 11:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Howells, kernel test robot, oe-lkp, lkp, linux-kernel,
	Christian Brauner, Alexander Viro, Jens Axboe, Christoph Hellwig,
	Christian Brauner, Matthew Wilcox, David Laight, ying.huang,
	feng.tang, fengwei.yin, linux-toolchains ML

Might as well Cc toolchains...

On Thu, Nov 16, 2023 at 11:48:18AM -0500, Linus Torvalds wrote:
> Hmm. I know about the '-mstringop-strategy' flag because of the fairly
> recently discussed bug where gcc would create a byte-by-byte copy in
> some crazy circumstances with the address space attributes:
> 
>     https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111657

I hear those stringop strategy heuristics are interesting. :)

> But I incorrectly thought that "-mstringop-strategy=libcall" would
> then *always* do library calls.

That's how I understood it too. BUT, reportedly, small and known sizes
are still optimized, which is exactly what we want.

> So I decided to test, and that shows that gcc still ends up doing the
> "expand small constant size copies inline" even with that option, and
> doesn't force library calls for those cases.

And you've confirmed it.

> IOW, my assumption was just broken, and using
> "-mstringop-strategy=libcall" may well be the right thing to do.

And here's where I'm wondering whether we should enable it for x86 only
or globally. I think globally because those stringop heuristics happen,
AFAIU, in the general optimization stage and thus target agnostic.

> Of course, it's also possible that with all the function call overhead
> introduced by the CPU mitigations on older CPU's, we should just say
> "rep movsb" is always correct - if you have a new CPU with FSRM it's
> good, and if you have an old CPU it's no worse than the horrendous CPU
> mitigation overhead for function call/returns.

Yeah, I think we should measure the libcall thing and then try to get
the inlined "rep movsb" working and see which one is better. You do have
a point about that RET overhead after each CALL.

> I really hate the mitigations. Oh well.

Tell me about it.

> Ayway, maybe your patch is the RightThing(tm). Or maybe we should use
> 'rep_byte' instead of 'libcall'. Who knows..

Yeah, lemme keep playing with this.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [linus:master] [iov_iter] c9eec08bac: vm-scalability.throughput -16.9% regression
  2023-11-17 11:44                 ` [linus:master] [iov_iter] c9eec08bac: vm-scalability.throughput -16.9% regression Borislav Petkov
@ 2023-11-17 12:09                   ` Jakub Jelinek
  2023-11-17 12:18                     ` Borislav Petkov
  2023-11-17 13:09                   ` David Laight
  1 sibling, 1 reply; 6+ messages in thread
From: Jakub Jelinek @ 2023-11-17 12:09 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Linus Torvalds, David Howells, kernel test robot, oe-lkp, lkp,
	linux-kernel, Christian Brauner, Alexander Viro, Jens Axboe,
	Christoph Hellwig, Christian Brauner, Matthew Wilcox,
	David Laight, ying.huang, feng.tang, fengwei.yin,
	linux-toolchains ML

On Fri, Nov 17, 2023 at 12:44:21PM +0100, Borislav Petkov wrote:
> Might as well Cc toolchains...
> 
> On Thu, Nov 16, 2023 at 11:48:18AM -0500, Linus Torvalds wrote:
> > Hmm. I know about the '-mstringop-strategy' flag because of the fairly
> > recently discussed bug where gcc would create a byte-by-byte copy in
> > some crazy circumstances with the address space attributes:
> > 
> >     https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111657
> 
> I hear those stringop strategy heuristics are interesting. :)
> 
> > But I incorrectly thought that "-mstringop-strategy=libcall" would
> > then *always* do library calls.
> 
> That's how I understood it too. BUT, reportedly, small and known sizes
> are still optimized, which is exactly what we want.

Sure.  -mstringop-strategy affects only x86 expansion of the stringops
from GIMPLE to RTL, while for small constant sizes some folding can happen
far earlier in generic code.  Similarly, the copy/store by pieces generic
handling (straight-line code expansion of the builtins) is done in some
cases without invoking the backend optabs which is the only expansion
affected by the strategy.
Note, the default strategy depends on the sizes, -mtune= in effect,
whether it is -Os or -O2 etc.  And the argument for -mmemcpy-strategy=
or -mmemset-strategy= can include details on what sizes should be handled
by which algorithm, not everything needs to be done the same.

> > IOW, my assumption was just broken, and using
> > "-mstringop-strategy=libcall" may well be the right thing to do.
> 
> And here's where I'm wondering whether we should enable it for x86 only
> or globally. I think globally because those stringop heuristics happen,
> AFAIU, in the general optimization stage and thus target agnostic.

-mstringop-strategy= option is x86 specific, so I don't see how you could
enable it on other architectures.

Anyway, if you are just trying to work-around bugs in specific compilers,
please limit it to the affected compilers, overriding kernel makefiles
forever with the workaround would mean you force perhaps suboptimal
expansion in various cases.

	Jakub


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [linus:master] [iov_iter] c9eec08bac: vm-scalability.throughput -16.9% regression
  2023-11-17 12:09                   ` Jakub Jelinek
@ 2023-11-17 12:18                     ` Borislav Petkov
  0 siblings, 0 replies; 6+ messages in thread
From: Borislav Petkov @ 2023-11-17 12:18 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Linus Torvalds, David Howells, kernel test robot, oe-lkp, lkp,
	linux-kernel, Christian Brauner, Alexander Viro, Jens Axboe,
	Christoph Hellwig, Christian Brauner, Matthew Wilcox,
	David Laight, ying.huang, feng.tang, fengwei.yin,
	linux-toolchains ML, Richard Biener, Michael Matz, Jan Hubicka

+ SUSE gcc folks.

On Fri, Nov 17, 2023 at 01:09:55PM +0100, Jakub Jelinek wrote:
> Sure.  -mstringop-strategy affects only x86 expansion of the stringops
> from GIMPLE to RTL, while for small constant sizes some folding can happen
> far earlier in generic code.  Similarly, the copy/store by pieces generic
> handling (straight-line code expansion of the builtins) is done in some
> cases without invoking the backend optabs which is the only expansion
> affected by the strategy.
> Note, the default strategy depends on the sizes, -mtune= in effect,
> whether it is -Os or -O2 etc.  And the argument for -mmemcpy-strategy=
> or -mmemset-strategy= can include details on what sizes should be handled
> by which algorithm, not everything needs to be done the same.

Good to know, I might experiment with those. Thx.

> > > IOW, my assumption was just broken, and using
> > > "-mstringop-strategy=libcall" may well be the right thing to do.
> > 
> > And here's where I'm wondering whether we should enable it for x86 only
> > or globally. I think globally because those stringop heuristics happen,
> > AFAIU, in the general optimization stage and thus target agnostic.
> 
> -mstringop-strategy= option is x86 specific, so I don't see how you could
> enable it on other architectures.

Yeah, Richi just explained to me the same on another thread. To which
I had the question:

"Ah, it even says so in the manpage:

x86 Options ... -mstringop-strategy=alg

Ok, so how would the same option be implemented for ARM or some other
backend?

Also -mstringop-strategy=alg but it would have effect when generating
ARM code, right?

Which means, if I have it in the generic Makefile, it'll get
automatically used on ARM too when gcc implements it.

Which then begs the question whether we want that or we should let ARM
folks decide when that time comes."

I.e., what happens if we have this option in the generic Makefile and
-mstringop-strategy starts affecting ARM expansion of the stringops from
GIMPLE to RTL? Does that even make sense?

> Anyway, if you are just trying to work-around bugs in specific compilers,
> please limit it to the affected compilers, overriding kernel makefiles
> forever with the workaround would mean you force perhaps suboptimal
> expansion in various cases.

Yeah, perhaps a good idea. gcc 13 for now, I guess...

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: [linus:master] [iov_iter] c9eec08bac: vm-scalability.throughput -16.9% regression
  2023-11-17 11:44                 ` [linus:master] [iov_iter] c9eec08bac: vm-scalability.throughput -16.9% regression Borislav Petkov
  2023-11-17 12:09                   ` Jakub Jelinek
@ 2023-11-17 13:09                   ` David Laight
  2023-11-17 13:36                     ` Linus Torvalds
  1 sibling, 1 reply; 6+ messages in thread
From: David Laight @ 2023-11-17 13:09 UTC (permalink / raw)
  To: 'Borislav Petkov', Linus Torvalds
  Cc: David Howells, kernel test robot, oe-lkp, lkp, linux-kernel,
	Christian Brauner, Alexander Viro, Jens Axboe, Christoph Hellwig,
	Christian Brauner, Matthew Wilcox, ying.huang, feng.tang,
	fengwei.yin, linux-toolchains ML

From: Borislav Petkov
> Sent: 17 November 2023 11:44
...
> Yeah, I think we should measure the libcall thing and then try to get
> the inlined "rep movsb" working and see which one is better. You do have
> a point about that RET overhead after each CALL.

You might be able to use the relocation list for memcpy()
to change the 5 byte call instruction into the inline
'mov %rdx,%rcx; rep movsb' sequence.

I've spent all morning (on holiday) trying to understand the strange
timings I'm seeing for 'rep mosvb' on in i7-7700.

The fixed overhead is very strange.

The first 'rep movsb' I do in a process takes an extra 5000 clocks or so.
But it doesn't seem to matter when I do it!
I can do it on entry to main() with several system calls before
the timing loop.

After that the fixed overhead for the 'rep movsb' is fairly small.
I've a few extra register moves between the 'rep movsb' but
I'd guess at about 30 clocks.
All sizes up to (at least) 32 bytes execute in the same time.
After that it increases at much the rate you'd expect.

Zero length copies are different, they always take ~60 clocks.

My current guess for the 5000 clocks is that the logic to
decode 'rep movsb' is loaded into a buffer that is also used
to decode some other instructions.
So if still contains the 'rep movsb' decoder it is fast, otherwise
it is slow.

No idea what other instructions might be using the same logic
(microcode?) buffer.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [linus:master] [iov_iter] c9eec08bac: vm-scalability.throughput -16.9% regression
  2023-11-17 13:09                   ` David Laight
@ 2023-11-17 13:36                     ` Linus Torvalds
  2023-11-17 15:20                       ` David Laight
  0 siblings, 1 reply; 6+ messages in thread
From: Linus Torvalds @ 2023-11-17 13:36 UTC (permalink / raw)
  To: David Laight
  Cc: Borislav Petkov, David Howells, kernel test robot, oe-lkp, lkp,
	linux-kernel, Christian Brauner, Alexander Viro, Jens Axboe,
	Christoph Hellwig, Christian Brauner, Matthew Wilcox, ying.huang,
	feng.tang, fengwei.yin, linux-toolchains ML

On Fri, 17 Nov 2023 at 08:09, David Laight <David.Laight@aculab.com> wrote:
>
> Zero length copies are different, they always take ~60 clocks.

That zero-length thing is some odd microcode implementation issue, and
I think intel actually made a FZRM cpuid bit available for it ("Fast
Zero-size Rep Movs").

I don't think we care in the kernel, but somebody else did (or maybe
Intel added a flag for "we fixed it" just because they noticed)

I at some point did some profiling, and we do have zero-length memcpy
cases occasionally (at least for user copies, which was what I was
looking at), but they aren't common enough to worry about some small
extra strange overhead.

(In case you care, it was for things like an ioctl doing "copy the
base part of the ioctl data, then copy the rest separately".  Where
"the rest" was then often nothing at all).

> My current guess for the 5000 clocks is that the logic to
> decode 'rep movsb' is loaded into a buffer that is also used
> to decode some other instructions.

Unlikely.

I would guess it's the "power up the AVX2 side". The memory copy uses
those same resources internally.

You could try to see if "first AVX memory access" (or similar) has the
same extra initial cpu cycle issue.

Anyway, the CPU you are testing is new enough to have ERMS - that's
the "we do pretty well on string instructions" flag. It does indeed do
pretty well on string instructions, but has a few oddities in addition
to the zero-sized thing.

The other bad cases tend to be along the line of "it falls flat on its
face when the source and destination address are not mutually aligned,
but they are the same virtual address modulo 4096".

Or something like that. I forget the exact details. The details do
exist, but I forget where (I suspect either Agner Fog or some footnote
in some Intel architecture manual).

So it's very much not as simple as "fixed initial cost and then a
fairly fixed cost per 32B", even if that is *one* pattern.

                Linus

^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: [linus:master] [iov_iter] c9eec08bac: vm-scalability.throughput -16.9% regression
  2023-11-17 13:36                     ` Linus Torvalds
@ 2023-11-17 15:20                       ` David Laight
  0 siblings, 0 replies; 6+ messages in thread
From: David Laight @ 2023-11-17 15:20 UTC (permalink / raw)
  To: 'Linus Torvalds'
  Cc: Borislav Petkov, David Howells, kernel test robot, oe-lkp, lkp,
	linux-kernel, Christian Brauner, Alexander Viro, Jens Axboe,
	Christoph Hellwig, Christian Brauner, Matthew Wilcox, ying.huang,
	feng.tang, fengwei.yin, linux-toolchains ML

From: Linus Torvalds
> Sent: 17 November 2023 13:36
> 
> On Fri, 17 Nov 2023 at 08:09, David Laight <David.Laight@aculab.com> wrote:
> >
> > Zero length copies are different, they always take ~60 clocks.
> 
> That zero-length thing is some odd microcode implementation issue, and
> I think intel actually made a FZRM cpuid bit available for it ("Fast
> Zero-size Rep Movs").
> 
> I don't think we care in the kernel, but somebody else did (or maybe
> Intel added a flag for "we fixed it" just because they noticed)

I wasn't really worried about it - but it was an oddidy.

> I at some point did some profiling, and we do have zero-length memcpy
> cases occasionally (at least for user copies, which was what I was
> looking at), but they aren't common enough to worry about some small
> extra strange overhead.

For user copies avoiding the slac/stac might make it worthwhile.
But I doubt you'd want to add the 'jcxz .+n' in the copy code
itself because the mispredicted branch might make a bigger
difference.

I have tested writev() with lots of zero length fragments.
But that isn't a normal case.

> (In case you care, it was for things like an ioctl doing "copy the
> base part of the ioctl data, then copy the rest separately".  Where
> "the rest" was then often nothing at all).

That specific code where a zero length copy is quite likely
would probably benefit from a test in the source.

> > My current guess for the 5000 clocks is that the logic to
> > decode 'rep movsb' is loaded into a buffer that is also used
> > to decode some other instructions.
> 
> Unlikely.
> 
> I would guess it's the "power up the AVX2 side". The memory copy uses
> those same resources internally.

That would make more sense - and have much the same effect.
If the kernel used 'rep movsb' internally and for user copies
it pretty much wouldn't ever get powered down.

> You could try to see if "first AVX memory access" (or similar) has the
> same extra initial cpu cycle issue.

Spot on.
	vpbroadcast %xmm1,%xmm2
does the trick as well.

> Anyway, the CPU you are testing is new enough to have ERMS - that's
> the "we do pretty well on string instructions" flag. It does indeed do
> pretty well on string instructions, but has a few oddities in addition
> to the zero-sized thing.

From what I looked at pretty much everything anyone cares about
probably has ERMS.
You need to be running on something older than sandy bridge.
So basically 'core 2' or 'core 2 duo' (or P4 netburst).
The amd cpus are similarly old.

> The other bad cases tend to be along the line of "it falls flat on its
> face when the source and destination address are not mutually aligned,
> but they are the same virtual address modulo 4096".

There is a similar condition that very often stop the cpu ever
actually doing two memory reads in one clock.
Could easily be related.

> Or something like that. I forget the exact details. The details do
> exist, but I forget where (I suspect either Agner Fog or some footnote
> in some Intel architecture manual).

If Intel have published it, it will be in an unlit basement
behind a locked door and a broken staircase!

Unless 'page copy' hits it I wonder if it really matters
for a normal workload.
Yes, you can conspire to hit it, but mostly you wont.

Wasn't it one of the atoms where the data cache prefetch
managed to completely destroy forwards data copy.
To the point where is was worth taking the hit of a
backwards copy?

> So it's very much not as simple as "fixed initial cost and then a
> fairly fixed cost per 32B", even if that is *one* pattern.

True, but it is the most common one.
And if it is bad the whole thing isn't worth using at all.

I'll try my test on a ivy bridge later.
(I don't have anything older that actually boots.)

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2023-11-17 15:20 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <202311061616.cd495695-oliver.sang@intel.com>
     [not found] ` <3865842.1700061614@warthog.procyon.org.uk>
     [not found]   ` <CAHk-=whM-cEwAsLtKsf5dYwV7nDTaRv1bUKLVBstMAQBug24uQ@mail.gmail.com>
     [not found]     ` <CAHk-=wjCUckvZUQf7gqp2ziJUWxVpikM_6srFdbcNdBJTxExRg@mail.gmail.com>
     [not found]       ` <CAHk-=wjhs6uuedgz-7HbcPtirEq+vvjJBY-M2zyteJwBhOMZhg@mail.gmail.com>
     [not found]         ` <20231115190938.GGZVUXcuUjI3i1JRAB@fat_crate.local>
     [not found]           ` <CAHk-=wh0TcXyGmKHfs+Xe=5Sd5bNn=NNV9CEtOy_tbyHAAmk9g@mail.gmail.com>
     [not found]             ` <20231116154406.GDZVY4xmFvRQt0wGGE@fat_crate.local>
     [not found]               ` <CAHk-=wjGQh3ucZFmFR0evbKu2OyEuue-bOjsrnCvxSQdj8x6aw@mail.gmail.com>
2023-11-17 11:44                 ` [linus:master] [iov_iter] c9eec08bac: vm-scalability.throughput -16.9% regression Borislav Petkov
2023-11-17 12:09                   ` Jakub Jelinek
2023-11-17 12:18                     ` Borislav Petkov
2023-11-17 13:09                   ` David Laight
2023-11-17 13:36                     ` Linus Torvalds
2023-11-17 15:20                       ` David Laight

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).