[RFC PATCH] x86, vmlinux.lds: Add debug option to force all data sections aligned

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH] x86, vmlinux.lds: Add debug option to force all data sections aligned
@ 2021-07-28  7:21 Feng Tang
  2021-09-22 18:51 ` Josh Poimboeuf
  0 siblings, 1 reply; 7+ messages in thread
From: Feng Tang @ 2021-07-28  7:21 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H Peter Anvin, Borislav Petkov,
	Peter Zijlstra, x86, linux-kernel
  Cc: Dave Hansen, Tony Luck, Feng Tang

0day has reported many strange performance changes (regression or
improvement), in which there was no obvious relation between the culprit
commit and the benchmark at the first look, and it causes people to doubt
the test itself is wrong.

Upon further check, many of these cases are caused by the change to the
alignment of kernel text or data, as whole text/data of kernel are linked
together, change in one domain can affect alignments of other domains.

To help to quickly identify if the strange performance change is caused
by _data_ alignment. add a debug option to force the data sections from
all .o files aligned on THREAD_SIZE, so that change in one domain won't
affect other modules' data alignment.

We have used this option to check some strange kernel changes [1][2][3],
and those performance changes were gone after enabling it, which proved
they are data alignment related.

Similarly, there is another kernel debug option to check text alignment
related performance changes: CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B,  
which forces all function's start address to be 64 bytes alinged.

This option depends on CONFIG_DYNAMIC_DEBUG==n, as '__dyndbg' subsection
of .data has a hard requirement of ALIGN(8), shown in the 'vmlinux.lds':

"
. = ALIGN(8); __start___dyndbg = .; KEEP(*(__dyndbg)) __stop___dyndbg = .;
"

It contains all pointers to 'struct _ddebug', and dynamic_debug_init()
will "pointer++" to loop accessing these pointers, which will be broken
with this option enabled.

[1]. https://lore.kernel.org/lkml/20200205123216.GO12867@shao2-debian/
[2]. https://lore.kernel.org/lkml/20200305062138.GI5972@shao2-debian/
[3]. https://lore.kernel.org/lkml/20201112140625.GA21612@xsang-OptiPlex-9020/

Signed-off-by: Feng Tang <feng.tang@intel.com>
---
 arch/x86/Kconfig.debug        | 13 +++++++++++++
 arch/x86/kernel/vmlinux.lds.S |  7 ++++++-
 2 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/arch/x86/Kconfig.debug b/arch/x86/Kconfig.debug
index 80b57e7..d04c67e 100644
--- a/arch/x86/Kconfig.debug
+++ b/arch/x86/Kconfig.debug
@@ -228,6 +228,19 @@ config PUNIT_ATOM_DEBUG
 	  The current power state can be read from
 	  /sys/kernel/debug/punit_atom/dev_power_state

+config DEBUG_FORCE_DATA_SECTION_ALIGNED
+	bool "Force all data sections to be THREAD_SIZE aligned"
+	depends on EXPERT && !DYNAMIC_DEBUG
+	help
+	  There are cases that a commit from one kernel domain changes
+	  data sections' alignment of other domains, as they are all
+	  linked together compactly, and cause magic performance bump
+	  (regression or improvement), which is hard to debug. Enable
+	  this option will help to verify if the bump is caused by
+	  data alignment changes.
+
+	  It is mainly for debug and performance tuning use.
+
 choice
 	prompt "Choose kernel unwinder"
 	default UNWINDER_ORC if X86_64
diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index efd9e9e..64256d0 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -156,7 +156,12 @@ SECTIONS
 	X86_ALIGN_RODATA_END

 	/* Data */
-	.data : AT(ADDR(.data) - LOAD_OFFSET) {
+	.data : AT(ADDR(.data) - LOAD_OFFSET)
+#ifdef CONFIG_DEBUG_FORCE_DATA_SECTION_ALIGNED
+	/* Use the biggest alignment of below sections */
+	SUBALIGN(THREAD_SIZE)
+#endif
+	{
 		/* Start of data section */
 		_sdata = .;

-- 
2.7.4

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH] x86, vmlinux.lds: Add debug option to force all data sections aligned
  2021-07-28  7:21 [RFC PATCH] x86, vmlinux.lds: Add debug option to force all data sections aligned Feng Tang
@ 2021-09-22 18:51 ` Josh Poimboeuf
  2021-09-23 14:57   ` Feng Tang
  0 siblings, 1 reply; 7+ messages in thread
From: Josh Poimboeuf @ 2021-09-22 18:51 UTC (permalink / raw)
  To: Feng Tang
  Cc: Thomas Gleixner, Ingo Molnar, H Peter Anvin, Borislav Petkov,
	Peter Zijlstra, x86, linux-kernel, Dave Hansen, Tony Luck,
	Denys Vlasenko, Linus Torvalds, Andy Lutomirski

On Wed, Jul 28, 2021 at 03:21:40PM +0800, Feng Tang wrote:
> 0day has reported many strange performance changes (regression or
> improvement), in which there was no obvious relation between the culprit
> commit and the benchmark at the first look, and it causes people to doubt
> the test itself is wrong.
> 
> Upon further check, many of these cases are caused by the change to the
> alignment of kernel text or data, as whole text/data of kernel are linked
> together, change in one domain can affect alignments of other domains.
> 
> To help to quickly identify if the strange performance change is caused
> by _data_ alignment. add a debug option to force the data sections from
> all .o files aligned on THREAD_SIZE, so that change in one domain won't
> affect other modules' data alignment.
> 
> We have used this option to check some strange kernel changes [1][2][3],
> and those performance changes were gone after enabling it, which proved
> they are data alignment related.
> 
> Similarly, there is another kernel debug option to check text alignment
> related performance changes: CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B,  
> which forces all function's start address to be 64 bytes alinged.
> 
> This option depends on CONFIG_DYNAMIC_DEBUG==n, as '__dyndbg' subsection
> of .data has a hard requirement of ALIGN(8), shown in the 'vmlinux.lds':
> 
> "
> . = ALIGN(8); __start___dyndbg = .; KEEP(*(__dyndbg)) __stop___dyndbg = .;
> "
> 
> It contains all pointers to 'struct _ddebug', and dynamic_debug_init()
> will "pointer++" to loop accessing these pointers, which will be broken
> with this option enabled.
> 
> [1]. https://lore.kernel.org/lkml/20200205123216.GO12867@shao2-debian/
> [2]. https://lore.kernel.org/lkml/20200305062138.GI5972@shao2-debian/
> [3]. https://lore.kernel.org/lkml/20201112140625.GA21612@xsang-OptiPlex-9020/
> 
> Signed-off-by: Feng Tang <feng.tang@intel.com>
> ---
>  arch/x86/Kconfig.debug        | 13 +++++++++++++
>  arch/x86/kernel/vmlinux.lds.S |  7 ++++++-
>  2 files changed, 19 insertions(+), 1 deletion(-)

Hi Feng,

Thanks for the interesting LPC presentation about alignment-related
performance issues (which mentioned this patch).
 
  https://linuxplumbersconf.org/event/11/contributions/895/

I wonder if we can look at enabling some kind of data section alignment
unconditionally instead of just making it a debug option.  Have you done
any performance and binary size comparisons?

On a similar vein I think we should re-explore permanently enabling
cacheline-sized function alignment i.e. making something like
CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B the default.  Ingo did some
research on that a while back:

   https://lkml.kernel.org/r/20150519213820.GA31688@gmail.com

At the time, the main reported drawback of -falign-functions=64 was that
even small functions got aligned.  But now I think that can be mitigated
with some new options like -flimit-function-alignment and/or
-falign-functions=64,X (for some carefully-chosen value of X).

-- 
Josh


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH] x86, vmlinux.lds: Add debug option to force all data sections aligned
  2021-09-22 18:51 ` Josh Poimboeuf
@ 2021-09-23 14:57   ` Feng Tang
  2021-09-24  1:57     ` Josh Poimboeuf
  2021-09-24  8:13     ` Denys Vlasenko
  0 siblings, 2 replies; 7+ messages in thread
From: Feng Tang @ 2021-09-23 14:57 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Thomas Gleixner, Ingo Molnar, H Peter Anvin, Borislav Petkov,
	Peter Zijlstra, x86, linux-kernel, Dave Hansen, Tony Luck,
	Denys Vlasenko, Linus Torvalds, Andy Lutomirski

Hi Josh, 

On Wed, Sep 22, 2021 at 11:51:37AM -0700, Josh Poimboeuf wrote:
> On Wed, Jul 28, 2021 at 03:21:40PM +0800, Feng Tang wrote:
> > 0day has reported many strange performance changes (regression or
> > improvement), in which there was no obvious relation between the culprit
> > commit and the benchmark at the first look, and it causes people to doubt
> > the test itself is wrong.
> > 
> > Upon further check, many of these cases are caused by the change to the
> > alignment of kernel text or data, as whole text/data of kernel are linked
> > together, change in one domain can affect alignments of other domains.
> > 
> > To help to quickly identify if the strange performance change is caused
> > by _data_ alignment. add a debug option to force the data sections from
> > all .o files aligned on THREAD_SIZE, so that change in one domain won't
> > affect other modules' data alignment.
> > 
> > We have used this option to check some strange kernel changes [1][2][3],
> > and those performance changes were gone after enabling it, which proved
> > they are data alignment related.
> > 
> > Similarly, there is another kernel debug option to check text alignment
> > related performance changes: CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B,  
> > which forces all function's start address to be 64 bytes alinged.
> > 
> > This option depends on CONFIG_DYNAMIC_DEBUG==n, as '__dyndbg' subsection
> > of .data has a hard requirement of ALIGN(8), shown in the 'vmlinux.lds':
> > 
> > "
> > . = ALIGN(8); __start___dyndbg = .; KEEP(*(__dyndbg)) __stop___dyndbg = .;
> > "
> > 
> > It contains all pointers to 'struct _ddebug', and dynamic_debug_init()
> > will "pointer++" to loop accessing these pointers, which will be broken
> > with this option enabled.
> > 
> > [1]. https://lore.kernel.org/lkml/20200205123216.GO12867@shao2-debian/
> > [2]. https://lore.kernel.org/lkml/20200305062138.GI5972@shao2-debian/
> > [3]. https://lore.kernel.org/lkml/20201112140625.GA21612@xsang-OptiPlex-9020/
> > 
> > Signed-off-by: Feng Tang <feng.tang@intel.com>
> > ---
> >  arch/x86/Kconfig.debug        | 13 +++++++++++++
> >  arch/x86/kernel/vmlinux.lds.S |  7 ++++++-
> >  2 files changed, 19 insertions(+), 1 deletion(-)
> 
> Hi Feng,
> 
> Thanks for the interesting LPC presentation about alignment-related
> performance issues (which mentioned this patch).
>  
>   https://linuxplumbersconf.org/event/11/contributions/895/
> 
> I wonder if we can look at enabling some kind of data section alignment
> unconditionally instead of just making it a debug option.  Have you done
> any performance and binary size comparisons?
 
Thanks for reviewing this!

For binary size, I just tested 5.14 kernel with a default desktop
config from Ubuntu (I didn't use the normal rhel-8.3 config used
by 0Day, which is more for server):

v5.14
------------------------
text		data		bss	    dec		hex	filename
16010221	14971391	6098944	37080556	235cdec	vmlinux

v5.14 + 64B-function-align
--------------------------
text		data		bss	    dec		hex	filename
18107373	14971391	6098944	39177708	255cdec	vmlinux

v5.14 + data-align(THREAD_SIZE 16KB)
--------------------------
text		data		bss	    dec		hex	filename
16010221	57001791	6008832	79020844	4b5c32c	vmlinux

So for the text-align, we see 13.1% increase for text. And for data-align,
there is 280.8% increase for data.

Performance wise, I have done some test with the force-32bytes-text-align
option before (v5.8 time), for benchmark will-it-scale, fsmark, hackbench,
netperf and kbuild:
* no obvious change for will-it-scale/fsmark/kbuild
* see both regression/improvement for different hackbench case
* see both regression/improvement for netperf, from -20% to +98%

As I didn't expect the text-align will be turned on by-default, so I
didn't dive deep into it at that time.


For data-alignment, it has huge impact for the size, and occupies more
cache/TLB, plus it hurts some normal function like dynamic-debug. So
I'm afraid it can only be used as a debug option.

> On a similar vein I think we should re-explore permanently enabling
> cacheline-sized function alignment i.e. making something like
> CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B the default.  Ingo did some
> research on that a while back:
> 
>    https://lkml.kernel.org/r/20150519213820.GA31688@gmail.com

Thanks for sharing this, from which I learned a lot, and I hope I
knew this thread when we first check strange regressions in 2019 :)

> At the time, the main reported drawback of -falign-functions=64 was that
> even small functions got aligned.  But now I think that can be mitigated
> with some new options like -flimit-function-alignment and/or
> -falign-functions=64,X (for some carefully-chosen value of X).

Will study more about these options. 

If they have much less size increase and no regression in performance,
then maybe it could be turned on by default.

Thanks,
Feng

> -- 
> Josh

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH] x86, vmlinux.lds: Add debug option to force all data sections aligned
  2021-09-23 14:57   ` Feng Tang
@ 2021-09-24  1:57     ` Josh Poimboeuf
  2021-09-24  8:13     ` Denys Vlasenko
  1 sibling, 0 replies; 7+ messages in thread
From: Josh Poimboeuf @ 2021-09-24  1:57 UTC (permalink / raw)
  To: Feng Tang
  Cc: Thomas Gleixner, Ingo Molnar, H Peter Anvin, Borislav Petkov,
	Peter Zijlstra, x86, linux-kernel, Dave Hansen, Tony Luck,
	Denys Vlasenko, Linus Torvalds, Andy Lutomirski

On Thu, Sep 23, 2021 at 10:57:20PM +0800, Feng Tang wrote:
> For binary size, I just tested 5.14 kernel with a default desktop
> config from Ubuntu (I didn't use the normal rhel-8.3 config used
> by 0Day, which is more for server):
> 
> v5.14
> ------------------------
> text		data		bss	    dec		hex	filename
> 16010221	14971391	6098944	37080556	235cdec	vmlinux
> 
> v5.14 + 64B-function-align
> --------------------------
> text		data		bss	    dec		hex	filename
> 18107373	14971391	6098944	39177708	255cdec	vmlinux
> 
> v5.14 + data-align(THREAD_SIZE 16KB)
> --------------------------
> text		data		bss	    dec		hex	filename
> 16010221	57001791	6008832	79020844	4b5c32c	vmlinux

That data size increase is indeed excessive.  However I wonder if some
other approach (other than SUBALIGN) could be taken.  For example, a 4k
alignment for each compilation unit's .data section.  That might require
some linker magic at the built-in.o linking level.

Anyway, I suspect the data alignment issues are less common than
function alignment.  It might be fine to leave the data alignment as a
debug feature for now, as this current patch does.

> > On a similar vein I think we should re-explore permanently enabling
> > cacheline-sized function alignment i.e. making something like
> > CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B the default.  Ingo did some
> > research on that a while back:
> > 
> >    https://lkml.kernel.org/r/20150519213820.GA31688@gmail.com
> 
> Thanks for sharing this, from which I learned a lot, and I hope I
> knew this thread when we first check strange regressions in 2019 :)
> 
> > At the time, the main reported drawback of -falign-functions=64 was that
> > even small functions got aligned.  But now I think that can be mitigated
> > with some new options like -flimit-function-alignment and/or
> > -falign-functions=64,X (for some carefully-chosen value of X).
> 
> Will study more about these options. 
> 
> If they have much less size increase and no regression in performance,
> then maybe it could be turned on by default.

Agreed!  I think/hope it would be a net positive change.

I've also been burned by such issues -- like a random one-line code
change causing a measurable performance regression due to changed
i-cache behavior in unrelated code.  It doesn't only affect 0-day tests,
it also affects real users.

-- 
Josh


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH] x86, vmlinux.lds: Add debug option to force all data sections aligned
  2021-09-23 14:57   ` Feng Tang
  2021-09-24  1:57     ` Josh Poimboeuf
@ 2021-09-24  8:13     ` Denys Vlasenko
  2021-09-27  7:04       ` Feng Tang
  1 sibling, 1 reply; 7+ messages in thread
From: Denys Vlasenko @ 2021-09-24  8:13 UTC (permalink / raw)
  To: Feng Tang, Josh Poimboeuf
  Cc: Thomas Gleixner, Ingo Molnar, H Peter Anvin, Borislav Petkov,
	Peter Zijlstra, x86, linux-kernel, Dave Hansen, Tony Luck,
	Linus Torvalds, Andy Lutomirski

On 9/23/21 4:57 PM, Feng Tang wrote:
> On Wed, Sep 22, 2021 at 11:51:37AM -0700, Josh Poimboeuf wrote:
>> Hi Feng,
>>
>> Thanks for the interesting LPC presentation about alignment-related
>> performance issues (which mentioned this patch).
>>   
>>    https://linuxplumbersconf.org/event/11/contributions/895/
>>
>> I wonder if we can look at enabling some kind of data section alignment
>> unconditionally instead of just making it a debug option.  Have you done
>> any performance and binary size comparisons?
>   
> Thanks for reviewing this!
> 
> For binary size, I just tested 5.14 kernel with a default desktop
> config from Ubuntu (I didn't use the normal rhel-8.3 config used
> by 0Day, which is more for server):
> 
> v5.14
> ------------------------
> text		data		bss	    dec		hex	filename
> 16010221	14971391	6098944	37080556	235cdec	vmlinux
> 
> v5.14 + 64B-function-align
> --------------------------
> text		data		bss	    dec		hex	filename
> 18107373	14971391	6098944	39177708	255cdec	vmlinux
> 
> v5.14 + data-align(THREAD_SIZE 16KB)
> --------------------------
> text		data		bss	    dec		hex	filename
> 16010221	57001791	6008832	79020844	4b5c32c	vmlinux
> 
> So for the text-align, we see 13.1% increase for text. And for data-align,
> there is 280.8% increase for data.

Page-size alignment of all data is WAY too much. At most, alignment
to cache line size should work to make timings stable.
(In your case with "adjacent cache line prefetcher",
it may need to be 128 bytes. But definitely not 4096 bytes).

> Performance wise, I have done some test with the force-32bytes-text-align
> option before (v5.8 time), for benchmark will-it-scale, fsmark, hackbench,
> netperf and kbuild:
> * no obvious change for will-it-scale/fsmark/kbuild
> * see both regression/improvement for different hackbench case
> * see both regression/improvement for netperf, from -20% to +98%

What usually happens here is that testcases are crafted to measure
how well some workloads scale, and to measure that efficiently,
testcases were intentionally written to cause congestion -
this way, benefits of better algorithms are easily seen.

However, this also means that in the congested scenario (e.g.
cache bouncing), small changes in CPU architecture are also
easily visible - including cases where optimizations are going awry.

In your presentation, you stumbled upon one such case:
the "adjacent cache line prefetcher" is counter-productive here,
it pulls unrelated cache into the CPU, not knowing that
this is in fact harmful - other CPUs will need this cache line,
not this one!

Since this particular case was a change in structure layout,
increasing alignment of .data sections won't help here.

My opinion is that we shouldn't worry about this too much.
Diagnose the observed slow downs, if they are "real"
(there is a way to improve), fix that, else if they are spurious,
just let them be.

Even when some CPU optimizations are unintentionally hurting some
benchmarks, on the average they are usually a win:
CPU makers have hundreds of people looking at that as their
full-time jobs. With your example of "adjacent cache line prefetcher",
CPU people might be looking at ways to detect when these
speculatively pulled-in cache lines are bouncing.

> For data-alignment, it has huge impact for the size, and occupies more
> cache/TLB, plus it hurts some normal function like dynamic-debug. So
> I'm afraid it can only be used as a debug option.
> 
>> On a similar vein I think we should re-explore permanently enabling
>> cacheline-sized function alignment i.e. making something like
>> CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B the default.  Ingo did some
>> research on that a while back:
>>
>>     https://lkml.kernel.org/r/20150519213820.GA31688@gmail.com
> 
> Thanks for sharing this, from which I learned a lot, and I hope I
> knew this thread when we first check strange regressions in 2019 :)
> 
>> At the time, the main reported drawback of -falign-functions=64 was that
>> even small functions got aligned.  But now I think that can be mitigated
>> with some new options like -flimit-function-alignment and/or
>> -falign-functions=64,X (for some carefully-chosen value of X).

-falign-functions=64,7 should be about right, I guess.

http://lkml.iu.edu/hypermail/linux/kernel/1505.2/03292.html

"""
defconfig vmlinux (w/o FRAME_POINTER) has 42141 functions.
6923 of them have 1st insn 5 or more bytes long,
5841 of them have 1st insn 6 or more bytes long,
5095 of them have 1st insn 7 or more bytes long,
786 of them have 1st insn 8 or more bytes long,
548 of them have 1st insn 9 or more bytes long,
375 of them have 1st insn 10 or more bytes long,
73 of them have 1st insn 11 or more bytes long,
one of them has 1st insn 12 bytes long:
this "heroic" instruction is in local_touch_nmi()
   65 48 c7 05 44 3c 00 7f 00 00 00 00
   movq $0x0,%gs:0x7f003c44(%rip)

Thus ensuring that at least seven first bytes do not cross
64-byte boundary would cover >98% of all functions.
"""

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH] x86, vmlinux.lds: Add debug option to force all data sections aligned
  2021-09-24  8:13     ` Denys Vlasenko
@ 2021-09-27  7:04       ` Feng Tang
  2021-11-16  5:54         ` Feng Tang
  0 siblings, 1 reply; 7+ messages in thread
From: Feng Tang @ 2021-09-27  7:04 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Josh Poimboeuf, Thomas Gleixner, Ingo Molnar, H Peter Anvin,
	Borislav Petkov, Peter Zijlstra, x86, linux-kernel, Dave Hansen,
	Tony Luck, Linus Torvalds, Andy Lutomirski

Hi Denys,

On Fri, Sep 24, 2021 at 10:13:42AM +0200, Denys Vlasenko wrote:
[...]
> >
> >For binary size, I just tested 5.14 kernel with a default desktop
> >config from Ubuntu (I didn't use the normal rhel-8.3 config used
> >by 0Day, which is more for server):
> >
> >v5.14
> >------------------------
> >text		data		bss	    dec		hex	filename
> >16010221	14971391	6098944	37080556	235cdec	vmlinux
> >
> >v5.14 + 64B-function-align
> >--------------------------
> >text		data		bss	    dec		hex	filename
> >18107373	14971391	6098944	39177708	255cdec	vmlinux
> >
> >v5.14 + data-align(THREAD_SIZE 16KB)
> >--------------------------
> >text		data		bss	    dec		hex	filename
> >16010221	57001791	6008832	79020844	4b5c32c	vmlinux
> >
> >So for the text-align, we see 13.1% increase for text. And for data-align,
> >there is 280.8% increase for data.
> 
> Page-size alignment of all data is WAY too much. At most, alignment
> to cache line size should work to make timings stable.
> (In your case with "adjacent cache line prefetcher",
> it may need to be 128 bytes. But definitely not 4096 bytes).

This data-alignment patch is inteneded for debug only. Also with this
"SUBALIGN" trick, 4096 is the smallest working value, others like 64
or 2048 will make the kernel not boot.

> 
> >Performance wise, I have done some test with the force-32bytes-text-align
> >option before (v5.8 time), for benchmark will-it-scale, fsmark, hackbench,
> >netperf and kbuild:
> >* no obvious change for will-it-scale/fsmark/kbuild
> >* see both regression/improvement for different hackbench case
> >* see both regression/improvement for netperf, from -20% to +98%
> 
> What usually happens here is that testcases are crafted to measure
> how well some workloads scale, and to measure that efficiently,
> testcases were intentionally written to cause congestion -
> this way, benefits of better algorithms are easily seen.
> 
> However, this also means that in the congested scenario (e.g.
> cache bouncing), small changes in CPU architecture are also
> easily visible - including cases where optimizations are going awry.
> 
> In your presentation, you stumbled upon one such case:
> the "adjacent cache line prefetcher" is counter-productive here,
> it pulls unrelated cache into the CPU, not knowing that
> this is in fact harmful - other CPUs will need this cache line,
> not this one!
> 
> Since this particular case was a change in structure layout,
> increasing alignment of .data sections won't help here.
> 
> My opinion is that we shouldn't worry about this too much.
> Diagnose the observed slow downs, if they are "real"
> (there is a way to improve), fix that, else if they are spurious,
> just let them be.

Agreed. The main topic of the talk is to explain or root cause
those "strange" performance changes. 

> Even when some CPU optimizations are unintentionally hurting some
> benchmarks, on the average they are usually a win:
> CPU makers have hundreds of people looking at that as their
> full-time jobs. With your example of "adjacent cache line prefetcher",
> CPU people might be looking at ways to detect when these
> speculatively pulled-in cache lines are bouncing.

I agree with you on this and I've never implied the HW cache prefetcher
is a bad thing :), see "as being helpful generally" in the foil. Also
in the live LPC discussion, I said "I don't recommend to disable the HW
prefetcher"
 
> >For data-alignment, it has huge impact for the size, and occupies more
> >cache/TLB, plus it hurts some normal function like dynamic-debug. So
> >I'm afraid it can only be used as a debug option.
> >
> >>On a similar vein I think we should re-explore permanently enabling
> >>cacheline-sized function alignment i.e. making something like
> >>CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B the default.  Ingo did some
> >>research on that a while back:
> >>
> >>    https://lkml.kernel.org/r/20150519213820.GA31688@gmail.com
> >
> >Thanks for sharing this, from which I learned a lot, and I hope I
> >knew this thread when we first check strange regressions in 2019 :)
> >
> >>At the time, the main reported drawback of -falign-functions=64 was that
> >>even small functions got aligned.  But now I think that can be mitigated
> >>with some new options like -flimit-function-alignment and/or
> >>-falign-functions=64,X (for some carefully-chosen value of X).
> 
> -falign-functions=64,7 should be about right, I guess.

In last email about kernel size, I used an old gcc version which didn't
support '-flimit-function-alignment', also as FRAME_POINTER option has
big effect on kernel size, I updated the gcc to 10.3.0 and retest
compiling kernel w/ and w/o FRAME_POINTER enabled, in three cases:
1. vanilla v5.14 kernel
2. vanilla v5.14 kernel + '-falign-functions=64'
3. vanilla v5.14 kernel + '-flimit-function-alignment -falign-functions=64:7'

And the sizes are as below ('fp' means CONFIG_FRAME_POINTER=y, and 'nofp'
means it's disabled):

   text		data		bss	    dec		hex	filename
18118898	14976647	6094848	39190393	255ff79	vmlinux-fp
16005288	14976519	6111232	37093039	235feaf	vmlinux-nofp
18118898	14976647	6094848	39190393	255ff79	vmlinux-text-align-fp
18102440	14976519	6111232	39190191	255feaf	vmlinux-text-align-nofp
16021746	14976647	6094848	37093241	235ff79	vmlinux-align-64-7-fp
16005288	14976519	6111232	37093039	235feaf	vmlinux-align-64-7-nofp

size wise, the '-falign-functions=64,7' has good result, but it does
break the vanilla kernel's 16 bytes alignment, and there are random
offset like

ffffffff81145f20 T tick_get_tick_sched
ffffffff81145f40 T tick_nohz_tick_stopped
ffffffff81145f63 T tick_nohz_tick_stopped_cpu
ffffffff81145f8a T tick_nohz_idle_stop_tick
ffffffff811461f4 T tick_nohz_idle_retain_tick
ffffffff8114621e T tick_nohz_idle_enter
ffffffff8114626f T tick_nohz_irq_exit
ffffffff811462ac T tick_nohz_idle_got_tick
ffffffff811462e1 T tick_nohz_get_next_hrtimer

I cannot run it with 0Day's benchmark service right now, but I'm afraid
there may be some performance change.

Btw, I'm still interested in the 'selective isolation' method, that
chose a few .o files from different kernel modules, add alignment to
one function and one global data of the .o file, setting up an
isolation buffer that any alignment change caused by the module before
this .o will _not_ affect the alignment of all .o files after it.

This will have minimal size cost, for one .o file, the worst waste is
128 bytes, so even we pick 128 .o files, the total cost is 8KB text
and 8KB data space.

And surely we need to test if this method can really make kernel
performance more stable, one testing method is to pick some reported
"strange" performance change case, and check if they are gone with
this method. 

Thanks,
Feng



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH] x86, vmlinux.lds: Add debug option to force all data sections aligned
  2021-09-27  7:04       ` Feng Tang
@ 2021-11-16  5:54         ` Feng Tang
  0 siblings, 0 replies; 7+ messages in thread
From: Feng Tang @ 2021-11-16  5:54 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Josh Poimboeuf, Thomas Gleixner, Ingo Molnar, H Peter Anvin,
	Borislav Petkov, Peter Zijlstra, x86, linux-kernel, Dave Hansen,
	Tony Luck, Linus Torvalds, Andy Lutomirski, longman, arnd, akpm,
	jannh

On Mon, Sep 27, 2021 at 03:04:48PM +0800, Feng Tang wrote:
[...]  
> > >For data-alignment, it has huge impact for the size, and occupies more
> > >cache/TLB, plus it hurts some normal function like dynamic-debug. So
> > >I'm afraid it can only be used as a debug option.
> > >
> > >>On a similar vein I think we should re-explore permanently enabling
> > >>cacheline-sized function alignment i.e. making something like
> > >>CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B the default.  Ingo did some
> > >>research on that a while back:
> > >>
> > >>    https://lkml.kernel.org/r/20150519213820.GA31688@gmail.com
> > >
> > >Thanks for sharing this, from which I learned a lot, and I hope I
> > >knew this thread when we first check strange regressions in 2019 :)
> > >
> > >>At the time, the main reported drawback of -falign-functions=64 was that
> > >>even small functions got aligned.  But now I think that can be mitigated
> > >>with some new options like -flimit-function-alignment and/or
> > >>-falign-functions=64,X (for some carefully-chosen value of X).
> > 
> > -falign-functions=64,7 should be about right, I guess.
[...] 
> I cannot run it with 0Day's benchmark service right now, but I'm afraid
> there may be some performance change.
> 
> Btw, I'm still interested in the 'selective isolation' method, that
> chose a few .o files from different kernel modules, add alignment to
> one function and one global data of the .o file, setting up an
> isolation buffer that any alignment change caused by the module before
> this .o will _not_ affect the alignment of all .o files after it.
> 
> This will have minimal size cost, for one .o file, the worst waste is
> 128 bytes, so even we pick 128 .o files, the total cost is 8KB text
> and 8KB data space.
> 
> And surely we need to test if this method can really make kernel
> performance more stable, one testing method is to pick some reported
> "strange" performance change case, and check if they are gone with
> this method. 

Some update on the experiments about "selective isolation": I tried
three code alignment related cases that have been discussed, and
found the method does help to reduce the performance bump, one is
cut from 7.5% to 0.1%, and another from 3.1% to 1.3%.

The 3 cases are:
1. y2038 code cleanup causing +11.7% improvement to 'mmap' test of
   will-it-scale
   https://lore.kernel.org/lkml/20200305062138.GI5972@shao2-debian/#r
2. a hugetlb fix causing +15.9% improvement to 'page_fault3' test of
   will-it-scale
   https://lore.kernel.org/lkml/20200114085637.GA29297@shao2-debian/#r
3. a one-line mm fix causing -30.7% regresson of scheduler test of
   stress-ng
   https://lore.kernel.org/lkml/20210427090013.GG32408@xsang-OptiPlex-9020/#r

These cases are old (one or two years old), and case 3 can't be
reproduced now. Case 1's current performance delta is +3.1%,
while case 2's is +7.5%, and we tried on case 1/2.

The experiment we did is to find what files the patch touches, say
a.c, then we chose the b.c which follows a.c in Makefile which means
the b.o will be linked right after a.o (this is for simplicity, that
there are other factors like special section definitions), and
make one function of b.c aligned on 4096 bytes.

For case 2, the bisected commit c77c0a8ac4c only touches hugetlb.c,
so we made a debug patch for mempolicy.c following it:

  diff --git a/mm/mempolicy.c b/mm/mempolicy.c
  index 067cf7d3daf5..036c93abdf9b 100644
  --- a/mm/mempolicy.c
  +++ b/mm/mempolicy.c
  @@ -804,7 +804,7 @@ static int mbind_range(struct mm_struct *mm, unsigned long start,
   }

   /* Set the process memory policy */
  -static long do_set_mempolicy(unsigned short mode, unsigned short flags,
  +static long __attribute__((aligned(4096)))  do_set_mempolicy(unsigned short mode, unsigned short flags,
  			     nodemask_t *nodes)
   {
  	struct mempolicy *new, *old;

with it, the performance delta is reduced from 7.5% to 0.1% 

And for case 2, we tried similar way (add 128B align to several files),
and saw the performance change is reduced from 3.1% to 1.3%  

So generally, this seems to be helpful for making the kernel performance
stabler and more controllable. And the cost is not high either, say if
we pick 32 files to make one of their function 128B aligned, the space
waste is 8KB for worst case (128KB for 4096 bytes alignment)

Thoughts?

Thanks,
Feng

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2021-11-16  5:57 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-28  7:21 [RFC PATCH] x86, vmlinux.lds: Add debug option to force all data sections aligned Feng Tang
2021-09-22 18:51 ` Josh Poimboeuf
2021-09-23 14:57   ` Feng Tang
2021-09-24  1:57     ` Josh Poimboeuf
2021-09-24  8:13     ` Denys Vlasenko
2021-09-27  7:04       ` Feng Tang
2021-11-16  5:54         ` Feng Tang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).