* [RFC PATCH] x86, vmlinux.lds: Add debug option to force all data sections aligned @ 2021-07-28 7:21 Feng Tang 2021-09-22 18:51 ` Josh Poimboeuf 0 siblings, 1 reply; 7+ messages in thread From: Feng Tang @ 2021-07-28 7:21 UTC (permalink / raw) To: Thomas Gleixner, Ingo Molnar, H Peter Anvin, Borislav Petkov, Peter Zijlstra, x86, linux-kernel Cc: Dave Hansen, Tony Luck, Feng Tang 0day has reported many strange performance changes (regression or improvement), in which there was no obvious relation between the culprit commit and the benchmark at the first look, and it causes people to doubt the test itself is wrong. Upon further check, many of these cases are caused by the change to the alignment of kernel text or data, as whole text/data of kernel are linked together, change in one domain can affect alignments of other domains. To help to quickly identify if the strange performance change is caused by _data_ alignment. add a debug option to force the data sections from all .o files aligned on THREAD_SIZE, so that change in one domain won't affect other modules' data alignment. We have used this option to check some strange kernel changes [1][2][3], and those performance changes were gone after enabling it, which proved they are data alignment related. Similarly, there is another kernel debug option to check text alignment related performance changes: CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B, which forces all function's start address to be 64 bytes alinged. This option depends on CONFIG_DYNAMIC_DEBUG==n, as '__dyndbg' subsection of .data has a hard requirement of ALIGN(8), shown in the 'vmlinux.lds': " . = ALIGN(8); __start___dyndbg = .; KEEP(*(__dyndbg)) __stop___dyndbg = .; " It contains all pointers to 'struct _ddebug', and dynamic_debug_init() will "pointer++" to loop accessing these pointers, which will be broken with this option enabled. [1]. https://lore.kernel.org/lkml/20200205123216.GO12867@shao2-debian/ [2]. https://lore.kernel.org/lkml/20200305062138.GI5972@shao2-debian/ [3]. https://lore.kernel.org/lkml/20201112140625.GA21612@xsang-OptiPlex-9020/ Signed-off-by: Feng Tang <feng.tang@intel.com> --- arch/x86/Kconfig.debug | 13 +++++++++++++ arch/x86/kernel/vmlinux.lds.S | 7 ++++++- 2 files changed, 19 insertions(+), 1 deletion(-) diff --git a/arch/x86/Kconfig.debug b/arch/x86/Kconfig.debug index 80b57e7..d04c67e 100644 --- a/arch/x86/Kconfig.debug +++ b/arch/x86/Kconfig.debug @@ -228,6 +228,19 @@ config PUNIT_ATOM_DEBUG The current power state can be read from /sys/kernel/debug/punit_atom/dev_power_state +config DEBUG_FORCE_DATA_SECTION_ALIGNED + bool "Force all data sections to be THREAD_SIZE aligned" + depends on EXPERT && !DYNAMIC_DEBUG + help + There are cases that a commit from one kernel domain changes + data sections' alignment of other domains, as they are all + linked together compactly, and cause magic performance bump + (regression or improvement), which is hard to debug. Enable + this option will help to verify if the bump is caused by + data alignment changes. + + It is mainly for debug and performance tuning use. + choice prompt "Choose kernel unwinder" default UNWINDER_ORC if X86_64 diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S index efd9e9e..64256d0 100644 --- a/arch/x86/kernel/vmlinux.lds.S +++ b/arch/x86/kernel/vmlinux.lds.S @@ -156,7 +156,12 @@ SECTIONS X86_ALIGN_RODATA_END /* Data */ - .data : AT(ADDR(.data) - LOAD_OFFSET) { + .data : AT(ADDR(.data) - LOAD_OFFSET) +#ifdef CONFIG_DEBUG_FORCE_DATA_SECTION_ALIGNED + /* Use the biggest alignment of below sections */ + SUBALIGN(THREAD_SIZE) +#endif + { /* Start of data section */ _sdata = .; -- 2.7.4 ^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [RFC PATCH] x86, vmlinux.lds: Add debug option to force all data sections aligned 2021-07-28 7:21 [RFC PATCH] x86, vmlinux.lds: Add debug option to force all data sections aligned Feng Tang @ 2021-09-22 18:51 ` Josh Poimboeuf 2021-09-23 14:57 ` Feng Tang 0 siblings, 1 reply; 7+ messages in thread From: Josh Poimboeuf @ 2021-09-22 18:51 UTC (permalink / raw) To: Feng Tang Cc: Thomas Gleixner, Ingo Molnar, H Peter Anvin, Borislav Petkov, Peter Zijlstra, x86, linux-kernel, Dave Hansen, Tony Luck, Denys Vlasenko, Linus Torvalds, Andy Lutomirski On Wed, Jul 28, 2021 at 03:21:40PM +0800, Feng Tang wrote: > 0day has reported many strange performance changes (regression or > improvement), in which there was no obvious relation between the culprit > commit and the benchmark at the first look, and it causes people to doubt > the test itself is wrong. > > Upon further check, many of these cases are caused by the change to the > alignment of kernel text or data, as whole text/data of kernel are linked > together, change in one domain can affect alignments of other domains. > > To help to quickly identify if the strange performance change is caused > by _data_ alignment. add a debug option to force the data sections from > all .o files aligned on THREAD_SIZE, so that change in one domain won't > affect other modules' data alignment. > > We have used this option to check some strange kernel changes [1][2][3], > and those performance changes were gone after enabling it, which proved > they are data alignment related. > > Similarly, there is another kernel debug option to check text alignment > related performance changes: CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B, > which forces all function's start address to be 64 bytes alinged. > > This option depends on CONFIG_DYNAMIC_DEBUG==n, as '__dyndbg' subsection > of .data has a hard requirement of ALIGN(8), shown in the 'vmlinux.lds': > > " > . = ALIGN(8); __start___dyndbg = .; KEEP(*(__dyndbg)) __stop___dyndbg = .; > " > > It contains all pointers to 'struct _ddebug', and dynamic_debug_init() > will "pointer++" to loop accessing these pointers, which will be broken > with this option enabled. > > [1]. https://lore.kernel.org/lkml/20200205123216.GO12867@shao2-debian/ > [2]. https://lore.kernel.org/lkml/20200305062138.GI5972@shao2-debian/ > [3]. https://lore.kernel.org/lkml/20201112140625.GA21612@xsang-OptiPlex-9020/ > > Signed-off-by: Feng Tang <feng.tang@intel.com> > --- > arch/x86/Kconfig.debug | 13 +++++++++++++ > arch/x86/kernel/vmlinux.lds.S | 7 ++++++- > 2 files changed, 19 insertions(+), 1 deletion(-) Hi Feng, Thanks for the interesting LPC presentation about alignment-related performance issues (which mentioned this patch). https://linuxplumbersconf.org/event/11/contributions/895/ I wonder if we can look at enabling some kind of data section alignment unconditionally instead of just making it a debug option. Have you done any performance and binary size comparisons? On a similar vein I think we should re-explore permanently enabling cacheline-sized function alignment i.e. making something like CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B the default. Ingo did some research on that a while back: https://lkml.kernel.org/r/20150519213820.GA31688@gmail.com At the time, the main reported drawback of -falign-functions=64 was that even small functions got aligned. But now I think that can be mitigated with some new options like -flimit-function-alignment and/or -falign-functions=64,X (for some carefully-chosen value of X). -- Josh ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC PATCH] x86, vmlinux.lds: Add debug option to force all data sections aligned 2021-09-22 18:51 ` Josh Poimboeuf @ 2021-09-23 14:57 ` Feng Tang 2021-09-24 1:57 ` Josh Poimboeuf 2021-09-24 8:13 ` Denys Vlasenko 0 siblings, 2 replies; 7+ messages in thread From: Feng Tang @ 2021-09-23 14:57 UTC (permalink / raw) To: Josh Poimboeuf Cc: Thomas Gleixner, Ingo Molnar, H Peter Anvin, Borislav Petkov, Peter Zijlstra, x86, linux-kernel, Dave Hansen, Tony Luck, Denys Vlasenko, Linus Torvalds, Andy Lutomirski Hi Josh, On Wed, Sep 22, 2021 at 11:51:37AM -0700, Josh Poimboeuf wrote: > On Wed, Jul 28, 2021 at 03:21:40PM +0800, Feng Tang wrote: > > 0day has reported many strange performance changes (regression or > > improvement), in which there was no obvious relation between the culprit > > commit and the benchmark at the first look, and it causes people to doubt > > the test itself is wrong. > > > > Upon further check, many of these cases are caused by the change to the > > alignment of kernel text or data, as whole text/data of kernel are linked > > together, change in one domain can affect alignments of other domains. > > > > To help to quickly identify if the strange performance change is caused > > by _data_ alignment. add a debug option to force the data sections from > > all .o files aligned on THREAD_SIZE, so that change in one domain won't > > affect other modules' data alignment. > > > > We have used this option to check some strange kernel changes [1][2][3], > > and those performance changes were gone after enabling it, which proved > > they are data alignment related. > > > > Similarly, there is another kernel debug option to check text alignment > > related performance changes: CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B, > > which forces all function's start address to be 64 bytes alinged. > > > > This option depends on CONFIG_DYNAMIC_DEBUG==n, as '__dyndbg' subsection > > of .data has a hard requirement of ALIGN(8), shown in the 'vmlinux.lds': > > > > " > > . = ALIGN(8); __start___dyndbg = .; KEEP(*(__dyndbg)) __stop___dyndbg = .; > > " > > > > It contains all pointers to 'struct _ddebug', and dynamic_debug_init() > > will "pointer++" to loop accessing these pointers, which will be broken > > with this option enabled. > > > > [1]. https://lore.kernel.org/lkml/20200205123216.GO12867@shao2-debian/ > > [2]. https://lore.kernel.org/lkml/20200305062138.GI5972@shao2-debian/ > > [3]. https://lore.kernel.org/lkml/20201112140625.GA21612@xsang-OptiPlex-9020/ > > > > Signed-off-by: Feng Tang <feng.tang@intel.com> > > --- > > arch/x86/Kconfig.debug | 13 +++++++++++++ > > arch/x86/kernel/vmlinux.lds.S | 7 ++++++- > > 2 files changed, 19 insertions(+), 1 deletion(-) > > Hi Feng, > > Thanks for the interesting LPC presentation about alignment-related > performance issues (which mentioned this patch). > > https://linuxplumbersconf.org/event/11/contributions/895/ > > I wonder if we can look at enabling some kind of data section alignment > unconditionally instead of just making it a debug option. Have you done > any performance and binary size comparisons? Thanks for reviewing this! For binary size, I just tested 5.14 kernel with a default desktop config from Ubuntu (I didn't use the normal rhel-8.3 config used by 0Day, which is more for server): v5.14 ------------------------ text data bss dec hex filename 16010221 14971391 6098944 37080556 235cdec vmlinux v5.14 + 64B-function-align -------------------------- text data bss dec hex filename 18107373 14971391 6098944 39177708 255cdec vmlinux v5.14 + data-align(THREAD_SIZE 16KB) -------------------------- text data bss dec hex filename 16010221 57001791 6008832 79020844 4b5c32c vmlinux So for the text-align, we see 13.1% increase for text. And for data-align, there is 280.8% increase for data. Performance wise, I have done some test with the force-32bytes-text-align option before (v5.8 time), for benchmark will-it-scale, fsmark, hackbench, netperf and kbuild: * no obvious change for will-it-scale/fsmark/kbuild * see both regression/improvement for different hackbench case * see both regression/improvement for netperf, from -20% to +98% As I didn't expect the text-align will be turned on by-default, so I didn't dive deep into it at that time. For data-alignment, it has huge impact for the size, and occupies more cache/TLB, plus it hurts some normal function like dynamic-debug. So I'm afraid it can only be used as a debug option. > On a similar vein I think we should re-explore permanently enabling > cacheline-sized function alignment i.e. making something like > CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B the default. Ingo did some > research on that a while back: > > https://lkml.kernel.org/r/20150519213820.GA31688@gmail.com Thanks for sharing this, from which I learned a lot, and I hope I knew this thread when we first check strange regressions in 2019 :) > At the time, the main reported drawback of -falign-functions=64 was that > even small functions got aligned. But now I think that can be mitigated > with some new options like -flimit-function-alignment and/or > -falign-functions=64,X (for some carefully-chosen value of X). Will study more about these options. If they have much less size increase and no regression in performance, then maybe it could be turned on by default. Thanks, Feng > -- > Josh ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC PATCH] x86, vmlinux.lds: Add debug option to force all data sections aligned 2021-09-23 14:57 ` Feng Tang @ 2021-09-24 1:57 ` Josh Poimboeuf 2021-09-24 8:13 ` Denys Vlasenko 1 sibling, 0 replies; 7+ messages in thread From: Josh Poimboeuf @ 2021-09-24 1:57 UTC (permalink / raw) To: Feng Tang Cc: Thomas Gleixner, Ingo Molnar, H Peter Anvin, Borislav Petkov, Peter Zijlstra, x86, linux-kernel, Dave Hansen, Tony Luck, Denys Vlasenko, Linus Torvalds, Andy Lutomirski On Thu, Sep 23, 2021 at 10:57:20PM +0800, Feng Tang wrote: > For binary size, I just tested 5.14 kernel with a default desktop > config from Ubuntu (I didn't use the normal rhel-8.3 config used > by 0Day, which is more for server): > > v5.14 > ------------------------ > text data bss dec hex filename > 16010221 14971391 6098944 37080556 235cdec vmlinux > > v5.14 + 64B-function-align > -------------------------- > text data bss dec hex filename > 18107373 14971391 6098944 39177708 255cdec vmlinux > > v5.14 + data-align(THREAD_SIZE 16KB) > -------------------------- > text data bss dec hex filename > 16010221 57001791 6008832 79020844 4b5c32c vmlinux That data size increase is indeed excessive. However I wonder if some other approach (other than SUBALIGN) could be taken. For example, a 4k alignment for each compilation unit's .data section. That might require some linker magic at the built-in.o linking level. Anyway, I suspect the data alignment issues are less common than function alignment. It might be fine to leave the data alignment as a debug feature for now, as this current patch does. > > On a similar vein I think we should re-explore permanently enabling > > cacheline-sized function alignment i.e. making something like > > CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B the default. Ingo did some > > research on that a while back: > > > > https://lkml.kernel.org/r/20150519213820.GA31688@gmail.com > > Thanks for sharing this, from which I learned a lot, and I hope I > knew this thread when we first check strange regressions in 2019 :) > > > At the time, the main reported drawback of -falign-functions=64 was that > > even small functions got aligned. But now I think that can be mitigated > > with some new options like -flimit-function-alignment and/or > > -falign-functions=64,X (for some carefully-chosen value of X). > > Will study more about these options. > > If they have much less size increase and no regression in performance, > then maybe it could be turned on by default. Agreed! I think/hope it would be a net positive change. I've also been burned by such issues -- like a random one-line code change causing a measurable performance regression due to changed i-cache behavior in unrelated code. It doesn't only affect 0-day tests, it also affects real users. -- Josh ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC PATCH] x86, vmlinux.lds: Add debug option to force all data sections aligned 2021-09-23 14:57 ` Feng Tang 2021-09-24 1:57 ` Josh Poimboeuf @ 2021-09-24 8:13 ` Denys Vlasenko 2021-09-27 7:04 ` Feng Tang 1 sibling, 1 reply; 7+ messages in thread From: Denys Vlasenko @ 2021-09-24 8:13 UTC (permalink / raw) To: Feng Tang, Josh Poimboeuf Cc: Thomas Gleixner, Ingo Molnar, H Peter Anvin, Borislav Petkov, Peter Zijlstra, x86, linux-kernel, Dave Hansen, Tony Luck, Linus Torvalds, Andy Lutomirski On 9/23/21 4:57 PM, Feng Tang wrote: > On Wed, Sep 22, 2021 at 11:51:37AM -0700, Josh Poimboeuf wrote: >> Hi Feng, >> >> Thanks for the interesting LPC presentation about alignment-related >> performance issues (which mentioned this patch). >> >> https://linuxplumbersconf.org/event/11/contributions/895/ >> >> I wonder if we can look at enabling some kind of data section alignment >> unconditionally instead of just making it a debug option. Have you done >> any performance and binary size comparisons? > > Thanks for reviewing this! > > For binary size, I just tested 5.14 kernel with a default desktop > config from Ubuntu (I didn't use the normal rhel-8.3 config used > by 0Day, which is more for server): > > v5.14 > ------------------------ > text data bss dec hex filename > 16010221 14971391 6098944 37080556 235cdec vmlinux > > v5.14 + 64B-function-align > -------------------------- > text data bss dec hex filename > 18107373 14971391 6098944 39177708 255cdec vmlinux > > v5.14 + data-align(THREAD_SIZE 16KB) > -------------------------- > text data bss dec hex filename > 16010221 57001791 6008832 79020844 4b5c32c vmlinux > > So for the text-align, we see 13.1% increase for text. And for data-align, > there is 280.8% increase for data. Page-size alignment of all data is WAY too much. At most, alignment to cache line size should work to make timings stable. (In your case with "adjacent cache line prefetcher", it may need to be 128 bytes. But definitely not 4096 bytes). > Performance wise, I have done some test with the force-32bytes-text-align > option before (v5.8 time), for benchmark will-it-scale, fsmark, hackbench, > netperf and kbuild: > * no obvious change for will-it-scale/fsmark/kbuild > * see both regression/improvement for different hackbench case > * see both regression/improvement for netperf, from -20% to +98% What usually happens here is that testcases are crafted to measure how well some workloads scale, and to measure that efficiently, testcases were intentionally written to cause congestion - this way, benefits of better algorithms are easily seen. However, this also means that in the congested scenario (e.g. cache bouncing), small changes in CPU architecture are also easily visible - including cases where optimizations are going awry. In your presentation, you stumbled upon one such case: the "adjacent cache line prefetcher" is counter-productive here, it pulls unrelated cache into the CPU, not knowing that this is in fact harmful - other CPUs will need this cache line, not this one! Since this particular case was a change in structure layout, increasing alignment of .data sections won't help here. My opinion is that we shouldn't worry about this too much. Diagnose the observed slow downs, if they are "real" (there is a way to improve), fix that, else if they are spurious, just let them be. Even when some CPU optimizations are unintentionally hurting some benchmarks, on the average they are usually a win: CPU makers have hundreds of people looking at that as their full-time jobs. With your example of "adjacent cache line prefetcher", CPU people might be looking at ways to detect when these speculatively pulled-in cache lines are bouncing. > For data-alignment, it has huge impact for the size, and occupies more > cache/TLB, plus it hurts some normal function like dynamic-debug. So > I'm afraid it can only be used as a debug option. > >> On a similar vein I think we should re-explore permanently enabling >> cacheline-sized function alignment i.e. making something like >> CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B the default. Ingo did some >> research on that a while back: >> >> https://lkml.kernel.org/r/20150519213820.GA31688@gmail.com > > Thanks for sharing this, from which I learned a lot, and I hope I > knew this thread when we first check strange regressions in 2019 :) > >> At the time, the main reported drawback of -falign-functions=64 was that >> even small functions got aligned. But now I think that can be mitigated >> with some new options like -flimit-function-alignment and/or >> -falign-functions=64,X (for some carefully-chosen value of X). -falign-functions=64,7 should be about right, I guess. http://lkml.iu.edu/hypermail/linux/kernel/1505.2/03292.html """ defconfig vmlinux (w/o FRAME_POINTER) has 42141 functions. 6923 of them have 1st insn 5 or more bytes long, 5841 of them have 1st insn 6 or more bytes long, 5095 of them have 1st insn 7 or more bytes long, 786 of them have 1st insn 8 or more bytes long, 548 of them have 1st insn 9 or more bytes long, 375 of them have 1st insn 10 or more bytes long, 73 of them have 1st insn 11 or more bytes long, one of them has 1st insn 12 bytes long: this "heroic" instruction is in local_touch_nmi() 65 48 c7 05 44 3c 00 7f 00 00 00 00 movq $0x0,%gs:0x7f003c44(%rip) Thus ensuring that at least seven first bytes do not cross 64-byte boundary would cover >98% of all functions. """ ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC PATCH] x86, vmlinux.lds: Add debug option to force all data sections aligned 2021-09-24 8:13 ` Denys Vlasenko @ 2021-09-27 7:04 ` Feng Tang 2021-11-16 5:54 ` Feng Tang 0 siblings, 1 reply; 7+ messages in thread From: Feng Tang @ 2021-09-27 7:04 UTC (permalink / raw) To: Denys Vlasenko Cc: Josh Poimboeuf, Thomas Gleixner, Ingo Molnar, H Peter Anvin, Borislav Petkov, Peter Zijlstra, x86, linux-kernel, Dave Hansen, Tony Luck, Linus Torvalds, Andy Lutomirski Hi Denys, On Fri, Sep 24, 2021 at 10:13:42AM +0200, Denys Vlasenko wrote: [...] > > > >For binary size, I just tested 5.14 kernel with a default desktop > >config from Ubuntu (I didn't use the normal rhel-8.3 config used > >by 0Day, which is more for server): > > > >v5.14 > >------------------------ > >text data bss dec hex filename > >16010221 14971391 6098944 37080556 235cdec vmlinux > > > >v5.14 + 64B-function-align > >-------------------------- > >text data bss dec hex filename > >18107373 14971391 6098944 39177708 255cdec vmlinux > > > >v5.14 + data-align(THREAD_SIZE 16KB) > >-------------------------- > >text data bss dec hex filename > >16010221 57001791 6008832 79020844 4b5c32c vmlinux > > > >So for the text-align, we see 13.1% increase for text. And for data-align, > >there is 280.8% increase for data. > > Page-size alignment of all data is WAY too much. At most, alignment > to cache line size should work to make timings stable. > (In your case with "adjacent cache line prefetcher", > it may need to be 128 bytes. But definitely not 4096 bytes). This data-alignment patch is inteneded for debug only. Also with this "SUBALIGN" trick, 4096 is the smallest working value, others like 64 or 2048 will make the kernel not boot. > > >Performance wise, I have done some test with the force-32bytes-text-align > >option before (v5.8 time), for benchmark will-it-scale, fsmark, hackbench, > >netperf and kbuild: > >* no obvious change for will-it-scale/fsmark/kbuild > >* see both regression/improvement for different hackbench case > >* see both regression/improvement for netperf, from -20% to +98% > > What usually happens here is that testcases are crafted to measure > how well some workloads scale, and to measure that efficiently, > testcases were intentionally written to cause congestion - > this way, benefits of better algorithms are easily seen. > > However, this also means that in the congested scenario (e.g. > cache bouncing), small changes in CPU architecture are also > easily visible - including cases where optimizations are going awry. > > In your presentation, you stumbled upon one such case: > the "adjacent cache line prefetcher" is counter-productive here, > it pulls unrelated cache into the CPU, not knowing that > this is in fact harmful - other CPUs will need this cache line, > not this one! > > Since this particular case was a change in structure layout, > increasing alignment of .data sections won't help here. > > My opinion is that we shouldn't worry about this too much. > Diagnose the observed slow downs, if they are "real" > (there is a way to improve), fix that, else if they are spurious, > just let them be. Agreed. The main topic of the talk is to explain or root cause those "strange" performance changes. > Even when some CPU optimizations are unintentionally hurting some > benchmarks, on the average they are usually a win: > CPU makers have hundreds of people looking at that as their > full-time jobs. With your example of "adjacent cache line prefetcher", > CPU people might be looking at ways to detect when these > speculatively pulled-in cache lines are bouncing. I agree with you on this and I've never implied the HW cache prefetcher is a bad thing :), see "as being helpful generally" in the foil. Also in the live LPC discussion, I said "I don't recommend to disable the HW prefetcher" > >For data-alignment, it has huge impact for the size, and occupies more > >cache/TLB, plus it hurts some normal function like dynamic-debug. So > >I'm afraid it can only be used as a debug option. > > > >>On a similar vein I think we should re-explore permanently enabling > >>cacheline-sized function alignment i.e. making something like > >>CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B the default. Ingo did some > >>research on that a while back: > >> > >> https://lkml.kernel.org/r/20150519213820.GA31688@gmail.com > > > >Thanks for sharing this, from which I learned a lot, and I hope I > >knew this thread when we first check strange regressions in 2019 :) > > > >>At the time, the main reported drawback of -falign-functions=64 was that > >>even small functions got aligned. But now I think that can be mitigated > >>with some new options like -flimit-function-alignment and/or > >>-falign-functions=64,X (for some carefully-chosen value of X). > > -falign-functions=64,7 should be about right, I guess. In last email about kernel size, I used an old gcc version which didn't support '-flimit-function-alignment', also as FRAME_POINTER option has big effect on kernel size, I updated the gcc to 10.3.0 and retest compiling kernel w/ and w/o FRAME_POINTER enabled, in three cases: 1. vanilla v5.14 kernel 2. vanilla v5.14 kernel + '-falign-functions=64' 3. vanilla v5.14 kernel + '-flimit-function-alignment -falign-functions=64:7' And the sizes are as below ('fp' means CONFIG_FRAME_POINTER=y, and 'nofp' means it's disabled): text data bss dec hex filename 18118898 14976647 6094848 39190393 255ff79 vmlinux-fp 16005288 14976519 6111232 37093039 235feaf vmlinux-nofp 18118898 14976647 6094848 39190393 255ff79 vmlinux-text-align-fp 18102440 14976519 6111232 39190191 255feaf vmlinux-text-align-nofp 16021746 14976647 6094848 37093241 235ff79 vmlinux-align-64-7-fp 16005288 14976519 6111232 37093039 235feaf vmlinux-align-64-7-nofp size wise, the '-falign-functions=64,7' has good result, but it does break the vanilla kernel's 16 bytes alignment, and there are random offset like ffffffff81145f20 T tick_get_tick_sched ffffffff81145f40 T tick_nohz_tick_stopped ffffffff81145f63 T tick_nohz_tick_stopped_cpu ffffffff81145f8a T tick_nohz_idle_stop_tick ffffffff811461f4 T tick_nohz_idle_retain_tick ffffffff8114621e T tick_nohz_idle_enter ffffffff8114626f T tick_nohz_irq_exit ffffffff811462ac T tick_nohz_idle_got_tick ffffffff811462e1 T tick_nohz_get_next_hrtimer I cannot run it with 0Day's benchmark service right now, but I'm afraid there may be some performance change. Btw, I'm still interested in the 'selective isolation' method, that chose a few .o files from different kernel modules, add alignment to one function and one global data of the .o file, setting up an isolation buffer that any alignment change caused by the module before this .o will _not_ affect the alignment of all .o files after it. This will have minimal size cost, for one .o file, the worst waste is 128 bytes, so even we pick 128 .o files, the total cost is 8KB text and 8KB data space. And surely we need to test if this method can really make kernel performance more stable, one testing method is to pick some reported "strange" performance change case, and check if they are gone with this method. Thanks, Feng ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC PATCH] x86, vmlinux.lds: Add debug option to force all data sections aligned 2021-09-27 7:04 ` Feng Tang @ 2021-11-16 5:54 ` Feng Tang 0 siblings, 0 replies; 7+ messages in thread From: Feng Tang @ 2021-11-16 5:54 UTC (permalink / raw) To: Denys Vlasenko Cc: Josh Poimboeuf, Thomas Gleixner, Ingo Molnar, H Peter Anvin, Borislav Petkov, Peter Zijlstra, x86, linux-kernel, Dave Hansen, Tony Luck, Linus Torvalds, Andy Lutomirski, longman, arnd, akpm, jannh On Mon, Sep 27, 2021 at 03:04:48PM +0800, Feng Tang wrote: [...] > > >For data-alignment, it has huge impact for the size, and occupies more > > >cache/TLB, plus it hurts some normal function like dynamic-debug. So > > >I'm afraid it can only be used as a debug option. > > > > > >>On a similar vein I think we should re-explore permanently enabling > > >>cacheline-sized function alignment i.e. making something like > > >>CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B the default. Ingo did some > > >>research on that a while back: > > >> > > >> https://lkml.kernel.org/r/20150519213820.GA31688@gmail.com > > > > > >Thanks for sharing this, from which I learned a lot, and I hope I > > >knew this thread when we first check strange regressions in 2019 :) > > > > > >>At the time, the main reported drawback of -falign-functions=64 was that > > >>even small functions got aligned. But now I think that can be mitigated > > >>with some new options like -flimit-function-alignment and/or > > >>-falign-functions=64,X (for some carefully-chosen value of X). > > > > -falign-functions=64,7 should be about right, I guess. [...] > I cannot run it with 0Day's benchmark service right now, but I'm afraid > there may be some performance change. > > Btw, I'm still interested in the 'selective isolation' method, that > chose a few .o files from different kernel modules, add alignment to > one function and one global data of the .o file, setting up an > isolation buffer that any alignment change caused by the module before > this .o will _not_ affect the alignment of all .o files after it. > > This will have minimal size cost, for one .o file, the worst waste is > 128 bytes, so even we pick 128 .o files, the total cost is 8KB text > and 8KB data space. > > And surely we need to test if this method can really make kernel > performance more stable, one testing method is to pick some reported > "strange" performance change case, and check if they are gone with > this method. Some update on the experiments about "selective isolation": I tried three code alignment related cases that have been discussed, and found the method does help to reduce the performance bump, one is cut from 7.5% to 0.1%, and another from 3.1% to 1.3%. The 3 cases are: 1. y2038 code cleanup causing +11.7% improvement to 'mmap' test of will-it-scale https://lore.kernel.org/lkml/20200305062138.GI5972@shao2-debian/#r 2. a hugetlb fix causing +15.9% improvement to 'page_fault3' test of will-it-scale https://lore.kernel.org/lkml/20200114085637.GA29297@shao2-debian/#r 3. a one-line mm fix causing -30.7% regresson of scheduler test of stress-ng https://lore.kernel.org/lkml/20210427090013.GG32408@xsang-OptiPlex-9020/#r These cases are old (one or two years old), and case 3 can't be reproduced now. Case 1's current performance delta is +3.1%, while case 2's is +7.5%, and we tried on case 1/2. The experiment we did is to find what files the patch touches, say a.c, then we chose the b.c which follows a.c in Makefile which means the b.o will be linked right after a.o (this is for simplicity, that there are other factors like special section definitions), and make one function of b.c aligned on 4096 bytes. For case 2, the bisected commit c77c0a8ac4c only touches hugetlb.c, so we made a debug patch for mempolicy.c following it: diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 067cf7d3daf5..036c93abdf9b 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -804,7 +804,7 @@ static int mbind_range(struct mm_struct *mm, unsigned long start, } /* Set the process memory policy */ -static long do_set_mempolicy(unsigned short mode, unsigned short flags, +static long __attribute__((aligned(4096))) do_set_mempolicy(unsigned short mode, unsigned short flags, nodemask_t *nodes) { struct mempolicy *new, *old; with it, the performance delta is reduced from 7.5% to 0.1% And for case 2, we tried similar way (add 128B align to several files), and saw the performance change is reduced from 3.1% to 1.3% So generally, this seems to be helpful for making the kernel performance stabler and more controllable. And the cost is not high either, say if we pick 32 files to make one of their function 128B aligned, the space waste is 8KB for worst case (128KB for 4096 bytes alignment) Thoughts? Thanks, Feng ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2021-11-16 5:57 UTC | newest] Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-07-28 7:21 [RFC PATCH] x86, vmlinux.lds: Add debug option to force all data sections aligned Feng Tang 2021-09-22 18:51 ` Josh Poimboeuf 2021-09-23 14:57 ` Feng Tang 2021-09-24 1:57 ` Josh Poimboeuf 2021-09-24 8:13 ` Denys Vlasenko 2021-09-27 7:04 ` Feng Tang 2021-11-16 5:54 ` Feng Tang
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).