[v9,04/10] x86: refcount: prevent gcc distortions
diff mbox series

Message ID 20181003213100.189959-5-namit@vmware.com
State New
Headers show
Series
  • x86: macrofying inline asm
Related show

Commit Message

Nadav Amit Oct. 3, 2018, 9:30 p.m. UTC
GCC considers the number of statements in inlined assembly blocks,
according to new-lines and semicolons, as an indication to the cost of
the block in time and space. This data is distorted by the kernel code,
which puts information in alternative sections. As a result, the
compiler may perform incorrect inlining and branch optimizations.

The solution is to set an assembly macro and call it from the inlined
assembly block. As a result GCC considers the inline assembly block as
a single instruction.

This patch allows to inline functions such as __get_seccomp_filter().
Interestingly, this allows more aggressive inlining while reducing the
kernel size.

   text	   data	    bss	    dec	    hex	filename
18140970 10225412 2957312 31323694 1ddf62e ./vmlinux before
18140140 10225284 2957312 31322736 1ddf270 ./vmlinux after (-958)

Static text symbols:
Before:	40302
After:	40286	(-16)

Functions such as kref_get(), free_user(), fuse_file_get() now get
inlined.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Cc: Jan Beulich <JBeulich@suse.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/include/asm/refcount.h | 74 ++++++++++++++++++++-------------
 arch/x86/kernel/macros.S        |  1 +
 2 files changed, 46 insertions(+), 29 deletions(-)

Comments

Ingo Molnar Oct. 4, 2018, 7:57 a.m. UTC | #1
* Nadav Amit <namit@vmware.com> wrote:

> GCC considers the number of statements in inlined assembly blocks,
> according to new-lines and semicolons, as an indication to the cost of
> the block in time and space. This data is distorted by the kernel code,
> which puts information in alternative sections. As a result, the
> compiler may perform incorrect inlining and branch optimizations.
> 
> The solution is to set an assembly macro and call it from the inlined
> assembly block. As a result GCC considers the inline assembly block as
> a single instruction.
> 
> This patch allows to inline functions such as __get_seccomp_filter().
> Interestingly, this allows more aggressive inlining while reducing the
> kernel size.
> 
>    text	   data	    bss	    dec	    hex	filename
> 18140970 10225412 2957312 31323694 1ddf62e ./vmlinux before
> 18140140 10225284 2957312 31322736 1ddf270 ./vmlinux after (-958)
> 
> Static text symbols:
> Before:	40302
> After:	40286	(-16)
> 
> Functions such as kref_get(), free_user(), fuse_file_get() now get
> inlined.

Yeah, so I kind of had your series on the back-burner (I'm sure you noticed!),
mostly because what I complained about in a previous round of review a couple
of months ago: that the description of the series and the changelog of every
single patch in it is tiptoeing around the *real* problem and never truly
describes it:

   ** This is a GCC bug, plain and simple, and we are uglifying **
   ** and complicating kernel assembly code to work it around.  **

We'd never ever consider such uglification for Clang, not even _close_.

Sure this would have warranted a passing mention? Instead the changelogs are
lovingly calling it a "distortion" as if this was no-one's fault really, and
the patch a "solution".

How about calling it a "GCC inlining bug" and a "workaround with costs" 
which it is in reality, and stop whitewashing the problem?

At the same time I realize that we still need this series because GCC won't
get fixed, so as a consolation I wrote the changelog below that explains
how it really is, no holds barred.

Since the tone of the changelog is a bit ... frosty, I added this disclaimer:

  [ mingo: Wrote new changelog. ]

Let me know if you want me to make it more prominent that you had absolutely
no role in writing that changelog.

I'm also somewhat annoyed at the fact that this series carries a boatload
of reviewed-by's and acked-by's, yet none of those reviewers found it
important to point out the large chasm that is gaping between description
and reality.

Thanks,

    Ingo


=============>
Subject: x86/refcount: Prevent inlining related GCC distortions
From: Nadav Amit <namit@vmware.com>
Date: Wed, 3 Oct 2018 14:30:54 -0700

The inlining pass of GCC doesn't include an assembler, so it's not aware
of basic properties of the generated code, such as its size in bytes,
or that there are such things as discontiuous blocks of code and data
due to the newfangled linker feature called 'sections' ...

Instead GCC uses a lazy and fragile heuristic: it does a linear count of
certain syntactic and whitespace elements in inlined assembly block source
code, such as a count of new-lines and semicolons (!), as a poor substitute
for "code size and complexity".

Unsurprisingly this heuristic falls over and breaks its neck whith certain
common types of kernel code that use inline assembly, such as the frequent
practice of putting useful information into alternative sections.

As a result of this fresh, 20+ years old GCC bug, GCC's inlining decisions
are effectively disabled for inlined functions that make use of such asm()
blocks, because GCC thinks those sections of code are "large" - when in
reality they are often result in just a very low number of machine
instructions generated.

This absolute lack of inlining provess when GCC comes across such asm()
blocks both increases generated kernel code size and causes performance
overhead, which is particularly noticeable on paravirt kernels, which make
frequent use of these inlining facilities in attemt to stay out of the
way when running on baremetal hardware.

Instead of fixing the compiler we use a workaround: we set an assembly macro
and call it from the inlined assembly block. As a result GCC considers the
inline assembly block as a single instruction. (Which it often isn't but I digress.)

This uglifies and bloats the source code:

  2 files changed, 46 insertions(+), 29 deletions(-)

Yay readability and maintainability, it's not like assembly code is hard to read
and maintain ...

This patch allows GCC to inline simple functions such as __get_seccomp_filter().

To no-one's surprise the result is GCC performs more aggressive (read: correct)
inlining decisions in these senarios, which reduces the kernel size and presumably
also speeds it up:

      text     data     bss      dec     hex  filename
  18140970 10225412 2957312 31323694 1ddf62e  ./vmlinux before
  18140140 10225284 2957312 31322736 1ddf270  ./vmlinux after (-958)

Change in size of static text symbols:

   Before: 40302
    After: 40286 (-16)

Functions such as kref_get(), free_user(), fuse_file_get() now get inlined. Hurray!

We also hope that GCC will eventually get fixed, but we are not holding
our breath for that. Yet we are optimistic, it might still happen, any decade now.

[ mingo: Wrote new changelog. ]

Tested-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Nadav Amit <namit@vmware.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Jan Beulich <JBeulich@suse.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20181003213100.189959-5-namit@vmware.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
Ingo Molnar Oct. 4, 2018, 8:33 a.m. UTC | #2
* Ingo Molnar <mingo@kernel.org> wrote:

> I'm also somewhat annoyed at the fact that this series carries a boatload
> of reviewed-by's and acked-by's, yet none of those reviewers found it
> important to point out the large chasm that is gaping between description
> and reality.

Another problem I just realized is that we now include arch/x86/kernel/macros.S in every 
translation pass when building the kernel, right?

But arch/x86/kernel/macros.S expands to a pretty large hiearchy of header files:

  $ make arch/x86/kernel/macros.s

  $ cat $(grep include arch/x86/kernel/macros.s | cut -d\" -f2 | sort | uniq) | wc -l
  4128

That's 4,100 extra lines of code to be preprocessed for every translation unit, of
which there are tens of thousands. More if other pieces of code get macrofied in
this fasion in the future.

If we assume that a typical distribution kernel build has ~20,000 translation units
then this change adds 82,560,000 more lines to be preprocessed, just to work around
a stupid GCC bug?

I'm totally unhappy about that. Can we do this without adding macros.S?

It's also a pretty stupidly central file anyway that moves source code away
from where it's used.

Thanks,

	Ingo
H. Peter Anvin Oct. 4, 2018, 8:40 a.m. UTC | #3
On October 4, 2018 1:33:33 AM PDT, Ingo Molnar <mingo@kernel.org> wrote:
>
>* Ingo Molnar <mingo@kernel.org> wrote:
>
>> I'm also somewhat annoyed at the fact that this series carries a
>boatload
>> of reviewed-by's and acked-by's, yet none of those reviewers found it
>> important to point out the large chasm that is gaping between
>description
>> and reality.
>
>Another problem I just realized is that we now include
>arch/x86/kernel/macros.S in every 
>translation pass when building the kernel, right?
>
>But arch/x86/kernel/macros.S expands to a pretty large hiearchy of
>header files:
>
>  $ make arch/x86/kernel/macros.s
>
>$ cat $(grep include arch/x86/kernel/macros.s | cut -d\" -f2 | sort |
>uniq) | wc -l
>  4128
>
>That's 4,100 extra lines of code to be preprocessed for every
>translation unit, of
>which there are tens of thousands. More if other pieces of code get
>macrofied in
>this fasion in the future.
>
>If we assume that a typical distribution kernel build has ~20,000
>translation units
>then this change adds 82,560,000 more lines to be preprocessed, just to
>work around
>a stupid GCC bug?
>
>I'm totally unhappy about that. Can we do this without adding macros.S?
>
>It's also a pretty stupidly central file anyway that moves source code
>away
>from where it's used.
>
>Thanks,
>
>	Ingo

It's not just for working around a stupid GCC bug, but it also has a huge potential for cleaning up the inline asm in general.

I would like to know if there is an actual number for the build overhead (an actual benchmark); I have asked for that once already.
Nadav Amit Oct. 4, 2018, 8:40 a.m. UTC | #4
at 12:57 AM, Ingo Molnar <mingo@kernel.org> wrote:

> 
> * Nadav Amit <namit@vmware.com> wrote:
> 
>> GCC considers the number of statements in inlined assembly blocks,
>> according to new-lines and semicolons, as an indication to the cost of
>> the block in time and space. This data is distorted by the kernel code,
>> which puts information in alternative sections. As a result, the
>> compiler may perform incorrect inlining and branch optimizations.
>> 
>> The solution is to set an assembly macro and call it from the inlined
>> assembly block. As a result GCC considers the inline assembly block as
>> a single instruction.
>> 
>> This patch allows to inline functions such as __get_seccomp_filter().
>> Interestingly, this allows more aggressive inlining while reducing the
>> kernel size.
>> 
>>   text	   data	    bss	    dec	    hex	filename
>> 18140970 10225412 2957312 31323694 1ddf62e ./vmlinux before
>> 18140140 10225284 2957312 31322736 1ddf270 ./vmlinux after (-958)
>> 
>> Static text symbols:
>> Before:	40302
>> After:	40286	(-16)
>> 
>> Functions such as kref_get(), free_user(), fuse_file_get() now get
>> inlined.
> 
> Yeah, so I kind of had your series on the back-burner (I'm sure you noticed!),
> mostly because what I complained about in a previous round of review a couple
> of months ago: that the description of the series and the changelog of every
> single patch in it is tiptoeing around the *real* problem and never truly
> describes it:
> 
>   ** This is a GCC bug, plain and simple, and we are uglifying **
>   ** and complicating kernel assembly code to work it around.  **
> 
> We'd never ever consider such uglification for Clang, not even _close_.
> Sure this would have warranted a passing mention? Instead the changelogs are
> lovingly calling it a "distortion" as if this was no-one's fault really, and
> the patch a "solution".
> 
> How about calling it a "GCC inlining bug" and a "workaround with costs" 
> which it is in reality, and stop whitewashing the problem?
> 
> At the same time I realize that we still need this series because GCC won't
> get fixed, so as a consolation I wrote the changelog below that explains
> how it really is, no holds barred.
> 
> Since the tone of the changelog is a bit ... frosty, I added this disclaimer:
> 
>  [ mingo: Wrote new changelog. ]
> 
> Let me know if you want me to make it more prominent that you had absolutely
> no role in writing that changelog.
> 
> I'm also somewhat annoyed at the fact that this series carries a boatload
> of reviewed-by's and acked-by's, yet none of those reviewers found it
> important to point out the large chasm that is gaping between description
> and reality.

So, I’m sorry for missing your comment about misrepresenting the problem.

Feel free to do whatever you want with the commit message (just fix the typo
in “attemt"). As long as you don’t NAK the patches or send me to redo them,
it’s fine. I just want to clarify few things for you to consider.

First, you are right that clang does not have this issue (I checked), but
the behavior of gcc is clearly documented - once you know what to look for.

Second, I think the end result is not as ugly as you make it sound (and
maybe not ugly at all). Using this patch-set, you can write big blocks of
inlined assembly code without having those disgusting C macros. You can also
share the same code between inline asm and asm files.

You can have a look, for example, on ALTERNATIVE which is defined both as
assembly macro and C macro. Is the C macro readable? Is it easy to maintain
two different version? I do have a patch that merges the two implementations
together (which I still didn’t send, since I wait for the infra to be
applied first), and I think makes much more sense.

Finally, note that it’s not as if the binary always becomes smaller.
Overall, with the full patch-set it is slightly bigger. But still, that’s
how it was supposed to be if gcc wasn’t doing things badly.

Thanks again,
Nadav
Ingo Molnar Oct. 4, 2018, 8:56 a.m. UTC | #5
* hpa@zytor.com <hpa@zytor.com> wrote:

> It's not just for working around a stupid GCC bug, but it also has a huge potential for 
> cleaning up the inline asm in general.

Sorry but that's just plain false. For example this patch:

   x86: cpufeature: use macros instead of inline assembly

... adds an extra macro indirection layer called STATIC_CPU_HAS, just to macrofy a single asm() 
statement that was perfectly readable and on-topic already...

There are some other cases where macrofying is also a cleanup due to sharing and naming a 
common pattern of code, but this is by no means an absolute quality of this approach.

Thanks,

	Ingo
Nadav Amit Oct. 4, 2018, 8:56 a.m. UTC | #6
at 1:40 AM, hpa@zytor.com wrote:

> On October 4, 2018 1:33:33 AM PDT, Ingo Molnar <mingo@kernel.org> wrote:
>> * Ingo Molnar <mingo@kernel.org> wrote:
>> 
>>> I'm also somewhat annoyed at the fact that this series carries a
>> boatload
>>> of reviewed-by's and acked-by's, yet none of those reviewers found it
>>> important to point out the large chasm that is gaping between
>> description
>>> and reality.
>> 
>> Another problem I just realized is that we now include
>> arch/x86/kernel/macros.S in every 
>> translation pass when building the kernel, right?
>> 
>> But arch/x86/kernel/macros.S expands to a pretty large hiearchy of
>> header files:
>> 
>> $ make arch/x86/kernel/macros.s
>> 
>> $ cat $(grep include arch/x86/kernel/macros.s | cut -d\" -f2 | sort |
>> uniq) | wc -l
>> 4128
>> 
>> That's 4,100 extra lines of code to be preprocessed for every
>> translation unit, of
>> which there are tens of thousands. More if other pieces of code get
>> macrofied in
>> this fasion in the future.
>> 
>> If we assume that a typical distribution kernel build has ~20,000
>> translation units
>> then this change adds 82,560,000 more lines to be preprocessed, just to
>> work around
>> a stupid GCC bug?
>> 
>> I'm totally unhappy about that. Can we do this without adding macros.S?
>> 
>> It's also a pretty stupidly central file anyway that moves source code
>> away
>> from where it's used.
>> 
>> Thanks,
>> 
>> 	Ingo
> 
> It's not just for working around a stupid GCC bug, but it also has a huge
> potential for cleaning up the inline asm in general.
> 
> I would like to know if there is an actual number for the build overhead
> (an actual benchmark); I have asked for that once already.

I can run some tests. (@hpa: I thought you asked about the -pipe overhead;
perhaps I misunderstood).

I guess you regard to the preprocessing of the assembler. Note that the C 
preprocessing of macros.S obviously happens only once. That’s the reason
I assumed it’s not that expensive.

Anyhow, I remember that we discussed at some point doing something like
‘asm(“.include XXX.s”)’ and somebody said it is not good, but I don’t
remember why and don’t see any reason it is so. Unless I am missing
something, I think it is possible to take each individual header and
preprocess the assembly part of into a separate .s file. Then we can put in
the C part of the header ‘asm(".include XXX.s”)’.

What do you think?
Ingo Molnar Oct. 4, 2018, 9:01 a.m. UTC | #7
* Nadav Amit <namit@vmware.com> wrote:

> Finally, note that it’s not as if the binary always becomes smaller.
> Overall, with the full patch-set it is slightly bigger. But still, that’s
> how it was supposed to be if gcc wasn’t doing things badly.

So what I cited was the changelog for the refcount patch, which according to your
measurements reduced kernel image size.

For other patches where size grew I left that text intact.

And yes, in this particular case a slight increase in kernel size might actually be
beneficial, as we get a lot more straight-line execution in exchange. So this is
not a show-stopper property as long as the bloat isn't large.

Thanks,

	Ingo
H. Peter Anvin Oct. 4, 2018, 9:02 a.m. UTC | #8
On October 4, 2018 1:56:37 AM PDT, Nadav Amit <namit@vmware.com> wrote:
>at 1:40 AM, hpa@zytor.com wrote:
>
>> On October 4, 2018 1:33:33 AM PDT, Ingo Molnar <mingo@kernel.org>
>wrote:
>>> * Ingo Molnar <mingo@kernel.org> wrote:
>>> 
>>>> I'm also somewhat annoyed at the fact that this series carries a
>>> boatload
>>>> of reviewed-by's and acked-by's, yet none of those reviewers found
>it
>>>> important to point out the large chasm that is gaping between
>>> description
>>>> and reality.
>>> 
>>> Another problem I just realized is that we now include
>>> arch/x86/kernel/macros.S in every 
>>> translation pass when building the kernel, right?
>>> 
>>> But arch/x86/kernel/macros.S expands to a pretty large hiearchy of
>>> header files:
>>> 
>>> $ make arch/x86/kernel/macros.s
>>> 
>>> $ cat $(grep include arch/x86/kernel/macros.s | cut -d\" -f2 | sort
>|
>>> uniq) | wc -l
>>> 4128
>>> 
>>> That's 4,100 extra lines of code to be preprocessed for every
>>> translation unit, of
>>> which there are tens of thousands. More if other pieces of code get
>>> macrofied in
>>> this fasion in the future.
>>> 
>>> If we assume that a typical distribution kernel build has ~20,000
>>> translation units
>>> then this change adds 82,560,000 more lines to be preprocessed, just
>to
>>> work around
>>> a stupid GCC bug?
>>> 
>>> I'm totally unhappy about that. Can we do this without adding
>macros.S?
>>> 
>>> It's also a pretty stupidly central file anyway that moves source
>code
>>> away
>>> from where it's used.
>>> 
>>> Thanks,
>>> 
>>> 	Ingo
>> 
>> It's not just for working around a stupid GCC bug, but it also has a
>huge
>> potential for cleaning up the inline asm in general.
>> 
>> I would like to know if there is an actual number for the build
>overhead
>> (an actual benchmark); I have asked for that once already.
>
>I can run some tests. (@hpa: I thought you asked about the -pipe
>overhead;
>perhaps I misunderstood).
>
>I guess you regard to the preprocessing of the assembler. Note that the
>C 
>preprocessing of macros.S obviously happens only once. That’s the
>reason
>I assumed it’s not that expensive.
>
>Anyhow, I remember that we discussed at some point doing something like
>‘asm(“.include XXX.s”)’ and somebody said it is not good, but I don’t
>remember why and don’t see any reason it is so. Unless I am missing
>something, I think it is possible to take each individual header and
>preprocess the assembly part of into a separate .s file. Then we can
>put in
>the C part of the header ‘asm(".include XXX.s”)’.
>
>What do you think?

Ingo: I wasn't talking necessarily about the specifics of each bit, but rather the general concept about being able to use macros in inlines... I can send you something I have been working on in the background, but have been holding off on because of this, in the morning my time.

But that's no excuse for not doing it right.

Global asm() statements are problematic because there is no guarantee where in the assembly source they will end up. You'd almost need to intercept the assembly output, move all the includes to the top, and then process...
Ingo Molnar Oct. 4, 2018, 9:12 a.m. UTC | #9
* Nadav Amit <namit@vmware.com> wrote:

> I can run some tests. (@hpa: I thought you asked about the -pipe overhead;
> perhaps I misunderstood).

Well, tests are unlikely to show the overhead of extra lines of this
magnitude, unless done very carefully, yet the added bloat exists and is not even
mentioned by the changelog, it just says:

  Subject: [PATCH v9 02/10] Makefile: Prepare for using macros for inline asm

  Using macros for inline assembly improves both readability and
  compilation decisions that are distorted by big assembly blocks that use
  alternative sections. Compile macros.S and use it to assemble all C
  files. Currently, only x86 will use it.

> I guess you regard to the preprocessing of the assembler. Note that the C 
> preprocessing of macros.S obviously happens only once. That’s the reason
> I assumed it’s not that expensive.

True - so first we build macros.s, and that gets included in every C file build, right?

macros.s is smaller: 275 lines only in the distro test build I tried, which looks
a lot better than my first 4,200 lines guesstimate.

> Anyhow, I remember that we discussed at some point doing something like
> ‘asm(“.include XXX.s”)’ and somebody said it is not good, but I don’t
> remember why and don’t see any reason it is so. Unless I am missing
> something, I think it is possible to take each individual header and
> preprocess the assembly part of into a separate .s file. Then we can put in
> the C part of the header ‘asm(".include XXX.s”)’.
> 
> What do you think?

Hm, this looks quite complex - macros.s is better I think. Also, 275 straight assembly lines is 
a lot better than 4,200.

Another, separate question I wanted to ask: how do we ensure that the kernel stays fixed?
I.e. is there some tooling we can use to actually measure whether there's bad inlining decisions 
done, to detect all these bad patterns that cause bad GCC code generation?

Thanks,

	Ingo
Ingo Molnar Oct. 4, 2018, 9:16 a.m. UTC | #10
* hpa@zytor.com <hpa@zytor.com> wrote:

> Ingo: I wasn't talking necessarily about the specifics of each bit, but rather the general 
> concept about being able to use macros in inlines...

Ok, agreed about that part - and some of the patches did improve readability.

Also, the 275 lines macros.s is a lot nicer than the 4,200 lines macros.S.

Also, I'm not against using workarounds when the benefits are larger than the costs, but I am 
against *hiding* the fact that these are workarounds and that for some of them there are costs.

> I can send you something I have been working on in the background, but have been holding off 
> on because of this, in the morning my time.

BTW., I have applied most of the series to tip:x86/kbuild already, and will push them out later 
today after some testing. I didn't apply the final 3 patches as they have dependencies, but 
applied the basics and fixed up the changelogs.

So you can rely on this.

Thanks,

	Ingo
H. Peter Anvin Oct. 4, 2018, 9:17 a.m. UTC | #11
On October 4, 2018 2:12:22 AM PDT, Ingo Molnar <mingo@kernel.org> wrote:
>
>* Nadav Amit <namit@vmware.com> wrote:
>
>> I can run some tests. (@hpa: I thought you asked about the -pipe
>overhead;
>> perhaps I misunderstood).
>
>Well, tests are unlikely to show the overhead of extra lines of this
>magnitude, unless done very carefully, yet the added bloat exists and
>is not even
>mentioned by the changelog, it just says:
>
>Subject: [PATCH v9 02/10] Makefile: Prepare for using macros for inline
>asm
>
>  Using macros for inline assembly improves both readability and
>compilation decisions that are distorted by big assembly blocks that
>use
>  alternative sections. Compile macros.S and use it to assemble all C
>  files. Currently, only x86 will use it.
>
>> I guess you regard to the preprocessing of the assembler. Note that
>the C 
>> preprocessing of macros.S obviously happens only once. That’s the
>reason
>> I assumed it’s not that expensive.
>
>True - so first we build macros.s, and that gets included in every C
>file build, right?
>
>macros.s is smaller: 275 lines only in the distro test build I tried,
>which looks
>a lot better than my first 4,200 lines guesstimate.
>
>> Anyhow, I remember that we discussed at some point doing something
>like
>> ‘asm(“.include XXX.s”)’ and somebody said it is not good, but I don’t
>> remember why and don’t see any reason it is so. Unless I am missing
>> something, I think it is possible to take each individual header and
>> preprocess the assembly part of into a separate .s file. Then we can
>put in
>> the C part of the header ‘asm(".include XXX.s”)’.
>> 
>> What do you think?
>
>Hm, this looks quite complex - macros.s is better I think. Also, 275
>straight assembly lines is 
>a lot better than 4,200.
>
>Another, separate question I wanted to ask: how do we ensure that the
>kernel stays fixed?
>I.e. is there some tooling we can use to actually measure whether
>there's bad inlining decisions 
>done, to detect all these bad patterns that cause bad GCC code
>generation?
>
>Thanks,
>
>	Ingo

The assembly output from GCC is quite volumious; I doubt tacking a few hundred lines on will matter one iota.
Nadav Amit Oct. 4, 2018, 9:30 a.m. UTC | #12
at 2:12 AM, Ingo Molnar <mingo@kernel.org> wrote:

> 
> * Nadav Amit <namit@vmware.com> wrote:
> 
>> I can run some tests. (@hpa: I thought you asked about the -pipe overhead;
>> perhaps I misunderstood).
> 
> Well, tests are unlikely to show the overhead of extra lines of this
> magnitude, unless done very carefully, yet the added bloat exists and is not even
> mentioned by the changelog, it just says:
> 
>  Subject: [PATCH v9 02/10] Makefile: Prepare for using macros for inline asm
> 
>  Using macros for inline assembly improves both readability and
>  compilation decisions that are distorted by big assembly blocks that use
>  alternative sections. Compile macros.S and use it to assemble all C
>  files. Currently, only x86 will use it.
> 
>> I guess you regard to the preprocessing of the assembler. Note that the C 
>> preprocessing of macros.S obviously happens only once. That’s the reason
>> I assumed it’s not that expensive.
> 
> True - so first we build macros.s, and that gets included in every C file build, right?
Right.

> 
> macros.s is smaller: 275 lines only in the distro test build I tried, which looks
> a lot better than my first 4,200 lines guesstimate.
> 
>> Anyhow, I remember that we discussed at some point doing something like
>> ‘asm(“.include XXX.s”)’ and somebody said it is not good, but I don’t
>> remember why and don’t see any reason it is so. Unless I am missing
>> something, I think it is possible to take each individual header and
>> preprocess the assembly part of into a separate .s file. Then we can put in
>> the C part of the header ‘asm(".include XXX.s”)’.
>> 
>> What do you think?
> 
> Hm, this looks quite complex - macros.s is better I think. Also, 275 straight assembly lines is 
> a lot better than 4,200.

I’m really not into it, and hpa reminded me why it wouldn’t work. For some
reason I thought the order of macros doesn’t matter in asm (I probably
should go to sleep).

> Another, separate question I wanted to ask: how do we ensure that the kernel stays fixed?
> I.e. is there some tooling we can use to actually measure whether there's bad inlining decisions 
> done, to detect all these bad patterns that cause bad GCC code generation?

Good question. First, I’ll indicate that this patch-set does not handle all
the issues. There is still the issue of conditional use of
__builtin_constant_p().

One indication for bad inlining decisions is the inlined functions have
multiple (non-inlined) instances in the binary and are short. I don’t
have an automatic solution, but you can try, for example to run:

nm --print-size ./vmlinux | grep ' t ' | cut -d' ' -f2- | sort | uniq -c | \
	grep -v '^      1' | sort -n -r | head -n 5

There are however many false positives. After these patches, for example, I
get:

     11 000000000000012f t jhash
      7 0000000000000017 t dst_output
      6 0000000000000011 t kzalloc
      5 000000000000002f t acpi_os_allocate_zeroed
      5 0000000000000029 t acpi_os_allocate


jhash() should not have been inlined in my mind, and should have a
non-inlined implementation. dst_output() is used as a function pointer.
kzalloc() and the next two suffer from the __builtin_constant_p() problem I
described in the past.

Regards,
Nadav
Ingo Molnar Oct. 4, 2018, 9:45 a.m. UTC | #13
* Nadav Amit <namit@vmware.com> wrote:

> > Another, separate question I wanted to ask: how do we ensure that the kernel stays fixed?
> > I.e. is there some tooling we can use to actually measure whether there's bad inlining decisions 
> > done, to detect all these bad patterns that cause bad GCC code generation?
> 
> Good question. First, I’ll indicate that this patch-set does not handle all
> the issues. There is still the issue of conditional use of
> __builtin_constant_p().
> 
> One indication for bad inlining decisions is the inlined functions have
> multiple (non-inlined) instances in the binary and are short. I don’t
> have an automatic solution, but you can try, for example to run:
> 
> nm --print-size ./vmlinux | grep ' t ' | cut -d' ' -f2- | sort | uniq -c | \
> 	grep -v '^      1' | sort -n -r | head -n 5
> 
> There are however many false positives. After these patches, for example, I
> get:
> 
>      11 000000000000012f t jhash
>       7 0000000000000017 t dst_output
>       6 0000000000000011 t kzalloc
>       5 000000000000002f t acpi_os_allocate_zeroed
>       5 0000000000000029 t acpi_os_allocate
> 
> 
> jhash() should not have been inlined in my mind, and should have a
> non-inlined implementation. dst_output() is used as a function pointer.
> kzalloc() and the next two suffer from the __builtin_constant_p() problem I
> described in the past.

Ok, that's useful info.

The histogram suggests that with all your patches applied the kernel is now in a pretty good 
state in terms of inlining decisions, right?

Are you using defconfig or a reasonable distro-config for your tests?

Thanks,

	Ingo
Nadav Amit Oct. 4, 2018, 10:23 a.m. UTC | #14
at 2:45 AM, Ingo Molnar <mingo@kernel.org> wrote:

> 
> * Nadav Amit <namit@vmware.com> wrote:
> 
>>> Another, separate question I wanted to ask: how do we ensure that the kernel stays fixed?
>>> I.e. is there some tooling we can use to actually measure whether there's bad inlining decisions 
>>> done, to detect all these bad patterns that cause bad GCC code generation?
>> 
>> Good question. First, I’ll indicate that this patch-set does not handle all
>> the issues. There is still the issue of conditional use of
>> __builtin_constant_p().
>> 
>> One indication for bad inlining decisions is the inlined functions have
>> multiple (non-inlined) instances in the binary and are short. I don’t
>> have an automatic solution, but you can try, for example to run:
>> 
>> nm --print-size ./vmlinux | grep ' t ' | cut -d' ' -f2- | sort | uniq -c | \
>> 	grep -v '^      1' | sort -n -r | head -n 5
>> 
>> There are however many false positives. After these patches, for example, I
>> get:
>> 
>>     11 000000000000012f t jhash
>>      7 0000000000000017 t dst_output
>>      6 0000000000000011 t kzalloc
>>      5 000000000000002f t acpi_os_allocate_zeroed
>>      5 0000000000000029 t acpi_os_allocate
>> 
>> 
>> jhash() should not have been inlined in my mind, and should have a
>> non-inlined implementation. dst_output() is used as a function pointer.
>> kzalloc() and the next two suffer from the __builtin_constant_p() problem I
>> described in the past.
> 
> Ok, that's useful info.
> 
> The histogram suggests that with all your patches applied the kernel is now in a pretty good 
> state in terms of inlining decisions, right?

It was just an example that I ran on the kernel I built right now (with a
custom config). Please don’t regard these results as anything indicative.

> Are you using defconfig or a reasonable distro-config for your tests?

I think it is best to take the kernel and run localyesconfig for testing.

Taking Ubuntu 18.04 and doing the same gives the following results:

     12 000000000000012f t jhash
      7 0000000000000017 t dst_output
      7 0000000000000014 t init_once
      5 00000000000000d8 t jhash2
      5 000000000000004e t put_page
      5 000000000000002f t acpi_os_allocate_zeroed
      5 0000000000000029 t acpi_os_allocate
      5 0000000000000028 t umask_show
      5 0000000000000011 t kzalloc
      4 0000000000000053 t trace_xhci_dbg_quirks

Not awful, but not great.

It is a bit hard to fix the __builtin_constant_p() problem without having
some negative side-effects.

Reminder: __builtin_constant_p() is evaluated after the inlining decision
are done. You can use __builtin_choose_expr() instead of an “if”s and
instead of ternary operators when evaluating __builtin_constant_p() to solve
the problem. However, this causes the compiler not to know sometimes that a
value is constant because __builtin_choose_expr () evaluation happens too
early. This __builtin_choose_expr() problem is the reason for put_page() and
kzalloc() are not being inlined. Clang, again, does not suffer from this
problem.

Anyhow, it may be a good practice to try to get rid of the rest. For
example, dst_discard() has four instances because it is always given as a
function pointer. So it should not have been inlined.

Regards,
Nadav
H. Peter Anvin Oct. 4, 2018, 7:33 p.m. UTC | #15
On 10/04/18 02:16, Ingo Molnar wrote:
> 
> * hpa@zytor.com <hpa@zytor.com> wrote:
> 
>> Ingo: I wasn't talking necessarily about the specifics of each bit, but rather the general 
>> concept about being able to use macros in inlines...
> 
> Ok, agreed about that part - and some of the patches did improve readability.
> 
> Also, the 275 lines macros.s is a lot nicer than the 4,200 lines macros.S.
> 
> Also, I'm not against using workarounds when the benefits are larger than the costs, but I am 
> against *hiding* the fact that these are workarounds and that for some of them there are costs.
> 

Agreed, of course.

>> I can send you something I have been working on in the background, but have been holding off 
>> on because of this, in the morning my time.
> 
> BTW., I have applied most of the series to tip:x86/kbuild already, and will push them out later 
> today after some testing. I didn't apply the final 3 patches as they have dependencies, but 
> applied the basics and fixed up the changelogs.
> 
> So you can rely on this.
> 

Wonderful.

Here is the horrible code I mentioned yesterday.  This is about
implementing the immediate-patching framework that Linus and others have
discussed (it helps both performance and kernel hardening):

Warning: this stuff can cause serious damage to your eyes, and this is a
just a small chunk of the whole mess; and relying on gas macros, as
brain damaged as they are, really is much, much cleaner than not:

	http://www.zytor.com/~hpa/foo.S

	-hpa
Nadav Amit Oct. 4, 2018, 8:05 p.m. UTC | #16
at 12:33 PM, H. Peter Anvin <hpa@zytor.com> wrote:

> On 10/04/18 02:16, Ingo Molnar wrote:
>> * hpa@zytor.com <hpa@zytor.com> wrote:
>> 
>>> Ingo: I wasn't talking necessarily about the specifics of each bit, but rather the general 
>>> concept about being able to use macros in inlines...
>> 
>> Ok, agreed about that part - and some of the patches did improve readability.
>> 
>> Also, the 275 lines macros.s is a lot nicer than the 4,200 lines macros.S.
>> 
>> Also, I'm not against using workarounds when the benefits are larger than the costs, but I am 
>> against *hiding* the fact that these are workarounds and that for some of them there are costs.
> 
> Agreed, of course.
> 
>>> I can send you something I have been working on in the background, but have been holding off 
>>> on because of this, in the morning my time.
>> 
>> BTW., I have applied most of the series to tip:x86/kbuild already, and will push them out later 
>> today after some testing. I didn't apply the final 3 patches as they have dependencies, but 
>> applied the basics and fixed up the changelogs.
>> 
>> So you can rely on this.
> 
> Wonderful.
> 
> Here is the horrible code I mentioned yesterday.  This is about
> implementing the immediate-patching framework that Linus and others have
> discussed (it helps both performance and kernel hardening):
> 
> Warning: this stuff can cause serious damage to your eyes, and this is a
> just a small chunk of the whole mess; and relying on gas macros, as
> brain damaged as they are, really is much, much cleaner than not:
> 
> 	https://na01.safelinks.protection.outlook.com/?url=http:%2F%2Fwww.zytor.com%2F~hpa%2Ffoo.S&amp;data=02%7C01%7Cnamit%40vmware.com%7C326f1a3beb4649df319508d62a3042fa%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C1%7C0%7C636742784111671122&amp;sdata=anYIOXzlSTXPQKttTBHjSQgapBmaO9gfibBF34ZlHeQ%3D&amp;reserved=0

Funny. Immediate-patching is what I was playing with when I encountered the
gcc issue. Performance got worse instead of improving (or at least staying
the same), because inlining got crazy.

Anyhow, wait for my soon-to-be-sent RFC in which I define a macro called
“call” (to reduce the retpoline overhead) before you talk about damage to
the eyes.
H. Peter Anvin Oct. 4, 2018, 8:08 p.m. UTC | #17
On 10/04/18 13:05, Nadav Amit wrote:
> 
> Funny. Immediate-patching is what I was playing with when I encountered the
> gcc issue. Performance got worse instead of improving (or at least staying
> the same), because inlining got crazy.
> 
> Anyhow, wait for my soon-to-be-sent RFC in which I define a macro called
> “call” (to reduce the retpoline overhead) before you talk about damage to
> the eyes.
> 

Doing it well is *hard*, as I discovered.  The actual patch is much,
much, larger.

	-hpa
Andy Lutomirski Oct. 4, 2018, 8:29 p.m. UTC | #18
On Thu, Oct 4, 2018 at 12:33 PM H. Peter Anvin <hpa@zytor.com> wrote:
>
> On 10/04/18 02:16, Ingo Molnar wrote:
> >
> > * hpa@zytor.com <hpa@zytor.com> wrote:
> >
> >> Ingo: I wasn't talking necessarily about the specifics of each bit, but rather the general
> >> concept about being able to use macros in inlines...
> >
> > Ok, agreed about that part - and some of the patches did improve readability.
> >
> > Also, the 275 lines macros.s is a lot nicer than the 4,200 lines macros.S.
> >
> > Also, I'm not against using workarounds when the benefits are larger than the costs, but I am
> > against *hiding* the fact that these are workarounds and that for some of them there are costs.
> >
>
> Agreed, of course.
>
> >> I can send you something I have been working on in the background, but have been holding off
> >> on because of this, in the morning my time.
> >
> > BTW., I have applied most of the series to tip:x86/kbuild already, and will push them out later
> > today after some testing. I didn't apply the final 3 patches as they have dependencies, but
> > applied the basics and fixed up the changelogs.
> >
> > So you can rely on this.
> >
>
> Wonderful.
>
> Here is the horrible code I mentioned yesterday.  This is about
> implementing the immediate-patching framework that Linus and others have
> discussed (it helps both performance and kernel hardening):
>

I'm wondering if a production version should look more like:

patch_point:
mov $whatever, [target]
.pushsection ".immediate_patches"
.quad .Lpatch_point
.popsection

and let objtool parse the resulting assembled output and emit an entry
in some table mapping patch_point to the actual address and size of
the immediate to be patched.
H. Peter Anvin Oct. 4, 2018, 11:11 p.m. UTC | #19
On 10/04/18 13:29, Andy Lutomirski wrote:
>>
>> Wonderful.
>>
>> Here is the horrible code I mentioned yesterday.  This is about
>> implementing the immediate-patching framework that Linus and others have
>> discussed (it helps both performance and kernel hardening):
>>
> 
> I'm wondering if a production version should look more like:
> 
> patch_point:
> mov $whatever, [target]
> .pushsection ".immediate_patches"
> .quad .Lpatch_point
> .popsection
> 
> and let objtool parse the resulting assembled output and emit an entry
> in some table mapping patch_point to the actual address and size of
> the immediate to be patched.
> 

Putting the generation of the ersatz code in objtool is an interesting
idea, although I'm wondering if these macros, as awful as they are,
wouldn't have generic applicability (what they do is they allow the cold
branch of an asm to push temporaries to the stack rather than having to
have gcc clobber them.)

I'll think about it.

	-hpa
Ingo Molnar Oct. 5, 2018, 9:31 a.m. UTC | #20
* Nadav Amit <namit@vmware.com> wrote:

> > Are you using defconfig or a reasonable distro-config for your tests?
> 
> I think it is best to take the kernel and run localyesconfig for testing.

Ok, agreed - and this makes the numbers you provided pretty representative.

Good - now that all of my concerns were addressed I'd like to merge the remaining 3 patches as 
well - but they are conflicting with ongoing x86 work in tip:x86/core. The extable conflict is 
trivial, the jump-label conflict a bit more involved.

Could you please pick up the updated changelogs below and resolve the conflicts against 
tip:master or tip:x86/build and submit the remaining patches as well?

Thanks,

	Ingo



=============>
commit b82b0b611740c7c88050ba743c398af7eb920029
Author: Nadav Amit <namit@vmware.com>
Date:   Wed Oct 3 14:31:00 2018 -0700

    x86/jump-labels: Macrofy inline assembly code to work around GCC inlining bugs
    
    As described in:
    
      77b0bf55bc67: ("kbuild/Makefile: Prepare for using macros in inline assembly code to work around asm() related GCC inlining bugs")
    
    GCC's inlining heuristics are broken with common asm() patterns used in
    kernel code, resulting in the effective disabling of inlining.
    
    The workaround is to set an assembly macro and call it from the inline
    assembly block - which is also a minor cleanup for the jump-label code.
    
    As a result the code size is slightly increased, but inlining decisions
    are better:
    
          text     data     bss      dec     hex  filename
      18163528 10226300 2957312 31347140 1de51c4  ./vmlinux before
      18163608 10227348 2957312 31348268 1de562c  ./vmlinux after (+1128)
    
    And functions such as intel_pstate_adjust_policy_max(),
    kvm_cpu_accept_dm_intr(), kvm_register_readl() are inlined.
    
    Tested-by: Kees Cook <keescook@chromium.org>
    Signed-off-by: Nadav Amit <namit@vmware.com>
    Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Cc: Andy Lutomirski <luto@amacapital.net>
    Cc: Borislav Petkov <bp@alien8.de>
    Cc: Brian Gerst <brgerst@gmail.com>
    Cc: Denys Vlasenko <dvlasenk@redhat.com>
    Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Cc: H. Peter Anvin <hpa@zytor.com>
    Cc: Kate Stewart <kstewart@linuxfoundation.org>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Philippe Ombredanne <pombredanne@nexb.com>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Link: http://lkml.kernel.org/r/20181003213100.189959-11-namit@vmware.com
    Signed-off-by: Ingo Molnar <mingo@kernel.org>

commit dfc243615d43bb477d1d16a0064fc3d69ade5b3a
Author: Nadav Amit <namit@vmware.com>
Date:   Wed Oct 3 14:30:59 2018 -0700

    x86/cpufeature: Macrofy inline assembly code to work around GCC inlining bugs
    
    As described in:
    
      77b0bf55bc67: ("kbuild/Makefile: Prepare for using macros in inline assembly code to work around asm() related GCC inlining bugs")
    
    GCC's inlining heuristics are broken with common asm() patterns used in
    kernel code, resulting in the effective disabling of inlining.
    
    The workaround is to set an assembly macro and call it from the inline
    assembly block - which is pretty pointless indirection in the static_cpu_has()
    case, but is worth it to improve overall inlining quality.
    
    The patch slightly increases the kernel size:
    
          text     data     bss      dec     hex  filename
      18162879 10226256 2957312 31346447 1de4f0f  ./vmlinux before
      18163528 10226300 2957312 31347140 1de51c4  ./vmlinux after (+693)
    
    And enables the inlining of function such as free_ldt_pgtables().
    
    Tested-by: Kees Cook <keescook@chromium.org>
    Signed-off-by: Nadav Amit <namit@vmware.com>
    Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Cc: Andy Lutomirski <luto@amacapital.net>
    Cc: Borislav Petkov <bp@alien8.de>
    Cc: Brian Gerst <brgerst@gmail.com>
    Cc: Denys Vlasenko <dvlasenk@redhat.com>
    Cc: H. Peter Anvin <hpa@zytor.com>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Link: http://lkml.kernel.org/r/20181003213100.189959-10-namit@vmware.com
    Signed-off-by: Ingo Molnar <mingo@kernel.org>

commit 4021bdcd351fd63d8d5e74264ee18d09388f0221
Author: Nadav Amit <namit@vmware.com>
Date:   Wed Oct 3 14:30:58 2018 -0700

    x86/extable: Macrofy inline assembly code to work around GCC inlining bugs
    
    As described in:
    
      77b0bf55bc67: ("kbuild/Makefile: Prepare for using macros in inline assembly code to work around asm() related GCC inlining bugs")
    
    GCC's inlining heuristics are broken with common asm() patterns used in
    kernel code, resulting in the effective disabling of inlining.
    
    The workaround is to set an assembly macro and call it from the inline
    assembly block - which is also a minor cleanup for the exception table
    code.
    
    Text size goes up a bit:
    
          text     data     bss      dec     hex  filename
      18162555 10226288 2957312 31346155 1de4deb  ./vmlinux before
      18162879 10226256 2957312 31346447 1de4f0f  ./vmlinux after (+292)
    
    But this allows the inlining of functions such as nested_vmx_exit_reflected(),
    set_segment_reg(), __copy_xstate_to_user() which is a net benefit.
    
    Tested-by: Kees Cook <keescook@chromium.org>
    Signed-off-by: Nadav Amit <namit@vmware.com>
    Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Cc: Andy Lutomirski <luto@amacapital.net>
    Cc: Borislav Petkov <bp@alien8.de>
    Cc: Brian Gerst <brgerst@gmail.com>
    Cc: Denys Vlasenko <dvlasenk@redhat.com>
    Cc: H. Peter Anvin <hpa@zytor.com>
    Cc: Josh Poimboeuf <jpoimboe@redhat.com>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Link: http://lkml.kernel.org/r/20181003213100.189959-9-namit@vmware.com
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
Borislav Petkov Oct. 5, 2018, 11:20 a.m. UTC | #21
On Fri, Oct 05, 2018 at 11:31:08AM +0200, Ingo Molnar wrote:
> 
> * Nadav Amit <namit@vmware.com> wrote:
> 
> > > Are you using defconfig or a reasonable distro-config for your tests?
> > 
> > I think it is best to take the kernel and run localyesconfig for testing.
> 
> Ok, agreed - and this makes the numbers you provided pretty representative.
> 
> Good - now that all of my concerns were addressed I'd like to merge the remaining 3 patches as 
> well - but they are conflicting with ongoing x86 work in tip:x86/core. The extable conflict is 
> trivial, the jump-label conflict a bit more involved.

FWIW, gcc guys are not too averse to the idea of enhancing gcc inline
asm syntax with a statement which specifies its size so that gcc can use
it to do better inlining cost estimation than counting lines.

Lemme work the process ...
Ingo Molnar Oct. 5, 2018, 12:52 p.m. UTC | #22
* Borislav Petkov <bp@alien8.de> wrote:

> On Fri, Oct 05, 2018 at 11:31:08AM +0200, Ingo Molnar wrote:
> > 
> > * Nadav Amit <namit@vmware.com> wrote:
> > 
> > > > Are you using defconfig or a reasonable distro-config for your tests?
> > > 
> > > I think it is best to take the kernel and run localyesconfig for testing.
> > 
> > Ok, agreed - and this makes the numbers you provided pretty representative.
> > 
> > Good - now that all of my concerns were addressed I'd like to merge the remaining 3 patches as 
> > well - but they are conflicting with ongoing x86 work in tip:x86/core. The extable conflict is 
> > trivial, the jump-label conflict a bit more involved.
> 
> FWIW, gcc guys are not too averse to the idea of enhancing gcc inline
> asm syntax with a statement which specifies its size so that gcc can use
> it to do better inlining cost estimation than counting lines.
> 
> Lemme work the process ...

Thanks!

	Ingo
Rasmus Villemoes Oct. 6, 2018, 1:40 a.m. UTC | #23
On 2018-10-04 21:33, H. Peter Anvin wrote:

> Here is the horrible code I mentioned yesterday.  This is about
> implementing the immediate-patching framework that Linus and others have
> discussed (it helps both performance and kernel hardening):

Heh, I did a POC in userspace some years ago for loading an "eventually
constant" value into a register - the idea being to avoid a load in
cases like kmemcache_alloc(foo_cachep) or kmemcache_free(foo_cachep, p),
and I assume this is something along the same lines? I didn't do
anything with it since I had no idea if the performance gain would be
worth it, and at the time (before __ro_after_init) there was no good way
to know that the values would really be constant eventually. Also, I had
hoped to come up with a way to avoid having to annotate the loads.

I just tried expanding this to deal with some of the hash tables sized
at init time which I can see was also mentioned on LKML some time ago.
I'm probably missing something fundamental, but there's some sorta
working code at https://github.com/villemoes/rai which is not too
horrible (IMO). Attaching gcc at various times and doing disassembly
shows that the patching does take place. Can I get you to take a look at
raimacros.S and tell me why that won't work?

Thanks,
Rasmus
Nadav Amit Oct. 8, 2018, 2:17 a.m. UTC | #24
at 2:31 AM, Ingo Molnar <mingo@kernel.org> wrote:

> 
> * Nadav Amit <namit@vmware.com> wrote:
> 
>>> Are you using defconfig or a reasonable distro-config for your tests?
>> 
>> I think it is best to take the kernel and run localyesconfig for testing.
> 
> Ok, agreed - and this makes the numbers you provided pretty representative.
> 
> Good - now that all of my concerns were addressed I'd like to merge the remaining 3 patches as 
> well - but they are conflicting with ongoing x86 work in tip:x86/core. The extable conflict is 
> trivial, the jump-label conflict a bit more involved.
> 
> Could you please pick up the updated changelogs below and resolve the conflicts against 
> tip:master or tip:x86/build and submit the remaining patches as well?

For the record, I summarized my analysis of the poor function inlining
decisions in Linux in the following blog-post:

https://nadav.amit.zone/blog/linux-inline

Regards,
Nadav

Patch
diff mbox series

diff --git a/arch/x86/include/asm/refcount.h b/arch/x86/include/asm/refcount.h
index 19b90521954c..c92909da0686 100644
--- a/arch/x86/include/asm/refcount.h
+++ b/arch/x86/include/asm/refcount.h
@@ -4,6 +4,41 @@ 
  * x86-specific implementation of refcount_t. Based on PAX_REFCOUNT from
  * PaX/grsecurity.
  */
+
+#ifdef __ASSEMBLY__
+
+#include <asm/asm.h>
+#include <asm/bug.h>
+
+.macro REFCOUNT_EXCEPTION counter:req
+	.pushsection .text..refcount
+111:	lea \counter, %_ASM_CX
+112:	ud2
+	ASM_UNREACHABLE
+	.popsection
+113:	_ASM_EXTABLE_REFCOUNT(112b, 113b)
+.endm
+
+/* Trigger refcount exception if refcount result is negative. */
+.macro REFCOUNT_CHECK_LT_ZERO counter:req
+	js 111f
+	REFCOUNT_EXCEPTION counter="\counter"
+.endm
+
+/* Trigger refcount exception if refcount result is zero or negative. */
+.macro REFCOUNT_CHECK_LE_ZERO counter:req
+	jz 111f
+	REFCOUNT_CHECK_LT_ZERO counter="\counter"
+.endm
+
+/* Trigger refcount exception unconditionally. */
+.macro REFCOUNT_ERROR counter:req
+	jmp 111f
+	REFCOUNT_EXCEPTION counter="\counter"
+.endm
+
+#else /* __ASSEMBLY__ */
+
 #include <linux/refcount.h>
 #include <asm/bug.h>
 
@@ -15,34 +50,11 @@ 
  * central refcount exception. The fixup address for the exception points
  * back to the regular execution flow in .text.
  */
-#define _REFCOUNT_EXCEPTION				\
-	".pushsection .text..refcount\n"		\
-	"111:\tlea %[counter], %%" _ASM_CX "\n"		\
-	"112:\t" ASM_UD2 "\n"				\
-	ASM_UNREACHABLE					\
-	".popsection\n"					\
-	"113:\n"					\
-	_ASM_EXTABLE_REFCOUNT(112b, 113b)
-
-/* Trigger refcount exception if refcount result is negative. */
-#define REFCOUNT_CHECK_LT_ZERO				\
-	"js 111f\n\t"					\
-	_REFCOUNT_EXCEPTION
-
-/* Trigger refcount exception if refcount result is zero or negative. */
-#define REFCOUNT_CHECK_LE_ZERO				\
-	"jz 111f\n\t"					\
-	REFCOUNT_CHECK_LT_ZERO
-
-/* Trigger refcount exception unconditionally. */
-#define REFCOUNT_ERROR					\
-	"jmp 111f\n\t"					\
-	_REFCOUNT_EXCEPTION
 
 static __always_inline void refcount_add(unsigned int i, refcount_t *r)
 {
 	asm volatile(LOCK_PREFIX "addl %1,%0\n\t"
-		REFCOUNT_CHECK_LT_ZERO
+		"REFCOUNT_CHECK_LT_ZERO counter=\"%[counter]\""
 		: [counter] "+m" (r->refs.counter)
 		: "ir" (i)
 		: "cc", "cx");
@@ -51,7 +63,7 @@  static __always_inline void refcount_add(unsigned int i, refcount_t *r)
 static __always_inline void refcount_inc(refcount_t *r)
 {
 	asm volatile(LOCK_PREFIX "incl %0\n\t"
-		REFCOUNT_CHECK_LT_ZERO
+		"REFCOUNT_CHECK_LT_ZERO counter=\"%[counter]\""
 		: [counter] "+m" (r->refs.counter)
 		: : "cc", "cx");
 }
@@ -59,7 +71,7 @@  static __always_inline void refcount_inc(refcount_t *r)
 static __always_inline void refcount_dec(refcount_t *r)
 {
 	asm volatile(LOCK_PREFIX "decl %0\n\t"
-		REFCOUNT_CHECK_LE_ZERO
+		"REFCOUNT_CHECK_LE_ZERO counter=\"%[counter]\""
 		: [counter] "+m" (r->refs.counter)
 		: : "cc", "cx");
 }
@@ -67,13 +79,15 @@  static __always_inline void refcount_dec(refcount_t *r)
 static __always_inline __must_check
 bool refcount_sub_and_test(unsigned int i, refcount_t *r)
 {
-	GEN_BINARY_SUFFIXED_RMWcc(LOCK_PREFIX "subl", REFCOUNT_CHECK_LT_ZERO,
+	GEN_BINARY_SUFFIXED_RMWcc(LOCK_PREFIX "subl",
+				  "REFCOUNT_CHECK_LT_ZERO counter=\"%0\"",
 				  r->refs.counter, "er", i, "%0", e, "cx");
 }
 
 static __always_inline __must_check bool refcount_dec_and_test(refcount_t *r)
 {
-	GEN_UNARY_SUFFIXED_RMWcc(LOCK_PREFIX "decl", REFCOUNT_CHECK_LT_ZERO,
+	GEN_UNARY_SUFFIXED_RMWcc(LOCK_PREFIX "decl",
+				 "REFCOUNT_CHECK_LT_ZERO counter=\"%0\"",
 				 r->refs.counter, "%0", e, "cx");
 }
 
@@ -91,7 +105,7 @@  bool refcount_add_not_zero(unsigned int i, refcount_t *r)
 
 		/* Did we try to increment from/to an undesirable state? */
 		if (unlikely(c < 0 || c == INT_MAX || result < c)) {
-			asm volatile(REFCOUNT_ERROR
+			asm volatile("REFCOUNT_ERROR counter=\"%[counter]\""
 				     : : [counter] "m" (r->refs.counter)
 				     : "cc", "cx");
 			break;
@@ -107,4 +121,6 @@  static __always_inline __must_check bool refcount_inc_not_zero(refcount_t *r)
 	return refcount_add_not_zero(1, r);
 }
 
+#endif /* __ASSEMBLY__ */
+
 #endif
diff --git a/arch/x86/kernel/macros.S b/arch/x86/kernel/macros.S
index cee28c3246dc..f1fe1d570365 100644
--- a/arch/x86/kernel/macros.S
+++ b/arch/x86/kernel/macros.S
@@ -7,3 +7,4 @@ 
  */
 
 #include <linux/compiler.h>
+#include <asm/refcount.h>