All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC 0/8] Improving compiler inlining decisions
@ 2018-05-15 14:11 ` Nadav Amit
  0 siblings, 0 replies; 35+ messages in thread
From: Nadav Amit @ 2018-05-15 14:11 UTC (permalink / raw)
  To: linux-kernel
  Cc: nadav.amit, Nadav Amit, Alok Kataria, Christopher Li,
	H. Peter Anvin, Ingo Molnar, Jan Beulich, Jonathan Corbet,
	Josh Poimboeuf, Juergen Gross, Kees Cook, linux-sparse,
	Peter Zijlstra, Randy Dunlap, Thomas Gleixner, virtualization,
	x86

This patch-set deals with an interesting yet stupid problem: code that
does not get inlined despite its simplicity.

I find 5 classes of causes:

1. Inline assembly blocks in which code and data are added to
alternative sections. The compiler is oblivious to the content of the
blocks and assumes their cost in space and time is proportional to the
number of the perceived assembly "instruction", according to the number
of newlines and semicolons. Alternatives, paravirt and other mechanisms
are affected.

2. Inline assembly with redundant new-lines and semicolons. Similarly to
(1) this code is considered "heavier" than it actually is.

3. Code with constant value optimizations. Quite a few parts of the
kernel check whether a variable is constant (using
__builtin_constant_p()) and perform heavy computations in that case.
These computations are eventually optimized out so they do not land in
the binary. However, the cost of these computations is also associated
with the calling function, which might prevent inlining of the calling
function. ilog2() is an example for such case.

4. Code that is marked with the "cold" attribute, including all the
__init functions. Some may consider it the desired behavior.

5. Code that is marked with a different optimization levels. This
affects for example vmx_vcpu_run(), inducing overheads of up to 10% on
exit.


This patch-set deals with some instances of first 3 classes. 

For (1) we insert an assembly macro, and call it from the inline
assembly block.  As a result, the compiler sees a single "instruction"
and assigns the more appropriate cost to the code.

For (2) the solution is trivial: just remove the newlines.

(3) is somewhat tricky. The proposed solution is to use
__builtin_choose_expr() to check whether a variable is actually constant
instead of using an if-condition or the C ternary operator.
__builtin_choose_expr() is evaluated earlier in the compilation, so it
allows the compiler to associate the right cost for the variable case
before the inlining decisions take place.  So far so good.

Still, there is a drawback. Since __builtin_choose_expr() is evaluated
earlier, it can fail to recognize constants, which an if-condition would
recognize correctly.  As a result, this patch-set only applies it to the
simplest cases.

Overall this patch-set slightly increases the kernel size (my build was
done using localmodconfig + localyesconfig for the record):

   text    data     bss     dec     hex filename
18126699 10066728 2936832 31130259 1db0293 ./vmlinux before
18149210 10064048 2936832 31150090 1db500a ./vmlinux after (+0.06%)

The patch-set eliminates many of the static text symbols:
Before: 40033
After:  39632   (-10%)

There is a measurable effect on performance in some cases. A loop of
MADV_DONTNEED/page-fault shows a 2% performance improvement with this
patch-set.

Some inline comments or self-explaining C macros might still be needed.

[1] https://lkml.org/lkml/2018/5/5/159

Cc: Alok Kataria <akataria@vmware.com>
Cc: Christopher Li <sparse@chrisli.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jan Beulich <JBeulich@suse.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: linux-sparse@vger.kernel.org
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: virtualization@lists.linux-foundation.org
Cc: x86@kernel.org

Nadav Amit (8):
  x86: objtool: use asm macro for better compiler decisions
  x86: bug: prevent gcc distortions
  x86: alternative: macrofy locks for better inlining
  x86: prevent inline distortion by paravirt ops
  x86: refcount: prevent gcc distortions
  x86: removing unneeded new-lines
  ilog2: preventing compiler distortion due to big condition
  bitops: prevent compiler inline decision distortion

 arch/x86/include/asm/alternative.h    | 28 ++++++++++----
 arch/x86/include/asm/asm.h            |  4 +-
 arch/x86/include/asm/bitops.h         |  8 ++--
 arch/x86/include/asm/bug.h            | 48 ++++++++++++++---------
 arch/x86/include/asm/cmpxchg.h        | 10 ++---
 arch/x86/include/asm/paravirt_types.h | 53 +++++++++++++++-----------
 arch/x86/include/asm/refcount.h       | 55 ++++++++++++++++-----------
 arch/x86/include/asm/special_insns.h  | 12 +++---
 include/linux/compiler.h              | 29 ++++++++++----
 include/linux/log2.h                  | 11 +++---
 10 files changed, 156 insertions(+), 102 deletions(-)

-- 
2.17.0

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC 0/8] Improving compiler inlining decisions
@ 2018-05-15 14:11 ` Nadav Amit
  0 siblings, 0 replies; 35+ messages in thread
From: Nadav Amit @ 2018-05-15 14:11 UTC (permalink / raw)
  To: linux-kernel
  Cc: nadav.amit, Nadav Amit, Alok Kataria, Christopher Li,
	H. Peter Anvin, Ingo Molnar, Jan Beulich, Jonathan Corbet,
	Josh Poimboeuf, Juergen Gross, Kees Cook, linux-sparse,
	Peter Zijlstra, Randy Dunlap, Thomas Gleixner, virtualization,
	x86

This patch-set deals with an interesting yet stupid problem: code that
does not get inlined despite its simplicity.

I find 5 classes of causes:

1. Inline assembly blocks in which code and data are added to
alternative sections. The compiler is oblivious to the content of the
blocks and assumes their cost in space and time is proportional to the
number of the perceived assembly "instruction", according to the number
of newlines and semicolons. Alternatives, paravirt and other mechanisms
are affected.

2. Inline assembly with redundant new-lines and semicolons. Similarly to
(1) this code is considered "heavier" than it actually is.

3. Code with constant value optimizations. Quite a few parts of the
kernel check whether a variable is constant (using
__builtin_constant_p()) and perform heavy computations in that case.
These computations are eventually optimized out so they do not land in
the binary. However, the cost of these computations is also associated
with the calling function, which might prevent inlining of the calling
function. ilog2() is an example for such case.

4. Code that is marked with the "cold" attribute, including all the
__init functions. Some may consider it the desired behavior.

5. Code that is marked with a different optimization levels. This
affects for example vmx_vcpu_run(), inducing overheads of up to 10% on
exit.


This patch-set deals with some instances of first 3 classes. 

For (1) we insert an assembly macro, and call it from the inline
assembly block.  As a result, the compiler sees a single "instruction"
and assigns the more appropriate cost to the code.

For (2) the solution is trivial: just remove the newlines.

(3) is somewhat tricky. The proposed solution is to use
__builtin_choose_expr() to check whether a variable is actually constant
instead of using an if-condition or the C ternary operator.
__builtin_choose_expr() is evaluated earlier in the compilation, so it
allows the compiler to associate the right cost for the variable case
before the inlining decisions take place.  So far so good.

Still, there is a drawback. Since __builtin_choose_expr() is evaluated
earlier, it can fail to recognize constants, which an if-condition would
recognize correctly.  As a result, this patch-set only applies it to the
simplest cases.

Overall this patch-set slightly increases the kernel size (my build was
done using localmodconfig + localyesconfig for the record):

   text    data     bss     dec     hex filename
18126699 10066728 2936832 31130259 1db0293 ./vmlinux before
18149210 10064048 2936832 31150090 1db500a ./vmlinux after (+0.06%)

The patch-set eliminates many of the static text symbols:
Before: 40033
After:  39632   (-10%)

There is a measurable effect on performance in some cases. A loop of
MADV_DONTNEED/page-fault shows a 2% performance improvement with this
patch-set.

Some inline comments or self-explaining C macros might still be needed.

[1] https://lkml.org/lkml/2018/5/5/159

Cc: Alok Kataria <akataria@vmware.com>
Cc: Christopher Li <sparse@chrisli.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jan Beulich <JBeulich@suse.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: linux-sparse@vger.kernel.org
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: virtualization@lists.linux-foundation.org
Cc: x86@kernel.org

Nadav Amit (8):
  x86: objtool: use asm macro for better compiler decisions
  x86: bug: prevent gcc distortions
  x86: alternative: macrofy locks for better inlining
  x86: prevent inline distortion by paravirt ops
  x86: refcount: prevent gcc distortions
  x86: removing unneeded new-lines
  ilog2: preventing compiler distortion due to big condition
  bitops: prevent compiler inline decision distortion

 arch/x86/include/asm/alternative.h    | 28 ++++++++++----
 arch/x86/include/asm/asm.h            |  4 +-
 arch/x86/include/asm/bitops.h         |  8 ++--
 arch/x86/include/asm/bug.h            | 48 ++++++++++++++---------
 arch/x86/include/asm/cmpxchg.h        | 10 ++---
 arch/x86/include/asm/paravirt_types.h | 53 +++++++++++++++-----------
 arch/x86/include/asm/refcount.h       | 55 ++++++++++++++++-----------
 arch/x86/include/asm/special_insns.h  | 12 +++---
 include/linux/compiler.h              | 29 ++++++++++----
 include/linux/log2.h                  | 11 +++---
 10 files changed, 156 insertions(+), 102 deletions(-)

-- 
2.17.0

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC 1/8] x86: objtool: use asm macro for better compiler decisions
  2018-05-15 14:11 ` Nadav Amit
@ 2018-05-15 14:11   ` Nadav Amit
  -1 siblings, 0 replies; 35+ messages in thread
From: Nadav Amit @ 2018-05-15 14:11 UTC (permalink / raw)
  To: linux-kernel; +Cc: nadav.amit, Nadav Amit, Christopher Li, linux-sparse

GCC considers the number of statements in inlined assembly blocks,
according to new-lines and semicolons, as an indication to the cost of
the block in time and space. This data is distorted by the kernel code,
which puts information in alternative sections. As a result, the
compiler may perform incorrect inlining and branch optimizations.

In the case of objtool, this distortion is extreme, since anyhow the
annotations of objtool are discarded during linkage.

The solution is to set an assembly macro and call it from the inlinedv
assembly block. As a result GCC considers the inline assembly block as
a single instruction.

This patch slightly increases the kernel size.

   text    data     bss     dec     hex filename
18126699 10066728 2936832 31130259 1db0293 ./vmlinux before
18126824 10067268 2936832 31130924 1db052c ./vmlinux after (+665)

But allows more aggressive inlining. Static text symbols:
Before: 40033
After: 40015 (-18)

Cc: Christopher Li <sparse@chrisli.org>
Cc: linux-sparse@vger.kernel.org

Signed-off-by: Nadav Amit <namit@vmware.com>
---
 include/linux/compiler.h | 29 +++++++++++++++++++++--------
 1 file changed, 21 insertions(+), 8 deletions(-)

diff --git a/include/linux/compiler.h b/include/linux/compiler.h
index ab4711c63601..369753000541 100644
--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -98,17 +98,30 @@ void ftrace_likely_update(struct ftrace_likely_data *f, int val,
  * The __COUNTER__ based labels are a hack to make each instance of the macros
  * unique, to convince GCC not to merge duplicate inline asm statements.
  */
+
+asm ("\n"
+	".macro __annotate_reachable counter:req\n"
+	"\\counter:\n\t"
+	".pushsection .discard.reachable\n\t"
+	".long \\counter\\()b -.\n\t"
+	".popsection\n"
+	".endm\n");
+
 #define annotate_reachable() ({						\
-	asm volatile("%c0:\n\t"						\
-		     ".pushsection .discard.reachable\n\t"		\
-		     ".long %c0b - .\n\t"				\
-		     ".popsection\n\t" : : "i" (__COUNTER__));		\
+	asm volatile("__annotate_reachable %c0" : : "i" (__COUNTER__));	\
 })
+
+asm ("\n"
+	".macro __annotate_unreachable counter:req\n"
+	"\\counter:\n\t"
+	".pushsection .discard.unreachable\n\t"
+	".long \\counter\\()b -.\n\t"
+	".popsection\n"
+	".endm");
+
 #define annotate_unreachable() ({					\
-	asm volatile("%c0:\n\t"						\
-		     ".pushsection .discard.unreachable\n\t"		\
-		     ".long %c0b - .\n\t"				\
-		     ".popsection\n\t" : : "i" (__COUNTER__));		\
+	asm volatile("__annotate_unreachable %c0" : :			\
+		     "i" (__COUNTER__));				\
 })
 #define ASM_UNREACHABLE							\
 	"999:\n\t"							\
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC 1/8] x86: objtool: use asm macro for better compiler decisions
@ 2018-05-15 14:11   ` Nadav Amit
  0 siblings, 0 replies; 35+ messages in thread
From: Nadav Amit @ 2018-05-15 14:11 UTC (permalink / raw)
  To: linux-kernel; +Cc: nadav.amit, Nadav Amit, Christopher Li, linux-sparse

GCC considers the number of statements in inlined assembly blocks,
according to new-lines and semicolons, as an indication to the cost of
the block in time and space. This data is distorted by the kernel code,
which puts information in alternative sections. As a result, the
compiler may perform incorrect inlining and branch optimizations.

In the case of objtool, this distortion is extreme, since anyhow the
annotations of objtool are discarded during linkage.

The solution is to set an assembly macro and call it from the inlinedv
assembly block. As a result GCC considers the inline assembly block as
a single instruction.

This patch slightly increases the kernel size.

   text    data     bss     dec     hex filename
18126699 10066728 2936832 31130259 1db0293 ./vmlinux before
18126824 10067268 2936832 31130924 1db052c ./vmlinux after (+665)

But allows more aggressive inlining. Static text symbols:
Before: 40033
After: 40015 (-18)

Cc: Christopher Li <sparse@chrisli.org>
Cc: linux-sparse@vger.kernel.org

Signed-off-by: Nadav Amit <namit@vmware.com>
---
 include/linux/compiler.h | 29 +++++++++++++++++++++--------
 1 file changed, 21 insertions(+), 8 deletions(-)

diff --git a/include/linux/compiler.h b/include/linux/compiler.h
index ab4711c63601..369753000541 100644
--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -98,17 +98,30 @@ void ftrace_likely_update(struct ftrace_likely_data *f, int val,
  * The __COUNTER__ based labels are a hack to make each instance of the macros
  * unique, to convince GCC not to merge duplicate inline asm statements.
  */
+
+asm ("\n"
+	".macro __annotate_reachable counter:req\n"
+	"\\counter:\n\t"
+	".pushsection .discard.reachable\n\t"
+	".long \\counter\\()b -.\n\t"
+	".popsection\n"
+	".endm\n");
+
 #define annotate_reachable() ({						\
-	asm volatile("%c0:\n\t"						\
-		     ".pushsection .discard.reachable\n\t"		\
-		     ".long %c0b - .\n\t"				\
-		     ".popsection\n\t" : : "i" (__COUNTER__));		\
+	asm volatile("__annotate_reachable %c0" : : "i" (__COUNTER__));	\
 })
+
+asm ("\n"
+	".macro __annotate_unreachable counter:req\n"
+	"\\counter:\n\t"
+	".pushsection .discard.unreachable\n\t"
+	".long \\counter\\()b -.\n\t"
+	".popsection\n"
+	".endm");
+
 #define annotate_unreachable() ({					\
-	asm volatile("%c0:\n\t"						\
-		     ".pushsection .discard.unreachable\n\t"		\
-		     ".long %c0b - .\n\t"				\
-		     ".popsection\n\t" : : "i" (__COUNTER__));		\
+	asm volatile("__annotate_unreachable %c0" : :			\
+		     "i" (__COUNTER__));				\
 })
 #define ASM_UNREACHABLE							\
 	"999:\n\t"							\
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC 2/8] x86: bug: prevent gcc distortions
  2018-05-15 14:11 ` Nadav Amit
  (?)
  (?)
@ 2018-05-15 14:11 ` Nadav Amit
  -1 siblings, 0 replies; 35+ messages in thread
From: Nadav Amit @ 2018-05-15 14:11 UTC (permalink / raw)
  To: linux-kernel
  Cc: nadav.amit, Nadav Amit, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86, Peter Zijlstra, Josh Poimboeuf

GCC considers the number of statements in inlined assembly blocks,
according to new-lines and semicolons, as an indication to the cost of
the block in time and space. This data is distorted by the kernel code,
which puts information in alternative sections. As a result, the
compiler may perform incorrect inlining and branch optimizations.

The solution is to set an assembly macro and call it from the inlinedv
assembly block. As a result GCC considers the inline assembly block as
a single instruction.

This patch increases the kernel size:

   text	   data	    bss	    dec	    hex	filename
18126824 10067268 2936832 31130924 1db052c ./vmlinux before
18127205 10068388 2936832 31132425 1db0b09 ./vmlinux after (+1501)

But enables more aggressive inlining (and probably branch decisions).
The number of static text symbols in vmlinux is lower.

Before:	40015
After:	39860	(-155)

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>

Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/include/asm/bug.h | 48 +++++++++++++++++++++++---------------
 1 file changed, 29 insertions(+), 19 deletions(-)

diff --git a/arch/x86/include/asm/bug.h b/arch/x86/include/asm/bug.h
index 6804d6642767..a2d981beb6cb 100644
--- a/arch/x86/include/asm/bug.h
+++ b/arch/x86/include/asm/bug.h
@@ -30,33 +30,43 @@
 
 #ifdef CONFIG_DEBUG_BUGVERBOSE
 
-#define _BUG_FLAGS(ins, flags)						\
+asm("\n"
+	".macro __BUG_FLAGS ins:req file:req line:req flags:req size:req\n"
+	"1:\t \\ins\n\t"
+	".pushsection __bug_table,\"aw\"\n"
+	"2:\t "__BUG_REL(1b)            "\t# bug_entry::bug_addr\n\t"
+	__BUG_REL(\\file)		"\t# bug_entry::file\n\t"
+	".word \\line"			"\t# bug_entry::line\n\t"
+	".word \\flags"			"\t# bug_entry::flags\n\t"
+	".org 2b+\\size\n\t"
+	".popsection\n"
+	".endm");
+
+#define _BUG_FLAGS(ins, flags)                                          \
 do {									\
-	asm volatile("1:\t" ins "\n"					\
-		     ".pushsection __bug_table,\"aw\"\n"		\
-		     "2:\t" __BUG_REL(1b) "\t# bug_entry::bug_addr\n"	\
-		     "\t"  __BUG_REL(%c0) "\t# bug_entry::file\n"	\
-		     "\t.word %c1"        "\t# bug_entry::line\n"	\
-		     "\t.word %c2"        "\t# bug_entry::flags\n"	\
-		     "\t.org 2b+%c3\n"					\
-		     ".popsection"					\
-		     : : "i" (__FILE__), "i" (__LINE__),		\
-			 "i" (flags),					\
+	asm volatile("__BUG_FLAGS \"" ins "\" %c0 %c1 %c2 %c3"		\
+		     : : "i" (__FILE__), "i" (__LINE__),                \
+			 "i" (flags),                                   \
 			 "i" (sizeof(struct bug_entry)));		\
 } while (0)
 
 #else /* !CONFIG_DEBUG_BUGVERBOSE */
 
+asm("\n"
+	".macro __BUG_FLAGS ins:req flags:req size:req\n"
+	"1:\t \\ins\n\t"
+	".pushsection __bug_table,\"aw\"\n"
+	"2:\t" __BUG_REL(1b)		"\t# bug_entry::bug_addr\n\t"
+	".word \\flags"			"\t# bug_entry::flags\n\t"
+	".org 2b+\\size\n\t"
+	".popsection\n"
+	".endm");
+
 #define _BUG_FLAGS(ins, flags)						\
 do {									\
-	asm volatile("1:\t" ins "\n"					\
-		     ".pushsection __bug_table,\"aw\"\n"		\
-		     "2:\t" __BUG_REL(1b) "\t# bug_entry::bug_addr\n"	\
-		     "\t.word %c0"        "\t# bug_entry::flags\n"	\
-		     "\t.org 2b+%c1\n"					\
-		     ".popsection"					\
-		     : : "i" (flags),					\
-			 "i" (sizeof(struct bug_entry)));		\
+	asm volatile("__BUG_FLAGS \"" ins "\" %c0 %c1"			\
+		    : : "i" (flags),					\
+			"i" (sizeof(struct bug_entry)));		\
 } while (0)
 
 #endif /* CONFIG_DEBUG_BUGVERBOSE */
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC 3/8] x86: alternative: macrofy locks for better inlining
  2018-05-15 14:11 ` Nadav Amit
                   ` (2 preceding siblings ...)
  (?)
@ 2018-05-15 14:11 ` Nadav Amit
  -1 siblings, 0 replies; 35+ messages in thread
From: Nadav Amit @ 2018-05-15 14:11 UTC (permalink / raw)
  To: linux-kernel
  Cc: nadav.amit, Nadav Amit, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86, Josh Poimboeuf

GCC considers the number of statements in inlined assembly blocks,
according to new-lines and semicolons, as an indication to the cost of
the block in time and space. This data is distorted by the kernel code,
which puts information in alternative sections. As a result, the
compiler may perform incorrect inlining and branch optimizations.

The solution is to set an assembly macro and call it from the inlined
assembly block. As a result GCC considers the inline assembly block as
a single instruction.

This patch handles the LOCK prefix, allowing more aggresive inlining.

   text	   data	    bss	    dec	    hex	filename
18127205 10068388 2936832 31132425 1db0b09 ./vmlinux before
18131468 10068488 2936832 31136788 1db1c14 ./vmlinux after (+4363)

Static text symbols:
Before:	39860
After:	39788	(-72)

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Cc: Josh Poimboeuf <jpoimboe@redhat.com>

Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/include/asm/alternative.h | 28 ++++++++++++++++++++--------
 1 file changed, 20 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/alternative.h b/arch/x86/include/asm/alternative.h
index 4cd6a3b71824..daa68ad51665 100644
--- a/arch/x86/include/asm/alternative.h
+++ b/arch/x86/include/asm/alternative.h
@@ -31,14 +31,26 @@
  */
 
 #ifdef CONFIG_SMP
-#define LOCK_PREFIX_HERE \
-		".pushsection .smp_locks,\"a\"\n"	\
-		".balign 4\n"				\
-		".long 671f - .\n" /* offset */		\
-		".popsection\n"				\
-		"671:"
-
-#define LOCK_PREFIX LOCK_PREFIX_HERE "\n\tlock; "
+
+asm ("\n"
+	".macro __LOCK_PREFIX_HERE\n\t"
+	".pushsection .smp_locks,\"a\"\n\t"
+	".balign 4\n\t"
+	".long 671f - .\n\t" /* offset */
+	".popsection\n"
+	"671:\n"
+	".endm");
+
+#define LOCK_PREFIX_HERE "__LOCK_PREFIX_HERE"
+
+asm ("\n"
+	".macro __LOCK_PREFIX ins:vararg\n\t"
+	"__LOCK_PREFIX_HERE\n\t"
+	"lock; \\ins\n"
+	".endm");
+
+#define LOCK_PREFIX "__LOCK_PREFIX "
+
 
 #else /* ! CONFIG_SMP */
 #define LOCK_PREFIX_HERE ""
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC 4/8] x86: prevent inline distortion by paravirt ops
  2018-05-15 14:11 ` Nadav Amit
                   ` (3 preceding siblings ...)
  (?)
@ 2018-05-15 14:11 ` Nadav Amit
  -1 siblings, 0 replies; 35+ messages in thread
From: Nadav Amit @ 2018-05-15 14:11 UTC (permalink / raw)
  To: linux-kernel
  Cc: nadav.amit, Nadav Amit, Juergen Gross, Alok Kataria,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	virtualization

GCC considers the number of statements in inlined assembly blocks,
according to new-lines and semicolons, as an indication to the cost of
the block in time and space. This data is distorted by the kernel code,
which puts information in alternative sections. As a result, the
compiler may perform incorrect inlining and branch optimizations.

The solution is to set an assembly macro and call it from the inlined
assembly block. As a result GCC considers the inline assembly block as
a single instruction.

The effect of the patch is a more aggressive inlining, which also
causes a size increase of kernel.

   text	   data	    bss	    dec	    hex	filename
18131468 10068488 2936832 31136788 1db1c14 ./vmlinux before
18146418 10064100 2936832 31147350 1db4556 ./vmlinux after (+10562)

Static text symbols:
Before:	39788
After:	39673	(-115)

Cc: Juergen Gross <jgross@suse.com>
Cc: Alok Kataria <akataria@vmware.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Cc: virtualization@lists.linux-foundation.org

Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/include/asm/paravirt_types.h | 53 +++++++++++++++------------
 1 file changed, 30 insertions(+), 23 deletions(-)

diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index 180bc0bff0fb..3fe99dab8266 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -346,20 +346,37 @@ extern struct pv_lock_ops pv_lock_ops;
 /*
  * Generate some code, and mark it as patchable by the
  * apply_paravirt() alternate instruction patcher.
+ *
+ * This generates an indirect call based on the operation type number.
+ * The type number, computed in PARAVIRT_PATCH, is derived from the
+ * offset into the paravirt_patch_template structure, and can therefore be
+ * freely converted back into a structure offset.
  */
-#define _paravirt_alt(insn_string, type, clobber)	\
-	"771:\n\t" insn_string "\n" "772:\n"		\
-	".pushsection .parainstructions,\"a\"\n"	\
-	_ASM_ALIGN "\n"					\
-	_ASM_PTR " 771b\n"				\
-	"  .byte " type "\n"				\
-	"  .byte 772b-771b\n"				\
-	"  .short " clobber "\n"			\
+
+asm ("\n"
+	".macro __paravirt_alt type:req clobber:req pv_opptr:req\n"
+	"771:\n\t"
+	ANNOTATE_RETPOLINE_SAFE "\n\t"
+	"call *\\pv_opptr\n"
+	"772:\n\t"
+	".pushsection .parainstructions,\"a\"\n\t"
+	_ASM_ALIGN "\n\t"
+	_ASM_PTR " 771b\n\t"
+	".byte \\type\n\t"
+	".byte 772b-771b\n\t"
+	".short \\clobber\n\t"
 	".popsection\n"
+	".endm");
+
+#define _paravirt_alt(type, clobber, pv_opptr)				\
+	"__paravirt_alt type=" __stringify(type)			\
+	" clobber=" __stringify(clobber)				\
+	" pv_opptr=" __stringify(pv_opptr) "\n\t"
 
 /* Generate patchable code, with the default asm parameters. */
-#define paravirt_alt(insn_string)					\
-	_paravirt_alt(insn_string, "%c[paravirt_typenum]", "%c[paravirt_clobber]")
+#define paravirt_alt							\
+	_paravirt_alt("%c[paravirt_typenum]", "%c[paravirt_clobber]",	\
+		      "%c[paravirt_opptr]")
 
 /* Simple instruction patching code. */
 #define NATIVE_LABEL(a,x,b) "\n\t.globl " a #x "_" #b "\n" a #x "_" #b ":\n\t"
@@ -387,16 +404,6 @@ unsigned native_patch(u8 type, u16 clobbers, void *ibuf,
 
 int paravirt_disable_iospace(void);
 
-/*
- * This generates an indirect call based on the operation type number.
- * The type number, computed in PARAVIRT_PATCH, is derived from the
- * offset into the paravirt_patch_template structure, and can therefore be
- * freely converted back into a structure offset.
- */
-#define PARAVIRT_CALL					\
-	ANNOTATE_RETPOLINE_SAFE				\
-	"call *%c[paravirt_opptr];"
-
 /*
  * These macros are intended to wrap calls through one of the paravirt
  * ops structs, so that they can be later identified and patched at
@@ -534,7 +541,7 @@ int paravirt_disable_iospace(void);
 		/* since this condition will never hold */		\
 		if (sizeof(rettype) > sizeof(unsigned long)) {		\
 			asm volatile(pre				\
-				     paravirt_alt(PARAVIRT_CALL)	\
+				     paravirt_alt			\
 				     post				\
 				     : call_clbr, ASM_CALL_CONSTRAINT	\
 				     : paravirt_type(op),		\
@@ -544,7 +551,7 @@ int paravirt_disable_iospace(void);
 			__ret = (rettype)((((u64)__edx) << 32) | __eax); \
 		} else {						\
 			asm volatile(pre				\
-				     paravirt_alt(PARAVIRT_CALL)	\
+				     paravirt_alt			\
 				     post				\
 				     : call_clbr, ASM_CALL_CONSTRAINT	\
 				     : paravirt_type(op),		\
@@ -571,7 +578,7 @@ int paravirt_disable_iospace(void);
 		PVOP_VCALL_ARGS;					\
 		PVOP_TEST_NULL(op);					\
 		asm volatile(pre					\
-			     paravirt_alt(PARAVIRT_CALL)		\
+			     paravirt_alt				\
 			     post					\
 			     : call_clbr, ASM_CALL_CONSTRAINT		\
 			     : paravirt_type(op),			\
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC 5/8] x86: refcount: prevent gcc distortions
  2018-05-15 14:11 ` Nadav Amit
                   ` (4 preceding siblings ...)
  (?)
@ 2018-05-15 14:11 ` Nadav Amit
  2018-05-16 13:59   ` Kees Cook
  -1 siblings, 1 reply; 35+ messages in thread
From: Nadav Amit @ 2018-05-15 14:11 UTC (permalink / raw)
  To: linux-kernel
  Cc: nadav.amit, Nadav Amit, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86, Kees Cook, Jan Beulich, Josh Poimboeuf

GCC considers the number of statements in inlined assembly blocks,
according to new-lines and semicolons, as an indication to the cost of
the block in time and space. This data is distorted by the kernel code,
which puts information in alternative sections. As a result, the
compiler may perform incorrect inlining and branch optimizations.

The solution is to set an assembly macro and call it from the inlined
assembly block. As a result GCC considers the inline assembly block as
a single instruction.

This patch allows to inline functions such as __get_seccomp_filter().
The effect of the patch is as follows on the kernel size:

   text	   data	    bss	    dec	    hex	filename
18146418 10064100 2936832 31147350 1db4556 ./vmlinux before
18148228 10063968 2936832 31149028 1db4be4 ./vmlinux after (+1678)

Static text symbols:
Before:	39673
After:	39649	(-24)

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Cc: Kees Cook <keescook@chromium.org>
Cc: Jan Beulich <JBeulich@suse.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>

Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/include/asm/refcount.h | 55 ++++++++++++++++++++-------------
 1 file changed, 33 insertions(+), 22 deletions(-)

diff --git a/arch/x86/include/asm/refcount.h b/arch/x86/include/asm/refcount.h
index 4cf11d88d3b3..a668c534206d 100644
--- a/arch/x86/include/asm/refcount.h
+++ b/arch/x86/include/asm/refcount.h
@@ -14,34 +14,43 @@
  * central refcount exception. The fixup address for the exception points
  * back to the regular execution flow in .text.
  */
-#define _REFCOUNT_EXCEPTION				\
-	".pushsection .text..refcount\n"		\
-	"111:\tlea %[counter], %%" _ASM_CX "\n"		\
-	"112:\t" ASM_UD2 "\n"				\
-	ASM_UNREACHABLE					\
-	".popsection\n"					\
-	"113:\n"					\
+
+asm ("\n"
+	".macro __REFCOUNT_EXCEPTION counter:vararg\n\t"
+	".pushsection .text..refcount\n"
+	"111:\tlea \\counter, %" _ASM_CX "\n"
+	"112:\t" ASM_UD2 "\n\t"
+	ASM_UNREACHABLE
+	".popsection\n\t"
+	"113:\n"
 	_ASM_EXTABLE_REFCOUNT(112b, 113b)
+	".endm");
 
 /* Trigger refcount exception if refcount result is negative. */
-#define REFCOUNT_CHECK_LT_ZERO				\
-	"js 111f\n\t"					\
-	_REFCOUNT_EXCEPTION
+asm ("\n"
+	".macro __REFCOUNT_CHECK_LT_ZERO counter:vararg\n"
+	"js 111f\n\t"
+	"__REFCOUNT_EXCEPTION \\counter\n"
+	".endm");
 
 /* Trigger refcount exception if refcount result is zero or negative. */
-#define REFCOUNT_CHECK_LE_ZERO				\
-	"jz 111f\n\t"					\
-	REFCOUNT_CHECK_LT_ZERO
+asm ("\n"
+	".macro __REFCOUNT_CHECK_LE_ZERO counter:vararg\n"
+	"jz 111f\n\t"
+	"__REFCOUNT_CHECK_LT_ZERO counter=\\counter\n"
+	".endm");
 
 /* Trigger refcount exception unconditionally. */
-#define REFCOUNT_ERROR					\
-	"jmp 111f\n\t"					\
-	_REFCOUNT_EXCEPTION
+asm ("\n"
+	".macro __REFCOUNT_ERROR counter:vararg\n\t"
+	"jmp 111f\n\t"
+	"__REFCOUNT_EXCEPTION counter=\\counter\n"
+	".endm");
 
 static __always_inline void refcount_add(unsigned int i, refcount_t *r)
 {
 	asm volatile(LOCK_PREFIX "addl %1,%0\n\t"
-		REFCOUNT_CHECK_LT_ZERO
+		"__REFCOUNT_CHECK_LT_ZERO %[counter]"
 		: [counter] "+m" (r->refs.counter)
 		: "ir" (i)
 		: "cc", "cx");
@@ -50,7 +59,7 @@ static __always_inline void refcount_add(unsigned int i, refcount_t *r)
 static __always_inline void refcount_inc(refcount_t *r)
 {
 	asm volatile(LOCK_PREFIX "incl %0\n\t"
-		REFCOUNT_CHECK_LT_ZERO
+		"__REFCOUNT_CHECK_LT_ZERO %[counter]"
 		: [counter] "+m" (r->refs.counter)
 		: : "cc", "cx");
 }
@@ -58,7 +67,7 @@ static __always_inline void refcount_inc(refcount_t *r)
 static __always_inline void refcount_dec(refcount_t *r)
 {
 	asm volatile(LOCK_PREFIX "decl %0\n\t"
-		REFCOUNT_CHECK_LE_ZERO
+		"__REFCOUNT_CHECK_LE_ZERO %[counter]"
 		: [counter] "+m" (r->refs.counter)
 		: : "cc", "cx");
 }
@@ -66,13 +75,15 @@ static __always_inline void refcount_dec(refcount_t *r)
 static __always_inline __must_check
 bool refcount_sub_and_test(unsigned int i, refcount_t *r)
 {
-	GEN_BINARY_SUFFIXED_RMWcc(LOCK_PREFIX "subl", REFCOUNT_CHECK_LT_ZERO,
+	GEN_BINARY_SUFFIXED_RMWcc(LOCK_PREFIX "subl",
+				  "__REFCOUNT_CHECK_LT_ZERO %[counter]",
 				  r->refs.counter, "er", i, "%0", e, "cx");
 }
 
 static __always_inline __must_check bool refcount_dec_and_test(refcount_t *r)
 {
-	GEN_UNARY_SUFFIXED_RMWcc(LOCK_PREFIX "decl", REFCOUNT_CHECK_LT_ZERO,
+	GEN_UNARY_SUFFIXED_RMWcc(LOCK_PREFIX "decl",
+				 "__REFCOUNT_CHECK_LT_ZERO %[counter]",
 				 r->refs.counter, "%0", e, "cx");
 }
 
@@ -90,7 +101,7 @@ bool refcount_add_not_zero(unsigned int i, refcount_t *r)
 
 		/* Did we try to increment from/to an undesirable state? */
 		if (unlikely(c < 0 || c == INT_MAX || result < c)) {
-			asm volatile(REFCOUNT_ERROR
+			asm volatile("__REFCOUNT_ERROR %[counter]"
 				     : : [counter] "m" (r->refs.counter)
 				     : "cc", "cx");
 			break;
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC 6/8] x86: removing unneeded new-lines
  2018-05-15 14:11 ` Nadav Amit
                   ` (5 preceding siblings ...)
  (?)
@ 2018-05-15 14:11 ` Nadav Amit
  -1 siblings, 0 replies; 35+ messages in thread
From: Nadav Amit @ 2018-05-15 14:11 UTC (permalink / raw)
  To: linux-kernel
  Cc: nadav.amit, Nadav Amit, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86, Josh Poimboeuf

GCC considers the number of statements in inlined assembly blocks,
according to new-lines and semicolons, as an indication to the cost of
the block in time and space. This data is distorted by the kernel code,
which puts information in alternative sections. As a result, the
compiler may perform incorrect inlining and branch optimizations.

This patch removes unneeded new-lines and semicolons to prevent such
distortion.

Functions such as nfs_io_completion_put() get inlined. Its overall
effect is not shown in the absolute numbers, but it seems to enable
slightly better inlining:

   text	   data	    bss	    dec	    hex	filename
18148228 10063968 2936832 31149028 1db4be4 ./vmlinux before
18148888 10064016 2936832 31149736 1db4ea8 ./vmlinux after (+708)

Static text symbols:
Before:	39649
After:	39650	(+1)

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Cc: Josh Poimboeuf <jpoimboe@redhat.com>

Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/include/asm/asm.h           |  4 ++--
 arch/x86/include/asm/cmpxchg.h       | 10 +++++-----
 arch/x86/include/asm/special_insns.h | 12 ++++++------
 3 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/asm.h b/arch/x86/include/asm/asm.h
index 219faaec51df..571ceec97976 100644
--- a/arch/x86/include/asm/asm.h
+++ b/arch/x86/include/asm/asm.h
@@ -51,10 +51,10 @@
  * The output operand must be type "bool".
  */
 #ifdef __GCC_ASM_FLAG_OUTPUTS__
-# define CC_SET(c) "\n\t/* output condition code " #c "*/\n"
+# define CC_SET(c) "\n\t/* output condition code " #c "*/"
 # define CC_OUT(c) "=@cc" #c
 #else
-# define CC_SET(c) "\n\tset" #c " %[_cc_" #c "]\n"
+# define CC_SET(c) "\n\tset" #c " %[_cc_" #c "]"
 # define CC_OUT(c) [_cc_ ## c] "=qm"
 #endif
 
diff --git a/arch/x86/include/asm/cmpxchg.h b/arch/x86/include/asm/cmpxchg.h
index e3efd8a06066..2be9582fcb2e 100644
--- a/arch/x86/include/asm/cmpxchg.h
+++ b/arch/x86/include/asm/cmpxchg.h
@@ -44,22 +44,22 @@ extern void __add_wrong_size(void)
 	        __typeof__ (*(ptr)) __ret = (arg);			\
 		switch (sizeof(*(ptr))) {				\
 		case __X86_CASE_B:					\
-			asm volatile (lock #op "b %b0, %1\n"		\
+			asm volatile (lock #op "b %b0, %1"		\
 				      : "+q" (__ret), "+m" (*(ptr))	\
 				      : : "memory", "cc");		\
 			break;						\
 		case __X86_CASE_W:					\
-			asm volatile (lock #op "w %w0, %1\n"		\
+			asm volatile (lock #op "w %w0, %1"		\
 				      : "+r" (__ret), "+m" (*(ptr))	\
 				      : : "memory", "cc");		\
 			break;						\
 		case __X86_CASE_L:					\
-			asm volatile (lock #op "l %0, %1\n"		\
+			asm volatile (lock #op "l %0, %1"		\
 				      : "+r" (__ret), "+m" (*(ptr))	\
 				      : : "memory", "cc");		\
 			break;						\
 		case __X86_CASE_Q:					\
-			asm volatile (lock #op "q %q0, %1\n"		\
+			asm volatile (lock #op "q %q0, %1"		\
 				      : "+r" (__ret), "+m" (*(ptr))	\
 				      : : "memory", "cc");		\
 			break;						\
@@ -134,7 +134,7 @@ extern void __add_wrong_size(void)
 	__raw_cmpxchg((ptr), (old), (new), (size), LOCK_PREFIX)
 
 #define __sync_cmpxchg(ptr, old, new, size)				\
-	__raw_cmpxchg((ptr), (old), (new), (size), "lock; ")
+	__raw_cmpxchg((ptr), (old), (new), (size), "lock ")
 
 #define __cmpxchg_local(ptr, old, new, size)				\
 	__raw_cmpxchg((ptr), (old), (new), (size), "")
diff --git a/arch/x86/include/asm/special_insns.h b/arch/x86/include/asm/special_insns.h
index 317fc59b512c..9c56059aaf24 100644
--- a/arch/x86/include/asm/special_insns.h
+++ b/arch/x86/include/asm/special_insns.h
@@ -19,7 +19,7 @@ extern unsigned long __force_order;
 static inline unsigned long native_read_cr0(void)
 {
 	unsigned long val;
-	asm volatile("mov %%cr0,%0\n\t" : "=r" (val), "=m" (__force_order));
+	asm volatile("mov %%cr0,%0" : "=r" (val), "=m" (__force_order));
 	return val;
 }
 
@@ -31,7 +31,7 @@ static inline void native_write_cr0(unsigned long val)
 static inline unsigned long native_read_cr2(void)
 {
 	unsigned long val;
-	asm volatile("mov %%cr2,%0\n\t" : "=r" (val), "=m" (__force_order));
+	asm volatile("mov %%cr2,%0" : "=r" (val), "=m" (__force_order));
 	return val;
 }
 
@@ -43,7 +43,7 @@ static inline void native_write_cr2(unsigned long val)
 static inline unsigned long __native_read_cr3(void)
 {
 	unsigned long val;
-	asm volatile("mov %%cr3,%0\n\t" : "=r" (val), "=m" (__force_order));
+	asm volatile("mov %%cr3,%0" : "=r" (val), "=m" (__force_order));
 	return val;
 }
 
@@ -67,7 +67,7 @@ static inline unsigned long native_read_cr4(void)
 		     : "=r" (val), "=m" (__force_order) : "0" (0));
 #else
 	/* CR4 always exists on x86_64. */
-	asm volatile("mov %%cr4,%0\n\t" : "=r" (val), "=m" (__force_order));
+	asm volatile("mov %%cr4,%0" : "=r" (val), "=m" (__force_order));
 #endif
 	return val;
 }
@@ -101,7 +101,7 @@ static inline u32 __read_pkru(void)
 	 * "rdpkru" instruction.  Places PKRU contents in to EAX,
 	 * clears EDX and requires that ecx=0.
 	 */
-	asm volatile(".byte 0x0f,0x01,0xee\n\t"
+	asm volatile(".byte 0x0f,0x01,0xee"
 		     : "=a" (pkru), "=d" (edx)
 		     : "c" (ecx));
 	return pkru;
@@ -115,7 +115,7 @@ static inline void __write_pkru(u32 pkru)
 	 * "wrpkru" instruction.  Loads contents in EAX to PKRU,
 	 * requires that ecx = edx = 0.
 	 */
-	asm volatile(".byte 0x0f,0x01,0xef\n\t"
+	asm volatile(".byte 0x0f,0x01,0xef"
 		     : : "a" (pkru), "c"(ecx), "d"(edx));
 }
 #else
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC 7/8] ilog2: preventing compiler distortion due to big condition
  2018-05-15 14:11 ` Nadav Amit
                   ` (6 preceding siblings ...)
  (?)
@ 2018-05-15 14:11 ` Nadav Amit
  -1 siblings, 0 replies; 35+ messages in thread
From: Nadav Amit @ 2018-05-15 14:11 UTC (permalink / raw)
  To: linux-kernel; +Cc: nadav.amit, Nadav Amit, Randy Dunlap, Jonathan Corbet

There are several places in the kernel in which there is a condition
that is based on whether the input is known to be constant in
compilation time. If it is, there are complex computations, which only
take place during compilation time.

Although this scheme works correctly, when GCC computes the expected
cost of this code in time and size, it disregards the fact that the
computations of the "constant" case will not take place during runtime
for the non-constant case. The cost of these functions is considered to
be much higher.  As a result, inline and branching decisions of the
compiler are distorted. Specifically, functions are less likely to be
inlined due to their preserved big size and execution time.

One of this cases is ilog2() which performs a complicated condition for
constant inputs.

The solution is to use __builtin_choose_expr() to detect whether the
input is constant instead of a C condition. GCC evaluates the builtin
earlier, which allows it to improve code-generation decisions.

This patch allows to inline functions such as tbl_size().

   text	   data	    bss	    dec	    hex	filename
18148888 10064016 2936832 31149736 1db4ea8 ./vmlinux before
18149165 10064176 2936832 31150173 1db505d ./vmlinux after (+437)

Static text symbols:
Before:	39650
After:	39643	(-7)

Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Jonathan Corbet <corbet@lwn.net>

Signed-off-by: Nadav Amit <namit@vmware.com>
---
 include/linux/log2.h | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/include/linux/log2.h b/include/linux/log2.h
index 41a1ae010993..aa4dac874339 100644
--- a/include/linux/log2.h
+++ b/include/linux/log2.h
@@ -83,7 +83,8 @@ unsigned long __rounddown_pow_of_two(unsigned long n)
  */
 #define ilog2(n)				\
 (						\
-	__builtin_constant_p(n) ? (		\
+	__builtin_choose_expr(			\
+		__builtin_constant_p(n), (	\
 		(n) < 2 ? 0 :			\
 		(n) & (1ULL << 63) ? 63 :	\
 		(n) & (1ULL << 62) ? 62 :	\
@@ -147,10 +148,10 @@ unsigned long __rounddown_pow_of_two(unsigned long n)
 		(n) & (1ULL <<  4) ?  4 :	\
 		(n) & (1ULL <<  3) ?  3 :	\
 		(n) & (1ULL <<  2) ?  2 :	\
-		1 ) :				\
-	(sizeof(n) <= 4) ?			\
-	__ilog2_u32(n) :			\
-	__ilog2_u64(n)				\
+		1),				\
+		(sizeof(n) <= 4) ?		\
+		__ilog2_u32(n) :		\
+		__ilog2_u64(n))			\
  )
 
 /**
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC 8/8] bitops: prevent compiler inline decision distortion
  2018-05-15 14:11 ` Nadav Amit
                   ` (7 preceding siblings ...)
  (?)
@ 2018-05-15 14:11 ` Nadav Amit
  -1 siblings, 0 replies; 35+ messages in thread
From: Nadav Amit @ 2018-05-15 14:11 UTC (permalink / raw)
  To: linux-kernel
  Cc: nadav.amit, Nadav Amit, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86

There are several places in the kernel in which there is a condition
that is based on whether the input is known to be constant in
compilation time. If it is, there are complex computations, which only
take place during compilation time.

Although this scheme works correctly, when GCC computes the expected
cost of this code in time and size, it disregards the fact that the
computations of the "constant" case will not take place during runtime
for the non-constant case. The cost of these functions is considered to
be much higher.  As a result, inline and branching decisions of the
compiler are distorted. Specifically, functions are less likely to be
inlined due to their preserved big size and execution time.

One of this cases is test_bit() which performs some computations for
constant inputs.

The solution is to use __builtin_choose_expr() to detect whether the
input is constant instead of a C condition. GCC evaluates the builtin
earlier, which allows it to improve code-generation decisions.

This patch allows function such as bitmap_pos_to_ord() to be inlined.
Its overall effect on size:

   text	   data	    bss	    dec	    hex	filename
18149165 10064176 2936832 31150173 1db505d ./vmlinux before
18149210 10064048 2936832 31150090 1db500a ./vmlinux after (-83)

Static text symbols:
Before:	39643
After:	39632	(-11)

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org

Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/include/asm/bitops.h | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index 9f645ba57dbb..f1cb1c9125a9 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -349,10 +349,10 @@ static __always_inline bool variable_test_bit(long nr, volatile const unsigned l
 static bool test_bit(int nr, const volatile unsigned long *addr);
 #endif
 
-#define test_bit(nr, addr)			\
-	(__builtin_constant_p((nr))		\
-	 ? constant_test_bit((nr), (addr))	\
-	 : variable_test_bit((nr), (addr)))
+#define test_bit(nr, addr)					\
+	__builtin_choose_expr(__builtin_constant_p((nr)),	\
+		constant_test_bit((nr), (addr)),		\
+		variable_test_bit((nr), (addr)))
 
 /**
  * __ffs - find first set bit in word
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC 0/8] Improving compiler inlining decisions
  2018-05-15 14:11 ` Nadav Amit
@ 2018-05-15 14:11   ` Nadav Amit
  -1 siblings, 0 replies; 35+ messages in thread
From: Nadav Amit @ 2018-05-15 14:11 UTC (permalink / raw)
  To: linux-kernel
  Cc: nadav.amit, Nadav Amit, Alok Kataria, Christopher Li,
	H. Peter Anvin, Ingo Molnar, Jan Beulich, Jonathan Corbet,
	Josh Poimboeuf, Juergen Gross, Kees Cook, linux-sparse,
	Peter Zijlstra, Randy Dunlap, Thomas Gleixner, virtualization,
	x86

This patch-set deals with an interesting yet stupid problem: code that
does not get inlined despite its simplicity.

I find 5 classes of causes:

1. Inline assembly blocks in which code and data are added to
alternative sections. The compiler is oblivious to the content of the
blocks and assumes their cost in space and time is proportional to the
number of the perceived assembly "instruction", according to the number
of newlines and semicolons. Alternatives, paravirt and other mechanisms
are affected.

2. Inline assembly with redundant new-lines and semicolons. Similarly to
(1) this code is considered "heavier" than it actually is.

3. Code with constant value optimizations. Quite a few parts of the
kernel check whether a variable is constant (using
__builtin_constant_p()) and perform heavy computations in that case.
These computations are eventually optimized out so they do not land in
the binary. However, the cost of these computations is also associated
with the calling function, which might prevent inlining of the calling
function. ilog2() is an example for such case.

4. Code that is marked with the "cold" attribute, including all the
__init functions. Some may consider it the desired behavior.

5. Code that is marked with a different optimization levels. This
affects for example vmx_vcpu_run(), inducing overheads of up to 10% on
exit.


This patch-set deals with some instances of first 3 classes. 

For (1) we insert an assembly macro, and call it from the inline
assembly block.  As a result, the compiler sees a single "instruction"
and assigns the more appropriate cost to the code.

For (2) the solution is trivial: just remove the newlines.

(3) is somewhat tricky. The proposed solution is to use
__builtin_choose_expr() to check whether a variable is actually constant
instead of using an if-condition or the C ternary operator.
__builtin_choose_expr() is evaluated earlier in the compilation, so it
allows the compiler to associate the right cost for the variable case
before the inlining decisions take place.  So far so good.

Still, there is a drawback. Since __builtin_choose_expr() is evaluated
earlier, it can fail to recognize constants, which an if-condition would
recognize correctly.  As a result, this patch-set only applies it to the
simplest cases.

Overall this patch-set slightly increases the kernel size (my build was
done using localmodconfig + localyesconfig for the record):

   text    data     bss     dec     hex filename
18126699 10066728 2936832 31130259 1db0293 ./vmlinux before
18149210 10064048 2936832 31150090 1db500a ./vmlinux after (+0.06%)

The patch-set eliminates many of the static text symbols:
Before: 40033
After:  39632   (-10%)

There is a measurable effect on performance in some cases. A loop of
MADV_DONTNEED/page-fault shows a 2% performance improvement with this
patch-set.

Some inline comments or self-explaining C macros might still be needed.

[1] https://lkml.org/lkml/2018/5/5/159

Cc: Alok Kataria <akataria@vmware.com>
Cc: Christopher Li <sparse@chrisli.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jan Beulich <JBeulich@suse.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: linux-sparse@vger.kernel.org
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: virtualization@lists.linux-foundation.org
Cc: x86@kernel.org

Nadav Amit (8):
  x86: objtool: use asm macro for better compiler decisions
  x86: bug: prevent gcc distortions
  x86: alternative: macrofy locks for better inlining
  x86: prevent inline distortion by paravirt ops
  x86: refcount: prevent gcc distortions
  x86: removing unneeded new-lines
  ilog2: preventing compiler distortion due to big condition
  bitops: prevent compiler inline decision distortion

 arch/x86/include/asm/alternative.h    | 28 ++++++++++----
 arch/x86/include/asm/asm.h            |  4 +-
 arch/x86/include/asm/bitops.h         |  8 ++--
 arch/x86/include/asm/bug.h            | 48 ++++++++++++++---------
 arch/x86/include/asm/cmpxchg.h        | 10 ++---
 arch/x86/include/asm/paravirt_types.h | 53 +++++++++++++++-----------
 arch/x86/include/asm/refcount.h       | 55 ++++++++++++++++-----------
 arch/x86/include/asm/special_insns.h  | 12 +++---
 include/linux/compiler.h              | 29 ++++++++++----
 include/linux/log2.h                  | 11 +++---
 10 files changed, 156 insertions(+), 102 deletions(-)

-- 
2.17.0

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC 0/8] Improving compiler inlining decisions
@ 2018-05-15 14:11   ` Nadav Amit
  0 siblings, 0 replies; 35+ messages in thread
From: Nadav Amit @ 2018-05-15 14:11 UTC (permalink / raw)
  To: linux-kernel
  Cc: nadav.amit, Nadav Amit, Alok Kataria, Christopher Li,
	H. Peter Anvin, Ingo Molnar, Jan Beulich, Jonathan Corbet,
	Josh Poimboeuf, Juergen Gross, Kees Cook, linux-sparse,
	Peter Zijlstra, Randy Dunlap, Thomas Gleixner, virtualization,
	x86

This patch-set deals with an interesting yet stupid problem: code that
does not get inlined despite its simplicity.

I find 5 classes of causes:

1. Inline assembly blocks in which code and data are added to
alternative sections. The compiler is oblivious to the content of the
blocks and assumes their cost in space and time is proportional to the
number of the perceived assembly "instruction", according to the number
of newlines and semicolons. Alternatives, paravirt and other mechanisms
are affected.

2. Inline assembly with redundant new-lines and semicolons. Similarly to
(1) this code is considered "heavier" than it actually is.

3. Code with constant value optimizations. Quite a few parts of the
kernel check whether a variable is constant (using
__builtin_constant_p()) and perform heavy computations in that case.
These computations are eventually optimized out so they do not land in
the binary. However, the cost of these computations is also associated
with the calling function, which might prevent inlining of the calling
function. ilog2() is an example for such case.

4. Code that is marked with the "cold" attribute, including all the
__init functions. Some may consider it the desired behavior.

5. Code that is marked with a different optimization levels. This
affects for example vmx_vcpu_run(), inducing overheads of up to 10% on
exit.


This patch-set deals with some instances of first 3 classes. 

For (1) we insert an assembly macro, and call it from the inline
assembly block.  As a result, the compiler sees a single "instruction"
and assigns the more appropriate cost to the code.

For (2) the solution is trivial: just remove the newlines.

(3) is somewhat tricky. The proposed solution is to use
__builtin_choose_expr() to check whether a variable is actually constant
instead of using an if-condition or the C ternary operator.
__builtin_choose_expr() is evaluated earlier in the compilation, so it
allows the compiler to associate the right cost for the variable case
before the inlining decisions take place.  So far so good.

Still, there is a drawback. Since __builtin_choose_expr() is evaluated
earlier, it can fail to recognize constants, which an if-condition would
recognize correctly.  As a result, this patch-set only applies it to the
simplest cases.

Overall this patch-set slightly increases the kernel size (my build was
done using localmodconfig + localyesconfig for the record):

   text    data     bss     dec     hex filename
18126699 10066728 2936832 31130259 1db0293 ./vmlinux before
18149210 10064048 2936832 31150090 1db500a ./vmlinux after (+0.06%)

The patch-set eliminates many of the static text symbols:
Before: 40033
After:  39632   (-10%)

There is a measurable effect on performance in some cases. A loop of
MADV_DONTNEED/page-fault shows a 2% performance improvement with this
patch-set.

Some inline comments or self-explaining C macros might still be needed.

[1] https://lkml.org/lkml/2018/5/5/159

Cc: Alok Kataria <akataria@vmware.com>
Cc: Christopher Li <sparse@chrisli.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jan Beulich <JBeulich@suse.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: linux-sparse@vger.kernel.org
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: virtualization@lists.linux-foundation.org
Cc: x86@kernel.org

Nadav Amit (8):
  x86: objtool: use asm macro for better compiler decisions
  x86: bug: prevent gcc distortions
  x86: alternative: macrofy locks for better inlining
  x86: prevent inline distortion by paravirt ops
  x86: refcount: prevent gcc distortions
  x86: removing unneeded new-lines
  ilog2: preventing compiler distortion due to big condition
  bitops: prevent compiler inline decision distortion

 arch/x86/include/asm/alternative.h    | 28 ++++++++++----
 arch/x86/include/asm/asm.h            |  4 +-
 arch/x86/include/asm/bitops.h         |  8 ++--
 arch/x86/include/asm/bug.h            | 48 ++++++++++++++---------
 arch/x86/include/asm/cmpxchg.h        | 10 ++---
 arch/x86/include/asm/paravirt_types.h | 53 +++++++++++++++-----------
 arch/x86/include/asm/refcount.h       | 55 ++++++++++++++++-----------
 arch/x86/include/asm/special_insns.h  | 12 +++---
 include/linux/compiler.h              | 29 ++++++++++----
 include/linux/log2.h                  | 11 +++---
 10 files changed, 156 insertions(+), 102 deletions(-)

-- 
2.17.0

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC 1/8] x86: objtool: use asm macro for better compiler decisions
  2018-05-15 14:11 ` Nadav Amit
@ 2018-05-15 14:11   ` Nadav Amit
  -1 siblings, 0 replies; 35+ messages in thread
From: Nadav Amit @ 2018-05-15 14:11 UTC (permalink / raw)
  To: linux-kernel; +Cc: nadav.amit, Nadav Amit, Christopher Li, linux-sparse

GCC considers the number of statements in inlined assembly blocks,
according to new-lines and semicolons, as an indication to the cost of
the block in time and space. This data is distorted by the kernel code,
which puts information in alternative sections. As a result, the
compiler may perform incorrect inlining and branch optimizations.

In the case of objtool, this distortion is extreme, since anyhow the
annotations of objtool are discarded during linkage.

The solution is to set an assembly macro and call it from the inlinedv
assembly block. As a result GCC considers the inline assembly block as
a single instruction.

This patch slightly increases the kernel size.

   text    data     bss     dec     hex filename
18126699 10066728 2936832 31130259 1db0293 ./vmlinux before
18126824 10067268 2936832 31130924 1db052c ./vmlinux after (+665)

But allows more aggressive inlining. Static text symbols:
Before: 40033
After: 40015 (-18)

Cc: Christopher Li <sparse@chrisli.org>
Cc: linux-sparse@vger.kernel.org

Signed-off-by: Nadav Amit <namit@vmware.com>
---
 include/linux/compiler.h | 29 +++++++++++++++++++++--------
 1 file changed, 21 insertions(+), 8 deletions(-)

diff --git a/include/linux/compiler.h b/include/linux/compiler.h
index ab4711c63601..369753000541 100644
--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -98,17 +98,30 @@ void ftrace_likely_update(struct ftrace_likely_data *f, int val,
  * The __COUNTER__ based labels are a hack to make each instance of the macros
  * unique, to convince GCC not to merge duplicate inline asm statements.
  */
+
+asm ("\n"
+	".macro __annotate_reachable counter:req\n"
+	"\\counter:\n\t"
+	".pushsection .discard.reachable\n\t"
+	".long \\counter\\()b -.\n\t"
+	".popsection\n"
+	".endm\n");
+
 #define annotate_reachable() ({						\
-	asm volatile("%c0:\n\t"						\
-		     ".pushsection .discard.reachable\n\t"		\
-		     ".long %c0b - .\n\t"				\
-		     ".popsection\n\t" : : "i" (__COUNTER__));		\
+	asm volatile("__annotate_reachable %c0" : : "i" (__COUNTER__));	\
 })
+
+asm ("\n"
+	".macro __annotate_unreachable counter:req\n"
+	"\\counter:\n\t"
+	".pushsection .discard.unreachable\n\t"
+	".long \\counter\\()b -.\n\t"
+	".popsection\n"
+	".endm");
+
 #define annotate_unreachable() ({					\
-	asm volatile("%c0:\n\t"						\
-		     ".pushsection .discard.unreachable\n\t"		\
-		     ".long %c0b - .\n\t"				\
-		     ".popsection\n\t" : : "i" (__COUNTER__));		\
+	asm volatile("__annotate_unreachable %c0" : :			\
+		     "i" (__COUNTER__));				\
 })
 #define ASM_UNREACHABLE							\
 	"999:\n\t"							\
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC 1/8] x86: objtool: use asm macro for better compiler decisions
@ 2018-05-15 14:11   ` Nadav Amit
  0 siblings, 0 replies; 35+ messages in thread
From: Nadav Amit @ 2018-05-15 14:11 UTC (permalink / raw)
  To: linux-kernel; +Cc: nadav.amit, Nadav Amit, Christopher Li, linux-sparse

GCC considers the number of statements in inlined assembly blocks,
according to new-lines and semicolons, as an indication to the cost of
the block in time and space. This data is distorted by the kernel code,
which puts information in alternative sections. As a result, the
compiler may perform incorrect inlining and branch optimizations.

In the case of objtool, this distortion is extreme, since anyhow the
annotations of objtool are discarded during linkage.

The solution is to set an assembly macro and call it from the inlinedv
assembly block. As a result GCC considers the inline assembly block as
a single instruction.

This patch slightly increases the kernel size.

   text    data     bss     dec     hex filename
18126699 10066728 2936832 31130259 1db0293 ./vmlinux before
18126824 10067268 2936832 31130924 1db052c ./vmlinux after (+665)

But allows more aggressive inlining. Static text symbols:
Before: 40033
After: 40015 (-18)

Cc: Christopher Li <sparse@chrisli.org>
Cc: linux-sparse@vger.kernel.org

Signed-off-by: Nadav Amit <namit@vmware.com>
---
 include/linux/compiler.h | 29 +++++++++++++++++++++--------
 1 file changed, 21 insertions(+), 8 deletions(-)

diff --git a/include/linux/compiler.h b/include/linux/compiler.h
index ab4711c63601..369753000541 100644
--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -98,17 +98,30 @@ void ftrace_likely_update(struct ftrace_likely_data *f, int val,
  * The __COUNTER__ based labels are a hack to make each instance of the macros
  * unique, to convince GCC not to merge duplicate inline asm statements.
  */
+
+asm ("\n"
+	".macro __annotate_reachable counter:req\n"
+	"\\counter:\n\t"
+	".pushsection .discard.reachable\n\t"
+	".long \\counter\\()b -.\n\t"
+	".popsection\n"
+	".endm\n");
+
 #define annotate_reachable() ({						\
-	asm volatile("%c0:\n\t"						\
-		     ".pushsection .discard.reachable\n\t"		\
-		     ".long %c0b - .\n\t"				\
-		     ".popsection\n\t" : : "i" (__COUNTER__));		\
+	asm volatile("__annotate_reachable %c0" : : "i" (__COUNTER__));	\
 })
+
+asm ("\n"
+	".macro __annotate_unreachable counter:req\n"
+	"\\counter:\n\t"
+	".pushsection .discard.unreachable\n\t"
+	".long \\counter\\()b -.\n\t"
+	".popsection\n"
+	".endm");
+
 #define annotate_unreachable() ({					\
-	asm volatile("%c0:\n\t"						\
-		     ".pushsection .discard.unreachable\n\t"		\
-		     ".long %c0b - .\n\t"				\
-		     ".popsection\n\t" : : "i" (__COUNTER__));		\
+	asm volatile("__annotate_unreachable %c0" : :			\
+		     "i" (__COUNTER__));				\
 })
 #define ASM_UNREACHABLE							\
 	"999:\n\t"							\
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC 2/8] x86: bug: prevent gcc distortions
  2018-05-15 14:11 ` Nadav Amit
                   ` (10 preceding siblings ...)
  (?)
@ 2018-05-15 14:11 ` Nadav Amit
  -1 siblings, 0 replies; 35+ messages in thread
From: Nadav Amit @ 2018-05-15 14:11 UTC (permalink / raw)
  To: linux-kernel
  Cc: nadav.amit, Nadav Amit, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86, Peter Zijlstra, Josh Poimboeuf

GCC considers the number of statements in inlined assembly blocks,
according to new-lines and semicolons, as an indication to the cost of
the block in time and space. This data is distorted by the kernel code,
which puts information in alternative sections. As a result, the
compiler may perform incorrect inlining and branch optimizations.

The solution is to set an assembly macro and call it from the inlinedv
assembly block. As a result GCC considers the inline assembly block as
a single instruction.

This patch increases the kernel size:

   text	   data	    bss	    dec	    hex	filename
18126824 10067268 2936832 31130924 1db052c ./vmlinux before
18127205 10068388 2936832 31132425 1db0b09 ./vmlinux after (+1501)

But enables more aggressive inlining (and probably branch decisions).
The number of static text symbols in vmlinux is lower.

Before:	40015
After:	39860	(-155)

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>

Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/include/asm/bug.h | 48 +++++++++++++++++++++++---------------
 1 file changed, 29 insertions(+), 19 deletions(-)

diff --git a/arch/x86/include/asm/bug.h b/arch/x86/include/asm/bug.h
index 6804d6642767..a2d981beb6cb 100644
--- a/arch/x86/include/asm/bug.h
+++ b/arch/x86/include/asm/bug.h
@@ -30,33 +30,43 @@
 
 #ifdef CONFIG_DEBUG_BUGVERBOSE
 
-#define _BUG_FLAGS(ins, flags)						\
+asm("\n"
+	".macro __BUG_FLAGS ins:req file:req line:req flags:req size:req\n"
+	"1:\t \\ins\n\t"
+	".pushsection __bug_table,\"aw\"\n"
+	"2:\t "__BUG_REL(1b)            "\t# bug_entry::bug_addr\n\t"
+	__BUG_REL(\\file)		"\t# bug_entry::file\n\t"
+	".word \\line"			"\t# bug_entry::line\n\t"
+	".word \\flags"			"\t# bug_entry::flags\n\t"
+	".org 2b+\\size\n\t"
+	".popsection\n"
+	".endm");
+
+#define _BUG_FLAGS(ins, flags)                                          \
 do {									\
-	asm volatile("1:\t" ins "\n"					\
-		     ".pushsection __bug_table,\"aw\"\n"		\
-		     "2:\t" __BUG_REL(1b) "\t# bug_entry::bug_addr\n"	\
-		     "\t"  __BUG_REL(%c0) "\t# bug_entry::file\n"	\
-		     "\t.word %c1"        "\t# bug_entry::line\n"	\
-		     "\t.word %c2"        "\t# bug_entry::flags\n"	\
-		     "\t.org 2b+%c3\n"					\
-		     ".popsection"					\
-		     : : "i" (__FILE__), "i" (__LINE__),		\
-			 "i" (flags),					\
+	asm volatile("__BUG_FLAGS \"" ins "\" %c0 %c1 %c2 %c3"		\
+		     : : "i" (__FILE__), "i" (__LINE__),                \
+			 "i" (flags),                                   \
 			 "i" (sizeof(struct bug_entry)));		\
 } while (0)
 
 #else /* !CONFIG_DEBUG_BUGVERBOSE */
 
+asm("\n"
+	".macro __BUG_FLAGS ins:req flags:req size:req\n"
+	"1:\t \\ins\n\t"
+	".pushsection __bug_table,\"aw\"\n"
+	"2:\t" __BUG_REL(1b)		"\t# bug_entry::bug_addr\n\t"
+	".word \\flags"			"\t# bug_entry::flags\n\t"
+	".org 2b+\\size\n\t"
+	".popsection\n"
+	".endm");
+
 #define _BUG_FLAGS(ins, flags)						\
 do {									\
-	asm volatile("1:\t" ins "\n"					\
-		     ".pushsection __bug_table,\"aw\"\n"		\
-		     "2:\t" __BUG_REL(1b) "\t# bug_entry::bug_addr\n"	\
-		     "\t.word %c0"        "\t# bug_entry::flags\n"	\
-		     "\t.org 2b+%c1\n"					\
-		     ".popsection"					\
-		     : : "i" (flags),					\
-			 "i" (sizeof(struct bug_entry)));		\
+	asm volatile("__BUG_FLAGS \"" ins "\" %c0 %c1"			\
+		    : : "i" (flags),					\
+			"i" (sizeof(struct bug_entry)));		\
 } while (0)
 
 #endif /* CONFIG_DEBUG_BUGVERBOSE */
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC 3/8] x86: alternative: macrofy locks for better inlining
  2018-05-15 14:11 ` Nadav Amit
                   ` (11 preceding siblings ...)
  (?)
@ 2018-05-15 14:11 ` Nadav Amit
  -1 siblings, 0 replies; 35+ messages in thread
From: Nadav Amit @ 2018-05-15 14:11 UTC (permalink / raw)
  To: linux-kernel
  Cc: nadav.amit, Nadav Amit, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86, Josh Poimboeuf

GCC considers the number of statements in inlined assembly blocks,
according to new-lines and semicolons, as an indication to the cost of
the block in time and space. This data is distorted by the kernel code,
which puts information in alternative sections. As a result, the
compiler may perform incorrect inlining and branch optimizations.

The solution is to set an assembly macro and call it from the inlined
assembly block. As a result GCC considers the inline assembly block as
a single instruction.

This patch handles the LOCK prefix, allowing more aggresive inlining.

   text	   data	    bss	    dec	    hex	filename
18127205 10068388 2936832 31132425 1db0b09 ./vmlinux before
18131468 10068488 2936832 31136788 1db1c14 ./vmlinux after (+4363)

Static text symbols:
Before:	39860
After:	39788	(-72)

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Cc: Josh Poimboeuf <jpoimboe@redhat.com>

Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/include/asm/alternative.h | 28 ++++++++++++++++++++--------
 1 file changed, 20 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/alternative.h b/arch/x86/include/asm/alternative.h
index 4cd6a3b71824..daa68ad51665 100644
--- a/arch/x86/include/asm/alternative.h
+++ b/arch/x86/include/asm/alternative.h
@@ -31,14 +31,26 @@
  */
 
 #ifdef CONFIG_SMP
-#define LOCK_PREFIX_HERE \
-		".pushsection .smp_locks,\"a\"\n"	\
-		".balign 4\n"				\
-		".long 671f - .\n" /* offset */		\
-		".popsection\n"				\
-		"671:"
-
-#define LOCK_PREFIX LOCK_PREFIX_HERE "\n\tlock; "
+
+asm ("\n"
+	".macro __LOCK_PREFIX_HERE\n\t"
+	".pushsection .smp_locks,\"a\"\n\t"
+	".balign 4\n\t"
+	".long 671f - .\n\t" /* offset */
+	".popsection\n"
+	"671:\n"
+	".endm");
+
+#define LOCK_PREFIX_HERE "__LOCK_PREFIX_HERE"
+
+asm ("\n"
+	".macro __LOCK_PREFIX ins:vararg\n\t"
+	"__LOCK_PREFIX_HERE\n\t"
+	"lock; \\ins\n"
+	".endm");
+
+#define LOCK_PREFIX "__LOCK_PREFIX "
+
 
 #else /* ! CONFIG_SMP */
 #define LOCK_PREFIX_HERE ""
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC 4/8] x86: prevent inline distortion by paravirt ops
  2018-05-15 14:11 ` Nadav Amit
                   ` (12 preceding siblings ...)
  (?)
@ 2018-05-15 14:11 ` Nadav Amit
  -1 siblings, 0 replies; 35+ messages in thread
From: Nadav Amit @ 2018-05-15 14:11 UTC (permalink / raw)
  To: linux-kernel
  Cc: nadav.amit, Nadav Amit, Juergen Gross, Alok Kataria,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	virtualization

GCC considers the number of statements in inlined assembly blocks,
according to new-lines and semicolons, as an indication to the cost of
the block in time and space. This data is distorted by the kernel code,
which puts information in alternative sections. As a result, the
compiler may perform incorrect inlining and branch optimizations.

The solution is to set an assembly macro and call it from the inlined
assembly block. As a result GCC considers the inline assembly block as
a single instruction.

The effect of the patch is a more aggressive inlining, which also
causes a size increase of kernel.

   text	   data	    bss	    dec	    hex	filename
18131468 10068488 2936832 31136788 1db1c14 ./vmlinux before
18146418 10064100 2936832 31147350 1db4556 ./vmlinux after (+10562)

Static text symbols:
Before:	39788
After:	39673	(-115)

Cc: Juergen Gross <jgross@suse.com>
Cc: Alok Kataria <akataria@vmware.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Cc: virtualization@lists.linux-foundation.org

Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/include/asm/paravirt_types.h | 53 +++++++++++++++------------
 1 file changed, 30 insertions(+), 23 deletions(-)

diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index 180bc0bff0fb..3fe99dab8266 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -346,20 +346,37 @@ extern struct pv_lock_ops pv_lock_ops;
 /*
  * Generate some code, and mark it as patchable by the
  * apply_paravirt() alternate instruction patcher.
+ *
+ * This generates an indirect call based on the operation type number.
+ * The type number, computed in PARAVIRT_PATCH, is derived from the
+ * offset into the paravirt_patch_template structure, and can therefore be
+ * freely converted back into a structure offset.
  */
-#define _paravirt_alt(insn_string, type, clobber)	\
-	"771:\n\t" insn_string "\n" "772:\n"		\
-	".pushsection .parainstructions,\"a\"\n"	\
-	_ASM_ALIGN "\n"					\
-	_ASM_PTR " 771b\n"				\
-	"  .byte " type "\n"				\
-	"  .byte 772b-771b\n"				\
-	"  .short " clobber "\n"			\
+
+asm ("\n"
+	".macro __paravirt_alt type:req clobber:req pv_opptr:req\n"
+	"771:\n\t"
+	ANNOTATE_RETPOLINE_SAFE "\n\t"
+	"call *\\pv_opptr\n"
+	"772:\n\t"
+	".pushsection .parainstructions,\"a\"\n\t"
+	_ASM_ALIGN "\n\t"
+	_ASM_PTR " 771b\n\t"
+	".byte \\type\n\t"
+	".byte 772b-771b\n\t"
+	".short \\clobber\n\t"
 	".popsection\n"
+	".endm");
+
+#define _paravirt_alt(type, clobber, pv_opptr)				\
+	"__paravirt_alt type=" __stringify(type)			\
+	" clobber=" __stringify(clobber)				\
+	" pv_opptr=" __stringify(pv_opptr) "\n\t"
 
 /* Generate patchable code, with the default asm parameters. */
-#define paravirt_alt(insn_string)					\
-	_paravirt_alt(insn_string, "%c[paravirt_typenum]", "%c[paravirt_clobber]")
+#define paravirt_alt							\
+	_paravirt_alt("%c[paravirt_typenum]", "%c[paravirt_clobber]",	\
+		      "%c[paravirt_opptr]")
 
 /* Simple instruction patching code. */
 #define NATIVE_LABEL(a,x,b) "\n\t.globl " a #x "_" #b "\n" a #x "_" #b ":\n\t"
@@ -387,16 +404,6 @@ unsigned native_patch(u8 type, u16 clobbers, void *ibuf,
 
 int paravirt_disable_iospace(void);
 
-/*
- * This generates an indirect call based on the operation type number.
- * The type number, computed in PARAVIRT_PATCH, is derived from the
- * offset into the paravirt_patch_template structure, and can therefore be
- * freely converted back into a structure offset.
- */
-#define PARAVIRT_CALL					\
-	ANNOTATE_RETPOLINE_SAFE				\
-	"call *%c[paravirt_opptr];"
-
 /*
  * These macros are intended to wrap calls through one of the paravirt
  * ops structs, so that they can be later identified and patched at
@@ -534,7 +541,7 @@ int paravirt_disable_iospace(void);
 		/* since this condition will never hold */		\
 		if (sizeof(rettype) > sizeof(unsigned long)) {		\
 			asm volatile(pre				\
-				     paravirt_alt(PARAVIRT_CALL)	\
+				     paravirt_alt			\
 				     post				\
 				     : call_clbr, ASM_CALL_CONSTRAINT	\
 				     : paravirt_type(op),		\
@@ -544,7 +551,7 @@ int paravirt_disable_iospace(void);
 			__ret = (rettype)((((u64)__edx) << 32) | __eax); \
 		} else {						\
 			asm volatile(pre				\
-				     paravirt_alt(PARAVIRT_CALL)	\
+				     paravirt_alt			\
 				     post				\
 				     : call_clbr, ASM_CALL_CONSTRAINT	\
 				     : paravirt_type(op),		\
@@ -571,7 +578,7 @@ int paravirt_disable_iospace(void);
 		PVOP_VCALL_ARGS;					\
 		PVOP_TEST_NULL(op);					\
 		asm volatile(pre					\
-			     paravirt_alt(PARAVIRT_CALL)		\
+			     paravirt_alt				\
 			     post					\
 			     : call_clbr, ASM_CALL_CONSTRAINT		\
 			     : paravirt_type(op),			\
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC 5/8] x86: refcount: prevent gcc distortions
  2018-05-15 14:11 ` Nadav Amit
                   ` (13 preceding siblings ...)
  (?)
@ 2018-05-15 14:11 ` Nadav Amit
  2018-05-16  7:14   ` Jan Beulich
  -1 siblings, 1 reply; 35+ messages in thread
From: Nadav Amit @ 2018-05-15 14:11 UTC (permalink / raw)
  To: linux-kernel
  Cc: nadav.amit, Nadav Amit, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86, Kees Cook, Jan Beulich, Josh Poimboeuf

GCC considers the number of statements in inlined assembly blocks,
according to new-lines and semicolons, as an indication to the cost of
the block in time and space. This data is distorted by the kernel code,
which puts information in alternative sections. As a result, the
compiler may perform incorrect inlining and branch optimizations.

The solution is to set an assembly macro and call it from the inlined
assembly block. As a result GCC considers the inline assembly block as
a single instruction.

This patch allows to inline functions such as __get_seccomp_filter().
The effect of the patch is as follows on the kernel size:

   text	   data	    bss	    dec	    hex	filename
18146418 10064100 2936832 31147350 1db4556 ./vmlinux before
18148228 10063968 2936832 31149028 1db4be4 ./vmlinux after (+1678)

Static text symbols:
Before:	39673
After:	39649	(-24)

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Cc: Kees Cook <keescook@chromium.org>
Cc: Jan Beulich <JBeulich@suse.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>

Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/include/asm/refcount.h | 55 ++++++++++++++++++++-------------
 1 file changed, 33 insertions(+), 22 deletions(-)

diff --git a/arch/x86/include/asm/refcount.h b/arch/x86/include/asm/refcount.h
index 4cf11d88d3b3..a668c534206d 100644
--- a/arch/x86/include/asm/refcount.h
+++ b/arch/x86/include/asm/refcount.h
@@ -14,34 +14,43 @@
  * central refcount exception. The fixup address for the exception points
  * back to the regular execution flow in .text.
  */
-#define _REFCOUNT_EXCEPTION				\
-	".pushsection .text..refcount\n"		\
-	"111:\tlea %[counter], %%" _ASM_CX "\n"		\
-	"112:\t" ASM_UD2 "\n"				\
-	ASM_UNREACHABLE					\
-	".popsection\n"					\
-	"113:\n"					\
+
+asm ("\n"
+	".macro __REFCOUNT_EXCEPTION counter:vararg\n\t"
+	".pushsection .text..refcount\n"
+	"111:\tlea \\counter, %" _ASM_CX "\n"
+	"112:\t" ASM_UD2 "\n\t"
+	ASM_UNREACHABLE
+	".popsection\n\t"
+	"113:\n"
 	_ASM_EXTABLE_REFCOUNT(112b, 113b)
+	".endm");
 
 /* Trigger refcount exception if refcount result is negative. */
-#define REFCOUNT_CHECK_LT_ZERO				\
-	"js 111f\n\t"					\
-	_REFCOUNT_EXCEPTION
+asm ("\n"
+	".macro __REFCOUNT_CHECK_LT_ZERO counter:vararg\n"
+	"js 111f\n\t"
+	"__REFCOUNT_EXCEPTION \\counter\n"
+	".endm");
 
 /* Trigger refcount exception if refcount result is zero or negative. */
-#define REFCOUNT_CHECK_LE_ZERO				\
-	"jz 111f\n\t"					\
-	REFCOUNT_CHECK_LT_ZERO
+asm ("\n"
+	".macro __REFCOUNT_CHECK_LE_ZERO counter:vararg\n"
+	"jz 111f\n\t"
+	"__REFCOUNT_CHECK_LT_ZERO counter=\\counter\n"
+	".endm");
 
 /* Trigger refcount exception unconditionally. */
-#define REFCOUNT_ERROR					\
-	"jmp 111f\n\t"					\
-	_REFCOUNT_EXCEPTION
+asm ("\n"
+	".macro __REFCOUNT_ERROR counter:vararg\n\t"
+	"jmp 111f\n\t"
+	"__REFCOUNT_EXCEPTION counter=\\counter\n"
+	".endm");
 
 static __always_inline void refcount_add(unsigned int i, refcount_t *r)
 {
 	asm volatile(LOCK_PREFIX "addl %1,%0\n\t"
-		REFCOUNT_CHECK_LT_ZERO
+		"__REFCOUNT_CHECK_LT_ZERO %[counter]"
 		: [counter] "+m" (r->refs.counter)
 		: "ir" (i)
 		: "cc", "cx");
@@ -50,7 +59,7 @@ static __always_inline void refcount_add(unsigned int i, refcount_t *r)
 static __always_inline void refcount_inc(refcount_t *r)
 {
 	asm volatile(LOCK_PREFIX "incl %0\n\t"
-		REFCOUNT_CHECK_LT_ZERO
+		"__REFCOUNT_CHECK_LT_ZERO %[counter]"
 		: [counter] "+m" (r->refs.counter)
 		: : "cc", "cx");
 }
@@ -58,7 +67,7 @@ static __always_inline void refcount_inc(refcount_t *r)
 static __always_inline void refcount_dec(refcount_t *r)
 {
 	asm volatile(LOCK_PREFIX "decl %0\n\t"
-		REFCOUNT_CHECK_LE_ZERO
+		"__REFCOUNT_CHECK_LE_ZERO %[counter]"
 		: [counter] "+m" (r->refs.counter)
 		: : "cc", "cx");
 }
@@ -66,13 +75,15 @@ static __always_inline void refcount_dec(refcount_t *r)
 static __always_inline __must_check
 bool refcount_sub_and_test(unsigned int i, refcount_t *r)
 {
-	GEN_BINARY_SUFFIXED_RMWcc(LOCK_PREFIX "subl", REFCOUNT_CHECK_LT_ZERO,
+	GEN_BINARY_SUFFIXED_RMWcc(LOCK_PREFIX "subl",
+				  "__REFCOUNT_CHECK_LT_ZERO %[counter]",
 				  r->refs.counter, "er", i, "%0", e, "cx");
 }
 
 static __always_inline __must_check bool refcount_dec_and_test(refcount_t *r)
 {
-	GEN_UNARY_SUFFIXED_RMWcc(LOCK_PREFIX "decl", REFCOUNT_CHECK_LT_ZERO,
+	GEN_UNARY_SUFFIXED_RMWcc(LOCK_PREFIX "decl",
+				 "__REFCOUNT_CHECK_LT_ZERO %[counter]",
 				 r->refs.counter, "%0", e, "cx");
 }
 
@@ -90,7 +101,7 @@ bool refcount_add_not_zero(unsigned int i, refcount_t *r)
 
 		/* Did we try to increment from/to an undesirable state? */
 		if (unlikely(c < 0 || c == INT_MAX || result < c)) {
-			asm volatile(REFCOUNT_ERROR
+			asm volatile("__REFCOUNT_ERROR %[counter]"
 				     : : [counter] "m" (r->refs.counter)
 				     : "cc", "cx");
 			break;
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC 6/8] x86: removing unneeded new-lines
  2018-05-15 14:11 ` Nadav Amit
                   ` (14 preceding siblings ...)
  (?)
@ 2018-05-15 14:11 ` Nadav Amit
  -1 siblings, 0 replies; 35+ messages in thread
From: Nadav Amit @ 2018-05-15 14:11 UTC (permalink / raw)
  To: linux-kernel
  Cc: nadav.amit, Nadav Amit, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86, Josh Poimboeuf

GCC considers the number of statements in inlined assembly blocks,
according to new-lines and semicolons, as an indication to the cost of
the block in time and space. This data is distorted by the kernel code,
which puts information in alternative sections. As a result, the
compiler may perform incorrect inlining and branch optimizations.

This patch removes unneeded new-lines and semicolons to prevent such
distortion.

Functions such as nfs_io_completion_put() get inlined. Its overall
effect is not shown in the absolute numbers, but it seems to enable
slightly better inlining:

   text	   data	    bss	    dec	    hex	filename
18148228 10063968 2936832 31149028 1db4be4 ./vmlinux before
18148888 10064016 2936832 31149736 1db4ea8 ./vmlinux after (+708)

Static text symbols:
Before:	39649
After:	39650	(+1)

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Cc: Josh Poimboeuf <jpoimboe@redhat.com>

Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/include/asm/asm.h           |  4 ++--
 arch/x86/include/asm/cmpxchg.h       | 10 +++++-----
 arch/x86/include/asm/special_insns.h | 12 ++++++------
 3 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/asm.h b/arch/x86/include/asm/asm.h
index 219faaec51df..571ceec97976 100644
--- a/arch/x86/include/asm/asm.h
+++ b/arch/x86/include/asm/asm.h
@@ -51,10 +51,10 @@
  * The output operand must be type "bool".
  */
 #ifdef __GCC_ASM_FLAG_OUTPUTS__
-# define CC_SET(c) "\n\t/* output condition code " #c "*/\n"
+# define CC_SET(c) "\n\t/* output condition code " #c "*/"
 # define CC_OUT(c) "=@cc" #c
 #else
-# define CC_SET(c) "\n\tset" #c " %[_cc_" #c "]\n"
+# define CC_SET(c) "\n\tset" #c " %[_cc_" #c "]"
 # define CC_OUT(c) [_cc_ ## c] "=qm"
 #endif
 
diff --git a/arch/x86/include/asm/cmpxchg.h b/arch/x86/include/asm/cmpxchg.h
index e3efd8a06066..2be9582fcb2e 100644
--- a/arch/x86/include/asm/cmpxchg.h
+++ b/arch/x86/include/asm/cmpxchg.h
@@ -44,22 +44,22 @@ extern void __add_wrong_size(void)
 	        __typeof__ (*(ptr)) __ret = (arg);			\
 		switch (sizeof(*(ptr))) {				\
 		case __X86_CASE_B:					\
-			asm volatile (lock #op "b %b0, %1\n"		\
+			asm volatile (lock #op "b %b0, %1"		\
 				      : "+q" (__ret), "+m" (*(ptr))	\
 				      : : "memory", "cc");		\
 			break;						\
 		case __X86_CASE_W:					\
-			asm volatile (lock #op "w %w0, %1\n"		\
+			asm volatile (lock #op "w %w0, %1"		\
 				      : "+r" (__ret), "+m" (*(ptr))	\
 				      : : "memory", "cc");		\
 			break;						\
 		case __X86_CASE_L:					\
-			asm volatile (lock #op "l %0, %1\n"		\
+			asm volatile (lock #op "l %0, %1"		\
 				      : "+r" (__ret), "+m" (*(ptr))	\
 				      : : "memory", "cc");		\
 			break;						\
 		case __X86_CASE_Q:					\
-			asm volatile (lock #op "q %q0, %1\n"		\
+			asm volatile (lock #op "q %q0, %1"		\
 				      : "+r" (__ret), "+m" (*(ptr))	\
 				      : : "memory", "cc");		\
 			break;						\
@@ -134,7 +134,7 @@ extern void __add_wrong_size(void)
 	__raw_cmpxchg((ptr), (old), (new), (size), LOCK_PREFIX)
 
 #define __sync_cmpxchg(ptr, old, new, size)				\
-	__raw_cmpxchg((ptr), (old), (new), (size), "lock; ")
+	__raw_cmpxchg((ptr), (old), (new), (size), "lock ")
 
 #define __cmpxchg_local(ptr, old, new, size)				\
 	__raw_cmpxchg((ptr), (old), (new), (size), "")
diff --git a/arch/x86/include/asm/special_insns.h b/arch/x86/include/asm/special_insns.h
index 317fc59b512c..9c56059aaf24 100644
--- a/arch/x86/include/asm/special_insns.h
+++ b/arch/x86/include/asm/special_insns.h
@@ -19,7 +19,7 @@ extern unsigned long __force_order;
 static inline unsigned long native_read_cr0(void)
 {
 	unsigned long val;
-	asm volatile("mov %%cr0,%0\n\t" : "=r" (val), "=m" (__force_order));
+	asm volatile("mov %%cr0,%0" : "=r" (val), "=m" (__force_order));
 	return val;
 }
 
@@ -31,7 +31,7 @@ static inline void native_write_cr0(unsigned long val)
 static inline unsigned long native_read_cr2(void)
 {
 	unsigned long val;
-	asm volatile("mov %%cr2,%0\n\t" : "=r" (val), "=m" (__force_order));
+	asm volatile("mov %%cr2,%0" : "=r" (val), "=m" (__force_order));
 	return val;
 }
 
@@ -43,7 +43,7 @@ static inline void native_write_cr2(unsigned long val)
 static inline unsigned long __native_read_cr3(void)
 {
 	unsigned long val;
-	asm volatile("mov %%cr3,%0\n\t" : "=r" (val), "=m" (__force_order));
+	asm volatile("mov %%cr3,%0" : "=r" (val), "=m" (__force_order));
 	return val;
 }
 
@@ -67,7 +67,7 @@ static inline unsigned long native_read_cr4(void)
 		     : "=r" (val), "=m" (__force_order) : "0" (0));
 #else
 	/* CR4 always exists on x86_64. */
-	asm volatile("mov %%cr4,%0\n\t" : "=r" (val), "=m" (__force_order));
+	asm volatile("mov %%cr4,%0" : "=r" (val), "=m" (__force_order));
 #endif
 	return val;
 }
@@ -101,7 +101,7 @@ static inline u32 __read_pkru(void)
 	 * "rdpkru" instruction.  Places PKRU contents in to EAX,
 	 * clears EDX and requires that ecx=0.
 	 */
-	asm volatile(".byte 0x0f,0x01,0xee\n\t"
+	asm volatile(".byte 0x0f,0x01,0xee"
 		     : "=a" (pkru), "=d" (edx)
 		     : "c" (ecx));
 	return pkru;
@@ -115,7 +115,7 @@ static inline void __write_pkru(u32 pkru)
 	 * "wrpkru" instruction.  Loads contents in EAX to PKRU,
 	 * requires that ecx = edx = 0.
 	 */
-	asm volatile(".byte 0x0f,0x01,0xef\n\t"
+	asm volatile(".byte 0x0f,0x01,0xef"
 		     : : "a" (pkru), "c"(ecx), "d"(edx));
 }
 #else
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC 7/8] ilog2: preventing compiler distortion due to big condition
  2018-05-15 14:11 ` Nadav Amit
                   ` (15 preceding siblings ...)
  (?)
@ 2018-05-15 14:11 ` Nadav Amit
  -1 siblings, 0 replies; 35+ messages in thread
From: Nadav Amit @ 2018-05-15 14:11 UTC (permalink / raw)
  To: linux-kernel; +Cc: nadav.amit, Nadav Amit, Randy Dunlap, Jonathan Corbet

There are several places in the kernel in which there is a condition
that is based on whether the input is known to be constant in
compilation time. If it is, there are complex computations, which only
take place during compilation time.

Although this scheme works correctly, when GCC computes the expected
cost of this code in time and size, it disregards the fact that the
computations of the "constant" case will not take place during runtime
for the non-constant case. The cost of these functions is considered to
be much higher.  As a result, inline and branching decisions of the
compiler are distorted. Specifically, functions are less likely to be
inlined due to their preserved big size and execution time.

One of this cases is ilog2() which performs a complicated condition for
constant inputs.

The solution is to use __builtin_choose_expr() to detect whether the
input is constant instead of a C condition. GCC evaluates the builtin
earlier, which allows it to improve code-generation decisions.

This patch allows to inline functions such as tbl_size().

   text	   data	    bss	    dec	    hex	filename
18148888 10064016 2936832 31149736 1db4ea8 ./vmlinux before
18149165 10064176 2936832 31150173 1db505d ./vmlinux after (+437)

Static text symbols:
Before:	39650
After:	39643	(-7)

Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Jonathan Corbet <corbet@lwn.net>

Signed-off-by: Nadav Amit <namit@vmware.com>
---
 include/linux/log2.h | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/include/linux/log2.h b/include/linux/log2.h
index 41a1ae010993..aa4dac874339 100644
--- a/include/linux/log2.h
+++ b/include/linux/log2.h
@@ -83,7 +83,8 @@ unsigned long __rounddown_pow_of_two(unsigned long n)
  */
 #define ilog2(n)				\
 (						\
-	__builtin_constant_p(n) ? (		\
+	__builtin_choose_expr(			\
+		__builtin_constant_p(n), (	\
 		(n) < 2 ? 0 :			\
 		(n) & (1ULL << 63) ? 63 :	\
 		(n) & (1ULL << 62) ? 62 :	\
@@ -147,10 +148,10 @@ unsigned long __rounddown_pow_of_two(unsigned long n)
 		(n) & (1ULL <<  4) ?  4 :	\
 		(n) & (1ULL <<  3) ?  3 :	\
 		(n) & (1ULL <<  2) ?  2 :	\
-		1 ) :				\
-	(sizeof(n) <= 4) ?			\
-	__ilog2_u32(n) :			\
-	__ilog2_u64(n)				\
+		1),				\
+		(sizeof(n) <= 4) ?		\
+		__ilog2_u32(n) :		\
+		__ilog2_u64(n))			\
  )
 
 /**
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC 8/8] bitops: prevent compiler inline decision distortion
  2018-05-15 14:11 ` Nadav Amit
                   ` (16 preceding siblings ...)
  (?)
@ 2018-05-15 14:11 ` Nadav Amit
  2018-05-16 14:09   ` Kees Cook
  -1 siblings, 1 reply; 35+ messages in thread
From: Nadav Amit @ 2018-05-15 14:11 UTC (permalink / raw)
  To: linux-kernel
  Cc: nadav.amit, Nadav Amit, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86

There are several places in the kernel in which there is a condition
that is based on whether the input is known to be constant in
compilation time. If it is, there are complex computations, which only
take place during compilation time.

Although this scheme works correctly, when GCC computes the expected
cost of this code in time and size, it disregards the fact that the
computations of the "constant" case will not take place during runtime
for the non-constant case. The cost of these functions is considered to
be much higher.  As a result, inline and branching decisions of the
compiler are distorted. Specifically, functions are less likely to be
inlined due to their preserved big size and execution time.

One of this cases is test_bit() which performs some computations for
constant inputs.

The solution is to use __builtin_choose_expr() to detect whether the
input is constant instead of a C condition. GCC evaluates the builtin
earlier, which allows it to improve code-generation decisions.

This patch allows function such as bitmap_pos_to_ord() to be inlined.
Its overall effect on size:

   text	   data	    bss	    dec	    hex	filename
18149165 10064176 2936832 31150173 1db505d ./vmlinux before
18149210 10064048 2936832 31150090 1db500a ./vmlinux after (-83)

Static text symbols:
Before:	39643
After:	39632	(-11)

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org

Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/include/asm/bitops.h | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index 9f645ba57dbb..f1cb1c9125a9 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -349,10 +349,10 @@ static __always_inline bool variable_test_bit(long nr, volatile const unsigned l
 static bool test_bit(int nr, const volatile unsigned long *addr);
 #endif
 
-#define test_bit(nr, addr)			\
-	(__builtin_constant_p((nr))		\
-	 ? constant_test_bit((nr), (addr))	\
-	 : variable_test_bit((nr), (addr)))
+#define test_bit(nr, addr)					\
+	__builtin_choose_expr(__builtin_constant_p((nr)),	\
+		constant_test_bit((nr), (addr)),		\
+		variable_test_bit((nr), (addr)))
 
 /**
  * __ffs - find first set bit in word
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [RFC 1/8] x86: objtool: use asm macro for better compiler decisions
  2018-05-15 14:11   ` Nadav Amit
  (?)
@ 2018-05-15 21:37   ` Josh Triplett
  2018-05-15 21:53     ` Nadav Amit
  -1 siblings, 1 reply; 35+ messages in thread
From: Josh Triplett @ 2018-05-15 21:37 UTC (permalink / raw)
  To: Nadav Amit; +Cc: linux-kernel, nadav.amit, Christopher Li, linux-sparse

On Tue, May 15, 2018 at 07:11:08AM -0700, Nadav Amit wrote:
> GCC considers the number of statements in inlined assembly blocks,
> according to new-lines and semicolons, as an indication to the cost of
> the block in time and space. This data is distorted by the kernel code,
> which puts information in alternative sections. As a result, the
> compiler may perform incorrect inlining and branch optimizations.
> 
> In the case of objtool, this distortion is extreme, since anyhow the
> annotations of objtool are discarded during linkage.
> 
> The solution is to set an assembly macro and call it from the inlinedv
> assembly block. As a result GCC considers the inline assembly block as
> a single instruction.
> 
> This patch slightly increases the kernel size.
> 
>    text    data     bss     dec     hex filename
> 18126699 10066728 2936832 31130259 1db0293 ./vmlinux before
> 18126824 10067268 2936832 31130924 1db052c ./vmlinux after (+665)

With what kernel config? What's the text size change for an allnoconfig?

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC 1/8] x86: objtool: use asm macro for better compiler decisions
  2018-05-15 21:37   ` Josh Triplett
@ 2018-05-15 21:53     ` Nadav Amit
  2018-05-15 21:55       ` Josh Triplett
  0 siblings, 1 reply; 35+ messages in thread
From: Nadav Amit @ 2018-05-15 21:53 UTC (permalink / raw)
  To: Josh Triplett; +Cc: linux-kernel, Christopher Li, linux-sparse

Josh Triplett <josh@joshtriplett.org> wrote:

> On Tue, May 15, 2018 at 07:11:08AM -0700, Nadav Amit wrote:
>> GCC considers the number of statements in inlined assembly blocks,
>> according to new-lines and semicolons, as an indication to the cost of
>> the block in time and space. This data is distorted by the kernel code,
>> which puts information in alternative sections. As a result, the
>> compiler may perform incorrect inlining and branch optimizations.
>> 
>> In the case of objtool, this distortion is extreme, since anyhow the
>> annotations of objtool are discarded during linkage.
>> 
>> The solution is to set an assembly macro and call it from the inlinedv
>> assembly block. As a result GCC considers the inline assembly block as
>> a single instruction.
>> 
>> This patch slightly increases the kernel size.
>> 
>>   text    data     bss     dec     hex filename
>> 18126699 10066728 2936832 31130259 1db0293 ./vmlinux before
>> 18126824 10067268 2936832 31130924 1db052c ./vmlinux after (+665)
> 
> With what kernel config? What's the text size change for an allnoconfig?

I use my custom config and to include the drivers in the image. Other
configs were not too appropriate: defconfig does not enable paravirt and
allyesconfig is too heavy.

There is no effect with allnoconfig on size:

   text	   data	    bss	    dec	    hex	filename
1018441	 230148	1215604	2464193	 2599c1	./vmlinux
1018441	 230148	1215604	2464193	 2599c1	./vmlinux

I think it is expected since CONFIG_STACK_VALIDATION is off. No?

Thanks,
Nadav

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC 1/8] x86: objtool: use asm macro for better compiler decisions
  2018-05-15 21:53     ` Nadav Amit
@ 2018-05-15 21:55       ` Josh Triplett
  0 siblings, 0 replies; 35+ messages in thread
From: Josh Triplett @ 2018-05-15 21:55 UTC (permalink / raw)
  To: Nadav Amit; +Cc: linux-kernel, Christopher Li, linux-sparse

On Tue, May 15, 2018 at 02:53:52PM -0700, Nadav Amit wrote:
> Josh Triplett <josh@joshtriplett.org> wrote:
> 
> > On Tue, May 15, 2018 at 07:11:08AM -0700, Nadav Amit wrote:
> >> GCC considers the number of statements in inlined assembly blocks,
> >> according to new-lines and semicolons, as an indication to the cost of
> >> the block in time and space. This data is distorted by the kernel code,
> >> which puts information in alternative sections. As a result, the
> >> compiler may perform incorrect inlining and branch optimizations.
> >> 
> >> In the case of objtool, this distortion is extreme, since anyhow the
> >> annotations of objtool are discarded during linkage.
> >> 
> >> The solution is to set an assembly macro and call it from the inlinedv
> >> assembly block. As a result GCC considers the inline assembly block as
> >> a single instruction.
> >> 
> >> This patch slightly increases the kernel size.
> >> 
> >>   text    data     bss     dec     hex filename
> >> 18126699 10066728 2936832 31130259 1db0293 ./vmlinux before
> >> 18126824 10067268 2936832 31130924 1db052c ./vmlinux after (+665)
> > 
> > With what kernel config? What's the text size change for an allnoconfig?
> 
> I use my custom config and to include the drivers in the image. Other
> configs were not too appropriate: defconfig does not enable paravirt and
> allyesconfig is too heavy.
> 
> There is no effect with allnoconfig on size:
> 
>    text	   data	    bss	    dec	    hex	filename
> 1018441	 230148	1215604	2464193	 2599c1	./vmlinux
> 1018441	 230148	1215604	2464193	 2599c1	./vmlinux
> 
> I think it is expected since CONFIG_STACK_VALIDATION is off. No?

Thanks for checking; fine by me then. :)

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC 0/8] Improving compiler inlining decisions
  2018-05-15 14:11 ` Nadav Amit
                   ` (17 preceding siblings ...)
  (?)
@ 2018-05-15 22:14 ` Nadav Amit
  -1 siblings, 0 replies; 35+ messages in thread
From: Nadav Amit @ 2018-05-15 22:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: Alok Kataria, Christopher Li, H. Peter Anvin, Ingo Molnar,
	Jan Beulich, Jonathan Corbet, Josh Poimboeuf, Juergen Gross,
	Kees Cook, linux-sparse, Peter Zijlstra, Randy Dunlap,
	Thomas Gleixner, virtualization, x86

Nadav Amit <namit@vmware.com> wrote:

> This patch-set deals with an interesting yet stupid problem: code that
> does not get inlined despite its simplicity.
> 
> I find 5 classes of causes:
> 
> 1. Inline assembly blocks in which code and data are added to
> alternative sections. The compiler is oblivious to the content of the
> blocks and assumes their cost in space and time is proportional to the
> number of the perceived assembly "instruction", according to the number
> of newlines and semicolons. Alternatives, paravirt and other mechanisms
> are affected.
> 
> 2. Inline assembly with redundant new-lines and semicolons. Similarly to
> (1) this code is considered "heavier" than it actually is.
> 
> 3. Code with constant value optimizations. Quite a few parts of the
> kernel check whether a variable is constant (using
> __builtin_constant_p()) and perform heavy computations in that case.
> These computations are eventually optimized out so they do not land in
> the binary. However, the cost of these computations is also associated
> with the calling function, which might prevent inlining of the calling
> function. ilog2() is an example for such case.
> 
> 4. Code that is marked with the "cold" attribute, including all the
> __init functions. Some may consider it the desired behavior.
> 
> 5. Code that is marked with a different optimization levels. This
> affects for example vmx_vcpu_run(), inducing overheads of up to 10% on
> exit.
> 
> 
> This patch-set deals with some instances of first 3 classes. 
> 
> For (1) we insert an assembly macro, and call it from the inline
> assembly block.  As a result, the compiler sees a single "instruction"
> and assigns the more appropriate cost to the code.
> 
> For (2) the solution is trivial: just remove the newlines.
> 
> (3) is somewhat tricky. The proposed solution is to use
> __builtin_choose_expr() to check whether a variable is actually constant
> instead of using an if-condition or the C ternary operator.
> __builtin_choose_expr() is evaluated earlier in the compilation, so it
> allows the compiler to associate the right cost for the variable case
> before the inlining decisions take place.  So far so good.
> 
> Still, there is a drawback. Since __builtin_choose_expr() is evaluated
> earlier, it can fail to recognize constants, which an if-condition would
> recognize correctly.  As a result, this patch-set only applies it to the
> simplest cases.
> 
> Overall this patch-set slightly increases the kernel size (my build was
> done using localmodconfig + localyesconfig for the record):
> 
>   text    data     bss     dec     hex filename
> 18126699 10066728 2936832 31130259 1db0293 ./vmlinux before
> 18149210 10064048 2936832 31150090 1db500a ./vmlinux after (+0.06%)
> 
> The patch-set eliminates many of the static text symbols:
> Before: 40033
> After:  39632   (-10%)

Oops. Should be -1%... 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC 0/8] Improving compiler inlining decisions
  2018-05-15 14:11 ` Nadav Amit
@ 2018-05-16  3:48   ` Josh Poimboeuf
  -1 siblings, 0 replies; 35+ messages in thread
From: Josh Poimboeuf @ 2018-05-16  3:48 UTC (permalink / raw)
  To: Nadav Amit
  Cc: linux-kernel, nadav.amit, Alok Kataria, Christopher Li,
	H. Peter Anvin, Ingo Molnar, Jan Beulich, Jonathan Corbet,
	Juergen Gross, Kees Cook, linux-sparse, Peter Zijlstra,
	Randy Dunlap, Thomas Gleixner, virtualization, x86

On Tue, May 15, 2018 at 07:11:07AM -0700, Nadav Amit wrote:
> This patch-set deals with an interesting yet stupid problem: code that
> does not get inlined despite its simplicity.

I got the 0/8 patch twice, and didn't get the 1/8 patch.  Was there an
issue with the sending of the patches?

-- 
Josh

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC 0/8] Improving compiler inlining decisions
@ 2018-05-16  3:48   ` Josh Poimboeuf
  0 siblings, 0 replies; 35+ messages in thread
From: Josh Poimboeuf @ 2018-05-16  3:48 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Juergen Gross, x86, Kees Cook, Jonathan Corbet, Peter Zijlstra,
	Christopher Li, Randy Dunlap, linux-kernel, virtualization,
	linux-sparse, Ingo Molnar, Jan Beulich, H. Peter Anvin,
	Alok Kataria, nadav.amit, Thomas Gleixner

On Tue, May 15, 2018 at 07:11:07AM -0700, Nadav Amit wrote:
> This patch-set deals with an interesting yet stupid problem: code that
> does not get inlined despite its simplicity.

I got the 0/8 patch twice, and didn't get the 1/8 patch.  Was there an
issue with the sending of the patches?

-- 
Josh

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC 0/8] Improving compiler inlining decisions
  2018-05-16  3:48   ` Josh Poimboeuf
  (?)
@ 2018-05-16  4:30   ` Nadav Amit
  -1 siblings, 0 replies; 35+ messages in thread
From: Nadav Amit @ 2018-05-16  4:30 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Linux Kernel Mailing List, Alok Kataria, Christopher Li,
	H. Peter Anvin, Ingo Molnar, Jan Beulich, Jonathan Corbet,
	Juergen Gross, Kees Cook, linux-sparse, Peter Zijlstra,
	Randy Dunlap, Thomas Gleixner, virtualization, x86

Josh Poimboeuf <jpoimboe@redhat.com> wrote:

> On Tue, May 15, 2018 at 07:11:07AM -0700, Nadav Amit wrote:
>> This patch-set deals with an interesting yet stupid problem: code that
>> does not get inlined despite its simplicity.
> 
> I got the 0/8 patch twice, and didn't get the 1/8 patch.  Was there an
> issue with the sending of the patches?

Strange. I am not sure why it happened. Anyhow 1/8 is available here:
https://lkml.org/lkml/2018/5/15/961 . I will forward it to you.

Thanks,
Nadav

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC 5/8] x86: refcount: prevent gcc distortions
  2018-05-15 14:11 ` [RFC 5/8] x86: refcount: prevent gcc distortions Nadav Amit
@ 2018-05-16  7:14   ` Jan Beulich
  2018-05-16 16:44     ` Nadav Amit
  0 siblings, 1 reply; 35+ messages in thread
From: Jan Beulich @ 2018-05-16  7:14 UTC (permalink / raw)
  To: Nadav Amit
  Cc: keescook, nadav.amit, the arch/x86 maintainers, tglx,
	Josh Poimboeuf, mingo, linux-kernel, hpa

>>> On 15.05.18 at 16:11, <namit@vmware.com> wrote:
> --- a/arch/x86/include/asm/refcount.h
> +++ b/arch/x86/include/asm/refcount.h
> @@ -14,34 +14,43 @@
>   * central refcount exception. The fixup address for the exception points
>   * back to the regular execution flow in .text.
>   */
> -#define _REFCOUNT_EXCEPTION				\
> -	".pushsection .text..refcount\n"		\
> -	"111:\tlea %[counter], %%" _ASM_CX "\n"		\
> -	"112:\t" ASM_UD2 "\n"				\
> -	ASM_UNREACHABLE					\
> -	".popsection\n"					\
> -	"113:\n"					\
> +
> +asm ("\n"
> +	".macro __REFCOUNT_EXCEPTION counter:vararg\n\t"
> +	".pushsection .text..refcount\n"
> +	"111:\tlea \\counter, %" _ASM_CX "\n"
> +	"112:\t" ASM_UD2 "\n\t"
> +	ASM_UNREACHABLE
> +	".popsection\n\t"
> +	"113:\n"
>  	_ASM_EXTABLE_REFCOUNT(112b, 113b)
> +	".endm");

A few comments on assembly code formatting - while gas at present is
relatively lax in this regard, I wouldn't exclude that there might be a
more strict mode in the future, and that such a mode might eventually
become the default. Furthermore these formatting aspects affect
readability of the assembly produced, should anyone ever find a need
to look at it (perhaps because of some breakage) - I certainly do every
once in a while.

Labels should be placed without any indentation (but of course there
may be more than one on a line, in which case subsequent ones may
of course be arbitrarily indented). Instructions and directives, otoh,
should be placed with at least a single tab or space of indentation
(unless preceded by a label, in which case the extra white space still
helps readability).

I'm also not sure about the purpose of the leading plain newline here.
gcc annotates code resulting from inline assembly anyway iirc, so
proper visual separation should already be available.

If I was the maintainer of this code, I would also object to the
mis-alignment your file scope asm()-s have ("asm (" is 5 characters,
which doesn't equal a tab's width).

Jan

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC 5/8] x86: refcount: prevent gcc distortions
  2018-05-15 14:11 ` [RFC 5/8] x86: refcount: prevent gcc distortions Nadav Amit
@ 2018-05-16 13:59   ` Kees Cook
  2018-05-16 16:37     ` Nadav Amit
  0 siblings, 1 reply; 35+ messages in thread
From: Kees Cook @ 2018-05-16 13:59 UTC (permalink / raw)
  To: Nadav Amit
  Cc: LKML, Nadav Amit, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	X86 ML, Jan Beulich, Josh Poimboeuf

On Tue, May 15, 2018 at 7:11 AM, Nadav Amit <namit@vmware.com> wrote:
> GCC considers the number of statements in inlined assembly blocks,
> according to new-lines and semicolons, as an indication to the cost of
> the block in time and space. This data is distorted by the kernel code,
> which puts information in alternative sections. As a result, the
> compiler may perform incorrect inlining and branch optimizations.
>
> The solution is to set an assembly macro and call it from the inlined
> assembly block. As a result GCC considers the inline assembly block as
> a single instruction.
>
> This patch allows to inline functions such as __get_seccomp_filter().
> The effect of the patch is as follows on the kernel size:
>
>    text    data     bss     dec     hex filename
> 18146418 10064100 2936832 31147350 1db4556 ./vmlinux before
> 18148228 10063968 2936832 31149028 1db4be4 ./vmlinux after (+1678)
>
> Static text symbols:
> Before: 39673
> After:  39649   (-24)
>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Cc: x86@kernel.org
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Jan Beulich <JBeulich@suse.com>
> Cc: Josh Poimboeuf <jpoimboe@redhat.com>
>
> Signed-off-by: Nadav Amit <namit@vmware.com>
> ---
>  arch/x86/include/asm/refcount.h | 55 ++++++++++++++++++++-------------
>  1 file changed, 33 insertions(+), 22 deletions(-)
>
> diff --git a/arch/x86/include/asm/refcount.h b/arch/x86/include/asm/refcount.h
> index 4cf11d88d3b3..a668c534206d 100644
> --- a/arch/x86/include/asm/refcount.h
> +++ b/arch/x86/include/asm/refcount.h
> @@ -14,34 +14,43 @@
>   * central refcount exception. The fixup address for the exception points
>   * back to the regular execution flow in .text.
>   */
> -#define _REFCOUNT_EXCEPTION                            \
> -       ".pushsection .text..refcount\n"                \
> -       "111:\tlea %[counter], %%" _ASM_CX "\n"         \
> -       "112:\t" ASM_UD2 "\n"                           \
> -       ASM_UNREACHABLE                                 \
> -       ".popsection\n"                                 \
> -       "113:\n"                                        \
> +
> +asm ("\n"
> +       ".macro __REFCOUNT_EXCEPTION counter:vararg\n\t"

Why are these vararg?

Also, I think for the whole series, these #define-a-macro cases need a
comment in the code. It's not obvious from looking at the code why
they've defined a macro instead of just leaving the asm as it was.

Beyond that, as long as there is no behavioral changes, I'm fine with
the changes.

-Kees

-- 
Kees Cook
Pixel Security

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC 8/8] bitops: prevent compiler inline decision distortion
  2018-05-15 14:11 ` [RFC 8/8] bitops: prevent compiler inline decision distortion Nadav Amit
@ 2018-05-16 14:09   ` Kees Cook
  0 siblings, 0 replies; 35+ messages in thread
From: Kees Cook @ 2018-05-16 14:09 UTC (permalink / raw)
  To: Nadav Amit
  Cc: LKML, Nadav Amit, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, X86 ML

On Tue, May 15, 2018 at 7:11 AM, Nadav Amit <namit@vmware.com> wrote:
> There are several places in the kernel in which there is a condition
> that is based on whether the input is known to be constant in
> compilation time. If it is, there are complex computations, which only
> take place during compilation time.

What we found while rewriting the min()/max() defines was that GCC's
__builtin_choose_expr() evaluates constant expressions, not the wider
possibility of constant values. And the older the GCC the less
__builtin_constant_p() would resolve to a constant expression, which
would force __builtin_choose_expr() along the "non constant" path.

> Although this scheme works correctly, when GCC computes the expected
> cost of this code in time and size, it disregards the fact that the
> computations of the "constant" case will not take place during runtime
> for the non-constant case. The cost of these functions is considered to
> be much higher.  As a result, inline and branching decisions of the
> compiler are distorted. Specifically, functions are less likely to be
> inlined due to their preserved big size and execution time.
>
> One of this cases is test_bit() which performs some computations for
> constant inputs.
>
> The solution is to use __builtin_choose_expr() to detect whether the
> input is constant instead of a C condition. GCC evaluates the builtin
> earlier, which allows it to improve code-generation decisions.

In the case of test_bit(), I *think* all the optimization-desired
cases are already only constant expressions, so I think this is safe,
but I wouldn't want to do this generally for uses of
__builtin_constant_p().

-Kees

>
> This patch allows function such as bitmap_pos_to_ord() to be inlined.
> Its overall effect on size:
>
>    text    data     bss     dec     hex filename
> 18149165 10064176 2936832 31150173 1db505d ./vmlinux before
> 18149210 10064048 2936832 31150090 1db500a ./vmlinux after (-83)
>
> Static text symbols:
> Before: 39643
> After:  39632   (-11)
>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Cc: x86@kernel.org
>
> Signed-off-by: Nadav Amit <namit@vmware.com>
> ---
>  arch/x86/include/asm/bitops.h | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
> index 9f645ba57dbb..f1cb1c9125a9 100644
> --- a/arch/x86/include/asm/bitops.h
> +++ b/arch/x86/include/asm/bitops.h
> @@ -349,10 +349,10 @@ static __always_inline bool variable_test_bit(long nr, volatile const unsigned l
>  static bool test_bit(int nr, const volatile unsigned long *addr);
>  #endif
>
> -#define test_bit(nr, addr)                     \
> -       (__builtin_constant_p((nr))             \
> -        ? constant_test_bit((nr), (addr))      \
> -        : variable_test_bit((nr), (addr)))
> +#define test_bit(nr, addr)                                     \
> +       __builtin_choose_expr(__builtin_constant_p((nr)),       \
> +               constant_test_bit((nr), (addr)),                \
> +               variable_test_bit((nr), (addr)))
>
>  /**
>   * __ffs - find first set bit in word
> --
> 2.17.0
>



-- 
Kees Cook
Pixel Security

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC 5/8] x86: refcount: prevent gcc distortions
  2018-05-16 13:59   ` Kees Cook
@ 2018-05-16 16:37     ` Nadav Amit
  0 siblings, 0 replies; 35+ messages in thread
From: Nadav Amit @ 2018-05-16 16:37 UTC (permalink / raw)
  To: Kees Cook
  Cc: LKML, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, X86 ML,
	Jan Beulich, Josh Poimboeuf

Kees Cook <keescook@chromium.org> wrote:

> On Tue, May 15, 2018 at 7:11 AM, Nadav Amit <namit@vmware.com> wrote:
>> GCC considers the number of statements in inlined assembly blocks,
>> according to new-lines and semicolons, as an indication to the cost of
>> the block in time and space. This data is distorted by the kernel code,
>> which puts information in alternative sections. As a result, the
>> compiler may perform incorrect inlining and branch optimizations.
>> 
>> The solution is to set an assembly macro and call it from the inlined
>> assembly block. As a result GCC considers the inline assembly block as
>> a single instruction.
>> 
>> This patch allows to inline functions such as __get_seccomp_filter().
>> The effect of the patch is as follows on the kernel size:
>> 
>>   text    data     bss     dec     hex filename
>> 18146418 10064100 2936832 31147350 1db4556 ./vmlinux before
>> 18148228 10063968 2936832 31149028 1db4be4 ./vmlinux after (+1678)
>> 
>> Static text symbols:
>> Before: 39673
>> After:  39649   (-24)
>> 
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Cc: Ingo Molnar <mingo@redhat.com>
>> Cc: "H. Peter Anvin" <hpa@zytor.com>
>> Cc: x86@kernel.org
>> Cc: Kees Cook <keescook@chromium.org>
>> Cc: Jan Beulich <JBeulich@suse.com>
>> Cc: Josh Poimboeuf <jpoimboe@redhat.com>
>> 
>> Signed-off-by: Nadav Amit <namit@vmware.com>
>> ---
>> arch/x86/include/asm/refcount.h | 55 ++++++++++++++++++++-------------
>> 1 file changed, 33 insertions(+), 22 deletions(-)
>> 
>> diff --git a/arch/x86/include/asm/refcount.h b/arch/x86/include/asm/refcount.h
>> index 4cf11d88d3b3..a668c534206d 100644
>> --- a/arch/x86/include/asm/refcount.h
>> +++ b/arch/x86/include/asm/refcount.h
>> @@ -14,34 +14,43 @@
>>  * central refcount exception. The fixup address for the exception points
>>  * back to the regular execution flow in .text.
>>  */
>> -#define _REFCOUNT_EXCEPTION                            \
>> -       ".pushsection .text..refcount\n"                \
>> -       "111:\tlea %[counter], %%" _ASM_CX "\n"         \
>> -       "112:\t" ASM_UD2 "\n"                           \
>> -       ASM_UNREACHABLE                                 \
>> -       ".popsection\n"                                 \
>> -       "113:\n"                                        \
>> +
>> +asm ("\n"
>> +       ".macro __REFCOUNT_EXCEPTION counter:vararg\n\t"
> 
> Why are these vararg?

I don’t think it is needed here. I will fix it.

> 
> Also, I think for the whole series, these #define-a-macro cases need a
> comment in the code. It's not obvious from looking at the code why
> they've defined a macro instead of just leaving the asm as it was.

Right. I will add them.

> Beyond that, as long as there is no behavioral changes, I'm fine with
> the changes.

Thanks!

Nadav

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC 5/8] x86: refcount: prevent gcc distortions
  2018-05-16  7:14   ` Jan Beulich
@ 2018-05-16 16:44     ` Nadav Amit
  2018-05-17  7:18       ` Jan Beulich
  0 siblings, 1 reply; 35+ messages in thread
From: Nadav Amit @ 2018-05-16 16:44 UTC (permalink / raw)
  To: Jan Beulich
  Cc: keescook, the arch/x86 maintainers, tglx, Josh Poimboeuf, mingo,
	linux-kernel, hpa

Jan Beulich <JBeulich@suse.com> wrote:

>>>> On 15.05.18 at 16:11, <namit@vmware.com> wrote:
>> --- a/arch/x86/include/asm/refcount.h
>> +++ b/arch/x86/include/asm/refcount.h
>> @@ -14,34 +14,43 @@
>>  * central refcount exception. The fixup address for the exception points
>>  * back to the regular execution flow in .text.
>>  */
>> -#define _REFCOUNT_EXCEPTION				\
>> -	".pushsection .text..refcount\n"		\
>> -	"111:\tlea %[counter], %%" _ASM_CX "\n"		\
>> -	"112:\t" ASM_UD2 "\n"				\
>> -	ASM_UNREACHABLE					\
>> -	".popsection\n"					\
>> -	"113:\n"					\
>> +
>> +asm ("\n"
>> +	".macro __REFCOUNT_EXCEPTION counter:vararg\n\t"
>> +	".pushsection .text..refcount\n"
>> +	"111:\tlea \\counter, %" _ASM_CX "\n"
>> +	"112:\t" ASM_UD2 "\n\t"
>> +	ASM_UNREACHABLE
>> +	".popsection\n\t"
>> +	"113:\n"
>> 	_ASM_EXTABLE_REFCOUNT(112b, 113b)
>> +	".endm");
> 
> A few comments on assembly code formatting - while gas at present is
> relatively lax in this regard, I wouldn't exclude that there might be a
> more strict mode in the future, and that such a mode might eventually
> become the default. Furthermore these formatting aspects affect
> readability of the assembly produced, should anyone ever find a need
> to look at it (perhaps because of some breakage) - I certainly do every
> once in a while.
> 
> Labels should be placed without any indentation (but of course there
> may be more than one on a line, in which case subsequent ones may
> of course be arbitrarily indented). Instructions and directives, otoh,
> should be placed with at least a single tab or space of indentation
> (unless preceded by a label, in which case the extra white space still
> helps readability).

Writing these patches, I looked at the generated assembly, and there did not
appear to be a standard. IIRC, .pushsection directives were not always
inlined. I will fix it according to your comments.

> I'm also not sure about the purpose of the leading plain newline here.
> gcc annotates code resulting from inline assembly anyway iirc, so
> proper visual separation should already be available.

Right. It was only to get the macro directive not tabulated, but as you
said, it should be tabulated, so I will remove it.

> 
> If I was the maintainer of this code, I would also object to the
> mis-alignment your file scope asm()-s have ("asm (" is 5 characters,
> which doesn't equal a tab's width).

I tried many formats (including the one you propose), and eventually went
with the one that made checkpatch not yell at me. I will revert to the one
you propose, which makes most sense, and ignore checkpatch warnings.

Thanks,
Nadav

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC 5/8] x86: refcount: prevent gcc distortions
  2018-05-16 16:44     ` Nadav Amit
@ 2018-05-17  7:18       ` Jan Beulich
  0 siblings, 0 replies; 35+ messages in thread
From: Jan Beulich @ 2018-05-17  7:18 UTC (permalink / raw)
  To: Nadav Amit
  Cc: keescook, the arch/x86 maintainers, tglx, Josh Poimboeuf, mingo,
	linux-kernel, hpa

>>> On 16.05.18 at 18:44, <namit@vmware.com> wrote:
> Jan Beulich <JBeulich@suse.com> wrote:
>>>>> On 15.05.18 at 16:11, <namit@vmware.com> wrote:
>>> --- a/arch/x86/include/asm/refcount.h
>>> +++ b/arch/x86/include/asm/refcount.h
>>> @@ -14,34 +14,43 @@
>>>  * central refcount exception. The fixup address for the exception points
>>>  * back to the regular execution flow in .text.
>>>  */
>>> -#define _REFCOUNT_EXCEPTION				\
>>> -	".pushsection .text..refcount\n"		\
>>> -	"111:\tlea %[counter], %%" _ASM_CX "\n"		\
>>> -	"112:\t" ASM_UD2 "\n"				\
>>> -	ASM_UNREACHABLE					\
>>> -	".popsection\n"					\
>>> -	"113:\n"					\
>>> +
>>> +asm ("\n"
>>> +	".macro __REFCOUNT_EXCEPTION counter:vararg\n\t"
>>> +	".pushsection .text..refcount\n"
>>> +	"111:\tlea \\counter, %" _ASM_CX "\n"
>>> +	"112:\t" ASM_UD2 "\n\t"
>>> +	ASM_UNREACHABLE
>>> +	".popsection\n\t"
>>> +	"113:\n"
>>> 	_ASM_EXTABLE_REFCOUNT(112b, 113b)
>>> +	".endm");
>> 
>> A few comments on assembly code formatting - while gas at present is
>> relatively lax in this regard, I wouldn't exclude that there might be a
>> more strict mode in the future, and that such a mode might eventually
>> become the default. Furthermore these formatting aspects affect
>> readability of the assembly produced, should anyone ever find a need
>> to look at it (perhaps because of some breakage) - I certainly do every
>> once in a while.
>> 
>> Labels should be placed without any indentation (but of course there
>> may be more than one on a line, in which case subsequent ones may
>> of course be arbitrarily indented). Instructions and directives, otoh,
>> should be placed with at least a single tab or space of indentation
>> (unless preceded by a label, in which case the extra white space still
>> helps readability).
> 
> Writing these patches, I looked at the generated assembly, and there did not
> appear to be a standard. IIRC, .pushsection directives were not always
> inlined. I will fix it according to your comments.

Right, I should have made explicit that there's no consistency at all in
pre-existing code. I merely think the issue shouldn't be made worse.

Jan

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2018-05-17  7:18 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-05-15 14:11 [RFC 0/8] Improving compiler inlining decisions Nadav Amit
2018-05-15 14:11 ` Nadav Amit
2018-05-15 14:11 ` [RFC 1/8] x86: objtool: use asm macro for better compiler decisions Nadav Amit
2018-05-15 14:11   ` Nadav Amit
2018-05-15 21:37   ` Josh Triplett
2018-05-15 21:53     ` Nadav Amit
2018-05-15 21:55       ` Josh Triplett
2018-05-15 14:11 ` [RFC 2/8] x86: bug: prevent gcc distortions Nadav Amit
2018-05-15 14:11 ` [RFC 3/8] x86: alternative: macrofy locks for better inlining Nadav Amit
2018-05-15 14:11 ` [RFC 4/8] x86: prevent inline distortion by paravirt ops Nadav Amit
2018-05-15 14:11 ` [RFC 5/8] x86: refcount: prevent gcc distortions Nadav Amit
2018-05-16 13:59   ` Kees Cook
2018-05-16 16:37     ` Nadav Amit
2018-05-15 14:11 ` [RFC 6/8] x86: removing unneeded new-lines Nadav Amit
2018-05-15 14:11 ` [RFC 7/8] ilog2: preventing compiler distortion due to big condition Nadav Amit
2018-05-15 14:11 ` [RFC 8/8] bitops: prevent compiler inline decision distortion Nadav Amit
2018-05-15 14:11 ` [RFC 0/8] Improving compiler inlining decisions Nadav Amit
2018-05-15 14:11   ` Nadav Amit
2018-05-15 14:11 ` [RFC 1/8] x86: objtool: use asm macro for better compiler decisions Nadav Amit
2018-05-15 14:11   ` Nadav Amit
2018-05-15 14:11 ` [RFC 2/8] x86: bug: prevent gcc distortions Nadav Amit
2018-05-15 14:11 ` [RFC 3/8] x86: alternative: macrofy locks for better inlining Nadav Amit
2018-05-15 14:11 ` [RFC 4/8] x86: prevent inline distortion by paravirt ops Nadav Amit
2018-05-15 14:11 ` [RFC 5/8] x86: refcount: prevent gcc distortions Nadav Amit
2018-05-16  7:14   ` Jan Beulich
2018-05-16 16:44     ` Nadav Amit
2018-05-17  7:18       ` Jan Beulich
2018-05-15 14:11 ` [RFC 6/8] x86: removing unneeded new-lines Nadav Amit
2018-05-15 14:11 ` [RFC 7/8] ilog2: preventing compiler distortion due to big condition Nadav Amit
2018-05-15 14:11 ` [RFC 8/8] bitops: prevent compiler inline decision distortion Nadav Amit
2018-05-16 14:09   ` Kees Cook
2018-05-15 22:14 ` [RFC 0/8] Improving compiler inlining decisions Nadav Amit
2018-05-16  3:48 ` Josh Poimboeuf
2018-05-16  3:48   ` Josh Poimboeuf
2018-05-16  4:30   ` Nadav Amit

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.