linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v9 00/10] x86: macrofying inline asm
@ 2018-10-03 21:30 Nadav Amit
  2018-10-03 21:30 ` [PATCH v9 01/10] xtensa: defining LINKER_SCRIPT for the linker script Nadav Amit
                   ` (10 more replies)
  0 siblings, 11 replies; 116+ messages in thread
From: Nadav Amit @ 2018-10-03 21:30 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, x86, Nadav Amit, Masahiro Yamada, Sam Ravnborg,
	Alok Kataria, Christopher Li, Greg Kroah-Hartman, H. Peter Anvin,
	Jan Beulich, Josh Poimboeuf, Juergen Gross, Kate Stewart,
	Kees Cook, linux-sparse, Peter Zijlstra, Philippe Ombredanne,
	Thomas Gleixner, virtualization, Linus Torvalds, Chris Zankel,
	Max Filippov, linux-xtensa

This patch-set deals with an interesting yet stupid problem: kernel code
that does not get inlined despite its simplicity. There are several
causes for this behavior: "cold" attribute on __init, different function
optimization levels; conditional constant computations based on
__builtin_constant_p(); and finally large inline assembly blocks.

This patch-set deals with the inline assembly problem. I separated these
patches from the others (that were sent in the RFC) for easier
inclusion. I also separated the removal of unnecessary new-lines which
would be sent separately.

The problem with inline assembly is that inline assembly is often used
by the kernel for things that are other than code - for example,
assembly directives and data. GCC however is oblivious to the content of
the blocks and assumes their cost in space and time is proportional to
the number of the perceived assembly "instruction", according to the
number of newlines and semicolons. Alternatives, paravirt and other
mechanisms are affected, causing code not to be inlined, and degrading
compilation quality in general.

The solution that this patch-set carries for this problem is to create
an assembly macro, and then call it from the inline assembly block.  As
a result, the compiler sees a single "instruction" and assigns the more
appropriate cost to the code.

To avoid uglification of the code, as many noted, the macros are first
precompiled into an assembly file, which is later assembled together
with the C files. This also enables to avoid duplicate implementation
that was set before for the asm and C code. This can be seen in the
exception table changes.

Overall this patch-set slightly increases the kernel size (my build was
done using my Ubuntu 18.04 config + localyesconfig for the record):

   text	   data	    bss	    dec	    hex	filename
18140829 10224724 2957312 31322865 1ddf2f1 ./vmlinux before
18163608 10227348 2957312 31348268 1de562c ./vmlinux after (+0.1%)

The number of static functions in the image is reduced by 379, but
actually inlining is even better, which does not always shows in these
numbers: a function may be inlined causing the calling function not to
be inlined.

I ran some limited number of benchmarks, and in general the performance
impact is not very notable. You can still see >10 cycles shaved off some
syscalls that manipulate page-tables (e.g., mprotect()), in which
paravirt caused many functions not to be inlined. In addition this
patch-set can prevent issues such as [1], and improves code readability
and maintainability.

Update: Rasmus recently caused me (inadvertently) to become paranoid
about the dependencies. To clarify: if any of the headers changes, any c
file which uses macros that are included in macros.S would be fine as
long as it includes the header as well (as it should). Adding an
assertion to check this is done might become slightly ugly, and nobody
else is concerned about it. Another minor issue is that changes of
macros.S would not trigger a global rebuild, but that is pretty similar
to changes of the Makefile that do not trigger a rebuild.

[1] https://patchwork.kernel.org/patch/10450037/

v8->v9: * Restoring the '-pipe' parameter (Rasmus)
	* Adding Kees's tested-by tag (Kees)

v7->v8:	* Add acks (Masahiro, Max)
	* Rebase on 4.19 (Ingo)

v6->v7: * Fix context switch tracking (Ingo)
	* Fix xtensa build error (Ingo)
	* Rebase on 4.18-rc8

v5->v6:	* Removing more code from jump-labels (PeterZ)
	* Fix build issue on i386 (0-day, PeterZ)

v4->v5:	* Makefile fixes (Masahiro, Sam)

v3->v4: * Changed naming of macros in 2 patches (PeterZ)
	* Minor cleanup of the paravirt patch

v2->v3: * Several build issues resolved (0-day)
	* Wrong comments fix (Josh)
	* Change asm vs C order in refcount (Kees)

v1->v2:	* Compiling the macros into a separate .s file, improving
	  readability (Linus)
	* Improving assembly formatting, applying most of the comments
	  according to my judgment (Jan)
	* Adding exception-table, cpufeature and jump-labels
	* Removing new-line cleanup; to be submitted separately

Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: Alok Kataria <akataria@vmware.com>
Cc: Christopher Li <sparse@chrisli.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jan Beulich <JBeulich@suse.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: linux-sparse@vger.kernel.org
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: virtualization@lists.linux-foundation.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: x86@kernel.org
Cc: Chris Zankel <chris@zankel.net>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: linux-xtensa@linux-xtensa.org

Nadav Amit (10):
  xtensa: defining LINKER_SCRIPT for the linker script
  Makefile: Prepare for using macros for inline asm
  x86: objtool: use asm macro for better compiler decisions
  x86: refcount: prevent gcc distortions
  x86: alternatives: macrofy locks for better inlining
  x86: bug: prevent gcc distortions
  x86: prevent inline distortion by paravirt ops
  x86: extable: use macros instead of inline assembly
  x86: cpufeature: use macros instead of inline assembly
  x86: jump-labels: use macros instead of inline assembly

 Makefile                               |  9 ++-
 arch/x86/Makefile                      |  7 ++
 arch/x86/entry/calling.h               |  2 +-
 arch/x86/include/asm/alternative-asm.h | 20 ++++--
 arch/x86/include/asm/alternative.h     | 11 +--
 arch/x86/include/asm/asm.h             | 61 +++++++---------
 arch/x86/include/asm/bug.h             | 98 +++++++++++++++-----------
 arch/x86/include/asm/cpufeature.h      | 82 ++++++++++++---------
 arch/x86/include/asm/jump_label.h      | 77 ++++++++------------
 arch/x86/include/asm/paravirt_types.h  | 56 +++++++--------
 arch/x86/include/asm/refcount.h        | 74 +++++++++++--------
 arch/x86/kernel/macros.S               | 16 +++++
 arch/xtensa/kernel/Makefile            |  4 +-
 include/asm-generic/bug.h              |  8 +--
 include/linux/compiler.h               | 56 +++++++++++----
 scripts/Kbuild.include                 |  4 +-
 scripts/mod/Makefile                   |  2 +
 17 files changed, 331 insertions(+), 256 deletions(-)
 create mode 100644 arch/x86/kernel/macros.S

-- 
2.17.1


^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH v9 01/10] xtensa: defining LINKER_SCRIPT for the linker script
  2018-10-03 21:30 [PATCH v9 00/10] x86: macrofying inline asm Nadav Amit
@ 2018-10-03 21:30 ` Nadav Amit
  2018-10-04 10:00   ` [tip:x86/build] kbuild/arch/xtensa: Define " tip-bot for Nadav Amit
  2018-10-03 21:30 ` [PATCH v9 02/10] Makefile: Prepare for using macros for inline asm Nadav Amit
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 116+ messages in thread
From: Nadav Amit @ 2018-10-03 21:30 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel, x86, Nadav Amit, Chris Zankel, linux-xtensa

Defining the LINKER_SCRIPT when building the linker script as being done
in other architectures. This is required for the next Makefile changes
would otherwise break things.

Cc: Chris Zankel <chris@zankel.net>
Cc: linux-xtensa@linux-xtensa.org
Acked-by: Max Filippov <jcmvbkbc@gmail.com>
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/xtensa/kernel/Makefile | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/xtensa/kernel/Makefile b/arch/xtensa/kernel/Makefile
index 91907590d183..8dff506caf07 100644
--- a/arch/xtensa/kernel/Makefile
+++ b/arch/xtensa/kernel/Makefile
@@ -35,8 +35,8 @@ sed-y = -e ':a; s/\*(\([^)]*\)\.text\.unlikely/*(\1.literal.unlikely .{text}.unl
 	-e 's/\.{text}/.text/g'
 
 quiet_cmd__cpp_lds_S = LDS     $@
-cmd__cpp_lds_S = $(CPP) $(cpp_flags) -P -C -Uxtensa -D__ASSEMBLY__ $<    \
-                 | sed $(sed-y) >$@
+cmd__cpp_lds_S = $(CPP) $(cpp_flags) -P -C -Uxtensa -D__ASSEMBLY__ \
+		 -DLINKER_SCRIPT $< | sed $(sed-y) >$@
 
 $(obj)/vmlinux.lds: $(src)/vmlinux.lds.S FORCE
 	$(call if_changed_dep,_cpp_lds_S)
-- 
2.17.1


^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH v9 02/10] Makefile: Prepare for using macros for inline asm
  2018-10-03 21:30 [PATCH v9 00/10] x86: macrofying inline asm Nadav Amit
  2018-10-03 21:30 ` [PATCH v9 01/10] xtensa: defining LINKER_SCRIPT for the linker script Nadav Amit
@ 2018-10-03 21:30 ` Nadav Amit
  2018-10-04 10:01   ` [tip:x86/build] kbuild/Makefile: Prepare for using macros in inline assembly code to work around asm() related GCC inlining bugs tip-bot for Nadav Amit
  2018-11-06 18:57   ` [PATCH v9 02/10] Makefile: Prepare for using macros for inline asm Logan Gunthorpe
  2018-10-03 21:30 ` [PATCH v9 03/10] x86: objtool: use asm macro for better compiler decisions Nadav Amit
                   ` (8 subsequent siblings)
  10 siblings, 2 replies; 116+ messages in thread
From: Nadav Amit @ 2018-10-03 21:30 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, x86, Nadav Amit, Sam Ravnborg, Michal Marek,
	Thomas Gleixner, H. Peter Anvin, linux-kbuild

Using macros for inline assembly improves both readability and
compilation decisions that are distorted by big assembly blocks that use
alternative sections. Compile macros.S and use it to assemble all C
files. Currently, only x86 will use it.

Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: Michal Marek <michal.lkml@markovi.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Cc: linux-kbuild@vger.kernel.org
Acked-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 Makefile                 | 9 +++++++--
 arch/x86/Makefile        | 7 +++++++
 arch/x86/kernel/macros.S | 7 +++++++
 scripts/Kbuild.include   | 4 +++-
 scripts/mod/Makefile     | 2 ++
 5 files changed, 26 insertions(+), 3 deletions(-)
 create mode 100644 arch/x86/kernel/macros.S

diff --git a/Makefile b/Makefile
index 6c3da3e10f07..6c40d547513c 100644
--- a/Makefile
+++ b/Makefile
@@ -1071,7 +1071,7 @@ scripts: scripts_basic asm-generic gcc-plugins $(autoksyms_h)
 # version.h and scripts_basic is processed / created.
 
 # Listed in dependency order
-PHONY += prepare archprepare prepare0 prepare1 prepare2 prepare3
+PHONY += prepare archprepare macroprepare prepare0 prepare1 prepare2 prepare3
 
 # prepare3 is used to check if we are building in a separate output directory,
 # and if so do:
@@ -1094,7 +1094,9 @@ prepare2: prepare3 outputmakefile asm-generic
 prepare1: prepare2 $(version_h) $(autoksyms_h) include/generated/utsrelease.h
 	$(cmd_crmodverdir)
 
-archprepare: archheaders archscripts prepare1 scripts_basic
+macroprepare: prepare1 archmacros
+
+archprepare: archheaders archscripts macroprepare scripts_basic
 
 prepare0: archprepare gcc-plugins
 	$(Q)$(MAKE) $(build)=.
@@ -1162,6 +1164,9 @@ archheaders:
 PHONY += archscripts
 archscripts:
 
+PHONY += archmacros
+archmacros:
+
 PHONY += __headers
 __headers: $(version_h) scripts_basic uapi-asm-generic archheaders archscripts
 	$(Q)$(MAKE) $(build)=scripts build_unifdef
diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index 8f6e7eb8ae9f..18a0084f134b 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -237,6 +237,13 @@ archscripts: scripts_basic
 archheaders:
 	$(Q)$(MAKE) $(build)=arch/x86/entry/syscalls all
 
+archmacros:
+	$(Q)$(MAKE) $(build)=arch/x86/kernel arch/x86/kernel/macros.s
+
+ASM_MACRO_FLAGS = -Wa,arch/x86/kernel/macros.s -Wa,-
+export ASM_MACRO_FLAGS
+KBUILD_CFLAGS += $(ASM_MACRO_FLAGS)
+
 ###
 # Kernel objects
 
diff --git a/arch/x86/kernel/macros.S b/arch/x86/kernel/macros.S
new file mode 100644
index 000000000000..cfc1c7d1a6eb
--- /dev/null
+++ b/arch/x86/kernel/macros.S
@@ -0,0 +1,7 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+/*
+ * This file includes headers whose assembly part includes macros which are
+ * commonly used. The macros are precompiled into assmebly file which is later
+ * assembled together with each compiled file.
+ */
diff --git a/scripts/Kbuild.include b/scripts/Kbuild.include
index ce53639a864a..8aeb60eb6ee3 100644
--- a/scripts/Kbuild.include
+++ b/scripts/Kbuild.include
@@ -115,7 +115,9 @@ __cc-option = $(call try-run,\
 
 # Do not attempt to build with gcc plugins during cc-option tests.
 # (And this uses delayed resolution so the flags will be up to date.)
-CC_OPTION_CFLAGS = $(filter-out $(GCC_PLUGINS_CFLAGS),$(KBUILD_CFLAGS))
+# In addition, do not include the asm macros which are built later.
+CC_OPTION_FILTERED = $(GCC_PLUGINS_CFLAGS) $(ASM_MACRO_FLAGS)
+CC_OPTION_CFLAGS = $(filter-out $(CC_OPTION_FILTERED),$(KBUILD_CFLAGS))
 
 # cc-option
 # Usage: cflags-y += $(call cc-option,-march=winchip-c6,-march=i586)
diff --git a/scripts/mod/Makefile b/scripts/mod/Makefile
index 42c5d50f2bcc..a5b4af47987a 100644
--- a/scripts/mod/Makefile
+++ b/scripts/mod/Makefile
@@ -4,6 +4,8 @@ OBJECT_FILES_NON_STANDARD := y
 hostprogs-y	:= modpost mk_elfconfig
 always		:= $(hostprogs-y) empty.o
 
+CFLAGS_REMOVE_empty.o := $(ASM_MACRO_FLAGS)
+
 modpost-objs	:= modpost.o file2alias.o sumversion.o
 
 devicetable-offsets-file := devicetable-offsets.h
-- 
2.17.1


^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH v9 03/10] x86: objtool: use asm macro for better compiler decisions
  2018-10-03 21:30 [PATCH v9 00/10] x86: macrofying inline asm Nadav Amit
  2018-10-03 21:30 ` [PATCH v9 01/10] xtensa: defining LINKER_SCRIPT for the linker script Nadav Amit
  2018-10-03 21:30 ` [PATCH v9 02/10] Makefile: Prepare for using macros for inline asm Nadav Amit
@ 2018-10-03 21:30 ` Nadav Amit
  2018-10-04 10:02   ` [tip:x86/build] x86/objtool: Use asm macros to work around GCC inlining bugs tip-bot for Nadav Amit
  2018-10-03 21:30 ` [PATCH v9 04/10] x86: refcount: prevent gcc distortions Nadav Amit
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 116+ messages in thread
From: Nadav Amit @ 2018-10-03 21:30 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel, x86, Nadav Amit, Christopher Li, linux-sparse

GCC considers the number of statements in inlined assembly blocks,
according to new-lines and semicolons, as an indication to the cost of
the block in time and space. This data is distorted by the kernel code,
which puts information in alternative sections. As a result, the
compiler may perform incorrect inlining and branch optimizations.

In the case of objtool, this distortion is extreme, since anyhow the
annotations of objtool are discarded during linkage.

The solution is to set an assembly macro and call it from the inline
assembly block. As a result GCC considers the inline assembly block as
a single instruction.

This patch slightly increases the kernel size.

   text	   data	    bss	    dec	    hex	filename
18140829 10224724 2957312 31322865 1ddf2f1 ./vmlinux before
18140970 10225412 2957312 31323694 1ddf62e ./vmlinux after (+829)

Static text symbols:
Before:	40321
After:	40302	(-19)

Cc: Christopher Li <sparse@chrisli.org>
Cc: linux-sparse@vger.kernel.org
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Tested-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/kernel/macros.S |  2 ++
 include/linux/compiler.h | 56 ++++++++++++++++++++++++++++++----------
 2 files changed, 45 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kernel/macros.S b/arch/x86/kernel/macros.S
index cfc1c7d1a6eb..cee28c3246dc 100644
--- a/arch/x86/kernel/macros.S
+++ b/arch/x86/kernel/macros.S
@@ -5,3 +5,5 @@
  * commonly used. The macros are precompiled into assmebly file which is later
  * assembled together with each compiled file.
  */
+
+#include <linux/compiler.h>
diff --git a/include/linux/compiler.h b/include/linux/compiler.h
index 681d866efb1e..1921545c6351 100644
--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -99,22 +99,13 @@ void ftrace_likely_update(struct ftrace_likely_data *f, int val,
  * unique, to convince GCC not to merge duplicate inline asm statements.
  */
 #define annotate_reachable() ({						\
-	asm volatile("%c0:\n\t"						\
-		     ".pushsection .discard.reachable\n\t"		\
-		     ".long %c0b - .\n\t"				\
-		     ".popsection\n\t" : : "i" (__COUNTER__));		\
+	asm volatile("ANNOTATE_REACHABLE counter=%c0"			\
+		     : : "i" (__COUNTER__));				\
 })
 #define annotate_unreachable() ({					\
-	asm volatile("%c0:\n\t"						\
-		     ".pushsection .discard.unreachable\n\t"		\
-		     ".long %c0b - .\n\t"				\
-		     ".popsection\n\t" : : "i" (__COUNTER__));		\
+	asm volatile("ANNOTATE_UNREACHABLE counter=%c0"			\
+		     : : "i" (__COUNTER__));				\
 })
-#define ASM_UNREACHABLE							\
-	"999:\n\t"							\
-	".pushsection .discard.unreachable\n\t"				\
-	".long 999b - .\n\t"						\
-	".popsection\n\t"
 #else
 #define annotate_reachable()
 #define annotate_unreachable()
@@ -299,6 +290,45 @@ static inline void *offset_to_ptr(const int *off)
 	return (void *)((unsigned long)off + *off);
 }
 
+#else /* __ASSEMBLY__ */
+
+#ifdef __KERNEL__
+#ifndef LINKER_SCRIPT
+
+#ifdef CONFIG_STACK_VALIDATION
+.macro ANNOTATE_UNREACHABLE counter:req
+\counter:
+	.pushsection .discard.unreachable
+	.long \counter\()b -.
+	.popsection
+.endm
+
+.macro ANNOTATE_REACHABLE counter:req
+\counter:
+	.pushsection .discard.reachable
+	.long \counter\()b -.
+	.popsection
+.endm
+
+.macro ASM_UNREACHABLE
+999:
+	.pushsection .discard.unreachable
+	.long 999b - .
+	.popsection
+.endm
+#else /* CONFIG_STACK_VALIDATION */
+.macro ANNOTATE_UNREACHABLE counter:req
+.endm
+
+.macro ANNOTATE_REACHABLE counter:req
+.endm
+
+.macro ASM_UNREACHABLE
+.endm
+#endif /* CONFIG_STACK_VALIDATION */
+
+#endif /* LINKER_SCRIPT */
+#endif /* __KERNEL__ */
 #endif /* __ASSEMBLY__ */
 
 #ifndef __optimize
-- 
2.17.1


^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH v9 04/10] x86: refcount: prevent gcc distortions
  2018-10-03 21:30 [PATCH v9 00/10] x86: macrofying inline asm Nadav Amit
                   ` (2 preceding siblings ...)
  2018-10-03 21:30 ` [PATCH v9 03/10] x86: objtool: use asm macro for better compiler decisions Nadav Amit
@ 2018-10-03 21:30 ` Nadav Amit
  2018-10-04  7:57   ` Ingo Molnar
  2018-10-04 10:02   ` [tip:x86/build] x86/refcount: Work around GCC inlining bug tip-bot for Nadav Amit
  2018-10-03 21:30 ` [PATCH v9 05/10] x86: alternatives: macrofy locks for better inlining Nadav Amit
                   ` (6 subsequent siblings)
  10 siblings, 2 replies; 116+ messages in thread
From: Nadav Amit @ 2018-10-03 21:30 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, x86, Nadav Amit, Thomas Gleixner, H. Peter Anvin,
	Jan Beulich, Josh Poimboeuf

GCC considers the number of statements in inlined assembly blocks,
according to new-lines and semicolons, as an indication to the cost of
the block in time and space. This data is distorted by the kernel code,
which puts information in alternative sections. As a result, the
compiler may perform incorrect inlining and branch optimizations.

The solution is to set an assembly macro and call it from the inlined
assembly block. As a result GCC considers the inline assembly block as
a single instruction.

This patch allows to inline functions such as __get_seccomp_filter().
Interestingly, this allows more aggressive inlining while reducing the
kernel size.

   text	   data	    bss	    dec	    hex	filename
18140970 10225412 2957312 31323694 1ddf62e ./vmlinux before
18140140 10225284 2957312 31322736 1ddf270 ./vmlinux after (-958)

Static text symbols:
Before:	40302
After:	40286	(-16)

Functions such as kref_get(), free_user(), fuse_file_get() now get
inlined.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Cc: Jan Beulich <JBeulich@suse.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/include/asm/refcount.h | 74 ++++++++++++++++++++-------------
 arch/x86/kernel/macros.S        |  1 +
 2 files changed, 46 insertions(+), 29 deletions(-)

diff --git a/arch/x86/include/asm/refcount.h b/arch/x86/include/asm/refcount.h
index 19b90521954c..c92909da0686 100644
--- a/arch/x86/include/asm/refcount.h
+++ b/arch/x86/include/asm/refcount.h
@@ -4,6 +4,41 @@
  * x86-specific implementation of refcount_t. Based on PAX_REFCOUNT from
  * PaX/grsecurity.
  */
+
+#ifdef __ASSEMBLY__
+
+#include <asm/asm.h>
+#include <asm/bug.h>
+
+.macro REFCOUNT_EXCEPTION counter:req
+	.pushsection .text..refcount
+111:	lea \counter, %_ASM_CX
+112:	ud2
+	ASM_UNREACHABLE
+	.popsection
+113:	_ASM_EXTABLE_REFCOUNT(112b, 113b)
+.endm
+
+/* Trigger refcount exception if refcount result is negative. */
+.macro REFCOUNT_CHECK_LT_ZERO counter:req
+	js 111f
+	REFCOUNT_EXCEPTION counter="\counter"
+.endm
+
+/* Trigger refcount exception if refcount result is zero or negative. */
+.macro REFCOUNT_CHECK_LE_ZERO counter:req
+	jz 111f
+	REFCOUNT_CHECK_LT_ZERO counter="\counter"
+.endm
+
+/* Trigger refcount exception unconditionally. */
+.macro REFCOUNT_ERROR counter:req
+	jmp 111f
+	REFCOUNT_EXCEPTION counter="\counter"
+.endm
+
+#else /* __ASSEMBLY__ */
+
 #include <linux/refcount.h>
 #include <asm/bug.h>
 
@@ -15,34 +50,11 @@
  * central refcount exception. The fixup address for the exception points
  * back to the regular execution flow in .text.
  */
-#define _REFCOUNT_EXCEPTION				\
-	".pushsection .text..refcount\n"		\
-	"111:\tlea %[counter], %%" _ASM_CX "\n"		\
-	"112:\t" ASM_UD2 "\n"				\
-	ASM_UNREACHABLE					\
-	".popsection\n"					\
-	"113:\n"					\
-	_ASM_EXTABLE_REFCOUNT(112b, 113b)
-
-/* Trigger refcount exception if refcount result is negative. */
-#define REFCOUNT_CHECK_LT_ZERO				\
-	"js 111f\n\t"					\
-	_REFCOUNT_EXCEPTION
-
-/* Trigger refcount exception if refcount result is zero or negative. */
-#define REFCOUNT_CHECK_LE_ZERO				\
-	"jz 111f\n\t"					\
-	REFCOUNT_CHECK_LT_ZERO
-
-/* Trigger refcount exception unconditionally. */
-#define REFCOUNT_ERROR					\
-	"jmp 111f\n\t"					\
-	_REFCOUNT_EXCEPTION
 
 static __always_inline void refcount_add(unsigned int i, refcount_t *r)
 {
 	asm volatile(LOCK_PREFIX "addl %1,%0\n\t"
-		REFCOUNT_CHECK_LT_ZERO
+		"REFCOUNT_CHECK_LT_ZERO counter=\"%[counter]\""
 		: [counter] "+m" (r->refs.counter)
 		: "ir" (i)
 		: "cc", "cx");
@@ -51,7 +63,7 @@ static __always_inline void refcount_add(unsigned int i, refcount_t *r)
 static __always_inline void refcount_inc(refcount_t *r)
 {
 	asm volatile(LOCK_PREFIX "incl %0\n\t"
-		REFCOUNT_CHECK_LT_ZERO
+		"REFCOUNT_CHECK_LT_ZERO counter=\"%[counter]\""
 		: [counter] "+m" (r->refs.counter)
 		: : "cc", "cx");
 }
@@ -59,7 +71,7 @@ static __always_inline void refcount_inc(refcount_t *r)
 static __always_inline void refcount_dec(refcount_t *r)
 {
 	asm volatile(LOCK_PREFIX "decl %0\n\t"
-		REFCOUNT_CHECK_LE_ZERO
+		"REFCOUNT_CHECK_LE_ZERO counter=\"%[counter]\""
 		: [counter] "+m" (r->refs.counter)
 		: : "cc", "cx");
 }
@@ -67,13 +79,15 @@ static __always_inline void refcount_dec(refcount_t *r)
 static __always_inline __must_check
 bool refcount_sub_and_test(unsigned int i, refcount_t *r)
 {
-	GEN_BINARY_SUFFIXED_RMWcc(LOCK_PREFIX "subl", REFCOUNT_CHECK_LT_ZERO,
+	GEN_BINARY_SUFFIXED_RMWcc(LOCK_PREFIX "subl",
+				  "REFCOUNT_CHECK_LT_ZERO counter=\"%0\"",
 				  r->refs.counter, "er", i, "%0", e, "cx");
 }
 
 static __always_inline __must_check bool refcount_dec_and_test(refcount_t *r)
 {
-	GEN_UNARY_SUFFIXED_RMWcc(LOCK_PREFIX "decl", REFCOUNT_CHECK_LT_ZERO,
+	GEN_UNARY_SUFFIXED_RMWcc(LOCK_PREFIX "decl",
+				 "REFCOUNT_CHECK_LT_ZERO counter=\"%0\"",
 				 r->refs.counter, "%0", e, "cx");
 }
 
@@ -91,7 +105,7 @@ bool refcount_add_not_zero(unsigned int i, refcount_t *r)
 
 		/* Did we try to increment from/to an undesirable state? */
 		if (unlikely(c < 0 || c == INT_MAX || result < c)) {
-			asm volatile(REFCOUNT_ERROR
+			asm volatile("REFCOUNT_ERROR counter=\"%[counter]\""
 				     : : [counter] "m" (r->refs.counter)
 				     : "cc", "cx");
 			break;
@@ -107,4 +121,6 @@ static __always_inline __must_check bool refcount_inc_not_zero(refcount_t *r)
 	return refcount_add_not_zero(1, r);
 }
 
+#endif /* __ASSEMBLY__ */
+
 #endif
diff --git a/arch/x86/kernel/macros.S b/arch/x86/kernel/macros.S
index cee28c3246dc..f1fe1d570365 100644
--- a/arch/x86/kernel/macros.S
+++ b/arch/x86/kernel/macros.S
@@ -7,3 +7,4 @@
  */
 
 #include <linux/compiler.h>
+#include <asm/refcount.h>
-- 
2.17.1


^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH v9 05/10] x86: alternatives: macrofy locks for better inlining
  2018-10-03 21:30 [PATCH v9 00/10] x86: macrofying inline asm Nadav Amit
                   ` (3 preceding siblings ...)
  2018-10-03 21:30 ` [PATCH v9 04/10] x86: refcount: prevent gcc distortions Nadav Amit
@ 2018-10-03 21:30 ` Nadav Amit
  2018-10-04 10:03   ` [tip:x86/build] x86/alternatives: Macrofy lock prefixes to work around GCC inlining bugs tip-bot for Nadav Amit
  2018-10-03 21:30 ` [PATCH v9 06/10] x86: bug: prevent gcc distortions Nadav Amit
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 116+ messages in thread
From: Nadav Amit @ 2018-10-03 21:30 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, x86, Nadav Amit, Thomas Gleixner, H. Peter Anvin,
	Josh Poimboeuf

GCC considers the number of statements in inlined assembly blocks,
according to new-lines and semicolons, as an indication to the cost of
the block in time and space. This data is distorted by the kernel code,
which puts information in alternative sections. As a result, the
compiler may perform incorrect inlining and branch optimizations.

The solution is to set an assembly macro and call it from the inlined
assembly block. As a result GCC considers the inline assembly block as
a single instruction.

This patch handles the LOCK prefix, allowing more aggresive inlining.

   text	   data	    bss	    dec	    hex	filename
18140140 10225284 2957312 31322736 1ddf270 ./vmlinux before
18146889 10225380 2957312 31329581 1de0d2d ./vmlinux after (+6845)

Static text symbols:
Before:	40286
After:	40218	(-68)

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/include/asm/alternative-asm.h | 20 ++++++++++++++------
 arch/x86/include/asm/alternative.h     | 11 ++---------
 arch/x86/kernel/macros.S               |  1 +
 3 files changed, 17 insertions(+), 15 deletions(-)

diff --git a/arch/x86/include/asm/alternative-asm.h b/arch/x86/include/asm/alternative-asm.h
index 31b627b43a8e..8e4ea39e55d0 100644
--- a/arch/x86/include/asm/alternative-asm.h
+++ b/arch/x86/include/asm/alternative-asm.h
@@ -7,16 +7,24 @@
 #include <asm/asm.h>
 
 #ifdef CONFIG_SMP
-	.macro LOCK_PREFIX
-672:	lock
+.macro LOCK_PREFIX_HERE
 	.pushsection .smp_locks,"a"
 	.balign 4
-	.long 672b - .
+	.long 671f - .		# offset
 	.popsection
-	.endm
+671:
+.endm
+
+.macro LOCK_PREFIX insn:vararg
+	LOCK_PREFIX_HERE
+	lock \insn
+.endm
 #else
-	.macro LOCK_PREFIX
-	.endm
+.macro LOCK_PREFIX_HERE
+.endm
+
+.macro LOCK_PREFIX insn:vararg
+.endm
 #endif
 
 /*
diff --git a/arch/x86/include/asm/alternative.h b/arch/x86/include/asm/alternative.h
index 4cd6a3b71824..d7faa16622d8 100644
--- a/arch/x86/include/asm/alternative.h
+++ b/arch/x86/include/asm/alternative.h
@@ -31,15 +31,8 @@
  */
 
 #ifdef CONFIG_SMP
-#define LOCK_PREFIX_HERE \
-		".pushsection .smp_locks,\"a\"\n"	\
-		".balign 4\n"				\
-		".long 671f - .\n" /* offset */		\
-		".popsection\n"				\
-		"671:"
-
-#define LOCK_PREFIX LOCK_PREFIX_HERE "\n\tlock; "
-
+#define LOCK_PREFIX_HERE "LOCK_PREFIX_HERE\n\t"
+#define LOCK_PREFIX "LOCK_PREFIX "
 #else /* ! CONFIG_SMP */
 #define LOCK_PREFIX_HERE ""
 #define LOCK_PREFIX ""
diff --git a/arch/x86/kernel/macros.S b/arch/x86/kernel/macros.S
index f1fe1d570365..852487a9fc56 100644
--- a/arch/x86/kernel/macros.S
+++ b/arch/x86/kernel/macros.S
@@ -8,3 +8,4 @@
 
 #include <linux/compiler.h>
 #include <asm/refcount.h>
+#include <asm/alternative-asm.h>
-- 
2.17.1


^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH v9 06/10] x86: bug: prevent gcc distortions
  2018-10-03 21:30 [PATCH v9 00/10] x86: macrofying inline asm Nadav Amit
                   ` (4 preceding siblings ...)
  2018-10-03 21:30 ` [PATCH v9 05/10] x86: alternatives: macrofy locks for better inlining Nadav Amit
@ 2018-10-03 21:30 ` Nadav Amit
  2018-10-04 10:03   ` [tip:x86/build] x86/bug: Macrofy the BUG table section handling, to work around GCC inlining bugs tip-bot for Nadav Amit
  2018-10-03 21:30 ` [PATCH v9 07/10] x86: prevent inline distortion by paravirt ops Nadav Amit
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 116+ messages in thread
From: Nadav Amit @ 2018-10-03 21:30 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, x86, Nadav Amit, Thomas Gleixner, H. Peter Anvin,
	Josh Poimboeuf

GCC considers the number of statements in inlined assembly blocks,
according to new-lines and semicolons, as an indication to the cost of
the block in time and space. This data is distorted by the kernel code,
which puts information in alternative sections. As a result, the
compiler may perform incorrect inlining and branch optimizations.

The solution is to set an assembly macro and call it from the inlinedv
assembly block. As a result GCC considers the inline assembly block as
a single instruction.

This patch increases the kernel size:

   text	   data	    bss	    dec	    hex	filename
18146889 10225380 2957312 31329581 1de0d2d ./vmlinux before
18147336 10226688 2957312 31331336 1de1408 ./vmlinux after (+1755)

But enables more aggressive inlining (and probably branch decisions).
The number of static text symbols in vmlinux is lower.

Static text symbols:
Before:	40218
After:	40053	(-165)

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/include/asm/bug.h | 98 ++++++++++++++++++++++----------------
 arch/x86/kernel/macros.S   |  1 +
 include/asm-generic/bug.h  |  8 ++--
 3 files changed, 61 insertions(+), 46 deletions(-)

diff --git a/arch/x86/include/asm/bug.h b/arch/x86/include/asm/bug.h
index 6804d6642767..5090035e6d16 100644
--- a/arch/x86/include/asm/bug.h
+++ b/arch/x86/include/asm/bug.h
@@ -4,6 +4,8 @@
 
 #include <linux/stringify.h>
 
+#ifndef __ASSEMBLY__
+
 /*
  * Despite that some emulators terminate on UD2, we use it for WARN().
  *
@@ -20,53 +22,15 @@
 
 #define LEN_UD2		2
 
-#ifdef CONFIG_GENERIC_BUG
-
-#ifdef CONFIG_X86_32
-# define __BUG_REL(val)	".long " __stringify(val)
-#else
-# define __BUG_REL(val)	".long " __stringify(val) " - 2b"
-#endif
-
-#ifdef CONFIG_DEBUG_BUGVERBOSE
-
-#define _BUG_FLAGS(ins, flags)						\
-do {									\
-	asm volatile("1:\t" ins "\n"					\
-		     ".pushsection __bug_table,\"aw\"\n"		\
-		     "2:\t" __BUG_REL(1b) "\t# bug_entry::bug_addr\n"	\
-		     "\t"  __BUG_REL(%c0) "\t# bug_entry::file\n"	\
-		     "\t.word %c1"        "\t# bug_entry::line\n"	\
-		     "\t.word %c2"        "\t# bug_entry::flags\n"	\
-		     "\t.org 2b+%c3\n"					\
-		     ".popsection"					\
-		     : : "i" (__FILE__), "i" (__LINE__),		\
-			 "i" (flags),					\
-			 "i" (sizeof(struct bug_entry)));		\
-} while (0)
-
-#else /* !CONFIG_DEBUG_BUGVERBOSE */
-
 #define _BUG_FLAGS(ins, flags)						\
 do {									\
-	asm volatile("1:\t" ins "\n"					\
-		     ".pushsection __bug_table,\"aw\"\n"		\
-		     "2:\t" __BUG_REL(1b) "\t# bug_entry::bug_addr\n"	\
-		     "\t.word %c0"        "\t# bug_entry::flags\n"	\
-		     "\t.org 2b+%c1\n"					\
-		     ".popsection"					\
-		     : : "i" (flags),					\
+	asm volatile("ASM_BUG ins=\"" ins "\" file=%c0 line=%c1 "	\
+		     "flags=%c2 size=%c3"				\
+		     : : "i" (__FILE__), "i" (__LINE__),                \
+			 "i" (flags),                                   \
 			 "i" (sizeof(struct bug_entry)));		\
 } while (0)
 
-#endif /* CONFIG_DEBUG_BUGVERBOSE */
-
-#else
-
-#define _BUG_FLAGS(ins, flags)  asm volatile(ins)
-
-#endif /* CONFIG_GENERIC_BUG */
-
 #define HAVE_ARCH_BUG
 #define BUG()							\
 do {								\
@@ -82,4 +46,54 @@ do {								\
 
 #include <asm-generic/bug.h>
 
+#else /* __ASSEMBLY__ */
+
+#ifdef CONFIG_GENERIC_BUG
+
+#ifdef CONFIG_X86_32
+.macro __BUG_REL val:req
+	.long \val
+.endm
+#else
+.macro __BUG_REL val:req
+	.long \val - 2b
+.endm
+#endif
+
+#ifdef CONFIG_DEBUG_BUGVERBOSE
+
+.macro ASM_BUG ins:req file:req line:req flags:req size:req
+1:	\ins
+	.pushsection __bug_table,"aw"
+2:	__BUG_REL val=1b	# bug_entry::bug_addr
+	__BUG_REL val=\file	# bug_entry::file
+	.word \line		# bug_entry::line
+	.word \flags		# bug_entry::flags
+	.org 2b+\size
+	.popsection
+.endm
+
+#else /* !CONFIG_DEBUG_BUGVERBOSE */
+
+.macro ASM_BUG ins:req file:req line:req flags:req size:req
+1:	\ins
+	.pushsection __bug_table,"aw"
+2:	__BUG_REL val=1b	# bug_entry::bug_addr
+	.word \flags		# bug_entry::flags
+	.org 2b+\size
+	.popsection
+.endm
+
+#endif /* CONFIG_DEBUG_BUGVERBOSE */
+
+#else /* CONFIG_GENERIC_BUG */
+
+.macro ASM_BUG ins:req file:req line:req flags:req size:req
+	\ins
+.endm
+
+#endif /* CONFIG_GENERIC_BUG */
+
+#endif /* __ASSEMBLY__ */
+
 #endif /* _ASM_X86_BUG_H */
diff --git a/arch/x86/kernel/macros.S b/arch/x86/kernel/macros.S
index 852487a9fc56..66ccb8e823b1 100644
--- a/arch/x86/kernel/macros.S
+++ b/arch/x86/kernel/macros.S
@@ -9,3 +9,4 @@
 #include <linux/compiler.h>
 #include <asm/refcount.h>
 #include <asm/alternative-asm.h>
+#include <asm/bug.h>
diff --git a/include/asm-generic/bug.h b/include/asm-generic/bug.h
index 20561a60db9c..cdafa5edea49 100644
--- a/include/asm-generic/bug.h
+++ b/include/asm-generic/bug.h
@@ -17,10 +17,8 @@
 #ifndef __ASSEMBLY__
 #include <linux/kernel.h>
 
-#ifdef CONFIG_BUG
-
-#ifdef CONFIG_GENERIC_BUG
 struct bug_entry {
+#ifdef CONFIG_GENERIC_BUG
 #ifndef CONFIG_GENERIC_BUG_RELATIVE_POINTERS
 	unsigned long	bug_addr;
 #else
@@ -35,8 +33,10 @@ struct bug_entry {
 	unsigned short	line;
 #endif
 	unsigned short	flags;
-};
 #endif	/* CONFIG_GENERIC_BUG */
+};
+
+#ifdef CONFIG_BUG
 
 /*
  * Don't use BUG() or BUG_ON() unless there's really no way out; one
-- 
2.17.1


^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH v9 07/10] x86: prevent inline distortion by paravirt ops
  2018-10-03 21:30 [PATCH v9 00/10] x86: macrofying inline asm Nadav Amit
                   ` (5 preceding siblings ...)
  2018-10-03 21:30 ` [PATCH v9 06/10] x86: bug: prevent gcc distortions Nadav Amit
@ 2018-10-03 21:30 ` Nadav Amit
  2018-10-04 10:04   ` [tip:x86/build] x86/paravirt: Work around GCC inlining bugs when compiling " tip-bot for Nadav Amit
  2018-10-03 21:30 ` [PATCH v9 08/10] x86: extable: use macros instead of inline assembly Nadav Amit
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 116+ messages in thread
From: Nadav Amit @ 2018-10-03 21:30 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, x86, Nadav Amit, Alok Kataria, Thomas Gleixner,
	H. Peter Anvin, virtualization

GCC considers the number of statements in inlined assembly blocks,
according to new-lines and semicolons, as an indication to the cost of
the block in time and space. This data is distorted by the kernel code,
which puts information in alternative sections. As a result, the
compiler may perform incorrect inlining and branch optimizations.

The solution is to set an assembly macro and call it from the inlined
assembly block. As a result GCC considers the inline assembly block as
a single instruction.

The effect of the patch is a more aggressive inlining, which also
causes a size increase of kernel.

   text	   data	    bss	    dec	    hex	filename
18147336 10226688 2957312 31331336 1de1408 ./vmlinux before
18162555 10226288 2957312 31346155 1de4deb ./vmlinux after (+14819)

Static text symbols:
Before:	40053
After:	39942	(-111)

Cc: Alok Kataria <akataria@vmware.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Cc: virtualization@lists.linux-foundation.org
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juergen Gross <jgross@suse.com>
Tested-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/include/asm/paravirt_types.h | 56 +++++++++++++--------------
 arch/x86/kernel/macros.S              |  1 +
 2 files changed, 28 insertions(+), 29 deletions(-)

diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index 4b75acc23b30..83ce282eed0a 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -346,23 +346,11 @@ extern struct pv_lock_ops pv_lock_ops;
 #define paravirt_clobber(clobber)		\
 	[paravirt_clobber] "i" (clobber)
 
-/*
- * Generate some code, and mark it as patchable by the
- * apply_paravirt() alternate instruction patcher.
- */
-#define _paravirt_alt(insn_string, type, clobber)	\
-	"771:\n\t" insn_string "\n" "772:\n"		\
-	".pushsection .parainstructions,\"a\"\n"	\
-	_ASM_ALIGN "\n"					\
-	_ASM_PTR " 771b\n"				\
-	"  .byte " type "\n"				\
-	"  .byte 772b-771b\n"				\
-	"  .short " clobber "\n"			\
-	".popsection\n"
-
 /* Generate patchable code, with the default asm parameters. */
-#define paravirt_alt(insn_string)					\
-	_paravirt_alt(insn_string, "%c[paravirt_typenum]", "%c[paravirt_clobber]")
+#define paravirt_call							\
+	"PARAVIRT_CALL type=\"%c[paravirt_typenum]\""			\
+	" clobber=\"%c[paravirt_clobber]\""				\
+	" pv_opptr=\"%c[paravirt_opptr]\";"
 
 /* Simple instruction patching code. */
 #define NATIVE_LABEL(a,x,b) "\n\t.globl " a #x "_" #b "\n" a #x "_" #b ":\n\t"
@@ -390,16 +378,6 @@ unsigned native_patch(u8 type, u16 clobbers, void *ibuf,
 
 int paravirt_disable_iospace(void);
 
-/*
- * This generates an indirect call based on the operation type number.
- * The type number, computed in PARAVIRT_PATCH, is derived from the
- * offset into the paravirt_patch_template structure, and can therefore be
- * freely converted back into a structure offset.
- */
-#define PARAVIRT_CALL					\
-	ANNOTATE_RETPOLINE_SAFE				\
-	"call *%c[paravirt_opptr];"
-
 /*
  * These macros are intended to wrap calls through one of the paravirt
  * ops structs, so that they can be later identified and patched at
@@ -537,7 +515,7 @@ int paravirt_disable_iospace(void);
 		/* since this condition will never hold */		\
 		if (sizeof(rettype) > sizeof(unsigned long)) {		\
 			asm volatile(pre				\
-				     paravirt_alt(PARAVIRT_CALL)	\
+				     paravirt_call			\
 				     post				\
 				     : call_clbr, ASM_CALL_CONSTRAINT	\
 				     : paravirt_type(op),		\
@@ -547,7 +525,7 @@ int paravirt_disable_iospace(void);
 			__ret = (rettype)((((u64)__edx) << 32) | __eax); \
 		} else {						\
 			asm volatile(pre				\
-				     paravirt_alt(PARAVIRT_CALL)	\
+				     paravirt_call			\
 				     post				\
 				     : call_clbr, ASM_CALL_CONSTRAINT	\
 				     : paravirt_type(op),		\
@@ -574,7 +552,7 @@ int paravirt_disable_iospace(void);
 		PVOP_VCALL_ARGS;					\
 		PVOP_TEST_NULL(op);					\
 		asm volatile(pre					\
-			     paravirt_alt(PARAVIRT_CALL)		\
+			     paravirt_call				\
 			     post					\
 			     : call_clbr, ASM_CALL_CONSTRAINT		\
 			     : paravirt_type(op),			\
@@ -694,6 +672,26 @@ struct paravirt_patch_site {
 extern struct paravirt_patch_site __parainstructions[],
 	__parainstructions_end[];
 
+#else	/* __ASSEMBLY__ */
+
+/*
+ * This generates an indirect call based on the operation type number.
+ * The type number, computed in PARAVIRT_PATCH, is derived from the
+ * offset into the paravirt_patch_template structure, and can therefore be
+ * freely converted back into a structure offset.
+ */
+.macro PARAVIRT_CALL type:req clobber:req pv_opptr:req
+771:	ANNOTATE_RETPOLINE_SAFE
+	call *\pv_opptr
+772:	.pushsection .parainstructions,"a"
+	_ASM_ALIGN
+	_ASM_PTR 771b
+	.byte \type
+	.byte 772b-771b
+	.short \clobber
+	.popsection
+.endm
+
 #endif	/* __ASSEMBLY__ */
 
 #endif	/* _ASM_X86_PARAVIRT_TYPES_H */
diff --git a/arch/x86/kernel/macros.S b/arch/x86/kernel/macros.S
index 66ccb8e823b1..71d8b716b111 100644
--- a/arch/x86/kernel/macros.S
+++ b/arch/x86/kernel/macros.S
@@ -10,3 +10,4 @@
 #include <asm/refcount.h>
 #include <asm/alternative-asm.h>
 #include <asm/bug.h>
+#include <asm/paravirt.h>
-- 
2.17.1


^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH v9 08/10] x86: extable: use macros instead of inline assembly
  2018-10-03 21:30 [PATCH v9 00/10] x86: macrofying inline asm Nadav Amit
                   ` (6 preceding siblings ...)
  2018-10-03 21:30 ` [PATCH v9 07/10] x86: prevent inline distortion by paravirt ops Nadav Amit
@ 2018-10-03 21:30 ` Nadav Amit
  2018-10-03 21:30 ` [PATCH v9 09/10] x86: cpufeature: " Nadav Amit
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 116+ messages in thread
From: Nadav Amit @ 2018-10-03 21:30 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, x86, Nadav Amit, Thomas Gleixner, H. Peter Anvin,
	Josh Poimboeuf

Use assembly macros for exception-tables and call them from inline
assembly.  This not only makes the code more readable and allows to
avoid the duplicate implementation, but also improves compilation
decision, specifically inline decisions which GCC base on the number of
new lines in inline assembly.

   text	   data	    bss	    dec	    hex	filename
18162555 10226288 2957312 31346155 1de4deb ./vmlinux before
18162879 10226256 2957312 31346447 1de4f0f ./vmlinux after (+292)

This allows to inline functions such as nested_vmx_exit_reflected(),
set_segment_reg(), __copy_xstate_to_user().

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/include/asm/asm.h | 61 +++++++++++++++++---------------------
 arch/x86/kernel/macros.S   |  1 +
 2 files changed, 28 insertions(+), 34 deletions(-)

diff --git a/arch/x86/include/asm/asm.h b/arch/x86/include/asm/asm.h
index 990770f9e76b..e47a2256c64e 100644
--- a/arch/x86/include/asm/asm.h
+++ b/arch/x86/include/asm/asm.h
@@ -117,28 +117,44 @@
 # define CC_OUT(c) [_cc_ ## c] "=qm"
 #endif
 
-/* Exception table entry */
 #ifdef __ASSEMBLY__
 # define _ASM_EXTABLE_HANDLE(from, to, handler)			\
-	.pushsection "__ex_table","a" ;				\
-	.balign 4 ;						\
-	.long (from) - . ;					\
-	.long (to) - . ;					\
-	.long (handler) - . ;					\
-	.popsection
+	ASM_EXTABLE_HANDLE from to handler
+
+#else /* __ASSEMBLY__ */
+
+# define _ASM_EXTABLE_HANDLE(from, to, handler)			\
+	"ASM_EXTABLE_HANDLE from=" #from " to=" #to		\
+	" handler=\"" #handler "\"\n\t"
+
+/* For C file, we already have NOKPROBE_SYMBOL macro */
+
+#endif /* __ASSEMBLY__ */
 
-# define _ASM_EXTABLE(from, to)					\
+#define _ASM_EXTABLE(from, to)					\
 	_ASM_EXTABLE_HANDLE(from, to, ex_handler_default)
 
-# define _ASM_EXTABLE_FAULT(from, to)				\
+#define _ASM_EXTABLE_FAULT(from, to)				\
 	_ASM_EXTABLE_HANDLE(from, to, ex_handler_fault)
 
-# define _ASM_EXTABLE_EX(from, to)				\
+#define _ASM_EXTABLE_EX(from, to)				\
 	_ASM_EXTABLE_HANDLE(from, to, ex_handler_ext)
 
-# define _ASM_EXTABLE_REFCOUNT(from, to)			\
+#define _ASM_EXTABLE_REFCOUNT(from, to)				\
 	_ASM_EXTABLE_HANDLE(from, to, ex_handler_refcount)
 
+/* Exception table entry */
+#ifdef __ASSEMBLY__
+
+.macro ASM_EXTABLE_HANDLE from:req to:req handler:req
+	.pushsection "__ex_table","a"
+	.balign 4
+	.long (\from) - .
+	.long (\to) - .
+	.long (\handler) - .
+	.popsection
+.endm
+
 # define _ASM_NOKPROBE(entry)					\
 	.pushsection "_kprobe_blacklist","aw" ;			\
 	_ASM_ALIGN ;						\
@@ -169,29 +185,6 @@
 	_ASM_EXTABLE(101b,103b)
 	.endm
 
-#else
-# define _EXPAND_EXTABLE_HANDLE(x) #x
-# define _ASM_EXTABLE_HANDLE(from, to, handler)			\
-	" .pushsection \"__ex_table\",\"a\"\n"			\
-	" .balign 4\n"						\
-	" .long (" #from ") - .\n"				\
-	" .long (" #to ") - .\n"				\
-	" .long (" _EXPAND_EXTABLE_HANDLE(handler) ") - .\n"	\
-	" .popsection\n"
-
-# define _ASM_EXTABLE(from, to)					\
-	_ASM_EXTABLE_HANDLE(from, to, ex_handler_default)
-
-# define _ASM_EXTABLE_FAULT(from, to)				\
-	_ASM_EXTABLE_HANDLE(from, to, ex_handler_fault)
-
-# define _ASM_EXTABLE_EX(from, to)				\
-	_ASM_EXTABLE_HANDLE(from, to, ex_handler_ext)
-
-# define _ASM_EXTABLE_REFCOUNT(from, to)			\
-	_ASM_EXTABLE_HANDLE(from, to, ex_handler_refcount)
-
-/* For C file, we already have NOKPROBE_SYMBOL macro */
 #endif
 
 #ifndef __ASSEMBLY__
diff --git a/arch/x86/kernel/macros.S b/arch/x86/kernel/macros.S
index 71d8b716b111..7baa40d5bf16 100644
--- a/arch/x86/kernel/macros.S
+++ b/arch/x86/kernel/macros.S
@@ -11,3 +11,4 @@
 #include <asm/alternative-asm.h>
 #include <asm/bug.h>
 #include <asm/paravirt.h>
+#include <asm/asm.h>
-- 
2.17.1


^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH v9 09/10] x86: cpufeature: use macros instead of inline assembly
  2018-10-03 21:30 [PATCH v9 00/10] x86: macrofying inline asm Nadav Amit
                   ` (7 preceding siblings ...)
  2018-10-03 21:30 ` [PATCH v9 08/10] x86: extable: use macros instead of inline assembly Nadav Amit
@ 2018-10-03 21:30 ` Nadav Amit
  2018-10-03 21:31 ` [PATCH v9 10/10] x86: jump-labels: " Nadav Amit
  2018-10-07  9:18 ` PROPOSAL: Extend inline asm syntax with size spec Borislav Petkov
  10 siblings, 0 replies; 116+ messages in thread
From: Nadav Amit @ 2018-10-03 21:30 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, x86, Nadav Amit, Thomas Gleixner, H. Peter Anvin

Use assembly macros for static_cpu_has() and call them from inline
assembly.  This not only makes the code more readable, but also improves
compilation decision, specifically inline decisions which GCC base on
the number of new lines in inline assembly.

The patch slightly increases the kernel size:

   text	   data	    bss	    dec	    hex	filename
18162879 10226256 2957312 31346447 1de4f0f ./vmlinux before
18163528 10226300 2957312 31347140 1de51c4 ./vmlinux after (+693)

And enables the inlining of function such as free_ldt_pgtables().

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/include/asm/cpufeature.h | 82 ++++++++++++++++++-------------
 arch/x86/kernel/macros.S          |  1 +
 2 files changed, 48 insertions(+), 35 deletions(-)

diff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm/cpufeature.h
index aced6c9290d6..7d442722ef24 100644
--- a/arch/x86/include/asm/cpufeature.h
+++ b/arch/x86/include/asm/cpufeature.h
@@ -2,10 +2,10 @@
 #ifndef _ASM_X86_CPUFEATURE_H
 #define _ASM_X86_CPUFEATURE_H
 
-#include <asm/processor.h>
-
-#if defined(__KERNEL__) && !defined(__ASSEMBLY__)
+#ifdef __KERNEL__
+#ifndef __ASSEMBLY__
 
+#include <asm/processor.h>
 #include <asm/asm.h>
 #include <linux/bitops.h>
 
@@ -161,37 +161,10 @@ extern void clear_cpu_cap(struct cpuinfo_x86 *c, unsigned int bit);
  */
 static __always_inline __pure bool _static_cpu_has(u16 bit)
 {
-	asm_volatile_goto("1: jmp 6f\n"
-		 "2:\n"
-		 ".skip -(((5f-4f) - (2b-1b)) > 0) * "
-			 "((5f-4f) - (2b-1b)),0x90\n"
-		 "3:\n"
-		 ".section .altinstructions,\"a\"\n"
-		 " .long 1b - .\n"		/* src offset */
-		 " .long 4f - .\n"		/* repl offset */
-		 " .word %P[always]\n"		/* always replace */
-		 " .byte 3b - 1b\n"		/* src len */
-		 " .byte 5f - 4f\n"		/* repl len */
-		 " .byte 3b - 2b\n"		/* pad len */
-		 ".previous\n"
-		 ".section .altinstr_replacement,\"ax\"\n"
-		 "4: jmp %l[t_no]\n"
-		 "5:\n"
-		 ".previous\n"
-		 ".section .altinstructions,\"a\"\n"
-		 " .long 1b - .\n"		/* src offset */
-		 " .long 0\n"			/* no replacement */
-		 " .word %P[feature]\n"		/* feature bit */
-		 " .byte 3b - 1b\n"		/* src len */
-		 " .byte 0\n"			/* repl len */
-		 " .byte 0\n"			/* pad len */
-		 ".previous\n"
-		 ".section .altinstr_aux,\"ax\"\n"
-		 "6:\n"
-		 " testb %[bitnum],%[cap_byte]\n"
-		 " jnz %l[t_yes]\n"
-		 " jmp %l[t_no]\n"
-		 ".previous\n"
+	asm_volatile_goto("STATIC_CPU_HAS bitnum=%[bitnum] "
+			  "cap_byte=\"%[cap_byte]\" "
+			  "feature=%P[feature] t_yes=%l[t_yes] "
+			  "t_no=%l[t_no] always=%P[always]"
 		 : : [feature]  "i" (bit),
 		     [always]   "i" (X86_FEATURE_ALWAYS),
 		     [bitnum]   "i" (1 << (bit & 7)),
@@ -226,5 +199,44 @@ static __always_inline __pure bool _static_cpu_has(u16 bit)
 #define CPU_FEATURE_TYPEVAL		boot_cpu_data.x86_vendor, boot_cpu_data.x86, \
 					boot_cpu_data.x86_model
 
-#endif /* defined(__KERNEL__) && !defined(__ASSEMBLY__) */
+#else /* __ASSEMBLY__ */
+
+.macro STATIC_CPU_HAS bitnum:req cap_byte:req feature:req t_yes:req t_no:req always:req
+1:
+	jmp 6f
+2:
+	.skip -(((5f-4f) - (2b-1b)) > 0) * ((5f-4f) - (2b-1b)),0x90
+3:
+	.section .altinstructions,"a"
+	.long 1b - .		/* src offset */
+	.long 4f - .		/* repl offset */
+	.word \always		/* always replace */
+	.byte 3b - 1b		/* src len */
+	.byte 5f - 4f		/* repl len */
+	.byte 3b - 2b		/* pad len */
+	.previous
+	.section .altinstr_replacement,"ax"
+4:
+	jmp \t_no
+5:
+	.previous
+	.section .altinstructions,"a"
+	.long 1b - .		/* src offset */
+	.long 0			/* no replacement */
+	.word \feature		/* feature bit */
+	.byte 3b - 1b		/* src len */
+	.byte 0			/* repl len */
+	.byte 0			/* pad len */
+	.previous
+	.section .altinstr_aux,"ax"
+6:
+	testb \bitnum,\cap_byte
+	jnz \t_yes
+	jmp \t_no
+	.previous
+.endm
+
+#endif /* __ASSEMBLY__ */
+
+#endif /* __KERNEL__ */
 #endif /* _ASM_X86_CPUFEATURE_H */
diff --git a/arch/x86/kernel/macros.S b/arch/x86/kernel/macros.S
index 7baa40d5bf16..bf8b9c93e255 100644
--- a/arch/x86/kernel/macros.S
+++ b/arch/x86/kernel/macros.S
@@ -12,3 +12,4 @@
 #include <asm/bug.h>
 #include <asm/paravirt.h>
 #include <asm/asm.h>
+#include <asm/cpufeature.h>
-- 
2.17.1


^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH v9 10/10] x86: jump-labels: use macros instead of inline assembly
  2018-10-03 21:30 [PATCH v9 00/10] x86: macrofying inline asm Nadav Amit
                   ` (8 preceding siblings ...)
  2018-10-03 21:30 ` [PATCH v9 09/10] x86: cpufeature: " Nadav Amit
@ 2018-10-03 21:31 ` Nadav Amit
  2018-10-07  9:18 ` PROPOSAL: Extend inline asm syntax with size spec Borislav Petkov
  10 siblings, 0 replies; 116+ messages in thread
From: Nadav Amit @ 2018-10-03 21:31 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, x86, Nadav Amit, Thomas Gleixner, H. Peter Anvin,
	Greg Kroah-Hartman, Kate Stewart, Philippe Ombredanne

Use assembly macros for jump-labels and call them from inline assembly.
This not only makes the code more readable, but also improves
compilation decision, specifically inline decisions which GCC base on
the number of new lines in inline assembly.

As a result the code size is slightly increased.

   text	   data	    bss	    dec	    hex	filename
18163528 10226300 2957312 31347140 1de51c4 ./vmlinux before
18163608 10227348 2957312 31348268 1de562c ./vmlinux after (+1128)

And functions such as intel_pstate_adjust_policy_max(),
kvm_cpu_accept_dm_intr(), kvm_register_readl() are inlined.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/entry/calling.h          |  2 +-
 arch/x86/include/asm/jump_label.h | 77 +++++++++++--------------------
 arch/x86/kernel/macros.S          |  1 +
 3 files changed, 30 insertions(+), 50 deletions(-)

diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index 352e70cd33e8..708b46a54578 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -338,7 +338,7 @@ For 32-bit we have the following conventions - kernel is built with
 .macro CALL_enter_from_user_mode
 #ifdef CONFIG_CONTEXT_TRACKING
 #ifdef HAVE_JUMP_LABEL
-	STATIC_JUMP_IF_FALSE .Lafter_call_\@, context_tracking_enabled, def=0
+	STATIC_BRANCH_JMP l_yes=.Lafter_call_\@, key=context_tracking_enabled, branch=1
 #endif
 	call enter_from_user_mode
 .Lafter_call_\@:
diff --git a/arch/x86/include/asm/jump_label.h b/arch/x86/include/asm/jump_label.h
index 8c0de4282659..4606a7101f97 100644
--- a/arch/x86/include/asm/jump_label.h
+++ b/arch/x86/include/asm/jump_label.h
@@ -2,19 +2,6 @@
 #ifndef _ASM_X86_JUMP_LABEL_H
 #define _ASM_X86_JUMP_LABEL_H
 
-#ifndef HAVE_JUMP_LABEL
-/*
- * For better or for worse, if jump labels (the gcc extension) are missing,
- * then the entire static branch patching infrastructure is compiled out.
- * If that happens, the code in here will malfunction.  Raise a compiler
- * error instead.
- *
- * In theory, jump labels and the static branch patching infrastructure
- * could be decoupled to fix this.
- */
-#error asm/jump_label.h included on a non-jump-label kernel
-#endif
-
 #define JUMP_LABEL_NOP_SIZE 5
 
 #ifdef CONFIG_X86_64
@@ -28,18 +15,27 @@
 
 #ifndef __ASSEMBLY__
 
+#ifndef HAVE_JUMP_LABEL
+/*
+ * For better or for worse, if jump labels (the gcc extension) are missing,
+ * then the entire static branch patching infrastructure is compiled out.
+ * If that happens, the code in here will malfunction.  Raise a compiler
+ * error instead.
+ *
+ * In theory, jump labels and the static branch patching infrastructure
+ * could be decoupled to fix this.
+ */
+#error asm/jump_label.h included on a non-jump-label kernel
+#endif
+
 #include <linux/stringify.h>
 #include <linux/types.h>
 
 static __always_inline bool arch_static_branch(struct static_key *key, bool branch)
 {
-	asm_volatile_goto("1:"
-		".byte " __stringify(STATIC_KEY_INIT_NOP) "\n\t"
-		".pushsection __jump_table,  \"aw\" \n\t"
-		_ASM_ALIGN "\n\t"
-		_ASM_PTR "1b, %l[l_yes], %c0 + %c1 \n\t"
-		".popsection \n\t"
-		: :  "i" (key), "i" (branch) : : l_yes);
+	asm_volatile_goto("STATIC_BRANCH_NOP l_yes=\"%l[l_yes]\" key=\"%c0\" "
+			  "branch=\"%c1\""
+			: :  "i" (key), "i" (branch) : : l_yes);
 
 	return false;
 l_yes:
@@ -48,13 +44,8 @@ static __always_inline bool arch_static_branch(struct static_key *key, bool bran
 
 static __always_inline bool arch_static_branch_jump(struct static_key *key, bool branch)
 {
-	asm_volatile_goto("1:"
-		".byte 0xe9\n\t .long %l[l_yes] - 2f\n\t"
-		"2:\n\t"
-		".pushsection __jump_table,  \"aw\" \n\t"
-		_ASM_ALIGN "\n\t"
-		_ASM_PTR "1b, %l[l_yes], %c0 + %c1 \n\t"
-		".popsection \n\t"
+	asm_volatile_goto("STATIC_BRANCH_JMP l_yes=\"%l[l_yes]\" key=\"%c0\" "
+			  "branch=\"%c1\""
 		: :  "i" (key), "i" (branch) : : l_yes);
 
 	return false;
@@ -76,35 +67,23 @@ struct jump_entry {
 
 #else	/* __ASSEMBLY__ */
 
-.macro STATIC_JUMP_IF_TRUE target, key, def
-.Lstatic_jump_\@:
-	.if \def
-	/* Equivalent to "jmp.d32 \target" */
-	.byte		0xe9
-	.long		\target - .Lstatic_jump_after_\@
-.Lstatic_jump_after_\@:
-	.else
-	.byte		STATIC_KEY_INIT_NOP
-	.endif
+.macro STATIC_BRANCH_NOP l_yes:req key:req branch:req
+1:
+	.byte STATIC_KEY_INIT_NOP
 	.pushsection __jump_table, "aw"
 	_ASM_ALIGN
-	_ASM_PTR	.Lstatic_jump_\@, \target, \key
+	_ASM_PTR 1b, \l_yes, \key + \branch
 	.popsection
 .endm
 
-.macro STATIC_JUMP_IF_FALSE target, key, def
-.Lstatic_jump_\@:
-	.if \def
-	.byte		STATIC_KEY_INIT_NOP
-	.else
-	/* Equivalent to "jmp.d32 \target" */
-	.byte		0xe9
-	.long		\target - .Lstatic_jump_after_\@
-.Lstatic_jump_after_\@:
-	.endif
+.macro STATIC_BRANCH_JMP l_yes:req key:req branch:req
+1:
+	.byte 0xe9
+	.long \l_yes - 2f
+2:
 	.pushsection __jump_table, "aw"
 	_ASM_ALIGN
-	_ASM_PTR	.Lstatic_jump_\@, \target, \key + 1
+	_ASM_PTR 1b, \l_yes, \key + \branch
 	.popsection
 .endm
 
diff --git a/arch/x86/kernel/macros.S b/arch/x86/kernel/macros.S
index bf8b9c93e255..161c95059044 100644
--- a/arch/x86/kernel/macros.S
+++ b/arch/x86/kernel/macros.S
@@ -13,3 +13,4 @@
 #include <asm/paravirt.h>
 #include <asm/asm.h>
 #include <asm/cpufeature.h>
+#include <asm/jump_label.h>
-- 
2.17.1


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 04/10] x86: refcount: prevent gcc distortions
  2018-10-03 21:30 ` [PATCH v9 04/10] x86: refcount: prevent gcc distortions Nadav Amit
@ 2018-10-04  7:57   ` Ingo Molnar
  2018-10-04  8:33     ` Ingo Molnar
  2018-10-04  8:40     ` Nadav Amit
  2018-10-04 10:02   ` [tip:x86/build] x86/refcount: Work around GCC inlining bug tip-bot for Nadav Amit
  1 sibling, 2 replies; 116+ messages in thread
From: Ingo Molnar @ 2018-10-04  7:57 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Ingo Molnar, linux-kernel, x86, Thomas Gleixner, H. Peter Anvin,
	Jan Beulich, Josh Poimboeuf, Linus Torvalds, Peter Zijlstra,
	Andy Lutomirski


* Nadav Amit <namit@vmware.com> wrote:

> GCC considers the number of statements in inlined assembly blocks,
> according to new-lines and semicolons, as an indication to the cost of
> the block in time and space. This data is distorted by the kernel code,
> which puts information in alternative sections. As a result, the
> compiler may perform incorrect inlining and branch optimizations.
> 
> The solution is to set an assembly macro and call it from the inlined
> assembly block. As a result GCC considers the inline assembly block as
> a single instruction.
> 
> This patch allows to inline functions such as __get_seccomp_filter().
> Interestingly, this allows more aggressive inlining while reducing the
> kernel size.
> 
>    text	   data	    bss	    dec	    hex	filename
> 18140970 10225412 2957312 31323694 1ddf62e ./vmlinux before
> 18140140 10225284 2957312 31322736 1ddf270 ./vmlinux after (-958)
> 
> Static text symbols:
> Before:	40302
> After:	40286	(-16)
> 
> Functions such as kref_get(), free_user(), fuse_file_get() now get
> inlined.

Yeah, so I kind of had your series on the back-burner (I'm sure you noticed!),
mostly because what I complained about in a previous round of review a couple
of months ago: that the description of the series and the changelog of every
single patch in it is tiptoeing around the *real* problem and never truly
describes it:

   ** This is a GCC bug, plain and simple, and we are uglifying **
   ** and complicating kernel assembly code to work it around.  **

We'd never ever consider such uglification for Clang, not even _close_.

Sure this would have warranted a passing mention? Instead the changelogs are
lovingly calling it a "distortion" as if this was no-one's fault really, and
the patch a "solution".

How about calling it a "GCC inlining bug" and a "workaround with costs" 
which it is in reality, and stop whitewashing the problem?

At the same time I realize that we still need this series because GCC won't
get fixed, so as a consolation I wrote the changelog below that explains
how it really is, no holds barred.

Since the tone of the changelog is a bit ... frosty, I added this disclaimer:

  [ mingo: Wrote new changelog. ]

Let me know if you want me to make it more prominent that you had absolutely
no role in writing that changelog.

I'm also somewhat annoyed at the fact that this series carries a boatload
of reviewed-by's and acked-by's, yet none of those reviewers found it
important to point out the large chasm that is gaping between description
and reality.

Thanks,

    Ingo


=============>
Subject: x86/refcount: Prevent inlining related GCC distortions
From: Nadav Amit <namit@vmware.com>
Date: Wed, 3 Oct 2018 14:30:54 -0700

The inlining pass of GCC doesn't include an assembler, so it's not aware
of basic properties of the generated code, such as its size in bytes,
or that there are such things as discontiuous blocks of code and data
due to the newfangled linker feature called 'sections' ...

Instead GCC uses a lazy and fragile heuristic: it does a linear count of
certain syntactic and whitespace elements in inlined assembly block source
code, such as a count of new-lines and semicolons (!), as a poor substitute
for "code size and complexity".

Unsurprisingly this heuristic falls over and breaks its neck whith certain
common types of kernel code that use inline assembly, such as the frequent
practice of putting useful information into alternative sections.

As a result of this fresh, 20+ years old GCC bug, GCC's inlining decisions
are effectively disabled for inlined functions that make use of such asm()
blocks, because GCC thinks those sections of code are "large" - when in
reality they are often result in just a very low number of machine
instructions generated.

This absolute lack of inlining provess when GCC comes across such asm()
blocks both increases generated kernel code size and causes performance
overhead, which is particularly noticeable on paravirt kernels, which make
frequent use of these inlining facilities in attemt to stay out of the
way when running on baremetal hardware.

Instead of fixing the compiler we use a workaround: we set an assembly macro
and call it from the inlined assembly block. As a result GCC considers the
inline assembly block as a single instruction. (Which it often isn't but I digress.)

This uglifies and bloats the source code:

  2 files changed, 46 insertions(+), 29 deletions(-)

Yay readability and maintainability, it's not like assembly code is hard to read
and maintain ...

This patch allows GCC to inline simple functions such as __get_seccomp_filter().

To no-one's surprise the result is GCC performs more aggressive (read: correct)
inlining decisions in these senarios, which reduces the kernel size and presumably
also speeds it up:

      text     data     bss      dec     hex  filename
  18140970 10225412 2957312 31323694 1ddf62e  ./vmlinux before
  18140140 10225284 2957312 31322736 1ddf270  ./vmlinux after (-958)

Change in size of static text symbols:

   Before: 40302
    After: 40286 (-16)

Functions such as kref_get(), free_user(), fuse_file_get() now get inlined. Hurray!

We also hope that GCC will eventually get fixed, but we are not holding
our breath for that. Yet we are optimistic, it might still happen, any decade now.

[ mingo: Wrote new changelog. ]

Tested-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Nadav Amit <namit@vmware.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Jan Beulich <JBeulich@suse.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20181003213100.189959-5-namit@vmware.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 04/10] x86: refcount: prevent gcc distortions
  2018-10-04  7:57   ` Ingo Molnar
@ 2018-10-04  8:33     ` Ingo Molnar
  2018-10-04  8:40       ` hpa
  2018-10-04  8:40     ` Nadav Amit
  1 sibling, 1 reply; 116+ messages in thread
From: Ingo Molnar @ 2018-10-04  8:33 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Ingo Molnar, linux-kernel, x86, Thomas Gleixner, H. Peter Anvin,
	Jan Beulich, Josh Poimboeuf, Linus Torvalds, Peter Zijlstra,
	Andy Lutomirski


* Ingo Molnar <mingo@kernel.org> wrote:

> I'm also somewhat annoyed at the fact that this series carries a boatload
> of reviewed-by's and acked-by's, yet none of those reviewers found it
> important to point out the large chasm that is gaping between description
> and reality.

Another problem I just realized is that we now include arch/x86/kernel/macros.S in every 
translation pass when building the kernel, right?

But arch/x86/kernel/macros.S expands to a pretty large hiearchy of header files:

  $ make arch/x86/kernel/macros.s

  $ cat $(grep include arch/x86/kernel/macros.s | cut -d\" -f2 | sort | uniq) | wc -l
  4128

That's 4,100 extra lines of code to be preprocessed for every translation unit, of
which there are tens of thousands. More if other pieces of code get macrofied in
this fasion in the future.

If we assume that a typical distribution kernel build has ~20,000 translation units
then this change adds 82,560,000 more lines to be preprocessed, just to work around
a stupid GCC bug?

I'm totally unhappy about that. Can we do this without adding macros.S?

It's also a pretty stupidly central file anyway that moves source code away
from where it's used.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 04/10] x86: refcount: prevent gcc distortions
  2018-10-04  8:33     ` Ingo Molnar
@ 2018-10-04  8:40       ` hpa
  2018-10-04  8:56         ` Ingo Molnar
  2018-10-04  8:56         ` Nadav Amit
  0 siblings, 2 replies; 116+ messages in thread
From: hpa @ 2018-10-04  8:40 UTC (permalink / raw)
  To: Ingo Molnar, Nadav Amit
  Cc: Ingo Molnar, linux-kernel, x86, Thomas Gleixner, Jan Beulich,
	Josh Poimboeuf, Linus Torvalds, Peter Zijlstra, Andy Lutomirski

On October 4, 2018 1:33:33 AM PDT, Ingo Molnar <mingo@kernel.org> wrote:
>
>* Ingo Molnar <mingo@kernel.org> wrote:
>
>> I'm also somewhat annoyed at the fact that this series carries a
>boatload
>> of reviewed-by's and acked-by's, yet none of those reviewers found it
>> important to point out the large chasm that is gaping between
>description
>> and reality.
>
>Another problem I just realized is that we now include
>arch/x86/kernel/macros.S in every 
>translation pass when building the kernel, right?
>
>But arch/x86/kernel/macros.S expands to a pretty large hiearchy of
>header files:
>
>  $ make arch/x86/kernel/macros.s
>
>$ cat $(grep include arch/x86/kernel/macros.s | cut -d\" -f2 | sort |
>uniq) | wc -l
>  4128
>
>That's 4,100 extra lines of code to be preprocessed for every
>translation unit, of
>which there are tens of thousands. More if other pieces of code get
>macrofied in
>this fasion in the future.
>
>If we assume that a typical distribution kernel build has ~20,000
>translation units
>then this change adds 82,560,000 more lines to be preprocessed, just to
>work around
>a stupid GCC bug?
>
>I'm totally unhappy about that. Can we do this without adding macros.S?
>
>It's also a pretty stupidly central file anyway that moves source code
>away
>from where it's used.
>
>Thanks,
>
>	Ingo

It's not just for working around a stupid GCC bug, but it also has a huge potential for cleaning up the inline asm in general.

I would like to know if there is an actual number for the build overhead (an actual benchmark); I have asked for that once already.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 04/10] x86: refcount: prevent gcc distortions
  2018-10-04  7:57   ` Ingo Molnar
  2018-10-04  8:33     ` Ingo Molnar
@ 2018-10-04  8:40     ` Nadav Amit
  2018-10-04  9:01       ` Ingo Molnar
  1 sibling, 1 reply; 116+ messages in thread
From: Nadav Amit @ 2018-10-04  8:40 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Ingo Molnar, LKML, X86 ML, Thomas Gleixner, H. Peter Anvin,
	Jan Beulich, Josh Poimboeuf, Linus Torvalds, Peter Zijlstra,
	Andy Lutomirski

at 12:57 AM, Ingo Molnar <mingo@kernel.org> wrote:

> 
> * Nadav Amit <namit@vmware.com> wrote:
> 
>> GCC considers the number of statements in inlined assembly blocks,
>> according to new-lines and semicolons, as an indication to the cost of
>> the block in time and space. This data is distorted by the kernel code,
>> which puts information in alternative sections. As a result, the
>> compiler may perform incorrect inlining and branch optimizations.
>> 
>> The solution is to set an assembly macro and call it from the inlined
>> assembly block. As a result GCC considers the inline assembly block as
>> a single instruction.
>> 
>> This patch allows to inline functions such as __get_seccomp_filter().
>> Interestingly, this allows more aggressive inlining while reducing the
>> kernel size.
>> 
>>   text	   data	    bss	    dec	    hex	filename
>> 18140970 10225412 2957312 31323694 1ddf62e ./vmlinux before
>> 18140140 10225284 2957312 31322736 1ddf270 ./vmlinux after (-958)
>> 
>> Static text symbols:
>> Before:	40302
>> After:	40286	(-16)
>> 
>> Functions such as kref_get(), free_user(), fuse_file_get() now get
>> inlined.
> 
> Yeah, so I kind of had your series on the back-burner (I'm sure you noticed!),
> mostly because what I complained about in a previous round of review a couple
> of months ago: that the description of the series and the changelog of every
> single patch in it is tiptoeing around the *real* problem and never truly
> describes it:
> 
>   ** This is a GCC bug, plain and simple, and we are uglifying **
>   ** and complicating kernel assembly code to work it around.  **
> 
> We'd never ever consider such uglification for Clang, not even _close_.
> Sure this would have warranted a passing mention? Instead the changelogs are
> lovingly calling it a "distortion" as if this was no-one's fault really, and
> the patch a "solution".
> 
> How about calling it a "GCC inlining bug" and a "workaround with costs" 
> which it is in reality, and stop whitewashing the problem?
> 
> At the same time I realize that we still need this series because GCC won't
> get fixed, so as a consolation I wrote the changelog below that explains
> how it really is, no holds barred.
> 
> Since the tone of the changelog is a bit ... frosty, I added this disclaimer:
> 
>  [ mingo: Wrote new changelog. ]
> 
> Let me know if you want me to make it more prominent that you had absolutely
> no role in writing that changelog.
> 
> I'm also somewhat annoyed at the fact that this series carries a boatload
> of reviewed-by's and acked-by's, yet none of those reviewers found it
> important to point out the large chasm that is gaping between description
> and reality.

So, I’m sorry for missing your comment about misrepresenting the problem.

Feel free to do whatever you want with the commit message (just fix the typo
in “attemt"). As long as you don’t NAK the patches or send me to redo them,
it’s fine. I just want to clarify few things for you to consider.

First, you are right that clang does not have this issue (I checked), but
the behavior of gcc is clearly documented - once you know what to look for.

Second, I think the end result is not as ugly as you make it sound (and
maybe not ugly at all). Using this patch-set, you can write big blocks of
inlined assembly code without having those disgusting C macros. You can also
share the same code between inline asm and asm files.

You can have a look, for example, on ALTERNATIVE which is defined both as
assembly macro and C macro. Is the C macro readable? Is it easy to maintain
two different version? I do have a patch that merges the two implementations
together (which I still didn’t send, since I wait for the infra to be
applied first), and I think makes much more sense.

Finally, note that it’s not as if the binary always becomes smaller.
Overall, with the full patch-set it is slightly bigger. But still, that’s
how it was supposed to be if gcc wasn’t doing things badly.

Thanks again,
Nadav


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 04/10] x86: refcount: prevent gcc distortions
  2018-10-04  8:40       ` hpa
@ 2018-10-04  8:56         ` Ingo Molnar
  2018-10-04  8:56         ` Nadav Amit
  1 sibling, 0 replies; 116+ messages in thread
From: Ingo Molnar @ 2018-10-04  8:56 UTC (permalink / raw)
  To: hpa
  Cc: Nadav Amit, Ingo Molnar, linux-kernel, x86, Thomas Gleixner,
	Jan Beulich, Josh Poimboeuf, Linus Torvalds, Peter Zijlstra,
	Andy Lutomirski


* hpa@zytor.com <hpa@zytor.com> wrote:

> It's not just for working around a stupid GCC bug, but it also has a huge potential for 
> cleaning up the inline asm in general.

Sorry but that's just plain false. For example this patch:

   x86: cpufeature: use macros instead of inline assembly

... adds an extra macro indirection layer called STATIC_CPU_HAS, just to macrofy a single asm() 
statement that was perfectly readable and on-topic already...

There are some other cases where macrofying is also a cleanup due to sharing and naming a 
common pattern of code, but this is by no means an absolute quality of this approach.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 04/10] x86: refcount: prevent gcc distortions
  2018-10-04  8:40       ` hpa
  2018-10-04  8:56         ` Ingo Molnar
@ 2018-10-04  8:56         ` Nadav Amit
  2018-10-04  9:02           ` hpa
  2018-10-04  9:12           ` Ingo Molnar
  1 sibling, 2 replies; 116+ messages in thread
From: Nadav Amit @ 2018-10-04  8:56 UTC (permalink / raw)
  To: hpa, Ingo Molnar
  Cc: Ingo Molnar, linux-kernel, x86, Thomas Gleixner, Jan Beulich,
	Josh Poimboeuf, Linus Torvalds, Peter Zijlstra, Andy Lutomirski

at 1:40 AM, hpa@zytor.com wrote:

> On October 4, 2018 1:33:33 AM PDT, Ingo Molnar <mingo@kernel.org> wrote:
>> * Ingo Molnar <mingo@kernel.org> wrote:
>> 
>>> I'm also somewhat annoyed at the fact that this series carries a
>> boatload
>>> of reviewed-by's and acked-by's, yet none of those reviewers found it
>>> important to point out the large chasm that is gaping between
>> description
>>> and reality.
>> 
>> Another problem I just realized is that we now include
>> arch/x86/kernel/macros.S in every 
>> translation pass when building the kernel, right?
>> 
>> But arch/x86/kernel/macros.S expands to a pretty large hiearchy of
>> header files:
>> 
>> $ make arch/x86/kernel/macros.s
>> 
>> $ cat $(grep include arch/x86/kernel/macros.s | cut -d\" -f2 | sort |
>> uniq) | wc -l
>> 4128
>> 
>> That's 4,100 extra lines of code to be preprocessed for every
>> translation unit, of
>> which there are tens of thousands. More if other pieces of code get
>> macrofied in
>> this fasion in the future.
>> 
>> If we assume that a typical distribution kernel build has ~20,000
>> translation units
>> then this change adds 82,560,000 more lines to be preprocessed, just to
>> work around
>> a stupid GCC bug?
>> 
>> I'm totally unhappy about that. Can we do this without adding macros.S?
>> 
>> It's also a pretty stupidly central file anyway that moves source code
>> away
>> from where it's used.
>> 
>> Thanks,
>> 
>> 	Ingo
> 
> It's not just for working around a stupid GCC bug, but it also has a huge
> potential for cleaning up the inline asm in general.
> 
> I would like to know if there is an actual number for the build overhead
> (an actual benchmark); I have asked for that once already.

I can run some tests. (@hpa: I thought you asked about the -pipe overhead;
perhaps I misunderstood).

I guess you regard to the preprocessing of the assembler. Note that the C 
preprocessing of macros.S obviously happens only once. That’s the reason
I assumed it’s not that expensive.

Anyhow, I remember that we discussed at some point doing something like
‘asm(“.include XXX.s”)’ and somebody said it is not good, but I don’t
remember why and don’t see any reason it is so. Unless I am missing
something, I think it is possible to take each individual header and
preprocess the assembly part of into a separate .s file. Then we can put in
the C part of the header ‘asm(".include XXX.s”)’.

What do you think?


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 04/10] x86: refcount: prevent gcc distortions
  2018-10-04  8:40     ` Nadav Amit
@ 2018-10-04  9:01       ` Ingo Molnar
  0 siblings, 0 replies; 116+ messages in thread
From: Ingo Molnar @ 2018-10-04  9:01 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Ingo Molnar, LKML, X86 ML, Thomas Gleixner, H. Peter Anvin,
	Jan Beulich, Josh Poimboeuf, Linus Torvalds, Peter Zijlstra,
	Andy Lutomirski


* Nadav Amit <namit@vmware.com> wrote:

> Finally, note that it’s not as if the binary always becomes smaller.
> Overall, with the full patch-set it is slightly bigger. But still, that’s
> how it was supposed to be if gcc wasn’t doing things badly.

So what I cited was the changelog for the refcount patch, which according to your
measurements reduced kernel image size.

For other patches where size grew I left that text intact.

And yes, in this particular case a slight increase in kernel size might actually be
beneficial, as we get a lot more straight-line execution in exchange. So this is
not a show-stopper property as long as the bloat isn't large.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 04/10] x86: refcount: prevent gcc distortions
  2018-10-04  8:56         ` Nadav Amit
@ 2018-10-04  9:02           ` hpa
  2018-10-04  9:16             ` Ingo Molnar
  2018-10-04  9:12           ` Ingo Molnar
  1 sibling, 1 reply; 116+ messages in thread
From: hpa @ 2018-10-04  9:02 UTC (permalink / raw)
  To: Nadav Amit, Ingo Molnar
  Cc: Ingo Molnar, linux-kernel, x86, Thomas Gleixner, Jan Beulich,
	Josh Poimboeuf, Linus Torvalds, Peter Zijlstra, Andy Lutomirski

On October 4, 2018 1:56:37 AM PDT, Nadav Amit <namit@vmware.com> wrote:
>at 1:40 AM, hpa@zytor.com wrote:
>
>> On October 4, 2018 1:33:33 AM PDT, Ingo Molnar <mingo@kernel.org>
>wrote:
>>> * Ingo Molnar <mingo@kernel.org> wrote:
>>> 
>>>> I'm also somewhat annoyed at the fact that this series carries a
>>> boatload
>>>> of reviewed-by's and acked-by's, yet none of those reviewers found
>it
>>>> important to point out the large chasm that is gaping between
>>> description
>>>> and reality.
>>> 
>>> Another problem I just realized is that we now include
>>> arch/x86/kernel/macros.S in every 
>>> translation pass when building the kernel, right?
>>> 
>>> But arch/x86/kernel/macros.S expands to a pretty large hiearchy of
>>> header files:
>>> 
>>> $ make arch/x86/kernel/macros.s
>>> 
>>> $ cat $(grep include arch/x86/kernel/macros.s | cut -d\" -f2 | sort
>|
>>> uniq) | wc -l
>>> 4128
>>> 
>>> That's 4,100 extra lines of code to be preprocessed for every
>>> translation unit, of
>>> which there are tens of thousands. More if other pieces of code get
>>> macrofied in
>>> this fasion in the future.
>>> 
>>> If we assume that a typical distribution kernel build has ~20,000
>>> translation units
>>> then this change adds 82,560,000 more lines to be preprocessed, just
>to
>>> work around
>>> a stupid GCC bug?
>>> 
>>> I'm totally unhappy about that. Can we do this without adding
>macros.S?
>>> 
>>> It's also a pretty stupidly central file anyway that moves source
>code
>>> away
>>> from where it's used.
>>> 
>>> Thanks,
>>> 
>>> 	Ingo
>> 
>> It's not just for working around a stupid GCC bug, but it also has a
>huge
>> potential for cleaning up the inline asm in general.
>> 
>> I would like to know if there is an actual number for the build
>overhead
>> (an actual benchmark); I have asked for that once already.
>
>I can run some tests. (@hpa: I thought you asked about the -pipe
>overhead;
>perhaps I misunderstood).
>
>I guess you regard to the preprocessing of the assembler. Note that the
>C 
>preprocessing of macros.S obviously happens only once. That’s the
>reason
>I assumed it’s not that expensive.
>
>Anyhow, I remember that we discussed at some point doing something like
>‘asm(“.include XXX.s”)’ and somebody said it is not good, but I don’t
>remember why and don’t see any reason it is so. Unless I am missing
>something, I think it is possible to take each individual header and
>preprocess the assembly part of into a separate .s file. Then we can
>put in
>the C part of the header ‘asm(".include XXX.s”)’.
>
>What do you think?

Ingo: I wasn't talking necessarily about the specifics of each bit, but rather the general concept about being able to use macros in inlines... I can send you something I have been working on in the background, but have been holding off on because of this, in the morning my time.

But that's no excuse for not doing it right.

Global asm() statements are problematic because there is no guarantee where in the assembly source they will end up. You'd almost need to intercept the assembly output, move all the includes to the top, and then process...
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 04/10] x86: refcount: prevent gcc distortions
  2018-10-04  8:56         ` Nadav Amit
  2018-10-04  9:02           ` hpa
@ 2018-10-04  9:12           ` Ingo Molnar
  2018-10-04  9:17             ` hpa
  2018-10-04  9:30             ` Nadav Amit
  1 sibling, 2 replies; 116+ messages in thread
From: Ingo Molnar @ 2018-10-04  9:12 UTC (permalink / raw)
  To: Nadav Amit
  Cc: hpa, Ingo Molnar, linux-kernel, x86, Thomas Gleixner,
	Jan Beulich, Josh Poimboeuf, Linus Torvalds, Peter Zijlstra,
	Andy Lutomirski


* Nadav Amit <namit@vmware.com> wrote:

> I can run some tests. (@hpa: I thought you asked about the -pipe overhead;
> perhaps I misunderstood).

Well, tests are unlikely to show the overhead of extra lines of this
magnitude, unless done very carefully, yet the added bloat exists and is not even
mentioned by the changelog, it just says:

  Subject: [PATCH v9 02/10] Makefile: Prepare for using macros for inline asm

  Using macros for inline assembly improves both readability and
  compilation decisions that are distorted by big assembly blocks that use
  alternative sections. Compile macros.S and use it to assemble all C
  files. Currently, only x86 will use it.

> I guess you regard to the preprocessing of the assembler. Note that the C 
> preprocessing of macros.S obviously happens only once. That’s the reason
> I assumed it’s not that expensive.

True - so first we build macros.s, and that gets included in every C file build, right?

macros.s is smaller: 275 lines only in the distro test build I tried, which looks
a lot better than my first 4,200 lines guesstimate.

> Anyhow, I remember that we discussed at some point doing something like
> ‘asm(“.include XXX.s”)’ and somebody said it is not good, but I don’t
> remember why and don’t see any reason it is so. Unless I am missing
> something, I think it is possible to take each individual header and
> preprocess the assembly part of into a separate .s file. Then we can put in
> the C part of the header ‘asm(".include XXX.s”)’.
> 
> What do you think?

Hm, this looks quite complex - macros.s is better I think. Also, 275 straight assembly lines is 
a lot better than 4,200.

Another, separate question I wanted to ask: how do we ensure that the kernel stays fixed?
I.e. is there some tooling we can use to actually measure whether there's bad inlining decisions 
done, to detect all these bad patterns that cause bad GCC code generation?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 04/10] x86: refcount: prevent gcc distortions
  2018-10-04  9:02           ` hpa
@ 2018-10-04  9:16             ` Ingo Molnar
  2018-10-04 19:33               ` H. Peter Anvin
  0 siblings, 1 reply; 116+ messages in thread
From: Ingo Molnar @ 2018-10-04  9:16 UTC (permalink / raw)
  To: hpa
  Cc: Nadav Amit, Ingo Molnar, linux-kernel, x86, Thomas Gleixner,
	Jan Beulich, Josh Poimboeuf, Linus Torvalds, Peter Zijlstra,
	Andy Lutomirski


* hpa@zytor.com <hpa@zytor.com> wrote:

> Ingo: I wasn't talking necessarily about the specifics of each bit, but rather the general 
> concept about being able to use macros in inlines...

Ok, agreed about that part - and some of the patches did improve readability.

Also, the 275 lines macros.s is a lot nicer than the 4,200 lines macros.S.

Also, I'm not against using workarounds when the benefits are larger than the costs, but I am 
against *hiding* the fact that these are workarounds and that for some of them there are costs.

> I can send you something I have been working on in the background, but have been holding off 
> on because of this, in the morning my time.

BTW., I have applied most of the series to tip:x86/kbuild already, and will push them out later 
today after some testing. I didn't apply the final 3 patches as they have dependencies, but 
applied the basics and fixed up the changelogs.

So you can rely on this.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 04/10] x86: refcount: prevent gcc distortions
  2018-10-04  9:12           ` Ingo Molnar
@ 2018-10-04  9:17             ` hpa
  2018-10-04  9:30             ` Nadav Amit
  1 sibling, 0 replies; 116+ messages in thread
From: hpa @ 2018-10-04  9:17 UTC (permalink / raw)
  To: Ingo Molnar, Nadav Amit
  Cc: Ingo Molnar, linux-kernel, x86, Thomas Gleixner, Jan Beulich,
	Josh Poimboeuf, Linus Torvalds, Peter Zijlstra, Andy Lutomirski

On October 4, 2018 2:12:22 AM PDT, Ingo Molnar <mingo@kernel.org> wrote:
>
>* Nadav Amit <namit@vmware.com> wrote:
>
>> I can run some tests. (@hpa: I thought you asked about the -pipe
>overhead;
>> perhaps I misunderstood).
>
>Well, tests are unlikely to show the overhead of extra lines of this
>magnitude, unless done very carefully, yet the added bloat exists and
>is not even
>mentioned by the changelog, it just says:
>
>Subject: [PATCH v9 02/10] Makefile: Prepare for using macros for inline
>asm
>
>  Using macros for inline assembly improves both readability and
>compilation decisions that are distorted by big assembly blocks that
>use
>  alternative sections. Compile macros.S and use it to assemble all C
>  files. Currently, only x86 will use it.
>
>> I guess you regard to the preprocessing of the assembler. Note that
>the C 
>> preprocessing of macros.S obviously happens only once. That’s the
>reason
>> I assumed it’s not that expensive.
>
>True - so first we build macros.s, and that gets included in every C
>file build, right?
>
>macros.s is smaller: 275 lines only in the distro test build I tried,
>which looks
>a lot better than my first 4,200 lines guesstimate.
>
>> Anyhow, I remember that we discussed at some point doing something
>like
>> ‘asm(“.include XXX.s”)’ and somebody said it is not good, but I don’t
>> remember why and don’t see any reason it is so. Unless I am missing
>> something, I think it is possible to take each individual header and
>> preprocess the assembly part of into a separate .s file. Then we can
>put in
>> the C part of the header ‘asm(".include XXX.s”)’.
>> 
>> What do you think?
>
>Hm, this looks quite complex - macros.s is better I think. Also, 275
>straight assembly lines is 
>a lot better than 4,200.
>
>Another, separate question I wanted to ask: how do we ensure that the
>kernel stays fixed?
>I.e. is there some tooling we can use to actually measure whether
>there's bad inlining decisions 
>done, to detect all these bad patterns that cause bad GCC code
>generation?
>
>Thanks,
>
>	Ingo

The assembly output from GCC is quite volumious; I doubt tacking a few hundred lines on will matter one iota.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 04/10] x86: refcount: prevent gcc distortions
  2018-10-04  9:12           ` Ingo Molnar
  2018-10-04  9:17             ` hpa
@ 2018-10-04  9:30             ` Nadav Amit
  2018-10-04  9:45               ` Ingo Molnar
  1 sibling, 1 reply; 116+ messages in thread
From: Nadav Amit @ 2018-10-04  9:30 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: hpa, Ingo Molnar, linux-kernel, x86, Thomas Gleixner,
	Jan Beulich, Josh Poimboeuf, Linus Torvalds, Peter Zijlstra,
	Andy Lutomirski

at 2:12 AM, Ingo Molnar <mingo@kernel.org> wrote:

> 
> * Nadav Amit <namit@vmware.com> wrote:
> 
>> I can run some tests. (@hpa: I thought you asked about the -pipe overhead;
>> perhaps I misunderstood).
> 
> Well, tests are unlikely to show the overhead of extra lines of this
> magnitude, unless done very carefully, yet the added bloat exists and is not even
> mentioned by the changelog, it just says:
> 
>  Subject: [PATCH v9 02/10] Makefile: Prepare for using macros for inline asm
> 
>  Using macros for inline assembly improves both readability and
>  compilation decisions that are distorted by big assembly blocks that use
>  alternative sections. Compile macros.S and use it to assemble all C
>  files. Currently, only x86 will use it.
> 
>> I guess you regard to the preprocessing of the assembler. Note that the C 
>> preprocessing of macros.S obviously happens only once. That’s the reason
>> I assumed it’s not that expensive.
> 
> True - so first we build macros.s, and that gets included in every C file build, right?
Right.

> 
> macros.s is smaller: 275 lines only in the distro test build I tried, which looks
> a lot better than my first 4,200 lines guesstimate.
> 
>> Anyhow, I remember that we discussed at some point doing something like
>> ‘asm(“.include XXX.s”)’ and somebody said it is not good, but I don’t
>> remember why and don’t see any reason it is so. Unless I am missing
>> something, I think it is possible to take each individual header and
>> preprocess the assembly part of into a separate .s file. Then we can put in
>> the C part of the header ‘asm(".include XXX.s”)’.
>> 
>> What do you think?
> 
> Hm, this looks quite complex - macros.s is better I think. Also, 275 straight assembly lines is 
> a lot better than 4,200.

I’m really not into it, and hpa reminded me why it wouldn’t work. For some
reason I thought the order of macros doesn’t matter in asm (I probably
should go to sleep).

> Another, separate question I wanted to ask: how do we ensure that the kernel stays fixed?
> I.e. is there some tooling we can use to actually measure whether there's bad inlining decisions 
> done, to detect all these bad patterns that cause bad GCC code generation?

Good question. First, I’ll indicate that this patch-set does not handle all
the issues. There is still the issue of conditional use of
__builtin_constant_p().

One indication for bad inlining decisions is the inlined functions have
multiple (non-inlined) instances in the binary and are short. I don’t
have an automatic solution, but you can try, for example to run:

nm --print-size ./vmlinux | grep ' t ' | cut -d' ' -f2- | sort | uniq -c | \
	grep -v '^      1' | sort -n -r | head -n 5

There are however many false positives. After these patches, for example, I
get:

     11 000000000000012f t jhash
      7 0000000000000017 t dst_output
      6 0000000000000011 t kzalloc
      5 000000000000002f t acpi_os_allocate_zeroed
      5 0000000000000029 t acpi_os_allocate


jhash() should not have been inlined in my mind, and should have a
non-inlined implementation. dst_output() is used as a function pointer.
kzalloc() and the next two suffer from the __builtin_constant_p() problem I
described in the past.

Regards,
Nadav

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 04/10] x86: refcount: prevent gcc distortions
  2018-10-04  9:30             ` Nadav Amit
@ 2018-10-04  9:45               ` Ingo Molnar
  2018-10-04 10:23                 ` Nadav Amit
  0 siblings, 1 reply; 116+ messages in thread
From: Ingo Molnar @ 2018-10-04  9:45 UTC (permalink / raw)
  To: Nadav Amit
  Cc: hpa, Ingo Molnar, linux-kernel, x86, Thomas Gleixner,
	Jan Beulich, Josh Poimboeuf, Linus Torvalds, Peter Zijlstra,
	Andy Lutomirski


* Nadav Amit <namit@vmware.com> wrote:

> > Another, separate question I wanted to ask: how do we ensure that the kernel stays fixed?
> > I.e. is there some tooling we can use to actually measure whether there's bad inlining decisions 
> > done, to detect all these bad patterns that cause bad GCC code generation?
> 
> Good question. First, I’ll indicate that this patch-set does not handle all
> the issues. There is still the issue of conditional use of
> __builtin_constant_p().
> 
> One indication for bad inlining decisions is the inlined functions have
> multiple (non-inlined) instances in the binary and are short. I don’t
> have an automatic solution, but you can try, for example to run:
> 
> nm --print-size ./vmlinux | grep ' t ' | cut -d' ' -f2- | sort | uniq -c | \
> 	grep -v '^      1' | sort -n -r | head -n 5
> 
> There are however many false positives. After these patches, for example, I
> get:
> 
>      11 000000000000012f t jhash
>       7 0000000000000017 t dst_output
>       6 0000000000000011 t kzalloc
>       5 000000000000002f t acpi_os_allocate_zeroed
>       5 0000000000000029 t acpi_os_allocate
> 
> 
> jhash() should not have been inlined in my mind, and should have a
> non-inlined implementation. dst_output() is used as a function pointer.
> kzalloc() and the next two suffer from the __builtin_constant_p() problem I
> described in the past.

Ok, that's useful info.

The histogram suggests that with all your patches applied the kernel is now in a pretty good 
state in terms of inlining decisions, right?

Are you using defconfig or a reasonable distro-config for your tests?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [tip:x86/build] kbuild/arch/xtensa: Define LINKER_SCRIPT for the linker script
  2018-10-03 21:30 ` [PATCH v9 01/10] xtensa: defining LINKER_SCRIPT for the linker script Nadav Amit
@ 2018-10-04 10:00   ` tip-bot for Nadav Amit
  0 siblings, 0 replies; 116+ messages in thread
From: tip-bot for Nadav Amit @ 2018-10-04 10:00 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: tglx, jcmvbkbc, hpa, yamada.masahiro, peterz, bp, namit,
	michal.lkml, torvalds, mingo, chris, linux-kernel

Commit-ID:  35e76b99ddf20405a6196bb7c9eb152675c93106
Gitweb:     https://git.kernel.org/tip/35e76b99ddf20405a6196bb7c9eb152675c93106
Author:     Nadav Amit <namit@vmware.com>
AuthorDate: Wed, 3 Oct 2018 14:30:51 -0700
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 4 Oct 2018 10:05:38 +0200

kbuild/arch/xtensa: Define LINKER_SCRIPT for the linker script

Define the LINKER_SCRIPT when building the linker script as being done
in other architectures. This is required, because upcoming Makefile changes
would otherwise break things.

Signed-off-by: Nadav Amit <namit@vmware.com>
Acked-by: Max Filippov <jcmvbkbc@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Chris Zankel <chris@zankel.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Michal Marek <michal.lkml@markovi.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-xtensa@linux-xtensa.org
Link: http://lkml.kernel.org/r/20181003213100.189959-2-namit@vmware.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/xtensa/kernel/Makefile | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/xtensa/kernel/Makefile b/arch/xtensa/kernel/Makefile
index 91907590d183..8dff506caf07 100644
--- a/arch/xtensa/kernel/Makefile
+++ b/arch/xtensa/kernel/Makefile
@@ -35,8 +35,8 @@ sed-y = -e ':a; s/\*(\([^)]*\)\.text\.unlikely/*(\1.literal.unlikely .{text}.unl
 	-e 's/\.{text}/.text/g'
 
 quiet_cmd__cpp_lds_S = LDS     $@
-cmd__cpp_lds_S = $(CPP) $(cpp_flags) -P -C -Uxtensa -D__ASSEMBLY__ $<    \
-                 | sed $(sed-y) >$@
+cmd__cpp_lds_S = $(CPP) $(cpp_flags) -P -C -Uxtensa -D__ASSEMBLY__ \
+		 -DLINKER_SCRIPT $< | sed $(sed-y) >$@
 
 $(obj)/vmlinux.lds: $(src)/vmlinux.lds.S FORCE
 	$(call if_changed_dep,_cpp_lds_S)

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [tip:x86/build] kbuild/Makefile: Prepare for using macros in inline assembly code to work around asm() related GCC inlining bugs
  2018-10-03 21:30 ` [PATCH v9 02/10] Makefile: Prepare for using macros for inline asm Nadav Amit
@ 2018-10-04 10:01   ` tip-bot for Nadav Amit
  2018-11-06 18:57   ` [PATCH v9 02/10] Makefile: Prepare for using macros for inline asm Logan Gunthorpe
  1 sibling, 0 replies; 116+ messages in thread
From: tip-bot for Nadav Amit @ 2018-10-04 10:01 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: luto, linux-kernel, sam, mingo, yamada.masahiro, tglx, namit,
	michal.lkml, brgerst, hpa, keescook, peterz, dvlasenk, torvalds,
	bp

Commit-ID:  77b0bf55bc675233d22cd5df97605d516d64525e
Gitweb:     https://git.kernel.org/tip/77b0bf55bc675233d22cd5df97605d516d64525e
Author:     Nadav Amit <namit@vmware.com>
AuthorDate: Wed, 3 Oct 2018 14:30:52 -0700
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 4 Oct 2018 10:57:09 +0200

kbuild/Makefile: Prepare for using macros in inline assembly code to work around asm() related GCC inlining bugs

Using macros in inline assembly allows us to work around bugs
in GCC's inlining decisions.

Compile macros.S and use it to assemble all C files.
Currently only x86 will use it.

Background:

The inlining pass of GCC doesn't include an assembler, so it's not aware
of basic properties of the generated code, such as its size in bytes,
or that there are such things as discontiuous blocks of code and data
due to the newfangled linker feature called 'sections' ...

Instead GCC uses a lazy and fragile heuristic: it does a linear count of
certain syntactic and whitespace elements in inlined assembly block source
code, such as a count of new-lines and semicolons (!), as a poor substitute
for "code size and complexity".

Unsurprisingly this heuristic falls over and breaks its neck whith certain
common types of kernel code that use inline assembly, such as the frequent
practice of putting useful information into alternative sections.

As a result of this fresh, 20+ years old GCC bug, GCC's inlining decisions
are effectively disabled for inlined functions that make use of such asm()
blocks, because GCC thinks those sections of code are "large" - when in
reality they are often result in just a very low number of machine
instructions.

This absolute lack of inlining provess when GCC comes across such asm()
blocks both increases generated kernel code size and causes performance
overhead, which is particularly noticeable on paravirt kernels, which make
frequent use of these inlining facilities in attempt to stay out of the
way when running on baremetal hardware.

Instead of fixing the compiler we use a workaround: we set an assembly macro
and call it from the inlined assembly block. As a result GCC considers the
inline assembly block as a single instruction. (Which it often isn't but I digress.)

This uglifies and bloats the source code - for example just the refcount
related changes have this impact:

 Makefile                 |    9 +++++++--
 arch/x86/Makefile        |    7 +++++++
 arch/x86/kernel/macros.S |    7 +++++++
 scripts/Kbuild.include   |    4 +++-
 scripts/mod/Makefile     |    2 ++
 5 files changed, 26 insertions(+), 3 deletions(-)

Yay readability and maintainability, it's not like assembly code is hard to read
and maintain ...

We also hope that GCC will eventually get fixed, but we are not holding
our breath for that. Yet we are optimistic, it might still happen, any decade now.

[ mingo: Wrote new changelog describing the background. ]

Tested-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Nadav Amit <namit@vmware.com>
Acked-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michal Marek <michal.lkml@markovi.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kbuild@vger.kernel.org
Link: http://lkml.kernel.org/r/20181003213100.189959-3-namit@vmware.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 Makefile                 | 9 +++++++--
 arch/x86/Makefile        | 7 +++++++
 arch/x86/kernel/macros.S | 7 +++++++
 scripts/Kbuild.include   | 4 +++-
 scripts/mod/Makefile     | 2 ++
 5 files changed, 26 insertions(+), 3 deletions(-)

diff --git a/Makefile b/Makefile
index 6c3da3e10f07..6c40d547513c 100644
--- a/Makefile
+++ b/Makefile
@@ -1071,7 +1071,7 @@ scripts: scripts_basic asm-generic gcc-plugins $(autoksyms_h)
 # version.h and scripts_basic is processed / created.
 
 # Listed in dependency order
-PHONY += prepare archprepare prepare0 prepare1 prepare2 prepare3
+PHONY += prepare archprepare macroprepare prepare0 prepare1 prepare2 prepare3
 
 # prepare3 is used to check if we are building in a separate output directory,
 # and if so do:
@@ -1094,7 +1094,9 @@ prepare2: prepare3 outputmakefile asm-generic
 prepare1: prepare2 $(version_h) $(autoksyms_h) include/generated/utsrelease.h
 	$(cmd_crmodverdir)
 
-archprepare: archheaders archscripts prepare1 scripts_basic
+macroprepare: prepare1 archmacros
+
+archprepare: archheaders archscripts macroprepare scripts_basic
 
 prepare0: archprepare gcc-plugins
 	$(Q)$(MAKE) $(build)=.
@@ -1162,6 +1164,9 @@ archheaders:
 PHONY += archscripts
 archscripts:
 
+PHONY += archmacros
+archmacros:
+
 PHONY += __headers
 __headers: $(version_h) scripts_basic uapi-asm-generic archheaders archscripts
 	$(Q)$(MAKE) $(build)=scripts build_unifdef
diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index 0e496ed476d4..5b562e464009 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -236,6 +236,13 @@ archscripts: scripts_basic
 archheaders:
 	$(Q)$(MAKE) $(build)=arch/x86/entry/syscalls all
 
+archmacros:
+	$(Q)$(MAKE) $(build)=arch/x86/kernel arch/x86/kernel/macros.s
+
+ASM_MACRO_FLAGS = -Wa,arch/x86/kernel/macros.s -Wa,-
+export ASM_MACRO_FLAGS
+KBUILD_CFLAGS += $(ASM_MACRO_FLAGS)
+
 ###
 # Kernel objects
 
diff --git a/arch/x86/kernel/macros.S b/arch/x86/kernel/macros.S
new file mode 100644
index 000000000000..cfc1c7d1a6eb
--- /dev/null
+++ b/arch/x86/kernel/macros.S
@@ -0,0 +1,7 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+/*
+ * This file includes headers whose assembly part includes macros which are
+ * commonly used. The macros are precompiled into assmebly file which is later
+ * assembled together with each compiled file.
+ */
diff --git a/scripts/Kbuild.include b/scripts/Kbuild.include
index ce53639a864a..8aeb60eb6ee3 100644
--- a/scripts/Kbuild.include
+++ b/scripts/Kbuild.include
@@ -115,7 +115,9 @@ __cc-option = $(call try-run,\
 
 # Do not attempt to build with gcc plugins during cc-option tests.
 # (And this uses delayed resolution so the flags will be up to date.)
-CC_OPTION_CFLAGS = $(filter-out $(GCC_PLUGINS_CFLAGS),$(KBUILD_CFLAGS))
+# In addition, do not include the asm macros which are built later.
+CC_OPTION_FILTERED = $(GCC_PLUGINS_CFLAGS) $(ASM_MACRO_FLAGS)
+CC_OPTION_CFLAGS = $(filter-out $(CC_OPTION_FILTERED),$(KBUILD_CFLAGS))
 
 # cc-option
 # Usage: cflags-y += $(call cc-option,-march=winchip-c6,-march=i586)
diff --git a/scripts/mod/Makefile b/scripts/mod/Makefile
index 42c5d50f2bcc..a5b4af47987a 100644
--- a/scripts/mod/Makefile
+++ b/scripts/mod/Makefile
@@ -4,6 +4,8 @@ OBJECT_FILES_NON_STANDARD := y
 hostprogs-y	:= modpost mk_elfconfig
 always		:= $(hostprogs-y) empty.o
 
+CFLAGS_REMOVE_empty.o := $(ASM_MACRO_FLAGS)
+
 modpost-objs	:= modpost.o file2alias.o sumversion.o
 
 devicetable-offsets-file := devicetable-offsets.h

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [tip:x86/build] x86/objtool: Use asm macros to work around GCC inlining bugs
  2018-10-03 21:30 ` [PATCH v9 03/10] x86: objtool: use asm macro for better compiler decisions Nadav Amit
@ 2018-10-04 10:02   ` tip-bot for Nadav Amit
  0 siblings, 0 replies; 116+ messages in thread
From: tip-bot for Nadav Amit @ 2018-10-04 10:02 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, tglx, peterz, keescook, luto, namit, sparse, hpa,
	dvlasenk, mingo, bp, torvalds, jpoimboe, brgerst

Commit-ID:  c06c4d8090513f2974dfdbed2ac98634357ac475
Gitweb:     https://git.kernel.org/tip/c06c4d8090513f2974dfdbed2ac98634357ac475
Author:     Nadav Amit <namit@vmware.com>
AuthorDate: Wed, 3 Oct 2018 14:30:53 -0700
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 4 Oct 2018 11:24:58 +0200

x86/objtool: Use asm macros to work around GCC inlining bugs

As described in:

  77b0bf55bc67: ("kbuild/Makefile: Prepare for using macros in inline assembly code to work around asm() related GCC inlining bugs")

GCC's inlining heuristics are broken with common asm() patterns used in
kernel code, resulting in the effective disabling of inlining.

In the case of objtool the resulting borkage can be significant, since all the
annotations of objtool are discarded during linkage and never inlined,
yet GCC bogusly considers most functions affected by objtool annotations
as 'too large'.

The workaround is to set an assembly macro and call it from the inline
assembly block. As a result GCC considers the inline assembly block as
a single instruction. (Which it isn't, but that's the best we can get.)

This increases the kernel size slightly:

      text     data     bss      dec     hex filename
  18140829 10224724 2957312 31322865 1ddf2f1 ./vmlinux before
  18140970 10225412 2957312 31323694 1ddf62e ./vmlinux after (+829)

The number of static text symbols (i.e. non-inlined functions) is reduced:

  Before:  40321
  After:   40302 (-19)

[ mingo: Rewrote the changelog. ]

Tested-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Nadav Amit <namit@vmware.com>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Christopher Li <sparse@chrisli.org>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-sparse@vger.kernel.org
Link: http://lkml.kernel.org/r/20181003213100.189959-4-namit@vmware.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/kernel/macros.S |  2 ++
 include/linux/compiler.h | 56 +++++++++++++++++++++++++++++++++++++-----------
 2 files changed, 45 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kernel/macros.S b/arch/x86/kernel/macros.S
index cfc1c7d1a6eb..cee28c3246dc 100644
--- a/arch/x86/kernel/macros.S
+++ b/arch/x86/kernel/macros.S
@@ -5,3 +5,5 @@
  * commonly used. The macros are precompiled into assmebly file which is later
  * assembled together with each compiled file.
  */
+
+#include <linux/compiler.h>
diff --git a/include/linux/compiler.h b/include/linux/compiler.h
index 681d866efb1e..1921545c6351 100644
--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -99,22 +99,13 @@ void ftrace_likely_update(struct ftrace_likely_data *f, int val,
  * unique, to convince GCC not to merge duplicate inline asm statements.
  */
 #define annotate_reachable() ({						\
-	asm volatile("%c0:\n\t"						\
-		     ".pushsection .discard.reachable\n\t"		\
-		     ".long %c0b - .\n\t"				\
-		     ".popsection\n\t" : : "i" (__COUNTER__));		\
+	asm volatile("ANNOTATE_REACHABLE counter=%c0"			\
+		     : : "i" (__COUNTER__));				\
 })
 #define annotate_unreachable() ({					\
-	asm volatile("%c0:\n\t"						\
-		     ".pushsection .discard.unreachable\n\t"		\
-		     ".long %c0b - .\n\t"				\
-		     ".popsection\n\t" : : "i" (__COUNTER__));		\
+	asm volatile("ANNOTATE_UNREACHABLE counter=%c0"			\
+		     : : "i" (__COUNTER__));				\
 })
-#define ASM_UNREACHABLE							\
-	"999:\n\t"							\
-	".pushsection .discard.unreachable\n\t"				\
-	".long 999b - .\n\t"						\
-	".popsection\n\t"
 #else
 #define annotate_reachable()
 #define annotate_unreachable()
@@ -299,6 +290,45 @@ static inline void *offset_to_ptr(const int *off)
 	return (void *)((unsigned long)off + *off);
 }
 
+#else /* __ASSEMBLY__ */
+
+#ifdef __KERNEL__
+#ifndef LINKER_SCRIPT
+
+#ifdef CONFIG_STACK_VALIDATION
+.macro ANNOTATE_UNREACHABLE counter:req
+\counter:
+	.pushsection .discard.unreachable
+	.long \counter\()b -.
+	.popsection
+.endm
+
+.macro ANNOTATE_REACHABLE counter:req
+\counter:
+	.pushsection .discard.reachable
+	.long \counter\()b -.
+	.popsection
+.endm
+
+.macro ASM_UNREACHABLE
+999:
+	.pushsection .discard.unreachable
+	.long 999b - .
+	.popsection
+.endm
+#else /* CONFIG_STACK_VALIDATION */
+.macro ANNOTATE_UNREACHABLE counter:req
+.endm
+
+.macro ANNOTATE_REACHABLE counter:req
+.endm
+
+.macro ASM_UNREACHABLE
+.endm
+#endif /* CONFIG_STACK_VALIDATION */
+
+#endif /* LINKER_SCRIPT */
+#endif /* __KERNEL__ */
 #endif /* __ASSEMBLY__ */
 
 #ifndef __optimize

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [tip:x86/build] x86/refcount: Work around GCC inlining bug
  2018-10-03 21:30 ` [PATCH v9 04/10] x86: refcount: prevent gcc distortions Nadav Amit
  2018-10-04  7:57   ` Ingo Molnar
@ 2018-10-04 10:02   ` tip-bot for Nadav Amit
  1 sibling, 0 replies; 116+ messages in thread
From: tip-bot for Nadav Amit @ 2018-10-04 10:02 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: namit, tglx, brgerst, hpa, torvalds, jpoimboe, dvlasenk,
	JBeulich, bp, peterz, mingo, linux-kernel, luto, keescook

Commit-ID:  9e1725b410594911cc5981b6c7b4cea4ec054ca8
Gitweb:     https://git.kernel.org/tip/9e1725b410594911cc5981b6c7b4cea4ec054ca8
Author:     Nadav Amit <namit@vmware.com>
AuthorDate: Wed, 3 Oct 2018 14:30:54 -0700
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 4 Oct 2018 11:24:59 +0200

x86/refcount: Work around GCC inlining bug

As described in:

  77b0bf55bc67: ("kbuild/Makefile: Prepare for using macros in inline assembly code to work around asm() related GCC inlining bugs")

GCC's inlining heuristics are broken with common asm() patterns used in
kernel code, resulting in the effective disabling of inlining.

The workaround is to set an assembly macro and call it from the inline
assembly block. As a result GCC considers the inline assembly block as
a single instruction. (Which it isn't, but that's the best we can get.)

This patch allows GCC to inline simple functions such as __get_seccomp_filter().

To no-one's surprise the result is that GCC performs more aggressive (read: correct)
inlining decisions in these senarios, which reduces the kernel size and presumably
also speeds it up:

      text     data     bss      dec     hex  filename
  18140970 10225412 2957312 31323694 1ddf62e  ./vmlinux before
  18140140 10225284 2957312 31322736 1ddf270  ./vmlinux after (-958)

16 fewer static text symbols:

   Before: 40302
    After: 40286 (-16)

these got inlined instead.

Functions such as kref_get(), free_user(), fuse_file_get() now get inlined. Hurray!

[ mingo: Rewrote the changelog. ]

Tested-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Nadav Amit <namit@vmware.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jan Beulich <JBeulich@suse.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20181003213100.189959-5-namit@vmware.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/refcount.h | 74 +++++++++++++++++++++++++----------------
 arch/x86/kernel/macros.S        |  1 +
 2 files changed, 46 insertions(+), 29 deletions(-)

diff --git a/arch/x86/include/asm/refcount.h b/arch/x86/include/asm/refcount.h
index 19b90521954c..c92909da0686 100644
--- a/arch/x86/include/asm/refcount.h
+++ b/arch/x86/include/asm/refcount.h
@@ -4,6 +4,41 @@
  * x86-specific implementation of refcount_t. Based on PAX_REFCOUNT from
  * PaX/grsecurity.
  */
+
+#ifdef __ASSEMBLY__
+
+#include <asm/asm.h>
+#include <asm/bug.h>
+
+.macro REFCOUNT_EXCEPTION counter:req
+	.pushsection .text..refcount
+111:	lea \counter, %_ASM_CX
+112:	ud2
+	ASM_UNREACHABLE
+	.popsection
+113:	_ASM_EXTABLE_REFCOUNT(112b, 113b)
+.endm
+
+/* Trigger refcount exception if refcount result is negative. */
+.macro REFCOUNT_CHECK_LT_ZERO counter:req
+	js 111f
+	REFCOUNT_EXCEPTION counter="\counter"
+.endm
+
+/* Trigger refcount exception if refcount result is zero or negative. */
+.macro REFCOUNT_CHECK_LE_ZERO counter:req
+	jz 111f
+	REFCOUNT_CHECK_LT_ZERO counter="\counter"
+.endm
+
+/* Trigger refcount exception unconditionally. */
+.macro REFCOUNT_ERROR counter:req
+	jmp 111f
+	REFCOUNT_EXCEPTION counter="\counter"
+.endm
+
+#else /* __ASSEMBLY__ */
+
 #include <linux/refcount.h>
 #include <asm/bug.h>
 
@@ -15,34 +50,11 @@
  * central refcount exception. The fixup address for the exception points
  * back to the regular execution flow in .text.
  */
-#define _REFCOUNT_EXCEPTION				\
-	".pushsection .text..refcount\n"		\
-	"111:\tlea %[counter], %%" _ASM_CX "\n"		\
-	"112:\t" ASM_UD2 "\n"				\
-	ASM_UNREACHABLE					\
-	".popsection\n"					\
-	"113:\n"					\
-	_ASM_EXTABLE_REFCOUNT(112b, 113b)
-
-/* Trigger refcount exception if refcount result is negative. */
-#define REFCOUNT_CHECK_LT_ZERO				\
-	"js 111f\n\t"					\
-	_REFCOUNT_EXCEPTION
-
-/* Trigger refcount exception if refcount result is zero or negative. */
-#define REFCOUNT_CHECK_LE_ZERO				\
-	"jz 111f\n\t"					\
-	REFCOUNT_CHECK_LT_ZERO
-
-/* Trigger refcount exception unconditionally. */
-#define REFCOUNT_ERROR					\
-	"jmp 111f\n\t"					\
-	_REFCOUNT_EXCEPTION
 
 static __always_inline void refcount_add(unsigned int i, refcount_t *r)
 {
 	asm volatile(LOCK_PREFIX "addl %1,%0\n\t"
-		REFCOUNT_CHECK_LT_ZERO
+		"REFCOUNT_CHECK_LT_ZERO counter=\"%[counter]\""
 		: [counter] "+m" (r->refs.counter)
 		: "ir" (i)
 		: "cc", "cx");
@@ -51,7 +63,7 @@ static __always_inline void refcount_add(unsigned int i, refcount_t *r)
 static __always_inline void refcount_inc(refcount_t *r)
 {
 	asm volatile(LOCK_PREFIX "incl %0\n\t"
-		REFCOUNT_CHECK_LT_ZERO
+		"REFCOUNT_CHECK_LT_ZERO counter=\"%[counter]\""
 		: [counter] "+m" (r->refs.counter)
 		: : "cc", "cx");
 }
@@ -59,7 +71,7 @@ static __always_inline void refcount_inc(refcount_t *r)
 static __always_inline void refcount_dec(refcount_t *r)
 {
 	asm volatile(LOCK_PREFIX "decl %0\n\t"
-		REFCOUNT_CHECK_LE_ZERO
+		"REFCOUNT_CHECK_LE_ZERO counter=\"%[counter]\""
 		: [counter] "+m" (r->refs.counter)
 		: : "cc", "cx");
 }
@@ -67,13 +79,15 @@ static __always_inline void refcount_dec(refcount_t *r)
 static __always_inline __must_check
 bool refcount_sub_and_test(unsigned int i, refcount_t *r)
 {
-	GEN_BINARY_SUFFIXED_RMWcc(LOCK_PREFIX "subl", REFCOUNT_CHECK_LT_ZERO,
+	GEN_BINARY_SUFFIXED_RMWcc(LOCK_PREFIX "subl",
+				  "REFCOUNT_CHECK_LT_ZERO counter=\"%0\"",
 				  r->refs.counter, "er", i, "%0", e, "cx");
 }
 
 static __always_inline __must_check bool refcount_dec_and_test(refcount_t *r)
 {
-	GEN_UNARY_SUFFIXED_RMWcc(LOCK_PREFIX "decl", REFCOUNT_CHECK_LT_ZERO,
+	GEN_UNARY_SUFFIXED_RMWcc(LOCK_PREFIX "decl",
+				 "REFCOUNT_CHECK_LT_ZERO counter=\"%0\"",
 				 r->refs.counter, "%0", e, "cx");
 }
 
@@ -91,7 +105,7 @@ bool refcount_add_not_zero(unsigned int i, refcount_t *r)
 
 		/* Did we try to increment from/to an undesirable state? */
 		if (unlikely(c < 0 || c == INT_MAX || result < c)) {
-			asm volatile(REFCOUNT_ERROR
+			asm volatile("REFCOUNT_ERROR counter=\"%[counter]\""
 				     : : [counter] "m" (r->refs.counter)
 				     : "cc", "cx");
 			break;
@@ -107,4 +121,6 @@ static __always_inline __must_check bool refcount_inc_not_zero(refcount_t *r)
 	return refcount_add_not_zero(1, r);
 }
 
+#endif /* __ASSEMBLY__ */
+
 #endif
diff --git a/arch/x86/kernel/macros.S b/arch/x86/kernel/macros.S
index cee28c3246dc..f1fe1d570365 100644
--- a/arch/x86/kernel/macros.S
+++ b/arch/x86/kernel/macros.S
@@ -7,3 +7,4 @@
  */
 
 #include <linux/compiler.h>
+#include <asm/refcount.h>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [tip:x86/build] x86/alternatives: Macrofy lock prefixes to work around GCC inlining bugs
  2018-10-03 21:30 ` [PATCH v9 05/10] x86: alternatives: macrofy locks for better inlining Nadav Amit
@ 2018-10-04 10:03   ` tip-bot for Nadav Amit
  0 siblings, 0 replies; 116+ messages in thread
From: tip-bot for Nadav Amit @ 2018-10-04 10:03 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: keescook, luto, jpoimboe, bp, peterz, linux-kernel, dvlasenk,
	hpa, brgerst, namit, torvalds, mingo, tglx

Commit-ID:  77f48ec28e4ccff94d2e5f4260a83ac27a7f3099
Gitweb:     https://git.kernel.org/tip/77f48ec28e4ccff94d2e5f4260a83ac27a7f3099
Author:     Nadav Amit <namit@vmware.com>
AuthorDate: Wed, 3 Oct 2018 14:30:55 -0700
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 4 Oct 2018 11:24:59 +0200

x86/alternatives: Macrofy lock prefixes to work around GCC inlining bugs

As described in:

  77b0bf55bc67: ("kbuild/Makefile: Prepare for using macros in inline assembly code to work around asm() related GCC inlining bugs")

GCC's inlining heuristics are broken with common asm() patterns used in
kernel code, resulting in the effective disabling of inlining.

The workaround is to set an assembly macro and call it from the inline
assembly block - i.e. to macrify the affected block.

As a result GCC considers the inline assembly block as a single instruction.

This patch handles the LOCK prefix, allowing more aggresive inlining:

      text     data     bss      dec     hex  filename
  18140140 10225284 2957312 31322736 1ddf270  ./vmlinux before
  18146889 10225380 2957312 31329581 1de0d2d  ./vmlinux after (+6845)

This is the reduction in non-inlined functions:

  Before: 40286
  After:  40218 (-68)

Tested-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Nadav Amit <namit@vmware.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20181003213100.189959-6-namit@vmware.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/alternative-asm.h | 20 ++++++++++++++------
 arch/x86/include/asm/alternative.h     | 11 ++---------
 arch/x86/kernel/macros.S               |  1 +
 3 files changed, 17 insertions(+), 15 deletions(-)

diff --git a/arch/x86/include/asm/alternative-asm.h b/arch/x86/include/asm/alternative-asm.h
index 31b627b43a8e..8e4ea39e55d0 100644
--- a/arch/x86/include/asm/alternative-asm.h
+++ b/arch/x86/include/asm/alternative-asm.h
@@ -7,16 +7,24 @@
 #include <asm/asm.h>
 
 #ifdef CONFIG_SMP
-	.macro LOCK_PREFIX
-672:	lock
+.macro LOCK_PREFIX_HERE
 	.pushsection .smp_locks,"a"
 	.balign 4
-	.long 672b - .
+	.long 671f - .		# offset
 	.popsection
-	.endm
+671:
+.endm
+
+.macro LOCK_PREFIX insn:vararg
+	LOCK_PREFIX_HERE
+	lock \insn
+.endm
 #else
-	.macro LOCK_PREFIX
-	.endm
+.macro LOCK_PREFIX_HERE
+.endm
+
+.macro LOCK_PREFIX insn:vararg
+.endm
 #endif
 
 /*
diff --git a/arch/x86/include/asm/alternative.h b/arch/x86/include/asm/alternative.h
index 4cd6a3b71824..d7faa16622d8 100644
--- a/arch/x86/include/asm/alternative.h
+++ b/arch/x86/include/asm/alternative.h
@@ -31,15 +31,8 @@
  */
 
 #ifdef CONFIG_SMP
-#define LOCK_PREFIX_HERE \
-		".pushsection .smp_locks,\"a\"\n"	\
-		".balign 4\n"				\
-		".long 671f - .\n" /* offset */		\
-		".popsection\n"				\
-		"671:"
-
-#define LOCK_PREFIX LOCK_PREFIX_HERE "\n\tlock; "
-
+#define LOCK_PREFIX_HERE "LOCK_PREFIX_HERE\n\t"
+#define LOCK_PREFIX "LOCK_PREFIX "
 #else /* ! CONFIG_SMP */
 #define LOCK_PREFIX_HERE ""
 #define LOCK_PREFIX ""
diff --git a/arch/x86/kernel/macros.S b/arch/x86/kernel/macros.S
index f1fe1d570365..852487a9fc56 100644
--- a/arch/x86/kernel/macros.S
+++ b/arch/x86/kernel/macros.S
@@ -8,3 +8,4 @@
 
 #include <linux/compiler.h>
 #include <asm/refcount.h>
+#include <asm/alternative-asm.h>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [tip:x86/build] x86/bug: Macrofy the BUG table section handling, to work around GCC inlining bugs
  2018-10-03 21:30 ` [PATCH v9 06/10] x86: bug: prevent gcc distortions Nadav Amit
@ 2018-10-04 10:03   ` tip-bot for Nadav Amit
  0 siblings, 0 replies; 116+ messages in thread
From: tip-bot for Nadav Amit @ 2018-10-04 10:03 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: dvlasenk, peterz, mingo, bp, brgerst, jpoimboe, linux-kernel,
	namit, luto, torvalds, tglx, hpa, keescook

Commit-ID:  f81f8ad56fd1c7b99b2ed1c314527f7d9ac447c6
Gitweb:     https://git.kernel.org/tip/f81f8ad56fd1c7b99b2ed1c314527f7d9ac447c6
Author:     Nadav Amit <namit@vmware.com>
AuthorDate: Wed, 3 Oct 2018 14:30:56 -0700
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 4 Oct 2018 11:25:00 +0200

x86/bug: Macrofy the BUG table section handling, to work around GCC inlining bugs

As described in:

  77b0bf55bc67: ("kbuild/Makefile: Prepare for using macros in inline assembly code to work around asm() related GCC inlining bugs")

GCC's inlining heuristics are broken with common asm() patterns used in
kernel code, resulting in the effective disabling of inlining.

The workaround is to set an assembly macro and call it from the inline
assembly block. As a result GCC considers the inline assembly block as
a single instruction. (Which it isn't, but that's the best we can get.)

This patch increases the kernel size:

      text     data     bss      dec     hex  filename
  18146889 10225380 2957312 31329581 1de0d2d  ./vmlinux before
  18147336 10226688 2957312 31331336 1de1408  ./vmlinux after (+1755)

But enables more aggressive inlining (and probably better branch decisions).

The number of static text symbols in vmlinux is much lower:

 Before: 40218
 After:  40053 (-165)

The assembly code gets harder to read due to the extra macro layer.

[ mingo: Rewrote the changelog. ]

Tested-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Nadav Amit <namit@vmware.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20181003213100.189959-7-namit@vmware.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/bug.h | 98 ++++++++++++++++++++++++++--------------------
 arch/x86/kernel/macros.S   |  1 +
 include/asm-generic/bug.h  |  8 ++--
 3 files changed, 61 insertions(+), 46 deletions(-)

diff --git a/arch/x86/include/asm/bug.h b/arch/x86/include/asm/bug.h
index 6804d6642767..5090035e6d16 100644
--- a/arch/x86/include/asm/bug.h
+++ b/arch/x86/include/asm/bug.h
@@ -4,6 +4,8 @@
 
 #include <linux/stringify.h>
 
+#ifndef __ASSEMBLY__
+
 /*
  * Despite that some emulators terminate on UD2, we use it for WARN().
  *
@@ -20,53 +22,15 @@
 
 #define LEN_UD2		2
 
-#ifdef CONFIG_GENERIC_BUG
-
-#ifdef CONFIG_X86_32
-# define __BUG_REL(val)	".long " __stringify(val)
-#else
-# define __BUG_REL(val)	".long " __stringify(val) " - 2b"
-#endif
-
-#ifdef CONFIG_DEBUG_BUGVERBOSE
-
-#define _BUG_FLAGS(ins, flags)						\
-do {									\
-	asm volatile("1:\t" ins "\n"					\
-		     ".pushsection __bug_table,\"aw\"\n"		\
-		     "2:\t" __BUG_REL(1b) "\t# bug_entry::bug_addr\n"	\
-		     "\t"  __BUG_REL(%c0) "\t# bug_entry::file\n"	\
-		     "\t.word %c1"        "\t# bug_entry::line\n"	\
-		     "\t.word %c2"        "\t# bug_entry::flags\n"	\
-		     "\t.org 2b+%c3\n"					\
-		     ".popsection"					\
-		     : : "i" (__FILE__), "i" (__LINE__),		\
-			 "i" (flags),					\
-			 "i" (sizeof(struct bug_entry)));		\
-} while (0)
-
-#else /* !CONFIG_DEBUG_BUGVERBOSE */
-
 #define _BUG_FLAGS(ins, flags)						\
 do {									\
-	asm volatile("1:\t" ins "\n"					\
-		     ".pushsection __bug_table,\"aw\"\n"		\
-		     "2:\t" __BUG_REL(1b) "\t# bug_entry::bug_addr\n"	\
-		     "\t.word %c0"        "\t# bug_entry::flags\n"	\
-		     "\t.org 2b+%c1\n"					\
-		     ".popsection"					\
-		     : : "i" (flags),					\
+	asm volatile("ASM_BUG ins=\"" ins "\" file=%c0 line=%c1 "	\
+		     "flags=%c2 size=%c3"				\
+		     : : "i" (__FILE__), "i" (__LINE__),                \
+			 "i" (flags),                                   \
 			 "i" (sizeof(struct bug_entry)));		\
 } while (0)
 
-#endif /* CONFIG_DEBUG_BUGVERBOSE */
-
-#else
-
-#define _BUG_FLAGS(ins, flags)  asm volatile(ins)
-
-#endif /* CONFIG_GENERIC_BUG */
-
 #define HAVE_ARCH_BUG
 #define BUG()							\
 do {								\
@@ -82,4 +46,54 @@ do {								\
 
 #include <asm-generic/bug.h>
 
+#else /* __ASSEMBLY__ */
+
+#ifdef CONFIG_GENERIC_BUG
+
+#ifdef CONFIG_X86_32
+.macro __BUG_REL val:req
+	.long \val
+.endm
+#else
+.macro __BUG_REL val:req
+	.long \val - 2b
+.endm
+#endif
+
+#ifdef CONFIG_DEBUG_BUGVERBOSE
+
+.macro ASM_BUG ins:req file:req line:req flags:req size:req
+1:	\ins
+	.pushsection __bug_table,"aw"
+2:	__BUG_REL val=1b	# bug_entry::bug_addr
+	__BUG_REL val=\file	# bug_entry::file
+	.word \line		# bug_entry::line
+	.word \flags		# bug_entry::flags
+	.org 2b+\size
+	.popsection
+.endm
+
+#else /* !CONFIG_DEBUG_BUGVERBOSE */
+
+.macro ASM_BUG ins:req file:req line:req flags:req size:req
+1:	\ins
+	.pushsection __bug_table,"aw"
+2:	__BUG_REL val=1b	# bug_entry::bug_addr
+	.word \flags		# bug_entry::flags
+	.org 2b+\size
+	.popsection
+.endm
+
+#endif /* CONFIG_DEBUG_BUGVERBOSE */
+
+#else /* CONFIG_GENERIC_BUG */
+
+.macro ASM_BUG ins:req file:req line:req flags:req size:req
+	\ins
+.endm
+
+#endif /* CONFIG_GENERIC_BUG */
+
+#endif /* __ASSEMBLY__ */
+
 #endif /* _ASM_X86_BUG_H */
diff --git a/arch/x86/kernel/macros.S b/arch/x86/kernel/macros.S
index 852487a9fc56..66ccb8e823b1 100644
--- a/arch/x86/kernel/macros.S
+++ b/arch/x86/kernel/macros.S
@@ -9,3 +9,4 @@
 #include <linux/compiler.h>
 #include <asm/refcount.h>
 #include <asm/alternative-asm.h>
+#include <asm/bug.h>
diff --git a/include/asm-generic/bug.h b/include/asm-generic/bug.h
index 20561a60db9c..cdafa5edea49 100644
--- a/include/asm-generic/bug.h
+++ b/include/asm-generic/bug.h
@@ -17,10 +17,8 @@
 #ifndef __ASSEMBLY__
 #include <linux/kernel.h>
 
-#ifdef CONFIG_BUG
-
-#ifdef CONFIG_GENERIC_BUG
 struct bug_entry {
+#ifdef CONFIG_GENERIC_BUG
 #ifndef CONFIG_GENERIC_BUG_RELATIVE_POINTERS
 	unsigned long	bug_addr;
 #else
@@ -35,8 +33,10 @@ struct bug_entry {
 	unsigned short	line;
 #endif
 	unsigned short	flags;
-};
 #endif	/* CONFIG_GENERIC_BUG */
+};
+
+#ifdef CONFIG_BUG
 
 /*
  * Don't use BUG() or BUG_ON() unless there's really no way out; one

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [tip:x86/build] x86/paravirt: Work around GCC inlining bugs when compiling paravirt ops
  2018-10-03 21:30 ` [PATCH v9 07/10] x86: prevent inline distortion by paravirt ops Nadav Amit
@ 2018-10-04 10:04   ` tip-bot for Nadav Amit
  0 siblings, 0 replies; 116+ messages in thread
From: tip-bot for Nadav Amit @ 2018-10-04 10:04 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: hpa, keescook, mingo, bp, jgross, namit, akataria, torvalds,
	linux-kernel, peterz, tglx, dvlasenk, brgerst, luto

Commit-ID:  494b5168f2de009eb80f198f668da374295098dd
Gitweb:     https://git.kernel.org/tip/494b5168f2de009eb80f198f668da374295098dd
Author:     Nadav Amit <namit@vmware.com>
AuthorDate: Wed, 3 Oct 2018 14:30:57 -0700
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 4 Oct 2018 11:25:00 +0200

x86/paravirt: Work around GCC inlining bugs when compiling paravirt ops

As described in:

  77b0bf55bc67: ("kbuild/Makefile: Prepare for using macros in inline assembly code to work around asm() related GCC inlining bugs")

GCC's inlining heuristics are broken with common asm() patterns used in
kernel code, resulting in the effective disabling of inlining.

The workaround is to set an assembly macro and call it from the inline
assembly block. As a result GCC considers the inline assembly block as
a single instruction. (Which it isn't, but that's the best we can get.)

In this patch we wrap the paravirt call section tricks in a macro,
to hide it from GCC.

The effect of the patch is a more aggressive inlining, which also
causes a size increase of kernel.

      text     data     bss      dec     hex  filename
  18147336 10226688 2957312 31331336 1de1408  ./vmlinux before
  18162555 10226288 2957312 31346155 1de4deb  ./vmlinux after (+14819)

The number of static text symbols (non-inlined functions) goes down:

  Before: 40053
  After:  39942 (-111)

[ mingo: Rewrote the changelog. ]

Tested-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Nadav Amit <namit@vmware.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alok Kataria <akataria@vmware.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: virtualization@lists.linux-foundation.org
Link: http://lkml.kernel.org/r/20181003213100.189959-8-namit@vmware.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/paravirt_types.h | 56 +++++++++++++++++------------------
 arch/x86/kernel/macros.S              |  1 +
 2 files changed, 28 insertions(+), 29 deletions(-)

diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index 4b75acc23b30..83ce282eed0a 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -346,23 +346,11 @@ extern struct pv_lock_ops pv_lock_ops;
 #define paravirt_clobber(clobber)		\
 	[paravirt_clobber] "i" (clobber)
 
-/*
- * Generate some code, and mark it as patchable by the
- * apply_paravirt() alternate instruction patcher.
- */
-#define _paravirt_alt(insn_string, type, clobber)	\
-	"771:\n\t" insn_string "\n" "772:\n"		\
-	".pushsection .parainstructions,\"a\"\n"	\
-	_ASM_ALIGN "\n"					\
-	_ASM_PTR " 771b\n"				\
-	"  .byte " type "\n"				\
-	"  .byte 772b-771b\n"				\
-	"  .short " clobber "\n"			\
-	".popsection\n"
-
 /* Generate patchable code, with the default asm parameters. */
-#define paravirt_alt(insn_string)					\
-	_paravirt_alt(insn_string, "%c[paravirt_typenum]", "%c[paravirt_clobber]")
+#define paravirt_call							\
+	"PARAVIRT_CALL type=\"%c[paravirt_typenum]\""			\
+	" clobber=\"%c[paravirt_clobber]\""				\
+	" pv_opptr=\"%c[paravirt_opptr]\";"
 
 /* Simple instruction patching code. */
 #define NATIVE_LABEL(a,x,b) "\n\t.globl " a #x "_" #b "\n" a #x "_" #b ":\n\t"
@@ -390,16 +378,6 @@ unsigned native_patch(u8 type, u16 clobbers, void *ibuf,
 
 int paravirt_disable_iospace(void);
 
-/*
- * This generates an indirect call based on the operation type number.
- * The type number, computed in PARAVIRT_PATCH, is derived from the
- * offset into the paravirt_patch_template structure, and can therefore be
- * freely converted back into a structure offset.
- */
-#define PARAVIRT_CALL					\
-	ANNOTATE_RETPOLINE_SAFE				\
-	"call *%c[paravirt_opptr];"
-
 /*
  * These macros are intended to wrap calls through one of the paravirt
  * ops structs, so that they can be later identified and patched at
@@ -537,7 +515,7 @@ int paravirt_disable_iospace(void);
 		/* since this condition will never hold */		\
 		if (sizeof(rettype) > sizeof(unsigned long)) {		\
 			asm volatile(pre				\
-				     paravirt_alt(PARAVIRT_CALL)	\
+				     paravirt_call			\
 				     post				\
 				     : call_clbr, ASM_CALL_CONSTRAINT	\
 				     : paravirt_type(op),		\
@@ -547,7 +525,7 @@ int paravirt_disable_iospace(void);
 			__ret = (rettype)((((u64)__edx) << 32) | __eax); \
 		} else {						\
 			asm volatile(pre				\
-				     paravirt_alt(PARAVIRT_CALL)	\
+				     paravirt_call			\
 				     post				\
 				     : call_clbr, ASM_CALL_CONSTRAINT	\
 				     : paravirt_type(op),		\
@@ -574,7 +552,7 @@ int paravirt_disable_iospace(void);
 		PVOP_VCALL_ARGS;					\
 		PVOP_TEST_NULL(op);					\
 		asm volatile(pre					\
-			     paravirt_alt(PARAVIRT_CALL)		\
+			     paravirt_call				\
 			     post					\
 			     : call_clbr, ASM_CALL_CONSTRAINT		\
 			     : paravirt_type(op),			\
@@ -694,6 +672,26 @@ struct paravirt_patch_site {
 extern struct paravirt_patch_site __parainstructions[],
 	__parainstructions_end[];
 
+#else	/* __ASSEMBLY__ */
+
+/*
+ * This generates an indirect call based on the operation type number.
+ * The type number, computed in PARAVIRT_PATCH, is derived from the
+ * offset into the paravirt_patch_template structure, and can therefore be
+ * freely converted back into a structure offset.
+ */
+.macro PARAVIRT_CALL type:req clobber:req pv_opptr:req
+771:	ANNOTATE_RETPOLINE_SAFE
+	call *\pv_opptr
+772:	.pushsection .parainstructions,"a"
+	_ASM_ALIGN
+	_ASM_PTR 771b
+	.byte \type
+	.byte 772b-771b
+	.short \clobber
+	.popsection
+.endm
+
 #endif	/* __ASSEMBLY__ */
 
 #endif	/* _ASM_X86_PARAVIRT_TYPES_H */
diff --git a/arch/x86/kernel/macros.S b/arch/x86/kernel/macros.S
index 66ccb8e823b1..71d8b716b111 100644
--- a/arch/x86/kernel/macros.S
+++ b/arch/x86/kernel/macros.S
@@ -10,3 +10,4 @@
 #include <asm/refcount.h>
 #include <asm/alternative-asm.h>
 #include <asm/bug.h>
+#include <asm/paravirt.h>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 04/10] x86: refcount: prevent gcc distortions
  2018-10-04  9:45               ` Ingo Molnar
@ 2018-10-04 10:23                 ` Nadav Amit
  2018-10-05  9:31                   ` Ingo Molnar
  0 siblings, 1 reply; 116+ messages in thread
From: Nadav Amit @ 2018-10-04 10:23 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: hpa, Ingo Molnar, linux-kernel, x86, Thomas Gleixner,
	Jan Beulich, Josh Poimboeuf, Linus Torvalds, Peter Zijlstra,
	Andy Lutomirski

at 2:45 AM, Ingo Molnar <mingo@kernel.org> wrote:

> 
> * Nadav Amit <namit@vmware.com> wrote:
> 
>>> Another, separate question I wanted to ask: how do we ensure that the kernel stays fixed?
>>> I.e. is there some tooling we can use to actually measure whether there's bad inlining decisions 
>>> done, to detect all these bad patterns that cause bad GCC code generation?
>> 
>> Good question. First, I’ll indicate that this patch-set does not handle all
>> the issues. There is still the issue of conditional use of
>> __builtin_constant_p().
>> 
>> One indication for bad inlining decisions is the inlined functions have
>> multiple (non-inlined) instances in the binary and are short. I don’t
>> have an automatic solution, but you can try, for example to run:
>> 
>> nm --print-size ./vmlinux | grep ' t ' | cut -d' ' -f2- | sort | uniq -c | \
>> 	grep -v '^      1' | sort -n -r | head -n 5
>> 
>> There are however many false positives. After these patches, for example, I
>> get:
>> 
>>     11 000000000000012f t jhash
>>      7 0000000000000017 t dst_output
>>      6 0000000000000011 t kzalloc
>>      5 000000000000002f t acpi_os_allocate_zeroed
>>      5 0000000000000029 t acpi_os_allocate
>> 
>> 
>> jhash() should not have been inlined in my mind, and should have a
>> non-inlined implementation. dst_output() is used as a function pointer.
>> kzalloc() and the next two suffer from the __builtin_constant_p() problem I
>> described in the past.
> 
> Ok, that's useful info.
> 
> The histogram suggests that with all your patches applied the kernel is now in a pretty good 
> state in terms of inlining decisions, right?

It was just an example that I ran on the kernel I built right now (with a
custom config). Please don’t regard these results as anything indicative.

> Are you using defconfig or a reasonable distro-config for your tests?

I think it is best to take the kernel and run localyesconfig for testing.

Taking Ubuntu 18.04 and doing the same gives the following results:

     12 000000000000012f t jhash
      7 0000000000000017 t dst_output
      7 0000000000000014 t init_once
      5 00000000000000d8 t jhash2
      5 000000000000004e t put_page
      5 000000000000002f t acpi_os_allocate_zeroed
      5 0000000000000029 t acpi_os_allocate
      5 0000000000000028 t umask_show
      5 0000000000000011 t kzalloc
      4 0000000000000053 t trace_xhci_dbg_quirks

Not awful, but not great.

It is a bit hard to fix the __builtin_constant_p() problem without having
some negative side-effects.

Reminder: __builtin_constant_p() is evaluated after the inlining decision
are done. You can use __builtin_choose_expr() instead of an “if”s and
instead of ternary operators when evaluating __builtin_constant_p() to solve
the problem. However, this causes the compiler not to know sometimes that a
value is constant because __builtin_choose_expr () evaluation happens too
early. This __builtin_choose_expr() problem is the reason for put_page() and
kzalloc() are not being inlined. Clang, again, does not suffer from this
problem.

Anyhow, it may be a good practice to try to get rid of the rest. For
example, dst_discard() has four instances because it is always given as a
function pointer. So it should not have been inlined.

Regards,
Nadav

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 04/10] x86: refcount: prevent gcc distortions
  2018-10-04  9:16             ` Ingo Molnar
@ 2018-10-04 19:33               ` H. Peter Anvin
  2018-10-04 20:05                 ` Nadav Amit
                                   ` (2 more replies)
  0 siblings, 3 replies; 116+ messages in thread
From: H. Peter Anvin @ 2018-10-04 19:33 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Nadav Amit, Ingo Molnar, linux-kernel, x86, Thomas Gleixner,
	Jan Beulich, Josh Poimboeuf, Linus Torvalds, Peter Zijlstra,
	Andy Lutomirski

On 10/04/18 02:16, Ingo Molnar wrote:
> 
> * hpa@zytor.com <hpa@zytor.com> wrote:
> 
>> Ingo: I wasn't talking necessarily about the specifics of each bit, but rather the general 
>> concept about being able to use macros in inlines...
> 
> Ok, agreed about that part - and some of the patches did improve readability.
> 
> Also, the 275 lines macros.s is a lot nicer than the 4,200 lines macros.S.
> 
> Also, I'm not against using workarounds when the benefits are larger than the costs, but I am 
> against *hiding* the fact that these are workarounds and that for some of them there are costs.
> 

Agreed, of course.

>> I can send you something I have been working on in the background, but have been holding off 
>> on because of this, in the morning my time.
> 
> BTW., I have applied most of the series to tip:x86/kbuild already, and will push them out later 
> today after some testing. I didn't apply the final 3 patches as they have dependencies, but 
> applied the basics and fixed up the changelogs.
> 
> So you can rely on this.
> 

Wonderful.

Here is the horrible code I mentioned yesterday.  This is about
implementing the immediate-patching framework that Linus and others have
discussed (it helps both performance and kernel hardening):

Warning: this stuff can cause serious damage to your eyes, and this is a
just a small chunk of the whole mess; and relying on gas macros, as
brain damaged as they are, really is much, much cleaner than not:

	http://www.zytor.com/~hpa/foo.S

	-hpa




^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 04/10] x86: refcount: prevent gcc distortions
  2018-10-04 19:33               ` H. Peter Anvin
@ 2018-10-04 20:05                 ` Nadav Amit
  2018-10-04 20:08                   ` H. Peter Anvin
  2018-10-04 20:29                 ` Andy Lutomirski
  2018-10-06  1:40                 ` Rasmus Villemoes
  2 siblings, 1 reply; 116+ messages in thread
From: Nadav Amit @ 2018-10-04 20:05 UTC (permalink / raw)
  To: H. Peter Anvin, Ingo Molnar
  Cc: Ingo Molnar, linux-kernel, x86, Thomas Gleixner, Jan Beulich,
	Josh Poimboeuf, Linus Torvalds, Peter Zijlstra, Andy Lutomirski

at 12:33 PM, H. Peter Anvin <hpa@zytor.com> wrote:

> On 10/04/18 02:16, Ingo Molnar wrote:
>> * hpa@zytor.com <hpa@zytor.com> wrote:
>> 
>>> Ingo: I wasn't talking necessarily about the specifics of each bit, but rather the general 
>>> concept about being able to use macros in inlines...
>> 
>> Ok, agreed about that part - and some of the patches did improve readability.
>> 
>> Also, the 275 lines macros.s is a lot nicer than the 4,200 lines macros.S.
>> 
>> Also, I'm not against using workarounds when the benefits are larger than the costs, but I am 
>> against *hiding* the fact that these are workarounds and that for some of them there are costs.
> 
> Agreed, of course.
> 
>>> I can send you something I have been working on in the background, but have been holding off 
>>> on because of this, in the morning my time.
>> 
>> BTW., I have applied most of the series to tip:x86/kbuild already, and will push them out later 
>> today after some testing. I didn't apply the final 3 patches as they have dependencies, but 
>> applied the basics and fixed up the changelogs.
>> 
>> So you can rely on this.
> 
> Wonderful.
> 
> Here is the horrible code I mentioned yesterday.  This is about
> implementing the immediate-patching framework that Linus and others have
> discussed (it helps both performance and kernel hardening):
> 
> Warning: this stuff can cause serious damage to your eyes, and this is a
> just a small chunk of the whole mess; and relying on gas macros, as
> brain damaged as they are, really is much, much cleaner than not:
> 
> 	https://na01.safelinks.protection.outlook.com/?url=http:%2F%2Fwww.zytor.com%2F~hpa%2Ffoo.S&amp;data=02%7C01%7Cnamit%40vmware.com%7C326f1a3beb4649df319508d62a3042fa%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C1%7C0%7C636742784111671122&amp;sdata=anYIOXzlSTXPQKttTBHjSQgapBmaO9gfibBF34ZlHeQ%3D&amp;reserved=0

Funny. Immediate-patching is what I was playing with when I encountered the
gcc issue. Performance got worse instead of improving (or at least staying
the same), because inlining got crazy.

Anyhow, wait for my soon-to-be-sent RFC in which I define a macro called
“call” (to reduce the retpoline overhead) before you talk about damage to
the eyes.


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 04/10] x86: refcount: prevent gcc distortions
  2018-10-04 20:05                 ` Nadav Amit
@ 2018-10-04 20:08                   ` H. Peter Anvin
  0 siblings, 0 replies; 116+ messages in thread
From: H. Peter Anvin @ 2018-10-04 20:08 UTC (permalink / raw)
  To: Nadav Amit, Ingo Molnar
  Cc: Ingo Molnar, linux-kernel, x86, Thomas Gleixner, Jan Beulich,
	Josh Poimboeuf, Linus Torvalds, Peter Zijlstra, Andy Lutomirski

On 10/04/18 13:05, Nadav Amit wrote:
> 
> Funny. Immediate-patching is what I was playing with when I encountered the
> gcc issue. Performance got worse instead of improving (or at least staying
> the same), because inlining got crazy.
> 
> Anyhow, wait for my soon-to-be-sent RFC in which I define a macro called
> “call” (to reduce the retpoline overhead) before you talk about damage to
> the eyes.
> 

Doing it well is *hard*, as I discovered.  The actual patch is much,
much, larger.

	-hpa



^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 04/10] x86: refcount: prevent gcc distortions
  2018-10-04 19:33               ` H. Peter Anvin
  2018-10-04 20:05                 ` Nadav Amit
@ 2018-10-04 20:29                 ` Andy Lutomirski
  2018-10-04 23:11                   ` H. Peter Anvin
  2018-10-06  1:40                 ` Rasmus Villemoes
  2 siblings, 1 reply; 116+ messages in thread
From: Andy Lutomirski @ 2018-10-04 20:29 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Ingo Molnar, Nadav Amit, Ingo Molnar, LKML, X86 ML,
	Thomas Gleixner, Jan Beulich, Josh Poimboeuf, Linus Torvalds,
	Peter Zijlstra, Andrew Lutomirski

On Thu, Oct 4, 2018 at 12:33 PM H. Peter Anvin <hpa@zytor.com> wrote:
>
> On 10/04/18 02:16, Ingo Molnar wrote:
> >
> > * hpa@zytor.com <hpa@zytor.com> wrote:
> >
> >> Ingo: I wasn't talking necessarily about the specifics of each bit, but rather the general
> >> concept about being able to use macros in inlines...
> >
> > Ok, agreed about that part - and some of the patches did improve readability.
> >
> > Also, the 275 lines macros.s is a lot nicer than the 4,200 lines macros.S.
> >
> > Also, I'm not against using workarounds when the benefits are larger than the costs, but I am
> > against *hiding* the fact that these are workarounds and that for some of them there are costs.
> >
>
> Agreed, of course.
>
> >> I can send you something I have been working on in the background, but have been holding off
> >> on because of this, in the morning my time.
> >
> > BTW., I have applied most of the series to tip:x86/kbuild already, and will push them out later
> > today after some testing. I didn't apply the final 3 patches as they have dependencies, but
> > applied the basics and fixed up the changelogs.
> >
> > So you can rely on this.
> >
>
> Wonderful.
>
> Here is the horrible code I mentioned yesterday.  This is about
> implementing the immediate-patching framework that Linus and others have
> discussed (it helps both performance and kernel hardening):
>

I'm wondering if a production version should look more like:

patch_point:
mov $whatever, [target]
.pushsection ".immediate_patches"
.quad .Lpatch_point
.popsection

and let objtool parse the resulting assembled output and emit an entry
in some table mapping patch_point to the actual address and size of
the immediate to be patched.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 04/10] x86: refcount: prevent gcc distortions
  2018-10-04 20:29                 ` Andy Lutomirski
@ 2018-10-04 23:11                   ` H. Peter Anvin
  0 siblings, 0 replies; 116+ messages in thread
From: H. Peter Anvin @ 2018-10-04 23:11 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Ingo Molnar, Nadav Amit, Ingo Molnar, LKML, X86 ML,
	Thomas Gleixner, Jan Beulich, Josh Poimboeuf, Linus Torvalds,
	Peter Zijlstra

On 10/04/18 13:29, Andy Lutomirski wrote:
>>
>> Wonderful.
>>
>> Here is the horrible code I mentioned yesterday.  This is about
>> implementing the immediate-patching framework that Linus and others have
>> discussed (it helps both performance and kernel hardening):
>>
> 
> I'm wondering if a production version should look more like:
> 
> patch_point:
> mov $whatever, [target]
> .pushsection ".immediate_patches"
> .quad .Lpatch_point
> .popsection
> 
> and let objtool parse the resulting assembled output and emit an entry
> in some table mapping patch_point to the actual address and size of
> the immediate to be patched.
> 

Putting the generation of the ersatz code in objtool is an interesting
idea, although I'm wondering if these macros, as awful as they are,
wouldn't have generic applicability (what they do is they allow the cold
branch of an asm to push temporaries to the stack rather than having to
have gcc clobber them.)

I'll think about it.

	-hpa

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 04/10] x86: refcount: prevent gcc distortions
  2018-10-04 10:23                 ` Nadav Amit
@ 2018-10-05  9:31                   ` Ingo Molnar
  2018-10-05 11:20                     ` Borislav Petkov
                                       ` (2 more replies)
  0 siblings, 3 replies; 116+ messages in thread
From: Ingo Molnar @ 2018-10-05  9:31 UTC (permalink / raw)
  To: Nadav Amit
  Cc: hpa, Ingo Molnar, linux-kernel, x86, Thomas Gleixner,
	Jan Beulich, Josh Poimboeuf, Linus Torvalds, Peter Zijlstra,
	Andy Lutomirski


* Nadav Amit <namit@vmware.com> wrote:

> > Are you using defconfig or a reasonable distro-config for your tests?
> 
> I think it is best to take the kernel and run localyesconfig for testing.

Ok, agreed - and this makes the numbers you provided pretty representative.

Good - now that all of my concerns were addressed I'd like to merge the remaining 3 patches as 
well - but they are conflicting with ongoing x86 work in tip:x86/core. The extable conflict is 
trivial, the jump-label conflict a bit more involved.

Could you please pick up the updated changelogs below and resolve the conflicts against 
tip:master or tip:x86/build and submit the remaining patches as well?

Thanks,

	Ingo



=============>
commit b82b0b611740c7c88050ba743c398af7eb920029
Author: Nadav Amit <namit@vmware.com>
Date:   Wed Oct 3 14:31:00 2018 -0700

    x86/jump-labels: Macrofy inline assembly code to work around GCC inlining bugs
    
    As described in:
    
      77b0bf55bc67: ("kbuild/Makefile: Prepare for using macros in inline assembly code to work around asm() related GCC inlining bugs")
    
    GCC's inlining heuristics are broken with common asm() patterns used in
    kernel code, resulting in the effective disabling of inlining.
    
    The workaround is to set an assembly macro and call it from the inline
    assembly block - which is also a minor cleanup for the jump-label code.
    
    As a result the code size is slightly increased, but inlining decisions
    are better:
    
          text     data     bss      dec     hex  filename
      18163528 10226300 2957312 31347140 1de51c4  ./vmlinux before
      18163608 10227348 2957312 31348268 1de562c  ./vmlinux after (+1128)
    
    And functions such as intel_pstate_adjust_policy_max(),
    kvm_cpu_accept_dm_intr(), kvm_register_readl() are inlined.
    
    Tested-by: Kees Cook <keescook@chromium.org>
    Signed-off-by: Nadav Amit <namit@vmware.com>
    Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Cc: Andy Lutomirski <luto@amacapital.net>
    Cc: Borislav Petkov <bp@alien8.de>
    Cc: Brian Gerst <brgerst@gmail.com>
    Cc: Denys Vlasenko <dvlasenk@redhat.com>
    Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Cc: H. Peter Anvin <hpa@zytor.com>
    Cc: Kate Stewart <kstewart@linuxfoundation.org>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Philippe Ombredanne <pombredanne@nexb.com>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Link: http://lkml.kernel.org/r/20181003213100.189959-11-namit@vmware.com
    Signed-off-by: Ingo Molnar <mingo@kernel.org>

commit dfc243615d43bb477d1d16a0064fc3d69ade5b3a
Author: Nadav Amit <namit@vmware.com>
Date:   Wed Oct 3 14:30:59 2018 -0700

    x86/cpufeature: Macrofy inline assembly code to work around GCC inlining bugs
    
    As described in:
    
      77b0bf55bc67: ("kbuild/Makefile: Prepare for using macros in inline assembly code to work around asm() related GCC inlining bugs")
    
    GCC's inlining heuristics are broken with common asm() patterns used in
    kernel code, resulting in the effective disabling of inlining.
    
    The workaround is to set an assembly macro and call it from the inline
    assembly block - which is pretty pointless indirection in the static_cpu_has()
    case, but is worth it to improve overall inlining quality.
    
    The patch slightly increases the kernel size:
    
          text     data     bss      dec     hex  filename
      18162879 10226256 2957312 31346447 1de4f0f  ./vmlinux before
      18163528 10226300 2957312 31347140 1de51c4  ./vmlinux after (+693)
    
    And enables the inlining of function such as free_ldt_pgtables().
    
    Tested-by: Kees Cook <keescook@chromium.org>
    Signed-off-by: Nadav Amit <namit@vmware.com>
    Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Cc: Andy Lutomirski <luto@amacapital.net>
    Cc: Borislav Petkov <bp@alien8.de>
    Cc: Brian Gerst <brgerst@gmail.com>
    Cc: Denys Vlasenko <dvlasenk@redhat.com>
    Cc: H. Peter Anvin <hpa@zytor.com>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Link: http://lkml.kernel.org/r/20181003213100.189959-10-namit@vmware.com
    Signed-off-by: Ingo Molnar <mingo@kernel.org>

commit 4021bdcd351fd63d8d5e74264ee18d09388f0221
Author: Nadav Amit <namit@vmware.com>
Date:   Wed Oct 3 14:30:58 2018 -0700

    x86/extable: Macrofy inline assembly code to work around GCC inlining bugs
    
    As described in:
    
      77b0bf55bc67: ("kbuild/Makefile: Prepare for using macros in inline assembly code to work around asm() related GCC inlining bugs")
    
    GCC's inlining heuristics are broken with common asm() patterns used in
    kernel code, resulting in the effective disabling of inlining.
    
    The workaround is to set an assembly macro and call it from the inline
    assembly block - which is also a minor cleanup for the exception table
    code.
    
    Text size goes up a bit:
    
          text     data     bss      dec     hex  filename
      18162555 10226288 2957312 31346155 1de4deb  ./vmlinux before
      18162879 10226256 2957312 31346447 1de4f0f  ./vmlinux after (+292)
    
    But this allows the inlining of functions such as nested_vmx_exit_reflected(),
    set_segment_reg(), __copy_xstate_to_user() which is a net benefit.
    
    Tested-by: Kees Cook <keescook@chromium.org>
    Signed-off-by: Nadav Amit <namit@vmware.com>
    Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Cc: Andy Lutomirski <luto@amacapital.net>
    Cc: Borislav Petkov <bp@alien8.de>
    Cc: Brian Gerst <brgerst@gmail.com>
    Cc: Denys Vlasenko <dvlasenk@redhat.com>
    Cc: H. Peter Anvin <hpa@zytor.com>
    Cc: Josh Poimboeuf <jpoimboe@redhat.com>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Link: http://lkml.kernel.org/r/20181003213100.189959-9-namit@vmware.com
    Signed-off-by: Ingo Molnar <mingo@kernel.org>


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 04/10] x86: refcount: prevent gcc distortions
  2018-10-05  9:31                   ` Ingo Molnar
@ 2018-10-05 11:20                     ` Borislav Petkov
  2018-10-05 12:52                       ` Ingo Molnar
  2018-10-05 20:27                     ` [PATCH 0/3] Macrofying inline asm rebased Nadav Amit
  2018-10-08  2:17                     ` [PATCH v9 04/10] x86: refcount: prevent gcc distortions Nadav Amit
  2 siblings, 1 reply; 116+ messages in thread
From: Borislav Petkov @ 2018-10-05 11:20 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Nadav Amit, hpa, Ingo Molnar, linux-kernel, x86, Thomas Gleixner,
	Jan Beulich, Josh Poimboeuf, Linus Torvalds, Peter Zijlstra,
	Andy Lutomirski

On Fri, Oct 05, 2018 at 11:31:08AM +0200, Ingo Molnar wrote:
> 
> * Nadav Amit <namit@vmware.com> wrote:
> 
> > > Are you using defconfig or a reasonable distro-config for your tests?
> > 
> > I think it is best to take the kernel and run localyesconfig for testing.
> 
> Ok, agreed - and this makes the numbers you provided pretty representative.
> 
> Good - now that all of my concerns were addressed I'd like to merge the remaining 3 patches as 
> well - but they are conflicting with ongoing x86 work in tip:x86/core. The extable conflict is 
> trivial, the jump-label conflict a bit more involved.

FWIW, gcc guys are not too averse to the idea of enhancing gcc inline
asm syntax with a statement which specifies its size so that gcc can use
it to do better inlining cost estimation than counting lines.

Lemme work the process ...

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 04/10] x86: refcount: prevent gcc distortions
  2018-10-05 11:20                     ` Borislav Petkov
@ 2018-10-05 12:52                       ` Ingo Molnar
  0 siblings, 0 replies; 116+ messages in thread
From: Ingo Molnar @ 2018-10-05 12:52 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Nadav Amit, hpa, Ingo Molnar, linux-kernel, x86, Thomas Gleixner,
	Jan Beulich, Josh Poimboeuf, Linus Torvalds, Peter Zijlstra,
	Andy Lutomirski


* Borislav Petkov <bp@alien8.de> wrote:

> On Fri, Oct 05, 2018 at 11:31:08AM +0200, Ingo Molnar wrote:
> > 
> > * Nadav Amit <namit@vmware.com> wrote:
> > 
> > > > Are you using defconfig or a reasonable distro-config for your tests?
> > > 
> > > I think it is best to take the kernel and run localyesconfig for testing.
> > 
> > Ok, agreed - and this makes the numbers you provided pretty representative.
> > 
> > Good - now that all of my concerns were addressed I'd like to merge the remaining 3 patches as 
> > well - but they are conflicting with ongoing x86 work in tip:x86/core. The extable conflict is 
> > trivial, the jump-label conflict a bit more involved.
> 
> FWIW, gcc guys are not too averse to the idea of enhancing gcc inline
> asm syntax with a statement which specifies its size so that gcc can use
> it to do better inlining cost estimation than counting lines.
> 
> Lemme work the process ...

Thanks!

	Ingo

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 0/3] Macrofying inline asm rebased
  2018-10-05  9:31                   ` Ingo Molnar
  2018-10-05 11:20                     ` Borislav Petkov
@ 2018-10-05 20:27                     ` Nadav Amit
  2018-10-05 20:27                       ` [PATCH 1/3] x86/extable: Macrofy inline assembly code to work around GCC inlining bugs Nadav Amit
                                         ` (2 more replies)
  2018-10-08  2:17                     ` [PATCH v9 04/10] x86: refcount: prevent gcc distortions Nadav Amit
  2 siblings, 3 replies; 116+ messages in thread
From: Nadav Amit @ 2018-10-05 20:27 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel, x86, Nadav Amit

These are the last 3 patches from the "macrofying inline asm" series
after rebasing on TIP, per Ingo's request.

For the record: the new commit log was written by Ingo.

Link: https://lore.kernel.org/lkml/20181003213100.189959-11-namit@vmware.com/T/#m28ed17da046354dd8d897ff9703561ac3fd71410

Nadav Amit (3):
  x86/extable: Macrofy inline assembly code to work around GCC inlining
    bugs
  x86/cpufeature: Macrofy inline assembly code to work around GCC
    inlining bugs
  x86/jump-labels: Macrofy inline assembly code to work around GCC
    inlining bugs

 arch/x86/entry/calling.h          |  2 +-
 arch/x86/include/asm/asm.h        | 53 ++++++++------------
 arch/x86/include/asm/cpufeature.h | 82 ++++++++++++++++++-------------
 arch/x86/include/asm/jump_label.h | 72 +++++++--------------------
 arch/x86/kernel/macros.S          |  3 ++
 5 files changed, 89 insertions(+), 123 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 1/3] x86/extable: Macrofy inline assembly code to work around GCC inlining bugs
  2018-10-05 20:27                     ` [PATCH 0/3] Macrofying inline asm rebased Nadav Amit
@ 2018-10-05 20:27                       ` Nadav Amit
  2018-10-06 14:42                         ` [tip:x86/build] " tip-bot for Nadav Amit
  2018-10-05 20:27                       ` [PATCH 2/3] x86/cpufeature: " Nadav Amit
  2018-10-05 20:27                       ` [PATCH 3/3] x86/jump-labels: " Nadav Amit
  2 siblings, 1 reply; 116+ messages in thread
From: Nadav Amit @ 2018-10-05 20:27 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, x86, Nadav Amit, Andy Lutomirski, Borislav Petkov,
	Brian Gerst, Denys Vlasenko, H . Peter Anvin, Josh Poimboeuf,
	Linus Torvalds, Peter Zijlstra, Thomas Gleixner, Ingo Molnar

As described in:

77b0bf55bc67: ("kbuild/Makefile: Prepare for using macros in inline assembly code to work around asm() related GCC inlining bugs")

GCC's inlining heuristics are broken with common asm() patterns used in
kernel code, resulting in the effective disabling of inlining.

The workaround is to set an assembly macro and call it from the inline
assembly block - which is also a minor cleanup for the exception table
code.

Text size goes up a bit:

 text     data     bss      dec     hex  filename
18162555 10226288 2957312 31346155 1de4deb  ./vmlinux before
18162879 10226256 2957312 31346447 1de4f0f  ./vmlinux after (+292)

But this allows the inlining of functions such as nested_vmx_exit_reflected(),
set_segment_reg(), __copy_xstate_to_user() which is a net benefit.

Tested-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Nadav Amit <namit@vmware.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/lkml/20181003213100.189959-9-namit@vmware.com/T/#u
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/asm.h | 53 ++++++++++++++------------------------
 arch/x86/kernel/macros.S   |  1 +
 2 files changed, 21 insertions(+), 33 deletions(-)

diff --git a/arch/x86/include/asm/asm.h b/arch/x86/include/asm/asm.h
index 6467757bb39f..21b086786404 100644
--- a/arch/x86/include/asm/asm.h
+++ b/arch/x86/include/asm/asm.h
@@ -120,12 +120,25 @@
 /* Exception table entry */
 #ifdef __ASSEMBLY__
 # define _ASM_EXTABLE_HANDLE(from, to, handler)			\
-	.pushsection "__ex_table","a" ;				\
-	.balign 4 ;						\
-	.long (from) - . ;					\
-	.long (to) - . ;					\
-	.long (handler) - . ;					\
+	ASM_EXTABLE_HANDLE from to handler
+
+.macro ASM_EXTABLE_HANDLE from:req to:req handler:req
+	.pushsection "__ex_table","a"
+	.balign 4
+	.long (\from) - .
+	.long (\to) - .
+	.long (\handler) - .
 	.popsection
+.endm
+#else /* __ASSEMBLY__ */
+
+# define _ASM_EXTABLE_HANDLE(from, to, handler)			\
+	"ASM_EXTABLE_HANDLE from=" #from " to=" #to		\
+	" handler=\"" #handler "\"\n\t"
+
+/* For C file, we already have NOKPROBE_SYMBOL macro */
+
+#endif /* __ASSEMBLY__ */
 
 # define _ASM_EXTABLE(from, to)					\
 	_ASM_EXTABLE_HANDLE(from, to, ex_handler_default)
@@ -148,6 +161,7 @@
 	_ASM_PTR (entry);					\
 	.popsection
 
+#ifdef __ASSEMBLY__
 .macro ALIGN_DESTINATION
 	/* check for bad alignment of destination */
 	movl %edi,%ecx
@@ -171,34 +185,7 @@
 	_ASM_EXTABLE_UA(100b, 103b)
 	_ASM_EXTABLE_UA(101b, 103b)
 	.endm
-
-#else
-# define _EXPAND_EXTABLE_HANDLE(x) #x
-# define _ASM_EXTABLE_HANDLE(from, to, handler)			\
-	" .pushsection \"__ex_table\",\"a\"\n"			\
-	" .balign 4\n"						\
-	" .long (" #from ") - .\n"				\
-	" .long (" #to ") - .\n"				\
-	" .long (" _EXPAND_EXTABLE_HANDLE(handler) ") - .\n"	\
-	" .popsection\n"
-
-# define _ASM_EXTABLE(from, to)					\
-	_ASM_EXTABLE_HANDLE(from, to, ex_handler_default)
-
-# define _ASM_EXTABLE_UA(from, to)				\
-	_ASM_EXTABLE_HANDLE(from, to, ex_handler_uaccess)
-
-# define _ASM_EXTABLE_FAULT(from, to)				\
-	_ASM_EXTABLE_HANDLE(from, to, ex_handler_fault)
-
-# define _ASM_EXTABLE_EX(from, to)				\
-	_ASM_EXTABLE_HANDLE(from, to, ex_handler_ext)
-
-# define _ASM_EXTABLE_REFCOUNT(from, to)			\
-	_ASM_EXTABLE_HANDLE(from, to, ex_handler_refcount)
-
-/* For C file, we already have NOKPROBE_SYMBOL macro */
-#endif
+#endif /* __ASSEMBLY__ */
 
 #ifndef __ASSEMBLY__
 /*
diff --git a/arch/x86/kernel/macros.S b/arch/x86/kernel/macros.S
index 71d8b716b111..7baa40d5bf16 100644
--- a/arch/x86/kernel/macros.S
+++ b/arch/x86/kernel/macros.S
@@ -11,3 +11,4 @@
 #include <asm/alternative-asm.h>
 #include <asm/bug.h>
 #include <asm/paravirt.h>
+#include <asm/asm.h>
-- 
2.17.1


^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 2/3] x86/cpufeature: Macrofy inline assembly code to work around GCC inlining bugs
  2018-10-05 20:27                     ` [PATCH 0/3] Macrofying inline asm rebased Nadav Amit
  2018-10-05 20:27                       ` [PATCH 1/3] x86/extable: Macrofy inline assembly code to work around GCC inlining bugs Nadav Amit
@ 2018-10-05 20:27                       ` Nadav Amit
  2018-10-06 14:43                         ` [tip:x86/build] " tip-bot for Nadav Amit
  2018-10-05 20:27                       ` [PATCH 3/3] x86/jump-labels: " Nadav Amit
  2 siblings, 1 reply; 116+ messages in thread
From: Nadav Amit @ 2018-10-05 20:27 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, x86, Nadav Amit, Andy Lutomirski, Borislav Petkov,
	Brian Gerst, Denys Vlasenko, H . Peter Anvin, Linus Torvalds,
	Peter Zijlstra, Thomas Gleixner, Ingo Molnar

As described in:

77b0bf55bc67: ("kbuild/Makefile: Prepare for using macros in inline assembly code to work around asm() related GCC inlining bugs")

GCC's inlining heuristics are broken with common asm() patterns used in
kernel code, resulting in the effective disabling of inlining.

The workaround is to set an assembly macro and call it from the inline
assembly block - which is pretty pointless indirection in the static_cpu_has()
case, but is worth it to improve overall inlining quality.

The patch slightly increases the kernel size:

 text     data     bss      dec     hex  filename
18162879 10226256 2957312 31346447 1de4f0f  ./vmlinux before
18163528 10226300 2957312 31347140 1de51c4  ./vmlinux after (+693)

And enables the inlining of function such as free_ldt_pgtables().

Tested-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Nadav Amit <namit@vmware.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/lkml/20181003213100.189959-10-namit@vmware.com/T/#u
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/cpufeature.h | 82 ++++++++++++++++++-------------
 arch/x86/kernel/macros.S          |  1 +
 2 files changed, 48 insertions(+), 35 deletions(-)

diff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm/cpufeature.h
index aced6c9290d6..7d442722ef24 100644
--- a/arch/x86/include/asm/cpufeature.h
+++ b/arch/x86/include/asm/cpufeature.h
@@ -2,10 +2,10 @@
 #ifndef _ASM_X86_CPUFEATURE_H
 #define _ASM_X86_CPUFEATURE_H
 
-#include <asm/processor.h>
-
-#if defined(__KERNEL__) && !defined(__ASSEMBLY__)
+#ifdef __KERNEL__
+#ifndef __ASSEMBLY__
 
+#include <asm/processor.h>
 #include <asm/asm.h>
 #include <linux/bitops.h>
 
@@ -161,37 +161,10 @@ extern void clear_cpu_cap(struct cpuinfo_x86 *c, unsigned int bit);
  */
 static __always_inline __pure bool _static_cpu_has(u16 bit)
 {
-	asm_volatile_goto("1: jmp 6f\n"
-		 "2:\n"
-		 ".skip -(((5f-4f) - (2b-1b)) > 0) * "
-			 "((5f-4f) - (2b-1b)),0x90\n"
-		 "3:\n"
-		 ".section .altinstructions,\"a\"\n"
-		 " .long 1b - .\n"		/* src offset */
-		 " .long 4f - .\n"		/* repl offset */
-		 " .word %P[always]\n"		/* always replace */
-		 " .byte 3b - 1b\n"		/* src len */
-		 " .byte 5f - 4f\n"		/* repl len */
-		 " .byte 3b - 2b\n"		/* pad len */
-		 ".previous\n"
-		 ".section .altinstr_replacement,\"ax\"\n"
-		 "4: jmp %l[t_no]\n"
-		 "5:\n"
-		 ".previous\n"
-		 ".section .altinstructions,\"a\"\n"
-		 " .long 1b - .\n"		/* src offset */
-		 " .long 0\n"			/* no replacement */
-		 " .word %P[feature]\n"		/* feature bit */
-		 " .byte 3b - 1b\n"		/* src len */
-		 " .byte 0\n"			/* repl len */
-		 " .byte 0\n"			/* pad len */
-		 ".previous\n"
-		 ".section .altinstr_aux,\"ax\"\n"
-		 "6:\n"
-		 " testb %[bitnum],%[cap_byte]\n"
-		 " jnz %l[t_yes]\n"
-		 " jmp %l[t_no]\n"
-		 ".previous\n"
+	asm_volatile_goto("STATIC_CPU_HAS bitnum=%[bitnum] "
+			  "cap_byte=\"%[cap_byte]\" "
+			  "feature=%P[feature] t_yes=%l[t_yes] "
+			  "t_no=%l[t_no] always=%P[always]"
 		 : : [feature]  "i" (bit),
 		     [always]   "i" (X86_FEATURE_ALWAYS),
 		     [bitnum]   "i" (1 << (bit & 7)),
@@ -226,5 +199,44 @@ static __always_inline __pure bool _static_cpu_has(u16 bit)
 #define CPU_FEATURE_TYPEVAL		boot_cpu_data.x86_vendor, boot_cpu_data.x86, \
 					boot_cpu_data.x86_model
 
-#endif /* defined(__KERNEL__) && !defined(__ASSEMBLY__) */
+#else /* __ASSEMBLY__ */
+
+.macro STATIC_CPU_HAS bitnum:req cap_byte:req feature:req t_yes:req t_no:req always:req
+1:
+	jmp 6f
+2:
+	.skip -(((5f-4f) - (2b-1b)) > 0) * ((5f-4f) - (2b-1b)),0x90
+3:
+	.section .altinstructions,"a"
+	.long 1b - .		/* src offset */
+	.long 4f - .		/* repl offset */
+	.word \always		/* always replace */
+	.byte 3b - 1b		/* src len */
+	.byte 5f - 4f		/* repl len */
+	.byte 3b - 2b		/* pad len */
+	.previous
+	.section .altinstr_replacement,"ax"
+4:
+	jmp \t_no
+5:
+	.previous
+	.section .altinstructions,"a"
+	.long 1b - .		/* src offset */
+	.long 0			/* no replacement */
+	.word \feature		/* feature bit */
+	.byte 3b - 1b		/* src len */
+	.byte 0			/* repl len */
+	.byte 0			/* pad len */
+	.previous
+	.section .altinstr_aux,"ax"
+6:
+	testb \bitnum,\cap_byte
+	jnz \t_yes
+	jmp \t_no
+	.previous
+.endm
+
+#endif /* __ASSEMBLY__ */
+
+#endif /* __KERNEL__ */
 #endif /* _ASM_X86_CPUFEATURE_H */
diff --git a/arch/x86/kernel/macros.S b/arch/x86/kernel/macros.S
index 7baa40d5bf16..bf8b9c93e255 100644
--- a/arch/x86/kernel/macros.S
+++ b/arch/x86/kernel/macros.S
@@ -12,3 +12,4 @@
 #include <asm/bug.h>
 #include <asm/paravirt.h>
 #include <asm/asm.h>
+#include <asm/cpufeature.h>
-- 
2.17.1


^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 3/3] x86/jump-labels: Macrofy inline assembly code to work around GCC inlining bugs
  2018-10-05 20:27                     ` [PATCH 0/3] Macrofying inline asm rebased Nadav Amit
  2018-10-05 20:27                       ` [PATCH 1/3] x86/extable: Macrofy inline assembly code to work around GCC inlining bugs Nadav Amit
  2018-10-05 20:27                       ` [PATCH 2/3] x86/cpufeature: " Nadav Amit
@ 2018-10-05 20:27                       ` Nadav Amit
  2018-10-06 14:44                         ` [tip:x86/build] " tip-bot for Nadav Amit
  2 siblings, 1 reply; 116+ messages in thread
From: Nadav Amit @ 2018-10-05 20:27 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, x86, Nadav Amit, Andy Lutomirski, Borislav Petkov,
	Brian Gerst, Denys Vlasenko, Greg Kroah-Hartman, H . Peter Anvin,
	Kate Stewart, Linus Torvalds, Peter Zijlstra,
	Philippe Ombredanne, Thomas Gleixner, Ingo Molnar

As described in:

77b0bf55bc67: ("kbuild/Makefile: Prepare for using macros in inline assembly code to work around asm() related GCC inlining bugs")

GCC's inlining heuristics are broken with common asm() patterns used in
kernel code, resulting in the effective disabling of inlining.

The workaround is to set an assembly macro and call it from the inline
assembly block - which is also a minor cleanup for the jump-label code.

As a result the code size is slightly increased, but inlining decisions
are better:

 text     data     bss      dec     hex  filename
18163528 10226300 2957312 31347140 1de51c4  ./vmlinux before
18163608 10227348 2957312 31348268 1de562c  ./vmlinux after (+1128)

And functions such as intel_pstate_adjust_policy_max(),
kvm_cpu_accept_dm_intr(), kvm_register_readl() are inlined.

Tested-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Nadav Amit <namit@vmware.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/lkml/20181003213100.189959-11-namit@vmware.com/T/#u
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/entry/calling.h          |  2 +-
 arch/x86/include/asm/jump_label.h | 72 ++++++++-----------------------
 arch/x86/kernel/macros.S          |  1 +
 3 files changed, 20 insertions(+), 55 deletions(-)

diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index 352e70cd33e8..708b46a54578 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -338,7 +338,7 @@ For 32-bit we have the following conventions - kernel is built with
 .macro CALL_enter_from_user_mode
 #ifdef CONFIG_CONTEXT_TRACKING
 #ifdef HAVE_JUMP_LABEL
-	STATIC_JUMP_IF_FALSE .Lafter_call_\@, context_tracking_enabled, def=0
+	STATIC_BRANCH_JMP l_yes=.Lafter_call_\@, key=context_tracking_enabled, branch=1
 #endif
 	call enter_from_user_mode
 .Lafter_call_\@:
diff --git a/arch/x86/include/asm/jump_label.h b/arch/x86/include/asm/jump_label.h
index 21efc9d07ed9..a5fb34fe56a4 100644
--- a/arch/x86/include/asm/jump_label.h
+++ b/arch/x86/include/asm/jump_label.h
@@ -2,19 +2,6 @@
 #ifndef _ASM_X86_JUMP_LABEL_H
 #define _ASM_X86_JUMP_LABEL_H
 
-#ifndef HAVE_JUMP_LABEL
-/*
- * For better or for worse, if jump labels (the gcc extension) are missing,
- * then the entire static branch patching infrastructure is compiled out.
- * If that happens, the code in here will malfunction.  Raise a compiler
- * error instead.
- *
- * In theory, jump labels and the static branch patching infrastructure
- * could be decoupled to fix this.
- */
-#error asm/jump_label.h included on a non-jump-label kernel
-#endif
-
 #define JUMP_LABEL_NOP_SIZE 5
 
 #ifdef CONFIG_X86_64
@@ -33,15 +20,9 @@
 
 static __always_inline bool arch_static_branch(struct static_key *key, bool branch)
 {
-	asm_volatile_goto("1:"
-		".byte " __stringify(STATIC_KEY_INIT_NOP) "\n\t"
-		".pushsection __jump_table,  \"aw\" \n\t"
-		_ASM_ALIGN "\n\t"
-		".long 1b - ., %l[l_yes] - . \n\t"
-		_ASM_PTR "%c0 + %c1 - .\n\t"
-		".popsection \n\t"
-		: :  "i" (key), "i" (branch) : : l_yes);
-
+	asm_volatile_goto("STATIC_BRANCH_NOP l_yes=\"%l[l_yes]\" key=\"%c0\" "
+			  "branch=\"%c1\""
+			: :  "i" (key), "i" (branch) : : l_yes);
 	return false;
 l_yes:
 	return true;
@@ -49,14 +30,8 @@ static __always_inline bool arch_static_branch(struct static_key *key, bool bran
 
 static __always_inline bool arch_static_branch_jump(struct static_key *key, bool branch)
 {
-	asm_volatile_goto("1:"
-		".byte 0xe9\n\t .long %l[l_yes] - 2f\n\t"
-		"2:\n\t"
-		".pushsection __jump_table,  \"aw\" \n\t"
-		_ASM_ALIGN "\n\t"
-		".long 1b - ., %l[l_yes] - . \n\t"
-		_ASM_PTR "%c0 + %c1 - .\n\t"
-		".popsection \n\t"
+	asm_volatile_goto("STATIC_BRANCH_JMP l_yes=\"%l[l_yes]\" key=\"%c0\" "
+			  "branch=\"%c1\""
 		: :  "i" (key), "i" (branch) : : l_yes);
 
 	return false;
@@ -66,37 +41,26 @@ static __always_inline bool arch_static_branch_jump(struct static_key *key, bool
 
 #else	/* __ASSEMBLY__ */
 
-.macro STATIC_JUMP_IF_TRUE target, key, def
-.Lstatic_jump_\@:
-	.if \def
-	/* Equivalent to "jmp.d32 \target" */
-	.byte		0xe9
-	.long		\target - .Lstatic_jump_after_\@
-.Lstatic_jump_after_\@:
-	.else
-	.byte		STATIC_KEY_INIT_NOP
-	.endif
+.macro STATIC_BRANCH_NOP l_yes:req key:req branch:req
+.Lstatic_branch_nop_\@:
+	.byte STATIC_KEY_INIT_NOP
+.Lstatic_branch_no_after_\@:
 	.pushsection __jump_table, "aw"
 	_ASM_ALIGN
-	.long		.Lstatic_jump_\@ - ., \target - .
-	_ASM_PTR	\key - .
+	.long		.Lstatic_branch_nop_\@ - ., \l_yes - .
+	_ASM_PTR        \key + \branch - .
 	.popsection
 .endm
 
-.macro STATIC_JUMP_IF_FALSE target, key, def
-.Lstatic_jump_\@:
-	.if \def
-	.byte		STATIC_KEY_INIT_NOP
-	.else
-	/* Equivalent to "jmp.d32 \target" */
-	.byte		0xe9
-	.long		\target - .Lstatic_jump_after_\@
-.Lstatic_jump_after_\@:
-	.endif
+.macro STATIC_BRANCH_JMP l_yes:req key:req branch:req
+.Lstatic_branch_jmp_\@:
+	.byte 0xe9
+	.long \l_yes - .Lstatic_branch_jmp_after_\@
+.Lstatic_branch_jmp_after_\@:
 	.pushsection __jump_table, "aw"
 	_ASM_ALIGN
-	.long		.Lstatic_jump_\@ - ., \target - .
-	_ASM_PTR	\key + 1 - .
+	.long		.Lstatic_branch_jmp_\@ - ., \l_yes - .
+	_ASM_PTR	\key + \branch - .
 	.popsection
 .endm
 
diff --git a/arch/x86/kernel/macros.S b/arch/x86/kernel/macros.S
index bf8b9c93e255..161c95059044 100644
--- a/arch/x86/kernel/macros.S
+++ b/arch/x86/kernel/macros.S
@@ -13,3 +13,4 @@
 #include <asm/paravirt.h>
 #include <asm/asm.h>
 #include <asm/cpufeature.h>
+#include <asm/jump_label.h>
-- 
2.17.1


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 04/10] x86: refcount: prevent gcc distortions
  2018-10-04 19:33               ` H. Peter Anvin
  2018-10-04 20:05                 ` Nadav Amit
  2018-10-04 20:29                 ` Andy Lutomirski
@ 2018-10-06  1:40                 ` Rasmus Villemoes
  2 siblings, 0 replies; 116+ messages in thread
From: Rasmus Villemoes @ 2018-10-06  1:40 UTC (permalink / raw)
  To: H. Peter Anvin, Ingo Molnar
  Cc: Nadav Amit, Ingo Molnar, linux-kernel, x86, Thomas Gleixner,
	Jan Beulich, Josh Poimboeuf, Linus Torvalds, Peter Zijlstra,
	Andy Lutomirski

On 2018-10-04 21:33, H. Peter Anvin wrote:

> Here is the horrible code I mentioned yesterday.  This is about
> implementing the immediate-patching framework that Linus and others have
> discussed (it helps both performance and kernel hardening):

Heh, I did a POC in userspace some years ago for loading an "eventually
constant" value into a register - the idea being to avoid a load in
cases like kmemcache_alloc(foo_cachep) or kmemcache_free(foo_cachep, p),
and I assume this is something along the same lines? I didn't do
anything with it since I had no idea if the performance gain would be
worth it, and at the time (before __ro_after_init) there was no good way
to know that the values would really be constant eventually. Also, I had
hoped to come up with a way to avoid having to annotate the loads.

I just tried expanding this to deal with some of the hash tables sized
at init time which I can see was also mentioned on LKML some time ago.
I'm probably missing something fundamental, but there's some sorta
working code at https://github.com/villemoes/rai which is not too
horrible (IMO). Attaching gcc at various times and doing disassembly
shows that the patching does take place. Can I get you to take a look at
raimacros.S and tell me why that won't work?

Thanks,
Rasmus

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [tip:x86/build] x86/extable: Macrofy inline assembly code to work around GCC inlining bugs
  2018-10-05 20:27                       ` [PATCH 1/3] x86/extable: Macrofy inline assembly code to work around GCC inlining bugs Nadav Amit
@ 2018-10-06 14:42                         ` tip-bot for Nadav Amit
  0 siblings, 0 replies; 116+ messages in thread
From: tip-bot for Nadav Amit @ 2018-10-06 14:42 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: hpa, peterz, torvalds, dvlasenk, linux-kernel, keescook, namit,
	bp, mingo, brgerst, luto, tglx, jpoimboe

Commit-ID:  0474d5d9d2f7f3b11262f7bf87d0e7314ead9200
Gitweb:     https://git.kernel.org/tip/0474d5d9d2f7f3b11262f7bf87d0e7314ead9200
Author:     Nadav Amit <namit@vmware.com>
AuthorDate: Fri, 5 Oct 2018 13:27:16 -0700
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Sat, 6 Oct 2018 15:52:15 +0200

x86/extable: Macrofy inline assembly code to work around GCC inlining bugs

As described in:

  77b0bf55bc67: ("kbuild/Makefile: Prepare for using macros in inline assembly code to work around asm() related GCC inlining bugs")

GCC's inlining heuristics are broken with common asm() patterns used in
kernel code, resulting in the effective disabling of inlining.

The workaround is to set an assembly macro and call it from the inline
assembly block - which is also a minor cleanup for the exception table
code.

Text size goes up a bit:

      text     data     bss      dec     hex  filename
  18162555 10226288 2957312 31346155 1de4deb  ./vmlinux before
  18162879 10226256 2957312 31346447 1de4f0f  ./vmlinux after (+292)

But this allows the inlining of functions such as nested_vmx_exit_reflected(),
set_segment_reg(), __copy_xstate_to_user() which is a net benefit.

Tested-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Nadav Amit <namit@vmware.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20181005202718.229565-2-namit@vmware.com
Link: https://lore.kernel.org/lkml/20181003213100.189959-9-namit@vmware.com/T/#u
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/asm.h | 53 +++++++++++++++++-----------------------------
 arch/x86/kernel/macros.S   |  1 +
 2 files changed, 21 insertions(+), 33 deletions(-)

diff --git a/arch/x86/include/asm/asm.h b/arch/x86/include/asm/asm.h
index 6467757bb39f..21b086786404 100644
--- a/arch/x86/include/asm/asm.h
+++ b/arch/x86/include/asm/asm.h
@@ -120,12 +120,25 @@
 /* Exception table entry */
 #ifdef __ASSEMBLY__
 # define _ASM_EXTABLE_HANDLE(from, to, handler)			\
-	.pushsection "__ex_table","a" ;				\
-	.balign 4 ;						\
-	.long (from) - . ;					\
-	.long (to) - . ;					\
-	.long (handler) - . ;					\
+	ASM_EXTABLE_HANDLE from to handler
+
+.macro ASM_EXTABLE_HANDLE from:req to:req handler:req
+	.pushsection "__ex_table","a"
+	.balign 4
+	.long (\from) - .
+	.long (\to) - .
+	.long (\handler) - .
 	.popsection
+.endm
+#else /* __ASSEMBLY__ */
+
+# define _ASM_EXTABLE_HANDLE(from, to, handler)			\
+	"ASM_EXTABLE_HANDLE from=" #from " to=" #to		\
+	" handler=\"" #handler "\"\n\t"
+
+/* For C file, we already have NOKPROBE_SYMBOL macro */
+
+#endif /* __ASSEMBLY__ */
 
 # define _ASM_EXTABLE(from, to)					\
 	_ASM_EXTABLE_HANDLE(from, to, ex_handler_default)
@@ -148,6 +161,7 @@
 	_ASM_PTR (entry);					\
 	.popsection
 
+#ifdef __ASSEMBLY__
 .macro ALIGN_DESTINATION
 	/* check for bad alignment of destination */
 	movl %edi,%ecx
@@ -171,34 +185,7 @@
 	_ASM_EXTABLE_UA(100b, 103b)
 	_ASM_EXTABLE_UA(101b, 103b)
 	.endm
-
-#else
-# define _EXPAND_EXTABLE_HANDLE(x) #x
-# define _ASM_EXTABLE_HANDLE(from, to, handler)			\
-	" .pushsection \"__ex_table\",\"a\"\n"			\
-	" .balign 4\n"						\
-	" .long (" #from ") - .\n"				\
-	" .long (" #to ") - .\n"				\
-	" .long (" _EXPAND_EXTABLE_HANDLE(handler) ") - .\n"	\
-	" .popsection\n"
-
-# define _ASM_EXTABLE(from, to)					\
-	_ASM_EXTABLE_HANDLE(from, to, ex_handler_default)
-
-# define _ASM_EXTABLE_UA(from, to)				\
-	_ASM_EXTABLE_HANDLE(from, to, ex_handler_uaccess)
-
-# define _ASM_EXTABLE_FAULT(from, to)				\
-	_ASM_EXTABLE_HANDLE(from, to, ex_handler_fault)
-
-# define _ASM_EXTABLE_EX(from, to)				\
-	_ASM_EXTABLE_HANDLE(from, to, ex_handler_ext)
-
-# define _ASM_EXTABLE_REFCOUNT(from, to)			\
-	_ASM_EXTABLE_HANDLE(from, to, ex_handler_refcount)
-
-/* For C file, we already have NOKPROBE_SYMBOL macro */
-#endif
+#endif /* __ASSEMBLY__ */
 
 #ifndef __ASSEMBLY__
 /*
diff --git a/arch/x86/kernel/macros.S b/arch/x86/kernel/macros.S
index 71d8b716b111..7baa40d5bf16 100644
--- a/arch/x86/kernel/macros.S
+++ b/arch/x86/kernel/macros.S
@@ -11,3 +11,4 @@
 #include <asm/alternative-asm.h>
 #include <asm/bug.h>
 #include <asm/paravirt.h>
+#include <asm/asm.h>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [tip:x86/build] x86/cpufeature: Macrofy inline assembly code to work around GCC inlining bugs
  2018-10-05 20:27                       ` [PATCH 2/3] x86/cpufeature: " Nadav Amit
@ 2018-10-06 14:43                         ` tip-bot for Nadav Amit
  0 siblings, 0 replies; 116+ messages in thread
From: tip-bot for Nadav Amit @ 2018-10-06 14:43 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: mingo, bp, torvalds, luto, tglx, keescook, peterz, linux-kernel,
	brgerst, hpa, namit, dvlasenk

Commit-ID:  d5a581d84ae6b8a4a740464b80d8d9cf1e7947b2
Gitweb:     https://git.kernel.org/tip/d5a581d84ae6b8a4a740464b80d8d9cf1e7947b2
Author:     Nadav Amit <namit@vmware.com>
AuthorDate: Fri, 5 Oct 2018 13:27:17 -0700
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Sat, 6 Oct 2018 15:52:16 +0200

x86/cpufeature: Macrofy inline assembly code to work around GCC inlining bugs

As described in:

  77b0bf55bc67: ("kbuild/Makefile: Prepare for using macros in inline assembly code to work around asm() related GCC inlining bugs")

GCC's inlining heuristics are broken with common asm() patterns used in
kernel code, resulting in the effective disabling of inlining.

The workaround is to set an assembly macro and call it from the inline
assembly block - which is pretty pointless indirection in the static_cpu_has()
case, but is worth it to improve overall inlining quality.

The patch slightly increases the kernel size:

      text     data     bss      dec     hex  filename
  18162879 10226256 2957312 31346447 1de4f0f  ./vmlinux before
  18163528 10226300 2957312 31347140 1de51c4  ./vmlinux after (+693)

And enables the inlining of function such as free_ldt_pgtables().

Tested-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Nadav Amit <namit@vmware.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20181005202718.229565-3-namit@vmware.com
Link: https://lore.kernel.org/lkml/20181003213100.189959-10-namit@vmware.com/T/#u
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/cpufeature.h | 82 ++++++++++++++++++++++-----------------
 arch/x86/kernel/macros.S          |  1 +
 2 files changed, 48 insertions(+), 35 deletions(-)

diff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm/cpufeature.h
index aced6c9290d6..7d442722ef24 100644
--- a/arch/x86/include/asm/cpufeature.h
+++ b/arch/x86/include/asm/cpufeature.h
@@ -2,10 +2,10 @@
 #ifndef _ASM_X86_CPUFEATURE_H
 #define _ASM_X86_CPUFEATURE_H
 
-#include <asm/processor.h>
-
-#if defined(__KERNEL__) && !defined(__ASSEMBLY__)
+#ifdef __KERNEL__
+#ifndef __ASSEMBLY__
 
+#include <asm/processor.h>
 #include <asm/asm.h>
 #include <linux/bitops.h>
 
@@ -161,37 +161,10 @@ extern void clear_cpu_cap(struct cpuinfo_x86 *c, unsigned int bit);
  */
 static __always_inline __pure bool _static_cpu_has(u16 bit)
 {
-	asm_volatile_goto("1: jmp 6f\n"
-		 "2:\n"
-		 ".skip -(((5f-4f) - (2b-1b)) > 0) * "
-			 "((5f-4f) - (2b-1b)),0x90\n"
-		 "3:\n"
-		 ".section .altinstructions,\"a\"\n"
-		 " .long 1b - .\n"		/* src offset */
-		 " .long 4f - .\n"		/* repl offset */
-		 " .word %P[always]\n"		/* always replace */
-		 " .byte 3b - 1b\n"		/* src len */
-		 " .byte 5f - 4f\n"		/* repl len */
-		 " .byte 3b - 2b\n"		/* pad len */
-		 ".previous\n"
-		 ".section .altinstr_replacement,\"ax\"\n"
-		 "4: jmp %l[t_no]\n"
-		 "5:\n"
-		 ".previous\n"
-		 ".section .altinstructions,\"a\"\n"
-		 " .long 1b - .\n"		/* src offset */
-		 " .long 0\n"			/* no replacement */
-		 " .word %P[feature]\n"		/* feature bit */
-		 " .byte 3b - 1b\n"		/* src len */
-		 " .byte 0\n"			/* repl len */
-		 " .byte 0\n"			/* pad len */
-		 ".previous\n"
-		 ".section .altinstr_aux,\"ax\"\n"
-		 "6:\n"
-		 " testb %[bitnum],%[cap_byte]\n"
-		 " jnz %l[t_yes]\n"
-		 " jmp %l[t_no]\n"
-		 ".previous\n"
+	asm_volatile_goto("STATIC_CPU_HAS bitnum=%[bitnum] "
+			  "cap_byte=\"%[cap_byte]\" "
+			  "feature=%P[feature] t_yes=%l[t_yes] "
+			  "t_no=%l[t_no] always=%P[always]"
 		 : : [feature]  "i" (bit),
 		     [always]   "i" (X86_FEATURE_ALWAYS),
 		     [bitnum]   "i" (1 << (bit & 7)),
@@ -226,5 +199,44 @@ t_no:
 #define CPU_FEATURE_TYPEVAL		boot_cpu_data.x86_vendor, boot_cpu_data.x86, \
 					boot_cpu_data.x86_model
 
-#endif /* defined(__KERNEL__) && !defined(__ASSEMBLY__) */
+#else /* __ASSEMBLY__ */
+
+.macro STATIC_CPU_HAS bitnum:req cap_byte:req feature:req t_yes:req t_no:req always:req
+1:
+	jmp 6f
+2:
+	.skip -(((5f-4f) - (2b-1b)) > 0) * ((5f-4f) - (2b-1b)),0x90
+3:
+	.section .altinstructions,"a"
+	.long 1b - .		/* src offset */
+	.long 4f - .		/* repl offset */
+	.word \always		/* always replace */
+	.byte 3b - 1b		/* src len */
+	.byte 5f - 4f		/* repl len */
+	.byte 3b - 2b		/* pad len */
+	.previous
+	.section .altinstr_replacement,"ax"
+4:
+	jmp \t_no
+5:
+	.previous
+	.section .altinstructions,"a"
+	.long 1b - .		/* src offset */
+	.long 0			/* no replacement */
+	.word \feature		/* feature bit */
+	.byte 3b - 1b		/* src len */
+	.byte 0			/* repl len */
+	.byte 0			/* pad len */
+	.previous
+	.section .altinstr_aux,"ax"
+6:
+	testb \bitnum,\cap_byte
+	jnz \t_yes
+	jmp \t_no
+	.previous
+.endm
+
+#endif /* __ASSEMBLY__ */
+
+#endif /* __KERNEL__ */
 #endif /* _ASM_X86_CPUFEATURE_H */
diff --git a/arch/x86/kernel/macros.S b/arch/x86/kernel/macros.S
index 7baa40d5bf16..bf8b9c93e255 100644
--- a/arch/x86/kernel/macros.S
+++ b/arch/x86/kernel/macros.S
@@ -12,3 +12,4 @@
 #include <asm/bug.h>
 #include <asm/paravirt.h>
 #include <asm/asm.h>
+#include <asm/cpufeature.h>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [tip:x86/build] x86/jump-labels: Macrofy inline assembly code to work around GCC inlining bugs
  2018-10-05 20:27                       ` [PATCH 3/3] x86/jump-labels: " Nadav Amit
@ 2018-10-06 14:44                         ` tip-bot for Nadav Amit
  0 siblings, 0 replies; 116+ messages in thread
From: tip-bot for Nadav Amit @ 2018-10-06 14:44 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: keescook, peterz, kstewart, gregkh, luto, mingo, pombredanne,
	hpa, namit, bp, linux-kernel, dvlasenk, torvalds, tglx, brgerst

Commit-ID:  5bdcd510c2ac9efaf55c4cbd8d46421d8e2320cd
Gitweb:     https://git.kernel.org/tip/5bdcd510c2ac9efaf55c4cbd8d46421d8e2320cd
Author:     Nadav Amit <namit@vmware.com>
AuthorDate: Fri, 5 Oct 2018 13:27:18 -0700
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Sat, 6 Oct 2018 15:52:17 +0200

x86/jump-labels: Macrofy inline assembly code to work around GCC inlining bugs

As described in:

  77b0bf55bc67: ("kbuild/Makefile: Prepare for using macros in inline assembly code to work around asm() related GCC inlining bugs")

GCC's inlining heuristics are broken with common asm() patterns used in
kernel code, resulting in the effective disabling of inlining.

The workaround is to set an assembly macro and call it from the inline
assembly block - which is also a minor cleanup for the jump-label code.

As a result the code size is slightly increased, but inlining decisions
are better:

      text     data     bss      dec     hex  filename
  18163528 10226300 2957312 31347140 1de51c4  ./vmlinux before
  18163608 10227348 2957312 31348268 1de562c  ./vmlinux after (+1128)

And functions such as intel_pstate_adjust_policy_max(),
kvm_cpu_accept_dm_intr(), kvm_register_readl() are inlined.

Tested-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Nadav Amit <namit@vmware.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20181005202718.229565-4-namit@vmware.com
Link: https://lore.kernel.org/lkml/20181003213100.189959-11-namit@vmware.com/T/#u
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/entry/calling.h          |  2 +-
 arch/x86/include/asm/jump_label.h | 72 ++++++++++-----------------------------
 arch/x86/kernel/macros.S          |  1 +
 3 files changed, 20 insertions(+), 55 deletions(-)

diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index 352e70cd33e8..708b46a54578 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -338,7 +338,7 @@ For 32-bit we have the following conventions - kernel is built with
 .macro CALL_enter_from_user_mode
 #ifdef CONFIG_CONTEXT_TRACKING
 #ifdef HAVE_JUMP_LABEL
-	STATIC_JUMP_IF_FALSE .Lafter_call_\@, context_tracking_enabled, def=0
+	STATIC_BRANCH_JMP l_yes=.Lafter_call_\@, key=context_tracking_enabled, branch=1
 #endif
 	call enter_from_user_mode
 .Lafter_call_\@:
diff --git a/arch/x86/include/asm/jump_label.h b/arch/x86/include/asm/jump_label.h
index 21efc9d07ed9..a5fb34fe56a4 100644
--- a/arch/x86/include/asm/jump_label.h
+++ b/arch/x86/include/asm/jump_label.h
@@ -2,19 +2,6 @@
 #ifndef _ASM_X86_JUMP_LABEL_H
 #define _ASM_X86_JUMP_LABEL_H
 
-#ifndef HAVE_JUMP_LABEL
-/*
- * For better or for worse, if jump labels (the gcc extension) are missing,
- * then the entire static branch patching infrastructure is compiled out.
- * If that happens, the code in here will malfunction.  Raise a compiler
- * error instead.
- *
- * In theory, jump labels and the static branch patching infrastructure
- * could be decoupled to fix this.
- */
-#error asm/jump_label.h included on a non-jump-label kernel
-#endif
-
 #define JUMP_LABEL_NOP_SIZE 5
 
 #ifdef CONFIG_X86_64
@@ -33,15 +20,9 @@
 
 static __always_inline bool arch_static_branch(struct static_key *key, bool branch)
 {
-	asm_volatile_goto("1:"
-		".byte " __stringify(STATIC_KEY_INIT_NOP) "\n\t"
-		".pushsection __jump_table,  \"aw\" \n\t"
-		_ASM_ALIGN "\n\t"
-		".long 1b - ., %l[l_yes] - . \n\t"
-		_ASM_PTR "%c0 + %c1 - .\n\t"
-		".popsection \n\t"
-		: :  "i" (key), "i" (branch) : : l_yes);
-
+	asm_volatile_goto("STATIC_BRANCH_NOP l_yes=\"%l[l_yes]\" key=\"%c0\" "
+			  "branch=\"%c1\""
+			: :  "i" (key), "i" (branch) : : l_yes);
 	return false;
 l_yes:
 	return true;
@@ -49,14 +30,8 @@ l_yes:
 
 static __always_inline bool arch_static_branch_jump(struct static_key *key, bool branch)
 {
-	asm_volatile_goto("1:"
-		".byte 0xe9\n\t .long %l[l_yes] - 2f\n\t"
-		"2:\n\t"
-		".pushsection __jump_table,  \"aw\" \n\t"
-		_ASM_ALIGN "\n\t"
-		".long 1b - ., %l[l_yes] - . \n\t"
-		_ASM_PTR "%c0 + %c1 - .\n\t"
-		".popsection \n\t"
+	asm_volatile_goto("STATIC_BRANCH_JMP l_yes=\"%l[l_yes]\" key=\"%c0\" "
+			  "branch=\"%c1\""
 		: :  "i" (key), "i" (branch) : : l_yes);
 
 	return false;
@@ -66,37 +41,26 @@ l_yes:
 
 #else	/* __ASSEMBLY__ */
 
-.macro STATIC_JUMP_IF_TRUE target, key, def
-.Lstatic_jump_\@:
-	.if \def
-	/* Equivalent to "jmp.d32 \target" */
-	.byte		0xe9
-	.long		\target - .Lstatic_jump_after_\@
-.Lstatic_jump_after_\@:
-	.else
-	.byte		STATIC_KEY_INIT_NOP
-	.endif
+.macro STATIC_BRANCH_NOP l_yes:req key:req branch:req
+.Lstatic_branch_nop_\@:
+	.byte STATIC_KEY_INIT_NOP
+.Lstatic_branch_no_after_\@:
 	.pushsection __jump_table, "aw"
 	_ASM_ALIGN
-	.long		.Lstatic_jump_\@ - ., \target - .
-	_ASM_PTR	\key - .
+	.long		.Lstatic_branch_nop_\@ - ., \l_yes - .
+	_ASM_PTR        \key + \branch - .
 	.popsection
 .endm
 
-.macro STATIC_JUMP_IF_FALSE target, key, def
-.Lstatic_jump_\@:
-	.if \def
-	.byte		STATIC_KEY_INIT_NOP
-	.else
-	/* Equivalent to "jmp.d32 \target" */
-	.byte		0xe9
-	.long		\target - .Lstatic_jump_after_\@
-.Lstatic_jump_after_\@:
-	.endif
+.macro STATIC_BRANCH_JMP l_yes:req key:req branch:req
+.Lstatic_branch_jmp_\@:
+	.byte 0xe9
+	.long \l_yes - .Lstatic_branch_jmp_after_\@
+.Lstatic_branch_jmp_after_\@:
 	.pushsection __jump_table, "aw"
 	_ASM_ALIGN
-	.long		.Lstatic_jump_\@ - ., \target - .
-	_ASM_PTR	\key + 1 - .
+	.long		.Lstatic_branch_jmp_\@ - ., \l_yes - .
+	_ASM_PTR	\key + \branch - .
 	.popsection
 .endm
 
diff --git a/arch/x86/kernel/macros.S b/arch/x86/kernel/macros.S
index bf8b9c93e255..161c95059044 100644
--- a/arch/x86/kernel/macros.S
+++ b/arch/x86/kernel/macros.S
@@ -13,3 +13,4 @@
 #include <asm/paravirt.h>
 #include <asm/asm.h>
 #include <asm/cpufeature.h>
+#include <asm/jump_label.h>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* PROPOSAL: Extend inline asm syntax with size spec
  2018-10-03 21:30 [PATCH v9 00/10] x86: macrofying inline asm Nadav Amit
                   ` (9 preceding siblings ...)
  2018-10-03 21:31 ` [PATCH v9 10/10] x86: jump-labels: " Nadav Amit
@ 2018-10-07  9:18 ` Borislav Petkov
       [not found]   ` <20181007132228.GJ29268@gate.crashing.org>
  2018-10-07 16:09   ` Nadav Amit
  10 siblings, 2 replies; 116+ messages in thread
From: Borislav Petkov @ 2018-10-07  9:18 UTC (permalink / raw)
  To: gcc, Richard Biener, Michael Matz
  Cc: Nadav Amit, Ingo Molnar, linux-kernel, x86, Masahiro Yamada,
	Sam Ravnborg, Alok Kataria, Christopher Li, Greg Kroah-Hartman,
	H. Peter Anvin, Jan Beulich, Josh Poimboeuf, Juergen Gross,
	Kate Stewart, Kees Cook, linux-sparse, Peter Zijlstra,
	Philippe Ombredanne, Thomas Gleixner, virtualization,
	Linus Torvalds, Chris Zankel, Max Filippov, linux-xtensa

Hi people,

this is an attempt to see whether gcc's inline asm heuristic when
estimating inline asm statements' cost for better inlining can be
improved.

AFAIU, the problematic arises when one ends up using a lot of inline
asm statements in the kernel but due to the inline asm cost estimation
heuristic which counts lines, I think, for example like in this here
macro:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/include/asm/cpufeature.h#n162

the resulting code ends up not inlining the functions themselves which
use this macro. I.e., you see a CALL <function> instead of its body
getting inlined directly.

Even though it should be because the actual instructions are only a
couple in most cases and all those other directives end up in another
section anyway.

The issue is explained below in the forwarded mail in a larger detail
too.

Now, Richard suggested doing something like:

 1) inline asm ("...")
 2) asm ("..." : : : : <size-expr>)
 3) asm ("...") __attribute__((asm_size(<size-expr>)));

with which user can tell gcc what the size of that inline asm statement
is and thus allow for more precise cost estimation and in the end better
inlining.

And FWIW 3) looks pretty straight-forward to me because attributes are
pretty common anyways.

But I'm sure there are other options and I'm sure people will have
better/different ideas so feel free to chime in.

Thx.

On Wed, Oct 03, 2018 at 02:30:50PM -0700, Nadav Amit wrote:
> This patch-set deals with an interesting yet stupid problem: kernel code
> that does not get inlined despite its simplicity. There are several
> causes for this behavior: "cold" attribute on __init, different function
> optimization levels; conditional constant computations based on
> __builtin_constant_p(); and finally large inline assembly blocks.
> 
> This patch-set deals with the inline assembly problem. I separated these
> patches from the others (that were sent in the RFC) for easier
> inclusion. I also separated the removal of unnecessary new-lines which
> would be sent separately.
> 
> The problem with inline assembly is that inline assembly is often used
> by the kernel for things that are other than code - for example,
> assembly directives and data. GCC however is oblivious to the content of
> the blocks and assumes their cost in space and time is proportional to
> the number of the perceived assembly "instruction", according to the
> number of newlines and semicolons. Alternatives, paravirt and other
> mechanisms are affected, causing code not to be inlined, and degrading
> compilation quality in general.
> 
> The solution that this patch-set carries for this problem is to create
> an assembly macro, and then call it from the inline assembly block.  As
> a result, the compiler sees a single "instruction" and assigns the more
> appropriate cost to the code.
> 
> To avoid uglification of the code, as many noted, the macros are first
> precompiled into an assembly file, which is later assembled together
> with the C files. This also enables to avoid duplicate implementation
> that was set before for the asm and C code. This can be seen in the
> exception table changes.
> 
> Overall this patch-set slightly increases the kernel size (my build was
> done using my Ubuntu 18.04 config + localyesconfig for the record):
> 
>    text	   data	    bss	    dec	    hex	filename
> 18140829 10224724 2957312 31322865 1ddf2f1 ./vmlinux before
> 18163608 10227348 2957312 31348268 1de562c ./vmlinux after (+0.1%)
> 
> The number of static functions in the image is reduced by 379, but
> actually inlining is even better, which does not always shows in these
> numbers: a function may be inlined causing the calling function not to
> be inlined.
> 
> I ran some limited number of benchmarks, and in general the performance
> impact is not very notable. You can still see >10 cycles shaved off some
> syscalls that manipulate page-tables (e.g., mprotect()), in which
> paravirt caused many functions not to be inlined. In addition this
> patch-set can prevent issues such as [1], and improves code readability
> and maintainability.
> 
> Update: Rasmus recently caused me (inadvertently) to become paranoid
> about the dependencies. To clarify: if any of the headers changes, any c
> file which uses macros that are included in macros.S would be fine as
> long as it includes the header as well (as it should). Adding an
> assertion to check this is done might become slightly ugly, and nobody
> else is concerned about it. Another minor issue is that changes of
> macros.S would not trigger a global rebuild, but that is pretty similar
> to changes of the Makefile that do not trigger a rebuild.
> 
> [1] https://patchwork.kernel.org/patch/10450037/
> 
> v8->v9: * Restoring the '-pipe' parameter (Rasmus)
> 	* Adding Kees's tested-by tag (Kees)
> 
> v7->v8:	* Add acks (Masahiro, Max)
> 	* Rebase on 4.19 (Ingo)
> 
> v6->v7: * Fix context switch tracking (Ingo)
> 	* Fix xtensa build error (Ingo)
> 	* Rebase on 4.18-rc8
> 
> v5->v6:	* Removing more code from jump-labels (PeterZ)
> 	* Fix build issue on i386 (0-day, PeterZ)
> 
> v4->v5:	* Makefile fixes (Masahiro, Sam)
> 
> v3->v4: * Changed naming of macros in 2 patches (PeterZ)
> 	* Minor cleanup of the paravirt patch
> 
> v2->v3: * Several build issues resolved (0-day)
> 	* Wrong comments fix (Josh)
> 	* Change asm vs C order in refcount (Kees)
> 
> v1->v2:	* Compiling the macros into a separate .s file, improving
> 	  readability (Linus)
> 	* Improving assembly formatting, applying most of the comments
> 	  according to my judgment (Jan)
> 	* Adding exception-table, cpufeature and jump-labels
> 	* Removing new-line cleanup; to be submitted separately
> 
> Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
> Cc: Sam Ravnborg <sam@ravnborg.org>
> Cc: Alok Kataria <akataria@vmware.com>
> Cc: Christopher Li <sparse@chrisli.org>
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Jan Beulich <JBeulich@suse.com>
> Cc: Josh Poimboeuf <jpoimboe@redhat.com>
> Cc: Juergen Gross <jgross@suse.com>
> Cc: Kate Stewart <kstewart@linuxfoundation.org>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: linux-sparse@vger.kernel.org
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Philippe Ombredanne <pombredanne@nexb.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: virtualization@lists.linux-foundation.org
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: x86@kernel.org
> Cc: Chris Zankel <chris@zankel.net>
> Cc: Max Filippov <jcmvbkbc@gmail.com>
> Cc: linux-xtensa@linux-xtensa.org
> 
> Nadav Amit (10):
>   xtensa: defining LINKER_SCRIPT for the linker script
>   Makefile: Prepare for using macros for inline asm
>   x86: objtool: use asm macro for better compiler decisions
>   x86: refcount: prevent gcc distortions
>   x86: alternatives: macrofy locks for better inlining
>   x86: bug: prevent gcc distortions
>   x86: prevent inline distortion by paravirt ops
>   x86: extable: use macros instead of inline assembly
>   x86: cpufeature: use macros instead of inline assembly
>   x86: jump-labels: use macros instead of inline assembly
> 
>  Makefile                               |  9 ++-
>  arch/x86/Makefile                      |  7 ++
>  arch/x86/entry/calling.h               |  2 +-
>  arch/x86/include/asm/alternative-asm.h | 20 ++++--
>  arch/x86/include/asm/alternative.h     | 11 +--
>  arch/x86/include/asm/asm.h             | 61 +++++++---------
>  arch/x86/include/asm/bug.h             | 98 +++++++++++++++-----------
>  arch/x86/include/asm/cpufeature.h      | 82 ++++++++++++---------
>  arch/x86/include/asm/jump_label.h      | 77 ++++++++------------
>  arch/x86/include/asm/paravirt_types.h  | 56 +++++++--------
>  arch/x86/include/asm/refcount.h        | 74 +++++++++++--------
>  arch/x86/kernel/macros.S               | 16 +++++
>  arch/xtensa/kernel/Makefile            |  4 +-
>  include/asm-generic/bug.h              |  8 +--
>  include/linux/compiler.h               | 56 +++++++++++----
>  scripts/Kbuild.include                 |  4 +-
>  scripts/mod/Makefile                   |  2 +
>  17 files changed, 331 insertions(+), 256 deletions(-)
>  create mode 100644 arch/x86/kernel/macros.S
> 
> -- 
> 2.17.1
> 

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: PROPOSAL: Extend inline asm syntax with size spec
       [not found]   ` <20181007132228.GJ29268@gate.crashing.org>
@ 2018-10-07 14:13     ` Borislav Petkov
  2018-10-07 15:14       ` Segher Boessenkool
  2018-10-07 15:53     ` Michael Matz
  1 sibling, 1 reply; 116+ messages in thread
From: Borislav Petkov @ 2018-10-07 14:13 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: gcc, Richard Biener, Michael Matz, Nadav Amit, Ingo Molnar,
	linux-kernel, x86, Masahiro Yamada, Sam Ravnborg, Alok Kataria,
	Christopher Li, Greg Kroah-Hartman, H. Peter Anvin, Jan Beulich,
	Josh Poimboeuf, Juergen Gross, Kate Stewart, Kees Cook,
	linux-sparse, Peter Zijlstra, Philippe Ombredanne,
	Thomas Gleixner, virtualization, Linus Torvalds, Chris Zankel,
	Max Filippov, linux-xtensa

On Sun, Oct 07, 2018 at 08:22:28AM -0500, Segher Boessenkool wrote:
> GCC already estimates the *size* of inline asm, and this is required
> *for correctness*.

I didn't say it didn't - but the heuristic could use improving.

> So I guess the real issue is that the inline asm size estimate for x86
> isn't very good (since it has to be pessimistic, and x86 insns can be
> huge)?

Well, the size thing could be just a "parameter" or "hint" of sorts, to
tell gcc to inline the function X which is inlining the asm statement
into the function Y which is calling function X. If you look at the
patchset, it is moving everything to asm macros where gcc is apparently
able to do better inlining.

> >  3) asm ("...") __attribute__((asm_size(<size-expr>)));
> 
> Eww.

Why?

> More precise *size* estimates, yes.  And if the user lies he should not
> be surprised to get assembler errors, etc.

Yes.

Another option would be if gcc parses the inline asm directly and
does a more precise size estimation. Which is a lot more involved and
complicated solution so I guess we wanna look at the simpler ones first.

:-)

> I don't like 2) either.  But 1) looks interesting, depends what its
> semantics would be?  "Don't count this insn's size for inlining decisions",
> maybe?

Or simply "this asm statement has a size of 1" to mean, inline it
everywhere. Which has the same caveats as above.

> Another option is to just force inlining for those few functions where
> GCC currently makes an inlining decision you don't like.  Or are there
> more than a few?

I'm afraid they're more than a few and this should work automatically,
if possible.

Thx.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: PROPOSAL: Extend inline asm syntax with size spec
  2018-10-07 14:13     ` Borislav Petkov
@ 2018-10-07 15:14       ` Segher Boessenkool
  2018-10-08  5:58         ` Ingo Molnar
  0 siblings, 1 reply; 116+ messages in thread
From: Segher Boessenkool @ 2018-10-07 15:14 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: gcc, Richard Biener, Michael Matz, Nadav Amit, Ingo Molnar,
	linux-kernel, x86, Masahiro Yamada, Sam Ravnborg, Alok Kataria,
	Christopher Li, Greg Kroah-Hartman, H. Peter Anvin, Jan Beulich,
	Josh Poimboeuf, Juergen Gross, Kate Stewart, Kees Cook,
	linux-sparse, Peter Zijlstra, Philippe Ombredanne,
	Thomas Gleixner, virtualization, Linus Torvalds, Chris Zankel,
	Max Filippov, linux-xtensa

On Sun, Oct 07, 2018 at 04:13:49PM +0200, Borislav Petkov wrote:
> On Sun, Oct 07, 2018 at 08:22:28AM -0500, Segher Boessenkool wrote:
> > GCC already estimates the *size* of inline asm, and this is required
> > *for correctness*.
> 
> I didn't say it didn't - but the heuristic could use improving.

How?  It is as sharp an estimate as can be *already*: number of insns
times maximum size per insn.

If you get this wrong, conditional branches (and similar things, but
conditional branches usually hit first, and hurt most) will stop working
correctly, unless binutils uses relaxation for those on your architecture
(most don't).

> > So I guess the real issue is that the inline asm size estimate for x86
> > isn't very good (since it has to be pessimistic, and x86 insns can be
> > huge)?
> 
> Well, the size thing could be just a "parameter" or "hint" of sorts, to
> tell gcc to inline the function X which is inlining the asm statement
> into the function Y which is calling function X. If you look at the
> patchset, it is moving everything to asm macros where gcc is apparently
> able to do better inlining.

Yes, that will cause fewer problems I think: do not override size _at all_,
but give a hint to the inliner.

> > >  3) asm ("...") __attribute__((asm_size(<size-expr>)));
> > 
> > Eww.
> 
> Why?

Attributes are clumsy and clunky and kludgy.

It never is well-defined how attributes interact, and the more attributes
we define and use, the more that matters.

> > More precise *size* estimates, yes.  And if the user lies he should not
> > be surprised to get assembler errors, etc.
> 
> Yes.
> 
> Another option would be if gcc parses the inline asm directly and
> does a more precise size estimation. Which is a lot more involved and
> complicated solution so I guess we wanna look at the simpler ones first.
> 
> :-)

Which is *impossible* to do.  Inline assembler is free-form text.

> > I don't like 2) either.  But 1) looks interesting, depends what its
> > semantics would be?  "Don't count this insn's size for inlining decisions",
> > maybe?
> 
> Or simply "this asm statement has a size of 1" to mean, inline it
> everywhere. Which has the same caveats as above.

"Has minimum length" then (size 1 cannot work on most archs).

> > Another option is to just force inlining for those few functions where
> > GCC currently makes an inlining decision you don't like.  Or are there
> > more than a few?
> 
> I'm afraid they're more than a few and this should work automatically,
> if possible.

Would counting *all* asms as having minimum length for inlining decisions
work?  Will that give bad side effects?

Or since this problem is quite specific to x86, maybe some target hook is
wanted?  Things work quite well elsewhere as-is, degrading that is not a
good idea.


Segher

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: PROPOSAL: Extend inline asm syntax with size spec
       [not found]   ` <20181007132228.GJ29268@gate.crashing.org>
  2018-10-07 14:13     ` Borislav Petkov
@ 2018-10-07 15:53     ` Michael Matz
  2018-10-08  6:13       ` Ingo Molnar
                         ` (2 more replies)
  1 sibling, 3 replies; 116+ messages in thread
From: Michael Matz @ 2018-10-07 15:53 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: Borislav Petkov, gcc, Richard Biener, Nadav Amit, Ingo Molnar,
	linux-kernel, x86, Masahiro Yamada, Sam Ravnborg, Alok Kataria,
	Christopher Li, Greg Kroah-Hartman, H. Peter Anvin, Jan Beulich,
	Josh Poimboeuf, Juergen Gross, Kate Stewart, Kees Cook,
	linux-sparse, Peter Zijlstra, Philippe Ombredanne,
	Thomas Gleixner, virtualization, Linus Torvalds, Chris Zankel,
	Max Filippov, linux-xtensa

Hi Segher,

On Sun, 7 Oct 2018, Segher Boessenkool wrote:

> On Sun, Oct 07, 2018 at 11:18:06AM +0200, Borislav Petkov wrote:
> > this is an attempt to see whether gcc's inline asm heuristic when
> > estimating inline asm statements' cost for better inlining can be
> > improved.
> 
> GCC already estimates the *size* of inline asm, and this is required
> *for correctness*.  So any workaround that works against this will only
> end in tears.

You're right and wrong.  GCC can't even estimate the size of mildly 
complicated inline asms right now, so your claim of it being necessary for 
correctness can't be true in this absolute form.  I know what you try to 
say, but still, consider inline asms like this:

     insn1
  .section bla
     insn2
  .previous

or
   invoke_asm_macro foo,bar

in both cases GCCs size estimate will be wrong however you want to count 
it.  This is actually the motivating example for the kernel guys, the 
games they play within their inline asms make the estimates be wildly 
wrong to a point it interacts with the inliner.

> So I guess the real issue is that the inline asm size estimate for x86 
> isn't very good (since it has to be pessimistic, and x86 insns can be 
> huge)?

No, see above, even if we were to improve the size estimates (e.g. based 
on some average instruction size) the kernel examples would still be off 
because they switch sections back and forth, use asm macros and computed 
.fill directives and maybe further stuff.  GCC will never be able to 
accurately calculate these sizes (without an built-in assembler which 
hopefully noone proposes).

So, there is a case for extending the inline-asm facility to say 
"size is complicated here, assume this for inline decisions".

> > Now, Richard suggested doing something like:
> > 
> >  1) inline asm ("...")
> 
> What would the semantics of this be?

The size of the inline asm wouldn't be counted towards the inliner size 
limits (or be counted as "1").

> I don't like 2) either.  But 1) looks interesting, depends what its
> semantics would be?  "Don't count this insn's size for inlining decisions",
> maybe?

TBH, I like the inline asm (...) suggestion most currently, but what if we 
want to add more attributes to asms?  We could add further special 
keywords to the clobber list:
  asm ("...." : : : "cc,memory,inline");
sure, it might seem strange to "clobber" inline, but if we reinterpret the 
clobber list as arbitrary set of attributes for this asm, it'd be fine.

> Another option is to just force inlining for those few functions where 
> GCC currently makes an inlining decision you don't like.  Or are there 
> more than a few?

I think the examples I saw from Boris were all indirect inlines:

  static inline void foo() { asm("large-looking-but-small-asm"); }
  static void bar1() { ... foo() ... }
  static void bar2() { ... foo() ... }
  void goo (void) { bar1(); }  // bar1 should have been inlined

So, while the immediate asm user was marked as always inline that in turn 
caused users of it to become non-inlined.  I'm assuming the kernel guys 
did proper measurements that they _really_ get some non-trivial speed 
benefit by inlining bar1/bar2, but for some reasons (I didn't inquire) 
didn't want to mark them all as inline as well.


Ciao,
Michael.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: PROPOSAL: Extend inline asm syntax with size spec
  2018-10-07  9:18 ` PROPOSAL: Extend inline asm syntax with size spec Borislav Petkov
       [not found]   ` <20181007132228.GJ29268@gate.crashing.org>
@ 2018-10-07 16:09   ` Nadav Amit
  2018-10-07 16:46     ` Richard Biener
  1 sibling, 1 reply; 116+ messages in thread
From: Nadav Amit @ 2018-10-07 16:09 UTC (permalink / raw)
  To: Borislav Petkov, gcc, Richard Biener, Michael Matz
  Cc: Ingo Molnar, linux-kernel, x86, Masahiro Yamada, Sam Ravnborg,
	Alok Kataria, Christopher Li, Greg Kroah-Hartman, H. Peter Anvin,
	Jan Beulich, Josh Poimboeuf, Juergen Gross, Kate Stewart,
	Kees Cook, linux-sparse, Peter Zijlstra, Philippe Ombredanne,
	Thomas Gleixner, virtualization, Linus Torvalds, Chris Zankel,
	Max Filippov, linux-xtensa

at 2:18 AM, Borislav Petkov <bp@alien8.de> wrote:

> Hi people,
> 
> this is an attempt to see whether gcc's inline asm heuristic when
> estimating inline asm statements' cost for better inlining can be
> improved.
> 
> AFAIU, the problematic arises when one ends up using a lot of inline
> asm statements in the kernel but due to the inline asm cost estimation
> heuristic which counts lines, I think, for example like in this here
> macro:
> 
> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit.kernel.org%2Fpub%2Fscm%2Flinux%2Fkernel%2Fgit%2Ftorvalds%2Flinux.git%2Ftree%2Farch%2Fx86%2Finclude%2Fasm%2Fcpufeature.h%23n162&amp;data=02%7C01%7Cnamit%40vmware.com%7C6db1258c65ea45bbe4ea08d62c35ceec%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C1%7C0%7C636745007006838299&amp;sdata=iehl%2Fb8h%2BZE%2Frqb4qjac19WekSgOObc9%2BM1Jto1VgF4%3D&amp;reserved=0
> 
> the resulting code ends up not inlining the functions themselves which
> use this macro. I.e., you see a CALL <function> instead of its body
> getting inlined directly.
> 
> Even though it should be because the actual instructions are only a
> couple in most cases and all those other directives end up in another
> section anyway.
> 
> The issue is explained below in the forwarded mail in a larger detail
> too.
> 
> Now, Richard suggested doing something like:
> 
> 1) inline asm ("...")
> 2) asm ("..." : : : : <size-expr>)
> 3) asm ("...") __attribute__((asm_size(<size-expr>)));
> 
> with which user can tell gcc what the size of that inline asm statement
> is and thus allow for more precise cost estimation and in the end better
> inlining.
> 
> And FWIW 3) looks pretty straight-forward to me because attributes are
> pretty common anyways.
> 
> But I'm sure there are other options and I'm sure people will have
> better/different ideas so feel free to chime in.

Thanks for taking care of it. I would like to mention a second issue, since
you may want to resolve both with a single solution: not inlining
conditional __builtin_constant_p(), in which there are two code-paths - one
for constants and one for variables.

Consider for example the Linux kernel ilog2 macro, which has a condition
based on __builtin_constant_p() (
https://elixir.bootlin.com/linux/v4.19-rc7/source/include/linux/log2.h#L160
). The compiler mistakenly considers the “heavy” code-path that is supposed
to be evaluated only in compilation time to evaluate the code size. This
causes the kernel to consider functions such as kmalloc() as “big”.
kmalloc() is marked with always_inline attribute, so instead the calling
functions, such as kzalloc() are not inlined.

When I thought about hacking gcc to solve this issue, I considered an
intrinsic that would override the cost of a given statement. This solution
is not too nice, but may solve both issues.

In addition, note that AFAIU the impact of a wrong cost of code estimation
can also impact loop and other optimizations.

Regards,
Nadav

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: PROPOSAL: Extend inline asm syntax with size spec
  2018-10-07 16:09   ` Nadav Amit
@ 2018-10-07 16:46     ` Richard Biener
  2018-10-07 19:06       ` Nadav Amit
  0 siblings, 1 reply; 116+ messages in thread
From: Richard Biener @ 2018-10-07 16:46 UTC (permalink / raw)
  To: Nadav Amit, Borislav Petkov, gcc, Michael Matz
  Cc: Ingo Molnar, linux-kernel, x86, Masahiro Yamada, Sam Ravnborg,
	Alok Kataria, Christopher Li, Greg Kroah-Hartman, H. Peter Anvin,
	Jan Beulich, Josh Poimboeuf, Juergen Gross, Kate Stewart,
	Kees Cook, linux-sparse, Peter Zijlstra, Philippe Ombredanne,
	Thomas Gleixner, virtualization, Linus Torvalds, Chris Zankel,
	Max Filippov, linux-xtensa

On October 7, 2018 6:09:30 PM GMT+02:00, Nadav Amit <namit@vmware.com> wrote:
>at 2:18 AM, Borislav Petkov <bp@alien8.de> wrote:
>
>> Hi people,
>> 
>> this is an attempt to see whether gcc's inline asm heuristic when
>> estimating inline asm statements' cost for better inlining can be
>> improved.
>> 
>> AFAIU, the problematic arises when one ends up using a lot of inline
>> asm statements in the kernel but due to the inline asm cost
>estimation
>> heuristic which counts lines, I think, for example like in this here
>> macro:
>> 
>>
>https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit.kernel.org%2Fpub%2Fscm%2Flinux%2Fkernel%2Fgit%2Ftorvalds%2Flinux.git%2Ftree%2Farch%2Fx86%2Finclude%2Fasm%2Fcpufeature.h%23n162&amp;data=02%7C01%7Cnamit%40vmware.com%7C6db1258c65ea45bbe4ea08d62c35ceec%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C1%7C0%7C636745007006838299&amp;sdata=iehl%2Fb8h%2BZE%2Frqb4qjac19WekSgOObc9%2BM1Jto1VgF4%3D&amp;reserved=0
>> 
>> the resulting code ends up not inlining the functions themselves
>which
>> use this macro. I.e., you see a CALL <function> instead of its body
>> getting inlined directly.
>> 
>> Even though it should be because the actual instructions are only a
>> couple in most cases and all those other directives end up in another
>> section anyway.
>> 
>> The issue is explained below in the forwarded mail in a larger detail
>> too.
>> 
>> Now, Richard suggested doing something like:
>> 
>> 1) inline asm ("...")
>> 2) asm ("..." : : : : <size-expr>)
>> 3) asm ("...") __attribute__((asm_size(<size-expr>)));
>> 
>> with which user can tell gcc what the size of that inline asm
>statement
>> is and thus allow for more precise cost estimation and in the end
>better
>> inlining.
>> 
>> And FWIW 3) looks pretty straight-forward to me because attributes
>are
>> pretty common anyways.
>> 
>> But I'm sure there are other options and I'm sure people will have
>> better/different ideas so feel free to chime in.
>
>Thanks for taking care of it. I would like to mention a second issue,
>since
>you may want to resolve both with a single solution: not inlining
>conditional __builtin_constant_p(), in which there are two code-paths -
>one
>for constants and one for variables.
>
>Consider for example the Linux kernel ilog2 macro, which has a
>condition
>based on __builtin_constant_p() (
>https://elixir.bootlin.com/linux/v4.19-rc7/source/include/linux/log2.h#L160
>). The compiler mistakenly considers the “heavy” code-path that is
>supposed
>to be evaluated only in compilation time to evaluate the code size.

But this is a misconception about __builtin_constant_p. It doesn't guard sth like 'constexpr' regions. If you try to use it with those semantics you'll fail (appearantly you do). 

Of course IPA CP code size estimates when seeing a constant fed to bcp might be not optimal, that's another issue of course. 

Richard. 

>This
>causes the kernel to consider functions such as kmalloc() as “big”.
>kmalloc() is marked with always_inline attribute, so instead the
>calling
>functions, such as kzalloc() are not inlined.
>
>When I thought about hacking gcc to solve this issue, I considered an
>intrinsic that would override the cost of a given statement. This
>solution
>is not too nice, but may solve both issues.
>
>In addition, note that AFAIU the impact of a wrong cost of code
>estimation
>can also impact loop and other optimizations.
>
>Regards,
>Nadav


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: PROPOSAL: Extend inline asm syntax with size spec
  2018-10-07 16:46     ` Richard Biener
@ 2018-10-07 19:06       ` Nadav Amit
  2018-10-07 19:52         ` Jeff Law
  2018-10-08  7:46         ` Richard Biener
  0 siblings, 2 replies; 116+ messages in thread
From: Nadav Amit @ 2018-10-07 19:06 UTC (permalink / raw)
  To: Richard Biener, Borislav Petkov, gcc, Michael Matz
  Cc: Ingo Molnar, linux-kernel, x86, Masahiro Yamada, Sam Ravnborg,
	Alok Kataria, Christopher Li, Greg Kroah-Hartman, H. Peter Anvin,
	Jan Beulich, Josh Poimboeuf, Juergen Gross, Kate Stewart,
	Kees Cook, linux-sparse, Peter Zijlstra, Philippe Ombredanne,
	Thomas Gleixner, virtualization, Linus Torvalds, Chris Zankel,
	Max Filippov, linux-xtensa

at 9:46 AM, Richard Biener <rguenther@suse.de> wrote:

> On October 7, 2018 6:09:30 PM GMT+02:00, Nadav Amit <namit@vmware.com> wrote:
>> at 2:18 AM, Borislav Petkov <bp@alien8.de> wrote:
>> 
>>> Hi people,
>>> 
>>> this is an attempt to see whether gcc's inline asm heuristic when
>>> estimating inline asm statements' cost for better inlining can be
>>> improved.
>>> 
>>> AFAIU, the problematic arises when one ends up using a lot of inline
>>> asm statements in the kernel but due to the inline asm cost
>> estimation
>>> heuristic which counts lines, I think, for example like in this here
>>> macro:
>> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit.kernel.org%2Fpub%2Fscm%2Flinux%2Fkernel%2Fgit%2Ftorvalds%2Flinux.git%2Ftree%2Farch%2Fx86%2Finclude%2Fasm%2Fcpufeature.h%23n162&amp;data=02%7C01%7Cnamit%40vmware.com%7C860403cecb874db64b7e08d62c746f46%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C1%7C0%7C636745275975505381&amp;sdata=Nd0636K9Z1IsUs1RWSRAhVuVboLxlBCB4peiAMfmQzQ%3D&amp;reserved=0
>>> the resulting code ends up not inlining the functions themselves
>> which
>>> use this macro. I.e., you see a CALL <function> instead of its body
>>> getting inlined directly.
>>> 
>>> Even though it should be because the actual instructions are only a
>>> couple in most cases and all those other directives end up in another
>>> section anyway.
>>> 
>>> The issue is explained below in the forwarded mail in a larger detail
>>> too.
>>> 
>>> Now, Richard suggested doing something like:
>>> 
>>> 1) inline asm ("...")
>>> 2) asm ("..." : : : : <size-expr>)
>>> 3) asm ("...") __attribute__((asm_size(<size-expr>)));
>>> 
>>> with which user can tell gcc what the size of that inline asm
>> statement
>>> is and thus allow for more precise cost estimation and in the end
>> better
>>> inlining.
>>> 
>>> And FWIW 3) looks pretty straight-forward to me because attributes
>> are
>>> pretty common anyways.
>>> 
>>> But I'm sure there are other options and I'm sure people will have
>>> better/different ideas so feel free to chime in.
>> 
>> Thanks for taking care of it. I would like to mention a second issue,
>> since
>> you may want to resolve both with a single solution: not inlining
>> conditional __builtin_constant_p(), in which there are two code-paths -
>> one
>> for constants and one for variables.
>> 
>> Consider for example the Linux kernel ilog2 macro, which has a
>> condition
>> based on __builtin_constant_p() (
>> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Felixir.bootlin.com%2Flinux%2Fv4.19-rc7%2Fsource%2Finclude%2Flinux%2Flog2.h%23L160&amp;data=02%7C01%7Cnamit%40vmware.com%7C860403cecb874db64b7e08d62c746f46%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C1%7C0%7C636745275975515386&amp;sdata=Hk39Za9%2FxcFyK0sGENB24d6QySjsDGzF%2FwqjnUEMiGk%3D&amp;reserved=0
>> ). The compiler mistakenly considers the “heavy” code-path that is
>> supposed
>> to be evaluated only in compilation time to evaluate the code size.
> 
> But this is a misconception about __builtin_constant_p. It doesn't guard sth like 'constexpr' regions. If you try to use it with those semantics you'll fail (appearantly you do). 
> 
> Of course IPA CP code size estimates when seeing a constant fed to bcp might be not optimal, that's another issue of course. 

I understand that this is might not be the right way to implement macros
such as ilog2() and test_bit(), but this code is around for some time.

I thought of using __builtin_choose_expr() instead of ternary operator, but
this introduces a different problem, as the variable version is used instead
of the constant one in many cases. From my brief experiments with llvm, it
appears that llvm does not have both of these issues (wrong cost attributed
to inline asm and conditions based on __builtin_constant_p()).

So what alternative do you propose to implement ilog2() like behavior? I was
digging through the gcc code to find a workaround with no success.

Thanks,
Nadav


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: PROPOSAL: Extend inline asm syntax with size spec
  2018-10-07 19:06       ` Nadav Amit
@ 2018-10-07 19:52         ` Jeff Law
  2018-10-08  7:46         ` Richard Biener
  1 sibling, 0 replies; 116+ messages in thread
From: Jeff Law @ 2018-10-07 19:52 UTC (permalink / raw)
  To: Nadav Amit, Richard Biener, Borislav Petkov, gcc, Michael Matz
  Cc: Ingo Molnar, linux-kernel, x86, Masahiro Yamada, Sam Ravnborg,
	Alok Kataria, Christopher Li, Greg Kroah-Hartman, H. Peter Anvin,
	Jan Beulich, Josh Poimboeuf, Juergen Gross, Kate Stewart,
	Kees Cook, linux-sparse, Peter Zijlstra, Philippe Ombredanne,
	Thomas Gleixner, virtualization, Linus Torvalds, Chris Zankel,
	Max Filippov

On 10/7/18 1:06 PM, Nadav Amit wrote:
> at 9:46 AM, Richard Biener <rguenther@suse.de> wrote:
> 
>> On October 7, 2018 6:09:30 PM GMT+02:00, Nadav Amit <namit@vmware.com> wrote:
>>> at 2:18 AM, Borislav Petkov <bp@alien8.de> wrote:
>>>
>>>> Hi people,
>>>>
>>>> this is an attempt to see whether gcc's inline asm heuristic when
>>>> estimating inline asm statements' cost for better inlining can be
>>>> improved.
>>>>
>>>> AFAIU, the problematic arises when one ends up using a lot of inline
>>>> asm statements in the kernel but due to the inline asm cost
>>> estimation
>>>> heuristic which counts lines, I think, for example like in this here
>>>> macro:
>>> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit.kernel.org%2Fpub%2Fscm%2Flinux%2Fkernel%2Fgit%2Ftorvalds%2Flinux.git%2Ftree%2Farch%2Fx86%2Finclude%2Fasm%2Fcpufeature.h%23n162&amp;data=02%7C01%7Cnamit%40vmware.com%7C860403cecb874db64b7e08d62c746f46%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C1%7C0%7C636745275975505381&amp;sdata=Nd0636K9Z1IsUs1RWSRAhVuVboLxlBCB4peiAMfmQzQ%3D&amp;reserved=0
>>>> the resulting code ends up not inlining the functions themselves
>>> which
>>>> use this macro. I.e., you see a CALL <function> instead of its body
>>>> getting inlined directly.
>>>>
>>>> Even though it should be because the actual instructions are only a
>>>> couple in most cases and all those other directives end up in another
>>>> section anyway.
>>>>
>>>> The issue is explained below in the forwarded mail in a larger detail
>>>> too.
>>>>
>>>> Now, Richard suggested doing something like:
>>>>
>>>> 1) inline asm ("...")
>>>> 2) asm ("..." : : : : <size-expr>)
>>>> 3) asm ("...") __attribute__((asm_size(<size-expr>)));
>>>>
>>>> with which user can tell gcc what the size of that inline asm
>>> statement
>>>> is and thus allow for more precise cost estimation and in the end
>>> better
>>>> inlining.
>>>>
>>>> And FWIW 3) looks pretty straight-forward to me because attributes
>>> are
>>>> pretty common anyways.
>>>>
>>>> But I'm sure there are other options and I'm sure people will have
>>>> better/different ideas so feel free to chime in.
>>>
>>> Thanks for taking care of it. I would like to mention a second issue,
>>> since
>>> you may want to resolve both with a single solution: not inlining
>>> conditional __builtin_constant_p(), in which there are two code-paths -
>>> one
>>> for constants and one for variables.
>>>
>>> Consider for example the Linux kernel ilog2 macro, which has a
>>> condition
>>> based on __builtin_constant_p() (
>>> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Felixir.bootlin.com%2Flinux%2Fv4.19-rc7%2Fsource%2Finclude%2Flinux%2Flog2.h%23L160&amp;data=02%7C01%7Cnamit%40vmware.com%7C860403cecb874db64b7e08d62c746f46%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C1%7C0%7C636745275975515386&amp;sdata=Hk39Za9%2FxcFyK0sGENB24d6QySjsDGzF%2FwqjnUEMiGk%3D&amp;reserved=0
>>> ). The compiler mistakenly considers the “heavy” code-path that is
>>> supposed
>>> to be evaluated only in compilation time to evaluate the code size.
>>
>> But this is a misconception about __builtin_constant_p. It doesn't guard sth like 'constexpr' regions. If you try to use it with those semantics you'll fail (appearantly you do). 
>>
>> Of course IPA CP code size estimates when seeing a constant fed to bcp might be not optimal, that's another issue of course. 
> 
> I understand that this is might not be the right way to implement macros
> such as ilog2() and test_bit(), but this code is around for some time.
That doesn't make it right -- and there's been numerous bogus bugs
reported against ilog2 because the authors of ilog2 haven't had a clear
understanding of the semantics of builtin_constant_p.


Jeff

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 04/10] x86: refcount: prevent gcc distortions
  2018-10-05  9:31                   ` Ingo Molnar
  2018-10-05 11:20                     ` Borislav Petkov
  2018-10-05 20:27                     ` [PATCH 0/3] Macrofying inline asm rebased Nadav Amit
@ 2018-10-08  2:17                     ` Nadav Amit
  2 siblings, 0 replies; 116+ messages in thread
From: Nadav Amit @ 2018-10-08  2:17 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: hpa, Ingo Molnar, linux-kernel, x86, Thomas Gleixner,
	Jan Beulich, Josh Poimboeuf, Linus Torvalds, Peter Zijlstra,
	Andy Lutomirski

at 2:31 AM, Ingo Molnar <mingo@kernel.org> wrote:

> 
> * Nadav Amit <namit@vmware.com> wrote:
> 
>>> Are you using defconfig or a reasonable distro-config for your tests?
>> 
>> I think it is best to take the kernel and run localyesconfig for testing.
> 
> Ok, agreed - and this makes the numbers you provided pretty representative.
> 
> Good - now that all of my concerns were addressed I'd like to merge the remaining 3 patches as 
> well - but they are conflicting with ongoing x86 work in tip:x86/core. The extable conflict is 
> trivial, the jump-label conflict a bit more involved.
> 
> Could you please pick up the updated changelogs below and resolve the conflicts against 
> tip:master or tip:x86/build and submit the remaining patches as well?

For the record, I summarized my analysis of the poor function inlining
decisions in Linux in the following blog-post:

https://nadav.amit.zone/blog/linux-inline

Regards,
Nadav



^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: PROPOSAL: Extend inline asm syntax with size spec
  2018-10-07 15:14       ` Segher Boessenkool
@ 2018-10-08  5:58         ` Ingo Molnar
  2018-10-08  7:53           ` Segher Boessenkool
  0 siblings, 1 reply; 116+ messages in thread
From: Ingo Molnar @ 2018-10-08  5:58 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: Borislav Petkov, gcc, Richard Biener, Michael Matz, Nadav Amit,
	Ingo Molnar, linux-kernel, x86, Masahiro Yamada, Sam Ravnborg,
	Alok Kataria, Christopher Li, Greg Kroah-Hartman, H. Peter Anvin,
	Jan Beulich, Josh Poimboeuf, Juergen Gross, Kate Stewart,
	Kees Cook, linux-sparse, Peter Zijlstra, Philippe Ombredanne,
	Thomas Gleixner, virtualization, Linus Torvalds, Chris Zankel,
	Max Filippov, linux-xtensa


* Segher Boessenkool <segher@kernel.crashing.org> wrote:

> > > More precise *size* estimates, yes.  And if the user lies he should not
> > > be surprised to get assembler errors, etc.
> > 
> > Yes.
> > 
> > Another option would be if gcc parses the inline asm directly and
> > does a more precise size estimation. Which is a lot more involved and
> > complicated solution so I guess we wanna look at the simpler ones first.
> > 
> > :-)
> 
> Which is *impossible* to do.  Inline assembler is free-form text.

"Impossible" is false: only under GCC's model and semantics of inline
asm that is, and only under the (false) assumption that the semantics
of the asm statement (which is a GCC extension to begin with) cannot
be changed like it has been changed multiple times in the past.

"Difficult", "not worth our while", perhaps.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: PROPOSAL: Extend inline asm syntax with size spec
  2018-10-07 15:53     ` Michael Matz
@ 2018-10-08  6:13       ` Ingo Molnar
  2018-10-08  8:18         ` Segher Boessenkool
  2018-10-08  7:31       ` Segher Boessenkool
  2018-10-08 16:24       ` David Laight
  2 siblings, 1 reply; 116+ messages in thread
From: Ingo Molnar @ 2018-10-08  6:13 UTC (permalink / raw)
  To: Michael Matz
  Cc: Segher Boessenkool, Borislav Petkov, gcc, Richard Biener,
	Nadav Amit, Ingo Molnar, linux-kernel, x86, Masahiro Yamada,
	Sam Ravnborg, Alok Kataria, Christopher Li, Greg Kroah-Hartman,
	H. Peter Anvin, Jan Beulich, Josh Poimboeuf, Juergen Gross,
	Kate Stewart, Kees Cook, linux-sparse, Peter Zijlstra,
	Philippe Ombredanne, Thomas Gleixner, virtualization,
	Linus Torvalds, Chris Zankel, Max Filippov, linux-xtensa


* Michael Matz <matz@suse.de> wrote:

> (without an built-in assembler which hopefully noone proposes).

There are disadvantages (the main one is having to implement it), but a built-in assembler has 
numerous advantages as well:

 - Better optimizations: for example -Os could more accurately estimate true instruction size.

 - Better inlining: as the examples in this thread are showing.

 - Better padding/alignment: right now GCC has no notion about the precise cache layout of the 
   assembly code it generates and the code alignment options it has are crude. It got away with 
   this so far because the x86 rule of thumb is that dense code is usually the right choice.

 - Better compiler performance: it would be faster as well to immediately emit assembly
   instructions, just like GCC's preprocessor library use speeds up compilation *significantly*
   instead of creating a separate preprocessor task.

 - Better future integration of assembly blocks: GCC could begin to actually understand the 
   assembly statements in inline asm and allow more user-friendly extensions to its 
   historically complex and difficult to master inline asm syntax.

I mean, it's a fact that the GNU project has *already* defined their own assembly syntax which 
departs from decades old platform assembly syntax - and how the assembler is called by the 
compiler is basically an implementation detail, not a conceptual choice. The random 
multi-process unidirectional assembler choice of the past should not be treated as orthodoxy.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: PROPOSAL: Extend inline asm syntax with size spec
  2018-10-07 15:53     ` Michael Matz
  2018-10-08  6:13       ` Ingo Molnar
@ 2018-10-08  7:31       ` Segher Boessenkool
  2018-10-08  9:07         ` Richard Biener
  2018-10-08 16:24       ` David Laight
  2 siblings, 1 reply; 116+ messages in thread
From: Segher Boessenkool @ 2018-10-08  7:31 UTC (permalink / raw)
  To: Michael Matz
  Cc: Borislav Petkov, gcc, Richard Biener, Nadav Amit, Ingo Molnar,
	linux-kernel, x86, Masahiro Yamada, Sam Ravnborg, Alok Kataria,
	Christopher Li, Greg Kroah-Hartman, H. Peter Anvin, Jan Beulich,
	Josh Poimboeuf, Juergen Gross, Kate Stewart, Kees Cook,
	linux-sparse, Peter Zijlstra, Philippe Ombredanne,
	Thomas Gleixner, virtualization, Linus Torvalds, Chris Zankel,
	Max Filippov, linux-xtensa

Hi!

On Sun, Oct 07, 2018 at 03:53:26PM +0000, Michael Matz wrote:
> On Sun, 7 Oct 2018, Segher Boessenkool wrote:
> > On Sun, Oct 07, 2018 at 11:18:06AM +0200, Borislav Petkov wrote:
> > > this is an attempt to see whether gcc's inline asm heuristic when
> > > estimating inline asm statements' cost for better inlining can be
> > > improved.
> > 
> > GCC already estimates the *size* of inline asm, and this is required
> > *for correctness*.  So any workaround that works against this will only
> > end in tears.
> 
> You're right and wrong.  GCC can't even estimate the size of mildly 
> complicated inline asms right now, so your claim of it being necessary for 
> correctness can't be true in this absolute form.  I know what you try to 
> say, but still, consider inline asms like this:
> 
>      insn1
>   .section bla
>      insn2
>   .previous
> 
> or
>    invoke_asm_macro foo,bar
> 
> in both cases GCCs size estimate will be wrong however you want to count 
> it.  This is actually the motivating example for the kernel guys, the 
> games they play within their inline asms make the estimates be wildly 
> wrong to a point it interacts with the inliner.

Right.  The manual says:

"""
Some targets require that GCC track the size of each instruction used
in order to generate correct code.  Because the final length of the
code produced by an @code{asm} statement is only known by the
assembler, GCC must make an estimate as to how big it will be.  It
does this by counting the number of instructions in the pattern of the
@code{asm} and multiplying that by the length of the longest
instruction supported by that processor.  (When working out the number
of instructions, it assumes that any occurrence of a newline or of
whatever statement separator character is supported by the assembler --
typically @samp{;} --- indicates the end of an instruction.)

Normally, GCC's estimate is adequate to ensure that correct
code is generated, but it is possible to confuse the compiler if you use
pseudo instructions or assembler macros that expand into multiple real
instructions, or if you use assembler directives that expand to more
space in the object file than is needed for a single instruction.
If this happens then the assembler may produce a diagnostic saying that
a label is unreachable.
"""

It *is* necessary for correctness, except you can do things that can
confuse the compiler and then you are on your own anyway.

> > So I guess the real issue is that the inline asm size estimate for x86 
> > isn't very good (since it has to be pessimistic, and x86 insns can be 
> > huge)?
> 
> No, see above, even if we were to improve the size estimates (e.g. based 
> on some average instruction size) the kernel examples would still be off 
> because they switch sections back and forth, use asm macros and computed 
> .fill directives and maybe further stuff.  GCC will never be able to 
> accurately calculate these sizes

What *is* such a size, anyway?  If it can be spread over multiple sections
(some of which support section merging), and you can have huge alignments,
etc.  What is needed here is not knowing the maximum size of the binary
output (however you want to define that), but some way for the compiler
to understand how bad it is to inline some assembler.  Maybe manual
direction, maybe just the current jeuristics can be tweaked a bit, maybe
we need to invent some attribute or two.

> (without an built-in assembler which hopefully noone proposes).

Not me, that's for sure.

> So, there is a case for extending the inline-asm facility to say 
> "size is complicated here, assume this for inline decisions".

Yeah, that's an option.  It may be too complicated though, or just not
useful in its generality, so that everyone will use "1" (or "1 normal
size instruction"), and then we are better off just making something
for _that_ (or making it the default).

> > > Now, Richard suggested doing something like:
> > > 
> > >  1) inline asm ("...")
> > 
> > What would the semantics of this be?
> 
> The size of the inline asm wouldn't be counted towards the inliner size 
> limits (or be counted as "1").

That sounds like a good option.

> > I don't like 2) either.  But 1) looks interesting, depends what its
> > semantics would be?  "Don't count this insn's size for inlining decisions",
> > maybe?
> 
> TBH, I like the inline asm (...) suggestion most currently, but what if we 
> want to add more attributes to asms?  We could add further special 
> keywords to the clobber list:
>   asm ("...." : : : "cc,memory,inline");
> sure, it might seem strange to "clobber" inline, but if we reinterpret the 
> clobber list as arbitrary set of attributes for this asm, it'd be fine.

All of a targets register names and alternative register names are
allowed in the clobber list.  Will that never conflict with an attribute
name?  We already *have* syntax for specifying attributes on an asm (on
*any* statement even), so mixing these two things has no advantage.

Both "cc" and "memory" have their own problems of course, adding more
things to this just feels bad.  It may not be so bad ;-)

> > Another option is to just force inlining for those few functions where 
> > GCC currently makes an inlining decision you don't like.  Or are there 
> > more than a few?
> 
> I think the examples I saw from Boris were all indirect inlines:
> 
>   static inline void foo() { asm("large-looking-but-small-asm"); }
>   static void bar1() { ... foo() ... }
>   static void bar2() { ... foo() ... }
>   void goo (void) { bar1(); }  // bar1 should have been inlined
> 
> So, while the immediate asm user was marked as always inline that in turn 
> caused users of it to become non-inlined.  I'm assuming the kernel guys 
> did proper measurements that they _really_ get some non-trivial speed 
> benefit by inlining bar1/bar2, but for some reasons (I didn't inquire) 
> didn't want to mark them all as inline as well.

Yeah that makes sense, like if this happens with the fixup stuff, it will
quickly spiral out of control.


Segher

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: PROPOSAL: Extend inline asm syntax with size spec
  2018-10-07 19:06       ` Nadav Amit
  2018-10-07 19:52         ` Jeff Law
@ 2018-10-08  7:46         ` Richard Biener
  1 sibling, 0 replies; 116+ messages in thread
From: Richard Biener @ 2018-10-08  7:46 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Borislav Petkov, gcc, Michael Matz, Ingo Molnar, linux-kernel,
	x86, Masahiro Yamada, Sam Ravnborg, Alok Kataria, Christopher Li,
	Greg Kroah-Hartman, H. Peter Anvin, Jan Beulich, Josh Poimboeuf,
	Juergen Gross, Kate Stewart, Kees Cook, linux-sparse,
	Peter Zijlstra, Philippe Ombredanne, Thomas Gleixner,
	virtualization, Linus Torvalds, Chris Zankel, Max Filippov,
	linux-xtensa

[-- Attachment #1: Type: text/plain, Size: 4971 bytes --]

On Sun, 7 Oct 2018, Nadav Amit wrote:

> at 9:46 AM, Richard Biener <rguenther@suse.de> wrote:
> 
> > On October 7, 2018 6:09:30 PM GMT+02:00, Nadav Amit <namit@vmware.com> wrote:
> >> at 2:18 AM, Borislav Petkov <bp@alien8.de> wrote:
> >> 
> >>> Hi people,
> >>> 
> >>> this is an attempt to see whether gcc's inline asm heuristic when
> >>> estimating inline asm statements' cost for better inlining can be
> >>> improved.
> >>> 
> >>> AFAIU, the problematic arises when one ends up using a lot of inline
> >>> asm statements in the kernel but due to the inline asm cost
> >> estimation
> >>> heuristic which counts lines, I think, for example like in this here
> >>> macro:
> >> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit.kernel.org%2Fpub%2Fscm%2Flinux%2Fkernel%2Fgit%2Ftorvalds%2Flinux.git%2Ftree%2Farch%2Fx86%2Finclude%2Fasm%2Fcpufeature.h%23n162&amp;data=02%7C01%7Cnamit%40vmware.com%7C860403cecb874db64b7e08d62c746f46%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C1%7C0%7C636745275975505381&amp;sdata=Nd0636K9Z1IsUs1RWSRAhVuVboLxlBCB4peiAMfmQzQ%3D&amp;reserved=0
> >>> the resulting code ends up not inlining the functions themselves
> >> which
> >>> use this macro. I.e., you see a CALL <function> instead of its body
> >>> getting inlined directly.
> >>> 
> >>> Even though it should be because the actual instructions are only a
> >>> couple in most cases and all those other directives end up in another
> >>> section anyway.
> >>> 
> >>> The issue is explained below in the forwarded mail in a larger detail
> >>> too.
> >>> 
> >>> Now, Richard suggested doing something like:
> >>> 
> >>> 1) inline asm ("...")
> >>> 2) asm ("..." : : : : <size-expr>)
> >>> 3) asm ("...") __attribute__((asm_size(<size-expr>)));
> >>> 
> >>> with which user can tell gcc what the size of that inline asm
> >> statement
> >>> is and thus allow for more precise cost estimation and in the end
> >> better
> >>> inlining.
> >>> 
> >>> And FWIW 3) looks pretty straight-forward to me because attributes
> >> are
> >>> pretty common anyways.
> >>> 
> >>> But I'm sure there are other options and I'm sure people will have
> >>> better/different ideas so feel free to chime in.
> >> 
> >> Thanks for taking care of it. I would like to mention a second issue,
> >> since
> >> you may want to resolve both with a single solution: not inlining
> >> conditional __builtin_constant_p(), in which there are two code-paths -
> >> one
> >> for constants and one for variables.
> >> 
> >> Consider for example the Linux kernel ilog2 macro, which has a
> >> condition
> >> based on __builtin_constant_p() (
> >> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Felixir.bootlin.com%2Flinux%2Fv4.19-rc7%2Fsource%2Finclude%2Flinux%2Flog2.h%23L160&amp;data=02%7C01%7Cnamit%40vmware.com%7C860403cecb874db64b7e08d62c746f46%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C1%7C0%7C636745275975515386&amp;sdata=Hk39Za9%2FxcFyK0sGENB24d6QySjsDGzF%2FwqjnUEMiGk%3D&amp;reserved=0
> >> ). The compiler mistakenly considers the “heavy” code-path that is
> >> supposed
> >> to be evaluated only in compilation time to evaluate the code size.
> > 
> > But this is a misconception about __builtin_constant_p. It doesn't guard sth like 'constexpr' regions. If you try to use it with those semantics you'll fail (appearantly you do). 
> > 
> > Of course IPA CP code size estimates when seeing a constant fed to bcp might be not optimal, that's another issue of course. 
> 
> I understand that this is might not be the right way to implement macros
> such as ilog2() and test_bit(), but this code is around for some time.
> 
> I thought of using __builtin_choose_expr() instead of ternary operator, but
> this introduces a different problem, as the variable version is used instead
> of the constant one in many cases. From my brief experiments with llvm, it
> appears that llvm does not have both of these issues (wrong cost attributed
> to inline asm and conditions based on __builtin_constant_p()).
> 
> So what alternative do you propose to implement ilog2() like behavior? I was
> digging through the gcc code to find a workaround with no success.

1) Don't try to cheat the compilers constant propagation abilities
2) Use a language that allows this (C++)
3) Define (and implement) the corresponding GNU C extension

__builtin_constant_p() isn't the right fit (I wonder what it was
implemented for in the first place though...).

I suppose you want sth like

 if (__builtin_constant_p (x))
   return __constexpr ...;

or use a call and have constexpr functions.  Note it wouldn't be
C++-constexpr like since you want the constexpr evaluation to
happen very late in the compilation to benefit from optimizations
and you are fine with the non-constexpr path.

Properly defining a language extension is hard.

Richard.

> 
> Thanks,
> Nadav
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: PROPOSAL: Extend inline asm syntax with size spec
  2018-10-08  5:58         ` Ingo Molnar
@ 2018-10-08  7:53           ` Segher Boessenkool
  0 siblings, 0 replies; 116+ messages in thread
From: Segher Boessenkool @ 2018-10-08  7:53 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Borislav Petkov, gcc, Richard Biener, Michael Matz, Nadav Amit,
	Ingo Molnar, linux-kernel, x86, Masahiro Yamada, Sam Ravnborg,
	Alok Kataria, Christopher Li, Greg Kroah-Hartman, H. Peter Anvin,
	Jan Beulich, Josh Poimboeuf, Juergen Gross, Kate Stewart,
	Kees Cook, linux-sparse, Peter Zijlstra, Philippe Ombredanne,
	Thomas Gleixner, virtualization, Linus Torvalds, Chris Zankel,
	Max Filippov, linux-xtensa

On Mon, Oct 08, 2018 at 07:58:38AM +0200, Ingo Molnar wrote:
> * Segher Boessenkool <segher@kernel.crashing.org> wrote:
> > > > More precise *size* estimates, yes.  And if the user lies he should not
> > > > be surprised to get assembler errors, etc.
> > > 
> > > Yes.
> > > 
> > > Another option would be if gcc parses the inline asm directly and
> > > does a more precise size estimation. Which is a lot more involved and
> > > complicated solution so I guess we wanna look at the simpler ones first.
> > > 
> > > :-)
> > 
> > Which is *impossible* to do.  Inline assembler is free-form text.
> 
> "Impossible" is false: only under GCC's model and semantics of inline
> asm that is, and only under the (false) assumption that the semantics
> of the asm statement (which is a GCC extension to begin with) cannot
> be changed like it has been changed multiple times in the past.
> 
> "Difficult", "not worth our while", perhaps.

If we throw out our current definition of inline assembler, and of the
internal backend interfaces, then sure you can do it.  This of course
invalidates all code that uses GCC inline assembler, and all GCC backends
(both in-tree and out-of-tree, both current and historical).

If other compilers think everyone should rewrite all of their code because
those compiler do inline asm wro^H^H^Hdifferently, that is their problem;
GCC should not deny all history and screw over all its users.


Segher

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: PROPOSAL: Extend inline asm syntax with size spec
  2018-10-08  6:13       ` Ingo Molnar
@ 2018-10-08  8:18         ` Segher Boessenkool
  0 siblings, 0 replies; 116+ messages in thread
From: Segher Boessenkool @ 2018-10-08  8:18 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Michael Matz, Borislav Petkov, gcc, Richard Biener, Nadav Amit,
	Ingo Molnar, linux-kernel, x86, Masahiro Yamada, Sam Ravnborg,
	Alok Kataria, Christopher Li, Greg Kroah-Hartman, H. Peter Anvin,
	Jan Beulich, Josh Poimboeuf, Juergen Gross, Kate Stewart,
	Kees Cook, linux-sparse, Peter Zijlstra, Philippe Ombredanne,
	Thomas Gleixner, virtualization, Linus Torvalds, Chris Zankel,
	Max Filippov, linux-xtensa

On Mon, Oct 08, 2018 at 08:13:23AM +0200, Ingo Molnar wrote:
> * Michael Matz <matz@suse.de> wrote:
> > (without an built-in assembler which hopefully noone proposes).
> There are disadvantages (the main one is having to implement it), but a built-in assembler has 
> numerous advantages as well:
> 
>  - Better optimizations: for example -Os could more accurately estimate true instruction size.

GCC already knows the exact instruction size in almost all cases, for most
backends.  It is an advantage to not *have to* keep track of exact insn
size in all cases.

>  - Better inlining: as the examples in this thread are showing.

That's a red herring, the actual issue is inlining makes some spectacularly
wrong decisions for code involving asm currently.  That would not be solved
by outputting binary code instead of assembler code.  All those decisions
are done long before code is output.

>  - Better padding/alignment: right now GCC has no notion about the precise cache layout of the 
>    assembly code it generates and the code alignment options it has are crude.

This isn't true.  Maybe some targets do not care.

And of course GCC only knows this as far as it knows the alignments of the
sections it outputs into; for example, ASLR is a nice performance killer
at times.  And if your linker scripts align sections to less than a cache
line things do not look rosy either, etc.

> It got away with 
>    this so far because the x86 rule of thumb is that dense code is usually the right choice.

>  - Better compiler performance: it would be faster as well to immediately emit assembly
>    instructions, just like GCC's preprocessor library use speeds up compilation *significantly*
>    instead of creating a separate preprocessor task.

So how much faster do you think it would be?  Do you have actual numbers?
And no, -O0 does not count.

>  - Better future integration of assembly blocks: GCC could begin to actually understand the 
>    assembly statements in inline asm and allow more user-friendly extensions to its 
>    historically complex and difficult to master inline asm syntax.

If you want to add a different kind of inline assembler, you can already.
There is no need for outputting binary code for that.

> I mean, it's a fact that the GNU project has *already* defined their own assembly syntax which 
> departs from decades old platform assembly syntax

GCC's asm syntax is over three decades old itself.  When it was defined
all other C inline assembler syntaxes were much younger than that.  I don't
see what your argument is here.

> - and how the assembler is called by the 
> compiler is basically an implementation detail, not a conceptual choice.

But the *format* of the interface representation, in this case textual
assembler code, is quite fundamental.

> The random 
> multi-process unidirectional assembler choice of the past should not be treated as orthodoxy.

Then I have good news for you: no assembler works that way these days, and
that has been true for decades.


Segher

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: PROPOSAL: Extend inline asm syntax with size spec
  2018-10-08  7:31       ` Segher Boessenkool
@ 2018-10-08  9:07         ` Richard Biener
  2018-10-08 10:02           ` Segher Boessenkool
  2018-10-09 14:53           ` Segher Boessenkool
  0 siblings, 2 replies; 116+ messages in thread
From: Richard Biener @ 2018-10-08  9:07 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: Michael Matz, Borislav Petkov, gcc, Nadav Amit, Ingo Molnar,
	linux-kernel, x86, Masahiro Yamada, Sam Ravnborg, Alok Kataria,
	Christopher Li, Greg Kroah-Hartman, H. Peter Anvin, Jan Beulich,
	Josh Poimboeuf, Juergen Gross, Kate Stewart, Kees Cook,
	linux-sparse, Peter Zijlstra, Philippe Ombredanne,
	Thomas Gleixner, virtualization, Linus Torvalds, Chris Zankel,
	Max Filippov, linux-xtensa

On Mon, 8 Oct 2018, Segher Boessenkool wrote:

> Hi!
> 
> On Sun, Oct 07, 2018 at 03:53:26PM +0000, Michael Matz wrote:
> > On Sun, 7 Oct 2018, Segher Boessenkool wrote:
> > > On Sun, Oct 07, 2018 at 11:18:06AM +0200, Borislav Petkov wrote:
> > > > this is an attempt to see whether gcc's inline asm heuristic when
> > > > estimating inline asm statements' cost for better inlining can be
> > > > improved.
> > > 
> > > GCC already estimates the *size* of inline asm, and this is required
> > > *for correctness*.  So any workaround that works against this will only
> > > end in tears.
> > 
> > You're right and wrong.  GCC can't even estimate the size of mildly 
> > complicated inline asms right now, so your claim of it being necessary for 
> > correctness can't be true in this absolute form.  I know what you try to 
> > say, but still, consider inline asms like this:
> > 
> >      insn1
> >   .section bla
> >      insn2
> >   .previous
> > 
> > or
> >    invoke_asm_macro foo,bar
> > 
> > in both cases GCCs size estimate will be wrong however you want to count 
> > it.  This is actually the motivating example for the kernel guys, the 
> > games they play within their inline asms make the estimates be wildly 
> > wrong to a point it interacts with the inliner.
> 
> Right.  The manual says:
> 
> """
> Some targets require that GCC track the size of each instruction used
> in order to generate correct code.  Because the final length of the
> code produced by an @code{asm} statement is only known by the
> assembler, GCC must make an estimate as to how big it will be.  It
> does this by counting the number of instructions in the pattern of the
> @code{asm} and multiplying that by the length of the longest
> instruction supported by that processor.  (When working out the number
> of instructions, it assumes that any occurrence of a newline or of
> whatever statement separator character is supported by the assembler --
> typically @samp{;} --- indicates the end of an instruction.)
> 
> Normally, GCC's estimate is adequate to ensure that correct
> code is generated, but it is possible to confuse the compiler if you use
> pseudo instructions or assembler macros that expand into multiple real
> instructions, or if you use assembler directives that expand to more
> space in the object file than is needed for a single instruction.
> If this happens then the assembler may produce a diagnostic saying that
> a label is unreachable.
> """
> 
> It *is* necessary for correctness, except you can do things that can
> confuse the compiler and then you are on your own anyway.
> 
> > > So I guess the real issue is that the inline asm size estimate for x86 
> > > isn't very good (since it has to be pessimistic, and x86 insns can be 
> > > huge)?
> > 
> > No, see above, even if we were to improve the size estimates (e.g. based 
> > on some average instruction size) the kernel examples would still be off 
> > because they switch sections back and forth, use asm macros and computed 
> > .fill directives and maybe further stuff.  GCC will never be able to 
> > accurately calculate these sizes
> 
> What *is* such a size, anyway?  If it can be spread over multiple sections
> (some of which support section merging), and you can have huge alignments,
> etc.  What is needed here is not knowing the maximum size of the binary
> output (however you want to define that), but some way for the compiler
> to understand how bad it is to inline some assembler.  Maybe manual
> direction, maybe just the current jeuristics can be tweaked a bit, maybe
> we need to invent some attribute or two.
> 
> > (without an built-in assembler which hopefully noone proposes).
> 
> Not me, that's for sure.
> 
> > So, there is a case for extending the inline-asm facility to say 
> > "size is complicated here, assume this for inline decisions".
> 
> Yeah, that's an option.  It may be too complicated though, or just not
> useful in its generality, so that everyone will use "1" (or "1 normal
> size instruction"), and then we are better off just making something
> for _that_ (or making it the default).
> 
> > > > Now, Richard suggested doing something like:
> > > > 
> > > >  1) inline asm ("...")
> > > 
> > > What would the semantics of this be?
> > 
> > The size of the inline asm wouldn't be counted towards the inliner size 
> > limits (or be counted as "1").
> 
> That sounds like a good option.

Yes, I also like it for simplicity.  It also avoids the requirement
of translating the number (in bytes?) given by the user to
"number of GIMPLE instructions" as needed by the inliner.

> > > I don't like 2) either.  But 1) looks interesting, depends what its
> > > semantics would be?  "Don't count this insn's size for inlining decisions",
> > > maybe?
> > 
> > TBH, I like the inline asm (...) suggestion most currently, but what if we 
> > want to add more attributes to asms?  We could add further special 
> > keywords to the clobber list:
> >   asm ("...." : : : "cc,memory,inline");
> > sure, it might seem strange to "clobber" inline, but if we reinterpret the 
> > clobber list as arbitrary set of attributes for this asm, it'd be fine.
> 
> All of a targets register names and alternative register names are
> allowed in the clobber list.  Will that never conflict with an attribute
> name?  We already *have* syntax for specifying attributes on an asm (on
> *any* statement even), so mixing these two things has no advantage.

Heh, but I failed to make an example with attribute synatx working.
IIRC attributes do not work on stmts.  What could work is to use
a #pragma though.

Richard.

> Both "cc" and "memory" have their own problems of course, adding more
> things to this just feels bad.  It may not be so bad ;-)
> 
> > > Another option is to just force inlining for those few functions where 
> > > GCC currently makes an inlining decision you don't like.  Or are there 
> > > more than a few?
> > 
> > I think the examples I saw from Boris were all indirect inlines:
> > 
> >   static inline void foo() { asm("large-looking-but-small-asm"); }
> >   static void bar1() { ... foo() ... }
> >   static void bar2() { ... foo() ... }
> >   void goo (void) { bar1(); }  // bar1 should have been inlined
> > 
> > So, while the immediate asm user was marked as always inline that in turn 
> > caused users of it to become non-inlined.  I'm assuming the kernel guys 
> > did proper measurements that they _really_ get some non-trivial speed 
> > benefit by inlining bar1/bar2, but for some reasons (I didn't inquire) 
> > didn't want to mark them all as inline as well.
> 
> Yeah that makes sense, like if this happens with the fixup stuff, it will
> quickly spiral out of control.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: PROPOSAL: Extend inline asm syntax with size spec
  2018-10-08  9:07         ` Richard Biener
@ 2018-10-08 10:02           ` Segher Boessenkool
  2018-10-09 14:53           ` Segher Boessenkool
  1 sibling, 0 replies; 116+ messages in thread
From: Segher Boessenkool @ 2018-10-08 10:02 UTC (permalink / raw)
  To: Richard Biener
  Cc: Michael Matz, Borislav Petkov, gcc, Nadav Amit, Ingo Molnar,
	linux-kernel, x86, Masahiro Yamada, Sam Ravnborg, Alok Kataria,
	Christopher Li, Greg Kroah-Hartman, H. Peter Anvin, Jan Beulich,
	Josh Poimboeuf, Juergen Gross, Kate Stewart, Kees Cook,
	linux-sparse, Peter Zijlstra, Philippe Ombredanne,
	Thomas Gleixner, virtualization, Linus Torvalds, Chris Zankel,
	Max Filippov, linux-xtensa

On Mon, Oct 08, 2018 at 11:07:46AM +0200, Richard Biener wrote:
> > All of a targets register names and alternative register names are
> > allowed in the clobber list.  Will that never conflict with an attribute
> > name?  We already *have* syntax for specifying attributes on an asm (on
> > *any* statement even), so mixing these two things has no advantage.
> 
> Heh, but I failed to make an example with attribute synatx working.
> IIRC attributes do not work on stmts.  What could work is to use
> a #pragma though.

Apparently statement attributes currently(?) only work for null statements.
Oh well.


Segher

^ permalink raw reply	[flat|nested] 116+ messages in thread

* RE: PROPOSAL: Extend inline asm syntax with size spec
  2018-10-07 15:53     ` Michael Matz
  2018-10-08  6:13       ` Ingo Molnar
  2018-10-08  7:31       ` Segher Boessenkool
@ 2018-10-08 16:24       ` David Laight
  2 siblings, 0 replies; 116+ messages in thread
From: David Laight @ 2018-10-08 16:24 UTC (permalink / raw)
  To: 'Michael Matz', Segher Boessenkool
  Cc: Borislav Petkov, gcc, Richard Biener, Nadav Amit, Ingo Molnar,
	linux-kernel, x86, Masahiro Yamada, Sam Ravnborg, Alok Kataria,
	Christopher Li, Greg Kroah-Hartman, H. Peter Anvin, Jan Beulich,
	Josh Poimboeuf, Juergen Gross, Kate Stewart, Kees Cook,
	linux-sparse, Peter Zijlstra, Philippe Ombredanne,
	Thomas Gleixner, virtualization, Linus Torvalds, Chris Zankel,
	Max Filippov, linux-xtensa

From: Michael Matz
> Sent: 07 October 2018 16:53
...
> I think the examples I saw from Boris were all indirect inlines:
> 
>   static inline void foo() { asm("large-looking-but-small-asm"); }
>   static void bar1() { ... foo() ... }
>   static void bar2() { ... foo() ... }
>   void goo (void) { bar1(); }  // bar1 should have been inlined
> 
> So, while the immediate asm user was marked as always inline that in turn
> caused users of it to become non-inlined.  I'm assuming the kernel guys
> did proper measurements that they _really_ get some non-trivial speed
> benefit by inlining bar1/bar2, but for some reasons (I didn't inquire)
> didn't want to mark them all as inline as well.

Could you add a 'size' attribute to the 'always inlined' foo() above
rather than trying to add one to the asm() statement itself.
Then add a warning in the documentation that small size attributes
might make the assembly fail due to limited branch offsets (etc).

Size '1' probably ought to be reserved for things that definitely
fit in a delay slot.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: PROPOSAL: Extend inline asm syntax with size spec
  2018-10-08  9:07         ` Richard Biener
  2018-10-08 10:02           ` Segher Boessenkool
@ 2018-10-09 14:53           ` Segher Boessenkool
  2018-10-10  6:35             ` Ingo Molnar
                               ` (3 more replies)
  1 sibling, 4 replies; 116+ messages in thread
From: Segher Boessenkool @ 2018-10-09 14:53 UTC (permalink / raw)
  To: Richard Biener
  Cc: Michael Matz, Borislav Petkov, gcc, Nadav Amit, Ingo Molnar,
	linux-kernel, x86, Masahiro Yamada, Sam Ravnborg, Alok Kataria,
	Christopher Li, Greg Kroah-Hartman, H. Peter Anvin, Jan Beulich,
	Josh Poimboeuf, Juergen Gross, Kate Stewart, Kees Cook,
	linux-sparse, Peter Zijlstra, Philippe Ombredanne,
	Thomas Gleixner, virtualization, Linus Torvalds, Chris Zankel,
	Max Filippov, linux-xtensa

On Mon, Oct 08, 2018 at 11:07:46AM +0200, Richard Biener wrote:
> On Mon, 8 Oct 2018, Segher Boessenkool wrote:
> > On Sun, Oct 07, 2018 at 03:53:26PM +0000, Michael Matz wrote:
> > > On Sun, 7 Oct 2018, Segher Boessenkool wrote:
> > > > On Sun, Oct 07, 2018 at 11:18:06AM +0200, Borislav Petkov wrote:
> > > > > Now, Richard suggested doing something like:
> > > > > 
> > > > >  1) inline asm ("...")
> > > > 
> > > > What would the semantics of this be?
> > > 
> > > The size of the inline asm wouldn't be counted towards the inliner size 
> > > limits (or be counted as "1").
> > 
> > That sounds like a good option.
> 
> Yes, I also like it for simplicity.  It also avoids the requirement
> of translating the number (in bytes?) given by the user to
> "number of GIMPLE instructions" as needed by the inliner.

This patch implements this, for C only so far.  And the syntax is
"asm inline", which is more in line with other syntax.

How does this look?


Segher



diff --git a/gcc/c/c-parser.c b/gcc/c/c-parser.c
index 1f173fc..94b1d41 100644
--- a/gcc/c/c-parser.c
+++ b/gcc/c/c-parser.c
@@ -6287,7 +6287,7 @@ static tree
 c_parser_asm_statement (c_parser *parser)
 {
   tree quals, str, outputs, inputs, clobbers, labels, ret;
-  bool simple, is_goto;
+  bool simple, is_goto, is_inline;
   location_t asm_loc = c_parser_peek_token (parser)->location;
   int section, nsections;
 
@@ -6318,6 +6318,13 @@ c_parser_asm_statement (c_parser *parser)
       is_goto = true;
     }
 
+  is_inline = false;
+  if (c_parser_next_token_is_keyword (parser, RID_INLINE))
+    {
+      c_parser_consume_token (parser);
+      is_inline = true;
+    }
+
   /* ??? Follow the C++ parser rather than using the
      lex_untranslated_string kludge.  */
   parser->lex_untranslated_string = true;
@@ -6393,7 +6400,8 @@ c_parser_asm_statement (c_parser *parser)
     c_parser_skip_to_end_of_block_or_statement (parser);
 
   ret = build_asm_stmt (quals, build_asm_expr (asm_loc, str, outputs, inputs,
-					       clobbers, labels, simple));
+					       clobbers, labels, simple,
+					       is_inline));
 
  error:
   parser->lex_untranslated_string = false;
diff --git a/gcc/c/c-tree.h b/gcc/c/c-tree.h
index 017c01c..f5629300 100644
--- a/gcc/c/c-tree.h
+++ b/gcc/c/c-tree.h
@@ -677,7 +677,8 @@ extern tree build_compound_literal (location_t, tree, tree, bool,
 extern void check_compound_literal_type (location_t, struct c_type_name *);
 extern tree c_start_case (location_t, location_t, tree, bool);
 extern void c_finish_case (tree, tree);
-extern tree build_asm_expr (location_t, tree, tree, tree, tree, tree, bool);
+extern tree build_asm_expr (location_t, tree, tree, tree, tree, tree, bool,
+			    bool);
 extern tree build_asm_stmt (tree, tree);
 extern int c_types_compatible_p (tree, tree);
 extern tree c_begin_compound_stmt (bool);
diff --git a/gcc/c/c-typeck.c b/gcc/c/c-typeck.c
index 9d09b8d..e013100 100644
--- a/gcc/c/c-typeck.c
+++ b/gcc/c/c-typeck.c
@@ -10064,7 +10064,7 @@ build_asm_stmt (tree cv_qualifier, tree args)
    are subtly different.  We use a ASM_EXPR node to represent this.  */
 tree
 build_asm_expr (location_t loc, tree string, tree outputs, tree inputs,
-		tree clobbers, tree labels, bool simple)
+		tree clobbers, tree labels, bool simple, bool is_inline)
 {
   tree tail;
   tree args;
@@ -10182,6 +10182,7 @@ build_asm_expr (location_t loc, tree string, tree outputs, tree inputs,
      as volatile.  */
   ASM_INPUT_P (args) = simple;
   ASM_VOLATILE_P (args) = (noutputs == 0);
+  ASM_INLINE_P (args) = is_inline;
 
   return args;
 }
diff --git a/gcc/gimple-pretty-print.c b/gcc/gimple-pretty-print.c
index 83e2273..1a00fa3 100644
--- a/gcc/gimple-pretty-print.c
+++ b/gcc/gimple-pretty-print.c
@@ -2019,6 +2019,8 @@ dump_gimple_asm (pretty_printer *buffer, gasm *gs, int spc, dump_flags_t flags)
       pp_string (buffer, "__asm__");
       if (gimple_asm_volatile_p (gs))
 	pp_string (buffer, " __volatile__");
+      if (gimple_asm_inline_p (gs))
+	pp_string (buffer, " __inline__");
       if (gimple_asm_nlabels (gs))
 	pp_string (buffer, " goto");
       pp_string (buffer, "(\"");
diff --git a/gcc/gimple.h b/gcc/gimple.h
index a5dda93..8a58e07 100644
--- a/gcc/gimple.h
+++ b/gcc/gimple.h
@@ -137,6 +137,7 @@ enum gimple_rhs_class
 enum gf_mask {
     GF_ASM_INPUT		= 1 << 0,
     GF_ASM_VOLATILE		= 1 << 1,
+    GF_ASM_INLINE		= 1 << 2,
     GF_CALL_FROM_THUNK		= 1 << 0,
     GF_CALL_RETURN_SLOT_OPT	= 1 << 1,
     GF_CALL_TAILCALL		= 1 << 2,
@@ -3911,7 +3912,7 @@ gimple_asm_volatile_p (const gasm *asm_stmt)
 }
 
 
-/* If VOLATLE_P is true, mark asm statement ASM_STMT as volatile.  */
+/* If VOLATILE_P is true, mark asm statement ASM_STMT as volatile.  */
 
 static inline void
 gimple_asm_set_volatile (gasm *asm_stmt, bool volatile_p)
@@ -3923,6 +3924,27 @@ gimple_asm_set_volatile (gasm *asm_stmt, bool volatile_p)
 }
 
 
+/* Return true ASM_STMT ASM_STMT is an asm statement marked inline.  */
+
+static inline bool
+gimple_asm_inline_p (const gasm *asm_stmt)
+{
+  return (asm_stmt->subcode & GF_ASM_INLINE) != 0;
+}
+
+
+/* If INLINE_P is true, mark asm statement ASM_STMT as inline.  */
+
+static inline void
+gimple_asm_set_inline (gasm *asm_stmt, bool inline_p)
+{
+  if (inline_p)
+    asm_stmt->subcode |= GF_ASM_INLINE;
+  else
+    asm_stmt->subcode &= ~GF_ASM_INLINE;
+}
+
+
 /* If INPUT_P is true, mark asm ASM_STMT as an ASM_INPUT.  */
 
 static inline void
diff --git a/gcc/gimplify.c b/gcc/gimplify.c
index 509fc2f..10b80f2 100644
--- a/gcc/gimplify.c
+++ b/gcc/gimplify.c
@@ -6315,6 +6315,7 @@ gimplify_asm_expr (tree *expr_p, gimple_seq *pre_p, gimple_seq *post_p)
 
       gimple_asm_set_volatile (stmt, ASM_VOLATILE_P (expr) || noutputs == 0);
       gimple_asm_set_input (stmt, ASM_INPUT_P (expr));
+      gimple_asm_set_inline (stmt, ASM_INLINE_P (expr));
 
       gimplify_seq_add_stmt (pre_p, stmt);
     }
diff --git a/gcc/ipa-icf-gimple.c b/gcc/ipa-icf-gimple.c
index ba39ea3..5361139 100644
--- a/gcc/ipa-icf-gimple.c
+++ b/gcc/ipa-icf-gimple.c
@@ -993,6 +993,9 @@ func_checker::compare_gimple_asm (const gasm *g1, const gasm *g2)
   if (gimple_asm_input_p (g1) != gimple_asm_input_p (g2))
     return false;
 
+  if (gimple_asm_inline_p (g1) != gimple_asm_inline_p (g2))
+    return false;
+
   if (gimple_asm_ninputs (g1) != gimple_asm_ninputs (g2))
     return false;
 
diff --git a/gcc/tree-inline.c b/gcc/tree-inline.c
index 9134253..6b1d2ea 100644
--- a/gcc/tree-inline.c
+++ b/gcc/tree-inline.c
@@ -4108,6 +4108,8 @@ estimate_num_insns (gimple *stmt, eni_weights *weights)
 	   with very long asm statements.  */
 	if (count > 1000)
 	  count = 1000;
+	if (gimple_asm_inline_p (as_a <gasm *> (stmt)))
+	  count = !!count;
 	return MAX (1, count);
       }
 
diff --git a/gcc/tree.h b/gcc/tree.h
index 2e45f9d..160b3a7 100644
--- a/gcc/tree.h
+++ b/gcc/tree.h
@@ -1245,6 +1245,9 @@ extern tree maybe_wrap_with_location (tree, location_t);
    ASM_OPERAND with no operands.  */
 #define ASM_INPUT_P(NODE) (ASM_EXPR_CHECK (NODE)->base.static_flag)
 #define ASM_VOLATILE_P(NODE) (ASM_EXPR_CHECK (NODE)->base.public_flag)
+/* Nonzero if we want to consider this asm as minimum length and cost
+   for inlining decisions.  */
+#define ASM_INLINE_P(NODE) (ASM_EXPR_CHECK (NODE)->base.protected_flag)
 
 /* COND_EXPR accessors.  */
 #define COND_EXPR_COND(NODE)	(TREE_OPERAND (COND_EXPR_CHECK (NODE), 0))

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: PROPOSAL: Extend inline asm syntax with size spec
  2018-10-09 14:53           ` Segher Boessenkool
@ 2018-10-10  6:35             ` Ingo Molnar
  2018-10-10  7:12             ` Richard Biener
                               ` (2 subsequent siblings)
  3 siblings, 0 replies; 116+ messages in thread
From: Ingo Molnar @ 2018-10-10  6:35 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: Richard Biener, Michael Matz, Borislav Petkov, gcc, Nadav Amit,
	Ingo Molnar, linux-kernel, x86, Masahiro Yamada, Sam Ravnborg,
	Alok Kataria, Christopher Li, Greg Kroah-Hartman, H. Peter Anvin,
	Jan Beulich, Josh Poimboeuf, Juergen Gross, Kate Stewart,
	Kees Cook, linux-sparse, Peter Zijlstra, Philippe Ombredanne,
	Thomas Gleixner, virtualization, Linus Torvalds, Chris Zankel,
	Max Filippov, linux-xtensa


* Segher Boessenkool <segher@kernel.crashing.org> wrote:

> On Mon, Oct 08, 2018 at 11:07:46AM +0200, Richard Biener wrote:
> > On Mon, 8 Oct 2018, Segher Boessenkool wrote:
> > > On Sun, Oct 07, 2018 at 03:53:26PM +0000, Michael Matz wrote:
> > > > On Sun, 7 Oct 2018, Segher Boessenkool wrote:
> > > > > On Sun, Oct 07, 2018 at 11:18:06AM +0200, Borislav Petkov wrote:
> > > > > > Now, Richard suggested doing something like:
> > > > > > 
> > > > > >  1) inline asm ("...")
> > > > > 
> > > > > What would the semantics of this be?
> > > > 
> > > > The size of the inline asm wouldn't be counted towards the inliner size 
> > > > limits (or be counted as "1").
> > > 
> > > That sounds like a good option.
> > 
> > Yes, I also like it for simplicity.  It also avoids the requirement
> > of translating the number (in bytes?) given by the user to
> > "number of GIMPLE instructions" as needed by the inliner.
> 
> This patch implements this, for C only so far.  And the syntax is
> "asm inline", which is more in line with other syntax.
> 
> How does this look?

Cool, thanks for implementing this!

In the kernel we'd likely wrap this in some "asm_inline()" type of construct to be
compatible with older toolchains and other compilers.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: PROPOSAL: Extend inline asm syntax with size spec
  2018-10-09 14:53           ` Segher Boessenkool
  2018-10-10  6:35             ` Ingo Molnar
@ 2018-10-10  7:12             ` Richard Biener
  2018-10-10  7:22               ` Ingo Molnar
  2018-10-10  7:53               ` Segher Boessenkool
  2018-10-10 16:31             ` Nadav Amit
  2018-11-29 11:46             ` Masahiro Yamada
  3 siblings, 2 replies; 116+ messages in thread
From: Richard Biener @ 2018-10-10  7:12 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: Michael Matz, Borislav Petkov, gcc, Nadav Amit, Ingo Molnar,
	linux-kernel, x86, Masahiro Yamada, Sam Ravnborg, Alok Kataria,
	Christopher Li, Greg Kroah-Hartman, H. Peter Anvin, Jan Beulich,
	Josh Poimboeuf, Juergen Gross, Kate Stewart, Kees Cook,
	linux-sparse, Peter Zijlstra, Philippe Ombredanne,
	Thomas Gleixner, virtualization, Linus Torvalds, Chris Zankel,
	Max Filippov, linux-xtensa

On Tue, 9 Oct 2018, Segher Boessenkool wrote:

> On Mon, Oct 08, 2018 at 11:07:46AM +0200, Richard Biener wrote:
> > On Mon, 8 Oct 2018, Segher Boessenkool wrote:
> > > On Sun, Oct 07, 2018 at 03:53:26PM +0000, Michael Matz wrote:
> > > > On Sun, 7 Oct 2018, Segher Boessenkool wrote:
> > > > > On Sun, Oct 07, 2018 at 11:18:06AM +0200, Borislav Petkov wrote:
> > > > > > Now, Richard suggested doing something like:
> > > > > > 
> > > > > >  1) inline asm ("...")
> > > > > 
> > > > > What would the semantics of this be?
> > > > 
> > > > The size of the inline asm wouldn't be counted towards the inliner size 
> > > > limits (or be counted as "1").
> > > 
> > > That sounds like a good option.
> > 
> > Yes, I also like it for simplicity.  It also avoids the requirement
> > of translating the number (in bytes?) given by the user to
> > "number of GIMPLE instructions" as needed by the inliner.
> 
> This patch implements this, for C only so far.  And the syntax is
> "asm inline", which is more in line with other syntax.
> 
> How does this look?

Looks good.  A few nits - you need to document this in extend.texi, the
tree flag use needs documenting in tree-core.h, and we need a testcase
(I'd suggest one that shows we inline a function with "large" asm inline 
() even at -Os).

Oh, and I don't think we want C and C++ to diverge - so you need to
cook up C++ support as well.

Can kernel folks give this a second and third thought please so we
don't implement sth that in the end won't satisfy you guys?

Thanks for doing this,
Richard.

> 
> Segher
> 
> 
> 
> diff --git a/gcc/c/c-parser.c b/gcc/c/c-parser.c
> index 1f173fc..94b1d41 100644
> --- a/gcc/c/c-parser.c
> +++ b/gcc/c/c-parser.c
> @@ -6287,7 +6287,7 @@ static tree
>  c_parser_asm_statement (c_parser *parser)
>  {
>    tree quals, str, outputs, inputs, clobbers, labels, ret;
> -  bool simple, is_goto;
> +  bool simple, is_goto, is_inline;
>    location_t asm_loc = c_parser_peek_token (parser)->location;
>    int section, nsections;
>  
> @@ -6318,6 +6318,13 @@ c_parser_asm_statement (c_parser *parser)
>        is_goto = true;
>      }
>  
> +  is_inline = false;
> +  if (c_parser_next_token_is_keyword (parser, RID_INLINE))
> +    {
> +      c_parser_consume_token (parser);
> +      is_inline = true;
> +    }
> +
>    /* ??? Follow the C++ parser rather than using the
>       lex_untranslated_string kludge.  */
>    parser->lex_untranslated_string = true;
> @@ -6393,7 +6400,8 @@ c_parser_asm_statement (c_parser *parser)
>      c_parser_skip_to_end_of_block_or_statement (parser);
>  
>    ret = build_asm_stmt (quals, build_asm_expr (asm_loc, str, outputs, inputs,
> -					       clobbers, labels, simple));
> +					       clobbers, labels, simple,
> +					       is_inline));
>  
>   error:
>    parser->lex_untranslated_string = false;
> diff --git a/gcc/c/c-tree.h b/gcc/c/c-tree.h
> index 017c01c..f5629300 100644
> --- a/gcc/c/c-tree.h
> +++ b/gcc/c/c-tree.h
> @@ -677,7 +677,8 @@ extern tree build_compound_literal (location_t, tree, tree, bool,
>  extern void check_compound_literal_type (location_t, struct c_type_name *);
>  extern tree c_start_case (location_t, location_t, tree, bool);
>  extern void c_finish_case (tree, tree);
> -extern tree build_asm_expr (location_t, tree, tree, tree, tree, tree, bool);
> +extern tree build_asm_expr (location_t, tree, tree, tree, tree, tree, bool,
> +			    bool);
>  extern tree build_asm_stmt (tree, tree);
>  extern int c_types_compatible_p (tree, tree);
>  extern tree c_begin_compound_stmt (bool);
> diff --git a/gcc/c/c-typeck.c b/gcc/c/c-typeck.c
> index 9d09b8d..e013100 100644
> --- a/gcc/c/c-typeck.c
> +++ b/gcc/c/c-typeck.c
> @@ -10064,7 +10064,7 @@ build_asm_stmt (tree cv_qualifier, tree args)
>     are subtly different.  We use a ASM_EXPR node to represent this.  */
>  tree
>  build_asm_expr (location_t loc, tree string, tree outputs, tree inputs,
> -		tree clobbers, tree labels, bool simple)
> +		tree clobbers, tree labels, bool simple, bool is_inline)
>  {
>    tree tail;
>    tree args;
> @@ -10182,6 +10182,7 @@ build_asm_expr (location_t loc, tree string, tree outputs, tree inputs,
>       as volatile.  */
>    ASM_INPUT_P (args) = simple;
>    ASM_VOLATILE_P (args) = (noutputs == 0);
> +  ASM_INLINE_P (args) = is_inline;
>  
>    return args;
>  }
> diff --git a/gcc/gimple-pretty-print.c b/gcc/gimple-pretty-print.c
> index 83e2273..1a00fa3 100644
> --- a/gcc/gimple-pretty-print.c
> +++ b/gcc/gimple-pretty-print.c
> @@ -2019,6 +2019,8 @@ dump_gimple_asm (pretty_printer *buffer, gasm *gs, int spc, dump_flags_t flags)
>        pp_string (buffer, "__asm__");
>        if (gimple_asm_volatile_p (gs))
>  	pp_string (buffer, " __volatile__");
> +      if (gimple_asm_inline_p (gs))
> +	pp_string (buffer, " __inline__");
>        if (gimple_asm_nlabels (gs))
>  	pp_string (buffer, " goto");
>        pp_string (buffer, "(\"");
> diff --git a/gcc/gimple.h b/gcc/gimple.h
> index a5dda93..8a58e07 100644
> --- a/gcc/gimple.h
> +++ b/gcc/gimple.h
> @@ -137,6 +137,7 @@ enum gimple_rhs_class
>  enum gf_mask {
>      GF_ASM_INPUT		= 1 << 0,
>      GF_ASM_VOLATILE		= 1 << 1,
> +    GF_ASM_INLINE		= 1 << 2,
>      GF_CALL_FROM_THUNK		= 1 << 0,
>      GF_CALL_RETURN_SLOT_OPT	= 1 << 1,
>      GF_CALL_TAILCALL		= 1 << 2,
> @@ -3911,7 +3912,7 @@ gimple_asm_volatile_p (const gasm *asm_stmt)
>  }
>  
>  
> -/* If VOLATLE_P is true, mark asm statement ASM_STMT as volatile.  */
> +/* If VOLATILE_P is true, mark asm statement ASM_STMT as volatile.  */
>  
>  static inline void
>  gimple_asm_set_volatile (gasm *asm_stmt, bool volatile_p)
> @@ -3923,6 +3924,27 @@ gimple_asm_set_volatile (gasm *asm_stmt, bool volatile_p)
>  }
>  
>  
> +/* Return true ASM_STMT ASM_STMT is an asm statement marked inline.  */
> +
> +static inline bool
> +gimple_asm_inline_p (const gasm *asm_stmt)
> +{
> +  return (asm_stmt->subcode & GF_ASM_INLINE) != 0;
> +}
> +
> +
> +/* If INLINE_P is true, mark asm statement ASM_STMT as inline.  */
> +
> +static inline void
> +gimple_asm_set_inline (gasm *asm_stmt, bool inline_p)
> +{
> +  if (inline_p)
> +    asm_stmt->subcode |= GF_ASM_INLINE;
> +  else
> +    asm_stmt->subcode &= ~GF_ASM_INLINE;
> +}
> +
> +
>  /* If INPUT_P is true, mark asm ASM_STMT as an ASM_INPUT.  */
>  
>  static inline void
> diff --git a/gcc/gimplify.c b/gcc/gimplify.c
> index 509fc2f..10b80f2 100644
> --- a/gcc/gimplify.c
> +++ b/gcc/gimplify.c
> @@ -6315,6 +6315,7 @@ gimplify_asm_expr (tree *expr_p, gimple_seq *pre_p, gimple_seq *post_p)
>  
>        gimple_asm_set_volatile (stmt, ASM_VOLATILE_P (expr) || noutputs == 0);
>        gimple_asm_set_input (stmt, ASM_INPUT_P (expr));
> +      gimple_asm_set_inline (stmt, ASM_INLINE_P (expr));
>  
>        gimplify_seq_add_stmt (pre_p, stmt);
>      }
> diff --git a/gcc/ipa-icf-gimple.c b/gcc/ipa-icf-gimple.c
> index ba39ea3..5361139 100644
> --- a/gcc/ipa-icf-gimple.c
> +++ b/gcc/ipa-icf-gimple.c
> @@ -993,6 +993,9 @@ func_checker::compare_gimple_asm (const gasm *g1, const gasm *g2)
>    if (gimple_asm_input_p (g1) != gimple_asm_input_p (g2))
>      return false;
>  
> +  if (gimple_asm_inline_p (g1) != gimple_asm_inline_p (g2))
> +    return false;
> +
>    if (gimple_asm_ninputs (g1) != gimple_asm_ninputs (g2))
>      return false;
>  
> diff --git a/gcc/tree-inline.c b/gcc/tree-inline.c
> index 9134253..6b1d2ea 100644
> --- a/gcc/tree-inline.c
> +++ b/gcc/tree-inline.c
> @@ -4108,6 +4108,8 @@ estimate_num_insns (gimple *stmt, eni_weights *weights)
>  	   with very long asm statements.  */
>  	if (count > 1000)
>  	  count = 1000;
> +	if (gimple_asm_inline_p (as_a <gasm *> (stmt)))
> +	  count = !!count;
>  	return MAX (1, count);
>        }
>  
> diff --git a/gcc/tree.h b/gcc/tree.h
> index 2e45f9d..160b3a7 100644
> --- a/gcc/tree.h
> +++ b/gcc/tree.h
> @@ -1245,6 +1245,9 @@ extern tree maybe_wrap_with_location (tree, location_t);
>     ASM_OPERAND with no operands.  */
>  #define ASM_INPUT_P(NODE) (ASM_EXPR_CHECK (NODE)->base.static_flag)
>  #define ASM_VOLATILE_P(NODE) (ASM_EXPR_CHECK (NODE)->base.public_flag)
> +/* Nonzero if we want to consider this asm as minimum length and cost
> +   for inlining decisions.  */
> +#define ASM_INLINE_P(NODE) (ASM_EXPR_CHECK (NODE)->base.protected_flag)
>  
>  /* COND_EXPR accessors.  */
>  #define COND_EXPR_COND(NODE)	(TREE_OPERAND (COND_EXPR_CHECK (NODE), 0))
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: PROPOSAL: Extend inline asm syntax with size spec
  2018-10-10  7:12             ` Richard Biener
@ 2018-10-10  7:22               ` Ingo Molnar
  2018-10-10  8:03                 ` Segher Boessenkool
  2018-10-10  7:53               ` Segher Boessenkool
  1 sibling, 1 reply; 116+ messages in thread
From: Ingo Molnar @ 2018-10-10  7:22 UTC (permalink / raw)
  To: Richard Biener
  Cc: Segher Boessenkool, Michael Matz, Borislav Petkov, gcc,
	Nadav Amit, Ingo Molnar, linux-kernel, x86, Masahiro Yamada,
	Sam Ravnborg, Alok Kataria, Christopher Li, Greg Kroah-Hartman,
	H. Peter Anvin, Jan Beulich, Josh Poimboeuf, Juergen Gross,
	Kate Stewart, Kees Cook, linux-sparse, Peter Zijlstra,
	Philippe Ombredanne, Thomas Gleixner, virtualization,
	Linus Torvalds, Chris Zankel, Max Filippov, linux-xtensa,
	Andrew Morton


* Richard Biener <rguenther@suse.de> wrote:

> Can kernel folks give this a second and third thought please so we
> don't implement sth that in the end won't satisfy you guys?

So this basically passes '0 size' to the inliner, which should be better
than passing in the explicit size, as we'd inevitably get it wrong
in cases.

I also like 'size 0' for the reason that we tend to write assembly code
and mark it 'inline' if we really think it matters to performance,
so making it more likely to be inlined when used within another inline
function is a plus as well.

Does anyone have any concerns about this?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: PROPOSAL: Extend inline asm syntax with size spec
  2018-10-10  7:12             ` Richard Biener
  2018-10-10  7:22               ` Ingo Molnar
@ 2018-10-10  7:53               ` Segher Boessenkool
  1 sibling, 0 replies; 116+ messages in thread
From: Segher Boessenkool @ 2018-10-10  7:53 UTC (permalink / raw)
  To: Richard Biener
  Cc: Michael Matz, Borislav Petkov, gcc, Nadav Amit, Ingo Molnar,
	linux-kernel, x86, Masahiro Yamada, Sam Ravnborg, Alok Kataria,
	Christopher Li, Greg Kroah-Hartman, H. Peter Anvin, Jan Beulich,
	Josh Poimboeuf, Juergen Gross, Kate Stewart, Kees Cook,
	linux-sparse, Peter Zijlstra, Philippe Ombredanne,
	Thomas Gleixner, virtualization, Linus Torvalds, Chris Zankel,
	Max Filippov, linux-xtensa

On Wed, Oct 10, 2018 at 09:12:48AM +0200, Richard Biener wrote:
> On Tue, 9 Oct 2018, Segher Boessenkool wrote:
> > On Mon, Oct 08, 2018 at 11:07:46AM +0200, Richard Biener wrote:
> > > On Mon, 8 Oct 2018, Segher Boessenkool wrote:
> > > > On Sun, Oct 07, 2018 at 03:53:26PM +0000, Michael Matz wrote:
> > > > > On Sun, 7 Oct 2018, Segher Boessenkool wrote:
> > > > > > On Sun, Oct 07, 2018 at 11:18:06AM +0200, Borislav Petkov wrote:
> > > > > > > Now, Richard suggested doing something like:
> > > > > > > 
> > > > > > >  1) inline asm ("...")
> > > > > > 
> > > > > > What would the semantics of this be?
> > > > > 
> > > > > The size of the inline asm wouldn't be counted towards the inliner size 
> > > > > limits (or be counted as "1").
> > > > 
> > > > That sounds like a good option.
> > > 
> > > Yes, I also like it for simplicity.  It also avoids the requirement
> > > of translating the number (in bytes?) given by the user to
> > > "number of GIMPLE instructions" as needed by the inliner.
> > 
> > This patch implements this, for C only so far.  And the syntax is
> > "asm inline", which is more in line with other syntax.
> > 
> > How does this look?
> 
> Looks good.  A few nits - you need to document this in extend.texi, the

Yup.

> tree flag use needs documenting in tree-core.h,

Ah yes.

> and we need a testcase
> (I'd suggest one that shows we inline a function with "large" asm inline 
> () even at -Os).

I have one.  Oh, and I probably should do a comment at the one line of
code that isn't just bookkeeping ;-)

> Oh, and I don't think we want C and C++ to diverge - so you need to
> cook up C++ support as well.

Right, that's why I said "C only so far".

> Can kernel folks give this a second and third thought please so we
> don't implement sth that in the end won't satisfy you guys?

Or actually try it out and see if it has the desired effect!  Nothing
beats field trials.

I'll do the C++ thing today hopefully, and send things to gcc-patches@.


Segher

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: PROPOSAL: Extend inline asm syntax with size spec
  2018-10-10  7:22               ` Ingo Molnar
@ 2018-10-10  8:03                 ` Segher Boessenkool
  2018-10-10  8:19                   ` Borislav Petkov
  2018-10-10 10:29                   ` Richard Biener
  0 siblings, 2 replies; 116+ messages in thread
From: Segher Boessenkool @ 2018-10-10  8:03 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Richard Biener, Michael Matz, Borislav Petkov, gcc, Nadav Amit,
	Ingo Molnar, linux-kernel, x86, Masahiro Yamada, Sam Ravnborg,
	Alok Kataria, Christopher Li, Greg Kroah-Hartman, H. Peter Anvin,
	Jan Beulich, Josh Poimboeuf, Juergen Gross, Kate Stewart,
	Kees Cook, linux-sparse, Peter Zijlstra, Philippe Ombredanne,
	Thomas Gleixner, virtualization, Linus Torvalds, Chris Zankel,
	Max Filippov, linux-xtensa, Andrew Morton

On Wed, Oct 10, 2018 at 09:22:40AM +0200, Ingo Molnar wrote:
> * Richard Biener <rguenther@suse.de> wrote:
> > Can kernel folks give this a second and third thought please so we
> > don't implement sth that in the end won't satisfy you guys?
> 
> So this basically passes '0 size' to the inliner, which should be better
> than passing in the explicit size, as we'd inevitably get it wrong
> in cases.

The code immediately after this makes it size 1, even for things like
asm(""), I suppose this works better for the inliner.  But that's a detail
(and it might change); the description says "consider this asm as minimum
length and cost for inlining decisions", which works for either 0 or 1.

> I also like 'size 0' for the reason that we tend to write assembly code
> and mark it 'inline' if we really think it matters to performance,
> so making it more likely to be inlined when used within another inline
> function is a plus as well.

You can think of it as meaning "we want this asm inlined always", and then
whether that actually happens depends on if the function around it is
inlined or not.


Segher

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: PROPOSAL: Extend inline asm syntax with size spec
  2018-10-10  8:03                 ` Segher Boessenkool
@ 2018-10-10  8:19                   ` Borislav Petkov
  2018-10-10  8:35                     ` Richard Biener
  2018-10-10 18:54                     ` Segher Boessenkool
  2018-10-10 10:29                   ` Richard Biener
  1 sibling, 2 replies; 116+ messages in thread
From: Borislav Petkov @ 2018-10-10  8:19 UTC (permalink / raw)
  To: Segher Boessenkool, Ingo Molnar
  Cc: Richard Biener, Michael Matz, gcc, Nadav Amit, Ingo Molnar,
	linux-kernel, x86, Masahiro Yamada, Sam Ravnborg, Alok Kataria,
	Christopher Li, Greg Kroah-Hartman, H. Peter Anvin, Jan Beulich,
	Josh Poimboeuf, Juergen Gross, Kate Stewart, Kees Cook,
	linux-sparse, Peter Zijlstra, Philippe Ombredanne,
	Thomas Gleixner, virtualization, Linus Torvalds, Chris Zankel,
	Max Filippov, linux-xtensa, Andrew Morton

On Wed, Oct 10, 2018 at 03:03:25AM -0500, Segher Boessenkool wrote:
> The code immediately after this makes it size 1, even for things like
> asm(""), I suppose this works better for the inliner.  But that's a detail
> (and it might change); the description says "consider this asm as minimum
> length and cost for inlining decisions", which works for either 0 or 1.

Thanks for implementing this, much appreciated. If you need people to
test stuff, lemme know.

> You can think of it as meaning "we want this asm inlined always", and then
> whether that actually happens depends on if the function around it is
> inlined or not.

My only concern is how we would catch the other extremity where the
inline asm grows too big and we end up inlining it everywhere and thus
getting fat. The 0day bot already builds tinyconfigs but we should be
looking at vmlinux size growth too.

Thx.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: PROPOSAL: Extend inline asm syntax with size spec
  2018-10-10  8:19                   ` Borislav Petkov
@ 2018-10-10  8:35                     ` Richard Biener
  2018-10-10 18:54                     ` Segher Boessenkool
  1 sibling, 0 replies; 116+ messages in thread
From: Richard Biener @ 2018-10-10  8:35 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Segher Boessenkool, Ingo Molnar, Michael Matz, gcc, Nadav Amit,
	Ingo Molnar, linux-kernel, x86, Masahiro Yamada, Sam Ravnborg,
	Alok Kataria, Christopher Li, Greg Kroah-Hartman, H. Peter Anvin,
	Jan Beulich, Josh Poimboeuf, Juergen Gross, Kate Stewart,
	Kees Cook, linux-sparse, Peter Zijlstra, Philippe Ombredanne,
	Thomas Gleixner, virtualization, Linus Torvalds, Chris Zankel,
	Max Filippov, linux-xtensa, Andrew Morton

On Wed, 10 Oct 2018, Borislav Petkov wrote:

> On Wed, Oct 10, 2018 at 03:03:25AM -0500, Segher Boessenkool wrote:
> > The code immediately after this makes it size 1, even for things like
> > asm(""), I suppose this works better for the inliner.  But that's a detail
> > (and it might change); the description says "consider this asm as minimum
> > length and cost for inlining decisions", which works for either 0 or 1.
> 
> Thanks for implementing this, much appreciated. If you need people to
> test stuff, lemme know.
> 
> > You can think of it as meaning "we want this asm inlined always", and then
> > whether that actually happens depends on if the function around it is
> > inlined or not.
> 
> My only concern is how we would catch the other extremity where the
> inline asm grows too big and we end up inlining it everywhere and thus
> getting fat. The 0day bot already builds tinyconfigs but we should be
> looking at vmlinux size growth too.

Well, it's like always-inline functions, you have to be careful
and _not_ put it everywhere...  It's also somewhat different to
always-inline functions as those lose their special-ness once
inlined (the inlined body is properly accounted for size).

Richard.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: PROPOSAL: Extend inline asm syntax with size spec
  2018-10-10  8:03                 ` Segher Boessenkool
  2018-10-10  8:19                   ` Borislav Petkov
@ 2018-10-10 10:29                   ` Richard Biener
  1 sibling, 0 replies; 116+ messages in thread
From: Richard Biener @ 2018-10-10 10:29 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: Ingo Molnar, Michael Matz, Borislav Petkov, gcc, Nadav Amit,
	Ingo Molnar, linux-kernel, x86, Masahiro Yamada, Sam Ravnborg,
	Alok Kataria, Christopher Li, Greg Kroah-Hartman, H. Peter Anvin,
	Jan Beulich, Josh Poimboeuf, Juergen Gross, Kate Stewart,
	Kees Cook, linux-sparse, Peter Zijlstra, Philippe Ombredanne,
	Thomas Gleixner, virtualization, Linus Torvalds, Chris Zankel,
	Max Filippov, linux-xtensa, Andrew Morton

On Wed, 10 Oct 2018, Segher Boessenkool wrote:

> On Wed, Oct 10, 2018 at 09:22:40AM +0200, Ingo Molnar wrote:
> > * Richard Biener <rguenther@suse.de> wrote:
> > > Can kernel folks give this a second and third thought please so we
> > > don't implement sth that in the end won't satisfy you guys?
> > 
> > So this basically passes '0 size' to the inliner, which should be better
> > than passing in the explicit size, as we'd inevitably get it wrong
> > in cases.
> 
> The code immediately after this makes it size 1, even for things like
> asm(""), I suppose this works better for the inliner.  But that's a detail
> (and it might change); the description says "consider this asm as minimum
> length and cost for inlining decisions", which works for either 0 or 1.

It was made 1 as otherwise the inliner happily explodes on

void foo () { asm(""); foo(); }

> > I also like 'size 0' for the reason that we tend to write assembly code
> > and mark it 'inline' if we really think it matters to performance,
> > so making it more likely to be inlined when used within another inline
> > function is a plus as well.
> 
> You can think of it as meaning "we want this asm inlined always", and then
> whether that actually happens depends on if the function around it is
> inlined or not.

Richard.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: PROPOSAL: Extend inline asm syntax with size spec
  2018-10-09 14:53           ` Segher Boessenkool
  2018-10-10  6:35             ` Ingo Molnar
  2018-10-10  7:12             ` Richard Biener
@ 2018-10-10 16:31             ` Nadav Amit
  2018-10-10 19:21               ` Segher Boessenkool
  2018-10-11  7:04               ` Richard Biener
  2018-11-29 11:46             ` Masahiro Yamada
  3 siblings, 2 replies; 116+ messages in thread
From: Nadav Amit @ 2018-10-10 16:31 UTC (permalink / raw)
  To: Segher Boessenkool, Richard Biener
  Cc: Michael Matz, Borislav Petkov, gcc, Ingo Molnar, LKML, X86 ML,
	Masahiro Yamada, Sam Ravnborg, Alok Kataria, Christopher Li,
	Greg Kroah-Hartman, H. Peter Anvin, Jan Beulich, Josh Poimboeuf,
	Juergen Gross, Kate Stewart, Kees Cook, linux-sparse,
	Peter Zijlstra, Philippe Ombredanne, Thomas Gleixner,
	virtualization, Linus Torvalds, Chris Zankel, Max Filippov,
	linux-xtensa

at 7:53 AM, Segher Boessenkool <segher@kernel.crashing.org> wrote:

> On Mon, Oct 08, 2018 at 11:07:46AM +0200, Richard Biener wrote:
>> On Mon, 8 Oct 2018, Segher Boessenkool wrote:
>>> On Sun, Oct 07, 2018 at 03:53:26PM +0000, Michael Matz wrote:
>>>> On Sun, 7 Oct 2018, Segher Boessenkool wrote:
>>>>> On Sun, Oct 07, 2018 at 11:18:06AM +0200, Borislav Petkov wrote:
>>>>>> Now, Richard suggested doing something like:
>>>>>> 
>>>>>> 1) inline asm ("...")
>>>>> 
>>>>> What would the semantics of this be?
>>>> 
>>>> The size of the inline asm wouldn't be counted towards the inliner size 
>>>> limits (or be counted as "1").
>>> 
>>> That sounds like a good option.
>> 
>> Yes, I also like it for simplicity.  It also avoids the requirement
>> of translating the number (in bytes?) given by the user to
>> "number of GIMPLE instructions" as needed by the inliner.
> 
> This patch implements this, for C only so far.  And the syntax is
> "asm inline", which is more in line with other syntax.
> 
> How does this look?

It looks good to me in general. I have a couple of reservations, but I
suspect you will not want to address them:

1. It is not backward compatible, requiring a C macro to wrap it, as the 
kernel might be built with different compilers.

2. It is specific to asm. I do not have in mind another use case (excluding
the __builtin_constant_p), but it would be nicer IMHO to have a builtin
saying “ignore the cost of this statement” for the matter of optimizations.

Regards,
Nadav

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: PROPOSAL: Extend inline asm syntax with size spec
  2018-10-10  8:19                   ` Borislav Petkov
  2018-10-10  8:35                     ` Richard Biener
@ 2018-10-10 18:54                     ` Segher Boessenkool
  2018-10-10 19:14                       ` Borislav Petkov
  1 sibling, 1 reply; 116+ messages in thread
From: Segher Boessenkool @ 2018-10-10 18:54 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Ingo Molnar, Richard Biener, Michael Matz, gcc, Nadav Amit,
	Ingo Molnar, linux-kernel, x86, Masahiro Yamada, Sam Ravnborg,
	Alok Kataria, Christopher Li, Greg Kroah-Hartman, H. Peter Anvin,
	Jan Beulich, Josh Poimboeuf, Juergen Gross, Kate Stewart,
	Kees Cook, linux-sparse, Peter Zijlstra, Philippe Ombredanne,
	Thomas Gleixner, virtualization, Linus Torvalds, Chris Zankel,
	Max Filippov, linux-xtensa, Andrew Morton

On Wed, Oct 10, 2018 at 10:19:06AM +0200, Borislav Petkov wrote:
> On Wed, Oct 10, 2018 at 03:03:25AM -0500, Segher Boessenkool wrote:
> > The code immediately after this makes it size 1, even for things like
> > asm(""), I suppose this works better for the inliner.  But that's a detail
> > (and it might change); the description says "consider this asm as minimum
> > length and cost for inlining decisions", which works for either 0 or 1.
> 
> Thanks for implementing this, much appreciated. If you need people to
> test stuff, lemme know.

It would be great to hear from kernel people if it works adequately for
what you guys want it for :-)

> > You can think of it as meaning "we want this asm inlined always", and then
> > whether that actually happens depends on if the function around it is
> > inlined or not.
> 
> My only concern is how we would catch the other extremity where the
> inline asm grows too big and we end up inlining it everywhere and thus
> getting fat. The 0day bot already builds tinyconfigs but we should be
> looking at vmlinux size growth too.

But this isn't really different from other always_inline concerns afaics?
So you should be able to catch it the same way, too.


Segher

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: PROPOSAL: Extend inline asm syntax with size spec
  2018-10-10 18:54                     ` Segher Boessenkool
@ 2018-10-10 19:14                       ` Borislav Petkov
  2018-10-13 19:33                         ` Borislav Petkov
  0 siblings, 1 reply; 116+ messages in thread
From: Borislav Petkov @ 2018-10-10 19:14 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: Ingo Molnar, Richard Biener, Michael Matz, gcc, Nadav Amit,
	Ingo Molnar, linux-kernel, x86, Masahiro Yamada, Sam Ravnborg,
	Alok Kataria, Christopher Li, Greg Kroah-Hartman, H. Peter Anvin,
	Jan Beulich, Josh Poimboeuf, Juergen Gross, Kate Stewart,
	Kees Cook, linux-sparse, Peter Zijlstra, Philippe Ombredanne,
	Thomas Gleixner, virtualization, Linus Torvalds, Chris Zankel,
	Max Filippov, linux-xtensa, Andrew Morton

On Wed, Oct 10, 2018 at 01:54:33PM -0500, Segher Boessenkool wrote:
> It would be great to hear from kernel people if it works adequately for
> what you guys want it for :-)

Sure, ping me when you have the final version and I'll try to build gcc
with it and do some size comparisons.

Thx.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: PROPOSAL: Extend inline asm syntax with size spec
  2018-10-10 16:31             ` Nadav Amit
@ 2018-10-10 19:21               ` Segher Boessenkool
  2018-10-11  7:04               ` Richard Biener
  1 sibling, 0 replies; 116+ messages in thread
From: Segher Boessenkool @ 2018-10-10 19:21 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Richard Biener, Michael Matz, Borislav Petkov, gcc, Ingo Molnar,
	LKML, X86 ML, Masahiro Yamada, Sam Ravnborg, Alok Kataria,
	Christopher Li, Greg Kroah-Hartman, H. Peter Anvin, Jan Beulich,
	Josh Poimboeuf, Juergen Gross, Kate Stewart, Kees Cook,
	linux-sparse, Peter Zijlstra, Philippe Ombredanne,
	Thomas Gleixner, virtualization, Linus Torvalds, Chris Zankel,
	Max Filippov, linux-xtensa

Hi Nadav,

On Wed, Oct 10, 2018 at 04:31:41PM +0000, Nadav Amit wrote:
> at 7:53 AM, Segher Boessenkool <segher@kernel.crashing.org> wrote:
> > How does this look?
> 
> It looks good to me in general. I have a couple of reservations, but I
> suspect you will not want to address them:
> 
> 1. It is not backward compatible, requiring a C macro to wrap it, as the 
> kernel might be built with different compilers.

How *could* it be backward compatible?  There should be an error or at least
a warning if the compiler does not support this, in general.

For the kernel, the kernel already has plenty of infrastructure to support
this (compiler.h etc.)  For other applications it is quite trivial, too.

> 2. It is specific to asm.

Yes, and that is on purpose.

> I do not have in mind another use case (excluding
> the __builtin_constant_p), but it would be nicer IMHO to have a builtin
> saying “ignore the cost of this statement” for the matter of optimizations.

That is a hundred or a thousand times more work to design and implement
(including testing etc.)  I'm not going to do it, but feel free to try
yourself!


Segher

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: PROPOSAL: Extend inline asm syntax with size spec
  2018-10-10 16:31             ` Nadav Amit
  2018-10-10 19:21               ` Segher Boessenkool
@ 2018-10-11  7:04               ` Richard Biener
  1 sibling, 0 replies; 116+ messages in thread
From: Richard Biener @ 2018-10-11  7:04 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Segher Boessenkool, Michael Matz, Borislav Petkov, gcc,
	Ingo Molnar, LKML, X86 ML, Masahiro Yamada, Sam Ravnborg,
	Alok Kataria, Christopher Li, Greg Kroah-Hartman, H. Peter Anvin,
	Jan Beulich, Josh Poimboeuf, Juergen Gross, Kate Stewart,
	Kees Cook, linux-sparse, Peter Zijlstra, Philippe Ombredanne,
	Thomas Gleixner, virtualization, Linus Torvalds, Chris Zankel,
	Max Filippov, linux-xtensa

[-- Attachment #1: Type: text/plain, Size: 2317 bytes --]

On Wed, 10 Oct 2018, Nadav Amit wrote:

> at 7:53 AM, Segher Boessenkool <segher@kernel.crashing.org> wrote:
> 
> > On Mon, Oct 08, 2018 at 11:07:46AM +0200, Richard Biener wrote:
> >> On Mon, 8 Oct 2018, Segher Boessenkool wrote:
> >>> On Sun, Oct 07, 2018 at 03:53:26PM +0000, Michael Matz wrote:
> >>>> On Sun, 7 Oct 2018, Segher Boessenkool wrote:
> >>>>> On Sun, Oct 07, 2018 at 11:18:06AM +0200, Borislav Petkov wrote:
> >>>>>> Now, Richard suggested doing something like:
> >>>>>> 
> >>>>>> 1) inline asm ("...")
> >>>>> 
> >>>>> What would the semantics of this be?
> >>>> 
> >>>> The size of the inline asm wouldn't be counted towards the inliner size 
> >>>> limits (or be counted as "1").
> >>> 
> >>> That sounds like a good option.
> >> 
> >> Yes, I also like it for simplicity.  It also avoids the requirement
> >> of translating the number (in bytes?) given by the user to
> >> "number of GIMPLE instructions" as needed by the inliner.
> > 
> > This patch implements this, for C only so far.  And the syntax is
> > "asm inline", which is more in line with other syntax.
> > 
> > How does this look?
> 
> It looks good to me in general. I have a couple of reservations, but I
> suspect you will not want to address them:
> 
> 1. It is not backward compatible, requiring a C macro to wrap it, as the 
> kernel might be built with different compilers.
> 
> 2. It is specific to asm. I do not have in mind another use case (excluding
> the __builtin_constant_p), but it would be nicer IMHO to have a builtin
> saying “ignore the cost of this statement” for the matter of optimizations.

The only easy possibility that comes to my mid is sth like

__attribute__((always_inline, zero_cost)) foo ()
{
  ... your stmts ...
}

and us, upon inlining, marking the inlined stmts properly.  That would
also work for the asm() case and it would be backwards compatible
(well, you'd get a diagnostic for the unknown zero_cost attribute).

There's the slight complication that if you have, say

 _1 = _2 * 3; // zero-cost
 _4 = _1 * 2;

and optimization ends up combining those to

 _4 = _2 * 6;

then is this stmt supposed to be zero-cost or not?  Compare that to

 _1 = _2 * 3;
 _4 = _1 * 2; // zero-cost

So outside of asm() there are new issues that come up with respect
to expected (cost) semantics.

Richard.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: PROPOSAL: Extend inline asm syntax with size spec
  2018-10-10 19:14                       ` Borislav Petkov
@ 2018-10-13 19:33                         ` Borislav Petkov
  2018-10-13 21:14                           ` Alexander Monakov
                                             ` (2 more replies)
  0 siblings, 3 replies; 116+ messages in thread
From: Borislav Petkov @ 2018-10-13 19:33 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: Ingo Molnar, Richard Biener, Michael Matz, gcc, Nadav Amit,
	Ingo Molnar, linux-kernel, x86, Masahiro Yamada, Sam Ravnborg,
	Alok Kataria, Christopher Li, Greg Kroah-Hartman, H. Peter Anvin,
	Jan Beulich, Josh Poimboeuf, Juergen Gross, Kate Stewart,
	Kees Cook, linux-sparse, Peter Zijlstra, Philippe Ombredanne,
	Thomas Gleixner, virtualization, Linus Torvalds, Chris Zankel,
	Max Filippov, linux-xtensa, Andrew Morton

Ok,

with Segher's help I've been playing with his patch ontop of bleeding
edge gcc 9 and here are my observations. Please double-check me for
booboos so that they can be addressed while there's time.

So here's what I see ontop of 4.19-rc7:

First marked the alternative asm() as inline and undeffed the "inline"
keyword. I need to do that because of the funky games we do with
"inline" redefinitions in include/linux/compiler_types.h.

And Segher hinted at either doing:

asm volatile inline(...

or

asm volatile __inline__(...

but both "inline" variants are defined as macros in that file.

Which means we either need to #undef inline before using it in asm() or
come up with something cleverer.

Anyway, did this:

---
diff --git a/arch/x86/include/asm/alternative.h b/arch/x86/include/asm/alternative.h
index 4cd6a3b71824..7c0639087da7 100644
--- a/arch/x86/include/asm/alternative.h
+++ b/arch/x86/include/asm/alternative.h
@@ -165,11 +165,13 @@ static inline int alternatives_text_reserved(void *start, void *end)
  * For non barrier like inlines please define new variants
  * without volatile and memory clobber.
  */
+
+#undef inline
 #define alternative(oldinstr, newinstr, feature)                       \
-       asm volatile (ALTERNATIVE(oldinstr, newinstr, feature) : : : "memory")
+       asm volatile inline(ALTERNATIVE(oldinstr, newinstr, feature) : : : "memory")
 
 #define alternative_2(oldinstr, newinstr1, feature1, newinstr2, feature2) \
-       asm volatile(ALTERNATIVE_2(oldinstr, newinstr1, feature1, newinstr2, feature2) ::: "memory")
+       asm volatile inline(ALTERNATIVE_2(oldinstr, newinstr1, feature1, newinstr2, feature2) ::: "memory")
 
 /*
  * Alternative inline assembly with input.
---

Build failed at link time with:

arch/x86/boot/compressed/cmdline.o: In function `native_save_fl':
cmdline.c:(.text+0x0): multiple definition of `native_save_fl'
arch/x86/boot/compressed/misc.o:misc.c:(.text+0x1e0): first defined here
arch/x86/boot/compressed/cmdline.o: In function `native_restore_fl':
cmdline.c:(.text+0x10): multiple definition of `native_restore_fl'
arch/x86/boot/compressed/misc.o:misc.c:(.text+0x1f0): first defined here
arch/x86/boot/compressed/error.o: In function `native_save_fl':
error.c:(.text+0x0): multiple definition of `native_save_fl'

which I had to fix with this:

---
diff --git a/arch/x86/include/asm/irqflags.h b/arch/x86/include/asm/irqflags.h
index 15450a675031..0d772598c37c 100644
--- a/arch/x86/include/asm/irqflags.h
+++ b/arch/x86/include/asm/irqflags.h
@@ -14,8 +14,7 @@
  */
 
 /* Declaration required for gcc < 4.9 to prevent -Werror=missing-prototypes */
-extern inline unsigned long native_save_fl(void);
-extern inline unsigned long native_save_fl(void)
+static inline unsigned long native_save_fl(void)
 {
        unsigned long flags;
 
@@ -33,8 +32,7 @@ ex
---

That "extern inline" declaration looks fishy to me anyway, maybe not really
needed ? Apparently, gcc < 4.9 complains with -Werror=missing-prototypes...

Then the build worked and the results look like this:

   text    data     bss     dec     hex filename
17287384        5040656 2019532 24347572        17383b4 vmlinux-before
17288020        5040656 2015436 24344112        1737630 vmlinux-2nd-version

so some inlining must be happening.

Then I did this:

---
diff --git a/arch/x86/lib/usercopy_64.c b/arch/x86/lib/usercopy_64.c
index 9c5606d88f61..a0170344cf08 100644
--- a/arch/x86/lib/usercopy_64.c
+++ b/arch/x86/lib/usercopy_64.c
@@ -20,7 +20,7 @@ unsigned long __clear_user(void __user *addr, unsigned long size)
        /* no memory constraint because it doesn't change any memory gcc knows
           about */
        stac();
-       asm volatile(
+       asm volatile inline(
                "       testq  %[size8],%[size8]\n"
                "       jz     4f\n"
                "0:     movq $0,(%[dst])\n"
---

to force inlining of a somewhat bigger asm() statement. And yap, more
got inlined:

   text    data     bss     dec     hex filename
17287384        5040656 2019532 24347572        17383b4 vmlinux-before
17288020        5040656 2015436 24344112        1737630 vmlinux-2nd-version
17288076        5040656 2015436 24344168        1737668 vmlinux-2nd-version__clear_user

so more stuff gets inlined.

Looking at the asm output, it had before:

---
clear_user:
# ./arch/x86/include/asm/current.h:15:  return this_cpu_read_stable(current_task);
#APP
# 15 "./arch/x86/include/asm/current.h" 1
        movq %gs:current_task,%rax      #, pfo_ret__
# 0 "" 2
# arch/x86/lib/usercopy_64.c:51:        if (access_ok(VERIFY_WRITE, to, n))
#NO_APP
        movq    2520(%rax), %rdx        # pfo_ret___12->thread.addr_limit.seg, _1
        movq    %rdi, %rax      # to, tmp93
        addq    %rsi, %rax      # n, tmp93
        jc      .L3     #,
# arch/x86/lib/usercopy_64.c:51:        if (access_ok(VERIFY_WRITE, to, n))
        cmpq    %rax, %rdx      # tmp93, _1
        jb      .L3     #,
# arch/x86/lib/usercopy_64.c:52:                return __clear_user(to, n);
        jmp     __clear_user    #
---

note the JMP to __clear_user. After marking the asm() in __clear_user() as
inline, clear_user() inlines __clear_user() directly:

---
clear_user:
# ./arch/x86/include/asm/current.h:15:  return this_cpu_read_stable(current_task);
#APP
# 15 "./arch/x86/include/asm/current.h" 1
        movq %gs:current_task,%rax      #, pfo_ret__
# 0 "" 2
# arch/x86/lib/usercopy_64.c:51:        if (access_ok(VERIFY_WRITE, to, n))
#NO_APP
        movq    2520(%rax), %rdx        # pfo_ret___12->thread.addr_limit.seg, _1
        movq    %rdi, %rax      # to, tmp95
        addq    %rsi, %rax      # n, tmp95
        jc      .L8     #,
# arch/x86/lib/usercopy_64.c:51:        if (access_ok(VERIFY_WRITE, to, n))
        cmpq    %rax, %rdx      # tmp95, _1
        jb      .L8     #,
# ./arch/x86/include/asm/smap.h:58:     alternative("", __stringify(__ASM_STAC), X86_FEATURE_SMAP);
...

this last line is the stac() macro which gets inlined due to the
alternative() inlined macro above.

So I guess this all looks like what we wanted.

And if this lands in gcc9, we would need to do a asm_volatile() macro
which is defined differently based on the compiler used.

Thoughts, suggestions, etc are most welcome.

Thx.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: PROPOSAL: Extend inline asm syntax with size spec
  2018-10-13 19:33                         ` Borislav Petkov
@ 2018-10-13 21:14                           ` Alexander Monakov
  2018-10-13 21:30                             ` Borislav Petkov
  2018-10-25 10:24                           ` Borislav Petkov
  2018-10-31 12:55                           ` Peter Zijlstra
  2 siblings, 1 reply; 116+ messages in thread
From: Alexander Monakov @ 2018-10-13 21:14 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Segher Boessenkool, Ingo Molnar, Richard Biener, Michael Matz,
	gcc, Nadav Amit, Ingo Molnar, linux-kernel, x86, Masahiro Yamada,
	Sam Ravnborg, Alok Kataria, Christopher Li, Greg Kroah-Hartman,
	H. Peter Anvin, Jan Beulich, Josh Poimboeuf, Juergen Gross,
	Kate Stewart, Kees Cook, linux-sparse, Peter Zijlstra,
	Philippe Ombredanne, Thomas Gleixner, virtualization,
	Linus Torvalds, Chris Zankel, Max Filippov, linux-xtensa,
	Andrew Morton

Hi,

On Sat, 13 Oct 2018, Borislav Petkov wrote:
> 
> Thoughts, suggestions, etc are most welcome.

I apologize for coming in late here with an alternative proposal, but would
you be happy if GCC gave you a way to designate a portion of the asm template
string that shouldn't be counted as its cost because it doesn't go into the
.text section? This wouldn't interact with your redefinitions of the inline
keyword, and you could do something like (assuming we go with %` ... %`
delimiters)

[if gcc-9 or compatible]

#define ASM_NONTEXT_START "%`\n"
#define ASM_NONTEXT_END   "%`\n"

[else]

#define ASM_NONTEXT_START "\n"
#define ASM_NONTEXT_END   "\n"

[endif]

#define _BUG_FLAGS(ins, flags)						\
do {									\
	asm volatile("1:\t" ins "\n"					\
	             ASM_NONTEXT_START                                  \
		     ".pushsection __bug_table,\"aw\"\n"		\
		     "2:\t" __BUG_REL(1b) "\t# bug_entry::bug_addr\n"	\
		     "\t.word %c0"        "\t# bug_entry::flags\n"	\
		     "\t.org 2b+%c1\n"					\
		     ".popsection"					\
	             ASM_NONTEXT_END                                    \
		     : : "i" (flags),					\
			 "i" (sizeof(struct bug_entry)));		\
} while (0)


I think it's nicer because it also allows the compiler to estimate asm length
for branch range optimization more accurately.

Alexander

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: PROPOSAL: Extend inline asm syntax with size spec
  2018-10-13 21:14                           ` Alexander Monakov
@ 2018-10-13 21:30                             ` Borislav Petkov
  0 siblings, 0 replies; 116+ messages in thread
From: Borislav Petkov @ 2018-10-13 21:30 UTC (permalink / raw)
  To: Alexander Monakov
  Cc: Segher Boessenkool, Ingo Molnar, Richard Biener, Michael Matz,
	gcc, Nadav Amit, Ingo Molnar, linux-kernel, x86, Masahiro Yamada,
	Sam Ravnborg, Alok Kataria, Christopher Li, Greg Kroah-Hartman,
	H. Peter Anvin, Jan Beulich, Josh Poimboeuf, Juergen Gross,
	Kate Stewart, Kees Cook, linux-sparse, Peter Zijlstra,
	Philippe Ombredanne, Thomas Gleixner, virtualization,
	Linus Torvalds, Chris Zankel, Max Filippov, linux-xtensa,
	Andrew Morton

On Sun, Oct 14, 2018 at 12:14:02AM +0300, Alexander Monakov wrote:
> I apologize for coming in late here with an alternative proposal, but would
> you be happy if GCC gave you a way to designate a portion of the asm template
> string that shouldn't be counted as its cost because it doesn't go into the
> .text section? This wouldn't interact with your redefinitions of the inline
> keyword, and you could do something like (assuming we go with %` ... %`
> delimiters)

I don't mind it but I see you guys are still discussing what would be
the better solution here, on the gcc ML. And we, as one user, are a
happy camper as long as it does what it is meant to do. But how the
feature looks like syntactically is something for gcc folk to decide as
they're going to support it for the foreseeable future and I'm very well
aware of how important it is for a supportable feature to be designed
properly.

Thx.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: PROPOSAL: Extend inline asm syntax with size spec
  2018-10-13 19:33                         ` Borislav Petkov
  2018-10-13 21:14                           ` Alexander Monakov
@ 2018-10-25 10:24                           ` Borislav Petkov
  2018-10-31 12:55                           ` Peter Zijlstra
  2 siblings, 0 replies; 116+ messages in thread
From: Borislav Petkov @ 2018-10-25 10:24 UTC (permalink / raw)
  To: Segher Boessenkool, Ingo Molnar, Richard Biener, Michael Matz,
	gcc, Nadav Amit, Ingo Molnar, linux-kernel, x86, Masahiro Yamada,
	Sam Ravnborg, Alok Kataria, Christopher Li, Greg Kroah-Hartman,
	H. Peter Anvin, Jan Beulich, Josh Poimboeuf, Juergen Gross,
	Kate Stewart, Kees Cook, linux-sparse, Peter Zijlstra,
	Philippe Ombredanne, Thomas Gleixner, virtualization,
	Linus Torvalds, Chris Zankel, Max Filippov, linux-xtensa,
	Andrew Morton

Ping.

This is a good point in time, methinks, where kernel folk on CC here
should have a look at this and speak up whether it is useful for us in
this form.

Frankly, I'm a bit unsure on the aspect of us using this and supporting
old compilers which don't have it and new compilers which do. Because
the old compilers should get to see the now upstreamed assembler macros
and the new compilers will simply inline the "asm volatile inline"
constructs. And how ugly that would become...

I guess we can try this with smaller macros first and keep them all
nicely hidden in header files.

On Sat, Oct 13, 2018 at 09:33:35PM +0200, Borislav Petkov wrote:
> Ok,
> 
> with Segher's help I've been playing with his patch ontop of bleeding
> edge gcc 9 and here are my observations. Please double-check me for
> booboos so that they can be addressed while there's time.
> 
> So here's what I see ontop of 4.19-rc7:
> 
> First marked the alternative asm() as inline and undeffed the "inline"
> keyword. I need to do that because of the funky games we do with
> "inline" redefinitions in include/linux/compiler_types.h.
> 
> And Segher hinted at either doing:
> 
> asm volatile inline(...
> 
> or
> 
> asm volatile __inline__(...
> 
> but both "inline" variants are defined as macros in that file.
> 
> Which means we either need to #undef inline before using it in asm() or
> come up with something cleverer.
> 
> Anyway, did this:
> 
> ---
> diff --git a/arch/x86/include/asm/alternative.h b/arch/x86/include/asm/alternative.h
> index 4cd6a3b71824..7c0639087da7 100644
> --- a/arch/x86/include/asm/alternative.h
> +++ b/arch/x86/include/asm/alternative.h
> @@ -165,11 +165,13 @@ static inline int alternatives_text_reserved(void *start, void *end)
>   * For non barrier like inlines please define new variants
>   * without volatile and memory clobber.
>   */
> +
> +#undef inline
>  #define alternative(oldinstr, newinstr, feature)                       \
> -       asm volatile (ALTERNATIVE(oldinstr, newinstr, feature) : : : "memory")
> +       asm volatile inline(ALTERNATIVE(oldinstr, newinstr, feature) : : : "memory")
>  
>  #define alternative_2(oldinstr, newinstr1, feature1, newinstr2, feature2) \
> -       asm volatile(ALTERNATIVE_2(oldinstr, newinstr1, feature1, newinstr2, feature2) ::: "memory")
> +       asm volatile inline(ALTERNATIVE_2(oldinstr, newinstr1, feature1, newinstr2, feature2) ::: "memory")
>  
>  /*
>   * Alternative inline assembly with input.
> ---
> 
> Build failed at link time with:
> 
> arch/x86/boot/compressed/cmdline.o: In function `native_save_fl':
> cmdline.c:(.text+0x0): multiple definition of `native_save_fl'
> arch/x86/boot/compressed/misc.o:misc.c:(.text+0x1e0): first defined here
> arch/x86/boot/compressed/cmdline.o: In function `native_restore_fl':
> cmdline.c:(.text+0x10): multiple definition of `native_restore_fl'
> arch/x86/boot/compressed/misc.o:misc.c:(.text+0x1f0): first defined here
> arch/x86/boot/compressed/error.o: In function `native_save_fl':
> error.c:(.text+0x0): multiple definition of `native_save_fl'
> 
> which I had to fix with this:
> 
> ---
> diff --git a/arch/x86/include/asm/irqflags.h b/arch/x86/include/asm/irqflags.h
> index 15450a675031..0d772598c37c 100644
> --- a/arch/x86/include/asm/irqflags.h
> +++ b/arch/x86/include/asm/irqflags.h
> @@ -14,8 +14,7 @@
>   */
>  
>  /* Declaration required for gcc < 4.9 to prevent -Werror=missing-prototypes */
> -extern inline unsigned long native_save_fl(void);
> -extern inline unsigned long native_save_fl(void)
> +static inline unsigned long native_save_fl(void)
>  {
>         unsigned long flags;
>  
> @@ -33,8 +32,7 @@ ex
> ---
> 
> That "extern inline" declaration looks fishy to me anyway, maybe not really
> needed ? Apparently, gcc < 4.9 complains with -Werror=missing-prototypes...
> 
> Then the build worked and the results look like this:
> 
>    text    data     bss     dec     hex filename
> 17287384        5040656 2019532 24347572        17383b4 vmlinux-before
> 17288020        5040656 2015436 24344112        1737630 vmlinux-2nd-version
> 
> so some inlining must be happening.
> 
> Then I did this:
> 
> ---
> diff --git a/arch/x86/lib/usercopy_64.c b/arch/x86/lib/usercopy_64.c
> index 9c5606d88f61..a0170344cf08 100644
> --- a/arch/x86/lib/usercopy_64.c
> +++ b/arch/x86/lib/usercopy_64.c
> @@ -20,7 +20,7 @@ unsigned long __clear_user(void __user *addr, unsigned long size)
>         /* no memory constraint because it doesn't change any memory gcc knows
>            about */
>         stac();
> -       asm volatile(
> +       asm volatile inline(
>                 "       testq  %[size8],%[size8]\n"
>                 "       jz     4f\n"
>                 "0:     movq $0,(%[dst])\n"
> ---
> 
> to force inlining of a somewhat bigger asm() statement. And yap, more
> got inlined:
> 
>    text    data     bss     dec     hex filename
> 17287384        5040656 2019532 24347572        17383b4 vmlinux-before
> 17288020        5040656 2015436 24344112        1737630 vmlinux-2nd-version
> 17288076        5040656 2015436 24344168        1737668 vmlinux-2nd-version__clear_user
> 
> so more stuff gets inlined.
> 
> Looking at the asm output, it had before:
> 
> ---
> clear_user:
> # ./arch/x86/include/asm/current.h:15:  return this_cpu_read_stable(current_task);
> #APP
> # 15 "./arch/x86/include/asm/current.h" 1
>         movq %gs:current_task,%rax      #, pfo_ret__
> # 0 "" 2
> # arch/x86/lib/usercopy_64.c:51:        if (access_ok(VERIFY_WRITE, to, n))
> #NO_APP
>         movq    2520(%rax), %rdx        # pfo_ret___12->thread.addr_limit.seg, _1
>         movq    %rdi, %rax      # to, tmp93
>         addq    %rsi, %rax      # n, tmp93
>         jc      .L3     #,
> # arch/x86/lib/usercopy_64.c:51:        if (access_ok(VERIFY_WRITE, to, n))
>         cmpq    %rax, %rdx      # tmp93, _1
>         jb      .L3     #,
> # arch/x86/lib/usercopy_64.c:52:                return __clear_user(to, n);
>         jmp     __clear_user    #
> ---
> 
> note the JMP to __clear_user. After marking the asm() in __clear_user() as
> inline, clear_user() inlines __clear_user() directly:
> 
> ---
> clear_user:
> # ./arch/x86/include/asm/current.h:15:  return this_cpu_read_stable(current_task);
> #APP
> # 15 "./arch/x86/include/asm/current.h" 1
>         movq %gs:current_task,%rax      #, pfo_ret__
> # 0 "" 2
> # arch/x86/lib/usercopy_64.c:51:        if (access_ok(VERIFY_WRITE, to, n))
> #NO_APP
>         movq    2520(%rax), %rdx        # pfo_ret___12->thread.addr_limit.seg, _1
>         movq    %rdi, %rax      # to, tmp95
>         addq    %rsi, %rax      # n, tmp95
>         jc      .L8     #,
> # arch/x86/lib/usercopy_64.c:51:        if (access_ok(VERIFY_WRITE, to, n))
>         cmpq    %rax, %rdx      # tmp95, _1
>         jb      .L8     #,
> # ./arch/x86/include/asm/smap.h:58:     alternative("", __stringify(__ASM_STAC), X86_FEATURE_SMAP);
> ...
> 
> this last line is the stac() macro which gets inlined due to the
> alternative() inlined macro above.
> 
> So I guess this all looks like what we wanted.
> 
> And if this lands in gcc9, we would need to do a asm_volatile() macro
> which is defined differently based on the compiler used.
> 
> Thoughts, suggestions, etc are most welcome.
> 
> Thx.
> 
> -- 
> Regards/Gruss,
>     Boris.
> 
> Good mailing practices for 400: avoid top-posting and trim the reply.

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: PROPOSAL: Extend inline asm syntax with size spec
  2018-10-13 19:33                         ` Borislav Petkov
  2018-10-13 21:14                           ` Alexander Monakov
  2018-10-25 10:24                           ` Borislav Petkov
@ 2018-10-31 12:55                           ` Peter Zijlstra
  2018-10-31 13:11                             ` Peter Zijlstra
                                               ` (3 more replies)
  2 siblings, 4 replies; 116+ messages in thread
From: Peter Zijlstra @ 2018-10-31 12:55 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Segher Boessenkool, Ingo Molnar, Richard Biener, Michael Matz,
	gcc, Nadav Amit, Ingo Molnar, linux-kernel, x86, Masahiro Yamada,
	Sam Ravnborg, Alok Kataria, Christopher Li, Greg Kroah-Hartman,
	H. Peter Anvin, Jan Beulich, Josh Poimboeuf, Juergen Gross,
	Kate Stewart, Kees Cook, linux-sparse, Philippe Ombredanne,
	Thomas Gleixner, virtualization, Linus Torvalds, Chris Zankel,
	Max Filippov, linux-xtensa, Andrew Morton

On Sat, Oct 13, 2018 at 09:33:35PM +0200, Borislav Petkov wrote:
> Ok,
> 
> with Segher's help I've been playing with his patch ontop of bleeding
> edge gcc 9 and here are my observations. Please double-check me for
> booboos so that they can be addressed while there's time.
> 
> So here's what I see ontop of 4.19-rc7:
> 
> First marked the alternative asm() as inline and undeffed the "inline"
> keyword. I need to do that because of the funky games we do with
> "inline" redefinitions in include/linux/compiler_types.h.
> 
> And Segher hinted at either doing:
> 
> asm volatile inline(...
> 
> or
> 
> asm volatile __inline__(...
> 
> but both "inline" variants are defined as macros in that file.
> 
> Which means we either need to #undef inline before using it in asm() or
> come up with something cleverer.

# git grep -e "\<__inline__\>" | wc -l
488
# git grep -e "\<__inline\>" | wc -l
56
# git grep -e "\<inline\>" | wc -l
69598

And we already have scripts/checkpatch.pl:

  # Check for __inline__ and __inline, prefer inline

Which suggests we do:

git grep -l "\<__inline\(\|__\)\>" | while read file
do
	sed -i -e 's/\<__inline\(\|__\)\>/inline/g' $file
done

and get it over with.


Anyway, with the below patch, I get:

   text    data     bss     dec     hex filename
17385183        5064780 1953892 24403855        1745f8f defconfig-build/vmlinux
17385678        5064780 1953892 24404350        174617e defconfig-build/vmlinux

Which shows we inline more (look for asm_volatile for the actual
changes).


So yes, this seems like something we could work with.

---
 Documentation/trace/tracepoint-analysis.rst        |  2 +-
 Documentation/translations/ja_JP/SubmittingPatches |  4 +--
 Documentation/translations/zh_CN/SubmittingPatches |  4 +--
 arch/alpha/include/asm/atomic.h                    | 12 +++----
 arch/alpha/include/asm/bitops.h                    |  6 ++--
 arch/alpha/include/asm/compiler.h                  |  4 +--
 arch/alpha/include/asm/dma.h                       | 22 ++++++------
 arch/alpha/include/asm/floppy.h                    |  4 +--
 arch/alpha/include/asm/irq.h                       |  2 +-
 arch/alpha/include/asm/local.h                     |  4 +--
 arch/alpha/include/asm/smp.h                       |  2 +-
 arch/arm/mach-iop32x/include/mach/uncompress.h     |  2 +-
 arch/arm/mach-iop33x/include/mach/uncompress.h     |  2 +-
 arch/arm/mach-ixp4xx/include/mach/uncompress.h     |  2 +-
 arch/ia64/hp/common/sba_iommu.c                    |  2 +-
 arch/ia64/hp/sim/simeth.c                          |  2 +-
 arch/ia64/include/asm/atomic.h                     |  8 ++---
 arch/ia64/include/asm/bitops.h                     | 34 +++++++++---------
 arch/ia64/include/asm/delay.h                      | 14 ++++----
 arch/ia64/include/asm/irq.h                        |  2 +-
 arch/ia64/include/asm/page.h                       |  2 +-
 arch/ia64/include/asm/sn/leds.h                    |  2 +-
 arch/ia64/include/asm/uaccess.h                    |  4 +--
 arch/ia64/include/uapi/asm/rse.h                   | 12 +++----
 arch/ia64/include/uapi/asm/swab.h                  |  6 ++--
 arch/ia64/oprofile/backtrace.c                     |  4 +--
 arch/m68k/include/asm/blinken.h                    |  2 +-
 arch/m68k/include/asm/checksum.h                   |  2 +-
 arch/m68k/include/asm/dma.h                        | 32 ++++++++---------
 arch/m68k/include/asm/floppy.h                     |  8 ++---
 arch/m68k/include/asm/nettel.h                     |  8 ++---
 arch/m68k/mac/iop.c                                | 14 ++++----
 arch/mips/include/asm/atomic.h                     | 16 ++++-----
 arch/mips/include/asm/checksum.h                   |  2 +-
 arch/mips/include/asm/dma.h                        | 20 +++++------
 arch/mips/include/asm/jazz.h                       |  2 +-
 arch/mips/include/asm/local.h                      |  4 +--
 arch/mips/include/asm/string.h                     |  8 ++---
 arch/mips/kernel/binfmt_elfn32.c                   |  2 +-
 arch/nds32/include/asm/swab.h                      |  4 +--
 arch/parisc/include/asm/atomic.h                   | 20 +++++------
 arch/parisc/include/asm/bitops.h                   | 18 +++++-----
 arch/parisc/include/asm/checksum.h                 |  4 +--
 arch/parisc/include/asm/compat.h                   |  2 +-
 arch/parisc/include/asm/delay.h                    |  2 +-
 arch/parisc/include/asm/dma.h                      | 20 +++++------
 arch/parisc/include/asm/ide.h                      |  8 ++---
 arch/parisc/include/asm/irq.h                      |  2 +-
 arch/parisc/include/asm/spinlock.h                 | 12 +++----
 arch/powerpc/include/asm/atomic.h                  | 40 +++++++++++-----------
 arch/powerpc/include/asm/bitops.h                  | 28 +++++++--------
 arch/powerpc/include/asm/dma.h                     | 20 +++++------
 arch/powerpc/include/asm/edac.h                    |  2 +-
 arch/powerpc/include/asm/irq.h                     |  2 +-
 arch/powerpc/include/asm/local.h                   | 14 ++++----
 arch/sh/include/asm/pgtable_64.h                   |  2 +-
 arch/sh/include/asm/processor_32.h                 |  4 +--
 arch/sh/include/cpu-sh3/cpu/dac.h                  |  6 ++--
 arch/x86/include/asm/alternative.h                 | 14 ++++----
 arch/x86/um/asm/checksum.h                         |  4 +--
 arch/x86/um/asm/checksum_32.h                      |  4 +--
 arch/xtensa/include/asm/checksum.h                 | 14 ++++----
 arch/xtensa/include/asm/cmpxchg.h                  |  4 +--
 arch/xtensa/include/asm/irq.h                      |  2 +-
 block/partitions/amiga.c                           |  2 +-
 drivers/atm/he.c                                   |  6 ++--
 drivers/atm/idt77252.c                             |  6 ++--
 drivers/gpu/drm/mga/mga_drv.h                      |  2 +-
 drivers/gpu/drm/mga/mga_state.c                    | 14 ++++----
 drivers/gpu/drm/r128/r128_drv.h                    |  2 +-
 drivers/gpu/drm/r128/r128_state.c                  | 14 ++++----
 drivers/gpu/drm/via/via_irq.c                      |  2 +-
 drivers/gpu/drm/via/via_verifier.c                 | 30 ++++++++--------
 drivers/isdn/hardware/eicon/platform.h             | 14 ++++----
 drivers/isdn/i4l/isdn_net.c                        | 14 ++++----
 drivers/isdn/i4l/isdn_net.h                        |  8 ++---
 drivers/media/pci/ivtv/ivtv-ioctl.c                |  2 +-
 drivers/net/ethernet/sun/sungem.c                  |  8 ++---
 drivers/net/ethernet/sun/sunhme.c                  |  6 ++--
 drivers/net/hamradio/baycom_ser_fdx.c              |  2 +-
 drivers/net/wan/lapbether.c                        |  2 +-
 drivers/net/wan/n2.c                               |  4 +--
 drivers/parisc/led.c                               |  4 +--
 drivers/parisc/sba_iommu.c                         |  2 +-
 drivers/parport/parport_gsc.c                      |  2 +-
 drivers/parport/parport_gsc.h                      |  4 +--
 drivers/parport/parport_pc.c                       |  2 +-
 drivers/scsi/lpfc/lpfc_scsi.c                      |  2 +-
 drivers/scsi/pcmcia/sym53c500_cs.c                 |  4 +--
 drivers/scsi/qla2xxx/qla_inline.h                  |  2 +-
 drivers/scsi/qla2xxx/qla_os.c                      |  4 +--
 drivers/staging/rtl8723bs/core/rtw_pwrctrl.c       |  4 +--
 drivers/staging/rtl8723bs/core/rtw_wlan_util.c     |  2 +-
 drivers/staging/rtl8723bs/include/drv_types.h      |  6 ++--
 drivers/staging/rtl8723bs/include/ieee80211.h      |  6 ++--
 drivers/staging/rtl8723bs/include/osdep_service.h  | 10 +++---
 .../rtl8723bs/include/osdep_service_linux.h        | 14 ++++----
 drivers/staging/rtl8723bs/include/rtw_mlme.h       | 14 ++++----
 drivers/staging/rtl8723bs/include/rtw_recv.h       | 16 ++++-----
 drivers/staging/rtl8723bs/include/sta_info.h       |  2 +-
 drivers/staging/rtl8723bs/include/wifi.h           | 14 ++++----
 drivers/staging/rtl8723bs/include/wlan_bssdef.h    |  2 +-
 drivers/tty/amiserial.c                            |  2 +-
 drivers/tty/serial/ip22zilog.c                     |  2 +-
 drivers/tty/serial/sunsab.c                        |  4 +--
 drivers/tty/serial/sunzilog.c                      |  2 +-
 drivers/video/fbdev/core/fbcon.c                   | 20 +++++------
 drivers/video/fbdev/ffb.c                          |  2 +-
 drivers/video/fbdev/intelfb/intelfbdrv.c           | 10 +++---
 drivers/video/fbdev/intelfb/intelfbhw.c            |  2 +-
 drivers/w1/masters/matrox_w1.c                     |  4 +--
 fs/coda/coda_linux.h                               |  6 ++--
 fs/freevxfs/vxfs_inode.c                           |  2 +-
 fs/nfsd/nfsfh.h                                    |  4 +--
 include/acpi/platform/acgcc.h                      |  2 +-
 include/acpi/platform/acintel.h                    |  2 +-
 include/asm-generic/ide_iops.h                     |  8 ++---
 include/linux/atalk.h                              |  4 +--
 include/linux/ceph/messenger.h                     |  2 +-
 include/linux/compiler_types.h                     |  4 +--
 include/linux/hdlc.h                               |  4 +--
 include/linux/inetdevice.h                         |  8 ++---
 include/linux/parport.h                            |  4 +--
 include/linux/parport_pc.h                         | 22 ++++++------
 include/net/ax25.h                                 |  2 +-
 include/net/checksum.h                             |  2 +-
 include/net/dn_nsp.h                               | 16 ++++-----
 include/net/ip.h                                   |  2 +-
 include/net/ip6_checksum.h                         |  2 +-
 include/net/ipx.h                                  | 10 +++---
 include/net/llc_c_ev.h                             |  4 +--
 include/net/llc_conn.h                             |  4 +--
 include/net/llc_s_ev.h                             |  2 +-
 include/net/netrom.h                               |  8 ++---
 include/net/scm.h                                  | 14 ++++----
 include/net/udplite.h                              |  2 +-
 include/net/x25.h                                  |  8 ++---
 include/net/xfrm.h                                 | 18 +++++-----
 include/uapi/linux/atm.h                           |  4 +--
 include/uapi/linux/atmsap.h                        |  2 +-
 include/uapi/linux/map_to_7segment.h               |  2 +-
 include/uapi/linux/netfilter_arp/arp_tables.h      |  2 +-
 include/uapi/linux/netfilter_bridge/ebtables.h     |  2 +-
 include/uapi/linux/netfilter_ipv4/ip_tables.h      |  2 +-
 include/uapi/linux/netfilter_ipv6/ip6_tables.h     |  2 +-
 include/video/newport.h                            | 12 +++----
 lib/zstd/mem.h                                     |  2 +-
 net/appletalk/atalk_proc.c                         |  4 +--
 net/appletalk/ddp.c                                |  2 +-
 net/core/neighbour.c                               |  2 +-
 net/core/scm.c                                     |  2 +-
 net/decnet/dn_nsp_in.c                             |  2 +-
 net/decnet/dn_nsp_out.c                            |  2 +-
 net/decnet/dn_route.c                              |  2 +-
 net/decnet/dn_table.c                              |  4 +--
 net/ipv4/igmp.c                                    |  2 +-
 net/ipv6/af_inet6.c                                |  2 +-
 net/ipv6/icmp.c                                    |  4 +--
 net/ipv6/udp.c                                     |  4 +--
 net/lapb/lapb_iface.c                              |  4 +--
 net/llc/llc_input.c                                |  2 +-
 scripts/checkpatch.pl                              |  8 ++---
 scripts/genksyms/keywords.c                        |  4 +--
 scripts/kernel-doc                                 |  4 +--
 sound/sparc/amd7930.c                              |  6 ++--
 165 files changed, 547 insertions(+), 547 deletions(-)

diff --git a/Documentation/trace/tracepoint-analysis.rst b/Documentation/trace/tracepoint-analysis.rst
index 716326b9f152..edd735a9ba2d 100644
--- a/Documentation/trace/tracepoint-analysis.rst
+++ b/Documentation/trace/tracepoint-analysis.rst
@@ -315,7 +315,7 @@ To see where within the function pixmanFillsse2 things are going wrong:
     0.00 :         34eeb:       0f 18 08                prefetcht0 (%eax)
          :      }
          :
-         :      extern __inline void __attribute__((__gnu_inline__, __always_inline__, _
+         :      extern inline void __attribute__((__gnu_inline__, __always_inline__, _
          :      _mm_store_si128 (__m128i *__P, __m128i __B) :      {
          :        *__P = __B;
    12.40 :         34eee:       66 0f 7f 80 40 ff ff    movdqa %xmm0,-0xc0(%eax)
diff --git a/Documentation/translations/ja_JP/SubmittingPatches b/Documentation/translations/ja_JP/SubmittingPatches
index 02139656463e..188846f1184b 100644
--- a/Documentation/translations/ja_JP/SubmittingPatches
+++ b/Documentation/translations/ja_JP/SubmittingPatches
@@ -676,8 +676,8 @@ gcc においては、マクロと同じくらい軽いです。
 いくつかの特定のケース)や「 static inline 」関数を使うことができないような
 場所(マクロの引数の文字列連結のような)にだけ使われるべきです。
 
-「 static inline 」は「 static __inline__ 」や「 extern inline 」や
-「 extern __inline__ 」よりも適切です。
+「 static inline 」は「 static inline 」や「 extern inline 」や
+「 extern inline 」よりも適切です。
 
 4) 設計に凝りすぎるな
 
diff --git a/Documentation/translations/zh_CN/SubmittingPatches b/Documentation/translations/zh_CN/SubmittingPatches
index e9098da8f1a4..5c39b6c487b5 100644
--- a/Documentation/translations/zh_CN/SubmittingPatches
+++ b/Documentation/translations/zh_CN/SubmittingPatches
@@ -377,8 +377,8 @@ Static inline 函数相比宏来说,是好得多的选择。Static inline 函
 
 宏只在 static inline 函数不是最优的时候[在 fast paths 里有很少的独立的
 案例],或者不可能用 static inline 函数的时候[例如字符串分配]。
-应该用 'static inline' 而不是 'static __inline__', 'extern inline' 和
-'extern __inline__' 。
+应该用 'static inline' 而不是 'static inline', 'extern inline' 和
+'extern inline' 。
 
 4) 不要过度设计
 
diff --git a/arch/alpha/include/asm/atomic.h b/arch/alpha/include/asm/atomic.h
index 150a1c5d6a2c..f8d494bcd72c 100644
--- a/arch/alpha/include/asm/atomic.h
+++ b/arch/alpha/include/asm/atomic.h
@@ -40,7 +40,7 @@
  */
 
 #define ATOMIC_OP(op, asm_op)						\
-static __inline__ void atomic_##op(int i, atomic_t * v)			\
+static inline void atomic_##op(int i, atomic_t * v)			\
 {									\
 	unsigned long temp;						\
 	__asm__ __volatile__(						\
@@ -93,7 +93,7 @@ static inline int atomic_fetch_##op##_relaxed(int i, atomic_t *v)	\
 }
 
 #define ATOMIC64_OP(op, asm_op)						\
-static __inline__ void atomic64_##op(long i, atomic64_t * v)		\
+static inline void atomic64_##op(long i, atomic64_t * v)		\
 {									\
 	unsigned long temp;						\
 	__asm__ __volatile__(						\
@@ -109,7 +109,7 @@ static __inline__ void atomic64_##op(long i, atomic64_t * v)		\
 }									\
 
 #define ATOMIC64_OP_RETURN(op, asm_op)					\
-static __inline__ long atomic64_##op##_return_relaxed(long i, atomic64_t * v)	\
+static inline long atomic64_##op##_return_relaxed(long i, atomic64_t * v)	\
 {									\
 	long temp, result;						\
 	__asm__ __volatile__(						\
@@ -128,7 +128,7 @@ static __inline__ long atomic64_##op##_return_relaxed(long i, atomic64_t * v)	\
 }
 
 #define ATOMIC64_FETCH_OP(op, asm_op)					\
-static __inline__ long atomic64_fetch_##op##_relaxed(long i, atomic64_t * v)	\
+static inline long atomic64_fetch_##op##_relaxed(long i, atomic64_t * v)	\
 {									\
 	long temp, result;						\
 	__asm__ __volatile__(						\
@@ -214,7 +214,7 @@ ATOMIC_OPS(xor, xor)
  * Atomically adds @a to @v, so long as it was not @u.
  * Returns the old value of @v.
  */
-static __inline__ int atomic_fetch_add_unless(atomic_t *v, int a, int u)
+static inline int atomic_fetch_add_unless(atomic_t *v, int a, int u)
 {
 	int c, new, old;
 	smp_mb();
@@ -246,7 +246,7 @@ static __inline__ int atomic_fetch_add_unless(atomic_t *v, int a, int u)
  * Atomically adds @a to @v, so long as it was not @u.
  * Returns the old value of @v.
  */
-static __inline__ long atomic64_fetch_add_unless(atomic64_t *v, long a, long u)
+static inline long atomic64_fetch_add_unless(atomic64_t *v, long a, long u)
 {
 	long c, new, old;
 	smp_mb();
diff --git a/arch/alpha/include/asm/bitops.h b/arch/alpha/include/asm/bitops.h
index ca43f4d0b937..4d08bd8243c0 100644
--- a/arch/alpha/include/asm/bitops.h
+++ b/arch/alpha/include/asm/bitops.h
@@ -82,7 +82,7 @@ clear_bit_unlock(unsigned long nr, volatile void * addr)
 /*
  * WARNING: non atomic version.
  */
-static __inline__ void
+static inline void
 __clear_bit(unsigned long nr, volatile void * addr)
 {
 	int *m = ((int *) addr) + (nr >> 5);
@@ -118,7 +118,7 @@ change_bit(unsigned long nr, volatile void * addr)
 /*
  * WARNING: non atomic version.
  */
-static __inline__ void
+static inline void
 __change_bit(unsigned long nr, volatile void * addr)
 {
 	int *m = ((int *) addr) + (nr >> 5);
@@ -272,7 +272,7 @@ test_and_change_bit(unsigned long nr, volatile void * addr)
 /*
  * WARNING: non atomic version.
  */
-static __inline__ int
+static inline int
 __test_and_change_bit(unsigned long nr, volatile void * addr)
 {
 	unsigned long mask = 1 << (nr & 0x1f);
diff --git a/arch/alpha/include/asm/compiler.h b/arch/alpha/include/asm/compiler.h
index 5159ba259d65..11176dc182f8 100644
--- a/arch/alpha/include/asm/compiler.h
+++ b/arch/alpha/include/asm/compiler.h
@@ -10,8 +10,8 @@
 
 #include <linux/compiler.h>
 #undef inline
-#undef __inline__
-#undef __inline
+#undef inline
+#undef inline
 #undef __always_inline
 #define __always_inline		inline __attribute__((always_inline))
 
diff --git a/arch/alpha/include/asm/dma.h b/arch/alpha/include/asm/dma.h
index 28610ea7786d..73bf4154b6a5 100644
--- a/arch/alpha/include/asm/dma.h
+++ b/arch/alpha/include/asm/dma.h
@@ -198,20 +198,20 @@
 
 extern spinlock_t  dma_spin_lock;
 
-static __inline__ unsigned long claim_dma_lock(void)
+static inline unsigned long claim_dma_lock(void)
 {
 	unsigned long flags;
 	spin_lock_irqsave(&dma_spin_lock, flags);
 	return flags;
 }
 
-static __inline__ void release_dma_lock(unsigned long flags)
+static inline void release_dma_lock(unsigned long flags)
 {
 	spin_unlock_irqrestore(&dma_spin_lock, flags);
 }
 
 /* enable/disable a specific DMA channel */
-static __inline__ void enable_dma(unsigned int dmanr)
+static inline void enable_dma(unsigned int dmanr)
 {
 	if (dmanr<=3)
 		dma_outb(dmanr,  DMA1_MASK_REG);
@@ -219,7 +219,7 @@ static __inline__ void enable_dma(unsigned int dmanr)
 		dma_outb(dmanr & 3,  DMA2_MASK_REG);
 }
 
-static __inline__ void disable_dma(unsigned int dmanr)
+static inline void disable_dma(unsigned int dmanr)
 {
 	if (dmanr<=3)
 		dma_outb(dmanr | 4,  DMA1_MASK_REG);
@@ -234,7 +234,7 @@ static __inline__ void disable_dma(unsigned int dmanr)
  * --- In order to do that, the DMA routines below should ---
  * --- only be used while interrupts are disabled! ---
  */
-static __inline__ void clear_dma_ff(unsigned int dmanr)
+static inline void clear_dma_ff(unsigned int dmanr)
 {
 	if (dmanr<=3)
 		dma_outb(0,  DMA1_CLEAR_FF_REG);
@@ -243,7 +243,7 @@ static __inline__ void clear_dma_ff(unsigned int dmanr)
 }
 
 /* set mode (above) for a specific DMA channel */
-static __inline__ void set_dma_mode(unsigned int dmanr, char mode)
+static inline void set_dma_mode(unsigned int dmanr, char mode)
 {
 	if (dmanr<=3)
 		dma_outb(mode | dmanr,  DMA1_MODE_REG);
@@ -252,7 +252,7 @@ static __inline__ void set_dma_mode(unsigned int dmanr, char mode)
 }
 
 /* set extended mode for a specific DMA channel */
-static __inline__ void set_dma_ext_mode(unsigned int dmanr, char ext_mode)
+static inline void set_dma_ext_mode(unsigned int dmanr, char ext_mode)
 {
 	if (dmanr<=3)
 		dma_outb(ext_mode | dmanr,  DMA1_EXT_MODE_REG);
@@ -264,7 +264,7 @@ static __inline__ void set_dma_ext_mode(unsigned int dmanr, char ext_mode)
  * This is used for successive transfers when we know the contents of
  * the lower 16 bits of the DMA current address register.
  */
-static __inline__ void set_dma_page(unsigned int dmanr, unsigned int pagenr)
+static inline void set_dma_page(unsigned int dmanr, unsigned int pagenr)
 {
 	switch(dmanr) {
 		case 0:
@@ -302,7 +302,7 @@ static __inline__ void set_dma_page(unsigned int dmanr, unsigned int pagenr)
 /* Set transfer address & page bits for specific DMA channel.
  * Assumes dma flipflop is clear.
  */
-static __inline__ void set_dma_addr(unsigned int dmanr, unsigned int a)
+static inline void set_dma_addr(unsigned int dmanr, unsigned int a)
 {
 	if (dmanr <= 3)  {
 	    dma_outb( a & 0xff, ((dmanr&3)<<1) + IO_DMA1_BASE );
@@ -323,7 +323,7 @@ static __inline__ void set_dma_addr(unsigned int dmanr, unsigned int a)
  * Assumes dma flip-flop is clear.
  * NOTE 2: "count" represents _bytes_ and must be even for channels 5-7.
  */
-static __inline__ void set_dma_count(unsigned int dmanr, unsigned int count)
+static inline void set_dma_count(unsigned int dmanr, unsigned int count)
 {
         count--;
 	if (dmanr <= 3)  {
@@ -344,7 +344,7 @@ static __inline__ void set_dma_count(unsigned int dmanr, unsigned int count)
  *
  * Assumes DMA flip-flop is clear.
  */
-static __inline__ int get_dma_residue(unsigned int dmanr)
+static inline int get_dma_residue(unsigned int dmanr)
 {
 	unsigned int io_port = (dmanr<=3)? ((dmanr&3)<<1) + 1 + IO_DMA1_BASE
 					 : ((dmanr&3)<<2) + 2 + IO_DMA2_BASE;
diff --git a/arch/alpha/include/asm/floppy.h b/arch/alpha/include/asm/floppy.h
index 942924756cf2..37947c2a8336 100644
--- a/arch/alpha/include/asm/floppy.h
+++ b/arch/alpha/include/asm/floppy.h
@@ -34,7 +34,7 @@
 
 #define fd_dma_setup(addr,size,mode,io) alpha_fd_dma_setup(addr,size,mode,io)
 
-static __inline__ int 
+static inline int 
 alpha_fd_dma_setup(char *addr, unsigned long size, int mode, int io)
 {
 	static unsigned long prev_size;
@@ -72,7 +72,7 @@ alpha_fd_dma_setup(char *addr, unsigned long size, int mode, int io)
 
 #endif /* CONFIG_PCI */
 
-__inline__ void virtual_dma_init(void)
+inline void virtual_dma_init(void)
 {
 	/* Nothing to do on an Alpha */
 }
diff --git a/arch/alpha/include/asm/irq.h b/arch/alpha/include/asm/irq.h
index 4d17cacd1462..bfad8598e972 100644
--- a/arch/alpha/include/asm/irq.h
+++ b/arch/alpha/include/asm/irq.h
@@ -77,7 +77,7 @@
 # define NR_IRQS	16
 #endif
 
-static __inline__ int irq_canonicalize(int irq)
+static inline int irq_canonicalize(int irq)
 {
 	/*
 	 * XXX is this true for all Alpha's?  The old serial driver
diff --git a/arch/alpha/include/asm/local.h b/arch/alpha/include/asm/local.h
index fab26a1c93d5..b13615ee5431 100644
--- a/arch/alpha/include/asm/local.h
+++ b/arch/alpha/include/asm/local.h
@@ -18,7 +18,7 @@ typedef struct
 #define local_add(i,l)	atomic_long_add((i),(&(l)->a))
 #define local_sub(i,l)	atomic_long_sub((i),(&(l)->a))
 
-static __inline__ long local_add_return(long i, local_t * l)
+static inline long local_add_return(long i, local_t * l)
 {
 	long temp, result;
 	__asm__ __volatile__(
@@ -35,7 +35,7 @@ static __inline__ long local_add_return(long i, local_t * l)
 	return result;
 }
 
-static __inline__ long local_sub_return(long i, local_t * l)
+static inline long local_sub_return(long i, local_t * l)
 {
 	long temp, result;
 	__asm__ __volatile__(
diff --git a/arch/alpha/include/asm/smp.h b/arch/alpha/include/asm/smp.h
index 2264ae72673b..f5e49acf60a4 100644
--- a/arch/alpha/include/asm/smp.h
+++ b/arch/alpha/include/asm/smp.h
@@ -9,7 +9,7 @@
 
 /* HACK: Cabrio WHAMI return value is bogus if more than 8 bits used.. :-( */
 
-static __inline__ unsigned char
+static inline unsigned char
 __hard_smp_processor_id(void)
 {
 	register unsigned char __r0 __asm__("$0");
diff --git a/arch/arm/mach-iop32x/include/mach/uncompress.h b/arch/arm/mach-iop32x/include/mach/uncompress.h
index ed4ac3e28fa1..24d1dd4ea27f 100644
--- a/arch/arm/mach-iop32x/include/mach/uncompress.h
+++ b/arch/arm/mach-iop32x/include/mach/uncompress.h
@@ -23,7 +23,7 @@ static inline void flush(void)
 {
 }
 
-static __inline__ void __arch_decomp_setup(unsigned long arch_id)
+static inline void __arch_decomp_setup(unsigned long arch_id)
 {
 	if (machine_is_iq80321())
 		uart_base = (volatile u8 *)IQ80321_UART;
diff --git a/arch/arm/mach-iop33x/include/mach/uncompress.h b/arch/arm/mach-iop33x/include/mach/uncompress.h
index 62b71cde1f79..9fc5a2aae8de 100644
--- a/arch/arm/mach-iop33x/include/mach/uncompress.h
+++ b/arch/arm/mach-iop33x/include/mach/uncompress.h
@@ -23,7 +23,7 @@ static inline void flush(void)
 {
 }
 
-static __inline__ void __arch_decomp_setup(unsigned long arch_id)
+static inline void __arch_decomp_setup(unsigned long arch_id)
 {
 	if (machine_is_iq80331() || machine_is_iq80332())
 		uart_base = (volatile u32 *)IOP33X_UART0_PHYS;
diff --git a/arch/arm/mach-ixp4xx/include/mach/uncompress.h b/arch/arm/mach-ixp4xx/include/mach/uncompress.h
index 7b25c0225e46..d9e698f7d7e1 100644
--- a/arch/arm/mach-ixp4xx/include/mach/uncompress.h
+++ b/arch/arm/mach-ixp4xx/include/mach/uncompress.h
@@ -35,7 +35,7 @@ static void flush(void)
 {
 }
 
-static __inline__ void __arch_decomp_setup(unsigned long arch_id)
+static inline void __arch_decomp_setup(unsigned long arch_id)
 {
 	/*
 	 * Some boards are using UART2 as console
diff --git a/arch/ia64/hp/common/sba_iommu.c b/arch/ia64/hp/common/sba_iommu.c
index 671ce1e3f6f2..d597f810bb84 100644
--- a/arch/ia64/hp/common/sba_iommu.c
+++ b/arch/ia64/hp/common/sba_iommu.c
@@ -107,7 +107,7 @@ extern int swiotlb_late_init_with_default_size (size_t size);
 #error FULL_VALID_PDIR and ASSERT_PDIR_SANITY are mutually exclusive
 #endif
 
-#define SBA_INLINE	__inline__
+#define SBA_INLINE	inline
 /* #define SBA_INLINE */
 
 #ifdef DEBUG_SBA_INIT
diff --git a/arch/ia64/hp/sim/simeth.c b/arch/ia64/hp/sim/simeth.c
index f39ef2b4ed72..e0620283dfc6 100644
--- a/arch/ia64/hp/sim/simeth.c
+++ b/arch/ia64/hp/sim/simeth.c
@@ -246,7 +246,7 @@ simeth_open(struct net_device *dev)
 }
 
 /* copied from lapbether.c */
-static __inline__ int dev_is_ethdev(struct net_device *dev)
+static inline int dev_is_ethdev(struct net_device *dev)
 {
        return ( dev->type == ARPHRD_ETHER && strncmp(dev->name, "dummy", 5));
 }
diff --git a/arch/ia64/include/asm/atomic.h b/arch/ia64/include/asm/atomic.h
index 206530d0751b..9408fa96e2dd 100644
--- a/arch/ia64/include/asm/atomic.h
+++ b/arch/ia64/include/asm/atomic.h
@@ -29,7 +29,7 @@
 #define atomic64_set(v,i)	WRITE_ONCE(((v)->counter), (i))
 
 #define ATOMIC_OP(op, c_op)						\
-static __inline__ int							\
+static inline int							\
 ia64_atomic_##op (int i, atomic_t *v)					\
 {									\
 	__s32 old, new;							\
@@ -44,7 +44,7 @@ ia64_atomic_##op (int i, atomic_t *v)					\
 }
 
 #define ATOMIC_FETCH_OP(op, c_op)					\
-static __inline__ int							\
+static inline int							\
 ia64_atomic_fetch_##op (int i, atomic_t *v)				\
 {									\
 	__s32 old, new;							\
@@ -124,7 +124,7 @@ ATOMIC_FETCH_OP(xor, ^)
 #undef ATOMIC_OP
 
 #define ATOMIC64_OP(op, c_op)						\
-static __inline__ long							\
+static inline long							\
 ia64_atomic64_##op (__s64 i, atomic64_t *v)				\
 {									\
 	__s64 old, new;							\
@@ -139,7 +139,7 @@ ia64_atomic64_##op (__s64 i, atomic64_t *v)				\
 }
 
 #define ATOMIC64_FETCH_OP(op, c_op)					\
-static __inline__ long							\
+static inline long							\
 ia64_atomic64_fetch_##op (__s64 i, atomic64_t *v)			\
 {									\
 	__s64 old, new;							\
diff --git a/arch/ia64/include/asm/bitops.h b/arch/ia64/include/asm/bitops.h
index 56a774bf13fa..d3249d081ab6 100644
--- a/arch/ia64/include/asm/bitops.h
+++ b/arch/ia64/include/asm/bitops.h
@@ -36,7 +36,7 @@
  *
  * bit 0 is the LSB of addr; bit 32 is the LSB of (addr+1).
  */
-static __inline__ void
+static inline void
 set_bit (int nr, volatile void *addr)
 {
 	__u32 bit, old, new;
@@ -61,7 +61,7 @@ set_bit (int nr, volatile void *addr)
  * If it's called on the same region of memory simultaneously, the effect
  * may be that only one operation succeeds.
  */
-static __inline__ void
+static inline void
 __set_bit (int nr, volatile void *addr)
 {
 	*((__u32 *) addr + (nr >> 5)) |= (1 << (nr & 31));
@@ -77,7 +77,7 @@ __set_bit (int nr, volatile void *addr)
  * you should call smp_mb__before_atomic() and/or smp_mb__after_atomic()
  * in order to ensure changes are visible on other processors.
  */
-static __inline__ void
+static inline void
 clear_bit (int nr, volatile void *addr)
 {
 	__u32 mask, old, new;
@@ -101,7 +101,7 @@ clear_bit (int nr, volatile void *addr)
  * clear_bit_unlock() is atomic and may not be reordered.  It does
  * contain a memory barrier suitable for unlock type operations.
  */
-static __inline__ void
+static inline void
 clear_bit_unlock (int nr, volatile void *addr)
 {
 	__u32 mask, old, new;
@@ -125,7 +125,7 @@ clear_bit_unlock (int nr, volatile void *addr)
  * Similarly to clear_bit_unlock, the implementation uses a store
  * with release semantics. See also arch_spin_unlock().
  */
-static __inline__ void
+static inline void
 __clear_bit_unlock(int nr, void *addr)
 {
 	__u32 * const m = (__u32 *) addr + (nr >> 5);
@@ -143,7 +143,7 @@ __clear_bit_unlock(int nr, void *addr)
  * If it's called on the same region of memory simultaneously, the effect
  * may be that only one operation succeeds.
  */
-static __inline__ void
+static inline void
 __clear_bit (int nr, volatile void *addr)
 {
 	*((__u32 *) addr + (nr >> 5)) &= ~(1 << (nr & 31));
@@ -158,7 +158,7 @@ __clear_bit (int nr, volatile void *addr)
  * Note that @nr may be almost arbitrarily large; this function is not
  * restricted to acting on a single-word quantity.
  */
-static __inline__ void
+static inline void
 change_bit (int nr, volatile void *addr)
 {
 	__u32 bit, old, new;
@@ -183,7 +183,7 @@ change_bit (int nr, volatile void *addr)
  * If it's called on the same region of memory simultaneously, the effect
  * may be that only one operation succeeds.
  */
-static __inline__ void
+static inline void
 __change_bit (int nr, volatile void *addr)
 {
 	*((__u32 *) addr + (nr >> 5)) ^= (1 << (nr & 31));
@@ -197,7 +197,7 @@ __change_bit (int nr, volatile void *addr)
  * This operation is atomic and cannot be reordered.  
  * It also implies the acquisition side of the memory barrier.
  */
-static __inline__ int
+static inline int
 test_and_set_bit (int nr, volatile void *addr)
 {
 	__u32 bit, old, new;
@@ -232,7 +232,7 @@ test_and_set_bit (int nr, volatile void *addr)
  * If two examples of this operation race, one can appear to succeed
  * but actually fail.  You must protect multiple accesses with a lock.
  */
-static __inline__ int
+static inline int
 __test_and_set_bit (int nr, volatile void *addr)
 {
 	__u32 *p = (__u32 *) addr + (nr >> 5);
@@ -251,7 +251,7 @@ __test_and_set_bit (int nr, volatile void *addr)
  * This operation is atomic and cannot be reordered.  
  * It also implies the acquisition side of the memory barrier.
  */
-static __inline__ int
+static inline int
 test_and_clear_bit (int nr, volatile void *addr)
 {
 	__u32 mask, old, new;
@@ -277,7 +277,7 @@ test_and_clear_bit (int nr, volatile void *addr)
  * If two examples of this operation race, one can appear to succeed
  * but actually fail.  You must protect multiple accesses with a lock.
  */
-static __inline__ int
+static inline int
 __test_and_clear_bit(int nr, volatile void * addr)
 {
 	__u32 *p = (__u32 *) addr + (nr >> 5);
@@ -296,7 +296,7 @@ __test_and_clear_bit(int nr, volatile void * addr)
  * This operation is atomic and cannot be reordered.  
  * It also implies the acquisition side of the memory barrier.
  */
-static __inline__ int
+static inline int
 test_and_change_bit (int nr, volatile void *addr)
 {
 	__u32 bit, old, new;
@@ -320,7 +320,7 @@ test_and_change_bit (int nr, volatile void *addr)
  *
  * This operation is non-atomic and can be reordered.
  */
-static __inline__ int
+static inline int
 __test_and_change_bit (int nr, void *addr)
 {
 	__u32 old, bit = (1 << (nr & 31));
@@ -331,7 +331,7 @@ __test_and_change_bit (int nr, void *addr)
 	return (old & bit) != 0;
 }
 
-static __inline__ int
+static inline int
 test_bit (int nr, const volatile void *addr)
 {
 	return 1 & (((const volatile __u32 *) addr)[nr >> 5] >> (nr & 31));
@@ -359,7 +359,7 @@ ffz (unsigned long x)
  *
  * Undefined if no bit exists, so code should check against 0 first.
  */
-static __inline__ unsigned long
+static inline unsigned long
 __ffs (unsigned long x)
 {
 	unsigned long result;
@@ -427,7 +427,7 @@ __fls (unsigned long x)
  * hweightN: returns the hamming weight (i.e. the number
  * of bits set) of a N-bit word
  */
-static __inline__ unsigned long __arch_hweight64(unsigned long x)
+static inline unsigned long __arch_hweight64(unsigned long x)
 {
 	unsigned long result;
 	result = ia64_popcnt(x);
diff --git a/arch/ia64/include/asm/delay.h b/arch/ia64/include/asm/delay.h
index 0227ac586107..9f7e60496feb 100644
--- a/arch/ia64/include/asm/delay.h
+++ b/arch/ia64/include/asm/delay.h
@@ -20,14 +20,14 @@
 #include <asm/intrinsics.h>
 #include <asm/processor.h>
 
-static __inline__ void
+static inline void
 ia64_set_itm (unsigned long val)
 {
 	ia64_setreg(_IA64_REG_CR_ITM, val);
 	ia64_srlz_d();
 }
 
-static __inline__ unsigned long
+static inline unsigned long
 ia64_get_itm (void)
 {
 	unsigned long result;
@@ -37,27 +37,27 @@ ia64_get_itm (void)
 	return result;
 }
 
-static __inline__ void
+static inline void
 ia64_set_itv (unsigned long val)
 {
 	ia64_setreg(_IA64_REG_CR_ITV, val);
 	ia64_srlz_d();
 }
 
-static __inline__ unsigned long
+static inline unsigned long
 ia64_get_itv (void)
 {
 	return ia64_getreg(_IA64_REG_CR_ITV);
 }
 
-static __inline__ void
+static inline void
 ia64_set_itc (unsigned long val)
 {
 	ia64_setreg(_IA64_REG_AR_ITC, val);
 	ia64_srlz_d();
 }
 
-static __inline__ unsigned long
+static inline unsigned long
 ia64_get_itc (void)
 {
 	unsigned long result;
@@ -75,7 +75,7 @@ ia64_get_itc (void)
 
 extern void ia64_delay_loop (unsigned long loops);
 
-static __inline__ void
+static inline void
 __delay (unsigned long loops)
 {
 	if (unlikely(loops < 1))
diff --git a/arch/ia64/include/asm/irq.h b/arch/ia64/include/asm/irq.h
index 8b84a55ed38a..eab8defa30de 100644
--- a/arch/ia64/include/asm/irq.h
+++ b/arch/ia64/include/asm/irq.h
@@ -16,7 +16,7 @@
 #include <linux/cpumask.h>
 #include <generated/nr-irqs.h>
 
-static __inline__ int
+static inline int
 irq_canonicalize (int irq)
 {
 	/*
diff --git a/arch/ia64/include/asm/page.h b/arch/ia64/include/asm/page.h
index 5798bd2b462c..858317be10cb 100644
--- a/arch/ia64/include/asm/page.h
+++ b/arch/ia64/include/asm/page.h
@@ -154,7 +154,7 @@ typedef union ia64_va {
 extern unsigned int hpage_shift;
 #endif
 
-static __inline__ int
+static inline int
 get_order (unsigned long size)
 {
 	long double d = size - 1;
diff --git a/arch/ia64/include/asm/sn/leds.h b/arch/ia64/include/asm/sn/leds.h
index 66cf8c4d92c9..48aabe95523f 100644
--- a/arch/ia64/include/asm/sn/leds.h
+++ b/arch/ia64/include/asm/sn/leds.h
@@ -22,7 +22,7 @@
  * Basic macros for flashing the LEDS on an SGI SN.
  */
 
-static __inline__ void
+static inline void
 set_led_bits(u8 value, u8 mask)
 {
 	pda->led_state = (pda->led_state & ~mask) | (value & mask);
diff --git a/arch/ia64/include/asm/uaccess.h b/arch/ia64/include/asm/uaccess.h
index a74524f2d625..386bfe1550c0 100644
--- a/arch/ia64/include/asm/uaccess.h
+++ b/arch/ia64/include/asm/uaccess.h
@@ -259,7 +259,7 @@ extern unsigned long __strnlen_user (const char __user *, long);
 })
 
 #define ARCH_HAS_TRANSLATE_MEM_PTR	1
-static __inline__ void *
+static inline void *
 xlate_dev_mem_ptr(phys_addr_t p)
 {
 	struct page *page;
@@ -277,7 +277,7 @@ xlate_dev_mem_ptr(phys_addr_t p)
 /*
  * Convert a virtual cached kernel memory pointer to an uncached pointer
  */
-static __inline__ void *
+static inline void *
 xlate_dev_kmem_ptr(void *p)
 {
 	struct page *page;
diff --git a/arch/ia64/include/uapi/asm/rse.h b/arch/ia64/include/uapi/asm/rse.h
index 6d260af571c5..55e66cae10e1 100644
--- a/arch/ia64/include/uapi/asm/rse.h
+++ b/arch/ia64/include/uapi/asm/rse.h
@@ -8,11 +8,11 @@
  *
  * Register stack engine related helper functions.  This file may be
  * used in applications, so be careful about the name-space and give
- * some consideration to non-GNU C compilers (though __inline__ is
+ * some consideration to non-GNU C compilers (though inline is
  * fine).
  */
 
-static __inline__ unsigned long
+static inline unsigned long
 ia64_rse_slot_num (unsigned long *addr)
 {
 	return (((unsigned long) addr) >> 3) & 0x3f;
@@ -21,7 +21,7 @@ ia64_rse_slot_num (unsigned long *addr)
 /*
  * Return TRUE if ADDR is the address of an RNAT slot.
  */
-static __inline__ unsigned long
+static inline unsigned long
 ia64_rse_is_rnat_slot (unsigned long *addr)
 {
 	return ia64_rse_slot_num(addr) == 0x3f;
@@ -31,7 +31,7 @@ ia64_rse_is_rnat_slot (unsigned long *addr)
  * Returns the address of the RNAT slot that covers the slot at
  * address SLOT_ADDR.
  */
-static __inline__ unsigned long *
+static inline unsigned long *
 ia64_rse_rnat_addr (unsigned long *slot_addr)
 {
 	return (unsigned long *) ((unsigned long) slot_addr | (0x3f << 3));
@@ -42,7 +42,7 @@ ia64_rse_rnat_addr (unsigned long *slot_addr)
  * ending at BSP.  This isn't simply (BSP-BSPSTORE)/8 because every 64th slot stores
  * ar.rnat.
  */
-static __inline__ unsigned long
+static inline unsigned long
 ia64_rse_num_regs (unsigned long *bspstore, unsigned long *bsp)
 {
 	unsigned long slots = (bsp - bspstore);
@@ -54,7 +54,7 @@ ia64_rse_num_regs (unsigned long *bspstore, unsigned long *bsp)
  * The inverse of the above: given bspstore and the number of
  * registers, calculate ar.bsp.
  */
-static __inline__ unsigned long *
+static inline unsigned long *
 ia64_rse_skip_regs (unsigned long *addr, long num_regs)
 {
 	long delta = ia64_rse_slot_num(addr) + num_regs;
diff --git a/arch/ia64/include/uapi/asm/swab.h b/arch/ia64/include/uapi/asm/swab.h
index 79f3fef1a05e..5b5df7a05f7e 100644
--- a/arch/ia64/include/uapi/asm/swab.h
+++ b/arch/ia64/include/uapi/asm/swab.h
@@ -11,7 +11,7 @@
 #include <asm/intrinsics.h>
 #include <linux/compiler.h>
 
-static __inline__ __attribute_const__ __u64 __arch_swab64(__u64 x)
+static inline __attribute_const__ __u64 __arch_swab64(__u64 x)
 {
 	__u64 result;
 
@@ -20,13 +20,13 @@ static __inline__ __attribute_const__ __u64 __arch_swab64(__u64 x)
 }
 #define __arch_swab64 __arch_swab64
 
-static __inline__ __attribute_const__ __u32 __arch_swab32(__u32 x)
+static inline __attribute_const__ __u32 __arch_swab32(__u32 x)
 {
 	return __arch_swab64(x) >> 32;
 }
 #define __arch_swab32 __arch_swab32
 
-static __inline__ __attribute_const__ __u16 __arch_swab16(__u16 x)
+static inline __attribute_const__ __u16 __arch_swab16(__u16 x)
 {
 	return __arch_swab64(x) >> 48;
 }
diff --git a/arch/ia64/oprofile/backtrace.c b/arch/ia64/oprofile/backtrace.c
index 6a219a946050..c9c8282f7a73 100644
--- a/arch/ia64/oprofile/backtrace.c
+++ b/arch/ia64/oprofile/backtrace.c
@@ -32,7 +32,7 @@ typedef struct
 } ia64_backtrace_t;
 
 /* Returns non-zero if the PC is in the Interrupt Vector Table */
-static __inline__ int in_ivt_code(unsigned long pc)
+static inline int in_ivt_code(unsigned long pc)
 {
 	extern char ia64_ivt[];
 	return (pc >= (u_long)ia64_ivt && pc < (u_long)ia64_ivt+32768);
@@ -41,7 +41,7 @@ static __inline__ int in_ivt_code(unsigned long pc)
 /*
  * Unwind to next stack frame.
  */
-static __inline__ int next_frame(ia64_backtrace_t *bt)
+static inline int next_frame(ia64_backtrace_t *bt)
 {
 	/*
 	 * Avoid unsightly console message from unw_unwind() when attempting
diff --git a/arch/m68k/include/asm/blinken.h b/arch/m68k/include/asm/blinken.h
index 0626582a7db4..2e5e74d74f82 100644
--- a/arch/m68k/include/asm/blinken.h
+++ b/arch/m68k/include/asm/blinken.h
@@ -19,7 +19,7 @@
 
 extern unsigned char hp300_ledstate;
 
-static __inline__ void blinken_leds(int on, int off)
+static inline void blinken_leds(int on, int off)
 {
 	if (MACH_IS_HP300)
 	{
diff --git a/arch/m68k/include/asm/checksum.h b/arch/m68k/include/asm/checksum.h
index f9b94e4b94f9..3724307be4a1 100644
--- a/arch/m68k/include/asm/checksum.h
+++ b/arch/m68k/include/asm/checksum.h
@@ -116,7 +116,7 @@ static inline __sum16 ip_compute_csum(const void *buff, int len)
 }
 
 #define _HAVE_ARCH_IPV6_CSUM
-static __inline__ __sum16
+static inline __sum16
 csum_ipv6_magic(const struct in6_addr *saddr, const struct in6_addr *daddr,
 		__u32 len, __u8 proto, __wsum sum)
 {
diff --git a/arch/m68k/include/asm/dma.h b/arch/m68k/include/asm/dma.h
index ae2021964e32..a451778fe112 100644
--- a/arch/m68k/include/asm/dma.h
+++ b/arch/m68k/include/asm/dma.h
@@ -123,7 +123,7 @@ extern unsigned int dma_device_address[MAX_M68K_DMA_CHANNELS];
 
 #if !defined(CONFIG_M5272)
 /* enable/disable a specific DMA channel */
-static __inline__ void enable_dma(unsigned int dmanr)
+static inline void enable_dma(unsigned int dmanr)
 {
   volatile unsigned short *dmawp;
 
@@ -135,7 +135,7 @@ static __inline__ void enable_dma(unsigned int dmanr)
   dmawp[MCFDMA_DCR] |= MCFDMA_DCR_EEXT;
 }
 
-static __inline__ void disable_dma(unsigned int dmanr)
+static inline void disable_dma(unsigned int dmanr)
 {
   volatile unsigned short *dmawp;
   volatile unsigned char  *dmapb;
@@ -162,12 +162,12 @@ static __inline__ void disable_dma(unsigned int dmanr)
  *
  * This is a NOP for ColdFire. Provide a stub for compatibility.
  */
-static __inline__ void clear_dma_ff(unsigned int dmanr)
+static inline void clear_dma_ff(unsigned int dmanr)
 {
 }
 
 /* set mode (above) for a specific DMA channel */
-static __inline__ void set_dma_mode(unsigned int dmanr, char mode)
+static inline void set_dma_mode(unsigned int dmanr, char mode)
 {
 
   volatile unsigned char  *dmabp;
@@ -210,7 +210,7 @@ static __inline__ void set_dma_mode(unsigned int dmanr, char mode)
 }
 
 /* Set transfer address for specific DMA channel */
-static __inline__ void set_dma_addr(unsigned int dmanr, unsigned int a)
+static inline void set_dma_addr(unsigned int dmanr, unsigned int a)
 {
   volatile unsigned short *dmawp;
   volatile unsigned int   *dmalp;
@@ -247,7 +247,7 @@ static __inline__ void set_dma_addr(unsigned int dmanr, unsigned int a)
  * Specific for Coldfire - sets device address.
  * Should be called after the mode set call, and before set DMA address.
  */
-static __inline__ void set_dma_device_addr(unsigned int dmanr, unsigned int a)
+static inline void set_dma_device_addr(unsigned int dmanr, unsigned int a)
 {
 #ifdef DMA_DEBUG
   printk("set_dma_device_addr(dmanr=%d,a=%x)\n", dmanr, a);
@@ -259,7 +259,7 @@ static __inline__ void set_dma_device_addr(unsigned int dmanr, unsigned int a)
 /*
  * NOTE 2: "count" represents _bytes_.
  */
-static __inline__ void set_dma_count(unsigned int dmanr, unsigned int count)
+static inline void set_dma_count(unsigned int dmanr, unsigned int count)
 {
   volatile unsigned short *dmawp;
 
@@ -277,7 +277,7 @@ static __inline__ void set_dma_count(unsigned int dmanr, unsigned int count)
  * still in progress will return unpredictable results.
  * Otherwise, it returns the number of _bytes_ left to transfer.
  */
-static __inline__ int get_dma_residue(unsigned int dmanr)
+static inline int get_dma_residue(unsigned int dmanr)
 {
   volatile unsigned short *dmawp;
   unsigned short count;
@@ -316,7 +316,7 @@ static __inline__ int get_dma_residue(unsigned int dmanr)
  */
 
 /* enable/disable a specific DMA channel */
-static __inline__ void enable_dma(unsigned int dmanr)
+static inline void enable_dma(unsigned int dmanr)
 {
   volatile unsigned int  *dmalp;
 
@@ -328,7 +328,7 @@ static __inline__ void enable_dma(unsigned int dmanr)
   dmalp[MCFDMA_DMR] |= MCFDMA_DMR_EN;
 }
 
-static __inline__ void disable_dma(unsigned int dmanr)
+static inline void disable_dma(unsigned int dmanr)
 {
   volatile unsigned int   *dmalp;
 
@@ -353,12 +353,12 @@ static __inline__ void disable_dma(unsigned int dmanr)
  *
  * This is a NOP for ColdFire. Provide a stub for compatibility.
  */
-static __inline__ void clear_dma_ff(unsigned int dmanr)
+static inline void clear_dma_ff(unsigned int dmanr)
 {
 }
 
 /* set mode (above) for a specific DMA channel */
-static __inline__ void set_dma_mode(unsigned int dmanr, char mode)
+static inline void set_dma_mode(unsigned int dmanr, char mode)
 {
 
   volatile unsigned int   *dmalp;
@@ -396,7 +396,7 @@ static __inline__ void set_dma_mode(unsigned int dmanr, char mode)
 }
 
 /* Set transfer address for specific DMA channel */
-static __inline__ void set_dma_addr(unsigned int dmanr, unsigned int a)
+static inline void set_dma_addr(unsigned int dmanr, unsigned int a)
 {
   volatile unsigned int   *dmalp;
 
@@ -431,7 +431,7 @@ static __inline__ void set_dma_addr(unsigned int dmanr, unsigned int a)
  * Specific for Coldfire - sets device address.
  * Should be called after the mode set call, and before set DMA address.
  */
-static __inline__ void set_dma_device_addr(unsigned int dmanr, unsigned int a)
+static inline void set_dma_device_addr(unsigned int dmanr, unsigned int a)
 {
 #ifdef DMA_DEBUG
   printk("set_dma_device_addr(dmanr=%d,a=%x)\n", dmanr, a);
@@ -445,7 +445,7 @@ static __inline__ void set_dma_device_addr(unsigned int dmanr, unsigned int a)
  *
  * NOTE 3: While a 32-bit register, "count" is only a maximum 24-bit value.
  */
-static __inline__ void set_dma_count(unsigned int dmanr, unsigned int count)
+static inline void set_dma_count(unsigned int dmanr, unsigned int count)
 {
   volatile unsigned int *dmalp;
 
@@ -463,7 +463,7 @@ static __inline__ void set_dma_count(unsigned int dmanr, unsigned int count)
  * still in progress will return unpredictable results.
  * Otherwise, it returns the number of _bytes_ left to transfer.
  */
-static __inline__ int get_dma_residue(unsigned int dmanr)
+static inline int get_dma_residue(unsigned int dmanr)
 {
   volatile unsigned int *dmalp;
   unsigned int count;
diff --git a/arch/m68k/include/asm/floppy.h b/arch/m68k/include/asm/floppy.h
index c3b9ad6732fc..7a658a5ca746 100644
--- a/arch/m68k/include/asm/floppy.h
+++ b/arch/m68k/include/asm/floppy.h
@@ -50,20 +50,20 @@ static int doing_pdma=0;
 
 extern spinlock_t  dma_spin_lock;
 
-static __inline__ unsigned long claim_dma_lock(void)
+static inline unsigned long claim_dma_lock(void)
 {
 	unsigned long flags;
 	spin_lock_irqsave(&dma_spin_lock, flags);
 	return flags;
 }
 
-static __inline__ void release_dma_lock(unsigned long flags)
+static inline void release_dma_lock(unsigned long flags)
 {
 	spin_unlock_irqrestore(&dma_spin_lock, flags);
 }
 
 
-static __inline__ unsigned char fd_inb(int port)
+static inline unsigned char fd_inb(int port)
 {
 	if(MACH_IS_Q40)
 		return inb_p(port);
@@ -72,7 +72,7 @@ static __inline__ unsigned char fd_inb(int port)
 	return 0;
 }
 
-static __inline__ void fd_outb(unsigned char value, int port)
+static inline void fd_outb(unsigned char value, int port)
 {
 	if(MACH_IS_Q40)
 		outb_p(value, port);
diff --git a/arch/m68k/include/asm/nettel.h b/arch/m68k/include/asm/nettel.h
index 45716ead7b9d..4037377c0048 100644
--- a/arch/m68k/include/asm/nettel.h
+++ b/arch/m68k/include/asm/nettel.h
@@ -47,14 +47,14 @@ extern volatile unsigned short ppdata;
  *	These functions defined to give quasi generic access to the
  *	PPIO bits used for DTR/DCD.
  */
-static __inline__ unsigned int mcf_getppdata(void)
+static inline unsigned int mcf_getppdata(void)
 {
 	volatile unsigned short *pp;
 	pp = (volatile unsigned short *) MCFSIM_PADAT;
 	return((unsigned int) *pp);
 }
 
-static __inline__ void mcf_setppdata(unsigned int mask, unsigned int bits)
+static inline void mcf_setppdata(unsigned int mask, unsigned int bits)
 {
 	volatile unsigned short *pp;
 	pp = (volatile unsigned short *) MCFSIM_PADAT;
@@ -86,12 +86,12 @@ static __inline__ void mcf_setppdata(unsigned int mask, unsigned int bits)
  *	These functions defined to give quasi generic access to the
  *	PPIO bits used for DTR/DCD.
  */
-static __inline__ unsigned int mcf_getppdata(void)
+static inline unsigned int mcf_getppdata(void)
 {
 	return readw(MCFSIM_PBDAT);
 }
 
-static __inline__ void mcf_setppdata(unsigned int mask, unsigned int bits)
+static inline void mcf_setppdata(unsigned int mask, unsigned int bits)
 {
 	writew((readw(MCFSIM_PBDAT) & ~mask) | bits, MCFSIM_PBDAT);
 }
diff --git a/arch/m68k/mac/iop.c b/arch/m68k/mac/iop.c
index 9bfa17015768..288b07530802 100644
--- a/arch/m68k/mac/iop.c
+++ b/arch/m68k/mac/iop.c
@@ -161,42 +161,42 @@ irqreturn_t iop_ism_irq(int, void *);
  * Private access functions
  */
 
-static __inline__ void iop_loadaddr(volatile struct mac_iop *iop, __u16 addr)
+static inline void iop_loadaddr(volatile struct mac_iop *iop, __u16 addr)
 {
 	iop->ram_addr_lo = addr;
 	iop->ram_addr_hi = addr >> 8;
 }
 
-static __inline__ __u8 iop_readb(volatile struct mac_iop *iop, __u16 addr)
+static inline __u8 iop_readb(volatile struct mac_iop *iop, __u16 addr)
 {
 	iop->ram_addr_lo = addr;
 	iop->ram_addr_hi = addr >> 8;
 	return iop->ram_data;
 }
 
-static __inline__ void iop_writeb(volatile struct mac_iop *iop, __u16 addr, __u8 data)
+static inline void iop_writeb(volatile struct mac_iop *iop, __u16 addr, __u8 data)
 {
 	iop->ram_addr_lo = addr;
 	iop->ram_addr_hi = addr >> 8;
 	iop->ram_data = data;
 }
 
-static __inline__ void iop_stop(volatile struct mac_iop *iop)
+static inline void iop_stop(volatile struct mac_iop *iop)
 {
 	iop->status_ctrl &= ~IOP_RUN;
 }
 
-static __inline__ void iop_start(volatile struct mac_iop *iop)
+static inline void iop_start(volatile struct mac_iop *iop)
 {
 	iop->status_ctrl = IOP_RUN | IOP_AUTOINC;
 }
 
-static __inline__ void iop_bypass(volatile struct mac_iop *iop)
+static inline void iop_bypass(volatile struct mac_iop *iop)
 {
 	iop->status_ctrl |= IOP_BYPASS;
 }
 
-static __inline__ void iop_interrupt(volatile struct mac_iop *iop)
+static inline void iop_interrupt(volatile struct mac_iop *iop)
 {
 	iop->status_ctrl |= IOP_IRQ;
 }
diff --git a/arch/mips/include/asm/atomic.h b/arch/mips/include/asm/atomic.h
index d4ea7a5b60cf..c179c12cd7c7 100644
--- a/arch/mips/include/asm/atomic.h
+++ b/arch/mips/include/asm/atomic.h
@@ -53,7 +53,7 @@
 #define atomic_set(v, i)	WRITE_ONCE((v)->counter, (i))
 
 #define ATOMIC_OP(op, c_op, asm_op)					      \
-static __inline__ void atomic_##op(int i, atomic_t * v)			      \
+static inline void atomic_##op(int i, atomic_t * v)			      \
 {									      \
 	if (kernel_uses_llsc) {						      \
 		int temp;						      \
@@ -77,7 +77,7 @@ static __inline__ void atomic_##op(int i, atomic_t * v)			      \
 }
 
 #define ATOMIC_OP_RETURN(op, c_op, asm_op)				      \
-static __inline__ int atomic_##op##_return_relaxed(int i, atomic_t * v)	      \
+static inline int atomic_##op##_return_relaxed(int i, atomic_t * v)	      \
 {									      \
 	int result;							      \
 									      \
@@ -109,7 +109,7 @@ static __inline__ int atomic_##op##_return_relaxed(int i, atomic_t * v)	      \
 }
 
 #define ATOMIC_FETCH_OP(op, c_op, asm_op)				      \
-static __inline__ int atomic_fetch_##op##_relaxed(int i, atomic_t * v)	      \
+static inline int atomic_fetch_##op##_relaxed(int i, atomic_t * v)	      \
 {									      \
 	int result;							      \
 									      \
@@ -178,7 +178,7 @@ ATOMIC_OPS(xor, ^=, xor)
  * Atomically test @v and subtract @i if @v is greater or equal than @i.
  * The function returns the old value of @v minus @i.
  */
-static __inline__ int atomic_sub_if_positive(int i, atomic_t * v)
+static inline int atomic_sub_if_positive(int i, atomic_t * v)
 {
 	int result;
 
@@ -246,7 +246,7 @@ static __inline__ int atomic_sub_if_positive(int i, atomic_t * v)
 #define atomic64_set(v, i)	WRITE_ONCE((v)->counter, (i))
 
 #define ATOMIC64_OP(op, c_op, asm_op)					      \
-static __inline__ void atomic64_##op(long i, atomic64_t * v)		      \
+static inline void atomic64_##op(long i, atomic64_t * v)		      \
 {									      \
 	if (kernel_uses_llsc) {						      \
 		long temp;						      \
@@ -270,7 +270,7 @@ static __inline__ void atomic64_##op(long i, atomic64_t * v)		      \
 }
 
 #define ATOMIC64_OP_RETURN(op, c_op, asm_op)				      \
-static __inline__ long atomic64_##op##_return_relaxed(long i, atomic64_t * v) \
+static inline long atomic64_##op##_return_relaxed(long i, atomic64_t * v) \
 {									      \
 	long result;							      \
 									      \
@@ -302,7 +302,7 @@ static __inline__ long atomic64_##op##_return_relaxed(long i, atomic64_t * v) \
 }
 
 #define ATOMIC64_FETCH_OP(op, c_op, asm_op)				      \
-static __inline__ long atomic64_fetch_##op##_relaxed(long i, atomic64_t * v)  \
+static inline long atomic64_fetch_##op##_relaxed(long i, atomic64_t * v)  \
 {									      \
 	long result;							      \
 									      \
@@ -372,7 +372,7 @@ ATOMIC64_OPS(xor, ^=, xor)
  * Atomically test @v and subtract @i if @v is greater or equal than @i.
  * The function returns the old value of @v minus @i.
  */
-static __inline__ long atomic64_sub_if_positive(long i, atomic64_t * v)
+static inline long atomic64_sub_if_positive(long i, atomic64_t * v)
 {
 	long result;
 
diff --git a/arch/mips/include/asm/checksum.h b/arch/mips/include/asm/checksum.h
index e8161e4dfde7..93f3fbc6e917 100644
--- a/arch/mips/include/asm/checksum.h
+++ b/arch/mips/include/asm/checksum.h
@@ -215,7 +215,7 @@ static inline __sum16 ip_compute_csum(const void *buff, int len)
 }
 
 #define _HAVE_ARCH_IPV6_CSUM
-static __inline__ __sum16 csum_ipv6_magic(const struct in6_addr *saddr,
+static inline __sum16 csum_ipv6_magic(const struct in6_addr *saddr,
 					  const struct in6_addr *daddr,
 					  __u32 len, __u8 proto,
 					  __wsum sum)
diff --git a/arch/mips/include/asm/dma.h b/arch/mips/include/asm/dma.h
index be726b943530..85cc6b5df8c9 100644
--- a/arch/mips/include/asm/dma.h
+++ b/arch/mips/include/asm/dma.h
@@ -157,20 +157,20 @@
 
 extern spinlock_t  dma_spin_lock;
 
-static __inline__ unsigned long claim_dma_lock(void)
+static inline unsigned long claim_dma_lock(void)
 {
 	unsigned long flags;
 	spin_lock_irqsave(&dma_spin_lock, flags);
 	return flags;
 }
 
-static __inline__ void release_dma_lock(unsigned long flags)
+static inline void release_dma_lock(unsigned long flags)
 {
 	spin_unlock_irqrestore(&dma_spin_lock, flags);
 }
 
 /* enable/disable a specific DMA channel */
-static __inline__ void enable_dma(unsigned int dmanr)
+static inline void enable_dma(unsigned int dmanr)
 {
 	if (dmanr<=3)
 		dma_outb(dmanr,	 DMA1_MASK_REG);
@@ -178,7 +178,7 @@ static __inline__ void enable_dma(unsigned int dmanr)
 		dma_outb(dmanr & 3,  DMA2_MASK_REG);
 }
 
-static __inline__ void disable_dma(unsigned int dmanr)
+static inline void disable_dma(unsigned int dmanr)
 {
 	if (dmanr<=3)
 		dma_outb(dmanr | 4,  DMA1_MASK_REG);
@@ -193,7 +193,7 @@ static __inline__ void disable_dma(unsigned int dmanr)
  * --- In order to do that, the DMA routines below should ---
  * --- only be used while holding the DMA lock ! ---
  */
-static __inline__ void clear_dma_ff(unsigned int dmanr)
+static inline void clear_dma_ff(unsigned int dmanr)
 {
 	if (dmanr<=3)
 		dma_outb(0,  DMA1_CLEAR_FF_REG);
@@ -202,7 +202,7 @@ static __inline__ void clear_dma_ff(unsigned int dmanr)
 }
 
 /* set mode (above) for a specific DMA channel */
-static __inline__ void set_dma_mode(unsigned int dmanr, char mode)
+static inline void set_dma_mode(unsigned int dmanr, char mode)
 {
 	if (dmanr<=3)
 		dma_outb(mode | dmanr,	DMA1_MODE_REG);
@@ -215,7 +215,7 @@ static __inline__ void set_dma_mode(unsigned int dmanr, char mode)
  * the lower 16 bits of the DMA current address register, but a 64k boundary
  * may have been crossed.
  */
-static __inline__ void set_dma_page(unsigned int dmanr, char pagenr)
+static inline void set_dma_page(unsigned int dmanr, char pagenr)
 {
 	switch(dmanr) {
 		case 0:
@@ -246,7 +246,7 @@ static __inline__ void set_dma_page(unsigned int dmanr, char pagenr)
 /* Set transfer address & page bits for specific DMA channel.
  * Assumes dma flipflop is clear.
  */
-static __inline__ void set_dma_addr(unsigned int dmanr, unsigned int a)
+static inline void set_dma_addr(unsigned int dmanr, unsigned int a)
 {
 	set_dma_page(dmanr, a>>16);
 	if (dmanr <= 3)	 {
@@ -267,7 +267,7 @@ static __inline__ void set_dma_addr(unsigned int dmanr, unsigned int a)
  * Assumes dma flip-flop is clear.
  * NOTE 2: "count" represents _bytes_ and must be even for channels 5-7.
  */
-static __inline__ void set_dma_count(unsigned int dmanr, unsigned int count)
+static inline void set_dma_count(unsigned int dmanr, unsigned int count)
 {
 	count--;
 	if (dmanr <= 3)	 {
@@ -288,7 +288,7 @@ static __inline__ void set_dma_count(unsigned int dmanr, unsigned int count)
  *
  * Assumes DMA flip-flop is clear.
  */
-static __inline__ int get_dma_residue(unsigned int dmanr)
+static inline int get_dma_residue(unsigned int dmanr)
 {
 	unsigned int io_port = (dmanr<=3)? ((dmanr&3)<<1) + 1 + IO_DMA1_BASE
 					 : ((dmanr&3)<<2) + 2 + IO_DMA2_BASE;
diff --git a/arch/mips/include/asm/jazz.h b/arch/mips/include/asm/jazz.h
index a61970d01a81..9401780519eb 100644
--- a/arch/mips/include/asm/jazz.h
+++ b/arch/mips/include/asm/jazz.h
@@ -72,7 +72,7 @@
 
 #ifndef __ASSEMBLY__
 
-static __inline__ void pica_set_led(unsigned int bits)
+static inline void pica_set_led(unsigned int bits)
 {
 	volatile unsigned int *led_register = (unsigned int *) PICA_LED;
 
diff --git a/arch/mips/include/asm/local.h b/arch/mips/include/asm/local.h
index ac8264eca1e9..056b4a5e00f4 100644
--- a/arch/mips/include/asm/local.h
+++ b/arch/mips/include/asm/local.h
@@ -27,7 +27,7 @@ typedef struct
 /*
  * Same as above, but return the result value
  */
-static __inline__ long local_add_return(long i, local_t * l)
+static inline long local_add_return(long i, local_t * l)
 {
 	unsigned long result;
 
@@ -72,7 +72,7 @@ static __inline__ long local_add_return(long i, local_t * l)
 	return result;
 }
 
-static __inline__ long local_sub_return(long i, local_t * l)
+static inline long local_sub_return(long i, local_t * l)
 {
 	unsigned long result;
 
diff --git a/arch/mips/include/asm/string.h b/arch/mips/include/asm/string.h
index 29030cb398ee..8764cba30e50 100644
--- a/arch/mips/include/asm/string.h
+++ b/arch/mips/include/asm/string.h
@@ -20,7 +20,7 @@
 #ifndef IN_STRING_C
 
 #define __HAVE_ARCH_STRCPY
-static __inline__ char *strcpy(char *__dest, __const__ char *__src)
+static inline char *strcpy(char *__dest, __const__ char *__src)
 {
   char *__xdest = __dest;
 
@@ -42,7 +42,7 @@ static __inline__ char *strcpy(char *__dest, __const__ char *__src)
 }
 
 #define __HAVE_ARCH_STRNCPY
-static __inline__ char *strncpy(char *__dest, __const__ char *__src, size_t __n)
+static inline char *strncpy(char *__dest, __const__ char *__src, size_t __n)
 {
   char *__xdest = __dest;
 
@@ -70,7 +70,7 @@ static __inline__ char *strncpy(char *__dest, __const__ char *__src, size_t __n)
 }
 
 #define __HAVE_ARCH_STRCMP
-static __inline__ int strcmp(__const__ char *__cs, __const__ char *__ct)
+static inline int strcmp(__const__ char *__cs, __const__ char *__ct)
 {
   int __res;
 
@@ -100,7 +100,7 @@ static __inline__ int strcmp(__const__ char *__cs, __const__ char *__ct)
 #endif /* !defined(IN_STRING_C) */
 
 #define __HAVE_ARCH_STRNCMP
-static __inline__ int
+static inline int
 strncmp(__const__ char *__cs, __const__ char *__ct, size_t __count)
 {
 	int __res;
diff --git a/arch/mips/kernel/binfmt_elfn32.c b/arch/mips/kernel/binfmt_elfn32.c
index 7a12763d553a..c84b0d718dd4 100644
--- a/arch/mips/kernel/binfmt_elfn32.c
+++ b/arch/mips/kernel/binfmt_elfn32.c
@@ -82,7 +82,7 @@ struct elf_prpsinfo32
 #define init_elf_binfmt init_elfn32_binfmt
 
 #define jiffies_to_timeval jiffies_to_old_timeval32
-static __inline__ void
+static inline void
 jiffies_to_old_timeval32(unsigned long jiffies, struct old_timeval32 *value)
 {
 	/*
diff --git a/arch/nds32/include/asm/swab.h b/arch/nds32/include/asm/swab.h
index e01a755a37d2..8e87949e59fd 100644
--- a/arch/nds32/include/asm/swab.h
+++ b/arch/nds32/include/asm/swab.h
@@ -7,7 +7,7 @@
 #include <linux/types.h>
 #include <linux/compiler.h>
 
-static __inline__ __attribute_const__ __u32 ___arch__swab32(__u32 x)
+static inline __attribute_const__ __u32 ___arch__swab32(__u32 x)
 {
 	__asm__("wsbh   %0, %0\n\t"	/* word swap byte within halfword */
 		"rotri  %0, %0, #16\n"
@@ -16,7 +16,7 @@ static __inline__ __attribute_const__ __u32 ___arch__swab32(__u32 x)
 	return x;
 }
 
-static __inline__ __attribute_const__ __u16 ___arch__swab16(__u16 x)
+static inline __attribute_const__ __u16 ___arch__swab16(__u16 x)
 {
 	__asm__("wsbh   %0, %0\n"	/* word swap byte within halfword */
 		:"=r"(x)
diff --git a/arch/parisc/include/asm/atomic.h b/arch/parisc/include/asm/atomic.h
index 118953d41763..60efc1896b5e 100644
--- a/arch/parisc/include/asm/atomic.h
+++ b/arch/parisc/include/asm/atomic.h
@@ -56,7 +56,7 @@ extern arch_spinlock_t __atomic_hash[ATOMIC_HASH_SIZE] __lock_aligned;
  * are atomic, so a reader never sees inconsistent values.
  */
 
-static __inline__ void atomic_set(atomic_t *v, int i)
+static inline void atomic_set(atomic_t *v, int i)
 {
 	unsigned long flags;
 	_atomic_spin_lock_irqsave(v, flags);
@@ -68,7 +68,7 @@ static __inline__ void atomic_set(atomic_t *v, int i)
 
 #define atomic_set_release(v, i)	atomic_set((v), (i))
 
-static __inline__ int atomic_read(const atomic_t *v)
+static inline int atomic_read(const atomic_t *v)
 {
 	return READ_ONCE((v)->counter);
 }
@@ -78,7 +78,7 @@ static __inline__ int atomic_read(const atomic_t *v)
 #define atomic_xchg(v, new) (xchg(&((v)->counter), new))
 
 #define ATOMIC_OP(op, c_op)						\
-static __inline__ void atomic_##op(int i, atomic_t *v)			\
+static inline void atomic_##op(int i, atomic_t *v)			\
 {									\
 	unsigned long flags;						\
 									\
@@ -88,7 +88,7 @@ static __inline__ void atomic_##op(int i, atomic_t *v)			\
 }									\
 
 #define ATOMIC_OP_RETURN(op, c_op)					\
-static __inline__ int atomic_##op##_return(int i, atomic_t *v)		\
+static inline int atomic_##op##_return(int i, atomic_t *v)		\
 {									\
 	unsigned long flags;						\
 	int ret;							\
@@ -101,7 +101,7 @@ static __inline__ int atomic_##op##_return(int i, atomic_t *v)		\
 }
 
 #define ATOMIC_FETCH_OP(op, c_op)					\
-static __inline__ int atomic_fetch_##op(int i, atomic_t *v)		\
+static inline int atomic_fetch_##op(int i, atomic_t *v)		\
 {									\
 	unsigned long flags;						\
 	int ret;							\
@@ -143,7 +143,7 @@ ATOMIC_OPS(xor, ^=)
 #define ATOMIC64_INIT(i) { (i) }
 
 #define ATOMIC64_OP(op, c_op)						\
-static __inline__ void atomic64_##op(s64 i, atomic64_t *v)		\
+static inline void atomic64_##op(s64 i, atomic64_t *v)		\
 {									\
 	unsigned long flags;						\
 									\
@@ -153,7 +153,7 @@ static __inline__ void atomic64_##op(s64 i, atomic64_t *v)		\
 }									\
 
 #define ATOMIC64_OP_RETURN(op, c_op)					\
-static __inline__ s64 atomic64_##op##_return(s64 i, atomic64_t *v)	\
+static inline s64 atomic64_##op##_return(s64 i, atomic64_t *v)	\
 {									\
 	unsigned long flags;						\
 	s64 ret;							\
@@ -166,7 +166,7 @@ static __inline__ s64 atomic64_##op##_return(s64 i, atomic64_t *v)	\
 }
 
 #define ATOMIC64_FETCH_OP(op, c_op)					\
-static __inline__ s64 atomic64_fetch_##op(s64 i, atomic64_t *v)		\
+static inline s64 atomic64_fetch_##op(s64 i, atomic64_t *v)		\
 {									\
 	unsigned long flags;						\
 	s64 ret;							\
@@ -201,7 +201,7 @@ ATOMIC64_OPS(xor, ^=)
 #undef ATOMIC64_OP_RETURN
 #undef ATOMIC64_OP
 
-static __inline__ void
+static inline void
 atomic64_set(atomic64_t *v, s64 i)
 {
 	unsigned long flags;
@@ -212,7 +212,7 @@ atomic64_set(atomic64_t *v, s64 i)
 	_atomic_spin_unlock_irqrestore(v, flags);
 }
 
-static __inline__ s64
+static inline s64
 atomic64_read(const atomic64_t *v)
 {
 	return READ_ONCE((v)->counter);
diff --git a/arch/parisc/include/asm/bitops.h b/arch/parisc/include/asm/bitops.h
index 53252d4f9a57..58d4dc3c6ddf 100644
--- a/arch/parisc/include/asm/bitops.h
+++ b/arch/parisc/include/asm/bitops.h
@@ -33,7 +33,7 @@
  *	__*_bit() are "relaxed" and don't use spinlock or volatile.
  */
 
-static __inline__ void set_bit(int nr, volatile unsigned long * addr)
+static inline void set_bit(int nr, volatile unsigned long * addr)
 {
 	unsigned long mask = 1UL << CHOP_SHIFTCOUNT(nr);
 	unsigned long flags;
@@ -44,7 +44,7 @@ static __inline__ void set_bit(int nr, volatile unsigned long * addr)
 	_atomic_spin_unlock_irqrestore(addr, flags);
 }
 
-static __inline__ void clear_bit(int nr, volatile unsigned long * addr)
+static inline void clear_bit(int nr, volatile unsigned long * addr)
 {
 	unsigned long mask = ~(1UL << CHOP_SHIFTCOUNT(nr));
 	unsigned long flags;
@@ -55,7 +55,7 @@ static __inline__ void clear_bit(int nr, volatile unsigned long * addr)
 	_atomic_spin_unlock_irqrestore(addr, flags);
 }
 
-static __inline__ void change_bit(int nr, volatile unsigned long * addr)
+static inline void change_bit(int nr, volatile unsigned long * addr)
 {
 	unsigned long mask = 1UL << CHOP_SHIFTCOUNT(nr);
 	unsigned long flags;
@@ -66,7 +66,7 @@ static __inline__ void change_bit(int nr, volatile unsigned long * addr)
 	_atomic_spin_unlock_irqrestore(addr, flags);
 }
 
-static __inline__ int test_and_set_bit(int nr, volatile unsigned long * addr)
+static inline int test_and_set_bit(int nr, volatile unsigned long * addr)
 {
 	unsigned long mask = 1UL << CHOP_SHIFTCOUNT(nr);
 	unsigned long old;
@@ -84,7 +84,7 @@ static __inline__ int test_and_set_bit(int nr, volatile unsigned long * addr)
 	return set;
 }
 
-static __inline__ int test_and_clear_bit(int nr, volatile unsigned long * addr)
+static inline int test_and_clear_bit(int nr, volatile unsigned long * addr)
 {
 	unsigned long mask = 1UL << CHOP_SHIFTCOUNT(nr);
 	unsigned long old;
@@ -102,7 +102,7 @@ static __inline__ int test_and_clear_bit(int nr, volatile unsigned long * addr)
 	return set;
 }
 
-static __inline__ int test_and_change_bit(int nr, volatile unsigned long * addr)
+static inline int test_and_change_bit(int nr, volatile unsigned long * addr)
 {
 	unsigned long mask = 1UL << CHOP_SHIFTCOUNT(nr);
 	unsigned long oldbit;
@@ -140,7 +140,7 @@ static __inline__ int test_and_change_bit(int nr, volatile unsigned long * addr)
  * cycles for each mispredicted branch.
  */
 
-static __inline__ unsigned long __ffs(unsigned long x)
+static inline unsigned long __ffs(unsigned long x)
 {
 	unsigned long ret;
 
@@ -178,7 +178,7 @@ static __inline__ unsigned long __ffs(unsigned long x)
  * This is defined the same way as the libc and compiler builtin
  * ffs routines, therefore differs in spirit from the above ffz (man ffs).
  */
-static __inline__ int ffs(int x)
+static inline int ffs(int x)
 {
 	return x ? (__ffs((unsigned long)x) + 1) : 0;
 }
@@ -188,7 +188,7 @@ static __inline__ int ffs(int x)
  * fls(0) = 0, fls(1) = 1, fls(0x80000000) = 32.
  */
 
-static __inline__ int fls(int x)
+static inline int fls(int x)
 {
 	int ret;
 	if (!x)
diff --git a/arch/parisc/include/asm/checksum.h b/arch/parisc/include/asm/checksum.h
index 3cbf1f1c1188..2e441a55adc9 100644
--- a/arch/parisc/include/asm/checksum.h
+++ b/arch/parisc/include/asm/checksum.h
@@ -121,7 +121,7 @@ static inline __sum16 ip_compute_csum(const void *buf, int len)
 
 
 #define _HAVE_ARCH_IPV6_CSUM
-static __inline__ __sum16 csum_ipv6_magic(const struct in6_addr *saddr,
+static inline __sum16 csum_ipv6_magic(const struct in6_addr *saddr,
 					  const struct in6_addr *daddr,
 					  __u32 len, __u8 proto,
 					  __wsum sum)
@@ -189,7 +189,7 @@ static __inline__ __sum16 csum_ipv6_magic(const struct in6_addr *saddr,
  *	Copy and checksum to user
  */
 #define HAVE_CSUM_COPY_USER
-static __inline__ __wsum csum_and_copy_to_user(const void *src,
+static inline __wsum csum_and_copy_to_user(const void *src,
 						      void __user *dst,
 						      int len, __wsum sum,
 						      int *err_ptr)
diff --git a/arch/parisc/include/asm/compat.h b/arch/parisc/include/asm/compat.h
index e03e3c849f40..e2269acf3d2b 100644
--- a/arch/parisc/include/asm/compat.h
+++ b/arch/parisc/include/asm/compat.h
@@ -190,7 +190,7 @@ static inline compat_uptr_t ptr_to_compat(void __user *uptr)
 	return (u32)(unsigned long)uptr;
 }
 
-static __inline__ void __user *arch_compat_alloc_user_space(long len)
+static inline void __user *arch_compat_alloc_user_space(long len)
 {
 	struct pt_regs *regs = &current->thread.regs;
 	return (void __user *)regs->gr[30];
diff --git a/arch/parisc/include/asm/delay.h b/arch/parisc/include/asm/delay.h
index 841b506b702a..e8f35d605bb2 100644
--- a/arch/parisc/include/asm/delay.h
+++ b/arch/parisc/include/asm/delay.h
@@ -2,7 +2,7 @@
 #ifndef _ASM_PARISC_DELAY_H
 #define _ASM_PARISC_DELAY_H
 
-static __inline__ void __delay(unsigned long loops) {
+static inline void __delay(unsigned long loops) {
 	asm volatile(
 	"	.balignl	64,0x34000034\n"
 	"	addib,UV -1,%0,.\n"
diff --git a/arch/parisc/include/asm/dma.h b/arch/parisc/include/asm/dma.h
index eea80ed34e6d..55170ca19ac9 100644
--- a/arch/parisc/include/asm/dma.h
+++ b/arch/parisc/include/asm/dma.h
@@ -71,12 +71,12 @@
 #define DMA2_MASK_ALL_REG       0xDE    /* all-channels mask (w) */
 #define DMA2_EXT_MODE_REG	(0x400 | DMA2_MODE_REG)
 
-static __inline__ unsigned long claim_dma_lock(void)
+static inline unsigned long claim_dma_lock(void)
 {
 	return 0;
 }
 
-static __inline__ void release_dma_lock(unsigned long flags)
+static inline void release_dma_lock(unsigned long flags)
 {
 }
 
@@ -89,7 +89,7 @@ static __inline__ void release_dma_lock(unsigned long flags)
  *
  * Assumes DMA flip-flop is clear.
  */
-static __inline__ int get_dma_residue(unsigned int dmanr)
+static inline int get_dma_residue(unsigned int dmanr)
 {
 	unsigned int io_port = (dmanr<=3)? ((dmanr&3)<<1) + 1 + IO_DMA1_BASE
 					 : ((dmanr&3)<<2) + 2 + IO_DMA2_BASE;
@@ -104,7 +104,7 @@ static __inline__ int get_dma_residue(unsigned int dmanr)
 }
 
 /* enable/disable a specific DMA channel */
-static __inline__ void enable_dma(unsigned int dmanr)
+static inline void enable_dma(unsigned int dmanr)
 {
 #ifdef CONFIG_SUPERIO
 	if (dmanr<=3)
@@ -114,7 +114,7 @@ static __inline__ void enable_dma(unsigned int dmanr)
 #endif
 }
 
-static __inline__ void disable_dma(unsigned int dmanr)
+static inline void disable_dma(unsigned int dmanr)
 {
 #ifdef CONFIG_SUPERIO
 	if (dmanr<=3)
@@ -134,12 +134,12 @@ static __inline__ void disable_dma(unsigned int dmanr)
  * --- In order to do that, the DMA routines below should ---
  * --- only be used while holding the DMA lock ! ---
  */
-static __inline__ void clear_dma_ff(unsigned int dmanr)
+static inline void clear_dma_ff(unsigned int dmanr)
 {
 }
 
 /* set mode (above) for a specific DMA channel */
-static __inline__ void set_dma_mode(unsigned int dmanr, char mode)
+static inline void set_dma_mode(unsigned int dmanr, char mode)
 {
 }
 
@@ -148,7 +148,7 @@ static __inline__ void set_dma_mode(unsigned int dmanr, char mode)
  * the lower 16 bits of the DMA current address register, but a 64k boundary
  * may have been crossed.
  */
-static __inline__ void set_dma_page(unsigned int dmanr, char pagenr)
+static inline void set_dma_page(unsigned int dmanr, char pagenr)
 {
 }
 
@@ -156,7 +156,7 @@ static __inline__ void set_dma_page(unsigned int dmanr, char pagenr)
 /* Set transfer address & page bits for specific DMA channel.
  * Assumes dma flipflop is clear.
  */
-static __inline__ void set_dma_addr(unsigned int dmanr, unsigned int a)
+static inline void set_dma_addr(unsigned int dmanr, unsigned int a)
 {
 }
 
@@ -169,7 +169,7 @@ static __inline__ void set_dma_addr(unsigned int dmanr, unsigned int a)
  * Assumes dma flip-flop is clear.
  * NOTE 2: "count" represents _bytes_ and must be even for channels 5-7.
  */
-static __inline__ void set_dma_count(unsigned int dmanr, unsigned int count)
+static inline void set_dma_count(unsigned int dmanr, unsigned int count)
 {
 }
 
diff --git a/arch/parisc/include/asm/ide.h b/arch/parisc/include/asm/ide.h
index 34cdac01ed35..edd7fe03b90a 100644
--- a/arch/parisc/include/asm/ide.h
+++ b/arch/parisc/include/asm/ide.h
@@ -21,7 +21,7 @@
 #define __ide_outsw	outsw
 #define __ide_outsl	outsl
 
-static __inline__ void __ide_mm_insw(void __iomem *port, void *addr, u32 count)
+static inline void __ide_mm_insw(void __iomem *port, void *addr, u32 count)
 {
 	while (count--) {
 		*(u16 *)addr = __raw_readw(port);
@@ -29,7 +29,7 @@ static __inline__ void __ide_mm_insw(void __iomem *port, void *addr, u32 count)
 	}
 }
 
-static __inline__ void __ide_mm_insl(void __iomem *port, void *addr, u32 count)
+static inline void __ide_mm_insl(void __iomem *port, void *addr, u32 count)
 {
 	while (count--) {
 		*(u32 *)addr = __raw_readl(port);
@@ -37,7 +37,7 @@ static __inline__ void __ide_mm_insl(void __iomem *port, void *addr, u32 count)
 	}
 }
 
-static __inline__ void __ide_mm_outsw(void __iomem *port, void *addr, u32 count)
+static inline void __ide_mm_outsw(void __iomem *port, void *addr, u32 count)
 {
 	while (count--) {
 		__raw_writew(*(u16 *)addr, port);
@@ -45,7 +45,7 @@ static __inline__ void __ide_mm_outsw(void __iomem *port, void *addr, u32 count)
 	}
 }
 
-static __inline__ void __ide_mm_outsl(void __iomem *port, void *addr, u32 count)
+static inline void __ide_mm_outsl(void __iomem *port, void *addr, u32 count)
 {
 	while (count--) {
 		__raw_writel(*(u32 *)addr, port);
diff --git a/arch/parisc/include/asm/irq.h b/arch/parisc/include/asm/irq.h
index 959e79cd2c14..b1c23a163114 100644
--- a/arch/parisc/include/asm/irq.h
+++ b/arch/parisc/include/asm/irq.h
@@ -27,7 +27,7 @@
 
 #define NR_IRQS		(CPU_IRQ_MAX + 1)
 
-static __inline__ int irq_canonicalize(int irq)
+static inline int irq_canonicalize(int irq)
 {
 	return (irq == 2) ? 9 : irq;
 }
diff --git a/arch/parisc/include/asm/spinlock.h b/arch/parisc/include/asm/spinlock.h
index 8a63515f03bf..0d4620a1f81b 100644
--- a/arch/parisc/include/asm/spinlock.h
+++ b/arch/parisc/include/asm/spinlock.h
@@ -66,7 +66,7 @@ static inline int arch_spin_trylock(arch_spinlock_t *x)
 
 /* Note that we have to ensure interrupts are disabled in case we're
  * interrupted by some other code that wants to grab the same read lock */
-static  __inline__ void arch_read_lock(arch_rwlock_t *rw)
+static  inline void arch_read_lock(arch_rwlock_t *rw)
 {
 	unsigned long flags;
 	local_irq_save(flags);
@@ -78,7 +78,7 @@ static  __inline__ void arch_read_lock(arch_rwlock_t *rw)
 
 /* Note that we have to ensure interrupts are disabled in case we're
  * interrupted by some other code that wants to grab the same read lock */
-static  __inline__ void arch_read_unlock(arch_rwlock_t *rw)
+static  inline void arch_read_unlock(arch_rwlock_t *rw)
 {
 	unsigned long flags;
 	local_irq_save(flags);
@@ -90,7 +90,7 @@ static  __inline__ void arch_read_unlock(arch_rwlock_t *rw)
 
 /* Note that we have to ensure interrupts are disabled in case we're
  * interrupted by some other code that wants to grab the same read lock */
-static __inline__ int arch_read_trylock(arch_rwlock_t *rw)
+static inline int arch_read_trylock(arch_rwlock_t *rw)
 {
 	unsigned long flags;
  retry:
@@ -116,7 +116,7 @@ static __inline__ int arch_read_trylock(arch_rwlock_t *rw)
 
 /* Note that we have to ensure interrupts are disabled in case we're
  * interrupted by some other code that wants to read_trylock() this lock */
-static __inline__ void arch_write_lock(arch_rwlock_t *rw)
+static inline void arch_write_lock(arch_rwlock_t *rw)
 {
 	unsigned long flags;
 retry:
@@ -138,7 +138,7 @@ static __inline__ void arch_write_lock(arch_rwlock_t *rw)
 	local_irq_restore(flags);
 }
 
-static __inline__ void arch_write_unlock(arch_rwlock_t *rw)
+static inline void arch_write_unlock(arch_rwlock_t *rw)
 {
 	rw->counter = 0;
 	arch_spin_unlock(&rw->lock);
@@ -146,7 +146,7 @@ static __inline__ void arch_write_unlock(arch_rwlock_t *rw)
 
 /* Note that we have to ensure interrupts are disabled in case we're
  * interrupted by some other code that wants to read_trylock() this lock */
-static __inline__ int arch_write_trylock(arch_rwlock_t *rw)
+static inline int arch_write_trylock(arch_rwlock_t *rw)
 {
 	unsigned long flags;
 	int result = 0;
diff --git a/arch/powerpc/include/asm/atomic.h b/arch/powerpc/include/asm/atomic.h
index 52eafaf74054..4e33807c75cf 100644
--- a/arch/powerpc/include/asm/atomic.h
+++ b/arch/powerpc/include/asm/atomic.h
@@ -25,7 +25,7 @@
 #define __atomic_release_fence()					\
 	__asm__ __volatile__(PPC_RELEASE_BARRIER "" : : : "memory")
 
-static __inline__ int atomic_read(const atomic_t *v)
+static inline int atomic_read(const atomic_t *v)
 {
 	int t;
 
@@ -34,13 +34,13 @@ static __inline__ int atomic_read(const atomic_t *v)
 	return t;
 }
 
-static __inline__ void atomic_set(atomic_t *v, int i)
+static inline void atomic_set(atomic_t *v, int i)
 {
 	__asm__ __volatile__("stw%U0%X0 %1,%0" : "=m"(v->counter) : "r"(i));
 }
 
 #define ATOMIC_OP(op, asm_op)						\
-static __inline__ void atomic_##op(int a, atomic_t *v)			\
+static inline void atomic_##op(int a, atomic_t *v)			\
 {									\
 	int t;								\
 									\
@@ -123,7 +123,7 @@ ATOMIC_OPS(xor, xor)
 #undef ATOMIC_OP_RETURN_RELAXED
 #undef ATOMIC_OP
 
-static __inline__ void atomic_inc(atomic_t *v)
+static inline void atomic_inc(atomic_t *v)
 {
 	int t;
 
@@ -139,7 +139,7 @@ static __inline__ void atomic_inc(atomic_t *v)
 }
 #define atomic_inc atomic_inc
 
-static __inline__ int atomic_inc_return_relaxed(atomic_t *v)
+static inline int atomic_inc_return_relaxed(atomic_t *v)
 {
 	int t;
 
@@ -156,7 +156,7 @@ static __inline__ int atomic_inc_return_relaxed(atomic_t *v)
 	return t;
 }
 
-static __inline__ void atomic_dec(atomic_t *v)
+static inline void atomic_dec(atomic_t *v)
 {
 	int t;
 
@@ -172,7 +172,7 @@ static __inline__ void atomic_dec(atomic_t *v)
 }
 #define atomic_dec atomic_dec
 
-static __inline__ int atomic_dec_return_relaxed(atomic_t *v)
+static inline int atomic_dec_return_relaxed(atomic_t *v)
 {
 	int t;
 
@@ -210,7 +210,7 @@ static __inline__ int atomic_dec_return_relaxed(atomic_t *v)
  * Atomically adds @a to @v, so long as it was not @u.
  * Returns the old value of @v.
  */
-static __inline__ int atomic_fetch_add_unless(atomic_t *v, int a, int u)
+static inline int atomic_fetch_add_unless(atomic_t *v, int a, int u)
 {
 	int t;
 
@@ -241,7 +241,7 @@ static __inline__ int atomic_fetch_add_unless(atomic_t *v, int a, int u)
  * Atomically increments @v by 1, so long as @v is non-zero.
  * Returns non-zero if @v was non-zero, and zero otherwise.
  */
-static __inline__ int atomic_inc_not_zero(atomic_t *v)
+static inline int atomic_inc_not_zero(atomic_t *v)
 {
 	int t1, t2;
 
@@ -270,7 +270,7 @@ static __inline__ int atomic_inc_not_zero(atomic_t *v)
  * The function returns the old value of *v minus 1, even if
  * the atomic variable, v, was not decremented.
  */
-static __inline__ int atomic_dec_if_positive(atomic_t *v)
+static inline int atomic_dec_if_positive(atomic_t *v)
 {
 	int t;
 
@@ -297,7 +297,7 @@ static __inline__ int atomic_dec_if_positive(atomic_t *v)
 
 #define ATOMIC64_INIT(i)	{ (i) }
 
-static __inline__ long atomic64_read(const atomic64_t *v)
+static inline long atomic64_read(const atomic64_t *v)
 {
 	long t;
 
@@ -306,13 +306,13 @@ static __inline__ long atomic64_read(const atomic64_t *v)
 	return t;
 }
 
-static __inline__ void atomic64_set(atomic64_t *v, long i)
+static inline void atomic64_set(atomic64_t *v, long i)
 {
 	__asm__ __volatile__("std%U0%X0 %1,%0" : "=m"(v->counter) : "r"(i));
 }
 
 #define ATOMIC64_OP(op, asm_op)						\
-static __inline__ void atomic64_##op(long a, atomic64_t *v)		\
+static inline void atomic64_##op(long a, atomic64_t *v)		\
 {									\
 	long t;								\
 									\
@@ -394,7 +394,7 @@ ATOMIC64_OPS(xor, xor)
 #undef ATOMIC64_OP_RETURN_RELAXED
 #undef ATOMIC64_OP
 
-static __inline__ void atomic64_inc(atomic64_t *v)
+static inline void atomic64_inc(atomic64_t *v)
 {
 	long t;
 
@@ -409,7 +409,7 @@ static __inline__ void atomic64_inc(atomic64_t *v)
 }
 #define atomic64_inc atomic64_inc
 
-static __inline__ long atomic64_inc_return_relaxed(atomic64_t *v)
+static inline long atomic64_inc_return_relaxed(atomic64_t *v)
 {
 	long t;
 
@@ -425,7 +425,7 @@ static __inline__ long atomic64_inc_return_relaxed(atomic64_t *v)
 	return t;
 }
 
-static __inline__ void atomic64_dec(atomic64_t *v)
+static inline void atomic64_dec(atomic64_t *v)
 {
 	long t;
 
@@ -440,7 +440,7 @@ static __inline__ void atomic64_dec(atomic64_t *v)
 }
 #define atomic64_dec atomic64_dec
 
-static __inline__ long atomic64_dec_return_relaxed(atomic64_t *v)
+static inline long atomic64_dec_return_relaxed(atomic64_t *v)
 {
 	long t;
 
@@ -463,7 +463,7 @@ static __inline__ long atomic64_dec_return_relaxed(atomic64_t *v)
  * Atomically test *v and decrement if it is greater than 0.
  * The function returns the old value of *v minus 1.
  */
-static __inline__ long atomic64_dec_if_positive(atomic64_t *v)
+static inline long atomic64_dec_if_positive(atomic64_t *v)
 {
 	long t;
 
@@ -502,7 +502,7 @@ static __inline__ long atomic64_dec_if_positive(atomic64_t *v)
  * Atomically adds @a to @v, so long as it was not @u.
  * Returns the old value of @v.
  */
-static __inline__ long atomic64_fetch_add_unless(atomic64_t *v, long a, long u)
+static inline long atomic64_fetch_add_unless(atomic64_t *v, long a, long u)
 {
 	long t;
 
@@ -532,7 +532,7 @@ static __inline__ long atomic64_fetch_add_unless(atomic64_t *v, long a, long u)
  * Atomically increments @v by 1, so long as @v is non-zero.
  * Returns non-zero if @v was non-zero, and zero otherwise.
  */
-static __inline__ int atomic64_inc_not_zero(atomic64_t *v)
+static inline int atomic64_inc_not_zero(atomic64_t *v)
 {
 	long t1, t2;
 
diff --git a/arch/powerpc/include/asm/bitops.h b/arch/powerpc/include/asm/bitops.h
index ff71566dadee..0858c8ddea92 100644
--- a/arch/powerpc/include/asm/bitops.h
+++ b/arch/powerpc/include/asm/bitops.h
@@ -68,7 +68,7 @@
 
 /* Macro for generating the ***_bits() functions */
 #define DEFINE_BITOP(fn, op, prefix)		\
-static __inline__ void fn(unsigned long mask,	\
+static inline void fn(unsigned long mask,	\
 		volatile unsigned long *_p)	\
 {						\
 	unsigned long old;			\
@@ -90,22 +90,22 @@ DEFINE_BITOP(clear_bits, andc, "")
 DEFINE_BITOP(clear_bits_unlock, andc, PPC_RELEASE_BARRIER)
 DEFINE_BITOP(change_bits, xor, "")
 
-static __inline__ void set_bit(int nr, volatile unsigned long *addr)
+static inline void set_bit(int nr, volatile unsigned long *addr)
 {
 	set_bits(BIT_MASK(nr), addr + BIT_WORD(nr));
 }
 
-static __inline__ void clear_bit(int nr, volatile unsigned long *addr)
+static inline void clear_bit(int nr, volatile unsigned long *addr)
 {
 	clear_bits(BIT_MASK(nr), addr + BIT_WORD(nr));
 }
 
-static __inline__ void clear_bit_unlock(int nr, volatile unsigned long *addr)
+static inline void clear_bit_unlock(int nr, volatile unsigned long *addr)
 {
 	clear_bits_unlock(BIT_MASK(nr), addr + BIT_WORD(nr));
 }
 
-static __inline__ void change_bit(int nr, volatile unsigned long *addr)
+static inline void change_bit(int nr, volatile unsigned long *addr)
 {
 	change_bits(BIT_MASK(nr), addr + BIT_WORD(nr));
 }
@@ -113,7 +113,7 @@ static __inline__ void change_bit(int nr, volatile unsigned long *addr)
 /* Like DEFINE_BITOP(), with changes to the arguments to 'op' and the output
  * operands. */
 #define DEFINE_TESTOP(fn, op, prefix, postfix, eh)	\
-static __inline__ unsigned long fn(			\
+static inline unsigned long fn(			\
 		unsigned long mask,			\
 		volatile unsigned long *_p)		\
 {							\
@@ -142,33 +142,33 @@ DEFINE_TESTOP(test_and_clear_bits, andc, PPC_ATOMIC_ENTRY_BARRIER,
 DEFINE_TESTOP(test_and_change_bits, xor, PPC_ATOMIC_ENTRY_BARRIER,
 	      PPC_ATOMIC_EXIT_BARRIER, 0)
 
-static __inline__ int test_and_set_bit(unsigned long nr,
+static inline int test_and_set_bit(unsigned long nr,
 				       volatile unsigned long *addr)
 {
 	return test_and_set_bits(BIT_MASK(nr), addr + BIT_WORD(nr)) != 0;
 }
 
-static __inline__ int test_and_set_bit_lock(unsigned long nr,
+static inline int test_and_set_bit_lock(unsigned long nr,
 				       volatile unsigned long *addr)
 {
 	return test_and_set_bits_lock(BIT_MASK(nr),
 				addr + BIT_WORD(nr)) != 0;
 }
 
-static __inline__ int test_and_clear_bit(unsigned long nr,
+static inline int test_and_clear_bit(unsigned long nr,
 					 volatile unsigned long *addr)
 {
 	return test_and_clear_bits(BIT_MASK(nr), addr + BIT_WORD(nr)) != 0;
 }
 
-static __inline__ int test_and_change_bit(unsigned long nr,
+static inline int test_and_change_bit(unsigned long nr,
 					  volatile unsigned long *addr)
 {
 	return test_and_change_bits(BIT_MASK(nr), addr + BIT_WORD(nr)) != 0;
 }
 
 #ifdef CONFIG_PPC64
-static __inline__ unsigned long clear_bit_unlock_return_word(int nr,
+static inline unsigned long clear_bit_unlock_return_word(int nr,
 						volatile unsigned long *addr)
 {
 	unsigned long old, t;
@@ -197,7 +197,7 @@ static __inline__ unsigned long clear_bit_unlock_return_word(int nr,
 
 #include <asm-generic/bitops/non-atomic.h>
 
-static __inline__ void __clear_bit_unlock(int nr, volatile unsigned long *addr)
+static inline void __clear_bit_unlock(int nr, volatile unsigned long *addr)
 {
 	__asm__ __volatile__(PPC_RELEASE_BARRIER "" ::: "memory");
 	__clear_bit(nr, addr);
@@ -219,14 +219,14 @@ static __inline__ void __clear_bit_unlock(int nr, volatile unsigned long *addr)
  * fls: find last (most-significant) bit set.
  * Note fls(0) = 0, fls(1) = 1, fls(0x80000000) = 32.
  */
-static __inline__ int fls(unsigned int x)
+static inline int fls(unsigned int x)
 {
 	return 32 - __builtin_clz(x);
 }
 
 #include <asm-generic/bitops/builtin-__fls.h>
 
-static __inline__ int fls64(__u64 x)
+static inline int fls64(__u64 x)
 {
 	return 64 - __builtin_clzll(x);
 }
diff --git a/arch/powerpc/include/asm/dma.h b/arch/powerpc/include/asm/dma.h
index 1b4f0254868f..d5be698b831a 100644
--- a/arch/powerpc/include/asm/dma.h
+++ b/arch/powerpc/include/asm/dma.h
@@ -166,20 +166,20 @@
 
 extern spinlock_t dma_spin_lock;
 
-static __inline__ unsigned long claim_dma_lock(void)
+static inline unsigned long claim_dma_lock(void)
 {
 	unsigned long flags;
 	spin_lock_irqsave(&dma_spin_lock, flags);
 	return flags;
 }
 
-static __inline__ void release_dma_lock(unsigned long flags)
+static inline void release_dma_lock(unsigned long flags)
 {
 	spin_unlock_irqrestore(&dma_spin_lock, flags);
 }
 
 /* enable/disable a specific DMA channel */
-static __inline__ void enable_dma(unsigned int dmanr)
+static inline void enable_dma(unsigned int dmanr)
 {
 	unsigned char ucDmaCmd = 0x00;
 
@@ -195,7 +195,7 @@ static __inline__ void enable_dma(unsigned int dmanr)
 	}
 }
 
-static __inline__ void disable_dma(unsigned int dmanr)
+static inline void disable_dma(unsigned int dmanr)
 {
 	if (dmanr <= 3)
 		dma_outb(dmanr | 4, DMA1_MASK_REG);
@@ -210,7 +210,7 @@ static __inline__ void disable_dma(unsigned int dmanr)
  * --- In order to do that, the DMA routines below should ---
  * --- only be used while interrupts are disabled! ---
  */
-static __inline__ void clear_dma_ff(unsigned int dmanr)
+static inline void clear_dma_ff(unsigned int dmanr)
 {
 	if (dmanr <= 3)
 		dma_outb(0, DMA1_CLEAR_FF_REG);
@@ -219,7 +219,7 @@ static __inline__ void clear_dma_ff(unsigned int dmanr)
 }
 
 /* set mode (above) for a specific DMA channel */
-static __inline__ void set_dma_mode(unsigned int dmanr, char mode)
+static inline void set_dma_mode(unsigned int dmanr, char mode)
 {
 	if (dmanr <= 3)
 		dma_outb(mode | dmanr, DMA1_MODE_REG);
@@ -232,7 +232,7 @@ static __inline__ void set_dma_mode(unsigned int dmanr, char mode)
  * the lower 16 bits of the DMA current address register, but a 64k boundary
  * may have been crossed.
  */
-static __inline__ void set_dma_page(unsigned int dmanr, int pagenr)
+static inline void set_dma_page(unsigned int dmanr, int pagenr)
 {
 	switch (dmanr) {
 	case 0:
@@ -269,7 +269,7 @@ static __inline__ void set_dma_page(unsigned int dmanr, int pagenr)
 /* Set transfer address & page bits for specific DMA channel.
  * Assumes dma flipflop is clear.
  */
-static __inline__ void set_dma_addr(unsigned int dmanr, unsigned int phys)
+static inline void set_dma_addr(unsigned int dmanr, unsigned int phys)
 {
 	if (dmanr <= 3) {
 		dma_outb(phys & 0xff,
@@ -294,7 +294,7 @@ static __inline__ void set_dma_addr(unsigned int dmanr, unsigned int phys)
  * Assumes dma flip-flop is clear.
  * NOTE 2: "count" represents _bytes_ and must be even for channels 5-7.
  */
-static __inline__ void set_dma_count(unsigned int dmanr, unsigned int count)
+static inline void set_dma_count(unsigned int dmanr, unsigned int count)
 {
 	count--;
 	if (dmanr <= 3) {
@@ -319,7 +319,7 @@ static __inline__ void set_dma_count(unsigned int dmanr, unsigned int count)
  *
  * Assumes DMA flip-flop is clear.
  */
-static __inline__ int get_dma_residue(unsigned int dmanr)
+static inline int get_dma_residue(unsigned int dmanr)
 {
 	unsigned int io_port = (dmanr <= 3)
 	    ? ((dmanr & 3) << 1) + 1 + IO_DMA1_BASE
diff --git a/arch/powerpc/include/asm/edac.h b/arch/powerpc/include/asm/edac.h
index 5571e23d253e..f331c7f8525f 100644
--- a/arch/powerpc/include/asm/edac.h
+++ b/arch/powerpc/include/asm/edac.h
@@ -16,7 +16,7 @@
  * ECC scrubbing.  It reads memory and then writes back the original
  * value, allowing the hardware to detect and correct memory errors.
  */
-static __inline__ void edac_atomic_scrub(void *va, u32 size)
+static inline void edac_atomic_scrub(void *va, u32 size)
 {
 	unsigned int *virt_addr = va;
 	unsigned int temp;
diff --git a/arch/powerpc/include/asm/irq.h b/arch/powerpc/include/asm/irq.h
index ee39ce56b2a2..676a7e7c66e7 100644
--- a/arch/powerpc/include/asm/irq.h
+++ b/arch/powerpc/include/asm/irq.h
@@ -31,7 +31,7 @@ extern atomic_t ppc_n_lost_interrupts;
 
 extern irq_hw_number_t virq_to_hw(unsigned int virq);
 
-static __inline__ int irq_canonicalize(int irq)
+static inline int irq_canonicalize(int irq)
 {
 	return irq;
 }
diff --git a/arch/powerpc/include/asm/local.h b/arch/powerpc/include/asm/local.h
index fdd00939270b..95702e9dc1d0 100644
--- a/arch/powerpc/include/asm/local.h
+++ b/arch/powerpc/include/asm/local.h
@@ -17,18 +17,18 @@ typedef struct
 
 #define LOCAL_INIT(i)	{ (i) }
 
-static __inline__ long local_read(local_t *l)
+static inline long local_read(local_t *l)
 {
 	return READ_ONCE(l->v);
 }
 
-static __inline__ void local_set(local_t *l, long i)
+static inline void local_set(local_t *l, long i)
 {
 	WRITE_ONCE(l->v, i);
 }
 
 #define LOCAL_OP(op, c_op)						\
-static __inline__ void local_##op(long i, local_t *l)			\
+static inline void local_##op(long i, local_t *l)			\
 {									\
 	unsigned long flags;						\
 									\
@@ -38,7 +38,7 @@ static __inline__ void local_##op(long i, local_t *l)			\
 }
 
 #define LOCAL_OP_RETURN(op, c_op)					\
-static __inline__ long local_##op##_return(long a, local_t *l)		\
+static inline long local_##op##_return(long a, local_t *l)		\
 {									\
 	long t;								\
 	unsigned long flags;						\
@@ -76,7 +76,7 @@ LOCAL_OPS(sub, -=)
 #define local_sub_and_test(a, l)	(local_sub_return((a), (l)) == 0)
 #define local_dec_and_test(l)		(local_dec_return((l)) == 0)
 
-static __inline__ long local_cmpxchg(local_t *l, long o, long n)
+static inline long local_cmpxchg(local_t *l, long o, long n)
 {
 	long t;
 	unsigned long flags;
@@ -90,7 +90,7 @@ static __inline__ long local_cmpxchg(local_t *l, long o, long n)
 	return t;
 }
 
-static __inline__ long local_xchg(local_t *l, long n)
+static inline long local_xchg(local_t *l, long n)
 {
 	long t;
 	unsigned long flags;
@@ -112,7 +112,7 @@ static __inline__ long local_xchg(local_t *l, long n)
  * Atomically adds @a to @l, so long as it was not @u.
  * Returns non-zero if @l was not @u, and zero otherwise.
  */
-static __inline__ int local_add_unless(local_t *l, long a, long u)
+static inline int local_add_unless(local_t *l, long a, long u)
 {
 	unsigned long flags;
 	int ret = 0;
diff --git a/arch/sh/include/asm/pgtable_64.h b/arch/sh/include/asm/pgtable_64.h
index 07424968df62..12ca0d9bdebb 100644
--- a/arch/sh/include/asm/pgtable_64.h
+++ b/arch/sh/include/asm/pgtable_64.h
@@ -32,7 +32,7 @@
  */
 #define set_pmd(pmdptr, pmdval) (*(pmdptr) = pmdval)
 
-static __inline__ void set_pte(pte_t *pteptr, pte_t pteval)
+static inline void set_pte(pte_t *pteptr, pte_t pteval)
 {
 	unsigned long long x = ((unsigned long long) pteval.pte_low);
 	unsigned long long *xp = (unsigned long long *) pteptr;
diff --git a/arch/sh/include/asm/processor_32.h b/arch/sh/include/asm/processor_32.h
index 95100d8a0b7b..87aa8cb16b5a 100644
--- a/arch/sh/include/asm/processor_32.h
+++ b/arch/sh/include/asm/processor_32.h
@@ -141,7 +141,7 @@ extern void release_thread(struct task_struct *);
  * FPU lazy state save handling.
  */
 
-static __inline__ void disable_fpu(void)
+static inline void disable_fpu(void)
 {
 	unsigned long __dummy;
 
@@ -153,7 +153,7 @@ static __inline__ void disable_fpu(void)
 			     : "r" (SR_FD));
 }
 
-static __inline__ void enable_fpu(void)
+static inline void enable_fpu(void)
 {
 	unsigned long __dummy;
 
diff --git a/arch/sh/include/cpu-sh3/cpu/dac.h b/arch/sh/include/cpu-sh3/cpu/dac.h
index fd02331608a8..67ae1ae03c47 100644
--- a/arch/sh/include/cpu-sh3/cpu/dac.h
+++ b/arch/sh/include/cpu-sh3/cpu/dac.h
@@ -15,7 +15,7 @@
 #define DACR_DAE	0x20
 
 
-static __inline__ void sh_dac_enable(int channel)
+static inline void sh_dac_enable(int channel)
 {
 	unsigned char v;
 	v = __raw_readb(DACR);
@@ -24,7 +24,7 @@ static __inline__ void sh_dac_enable(int channel)
 	__raw_writeb(v,DACR);
 }
 
-static __inline__ void sh_dac_disable(int channel)
+static inline void sh_dac_disable(int channel)
 {
 	unsigned char v;
 	v = __raw_readb(DACR);
@@ -33,7 +33,7 @@ static __inline__ void sh_dac_disable(int channel)
 	__raw_writeb(v,DACR);
 }
 
-static __inline__ void sh_dac_output(u8 value, int channel)
+static inline void sh_dac_output(u8 value, int channel)
 {
 	if(channel) __raw_writeb(value,DADR1);
 	else __raw_writeb(value,DADR0);
diff --git a/arch/x86/include/asm/alternative.h b/arch/x86/include/asm/alternative.h
index d7faa16622d8..8eb160bb08eb 100644
--- a/arch/x86/include/asm/alternative.h
+++ b/arch/x86/include/asm/alternative.h
@@ -159,10 +159,10 @@ static inline int alternatives_text_reserved(void *start, void *end)
  * without volatile and memory clobber.
  */
 #define alternative(oldinstr, newinstr, feature)			\
-	asm volatile (ALTERNATIVE(oldinstr, newinstr, feature) : : : "memory")
+	asm_volatile (ALTERNATIVE(oldinstr, newinstr, feature) : : : "memory")
 
 #define alternative_2(oldinstr, newinstr1, feature1, newinstr2, feature2) \
-	asm volatile(ALTERNATIVE_2(oldinstr, newinstr1, feature1, newinstr2, feature2) ::: "memory")
+	asm_volatile(ALTERNATIVE_2(oldinstr, newinstr1, feature1, newinstr2, feature2) ::: "memory")
 
 /*
  * Alternative inline assembly with input.
@@ -176,7 +176,7 @@ static inline int alternatives_text_reserved(void *start, void *end)
  * Leaving an unused argument 0 to keep API compatibility.
  */
 #define alternative_input(oldinstr, newinstr, feature, input...)	\
-	asm volatile (ALTERNATIVE(oldinstr, newinstr, feature)		\
+	asm_volatile (ALTERNATIVE(oldinstr, newinstr, feature)		\
 		: : "i" (0), ## input)
 
 /*
@@ -189,18 +189,18 @@ static inline int alternatives_text_reserved(void *start, void *end)
  */
 #define alternative_input_2(oldinstr, newinstr1, feature1, newinstr2,	     \
 			   feature2, input...)				     \
-	asm volatile(ALTERNATIVE_2(oldinstr, newinstr1, feature1,	     \
+	asm_volatile(ALTERNATIVE_2(oldinstr, newinstr1, feature1,	     \
 		newinstr2, feature2)					     \
 		: : "i" (0), ## input)
 
 /* Like alternative_input, but with a single output argument */
 #define alternative_io(oldinstr, newinstr, feature, output, input...)	\
-	asm volatile (ALTERNATIVE(oldinstr, newinstr, feature)		\
+	asm_volatile (ALTERNATIVE(oldinstr, newinstr, feature)		\
 		: output : "i" (0), ## input)
 
 /* Like alternative_io, but for replacing a direct call with another one. */
 #define alternative_call(oldfunc, newfunc, feature, output, input...)	\
-	asm volatile (ALTERNATIVE("call %P[old]", "call %P[new]", feature) \
+	asm_volatile (ALTERNATIVE("call %P[old]", "call %P[new]", feature) \
 		: output : [old] "i" (oldfunc), [new] "i" (newfunc), ## input)
 
 /*
@@ -211,7 +211,7 @@ static inline int alternatives_text_reserved(void *start, void *end)
  */
 #define alternative_call_2(oldfunc, newfunc1, feature1, newfunc2, feature2,   \
 			   output, input...)				      \
-	asm volatile (ALTERNATIVE_2("call %P[old]", "call %P[new1]", feature1,\
+	asm_volatile (ALTERNATIVE_2("call %P[old]", "call %P[new1]", feature1,\
 		"call %P[new2]", feature2)				      \
 		: output, ASM_CALL_CONSTRAINT				      \
 		: [old] "i" (oldfunc), [new1] "i" (newfunc1),		      \
diff --git a/arch/x86/um/asm/checksum.h b/arch/x86/um/asm/checksum.h
index 2a56cac64687..8f1c7f32d420 100644
--- a/arch/x86/um/asm/checksum.h
+++ b/arch/x86/um/asm/checksum.h
@@ -28,7 +28,7 @@ extern __wsum csum_partial(const void *buff, int len, __wsum sum);
  *	access_ok().
  */
 
-static __inline__
+static inline
 __wsum csum_partial_copy_nocheck(const void *src, void *dst,
 				       int len, __wsum sum)
 {
@@ -44,7 +44,7 @@ __wsum csum_partial_copy_nocheck(const void *src, void *dst,
  * better 64-bit) boundary
  */
 
-static __inline__
+static inline
 __wsum csum_partial_copy_from_user(const void __user *src, void *dst,
 					 int len, __wsum sum, int *err_ptr)
 {
diff --git a/arch/x86/um/asm/checksum_32.h b/arch/x86/um/asm/checksum_32.h
index 83a75f8a1233..8364af309b85 100644
--- a/arch/x86/um/asm/checksum_32.h
+++ b/arch/x86/um/asm/checksum_32.h
@@ -11,7 +11,7 @@ static inline __sum16 ip_compute_csum(const void *buff, int len)
 }
 
 #define _HAVE_ARCH_IPV6_CSUM
-static __inline__ __sum16 csum_ipv6_magic(const struct in6_addr *saddr,
+static inline __sum16 csum_ipv6_magic(const struct in6_addr *saddr,
 					  const struct in6_addr *daddr,
 					  __u32 len, __u8 proto,
 					  __wsum sum)
@@ -39,7 +39,7 @@ static __inline__ __sum16 csum_ipv6_magic(const struct in6_addr *saddr,
  *	Copy and checksum to user
  */
 #define HAVE_CSUM_COPY_USER
-static __inline__ __wsum csum_and_copy_to_user(const void *src,
+static inline __wsum csum_and_copy_to_user(const void *src,
 						     void __user *dst,
 						     int len, __wsum sum, int *err_ptr)
 {
diff --git a/arch/xtensa/include/asm/checksum.h b/arch/xtensa/include/asm/checksum.h
index 3ae74d7e074b..4eeb36d2082c 100644
--- a/arch/xtensa/include/asm/checksum.h
+++ b/arch/xtensa/include/asm/checksum.h
@@ -66,7 +66,7 @@ __wsum csum_partial_copy_from_user(const void __user *src, void *dst,
  *	Fold a partial checksum
  */
 
-static __inline__ __sum16 csum_fold(__wsum sum)
+static inline __sum16 csum_fold(__wsum sum)
 {
 	unsigned int __dummy;
 	__asm__("extui	%1, %0, 16, 16\n\t"
@@ -87,7 +87,7 @@ static __inline__ __sum16 csum_fold(__wsum sum)
  *	This is a version of ip_compute_csum() optimized for IP headers,
  *	which always checksum on 4 octet boundaries.
  */
-static __inline__ __sum16 ip_fast_csum(const void *iph, unsigned int ihl)
+static inline __sum16 ip_fast_csum(const void *iph, unsigned int ihl)
 {
 	unsigned int sum, tmp, endaddr;
 
@@ -122,7 +122,7 @@ static __inline__ __sum16 ip_fast_csum(const void *iph, unsigned int ihl)
 	return	csum_fold(sum);
 }
 
-static __inline__ __wsum csum_tcpudp_nofold(__be32 saddr, __be32 daddr,
+static inline __wsum csum_tcpudp_nofold(__be32 saddr, __be32 daddr,
 					    __u32 len, __u8 proto,
 					    __wsum sum)
 {
@@ -155,7 +155,7 @@ static __inline__ __wsum csum_tcpudp_nofold(__be32 saddr, __be32 daddr,
  * computes the checksum of the TCP/UDP pseudo-header
  * returns a 16-bit checksum, already complemented
  */
-static __inline__ __sum16 csum_tcpudp_magic(__be32 saddr, __be32 daddr,
+static inline __sum16 csum_tcpudp_magic(__be32 saddr, __be32 daddr,
 					    __u32 len, __u8 proto,
 					    __wsum sum)
 {
@@ -167,13 +167,13 @@ static __inline__ __sum16 csum_tcpudp_magic(__be32 saddr, __be32 daddr,
  * in icmp.c
  */
 
-static __inline__ __sum16 ip_compute_csum(const void *buff, int len)
+static inline __sum16 ip_compute_csum(const void *buff, int len)
 {
 	return csum_fold (csum_partial(buff, len, 0));
 }
 
 #define _HAVE_ARCH_IPV6_CSUM
-static __inline__ __sum16 csum_ipv6_magic(const struct in6_addr *saddr,
+static inline __sum16 csum_ipv6_magic(const struct in6_addr *saddr,
 					  const struct in6_addr *daddr,
 					  __u32 len, __u8 proto,
 					  __wsum sum)
@@ -239,7 +239,7 @@ static __inline__ __sum16 csum_ipv6_magic(const struct in6_addr *saddr,
  *	Copy and checksum to user
  */
 #define HAVE_CSUM_COPY_USER
-static __inline__ __wsum csum_and_copy_to_user(const void *src,
+static inline __wsum csum_and_copy_to_user(const void *src,
 					       void __user *dst, int len,
 					       __wsum sum, int *err_ptr)
 {
diff --git a/arch/xtensa/include/asm/cmpxchg.h b/arch/xtensa/include/asm/cmpxchg.h
index 201e9009efd8..15d62c16de7a 100644
--- a/arch/xtensa/include/asm/cmpxchg.h
+++ b/arch/xtensa/include/asm/cmpxchg.h
@@ -52,7 +52,7 @@ __cmpxchg_u32(volatile int *p, int old, int new)
 
 extern void __cmpxchg_called_with_bad_pointer(void);
 
-static __inline__ unsigned long
+static inline unsigned long
 __cmpxchg(volatile void *ptr, unsigned long old, unsigned long new, int size)
 {
 	switch (size) {
@@ -146,7 +146,7 @@ static inline unsigned long xchg_u32(volatile int * m, unsigned long val)
 
 extern void __xchg_called_with_bad_pointer(void);
 
-static __inline__ unsigned long
+static inline unsigned long
 __xchg(unsigned long x, volatile void * ptr, int size)
 {
 	switch (size) {
diff --git a/arch/xtensa/include/asm/irq.h b/arch/xtensa/include/asm/irq.h
index 6c6ed23e0c79..ae53e599255f 100644
--- a/arch/xtensa/include/asm/irq.h
+++ b/arch/xtensa/include/asm/irq.h
@@ -23,7 +23,7 @@
 #define NR_IRQS (XTENSA_NR_IRQS + PLATFORM_NR_IRQS + 1)
 #define XTENSA_PIC_LINUX_IRQ(hwirq) ((hwirq) + 1)
 
-static __inline__ int irq_canonicalize(int irq)
+static inline int irq_canonicalize(int irq)
 {
 	return (irq);
 }
diff --git a/block/partitions/amiga.c b/block/partitions/amiga.c
index 560936617d9c..7434b0a0f86c 100644
--- a/block/partitions/amiga.c
+++ b/block/partitions/amiga.c
@@ -16,7 +16,7 @@
 #include "check.h"
 #include "amiga.h"
 
-static __inline__ u32
+static inline u32
 checksum_block(__be32 *m, int size)
 {
 	u32 sum = 0;
diff --git a/drivers/atm/he.c b/drivers/atm/he.c
index 29f102dcfec4..abb6415f9565 100644
--- a/drivers/atm/he.c
+++ b/drivers/atm/he.c
@@ -178,7 +178,7 @@ static const struct atmdev_ops he_ops =
 
 /* section 2.12 connection memory access */
 
-static __inline__ void
+static inline void
 he_writel_internal(struct he_dev *he_dev, unsigned val, unsigned addr,
 								unsigned flags)
 {
@@ -324,7 +324,7 @@ he_readl_internal(struct he_dev *he_dev, unsigned addr, unsigned flags)
 #define he_writel_rsr7(dev, val, cid) \
 		he_writel_rcm(dev, val, 0x00000 | (cid << 3) | 7)
 
-static __inline__ struct atm_vcc*
+static inline struct atm_vcc*
 __find_vcc(struct he_dev *he_dev, unsigned cid)
 {
 	struct hlist_head *head;
@@ -2050,7 +2050,7 @@ he_irq_handler(int irq, void *dev_id)
 
 }
 
-static __inline__ void
+static inline void
 __enqueue_tpd(struct he_dev *he_dev, struct he_tpd *tpd, unsigned cid)
 {
 	struct he_tpdrq *new_tail;
diff --git a/drivers/atm/idt77252.c b/drivers/atm/idt77252.c
index 6e737142ceaa..226a65a03b70 100644
--- a/drivers/atm/idt77252.c
+++ b/drivers/atm/idt77252.c
@@ -1783,13 +1783,13 @@ set_tct(struct idt77252_dev *card, struct vc_map *vc)
 /*                                                                           */
 /*****************************************************************************/
 
-static __inline__ int
+static inline int
 idt77252_fbq_level(struct idt77252_dev *card, int queue)
 {
 	return (readl(SAR_REG_STAT) >> (16 + (queue << 2))) & 0x0f;
 }
 
-static __inline__ int
+static inline int
 idt77252_fbq_full(struct idt77252_dev *card, int queue)
 {
 	return (readl(SAR_REG_STAT) >> (16 + (queue << 2))) == 0x0f;
@@ -2016,7 +2016,7 @@ idt77252_send_oam(struct atm_vcc *vcc, void *cell, int flags)
 	return idt77252_send_skb(vcc, skb, 1);
 }
 
-static __inline__ unsigned int
+static inline unsigned int
 idt77252_fls(unsigned int x)
 {
 	int r = 1;
diff --git a/drivers/gpu/drm/mga/mga_drv.h b/drivers/gpu/drm/mga/mga_drv.h
index a45bb22275a7..a4bb4ead677f 100644
--- a/drivers/gpu/drm/mga/mga_drv.h
+++ b/drivers/gpu/drm/mga/mga_drv.h
@@ -661,7 +661,7 @@ do {									\
 
 /* Simple idle test.
  */
-static __inline__ int mga_is_idle(drm_mga_private_t *dev_priv)
+static inline int mga_is_idle(drm_mga_private_t *dev_priv)
 {
 	u32 status = MGA_READ(MGA_STATUS) & MGA_ENGINE_IDLE_MASK;
 	return (status == MGA_ENDPRDMASTS);
diff --git a/drivers/gpu/drm/mga/mga_state.c b/drivers/gpu/drm/mga/mga_state.c
index e5f6b735f575..67f261c59111 100644
--- a/drivers/gpu/drm/mga/mga_state.c
+++ b/drivers/gpu/drm/mga/mga_state.c
@@ -65,7 +65,7 @@ static void mga_emit_clip_rect(drm_mga_private_t *dev_priv,
 	ADVANCE_DMA();
 }
 
-static __inline__ void mga_g200_emit_context(drm_mga_private_t *dev_priv)
+static inline void mga_g200_emit_context(drm_mga_private_t *dev_priv)
 {
 	drm_mga_sarea_t *sarea_priv = dev_priv->sarea_priv;
 	drm_mga_context_regs_t *ctx = &sarea_priv->context_state;
@@ -88,7 +88,7 @@ static __inline__ void mga_g200_emit_context(drm_mga_private_t *dev_priv)
 	ADVANCE_DMA();
 }
 
-static __inline__ void mga_g400_emit_context(drm_mga_private_t *dev_priv)
+static inline void mga_g400_emit_context(drm_mga_private_t *dev_priv)
 {
 	drm_mga_sarea_t *sarea_priv = dev_priv->sarea_priv;
 	drm_mga_context_regs_t *ctx = &sarea_priv->context_state;
@@ -115,7 +115,7 @@ static __inline__ void mga_g400_emit_context(drm_mga_private_t *dev_priv)
 	ADVANCE_DMA();
 }
 
-static __inline__ void mga_g200_emit_tex0(drm_mga_private_t *dev_priv)
+static inline void mga_g200_emit_tex0(drm_mga_private_t *dev_priv)
 {
 	drm_mga_sarea_t *sarea_priv = dev_priv->sarea_priv;
 	drm_mga_texture_regs_t *tex = &sarea_priv->tex_state[0];
@@ -143,7 +143,7 @@ static __inline__ void mga_g200_emit_tex0(drm_mga_private_t *dev_priv)
 	ADVANCE_DMA();
 }
 
-static __inline__ void mga_g400_emit_tex0(drm_mga_private_t *dev_priv)
+static inline void mga_g400_emit_tex0(drm_mga_private_t *dev_priv)
 {
 	drm_mga_sarea_t *sarea_priv = dev_priv->sarea_priv;
 	drm_mga_texture_regs_t *tex = &sarea_priv->tex_state[0];
@@ -183,7 +183,7 @@ static __inline__ void mga_g400_emit_tex0(drm_mga_private_t *dev_priv)
 	ADVANCE_DMA();
 }
 
-static __inline__ void mga_g400_emit_tex1(drm_mga_private_t *dev_priv)
+static inline void mga_g400_emit_tex1(drm_mga_private_t *dev_priv)
 {
 	drm_mga_sarea_t *sarea_priv = dev_priv->sarea_priv;
 	drm_mga_texture_regs_t *tex = &sarea_priv->tex_state[1];
@@ -222,7 +222,7 @@ static __inline__ void mga_g400_emit_tex1(drm_mga_private_t *dev_priv)
 	ADVANCE_DMA();
 }
 
-static __inline__ void mga_g200_emit_pipe(drm_mga_private_t *dev_priv)
+static inline void mga_g200_emit_pipe(drm_mga_private_t *dev_priv)
 {
 	drm_mga_sarea_t *sarea_priv = dev_priv->sarea_priv;
 	unsigned int pipe = sarea_priv->warp_pipe;
@@ -249,7 +249,7 @@ static __inline__ void mga_g200_emit_pipe(drm_mga_private_t *dev_priv)
 	ADVANCE_DMA();
 }
 
-static __inline__ void mga_g400_emit_pipe(drm_mga_private_t *dev_priv)
+static inline void mga_g400_emit_pipe(drm_mga_private_t *dev_priv)
 {
 	drm_mga_sarea_t *sarea_priv = dev_priv->sarea_priv;
 	unsigned int pipe = sarea_priv->warp_pipe;
diff --git a/drivers/gpu/drm/r128/r128_drv.h b/drivers/gpu/drm/r128/r128_drv.h
index 2de40d276116..ae32a67dec16 100644
--- a/drivers/gpu/drm/r128/r128_drv.h
+++ b/drivers/gpu/drm/r128/r128_drv.h
@@ -417,7 +417,7 @@ do {									\
 #define CCE_PACKET3(pkt, n)		(R128_CCE_PACKET3 |		\
 					 (pkt) | ((n) << 16))
 
-static __inline__ void r128_update_ring_snapshot(drm_r128_private_t *dev_priv)
+static inline void r128_update_ring_snapshot(drm_r128_private_t *dev_priv)
 {
 	drm_r128_ring_buffer_t *ring = &dev_priv->ring;
 	ring->space = (GET_RING_HEAD(dev_priv) - ring->tail) * sizeof(u32);
diff --git a/drivers/gpu/drm/r128/r128_state.c b/drivers/gpu/drm/r128/r128_state.c
index b9bfa806d346..36a6642f485d 100644
--- a/drivers/gpu/drm/r128/r128_state.c
+++ b/drivers/gpu/drm/r128/r128_state.c
@@ -79,7 +79,7 @@ static void r128_emit_clip_rects(drm_r128_private_t *dev_priv,
 	ADVANCE_RING();
 }
 
-static __inline__ void r128_emit_core(drm_r128_private_t *dev_priv)
+static inline void r128_emit_core(drm_r128_private_t *dev_priv)
 {
 	drm_r128_sarea_t *sarea_priv = dev_priv->sarea_priv;
 	drm_r128_context_regs_t *ctx = &sarea_priv->context_state;
@@ -94,7 +94,7 @@ static __inline__ void r128_emit_core(drm_r128_private_t *dev_priv)
 	ADVANCE_RING();
 }
 
-static __inline__ void r128_emit_context(drm_r128_private_t *dev_priv)
+static inline void r128_emit_context(drm_r128_private_t *dev_priv)
 {
 	drm_r128_sarea_t *sarea_priv = dev_priv->sarea_priv;
 	drm_r128_context_regs_t *ctx = &sarea_priv->context_state;
@@ -120,7 +120,7 @@ static __inline__ void r128_emit_context(drm_r128_private_t *dev_priv)
 	ADVANCE_RING();
 }
 
-static __inline__ void r128_emit_setup(drm_r128_private_t *dev_priv)
+static inline void r128_emit_setup(drm_r128_private_t *dev_priv)
 {
 	drm_r128_sarea_t *sarea_priv = dev_priv->sarea_priv;
 	drm_r128_context_regs_t *ctx = &sarea_priv->context_state;
@@ -136,7 +136,7 @@ static __inline__ void r128_emit_setup(drm_r128_private_t *dev_priv)
 	ADVANCE_RING();
 }
 
-static __inline__ void r128_emit_masks(drm_r128_private_t *dev_priv)
+static inline void r128_emit_masks(drm_r128_private_t *dev_priv)
 {
 	drm_r128_sarea_t *sarea_priv = dev_priv->sarea_priv;
 	drm_r128_context_regs_t *ctx = &sarea_priv->context_state;
@@ -155,7 +155,7 @@ static __inline__ void r128_emit_masks(drm_r128_private_t *dev_priv)
 	ADVANCE_RING();
 }
 
-static __inline__ void r128_emit_window(drm_r128_private_t *dev_priv)
+static inline void r128_emit_window(drm_r128_private_t *dev_priv)
 {
 	drm_r128_sarea_t *sarea_priv = dev_priv->sarea_priv;
 	drm_r128_context_regs_t *ctx = &sarea_priv->context_state;
@@ -170,7 +170,7 @@ static __inline__ void r128_emit_window(drm_r128_private_t *dev_priv)
 	ADVANCE_RING();
 }
 
-static __inline__ void r128_emit_tex0(drm_r128_private_t *dev_priv)
+static inline void r128_emit_tex0(drm_r128_private_t *dev_priv)
 {
 	drm_r128_sarea_t *sarea_priv = dev_priv->sarea_priv;
 	drm_r128_context_regs_t *ctx = &sarea_priv->context_state;
@@ -196,7 +196,7 @@ static __inline__ void r128_emit_tex0(drm_r128_private_t *dev_priv)
 	ADVANCE_RING();
 }
 
-static __inline__ void r128_emit_tex1(drm_r128_private_t *dev_priv)
+static inline void r128_emit_tex1(drm_r128_private_t *dev_priv)
 {
 	drm_r128_sarea_t *sarea_priv = dev_priv->sarea_priv;
 	drm_r128_texture_regs_t *tex = &sarea_priv->tex_state[1];
diff --git a/drivers/gpu/drm/via/via_irq.c b/drivers/gpu/drm/via/via_irq.c
index c96830ccc0ec..02f2a309ea8d 100644
--- a/drivers/gpu/drm/via/via_irq.c
+++ b/drivers/gpu/drm/via/via_irq.c
@@ -152,7 +152,7 @@ irqreturn_t via_driver_irq_handler(int irq, void *arg)
 		return IRQ_NONE;
 }
 
-static __inline__ void viadrv_acknowledge_irqs(drm_via_private_t *dev_priv)
+static inline void viadrv_acknowledge_irqs(drm_via_private_t *dev_priv)
 {
 	u32 status;
 
diff --git a/drivers/gpu/drm/via/via_verifier.c b/drivers/gpu/drm/via/via_verifier.c
index fb2609434df7..400fe11b128d 100644
--- a/drivers/gpu/drm/via/via_verifier.c
+++ b/drivers/gpu/drm/via/via_verifier.c
@@ -235,7 +235,7 @@ static hazard_t table1[256];
 static hazard_t table2[256];
 static hazard_t table3[256];
 
-static __inline__ int
+static inline int
 eat_words(const uint32_t **buf, const uint32_t *buf_end, unsigned num_words)
 {
 	if ((buf_end - *buf) >= num_words) {
@@ -250,7 +250,7 @@ eat_words(const uint32_t **buf, const uint32_t *buf_end, unsigned num_words)
  * Partially stolen from drm_memory.h
  */
 
-static __inline__ drm_local_map_t *via_drm_lookup_agp_map(drm_via_state_t *seq,
+static inline drm_local_map_t *via_drm_lookup_agp_map(drm_via_state_t *seq,
 						    unsigned long offset,
 						    unsigned long size,
 						    struct drm_device *dev)
@@ -287,7 +287,7 @@ static __inline__ drm_local_map_t *via_drm_lookup_agp_map(drm_via_state_t *seq,
  * very little CPU time.
  */
 
-static __inline__ int finish_current_sequence(drm_via_state_t * cur_seq)
+static inline int finish_current_sequence(drm_via_state_t * cur_seq)
 {
 	switch (cur_seq->unfinished) {
 	case z_address:
@@ -344,7 +344,7 @@ static __inline__ int finish_current_sequence(drm_via_state_t * cur_seq)
 	return 0;
 }
 
-static __inline__ int
+static inline int
 investigate_hazard(uint32_t cmd, hazard_t hz, drm_via_state_t *cur_seq)
 {
 	register uint32_t tmp, *tmp_addr;
@@ -517,7 +517,7 @@ investigate_hazard(uint32_t cmd, hazard_t hz, drm_via_state_t *cur_seq)
 	return 2;
 }
 
-static __inline__ int
+static inline int
 via_check_prim_list(uint32_t const **buffer, const uint32_t * buf_end,
 		    drm_via_state_t *cur_seq)
 {
@@ -621,7 +621,7 @@ via_check_prim_list(uint32_t const **buffer, const uint32_t * buf_end,
 	return ret;
 }
 
-static __inline__ verifier_state_t
+static inline verifier_state_t
 via_check_header2(uint32_t const **buffer, const uint32_t *buf_end,
 		  drm_via_state_t *hc_state)
 {
@@ -713,7 +713,7 @@ via_check_header2(uint32_t const **buffer, const uint32_t *buf_end,
 	return state_command;
 }
 
-static __inline__ verifier_state_t
+static inline verifier_state_t
 via_parse_header2(drm_via_private_t *dev_priv, uint32_t const **buffer,
 		  const uint32_t *buf_end, int *fire_count)
 {
@@ -762,7 +762,7 @@ via_parse_header2(drm_via_private_t *dev_priv, uint32_t const **buffer,
 	return state_command;
 }
 
-static __inline__ int verify_mmio_address(uint32_t address)
+static inline int verify_mmio_address(uint32_t address)
 {
 	if ((address > 0x3FF) && (address < 0xC00)) {
 		DRM_ERROR("Invalid VIDEO DMA command. "
@@ -780,7 +780,7 @@ static __inline__ int verify_mmio_address(uint32_t address)
 	return 0;
 }
 
-static __inline__ int
+static inline int
 verify_video_tail(uint32_t const **buffer, const uint32_t * buf_end,
 		  uint32_t dwords)
 {
@@ -800,7 +800,7 @@ verify_video_tail(uint32_t const **buffer, const uint32_t * buf_end,
 	return 0;
 }
 
-static __inline__ verifier_state_t
+static inline verifier_state_t
 via_check_header1(uint32_t const **buffer, const uint32_t * buf_end)
 {
 	uint32_t cmd;
@@ -832,7 +832,7 @@ via_check_header1(uint32_t const **buffer, const uint32_t * buf_end)
 	return ret;
 }
 
-static __inline__ verifier_state_t
+static inline verifier_state_t
 via_parse_header1(drm_via_private_t *dev_priv, uint32_t const **buffer,
 		  const uint32_t *buf_end)
 {
@@ -850,7 +850,7 @@ via_parse_header1(drm_via_private_t *dev_priv, uint32_t const **buffer,
 	return state_command;
 }
 
-static __inline__ verifier_state_t
+static inline verifier_state_t
 via_check_vheader5(uint32_t const **buffer, const uint32_t *buf_end)
 {
 	uint32_t data;
@@ -883,7 +883,7 @@ via_check_vheader5(uint32_t const **buffer, const uint32_t *buf_end)
 
 }
 
-static __inline__ verifier_state_t
+static inline verifier_state_t
 via_parse_vheader5(drm_via_private_t *dev_priv, uint32_t const **buffer,
 		   const uint32_t *buf_end)
 {
@@ -901,7 +901,7 @@ via_parse_vheader5(drm_via_private_t *dev_priv, uint32_t const **buffer,
 	return state_command;
 }
 
-static __inline__ verifier_state_t
+static inline verifier_state_t
 via_check_vheader6(uint32_t const **buffer, const uint32_t * buf_end)
 {
 	uint32_t data;
@@ -938,7 +938,7 @@ via_check_vheader6(uint32_t const **buffer, const uint32_t * buf_end)
 	return state_command;
 }
 
-static __inline__ verifier_state_t
+static inline verifier_state_t
 via_parse_vheader6(drm_via_private_t *dev_priv, uint32_t const **buffer,
 		   const uint32_t *buf_end)
 {
diff --git a/drivers/isdn/hardware/eicon/platform.h b/drivers/isdn/hardware/eicon/platform.h
index 62e2073c3690..ea861c904cd2 100644
--- a/drivers/isdn/hardware/eicon/platform.h
+++ b/drivers/isdn/hardware/eicon/platform.h
@@ -159,7 +159,7 @@ void diva_xdi_didd_remove_adapter(int card);
 /*
 ** memory allocation
 */
-static __inline__ void *diva_os_malloc(unsigned long flags, unsigned long size)
+static inline void *diva_os_malloc(unsigned long flags, unsigned long size)
 {
 	void *ret = NULL;
 
@@ -168,7 +168,7 @@ static __inline__ void *diva_os_malloc(unsigned long flags, unsigned long size)
 	}
 	return (ret);
 }
-static __inline__ void diva_os_free(unsigned long flags, void *ptr)
+static inline void diva_os_free(unsigned long flags, void *ptr)
 {
 	vfree(ptr);
 }
@@ -185,11 +185,11 @@ void diva_os_free_message_buffer(diva_os_message_buffer_s *dmb);
 /*
 ** mSeconds waiting
 */
-static __inline__ void diva_os_sleep(dword mSec)
+static inline void diva_os_sleep(dword mSec)
 {
 	msleep(mSec);
 }
-static __inline__ void diva_os_wait(dword mSec)
+static inline void diva_os_wait(dword mSec)
 {
 	mdelay(mSec);
 }
@@ -233,12 +233,12 @@ void diva_os_remove_irq(void *context, byte irq);
 */
 typedef long diva_os_spin_lock_magic_t;
 typedef spinlock_t diva_os_spin_lock_t;
-static __inline__ int diva_os_initialize_spin_lock(spinlock_t *lock, void *unused) { \
+static inline int diva_os_initialize_spin_lock(spinlock_t *lock, void *unused) { \
 	spin_lock_init(lock); return (0); }
-static __inline__ void diva_os_enter_spin_lock(diva_os_spin_lock_t *a, \
+static inline void diva_os_enter_spin_lock(diva_os_spin_lock_t *a, \
 					       diva_os_spin_lock_magic_t *old_irql, \
 					       void *dbg) { spin_lock_bh(a); }
-static __inline__ void diva_os_leave_spin_lock(diva_os_spin_lock_t *a, \
+static inline void diva_os_leave_spin_lock(diva_os_spin_lock_t *a, \
 					       diva_os_spin_lock_magic_t *old_irql, \
 					       void *dbg) { spin_unlock_bh(a); }
 
diff --git a/drivers/isdn/i4l/isdn_net.c b/drivers/isdn/i4l/isdn_net.c
index c138f66f2659..09d4cb136382 100644
--- a/drivers/isdn/i4l/isdn_net.c
+++ b/drivers/isdn/i4l/isdn_net.c
@@ -70,7 +70,7 @@
  * Find out if the netdevice has been ifup-ed yet.
  * For slaves, look at the corresponding master.
  */
-static __inline__ int isdn_net_device_started(isdn_net_dev *n)
+static inline int isdn_net_device_started(isdn_net_dev *n)
 {
 	isdn_net_local *lp = n->local;
 	struct net_device *dev;
@@ -86,7 +86,7 @@ static __inline__ int isdn_net_device_started(isdn_net_dev *n)
  * wake up the network -> net_device queue.
  * For slaves, wake the corresponding master interface.
  */
-static __inline__ void isdn_net_device_wake_queue(isdn_net_local *lp)
+static inline void isdn_net_device_wake_queue(isdn_net_local *lp)
 {
 	if (lp->master)
 		netif_wake_queue(lp->master);
@@ -98,7 +98,7 @@ static __inline__ void isdn_net_device_wake_queue(isdn_net_local *lp)
  * stop the network -> net_device queue.
  * For slaves, stop the corresponding master interface.
  */
-static __inline__ void isdn_net_device_stop_queue(isdn_net_local *lp)
+static inline void isdn_net_device_stop_queue(isdn_net_local *lp)
 {
 	if (lp->master)
 		netif_stop_queue(lp->master);
@@ -111,7 +111,7 @@ static __inline__ void isdn_net_device_stop_queue(isdn_net_local *lp)
  * master or slave) is busy. It's busy iff all (master and slave)
  * queues are busy
  */
-static __inline__ int isdn_net_device_busy(isdn_net_local *lp)
+static inline int isdn_net_device_busy(isdn_net_local *lp)
 {
 	isdn_net_local *nlp;
 	isdn_net_dev *nd;
@@ -138,14 +138,14 @@ static __inline__ int isdn_net_device_busy(isdn_net_local *lp)
 	return 1;
 }
 
-static __inline__ void isdn_net_inc_frame_cnt(isdn_net_local *lp)
+static inline void isdn_net_inc_frame_cnt(isdn_net_local *lp)
 {
 	atomic_inc(&lp->frame_cnt);
 	if (isdn_net_device_busy(lp))
 		isdn_net_device_stop_queue(lp);
 }
 
-static __inline__ void isdn_net_dec_frame_cnt(isdn_net_local *lp)
+static inline void isdn_net_dec_frame_cnt(isdn_net_local *lp)
 {
 	atomic_dec(&lp->frame_cnt);
 
@@ -158,7 +158,7 @@ static __inline__ void isdn_net_dec_frame_cnt(isdn_net_local *lp)
 	}
 }
 
-static __inline__ void isdn_net_zero_frame_cnt(isdn_net_local *lp)
+static inline void isdn_net_zero_frame_cnt(isdn_net_local *lp)
 {
 	atomic_set(&lp->frame_cnt, 0);
 }
diff --git a/drivers/isdn/i4l/isdn_net.h b/drivers/isdn/i4l/isdn_net.h
index cca6d68da171..5fdff8a0ac8d 100644
--- a/drivers/isdn/i4l/isdn_net.h
+++ b/drivers/isdn/i4l/isdn_net.h
@@ -64,7 +64,7 @@ extern void isdn_net_write_super(isdn_net_local *lp, struct sk_buff *skb);
 /*
  * is this particular channel busy?
  */
-static __inline__ int isdn_net_lp_busy(isdn_net_local *lp)
+static inline int isdn_net_lp_busy(isdn_net_local *lp)
 {
 	if (atomic_read(&lp->frame_cnt) < ISDN_NET_MAX_QUEUE_LENGTH)
 		return 0;
@@ -76,7 +76,7 @@ static __inline__ int isdn_net_lp_busy(isdn_net_local *lp)
  * For the given net device, this will get a non-busy channel out of the
  * corresponding bundle. The returned channel is locked.
  */
-static __inline__ isdn_net_local *isdn_net_get_locked_lp(isdn_net_dev *nd)
+static inline isdn_net_local *isdn_net_get_locked_lp(isdn_net_dev *nd)
 {
 	unsigned long flags;
 	isdn_net_local *lp;
@@ -104,7 +104,7 @@ static __inline__ isdn_net_local *isdn_net_get_locked_lp(isdn_net_dev *nd)
 /*
  * add a channel to a bundle
  */
-static __inline__ void isdn_net_add_to_bundle(isdn_net_dev *nd, isdn_net_local *nlp)
+static inline void isdn_net_add_to_bundle(isdn_net_dev *nd, isdn_net_local *nlp)
 {
 	isdn_net_local *lp;
 	unsigned long flags;
@@ -125,7 +125,7 @@ static __inline__ void isdn_net_add_to_bundle(isdn_net_dev *nd, isdn_net_local *
 /*
  * remove a channel from the bundle it belongs to
  */
-static __inline__ void isdn_net_rm_from_bundle(isdn_net_local *lp)
+static inline void isdn_net_rm_from_bundle(isdn_net_local *lp)
 {
 	isdn_net_local *master_lp = lp;
 	unsigned long flags;
diff --git a/drivers/media/pci/ivtv/ivtv-ioctl.c b/drivers/media/pci/ivtv/ivtv-ioctl.c
index 4cdc6d2be85d..e4e8b38387ec 100644
--- a/drivers/media/pci/ivtv/ivtv-ioctl.c
+++ b/drivers/media/pci/ivtv/ivtv-ioctl.c
@@ -1622,7 +1622,7 @@ static int ivtv_try_decoder_cmd(struct file *file, void *fh, struct v4l2_decoder
 }
 
 #ifdef CONFIG_VIDEO_IVTV_DEPRECATED_IOCTLS
-static __inline__ void warn_deprecated_ioctl(const char *name)
+static inline void warn_deprecated_ioctl(const char *name)
 {
 	pr_warn_once("warning: the %s ioctl is deprecated. Don't use it, as it will be removed soon\n",
 		     name);
diff --git a/drivers/net/ethernet/sun/sungem.c b/drivers/net/ethernet/sun/sungem.c
index b9221fc1674d..4da7a3ac3c87 100644
--- a/drivers/net/ethernet/sun/sungem.c
+++ b/drivers/net/ethernet/sun/sungem.c
@@ -640,7 +640,7 @@ static int gem_abnormal_irq(struct net_device *dev, struct gem *gp, u32 gem_stat
 	return 0;
 }
 
-static __inline__ void gem_tx(struct net_device *dev, struct gem *gp, u32 gem_status)
+static inline void gem_tx(struct net_device *dev, struct gem *gp, u32 gem_status)
 {
 	int entry, limit;
 
@@ -710,7 +710,7 @@ static __inline__ void gem_tx(struct net_device *dev, struct gem *gp, u32 gem_st
 	}
 }
 
-static __inline__ void gem_post_rxds(struct gem *gp, int limit)
+static inline void gem_post_rxds(struct gem *gp, int limit)
 {
 	int cluster_start, curr, count, kick;
 
@@ -742,7 +742,7 @@ static __inline__ void gem_post_rxds(struct gem *gp, int limit)
 
 #define ALIGNED_RX_SKB_ADDR(addr) \
         ((((unsigned long)(addr) + (64UL - 1UL)) & ~(64UL - 1UL)) - (unsigned long)(addr))
-static __inline__ struct sk_buff *gem_alloc_skb(struct net_device *dev, int size,
+static inline struct sk_buff *gem_alloc_skb(struct net_device *dev, int size,
 						gfp_t gfp_flags)
 {
 	struct sk_buff *skb = alloc_skb(size + 64, gfp_flags);
@@ -988,7 +988,7 @@ static void gem_tx_timeout(struct net_device *dev)
 	gem_schedule_reset(gp);
 }
 
-static __inline__ int gem_intme(int entry)
+static inline int gem_intme(int entry)
 {
 	/* Algorithm: IRQ every 1/2 of descriptors. */
 	if (!(entry & ((TX_RING_SIZE>>1)-1)))
diff --git a/drivers/net/ethernet/sun/sunhme.c b/drivers/net/ethernet/sun/sunhme.c
index 06da2f59fcbf..1b7c95d17e46 100644
--- a/drivers/net/ethernet/sun/sunhme.c
+++ b/drivers/net/ethernet/sun/sunhme.c
@@ -108,7 +108,7 @@ struct hme_tx_logent {
 #define TX_LOG_LEN	128
 static struct hme_tx_logent tx_log[TX_LOG_LEN];
 static int txlog_cur_entry;
-static __inline__ void tx_add_log(struct happy_meal *hp, unsigned int a, unsigned int s)
+static inline void tx_add_log(struct happy_meal *hp, unsigned int a, unsigned int s)
 {
 	struct hme_tx_logent *tlp;
 	unsigned long flags;
@@ -123,7 +123,7 @@ static __inline__ void tx_add_log(struct happy_meal *hp, unsigned int a, unsigne
 	txlog_cur_entry = (txlog_cur_entry + 1) & (TX_LOG_LEN - 1);
 	local_irq_restore(flags);
 }
-static __inline__ void tx_dump_log(void)
+static inline void tx_dump_log(void)
 {
 	int i, this;
 
@@ -136,7 +136,7 @@ static __inline__ void tx_dump_log(void)
 		this = (this + 1) & (TX_LOG_LEN - 1);
 	}
 }
-static __inline__ void tx_dump_ring(struct happy_meal *hp)
+static inline void tx_dump_ring(struct happy_meal *hp)
 {
 	struct hmeal_init_block *hb = hp->happy_block;
 	struct happy_meal_txd *tp = &hb->happy_meal_txd[0];
diff --git a/drivers/net/hamradio/baycom_ser_fdx.c b/drivers/net/hamradio/baycom_ser_fdx.c
index 190f66c88479..bd91fb571927 100644
--- a/drivers/net/hamradio/baycom_ser_fdx.c
+++ b/drivers/net/hamradio/baycom_ser_fdx.c
@@ -229,7 +229,7 @@ static inline unsigned int hweight8(unsigned int w)
 
 /* --------------------------------------------------------------------- */
 
-static __inline__ void ser12_rx(struct net_device *dev, struct baycom_state *bc, struct timespec64 *ts, unsigned char curs)
+static inline void ser12_rx(struct net_device *dev, struct baycom_state *bc, struct timespec64 *ts, unsigned char curs)
 {
 	int timediff;
 	int bdus8 = bc->baud_us >> 3;
diff --git a/drivers/net/wan/lapbether.c b/drivers/net/wan/lapbether.c
index 0e3f8ed84660..84145ec82a36 100644
--- a/drivers/net/wan/lapbether.c
+++ b/drivers/net/wan/lapbether.c
@@ -76,7 +76,7 @@ static struct lapbethdev *lapbeth_get_x25_dev(struct net_device *dev)
 	return NULL;
 }
 
-static __inline__ int dev_is_ethdev(struct net_device *dev)
+static inline int dev_is_ethdev(struct net_device *dev)
 {
 	return dev->type == ARPHRD_ETHER && strncmp(dev->name, "dummy", 5);
 }
diff --git a/drivers/net/wan/n2.c b/drivers/net/wan/n2.c
index c8f4517db3a0..45ea8da79fa5 100644
--- a/drivers/net/wan/n2.c
+++ b/drivers/net/wan/n2.c
@@ -148,13 +148,13 @@ static card_t **new_card = &first_card;
 					 &(card)->ports[port] : NULL)
 
 
-static __inline__ u8 sca_get_page(card_t *card)
+static inline u8 sca_get_page(card_t *card)
 {
 	return inb(card->io + N2_PSR) & PSR_PAGEBITS;
 }
 
 
-static __inline__ void openwin(card_t *card, u8 page)
+static inline void openwin(card_t *card, u8 page)
 {
 	u8 psr = inb(card->io + N2_PSR);
 	outb((psr & ~PSR_PAGEBITS) | page, card->io + N2_PSR);
diff --git a/drivers/parisc/led.c b/drivers/parisc/led.c
index 0c6e8b44b4ed..65baa86dbae7 100644
--- a/drivers/parisc/led.c
+++ b/drivers/parisc/led.c
@@ -349,7 +349,7 @@ static void led_LCD_driver(unsigned char leds)
    ** (analog to dev_get_info() from net/core/dev.c)
    **   
  */
-static __inline__ int led_get_net_activity(void)
+static inline int led_get_net_activity(void)
 { 
 #ifndef CONFIG_NET
 	return 0;
@@ -401,7 +401,7 @@ static __inline__ int led_get_net_activity(void)
    ** calculate if there was disk-io in the system
    **   
  */
-static __inline__ int led_get_diskio_activity(void)
+static inline int led_get_diskio_activity(void)
 {	
 	static unsigned long last_pgpgin, last_pgpgout;
 	unsigned long events[NR_VM_EVENT_ITEMS];
diff --git a/drivers/parisc/sba_iommu.c b/drivers/parisc/sba_iommu.c
index 11de0eccf968..a7a79aebae17 100644
--- a/drivers/parisc/sba_iommu.c
+++ b/drivers/parisc/sba_iommu.c
@@ -89,7 +89,7 @@
 #define DBG_RES(x...)
 #endif
 
-#define SBA_INLINE	__inline__
+#define SBA_INLINE	inline
 
 #define DEFAULT_DMA_HINT_REG	0
 
diff --git a/drivers/parport/parport_gsc.c b/drivers/parport/parport_gsc.c
index 190c0a7a1c52..20dab463e40d 100644
--- a/drivers/parport/parport_gsc.c
+++ b/drivers/parport/parport_gsc.c
@@ -76,7 +76,7 @@ static int clear_epp_timeout(struct parport *pb)
  * Access functions.
  *
  * Most of these aren't static because they may be used by the
- * parport_xxx_yyy macros.  extern __inline__ versions of several
+ * parport_xxx_yyy macros.  extern inline versions of several
  * of these are in parport_gsc.h.
  */
 
diff --git a/drivers/parport/parport_gsc.h b/drivers/parport/parport_gsc.h
index 812214768d27..21a3f96806f9 100644
--- a/drivers/parport/parport_gsc.h
+++ b/drivers/parport/parport_gsc.h
@@ -41,13 +41,13 @@
 #define parport_readb	gsc_readb
 #define parport_writeb	gsc_writeb
 #else
-static __inline__ unsigned char parport_readb( unsigned long port )
+static inline unsigned char parport_readb( unsigned long port )
 {
     udelay(DELAY_TIME);
     return gsc_readb(port);
 }
 
-static __inline__ void parport_writeb( unsigned char value, unsigned long port )
+static inline void parport_writeb( unsigned char value, unsigned long port )
 {
     gsc_writeb(value,port);
     udelay(DELAY_TIME);
diff --git a/drivers/parport/parport_pc.c b/drivers/parport/parport_pc.c
index 380916bff9e0..c5d3a1053dff 100644
--- a/drivers/parport/parport_pc.c
+++ b/drivers/parport/parport_pc.c
@@ -225,7 +225,7 @@ static int clear_epp_timeout(struct parport *pb)
  * Access functions.
  *
  * Most of these aren't static because they may be used by the
- * parport_xxx_yyy macros.  extern __inline__ versions of several
+ * parport_xxx_yyy macros.  extern inline versions of several
  * of these are in parport_pc.h.
  */
 
diff --git a/drivers/scsi/lpfc/lpfc_scsi.c b/drivers/scsi/lpfc/lpfc_scsi.c
index 5c7858e735c9..c278783ad59b 100644
--- a/drivers/scsi/lpfc/lpfc_scsi.c
+++ b/drivers/scsi/lpfc/lpfc_scsi.c
@@ -4481,7 +4481,7 @@ lpfc_info(struct Scsi_Host *host)
  * This routine modifies fcp_poll_timer  field of @phba by cfg_poll_tmo.
  * The default value of cfg_poll_tmo is 10 milliseconds.
  **/
-static __inline__ void lpfc_poll_rearm_timer(struct lpfc_hba * phba)
+static inline void lpfc_poll_rearm_timer(struct lpfc_hba * phba)
 {
 	unsigned long  poll_tmo_expires =
 		(jiffies + msecs_to_jiffies(phba->cfg_poll_tmo));
diff --git a/drivers/scsi/pcmcia/sym53c500_cs.c b/drivers/scsi/pcmcia/sym53c500_cs.c
index 20011c8afbb5..299963f8d0ac 100644
--- a/drivers/scsi/pcmcia/sym53c500_cs.c
+++ b/drivers/scsi/pcmcia/sym53c500_cs.c
@@ -241,7 +241,7 @@ SYM53C500_int_host_reset(int io_port)
 	chip_init(io_port);
 }
 
-static __inline__ int
+static inline int
 SYM53C500_pio_read(int fast_pio, int base, unsigned char *request, unsigned int reqlen)
 {
 	int i;
@@ -296,7 +296,7 @@ SYM53C500_pio_read(int fast_pio, int base, unsigned char *request, unsigned int
 	return 0;
 }
 
-static __inline__ int
+static inline int
 SYM53C500_pio_write(int fast_pio, int base, unsigned char *request, unsigned int reqlen)
 {
 	int i = 0;
diff --git a/drivers/scsi/qla2xxx/qla_inline.h b/drivers/scsi/qla2xxx/qla_inline.h
index 4351736b2426..d6db2ad8bf57 100644
--- a/drivers/scsi/qla2xxx/qla_inline.h
+++ b/drivers/scsi/qla2xxx/qla_inline.h
@@ -39,7 +39,7 @@ qla24xx_calc_iocbs(scsi_qla_host_t *vha, uint16_t dsds)
  * Returns:
  *      register value.
  */
-static __inline__ uint16_t
+static inline uint16_t
 qla2x00_debounce_register(volatile uint16_t __iomem *addr)
 {
 	volatile uint16_t first;
diff --git a/drivers/scsi/qla2xxx/qla_os.c b/drivers/scsi/qla2xxx/qla_os.c
index 42b8f0d3e580..f48f0b8d0b69 100644
--- a/drivers/scsi/qla2xxx/qla_os.c
+++ b/drivers/scsi/qla2xxx/qla_os.c
@@ -344,7 +344,7 @@ struct scsi_transport_template *qla2xxx_transport_vport_template = NULL;
  * Timer routines
  */
 
-__inline__ void
+inline void
 qla2x00_start_timer(scsi_qla_host_t *vha, unsigned long interval)
 {
 	timer_setup(&vha->timer, qla2x00_timer, 0);
@@ -366,7 +366,7 @@ qla2x00_restart_timer(scsi_qla_host_t *vha, unsigned long interval)
 	mod_timer(&vha->timer, jiffies + interval * HZ);
 }
 
-static __inline__ void
+static inline void
 qla2x00_stop_timer(scsi_qla_host_t *vha)
 {
 	del_timer_sync(&vha->timer);
diff --git a/drivers/staging/rtl8723bs/core/rtw_pwrctrl.c b/drivers/staging/rtl8723bs/core/rtw_pwrctrl.c
index 110bbe340b78..969b93adc214 100644
--- a/drivers/staging/rtl8723bs/core/rtw_pwrctrl.c
+++ b/drivers/staging/rtl8723bs/core/rtw_pwrctrl.c
@@ -831,12 +831,12 @@ static void pwr_rpwm_timeout_handler(struct timer_list *t)
 	_set_workitem(&pwrpriv->rpwmtimeoutwi);
 }
 
-static __inline void register_task_alive(struct pwrctrl_priv *pwrctrl, u32 tag)
+static inline void register_task_alive(struct pwrctrl_priv *pwrctrl, u32 tag)
 {
 	pwrctrl->alives |= tag;
 }
 
-static __inline void unregister_task_alive(struct pwrctrl_priv *pwrctrl, u32 tag)
+static inline void unregister_task_alive(struct pwrctrl_priv *pwrctrl, u32 tag)
 {
 	pwrctrl->alives &= ~tag;
 }
diff --git a/drivers/staging/rtl8723bs/core/rtw_wlan_util.c b/drivers/staging/rtl8723bs/core/rtw_wlan_util.c
index 2c65af319a60..de0476f1f23b 100644
--- a/drivers/staging/rtl8723bs/core/rtw_wlan_util.c
+++ b/drivers/staging/rtl8723bs/core/rtw_wlan_util.c
@@ -456,7 +456,7 @@ void set_channel_bwmode(struct adapter *padapter, unsigned char channel, unsigne
 	mutex_unlock(&(adapter_to_dvobj(padapter)->setch_mutex));
 }
 
-__inline u8 *get_my_bssid(struct wlan_bssid_ex *pnetwork)
+inline u8 *get_my_bssid(struct wlan_bssid_ex *pnetwork)
 {
 	return pnetwork->MacAddress;
 }
diff --git a/drivers/staging/rtl8723bs/include/drv_types.h b/drivers/staging/rtl8723bs/include/drv_types.h
index c57f290f605a..9a56fc6c2d1e 100644
--- a/drivers/staging/rtl8723bs/include/drv_types.h
+++ b/drivers/staging/rtl8723bs/include/drv_types.h
@@ -485,7 +485,7 @@ struct dvobj_priv
 #define dvobj_to_pwrctl(dvobj) (&(dvobj->pwrctl_priv))
 #define pwrctl_to_dvobj(pwrctl) container_of(pwrctl, struct dvobj_priv, pwrctl_priv)
 
-__inline static struct device *dvobj_to_dev(struct dvobj_priv *dvobj)
+inline static struct device *dvobj_to_dev(struct dvobj_priv *dvobj)
 {
 	/* todo: get interface type from dvobj and the return the dev accordingly */
 #ifdef RTW_DVOBJ_CHIP_HW_TYPE
@@ -643,14 +643,14 @@ struct adapter {
 
 /* define RTW_DISABLE_FUNC(padapter, func) (atomic_add(&adapter_to_dvobj(padapter)->disable_func, (func))) */
 /* define RTW_ENABLE_FUNC(padapter, func) (atomic_sub(&adapter_to_dvobj(padapter)->disable_func, (func))) */
-__inline static void RTW_DISABLE_FUNC(struct adapter *padapter, int func_bit)
+inline static void RTW_DISABLE_FUNC(struct adapter *padapter, int func_bit)
 {
 	int	df = atomic_read(&adapter_to_dvobj(padapter)->disable_func);
 	df |= func_bit;
 	atomic_set(&adapter_to_dvobj(padapter)->disable_func, df);
 }
 
-__inline static void RTW_ENABLE_FUNC(struct adapter *padapter, int func_bit)
+inline static void RTW_ENABLE_FUNC(struct adapter *padapter, int func_bit)
 {
 	int	df = atomic_read(&adapter_to_dvobj(padapter)->disable_func);
 	df &= ~(func_bit);
diff --git a/drivers/staging/rtl8723bs/include/ieee80211.h b/drivers/staging/rtl8723bs/include/ieee80211.h
index bcc8dfa8e672..b95ba81036f4 100644
--- a/drivers/staging/rtl8723bs/include/ieee80211.h
+++ b/drivers/staging/rtl8723bs/include/ieee80211.h
@@ -850,18 +850,18 @@ enum ieee80211_state {
 #define IP_FMT "%pI4"
 #define IP_ARG(x) (x)
 
-extern __inline int is_multicast_mac_addr(const u8 *addr)
+extern inline int is_multicast_mac_addr(const u8 *addr)
 {
         return ((addr[0] != 0xff) && (0x01 & addr[0]));
 }
 
-extern __inline int is_broadcast_mac_addr(const u8 *addr)
+extern inline int is_broadcast_mac_addr(const u8 *addr)
 {
 	return ((addr[0] == 0xff) && (addr[1] == 0xff) && (addr[2] == 0xff) &&   \
 		(addr[3] == 0xff) && (addr[4] == 0xff) && (addr[5] == 0xff));
 }
 
-extern __inline int is_zero_mac_addr(const u8 *addr)
+extern inline int is_zero_mac_addr(const u8 *addr)
 {
 	return ((addr[0] == 0x00) && (addr[1] == 0x00) && (addr[2] == 0x00) &&   \
 		(addr[3] == 0x00) && (addr[4] == 0x00) && (addr[5] == 0x00));
diff --git a/drivers/staging/rtl8723bs/include/osdep_service.h b/drivers/staging/rtl8723bs/include/osdep_service.h
index 76d619585046..3a2eb3cbc0e4 100644
--- a/drivers/staging/rtl8723bs/include/osdep_service.h
+++ b/drivers/staging/rtl8723bs/include/osdep_service.h
@@ -110,12 +110,12 @@ int _rtw_netif_rx(_nic_hdl ndev, struct sk_buff *skb);
 
 extern void _rtw_init_queue(struct __queue	*pqueue);
 
-static __inline void thread_enter(char *name)
+static inline void thread_enter(char *name)
 {
 	allow_signal(SIGTERM);
 }
 
-__inline static void flush_signals_thread(void)
+inline static void flush_signals_thread(void)
 {
 	if (signal_pending (current))
 	{
@@ -125,7 +125,7 @@ __inline static void flush_signals_thread(void)
 
 #define rtw_warn_on(condition) WARN_ON(condition)
 
-__inline static int rtw_bug_check(void *parg1, void *parg2, void *parg3, void *parg4)
+inline static int rtw_bug_check(void *parg1, void *parg2, void *parg3, void *parg4)
 {
 	int ret = true;
 
@@ -136,7 +136,7 @@ __inline static int rtw_bug_check(void *parg1, void *parg2, void *parg3, void *p
 #define _RND(sz, r) ((((sz)+((r)-1))/(r))*(r))
 #define RND4(x)	(((x >> 2) + (((x & 3) == 0) ?  0: 1)) << 2)
 
-__inline static u32 _RND4(u32 sz)
+inline static u32 _RND4(u32 sz)
 {
 
 	u32 val;
@@ -147,7 +147,7 @@ __inline static u32 _RND4(u32 sz)
 
 }
 
-__inline static u32 _RND8(u32 sz)
+inline static u32 _RND8(u32 sz)
 {
 
 	u32 val;
diff --git a/drivers/staging/rtl8723bs/include/osdep_service_linux.h b/drivers/staging/rtl8723bs/include/osdep_service_linux.h
index 58d1e1019241..d5ca84b326ec 100644
--- a/drivers/staging/rtl8723bs/include/osdep_service_linux.h
+++ b/drivers/staging/rtl8723bs/include/osdep_service_linux.h
@@ -66,12 +66,12 @@
 
 	typedef struct work_struct _workitem;
 
-__inline static struct list_head *get_next(struct list_head	*list)
+inline static struct list_head *get_next(struct list_head	*list)
 {
 	return list->next;
 }
 
-__inline static struct list_head	*get_list_head(struct __queue	*queue)
+inline static struct list_head	*get_list_head(struct __queue	*queue)
 {
 	return (&(queue->queue));
 }
@@ -80,28 +80,28 @@ __inline static struct list_head	*get_list_head(struct __queue	*queue)
 #define LIST_CONTAINOR(ptr, type, member) \
 	container_of(ptr, type, member)
 
-__inline static void _set_timer(_timer *ptimer, u32 delay_time)
+inline static void _set_timer(_timer *ptimer, u32 delay_time)
 {
 	mod_timer(ptimer , (jiffies+(delay_time*HZ/1000)));
 }
 
-__inline static void _cancel_timer(_timer *ptimer, u8 *bcancelled)
+inline static void _cancel_timer(_timer *ptimer, u8 *bcancelled)
 {
 	del_timer_sync(ptimer);
 	*bcancelled =  true;/* true == 1; false == 0 */
 }
 
-__inline static void _init_workitem(_workitem *pwork, void *pfunc, void *cntx)
+inline static void _init_workitem(_workitem *pwork, void *pfunc, void *cntx)
 {
 	INIT_WORK(pwork, pfunc);
 }
 
-__inline static void _set_workitem(_workitem *pwork)
+inline static void _set_workitem(_workitem *pwork)
 {
 	schedule_work(pwork);
 }
 
-__inline static void _cancel_workitem_sync(_workitem *pwork)
+inline static void _cancel_workitem_sync(_workitem *pwork)
 {
 	cancel_work_sync(pwork);
 }
diff --git a/drivers/staging/rtl8723bs/include/rtw_mlme.h b/drivers/staging/rtl8723bs/include/rtw_mlme.h
index 1ea9ea0e8d2e..80882a05b6fb 100644
--- a/drivers/staging/rtl8723bs/include/rtw_mlme.h
+++ b/drivers/staging/rtl8723bs/include/rtw_mlme.h
@@ -525,13 +525,13 @@ extern sint rtw_select_and_join_from_scanned_queue(struct mlme_priv *pmlmepriv);
 extern sint rtw_set_key(struct adapter *adapter, struct security_priv *psecuritypriv, sint keyid, u8 set_tx, bool enqueue);
 extern sint rtw_set_auth(struct adapter *adapter, struct security_priv *psecuritypriv);
 
-__inline static u8 *get_bssid(struct mlme_priv *pmlmepriv)
+inline static u8 *get_bssid(struct mlme_priv *pmlmepriv)
 {	/* if sta_mode:pmlmepriv->cur_network.network.MacAddress => bssid */
 	/*  if adhoc_mode:pmlmepriv->cur_network.network.MacAddress => ibss mac address */
 	return pmlmepriv->cur_network.network.MacAddress;
 }
 
-__inline static sint check_fwstate(struct mlme_priv *pmlmepriv, sint state)
+inline static sint check_fwstate(struct mlme_priv *pmlmepriv, sint state)
 {
 	if (pmlmepriv->fw_state & state)
 		return true;
@@ -539,7 +539,7 @@ __inline static sint check_fwstate(struct mlme_priv *pmlmepriv, sint state)
 	return false;
 }
 
-__inline static sint get_fwstate(struct mlme_priv *pmlmepriv)
+inline static sint get_fwstate(struct mlme_priv *pmlmepriv)
 {
 	return pmlmepriv->fw_state;
 }
@@ -551,7 +551,7 @@ __inline static sint get_fwstate(struct mlme_priv *pmlmepriv)
  * ### NOTE:#### (!!!!)
  * MUST TAKE CARE THAT BEFORE CALLING THIS FUNC, YOU SHOULD HAVE LOCKED pmlmepriv->lock
  */
-__inline static void set_fwstate(struct mlme_priv *pmlmepriv, sint state)
+inline static void set_fwstate(struct mlme_priv *pmlmepriv, sint state)
 {
 	pmlmepriv->fw_state |= state;
 	/* FOR HW integration */
@@ -560,7 +560,7 @@ __inline static void set_fwstate(struct mlme_priv *pmlmepriv, sint state)
 	}
 }
 
-__inline static void _clr_fwstate_(struct mlme_priv *pmlmepriv, sint state)
+inline static void _clr_fwstate_(struct mlme_priv *pmlmepriv, sint state)
 {
 	pmlmepriv->fw_state &= ~state;
 	/* FOR HW integration */
@@ -573,7 +573,7 @@ __inline static void _clr_fwstate_(struct mlme_priv *pmlmepriv, sint state)
  * No Limit on the calling context,
  * therefore set it to be the critical section...
  */
-__inline static void clr_fwstate(struct mlme_priv *pmlmepriv, sint state)
+inline static void clr_fwstate(struct mlme_priv *pmlmepriv, sint state)
 {
 	spin_lock_bh(&pmlmepriv->lock);
 	if (check_fwstate(pmlmepriv, state) == true)
@@ -581,7 +581,7 @@ __inline static void clr_fwstate(struct mlme_priv *pmlmepriv, sint state)
 	spin_unlock_bh(&pmlmepriv->lock);
 }
 
-__inline static void set_scanned_network_val(struct mlme_priv *pmlmepriv, sint val)
+inline static void set_scanned_network_val(struct mlme_priv *pmlmepriv, sint val)
 {
 	spin_lock_bh(&pmlmepriv->lock);
 	pmlmepriv->num_of_scanned = val;
diff --git a/drivers/staging/rtl8723bs/include/rtw_recv.h b/drivers/staging/rtl8723bs/include/rtw_recv.h
index 1f53c1c7b0da..ca14532df09f 100644
--- a/drivers/staging/rtl8723bs/include/rtw_recv.h
+++ b/drivers/staging/rtl8723bs/include/rtw_recv.h
@@ -405,7 +405,7 @@ struct recv_buf *rtw_dequeue_recvbuf (struct __queue *queue);
 
 void rtw_reordering_ctrl_timeout_handler(struct timer_list *t);
 
-__inline static u8 *get_rxmem(union recv_frame *precvframe)
+inline static u8 *get_rxmem(union recv_frame *precvframe)
 {
 	/* always return rx_head... */
 	if (precvframe == NULL)
@@ -414,7 +414,7 @@ __inline static u8 *get_rxmem(union recv_frame *precvframe)
 	return precvframe->u.hdr.rx_head;
 }
 
-__inline static u8 *get_recvframe_data(union recv_frame *precvframe)
+inline static u8 *get_recvframe_data(union recv_frame *precvframe)
 {
 
 	/* alwasy return rx_data */
@@ -425,7 +425,7 @@ __inline static u8 *get_recvframe_data(union recv_frame *precvframe)
 
 }
 
-__inline static u8 *recvframe_pull(union recv_frame *precvframe, sint sz)
+inline static u8 *recvframe_pull(union recv_frame *precvframe, sint sz)
 {
 	/*  rx_data += sz; move rx_data sz bytes  hereafter */
 
@@ -450,7 +450,7 @@ __inline static u8 *recvframe_pull(union recv_frame *precvframe, sint sz)
 
 }
 
-__inline static u8 *recvframe_put(union recv_frame *precvframe, sint sz)
+inline static u8 *recvframe_put(union recv_frame *precvframe, sint sz)
 {
 	/*  rx_tai += sz; move rx_tail sz bytes  hereafter */
 
@@ -479,7 +479,7 @@ __inline static u8 *recvframe_put(union recv_frame *precvframe, sint sz)
 
 
 
-__inline static u8 *recvframe_pull_tail(union recv_frame *precvframe, sint sz)
+inline static u8 *recvframe_pull_tail(union recv_frame *precvframe, sint sz)
 {
 	/*  rmv data from rx_tail (by yitsen) */
 
@@ -503,7 +503,7 @@ __inline static u8 *recvframe_pull_tail(union recv_frame *precvframe, sint sz)
 
 }
 
-__inline static union recv_frame *rxmem_to_recvframe(u8 *rxmem)
+inline static union recv_frame *rxmem_to_recvframe(u8 *rxmem)
 {
 	/* due to the design of 2048 bytes alignment of recv_frame, we can reference the union recv_frame */
 	/* from any given member of recv_frame. */
@@ -513,13 +513,13 @@ __inline static union recv_frame *rxmem_to_recvframe(u8 *rxmem)
 
 }
 
-__inline static sint get_recvframe_len(union recv_frame *precvframe)
+inline static sint get_recvframe_len(union recv_frame *precvframe)
 {
 	return precvframe->u.hdr.len;
 }
 
 
-__inline static s32 translate_percentage_to_dbm(u32 SignalStrengthIndex)
+inline static s32 translate_percentage_to_dbm(u32 SignalStrengthIndex)
 {
 	s32	SignalPower; /*  in dBm. */
 
diff --git a/drivers/staging/rtl8723bs/include/sta_info.h b/drivers/staging/rtl8723bs/include/sta_info.h
index b9df42d0677e..a4817ea9e4a1 100644
--- a/drivers/staging/rtl8723bs/include/sta_info.h
+++ b/drivers/staging/rtl8723bs/include/sta_info.h
@@ -348,7 +348,7 @@ struct	sta_priv {
 };
 
 
-__inline static u32 wifi_mac_hash(u8 *mac)
+inline static u32 wifi_mac_hash(u8 *mac)
 {
         u32 x;
 
diff --git a/drivers/staging/rtl8723bs/include/wifi.h b/drivers/staging/rtl8723bs/include/wifi.h
index 559bf2606fb7..74d5cf442354 100644
--- a/drivers/staging/rtl8723bs/include/wifi.h
+++ b/drivers/staging/rtl8723bs/include/wifi.h
@@ -347,7 +347,7 @@ enum WIFI_REG_DOMAIN {
 	(addr[4] == 0xff) && (addr[5] == 0xff))  ? true : false \
 )
 
-__inline static int IS_MCAST(unsigned char *da)
+inline static int IS_MCAST(unsigned char *da)
 {
 	if ((*da) & 0x01)
 		return true;
@@ -355,20 +355,20 @@ __inline static int IS_MCAST(unsigned char *da)
 		return false;
 }
 
-__inline static unsigned char * get_ra(unsigned char *pframe)
+inline static unsigned char * get_ra(unsigned char *pframe)
 {
 	unsigned char *ra;
 	ra = GetAddr1Ptr(pframe);
 	return ra;
 }
-__inline static unsigned char * get_ta(unsigned char *pframe)
+inline static unsigned char * get_ta(unsigned char *pframe)
 {
 	unsigned char *ta;
 	ta = GetAddr2Ptr(pframe);
 	return ta;
 }
 
-__inline static unsigned char * get_da(unsigned char *pframe)
+inline static unsigned char * get_da(unsigned char *pframe)
 {
 	unsigned char *da;
 	unsigned int	to_fr_ds	= (GetToDs(pframe) << 1) | GetFrDs(pframe);
@@ -392,7 +392,7 @@ __inline static unsigned char * get_da(unsigned char *pframe)
 }
 
 
-__inline static unsigned char * get_sa(unsigned char *pframe)
+inline static unsigned char * get_sa(unsigned char *pframe)
 {
 	unsigned char *sa;
 	unsigned int	to_fr_ds	= (GetToDs(pframe) << 1) | GetFrDs(pframe);
@@ -415,7 +415,7 @@ __inline static unsigned char * get_sa(unsigned char *pframe)
 	return sa;
 }
 
-__inline static unsigned char * get_hdr_bssid(unsigned char *pframe)
+inline static unsigned char * get_hdr_bssid(unsigned char *pframe)
 {
 	unsigned char *sa = NULL;
 	unsigned int	to_fr_ds	= (GetToDs(pframe) << 1) | GetFrDs(pframe);
@@ -439,7 +439,7 @@ __inline static unsigned char * get_hdr_bssid(unsigned char *pframe)
 }
 
 
-__inline static int IsFrameTypeCtrl(unsigned char *pframe)
+inline static int IsFrameTypeCtrl(unsigned char *pframe)
 {
 	if (WIFI_CTRL_TYPE == GetFrameType(pframe))
 		return true;
diff --git a/drivers/staging/rtl8723bs/include/wlan_bssdef.h b/drivers/staging/rtl8723bs/include/wlan_bssdef.h
index bdb14a84e5a5..94cb3248bc71 100644
--- a/drivers/staging/rtl8723bs/include/wlan_bssdef.h
+++ b/drivers/staging/rtl8723bs/include/wlan_bssdef.h
@@ -223,7 +223,7 @@ struct wlan_bssid_ex {
 	u8  IEs[MAX_IE_SZ];	/* timestamp, beacon interval, and capability information) */
 } __packed;
 
-__inline  static uint get_wlan_bssid_ex_sz(struct wlan_bssid_ex *bss)
+inline  static uint get_wlan_bssid_ex_sz(struct wlan_bssid_ex *bss)
 {
 	return (sizeof(struct wlan_bssid_ex) - MAX_IE_SZ + bss->IELength);
 }
diff --git a/drivers/tty/amiserial.c b/drivers/tty/amiserial.c
index 34dead614149..c1bd9b74f4d5 100644
--- a/drivers/tty/amiserial.c
+++ b/drivers/tty/amiserial.c
@@ -171,7 +171,7 @@ static inline int serial_paranoia_check(struct serial_state *info,
 #define SER_CTS     (1<<4)
 #define SER_DSR     (1<<3)
 
-static __inline__ void rtsdtr_ctrl(int bits)
+static inline void rtsdtr_ctrl(int bits)
 {
     ciab.pra = ((bits & (SER_RTS | SER_DTR)) ^ (SER_RTS | SER_DTR)) | (ciab.pra & ~(SER_RTS | SER_DTR));
 }
diff --git a/drivers/tty/serial/ip22zilog.c b/drivers/tty/serial/ip22zilog.c
index 8c810733df3d..8ccf8b7ee8bb 100644
--- a/drivers/tty/serial/ip22zilog.c
+++ b/drivers/tty/serial/ip22zilog.c
@@ -490,7 +490,7 @@ static irqreturn_t ip22zilog_interrupt(int irq, void *dev_id)
 /* A convenient way to quickly get R0 status.  The caller must _not_ hold the
  * port lock, it is acquired here.
  */
-static __inline__ unsigned char ip22zilog_read_channel_status(struct uart_port *port)
+static inline unsigned char ip22zilog_read_channel_status(struct uart_port *port)
 {
 	struct zilog_channel *channel;
 	unsigned char status;
diff --git a/drivers/tty/serial/sunsab.c b/drivers/tty/serial/sunsab.c
index 72131b5e132e..ec62843a39f4 100644
--- a/drivers/tty/serial/sunsab.c
+++ b/drivers/tty/serial/sunsab.c
@@ -92,7 +92,7 @@ static char *sab82532_version[16] = {
 #define SAB82532_RECV_FIFO_SIZE	32      /* Standard async fifo sizes */
 #define SAB82532_XMIT_FIFO_SIZE	32
 
-static __inline__ void sunsab_tec_wait(struct uart_sunsab_port *up)
+static inline void sunsab_tec_wait(struct uart_sunsab_port *up)
 {
 	int timeout = up->tec_timeout;
 
@@ -100,7 +100,7 @@ static __inline__ void sunsab_tec_wait(struct uart_sunsab_port *up)
 		udelay(1);
 }
 
-static __inline__ void sunsab_cec_wait(struct uart_sunsab_port *up)
+static inline void sunsab_cec_wait(struct uart_sunsab_port *up)
 {
 	int timeout = up->cec_timeout;
 
diff --git a/drivers/tty/serial/sunzilog.c b/drivers/tty/serial/sunzilog.c
index bc7af8b08a72..5c7f1648f54d 100644
--- a/drivers/tty/serial/sunzilog.c
+++ b/drivers/tty/serial/sunzilog.c
@@ -590,7 +590,7 @@ static irqreturn_t sunzilog_interrupt(int irq, void *dev_id)
 /* A convenient way to quickly get R0 status.  The caller must _not_ hold the
  * port lock, it is acquired here.
  */
-static __inline__ unsigned char sunzilog_read_channel_status(struct uart_port *port)
+static inline unsigned char sunzilog_read_channel_status(struct uart_port *port)
 {
 	struct zilog_channel __iomem *channel;
 	unsigned char status;
diff --git a/drivers/video/fbdev/core/fbcon.c b/drivers/video/fbdev/core/fbcon.c
index 75ebbbf0a1fb..57f74bf7731f 100644
--- a/drivers/video/fbdev/core/fbcon.c
+++ b/drivers/video/fbdev/core/fbcon.c
@@ -181,10 +181,10 @@ static void fbcon_set_palette(struct vc_data *vc, const unsigned char *table);
 /*
  *  Internal routines
  */
-static __inline__ void ywrap_up(struct vc_data *vc, int count);
-static __inline__ void ywrap_down(struct vc_data *vc, int count);
-static __inline__ void ypan_up(struct vc_data *vc, int count);
-static __inline__ void ypan_down(struct vc_data *vc, int count);
+static inline void ywrap_up(struct vc_data *vc, int count);
+static inline void ywrap_down(struct vc_data *vc, int count);
+static inline void ypan_up(struct vc_data *vc, int count);
+static inline void ypan_down(struct vc_data *vc, int count);
 static void fbcon_bmove_rec(struct vc_data *vc, struct display *p, int sy, int sx,
 			    int dy, int dx, int height, int width, u_int y_break);
 static void fbcon_set_disp(struct fb_info *info, struct fb_var_screeninfo *var,
@@ -1442,7 +1442,7 @@ static void fbcon_set_disp(struct fb_info *info, struct fb_var_screeninfo *var,
 	}
 }
 
-static __inline__ void ywrap_up(struct vc_data *vc, int count)
+static inline void ywrap_up(struct vc_data *vc, int count)
 {
 	struct fb_info *info = registered_fb[con2fb_map[vc->vc_num]];
 	struct fbcon_ops *ops = info->fbcon_par;
@@ -1461,7 +1461,7 @@ static __inline__ void ywrap_up(struct vc_data *vc, int count)
 	scrollback_current = 0;
 }
 
-static __inline__ void ywrap_down(struct vc_data *vc, int count)
+static inline void ywrap_down(struct vc_data *vc, int count)
 {
 	struct fb_info *info = registered_fb[con2fb_map[vc->vc_num]];
 	struct fbcon_ops *ops = info->fbcon_par;
@@ -1480,7 +1480,7 @@ static __inline__ void ywrap_down(struct vc_data *vc, int count)
 	scrollback_current = 0;
 }
 
-static __inline__ void ypan_up(struct vc_data *vc, int count)
+static inline void ypan_up(struct vc_data *vc, int count)
 {
 	struct fb_info *info = registered_fb[con2fb_map[vc->vc_num]];
 	struct display *p = &fb_display[vc->vc_num];
@@ -1504,7 +1504,7 @@ static __inline__ void ypan_up(struct vc_data *vc, int count)
 	scrollback_current = 0;
 }
 
-static __inline__ void ypan_up_redraw(struct vc_data *vc, int t, int count)
+static inline void ypan_up_redraw(struct vc_data *vc, int t, int count)
 {
 	struct fb_info *info = registered_fb[con2fb_map[vc->vc_num]];
 	struct fbcon_ops *ops = info->fbcon_par;
@@ -1528,7 +1528,7 @@ static __inline__ void ypan_up_redraw(struct vc_data *vc, int t, int count)
 	scrollback_current = 0;
 }
 
-static __inline__ void ypan_down(struct vc_data *vc, int count)
+static inline void ypan_down(struct vc_data *vc, int count)
 {
 	struct fb_info *info = registered_fb[con2fb_map[vc->vc_num]];
 	struct display *p = &fb_display[vc->vc_num];
@@ -1552,7 +1552,7 @@ static __inline__ void ypan_down(struct vc_data *vc, int count)
 	scrollback_current = 0;
 }
 
-static __inline__ void ypan_down_redraw(struct vc_data *vc, int t, int count)
+static inline void ypan_down_redraw(struct vc_data *vc, int t, int count)
 {
 	struct fb_info *info = registered_fb[con2fb_map[vc->vc_num]];
 	struct fbcon_ops *ops = info->fbcon_par;
diff --git a/drivers/video/fbdev/ffb.c b/drivers/video/fbdev/ffb.c
index 6b1915872af1..d86911707e54 100644
--- a/drivers/video/fbdev/ffb.c
+++ b/drivers/video/fbdev/ffb.c
@@ -411,7 +411,7 @@ static int ffb_sync(struct fb_info *p)
 	return 0;
 }
 
-static __inline__ void ffb_rop(struct ffb_par *par, u32 rop)
+static inline void ffb_rop(struct ffb_par *par, u32 rop)
 {
 	if (par->rop_cache != rop) {
 		FFBFifo(par, 1);
diff --git a/drivers/video/fbdev/intelfb/intelfbdrv.c b/drivers/video/fbdev/intelfb/intelfbdrv.c
index d7463a2a5d83..de1ee0f5406a 100644
--- a/drivers/video/fbdev/intelfb/intelfbdrv.c
+++ b/drivers/video/fbdev/intelfb/intelfbdrv.c
@@ -270,7 +270,7 @@ MODULE_PARM_DESC(mode,
 #define OPT_INTVAL(opt, name) simple_strtoul(opt + strlen(name) + 1, NULL, 0)
 #define OPT_STRVAL(opt, name) (opt + strlen(name))
 
-static __inline__ char * get_opt_string(const char *this_opt, const char *name)
+static inline char * get_opt_string(const char *this_opt, const char *name)
 {
 	const char *p;
 	int i;
@@ -288,7 +288,7 @@ static __inline__ char * get_opt_string(const char *this_opt, const char *name)
 	return ret;
 }
 
-static __inline__ int get_opt_int(const char *this_opt, const char *name,
+static inline int get_opt_int(const char *this_opt, const char *name,
 				  int *ret)
 {
 	if (!ret)
@@ -301,7 +301,7 @@ static __inline__ int get_opt_int(const char *this_opt, const char *name,
 	return 1;
 }
 
-static __inline__ int get_opt_bool(const char *this_opt, const char *name,
+static inline int get_opt_bool(const char *this_opt, const char *name,
 				   bool *ret)
 {
 	if (!ret)
@@ -907,7 +907,7 @@ static void intelfb_pci_unregister(struct pci_dev *pdev)
  *                       helper functions                      *
  ***************************************************************/
 
-__inline__ int intelfb_var_to_depth(const struct fb_var_screeninfo *var)
+inline int intelfb_var_to_depth(const struct fb_var_screeninfo *var)
 {
 	DBG_MSG("intelfb_var_to_depth: bpp: %d, green.length is %d\n",
 		var->bits_per_pixel, var->green.length);
@@ -923,7 +923,7 @@ __inline__ int intelfb_var_to_depth(const struct fb_var_screeninfo *var)
 }
 
 
-static __inline__ int var_to_refresh(const struct fb_var_screeninfo *var)
+static inline int var_to_refresh(const struct fb_var_screeninfo *var)
 {
 	int xtot = var->xres + var->left_margin + var->right_margin +
 		   var->hsync_len;
diff --git a/drivers/video/fbdev/intelfb/intelfbhw.c b/drivers/video/fbdev/intelfb/intelfbhw.c
index 57aff7450bce..bbd258330f21 100644
--- a/drivers/video/fbdev/intelfb/intelfbhw.c
+++ b/drivers/video/fbdev/intelfb/intelfbhw.c
@@ -1025,7 +1025,7 @@ static int calc_pll_params(int index, int clock, u32 *retm1, u32 *retm2,
 	return 0;
 }
 
-static __inline__ int check_overflow(u32 value, u32 limit,
+static inline int check_overflow(u32 value, u32 limit,
 				     const char *description)
 {
 	if (value > limit) {
diff --git a/drivers/w1/masters/matrox_w1.c b/drivers/w1/masters/matrox_w1.c
index d83d7c99d81d..8bdee261770f 100644
--- a/drivers/w1/masters/matrox_w1.c
+++ b/drivers/w1/masters/matrox_w1.c
@@ -78,7 +78,7 @@ struct matrox_device
  *
  * Port mapping.
  */
-static __inline__ u8 matrox_w1_read_reg(struct matrox_device *dev, u8 reg)
+static inline u8 matrox_w1_read_reg(struct matrox_device *dev, u8 reg)
 {
 	u8 ret;
 
@@ -89,7 +89,7 @@ static __inline__ u8 matrox_w1_read_reg(struct matrox_device *dev, u8 reg)
 	return ret;
 }
 
-static __inline__ void matrox_w1_write_reg(struct matrox_device *dev, u8 reg, u8 val)
+static inline void matrox_w1_write_reg(struct matrox_device *dev, u8 reg, u8 val)
 {
 	writeb(reg, dev->port_index);
 	writeb(val, dev->port_data);
diff --git a/fs/coda/coda_linux.h b/fs/coda/coda_linux.h
index 126155cadfa9..5f324bb0bd13 100644
--- a/fs/coda/coda_linux.h
+++ b/fs/coda/coda_linux.h
@@ -82,18 +82,18 @@ static inline struct coda_inode_info *ITOC(struct inode *inode)
 	return container_of(inode, struct coda_inode_info, vfs_inode);
 }
 
-static __inline__ struct CodaFid *coda_i2f(struct inode *inode)
+static inline struct CodaFid *coda_i2f(struct inode *inode)
 {
 	return &(ITOC(inode)->c_fid);
 }
 
-static __inline__ char *coda_i2s(struct inode *inode)
+static inline char *coda_i2s(struct inode *inode)
 {
 	return coda_f2s(&(ITOC(inode)->c_fid));
 }
 
 /* this will not zap the inode away */
-static __inline__ void coda_flag_inode(struct inode *inode, int flag)
+static inline void coda_flag_inode(struct inode *inode, int flag)
 {
 	struct coda_inode_info *cii = ITOC(inode);
 
diff --git a/fs/freevxfs/vxfs_inode.c b/fs/freevxfs/vxfs_inode.c
index 1f41b25ef38b..a0b32934b1e4 100644
--- a/fs/freevxfs/vxfs_inode.c
+++ b/fs/freevxfs/vxfs_inode.c
@@ -74,7 +74,7 @@ vxfs_dumpi(struct vxfs_inode_info *vip, ino_t ino)
  *  vxfs_transmod returns a Linux mode_t for a given
  *  VxFS inode structure.
  */
-static __inline__ umode_t
+static inline umode_t
 vxfs_transmod(struct vxfs_inode_info *vip)
 {
 	umode_t			ret = vip->vii_mode & ~VXFS_TYPE_MASK;
diff --git a/fs/nfsd/nfsfh.h b/fs/nfsd/nfsfh.h
index 755e256a9103..f399f4a16fc9 100644
--- a/fs/nfsd/nfsfh.h
+++ b/fs/nfsd/nfsfh.h
@@ -166,7 +166,7 @@ __be32	fh_compose(struct svc_fh *, struct svc_export *, struct dentry *, struct
 __be32	fh_update(struct svc_fh *);
 void	fh_put(struct svc_fh *);
 
-static __inline__ struct svc_fh *
+static inline struct svc_fh *
 fh_copy(struct svc_fh *dst, struct svc_fh *src)
 {
 	WARN_ON(src->fh_dentry || src->fh_locked);
@@ -182,7 +182,7 @@ fh_copy_shallow(struct knfsd_fh *dst, struct knfsd_fh *src)
 	memcpy(&dst->fh_base, &src->fh_base, src->fh_size);
 }
 
-static __inline__ struct svc_fh *
+static inline struct svc_fh *
 fh_init(struct svc_fh *fhp, int maxsize)
 {
 	memset(fhp, 0, sizeof(*fhp));
diff --git a/include/acpi/platform/acgcc.h b/include/acpi/platform/acgcc.h
index 085db95a3dae..32abaa61ef96 100644
--- a/include/acpi/platform/acgcc.h
+++ b/include/acpi/platform/acgcc.h
@@ -26,7 +26,7 @@ typedef __builtin_va_list va_list;
 #endif
 #endif
 
-#define ACPI_INLINE             __inline__
+#define ACPI_INLINE             inline
 
 /* Function name is used for debug output. Non-ANSI, compiler-dependent */
 
diff --git a/include/acpi/platform/acintel.h b/include/acpi/platform/acintel.h
index 626265833a54..fc013efe606d 100644
--- a/include/acpi/platform/acintel.h
+++ b/include/acpi/platform/acintel.h
@@ -22,7 +22,7 @@
 
 #define COMPILER_DEPENDENT_INT64    __int64
 #define COMPILER_DEPENDENT_UINT64   unsigned __int64
-#define ACPI_INLINE                 __inline
+#define ACPI_INLINE                 inline
 
 /*
  * Calling conventions:
diff --git a/include/asm-generic/ide_iops.h b/include/asm-generic/ide_iops.h
index 81dfa3ee5e06..c7028674a03d 100644
--- a/include/asm-generic/ide_iops.h
+++ b/include/asm-generic/ide_iops.h
@@ -6,7 +6,7 @@
 #define __ide_outsw	outsw
 #define __ide_outsl	outsl
 
-static __inline__ void __ide_mm_insw(void __iomem *port, void *addr, u32 count)
+static inline void __ide_mm_insw(void __iomem *port, void *addr, u32 count)
 {
 	while (count--) {
 		*(u16 *)addr = readw(port);
@@ -14,7 +14,7 @@ static __inline__ void __ide_mm_insw(void __iomem *port, void *addr, u32 count)
 	}
 }
 
-static __inline__ void __ide_mm_insl(void __iomem *port, void *addr, u32 count)
+static inline void __ide_mm_insl(void __iomem *port, void *addr, u32 count)
 {
 	while (count--) {
 		*(u32 *)addr = readl(port);
@@ -22,7 +22,7 @@ static __inline__ void __ide_mm_insl(void __iomem *port, void *addr, u32 count)
 	}
 }
 
-static __inline__ void __ide_mm_outsw(void __iomem *port, void *addr, u32 count)
+static inline void __ide_mm_outsw(void __iomem *port, void *addr, u32 count)
 {
 	while (count--) {
 		writew(*(u16 *)addr, port);
@@ -30,7 +30,7 @@ static __inline__ void __ide_mm_outsw(void __iomem *port, void *addr, u32 count)
 	}
 }
 
-static __inline__ void __ide_mm_outsl(void __iomem * port, void *addr, u32 count)
+static inline void __ide_mm_outsl(void __iomem * port, void *addr, u32 count)
 {
 	while (count--) {
 		writel(*(u32 *)addr, port);
diff --git a/include/linux/atalk.h b/include/linux/atalk.h
index 23f805562f4e..257c986f7f40 100644
--- a/include/linux/atalk.h
+++ b/include/linux/atalk.h
@@ -60,7 +60,7 @@ struct ddpehdr {
 	/* And netatalk apps expect to stick the type in themselves */
 };
 
-static __inline__ struct ddpehdr *ddp_hdr(struct sk_buff *skb)
+static inline struct ddpehdr *ddp_hdr(struct sk_buff *skb)
 {
 	return (struct ddpehdr *)skb_transport_header(skb);
 }
@@ -88,7 +88,7 @@ struct elapaarp {
 	__u8	pa_dst_node;
 } __attribute__ ((packed));
 
-static __inline__ struct elapaarp *aarp_hdr(struct sk_buff *skb)
+static inline struct elapaarp *aarp_hdr(struct sk_buff *skb)
 {
 	return (struct elapaarp *)skb_transport_header(skb);
 }
diff --git a/include/linux/ceph/messenger.h b/include/linux/ceph/messenger.h
index fc2b4491ee0a..63583db288e5 100644
--- a/include/linux/ceph/messenger.h
+++ b/include/linux/ceph/messenger.h
@@ -82,7 +82,7 @@ enum ceph_msg_data_type {
 	CEPH_MSG_DATA_BVECS,	/* data source/destination is a bio_vec array */
 };
 
-static __inline__ bool ceph_msg_data_type_valid(enum ceph_msg_data_type type)
+static inline bool ceph_msg_data_type_valid(enum ceph_msg_data_type type)
 {
 	switch (type) {
 	case CEPH_MSG_DATA_NONE:
diff --git a/include/linux/compiler_types.h b/include/linux/compiler_types.h
index db192becfec4..8cc282b7386b 100644
--- a/include/linux/compiler_types.h
+++ b/include/linux/compiler_types.h
@@ -248,6 +248,8 @@ struct ftrace_likely_data {
 # define __gnu_inline
 #endif
 
+#define asm_volatile(stmt...) asm volatile __inline__(stmt)
+
 /*
  * Force always-inline if the user requests it so via the .config.
  * GCC does not warn about unused static inline functions for
@@ -268,8 +270,6 @@ struct ftrace_likely_data {
 #define inline inline	__attribute__((unused)) notrace __gnu_inline
 #endif
 
-#define __inline__ inline
-#define __inline inline
 #define noinline	__attribute__((noinline))
 
 #ifndef __always_inline
diff --git a/include/linux/hdlc.h b/include/linux/hdlc.h
index 97585d9679f3..c9e58c889548 100644
--- a/include/linux/hdlc.h
+++ b/include/linux/hdlc.h
@@ -74,7 +74,7 @@ static inline struct hdlc_device* dev_to_hdlc(struct net_device *dev)
 	return netdev_priv(dev);
 }
 
-static __inline__ void debug_frame(const struct sk_buff *skb)
+static inline void debug_frame(const struct sk_buff *skb)
 {
 	int i;
 
@@ -101,7 +101,7 @@ int attach_hdlc_protocol(struct net_device *dev, struct hdlc_proto *proto,
 /* May be used by hardware driver to gain control over HDLC device */
 int detach_hdlc_protocol(struct net_device *dev);
 
-static __inline__ __be16 hdlc_type_trans(struct sk_buff *skb,
+static inline __be16 hdlc_type_trans(struct sk_buff *skb,
 					 struct net_device *dev)
 {
 	hdlc_device *hdlc = dev_to_hdlc(dev);
diff --git a/include/linux/inetdevice.h b/include/linux/inetdevice.h
index c759d1cbcedd..76c97154dc81 100644
--- a/include/linux/inetdevice.h
+++ b/include/linux/inetdevice.h
@@ -184,7 +184,7 @@ __be32 inet_confirm_addr(struct net *net, struct in_device *in_dev, __be32 dst,
 struct in_ifaddr *inet_ifa_byprefix(struct in_device *in_dev, __be32 prefix,
 				    __be32 mask);
 struct in_ifaddr *inet_lookup_ifaddr_rcu(struct net *net, __be32 addr);
-static __inline__ bool inet_ifa_match(__be32 addr, struct in_ifaddr *ifa)
+static inline bool inet_ifa_match(__be32 addr, struct in_ifaddr *ifa)
 {
 	return !((addr^ifa->ifa_address)&ifa->ifa_mask);
 }
@@ -193,7 +193,7 @@ static __inline__ bool inet_ifa_match(__be32 addr, struct in_ifaddr *ifa)
  *	Check if a mask is acceptable.
  */
  
-static __inline__ bool bad_mask(__be32 mask, __be32 addr)
+static inline bool bad_mask(__be32 mask, __be32 addr)
 {
 	__u32 hmask;
 	if (addr & (mask = ~mask))
@@ -255,14 +255,14 @@ static inline void in_dev_put(struct in_device *idev)
 
 #endif /* __KERNEL__ */
 
-static __inline__ __be32 inet_make_mask(int logmask)
+static inline __be32 inet_make_mask(int logmask)
 {
 	if (logmask)
 		return htonl(~((1U<<(32-logmask))-1));
 	return 0;
 }
 
-static __inline__ int inet_mask_len(__be32 mask)
+static inline int inet_mask_len(__be32 mask)
 {
 	__u32 hmask = ntohl(mask);
 	if (!hmask)
diff --git a/include/linux/parport.h b/include/linux/parport.h
index 397607a0c0eb..a28dc3e22074 100644
--- a/include/linux/parport.h
+++ b/include/linux/parport.h
@@ -385,7 +385,7 @@ extern void parport_release(struct pardevice *dev);
  * timeslice is half a second, but it can be adjusted via the /proc
  * interface.
  **/
-static __inline__ int parport_yield(struct pardevice *dev)
+static inline int parport_yield(struct pardevice *dev)
 {
 	unsigned long int timeslip = (jiffies - dev->time);
 	if ((dev->port->waithead == NULL) || (timeslip < dev->timeslice))
@@ -403,7 +403,7 @@ static __inline__ int parport_yield(struct pardevice *dev)
  * parport_claim_or_block(), and the return value is the same as for
  * parport_claim_or_block().
  **/
-static __inline__ int parport_yield_blocking(struct pardevice *dev)
+static inline int parport_yield_blocking(struct pardevice *dev)
 {
 	unsigned long int timeslip = (jiffies - dev->time);
 	if ((dev->port->waithead == NULL) || (timeslip < dev->timeslice))
diff --git a/include/linux/parport_pc.h b/include/linux/parport_pc.h
index 3d6fc576d6a1..ea368c35a5e7 100644
--- a/include/linux/parport_pc.h
+++ b/include/linux/parport_pc.h
@@ -60,7 +60,7 @@ struct parport_pc_via_data
 	u8 viacfg_parport_base;
 };
 
-static __inline__ void parport_pc_write_data(struct parport *p, unsigned char d)
+static inline void parport_pc_write_data(struct parport *p, unsigned char d)
 {
 #ifdef DEBUG_PARPORT
 	printk (KERN_DEBUG "parport_pc_write_data(%p,0x%02x)\n", p, d);
@@ -68,7 +68,7 @@ static __inline__ void parport_pc_write_data(struct parport *p, unsigned char d)
 	outb(d, DATA(p));
 }
 
-static __inline__ unsigned char parport_pc_read_data(struct parport *p)
+static inline unsigned char parport_pc_read_data(struct parport *p)
 {
 	unsigned char val = inb (DATA (p));
 #ifdef DEBUG_PARPORT
@@ -125,7 +125,7 @@ static inline void dump_parport_state (char *str, struct parport *p)
 
 /* __parport_pc_frob_control differs from parport_pc_frob_control in that
  * it doesn't do any extra masking. */
-static __inline__ unsigned char __parport_pc_frob_control (struct parport *p,
+static inline unsigned char __parport_pc_frob_control (struct parport *p,
 							   unsigned char mask,
 							   unsigned char val)
 {
@@ -143,17 +143,17 @@ static __inline__ unsigned char __parport_pc_frob_control (struct parport *p,
 	return ctr;
 }
 
-static __inline__ void parport_pc_data_reverse (struct parport *p)
+static inline void parport_pc_data_reverse (struct parport *p)
 {
 	__parport_pc_frob_control (p, 0x20, 0x20);
 }
 
-static __inline__ void parport_pc_data_forward (struct parport *p)
+static inline void parport_pc_data_forward (struct parport *p)
 {
 	__parport_pc_frob_control (p, 0x20, 0x00);
 }
 
-static __inline__ void parport_pc_write_control (struct parport *p,
+static inline void parport_pc_write_control (struct parport *p,
 						 unsigned char d)
 {
 	const unsigned char wm = (PARPORT_CONTROL_STROBE |
@@ -171,7 +171,7 @@ static __inline__ void parport_pc_write_control (struct parport *p,
 	__parport_pc_frob_control (p, wm, d & wm);
 }
 
-static __inline__ unsigned char parport_pc_read_control(struct parport *p)
+static inline unsigned char parport_pc_read_control(struct parport *p)
 {
 	const unsigned char rm = (PARPORT_CONTROL_STROBE |
 				  PARPORT_CONTROL_AUTOFD |
@@ -181,7 +181,7 @@ static __inline__ unsigned char parport_pc_read_control(struct parport *p)
 	return priv->ctr & rm; /* Use soft copy */
 }
 
-static __inline__ unsigned char parport_pc_frob_control (struct parport *p,
+static inline unsigned char parport_pc_frob_control (struct parport *p,
 							 unsigned char mask,
 							 unsigned char val)
 {
@@ -208,18 +208,18 @@ static __inline__ unsigned char parport_pc_frob_control (struct parport *p,
 	return __parport_pc_frob_control (p, mask, val);
 }
 
-static __inline__ unsigned char parport_pc_read_status(struct parport *p)
+static inline unsigned char parport_pc_read_status(struct parport *p)
 {
 	return inb(STATUS(p));
 }
 
 
-static __inline__ void parport_pc_disable_irq(struct parport *p)
+static inline void parport_pc_disable_irq(struct parport *p)
 {
 	__parport_pc_frob_control (p, 0x10, 0x00);
 }
 
-static __inline__ void parport_pc_enable_irq(struct parport *p)
+static inline void parport_pc_enable_irq(struct parport *p)
 {
 	__parport_pc_frob_control (p, 0x10, 0x10);
 }
diff --git a/include/net/ax25.h b/include/net/ax25.h
index 3f9aea8087e3..cd0bec7a6d4d 100644
--- a/include/net/ax25.h
+++ b/include/net/ax25.h
@@ -270,7 +270,7 @@ static inline struct ax25_cb *sk_to_ax25(const struct sock *sk)
 #define ax25_cb_hold(__ax25) \
 	refcount_inc(&((__ax25)->refcount))
 
-static __inline__ void ax25_cb_put(ax25_cb *ax25)
+static inline void ax25_cb_put(ax25_cb *ax25)
 {
 	if (refcount_dec_and_test(&ax25->refcount)) {
 		kfree(ax25->digipeat);
diff --git a/include/net/checksum.h b/include/net/checksum.h
index aef2b2bb6603..e03914ab9197 100644
--- a/include/net/checksum.h
+++ b/include/net/checksum.h
@@ -41,7 +41,7 @@ __wsum csum_and_copy_from_user (const void __user *src, void *dst,
 #endif
 
 #ifndef HAVE_CSUM_COPY_USER
-static __inline__ __wsum csum_and_copy_to_user
+static inline __wsum csum_and_copy_to_user
 (const void *src, void __user *dst, int len, __wsum sum, int *err_ptr)
 {
 	sum = csum_partial(src, len, sum);
diff --git a/include/net/dn_nsp.h b/include/net/dn_nsp.h
index 413a15e5339c..771d3943a8f9 100644
--- a/include/net/dn_nsp.h
+++ b/include/net/dn_nsp.h
@@ -144,7 +144,7 @@ struct  srcobj_fmt {
  * numbers used in NSP. Similar in operation to the functions
  * of the same name in TCP.
  */
-static __inline__ int dn_before(__u16 seq1, __u16 seq2)
+static inline int dn_before(__u16 seq1, __u16 seq2)
 {
         seq1 &= 0x0fff;
         seq2 &= 0x0fff;
@@ -153,7 +153,7 @@ static __inline__ int dn_before(__u16 seq1, __u16 seq2)
 }
 
 
-static __inline__ int dn_after(__u16 seq1, __u16 seq2)
+static inline int dn_after(__u16 seq1, __u16 seq2)
 {
         seq1 &= 0x0fff;
         seq2 &= 0x0fff;
@@ -161,23 +161,23 @@ static __inline__ int dn_after(__u16 seq1, __u16 seq2)
         return (int)((seq2 - seq1) & 0x0fff) > 2048;
 }
 
-static __inline__ int dn_equal(__u16 seq1, __u16 seq2)
+static inline int dn_equal(__u16 seq1, __u16 seq2)
 {
         return ((seq1 ^ seq2) & 0x0fff) == 0;
 }
 
-static __inline__ int dn_before_or_equal(__u16 seq1, __u16 seq2)
+static inline int dn_before_or_equal(__u16 seq1, __u16 seq2)
 {
 	return (dn_before(seq1, seq2) || dn_equal(seq1, seq2));
 }
 
-static __inline__ void seq_add(__u16 *seq, __u16 off)
+static inline void seq_add(__u16 *seq, __u16 off)
 {
         (*seq) += off;
         (*seq) &= 0x0fff;
 }
 
-static __inline__ int seq_next(__u16 seq1, __u16 seq2)
+static inline int seq_next(__u16 seq1, __u16 seq2)
 {
 	return dn_equal(seq1 + 1, seq2);
 }
@@ -185,7 +185,7 @@ static __inline__ int seq_next(__u16 seq1, __u16 seq2)
 /*
  * Can we delay the ack ?
  */
-static __inline__ int sendack(__u16 seq)
+static inline int sendack(__u16 seq)
 {
         return (int)((seq & 0x1000) ? 0 : 1);
 }
@@ -193,7 +193,7 @@ static __inline__ int sendack(__u16 seq)
 /*
  * Is socket congested ?
  */
-static __inline__ int dn_congested(struct sock *sk)
+static inline int dn_congested(struct sock *sk)
 {
         return atomic_read(&sk->sk_rmem_alloc) > (sk->sk_rcvbuf >> 1);
 }
diff --git a/include/net/ip.h b/include/net/ip.h
index e44b1a44f67a..0e9cf4a778f9 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -547,7 +547,7 @@ static inline void ip_ipgre_mc_map(__be32 naddr, const unsigned char *broadcast,
 #include <linux/ipv6.h>
 #endif
 
-static __inline__ void inet_reset_saddr(struct sock *sk)
+static inline void inet_reset_saddr(struct sock *sk)
 {
 	inet_sk(sk)->inet_rcv_saddr = inet_sk(sk)->inet_saddr = 0;
 #if IS_ENABLED(CONFIG_IPV6)
diff --git a/include/net/ip6_checksum.h b/include/net/ip6_checksum.h
index cca840584c88..27567477dcc6 100644
--- a/include/net/ip6_checksum.h
+++ b/include/net/ip6_checksum.h
@@ -55,7 +55,7 @@ static inline __wsum ip6_gro_compute_pseudo(struct sk_buff *skb, int proto)
 					    skb_gro_len(skb), proto, 0));
 }
 
-static __inline__ __sum16 tcp_v6_check(int len,
+static inline __sum16 tcp_v6_check(int len,
 				   const struct in6_addr *saddr,
 				   const struct in6_addr *daddr,
 				   __wsum base)
diff --git a/include/net/ipx.h b/include/net/ipx.h
index baf090390998..cf89ef92a5f7 100644
--- a/include/net/ipx.h
+++ b/include/net/ipx.h
@@ -47,7 +47,7 @@ struct ipxhdr {
 /* From af_ipx.c */
 extern int sysctl_ipx_pprop_broadcasting;
 
-static __inline__ struct ipxhdr *ipx_hdr(struct sk_buff *skb)
+static inline struct ipxhdr *ipx_hdr(struct sk_buff *skb)
 {
 	return (struct ipxhdr *)skb_transport_header(skb);
 }
@@ -139,7 +139,7 @@ void ipx_proc_exit(void);
 const char *ipx_frame_name(__be16);
 const char *ipx_device_name(struct ipx_interface *intrfc);
 
-static __inline__ void ipxitf_hold(struct ipx_interface *intrfc)
+static inline void ipxitf_hold(struct ipx_interface *intrfc)
 {
 	refcount_inc(&intrfc->refcnt);
 }
@@ -157,18 +157,18 @@ int ipxrtr_route_skb(struct sk_buff *skb);
 struct ipx_route *ipxrtr_lookup(__be32 net);
 int ipxrtr_ioctl(unsigned int cmd, void __user *arg);
 
-static __inline__ void ipxitf_put(struct ipx_interface *intrfc)
+static inline void ipxitf_put(struct ipx_interface *intrfc)
 {
 	if (refcount_dec_and_test(&intrfc->refcnt))
 		ipxitf_down(intrfc);
 }
 
-static __inline__ void ipxrtr_hold(struct ipx_route *rt)
+static inline void ipxrtr_hold(struct ipx_route *rt)
 {
 	        refcount_inc(&rt->refcnt);
 }
 
-static __inline__ void ipxrtr_put(struct ipx_route *rt)
+static inline void ipxrtr_put(struct ipx_route *rt)
 {
 	        if (refcount_dec_and_test(&rt->refcnt))
 			                kfree(rt);
diff --git a/include/net/llc_c_ev.h b/include/net/llc_c_ev.h
index 3948cf111dd0..266275a945b4 100644
--- a/include/net/llc_c_ev.h
+++ b/include/net/llc_c_ev.h
@@ -120,7 +120,7 @@ struct llc_conn_state_ev {
 	u8 cfm_prim;
 };
 
-static __inline__ struct llc_conn_state_ev *llc_conn_ev(struct sk_buff *skb)
+static inline struct llc_conn_state_ev *llc_conn_ev(struct sk_buff *skb)
 {
 	return (struct llc_conn_state_ev *)skb->cb;
 }
@@ -216,7 +216,7 @@ int llc_conn_ev_qlfy_set_status_refuse(struct sock *sk, struct sk_buff *skb);
 int llc_conn_ev_qlfy_set_status_conflict(struct sock *sk, struct sk_buff *skb);
 int llc_conn_ev_qlfy_set_status_rst_done(struct sock *sk, struct sk_buff *skb);
 
-static __inline__ int llc_conn_space(struct sock *sk, struct sk_buff *skb)
+static inline int llc_conn_space(struct sock *sk, struct sk_buff *skb)
 {
 	return atomic_read(&sk->sk_rmem_alloc) + skb->truesize <
 	       (unsigned int)sk->sk_rcvbuf;
diff --git a/include/net/llc_conn.h b/include/net/llc_conn.h
index df528a623548..27880d1bfd99 100644
--- a/include/net/llc_conn.h
+++ b/include/net/llc_conn.h
@@ -85,12 +85,12 @@ static inline struct llc_sock *llc_sk(const struct sock *sk)
 	return (struct llc_sock *)sk;
 }
 
-static __inline__ void llc_set_backlog_type(struct sk_buff *skb, char type)
+static inline void llc_set_backlog_type(struct sk_buff *skb, char type)
 {
 	skb->cb[sizeof(skb->cb) - 1] = type;
 }
 
-static __inline__ char llc_backlog_type(struct sk_buff *skb)
+static inline char llc_backlog_type(struct sk_buff *skb)
 {
 	return skb->cb[sizeof(skb->cb) - 1];
 }
diff --git a/include/net/llc_s_ev.h b/include/net/llc_s_ev.h
index 84db3a59ed28..00439d3e9f5d 100644
--- a/include/net/llc_s_ev.h
+++ b/include/net/llc_s_ev.h
@@ -44,7 +44,7 @@ struct llc_sap_state_ev {
 	struct llc_addr daddr;
 };
 
-static __inline__ struct llc_sap_state_ev *llc_sap_ev(struct sk_buff *skb)
+static inline struct llc_sap_state_ev *llc_sap_ev(struct sk_buff *skb)
 {
 	return (struct llc_sap_state_ev *)skb->cb;
 }
diff --git a/include/net/netrom.h b/include/net/netrom.h
index 5a0714ff500f..1741b7cc8962 100644
--- a/include/net/netrom.h
+++ b/include/net/netrom.h
@@ -123,7 +123,7 @@ struct nr_node {
 #define nr_node_hold(__nr_node) \
 	refcount_inc(&((__nr_node)->refcount))
 
-static __inline__ void nr_node_put(struct nr_node *nr_node)
+static inline void nr_node_put(struct nr_node *nr_node)
 {
 	if (refcount_dec_and_test(&nr_node->refcount)) {
 		kfree(nr_node);
@@ -133,7 +133,7 @@ static __inline__ void nr_node_put(struct nr_node *nr_node)
 #define nr_neigh_hold(__nr_neigh) \
 	refcount_inc(&((__nr_neigh)->refcount))
 
-static __inline__ void nr_neigh_put(struct nr_neigh *nr_neigh)
+static inline void nr_neigh_put(struct nr_neigh *nr_neigh)
 {
 	if (refcount_dec_and_test(&nr_neigh->refcount)) {
 		if (nr_neigh->ax25)
@@ -145,13 +145,13 @@ static __inline__ void nr_neigh_put(struct nr_neigh *nr_neigh)
 
 /* nr_node_lock and nr_node_unlock also hold/put the node's refcounter.
  */
-static __inline__ void nr_node_lock(struct nr_node *nr_node)
+static inline void nr_node_lock(struct nr_node *nr_node)
 {
 	nr_node_hold(nr_node);
 	spin_lock_bh(&nr_node->node_lock);
 }
 
-static __inline__ void nr_node_unlock(struct nr_node *nr_node)
+static inline void nr_node_unlock(struct nr_node *nr_node)
 {
 	spin_unlock_bh(&nr_node->node_lock);
 	nr_node_put(nr_node);
diff --git a/include/net/scm.h b/include/net/scm.h
index 1ce365f4c256..b77be632b440 100644
--- a/include/net/scm.h
+++ b/include/net/scm.h
@@ -44,16 +44,16 @@ void __scm_destroy(struct scm_cookie *scm);
 struct scm_fp_list *scm_fp_dup(struct scm_fp_list *fpl);
 
 #ifdef CONFIG_SECURITY_NETWORK
-static __inline__ void unix_get_peersec_dgram(struct socket *sock, struct scm_cookie *scm)
+static inline void unix_get_peersec_dgram(struct socket *sock, struct scm_cookie *scm)
 {
 	security_socket_getpeersec_dgram(sock, NULL, &scm->secid);
 }
 #else
-static __inline__ void unix_get_peersec_dgram(struct socket *sock, struct scm_cookie *scm)
+static inline void unix_get_peersec_dgram(struct socket *sock, struct scm_cookie *scm)
 { }
 #endif /* CONFIG_SECURITY_NETWORK */
 
-static __inline__ void scm_set_cred(struct scm_cookie *scm,
+static inline void scm_set_cred(struct scm_cookie *scm,
 				    struct pid *pid, kuid_t uid, kgid_t gid)
 {
 	scm->pid  = get_pid(pid);
@@ -62,20 +62,20 @@ static __inline__ void scm_set_cred(struct scm_cookie *scm,
 	scm->creds.gid = gid;
 }
 
-static __inline__ void scm_destroy_cred(struct scm_cookie *scm)
+static inline void scm_destroy_cred(struct scm_cookie *scm)
 {
 	put_pid(scm->pid);
 	scm->pid  = NULL;
 }
 
-static __inline__ void scm_destroy(struct scm_cookie *scm)
+static inline void scm_destroy(struct scm_cookie *scm)
 {
 	scm_destroy_cred(scm);
 	if (scm->fp)
 		__scm_destroy(scm);
 }
 
-static __inline__ int scm_send(struct socket *sock, struct msghdr *msg,
+static inline int scm_send(struct socket *sock, struct msghdr *msg,
 			       struct scm_cookie *scm, bool forcecreds)
 {
 	memset(scm, 0, sizeof(*scm));
@@ -110,7 +110,7 @@ static inline void scm_passec(struct socket *sock, struct msghdr *msg, struct sc
 { }
 #endif /* CONFIG_SECURITY_NETWORK */
 
-static __inline__ void scm_recv(struct socket *sock, struct msghdr *msg,
+static inline void scm_recv(struct socket *sock, struct msghdr *msg,
 				struct scm_cookie *scm, int flags)
 {
 	if (!msg->msg_control) {
diff --git a/include/net/udplite.h b/include/net/udplite.h
index 9185e45b997f..747859a3a00f 100644
--- a/include/net/udplite.h
+++ b/include/net/udplite.h
@@ -17,7 +17,7 @@ extern struct udp_table		udplite_table;
 /*
  *	Checksum computation is all in software, hence simpler getfrag.
  */
-static __inline__ int udplite_getfrag(void *from, char *to, int  offset,
+static inline int udplite_getfrag(void *from, char *to, int  offset,
 				      int len, int odd, struct sk_buff *skb)
 {
 	struct msghdr *msg = from;
diff --git a/include/net/x25.h b/include/net/x25.h
index ed1acc3044ac..4cb533479d61 100644
--- a/include/net/x25.h
+++ b/include/net/x25.h
@@ -242,12 +242,12 @@ struct x25_neigh *x25_get_neigh(struct net_device *);
 void x25_link_free(void);
 
 /* x25_neigh.c */
-static __inline__ void x25_neigh_hold(struct x25_neigh *nb)
+static inline void x25_neigh_hold(struct x25_neigh *nb)
 {
 	refcount_inc(&nb->refcnt);
 }
 
-static __inline__ void x25_neigh_put(struct x25_neigh *nb)
+static inline void x25_neigh_put(struct x25_neigh *nb)
 {
 	if (refcount_dec_and_test(&nb->refcnt))
 		kfree(nb);
@@ -265,12 +265,12 @@ void x25_route_device_down(struct net_device *dev);
 int x25_route_ioctl(unsigned int, void __user *);
 void x25_route_free(void);
 
-static __inline__ void x25_route_hold(struct x25_route *rt)
+static inline void x25_route_hold(struct x25_route *rt)
 {
 	refcount_inc(&rt->refcnt);
 }
 
-static __inline__ void x25_route_put(struct x25_route *rt)
+static inline void x25_route_put(struct x25_route *rt)
 {
 	if (refcount_dec_and_test(&rt->refcnt))
 		kfree(rt);
diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index 0eb390c205af..2f03fd96dce9 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -903,7 +903,7 @@ static inline bool addr4_match(__be32 a1, __be32 a2, u8 prefixlen)
 	return !((a1 ^ a2) & htonl(~0UL << (32 - prefixlen)));
 }
 
-static __inline__
+static inline
 __be16 xfrm_flowi_sport(const struct flowi *fl, const union flowi_uli *uli)
 {
 	__be16 port;
@@ -930,7 +930,7 @@ __be16 xfrm_flowi_sport(const struct flowi *fl, const union flowi_uli *uli)
 	return port;
 }
 
-static __inline__
+static inline
 __be16 xfrm_flowi_dport(const struct flowi *fl, const union flowi_uli *uli)
 {
 	__be16 port;
@@ -1325,7 +1325,7 @@ static inline int xfrm6_policy_check_reverse(struct sock *sk, int dir,
 }
 #endif
 
-static __inline__
+static inline
 xfrm_address_t *xfrm_flowi_daddr(const struct flowi *fl, unsigned short family)
 {
 	switch (family){
@@ -1337,7 +1337,7 @@ xfrm_address_t *xfrm_flowi_daddr(const struct flowi *fl, unsigned short family)
 	return NULL;
 }
 
-static __inline__
+static inline
 xfrm_address_t *xfrm_flowi_saddr(const struct flowi *fl, unsigned short family)
 {
 	switch (family){
@@ -1349,7 +1349,7 @@ xfrm_address_t *xfrm_flowi_saddr(const struct flowi *fl, unsigned short family)
 	return NULL;
 }
 
-static __inline__
+static inline
 void xfrm_flowi_addr_get(const struct flowi *fl,
 			 xfrm_address_t *saddr, xfrm_address_t *daddr,
 			 unsigned short family)
@@ -1366,7 +1366,7 @@ void xfrm_flowi_addr_get(const struct flowi *fl,
 	}
 }
 
-static __inline__ int
+static inline int
 __xfrm4_state_addr_check(const struct xfrm_state *x,
 			 const xfrm_address_t *daddr, const xfrm_address_t *saddr)
 {
@@ -1376,7 +1376,7 @@ __xfrm4_state_addr_check(const struct xfrm_state *x,
 	return 0;
 }
 
-static __inline__ int
+static inline int
 __xfrm6_state_addr_check(const struct xfrm_state *x,
 			 const xfrm_address_t *daddr, const xfrm_address_t *saddr)
 {
@@ -1388,7 +1388,7 @@ __xfrm6_state_addr_check(const struct xfrm_state *x,
 	return 0;
 }
 
-static __inline__ int
+static inline int
 xfrm_state_addr_check(const struct xfrm_state *x,
 		      const xfrm_address_t *daddr, const xfrm_address_t *saddr,
 		      unsigned short family)
@@ -1402,7 +1402,7 @@ xfrm_state_addr_check(const struct xfrm_state *x,
 	return 0;
 }
 
-static __inline__ int
+static inline int
 xfrm_state_addr_flow_check(const struct xfrm_state *x, const struct flowi *fl,
 			   unsigned short family)
 {
diff --git a/include/uapi/linux/atm.h b/include/uapi/linux/atm.h
index 95ebdcf4fe88..a89c6529d0e0 100644
--- a/include/uapi/linux/atm.h
+++ b/include/uapi/linux/atm.h
@@ -215,13 +215,13 @@ struct sockaddr_atmsvc {
 };
 
 
-static __inline__ int atmsvc_addr_in_use(struct sockaddr_atmsvc addr)
+static inline int atmsvc_addr_in_use(struct sockaddr_atmsvc addr)
 {
 	return *addr.sas_addr.prv || *addr.sas_addr.pub;
 }
 
 
-static __inline__ int atmpvc_addr_in_use(struct sockaddr_atmpvc addr)
+static inline int atmpvc_addr_in_use(struct sockaddr_atmpvc addr)
 {
 	return addr.sap_addr.itf || addr.sap_addr.vpi || addr.sap_addr.vci;
 }
diff --git a/include/uapi/linux/atmsap.h b/include/uapi/linux/atmsap.h
index fc052481eae0..97f3699ead0b 100644
--- a/include/uapi/linux/atmsap.h
+++ b/include/uapi/linux/atmsap.h
@@ -155,7 +155,7 @@ struct atm_sap {
 };
 
 
-static __inline__ int blli_in_use(struct atm_blli blli)
+static inline int blli_in_use(struct atm_blli blli)
 {
 	return blli.l2_proto || blli.l3_proto;
 }
diff --git a/include/uapi/linux/map_to_7segment.h b/include/uapi/linux/map_to_7segment.h
index f9ed18134b83..16c89adab225 100644
--- a/include/uapi/linux/map_to_7segment.h
+++ b/include/uapi/linux/map_to_7segment.h
@@ -76,7 +76,7 @@ struct seg7_conversion_map {
 	unsigned char	table[128];
 };
 
-static __inline__ int map_to_seg7(struct seg7_conversion_map *map, int c)
+static inline int map_to_seg7(struct seg7_conversion_map *map, int c)
 {
 	return c >= 0 && c < sizeof(map->table) ? map->table[c] : -EINVAL;
 }
diff --git a/include/uapi/linux/netfilter_arp/arp_tables.h b/include/uapi/linux/netfilter_arp/arp_tables.h
index a2a0927d9bd6..4d733bbaa183 100644
--- a/include/uapi/linux/netfilter_arp/arp_tables.h
+++ b/include/uapi/linux/netfilter_arp/arp_tables.h
@@ -197,7 +197,7 @@ struct arpt_get_entries {
 };
 
 /* Helper functions */
-static __inline__ struct xt_entry_target *arpt_get_target(struct arpt_entry *e)
+static inline struct xt_entry_target *arpt_get_target(struct arpt_entry *e)
 {
 	return (void *)e + e->target_offset;
 }
diff --git a/include/uapi/linux/netfilter_bridge/ebtables.h b/include/uapi/linux/netfilter_bridge/ebtables.h
index 3b86c14ea49d..0c5bf95a8d44 100644
--- a/include/uapi/linux/netfilter_bridge/ebtables.h
+++ b/include/uapi/linux/netfilter_bridge/ebtables.h
@@ -191,7 +191,7 @@ struct ebt_entry {
 	unsigned char elems[0] __attribute__ ((aligned (__alignof__(struct ebt_replace))));
 };
 
-static __inline__ struct ebt_entry_target *
+static inline struct ebt_entry_target *
 ebt_get_target(struct ebt_entry *e)
 {
 	return (void *)e + e->target_offset;
diff --git a/include/uapi/linux/netfilter_ipv4/ip_tables.h b/include/uapi/linux/netfilter_ipv4/ip_tables.h
index 6aaeb14bfce1..06c7900e36c4 100644
--- a/include/uapi/linux/netfilter_ipv4/ip_tables.h
+++ b/include/uapi/linux/netfilter_ipv4/ip_tables.h
@@ -219,7 +219,7 @@ struct ipt_get_entries {
 };
 
 /* Helper functions */
-static __inline__ struct xt_entry_target *
+static inline struct xt_entry_target *
 ipt_get_target(struct ipt_entry *e)
 {
 	return (void *)e + e->target_offset;
diff --git a/include/uapi/linux/netfilter_ipv6/ip6_tables.h b/include/uapi/linux/netfilter_ipv6/ip6_tables.h
index 031d0a43bed2..f84464aab65b 100644
--- a/include/uapi/linux/netfilter_ipv6/ip6_tables.h
+++ b/include/uapi/linux/netfilter_ipv6/ip6_tables.h
@@ -259,7 +259,7 @@ struct ip6t_get_entries {
 };
 
 /* Helper functions */
-static __inline__ struct xt_entry_target *
+static inline struct xt_entry_target *
 ip6t_get_target(struct ip6t_entry *e)
 {
 	return (void *)e + e->target_offset;
diff --git a/include/video/newport.h b/include/video/newport.h
index bcbb3d1b6bf9..108bc554c4b8 100644
--- a/include/video/newport.h
+++ b/include/video/newport.h
@@ -426,7 +426,7 @@ static inline unsigned short newport_vc2_get(struct newport_regs *regs,
 #define NCMAP_REGADDR_RREG   0x00000060
 #define NCMAP_PROTOCOL       (0x00008000 | 0x00040000 | 0x00800000)
 
-static __inline__ void newport_cmap_setaddr(struct newport_regs *regs,
+static inline void newport_cmap_setaddr(struct newport_regs *regs,
 					unsigned short addr)
 {
 	regs->set.dcbmode = (NPORT_DMODE_ACMALL | NCMAP_PROTOCOL |
@@ -437,7 +437,7 @@ static __inline__ void newport_cmap_setaddr(struct newport_regs *regs,
 			   NCMAP_REGADDR_PBUF | NPORT_DMODE_W3);
 }
 
-static __inline__ void newport_cmap_setrgb(struct newport_regs *regs,
+static inline void newport_cmap_setrgb(struct newport_regs *regs,
 				       unsigned char red,
 				       unsigned char green,
 				       unsigned char blue)
@@ -450,7 +450,7 @@ static __inline__ void newport_cmap_setrgb(struct newport_regs *regs,
 
 /* Miscellaneous NEWPORT routines. */
 #define BUSY_TIMEOUT 100000
-static __inline__ int newport_wait(struct newport_regs *regs)
+static inline int newport_wait(struct newport_regs *regs)
 {
 	int t = BUSY_TIMEOUT;
 
@@ -460,7 +460,7 @@ static __inline__ int newport_wait(struct newport_regs *regs)
 	return !t;
 }
 
-static __inline__ int newport_bfwait(struct newport_regs *regs)
+static inline int newport_bfwait(struct newport_regs *regs)
 {
 	int t = BUSY_TIMEOUT;
 
@@ -547,7 +547,7 @@ static __inline__ int newport_bfwait(struct newport_regs *regs)
 #define WAYSLOW_DCB_XMAP9_PROTOCOL DCB_CYCLES (12, 12, 0)
 #define R_DCB_XMAP9_PROTOCOL       DCB_CYCLES (2, 1, 3)
 
-static __inline__ void
+static inline void
 xmap9FIFOWait (struct newport_regs *rex)
 {
         rex->set.dcbmode = DCB_XMAP0 | XM9_CRS_FIFO_AVAIL |
@@ -558,7 +558,7 @@ xmap9FIFOWait (struct newport_regs *rex)
 		;
 }
 
-static __inline__ void
+static inline void
 xmap9SetModeReg (struct newport_regs *rex, unsigned int modereg, unsigned int data24, int cfreq)
 {
         if (cfreq > 119)
diff --git a/lib/zstd/mem.h b/lib/zstd/mem.h
index 3a0f34c8706c..739837a59ad6 100644
--- a/lib/zstd/mem.h
+++ b/lib/zstd/mem.h
@@ -27,7 +27,7 @@
 /*-****************************************
 *  Compiler specifics
 ******************************************/
-#define ZSTD_STATIC static __inline __attribute__((unused))
+#define ZSTD_STATIC static inline __attribute__((unused))
 
 /*-**************************************************************
 *  Basic Types
diff --git a/net/appletalk/atalk_proc.c b/net/appletalk/atalk_proc.c
index 8006295f8bd7..6bc7c80a44de 100644
--- a/net/appletalk/atalk_proc.c
+++ b/net/appletalk/atalk_proc.c
@@ -17,7 +17,7 @@
 #include <linux/export.h>
 
 
-static __inline__ struct atalk_iface *atalk_get_interface_idx(loff_t pos)
+static inline struct atalk_iface *atalk_get_interface_idx(loff_t pos)
 {
 	struct atalk_iface *i;
 
@@ -78,7 +78,7 @@ static int atalk_seq_interface_show(struct seq_file *seq, void *v)
 	return 0;
 }
 
-static __inline__ struct atalk_route *atalk_get_route_idx(loff_t pos)
+static inline struct atalk_route *atalk_get_route_idx(loff_t pos)
 {
 	struct atalk_route *r;
 
diff --git a/net/appletalk/ddp.c b/net/appletalk/ddp.c
index 9b6bc5abe946..822bf45adb73 100644
--- a/net/appletalk/ddp.c
+++ b/net/appletalk/ddp.c
@@ -1277,7 +1277,7 @@ static int atalk_getname(struct socket *sock, struct sockaddr *uaddr,
 }
 
 #if IS_ENABLED(CONFIG_IPDDP)
-static __inline__ int is_ip_over_ddp(struct sk_buff *skb)
+static inline int is_ip_over_ddp(struct sk_buff *skb)
 {
 	return skb->data[12] == 22;
 }
diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 91592fceeaad..8d2fd114f6e2 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -865,7 +865,7 @@ static void neigh_periodic_work(struct work_struct *work)
 	write_unlock_bh(&tbl->lock);
 }
 
-static __inline__ int neigh_max_probes(struct neighbour *n)
+static inline int neigh_max_probes(struct neighbour *n)
 {
 	struct neigh_parms *p = n->parms;
 	return NEIGH_VAR(p, UCAST_PROBES) + NEIGH_VAR(p, APP_PROBES) +
diff --git a/net/core/scm.c b/net/core/scm.c
index b1ff8a441748..d508a27f8552 100644
--- a/net/core/scm.c
+++ b/net/core/scm.c
@@ -45,7 +45,7 @@
  *	setu(g)id.
  */
 
-static __inline__ int scm_check_creds(struct ucred *creds)
+static inline int scm_check_creds(struct ucred *creds)
 {
 	const struct cred *cred = current_cred();
 	kuid_t uid = make_kuid(cred->user_ns, creds->uid);
diff --git a/net/decnet/dn_nsp_in.c b/net/decnet/dn_nsp_in.c
index 2fb5e055ba25..508bb7a341aa 100644
--- a/net/decnet/dn_nsp_in.c
+++ b/net/decnet/dn_nsp_in.c
@@ -583,7 +583,7 @@ static void dn_nsp_linkservice(struct sock *sk, struct sk_buff *skb)
  * bh_lock_sock() (its already held when this is called) which
  * also allows data and other data to be queued to a socket.
  */
-static __inline__ int dn_queue_skb(struct sock *sk, struct sk_buff *skb, int sig, struct sk_buff_head *queue)
+static inline int dn_queue_skb(struct sock *sk, struct sk_buff *skb, int sig, struct sk_buff_head *queue)
 {
 	int err;
 
diff --git a/net/decnet/dn_nsp_out.c b/net/decnet/dn_nsp_out.c
index a1779de6bd9c..2ed306ce9010 100644
--- a/net/decnet/dn_nsp_out.c
+++ b/net/decnet/dn_nsp_out.c
@@ -529,7 +529,7 @@ void dn_send_conn_conf(struct sock *sk, gfp_t gfp)
 }
 
 
-static __inline__ void dn_nsp_do_disc(struct sock *sk, unsigned char msgflg,
+static inline void dn_nsp_do_disc(struct sock *sk, unsigned char msgflg,
 			unsigned short reason, gfp_t gfp,
 			struct dst_entry *dst,
 			int ddl, unsigned char *dd, __le16 rem, __le16 loc)
diff --git a/net/decnet/dn_route.c b/net/decnet/dn_route.c
index 1c002c0fb712..6b7c61544d76 100644
--- a/net/decnet/dn_route.c
+++ b/net/decnet/dn_route.c
@@ -174,7 +174,7 @@ static void dn_dst_ifdown(struct dst_entry *dst, struct net_device *dev, int how
 	}
 }
 
-static __inline__ unsigned int dn_hash(__le16 src, __le16 dst)
+static inline unsigned int dn_hash(__le16 src, __le16 dst)
 {
 	__u16 tmp = (__u16 __force)(src ^ dst);
 	tmp ^= (tmp >> 3);
diff --git a/net/decnet/dn_table.c b/net/decnet/dn_table.c
index f0710b5d037d..90b5144c20bf 100644
--- a/net/decnet/dn_table.c
+++ b/net/decnet/dn_table.c
@@ -405,7 +405,7 @@ static void dn_rtmsg_fib(int event, struct dn_fib_node *f, int z, u32 tb_id,
 		rtnl_set_sk_err(&init_net, RTNLGRP_DECnet_ROUTE, err);
 }
 
-static __inline__ int dn_hash_dump_bucket(struct sk_buff *skb,
+static inline int dn_hash_dump_bucket(struct sk_buff *skb,
 				struct netlink_callback *cb,
 				struct dn_fib_table *tb,
 				struct dn_zone *dz,
@@ -434,7 +434,7 @@ static __inline__ int dn_hash_dump_bucket(struct sk_buff *skb,
 	return skb->len;
 }
 
-static __inline__ int dn_hash_dump_zone(struct sk_buff *skb,
+static inline int dn_hash_dump_zone(struct sk_buff *skb,
 				struct netlink_callback *cb,
 				struct dn_fib_table *tb,
 				struct dn_zone *dz)
diff --git a/net/ipv4/igmp.c b/net/ipv4/igmp.c
index 4da39446da2d..177e1f9c6bce 100644
--- a/net/ipv4/igmp.c
+++ b/net/ipv4/igmp.c
@@ -18,7 +18,7 @@
  *
  *	Fixes:
  *
- *		Alan Cox	:	Added lots of __inline__ to optimise
+ *		Alan Cox	:	Added lots of inline to optimise
  *					the memory usage of all the tiny little
  *					functions.
  *		Alan Cox	:	Dumped the header building experiment.
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index 9a4261e50272..35896e9897c1 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -100,7 +100,7 @@ bool ipv6_mod_enabled(void)
 }
 EXPORT_SYMBOL_GPL(ipv6_mod_enabled);
 
-static __inline__ struct ipv6_pinfo *inet6_sk_generic(struct sock *sk)
+static inline struct ipv6_pinfo *inet6_sk_generic(struct sock *sk)
 {
 	const int offset = sk->sk_prot->obj_size - sizeof(struct ipv6_pinfo);
 
diff --git a/net/ipv6/icmp.c b/net/ipv6/icmp.c
index c9c53ade55c3..488a04e188c0 100644
--- a/net/ipv6/icmp.c
+++ b/net/ipv6/icmp.c
@@ -111,7 +111,7 @@ static const struct inet6_protocol icmpv6_protocol = {
 };
 
 /* Called with BH disabled */
-static __inline__ struct sock *icmpv6_xmit_lock(struct net *net)
+static inline struct sock *icmpv6_xmit_lock(struct net *net)
 {
 	struct sock *sk;
 
@@ -126,7 +126,7 @@ static __inline__ struct sock *icmpv6_xmit_lock(struct net *net)
 	return sk;
 }
 
-static __inline__ void icmpv6_xmit_unlock(struct sock *sk)
+static inline void icmpv6_xmit_unlock(struct sock *sk)
 {
 	spin_unlock(&sk->sk_lock.slock);
 }
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 28c4aa5078fc..c5165652929a 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -541,7 +541,7 @@ static int __udpv6_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
 	return 0;
 }
 
-static __inline__ void udpv6_err(struct sk_buff *skb,
+static inline void udpv6_err(struct sk_buff *skb,
 				 struct inet6_skb_parm *opt, u8 type,
 				 u8 code, int offset, __be32 info)
 {
@@ -947,7 +947,7 @@ static void udp_v6_early_demux(struct sk_buff *skb)
 	}
 }
 
-static __inline__ int udpv6_rcv(struct sk_buff *skb)
+static inline int udpv6_rcv(struct sk_buff *skb)
 {
 	return __udp6_lib_rcv(skb, &udp_table, IPPROTO_UDP);
 }
diff --git a/net/lapb/lapb_iface.c b/net/lapb/lapb_iface.c
index db6e0afe3a20..b797138fec8c 100644
--- a/net/lapb/lapb_iface.c
+++ b/net/lapb/lapb_iface.c
@@ -52,12 +52,12 @@ static void lapb_free_cb(struct lapb_cb *lapb)
 	kfree(lapb);
 }
 
-static __inline__ void lapb_hold(struct lapb_cb *lapb)
+static inline void lapb_hold(struct lapb_cb *lapb)
 {
 	refcount_inc(&lapb->refcnt);
 }
 
-static __inline__ void lapb_put(struct lapb_cb *lapb)
+static inline void lapb_put(struct lapb_cb *lapb)
 {
 	if (refcount_dec_and_test(&lapb->refcnt))
 		lapb_free_cb(lapb);
diff --git a/net/llc/llc_input.c b/net/llc/llc_input.c
index 82cb93f66b9b..6d467c8b1d70 100644
--- a/net/llc/llc_input.c
+++ b/net/llc/llc_input.c
@@ -72,7 +72,7 @@ void llc_set_station_handler(void (*handler)(struct sk_buff *skb))
  *
  *	This function returns which LLC component must handle this PDU.
  */
-static __inline__ int llc_pdu_type(struct sk_buff *skb)
+static inline int llc_pdu_type(struct sk_buff *skb)
 {
 	int type = LLC_DEST_CONN; /* I-PDU or S-PDU type */
 	struct llc_pdu_sn *pdu = llc_pdu_sn_hdr(skb);
diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
index 161b0224d6ae..0cbe8de662a6 100755
--- a/scripts/checkpatch.pl
+++ b/scripts/checkpatch.pl
@@ -389,7 +389,7 @@ our $Attribute	= qr{
 			__weak
 		  }x;
 our $Modifier;
-our $Inline	= qr{inline|__always_inline|noinline|__inline|__inline__};
+our $Inline	= qr{inline|__always_inline|noinline|inline|inline};
 our $Member	= qr{->$Ident|\.$Ident|\[[^]]*\]};
 our $Lval	= qr{$Ident(?:$Member)*};
 
@@ -5782,13 +5782,13 @@ sub process {
 			      "inline keyword should sit between storage class and type\n" . $herecurr);
 		}
 
-# Check for __inline__ and __inline, prefer inline
+# Check for inline and inline, prefer inline
 		if ($realfile !~ m@\binclude/uapi/@ &&
-		    $line =~ /\b(__inline__|__inline)\b/) {
+		    $line =~ /\b(inline|inline)\b/) {
 			if (WARN("INLINE",
 				 "plain inline is preferred over $1\n" . $herecurr) &&
 			    $fix) {
-				$fixed[$fixlinenr] =~ s/\b(__inline__|__inline)\b/inline/;
+				$fixed[$fixlinenr] =~ s/\b(inline|inline)\b/inline/;
 
 			}
 		}
diff --git a/scripts/genksyms/keywords.c b/scripts/genksyms/keywords.c
index 9f40bcd17d07..f8cb811cd97d 100644
--- a/scripts/genksyms/keywords.c
+++ b/scripts/genksyms/keywords.c
@@ -14,8 +14,8 @@ static struct resword {
 	{ "__const", CONST_KEYW },
 	{ "__const__", CONST_KEYW },
 	{ "__extension__", EXTENSION_KEYW },
-	{ "__inline", INLINE_KEYW },
-	{ "__inline__", INLINE_KEYW },
+	{ "inline", INLINE_KEYW },
+	{ "inline", INLINE_KEYW },
 	{ "__signed", SIGNED_KEYW },
 	{ "__signed__", SIGNED_KEYW },
 	{ "__typeof", TYPEOF_KEYW },
diff --git a/scripts/kernel-doc b/scripts/kernel-doc
index 8f0f508a78e9..9caed6640b58 100755
--- a/scripts/kernel-doc
+++ b/scripts/kernel-doc
@@ -1569,8 +1569,8 @@ sub dump_function($$) {
     $prototype =~ s/^extern +//;
     $prototype =~ s/^asmlinkage +//;
     $prototype =~ s/^inline +//;
-    $prototype =~ s/^__inline__ +//;
-    $prototype =~ s/^__inline +//;
+    $prototype =~ s/^inline +//;
+    $prototype =~ s/^inline +//;
     $prototype =~ s/^__always_inline +//;
     $prototype =~ s/^noinline +//;
     $prototype =~ s/__init +//;
diff --git a/sound/sparc/amd7930.c b/sound/sparc/amd7930.c
index 56f17410fcea..3d1e603fdcfe 100644
--- a/sound/sparc/amd7930.c
+++ b/sound/sparc/amd7930.c
@@ -344,7 +344,7 @@ struct snd_amd7930 {
 static struct snd_amd7930 *amd7930_list;
 
 /* Idle the AMD7930 chip.  The amd->lock is not held.  */
-static __inline__ void amd7930_idle(struct snd_amd7930 *amd)
+static inline void amd7930_idle(struct snd_amd7930 *amd)
 {
 	unsigned long flags;
 
@@ -355,7 +355,7 @@ static __inline__ void amd7930_idle(struct snd_amd7930 *amd)
 }
 
 /* Enable chip interrupts.  The amd->lock is not held.  */
-static __inline__ void amd7930_enable_ints(struct snd_amd7930 *amd)
+static inline void amd7930_enable_ints(struct snd_amd7930 *amd)
 {
 	unsigned long flags;
 
@@ -366,7 +366,7 @@ static __inline__ void amd7930_enable_ints(struct snd_amd7930 *amd)
 }
 
 /* Disable chip interrupts.  The amd->lock is not held.  */
-static __inline__ void amd7930_disable_ints(struct snd_amd7930 *amd)
+static inline void amd7930_disable_ints(struct snd_amd7930 *amd)
 {
 	unsigned long flags;
 

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: PROPOSAL: Extend inline asm syntax with size spec
  2018-10-31 12:55                           ` Peter Zijlstra
@ 2018-10-31 13:11                             ` Peter Zijlstra
  2018-10-31 16:31                             ` Segher Boessenkool
                                               ` (2 subsequent siblings)
  3 siblings, 0 replies; 116+ messages in thread
From: Peter Zijlstra @ 2018-10-31 13:11 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Segher Boessenkool, Ingo Molnar, Richard Biener, Michael Matz,
	gcc, Nadav Amit, Ingo Molnar, linux-kernel, x86, Masahiro Yamada,
	Sam Ravnborg, Alok Kataria, Christopher Li, Greg Kroah-Hartman,
	H. Peter Anvin, Jan Beulich, Josh Poimboeuf, Juergen Gross,
	Kate Stewart, Kees Cook, linux-sparse, Philippe Ombredanne,
	Thomas Gleixner, virtualization, Linus Torvalds, Chris Zankel,
	Max Filippov, linux-xtensa, Andrew Morton

On Wed, Oct 31, 2018 at 01:55:26PM +0100, Peter Zijlstra wrote:
> On Sat, Oct 13, 2018 at 09:33:35PM +0200, Borislav Petkov wrote:
> > Ok,
> > 
> > with Segher's help I've been playing with his patch ontop of bleeding
> > edge gcc 9 and here are my observations. Please double-check me for
> > booboos so that they can be addressed while there's time.
> > 
> > So here's what I see ontop of 4.19-rc7:
> > 
> > First marked the alternative asm() as inline and undeffed the "inline"
> > keyword. I need to do that because of the funky games we do with
> > "inline" redefinitions in include/linux/compiler_types.h.
> > 
> > And Segher hinted at either doing:
> > 
> > asm volatile inline(...
> > 
> > or
> > 
> > asm volatile __inline__(...
> > 
> > but both "inline" variants are defined as macros in that file.
> > 
> > Which means we either need to #undef inline before using it in asm() or
> > come up with something cleverer.
> 
> # git grep -e "\<__inline__\>" | wc -l
> 488
> # git grep -e "\<__inline\>" | wc -l
> 56
> # git grep -e "\<inline\>" | wc -l
> 69598
> 
> And we already have scripts/checkpatch.pl:
> 
>   # Check for __inline__ and __inline, prefer inline
> 
> Which suggests we do:
> 
> git grep -l "\<__inline\(\|__\)\>" | while read file
> do
> 	sed -i -e 's/\<__inline\(\|__\)\>/inline/g' $file
> done
> 
> and get it over with.
> 
> 
> Anyway, with the below patch, I get:
> 
>    text    data     bss     dec     hex filename
> 17385183        5064780 1953892 24403855        1745f8f defconfig-build/vmlinux
> 17385678        5064780 1953892 24404350        174617e defconfig-build/vmlinux

  17387603        5065468 1953892 24406963        1746bb3 defconfig-build/vmlinux

If I do an additional:

git grep -l "asm volatile" | while read file
do
	sed -i -e 's/asm volatile/asm_volatile/g' $file
done

on the tree...

No changes for:

-#define asm_volatile_goto(x...)        do { asm goto(x); asm (""); } while (0)
+#define asm_volatile_goto(x...)        do { asm __inline__ goto (x); asm (""); } while (0)

I suppose all our goto's are small now (my tree includes Nadav's patch
to static_cpu_has).

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: PROPOSAL: Extend inline asm syntax with size spec
  2018-10-31 12:55                           ` Peter Zijlstra
  2018-10-31 13:11                             ` Peter Zijlstra
@ 2018-10-31 16:31                             ` Segher Boessenkool
  2018-11-01  5:20                             ` Joe Perches
  2018-12-27  4:47                             ` Masahiro Yamada
  3 siblings, 0 replies; 116+ messages in thread
From: Segher Boessenkool @ 2018-10-31 16:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Borislav Petkov, Ingo Molnar, Richard Biener, Michael Matz, gcc,
	Nadav Amit, Ingo Molnar, linux-kernel, x86, Masahiro Yamada,
	Sam Ravnborg, Alok Kataria, Christopher Li, Greg Kroah-Hartman,
	H. Peter Anvin, Jan Beulich, Josh Poimboeuf, Juergen Gross,
	Kate Stewart, Kees Cook, linux-sparse, Philippe Ombredanne,
	Thomas Gleixner, virtualization, Linus Torvalds, Chris Zankel,
	Max Filippov, linux-xtensa, Andrew Morton

On Wed, Oct 31, 2018 at 01:55:26PM +0100, Peter Zijlstra wrote:
> Anyway, with the below patch, I get:
> 
>    text    data     bss     dec     hex filename
> 17385183        5064780 1953892 24403855        1745f8f defconfig-build/vmlinux
> 17385678        5064780 1953892 24404350        174617e defconfig-build/vmlinux
> 
> Which shows we inline more (look for asm_volatile for the actual
> changes).
> 
> 
> So yes, this seems like something we could work with.

Great to hear.  Note that with my latest patches
(see https://gcc.gnu.org/ml/gcc-patches/2018-10/msg01931.html ) there
is no required order "asm volatile inline" anymore, so you can just say
"asm inline volatile".  (And similar for "goto" as well).


Segher

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: PROPOSAL: Extend inline asm syntax with size spec
  2018-10-31 12:55                           ` Peter Zijlstra
  2018-10-31 13:11                             ` Peter Zijlstra
  2018-10-31 16:31                             ` Segher Boessenkool
@ 2018-11-01  5:20                             ` Joe Perches
  2018-11-01  9:01                               ` Peter Zijlstra
  2018-12-27  4:47                             ` Masahiro Yamada
  3 siblings, 1 reply; 116+ messages in thread
From: Joe Perches @ 2018-11-01  5:20 UTC (permalink / raw)
  To: Peter Zijlstra, Borislav Petkov
  Cc: Segher Boessenkool, Ingo Molnar, Richard Biener, Michael Matz,
	gcc, Nadav Amit, Ingo Molnar, linux-kernel, x86, Masahiro Yamada,
	Sam Ravnborg, Alok Kataria, Christopher Li, Greg Kroah-Hartman,
	H. Peter Anvin, Jan Beulich, Josh Poimboeuf, Juergen Gross,
	Kate Stewart, Kees Cook, linux-sparse, Philippe Ombredanne,
	Thomas Gleixner, virtualization, Linus Torvalds, Chris Zankel,
	Max Filippov, linux-xtensa, Andrew Morton

On Wed, 2018-10-31 at 13:55 +0100, Peter Zijlstra wrote:
> 
> Anyway, with the below patch, I get:
> 
>    text    data     bss     dec     hex filename
> 17385183        5064780 1953892 24403855        1745f8f defconfig-build/vmlinux
> 17385678        5064780 1953892 24404350        174617e defconfig-build/vmlinux
> 
> Which shows we inline more (look for asm_volatile for the actual
> changes).
[]
>  scripts/checkpatch.pl                              |  8 ++---
>  scripts/genksyms/keywords.c                        |  4 +--
>  scripts/kernel-doc                                 |  4 +--

I believe these should be excluded from the conversions.

Other than that, generic conversion by script seems a good idea.


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: PROPOSAL: Extend inline asm syntax with size spec
  2018-11-01  5:20                             ` Joe Perches
@ 2018-11-01  9:01                               ` Peter Zijlstra
  2018-11-01  9:20                                 ` Joe Perches
  0 siblings, 1 reply; 116+ messages in thread
From: Peter Zijlstra @ 2018-11-01  9:01 UTC (permalink / raw)
  To: Joe Perches
  Cc: Borislav Petkov, Segher Boessenkool, Ingo Molnar, Richard Biener,
	Michael Matz, gcc, Nadav Amit, Ingo Molnar, linux-kernel, x86,
	Masahiro Yamada, Sam Ravnborg, Alok Kataria, Christopher Li,
	Greg Kroah-Hartman, H. Peter Anvin, Jan Beulich, Josh Poimboeuf,
	Juergen Gross, Kate Stewart, Kees Cook, linux-sparse,
	Philippe Ombredanne, Thomas Gleixner, virtualization,
	Linus Torvalds, Chris Zankel, Max Filippov, linux-xtensa,
	Andrew Morton

On Wed, Oct 31, 2018 at 10:20:00PM -0700, Joe Perches wrote:
> On Wed, 2018-10-31 at 13:55 +0100, Peter Zijlstra wrote:
> > 
> > Anyway, with the below patch, I get:
> > 
> >    text    data     bss     dec     hex filename
> > 17385183        5064780 1953892 24403855        1745f8f defconfig-build/vmlinux
> > 17385678        5064780 1953892 24404350        174617e defconfig-build/vmlinux
> > 
> > Which shows we inline more (look for asm_volatile for the actual
> > changes).
> []
> >  scripts/checkpatch.pl                              |  8 ++---
> >  scripts/genksyms/keywords.c                        |  4 +--
> >  scripts/kernel-doc                                 |  4 +--
> 
> I believe these should be excluded from the conversions.

Probably, yes. It compiled, which was all I cared about :-)

BTW, if we do that conversion, we should upgrade the checkpatch warn to
an error I suppose.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: PROPOSAL: Extend inline asm syntax with size spec
  2018-11-01  9:01                               ` Peter Zijlstra
@ 2018-11-01  9:20                                 ` Joe Perches
  2018-11-01 11:15                                   ` Peter Zijlstra
  0 siblings, 1 reply; 116+ messages in thread
From: Joe Perches @ 2018-11-01  9:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Borislav Petkov, Segher Boessenkool, Ingo Molnar, Richard Biener,
	Michael Matz, gcc, Nadav Amit, Ingo Molnar, linux-kernel, x86,
	Masahiro Yamada, Sam Ravnborg, Alok Kataria, Christopher Li,
	Greg Kroah-Hartman, H. Peter Anvin, Jan Beulich, Josh Poimboeuf,
	Juergen Gross, Kate Stewart, Kees Cook, linux-sparse,
	Philippe Ombredanne, Thomas Gleixner, virtualization,
	Linus Torvalds, Chris Zankel, Max Filippov, linux-xtensa,
	Andrew Morton

On Thu, 2018-11-01 at 10:01 +0100, Peter Zijlstra wrote:
> On Wed, Oct 31, 2018 at 10:20:00PM -0700, Joe Perches wrote:
> > On Wed, 2018-10-31 at 13:55 +0100, Peter Zijlstra wrote:
> > > Anyway, with the below patch, I get:
> > > 
> > >    text    data     bss     dec     hex filename
> > > 17385183        5064780 1953892 24403855        1745f8f defconfig-build/vmlinux
> > > 17385678        5064780 1953892 24404350        174617e defconfig-build/vmlinux
> > > 
> > > Which shows we inline more (look for asm_volatile for the actual
> > > changes).
> > []
> > >  scripts/checkpatch.pl                              |  8 ++---
> > >  scripts/genksyms/keywords.c                        |  4 +--
> > >  scripts/kernel-doc                                 |  4 +--
> > 
> > I believe these should be excluded from the conversions.
> 
> Probably, yes. It compiled, which was all I cared about :-)
> 
> BTW, if we do that conversion, we should upgrade the checkpatch warn to
> an error I suppose.

More like remove altogether as __inline and __inline__
will no longer be #defined

$ git grep -P 'define\s+__inline'
include/linux/compiler_types.h:#define __inline__ inline
include/linux/compiler_types.h:#define __inline   inline


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: PROPOSAL: Extend inline asm syntax with size spec
  2018-11-01  9:20                                 ` Joe Perches
@ 2018-11-01 11:15                                   ` Peter Zijlstra
  0 siblings, 0 replies; 116+ messages in thread
From: Peter Zijlstra @ 2018-11-01 11:15 UTC (permalink / raw)
  To: Joe Perches
  Cc: Borislav Petkov, Segher Boessenkool, Ingo Molnar, Richard Biener,
	Michael Matz, gcc, Nadav Amit, Ingo Molnar, linux-kernel, x86,
	Masahiro Yamada, Sam Ravnborg, Alok Kataria, Christopher Li,
	Greg Kroah-Hartman, H. Peter Anvin, Jan Beulich, Josh Poimboeuf,
	Juergen Gross, Kate Stewart, Kees Cook, linux-sparse,
	Philippe Ombredanne, Thomas Gleixner, virtualization,
	Linus Torvalds, Chris Zankel, Max Filippov, linux-xtensa,
	Andrew Morton

On Thu, Nov 01, 2018 at 02:20:40AM -0700, Joe Perches wrote:
> On Thu, 2018-11-01 at 10:01 +0100, Peter Zijlstra wrote:
> > On Wed, Oct 31, 2018 at 10:20:00PM -0700, Joe Perches wrote:
> > > On Wed, 2018-10-31 at 13:55 +0100, Peter Zijlstra wrote:
> > > > Anyway, with the below patch, I get:
> > > > 
> > > >    text    data     bss     dec     hex filename
> > > > 17385183        5064780 1953892 24403855        1745f8f defconfig-build/vmlinux
> > > > 17385678        5064780 1953892 24404350        174617e defconfig-build/vmlinux
> > > > 
> > > > Which shows we inline more (look for asm_volatile for the actual
> > > > changes).
> > > []
> > > >  scripts/checkpatch.pl                              |  8 ++---
> > > >  scripts/genksyms/keywords.c                        |  4 +--
> > > >  scripts/kernel-doc                                 |  4 +--
> > > 
> > > I believe these should be excluded from the conversions.
> > 
> > Probably, yes. It compiled, which was all I cared about :-)
> > 
> > BTW, if we do that conversion, we should upgrade the checkpatch warn to
> > an error I suppose.
> 
> More like remove altogether as __inline and __inline__
> will no longer be #defined

That's the point, therefore checkpatch should error when it sees them.
Otherwise we'll grow new instances, because it will compile just file.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 02/10] Makefile: Prepare for using macros for inline asm
  2018-10-03 21:30 ` [PATCH v9 02/10] Makefile: Prepare for using macros for inline asm Nadav Amit
  2018-10-04 10:01   ` [tip:x86/build] kbuild/Makefile: Prepare for using macros in inline assembly code to work around asm() related GCC inlining bugs tip-bot for Nadav Amit
@ 2018-11-06 18:57   ` Logan Gunthorpe
  2018-11-06 19:18     ` Nadav Amit
  1 sibling, 1 reply; 116+ messages in thread
From: Logan Gunthorpe @ 2018-11-06 18:57 UTC (permalink / raw)
  To: Nadav Amit, Ingo Molnar
  Cc: linux-kernel, x86, Sam Ravnborg, Michal Marek, Thomas Gleixner,
	H. Peter Anvin, linux-kbuild, Stephen Bates



On 2018-10-03 3:30 p.m., Nadav Amit wrote:
> +ASM_MACRO_FLAGS = -Wa,arch/x86/kernel/macros.s -Wa,-
> +export ASM_MACRO_FLAGS
> +KBUILD_CFLAGS += $(ASM_MACRO_FLAGS)

I'm not sure how much we care about this or what we can do about it, but
adding the macros.s to the command line like this has effectively broken
distributed compiling with distcc and icecc as of v4.20-rc1. Neither
tool will successfully distribute any compile processes because the
non-local machines won't have a copy of macros.s.

Thanks,

Logan

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 02/10] Makefile: Prepare for using macros for inline asm
  2018-11-06 18:57   ` [PATCH v9 02/10] Makefile: Prepare for using macros for inline asm Logan Gunthorpe
@ 2018-11-06 19:18     ` Nadav Amit
  2018-11-06 20:01       ` Logan Gunthorpe
  0 siblings, 1 reply; 116+ messages in thread
From: Nadav Amit @ 2018-11-06 19:18 UTC (permalink / raw)
  To: Logan Gunthorpe, Ingo Molnar
  Cc: LKML, X86 ML, Sam Ravnborg, Michal Marek, Thomas Gleixner,
	H. Peter Anvin, Linux Kbuild mailing list, Stephen Bates

From: Logan Gunthorpe
Sent: November 6, 2018 at 6:57:57 PM GMT
> To: Nadav Amit <namit@vmware.com>, Ingo Molnar <mingo@redhat.com>
> Cc: linux-kernel@vger.kernel.org>, x86@kernel.org>, Sam Ravnborg <sam@ravnborg.org>, Michal Marek <michal.lkml@markovi.net>, Thomas Gleixner <tglx@linutronix.de>, H. Peter Anvin <hpa@zytor.com>, linux-kbuild@vger.kernel.org>, Stephen Bates <sbates@raithlin.com>
> Subject: Re: [PATCH v9 02/10] Makefile: Prepare for using macros for inline asm
> 
> 
> 
> 
> On 2018-10-03 3:30 p.m., Nadav Amit wrote:
>> +ASM_MACRO_FLAGS = -Wa,arch/x86/kernel/macros.s -Wa,-
>> +export ASM_MACRO_FLAGS
>> +KBUILD_CFLAGS += $(ASM_MACRO_FLAGS)
> 
> I'm not sure how much we care about this or what we can do about it, but
> adding the macros.s to the command line like this has effectively broken
> distributed compiling with distcc and icecc as of v4.20-rc1. Neither
> tool will successfully distribute any compile processes because the
> non-local machines won't have a copy of macros.s.

Err.. I don’t have a dist-cc environment. I wonder whether something like
that would do the trick:

-- >8 —

Subject: [PATCH] x86: Fix dist-cc

---
 arch/x86/Makefile | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index 5b562e464009..81d76cbcffff 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -237,9 +237,9 @@ archheaders:
 	$(Q)$(MAKE) $(build)=arch/x86/entry/syscalls all
 
 archmacros:
-	$(Q)$(MAKE) $(build)=arch/x86/kernel arch/x86/kernel/macros.s
+	$(Q)$(MAKE) $(build)=arch/x86/kernel ${objtree}/arch/x86/kernel/macros.s
 
-ASM_MACRO_FLAGS = -Wa,arch/x86/kernel/macros.s -Wa,-
+ASM_MACRO_FLAGS = -Wa,${objtree}/arch/x86/kernel/macros.s -Wa,-
 export ASM_MACRO_FLAGS
 KBUILD_CFLAGS += $(ASM_MACRO_FLAGS)
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 02/10] Makefile: Prepare for using macros for inline asm
  2018-11-06 19:18     ` Nadav Amit
@ 2018-11-06 20:01       ` Logan Gunthorpe
  2018-11-07 18:01         ` Nadav Amit
  0 siblings, 1 reply; 116+ messages in thread
From: Logan Gunthorpe @ 2018-11-06 20:01 UTC (permalink / raw)
  To: Nadav Amit, Ingo Molnar
  Cc: LKML, X86 ML, Sam Ravnborg, Michal Marek, Thomas Gleixner,
	H. Peter Anvin, Linux Kbuild mailing list, Stephen Bates



On 2018-11-06 12:18 p.m., Nadav Amit wrote:
> Err.. I don’t have a dist-cc environment. I wonder whether something like
> that would do the trick:

I tested it to be sure -- but, no, this does not have any affect on
distcc. Distcc usually does pre-processing locally and then sends the
single output of cpp to a remote for compilation and assembly. It
doesn't expect there to be additional assembly files tacked on the
command line and isn't going to handle that case correctly. Thus,
changing the path of the file won't fix anything.

Logan

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 02/10] Makefile: Prepare for using macros for inline asm
  2018-11-06 20:01       ` Logan Gunthorpe
@ 2018-11-07 18:01         ` Nadav Amit
  2018-11-07 18:53           ` Logan Gunthorpe
  0 siblings, 1 reply; 116+ messages in thread
From: Nadav Amit @ 2018-11-07 18:01 UTC (permalink / raw)
  To: Logan Gunthorpe, Ingo Molnar
  Cc: LKML, X86 ML, Sam Ravnborg, Michal Marek, Thomas Gleixner,
	H. Peter Anvin, Linux Kbuild mailing list, Stephen Bates

From: Logan Gunthorpe
Sent: November 6, 2018 at 8:01:31 PM GMT
> To: Nadav Amit <namit@vmware.com>, Ingo Molnar <mingo@redhat.com>
> Cc: LKML <linux-kernel@vger.kernel.org>, X86 ML <x86@kernel.org>, Sam Ravnborg <sam@ravnborg.org>, Michal Marek <michal.lkml@markovi.net>, Thomas Gleixner <tglx@linutronix.de>, H. Peter Anvin <hpa@zytor.com>, Linux Kbuild mailing list <linux-kbuild@vger.kernel.org>, Stephen Bates <sbates@raithlin.com>
> Subject: Re: [PATCH v9 02/10] Makefile: Prepare for using macros for inline asm
> 
> 
> 
> 
> On 2018-11-06 12:18 p.m., Nadav Amit wrote:
>> Err.. I don’t have a dist-cc environment. I wonder whether something like
>> that would do the trick:
> 
> I tested it to be sure -- but, no, this does not have any affect on
> distcc. Distcc usually does pre-processing locally and then sends the
> single output of cpp to a remote for compilation and assembly. It
> doesn't expect there to be additional assembly files tacked on the
> command line and isn't going to handle that case correctly. Thus,
> changing the path of the file won't fix anything.

Thanks for the explanation.

I played with dist-cc and gave it some thought, but I do not have a
solution. I thought that perhaps some hack is possible by using the same
preprocessed file can be used as both the source file and the assembly file,
but that’s as far as I managed to get.

Ideas? Do people care about it?

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 02/10] Makefile: Prepare for using macros for inline asm
  2018-11-07 18:01         ` Nadav Amit
@ 2018-11-07 18:53           ` Logan Gunthorpe
  2018-11-07 18:56             ` Nadav Amit
  0 siblings, 1 reply; 116+ messages in thread
From: Logan Gunthorpe @ 2018-11-07 18:53 UTC (permalink / raw)
  To: Nadav Amit, Ingo Molnar
  Cc: LKML, X86 ML, Sam Ravnborg, Michal Marek, Thomas Gleixner,
	H. Peter Anvin, Linux Kbuild mailing list, Stephen Bates



On 2018-11-07 11:01 a.m., Nadav Amit wrote:
> Ideas? Do people care about it?

Just spit balling, but is there a reason we didn't just put the macros
for inline assembly directly in the header? Something like this:

asm(".macro ANNOTATE_UNREACHABLE counter:req\n\t"
    "\\counter:\n\t"
        ".pushsection .discard.unreachable\n\t"
        ".long \\counter\\()b -.\n\t"
        ".popsection\n\t"
    ".endm\n");

#define annotate_unreachable() ({                   \
    asm volatile("ANNOTATE_UNREACHABLE counter=%c0" \
                 : : "i" (__COUNTER__));            \
})

It does mean the macros won't be usable in non-inline assembly, without
duplicating them, but do we care about that?

I would have expected people compiling the kernel with distcc to be
fairly common -- it certainly speeds up my kernel compiles quite
significantly and thus saves me a lot of time.

Logan

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 02/10] Makefile: Prepare for using macros for inline asm
  2018-11-07 18:53           ` Logan Gunthorpe
@ 2018-11-07 18:56             ` Nadav Amit
  2018-11-07 21:43               ` Logan Gunthorpe
  0 siblings, 1 reply; 116+ messages in thread
From: Nadav Amit @ 2018-11-07 18:56 UTC (permalink / raw)
  To: Logan Gunthorpe, Ingo Molnar
  Cc: LKML, X86 ML, Sam Ravnborg, Michal Marek, Thomas Gleixner,
	H. Peter Anvin, Linux Kbuild mailing list, Stephen Bates

From: Logan Gunthorpe
Sent: November 7, 2018 at 6:53:02 PM GMT
> Subject: Re: [PATCH v9 02/10] Makefile: Prepare for using macros for inline asm
> 
> 
> 
> 
> On 2018-11-07 11:01 a.m., Nadav Amit wrote:
>> Ideas? Do people care about it?
> 
> Just spit balling, but is there a reason we didn't just put the macros
> for inline assembly directly in the header? Something like this:
> 
> asm(".macro ANNOTATE_UNREACHABLE counter:req\n\t"
>    "\\counter:\n\t"
>        ".pushsection .discard.unreachable\n\t"
>        ".long \\counter\\()b -.\n\t"
>        ".popsection\n\t"
>    ".endm\n");
> 
> #define annotate_unreachable() ({                   \
>    asm volatile("ANNOTATE_UNREACHABLE counter=%c0" \
>                 : : "i" (__COUNTER__));            \
> })
> 
> It does mean the macros won't be usable in non-inline assembly, without
> duplicating them, but do we care about that?

HPA indicated more than once that this is wrong (and that was indeed my
initial solution), since it is not guaranteed that the compiler would put
the macro assembly before its use.


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 02/10] Makefile: Prepare for using macros for inline asm
  2018-11-07 18:56             ` Nadav Amit
@ 2018-11-07 21:43               ` Logan Gunthorpe
  2018-11-07 21:50                 ` hpa
  0 siblings, 1 reply; 116+ messages in thread
From: Logan Gunthorpe @ 2018-11-07 21:43 UTC (permalink / raw)
  To: Nadav Amit, Ingo Molnar
  Cc: LKML, X86 ML, Sam Ravnborg, Michal Marek, Thomas Gleixner,
	H. Peter Anvin, Linux Kbuild mailing list, Stephen Bates



On 2018-11-07 11:56 a.m., Nadav Amit wrote:
> HPA indicated more than once that this is wrong (and that was indeed my
> initial solution), since it is not guaranteed that the compiler would put
> the macro assembly before its use.

Hmm, that's very unfortunate. I don't really understand why the compiler
would not put the macro assembly in the same order as it appears in the
code and it would be in the correct order there.

In any case, I've submitted a couple of issues to icecc[1] and distcc[2]
to see if they have any ideas about supporting this on their sides.

Thanks,

Logan

[1] https://github.com/icecc/icecream/issues/428
[2] https://github.com/distcc/distcc/issues/312



^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 02/10] Makefile: Prepare for using macros for inline asm
  2018-11-07 21:43               ` Logan Gunthorpe
@ 2018-11-07 21:50                 ` hpa
  2018-11-08  6:18                   ` Nadav Amit
  0 siblings, 1 reply; 116+ messages in thread
From: hpa @ 2018-11-07 21:50 UTC (permalink / raw)
  To: Logan Gunthorpe, Nadav Amit, Ingo Molnar
  Cc: LKML, X86 ML, Sam Ravnborg, Michal Marek, Thomas Gleixner,
	Linux Kbuild mailing list, Stephen Bates

On November 7, 2018 1:43:39 PM PST, Logan Gunthorpe <logang@deltatee.com> wrote:
>
>
>On 2018-11-07 11:56 a.m., Nadav Amit wrote:
>> HPA indicated more than once that this is wrong (and that was indeed
>my
>> initial solution), since it is not guaranteed that the compiler would
>put
>> the macro assembly before its use.
>
>Hmm, that's very unfortunate. I don't really understand why the
>compiler
>would not put the macro assembly in the same order as it appears in the
>code and it would be in the correct order there.
>
>In any case, I've submitted a couple of issues to icecc[1] and
>distcc[2]
>to see if they have any ideas about supporting this on their sides.
>
>Thanks,
>
>Logan
>
>[1] https://github.com/icecc/icecream/issues/428
>[2] https://github.com/distcc/distcc/issues/312

Apparently gcc will treat them like basic blocks and possibly move them around.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 02/10] Makefile: Prepare for using macros for inline asm
  2018-11-07 21:50                 ` hpa
@ 2018-11-08  6:18                   ` Nadav Amit
  2018-11-08 17:14                     ` Logan Gunthorpe
  0 siblings, 1 reply; 116+ messages in thread
From: Nadav Amit @ 2018-11-08  6:18 UTC (permalink / raw)
  To: hpa, Logan Gunthorpe, Ingo Molnar
  Cc: LKML, X86 ML, Sam Ravnborg, Michal Marek, Thomas Gleixner,
	Linux Kbuild mailing list, Stephen Bates

From: hpa@zytor.com
Sent: November 7, 2018 at 9:50:28 PM GMT
> To: Logan Gunthorpe <logang@deltatee.com>, Nadav Amit <namit@vmware.com>, Ingo Molnar <mingo@redhat.com>
> Cc: LKML <linux-kernel@vger.kernel.org>, X86 ML <x86@kernel.org>, Sam Ravnborg <sam@ravnborg.org>, Michal Marek <michal.lkml@markovi.net>, Thomas Gleixner <tglx@linutronix.de>, Linux Kbuild mailing list <linux-kbuild@vger.kernel.org>, Stephen Bates <sbates@raithlin.com>
> Subject: Re: [PATCH v9 02/10] Makefile: Prepare for using macros for inline asm
> 
> 
> On November 7, 2018 1:43:39 PM PST, Logan Gunthorpe <logang@deltatee.com> wrote:
>> On 2018-11-07 11:56 a.m., Nadav Amit wrote:
>>> HPA indicated more than once that this is wrong (and that was indeed
>> my
>>> initial solution), since it is not guaranteed that the compiler would
>> put
>>> the macro assembly before its use.
>> 
>> Hmm, that's very unfortunate. I don't really understand why the
>> compiler
>> would not put the macro assembly in the same order as it appears in the
>> code and it would be in the correct order there.
>> 
>> In any case, I've submitted a couple of issues to icecc[1] and
>> distcc[2]
>> to see if they have any ideas about supporting this on their sides.
>> 
>> Thanks,
>> 
>> Logan
>> 
>> [1] https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ficecc%2Ficecream%2Fissues%2F428&amp;data=02%7C01%7Cnamit%40vmware.com%7C30ab3751343b49f869ab08d644fb1d8c%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C1%7C0%7C636772242666772528&amp;sdata=dXKTR79LkFDQ9IXxYw9XYt0VPFa4MXrMUcc86w5uy%2Fk%3D&amp;reserved=0
>> [2] https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fdistcc%2Fdistcc%2Fissues%2F312&amp;data=02%7C01%7Cnamit%40vmware.com%7C30ab3751343b49f869ab08d644fb1d8c%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C1%7C0%7C636772242666772528&amp;sdata=XynZ1bFbKAb8V2eoPQbXudEJ%2B%2Bu8QA3mM4Sr4E%2FTzWs%3D&amp;reserved=0
> 
> Apparently gcc will treat them like basic blocks and possibly move them around.

Maybe it is possible to break the compilation of each object into two
stages: first, compile the source without assembly, and then take the
generated .s file and assemble it with the .s file of the macros.

Does it sounds as something that may work? I guess it should only be done
when distcc is used.


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 02/10] Makefile: Prepare for using macros for inline asm
  2018-11-08  6:18                   ` Nadav Amit
@ 2018-11-08 17:14                     ` Logan Gunthorpe
  2018-11-08 19:54                       ` Nadav Amit
  0 siblings, 1 reply; 116+ messages in thread
From: Logan Gunthorpe @ 2018-11-08 17:14 UTC (permalink / raw)
  To: Nadav Amit, hpa, Ingo Molnar
  Cc: LKML, X86 ML, Sam Ravnborg, Michal Marek, Thomas Gleixner,
	Linux Kbuild mailing list, Stephen Bates



On 2018-11-07 11:18 p.m., Nadav Amit wrote:
>> Apparently gcc will treat them like basic blocks and possibly move them around.
> 
> Maybe it is possible to break the compilation of each object into two
> stages: first, compile the source without assembly, and then take the
> generated .s file and assemble it with the .s file of the macros.
> 
> Does it sounds as something that may work? I guess it should only be done
> when distcc is used.

In theory it would at least allow the compile step to be distributed,
the assembly step would still have to be done locally... It'd be better
than nothing, I guess.

It'd also be difficult to know when distribution is being done and that
it's necessary to split the steps. We'd have to add an environment
variable or something and users would need to know they have to set it
when using a distributed compile.

Logan

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 02/10] Makefile: Prepare for using macros for inline asm
  2018-11-08 17:14                     ` Logan Gunthorpe
@ 2018-11-08 19:54                       ` Nadav Amit
  2018-11-08 20:00                         ` Logan Gunthorpe
  0 siblings, 1 reply; 116+ messages in thread
From: Nadav Amit @ 2018-11-08 19:54 UTC (permalink / raw)
  To: Logan Gunthorpe, hpa, Ingo Molnar
  Cc: LKML, X86 ML, Sam Ravnborg, Michal Marek, Thomas Gleixner,
	Linux Kbuild mailing list, Stephen Bates

From: Logan Gunthorpe
Sent: November 8, 2018 at 5:14:58 PM GMT
> To: Nadav Amit <namit@vmware.com>, hpa@zytor.com <hpa@zytor.com>, Ingo Molnar <mingo@redhat.com>
> Cc: LKML <linux-kernel@vger.kernel.org>, X86 ML <x86@kernel.org>, Sam Ravnborg <sam@ravnborg.org>, Michal Marek <michal.lkml@markovi.net>, Thomas Gleixner <tglx@linutronix.de>, Linux Kbuild mailing list <linux-kbuild@vger.kernel.org>, Stephen Bates <sbates@raithlin.com>
> Subject: Re: [PATCH v9 02/10] Makefile: Prepare for using macros for inline asm
> 
> 
> 
> 
> On 2018-11-07 11:18 p.m., Nadav Amit wrote:
>>> Apparently gcc will treat them like basic blocks and possibly move them around.
>> 
>> Maybe it is possible to break the compilation of each object into two
>> stages: first, compile the source without assembly, and then take the
>> generated .s file and assemble it with the .s file of the macros.
>> 
>> Does it sounds as something that may work? I guess it should only be done
>> when distcc is used.
> 
> In theory it would at least allow the compile step to be distributed,
> the assembly step would still have to be done locally... It'd be better
> than nothing, I guess.

I don’t think the assembly stage needs to be done locally. gcc can still be
used to deploy the assembler. I am not too familiar with distcc, so I don’t
know whether the preprocessing supports multiple source-files, and whether
it has some strange-behavior when it comes to .S/.s files.

Well, I’ll give it a try.


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 02/10] Makefile: Prepare for using macros for inline asm
  2018-11-08 19:54                       ` Nadav Amit
@ 2018-11-08 20:00                         ` Logan Gunthorpe
  2018-11-08 20:18                           ` Nadav Amit
  0 siblings, 1 reply; 116+ messages in thread
From: Logan Gunthorpe @ 2018-11-08 20:00 UTC (permalink / raw)
  To: Nadav Amit, hpa, Ingo Molnar
  Cc: LKML, X86 ML, Sam Ravnborg, Michal Marek, Thomas Gleixner,
	Linux Kbuild mailing list, Stephen Bates



On 2018-11-08 12:54 p.m., Nadav Amit wrote:
> I don’t think the assembly stage needs to be done locally. gcc can still be
> used to deploy the assembler. I am not too familiar with distcc, so I don’t
> know whether the preprocessing supports multiple source-files, and whether
> it has some strange-behavior when it comes to .S/.s files.

The problem is that it's the assembly stage that needs the extra
macros.s and without that file being transferred somehow, that stage
must be done locally.

The distcc guys have a thought that the macros.s file might be able to
be transferred using pump mode with a change to include_server.py. I
don't think that will work for icecc though.

Logan

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 02/10] Makefile: Prepare for using macros for inline asm
  2018-11-08 20:00                         ` Logan Gunthorpe
@ 2018-11-08 20:18                           ` Nadav Amit
  2018-11-10 22:04                             ` Nadav Amit
  0 siblings, 1 reply; 116+ messages in thread
From: Nadav Amit @ 2018-11-08 20:18 UTC (permalink / raw)
  To: Logan Gunthorpe, hpa, Ingo Molnar
  Cc: LKML, X86 ML, Sam Ravnborg, Michal Marek, Thomas Gleixner,
	Linux Kbuild mailing list, Stephen Bates

From: Logan Gunthorpe
Sent: November 8, 2018 at 8:00:33 PM GMT
> To: Nadav Amit <namit@vmware.com>, hpa@zytor.com <hpa@zytor.com>, Ingo Molnar <mingo@redhat.com>
> Cc: LKML <linux-kernel@vger.kernel.org>, X86 ML <x86@kernel.org>, Sam Ravnborg <sam@ravnborg.org>, Michal Marek <michal.lkml@markovi.net>, Thomas Gleixner <tglx@linutronix.de>, Linux Kbuild mailing list <linux-kbuild@vger.kernel.org>, Stephen Bates <sbates@raithlin.com>
> Subject: Re: [PATCH v9 02/10] Makefile: Prepare for using macros for inline asm
> 
> 
> 
> 
> On 2018-11-08 12:54 p.m., Nadav Amit wrote:
>> I don’t think the assembly stage needs to be done locally. gcc can still be
>> used to deploy the assembler. I am not too familiar with distcc, so I don’t
>> know whether the preprocessing supports multiple source-files, and whether
>> it has some strange-behavior when it comes to .S/.s files.
> 
> The problem is that it's the assembly stage that needs the extra
> macros.s and without that file being transferred somehow, that stage
> must be done locally.

I understand, but the idea is to have two stages, for instance:

  gcc ... -S -o .tmp_[unit].S [unit].c

and then

  gcc ... -D__ASSEMBLY__ arch/x86/kernel/macros.S .tmp_[unit].S

(Yes, I realize the .tmp_[unit].S was already preprocessed, but this way you
can combine both inputs)

Unfortunately, as I write this email (and run tests) I realize distcc is too
dumb to handle two input files:

  "(dcc_scan_args) do we have two inputs?  i give up "

Just great. So I guess macros.s would need to be concatenated with
.tmp_[unit].s as a separate (local) interim stage.


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 02/10] Makefile: Prepare for using macros for inline asm
  2018-11-08 20:18                           ` Nadav Amit
@ 2018-11-10 22:04                             ` Nadav Amit
  2018-11-13  4:56                               ` Logan Gunthorpe
  0 siblings, 1 reply; 116+ messages in thread
From: Nadav Amit @ 2018-11-10 22:04 UTC (permalink / raw)
  To: Logan Gunthorpe, hpa, Ingo Molnar
  Cc: LKML, X86 ML, Sam Ravnborg, Michal Marek, Thomas Gleixner,
	Linux Kbuild mailing list, Stephen Bates, Masami Hiramatsu

From: Nadav Amit
Sent: November 8, 2018 at 8:18:23 PM GMT
> To: Logan Gunthorpe <logang@deltatee.com>, hpa@zytor.com <hpa@zytor.com>, Ingo Molnar <mingo@redhat.com>
> Cc: LKML <linux-kernel@vger.kernel.org>, X86 ML <x86@kernel.org>, Sam Ravnborg <sam@ravnborg.org>, Michal Marek <michal.lkml@markovi.net>, Thomas Gleixner <tglx@linutronix.de>, Linux Kbuild mailing list <linux-kbuild@vger.kernel.org>, Stephen Bates <sbates@raithlin.com>
> Subject: Re: [PATCH v9 02/10] Makefile: Prepare for using macros for inline asm
> 
> 
> From: Logan Gunthorpe
> Sent: November 8, 2018 at 8:00:33 PM GMT
>> To: Nadav Amit <namit@vmware.com>, hpa@zytor.com <hpa@zytor.com>, Ingo Molnar <mingo@redhat.com>
>> Cc: LKML <linux-kernel@vger.kernel.org>, X86 ML <x86@kernel.org>, Sam Ravnborg <sam@ravnborg.org>, Michal Marek <michal.lkml@markovi.net>, Thomas Gleixner <tglx@linutronix.de>, Linux Kbuild mailing list <linux-kbuild@vger.kernel.org>, Stephen Bates <sbates@raithlin.com>
>> Subject: Re: [PATCH v9 02/10] Makefile: Prepare for using macros for inline asm
>> 
>> 
>> 
>> 
>> On 2018-11-08 12:54 p.m., Nadav Amit wrote:
>>> I don’t think the assembly stage needs to be done locally. gcc can still be
>>> used to deploy the assembler. I am not too familiar with distcc, so I don’t
>>> know whether the preprocessing supports multiple source-files, and whether
>>> it has some strange-behavior when it comes to .S/.s files.
>> 
>> The problem is that it's the assembly stage that needs the extra
>> macros.s and without that file being transferred somehow, that stage
>> must be done locally.
> 
> I understand, but the idea is to have two stages, for instance:
> 
>  gcc ... -S -o .tmp_[unit].S [unit].c
> 
> and then
> 
>  gcc ... -D__ASSEMBLY__ arch/x86/kernel/macros.S .tmp_[unit].S
> 
> (Yes, I realize the .tmp_[unit].S was already preprocessed, but this way you
> can combine both inputs)
> 
> Unfortunately, as I write this email (and run tests) I realize distcc is too
> dumb to handle two input files:
> 
>  "(dcc_scan_args) do we have two inputs?  i give up "
> 
> Just great. So I guess macros.s would need to be concatenated with
> .tmp_[unit].s as a separate (local) interim stage.

Err.. I hate makefiles and distcc doesn’t make life easier. For instance, if
it sees two source files on the command-line, it freaks out and compiles it
locally. It also has an option to distribute assembly, but it is not enabled
by default.

Anyhow, can you try the following patch?

-- >8 --

Subject: [PATCH] Makefile: Fix distcc compilation with x86 macros

Introducing the use of asm macros in c-code broke distcc, since it only
sends the preprocessed source file. The solution is to break the
compilation into two separate phases of compilation and assembly, and
between the two concatanate the assembly macros and the compiled (yet
not assembled) source file. Since this is less efficient, this
compilation mode is only used when make is called with the "DISTCC=y"
parameter.

Note that the assembly stage should also be distributable, if distcc is
configured using "CFLAGS=-DENABLE_REMOTE_ASSEMBLE".

Reported-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 Makefile               |  4 +++-
 arch/x86/Makefile      |  4 +++-
 scripts/Makefile.build | 29 +++++++++++++++++++++++++++--
 3 files changed, 33 insertions(+), 4 deletions(-)

diff --git a/Makefile b/Makefile
index 9fce8b91c15f..c07349fc38c7 100644
--- a/Makefile
+++ b/Makefile
@@ -743,7 +743,9 @@ KBUILD_CFLAGS   += $(call cc-option, -gsplit-dwarf, -g)
 else
 KBUILD_CFLAGS	+= -g
 endif
-KBUILD_AFLAGS	+= -Wa,-gdwarf-2
+AFLAGS_DEBUG_INFO = -Wa,-gdwarf-2
+export AFLAGS_DEBUG_INFO
+KBUILD_AFLAGS	+= $(AFLAGS_DEBUG_INFO)
 endif
 ifdef CONFIG_DEBUG_INFO_DWARF4
 KBUILD_CFLAGS	+= $(call cc-option, -gdwarf-4,)
diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index f5d7f4134524..080bd9cbc4e1 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -238,7 +238,9 @@ archheaders:
 archmacros:
 	$(Q)$(MAKE) $(build)=arch/x86/kernel arch/x86/kernel/macros.s
 
-ASM_MACRO_FLAGS = -Wa,arch/x86/kernel/macros.s
+ASM_MACRO_FILE = arch/x86/kernel/macros.s
+export ASM_MACRO_FILE
+ASM_MACRO_FLAGS = -Wa,$(ASM_MACRO_FILE)
 export ASM_MACRO_FLAGS
 KBUILD_CFLAGS += $(ASM_MACRO_FLAGS)
 
diff --git a/scripts/Makefile.build b/scripts/Makefile.build
index 6a6be9f440cf..2b79789a3e10 100644
--- a/scripts/Makefile.build
+++ b/scripts/Makefile.build
@@ -155,8 +155,33 @@ $(obj)/%.ll: $(src)/%.c FORCE
 
 quiet_cmd_cc_o_c = CC $(quiet_modtag)  $@
 
+# If distcc is used, then when an assembly macro files is needed, the
+# compilation stage and the assembly stage need to be separated. Providing
+# "DISTCC=y" option enables the separate compilation and assembly.
+cmd_cc_o_c_simple = $(CC) $(c_flags) -c -o $(1) $<
+
+ifeq ($(DISTCC),y)
+a_flags_no_debug = $(filter-out $(AFLAGS_DEBUG_INFO), $(a_flags))
+c_flags_no_macros = $(filter-out $(ASM_MACRO_FLAGS), $(c_flags))
+
+cmd_cc_o_c_two_steps =							\
+	$(CC) $(c_flags_no_macros) $(DISABLE_LTO) -fverbose-asm -S 	\
+		-o $(@D)/.$(@F:.o=.s) $< ;				\
+	cat $(ASM_MACRO_FILE) $(@D)/.$(@F:.o=.s) > 			\
+		$(@D)/.tmp_$(@F:.o=.s);					\
+	$(CC) $(a_flags_no_debug) -c -o $(1) $(@D)/.tmp_$(@F:.o=.s) ; 	\
+	rm -f $(@D)/.$(@F:.o=.s) $(@D)/.tmp_$(@F:.o=.s)			\
+
+cmd_cc_o_c_helper =    							\
+	$(if $(findstring $(ASM_MACRO_FLAGS),$(c_flags)),		\
+		$(call cmd_cc_o_c_two_steps, $(1)),			\
+		$(call cmd_cc_o_c_simple, $(1)))
+else
+cmd_cc_o_c_helper = $(call cmd_cc_o_c_simple, $(1))
+endif
+
 ifndef CONFIG_MODVERSIONS
-cmd_cc_o_c = $(CC) $(c_flags) -c -o $@ $<
+cmd_cc_o_c = $(call cmd_cc_o_c_helper,$@)
 
 else
 # When module versioning is enabled the following steps are executed:
@@ -171,7 +196,7 @@ else
 #   replace the unresolved symbols __crc_exported_symbol with
 #   the actual value of the checksum generated by genksyms
 
-cmd_cc_o_c = $(CC) $(c_flags) -c -o $(@D)/.tmp_$(@F) $<
+cmd_cc_o_c = $(call cmd_cc_o_c_helper,$(@D)/.tmp_$(@F))
 
 cmd_modversions_c =								\
 	if $(OBJDUMP) -h $(@D)/.tmp_$(@F) | grep -q __ksymtab; then		\
-- 
2.17.1




^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 02/10] Makefile: Prepare for using macros for inline asm
  2018-11-10 22:04                             ` Nadav Amit
@ 2018-11-13  4:56                               ` Logan Gunthorpe
  0 siblings, 0 replies; 116+ messages in thread
From: Logan Gunthorpe @ 2018-11-13  4:56 UTC (permalink / raw)
  To: Nadav Amit, hpa, Ingo Molnar
  Cc: LKML, X86 ML, Sam Ravnborg, Michal Marek, Thomas Gleixner,
	Linux Kbuild mailing list, Stephen Bates, Masami Hiramatsu



On 10/11/18 03:04 PM, Nadav Amit wrote:
> Err.. I hate makefiles and distcc doesn’t make life easier. For instance, if
> it sees two source files on the command-line, it freaks out and compiles it
> locally. It also has an option to distribute assembly, but it is not enabled
> by default.
> 
> Anyhow, can you try the following patch?

Very nice, great work! It works as expected. I tested with both distcc
and icecc and both work well with DISTCC=y.

Also worth noting, is that distcc has a patch to work without DISTCC=y
in pump mode (though pump mode doesn't work super great with the kernel
as some of the include processing takes a long time). When I have some
free time, I may also try to patch icecc to work with it, as in my
experience, it performs better.

Regardless, I think your patch would be very valuable to be in upstream
so there's at least some way to use the existing versions of distcc and
icecc.

Thanks!

Logan

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: PROPOSAL: Extend inline asm syntax with size spec
  2018-10-09 14:53           ` Segher Boessenkool
                               ` (2 preceding siblings ...)
  2018-10-10 16:31             ` Nadav Amit
@ 2018-11-29 11:46             ` Masahiro Yamada
  2018-11-29 12:25               ` Segher Boessenkool
  2018-11-29 13:07               ` Borislav Petkov
  3 siblings, 2 replies; 116+ messages in thread
From: Masahiro Yamada @ 2018-11-29 11:46 UTC (permalink / raw)
  To: Segher Boessenkool, Nadav Amit, Ingo Molnar, H. Peter Anvin
  Cc: rguenther, matz, Borislav Petkov, gcc, Linux Kernel Mailing List,
	X86 ML, Sam Ravnborg, Alok Kataria, Christopher Li,
	Greg Kroah-Hartman, Jan Beulich, Josh Poimboeuf, Juergen Gross,
	Kate Stewart, Kees Cook, linux-sparse, Peter Zijlstra (Intel),
	Philippe Ombredanne, Thomas Gleixner, virtualization,
	Linus Torvalds, Chris Zankel, Max Filippov, linux-xtensa

Hi.


On Wed, Oct 10, 2018 at 1:14 AM Segher Boessenkool
<segher@kernel.crashing.org> wrote:
>
> On Mon, Oct 08, 2018 at 11:07:46AM +0200, Richard Biener wrote:
> > On Mon, 8 Oct 2018, Segher Boessenkool wrote:
> > > On Sun, Oct 07, 2018 at 03:53:26PM +0000, Michael Matz wrote:
> > > > On Sun, 7 Oct 2018, Segher Boessenkool wrote:
> > > > > On Sun, Oct 07, 2018 at 11:18:06AM +0200, Borislav Petkov wrote:
> > > > > > Now, Richard suggested doing something like:
> > > > > >
> > > > > >  1) inline asm ("...")
> > > > >
> > > > > What would the semantics of this be?
> > > >
> > > > The size of the inline asm wouldn't be counted towards the inliner size
> > > > limits (or be counted as "1").
> > >
> > > That sounds like a good option.
> >
> > Yes, I also like it for simplicity.  It also avoids the requirement
> > of translating the number (in bytes?) given by the user to
> > "number of GIMPLE instructions" as needed by the inliner.
>
> This patch implements this, for C only so far.  And the syntax is
> "asm inline", which is more in line with other syntax.
>
> How does this look?


Thank you very much for your work.


https://gcc.gnu.org/ml/gcc-patches/2018-10/msg01932.html

How is the progress of this in GCC ML?



I am really hoping the issue will be solved by compiler
instead of the in-kernel workaround.


Since commit 77b0bf55bc675233d22cd5df97605d516d64525e,
DISTCC breakage was reported.


Then, another problem showed up.

Debian linux-headers package is broken
due to missing arch/x86/kernel/macros.s

https://www.spinics.net/lists/linux-kbuild/msg20037.html

The kernel-devel RPM package is broken as well.

More fundamentally, the external module building itself is broken;
'make clean' must keep all files needed for external modules, but
*.s files are all gone.


Of course, we can fix the problems at the cost of uglifying Makefiles.
I wrote a patch to fix the external module building
and packages, and now have it in hand locally.


But, I'd like to ask if x86 people want to keep this macros.s approach.
Revert 77b0bf55bc675 right now
assuming the compiler will eventually solve the issue?

Do you have ideas? Comments?


-- 
Best Regards
Masahiro Yamada

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: PROPOSAL: Extend inline asm syntax with size spec
  2018-11-29 11:46             ` Masahiro Yamada
@ 2018-11-29 12:25               ` Segher Boessenkool
  2018-11-30  9:06                 ` Boris Petkov
  2018-11-29 13:07               ` Borislav Petkov
  1 sibling, 1 reply; 116+ messages in thread
From: Segher Boessenkool @ 2018-11-29 12:25 UTC (permalink / raw)
  To: Masahiro Yamada
  Cc: Nadav Amit, Ingo Molnar, H. Peter Anvin, rguenther, matz,
	Borislav Petkov, gcc, Linux Kernel Mailing List, X86 ML,
	Sam Ravnborg, Alok Kataria, Christopher Li, Greg Kroah-Hartman,
	Jan Beulich, Josh Poimboeuf, Juergen Gross, Kate Stewart,
	Kees Cook, linux-sparse, Peter Zijlstra (Intel),
	Philippe Ombredanne, Thomas Gleixner, virtualization,
	Linus Torvalds, Chris Zankel, Max Filippov, linux-xtensa

On Thu, Nov 29, 2018 at 08:46:34PM +0900, Masahiro Yamada wrote:
> On Wed, Oct 10, 2018 at 1:14 AM Segher Boessenkool
> <segher@kernel.crashing.org> wrote:
> >
> > On Mon, Oct 08, 2018 at 11:07:46AM +0200, Richard Biener wrote:
> > > On Mon, 8 Oct 2018, Segher Boessenkool wrote:
> > > > On Sun, Oct 07, 2018 at 03:53:26PM +0000, Michael Matz wrote:
> > > > > On Sun, 7 Oct 2018, Segher Boessenkool wrote:
> > > > > > On Sun, Oct 07, 2018 at 11:18:06AM +0200, Borislav Petkov wrote:
> > > > > > > Now, Richard suggested doing something like:
> > > > > > >
> > > > > > >  1) inline asm ("...")
> > > > > >
> > > > > > What would the semantics of this be?
> > > > >
> > > > > The size of the inline asm wouldn't be counted towards the inliner size
> > > > > limits (or be counted as "1").
> > > >
> > > > That sounds like a good option.
> > >
> > > Yes, I also like it for simplicity.  It also avoids the requirement
> > > of translating the number (in bytes?) given by the user to
> > > "number of GIMPLE instructions" as needed by the inliner.
> >
> > This patch implements this, for C only so far.  And the syntax is
> > "asm inline", which is more in line with other syntax.
> >
> > How does this look?
> 
> 
> Thank you very much for your work.
> 
> 
> https://gcc.gnu.org/ml/gcc-patches/2018-10/msg01932.html
> 
> How is the progress of this in GCC ML?

Latest patch was pinged a few times:
https://gcc.gnu.org/ml/gcc-patches/2018-11/msg01569.html .

I'll ping it again.  Will fix the subject as well if I remember to, sigh.

> I am really hoping the issue will be solved by compiler
> instead of the in-kernel workaround.

This will only be fixed from GCC 9 on, if the compiler adopts it.  The
kernel wants to support ancient GCC, so it will need to have a workaround
for older GCC versions anyway.


Segher

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: PROPOSAL: Extend inline asm syntax with size spec
  2018-11-29 11:46             ` Masahiro Yamada
  2018-11-29 12:25               ` Segher Boessenkool
@ 2018-11-29 13:07               ` Borislav Petkov
  2018-11-29 13:09                 ` Richard Biener
  1 sibling, 1 reply; 116+ messages in thread
From: Borislav Petkov @ 2018-11-29 13:07 UTC (permalink / raw)
  To: Masahiro Yamada
  Cc: Segher Boessenkool, Nadav Amit, Ingo Molnar, H. Peter Anvin,
	rguenther, matz, gcc, Linux Kernel Mailing List, X86 ML,
	Sam Ravnborg, Alok Kataria, Christopher Li, Greg Kroah-Hartman,
	Jan Beulich, Josh Poimboeuf, Juergen Gross, Kate Stewart,
	Kees Cook, linux-sparse, Peter Zijlstra (Intel),
	Philippe Ombredanne, Thomas Gleixner, virtualization,
	Linus Torvalds, Chris Zankel, Max Filippov, linux-xtensa

On Thu, Nov 29, 2018 at 08:46:34PM +0900, Masahiro Yamada wrote:
> But, I'd like to ask if x86 people want to keep this macros.s approach.
> Revert 77b0bf55bc675 right now
> assuming the compiler will eventually solve the issue?

Yap, considering how elegant the compiler solution is and how much
problems this macros-in-asm thing causes, I think we should patch
out the latter and wait for gcc9. I mean, the savings are not so
mind-blowing to have to deal with the fallout.

But this is just my opinion.

Thx.

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: PROPOSAL: Extend inline asm syntax with size spec
  2018-11-29 13:07               ` Borislav Petkov
@ 2018-11-29 13:09                 ` Richard Biener
  2018-11-29 13:16                   ` Borislav Petkov
  0 siblings, 1 reply; 116+ messages in thread
From: Richard Biener @ 2018-11-29 13:09 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Masahiro Yamada, Segher Boessenkool, Nadav Amit, Ingo Molnar,
	H. Peter Anvin, matz, gcc, Linux Kernel Mailing List, X86 ML,
	Sam Ravnborg, Alok Kataria, Christopher Li, Greg Kroah-Hartman,
	Jan Beulich, Josh Poimboeuf, Juergen Gross, Kate Stewart,
	Kees Cook, linux-sparse, Peter Zijlstra (Intel),
	Philippe Ombredanne, Thomas Gleixner, virtualization,
	Linus Torvalds, Chris Zankel, Max Filippov, linux-xtensa

On Thu, 29 Nov 2018, Borislav Petkov wrote:

> On Thu, Nov 29, 2018 at 08:46:34PM +0900, Masahiro Yamada wrote:
> > But, I'd like to ask if x86 people want to keep this macros.s approach.
> > Revert 77b0bf55bc675 right now
> > assuming the compiler will eventually solve the issue?
> 
> Yap, considering how elegant the compiler solution is and how much
> problems this macros-in-asm thing causes, I think we should patch
> out the latter and wait for gcc9. I mean, the savings are not so
> mind-blowing to have to deal with the fallout.
> 
> But this is just my opinion.

I'd be not opposed to backporting the asm inline support.

Of course we still have to be happy with it and install the patch ;)

Are you (kernel folks) happy with asm inline ()?

Richard.

> Thx.
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: PROPOSAL: Extend inline asm syntax with size spec
  2018-11-29 13:09                 ` Richard Biener
@ 2018-11-29 13:16                   ` Borislav Petkov
  2018-11-29 13:24                     ` Richard Biener
  0 siblings, 1 reply; 116+ messages in thread
From: Borislav Petkov @ 2018-11-29 13:16 UTC (permalink / raw)
  To: Richard Biener
  Cc: Masahiro Yamada, Segher Boessenkool, Nadav Amit, Ingo Molnar,
	H. Peter Anvin, matz, gcc, Linux Kernel Mailing List, X86 ML,
	Sam Ravnborg, Alok Kataria, Christopher Li, Greg Kroah-Hartman,
	Jan Beulich, Josh Poimboeuf, Juergen Gross, Kate Stewart,
	Kees Cook, linux-sparse, Peter Zijlstra (Intel),
	Philippe Ombredanne, Thomas Gleixner, virtualization,
	Linus Torvalds, Chris Zankel, Max Filippov, linux-xtensa

On Thu, Nov 29, 2018 at 02:09:25PM +0100, Richard Biener wrote:
> I'd be not opposed to backporting the asm inline support.

Even better! :-)

> Of course we still have to be happy with it and install the patch ;)
> 
> Are you (kernel folks) happy with asm inline ()?

Yes, I think we are:

https://lkml.kernel.org/r/20181031125526.GA13219@hirez.programming.kicks-ass.net

Thx.

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: PROPOSAL: Extend inline asm syntax with size spec
  2018-11-29 13:16                   ` Borislav Petkov
@ 2018-11-29 13:24                     ` Richard Biener
  0 siblings, 0 replies; 116+ messages in thread
From: Richard Biener @ 2018-11-29 13:24 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Masahiro Yamada, Segher Boessenkool, Nadav Amit, Ingo Molnar,
	H. Peter Anvin, matz, gcc, Linux Kernel Mailing List, X86 ML,
	Sam Ravnborg, Alok Kataria, Christopher Li, Greg Kroah-Hartman,
	Jan Beulich, Josh Poimboeuf, Juergen Gross, Kate Stewart,
	Kees Cook, linux-sparse, Peter Zijlstra (Intel),
	Philippe Ombredanne, Thomas Gleixner, virtualization,
	Linus Torvalds, Chris Zankel, Max Filippov, linux-xtensa

On Thu, 29 Nov 2018, Borislav Petkov wrote:

> On Thu, Nov 29, 2018 at 02:09:25PM +0100, Richard Biener wrote:
> > I'd be not opposed to backporting the asm inline support.
> 
> Even better! :-)
> 
> > Of course we still have to be happy with it and install the patch ;)
> > 
> > Are you (kernel folks) happy with asm inline ()?
> 
> Yes, I think we are:
> 
> https://lkml.kernel.org/r/20181031125526.GA13219@hirez.programming.kicks-ass.net

OK, Segher - can you ping the latest version of the patch please?

Thanks,
Richard.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: PROPOSAL: Extend inline asm syntax with size spec
  2018-11-29 12:25               ` Segher Boessenkool
@ 2018-11-30  9:06                 ` Boris Petkov
  2018-11-30 13:16                   ` Segher Boessenkool
  0 siblings, 1 reply; 116+ messages in thread
From: Boris Petkov @ 2018-11-30  9:06 UTC (permalink / raw)
  To: Segher Boessenkool, Masahiro Yamada
  Cc: Nadav Amit, Ingo Molnar, H. Peter Anvin, rguenther, matz, gcc,
	Linux Kernel Mailing List, X86 ML, Sam Ravnborg, Alok Kataria,
	Christopher Li, Greg Kroah-Hartman, Jan Beulich, Josh Poimboeuf,
	Juergen Gross, Kate Stewart, Kees Cook, linux-sparse,
	Peter Zijlstra (Intel),
	Philippe Ombredanne, Thomas Gleixner, virtualization,
	Linus Torvalds, Chris Zankel, Max Filippov, linux-xtensa

On November 29, 2018 1:25:02 PM GMT+01:00, Segher Boessenkool <segher@kernel.crashing.org> wrote:
>This will only be fixed from GCC 9 on, if the compiler adopts it.  The
>kernel wants to support ancient GCC, so it will need to have a
>workaround
>for older GCC versions anyway.

What about backporting it, like Richard says?


-- 
Sent from a small device: formatting sux and brevity is inevitable.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: PROPOSAL: Extend inline asm syntax with size spec
  2018-11-30  9:06                 ` Boris Petkov
@ 2018-11-30 13:16                   ` Segher Boessenkool
  2018-12-10  8:16                     ` Masahiro Yamada
  0 siblings, 1 reply; 116+ messages in thread
From: Segher Boessenkool @ 2018-11-30 13:16 UTC (permalink / raw)
  To: Boris Petkov
  Cc: Masahiro Yamada, Nadav Amit, Ingo Molnar, H. Peter Anvin,
	rguenther, matz, gcc, Linux Kernel Mailing List, X86 ML,
	Sam Ravnborg, Alok Kataria, Christopher Li, Greg Kroah-Hartman,
	Jan Beulich, Josh Poimboeuf, Juergen Gross, Kate Stewart,
	Kees Cook, linux-sparse, Peter Zijlstra (Intel),
	Philippe Ombredanne, Thomas Gleixner, virtualization,
	Linus Torvalds, Chris Zankel, Max Filippov, linux-xtensa

On Fri, Nov 30, 2018 at 10:06:02AM +0100, Boris Petkov wrote:
> On November 29, 2018 1:25:02 PM GMT+01:00, Segher Boessenkool <segher@kernel.crashing.org> wrote:
> >This will only be fixed from GCC 9 on, if the compiler adopts it.  The
> >kernel wants to support ancient GCC, so it will need to have a
> >workaround
> >for older GCC versions anyway.
> 
> What about backporting it, like Richard says?

Let me first get it into GCC trunk :-)

It should backport fine, sure; and I'll work on that.


Segher

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: PROPOSAL: Extend inline asm syntax with size spec
  2018-11-30 13:16                   ` Segher Boessenkool
@ 2018-12-10  8:16                     ` Masahiro Yamada
  0 siblings, 0 replies; 116+ messages in thread
From: Masahiro Yamada @ 2018-12-10  8:16 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: Borislav Petkov, Nadav Amit, Ingo Molnar, H. Peter Anvin,
	rguenther, matz, gcc, Linux Kernel Mailing List, X86 ML,
	Sam Ravnborg, Alok Kataria, Christopher Li, Greg Kroah-Hartman,
	Jan Beulich, Josh Poimboeuf, Juergen Gross, Kate Stewart,
	Kees Cook, linux-sparse, Peter Zijlstra (Intel),
	Philippe Ombredanne, Thomas Gleixner, virtualization,
	Linus Torvalds, Chris Zankel, Max Filippov, linux-xtensa

Hi Segher,


On Sun, Dec 2, 2018 at 3:48 PM Segher Boessenkool
<segher@kernel.crashing.org> wrote:
>
> On Fri, Nov 30, 2018 at 10:06:02AM +0100, Boris Petkov wrote:
> > On November 29, 2018 1:25:02 PM GMT+01:00, Segher Boessenkool <segher@kernel.crashing.org> wrote:
> > >This will only be fixed from GCC 9 on, if the compiler adopts it.  The
> > >kernel wants to support ancient GCC, so it will need to have a
> > >workaround
> > >for older GCC versions anyway.
> >
> > What about backporting it, like Richard says?
>
> Let me first get it into GCC trunk :-)
>
> It should backport fine, sure; and I'll work on that.
>



Now, I can see it in the GCC trunk.  Hooray!!!



commit 6de46ad5326fc5e6b730a2feb8c62d09c1561f92
Author: segher <segher@138bc75d-0d04-0410-961f-82ee72b054a4>
Date:   Thu Dec 6 17:56:58 2018 +0000

    asm inline

    The Linux kernel people want a feature that makes GCC pretend some
    inline assembler code is tiny (while it would think it is huge), so
    that such code will be inlined essentially always instead of
    essentially never.

    This patch lets you say "asm inline" instead of just "asm", with the
    result that that inline assembler is always counted as minimum cost
    for inlining.  It implements this for C and C++, making "inline"
    another asm-qualifier (supplementing "volatile" and "goto").




-- 
Best Regards
Masahiro Yamada

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: PROPOSAL: Extend inline asm syntax with size spec
  2018-10-31 12:55                           ` Peter Zijlstra
                                               ` (2 preceding siblings ...)
  2018-11-01  5:20                             ` Joe Perches
@ 2018-12-27  4:47                             ` Masahiro Yamada
  3 siblings, 0 replies; 116+ messages in thread
From: Masahiro Yamada @ 2018-12-27  4:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Borislav Petkov, Segher Boessenkool, Ingo Molnar, Richard Biener,
	Michael Matz, gcc, Nadav Amit, Ingo Molnar,
	Linux Kernel Mailing List, X86 ML, Sam Ravnborg, Alok Kataria,
	Christopher Li, Greg Kroah-Hartman, H. Peter Anvin, Jan Beulich,
	Josh Poimboeuf, Juergen Gross, Kate Stewart, Kees Cook,
	linux-sparse, Philippe Ombredanne, Thomas Gleixner,
	virtualization, Linus Torvalds, Chris Zankel, Max Filippov,
	linux-xtensa, Andrew Morton

Hi Peter,


On Wed, Oct 31, 2018 at 9:58 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Sat, Oct 13, 2018 at 09:33:35PM +0200, Borislav Petkov wrote:
> > Ok,
> >
> > with Segher's help I've been playing with his patch ontop of bleeding
> > edge gcc 9 and here are my observations. Please double-check me for
> > booboos so that they can be addressed while there's time.
> >
> > So here's what I see ontop of 4.19-rc7:
> >
> > First marked the alternative asm() as inline and undeffed the "inline"
> > keyword. I need to do that because of the funky games we do with
> > "inline" redefinitions in include/linux/compiler_types.h.
> >
> > And Segher hinted at either doing:
> >
> > asm volatile inline(...
> >
> > or
> >
> > asm volatile __inline__(...
> >
> > but both "inline" variants are defined as macros in that file.
> >
> > Which means we either need to #undef inline before using it in asm() or
> > come up with something cleverer.
>
> # git grep -e "\<__inline__\>" | wc -l
> 488
> # git grep -e "\<__inline\>" | wc -l
> 56
> # git grep -e "\<inline\>" | wc -l
> 69598
>
> And we already have scripts/checkpatch.pl:
>
>   # Check for __inline__ and __inline, prefer inline
>
> Which suggests we do:
>
> git grep -l "\<__inline\(\|__\)\>" | while read file
> do
>         sed -i -e 's/\<__inline\(\|__\)\>/inline/g' $file
> done
>
> and get it over with.


Do you have a plan to really do this?

This is a nice cleanup anyway.

I think the last minute of MW is
a good timing for the global replacement like this.




>
> Anyway, with the below patch, I get:
>
>    text    data     bss     dec     hex filename
> 17385183        5064780 1953892 24403855        1745f8f defconfig-build/vmlinux
> 17385678        5064780 1953892 24404350        174617e defconfig-build/vmlinux
>
> Which shows we inline more (look for asm_volatile for the actual
> changes).
>
>
> So yes, this seems like something we could work with.
>
> ---
>  Documentation/trace/tracepoint-analysis.rst        |  2 +-
>  Documentation/translations/ja_JP/SubmittingPatches |  4 +--
>  Documentation/translations/zh_CN/SubmittingPatches |  4 +--
>  arch/alpha/include/asm/atomic.h                    | 12 +++----
>  arch/alpha/include/asm/bitops.h                    |  6 ++--
>  arch/alpha/include/asm/compiler.h                  |  4 +--
>  arch/alpha/include/asm/dma.h                       | 22 ++++++------
>  arch/alpha/include/asm/floppy.h                    |  4 +--
>  arch/alpha/include/asm/irq.h                       |  2 +-
>  arch/alpha/include/asm/local.h                     |  4 +--
>  arch/alpha/include/asm/smp.h                       |  2 +-
>  arch/arm/mach-iop32x/include/mach/uncompress.h     |  2 +-
>  arch/arm/mach-iop33x/include/mach/uncompress.h     |  2 +-
>  arch/arm/mach-ixp4xx/include/mach/uncompress.h     |  2 +-
>  arch/ia64/hp/common/sba_iommu.c                    |  2 +-
>  arch/ia64/hp/sim/simeth.c                          |  2 +-
>  arch/ia64/include/asm/atomic.h                     |  8 ++---
>  arch/ia64/include/asm/bitops.h                     | 34 +++++++++---------
>  arch/ia64/include/asm/delay.h                      | 14 ++++----
>  arch/ia64/include/asm/irq.h                        |  2 +-
>  arch/ia64/include/asm/page.h                       |  2 +-
>  arch/ia64/include/asm/sn/leds.h                    |  2 +-
>  arch/ia64/include/asm/uaccess.h                    |  4 +--
>  arch/ia64/include/uapi/asm/rse.h                   | 12 +++----
>  arch/ia64/include/uapi/asm/swab.h                  |  6 ++--
>  arch/ia64/oprofile/backtrace.c                     |  4 +--
>  arch/m68k/include/asm/blinken.h                    |  2 +-
>  arch/m68k/include/asm/checksum.h                   |  2 +-
>  arch/m68k/include/asm/dma.h                        | 32 ++++++++---------
>  arch/m68k/include/asm/floppy.h                     |  8 ++---
>  arch/m68k/include/asm/nettel.h                     |  8 ++---
>  arch/m68k/mac/iop.c                                | 14 ++++----
>  arch/mips/include/asm/atomic.h                     | 16 ++++-----
>  arch/mips/include/asm/checksum.h                   |  2 +-
>  arch/mips/include/asm/dma.h                        | 20 +++++------
>  arch/mips/include/asm/jazz.h                       |  2 +-
>  arch/mips/include/asm/local.h                      |  4 +--
>  arch/mips/include/asm/string.h                     |  8 ++---
>  arch/mips/kernel/binfmt_elfn32.c                   |  2 +-
>  arch/nds32/include/asm/swab.h                      |  4 +--
>  arch/parisc/include/asm/atomic.h                   | 20 +++++------
>  arch/parisc/include/asm/bitops.h                   | 18 +++++-----
>  arch/parisc/include/asm/checksum.h                 |  4 +--
>  arch/parisc/include/asm/compat.h                   |  2 +-
>  arch/parisc/include/asm/delay.h                    |  2 +-
>  arch/parisc/include/asm/dma.h                      | 20 +++++------
>  arch/parisc/include/asm/ide.h                      |  8 ++---
>  arch/parisc/include/asm/irq.h                      |  2 +-
>  arch/parisc/include/asm/spinlock.h                 | 12 +++----
>  arch/powerpc/include/asm/atomic.h                  | 40 +++++++++++-----------
>  arch/powerpc/include/asm/bitops.h                  | 28 +++++++--------
>  arch/powerpc/include/asm/dma.h                     | 20 +++++------
>  arch/powerpc/include/asm/edac.h                    |  2 +-
>  arch/powerpc/include/asm/irq.h                     |  2 +-
>  arch/powerpc/include/asm/local.h                   | 14 ++++----
>  arch/sh/include/asm/pgtable_64.h                   |  2 +-
>  arch/sh/include/asm/processor_32.h                 |  4 +--
>  arch/sh/include/cpu-sh3/cpu/dac.h                  |  6 ++--
>  arch/x86/include/asm/alternative.h                 | 14 ++++----
>  arch/x86/um/asm/checksum.h                         |  4 +--
>  arch/x86/um/asm/checksum_32.h                      |  4 +--
>  arch/xtensa/include/asm/checksum.h                 | 14 ++++----
>  arch/xtensa/include/asm/cmpxchg.h                  |  4 +--
>  arch/xtensa/include/asm/irq.h                      |  2 +-
>  block/partitions/amiga.c                           |  2 +-
>  drivers/atm/he.c                                   |  6 ++--
>  drivers/atm/idt77252.c                             |  6 ++--
>  drivers/gpu/drm/mga/mga_drv.h                      |  2 +-
>  drivers/gpu/drm/mga/mga_state.c                    | 14 ++++----
>  drivers/gpu/drm/r128/r128_drv.h                    |  2 +-
>  drivers/gpu/drm/r128/r128_state.c                  | 14 ++++----
>  drivers/gpu/drm/via/via_irq.c                      |  2 +-
>  drivers/gpu/drm/via/via_verifier.c                 | 30 ++++++++--------
>  drivers/isdn/hardware/eicon/platform.h             | 14 ++++----
>  drivers/isdn/i4l/isdn_net.c                        | 14 ++++----
>  drivers/isdn/i4l/isdn_net.h                        |  8 ++---
>  drivers/media/pci/ivtv/ivtv-ioctl.c                |  2 +-
>  drivers/net/ethernet/sun/sungem.c                  |  8 ++---
>  drivers/net/ethernet/sun/sunhme.c                  |  6 ++--
>  drivers/net/hamradio/baycom_ser_fdx.c              |  2 +-
>  drivers/net/wan/lapbether.c                        |  2 +-
>  drivers/net/wan/n2.c                               |  4 +--
>  drivers/parisc/led.c                               |  4 +--
>  drivers/parisc/sba_iommu.c                         |  2 +-
>  drivers/parport/parport_gsc.c                      |  2 +-
>  drivers/parport/parport_gsc.h                      |  4 +--
>  drivers/parport/parport_pc.c                       |  2 +-
>  drivers/scsi/lpfc/lpfc_scsi.c                      |  2 +-
>  drivers/scsi/pcmcia/sym53c500_cs.c                 |  4 +--
>  drivers/scsi/qla2xxx/qla_inline.h                  |  2 +-
>  drivers/scsi/qla2xxx/qla_os.c                      |  4 +--
>  drivers/staging/rtl8723bs/core/rtw_pwrctrl.c       |  4 +--
>  drivers/staging/rtl8723bs/core/rtw_wlan_util.c     |  2 +-
>  drivers/staging/rtl8723bs/include/drv_types.h      |  6 ++--
>  drivers/staging/rtl8723bs/include/ieee80211.h      |  6 ++--
>  drivers/staging/rtl8723bs/include/osdep_service.h  | 10 +++---
>  .../rtl8723bs/include/osdep_service_linux.h        | 14 ++++----
>  drivers/staging/rtl8723bs/include/rtw_mlme.h       | 14 ++++----
>  drivers/staging/rtl8723bs/include/rtw_recv.h       | 16 ++++-----
>  drivers/staging/rtl8723bs/include/sta_info.h       |  2 +-
>  drivers/staging/rtl8723bs/include/wifi.h           | 14 ++++----
>  drivers/staging/rtl8723bs/include/wlan_bssdef.h    |  2 +-
>  drivers/tty/amiserial.c                            |  2 +-
>  drivers/tty/serial/ip22zilog.c                     |  2 +-
>  drivers/tty/serial/sunsab.c                        |  4 +--
>  drivers/tty/serial/sunzilog.c                      |  2 +-
>  drivers/video/fbdev/core/fbcon.c                   | 20 +++++------
>  drivers/video/fbdev/ffb.c                          |  2 +-
>  drivers/video/fbdev/intelfb/intelfbdrv.c           | 10 +++---
>  drivers/video/fbdev/intelfb/intelfbhw.c            |  2 +-
>  drivers/w1/masters/matrox_w1.c                     |  4 +--
>  fs/coda/coda_linux.h                               |  6 ++--
>  fs/freevxfs/vxfs_inode.c                           |  2 +-
>  fs/nfsd/nfsfh.h                                    |  4 +--
>  include/acpi/platform/acgcc.h                      |  2 +-
>  include/acpi/platform/acintel.h                    |  2 +-
>  include/asm-generic/ide_iops.h                     |  8 ++---
>  include/linux/atalk.h                              |  4 +--
>  include/linux/ceph/messenger.h                     |  2 +-
>  include/linux/compiler_types.h                     |  4 +--
>  include/linux/hdlc.h                               |  4 +--
>  include/linux/inetdevice.h                         |  8 ++---
>  include/linux/parport.h                            |  4 +--
>  include/linux/parport_pc.h                         | 22 ++++++------
>  include/net/ax25.h                                 |  2 +-
>  include/net/checksum.h                             |  2 +-
>  include/net/dn_nsp.h                               | 16 ++++-----
>  include/net/ip.h                                   |  2 +-
>  include/net/ip6_checksum.h                         |  2 +-
>  include/net/ipx.h                                  | 10 +++---
>  include/net/llc_c_ev.h                             |  4 +--
>  include/net/llc_conn.h                             |  4 +--
>  include/net/llc_s_ev.h                             |  2 +-
>  include/net/netrom.h                               |  8 ++---
>  include/net/scm.h                                  | 14 ++++----
>  include/net/udplite.h                              |  2 +-
>  include/net/x25.h                                  |  8 ++---
>  include/net/xfrm.h                                 | 18 +++++-----
>  include/uapi/linux/atm.h                           |  4 +--
>  include/uapi/linux/atmsap.h                        |  2 +-
>  include/uapi/linux/map_to_7segment.h               |  2 +-
>  include/uapi/linux/netfilter_arp/arp_tables.h      |  2 +-
>  include/uapi/linux/netfilter_bridge/ebtables.h     |  2 +-
>  include/uapi/linux/netfilter_ipv4/ip_tables.h      |  2 +-
>  include/uapi/linux/netfilter_ipv6/ip6_tables.h     |  2 +-
>  include/video/newport.h                            | 12 +++----
>  lib/zstd/mem.h                                     |  2 +-
>  net/appletalk/atalk_proc.c                         |  4 +--
>  net/appletalk/ddp.c                                |  2 +-
>  net/core/neighbour.c                               |  2 +-
>  net/core/scm.c                                     |  2 +-
>  net/decnet/dn_nsp_in.c                             |  2 +-
>  net/decnet/dn_nsp_out.c                            |  2 +-
>  net/decnet/dn_route.c                              |  2 +-
>  net/decnet/dn_table.c                              |  4 +--
>  net/ipv4/igmp.c                                    |  2 +-
>  net/ipv6/af_inet6.c                                |  2 +-
>  net/ipv6/icmp.c                                    |  4 +--
>  net/ipv6/udp.c                                     |  4 +--
>  net/lapb/lapb_iface.c                              |  4 +--
>  net/llc/llc_input.c                                |  2 +-
>  scripts/checkpatch.pl                              |  8 ++---
>  scripts/genksyms/keywords.c                        |  4 +--
>  scripts/kernel-doc                                 |  4 +--
>  sound/sparc/amd7930.c                              |  6 ++--
>  165 files changed, 547 insertions(+), 547 deletions(-)


-- 
Best Regards
Masahiro Yamada

^ permalink raw reply	[flat|nested] 116+ messages in thread

end of thread, other threads:[~2018-12-27  4:48 UTC | newest]

Thread overview: 116+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-03 21:30 [PATCH v9 00/10] x86: macrofying inline asm Nadav Amit
2018-10-03 21:30 ` [PATCH v9 01/10] xtensa: defining LINKER_SCRIPT for the linker script Nadav Amit
2018-10-04 10:00   ` [tip:x86/build] kbuild/arch/xtensa: Define " tip-bot for Nadav Amit
2018-10-03 21:30 ` [PATCH v9 02/10] Makefile: Prepare for using macros for inline asm Nadav Amit
2018-10-04 10:01   ` [tip:x86/build] kbuild/Makefile: Prepare for using macros in inline assembly code to work around asm() related GCC inlining bugs tip-bot for Nadav Amit
2018-11-06 18:57   ` [PATCH v9 02/10] Makefile: Prepare for using macros for inline asm Logan Gunthorpe
2018-11-06 19:18     ` Nadav Amit
2018-11-06 20:01       ` Logan Gunthorpe
2018-11-07 18:01         ` Nadav Amit
2018-11-07 18:53           ` Logan Gunthorpe
2018-11-07 18:56             ` Nadav Amit
2018-11-07 21:43               ` Logan Gunthorpe
2018-11-07 21:50                 ` hpa
2018-11-08  6:18                   ` Nadav Amit
2018-11-08 17:14                     ` Logan Gunthorpe
2018-11-08 19:54                       ` Nadav Amit
2018-11-08 20:00                         ` Logan Gunthorpe
2018-11-08 20:18                           ` Nadav Amit
2018-11-10 22:04                             ` Nadav Amit
2018-11-13  4:56                               ` Logan Gunthorpe
2018-10-03 21:30 ` [PATCH v9 03/10] x86: objtool: use asm macro for better compiler decisions Nadav Amit
2018-10-04 10:02   ` [tip:x86/build] x86/objtool: Use asm macros to work around GCC inlining bugs tip-bot for Nadav Amit
2018-10-03 21:30 ` [PATCH v9 04/10] x86: refcount: prevent gcc distortions Nadav Amit
2018-10-04  7:57   ` Ingo Molnar
2018-10-04  8:33     ` Ingo Molnar
2018-10-04  8:40       ` hpa
2018-10-04  8:56         ` Ingo Molnar
2018-10-04  8:56         ` Nadav Amit
2018-10-04  9:02           ` hpa
2018-10-04  9:16             ` Ingo Molnar
2018-10-04 19:33               ` H. Peter Anvin
2018-10-04 20:05                 ` Nadav Amit
2018-10-04 20:08                   ` H. Peter Anvin
2018-10-04 20:29                 ` Andy Lutomirski
2018-10-04 23:11                   ` H. Peter Anvin
2018-10-06  1:40                 ` Rasmus Villemoes
2018-10-04  9:12           ` Ingo Molnar
2018-10-04  9:17             ` hpa
2018-10-04  9:30             ` Nadav Amit
2018-10-04  9:45               ` Ingo Molnar
2018-10-04 10:23                 ` Nadav Amit
2018-10-05  9:31                   ` Ingo Molnar
2018-10-05 11:20                     ` Borislav Petkov
2018-10-05 12:52                       ` Ingo Molnar
2018-10-05 20:27                     ` [PATCH 0/3] Macrofying inline asm rebased Nadav Amit
2018-10-05 20:27                       ` [PATCH 1/3] x86/extable: Macrofy inline assembly code to work around GCC inlining bugs Nadav Amit
2018-10-06 14:42                         ` [tip:x86/build] " tip-bot for Nadav Amit
2018-10-05 20:27                       ` [PATCH 2/3] x86/cpufeature: " Nadav Amit
2018-10-06 14:43                         ` [tip:x86/build] " tip-bot for Nadav Amit
2018-10-05 20:27                       ` [PATCH 3/3] x86/jump-labels: " Nadav Amit
2018-10-06 14:44                         ` [tip:x86/build] " tip-bot for Nadav Amit
2018-10-08  2:17                     ` [PATCH v9 04/10] x86: refcount: prevent gcc distortions Nadav Amit
2018-10-04  8:40     ` Nadav Amit
2018-10-04  9:01       ` Ingo Molnar
2018-10-04 10:02   ` [tip:x86/build] x86/refcount: Work around GCC inlining bug tip-bot for Nadav Amit
2018-10-03 21:30 ` [PATCH v9 05/10] x86: alternatives: macrofy locks for better inlining Nadav Amit
2018-10-04 10:03   ` [tip:x86/build] x86/alternatives: Macrofy lock prefixes to work around GCC inlining bugs tip-bot for Nadav Amit
2018-10-03 21:30 ` [PATCH v9 06/10] x86: bug: prevent gcc distortions Nadav Amit
2018-10-04 10:03   ` [tip:x86/build] x86/bug: Macrofy the BUG table section handling, to work around GCC inlining bugs tip-bot for Nadav Amit
2018-10-03 21:30 ` [PATCH v9 07/10] x86: prevent inline distortion by paravirt ops Nadav Amit
2018-10-04 10:04   ` [tip:x86/build] x86/paravirt: Work around GCC inlining bugs when compiling " tip-bot for Nadav Amit
2018-10-03 21:30 ` [PATCH v9 08/10] x86: extable: use macros instead of inline assembly Nadav Amit
2018-10-03 21:30 ` [PATCH v9 09/10] x86: cpufeature: " Nadav Amit
2018-10-03 21:31 ` [PATCH v9 10/10] x86: jump-labels: " Nadav Amit
2018-10-07  9:18 ` PROPOSAL: Extend inline asm syntax with size spec Borislav Petkov
     [not found]   ` <20181007132228.GJ29268@gate.crashing.org>
2018-10-07 14:13     ` Borislav Petkov
2018-10-07 15:14       ` Segher Boessenkool
2018-10-08  5:58         ` Ingo Molnar
2018-10-08  7:53           ` Segher Boessenkool
2018-10-07 15:53     ` Michael Matz
2018-10-08  6:13       ` Ingo Molnar
2018-10-08  8:18         ` Segher Boessenkool
2018-10-08  7:31       ` Segher Boessenkool
2018-10-08  9:07         ` Richard Biener
2018-10-08 10:02           ` Segher Boessenkool
2018-10-09 14:53           ` Segher Boessenkool
2018-10-10  6:35             ` Ingo Molnar
2018-10-10  7:12             ` Richard Biener
2018-10-10  7:22               ` Ingo Molnar
2018-10-10  8:03                 ` Segher Boessenkool
2018-10-10  8:19                   ` Borislav Petkov
2018-10-10  8:35                     ` Richard Biener
2018-10-10 18:54                     ` Segher Boessenkool
2018-10-10 19:14                       ` Borislav Petkov
2018-10-13 19:33                         ` Borislav Petkov
2018-10-13 21:14                           ` Alexander Monakov
2018-10-13 21:30                             ` Borislav Petkov
2018-10-25 10:24                           ` Borislav Petkov
2018-10-31 12:55                           ` Peter Zijlstra
2018-10-31 13:11                             ` Peter Zijlstra
2018-10-31 16:31                             ` Segher Boessenkool
2018-11-01  5:20                             ` Joe Perches
2018-11-01  9:01                               ` Peter Zijlstra
2018-11-01  9:20                                 ` Joe Perches
2018-11-01 11:15                                   ` Peter Zijlstra
2018-12-27  4:47                             ` Masahiro Yamada
2018-10-10 10:29                   ` Richard Biener
2018-10-10  7:53               ` Segher Boessenkool
2018-10-10 16:31             ` Nadav Amit
2018-10-10 19:21               ` Segher Boessenkool
2018-10-11  7:04               ` Richard Biener
2018-11-29 11:46             ` Masahiro Yamada
2018-11-29 12:25               ` Segher Boessenkool
2018-11-30  9:06                 ` Boris Petkov
2018-11-30 13:16                   ` Segher Boessenkool
2018-12-10  8:16                     ` Masahiro Yamada
2018-11-29 13:07               ` Borislav Petkov
2018-11-29 13:09                 ` Richard Biener
2018-11-29 13:16                   ` Borislav Petkov
2018-11-29 13:24                     ` Richard Biener
2018-10-08 16:24       ` David Laight
2018-10-07 16:09   ` Nadav Amit
2018-10-07 16:46     ` Richard Biener
2018-10-07 19:06       ` Nadav Amit
2018-10-07 19:52         ` Jeff Law
2018-10-08  7:46         ` Richard Biener

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).