All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 00/59] x86/retbleed: Call depth tracking mitigation
@ 2022-09-02 13:06 Peter Zijlstra
  2022-09-02 13:06 ` [PATCH v2 01/59] x86/paravirt: Ensure proper alignment Peter Zijlstra
                   ` (59 more replies)
  0 siblings, 60 replies; 81+ messages in thread
From: Peter Zijlstra @ 2022-09-02 13:06 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, peterz, x86, Linus Torvalds, Tim Chen,
	Josh Poimboeuf, Andrew Cooper, Pawan Gupta, Johannes Wikner,
	Alyssa Milburn, Jann Horn, H.J. Lu, Joao Moreira, Joseph Nuzman,
	Steven Rostedt, Juergen Gross, Masami Hiramatsu,
	Alexei Starovoitov, Daniel Borkmann, K Prateek Nayak,
	Eric Dumazet


Hi!

At long last a new version of the call depth tracking patches.

  v1: https://lkml.kernel.org/r/20220716230344.239749011@linutronix.de

This version is significantly different from the last in that it no longer
makes use of external call thunks allocated from the module space. Instead
every function gets aligned to 16 bytes and gets 16 bytes of (pre-symbol)
padding. (This padding will also come in handy for other things, like the
kCFI/FineIBT work.)

Prior to these patches function alignment is basically non-existent, as such
any instruction fetch for the first instructions of a function will have (on
average) half the fetch window filled with whatever comes before. By pushing
the alignment up to 16 bytes this improves matters for chips that happen to
have a 16 byte i-fetch window size (Intel) while not making matters worse for
chips that have a larger 32 byte i-fetch window (AMD Zen). In fact, it improves
the worst case for Zen from 31 bytes of garbage to 16 bytes of garbage.

As such the first many patches of the series fix up lots of alignment quirks.


The second big difference is the introduction of struct pcpu_hot. Because the
compiler managed to place two adjacent (in code) DEFINE_PER_CPU() variables in
random cachelines (it is absolutely free to do so) the introduction of the
per-cpu x86_call_depth variable sometimes introduced significant additional
cache pressure, while other times it would sit nicely in the same line with
preempt_count and not show up at all.

In order to alleviate this problem; introduce struct pcpu_hot and collect a
number of hot per-cpu variables in a way the compiler can't mess up.


Since these changes are 'unconditional', Mel was gracious enough to help test
this on his test setup across all the relevant uarchs (very much including both
Intel and AMD machines) and found that while these changes cause some very
small wins and losses across the board it is mostly noise.


Aside from these changes; the core of the depth tracking is still the same.

 - objtool creates a list of (function) call sites.

 - for every call; overwrite the padding of the target function with the
   accounting thunk (if not already done) and adjust the call site to
   target this thunk.

 - the retbleed return thunk mechanism is used for a custom return thunk
   that includes return accounting and does RSB stuffing when required.

This ensures no new compiler is required and avoids almost all overhead for
non affected machines. This new option can still be selected using:

  "retbleed=stuff"

on the kernel command line.


As a refresher; the theory behind call depth tracking is:

The Return-Stack-Buffer (RSB) is a 16 deep stack that is filled on every call.
On the return path speculation will "pop" an entry and takes that as the return
target. Once the RSB is empty, the CPU falls back to other predictors, e.g. the
Branch History Buffer, which can be mistrained by user space and misguides the
(return) speculation path to a disclosure gadget of your choice -- as described
in the retbleed paper.

Call depth tracking is designed to break this speculation path by stuffing
speculation trap calls into the RSB whenver the RSB is running low. This way
the speculation stalls and never falls back to other predictors.

The assumption is that stuffing at the 12th return is sufficient to break the
speculation before it hits the underflow and the fallback to the other
predictors. Testing confirms that it works. Johannes, one of the retbleed
researchers, tried to attack this approach and confirmed that it brings the
signal to noise ratio down to the crystal ball level.


Excerpts from IBRS vs stuff from Mel's testing:

perfsyscall

		6.0.0-rc1		6.0.0-rc1
                tglx-mit-spectre-ibrs	tglx-mit-spectre-retpoline-retstuff
Duration User         136.16		   69.10
Duration System       100.50		   33.04
Duration Elapsed      237.20		  102.65

That's a massive improvement with a major reduction in system CPU usage.

Kernel compilation is variable. Skylake-X was modest with 2-18% gain depending
on degree of parallelisation.

Git checkouts are roughly 14% faster on Skylake-X

Network test were localhost only so are limited but even so, the gain is
large. Skylake-X again;

Netperf-TCP
                                  6.0.0-rc1              6.0.0-rc1
                      tglx-mit-spectre-ibrs  tglx-mit-spectre-retpoline-retstuff
Hmean     send-64         241.39 (   0.00%)      298.00 *  23.45%*
Hmean     send-128        489.55 (   0.00%)      610.46 *  24.70%*
Hmean     send-256        990.85 (   0.00%)     1201.73 *  21.28%*
Hmean     send-1024      4051.84 (   0.00%)     5006.19 *  23.55%*
Hmean     send-2048      7924.75 (   0.00%)     9777.14 *  23.37%*
Hmean     send-3312     12319.98 (   0.00%)    15210.07 *  23.46%*
Hmean     send-4096     14770.62 (   0.00%)    17941.32 *  21.47%*
Hmean     send-8192     26302.00 (   0.00%)    30170.04 *  14.71%*
Hmean     send-16384    42449.51 (   0.00%)    48036.45 *  13.16%*

While this is UDP_STREAM, TCP_STREAM is similarly impressive.

FIO measurements done by Tim Chen:

read (kIOPs)            Mean    stdev   mitigations=off retbleed=off    CPU util
================================================================================
mitigations=off         357.33  3.79    0.00%           6.14%           98.93%
retbleed=off            336.67  6.43    -5.78%          0.00%           99.01%
retbleed=ibrs           242.00  0.00    -32.28%         -28.12%         99.41%
retbleed=stuff (pad)    314.33  1.53    -12.03%         -6.63%          99.31%

read/write                              Baseline        Baseline
70/30 (kIOPs)           Mean    stdev   mitigations=off retbleed=off    CPU util
================================================================================
mitigations=off         349.00  5.29    0.00%           9.06%           96.66%
retbleed=off            320.00  5.05    -8.31%          0.00%           95.54%
retbleed=ibrs           238.60  0.17    -31.63%         -25.44%         98.18%
retbleed=stuff (pad)    293.37  0.81    -15.94%         -8.32%          97.71%

                                        Baseline        Baseline
write (kIOPs)           Mean    stdev   mitigations=off retbleed=off    CPU util
================================================================================
mitigations=off         296.33  8.08    0.00%           6.21%           93.96%
retbleed=off            279.00  2.65    -5.85%          0.00%           93.63%
retbleed=ibrs           230.33  0.58    -22.27%         -17.44%         95.92%
retbleed=stuff (pad)    266.67  1.53    -10.01%         -4.42%          94.75%


The patches can also be found in git here:

  git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git depthtracking




^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v2 01/59] x86/paravirt: Ensure proper alignment
  2022-09-02 13:06 [PATCH v2 00/59] x86/retbleed: Call depth tracking mitigation Peter Zijlstra
@ 2022-09-02 13:06 ` Peter Zijlstra
  2022-09-02 16:05   ` Juergen Gross
  2022-09-02 13:06 ` [PATCH v2 02/59] x86/cpu: Remove segment load from switch_to_new_gdt() Peter Zijlstra
                   ` (58 subsequent siblings)
  59 siblings, 1 reply; 81+ messages in thread
From: Peter Zijlstra @ 2022-09-02 13:06 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, peterz, x86, Linus Torvalds, Tim Chen,
	Josh Poimboeuf, Andrew Cooper, Pawan Gupta, Johannes Wikner,
	Alyssa Milburn, Jann Horn, H.J. Lu, Joao Moreira, Joseph Nuzman,
	Steven Rostedt, Juergen Gross, Masami Hiramatsu,
	Alexei Starovoitov, Daniel Borkmann, K Prateek Nayak,
	Eric Dumazet

From: Thomas Gleixner <tglx@linutronix.de>

The entries in the .parainstr sections are 8 byte aligned and the
corresponding C struct makes the array offset 16 bytes.

Though the pushed entries are only using 12 bytes. .parainstr_end is
therefore 4 bytes short.

That works by chance because it's only used in a loop:

     for (p = start; p < end; p++)

But this falls flat when calculating the number of elements:

    n = end - start

That's obviously off by one.

Ensure that the gap is filled and the last entry is occupying 16 bytes.

Cc: Juergen Gross <jgross@suse.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/include/asm/paravirt.h       |    1 +
 arch/x86/include/asm/paravirt_types.h |    1 +
 2 files changed, 2 insertions(+)

--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -743,6 +743,7 @@ extern void default_banner(void);
 	 word 771b;				\
 	 .byte ptype;				\
 	 .byte 772b-771b;			\
+	 _ASM_ALIGN;				\
 	.popsection
 
 
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -294,6 +294,7 @@ extern struct paravirt_patch_template pv
 	"  .byte " type "\n"				\
 	"  .byte 772b-771b\n"				\
 	"  .short " clobber "\n"			\
+	_ASM_ALIGN "\n"					\
 	".popsection\n"
 
 /* Generate patchable code, with the default asm parameters. */



^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v2 02/59] x86/cpu: Remove segment load from switch_to_new_gdt()
  2022-09-02 13:06 [PATCH v2 00/59] x86/retbleed: Call depth tracking mitigation Peter Zijlstra
  2022-09-02 13:06 ` [PATCH v2 01/59] x86/paravirt: Ensure proper alignment Peter Zijlstra
@ 2022-09-02 13:06 ` Peter Zijlstra
  2022-09-02 13:06 ` [PATCH v2 03/59] x86/cpu: Get rid of redundant switch_to_new_gdt() invocations Peter Zijlstra
                   ` (57 subsequent siblings)
  59 siblings, 0 replies; 81+ messages in thread
From: Peter Zijlstra @ 2022-09-02 13:06 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, peterz, x86, Linus Torvalds, Tim Chen,
	Josh Poimboeuf, Andrew Cooper, Pawan Gupta, Johannes Wikner,
	Alyssa Milburn, Jann Horn, H.J. Lu, Joao Moreira, Joseph Nuzman,
	Steven Rostedt, Juergen Gross, Masami Hiramatsu,
	Alexei Starovoitov, Daniel Borkmann, K Prateek Nayak,
	Eric Dumazet

From: Thomas Gleixner <tglx@linutronix.de>

On 32bit FS and on 64bit GS segments are already set up correctly, but
load_percpu_segment() still sets [FG]S after switching from the early GDT
to the direct GDT.

For 32bit the segment load has no side effects, but on 64bit it causes
GSBASE to become 0, which means that any per CPU access before GSBASE is
set to the new value is going to fault. That's the reason why the whole
file containing this code has stackprotector removed.

But that's a pointless exercise for both 32 and 64 bit as the relevant
segment selector is already correct. Loading the new GDT does not change
that.

Remove the segment loads and add comments.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/include/asm/processor.h |    1 
 arch/x86/kernel/cpu/common.c     |   47 +++++++++++++++++++++++++--------------
 2 files changed, 31 insertions(+), 17 deletions(-)

--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -673,7 +673,6 @@ extern struct desc_ptr		early_gdt_descr;
 extern void switch_to_new_gdt(int);
 extern void load_direct_gdt(int);
 extern void load_fixmap_gdt(int);
-extern void load_percpu_segment(int);
 extern void cpu_init(void);
 extern void cpu_init_secondary(void);
 extern void cpu_init_exception_handling(void);
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -701,16 +701,6 @@ static const char *table_lookup_model(st
 __u32 cpu_caps_cleared[NCAPINTS + NBUGINTS] __aligned(sizeof(unsigned long));
 __u32 cpu_caps_set[NCAPINTS + NBUGINTS] __aligned(sizeof(unsigned long));
 
-void load_percpu_segment(int cpu)
-{
-#ifdef CONFIG_X86_32
-	loadsegment(fs, __KERNEL_PERCPU);
-#else
-	__loadsegment_simple(gs, 0);
-	wrmsrl(MSR_GS_BASE, cpu_kernelmode_gs_base(cpu));
-#endif
-}
-
 #ifdef CONFIG_X86_32
 /* The 32-bit entry code needs to find cpu_entry_area. */
 DEFINE_PER_CPU(struct cpu_entry_area *, cpu_entry_area);
@@ -738,16 +728,41 @@ void load_fixmap_gdt(int cpu)
 }
 EXPORT_SYMBOL_GPL(load_fixmap_gdt);
 
-/*
- * Current gdt points %fs at the "master" per-cpu area: after this,
- * it's on the real one.
+/**
+ * switch_to_new_gdt - Switch form early GDT to the direct one
+ * @cpu:	The CPU number for which this is invoked
+ *
+ * Invoked during early boot to switch from early GDT and early per CPU
+ * (%fs on 32bit, GS_BASE on 64bit) to the direct GDT and the runtime per
+ * CPU area.
  */
 void switch_to_new_gdt(int cpu)
 {
-	/* Load the original GDT */
 	load_direct_gdt(cpu);
-	/* Reload the per-cpu base */
-	load_percpu_segment(cpu);
+
+#ifdef CONFIG_X86_64
+	/*
+	 * No need to load %gs. It is already correct.
+	 *
+	 * Writing %gs on 64bit would zero GSBASE which would make any per
+	 * CPU operation up to the point of the wrmsrl() fault.
+	 *
+	 * Set GSBASE to the new offset. Until the wrmsrl() happens the
+	 * early mapping is still valid. That means the GSBASE update will
+	 * lose any prior per CPU data which was not copied over in
+	 * setup_per_cpu_areas().
+	 */
+	wrmsrl(MSR_GS_BASE, cpu_kernelmode_gs_base(cpu));
+#else
+	/*
+	 * %fs is already set to __KERNEL_PERCPU, but after switching GDT
+	 * it is required to load FS again so that the 'hidden' part is
+	 * updated from the new GDT. Up to this point the early per CPU
+	 * translation is active. Any content of the early per CPU data
+	 * which was not copied over in setup_per_cpu_areas() is lost.
+	 */
+	loadsegment(fs, __KERNEL_PERCPU);
+#endif
 }
 
 static const struct cpu_dev *cpu_devs[X86_VENDOR_NUM] = {};



^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v2 03/59] x86/cpu: Get rid of redundant switch_to_new_gdt() invocations
  2022-09-02 13:06 [PATCH v2 00/59] x86/retbleed: Call depth tracking mitigation Peter Zijlstra
  2022-09-02 13:06 ` [PATCH v2 01/59] x86/paravirt: Ensure proper alignment Peter Zijlstra
  2022-09-02 13:06 ` [PATCH v2 02/59] x86/cpu: Remove segment load from switch_to_new_gdt() Peter Zijlstra
@ 2022-09-02 13:06 ` Peter Zijlstra
  2022-09-02 13:06 ` [PATCH v2 04/59] x86/cpu: Re-enable stackprotector Peter Zijlstra
                   ` (56 subsequent siblings)
  59 siblings, 0 replies; 81+ messages in thread
From: Peter Zijlstra @ 2022-09-02 13:06 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, peterz, x86, Linus Torvalds, Tim Chen,
	Josh Poimboeuf, Andrew Cooper, Pawan Gupta, Johannes Wikner,
	Alyssa Milburn, Jann Horn, H.J. Lu, Joao Moreira, Joseph Nuzman,
	Steven Rostedt, Juergen Gross, Masami Hiramatsu,
	Alexei Starovoitov, Daniel Borkmann, K Prateek Nayak,
	Eric Dumazet

From: Thomas Gleixner <tglx@linutronix.de>

The only place where switch_to_new_gdt() is required is early boot to
switch from the early GDT to the direct GDT. Any other invocation is
completely redundant because it does not change anything.

Secondary CPUs come out of the ASM code with GDT and GSBASE correctly set
up. The same is true for XEN_PV.

Remove all the voodoo invocations which are left overs from the ancient
past, rename the function to switch_gdt_and_percpu_base() and mark it init.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
V2: Rename it (Linus)
---
 arch/x86/include/asm/processor.h |    2 +-
 arch/x86/kernel/cpu/common.c     |   17 ++++++-----------
 arch/x86/kernel/setup_percpu.c   |    2 +-
 arch/x86/kernel/smpboot.c        |    6 +++++-
 arch/x86/xen/enlighten_pv.c      |    2 +-
 5 files changed, 14 insertions(+), 15 deletions(-)

--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -670,7 +670,7 @@ extern int sysenter_setup(void);
 /* Defined in head.S */
 extern struct desc_ptr		early_gdt_descr;
 
-extern void switch_to_new_gdt(int);
+extern void switch_gdt_and_percpu_base(int);
 extern void load_direct_gdt(int);
 extern void load_fixmap_gdt(int);
 extern void cpu_init(void);
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -729,14 +729,15 @@ void load_fixmap_gdt(int cpu)
 EXPORT_SYMBOL_GPL(load_fixmap_gdt);
 
 /**
- * switch_to_new_gdt - Switch form early GDT to the direct one
+ * switch_gdt_and_percpu_base - Switch to direct GDT and runtime per CPU base
  * @cpu:	The CPU number for which this is invoked
  *
- * Invoked during early boot to switch from early GDT and early per CPU
- * (%fs on 32bit, GS_BASE on 64bit) to the direct GDT and the runtime per
- * CPU area.
+ * Invoked during early boot to switch from early GDT and early per CPU to
+ * the direct GDT and the runtime per CPU area. On 32-bit the percpu base
+ * switch is implicit by loading the direct GDT. On 64bit this requires
+ * to update GSBASE.
  */
-void switch_to_new_gdt(int cpu)
+void __init switch_gdt_and_percpu_base(int cpu)
 {
 	load_direct_gdt(cpu);
 
@@ -2251,12 +2252,6 @@ void cpu_init(void)
 	    boot_cpu_has(X86_FEATURE_TSC) || boot_cpu_has(X86_FEATURE_DE))
 		cr4_clear_bits(X86_CR4_VME|X86_CR4_PVI|X86_CR4_TSD|X86_CR4_DE);
 
-	/*
-	 * Initialize the per-CPU GDT with the boot GDT,
-	 * and set up the GDT descriptor:
-	 */
-	switch_to_new_gdt(cpu);
-
 	if (IS_ENABLED(CONFIG_X86_64)) {
 		loadsegment(fs, 0);
 		memset(cur->thread.tls_array, 0, GDT_ENTRY_TLS_ENTRIES * 8);
--- a/arch/x86/kernel/setup_percpu.c
+++ b/arch/x86/kernel/setup_percpu.c
@@ -211,7 +211,7 @@ void __init setup_per_cpu_areas(void)
 		 * area.  Reload any changed state for the boot CPU.
 		 */
 		if (!cpu)
-			switch_to_new_gdt(cpu);
+			switch_gdt_and_percpu_base(cpu);
 	}
 
 	/* indicate the early static arrays will soon be gone */
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1453,7 +1453,11 @@ void arch_thaw_secondary_cpus_end(void)
 void __init native_smp_prepare_boot_cpu(void)
 {
 	int me = smp_processor_id();
-	switch_to_new_gdt(me);
+
+	/* SMP handles this from setup_per_cpu_areas() */
+	if (!IS_ENABLED(CONFIG_SMP))
+		switch_gdt_and_percpu_base(me);
+
 	/* already set me in cpu_online_mask in boot_cpu_init() */
 	cpumask_set_cpu(me, cpu_callout_mask);
 	cpu_set_state_online(me);
--- a/arch/x86/xen/enlighten_pv.c
+++ b/arch/x86/xen/enlighten_pv.c
@@ -1167,7 +1167,7 @@ static void __init xen_setup_gdt(int cpu
 	pv_ops.cpu.write_gdt_entry = xen_write_gdt_entry_boot;
 	pv_ops.cpu.load_gdt = xen_load_gdt_boot;
 
-	switch_to_new_gdt(cpu);
+	switch_gdt_and_percpu_base(cpu);
 
 	pv_ops.cpu.write_gdt_entry = xen_write_gdt_entry;
 	pv_ops.cpu.load_gdt = xen_load_gdt;



^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v2 04/59] x86/cpu: Re-enable stackprotector
  2022-09-02 13:06 [PATCH v2 00/59] x86/retbleed: Call depth tracking mitigation Peter Zijlstra
                   ` (2 preceding siblings ...)
  2022-09-02 13:06 ` [PATCH v2 03/59] x86/cpu: Get rid of redundant switch_to_new_gdt() invocations Peter Zijlstra
@ 2022-09-02 13:06 ` Peter Zijlstra
  2022-09-02 13:06 ` [PATCH v2 05/59] x86/modules: Set VM_FLUSH_RESET_PERMS in module_alloc() Peter Zijlstra
                   ` (55 subsequent siblings)
  59 siblings, 0 replies; 81+ messages in thread
From: Peter Zijlstra @ 2022-09-02 13:06 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, peterz, x86, Linus Torvalds, Tim Chen,
	Josh Poimboeuf, Andrew Cooper, Pawan Gupta, Johannes Wikner,
	Alyssa Milburn, Jann Horn, H.J. Lu, Joao Moreira, Joseph Nuzman,
	Steven Rostedt, Juergen Gross, Masami Hiramatsu,
	Alexei Starovoitov, Daniel Borkmann, K Prateek Nayak,
	Eric Dumazet

From: Thomas Gleixner <tglx@linutronix.de>

Commit 5416c2663517 ("x86: make sure load_percpu_segment has no
stackprotector") disabled the stackprotector for cpu/common.c because of
load_percpu_segment(). Back then the boot stack canary was initialized very
early in start_kernel(). Switching the per CPU area by loading the GDT
caused the stackprotector to fail with paravirt enabled kernels as the
GSBASE was not updated yet. In hindsight a wrong change because it would
have been sufficient to ensure that the canary is the same in both per CPU
areas.

Commit d55535232c3d ("random: move rand_initialize() earlier") moved the
stack canary initialization to a later point in the init sequence. As a
consequence the per CPU stack canary is 0 when switching the per CPU areas,
so there is no requirement anymore to exclude this file.

Add a comment to load_percpu_segment().

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/kernel/cpu/Makefile |    3 ---
 arch/x86/kernel/cpu/common.c |    3 +++
 2 files changed, 3 insertions(+), 3 deletions(-)

--- a/arch/x86/kernel/cpu/Makefile
+++ b/arch/x86/kernel/cpu/Makefile
@@ -16,9 +16,6 @@ KCOV_INSTRUMENT_perf_event.o := n
 # As above, instrumenting secondary CPU boot code causes boot hangs.
 KCSAN_SANITIZE_common.o := n
 
-# Make sure load_percpu_segment has no stackprotector
-CFLAGS_common.o		:= -fno-stack-protector
-
 obj-y			:= cacheinfo.o scattered.o topology.o
 obj-y			+= common.o
 obj-y			+= rdrand.o
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -752,6 +752,9 @@ void __init switch_gdt_and_percpu_base(i
 	 * early mapping is still valid. That means the GSBASE update will
 	 * lose any prior per CPU data which was not copied over in
 	 * setup_per_cpu_areas().
+	 *
+	 * This works even with stackprotector enabled because the
+	 * per CPU stack canary is 0 in both per CPU areas.
 	 */
 	wrmsrl(MSR_GS_BASE, cpu_kernelmode_gs_base(cpu));
 #else



^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v2 05/59] x86/modules: Set VM_FLUSH_RESET_PERMS in module_alloc()
  2022-09-02 13:06 [PATCH v2 00/59] x86/retbleed: Call depth tracking mitigation Peter Zijlstra
                   ` (3 preceding siblings ...)
  2022-09-02 13:06 ` [PATCH v2 04/59] x86/cpu: Re-enable stackprotector Peter Zijlstra
@ 2022-09-02 13:06 ` Peter Zijlstra
  2022-09-02 13:06 ` [PATCH v2 06/59] x86/vdso: Ensure all kernel code is seen by objtool Peter Zijlstra
                   ` (54 subsequent siblings)
  59 siblings, 0 replies; 81+ messages in thread
From: Peter Zijlstra @ 2022-09-02 13:06 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, peterz, x86, Linus Torvalds, Tim Chen,
	Josh Poimboeuf, Andrew Cooper, Pawan Gupta, Johannes Wikner,
	Alyssa Milburn, Jann Horn, H.J. Lu, Joao Moreira, Joseph Nuzman,
	Steven Rostedt, Juergen Gross, Masami Hiramatsu,
	Alexei Starovoitov, Daniel Borkmann, K Prateek Nayak,
	Eric Dumazet

From: Thomas Gleixner <tglx@linutronix.de>

Instead of resetting permissions all over the place when freeing module
memory tell the vmalloc code to do so. Avoids the exercise for the next
upcoming user.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/kernel/ftrace.c       |    2 --
 arch/x86/kernel/kprobes/core.c |    1 -
 arch/x86/kernel/module.c       |    9 +++++----
 3 files changed, 5 insertions(+), 7 deletions(-)

--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -413,8 +413,6 @@ create_trampoline(struct ftrace_ops *ops
 	/* ALLOC_TRAMP flags lets us know we created it */
 	ops->flags |= FTRACE_OPS_FL_ALLOC_TRAMP;
 
-	set_vm_flush_reset_perms(trampoline);
-
 	if (likely(system_state != SYSTEM_BOOTING))
 		set_memory_ro((unsigned long)trampoline, npages);
 	set_memory_x((unsigned long)trampoline, npages);
--- a/arch/x86/kernel/kprobes/core.c
+++ b/arch/x86/kernel/kprobes/core.c
@@ -416,7 +416,6 @@ void *alloc_insn_page(void)
 	if (!page)
 		return NULL;
 
-	set_vm_flush_reset_perms(page);
 	/*
 	 * First make the page read-only, and only then make it executable to
 	 * prevent it from being W+X in between.
--- a/arch/x86/kernel/module.c
+++ b/arch/x86/kernel/module.c
@@ -74,10 +74,11 @@ void *module_alloc(unsigned long size)
 		return NULL;
 
 	p = __vmalloc_node_range(size, MODULE_ALIGN,
-				    MODULES_VADDR + get_module_load_offset(),
-				    MODULES_END, gfp_mask,
-				    PAGE_KERNEL, VM_DEFER_KMEMLEAK, NUMA_NO_NODE,
-				    __builtin_return_address(0));
+				 MODULES_VADDR + get_module_load_offset(),
+				 MODULES_END, gfp_mask, PAGE_KERNEL,
+				 VM_FLUSH_RESET_PERMS | VM_DEFER_KMEMLEAK,
+				 NUMA_NO_NODE, __builtin_return_address(0));
+
 	if (p && (kasan_alloc_module_shadow(p, size, gfp_mask) < 0)) {
 		vfree(p);
 		return NULL;



^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v2 06/59] x86/vdso: Ensure all kernel code is seen by objtool
  2022-09-02 13:06 [PATCH v2 00/59] x86/retbleed: Call depth tracking mitigation Peter Zijlstra
                   ` (4 preceding siblings ...)
  2022-09-02 13:06 ` [PATCH v2 05/59] x86/modules: Set VM_FLUSH_RESET_PERMS in module_alloc() Peter Zijlstra
@ 2022-09-02 13:06 ` Peter Zijlstra
  2022-09-02 13:06 ` [PATCH v2 07/59] x86: Sanitize linker script Peter Zijlstra
                   ` (53 subsequent siblings)
  59 siblings, 0 replies; 81+ messages in thread
From: Peter Zijlstra @ 2022-09-02 13:06 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, peterz, x86, Linus Torvalds, Tim Chen,
	Josh Poimboeuf, Andrew Cooper, Pawan Gupta, Johannes Wikner,
	Alyssa Milburn, Jann Horn, H.J. Lu, Joao Moreira, Joseph Nuzman,
	Steven Rostedt, Juergen Gross, Masami Hiramatsu,
	Alexei Starovoitov, Daniel Borkmann, K Prateek Nayak,
	Eric Dumazet

From: Thomas Gleixner <tglx@linutronix.de>

extable.c is kernel code and not part of the VDSO

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/entry/vdso/Makefile |   11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

--- a/arch/x86/entry/vdso/Makefile
+++ b/arch/x86/entry/vdso/Makefile
@@ -30,11 +30,12 @@ vobjs32-y += vdso32/vclock_gettime.o
 vobjs-$(CONFIG_X86_SGX)	+= vsgx.o
 
 # files to link into kernel
-obj-y				+= vma.o extable.o
-KASAN_SANITIZE_vma.o		:= y
-UBSAN_SANITIZE_vma.o		:= y
-KCSAN_SANITIZE_vma.o		:= y
-OBJECT_FILES_NON_STANDARD_vma.o	:= n
+obj-y					+= vma.o extable.o
+KASAN_SANITIZE_vma.o			:= y
+UBSAN_SANITIZE_vma.o			:= y
+KCSAN_SANITIZE_vma.o			:= y
+OBJECT_FILES_NON_STANDARD_vma.o		:= n
+OBJECT_FILES_NON_STANDARD_extable.o	:= n
 
 # vDSO images to build
 vdso_img-$(VDSO64-y)		+= 64



^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v2 07/59] x86: Sanitize linker script
  2022-09-02 13:06 [PATCH v2 00/59] x86/retbleed: Call depth tracking mitigation Peter Zijlstra
                   ` (5 preceding siblings ...)
  2022-09-02 13:06 ` [PATCH v2 06/59] x86/vdso: Ensure all kernel code is seen by objtool Peter Zijlstra
@ 2022-09-02 13:06 ` Peter Zijlstra
  2022-09-02 13:06 ` [PATCH v2 08/59] x86/build: Ensure proper function alignment Peter Zijlstra
                   ` (52 subsequent siblings)
  59 siblings, 0 replies; 81+ messages in thread
From: Peter Zijlstra @ 2022-09-02 13:06 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, peterz, x86, Linus Torvalds, Tim Chen,
	Josh Poimboeuf, Andrew Cooper, Pawan Gupta, Johannes Wikner,
	Alyssa Milburn, Jann Horn, H.J. Lu, Joao Moreira, Joseph Nuzman,
	Steven Rostedt, Juergen Gross, Masami Hiramatsu,
	Alexei Starovoitov, Daniel Borkmann, K Prateek Nayak,
	Eric Dumazet

From: Thomas Gleixner <tglx@linutronix.de>

The section ordering in the text section is more than suboptimal:

    ALIGN_ENTRY_TEXT_BEGIN
    ENTRY_TEXT
    ALIGN_ENTRY_TEXT_END
    SOFTIRQENTRY_TEXT
    STATIC_CALL_TEXT
    INDIRECT_THUNK_TEXT

ENTRY_TEXT is in a seperate PMD so it can be mapped into the cpu entry area
when KPTI is enabled. That means the sections after it are also in a
seperate PMD. That's wasteful especially as the indirect thunk text is a
hotpath on retpoline enabled systems and the static call text is fairly hot
on 32bit.

Move the entry text section last so that the other sections share a PMD
with the text before it. This is obviously just best effort and not
guaranteed when the previous text is just at a PMD boundary.

The text section placement needs an overhaul in general. There is e.g. no
point to have debugfs, sysfs, cpuhotplug and other rarely used functions
next to hot path text.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/kernel/vmlinux.lds.S |   13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -132,18 +132,19 @@ SECTIONS
 		CPUIDLE_TEXT
 		LOCK_TEXT
 		KPROBES_TEXT
-		ALIGN_ENTRY_TEXT_BEGIN
-		ENTRY_TEXT
-		ALIGN_ENTRY_TEXT_END
 		SOFTIRQENTRY_TEXT
-		STATIC_CALL_TEXT
-		*(.gnu.warning)
-
 #ifdef CONFIG_RETPOLINE
 		__indirect_thunk_start = .;
 		*(.text.__x86.*)
 		__indirect_thunk_end = .;
 #endif
+		STATIC_CALL_TEXT
+
+		ALIGN_ENTRY_TEXT_BEGIN
+		ENTRY_TEXT
+		ALIGN_ENTRY_TEXT_END
+		*(.gnu.warning)
+
 	} :text =0xcccc
 
 	/* End of text section, which should occupy whole number of pages */



^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v2 08/59] x86/build: Ensure proper function alignment
  2022-09-02 13:06 [PATCH v2 00/59] x86/retbleed: Call depth tracking mitigation Peter Zijlstra
                   ` (6 preceding siblings ...)
  2022-09-02 13:06 ` [PATCH v2 07/59] x86: Sanitize linker script Peter Zijlstra
@ 2022-09-02 13:06 ` Peter Zijlstra
  2022-09-02 16:51   ` Linus Torvalds
  2022-09-05  2:09   ` David Laight
  2022-09-02 13:06 ` [PATCH v2 09/59] x86/asm: " Peter Zijlstra
                   ` (51 subsequent siblings)
  59 siblings, 2 replies; 81+ messages in thread
From: Peter Zijlstra @ 2022-09-02 13:06 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, peterz, x86, Linus Torvalds, Tim Chen,
	Josh Poimboeuf, Andrew Cooper, Pawan Gupta, Johannes Wikner,
	Alyssa Milburn, Jann Horn, H.J. Lu, Joao Moreira, Joseph Nuzman,
	Steven Rostedt, Juergen Gross, Masami Hiramatsu,
	Alexei Starovoitov, Daniel Borkmann, K Prateek Nayak,
	Eric Dumazet

From: Thomas Gleixner <tglx@linutronix.de>

The Intel Architectures Optimization Reference Manual explains that
functions should be aligned at 16 bytes because for a lot of (Intel)
uarchs the I-fetch width is 16 bytes. The AMD Software Optimization
Guide (for recent chips) mentions a 32 byte I-fetch window but a 16
byte decode window.

Follow this advice and align functions to 16 bytes to optimize
instruction delivery to decode and reduce front-end bottlenecks.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/Kconfig.cpu              |    6 ++++++
 arch/x86/Makefile                 |    4 ++++
 arch/x86/include/asm/linkage.h    |    7 ++++---
 include/asm-generic/vmlinux.lds.h |    7 ++++++-
 4 files changed, 20 insertions(+), 4 deletions(-)

--- a/arch/x86/Kconfig.cpu
+++ b/arch/x86/Kconfig.cpu
@@ -517,3 +517,9 @@ config CPU_SUP_VORTEX_32
 	  makes the kernel a tiny bit smaller.
 
 	  If unsure, say N.
+
+# Defined here so it is defined for UM too
+config FUNCTION_ALIGNMENT
+	int
+	default 16 if X86_64 || X86_ALIGNMENT_16
+	default 8
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -84,6 +84,10 @@ else
 KBUILD_CFLAGS += $(call cc-option,-fcf-protection=none)
 endif
 
+ifneq ($(CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B),y)
+KBUILD_CFLAGS += -falign-functions=$(CONFIG_FUNCTION_ALIGNMENT)
+endif
+
 ifeq ($(CONFIG_X86_32),y)
         BITS := 32
         UTS_MACHINE := i386
--- a/arch/x86/include/asm/linkage.h
+++ b/arch/x86/include/asm/linkage.h
@@ -14,9 +14,10 @@
 
 #ifdef __ASSEMBLY__
 
-#if defined(CONFIG_X86_64) || defined(CONFIG_X86_ALIGNMENT_16)
-#define __ALIGN		.p2align 4, 0x90
-#define __ALIGN_STR	__stringify(__ALIGN)
+#if CONFIG_FUNCTION_ALIGNMENT == 16
+#define __ALIGN			.p2align 4, 0x90
+#define __ALIGN_STR		__stringify(__ALIGN)
+#define FUNCTION_ALIGNMENT	16
 #endif
 
 #if defined(CONFIG_RETHUNK) && !defined(__DISABLE_EXPORTS) && !defined(BUILD_VDSO)
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -82,7 +82,12 @@
 #endif
 
 /* Align . to a 8 byte boundary equals to maximum function alignment. */
-#define ALIGN_FUNCTION()  . = ALIGN(8)
+#ifndef CONFIG_FUNCTION_ALIGNMENT
+#define __FUNCTION_ALIGNMENT	8
+#else
+#define __FUNCTION_ALIGNMENT	CONFIG_FUNCTION_ALIGNMENT
+#endif
+#define ALIGN_FUNCTION()  . = ALIGN(__FUNCTION_ALIGNMENT)
 
 /*
  * LD_DEAD_CODE_DATA_ELIMINATION option enables -fdata-sections, which



^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v2 09/59] x86/asm: Ensure proper function alignment
  2022-09-02 13:06 [PATCH v2 00/59] x86/retbleed: Call depth tracking mitigation Peter Zijlstra
                   ` (7 preceding siblings ...)
  2022-09-02 13:06 ` [PATCH v2 08/59] x86/build: Ensure proper function alignment Peter Zijlstra
@ 2022-09-02 13:06 ` Peter Zijlstra
  2022-09-02 13:06 ` [PATCH v2 10/59] x86/error_inject: Align function properly Peter Zijlstra
                   ` (50 subsequent siblings)
  59 siblings, 0 replies; 81+ messages in thread
From: Peter Zijlstra @ 2022-09-02 13:06 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, peterz, x86, Linus Torvalds, Tim Chen,
	Josh Poimboeuf, Andrew Cooper, Pawan Gupta, Johannes Wikner,
	Alyssa Milburn, Jann Horn, H.J. Lu, Joao Moreira, Joseph Nuzman,
	Steven Rostedt, Juergen Gross, Masami Hiramatsu,
	Alexei Starovoitov, Daniel Borkmann, K Prateek Nayak,
	Eric Dumazet

From: Thomas Gleixner <tglx@linutronix.de>

With the compiler now aligning functions to 16 bytes, make sure the
assmbler does the same. Change the SYM_FUNC_START*() variants to have
matching alignment.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/include/asm/linkage.h |   23 +++++++++++++++--------
 1 file changed, 15 insertions(+), 8 deletions(-)

--- a/arch/x86/include/asm/linkage.h
+++ b/arch/x86/include/asm/linkage.h
@@ -12,13 +12,20 @@
 #define asmlinkage CPP_ASMLINKAGE __attribute__((regparm(0)))
 #endif /* CONFIG_X86_32 */
 
-#ifdef __ASSEMBLY__
+#if CONFIG_FUNCTION_ALIGNMENT == 8
+#define __ALIGN			.p2align 3, 0x90;
+#elif CONFIG_FUNCTION_ALIGNMENT == 16
+#define __ALIGN			.p2align 4, 0x90;
+#else
+# error Unsupported function alignment
+#endif
 
-#if CONFIG_FUNCTION_ALIGNMENT == 16
-#define __ALIGN			.p2align 4, 0x90
 #define __ALIGN_STR		__stringify(__ALIGN)
-#define FUNCTION_ALIGNMENT	16
-#endif
+#define ASM_FUNC_ALIGN		__ALIGN_STR
+#define __FUNC_ALIGN		__ALIGN
+#define SYM_F_ALIGN		__FUNC_ALIGN
+
+#ifdef __ASSEMBLY__
 
 #if defined(CONFIG_RETHUNK) && !defined(__DISABLE_EXPORTS) && !defined(BUILD_VDSO)
 #define RET	jmp __x86_return_thunk
@@ -46,7 +53,7 @@
 
 /* SYM_FUNC_START -- use for global functions */
 #define SYM_FUNC_START(name)				\
-	SYM_START(name, SYM_L_GLOBAL, SYM_A_ALIGN)	\
+	SYM_START(name, SYM_L_GLOBAL, SYM_F_ALIGN)	\
 	ENDBR
 
 /* SYM_FUNC_START_NOALIGN -- use for global functions, w/o alignment */
@@ -56,7 +63,7 @@
 
 /* SYM_FUNC_START_LOCAL -- use for local functions */
 #define SYM_FUNC_START_LOCAL(name)			\
-	SYM_START(name, SYM_L_LOCAL, SYM_A_ALIGN)	\
+	SYM_START(name, SYM_L_LOCAL, SYM_F_ALIGN)	\
 	ENDBR
 
 /* SYM_FUNC_START_LOCAL_NOALIGN -- use for local functions, w/o alignment */
@@ -66,7 +73,7 @@
 
 /* SYM_FUNC_START_WEAK -- use for weak functions */
 #define SYM_FUNC_START_WEAK(name)			\
-	SYM_START(name, SYM_L_WEAK, SYM_A_ALIGN)	\
+	SYM_START(name, SYM_L_WEAK, SYM_F_ALIGN)	\
 	ENDBR
 
 /* SYM_FUNC_START_WEAK_NOALIGN -- use for weak functions, w/o alignment */



^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v2 10/59] x86/error_inject: Align function properly
  2022-09-02 13:06 [PATCH v2 00/59] x86/retbleed: Call depth tracking mitigation Peter Zijlstra
                   ` (8 preceding siblings ...)
  2022-09-02 13:06 ` [PATCH v2 09/59] x86/asm: " Peter Zijlstra
@ 2022-09-02 13:06 ` Peter Zijlstra
  2022-09-02 13:06 ` [PATCH v2 11/59] x86/paravirt: Properly align PV functions Peter Zijlstra
                   ` (49 subsequent siblings)
  59 siblings, 0 replies; 81+ messages in thread
From: Peter Zijlstra @ 2022-09-02 13:06 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, peterz, x86, Linus Torvalds, Tim Chen,
	Josh Poimboeuf, Andrew Cooper, Pawan Gupta, Johannes Wikner,
	Alyssa Milburn, Jann Horn, H.J. Lu, Joao Moreira, Joseph Nuzman,
	Steven Rostedt, Juergen Gross, Masami Hiramatsu,
	Alexei Starovoitov, Daniel Borkmann, K Prateek Nayak,
	Eric Dumazet

From: Peter Zijlstra <peterz@infradead.org>

Ensure inline asm functions are consistently aligned with compiler
generated and SYM_FUNC_START*() functions.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/lib/error-inject.c |    1 +
 1 file changed, 1 insertion(+)

--- a/arch/x86/lib/error-inject.c
+++ b/arch/x86/lib/error-inject.c
@@ -11,6 +11,7 @@ asm(
 	".text\n"
 	".type just_return_func, @function\n"
 	".globl just_return_func\n"
+	ASM_FUNC_ALIGN
 	"just_return_func:\n"
 		ANNOTATE_NOENDBR
 		ASM_RET



^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v2 11/59] x86/paravirt: Properly align PV functions
  2022-09-02 13:06 [PATCH v2 00/59] x86/retbleed: Call depth tracking mitigation Peter Zijlstra
                   ` (9 preceding siblings ...)
  2022-09-02 13:06 ` [PATCH v2 10/59] x86/error_inject: Align function properly Peter Zijlstra
@ 2022-09-02 13:06 ` Peter Zijlstra
  2022-09-02 13:06 ` [PATCH v2 12/59] x86/entry: Align SYM_CODE_START() variants Peter Zijlstra
                   ` (48 subsequent siblings)
  59 siblings, 0 replies; 81+ messages in thread
From: Peter Zijlstra @ 2022-09-02 13:06 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, peterz, x86, Linus Torvalds, Tim Chen,
	Josh Poimboeuf, Andrew Cooper, Pawan Gupta, Johannes Wikner,
	Alyssa Milburn, Jann Horn, H.J. Lu, Joao Moreira, Joseph Nuzman,
	Steven Rostedt, Juergen Gross, Masami Hiramatsu,
	Alexei Starovoitov, Daniel Borkmann, K Prateek Nayak,
	Eric Dumazet

From: Thomas Gleixner <tglx@linutronix.de>

Ensure inline asm functions are consistently aligned with compiler
generated and SYM_FUNC_START*() functions.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/include/asm/paravirt.h           |    1 +
 arch/x86/include/asm/qspinlock_paravirt.h |    2 +-
 arch/x86/kernel/kvm.c                     |    1 +
 arch/x86/kernel/paravirt.c                |    2 ++
 4 files changed, 5 insertions(+), 1 deletion(-)

--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -665,6 +665,7 @@ bool __raw_callee_save___native_vcpu_is_
 	asm(".pushsection " section ", \"ax\";"				\
 	    ".globl " PV_THUNK_NAME(func) ";"				\
 	    ".type " PV_THUNK_NAME(func) ", @function;"			\
+	    ASM_FUNC_ALIGN						\
 	    PV_THUNK_NAME(func) ":"					\
 	    ASM_ENDBR							\
 	    FRAME_BEGIN							\
--- a/arch/x86/include/asm/qspinlock_paravirt.h
+++ b/arch/x86/include/asm/qspinlock_paravirt.h
@@ -39,7 +39,7 @@ PV_CALLEE_SAVE_REGS_THUNK(__pv_queued_sp
 asm    (".pushsection .text;"
 	".globl " PV_UNLOCK ";"
 	".type " PV_UNLOCK ", @function;"
-	".align 4,0x90;"
+	ASM_FUNC_ALIGN
 	PV_UNLOCK ": "
 	ASM_ENDBR
 	FRAME_BEGIN
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -802,6 +802,7 @@ asm(
 ".pushsection .text;"
 ".global __raw_callee_save___kvm_vcpu_is_preempted;"
 ".type __raw_callee_save___kvm_vcpu_is_preempted, @function;"
+ASM_FUNC_ALIGN
 "__raw_callee_save___kvm_vcpu_is_preempted:"
 ASM_ENDBR
 "movq	__per_cpu_offset(,%rdi,8), %rax;"
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -40,6 +40,7 @@
 extern void _paravirt_nop(void);
 asm (".pushsection .entry.text, \"ax\"\n"
      ".global _paravirt_nop\n"
+     ASM_FUNC_ALIGN
      "_paravirt_nop:\n\t"
      ASM_ENDBR
      ASM_RET
@@ -50,6 +51,7 @@ asm (".pushsection .entry.text, \"ax\"\n
 /* stub always returning 0. */
 asm (".pushsection .entry.text, \"ax\"\n"
      ".global paravirt_ret0\n"
+     ASM_FUNC_ALIGN
      "paravirt_ret0:\n\t"
      ASM_ENDBR
      "xor %" _ASM_AX ", %" _ASM_AX ";\n\t"



^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v2 12/59] x86/entry: Align SYM_CODE_START() variants
  2022-09-02 13:06 [PATCH v2 00/59] x86/retbleed: Call depth tracking mitigation Peter Zijlstra
                   ` (10 preceding siblings ...)
  2022-09-02 13:06 ` [PATCH v2 11/59] x86/paravirt: Properly align PV functions Peter Zijlstra
@ 2022-09-02 13:06 ` Peter Zijlstra
  2022-09-02 13:06 ` [PATCH v2 13/59] crypto: x86/camellia: Remove redundant alignments Peter Zijlstra
                   ` (47 subsequent siblings)
  59 siblings, 0 replies; 81+ messages in thread
From: Peter Zijlstra @ 2022-09-02 13:06 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, peterz, x86, Linus Torvalds, Tim Chen,
	Josh Poimboeuf, Andrew Cooper, Pawan Gupta, Johannes Wikner,
	Alyssa Milburn, Jann Horn, H.J. Lu, Joao Moreira, Joseph Nuzman,
	Steven Rostedt, Juergen Gross, Masami Hiramatsu,
	Alexei Starovoitov, Daniel Borkmann, K Prateek Nayak,
	Eric Dumazet

From: Thomas Gleixner <tglx@linutronix.de>

Explicitly align a bunch of commonly called SYM_CODE_START() symbols.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/entry/entry_64.S |   12 ++++++++----
 arch/x86/entry/thunk_64.S |    4 ++--
 2 files changed, 10 insertions(+), 6 deletions(-)

--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -284,7 +284,8 @@ SYM_FUNC_END(__switch_to_asm)
  * r12: kernel thread arg
  */
 .pushsection .text, "ax"
-SYM_CODE_START(ret_from_fork)
+	__FUNC_ALIGN
+SYM_CODE_START_NOALIGN(ret_from_fork)
 	UNWIND_HINT_EMPTY
 	ANNOTATE_NOENDBR // copy_thread
 	movq	%rax, %rdi
@@ -828,7 +829,8 @@ EXPORT_SYMBOL(asm_load_gs_index)
  *
  * C calling convention: exc_xen_hypervisor_callback(struct *pt_regs)
  */
-SYM_CODE_START_LOCAL(exc_xen_hypervisor_callback)
+	__FUNC_ALIGN
+SYM_CODE_START_LOCAL_NOALIGN(exc_xen_hypervisor_callback)
 
 /*
  * Since we don't modify %rdi, evtchn_do_upall(struct *pt_regs) will
@@ -856,7 +858,8 @@ SYM_CODE_END(exc_xen_hypervisor_callback
  * We distinguish between categories by comparing each saved segment register
  * with its current contents: any discrepancy means we in category 1.
  */
-SYM_CODE_START(xen_failsafe_callback)
+	__FUNC_ALIGN
+SYM_CODE_START_NOALIGN(xen_failsafe_callback)
 	UNWIND_HINT_EMPTY
 	ENDBR
 	movl	%ds, %ecx
@@ -1516,7 +1519,8 @@ SYM_CODE_END(ignore_sysret)
 #endif
 
 .pushsection .text, "ax"
-SYM_CODE_START(rewind_stack_and_make_dead)
+	__FUNC_ALIGN
+SYM_CODE_START_NOALIGN(rewind_stack_and_make_dead)
 	UNWIND_HINT_FUNC
 	/* Prevent any naive code from trying to unwind to our caller. */
 	xorl	%ebp, %ebp
--- a/arch/x86/entry/thunk_64.S
+++ b/arch/x86/entry/thunk_64.S
@@ -11,7 +11,7 @@
 
 	/* rdi:	arg1 ... normal C conventions. rax is saved/restored. */
 	.macro THUNK name, func
-SYM_FUNC_START_NOALIGN(\name)
+SYM_FUNC_START(\name)
 	pushq %rbp
 	movq %rsp, %rbp
 
@@ -36,7 +36,7 @@ SYM_FUNC_END(\name)
 	EXPORT_SYMBOL(preempt_schedule_thunk)
 	EXPORT_SYMBOL(preempt_schedule_notrace_thunk)
 
-SYM_CODE_START_LOCAL_NOALIGN(__thunk_restore)
+SYM_CODE_START_LOCAL(__thunk_restore)
 	popq %r11
 	popq %r10
 	popq %r9



^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v2 13/59] crypto: x86/camellia: Remove redundant alignments
  2022-09-02 13:06 [PATCH v2 00/59] x86/retbleed: Call depth tracking mitigation Peter Zijlstra
                   ` (11 preceding siblings ...)
  2022-09-02 13:06 ` [PATCH v2 12/59] x86/entry: Align SYM_CODE_START() variants Peter Zijlstra
@ 2022-09-02 13:06 ` Peter Zijlstra
  2022-09-02 13:06 ` [PATCH v2 14/59] crypto: x86/cast5: " Peter Zijlstra
                   ` (46 subsequent siblings)
  59 siblings, 0 replies; 81+ messages in thread
From: Peter Zijlstra @ 2022-09-02 13:06 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, peterz, x86, Linus Torvalds, Tim Chen,
	Josh Poimboeuf, Andrew Cooper, Pawan Gupta, Johannes Wikner,
	Alyssa Milburn, Jann Horn, H.J. Lu, Joao Moreira, Joseph Nuzman,
	Steven Rostedt, Juergen Gross, Masami Hiramatsu,
	Alexei Starovoitov, Daniel Borkmann, K Prateek Nayak,
	Eric Dumazet

From: Thomas Gleixner <tglx@linutronix.de>

SYM_FUNC_START*() and friends already imply alignment, remove custom
alignment hacks to make code consistent. This prepares for future
function call ABI changes.

Also, with having pushed the function alignment to 16 bytes, this
custom alignment is completely superfluous.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/crypto/camellia-aesni-avx-asm_64.S  |    2 --
 arch/x86/crypto/camellia-aesni-avx2-asm_64.S |    4 ----
 2 files changed, 6 deletions(-)

--- a/arch/x86/crypto/camellia-aesni-avx-asm_64.S
+++ b/arch/x86/crypto/camellia-aesni-avx-asm_64.S
@@ -712,7 +712,6 @@ SYM_FUNC_END(roundsm16_x4_x5_x6_x7_x0_x1
 
 .text
 
-.align 8
 SYM_FUNC_START_LOCAL(__camellia_enc_blk16)
 	/* input:
 	 *	%rdi: ctx, CTX
@@ -799,7 +798,6 @@ SYM_FUNC_START_LOCAL(__camellia_enc_blk1
 	jmp .Lenc_done;
 SYM_FUNC_END(__camellia_enc_blk16)
 
-.align 8
 SYM_FUNC_START_LOCAL(__camellia_dec_blk16)
 	/* input:
 	 *	%rdi: ctx, CTX
--- a/arch/x86/crypto/camellia-aesni-avx2-asm_64.S
+++ b/arch/x86/crypto/camellia-aesni-avx2-asm_64.S
@@ -221,7 +221,6 @@
  * Size optimization... with inlined roundsm32 binary would be over 5 times
  * larger and would only marginally faster.
  */
-.align 8
 SYM_FUNC_START_LOCAL(roundsm32_x0_x1_x2_x3_x4_x5_x6_x7_y0_y1_y2_y3_y4_y5_y6_y7_cd)
 	roundsm32(%ymm0, %ymm1, %ymm2, %ymm3, %ymm4, %ymm5, %ymm6, %ymm7,
 		  %ymm8, %ymm9, %ymm10, %ymm11, %ymm12, %ymm13, %ymm14, %ymm15,
@@ -229,7 +228,6 @@ SYM_FUNC_START_LOCAL(roundsm32_x0_x1_x2_
 	RET;
 SYM_FUNC_END(roundsm32_x0_x1_x2_x3_x4_x5_x6_x7_y0_y1_y2_y3_y4_y5_y6_y7_cd)
 
-.align 8
 SYM_FUNC_START_LOCAL(roundsm32_x4_x5_x6_x7_x0_x1_x2_x3_y4_y5_y6_y7_y0_y1_y2_y3_ab)
 	roundsm32(%ymm4, %ymm5, %ymm6, %ymm7, %ymm0, %ymm1, %ymm2, %ymm3,
 		  %ymm12, %ymm13, %ymm14, %ymm15, %ymm8, %ymm9, %ymm10, %ymm11,
@@ -748,7 +746,6 @@ SYM_FUNC_END(roundsm32_x4_x5_x6_x7_x0_x1
 
 .text
 
-.align 8
 SYM_FUNC_START_LOCAL(__camellia_enc_blk32)
 	/* input:
 	 *	%rdi: ctx, CTX
@@ -835,7 +832,6 @@ SYM_FUNC_START_LOCAL(__camellia_enc_blk3
 	jmp .Lenc_done;
 SYM_FUNC_END(__camellia_enc_blk32)
 
-.align 8
 SYM_FUNC_START_LOCAL(__camellia_dec_blk32)
 	/* input:
 	 *	%rdi: ctx, CTX



^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v2 14/59] crypto: x86/cast5: Remove redundant alignments
  2022-09-02 13:06 [PATCH v2 00/59] x86/retbleed: Call depth tracking mitigation Peter Zijlstra
                   ` (12 preceding siblings ...)
  2022-09-02 13:06 ` [PATCH v2 13/59] crypto: x86/camellia: Remove redundant alignments Peter Zijlstra
@ 2022-09-02 13:06 ` Peter Zijlstra
  2022-09-02 13:06 ` [PATCH v2 15/59] crypto: x86/crct10dif-pcl: " Peter Zijlstra
                   ` (45 subsequent siblings)
  59 siblings, 0 replies; 81+ messages in thread
From: Peter Zijlstra @ 2022-09-02 13:06 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, peterz, x86, Linus Torvalds, Tim Chen,
	Josh Poimboeuf, Andrew Cooper, Pawan Gupta, Johannes Wikner,
	Alyssa Milburn, Jann Horn, H.J. Lu, Joao Moreira, Joseph Nuzman,
	Steven Rostedt, Juergen Gross, Masami Hiramatsu,
	Alexei Starovoitov, Daniel Borkmann, K Prateek Nayak,
	Eric Dumazet

From: Thomas Gleixner <tglx@linutronix.de>

SYM_FUNC_START*() and friends already imply alignment, remove custom
alignment hacks to make code consistent. This prepares for future
function call ABI changes.

Also, with having pushed the function alignment to 16 bytes, this
custom alignment is completely superfluous.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/crypto/cast5-avx-x86_64-asm_64.S |    2 --
 1 file changed, 2 deletions(-)

--- a/arch/x86/crypto/cast5-avx-x86_64-asm_64.S
+++ b/arch/x86/crypto/cast5-avx-x86_64-asm_64.S
@@ -208,7 +208,6 @@
 
 .text
 
-.align 16
 SYM_FUNC_START_LOCAL(__cast5_enc_blk16)
 	/* input:
 	 *	%rdi: ctx
@@ -282,7 +281,6 @@ SYM_FUNC_START_LOCAL(__cast5_enc_blk16)
 	RET;
 SYM_FUNC_END(__cast5_enc_blk16)
 
-.align 16
 SYM_FUNC_START_LOCAL(__cast5_dec_blk16)
 	/* input:
 	 *	%rdi: ctx



^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v2 15/59] crypto: x86/crct10dif-pcl: Remove redundant alignments
  2022-09-02 13:06 [PATCH v2 00/59] x86/retbleed: Call depth tracking mitigation Peter Zijlstra
                   ` (13 preceding siblings ...)
  2022-09-02 13:06 ` [PATCH v2 14/59] crypto: x86/cast5: " Peter Zijlstra
@ 2022-09-02 13:06 ` Peter Zijlstra
  2022-09-02 13:06 ` [PATCH v2 16/59] crypto: x86/serpent: " Peter Zijlstra
                   ` (44 subsequent siblings)
  59 siblings, 0 replies; 81+ messages in thread
From: Peter Zijlstra @ 2022-09-02 13:06 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, peterz, x86, Linus Torvalds, Tim Chen,
	Josh Poimboeuf, Andrew Cooper, Pawan Gupta, Johannes Wikner,
	Alyssa Milburn, Jann Horn, H.J. Lu, Joao Moreira, Joseph Nuzman,
	Steven Rostedt, Juergen Gross, Masami Hiramatsu,
	Alexei Starovoitov, Daniel Borkmann, K Prateek Nayak,
	Eric Dumazet

From: Thomas Gleixner <tglx@linutronix.de>

SYM_FUNC_START*() and friends already imply alignment, remove custom
alignment hacks to make code consistent. This prepares for future
function call ABI changes.

Also, with having pushed the function alignment to 16 bytes, this
custom alignment is completely superfluous.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/crypto/crct10dif-pcl-asm_64.S |    1 -
 1 file changed, 1 deletion(-)

--- a/arch/x86/crypto/crct10dif-pcl-asm_64.S
+++ b/arch/x86/crypto/crct10dif-pcl-asm_64.S
@@ -94,7 +94,6 @@
 #
 # Assumes len >= 16.
 #
-.align 16
 SYM_FUNC_START(crc_t10dif_pcl)
 
 	movdqa	.Lbswap_mask(%rip), BSWAP_MASK



^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v2 16/59] crypto: x86/serpent: Remove redundant alignments
  2022-09-02 13:06 [PATCH v2 00/59] x86/retbleed: Call depth tracking mitigation Peter Zijlstra
                   ` (14 preceding siblings ...)
  2022-09-02 13:06 ` [PATCH v2 15/59] crypto: x86/crct10dif-pcl: " Peter Zijlstra
@ 2022-09-02 13:06 ` Peter Zijlstra
  2022-09-02 13:06 ` [PATCH v2 17/59] crypto: x86/sha1: Remove custom alignments Peter Zijlstra
                   ` (43 subsequent siblings)
  59 siblings, 0 replies; 81+ messages in thread
From: Peter Zijlstra @ 2022-09-02 13:06 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, peterz, x86, Linus Torvalds, Tim Chen,
	Josh Poimboeuf, Andrew Cooper, Pawan Gupta, Johannes Wikner,
	Alyssa Milburn, Jann Horn, H.J. Lu, Joao Moreira, Joseph Nuzman,
	Steven Rostedt, Juergen Gross, Masami Hiramatsu,
	Alexei Starovoitov, Daniel Borkmann, K Prateek Nayak,
	Eric Dumazet

From: Thomas Gleixner <tglx@linutronix.de>

SYM_FUNC_START*() and friends already imply alignment, remove custom
alignment hacks to make code consistent. This prepares for future
function call ABI changes.

Also, with having pushed the function alignment to 16 bytes, this
custom alignment is completely superfluous.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/crypto/serpent-avx-x86_64-asm_64.S |    2 --
 arch/x86/crypto/serpent-avx2-asm_64.S       |    2 --
 2 files changed, 4 deletions(-)

--- a/arch/x86/crypto/serpent-avx-x86_64-asm_64.S
+++ b/arch/x86/crypto/serpent-avx-x86_64-asm_64.S
@@ -550,7 +550,6 @@
 #define write_blocks(x0, x1, x2, x3, t0, t1, t2) \
 	transpose_4x4(x0, x1, x2, x3, t0, t1, t2)
 
-.align 8
 SYM_FUNC_START_LOCAL(__serpent_enc_blk8_avx)
 	/* input:
 	 *	%rdi: ctx, CTX
@@ -604,7 +603,6 @@ SYM_FUNC_START_LOCAL(__serpent_enc_blk8_
 	RET;
 SYM_FUNC_END(__serpent_enc_blk8_avx)
 
-.align 8
 SYM_FUNC_START_LOCAL(__serpent_dec_blk8_avx)
 	/* input:
 	 *	%rdi: ctx, CTX
--- a/arch/x86/crypto/serpent-avx2-asm_64.S
+++ b/arch/x86/crypto/serpent-avx2-asm_64.S
@@ -550,7 +550,6 @@
 #define write_blocks(x0, x1, x2, x3, t0, t1, t2) \
 	transpose_4x4(x0, x1, x2, x3, t0, t1, t2)
 
-.align 8
 SYM_FUNC_START_LOCAL(__serpent_enc_blk16)
 	/* input:
 	 *	%rdi: ctx, CTX
@@ -604,7 +603,6 @@ SYM_FUNC_START_LOCAL(__serpent_enc_blk16
 	RET;
 SYM_FUNC_END(__serpent_enc_blk16)
 
-.align 8
 SYM_FUNC_START_LOCAL(__serpent_dec_blk16)
 	/* input:
 	 *	%rdi: ctx, CTX



^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v2 17/59] crypto: x86/sha1: Remove custom alignments
  2022-09-02 13:06 [PATCH v2 00/59] x86/retbleed: Call depth tracking mitigation Peter Zijlstra
                   ` (15 preceding siblings ...)
  2022-09-02 13:06 ` [PATCH v2 16/59] crypto: x86/serpent: " Peter Zijlstra
@ 2022-09-02 13:06 ` Peter Zijlstra
  2022-09-02 13:06 ` [PATCH v2 18/59] crypto: x86/sha256: " Peter Zijlstra
                   ` (42 subsequent siblings)
  59 siblings, 0 replies; 81+ messages in thread
From: Peter Zijlstra @ 2022-09-02 13:06 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, peterz, x86, Linus Torvalds, Tim Chen,
	Josh Poimboeuf, Andrew Cooper, Pawan Gupta, Johannes Wikner,
	Alyssa Milburn, Jann Horn, H.J. Lu, Joao Moreira, Joseph Nuzman,
	Steven Rostedt, Juergen Gross, Masami Hiramatsu,
	Alexei Starovoitov, Daniel Borkmann, K Prateek Nayak,
	Eric Dumazet

From: Thomas Gleixner <tglx@linutronix.de>

SYM_FUNC_START*() and friends already imply alignment, remove custom
alignment hacks to make code consistent. This prepares for future
function call ABI changes.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/crypto/sha1_ni_asm.S |    1 -
 1 file changed, 1 deletion(-)

--- a/arch/x86/crypto/sha1_ni_asm.S
+++ b/arch/x86/crypto/sha1_ni_asm.S
@@ -92,7 +92,6 @@
  * numBlocks: Number of blocks to process
  */
 .text
-.align 32
 SYM_FUNC_START(sha1_ni_transform)
 	push		%rbp
 	mov		%rsp, %rbp



^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v2 18/59] crypto: x86/sha256: Remove custom alignments
  2022-09-02 13:06 [PATCH v2 00/59] x86/retbleed: Call depth tracking mitigation Peter Zijlstra
                   ` (16 preceding siblings ...)
  2022-09-02 13:06 ` [PATCH v2 17/59] crypto: x86/sha1: Remove custom alignments Peter Zijlstra
@ 2022-09-02 13:06 ` Peter Zijlstra
  2022-09-02 13:06 ` [PATCH v2 19/59] crypto: x86/sm[34]: Remove redundant alignments Peter Zijlstra
                   ` (41 subsequent siblings)
  59 siblings, 0 replies; 81+ messages in thread
From: Peter Zijlstra @ 2022-09-02 13:06 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, peterz, x86, Linus Torvalds, Tim Chen,
	Josh Poimboeuf, Andrew Cooper, Pawan Gupta, Johannes Wikner,
	Alyssa Milburn, Jann Horn, H.J. Lu, Joao Moreira, Joseph Nuzman,
	Steven Rostedt, Juergen Gross, Masami Hiramatsu,
	Alexei Starovoitov, Daniel Borkmann, K Prateek Nayak,
	Eric Dumazet

From: Thomas Gleixner <tglx@linutronix.de>

SYM_FUNC_START*() and friends already imply alignment, remove custom
alignment hacks to make code consistent. This prepares for future
function call ABI changes.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/crypto/sha256-avx-asm.S   |    1 -
 arch/x86/crypto/sha256-avx2-asm.S  |    1 -
 arch/x86/crypto/sha256-ssse3-asm.S |    1 -
 arch/x86/crypto/sha256_ni_asm.S    |    1 -
 4 files changed, 4 deletions(-)

--- a/arch/x86/crypto/sha256-avx-asm.S
+++ b/arch/x86/crypto/sha256-avx-asm.S
@@ -347,7 +347,6 @@ a = TMP_
 ########################################################################
 .text
 SYM_FUNC_START(sha256_transform_avx)
-.align 32
 	pushq   %rbx
 	pushq   %r12
 	pushq   %r13
--- a/arch/x86/crypto/sha256-avx2-asm.S
+++ b/arch/x86/crypto/sha256-avx2-asm.S
@@ -524,7 +524,6 @@ STACK_SIZE	= _CTX      + _CTX_SIZE
 ########################################################################
 .text
 SYM_FUNC_START(sha256_transform_rorx)
-.align 32
 	pushq	%rbx
 	pushq	%r12
 	pushq	%r13
--- a/arch/x86/crypto/sha256-ssse3-asm.S
+++ b/arch/x86/crypto/sha256-ssse3-asm.S
@@ -356,7 +356,6 @@ a = TMP_
 ########################################################################
 .text
 SYM_FUNC_START(sha256_transform_ssse3)
-.align 32
 	pushq   %rbx
 	pushq   %r12
 	pushq   %r13
--- a/arch/x86/crypto/sha256_ni_asm.S
+++ b/arch/x86/crypto/sha256_ni_asm.S
@@ -96,7 +96,6 @@
  */
 
 .text
-.align 32
 SYM_FUNC_START(sha256_ni_transform)
 
 	shl		$6, NUM_BLKS		/*  convert to bytes */



^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v2 19/59] crypto: x86/sm[34]: Remove redundant alignments
  2022-09-02 13:06 [PATCH v2 00/59] x86/retbleed: Call depth tracking mitigation Peter Zijlstra
                   ` (17 preceding siblings ...)
  2022-09-02 13:06 ` [PATCH v2 18/59] crypto: x86/sha256: " Peter Zijlstra
@ 2022-09-02 13:06 ` Peter Zijlstra
  2022-09-02 13:06 ` [PATCH v2 20/59] crypto: twofish: " Peter Zijlstra
                   ` (40 subsequent siblings)
  59 siblings, 0 replies; 81+ messages in thread
From: Peter Zijlstra @ 2022-09-02 13:06 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, peterz, x86, Linus Torvalds, Tim Chen,
	Josh Poimboeuf, Andrew Cooper, Pawan Gupta, Johannes Wikner,
	Alyssa Milburn, Jann Horn, H.J. Lu, Joao Moreira, Joseph Nuzman,
	Steven Rostedt, Juergen Gross, Masami Hiramatsu,
	Alexei Starovoitov, Daniel Borkmann, K Prateek Nayak,
	Eric Dumazet

From: Thomas Gleixner <tglx@linutronix.de>

SYM_FUNC_START*() and friends already imply alignment, remove custom
alignment hacks to make code consistent. This prepares for future
function call ABI changes.

Also, with having pushed the function alignment to 16 bytes, this
custom alignment is completely superfluous.

( this code couldn't seem to make up it's mind about what alignment it
  actually wanted, randomly mixing 8 and 16 bytes )

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/crypto/sm3-avx-asm_64.S        |    1 -
 arch/x86/crypto/sm4-aesni-avx-asm_64.S  |    7 -------
 arch/x86/crypto/sm4-aesni-avx2-asm_64.S |    6 ------
 3 files changed, 14 deletions(-)

--- a/arch/x86/crypto/sm3-avx-asm_64.S
+++ b/arch/x86/crypto/sm3-avx-asm_64.S
@@ -327,7 +327,6 @@
  * void sm3_transform_avx(struct sm3_state *state,
  *                        const u8 *data, int nblocks);
  */
-.align 16
 SYM_FUNC_START(sm3_transform_avx)
 	/* input:
 	 *	%rdi: ctx, CTX
--- a/arch/x86/crypto/sm4-aesni-avx-asm_64.S
+++ b/arch/x86/crypto/sm4-aesni-avx-asm_64.S
@@ -139,13 +139,11 @@
 
 
 .text
-.align 16
 
 /*
  * void sm4_aesni_avx_crypt4(const u32 *rk, u8 *dst,
  *                           const u8 *src, int nblocks)
  */
-.align 8
 SYM_FUNC_START(sm4_aesni_avx_crypt4)
 	/* input:
 	 *	%rdi: round key array, CTX
@@ -249,7 +247,6 @@ SYM_FUNC_START(sm4_aesni_avx_crypt4)
 	RET;
 SYM_FUNC_END(sm4_aesni_avx_crypt4)
 
-.align 8
 SYM_FUNC_START_LOCAL(__sm4_crypt_blk8)
 	/* input:
 	 *	%rdi: round key array, CTX
@@ -363,7 +360,6 @@ SYM_FUNC_END(__sm4_crypt_blk8)
  * void sm4_aesni_avx_crypt8(const u32 *rk, u8 *dst,
  *                           const u8 *src, int nblocks)
  */
-.align 8
 SYM_FUNC_START(sm4_aesni_avx_crypt8)
 	/* input:
 	 *	%rdi: round key array, CTX
@@ -419,7 +415,6 @@ SYM_FUNC_END(sm4_aesni_avx_crypt8)
  * void sm4_aesni_avx_ctr_enc_blk8(const u32 *rk, u8 *dst,
  *                                 const u8 *src, u8 *iv)
  */
-.align 8
 SYM_FUNC_START(sm4_aesni_avx_ctr_enc_blk8)
 	/* input:
 	 *	%rdi: round key array, CTX
@@ -494,7 +489,6 @@ SYM_FUNC_END(sm4_aesni_avx_ctr_enc_blk8)
  * void sm4_aesni_avx_cbc_dec_blk8(const u32 *rk, u8 *dst,
  *                                 const u8 *src, u8 *iv)
  */
-.align 8
 SYM_FUNC_START(sm4_aesni_avx_cbc_dec_blk8)
 	/* input:
 	 *	%rdi: round key array, CTX
@@ -544,7 +538,6 @@ SYM_FUNC_END(sm4_aesni_avx_cbc_dec_blk8)
  * void sm4_aesni_avx_cfb_dec_blk8(const u32 *rk, u8 *dst,
  *                                 const u8 *src, u8 *iv)
  */
-.align 8
 SYM_FUNC_START(sm4_aesni_avx_cfb_dec_blk8)
 	/* input:
 	 *	%rdi: round key array, CTX
--- a/arch/x86/crypto/sm4-aesni-avx2-asm_64.S
+++ b/arch/x86/crypto/sm4-aesni-avx2-asm_64.S
@@ -153,9 +153,6 @@
 	.long 0xdeadbeef, 0xdeadbeef, 0xdeadbeef
 
 .text
-.align 16
-
-.align 8
 SYM_FUNC_START_LOCAL(__sm4_crypt_blk16)
 	/* input:
 	 *	%rdi: round key array, CTX
@@ -281,7 +278,6 @@ SYM_FUNC_END(__sm4_crypt_blk16)
  * void sm4_aesni_avx2_ctr_enc_blk16(const u32 *rk, u8 *dst,
  *                                   const u8 *src, u8 *iv)
  */
-.align 8
 SYM_FUNC_START(sm4_aesni_avx2_ctr_enc_blk16)
 	/* input:
 	 *	%rdi: round key array, CTX
@@ -394,7 +390,6 @@ SYM_FUNC_END(sm4_aesni_avx2_ctr_enc_blk1
  * void sm4_aesni_avx2_cbc_dec_blk16(const u32 *rk, u8 *dst,
  *                                   const u8 *src, u8 *iv)
  */
-.align 8
 SYM_FUNC_START(sm4_aesni_avx2_cbc_dec_blk16)
 	/* input:
 	 *	%rdi: round key array, CTX
@@ -448,7 +443,6 @@ SYM_FUNC_END(sm4_aesni_avx2_cbc_dec_blk1
  * void sm4_aesni_avx2_cfb_dec_blk16(const u32 *rk, u8 *dst,
  *                                   const u8 *src, u8 *iv)
  */
-.align 8
 SYM_FUNC_START(sm4_aesni_avx2_cfb_dec_blk16)
 	/* input:
 	 *	%rdi: round key array, CTX



^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v2 20/59] crypto: twofish: Remove redundant alignments
  2022-09-02 13:06 [PATCH v2 00/59] x86/retbleed: Call depth tracking mitigation Peter Zijlstra
                   ` (18 preceding siblings ...)
  2022-09-02 13:06 ` [PATCH v2 19/59] crypto: x86/sm[34]: Remove redundant alignments Peter Zijlstra
@ 2022-09-02 13:06 ` Peter Zijlstra
  2022-09-02 13:06 ` [PATCH v2 21/59] crypto: x86/poly1305: Remove custom function alignment Peter Zijlstra
                   ` (39 subsequent siblings)
  59 siblings, 0 replies; 81+ messages in thread
From: Peter Zijlstra @ 2022-09-02 13:06 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, peterz, x86, Linus Torvalds, Tim Chen,
	Josh Poimboeuf, Andrew Cooper, Pawan Gupta, Johannes Wikner,
	Alyssa Milburn, Jann Horn, H.J. Lu, Joao Moreira, Joseph Nuzman,
	Steven Rostedt, Juergen Gross, Masami Hiramatsu,
	Alexei Starovoitov, Daniel Borkmann, K Prateek Nayak,
	Eric Dumazet

From: Thomas Gleixner <tglx@linutronix.de>

SYM_FUNC_START*() and friends already imply alignment, remove custom
alignment hacks to make code consistent. This prepares for future
function call ABI changes.

Also, with having pushed the function alignment to 16 bytes, this
custom alignment is completely superfluous.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/crypto/twofish-avx-x86_64-asm_64.S |    2 --
 1 file changed, 2 deletions(-)

--- a/arch/x86/crypto/twofish-avx-x86_64-asm_64.S
+++ b/arch/x86/crypto/twofish-avx-x86_64-asm_64.S
@@ -228,7 +228,6 @@
 	vpxor		x2, wkey, x2; \
 	vpxor		x3, wkey, x3;
 
-.align 8
 SYM_FUNC_START_LOCAL(__twofish_enc_blk8)
 	/* input:
 	 *	%rdi: ctx, CTX
@@ -270,7 +269,6 @@ SYM_FUNC_START_LOCAL(__twofish_enc_blk8)
 	RET;
 SYM_FUNC_END(__twofish_enc_blk8)
 
-.align 8
 SYM_FUNC_START_LOCAL(__twofish_dec_blk8)
 	/* input:
 	 *	%rdi: ctx, CTX



^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v2 21/59] crypto: x86/poly1305: Remove custom function alignment
  2022-09-02 13:06 [PATCH v2 00/59] x86/retbleed: Call depth tracking mitigation Peter Zijlstra
                   ` (19 preceding siblings ...)
  2022-09-02 13:06 ` [PATCH v2 20/59] crypto: twofish: " Peter Zijlstra
@ 2022-09-02 13:06 ` Peter Zijlstra
  2022-09-02 13:06 ` [PATCH v2 22/59] x86: Put hot per CPU variables into a struct Peter Zijlstra
                   ` (38 subsequent siblings)
  59 siblings, 0 replies; 81+ messages in thread
From: Peter Zijlstra @ 2022-09-02 13:06 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, peterz, x86, Linus Torvalds, Tim Chen,
	Josh Poimboeuf, Andrew Cooper, Pawan Gupta, Johannes Wikner,
	Alyssa Milburn, Jann Horn, H.J. Lu, Joao Moreira, Joseph Nuzman,
	Steven Rostedt, Juergen Gross, Masami Hiramatsu,
	Alexei Starovoitov, Daniel Borkmann, K Prateek Nayak,
	Eric Dumazet

From: Thomas Gleixner <tglx@linutronix.de>

SYM_FUNC_START*() and friends already imply alignment, remove custom
alignment hacks to make code consistent. This prepares for future
function call ABI changes.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/crypto/poly1305-x86_64-cryptogams.pl |    1 -
 1 file changed, 1 deletion(-)

--- a/arch/x86/crypto/poly1305-x86_64-cryptogams.pl
+++ b/arch/x86/crypto/poly1305-x86_64-cryptogams.pl
@@ -108,7 +108,6 @@ if (!$kernel) {
 sub declare_function() {
 	my ($name, $align, $nargs) = @_;
 	if($kernel) {
-		$code .= ".align $align\n";
 		$code .= "SYM_FUNC_START($name)\n";
 		$code .= ".L$name:\n";
 	} else {



^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v2 22/59] x86: Put hot per CPU variables into a struct
  2022-09-02 13:06 [PATCH v2 00/59] x86/retbleed: Call depth tracking mitigation Peter Zijlstra
                   ` (20 preceding siblings ...)
  2022-09-02 13:06 ` [PATCH v2 21/59] crypto: x86/poly1305: Remove custom function alignment Peter Zijlstra
@ 2022-09-02 13:06 ` Peter Zijlstra
  2022-09-02 18:02   ` Jann Horn
  2022-09-02 13:06 ` [PATCH v2 23/59] x86/percpu: Move preempt_count next to current_task Peter Zijlstra
                   ` (37 subsequent siblings)
  59 siblings, 1 reply; 81+ messages in thread
From: Peter Zijlstra @ 2022-09-02 13:06 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, peterz, x86, Linus Torvalds, Tim Chen,
	Josh Poimboeuf, Andrew Cooper, Pawan Gupta, Johannes Wikner,
	Alyssa Milburn, Jann Horn, H.J. Lu, Joao Moreira, Joseph Nuzman,
	Steven Rostedt, Juergen Gross, Masami Hiramatsu,
	Alexei Starovoitov, Daniel Borkmann, K Prateek Nayak,
	Eric Dumazet

From: Thomas Gleixner <tglx@linutronix.de>

The layout of per-cpu variables is at the mercy of the compiler. This
can lead to random performance fluctuations from build to build.

Create a structure to hold some of the hottest per-cpu variables,
starting with current_task.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/include/asm/current.h |   19 ++++++++++++++++---
 arch/x86/kernel/cpu/common.c   |   14 +++++---------
 arch/x86/kernel/process_32.c   |    2 +-
 arch/x86/kernel/process_64.c   |    2 +-
 arch/x86/kernel/smpboot.c      |    2 +-
 5 files changed, 24 insertions(+), 15 deletions(-)

--- a/arch/x86/include/asm/current.h
+++ b/arch/x86/include/asm/current.h
@@ -3,16 +3,29 @@
 #define _ASM_X86_CURRENT_H
 
 #include <linux/compiler.h>
-#include <asm/percpu.h>
 
 #ifndef __ASSEMBLY__
+
+#include <linux/cache.h>
+#include <asm/percpu.h>
+
 struct task_struct;
 
-DECLARE_PER_CPU(struct task_struct *, current_task);
+struct pcpu_hot {
+	union {
+		struct {
+			struct task_struct	*current_task;
+		};
+		u8	pad[64];
+	};
+};
+static_assert(sizeof(struct pcpu_hot) == 64);
+
+DECLARE_PER_CPU_ALIGNED(struct pcpu_hot, pcpu_hot);
 
 static __always_inline struct task_struct *get_current(void)
 {
-	return this_cpu_read_stable(current_task);
+	return this_cpu_read_stable(pcpu_hot.current_task);
 }
 
 #define current get_current()
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -2000,18 +2000,16 @@ static __init int setup_clearcpuid(char
 }
 __setup("clearcpuid=", setup_clearcpuid);
 
+DEFINE_PER_CPU_ALIGNED(struct pcpu_hot, pcpu_hot) = {
+	.current_task	= &init_task,
+};
+EXPORT_PER_CPU_SYMBOL(pcpu_hot);
+
 #ifdef CONFIG_X86_64
 DEFINE_PER_CPU_FIRST(struct fixed_percpu_data,
 		     fixed_percpu_data) __aligned(PAGE_SIZE) __visible;
 EXPORT_PER_CPU_SYMBOL_GPL(fixed_percpu_data);
 
-/*
- * The following percpu variables are hot.  Align current_task to
- * cacheline size such that they fall in the same cacheline.
- */
-DEFINE_PER_CPU(struct task_struct *, current_task) ____cacheline_aligned =
-	&init_task;
-EXPORT_PER_CPU_SYMBOL(current_task);
 
 DEFINE_PER_CPU(void *, hardirq_stack_ptr);
 DEFINE_PER_CPU(bool, hardirq_stack_inuse);
@@ -2071,8 +2069,6 @@ void syscall_init(void)
 
 #else	/* CONFIG_X86_64 */
 
-DEFINE_PER_CPU(struct task_struct *, current_task) = &init_task;
-EXPORT_PER_CPU_SYMBOL(current_task);
 DEFINE_PER_CPU(int, __preempt_count) = INIT_PREEMPT_COUNT;
 EXPORT_PER_CPU_SYMBOL(__preempt_count);
 
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -207,7 +207,7 @@ EXPORT_SYMBOL_GPL(start_thread);
 	if (prev->gs | next->gs)
 		loadsegment(gs, next->gs);
 
-	this_cpu_write(current_task, next_p);
+	raw_cpu_write(pcpu_hot.current_task, next_p);
 
 	switch_fpu_finish();
 
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -616,7 +616,7 @@ void compat_start_thread(struct pt_regs
 	/*
 	 * Switch the PDA and FPU contexts.
 	 */
-	this_cpu_write(current_task, next_p);
+	raw_cpu_write(pcpu_hot.current_task, next_p);
 	this_cpu_write(cpu_current_top_of_stack, task_top_of_stack(next_p));
 
 	switch_fpu_finish();
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1046,7 +1046,7 @@ int common_cpu_up(unsigned int cpu, stru
 	/* Just in case we booted with a single CPU. */
 	alternatives_enable_smp();
 
-	per_cpu(current_task, cpu) = idle;
+	per_cpu(pcpu_hot.current_task, cpu) = idle;
 	cpu_init_stack_canary(cpu, idle);
 
 	/* Initialize the interrupt stack(s) */



^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v2 23/59] x86/percpu: Move preempt_count next to current_task
  2022-09-02 13:06 [PATCH v2 00/59] x86/retbleed: Call depth tracking mitigation Peter Zijlstra
                   ` (21 preceding siblings ...)
  2022-09-02 13:06 ` [PATCH v2 22/59] x86: Put hot per CPU variables into a struct Peter Zijlstra
@ 2022-09-02 13:06 ` Peter Zijlstra
  2022-09-02 13:06 ` [PATCH v2 24/59] x86/percpu: Move cpu_number " Peter Zijlstra
                   ` (36 subsequent siblings)
  59 siblings, 0 replies; 81+ messages in thread
From: Peter Zijlstra @ 2022-09-02 13:06 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, peterz, x86, Linus Torvalds, Tim Chen,
	Josh Poimboeuf, Andrew Cooper, Pawan Gupta, Johannes Wikner,
	Alyssa Milburn, Jann Horn, H.J. Lu, Joao Moreira, Joseph Nuzman,
	Steven Rostedt, Juergen Gross, Masami Hiramatsu,
	Alexei Starovoitov, Daniel Borkmann, K Prateek Nayak,
	Eric Dumazet

From: Thomas Gleixner <tglx@linutronix.de>

Add preempt_count to pcpu_hot, since it is once of the most used
per-cpu variables.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/include/asm/current.h |    1 +
 arch/x86/include/asm/preempt.h |   27 ++++++++++++++-------------
 arch/x86/kernel/cpu/common.c   |    8 +-------
 3 files changed, 16 insertions(+), 20 deletions(-)

--- a/arch/x86/include/asm/current.h
+++ b/arch/x86/include/asm/current.h
@@ -15,6 +15,7 @@ struct pcpu_hot {
 	union {
 		struct {
 			struct task_struct	*current_task;
+			int			preempt_count;
 		};
 		u8	pad[64];
 	};
--- a/arch/x86/include/asm/preempt.h
+++ b/arch/x86/include/asm/preempt.h
@@ -4,11 +4,11 @@
 
 #include <asm/rmwcc.h>
 #include <asm/percpu.h>
+#include <asm/current.h>
+
 #include <linux/thread_info.h>
 #include <linux/static_call_types.h>
 
-DECLARE_PER_CPU(int, __preempt_count);
-
 /* We use the MSB mostly because its available */
 #define PREEMPT_NEED_RESCHED	0x80000000
 
@@ -24,7 +24,7 @@ DECLARE_PER_CPU(int, __preempt_count);
  */
 static __always_inline int preempt_count(void)
 {
-	return raw_cpu_read_4(__preempt_count) & ~PREEMPT_NEED_RESCHED;
+	return raw_cpu_read_4(pcpu_hot.preempt_count) & ~PREEMPT_NEED_RESCHED;
 }
 
 static __always_inline void preempt_count_set(int pc)
@@ -32,10 +32,10 @@ static __always_inline void preempt_coun
 	int old, new;
 
 	do {
-		old = raw_cpu_read_4(__preempt_count);
+		old = raw_cpu_read_4(pcpu_hot.preempt_count);
 		new = (old & PREEMPT_NEED_RESCHED) |
 			(pc & ~PREEMPT_NEED_RESCHED);
-	} while (raw_cpu_cmpxchg_4(__preempt_count, old, new) != old);
+	} while (raw_cpu_cmpxchg_4(pcpu_hot.preempt_count, old, new) != old);
 }
 
 /*
@@ -44,7 +44,7 @@ static __always_inline void preempt_coun
 #define init_task_preempt_count(p) do { } while (0)
 
 #define init_idle_preempt_count(p, cpu) do { \
-	per_cpu(__preempt_count, (cpu)) = PREEMPT_DISABLED; \
+	per_cpu(pcpu_hot.preempt_count, (cpu)) = PREEMPT_DISABLED; \
 } while (0)
 
 /*
@@ -58,17 +58,17 @@ static __always_inline void preempt_coun
 
 static __always_inline void set_preempt_need_resched(void)
 {
-	raw_cpu_and_4(__preempt_count, ~PREEMPT_NEED_RESCHED);
+	raw_cpu_and_4(pcpu_hot.preempt_count, ~PREEMPT_NEED_RESCHED);
 }
 
 static __always_inline void clear_preempt_need_resched(void)
 {
-	raw_cpu_or_4(__preempt_count, PREEMPT_NEED_RESCHED);
+	raw_cpu_or_4(pcpu_hot.preempt_count, PREEMPT_NEED_RESCHED);
 }
 
 static __always_inline bool test_preempt_need_resched(void)
 {
-	return !(raw_cpu_read_4(__preempt_count) & PREEMPT_NEED_RESCHED);
+	return !(raw_cpu_read_4(pcpu_hot.preempt_count) & PREEMPT_NEED_RESCHED);
 }
 
 /*
@@ -77,12 +77,12 @@ static __always_inline bool test_preempt
 
 static __always_inline void __preempt_count_add(int val)
 {
-	raw_cpu_add_4(__preempt_count, val);
+	raw_cpu_add_4(pcpu_hot.preempt_count, val);
 }
 
 static __always_inline void __preempt_count_sub(int val)
 {
-	raw_cpu_add_4(__preempt_count, -val);
+	raw_cpu_add_4(pcpu_hot.preempt_count, -val);
 }
 
 /*
@@ -92,7 +92,8 @@ static __always_inline void __preempt_co
  */
 static __always_inline bool __preempt_count_dec_and_test(void)
 {
-	return GEN_UNARY_RMWcc("decl", __preempt_count, e, __percpu_arg([var]));
+	return GEN_UNARY_RMWcc("decl", pcpu_hot.preempt_count, e,
+			       __percpu_arg([var]));
 }
 
 /*
@@ -100,7 +101,7 @@ static __always_inline bool __preempt_co
  */
 static __always_inline bool should_resched(int preempt_offset)
 {
-	return unlikely(raw_cpu_read_4(__preempt_count) == preempt_offset);
+	return unlikely(raw_cpu_read_4(pcpu_hot.preempt_count) == preempt_offset);
 }
 
 #ifdef CONFIG_PREEMPTION
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -2002,6 +2002,7 @@ static __init int setup_clearcpuid(char
 
 DEFINE_PER_CPU_ALIGNED(struct pcpu_hot, pcpu_hot) = {
 	.current_task	= &init_task,
+	.preempt_count	= INIT_PREEMPT_COUNT,
 };
 EXPORT_PER_CPU_SYMBOL(pcpu_hot);
 
@@ -2010,13 +2011,9 @@ DEFINE_PER_CPU_FIRST(struct fixed_percpu
 		     fixed_percpu_data) __aligned(PAGE_SIZE) __visible;
 EXPORT_PER_CPU_SYMBOL_GPL(fixed_percpu_data);
 
-
 DEFINE_PER_CPU(void *, hardirq_stack_ptr);
 DEFINE_PER_CPU(bool, hardirq_stack_inuse);
 
-DEFINE_PER_CPU(int, __preempt_count) = INIT_PREEMPT_COUNT;
-EXPORT_PER_CPU_SYMBOL(__preempt_count);
-
 DEFINE_PER_CPU(unsigned long, cpu_current_top_of_stack) = TOP_OF_INIT_STACK;
 
 static void wrmsrl_cstar(unsigned long val)
@@ -2069,9 +2066,6 @@ void syscall_init(void)
 
 #else	/* CONFIG_X86_64 */
 
-DEFINE_PER_CPU(int, __preempt_count) = INIT_PREEMPT_COUNT;
-EXPORT_PER_CPU_SYMBOL(__preempt_count);
-
 /*
  * On x86_32, vm86 modifies tss.sp0, so sp0 isn't a reliable way to find
  * the top of the kernel stack.  Use an extra percpu variable to track the



^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v2 24/59] x86/percpu: Move cpu_number next to current_task
  2022-09-02 13:06 [PATCH v2 00/59] x86/retbleed: Call depth tracking mitigation Peter Zijlstra
                   ` (22 preceding siblings ...)
  2022-09-02 13:06 ` [PATCH v2 23/59] x86/percpu: Move preempt_count next to current_task Peter Zijlstra
@ 2022-09-02 13:06 ` Peter Zijlstra
  2022-09-02 13:06 ` [PATCH v2 25/59] x86/percpu: Move current_top_of_stack " Peter Zijlstra
                   ` (35 subsequent siblings)
  59 siblings, 0 replies; 81+ messages in thread
From: Peter Zijlstra @ 2022-09-02 13:06 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, peterz, x86, Linus Torvalds, Tim Chen,
	Josh Poimboeuf, Andrew Cooper, Pawan Gupta, Johannes Wikner,
	Alyssa Milburn, Jann Horn, H.J. Lu, Joao Moreira, Joseph Nuzman,
	Steven Rostedt, Juergen Gross, Masami Hiramatsu,
	Alexei Starovoitov, Daniel Borkmann, K Prateek Nayak,
	Eric Dumazet

From: Thomas Gleixner <tglx@linutronix.de>

Also add cpu_number to the pcpu_hot structure, it is often referenced
and this cacheline is there.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/include/asm/current.h |    1 +
 arch/x86/include/asm/smp.h     |   12 +++++-------
 arch/x86/kernel/setup_percpu.c |    5 +----
 3 files changed, 7 insertions(+), 11 deletions(-)

--- a/arch/x86/include/asm/current.h
+++ b/arch/x86/include/asm/current.h
@@ -16,6 +16,7 @@ struct pcpu_hot {
 		struct {
 			struct task_struct	*current_task;
 			int			preempt_count;
+			int			cpu_number;
 		};
 		u8	pad[64];
 	};
--- a/arch/x86/include/asm/smp.h
+++ b/arch/x86/include/asm/smp.h
@@ -3,10 +3,10 @@
 #define _ASM_X86_SMP_H
 #ifndef __ASSEMBLY__
 #include <linux/cpumask.h>
-#include <asm/percpu.h>
 
-#include <asm/thread_info.h>
 #include <asm/cpumask.h>
+#include <asm/current.h>
+#include <asm/thread_info.h>
 
 extern int smp_num_siblings;
 extern unsigned int num_processors;
@@ -19,7 +19,6 @@ DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_
 DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_l2c_shared_map);
 DECLARE_PER_CPU_READ_MOSTLY(u16, cpu_llc_id);
 DECLARE_PER_CPU_READ_MOSTLY(u16, cpu_l2c_id);
-DECLARE_PER_CPU_READ_MOSTLY(int, cpu_number);
 
 static inline struct cpumask *cpu_llc_shared_mask(int cpu)
 {
@@ -160,11 +159,10 @@ asmlinkage __visible void smp_reboot_int
 
 /*
  * This function is needed by all SMP systems. It must _always_ be valid
- * from the initial startup. We map APIC_BASE very early in page_setup(),
- * so this is correct in the x86 case.
+ * from the initial startup.
  */
-#define raw_smp_processor_id()  this_cpu_read(cpu_number)
-#define __smp_processor_id() __this_cpu_read(cpu_number)
+#define raw_smp_processor_id()  this_cpu_read(pcpu_hot.cpu_number)
+#define __smp_processor_id() __this_cpu_read(pcpu_hot.cpu_number)
 
 #ifdef CONFIG_X86_32
 extern int safe_smp_processor_id(void);
--- a/arch/x86/kernel/setup_percpu.c
+++ b/arch/x86/kernel/setup_percpu.c
@@ -23,9 +23,6 @@
 #include <asm/cpu.h>
 #include <asm/stackprotector.h>
 
-DEFINE_PER_CPU_READ_MOSTLY(int, cpu_number);
-EXPORT_PER_CPU_SYMBOL(cpu_number);
-
 #ifdef CONFIG_X86_64
 #define BOOT_PERCPU_OFFSET ((unsigned long)__per_cpu_load)
 #else
@@ -172,7 +169,7 @@ void __init setup_per_cpu_areas(void)
 	for_each_possible_cpu(cpu) {
 		per_cpu_offset(cpu) = delta + pcpu_unit_offsets[cpu];
 		per_cpu(this_cpu_off, cpu) = per_cpu_offset(cpu);
-		per_cpu(cpu_number, cpu) = cpu;
+		per_cpu(pcpu_hot.cpu_number, cpu) = cpu;
 		setup_percpu_segment(cpu);
 		/*
 		 * Copy data used in early init routines from the



^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v2 25/59] x86/percpu: Move current_top_of_stack next to current_task
  2022-09-02 13:06 [PATCH v2 00/59] x86/retbleed: Call depth tracking mitigation Peter Zijlstra
                   ` (23 preceding siblings ...)
  2022-09-02 13:06 ` [PATCH v2 24/59] x86/percpu: Move cpu_number " Peter Zijlstra
@ 2022-09-02 13:06 ` Peter Zijlstra
  2022-09-02 13:06 ` [PATCH v2 26/59] x86/percpu: Move irq_stack variables " Peter Zijlstra
                   ` (34 subsequent siblings)
  59 siblings, 0 replies; 81+ messages in thread
From: Peter Zijlstra @ 2022-09-02 13:06 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, peterz, x86, Linus Torvalds, Tim Chen,
	Josh Poimboeuf, Andrew Cooper, Pawan Gupta, Johannes Wikner,
	Alyssa Milburn, Jann Horn, H.J. Lu, Joao Moreira, Joseph Nuzman,
	Steven Rostedt, Juergen Gross, Masami Hiramatsu,
	Alexei Starovoitov, Daniel Borkmann, K Prateek Nayak,
	Eric Dumazet

From: Thomas Gleixner <tglx@linutronix.de>

Extend the struct pcpu_hot cacheline with current_top_of_stack;
another very frequently used value.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/entry/entry_32.S        |    4 ++--
 arch/x86/entry/entry_64.S        |    6 +++---
 arch/x86/entry/entry_64_compat.S |    6 +++---
 arch/x86/include/asm/current.h   |    1 +
 arch/x86/include/asm/processor.h |    4 +---
 arch/x86/kernel/asm-offsets.c    |    2 ++
 arch/x86/kernel/cpu/common.c     |   12 +-----------
 arch/x86/kernel/process_32.c     |    4 ++--
 arch/x86/kernel/process_64.c     |    2 +-
 arch/x86/kernel/smpboot.c        |    2 +-
 arch/x86/kernel/traps.c          |    4 ++--
 11 files changed, 19 insertions(+), 28 deletions(-)

--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -1181,7 +1181,7 @@ SYM_CODE_START(asm_exc_nmi)
 	 * is using the thread stack right now, so it's safe for us to use it.
 	 */
 	movl	%esp, %ebx
-	movl	PER_CPU_VAR(cpu_current_top_of_stack), %esp
+	movl	PER_CPU_VAR(pcpu_hot + X86_top_of_stack), %esp
 	call	exc_nmi
 	movl	%ebx, %esp
 
@@ -1243,7 +1243,7 @@ SYM_CODE_START(rewind_stack_and_make_dea
 	/* Prevent any naive code from trying to unwind to our caller. */
 	xorl	%ebp, %ebp
 
-	movl	PER_CPU_VAR(cpu_current_top_of_stack), %esi
+	movl	PER_CPU_VAR(pcpu_hot + X86_top_of_stack), %esi
 	leal	-TOP_OF_KERNEL_STACK_PADDING-PTREGS_SIZE(%esi), %esp
 
 	call	make_task_dead
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -92,7 +92,7 @@ SYM_CODE_START(entry_SYSCALL_64)
 	/* tss.sp2 is scratch space. */
 	movq	%rsp, PER_CPU_VAR(cpu_tss_rw + TSS_sp2)
 	SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp
-	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp
+	movq	PER_CPU_VAR(pcpu_hot + X86_top_of_stack), %rsp
 
 SYM_INNER_LABEL(entry_SYSCALL_64_safe_stack, SYM_L_GLOBAL)
 	ANNOTATE_NOENDBR
@@ -1209,7 +1209,7 @@ SYM_CODE_START(asm_exc_nmi)
 	FENCE_SWAPGS_USER_ENTRY
 	SWITCH_TO_KERNEL_CR3 scratch_reg=%rdx
 	movq	%rsp, %rdx
-	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp
+	movq	PER_CPU_VAR(pcpu_hot + X86_top_of_stack), %rsp
 	UNWIND_HINT_IRET_REGS base=%rdx offset=8
 	pushq	5*8(%rdx)	/* pt_regs->ss */
 	pushq	4*8(%rdx)	/* pt_regs->rsp */
@@ -1525,7 +1525,7 @@ SYM_CODE_START_NOALIGN(rewind_stack_and_
 	/* Prevent any naive code from trying to unwind to our caller. */
 	xorl	%ebp, %ebp
 
-	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rax
+	movq	PER_CPU_VAR(pcpu_hot + X86_top_of_stack), %rax
 	leaq	-PTREGS_SIZE(%rax), %rsp
 	UNWIND_HINT_REGS
 
--- a/arch/x86/entry/entry_64_compat.S
+++ b/arch/x86/entry/entry_64_compat.S
@@ -58,7 +58,7 @@ SYM_CODE_START(entry_SYSENTER_compat)
 	SWITCH_TO_KERNEL_CR3 scratch_reg=%rax
 	popq	%rax
 
-	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp
+	movq	PER_CPU_VAR(pcpu_hot + X86_top_of_stack), %rsp
 
 	/* Construct struct pt_regs on stack */
 	pushq	$__USER32_DS		/* pt_regs->ss */
@@ -191,7 +191,7 @@ SYM_CODE_START(entry_SYSCALL_compat)
 	SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp
 
 	/* Switch to the kernel stack */
-	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp
+	movq	PER_CPU_VAR(pcpu_hot + X86_top_of_stack), %rsp
 
 SYM_INNER_LABEL(entry_SYSCALL_compat_safe_stack, SYM_L_GLOBAL)
 	ANNOTATE_NOENDBR
@@ -332,7 +332,7 @@ SYM_CODE_START(entry_INT80_compat)
 	ALTERNATIVE "", "jmp .Lint80_keep_stack", X86_FEATURE_XENPV
 
 	movq	%rsp, %rax
-	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp
+	movq	PER_CPU_VAR(pcpu_hot + X86_top_of_stack), %rsp
 
 	pushq	5*8(%rax)		/* regs->ss */
 	pushq	4*8(%rax)		/* regs->rsp */
--- a/arch/x86/include/asm/current.h
+++ b/arch/x86/include/asm/current.h
@@ -17,6 +17,7 @@ struct pcpu_hot {
 			struct task_struct	*current_task;
 			int			preempt_count;
 			int			cpu_number;
+			unsigned long		top_of_stack;
 		};
 		u8	pad[64];
 	};
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -426,8 +426,6 @@ struct irq_stack {
 	char		stack[IRQ_STACK_SIZE];
 } __aligned(IRQ_STACK_SIZE);
 
-DECLARE_PER_CPU(unsigned long, cpu_current_top_of_stack);
-
 #ifdef CONFIG_X86_64
 struct fixed_percpu_data {
 	/*
@@ -566,7 +564,7 @@ static __always_inline unsigned long cur
 	 *  and around vm86 mode and sp0 on x86_64 is special because of the
 	 *  entry trampoline.
 	 */
-	return this_cpu_read_stable(cpu_current_top_of_stack);
+	return this_cpu_read_stable(pcpu_hot.top_of_stack);
 }
 
 static __always_inline bool on_thread_stack(void)
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -109,6 +109,8 @@ static void __used common(void)
 	OFFSET(TSS_sp1, tss_struct, x86_tss.sp1);
 	OFFSET(TSS_sp2, tss_struct, x86_tss.sp2);
 
+	OFFSET(X86_top_of_stack, pcpu_hot, top_of_stack);
+
 	if (IS_ENABLED(CONFIG_KVM_INTEL)) {
 		BLANK();
 		OFFSET(VMX_spec_ctrl, vcpu_vmx, spec_ctrl);
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -2003,6 +2003,7 @@ static __init int setup_clearcpuid(char
 DEFINE_PER_CPU_ALIGNED(struct pcpu_hot, pcpu_hot) = {
 	.current_task	= &init_task,
 	.preempt_count	= INIT_PREEMPT_COUNT,
+	.top_of_stack	= TOP_OF_INIT_STACK,
 };
 EXPORT_PER_CPU_SYMBOL(pcpu_hot);
 
@@ -2014,8 +2015,6 @@ EXPORT_PER_CPU_SYMBOL_GPL(fixed_percpu_d
 DEFINE_PER_CPU(void *, hardirq_stack_ptr);
 DEFINE_PER_CPU(bool, hardirq_stack_inuse);
 
-DEFINE_PER_CPU(unsigned long, cpu_current_top_of_stack) = TOP_OF_INIT_STACK;
-
 static void wrmsrl_cstar(unsigned long val)
 {
 	/*
@@ -2066,15 +2065,6 @@ void syscall_init(void)
 
 #else	/* CONFIG_X86_64 */
 
-/*
- * On x86_32, vm86 modifies tss.sp0, so sp0 isn't a reliable way to find
- * the top of the kernel stack.  Use an extra percpu variable to track the
- * top of the kernel stack directly.
- */
-DEFINE_PER_CPU(unsigned long, cpu_current_top_of_stack) =
-	(unsigned long)&init_thread_union + THREAD_SIZE;
-EXPORT_PER_CPU_SYMBOL(cpu_current_top_of_stack);
-
 #ifdef CONFIG_STACKPROTECTOR
 DEFINE_PER_CPU(unsigned long, __stack_chk_guard);
 EXPORT_PER_CPU_SYMBOL(__stack_chk_guard);
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -191,13 +191,13 @@ EXPORT_SYMBOL_GPL(start_thread);
 	arch_end_context_switch(next_p);
 
 	/*
-	 * Reload esp0 and cpu_current_top_of_stack.  This changes
+	 * Reload esp0 and pcpu_hot.top_of_stack.  This changes
 	 * current_thread_info().  Refresh the SYSENTER configuration in
 	 * case prev or next is vm86.
 	 */
 	update_task_stack(next_p);
 	refresh_sysenter_cs(next);
-	this_cpu_write(cpu_current_top_of_stack,
+	this_cpu_write(pcpu_hot.top_of_stack,
 		       (unsigned long)task_stack_page(next_p) +
 		       THREAD_SIZE);
 
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -617,7 +617,7 @@ void compat_start_thread(struct pt_regs
 	 * Switch the PDA and FPU contexts.
 	 */
 	raw_cpu_write(pcpu_hot.current_task, next_p);
-	this_cpu_write(cpu_current_top_of_stack, task_top_of_stack(next_p));
+	raw_cpu_write(pcpu_hot.top_of_stack, task_top_of_stack(next_p));
 
 	switch_fpu_finish();
 
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1056,7 +1056,7 @@ int common_cpu_up(unsigned int cpu, stru
 
 #ifdef CONFIG_X86_32
 	/* Stack for startup_32 can be just as for start_secondary onwards */
-	per_cpu(cpu_current_top_of_stack, cpu) = task_top_of_stack(idle);
+	per_cpu(pcpu_hot.top_of_stack, cpu) = task_top_of_stack(idle);
 #else
 	initial_gs = per_cpu_offset(cpu);
 #endif
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -849,7 +849,7 @@ DEFINE_IDTENTRY_RAW(exc_int3)
  */
 asmlinkage __visible noinstr struct pt_regs *sync_regs(struct pt_regs *eregs)
 {
-	struct pt_regs *regs = (struct pt_regs *)this_cpu_read(cpu_current_top_of_stack) - 1;
+	struct pt_regs *regs = (struct pt_regs *)this_cpu_read(pcpu_hot.top_of_stack) - 1;
 	if (regs != eregs)
 		*regs = *eregs;
 	return regs;
@@ -867,7 +867,7 @@ asmlinkage __visible noinstr struct pt_r
 	 * trust it and switch to the current kernel stack
 	 */
 	if (ip_within_syscall_gap(regs)) {
-		sp = this_cpu_read(cpu_current_top_of_stack);
+		sp = this_cpu_read(pcpu_hot.top_of_stack);
 		goto sync;
 	}
 



^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v2 26/59] x86/percpu: Move irq_stack variables next to current_task
  2022-09-02 13:06 [PATCH v2 00/59] x86/retbleed: Call depth tracking mitigation Peter Zijlstra
                   ` (24 preceding siblings ...)
  2022-09-02 13:06 ` [PATCH v2 25/59] x86/percpu: Move current_top_of_stack " Peter Zijlstra
@ 2022-09-02 13:06 ` Peter Zijlstra
  2022-09-02 13:06 ` [PATCH v2 27/59] x86/softirq: Move softirq pending next to current task Peter Zijlstra
                   ` (33 subsequent siblings)
  59 siblings, 0 replies; 81+ messages in thread
From: Peter Zijlstra @ 2022-09-02 13:06 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, peterz, x86, Linus Torvalds, Tim Chen,
	Josh Poimboeuf, Andrew Cooper, Pawan Gupta, Johannes Wikner,
	Alyssa Milburn, Jann Horn, H.J. Lu, Joao Moreira, Joseph Nuzman,
	Steven Rostedt, Juergen Gross, Masami Hiramatsu,
	Alexei Starovoitov, Daniel Borkmann, K Prateek Nayak,
	Eric Dumazet

From: Thomas Gleixner <tglx@linutronix.de>

Further extend struct pcpu_hot with the hard and soft irq stack
pointers.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/include/asm/current.h   |    6 ++++++
 arch/x86/include/asm/irq_stack.h |   12 ++++++------
 arch/x86/include/asm/processor.h |    4 ----
 arch/x86/kernel/cpu/common.c     |    3 ---
 arch/x86/kernel/dumpstack_32.c   |    4 ++--
 arch/x86/kernel/dumpstack_64.c   |    2 +-
 arch/x86/kernel/irq_32.c         |   13 +++++--------
 arch/x86/kernel/irq_64.c         |    6 +++---
 arch/x86/kernel/process_64.c     |    2 +-
 9 files changed, 24 insertions(+), 28 deletions(-)

--- a/arch/x86/include/asm/current.h
+++ b/arch/x86/include/asm/current.h
@@ -18,6 +18,12 @@ struct pcpu_hot {
 			int			preempt_count;
 			int			cpu_number;
 			unsigned long		top_of_stack;
+			void			*hardirq_stack_ptr;
+#ifdef CONFIG_X86_64
+			bool			hardirq_stack_inuse;
+#else
+			void			*softirq_stack_ptr;
+#endif
 		};
 		u8	pad[64];
 	};
--- a/arch/x86/include/asm/irq_stack.h
+++ b/arch/x86/include/asm/irq_stack.h
@@ -116,7 +116,7 @@
 	ASM_CALL_ARG2
 
 #define call_on_irqstack(func, asm_call, argconstr...)			\
-	call_on_stack(__this_cpu_read(hardirq_stack_ptr),		\
+	call_on_stack(__this_cpu_read(pcpu_hot.hardirq_stack_ptr),	\
 		      func, asm_call, argconstr)
 
 /* Macros to assert type correctness for run_*_on_irqstack macros */
@@ -135,7 +135,7 @@
 	 * User mode entry and interrupt on the irq stack do not	\
 	 * switch stacks. If from user mode the task stack is empty.	\
 	 */								\
-	if (user_mode(regs) || __this_cpu_read(hardirq_stack_inuse)) {	\
+	if (user_mode(regs) || __this_cpu_read(pcpu_hot.hardirq_stack_inuse)) { \
 		irq_enter_rcu();					\
 		func(c_args);						\
 		irq_exit_rcu();						\
@@ -146,9 +146,9 @@
 		 * places. Invoke the stack switch macro with the call	\
 		 * sequence which matches the above direct invocation.	\
 		 */							\
-		__this_cpu_write(hardirq_stack_inuse, true);		\
+		__this_cpu_write(pcpu_hot.hardirq_stack_inuse, true);	\
 		call_on_irqstack(func, asm_call, constr);		\
-		__this_cpu_write(hardirq_stack_inuse, false);		\
+		__this_cpu_write(pcpu_hot.hardirq_stack_inuse, false);	\
 	}								\
 }
 
@@ -212,9 +212,9 @@
  */
 #define do_softirq_own_stack()						\
 {									\
-	__this_cpu_write(hardirq_stack_inuse, true);			\
+	__this_cpu_write(pcpu_hot.hardirq_stack_inuse, true);		\
 	call_on_irqstack(__do_softirq, ASM_CALL_ARG0);			\
-	__this_cpu_write(hardirq_stack_inuse, false);			\
+	__this_cpu_write(pcpu_hot.hardirq_stack_inuse, false);		\
 }
 
 #endif
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -448,8 +448,6 @@ static inline unsigned long cpu_kernelmo
 	return (unsigned long)per_cpu(fixed_percpu_data.gs_base, cpu);
 }
 
-DECLARE_PER_CPU(void *, hardirq_stack_ptr);
-DECLARE_PER_CPU(bool, hardirq_stack_inuse);
 extern asmlinkage void ignore_sysret(void);
 
 /* Save actual FS/GS selectors and bases to current->thread */
@@ -458,8 +456,6 @@ void current_save_fsgs(void);
 #ifdef CONFIG_STACKPROTECTOR
 DECLARE_PER_CPU(unsigned long, __stack_chk_guard);
 #endif
-DECLARE_PER_CPU(struct irq_stack *, hardirq_stack_ptr);
-DECLARE_PER_CPU(struct irq_stack *, softirq_stack_ptr);
 #endif	/* !X86_64 */
 
 struct perf_event;
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -2012,9 +2012,6 @@ DEFINE_PER_CPU_FIRST(struct fixed_percpu
 		     fixed_percpu_data) __aligned(PAGE_SIZE) __visible;
 EXPORT_PER_CPU_SYMBOL_GPL(fixed_percpu_data);
 
-DEFINE_PER_CPU(void *, hardirq_stack_ptr);
-DEFINE_PER_CPU(bool, hardirq_stack_inuse);
-
 static void wrmsrl_cstar(unsigned long val)
 {
 	/*
--- a/arch/x86/kernel/dumpstack_32.c
+++ b/arch/x86/kernel/dumpstack_32.c
@@ -37,7 +37,7 @@ const char *stack_type_name(enum stack_t
 
 static bool in_hardirq_stack(unsigned long *stack, struct stack_info *info)
 {
-	unsigned long *begin = (unsigned long *)this_cpu_read(hardirq_stack_ptr);
+	unsigned long *begin = (unsigned long *)this_cpu_read(pcpu_hot.hardirq_stack_ptr);
 	unsigned long *end   = begin + (THREAD_SIZE / sizeof(long));
 
 	/*
@@ -62,7 +62,7 @@ static bool in_hardirq_stack(unsigned lo
 
 static bool in_softirq_stack(unsigned long *stack, struct stack_info *info)
 {
-	unsigned long *begin = (unsigned long *)this_cpu_read(softirq_stack_ptr);
+	unsigned long *begin = (unsigned long *)this_cpu_read(pcpu_hot.softirq_stack_ptr);
 	unsigned long *end   = begin + (THREAD_SIZE / sizeof(long));
 
 	/*
--- a/arch/x86/kernel/dumpstack_64.c
+++ b/arch/x86/kernel/dumpstack_64.c
@@ -134,7 +134,7 @@ static __always_inline bool in_exception
 
 static __always_inline bool in_irq_stack(unsigned long *stack, struct stack_info *info)
 {
-	unsigned long *end = (unsigned long *)this_cpu_read(hardirq_stack_ptr);
+	unsigned long *end = (unsigned long *)this_cpu_read(pcpu_hot.hardirq_stack_ptr);
 	unsigned long *begin;
 
 	/*
--- a/arch/x86/kernel/irq_32.c
+++ b/arch/x86/kernel/irq_32.c
@@ -52,9 +52,6 @@ static inline int check_stack_overflow(v
 static inline void print_stack_overflow(void) { }
 #endif
 
-DEFINE_PER_CPU(struct irq_stack *, hardirq_stack_ptr);
-DEFINE_PER_CPU(struct irq_stack *, softirq_stack_ptr);
-
 static void call_on_stack(void *func, void *stack)
 {
 	asm volatile("xchgl	%%ebx,%%esp	\n"
@@ -77,7 +74,7 @@ static inline int execute_on_irq_stack(i
 	u32 *isp, *prev_esp, arg1;
 
 	curstk = (struct irq_stack *) current_stack();
-	irqstk = __this_cpu_read(hardirq_stack_ptr);
+	irqstk = __this_cpu_read(pcpu_hot.hardirq_stack_ptr);
 
 	/*
 	 * this is where we switch to the IRQ stack. However, if we are
@@ -115,7 +112,7 @@ int irq_init_percpu_irqstack(unsigned in
 	int node = cpu_to_node(cpu);
 	struct page *ph, *ps;
 
-	if (per_cpu(hardirq_stack_ptr, cpu))
+	if (per_cpu(pcpu_hot.hardirq_stack_ptr, cpu))
 		return 0;
 
 	ph = alloc_pages_node(node, THREADINFO_GFP, THREAD_SIZE_ORDER);
@@ -127,8 +124,8 @@ int irq_init_percpu_irqstack(unsigned in
 		return -ENOMEM;
 	}
 
-	per_cpu(hardirq_stack_ptr, cpu) = page_address(ph);
-	per_cpu(softirq_stack_ptr, cpu) = page_address(ps);
+	per_cpu(pcpu_hot.hardirq_stack_ptr, cpu) = page_address(ph);
+	per_cpu(pcpu_hot.softirq_stack_ptr, cpu) = page_address(ps);
 	return 0;
 }
 
@@ -138,7 +135,7 @@ void do_softirq_own_stack(void)
 	struct irq_stack *irqstk;
 	u32 *isp, *prev_esp;
 
-	irqstk = __this_cpu_read(softirq_stack_ptr);
+	irqstk = __this_cpu_read(pcpu_hot.softirq_stack_ptr);
 
 	/* build the stack frame on the softirq stack */
 	isp = (u32 *) ((char *)irqstk + sizeof(*irqstk));
--- a/arch/x86/kernel/irq_64.c
+++ b/arch/x86/kernel/irq_64.c
@@ -50,7 +50,7 @@ static int map_irq_stack(unsigned int cp
 		return -ENOMEM;
 
 	/* Store actual TOS to avoid adjustment in the hotpath */
-	per_cpu(hardirq_stack_ptr, cpu) = va + IRQ_STACK_SIZE - 8;
+	per_cpu(pcpu_hot.hardirq_stack_ptr, cpu) = va + IRQ_STACK_SIZE - 8;
 	return 0;
 }
 #else
@@ -63,14 +63,14 @@ static int map_irq_stack(unsigned int cp
 	void *va = per_cpu_ptr(&irq_stack_backing_store, cpu);
 
 	/* Store actual TOS to avoid adjustment in the hotpath */
-	per_cpu(hardirq_stack_ptr, cpu) = va + IRQ_STACK_SIZE - 8;
+	per_cpu(pcpu_hot.hardirq_stack_ptr, cpu) = va + IRQ_STACK_SIZE - 8;
 	return 0;
 }
 #endif
 
 int irq_init_percpu_irqstack(unsigned int cpu)
 {
-	if (per_cpu(hardirq_stack_ptr, cpu))
+	if (per_cpu(pcpu_hot.hardirq_stack_ptr, cpu))
 		return 0;
 	return map_irq_stack(cpu);
 }
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -562,7 +562,7 @@ void compat_start_thread(struct pt_regs
 	int cpu = smp_processor_id();
 
 	WARN_ON_ONCE(IS_ENABLED(CONFIG_DEBUG_ENTRY) &&
-		     this_cpu_read(hardirq_stack_inuse));
+		     this_cpu_read(pcpu_hot.hardirq_stack_inuse));
 
 	if (!test_thread_flag(TIF_NEED_FPU_LOAD))
 		switch_fpu_prepare(prev_fpu, cpu);



^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v2 27/59] x86/softirq: Move softirq pending next to current task
  2022-09-02 13:06 [PATCH v2 00/59] x86/retbleed: Call depth tracking mitigation Peter Zijlstra
                   ` (25 preceding siblings ...)
  2022-09-02 13:06 ` [PATCH v2 26/59] x86/percpu: Move irq_stack variables " Peter Zijlstra
@ 2022-09-02 13:06 ` Peter Zijlstra
  2022-09-02 13:06 ` [PATCH v2 28/59] objtool: Allow !PC relative relocations Peter Zijlstra
                   ` (32 subsequent siblings)
  59 siblings, 0 replies; 81+ messages in thread
From: Peter Zijlstra @ 2022-09-02 13:06 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, peterz, x86, Linus Torvalds, Tim Chen,
	Josh Poimboeuf, Andrew Cooper, Pawan Gupta, Johannes Wikner,
	Alyssa Milburn, Jann Horn, H.J. Lu, Joao Moreira, Joseph Nuzman,
	Steven Rostedt, Juergen Gross, Masami Hiramatsu,
	Alexei Starovoitov, Daniel Borkmann, K Prateek Nayak,
	Eric Dumazet

From: Thomas Gleixner <tglx@linutronix.de>

Another hot variable which is strict per CPU and benefits from
being in the same cache line.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/include/asm/current.h |    1 +
 arch/x86/include/asm/hardirq.h |    3 ++-
 2 files changed, 3 insertions(+), 1 deletion(-)

--- a/arch/x86/include/asm/current.h
+++ b/arch/x86/include/asm/current.h
@@ -19,6 +19,7 @@ struct pcpu_hot {
 			int			cpu_number;
 			unsigned long		top_of_stack;
 			void			*hardirq_stack_ptr;
+			u16			softirq_pending;
 #ifdef CONFIG_X86_64
 			bool			hardirq_stack_inuse;
 #else
--- a/arch/x86/include/asm/hardirq.h
+++ b/arch/x86/include/asm/hardirq.h
@@ -3,9 +3,9 @@
 #define _ASM_X86_HARDIRQ_H
 
 #include <linux/threads.h>
+#include <asm/current.h>
 
 typedef struct {
-	u16	     __softirq_pending;
 #if IS_ENABLED(CONFIG_KVM_INTEL)
 	u8	     kvm_cpu_l1tf_flush_l1d;
 #endif
@@ -60,6 +60,7 @@ extern u64 arch_irq_stat_cpu(unsigned in
 extern u64 arch_irq_stat(void);
 #define arch_irq_stat		arch_irq_stat
 
+#define local_softirq_pending_ref       pcpu_hot.softirq_pending
 
 #if IS_ENABLED(CONFIG_KVM_INTEL)
 static inline void kvm_set_cpu_l1tf_flush_l1d(void)



^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v2 28/59] objtool: Allow !PC relative relocations
  2022-09-02 13:06 [PATCH v2 00/59] x86/retbleed: Call depth tracking mitigation Peter Zijlstra
                   ` (26 preceding siblings ...)
  2022-09-02 13:06 ` [PATCH v2 27/59] x86/softirq: Move softirq pending next to current task Peter Zijlstra
@ 2022-09-02 13:06 ` Peter Zijlstra
  2022-09-02 13:06 ` [PATCH v2 29/59] objtool: Track init section Peter Zijlstra
                   ` (31 subsequent siblings)
  59 siblings, 0 replies; 81+ messages in thread
From: Peter Zijlstra @ 2022-09-02 13:06 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, peterz, x86, Linus Torvalds, Tim Chen,
	Josh Poimboeuf, Andrew Cooper, Pawan Gupta, Johannes Wikner,
	Alyssa Milburn, Jann Horn, H.J. Lu, Joao Moreira, Joseph Nuzman,
	Steven Rostedt, Juergen Gross, Masami Hiramatsu,
	Alexei Starovoitov, Daniel Borkmann, K Prateek Nayak,
	Eric Dumazet

From: Peter Zijlstra <peterz@infradead.org>

Objtool doesn't currently much like per-cpu usage in alternatives:

arch/x86/entry/entry_64.o: warning: objtool: .altinstr_replacement+0xf: unsupported relocation in alternatives section
  f:   65 c7 04 25 00 00 00 00 00 00 00 80     movl   $0x80000000,%gs:0x0      13: R_X86_64_32S        __x86_call_depth

Since the R_X86_64_32S relocation is location invariant (it's
computation doesn't include P - the address of the location itself),
it can be trivially allowed.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 tools/objtool/arch/x86/decode.c       |   42 +++++++++++++++++++++++++++++-----
 tools/objtool/check.c                 |    6 +---
 tools/objtool/include/objtool/arch.h  |    6 ++--
 tools/objtool/include/objtool/check.h |   17 +++++++------
 4 files changed, 51 insertions(+), 20 deletions(-)

--- a/tools/objtool/arch/x86/decode.c
+++ b/tools/objtool/arch/x86/decode.c
@@ -73,6 +73,30 @@ unsigned long arch_jump_destination(stru
 	return insn->offset + insn->len + insn->immediate;
 }
 
+bool arch_pc_relative_reloc(struct reloc *reloc)
+{
+	/*
+	 * All relocation types where P (the address of the target)
+	 * is included in the computation.
+	 */
+	switch (reloc->type) {
+	case R_X86_64_PC8:
+	case R_X86_64_PC16:
+	case R_X86_64_PC32:
+	case R_X86_64_PC64:
+
+	case R_X86_64_PLT32:
+	case R_X86_64_GOTPC32:
+	case R_X86_64_GOTPCREL:
+		return true;
+
+	default:
+		break;
+	}
+
+	return false;
+}
+
 #define ADD_OP(op) \
 	if (!(op = calloc(1, sizeof(*op)))) \
 		return -1; \
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -1622,7 +1620,7 @@ static int handle_group_alt(struct objto
 		 * accordingly.
 		 */
 		alt_reloc = insn_reloc(file, insn);
-		if (alt_reloc &&
+		if (alt_reloc && arch_pc_relative_reloc(alt_reloc) &&
 		    !arch_support_alt_relocation(special_alt, insn, alt_reloc)) {
 
 			WARN_FUNC("unsupported relocation in alternatives section",
--- a/tools/objtool/include/objtool/arch.h
+++ b/tools/objtool/include/objtool/arch.h
@@ -93,4 +91,6 @@ bool arch_is_rethunk(struct symbol *sym)
 
 int arch_rewrite_retpolines(struct objtool_file *file);
 
+bool arch_pc_relative_reloc(struct reloc *reloc);
+
 #endif /* _ARCH_H */



^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v2 29/59] objtool: Track init section
  2022-09-02 13:06 [PATCH v2 00/59] x86/retbleed: Call depth tracking mitigation Peter Zijlstra
                   ` (27 preceding siblings ...)
  2022-09-02 13:06 ` [PATCH v2 28/59] objtool: Allow !PC relative relocations Peter Zijlstra
@ 2022-09-02 13:06 ` Peter Zijlstra
  2022-09-02 13:06 ` [PATCH v2 30/59] objtool: Add .call_sites section Peter Zijlstra
                   ` (30 subsequent siblings)
  59 siblings, 0 replies; 81+ messages in thread
From: Peter Zijlstra @ 2022-09-02 13:06 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, peterz, x86, Linus Torvalds, Tim Chen,
	Josh Poimboeuf, Andrew Cooper, Pawan Gupta, Johannes Wikner,
	Alyssa Milburn, Jann Horn, H.J. Lu, Joao Moreira, Joseph Nuzman,
	Steven Rostedt, Juergen Gross, Masami Hiramatsu,
	Alexei Starovoitov, Daniel Borkmann, K Prateek Nayak,
	Eric Dumazet

From: Peter Zijlstra <peterz@infradead.org>

For future usage of .init.text exclusion track the init section in the
instruction decoder and use the result in retpoline validation.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 tools/objtool/check.c               |   17 ++++++++++-------
 tools/objtool/include/objtool/elf.h |    2 +-
 2 files changed, 11 insertions(+), 8 deletions(-)

--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -380,6 +380,15 @@ static int decode_instructions(struct ob
 		    !strncmp(sec->name, ".text.__x86.", 12))
 			sec->noinstr = true;
 
+		/*
+		 * .init.text code is ran before userspace and thus doesn't
+		 * strictly need retpolines, except for modules which are
+		 * loaded late, they very much do need retpoline in their
+		 * .init.text
+		 */
+		if (!strcmp(sec->name, ".init.text") && !opts.module)
+			sec->init = true;
+
 		for (offset = 0; offset < sec->sh.sh_size; offset += insn->len) {
 			insn = malloc(sizeof(*insn));
 			if (!insn) {
@@ -3720,13 +3729,7 @@ static int validate_retpoline(struct obj
 		if (insn->retpoline_safe)
 			continue;
 
-		/*
-		 * .init.text code is ran before userspace and thus doesn't
-		 * strictly need retpolines, except for modules which are
-		 * loaded late, they very much do need retpoline in their
-		 * .init.text
-		 */
-		if (!strcmp(insn->sec->name, ".init.text") && !opts.module)
+		if (insn->sec->init)
 			continue;
 
 		if (insn->type == INSN_RETURN) {
--- a/tools/objtool/include/objtool/elf.h
+++ b/tools/objtool/include/objtool/elf.h
@@ -38,7 +38,7 @@ struct section {
 	Elf_Data *data;
 	char *name;
 	int idx;
-	bool changed, text, rodata, noinstr;
+	bool changed, text, rodata, noinstr, init;
 };
 
 struct symbol {



^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v2 30/59] objtool: Add .call_sites section
  2022-09-02 13:06 [PATCH v2 00/59] x86/retbleed: Call depth tracking mitigation Peter Zijlstra
                   ` (28 preceding siblings ...)
  2022-09-02 13:06 ` [PATCH v2 29/59] objtool: Track init section Peter Zijlstra
@ 2022-09-02 13:06 ` Peter Zijlstra
  2022-09-02 13:06 ` [PATCH v2 31/59] objtool: Add --hacks=skylake Peter Zijlstra
                   ` (29 subsequent siblings)
  59 siblings, 0 replies; 81+ messages in thread
From: Peter Zijlstra @ 2022-09-02 13:06 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, peterz, x86, Linus Torvalds, Tim Chen,
	Josh Poimboeuf, Andrew Cooper, Pawan Gupta, Johannes Wikner,
	Alyssa Milburn, Jann Horn, H.J. Lu, Joao Moreira, Joseph Nuzman,
	Steven Rostedt, Juergen Gross, Masami Hiramatsu,
	Alexei Starovoitov, Daniel Borkmann, K Prateek Nayak,
	Eric Dumazet

From: Peter Zijlstra <peterz@infradead.org>

In preparation for call depth tracking provide a section which collects all
direct calls.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/kernel/vmlinux.lds.S           |    7 ++++
 tools/objtool/check.c                   |   51 ++++++++++++++++++++++++++++++++
 tools/objtool/include/objtool/objtool.h |    1 
 tools/objtool/objtool.c                 |    1 
 4 files changed, 60 insertions(+)

--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -291,6 +291,13 @@ SECTIONS
 		*(.return_sites)
 		__return_sites_end = .;
 	}
+
+	. = ALIGN(8);
+	.call_sites : AT(ADDR(.call_sites) - LOAD_OFFSET) {
+		__call_sites = .;
+		*(.call_sites)
+		__call_sites_end = .;
+	}
 #endif
 
 #ifdef CONFIG_X86_KERNEL_IBT
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -898,6 +898,49 @@ static int create_mcount_loc_sections(st
 	return 0;
 }
 
+static int create_direct_call_sections(struct objtool_file *file)
+{
+	struct instruction *insn;
+	struct section *sec;
+	unsigned int *loc;
+	int idx;
+
+	sec = find_section_by_name(file->elf, ".call_sites");
+	if (sec) {
+		INIT_LIST_HEAD(&file->call_list);
+		WARN("file already has .call_sites section, skipping");
+		return 0;
+	}
+
+	if (list_empty(&file->call_list))
+		return 0;
+
+	idx = 0;
+	list_for_each_entry(insn, &file->call_list, call_node)
+		idx++;
+
+	sec = elf_create_section(file->elf, ".call_sites", 0, sizeof(unsigned int), idx);
+	if (!sec)
+		return -1;
+
+	idx = 0;
+	list_for_each_entry(insn, &file->call_list, call_node) {
+
+		loc = (unsigned int *)sec->data->d_buf + idx;
+		memset(loc, 0, sizeof(unsigned int));
+
+		if (elf_add_reloc_to_insn(file->elf, sec,
+					  idx * sizeof(unsigned int),
+					  R_X86_64_PC32,
+					  insn->sec, insn->offset))
+			return -1;
+
+		idx++;
+	}
+
+	return 0;
+}
+
 /*
  * Warnings shouldn't be reported for ignored functions.
  */
@@ -1252,6 +1295,9 @@ static void annotate_call_site(struct ob
 		return;
 	}
 
+	if (insn->type == INSN_CALL && !insn->sec->init)
+		list_add_tail(&insn->call_node, &file->call_list);
+
 	if (!sibling && dead_end_function(file, sym))
 		insn->dead_end = true;
 }
@@ -4274,6 +4320,11 @@ int check(struct objtool_file *file)
 		if (ret < 0)
 			goto out;
 		warnings += ret;
+
+		ret = create_direct_call_sections(file);
+		if (ret < 0)
+			goto out;
+		warnings += ret;
 	}
 
 	if (opts.mcount) {
--- a/tools/objtool/include/objtool/objtool.h
+++ b/tools/objtool/include/objtool/objtool.h
@@ -28,6 +28,7 @@ struct objtool_file {
 	struct list_head static_call_list;
 	struct list_head mcount_loc_list;
 	struct list_head endbr_list;
+	struct list_head call_list;
 	bool ignore_unreachables, hints, rodata;
 
 	unsigned int nr_endbr;
--- a/tools/objtool/objtool.c
+++ b/tools/objtool/objtool.c
@@ -106,6 +106,7 @@ struct objtool_file *objtool_open_read(c
 	INIT_LIST_HEAD(&file.static_call_list);
 	INIT_LIST_HEAD(&file.mcount_loc_list);
 	INIT_LIST_HEAD(&file.endbr_list);
+	INIT_LIST_HEAD(&file.call_list);
 	file.ignore_unreachables = opts.no_unreachable;
 	file.hints = false;
 



^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v2 31/59] objtool: Add --hacks=skylake
  2022-09-02 13:06 [PATCH v2 00/59] x86/retbleed: Call depth tracking mitigation Peter Zijlstra
                   ` (29 preceding siblings ...)
  2022-09-02 13:06 ` [PATCH v2 30/59] objtool: Add .call_sites section Peter Zijlstra
@ 2022-09-02 13:06 ` Peter Zijlstra
  2022-09-02 13:06 ` [PATCH v2 32/59] objtool: Allow STT_NOTYPE -> STT_FUNC+0 tail-calls Peter Zijlstra
                   ` (28 subsequent siblings)
  59 siblings, 0 replies; 81+ messages in thread
From: Peter Zijlstra @ 2022-09-02 13:06 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, peterz, x86, Linus Torvalds, Tim Chen,
	Josh Poimboeuf, Andrew Cooper, Pawan Gupta, Johannes Wikner,
	Alyssa Milburn, Jann Horn, H.J. Lu, Joao Moreira, Joseph Nuzman,
	Steven Rostedt, Juergen Gross, Masami Hiramatsu,
	Alexei Starovoitov, Daniel Borkmann, K Prateek Nayak,
	Eric Dumazet

From: Peter Zijlstra <peterz@infradead.org>

Make the call/func sections selectable via the --hacks option.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 scripts/Makefile.lib                    |    3 ++-
 tools/objtool/builtin-check.c           |    7 ++++++-
 tools/objtool/check.c                   |   10 ++++++----
 tools/objtool/include/objtool/builtin.h |    1 +
 4 files changed, 15 insertions(+), 6 deletions(-)

--- a/scripts/Makefile.lib
+++ b/scripts/Makefile.lib
@@ -231,7 +231,8 @@ objtool := $(objtree)/tools/objtool/objt
 
 objtool_args =								\
 	$(if $(CONFIG_HAVE_JUMP_LABEL_HACK), --hacks=jump_label)	\
-	$(if $(CONFIG_HAVE_NOINSTR_HACK), --hacks=noinstr)		\
+	$(if $(CONFIG_HAVE_NOINSTR_HACK), --hacks=noinstr)              \
+	$(if $(CONFIG_CALL_DEPTH_TRACKING), --hacks=skylake)            \
 	$(if $(CONFIG_X86_KERNEL_IBT), --ibt)				\
 	$(if $(CONFIG_FTRACE_MCOUNT_USE_OBJTOOL), --mcount)		\
 	$(if $(CONFIG_UNWINDER_ORC), --orc)				\
--- a/tools/objtool/builtin-check.c
+++ b/tools/objtool/builtin-check.c
@@ -57,12 +57,17 @@ static int parse_hacks(const struct opti
 		found = true;
 	}
 
+	if (!str || strstr(str, "skylake")) {
+		opts.hack_skylake = true;
+		found = true;
+	}
+
 	return found ? 0 : -1;
 }
 
 const struct option check_options[] = {
 	OPT_GROUP("Actions:"),
-	OPT_CALLBACK_OPTARG('h', "hacks", NULL, NULL, "jump_label,noinstr", "patch toolchain bugs/limitations", parse_hacks),
+	OPT_CALLBACK_OPTARG('h', "hacks", NULL, NULL, "jump_label,noinstr,skylake", "patch toolchain bugs/limitations", parse_hacks),
 	OPT_BOOLEAN('i', "ibt", &opts.ibt, "validate and annotate IBT"),
 	OPT_BOOLEAN('m', "mcount", &opts.mcount, "annotate mcount/fentry calls for ftrace"),
 	OPT_BOOLEAN('n', "noinstr", &opts.noinstr, "validate noinstr rules"),
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -4321,10 +4321,12 @@ int check(struct objtool_file *file)
 			goto out;
 		warnings += ret;
 
-		ret = create_direct_call_sections(file);
-		if (ret < 0)
-			goto out;
-		warnings += ret;
+		if (opts.hack_skylake) {
+			ret = create_direct_call_sections(file);
+			if (ret < 0)
+				goto out;
+			warnings += ret;
+		}
 	}
 
 	if (opts.mcount) {
--- a/tools/objtool/include/objtool/builtin.h
+++ b/tools/objtool/include/objtool/builtin.h
@@ -14,6 +14,7 @@ struct opts {
 	bool dump_orc;
 	bool hack_jump_label;
 	bool hack_noinstr;
+	bool hack_skylake;
 	bool ibt;
 	bool mcount;
 	bool noinstr;



^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v2 32/59] objtool: Allow STT_NOTYPE -> STT_FUNC+0 tail-calls
  2022-09-02 13:06 [PATCH v2 00/59] x86/retbleed: Call depth tracking mitigation Peter Zijlstra
                   ` (30 preceding siblings ...)
  2022-09-02 13:06 ` [PATCH v2 31/59] objtool: Add --hacks=skylake Peter Zijlstra
@ 2022-09-02 13:06 ` Peter Zijlstra
  2022-09-02 13:06 ` [PATCH v2 33/59] objtool: Fix find_{symbol,func}_containing() Peter Zijlstra
                   ` (27 subsequent siblings)
  59 siblings, 0 replies; 81+ messages in thread
From: Peter Zijlstra @ 2022-09-02 13:06 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, peterz, x86, Linus Torvalds, Tim Chen,
	Josh Poimboeuf, Andrew Cooper, Pawan Gupta, Johannes Wikner,
	Alyssa Milburn, Jann Horn, H.J. Lu, Joao Moreira, Joseph Nuzman,
	Steven Rostedt, Juergen Gross, Masami Hiramatsu,
	Alexei Starovoitov, Daniel Borkmann, K Prateek Nayak,
	Eric Dumazet

From: Peter Zijlstra <peterz@infradead.org>

Allow STT_NOTYPE to tail-call into STT_FUNC, per definition STT_NOTYPE
is not a sub-function of the STT_FUNC.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 tools/objtool/check.c |   29 ++++++++++++++++++++---------
 1 file changed, 20 insertions(+), 9 deletions(-)

--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -1370,6 +1370,16 @@ static void add_return_call(struct objto
 
 static bool same_function(struct instruction *insn1, struct instruction *insn2)
 {
+	if (!insn1->func && !insn2->func)
+		return true;
+
+	/* Allow STT_NOTYPE -> STT_FUNC+0 tail-calls */
+	if (!insn1->func && insn1->func != insn2->func)
+		return false;
+
+	if (!insn2->func)
+		return true;
+
 	return insn1->func->pfunc == insn2->func->pfunc;
 }
 
@@ -1487,18 +1497,19 @@ static int add_jump_destinations(struct
 			    strstr(jump_dest->func->name, ".cold")) {
 				insn->func->cfunc = jump_dest->func;
 				jump_dest->func->pfunc = insn->func;
-
-			} else if (!same_function(insn, jump_dest) &&
-				   is_first_func_insn(file, jump_dest)) {
-				/*
-				 * Internal sibling call without reloc or with
-				 * STT_SECTION reloc.
-				 */
-				add_call_dest(file, insn, jump_dest->func, true);
-				continue;
 			}
 		}
 
+		if (!same_function(insn, jump_dest) &&
+		    is_first_func_insn(file, jump_dest)) {
+			/*
+			 * Internal sibling call without reloc or with
+			 * STT_SECTION reloc.
+			 */
+			add_call_dest(file, insn, jump_dest->func, true);
+			continue;
+		}
+
 		insn->jump_dest = jump_dest;
 	}
 



^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v2 33/59] objtool: Fix find_{symbol,func}_containing()
  2022-09-02 13:06 [PATCH v2 00/59] x86/retbleed: Call depth tracking mitigation Peter Zijlstra
                   ` (31 preceding siblings ...)
  2022-09-02 13:06 ` [PATCH v2 32/59] objtool: Allow STT_NOTYPE -> STT_FUNC+0 tail-calls Peter Zijlstra
@ 2022-09-02 13:06 ` Peter Zijlstra
  2022-09-02 13:06 ` [PATCH v2 34/59] objtool: Allow symbol range comparisons for IBT/ENDBR Peter Zijlstra
                   ` (26 subsequent siblings)
  59 siblings, 0 replies; 81+ messages in thread
From: Peter Zijlstra @ 2022-09-02 13:06 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, peterz, x86, Linus Torvalds, Tim Chen,
	Josh Poimboeuf, Andrew Cooper, Pawan Gupta, Johannes Wikner,
	Alyssa Milburn, Jann Horn, H.J. Lu, Joao Moreira, Joseph Nuzman,
	Steven Rostedt, Juergen Gross, Masami Hiramatsu,
	Alexei Starovoitov, Daniel Borkmann, K Prateek Nayak,
	Eric Dumazet

From: Peter Zijlstra <peterz@infradead.org>

The current find_{symbol,func}_containing() functions are broken in
the face of overlapping symbols, exactly the case that is needed for a
new ibt/endbr supression.

Import interval_tree_generic.h into the tools tree and convert the
symbol tree to an interval tree to support proper range stabs.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 tools/include/linux/interval_tree_generic.h |  187 ++++++++++++++++++++++++++++
 tools/objtool/elf.c                         |   93 +++++--------
 tools/objtool/include/objtool/elf.h         |    3 
 3 files changed, 229 insertions(+), 54 deletions(-)

--- /dev/null
+++ b/tools/include/linux/interval_tree_generic.h
@@ -0,0 +1,187 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+  Interval Trees
+  (C) 2012  Michel Lespinasse <walken@google.com>
+
+
+  include/linux/interval_tree_generic.h
+*/
+
+#include <linux/rbtree_augmented.h>
+
+/*
+ * Template for implementing interval trees
+ *
+ * ITSTRUCT:   struct type of the interval tree nodes
+ * ITRB:       name of struct rb_node field within ITSTRUCT
+ * ITTYPE:     type of the interval endpoints
+ * ITSUBTREE:  name of ITTYPE field within ITSTRUCT holding last-in-subtree
+ * ITSTART(n): start endpoint of ITSTRUCT node n
+ * ITLAST(n):  last endpoint of ITSTRUCT node n
+ * ITSTATIC:   'static' or empty
+ * ITPREFIX:   prefix to use for the inline tree definitions
+ *
+ * Note - before using this, please consider if generic version
+ * (interval_tree.h) would work for you...
+ */
+
+#define INTERVAL_TREE_DEFINE(ITSTRUCT, ITRB, ITTYPE, ITSUBTREE,		      \
+			     ITSTART, ITLAST, ITSTATIC, ITPREFIX)	      \
+									      \
+/* Callbacks for augmented rbtree insert and remove */			      \
+									      \
+RB_DECLARE_CALLBACKS_MAX(static, ITPREFIX ## _augment,			      \
+			 ITSTRUCT, ITRB, ITTYPE, ITSUBTREE, ITLAST)	      \
+									      \
+/* Insert / remove interval nodes from the tree */			      \
+									      \
+ITSTATIC void ITPREFIX ## _insert(ITSTRUCT *node,			      \
+				  struct rb_root_cached *root)	 	      \
+{									      \
+	struct rb_node **link = &root->rb_root.rb_node, *rb_parent = NULL;    \
+	ITTYPE start = ITSTART(node), last = ITLAST(node);		      \
+	ITSTRUCT *parent;						      \
+	bool leftmost = true;						      \
+									      \
+	while (*link) {							      \
+		rb_parent = *link;					      \
+		parent = rb_entry(rb_parent, ITSTRUCT, ITRB);		      \
+		if (parent->ITSUBTREE < last)				      \
+			parent->ITSUBTREE = last;			      \
+		if (start < ITSTART(parent))				      \
+			link = &parent->ITRB.rb_left;			      \
+		else {							      \
+			link = &parent->ITRB.rb_right;			      \
+			leftmost = false;				      \
+		}							      \
+	}								      \
+									      \
+	node->ITSUBTREE = last;						      \
+	rb_link_node(&node->ITRB, rb_parent, link);			      \
+	rb_insert_augmented_cached(&node->ITRB, root,			      \
+				   leftmost, &ITPREFIX ## _augment);	      \
+}									      \
+									      \
+ITSTATIC void ITPREFIX ## _remove(ITSTRUCT *node,			      \
+				  struct rb_root_cached *root)		      \
+{									      \
+	rb_erase_augmented_cached(&node->ITRB, root, &ITPREFIX ## _augment);  \
+}									      \
+									      \
+/*									      \
+ * Iterate over intervals intersecting [start;last]			      \
+ *									      \
+ * Note that a node's interval intersects [start;last] iff:		      \
+ *   Cond1: ITSTART(node) <= last					      \
+ * and									      \
+ *   Cond2: start <= ITLAST(node)					      \
+ */									      \
+									      \
+static ITSTRUCT *							      \
+ITPREFIX ## _subtree_search(ITSTRUCT *node, ITTYPE start, ITTYPE last)	      \
+{									      \
+	while (true) {							      \
+		/*							      \
+		 * Loop invariant: start <= node->ITSUBTREE		      \
+		 * (Cond2 is satisfied by one of the subtree nodes)	      \
+		 */							      \
+		if (node->ITRB.rb_left) {				      \
+			ITSTRUCT *left = rb_entry(node->ITRB.rb_left,	      \
+						  ITSTRUCT, ITRB);	      \
+			if (start <= left->ITSUBTREE) {			      \
+				/*					      \
+				 * Some nodes in left subtree satisfy Cond2.  \
+				 * Iterate to find the leftmost such node N.  \
+				 * If it also satisfies Cond1, that's the     \
+				 * match we are looking for. Otherwise, there \
+				 * is no matching interval as nodes to the    \
+				 * right of N can't satisfy Cond1 either.     \
+				 */					      \
+				node = left;				      \
+				continue;				      \
+			}						      \
+		}							      \
+		if (ITSTART(node) <= last) {		/* Cond1 */	      \
+			if (start <= ITLAST(node))	/* Cond2 */	      \
+				return node;	/* node is leftmost match */  \
+			if (node->ITRB.rb_right) {			      \
+				node = rb_entry(node->ITRB.rb_right,	      \
+						ITSTRUCT, ITRB);	      \
+				if (start <= node->ITSUBTREE)		      \
+					continue;			      \
+			}						      \
+		}							      \
+		return NULL;	/* No match */				      \
+	}								      \
+}									      \
+									      \
+ITSTATIC ITSTRUCT *							      \
+ITPREFIX ## _iter_first(struct rb_root_cached *root,			      \
+			ITTYPE start, ITTYPE last)			      \
+{									      \
+	ITSTRUCT *node, *leftmost;					      \
+									      \
+	if (!root->rb_root.rb_node)					      \
+		return NULL;						      \
+									      \
+	/*								      \
+	 * Fastpath range intersection/overlap between A: [a0, a1] and	      \
+	 * B: [b0, b1] is given by:					      \
+	 *								      \
+	 *         a0 <= b1 && b0 <= a1					      \
+	 *								      \
+	 *  ... where A holds the lock range and B holds the smallest	      \
+	 * 'start' and largest 'last' in the tree. For the later, we	      \
+	 * rely on the root node, which by augmented interval tree	      \
+	 * property, holds the largest value in its last-in-subtree.	      \
+	 * This allows mitigating some of the tree walk overhead for	      \
+	 * for non-intersecting ranges, maintained and consulted in O(1).     \
+	 */								      \
+	node = rb_entry(root->rb_root.rb_node, ITSTRUCT, ITRB);		      \
+	if (node->ITSUBTREE < start)					      \
+		return NULL;						      \
+									      \
+	leftmost = rb_entry(root->rb_leftmost, ITSTRUCT, ITRB);		      \
+	if (ITSTART(leftmost) > last)					      \
+		return NULL;						      \
+									      \
+	return ITPREFIX ## _subtree_search(node, start, last);		      \
+}									      \
+									      \
+ITSTATIC ITSTRUCT *							      \
+ITPREFIX ## _iter_next(ITSTRUCT *node, ITTYPE start, ITTYPE last)	      \
+{									      \
+	struct rb_node *rb = node->ITRB.rb_right, *prev;		      \
+									      \
+	while (true) {							      \
+		/*							      \
+		 * Loop invariants:					      \
+		 *   Cond1: ITSTART(node) <= last			      \
+		 *   rb == node->ITRB.rb_right				      \
+		 *							      \
+		 * First, search right subtree if suitable		      \
+		 */							      \
+		if (rb) {						      \
+			ITSTRUCT *right = rb_entry(rb, ITSTRUCT, ITRB);	      \
+			if (start <= right->ITSUBTREE)			      \
+				return ITPREFIX ## _subtree_search(right,     \
+								start, last); \
+		}							      \
+									      \
+		/* Move up the tree until we come from a node's left child */ \
+		do {							      \
+			rb = rb_parent(&node->ITRB);			      \
+			if (!rb)					      \
+				return NULL;				      \
+			prev = &node->ITRB;				      \
+			node = rb_entry(rb, ITSTRUCT, ITRB);		      \
+			rb = node->ITRB.rb_right;			      \
+		} while (prev == rb);					      \
+									      \
+		/* Check if the node intersects [start;last] */		      \
+		if (last < ITSTART(node))		/* !Cond1 */	      \
+			return NULL;					      \
+		else if (start <= ITLAST(node))		/* Cond2 */	      \
+			return node;					      \
+	}								      \
+}
--- a/tools/objtool/elf.c
+++ b/tools/objtool/elf.c
@@ -16,6 +16,7 @@
 #include <string.h>
 #include <unistd.h>
 #include <errno.h>
+#include <linux/interval_tree_generic.h>
 #include <objtool/builtin.h>
 
 #include <objtool/elf.h>
@@ -50,38 +51,22 @@ static inline u32 str_hash(const char *s
 	__elf_table(name); \
 })
 
-static bool symbol_to_offset(struct rb_node *a, const struct rb_node *b)
+static inline unsigned long __sym_start(struct symbol *s)
 {
-	struct symbol *sa = rb_entry(a, struct symbol, node);
-	struct symbol *sb = rb_entry(b, struct symbol, node);
-
-	if (sa->offset < sb->offset)
-		return true;
-	if (sa->offset > sb->offset)
-		return false;
-
-	if (sa->len < sb->len)
-		return true;
-	if (sa->len > sb->len)
-		return false;
-
-	sa->alias = sb;
-
-	return false;
+	return s->offset;
 }
 
-static int symbol_by_offset(const void *key, const struct rb_node *node)
+static inline unsigned long __sym_last(struct symbol *s)
 {
-	const struct symbol *s = rb_entry(node, struct symbol, node);
-	const unsigned long *o = key;
+	return s->offset + s->len - 1;
+}
 
-	if (*o < s->offset)
-		return -1;
-	if (*o >= s->offset + s->len)
-		return 1;
+INTERVAL_TREE_DEFINE(struct symbol, node, unsigned long, __subtree_last,
+		     __sym_start, __sym_last, static, __sym)
 
-	return 0;
-}
+#define __sym_for_each(_iter, _tree, _start, _end)			\
+	for (_iter = __sym_iter_first((_tree), (_start), (_end));	\
+	     _iter; _iter = __sym_iter_next(_iter, (_start), (_end)))
 
 struct symbol_hole {
 	unsigned long key;
@@ -147,13 +132,12 @@ static struct symbol *find_symbol_by_ind
 
 struct symbol *find_symbol_by_offset(struct section *sec, unsigned long offset)
 {
-	struct rb_node *node;
-
-	rb_for_each(node, &offset, &sec->symbol_tree, symbol_by_offset) {
-		struct symbol *s = rb_entry(node, struct symbol, node);
+	struct rb_root_cached *tree = (struct rb_root_cached *)&sec->symbol_tree;
+	struct symbol *iter;
 
-		if (s->offset == offset && s->type != STT_SECTION)
-			return s;
+	__sym_for_each(iter, tree, offset, offset) {
+		if (iter->offset == offset && iter->type != STT_SECTION)
+			return iter;
 	}
 
 	return NULL;
@@ -161,13 +145,12 @@ struct symbol *find_symbol_by_offset(str
 
 struct symbol *find_func_by_offset(struct section *sec, unsigned long offset)
 {
-	struct rb_node *node;
+	struct rb_root_cached *tree = (struct rb_root_cached *)&sec->symbol_tree;
+	struct symbol *iter;
 
-	rb_for_each(node, &offset, &sec->symbol_tree, symbol_by_offset) {
-		struct symbol *s = rb_entry(node, struct symbol, node);
-
-		if (s->offset == offset && s->type == STT_FUNC)
-			return s;
+	__sym_for_each(iter, tree, offset, offset) {
+		if (iter->offset == offset && iter->type == STT_FUNC)
+			return iter;
 	}
 
 	return NULL;
@@ -175,13 +158,12 @@ struct symbol *find_func_by_offset(struc
 
 struct symbol *find_symbol_containing(const struct section *sec, unsigned long offset)
 {
-	struct rb_node *node;
-
-	rb_for_each(node, &offset, &sec->symbol_tree, symbol_by_offset) {
-		struct symbol *s = rb_entry(node, struct symbol, node);
+	struct rb_root_cached *tree = (struct rb_root_cached *)&sec->symbol_tree;
+	struct symbol *iter;
 
-		if (s->type != STT_SECTION)
-			return s;
+	__sym_for_each(iter, tree, offset, offset) {
+		if (iter->type != STT_SECTION)
+			return iter;
 	}
 
 	return NULL;
@@ -202,7 +184,7 @@ int find_symbol_hole_containing(const st
 	/*
 	 * Find the rightmost symbol for which @offset is after it.
 	 */
-	n = rb_find(&hole, &sec->symbol_tree, symbol_hole_by_offset);
+	n = rb_find(&hole, &sec->symbol_tree.rb_root, symbol_hole_by_offset);
 
 	/* found a symbol that contains @offset */
 	if (n)
@@ -224,13 +206,12 @@ int find_symbol_hole_containing(const st
 
 struct symbol *find_func_containing(struct section *sec, unsigned long offset)
 {
-	struct rb_node *node;
-
-	rb_for_each(node, &offset, &sec->symbol_tree, symbol_by_offset) {
-		struct symbol *s = rb_entry(node, struct symbol, node);
+	struct rb_root_cached *tree = (struct rb_root_cached *)&sec->symbol_tree;
+	struct symbol *iter;
 
-		if (s->type == STT_FUNC)
-			return s;
+	__sym_for_each(iter, tree, offset, offset) {
+		if (iter->type == STT_FUNC)
+			return iter;
 	}
 
 	return NULL;
@@ -373,6 +354,7 @@ static void elf_add_symbol(struct elf *e
 {
 	struct list_head *entry;
 	struct rb_node *pnode;
+	struct symbol *iter;
 
 	INIT_LIST_HEAD(&sym->pv_target);
 	sym->alias = sym;
@@ -386,7 +368,12 @@ static void elf_add_symbol(struct elf *e
 	sym->offset = sym->sym.st_value;
 	sym->len = sym->sym.st_size;
 
-	rb_add(&sym->node, &sym->sec->symbol_tree, symbol_to_offset);
+	__sym_for_each(iter, &sym->sec->symbol_tree, sym->offset, sym->offset) {
+		if (iter->offset == sym->offset && iter->type == sym->type)
+			iter->alias = sym;
+	}
+
+	__sym_insert(sym, &sym->sec->symbol_tree);
 	pnode = rb_prev(&sym->node);
 	if (pnode)
 		entry = &rb_entry(pnode, struct symbol, node)->list;
@@ -401,7 +388,7 @@ static void elf_add_symbol(struct elf *e
 	 * can exist within a function, confusing the sorting.
 	 */
 	if (!sym->len)
-		rb_erase(&sym->node, &sym->sec->symbol_tree);
+		__sym_remove(sym, &sym->sec->symbol_tree);
 }
 
 static int read_symbols(struct elf *elf)
--- a/tools/objtool/include/objtool/elf.h
+++ b/tools/objtool/include/objtool/elf.h
@@ -30,7 +30,7 @@ struct section {
 	struct hlist_node hash;
 	struct hlist_node name_hash;
 	GElf_Shdr sh;
-	struct rb_root symbol_tree;
+	struct rb_root_cached symbol_tree;
 	struct list_head symbol_list;
 	struct list_head reloc_list;
 	struct section *base, *reloc;
@@ -53,6 +53,7 @@ struct symbol {
 	unsigned char bind, type;
 	unsigned long offset;
 	unsigned int len;
+	unsigned long __subtree_last;
 	struct symbol *pfunc, *cfunc, *alias;
 	u8 uaccess_safe      : 1;
 	u8 static_call_tramp : 1;



^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v2 34/59] objtool: Allow symbol range comparisons for IBT/ENDBR
  2022-09-02 13:06 [PATCH v2 00/59] x86/retbleed: Call depth tracking mitigation Peter Zijlstra
                   ` (32 preceding siblings ...)
  2022-09-02 13:06 ` [PATCH v2 33/59] objtool: Fix find_{symbol,func}_containing() Peter Zijlstra
@ 2022-09-02 13:06 ` Peter Zijlstra
  2022-09-02 13:07 ` [PATCH v2 35/59] x86/entry: Make sync_regs() invocation a tail call Peter Zijlstra
                   ` (25 subsequent siblings)
  59 siblings, 0 replies; 81+ messages in thread
From: Peter Zijlstra @ 2022-09-02 13:06 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, peterz, x86, Linus Torvalds, Tim Chen,
	Josh Poimboeuf, Andrew Cooper, Pawan Gupta, Johannes Wikner,
	Alyssa Milburn, Jann Horn, H.J. Lu, Joao Moreira, Joseph Nuzman,
	Steven Rostedt, Juergen Gross, Masami Hiramatsu,
	Alexei Starovoitov, Daniel Borkmann, K Prateek Nayak,
	Eric Dumazet

From: Peter Zijlstra <peterz@infradead.org>

A semi common pattern is where code checks if a code address is
within a specific range. All text addresses require either ENDBR or
ANNOTATE_ENDBR, however the ANNOTATE_NOENDBR past the range is
unnatural.

Instead, suppress this warning when this is exactly at the end of a
symbol that itself starts with either ENDBR/ANNOTATE_ENDBR.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/entry/entry_64_compat.S |    1 -
 tools/objtool/check.c            |   28 ++++++++++++++++++++++++++++
 2 files changed, 28 insertions(+), 1 deletion(-)

--- a/arch/x86/entry/entry_64_compat.S
+++ b/arch/x86/entry/entry_64_compat.S
@@ -128,7 +128,6 @@ SYM_INNER_LABEL(entry_SYSENTER_compat_af
 	popfq
 	jmp	.Lsysenter_flags_fixed
 SYM_INNER_LABEL(__end_entry_SYSENTER_compat, SYM_L_GLOBAL)
-	ANNOTATE_NOENDBR // is_sysenter_singlestep
 SYM_CODE_END(entry_SYSENTER_compat)
 
 /*
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -4016,6 +4016,24 @@ static void mark_endbr_used(struct instr
 		list_del_init(&insn->call_node);
 }
 
+static bool noendbr_range(struct objtool_file *file, struct instruction *insn)
+{
+	struct symbol *sym = find_symbol_containing(insn->sec, insn->offset-1);
+	struct instruction *first;
+
+	if (!sym)
+		return false;
+
+	first = find_insn(file, sym->sec, sym->offset);
+	if (!first)
+		return false;
+
+	if (first->type != INSN_ENDBR && !first->noendbr)
+		return false;
+
+	return insn->offset == sym->offset + sym->len;
+}
+
 static int validate_ibt_insn(struct objtool_file *file, struct instruction *insn)
 {
 	struct instruction *dest;
@@ -4088,9 +4106,19 @@ static int validate_ibt_insn(struct objt
 			continue;
 		}
 
+		/*
+		 * Accept anything ANNOTATE_NOENDBR.
+		 */
 		if (dest->noendbr)
 			continue;
 
+		/*
+		 * Accept if this is the instruction after a symbol
+		 * that is (no)endbr -- typical code-range usage.
+		 */
+		if (noendbr_range(file, dest))
+			continue;
+
 		WARN_FUNC("relocation to !ENDBR: %s",
 			  insn->sec, insn->offset,
 			  offstr(dest->sec, dest->offset));



^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v2 35/59] x86/entry: Make sync_regs() invocation a tail call
  2022-09-02 13:06 [PATCH v2 00/59] x86/retbleed: Call depth tracking mitigation Peter Zijlstra
                   ` (33 preceding siblings ...)
  2022-09-02 13:06 ` [PATCH v2 34/59] objtool: Allow symbol range comparisons for IBT/ENDBR Peter Zijlstra
@ 2022-09-02 13:07 ` Peter Zijlstra
  2022-09-02 13:07 ` [PATCH v2 36/59] ftrace: Add HAVE_DYNAMIC_FTRACE_NO_PATCHABLE Peter Zijlstra
                   ` (24 subsequent siblings)
  59 siblings, 0 replies; 81+ messages in thread
From: Peter Zijlstra @ 2022-09-02 13:07 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, peterz, x86, Linus Torvalds, Tim Chen,
	Josh Poimboeuf, Andrew Cooper, Pawan Gupta, Johannes Wikner,
	Alyssa Milburn, Jann Horn, H.J. Lu, Joao Moreira, Joseph Nuzman,
	Steven Rostedt, Juergen Gross, Masami Hiramatsu,
	Alexei Starovoitov, Daniel Borkmann, K Prateek Nayak,
	Eric Dumazet

From: Peter Zijlstra <peterz@infradead.org>

No point in having a call there. Spare the call/ret overhead.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/entry/entry_64.S |    7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -1062,11 +1062,8 @@ SYM_CODE_START_LOCAL(error_entry)
 	UNTRAIN_RET
 
 	leaq	8(%rsp), %rdi			/* arg0 = pt_regs pointer */
-.Lerror_entry_from_usermode_after_swapgs:
-
 	/* Put us onto the real thread stack. */
-	call	sync_regs
-	RET
+	jmp	sync_regs
 
 	/*
 	 * There are two places in the kernel that can potentially fault with
@@ -1124,7 +1121,7 @@ SYM_CODE_START_LOCAL(error_entry)
 	leaq	8(%rsp), %rdi			/* arg0 = pt_regs pointer */
 	call	fixup_bad_iret
 	mov	%rax, %rdi
-	jmp	.Lerror_entry_from_usermode_after_swapgs
+	jmp	sync_regs
 SYM_CODE_END(error_entry)
 
 SYM_CODE_START_LOCAL(error_return)



^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v2 36/59] ftrace: Add HAVE_DYNAMIC_FTRACE_NO_PATCHABLE
  2022-09-02 13:06 [PATCH v2 00/59] x86/retbleed: Call depth tracking mitigation Peter Zijlstra
                   ` (34 preceding siblings ...)
  2022-09-02 13:07 ` [PATCH v2 35/59] x86/entry: Make sync_regs() invocation a tail call Peter Zijlstra
@ 2022-09-02 13:07 ` Peter Zijlstra
  2022-09-02 13:07 ` [PATCH v2 37/59] x86/putuser: Provide room for padding Peter Zijlstra
                   ` (23 subsequent siblings)
  59 siblings, 0 replies; 81+ messages in thread
From: Peter Zijlstra @ 2022-09-02 13:07 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, peterz, x86, Linus Torvalds, Tim Chen,
	Josh Poimboeuf, Andrew Cooper, Pawan Gupta, Johannes Wikner,
	Alyssa Milburn, Jann Horn, H.J. Lu, Joao Moreira, Joseph Nuzman,
	Steven Rostedt, Juergen Gross, Masami Hiramatsu,
	Alexei Starovoitov, Daniel Borkmann, K Prateek Nayak,
	Eric Dumazet

From: Peter Zijlstra (Intel) <peterz@infradead.org>

x86 will shortly start using -fpatchable-function-entry for purposes
other than ftrace, make sure the __patchable_function_entry section
isn't merged in the mcount_loc section.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/asm-generic/vmlinux.lds.h |   11 ++++++++++-
 kernel/trace/Kconfig              |    6 ++++++
 tools/objtool/check.c             |    3 ++-
 3 files changed, 18 insertions(+), 2 deletions(-)

--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -159,6 +159,14 @@
 #define MEM_DISCARD(sec) *(.mem##sec)
 #endif
 
+#ifndef CONFIG_HAVE_DYNAMIC_FTRACE_NO_PATCHABLE
+#define KEEP_PATCHABLE		KEEP(*(__patchable_function_entries))
+#define PATCHABLE_DISCARDS
+#else
+#define KEEP_PATCHABLE
+#define PATCHABLE_DISCARDS	*(__patchable_function_entries)
+#endif
+
 #ifdef CONFIG_FTRACE_MCOUNT_RECORD
 /*
  * The ftrace call sites are logged to a section whose name depends on the
@@ -177,7 +185,7 @@
 #define MCOUNT_REC()	. = ALIGN(8);				\
 			__start_mcount_loc = .;			\
 			KEEP(*(__mcount_loc))			\
-			KEEP(*(__patchable_function_entries))	\
+			KEEP_PATCHABLE				\
 			__stop_mcount_loc = .;			\
 			ftrace_stub_graph = ftrace_stub;	\
 			ftrace_ops_list_func = arch_ftrace_ops_list_func;
@@ -1029,6 +1037,7 @@
 
 #define COMMON_DISCARDS							\
 	SANITIZER_DISCARDS						\
+	PATCHABLE_DISCARDS						\
 	*(.discard)							\
 	*(.discard.*)							\
 	*(.modinfo)							\
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -51,6 +51,12 @@ config HAVE_DYNAMIC_FTRACE_WITH_ARGS
 	 This allows for use of regs_get_kernel_argument() and
 	 kernel_stack_pointer().
 
+config HAVE_DYNAMIC_FTRACE_NO_PATCHABLE
+	bool
+	help
+	  If the architecture generates __patchable_function_entries sections
+	  but does not want them included in the ftrace locations.
+
 config HAVE_FTRACE_MCOUNT_RECORD
 	bool
 	help
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -4196,7 +4196,8 @@ static int validate_ibt(struct objtool_f
 		    !strcmp(sec->name, "__bug_table")			||
 		    !strcmp(sec->name, "__ex_table")			||
 		    !strcmp(sec->name, "__jump_table")			||
-		    !strcmp(sec->name, "__mcount_loc"))
+		    !strcmp(sec->name, "__mcount_loc")			||
+		    strstr(sec->name, "__patchable_function_entries"))
 			continue;
 
 		list_for_each_entry(reloc, &sec->reloc->reloc_list, list)



^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v2 37/59] x86/putuser: Provide room for padding
  2022-09-02 13:06 [PATCH v2 00/59] x86/retbleed: Call depth tracking mitigation Peter Zijlstra
                   ` (35 preceding siblings ...)
  2022-09-02 13:07 ` [PATCH v2 36/59] ftrace: Add HAVE_DYNAMIC_FTRACE_NO_PATCHABLE Peter Zijlstra
@ 2022-09-02 13:07 ` Peter Zijlstra
  2022-09-02 16:43   ` Linus Torvalds
  2022-09-02 13:07 ` [PATCH v2 38/59] x86/Kconfig: Add CONFIG_CALL_THUNKS Peter Zijlstra
                   ` (22 subsequent siblings)
  59 siblings, 1 reply; 81+ messages in thread
From: Peter Zijlstra @ 2022-09-02 13:07 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, peterz, x86, Linus Torvalds, Tim Chen,
	Josh Poimboeuf, Andrew Cooper, Pawan Gupta, Johannes Wikner,
	Alyssa Milburn, Jann Horn, H.J. Lu, Joao Moreira, Joseph Nuzman,
	Steven Rostedt, Juergen Gross, Masami Hiramatsu,
	Alexei Starovoitov, Daniel Borkmann, K Prateek Nayak,
	Eric Dumazet

From: Thomas Gleixner <tglx@linutronix.de>

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/lib/putuser.S |   62 ++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 49 insertions(+), 13 deletions(-)

--- a/arch/x86/lib/putuser.S
+++ b/arch/x86/lib/putuser.S
@@ -47,8 +47,6 @@ SYM_FUNC_START(__put_user_1)
 	LOAD_TASK_SIZE_MINUS_N(0)
 	cmp %_ASM_BX,%_ASM_CX
 	jae .Lbad_put_user
-SYM_INNER_LABEL(__put_user_nocheck_1, SYM_L_GLOBAL)
-	ENDBR
 	ASM_STAC
 1:	movb %al,(%_ASM_CX)
 	xor %ecx,%ecx
@@ -56,54 +54,87 @@ SYM_INNER_LABEL(__put_user_nocheck_1, SY
 	RET
 SYM_FUNC_END(__put_user_1)
 EXPORT_SYMBOL(__put_user_1)
+
+SYM_FUNC_START(__put_user_nocheck_1)
+	ENDBR
+	ASM_STAC
+2:	movb %al,(%_ASM_CX)
+	xor %ecx,%ecx
+	ASM_CLAC
+	RET
+SYM_FUNC_END(__put_user_nocheck_1)
 EXPORT_SYMBOL(__put_user_nocheck_1)
 
 SYM_FUNC_START(__put_user_2)
 	LOAD_TASK_SIZE_MINUS_N(1)
 	cmp %_ASM_BX,%_ASM_CX
 	jae .Lbad_put_user
-SYM_INNER_LABEL(__put_user_nocheck_2, SYM_L_GLOBAL)
-	ENDBR
 	ASM_STAC
-2:	movw %ax,(%_ASM_CX)
+3:	movw %ax,(%_ASM_CX)
 	xor %ecx,%ecx
 	ASM_CLAC
 	RET
 SYM_FUNC_END(__put_user_2)
 EXPORT_SYMBOL(__put_user_2)
+
+SYM_FUNC_START(__put_user_nocheck_2)
+	ENDBR
+	ASM_STAC
+4:	movw %ax,(%_ASM_CX)
+	xor %ecx,%ecx
+	ASM_CLAC
+	RET
+SYM_FUNC_END(__put_user_nocheck_2)
 EXPORT_SYMBOL(__put_user_nocheck_2)
 
 SYM_FUNC_START(__put_user_4)
 	LOAD_TASK_SIZE_MINUS_N(3)
 	cmp %_ASM_BX,%_ASM_CX
 	jae .Lbad_put_user
-SYM_INNER_LABEL(__put_user_nocheck_4, SYM_L_GLOBAL)
-	ENDBR
 	ASM_STAC
-3:	movl %eax,(%_ASM_CX)
+5:	movl %eax,(%_ASM_CX)
 	xor %ecx,%ecx
 	ASM_CLAC
 	RET
 SYM_FUNC_END(__put_user_4)
 EXPORT_SYMBOL(__put_user_4)
+
+SYM_FUNC_START(__put_user_nocheck_4)
+	ENDBR
+	ASM_STAC
+6:	movl %eax,(%_ASM_CX)
+	xor %ecx,%ecx
+	ASM_CLAC
+	RET
+SYM_FUNC_END(__put_user_nocheck_4)
 EXPORT_SYMBOL(__put_user_nocheck_4)
 
 SYM_FUNC_START(__put_user_8)
 	LOAD_TASK_SIZE_MINUS_N(7)
 	cmp %_ASM_BX,%_ASM_CX
 	jae .Lbad_put_user
-SYM_INNER_LABEL(__put_user_nocheck_8, SYM_L_GLOBAL)
-	ENDBR
 	ASM_STAC
-4:	mov %_ASM_AX,(%_ASM_CX)
+7:	mov %_ASM_AX,(%_ASM_CX)
 #ifdef CONFIG_X86_32
-5:	movl %edx,4(%_ASM_CX)
+8:	movl %edx,4(%_ASM_CX)
 #endif
 	xor %ecx,%ecx
 	ASM_CLAC
 	RET
 SYM_FUNC_END(__put_user_8)
 EXPORT_SYMBOL(__put_user_8)
+
+SYM_FUNC_START(__put_user_nocheck_8)
+	ENDBR
+	ASM_STAC
+9:	mov %_ASM_AX,(%_ASM_CX)
+#ifdef CONFIG_X86_32
+10:	movl %edx,4(%_ASM_CX)
+#endif
+	xor %ecx,%ecx
+	ASM_CLAC
+	RET
+SYM_FUNC_END(__put_user_nocheck_8)
 EXPORT_SYMBOL(__put_user_nocheck_8)
 
 SYM_CODE_START_LOCAL(.Lbad_put_user_clac)
@@ -117,6 +148,11 @@ SYM_CODE_END(.Lbad_put_user_clac)
 	_ASM_EXTABLE_UA(2b, .Lbad_put_user_clac)
 	_ASM_EXTABLE_UA(3b, .Lbad_put_user_clac)
 	_ASM_EXTABLE_UA(4b, .Lbad_put_user_clac)
-#ifdef CONFIG_X86_32
 	_ASM_EXTABLE_UA(5b, .Lbad_put_user_clac)
+	_ASM_EXTABLE_UA(6b, .Lbad_put_user_clac)
+	_ASM_EXTABLE_UA(7b, .Lbad_put_user_clac)
+	_ASM_EXTABLE_UA(9b, .Lbad_put_user_clac)
+#ifdef CONFIG_X86_32
+	_ASM_EXTABLE_UA(8b, .Lbad_put_user_clac)
+	_ASM_EXTABLE_UA(10b, .Lbad_put_user_clac)
 #endif



^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v2 38/59] x86/Kconfig: Add CONFIG_CALL_THUNKS
  2022-09-02 13:06 [PATCH v2 00/59] x86/retbleed: Call depth tracking mitigation Peter Zijlstra
                   ` (36 preceding siblings ...)
  2022-09-02 13:07 ` [PATCH v2 37/59] x86/putuser: Provide room for padding Peter Zijlstra
@ 2022-09-02 13:07 ` Peter Zijlstra
  2022-09-02 13:07 ` [PATCH v2 39/59] x86/Kconfig: Introduce function padding Peter Zijlstra
                   ` (21 subsequent siblings)
  59 siblings, 0 replies; 81+ messages in thread
From: Peter Zijlstra @ 2022-09-02 13:07 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, peterz, x86, Linus Torvalds, Tim Chen,
	Josh Poimboeuf, Andrew Cooper, Pawan Gupta, Johannes Wikner,
	Alyssa Milburn, Jann Horn, H.J. Lu, Joao Moreira, Joseph Nuzman,
	Steven Rostedt, Juergen Gross, Masami Hiramatsu,
	Alexei Starovoitov, Daniel Borkmann, K Prateek Nayak,
	Eric Dumazet

From: Thomas Gleixner <tglx@linutronix.de>

In preparation for mitigating the Intel SKL RSB underflow issue in
software, add a new configuration symbol which allows to build the
required call thunk infrastructure conditionally.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/Kconfig |    7 +++++++
 1 file changed, 7 insertions(+)

--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2432,6 +2432,13 @@ config CC_HAS_SLS
 config CC_HAS_RETURN_THUNK
 	def_bool $(cc-option,-mfunction-return=thunk-extern)
 
+config HAVE_CALL_THUNKS
+	def_bool y
+	depends on RETHUNK && OBJTOOL
+
+config CALL_THUNKS
+	def_bool n
+
 menuconfig SPECULATION_MITIGATIONS
 	bool "Mitigations for speculative execution vulnerabilities"
 	default y



^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v2 39/59] x86/Kconfig: Introduce function padding
  2022-09-02 13:06 [PATCH v2 00/59] x86/retbleed: Call depth tracking mitigation Peter Zijlstra
                   ` (37 preceding siblings ...)
  2022-09-02 13:07 ` [PATCH v2 38/59] x86/Kconfig: Add CONFIG_CALL_THUNKS Peter Zijlstra
@ 2022-09-02 13:07 ` Peter Zijlstra
  2022-09-02 13:07 ` [PATCH v2 40/59] x86/retbleed: Add X86_FEATURE_CALL_DEPTH Peter Zijlstra
                   ` (20 subsequent siblings)
  59 siblings, 0 replies; 81+ messages in thread
From: Peter Zijlstra @ 2022-09-02 13:07 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, peterz, x86, Linus Torvalds, Tim Chen,
	Josh Poimboeuf, Andrew Cooper, Pawan Gupta, Johannes Wikner,
	Alyssa Milburn, Jann Horn, H.J. Lu, Joao Moreira, Joseph Nuzman,
	Steven Rostedt, Juergen Gross, Masami Hiramatsu,
	Alexei Starovoitov, Daniel Borkmann, K Prateek Nayak,
	Eric Dumazet

From: Thomas Gleixner <tglx@linutronix.de>

Now that all functions are 16 byte aligned, add 16 bytes of NOP
padding in front of each function. This prepares things for software
call stack tracking and kCFI/FineIBT.

This significantly increases kernel .text size, around 5.1% on a
x86_64-defconfig-ish build.

However, per the random access argument used for alignment, these 16
extra bytes are code that wouldn't be used. Performance measurements
back this up by showing no significant performance regressions.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 Makefile                       |    3 +++
 arch/x86/Kconfig               |   12 +++++++++++-
 arch/x86/Makefile              |    5 +++++
 arch/x86/entry/vdso/Makefile   |    3 ++-
 arch/x86/include/asm/linkage.h |   28 ++++++++++++++++++++++++----
 5 files changed, 45 insertions(+), 6 deletions(-)

--- a/Makefile
+++ b/Makefile
@@ -920,6 +920,9 @@ KBUILD_AFLAGS	+= -fno-lto
 export CC_FLAGS_LTO
 endif
 
+PADDING_CFLAGS := -fpatchable-function-entry=$(CONFIG_FUNCTION_PADDING_BYTES),$(CONFIG_FUNCTION_PADDING_BYTES)
+export PADDING_CFLAGS
+
 ifdef CONFIG_CFI_CLANG
 CC_FLAGS_CFI	:= -fsanitize=cfi \
 		   -fsanitize-cfi-cross-dso \
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2427,9 +2427,19 @@ config CC_HAS_SLS
 config CC_HAS_RETURN_THUNK
 	def_bool $(cc-option,-mfunction-return=thunk-extern)
 
+config CC_HAS_ENTRY_PADDING
+	def_bool $(cc-option,-fpatchable-function-entry=16,16)
+
+config FUNCTION_PADDING_BYTES
+	int
+	default 32 if CALL_THUNKS_DEBUG && !CFI_CLANG
+	default 27 if CALL_THUNKS_DEBUG && CFI_CLANG
+	default 16 if !CFI_CLANG
+	default 11
+
 config HAVE_CALL_THUNKS
 	def_bool y
-	depends on RETHUNK && OBJTOOL
+	depends on CC_HAS_ENTRY_PADDING && RETHUNK && OBJTOOL
 
 config CALL_THUNKS
 	def_bool n
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -202,6 +202,11 @@ ifdef CONFIG_SLS
   KBUILD_CFLAGS += -mharden-sls=all
 endif
 
+PADDING_CFLAGS := -fpatchable-function-entry=$(CONFIG_FUNCTION_PADDING_BYTES),$(CONFIG_FUNCTION_PADDING_BYTES)
+ifdef CONFIG_CALL_THUNKS
+  KBUILD_CFLAGS += $(PADDING_CFLAGS)
+endif
+
 KBUILD_LDFLAGS += -m elf_$(UTS_MACHINE)
 
 ifdef CONFIG_LTO_CLANG
--- a/arch/x86/entry/vdso/Makefile
+++ b/arch/x86/entry/vdso/Makefile
@@ -92,7 +92,7 @@ ifneq ($(RETPOLINE_VDSO_CFLAGS),)
 endif
 endif
 
-$(vobjs): KBUILD_CFLAGS := $(filter-out $(CC_FLAGS_LTO) $(RANDSTRUCT_CFLAGS) $(GCC_PLUGINS_CFLAGS) $(RETPOLINE_CFLAGS),$(KBUILD_CFLAGS)) $(CFL)
+$(vobjs): KBUILD_CFLAGS := $(filter-out $(PADDING_CFLAGS) $(CC_FLAGS_LTO) $(RANDSTRUCT_CFLAGS) $(GCC_PLUGINS_CFLAGS) $(RETPOLINE_CFLAGS),$(KBUILD_CFLAGS)) $(CFL)
 $(vobjs): KBUILD_AFLAGS += -DBUILD_VDSO
 
 #
@@ -154,6 +154,7 @@ KBUILD_CFLAGS_32 := $(filter-out $(RANDS
 KBUILD_CFLAGS_32 := $(filter-out $(GCC_PLUGINS_CFLAGS),$(KBUILD_CFLAGS_32))
 KBUILD_CFLAGS_32 := $(filter-out $(RETPOLINE_CFLAGS),$(KBUILD_CFLAGS_32))
 KBUILD_CFLAGS_32 := $(filter-out $(CC_FLAGS_LTO),$(KBUILD_CFLAGS_32))
+KBUILD_CFLAGS_32 := $(filter-out $(PADDING_CFLAGS),$(KBUILD_CFLAGS_32))
 KBUILD_CFLAGS_32 += -m32 -msoft-float -mregparm=0 -fpic
 KBUILD_CFLAGS_32 += -fno-stack-protector
 KBUILD_CFLAGS_32 += $(call cc-option, -foptimize-sibling-calls)
--- a/arch/x86/include/asm/linkage.h
+++ b/arch/x86/include/asm/linkage.h
@@ -13,17 +13,37 @@
 #endif /* CONFIG_X86_32 */
 
 #if CONFIG_FUNCTION_ALIGNMENT == 8
-#define __ALIGN			.p2align 3, 0x90;
+# define __ALIGN		.p2align 3, 0x90;
 #elif CONFIG_FUNCTION_ALIGNMENT == 16
-#define __ALIGN			.p2align 4, 0x90;
+# define __ALIGN		.p2align 4, 0x90;
+#elif CONFIG_FUNCTION_ALIGNMENT == 32
+# define __ALIGN		.p2align 5, 0x90
 #else
 # error Unsupported function alignment
 #endif
 
 #define __ALIGN_STR		__stringify(__ALIGN)
-#define ASM_FUNC_ALIGN		__ALIGN_STR
-#define __FUNC_ALIGN		__ALIGN
+
+#ifdef CONFIG_CFI_CLANG
+#define __FUNCTION_PADDING	(CONFIG_FUNCTION_ALIGNMENT - 5)
+#else
+#define __FUNCTION_PADDING	CONFIG_FUNCTION_ALIGNMENT
+#endif
+
+#if defined(CONFIG_CALL_THUNKS) && !defined(__DISABLE_EXPORTS) && !defined(BUILD_VDSO)
+#define FUNCTION_PADDING	.skip __FUNCTION_PADDING, 0x90;
+#else
+#define FUNCTION_PADDING
+#endif
+
+#if (CONFIG_FUNCTION_ALIGNMENT > 8) && !defined(__DISABLE_EXPORTS) && !defined(BULID_VDSO)
+# define __FUNC_ALIGN		__ALIGN; FUNCTION_PADDING
+#else
+# define __FUNC_ALIGN		__ALIGN
+#endif
+
 #define SYM_F_ALIGN		__FUNC_ALIGN
+#define ASM_FUNC_ALIGN		__stringify(__FUNC_ALIGN)
 
 #ifdef __ASSEMBLY__
 



^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v2 40/59] x86/retbleed: Add X86_FEATURE_CALL_DEPTH
  2022-09-02 13:06 [PATCH v2 00/59] x86/retbleed: Call depth tracking mitigation Peter Zijlstra
                   ` (38 preceding siblings ...)
  2022-09-02 13:07 ` [PATCH v2 39/59] x86/Kconfig: Introduce function padding Peter Zijlstra
@ 2022-09-02 13:07 ` Peter Zijlstra
  2022-09-02 13:07 ` [PATCH v2 41/59] x86/alternatives: Provide text_poke_copy_locked() Peter Zijlstra
                   ` (19 subsequent siblings)
  59 siblings, 0 replies; 81+ messages in thread
From: Peter Zijlstra @ 2022-09-02 13:07 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, peterz, x86, Linus Torvalds, Tim Chen,
	Josh Poimboeuf, Andrew Cooper, Pawan Gupta, Johannes Wikner,
	Alyssa Milburn, Jann Horn, H.J. Lu, Joao Moreira, Joseph Nuzman,
	Steven Rostedt, Juergen Gross, Masami Hiramatsu,
	Alexei Starovoitov, Daniel Borkmann, K Prateek Nayak,
	Eric Dumazet

From: Thomas Gleixner <tglx@linutronix.de>

Intel SKL CPUs fall back to other predictors when the RSB underflows. The
only microcode mitigation is IBRS which is insanely expensive. It comes
with performance drops of up to 30% depending on the workload.

A way less expensive, but nevertheless horrible mitigation is to track the
call depth in software and overeagerly fill the RSB when returns underflow
the software counter.

Provide a configuration symbol and a CPU misfeature bit.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/Kconfig                         |   19 +++++++++++++++++++
 arch/x86/include/asm/cpufeatures.h       |    1 +
 arch/x86/include/asm/disabled-features.h |    9 ++++++++-
 3 files changed, 28 insertions(+), 1 deletion(-)

--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2501,6 +2501,25 @@ config CPU_UNRET_ENTRY
 	help
 	  Compile the kernel with support for the retbleed=unret mitigation.
 
+config CALL_DEPTH_TRACKING
+	bool "Mitigate RSB underflow with call depth tracking"
+	depends on CPU_SUP_INTEL && HAVE_CALL_THUNKS
+	select HAVE_DYNAMIC_FTRACE_NO_PATCHABLE
+	select CALL_THUNKS
+	default y
+	help
+	  Compile the kernel with call depth tracking to mitigate the Intel
+	  SKL Return-Speculation-Buffer (RSB) underflow issue. The
+	  mitigation is off by default and needs to be enabled on the
+	  kernel command line via the retbleed=stuff option. For
+	  non-affected systems the overhead of this option is marginal as
+	  the call depth tracking is using run-time generated call thunks
+	  in a compiler generated padding area and call patching. This
+	  increases text size by ~5%. For non affected systems this space
+	  is unused. On affected SKL systems this results in a significant
+	  performance gain over the IBRS mitigation.
+
+
 config CPU_IBPB_ENTRY
 	bool "Enable IBPB on kernel entry"
 	depends on CPU_SUP_AMD && X86_64
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -304,6 +304,7 @@
 #define X86_FEATURE_UNRET		(11*32+15) /* "" AMD BTB untrain return */
 #define X86_FEATURE_USE_IBPB_FW		(11*32+16) /* "" Use IBPB during runtime firmware calls */
 #define X86_FEATURE_RSB_VMEXIT_LITE	(11*32+17) /* "" Fill RSB on VM exit when EIBRS is enabled */
+#define X86_FEATURE_CALL_DEPTH		(11*32+18) /* "" Call depth tracking for RSB stuffing */
 
 /* Intel-defined CPU features, CPUID level 0x00000007:1 (EAX), word 12 */
 #define X86_FEATURE_AVX_VNNI		(12*32+ 4) /* AVX VNNI instructions */
--- a/arch/x86/include/asm/disabled-features.h
+++ b/arch/x86/include/asm/disabled-features.h
@@ -69,6 +69,12 @@
 # define DISABLE_UNRET		(1 << (X86_FEATURE_UNRET & 31))
 #endif
 
+#ifdef CONFIG_CALL_DEPTH_TRACKING
+# define DISABLE_CALL_DEPTH_TRACKING	0
+#else
+# define DISABLE_CALL_DEPTH_TRACKING	(1 << (X86_FEATURE_CALL_DEPTH & 31))
+#endif
+
 #ifdef CONFIG_INTEL_IOMMU_SVM
 # define DISABLE_ENQCMD		0
 #else
@@ -101,7 +107,8 @@
 #define DISABLED_MASK8	(DISABLE_TDX_GUEST)
 #define DISABLED_MASK9	(DISABLE_SGX)
 #define DISABLED_MASK10	0
-#define DISABLED_MASK11	(DISABLE_RETPOLINE|DISABLE_RETHUNK|DISABLE_UNRET)
+#define DISABLED_MASK11	(DISABLE_RETPOLINE|DISABLE_RETHUNK|DISABLE_UNRET| \
+			 DISABLE_CALL_DEPTH_TRACKING)
 #define DISABLED_MASK12	0
 #define DISABLED_MASK13	0
 #define DISABLED_MASK14	0



^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v2 41/59] x86/alternatives: Provide text_poke_copy_locked()
  2022-09-02 13:06 [PATCH v2 00/59] x86/retbleed: Call depth tracking mitigation Peter Zijlstra
                   ` (39 preceding siblings ...)
  2022-09-02 13:07 ` [PATCH v2 40/59] x86/retbleed: Add X86_FEATURE_CALL_DEPTH Peter Zijlstra
@ 2022-09-02 13:07 ` Peter Zijlstra
  2022-09-02 13:07 ` [PATCH v2 42/59] x86/entry: Make some entry symbols global Peter Zijlstra
                   ` (18 subsequent siblings)
  59 siblings, 0 replies; 81+ messages in thread
From: Peter Zijlstra @ 2022-09-02 13:07 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, peterz, x86, Linus Torvalds, Tim Chen,
	Josh Poimboeuf, Andrew Cooper, Pawan Gupta, Johannes Wikner,
	Alyssa Milburn, Jann Horn, H.J. Lu, Joao Moreira, Joseph Nuzman,
	Steven Rostedt, Juergen Gross, Masami Hiramatsu,
	Alexei Starovoitov, Daniel Borkmann, K Prateek Nayak,
	Eric Dumazet

From: Thomas Gleixner <tglx@linutronix.de>

The upcoming call thunk patching must hold text_mutex and needs access to
text_poke_copy(), which takes text_mutex.

Provide a _locked postfixed variant to expose the inner workings.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/include/asm/text-patching.h |    1 
 arch/x86/kernel/alternative.c        |   37 ++++++++++++++++++++---------------
 2 files changed, 23 insertions(+), 15 deletions(-)

--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -45,6 +45,7 @@ extern void *text_poke(void *addr, const
 extern void text_poke_sync(void);
 extern void *text_poke_kgdb(void *addr, const void *opcode, size_t len);
 extern void *text_poke_copy(void *addr, const void *opcode, size_t len);
+extern void *text_poke_copy_locked(void *addr, const void *opcode, size_t len, bool core_ok);
 extern void *text_poke_set(void *addr, int c, size_t len);
 extern int poke_int3_handler(struct pt_regs *regs);
 extern void text_poke_bp(void *addr, const void *opcode, size_t len, const void *emulate);
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -1227,27 +1227,15 @@ void *text_poke_kgdb(void *addr, const v
 	return __text_poke(text_poke_memcpy, addr, opcode, len);
 }
 
-/**
- * text_poke_copy - Copy instructions into (an unused part of) RX memory
- * @addr: address to modify
- * @opcode: source of the copy
- * @len: length to copy, could be more than 2x PAGE_SIZE
- *
- * Not safe against concurrent execution; useful for JITs to dump
- * new code blocks into unused regions of RX memory. Can be used in
- * conjunction with synchronize_rcu_tasks() to wait for existing
- * execution to quiesce after having made sure no existing functions
- * pointers are live.
- */
-void *text_poke_copy(void *addr, const void *opcode, size_t len)
+void *text_poke_copy_locked(void *addr, const void *opcode, size_t len,
+			    bool core_ok)
 {
 	unsigned long start = (unsigned long)addr;
 	size_t patched = 0;
 
-	if (WARN_ON_ONCE(core_kernel_text(start)))
+	if (WARN_ON_ONCE(!core_ok && core_kernel_text(start)))
 		return NULL;
 
-	mutex_lock(&text_mutex);
 	while (patched < len) {
 		unsigned long ptr = start + patched;
 		size_t s;
@@ -1257,6 +1245,25 @@ void *text_poke_copy(void *addr, const v
 		__text_poke(text_poke_memcpy, (void *)ptr, opcode + patched, s);
 		patched += s;
 	}
+	return addr;
+}
+
+/**
+ * text_poke_copy - Copy instructions into (an unused part of) RX memory
+ * @addr: address to modify
+ * @opcode: source of the copy
+ * @len: length to copy, could be more than 2x PAGE_SIZE
+ *
+ * Not safe against concurrent execution; useful for JITs to dump
+ * new code blocks into unused regions of RX memory. Can be used in
+ * conjunction with synchronize_rcu_tasks() to wait for existing
+ * execution to quiesce after having made sure no existing functions
+ * pointers are live.
+ */
+void *text_poke_copy(void *addr, const void *opcode, size_t len)
+{
+	mutex_lock(&text_mutex);
+	addr = text_poke_copy_locked(addr, opcode, len, false);
 	mutex_unlock(&text_mutex);
 	return addr;
 }



^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v2 42/59] x86/entry: Make some entry symbols global
  2022-09-02 13:06 [PATCH v2 00/59] x86/retbleed: Call depth tracking mitigation Peter Zijlstra
                   ` (40 preceding siblings ...)
  2022-09-02 13:07 ` [PATCH v2 41/59] x86/alternatives: Provide text_poke_copy_locked() Peter Zijlstra
@ 2022-09-02 13:07 ` Peter Zijlstra
  2022-09-02 13:07 ` [PATCH v2 43/59] x86/paravirt: Make struct paravirt_call_site unconditionally available Peter Zijlstra
                   ` (17 subsequent siblings)
  59 siblings, 0 replies; 81+ messages in thread
From: Peter Zijlstra @ 2022-09-02 13:07 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, peterz, x86, Linus Torvalds, Tim Chen,
	Josh Poimboeuf, Andrew Cooper, Pawan Gupta, Johannes Wikner,
	Alyssa Milburn, Jann Horn, H.J. Lu, Joao Moreira, Joseph Nuzman,
	Steven Rostedt, Juergen Gross, Masami Hiramatsu,
	Alexei Starovoitov, Daniel Borkmann, K Prateek Nayak,
	Eric Dumazet

From: Thomas Gleixner <tglx@linutronix.de>

paranoid_entry(), error_entry() and xen_error_entry() have to be
exempted from call accounting by thunk patching because they are
before UNTRAIN_RET.

Expose them so they are available in the alternative code.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/entry/entry_64.S |    9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -327,7 +327,8 @@ SYM_CODE_END(ret_from_fork)
 #endif
 .endm
 
-SYM_CODE_START_LOCAL(xen_error_entry)
+SYM_CODE_START(xen_error_entry)
+	ANNOTATE_NOENDBR
 	UNWIND_HINT_FUNC
 	PUSH_AND_CLEAR_REGS save_ret=1
 	ENCODE_FRAME_POINTER 8
@@ -906,7 +907,8 @@ SYM_CODE_END(xen_failsafe_callback)
  * R14 - old CR3
  * R15 - old SPEC_CTRL
  */
-SYM_CODE_START_LOCAL(paranoid_entry)
+SYM_CODE_START(paranoid_entry)
+	ANNOTATE_NOENDBR
 	UNWIND_HINT_FUNC
 	PUSH_AND_CLEAR_REGS save_ret=1
 	ENCODE_FRAME_POINTER 8
@@ -1041,7 +1043,8 @@ SYM_CODE_END(paranoid_exit)
 /*
  * Switch GS and CR3 if needed.
  */
-SYM_CODE_START_LOCAL(error_entry)
+SYM_CODE_START(error_entry)
+	ANNOTATE_NOENDBR
 	UNWIND_HINT_FUNC
 
 	PUSH_AND_CLEAR_REGS save_ret=1



^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v2 43/59] x86/paravirt: Make struct paravirt_call_site unconditionally available
  2022-09-02 13:06 [PATCH v2 00/59] x86/retbleed: Call depth tracking mitigation Peter Zijlstra
                   ` (41 preceding siblings ...)
  2022-09-02 13:07 ` [PATCH v2 42/59] x86/entry: Make some entry symbols global Peter Zijlstra
@ 2022-09-02 13:07 ` Peter Zijlstra
  2022-09-02 16:09   ` Juergen Gross
  2022-09-02 13:07 ` [PATCH v2 44/59] x86/callthunks: Add call patching for call depth tracking Peter Zijlstra
                   ` (16 subsequent siblings)
  59 siblings, 1 reply; 81+ messages in thread
From: Peter Zijlstra @ 2022-09-02 13:07 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, peterz, x86, Linus Torvalds, Tim Chen,
	Josh Poimboeuf, Andrew Cooper, Pawan Gupta, Johannes Wikner,
	Alyssa Milburn, Jann Horn, H.J. Lu, Joao Moreira, Joseph Nuzman,
	Steven Rostedt, Juergen Gross, Masami Hiramatsu,
	Alexei Starovoitov, Daniel Borkmann, K Prateek Nayak,
	Eric Dumazet

From: Thomas Gleixner <tglx@linutronix.de>

For the upcoming call thunk patching it's less ifdeffery when the data
structure is unconditionally available. The code can then be trivially
fenced off with IS_ENABLED().

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/include/asm/paravirt.h       |    4 ++--
 arch/x86/include/asm/paravirt_types.h |   20 ++++++++++++--------
 2 files changed, 14 insertions(+), 10 deletions(-)

--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -4,13 +4,13 @@
 /* Various instructions on x86 need to be replaced for
  * para-virtualization: those hooks are defined here. */
 
+#include <asm/paravirt_types.h>
+
 #ifdef CONFIG_PARAVIRT
 #include <asm/pgtable_types.h>
 #include <asm/asm.h>
 #include <asm/nospec-branch.h>
 
-#include <asm/paravirt_types.h>
-
 #ifndef __ASSEMBLY__
 #include <linux/bug.h>
 #include <linux/types.h>
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -2,6 +2,17 @@
 #ifndef _ASM_X86_PARAVIRT_TYPES_H
 #define _ASM_X86_PARAVIRT_TYPES_H
 
+#ifndef __ASSEMBLY__
+/* These all sit in the .parainstructions section to tell us what to patch. */
+struct paravirt_patch_site {
+	u8 *instr;		/* original instructions */
+	u8 type;		/* type of this instruction */
+	u8 len;			/* length of original instruction */
+};
+#endif
+
+#ifdef CONFIG_PARAVIRT
+
 /* Bitmask of what can be clobbered: usually at least eax. */
 #define CLBR_EAX  (1 << 0)
 #define CLBR_ECX  (1 << 1)
@@ -584,16 +595,9 @@ unsigned long paravirt_ret0(void);
 
 #define paravirt_nop	((void *)_paravirt_nop)
 
-/* These all sit in the .parainstructions section to tell us what to patch. */
-struct paravirt_patch_site {
-	u8 *instr;		/* original instructions */
-	u8 type;		/* type of this instruction */
-	u8 len;			/* length of original instruction */
-};
-
 extern struct paravirt_patch_site __parainstructions[],
 	__parainstructions_end[];
 
 #endif	/* __ASSEMBLY__ */
-
+#endif  /* CONFIG_PARAVIRT */
 #endif	/* _ASM_X86_PARAVIRT_TYPES_H */



^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v2 44/59] x86/callthunks: Add call patching for call depth tracking
  2022-09-02 13:06 [PATCH v2 00/59] x86/retbleed: Call depth tracking mitigation Peter Zijlstra
                   ` (42 preceding siblings ...)
  2022-09-02 13:07 ` [PATCH v2 43/59] x86/paravirt: Make struct paravirt_call_site unconditionally available Peter Zijlstra
@ 2022-09-02 13:07 ` Peter Zijlstra
  2022-09-02 13:07 ` [PATCH v2 45/59] x86/modules: Add call patching Peter Zijlstra
                   ` (15 subsequent siblings)
  59 siblings, 0 replies; 81+ messages in thread
From: Peter Zijlstra @ 2022-09-02 13:07 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, peterz, x86, Linus Torvalds, Tim Chen,
	Josh Poimboeuf, Andrew Cooper, Pawan Gupta, Johannes Wikner,
	Alyssa Milburn, Jann Horn, H.J. Lu, Joao Moreira, Joseph Nuzman,
	Steven Rostedt, Juergen Gross, Masami Hiramatsu,
	Alexei Starovoitov, Daniel Borkmann, K Prateek Nayak,
	Eric Dumazet

From: Thomas Gleixner <tglx@linutronix.de>

Mitigating the Intel SKL RSB underflow issue in software requires to
track the call depth. That is every CALL and every RET need to be
intercepted and additional code injected.

The existing retbleed mitigations already include means of redirecting
RET to __x86_return_thunk; this can be re-purposed and RET can be
redirected to another function doing RET accounting.

CALL accounting will use the function padding introduced in prior
patches. For each CALL instruction, the destination symbol's padding
is rewritten to do the accounting and the CALL instruction is adjusted
to call into the padding.

This ensures only affected CPUs pay the overhead of this accounting.
Unaffected CPUs will leave the padding unused and have their 'JMP
__x86_return_thunk' replaced with an actual 'RET' instruction.

Objtool has been modified to supply a .call_sites section that lists
all the 'CALL' instructions. Additionally the paravirt instruction
sites are iterated since they will have been patched from an indirect
call to direct calls (or direct instructions in which case it'll be
ignored).

Module handling and the actual thunk code for SKL will be added in
subsequent steps.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/Kconfig                     |   11 +
 arch/x86/include/asm/alternative.h   |   12 +
 arch/x86/kernel/Makefile             |    2 
 arch/x86/kernel/alternative.c        |    6 
 arch/x86/kernel/callthunks.c         |  251 +++++++++++++++++++++++++++++++++++
 arch/x86/kernel/head_64.S            |    1 
 arch/x86/kernel/relocate_kernel_64.S |    5 
 arch/x86/kernel/vmlinux.lds.S        |    8 -
 8 files changed, 286 insertions(+), 10 deletions(-)

--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2519,6 +2519,17 @@ config CALL_DEPTH_TRACKING
 	  is unused. On affected SKL systems this results in a significant
 	  performance gain over the IBRS mitigation.
 
+config CALL_THUNKS_DEBUG
+	bool "Enable call thunks and call depth tracking debugging"
+	depends on CALL_DEPTH_TRACKING
+	default n
+	help
+	  Enable call/ret counters for imbalance detection and build in
+	  a noisy dmesg about callthunks generation and call patching for
+	  trouble shooting. The debug prints need to be enabled on the
+	  kernel command line with 'debug-callthunks'.
+	  Only enable this, when you are debugging call thunks as this
+	  creates a noticable runtime overhead. If unsure say N.
 
 config CPU_IBPB_ENTRY
 	bool "Enable IBPB on kernel entry"
--- a/arch/x86/include/asm/alternative.h
+++ b/arch/x86/include/asm/alternative.h
@@ -80,6 +80,18 @@ extern void apply_returns(s32 *start, s3
 extern void apply_ibt_endbr(s32 *start, s32 *end);
 
 struct module;
+struct paravirt_patch_site;
+
+struct callthunk_sites {
+	s32				*call_start, *call_end;
+	struct paravirt_patch_site	*pv_start, *pv_end;
+};
+
+#ifdef CONFIG_CALL_THUNKS
+extern void callthunks_patch_builtin_calls(void);
+#else
+static __always_inline void callthunks_patch_builtin_calls(void) {}
+#endif
 
 #ifdef CONFIG_SMP
 extern void alternatives_smp_module_add(struct module *mod, char *name,
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -139,6 +139,8 @@ obj-$(CONFIG_UNWINDER_GUESS)		+= unwind_
 
 obj-$(CONFIG_AMD_MEM_ENCRYPT)		+= sev.o
 
+obj-$(CONFIG_CALL_THUNKS)		+= callthunks.o
+
 ###
 # 64 bit specific files
 ifeq ($(CONFIG_X86_64),y)
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -938,6 +938,12 @@ void __init alternative_instructions(voi
 	 */
 	apply_alternatives(__alt_instructions, __alt_instructions_end);
 
+	/*
+	 * Now all calls are established. Apply the call thunks if
+	 * required.
+	 */
+	callthunks_patch_builtin_calls();
+
 	apply_ibt_endbr(__ibt_endbr_seal, __ibt_endbr_seal_end);
 
 #ifdef CONFIG_SMP
--- /dev/null
+++ b/arch/x86/kernel/callthunks.c
@@ -0,0 +1,251 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#define pr_fmt(fmt) "callthunks: " fmt
+
+#include <linux/kallsyms.h>
+#include <linux/memory.h>
+#include <linux/moduleloader.h>
+
+#include <asm/alternative.h>
+#include <asm/cpu.h>
+#include <asm/ftrace.h>
+#include <asm/insn.h>
+#include <asm/kexec.h>
+#include <asm/nospec-branch.h>
+#include <asm/paravirt.h>
+#include <asm/sections.h>
+#include <asm/switch_to.h>
+#include <asm/sync_core.h>
+#include <asm/text-patching.h>
+#include <asm/xen/hypercall.h>
+
+static int __initdata_or_module debug_callthunks;
+
+#define prdbg(fmt, args...)					\
+do {								\
+	if (debug_callthunks)					\
+		printk(KERN_DEBUG pr_fmt(fmt), ##args);		\
+} while(0)
+
+static int __init debug_thunks(char *str)
+{
+	debug_callthunks = 1;
+	return 1;
+}
+__setup("debug-callthunks", debug_thunks);
+
+extern s32 __call_sites[], __call_sites_end[];
+
+struct thunk_desc {
+	void		*template;
+	unsigned int	template_size;
+};
+
+struct core_text {
+	unsigned long	base;
+	unsigned long	end;
+	const char	*name;
+};
+
+static bool thunks_initialized __ro_after_init;
+
+static const struct core_text builtin_coretext = {
+	.base = (unsigned long)_text,
+	.end  = (unsigned long)_etext,
+	.name = "builtin",
+};
+
+static struct thunk_desc callthunk_desc __ro_after_init;
+
+extern void error_entry(void);
+extern void xen_error_entry(void);
+extern void paranoid_entry(void);
+
+static inline bool within_coretext(const struct core_text *ct, void *addr)
+{
+	unsigned long p = (unsigned long)addr;
+
+	return ct->base <= p && p < ct->end;
+}
+
+static inline bool within_module_coretext(void *addr)
+{
+	bool ret = false;
+
+#ifdef CONFIG_MODULES
+	struct module *mod;
+
+	preempt_disable();
+	mod = __module_address((unsigned long)addr);
+	if (mod && within_module_core((unsigned long)addr, mod))
+		ret = true;
+	preempt_enable();
+#endif
+	return ret;
+}
+
+static bool is_coretext(const struct core_text *ct, void *addr)
+{
+	if (ct && within_coretext(ct, addr))
+		return true;
+	if (within_coretext(&builtin_coretext, addr))
+		return true;
+	return within_module_coretext(addr);
+}
+
+static __init_or_module bool skip_addr(void *dest)
+{
+	if (dest == error_entry)
+		return true;
+	if (dest == paranoid_entry)
+		return true;
+	if (dest == xen_error_entry)
+		return true;
+	/* Does FILL_RSB... */
+	if (dest == __switch_to_asm)
+		return true;
+	/* Accounts directly */
+	if (dest == ret_from_fork)
+		return true;
+#ifdef CONFIG_HOTPLUG_CPU
+	if (dest == start_cpu0)
+		return true;
+#endif
+#ifdef CONFIG_FUNCTION_TRACER
+	if (dest == __fentry__)
+		return true;
+#endif
+#ifdef CONFIG_KEXEC_CORE
+	if (dest >= (void *)relocate_kernel &&
+	    dest < (void*)relocate_kernel + KEXEC_CONTROL_CODE_MAX_SIZE)
+		return true;
+#endif
+#ifdef CONFIG_XEN
+	if (dest >= (void *)hypercall_page &&
+	    dest < (void*)hypercall_page + PAGE_SIZE)
+		return true;
+#endif
+	return false;
+}
+
+static __init_or_module void *call_get_dest(void *addr)
+{
+	struct insn insn;
+	void *dest;
+	int ret;
+
+	ret = insn_decode_kernel(&insn, addr);
+	if (ret)
+		return ERR_PTR(ret);
+
+	/* Patched out call? */
+	if (insn.opcode.bytes[0] != CALL_INSN_OPCODE)
+		return NULL;
+
+	dest = addr + insn.length + insn.immediate.value;
+	if (skip_addr(dest))
+		return NULL;
+	return dest;
+}
+
+static const u8 nops[] = {
+	0x90, 0x90, 0x90, 0x90, 0x90, 0x90, 0x90, 0x90,
+	0x90, 0x90, 0x90, 0x90, 0x90, 0x90, 0x90, 0x90,
+	0x90, 0x90, 0x90, 0x90, 0x90, 0x90, 0x90, 0x90,
+	0x90, 0x90, 0x90, 0x90, 0x90, 0x90, 0x90, 0x90,
+};
+
+static __init_or_module void *patch_dest(void *dest, bool direct)
+{
+	unsigned int tsize = callthunk_desc.template_size;
+	u8 *pad = dest - tsize;
+
+	/* Already patched? */
+	if (!bcmp(pad, callthunk_desc.template, tsize))
+		return pad;
+
+	/* Ensure there are nops */
+	if (bcmp(pad, nops, tsize)) {
+		pr_warn_once("Invalid padding area for %pS\n", dest);
+		return NULL;
+	}
+
+	if (direct)
+		memcpy(pad, callthunk_desc.template, tsize);
+	else
+		text_poke_copy_locked(pad, callthunk_desc.template, tsize, true);
+	return pad;
+}
+
+static __init_or_module void patch_call(void *addr, const struct core_text *ct)
+{
+	void *pad, *dest;
+	u8 bytes[8];
+
+	if (!within_coretext(ct, addr))
+		return;
+
+	dest = call_get_dest(addr);
+	if (!dest || WARN_ON_ONCE(IS_ERR(dest)))
+		return;
+
+	if (!is_coretext(ct, dest))
+		return;
+
+	pad = patch_dest(dest, within_coretext(ct, dest));
+	if (!pad)
+		return;
+
+	prdbg("Patch call at: %pS %px to %pS %px -> %px \n", addr, addr,
+		dest, dest, pad);
+	__text_gen_insn(bytes, CALL_INSN_OPCODE, addr, pad, CALL_INSN_SIZE);
+	text_poke_early(addr, bytes, CALL_INSN_SIZE);
+}
+
+static __init_or_module void
+patch_call_sites(s32 *start, s32 *end, const struct core_text *ct)
+{
+	s32 *s;
+
+	for (s = start; s < end; s++)
+		patch_call((void *)s + *s, ct);
+}
+
+static __init_or_module void
+patch_paravirt_call_sites(struct paravirt_patch_site *start,
+			  struct paravirt_patch_site *end,
+			  const struct core_text *ct)
+{
+	struct paravirt_patch_site *p;
+
+	for (p = start; p < end; p++)
+		patch_call(p->instr, ct);
+}
+
+static __init_or_module void
+callthunks_setup(struct callthunk_sites *cs, const struct core_text *ct)
+{
+	prdbg("Patching call sites %s\n", ct->name);
+	patch_call_sites(cs->call_start, cs->call_end, ct);
+	patch_paravirt_call_sites(cs->pv_start, cs->pv_end, ct);
+	prdbg("Patching call sites done%s\n", ct->name);
+}
+
+void __init callthunks_patch_builtin_calls(void)
+{
+	struct callthunk_sites cs = {
+		.call_start	= __call_sites,
+		.call_end	= __call_sites_end,
+		.pv_start	= __parainstructions,
+		.pv_end		= __parainstructions_end
+	};
+
+	if (!cpu_feature_enabled(X86_FEATURE_CALL_DEPTH))
+		return;
+
+	pr_info("Setting up call depth tracking\n");
+	mutex_lock(&text_mutex);
+	callthunks_setup(&cs, &builtin_coretext);
+	thunks_initialized = true;
+	mutex_unlock(&text_mutex);
+}
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -370,6 +370,7 @@ SYM_CODE_END(secondary_startup_64)
  * start_secondary() via .Ljump_to_C_code.
  */
 SYM_CODE_START(start_cpu0)
+	ANNOTATE_NOENDBR
 	UNWIND_HINT_EMPTY
 	movq	initial_stack(%rip), %rsp
 	jmp	.Ljump_to_C_code
--- a/arch/x86/kernel/relocate_kernel_64.S
+++ b/arch/x86/kernel/relocate_kernel_64.S
@@ -41,6 +41,7 @@
 	.text
 	.align PAGE_SIZE
 	.code64
+SYM_CODE_START_NOALIGN(relocate_range)
 SYM_CODE_START_NOALIGN(relocate_kernel)
 	UNWIND_HINT_EMPTY
 	ANNOTATE_NOENDBR
@@ -312,5 +313,5 @@ SYM_CODE_START_LOCAL_NOALIGN(swap_pages)
 	int3
 SYM_CODE_END(swap_pages)
 
-	.globl kexec_control_code_size
-.set kexec_control_code_size, . - relocate_kernel
+	.skip KEXEC_CONTROL_CODE_MAX_SIZE - (. - relocate_kernel), 0xcc
+SYM_CODE_END(relocate_range);
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -501,11 +501,3 @@ INIT_PER_CPU(irq_stack_backing_store);
 #endif
 
 #endif /* CONFIG_X86_64 */
-
-#ifdef CONFIG_KEXEC_CORE
-#include <asm/kexec.h>
-
-. = ASSERT(kexec_control_code_size <= KEXEC_CONTROL_CODE_MAX_SIZE,
-           "kexec control code size is too big");
-#endif
-



^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v2 45/59] x86/modules: Add call patching
  2022-09-02 13:06 [PATCH v2 00/59] x86/retbleed: Call depth tracking mitigation Peter Zijlstra
                   ` (43 preceding siblings ...)
  2022-09-02 13:07 ` [PATCH v2 44/59] x86/callthunks: Add call patching for call depth tracking Peter Zijlstra
@ 2022-09-02 13:07 ` Peter Zijlstra
  2022-09-02 13:07 ` [PATCH v2 46/59] x86/returnthunk: Allow different return thunks Peter Zijlstra
                   ` (14 subsequent siblings)
  59 siblings, 0 replies; 81+ messages in thread
From: Peter Zijlstra @ 2022-09-02 13:07 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, peterz, x86, Linus Torvalds, Tim Chen,
	Josh Poimboeuf, Andrew Cooper, Pawan Gupta, Johannes Wikner,
	Alyssa Milburn, Jann Horn, H.J. Lu, Joao Moreira, Joseph Nuzman,
	Steven Rostedt, Juergen Gross, Masami Hiramatsu,
	Alexei Starovoitov, Daniel Borkmann, K Prateek Nayak,
	Eric Dumazet

From: Thomas Gleixner <tglx@linutronix.de>

As for the builtins create call thunks and patch the call sites to call the
thunk on Intel SKL CPUs for retbleed mitigation.

Note, that module init functions are ignored for sake of simplicity because
loading modules is not something which is done in high frequent loops and
the attacker has not really a handle on when this happens in order to
launch a matching attack. The depth tracking will still work for calls into
the builtins and because the call is not accounted it will underflow faster
and overstuff, but that's mitigated by the saturating counter and the side
effect is only temporary.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/include/asm/alternative.h |    5 +++++
 arch/x86/kernel/callthunks.c       |   19 +++++++++++++++++++
 arch/x86/kernel/module.c           |   20 +++++++++++++++++++-
 3 files changed, 43 insertions(+), 1 deletion(-)

--- a/arch/x86/include/asm/alternative.h
+++ b/arch/x86/include/asm/alternative.h
@@ -89,8 +89,13 @@ struct callthunk_sites {
 
 #ifdef CONFIG_CALL_THUNKS
 extern void callthunks_patch_builtin_calls(void);
+extern void callthunks_patch_module_calls(struct callthunk_sites *sites,
+					  struct module *mod);
 #else
 static __always_inline void callthunks_patch_builtin_calls(void) {}
+static __always_inline void
+callthunks_patch_module_calls(struct callthunk_sites *sites,
+			      struct module *mod) {}
 #endif
 
 #ifdef CONFIG_SMP
--- a/arch/x86/kernel/callthunks.c
+++ b/arch/x86/kernel/callthunks.c
@@ -249,3 +249,22 @@ void __init callthunks_patch_builtin_cal
 	thunks_initialized = true;
 	mutex_unlock(&text_mutex);
 }
+
+#ifdef CONFIG_MODULES
+void noinline callthunks_patch_module_calls(struct callthunk_sites *cs,
+					    struct module *mod)
+{
+	struct core_text ct = {
+		.base = (unsigned long)mod->core_layout.base,
+		.end  = (unsigned long)mod->core_layout.base + mod->core_layout.size,
+		.name = mod->name,
+	};
+
+	if (!thunks_initialized)
+		return;
+
+	mutex_lock(&text_mutex);
+	callthunks_setup(cs, &ct);
+	mutex_unlock(&text_mutex);
+}
+#endif /* CONFIG_MODULES */
--- a/arch/x86/kernel/module.c
+++ b/arch/x86/kernel/module.c
@@ -254,7 +254,8 @@ int module_finalize(const Elf_Ehdr *hdr,
 {
 	const Elf_Shdr *s, *text = NULL, *alt = NULL, *locks = NULL,
 		*para = NULL, *orc = NULL, *orc_ip = NULL,
-		*retpolines = NULL, *returns = NULL, *ibt_endbr = NULL;
+		*retpolines = NULL, *returns = NULL, *ibt_endbr = NULL,
+		*calls = NULL;
 	char *secstrings = (void *)hdr + sechdrs[hdr->e_shstrndx].sh_offset;
 
 	for (s = sechdrs; s < sechdrs + hdr->e_shnum; s++) {
@@ -274,6 +275,8 @@ int module_finalize(const Elf_Ehdr *hdr,
 			retpolines = s;
 		if (!strcmp(".return_sites", secstrings + s->sh_name))
 			returns = s;
+		if (!strcmp(".call_sites", secstrings + s->sh_name))
+			calls = s;
 		if (!strcmp(".ibt_endbr_seal", secstrings + s->sh_name))
 			ibt_endbr = s;
 	}
@@ -299,6 +302,21 @@ int module_finalize(const Elf_Ehdr *hdr,
 		void *aseg = (void *)alt->sh_addr;
 		apply_alternatives(aseg, aseg + alt->sh_size);
 	}
+	if (calls || para) {
+		struct callthunk_sites cs = {};
+
+		if (calls) {
+			cs.call_start = (void *)calls->sh_addr;
+			cs.call_end = (void *)calls->sh_addr + calls->sh_size;
+		}
+
+		if (para) {
+			cs.pv_start = (void *)para->sh_addr;
+			cs.pv_end = (void *)para->sh_addr + para->sh_size;
+		}
+
+		callthunks_patch_module_calls(&cs, me);
+	}
 	if (ibt_endbr) {
 		void *iseg = (void *)ibt_endbr->sh_addr;
 		apply_ibt_endbr(iseg, iseg + ibt_endbr->sh_size);



^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v2 46/59] x86/returnthunk: Allow different return thunks
  2022-09-02 13:06 [PATCH v2 00/59] x86/retbleed: Call depth tracking mitigation Peter Zijlstra
                   ` (44 preceding siblings ...)
  2022-09-02 13:07 ` [PATCH v2 45/59] x86/modules: Add call patching Peter Zijlstra
@ 2022-09-02 13:07 ` Peter Zijlstra
  2022-09-02 13:07 ` [PATCH v2 47/59] x86/asm: Provide ALTERNATIVE_3 Peter Zijlstra
                   ` (13 subsequent siblings)
  59 siblings, 0 replies; 81+ messages in thread
From: Peter Zijlstra @ 2022-09-02 13:07 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, peterz, x86, Linus Torvalds, Tim Chen,
	Josh Poimboeuf, Andrew Cooper, Pawan Gupta, Johannes Wikner,
	Alyssa Milburn, Jann Horn, H.J. Lu, Joao Moreira, Joseph Nuzman,
	Steven Rostedt, Juergen Gross, Masami Hiramatsu,
	Alexei Starovoitov, Daniel Borkmann, K Prateek Nayak,
	Eric Dumazet

From: Peter Zijlstra <peterz@infradead.org>

In preparation for call depth tracking on Intel SKL CPUs, make it possible
to patch in a SKL specific return thunk.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/include/asm/nospec-branch.h |    6 ++++++
 arch/x86/kernel/alternative.c        |   19 ++++++++++++++-----
 arch/x86/kernel/ftrace.c             |    2 +-
 arch/x86/kernel/static_call.c        |    2 +-
 arch/x86/net/bpf_jit_comp.c          |    2 +-
 5 files changed, 23 insertions(+), 8 deletions(-)

--- a/arch/x86/include/asm/nospec-branch.h
+++ b/arch/x86/include/asm/nospec-branch.h
@@ -198,6 +198,12 @@ extern void __x86_return_thunk(void);
 extern void zen_untrain_ret(void);
 extern void entry_ibpb(void);
 
+#ifdef CONFIG_CALL_THUNKS
+extern void (*x86_return_thunk)(void);
+#else
+#define x86_return_thunk	(&__x86_return_thunk)
+#endif
+
 #ifdef CONFIG_RETPOLINE
 
 #define GEN(reg) \
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -509,6 +509,11 @@ void __init_or_module noinline apply_ret
 }
 
 #ifdef CONFIG_RETHUNK
+
+#ifdef CONFIG_CALL_THUNKS
+void (*x86_return_thunk)(void) __ro_after_init = &__x86_return_thunk;
+#endif
+
 /*
  * Rewrite the compiler generated return thunk tail-calls.
  *
@@ -524,14 +529,18 @@ static int patch_return(void *addr, stru
 {
 	int i = 0;
 
-	if (cpu_feature_enabled(X86_FEATURE_RETHUNK))
-		return -1;
-
-	bytes[i++] = RET_INSN_OPCODE;
+	if (cpu_feature_enabled(X86_FEATURE_RETHUNK)) {
+		if (x86_return_thunk == __x86_return_thunk)
+			return -1;
+
+		i = JMP32_INSN_SIZE;
+		__text_gen_insn(bytes, JMP32_INSN_OPCODE, addr, x86_return_thunk, i);
+	} else {
+		bytes[i++] = RET_INSN_OPCODE;
+	}
 
 	for (; i < insn->length;)
 		bytes[i++] = INT3_INSN_OPCODE;
-
 	return i;
 }
 
--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -359,7 +359,7 @@ create_trampoline(struct ftrace_ops *ops
 
 	ip = trampoline + size;
 	if (cpu_feature_enabled(X86_FEATURE_RETHUNK))
-		__text_gen_insn(ip, JMP32_INSN_OPCODE, ip, &__x86_return_thunk, JMP32_INSN_SIZE);
+		__text_gen_insn(ip, JMP32_INSN_OPCODE, ip, x86_return_thunk, JMP32_INSN_SIZE);
 	else
 		memcpy(ip, retq, sizeof(retq));
 
--- a/arch/x86/kernel/static_call.c
+++ b/arch/x86/kernel/static_call.c
@@ -52,7 +52,7 @@ static void __ref __static_call_transfor
 
 	case RET:
 		if (cpu_feature_enabled(X86_FEATURE_RETHUNK))
-			code = text_gen_insn(JMP32_INSN_OPCODE, insn, &__x86_return_thunk);
+			code = text_gen_insn(JMP32_INSN_OPCODE, insn, x86_return_thunk);
 		else
 			code = &retinsn;
 		break;
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -430,7 +430,7 @@ static void emit_return(u8 **pprog, u8 *
 	u8 *prog = *pprog;
 
 	if (cpu_feature_enabled(X86_FEATURE_RETHUNK)) {
-		emit_jump(&prog, &__x86_return_thunk, ip);
+		emit_jump(&prog, x86_return_thunk, ip);
 	} else {
 		EMIT1(0xC3);		/* ret */
 		if (IS_ENABLED(CONFIG_SLS))



^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v2 47/59] x86/asm: Provide ALTERNATIVE_3
  2022-09-02 13:06 [PATCH v2 00/59] x86/retbleed: Call depth tracking mitigation Peter Zijlstra
                   ` (45 preceding siblings ...)
  2022-09-02 13:07 ` [PATCH v2 46/59] x86/returnthunk: Allow different return thunks Peter Zijlstra
@ 2022-09-02 13:07 ` Peter Zijlstra
  2022-09-02 13:07 ` [PATCH v2 48/59] x86/retbleed: Add SKL return thunk Peter Zijlstra
                   ` (12 subsequent siblings)
  59 siblings, 0 replies; 81+ messages in thread
From: Peter Zijlstra @ 2022-09-02 13:07 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, peterz, x86, Linus Torvalds, Tim Chen,
	Josh Poimboeuf, Andrew Cooper, Pawan Gupta, Johannes Wikner,
	Alyssa Milburn, Jann Horn, H.J. Lu, Joao Moreira, Joseph Nuzman,
	Steven Rostedt, Juergen Gross, Masami Hiramatsu,
	Alexei Starovoitov, Daniel Borkmann, K Prateek Nayak,
	Eric Dumazet

From: Peter Zijlstra <peterz@infradead.org>

Fairly straight forward adaptation/extention of ALTERNATIVE_2.

Required for call depth tracking.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/include/asm/alternative.h |   33 ++++++++++++++++++++++++++++++---
 1 file changed, 30 insertions(+), 3 deletions(-)

--- a/arch/x86/include/asm/alternative.h
+++ b/arch/x86/include/asm/alternative.h
@@ -364,6 +364,7 @@ static inline int alternatives_text_rese
 #define old_len			141b-140b
 #define new_len1		144f-143f
 #define new_len2		145f-144f
+#define new_len3		146f-145f
 
 /*
  * gas compatible max based on the idea from:
@@ -371,7 +372,8 @@ static inline int alternatives_text_rese
  *
  * The additional "-" is needed because gas uses a "true" value of -1.
  */
-#define alt_max_short(a, b)	((a) ^ (((a) ^ (b)) & -(-((a) < (b)))))
+#define alt_max_2(a, b)		((a) ^ (((a) ^ (b)) & -(-((a) < (b)))))
+#define alt_max_3(a, b, c)	(alt_max_2(alt_max_2(a, b), c))
 
 
 /*
@@ -383,8 +385,8 @@ static inline int alternatives_text_rese
 140:
 	\oldinstr
 141:
-	.skip -((alt_max_short(new_len1, new_len2) - (old_len)) > 0) * \
-		(alt_max_short(new_len1, new_len2) - (old_len)),0x90
+	.skip -((alt_max_2(new_len1, new_len2) - (old_len)) > 0) * \
+		(alt_max_2(new_len1, new_len2) - (old_len)),0x90
 142:
 
 	.pushsection .altinstructions,"a"
@@ -401,6 +403,31 @@ static inline int alternatives_text_rese
 	.popsection
 .endm
 
+.macro ALTERNATIVE_3 oldinstr, newinstr1, feature1, newinstr2, feature2, newinstr3, feature3
+140:
+	\oldinstr
+141:
+	.skip -((alt_max_3(new_len1, new_len2, new_len3) - (old_len)) > 0) * \
+		(alt_max_3(new_len1, new_len2, new_len3) - (old_len)),0x90
+142:
+
+	.pushsection .altinstructions,"a"
+	altinstruction_entry 140b,143f,\feature1,142b-140b,144f-143f
+	altinstruction_entry 140b,144f,\feature2,142b-140b,145f-144f
+	altinstruction_entry 140b,145f,\feature3,142b-140b,146f-145f
+	.popsection
+
+	.pushsection .altinstr_replacement,"ax"
+143:
+	\newinstr1
+144:
+	\newinstr2
+145:
+	\newinstr3
+146:
+	.popsection
+.endm
+
 /* If @feature is set, patch in @newinstr_yes, otherwise @newinstr_no. */
 #define ALTERNATIVE_TERNARY(oldinstr, feature, newinstr_yes, newinstr_no) \
 	ALTERNATIVE_2 oldinstr, newinstr_no, X86_FEATURE_ALWAYS,	\



^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v2 48/59] x86/retbleed: Add SKL return thunk
  2022-09-02 13:06 [PATCH v2 00/59] x86/retbleed: Call depth tracking mitigation Peter Zijlstra
                   ` (46 preceding siblings ...)
  2022-09-02 13:07 ` [PATCH v2 47/59] x86/asm: Provide ALTERNATIVE_3 Peter Zijlstra
@ 2022-09-02 13:07 ` Peter Zijlstra
  2022-09-02 13:07 ` [PATCH v2 49/59] x86/retpoline: Add SKL retthunk retpolines Peter Zijlstra
                   ` (11 subsequent siblings)
  59 siblings, 0 replies; 81+ messages in thread
From: Peter Zijlstra @ 2022-09-02 13:07 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, peterz, x86, Linus Torvalds, Tim Chen,
	Josh Poimboeuf, Andrew Cooper, Pawan Gupta, Johannes Wikner,
	Alyssa Milburn, Jann Horn, H.J. Lu, Joao Moreira, Joseph Nuzman,
	Steven Rostedt, Juergen Gross, Masami Hiramatsu,
	Alexei Starovoitov, Daniel Borkmann, K Prateek Nayak,
	Eric Dumazet

From: Thomas Gleixner <tglx@linutronix.de>

To address the Intel SKL RSB underflow issue in software it's required to
do call depth tracking.

Provide a return thunk for call depth tracking on Intel SKL CPUs.

The tracking does not use a counter. It uses uses arithmetic shift
right on call entry and logical shift left on return.

The depth tracking variable is initialized to 0x8000.... when the call
depth is zero. The arithmetic shift right sign extends the MSB and
saturates after the 12th call. The shift count is 5 so the tracking covers
12 nested calls. On return the variable is shifted left logically so it
becomes zero again.

 CALL	 	   	RET
 0: 0x8000000000000000	0x0000000000000000
 1: 0xfc00000000000000	0xf000000000000000
...
11: 0xfffffffffffffff8	0xfffffffffffffc00
12: 0xffffffffffffffff	0xffffffffffffffe0

After a return buffer fill the depth is credited 12 calls before the next
stuffing has to take place.

There is a inaccuracy for situations like this:

   10 calls
    5 returns
    3 calls
    4 returns
    3 calls
    ....

The shift count might cause this to be off by one in either direction, but
there is still a cushion vs. the RSB depth. The algorithm does not claim to
be perfect, but it should obfuscate the problem enough to make exploitation
extremly difficult.

The theory behind this is:

RSB is a stack with depth 16 which is filled on every call. On the return
path speculation "pops" entries to speculate down the call chain. Once the
speculative RSB is empty it switches to other predictors, e.g. the Branch
History Buffer, which can be mistrained by user space and misguide the
speculation path to a gadget.

Call depth tracking is designed to break this speculation path by stuffing
speculation trap calls into the RSB which are never getting a corresponding
return executed. This stalls the prediction path until it gets resteered,

The assumption is that stuffing at the 12th return is sufficient to break
the speculation before it hits the underflow and the fallback to the other
predictors. Testing confirms that it works. Johannes, one of the retbleed
researchers. tried to attack this approach but failed.

There is obviously no scientific proof that this will withstand future
research progress, but all we can do right now is to speculate about it.

The SAR/SHL usage was suggested by Andi Kleen.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/entry/entry_64.S            |   10 +-
 arch/x86/include/asm/current.h       |    3 
 arch/x86/include/asm/nospec-branch.h |  118 +++++++++++++++++++++++++++++++++--
 arch/x86/kernel/asm-offsets.c        |    3 
 arch/x86/kvm/svm/vmenter.S           |    1 
 arch/x86/lib/retpoline.S             |   31 +++++++++
 6 files changed, 156 insertions(+), 10 deletions(-)

--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -288,6 +288,7 @@ SYM_FUNC_END(__switch_to_asm)
 SYM_CODE_START_NOALIGN(ret_from_fork)
 	UNWIND_HINT_EMPTY
 	ANNOTATE_NOENDBR // copy_thread
+	CALL_DEPTH_ACCOUNT
 	movq	%rax, %rdi
 	call	schedule_tail			/* rdi: 'prev' task parameter */
 
@@ -332,7 +333,7 @@ SYM_CODE_START(xen_error_entry)
 	UNWIND_HINT_FUNC
 	PUSH_AND_CLEAR_REGS save_ret=1
 	ENCODE_FRAME_POINTER 8
-	UNTRAIN_RET
+	UNTRAIN_RET_FROM_CALL
 	RET
 SYM_CODE_END(xen_error_entry)
 
@@ -977,7 +978,7 @@ SYM_CODE_START(paranoid_entry)
 	 * CR3 above, keep the old value in a callee saved register.
 	 */
 	IBRS_ENTER save_reg=%r15
-	UNTRAIN_RET
+	UNTRAIN_RET_FROM_CALL
 
 	RET
 SYM_CODE_END(paranoid_entry)
@@ -1062,7 +1063,7 @@ SYM_CODE_START(error_entry)
 	/* We have user CR3.  Change to kernel CR3. */
 	SWITCH_TO_KERNEL_CR3 scratch_reg=%rax
 	IBRS_ENTER
-	UNTRAIN_RET
+	UNTRAIN_RET_FROM_CALL
 
 	leaq	8(%rsp), %rdi			/* arg0 = pt_regs pointer */
 	/* Put us onto the real thread stack. */
@@ -1097,6 +1098,7 @@ SYM_CODE_START(error_entry)
 	 */
 .Lerror_entry_done_lfence:
 	FENCE_SWAPGS_KERNEL_ENTRY
+	CALL_DEPTH_ACCOUNT
 	leaq	8(%rsp), %rax			/* return pt_regs pointer */
 	ANNOTATE_UNRET_END
 	RET
@@ -1115,7 +1117,7 @@ SYM_CODE_START(error_entry)
 	FENCE_SWAPGS_USER_ENTRY
 	SWITCH_TO_KERNEL_CR3 scratch_reg=%rax
 	IBRS_ENTER
-	UNTRAIN_RET
+	UNTRAIN_RET_FROM_CALL
 
 	/*
 	 * Pretend that the exception came from user mode: set up pt_regs
--- a/arch/x86/include/asm/current.h
+++ b/arch/x86/include/asm/current.h
@@ -17,6 +17,9 @@ struct pcpu_hot {
 			struct task_struct	*current_task;
 			int			preempt_count;
 			int			cpu_number;
+#ifdef CONFIG_CALL_DEPTH_TRACKING
+			u64			call_depth;
+#endif
 			unsigned long		top_of_stack;
 			void			*hardirq_stack_ptr;
 			u16			softirq_pending;
--- a/arch/x86/include/asm/nospec-branch.h
+++ b/arch/x86/include/asm/nospec-branch.h
@@ -12,8 +12,80 @@
 #include <asm/msr-index.h>
 #include <asm/unwind_hints.h>
 #include <asm/percpu.h>
+#include <asm/current.h>
 
-#define RETPOLINE_THUNK_SIZE	32
+/*
+ * Call depth tracking for Intel SKL CPUs to address the RSB underflow
+ * issue in software.
+ *
+ * The tracking does not use a counter. It uses uses arithmetic shift
+ * right on call entry and logical shift left on return.
+ *
+ * The depth tracking variable is initialized to 0x8000.... when the call
+ * depth is zero. The arithmetic shift right sign extends the MSB and
+ * saturates after the 12th call. The shift count is 5 for both directions
+ * so the tracking covers 12 nested calls.
+ *
+ *  Call
+ *  0: 0x8000000000000000	0x0000000000000000
+ *  1: 0xfc00000000000000	0xf000000000000000
+ * ...
+ * 11: 0xfffffffffffffff8	0xfffffffffffffc00
+ * 12: 0xffffffffffffffff	0xffffffffffffffe0
+ *
+ * After a return buffer fill the depth is credited 12 calls before the
+ * next stuffing has to take place.
+ *
+ * There is a inaccuracy for situations like this:
+ *
+ *  10 calls
+ *   5 returns
+ *   3 calls
+ *   4 returns
+ *   3 calls
+ *   ....
+ *
+ * The shift count might cause this to be off by one in either direction,
+ * but there is still a cushion vs. the RSB depth. The algorithm does not
+ * claim to be perfect and it can be speculated around by the CPU, but it
+ * is considered that it obfuscates the problem enough to make exploitation
+ * extremly difficult.
+ */
+#define RET_DEPTH_SHIFT			5
+#define RSB_RET_STUFF_LOOPS		16
+#define RET_DEPTH_INIT			0x8000000000000000ULL
+#define RET_DEPTH_INIT_FROM_CALL	0xfc00000000000000ULL
+#define RET_DEPTH_CREDIT		0xffffffffffffffffULL
+
+#ifdef CONFIG_CALL_DEPTH_TRACKING
+#define CREDIT_CALL_DEPTH					\
+	movq	$-1, PER_CPU_VAR(pcpu_hot + X86_call_depth);
+
+#define ASM_CREDIT_CALL_DEPTH					\
+	movq	$-1, PER_CPU_VAR(pcpu_hot + X86_call_depth);
+
+#define RESET_CALL_DEPTH					\
+	mov	$0x80, %rax;					\
+	shl	$56, %rax;					\
+	movq	%rax, PER_CPU_VAR(pcpu_hot + X86_call_depth);
+
+#define RESET_CALL_DEPTH_FROM_CALL				\
+	mov	$0xfc, %rax;					\
+	shl	$56, %rax;					\
+	movq	%rax, PER_CPU_VAR(pcpu_hot + X86_call_depth);
+
+#define INCREMENT_CALL_DEPTH					\
+	sarq	$5, %gs:pcpu_hot + X86_call_depth;
+
+#define ASM_INCREMENT_CALL_DEPTH				\
+	sarq	$5, PER_CPU_VAR(pcpu_hot + X86_call_depth);
+
+#else
+#define CREDIT_CALL_DEPTH
+#define RESET_CALL_DEPTH
+#define INCREMENT_CALL_DEPTH
+#define RESET_CALL_DEPTH_FROM_CALL
+#endif
 
 /*
  * Fill the CPU return stack buffer.
@@ -32,6 +104,7 @@
  * from C via asm(".include <asm/nospec-branch.h>") but let's not go there.
  */
 
+#define RETPOLINE_THUNK_SIZE	32
 #define RSB_CLEAR_LOOPS		32	/* To forcibly overwrite all entries */
 
 /*
@@ -60,7 +133,8 @@
 	dec	reg;					\
 	jnz	771b;					\
 	/* barrier for jnz misprediction */		\
-	lfence;
+	lfence;						\
+	ASM_CREDIT_CALL_DEPTH
 #else
 /*
  * i386 doesn't unconditionally have LFENCE, as such it can't
@@ -185,11 +259,32 @@
  * where we have a stack but before any RET instruction.
  */
 .macro UNTRAIN_RET
-#if defined(CONFIG_CPU_UNRET_ENTRY) || defined(CONFIG_CPU_IBPB_ENTRY)
+#if defined(CONFIG_CPU_UNRET_ENTRY) || defined(CONFIG_CPU_IBPB_ENTRY) || \
+	defined(CONFIG_X86_FEATURE_CALL_DEPTH)
 	ANNOTATE_UNRET_END
-	ALTERNATIVE_2 "",						\
-	              CALL_ZEN_UNTRAIN_RET, X86_FEATURE_UNRET,		\
-		      "call entry_ibpb", X86_FEATURE_ENTRY_IBPB
+	ALTERNATIVE_3 "",						\
+		      CALL_ZEN_UNTRAIN_RET, X86_FEATURE_UNRET,		\
+		      "call entry_ibpb", X86_FEATURE_ENTRY_IBPB,	\
+		      __stringify(RESET_CALL_DEPTH), X86_FEATURE_CALL_DEPTH
+#endif
+.endm
+
+.macro UNTRAIN_RET_FROM_CALL
+#if defined(CONFIG_CPU_UNRET_ENTRY) || defined(CONFIG_CPU_IBPB_ENTRY) || \
+	defined(CONFIG_X86_FEATURE_CALL_DEPTH)
+	ANNOTATE_UNRET_END
+	ALTERNATIVE_3 "",						\
+		      CALL_ZEN_UNTRAIN_RET, X86_FEATURE_UNRET,		\
+		      "call entry_ibpb", X86_FEATURE_ENTRY_IBPB,	\
+		      __stringify(RESET_CALL_DEPTH_FROM_CALL), X86_FEATURE_CALL_DEPTH
+#endif
+.endm
+
+
+.macro CALL_DEPTH_ACCOUNT
+#ifdef CONFIG_CALL_DEPTH_TRACKING
+	ALTERNATIVE "",							\
+		    __stringify(ASM_INCREMENT_CALL_DEPTH), X86_FEATURE_CALL_DEPTH
 #endif
 .endm
 
@@ -214,6 +309,17 @@ extern void (*x86_return_thunk)(void);
 #define x86_return_thunk	(&__x86_return_thunk)
 #endif
 
+#ifdef CONFIG_CALL_DEPTH_TRACKING
+extern void __x86_return_skl(void);
+
+static inline void x86_set_skl_return_thunk(void)
+{
+	x86_return_thunk = &__x86_return_skl;
+}
+#else
+static inline void x86_set_skl_return_thunk(void) {}
+#endif
+
 #ifdef CONFIG_RETPOLINE
 
 #define GEN(reg) \
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -110,6 +110,9 @@ static void __used common(void)
 	OFFSET(TSS_sp2, tss_struct, x86_tss.sp2);
 
 	OFFSET(X86_top_of_stack, pcpu_hot, top_of_stack);
+#ifdef CONFIG_CALL_DEPTH_TRACKING
+	OFFSET(X86_call_depth, pcpu_hot, call_depth);
+#endif
 
 	if (IS_ENABLED(CONFIG_KVM_INTEL)) {
 		BLANK();
--- a/arch/x86/kvm/svm/vmenter.S
+++ b/arch/x86/kvm/svm/vmenter.S
@@ -1,6 +1,7 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 #include <linux/linkage.h>
 #include <asm/asm.h>
+#include <asm/asm-offsets.h>
 #include <asm/bitsperlong.h>
 #include <asm/kvm_vcpu_regs.h>
 #include <asm/nospec-branch.h>
--- a/arch/x86/lib/retpoline.S
+++ b/arch/x86/lib/retpoline.S
@@ -5,9 +5,11 @@
 #include <asm/dwarf2.h>
 #include <asm/cpufeatures.h>
 #include <asm/alternative.h>
+#include <asm/asm-offsets.h>
 #include <asm/export.h>
 #include <asm/nospec-branch.h>
 #include <asm/unwind_hints.h>
+#include <asm/percpu.h>
 #include <asm/frame.h>
 
 	.section .text.__x86.indirect_thunk
@@ -140,3 +142,32 @@ SYM_FUNC_END(zen_untrain_ret)
 EXPORT_SYMBOL(__x86_return_thunk)
 
 #endif /* CONFIG_RETHUNK */
+
+#ifdef CONFIG_CALL_DEPTH_TRACKING
+
+	.align 64
+SYM_FUNC_START(__x86_return_skl)
+	ANNOTATE_NOENDBR
+	/* Keep the hotpath in a 16byte I-fetch */
+	shlq	$5, PER_CPU_VAR(pcpu_hot + X86_call_depth)
+	jz	1f
+	ANNOTATE_UNRET_SAFE
+	ret
+	int3
+1:
+	.rept	16
+	ANNOTATE_INTRA_FUNCTION_CALL
+	call	2f
+	int3
+2:
+	.endr
+	add	$(8*16), %rsp
+
+	CREDIT_CALL_DEPTH
+
+	ANNOTATE_UNRET_SAFE
+	ret
+	int3
+SYM_FUNC_END(__x86_return_skl)
+
+#endif /* CONFIG_CALL_DEPTH_TRACKING */



^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v2 49/59] x86/retpoline: Add SKL retthunk retpolines
  2022-09-02 13:06 [PATCH v2 00/59] x86/retbleed: Call depth tracking mitigation Peter Zijlstra
                   ` (47 preceding siblings ...)
  2022-09-02 13:07 ` [PATCH v2 48/59] x86/retbleed: Add SKL return thunk Peter Zijlstra
@ 2022-09-02 13:07 ` Peter Zijlstra
  2022-09-02 13:07 ` [PATCH v2 50/59] x86/retbleed: Add SKL call thunk Peter Zijlstra
                   ` (10 subsequent siblings)
  59 siblings, 0 replies; 81+ messages in thread
From: Peter Zijlstra @ 2022-09-02 13:07 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, peterz, x86, Linus Torvalds, Tim Chen,
	Josh Poimboeuf, Andrew Cooper, Pawan Gupta, Johannes Wikner,
	Alyssa Milburn, Jann Horn, H.J. Lu, Joao Moreira, Joseph Nuzman,
	Steven Rostedt, Juergen Gross, Masami Hiramatsu,
	Alexei Starovoitov, Daniel Borkmann, K Prateek Nayak,
	Eric Dumazet

From: Peter Zijlstra <peterz@infradead.org>

Ensure that retpolines do the proper call accounting so that the return
accounting works correctly.

Specifically; retpolines are used to replace both 'jmp *%reg' and
'call *%reg', however these two cases do not have the same accounting
requirements. Therefore split things up and provide two different
retpoline arrays for SKL.

The 'jmp *%reg' case needs no accounting, the
__x86_indirect_jump_thunk_array[] covers this. The retpoline is
changed to not use the return thunk; it's a simple call;ret construct.

[ strictly speaking it should do:
	andq $(~0x1f), PER_CPU_VAR(__x86_call_depth)
  but we can argue this can be covered by the fuzz we already have
  in the accounting depth (12) vs the RSB depth (16) ]

The 'call *%reg' case does need accounting, the
__x86_indirect_call_thunk_array[] covers this. Again, this retpoline
avoids the use of the return-thunk, in this case to avoid double
accounting.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/include/asm/nospec-branch.h |   12 +++++
 arch/x86/kernel/alternative.c        |   59 +++++++++++++++++++++++++++--
 arch/x86/lib/retpoline.S             |   71 +++++++++++++++++++++++++++++++----
 arch/x86/net/bpf_jit_comp.c          |    5 +-
 4 files changed, 135 insertions(+), 12 deletions(-)

--- a/arch/x86/include/asm/nospec-branch.h
+++ b/arch/x86/include/asm/nospec-branch.h
@@ -288,6 +288,8 @@
 
 typedef u8 retpoline_thunk_t[RETPOLINE_THUNK_SIZE];
 extern retpoline_thunk_t __x86_indirect_thunk_array[];
+extern retpoline_thunk_t __x86_indirect_call_thunk_array[];
+extern retpoline_thunk_t __x86_indirect_jump_thunk_array[];
 
 extern void __x86_return_thunk(void);
 extern void zen_untrain_ret(void);
@@ -317,6 +319,16 @@ static inline void x86_set_skl_return_th
 #include <asm/GEN-for-each-reg.h>
 #undef GEN
 
+#define GEN(reg)						\
+	extern retpoline_thunk_t __x86_indirect_call_thunk_ ## reg;
+#include <asm/GEN-for-each-reg.h>
+#undef GEN
+
+#define GEN(reg)						\
+	extern retpoline_thunk_t __x86_indirect_jump_thunk_ ## reg;
+#include <asm/GEN-for-each-reg.h>
+#undef GEN
+
 #ifdef CONFIG_X86_64
 
 /*
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -377,6 +377,56 @@ static int emit_indirect(int op, int reg
 	return i;
 }
 
+static inline bool is_jcc32(struct insn *insn)
+{
+	/* Jcc.d32 second opcode byte is in the range: 0x80-0x8f */
+	return insn->opcode.bytes[0] == 0x0f && (insn->opcode.bytes[1] & 0xf0) == 0x80;
+}
+
+static int emit_call_track_retpoline(void *addr, struct insn *insn, int reg, u8 *bytes)
+{
+	u8 op = insn->opcode.bytes[0];
+	int i = 0;
+
+	/*
+	 * Clang does 'weird' Jcc __x86_indirect_thunk_r11 conditional
+	 * tail-calls. Deal with them.
+	 */
+	if (is_jcc32(insn)) {
+		bytes[i++] = op;
+		op = insn->opcode.bytes[1];
+		goto clang_jcc;
+	}
+
+	if (insn->length == 6)
+		bytes[i++] = 0x2e; /* CS-prefix */
+
+	switch (op) {
+	case CALL_INSN_OPCODE:
+		__text_gen_insn(bytes+i, op, addr+i,
+				__x86_indirect_call_thunk_array[reg],
+				CALL_INSN_SIZE);
+		i += CALL_INSN_SIZE;
+		break;
+
+	case JMP32_INSN_OPCODE:
+clang_jcc:
+		__text_gen_insn(bytes+i, op, addr+i,
+				__x86_indirect_jump_thunk_array[reg],
+				JMP32_INSN_SIZE);
+		i += JMP32_INSN_SIZE;
+		break;
+
+	default:
+		WARN("%pS %px %*ph\n", addr, addr, 6, addr);
+		return -1;
+	}
+
+	WARN_ON_ONCE(i != insn->length);
+
+	return i;
+}
+
 /*
  * Rewrite the compiler generated retpoline thunk calls.
  *
@@ -409,8 +459,12 @@ static int patch_retpoline(void *addr, s
 	BUG_ON(reg == 4);
 
 	if (cpu_feature_enabled(X86_FEATURE_RETPOLINE) &&
-	    !cpu_feature_enabled(X86_FEATURE_RETPOLINE_LFENCE))
+	    !cpu_feature_enabled(X86_FEATURE_RETPOLINE_LFENCE)) {
+		if (cpu_feature_enabled(X86_FEATURE_CALL_DEPTH))
+			return emit_call_track_retpoline(addr, insn, reg, bytes);
+
 		return -1;
+	}
 
 	op = insn->opcode.bytes[0];
 
@@ -427,8 +481,7 @@ static int patch_retpoline(void *addr, s
 	 *   [ NOP ]
 	 * 1:
 	 */
-	/* Jcc.d32 second opcode byte is in the range: 0x80-0x8f */
-	if (op == 0x0f && (insn->opcode.bytes[1] & 0xf0) == 0x80) {
+	if (is_jcc32(insn)) {
 		cc = insn->opcode.bytes[1] & 0xf;
 		cc ^= 1; /* invert condition */
 
--- a/arch/x86/lib/retpoline.S
+++ b/arch/x86/lib/retpoline.S
@@ -14,17 +14,18 @@
 
 	.section .text.__x86.indirect_thunk
 
-.macro RETPOLINE reg
+
+.macro POLINE reg
 	ANNOTATE_INTRA_FUNCTION_CALL
 	call    .Ldo_rop_\@
-.Lspec_trap_\@:
-	UNWIND_HINT_EMPTY
-	pause
-	lfence
-	jmp .Lspec_trap_\@
+	int3
 .Ldo_rop_\@:
 	mov     %\reg, (%_ASM_SP)
 	UNWIND_HINT_FUNC
+.endm
+
+.macro RETPOLINE reg
+	POLINE \reg
 	RET
 .endm
 
@@ -54,7 +55,6 @@ SYM_INNER_LABEL(__x86_indirect_thunk_\re
  */
 
 #define __EXPORT_THUNK(sym)	_ASM_NOKPROBE(sym); EXPORT_SYMBOL(sym)
-#define EXPORT_THUNK(reg)	__EXPORT_THUNK(__x86_indirect_thunk_ ## reg)
 
 	.align RETPOLINE_THUNK_SIZE
 SYM_CODE_START(__x86_indirect_thunk_array)
@@ -66,10 +66,65 @@ SYM_CODE_START(__x86_indirect_thunk_arra
 	.align RETPOLINE_THUNK_SIZE
 SYM_CODE_END(__x86_indirect_thunk_array)
 
-#define GEN(reg) EXPORT_THUNK(reg)
+#define GEN(reg) __EXPORT_THUNK(__x86_indirect_thunk_ ## reg)
+#include <asm/GEN-for-each-reg.h>
+#undef GEN
+
+#ifdef CONFIG_CALL_DEPTH_TRACKING
+.macro CALL_THUNK reg
+	.align RETPOLINE_THUNK_SIZE
+
+SYM_INNER_LABEL(__x86_indirect_call_thunk_\reg, SYM_L_GLOBAL)
+	UNWIND_HINT_EMPTY
+	ANNOTATE_NOENDBR
+
+	CALL_DEPTH_ACCOUNT
+	POLINE \reg
+	ANNOTATE_UNRET_SAFE
+	ret
+	int3
+.endm
+
+	.align RETPOLINE_THUNK_SIZE
+SYM_CODE_START(__x86_indirect_call_thunk_array)
+
+#define GEN(reg) CALL_THUNK reg
+#include <asm/GEN-for-each-reg.h>
+#undef GEN
+
+	.align RETPOLINE_THUNK_SIZE
+SYM_CODE_END(__x86_indirect_call_thunk_array)
+
+#define GEN(reg) __EXPORT_THUNK(__x86_indirect_call_thunk_ ## reg)
 #include <asm/GEN-for-each-reg.h>
 #undef GEN
 
+.macro JUMP_THUNK reg
+	.align RETPOLINE_THUNK_SIZE
+
+SYM_INNER_LABEL(__x86_indirect_jump_thunk_\reg, SYM_L_GLOBAL)
+	UNWIND_HINT_EMPTY
+	ANNOTATE_NOENDBR
+	POLINE \reg
+	ANNOTATE_UNRET_SAFE
+	ret
+	int3
+.endm
+
+	.align RETPOLINE_THUNK_SIZE
+SYM_CODE_START(__x86_indirect_jump_thunk_array)
+
+#define GEN(reg) JUMP_THUNK reg
+#include <asm/GEN-for-each-reg.h>
+#undef GEN
+
+	.align RETPOLINE_THUNK_SIZE
+SYM_CODE_END(__x86_indirect_jump_thunk_array)
+
+#define GEN(reg) __EXPORT_THUNK(__x86_indirect_jump_thunk_ ## reg)
+#include <asm/GEN-for-each-reg.h>
+#undef GEN
+#endif
 /*
  * This function name is magical and is used by -mfunction-return=thunk-extern
  * for the compiler to generate JMPs to it.
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -417,7 +417,10 @@ static void emit_indirect_jump(u8 **ppro
 		EMIT2(0xFF, 0xE0 + reg);
 	} else if (cpu_feature_enabled(X86_FEATURE_RETPOLINE)) {
 		OPTIMIZER_HIDE_VAR(reg);
-		emit_jump(&prog, &__x86_indirect_thunk_array[reg], ip);
+		if (cpu_feature_enabled(X86_FEATURE_CALL_DEPTH))
+			emit_jump(&prog, &__x86_indirect_jump_thunk_array[reg], ip);
+		else
+			emit_jump(&prog, &__x86_indirect_thunk_array[reg], ip);
 	} else {
 		EMIT2(0xFF, 0xE0 + reg);
 	}



^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v2 50/59] x86/retbleed: Add SKL call thunk
  2022-09-02 13:06 [PATCH v2 00/59] x86/retbleed: Call depth tracking mitigation Peter Zijlstra
                   ` (48 preceding siblings ...)
  2022-09-02 13:07 ` [PATCH v2 49/59] x86/retpoline: Add SKL retthunk retpolines Peter Zijlstra
@ 2022-09-02 13:07 ` Peter Zijlstra
  2022-09-02 13:07 ` [PATCH v2 51/59] x86/calldepth: Add ret/call counting for debug Peter Zijlstra
                   ` (9 subsequent siblings)
  59 siblings, 0 replies; 81+ messages in thread
From: Peter Zijlstra @ 2022-09-02 13:07 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, peterz, x86, Linus Torvalds, Tim Chen,
	Josh Poimboeuf, Andrew Cooper, Pawan Gupta, Johannes Wikner,
	Alyssa Milburn, Jann Horn, H.J. Lu, Joao Moreira, Joseph Nuzman,
	Steven Rostedt, Juergen Gross, Masami Hiramatsu,
	Alexei Starovoitov, Daniel Borkmann, K Prateek Nayak,
	Eric Dumazet

From: Thomas Gleixner <tglx@linutronix.de>

Add the actual SKL call thunk for call depth accounting.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/kernel/callthunks.c |   25 ++++++++++++++++++++-----
 1 file changed, 20 insertions(+), 5 deletions(-)

--- a/arch/x86/kernel/callthunks.c
+++ b/arch/x86/kernel/callthunks.c
@@ -7,6 +7,7 @@
 #include <linux/moduleloader.h>
 
 #include <asm/alternative.h>
+#include <asm/asm-offsets.h>
 #include <asm/cpu.h>
 #include <asm/ftrace.h>
 #include <asm/insn.h>
@@ -55,7 +56,21 @@ static const struct core_text builtin_co
 	.name = "builtin",
 };
 
-static struct thunk_desc callthunk_desc __ro_after_init;
+asm (
+	".pushsection .rodata				\n"
+	".global skl_call_thunk_template		\n"
+	"skl_call_thunk_template:			\n"
+		__stringify(INCREMENT_CALL_DEPTH)"	\n"
+	".global skl_call_thunk_tail			\n"
+	"skl_call_thunk_tail:				\n"
+	".popsection					\n"
+);
+
+extern u8 skl_call_thunk_template[];
+extern u8 skl_call_thunk_tail[];
+
+#define SKL_TMPL_SIZE \
+	((unsigned int)(skl_call_thunk_tail - skl_call_thunk_template))
 
 extern void error_entry(void);
 extern void xen_error_entry(void);
@@ -157,11 +172,11 @@ static const u8 nops[] = {
 
 static __init_or_module void *patch_dest(void *dest, bool direct)
 {
-	unsigned int tsize = callthunk_desc.template_size;
+	unsigned int tsize = SKL_TMPL_SIZE;
 	u8 *pad = dest - tsize;
 
 	/* Already patched? */
-	if (!bcmp(pad, callthunk_desc.template, tsize))
+	if (!bcmp(pad, skl_call_thunk_template, tsize))
 		return pad;
 
 	/* Ensure there are nops */
@@ -171,9 +186,9 @@ static __init_or_module void *patch_dest
 	}
 
 	if (direct)
-		memcpy(pad, callthunk_desc.template, tsize);
+		memcpy(pad, skl_call_thunk_template, tsize);
 	else
-		text_poke_copy_locked(pad, callthunk_desc.template, tsize, true);
+		text_poke_copy_locked(pad, skl_call_thunk_template, tsize, true);
 	return pad;
 }
 



^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v2 51/59] x86/calldepth: Add ret/call counting for debug
  2022-09-02 13:06 [PATCH v2 00/59] x86/retbleed: Call depth tracking mitigation Peter Zijlstra
                   ` (49 preceding siblings ...)
  2022-09-02 13:07 ` [PATCH v2 50/59] x86/retbleed: Add SKL call thunk Peter Zijlstra
@ 2022-09-02 13:07 ` Peter Zijlstra
  2022-09-02 13:07 ` [PATCH v2 52/59] static_call: Add call depth tracking support Peter Zijlstra
                   ` (8 subsequent siblings)
  59 siblings, 0 replies; 81+ messages in thread
From: Peter Zijlstra @ 2022-09-02 13:07 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, peterz, x86, Linus Torvalds, Tim Chen,
	Josh Poimboeuf, Andrew Cooper, Pawan Gupta, Johannes Wikner,
	Alyssa Milburn, Jann Horn, H.J. Lu, Joao Moreira, Joseph Nuzman,
	Steven Rostedt, Juergen Gross, Masami Hiramatsu,
	Alexei Starovoitov, Daniel Borkmann, K Prateek Nayak,
	Eric Dumazet

From: Thomas Gleixner <tglx@linutronix.de>

Add a debuigfs mechanism to validate the accounting, e.g. vs. call/ret
balance and to gather statistics about the stuffing to call ratio.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/include/asm/nospec-branch.h |   36 +++++++++++++++++++++--
 arch/x86/kernel/callthunks.c         |   53 +++++++++++++++++++++++++++++++++++
 arch/x86/lib/retpoline.S             |    7 +++-
 3 files changed, 91 insertions(+), 5 deletions(-)

--- a/arch/x86/include/asm/nospec-branch.h
+++ b/arch/x86/include/asm/nospec-branch.h
@@ -57,6 +57,22 @@
 #define RET_DEPTH_INIT_FROM_CALL	0xfc00000000000000ULL
 #define RET_DEPTH_CREDIT		0xffffffffffffffffULL
 
+#ifdef CONFIG_CALL_THUNKS_DEBUG
+# define CALL_THUNKS_DEBUG_INC_CALLS				\
+	incq	%gs:__x86_call_count;
+# define CALL_THUNKS_DEBUG_INC_RETS				\
+	incq	%gs:__x86_ret_count;
+# define CALL_THUNKS_DEBUG_INC_STUFFS				\
+	incq	%gs:__x86_stuffs_count;
+# define CALL_THUNKS_DEBUG_INC_CTXSW				\
+	incq	%gs:__x86_ctxsw_count;
+#else
+# define CALL_THUNKS_DEBUG_INC_CALLS
+# define CALL_THUNKS_DEBUG_INC_RETS
+# define CALL_THUNKS_DEBUG_INC_STUFFS
+# define CALL_THUNKS_DEBUG_INC_CTXSW
+#endif
+
 #ifdef CONFIG_CALL_DEPTH_TRACKING
 #define CREDIT_CALL_DEPTH					\
 	movq	$-1, PER_CPU_VAR(pcpu_hot + X86_call_depth);
@@ -72,18 +88,23 @@
 #define RESET_CALL_DEPTH_FROM_CALL				\
 	mov	$0xfc, %rax;					\
 	shl	$56, %rax;					\
-	movq	%rax, PER_CPU_VAR(pcpu_hot + X86_call_depth);
+	movq	%rax, PER_CPU_VAR(pcpu_hot + X86_call_depth);	\
+	CALL_THUNKS_DEBUG_INC_CALLS
 
 #define INCREMENT_CALL_DEPTH					\
-	sarq	$5, %gs:pcpu_hot + X86_call_depth;
+	sarq	$5, %gs:pcpu_hot + X86_call_depth;		\
+	CALL_THUNKS_DEBUG_INC_CALLS
 
 #define ASM_INCREMENT_CALL_DEPTH				\
-	sarq	$5, PER_CPU_VAR(pcpu_hot + X86_call_depth);
+	sarq	$5, PER_CPU_VAR(pcpu_hot + X86_call_depth);	\
+	CALL_THUNKS_DEBUG_INC_CALLS
 
 #else
 #define CREDIT_CALL_DEPTH
+#define ASM_CREDIT_CALL_DEPTH
 #define RESET_CALL_DEPTH
 #define INCREMENT_CALL_DEPTH
+#define ASM_INCREMENT_CALL_DEPTH
 #define RESET_CALL_DEPTH_FROM_CALL
 #endif
 
@@ -134,7 +155,8 @@
 	jnz	771b;					\
 	/* barrier for jnz misprediction */		\
 	lfence;						\
-	ASM_CREDIT_CALL_DEPTH
+	ASM_CREDIT_CALL_DEPTH				\
+	CALL_THUNKS_DEBUG_INC_CTXSW
 #else
 /*
  * i386 doesn't unconditionally have LFENCE, as such it can't
@@ -319,6 +341,12 @@ static inline void x86_set_skl_return_th
 {
 	x86_return_thunk = &__x86_return_skl;
 }
+#ifdef CONFIG_CALL_THUNKS_DEBUG
+DECLARE_PER_CPU(u64, __x86_call_count);
+DECLARE_PER_CPU(u64, __x86_ret_count);
+DECLARE_PER_CPU(u64, __x86_stuffs_count);
+DECLARE_PER_CPU(u64, __x86_ctxsw_count);
+#endif
 #else
 static inline void x86_set_skl_return_thunk(void) {}
 #endif
--- a/arch/x86/kernel/callthunks.c
+++ b/arch/x86/kernel/callthunks.c
@@ -2,6 +2,7 @@
 
 #define pr_fmt(fmt) "callthunks: " fmt
 
+#include <linux/debugfs.h>
 #include <linux/kallsyms.h>
 #include <linux/memory.h>
 #include <linux/moduleloader.h>
@@ -35,6 +36,15 @@ static int __init debug_thunks(char *str
 }
 __setup("debug-callthunks", debug_thunks);
 
+#ifdef CONFIG_CALL_THUNKS_DEBUG
+DEFINE_PER_CPU(u64, __x86_call_count);
+DEFINE_PER_CPU(u64, __x86_ret_count);
+DEFINE_PER_CPU(u64, __x86_stuffs_count);
+DEFINE_PER_CPU(u64, __x86_ctxsw_count);
+EXPORT_SYMBOL_GPL(__x86_ctxsw_count);
+EXPORT_SYMBOL_GPL(__x86_call_count);
+#endif
+
 extern s32 __call_sites[], __call_sites_end[];
 
 struct thunk_desc {
@@ -283,3 +293,46 @@ void noinline callthunks_patch_module_ca
 	mutex_unlock(&text_mutex);
 }
 #endif /* CONFIG_MODULES */
+
+#if defined(CONFIG_CALL_THUNKS_DEBUG) && defined(CONFIG_DEBUG_FS)
+static int callthunks_debug_show(struct seq_file *m, void *p)
+{
+	unsigned long cpu = (unsigned long)m->private;
+
+	seq_printf(m, "C: %16llu R: %16llu S: %16llu X: %16llu\n,",
+		   per_cpu(__x86_call_count, cpu),
+		   per_cpu(__x86_ret_count, cpu),
+		   per_cpu(__x86_stuffs_count, cpu),
+		   per_cpu(__x86_ctxsw_count, cpu));
+	return 0;
+}
+
+static int callthunks_debug_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, callthunks_debug_show, inode->i_private);
+}
+
+static const struct file_operations dfs_ops = {
+	.open		= callthunks_debug_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+
+static int __init callthunks_debugfs_init(void)
+{
+	struct dentry *dir;
+	unsigned long cpu;
+
+	dir = debugfs_create_dir("callthunks", NULL);
+	for_each_possible_cpu(cpu) {
+		void *arg = (void *)cpu;
+		char name [10];
+
+		sprintf(name, "cpu%lu", cpu);
+		debugfs_create_file(name, 0644, dir, arg, &dfs_ops);
+	}
+	return 0;
+}
+__initcall(callthunks_debugfs_init);
+#endif
--- a/arch/x86/lib/retpoline.S
+++ b/arch/x86/lib/retpoline.S
@@ -203,13 +203,18 @@ EXPORT_SYMBOL(__x86_return_thunk)
 	.align 64
 SYM_FUNC_START(__x86_return_skl)
 	ANNOTATE_NOENDBR
-	/* Keep the hotpath in a 16byte I-fetch */
+	/*
+	 * Keep the hotpath in a 16byte I-fetch for the non-debug
+	 * case.
+	 */
+	CALL_THUNKS_DEBUG_INC_RETS
 	shlq	$5, PER_CPU_VAR(pcpu_hot + X86_call_depth)
 	jz	1f
 	ANNOTATE_UNRET_SAFE
 	ret
 	int3
 1:
+	CALL_THUNKS_DEBUG_INC_STUFFS
 	.rept	16
 	ANNOTATE_INTRA_FUNCTION_CALL
 	call	2f



^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v2 52/59] static_call: Add call depth tracking support
  2022-09-02 13:06 [PATCH v2 00/59] x86/retbleed: Call depth tracking mitigation Peter Zijlstra
                   ` (50 preceding siblings ...)
  2022-09-02 13:07 ` [PATCH v2 51/59] x86/calldepth: Add ret/call counting for debug Peter Zijlstra
@ 2022-09-02 13:07 ` Peter Zijlstra
  2022-09-02 13:07 ` [PATCH v2 53/59] kallsyms: Take callthunks into account Peter Zijlstra
                   ` (7 subsequent siblings)
  59 siblings, 0 replies; 81+ messages in thread
From: Peter Zijlstra @ 2022-09-02 13:07 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, peterz, x86, Linus Torvalds, Tim Chen,
	Josh Poimboeuf, Andrew Cooper, Pawan Gupta, Johannes Wikner,
	Alyssa Milburn, Jann Horn, H.J. Lu, Joao Moreira, Joseph Nuzman,
	Steven Rostedt, Juergen Gross, Masami Hiramatsu,
	Alexei Starovoitov, Daniel Borkmann, K Prateek Nayak,
	Eric Dumazet

From: Peter Zijlstra <peterz@infradead.org>

When indirect calls are switched to direct calls then it has to be ensured
that the call target is not the function, but the call thunk when call
depth tracking is enabled. But static calls are available before call
thunks have been set up.

Ensure a second run through the static call patching code after call thunks
have been created. When call thunks are not enabled this has no side
effects.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/include/asm/alternative.h |    5 +++++
 arch/x86/kernel/callthunks.c       |   18 ++++++++++++++++++
 arch/x86/kernel/static_call.c      |    1 +
 include/linux/static_call.h        |    2 ++
 kernel/static_call_inline.c        |   23 ++++++++++++++++++-----
 5 files changed, 44 insertions(+), 5 deletions(-)

--- a/arch/x86/include/asm/alternative.h
+++ b/arch/x86/include/asm/alternative.h
@@ -91,11 +91,16 @@ struct callthunk_sites {
 extern void callthunks_patch_builtin_calls(void);
 extern void callthunks_patch_module_calls(struct callthunk_sites *sites,
 					  struct module *mod);
+extern void *callthunks_translate_call_dest(void *dest);
 #else
 static __always_inline void callthunks_patch_builtin_calls(void) {}
 static __always_inline void
 callthunks_patch_module_calls(struct callthunk_sites *sites,
 			      struct module *mod) {}
+static __always_inline void *callthunks_translate_call_dest(void *dest)
+{
+	return dest;
+}
 #endif
 
 #ifdef CONFIG_SMP
--- a/arch/x86/kernel/callthunks.c
+++ b/arch/x86/kernel/callthunks.c
@@ -6,6 +6,7 @@
 #include <linux/kallsyms.h>
 #include <linux/memory.h>
 #include <linux/moduleloader.h>
+#include <linux/static_call.h>
 
 #include <asm/alternative.h>
 #include <asm/asm-offsets.h>
@@ -270,10 +271,27 @@ void __init callthunks_patch_builtin_cal
 	pr_info("Setting up call depth tracking\n");
 	mutex_lock(&text_mutex);
 	callthunks_setup(&cs, &builtin_coretext);
+	static_call_force_reinit();
 	thunks_initialized = true;
 	mutex_unlock(&text_mutex);
 }
 
+void *callthunks_translate_call_dest(void *dest)
+{
+	void *target;
+
+	lockdep_assert_held(&text_mutex);
+
+	if (!thunks_initialized || skip_addr(dest))
+		return dest;
+
+	if (!is_coretext(NULL, dest))
+		return dest;
+
+	target = patch_dest(dest, false);
+	return target ? : dest;
+}
+
 #ifdef CONFIG_MODULES
 void noinline callthunks_patch_module_calls(struct callthunk_sites *cs,
 					    struct module *mod)
--- a/arch/x86/kernel/static_call.c
+++ b/arch/x86/kernel/static_call.c
@@ -34,6 +34,7 @@ static void __ref __static_call_transfor
 
 	switch (type) {
 	case CALL:
+		func = callthunks_translate_call_dest(func);
 		code = text_gen_insn(CALL_INSN_OPCODE, insn, func);
 		if (func == &__static_call_return0) {
 			emulate = code;
--- a/include/linux/static_call.h
+++ b/include/linux/static_call.h
@@ -162,6 +162,8 @@ extern void arch_static_call_transform(v
 
 extern int __init static_call_init(void);
 
+extern void static_call_force_reinit(void);
+
 struct static_call_mod {
 	struct static_call_mod *next;
 	struct module *mod; /* for vmlinux, mod == NULL */
--- a/kernel/static_call_inline.c
+++ b/kernel/static_call_inline.c
@@ -15,7 +15,18 @@ extern struct static_call_site __start_s
 extern struct static_call_tramp_key __start_static_call_tramp_key[],
 				    __stop_static_call_tramp_key[];
 
-static bool static_call_initialized;
+static int static_call_initialized;
+
+/*
+ * Must be called before early_initcall() to be effective.
+ */
+void static_call_force_reinit(void)
+{
+	if (WARN_ON_ONCE(!static_call_initialized))
+		return;
+
+	static_call_initialized++;
+}
 
 /* mutex to protect key modules/sites */
 static DEFINE_MUTEX(static_call_mutex);
@@ -475,7 +486,8 @@ int __init static_call_init(void)
 {
 	int ret;
 
-	if (static_call_initialized)
+	/* See static_call_force_reinit(). */
+	if (static_call_initialized == 1)
 		return 0;
 
 	cpus_read_lock();
@@ -490,11 +502,12 @@ int __init static_call_init(void)
 		BUG();
 	}
 
-	static_call_initialized = true;
-
 #ifdef CONFIG_MODULES
-	register_module_notifier(&static_call_module_nb);
+	if (!static_call_initialized)
+		register_module_notifier(&static_call_module_nb);
 #endif
+
+	static_call_initialized = 1;
 	return 0;
 }
 early_initcall(static_call_init);



^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v2 53/59] kallsyms: Take callthunks into account
  2022-09-02 13:06 [PATCH v2 00/59] x86/retbleed: Call depth tracking mitigation Peter Zijlstra
                   ` (51 preceding siblings ...)
  2022-09-02 13:07 ` [PATCH v2 52/59] static_call: Add call depth tracking support Peter Zijlstra
@ 2022-09-02 13:07 ` Peter Zijlstra
  2022-09-02 13:07 ` [PATCH v2 54/59] x86/orc: Make it callthunk aware Peter Zijlstra
                   ` (6 subsequent siblings)
  59 siblings, 0 replies; 81+ messages in thread
From: Peter Zijlstra @ 2022-09-02 13:07 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, peterz, x86, Linus Torvalds, Tim Chen,
	Josh Poimboeuf, Andrew Cooper, Pawan Gupta, Johannes Wikner,
	Alyssa Milburn, Jann Horn, H.J. Lu, Joao Moreira, Joseph Nuzman,
	Steven Rostedt, Juergen Gross, Masami Hiramatsu,
	Alexei Starovoitov, Daniel Borkmann, K Prateek Nayak,
	Eric Dumazet

From: Peter Zijlstra <peterz@infradead.org>

Since the pre-symbol function padding is an integral part of the
symbol make kallsyms report it as part of the symbol by reporting it
as sym-x instead of prev_sym+y.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/kallsyms.c |   45 ++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 40 insertions(+), 5 deletions(-)

--- a/kernel/kallsyms.c
+++ b/kernel/kallsyms.c
@@ -292,6 +292,12 @@ static unsigned long get_symbol_pos(unsi
 	return low;
 }
 
+#ifdef CONFIG_FUNCTION_PADDING_BYTES
+#define PADDING_BYTES	CONFIG_FUNCTION_PADDING_BYTES
+#else
+#define PADDING_BYTES	0
+#endif
+
 /*
  * Lookup an address but don't bother to find any names.
  */
@@ -299,13 +305,25 @@ int kallsyms_lookup_size_offset(unsigned
 				unsigned long *offset)
 {
 	char namebuf[KSYM_NAME_LEN];
+	int ret;
+
+	addr += PADDING_BYTES;
 
 	if (is_ksym_addr(addr)) {
 		get_symbol_pos(addr, symbolsize, offset);
-		return 1;
+		ret = 1;
+		goto found;
 	}
-	return !!module_address_lookup(addr, symbolsize, offset, NULL, NULL, namebuf) ||
-	       !!__bpf_address_lookup(addr, symbolsize, offset, namebuf);
+
+	ret = !!module_address_lookup(addr, symbolsize, offset, NULL, NULL, namebuf);
+	if (!ret) {
+		ret = !!__bpf_address_lookup(addr, symbolsize,
+					     offset, namebuf);
+	}
+found:
+	if (ret && offset)
+		*offset -= PADDING_BYTES;
+	return ret;
 }
 
 static const char *kallsyms_lookup_buildid(unsigned long addr,
@@ -318,6 +336,8 @@ static const char *kallsyms_lookup_build
 	namebuf[KSYM_NAME_LEN - 1] = 0;
 	namebuf[0] = 0;
 
+	addr += PADDING_BYTES;
+
 	if (is_ksym_addr(addr)) {
 		unsigned long pos;
 
@@ -347,6 +367,8 @@ static const char *kallsyms_lookup_build
 
 found:
 	cleanup_symbol_name(namebuf);
+	if (ret && offset)
+		*offset -= PADDING_BYTES;
 	return ret;
 }
 
@@ -373,6 +395,8 @@ int lookup_symbol_name(unsigned long add
 	symname[0] = '\0';
 	symname[KSYM_NAME_LEN - 1] = '\0';
 
+	addr += PADDING_BYTES;
+
 	if (is_ksym_addr(addr)) {
 		unsigned long pos;
 
@@ -400,6 +424,8 @@ int lookup_symbol_attrs(unsigned long ad
 	name[0] = '\0';
 	name[KSYM_NAME_LEN - 1] = '\0';
 
+	addr += PADDING_BYTES;
+
 	if (is_ksym_addr(addr)) {
 		unsigned long pos;
 
@@ -416,6 +442,8 @@ int lookup_symbol_attrs(unsigned long ad
 		return res;
 
 found:
+	if (offset)
+		*offset -= PADDING_BYTES;
 	cleanup_symbol_name(name);
 	return 0;
 }
@@ -441,8 +469,15 @@ static int __sprint_symbol(char *buffer,
 	len = strlen(buffer);
 	offset -= symbol_offset;
 
-	if (add_offset)
-		len += sprintf(buffer + len, "+%#lx/%#lx", offset, size);
+	if (add_offset) {
+		char s = '+';
+
+		if ((long)offset < 0) {
+			s = '-';
+			offset = 0UL - offset;
+		}
+		len += sprintf(buffer + len, "%c%#lx/%#lx", s, offset, size);
+	}
 
 	if (modname) {
 		len += sprintf(buffer + len, " [%s", modname);



^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v2 54/59] x86/orc: Make it callthunk aware
  2022-09-02 13:06 [PATCH v2 00/59] x86/retbleed: Call depth tracking mitigation Peter Zijlstra
                   ` (52 preceding siblings ...)
  2022-09-02 13:07 ` [PATCH v2 53/59] kallsyms: Take callthunks into account Peter Zijlstra
@ 2022-09-02 13:07 ` Peter Zijlstra
  2022-09-02 13:07 ` [PATCH v2 55/59] x86/bpf: Emit call depth accounting if required Peter Zijlstra
                   ` (5 subsequent siblings)
  59 siblings, 0 replies; 81+ messages in thread
From: Peter Zijlstra @ 2022-09-02 13:07 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, peterz, x86, Linus Torvalds, Tim Chen,
	Josh Poimboeuf, Andrew Cooper, Pawan Gupta, Johannes Wikner,
	Alyssa Milburn, Jann Horn, H.J. Lu, Joao Moreira, Joseph Nuzman,
	Steven Rostedt, Juergen Gross, Masami Hiramatsu,
	Alexei Starovoitov, Daniel Borkmann, K Prateek Nayak,
	Eric Dumazet

From: Peter Zijlstra <peterz@infradead.org>

Callthunks addresses on the stack would confuse the ORC unwinder. Handle
them correctly and tell ORC to proceed further down the stack.

Cc: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/include/asm/alternative.h |    5 +++++
 arch/x86/kernel/callthunks.c       |   13 +++++++++++++
 arch/x86/kernel/unwind_orc.c       |   21 ++++++++++++++++++++-
 3 files changed, 38 insertions(+), 1 deletion(-)

--- a/arch/x86/include/asm/alternative.h
+++ b/arch/x86/include/asm/alternative.h
@@ -92,6 +92,7 @@ extern void callthunks_patch_builtin_cal
 extern void callthunks_patch_module_calls(struct callthunk_sites *sites,
 					  struct module *mod);
 extern void *callthunks_translate_call_dest(void *dest);
+extern bool is_callthunk(void *addr);
 #else
 static __always_inline void callthunks_patch_builtin_calls(void) {}
 static __always_inline void
@@ -101,6 +102,10 @@ static __always_inline void *callthunks_
 {
 	return dest;
 }
+static __always_inline bool is_callthunk(void *addr)
+{
+	return false;
+}
 #endif
 
 #ifdef CONFIG_SMP
--- a/arch/x86/kernel/callthunks.c
+++ b/arch/x86/kernel/callthunks.c
@@ -292,6 +292,19 @@ void *callthunks_translate_call_dest(voi
 	return target ? : dest;
 }
 
+bool is_callthunk(void *addr)
+{
+	unsigned int tmpl_size = SKL_TMPL_SIZE;
+	void *tmpl = skl_call_thunk_template;
+	unsigned long dest;
+
+	dest = roundup((unsigned long)addr, CONFIG_FUNCTION_ALIGNMENT);
+	if (!thunks_initialized || skip_addr((void *)dest))
+		return false;
+
+	return !bcmp((void *)(dest - tmpl_size), tmpl, tmpl_size);
+}
+
 #ifdef CONFIG_MODULES
 void noinline callthunks_patch_module_calls(struct callthunk_sites *cs,
 					    struct module *mod)
--- a/arch/x86/kernel/unwind_orc.c
+++ b/arch/x86/kernel/unwind_orc.c
@@ -131,6 +131,21 @@ static struct orc_entry null_orc_entry =
 	.type = UNWIND_HINT_TYPE_CALL
 };
 
+#ifdef CONFIG_CALL_THUNKS
+static struct orc_entry *orc_callthunk_find(unsigned long ip)
+{
+	if (!is_callthunk((void *)ip))
+		return NULL;
+
+	return &null_orc_entry;
+}
+#else
+static struct orc_entry *orc_callthunk_find(unsigned long ip)
+{
+	return NULL;
+}
+#endif
+
 /* Fake frame pointer entry -- used as a fallback for generated code */
 static struct orc_entry orc_fp_entry = {
 	.type		= UNWIND_HINT_TYPE_CALL,
@@ -184,7 +199,11 @@ static struct orc_entry *orc_find(unsign
 	if (orc)
 		return orc;
 
-	return orc_ftrace_find(ip);
+	orc =  orc_ftrace_find(ip);
+	if (orc)
+		return orc;
+
+	return orc_callthunk_find(ip);
 }
 
 #ifdef CONFIG_MODULES



^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v2 55/59] x86/bpf: Emit call depth accounting if required
  2022-09-02 13:06 [PATCH v2 00/59] x86/retbleed: Call depth tracking mitigation Peter Zijlstra
                   ` (53 preceding siblings ...)
  2022-09-02 13:07 ` [PATCH v2 54/59] x86/orc: Make it callthunk aware Peter Zijlstra
@ 2022-09-02 13:07 ` Peter Zijlstra
  2022-09-02 13:07 ` [PATCH v2 56/59] x86/ftrace: Remove ftrace_epilogue() Peter Zijlstra
                   ` (4 subsequent siblings)
  59 siblings, 0 replies; 81+ messages in thread
From: Peter Zijlstra @ 2022-09-02 13:07 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, peterz, x86, Linus Torvalds, Tim Chen,
	Josh Poimboeuf, Andrew Cooper, Pawan Gupta, Johannes Wikner,
	Alyssa Milburn, Jann Horn, H.J. Lu, Joao Moreira, Joseph Nuzman,
	Steven Rostedt, Juergen Gross, Masami Hiramatsu,
	Alexei Starovoitov, Daniel Borkmann, K Prateek Nayak,
	Eric Dumazet

From: Thomas Gleixner <tglx@linutronix.de>

Ensure that calls in BPF jitted programs are emitting call depth accounting
when enabled to keep the call/return balanced. The return thunk jump is
already injected due to the earlier retbleed mitigations.

Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/include/asm/alternative.h |    6 ++++++
 arch/x86/kernel/callthunks.c       |   19 +++++++++++++++++++
 arch/x86/net/bpf_jit_comp.c        |   32 +++++++++++++++++++++++---------
 3 files changed, 48 insertions(+), 9 deletions(-)

--- a/arch/x86/include/asm/alternative.h
+++ b/arch/x86/include/asm/alternative.h
@@ -93,6 +93,7 @@ extern void callthunks_patch_module_call
 					  struct module *mod);
 extern void *callthunks_translate_call_dest(void *dest);
 extern bool is_callthunk(void *addr);
+extern int x86_call_depth_emit_accounting(u8 **pprog, void *func);
 #else
 static __always_inline void callthunks_patch_builtin_calls(void) {}
 static __always_inline void
@@ -106,6 +107,11 @@ static __always_inline bool is_callthunk
 {
 	return false;
 }
+static __always_inline int x86_call_depth_emit_accounting(u8 **pprog,
+							  void *func)
+{
+	return 0;
+}
 #endif
 
 #ifdef CONFIG_SMP
--- a/arch/x86/kernel/callthunks.c
+++ b/arch/x86/kernel/callthunks.c
@@ -305,6 +305,25 @@ bool is_callthunk(void *addr)
 	return !bcmp((void *)(dest - tmpl_size), tmpl, tmpl_size);
 }
 
+#ifdef CONFIG_BPF_JIT
+int x86_call_depth_emit_accounting(u8 **pprog, void *func)
+{
+	unsigned int tmpl_size = SKL_TMPL_SIZE;
+	void *tmpl = skl_call_thunk_template;
+
+	if (!thunks_initialized)
+		return 0;
+
+	/* Is function call target a thunk? */
+	if (is_callthunk(func))
+		return 0;
+
+	memcpy(*pprog, tmpl, tmpl_size);
+	*pprog += tmpl_size;
+	return tmpl_size;
+}
+#endif
+
 #ifdef CONFIG_MODULES
 void noinline callthunks_patch_module_calls(struct callthunk_sites *cs,
 					    struct module *mod)
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -340,6 +340,13 @@ static int emit_call(u8 **pprog, void *f
 	return emit_patch(pprog, func, ip, 0xE8);
 }
 
+static int emit_rsb_call(u8 **pprog, void *func, void *ip)
+{
+	OPTIMIZER_HIDE_VAR(func);
+	x86_call_depth_emit_accounting(pprog, func);
+	return emit_patch(pprog, func, ip, 0xE8);
+}
+
 static int emit_jump(u8 **pprog, void *func, void *ip)
 {
 	return emit_patch(pprog, func, ip, 0xE9);
@@ -1434,19 +1441,26 @@ st:			if (is_imm8(insn->off))
 			break;
 
 			/* call */
-		case BPF_JMP | BPF_CALL:
+		case BPF_JMP | BPF_CALL: {
+			int offs;
+
 			func = (u8 *) __bpf_call_base + imm32;
 			if (tail_call_reachable) {
 				/* mov rax, qword ptr [rbp - rounded_stack_depth - 8] */
 				EMIT3_off32(0x48, 0x8B, 0x85,
 					    -round_up(bpf_prog->aux->stack_depth, 8) - 8);
-				if (!imm32 || emit_call(&prog, func, image + addrs[i - 1] + 7))
+				if (!imm32)
 					return -EINVAL;
+				offs = 7 + x86_call_depth_emit_accounting(&prog, func);
 			} else {
-				if (!imm32 || emit_call(&prog, func, image + addrs[i - 1]))
+				if (!imm32)
 					return -EINVAL;
+				offs = x86_call_depth_emit_accounting(&prog, func);
 			}
+			if (emit_call(&prog, func, image + addrs[i - 1] + offs))
+				return -EINVAL;
 			break;
+		}
 
 		case BPF_JMP | BPF_TAIL_CALL:
 			if (imm32)
@@ -1823,7 +1837,7 @@ static int invoke_bpf_prog(const struct
 	/* arg2: lea rsi, [rbp - ctx_cookie_off] */
 	EMIT4(0x48, 0x8D, 0x75, -run_ctx_off);
 
-	if (emit_call(&prog, enter, prog))
+	if (emit_rsb_call(&prog, enter, prog))
 		return -EINVAL;
 	/* remember prog start time returned by __bpf_prog_enter */
 	emit_mov_reg(&prog, true, BPF_REG_6, BPF_REG_0);
@@ -1844,7 +1858,7 @@ static int invoke_bpf_prog(const struct
 			       (long) p->insnsi >> 32,
 			       (u32) (long) p->insnsi);
 	/* call JITed bpf program or interpreter */
-	if (emit_call(&prog, p->bpf_func, prog))
+	if (emit_rsb_call(&prog, p->bpf_func, prog))
 		return -EINVAL;
 
 	/*
@@ -1868,7 +1882,7 @@ static int invoke_bpf_prog(const struct
 	emit_mov_reg(&prog, true, BPF_REG_2, BPF_REG_6);
 	/* arg3: lea rdx, [rbp - run_ctx_off] */
 	EMIT4(0x48, 0x8D, 0x55, -run_ctx_off);
-	if (emit_call(&prog, exit, prog))
+	if (emit_rsb_call(&prog, exit, prog))
 		return -EINVAL;
 
 	*pprog = prog;
@@ -2109,7 +2123,7 @@ int arch_prepare_bpf_trampoline(struct b
 	if (flags & BPF_TRAMP_F_CALL_ORIG) {
 		/* arg1: mov rdi, im */
 		emit_mov_imm64(&prog, BPF_REG_1, (long) im >> 32, (u32) (long) im);
-		if (emit_call(&prog, __bpf_tramp_enter, prog)) {
+		if (emit_rsb_call(&prog, __bpf_tramp_enter, prog)) {
 			ret = -EINVAL;
 			goto cleanup;
 		}
@@ -2141,7 +2155,7 @@ int arch_prepare_bpf_trampoline(struct b
 			EMIT2(0xff, 0xd0); /* call *rax */
 		} else {
 			/* call original function */
-			if (emit_call(&prog, orig_call, prog)) {
+			if (emit_rsb_call(&prog, orig_call, prog)) {
 				ret = -EINVAL;
 				goto cleanup;
 			}
@@ -2185,7 +2199,7 @@ int arch_prepare_bpf_trampoline(struct b
 		im->ip_epilogue = prog;
 		/* arg1: mov rdi, im */
 		emit_mov_imm64(&prog, BPF_REG_1, (long) im >> 32, (u32) (long) im);
-		if (emit_call(&prog, __bpf_tramp_exit, prog)) {
+		if (emit_rsb_call(&prog, __bpf_tramp_exit, prog)) {
 			ret = -EINVAL;
 			goto cleanup;
 		}



^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v2 56/59] x86/ftrace: Remove ftrace_epilogue()
  2022-09-02 13:06 [PATCH v2 00/59] x86/retbleed: Call depth tracking mitigation Peter Zijlstra
                   ` (54 preceding siblings ...)
  2022-09-02 13:07 ` [PATCH v2 55/59] x86/bpf: Emit call depth accounting if required Peter Zijlstra
@ 2022-09-02 13:07 ` Peter Zijlstra
  2022-09-02 13:07 ` [PATCH v2 57/59] x86/ftrace: Rebalance RSB Peter Zijlstra
                   ` (3 subsequent siblings)
  59 siblings, 0 replies; 81+ messages in thread
From: Peter Zijlstra @ 2022-09-02 13:07 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, peterz, x86, Linus Torvalds, Tim Chen,
	Josh Poimboeuf, Andrew Cooper, Pawan Gupta, Johannes Wikner,
	Alyssa Milburn, Jann Horn, H.J. Lu, Joao Moreira, Joseph Nuzman,
	Steven Rostedt, Juergen Gross, Masami Hiramatsu,
	Alexei Starovoitov, Daniel Borkmann, K Prateek Nayak,
	Eric Dumazet

From: Peter Zijlstra <peterz@infradead.org>

Remove the weird jumps to RET and simply use RET.

This then promotes ftrace_stub() to a real function; which becomes
important for kcfi.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/kernel/ftrace_64.S |   21 ++++++---------------
 1 file changed, 6 insertions(+), 15 deletions(-)

--- a/arch/x86/kernel/ftrace_64.S
+++ b/arch/x86/kernel/ftrace_64.S
@@ -172,20 +172,14 @@ SYM_INNER_LABEL(ftrace_call, SYM_L_GLOBA
 	 */
 SYM_INNER_LABEL(ftrace_caller_end, SYM_L_GLOBAL)
 	ANNOTATE_NOENDBR
-
-	jmp ftrace_epilogue
+	RET
 SYM_FUNC_END(ftrace_caller);
 STACK_FRAME_NON_STANDARD_FP(ftrace_caller)
 
-SYM_FUNC_START(ftrace_epilogue)
-/*
- * This is weak to keep gas from relaxing the jumps.
- */
-SYM_INNER_LABEL_ALIGN(ftrace_stub, SYM_L_WEAK)
+SYM_FUNC_START(ftrace_stub)
 	UNWIND_HINT_FUNC
-	ENDBR
 	RET
-SYM_FUNC_END(ftrace_epilogue)
+SYM_FUNC_END(ftrace_stub)
 
 SYM_FUNC_START(ftrace_regs_caller)
 	/* Save the current flags before any operations that can change them */
@@ -262,14 +256,11 @@ SYM_INNER_LABEL(ftrace_regs_caller_jmp,
 	popfq
 
 	/*
-	 * As this jmp to ftrace_epilogue can be a short jump
-	 * it must not be copied into the trampoline.
-	 * The trampoline will add the code to jump
-	 * to the return.
+	 * The trampoline will add the return.
 	 */
 SYM_INNER_LABEL(ftrace_regs_caller_end, SYM_L_GLOBAL)
 	ANNOTATE_NOENDBR
-	jmp ftrace_epilogue
+	RET
 
 	/* Swap the flags with orig_rax */
 1:	movq MCOUNT_REG_SIZE(%rsp), %rdi
@@ -280,7 +271,7 @@ SYM_INNER_LABEL(ftrace_regs_caller_end,
 	/* Restore flags */
 	popfq
 	UNWIND_HINT_FUNC
-	jmp	ftrace_epilogue
+	RET
 
 SYM_FUNC_END(ftrace_regs_caller)
 STACK_FRAME_NON_STANDARD_FP(ftrace_regs_caller)



^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v2 57/59] x86/ftrace: Rebalance RSB
  2022-09-02 13:06 [PATCH v2 00/59] x86/retbleed: Call depth tracking mitigation Peter Zijlstra
                   ` (55 preceding siblings ...)
  2022-09-02 13:07 ` [PATCH v2 56/59] x86/ftrace: Remove ftrace_epilogue() Peter Zijlstra
@ 2022-09-02 13:07 ` Peter Zijlstra
  2022-09-02 13:07 ` [PATCH v2 58/59] x86/ftrace: Make it call depth tracking aware Peter Zijlstra
                   ` (2 subsequent siblings)
  59 siblings, 0 replies; 81+ messages in thread
From: Peter Zijlstra @ 2022-09-02 13:07 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, peterz, x86, Linus Torvalds, Tim Chen,
	Josh Poimboeuf, Andrew Cooper, Pawan Gupta, Johannes Wikner,
	Alyssa Milburn, Jann Horn, H.J. Lu, Joao Moreira, Joseph Nuzman,
	Steven Rostedt, Juergen Gross, Masami Hiramatsu,
	Alexei Starovoitov, Daniel Borkmann, K Prateek Nayak,
	Eric Dumazet

From: Peter Zijlstra (Intel) <peterz@infradead.org>

ftrace_regs_caller() uses a PUSH;RET pattern to tail-call into a
direct-call function, this unbalances the RSB, fix that.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/kernel/ftrace_64.S |   11 +++++++++++
 1 file changed, 11 insertions(+)

--- a/arch/x86/kernel/ftrace_64.S
+++ b/arch/x86/kernel/ftrace_64.S
@@ -271,6 +271,17 @@ SYM_INNER_LABEL(ftrace_regs_caller_end,
 	/* Restore flags */
 	popfq
 	UNWIND_HINT_FUNC
+
+	/*
+	 * The above left an extra return value on the stack; effectively
+	 * doing a tail-call without using a register. This PUSH;RET
+	 * pattern unbalances the RSB, inject a pointless CALL to rebalance.
+	 */
+	ANNOTATE_INTRA_FUNCTION_CALL
+	CALL .Ldo_rebalance
+	int3
+.Ldo_rebalance:
+	add $8, %rsp
 	RET
 
 SYM_FUNC_END(ftrace_regs_caller)



^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v2 58/59] x86/ftrace: Make it call depth tracking aware
  2022-09-02 13:06 [PATCH v2 00/59] x86/retbleed: Call depth tracking mitigation Peter Zijlstra
                   ` (56 preceding siblings ...)
  2022-09-02 13:07 ` [PATCH v2 57/59] x86/ftrace: Rebalance RSB Peter Zijlstra
@ 2022-09-02 13:07 ` Peter Zijlstra
  2022-09-02 13:07 ` [PATCH v2 59/59] x86/retbleed: Add call depth tracking mitigation Peter Zijlstra
  2022-09-16  9:35 ` [PATCH v2 00/59] x86/retbleed: Call " Mel Gorman
  59 siblings, 0 replies; 81+ messages in thread
From: Peter Zijlstra @ 2022-09-02 13:07 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, peterz, x86, Linus Torvalds, Tim Chen,
	Josh Poimboeuf, Andrew Cooper, Pawan Gupta, Johannes Wikner,
	Alyssa Milburn, Jann Horn, H.J. Lu, Joao Moreira, Joseph Nuzman,
	Steven Rostedt, Juergen Gross, Masami Hiramatsu,
	Alexei Starovoitov, Daniel Borkmann, K Prateek Nayak,
	Eric Dumazet

From: Peter Zijlstra <peterz@infradead.org>

Since ftrace has trampolines, don't use thunks for the __fentry__ site
but instead require that every function called from there includes
accounting. This very much includes all the direct-call functions.

Additionally, ftrace uses ROP tricks in two places:

 - return_to_handler(), and
 - ftrace_regs_caller() when pt_regs->orig_ax is set by a direct-call.

return_to_handler() already uses a retpoline to replace an
indirect-jump to defeat IBT, since this is a jump-type retpoline, make
sure there is no accounting done and ALTERNATIVE the RET into a ret.

ftrace_regs_caller() does much the same and gets the same treatment.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/include/asm/nospec-branch.h        |    9 +++++++++
 arch/x86/kernel/callthunks.c                |    2 +-
 arch/x86/kernel/ftrace.c                    |   16 ++++++++++++----
 arch/x86/kernel/ftrace_64.S                 |   22 ++++++++++++++++++++--
 arch/x86/net/bpf_jit_comp.c                 |    6 ++++++
 kernel/trace/trace_selftest.c               |    5 ++++-
 samples/ftrace/ftrace-direct-modify.c       |    3 +++
 samples/ftrace/ftrace-direct-multi-modify.c |    3 +++
 samples/ftrace/ftrace-direct-multi.c        |    2 ++
 samples/ftrace/ftrace-direct-too.c          |    2 ++
 samples/ftrace/ftrace-direct.c              |    2 ++
 11 files changed, 64 insertions(+), 8 deletions(-)

--- a/arch/x86/include/asm/nospec-branch.h
+++ b/arch/x86/include/asm/nospec-branch.h
@@ -331,6 +331,12 @@ static inline void x86_set_skl_return_th
 {
 	x86_return_thunk = &__x86_return_skl;
 }
+
+#define CALL_DEPTH_ACCOUNT					\
+	ALTERNATIVE("",						\
+		    __stringify(INCREMENT_CALL_DEPTH),		\
+		    X86_FEATURE_CALL_DEPTH)
+
 #ifdef CONFIG_CALL_THUNKS_DEBUG
 DECLARE_PER_CPU(u64, __x86_call_count);
 DECLARE_PER_CPU(u64, __x86_ret_count);
@@ -339,6 +345,9 @@ DECLARE_PER_CPU(u64, __x86_ctxsw_count);
 #endif
 #else
 static inline void x86_set_skl_return_thunk(void) {}
+
+#define CALL_DEPTH_ACCOUNT ""
+
 #endif
 
 #ifdef CONFIG_RETPOLINE
--- a/arch/x86/kernel/callthunks.c
+++ b/arch/x86/kernel/callthunks.c
@@ -315,7 +315,7 @@ int x86_call_depth_emit_accounting(u8 **
 		return 0;
 
 	/* Is function call target a thunk? */
-	if (is_callthunk(func))
+	if (func && is_callthunk(func))
 		return 0;
 
 	memcpy(*pprog, tmpl, tmpl_size);
--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -69,6 +69,10 @@ static const char *ftrace_nop_replace(vo
 
 static const char *ftrace_call_replace(unsigned long ip, unsigned long addr)
 {
+	/*
+	 * No need to translate into a callthunk. The trampoline does
+	 * the depth accounting itself.
+	 */
 	return text_gen_insn(CALL_INSN_OPCODE, (void *)ip, (void *)addr);
 }
 
@@ -317,7 +321,7 @@ create_trampoline(struct ftrace_ops *ops
 	unsigned long size;
 	unsigned long *ptr;
 	void *trampoline;
-	void *ip;
+	void *ip, *dest;
 	/* 48 8b 15 <offset> is movq <offset>(%rip), %rdx */
 	unsigned const char op_ref[] = { 0x48, 0x8b, 0x15 };
 	unsigned const char retq[] = { RET_INSN_OPCODE, INT3_INSN_OPCODE };
@@ -404,10 +408,14 @@ create_trampoline(struct ftrace_ops *ops
 	/* put in the call to the function */
 	mutex_lock(&text_mutex);
 	call_offset -= start_offset;
+	/*
+	 * No need to translate into a callthunk. The trampoline does
+	 * the depth accounting before the call already.
+	 */
+	dest = ftrace_ops_get_func(ops);
 	memcpy(trampoline + call_offset,
-	       text_gen_insn(CALL_INSN_OPCODE,
-			     trampoline + call_offset,
-			     ftrace_ops_get_func(ops)), CALL_INSN_SIZE);
+	       text_gen_insn(CALL_INSN_OPCODE, trampoline + call_offset, dest),
+	       CALL_INSN_SIZE);
 	mutex_unlock(&text_mutex);
 
 	/* ALLOC_TRAMP flags lets us know we created it */
--- a/arch/x86/kernel/ftrace_64.S
+++ b/arch/x86/kernel/ftrace_64.S
@@ -4,6 +4,7 @@
  */
 
 #include <linux/linkage.h>
+#include <asm/asm-offsets.h>
 #include <asm/ptrace.h>
 #include <asm/ftrace.h>
 #include <asm/export.h>
@@ -132,6 +133,7 @@
 #ifdef CONFIG_DYNAMIC_FTRACE
 
 SYM_FUNC_START(__fentry__)
+	CALL_DEPTH_ACCOUNT
 	RET
 SYM_FUNC_END(__fentry__)
 EXPORT_SYMBOL(__fentry__)
@@ -140,6 +142,8 @@ SYM_FUNC_START(ftrace_caller)
 	/* save_mcount_regs fills in first two parameters */
 	save_mcount_regs
 
+	CALL_DEPTH_ACCOUNT
+
 	/* Stack - skipping return address of ftrace_caller */
 	leaq MCOUNT_REG_SIZE+8(%rsp), %rcx
 	movq %rcx, RSP(%rsp)
@@ -155,6 +159,9 @@ SYM_INNER_LABEL(ftrace_caller_op_ptr, SY
 	/* Only ops with REGS flag set should have CS register set */
 	movq $0, CS(%rsp)
 
+	/* Account for the function call below */
+	CALL_DEPTH_ACCOUNT
+
 SYM_INNER_LABEL(ftrace_call, SYM_L_GLOBAL)
 	ANNOTATE_NOENDBR
 	call ftrace_stub
@@ -189,6 +196,8 @@ SYM_FUNC_START(ftrace_regs_caller)
 	save_mcount_regs 8
 	/* save_mcount_regs fills in first two parameters */
 
+	CALL_DEPTH_ACCOUNT
+
 SYM_INNER_LABEL(ftrace_regs_caller_op_ptr, SYM_L_GLOBAL)
 	ANNOTATE_NOENDBR
 	/* Load the ftrace_ops into the 3rd parameter */
@@ -219,6 +228,9 @@ SYM_INNER_LABEL(ftrace_regs_caller_op_pt
 	/* regs go into 4th parameter */
 	leaq (%rsp), %rcx
 
+	/* Account for the function call below */
+	CALL_DEPTH_ACCOUNT
+
 SYM_INNER_LABEL(ftrace_regs_call, SYM_L_GLOBAL)
 	ANNOTATE_NOENDBR
 	call ftrace_stub
@@ -282,7 +294,9 @@ SYM_INNER_LABEL(ftrace_regs_caller_end,
 	int3
 .Ldo_rebalance:
 	add $8, %rsp
-	RET
+	ALTERNATIVE __stringify(RET), \
+		    __stringify(ANNOTATE_UNRET_SAFE; ret; int3), \
+		    X86_FEATURE_CALL_DEPTH
 
 SYM_FUNC_END(ftrace_regs_caller)
 STACK_FRAME_NON_STANDARD_FP(ftrace_regs_caller)
@@ -291,6 +305,8 @@ STACK_FRAME_NON_STANDARD_FP(ftrace_regs_
 #else /* ! CONFIG_DYNAMIC_FTRACE */
 
 SYM_FUNC_START(__fentry__)
+	CALL_DEPTH_ACCOUNT
+
 	cmpq $ftrace_stub, ftrace_trace_function
 	jnz trace
 
@@ -347,6 +363,8 @@ SYM_CODE_START(return_to_handler)
 	int3
 .Ldo_rop:
 	mov %rdi, (%rsp)
-	RET
+	ALTERNATIVE __stringify(RET), \
+		    __stringify(ANNOTATE_UNRET_SAFE; ret; int3), \
+		    X86_FEATURE_CALL_DEPTH
 SYM_CODE_END(return_to_handler)
 #endif
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -12,6 +12,7 @@
 #include <linux/memory.h>
 #include <linux/sort.h>
 #include <asm/extable.h>
+#include <asm/ftrace.h>
 #include <asm/set_memory.h>
 #include <asm/nospec-branch.h>
 #include <asm/text-patching.h>
@@ -2095,6 +2096,11 @@ int arch_prepare_bpf_trampoline(struct b
 	prog = image;
 
 	EMIT_ENDBR();
+	/*
+	 * This is the direct-call trampoline, as such it needs accounting
+	 * for the __fentry__ call.
+	 */
+	x86_call_depth_emit_accounting(&prog, NULL);
 	EMIT1(0x55);		 /* push rbp */
 	EMIT3(0x48, 0x89, 0xE5); /* mov rbp, rsp */
 	EMIT4(0x48, 0x83, 0xEC, stack_size); /* sub rsp, stack_size */
--- a/kernel/trace/trace_selftest.c
+++ b/kernel/trace/trace_selftest.c
@@ -785,7 +785,10 @@ static struct fgraph_ops fgraph_ops __in
 };
 
 #ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
-noinline __noclone static void trace_direct_tramp(void) { }
+noinline __noclone static void trace_direct_tramp(void)
+{
+	asm(CALL_DEPTH_ACCOUNT);
+}
 #endif
 
 /*
--- a/samples/ftrace/ftrace-direct-modify.c
+++ b/samples/ftrace/ftrace-direct-modify.c
@@ -3,6 +3,7 @@
 #include <linux/kthread.h>
 #include <linux/ftrace.h>
 #include <asm/asm-offsets.h>
+#include <asm/nospec-branch.h>
 
 extern void my_direct_func1(void);
 extern void my_direct_func2(void);
@@ -34,6 +35,7 @@ asm (
 	ASM_ENDBR
 "	pushq %rbp\n"
 "	movq %rsp, %rbp\n"
+	CALL_DEPTH_ACCOUNT
 "	call my_direct_func1\n"
 "	leave\n"
 "	.size		my_tramp1, .-my_tramp1\n"
@@ -45,6 +47,7 @@ asm (
 	ASM_ENDBR
 "	pushq %rbp\n"
 "	movq %rsp, %rbp\n"
+	CALL_DEPTH_ACCOUNT
 "	call my_direct_func2\n"
 "	leave\n"
 	ASM_RET
--- a/samples/ftrace/ftrace-direct-multi-modify.c
+++ b/samples/ftrace/ftrace-direct-multi-modify.c
@@ -3,6 +3,7 @@
 #include <linux/kthread.h>
 #include <linux/ftrace.h>
 #include <asm/asm-offsets.h>
+#include <asm/nospec-branch.h>
 
 extern void my_direct_func1(unsigned long ip);
 extern void my_direct_func2(unsigned long ip);
@@ -32,6 +33,7 @@ asm (
 	ASM_ENDBR
 "	pushq %rbp\n"
 "	movq %rsp, %rbp\n"
+	CALL_DEPTH_ACCOUNT
 "	pushq %rdi\n"
 "	movq 8(%rbp), %rdi\n"
 "	call my_direct_func1\n"
@@ -46,6 +48,7 @@ asm (
 	ASM_ENDBR
 "	pushq %rbp\n"
 "	movq %rsp, %rbp\n"
+	CALL_DEPTH_ACCOUNT
 "	pushq %rdi\n"
 "	movq 8(%rbp), %rdi\n"
 "	call my_direct_func2\n"
--- a/samples/ftrace/ftrace-direct-multi.c
+++ b/samples/ftrace/ftrace-direct-multi.c
@@ -5,6 +5,7 @@
 #include <linux/ftrace.h>
 #include <linux/sched/stat.h>
 #include <asm/asm-offsets.h>
+#include <asm/nospec-branch.h>
 
 extern void my_direct_func(unsigned long ip);
 
@@ -27,6 +28,7 @@ asm (
 	ASM_ENDBR
 "	pushq %rbp\n"
 "	movq %rsp, %rbp\n"
+	CALL_DEPTH_ACCOUNT
 "	pushq %rdi\n"
 "	movq 8(%rbp), %rdi\n"
 "	call my_direct_func\n"
--- a/samples/ftrace/ftrace-direct-too.c
+++ b/samples/ftrace/ftrace-direct-too.c
@@ -4,6 +4,7 @@
 #include <linux/mm.h> /* for handle_mm_fault() */
 #include <linux/ftrace.h>
 #include <asm/asm-offsets.h>
+#include <asm/nospec-branch.h>
 
 extern void my_direct_func(struct vm_area_struct *vma,
 			   unsigned long address, unsigned int flags);
@@ -29,6 +30,7 @@ asm (
 	ASM_ENDBR
 "	pushq %rbp\n"
 "	movq %rsp, %rbp\n"
+	CALL_DEPTH_ACCOUNT
 "	pushq %rdi\n"
 "	pushq %rsi\n"
 "	pushq %rdx\n"
--- a/samples/ftrace/ftrace-direct.c
+++ b/samples/ftrace/ftrace-direct.c
@@ -4,6 +4,7 @@
 #include <linux/sched.h> /* for wake_up_process() */
 #include <linux/ftrace.h>
 #include <asm/asm-offsets.h>
+#include <asm/nospec-branch.h>
 
 extern void my_direct_func(struct task_struct *p);
 
@@ -26,6 +27,7 @@ asm (
 	ASM_ENDBR
 "	pushq %rbp\n"
 "	movq %rsp, %rbp\n"
+	CALL_DEPTH_ACCOUNT
 "	pushq %rdi\n"
 "	call my_direct_func\n"
 "	popq %rdi\n"



^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v2 59/59] x86/retbleed: Add call depth tracking mitigation
  2022-09-02 13:06 [PATCH v2 00/59] x86/retbleed: Call depth tracking mitigation Peter Zijlstra
                   ` (57 preceding siblings ...)
  2022-09-02 13:07 ` [PATCH v2 58/59] x86/ftrace: Make it call depth tracking aware Peter Zijlstra
@ 2022-09-02 13:07 ` Peter Zijlstra
  2022-09-16  9:35 ` [PATCH v2 00/59] x86/retbleed: Call " Mel Gorman
  59 siblings, 0 replies; 81+ messages in thread
From: Peter Zijlstra @ 2022-09-02 13:07 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, peterz, x86, Linus Torvalds, Tim Chen,
	Josh Poimboeuf, Andrew Cooper, Pawan Gupta, Johannes Wikner,
	Alyssa Milburn, Jann Horn, H.J. Lu, Joao Moreira, Joseph Nuzman,
	Steven Rostedt, Juergen Gross, Masami Hiramatsu,
	Alexei Starovoitov, Daniel Borkmann, K Prateek Nayak,
	Eric Dumazet

From: Thomas Gleixner <tglx@linutronix.de>

The fully secure mitigation for RSB underflow on Intel SKL CPUs is IBRS,
which inflicts up to 30% penalty for pathological syscall heavy work loads.

Software based call depth tracking and RSB refill is not perfect, but
reduces the attack surface massively. The penalty for the pathological case
is about 8% which is still annoying but definitely more palatable than IBRS.

Add a retbleed=stuff command line option to enable the call depth tracking
and software refill of the RSB.

This gives admins a choice. IBeeRS are safe and cause headaches, call depth
tracking is considered to be s(t)ufficiently safe.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/kernel/cpu/bugs.c |   32 ++++++++++++++++++++++++++++++--
 1 file changed, 30 insertions(+), 2 deletions(-)

--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -787,6 +787,7 @@ enum retbleed_mitigation {
 	RETBLEED_MITIGATION_IBPB,
 	RETBLEED_MITIGATION_IBRS,
 	RETBLEED_MITIGATION_EIBRS,
+	RETBLEED_MITIGATION_STUFF,
 };
 
 enum retbleed_mitigation_cmd {
@@ -794,6 +795,7 @@ enum retbleed_mitigation_cmd {
 	RETBLEED_CMD_AUTO,
 	RETBLEED_CMD_UNRET,
 	RETBLEED_CMD_IBPB,
+	RETBLEED_CMD_STUFF,
 };
 
 static const char * const retbleed_strings[] = {
@@ -802,6 +804,7 @@ static const char * const retbleed_strin
 	[RETBLEED_MITIGATION_IBPB]	= "Mitigation: IBPB",
 	[RETBLEED_MITIGATION_IBRS]	= "Mitigation: IBRS",
 	[RETBLEED_MITIGATION_EIBRS]	= "Mitigation: Enhanced IBRS",
+	[RETBLEED_MITIGATION_STUFF]	= "Mitigation: Stuffing",
 };
 
 static enum retbleed_mitigation retbleed_mitigation __ro_after_init =
@@ -831,6 +834,8 @@ static int __init retbleed_parse_cmdline
 			retbleed_cmd = RETBLEED_CMD_UNRET;
 		} else if (!strcmp(str, "ibpb")) {
 			retbleed_cmd = RETBLEED_CMD_IBPB;
+		} else if (!strcmp(str, "stuff")) {
+			retbleed_cmd = RETBLEED_CMD_STUFF;
 		} else if (!strcmp(str, "nosmt")) {
 			retbleed_nosmt = true;
 		} else {
@@ -879,6 +884,21 @@ static void __init retbleed_select_mitig
 		}
 		break;
 
+	case RETBLEED_CMD_STUFF:
+		if (IS_ENABLED(CONFIG_CALL_DEPTH_TRACKING) &&
+		    spectre_v2_enabled == SPECTRE_V2_RETPOLINE) {
+			retbleed_mitigation = RETBLEED_MITIGATION_STUFF;
+
+		} else {
+			if (IS_ENABLED(CONFIG_CALL_DEPTH_TRACKING))
+				pr_err("WARNING: retbleed=stuff depends on spectre_v2=retpoline\n");
+			else
+				pr_err("WARNING: kernel not compiled with CALL_DEPTH_TRACKING.\n");
+
+			goto do_cmd_auto;
+		}
+		break;
+
 do_cmd_auto:
 	case RETBLEED_CMD_AUTO:
 	default:
@@ -916,6 +936,12 @@ static void __init retbleed_select_mitig
 		mitigate_smt = true;
 		break;
 
+	case RETBLEED_MITIGATION_STUFF:
+		setup_force_cpu_cap(X86_FEATURE_RETHUNK);
+		setup_force_cpu_cap(X86_FEATURE_CALL_DEPTH);
+		x86_set_skl_return_thunk();
+		break;
+
 	default:
 		break;
 	}
@@ -926,7 +952,7 @@ static void __init retbleed_select_mitig
 
 	/*
 	 * Let IBRS trump all on Intel without affecting the effects of the
-	 * retbleed= cmdline option.
+	 * retbleed= cmdline option except for call depth based stuffing
 	 */
 	if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL) {
 		switch (spectre_v2_enabled) {
@@ -939,7 +965,8 @@ static void __init retbleed_select_mitig
 			retbleed_mitigation = RETBLEED_MITIGATION_EIBRS;
 			break;
 		default:
-			pr_err(RETBLEED_INTEL_MSG);
+			if (retbleed_mitigation != RETBLEED_MITIGATION_STUFF)
+				pr_err(RETBLEED_INTEL_MSG);
 		}
 	}
 
@@ -1413,6 +1440,7 @@ static void __init spectre_v2_select_mit
 		if (IS_ENABLED(CONFIG_CPU_IBRS_ENTRY) &&
 		    boot_cpu_has_bug(X86_BUG_RETBLEED) &&
 		    retbleed_cmd != RETBLEED_CMD_OFF &&
+		    retbleed_cmd != RETBLEED_CMD_STUFF &&
 		    boot_cpu_has(X86_FEATURE_IBRS) &&
 		    boot_cpu_data.x86_vendor == X86_VENDOR_INTEL) {
 			mode = SPECTRE_V2_IBRS;



^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v2 01/59] x86/paravirt: Ensure proper alignment
  2022-09-02 13:06 ` [PATCH v2 01/59] x86/paravirt: Ensure proper alignment Peter Zijlstra
@ 2022-09-02 16:05   ` Juergen Gross
  0 siblings, 0 replies; 81+ messages in thread
From: Juergen Gross @ 2022-09-02 16:05 UTC (permalink / raw)
  To: Peter Zijlstra, Thomas Gleixner
  Cc: linux-kernel, x86, Linus Torvalds, Tim Chen, Josh Poimboeuf,
	Andrew Cooper, Pawan Gupta, Johannes Wikner, Alyssa Milburn,
	Jann Horn, H.J. Lu, Joao Moreira, Joseph Nuzman, Steven Rostedt,
	Masami Hiramatsu, Alexei Starovoitov, Daniel Borkmann,
	K Prateek Nayak, Eric Dumazet


[-- Attachment #1.1.1: Type: text/plain, Size: 863 bytes --]

On 02.09.22 15:06, Peter Zijlstra wrote:
> From: Thomas Gleixner <tglx@linutronix.de>
> 
> The entries in the .parainstr sections are 8 byte aligned and the
> corresponding C struct makes the array offset 16 bytes.
> 
> Though the pushed entries are only using 12 bytes. .parainstr_end is
> therefore 4 bytes short.
> 
> That works by chance because it's only used in a loop:
> 
>       for (p = start; p < end; p++)
> 
> But this falls flat when calculating the number of elements:
> 
>      n = end - start
> 
> That's obviously off by one.
> 
> Ensure that the gap is filled and the last entry is occupying 16 bytes.
> 
> Cc: Juergen Gross <jgross@suse.com>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Reviewed-by: Juergen Gross <jgross@suse.com>


Juergen

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 3149 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v2 43/59] x86/paravirt: Make struct paravirt_call_site unconditionally available
  2022-09-02 13:07 ` [PATCH v2 43/59] x86/paravirt: Make struct paravirt_call_site unconditionally available Peter Zijlstra
@ 2022-09-02 16:09   ` Juergen Gross
  0 siblings, 0 replies; 81+ messages in thread
From: Juergen Gross @ 2022-09-02 16:09 UTC (permalink / raw)
  To: Peter Zijlstra, Thomas Gleixner
  Cc: linux-kernel, x86, Linus Torvalds, Tim Chen, Josh Poimboeuf,
	Andrew Cooper, Pawan Gupta, Johannes Wikner, Alyssa Milburn,
	Jann Horn, H.J. Lu, Joao Moreira, Joseph Nuzman, Steven Rostedt,
	Masami Hiramatsu, Alexei Starovoitov, Daniel Borkmann,
	K Prateek Nayak, Eric Dumazet


[-- Attachment #1.1.1: Type: text/plain, Size: 457 bytes --]

On 02.09.22 15:07, Peter Zijlstra wrote:
> From: Thomas Gleixner <tglx@linutronix.de>
> 
> For the upcoming call thunk patching it's less ifdeffery when the data
> structure is unconditionally available. The code can then be trivially
> fenced off with IS_ENABLED().
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Reviewed-by: Juergen Gross <jgross@suse.com>


Juergen

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 3149 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v2 37/59] x86/putuser: Provide room for padding
  2022-09-02 13:07 ` [PATCH v2 37/59] x86/putuser: Provide room for padding Peter Zijlstra
@ 2022-09-02 16:43   ` Linus Torvalds
  2022-09-02 17:03     ` Peter Zijlstra
  0 siblings, 1 reply; 81+ messages in thread
From: Linus Torvalds @ 2022-09-02 16:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, linux-kernel, x86, Tim Chen, Josh Poimboeuf,
	Andrew Cooper, Pawan Gupta, Johannes Wikner, Alyssa Milburn,
	Jann Horn, H.J. Lu, Joao Moreira, Joseph Nuzman, Steven Rostedt,
	Juergen Gross, Masami Hiramatsu, Alexei Starovoitov,
	Daniel Borkmann, K Prateek Nayak, Eric Dumazet

So I don't hate this patch and it's probably good for consistency, but
I really think that the retbleed tracking could perhaps be improved to
let this be all unnecessary.

The whole return stack depth counting is already not 100% exact, and I
think we could just make the rule be that we don't track leaf
functions.

Why? It's just a off-by-one in the already not exact tracking. And -
perhaps equally importantly - leaf functions are very very common
dynamically, and I suspect it's trivial to see them.

Yes, yes, you could make objtool even smarter and actually do some
kind of function flow graph thing (and I think some people were
talking about that with the whole ret counting long long ago), but the
leaf function thing is the really simple low-hanging fruit case of
that.

            Linus

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v2 08/59] x86/build: Ensure proper function alignment
  2022-09-02 13:06 ` [PATCH v2 08/59] x86/build: Ensure proper function alignment Peter Zijlstra
@ 2022-09-02 16:51   ` Linus Torvalds
  2022-09-02 17:32     ` Peter Zijlstra
  2022-09-05  2:09   ` David Laight
  1 sibling, 1 reply; 81+ messages in thread
From: Linus Torvalds @ 2022-09-02 16:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, linux-kernel, x86, Tim Chen, Josh Poimboeuf,
	Andrew Cooper, Pawan Gupta, Johannes Wikner, Alyssa Milburn,
	Jann Horn, H.J. Lu, Joao Moreira, Joseph Nuzman, Steven Rostedt,
	Juergen Gross, Masami Hiramatsu, Alexei Starovoitov,
	Daniel Borkmann, K Prateek Nayak, Eric Dumazet

On Fri, Sep 2, 2022 at 6:55 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> --- a/arch/x86/include/asm/linkage.h
> +++ b/arch/x86/include/asm/linkage.h
> @@ -14,9 +14,10 @@
>
>  #ifdef __ASSEMBLY__
>
> -#if defined(CONFIG_X86_64) || defined(CONFIG_X86_ALIGNMENT_16)
> -#define __ALIGN                .p2align 4, 0x90
> -#define __ALIGN_STR    __stringify(__ALIGN)
> +#if CONFIG_FUNCTION_ALIGNMENT == 16
> +#define __ALIGN                        .p2align 4, 0x90
> +#define __ALIGN_STR            __stringify(__ALIGN)
> +#define FUNCTION_ALIGNMENT     16
>  #endif

Ugh.

Why is this conditional on that alignment being 16?

Is it purely because the CONFIG variable was mis-designed, and is the
number of bytes, instead of being the shift? If so, just fix that, and
then do an unconditional

  #define __ALIGN                        .p2align
CONFIG_FUNCTION_ALIGNMENT_SHIFT, 0x90
  #define __ALIGN_STR            __stringify(__ALIGN)

(leave the asm-generic one conditional, since then the condition makes
sense - it's arch-specific).

           Linus

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v2 37/59] x86/putuser: Provide room for padding
  2022-09-02 16:43   ` Linus Torvalds
@ 2022-09-02 17:03     ` Peter Zijlstra
  2022-09-02 20:24       ` Peter Zijlstra
  0 siblings, 1 reply; 81+ messages in thread
From: Peter Zijlstra @ 2022-09-02 17:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Thomas Gleixner, linux-kernel, x86, Tim Chen, Josh Poimboeuf,
	Andrew Cooper, Pawan Gupta, Johannes Wikner, Alyssa Milburn,
	Jann Horn, H.J. Lu, Joao Moreira, Joseph Nuzman, Steven Rostedt,
	Juergen Gross, Masami Hiramatsu, Alexei Starovoitov,
	Daniel Borkmann, K Prateek Nayak, Eric Dumazet

On Fri, Sep 02, 2022 at 09:43:45AM -0700, Linus Torvalds wrote:
> So I don't hate this patch and it's probably good for consistency, but
> I really think that the retbleed tracking could perhaps be improved to
> let this be all unnecessary.
> 
> The whole return stack depth counting is already not 100% exact, and I
> think we could just make the rule be that we don't track leaf
> functions.
> 
> Why? It's just a off-by-one in the already not exact tracking. And -
> perhaps equally importantly - leaf functions are very very common
> dynamically, and I suspect it's trivial to see them.
> 
> Yes, yes, you could make objtool even smarter and actually do some
> kind of function flow graph thing (and I think some people were
> talking about that with the whole ret counting long long ago), but the
> leaf function thing is the really simple low-hanging fruit case of
> that.

So I did the leaf thing a few weeks ago, and at the time the perf gains
where not worth the complexity.

I can try again :-)

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v2 08/59] x86/build: Ensure proper function alignment
  2022-09-02 16:51   ` Linus Torvalds
@ 2022-09-02 17:32     ` Peter Zijlstra
  2022-09-02 18:08       ` Linus Torvalds
  0 siblings, 1 reply; 81+ messages in thread
From: Peter Zijlstra @ 2022-09-02 17:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Thomas Gleixner, linux-kernel, x86, Tim Chen, Josh Poimboeuf,
	Andrew Cooper, Pawan Gupta, Johannes Wikner, Alyssa Milburn,
	Jann Horn, H.J. Lu, Joao Moreira, Joseph Nuzman, Steven Rostedt,
	Juergen Gross, Masami Hiramatsu, Alexei Starovoitov,
	Daniel Borkmann, K Prateek Nayak, Eric Dumazet

On Fri, Sep 02, 2022 at 09:51:17AM -0700, Linus Torvalds wrote:
> On Fri, Sep 2, 2022 at 6:55 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > --- a/arch/x86/include/asm/linkage.h
> > +++ b/arch/x86/include/asm/linkage.h
> > @@ -14,9 +14,10 @@
> >
> >  #ifdef __ASSEMBLY__
> >
> > -#if defined(CONFIG_X86_64) || defined(CONFIG_X86_ALIGNMENT_16)
> > -#define __ALIGN                .p2align 4, 0x90
> > -#define __ALIGN_STR    __stringify(__ALIGN)
> > +#if CONFIG_FUNCTION_ALIGNMENT == 16
> > +#define __ALIGN                        .p2align 4, 0x90
> > +#define __ALIGN_STR            __stringify(__ALIGN)
> > +#define FUNCTION_ALIGNMENT     16
> >  #endif
> 
> Ugh.
> 
> Why is this conditional on that alignment being 16?

There is a DEBUG case that increases the thing to 32.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v2 22/59] x86: Put hot per CPU variables into a struct
  2022-09-02 13:06 ` [PATCH v2 22/59] x86: Put hot per CPU variables into a struct Peter Zijlstra
@ 2022-09-02 18:02   ` Jann Horn
  2022-09-15 11:22     ` Peter Zijlstra
  0 siblings, 1 reply; 81+ messages in thread
From: Jann Horn @ 2022-09-02 18:02 UTC (permalink / raw)
  To: Peter Zijlstra, Thomas Gleixner
  Cc: linux-kernel, x86, Linus Torvalds, Tim Chen, Josh Poimboeuf,
	Andrew Cooper, Pawan Gupta, Johannes Wikner, Alyssa Milburn,
	H.J. Lu, Joao Moreira, Joseph Nuzman, Steven Rostedt,
	Juergen Gross, Masami Hiramatsu, Alexei Starovoitov,
	Daniel Borkmann, K Prateek Nayak, Eric Dumazet

On Fri, Sep 2, 2022 at 3:54 PM Peter Zijlstra <peterz@infradead.org> wrote:
> From: Thomas Gleixner <tglx@linutronix.de>
>
> The layout of per-cpu variables is at the mercy of the compiler. This
> can lead to random performance fluctuations from build to build.
>
> Create a structure to hold some of the hottest per-cpu variables,
> starting with current_task.
[...]
> -DECLARE_PER_CPU(struct task_struct *, current_task);
> +struct pcpu_hot {
> +       union {
> +               struct {
> +                       struct task_struct      *current_task;
> +               };
> +               u8      pad[64];
> +       };
> +};

fixed_percpu_data::stack_canary is probably also a fairly hot per-cpu
variable on distro kernels with CONFIG_STACKPROTECTOR_STRONG (which
e.g. Debian enables), so perhaps it'd make sense to reuse
fixed_percpu_data as the struct for hot percpu variables? But I don't
have any numbers to actually back up that idea.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v2 08/59] x86/build: Ensure proper function alignment
  2022-09-02 17:32     ` Peter Zijlstra
@ 2022-09-02 18:08       ` Linus Torvalds
  2022-09-05 10:04         ` Peter Zijlstra
  0 siblings, 1 reply; 81+ messages in thread
From: Linus Torvalds @ 2022-09-02 18:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, linux-kernel, x86, Tim Chen, Josh Poimboeuf,
	Andrew Cooper, Pawan Gupta, Johannes Wikner, Alyssa Milburn,
	Jann Horn, H.J. Lu, Joao Moreira, Joseph Nuzman, Steven Rostedt,
	Juergen Gross, Masami Hiramatsu, Alexei Starovoitov,
	Daniel Borkmann, K Prateek Nayak, Eric Dumazet

On Fri, Sep 2, 2022 at 10:32 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> There is a DEBUG case that increases the thing to 32.

Well, but that should be part of the Kconfig rules too.

In fact, I think that argues for moving that FUNCTION_ALIGNMENT into
the generic Kconfig, since we already have that much hackier debug
thing there.

That would then get rid of the conditional in asm-generic too, and get
rid of the horrid hack in the main Makefile as well.

I love how commit cf536e185869 ("Makefile: extend 32B aligned debug
option to 64B aligned") took that previous random debug entry and just
made it 64B instead of 32B. What a crock that all is.

Let's just do this right.

             Linus


               Linus

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v2 37/59] x86/putuser: Provide room for padding
  2022-09-02 17:03     ` Peter Zijlstra
@ 2022-09-02 20:24       ` Peter Zijlstra
  2022-09-02 21:46         ` Linus Torvalds
  0 siblings, 1 reply; 81+ messages in thread
From: Peter Zijlstra @ 2022-09-02 20:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Thomas Gleixner, linux-kernel, x86, Tim Chen, Josh Poimboeuf,
	Andrew Cooper, Pawan Gupta, Johannes Wikner, Alyssa Milburn,
	Jann Horn, H.J. Lu, Joao Moreira, Joseph Nuzman, Steven Rostedt,
	Juergen Gross, Masami Hiramatsu, Alexei Starovoitov,
	Daniel Borkmann, K Prateek Nayak, Eric Dumazet

On Fri, Sep 02, 2022 at 07:03:33PM +0200, Peter Zijlstra wrote:
> On Fri, Sep 02, 2022 at 09:43:45AM -0700, Linus Torvalds wrote:
> > So I don't hate this patch and it's probably good for consistency, but
> > I really think that the retbleed tracking could perhaps be improved to
> > let this be all unnecessary.
> > 
> > The whole return stack depth counting is already not 100% exact, and I
> > think we could just make the rule be that we don't track leaf
> > functions.
> > 
> > Why? It's just a off-by-one in the already not exact tracking. And -
> > perhaps equally importantly - leaf functions are very very common
> > dynamically, and I suspect it's trivial to see them.
> > 
> > Yes, yes, you could make objtool even smarter and actually do some
> > kind of function flow graph thing (and I think some people were
> > talking about that with the whole ret counting long long ago), but the
> > leaf function thing is the really simple low-hanging fruit case of
> > that.
> 
> So I did the leaf thing a few weeks ago, and at the time the perf gains
> where not worth the complexity.
> 
> I can try again :-)

The below (mashup of a handful of patches) is the best I could come up
with in a hurry.

Specifically, the approach taken is that instead of the 10 byte sarq for
accounting a leaf has a 10 byte NOP in front of it. When this 'leaf'-nop
is found, the return thunk is also not used and a regular 'ret'
instruction is patched in.

Direct calls to leaf functions are unmodified; they still go to +0.

However, indirect call will unconditionally jump to -10. These will then
either hit the sarq or the fancy nop

Seems to boot in kvm (defconfig+kvm_guest.config)

That is; the thing you complained about isn't actually 'fixed', leaf
functions still need their padding.

If this patch makes you feel warm and fuzzy, I can clean it up,
otherwise I would suggest getting the stuff we have merged before adding
even more things on top.

I'll see if I get time to clean up the alignment thing this weekend,
otherwise it'll have to wait till next week or so.

---
 arch/x86/include/asm/alternative.h  |    7 ++
 arch/x86/kernel/alternative.c       |   11 ++++
 arch/x86/kernel/callthunks.c        |   51 +++++++++++++++++++
 arch/x86/kernel/cpu/bugs.c          |    2 
 arch/x86/kernel/module.c            |    8 ++-
 arch/x86/kernel/vmlinux.lds.S       |    7 ++
 arch/x86/lib/retpoline.S            |   11 +---
 tools/objtool/check.c               |   95 ++++++++++++++++++++++++++++++++++++
 tools/objtool/include/objtool/elf.h |    2 
 9 files changed, 186 insertions(+), 8 deletions(-)

--- a/arch/x86/include/asm/alternative.h
+++ b/arch/x86/include/asm/alternative.h
@@ -94,6 +94,8 @@ extern void callthunks_patch_module_call
 extern void *callthunks_translate_call_dest(void *dest);
 extern bool is_callthunk(void *addr);
 extern int x86_call_depth_emit_accounting(u8 **pprog, void *func);
+extern void apply_leafs(s32 *start, s32 *end);
+extern bool is_leaf_function(void *addr);
 #else
 static __always_inline void callthunks_patch_builtin_calls(void) {}
 static __always_inline void
@@ -112,6 +114,11 @@ static __always_inline int x86_call_dept
 {
 	return 0;
 }
+static __always_inline void apply_leafs(s32 *start, s32 *end) {}
+static __always_inline bool is_leaf_function(void *addr)
+{
+	return false;
+}
 #endif
 
 #ifdef CONFIG_SMP
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -114,6 +114,7 @@ static void __init_or_module add_nops(vo
 	}
 }
 
+extern s32 __leaf_sites[], __leaf_sites_end[];
 extern s32 __retpoline_sites[], __retpoline_sites_end[];
 extern s32 __return_sites[], __return_sites_end[];
 extern s32 __ibt_endbr_seal[], __ibt_endbr_seal_end[];
@@ -586,9 +587,14 @@ static int patch_return(void *addr, stru
 		if (x86_return_thunk == __x86_return_thunk)
 			return -1;
 
+		if (x86_return_thunk == __x86_return_skl &&
+		    is_leaf_function(addr))
+			goto plain_ret;
+
 		i = JMP32_INSN_SIZE;
 		__text_gen_insn(bytes, JMP32_INSN_OPCODE, addr, x86_return_thunk, i);
 	} else {
+plain_ret:
 		bytes[i++] = RET_INSN_OPCODE;
 	}
 
@@ -988,6 +994,11 @@ void __init alternative_instructions(voi
 	apply_paravirt(__parainstructions, __parainstructions_end);
 
 	/*
+	 * Mark the leaf sites; this affects apply_returns() and callthunks_patch*().
+	 */
+	apply_leafs(__leaf_sites, __leaf_sites_end);
+
+	/*
 	 * Rewrite the retpolines, must be done before alternatives since
 	 * those can rewrite the retpoline thunks.
 	 */
--- a/arch/x86/kernel/callthunks.c
+++ b/arch/x86/kernel/callthunks.c
@@ -181,6 +181,54 @@ static const u8 nops[] = {
 	0x90, 0x90, 0x90, 0x90, 0x90, 0x90, 0x90, 0x90,
 };
 
+/*
+ * 10 byte nop that spells 'leaf' in the immediate.
+ */
+static const u8 leaf_nop[] = {            /* 'l',  'e',  'a',  'f' */
+	0x2e, 0x66, 0x0f, 0x1f, 0x84, 0x00, 0x6c, 0x65, 0x61, 0x66,
+};
+
+void __init_or_module noinline apply_leafs(s32 *start, s32 *end)
+{
+	u8 buffer[16];
+	s32 *s;
+
+	for (s = start; s < end; s++) {
+		void *addr = (void *)s + *s;
+
+		if (skip_addr(addr))
+			continue;
+
+		if (copy_from_kernel_nofault(buffer, addr-10, 10))
+			continue;
+
+		/* already patched */
+		if (!memcmp(buffer, leaf_nop, 10))
+			continue;
+
+		if (memcmp(buffer, nops, 10)) {
+			pr_warn("Not NOPs: %pS %px %*ph\n", addr, addr, 10, addr);
+			continue;
+		}
+
+		text_poke_early(addr-10, leaf_nop, 10);
+	}
+}
+
+bool is_leaf_function(void *addr)
+{
+	unsigned long size, offset;
+	u8 buffer[10];
+
+	if (kallsyms_lookup_size_offset((unsigned long)addr, &size, &offset))
+		addr -= offset;
+
+	if (copy_from_kernel_nofault(buffer, addr-10, 10))
+		return false;
+
+	return memcmp(buffer, leaf_nop, 10) == 0;
+}
+
 static __init_or_module void *patch_dest(void *dest, bool direct)
 {
 	unsigned int tsize = SKL_TMPL_SIZE;
@@ -190,6 +238,9 @@ static __init_or_module void *patch_dest
 	if (!bcmp(pad, skl_call_thunk_template, tsize))
 		return pad;
 
+	if (!bcmp(pad, leaf_nop, tsize))
+		return dest;
+
 	/* Ensure there are nops */
 	if (bcmp(pad, nops, tsize)) {
 		pr_warn_once("Invalid padding area for %pS\n", dest);
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -838,6 +838,8 @@ static int __init retbleed_parse_cmdline
 			retbleed_cmd = RETBLEED_CMD_STUFF;
 		} else if (!strcmp(str, "nosmt")) {
 			retbleed_nosmt = true;
+		} else if (!strcmp(str, "force")) {
+			setup_force_cpu_bug(X86_BUG_RETBLEED);
 		} else {
 			pr_err("Ignoring unknown retbleed option (%s).", str);
 		}
--- a/arch/x86/kernel/module.c
+++ b/arch/x86/kernel/module.c
@@ -253,7 +253,7 @@ int module_finalize(const Elf_Ehdr *hdr,
 		    struct module *me)
 {
 	const Elf_Shdr *s, *text = NULL, *alt = NULL, *locks = NULL,
-		*para = NULL, *orc = NULL, *orc_ip = NULL,
+		*para = NULL, *orc = NULL, *orc_ip = NULL, *leafs = NULL,
 		*retpolines = NULL, *returns = NULL, *ibt_endbr = NULL,
 		*calls = NULL;
 	char *secstrings = (void *)hdr + sechdrs[hdr->e_shstrndx].sh_offset;
@@ -271,6 +271,8 @@ int module_finalize(const Elf_Ehdr *hdr,
 			orc = s;
 		if (!strcmp(".orc_unwind_ip", secstrings + s->sh_name))
 			orc_ip = s;
+		if (!strcmp(".leaf_sites", secstrings + s->sh_name))
+			leafs = s;
 		if (!strcmp(".retpoline_sites", secstrings + s->sh_name))
 			retpolines = s;
 		if (!strcmp(".return_sites", secstrings + s->sh_name))
@@ -289,6 +291,10 @@ int module_finalize(const Elf_Ehdr *hdr,
 		void *pseg = (void *)para->sh_addr;
 		apply_paravirt(pseg, pseg + para->sh_size);
 	}
+	if (leafs) {
+		void *rseg = (void *)leafs->sh_addr;
+		apply_leafs(rseg, rseg + leafs->sh_size);
+	}
 	if (retpolines) {
 		void *rseg = (void *)retpolines->sh_addr;
 		apply_retpolines(rseg, rseg + retpolines->sh_size);
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -298,6 +298,13 @@ SECTIONS
 		*(.call_sites)
 		__call_sites_end = .;
 	}
+
+	. = ALIGN(8);
+	.leaf_sites : AT(ADDR(.leaf_sites) - LOAD_OFFSET) {
+		__leaf_sites = .;
+		*(.leaf_sites)
+		__leaf_sites_end = .;
+	}
 #endif
 
 #ifdef CONFIG_X86_KERNEL_IBT
--- a/arch/x86/lib/retpoline.S
+++ b/arch/x86/lib/retpoline.S
@@ -77,9 +77,9 @@ SYM_CODE_END(__x86_indirect_thunk_array)
 SYM_INNER_LABEL(__x86_indirect_call_thunk_\reg, SYM_L_GLOBAL)
 	UNWIND_HINT_EMPTY
 	ANNOTATE_NOENDBR
-
-	CALL_DEPTH_ACCOUNT
-	POLINE \reg
+	sub	$10, %\reg
+	POLINE	\reg
+	add	$10, %\reg
 	ANNOTATE_UNRET_SAFE
 	ret
 	int3
@@ -216,10 +216,7 @@ SYM_FUNC_START(__x86_return_skl)
 1:
 	CALL_THUNKS_DEBUG_INC_STUFFS
 	.rept	16
-	ANNOTATE_INTRA_FUNCTION_CALL
-	call	2f
-	int3
-2:
+	__FILL_RETURN_SLOT
 	.endr
 	add	$(8*16), %rsp
 
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -945,6 +945,74 @@ static int create_direct_call_sections(s
 	return 0;
 }
 
+static int create_leaf_sites_sections(struct objtool_file *file)
+{
+	struct section *sec, *s;
+	struct symbol *sym;
+	unsigned int *loc;
+	int idx;
+
+	sec = find_section_by_name(file->elf, ".leaf_sites");
+	if (sec) {
+		INIT_LIST_HEAD(&file->call_list);
+		WARN("file already has .leaf_sites section, skipping");
+		return 0;
+	}
+
+	idx = 0;
+	for_each_sec(file, s) {
+		if (!s->text || s->init)
+			continue;
+
+		list_for_each_entry(sym, &s->symbol_list, list) {
+			if (sym->pfunc != sym)
+				continue;
+
+			if (sym->static_call_tramp)
+				continue;
+
+			if (!sym->leaf)
+				continue;
+
+			idx++;
+		}
+	}
+
+	sec = elf_create_section(file->elf, ".leaf_sites", 0, sizeof(unsigned int), idx);
+	if (!sec)
+		return -1;
+
+	idx = 0;
+	for_each_sec(file, s) {
+		if (!s->text || s->init)
+			continue;
+
+		list_for_each_entry(sym, &s->symbol_list, list) {
+			if (sym->pfunc != sym)
+				continue;
+
+			if (sym->static_call_tramp)
+				continue;
+
+			if (!sym->leaf)
+				continue;
+
+			loc = (unsigned int *)sec->data->d_buf + idx;
+			memset(loc, 0, sizeof(unsigned int));
+
+			if (elf_add_reloc_to_insn(file->elf, sec,
+						  idx * sizeof(unsigned int),
+						  R_X86_64_PC32,
+						  s, sym->offset))
+				return -1;
+
+			idx++;
+		}
+	}
+
+	return 0;
+}
+
 /*
  * Warnings shouldn't be reported for ignored functions.
  */
@@ -2362,6 +2430,9 @@ static int classify_symbols(struct objto
 			if (!strcmp(func->name, "__fentry__"))
 				func->fentry = true;
 
+			if (!strcmp(func->name, "__stack_chk_fail"))
+				func->stack_chk = true;
+
 			if (is_profiling_func(func->name))
 				func->profiling_func = true;
 		}
@@ -2492,6 +2563,16 @@ static bool is_fentry_call(struct instru
 	return false;
 }
 
+static bool is_stack_chk_call(struct instruction *insn)
+{
+	if (insn->type == INSN_CALL &&
+	    insn->call_dest &&
+	    insn->call_dest->stack_chk)
+		return true;
+
+	return false;
+}
+
 static bool has_modified_stack_frame(struct instruction *insn, struct insn_state *state)
 {
 	struct cfi_state *cfi = &state->cfi;
@@ -3269,6 +3350,9 @@ static int validate_call(struct objtool_
 			 struct instruction *insn,
 			 struct insn_state *state)
 {
+	if (insn->func && !is_fentry_call(insn) && !is_stack_chk_call(insn))
+		insn->func->leaf = 0;
+
 	if (state->noinstr && state->instr <= 0 &&
 	    !noinstr_call_dest(file, insn, insn->call_dest)) {
 		WARN_FUNC("call to %s() leaves .noinstr.text section",
@@ -3973,6 +4057,12 @@ static int validate_section(struct objto
 		init_insn_state(file, &state, sec);
 		set_func_state(&state.cfi);
 
+		/*
+		 * Asumme it is a leaf function; will be cleared for any CALL
+		 * encountered while validating the branches.
+		 */
+		func->leaf = 1;
+
 		warnings += validate_symbol(file, sec, func, &state);
 	}
 
@@ -4358,6 +4448,11 @@ int check(struct objtool_file *file)
 		if (ret < 0)
 			goto out;
 		warnings += ret;
+
+		ret = create_leaf_sites_sections(file);
+		if (ret < 0)
+			goto out;
+		warnings += ret;
 	}
 
 	if (opts.rethunk) {
--- a/tools/objtool/include/objtool/elf.h
+++ b/tools/objtool/include/objtool/elf.h
@@ -61,6 +61,8 @@ struct symbol {
 	u8 return_thunk      : 1;
 	u8 fentry            : 1;
 	u8 profiling_func    : 1;
+	u8 leaf	             : 1;
+	u8 stack_chk         : 1;
 	struct list_head pv_target;
 };
 

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v2 37/59] x86/putuser: Provide room for padding
  2022-09-02 20:24       ` Peter Zijlstra
@ 2022-09-02 21:46         ` Linus Torvalds
  2022-09-03 17:26           ` Linus Torvalds
  0 siblings, 1 reply; 81+ messages in thread
From: Linus Torvalds @ 2022-09-02 21:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, linux-kernel, x86, Tim Chen, Josh Poimboeuf,
	Andrew Cooper, Pawan Gupta, Johannes Wikner, Alyssa Milburn,
	Jann Horn, H.J. Lu, Joao Moreira, Joseph Nuzman, Steven Rostedt,
	Juergen Gross, Masami Hiramatsu, Alexei Starovoitov,
	Daniel Borkmann, K Prateek Nayak, Eric Dumazet

On Fri, Sep 2, 2022 at 1:25 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> The below (mashup of a handful of patches) is the best I could come up
> with in a hurry.

Hmm. It doesn't look too horrible, but yeah, fi it still ends up
getting the same padding overhead I'm not sure it ends up really
mattering.

I was hoping that some of the overhead would just go away - looking at
my kernel build, we do have a fair number of functions that are 1-31
bytes (according to a random and probably broken little shell script:

    objdump -t vmlinux |
        grep 'F .text' |
        sort |
        cut -f2- |
        grep '^00000000000000[01]'

but from a quick look, a fair number of them aren't actually even leaf
functions (ie they are small because they just call another function
with a set of simple arguments, often as a tail-call).,

So maybe it's just not worth it.

> If this patch makes you feel warm and fuzzy, I can clean it up,
> otherwise I would suggest getting the stuff we have merged before adding
> even more things on top.

Yeah, it doesn't look that convincing.  I have no real numbers - a lot
of small functions, but I'm not convinced that it is worth worrying
about them, particularly if it doesn't really help the actual text
size.

              Linus

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v2 37/59] x86/putuser: Provide room for padding
  2022-09-02 21:46         ` Linus Torvalds
@ 2022-09-03 17:26           ` Linus Torvalds
  2022-09-05  7:16             ` Peter Zijlstra
  0 siblings, 1 reply; 81+ messages in thread
From: Linus Torvalds @ 2022-09-03 17:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, linux-kernel, x86, Tim Chen, Josh Poimboeuf,
	Andrew Cooper, Pawan Gupta, Johannes Wikner, Alyssa Milburn,
	Jann Horn, H.J. Lu, Joao Moreira, Joseph Nuzman, Steven Rostedt,
	Juergen Gross, Masami Hiramatsu, Alexei Starovoitov,
	Daniel Borkmann, K Prateek Nayak, Eric Dumazet

On Fri, Sep 2, 2022 at 2:46 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Hmm. It doesn't look too horrible, but yeah, fi it still ends up
> getting the same padding overhead I'm not sure it ends up really
> mattering.

Oh, and I think it's buggy anyway.

If there are any tail-calls to a leaf function, the tracking now gets
out of whack. So it's no longer a "don't bother counting the last
level", now it ends up being a "count got off by one".

Oh well.

             Linus

^ permalink raw reply	[flat|nested] 81+ messages in thread

* RE: [PATCH v2 08/59] x86/build: Ensure proper function alignment
  2022-09-02 13:06 ` [PATCH v2 08/59] x86/build: Ensure proper function alignment Peter Zijlstra
  2022-09-02 16:51   ` Linus Torvalds
@ 2022-09-05  2:09   ` David Laight
  1 sibling, 0 replies; 81+ messages in thread
From: David Laight @ 2022-09-05  2:09 UTC (permalink / raw)
  To: 'Peter Zijlstra', Thomas Gleixner
  Cc: linux-kernel, x86, Linus Torvalds, Tim Chen, Josh Poimboeuf,
	Andrew Cooper, Pawan Gupta, Johannes Wikner, Alyssa Milburn,
	Jann Horn, H.J. Lu, Joao Moreira, Joseph Nuzman, Steven Rostedt,
	Juergen Gross, Masami Hiramatsu, Alexei Starovoitov,
	Daniel Borkmann, K Prateek Nayak, Eric Dumazet

From: Peter Zijlstra
> Sent: 02 September 2022 14:07
> 
> From: Thomas Gleixner <tglx@linutronix.de>
> 
> The Intel Architectures Optimization Reference Manual explains that
> functions should be aligned at 16 bytes because for a lot of (Intel)
> uarchs the I-fetch width is 16 bytes. The AMD Software Optimization
> Guide (for recent chips) mentions a 32 byte I-fetch window but a 16
> byte decode window.
> 
> Follow this advice and align functions to 16 bytes to optimize
> instruction delivery to decode and reduce front-end bottlenecks.

Performance figures?

IIRC the same document will suggest aligning all jump labels.
That is pretty much known to be harmful because of the bloat
it generates.

Also things like CFI and ENDBRA have a habit of making the
entry point unaligned unless you can pad to 16n+x values.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v2 37/59] x86/putuser: Provide room for padding
  2022-09-03 17:26           ` Linus Torvalds
@ 2022-09-05  7:16             ` Peter Zijlstra
  2022-09-05 11:26               ` Peter Zijlstra
  0 siblings, 1 reply; 81+ messages in thread
From: Peter Zijlstra @ 2022-09-05  7:16 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Thomas Gleixner, linux-kernel, x86, Tim Chen, Josh Poimboeuf,
	Andrew Cooper, Pawan Gupta, Johannes Wikner, Alyssa Milburn,
	Jann Horn, H.J. Lu, Joao Moreira, Joseph Nuzman, Steven Rostedt,
	Juergen Gross, Masami Hiramatsu, Alexei Starovoitov,
	Daniel Borkmann, K Prateek Nayak, Eric Dumazet

On Sat, Sep 03, 2022 at 10:26:45AM -0700, Linus Torvalds wrote:
> On Fri, Sep 2, 2022 at 2:46 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > Hmm. It doesn't look too horrible, but yeah, fi it still ends up
> > getting the same padding overhead I'm not sure it ends up really
> > mattering.
> 
> Oh, and I think it's buggy anyway.
> 
> If there are any tail-calls to a leaf function, the tracking now gets
> out of whack. So it's no longer a "don't bother counting the last
> level", now it ends up being a "count got off by one".

See validate_sibling_call(), it too calls validate_call().

(Although prehaps I should go s/sibling/tail/ on all that for clarity)

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v2 08/59] x86/build: Ensure proper function alignment
  2022-09-02 18:08       ` Linus Torvalds
@ 2022-09-05 10:04         ` Peter Zijlstra
  2022-09-12 14:09           ` Linus Torvalds
  0 siblings, 1 reply; 81+ messages in thread
From: Peter Zijlstra @ 2022-09-05 10:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Thomas Gleixner, linux-kernel, x86, Tim Chen, Josh Poimboeuf,
	Andrew Cooper, Pawan Gupta, Johannes Wikner, Alyssa Milburn,
	Jann Horn, H.J. Lu, Joao Moreira, Joseph Nuzman, Steven Rostedt,
	Juergen Gross, Masami Hiramatsu, Alexei Starovoitov,
	Daniel Borkmann, K Prateek Nayak, Eric Dumazet

On Fri, Sep 02, 2022 at 11:08:54AM -0700, Linus Torvalds wrote:

> Let's just do this right.

Something like so then?

---
--- a/Makefile
+++ b/Makefile
@@ -940,8 +940,8 @@ KBUILD_CFLAGS	+= $(CC_FLAGS_CFI)
 export CC_FLAGS_CFI
 endif
 
-ifdef CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B
-KBUILD_CFLAGS += -falign-functions=64
+ifneq ($(CONFIG_FUNCTION_ALIGNMENT),0)
+KBUILD_CFLAGS += -falign-functions=$(CONFIG_FUNCTION_ALIGNMENT)
 endif
 
 # arch Makefile may override CC so keep this after arch Makefile is included
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -1419,4 +1419,24 @@ source "kernel/gcov/Kconfig"
 
 source "scripts/gcc-plugins/Kconfig"
 
+config FUNCTION_ALIGNMENT_8B
+	bool
+
+config FUNCTION_ALIGNMENT_16B
+	bool
+
+config FUNCTION_ALIGNMENT_32B
+	bool
+
+config FUNCTION_ALIGNMENT_64B
+	bool
+
+config FUNCTION_ALIGNMENT
+	int
+	default 64 if FUNCTION_ALIGNMENT_64B
+	default 32 if FUNCTION_ALIGNMENT_32B
+	default 16 if FUNCTION_ALIGNMENT_16B
+	default 8 if FUNCTION_ALIGNMENT_8B
+	default 0
+
 endmenu
--- a/arch/ia64/Kconfig
+++ b/arch/ia64/Kconfig
@@ -63,6 +63,7 @@ config IA64
 	select NUMA if !FLATMEM
 	select PCI_MSI_ARCH_FALLBACKS if PCI_MSI
 	select ZONE_DMA32
+	select FUNCTION_ALIGNMENT_32B
 	default y
 	help
 	  The Itanium Processor Family is Intel's 64-bit successor to
--- a/arch/ia64/Makefile
+++ b/arch/ia64/Makefile
@@ -23,7 +23,7 @@ KBUILD_AFLAGS_KERNEL := -mconstant-gp
 EXTRA		:=
 
 cflags-y	:= -pipe $(EXTRA) -ffixed-r13 -mfixed-range=f12-f15,f32-f127 \
-		   -falign-functions=32 -frename-registers -fno-optimize-sibling-calls
+		   -frename-registers -fno-optimize-sibling-calls
 KBUILD_CFLAGS_KERNEL := -mconstant-gp
 
 GAS_STATUS	= $(shell $(srctree)/arch/ia64/scripts/check-gas "$(CC)" "$(OBJDUMP)")
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -283,6 +283,8 @@ config X86
 	select X86_FEATURE_NAMES		if PROC_FS
 	select PROC_PID_ARCH_STATUS		if PROC_FS
 	select HAVE_ARCH_NODE_DEV_GROUP		if X86_SGX
+	select FUNCTION_ALIGNMENT_16B		if X86_64 || X86_ALIGNMENT_16
+	select FUNCTION_ALIGNMENT_8B
 	imply IMA_SECURE_AND_OR_TRUSTED_BOOT    if EFI
 
 config INSTRUCTION_DECODER
@@ -2442,6 +2444,7 @@ config HAVE_CALL_THUNKS
 	depends on CC_HAS_ENTRY_PADDING && RETHUNK && OBJTOOL
 
 config CALL_THUNKS
+	select FUNCTION_ALIGNMENT_16B
 	def_bool n
 
 menuconfig SPECULATION_MITIGATIONS
@@ -2515,6 +2518,7 @@ config CALL_DEPTH_TRACKING
 
 config CALL_THUNKS_DEBUG
 	bool "Enable call thunks and call depth tracking debugging"
+	select FUNCTION_ALIGNMENT_32B
 	depends on CALL_DEPTH_TRACKING
 	default n
 	help
--- a/arch/x86/Kconfig.cpu
+++ b/arch/x86/Kconfig.cpu
@@ -517,9 +517,3 @@ config CPU_SUP_VORTEX_32
 	  makes the kernel a tiny bit smaller.
 
 	  If unsure, say N.
-
-# Defined here so it is defined for UM too
-config FUNCTION_ALIGNMENT
-	int
-	default 16 if X86_64 || X86_ALIGNMENT_16
-	default 8
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -84,10 +84,6 @@ else
 KBUILD_CFLAGS += $(call cc-option,-fcf-protection=none)
 endif
 
-ifneq ($(CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B),y)
-KBUILD_CFLAGS += -falign-functions=$(CONFIG_FUNCTION_ALIGNMENT)
-endif
-
 ifeq ($(CONFIG_X86_32),y)
         BITS := 32
         UTS_MACHINE := i386
--- a/arch/x86/include/asm/linkage.h
+++ b/arch/x86/include/asm/linkage.h
@@ -12,16 +12,7 @@
 #define asmlinkage CPP_ASMLINKAGE __attribute__((regparm(0)))
 #endif /* CONFIG_X86_32 */
 
-#if CONFIG_FUNCTION_ALIGNMENT == 8
-# define __ALIGN		.p2align 3, 0x90;
-#elif CONFIG_FUNCTION_ALIGNMENT == 16
-# define __ALIGN		.p2align 4, 0x90;
-#elif CONFIG_FUNCTION_ALIGNMENT == 32
-# define __ALIGN		.p2align 5, 0x90
-#else
-# error Unsupported function alignment
-#endif
-
+#define __ALIGN			.balign CONFIG_FUNCTION_ALIGNMENT, 0x90
 #define __ALIGN_STR		__stringify(__ALIGN)
 
 #ifdef CONFIG_CFI_CLANG
--- a/include/linux/linkage.h
+++ b/include/linux/linkage.h
@@ -69,8 +69,8 @@
 #endif
 
 #ifndef __ALIGN
-#define __ALIGN		.align 4,0x90
-#define __ALIGN_STR	".align 4,0x90"
+#define __ALIGN			.balign CONFIG_FUNCTION_ALIGNMENT
+#define __ALIGN_STR		__stringify(__ALIGN)
 #endif
 
 #ifdef __ASSEMBLY__
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -459,6 +459,7 @@ config SECTION_MISMATCH_WARN_ONLY
 config DEBUG_FORCE_FUNCTION_ALIGN_64B
 	bool "Force all function address 64B aligned"
 	depends on EXPERT && (X86_64 || ARM64 || PPC32 || PPC64 || ARC)
+	select FUNCTION_ALIGNMENT_64B
 	help
 	  There are cases that a commit from one domain changes the function
 	  address alignment of other domains, and cause magic performance

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v2 37/59] x86/putuser: Provide room for padding
  2022-09-05  7:16             ` Peter Zijlstra
@ 2022-09-05 11:26               ` Peter Zijlstra
  0 siblings, 0 replies; 81+ messages in thread
From: Peter Zijlstra @ 2022-09-05 11:26 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Thomas Gleixner, linux-kernel, x86, Tim Chen, Josh Poimboeuf,
	Andrew Cooper, Pawan Gupta, Johannes Wikner, Alyssa Milburn,
	Jann Horn, H.J. Lu, Joao Moreira, Joseph Nuzman, Steven Rostedt,
	Juergen Gross, Masami Hiramatsu, Alexei Starovoitov,
	Daniel Borkmann, K Prateek Nayak, Eric Dumazet

On Mon, Sep 05, 2022 at 09:16:43AM +0200, Peter Zijlstra wrote:
> On Sat, Sep 03, 2022 at 10:26:45AM -0700, Linus Torvalds wrote:
> > On Fri, Sep 2, 2022 at 2:46 PM Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> > >
> > > Hmm. It doesn't look too horrible, but yeah, fi it still ends up
> > > getting the same padding overhead I'm not sure it ends up really
> > > mattering.
> > 
> > Oh, and I think it's buggy anyway.
> > 
> > If there are any tail-calls to a leaf function, the tracking now gets
> > out of whack. So it's no longer a "don't bother counting the last
> > level", now it ends up being a "count got off by one".
> 
> See validate_sibling_call(), it too calls validate_call().
> 
> (Although prehaps I should go s/sibling/tail/ on all that for clarity)

Ah, no, you're right and I needed more wake-up juice.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v2 08/59] x86/build: Ensure proper function alignment
  2022-09-05 10:04         ` Peter Zijlstra
@ 2022-09-12 14:09           ` Linus Torvalds
  2022-09-12 19:44             ` Peter Zijlstra
  0 siblings, 1 reply; 81+ messages in thread
From: Linus Torvalds @ 2022-09-12 14:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, linux-kernel, x86, Tim Chen, Josh Poimboeuf,
	Andrew Cooper, Pawan Gupta, Johannes Wikner, Alyssa Milburn,
	Jann Horn, H.J. Lu, Joao Moreira, Joseph Nuzman, Steven Rostedt,
	Juergen Gross, Masami Hiramatsu, Alexei Starovoitov,
	Daniel Borkmann, K Prateek Nayak, Eric Dumazet

On Mon, Sep 5, 2022 at 6:07 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Sep 02, 2022 at 11:08:54AM -0700, Linus Torvalds wrote:
>
> > Let's just do this right.
>
> Something like so then?

Sorry, I dropped this due to travel.

The patch looks sane, the only thing I worry a bit about is

> +config FUNCTION_ALIGNMENT
> +       int
> +       default 64 if FUNCTION_ALIGNMENT_64B
..
> +       default 0

Is '0' even a valid value then for things like

> +#define __ALIGN                        .balign CONFIG_FUNCTION_ALIGNMENT
> +#define __ALIGN_STR            __stringify(__ALIGN)

because it doesn't really seem like a sensible byte alignment.

Maybe "default 4" would be a safer choice?

                   Linus

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v2 08/59] x86/build: Ensure proper function alignment
  2022-09-12 14:09           ` Linus Torvalds
@ 2022-09-12 19:44             ` Peter Zijlstra
  2022-09-13  8:08               ` Peter Zijlstra
  0 siblings, 1 reply; 81+ messages in thread
From: Peter Zijlstra @ 2022-09-12 19:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Thomas Gleixner, linux-kernel, x86, Tim Chen, Josh Poimboeuf,
	Andrew Cooper, Pawan Gupta, Johannes Wikner, Alyssa Milburn,
	Jann Horn, H.J. Lu, Joao Moreira, Joseph Nuzman, Steven Rostedt,
	Juergen Gross, Masami Hiramatsu, Alexei Starovoitov,
	Daniel Borkmann, K Prateek Nayak, Eric Dumazet

On Mon, Sep 12, 2022 at 10:09:38AM -0400, Linus Torvalds wrote:

> The patch looks sane, the only thing I worry a bit about is
> 
> > +config FUNCTION_ALIGNMENT
> > +       int
> > +       default 64 if FUNCTION_ALIGNMENT_64B
> ..
> > +       default 0
> 
> Is '0' even a valid value then for things like

At the time I thought to have read that 0 alignment effectively no-ops
the statement, but now I can't find it in a hurry, happy to make it
default to 4.


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v2 08/59] x86/build: Ensure proper function alignment
  2022-09-12 19:44             ` Peter Zijlstra
@ 2022-09-13  8:08               ` Peter Zijlstra
  2022-09-13 13:08                 ` Linus Torvalds
  0 siblings, 1 reply; 81+ messages in thread
From: Peter Zijlstra @ 2022-09-13  8:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Thomas Gleixner, linux-kernel, x86, Tim Chen, Josh Poimboeuf,
	Andrew Cooper, Pawan Gupta, Johannes Wikner, Alyssa Milburn,
	Jann Horn, H.J. Lu, Joao Moreira, Joseph Nuzman, Steven Rostedt,
	Juergen Gross, Masami Hiramatsu, Alexei Starovoitov,
	Daniel Borkmann, K Prateek Nayak, Eric Dumazet

On Mon, Sep 12, 2022 at 09:44:05PM +0200, Peter Zijlstra wrote:
> On Mon, Sep 12, 2022 at 10:09:38AM -0400, Linus Torvalds wrote:
> 
> > The patch looks sane, the only thing I worry a bit about is
> > 
> > > +config FUNCTION_ALIGNMENT
> > > +       int
> > > +       default 64 if FUNCTION_ALIGNMENT_64B
> > ..
> > > +       default 0
> > 
> > Is '0' even a valid value then for things like
> 
> At the time I thought to have read that 0 alignment effectively no-ops
> the statement, but now I can't find it in a hurry, happy to make it
> default to 4.

Found it: https://sourceware.org/binutils/docs-2.39/as/Balign.html

7.8 .balign[wl] [abs-expr[, abs-expr[, abs-expr]]]

Pad the location counter (in the current subsection) to a particular
storage boundary. The first expression (which must be absolute) is the
alignment request in bytes. ...  If the expression is omitted then a
default value of 0 is used, effectively disabling alignment
requirements.

(for some raisin google served me a very old binutils document last
night that doesn't have mention the 0 thing)

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v2 08/59] x86/build: Ensure proper function alignment
  2022-09-13  8:08               ` Peter Zijlstra
@ 2022-09-13 13:08                 ` Linus Torvalds
  0 siblings, 0 replies; 81+ messages in thread
From: Linus Torvalds @ 2022-09-13 13:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, linux-kernel, x86, Tim Chen, Josh Poimboeuf,
	Andrew Cooper, Pawan Gupta, Johannes Wikner, Alyssa Milburn,
	Jann Horn, H.J. Lu, Joao Moreira, Joseph Nuzman, Steven Rostedt,
	Juergen Gross, Masami Hiramatsu, Alexei Starovoitov,
	Daniel Borkmann, K Prateek Nayak, Eric Dumazet

On Tue, Sep 13, 2022 at 4:09 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> Found it: https://sourceware.org/binutils/docs-2.39/as/Balign.html
>
> 7.8 .balign[wl] [abs-expr[, abs-expr[, abs-expr]]]
>
> Pad the location counter [...]

Very good. All looks sane to me then.

              Linus

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v2 22/59] x86: Put hot per CPU variables into a struct
  2022-09-02 18:02   ` Jann Horn
@ 2022-09-15 11:22     ` Peter Zijlstra
  0 siblings, 0 replies; 81+ messages in thread
From: Peter Zijlstra @ 2022-09-15 11:22 UTC (permalink / raw)
  To: Jann Horn
  Cc: Thomas Gleixner, linux-kernel, x86, Linus Torvalds, Tim Chen,
	Josh Poimboeuf, Andrew Cooper, Pawan Gupta, Johannes Wikner,
	Alyssa Milburn, H.J. Lu, Joao Moreira, Joseph Nuzman,
	Steven Rostedt, Juergen Gross, Masami Hiramatsu,
	Alexei Starovoitov, Daniel Borkmann, K Prateek Nayak,
	Eric Dumazet

On Fri, Sep 02, 2022 at 08:02:46PM +0200, Jann Horn wrote:
> On Fri, Sep 2, 2022 at 3:54 PM Peter Zijlstra <peterz@infradead.org> wrote:
> > From: Thomas Gleixner <tglx@linutronix.de>
> >
> > The layout of per-cpu variables is at the mercy of the compiler. This
> > can lead to random performance fluctuations from build to build.
> >
> > Create a structure to hold some of the hottest per-cpu variables,
> > starting with current_task.
> [...]
> > -DECLARE_PER_CPU(struct task_struct *, current_task);
> > +struct pcpu_hot {
> > +       union {
> > +               struct {
> > +                       struct task_struct      *current_task;
> > +               };
> > +               u8      pad[64];
> > +       };
> > +};
> 
> fixed_percpu_data::stack_canary is probably also a fairly hot per-cpu
> variable on distro kernels with CONFIG_STACKPROTECTOR_STRONG (which
> e.g. Debian enables), so perhaps it'd make sense to reuse
> fixed_percpu_data as the struct for hot percpu variables? But I don't
> have any numbers to actually back up that idea.

Not a bad idea; but the immediate problem I see with this is that
fixed_percpu_data is x86_64 only.

Also; I'm thinking the current stack-protector thing is somewhat of a
hack due to GCC limitations (per the comment there) and once that gets
cleaned up it can come live in the pcpu_hot thing.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v2 00/59] x86/retbleed: Call depth tracking mitigation
  2022-09-02 13:06 [PATCH v2 00/59] x86/retbleed: Call depth tracking mitigation Peter Zijlstra
                   ` (58 preceding siblings ...)
  2022-09-02 13:07 ` [PATCH v2 59/59] x86/retbleed: Add call depth tracking mitigation Peter Zijlstra
@ 2022-09-16  9:35 ` Mel Gorman
  59 siblings, 0 replies; 81+ messages in thread
From: Mel Gorman @ 2022-09-16  9:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, linux-kernel, x86, Linus Torvalds, Tim Chen,
	Josh Poimboeuf, Andrew Cooper, Pawan Gupta, Johannes Wikner,
	Alyssa Milburn, Jann Horn, H.J. Lu, Joao Moreira, Joseph Nuzman,
	Steven Rostedt, Juergen Gross, Masami Hiramatsu,
	Alexei Starovoitov, Daniel Borkmann, K Prateek Nayak,
	Eric Dumazet

On Fri, Sep 02, 2022 at 03:06:25PM +0200, Peter Zijlstra wrote:
> Excerpts from IBRS vs stuff from Mel's testing:
> 

FWIW, I retested this version as there were slight changes and retbleed=stuff
on Skylake is still far faster than the default. The default performance
differences are mostly within the noise versus a vanilla 6.0-rc3 kernel.
Other generations of Intel machine showed mostly noise. Zen[1-3] showed
no significant difference (immune to retbleed but functional alignment or
percpu structure changes might have mattered). For the series;

Tested-by: Mel Gorman <mgorman@techsingularity.net>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 81+ messages in thread

end of thread, other threads:[~2022-09-16  9:36 UTC | newest]

Thread overview: 81+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-02 13:06 [PATCH v2 00/59] x86/retbleed: Call depth tracking mitigation Peter Zijlstra
2022-09-02 13:06 ` [PATCH v2 01/59] x86/paravirt: Ensure proper alignment Peter Zijlstra
2022-09-02 16:05   ` Juergen Gross
2022-09-02 13:06 ` [PATCH v2 02/59] x86/cpu: Remove segment load from switch_to_new_gdt() Peter Zijlstra
2022-09-02 13:06 ` [PATCH v2 03/59] x86/cpu: Get rid of redundant switch_to_new_gdt() invocations Peter Zijlstra
2022-09-02 13:06 ` [PATCH v2 04/59] x86/cpu: Re-enable stackprotector Peter Zijlstra
2022-09-02 13:06 ` [PATCH v2 05/59] x86/modules: Set VM_FLUSH_RESET_PERMS in module_alloc() Peter Zijlstra
2022-09-02 13:06 ` [PATCH v2 06/59] x86/vdso: Ensure all kernel code is seen by objtool Peter Zijlstra
2022-09-02 13:06 ` [PATCH v2 07/59] x86: Sanitize linker script Peter Zijlstra
2022-09-02 13:06 ` [PATCH v2 08/59] x86/build: Ensure proper function alignment Peter Zijlstra
2022-09-02 16:51   ` Linus Torvalds
2022-09-02 17:32     ` Peter Zijlstra
2022-09-02 18:08       ` Linus Torvalds
2022-09-05 10:04         ` Peter Zijlstra
2022-09-12 14:09           ` Linus Torvalds
2022-09-12 19:44             ` Peter Zijlstra
2022-09-13  8:08               ` Peter Zijlstra
2022-09-13 13:08                 ` Linus Torvalds
2022-09-05  2:09   ` David Laight
2022-09-02 13:06 ` [PATCH v2 09/59] x86/asm: " Peter Zijlstra
2022-09-02 13:06 ` [PATCH v2 10/59] x86/error_inject: Align function properly Peter Zijlstra
2022-09-02 13:06 ` [PATCH v2 11/59] x86/paravirt: Properly align PV functions Peter Zijlstra
2022-09-02 13:06 ` [PATCH v2 12/59] x86/entry: Align SYM_CODE_START() variants Peter Zijlstra
2022-09-02 13:06 ` [PATCH v2 13/59] crypto: x86/camellia: Remove redundant alignments Peter Zijlstra
2022-09-02 13:06 ` [PATCH v2 14/59] crypto: x86/cast5: " Peter Zijlstra
2022-09-02 13:06 ` [PATCH v2 15/59] crypto: x86/crct10dif-pcl: " Peter Zijlstra
2022-09-02 13:06 ` [PATCH v2 16/59] crypto: x86/serpent: " Peter Zijlstra
2022-09-02 13:06 ` [PATCH v2 17/59] crypto: x86/sha1: Remove custom alignments Peter Zijlstra
2022-09-02 13:06 ` [PATCH v2 18/59] crypto: x86/sha256: " Peter Zijlstra
2022-09-02 13:06 ` [PATCH v2 19/59] crypto: x86/sm[34]: Remove redundant alignments Peter Zijlstra
2022-09-02 13:06 ` [PATCH v2 20/59] crypto: twofish: " Peter Zijlstra
2022-09-02 13:06 ` [PATCH v2 21/59] crypto: x86/poly1305: Remove custom function alignment Peter Zijlstra
2022-09-02 13:06 ` [PATCH v2 22/59] x86: Put hot per CPU variables into a struct Peter Zijlstra
2022-09-02 18:02   ` Jann Horn
2022-09-15 11:22     ` Peter Zijlstra
2022-09-02 13:06 ` [PATCH v2 23/59] x86/percpu: Move preempt_count next to current_task Peter Zijlstra
2022-09-02 13:06 ` [PATCH v2 24/59] x86/percpu: Move cpu_number " Peter Zijlstra
2022-09-02 13:06 ` [PATCH v2 25/59] x86/percpu: Move current_top_of_stack " Peter Zijlstra
2022-09-02 13:06 ` [PATCH v2 26/59] x86/percpu: Move irq_stack variables " Peter Zijlstra
2022-09-02 13:06 ` [PATCH v2 27/59] x86/softirq: Move softirq pending next to current task Peter Zijlstra
2022-09-02 13:06 ` [PATCH v2 28/59] objtool: Allow !PC relative relocations Peter Zijlstra
2022-09-02 13:06 ` [PATCH v2 29/59] objtool: Track init section Peter Zijlstra
2022-09-02 13:06 ` [PATCH v2 30/59] objtool: Add .call_sites section Peter Zijlstra
2022-09-02 13:06 ` [PATCH v2 31/59] objtool: Add --hacks=skylake Peter Zijlstra
2022-09-02 13:06 ` [PATCH v2 32/59] objtool: Allow STT_NOTYPE -> STT_FUNC+0 tail-calls Peter Zijlstra
2022-09-02 13:06 ` [PATCH v2 33/59] objtool: Fix find_{symbol,func}_containing() Peter Zijlstra
2022-09-02 13:06 ` [PATCH v2 34/59] objtool: Allow symbol range comparisons for IBT/ENDBR Peter Zijlstra
2022-09-02 13:07 ` [PATCH v2 35/59] x86/entry: Make sync_regs() invocation a tail call Peter Zijlstra
2022-09-02 13:07 ` [PATCH v2 36/59] ftrace: Add HAVE_DYNAMIC_FTRACE_NO_PATCHABLE Peter Zijlstra
2022-09-02 13:07 ` [PATCH v2 37/59] x86/putuser: Provide room for padding Peter Zijlstra
2022-09-02 16:43   ` Linus Torvalds
2022-09-02 17:03     ` Peter Zijlstra
2022-09-02 20:24       ` Peter Zijlstra
2022-09-02 21:46         ` Linus Torvalds
2022-09-03 17:26           ` Linus Torvalds
2022-09-05  7:16             ` Peter Zijlstra
2022-09-05 11:26               ` Peter Zijlstra
2022-09-02 13:07 ` [PATCH v2 38/59] x86/Kconfig: Add CONFIG_CALL_THUNKS Peter Zijlstra
2022-09-02 13:07 ` [PATCH v2 39/59] x86/Kconfig: Introduce function padding Peter Zijlstra
2022-09-02 13:07 ` [PATCH v2 40/59] x86/retbleed: Add X86_FEATURE_CALL_DEPTH Peter Zijlstra
2022-09-02 13:07 ` [PATCH v2 41/59] x86/alternatives: Provide text_poke_copy_locked() Peter Zijlstra
2022-09-02 13:07 ` [PATCH v2 42/59] x86/entry: Make some entry symbols global Peter Zijlstra
2022-09-02 13:07 ` [PATCH v2 43/59] x86/paravirt: Make struct paravirt_call_site unconditionally available Peter Zijlstra
2022-09-02 16:09   ` Juergen Gross
2022-09-02 13:07 ` [PATCH v2 44/59] x86/callthunks: Add call patching for call depth tracking Peter Zijlstra
2022-09-02 13:07 ` [PATCH v2 45/59] x86/modules: Add call patching Peter Zijlstra
2022-09-02 13:07 ` [PATCH v2 46/59] x86/returnthunk: Allow different return thunks Peter Zijlstra
2022-09-02 13:07 ` [PATCH v2 47/59] x86/asm: Provide ALTERNATIVE_3 Peter Zijlstra
2022-09-02 13:07 ` [PATCH v2 48/59] x86/retbleed: Add SKL return thunk Peter Zijlstra
2022-09-02 13:07 ` [PATCH v2 49/59] x86/retpoline: Add SKL retthunk retpolines Peter Zijlstra
2022-09-02 13:07 ` [PATCH v2 50/59] x86/retbleed: Add SKL call thunk Peter Zijlstra
2022-09-02 13:07 ` [PATCH v2 51/59] x86/calldepth: Add ret/call counting for debug Peter Zijlstra
2022-09-02 13:07 ` [PATCH v2 52/59] static_call: Add call depth tracking support Peter Zijlstra
2022-09-02 13:07 ` [PATCH v2 53/59] kallsyms: Take callthunks into account Peter Zijlstra
2022-09-02 13:07 ` [PATCH v2 54/59] x86/orc: Make it callthunk aware Peter Zijlstra
2022-09-02 13:07 ` [PATCH v2 55/59] x86/bpf: Emit call depth accounting if required Peter Zijlstra
2022-09-02 13:07 ` [PATCH v2 56/59] x86/ftrace: Remove ftrace_epilogue() Peter Zijlstra
2022-09-02 13:07 ` [PATCH v2 57/59] x86/ftrace: Rebalance RSB Peter Zijlstra
2022-09-02 13:07 ` [PATCH v2 58/59] x86/ftrace: Make it call depth tracking aware Peter Zijlstra
2022-09-02 13:07 ` [PATCH v2 59/59] x86/retbleed: Add call depth tracking mitigation Peter Zijlstra
2022-09-16  9:35 ` [PATCH v2 00/59] x86/retbleed: Call " Mel Gorman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.