[PATCH v2 00/59] x86/retbleed: Call depth tracking mitigation

* [PATCH v2 00/59] x86/retbleed: Call depth tracking mitigation
@ 2022-09-02 13:06 Peter Zijlstra
  2022-09-02 13:06 ` [PATCH v2 01/59] x86/paravirt: Ensure proper alignment Peter Zijlstra
                   ` (59 more replies)
  0 siblings, 60 replies; 81+ messages in thread
From: Peter Zijlstra @ 2022-09-02 13:06 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, peterz, x86, Linus Torvalds, Tim Chen,
	Josh Poimboeuf, Andrew Cooper, Pawan Gupta, Johannes Wikner,
	Alyssa Milburn, Jann Horn, H.J. Lu, Joao Moreira, Joseph Nuzman,
	Steven Rostedt, Juergen Gross, Masami Hiramatsu,
	Alexei Starovoitov, Daniel Borkmann, K Prateek Nayak,
	Eric Dumazet

Hi!

At long last a new version of the call depth tracking patches.

  v1: https://lkml.kernel.org/r/20220716230344.239749011@linutronix.de

This version is significantly different from the last in that it no longer
makes use of external call thunks allocated from the module space. Instead
every function gets aligned to 16 bytes and gets 16 bytes of (pre-symbol)
padding. (This padding will also come in handy for other things, like the
kCFI/FineIBT work.)

Prior to these patches function alignment is basically non-existent, as such
any instruction fetch for the first instructions of a function will have (on
average) half the fetch window filled with whatever comes before. By pushing
the alignment up to 16 bytes this improves matters for chips that happen to
have a 16 byte i-fetch window size (Intel) while not making matters worse for
chips that have a larger 32 byte i-fetch window (AMD Zen). In fact, it improves
the worst case for Zen from 31 bytes of garbage to 16 bytes of garbage.

As such the first many patches of the series fix up lots of alignment quirks.

The second big difference is the introduction of struct pcpu_hot. Because the
compiler managed to place two adjacent (in code) DEFINE_PER_CPU() variables in
random cachelines (it is absolutely free to do so) the introduction of the
per-cpu x86_call_depth variable sometimes introduced significant additional
cache pressure, while other times it would sit nicely in the same line with
preempt_count and not show up at all.

In order to alleviate this problem; introduce struct pcpu_hot and collect a
number of hot per-cpu variables in a way the compiler can't mess up.

Since these changes are 'unconditional', Mel was gracious enough to help test
this on his test setup across all the relevant uarchs (very much including both
Intel and AMD machines) and found that while these changes cause some very
small wins and losses across the board it is mostly noise.

Aside from these changes; the core of the depth tracking is still the same.

 - objtool creates a list of (function) call sites.

 - for every call; overwrite the padding of the target function with the
   accounting thunk (if not already done) and adjust the call site to
   target this thunk.

 - the retbleed return thunk mechanism is used for a custom return thunk
   that includes return accounting and does RSB stuffing when required.

This ensures no new compiler is required and avoids almost all overhead for
non affected machines. This new option can still be selected using:

  "retbleed=stuff"

on the kernel command line.

As a refresher; the theory behind call depth tracking is:

The Return-Stack-Buffer (RSB) is a 16 deep stack that is filled on every call.
On the return path speculation will "pop" an entry and takes that as the return
target. Once the RSB is empty, the CPU falls back to other predictors, e.g. the
Branch History Buffer, which can be mistrained by user space and misguides the
(return) speculation path to a disclosure gadget of your choice -- as described
in the retbleed paper.

Call depth tracking is designed to break this speculation path by stuffing
speculation trap calls into the RSB whenver the RSB is running low. This way
the speculation stalls and never falls back to other predictors.

The assumption is that stuffing at the 12th return is sufficient to break the
speculation before it hits the underflow and the fallback to the other
predictors. Testing confirms that it works. Johannes, one of the retbleed
researchers, tried to attack this approach and confirmed that it brings the
signal to noise ratio down to the crystal ball level.

Excerpts from IBRS vs stuff from Mel's testing:

perfsyscall

		6.0.0-rc1		6.0.0-rc1
                tglx-mit-spectre-ibrs	tglx-mit-spectre-retpoline-retstuff
Duration User         136.16		   69.10
Duration System       100.50		   33.04
Duration Elapsed      237.20		  102.65

That's a massive improvement with a major reduction in system CPU usage.

Kernel compilation is variable. Skylake-X was modest with 2-18% gain depending
on degree of parallelisation.

Git checkouts are roughly 14% faster on Skylake-X

Network test were localhost only so are limited but even so, the gain is
large. Skylake-X again;

Netperf-TCP
                                  6.0.0-rc1              6.0.0-rc1
                      tglx-mit-spectre-ibrs  tglx-mit-spectre-retpoline-retstuff
Hmean     send-64         241.39 (   0.00%)      298.00 *  23.45%*
Hmean     send-128        489.55 (   0.00%)      610.46 *  24.70%*
Hmean     send-256        990.85 (   0.00%)     1201.73 *  21.28%*
Hmean     send-1024      4051.84 (   0.00%)     5006.19 *  23.55%*
Hmean     send-2048      7924.75 (   0.00%)     9777.14 *  23.37%*
Hmean     send-3312     12319.98 (   0.00%)    15210.07 *  23.46%*
Hmean     send-4096     14770.62 (   0.00%)    17941.32 *  21.47%*
Hmean     send-8192     26302.00 (   0.00%)    30170.04 *  14.71%*
Hmean     send-16384    42449.51 (   0.00%)    48036.45 *  13.16%*

While this is UDP_STREAM, TCP_STREAM is similarly impressive.

FIO measurements done by Tim Chen:

read (kIOPs)            Mean    stdev   mitigations=off retbleed=off    CPU util
================================================================================
mitigations=off         357.33  3.79    0.00%           6.14%           98.93%
retbleed=off            336.67  6.43    -5.78%          0.00%           99.01%
retbleed=ibrs           242.00  0.00    -32.28%         -28.12%         99.41%
retbleed=stuff (pad)    314.33  1.53    -12.03%         -6.63%          99.31%

read/write                              Baseline        Baseline
70/30 (kIOPs)           Mean    stdev   mitigations=off retbleed=off    CPU util
================================================================================
mitigations=off         349.00  5.29    0.00%           9.06%           96.66%
retbleed=off            320.00  5.05    -8.31%          0.00%           95.54%
retbleed=ibrs           238.60  0.17    -31.63%         -25.44%         98.18%
retbleed=stuff (pad)    293.37  0.81    -15.94%         -8.32%          97.71%

                                        Baseline        Baseline
write (kIOPs)           Mean    stdev   mitigations=off retbleed=off    CPU util
================================================================================
mitigations=off         296.33  8.08    0.00%           6.21%           93.96%
retbleed=off            279.00  2.65    -5.85%          0.00%           93.63%
retbleed=ibrs           230.33  0.58    -22.27%         -17.44%         95.92%
retbleed=stuff (pad)    266.67  1.53    -10.01%         -4.42%          94.75%

The patches can also be found in git here:

  git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git depthtracking

^ permalink raw reply	[flat|nested] 81+ messages in thread