Re: [PATCH v3 1/4] x86/spec-ctrl: add logic to issue IBPB on exit to guest

From: Jan Beulich <jbeulich@suse.com>
To: Andrew Cooper <amc96@srcf.net>
Cc: "Andrew Cooper" <andrew.cooper3@citrix.com>,
	"Wei Liu" <wl@xen.org>, "Roger Pau Monné" <roger.pau@citrix.com>,
	"Kevin Tian" <kevin.tian@intel.com>,
	"Jun Nakajima" <jun.nakajima@intel.com>,
	"xen-devel@lists.xenproject.org" <xen-devel@lists.xenproject.org>
Subject: Re: [PATCH v3 1/4] x86/spec-ctrl: add logic to issue IBPB on exit to guest
Date: Mon, 6 Feb 2023 14:58:11 +0100	[thread overview]
Message-ID: <38621f02-83a3-d1dc-560f-905dd0d963b0@suse.com> (raw)
In-Reply-To: <df86e0a6-1415-3fd3-5202-faff5edfc271@srcf.net>

On 26.01.2023 21:27, Andrew Cooper wrote:
> On 26/01/2023 8:02 am, Jan Beulich wrote:
>> On 25.01.2023 22:10, Andrew Cooper wrote:
>>> On 25/01/2023 3:25 pm, Jan Beulich wrote:
>>>> In order to be able to defer the context switch IBPB to the last
>>>> possible point, add logic to the exit-to-guest paths to issue the
>>>> barrier there, including the "IBPB doesn't flush the RSB/RAS"
>>>> workaround. Since alternatives, for now at least, can't nest, emit JMP
>>>> to skip past both constructs where both are needed. This may be more
>>>> efficient anyway, as the sequence of NOPs is pretty long.
>>> It is very uarch specific as to when a jump is less overhead than a line
>>> of nops.
>>>
>>> In all CPUs liable to be running Xen, even unconditional jumps take up
>>> branch prediction resource, because all branch prediction is pre-decode
>>> these days, so branch locations/types/destinations all need deriving
>>> from %rip and "history" alone.
>>>
>>> So whether a branch or a line of nops is better is a tradeoff between
>>> how much competition there is for branch prediction resource, and how
>>> efficiently the CPU can brute-force its way through a long line of nops.
>>>
>>> But a very interesting datapoint.  It turns out that AMD Zen4 CPUs
>>> macrofuse adjacent nops, including longnops, because it reduces the
>>> amount of execute/retire resources required.  And a lot of
>>> kernel/hypervisor fastpaths have a lot of nops these days.
>>>
>>>
>>> For us, the "can't nest" is singularly more important than any worry
>>> about uarch behaviour.  We've frankly got much lower hanging fruit than
>>> worring about one branch vs a few nops.
>>>
>>>> LFENCEs are omitted - for HVM a VM entry is immanent, which already
>>>> elsewhere we deem sufficiently serializing an event. For 32-bit PV
>>>> we're going through IRET, which ought to be good enough as well. While
>>>> 64-bit PV may use SYSRET, there are several more conditional branches
>>>> there which are all unprotected.
>>> Privilege changes are serialsing-ish, and this behaviour has been
>>> guaranteed moving forwards, although not documented coherently.
>>>
>>> CPL (well - privilege, which includes SMM, root/non-root, etc) is not
>>> written speculatively.  So any logic which needs to modify privilege has
>>> to block until it is known to be an architectural execution path.
>>>
>>> This gets us "lfence-like" or "dispatch serialising" behaviour, which is
>>> also the reason why INT3 is our go-to speculation halting instruction. 
>>> Microcode has to be entirely certain we are going to deliver an
>>> interrupt/exception/etc before it can start reading the IDT/etc.
>>>
>>> Either way, we've been promised that all instructions like IRET,
>>> SYS{CALL,RET,ENTER,EXIT}, VM{RUN,LAUNCH,RESUME} (and ERET{U,S} in the
>>> future FRED world) do, and shall continue to not execute speculatively.
>>>
>>> Which in practice means we don't need to worry about Spectre-v1 attack
>>> against codepaths which hit an exit-from-xen path, in terms of skipping
>>> protections.
>>>
>>> We do need to be careful about memory accesses and potential double
>>> dereferences, but all the data is on the top of the stack for XPTI
>>> reasons.  About the only concern is v->arch.msrs->* in the HVM path, and
>>> we're fine with the current layout (AFAICT).
>>>
>>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>>> ---
>>>> I have to admit that I'm not really certain about the placement of the
>>>> IBPB wrt the MSR_SPEC_CTRL writes. For now I've simply used "opposite of
>>>> entry".
>>> It really doesn't matter.  They're independent operations that both need
>>> doing, and are fully serialising so can't parallelise.
>>>
>>> But on this note, WRMSRNS and WRMSRLIST are on the horizon.  The CPUs
>>> which implement these instructions are the ones which also ought not to
>>> need any adjustments in the exit paths.  So I think it is specifically
>>> not worth trying to make any effort to turn *these* WRMSR's into more
>>> optimised forms.
>>>
>>> But WRMSRLIST was designed specifically for this kind of usecase
>>> (actually, more for the main context switch path) where you can prepare
>>> the list of MSRs in memory, including the ability to conditionally skip
>>> certain entries by adjusting the index field.
>>>
>>>
>>> It occurs to me, having written this out, is that what we actually want
>>> to do is have slightly custom not-quite-alternative blocks.  We have a
>>> sequence of independent code blocks, and a small block at the end that
>>> happens to contain an IRET.
>>>
>>> We could remove the nops at boot time if we treated it as one large
>>> region, with the IRET at the end also able to have a variable position,
>>> and assembles the "active" blocks tightly from the start.  Complications
>>> would include adjusting the IRET extable entry, but this isn't
>>> insurmountable.  Entrypoints are a bit more tricky but could be done by
>>> packing from the back forward, and adjusting the entry position.
>>>
>>> Either way, something to ponder.  (It's also possible that it doesn't
>>> make a measurable difference until we get to FRED, at which point we
>>> have a set of fresh entry-points to write anyway, and a slight glimmer
>>> of hope of not needing to pollute them with speculation workarounds...)
>>>
>>>> Since we're going to run out of SCF_* bits soon and since the new flag
>>>> is meaningful only in struct cpu_info's spec_ctrl_flags, we could choose
>>>> to widen that field to 16 bits right away and then use bit 8 (or higher)
>>>> for the purpose here.
>>> I really don't think it matters.  We've got plenty of room, and the
>>> flexibility to shuffle, in both structures.  It's absolutely not worth
>>> trying to introduce asymmetries to save 1 bit.
>> Thanks for all the comments up to here. Just to clarify - I've not spotted
>> anything there that would result in me being expected to take any action.
> 
> I'm tempted to suggest dropping the sentence about "might be more
> efficient".  It's not necessary at all IMO, and it's probably not
> correct if you were to compare an atom microserver vs a big Xeon.

Hmm - "might" still isn't "will". ISTR us actually discussing to limit
how long a sequence of NOPs we'd tolerate before switching to JMP.

>>>> --- a/xen/arch/x86/include/asm/spec_ctrl.h
>>>> +++ b/xen/arch/x86/include/asm/spec_ctrl.h
>>>> @@ -36,6 +36,8 @@
>>>>  #define SCF_verw       (1 << 3)
>>>>  #define SCF_ist_ibpb   (1 << 4)
>>>>  #define SCF_entry_ibpb (1 << 5)
>>>> +#define SCF_exit_ibpb_bit 6
>>>> +#define SCF_exit_ibpb  (1 << SCF_exit_ibpb_bit)
>>> One option to avoid the second define is to use ILOG2() with btrl.
>> Specifically not. The assembler doesn't know the conditional operator,
>> and the pre-processor won't collapse the expression resulting from
>> expanding ilog2().
> 
> Oh that's dull.
> 
> I suspect we could construct equivalent logic with an .if/.else chain,
> but I have no idea if the order of evaluation would be conducive to
> doing so as part of evaluating an immediate operand.  We would
> specifically not want something that ended looking like `ilog2 const
> "btrl $" ",%eax"`, even though I could see how to make that work.
> 
> It would be nice if we could make something suitable here, but if not we
> can live with the second _bit constant.

How would .if/.else be able to go inside an expression? You can't even
put this in a .macro, as that can't be invoked as part of an expression
either.

>>>> @@ -272,6 +293,14 @@
>>>>  #define SPEC_CTRL_EXIT_TO_PV                                            \
>>>>      ALTERNATIVE "",                                                     \
>>>>          DO_SPEC_CTRL_EXIT_TO_GUEST, X86_FEATURE_SC_MSR_PV;              \
>>>> +    ALTERNATIVE __stringify(jmp PASTE(.Lscexitpv_done, __LINE__)),      \
>>>> +        __stringify(DO_SPEC_CTRL_EXIT_IBPB                              \
>>>> +                    disp=(PASTE(.Lscexitpv_done, __LINE__) -            \
>>>> +                          PASTE(.Lscexitpv_rsb, __LINE__))),            \
>>>> +        X86_FEATURE_IBPB_EXIT_PV;                                       \
>>>> +PASTE(.Lscexitpv_rsb, __LINE__):                                        \
>>>> +    ALTERNATIVE "", DO_OVERWRITE_RSB, X86_BUG_IBPB_NO_RET;              \
>>>> +PASTE(.Lscexitpv_done, __LINE__):                                       \
>>>>      DO_SPEC_CTRL_COND_VERW
>>> What's wrong with the normal %= trick?
>> We're in a C macro here which is then used in assembly code. %= only
>> works in asm(), though. If we were in an assembler macro, I'd have
>> used \@. Yet wrapping the whole thing in an assembler macro would, for
>> my taste, hide too much information from the use sites (in particular
>> the X86_{FEATURE,BUG}_* which are imo relevant to be visible there).
>>
>>>   The use of __LINE__ makes this
>>> hard to subsequently livepatch, so I'd prefer to avoid it if possible.
>> Yes, I was certainly aware this would be a concern. I couldn't think of
>> a (forward looking) clean solution, though: Right now we have only one
>> use per source file (the native and compat PV entry.S), so we could use
>> a context-independent label name. But as you say above, for FRED we're
>> likely to get new entry points, and they're likely better placed in the
>> same files.
> 
> I was going to suggest using __COUNTER__ but I've just realised why that
> wont work.
> 
> But on further consideration, this might be ok for livepatching, so long
> as __LINE__ is only ever used with a local label name.  By the time it
> has been compiled to a binary, the label name wont survive; only the
> resulting displacement will.
> 
> I think we probably want to merge this with TEMP_NAME() (perhaps as
> UNIQ_NAME(), as it will have to move elsewhere to become common with
> this) to avoid proliferating our livepatching reasoning.

Even if I had recalled that we have TEMP_NAME() somewhere, I'd be very
hesitant to make anything like that into more generally accessible
infrastructure. I consider TEMP_NAME() itself already a too widely
exposed. The problem with it is that you can easily end up with two uses
as the result of expanding something that's all contained on a single
source line. Hence I very specifically want to have uses of __LINE__
only in places where either it is the top level source line, or where
- as is the case here - it is clear that no two instance of the same or
a similar macro will ever sensibly be put on one line. (Even then there's
still the risk of using the C macro inside an assembler macro, where two
instances would cause problems. But if such appeared, making the
assembler macro suitably use \@ instead shouldn't be overly difficult.)

Jan