All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jan Beulich <jbeulich@suse.com>
To: Andrew Cooper <amc96@srcf.net>
Cc: "Andrew Cooper" <andrew.cooper3@citrix.com>,
	"Wei Liu" <wl@xen.org>, "Roger Pau Monné" <roger.pau@citrix.com>,
	"Kevin Tian" <kevin.tian@intel.com>,
	"Jun Nakajima" <jun.nakajima@intel.com>,
	"xen-devel@lists.xenproject.org" <xen-devel@lists.xenproject.org>
Subject: Re: [PATCH v3 1/4] x86/spec-ctrl: add logic to issue IBPB on exit to guest
Date: Mon, 6 Feb 2023 14:58:11 +0100	[thread overview]
Message-ID: <38621f02-83a3-d1dc-560f-905dd0d963b0@suse.com> (raw)
In-Reply-To: <df86e0a6-1415-3fd3-5202-faff5edfc271@srcf.net>

On 26.01.2023 21:27, Andrew Cooper wrote:
> On 26/01/2023 8:02 am, Jan Beulich wrote:
>> On 25.01.2023 22:10, Andrew Cooper wrote:
>>> On 25/01/2023 3:25 pm, Jan Beulich wrote:
>>>> In order to be able to defer the context switch IBPB to the last
>>>> possible point, add logic to the exit-to-guest paths to issue the
>>>> barrier there, including the "IBPB doesn't flush the RSB/RAS"
>>>> workaround. Since alternatives, for now at least, can't nest, emit JMP
>>>> to skip past both constructs where both are needed. This may be more
>>>> efficient anyway, as the sequence of NOPs is pretty long.
>>> It is very uarch specific as to when a jump is less overhead than a line
>>> of nops.
>>>
>>> In all CPUs liable to be running Xen, even unconditional jumps take up
>>> branch prediction resource, because all branch prediction is pre-decode
>>> these days, so branch locations/types/destinations all need deriving
>>> from %rip and "history" alone.
>>>
>>> So whether a branch or a line of nops is better is a tradeoff between
>>> how much competition there is for branch prediction resource, and how
>>> efficiently the CPU can brute-force its way through a long line of nops.
>>>
>>> But a very interesting datapoint.  It turns out that AMD Zen4 CPUs
>>> macrofuse adjacent nops, including longnops, because it reduces the
>>> amount of execute/retire resources required.  And a lot of
>>> kernel/hypervisor fastpaths have a lot of nops these days.
>>>
>>>
>>> For us, the "can't nest" is singularly more important than any worry
>>> about uarch behaviour.  We've frankly got much lower hanging fruit than
>>> worring about one branch vs a few nops.
>>>
>>>> LFENCEs are omitted - for HVM a VM entry is immanent, which already
>>>> elsewhere we deem sufficiently serializing an event. For 32-bit PV
>>>> we're going through IRET, which ought to be good enough as well. While
>>>> 64-bit PV may use SYSRET, there are several more conditional branches
>>>> there which are all unprotected.
>>> Privilege changes are serialsing-ish, and this behaviour has been
>>> guaranteed moving forwards, although not documented coherently.
>>>
>>> CPL (well - privilege, which includes SMM, root/non-root, etc) is not
>>> written speculatively.  So any logic which needs to modify privilege has
>>> to block until it is known to be an architectural execution path.
>>>
>>> This gets us "lfence-like" or "dispatch serialising" behaviour, which is
>>> also the reason why INT3 is our go-to speculation halting instruction. 
>>> Microcode has to be entirely certain we are going to deliver an
>>> interrupt/exception/etc before it can start reading the IDT/etc.
>>>
>>> Either way, we've been promised that all instructions like IRET,
>>> SYS{CALL,RET,ENTER,EXIT}, VM{RUN,LAUNCH,RESUME} (and ERET{U,S} in the
>>> future FRED world) do, and shall continue to not execute speculatively.
>>>
>>> Which in practice means we don't need to worry about Spectre-v1 attack
>>> against codepaths which hit an exit-from-xen path, in terms of skipping
>>> protections.
>>>
>>> We do need to be careful about memory accesses and potential double
>>> dereferences, but all the data is on the top of the stack for XPTI
>>> reasons.  About the only concern is v->arch.msrs->* in the HVM path, and
>>> we're fine with the current layout (AFAICT).
>>>
>>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>>> ---
>>>> I have to admit that I'm not really certain about the placement of the
>>>> IBPB wrt the MSR_SPEC_CTRL writes. For now I've simply used "opposite of
>>>> entry".
>>> It really doesn't matter.  They're independent operations that both need
>>> doing, and are fully serialising so can't parallelise.
>>>
>>> But on this note, WRMSRNS and WRMSRLIST are on the horizon.  The CPUs
>>> which implement these instructions are the ones which also ought not to
>>> need any adjustments in the exit paths.  So I think it is specifically
>>> not worth trying to make any effort to turn *these* WRMSR's into more
>>> optimised forms.
>>>
>>> But WRMSRLIST was designed specifically for this kind of usecase
>>> (actually, more for the main context switch path) where you can prepare
>>> the list of MSRs in memory, including the ability to conditionally skip
>>> certain entries by adjusting the index field.
>>>
>>>
>>> It occurs to me, having written this out, is that what we actually want
>>> to do is have slightly custom not-quite-alternative blocks.  We have a
>>> sequence of independent code blocks, and a small block at the end that
>>> happens to contain an IRET.
>>>
>>> We could remove the nops at boot time if we treated it as one large
>>> region, with the IRET at the end also able to have a variable position,
>>> and assembles the "active" blocks tightly from the start.  Complications
>>> would include adjusting the IRET extable entry, but this isn't
>>> insurmountable.  Entrypoints are a bit more tricky but could be done by
>>> packing from the back forward, and adjusting the entry position.
>>>
>>> Either way, something to ponder.  (It's also possible that it doesn't
>>> make a measurable difference until we get to FRED, at which point we
>>> have a set of fresh entry-points to write anyway, and a slight glimmer
>>> of hope of not needing to pollute them with speculation workarounds...)
>>>
>>>> Since we're going to run out of SCF_* bits soon and since the new flag
>>>> is meaningful only in struct cpu_info's spec_ctrl_flags, we could choose
>>>> to widen that field to 16 bits right away and then use bit 8 (or higher)
>>>> for the purpose here.
>>> I really don't think it matters.  We've got plenty of room, and the
>>> flexibility to shuffle, in both structures.  It's absolutely not worth
>>> trying to introduce asymmetries to save 1 bit.
>> Thanks for all the comments up to here. Just to clarify - I've not spotted
>> anything there that would result in me being expected to take any action.
> 
> I'm tempted to suggest dropping the sentence about "might be more
> efficient".  It's not necessary at all IMO, and it's probably not
> correct if you were to compare an atom microserver vs a big Xeon.

Hmm - "might" still isn't "will". ISTR us actually discussing to limit
how long a sequence of NOPs we'd tolerate before switching to JMP.

>>>> --- a/xen/arch/x86/include/asm/spec_ctrl.h
>>>> +++ b/xen/arch/x86/include/asm/spec_ctrl.h
>>>> @@ -36,6 +36,8 @@
>>>>  #define SCF_verw       (1 << 3)
>>>>  #define SCF_ist_ibpb   (1 << 4)
>>>>  #define SCF_entry_ibpb (1 << 5)
>>>> +#define SCF_exit_ibpb_bit 6
>>>> +#define SCF_exit_ibpb  (1 << SCF_exit_ibpb_bit)
>>> One option to avoid the second define is to use ILOG2() with btrl.
>> Specifically not. The assembler doesn't know the conditional operator,
>> and the pre-processor won't collapse the expression resulting from
>> expanding ilog2().
> 
> Oh that's dull.
> 
> I suspect we could construct equivalent logic with an .if/.else chain,
> but I have no idea if the order of evaluation would be conducive to
> doing so as part of evaluating an immediate operand.  We would
> specifically not want something that ended looking like `ilog2 const
> "btrl $" ",%eax"`, even though I could see how to make that work.
> 
> It would be nice if we could make something suitable here, but if not we
> can live with the second _bit constant.

How would .if/.else be able to go inside an expression? You can't even
put this in a .macro, as that can't be invoked as part of an expression
either.

>>>> @@ -272,6 +293,14 @@
>>>>  #define SPEC_CTRL_EXIT_TO_PV                                            \
>>>>      ALTERNATIVE "",                                                     \
>>>>          DO_SPEC_CTRL_EXIT_TO_GUEST, X86_FEATURE_SC_MSR_PV;              \
>>>> +    ALTERNATIVE __stringify(jmp PASTE(.Lscexitpv_done, __LINE__)),      \
>>>> +        __stringify(DO_SPEC_CTRL_EXIT_IBPB                              \
>>>> +                    disp=(PASTE(.Lscexitpv_done, __LINE__) -            \
>>>> +                          PASTE(.Lscexitpv_rsb, __LINE__))),            \
>>>> +        X86_FEATURE_IBPB_EXIT_PV;                                       \
>>>> +PASTE(.Lscexitpv_rsb, __LINE__):                                        \
>>>> +    ALTERNATIVE "", DO_OVERWRITE_RSB, X86_BUG_IBPB_NO_RET;              \
>>>> +PASTE(.Lscexitpv_done, __LINE__):                                       \
>>>>      DO_SPEC_CTRL_COND_VERW
>>> What's wrong with the normal %= trick?
>> We're in a C macro here which is then used in assembly code. %= only
>> works in asm(), though. If we were in an assembler macro, I'd have
>> used \@. Yet wrapping the whole thing in an assembler macro would, for
>> my taste, hide too much information from the use sites (in particular
>> the X86_{FEATURE,BUG}_* which are imo relevant to be visible there).
>>
>>>   The use of __LINE__ makes this
>>> hard to subsequently livepatch, so I'd prefer to avoid it if possible.
>> Yes, I was certainly aware this would be a concern. I couldn't think of
>> a (forward looking) clean solution, though: Right now we have only one
>> use per source file (the native and compat PV entry.S), so we could use
>> a context-independent label name. But as you say above, for FRED we're
>> likely to get new entry points, and they're likely better placed in the
>> same files.
> 
> I was going to suggest using __COUNTER__ but I've just realised why that
> wont work.
> 
> But on further consideration, this might be ok for livepatching, so long
> as __LINE__ is only ever used with a local label name.  By the time it
> has been compiled to a binary, the label name wont survive; only the
> resulting displacement will.
> 
> I think we probably want to merge this with TEMP_NAME() (perhaps as
> UNIQ_NAME(), as it will have to move elsewhere to become common with
> this) to avoid proliferating our livepatching reasoning.

Even if I had recalled that we have TEMP_NAME() somewhere, I'd be very
hesitant to make anything like that into more generally accessible
infrastructure. I consider TEMP_NAME() itself already a too widely
exposed. The problem with it is that you can easily end up with two uses
as the result of expanding something that's all contained on a single
source line. Hence I very specifically want to have uses of __LINE__
only in places where either it is the top level source line, or where
- as is the case here - it is clear that no two instance of the same or
a similar macro will ever sensibly be put on one line. (Even then there's
still the risk of using the C macro inside an assembler macro, where two
instances would cause problems. But if such appeared, making the
assembler macro suitably use \@ instead shouldn't be overly difficult.)

Jan


  reply	other threads:[~2023-02-06 13:58 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-01-25 15:24 [PATCH v3 0/4] x86/spec-ctrl: IPBP improvements Jan Beulich
2023-01-25 15:25 ` [PATCH v3 1/4] x86/spec-ctrl: add logic to issue IBPB on exit to guest Jan Beulich
2023-01-25 21:10   ` Andrew Cooper
2023-01-26  8:02     ` Jan Beulich
2023-01-26 20:27       ` Andrew Cooper
2023-02-06 13:58         ` Jan Beulich [this message]
2023-01-25 15:26 ` [PATCH v3 2/4] x86/spec-ctrl: defer context-switch IBPB until guest entry Jan Beulich
2023-01-26 20:43   ` Andrew Cooper
2023-02-06 14:24     ` Jan Beulich
2023-01-25 15:26 ` [PATCH v3 3/4] x86: limit issuing of IBPB during context switch Jan Beulich
2023-01-26 20:49   ` Andrew Cooper
2023-01-27  7:51     ` Jan Beulich
2023-01-27 17:47       ` Andrew Cooper
2023-02-06 14:58         ` Jan Beulich
2023-01-25 15:27 ` [PATCH v3 4/4] x86/PV: issue branch prediction barrier when switching 64-bit guest to kernel mode Jan Beulich
2023-01-25 17:49 ` [PATCH v3 0/4] x86/spec-ctrl: IPBP improvements Andrew Cooper
2023-01-26  7:32   ` Jan Beulich

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=38621f02-83a3-d1dc-560f-905dd0d963b0@suse.com \
    --to=jbeulich@suse.com \
    --cc=amc96@srcf.net \
    --cc=andrew.cooper3@citrix.com \
    --cc=jun.nakajima@intel.com \
    --cc=kevin.tian@intel.com \
    --cc=roger.pau@citrix.com \
    --cc=wl@xen.org \
    --cc=xen-devel@lists.xenproject.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.