x86: Fix crash on S3 resume

All of lore.kernel.org
 help / color / mirror / Atom feed

* x86: Fix crash on S3 resume
@ 2022-02-24 19:48 Andrew Cooper
  2022-02-24 19:48 ` x86/CET: Fix S3 resume with shadow stacks active Andrew Cooper
  2022-02-24 19:48 ` x86/vmx: Don't spuriously crash the domain when INIT is received Andrew Cooper
  0 siblings, 2 replies; 16+ messages in thread
From: Andrew Cooper @ 2022-02-24 19:48 UTC (permalink / raw)
  To: Xen-devel
  Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné,
	Wei Liu, Thiner Logoer, Marek Marczykowski-Górecki,
	Kevin Tian

Two fixes from investing a QubesOS bug report.

Andrew Cooper (2):
  x86/CET: Fix S3 resume with shadow stacks active
  x86/vmx: Don't spuriously crash the domain when INIT is received

 xen/arch/x86/boot/x86_64.S | 18 +++++++++++++-----
 xen/arch/x86/hvm/vmx/vmx.c |  4 ++++
 2 files changed, 17 insertions(+), 5 deletions(-)



^ permalink raw reply	[flat|nested] 16+ messages in thread

* x86/CET: Fix S3 resume with shadow stacks active
  2022-02-24 19:48 x86: Fix crash on S3 resume Andrew Cooper
@ 2022-02-24 19:48 ` Andrew Cooper
  2022-02-25  8:38   ` Jan Beulich
  2022-02-24 19:48 ` x86/vmx: Don't spuriously crash the domain when INIT is received Andrew Cooper
  1 sibling, 1 reply; 16+ messages in thread
From: Andrew Cooper @ 2022-02-24 19:48 UTC (permalink / raw)
  To: Xen-devel
  Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné,
	Wei Liu, Thiner Logoer, Marek Marczykowski-Górecki

The original shadow stack support has an error on S3 resume with very bizzare
fallout.  The BSP comes back up, but APs fail with:

  (XEN) Enabling non-boot CPUs ...
  (XEN) Stuck ??
  (XEN) Error bringing CPU1 up: -5

and then later (on at least two Intel TigerLake platforms), the next HVM vCPU
to be scheduled on the BSP dies with:

  (XEN) d1v0 Unexpected vmexit: reason 3
  (XEN) domain_crash called from vmx.c:4304
  (XEN) Domain 1 (vcpu#0) crashed on cpu#0:

The VMExit reason is EXIT_REASON_INIT, which has nothing to do with the
scheduled vCPU, and will be addressed in a subsequent patch.  It is a
consequence of the APs triple faulting.

The reason the APs triple fault is because we don't tear down the stacks on
suspend.  The idle/play_dead loop is killed in the middle of running, meaning
that the supervisor token is left busy.

On resume, SETSSBSY finds the token already busy, suffers #CP and triple
faults because the IDT isn't configured this early.

Rework the AP bringup path to (re)create the supervisor token.  This ensures
the primary stack is non-busy before use.

Fixes: b60ab42db2f0 ("x86/shstk: Activate Supervisor Shadow Stacks")
Link: https://github.com/QubesOS/qubes-issues/issues/7283
Reported-by: Thiner Logoer <logoerthiner1@163.com>
Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Tested-by: Thiner Logoer <logoerthiner1@163.com>
Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Wei Liu <wl@xen.org>
CC: Thiner Logoer <logoerthiner1@163.com>
CC: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>

Slightly RFC.  This does fix the crash encountered, but it occurs to me that
there's a race condition when S3 platform powerdown is incident with an
NMI/#MC, where more than just the primary shadow stack can end up busy on
resume.

A larger fix would be to change how we allocate tokens, and always have each
CPU set up its own tokens.  I didn't do this originally in the hopes of having
WRSSQ generally disabled, but that plan failed when encountering reality...

diff --git a/xen/arch/x86/boot/x86_64.S b/xen/arch/x86/boot/x86_64.S
index fa41990dde0f..5d12937a0e40 100644
--- a/xen/arch/x86/boot/x86_64.S
+++ b/xen/arch/x86/boot/x86_64.S
@@ -51,13 +51,21 @@ ENTRY(__high_start)
         test    $CET_SHSTK_EN, %al
         jz      .L_ap_cet_done

-        /* Derive MSR_PL0_SSP from %rsp (token written when stack is allocated). */
-        mov     $MSR_PL0_SSP, %ecx
+        /* Derive the supervisor token address from %rsp. */
         mov     %rsp, %rdx
+        and     $~(STACK_SIZE - 1), %rdx
+        or      $(PRIMARY_SHSTK_SLOT + 1) * PAGE_SIZE - 8, %rdx
+
+        /*
+         * Write a new supervisor token.  Doesn't matter on boot, but for S3
+         * resume this clears the busy bit.
+         */
+        wrssq   %rdx, (%rdx)
+
+        /* Point MSR_PL0_SSP at the token. */
+        mov     $MSR_PL0_SSP, %ecx
+        mov     %edx, %eax
         shr     $32, %rdx
-        mov     %esp, %eax
-        and     $~(STACK_SIZE - 1), %eax
-        or      $(PRIMARY_SHSTK_SLOT + 1) * PAGE_SIZE - 8, %eax
         wrmsr

         setssbsy

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: x86/CET: Fix S3 resume with shadow stacks active
  2022-02-24 19:48 ` x86/CET: Fix S3 resume with shadow stacks active Andrew Cooper
@ 2022-02-25  8:38   ` Jan Beulich
  2022-02-25 12:41     ` Andrew Cooper
  0 siblings, 1 reply; 16+ messages in thread
From: Jan Beulich @ 2022-02-25  8:38 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Roger Pau Monné,
	Wei Liu, Thiner Logoer, Marek Marczykowski-Górecki,
	Xen-devel

On 24.02.2022 20:48, Andrew Cooper wrote:
> The original shadow stack support has an error on S3 resume with very bizzare
> fallout.  The BSP comes back up, but APs fail with:
> 
>   (XEN) Enabling non-boot CPUs ...
>   (XEN) Stuck ??
>   (XEN) Error bringing CPU1 up: -5
> 
> and then later (on at least two Intel TigerLake platforms), the next HVM vCPU
> to be scheduled on the BSP dies with:
> 
>   (XEN) d1v0 Unexpected vmexit: reason 3
>   (XEN) domain_crash called from vmx.c:4304
>   (XEN) Domain 1 (vcpu#0) crashed on cpu#0:
> 
> The VMExit reason is EXIT_REASON_INIT, which has nothing to do with the
> scheduled vCPU, and will be addressed in a subsequent patch.  It is a
> consequence of the APs triple faulting.
> 
> The reason the APs triple fault is because we don't tear down the stacks on
> suspend.  The idle/play_dead loop is killed in the middle of running, meaning
> that the supervisor token is left busy.
> 
> On resume, SETSSBSY finds the token already busy, suffers #CP and triple
> faults because the IDT isn't configured this early.
> 
> Rework the AP bringup path to (re)create the supervisor token.  This ensures
> the primary stack is non-busy before use.
> 
> Fixes: b60ab42db2f0 ("x86/shstk: Activate Supervisor Shadow Stacks")
> Link: https://github.com/QubesOS/qubes-issues/issues/7283
> Reported-by: Thiner Logoer <logoerthiner1@163.com>
> Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> Tested-by: Thiner Logoer <logoerthiner1@163.com>
> Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>

Reviewed-by: Jan Beulich <jbeulich@suse.com>

> Slightly RFC.  This does fix the crash encountered, but it occurs to me that
> there's a race condition when S3 platform powerdown is incident with an
> NMI/#MC, where more than just the primary shadow stack can end up busy on
> resume.
> 
> A larger fix would be to change how we allocate tokens, and always have each
> CPU set up its own tokens.  I didn't do this originally in the hopes of having
> WRSSQ generally disabled, but that plan failed when encountering reality...

While I think this wants fixing one way or another, I also think this
shouldn't block the immediate fix here (which addresses an unconditional
crash rather than a pretty unlikely one).

Jan



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: x86/CET: Fix S3 resume with shadow stacks active
  2022-02-25  8:38   ` Jan Beulich
@ 2022-02-25 12:41     ` Andrew Cooper
  0 siblings, 0 replies; 16+ messages in thread
From: Andrew Cooper @ 2022-02-25 12:41 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Roger Pau Monne, Wei Liu, Thiner Logoer,
	Marek Marczykowski-Górecki, Xen-devel

On 25/02/2022 08:38, Jan Beulich wrote:
> On 24.02.2022 20:48, Andrew Cooper wrote:
>> The original shadow stack support has an error on S3 resume with very bizzare
>> fallout.  The BSP comes back up, but APs fail with:
>>
>>   (XEN) Enabling non-boot CPUs ...
>>   (XEN) Stuck ??
>>   (XEN) Error bringing CPU1 up: -5
>>
>> and then later (on at least two Intel TigerLake platforms), the next HVM vCPU
>> to be scheduled on the BSP dies with:
>>
>>   (XEN) d1v0 Unexpected vmexit: reason 3
>>   (XEN) domain_crash called from vmx.c:4304
>>   (XEN) Domain 1 (vcpu#0) crashed on cpu#0:
>>
>> The VMExit reason is EXIT_REASON_INIT, which has nothing to do with the
>> scheduled vCPU, and will be addressed in a subsequent patch.  It is a
>> consequence of the APs triple faulting.
>>
>> The reason the APs triple fault is because we don't tear down the stacks on
>> suspend.  The idle/play_dead loop is killed in the middle of running, meaning
>> that the supervisor token is left busy.
>>
>> On resume, SETSSBSY finds the token already busy, suffers #CP and triple
>> faults because the IDT isn't configured this early.
>>
>> Rework the AP bringup path to (re)create the supervisor token.  This ensures
>> the primary stack is non-busy before use.
>>
>> Fixes: b60ab42db2f0 ("x86/shstk: Activate Supervisor Shadow Stacks")
>> Link: https://github.com/QubesOS/qubes-issues/issues/7283
>> Reported-by: Thiner Logoer <logoerthiner1@163.com>
>> Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
>> Tested-by: Thiner Logoer <logoerthiner1@163.com>
>> Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
> Reviewed-by: Jan Beulich <jbeulich@suse.com>

Thanks.

>> Slightly RFC.  This does fix the crash encountered, but it occurs to me that
>> there's a race condition when S3 platform powerdown is incident with an
>> NMI/#MC, where more than just the primary shadow stack can end up busy on
>> resume.
>>
>> A larger fix would be to change how we allocate tokens, and always have each
>> CPU set up its own tokens.  I didn't do this originally in the hopes of having
>> WRSSQ generally disabled, but that plan failed when encountering reality...
> While I think this wants fixing one way or another, I also think this
> shouldn't block the immediate fix here (which addresses an unconditional
> crash rather than a pretty unlikely one).

Fair point.  I'll get this committed now, and work on the IST shstks later.

As a note for backporting, this depends on the adjustments made in c/s
311434bfc9d1 so isn't safe to backport in exactly this form.  I'll sort
something out in due course.

~Andrew

^ permalink raw reply	[flat|nested] 16+ messages in thread

* x86/vmx: Don't spuriously crash the domain when INIT is received
  2022-02-24 19:48 x86: Fix crash on S3 resume Andrew Cooper
  2022-02-24 19:48 ` x86/CET: Fix S3 resume with shadow stacks active Andrew Cooper
@ 2022-02-24 19:48 ` Andrew Cooper
  2022-02-25  8:44   ` Jan Beulich
  1 sibling, 1 reply; 16+ messages in thread
From: Andrew Cooper @ 2022-02-24 19:48 UTC (permalink / raw)
  To: Xen-devel
  Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné,
	Wei Liu, Jun Nakajima, Kevin Tian, Thiner Logoer,
	Marek Marczykowski-Górecki

In VMX operation, the handling of INIT IPIs is changed.  EXIT_REASON_INIT has
nothing to do with the guest in question, simply signals that an INIT was
received.

Ignoring the INIT is probably the wrong thing to do, but is helpful for
debugging.  Crashing the domain which happens to be in context is definitely
wrong.  Print an error message and continue.

Discovered as collateral damage from when an AP triple faults on S3 resume on
Intel TigerLake platforms.

Link: https://github.com/QubesOS/qubes-issues/issues/7283
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Wei Liu <wl@xen.org>
CC: Jun Nakajima <jun.nakajima@intel.com>
CC: Kevin Tian <kevin.tian@intel.com>
CC: Thiner Logoer <logoerthiner1@163.com>
CC: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>

diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index c075370f642a..883213ce8f6a 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -3965,6 +3965,10 @@ void vmx_vmexit_handler(struct cpu_user_regs *regs)
     case EXIT_REASON_MCE_DURING_VMENTRY:
         do_machine_check(regs);
         break;
+
+    case EXIT_REASON_INIT:
+        printk(XENLOG_ERR "Error: INIT received - ignoring\n");
+        return; /* Renter the guest without further processing */
     }
 
     /* Now enable interrupts so it's safe to take locks. */


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: x86/vmx: Don't spuriously crash the domain when INIT is received
  2022-02-24 19:48 ` x86/vmx: Don't spuriously crash the domain when INIT is received Andrew Cooper
@ 2022-02-25  8:44   ` Jan Beulich
  2022-02-25 12:28     ` Andrew Cooper
  0 siblings, 1 reply; 16+ messages in thread
From: Jan Beulich @ 2022-02-25  8:44 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Roger Pau Monné,
	Wei Liu, Jun Nakajima, Kevin Tian, Thiner Logoer,
	Marek Marczykowski-Górecki, Xen-devel

On 24.02.2022 20:48, Andrew Cooper wrote:
> In VMX operation, the handling of INIT IPIs is changed.  EXIT_REASON_INIT has
> nothing to do with the guest in question, simply signals that an INIT was
> received.
> 
> Ignoring the INIT is probably the wrong thing to do, but is helpful for
> debugging.  Crashing the domain which happens to be in context is definitely
> wrong.  Print an error message and continue.
> 
> Discovered as collateral damage from when an AP triple faults on S3 resume on
> Intel TigerLake platforms.

I'm afraid I don't follow the scenario, which was (only) outlined in
patch 1: Why would the BSP receive INIT in this case? And it also
cannot be that the INIT was received by the vCPU while running on
another CPU: With APs not coming back up, it cannot have been
scheduled to run there. And it would have been de-scheduled before
suspending (i.e. before any INITs are sent).

Jan



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: x86/vmx: Don't spuriously crash the domain when INIT is received
  2022-02-25  8:44   ` Jan Beulich
@ 2022-02-25 12:28     ` Andrew Cooper
  2022-02-25 13:19       ` Jan Beulich
  0 siblings, 1 reply; 16+ messages in thread
From: Andrew Cooper @ 2022-02-25 12:28 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Roger Pau Monne, Wei Liu, Jun Nakajima, Kevin Tian,
	Thiner Logoer, Marek Marczykowski-Górecki, Xen-devel

On 25/02/2022 08:44, Jan Beulich wrote:
> On 24.02.2022 20:48, Andrew Cooper wrote:
>> In VMX operation, the handling of INIT IPIs is changed.  EXIT_REASON_INIT has
>> nothing to do with the guest in question, simply signals that an INIT was
>> received.
>>
>> Ignoring the INIT is probably the wrong thing to do, but is helpful for
>> debugging.  Crashing the domain which happens to be in context is definitely
>> wrong.  Print an error message and continue.
>>
>> Discovered as collateral damage from when an AP triple faults on S3 resume on
>> Intel TigerLake platforms.
> I'm afraid I don't follow the scenario, which was (only) outlined in
> patch 1: Why would the BSP receive INIT in this case?

SHUTDOWN is a signal emitted by a core when it can't continue.  Triple
fault is one cause, but other sources include a double #MC, etc.

Some external component, in the PCH I expect, needs to turn this into a
platform reset, because one malfunctioning core can't.  It is why a
triple fault on any logical processor brings the whole system down.

> And it also cannot be that the INIT was received by the vCPU while running on
> another CPU:

It's nothing (really) to do with the vCPU.  INIT is a external signal to
the (real) APIC, just like NMI/etc.

It is the next VMEntry on a CPU which received INIT that suffers a
VMEntry failure, and the VMEntry failure has nothing to do with the
contents of the VMCS.

Importantly for Xen however, this isn't applicable for scheduling PV
vCPUs, which is why dom0 wasn't the one that crashed.  This actually
meant that dom0 was alive an usable, albeit it sharing all vCPUs on a
single CPU.

The change in INIT behaviour exists for TXT, where is it critical that
software can clear secrets from RAM before resetting.  I'm not wanting
to get into any of that because it's far more complicated than I have
time to fix.

Xen still ignores the INIT, but now doesn't crash an entirely innocent
domain as a side effect of the platform sending an INIT IPI.

~Andrew

P.S. This is also fun without interrupt remapping.  XSA-3 didn't imagine
the full scope of problems which could occur.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: x86/vmx: Don't spuriously crash the domain when INIT is received
  2022-02-25 12:28     ` Andrew Cooper
@ 2022-02-25 13:19       ` Jan Beulich
  2022-02-25 13:51         ` Marek Marczykowski-Górecki
  2022-02-25 17:11         ` Andrew Cooper
  0 siblings, 2 replies; 16+ messages in thread
From: Jan Beulich @ 2022-02-25 13:19 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Roger Pau Monne, Wei Liu, Jun Nakajima, Kevin Tian,
	Thiner Logoer, Marek Marczykowski-Górecki, Xen-devel

On 25.02.2022 13:28, Andrew Cooper wrote:
> On 25/02/2022 08:44, Jan Beulich wrote:
>> On 24.02.2022 20:48, Andrew Cooper wrote:
>>> In VMX operation, the handling of INIT IPIs is changed.  EXIT_REASON_INIT has
>>> nothing to do with the guest in question, simply signals that an INIT was
>>> received.
>>>
>>> Ignoring the INIT is probably the wrong thing to do, but is helpful for
>>> debugging.  Crashing the domain which happens to be in context is definitely
>>> wrong.  Print an error message and continue.
>>>
>>> Discovered as collateral damage from when an AP triple faults on S3 resume on
>>> Intel TigerLake platforms.
>> I'm afraid I don't follow the scenario, which was (only) outlined in
>> patch 1: Why would the BSP receive INIT in this case?
> 
> SHUTDOWN is a signal emitted by a core when it can't continue.  Triple
> fault is one cause, but other sources include a double #MC, etc.
> 
> Some external component, in the PCH I expect, needs to turn this into a
> platform reset, because one malfunctioning core can't.  It is why a
> triple fault on any logical processor brings the whole system down.

I'm afraid this doesn't answer my question. Clearly the system didn't
shut down. Hence I still don't see why the BSP would see INIT in the
first place.

>> And it also cannot be that the INIT was received by the vCPU while running on
>> another CPU:
> 
> It's nothing (really) to do with the vCPU.  INIT is a external signal to
> the (real) APIC, just like NMI/etc.
> 
> It is the next VMEntry on a CPU which received INIT that suffers a
> VMEntry failure, and the VMEntry failure has nothing to do with the
> contents of the VMCS.
> 
> Importantly for Xen however, this isn't applicable for scheduling PV
> vCPUs, which is why dom0 wasn't the one that crashed.  This actually
> meant that dom0 was alive an usable, albeit it sharing all vCPUs on a
> single CPU.
> 
> 
> The change in INIT behaviour exists for TXT, where is it critical that
> software can clear secrets from RAM before resetting.  I'm not wanting
> to get into any of that because it's far more complicated than I have
> time to fix.

I guess there's something hidden behind what you say here, like INIT
only being latched, but this latched state then causing the VM entry
failure. Which would mean that really the INIT was a signal for the
system to shut down / shutting down. In which case arranging to
continue by ignoring the event in VMX looks wrong. Simply crashing
the guest would then be wrong as well, of course. We should shut
down instead.

But I don't think I see the full picture here yet, unless your
mentioning of TXT was actually implying that TXT was active at the
point of the crash (which I don't think was said anywhere).

Jan



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: x86/vmx: Don't spuriously crash the domain when INIT is received
  2022-02-25 13:19       ` Jan Beulich
@ 2022-02-25 13:51         ` Marek Marczykowski-Górecki
  2022-02-25 14:18           ` Jan Beulich
  2022-02-25 17:11         ` Andrew Cooper
  1 sibling, 1 reply; 16+ messages in thread
From: Marek Marczykowski-Górecki @ 2022-02-25 13:51 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, Roger Pau Monne, Wei Liu, Jun Nakajima,
	Kevin Tian, Thiner Logoer, Xen-devel

[-- Attachment #1: Type: text/plain, Size: 3698 bytes --]

On Fri, Feb 25, 2022 at 02:19:39PM +0100, Jan Beulich wrote:
> On 25.02.2022 13:28, Andrew Cooper wrote:
> > On 25/02/2022 08:44, Jan Beulich wrote:
> >> On 24.02.2022 20:48, Andrew Cooper wrote:
> >>> In VMX operation, the handling of INIT IPIs is changed.  EXIT_REASON_INIT has
> >>> nothing to do with the guest in question, simply signals that an INIT was
> >>> received.
> >>>
> >>> Ignoring the INIT is probably the wrong thing to do, but is helpful for
> >>> debugging.  Crashing the domain which happens to be in context is definitely
> >>> wrong.  Print an error message and continue.
> >>>
> >>> Discovered as collateral damage from when an AP triple faults on S3 resume on
> >>> Intel TigerLake platforms.
> >> I'm afraid I don't follow the scenario, which was (only) outlined in
> >> patch 1: Why would the BSP receive INIT in this case?
> > 
> > SHUTDOWN is a signal emitted by a core when it can't continue.  Triple
> > fault is one cause, but other sources include a double #MC, etc.
> > 
> > Some external component, in the PCH I expect, needs to turn this into a
> > platform reset, because one malfunctioning core can't.  It is why a
> > triple fault on any logical processor brings the whole system down.
> 
> I'm afraid this doesn't answer my question. Clearly the system didn't
> shut down. Hence I still don't see why the BSP would see INIT in the
> first place.
> 
> >> And it also cannot be that the INIT was received by the vCPU while running on
> >> another CPU:
> > 
> > It's nothing (really) to do with the vCPU.  INIT is a external signal to
> > the (real) APIC, just like NMI/etc.
> > 
> > It is the next VMEntry on a CPU which received INIT that suffers a
> > VMEntry failure, and the VMEntry failure has nothing to do with the
> > contents of the VMCS.
> > 
> > Importantly for Xen however, this isn't applicable for scheduling PV
> > vCPUs, which is why dom0 wasn't the one that crashed.  This actually
> > meant that dom0 was alive an usable, albeit it sharing all vCPUs on a
> > single CPU.
> > 
> > 
> > The change in INIT behaviour exists for TXT, where is it critical that
> > software can clear secrets from RAM before resetting.  I'm not wanting
> > to get into any of that because it's far more complicated than I have
> > time to fix.
> 
> I guess there's something hidden behind what you say here, like INIT
> only being latched, but this latched state then causing the VM entry
> failure. Which would mean that really the INIT was a signal for the
> system to shut down / shutting down. In which case arranging to
> continue by ignoring the event in VMX looks wrong. Simply crashing
> the guest would then be wrong as well, of course. We should shut
> down instead.

A shutdown could be an alternative here, with remark that it would make
debugging such issues significantly harder. Note the INIT is delivered
to BSP, but the actual reason (in this case) is on some AP. Shutdown
(crash) in this case would prevent (still functioning) BSP to show you
the message (unless you have serial console, which is rather rare in
laptops - which are significant target for Qubes OS).

> But I don't think I see the full picture here yet, unless your
> mentioning of TXT was actually implying that TXT was active at the
> point of the crash (which I don't think was said anywhere).

No, TXT wasn't (intentionally) active. I think Andrew mentioned it as
explanation why VMX behaves this way: to let the OS do something _if_ TXT
is active (the check for TXT is the OS responsibility here). But that's
my guess only...

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: x86/vmx: Don't spuriously crash the domain when INIT is received
  2022-02-25 13:51         ` Marek Marczykowski-Górecki
@ 2022-02-25 14:18           ` Jan Beulich
  0 siblings, 0 replies; 16+ messages in thread
From: Jan Beulich @ 2022-02-25 14:18 UTC (permalink / raw)
  To: Marek Marczykowski-Górecki
  Cc: Andrew Cooper, Roger Pau Monne, Wei Liu, Jun Nakajima,
	Kevin Tian, Thiner Logoer, Xen-devel

On 25.02.2022 14:51, Marek Marczykowski-Górecki wrote:
> On Fri, Feb 25, 2022 at 02:19:39PM +0100, Jan Beulich wrote:
>> On 25.02.2022 13:28, Andrew Cooper wrote:
>>> On 25/02/2022 08:44, Jan Beulich wrote:
>>>> On 24.02.2022 20:48, Andrew Cooper wrote:
>>>>> In VMX operation, the handling of INIT IPIs is changed.  EXIT_REASON_INIT has
>>>>> nothing to do with the guest in question, simply signals that an INIT was
>>>>> received.
>>>>>
>>>>> Ignoring the INIT is probably the wrong thing to do, but is helpful for
>>>>> debugging.  Crashing the domain which happens to be in context is definitely
>>>>> wrong.  Print an error message and continue.
>>>>>
>>>>> Discovered as collateral damage from when an AP triple faults on S3 resume on
>>>>> Intel TigerLake platforms.
>>>> I'm afraid I don't follow the scenario, which was (only) outlined in
>>>> patch 1: Why would the BSP receive INIT in this case?
>>>
>>> SHUTDOWN is a signal emitted by a core when it can't continue.  Triple
>>> fault is one cause, but other sources include a double #MC, etc.
>>>
>>> Some external component, in the PCH I expect, needs to turn this into a
>>> platform reset, because one malfunctioning core can't.  It is why a
>>> triple fault on any logical processor brings the whole system down.
>>
>> I'm afraid this doesn't answer my question. Clearly the system didn't
>> shut down. Hence I still don't see why the BSP would see INIT in the
>> first place.
>>
>>>> And it also cannot be that the INIT was received by the vCPU while running on
>>>> another CPU:
>>>
>>> It's nothing (really) to do with the vCPU.  INIT is a external signal to
>>> the (real) APIC, just like NMI/etc.
>>>
>>> It is the next VMEntry on a CPU which received INIT that suffers a
>>> VMEntry failure, and the VMEntry failure has nothing to do with the
>>> contents of the VMCS.
>>>
>>> Importantly for Xen however, this isn't applicable for scheduling PV
>>> vCPUs, which is why dom0 wasn't the one that crashed.  This actually
>>> meant that dom0 was alive an usable, albeit it sharing all vCPUs on a
>>> single CPU.
>>>
>>>
>>> The change in INIT behaviour exists for TXT, where is it critical that
>>> software can clear secrets from RAM before resetting.  I'm not wanting
>>> to get into any of that because it's far more complicated than I have
>>> time to fix.
>>
>> I guess there's something hidden behind what you say here, like INIT
>> only being latched, but this latched state then causing the VM entry
>> failure. Which would mean that really the INIT was a signal for the
>> system to shut down / shutting down. In which case arranging to
>> continue by ignoring the event in VMX looks wrong. Simply crashing
>> the guest would then be wrong as well, of course. We should shut
>> down instead.
> 
> A shutdown could be an alternative here, with remark that it would make
> debugging such issues significantly harder. Note the INIT is delivered
> to BSP, but the actual reason (in this case) is on some AP. Shutdown
> (crash) in this case would prevent (still functioning) BSP to show you
> the message (unless you have serial console, which is rather rare in
> laptops - which are significant target for Qubes OS).

Well, I didn't necessarily mean shutting down silently. I fully
appreciate the usefulness of getting state dumped out for debugging
of an issue.

>> But I don't think I see the full picture here yet, unless your
>> mentioning of TXT was actually implying that TXT was active at the
>> point of the crash (which I don't think was said anywhere).
> 
> No, TXT wasn't (intentionally) active. I think Andrew mentioned it as
> explanation why VMX behaves this way: to let the OS do something _if_ TXT
> is active (the check for TXT is the OS responsibility here). But that's
> my guess only...

One part here that I don't understand: How would the OS become
aware of the INIT if it didn't try to enter a VMX guest (i.e. non-
root mode)?

Jan



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: x86/vmx: Don't spuriously crash the domain when INIT is received
  2022-02-25 13:19       ` Jan Beulich
  2022-02-25 13:51         ` Marek Marczykowski-Górecki
@ 2022-02-25 17:11         ` Andrew Cooper
  2022-02-26 22:55           ` Jason Andryuk
  2022-02-28  7:36           ` Jan Beulich
  1 sibling, 2 replies; 16+ messages in thread
From: Andrew Cooper @ 2022-02-25 17:11 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Roger Pau Monne, Wei Liu, Jun Nakajima, Kevin Tian,
	Thiner Logoer, Marek Marczykowski-Górecki, Xen-devel

On 25/02/2022 13:19, Jan Beulich wrote:
> On 25.02.2022 13:28, Andrew Cooper wrote:
>> On 25/02/2022 08:44, Jan Beulich wrote:
>>> On 24.02.2022 20:48, Andrew Cooper wrote:
>>>> In VMX operation, the handling of INIT IPIs is changed.  EXIT_REASON_INIT has
>>>> nothing to do with the guest in question, simply signals that an INIT was
>>>> received.
>>>>
>>>> Ignoring the INIT is probably the wrong thing to do, but is helpful for
>>>> debugging.  Crashing the domain which happens to be in context is definitely
>>>> wrong.  Print an error message and continue.
>>>>
>>>> Discovered as collateral damage from when an AP triple faults on S3 resume on
>>>> Intel TigerLake platforms.
>>> I'm afraid I don't follow the scenario, which was (only) outlined in
>>> patch 1: Why would the BSP receive INIT in this case?
>> SHUTDOWN is a signal emitted by a core when it can't continue.  Triple
>> fault is one cause, but other sources include a double #MC, etc.
>>
>> Some external component, in the PCH I expect, needs to turn this into a
>> platform reset, because one malfunctioning core can't.  It is why a
>> triple fault on any logical processor brings the whole system down.
> I'm afraid this doesn't answer my question. Clearly the system didn't
> shut down.

Indeed, *because* Xen caught and ignored the INIT which was otherwise
supposed to do it.

>  Hence I still don't see why the BSP would see INIT in the
> first place.
>
>>> And it also cannot be that the INIT was received by the vCPU while running on
>>> another CPU:
>> It's nothing (really) to do with the vCPU.  INIT is a external signal to
>> the (real) APIC, just like NMI/etc.
>>
>> It is the next VMEntry on a CPU which received INIT that suffers a
>> VMEntry failure, and the VMEntry failure has nothing to do with the
>> contents of the VMCS.
>>
>> Importantly for Xen however, this isn't applicable for scheduling PV
>> vCPUs, which is why dom0 wasn't the one that crashed.  This actually
>> meant that dom0 was alive an usable, albeit it sharing all vCPUs on a
>> single CPU.
>>
>>
>> The change in INIT behaviour exists for TXT, where is it critical that
>> software can clear secrets from RAM before resetting.  I'm not wanting
>> to get into any of that because it's far more complicated than I have
>> time to fix.
> I guess there's something hidden behind what you say here, like INIT
> only being latched, but this latched state then causing the VM entry
> failure. Which would mean that really the INIT was a signal for the
> system to shut down / shutting down.

Yes.

> In which case arranging to
> continue by ignoring the event in VMX looks wrong. Simply crashing
> the guest would then be wrong as well, of course. We should shut
> down instead.

It is software's discretion what to do when an INIT is caught, even if
the expectation is to honour it fairly promptly.

> But I don't think I see the full picture here yet, unless your
> mentioning of TXT was actually implying that TXT was active at the
> point of the crash (which I don't think was said anywhere).

This did cause confusion during debugging.  As far as we can tell, TXT
is not active, but the observed behaviour certainly looks like TXT is
active.

Then again, reset is a platform behaviour, not architectural.  Also,
it's my understanding that Intel does not support S3 on TigerLake
(opting to only support S0ix instead), so I'm guessing that "Linux S3"
as it's called in the menu is something retrofitted by the OEM.

But overall, the point isn't really about what triggered the INIT.  We
also shouldn't nuke an innocent VM if an INIT IPI slips through
interrupt remapping.

~Andrew

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: x86/vmx: Don't spuriously crash the domain when INIT is received
  2022-02-25 17:11         ` Andrew Cooper
@ 2022-02-26 22:55           ` Jason Andryuk
  2022-02-28  7:36           ` Jan Beulich
  1 sibling, 0 replies; 16+ messages in thread
From: Jason Andryuk @ 2022-02-26 22:55 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Jan Beulich, Roger Pau Monne, Wei Liu, Jun Nakajima, Kevin Tian,
	Thiner Logoer, Marek Marczykowski-Górecki, Xen-devel

On Fri, Feb 25, 2022 at 12:12 PM Andrew Cooper
<Andrew.Cooper3@citrix.com> wrote:
>
> On 25/02/2022 13:19, Jan Beulich wrote:
> > But I don't think I see the full picture here yet, unless your
> > mentioning of TXT was actually implying that TXT was active at the
> > point of the crash (which I don't think was said anywhere).
>
> This did cause confusion during debugging.  As far as we can tell, TXT
> is not active, but the observed behaviour certainly looks like TXT is
> active.

It's curious since the CPU, i5-1135G7, is listed as *not* supporting
TXT.  However, it does support Boot Guard, and both Boot Guard and TXT
use Authenticated Code Modules (ACMs) to implement their
functionality.  There is the below quote from the Measured Launched
Environment Developer’s Guide:
http://kib.kiev.ua/x86docs/Intel/TXT/315168-014.pdf

"Both Server TXT and Boot Guard (BtG) technologies require Startup ACM to be
executed at platform reset. Intel ® CPUs can support only single such ACM and
therefore combining of BtG ACM with a Startup ACM is inevitable for platforms
supporting both technologies. This combining requirement triggered the
whole set of
upgrades targeted to better alignment of both technologies, and their
mutual benefits."

So I'm just speculating, but it seems there is TXT-ish stuff going on
when it resumes.

Regards,
Jason

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: x86/vmx: Don't spuriously crash the domain when INIT is received
  2022-02-25 17:11         ` Andrew Cooper
  2022-02-26 22:55           ` Jason Andryuk
@ 2022-02-28  7:36           ` Jan Beulich
  2022-03-14  6:35             ` Tian, Kevin
  1 sibling, 1 reply; 16+ messages in thread
From: Jan Beulich @ 2022-02-28  7:36 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Roger Pau Monne, Wei Liu, Jun Nakajima, Kevin Tian,
	Thiner Logoer, Marek Marczykowski-Górecki, Xen-devel

On 25.02.2022 18:11, Andrew Cooper wrote:
> On 25/02/2022 13:19, Jan Beulich wrote:
>> On 25.02.2022 13:28, Andrew Cooper wrote:
>>> On 25/02/2022 08:44, Jan Beulich wrote:
>>>> On 24.02.2022 20:48, Andrew Cooper wrote:
>>>>> In VMX operation, the handling of INIT IPIs is changed.  EXIT_REASON_INIT has
>>>>> nothing to do with the guest in question, simply signals that an INIT was
>>>>> received.
>>>>>
>>>>> Ignoring the INIT is probably the wrong thing to do, but is helpful for
>>>>> debugging.  Crashing the domain which happens to be in context is definitely
>>>>> wrong.  Print an error message and continue.
>>>>>
>>>>> Discovered as collateral damage from when an AP triple faults on S3 resume on
>>>>> Intel TigerLake platforms.
>>>> I'm afraid I don't follow the scenario, which was (only) outlined in
>>>> patch 1: Why would the BSP receive INIT in this case?
>>> SHUTDOWN is a signal emitted by a core when it can't continue.  Triple
>>> fault is one cause, but other sources include a double #MC, etc.
>>>
>>> Some external component, in the PCH I expect, needs to turn this into a
>>> platform reset, because one malfunctioning core can't.  It is why a
>>> triple fault on any logical processor brings the whole system down.
>> I'm afraid this doesn't answer my question. Clearly the system didn't
>> shut down.
> 
> Indeed, *because* Xen caught and ignored the INIT which was otherwise
> supposed to do it.
> 
>>  Hence I still don't see why the BSP would see INIT in the
>> first place.
>>
>>>> And it also cannot be that the INIT was received by the vCPU while running on
>>>> another CPU:
>>> It's nothing (really) to do with the vCPU.  INIT is a external signal to
>>> the (real) APIC, just like NMI/etc.
>>>
>>> It is the next VMEntry on a CPU which received INIT that suffers a
>>> VMEntry failure, and the VMEntry failure has nothing to do with the
>>> contents of the VMCS.
>>>
>>> Importantly for Xen however, this isn't applicable for scheduling PV
>>> vCPUs, which is why dom0 wasn't the one that crashed.  This actually
>>> meant that dom0 was alive an usable, albeit it sharing all vCPUs on a
>>> single CPU.
>>>
>>>
>>> The change in INIT behaviour exists for TXT, where is it critical that
>>> software can clear secrets from RAM before resetting.  I'm not wanting
>>> to get into any of that because it's far more complicated than I have
>>> time to fix.
>> I guess there's something hidden behind what you say here, like INIT
>> only being latched, but this latched state then causing the VM entry
>> failure. Which would mean that really the INIT was a signal for the
>> system to shut down / shutting down.
> 
> Yes.
> 
>> In which case arranging to
>> continue by ignoring the event in VMX looks wrong. Simply crashing
>> the guest would then be wrong as well, of course. We should shut
>> down instead.
> 
> It is software's discretion what to do when an INIT is caught, even if
> the expectation is to honour it fairly promptly.
> 
>> But I don't think I see the full picture here yet, unless your
>> mentioning of TXT was actually implying that TXT was active at the
>> point of the crash (which I don't think was said anywhere).
> 
> This did cause confusion during debugging.  As far as we can tell, TXT
> is not active, but the observed behaviour certainly looks like TXT is
> active.
> 
> Then again, reset is a platform behaviour, not architectural.  Also,
> it's my understanding that Intel does not support S3 on TigerLake
> (opting to only support S0ix instead), so I'm guessing that "Linux S3"
> as it's called in the menu is something retrofitted by the OEM.
> 
> But overall, the point isn't really about what triggered the INIT.  We
> also shouldn't nuke an innocent VM if an INIT IPI slips through
> interrupt remapping.

But we also shouldn't continue in such a case as if nothing had happened
at all, should we?

Jan



^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: x86/vmx: Don't spuriously crash the domain when INIT is received
  2022-02-28  7:36           ` Jan Beulich
@ 2022-03-14  6:35             ` Tian, Kevin
  2022-03-14  7:43               ` Jan Beulich
  0 siblings, 1 reply; 16+ messages in thread
From: Tian, Kevin @ 2022-03-14  6:35 UTC (permalink / raw)
  To: Beulich, Jan, Cooper, Andrew
  Cc: Pau Monné,
	Roger, Wei Liu, Nakajima, Jun, Thiner Logoer, Marczykowski,
	Marek, Xen-devel

> From: Jan Beulich <jbeulich@suse.com>
> Sent: Monday, February 28, 2022 3:36 PM
> 
> On 25.02.2022 18:11, Andrew Cooper wrote:
> > On 25/02/2022 13:19, Jan Beulich wrote:
> >> On 25.02.2022 13:28, Andrew Cooper wrote:
> >>> On 25/02/2022 08:44, Jan Beulich wrote:
> >>>> On 24.02.2022 20:48, Andrew Cooper wrote:
> >>>>> In VMX operation, the handling of INIT IPIs is changed.
> EXIT_REASON_INIT has
> >>>>> nothing to do with the guest in question, simply signals that an INIT
> was
> >>>>> received.
> >>>>>
> >>>>> Ignoring the INIT is probably the wrong thing to do, but is helpful for
> >>>>> debugging.  Crashing the domain which happens to be in context is
> definitely
> >>>>> wrong.  Print an error message and continue.
> >>>>>
> >>>>> Discovered as collateral damage from when an AP triple faults on S3
> resume on
> >>>>> Intel TigerLake platforms.
> >>>> I'm afraid I don't follow the scenario, which was (only) outlined in
> >>>> patch 1: Why would the BSP receive INIT in this case?
> >>> SHUTDOWN is a signal emitted by a core when it can't continue.  Triple
> >>> fault is one cause, but other sources include a double #MC, etc.
> >>>
> >>> Some external component, in the PCH I expect, needs to turn this into a
> >>> platform reset, because one malfunctioning core can't.  It is why a
> >>> triple fault on any logical processor brings the whole system down.
> >> I'm afraid this doesn't answer my question. Clearly the system didn't
> >> shut down.
> >
> > Indeed, *because* Xen caught and ignored the INIT which was otherwise
> > supposed to do it.
> >
> >>  Hence I still don't see why the BSP would see INIT in the
> >> first place.
> >>
> >>>> And it also cannot be that the INIT was received by the vCPU while
> running on
> >>>> another CPU:
> >>> It's nothing (really) to do with the vCPU.  INIT is a external signal to
> >>> the (real) APIC, just like NMI/etc.
> >>>
> >>> It is the next VMEntry on a CPU which received INIT that suffers a
> >>> VMEntry failure, and the VMEntry failure has nothing to do with the
> >>> contents of the VMCS.
> >>>
> >>> Importantly for Xen however, this isn't applicable for scheduling PV
> >>> vCPUs, which is why dom0 wasn't the one that crashed.  This actually
> >>> meant that dom0 was alive an usable, albeit it sharing all vCPUs on a
> >>> single CPU.
> >>>
> >>>
> >>> The change in INIT behaviour exists for TXT, where is it critical that
> >>> software can clear secrets from RAM before resetting.  I'm not wanting
> >>> to get into any of that because it's far more complicated than I have
> >>> time to fix.
> >> I guess there's something hidden behind what you say here, like INIT
> >> only being latched, but this latched state then causing the VM entry
> >> failure. Which would mean that really the INIT was a signal for the
> >> system to shut down / shutting down.
> >
> > Yes.

why is INIT latched in root mode (take effect until vmentry) instead of
directly causing the CPU to reset?

> >
> >> In which case arranging to
> >> continue by ignoring the event in VMX looks wrong. Simply crashing
> >> the guest would then be wrong as well, of course. We should shut
> >> down instead.
> >
> > It is software's discretion what to do when an INIT is caught, even if
> > the expectation is to honour it fairly promptly.
> >
> >> But I don't think I see the full picture here yet, unless your
> >> mentioning of TXT was actually implying that TXT was active at the
> >> point of the crash (which I don't think was said anywhere).
> >
> > This did cause confusion during debugging.  As far as we can tell, TXT
> > is not active, but the observed behaviour certainly looks like TXT is
> > active.
> >
> > Then again, reset is a platform behaviour, not architectural.  Also,
> > it's my understanding that Intel does not support S3 on TigerLake
> > (opting to only support S0ix instead), so I'm guessing that "Linux S3"
> > as it's called in the menu is something retrofitted by the OEM.
> >
> > But overall, the point isn't really about what triggered the INIT.  We
> > also shouldn't nuke an innocent VM if an INIT IPI slips through
> > interrupt remapping.
> 
> But we also shouldn't continue in such a case as if nothing had happened
> at all, should we?
> 

Now there are two problems:

1) An innocent VM is killed;
2) The system continues as if nothing had happened;

Andrew's patch fixes 1) which imo is welcomed anyway.

2) certainly needs more work but could come after 1). 

Thanks
Kevin

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: x86/vmx: Don't spuriously crash the domain when INIT is received
  2022-03-14  6:35             ` Tian, Kevin
@ 2022-03-14  7:43               ` Jan Beulich
  2022-03-17  5:57                 ` Tian, Kevin
  0 siblings, 1 reply; 16+ messages in thread
From: Jan Beulich @ 2022-03-14  7:43 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Pau Monné,
	Roger, Wei Liu, Nakajima, Jun, Thiner Logoer, Marczykowski,
	Marek, Xen-devel, Cooper, Andrew

On 14.03.2022 07:35, Tian, Kevin wrote:
>> From: Jan Beulich <jbeulich@suse.com>
>> Sent: Monday, February 28, 2022 3:36 PM
>>
>> On 25.02.2022 18:11, Andrew Cooper wrote:
>>> On 25/02/2022 13:19, Jan Beulich wrote:
>>>> On 25.02.2022 13:28, Andrew Cooper wrote:
>>>>> On 25/02/2022 08:44, Jan Beulich wrote:
>>>>>> On 24.02.2022 20:48, Andrew Cooper wrote:
>>>>>>> In VMX operation, the handling of INIT IPIs is changed.
>> EXIT_REASON_INIT has
>>>>>>> nothing to do with the guest in question, simply signals that an INIT
>> was
>>>>>>> received.
>>>>>>>
>>>>>>> Ignoring the INIT is probably the wrong thing to do, but is helpful for
>>>>>>> debugging.  Crashing the domain which happens to be in context is
>> definitely
>>>>>>> wrong.  Print an error message and continue.
>>>>>>>
>>>>>>> Discovered as collateral damage from when an AP triple faults on S3
>> resume on
>>>>>>> Intel TigerLake platforms.
>>>>>> I'm afraid I don't follow the scenario, which was (only) outlined in
>>>>>> patch 1: Why would the BSP receive INIT in this case?
>>>>> SHUTDOWN is a signal emitted by a core when it can't continue.  Triple
>>>>> fault is one cause, but other sources include a double #MC, etc.
>>>>>
>>>>> Some external component, in the PCH I expect, needs to turn this into a
>>>>> platform reset, because one malfunctioning core can't.  It is why a
>>>>> triple fault on any logical processor brings the whole system down.
>>>> I'm afraid this doesn't answer my question. Clearly the system didn't
>>>> shut down.
>>>
>>> Indeed, *because* Xen caught and ignored the INIT which was otherwise
>>> supposed to do it.
>>>
>>>>  Hence I still don't see why the BSP would see INIT in the
>>>> first place.
>>>>
>>>>>> And it also cannot be that the INIT was received by the vCPU while
>> running on
>>>>>> another CPU:
>>>>> It's nothing (really) to do with the vCPU.  INIT is a external signal to
>>>>> the (real) APIC, just like NMI/etc.
>>>>>
>>>>> It is the next VMEntry on a CPU which received INIT that suffers a
>>>>> VMEntry failure, and the VMEntry failure has nothing to do with the
>>>>> contents of the VMCS.
>>>>>
>>>>> Importantly for Xen however, this isn't applicable for scheduling PV
>>>>> vCPUs, which is why dom0 wasn't the one that crashed.  This actually
>>>>> meant that dom0 was alive an usable, albeit it sharing all vCPUs on a
>>>>> single CPU.
>>>>>
>>>>>
>>>>> The change in INIT behaviour exists for TXT, where is it critical that
>>>>> software can clear secrets from RAM before resetting.  I'm not wanting
>>>>> to get into any of that because it's far more complicated than I have
>>>>> time to fix.
>>>> I guess there's something hidden behind what you say here, like INIT
>>>> only being latched, but this latched state then causing the VM entry
>>>> failure. Which would mean that really the INIT was a signal for the
>>>> system to shut down / shutting down.
>>>
>>> Yes.
> 
> why is INIT latched in root mode (take effect until vmentry) instead of
> directly causing the CPU to reset?
> 
>>>
>>>> In which case arranging to
>>>> continue by ignoring the event in VMX looks wrong. Simply crashing
>>>> the guest would then be wrong as well, of course. We should shut
>>>> down instead.
>>>
>>> It is software's discretion what to do when an INIT is caught, even if
>>> the expectation is to honour it fairly promptly.
>>>
>>>> But I don't think I see the full picture here yet, unless your
>>>> mentioning of TXT was actually implying that TXT was active at the
>>>> point of the crash (which I don't think was said anywhere).
>>>
>>> This did cause confusion during debugging.  As far as we can tell, TXT
>>> is not active, but the observed behaviour certainly looks like TXT is
>>> active.
>>>
>>> Then again, reset is a platform behaviour, not architectural.  Also,
>>> it's my understanding that Intel does not support S3 on TigerLake
>>> (opting to only support S0ix instead), so I'm guessing that "Linux S3"
>>> as it's called in the menu is something retrofitted by the OEM.
>>>
>>> But overall, the point isn't really about what triggered the INIT.  We
>>> also shouldn't nuke an innocent VM if an INIT IPI slips through
>>> interrupt remapping.
>>
>> But we also shouldn't continue in such a case as if nothing had happened
>> at all, should we?
>>
> 
> Now there are two problems:
> 
> 1) An innocent VM is killed;
> 2) The system continues as if nothing had happened;
> 
> Andrew's patch fixes 1) which imo is welcomed anyway.
> 
> 2) certainly needs more work but could come after 1). 

That's one way to look at things, sure, and if you agree it makes sense
to address 1), I won't go as far as trying to block such a change. But
it feels wrong to me - 2) working properly really includes 1) plus the
killing of all other innocent VMs that were running at the time.

Jan



^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: x86/vmx: Don't spuriously crash the domain when INIT is received
  2022-03-14  7:43               ` Jan Beulich
@ 2022-03-17  5:57                 ` Tian, Kevin
  0 siblings, 0 replies; 16+ messages in thread
From: Tian, Kevin @ 2022-03-17  5:57 UTC (permalink / raw)
  To: Beulich, Jan
  Cc: Pau Monné,
	Roger, Wei Liu, Nakajima, Jun, Thiner Logoer, Marczykowski,
	Marek, Xen-devel, Cooper, Andrew

> From: Jan Beulich <jbeulich@suse.com>
> Sent: Monday, March 14, 2022 3:43 PM
> 
> On 14.03.2022 07:35, Tian, Kevin wrote:
> >> From: Jan Beulich <jbeulich@suse.com>
> >> Sent: Monday, February 28, 2022 3:36 PM
> >>
> >> On 25.02.2022 18:11, Andrew Cooper wrote:
> >>> On 25/02/2022 13:19, Jan Beulich wrote:
> >>>> On 25.02.2022 13:28, Andrew Cooper wrote:
> >>>>> On 25/02/2022 08:44, Jan Beulich wrote:
> >>>>>> On 24.02.2022 20:48, Andrew Cooper wrote:
> >>>>>>> In VMX operation, the handling of INIT IPIs is changed.
> >> EXIT_REASON_INIT has
> >>>>>>> nothing to do with the guest in question, simply signals that an INIT
> >> was
> >>>>>>> received.
> >>>>>>>
> >>>>>>> Ignoring the INIT is probably the wrong thing to do, but is helpful for
> >>>>>>> debugging.  Crashing the domain which happens to be in context is
> >> definitely
> >>>>>>> wrong.  Print an error message and continue.
> >>>>>>>
> >>>>>>> Discovered as collateral damage from when an AP triple faults on S3
> >> resume on
> >>>>>>> Intel TigerLake platforms.
> >>>>>> I'm afraid I don't follow the scenario, which was (only) outlined in
> >>>>>> patch 1: Why would the BSP receive INIT in this case?
> >>>>> SHUTDOWN is a signal emitted by a core when it can't continue.  Triple
> >>>>> fault is one cause, but other sources include a double #MC, etc.
> >>>>>
> >>>>> Some external component, in the PCH I expect, needs to turn this into
> a
> >>>>> platform reset, because one malfunctioning core can't.  It is why a
> >>>>> triple fault on any logical processor brings the whole system down.
> >>>> I'm afraid this doesn't answer my question. Clearly the system didn't
> >>>> shut down.
> >>>
> >>> Indeed, *because* Xen caught and ignored the INIT which was otherwise
> >>> supposed to do it.
> >>>
> >>>>  Hence I still don't see why the BSP would see INIT in the
> >>>> first place.
> >>>>
> >>>>>> And it also cannot be that the INIT was received by the vCPU while
> >> running on
> >>>>>> another CPU:
> >>>>> It's nothing (really) to do with the vCPU.  INIT is a external signal to
> >>>>> the (real) APIC, just like NMI/etc.
> >>>>>
> >>>>> It is the next VMEntry on a CPU which received INIT that suffers a
> >>>>> VMEntry failure, and the VMEntry failure has nothing to do with the
> >>>>> contents of the VMCS.
> >>>>>
> >>>>> Importantly for Xen however, this isn't applicable for scheduling PV
> >>>>> vCPUs, which is why dom0 wasn't the one that crashed.  This actually
> >>>>> meant that dom0 was alive an usable, albeit it sharing all vCPUs on a
> >>>>> single CPU.
> >>>>>
> >>>>>
> >>>>> The change in INIT behaviour exists for TXT, where is it critical that
> >>>>> software can clear secrets from RAM before resetting.  I'm not wanting
> >>>>> to get into any of that because it's far more complicated than I have
> >>>>> time to fix.
> >>>> I guess there's something hidden behind what you say here, like INIT
> >>>> only being latched, but this latched state then causing the VM entry
> >>>> failure. Which would mean that really the INIT was a signal for the
> >>>> system to shut down / shutting down.
> >>>
> >>> Yes.
> >
> > why is INIT latched in root mode (take effect until vmentry) instead of
> > directly causing the CPU to reset?
> >
> >>>
> >>>> In which case arranging to
> >>>> continue by ignoring the event in VMX looks wrong. Simply crashing
> >>>> the guest would then be wrong as well, of course. We should shut
> >>>> down instead.
> >>>
> >>> It is software's discretion what to do when an INIT is caught, even if
> >>> the expectation is to honour it fairly promptly.
> >>>
> >>>> But I don't think I see the full picture here yet, unless your
> >>>> mentioning of TXT was actually implying that TXT was active at the
> >>>> point of the crash (which I don't think was said anywhere).
> >>>
> >>> This did cause confusion during debugging.  As far as we can tell, TXT
> >>> is not active, but the observed behaviour certainly looks like TXT is
> >>> active.
> >>>
> >>> Then again, reset is a platform behaviour, not architectural.  Also,
> >>> it's my understanding that Intel does not support S3 on TigerLake
> >>> (opting to only support S0ix instead), so I'm guessing that "Linux S3"
> >>> as it's called in the menu is something retrofitted by the OEM.
> >>>
> >>> But overall, the point isn't really about what triggered the INIT.  We
> >>> also shouldn't nuke an innocent VM if an INIT IPI slips through
> >>> interrupt remapping.
> >>
> >> But we also shouldn't continue in such a case as if nothing had happened
> >> at all, should we?
> >>
> >
> > Now there are two problems:
> >
> > 1) An innocent VM is killed;
> > 2) The system continues as if nothing had happened;
> >
> > Andrew's patch fixes 1) which imo is welcomed anyway.
> >
> > 2) certainly needs more work but could come after 1).
> 
> That's one way to look at things, sure, and if you agree it makes sense
> to address 1), I won't go as far as trying to block such a change. But
> it feels wrong to me - 2) working properly really includes 1) plus the
> killing of all other innocent VMs that were running at the time.
> 

I feel that 2) will be done in a way that the admin can choose the
policy. It could be killing all VMs or in a mode where further 
diagnose is allowed. Given that part is unclear at this point, I'm
inclined to ack 1) first:

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

Thanks
Kevin

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2022-03-17  5:58 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-24 19:48 x86: Fix crash on S3 resume Andrew Cooper
2022-02-24 19:48 ` x86/CET: Fix S3 resume with shadow stacks active Andrew Cooper
2022-02-25  8:38   ` Jan Beulich
2022-02-25 12:41     ` Andrew Cooper
2022-02-24 19:48 ` x86/vmx: Don't spuriously crash the domain when INIT is received Andrew Cooper
2022-02-25  8:44   ` Jan Beulich
2022-02-25 12:28     ` Andrew Cooper
2022-02-25 13:19       ` Jan Beulich
2022-02-25 13:51         ` Marek Marczykowski-Górecki
2022-02-25 14:18           ` Jan Beulich
2022-02-25 17:11         ` Andrew Cooper
2022-02-26 22:55           ` Jason Andryuk
2022-02-28  7:36           ` Jan Beulich
2022-03-14  6:35             ` Tian, Kevin
2022-03-14  7:43               ` Jan Beulich
2022-03-17  5:57                 ` Tian, Kevin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.