Re: x86/vmx: Don't spuriously crash the domain when INIT is received

From: Jan Beulich <jbeulich@suse.com>
To: "Tian, Kevin" <kevin.tian@intel.com>
Cc: "Pau Monné, Roger" <roger.pau@citrix.com>, "Wei Liu" <wl@xen.org>,
	"Nakajima, Jun" <jun.nakajima@intel.com>,
	"Thiner Logoer" <logoerthiner1@163.com>,
	"Marczykowski, Marek" <marmarek@invisiblethingslab.com>,
	Xen-devel <xen-devel@lists.xenproject.org>,
	"Cooper, Andrew" <andrew.cooper3@citrix.com>
Subject: Re: x86/vmx: Don't spuriously crash the domain when INIT is received
Date: Mon, 14 Mar 2022 08:43:28 +0100	[thread overview]
Message-ID: <cdbd90a7-bf4e-4ff6-0e95-0671ad553b83@suse.com> (raw)
In-Reply-To: <BN9PR11MB52767BF5573E31241734F2A28C0F9@BN9PR11MB5276.namprd11.prod.outlook.com>

On 14.03.2022 07:35, Tian, Kevin wrote:
>> From: Jan Beulich <jbeulich@suse.com>
>> Sent: Monday, February 28, 2022 3:36 PM
>>
>> On 25.02.2022 18:11, Andrew Cooper wrote:
>>> On 25/02/2022 13:19, Jan Beulich wrote:
>>>> On 25.02.2022 13:28, Andrew Cooper wrote:
>>>>> On 25/02/2022 08:44, Jan Beulich wrote:
>>>>>> On 24.02.2022 20:48, Andrew Cooper wrote:
>>>>>>> In VMX operation, the handling of INIT IPIs is changed.
>> EXIT_REASON_INIT has
>>>>>>> nothing to do with the guest in question, simply signals that an INIT
>> was
>>>>>>> received.
>>>>>>>
>>>>>>> Ignoring the INIT is probably the wrong thing to do, but is helpful for
>>>>>>> debugging.  Crashing the domain which happens to be in context is
>> definitely
>>>>>>> wrong.  Print an error message and continue.
>>>>>>>
>>>>>>> Discovered as collateral damage from when an AP triple faults on S3
>> resume on
>>>>>>> Intel TigerLake platforms.
>>>>>> I'm afraid I don't follow the scenario, which was (only) outlined in
>>>>>> patch 1: Why would the BSP receive INIT in this case?
>>>>> SHUTDOWN is a signal emitted by a core when it can't continue.  Triple
>>>>> fault is one cause, but other sources include a double #MC, etc.
>>>>>
>>>>> Some external component, in the PCH I expect, needs to turn this into a
>>>>> platform reset, because one malfunctioning core can't.  It is why a
>>>>> triple fault on any logical processor brings the whole system down.
>>>> I'm afraid this doesn't answer my question. Clearly the system didn't
>>>> shut down.
>>>
>>> Indeed, *because* Xen caught and ignored the INIT which was otherwise
>>> supposed to do it.
>>>
>>>>  Hence I still don't see why the BSP would see INIT in the
>>>> first place.
>>>>
>>>>>> And it also cannot be that the INIT was received by the vCPU while
>> running on
>>>>>> another CPU:
>>>>> It's nothing (really) to do with the vCPU.  INIT is a external signal to
>>>>> the (real) APIC, just like NMI/etc.
>>>>>
>>>>> It is the next VMEntry on a CPU which received INIT that suffers a
>>>>> VMEntry failure, and the VMEntry failure has nothing to do with the
>>>>> contents of the VMCS.
>>>>>
>>>>> Importantly for Xen however, this isn't applicable for scheduling PV
>>>>> vCPUs, which is why dom0 wasn't the one that crashed.  This actually
>>>>> meant that dom0 was alive an usable, albeit it sharing all vCPUs on a
>>>>> single CPU.
>>>>>
>>>>>
>>>>> The change in INIT behaviour exists for TXT, where is it critical that
>>>>> software can clear secrets from RAM before resetting.  I'm not wanting
>>>>> to get into any of that because it's far more complicated than I have
>>>>> time to fix.
>>>> I guess there's something hidden behind what you say here, like INIT
>>>> only being latched, but this latched state then causing the VM entry
>>>> failure. Which would mean that really the INIT was a signal for the
>>>> system to shut down / shutting down.
>>>
>>> Yes.
> 
> why is INIT latched in root mode (take effect until vmentry) instead of
> directly causing the CPU to reset?
> 
>>>
>>>> In which case arranging to
>>>> continue by ignoring the event in VMX looks wrong. Simply crashing
>>>> the guest would then be wrong as well, of course. We should shut
>>>> down instead.
>>>
>>> It is software's discretion what to do when an INIT is caught, even if
>>> the expectation is to honour it fairly promptly.
>>>
>>>> But I don't think I see the full picture here yet, unless your
>>>> mentioning of TXT was actually implying that TXT was active at the
>>>> point of the crash (which I don't think was said anywhere).
>>>
>>> This did cause confusion during debugging.  As far as we can tell, TXT
>>> is not active, but the observed behaviour certainly looks like TXT is
>>> active.
>>>
>>> Then again, reset is a platform behaviour, not architectural.  Also,
>>> it's my understanding that Intel does not support S3 on TigerLake
>>> (opting to only support S0ix instead), so I'm guessing that "Linux S3"
>>> as it's called in the menu is something retrofitted by the OEM.
>>>
>>> But overall, the point isn't really about what triggered the INIT.  We
>>> also shouldn't nuke an innocent VM if an INIT IPI slips through
>>> interrupt remapping.
>>
>> But we also shouldn't continue in such a case as if nothing had happened
>> at all, should we?
>>
> 
> Now there are two problems:
> 
> 1) An innocent VM is killed;
> 2) The system continues as if nothing had happened;
> 
> Andrew's patch fixes 1) which imo is welcomed anyway.
> 
> 2) certainly needs more work but could come after 1). 

That's one way to look at things, sure, and if you agree it makes sense
to address 1), I won't go as far as trying to block such a change. But
it feels wrong to me - 2) working properly really includes 1) plus the
killing of all other innocent VMs that were running at the time.

Jan