Re: [PATCH v4 2/3] i386: Explicitly ignore unsupported BUS_MCEERR_AO MCE on AMD guest

From: Yazen Ghannam <yazen.ghannam@amd.com>
To: Joao Martins <joao.m.martins@oracle.com>,
	William Roche <william.roche@oracle.com>,
	John Allen <john.allen@amd.com>,
	qemu-devel@nongnu.org
Cc: yazen.ghannam@amd.com, michael.roth@amd.com, babu.moger@amd.com,
	pbonzini@redhat.com, richard.henderson@linaro.org,
	eduardo@habkost.net
Subject: Re: [PATCH v4 2/3] i386: Explicitly ignore unsupported BUS_MCEERR_AO MCE on AMD guest
Date: Thu, 21 Sep 2023 13:41:59 -0400	[thread overview]
Message-ID: <39a471b1-9ef6-47e9-97a8-b315f63b4917@amd.com> (raw)
In-Reply-To: <948a0ac5-379d-44a7-92b1-d2cc0995e187@oracle.com>

On 9/20/23 7:13 AM, Joao Martins wrote:
> On 18/09/2023 23:00, William Roche wrote:
>> Hi John,
>>
>> I'd like to put the emphasis on the fact that ignoring the SRAO error
>> for a VM is a real problem at least for a specific (rare) case I'm
>> currently working on: The VM migration.
>>
>> Context:
>>
>> - In the case of a poisoned page in the VM address space, the migration
>> can't read it and will skip this page, considering it as a zero-filled
>> page. The VM kernel (that handled the vMCE) would have marked it's
>> associated page as poisoned, and if the VM touches the page, the VM
>> kernel generates the associated MCE because it already knows about the
>> poisoned page.
>>
>> - When we ignore the vMCE in the case of a SIGBUS/BUS_MCEERR_AO error
>> (what this patch does), we entirely rely on the Hypervisor to send an
>> SRAR error to qemu when the page is touched: The AMD VM kernel will
>> receive the SIGBUS/BUS_MCEERR_AR and deal with it, thanks to your
>> changes here.
>>
>> So it looks like the mechanism works fine... unless the VM has migrated
>> between the SRAO error and the first time it really touches the poisoned
>> page to get an SRAR error !  In this case, its new address space
>> (created on the migration destination) will have a zero-page where we
>> had a poisoned page, and the AMD VM Kernel (that never dealt with the
>> SRAO) doesn't know about the poisoned page and will access the page
>> finding only zeros...  We have a memory corruption !

I don't understand this. Why would the page be zero? Even so, why would
that affect poison?

Also, during page migration, does the data flow through the CPU core?
Sorry for the basic question. I haven't done a lot with virtualization.

Please note that current AMD systems use an internal poison marker on
memory. This cannot be cleared through normal memory operations. The
only exception, I think, is to use the CLZERO instruction. This will
completely wipe a cacheline including metadata like poison, etc.

So the hardware should not (by design) loose track of poisoned data.

>>
>> It is a very rare window, but in order to fix it the most reasonable
>> course of action would be to make the AMD emulation deal with SRAO
>> errors, instead of ignoring them.
>>
>> Do you agree with my analysis ?
> 
> Under the case that SRAO aren't handled well in the kernel today[*] for AMD, we
> could always add a migration blocker when we hit AO sigbus, in case ignoring
> is our only option. But this would be less than ideal to propagating the
> SRAO into the guest.
> 
> [*] Meaning knowing that handling the SRAO would generate a crash in the guest
> 
> Perhaps as an improvement, perhaps allow qemu to choose to propagate should this
> limitation be lifted via a new -action value and allow it to ignore/propagate or
> not e.g.
> 
>  -action mce=none # default on Intel to propagate all MCE events to the guest
>  -action mce=ignore-optional # Ignore SRAO
> 
> I suppose the second is also useful for ARM64 considering they currently ignore
> SRAO events too.
> 
>> Would an AMD platform generate SRAO signal to a process
>> (SIGBUS/BUS_MCEERR_AO) in case of a real hardware error ?
>>
> This would be useful to confirm.
>

There is no SRAO signal on AMD. The closest equivalent may be a
"Deferred" error interrupt. This is an x86 APIC LVT interrupt, and it's
sent when a deferred (uncorrectable non-urgent) error is detected by a
memory controller.

In this case, the CPU will get the interrupt and log the error (in the
host).

An enhancement will be to take the MCA error information collected
during the interrupt and extract useful data. For example, we'll need to
translate the reported address to a system physical address that can be
mapped to a page.

Once we have the page, then we can decide how we want to signal the
process(es). We could get a deferred/AO error in the host, and signal the
guest with an AR. So the guest handling could be the same in both cases.

Would this be okay? Or is it important that the guest can distinguish
between the A0/AR cases? IOW, will guests have their own policies on
when to take action? Or is it more about allowing the guest to handle
the error less urgently?

Thanks,
Yazen