Re: [RFT PATCH] x86/pat: Fix set_mce_nospec() for pmem

From: Jane Chu <jane.chu@oracle.com>
To: Dan Williams <dan.j.williams@intel.com>
Cc: Borislav Petkov <bp@alien8.de>,
	"Luck, Tony" <tony.luck@intel.com>,
	Linux NVDIMM <nvdimm@lists.linux.dev>,
	Luis Chamberlain <mcgrof@suse.com>
Subject: Re: [RFT PATCH] x86/pat: Fix set_mce_nospec() for pmem
Date: Thu, 18 Nov 2021 19:03:59 +0000	[thread overview]
Message-ID: <b51fb3d6-6d39-c450-e0a1-94a1645a22ec@oracle.com> (raw)
In-Reply-To: <CAPcyv4jBHnYtqoxoJY1NGNE1DXOv3bAg0gBzjZ=eOvarVXDRbA@mail.gmail.com>

On 11/13/2021 12:47 PM, Dan Williams wrote:
<snip>
>>> It should know because the MCE that unmapped the page will have
>>> communicated a "whole_page()" MCE. When dax_recovery_read() goes to
>>> consult the badblocks list to try to read the remaining good data it
>>> will see that every single cacheline is covered by badblocks, so
>>> nothing to read, and no need to establish the UC mapping. So the the
>>> "Tony fix" was incomplete in retrospect. It neglected to update the
>>> NVDIMM badblocks tracking for the whole page case.
>>
>> So the call in nfit_handle_mce():
>>     nvdimm_bus_add_badrange(acpi_desc->nvdimm_bus,
>>                   ALIGN(mce->addr, L1_CACHE_BYTES),
>>                   L1_CACHE_BYTES);
>> should be replaced by
>>     nvdimm_bus_add_badrange(acpi_desc->nvdimm_bus,
>>                   ALIGN(mce->addr, L1_CACHE_BYTES),
>>                   (1 << MCI_MISC_ADDR_LSB(m->misc)));
>> right?
> 
> Yes.
> 
>>
>> And when dax_recovery_read() calls
>>     badblocks_check(bb, sector, len / 512, &first_bad, &num_bad)
>> it should always, in case of 'NP', discover that 'first_bad'
>> is the first sector in the poisoned page,  hence no need
>> to switch to 'UC', right?
> 
> Yes.
> 
>>
>> In case the 'first_bad' is in the middle of the poisoned page,
>> that is, dax_recover_read() could potentially read some clean
>> sectors, is there problem to
>>     call _set_memory_UC(pfn, 1),
>>     do the mc_safe read,
>>     and then call set_memory_NP(pfn, 1)
>> ?
>> Why do we need to call ioremap() or vmap()?
> 
> I'm worried about concurrent operations and enabling access to threads
> outside of the one currently in dax_recovery_read(). If a local vmap()
> / ioremap() is used it effectively makes the access thread local.
> There might still need to be an rwsem to allow dax_recovery_write() to
> fixup the pfn access and syncrhonize with dax_recovery_read()
> operations.
> 

<snip>
>> I didn't even know that guest could clear poison by trapping hypervisor
>> with the ClearError DSM method,
> 
> The guest can call the Clear Error DSM if the virtual BIOS provides
> it. Whether that actually clears errors or not is up to the
> hypervisor.
> 
>> I thought guest isn't privileged with that.
> 
> The guest does not have access to the bare metal DSM path, but the
> hypervisor can certainly offer translation service for that operation.
> 
>> Would you mind to elaborate about the mechanism and maybe point
>> out the code, and perhaps if you have test case to share?
> 
> I don't have a test case because until Tony's fix I did not realize
> that a virtual #MC would allow the guest to learn of poisoned
> locations without necessarily allowing the guest to trigger actual
> poison consumption.
> 
> In other words I was operating under the assumption that telling
> guests where poison is located is potentially handing the guest a way
> to DoS the VMM. However, Tony's fix shows that when the hypervisor
> unmaps the guest physical page it can prevent the guest from accessing
> it again. So it follows that it should be ok to inject virtual #MC to
> the guest, and unmap the guest physical range, but later allow that
> guest physical range to be repaired if the guest asks the hypervisor
> to repair the page.
> 
> Tony, does this match your understanding?
> 
>>
>> but I'm not sure what to do about
>>> guests that later want to use MOVDIR64B to clear errors.
>>>
>>
>> Yeah, perhaps there is no way to prevent guest from accidentally
>> clear error via MOVDIR64B, given some application rely on MOVDIR64B
>> for fast data movement (straight to the media). I guess in that case,
>> the consequence is false alarm, nothing disastrous, right?
> 
> You'll just continue to get false positive failures because the error
> tracking will be out-of-sync with reality.
> 
>> How about allowing the potential bad-block bookkeeping gap, and
>> manage to close the gap at certain checkpoints? I guess one of
>> the checkpoints might be when page fault discovers a poisoned
>> page?
> 
> Not sure how that would work... it's already the case that new error
> entries are appended to the list at #MC time, the problem is knowing
> when to clear those stale entries. I still think that needs to be at
> dax_recovery_write() time.
> 

Thanks Dan for taking the time elaborating so much details!

After some amount of digging, I have a feel that we need to take
dax error handling in phases.

Phase-1: the simplest dax_recovery_write on page granularity, along
          with fix to set poisoned page to 'NP', serialize
          dax_recovery_write threads.
Phase-2: provide dax_recovery_read support and hence shrink the error
          recovery granularity.  As ioremap returns __iomem pointer
          that is only allowed to be referenced with helpers like
          readl() which do not have a mc_safe variant, and I'm
          not sure whether there should be.  Also the synchronization
          between dax_recovery_read and dax_recovery_write threads.
Phase-3: the hypervisor error-record keeping issue, suppose there is
          an issue, I'll need to figure out how to setup a test case.
Phase-4: the how-to-mitigate-MOVDIR64B-false-alarm issue.

Right now, it seems to me providing Phase-1 solution is urgent, to give
something that customers can rely on.

How does this sound to you?

thanks,
-jane