Re: [PATCH v7 3/6] mce: fix set_mce_nospec to always unmap the whole page

From: Dan Williams <dan.j.williams@intel.com>
To: Jane Chu <jane.chu@oracle.com>
Cc: david <david@fromorbit.com>,
	"Darrick J. Wong" <djwong@kernel.org>,
	 Christoph Hellwig <hch@infradead.org>,
	Vishal L Verma <vishal.l.verma@intel.com>,
	 Dave Jiang <dave.jiang@intel.com>,
	Alasdair Kergon <agk@redhat.com>,
	Mike Snitzer <snitzer@redhat.com>,
	 device-mapper development <dm-devel@redhat.com>,
	"Weiny, Ira" <ira.weiny@intel.com>,
	 Matthew Wilcox <willy@infradead.org>,
	Vivek Goyal <vgoyal@redhat.com>,
	 linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Linux NVDIMM <nvdimm@lists.linux.dev>,
	 Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	linux-xfs <linux-xfs@vger.kernel.org>,  X86 ML <x86@kernel.org>,
	"luto@kernel.org" <luto@kernel.org>,
	 "peterz@infradead.org" <peterz@infradead.org>,
	"dave.hansen@intel.com" <dave.hansen@intel.com>
Subject: Re: [PATCH v7 3/6] mce: fix set_mce_nospec to always unmap the whole page
Date: Wed, 13 Apr 2022 19:32:47 -0700	[thread overview]
Message-ID: <CAPcyv4jpwzMPKtzzc=DEbC340+zmzXkj+QtPVxfYbraskLKv8g@mail.gmail.com> (raw)
In-Reply-To: <b511a483-4260-656a-ab04-2ba319e65ca7@oracle.com>

On Wed, Apr 13, 2022 at 4:36 PM Jane Chu <jane.chu@oracle.com> wrote:
>
> On 4/11/2022 4:27 PM, Dan Williams wrote:
> > On Tue, Apr 5, 2022 at 12:48 PM Jane Chu <jane.chu@oracle.com> wrote:
> >>
> >> The set_memory_uc() approach doesn't work well in all cases.
> >> For example, when "The VMM unmapped the bad page from guest
> >> physical space and passed the machine check to the guest."
> >> "The guest gets virtual #MC on an access to that page.
> >>   When the guest tries to do set_memory_uc() and instructs
> >>   cpa_flush() to do clean caches that results in taking another
> >>   fault / exception perhaps because the VMM unmapped the page
> >>   from the guest."
> >>
> >> Since the driver has special knowledge to handle NP or UC,
> >
> > I think a patch is needed before this one to make this statement true? I.e.:
> >
> > diff --git a/drivers/acpi/nfit/mce.c b/drivers/acpi/nfit/mce.c
> > index ee8d9973f60b..11641f55025a 100644
> > --- a/drivers/acpi/nfit/mce.c
> > +++ b/drivers/acpi/nfit/mce.c
> > @@ -32,6 +32,7 @@ static int nfit_handle_mce(struct notifier_block
> > *nb, unsigned long val,
> >           */
> >          mutex_lock(&acpi_desc_lock);
> >          list_for_each_entry(acpi_desc, &acpi_descs, list) {
> > +               unsigned int align = 1UL << MCI_MISC_ADDR_LSB(mce->misc);
> >                  struct device *dev = acpi_desc->dev;
> >                  int found_match = 0;
> >
> > @@ -63,8 +64,7 @@ static int nfit_handle_mce(struct notifier_block
> > *nb, unsigned long val,
> >
> >                  /* If this fails due to an -ENOMEM, there is little we can do */
> >                  nvdimm_bus_add_badrange(acpi_desc->nvdimm_bus,
> > -                               ALIGN(mce->addr, L1_CACHE_BYTES),
> > -                               L1_CACHE_BYTES);
> > +                                       ALIGN(mce->addr, align), align);
> >                  nvdimm_region_notify(nfit_spa->nd_region,
> >                                  NVDIMM_REVALIDATE_POISON);
> >
>
> Dan, I tried the above change, and this is what I got after injecting 8
> back-to-back poisons, then read them and received  SIGBUS/BUS_MCEERR_AR,
> then repair via the v7 patch which works until this change is added.
>
> [ 6240.955331] nfit ACPI0012:00: XXX, align = 100
> [ 6240.960300] nfit ACPI0012:00: XXX, ALIGN(mce->addr,
> L1_CACHE_BYTES)=1851600400, L1_CACHE_BYTES=40, ALIGN(mce->addr,
> align)=1851600400
> [..]
> [ 6242.052277] nfit ACPI0012:00: XXX, align = 100
> [ 6242.057243] nfit ACPI0012:00: XXX, ALIGN(mce->addr,
> L1_CACHE_BYTES)=1851601000, L1_CACHE_BYTES=40, ALIGN(mce->addr,
> align)=1851601000
> [..]
> [ 6244.917198] nfit ACPI0012:00: XXX, align = 1000
> [ 6244.922258] nfit ACPI0012:00: XXX, ALIGN(mce->addr,
> L1_CACHE_BYTES)=1851601200, L1_CACHE_BYTES=40, ALIGN(mce->addr,
> align)=1851602000
> [..]
>
> All 8 poisons remain uncleared.
>
> Without further investigation, I don't know why the failure.
> Could we mark this change to a follow-on task?

Perhaps a bit more debug before kicking this can down the road...

I'm worried that this means that the driver is not accurately tracking
poison data For example, that last case the hardware is indicating a
full page clobber, but the old code would only track poison from
1851601200 to 1851601400 (i.e. the first 512 bytes from the base error
address).

Oh... wait, I think there is a second bug here, that ALIGN should be
ALIGN_DOWN(). Does this restore the result you expect?

diff --git a/drivers/acpi/nfit/mce.c b/drivers/acpi/nfit/mce.c
index ee8d9973f60b..d7a52238a741 100644
--- a/drivers/acpi/nfit/mce.c
+++ b/drivers/acpi/nfit/mce.c
@@ -63,8 +63,7 @@ static int nfit_handle_mce(struct notifier_block
*nb, unsigned long val,

                /* If this fails due to an -ENOMEM, there is little we can do */
                nvdimm_bus_add_badrange(acpi_desc->nvdimm_bus,
-                               ALIGN(mce->addr, L1_CACHE_BYTES),
-                               L1_CACHE_BYTES);
+                                       ALIGN_DOWN(mce->addr, align), align);
                nvdimm_region_notify(nfit_spa->nd_region,
                                NVDIMM_REVALIDATE_POISON);


> The driver knows a lot about how to clear poisons besides hardcoding
> poison alignment to 0x40 bytes.

It does, but the badblocks tracking should still be reliable, and if
it's not reliable I expect there are cases where recovery_write() will
not be triggered because the driver will not fail the
dax_direct_access() attempt.