nvdimm.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
* [RFT PATCH] x86/pat: Fix set_mce_nospec() for pmem
@ 2021-07-07  1:01 Dan Williams
  2021-08-26 19:08 ` Dan Williams
  2021-09-13 10:29 ` Borislav Petkov
  0 siblings, 2 replies; 35+ messages in thread
From: Dan Williams @ 2021-07-07  1:01 UTC (permalink / raw)
  To: nvdimm; +Cc: Jane Chu, Luis Chamberlain, Borislav Petkov, Tony Luck

When poison is discovered and triggers memory_failure() the physical
page is unmapped from all process address space. However, it is not
unmapped from kernel address space. Unlike a typical memory page that
can be retired from use in the page allocator and marked 'not present',
pmem needs to remain accessible given it can not be physically remapped
or retired. set_memory_uc() tries to maintain consistent nominal memtype
mappings for a given pfn, but memory_failure() is an exceptional
condition.

For the same reason that set_memory_np() bypasses memtype checks
because they do not apply in the memory failure case, memtype validation
is not applicable for marking the pmem pfn uncacheable. Use
_set_memory_uc().

Reported-by: Jane Chu <jane.chu@oracle.com>
Fixes: 284ce4011ba6 ("x86/memory_failure: Introduce {set,clear}_mce_nospec()")
Cc: Luis Chamberlain <mcgrof@suse.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Tony Luck <tony.luck@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
Jane, can you give this a try and see if it cleans up the error you are
seeing?

Thanks for the help.

 arch/x86/include/asm/set_memory.h |    9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/set_memory.h b/arch/x86/include/asm/set_memory.h
index 43fa081a1adb..0bf2274c5186 100644
--- a/arch/x86/include/asm/set_memory.h
+++ b/arch/x86/include/asm/set_memory.h
@@ -114,8 +114,13 @@ static inline int set_mce_nospec(unsigned long pfn, bool unmap)
 
 	if (unmap)
 		rc = set_memory_np(decoy_addr, 1);
-	else
-		rc = set_memory_uc(decoy_addr, 1);
+	else {
+		/*
+		 * Bypass memtype checks since memory-failure has shot
+		 * down mappings.
+		 */
+		rc = _set_memory_uc(decoy_addr, 1);
+	}
 	if (rc)
 		pr_warn("Could not invalidate pfn=0x%lx from 1:1 map\n", pfn);
 	return rc;


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFT PATCH] x86/pat: Fix set_mce_nospec() for pmem
  2021-07-07  1:01 [RFT PATCH] x86/pat: Fix set_mce_nospec() for pmem Dan Williams
@ 2021-08-26 19:08 ` Dan Williams
  2021-08-27  7:12   ` Jane Chu
  2021-09-13 10:29 ` Borislav Petkov
  1 sibling, 1 reply; 35+ messages in thread
From: Dan Williams @ 2021-08-26 19:08 UTC (permalink / raw)
  To: Linux NVDIMM; +Cc: Jane Chu, Luis Chamberlain, Borislav Petkov, Tony Luck

On Tue, Jul 6, 2021 at 6:01 PM Dan Williams <dan.j.williams@intel.com> wrote:
>
> When poison is discovered and triggers memory_failure() the physical
> page is unmapped from all process address space. However, it is not
> unmapped from kernel address space. Unlike a typical memory page that
> can be retired from use in the page allocator and marked 'not present',
> pmem needs to remain accessible given it can not be physically remapped
> or retired. set_memory_uc() tries to maintain consistent nominal memtype
> mappings for a given pfn, but memory_failure() is an exceptional
> condition.
>
> For the same reason that set_memory_np() bypasses memtype checks
> because they do not apply in the memory failure case, memtype validation
> is not applicable for marking the pmem pfn uncacheable. Use
> _set_memory_uc().
>
> Reported-by: Jane Chu <jane.chu@oracle.com>
> Fixes: 284ce4011ba6 ("x86/memory_failure: Introduce {set,clear}_mce_nospec()")
> Cc: Luis Chamberlain <mcgrof@suse.com>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: Tony Luck <tony.luck@intel.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
> Jane, can you give this a try and see if it cleans up the error you are
> seeing?
>
> Thanks for the help.

Jane, does this resolve the failure you reported [1]?

[1]: https://lore.kernel.org/r/327f9156-9b28-d20e-2850-21c125ece8c7@oracle.com

>
>  arch/x86/include/asm/set_memory.h |    9 +++++++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/include/asm/set_memory.h b/arch/x86/include/asm/set_memory.h
> index 43fa081a1adb..0bf2274c5186 100644
> --- a/arch/x86/include/asm/set_memory.h
> +++ b/arch/x86/include/asm/set_memory.h
> @@ -114,8 +114,13 @@ static inline int set_mce_nospec(unsigned long pfn, bool unmap)
>
>         if (unmap)
>                 rc = set_memory_np(decoy_addr, 1);
> -       else
> -               rc = set_memory_uc(decoy_addr, 1);
> +       else {
> +               /*
> +                * Bypass memtype checks since memory-failure has shot
> +                * down mappings.
> +                */
> +               rc = _set_memory_uc(decoy_addr, 1);
> +       }
>         if (rc)
>                 pr_warn("Could not invalidate pfn=0x%lx from 1:1 map\n", pfn);
>         return rc;
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFT PATCH] x86/pat: Fix set_mce_nospec() for pmem
  2021-08-26 19:08 ` Dan Williams
@ 2021-08-27  7:12   ` Jane Chu
  0 siblings, 0 replies; 35+ messages in thread
From: Jane Chu @ 2021-08-27  7:12 UTC (permalink / raw)
  To: Dan Williams, Linux NVDIMM; +Cc: Luis Chamberlain, Borislav Petkov, Tony Luck

Hi, Dan,

On 8/26/2021 12:08 PM, Dan Williams wrote:
> On Tue, Jul 6, 2021 at 6:01 PM Dan Williams<dan.j.williams@intel.com>  wrote:
>> When poison is discovered and triggers memory_failure() the physical
>> page is unmapped from all process address space. However, it is not
>> unmapped from kernel address space. Unlike a typical memory page that
>> can be retired from use in the page allocator and marked 'not present',
>> pmem needs to remain accessible given it can not be physically remapped
>> or retired. set_memory_uc() tries to maintain consistent nominal memtype
>> mappings for a given pfn, but memory_failure() is an exceptional
>> condition.
>>
>> For the same reason that set_memory_np() bypasses memtype checks
>> because they do not apply in the memory failure case, memtype validation
>> is not applicable for marking the pmem pfn uncacheable. Use
>> _set_memory_uc().
>>
>> Reported-by: Jane Chu<jane.chu@oracle.com>
>> Fixes: 284ce4011ba6 ("x86/memory_failure: Introduce {set,clear}_mce_nospec()")
>> Cc: Luis Chamberlain<mcgrof@suse.com>
>> Cc: Borislav Petkov<bp@alien8.de>
>> Cc: Tony Luck<tony.luck@intel.com>
>> Signed-off-by: Dan Williams<dan.j.williams@intel.com>
>> ---
>> Jane, can you give this a try and see if it cleans up the error you are
>> seeing?
>>
>> Thanks for the help.
> Jane, does this resolve the failure you reported [1]?
> 
> [1]:https://lore.kernel.org/r/327f9156-9b28-d20e-2850-21c125ece8c7@oracle.com
> 

Sorry for taking so long.  With the patch applied, the dmesg is displaying
[ 2111.282759] Memory failure: 0x1850600: recovery action for dax page: 
Recovered
[ 2112.415412] x86/PAT: fsdax_poison_v1:3214 freeing invalid memtype 
[mem 0x1850600000-0x1850600fff]

instead of the problematic

[10683.426147] x86/PAT: fsdax_poison_v1:5018 conflicting memory types 
1850600000-1850601000  uncached-minus<->write-back

Please feel free to add Tested-by: Jane Chu<jane.chu@oracle.com>

Thanks for the fix!

-jane






^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFT PATCH] x86/pat: Fix set_mce_nospec() for pmem
  2021-07-07  1:01 [RFT PATCH] x86/pat: Fix set_mce_nospec() for pmem Dan Williams
  2021-08-26 19:08 ` Dan Williams
@ 2021-09-13 10:29 ` Borislav Petkov
  2021-09-14 18:08   ` Dan Williams
  1 sibling, 1 reply; 35+ messages in thread
From: Borislav Petkov @ 2021-09-13 10:29 UTC (permalink / raw)
  To: Dan Williams; +Cc: nvdimm, Jane Chu, Luis Chamberlain, Tony Luck

On Tue, Jul 06, 2021 at 06:01:05PM -0700, Dan Williams wrote:
> When poison is discovered and triggers memory_failure() the physical
> page is unmapped from all process address space. However, it is not
> unmapped from kernel address space. Unlike a typical memory page that
> can be retired from use in the page allocator and marked 'not present',
> pmem needs to remain accessible given it can not be physically remapped
> or retired.

I'm surely missing something obvious but why does it need to remain
accessible? Spell it out please.

> set_memory_uc() tries to maintain consistent nominal memtype
> mappings for a given pfn, but memory_failure() is an exceptional
> condition.

That's not clear to me too. So looking at the failure:

[10683.426147] x86/PAT: fsdax_poison_v1:5018 conflicting memory types 1850600000-1850601000  uncached-minus<->write-back

set_memory_uc() marked it UC- but something? wants it to be WB. Why?

I guess I need some more info on the whole memory offlining for pmem and
why that should be done differently than with normal memory.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFT PATCH] x86/pat: Fix set_mce_nospec() for pmem
  2021-09-13 10:29 ` Borislav Petkov
@ 2021-09-14 18:08   ` Dan Williams
  2021-09-15 10:41     ` Borislav Petkov
  0 siblings, 1 reply; 35+ messages in thread
From: Dan Williams @ 2021-09-14 18:08 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Linux NVDIMM, Jane Chu, Luis Chamberlain, Tony Luck

On Mon, Sep 13, 2021 at 3:29 AM Borislav Petkov <bp@alien8.de> wrote:
>
> On Tue, Jul 06, 2021 at 06:01:05PM -0700, Dan Williams wrote:
> > When poison is discovered and triggers memory_failure() the physical
> > page is unmapped from all process address space. However, it is not
> > unmapped from kernel address space. Unlike a typical memory page that
> > can be retired from use in the page allocator and marked 'not present',
> > pmem needs to remain accessible given it can not be physically remapped
> > or retired.
>
> I'm surely missing something obvious but why does it need to remain
> accessible? Spell it out please.

Sure, I should probably include this following note in all patches
touching the DAX-memory_failure() path, because it is a frequently
asked question. The tl;dr is:

Typical memory_failure() does not assume the physical page can be
recovered and put back into circulation, PMEM memory_failure() allows
for recovery of the page.

The longer description is:
Typical memory_failure() for anonymous, or page-cache pages, has the
flexibility to invalidate bad pages and trigger any users to request a
new page from the page allocator to replace the quarantined one. DAX
removes that flexibility. The page is a handle for a fixed storage
location, i.e. no mechanism to remap a physical page to a different
logical address. Software expects to be able to repair an error in
PMEM by reading around the poisoned cache lines and writing zeros,
fallocate(...FALLOC_FL_PUNCH_HOLE...), to overwrite poison. The page
needs to remain accessible to enable recovery.

>
> > set_memory_uc() tries to maintain consistent nominal memtype
> > mappings for a given pfn, but memory_failure() is an exceptional
> > condition.
>
> That's not clear to me too. So looking at the failure:
>
> [10683.426147] x86/PAT: fsdax_poison_v1:5018 conflicting memory types 1850600000-1850601000  uncached-minus<->write-back
>
> set_memory_uc() marked it UC- but something? wants it to be WB. Why?

PMEM is mapped WB at the beginning of time for nominal operation.
track_pfn_remap() records that driver setting and forwards it to any
track_pfn_insert() of the same pfn, i.e. this is how DAX mappings
inherit the WB cache mode. memory_failure() wants to arrange avoidance
speculative consumption of poison, set_memory_uc() checks with the
track_pfn_remap() setting, but we know this is an exceptional
condition and it is ok to force it UC against the typical memtype
expectation.

> I guess I need some more info on the whole memory offlining for pmem and
> why that should be done differently than with normal memory.

Short answer, PMEM never goes "offline" because it was never "online"
in the first place. Where "online" in this context is specifically
referring to pfns that are under the watchful eye of the core-mm page
allocator.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFT PATCH] x86/pat: Fix set_mce_nospec() for pmem
  2021-09-14 18:08   ` Dan Williams
@ 2021-09-15 10:41     ` Borislav Petkov
  2021-09-16 20:33       ` Dan Williams
  0 siblings, 1 reply; 35+ messages in thread
From: Borislav Petkov @ 2021-09-15 10:41 UTC (permalink / raw)
  To: Dan Williams; +Cc: Linux NVDIMM, Jane Chu, Luis Chamberlain, Tony Luck

On Tue, Sep 14, 2021 at 11:08:00AM -0700, Dan Williams wrote:
> Sure, I should probably include this following note in all patches
> touching the DAX-memory_failure() path, because it is a frequently
> asked question. The tl;dr is:
> 
> Typical memory_failure() does not assume the physical page can be
> recovered and put back into circulation, PMEM memory_failure() allows
> for recovery of the page.

Hmm, I think by "PMEM memory_failure()" you mean what pmem_do_write()
does with that poison clearing or?

But then I have no clue what the "DAX-memory_failure()" path is.

> The longer description is:
> Typical memory_failure() for anonymous, or page-cache pages, has the
> flexibility to invalidate bad pages and trigger any users to request a
> new page from the page allocator to replace the quarantined one. DAX
> removes that flexibility. The page is a handle for a fixed storage
> location, i.e. no mechanism to remap a physical page to a different
> logical address. Software expects to be able to repair an error in
> PMEM by reading around the poisoned cache lines and writing zeros,
> fallocate(...FALLOC_FL_PUNCH_HOLE...), to overwrite poison. The page
> needs to remain accessible to enable recovery.

Aha, so memory failure there is not offlining he 4K page but simply
zeroing the poison. What happens if the same physical location triggers
more read errors, i.e., the underlying hw is going bad? Don't you want
to exclude that location from ever being written again?

Or maybe such recovery action doesn't make sense for pmem?

> Short answer, PMEM never goes "offline" because it was never "online"
> in the first place. Where "online" in this context is specifically
> referring to pfns that are under the watchful eye of the core-mm page
> allocator.

Ok, so pmem wants to be handled differently during memory failure.
Looking at the patch again, you change the !unmap case to do
_set_memory_uc().

That bool unmap thing gets *not* set in whole_page():

	return MCI_MISC_ADDR_LSB(m->misc) >= PAGE_SHIFT;

so I'm guessing that "Recoverable Address LSB (bits 5:0): The lowest
valid recoverable address bit." thing is < 12 for pmem.

But are you really saying that the hardware would never report a lower
than 12 value for normal memory?

If it does, then that is wrong here.

In any case, I'd prefer if the code would do an explicit check somewhere
saying "is this pmem" in order to do that special action only for when
it really is pmem and not rely on it implicitly.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFT PATCH] x86/pat: Fix set_mce_nospec() for pmem
  2021-09-15 10:41     ` Borislav Petkov
@ 2021-09-16 20:33       ` Dan Williams
  2021-09-17 11:30         ` Borislav Petkov
  2021-09-30 18:15         ` Jane Chu
  0 siblings, 2 replies; 35+ messages in thread
From: Dan Williams @ 2021-09-16 20:33 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Linux NVDIMM, Jane Chu, Luis Chamberlain, Tony Luck

On Wed, Sep 15, 2021 at 3:41 AM Borislav Petkov <bp@alien8.de> wrote:
>
> On Tue, Sep 14, 2021 at 11:08:00AM -0700, Dan Williams wrote:
> > Sure, I should probably include this following note in all patches
> > touching the DAX-memory_failure() path, because it is a frequently
> > asked question. The tl;dr is:
> >
> > Typical memory_failure() does not assume the physical page can be
> > recovered and put back into circulation, PMEM memory_failure() allows
> > for recovery of the page.
>
> Hmm, I think by "PMEM memory_failure()" you mean what pmem_do_write()
> does with that poison clearing or?

I am specifically talking about the memory_failure_dev_pagemap() path
taken from memory_failure().

> But then I have no clue what the "DAX-memory_failure()" path is.

Sorry, I should not have lazily used {PMEM,DAX} memory_failure() to
refer to the exact routine in question: memory_failure_dev_pagemap().
That path is taken when memory_failure() sees that a pfn was mapped by
memremap_pages(). It determines that the pfn does not represent a
dynamic page allocator page and may refer to a static page of storage
from a device like /dev/pmem, or /dev/dax.

>
> > The longer description is:
> > Typical memory_failure() for anonymous, or page-cache pages, has the
> > flexibility to invalidate bad pages and trigger any users to request a
> > new page from the page allocator to replace the quarantined one. DAX
> > removes that flexibility. The page is a handle for a fixed storage
> > location, i.e. no mechanism to remap a physical page to a different
> > logical address. Software expects to be able to repair an error in
> > PMEM by reading around the poisoned cache lines and writing zeros,
> > fallocate(...FALLOC_FL_PUNCH_HOLE...), to overwrite poison. The page
> > needs to remain accessible to enable recovery.
>
> Aha, so memory failure there is not offlining he 4K page but simply
> zeroing the poison.

The zeroing of the poison comes optionally later if userspace
explicitly asks for it to be cleared.

> What happens if the same physical location triggers
> more read errors, i.e., the underlying hw is going bad? Don't you want
> to exclude that location from ever being written again?

Yes. The device driver avoids reconsumption of the same error by
recording the error in a "badblocks" structure (block/badblocks.c).
The driver consults its badblocks instance on every subsequent access
and preemptively returns EIO. The driver registers for machine-check
notifications and translates those events into badblocks entries. So,
repeat consumption is avoided unless/until the babblocks entry can be
cleared along with the poison itself.

I'll note that going forward the filesystem wants to be more involved
in tracking and managing these errors so the driver will grow the
ability to forward those notifications up the stack.

> Or maybe such recovery action doesn't make sense for pmem?

It does. In addition to machine-check synchronous notification of
poison, there are asynchronous scanning mechanisms that prepopulate
the badblocks entries to get ahead of consumption events.

> > Short answer, PMEM never goes "offline" because it was never "online"
> > in the first place. Where "online" in this context is specifically
> > referring to pfns that are under the watchful eye of the core-mm page
> > allocator.
>
> Ok, so pmem wants to be handled differently during memory failure.
> Looking at the patch again, you change the !unmap case to do
> _set_memory_uc().
>
> That bool unmap thing gets *not* set in whole_page():
>
>         return MCI_MISC_ADDR_LSB(m->misc) >= PAGE_SHIFT;
>
> so I'm guessing that "Recoverable Address LSB (bits 5:0): The lowest
> valid recoverable address bit." thing is < 12 for pmem.
>
> But are you really saying that the hardware would never report a lower
> than 12 value for normal memory?

Ugh, no, I failed to update my mental model of set_mce_nopsec() usage
after commit 17fae1294ad9 "x86/{mce,mm}: Unmap the entire page if the
whole page is affected and poisoned". I was just looking for "where
does set_mce_nospec() do the UC vs NP decision", and did not read the
new calling convention... more below.

> If it does, then that is wrong here.
>
> In any case, I'd prefer if the code would do an explicit check somewhere
> saying "is this pmem" in order to do that special action only for when
> it really is pmem and not rely on it implicitly.

After thinking this through a bit I don't think we want that, but I
did find a bug, and this patch needs a rewritten changelog to say why
pmem does not need to be explicitly called out in this path. The
reasoning is that set_mce_nospec() was introduced as a way to achieve
the goal of preventing speculative consumption of poison *and* allow
for reads around poisoned cachelines for PMEM recovery. UC achieves
that goal for both PMEM and DRAM, unlike what this changelog alluded
to which was NP for DRAM and UC for PMEM. The new consideration of
whole_page() errors means that the code in nfit_handle_mce() is wrong
to assume that the length of the error is just 1 cacheline. When that
bug is fixed the badblocks range will cover the entire page in
whole_page() notification cases. Because the entire page is dead there
is no need for UC because all the cachelines are gone, nothing to
read-around in this page. In short DRAM and PMEM want to share the
exact same policy here and use NP for whole_page() and UC for not.
Just the small matter of ignoring the memtype by using
_set_memory_uc().

Thanks for triggering that deeper look.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFT PATCH] x86/pat: Fix set_mce_nospec() for pmem
  2021-09-16 20:33       ` Dan Williams
@ 2021-09-17 11:30         ` Borislav Petkov
  2021-09-21  2:04           ` Dan Williams
  2021-09-30 18:15         ` Jane Chu
  1 sibling, 1 reply; 35+ messages in thread
From: Borislav Petkov @ 2021-09-17 11:30 UTC (permalink / raw)
  To: Dan Williams; +Cc: Linux NVDIMM, Jane Chu, Luis Chamberlain, Tony Luck

On Thu, Sep 16, 2021 at 01:33:42PM -0700, Dan Williams wrote:
> I am specifically talking about the memory_failure_dev_pagemap() path
> taken from memory_failure().
> 
> > But then I have no clue what the "DAX-memory_failure()" path is.
> 
> Sorry, I should not have lazily used {PMEM,DAX} memory_failure() to
> refer to the exact routine in question: memory_failure_dev_pagemap().
> That path is taken when memory_failure() sees that a pfn was mapped by
> memremap_pages(). It determines that the pfn does not represent a
> dynamic page allocator page and may refer to a static page of storage
> from a device like /dev/pmem, or /dev/dax.

Aaaha, that's that innocent-looking pfn_to_online_page() call there in
memory_failure() which would return NULL for pmem/dax pages.

> The zeroing of the poison comes optionally later if userspace
> explicitly asks for it to be cleared.

I see.

> Yes. The device driver avoids reconsumption of the same error by
> recording the error in a "badblocks" structure (block/badblocks.c).
> The driver consults its badblocks instance on every subsequent access
> and preemptively returns EIO. The driver registers for machine-check
> notifications and translates those events into badblocks entries. So,
> repeat consumption is avoided unless/until the babblocks entry can be
> cleared along with the poison itself.

Sounds good.

> I'll note that going forward the filesystem wants to be more involved
> in tracking and managing these errors so the driver will grow the
> ability to forward those notifications up the stack.

... which would save the ask-the-driver each time thing. Yap.

> It does. In addition to machine-check synchronous notification of
> poison, there are asynchronous scanning mechanisms that prepopulate
> the badblocks entries to get ahead of consumption events.

Probably something similar to patrol scrubbing on normal DRAM.

> ... Because the entire page is dead there is no need for UC because all
> the cachelines are gone, nothing to read-around in this page. In short
> DRAM and PMEM want to share the exact same policy here and use NP for
> whole_page() and UC for not. Just the small matter of ignoring the
> memtype by using _set_memory_uc().

Hmm, so looking at this more:

        else
                rc = set_memory_uc(decoy_addr, 1);

->  memtype_reserve(__pa(addr), __pa(addr) + numpages * PAGE_SIZE,

->  memtype_reserve(__pa(addr), __pa(addr) + 1,

Looking at

  17fae1294ad9 ("x86/{mce,mm}: Unmap the entire page if the whole page is affected and poisoned")

it says

    Fix is to check the scope of the poison by checking the MCi_MISC register.
    If the entire page is affected, then unmap the page. If only part of the
    page is affected, then mark the page as uncacheable.

so I guess for normal DRAM we still want to map the page uncacheable so
that other parts except that cacheline can still be read.

And PMEM should be excluded from that treatise here because it needs to
be WB, as you said earlier.

Right?

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFT PATCH] x86/pat: Fix set_mce_nospec() for pmem
  2021-09-17 11:30         ` Borislav Petkov
@ 2021-09-21  2:04           ` Dan Williams
  2021-09-30 17:19             ` Borislav Petkov
  0 siblings, 1 reply; 35+ messages in thread
From: Dan Williams @ 2021-09-21  2:04 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Linux NVDIMM, Jane Chu, Luis Chamberlain, Tony Luck

On Fri, Sep 17, 2021 at 4:30 AM Borislav Petkov <bp@alien8.de> wrote:
>
> On Thu, Sep 16, 2021 at 01:33:42PM -0700, Dan Williams wrote:
> > I am specifically talking about the memory_failure_dev_pagemap() path
> > taken from memory_failure().
> >
> > > But then I have no clue what the "DAX-memory_failure()" path is.
> >
> > Sorry, I should not have lazily used {PMEM,DAX} memory_failure() to
> > refer to the exact routine in question: memory_failure_dev_pagemap().
> > That path is taken when memory_failure() sees that a pfn was mapped by
> > memremap_pages(). It determines that the pfn does not represent a
> > dynamic page allocator page and may refer to a static page of storage
> > from a device like /dev/pmem, or /dev/dax.
>
> Aaaha, that's that innocent-looking pfn_to_online_page() call there in
> memory_failure() which would return NULL for pmem/dax pages.

Exactly.

>
> > The zeroing of the poison comes optionally later if userspace
> > explicitly asks for it to be cleared.
>
> I see.
>
> > Yes. The device driver avoids reconsumption of the same error by
> > recording the error in a "badblocks" structure (block/badblocks.c).
> > The driver consults its badblocks instance on every subsequent access
> > and preemptively returns EIO. The driver registers for machine-check
> > notifications and translates those events into badblocks entries. So,
> > repeat consumption is avoided unless/until the babblocks entry can be
> > cleared along with the poison itself.
>
> Sounds good.
>
> > I'll note that going forward the filesystem wants to be more involved
> > in tracking and managing these errors so the driver will grow the
> > ability to forward those notifications up the stack.
>
> ... which would save the ask-the-driver each time thing. Yap.
>
> > It does. In addition to machine-check synchronous notification of
> > poison, there are asynchronous scanning mechanisms that prepopulate
> > the badblocks entries to get ahead of consumption events.
>
> Probably something similar to patrol scrubbing on normal DRAM.

Yes, although I believe that DRAM patrol scrubbing is being done from
the host memory controller, these PMEM DIMMs have firmware and DMA
engines *in the DIMM* to do this scrub work.

> > ... Because the entire page is dead there is no need for UC because all
> > the cachelines are gone, nothing to read-around in this page. In short
> > DRAM and PMEM want to share the exact same policy here and use NP for
> > whole_page() and UC for not. Just the small matter of ignoring the
> > memtype by using _set_memory_uc().
>
> Hmm, so looking at this more:
>
>         else
>                 rc = set_memory_uc(decoy_addr, 1);
>
> ->  memtype_reserve(__pa(addr), __pa(addr) + numpages * PAGE_SIZE,
>
> ->  memtype_reserve(__pa(addr), __pa(addr) + 1,
>
> Looking at
>
>   17fae1294ad9 ("x86/{mce,mm}: Unmap the entire page if the whole page is affected and poisoned")
>
> it says
>
>     Fix is to check the scope of the poison by checking the MCi_MISC register.
>     If the entire page is affected, then unmap the page. If only part of the
>     page is affected, then mark the page as uncacheable.
>
> so I guess for normal DRAM we still want to map the page uncacheable so
> that other parts except that cacheline can still be read.

Perhaps, but I don't know how you do that if memory_failure() has
"offlined" the DRAM page, in the case of PMEM you can issue a
byte-aligned direct-I/O access to the exact storage locations around
the poisoned cachelines.

> And PMEM should be excluded from that treatise here because it needs to
> be WB, as you said earlier.
>
> Right?

PMEM can still go NP if the entire page is failed, so no need to
exclude PMEM from the treatise because the driver's badblocks
implementation will cover the NP page, and the driver can use
clear_mce_nospec() to recover the WB mapping / access after the poison
has been cleared.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFT PATCH] x86/pat: Fix set_mce_nospec() for pmem
  2021-09-21  2:04           ` Dan Williams
@ 2021-09-30 17:19             ` Borislav Petkov
  2021-09-30 17:28               ` Luck, Tony
  0 siblings, 1 reply; 35+ messages in thread
From: Borislav Petkov @ 2021-09-30 17:19 UTC (permalink / raw)
  To: Dan Williams, Tony Luck; +Cc: Linux NVDIMM, Jane Chu, Luis Chamberlain

On Mon, Sep 20, 2021 at 07:04:50PM -0700, Dan Williams wrote:
> Yes, although I believe that DRAM patrol scrubbing is being done from
> the host memory controller, these PMEM DIMMs have firmware and DMA
> engines *in the DIMM* to do this scrub work.

Oh great. Lemme guess, they even have small OSes inside ... <eyeroll>

> Perhaps, but I don't know how you do that if memory_failure() has
> "offlined" the DRAM page, in the case of PMEM you can issue a
> byte-aligned direct-I/O access to the exact storage locations around
> the poisoned cachelines.

Well, looking at the exactly two call sites of set_mce_nospec(), it
does:

	if (!memory_failure(..))
		set_mce_nospec(pfn, whole_page...);

so IINM, if that thing returns 0, then we have hwpoisoned the page. And
if that is the case, then why are we even diddling with reads around the
poisoned cacheline when the whole page has been poisoned and we can't
access it anymore?

Or am I missing something?

Because the comment over set_mce_nospec() says

"or marking it uncacheable (if we want to try to retrieve data from
non-poisoned lines in the page)."

Question is, can we even access a hwpoisoned page to retrieve that data
or are we completely in the wrong weeds here? Tony?

> PMEM can still go NP if the entire page is failed, so no need to
> exclude PMEM from the treatise because the driver's badblocks
> implementation will cover the NP page, and the driver can use
> clear_mce_nospec() to recover the WB mapping / access after the poison
> has been cleared.

Now I'm all confused again. ;-(

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: [RFT PATCH] x86/pat: Fix set_mce_nospec() for pmem
  2021-09-30 17:19             ` Borislav Petkov
@ 2021-09-30 17:28               ` Luck, Tony
  2021-09-30 19:30                 ` Borislav Petkov
  0 siblings, 1 reply; 35+ messages in thread
From: Luck, Tony @ 2021-09-30 17:28 UTC (permalink / raw)
  To: Borislav Petkov, Williams, Dan J; +Cc: Linux NVDIMM, Jane Chu, Luis Chamberlain

> Question is, can we even access a hwpoisoned page to retrieve that data
> or are we completely in the wrong weeds here? Tony?

Hardware scope for poison is a cache line (64 bytes for DDR, may be larger
for the internals of 3D-Xpoint memory).

So you can access the other lines in the page. You just need to be careful
not to access the poison. Setting the cache mode for the page to uncacheable
ensures that h/w prefetchers and speculative access don't accidently touch
the poison while accessing nearby cache lines.

-Tony

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFT PATCH] x86/pat: Fix set_mce_nospec() for pmem
  2021-09-16 20:33       ` Dan Williams
  2021-09-17 11:30         ` Borislav Petkov
@ 2021-09-30 18:15         ` Jane Chu
  2021-09-30 19:11           ` Dan Williams
  1 sibling, 1 reply; 35+ messages in thread
From: Jane Chu @ 2021-09-30 18:15 UTC (permalink / raw)
  To: Dan Williams, Borislav Petkov; +Cc: Linux NVDIMM, Luis Chamberlain, Tony Luck

Hi, Dan,

On 9/16/2021 1:33 PM, Dan Williams wrote:
> The new consideration of
> whole_page() errors means that the code in nfit_handle_mce() is wrong
> to assume that the length of the error is just 1 cacheline. When that
> bug is fixed the badblocks range will cover the entire page in
> whole_page() notification cases.

Does this explain what I saw a while ago: inject two poisons
to the same block, and pwrite clears one poison at a time?

thanks!
-jane



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFT PATCH] x86/pat: Fix set_mce_nospec() for pmem
  2021-09-30 18:15         ` Jane Chu
@ 2021-09-30 19:11           ` Dan Williams
  2021-09-30 21:23             ` Jane Chu
  0 siblings, 1 reply; 35+ messages in thread
From: Dan Williams @ 2021-09-30 19:11 UTC (permalink / raw)
  To: Jane Chu; +Cc: Borislav Petkov, Linux NVDIMM, Luis Chamberlain, Tony Luck

On Thu, Sep 30, 2021 at 11:15 AM Jane Chu <jane.chu@oracle.com> wrote:
>
> Hi, Dan,
>
> On 9/16/2021 1:33 PM, Dan Williams wrote:
> > The new consideration of
> > whole_page() errors means that the code in nfit_handle_mce() is wrong
> > to assume that the length of the error is just 1 cacheline. When that
> > bug is fixed the badblocks range will cover the entire page in
> > whole_page() notification cases.
>
> Does this explain what I saw a while ago: inject two poisons
> to the same block, and pwrite clears one poison at a time?

Potentially, yes. If you injected poison over 512 bytes then it could
take 2 pwrites() of 256-bytes each to clear that poison. In the DAX
case nothing is enforcing sector aligned I/O for pwrite().

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFT PATCH] x86/pat: Fix set_mce_nospec() for pmem
  2021-09-30 17:28               ` Luck, Tony
@ 2021-09-30 19:30                 ` Borislav Petkov
  2021-09-30 19:41                   ` Dan Williams
  2021-09-30 19:44                   ` Luck, Tony
  0 siblings, 2 replies; 35+ messages in thread
From: Borislav Petkov @ 2021-09-30 19:30 UTC (permalink / raw)
  To: Luck, Tony; +Cc: Williams, Dan J, Linux NVDIMM, Jane Chu, Luis Chamberlain

On Thu, Sep 30, 2021 at 05:28:12PM +0000, Luck, Tony wrote:
> > Question is, can we even access a hwpoisoned page to retrieve that data
> > or are we completely in the wrong weeds here? Tony?
> 
> Hardware scope for poison is a cache line (64 bytes for DDR, may be larger
> for the internals of 3D-Xpoint memory).

I don't mean from the hw aspect but from the OS one: my simple thinking
is, *if* a page is marked as HW poison, any further mapping or accessing
of the page frame is prevented by the mm code.

So you can't access *any* bits there so why do we even bother with whole
or not whole page? Page is gone...

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFT PATCH] x86/pat: Fix set_mce_nospec() for pmem
  2021-09-30 19:30                 ` Borislav Petkov
@ 2021-09-30 19:41                   ` Dan Williams
  2021-09-30 19:44                   ` Luck, Tony
  1 sibling, 0 replies; 35+ messages in thread
From: Dan Williams @ 2021-09-30 19:41 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Luck, Tony, Linux NVDIMM, Jane Chu, Luis Chamberlain

On Thu, Sep 30, 2021 at 12:30 PM Borislav Petkov <bp@alien8.de> wrote:
>
> On Thu, Sep 30, 2021 at 05:28:12PM +0000, Luck, Tony wrote:
> > > Question is, can we even access a hwpoisoned page to retrieve that data
> > > or are we completely in the wrong weeds here? Tony?
> >
> > Hardware scope for poison is a cache line (64 bytes for DDR, may be larger
> > for the internals of 3D-Xpoint memory).
>
> I don't mean from the hw aspect but from the OS one: my simple thinking
> is, *if* a page is marked as HW poison, any further mapping or accessing
> of the page frame is prevented by the mm code.
>
> So you can't access *any* bits there so why do we even bother with whole
> or not whole page? Page is gone...

I think the disconnect is that in the
typical-memory-from-the-page-allocator case it's ok to throw away the
whole page and get another one, they are interchangeable. In the PMEM
case they are not, they are fixed physical offsets known to things
like filesystem-metadata. So PageHWPoison in this latter case is just
a flag to indicate that poison mitigations are in effect, page marked
UC or NP, and the owner of that page, PMEM driver, knows how to
navigate around that poison to maximize data recovery flows. The owner
of the page in the latter / typical case, page allocator, says "bah,
I'll just throw the whole thing away because there's no chance at
repair and consumers can just as well use a different pfn for the same
role."

^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: [RFT PATCH] x86/pat: Fix set_mce_nospec() for pmem
  2021-09-30 19:30                 ` Borislav Petkov
  2021-09-30 19:41                   ` Dan Williams
@ 2021-09-30 19:44                   ` Luck, Tony
  2021-09-30 20:01                     ` Borislav Petkov
  1 sibling, 1 reply; 35+ messages in thread
From: Luck, Tony @ 2021-09-30 19:44 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Williams, Dan J, Linux NVDIMM, Jane Chu, Luis Chamberlain

>I don't mean from the hw aspect but from the OS one: my simple thinking
> is, *if* a page is marked as HW poison, any further mapping or accessing
> of the page frame is prevented by the mm code.
>
> So you can't access *any* bits there so why do we even bother with whole
> or not whole page? Page is gone...

See the comment above set_mce_nospec() ...

/*
 * Prevent speculative access to the page by either unmapping
 * it (if we do not require access to any part of the page) or
 * marking it uncacheable (if we want to try to retrieve data
 * from non-poisoned lines in the page).
 */
static inline int set_mce_nospec(unsigned long pfn, bool unmap)


It's a choice as to whether the whole page is gone or not. The history for
this is using pmem as storage. The filesystem block size may be less than
the page size. An error in a "block" should only result in that block disappearing
from the file, not the surrounding 4k.

Of course that only works for users of read(2)/write(2) to access the file.
They can't mmap(2) and just get the good bits.

-Tony

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFT PATCH] x86/pat: Fix set_mce_nospec() for pmem
  2021-09-30 19:44                   ` Luck, Tony
@ 2021-09-30 20:01                     ` Borislav Petkov
  2021-09-30 20:15                       ` Luck, Tony
  0 siblings, 1 reply; 35+ messages in thread
From: Borislav Petkov @ 2021-09-30 20:01 UTC (permalink / raw)
  To: Luck, Tony; +Cc: Williams, Dan J, Linux NVDIMM, Jane Chu, Luis Chamberlain

On Thu, Sep 30, 2021 at 07:44:48PM +0000, Luck, Tony wrote:
> See the comment above set_mce_nospec() ...
> 
> /*
>  * Prevent speculative access to the page by either unmapping
>  * it (if we do not require access to any part of the page) or
>  * marking it uncacheable (if we want to try to retrieve data
>  * from non-poisoned lines in the page).
>  */
> static inline int set_mce_nospec(unsigned long pfn, bool unmap)

I've seen that comment - I've quoted it upthread...

> It's a choice as to whether the whole page is gone or not. The history for
> this is using pmem as storage. The filesystem block size may be less than
> the page size. An error in a "block" should only result in that block disappearing
> from the file, not the surrounding 4k.

So let me cut to the chase:

        if (!memory_failure(..))
                set_mce_nospec(pfn, whole_page...);

when memory_failure() returns 0, is a whole page marked as hwpoison or
not?

Because I see there close to the top of the function:

	if (TestSetPageHWPoison(p)) {
		...

after this, that whole page is hwpoison I'd say. Not a cacheline but the
whole thing.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: [RFT PATCH] x86/pat: Fix set_mce_nospec() for pmem
  2021-09-30 20:01                     ` Borislav Petkov
@ 2021-09-30 20:15                       ` Luck, Tony
  2021-09-30 20:32                         ` Borislav Petkov
  0 siblings, 1 reply; 35+ messages in thread
From: Luck, Tony @ 2021-09-30 20:15 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Williams, Dan J, Linux NVDIMM, Jane Chu, Luis Chamberlain

> So let me cut to the chase:
>
>        if (!memory_failure(..))
>                set_mce_nospec(pfn, whole_page...);
>
> when memory_failure() returns 0, is a whole page marked as hwpoison or
> not?

Depends on what the whole_page() helper said.

static bool whole_page(struct mce *m)
{
        if (!mca_cfg.ser || !(m->status & MCI_STATUS_MISCV))
                return true;

        return MCI_MISC_ADDR_LSB(m->misc) >= PAGE_SHIFT;
}

> Because I see there close to the top of the function:
>
>	if (TestSetPageHWPoison(p)) {
>		...
>
> after this, that whole page is hwpoison I'd say. Not a cacheline but the
> whole thing.

That may now be a confusing name for the page flag bit. Until the
pmem/storage use case we just simply lost the whole page (back
then set_mce_nospec() didn't take an argument, it just marked the
whole page as "not present" in the kernel 1:1 map).

So the meaning of HWPoison has subtly changed from "this whole
page is poisoned" to "there is some poison in some/all[1] of this page"

-Tony

[1] even in the "all" case the poison is likely just in one cache line, but
the MCI_MISC_ADDR_LSB indication in the machine check bank said
the scope was the whole page. This happened on some older systems
for page scrub errors where the memory controller wasn't smart enough
to translate the channel address back to a cache granular system address.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFT PATCH] x86/pat: Fix set_mce_nospec() for pmem
  2021-09-30 20:15                       ` Luck, Tony
@ 2021-09-30 20:32                         ` Borislav Petkov
  2021-09-30 20:39                           ` Dan Williams
  0 siblings, 1 reply; 35+ messages in thread
From: Borislav Petkov @ 2021-09-30 20:32 UTC (permalink / raw)
  To: Luck, Tony; +Cc: Williams, Dan J, Linux NVDIMM, Jane Chu, Luis Chamberlain

On Thu, Sep 30, 2021 at 08:15:51PM +0000, Luck, Tony wrote:
> That may now be a confusing name for the page flag bit. Until the
> pmem/storage use case we just simply lost the whole page (back
> then set_mce_nospec() didn't take an argument, it just marked the
> whole page as "not present" in the kernel 1:1 map).
> 
> So the meaning of HWPoison has subtly changed from "this whole
> page is poisoned" to "there is some poison in some/all[1] of this page"

Is that that PMEM driver case Dan is talking about: "the owner of that
page, PMEM driver, knows how to navigate around that poison to maximize
data recovery flows."?

IOW, even if the page is marked as poison, in the PMEM case the driver
can access those sub-page ranges to salvage data? And in the "normal"
case, we only deal with whole pages anyway because memory_failure()
will mark the whole page as poison and nothing will be able to access
sub-page ranges there to salvage data?

Closer?

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFT PATCH] x86/pat: Fix set_mce_nospec() for pmem
  2021-09-30 20:32                         ` Borislav Petkov
@ 2021-09-30 20:39                           ` Dan Williams
  2021-09-30 20:54                             ` Borislav Petkov
  0 siblings, 1 reply; 35+ messages in thread
From: Dan Williams @ 2021-09-30 20:39 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Luck, Tony, Linux NVDIMM, Jane Chu, Luis Chamberlain

On Thu, Sep 30, 2021 at 1:34 PM Borislav Petkov <bp@alien8.de> wrote:
>
> On Thu, Sep 30, 2021 at 08:15:51PM +0000, Luck, Tony wrote:
> > That may now be a confusing name for the page flag bit. Until the
> > pmem/storage use case we just simply lost the whole page (back
> > then set_mce_nospec() didn't take an argument, it just marked the
> > whole page as "not present" in the kernel 1:1 map).
> >
> > So the meaning of HWPoison has subtly changed from "this whole
> > page is poisoned" to "there is some poison in some/all[1] of this page"
>
> Is that that PMEM driver case Dan is talking about: "the owner of that
> page, PMEM driver, knows how to navigate around that poison to maximize
> data recovery flows."?
>
> IOW, even if the page is marked as poison, in the PMEM case the driver
> can access those sub-page ranges to salvage data? And in the "normal"
> case, we only deal with whole pages anyway because memory_failure()
> will mark the whole page as poison and nothing will be able to access
> sub-page ranges there to salvage data?
>
> Closer?
>

Yes, that's a good way to think about it. The only way to avoid poison
for page allocator pages is to just ditch the page. In the case of
PMEM the driver can do this fine grained dance because it gets precise
sub-page poison lists to consult and implements a non-mmap path to
access the page contents.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFT PATCH] x86/pat: Fix set_mce_nospec() for pmem
  2021-09-30 20:39                           ` Dan Williams
@ 2021-09-30 20:54                             ` Borislav Petkov
  2021-09-30 21:05                               ` Dan Williams
  0 siblings, 1 reply; 35+ messages in thread
From: Borislav Petkov @ 2021-09-30 20:54 UTC (permalink / raw)
  To: Dan Williams; +Cc: Luck, Tony, Linux NVDIMM, Jane Chu, Luis Chamberlain

On Thu, Sep 30, 2021 at 01:39:03PM -0700, Dan Williams wrote:
> Yes, that's a good way to think about it. The only way to avoid poison
> for page allocator pages is to just ditch the page. In the case of
> PMEM the driver can do this fine grained dance because it gets precise
> sub-page poison lists to consult and implements a non-mmap path to
> access the page contents.

Ok, good.

Now, before we do anything further here, I'd like for this very much
non-obvious situation to be documented in detail so that we know what's
going on there and what that whole_page notion even means. Because this
is at least bending the meaning of page states like poison and what that
really means for the underlying thing - PMEM or general purpose DIMMs.

And then that test could be something like:

	/*
	 * Normal DRAM gets poisoned as a whole page, yadda yadda...
	 /
	if (whole_page) {

	/*
	 * Special handling for PMEM case, driver can handle accessing sub-page ranges
	 * even if the whole "page" is poisoned, blabla
	} else {
		rc = _set_memory_uc(decoy_addr, 1);
	...

so that it is crystal clear what's going on there.

Thx!

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFT PATCH] x86/pat: Fix set_mce_nospec() for pmem
  2021-09-30 20:54                             ` Borislav Petkov
@ 2021-09-30 21:05                               ` Dan Williams
  2021-09-30 21:20                                 ` Borislav Petkov
  0 siblings, 1 reply; 35+ messages in thread
From: Dan Williams @ 2021-09-30 21:05 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Luck, Tony, Linux NVDIMM, Jane Chu, Luis Chamberlain

On Thu, Sep 30, 2021 at 1:54 PM Borislav Petkov <bp@alien8.de> wrote:
>
> On Thu, Sep 30, 2021 at 01:39:03PM -0700, Dan Williams wrote:
> > Yes, that's a good way to think about it. The only way to avoid poison
> > for page allocator pages is to just ditch the page. In the case of
> > PMEM the driver can do this fine grained dance because it gets precise
> > sub-page poison lists to consult and implements a non-mmap path to
> > access the page contents.
>
> Ok, good.
>
> Now, before we do anything further here, I'd like for this very much
> non-obvious situation to be documented in detail so that we know what's
> going on there and what that whole_page notion even means. Because this
> is at least bending the meaning of page states like poison and what that
> really means for the underlying thing - PMEM or general purpose DIMMs.

Hmm, memory type does not matter in this path.

>
> And then that test could be something like:
>
>         /*
>          * Normal DRAM gets poisoned as a whole page, yadda yadda...

No, the whole_page case is equally relevant for PMEM...

>          /
>         if (whole_page) {
>
>         /*
>          * Special handling for PMEM case, driver can handle accessing sub-page ranges
>          * even if the whole "page" is poisoned, blabla
>         } else {

...and UC is acceptable to DRAM.

>                 rc = _set_memory_uc(decoy_addr, 1);
>         ...
>
> so that it is crystal clear what's going on there.
>

The only distinction that matters here is whether the reported
blast-radius of the poison reported by the hardware is a sub-page or
whole-page. Memory type does not matter.

I.e. it's fine if a DRAM page with a single cacheline error only gets
marked UC. Speculation is disabled and the page allocator will still
throw it away and never use it again. Similarly NP is fine for PMEM
when the machine-check-registers indicate that the entire page is full
of poison. The driver will record that and block any attempt to
recover any data in that page.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFT PATCH] x86/pat: Fix set_mce_nospec() for pmem
  2021-09-30 21:05                               ` Dan Williams
@ 2021-09-30 21:20                                 ` Borislav Petkov
  2021-09-30 21:41                                   ` Dan Williams
  0 siblings, 1 reply; 35+ messages in thread
From: Borislav Petkov @ 2021-09-30 21:20 UTC (permalink / raw)
  To: Dan Williams; +Cc: Luck, Tony, Linux NVDIMM, Jane Chu, Luis Chamberlain

On Thu, Sep 30, 2021 at 02:05:49PM -0700, Dan Williams wrote:
> I.e. it's fine

A lot of things are fine - question is, what makes sense and what makes
this thing as simple as required to be.

> if a DRAM page with a single cacheline error only gets marked UC.
> Speculation is disabled and the page allocator will still throw it
> away and never use it again.

Normal DRAM is poisoned as a whole page, as we established. So whatever
it is marked with - UC or NP - it probably doesn't matter. But since
the page won't be ever used, then that page is practically not present.
So I say, let's mark normal DRAM pages which have been poisoned as not
present and be done with it. Period.

> Similarly NP is fine for PMEM when the machine-check-registers
> indicate that the entire page is full of poison. The driver will
> record that and block any attempt to recover any data in that page.

So this is still kinda weird. We will mark it with either NP or UC but
the driver has some special knowledge how to tiptoe around poison. So
for simplicity, let's mark it with whatever fits best and be done with
it - driver can handle it just fine.

I hope you're cathing my drift: it doesn't really matter what's possible
wrt marking - it matters what the practical side of the whole thing
is wrt further page handling and recovery action. And we should do
here whatever does not impede that further page handling even if other
markings are possible.

Ok?

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFT PATCH] x86/pat: Fix set_mce_nospec() for pmem
  2021-09-30 19:11           ` Dan Williams
@ 2021-09-30 21:23             ` Jane Chu
  0 siblings, 0 replies; 35+ messages in thread
From: Jane Chu @ 2021-09-30 21:23 UTC (permalink / raw)
  To: Dan Williams; +Cc: Borislav Petkov, Linux NVDIMM, Luis Chamberlain, Tony Luck


On 9/30/2021 12:11 PM, Dan Williams wrote:
> On Thu, Sep 30, 2021 at 11:15 AM Jane Chu <jane.chu@oracle.com> wrote:
>>
>> Hi, Dan,
>>
>> On 9/16/2021 1:33 PM, Dan Williams wrote:
>>> The new consideration of
>>> whole_page() errors means that the code in nfit_handle_mce() is wrong
>>> to assume that the length of the error is just 1 cacheline. When that
>>> bug is fixed the badblocks range will cover the entire page in
>>> whole_page() notification cases.
>>
>> Does this explain what I saw a while ago: inject two poisons
>> to the same block, and pwrite clears one poison at a time?
> 
> Potentially, yes. If you injected poison over 512 bytes then it could
> take 2 pwrites() of 256-bytes each to clear that poison. In the DAX
> case nothing is enforcing sector aligned I/O for pwrite().
> 

Thanks!

-jane

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFT PATCH] x86/pat: Fix set_mce_nospec() for pmem
  2021-09-30 21:20                                 ` Borislav Petkov
@ 2021-09-30 21:41                                   ` Dan Williams
  2021-09-30 22:35                                     ` Borislav Petkov
  0 siblings, 1 reply; 35+ messages in thread
From: Dan Williams @ 2021-09-30 21:41 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Luck, Tony, Linux NVDIMM, Jane Chu, Luis Chamberlain

On Thu, Sep 30, 2021 at 2:21 PM Borislav Petkov <bp@alien8.de> wrote:
>
> On Thu, Sep 30, 2021 at 02:05:49PM -0700, Dan Williams wrote:
> > I.e. it's fine
>
> A lot of things are fine - question is, what makes sense and what makes
> this thing as simple as required to be.
>
> > if a DRAM page with a single cacheline error only gets marked UC.
> > Speculation is disabled and the page allocator will still throw it
> > away and never use it again.
>
> Normal DRAM is poisoned as a whole page, as we established.

It is not in all cases, as I read:

17fae1294ad9 x86/{mce,mm}: Unmap the entire page if the whole page is
affected and poisoned

...it is MSi_MISC that determines what happens and MSi_MISC is not
memory-type aware.

> So whatever
> it is marked with - UC or NP - it probably doesn't matter. But since
> the page won't be ever used, then that page is practically not present.
> So I say, let's mark normal DRAM pages which have been poisoned as not
> present and be done with it. Period.

Where would this routine learn the memory type? Pass it in via
memory_failure(), to what end?

> > Similarly NP is fine for PMEM when the machine-check-registers
> > indicate that the entire page is full of poison. The driver will
> > record that and block any attempt to recover any data in that page.
>
> So this is still kinda weird. We will mark it with either NP or UC but
> the driver has some special knowledge how to tiptoe around poison. So
> for simplicity, let's mark it with whatever fits best and be done with
> it - driver can handle it just fine.
>
> I hope you're cathing my drift: it doesn't really matter what's possible
> wrt marking - it matters what the practical side of the whole thing
> is wrt further page handling and recovery action. And we should do
> here whatever does not impede that further page handling even if other
> markings are possible.
>
> Ok?

I'm sorry, but I'm not tracking.

set_mce_nospec() just wants to prevent speculative consumption of
poison going forward. It has 2 options NP and UC. UC is suitable for
DRAM and desired for PMEM as long as MSi_MISC indicates a sub-page
size poison event.

If you want set_mce_nospec() to specify NP for the page even when
MSi_MISC indicates sub-page blast radius it would require extra
plumbing to convey the memory type to set_mce_nospec().

I fail to see the point of that extra plumbing when MSi_MISC
indicating "whole_page", or not is sufficient. What am I missing?

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFT PATCH] x86/pat: Fix set_mce_nospec() for pmem
  2021-09-30 21:41                                   ` Dan Williams
@ 2021-09-30 22:35                                     ` Borislav Petkov
  2021-09-30 22:44                                       ` Dan Williams
  2021-10-01  0:43                                       ` Jane Chu
  0 siblings, 2 replies; 35+ messages in thread
From: Borislav Petkov @ 2021-09-30 22:35 UTC (permalink / raw)
  To: Dan Williams; +Cc: Luck, Tony, Linux NVDIMM, Jane Chu, Luis Chamberlain

On Thu, Sep 30, 2021 at 02:41:52PM -0700, Dan Williams wrote:
> I fail to see the point of that extra plumbing when MSi_MISC
> indicating "whole_page", or not is sufficient. What am I missing?

I think you're looking at it from the wrong side... (or it is too late
here, but we'll see). Forget how a memory type can be mapped but think
about how the recovery action looks like.

- DRAM: when a DRAM page is poisoned, it is only poisoned as a whole
page by memory_failure(). whole_page is always true here, no matter what
the hardware says because we don't and cannot do any sub-page recovery
actions. So it doesn't matter how we map it, UC, NP... I suggested NP
because the page is practically not present if you want to access it
because mm won't allow it...

- PMEM: reportedly, we can do sub-page recovery here so PMEM should be
mapped in the way it is better for the recovery action to work.

In both cases, the recovery action should control how the memory type is
mapped.

Now, you say we cannot know the memory type when the error gets
reported.

And I say: for simplicity's sake, we simply go and work with whole
pages. Always. That is the case anyway for DRAM.

For PMEM, AFAIU, it doesn't matter whether it is a whole page or not -
the PMEM driver knows how to do those sub-pages accesses.

IOW, set_mce_nospec() should simply do:

	rc = set_memory_np(decoy_addr, 1);

and that's it.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFT PATCH] x86/pat: Fix set_mce_nospec() for pmem
  2021-09-30 22:35                                     ` Borislav Petkov
@ 2021-09-30 22:44                                       ` Dan Williams
  2021-10-01 10:41                                         ` Borislav Petkov
  2021-10-01  0:43                                       ` Jane Chu
  1 sibling, 1 reply; 35+ messages in thread
From: Dan Williams @ 2021-09-30 22:44 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Luck, Tony, Linux NVDIMM, Jane Chu, Luis Chamberlain

On Thu, Sep 30, 2021 at 3:36 PM Borislav Petkov <bp@alien8.de> wrote:
>
> On Thu, Sep 30, 2021 at 02:41:52PM -0700, Dan Williams wrote:
> > I fail to see the point of that extra plumbing when MSi_MISC
> > indicating "whole_page", or not is sufficient. What am I missing?
>
> I think you're looking at it from the wrong side... (or it is too late
> here, but we'll see). Forget how a memory type can be mapped but think
> about how the recovery action looks like.
>
> - DRAM: when a DRAM page is poisoned, it is only poisoned as a whole
> page by memory_failure(). whole_page is always true here, no matter what
> the hardware says because we don't and cannot do any sub-page recovery
> actions. So it doesn't matter how we map it, UC, NP... I suggested NP
> because the page is practically not present if you want to access it
> because mm won't allow it...
>
> - PMEM: reportedly, we can do sub-page recovery here so PMEM should be
> mapped in the way it is better for the recovery action to work.
>
> In both cases, the recovery action should control how the memory type is
> mapped.
>
> Now, you say we cannot know the memory type when the error gets
> reported.
>
> And I say: for simplicity's sake, we simply go and work with whole
> pages. Always. That is the case anyway for DRAM.
>
> For PMEM, AFAIU, it doesn't matter whether it is a whole page or not -
> the PMEM driver knows how to do those sub-pages accesses.
>
> IOW, set_mce_nospec() should simply do:
>
>         rc = set_memory_np(decoy_addr, 1);
>
> and that's it.

The driver uses the direct-map to do the access. It uses the
direct-map because it has also arranged for pfn_to_page() to work for
PMEM pages. So if PMEM is in the direct-map is marked NP then the
sub-page accesses will fault.

Now, the driver could set up and tear down page tables for the pfn
whenever it is asked to do I/O over a potentially poisoned pfn. Is
that what you are suggesting? It seems like a significant amount of
overhead, but it would at least kick this question out of the purview
of the MCE code.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFT PATCH] x86/pat: Fix set_mce_nospec() for pmem
  2021-09-30 22:35                                     ` Borislav Petkov
  2021-09-30 22:44                                       ` Dan Williams
@ 2021-10-01  0:43                                       ` Jane Chu
  2021-10-01  2:02                                         ` Dan Williams
  1 sibling, 1 reply; 35+ messages in thread
From: Jane Chu @ 2021-10-01  0:43 UTC (permalink / raw)
  To: Borislav Petkov, Dan Williams; +Cc: Luck, Tony, Linux NVDIMM, Luis Chamberlain


On 9/30/2021 3:35 PM, Borislav Petkov wrote:
> On Thu, Sep 30, 2021 at 02:41:52PM -0700, Dan Williams wrote:
>> I fail to see the point of that extra plumbing when MSi_MISC
>> indicating "whole_page", or not is sufficient. What am I missing?
> 
> I think you're looking at it from the wrong side... (or it is too late
> here, but we'll see). Forget how a memory type can be mapped but think
> about how the recovery action looks like.
> 
> - DRAM: when a DRAM page is poisoned, it is only poisoned as a whole
> page by memory_failure(). whole_page is always true here, no matter what
> the hardware says because we don't and cannot do any sub-page recovery
> actions. So it doesn't matter how we map it, UC, NP... I suggested NP
> because the page is practically not present if you want to access it
> because mm won't allow it...
> 
> - PMEM: reportedly, we can do sub-page recovery here so PMEM should be
> mapped in the way it is better for the recovery action to work.
> 
> In both cases, the recovery action should control how the memory type is
> mapped.
> 
> Now, you say we cannot know the memory type when the error gets
> reported.
> 
> And I say: for simplicity's sake, we simply go and work with whole
> pages. Always. That is the case anyway for DRAM.

Sorry, please correct me if I misunderstand. The DRAM poison handling
at page frame granularity is a helpless compromise due to lack of
guarantee to decipher the precise error blast radius given all
types of DRAM and architectures, right?  But that's not true for
the PMEM case. So why should PMEM poison handling follow the lead
of DRAM?

Regards,
-jane

> 
> For PMEM, AFAIU, it doesn't matter whether it is a whole page or not -
> the PMEM driver knows how to do those sub-pages accesses.
> 
> IOW, set_mce_nospec() should simply do:
> 
> 	rc = set_memory_np(decoy_addr, 1);
> 
> and that's it.
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFT PATCH] x86/pat: Fix set_mce_nospec() for pmem
  2021-10-01  0:43                                       ` Jane Chu
@ 2021-10-01  2:02                                         ` Dan Williams
  2021-10-01 10:50                                           ` Borislav Petkov
  0 siblings, 1 reply; 35+ messages in thread
From: Dan Williams @ 2021-10-01  2:02 UTC (permalink / raw)
  To: Jane Chu; +Cc: Borislav Petkov, Luck, Tony, Linux NVDIMM, Luis Chamberlain

On Thu, Sep 30, 2021 at 5:43 PM Jane Chu <jane.chu@oracle.com> wrote:
>
>
> On 9/30/2021 3:35 PM, Borislav Petkov wrote:
> > On Thu, Sep 30, 2021 at 02:41:52PM -0700, Dan Williams wrote:
> >> I fail to see the point of that extra plumbing when MSi_MISC
> >> indicating "whole_page", or not is sufficient. What am I missing?
> >
> > I think you're looking at it from the wrong side... (or it is too late
> > here, but we'll see). Forget how a memory type can be mapped but think
> > about how the recovery action looks like.
> >
> > - DRAM: when a DRAM page is poisoned, it is only poisoned as a whole
> > page by memory_failure(). whole_page is always true here, no matter what
> > the hardware says because we don't and cannot do any sub-page recovery
> > actions. So it doesn't matter how we map it, UC, NP... I suggested NP
> > because the page is practically not present if you want to access it
> > because mm won't allow it...
> >
> > - PMEM: reportedly, we can do sub-page recovery here so PMEM should be
> > mapped in the way it is better for the recovery action to work.
> >
> > In both cases, the recovery action should control how the memory type is
> > mapped.
> >
> > Now, you say we cannot know the memory type when the error gets
> > reported.
> >
> > And I say: for simplicity's sake, we simply go and work with whole
> > pages. Always. That is the case anyway for DRAM.
>
> Sorry, please correct me if I misunderstand. The DRAM poison handling
> at page frame granularity is a helpless compromise due to lack of
> guarantee to decipher the precise error blast radius given all
> types of DRAM and architectures, right?  But that's not true for
> the PMEM case. So why should PMEM poison handling follow the lead
> of DRAM?

If I understand the proposal correctly Boris is basically saying
"figure out how to do your special PMEM stuff in the driver directly
and make it so MCE code has no knowledge of the PMEM case". The flow
is:

memory_failure(pfn, flags)
nfit_handle_mce(...) <--- right now this on mce notifier chain
set_mce_nospec(pfn) <--- drop the "whole page" concept

This poses a problem because not all memory_failure() paths trigger
set_mce_nospec() or the mce notifier chain. If that disconnect
happens, attempts to read PMEM pages that have been signalled to
memory_failure() will now crash in the driver without workarounds for
NP pages.

So memory_failure() needs to ensure that it communicates with the
driver before any possible NP page attribute changes. I.e. the driver
needs to know that regardless of how many cachelines are poisoned the
entire page is always unmapped in the direct map.

Then, when the driver is called with the new RWF_RECOVER_DATA flag, it
can set up a new UC alias mapping for the pfn and access the good data
in the page while being careful to read around the poisoned cache
lines.

In my mind this moves the RWF_RECOVER_DATA flag proposal from "nice to
have" to "critical for properly coordinating with memory_failure() and
mce expectations"

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFT PATCH] x86/pat: Fix set_mce_nospec() for pmem
  2021-09-30 22:44                                       ` Dan Williams
@ 2021-10-01 10:41                                         ` Borislav Petkov
  0 siblings, 0 replies; 35+ messages in thread
From: Borislav Petkov @ 2021-10-01 10:41 UTC (permalink / raw)
  To: Dan Williams; +Cc: Luck, Tony, Linux NVDIMM, Jane Chu, Luis Chamberlain

On Thu, Sep 30, 2021 at 03:44:40PM -0700, Dan Williams wrote:
> The driver uses the direct-map to do the access. It uses the
> direct-map because it has also arranged for pfn_to_page() to work for
> PMEM pages. So if PMEM is in the direct-map is marked NP then the
> sub-page accesses will fault.

Well, the driver has special knowledge so *actually* it could go and use
the NP marking as "this page has been poisoned" and mark it NC only for
itself, so that it gets the job done. Dunno if that would end up being
too ugly to live and turn into a layering violation or so.

Or we can mark the page NC - that should stop the speculative access for
DRAM and the PMEM driver can do its job. The NC should take care so that
no cachelines are in the cache hierarchy.

> Now, the driver could set up and tear down page tables for the pfn
> whenever it is asked to do I/O over a potentially poisoned pfn. Is
> that what you are suggesting? It seems like a significant amount of
> overhead, but it would at least kick this question out of the purview
> of the MCE code.

Yeah, I'm looking for the simplest solution first which satisfies
everyone. Let's keep that one on the bag for now...

:-)

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFT PATCH] x86/pat: Fix set_mce_nospec() for pmem
  2021-10-01  2:02                                         ` Dan Williams
@ 2021-10-01 10:50                                           ` Borislav Petkov
  2021-10-01 16:52                                             ` Dan Williams
  0 siblings, 1 reply; 35+ messages in thread
From: Borislav Petkov @ 2021-10-01 10:50 UTC (permalink / raw)
  To: Dan Williams; +Cc: Jane Chu, Luck, Tony, Linux NVDIMM, Luis Chamberlain

On Thu, Sep 30, 2021 at 07:02:13PM -0700, Dan Williams wrote:
> If I understand the proposal correctly Boris is basically saying
> "figure out how to do your special PMEM stuff in the driver directly
> and make it so MCE code has no knowledge of the PMEM case".

I don't mind if it has to because of a good reason X - I'm just trying
to simplify things and not do stuff which is practically unnecessary.

> The flow
> is:
> 
> memory_failure(pfn, flags)
> nfit_handle_mce(...) <--- right now this on mce notifier chain
> set_mce_nospec(pfn) <--- drop the "whole page" concept
> 
> This poses a problem because not all memory_failure() paths trigger
> set_mce_nospec() or the mce notifier chain.

Hmm, so this sounds like more is needed than this small patch.

> If that disconnect happens, attempts to read PMEM pages that have been
> signalled to memory_failure() will now crash in the driver without
> workarounds for NP pages.
>
> So memory_failure() needs to ensure that it communicates with the
> driver before any possible NP page attribute changes. I.e. the driver
> needs to know that regardless of how many cachelines are poisoned the
> entire page is always unmapped in the direct map.

Ok, so those errors do get reported through MCE so MCE code does need to
know about them.

So the question is, how should MCE mark them so that the driver can deal
with them properly.

> Then, when the driver is called with the new RWF_RECOVER_DATA flag, it
> can set up a new UC alias mapping for the pfn and access the good data
> in the page while being careful to read around the poisoned cache
> lines.

See my other mail - that sounds good.

> In my mind this moves the RWF_RECOVER_DATA flag proposal from "nice to
> have" to "critical for properly coordinating with memory_failure() and
> mce expectations"

Right, so to reiterate: I don't mind if the MCE code knows about this -
in the end of the day, it is that code which does the error reporting.
My goal is to make that reporting and marking optimal so that the driver
can work with that marking and simplify and document *why* we're doing
that so that future MCE changes can keep that in mind.

So to sum up: it sounds to me like it should simply mark *whole* pages
as NC:

1. this should prevent the spec. access for DRAM.
2. PMEM can do UC alias mapping and do sub-page access to salvage data.

How's that?

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFT PATCH] x86/pat: Fix set_mce_nospec() for pmem
  2021-10-01 10:50                                           ` Borislav Petkov
@ 2021-10-01 16:52                                             ` Dan Williams
  2021-10-01 18:11                                               ` Borislav Petkov
  0 siblings, 1 reply; 35+ messages in thread
From: Dan Williams @ 2021-10-01 16:52 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Jane Chu, Luck, Tony, Linux NVDIMM, Luis Chamberlain

On Fri, Oct 1, 2021 at 3:50 AM Borislav Petkov <bp@alien8.de> wrote:
>
> On Thu, Sep 30, 2021 at 07:02:13PM -0700, Dan Williams wrote:
> > If I understand the proposal correctly Boris is basically saying
> > "figure out how to do your special PMEM stuff in the driver directly
> > and make it so MCE code has no knowledge of the PMEM case".
>
> I don't mind if it has to because of a good reason X - I'm just trying
> to simplify things and not do stuff which is practically unnecessary.
>
> > The flow
> > is:
> >
> > memory_failure(pfn, flags)
> > nfit_handle_mce(...) <--- right now this on mce notifier chain
> > set_mce_nospec(pfn) <--- drop the "whole page" concept
> >
> > This poses a problem because not all memory_failure() paths trigger
> > set_mce_nospec() or the mce notifier chain.
>
> Hmm, so this sounds like more is needed than this small patch.
>
> > If that disconnect happens, attempts to read PMEM pages that have been
> > signalled to memory_failure() will now crash in the driver without
> > workarounds for NP pages.
> >
> > So memory_failure() needs to ensure that it communicates with the
> > driver before any possible NP page attribute changes. I.e. the driver
> > needs to know that regardless of how many cachelines are poisoned the
> > entire page is always unmapped in the direct map.
>
> Ok, so those errors do get reported through MCE so MCE code does need to
> know about them.
>
> So the question is, how should MCE mark them so that the driver can deal
> with them properly.
>
> > Then, when the driver is called with the new RWF_RECOVER_DATA flag, it
> > can set up a new UC alias mapping for the pfn and access the good data
> > in the page while being careful to read around the poisoned cache
> > lines.
>
> See my other mail - that sounds good.
>
> > In my mind this moves the RWF_RECOVER_DATA flag proposal from "nice to
> > have" to "critical for properly coordinating with memory_failure() and
> > mce expectations"
>
> Right, so to reiterate: I don't mind if the MCE code knows about this -
> in the end of the day, it is that code which does the error reporting.
> My goal is to make that reporting and marking optimal so that the driver
> can work with that marking and simplify and document *why* we're doing
> that so that future MCE changes can keep that in mind.
>
> So to sum up: it sounds to me like it should simply mark *whole* pages
> as NC:
>
> 1. this should prevent the spec. access for DRAM.
> 2. PMEM can do UC alias mapping and do sub-page access to salvage data.
>
> How's that?

I think that puts us back in the situation that Tony fixed in:

17fae1294ad9 x86/{mce,mm}: Unmap the entire page if the whole page is
affected and poisoned

...where the clflush in _set_memory_uc() causes more unwanted virtual
#MC injection.

I think that means that we have no choice but to mark the page NP
unconditionally and do the work to ensure that the driver has updated
its poisoned cacheline tracking for data recovery requests.

Right?

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFT PATCH] x86/pat: Fix set_mce_nospec() for pmem
  2021-10-01 16:52                                             ` Dan Williams
@ 2021-10-01 18:11                                               ` Borislav Petkov
  2021-10-01 18:29                                                 ` Dan Williams
  0 siblings, 1 reply; 35+ messages in thread
From: Borislav Petkov @ 2021-10-01 18:11 UTC (permalink / raw)
  To: Dan Williams; +Cc: Jane Chu, Luck, Tony, Linux NVDIMM, Luis Chamberlain

On Fri, Oct 01, 2021 at 09:52:21AM -0700, Dan Williams wrote:
> I think that puts us back in the situation that Tony fixed in:
> 
> 17fae1294ad9 x86/{mce,mm}: Unmap the entire page if the whole page is
> affected and poisoned
> 
> ...where the clflush in _set_memory_uc() causes more unwanted virtual
> #MC injection.

Hmm, lemme read that commit message again: so the guest kernel sees a
*second* MCE while doing set_memory_uc().

So what prevents the guest kernel from seeing a second MCE when it does
set_memory_np() instead?

"Further investigation showed that the VMM had passed in another machine
check because is appeared that the guest was accessing the bad page."

but then the beginning of the commit message says:

"The VMM unmapped the bad page from guest physical space and passed the
machine check to the guest."

so I'm really confused here what actually happens. Did the VMM manage to
unmap it or not really? Because if the VMM had unmapped it, then how was
the guest still accessing the bad page? ... Maybe I'm reading that wrong.

> I think that means that we have no choice but to mark the page NP
> unconditionally and do the work to ensure that the driver has updated
> its poisoned cacheline tracking for data recovery requests.

So a couple of emails earlier I had this:

|Well, the driver has special knowledge so *actually* it could go and use
|the NP marking as "this page has been poisoned" and mark it NC only for
|itself, so that it gets the job done. Dunno if that would end up being
|too ugly to live and turn into a layering violation or so.

So if we have marked the page NP, then nothing would be able to access
it anymore and it will be marked as hwpoison additionally, which will
prevent that access too.

Then, the PMEM driver would go and map the page however it wants to, it
could even remove it from the direct map so that nothing else accesses
it, even in the kernel, and then do all kinds of recovery.

Hmm?

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFT PATCH] x86/pat: Fix set_mce_nospec() for pmem
  2021-10-01 18:11                                               ` Borislav Petkov
@ 2021-10-01 18:29                                                 ` Dan Williams
  2021-10-02 10:17                                                   ` Borislav Petkov
  0 siblings, 1 reply; 35+ messages in thread
From: Dan Williams @ 2021-10-01 18:29 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Jane Chu, Luck, Tony, Linux NVDIMM, Luis Chamberlain

On Fri, Oct 1, 2021 at 11:12 AM Borislav Petkov <bp@alien8.de> wrote:
>
> On Fri, Oct 01, 2021 at 09:52:21AM -0700, Dan Williams wrote:
> > I think that puts us back in the situation that Tony fixed in:
> >
> > 17fae1294ad9 x86/{mce,mm}: Unmap the entire page if the whole page is
> > affected and poisoned
> >
> > ...where the clflush in _set_memory_uc() causes more unwanted virtual
> > #MC injection.
>
> Hmm, lemme read that commit message again: so the guest kernel sees a
> *second* MCE while doing set_memory_uc().
>
> So what prevents the guest kernel from seeing a second MCE when it does
> set_memory_np() instead?
>
> "Further investigation showed that the VMM had passed in another machine
> check because is appeared that the guest was accessing the bad page."
>
> but then the beginning of the commit message says:
>
> "The VMM unmapped the bad page from guest physical space and passed the
> machine check to the guest."
>
> so I'm really confused here what actually happens. Did the VMM manage to
> unmap it or not really? Because if the VMM had unmapped it, then how was
> the guest still accessing the bad page? ... Maybe I'm reading that wrong.

My read is that the guest gets virtual #MC on an access to that page.
When the guest tries to do set_memory_uc() and instructs cpa_flush()
to do clean caches that results in taking another fault / exception
perhaps because the VMM unmapped the page from the guest? If the guest
had flipped the page to NP then cpa_flush() says "oh, no caching
change, skip the clflush() loop".

>
> > I think that means that we have no choice but to mark the page NP
> > unconditionally and do the work to ensure that the driver has updated
> > its poisoned cacheline tracking for data recovery requests.
>
> So a couple of emails earlier I had this:
>
> |Well, the driver has special knowledge so *actually* it could go and use
> |the NP marking as "this page has been poisoned" and mark it NC only for
> |itself, so that it gets the job done. Dunno if that would end up being
> |too ugly to live and turn into a layering violation or so.
>
> So if we have marked the page NP, then nothing would be able to access
> it anymore and it will be marked as hwpoison additionally, which will
> prevent that access too.
>
> Then, the PMEM driver would go and map the page however it wants to, it
> could even remove it from the direct map so that nothing else accesses
> it, even in the kernel, and then do all kinds of recovery.
>
> Hmm?

Yeah, I thought UC would make the PMEM driver's life easier, but if it
has to contend with an NP case at all, might as well make it handle
that case all the time.

Safe to say this patch of mine is woefully insufficient and I need to
go look at how to make the guarantees needed by the PMEM driver so it
can handle NP and set up alias maps.

This was a useful discussion. It proves that my commit:

284ce4011ba6 x86/memory_failure: Introduce {set, clear}_mce_nospec()

...was broken from the outset.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFT PATCH] x86/pat: Fix set_mce_nospec() for pmem
  2021-10-01 18:29                                                 ` Dan Williams
@ 2021-10-02 10:17                                                   ` Borislav Petkov
  0 siblings, 0 replies; 35+ messages in thread
From: Borislav Petkov @ 2021-10-02 10:17 UTC (permalink / raw)
  To: Dan Williams; +Cc: Jane Chu, Luck, Tony, Linux NVDIMM, Luis Chamberlain

On Fri, Oct 01, 2021 at 11:29:43AM -0700, Dan Williams wrote:
> My read is that the guest gets virtual #MC on an access to that page.
> When the guest tries to do set_memory_uc() and instructs cpa_flush()
> to do clean caches that results in taking another fault / exception
> perhaps because the VMM unmapped the page from the guest? If the guest
> had flipped the page to NP then cpa_flush() says "oh, no caching
> change, skip the clflush() loop".

... and the CLFLUSH is the insn which caused the second MCE because it
"appeared that the guest was accessing the bad page."

Uuf, that could be. Nasty.

> Yeah, I thought UC would make the PMEM driver's life easier, but if it
> has to contend with an NP case at all, might as well make it handle
> that case all the time.
> 
> Safe to say this patch of mine is woefully insufficient and I need to
> go look at how to make the guarantees needed by the PMEM driver so it
> can handle NP and set up alias maps.
> 
> This was a useful discussion.

Oh yeah, thanks for taking the time!

> It proves that my commit:
> 
> 284ce4011ba6 x86/memory_failure: Introduce {set, clear}_mce_nospec()
> 
> ...was broken from the outset.

Well, the problem with hw errors is that it is always very hard to test
the code. But I hear injection works now soo... :-)

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2021-10-02 10:17 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-07  1:01 [RFT PATCH] x86/pat: Fix set_mce_nospec() for pmem Dan Williams
2021-08-26 19:08 ` Dan Williams
2021-08-27  7:12   ` Jane Chu
2021-09-13 10:29 ` Borislav Petkov
2021-09-14 18:08   ` Dan Williams
2021-09-15 10:41     ` Borislav Petkov
2021-09-16 20:33       ` Dan Williams
2021-09-17 11:30         ` Borislav Petkov
2021-09-21  2:04           ` Dan Williams
2021-09-30 17:19             ` Borislav Petkov
2021-09-30 17:28               ` Luck, Tony
2021-09-30 19:30                 ` Borislav Petkov
2021-09-30 19:41                   ` Dan Williams
2021-09-30 19:44                   ` Luck, Tony
2021-09-30 20:01                     ` Borislav Petkov
2021-09-30 20:15                       ` Luck, Tony
2021-09-30 20:32                         ` Borislav Petkov
2021-09-30 20:39                           ` Dan Williams
2021-09-30 20:54                             ` Borislav Petkov
2021-09-30 21:05                               ` Dan Williams
2021-09-30 21:20                                 ` Borislav Petkov
2021-09-30 21:41                                   ` Dan Williams
2021-09-30 22:35                                     ` Borislav Petkov
2021-09-30 22:44                                       ` Dan Williams
2021-10-01 10:41                                         ` Borislav Petkov
2021-10-01  0:43                                       ` Jane Chu
2021-10-01  2:02                                         ` Dan Williams
2021-10-01 10:50                                           ` Borislav Petkov
2021-10-01 16:52                                             ` Dan Williams
2021-10-01 18:11                                               ` Borislav Petkov
2021-10-01 18:29                                                 ` Dan Williams
2021-10-02 10:17                                                   ` Borislav Petkov
2021-09-30 18:15         ` Jane Chu
2021-09-30 19:11           ` Dan Williams
2021-09-30 21:23             ` Jane Chu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).