linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] x86/mm: Don't try to change poison pages to uncacheable in a guest
@ 2020-05-05 18:46 Tony Luck
  2020-05-16  6:54 ` Borislav Petkov
  0 siblings, 1 reply; 19+ messages in thread
From: Tony Luck @ 2020-05-05 18:46 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Tony Luck, Jue Wang, Dan Williams, x86, linux-kernel

An interesting thing happened when a guest Linux instance took
a machine check. The VMM unmapped the bad page from guest physical
space and passed the machine check to the guest.

Linux took all the normal actions to offline the page from the process
that was using it. But then guest Linux crashed because it said there
was a second machine check inside the kernel with this stack trace:

do_memory_failure
    set_mce_nospec
         set_memory_uc
              _set_memory_uc
                   change_page_attr_set_clr
                        cpa_flush
                             clflush_cache_range_opt

This was odd, because a CLFLUSH instruction shouldn't raise a machine
check (it isn't consuming the data). Further investigation showed that
the VMM had passed in another machine check because is appeared that the
guest was accessing the bad page.

Fix is to check whether Linux is running as a guest. If it is, there is no
point in trying to change the cache mode of the bad page. The VMM has taken
the whole page away.

Reported-by: Jue Wang <juew@google.com>
Tested-by: Jue Wang <juew@google.com>
Fixes: 284ce4011ba6 ("x86/memory_failure: Introduce {set, clear}_mce_nospec()")
Cc: <stable@vger.kernel.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/include/asm/set_memory.h | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/arch/x86/include/asm/set_memory.h b/arch/x86/include/asm/set_memory.h
index ec2c0a094b5d..e8b7e8a82d7a 100644
--- a/arch/x86/include/asm/set_memory.h
+++ b/arch/x86/include/asm/set_memory.h
@@ -91,6 +91,14 @@ static inline int set_mce_nospec(unsigned long pfn)
 	unsigned long decoy_addr;
 	int rc;
 
+	/*
+	 * If we are running as a guest, then the hypervisor
+	 * will have removed all access from the page. So no
+	 * point trying to make it uncacheable
+	 */
+	if (boot_cpu_has(X86_FEATURE_HYPERVISOR))
+		return 0;
+
 	/*
 	 * Mark the linear address as UC to make sure we don't log more
 	 * errors because of speculative access to the page.
-- 
2.21.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH] x86/mm: Don't try to change poison pages to uncacheable in a guest
  2020-05-05 18:46 [PATCH] x86/mm: Don't try to change poison pages to uncacheable in a guest Tony Luck
@ 2020-05-16  6:54 ` Borislav Petkov
  2020-05-16 14:47   ` Luck, Tony
  0 siblings, 1 reply; 19+ messages in thread
From: Borislav Petkov @ 2020-05-16  6:54 UTC (permalink / raw)
  To: Tony Luck; +Cc: Jue Wang, Dan Williams, x86, linux-kernel

On Tue, May 05, 2020 at 11:46:48AM -0700, Tony Luck wrote:
> An interesting thing happened when a guest Linux instance took
> a machine check. The VMM unmapped the bad page from guest physical
> space and passed the machine check to the guest.
> 
> Linux took all the normal actions to offline the page from the process
> that was using it. But then guest Linux crashed because it said there
> was a second machine check inside the kernel with this stack trace:
> 
> do_memory_failure
>     set_mce_nospec
>          set_memory_uc
>               _set_memory_uc
>                    change_page_attr_set_clr
>                         cpa_flush
>                              clflush_cache_range_opt

Maybe I don't see it but how can clflush_cache_range_opt() call
cpa_flush() ?

> This was odd, because a CLFLUSH instruction shouldn't raise a machine
> check (it isn't consuming the data). Further investigation showed that
> the VMM had passed in another machine check because is appeared that the
> guest was accessing the bad page.

This is where you lost me - if the VMM unmaps the page during the first
MCE, how can the guest even attempt to touch it and do this stack trace
above?

/me is confused.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] x86/mm: Don't try to change poison pages to uncacheable in a guest
  2020-05-16  6:54 ` Borislav Petkov
@ 2020-05-16 14:47   ` Luck, Tony
  2020-05-16 15:02     ` Borislav Petkov
  0 siblings, 1 reply; 19+ messages in thread
From: Luck, Tony @ 2020-05-16 14:47 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Jue Wang, Williams, Dan J, x86, linux-kernel

There is only one actual machine check. But the VMM simulates a second machine check to the guest when the guest tries to access the poisoned page.

The stack trace was from Jue. I didn’t try to check it. But it looked reasonable that Linux would flush the cache for a page that is transitioning from cacheable to uncacheable.

Sent from my iPhone

> On May 15, 2020, at 23:54, Borislav Petkov <bp@alien8.de> wrote:
> 
> On Tue, May 05, 2020 at 11:46:48AM -0700, Tony Luck wrote:
>> An interesting thing happened when a guest Linux instance took
>> a machine check. The VMM unmapped the bad page from guest physical
>> space and passed the machine check to the guest.
>> 
>> Linux took all the normal actions to offline the page from the process
>> that was using it. But then guest Linux crashed because it said there
>> was a second machine check inside the kernel with this stack trace:
>> 
>> do_memory_failure
>>    set_mce_nospec
>>         set_memory_uc
>>              _set_memory_uc
>>                   change_page_attr_set_clr
>>                        cpa_flush
>>                             clflush_cache_range_opt
> 
> Maybe I don't see it but how can clflush_cache_range_opt() call
> cpa_flush() ?
> 
>> This was odd, because a CLFLUSH instruction shouldn't raise a machine
>> check (it isn't consuming the data). Further investigation showed that
>> the VMM had passed in another machine check because is appeared that the
>> guest was accessing the bad page.
> 
> This is where you lost me - if the VMM unmaps the page during the first
> MCE, how can the guest even attempt to touch it and do this stack trace
> above?
> 
> /me is confused.
> 
> -- 
> Regards/Gruss,
>    Boris.
> 
> https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] x86/mm: Don't try to change poison pages to uncacheable in a guest
  2020-05-16 14:47   ` Luck, Tony
@ 2020-05-16 15:02     ` Borislav Petkov
  2020-05-17  1:52       ` Luck, Tony
  0 siblings, 1 reply; 19+ messages in thread
From: Borislav Petkov @ 2020-05-16 15:02 UTC (permalink / raw)
  To: Luck, Tony; +Cc: Jue Wang, Williams, Dan J, x86, linux-kernel

On Sat, May 16, 2020 at 02:47:42PM +0000, Luck, Tony wrote:
> There is only one actual machine check. But the VMM simulates a second
> machine check to the guest when the guest tries to access the poisoned
> page.

If the VMM unmaps the bad page, why doesn't the guest get a #PF instead
injected by the VMM instead of latter injecting a second #MCE?

If the guest tries to access an unmapped page, it should get a #PF, I'd
expect.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] x86/mm: Don't try to change poison pages to uncacheable in a guest
  2020-05-16 15:02     ` Borislav Petkov
@ 2020-05-17  1:52       ` Luck, Tony
       [not found]         ` <CAPcxDJ50pbuTbittyvPwKq1uUT8q8jJ+dHH8rCug8a1DDZXVYw@mail.gmail.com>
  0 siblings, 1 reply; 19+ messages in thread
From: Luck, Tony @ 2020-05-17  1:52 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Jue Wang, Williams, Dan J, x86, linux-kernel

But the guest isn’t likely to do the right thing with a page fault. The guest just accessed a page that it knows is poisoned (VMM just told it once that it was poisoned). There is no reason that the VMM should let the guest actually touch the poison a second time. But if the guest does, then the guest should get the expected response.  I.e. another machine check.

Sent from my iPhone

> On May 16, 2020, at 08:03, Borislav Petkov <bp@alien8.de> wrote:
> 
> On Sat, May 16, 2020 at 02:47:42PM +0000, Luck, Tony wrote:
>> There is only one actual machine check. But the VMM simulates a second
>> machine check to the guest when the guest tries to access the poisoned
>> page.
> 
> If the VMM unmaps the bad page, why doesn't the guest get a #PF instead
> injected by the VMM instead of latter injecting a second #MCE?
> 
> If the guest tries to access an unmapped page, it should get a #PF, I'd
> expect.
> 
> -- 
> Regards/Gruss,
>    Boris.
> 
> https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] x86/mm: Don't try to change poison pages to uncacheable in a guest
       [not found]           ` <CAPcxDJ6f3pBpwiR9nvXN_g_HBa1RAMG+aOmgfXLFT6aZ9HQn3w@mail.gmail.com>
@ 2020-05-18 13:48             ` Borislav Petkov
  2020-05-18 15:36               ` Luck, Tony
  0 siblings, 1 reply; 19+ messages in thread
From: Borislav Petkov @ 2020-05-18 13:48 UTC (permalink / raw)
  To: Jue Wang; +Cc: Luck, Tony, Williams, Dan J, x86, linux-kernel

Hi,

lemme try to reply to three emails at once.

First of all, the two of you: pls do not top-post.

On Sat, May 16, 2020 at 6:52 PM Luck, Tony <tony.luck@intel.com> wrote:
> But the guest isn’t likely to do the right thing with a page fault.
> The guest just accessed a page that it knows is poisoned (VMM just told
> it once that it was poisoned). There is no reason that the VMM should
> let the guest actually touch the poison a second time. But if the guest
> does, then the guest should get the expected response. I.e. another
> machine check.

So Jue says below that the VMM has unmapped the guest page from the EPT.
So the guest cannot touch the poison anymore.

How is then possible for the guest to touch it again if the page is not
mapped anymore?

The guest should get a #PF when the page is unmapped and cause a new
page to be mapped.

On Sun, May 17, 2020 at 07:36:00AM -0700, Jue Wang wrote:
> The stack is from guest MCE handler's processing of the first MCE injected.

Aha, so you've flipped the functions order in the trace. It all starts
at

  set_mce_nospec(m->addr >> PAGE_SHIFT);

Now it makes sense.

> Note before the first MCE is injected into guest (i.e., after the host MCE
> handler successfully finished MCE handling and notified VMM via SIGBUS with
> BUS_MCEERR_AR), VMM unmaps the guest page from EPT.

Ok, good.

> The guest MCE handler finished the "normal" MCE handling and recovery
> (memory_failure() in mm/memory_failure.cc) successfully, it's the aftermath
> below leading to the stack trace:
> https://github.com/torvalds/linux/blob/5a9ffb954a3933d7867f4341684a23e008d6839b/arch/x86/kernel/cpu/mce/core.c#L1101

On Sun, May 17, 2020 at 08:33:00AM -0700, Jue Wang wrote:
> In other words, it's the *do_memory_failure -> set_mce_nospec*  flow of
> guest MCE handler acting on the *first* MCE injection. As a result, the
> guest panics and resets *whenever* there is an MCE injected, even when the
> injected MCE is recoverable.

So IIUC that set_mce_nospec() thing should check whether m->addr is
mapped first and only then mark it _uc and whatever monkey business it
does. Not this blanket

  if am I a guest?

test.

Imagine a hypervisor which doesn't set X86_FEATURE_HYPERVISOR, i.e.,
CPUID(1)[EDX, bit 31]?

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] x86/mm: Don't try to change poison pages to uncacheable in a guest
  2020-05-18 13:48             ` Borislav Petkov
@ 2020-05-18 15:36               ` Luck, Tony
  2020-05-18 16:55                 ` Borislav Petkov
  0 siblings, 1 reply; 19+ messages in thread
From: Luck, Tony @ 2020-05-18 15:36 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Jue Wang, Williams, Dan J, x86, linux-kernel

On Mon, May 18, 2020 at 03:48:13PM +0200, Borislav Petkov wrote:
> Hi,
> 
> lemme try to reply to three emails at once.
> 
> First of all, the two of you: pls do not top-post.

Sorry. Phone e-mail client is dumb.

> On Sat, May 16, 2020 at 6:52 PM Luck, Tony <tony.luck@intel.com> wrote:
> > But the guest isn’t likely to do the right thing with a page fault.
> > The guest just accessed a page that it knows is poisoned (VMM just told
> > it once that it was poisoned). There is no reason that the VMM should
> > let the guest actually touch the poison a second time. But if the guest
> > does, then the guest should get the expected response. I.e. another
> > machine check.
> 
> So Jue says below that the VMM has unmapped the guest page from the EPT.
> So the guest cannot touch the poison anymore.
> 
> How is then possible for the guest to touch it again if the page is not
> mapped anymore?

The VMM wants to make sure that the guest can't touch the poison
(this is important because not every touch of poison results in a
recoverable machine check. If the guest's next touch is an unaligned
access that crosses from the poison cache line to a non-poisoned line
then h/w will signal a fatal machinecheck and the whole machine will
go down).

> The guest should get a #PF when the page is unmapped and cause a new
> page to be mapped.

The VMM gets the page fault (because the unmapping of the guest
physical address is at the VMM EPT level).  The VMM can't map a new
page into that guest physical address because it has no way to
replace the contents of the old page.  The VMM could pass the #PF
to the guest, but that would just confuse the guest (its page tables
all say that the page is still valid). In this particular case the
page is part of the 1:1 kernel map. So the kernel will OOPS (I think).

> On Sun, May 17, 2020 at 07:36:00AM -0700, Jue Wang wrote:
> > The stack is from guest MCE handler's processing of the first MCE injected.
> 
> Aha, so you've flipped the functions order in the trace. It all starts
> at
> 
>   set_mce_nospec(m->addr >> PAGE_SHIFT);
> 
> Now it makes sense.
> 
> > Note before the first MCE is injected into guest (i.e., after the host MCE
> > handler successfully finished MCE handling and notified VMM via SIGBUS with
> > BUS_MCEERR_AR), VMM unmaps the guest page from EPT.
> 
> Ok, good.
> 
> > The guest MCE handler finished the "normal" MCE handling and recovery
> > (memory_failure() in mm/memory_failure.cc) successfully, it's the aftermath
> > below leading to the stack trace:
> > https://github.com/torvalds/linux/blob/5a9ffb954a3933d7867f4341684a23e008d6839b/arch/x86/kernel/cpu/mce/core.c#L1101
> 
> On Sun, May 17, 2020 at 08:33:00AM -0700, Jue Wang wrote:
> > In other words, it's the *do_memory_failure -> set_mce_nospec*  flow of
> > guest MCE handler acting on the *first* MCE injection. As a result, the
> > guest panics and resets *whenever* there is an MCE injected, even when the
> > injected MCE is recoverable.
> 
> So IIUC that set_mce_nospec() thing should check whether m->addr is
> mapped first and only then mark it _uc and whatever monkey business it
> does. Not this blanket

PLease explain how a guest (that doesn't even know that it is a guest)
is going to figure out that the EPT tables (that it has no way to access)
have marked this page invalid in guest physical address space.

>   if am I a guest?
> 
> test.
> 
> Imagine a hypervisor which doesn't set X86_FEATURE_HYPERVISOR, i.e.,
> CPUID(1)[EDX, bit 31]?

Guest is going to be screwed in this case.

-Tony

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] x86/mm: Don't try to change poison pages to uncacheable in a guest
  2020-05-18 15:36               ` Luck, Tony
@ 2020-05-18 16:55                 ` Borislav Petkov
  2020-05-18 18:26                   ` Luck, Tony
  2020-05-19  5:04                   ` [PATCH] x86/mm: Don't try to change poison pages to uncacheable in a guest Sean Christopherson
  0 siblings, 2 replies; 19+ messages in thread
From: Borislav Petkov @ 2020-05-18 16:55 UTC (permalink / raw)
  To: Luck, Tony; +Cc: Jue Wang, Williams, Dan J, x86, linux-kernel

On Mon, May 18, 2020 at 08:36:25AM -0700, Luck, Tony wrote:
> The VMM gets the page fault (because the unmapping of the guest
> physical address is at the VMM EPT level).  The VMM can't map a new
> page into that guest physical address because it has no way to
> replace the contents of the old page.  The VMM could pass the #PF
> to the guest, but that would just confuse the guest (its page tables
> all say that the page is still valid). In this particular case the
> page is part of the 1:1 kernel map. So the kernel will OOPS (I think).

...

> PLease explain how a guest (that doesn't even know that it is a guest)
> is going to figure out that the EPT tables (that it has no way to access)
> have marked this page invalid in guest physical address space.

So somewhere BUS_MCEERR_AR was mentioned. So I'm assuming the error
severity was "action required". What does happen in the kernel, on
baremetal, with an AR error in kernel space, i.e., kernel memory?

If we can't fixup the exception, we die.

So why should the guest behave any differently?

Now, if you want for the guest to be more "robust" and handle that
thing, fine. But then you'd need an explicit way to tell the guest
kernel: "you've just had an MCE and I unmapped the page" so that the
guest kernel can figure out what do to. Even if it means, to panic.

I.e., signal in an explicit way that EPT violation Jue is talking about
in the other mail.

You can inject a #PF or better yet the *first* MCE which is being
injected should say with a bit somehwere "I unmapped the address in
m->addr". So that the guest kernel can handle that properly and know
what *exactly* it is getting an MCE for.

What I don't like is the "am I running as a guest" check. Because
someone else would come later and say, err, I'm not virtualizing this
portion of MCA either, lemme add another "am I guest" check.

Sure, it is a lot easier but when stuff like that starts spreading
around in the MCE code, then we can just as well disable MCE when
virtualized altogether. It would be a lot easier for everybody.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] x86/mm: Don't try to change poison pages to uncacheable in a guest
  2020-05-18 16:55                 ` Borislav Petkov
@ 2020-05-18 18:26                   ` Luck, Tony
  2020-05-18 19:20                     ` Dan Williams
                                       ` (2 more replies)
  2020-05-19  5:04                   ` [PATCH] x86/mm: Don't try to change poison pages to uncacheable in a guest Sean Christopherson
  1 sibling, 3 replies; 19+ messages in thread
From: Luck, Tony @ 2020-05-18 18:26 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Jue Wang, Williams, Dan J, x86, linux-kernel

On Mon, May 18, 2020 at 06:55:00PM +0200, Borislav Petkov wrote:
> On Mon, May 18, 2020 at 08:36:25AM -0700, Luck, Tony wrote:
> > The VMM gets the page fault (because the unmapping of the guest
> > physical address is at the VMM EPT level).  The VMM can't map a new
> > page into that guest physical address because it has no way to
> > replace the contents of the old page.  The VMM could pass the #PF
> > to the guest, but that would just confuse the guest (its page tables
> > all say that the page is still valid). In this particular case the
> > page is part of the 1:1 kernel map. So the kernel will OOPS (I think).
> 
> ...
> 
> > PLease explain how a guest (that doesn't even know that it is a guest)
> > is going to figure out that the EPT tables (that it has no way to access)
> > have marked this page invalid in guest physical address space.
> 
> So somewhere BUS_MCEERR_AR was mentioned. So I'm assuming the error
> severity was "action required". What does happen in the kernel, on
> baremetal, with an AR error in kernel space, i.e., kernel memory?

Outside of the now infamous memcpy_mcsafe() any kernel consumption
of poison results in a panic as the mce_severity() code will trip
this case:

        MCESEV(
                PANIC, "Data load in unrecoverable area of kernel",
                SER, MASK(MCI_STATUS_OVER|MCI_UC_SAR|MCI_ADDR|MCACOD, MCI_UC_SAR|MCI_ADDR|MCACOD_DATA),
                KERNEL
                ),

> If we can't fixup the exception, we die.
> 
> So why should the guest behave any differently?

We don't see this particular problem on baremetal because a CLFLUSH
instruction isn't *consuming* data. It's just evicting things from
the cache to memory. So we reference the virtual address, which works
fine on baremetal because the kernel 1:1 map is still active. But in
the guest case the guest physical address has gone away. So we trap
to the VMM.

> Now, if you want for the guest to be more "robust" and handle that
> thing, fine. But then you'd need an explicit way to tell the guest
> kernel: "you've just had an MCE and I unmapped the page" so that the
> guest kernel can figure out what do to. Even if it means, to panic.
> 
> I.e., signal in an explicit way that EPT violation Jue is talking about
> in the other mail.
> 
> You can inject a #PF or better yet the *first* MCE which is being
> injected should say with a bit somehwere "I unmapped the address in
> m->addr". So that the guest kernel can handle that properly and know
> what *exactly* it is getting an MCE for.

That question only makes any sense if you know you are running as a
guest and that someone else has unmapped the address. It's a meaningless
question to ask if you are running bare metal. So we'd still have a check
for FEATURE_HYPERVISOR

> What I don't like is the "am I running as a guest" check. Because
> someone else would come later and say, err, I'm not virtualizing this
> portion of MCA either, lemme add another "am I guest" check.
> 
> Sure, it is a lot easier but when stuff like that starts spreading
> around in the MCE code, then we can just as well disable MCE when
> virtualized altogether. It would be a lot easier for everybody.

Maybe it isn't pretty. But I don't see another practical solution.

The VMM is doing exactly the right thing here. It should not trust
that the guest will behave and not touch the poison location again.
If/when the guest does touch the poison, the right action is
for the VMM to fake a new machine check to the guest.

Theoretlcally the VMM could decode the instruction that the guest
was trying to use on the poison page and decide "oh, this is that
weird case in Linux where it's just trying to CLFLUSH the page. I'll
just step the return IP past the CLFLUSH and let the guest continue".

But that doesn't sound at all reasonable to me (especially as the
next step is to realize that Linux is going to repeat that for every
cache line in the page, so you also want to VMM to fudge the register
contents to skip to the end of the loop and avoid another 63 VMEXITs).

N.B. Linux wants to switch the page to uncacheable so that in the
persistant memory case the filesytem code can continue to access
the other "blocks" in the page, rather than lose all of them. That's
futile in the case where the VMM took the whole 4K away. Maybe Dan
needs to think about the guest case too.

-Tony

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] x86/mm: Don't try to change poison pages to uncacheable in a guest
  2020-05-18 18:26                   ` Luck, Tony
@ 2020-05-18 19:20                     ` Dan Williams
  2020-05-19  5:22                     ` Sean Christopherson
  2020-05-19  8:50                     ` Borislav Petkov
  2 siblings, 0 replies; 19+ messages in thread
From: Dan Williams @ 2020-05-18 19:20 UTC (permalink / raw)
  To: Luck, Tony; +Cc: Borislav Petkov, Jue Wang, x86, linux-kernel

On Mon, May 18, 2020 at 11:26 AM Luck, Tony <tony.luck@intel.com> wrote:
[..]
> N.B. Linux wants to switch the page to uncacheable so that in the
> persistant memory case the filesytem code can continue to access
> the other "blocks" in the page, rather than lose all of them. That's
> futile in the case where the VMM took the whole 4K away. Maybe Dan
> needs to think about the guest case too.

I think increasing the blast-radius to 4K is the best we can do
without a paravirt mechanism to coordinate errors. There's also the
existing problem that set_mce_nospec() fails on pmem due to the
memtype lookup.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] x86/mm: Don't try to change poison pages to uncacheable in a guest
  2020-05-18 16:55                 ` Borislav Petkov
  2020-05-18 18:26                   ` Luck, Tony
@ 2020-05-19  5:04                   ` Sean Christopherson
  1 sibling, 0 replies; 19+ messages in thread
From: Sean Christopherson @ 2020-05-19  5:04 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Luck, Tony, Jue Wang, Williams, Dan J, x86, linux-kernel

On Mon, May 18, 2020 at 06:55:00PM +0200, Borislav Petkov wrote:
> On Mon, May 18, 2020 at 08:36:25AM -0700, Luck, Tony wrote:
> > The VMM gets the page fault (because the unmapping of the guest
> > physical address is at the VMM EPT level).  The VMM can't map a new
> > page into that guest physical address because it has no way to
> > replace the contents of the old page.  The VMM could pass the #PF
> > to the guest, but that would just confuse the guest (its page tables
> > all say that the page is still valid). In this particular case the
> > page is part of the 1:1 kernel map. So the kernel will OOPS (I think).
> 
> ...
> 
> > PLease explain how a guest (that doesn't even know that it is a guest)
> > is going to figure out that the EPT tables (that it has no way to access)
> > have marked this page invalid in guest physical address space.
> 
> So somewhere BUS_MCEERR_AR was mentioned. So I'm assuming the error
> severity was "action required". What does happen in the kernel, on
> baremetal, with an AR error in kernel space, i.e., kernel memory?
> 
> If we can't fixup the exception, we die.
> 
> So why should the guest behave any differently?
> 
> Now, if you want for the guest to be more "robust" and handle that
> thing, fine. But then you'd need an explicit way to tell the guest
> kernel: "you've just had an MCE and I unmapped the page" so that the
> guest kernel can figure out what do to. Even if it means, to panic.
> 
> I.e., signal in an explicit way that EPT violation Jue is talking about
> in the other mail.

Well, technically the CLFUSH thing is a KVM emulation bug, but it sounds
like that's a moot point since the pmem-enabled guest will make real
accesses to the poisoned page shortly thereafter.  E.g. teaching KVM to
eat the -EHWPOISON on CLFLUSH would only postpone the guest's death.

As for how the second #MC occurs, on the EPT violation, KVM does a gup() to
translate the virtual address to a pfn (KVM maintains a simple GPA->HVA
lookup).  gup() returns -EHWPOISON for the poisoned page, which KVM
redirects into a BUS_MCEERR_AR.  The userspace VMM, e.g. Qemu, sees the
BUS_MCEERR_AR and sends it back into the guest as a virtual #MC.

> You can inject a #PF or better yet the *first* MCE which is being
> injected should say with a bit somehwere "I unmapped the address in
> m->addr". So that the guest kernel can handle that properly and know
> what *exactly* it is getting an MCE for.
> 
> What I don't like is the "am I running as a guest" check. Because
> someone else would come later and say, err, I'm not virtualizing this
> portion of MCA either, lemme add another "am I guest" check.
> 
> Sure, it is a lot easier but when stuff like that starts spreading
> around in the MCE code, then we can just as well disable MCE when
> virtualized altogether. It would be a lot easier for everybody.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] x86/mm: Don't try to change poison pages to uncacheable in a guest
  2020-05-18 18:26                   ` Luck, Tony
  2020-05-18 19:20                     ` Dan Williams
@ 2020-05-19  5:22                     ` Sean Christopherson
  2020-05-19  8:50                     ` Borislav Petkov
  2 siblings, 0 replies; 19+ messages in thread
From: Sean Christopherson @ 2020-05-19  5:22 UTC (permalink / raw)
  To: Luck, Tony; +Cc: Borislav Petkov, Jue Wang, Williams, Dan J, x86, linux-kernel

On Mon, May 18, 2020 at 11:26:29AM -0700, Luck, Tony wrote:
> Maybe it isn't pretty. But I don't see another practical solution.
> 
> The VMM is doing exactly the right thing here. It should not trust
> that the guest will behave and not touch the poison location again.
> If/when the guest does touch the poison, the right action is
> for the VMM to fake a new machine check to the guest.
> 
> Theoretlcally the VMM could decode the instruction that the guest
> was trying to use on the poison page and decide "oh, this is that
> weird case in Linux where it's just trying to CLFLUSH the page. I'll
> just step the return IP past the CLFLUSH and let the guest continue".

That's actually doable in KVM, e.g. a hack could be done in <10 lines of
code.  A proper fix that integrates with KVM's emulator would be
substantially more code and effort though.

> But that doesn't sound at all reasonable to me (especially as the
> next step is to realize that Linux is going to repeat that for every
> cache line in the page, so you also want to VMM to fudge the register
> contents to skip to the end of the loop and avoid another 63 VMEXITs).

Eh, 63 VM-Exits is peanuts in the grand scheme.  Even with the host-side
gup() that's probably less than 50us.

> N.B. Linux wants to switch the page to uncacheable so that in the
> persistant memory case the filesytem code can continue to access
> the other "blocks" in the page, rather than lose all of them. That's
> futile in the case where the VMM took the whole 4K away. Maybe Dan
> needs to think about the guest case too.

This is where I'm unclear as to the guest behavior.  Is it doing *just*
CLFLUSH, or is it doing CLFLUSH followed by other accesses to the poisoned
page?  If it's the former, then it's probably worth at least exploring a
KVM fix.  If it's the latter, then yeah, emulating CLFLUSH for a poisoned
#MC is pointless.  I assume it's the latter since the goal is to recover
data?

Oh, and FWIW, the guest won't actually get UC for that page.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] x86/mm: Don't try to change poison pages to uncacheable in a guest
  2020-05-18 18:26                   ` Luck, Tony
  2020-05-18 19:20                     ` Dan Williams
  2020-05-19  5:22                     ` Sean Christopherson
@ 2020-05-19  8:50                     ` Borislav Petkov
  2020-05-20 16:35                       ` [PATCH v2] x86/mm: Change so poison pages are either unmapped or marked uncacheable Luck, Tony
  2 siblings, 1 reply; 19+ messages in thread
From: Borislav Petkov @ 2020-05-19  8:50 UTC (permalink / raw)
  To: Luck, Tony; +Cc: Jue Wang, Williams, Dan J, x86, linux-kernel

On Mon, May 18, 2020 at 11:26:29AM -0700, Luck, Tony wrote:
> That question only makes any sense if you know you are running as a
> guest and that someone else has unmapped the address. It's a meaningless
> question to ask if you are running bare metal. So we'd still have a check
> for FEATURE_HYPERVISOR

Maybe I'm not making myself clear enough here: I'm talking about using
a *special* MCE signature which says "your mapping has disappeared from
underneath you." Maybe a bit in MCi_MISC which the hw doesn't use. Or
some other deprecated bit, whatever.

If that signature is unique you won't have to check for hypervisor - you
*know* it comes from it.

Because the hypervisor would be telling the guest: "I have removed the
page from under you, you should act accordingly" with that signature. Vs
the kernel going with the unspecific "am I running as a guest"?

See the huge difference?

> Maybe it isn't pretty. But I don't see another practical solution.

See above. Below too. I actually got two.

> The VMM is doing exactly the right thing here. It should not trust
> that the guest will behave and not touch the poison location again.
> If/when the guest does touch the poison, the right action is
> for the VMM to fake a new machine check to the guest.

Yes, and that new machine check should tell the guest: "do not CLFLUSH
this address - I've unmapped it and you don't have to do anything."
Basically what your hypervisor check does but *actually* stating why it
raised the second MCE.

> Theoretlcally the VMM could decode the instruction that the guest
> was trying to use on the poison page and decide "oh, this is that
> weird case in Linux where it's just trying to CLFLUSH the page. I'll
> just step the return IP past the CLFLUSH and let the guest continue".

... or not inject the second MCE at all.

That would be fixing it in the HV.

Because there's this other way to look at it and come to think of it,
fixing this in the HV makes a lot more sense. Why?

Well, let me elaborate:

The hypervisor just removed that page under the guest's feet and if that
hypervisor wants to support unenlightened guests which cannot even deal
with pages disappearing from under their feet, then that hypervisor
better not inject that second MCE.

Why would it even inject the MCE - what can the guest even do about
it? Exactly *nothing*. The page is unmapped and gone, the guest cannot
salvage any information anymore from it.

And yes, the hypervisor has *all* the information, it knows which page
it just removed so if the guest tries to access memory which HV just
poisoned and is within the range which was covered by that page, then it
should *not* inject that MCE. The guest can't handle it and why would it
inject it - it is an access to a poisoned page which the HV *knows* it
won't succeed so why bother?

The HV simply returns without injecting the MCE and so on, until the
4K page's end. It simply ignores guest accesses to the poisoned page.
Without any guest changes.

> N.B. Linux wants to switch the page to uncacheable so that in the
> persistant memory case the filesytem code can continue to access
> the other "blocks" in the page, rather than lose all of them. That's
> futile in the case where the VMM took the whole 4K away. Maybe Dan
> needs to think about the guest case too.

Yes, if the 4K page just went away, marking it UC doesn't make any
sense.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v2] x86/mm: Change so poison pages are either unmapped or marked uncacheable
  2020-05-19  8:50                     ` Borislav Petkov
@ 2020-05-20 16:35                       ` Luck, Tony
  2020-05-25 11:00                         ` [tip: ras/core] x86/{mce,mm}: " tip-bot2 for Tony Luck
  2020-05-26 19:56                         ` [tip: ras/core] x86/{mce,mm}: Unmap the entire page if the whole page is affected and poisoned tip-bot2 for Tony Luck
  0 siblings, 2 replies; 19+ messages in thread
From: Luck, Tony @ 2020-05-20 16:35 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Jue Wang, Williams, Dan J, x86, linux-kernel

An interesting thing happened when a guest Linux instance took
a machine check. The VMM unmapped the bad page from guest physical
space and passed the machine check to the guest.

Linux took all the normal actions to offline the page from the process
that was using it. But then guest Linux crashed because it said there
was a second machine check inside the kernel with this stack trace:

do_memory_failure
    set_mce_nospec
         set_memory_uc
              _set_memory_uc
                   change_page_attr_set_clr
                        cpa_flush
                             clflush_cache_range_opt

This was odd, because a CLFLUSH instruction shouldn't raise a machine
check (it isn't consuming the data). Further investigation showed that
the VMM had passed in another machine check because is appeared that the
guest was accessing the bad page.

Fix is to check the scope of the poison by checking the MCi_MISC register.
If the entire page is affected, then unmap the page. If only part of the
page is affected, then mark the page as uncacheable.

This assumes that VMMs will do the logical thing and pass in the "whole
page scope" via the MCi_MISC register (since they unmapped the entire
page).

Reported-by: Jue Wang <juew@google.com>
Tested-by: Jue Wang <juew@google.com>
Fixes: 284ce4011ba6 ("x86/memory_failure: Introduce {set, clear}_mce_nospec()")
Cc: <stable@vger.kernel.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
Boris wrote:
> Maybe I'm not making myself clear enough here: I'm talking about using
> a *special* MCE signature which says "your mapping has disappeared from
> underneath you." Maybe a bit in MCi_MISC which the hw doesn't use. Or
> some other deprecated bit, whatever.

My reactions (in order)
1) How can I do that for 10 year old h/w?
2) Oh wait, Boris is talking about the VMM faking a signature to the guest.
   They can mess with whatever bits they want.
3) Unused/deprecated bits are in short supply. Taking one over and getting
   all the VMM vendors to agree to this "standard" is going to be a huge effort.
4) Could we do something inside the existing architecture?
5) MCi_MISC uses low order bits to indicate the "blast radius" of an error
6) Asks Jue "Could you set those MCi_MISC bits to say the whole page is gone?
7) Jue: We already do that!
8) Me: writes this patch

 arch/x86/include/asm/set_memory.h | 19 +++++++++++++------
 arch/x86/kernel/cpu/mce/core.c    | 11 +++++++++--
 include/linux/set_memory.h        |  2 +-
 3 files changed, 23 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/set_memory.h b/arch/x86/include/asm/set_memory.h
index ec2c0a094b5d..5948218f35c5 100644
--- a/arch/x86/include/asm/set_memory.h
+++ b/arch/x86/include/asm/set_memory.h
@@ -86,28 +86,35 @@ int set_direct_map_default_noflush(struct page *page);
 extern int kernel_set_to_readonly;
 
 #ifdef CONFIG_X86_64
-static inline int set_mce_nospec(unsigned long pfn)
+/*
+ * Prevent speculative access to the page by either unmapping
+ * it (if we do not require access to any part of the page) or
+ * marking it uncacheable (if we want to try to retrieve data
+ * from non-poisoned lines in the page).
+ */
+static inline int set_mce_nospec(unsigned long pfn, bool unmap)
 {
 	unsigned long decoy_addr;
 	int rc;
 
 	/*
-	 * Mark the linear address as UC to make sure we don't log more
-	 * errors because of speculative access to the page.
 	 * We would like to just call:
-	 *      set_memory_uc((unsigned long)pfn_to_kaddr(pfn), 1);
+	 *      set_memory_XX((unsigned long)pfn_to_kaddr(pfn), 1);
 	 * but doing that would radically increase the odds of a
 	 * speculative access to the poison page because we'd have
 	 * the virtual address of the kernel 1:1 mapping sitting
 	 * around in registers.
 	 * Instead we get tricky.  We create a non-canonical address
 	 * that looks just like the one we want, but has bit 63 flipped.
-	 * This relies on set_memory_uc() properly sanitizing any __pa()
+	 * This relies on set_memory_XX() properly sanitizing any __pa()
 	 * results with __PHYSICAL_MASK or PTE_PFN_MASK.
 	 */
 	decoy_addr = (pfn << PAGE_SHIFT) + (PAGE_OFFSET ^ BIT(63));
 
-	rc = set_memory_uc(decoy_addr, 1);
+	if (unmap)
+		rc = set_memory_np(decoy_addr, 1);
+	else
+		rc = set_memory_uc(decoy_addr, 1);
 	if (rc)
 		pr_warn("Could not invalidate pfn=0x%lx from 1:1 map\n", pfn);
 	return rc;
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 54165f3569e8..c1a480a27164 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -529,6 +529,13 @@ bool mce_is_memory_error(struct mce *m)
 }
 EXPORT_SYMBOL_GPL(mce_is_memory_error);
 
+static bool whole_page(struct mce *m)
+{
+	if (!mca_cfg.ser || !(m->status & MCI_STATUS_MISCV))
+		return true;
+	return MCI_MISC_ADDR_LSB(m->misc) >= PAGE_SHIFT;
+}
+
 bool mce_is_correctable(struct mce *m)
 {
 	if (m->cpuvendor == X86_VENDOR_AMD && m->status & MCI_STATUS_DEFERRED)
@@ -600,7 +607,7 @@ static int uc_decode_notifier(struct notifier_block *nb, unsigned long val,
 
 	pfn = mce->addr >> PAGE_SHIFT;
 	if (!memory_failure(pfn, 0))
-		set_mce_nospec(pfn);
+		set_mce_nospec(pfn, whole_page(mce));
 
 	return NOTIFY_OK;
 }
@@ -1098,7 +1105,7 @@ static int do_memory_failure(struct mce *m)
 	if (ret)
 		pr_err("Memory error not recovered");
 	else
-		set_mce_nospec(m->addr >> PAGE_SHIFT);
+		set_mce_nospec(m->addr >> PAGE_SHIFT, whole_page(m));
 	return ret;
 }
 
diff --git a/include/linux/set_memory.h b/include/linux/set_memory.h
index 86281ac7c305..860e0f843c12 100644
--- a/include/linux/set_memory.h
+++ b/include/linux/set_memory.h
@@ -26,7 +26,7 @@ static inline int set_direct_map_default_noflush(struct page *page)
 #endif
 
 #ifndef set_mce_nospec
-static inline int set_mce_nospec(unsigned long pfn)
+static inline int set_mce_nospec(unsigned long pfn, bool unmap)
 {
 	return 0;
 }
-- 
2.21.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [tip: ras/core] x86/{mce,mm}: Change so poison pages are either unmapped or marked uncacheable
  2020-05-20 16:35                       ` [PATCH v2] x86/mm: Change so poison pages are either unmapped or marked uncacheable Luck, Tony
@ 2020-05-25 11:00                         ` tip-bot2 for Tony Luck
  2020-05-25 20:40                           ` Borislav Petkov
  2020-05-26 19:56                         ` [tip: ras/core] x86/{mce,mm}: Unmap the entire page if the whole page is affected and poisoned tip-bot2 for Tony Luck
  1 sibling, 1 reply; 19+ messages in thread
From: tip-bot2 for Tony Luck @ 2020-05-25 11:00 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Jue Wang, Tony Luck, Borislav Petkov, stable, x86, LKML

The following commit has been merged into the ras/core branch of tip:

Commit-ID:     3cb1ada80fe29e2fa022b5f20370b65718e0a744
Gitweb:        https://git.kernel.org/tip/3cb1ada80fe29e2fa022b5f20370b65718e0a744
Author:        Tony Luck <tony.luck@intel.com>
AuthorDate:    Wed, 20 May 2020 09:35:46 -07:00
Committer:     Borislav Petkov <bp@suse.de>
CommitterDate: Mon, 25 May 2020 12:46:40 +02:00

x86/{mce,mm}: Change so poison pages are either unmapped or marked uncacheable

An interesting thing happened when a guest Linux instance took a machine
check. The VMM unmapped the bad page from guest physical space and
passed the machine check to the guest.

Linux took all the normal actions to offline the page from the process
that was using it. But then guest Linux crashed because it said there
was a second machine check inside the kernel with this stack trace:

do_memory_failure
    set_mce_nospec
         set_memory_uc
              _set_memory_uc
                   change_page_attr_set_clr
                        cpa_flush
                             clflush_cache_range_opt

This was odd, because a CLFLUSH instruction shouldn't raise a machine
check (it isn't consuming the data). Further investigation showed that
the VMM had passed in another machine check because is appeared that the
guest was accessing the bad page.

Fix is to check the scope of the poison by checking the MCi_MISC register.
If the entire page is affected, then unmap the page. If only part of the
page is affected, then mark the page as uncacheable.

This assumes that VMMs will do the logical thing and pass in the "whole
page scope" via the MCi_MISC register (since they unmapped the entire
page).

Fixes: 284ce4011ba6 ("x86/memory_failure: Introduce {set, clear}_mce_nospec()")
Reported-by: Jue Wang <juew@google.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Jue Wang <juew@google.com>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20200520163546.GA7977@agluck-desk2.amr.corp.intel.com
---
 arch/x86/include/asm/set_memory.h | 19 +++++++++++++------
 arch/x86/kernel/cpu/mce/core.c    | 11 +++++++++--
 include/linux/set_memory.h        |  2 +-
 3 files changed, 23 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/set_memory.h b/arch/x86/include/asm/set_memory.h
index ec2c0a0..5948218 100644
--- a/arch/x86/include/asm/set_memory.h
+++ b/arch/x86/include/asm/set_memory.h
@@ -86,28 +86,35 @@ int set_direct_map_default_noflush(struct page *page);
 extern int kernel_set_to_readonly;
 
 #ifdef CONFIG_X86_64
-static inline int set_mce_nospec(unsigned long pfn)
+/*
+ * Prevent speculative access to the page by either unmapping
+ * it (if we do not require access to any part of the page) or
+ * marking it uncacheable (if we want to try to retrieve data
+ * from non-poisoned lines in the page).
+ */
+static inline int set_mce_nospec(unsigned long pfn, bool unmap)
 {
 	unsigned long decoy_addr;
 	int rc;
 
 	/*
-	 * Mark the linear address as UC to make sure we don't log more
-	 * errors because of speculative access to the page.
 	 * We would like to just call:
-	 *      set_memory_uc((unsigned long)pfn_to_kaddr(pfn), 1);
+	 *      set_memory_XX((unsigned long)pfn_to_kaddr(pfn), 1);
 	 * but doing that would radically increase the odds of a
 	 * speculative access to the poison page because we'd have
 	 * the virtual address of the kernel 1:1 mapping sitting
 	 * around in registers.
 	 * Instead we get tricky.  We create a non-canonical address
 	 * that looks just like the one we want, but has bit 63 flipped.
-	 * This relies on set_memory_uc() properly sanitizing any __pa()
+	 * This relies on set_memory_XX() properly sanitizing any __pa()
 	 * results with __PHYSICAL_MASK or PTE_PFN_MASK.
 	 */
 	decoy_addr = (pfn << PAGE_SHIFT) + (PAGE_OFFSET ^ BIT(63));
 
-	rc = set_memory_uc(decoy_addr, 1);
+	if (unmap)
+		rc = set_memory_np(decoy_addr, 1);
+	else
+		rc = set_memory_uc(decoy_addr, 1);
 	if (rc)
 		pr_warn("Could not invalidate pfn=0x%lx from 1:1 map\n", pfn);
 	return rc;
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 02e1f16..e35aece 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -518,6 +518,13 @@ bool mce_is_memory_error(struct mce *m)
 }
 EXPORT_SYMBOL_GPL(mce_is_memory_error);
 
+static bool whole_page(struct mce *m)
+{
+	if (!mca_cfg.ser || !(m->status & MCI_STATUS_MISCV))
+		return true;
+	return MCI_MISC_ADDR_LSB(m->misc) >= PAGE_SHIFT;
+}
+
 bool mce_is_correctable(struct mce *m)
 {
 	if (m->cpuvendor == X86_VENDOR_AMD && m->status & MCI_STATUS_DEFERRED)
@@ -571,7 +578,7 @@ static int uc_decode_notifier(struct notifier_block *nb, unsigned long val,
 
 	pfn = mce->addr >> PAGE_SHIFT;
 	if (!memory_failure(pfn, 0)) {
-		set_mce_nospec(pfn);
+		set_mce_nospec(pfn, whole_page(mce));
 		mce->kflags |= MCE_HANDLED_UC;
 	}
 
@@ -1069,7 +1076,7 @@ static int do_memory_failure(struct mce *m)
 	if (ret)
 		pr_err("Memory error not recovered");
 	else
-		set_mce_nospec(m->addr >> PAGE_SHIFT);
+		set_mce_nospec(m->addr >> PAGE_SHIFT, whole_page(m));
 	return ret;
 }
 
diff --git a/include/linux/set_memory.h b/include/linux/set_memory.h
index 86281ac..860e0f8 100644
--- a/include/linux/set_memory.h
+++ b/include/linux/set_memory.h
@@ -26,7 +26,7 @@ static inline int set_direct_map_default_noflush(struct page *page)
 #endif
 
 #ifndef set_mce_nospec
-static inline int set_mce_nospec(unsigned long pfn)
+static inline int set_mce_nospec(unsigned long pfn, bool unmap)
 {
 	return 0;
 }

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [tip: ras/core] x86/{mce,mm}: Change so poison pages are either unmapped or marked uncacheable
  2020-05-25 11:00                         ` [tip: ras/core] x86/{mce,mm}: " tip-bot2 for Tony Luck
@ 2020-05-25 20:40                           ` Borislav Petkov
  2020-05-26 17:37                             ` Luck, Tony
  0 siblings, 1 reply; 19+ messages in thread
From: Borislav Petkov @ 2020-05-25 20:40 UTC (permalink / raw)
  To: Jue Wang, Tony Luck; +Cc: linux-kernel, linux-tip-commits, stable, x86

On Mon, May 25, 2020 at 11:00:03AM -0000, tip-bot2 for Tony Luck wrote:
> The following commit has been merged into the ras/core branch of tip:
> 
> Commit-ID:     3cb1ada80fe29e2fa022b5f20370b65718e0a744
> Gitweb:        https://git.kernel.org/tip/3cb1ada80fe29e2fa022b5f20370b65718e0a744
> Author:        Tony Luck <tony.luck@intel.com>
> AuthorDate:    Wed, 20 May 2020 09:35:46 -07:00
> Committer:     Borislav Petkov <bp@suse.de>
> CommitterDate: Mon, 25 May 2020 12:46:40 +02:00

Ok, I had to change this one due to other pending changes in
tip:x86/entry. The new version below.

Can you guys run this branch to make sure it still works as expected?

https://git.kernel.org/pub/scm/linux/kernel/git/bp/bp.git/log/?h=tip-ras-core

Thx.

---
From 4d37444a762f4c35289ac86fe880e018731701f9 Mon Sep 17 00:00:00 2001
From: Tony Luck <tony.luck@intel.com>
Date: Wed, 20 May 2020 09:35:46 -0700
Subject: [PATCH] x86/{mce,mm}: Unmap the entire page if the whole page is
 affected and poisoned

An interesting thing happened when a guest Linux instance took a machine
check. The VMM unmapped the bad page from guest physical space and
passed the machine check to the guest.

Linux took all the normal actions to offline the page from the process
that was using it. But then guest Linux crashed because it said there
was a second machine check inside the kernel with this stack trace:

do_memory_failure
    set_mce_nospec
         set_memory_uc
              _set_memory_uc
                   change_page_attr_set_clr
                        cpa_flush
                             clflush_cache_range_opt

This was odd, because a CLFLUSH instruction shouldn't raise a machine
check (it isn't consuming the data). Further investigation showed that
the VMM had passed in another machine check because is appeared that the
guest was accessing the bad page.

Fix is to check the scope of the poison by checking the MCi_MISC register.
If the entire page is affected, then unmap the page. If only part of the
page is affected, then mark the page as uncacheable.

This assumes that VMMs will do the logical thing and pass in the "whole
page scope" via the MCi_MISC register (since they unmapped the entire
page).

 [ bp: Adjust to x86/entry changes. ]

Fixes: 284ce4011ba6 ("x86/memory_failure: Introduce {set, clear}_mce_nospec()")
Reported-by: Jue Wang <juew@google.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Jue Wang <juew@google.com>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20200520163546.GA7977@agluck-desk2.amr.corp.intel.com
---
 arch/x86/include/asm/set_memory.h | 19 +++++++++++++------
 arch/x86/kernel/cpu/mce/core.c    | 18 ++++++++++++++----
 include/linux/sched.h             |  4 +++-
 include/linux/set_memory.h        |  2 +-
 4 files changed, 31 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/set_memory.h b/arch/x86/include/asm/set_memory.h
index ec2c0a094b5d..5948218f35c5 100644
--- a/arch/x86/include/asm/set_memory.h
+++ b/arch/x86/include/asm/set_memory.h
@@ -86,28 +86,35 @@ int set_direct_map_default_noflush(struct page *page);
 extern int kernel_set_to_readonly;
 
 #ifdef CONFIG_X86_64
-static inline int set_mce_nospec(unsigned long pfn)
+/*
+ * Prevent speculative access to the page by either unmapping
+ * it (if we do not require access to any part of the page) or
+ * marking it uncacheable (if we want to try to retrieve data
+ * from non-poisoned lines in the page).
+ */
+static inline int set_mce_nospec(unsigned long pfn, bool unmap)
 {
 	unsigned long decoy_addr;
 	int rc;
 
 	/*
-	 * Mark the linear address as UC to make sure we don't log more
-	 * errors because of speculative access to the page.
 	 * We would like to just call:
-	 *      set_memory_uc((unsigned long)pfn_to_kaddr(pfn), 1);
+	 *      set_memory_XX((unsigned long)pfn_to_kaddr(pfn), 1);
 	 * but doing that would radically increase the odds of a
 	 * speculative access to the poison page because we'd have
 	 * the virtual address of the kernel 1:1 mapping sitting
 	 * around in registers.
 	 * Instead we get tricky.  We create a non-canonical address
 	 * that looks just like the one we want, but has bit 63 flipped.
-	 * This relies on set_memory_uc() properly sanitizing any __pa()
+	 * This relies on set_memory_XX() properly sanitizing any __pa()
 	 * results with __PHYSICAL_MASK or PTE_PFN_MASK.
 	 */
 	decoy_addr = (pfn << PAGE_SHIFT) + (PAGE_OFFSET ^ BIT(63));
 
-	rc = set_memory_uc(decoy_addr, 1);
+	if (unmap)
+		rc = set_memory_np(decoy_addr, 1);
+	else
+		rc = set_memory_uc(decoy_addr, 1);
 	if (rc)
 		pr_warn("Could not invalidate pfn=0x%lx from 1:1 map\n", pfn);
 	return rc;
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index ffee8a2f435d..753bc7731f12 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -520,6 +520,14 @@ bool mce_is_memory_error(struct mce *m)
 }
 EXPORT_SYMBOL_GPL(mce_is_memory_error);
 
+static bool whole_page(struct mce *m)
+{
+	if (!mca_cfg.ser || !(m->status & MCI_STATUS_MISCV))
+		return true;
+
+	return MCI_MISC_ADDR_LSB(m->misc) >= PAGE_SHIFT;
+}
+
 bool mce_is_correctable(struct mce *m)
 {
 	if (m->cpuvendor == X86_VENDOR_AMD && m->status & MCI_STATUS_DEFERRED)
@@ -573,7 +581,7 @@ static int uc_decode_notifier(struct notifier_block *nb, unsigned long val,
 
 	pfn = mce->addr >> PAGE_SHIFT;
 	if (!memory_failure(pfn, 0)) {
-		set_mce_nospec(pfn);
+		set_mce_nospec(pfn, whole_page(mce));
 		mce->kflags |= MCE_HANDLED_UC;
 	}
 
@@ -1173,11 +1181,12 @@ static void kill_me_maybe(struct callback_head *cb)
 	int flags = MF_ACTION_REQUIRED;
 
 	pr_err("Uncorrected hardware memory error in user-access at %llx", p->mce_addr);
-	if (!(p->mce_status & MCG_STATUS_RIPV))
+
+	if (!p->mce_ripv)
 		flags |= MF_MUST_KILL;
 
 	if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags)) {
-		set_mce_nospec(p->mce_addr >> PAGE_SHIFT);
+		set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page);
 		return;
 	}
 
@@ -1331,7 +1340,8 @@ void noinstr do_machine_check(struct pt_regs *regs)
 		BUG_ON(!on_thread_stack() || !user_mode(regs));
 
 		current->mce_addr = m.addr;
-		current->mce_status = m.mcgstatus;
+		current->mce_ripv = !!(m.mcgstatus & MCG_STATUS_RIPV);
+		current->mce_whole_page = whole_page(&m);
 		current->mce_kill_me.func = kill_me_maybe;
 		if (kill_it)
 			current->mce_kill_me.func = kill_me_now;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1d68ee36c583..6293fc2f3fc0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1304,7 +1304,9 @@ struct task_struct {
 
 #ifdef CONFIG_X86_MCE
 	u64				mce_addr;
-	u64				mce_status;
+	__u64				mce_ripv : 1,
+					mce_whole_page : 1,
+					__mce_reserved : 62;
 	struct callback_head		mce_kill_me;
 #endif
 
diff --git a/include/linux/set_memory.h b/include/linux/set_memory.h
index 86281ac7c305..860e0f843c12 100644
--- a/include/linux/set_memory.h
+++ b/include/linux/set_memory.h
@@ -26,7 +26,7 @@ static inline int set_direct_map_default_noflush(struct page *page)
 #endif
 
 #ifndef set_mce_nospec
-static inline int set_mce_nospec(unsigned long pfn)
+static inline int set_mce_nospec(unsigned long pfn, bool unmap)
 {
 	return 0;
 }
-- 
2.21.0

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* RE: [tip: ras/core] x86/{mce,mm}: Change so poison pages are either unmapped or marked uncacheable
  2020-05-25 20:40                           ` Borislav Petkov
@ 2020-05-26 17:37                             ` Luck, Tony
       [not found]                               ` <CAPcxDJ5arJojbY4pzOvYh=waSPd3X_JJb1_PSuzd+jQ0qbvFsA@mail.gmail.com>
  0 siblings, 1 reply; 19+ messages in thread
From: Luck, Tony @ 2020-05-26 17:37 UTC (permalink / raw)
  To: Borislav Petkov, Jue Wang; +Cc: linux-kernel, linux-tip-commits, stable, x86

> Ok, I had to change this one due to other pending changes in
> tip:x86/entry. The new version below.
>
> Can you guys run this branch to make sure it still works as expected?
>
> https://git.kernel.org/pub/scm/linux/kernel/git/bp/bp.git/log/?h=tip-ras-core

Tested the native case. We correctly try to set the page uncacheable because
the scope of the error is a cache line.

I don't have the right setup to test the virtualization case. Maybe Jue can test again?

-Tony

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [tip: ras/core] x86/{mce,mm}: Change so poison pages are either unmapped or marked uncacheable
       [not found]                                 ` <CAPcxDJ54EgX-SaDV=Lm+a2-43O68LhomyYfYdCDz38HGJCkh7g@mail.gmail.com>
@ 2020-05-26 19:46                                   ` Borislav Petkov
  0 siblings, 0 replies; 19+ messages in thread
From: Borislav Petkov @ 2020-05-26 19:46 UTC (permalink / raw)
  To: Jue Wang; +Cc: Luck, Tony, linux-kernel, linux-tip-commits, stable, x86

On Tue, May 26, 2020 at 11:44:18AM -0700, Jue Wang wrote:
> On Tue, May 26, 2020 at 11:03 AM Jue Wang <juew@google.com> wrote:
> 
> > I tried to test this but my guest image build setup was not able to build
> > from kernel/git/bp/bp.git tip-ras-core branch. It appeared there was some
> > bindeb-pkg issue.
> >
> The bindeb-pkg issue is resolved and I tested the following branch in KVM
> guest and the injected MCE is recovered.
> https://git.kernel.org/pub/scm/linux/kernel/git/bp/bp.git/log/?h=tip-ras-core

Thanks to both of you!

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [tip: ras/core] x86/{mce,mm}: Unmap the entire page if the whole page is affected and poisoned
  2020-05-20 16:35                       ` [PATCH v2] x86/mm: Change so poison pages are either unmapped or marked uncacheable Luck, Tony
  2020-05-25 11:00                         ` [tip: ras/core] x86/{mce,mm}: " tip-bot2 for Tony Luck
@ 2020-05-26 19:56                         ` tip-bot2 for Tony Luck
  1 sibling, 0 replies; 19+ messages in thread
From: tip-bot2 for Tony Luck @ 2020-05-26 19:56 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Jue Wang, Tony Luck, Borislav Petkov, stable, x86, LKML

The following commit has been merged into the ras/core branch of tip:

Commit-ID:     be69f6c5cd38c457c22f6e718077f6524437369d
Gitweb:        https://git.kernel.org/tip/be69f6c5cd38c457c22f6e718077f6524437369d
Author:        Tony Luck <tony.luck@intel.com>
AuthorDate:    Wed, 20 May 2020 09:35:46 -07:00
Committer:     Borislav Petkov <bp@suse.de>
CommitterDate: Mon, 25 May 2020 22:37:41 +02:00

x86/{mce,mm}: Unmap the entire page if the whole page is affected and poisoned

An interesting thing happened when a guest Linux instance took a machine
check. The VMM unmapped the bad page from guest physical space and
passed the machine check to the guest.

Linux took all the normal actions to offline the page from the process
that was using it. But then guest Linux crashed because it said there
was a second machine check inside the kernel with this stack trace:

do_memory_failure
    set_mce_nospec
         set_memory_uc
              _set_memory_uc
                   change_page_attr_set_clr
                        cpa_flush
                             clflush_cache_range_opt

This was odd, because a CLFLUSH instruction shouldn't raise a machine
check (it isn't consuming the data). Further investigation showed that
the VMM had passed in another machine check because is appeared that the
guest was accessing the bad page.

Fix is to check the scope of the poison by checking the MCi_MISC register.
If the entire page is affected, then unmap the page. If only part of the
page is affected, then mark the page as uncacheable.

This assumes that VMMs will do the logical thing and pass in the "whole
page scope" via the MCi_MISC register (since they unmapped the entire
page).

  [ bp: Adjust to x86/entry changes. ]

Fixes: 284ce4011ba6 ("x86/memory_failure: Introduce {set, clear}_mce_nospec()")
Reported-by: Jue Wang <juew@google.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Jue Wang <juew@google.com>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20200520163546.GA7977@agluck-desk2.amr.corp.intel.com
---
 arch/x86/include/asm/set_memory.h | 19 +++++++++++++------
 arch/x86/kernel/cpu/mce/core.c    | 18 ++++++++++++++----
 include/linux/sched.h             |  4 +++-
 include/linux/set_memory.h        |  2 +-
 4 files changed, 31 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/set_memory.h b/arch/x86/include/asm/set_memory.h
index ec2c0a0..5948218 100644
--- a/arch/x86/include/asm/set_memory.h
+++ b/arch/x86/include/asm/set_memory.h
@@ -86,28 +86,35 @@ int set_direct_map_default_noflush(struct page *page);
 extern int kernel_set_to_readonly;
 
 #ifdef CONFIG_X86_64
-static inline int set_mce_nospec(unsigned long pfn)
+/*
+ * Prevent speculative access to the page by either unmapping
+ * it (if we do not require access to any part of the page) or
+ * marking it uncacheable (if we want to try to retrieve data
+ * from non-poisoned lines in the page).
+ */
+static inline int set_mce_nospec(unsigned long pfn, bool unmap)
 {
 	unsigned long decoy_addr;
 	int rc;
 
 	/*
-	 * Mark the linear address as UC to make sure we don't log more
-	 * errors because of speculative access to the page.
 	 * We would like to just call:
-	 *      set_memory_uc((unsigned long)pfn_to_kaddr(pfn), 1);
+	 *      set_memory_XX((unsigned long)pfn_to_kaddr(pfn), 1);
 	 * but doing that would radically increase the odds of a
 	 * speculative access to the poison page because we'd have
 	 * the virtual address of the kernel 1:1 mapping sitting
 	 * around in registers.
 	 * Instead we get tricky.  We create a non-canonical address
 	 * that looks just like the one we want, but has bit 63 flipped.
-	 * This relies on set_memory_uc() properly sanitizing any __pa()
+	 * This relies on set_memory_XX() properly sanitizing any __pa()
 	 * results with __PHYSICAL_MASK or PTE_PFN_MASK.
 	 */
 	decoy_addr = (pfn << PAGE_SHIFT) + (PAGE_OFFSET ^ BIT(63));
 
-	rc = set_memory_uc(decoy_addr, 1);
+	if (unmap)
+		rc = set_memory_np(decoy_addr, 1);
+	else
+		rc = set_memory_uc(decoy_addr, 1);
 	if (rc)
 		pr_warn("Could not invalidate pfn=0x%lx from 1:1 map\n", pfn);
 	return rc;
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index ffee8a2..753bc77 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -520,6 +520,14 @@ bool mce_is_memory_error(struct mce *m)
 }
 EXPORT_SYMBOL_GPL(mce_is_memory_error);
 
+static bool whole_page(struct mce *m)
+{
+	if (!mca_cfg.ser || !(m->status & MCI_STATUS_MISCV))
+		return true;
+
+	return MCI_MISC_ADDR_LSB(m->misc) >= PAGE_SHIFT;
+}
+
 bool mce_is_correctable(struct mce *m)
 {
 	if (m->cpuvendor == X86_VENDOR_AMD && m->status & MCI_STATUS_DEFERRED)
@@ -573,7 +581,7 @@ static int uc_decode_notifier(struct notifier_block *nb, unsigned long val,
 
 	pfn = mce->addr >> PAGE_SHIFT;
 	if (!memory_failure(pfn, 0)) {
-		set_mce_nospec(pfn);
+		set_mce_nospec(pfn, whole_page(mce));
 		mce->kflags |= MCE_HANDLED_UC;
 	}
 
@@ -1173,11 +1181,12 @@ static void kill_me_maybe(struct callback_head *cb)
 	int flags = MF_ACTION_REQUIRED;
 
 	pr_err("Uncorrected hardware memory error in user-access at %llx", p->mce_addr);
-	if (!(p->mce_status & MCG_STATUS_RIPV))
+
+	if (!p->mce_ripv)
 		flags |= MF_MUST_KILL;
 
 	if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags)) {
-		set_mce_nospec(p->mce_addr >> PAGE_SHIFT);
+		set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page);
 		return;
 	}
 
@@ -1331,7 +1340,8 @@ void noinstr do_machine_check(struct pt_regs *regs)
 		BUG_ON(!on_thread_stack() || !user_mode(regs));
 
 		current->mce_addr = m.addr;
-		current->mce_status = m.mcgstatus;
+		current->mce_ripv = !!(m.mcgstatus & MCG_STATUS_RIPV);
+		current->mce_whole_page = whole_page(&m);
 		current->mce_kill_me.func = kill_me_maybe;
 		if (kill_it)
 			current->mce_kill_me.func = kill_me_now;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1d68ee3..6293fc2 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1304,7 +1304,9 @@ struct task_struct {
 
 #ifdef CONFIG_X86_MCE
 	u64				mce_addr;
-	u64				mce_status;
+	__u64				mce_ripv : 1,
+					mce_whole_page : 1,
+					__mce_reserved : 62;
 	struct callback_head		mce_kill_me;
 #endif
 
diff --git a/include/linux/set_memory.h b/include/linux/set_memory.h
index 86281ac..860e0f8 100644
--- a/include/linux/set_memory.h
+++ b/include/linux/set_memory.h
@@ -26,7 +26,7 @@ static inline int set_direct_map_default_noflush(struct page *page)
 #endif
 
 #ifndef set_mce_nospec
-static inline int set_mce_nospec(unsigned long pfn)
+static inline int set_mce_nospec(unsigned long pfn, bool unmap)
 {
 	return 0;
 }

^ permalink raw reply related	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2020-05-26 19:56 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-05 18:46 [PATCH] x86/mm: Don't try to change poison pages to uncacheable in a guest Tony Luck
2020-05-16  6:54 ` Borislav Petkov
2020-05-16 14:47   ` Luck, Tony
2020-05-16 15:02     ` Borislav Petkov
2020-05-17  1:52       ` Luck, Tony
     [not found]         ` <CAPcxDJ50pbuTbittyvPwKq1uUT8q8jJ+dHH8rCug8a1DDZXVYw@mail.gmail.com>
     [not found]           ` <CAPcxDJ6f3pBpwiR9nvXN_g_HBa1RAMG+aOmgfXLFT6aZ9HQn3w@mail.gmail.com>
2020-05-18 13:48             ` Borislav Petkov
2020-05-18 15:36               ` Luck, Tony
2020-05-18 16:55                 ` Borislav Petkov
2020-05-18 18:26                   ` Luck, Tony
2020-05-18 19:20                     ` Dan Williams
2020-05-19  5:22                     ` Sean Christopherson
2020-05-19  8:50                     ` Borislav Petkov
2020-05-20 16:35                       ` [PATCH v2] x86/mm: Change so poison pages are either unmapped or marked uncacheable Luck, Tony
2020-05-25 11:00                         ` [tip: ras/core] x86/{mce,mm}: " tip-bot2 for Tony Luck
2020-05-25 20:40                           ` Borislav Petkov
2020-05-26 17:37                             ` Luck, Tony
     [not found]                               ` <CAPcxDJ5arJojbY4pzOvYh=waSPd3X_JJb1_PSuzd+jQ0qbvFsA@mail.gmail.com>
     [not found]                                 ` <CAPcxDJ54EgX-SaDV=Lm+a2-43O68LhomyYfYdCDz38HGJCkh7g@mail.gmail.com>
2020-05-26 19:46                                   ` Borislav Petkov
2020-05-26 19:56                         ` [tip: ras/core] x86/{mce,mm}: Unmap the entire page if the whole page is affected and poisoned tip-bot2 for Tony Luck
2020-05-19  5:04                   ` [PATCH] x86/mm: Don't try to change poison pages to uncacheable in a guest Sean Christopherson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).