Re: [PATCH 0/3] recover hardware corrupted page by virtio balloon

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [PATCH 0/3] recover hardware corrupted page by virtio balloon
@ 2022-05-25 20:16 Jue Wang
  2022-05-26 18:37 ` Peter Xu
  0 siblings, 1 reply; 11+ messages in thread
From: Jue Wang @ 2022-05-25 20:16 UTC (permalink / raw)
  To: pizhenwei
  Cc: Andrew Morton, David Hildenbrand, jasowang, LKML, Linux MM, mst,
	HORIGUCHI NAOYA(堀口 直也),
	Paolo Bonzini, Peter Xu, qemu-devel, virtualization

Some points to consider:

The injected MCE has _done_ the damages to guest workload. Recovering
the guest poisoned memory doesn't help with the already happened guest
workload memory corruption / loss / interruption due to injected MCEs.

The hypervisor _must_ emulate poisons identified in guest physical
address space (could be transported from the source VM), this is to
prevent silent data corruption in the guest. With a paravirtual
approach like this patch series, the hypervisor can clear some of the
poisoned HVAs knowing for certain that the guest OS has isolated the
poisoned page. I wonder how much value it provides to the guest if the
guest and workload are _not_ in a pressing need for the extra KB/MB
worth of memory.

Thanks,
-Jue

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 0/3] recover hardware corrupted page by virtio balloon
  2022-05-25 20:16 [PATCH 0/3] recover hardware corrupted page by virtio balloon Jue Wang
@ 2022-05-26 18:37 ` Peter Xu
  2022-05-27  6:32   ` zhenwei pi
  0 siblings, 1 reply; 11+ messages in thread
From: Peter Xu @ 2022-05-26 18:37 UTC (permalink / raw)
  To: Jue Wang
  Cc: pizhenwei, Andrew Morton, David Hildenbrand, jasowang, LKML,
	Linux MM, mst, HORIGUCHI NAOYA(堀口 直也),
	Paolo Bonzini, qemu-devel, virtualization

On Wed, May 25, 2022 at 01:16:34PM -0700, Jue Wang wrote:
> The hypervisor _must_ emulate poisons identified in guest physical
> address space (could be transported from the source VM), this is to
> prevent silent data corruption in the guest. With a paravirtual
> approach like this patch series, the hypervisor can clear some of the
> poisoned HVAs knowing for certain that the guest OS has isolated the
> poisoned page. I wonder how much value it provides to the guest if the
> guest and workload are _not_ in a pressing need for the extra KB/MB
> worth of memory.

I'm curious the same on how unpoisoning could help here.  The reasoning
behind would be great material to be mentioned in the next cover letter.

Shouldn't we consider migrating serious workloads off the host already
where there's a sign of more severe hardware issues, instead?

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Re: [PATCH 0/3] recover hardware corrupted page by virtio balloon
  2022-05-26 18:37 ` Peter Xu
@ 2022-05-27  6:32   ` zhenwei pi
  2022-05-30  7:41     ` David Hildenbrand
  0 siblings, 1 reply; 11+ messages in thread
From: zhenwei pi @ 2022-05-27  6:32 UTC (permalink / raw)
  To: Peter Xu, Jue Wang
  Cc: Andrew Morton, David Hildenbrand, jasowang, LKML, Linux MM, mst,
	HORIGUCHI NAOYA(堀口 直也),
	Paolo Bonzini, qemu-devel, virtualization

On 5/27/22 02:37, Peter Xu wrote:
> On Wed, May 25, 2022 at 01:16:34PM -0700, Jue Wang wrote:
>> The hypervisor _must_ emulate poisons identified in guest physical
>> address space (could be transported from the source VM), this is to
>> prevent silent data corruption in the guest. With a paravirtual
>> approach like this patch series, the hypervisor can clear some of the
>> poisoned HVAs knowing for certain that the guest OS has isolated the
>> poisoned page. I wonder how much value it provides to the guest if the
>> guest and workload are _not_ in a pressing need for the extra KB/MB
>> worth of memory.
> 
> I'm curious the same on how unpoisoning could help here.  The reasoning
> behind would be great material to be mentioned in the next cover letter.
> 
> Shouldn't we consider migrating serious workloads off the host already
> where there's a sign of more severe hardware issues, instead?
> 
> Thanks,
> 

I'm maintaining 1000,000+ virtual machines, from my experience:
UE is quite unusual and occurs randomly, and I did not hit UE storm case 
in the past years. The memory also has no obvious performance drop after 
hitting UE.

I hit several CE storm case, the performance memory drops a lot. But I 
can't find obvious relationship between UE and CE.

So from the point of my view, to fix the corrupted page for VM seems 
good enough. And yes, unpoisoning several pages does not help 
significantly, but it is still a chance to make the virtualization better.

-- 
zhenwei pi

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 0/3] recover hardware corrupted page by virtio balloon
  2022-05-27  6:32   ` zhenwei pi
@ 2022-05-30  7:41     ` David Hildenbrand
  2022-05-30 11:33       ` zhenwei pi
  0 siblings, 1 reply; 11+ messages in thread
From: David Hildenbrand @ 2022-05-30  7:41 UTC (permalink / raw)
  To: zhenwei pi, Peter Xu, Jue Wang
  Cc: Andrew Morton, jasowang, LKML, Linux MM, mst,
	HORIGUCHI NAOYA(堀口 直也),
	Paolo Bonzini, qemu-devel, virtualization

On 27.05.22 08:32, zhenwei pi wrote:
> On 5/27/22 02:37, Peter Xu wrote:
>> On Wed, May 25, 2022 at 01:16:34PM -0700, Jue Wang wrote:
>>> The hypervisor _must_ emulate poisons identified in guest physical
>>> address space (could be transported from the source VM), this is to
>>> prevent silent data corruption in the guest. With a paravirtual
>>> approach like this patch series, the hypervisor can clear some of the
>>> poisoned HVAs knowing for certain that the guest OS has isolated the
>>> poisoned page. I wonder how much value it provides to the guest if the
>>> guest and workload are _not_ in a pressing need for the extra KB/MB
>>> worth of memory.
>>
>> I'm curious the same on how unpoisoning could help here.  The reasoning
>> behind would be great material to be mentioned in the next cover letter.
>>
>> Shouldn't we consider migrating serious workloads off the host already
>> where there's a sign of more severe hardware issues, instead?
>>
>> Thanks,
>>
> 
> I'm maintaining 1000,000+ virtual machines, from my experience:
> UE is quite unusual and occurs randomly, and I did not hit UE storm case 
> in the past years. The memory also has no obvious performance drop after 
> hitting UE.
> 
> I hit several CE storm case, the performance memory drops a lot. But I 
> can't find obvious relationship between UE and CE.
> 
> So from the point of my view, to fix the corrupted page for VM seems 
> good enough. And yes, unpoisoning several pages does not help 
> significantly, but it is still a chance to make the virtualization better.
> 

I'm curious why we should care about resurrecting a handful of poisoned
pages in a VM. The cover letter doesn't touch on that.

IOW, I'm missing the motivation why we should add additional
code+complexity to unpoison pages at all.

If we're talking about individual 4k pages, it's certainly sub-optimal,
but does it matter in practice? I could understand if we're losing
megabytes of memory. But then, I assume the workload might be seriously
harmed either way already?

I assume when talking about "the performance memory drops a lot", you
imply that this patch set can mitigate that performance drop?

But why do you see a performance drop? Because we might lose some
possible THP candidates (in the host or the guest) and you want to plug
does holes? I assume you'll see a performance drop simply because
poisoning memory is expensive, including migrating pages around on CE.

If you have some numbers to share, especially before/after this change,
that would be great.

-- 
Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Re: [PATCH 0/3] recover hardware corrupted page by virtio balloon
  2022-05-30  7:41     ` David Hildenbrand
@ 2022-05-30 11:33       ` zhenwei pi
  2022-05-30 15:49         ` Peter Xu
  0 siblings, 1 reply; 11+ messages in thread
From: zhenwei pi @ 2022-05-30 11:33 UTC (permalink / raw)
  To: David Hildenbrand, Peter Xu, Jue Wang, Paolo Bonzini
  Cc: Andrew Morton, jasowang, LKML, Linux MM, mst,
	HORIGUCHI NAOYA(堀口 直也),
	qemu-devel, virtualization



On 5/30/22 15:41, David Hildenbrand wrote:
> On 27.05.22 08:32, zhenwei pi wrote:
>> On 5/27/22 02:37, Peter Xu wrote:
>>> On Wed, May 25, 2022 at 01:16:34PM -0700, Jue Wang wrote:
>>>> The hypervisor _must_ emulate poisons identified in guest physical
>>>> address space (could be transported from the source VM), this is to
>>>> prevent silent data corruption in the guest. With a paravirtual
>>>> approach like this patch series, the hypervisor can clear some of the
>>>> poisoned HVAs knowing for certain that the guest OS has isolated the
>>>> poisoned page. I wonder how much value it provides to the guest if the
>>>> guest and workload are _not_ in a pressing need for the extra KB/MB
>>>> worth of memory.
>>>
>>> I'm curious the same on how unpoisoning could help here.  The reasoning
>>> behind would be great material to be mentioned in the next cover letter.
>>>
>>> Shouldn't we consider migrating serious workloads off the host already
>>> where there's a sign of more severe hardware issues, instead?
>>>
>>> Thanks,
>>>
>>
>> I'm maintaining 1000,000+ virtual machines, from my experience:
>> UE is quite unusual and occurs randomly, and I did not hit UE storm case
>> in the past years. The memory also has no obvious performance drop after
>> hitting UE.
>>
>> I hit several CE storm case, the performance memory drops a lot. But I
>> can't find obvious relationship between UE and CE.
>>
>> So from the point of my view, to fix the corrupted page for VM seems
>> good enough. And yes, unpoisoning several pages does not help
>> significantly, but it is still a chance to make the virtualization better.
>>
> 
> I'm curious why we should care about resurrecting a handful of poisoned
> pages in a VM. The cover letter doesn't touch on that.
> 
> IOW, I'm missing the motivation why we should add additional
> code+complexity to unpoison pages at all.
> 
> If we're talking about individual 4k pages, it's certainly sub-optimal,
> but does it matter in practice? I could understand if we're losing
> megabytes of memory. But then, I assume the workload might be seriously
> harmed either way already?
> 

Yes, resurrecting a handful of poisoned pages does not help 
significantly. And, in some ways, it seems nice to have. :D

A VM uses RAM of 2M huge page. Once a MCE(@HVAy in [HVAx,HVAz)) occurs, 
the 2M([HVAx,HVAz)) of hypervisor becomes unaccessible, but the guest 
poisons 4K (@GPAy in [GPAx, GPAz)) only, it may hit another 511 MCE 
([GPAx, GPAz) except GPAy). This is the worse case, so I want to add
  '__le32 corrupted_pages' in struct virtio_balloon_config, it is used 
in the next step: reporting 512 * 4K 'corrupted_pages' to the guest, the 
guest has a chance to isolate the other 511 pages ahead of time. And the 
guest actually loses 2M, fixing 512*4K seems to help significantly.

> 
> I assume when talking about "the performance memory drops a lot", you
> imply that this patch set can mitigate that performance drop?
> 
> But why do you see a performance drop? Because we might lose some
> possible THP candidates (in the host or the guest) and you want to plug
> does holes? I assume you'll see a performance drop simply because
> poisoning memory is expensive, including migrating pages around on CE.
> 
> If you have some numbers to share, especially before/after this change,
> that would be great.
> 

The CE storm leads 2 problems I have even seen:
1, the memory bandwidth slows down to 10%~20%, and the cycles per 
instruction of CPU increases a lot.
2, the THR (/proc/interrupts) interrupts frequently, the CPU has to use 
a lot time to handle IRQ.

But no corrupted page occurs. Migrating VM to another healthy host seems 
a good choice. This patch does not handle CE storm case.

-- 
zhenwei pi

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Re: [PATCH 0/3] recover hardware corrupted page by virtio balloon
  2022-05-30 11:33       ` zhenwei pi
@ 2022-05-30 15:49         ` Peter Xu
  2022-05-31  4:08           ` Jue Wang
  0 siblings, 1 reply; 11+ messages in thread
From: Peter Xu @ 2022-05-30 15:49 UTC (permalink / raw)
  To: zhenwei pi
  Cc: David Hildenbrand, Jue Wang, Paolo Bonzini, Andrew Morton,
	jasowang, LKML, Linux MM, mst,
	HORIGUCHI NAOYA(堀口 直也),
	qemu-devel, virtualization

On Mon, May 30, 2022 at 07:33:35PM +0800, zhenwei pi wrote:
> A VM uses RAM of 2M huge page. Once a MCE(@HVAy in [HVAx,HVAz)) occurs, the
> 2M([HVAx,HVAz)) of hypervisor becomes unaccessible, but the guest poisons 4K
> (@GPAy in [GPAx, GPAz)) only, it may hit another 511 MCE ([GPAx, GPAz)
> except GPAy). This is the worse case, so I want to add
>  '__le32 corrupted_pages' in struct virtio_balloon_config, it is used in the
> next step: reporting 512 * 4K 'corrupted_pages' to the guest, the guest has
> a chance to isolate the other 511 pages ahead of time. And the guest
> actually loses 2M, fixing 512*4K seems to help significantly.

It sounds hackish to teach a virtio device to assume one page will always
be poisoned in huge page granule.  That's only a limitation to host kernel
not virtio itself.

E.g. there're upstream effort ongoing with enabling doublemap on hugetlbfs
pages so hugetlb pages can be mapped in 4k with it.  It provides potential
possibility to do page poisoning with huge pages in 4k too.  When that'll
be ready the assumption can go away, and that does sound like a better
approach towards this problem.

> 
> > 
> > I assume when talking about "the performance memory drops a lot", you
> > imply that this patch set can mitigate that performance drop?
> > 
> > But why do you see a performance drop? Because we might lose some
> > possible THP candidates (in the host or the guest) and you want to plug
> > does holes? I assume you'll see a performance drop simply because
> > poisoning memory is expensive, including migrating pages around on CE.
> > 
> > If you have some numbers to share, especially before/after this change,
> > that would be great.
> > 
> 
> The CE storm leads 2 problems I have even seen:
> 1, the memory bandwidth slows down to 10%~20%, and the cycles per
> instruction of CPU increases a lot.
> 2, the THR (/proc/interrupts) interrupts frequently, the CPU has to use a
> lot time to handle IRQ.

Totally no good knowledge on CMCI, but if 2) is true then I'm wondering
whether it's necessary to handle the interrupts that frequently.  When I
was reading the Intel CMCI vector handler I stumbled over this comment:

/*
 * The interrupt handler. This is called on every event.
 * Just call the poller directly to log any events.
 * This could in theory increase the threshold under high load,
 * but doesn't for now.
 */
static void intel_threshold_interrupt(void)

I think that matches with what I was thinking..  I mean for 2) not sure
whether it can be seen as a CMCI problem and potentially can be optimized
by adjust the cmci threshold dynamically.

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Re: [PATCH 0/3] recover hardware corrupted page by virtio balloon
  2022-05-30 15:49         ` Peter Xu
@ 2022-05-31  4:08           ` Jue Wang
  2022-06-01  2:17             ` zhenwei pi
  0 siblings, 1 reply; 11+ messages in thread
From: Jue Wang @ 2022-05-31  4:08 UTC (permalink / raw)
  To: Peter Xu, zhenwei pi, David Hildenbrand
  Cc: Paolo Bonzini, Andrew Morton, jasowang, LKML, Linux MM, mst,
	HORIGUCHI NAOYA(堀口 直也),
	qemu-devel, virtualization

On Mon, May 30, 2022 at 8:49 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Mon, May 30, 2022 at 07:33:35PM +0800, zhenwei pi wrote:
> > A VM uses RAM of 2M huge page. Once a MCE(@HVAy in [HVAx,HVAz)) occurs, the
> > 2M([HVAx,HVAz)) of hypervisor becomes unaccessible, but the guest poisons 4K
> > (@GPAy in [GPAx, GPAz)) only, it may hit another 511 MCE ([GPAx, GPAz)
> > except GPAy). This is the worse case, so I want to add
> >  '__le32 corrupted_pages' in struct virtio_balloon_config, it is used in the
> > next step: reporting 512 * 4K 'corrupted_pages' to the guest, the guest has
> > a chance to isolate the other 511 pages ahead of time. And the guest
> > actually loses 2M, fixing 512*4K seems to help significantly.
>
> It sounds hackish to teach a virtio device to assume one page will always
> be poisoned in huge page granule.  That's only a limitation to host kernel
> not virtio itself.
>
> E.g. there're upstream effort ongoing with enabling doublemap on hugetlbfs
> pages so hugetlb pages can be mapped in 4k with it.  It provides potential
> possibility to do page poisoning with huge pages in 4k too.  When that'll
> be ready the assumption can go away, and that does sound like a better
> approach towards this problem.

+1.

A hypervisor should always strive to minimize the guest memory loss.

The HugeTLB double mapping enlightened memory poisoning behavior (only
poison 4K out of a 2MB huge page and 4K in guest) is a much better
solution here. To be completely transparent, it's not _strictly_
required to poison the page (whatever the granularity it is) on the
host side, as long as the following are true:

1. A hypervisor can emulate the _minimized_ (e.g., 4K) the poison to the guest.
2. The host page with the UC error is "isolated" (could be PG_HWPOISON
or in some other way) and prevented from being reused by other
processes.

For #2, PG_HWPOISON and HugeTLB double mapping enlightened memory
poisoning is a good solution.

>
> >
> > >
> > > I assume when talking about "the performance memory drops a lot", you
> > > imply that this patch set can mitigate that performance drop?
> > >
> > > But why do you see a performance drop? Because we might lose some
> > > possible THP candidates (in the host or the guest) and you want to plug
> > > does holes? I assume you'll see a performance drop simply because
> > > poisoning memory is expensive, including migrating pages around on CE.
> > >
> > > If you have some numbers to share, especially before/after this change,
> > > that would be great.
> > >
> >
> > The CE storm leads 2 problems I have even seen:
> > 1, the memory bandwidth slows down to 10%~20%, and the cycles per
> > instruction of CPU increases a lot.
> > 2, the THR (/proc/interrupts) interrupts frequently, the CPU has to use a
> > lot time to handle IRQ.
>
> Totally no good knowledge on CMCI, but if 2) is true then I'm wondering
> whether it's necessary to handle the interrupts that frequently.  When I
> was reading the Intel CMCI vector handler I stumbled over this comment:
>
> /*
>  * The interrupt handler. This is called on every event.
>  * Just call the poller directly to log any events.
>  * This could in theory increase the threshold under high load,
>  * but doesn't for now.
>  */
> static void intel_threshold_interrupt(void)
>
> I think that matches with what I was thinking..  I mean for 2) not sure
> whether it can be seen as a CMCI problem and potentially can be optimized
> by adjust the cmci threshold dynamically.

The CE storm caused performance drop is caused by the extra cycles
spent by the ECC steps in memory controller, not in CMCI handling.
This is observed in the Google fleet as well. A good solution is to
monitor the CE rate closely in user space via /dev/mcelog and migrate
all VMs to another host once the CE rate exceeds some threshold.

CMCI is a _background_ interrupt that is not handled in the process
execution context and its handler is setup to switch to poll (1 / 5
min) mode if there are more than ~ a dozen CEs reported via CMCI per
second.
>
> --
> Peter Xu
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Re: Re: [PATCH 0/3] recover hardware corrupted page by virtio balloon
  2022-05-31  4:08           ` Jue Wang
@ 2022-06-01  2:17             ` zhenwei pi
  2022-06-01  7:59               ` David Hildenbrand
  0 siblings, 1 reply; 11+ messages in thread
From: zhenwei pi @ 2022-06-01  2:17 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand,
	HORIGUCHI NAOYA(堀口 直也)
  Cc: Peter Xu, Jue Wang, Paolo Bonzini, jasowang, LKML, Linux MM, mst,
	qemu-devel, virtualization

On 5/31/22 12:08, Jue Wang wrote:
> On Mon, May 30, 2022 at 8:49 AM Peter Xu <peterx@redhat.com> wrote:
>>
>> On Mon, May 30, 2022 at 07:33:35PM +0800, zhenwei pi wrote:
>>> A VM uses RAM of 2M huge page. Once a MCE(@HVAy in [HVAx,HVAz)) occurs, the
>>> 2M([HVAx,HVAz)) of hypervisor becomes unaccessible, but the guest poisons 4K
>>> (@GPAy in [GPAx, GPAz)) only, it may hit another 511 MCE ([GPAx, GPAz)
>>> except GPAy). This is the worse case, so I want to add
>>>   '__le32 corrupted_pages' in struct virtio_balloon_config, it is used in the
>>> next step: reporting 512 * 4K 'corrupted_pages' to the guest, the guest has
>>> a chance to isolate the other 511 pages ahead of time. And the guest
>>> actually loses 2M, fixing 512*4K seems to help significantly.
>>
>> It sounds hackish to teach a virtio device to assume one page will always
>> be poisoned in huge page granule.  That's only a limitation to host kernel
>> not virtio itself.
>>
>> E.g. there're upstream effort ongoing with enabling doublemap on hugetlbfs
>> pages so hugetlb pages can be mapped in 4k with it.  It provides potential
>> possibility to do page poisoning with huge pages in 4k too.  When that'll
>> be ready the assumption can go away, and that does sound like a better
>> approach towards this problem.
> 
> +1.
> 
> A hypervisor should always strive to minimize the guest memory loss.
> 
> The HugeTLB double mapping enlightened memory poisoning behavior (only
> poison 4K out of a 2MB huge page and 4K in guest) is a much better
> solution here. To be completely transparent, it's not _strictly_
> required to poison the page (whatever the granularity it is) on the
> host side, as long as the following are true:
> 
> 1. A hypervisor can emulate the _minimized_ (e.g., 4K) the poison to the guest.
> 2. The host page with the UC error is "isolated" (could be PG_HWPOISON
> or in some other way) and prevented from being reused by other
> processes.
> 
> For #2, PG_HWPOISON and HugeTLB double mapping enlightened memory
> poisoning is a good solution.
> 
>>
>>>
>>>>
>>>> I assume when talking about "the performance memory drops a lot", you
>>>> imply that this patch set can mitigate that performance drop?
>>>>
>>>> But why do you see a performance drop? Because we might lose some
>>>> possible THP candidates (in the host or the guest) and you want to plug
>>>> does holes? I assume you'll see a performance drop simply because
>>>> poisoning memory is expensive, including migrating pages around on CE.
>>>>
>>>> If you have some numbers to share, especially before/after this change,
>>>> that would be great.
>>>>
>>>
>>> The CE storm leads 2 problems I have even seen:
>>> 1, the memory bandwidth slows down to 10%~20%, and the cycles per
>>> instruction of CPU increases a lot.
>>> 2, the THR (/proc/interrupts) interrupts frequently, the CPU has to use a
>>> lot time to handle IRQ.
>>
>> Totally no good knowledge on CMCI, but if 2) is true then I'm wondering
>> whether it's necessary to handle the interrupts that frequently.  When I
>> was reading the Intel CMCI vector handler I stumbled over this comment:
>>
>> /*
>>   * The interrupt handler. This is called on every event.
>>   * Just call the poller directly to log any events.
>>   * This could in theory increase the threshold under high load,
>>   * but doesn't for now.
>>   */
>> static void intel_threshold_interrupt(void)
>>
>> I think that matches with what I was thinking..  I mean for 2) not sure
>> whether it can be seen as a CMCI problem and potentially can be optimized
>> by adjust the cmci threshold dynamically.
> 
> The CE storm caused performance drop is caused by the extra cycles
> spent by the ECC steps in memory controller, not in CMCI handling.
> This is observed in the Google fleet as well. A good solution is to
> monitor the CE rate closely in user space via /dev/mcelog and migrate
> all VMs to another host once the CE rate exceeds some threshold.
> 
> CMCI is a _background_ interrupt that is not handled in the process
> execution context and its handler is setup to switch to poll (1 / 5
> min) mode if there are more than ~ a dozen CEs reported via CMCI per
> second.
>>
>> --
>> Peter Xu
>>

Hi, Andrew, David, Naoya

According to the suggestions, I'd give up the improvement of memory 
failure on huge page in this series.

Is it worth recovering corrupted pages for the guest kernel? I'd follow 
your decision.

-- 
zhenwei pi

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 0/3] recover hardware corrupted page by virtio balloon
  2022-06-01  2:17             ` zhenwei pi
@ 2022-06-01  7:59               ` David Hildenbrand
  2022-06-02  9:28                 ` zhenwei pi
  0 siblings, 1 reply; 11+ messages in thread
From: David Hildenbrand @ 2022-06-01  7:59 UTC (permalink / raw)
  To: zhenwei pi, Andrew Morton,
	HORIGUCHI NAOYA(堀口 直也)
  Cc: Peter Xu, Jue Wang, Paolo Bonzini, jasowang, LKML, Linux MM, mst,
	qemu-devel, virtualization

On 01.06.22 04:17, zhenwei pi wrote:
> On 5/31/22 12:08, Jue Wang wrote:
>> On Mon, May 30, 2022 at 8:49 AM Peter Xu <peterx@redhat.com> wrote:
>>>
>>> On Mon, May 30, 2022 at 07:33:35PM +0800, zhenwei pi wrote:
>>>> A VM uses RAM of 2M huge page. Once a MCE(@HVAy in [HVAx,HVAz)) occurs, the
>>>> 2M([HVAx,HVAz)) of hypervisor becomes unaccessible, but the guest poisons 4K
>>>> (@GPAy in [GPAx, GPAz)) only, it may hit another 511 MCE ([GPAx, GPAz)
>>>> except GPAy). This is the worse case, so I want to add
>>>>   '__le32 corrupted_pages' in struct virtio_balloon_config, it is used in the
>>>> next step: reporting 512 * 4K 'corrupted_pages' to the guest, the guest has
>>>> a chance to isolate the other 511 pages ahead of time. And the guest
>>>> actually loses 2M, fixing 512*4K seems to help significantly.
>>>
>>> It sounds hackish to teach a virtio device to assume one page will always
>>> be poisoned in huge page granule.  That's only a limitation to host kernel
>>> not virtio itself.
>>>
>>> E.g. there're upstream effort ongoing with enabling doublemap on hugetlbfs
>>> pages so hugetlb pages can be mapped in 4k with it.  It provides potential
>>> possibility to do page poisoning with huge pages in 4k too.  When that'll
>>> be ready the assumption can go away, and that does sound like a better
>>> approach towards this problem.
>>
>> +1.
>>
>> A hypervisor should always strive to minimize the guest memory loss.
>>
>> The HugeTLB double mapping enlightened memory poisoning behavior (only
>> poison 4K out of a 2MB huge page and 4K in guest) is a much better
>> solution here. To be completely transparent, it's not _strictly_
>> required to poison the page (whatever the granularity it is) on the
>> host side, as long as the following are true:
>>
>> 1. A hypervisor can emulate the _minimized_ (e.g., 4K) the poison to the guest.
>> 2. The host page with the UC error is "isolated" (could be PG_HWPOISON
>> or in some other way) and prevented from being reused by other
>> processes.
>>
>> For #2, PG_HWPOISON and HugeTLB double mapping enlightened memory
>> poisoning is a good solution.
>>
>>>
>>>>
>>>>>
>>>>> I assume when talking about "the performance memory drops a lot", you
>>>>> imply that this patch set can mitigate that performance drop?
>>>>>
>>>>> But why do you see a performance drop? Because we might lose some
>>>>> possible THP candidates (in the host or the guest) and you want to plug
>>>>> does holes? I assume you'll see a performance drop simply because
>>>>> poisoning memory is expensive, including migrating pages around on CE.
>>>>>
>>>>> If you have some numbers to share, especially before/after this change,
>>>>> that would be great.
>>>>>
>>>>
>>>> The CE storm leads 2 problems I have even seen:
>>>> 1, the memory bandwidth slows down to 10%~20%, and the cycles per
>>>> instruction of CPU increases a lot.
>>>> 2, the THR (/proc/interrupts) interrupts frequently, the CPU has to use a
>>>> lot time to handle IRQ.
>>>
>>> Totally no good knowledge on CMCI, but if 2) is true then I'm wondering
>>> whether it's necessary to handle the interrupts that frequently.  When I
>>> was reading the Intel CMCI vector handler I stumbled over this comment:
>>>
>>> /*
>>>   * The interrupt handler. This is called on every event.
>>>   * Just call the poller directly to log any events.
>>>   * This could in theory increase the threshold under high load,
>>>   * but doesn't for now.
>>>   */
>>> static void intel_threshold_interrupt(void)
>>>
>>> I think that matches with what I was thinking..  I mean for 2) not sure
>>> whether it can be seen as a CMCI problem and potentially can be optimized
>>> by adjust the cmci threshold dynamically.
>>
>> The CE storm caused performance drop is caused by the extra cycles
>> spent by the ECC steps in memory controller, not in CMCI handling.
>> This is observed in the Google fleet as well. A good solution is to
>> monitor the CE rate closely in user space via /dev/mcelog and migrate
>> all VMs to another host once the CE rate exceeds some threshold.
>>
>> CMCI is a _background_ interrupt that is not handled in the process
>> execution context and its handler is setup to switch to poll (1 / 5
>> min) mode if there are more than ~ a dozen CEs reported via CMCI per
>> second.
>>>
>>> --
>>> Peter Xu
>>>
> 
> Hi, Andrew, David, Naoya
> 
> According to the suggestions, I'd give up the improvement of memory 
> failure on huge page in this series.
> 
> Is it worth recovering corrupted pages for the guest kernel? I'd follow 
> your decision.

Well, as I said, I am not sure if we really need/want this for a handful
of 4k poisoned pages in a VM. As I suspected, doing so might primarily
be interesting for some sort of de-fragmentation (allow again a higher
order page to be placed at the affected PFNs), not because of the slight
reduction of available memory. A simple VM reboot would get the job
similarly done.

As the poisoning refcount code is already a bit shaky as I learned
recently in the context of memory offlining, I do wonder if we really
want to expose the unpoisoning code outside of debugfs (hwpoison) usage.

Interestingly, unpoison_memory() documents: "This is only done on the
software-level, so it only works for linux injected failures, not real
hardware failures" -- ehm?

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Re: [PATCH 0/3] recover hardware corrupted page by virtio balloon
  2022-06-01  7:59               ` David Hildenbrand
@ 2022-06-02  9:28                 ` zhenwei pi
  2022-06-02  9:40                   ` David Hildenbrand
  0 siblings, 1 reply; 11+ messages in thread
From: zhenwei pi @ 2022-06-02  9:28 UTC (permalink / raw)
  To: David Hildenbrand, Andrew Morton,
	HORIGUCHI NAOYA(堀口 直也)
  Cc: Peter Xu, Jue Wang, Paolo Bonzini, jasowang, LKML, Linux MM, mst,
	qemu-devel, virtualization

On 6/1/22 15:59, David Hildenbrand wrote:
> On 01.06.22 04:17, zhenwei pi wrote:
>> On 5/31/22 12:08, Jue Wang wrote:
>>> On Mon, May 30, 2022 at 8:49 AM Peter Xu <peterx@redhat.com> wrote:
>>>>
>>>> On Mon, May 30, 2022 at 07:33:35PM +0800, zhenwei pi wrote:
>>>>> A VM uses RAM of 2M huge page. Once a MCE(@HVAy in [HVAx,HVAz)) occurs, the
>>>>> 2M([HVAx,HVAz)) of hypervisor becomes unaccessible, but the guest poisons 4K
>>>>> (@GPAy in [GPAx, GPAz)) only, it may hit another 511 MCE ([GPAx, GPAz)
>>>>> except GPAy). This is the worse case, so I want to add
>>>>>    '__le32 corrupted_pages' in struct virtio_balloon_config, it is used in the
>>>>> next step: reporting 512 * 4K 'corrupted_pages' to the guest, the guest has
>>>>> a chance to isolate the other 511 pages ahead of time. And the guest
>>>>> actually loses 2M, fixing 512*4K seems to help significantly.
>>>>
>>>> It sounds hackish to teach a virtio device to assume one page will always
>>>> be poisoned in huge page granule.  That's only a limitation to host kernel
>>>> not virtio itself.
>>>>
>>>> E.g. there're upstream effort ongoing with enabling doublemap on hugetlbfs
>>>> pages so hugetlb pages can be mapped in 4k with it.  It provides potential
>>>> possibility to do page poisoning with huge pages in 4k too.  When that'll
>>>> be ready the assumption can go away, and that does sound like a better
>>>> approach towards this problem.
>>>
>>> +1.
>>>
>>> A hypervisor should always strive to minimize the guest memory loss.
>>>
>>> The HugeTLB double mapping enlightened memory poisoning behavior (only
>>> poison 4K out of a 2MB huge page and 4K in guest) is a much better
>>> solution here. To be completely transparent, it's not _strictly_
>>> required to poison the page (whatever the granularity it is) on the
>>> host side, as long as the following are true:
>>>
>>> 1. A hypervisor can emulate the _minimized_ (e.g., 4K) the poison to the guest.
>>> 2. The host page with the UC error is "isolated" (could be PG_HWPOISON
>>> or in some other way) and prevented from being reused by other
>>> processes.
>>>
>>> For #2, PG_HWPOISON and HugeTLB double mapping enlightened memory
>>> poisoning is a good solution.
>>>
>>>>
>>>>>
>>>>>>
>>>>>> I assume when talking about "the performance memory drops a lot", you
>>>>>> imply that this patch set can mitigate that performance drop?
>>>>>>
>>>>>> But why do you see a performance drop? Because we might lose some
>>>>>> possible THP candidates (in the host or the guest) and you want to plug
>>>>>> does holes? I assume you'll see a performance drop simply because
>>>>>> poisoning memory is expensive, including migrating pages around on CE.
>>>>>>
>>>>>> If you have some numbers to share, especially before/after this change,
>>>>>> that would be great.
>>>>>>
>>>>>
>>>>> The CE storm leads 2 problems I have even seen:
>>>>> 1, the memory bandwidth slows down to 10%~20%, and the cycles per
>>>>> instruction of CPU increases a lot.
>>>>> 2, the THR (/proc/interrupts) interrupts frequently, the CPU has to use a
>>>>> lot time to handle IRQ.
>>>>
>>>> Totally no good knowledge on CMCI, but if 2) is true then I'm wondering
>>>> whether it's necessary to handle the interrupts that frequently.  When I
>>>> was reading the Intel CMCI vector handler I stumbled over this comment:
>>>>
>>>> /*
>>>>    * The interrupt handler. This is called on every event.
>>>>    * Just call the poller directly to log any events.
>>>>    * This could in theory increase the threshold under high load,
>>>>    * but doesn't for now.
>>>>    */
>>>> static void intel_threshold_interrupt(void)
>>>>
>>>> I think that matches with what I was thinking..  I mean for 2) not sure
>>>> whether it can be seen as a CMCI problem and potentially can be optimized
>>>> by adjust the cmci threshold dynamically.
>>>
>>> The CE storm caused performance drop is caused by the extra cycles
>>> spent by the ECC steps in memory controller, not in CMCI handling.
>>> This is observed in the Google fleet as well. A good solution is to
>>> monitor the CE rate closely in user space via /dev/mcelog and migrate
>>> all VMs to another host once the CE rate exceeds some threshold.
>>>
>>> CMCI is a _background_ interrupt that is not handled in the process
>>> execution context and its handler is setup to switch to poll (1 / 5
>>> min) mode if there are more than ~ a dozen CEs reported via CMCI per
>>> second.
>>>>
>>>> --
>>>> Peter Xu
>>>>
>>
>> Hi, Andrew, David, Naoya
>>
>> According to the suggestions, I'd give up the improvement of memory
>> failure on huge page in this series.
>>
>> Is it worth recovering corrupted pages for the guest kernel? I'd follow
>> your decision.
> 
> Well, as I said, I am not sure if we really need/want this for a handful
> of 4k poisoned pages in a VM. As I suspected, doing so might primarily
> be interesting for some sort of de-fragmentation (allow again a higher
> order page to be placed at the affected PFNs), not because of the slight
> reduction of available memory. A simple VM reboot would get the job
> similarly done.
> 

Sure, Let's drop this idea. Thanks to all for the suggestions.

Hi, Naoya
It seems that memory failure notifier is not required currently, so I'll 
not push the next version of:
[PATCH 1/3] memory-failure: Introduce memory failure notifier
[PATCH 2/3] mm/memory-failure.c: support reset PTE during unpoison

Thanks you for review work!

> As the poisoning refcount code is already a bit shaky as I learned
> recently in the context of memory offlining, I do wonder if we really
> want to expose the unpoisoning code outside of debugfs (hwpoison) usage.
> 
> Interestingly, unpoison_memory() documents: "This is only done on the
> software-level, so it only works for linux injected failures, not real
> hardware failures" -- ehm?
> 

I guess unpoison_memory() is designed/tested by hwpoison-inject only, I 
have no idea to fix memory failure on a hardware platform. I suppose 
it's the first time that unpoison_memory() is required by hardware-level 
(balloon VQ).

-- 
zhenwei pi

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 0/3] recover hardware corrupted page by virtio balloon
  2022-06-02  9:28                 ` zhenwei pi
@ 2022-06-02  9:40                   ` David Hildenbrand
  0 siblings, 0 replies; 11+ messages in thread
From: David Hildenbrand @ 2022-06-02  9:40 UTC (permalink / raw)
  To: zhenwei pi, Andrew Morton,
	HORIGUCHI NAOYA(堀口 直也)
  Cc: Peter Xu, Jue Wang, Paolo Bonzini, jasowang, LKML, Linux MM, mst,
	qemu-devel, virtualization

On 02.06.22 11:28, zhenwei pi wrote:
> On 6/1/22 15:59, David Hildenbrand wrote:
>> On 01.06.22 04:17, zhenwei pi wrote:
>>> On 5/31/22 12:08, Jue Wang wrote:
>>>> On Mon, May 30, 2022 at 8:49 AM Peter Xu <peterx@redhat.com> wrote:
>>>>>
>>>>> On Mon, May 30, 2022 at 07:33:35PM +0800, zhenwei pi wrote:
>>>>>> A VM uses RAM of 2M huge page. Once a MCE(@HVAy in [HVAx,HVAz)) occurs, the
>>>>>> 2M([HVAx,HVAz)) of hypervisor becomes unaccessible, but the guest poisons 4K
>>>>>> (@GPAy in [GPAx, GPAz)) only, it may hit another 511 MCE ([GPAx, GPAz)
>>>>>> except GPAy). This is the worse case, so I want to add
>>>>>>    '__le32 corrupted_pages' in struct virtio_balloon_config, it is used in the
>>>>>> next step: reporting 512 * 4K 'corrupted_pages' to the guest, the guest has
>>>>>> a chance to isolate the other 511 pages ahead of time. And the guest
>>>>>> actually loses 2M, fixing 512*4K seems to help significantly.
>>>>>
>>>>> It sounds hackish to teach a virtio device to assume one page will always
>>>>> be poisoned in huge page granule.  That's only a limitation to host kernel
>>>>> not virtio itself.
>>>>>
>>>>> E.g. there're upstream effort ongoing with enabling doublemap on hugetlbfs
>>>>> pages so hugetlb pages can be mapped in 4k with it.  It provides potential
>>>>> possibility to do page poisoning with huge pages in 4k too.  When that'll
>>>>> be ready the assumption can go away, and that does sound like a better
>>>>> approach towards this problem.
>>>>
>>>> +1.
>>>>
>>>> A hypervisor should always strive to minimize the guest memory loss.
>>>>
>>>> The HugeTLB double mapping enlightened memory poisoning behavior (only
>>>> poison 4K out of a 2MB huge page and 4K in guest) is a much better
>>>> solution here. To be completely transparent, it's not _strictly_
>>>> required to poison the page (whatever the granularity it is) on the
>>>> host side, as long as the following are true:
>>>>
>>>> 1. A hypervisor can emulate the _minimized_ (e.g., 4K) the poison to the guest.
>>>> 2. The host page with the UC error is "isolated" (could be PG_HWPOISON
>>>> or in some other way) and prevented from being reused by other
>>>> processes.
>>>>
>>>> For #2, PG_HWPOISON and HugeTLB double mapping enlightened memory
>>>> poisoning is a good solution.
>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>> I assume when talking about "the performance memory drops a lot", you
>>>>>>> imply that this patch set can mitigate that performance drop?
>>>>>>>
>>>>>>> But why do you see a performance drop? Because we might lose some
>>>>>>> possible THP candidates (in the host or the guest) and you want to plug
>>>>>>> does holes? I assume you'll see a performance drop simply because
>>>>>>> poisoning memory is expensive, including migrating pages around on CE.
>>>>>>>
>>>>>>> If you have some numbers to share, especially before/after this change,
>>>>>>> that would be great.
>>>>>>>
>>>>>>
>>>>>> The CE storm leads 2 problems I have even seen:
>>>>>> 1, the memory bandwidth slows down to 10%~20%, and the cycles per
>>>>>> instruction of CPU increases a lot.
>>>>>> 2, the THR (/proc/interrupts) interrupts frequently, the CPU has to use a
>>>>>> lot time to handle IRQ.
>>>>>
>>>>> Totally no good knowledge on CMCI, but if 2) is true then I'm wondering
>>>>> whether it's necessary to handle the interrupts that frequently.  When I
>>>>> was reading the Intel CMCI vector handler I stumbled over this comment:
>>>>>
>>>>> /*
>>>>>    * The interrupt handler. This is called on every event.
>>>>>    * Just call the poller directly to log any events.
>>>>>    * This could in theory increase the threshold under high load,
>>>>>    * but doesn't for now.
>>>>>    */
>>>>> static void intel_threshold_interrupt(void)
>>>>>
>>>>> I think that matches with what I was thinking..  I mean for 2) not sure
>>>>> whether it can be seen as a CMCI problem and potentially can be optimized
>>>>> by adjust the cmci threshold dynamically.
>>>>
>>>> The CE storm caused performance drop is caused by the extra cycles
>>>> spent by the ECC steps in memory controller, not in CMCI handling.
>>>> This is observed in the Google fleet as well. A good solution is to
>>>> monitor the CE rate closely in user space via /dev/mcelog and migrate
>>>> all VMs to another host once the CE rate exceeds some threshold.
>>>>
>>>> CMCI is a _background_ interrupt that is not handled in the process
>>>> execution context and its handler is setup to switch to poll (1 / 5
>>>> min) mode if there are more than ~ a dozen CEs reported via CMCI per
>>>> second.
>>>>>
>>>>> --
>>>>> Peter Xu
>>>>>
>>>
>>> Hi, Andrew, David, Naoya
>>>
>>> According to the suggestions, I'd give up the improvement of memory
>>> failure on huge page in this series.
>>>
>>> Is it worth recovering corrupted pages for the guest kernel? I'd follow
>>> your decision.
>>
>> Well, as I said, I am not sure if we really need/want this for a handful
>> of 4k poisoned pages in a VM. As I suspected, doing so might primarily
>> be interesting for some sort of de-fragmentation (allow again a higher
>> order page to be placed at the affected PFNs), not because of the slight
>> reduction of available memory. A simple VM reboot would get the job
>> similarly done.
>>
> 
> Sure, Let's drop this idea. Thanks to all for the suggestions.

Thanks for the interesting idea + discussions.

Just a note that if you believe that we want/need something like that,
and there is a reasonable use case, please tell us we're wrong and push
back :)

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2022-06-02  9:40 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-25 20:16 [PATCH 0/3] recover hardware corrupted page by virtio balloon Jue Wang
2022-05-26 18:37 ` Peter Xu
2022-05-27  6:32   ` zhenwei pi
2022-05-30  7:41     ` David Hildenbrand
2022-05-30 11:33       ` zhenwei pi
2022-05-30 15:49         ` Peter Xu
2022-05-31  4:08           ` Jue Wang
2022-06-01  2:17             ` zhenwei pi
2022-06-01  7:59               ` David Hildenbrand
2022-06-02  9:28                 ` zhenwei pi
2022-06-02  9:40                   ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).