kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH v2] mm: hwpoison: disable memory error handling on 1GB hugepage
       [not found]             ` <e673f38a-9e5f-21f6-421b-b3cb4ff02e91@oracle.com>
@ 2019-05-28  9:49               ` Wanpeng Li
  2019-05-29 23:31                 ` Mike Kravetz
  0 siblings, 1 reply; 7+ messages in thread
From: Wanpeng Li @ 2019-05-28  9:49 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Michael Ellerman, Andrew Morton, Punit Agrawal, Naoya Horiguchi,
	linux-mm, Michal Hocko, Aneesh Kumar K.V, Anshuman Khandual,
	linux-kernel, Benjamin Herrenschmidt, linuxppc-dev, kvm,
	Paolo Bonzini, Xiao Guangrong, lidongchen, yongkaiwu

Cc Paolo,
Hi all,
On Wed, 14 Feb 2018 at 06:34, Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> On 02/12/2018 06:48 PM, Michael Ellerman wrote:
> > Andrew Morton <akpm@linux-foundation.org> writes:
> >
> >> On Thu, 08 Feb 2018 12:30:45 +0000 Punit Agrawal <punit.agrawal@arm.com> wrote:
> >>
> >>>>
> >>>> So I don't think that the above test result means that errors are properly
> >>>> handled, and the proposed patch should help for arm64.
> >>>
> >>> Although, the deviation of pud_huge() avoids a kernel crash the code
> >>> would be easier to maintain and reason about if arm64 helpers are
> >>> consistent with expectations by core code.
> >>>
> >>> I'll look to update the arm64 helpers once this patch gets merged. But
> >>> it would be helpful if there was a clear expression of semantics for
> >>> pud_huge() for various cases. Is there any version that can be used as
> >>> reference?
> >>
> >> Is that an ack or tested-by?
> >>
> >> Mike keeps plaintively asking the powerpc developers to take a look,
> >> but they remain steadfastly in hiding.
> >
> > Cc'ing linuxppc-dev is always a good idea :)
> >
>
> Thanks Michael,
>
> I was mostly concerned about use cases for soft/hard offline of huge pages
> larger than PMD_SIZE on powerpc.  I know that powerpc supports PGD_SIZE
> huge pages, and soft/hard offline support was specifically added for this.
> See, 94310cbcaa3c "mm/madvise: enable (soft|hard) offline of HugeTLB pages
> at PGD level"
>
> This patch will disable that functionality.  So, at a minimum this is a
> 'heads up'.  If there are actual use cases that depend on this, then more
> work/discussions will need to happen.  From the e-mail thread on PGD_SIZE
> support, I can not tell if there is a real use case or this is just a
> 'nice to have'.

1GB hugetlbfs pages are used by DPDK and VMs in cloud deployment, we
encounter gup_pud_range() panic several times in product environment.
Is there any plan to reenable and fix arch codes?

In addition, https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kvm/mmu.c#n3213
The memory in guest can be 1GB/2MB/4K, though the host-backed memory
are 1GB hugetlbfs pages, after above PUD panic is fixed,
try_to_unmap() which is called in MCA recovery path will mark the PUD
hwpoison entry. The guest will vmexit and retry endlessly when
accessing any memory in the guest which is backed by this 1GB poisoned
hugetlbfs page. We have a plan to split this 1GB hugetblfs page by 2MB
hugetlbfs pages/4KB pages, maybe file remap to a virtual address range
which is 2MB/4KB page granularity, also split the KVM MMU 1GB SPTE
into 2MB/4KB and mark the offensive SPTE w/ a hwpoison flag, a sigbus
will be delivered to VM at page fault next time for the offensive
SPTE. Is this proposal acceptable?

Regards,
Wanpeng Li

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2] mm: hwpoison: disable memory error handling on 1GB hugepage
  2019-05-28  9:49               ` [PATCH v2] mm: hwpoison: disable memory error handling on 1GB hugepage Wanpeng Li
@ 2019-05-29 23:31                 ` Mike Kravetz
  2019-06-10 23:50                   ` Naoya Horiguchi
  0 siblings, 1 reply; 7+ messages in thread
From: Mike Kravetz @ 2019-05-29 23:31 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: Michael Ellerman, Andrew Morton, Punit Agrawal, Naoya Horiguchi,
	linux-mm, Michal Hocko, Aneesh Kumar K.V, Anshuman Khandual,
	linux-kernel, Benjamin Herrenschmidt, linuxppc-dev, kvm,
	Paolo Bonzini, Xiao Guangrong, lidongchen, yongkaiwu

On 5/28/19 2:49 AM, Wanpeng Li wrote:
> Cc Paolo,
> Hi all,
> On Wed, 14 Feb 2018 at 06:34, Mike Kravetz <mike.kravetz@oracle.com> wrote:
>>
>> On 02/12/2018 06:48 PM, Michael Ellerman wrote:
>>> Andrew Morton <akpm@linux-foundation.org> writes:
>>>
>>>> On Thu, 08 Feb 2018 12:30:45 +0000 Punit Agrawal <punit.agrawal@arm.com> wrote:
>>>>
>>>>>>
>>>>>> So I don't think that the above test result means that errors are properly
>>>>>> handled, and the proposed patch should help for arm64.
>>>>>
>>>>> Although, the deviation of pud_huge() avoids a kernel crash the code
>>>>> would be easier to maintain and reason about if arm64 helpers are
>>>>> consistent with expectations by core code.
>>>>>
>>>>> I'll look to update the arm64 helpers once this patch gets merged. But
>>>>> it would be helpful if there was a clear expression of semantics for
>>>>> pud_huge() for various cases. Is there any version that can be used as
>>>>> reference?
>>>>
>>>> Is that an ack or tested-by?
>>>>
>>>> Mike keeps plaintively asking the powerpc developers to take a look,
>>>> but they remain steadfastly in hiding.
>>>
>>> Cc'ing linuxppc-dev is always a good idea :)
>>>
>>
>> Thanks Michael,
>>
>> I was mostly concerned about use cases for soft/hard offline of huge pages
>> larger than PMD_SIZE on powerpc.  I know that powerpc supports PGD_SIZE
>> huge pages, and soft/hard offline support was specifically added for this.
>> See, 94310cbcaa3c "mm/madvise: enable (soft|hard) offline of HugeTLB pages
>> at PGD level"
>>
>> This patch will disable that functionality.  So, at a minimum this is a
>> 'heads up'.  If there are actual use cases that depend on this, then more
>> work/discussions will need to happen.  From the e-mail thread on PGD_SIZE
>> support, I can not tell if there is a real use case or this is just a
>> 'nice to have'.
> 
> 1GB hugetlbfs pages are used by DPDK and VMs in cloud deployment, we
> encounter gup_pud_range() panic several times in product environment.
> Is there any plan to reenable and fix arch codes?

I too am aware of slightly more interest in 1G huge pages.  Suspect that as
Intel MMU capacity increases to handle more TLB entries there will be more
and more interest.

Personally, I am not looking at this issue.  Perhaps Naoya will comment as
he know most about this code.

> In addition, https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kvm/mmu.c#n3213
> The memory in guest can be 1GB/2MB/4K, though the host-backed memory
> are 1GB hugetlbfs pages, after above PUD panic is fixed,
> try_to_unmap() which is called in MCA recovery path will mark the PUD
> hwpoison entry. The guest will vmexit and retry endlessly when
> accessing any memory in the guest which is backed by this 1GB poisoned
> hugetlbfs page. We have a plan to split this 1GB hugetblfs page by 2MB
> hugetlbfs pages/4KB pages, maybe file remap to a virtual address range
> which is 2MB/4KB page granularity, also split the KVM MMU 1GB SPTE
> into 2MB/4KB and mark the offensive SPTE w/ a hwpoison flag, a sigbus
> will be delivered to VM at page fault next time for the offensive
> SPTE. Is this proposal acceptable?

I am not sure of the error handling design, but this does sound reasonable.
That block of code which potentially dissolves a huge page on memory error
is hard to understand and I'm not sure if that is even the 'normal'
functionality.  Certainly, we would hate to waste/poison an entire 1G page
for an error on a small subsection.

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2] mm: hwpoison: disable memory error handling on 1GB hugepage
  2019-05-29 23:31                 ` Mike Kravetz
@ 2019-06-10 23:50                   ` Naoya Horiguchi
  2019-06-11  8:42                     ` Wanpeng Li
  2019-08-20  7:03                     ` Wanpeng Li
  0 siblings, 2 replies; 7+ messages in thread
From: Naoya Horiguchi @ 2019-06-10 23:50 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Wanpeng Li, Michael Ellerman, Andrew Morton, Punit Agrawal,
	linux-mm, Michal Hocko, Aneesh Kumar K.V, Anshuman Khandual,
	linux-kernel, Benjamin Herrenschmidt, linuxppc-dev, kvm,
	Paolo Bonzini, Xiao Guangrong, lidongchen, yongkaiwu

On Wed, May 29, 2019 at 04:31:01PM -0700, Mike Kravetz wrote:
> On 5/28/19 2:49 AM, Wanpeng Li wrote:
> > Cc Paolo,
> > Hi all,
> > On Wed, 14 Feb 2018 at 06:34, Mike Kravetz <mike.kravetz@oracle.com> wrote:
> >>
> >> On 02/12/2018 06:48 PM, Michael Ellerman wrote:
> >>> Andrew Morton <akpm@linux-foundation.org> writes:
> >>>
> >>>> On Thu, 08 Feb 2018 12:30:45 +0000 Punit Agrawal <punit.agrawal@arm.com> wrote:
> >>>>
> >>>>>>
> >>>>>> So I don't think that the above test result means that errors are properly
> >>>>>> handled, and the proposed patch should help for arm64.
> >>>>>
> >>>>> Although, the deviation of pud_huge() avoids a kernel crash the code
> >>>>> would be easier to maintain and reason about if arm64 helpers are
> >>>>> consistent with expectations by core code.
> >>>>>
> >>>>> I'll look to update the arm64 helpers once this patch gets merged. But
> >>>>> it would be helpful if there was a clear expression of semantics for
> >>>>> pud_huge() for various cases. Is there any version that can be used as
> >>>>> reference?
> >>>>
> >>>> Is that an ack or tested-by?
> >>>>
> >>>> Mike keeps plaintively asking the powerpc developers to take a look,
> >>>> but they remain steadfastly in hiding.
> >>>
> >>> Cc'ing linuxppc-dev is always a good idea :)
> >>>
> >>
> >> Thanks Michael,
> >>
> >> I was mostly concerned about use cases for soft/hard offline of huge pages
> >> larger than PMD_SIZE on powerpc.  I know that powerpc supports PGD_SIZE
> >> huge pages, and soft/hard offline support was specifically added for this.
> >> See, 94310cbcaa3c "mm/madvise: enable (soft|hard) offline of HugeTLB pages
> >> at PGD level"
> >>
> >> This patch will disable that functionality.  So, at a minimum this is a
> >> 'heads up'.  If there are actual use cases that depend on this, then more
> >> work/discussions will need to happen.  From the e-mail thread on PGD_SIZE
> >> support, I can not tell if there is a real use case or this is just a
> >> 'nice to have'.
> > 
> > 1GB hugetlbfs pages are used by DPDK and VMs in cloud deployment, we
> > encounter gup_pud_range() panic several times in product environment.
> > Is there any plan to reenable and fix arch codes?
> 
> I too am aware of slightly more interest in 1G huge pages.  Suspect that as
> Intel MMU capacity increases to handle more TLB entries there will be more
> and more interest.
> 
> Personally, I am not looking at this issue.  Perhaps Naoya will comment as
> he know most about this code.

Thanks for forwarding this to me, I'm feeling that memory error handling
on 1GB hugepage is demanded as real use case.

> 
> > In addition, https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kvm/mmu.c#n3213
> > The memory in guest can be 1GB/2MB/4K, though the host-backed memory
> > are 1GB hugetlbfs pages, after above PUD panic is fixed,
> > try_to_unmap() which is called in MCA recovery path will mark the PUD
> > hwpoison entry. The guest will vmexit and retry endlessly when
> > accessing any memory in the guest which is backed by this 1GB poisoned
> > hugetlbfs page. We have a plan to split this 1GB hugetblfs page by 2MB
> > hugetlbfs pages/4KB pages, maybe file remap to a virtual address range
> > which is 2MB/4KB page granularity, also split the KVM MMU 1GB SPTE
> > into 2MB/4KB and mark the offensive SPTE w/ a hwpoison flag, a sigbus
> > will be delivered to VM at page fault next time for the offensive
> > SPTE. Is this proposal acceptable?
> 
> I am not sure of the error handling design, but this does sound reasonable.

I agree that that's better.

> That block of code which potentially dissolves a huge page on memory error
> is hard to understand and I'm not sure if that is even the 'normal'
> functionality.  Certainly, we would hate to waste/poison an entire 1G page
> for an error on a small subsection.

Yes, that's not practical, so we need at first establish the code base for
2GB hugetlb splitting and then extending it to 1GB next.

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2] mm: hwpoison: disable memory error handling on 1GB hugepage
  2019-06-10 23:50                   ` Naoya Horiguchi
@ 2019-06-11  8:42                     ` Wanpeng Li
  2019-08-20  7:03                     ` Wanpeng Li
  1 sibling, 0 replies; 7+ messages in thread
From: Wanpeng Li @ 2019-06-11  8:42 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Mike Kravetz, Michael Ellerman, Andrew Morton, Punit Agrawal,
	linux-mm, Michal Hocko, Aneesh Kumar K.V, Anshuman Khandual,
	linux-kernel, Benjamin Herrenschmidt, linuxppc-dev, kvm,
	Paolo Bonzini, Xiao Guangrong, lidongchen, yongkaiwu

On Tue, 11 Jun 2019 at 07:51, Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote:
>
> On Wed, May 29, 2019 at 04:31:01PM -0700, Mike Kravetz wrote:
> > On 5/28/19 2:49 AM, Wanpeng Li wrote:
> > > Cc Paolo,
> > > Hi all,
> > > On Wed, 14 Feb 2018 at 06:34, Mike Kravetz <mike.kravetz@oracle.com> wrote:
> > >>
> > >> On 02/12/2018 06:48 PM, Michael Ellerman wrote:
> > >>> Andrew Morton <akpm@linux-foundation.org> writes:
> > >>>
> > >>>> On Thu, 08 Feb 2018 12:30:45 +0000 Punit Agrawal <punit.agrawal@arm.com> wrote:
> > >>>>
> > >>>>>>
> > >>>>>> So I don't think that the above test result means that errors are properly
> > >>>>>> handled, and the proposed patch should help for arm64.
> > >>>>>
> > >>>>> Although, the deviation of pud_huge() avoids a kernel crash the code
> > >>>>> would be easier to maintain and reason about if arm64 helpers are
> > >>>>> consistent with expectations by core code.
> > >>>>>
> > >>>>> I'll look to update the arm64 helpers once this patch gets merged. But
> > >>>>> it would be helpful if there was a clear expression of semantics for
> > >>>>> pud_huge() for various cases. Is there any version that can be used as
> > >>>>> reference?
> > >>>>
> > >>>> Is that an ack or tested-by?
> > >>>>
> > >>>> Mike keeps plaintively asking the powerpc developers to take a look,
> > >>>> but they remain steadfastly in hiding.
> > >>>
> > >>> Cc'ing linuxppc-dev is always a good idea :)
> > >>>
> > >>
> > >> Thanks Michael,
> > >>
> > >> I was mostly concerned about use cases for soft/hard offline of huge pages
> > >> larger than PMD_SIZE on powerpc.  I know that powerpc supports PGD_SIZE
> > >> huge pages, and soft/hard offline support was specifically added for this.
> > >> See, 94310cbcaa3c "mm/madvise: enable (soft|hard) offline of HugeTLB pages
> > >> at PGD level"
> > >>
> > >> This patch will disable that functionality.  So, at a minimum this is a
> > >> 'heads up'.  If there are actual use cases that depend on this, then more
> > >> work/discussions will need to happen.  From the e-mail thread on PGD_SIZE
> > >> support, I can not tell if there is a real use case or this is just a
> > >> 'nice to have'.
> > >
> > > 1GB hugetlbfs pages are used by DPDK and VMs in cloud deployment, we
> > > encounter gup_pud_range() panic several times in product environment.
> > > Is there any plan to reenable and fix arch codes?
> >
> > I too am aware of slightly more interest in 1G huge pages.  Suspect that as
> > Intel MMU capacity increases to handle more TLB entries there will be more
> > and more interest.
> >
> > Personally, I am not looking at this issue.  Perhaps Naoya will comment as
> > he know most about this code.
>
> Thanks for forwarding this to me, I'm feeling that memory error handling
> on 1GB hugepage is demanded as real use case.
>
> >
> > > In addition, https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kvm/mmu.c#n3213
> > > The memory in guest can be 1GB/2MB/4K, though the host-backed memory
> > > are 1GB hugetlbfs pages, after above PUD panic is fixed,
> > > try_to_unmap() which is called in MCA recovery path will mark the PUD
> > > hwpoison entry. The guest will vmexit and retry endlessly when
> > > accessing any memory in the guest which is backed by this 1GB poisoned
> > > hugetlbfs page. We have a plan to split this 1GB hugetblfs page by 2MB
> > > hugetlbfs pages/4KB pages, maybe file remap to a virtual address range
> > > which is 2MB/4KB page granularity, also split the KVM MMU 1GB SPTE
> > > into 2MB/4KB and mark the offensive SPTE w/ a hwpoison flag, a sigbus
> > > will be delivered to VM at page fault next time for the offensive
> > > SPTE. Is this proposal acceptable?
> >
> > I am not sure of the error handling design, but this does sound reasonable.
>
> I agree that that's better.
>
> > That block of code which potentially dissolves a huge page on memory error
> > is hard to understand and I'm not sure if that is even the 'normal'
> > functionality.  Certainly, we would hate to waste/poison an entire 1G page
> > for an error on a small subsection.
>
> Yes, that's not practical, so we need at first establish the code base for
> 2GB hugetlb splitting and then extending it to 1GB next.

I'm working on this, thanks for the inputs.

Regards,
Wanpeng Li

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2] mm: hwpoison: disable memory error handling on 1GB hugepage
  2019-06-10 23:50                   ` Naoya Horiguchi
  2019-06-11  8:42                     ` Wanpeng Li
@ 2019-08-20  7:03                     ` Wanpeng Li
  2019-08-21  5:39                       ` ##freemail## " Naoya Horiguchi
  1 sibling, 1 reply; 7+ messages in thread
From: Wanpeng Li @ 2019-08-20  7:03 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Mike Kravetz, Michael Ellerman, Andrew Morton, Punit Agrawal,
	linux-mm, Michal Hocko, Aneesh Kumar K.V, Anshuman Khandual,
	linux-kernel, Benjamin Herrenschmidt, linuxppc-dev, kvm,
	Paolo Bonzini, Xiao Guangrong, lidongchen, yongkaiwu, Mel Gorman,
	Kirill A. Shutemov, Hansen, Dave, Hugh Dickins

Cc Mel Gorman, Kirill, Dave Hansen,
On Tue, 11 Jun 2019 at 07:51, Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote:
>
> On Wed, May 29, 2019 at 04:31:01PM -0700, Mike Kravetz wrote:
> > On 5/28/19 2:49 AM, Wanpeng Li wrote:
> > > Cc Paolo,
> > > Hi all,
> > > On Wed, 14 Feb 2018 at 06:34, Mike Kravetz <mike.kravetz@oracle.com> wrote:
> > >>
> > >> On 02/12/2018 06:48 PM, Michael Ellerman wrote:
> > >>> Andrew Morton <akpm@linux-foundation.org> writes:
> > >>>
> > >>>> On Thu, 08 Feb 2018 12:30:45 +0000 Punit Agrawal <punit.agrawal@arm.com> wrote:
> > >>>>
> > >>>>>>
> > >>>>>> So I don't think that the above test result means that errors are properly
> > >>>>>> handled, and the proposed patch should help for arm64.
> > >>>>>
> > >>>>> Although, the deviation of pud_huge() avoids a kernel crash the code
> > >>>>> would be easier to maintain and reason about if arm64 helpers are
> > >>>>> consistent with expectations by core code.
> > >>>>>
> > >>>>> I'll look to update the arm64 helpers once this patch gets merged. But
> > >>>>> it would be helpful if there was a clear expression of semantics for
> > >>>>> pud_huge() for various cases. Is there any version that can be used as
> > >>>>> reference?
> > >>>>
> > >>>> Is that an ack or tested-by?
> > >>>>
> > >>>> Mike keeps plaintively asking the powerpc developers to take a look,
> > >>>> but they remain steadfastly in hiding.
> > >>>
> > >>> Cc'ing linuxppc-dev is always a good idea :)
> > >>>
> > >>
> > >> Thanks Michael,
> > >>
> > >> I was mostly concerned about use cases for soft/hard offline of huge pages
> > >> larger than PMD_SIZE on powerpc.  I know that powerpc supports PGD_SIZE
> > >> huge pages, and soft/hard offline support was specifically added for this.
> > >> See, 94310cbcaa3c "mm/madvise: enable (soft|hard) offline of HugeTLB pages
> > >> at PGD level"
> > >>
> > >> This patch will disable that functionality.  So, at a minimum this is a
> > >> 'heads up'.  If there are actual use cases that depend on this, then more
> > >> work/discussions will need to happen.  From the e-mail thread on PGD_SIZE
> > >> support, I can not tell if there is a real use case or this is just a
> > >> 'nice to have'.
> > >
> > > 1GB hugetlbfs pages are used by DPDK and VMs in cloud deployment, we
> > > encounter gup_pud_range() panic several times in product environment.
> > > Is there any plan to reenable and fix arch codes?
> >
> > I too am aware of slightly more interest in 1G huge pages.  Suspect that as
> > Intel MMU capacity increases to handle more TLB entries there will be more
> > and more interest.
> >
> > Personally, I am not looking at this issue.  Perhaps Naoya will comment as
> > he know most about this code.
>
> Thanks for forwarding this to me, I'm feeling that memory error handling
> on 1GB hugepage is demanded as real use case.
>
> >
> > > In addition, https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kvm/mmu.c#n3213
> > > The memory in guest can be 1GB/2MB/4K, though the host-backed memory
> > > are 1GB hugetlbfs pages, after above PUD panic is fixed,
> > > try_to_unmap() which is called in MCA recovery path will mark the PUD
> > > hwpoison entry. The guest will vmexit and retry endlessly when
> > > accessing any memory in the guest which is backed by this 1GB poisoned
> > > hugetlbfs page. We have a plan to split this 1GB hugetblfs page by 2MB
> > > hugetlbfs pages/4KB pages, maybe file remap to a virtual address range
> > > which is 2MB/4KB page granularity, also split the KVM MMU 1GB SPTE
> > > into 2MB/4KB and mark the offensive SPTE w/ a hwpoison flag, a sigbus
> > > will be delivered to VM at page fault next time for the offensive
> > > SPTE. Is this proposal acceptable?
> >
> > I am not sure of the error handling design, but this does sound reasonable.
>
> I agree that that's better.
>
> > That block of code which potentially dissolves a huge page on memory error
> > is hard to understand and I'm not sure if that is even the 'normal'
> > functionality.  Certainly, we would hate to waste/poison an entire 1G page
> > for an error on a small subsection.
>
> Yes, that's not practical, so we need at first establish the code base for
> 2GB hugetlb splitting and then extending it to 1GB next.

I found it is not easy to split. There is a unique hugetlb page size
that is associated with a mounted hugetlbfs filesystem, file remap to
2MB/4KB will break this. How about hard offline 1GB hugetlb page as
what has already done in soft offline, replace the corrupted 1GB page
by new 1GB page through page migration, the offending/corrupted area
in the original 1GB page doesn't need to be copied into the new page,
the offending/corrupted area in new page can keep full zero just as it
is clear during hugetlb page fault, other sub-pages of the original
1GB page can be freed to buddy system. The sigbus signal is sent to
userspace w/ offending/corrupted virtual address, and signal code,
userspace should take care this.

Regards,
Wanpeng Li

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: ##freemail## Re: [PATCH v2] mm: hwpoison: disable memory error handling on 1GB hugepage
  2019-08-20  7:03                     ` Wanpeng Li
@ 2019-08-21  5:39                       ` Naoya Horiguchi
  2019-08-21  7:15                         ` Wanpeng Li
  0 siblings, 1 reply; 7+ messages in thread
From: Naoya Horiguchi @ 2019-08-21  5:39 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: Mike Kravetz, Michael Ellerman, Andrew Morton, Punit Agrawal,
	linux-mm, Michal Hocko, Aneesh Kumar K.V, Anshuman Khandual,
	linux-kernel, Benjamin Herrenschmidt, linuxppc-dev, kvm,
	Paolo Bonzini, Xiao Guangrong, lidongchen, yongkaiwu, Mel Gorman,
	Kirill A. Shutemov, Hansen, Dave, Hugh Dickins

On Tue, Aug 20, 2019 at 03:03:55PM +0800, Wanpeng Li wrote:
> Cc Mel Gorman, Kirill, Dave Hansen,
> On Tue, 11 Jun 2019 at 07:51, Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote:
> >
> > On Wed, May 29, 2019 at 04:31:01PM -0700, Mike Kravetz wrote:
> > > On 5/28/19 2:49 AM, Wanpeng Li wrote:
> > > > Cc Paolo,
> > > > Hi all,
> > > > On Wed, 14 Feb 2018 at 06:34, Mike Kravetz <mike.kravetz@oracle.com> wrote:
> > > >>
> > > >> On 02/12/2018 06:48 PM, Michael Ellerman wrote:
> > > >>> Andrew Morton <akpm@linux-foundation.org> writes:
> > > >>>
> > > >>>> On Thu, 08 Feb 2018 12:30:45 +0000 Punit Agrawal <punit.agrawal@arm.com> wrote:
> > > >>>>
> > > >>>>>>
> > > >>>>>> So I don't think that the above test result means that errors are properly
> > > >>>>>> handled, and the proposed patch should help for arm64.
> > > >>>>>
> > > >>>>> Although, the deviation of pud_huge() avoids a kernel crash the code
> > > >>>>> would be easier to maintain and reason about if arm64 helpers are
> > > >>>>> consistent with expectations by core code.
> > > >>>>>
> > > >>>>> I'll look to update the arm64 helpers once this patch gets merged. But
> > > >>>>> it would be helpful if there was a clear expression of semantics for
> > > >>>>> pud_huge() for various cases. Is there any version that can be used as
> > > >>>>> reference?
> > > >>>>
> > > >>>> Is that an ack or tested-by?
> > > >>>>
> > > >>>> Mike keeps plaintively asking the powerpc developers to take a look,
> > > >>>> but they remain steadfastly in hiding.
> > > >>>
> > > >>> Cc'ing linuxppc-dev is always a good idea :)
> > > >>>
> > > >>
> > > >> Thanks Michael,
> > > >>
> > > >> I was mostly concerned about use cases for soft/hard offline of huge pages
> > > >> larger than PMD_SIZE on powerpc.  I know that powerpc supports PGD_SIZE
> > > >> huge pages, and soft/hard offline support was specifically added for this.
> > > >> See, 94310cbcaa3c "mm/madvise: enable (soft|hard) offline of HugeTLB pages
> > > >> at PGD level"
> > > >>
> > > >> This patch will disable that functionality.  So, at a minimum this is a
> > > >> 'heads up'.  If there are actual use cases that depend on this, then more
> > > >> work/discussions will need to happen.  From the e-mail thread on PGD_SIZE
> > > >> support, I can not tell if there is a real use case or this is just a
> > > >> 'nice to have'.
> > > >
> > > > 1GB hugetlbfs pages are used by DPDK and VMs in cloud deployment, we
> > > > encounter gup_pud_range() panic several times in product environment.
> > > > Is there any plan to reenable and fix arch codes?
> > >
> > > I too am aware of slightly more interest in 1G huge pages.  Suspect that as
> > > Intel MMU capacity increases to handle more TLB entries there will be more
> > > and more interest.
> > >
> > > Personally, I am not looking at this issue.  Perhaps Naoya will comment as
> > > he know most about this code.
> >
> > Thanks for forwarding this to me, I'm feeling that memory error handling
> > on 1GB hugepage is demanded as real use case.
> >
> > >
> > > > In addition, https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kvm/mmu.c#n3213
> > > > The memory in guest can be 1GB/2MB/4K, though the host-backed memory
> > > > are 1GB hugetlbfs pages, after above PUD panic is fixed,
> > > > try_to_unmap() which is called in MCA recovery path will mark the PUD
> > > > hwpoison entry. The guest will vmexit and retry endlessly when
> > > > accessing any memory in the guest which is backed by this 1GB poisoned
> > > > hugetlbfs page. We have a plan to split this 1GB hugetblfs page by 2MB
> > > > hugetlbfs pages/4KB pages, maybe file remap to a virtual address range
> > > > which is 2MB/4KB page granularity, also split the KVM MMU 1GB SPTE
> > > > into 2MB/4KB and mark the offensive SPTE w/ a hwpoison flag, a sigbus
> > > > will be delivered to VM at page fault next time for the offensive
> > > > SPTE. Is this proposal acceptable?
> > >
> > > I am not sure of the error handling design, but this does sound reasonable.
> >
> > I agree that that's better.
> >
> > > That block of code which potentially dissolves a huge page on memory error
> > > is hard to understand and I'm not sure if that is even the 'normal'
> > > functionality.  Certainly, we would hate to waste/poison an entire 1G page
> > > for an error on a small subsection.
> >
> > Yes, that's not practical, so we need at first establish the code base for
> > 2GB hugetlb splitting and then extending it to 1GB next.
> 
> I found it is not easy to split. There is a unique hugetlb page size
> that is associated with a mounted hugetlbfs filesystem, file remap to
> 2MB/4KB will break this. How about hard offline 1GB hugetlb page as
> what has already done in soft offline, replace the corrupted 1GB page
> by new 1GB page through page migration, the offending/corrupted area
> in the original 1GB page doesn't need to be copied into the new page,
> the offending/corrupted area in new page can keep full zero just as it
> is clear during hugetlb page fault, other sub-pages of the original
> 1GB page can be freed to buddy system. The sigbus signal is sent to
> userspace w/ offending/corrupted virtual address, and signal code,
> userspace should take care this.

Splitting hugetlb is simply hard, IMHO. THP splitting is done by years
of effort by many great kernel develpers, and I don't think doing similar
development on hugetlb is a good idea.  I thought of converting hugetlb
into thp, but maybe it's not an easy task either.
"Hard offlining via soft offlining" approach sounds new and promising to me.
I guess we don't need a large patchset to do this. So, thanks for the idea!

- Naoya

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: ##freemail## Re: [PATCH v2] mm: hwpoison: disable memory error handling on 1GB hugepage
  2019-08-21  5:39                       ` ##freemail## " Naoya Horiguchi
@ 2019-08-21  7:15                         ` Wanpeng Li
  0 siblings, 0 replies; 7+ messages in thread
From: Wanpeng Li @ 2019-08-21  7:15 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Mike Kravetz, Michael Ellerman, Andrew Morton, Punit Agrawal,
	linux-mm, Michal Hocko, Aneesh Kumar K.V, Anshuman Khandual,
	linux-kernel, Benjamin Herrenschmidt, linuxppc-dev, kvm,
	Paolo Bonzini, Xiao Guangrong, lidongchen, yongkaiwu, Mel Gorman,
	Kirill A. Shutemov, Hansen, Dave, Hugh Dickins

On Wed, 21 Aug 2019 at 13:41, Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote:
>
> On Tue, Aug 20, 2019 at 03:03:55PM +0800, Wanpeng Li wrote:
> > Cc Mel Gorman, Kirill, Dave Hansen,
> > On Tue, 11 Jun 2019 at 07:51, Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote:
> > >
> > > On Wed, May 29, 2019 at 04:31:01PM -0700, Mike Kravetz wrote:
> > > > On 5/28/19 2:49 AM, Wanpeng Li wrote:
> > > > > Cc Paolo,
> > > > > Hi all,
> > > > > On Wed, 14 Feb 2018 at 06:34, Mike Kravetz <mike.kravetz@oracle.com> wrote:
> > > > >>
> > > > >> On 02/12/2018 06:48 PM, Michael Ellerman wrote:
> > > > >>> Andrew Morton <akpm@linux-foundation.org> writes:
> > > > >>>
> > > > >>>> On Thu, 08 Feb 2018 12:30:45 +0000 Punit Agrawal <punit.agrawal@arm.com> wrote:
> > > > >>>>
> > > > >>>>>>
> > > > >>>>>> So I don't think that the above test result means that errors are properly
> > > > >>>>>> handled, and the proposed patch should help for arm64.
> > > > >>>>>
> > > > >>>>> Although, the deviation of pud_huge() avoids a kernel crash the code
> > > > >>>>> would be easier to maintain and reason about if arm64 helpers are
> > > > >>>>> consistent with expectations by core code.
> > > > >>>>>
> > > > >>>>> I'll look to update the arm64 helpers once this patch gets merged. But
> > > > >>>>> it would be helpful if there was a clear expression of semantics for
> > > > >>>>> pud_huge() for various cases. Is there any version that can be used as
> > > > >>>>> reference?
> > > > >>>>
> > > > >>>> Is that an ack or tested-by?
> > > > >>>>
> > > > >>>> Mike keeps plaintively asking the powerpc developers to take a look,
> > > > >>>> but they remain steadfastly in hiding.
> > > > >>>
> > > > >>> Cc'ing linuxppc-dev is always a good idea :)
> > > > >>>
> > > > >>
> > > > >> Thanks Michael,
> > > > >>
> > > > >> I was mostly concerned about use cases for soft/hard offline of huge pages
> > > > >> larger than PMD_SIZE on powerpc.  I know that powerpc supports PGD_SIZE
> > > > >> huge pages, and soft/hard offline support was specifically added for this.
> > > > >> See, 94310cbcaa3c "mm/madvise: enable (soft|hard) offline of HugeTLB pages
> > > > >> at PGD level"
> > > > >>
> > > > >> This patch will disable that functionality.  So, at a minimum this is a
> > > > >> 'heads up'.  If there are actual use cases that depend on this, then more
> > > > >> work/discussions will need to happen.  From the e-mail thread on PGD_SIZE
> > > > >> support, I can not tell if there is a real use case or this is just a
> > > > >> 'nice to have'.
> > > > >
> > > > > 1GB hugetlbfs pages are used by DPDK and VMs in cloud deployment, we
> > > > > encounter gup_pud_range() panic several times in product environment.
> > > > > Is there any plan to reenable and fix arch codes?
> > > >
> > > > I too am aware of slightly more interest in 1G huge pages.  Suspect that as
> > > > Intel MMU capacity increases to handle more TLB entries there will be more
> > > > and more interest.
> > > >
> > > > Personally, I am not looking at this issue.  Perhaps Naoya will comment as
> > > > he know most about this code.
> > >
> > > Thanks for forwarding this to me, I'm feeling that memory error handling
> > > on 1GB hugepage is demanded as real use case.
> > >
> > > >
> > > > > In addition, https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kvm/mmu.c#n3213
> > > > > The memory in guest can be 1GB/2MB/4K, though the host-backed memory
> > > > > are 1GB hugetlbfs pages, after above PUD panic is fixed,
> > > > > try_to_unmap() which is called in MCA recovery path will mark the PUD
> > > > > hwpoison entry. The guest will vmexit and retry endlessly when
> > > > > accessing any memory in the guest which is backed by this 1GB poisoned
> > > > > hugetlbfs page. We have a plan to split this 1GB hugetblfs page by 2MB
> > > > > hugetlbfs pages/4KB pages, maybe file remap to a virtual address range
> > > > > which is 2MB/4KB page granularity, also split the KVM MMU 1GB SPTE
> > > > > into 2MB/4KB and mark the offensive SPTE w/ a hwpoison flag, a sigbus
> > > > > will be delivered to VM at page fault next time for the offensive
> > > > > SPTE. Is this proposal acceptable?
> > > >
> > > > I am not sure of the error handling design, but this does sound reasonable.
> > >
> > > I agree that that's better.
> > >
> > > > That block of code which potentially dissolves a huge page on memory error
> > > > is hard to understand and I'm not sure if that is even the 'normal'
> > > > functionality.  Certainly, we would hate to waste/poison an entire 1G page
> > > > for an error on a small subsection.
> > >
> > > Yes, that's not practical, so we need at first establish the code base for
> > > 2GB hugetlb splitting and then extending it to 1GB next.
> >
> > I found it is not easy to split. There is a unique hugetlb page size
> > that is associated with a mounted hugetlbfs filesystem, file remap to
> > 2MB/4KB will break this. How about hard offline 1GB hugetlb page as
> > what has already done in soft offline, replace the corrupted 1GB page
> > by new 1GB page through page migration, the offending/corrupted area
> > in the original 1GB page doesn't need to be copied into the new page,
> > the offending/corrupted area in new page can keep full zero just as it
> > is clear during hugetlb page fault, other sub-pages of the original
> > 1GB page can be freed to buddy system. The sigbus signal is sent to
> > userspace w/ offending/corrupted virtual address, and signal code,
> > userspace should take care this.
>
> Splitting hugetlb is simply hard, IMHO. THP splitting is done by years
> of effort by many great kernel develpers, and I don't think doing similar
> development on hugetlb is a good idea.  I thought of converting hugetlb
> into thp, but maybe it's not an easy task either.
> "Hard offlining via soft offlining" approach sounds new and promising to me.
> I guess we don't need a large patchset to do this. So, thanks for the idea!

Good, I will wait a while, and start to cook the patches if there is
no opposite of voice.

Regards,
Wanpeng Li

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2019-08-21  7:15 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20180130013919.GA19959@hori1.linux.bs1.fc.nec.co.jp>
     [not found] ` <1517284444-18149-1-git-send-email-n-horiguchi@ah.jp.nec.com>
     [not found]   ` <87inbbjx2w.fsf@e105922-lin.cambridge.arm.com>
     [not found]     ` <20180207011455.GA15214@hori1.linux.bs1.fc.nec.co.jp>
     [not found]       ` <87fu6bfytm.fsf@e105922-lin.cambridge.arm.com>
     [not found]         ` <20180208121749.0ac09af2b5a143106f339f55@linux-foundation.org>
     [not found]           ` <87wozhvc49.fsf@concordia.ellerman.id.au>
     [not found]             ` <e673f38a-9e5f-21f6-421b-b3cb4ff02e91@oracle.com>
2019-05-28  9:49               ` [PATCH v2] mm: hwpoison: disable memory error handling on 1GB hugepage Wanpeng Li
2019-05-29 23:31                 ` Mike Kravetz
2019-06-10 23:50                   ` Naoya Horiguchi
2019-06-11  8:42                     ` Wanpeng Li
2019-08-20  7:03                     ` Wanpeng Li
2019-08-21  5:39                       ` ##freemail## " Naoya Horiguchi
2019-08-21  7:15                         ` Wanpeng Li

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).