linux-api.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH] mm: introduce MADV_CLR_HUGEPAGE
       [not found]           ` <20170524075043.GB3063@rapoport-lnx>
@ 2017-05-24  7:58             ` Vlastimil Babka
  2017-05-24 10:39               ` Mike Rapoport
  0 siblings, 1 reply; 40+ messages in thread
From: Vlastimil Babka @ 2017-05-24  7:58 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Michal Hocko, Kirill A. Shutemov, Andrew Morton, Arnd Bergmann,
	Kirill A. Shutemov, Andrea Arcangeli, Pavel Emelyanov, linux-mm,
	lkml, Linux API

On 05/24/2017 09:50 AM, Mike Rapoport wrote:
> On Mon, May 22, 2017 at 05:52:47PM +0200, Vlastimil Babka wrote:
>> On 05/22/2017 04:29 PM, Mike Rapoport wrote:
>>> On Mon, May 22, 2017 at 03:55:48PM +0200, Michal Hocko wrote:
>>>> On Mon 22-05-17 16:36:00, Mike Rapoport wrote:
>>>>> On Mon, May 22, 2017 at 02:42:43PM +0300, Kirill A. Shutemov wrote:
>>>>>> On Mon, May 22, 2017 at 09:12:42AM +0300, Mike Rapoport wrote:
>>>>>>> Currently applications can explicitly enable or disable THP for a memory
>>>>>>> region using MADV_HUGEPAGE or MADV_NOHUGEPAGE. However, once either of
>>>>>>> these advises is used, the region will always have
>>>>>>> VM_HUGEPAGE/VM_NOHUGEPAGE flag set in vma->vm_flags.
>>>>>>> The MADV_CLR_HUGEPAGE resets both these flags and allows managing THP in
>>>>>>> the region according to system-wide settings.
>>>>>>
>>>>>> Seems reasonable. But could you describe an use-case when it's useful in
>>>>>> real world.
>>>>>
>>>>> My use-case was combination of pre- and post-copy migration of containers
>>>>> with CRIU.
>>>>> In this case we populate a part of a memory region with data that was saved
>>>>> during the pre-copy stage. Afterwards, the region is registered with
>>>>> userfaultfd and we expect to get page faults for the parts of the region
>>>>> that were not yet populated. However, khugepaged collapses the pages and
>>>>> the page faults we would expect do not occur.
>>>>
>>>> I am not sure I undestand the problem. Do I get it right that the
>>>> khugepaged will effectivelly corrupt the memory by collapsing a range
>>>> which is not yet fully populated? If yes shouldn't that be fixed in
>>>> khugepaged rather than adding yet another madvise command? Also how do
>>>> you prevent on races? (say you VM_NOHUGEPAGE, khugepaged would be in the
>>>> middle of the operation and sees a collapsable vma and you get the same
>>>> result)
>>>
>>> Probably I didn't explained it too well.
>>>
>>> The range is intentionally not populated. When we combine pre- and
>>> post-copy for process migration, we create memory pre-dump without stopping
>>> the process, then we freeze the process without dumping the pages it has
>>> dirtied between pre-dump and freeze, and then, during restore, we populate
>>> the dirtied pages using userfaultfd.
>>>
>>> When CRIU restores a process in such scenario, it does something like:
>>>
>>> * mmap() memory region
>>> * fill in the pages that were collected during the pre-dump
>>> * do some other stuff
>>> * register memory region with userfaultfd
>>> * populate the missing memory on demand
>>>
>>> khugepaged collapses the pages in the partially populated regions before we
>>> have a chance to register these regions with userfaultfd, which would
>>> prevent the collapse.
>>>
>>> We could have used MADV_NOHUGEPAGE right after the mmap() call, and then
>>> there would be no race because there would be nothing for khugepaged to
>>> collapse at that point. But the problem is that we have no way to reset
>>> *HUGEPAGE flags after the memory restore is complete.
>>
>> Hmm, I wouldn't be that sure if this is indeed race-free. Check that
>> this scenario is indeed impossible?
>>
>> - you do the mmap
>> - khugepaged will choose the process' mm to scan
>> - khugepaged will get to the vma in question, it doesn't have
>> MADV_NOHUGEPAGE yet
>> - you set MADV_NOHUGEPAGE on the vma
>> - you start populating the vma
>> - khugepaged sees the vma is non-empty, collapses
>>
>> unless I'm wrong, the racers will have mmap_sem for reading only when
>> setting/checking the MADV_NOHUGEPAGE? Might be actually considered a bug.
>>
>> However, can't you use prctl(PR_SET_THP_DISABLE) instead? "If arg2 has a
>> nonzero value, the flag is set, otherwise it is cleared." says the
>> manpage. Do it before the mmap and you avoid the race as well?
> 
> Unfortunately, prctl(PR_SET_THP_DISABLE) didn't help :(
> When I've tried to use it, I've ended up with VM_NOHUGEPAGE set on all VMAs
> created after prctl(). This returns me to the state when checkpoint-restore
> alters the application vma->vm_flags although it shouldn't and I do not see
> a way to fix it using existing interfaces.

[CC linux-api, should have been done in the initial posting already]

Hm so the prctl does:

                if (arg2)
                        me->mm->def_flags |= VM_NOHUGEPAGE;
                else
                        me->mm->def_flags &= ~VM_NOHUGEPAGE;

That's rather lazy implementation IMHO. Could we change it so the flag
is stored elsewhere in the mm, and the code that decides to (not) use
THP will check both the per-vma flag and the per-mm flag?

> --
> Sincerely yours,
> Mike. 
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] mm: introduce MADV_CLR_HUGEPAGE
  2017-05-24  7:58             ` [PATCH] mm: introduce MADV_CLR_HUGEPAGE Vlastimil Babka
@ 2017-05-24 10:39               ` Mike Rapoport
  2017-05-24 11:18                 ` Michal Hocko
  2017-05-24 11:31                 ` Vlastimil Babka
  0 siblings, 2 replies; 40+ messages in thread
From: Mike Rapoport @ 2017-05-24 10:39 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Michal Hocko, Kirill A. Shutemov, Andrew Morton, Arnd Bergmann,
	Kirill A. Shutemov, Andrea Arcangeli, Pavel Emelyanov, linux-mm,
	lkml, Linux API

On Wed, May 24, 2017 at 09:58:06AM +0200, Vlastimil Babka wrote:
> On 05/24/2017 09:50 AM, Mike Rapoport wrote:
> > On Mon, May 22, 2017 at 05:52:47PM +0200, Vlastimil Babka wrote:
> >> On 05/22/2017 04:29 PM, Mike Rapoport wrote:
> >>>
> >>> Probably I didn't explained it too well.
> >>>
> >>> The range is intentionally not populated. When we combine pre- and
> >>> post-copy for process migration, we create memory pre-dump without stopping
> >>> the process, then we freeze the process without dumping the pages it has
> >>> dirtied between pre-dump and freeze, and then, during restore, we populate
> >>> the dirtied pages using userfaultfd.
> >>>
> >>> When CRIU restores a process in such scenario, it does something like:
> >>>
> >>> * mmap() memory region
> >>> * fill in the pages that were collected during the pre-dump
> >>> * do some other stuff
> >>> * register memory region with userfaultfd
> >>> * populate the missing memory on demand
> >>>
> >>> khugepaged collapses the pages in the partially populated regions before we
> >>> have a chance to register these regions with userfaultfd, which would
> >>> prevent the collapse.
> >>>
> >>> We could have used MADV_NOHUGEPAGE right after the mmap() call, and then
> >>> there would be no race because there would be nothing for khugepaged to
> >>> collapse at that point. But the problem is that we have no way to reset
> >>> *HUGEPAGE flags after the memory restore is complete.
> >>
> >> Hmm, I wouldn't be that sure if this is indeed race-free. Check that
> >> this scenario is indeed impossible?
> >>
> >> - you do the mmap
> >> - khugepaged will choose the process' mm to scan
> >> - khugepaged will get to the vma in question, it doesn't have
> >> MADV_NOHUGEPAGE yet
> >> - you set MADV_NOHUGEPAGE on the vma
> >> - you start populating the vma
> >> - khugepaged sees the vma is non-empty, collapses
> >>
> >> unless I'm wrong, the racers will have mmap_sem for reading only when
> >> setting/checking the MADV_NOHUGEPAGE? Might be actually considered a bug.
> >>
> >> However, can't you use prctl(PR_SET_THP_DISABLE) instead? "If arg2 has a
> >> nonzero value, the flag is set, otherwise it is cleared." says the
> >> manpage. Do it before the mmap and you avoid the race as well?
> > 
> > Unfortunately, prctl(PR_SET_THP_DISABLE) didn't help :(
> > When I've tried to use it, I've ended up with VM_NOHUGEPAGE set on all VMAs
> > created after prctl(). This returns me to the state when checkpoint-restore
> > alters the application vma->vm_flags although it shouldn't and I do not see
> > a way to fix it using existing interfaces.
> 
> [CC linux-api, should have been done in the initial posting already]

Sorry, missed that.
 
> Hm so the prctl does:
> 
>                 if (arg2)
>                         me->mm->def_flags |= VM_NOHUGEPAGE;
>                 else
>                         me->mm->def_flags &= ~VM_NOHUGEPAGE;
> 
> That's rather lazy implementation IMHO. Could we change it so the flag
> is stored elsewhere in the mm, and the code that decides to (not) use
> THP will check both the per-vma flag and the per-mm flag?

I afraid I don't understand how that can help.
What we need is an ability to temporarily disable collapse of the pages in
VMAs that do not have VM_*HUGEPAGE flags set and that after we re-enable
THP, the vma->vm_flags for those VMAs will remain intact.

--
Sincerely yours,
Mike.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] mm: introduce MADV_CLR_HUGEPAGE
  2017-05-24 10:39               ` Mike Rapoport
@ 2017-05-24 11:18                 ` Michal Hocko
  2017-05-24 14:25                   ` Pavel Emelyanov
  2017-05-24 14:27                   ` Mike Rapoport
  2017-05-24 11:31                 ` Vlastimil Babka
  1 sibling, 2 replies; 40+ messages in thread
From: Michal Hocko @ 2017-05-24 11:18 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Vlastimil Babka, Kirill A. Shutemov, Andrew Morton,
	Arnd Bergmann, Kirill A. Shutemov, Andrea Arcangeli,
	Pavel Emelyanov, linux-mm, lkml, Linux API

On Wed 24-05-17 13:39:48, Mike Rapoport wrote:
> On Wed, May 24, 2017 at 09:58:06AM +0200, Vlastimil Babka wrote:
> > On 05/24/2017 09:50 AM, Mike Rapoport wrote:
> > > On Mon, May 22, 2017 at 05:52:47PM +0200, Vlastimil Babka wrote:
> > >> On 05/22/2017 04:29 PM, Mike Rapoport wrote:
> > >>>
> > >>> Probably I didn't explained it too well.
> > >>>
> > >>> The range is intentionally not populated. When we combine pre- and
> > >>> post-copy for process migration, we create memory pre-dump without stopping
> > >>> the process, then we freeze the process without dumping the pages it has
> > >>> dirtied between pre-dump and freeze, and then, during restore, we populate
> > >>> the dirtied pages using userfaultfd.
> > >>>
> > >>> When CRIU restores a process in such scenario, it does something like:
> > >>>
> > >>> * mmap() memory region
> > >>> * fill in the pages that were collected during the pre-dump
> > >>> * do some other stuff
> > >>> * register memory region with userfaultfd
> > >>> * populate the missing memory on demand
> > >>>
> > >>> khugepaged collapses the pages in the partially populated regions before we
> > >>> have a chance to register these regions with userfaultfd, which would
> > >>> prevent the collapse.
> > >>>
> > >>> We could have used MADV_NOHUGEPAGE right after the mmap() call, and then
> > >>> there would be no race because there would be nothing for khugepaged to
> > >>> collapse at that point. But the problem is that we have no way to reset
> > >>> *HUGEPAGE flags after the memory restore is complete.
> > >>
> > >> Hmm, I wouldn't be that sure if this is indeed race-free. Check that
> > >> this scenario is indeed impossible?
> > >>
> > >> - you do the mmap
> > >> - khugepaged will choose the process' mm to scan
> > >> - khugepaged will get to the vma in question, it doesn't have
> > >> MADV_NOHUGEPAGE yet
> > >> - you set MADV_NOHUGEPAGE on the vma
> > >> - you start populating the vma
> > >> - khugepaged sees the vma is non-empty, collapses
> > >>
> > >> unless I'm wrong, the racers will have mmap_sem for reading only when
> > >> setting/checking the MADV_NOHUGEPAGE? Might be actually considered a bug.
> > >>
> > >> However, can't you use prctl(PR_SET_THP_DISABLE) instead? "If arg2 has a
> > >> nonzero value, the flag is set, otherwise it is cleared." says the
> > >> manpage. Do it before the mmap and you avoid the race as well?
> > > 
> > > Unfortunately, prctl(PR_SET_THP_DISABLE) didn't help :(
> > > When I've tried to use it, I've ended up with VM_NOHUGEPAGE set on all VMAs
> > > created after prctl(). This returns me to the state when checkpoint-restore
> > > alters the application vma->vm_flags although it shouldn't and I do not see
> > > a way to fix it using existing interfaces.
> > 
> > [CC linux-api, should have been done in the initial posting already]
> 
> Sorry, missed that.
>  
> > Hm so the prctl does:
> > 
> >                 if (arg2)
> >                         me->mm->def_flags |= VM_NOHUGEPAGE;
> >                 else
> >                         me->mm->def_flags &= ~VM_NOHUGEPAGE;
> > 
> > That's rather lazy implementation IMHO. Could we change it so the flag
> > is stored elsewhere in the mm, and the code that decides to (not) use
> > THP will check both the per-vma flag and the per-mm flag?
> 
> I afraid I don't understand how that can help.
> What we need is an ability to temporarily disable collapse of the pages in
> VMAs that do not have VM_*HUGEPAGE flags set and that after we re-enable
> THP, the vma->vm_flags for those VMAs will remain intact.

Why cannot khugepaged simply skip over all VMAs which have userfault
regions registered? This would sound like a less error prone approach to
me.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] mm: introduce MADV_CLR_HUGEPAGE
  2017-05-24 10:39               ` Mike Rapoport
  2017-05-24 11:18                 ` Michal Hocko
@ 2017-05-24 11:31                 ` Vlastimil Babka
  2017-05-24 14:28                   ` Pavel Emelyanov
  1 sibling, 1 reply; 40+ messages in thread
From: Vlastimil Babka @ 2017-05-24 11:31 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Michal Hocko, Kirill A. Shutemov, Andrew Morton, Arnd Bergmann,
	Kirill A. Shutemov, Andrea Arcangeli, Pavel Emelyanov, linux-mm,
	lkml, Linux API

On 05/24/2017 12:39 PM, Mike Rapoport wrote:
>> Hm so the prctl does:
>>
>>                 if (arg2)
>>                         me->mm->def_flags |= VM_NOHUGEPAGE;
>>                 else
>>                         me->mm->def_flags &= ~VM_NOHUGEPAGE;
>>
>> That's rather lazy implementation IMHO. Could we change it so the flag
>> is stored elsewhere in the mm, and the code that decides to (not) use
>> THP will check both the per-vma flag and the per-mm flag?
> 
> I afraid I don't understand how that can help.
> What we need is an ability to temporarily disable collapse of the pages in
> VMAs that do not have VM_*HUGEPAGE flags set and that after we re-enable
> THP, the vma->vm_flags for those VMAs will remain intact.

That's what I'm saying - instead of implementing the prctl flag via
mm->def_flags (which gets permanently propagated to newly created vma's
but e.g. doesn't affect already existing ones), it would be setting a
flag somewhere in mm, which khugepaged (and page faults) would check in
addition to the per-vma flags.


> --
> Sincerely yours,
> Mike.
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] mm: introduce MADV_CLR_HUGEPAGE
  2017-05-24 11:18                 ` Michal Hocko
@ 2017-05-24 14:25                   ` Pavel Emelyanov
  2017-05-24 14:27                   ` Mike Rapoport
  1 sibling, 0 replies; 40+ messages in thread
From: Pavel Emelyanov @ 2017-05-24 14:25 UTC (permalink / raw)
  To: Michal Hocko, Mike Rapoport
  Cc: Vlastimil Babka, Kirill A. Shutemov, Andrew Morton,
	Arnd Bergmann, Kirill A. Shutemov, Andrea Arcangeli, linux-mm,
	lkml, Linux API

On 05/24/2017 02:18 PM, Michal Hocko wrote:
> On Wed 24-05-17 13:39:48, Mike Rapoport wrote:
>> On Wed, May 24, 2017 at 09:58:06AM +0200, Vlastimil Babka wrote:
>>> On 05/24/2017 09:50 AM, Mike Rapoport wrote:
>>>> On Mon, May 22, 2017 at 05:52:47PM +0200, Vlastimil Babka wrote:
>>>>> On 05/22/2017 04:29 PM, Mike Rapoport wrote:
>>>>>>
>>>>>> Probably I didn't explained it too well.
>>>>>>
>>>>>> The range is intentionally not populated. When we combine pre- and
>>>>>> post-copy for process migration, we create memory pre-dump without stopping
>>>>>> the process, then we freeze the process without dumping the pages it has
>>>>>> dirtied between pre-dump and freeze, and then, during restore, we populate
>>>>>> the dirtied pages using userfaultfd.
>>>>>>
>>>>>> When CRIU restores a process in such scenario, it does something like:
>>>>>>
>>>>>> * mmap() memory region
>>>>>> * fill in the pages that were collected during the pre-dump
>>>>>> * do some other stuff
>>>>>> * register memory region with userfaultfd
>>>>>> * populate the missing memory on demand
>>>>>>
>>>>>> khugepaged collapses the pages in the partially populated regions before we
>>>>>> have a chance to register these regions with userfaultfd, which would
>>>>>> prevent the collapse.
>>>>>>
>>>>>> We could have used MADV_NOHUGEPAGE right after the mmap() call, and then
>>>>>> there would be no race because there would be nothing for khugepaged to
>>>>>> collapse at that point. But the problem is that we have no way to reset
>>>>>> *HUGEPAGE flags after the memory restore is complete.
>>>>>
>>>>> Hmm, I wouldn't be that sure if this is indeed race-free. Check that
>>>>> this scenario is indeed impossible?
>>>>>
>>>>> - you do the mmap
>>>>> - khugepaged will choose the process' mm to scan
>>>>> - khugepaged will get to the vma in question, it doesn't have
>>>>> MADV_NOHUGEPAGE yet
>>>>> - you set MADV_NOHUGEPAGE on the vma
>>>>> - you start populating the vma
>>>>> - khugepaged sees the vma is non-empty, collapses
>>>>>
>>>>> unless I'm wrong, the racers will have mmap_sem for reading only when
>>>>> setting/checking the MADV_NOHUGEPAGE? Might be actually considered a bug.
>>>>>
>>>>> However, can't you use prctl(PR_SET_THP_DISABLE) instead? "If arg2 has a
>>>>> nonzero value, the flag is set, otherwise it is cleared." says the
>>>>> manpage. Do it before the mmap and you avoid the race as well?
>>>>
>>>> Unfortunately, prctl(PR_SET_THP_DISABLE) didn't help :(
>>>> When I've tried to use it, I've ended up with VM_NOHUGEPAGE set on all VMAs
>>>> created after prctl(). This returns me to the state when checkpoint-restore
>>>> alters the application vma->vm_flags although it shouldn't and I do not see
>>>> a way to fix it using existing interfaces.
>>>
>>> [CC linux-api, should have been done in the initial posting already]
>>
>> Sorry, missed that.
>>  
>>> Hm so the prctl does:
>>>
>>>                 if (arg2)
>>>                         me->mm->def_flags |= VM_NOHUGEPAGE;
>>>                 else
>>>                         me->mm->def_flags &= ~VM_NOHUGEPAGE;
>>>
>>> That's rather lazy implementation IMHO. Could we change it so the flag
>>> is stored elsewhere in the mm, and the code that decides to (not) use
>>> THP will check both the per-vma flag and the per-mm flag?
>>
>> I afraid I don't understand how that can help.
>> What we need is an ability to temporarily disable collapse of the pages in
>> VMAs that do not have VM_*HUGEPAGE flags set and that after we re-enable
>> THP, the vma->vm_flags for those VMAs will remain intact.
> 
> Why cannot khugepaged simply skip over all VMAs which have userfault
> regions registered? This would sound like a less error prone approach to
> me.

It already does so. The problem is that there's a race window. We first populate VMA
with pages, then register it in UFFD. Between these two actions khugepaged comes
and generates a huge page out of populated pages and holes. And the holes in question
are not, well, holes -- they should be populated later via the UFFD, while the
generated huge page prevents this from happening.

-- Pavel

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] mm: introduce MADV_CLR_HUGEPAGE
  2017-05-24 11:18                 ` Michal Hocko
  2017-05-24 14:25                   ` Pavel Emelyanov
@ 2017-05-24 14:27                   ` Mike Rapoport
  2017-05-24 15:22                     ` Andrea Arcangeli
  2017-05-30  7:44                     ` Michal Hocko
  1 sibling, 2 replies; 40+ messages in thread
From: Mike Rapoport @ 2017-05-24 14:27 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vlastimil Babka, Kirill A. Shutemov, Andrew Morton,
	Arnd Bergmann, Kirill A. Shutemov, Andrea Arcangeli,
	Pavel Emelyanov, linux-mm, lkml, Linux API

On Wed, May 24, 2017 at 01:18:00PM +0200, Michal Hocko wrote:
> On Wed 24-05-17 13:39:48, Mike Rapoport wrote:
> > On Wed, May 24, 2017 at 09:58:06AM +0200, Vlastimil Babka wrote:
> > > On 05/24/2017 09:50 AM, Mike Rapoport wrote:
> > > > On Mon, May 22, 2017 at 05:52:47PM +0200, Vlastimil Babka wrote:
> > > >> On 05/22/2017 04:29 PM, Mike Rapoport wrote:
> > > >>>
> > > >>> Probably I didn't explained it too well.
> > > >>>
> > > >>> The range is intentionally not populated. When we combine pre- and
> > > >>> post-copy for process migration, we create memory pre-dump without stopping
> > > >>> the process, then we freeze the process without dumping the pages it has
> > > >>> dirtied between pre-dump and freeze, and then, during restore, we populate
> > > >>> the dirtied pages using userfaultfd.
> > > >>>
> > > >>> When CRIU restores a process in such scenario, it does something like:
> > > >>>
> > > >>> * mmap() memory region
> > > >>> * fill in the pages that were collected during the pre-dump
> > > >>> * do some other stuff
> > > >>> * register memory region with userfaultfd
> > > >>> * populate the missing memory on demand
> > > >>>
> > > >>> khugepaged collapses the pages in the partially populated regions before we
> > > >>> have a chance to register these regions with userfaultfd, which would
> > > >>> prevent the collapse.
> > > >>>
> > > >>> We could have used MADV_NOHUGEPAGE right after the mmap() call, and then
> > > >>> there would be no race because there would be nothing for khugepaged to
> > > >>> collapse at that point. But the problem is that we have no way to reset
> > > >>> *HUGEPAGE flags after the memory restore is complete.
> > > >>
> > > >> Hmm, I wouldn't be that sure if this is indeed race-free. Check that
> > > >> this scenario is indeed impossible?
> > > >>
> > > >> - you do the mmap
> > > >> - khugepaged will choose the process' mm to scan
> > > >> - khugepaged will get to the vma in question, it doesn't have
> > > >> MADV_NOHUGEPAGE yet
> > > >> - you set MADV_NOHUGEPAGE on the vma
> > > >> - you start populating the vma
> > > >> - khugepaged sees the vma is non-empty, collapses
> > > >>
> > > >> unless I'm wrong, the racers will have mmap_sem for reading only when
> > > >> setting/checking the MADV_NOHUGEPAGE? Might be actually considered a bug.
> > > >>
> > > >> However, can't you use prctl(PR_SET_THP_DISABLE) instead? "If arg2 has a
> > > >> nonzero value, the flag is set, otherwise it is cleared." says the
> > > >> manpage. Do it before the mmap and you avoid the race as well?
> > > > 
> > > > Unfortunately, prctl(PR_SET_THP_DISABLE) didn't help :(
> > > > When I've tried to use it, I've ended up with VM_NOHUGEPAGE set on all VMAs
> > > > created after prctl(). This returns me to the state when checkpoint-restore
> > > > alters the application vma->vm_flags although it shouldn't and I do not see
> > > > a way to fix it using existing interfaces.
> > > 
> > > [CC linux-api, should have been done in the initial posting already]
> > 
> > Sorry, missed that.
> >  
> > > Hm so the prctl does:
> > > 
> > >                 if (arg2)
> > >                         me->mm->def_flags |= VM_NOHUGEPAGE;
> > >                 else
> > >                         me->mm->def_flags &= ~VM_NOHUGEPAGE;
> > > 
> > > That's rather lazy implementation IMHO. Could we change it so the flag
> > > is stored elsewhere in the mm, and the code that decides to (not) use
> > > THP will check both the per-vma flag and the per-mm flag?
> > 
> > I afraid I don't understand how that can help.
> > What we need is an ability to temporarily disable collapse of the pages in
> > VMAs that do not have VM_*HUGEPAGE flags set and that after we re-enable
> > THP, the vma->vm_flags for those VMAs will remain intact.
> 
> Why cannot khugepaged simply skip over all VMAs which have userfault
> regions registered? This would sound like a less error prone approach to
> me.

khugepaged does skip over VMAs which have userfault. We could register the
regions with userfault before populating them to avoid collapses in the
transition period. But then we'll have to populate these regions with
UFFDIO_COPY which adds quite an overhead.

> -- 
> Michal Hocko
> SUSE Labs
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] mm: introduce MADV_CLR_HUGEPAGE
  2017-05-24 11:31                 ` Vlastimil Babka
@ 2017-05-24 14:28                   ` Pavel Emelyanov
  2017-05-24 14:54                     ` Vlastimil Babka
  0 siblings, 1 reply; 40+ messages in thread
From: Pavel Emelyanov @ 2017-05-24 14:28 UTC (permalink / raw)
  To: Vlastimil Babka, Mike Rapoport
  Cc: Michal Hocko, Kirill A. Shutemov, Andrew Morton, Arnd Bergmann,
	Kirill A. Shutemov, Andrea Arcangeli, linux-mm, lkml, Linux API

On 05/24/2017 02:31 PM, Vlastimil Babka wrote:
> On 05/24/2017 12:39 PM, Mike Rapoport wrote:
>>> Hm so the prctl does:
>>>
>>>                 if (arg2)
>>>                         me->mm->def_flags |= VM_NOHUGEPAGE;
>>>                 else
>>>                         me->mm->def_flags &= ~VM_NOHUGEPAGE;
>>>
>>> That's rather lazy implementation IMHO. Could we change it so the flag
>>> is stored elsewhere in the mm, and the code that decides to (not) use
>>> THP will check both the per-vma flag and the per-mm flag?
>>
>> I afraid I don't understand how that can help.
>> What we need is an ability to temporarily disable collapse of the pages in
>> VMAs that do not have VM_*HUGEPAGE flags set and that after we re-enable
>> THP, the vma->vm_flags for those VMAs will remain intact.
> 
> That's what I'm saying - instead of implementing the prctl flag via
> mm->def_flags (which gets permanently propagated to newly created vma's
> but e.g. doesn't affect already existing ones), it would be setting a
> flag somewhere in mm, which khugepaged (and page faults) would check in
> addition to the per-vma flags.

I do not insist, but this would make existing paths (checking for flags) be 
2 times slower -- from now on these would need to check two bits (vma flags
and mm flags) which are 100% in different cache lines.

What Mike is proposing is the way to fine-tune the existing vma flags. This
would keep current paths as fast (or slow ;) ) as they are now. All the
complexity would go to rare cases when someone needs to turn thp off for a
while and then turn it back on.

-- Pavel

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] mm: introduce MADV_CLR_HUGEPAGE
  2017-05-24 14:28                   ` Pavel Emelyanov
@ 2017-05-24 14:54                     ` Vlastimil Babka
  2017-05-24 15:13                       ` Mike Rapoport
  0 siblings, 1 reply; 40+ messages in thread
From: Vlastimil Babka @ 2017-05-24 14:54 UTC (permalink / raw)
  To: Pavel Emelyanov, Mike Rapoport
  Cc: Michal Hocko, Kirill A. Shutemov, Andrew Morton, Arnd Bergmann,
	Kirill A. Shutemov, Andrea Arcangeli, linux-mm, lkml, Linux API

On 05/24/2017 04:28 PM, Pavel Emelyanov wrote:
> On 05/24/2017 02:31 PM, Vlastimil Babka wrote:
>> On 05/24/2017 12:39 PM, Mike Rapoport wrote:
>>>> Hm so the prctl does:
>>>>
>>>>                 if (arg2)
>>>>                         me->mm->def_flags |= VM_NOHUGEPAGE;
>>>>                 else
>>>>                         me->mm->def_flags &= ~VM_NOHUGEPAGE;
>>>>
>>>> That's rather lazy implementation IMHO. Could we change it so the flag
>>>> is stored elsewhere in the mm, and the code that decides to (not) use
>>>> THP will check both the per-vma flag and the per-mm flag?
>>>
>>> I afraid I don't understand how that can help.
>>> What we need is an ability to temporarily disable collapse of the pages in
>>> VMAs that do not have VM_*HUGEPAGE flags set and that after we re-enable
>>> THP, the vma->vm_flags for those VMAs will remain intact.
>>
>> That's what I'm saying - instead of implementing the prctl flag via
>> mm->def_flags (which gets permanently propagated to newly created vma's
>> but e.g. doesn't affect already existing ones), it would be setting a
>> flag somewhere in mm, which khugepaged (and page faults) would check in
>> addition to the per-vma flags.
> 
> I do not insist, but this would make existing paths (checking for flags) be 
> 2 times slower -- from now on these would need to check two bits (vma flags
> and mm flags) which are 100% in different cache lines.

I'd expect you already have mm struct cached during a page fault. And
THP-eligible page fault is just one per pmd, the overhead should be
practically zero.

> What Mike is proposing is the way to fine-tune the existing vma flags. This
> would keep current paths as fast (or slow ;) ) as they are now. All the
> complexity would go to rare cases when someone needs to turn thp off for a
> while and then turn it back on.

Yeah but it's extending user-space API for a corner case. We should do
that only when there's no other option.

> -- Pavel
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] mm: introduce MADV_CLR_HUGEPAGE
  2017-05-24 14:54                     ` Vlastimil Babka
@ 2017-05-24 15:13                       ` Mike Rapoport
  0 siblings, 0 replies; 40+ messages in thread
From: Mike Rapoport @ 2017-05-24 15:13 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Pavel Emelyanov, Michal Hocko, Kirill A. Shutemov, Andrew Morton,
	Arnd Bergmann, Kirill A. Shutemov, Andrea Arcangeli, linux-mm,
	lkml, Linux API

On Wed, May 24, 2017 at 04:54:38PM +0200, Vlastimil Babka wrote:
> On 05/24/2017 04:28 PM, Pavel Emelyanov wrote:
> > On 05/24/2017 02:31 PM, Vlastimil Babka wrote:
> >> On 05/24/2017 12:39 PM, Mike Rapoport wrote:
> >>>> Hm so the prctl does:
> >>>>
> >>>>                 if (arg2)
> >>>>                         me->mm->def_flags |= VM_NOHUGEPAGE;
> >>>>                 else
> >>>>                         me->mm->def_flags &= ~VM_NOHUGEPAGE;
> >>>>
> >>>> That's rather lazy implementation IMHO. Could we change it so the flag
> >>>> is stored elsewhere in the mm, and the code that decides to (not) use
> >>>> THP will check both the per-vma flag and the per-mm flag?
> >>>
> >>> I afraid I don't understand how that can help.
> >>> What we need is an ability to temporarily disable collapse of the pages in
> >>> VMAs that do not have VM_*HUGEPAGE flags set and that after we re-enable
> >>> THP, the vma->vm_flags for those VMAs will remain intact.
> >>
> >> That's what I'm saying - instead of implementing the prctl flag via
> >> mm->def_flags (which gets permanently propagated to newly created vma's
> >> but e.g. doesn't affect already existing ones), it would be setting a
> >> flag somewhere in mm, which khugepaged (and page faults) would check in
> >> addition to the per-vma flags.
> > 
> > I do not insist, but this would make existing paths (checking for flags) be 
> > 2 times slower -- from now on these would need to check two bits (vma flags
> > and mm flags) which are 100% in different cache lines.
> 
> I'd expect you already have mm struct cached during a page fault. And
> THP-eligible page fault is just one per pmd, the overhead should be
> practically zero.
> 
> > What Mike is proposing is the way to fine-tune the existing vma flags. This
> > would keep current paths as fast (or slow ;) ) as they are now. All the
> > complexity would go to rare cases when someone needs to turn thp off for a
> > while and then turn it back on.
> 
> Yeah but it's extending user-space API for a corner case. We should do
> that only when there's no other option.

With madvise() I'm suggesting we rather add "completeness" to the existing
API, IMHO. We do have API that sets VM_HUGEPAGE and clears VM_NOHUGEPAGE or
vise versa, but we do not have an API that can clear both flags...

And if we would use prctl(), we either change user visible behaviour or we
still need to extend the API and use, say, arg2 to distinguish between the
current behaviour and the new one.

--
Sincerely yours,
Mike. 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] mm: introduce MADV_CLR_HUGEPAGE
  2017-05-24 14:27                   ` Mike Rapoport
@ 2017-05-24 15:22                     ` Andrea Arcangeli
  2017-05-30  7:44                     ` Michal Hocko
  1 sibling, 0 replies; 40+ messages in thread
From: Andrea Arcangeli @ 2017-05-24 15:22 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Michal Hocko, Vlastimil Babka, Kirill A. Shutemov, Andrew Morton,
	Arnd Bergmann, Kirill A. Shutemov, Pavel Emelyanov, linux-mm,
	lkml, Linux API

Hello,

On Wed, May 24, 2017 at 05:27:36PM +0300, Mike Rapoport wrote:
> khugepaged does skip over VMAs which have userfault. We could register the
> regions with userfault before populating them to avoid collapses in the
> transition period. But then we'll have to populate these regions with
> UFFDIO_COPY which adds quite an overhead.

Yes, in fact with postcopy-only mode, there's no issue because of the
above.

The case where THP has to be temporarily disabled by CRIU is before
postcopy/userfaults engages, i.e. during the precopy with a
precopy+postcopy mode.

QEMU preferred mode is to do one pass of precopy before starting
postcopy/userfaults. During QEMU precopy phase VM_HUGEPAGE is set for
maximum performance and to back with THP in the destination as many
readonly (i.e. no source-redirtied) pages as possible. The dirty
logging in the source happens at 4k granularity by forcing the KVM
shadow MMU to map all pages at 4k granularity and by tracking the
dirty bit in software for the updates happening through the primary
MMU (linux pagetables dirty bit are ignored because soft dirty would
be too slow with O(N) complexity where N is linear with the size of
the VM, not with the number of re-dirtied pages in a precopy
pass). After that we track which 4k pages aren't uptodate on
destination and we zap them at 4k granularity with MADV_DONTNEED (we
badly need madvisev in fact to reduce the totally unnecessary flood of
4k wide MADV_DONTNEED there). So before calling the MADV_DONTNEED
flood, QEMU sets VM_NOHUGEPAGE, and after calling UFFDIO_REGISTER QEMU
sets back VM_HUGEPAGE (as the UFFDIO registration will keep khugepaged
at bay until postcopy completes). QEMU then finally calls
UFFDIO_UNREGISTER and khugepaged starts compacting everything that was
migrated through 4k wide userfaults.

CRIU doesn't attempt to populate destination with THP at all to be
simpler, but the problem is similar. It still has to call
VM_NOHUGEPAGE somehow during precopy (i.e. during the whole precopy
phase, precisely to avoid having to call MADV_DONTNEED to zap
4k not-uptodate fragments).

QEMU gets away with setting VM_NOHUGEPAGE and then back to VM_HUGEPAGE
without any issue because it's cooperative. CRIU as opposed has to
restore the same vm_flags that the vma had in the source to avoid
changing the behavior of the app after precopy+postcopy
completes. This is where the need of clearing the VM_*HUGEPAGE bits
from vm_flags comes into play.

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] mm: introduce MADV_CLR_HUGEPAGE
  2017-05-24 14:27                   ` Mike Rapoport
  2017-05-24 15:22                     ` Andrea Arcangeli
@ 2017-05-30  7:44                     ` Michal Hocko
       [not found]                       ` <20170530074408.GA7969-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
  1 sibling, 1 reply; 40+ messages in thread
From: Michal Hocko @ 2017-05-30  7:44 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Vlastimil Babka, Kirill A. Shutemov, Andrew Morton,
	Arnd Bergmann, Kirill A. Shutemov, Andrea Arcangeli,
	Pavel Emelyanov, linux-mm, lkml, Linux API

On Wed 24-05-17 17:27:36, Mike Rapoport wrote:
> On Wed, May 24, 2017 at 01:18:00PM +0200, Michal Hocko wrote:
[...]
> > Why cannot khugepaged simply skip over all VMAs which have userfault
> > regions registered? This would sound like a less error prone approach to
> > me.
> 
> khugepaged does skip over VMAs which have userfault. We could register the
> regions with userfault before populating them to avoid collapses in the
> transition period.

Why cannot you register only post-copy regions and "manually" copy the
pre-copy parts?

> But then we'll have to populate these regions with
> UFFDIO_COPY which adds quite an overhead.

How big is the performance impact?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] mm: introduce MADV_CLR_HUGEPAGE
       [not found]                       ` <20170530074408.GA7969-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
@ 2017-05-30 10:19                         ` Mike Rapoport
  2017-05-30 10:39                           ` Michal Hocko
  0 siblings, 1 reply; 40+ messages in thread
From: Mike Rapoport @ 2017-05-30 10:19 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vlastimil Babka, Kirill A. Shutemov, Andrew Morton,
	Arnd Bergmann, Kirill A. Shutemov, Andrea Arcangeli,
	Pavel Emelyanov, linux-mm, lkml, Linux API

On Tue, May 30, 2017 at 09:44:08AM +0200, Michal Hocko wrote:
> On Wed 24-05-17 17:27:36, Mike Rapoport wrote:
> > On Wed, May 24, 2017 at 01:18:00PM +0200, Michal Hocko wrote:
> [...]
> > > Why cannot khugepaged simply skip over all VMAs which have userfault
> > > regions registered? This would sound like a less error prone approach to
> > > me.
> > 
> > khugepaged does skip over VMAs which have userfault. We could register the
> > regions with userfault before populating them to avoid collapses in the
> > transition period.
> 
> Why cannot you register only post-copy regions and "manually" copy the
> pre-copy parts?

We can register only post-copy regions, but this will cause VMA
fragmentation. Now we register the entire VMA with userfaultfd, no matter
how many pages were dirtied there since the pre-dump. If we register only
post-copy regions, we will split out the VMAs for those regions.
 
> > But then we'll have to populate these regions with
> > UFFDIO_COPY which adds quite an overhead.
> 
> How big is the performance impact?

I don't have the numbers handy, but for each post-copy range it means that
instead of memcpy() we will use ioctl(UFFDIO_COPY).

> -- 
> Michal Hocko
> SUSE Labs
 
--
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] mm: introduce MADV_CLR_HUGEPAGE
  2017-05-30 10:19                         ` Mike Rapoport
@ 2017-05-30 10:39                           ` Michal Hocko
       [not found]                             ` <20170530103930.GB7969-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
  0 siblings, 1 reply; 40+ messages in thread
From: Michal Hocko @ 2017-05-30 10:39 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Vlastimil Babka, Kirill A. Shutemov, Andrew Morton,
	Arnd Bergmann, Kirill A. Shutemov, Andrea Arcangeli,
	Pavel Emelyanov, linux-mm, lkml, Linux API

On Tue 30-05-17 13:19:22, Mike Rapoport wrote:
> On Tue, May 30, 2017 at 09:44:08AM +0200, Michal Hocko wrote:
> > On Wed 24-05-17 17:27:36, Mike Rapoport wrote:
> > > On Wed, May 24, 2017 at 01:18:00PM +0200, Michal Hocko wrote:
> > [...]
> > > > Why cannot khugepaged simply skip over all VMAs which have userfault
> > > > regions registered? This would sound like a less error prone approach to
> > > > me.
> > > 
> > > khugepaged does skip over VMAs which have userfault. We could register the
> > > regions with userfault before populating them to avoid collapses in the
> > > transition period.
> > 
> > Why cannot you register only post-copy regions and "manually" copy the
> > pre-copy parts?
> 
> We can register only post-copy regions, but this will cause VMA
> fragmentation. Now we register the entire VMA with userfaultfd, no matter
> how many pages were dirtied there since the pre-dump. If we register only
> post-copy regions, we will split out the VMAs for those regions.

Is this really a problem, though?

> > > But then we'll have to populate these regions with
> > > UFFDIO_COPY which adds quite an overhead.
> > 
> > How big is the performance impact?
> 
> I don't have the numbers handy, but for each post-copy range it means that
> instead of memcpy() we will use ioctl(UFFDIO_COPY).

It would be good to measure that though. You are proposing a new user
API and the THP api is quite convoluted already so there better be a
very good reason to add a new API. So far I can only see that it would
be more convinient to add another madvise command and that is rather
insufficient justification IMHO. Also do you expect somebody else would
use new madvise? What would be the usecase?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] mm: introduce MADV_CLR_HUGEPAGE
       [not found]                             ` <20170530103930.GB7969-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
@ 2017-05-30 14:04                               ` Andrea Arcangeli
  2017-05-30 14:39                                 ` Michal Hocko
  2017-05-31  9:08                               ` Mike Rapoport
  1 sibling, 1 reply; 40+ messages in thread
From: Andrea Arcangeli @ 2017-05-30 14:04 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mike Rapoport, Vlastimil Babka, Kirill A. Shutemov,
	Andrew Morton, Arnd Bergmann, Kirill A. Shutemov,
	Pavel Emelyanov, linux-mm, lkml, Linux API

On Tue, May 30, 2017 at 12:39:30PM +0200, Michal Hocko wrote:
> On Tue 30-05-17 13:19:22, Mike Rapoport wrote:
> > On Tue, May 30, 2017 at 09:44:08AM +0200, Michal Hocko wrote:
> > > On Wed 24-05-17 17:27:36, Mike Rapoport wrote:
> > > > On Wed, May 24, 2017 at 01:18:00PM +0200, Michal Hocko wrote:
> > > [...]
> > > > > Why cannot khugepaged simply skip over all VMAs which have userfault
> > > > > regions registered? This would sound like a less error prone approach to
> > > > > me.
> > > > 
> > > > khugepaged does skip over VMAs which have userfault. We could register the
> > > > regions with userfault before populating them to avoid collapses in the
> > > > transition period.
> > > 
> > > Why cannot you register only post-copy regions and "manually" copy the
> > > pre-copy parts?
> > 
> > We can register only post-copy regions, but this will cause VMA
> > fragmentation. Now we register the entire VMA with userfaultfd, no matter
> > how many pages were dirtied there since the pre-dump. If we register only
> > post-copy regions, we will split out the VMAs for those regions.
> 
> Is this really a problem, though?

It would eventually get -ENOMEM or at best create lots of unnecessary
vmas (at least UFFDIO_COPY would never risk to trigger -ENOMEM).

The only attractive alternative is to use UFFDIO_COPY for precopy too
after pre-registering the whole range in uffd (which would happen
later anyway to start postcopy).

> It would be good to measure that though. You are proposing a new user
> API and the THP api is quite convoluted already so there better be a
> very good reason to add a new API. So far I can only see that it would
> be more convinient to add another madvise command and that is rather
> insufficient justification IMHO. Also do you expect somebody else would
> use new madvise? What would be the usecase?

UFFDIO_COPY while not being a major slowdown for sure, it's likely
measurable at the microbenchmark level because it would add a
enter/exit kernel to every 4k memcpy. It's not hard to imagine that as
measurable. How that impacts the total precopy time I don't know, it
would need to be benchmarked to be sure. The main benefit of this
madvise is precisely to skip those enter/exit kernel that UFFDIO_COPY
would add. Even if the impact on the total precopy time wouldn't be
measurable (i.e. if it's network bound load), the madvise that allows
using memcpy after setting VM_NOHUGEPAGE, would free up some CPU
cycles in the destination that could be used by other processes.

About the proposed madvise, it just clear bits, but it doesn't change
at all how those bits are computed in THP code. So I don't see it as
convoluted.

If it would add new bits to be computed it would add to the
complexity. Just clearing the same bits that already exists without
altering how they're computed, doesn't move the needle in terms of
complexity. If it wasn't the case the "operational" part of the patch
wouldn't be just a one liner.

+               *vm_flags &= ~(VM_HUGEPAGE | VM_NOHUGEPAGE);

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] mm: introduce MADV_CLR_HUGEPAGE
  2017-05-30 14:04                               ` Andrea Arcangeli
@ 2017-05-30 14:39                                 ` Michal Hocko
       [not found]                                   ` <20170530143941.GK7969-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
                                                     ` (2 more replies)
  0 siblings, 3 replies; 40+ messages in thread
From: Michal Hocko @ 2017-05-30 14:39 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Mike Rapoport, Vlastimil Babka, Kirill A. Shutemov,
	Andrew Morton, Arnd Bergmann, Kirill A. Shutemov,
	Pavel Emelyanov, linux-mm, lkml, Linux API

On Tue 30-05-17 16:04:56, Andrea Arcangeli wrote:
> On Tue, May 30, 2017 at 12:39:30PM +0200, Michal Hocko wrote:
> > On Tue 30-05-17 13:19:22, Mike Rapoport wrote:
> > > On Tue, May 30, 2017 at 09:44:08AM +0200, Michal Hocko wrote:
> > > > On Wed 24-05-17 17:27:36, Mike Rapoport wrote:
> > > > > On Wed, May 24, 2017 at 01:18:00PM +0200, Michal Hocko wrote:
> > > > [...]
> > > > > > Why cannot khugepaged simply skip over all VMAs which have userfault
> > > > > > regions registered? This would sound like a less error prone approach to
> > > > > > me.
> > > > > 
> > > > > khugepaged does skip over VMAs which have userfault. We could register the
> > > > > regions with userfault before populating them to avoid collapses in the
> > > > > transition period.
> > > > 
> > > > Why cannot you register only post-copy regions and "manually" copy the
> > > > pre-copy parts?
> > > 
> > > We can register only post-copy regions, but this will cause VMA
> > > fragmentation. Now we register the entire VMA with userfaultfd, no matter
> > > how many pages were dirtied there since the pre-dump. If we register only
> > > post-copy regions, we will split out the VMAs for those regions.
> > 
> > Is this really a problem, though?
> 
> It would eventually get -ENOMEM or at best create lots of unnecessary
> vmas (at least UFFDIO_COPY would never risk to trigger -ENOMEM).

I sysctl for the mapcount can be increased, right? I also assume that
those vmas will get merged after the post copy is done.

> The only attractive alternative is to use UFFDIO_COPY for precopy too
> after pre-registering the whole range in uffd (which would happen
> later anyway to start postcopy).
> 
> > It would be good to measure that though. You are proposing a new user
> > API and the THP api is quite convoluted already so there better be a
> > very good reason to add a new API. So far I can only see that it would
> > be more convinient to add another madvise command and that is rather
> > insufficient justification IMHO. Also do you expect somebody else would
> > use new madvise? What would be the usecase?
> 
> UFFDIO_COPY while not being a major slowdown for sure, it's likely
> measurable at the microbenchmark level because it would add a
> enter/exit kernel to every 4k memcpy. It's not hard to imagine that as
> measurable. How that impacts the total precopy time I don't know, it
> would need to be benchmarked to be sure.

Yes, please!

> The main benefit of this
> madvise is precisely to skip those enter/exit kernel that UFFDIO_COPY
> would add. Even if the impact on the total precopy time wouldn't be
> measurable (i.e. if it's network bound load), the madvise that allows
> using memcpy after setting VM_NOHUGEPAGE, would free up some CPU
> cycles in the destination that could be used by other processes.

I understand that part but it sounds awfully one purpose thing to me.
Are we going to add other MADVISE_RESET_$FOO to clear other flags just
because we can race in this specific use case?

> About the proposed madvise, it just clear bits, but it doesn't change
> at all how those bits are computed in THP code. So I don't see it as
> convoluted.

But we already have MADV_HUGEPAGE, MADV_NOHUGEPAGE and prctl to
enable/disable thp. Doesn't that sound little bit too much for a single
feature to you?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] mm: introduce MADV_CLR_HUGEPAGE
       [not found]                                   ` <20170530143941.GK7969-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
@ 2017-05-30 14:56                                     ` Michal Hocko
       [not found]                                       ` <20170530145632.GL7969-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
  0 siblings, 1 reply; 40+ messages in thread
From: Michal Hocko @ 2017-05-30 14:56 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Mike Rapoport, Vlastimil Babka, Kirill A. Shutemov,
	Andrew Morton, Arnd Bergmann, Kirill A. Shutemov,
	Pavel Emelyanov, linux-mm, lkml, Linux API

On Tue 30-05-17 16:39:41, Michal Hocko wrote:
> On Tue 30-05-17 16:04:56, Andrea Arcangeli wrote:
[...]
> > About the proposed madvise, it just clear bits, but it doesn't change
> > at all how those bits are computed in THP code. So I don't see it as
> > convoluted.
> 
> But we already have MADV_HUGEPAGE, MADV_NOHUGEPAGE and prctl to
> enable/disable thp. Doesn't that sound little bit too much for a single
> feature to you?

And also I would argue that the prctl should be usable for this specific
usecase. The man page says
"
Setting this flag provides a method for disabling transparent huge pages
for jobs where the code cannot be modified
"

and that fits into the described case AFAIU. The thing that the current
implementation doesn't work is a mere detail. I would even argue that
it is non-intuitive if not buggy right away. Whoever calls this prctl
later in the process life time will simply not stop THP from creating.

So again, why cannot we fix that? There was some handwaving about
potential overhead but has anybody actually measured that?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] mm: introduce MADV_CLR_HUGEPAGE
  2017-05-30 14:39                                 ` Michal Hocko
       [not found]                                   ` <20170530143941.GK7969-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
@ 2017-05-30 15:43                                   ` Andrea Arcangeli
  2017-05-31 12:08                                     ` Michal Hocko
  2017-06-01  6:53                                   ` Mike Rapoport
  2 siblings, 1 reply; 40+ messages in thread
From: Andrea Arcangeli @ 2017-05-30 15:43 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mike Rapoport, Vlastimil Babka, Kirill A. Shutemov,
	Andrew Morton, Arnd Bergmann, Kirill A. Shutemov,
	Pavel Emelyanov, linux-mm, lkml, Linux API

On Tue, May 30, 2017 at 04:39:41PM +0200, Michal Hocko wrote:
> I sysctl for the mapcount can be increased, right? I also assume that
> those vmas will get merged after the post copy is done.

Assuming you enlarge the sysctl to the worst possible case, with 64bit
address space you can have billions of VMAs if you're migrating 4T of
RAM and you're unlucky and the address space gets fragmented. The
unswappable kernel memory overhead would be relatively large
(i.e. dozen gigabytes of RAM in vm_area_struct slab), and each
find_vma operation would need to walk ~40 steps across that large vma
rbtree. There's a reason the sysctl exist. Not to tell all those
unnecessary vma mangling operations would be protected by the mmap_sem
for writing.

Not creating a ton of vmas and enabling vma-less pte mangling with a
single large vma and only using mmap_sem for reading during all the
pte mangling, is one of the primary design motivations for
userfaultfd.

> I understand that part but it sounds awfully one purpose thing to me.
> Are we going to add other MADVISE_RESET_$FOO to clear other flags just
> because we can race in this specific use case?

Those already exists, see for example MADV_NORMAL, clearing
~VM_RAND_READ & ~VM_SEQ_READ after calling MADV_SEQUENTIAL or
MADV_RANDOM.

Or MADV_DOFORK after MADV_DONTFORK. MADV_DONTDUMP after MADV_DODUMP. Etc..

> But we already have MADV_HUGEPAGE, MADV_NOHUGEPAGE and prctl to
> enable/disable thp. Doesn't that sound little bit too much for a single
> feature to you?

MADV_NOHUGEPAGE doesn't mean clearing the flag set with
MADV_HUGEPAGE. MADV_NOHUGEPAGE disables THP on the region if the
global sysfs "enabled" tune is set to "always". MADV_HUGEPAGE enables
THP if the global "enabled" sysfs tune is set to "madvise". The two
MADV_NOHUGEPAGE and MADV_HUGEPAGE are needed to leverage the three-way
setting of "never" "madvise" "always" of the global tune.

The "madvise" global tune exists if you want to save RAM and you don't
care much about performance but still allowing apps like QEMU where no
memory is lost by enabling THP, to use THP.

There's no way to clear either of those two flags and bring back the
default behavior of the global sysfs tune, so it's not redundant at
the very least.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] mm: introduce MADV_CLR_HUGEPAGE
       [not found]                                       ` <20170530145632.GL7969-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
@ 2017-05-30 16:06                                         ` Andrea Arcangeli
  2017-05-31  6:30                                           ` Vlastimil Babka
  0 siblings, 1 reply; 40+ messages in thread
From: Andrea Arcangeli @ 2017-05-30 16:06 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mike Rapoport, Vlastimil Babka, Kirill A. Shutemov,
	Andrew Morton, Arnd Bergmann, Kirill A. Shutemov,
	Pavel Emelyanov, linux-mm, lkml, Linux API

On Tue, May 30, 2017 at 04:56:33PM +0200, Michal Hocko wrote:
> On Tue 30-05-17 16:39:41, Michal Hocko wrote:
> > On Tue 30-05-17 16:04:56, Andrea Arcangeli wrote:
> [...]
> > > About the proposed madvise, it just clear bits, but it doesn't change
> > > at all how those bits are computed in THP code. So I don't see it as
> > > convoluted.
> > 
> > But we already have MADV_HUGEPAGE, MADV_NOHUGEPAGE and prctl to
> > enable/disable thp. Doesn't that sound little bit too much for a single
> > feature to you?
> 
> And also I would argue that the prctl should be usable for this specific
> usecase. The man page says
> "
> Setting this flag provides a method for disabling transparent huge pages
> for jobs where the code cannot be modified
> "
> 
> and that fits into the described case AFAIU. The thing that the current
> implementation doesn't work is a mere detail. I would even argue that
> it is non-intuitive if not buggy right away. Whoever calls this prctl
> later in the process life time will simply not stop THP from creating.
> 
> So again, why cannot we fix that? There was some handwaving about
> potential overhead but has anybody actually measured that?

I'm not sure if it should be considered a bug, the prctl is intended
to use normally by wrappers so it looks optimal as implemented this
way: affecting future vmas only, which will all be created after
execve executed by the wrapper.

What's the point of messing with the prctl so it mangles over the
wrapper process own vmas before exec? Messing with those vmas is pure
wasted CPUs for the wrapper use case which is what the prctl was
created for.

Furthermore there would be the risk a program that uses the prctl not
as a wrapper and then calls the prctl to clear VM_NOHUGEPAGE from
def_flags assuming the current kABI. The program could assume those
vmas that were instantiated before disabling the prctl are still with
VM_NOHUGEPAGE set (they would not after the change you propose).

Adding a scan of all vmas to PR_SET_THP_DISABLE to clear VM_NOHUGEPAGE
on existing vmas looks more complex too and less finegrined so
probably more complex for userland to manage, but ignoring all above
considerations it would be a functional alternative for CRIU's
needs. However if you didn't like the complexity of the new madvise
which is functionally a one-liner equivalent to MADV_NORMAL, I
wouldn't expect you to prefer to make the prctl even more complex with
a loop over all vmas that despite being fairly simple it'll still be
more than a trivial one liner.

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] mm: introduce MADV_CLR_HUGEPAGE
  2017-05-30 16:06                                         ` Andrea Arcangeli
@ 2017-05-31  6:30                                           ` Vlastimil Babka
  2017-05-31  8:24                                             ` Michal Hocko
  0 siblings, 1 reply; 40+ messages in thread
From: Vlastimil Babka @ 2017-05-31  6:30 UTC (permalink / raw)
  To: Andrea Arcangeli, Michal Hocko
  Cc: Mike Rapoport, Kirill A. Shutemov, Andrew Morton, Arnd Bergmann,
	Kirill A. Shutemov, Pavel Emelyanov, linux-mm, lkml, Linux API

On 05/30/2017 06:06 PM, Andrea Arcangeli wrote:
> 
> I'm not sure if it should be considered a bug, the prctl is intended
> to use normally by wrappers so it looks optimal as implemented this
> way: affecting future vmas only, which will all be created after
> execve executed by the wrapper.
> 
> What's the point of messing with the prctl so it mangles over the
> wrapper process own vmas before exec? Messing with those vmas is pure
> wasted CPUs for the wrapper use case which is what the prctl was
> created for.
> 
> Furthermore there would be the risk a program that uses the prctl not
> as a wrapper and then calls the prctl to clear VM_NOHUGEPAGE from
> def_flags assuming the current kABI. The program could assume those
> vmas that were instantiated before disabling the prctl are still with
> VM_NOHUGEPAGE set (they would not after the change you propose).
> 
> Adding a scan of all vmas to PR_SET_THP_DISABLE to clear VM_NOHUGEPAGE
> on existing vmas looks more complex too and less finegrined so
> probably more complex for userland to manage

I would expect the prctl wouldn't iterate all vma's, nor would it modify
def_flags anymore. It would just set a flag somewhere in mm struct that
would be considered in addition to the per-vma flags when deciding
whether to use THP. We could consider whether MADV_HUGEPAGE should be
able to override the prctl or not.

> but ignoring all above
> considerations it would be a functional alternative for CRIU's
> needs. However if you didn't like the complexity of the new madvise
> which is functionally a one-liner equivalent to MADV_NORMAL, I
> wouldn't expect you to prefer to make the prctl even more complex with
> a loop over all vmas that despite being fairly simple it'll still be
> more than a trivial one liner.
> 
> Thanks,
> Andrea
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] mm: introduce MADV_CLR_HUGEPAGE
  2017-05-31  6:30                                           ` Vlastimil Babka
@ 2017-05-31  8:24                                             ` Michal Hocko
  2017-05-31  9:27                                               ` Mike Rapoport
                                                                 ` (2 more replies)
  0 siblings, 3 replies; 40+ messages in thread
From: Michal Hocko @ 2017-05-31  8:24 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrea Arcangeli, Mike Rapoport, Kirill A. Shutemov,
	Andrew Morton, Arnd Bergmann, Kirill A. Shutemov,
	Pavel Emelyanov, linux-mm, lkml, Linux API

On Wed 31-05-17 08:30:08, Vlastimil Babka wrote:
> On 05/30/2017 06:06 PM, Andrea Arcangeli wrote:
> > 
> > I'm not sure if it should be considered a bug, the prctl is intended
> > to use normally by wrappers so it looks optimal as implemented this
> > way: affecting future vmas only, which will all be created after
> > execve executed by the wrapper.
> > 
> > What's the point of messing with the prctl so it mangles over the
> > wrapper process own vmas before exec? Messing with those vmas is pure
> > wasted CPUs for the wrapper use case which is what the prctl was
> > created for.
> > 
> > Furthermore there would be the risk a program that uses the prctl not
> > as a wrapper and then calls the prctl to clear VM_NOHUGEPAGE from
> > def_flags assuming the current kABI. The program could assume those
> > vmas that were instantiated before disabling the prctl are still with
> > VM_NOHUGEPAGE set (they would not after the change you propose).
> > 
> > Adding a scan of all vmas to PR_SET_THP_DISABLE to clear VM_NOHUGEPAGE
> > on existing vmas looks more complex too and less finegrined so
> > probably more complex for userland to manage
> 
> I would expect the prctl wouldn't iterate all vma's, nor would it modify
> def_flags anymore. It would just set a flag somewhere in mm struct that
> would be considered in addition to the per-vma flags when deciding
> whether to use THP.

Exactly. Something like the below (not even compile tested).

> We could consider whether MADV_HUGEPAGE should be
> able to override the prctl or not.

This should be a master override to any per vma setting.

---
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index a3762d49ba39..9da053ced864 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -92,6 +92,7 @@ extern bool is_vma_temporary_stack(struct vm_area_struct *vma);
 	   (1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG) &&			\
 	   ((__vma)->vm_flags & VM_HUGEPAGE))) &&			\
 	 !((__vma)->vm_flags & VM_NOHUGEPAGE) &&			\
+	 !test_bit(MMF_DISABLE_THP, &(__vma)->vm_mm->flags) &&		\
 	 !is_vma_temporary_stack(__vma))
 #define transparent_hugepage_use_zero_page()				\
 	(transparent_hugepage_flags &					\
diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
index 5d9a400af509..f0d7335336cd 100644
--- a/include/linux/khugepaged.h
+++ b/include/linux/khugepaged.h
@@ -48,7 +48,8 @@ static inline int khugepaged_enter(struct vm_area_struct *vma,
 	if (!test_bit(MMF_VM_HUGEPAGE, &vma->vm_mm->flags))
 		if ((khugepaged_always() ||
 		     (khugepaged_req_madv() && (vm_flags & VM_HUGEPAGE))) &&
-		    !(vm_flags & VM_NOHUGEPAGE))
+		    !(vm_flags & VM_NOHUGEPAGE) &&
+		    !test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
 			if (__khugepaged_enter(vma->vm_mm))
 				return -ENOMEM;
 	return 0;
diff --git a/include/linux/sched/coredump.h b/include/linux/sched/coredump.h
index 69eedcef8f03..2c07b244090a 100644
--- a/include/linux/sched/coredump.h
+++ b/include/linux/sched/coredump.h
@@ -68,6 +68,7 @@ static inline int get_dumpable(struct mm_struct *mm)
 #define MMF_OOM_SKIP		21	/* mm is of no interest for the OOM killer */
 #define MMF_UNSTABLE		22	/* mm is unstable for copy_from_user */
 #define MMF_HUGE_ZERO_PAGE	23      /* mm has ever used the global huge zero page */
+#define MMF_DISABLE_THP		24	/* disable THP for all VMAs */
 
 #define MMF_INIT_MASK		(MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK)
 
diff --git a/kernel/sys.c b/kernel/sys.c
index 8a94b4eabcaa..e48f0636c7fd 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2266,7 +2266,7 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 	case PR_GET_THP_DISABLE:
 		if (arg2 || arg3 || arg4 || arg5)
 			return -EINVAL;
-		error = !!(me->mm->def_flags & VM_NOHUGEPAGE);
+		error = !!test_bit(MMF_DISABLE_THP, &me->mm->flags);
 		break;
 	case PR_SET_THP_DISABLE:
 		if (arg3 || arg4 || arg5)
@@ -2274,9 +2274,9 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 		if (down_write_killable(&me->mm->mmap_sem))
 			return -EINTR;
 		if (arg2)
-			me->mm->def_flags |= VM_NOHUGEPAGE;
+			set_bit(MMF_DISABLE_THP, &me->mm->flags);
 		else
-			me->mm->def_flags &= ~VM_NOHUGEPAGE;
+			clear_bit(MMF_DISABLE_THP, &me->mm->flags);
 		up_write(&me->mm->mmap_sem);
 		break;
 	case PR_MPX_ENABLE_MANAGEMENT:
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index ce29e5cc7809..57e31f4752b3 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -818,7 +818,8 @@ khugepaged_alloc_page(struct page **hpage, gfp_t gfp, int node)
 static bool hugepage_vma_check(struct vm_area_struct *vma)
 {
 	if ((!(vma->vm_flags & VM_HUGEPAGE) && !khugepaged_always()) ||
-	    (vma->vm_flags & VM_NOHUGEPAGE))
+	    (vma->vm_flags & VM_NOHUGEPAGE) ||
+	    test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
 		return false;
 	if (shmem_file(vma->vm_file)) {
 		if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGE_PAGECACHE))
diff --git a/mm/shmem.c b/mm/shmem.c
index e67d6ba4e98e..27fe1bbf813b 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1977,10 +1977,11 @@ static int shmem_fault(struct vm_fault *vmf)
 	}
 
 	sgp = SGP_CACHE;
-	if (vma->vm_flags & VM_HUGEPAGE)
-		sgp = SGP_HUGE;
-	else if (vma->vm_flags & VM_NOHUGEPAGE)
+	
+	if ((vma->vm_flags & VM_NOHUGEPAGE) || test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
 		sgp = SGP_NOHUGE;
+	else if (vma->vm_flags & VM_HUGEPAGE)
+		sgp = SGP_HUGE;
 
 	error = shmem_getpage_gfp(inode, vmf->pgoff, &vmf->page, sgp,
 				  gfp, vma, vmf, &ret);
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH] mm: introduce MADV_CLR_HUGEPAGE
       [not found]                             ` <20170530103930.GB7969-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
  2017-05-30 14:04                               ` Andrea Arcangeli
@ 2017-05-31  9:08                               ` Mike Rapoport
  2017-05-31 12:05                                 ` Michal Hocko
  1 sibling, 1 reply; 40+ messages in thread
From: Mike Rapoport @ 2017-05-31  9:08 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vlastimil Babka, Kirill A. Shutemov, Andrew Morton,
	Arnd Bergmann, Kirill A. Shutemov, Andrea Arcangeli,
	Pavel Emelyanov, linux-mm, lkml, Linux API

On Tue, May 30, 2017 at 12:39:30PM +0200, Michal Hocko wrote:
> On Tue 30-05-17 13:19:22, Mike Rapoport wrote:
> > > > But then we'll have to populate these regions with
> > > > UFFDIO_COPY which adds quite an overhead.
> > > 
> > > How big is the performance impact?
> > 
> > I don't have the numbers handy, but for each post-copy range it means that
> > instead of memcpy() we will use ioctl(UFFDIO_COPY).
> 
> It would be good to measure that though.

I will, but I won't expect huge difference here. Anyway, memcpy() will
touch still unpopulated pages, so we'll anyway enter/exit kernel.

> You are proposing a new user 
> API and the THP api is quite convoluted already so there better be a
> very good reason to add a new API. So far I can only see that it would
> be more convinient to add another madvise command and that is rather
> insufficient justification IMHO.

Well, the most convenient for my use case would be simply disable THP
before restore and re-enable it afterwards. And the need to use
ioctl(UFFDIO_COPY) is not that less convenient that the proposed madvise
command.

I've proposed the new madvise command because I firmly believe it is
missing. All madvise() commands that set some flag in vma->vm_flags have
the counter-command that resets that flag. Except for THP. The THP-related
flags can define three states for a VMA, pretty much like VM_SEQ_READ and
VM_RAND_READ. And it requires three madvise commands to allow setting any
of the desired states, just like with MADV_RANDOM, MADV_SEQUENTIAL and
MADV_NORMAL.

> Also do you expect somebody else would use new madvise? What would be the
> usecase?

I can think of an application that wants to keep 4K pages to save physical
memory for certain phase, e.g. until these pages are populated with very
few data. After the memory usage increases, the application may wish to
stop preventing khugepged from merging these pages, but it does not have
strong inclination to force use of huge pages.

> -- 
> Michal Hocko
> SUSE Labs

--
Sincerely yours,
Mike. 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] mm: introduce MADV_CLR_HUGEPAGE
  2017-05-31  8:24                                             ` Michal Hocko
@ 2017-05-31  9:27                                               ` Mike Rapoport
  2017-05-31 10:24                                                 ` Michal Hocko
       [not found]                                               ` <20170531082414.GB27783-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
  2017-06-01 11:00                                               ` Mike Rapoport
  2 siblings, 1 reply; 40+ messages in thread
From: Mike Rapoport @ 2017-05-31  9:27 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vlastimil Babka, Andrea Arcangeli, Kirill A. Shutemov,
	Andrew Morton, Arnd Bergmann, Kirill A. Shutemov,
	Pavel Emelyanov, linux-mm, lkml, Linux API

On Wed, May 31, 2017 at 10:24:14AM +0200, Michal Hocko wrote:
> On Wed 31-05-17 08:30:08, Vlastimil Babka wrote:
> > On 05/30/2017 06:06 PM, Andrea Arcangeli wrote:
> > > 
> > > I'm not sure if it should be considered a bug, the prctl is intended
> > > to use normally by wrappers so it looks optimal as implemented this
> > > way: affecting future vmas only, which will all be created after
> > > execve executed by the wrapper.
> > > 
> > > What's the point of messing with the prctl so it mangles over the
> > > wrapper process own vmas before exec? Messing with those vmas is pure
> > > wasted CPUs for the wrapper use case which is what the prctl was
> > > created for.
> > > 
> > > Furthermore there would be the risk a program that uses the prctl not
> > > as a wrapper and then calls the prctl to clear VM_NOHUGEPAGE from
> > > def_flags assuming the current kABI. The program could assume those
> > > vmas that were instantiated before disabling the prctl are still with
> > > VM_NOHUGEPAGE set (they would not after the change you propose).
> > > 
> > > Adding a scan of all vmas to PR_SET_THP_DISABLE to clear VM_NOHUGEPAGE
> > > on existing vmas looks more complex too and less finegrined so
> > > probably more complex for userland to manage
> > 
> > I would expect the prctl wouldn't iterate all vma's, nor would it modify
> > def_flags anymore. It would just set a flag somewhere in mm struct that
> > would be considered in addition to the per-vma flags when deciding
> > whether to use THP.
> 
> Exactly. Something like the below (not even compile tested).

If we set aside the argument for keeping the kABI, this seems, hmm, a bit
more complex than new madvise() :)

It seems that for CRIU usecase such behaviour of prctl will work and it
probably will be even more convenient than madvise(). Nonetheless, I think
madvise() is the more elegant and correct solution.

> > We could consider whether MADV_HUGEPAGE should be
> > able to override the prctl or not.
> 
> This should be a master override to any per vma setting.

Currently, MADV_HUGEPAGE overrides the prctl(PR_SET_THP_DISABLE)...
AFAIU, the prctl was intended to work with applications unaware of THP and
for the cases where addition of MADV_*HUGEPAGE to the application was not
an option.
 
> ---
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index a3762d49ba39..9da053ced864 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -92,6 +92,7 @@ extern bool is_vma_temporary_stack(struct vm_area_struct *vma);
>  	   (1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG) &&			\
>  	   ((__vma)->vm_flags & VM_HUGEPAGE))) &&			\
>  	 !((__vma)->vm_flags & VM_NOHUGEPAGE) &&			\
> +	 !test_bit(MMF_DISABLE_THP, &(__vma)->vm_mm->flags) &&		\
>  	 !is_vma_temporary_stack(__vma))
>  #define transparent_hugepage_use_zero_page()				\
>  	(transparent_hugepage_flags &					\
> diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
> index 5d9a400af509..f0d7335336cd 100644
> --- a/include/linux/khugepaged.h
> +++ b/include/linux/khugepaged.h
> @@ -48,7 +48,8 @@ static inline int khugepaged_enter(struct vm_area_struct *vma,
>  	if (!test_bit(MMF_VM_HUGEPAGE, &vma->vm_mm->flags))
>  		if ((khugepaged_always() ||
>  		     (khugepaged_req_madv() && (vm_flags & VM_HUGEPAGE))) &&
> -		    !(vm_flags & VM_NOHUGEPAGE))
> +		    !(vm_flags & VM_NOHUGEPAGE) &&
> +		    !test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
>  			if (__khugepaged_enter(vma->vm_mm))
>  				return -ENOMEM;
>  	return 0;
> diff --git a/include/linux/sched/coredump.h b/include/linux/sched/coredump.h
> index 69eedcef8f03..2c07b244090a 100644
> --- a/include/linux/sched/coredump.h
> +++ b/include/linux/sched/coredump.h
> @@ -68,6 +68,7 @@ static inline int get_dumpable(struct mm_struct *mm)
>  #define MMF_OOM_SKIP		21	/* mm is of no interest for the OOM killer */
>  #define MMF_UNSTABLE		22	/* mm is unstable for copy_from_user */
>  #define MMF_HUGE_ZERO_PAGE	23      /* mm has ever used the global huge zero page */
> +#define MMF_DISABLE_THP		24	/* disable THP for all VMAs */
> 
>  #define MMF_INIT_MASK		(MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK)
> 
> diff --git a/kernel/sys.c b/kernel/sys.c
> index 8a94b4eabcaa..e48f0636c7fd 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -2266,7 +2266,7 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
>  	case PR_GET_THP_DISABLE:
>  		if (arg2 || arg3 || arg4 || arg5)
>  			return -EINVAL;
> -		error = !!(me->mm->def_flags & VM_NOHUGEPAGE);
> +		error = !!test_bit(MMF_DISABLE_THP, &me->mm->flags);
>  		break;
>  	case PR_SET_THP_DISABLE:
>  		if (arg3 || arg4 || arg5)
> @@ -2274,9 +2274,9 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
>  		if (down_write_killable(&me->mm->mmap_sem))
>  			return -EINTR;
>  		if (arg2)
> -			me->mm->def_flags |= VM_NOHUGEPAGE;
> +			set_bit(MMF_DISABLE_THP, &me->mm->flags);
>  		else
> -			me->mm->def_flags &= ~VM_NOHUGEPAGE;
> +			clear_bit(MMF_DISABLE_THP, &me->mm->flags);
>  		up_write(&me->mm->mmap_sem);
>  		break;
>  	case PR_MPX_ENABLE_MANAGEMENT:
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index ce29e5cc7809..57e31f4752b3 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -818,7 +818,8 @@ khugepaged_alloc_page(struct page **hpage, gfp_t gfp, int node)
>  static bool hugepage_vma_check(struct vm_area_struct *vma)
>  {
>  	if ((!(vma->vm_flags & VM_HUGEPAGE) && !khugepaged_always()) ||
> -	    (vma->vm_flags & VM_NOHUGEPAGE))
> +	    (vma->vm_flags & VM_NOHUGEPAGE) ||
> +	    test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
>  		return false;
>  	if (shmem_file(vma->vm_file)) {
>  		if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGE_PAGECACHE))
> diff --git a/mm/shmem.c b/mm/shmem.c
> index e67d6ba4e98e..27fe1bbf813b 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1977,10 +1977,11 @@ static int shmem_fault(struct vm_fault *vmf)
>  	}
> 
>  	sgp = SGP_CACHE;
> -	if (vma->vm_flags & VM_HUGEPAGE)
> -		sgp = SGP_HUGE;
> -	else if (vma->vm_flags & VM_NOHUGEPAGE)
> +	
> +	if ((vma->vm_flags & VM_NOHUGEPAGE) || test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
>  		sgp = SGP_NOHUGE;
> +	else if (vma->vm_flags & VM_HUGEPAGE)
> +		sgp = SGP_HUGE;
> 
>  	error = shmem_getpage_gfp(inode, vmf->pgoff, &vmf->page, sgp,
>  				  gfp, vma, vmf, &ret);
> -- 
> Michal Hocko
> SUSE Labs
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] mm: introduce MADV_CLR_HUGEPAGE
       [not found]                                               ` <20170531082414.GB27783-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
@ 2017-05-31 10:22                                                 ` Michal Hocko
  0 siblings, 0 replies; 40+ messages in thread
From: Michal Hocko @ 2017-05-31 10:22 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrea Arcangeli, Mike Rapoport, Kirill A. Shutemov,
	Andrew Morton, Arnd Bergmann, Kirill A. Shutemov,
	Pavel Emelyanov, linux-mm, lkml, Linux API

On Wed 31-05-17 10:24:14, Michal Hocko wrote:
[...]

JFTR we also need to update MMF_INIT_MASK as well.
+#define MMF_INIT_MASK          (MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK | MMF_DISABLE_THP)

> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index a3762d49ba39..9da053ced864 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -92,6 +92,7 @@ extern bool is_vma_temporary_stack(struct vm_area_struct *vma);
>  	   (1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG) &&			\
>  	   ((__vma)->vm_flags & VM_HUGEPAGE))) &&			\
>  	 !((__vma)->vm_flags & VM_NOHUGEPAGE) &&			\
> +	 !test_bit(MMF_DISABLE_THP, &(__vma)->vm_mm->flags) &&		\
>  	 !is_vma_temporary_stack(__vma))
>  #define transparent_hugepage_use_zero_page()				\
>  	(transparent_hugepage_flags &					\
> diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
> index 5d9a400af509..f0d7335336cd 100644
> --- a/include/linux/khugepaged.h
> +++ b/include/linux/khugepaged.h
> @@ -48,7 +48,8 @@ static inline int khugepaged_enter(struct vm_area_struct *vma,
>  	if (!test_bit(MMF_VM_HUGEPAGE, &vma->vm_mm->flags))
>  		if ((khugepaged_always() ||
>  		     (khugepaged_req_madv() && (vm_flags & VM_HUGEPAGE))) &&
> -		    !(vm_flags & VM_NOHUGEPAGE))
> +		    !(vm_flags & VM_NOHUGEPAGE) &&
> +		    !test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
>  			if (__khugepaged_enter(vma->vm_mm))
>  				return -ENOMEM;
>  	return 0;
> diff --git a/include/linux/sched/coredump.h b/include/linux/sched/coredump.h
> index 69eedcef8f03..2c07b244090a 100644
> --- a/include/linux/sched/coredump.h
> +++ b/include/linux/sched/coredump.h
> @@ -68,6 +68,7 @@ static inline int get_dumpable(struct mm_struct *mm)
>  #define MMF_OOM_SKIP		21	/* mm is of no interest for the OOM killer */
>  #define MMF_UNSTABLE		22	/* mm is unstable for copy_from_user */
>  #define MMF_HUGE_ZERO_PAGE	23      /* mm has ever used the global huge zero page */
> +#define MMF_DISABLE_THP		24	/* disable THP for all VMAs */
>  
>  #define MMF_INIT_MASK		(MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK)
>  
> diff --git a/kernel/sys.c b/kernel/sys.c
> index 8a94b4eabcaa..e48f0636c7fd 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -2266,7 +2266,7 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
>  	case PR_GET_THP_DISABLE:
>  		if (arg2 || arg3 || arg4 || arg5)
>  			return -EINVAL;
> -		error = !!(me->mm->def_flags & VM_NOHUGEPAGE);
> +		error = !!test_bit(MMF_DISABLE_THP, &me->mm->flags);
>  		break;
>  	case PR_SET_THP_DISABLE:
>  		if (arg3 || arg4 || arg5)
> @@ -2274,9 +2274,9 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
>  		if (down_write_killable(&me->mm->mmap_sem))
>  			return -EINTR;
>  		if (arg2)
> -			me->mm->def_flags |= VM_NOHUGEPAGE;
> +			set_bit(MMF_DISABLE_THP, &me->mm->flags);
>  		else
> -			me->mm->def_flags &= ~VM_NOHUGEPAGE;
> +			clear_bit(MMF_DISABLE_THP, &me->mm->flags);
>  		up_write(&me->mm->mmap_sem);
>  		break;
>  	case PR_MPX_ENABLE_MANAGEMENT:
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index ce29e5cc7809..57e31f4752b3 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -818,7 +818,8 @@ khugepaged_alloc_page(struct page **hpage, gfp_t gfp, int node)
>  static bool hugepage_vma_check(struct vm_area_struct *vma)
>  {
>  	if ((!(vma->vm_flags & VM_HUGEPAGE) && !khugepaged_always()) ||
> -	    (vma->vm_flags & VM_NOHUGEPAGE))
> +	    (vma->vm_flags & VM_NOHUGEPAGE) ||
> +	    test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
>  		return false;
>  	if (shmem_file(vma->vm_file)) {
>  		if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGE_PAGECACHE))
> diff --git a/mm/shmem.c b/mm/shmem.c
> index e67d6ba4e98e..27fe1bbf813b 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1977,10 +1977,11 @@ static int shmem_fault(struct vm_fault *vmf)
>  	}
>  
>  	sgp = SGP_CACHE;
> -	if (vma->vm_flags & VM_HUGEPAGE)
> -		sgp = SGP_HUGE;
> -	else if (vma->vm_flags & VM_NOHUGEPAGE)
> +	
> +	if ((vma->vm_flags & VM_NOHUGEPAGE) || test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
>  		sgp = SGP_NOHUGE;
> +	else if (vma->vm_flags & VM_HUGEPAGE)
> +		sgp = SGP_HUGE;
>  
>  	error = shmem_getpage_gfp(inode, vmf->pgoff, &vmf->page, sgp,
>  				  gfp, vma, vmf, &ret);
> -- 
> Michal Hocko
> SUSE Labs

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] mm: introduce MADV_CLR_HUGEPAGE
  2017-05-31  9:27                                               ` Mike Rapoport
@ 2017-05-31 10:24                                                 ` Michal Hocko
  0 siblings, 0 replies; 40+ messages in thread
From: Michal Hocko @ 2017-05-31 10:24 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Vlastimil Babka, Andrea Arcangeli, Kirill A. Shutemov,
	Andrew Morton, Arnd Bergmann, Kirill A. Shutemov,
	Pavel Emelyanov, linux-mm, lkml, Linux API

On Wed 31-05-17 12:27:00, Mike Rapoport wrote:
> On Wed, May 31, 2017 at 10:24:14AM +0200, Michal Hocko wrote:
> > On Wed 31-05-17 08:30:08, Vlastimil Babka wrote:
> > > On 05/30/2017 06:06 PM, Andrea Arcangeli wrote:
> > > > 
> > > > I'm not sure if it should be considered a bug, the prctl is intended
> > > > to use normally by wrappers so it looks optimal as implemented this
> > > > way: affecting future vmas only, which will all be created after
> > > > execve executed by the wrapper.
> > > > 
> > > > What's the point of messing with the prctl so it mangles over the
> > > > wrapper process own vmas before exec? Messing with those vmas is pure
> > > > wasted CPUs for the wrapper use case which is what the prctl was
> > > > created for.
> > > > 
> > > > Furthermore there would be the risk a program that uses the prctl not
> > > > as a wrapper and then calls the prctl to clear VM_NOHUGEPAGE from
> > > > def_flags assuming the current kABI. The program could assume those
> > > > vmas that were instantiated before disabling the prctl are still with
> > > > VM_NOHUGEPAGE set (they would not after the change you propose).
> > > > 
> > > > Adding a scan of all vmas to PR_SET_THP_DISABLE to clear VM_NOHUGEPAGE
> > > > on existing vmas looks more complex too and less finegrined so
> > > > probably more complex for userland to manage
> > > 
> > > I would expect the prctl wouldn't iterate all vma's, nor would it modify
> > > def_flags anymore. It would just set a flag somewhere in mm struct that
> > > would be considered in addition to the per-vma flags when deciding
> > > whether to use THP.
> > 
> > Exactly. Something like the below (not even compile tested).
> 
> If we set aside the argument for keeping the kABI, this seems, hmm, a bit
> more complex than new madvise() :)

Yes, code wise it is more LOC which is not all that great but semantic
wise it make much more sense than the current implementation of
PR_SET_THP_DISABLE.

> It seems that for CRIU usecase such behaviour of prctl will work and it
> probably will be even more convenient than madvise(). Nonetheless, I think
> madvise() is the more elegant and correct solution.
> 
> > > We could consider whether MADV_HUGEPAGE should be
> > > able to override the prctl or not.
> > 
> > This should be a master override to any per vma setting.
> 
> Currently, MADV_HUGEPAGE overrides the prctl(PR_SET_THP_DISABLE)...
> AFAIU, the prctl was intended to work with applications unaware of THP and
> for the cases where addition of MADV_*HUGEPAGE to the application was not
> an option.

which makes it even more weird API IMHO.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] mm: introduce MADV_CLR_HUGEPAGE
  2017-05-31  9:08                               ` Mike Rapoport
@ 2017-05-31 12:05                                 ` Michal Hocko
  2017-05-31 12:25                                   ` Mike Rapoprt
  0 siblings, 1 reply; 40+ messages in thread
From: Michal Hocko @ 2017-05-31 12:05 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Vlastimil Babka, Kirill A. Shutemov, Andrew Morton,
	Arnd Bergmann, Kirill A. Shutemov, Andrea Arcangeli,
	Pavel Emelyanov, linux-mm, lkml, Linux API

On Wed 31-05-17 12:08:45, Mike Rapoport wrote:
> On Tue, May 30, 2017 at 12:39:30PM +0200, Michal Hocko wrote:
[...]
> > Also do you expect somebody else would use new madvise? What would be the
> > usecase?
> 
> I can think of an application that wants to keep 4K pages to save physical
> memory for certain phase, e.g. until these pages are populated with very
> few data. After the memory usage increases, the application may wish to
> stop preventing khugepged from merging these pages, but it does not have
> strong inclination to force use of huge pages.

Well, is actually anybody going to do that?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] mm: introduce MADV_CLR_HUGEPAGE
  2017-05-30 15:43                                   ` Andrea Arcangeli
@ 2017-05-31 12:08                                     ` Michal Hocko
       [not found]                                       ` <20170531120822.GL27783-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
  0 siblings, 1 reply; 40+ messages in thread
From: Michal Hocko @ 2017-05-31 12:08 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Mike Rapoport, Vlastimil Babka, Kirill A. Shutemov,
	Andrew Morton, Arnd Bergmann, Kirill A. Shutemov,
	Pavel Emelyanov, linux-mm, lkml, Linux API

On Tue 30-05-17 17:43:26, Andrea Arcangeli wrote:
> On Tue, May 30, 2017 at 04:39:41PM +0200, Michal Hocko wrote:
> > I sysctl for the mapcount can be increased, right? I also assume that
> > those vmas will get merged after the post copy is done.
> 
> Assuming you enlarge the sysctl to the worst possible case, with 64bit
> address space you can have billions of VMAs if you're migrating 4T of
> RAM and you're unlucky and the address space gets fragmented. The
> unswappable kernel memory overhead would be relatively large
> (i.e. dozen gigabytes of RAM in vm_area_struct slab), and each
> find_vma operation would need to walk ~40 steps across that large vma
> rbtree. There's a reason the sysctl exist. Not to tell all those
> unnecessary vma mangling operations would be protected by the mmap_sem
> for writing.
> 
> Not creating a ton of vmas and enabling vma-less pte mangling with a
> single large vma and only using mmap_sem for reading during all the
> pte mangling, is one of the primary design motivations for
> userfaultfd.

Yes, I am aware of fallouts of too many vmas. I was asking merely to
learn whether this will really happen under the the specific usecase
Mike is after.

> > I understand that part but it sounds awfully one purpose thing to me.
> > Are we going to add other MADVISE_RESET_$FOO to clear other flags just
> > because we can race in this specific use case?
> 
> Those already exists, see for example MADV_NORMAL, clearing
> ~VM_RAND_READ & ~VM_SEQ_READ after calling MADV_SEQUENTIAL or
> MADV_RANDOM.

I would argue that MADV_NORMAL is everything but a clear madvise
command. Why doesn't it clear all the sticky MADV* flags?

> Or MADV_DOFORK after MADV_DONTFORK. MADV_DONTDUMP after MADV_DODUMP. Etc..
>
> > But we already have MADV_HUGEPAGE, MADV_NOHUGEPAGE and prctl to
> > enable/disable thp. Doesn't that sound little bit too much for a single
> > feature to you?
> 
> MADV_NOHUGEPAGE doesn't mean clearing the flag set with
> MADV_HUGEPAGE. MADV_NOHUGEPAGE disables THP on the region if the
> global sysfs "enabled" tune is set to "always". MADV_HUGEPAGE enables
> THP if the global "enabled" sysfs tune is set to "madvise". The two
> MADV_NOHUGEPAGE and MADV_HUGEPAGE are needed to leverage the three-way
> setting of "never" "madvise" "always" of the global tune.
> 
> The "madvise" global tune exists if you want to save RAM and you don't
> care much about performance but still allowing apps like QEMU where no
> memory is lost by enabling THP, to use THP.
> 
> There's no way to clear either of those two flags and bring back the
> default behavior of the global sysfs tune, so it's not redundant at
> the very least.

Yes I am not a huge fan of the current MADV*HUGEPAGE semantic but I
would really like to see a strong usecase for adding another command on
top. From what Mike said a global disable THP for the whole process
while the post-copy is in progress is a better solution anyway.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] mm: introduce MADV_CLR_HUGEPAGE
  2017-05-31 12:05                                 ` Michal Hocko
@ 2017-05-31 12:25                                   ` Mike Rapoprt
  0 siblings, 0 replies; 40+ messages in thread
From: Mike Rapoprt @ 2017-05-31 12:25 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vlastimil Babka, Kirill A. Shutemov, Andrew Morton,
	Arnd Bergmann, Kirill A. Shutemov, Andrea Arcangeli,
	Pavel Emelyanov, linux-mm, lkml, Linux API



On May 31, 2017 3:05:56 PM GMT+03:00, Michal Hocko <mhocko@kernel.org> wrote:
>On Wed 31-05-17 12:08:45, Mike Rapoport wrote:
>> On Tue, May 30, 2017 at 12:39:30PM +0200, Michal Hocko wrote:
>[...]
>> > Also do you expect somebody else would use new madvise? What would
>be the
>> > usecase?
>> 
>> I can think of an application that wants to keep 4K pages to save
>physical
>> memory for certain phase, e.g. until these pages are populated with
>very
>> few data. After the memory usage increases, the application may wish
>to
>> stop preventing khugepged from merging these pages, but it does not
>have
>> strong inclination to force use of huge pages.
>
>Well, is actually anybody going to do that?

Well, I don't​ know, it's pretty much future telling :)
For sure, without the new madvise nobody will be even able to do that.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] mm: introduce MADV_CLR_HUGEPAGE
       [not found]                                       ` <20170531120822.GL27783-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
@ 2017-05-31 12:39                                         ` Mike Rapoprt
  2017-05-31 14:18                                           ` Andrea Arcangeli
       [not found]                                           ` <8FA5E4C2-D289-4AF5-AA09-6C199E58F9A5-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
  0 siblings, 2 replies; 40+ messages in thread
From: Mike Rapoprt @ 2017-05-31 12:39 UTC (permalink / raw)
  To: Michal Hocko, Andrea Arcangeli
  Cc: Vlastimil Babka, Kirill A. Shutemov, Andrew Morton,
	Arnd Bergmann, Kirill A. Shutemov, Pavel Emelyanov, linux-mm,
	lkml, Linux API



On May 31, 2017 3:08:22 PM GMT+03:00, Michal Hocko <mhocko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
>On Tue 30-05-17 17:43:26, Andrea Arcangeli wrote:
>> On Tue, May 30, 2017 at 04:39:41PM +0200, Michal Hocko wrote:
>> > I sysctl for the mapcount can be increased, right? I also assume
>that
>> > those vmas will get merged after the post copy is done.
>> 
>> Assuming you enlarge the sysctl to the worst possible case, with
>64bit
>> address space you can have billions of VMAs if you're migrating 4T of
>> RAM and you're unlucky and the address space gets fragmented. The
>> unswappable kernel memory overhead would be relatively large
>> (i.e. dozen gigabytes of RAM in vm_area_struct slab), and each
>> find_vma operation would need to walk ~40 steps across that large vma
>> rbtree. There's a reason the sysctl exist. Not to tell all those
>> unnecessary vma mangling operations would be protected by the
>mmap_sem
>> for writing.
>> 
>> Not creating a ton of vmas and enabling vma-less pte mangling with a
>> single large vma and only using mmap_sem for reading during all the
>> pte mangling, is one of the primary design motivations for
>> userfaultfd.
>
>Yes, I am aware of fallouts of too many vmas. I was asking merely to
>learn whether this will really happen under the the specific usecase
>Mike is after.

That depends on the application access pattern in the period between the pre-dump is finished and the application is frozen. If the accesses are random enough, the dirty pages that would be post copied could get spread all over the address space.

>> > I understand that part but it sounds awfully one purpose thing to
>me.
>> > Are we going to add other MADVISE_RESET_$FOO to clear other flags
>just
>> > because we can race in this specific use case?
>> 
>> Those already exists, see for example MADV_NORMAL, clearing
>> ~VM_RAND_READ & ~VM_SEQ_READ after calling MADV_SEQUENTIAL or
>> MADV_RANDOM.
>
>I would argue that MADV_NORMAL is everything but a clear madvise
>command. Why doesn't it clear all the sticky MADV* flags?

That would be helpful :)
Still, the problem here is more with the naming that with the action. If it was called MADV_DEFAULT_READ or something, it would be fine, wouldn't it?

>> Or MADV_DOFORK after MADV_DONTFORK. MADV_DONTDUMP after MADV_DODUMP.
>Etc..
>>
>> > But we already have MADV_HUGEPAGE, MADV_NOHUGEPAGE and prctl to
>> > enable/disable thp. Doesn't that sound little bit too much for a
>single
>> > feature to you?
>> 
>> MADV_NOHUGEPAGE doesn't mean clearing the flag set with
>> MADV_HUGEPAGE. MADV_NOHUGEPAGE disables THP on the region if the
>> global sysfs "enabled" tune is set to "always". MADV_HUGEPAGE enables
>> THP if the global "enabled" sysfs tune is set to "madvise". The two
>> MADV_NOHUGEPAGE and MADV_HUGEPAGE are needed to leverage the
>three-way
>> setting of "never" "madvise" "always" of the global tune.
>> 
>> The "madvise" global tune exists if you want to save RAM and you
>don't
>> care much about performance but still allowing apps like QEMU where
>no
>> memory is lost by enabling THP, to use THP.
>> 
>> There's no way to clear either of those two flags and bring back the
>> default behavior of the global sysfs tune, so it's not redundant at
>> the very least.
>
>Yes I am not a huge fan of the current MADV*HUGEPAGE semantic but I
>would really like to see a strong usecase for adding another command on
>top. 

Well, another command makes the semantic a bit better, IMHO...

> From what Mike said a global disable THP for the whole process
>while the post-copy is in progress is a better solution anyway.

For the CRIU usecase, disabling THP for a while and re-enabling it back will do the trick, provided VMAs flags are not affected, like in the patch you've sent.
Moreover, we may even get away with ioctl(UFFDIO_COPY) if it's overhead shows to be negligible​.
Still, I believe that MADV_RESET_HUGEPAGE (or some better named) command has the value on its own.
--
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] mm: introduce MADV_CLR_HUGEPAGE
  2017-05-31 12:39                                         ` Mike Rapoprt
@ 2017-05-31 14:18                                           ` Andrea Arcangeli
       [not found]                                             ` <20170531141809.GB302-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
       [not found]                                           ` <8FA5E4C2-D289-4AF5-AA09-6C199E58F9A5-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
  1 sibling, 1 reply; 40+ messages in thread
From: Andrea Arcangeli @ 2017-05-31 14:18 UTC (permalink / raw)
  To: Mike Rapoprt
  Cc: Michal Hocko, Vlastimil Babka, Kirill A. Shutemov, Andrew Morton,
	Arnd Bergmann, Kirill A. Shutemov, Pavel Emelyanov, linux-mm,
	lkml, Linux API

On Wed, May 31, 2017 at 03:39:22PM +0300, Mike Rapoport wrote:
> For the CRIU usecase, disabling THP for a while and re-enabling it
> back will do the trick, provided VMAs flags are not affected, like
> in the patch you've sent.  Moreover, we may even get away with

Are you going to check uname -r to know when the kABI changed in your
favor (so CRIU cannot ever work with enterprise backports unless you
expand the uname -r coverage), or how do you know the patch is
applied?

Optimistically assuming people is going to run new CRIU code only on
new kernels looks very risky, it would leads to silent random memory
corruption, so I doubt you can get away without a uname -r check.

This is fairly simple change too, its main cons is that it adds a
branch to the page fault fast path, the old behavior of the prctl and
the new madvise were both zero cost.

Still if the prctl is preferred despite the added branch, to avoid
uname -r clashes, to me it sounds better to add a new prctl ID and
keep the old one too. The old one could be implemented the same way as
the new one if you want to save a few bytes of .text. But the old one
should probably do a printk_once to print a deprecation warning so the
old ID with weaker (zero runtime cost) semantics can be removed later.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] mm: introduce MADV_CLR_HUGEPAGE
       [not found]                                           ` <8FA5E4C2-D289-4AF5-AA09-6C199E58F9A5-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
@ 2017-05-31 14:19                                             ` Michal Hocko
  0 siblings, 0 replies; 40+ messages in thread
From: Michal Hocko @ 2017-05-31 14:19 UTC (permalink / raw)
  To: Mike Rapoprt
  Cc: Andrea Arcangeli, Vlastimil Babka, Kirill A. Shutemov,
	Andrew Morton, Arnd Bergmann, Kirill A. Shutemov,
	Pavel Emelyanov, linux-mm, lkml, Linux API

On Wed 31-05-17 15:39:22, Mike Rapoprt wrote:
> 
> 
> On May 31, 2017 3:08:22 PM GMT+03:00, Michal Hocko <mhocko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
[...]
> > From what Mike said a global disable THP for the whole process
> >while the post-copy is in progress is a better solution anyway.
> 
> For the CRIU usecase, disabling THP for a while and re-enabling
> it back will do the trick, provided VMAs flags are not affected,
> like in the patch you've sent.  Moreover, we may even get away with
> ioctl(UFFDIO_COPY) if it's overhead shows to be negligible.  Still,
> I believe that MADV_RESET_HUGEPAGE (or some better named) command has
> the value on its own.

I would prefer if we could go the prctl if possible and add a new
MADV_RESET_HUGEPAGE if there is really a usecase for it.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] mm: introduce MADV_CLR_HUGEPAGE
       [not found]                                             ` <20170531141809.GB302-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2017-05-31 14:32                                               ` Michal Hocko
  2017-05-31 15:46                                                 ` Andrea Arcangeli
  2017-06-01  6:58                                               ` Mike Rapoport
  1 sibling, 1 reply; 40+ messages in thread
From: Michal Hocko @ 2017-05-31 14:32 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Mike Rapoprt, Vlastimil Babka, Kirill A. Shutemov, Andrew Morton,
	Arnd Bergmann, Kirill A. Shutemov, Pavel Emelyanov, linux-mm,
	lkml, Linux API

On Wed 31-05-17 16:18:09, Andrea Arcangeli wrote:
> On Wed, May 31, 2017 at 03:39:22PM +0300, Mike Rapoport wrote:
> > For the CRIU usecase, disabling THP for a while and re-enabling it
> > back will do the trick, provided VMAs flags are not affected, like
> > in the patch you've sent.  Moreover, we may even get away with
> 
> Are you going to check uname -r to know when the kABI changed in your
> favor (so CRIU cannot ever work with enterprise backports unless you
> expand the uname -r coverage), or how do you know the patch is
> applied?

I would assume such a patch would be backported to stable trees because
to me it sounds like the current semantic is simply broken and needs
fixing anyway but it shouldn't be much different from any other bugs.

This is far from ideal from the "guarantee POV" of course.

> Optimistically assuming people is going to run new CRIU code only on
> new kernels looks very risky, it would leads to silent random memory
> corruption, so I doubt you can get away without a uname -r check.
> 
> This is fairly simple change too, its main cons is that it adds a
> branch to the page fault fast path, the old behavior of the prctl and
> the new madvise were both zero cost.
> 
> Still if the prctl is preferred despite the added branch, to avoid
> uname -r clashes, to me it sounds better to add a new prctl ID and
> keep the old one too. The old one could be implemented the same way as
> the new one if you want to save a few bytes of .text. But the old one
> should probably do a printk_once to print a deprecation warning so the
> old ID with weaker (zero runtime cost) semantics can be removed later.

this would be an option as well although it adds to the mess...

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] mm: introduce MADV_CLR_HUGEPAGE
  2017-05-31 14:32                                               ` Michal Hocko
@ 2017-05-31 15:46                                                 ` Andrea Arcangeli
  0 siblings, 0 replies; 40+ messages in thread
From: Andrea Arcangeli @ 2017-05-31 15:46 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mike Rapoprt, Vlastimil Babka, Kirill A. Shutemov, Andrew Morton,
	Arnd Bergmann, Kirill A. Shutemov, Pavel Emelyanov, linux-mm,
	lkml, Linux API

On Wed, May 31, 2017 at 04:32:17PM +0200, Michal Hocko wrote:
> I would assume such a patch would be backported to stable trees because
> to me it sounds like the current semantic is simply broken and needs
> fixing anyway but it shouldn't be much different from any other bugs.

So the program would need then to check also for the -stable minor
number where the patch was backported to in addition of any enterprise
kernel backport versioning.

> This is far from ideal from the "guarantee POV" of course.

Agree it's far from ideal and lack of guarantee at least for CRIU
means silent random memory corruption.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] mm: introduce MADV_CLR_HUGEPAGE
  2017-05-30 14:39                                 ` Michal Hocko
       [not found]                                   ` <20170530143941.GK7969-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
  2017-05-30 15:43                                   ` Andrea Arcangeli
@ 2017-06-01  6:53                                   ` Mike Rapoport
  2017-06-01  8:09                                     ` Michal Hocko
  2 siblings, 1 reply; 40+ messages in thread
From: Mike Rapoport @ 2017-06-01  6:53 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrea Arcangeli, Vlastimil Babka, Kirill A. Shutemov,
	Andrew Morton, Arnd Bergmann, Kirill A. Shutemov,
	Pavel Emelyanov, linux-mm, lkml, Linux API

On Tue, May 30, 2017 at 04:39:41PM +0200, Michal Hocko wrote:
> On Tue 30-05-17 16:04:56, Andrea Arcangeli wrote:
> > 
> > UFFDIO_COPY while not being a major slowdown for sure, it's likely
> > measurable at the microbenchmark level because it would add a
> > enter/exit kernel to every 4k memcpy. It's not hard to imagine that as
> > measurable. How that impacts the total precopy time I don't know, it
> > would need to be benchmarked to be sure.
> 
> Yes, please!

I've run a simple test (below) that fills 1G of memory either with memcpy
of ioctl(UFFDIO_COPY) in 4K chunks.
The machine I used has two "Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz" and
128G of RAM.
I've averaged elapsed time reported by /usr/bin/time over 100 runs and here
what I've got:

memcpy with THP on: 0.3278 sec
memcpy with THP off: 0.5295 sec
UFFDIO_COPY: 0.44 sec

That said, for the CRIU usecase UFFDIO_COPY seems faster that disabling THP
and then doing memcpy.

--
Sincerely yours,
Mike.

----------------------------------------------------------
{
	...

	src = mmap(NULL, page_size, PROT_READ | PROT_WRITE,
		   MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
	if (src == MAP_FAILED)
		fprintf(stderr, "map src failed\n"), exit(1);
	*((unsigned long *)src) = 1;

 	if (disable_huge && prctl(PR_SET_THP_DISABLE, 1, 0, 0, 0))
		fprintf(stderr, "ptctl failed\n"), exit(1);

	dst = mmap(NULL, page_size * nr_pages, PROT_READ | PROT_WRITE,
		   MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
	if (dst == MAP_FAILED)
		fprintf(stderr, "map dst failed\n"), exit(1);

	if (use_uffd && userfaultfd_register(dst))
		fprintf(stderr, "userfault_register failed\n"), exit(1);

	for (i = 0; i < nr_pages; i++) {
		char *address = dst + i * page_size;

		if (use_uffd) {
			struct uffdio_copy uffdio_copy;

			uffdio_copy.dst = (unsigned long)address;
			uffdio_copy.src = (unsigned long)src;
			uffdio_copy.len = page_size;
			uffdio_copy.mode = 0;
			uffdio_copy.copy = 0;

			ret = ioctl(uffd, UFFDIO_COPY, &uffdio_copy);
			if (ret)
				fprintf(stderr, "copy: %d, %d\n", ret, errno),
					exit(1);
		} else {
			memcpy(address, src, page_size);
		}

	}

	return 0;
}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] mm: introduce MADV_CLR_HUGEPAGE
       [not found]                                             ` <20170531141809.GB302-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2017-05-31 14:32                                               ` Michal Hocko
@ 2017-06-01  6:58                                               ` Mike Rapoport
  1 sibling, 0 replies; 40+ messages in thread
From: Mike Rapoport @ 2017-06-01  6:58 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Michal Hocko, Vlastimil Babka, Kirill A. Shutemov, Andrew Morton,
	Arnd Bergmann, Kirill A. Shutemov, Pavel Emelyanov, linux-mm,
	lkml, Linux API

On Wed, May 31, 2017 at 04:18:09PM +0200, Andrea Arcangeli wrote:
> On Wed, May 31, 2017 at 03:39:22PM +0300, Mike Rapoport wrote:
> > For the CRIU usecase, disabling THP for a while and re-enabling it
> > back will do the trick, provided VMAs flags are not affected, like
> > in the patch you've sent.  Moreover, we may even get away with
> 
> Are you going to check uname -r to know when the kABI changed in your
> favor (so CRIU cannot ever work with enterprise backports unless you
> expand the uname -r coverage), or how do you know the patch is
> applied?

CRIU does not rely on uname -r. We have code that checks what kernel
features we can actually use. For instance, we use UFFDIO_API to see if we
can do post-copy at all.
 
> Optimistically assuming people is going to run new CRIU code only on
> new kernels looks very risky, it would leads to silent random memory
> corruption, so I doubt you can get away without a uname -r check.
> 
> This is fairly simple change too, its main cons is that it adds a
> branch to the page fault fast path, the old behavior of the prctl and
> the new madvise were both zero cost.
> 
> Still if the prctl is preferred despite the added branch, to avoid
> uname -r clashes, to me it sounds better to add a new prctl ID and
> keep the old one too. The old one could be implemented the same way as
> the new one if you want to save a few bytes of .text. But the old one
> should probably do a printk_once to print a deprecation warning so the
> old ID with weaker (zero runtime cost) semantics can be removed later.
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] mm: introduce MADV_CLR_HUGEPAGE
  2017-06-01  6:53                                   ` Mike Rapoport
@ 2017-06-01  8:09                                     ` Michal Hocko
       [not found]                                       ` <20170601080909.GD32677-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
  2017-06-01 13:45                                       ` Andrea Arcangeli
  0 siblings, 2 replies; 40+ messages in thread
From: Michal Hocko @ 2017-06-01  8:09 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Andrea Arcangeli, Vlastimil Babka, Kirill A. Shutemov,
	Andrew Morton, Arnd Bergmann, Kirill A. Shutemov,
	Pavel Emelyanov, linux-mm, lkml, Linux API

On Thu 01-06-17 09:53:02, Mike Rapoport wrote:
> On Tue, May 30, 2017 at 04:39:41PM +0200, Michal Hocko wrote:
> > On Tue 30-05-17 16:04:56, Andrea Arcangeli wrote:
> > > 
> > > UFFDIO_COPY while not being a major slowdown for sure, it's likely
> > > measurable at the microbenchmark level because it would add a
> > > enter/exit kernel to every 4k memcpy. It's not hard to imagine that as
> > > measurable. How that impacts the total precopy time I don't know, it
> > > would need to be benchmarked to be sure.
> > 
> > Yes, please!
> 
> I've run a simple test (below) that fills 1G of memory either with memcpy
> of ioctl(UFFDIO_COPY) in 4K chunks.
> The machine I used has two "Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz" and
> 128G of RAM.
> I've averaged elapsed time reported by /usr/bin/time over 100 runs and here
> what I've got:
> 
> memcpy with THP on: 0.3278 sec
> memcpy with THP off: 0.5295 sec
> UFFDIO_COPY: 0.44 sec

I assume that the standard deviation is small?
 
> That said, for the CRIU usecase UFFDIO_COPY seems faster that disabling THP
> and then doing memcpy.

That is a bit surprising. I didn't think that the userfault syscall
(ioctl) can be faster than a regular #PF but considering that
__mcopy_atomic bypasses the page fault path and it can be optimized for
the anon case suggests that we can save some cycles for each page and so
the cumulative savings can be visible.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] mm: introduce MADV_CLR_HUGEPAGE
       [not found]                                       ` <20170601080909.GD32677-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
@ 2017-06-01  8:35                                         ` Mike Rapoport
  0 siblings, 0 replies; 40+ messages in thread
From: Mike Rapoport @ 2017-06-01  8:35 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrea Arcangeli, Vlastimil Babka, Kirill A. Shutemov,
	Andrew Morton, Arnd Bergmann, Kirill A. Shutemov,
	Pavel Emelyanov, linux-mm, lkml, Linux API

On Thu, Jun 01, 2017 at 10:09:09AM +0200, Michal Hocko wrote:
> On Thu 01-06-17 09:53:02, Mike Rapoport wrote:
> > On Tue, May 30, 2017 at 04:39:41PM +0200, Michal Hocko wrote:
> > > On Tue 30-05-17 16:04:56, Andrea Arcangeli wrote:
> > > > 
> > > > UFFDIO_COPY while not being a major slowdown for sure, it's likely
> > > > measurable at the microbenchmark level because it would add a
> > > > enter/exit kernel to every 4k memcpy. It's not hard to imagine that as
> > > > measurable. How that impacts the total precopy time I don't know, it
> > > > would need to be benchmarked to be sure.
> > > 
> > > Yes, please!
> > 
> > I've run a simple test (below) that fills 1G of memory either with memcpy
> > of ioctl(UFFDIO_COPY) in 4K chunks.
> > The machine I used has two "Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz" and
> > 128G of RAM.
> > I've averaged elapsed time reported by /usr/bin/time over 100 runs and here
> > what I've got:
> > 
> > memcpy with THP on: 0.3278 sec
> > memcpy with THP off: 0.5295 sec
> > UFFDIO_COPY: 0.44 sec
> 
> I assume that the standard deviation is small?

Yes.
 
> > That said, for the CRIU usecase UFFDIO_COPY seems faster that disabling THP
> > and then doing memcpy.
> 
> That is a bit surprising. I didn't think that the userfault syscall
> (ioctl) can be faster than a regular #PF but considering that
> __mcopy_atomic bypasses the page fault path and it can be optimized for
> the anon case suggests that we can save some cycles for each page and so
> the cumulative savings can be visible.
> 
> -- 
> Michal Hocko
> SUSE Labs
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] mm: introduce MADV_CLR_HUGEPAGE
  2017-05-31  8:24                                             ` Michal Hocko
  2017-05-31  9:27                                               ` Mike Rapoport
       [not found]                                               ` <20170531082414.GB27783-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
@ 2017-06-01 11:00                                               ` Mike Rapoport
  2017-06-01 12:27                                                 ` Michal Hocko
  2 siblings, 1 reply; 40+ messages in thread
From: Mike Rapoport @ 2017-06-01 11:00 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vlastimil Babka, Andrea Arcangeli, Kirill A. Shutemov,
	Andrew Morton, Arnd Bergmann, Kirill A. Shutemov,
	Pavel Emelyanov, linux-mm, lkml, Linux API

On Wed, May 31, 2017 at 10:24:14AM +0200, Michal Hocko wrote:
> On Wed 31-05-17 08:30:08, Vlastimil Babka wrote:
> > On 05/30/2017 06:06 PM, Andrea Arcangeli wrote:
> > > 
> > > I'm not sure if it should be considered a bug, the prctl is intended
> > > to use normally by wrappers so it looks optimal as implemented this
> > > way: affecting future vmas only, which will all be created after
> > > execve executed by the wrapper.
> > > 
> > > What's the point of messing with the prctl so it mangles over the
> > > wrapper process own vmas before exec? Messing with those vmas is pure
> > > wasted CPUs for the wrapper use case which is what the prctl was
> > > created for.
> > > 
> > > Furthermore there would be the risk a program that uses the prctl not
> > > as a wrapper and then calls the prctl to clear VM_NOHUGEPAGE from
> > > def_flags assuming the current kABI. The program could assume those
> > > vmas that were instantiated before disabling the prctl are still with
> > > VM_NOHUGEPAGE set (they would not after the change you propose).
> > > 
> > > Adding a scan of all vmas to PR_SET_THP_DISABLE to clear VM_NOHUGEPAGE
> > > on existing vmas looks more complex too and less finegrined so
> > > probably more complex for userland to manage
> > 
> > I would expect the prctl wouldn't iterate all vma's, nor would it modify
> > def_flags anymore. It would just set a flag somewhere in mm struct that
> > would be considered in addition to the per-vma flags when deciding
> > whether to use THP.
> 
> Exactly. Something like the below (not even compile tested).
 
I did a quick go with the patch, compiles just fine :)
It worked for my simple examples, the THP is enabled/disabled as expected
and the vma->vm_flags are indeed unaffected.

> > We could consider whether MADV_HUGEPAGE should be
> > able to override the prctl or not.
> 
> This should be a master override to any per vma setting.

Here you've introduced a change to the current behaviour. Consider the
following sequence:

{
	prctl(PR_SET_THP_DISABLE);
	address = mmap(...);
	madvise(address, len, MADV_HUGEPAGE);
}

Currently, for the vma that backs the address
transparent_hugepage_enabled(vma) will return true, and after your patch it
will return false.
The new behaviour may be more correct, I just wanted to bring the change to
attention. 
 
> ---
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index a3762d49ba39..9da053ced864 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -92,6 +92,7 @@ extern bool is_vma_temporary_stack(struct vm_area_struct *vma);
>  	   (1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG) &&			\
>  	   ((__vma)->vm_flags & VM_HUGEPAGE))) &&			\
>  	 !((__vma)->vm_flags & VM_NOHUGEPAGE) &&			\
> +	 !test_bit(MMF_DISABLE_THP, &(__vma)->vm_mm->flags) &&		\
>  	 !is_vma_temporary_stack(__vma))
>  #define transparent_hugepage_use_zero_page()				\
>  	(transparent_hugepage_flags &					\
> diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
> index 5d9a400af509..f0d7335336cd 100644
> --- a/include/linux/khugepaged.h
> +++ b/include/linux/khugepaged.h
> @@ -48,7 +48,8 @@ static inline int khugepaged_enter(struct vm_area_struct *vma,
>  	if (!test_bit(MMF_VM_HUGEPAGE, &vma->vm_mm->flags))
>  		if ((khugepaged_always() ||
>  		     (khugepaged_req_madv() && (vm_flags & VM_HUGEPAGE))) &&
> -		    !(vm_flags & VM_NOHUGEPAGE))
> +		    !(vm_flags & VM_NOHUGEPAGE) &&
> +		    !test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
>  			if (__khugepaged_enter(vma->vm_mm))
>  				return -ENOMEM;
>  	return 0;
> diff --git a/include/linux/sched/coredump.h b/include/linux/sched/coredump.h
> index 69eedcef8f03..2c07b244090a 100644
> --- a/include/linux/sched/coredump.h
> +++ b/include/linux/sched/coredump.h
> @@ -68,6 +68,7 @@ static inline int get_dumpable(struct mm_struct *mm)
>  #define MMF_OOM_SKIP		21	/* mm is of no interest for the OOM killer */
>  #define MMF_UNSTABLE		22	/* mm is unstable for copy_from_user */
>  #define MMF_HUGE_ZERO_PAGE	23      /* mm has ever used the global huge zero page */
> +#define MMF_DISABLE_THP		24	/* disable THP for all VMAs */
> 
>  #define MMF_INIT_MASK		(MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK)
> 
> diff --git a/kernel/sys.c b/kernel/sys.c
> index 8a94b4eabcaa..e48f0636c7fd 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -2266,7 +2266,7 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
>  	case PR_GET_THP_DISABLE:
>  		if (arg2 || arg3 || arg4 || arg5)
>  			return -EINVAL;
> -		error = !!(me->mm->def_flags & VM_NOHUGEPAGE);
> +		error = !!test_bit(MMF_DISABLE_THP, &me->mm->flags);
>  		break;
>  	case PR_SET_THP_DISABLE:
>  		if (arg3 || arg4 || arg5)
> @@ -2274,9 +2274,9 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
>  		if (down_write_killable(&me->mm->mmap_sem))
>  			return -EINTR;
>  		if (arg2)
> -			me->mm->def_flags |= VM_NOHUGEPAGE;
> +			set_bit(MMF_DISABLE_THP, &me->mm->flags);
>  		else
> -			me->mm->def_flags &= ~VM_NOHUGEPAGE;
> +			clear_bit(MMF_DISABLE_THP, &me->mm->flags);
>  		up_write(&me->mm->mmap_sem);
>  		break;
>  	case PR_MPX_ENABLE_MANAGEMENT:
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index ce29e5cc7809..57e31f4752b3 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -818,7 +818,8 @@ khugepaged_alloc_page(struct page **hpage, gfp_t gfp, int node)
>  static bool hugepage_vma_check(struct vm_area_struct *vma)
>  {
>  	if ((!(vma->vm_flags & VM_HUGEPAGE) && !khugepaged_always()) ||
> -	    (vma->vm_flags & VM_NOHUGEPAGE))
> +	    (vma->vm_flags & VM_NOHUGEPAGE) ||
> +	    test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
>  		return false;
>  	if (shmem_file(vma->vm_file)) {
>  		if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGE_PAGECACHE))
> diff --git a/mm/shmem.c b/mm/shmem.c
> index e67d6ba4e98e..27fe1bbf813b 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1977,10 +1977,11 @@ static int shmem_fault(struct vm_fault *vmf)
>  	}
> 
>  	sgp = SGP_CACHE;
> -	if (vma->vm_flags & VM_HUGEPAGE)
> -		sgp = SGP_HUGE;
> -	else if (vma->vm_flags & VM_NOHUGEPAGE)
> +	
> +	if ((vma->vm_flags & VM_NOHUGEPAGE) || test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
>  		sgp = SGP_NOHUGE;
> +	else if (vma->vm_flags & VM_HUGEPAGE)
> +		sgp = SGP_HUGE;
> 
>  	error = shmem_getpage_gfp(inode, vmf->pgoff, &vmf->page, sgp,
>  				  gfp, vma, vmf, &ret);
> -- 
> Michal Hocko
> SUSE Labs
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] mm: introduce MADV_CLR_HUGEPAGE
  2017-06-01 11:00                                               ` Mike Rapoport
@ 2017-06-01 12:27                                                 ` Michal Hocko
  0 siblings, 0 replies; 40+ messages in thread
From: Michal Hocko @ 2017-06-01 12:27 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Vlastimil Babka, Andrea Arcangeli, Kirill A. Shutemov,
	Andrew Morton, Arnd Bergmann, Kirill A. Shutemov,
	Pavel Emelyanov, linux-mm, lkml, Linux API

On Thu 01-06-17 14:00:48, Mike Rapoport wrote:
> On Wed, May 31, 2017 at 10:24:14AM +0200, Michal Hocko wrote:
> > On Wed 31-05-17 08:30:08, Vlastimil Babka wrote:
> > > On 05/30/2017 06:06 PM, Andrea Arcangeli wrote:
> > > > 
> > > > I'm not sure if it should be considered a bug, the prctl is intended
> > > > to use normally by wrappers so it looks optimal as implemented this
> > > > way: affecting future vmas only, which will all be created after
> > > > execve executed by the wrapper.
> > > > 
> > > > What's the point of messing with the prctl so it mangles over the
> > > > wrapper process own vmas before exec? Messing with those vmas is pure
> > > > wasted CPUs for the wrapper use case which is what the prctl was
> > > > created for.
> > > > 
> > > > Furthermore there would be the risk a program that uses the prctl not
> > > > as a wrapper and then calls the prctl to clear VM_NOHUGEPAGE from
> > > > def_flags assuming the current kABI. The program could assume those
> > > > vmas that were instantiated before disabling the prctl are still with
> > > > VM_NOHUGEPAGE set (they would not after the change you propose).
> > > > 
> > > > Adding a scan of all vmas to PR_SET_THP_DISABLE to clear VM_NOHUGEPAGE
> > > > on existing vmas looks more complex too and less finegrined so
> > > > probably more complex for userland to manage
> > > 
> > > I would expect the prctl wouldn't iterate all vma's, nor would it modify
> > > def_flags anymore. It would just set a flag somewhere in mm struct that
> > > would be considered in addition to the per-vma flags when deciding
> > > whether to use THP.
> > 
> > Exactly. Something like the below (not even compile tested).
>  
> I did a quick go with the patch, compiles just fine :)
> It worked for my simple examples, the THP is enabled/disabled as expected
> and the vma->vm_flags are indeed unaffected.
> 
> > > We could consider whether MADV_HUGEPAGE should be
> > > able to override the prctl or not.
> > 
> > This should be a master override to any per vma setting.
> 
> Here you've introduced a change to the current behaviour. Consider the
> following sequence:
> 
> {
> 	prctl(PR_SET_THP_DISABLE);
> 	address = mmap(...);
> 	madvise(address, len, MADV_HUGEPAGE);
> }
>
> Currently, for the vma that backs the address
> transparent_hugepage_enabled(vma) will return true, and after your patch it
> will return false.
> The new behaviour may be more correct, I just wanted to bring the change to
> attention. 

The system wide disable should override any VMA specific setting
IMHO. Why would we disable the THP for the whole process otherwise?
Anyway this needs to be discussed at linux-api mailing list. I will try
to make my change into a proper patch and post it there.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] mm: introduce MADV_CLR_HUGEPAGE
  2017-06-01  8:09                                     ` Michal Hocko
       [not found]                                       ` <20170601080909.GD32677-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
@ 2017-06-01 13:45                                       ` Andrea Arcangeli
  2017-06-02  9:11                                         ` Mike Rapoport
  1 sibling, 1 reply; 40+ messages in thread
From: Andrea Arcangeli @ 2017-06-01 13:45 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mike Rapoport, Vlastimil Babka, Kirill A. Shutemov,
	Andrew Morton, Arnd Bergmann, Kirill A. Shutemov,
	Pavel Emelyanov, linux-mm, lkml, Linux API

On Thu, Jun 01, 2017 at 10:09:09AM +0200, Michal Hocko wrote:
> That is a bit surprising. I didn't think that the userfault syscall
> (ioctl) can be faster than a regular #PF but considering that
> __mcopy_atomic bypasses the page fault path and it can be optimized for
> the anon case suggests that we can save some cycles for each page and so
> the cumulative savings can be visible.

__mcopy_atomic works not just for anonymous memory, hugetlbfs/shmem
are covered too and there are branches to handle those.

If you were to run more than one precopy pass UFFDIO_COPY shall become
slower than the userland access starting from the second pass.

At the light of this if CRIU can only do one single pass of precopy,
CRIU is probably better off using UFFDIO_COPY than using prctl or
madvise to temporarily turn off THP.

With QEMU as opposed we set MADV_HUGEPAGE during precopy on
destination to maximize the THP utilization for all those 2M naturally
aligned guest regions that aren't re-dirtied in the source, so we're
better off without using UFFDIO_COPY in precopy even during the first
pass to avoid the enter/kernel for subpages that are written to
destination in a already instantiated THP. At least until we teach
QEMU to map 2M at once if possible (UFFDIO_COPY would then also
require an enhancement, because currently it won't map THP on the
fly).

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] mm: introduce MADV_CLR_HUGEPAGE
  2017-06-01 13:45                                       ` Andrea Arcangeli
@ 2017-06-02  9:11                                         ` Mike Rapoport
  0 siblings, 0 replies; 40+ messages in thread
From: Mike Rapoport @ 2017-06-02  9:11 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Michal Hocko, Vlastimil Babka, Kirill A. Shutemov, Andrew Morton,
	Arnd Bergmann, Kirill A. Shutemov, Pavel Emelyanov, linux-mm,
	lkml, Linux API

On Thu, Jun 01, 2017 at 03:45:22PM +0200, Andrea Arcangeli wrote:
> On Thu, Jun 01, 2017 at 10:09:09AM +0200, Michal Hocko wrote:
> > That is a bit surprising. I didn't think that the userfault syscall
> > (ioctl) can be faster than a regular #PF but considering that
> > __mcopy_atomic bypasses the page fault path and it can be optimized for
> > the anon case suggests that we can save some cycles for each page and so
> > the cumulative savings can be visible.
> 
> __mcopy_atomic works not just for anonymous memory, hugetlbfs/shmem
> are covered too and there are branches to handle those.
> 
> If you were to run more than one precopy pass UFFDIO_COPY shall become
> slower than the userland access starting from the second pass.
> 
> At the light of this if CRIU can only do one single pass of precopy,
> CRIU is probably better off using UFFDIO_COPY than using prctl or
> madvise to temporarily turn off THP.

CRIU does memory tracking differently from QEMU. Every round of pre-copy in
CRIU means we dump the dirty pages into an image file. The restore then
chooses what image file to use. Anyway, we fill the memory only once at
restore time, hence UFFDIO_COPY would be better than disabling THP.
 
> With QEMU as opposed we set MADV_HUGEPAGE during precopy on
> destination to maximize the THP utilization for all those 2M naturally
> aligned guest regions that aren't re-dirtied in the source, so we're
> better off without using UFFDIO_COPY in precopy even during the first
> pass to avoid the enter/kernel for subpages that are written to
> destination in a already instantiated THP. At least until we teach
> QEMU to map 2M at once if possible (UFFDIO_COPY would then also
> require an enhancement, because currently it won't map THP on the
> fly).
> 
> Thanks,
> Andrea
> 

--
Sincerely yours,
Mike.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2017-06-02  9:11 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <1495433562-26625-1-git-send-email-rppt@linux.vnet.ibm.com>
     [not found] ` <20170522114243.2wrdbncilozygbpl@node.shutemov.name>
     [not found]   ` <20170522133559.GE27382@rapoport-lnx>
     [not found]     ` <20170522135548.GA8514@dhcp22.suse.cz>
     [not found]       ` <20170522142927.GG27382@rapoport-lnx>
     [not found]         ` <a9e74c22-1a07-f49a-42b5-497fee85e9c9@suse.cz>
     [not found]           ` <20170524075043.GB3063@rapoport-lnx>
2017-05-24  7:58             ` [PATCH] mm: introduce MADV_CLR_HUGEPAGE Vlastimil Babka
2017-05-24 10:39               ` Mike Rapoport
2017-05-24 11:18                 ` Michal Hocko
2017-05-24 14:25                   ` Pavel Emelyanov
2017-05-24 14:27                   ` Mike Rapoport
2017-05-24 15:22                     ` Andrea Arcangeli
2017-05-30  7:44                     ` Michal Hocko
     [not found]                       ` <20170530074408.GA7969-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2017-05-30 10:19                         ` Mike Rapoport
2017-05-30 10:39                           ` Michal Hocko
     [not found]                             ` <20170530103930.GB7969-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2017-05-30 14:04                               ` Andrea Arcangeli
2017-05-30 14:39                                 ` Michal Hocko
     [not found]                                   ` <20170530143941.GK7969-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2017-05-30 14:56                                     ` Michal Hocko
     [not found]                                       ` <20170530145632.GL7969-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2017-05-30 16:06                                         ` Andrea Arcangeli
2017-05-31  6:30                                           ` Vlastimil Babka
2017-05-31  8:24                                             ` Michal Hocko
2017-05-31  9:27                                               ` Mike Rapoport
2017-05-31 10:24                                                 ` Michal Hocko
     [not found]                                               ` <20170531082414.GB27783-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2017-05-31 10:22                                                 ` Michal Hocko
2017-06-01 11:00                                               ` Mike Rapoport
2017-06-01 12:27                                                 ` Michal Hocko
2017-05-30 15:43                                   ` Andrea Arcangeli
2017-05-31 12:08                                     ` Michal Hocko
     [not found]                                       ` <20170531120822.GL27783-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2017-05-31 12:39                                         ` Mike Rapoprt
2017-05-31 14:18                                           ` Andrea Arcangeli
     [not found]                                             ` <20170531141809.GB302-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-05-31 14:32                                               ` Michal Hocko
2017-05-31 15:46                                                 ` Andrea Arcangeli
2017-06-01  6:58                                               ` Mike Rapoport
     [not found]                                           ` <8FA5E4C2-D289-4AF5-AA09-6C199E58F9A5-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
2017-05-31 14:19                                             ` Michal Hocko
2017-06-01  6:53                                   ` Mike Rapoport
2017-06-01  8:09                                     ` Michal Hocko
     [not found]                                       ` <20170601080909.GD32677-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2017-06-01  8:35                                         ` Mike Rapoport
2017-06-01 13:45                                       ` Andrea Arcangeli
2017-06-02  9:11                                         ` Mike Rapoport
2017-05-31  9:08                               ` Mike Rapoport
2017-05-31 12:05                                 ` Michal Hocko
2017-05-31 12:25                                   ` Mike Rapoprt
2017-05-24 11:31                 ` Vlastimil Babka
2017-05-24 14:28                   ` Pavel Emelyanov
2017-05-24 14:54                     ` Vlastimil Babka
2017-05-24 15:13                       ` Mike Rapoport

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).