All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mike Rapoport <rppt@linux.vnet.ibm.com>
To: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@kernel.org>,
	"Kirill A. Shutemov" <kirill@shutemov.name>,
	Andrew Morton <akpm@linux-foundation.org>,
	Arnd Bergmann <arnd@arndb.de>,
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Pavel Emelyanov <xemul@virtuozzo.com>,
	linux-mm <linux-mm@kvack.org>,
	lkml <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH] mm: introduce MADV_CLR_HUGEPAGE
Date: Mon, 22 May 2017 20:51:07 +0300	[thread overview]
Message-ID: <20170522175106.GA15644@rapoport-lnx> (raw)
In-Reply-To: <a9e74c22-1a07-f49a-42b5-497fee85e9c9@suse.cz>

On Mon, May 22, 2017 at 05:52:47PM +0200, Vlastimil Babka wrote:
> On 05/22/2017 04:29 PM, Mike Rapoport wrote:
> > On Mon, May 22, 2017 at 03:55:48PM +0200, Michal Hocko wrote:
> >> On Mon 22-05-17 16:36:00, Mike Rapoport wrote:
> >>> On Mon, May 22, 2017 at 02:42:43PM +0300, Kirill A. Shutemov wrote:
> >>>> On Mon, May 22, 2017 at 09:12:42AM +0300, Mike Rapoport wrote:
> >>>>> Currently applications can explicitly enable or disable THP for a memory
> >>>>> region using MADV_HUGEPAGE or MADV_NOHUGEPAGE. However, once either of
> >>>>> these advises is used, the region will always have
> >>>>> VM_HUGEPAGE/VM_NOHUGEPAGE flag set in vma->vm_flags.
> >>>>> The MADV_CLR_HUGEPAGE resets both these flags and allows managing THP in
> >>>>> the region according to system-wide settings.
> >>>>
> >>>> Seems reasonable. But could you describe an use-case when it's useful in
> >>>> real world.
> >>>
> >>> My use-case was combination of pre- and post-copy migration of containers
> >>> with CRIU.
> >>> In this case we populate a part of a memory region with data that was saved
> >>> during the pre-copy stage. Afterwards, the region is registered with
> >>> userfaultfd and we expect to get page faults for the parts of the region
> >>> that were not yet populated. However, khugepaged collapses the pages and
> >>> the page faults we would expect do not occur.
> >>
> >> I am not sure I undestand the problem. Do I get it right that the
> >> khugepaged will effectivelly corrupt the memory by collapsing a range
> >> which is not yet fully populated? If yes shouldn't that be fixed in
> >> khugepaged rather than adding yet another madvise command? Also how do
> >> you prevent on races? (say you VM_NOHUGEPAGE, khugepaged would be in the
> >> middle of the operation and sees a collapsable vma and you get the same
> >> result)
> > 
> > Probably I didn't explained it too well.
> > 
> > The range is intentionally not populated. When we combine pre- and
> > post-copy for process migration, we create memory pre-dump without stopping
> > the process, then we freeze the process without dumping the pages it has
> > dirtied between pre-dump and freeze, and then, during restore, we populate
> > the dirtied pages using userfaultfd.
> > 
> > When CRIU restores a process in such scenario, it does something like:
> > 
> > * mmap() memory region
> > * fill in the pages that were collected during the pre-dump
> > * do some other stuff
> > * register memory region with userfaultfd
> > * populate the missing memory on demand
> > 
> > khugepaged collapses the pages in the partially populated regions before we
> > have a chance to register these regions with userfaultfd, which would
> > prevent the collapse.
> > 
> > We could have used MADV_NOHUGEPAGE right after the mmap() call, and then
> > there would be no race because there would be nothing for khugepaged to
> > collapse at that point. But the problem is that we have no way to reset
> > *HUGEPAGE flags after the memory restore is complete.
> 
> Hmm, I wouldn't be that sure if this is indeed race-free. Check that
> this scenario is indeed impossible?
> 
> - you do the mmap
> - khugepaged will choose the process' mm to scan
> - khugepaged will get to the vma in question, it doesn't have
> MADV_NOHUGEPAGE yet
> - you set MADV_NOHUGEPAGE on the vma
> - you start populating the vma
> - khugepaged sees the vma is non-empty, collapses
> 
> unless I'm wrong, the racers will have mmap_sem for reading only when
> setting/checking the MADV_NOHUGEPAGE? Might be actually considered a bug.

madvise(MADV_*HUGEPAGE) takes mmap_sem for writing, so it is safe.
 
> However, can't you use prctl(PR_SET_THP_DISABLE) instead? "If arg2 has a
> nonzero value, the flag is set, otherwise it is cleared." says the
> manpage. Do it before the mmap and you avoid the race as well?

I've missed that one, thanks Vlastimil!

WARNING: multiple messages have this Message-ID (diff)
From: Mike Rapoport <rppt@linux.vnet.ibm.com>
To: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@kernel.org>,
	"Kirill A. Shutemov" <kirill@shutemov.name>,
	Andrew Morton <akpm@linux-foundation.org>,
	Arnd Bergmann <arnd@arndb.de>,
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Pavel Emelyanov <xemul@virtuozzo.com>,
	linux-mm <linux-mm@kvack.org>,
	lkml <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH] mm: introduce MADV_CLR_HUGEPAGE
Date: Mon, 22 May 2017 20:51:07 +0300	[thread overview]
Message-ID: <20170522175106.GA15644@rapoport-lnx> (raw)
In-Reply-To: <a9e74c22-1a07-f49a-42b5-497fee85e9c9@suse.cz>

On Mon, May 22, 2017 at 05:52:47PM +0200, Vlastimil Babka wrote:
> On 05/22/2017 04:29 PM, Mike Rapoport wrote:
> > On Mon, May 22, 2017 at 03:55:48PM +0200, Michal Hocko wrote:
> >> On Mon 22-05-17 16:36:00, Mike Rapoport wrote:
> >>> On Mon, May 22, 2017 at 02:42:43PM +0300, Kirill A. Shutemov wrote:
> >>>> On Mon, May 22, 2017 at 09:12:42AM +0300, Mike Rapoport wrote:
> >>>>> Currently applications can explicitly enable or disable THP for a memory
> >>>>> region using MADV_HUGEPAGE or MADV_NOHUGEPAGE. However, once either of
> >>>>> these advises is used, the region will always have
> >>>>> VM_HUGEPAGE/VM_NOHUGEPAGE flag set in vma->vm_flags.
> >>>>> The MADV_CLR_HUGEPAGE resets both these flags and allows managing THP in
> >>>>> the region according to system-wide settings.
> >>>>
> >>>> Seems reasonable. But could you describe an use-case when it's useful in
> >>>> real world.
> >>>
> >>> My use-case was combination of pre- and post-copy migration of containers
> >>> with CRIU.
> >>> In this case we populate a part of a memory region with data that was saved
> >>> during the pre-copy stage. Afterwards, the region is registered with
> >>> userfaultfd and we expect to get page faults for the parts of the region
> >>> that were not yet populated. However, khugepaged collapses the pages and
> >>> the page faults we would expect do not occur.
> >>
> >> I am not sure I undestand the problem. Do I get it right that the
> >> khugepaged will effectivelly corrupt the memory by collapsing a range
> >> which is not yet fully populated? If yes shouldn't that be fixed in
> >> khugepaged rather than adding yet another madvise command? Also how do
> >> you prevent on races? (say you VM_NOHUGEPAGE, khugepaged would be in the
> >> middle of the operation and sees a collapsable vma and you get the same
> >> result)
> > 
> > Probably I didn't explained it too well.
> > 
> > The range is intentionally not populated. When we combine pre- and
> > post-copy for process migration, we create memory pre-dump without stopping
> > the process, then we freeze the process without dumping the pages it has
> > dirtied between pre-dump and freeze, and then, during restore, we populate
> > the dirtied pages using userfaultfd.
> > 
> > When CRIU restores a process in such scenario, it does something like:
> > 
> > * mmap() memory region
> > * fill in the pages that were collected during the pre-dump
> > * do some other stuff
> > * register memory region with userfaultfd
> > * populate the missing memory on demand
> > 
> > khugepaged collapses the pages in the partially populated regions before we
> > have a chance to register these regions with userfaultfd, which would
> > prevent the collapse.
> > 
> > We could have used MADV_NOHUGEPAGE right after the mmap() call, and then
> > there would be no race because there would be nothing for khugepaged to
> > collapse at that point. But the problem is that we have no way to reset
> > *HUGEPAGE flags after the memory restore is complete.
> 
> Hmm, I wouldn't be that sure if this is indeed race-free. Check that
> this scenario is indeed impossible?
> 
> - you do the mmap
> - khugepaged will choose the process' mm to scan
> - khugepaged will get to the vma in question, it doesn't have
> MADV_NOHUGEPAGE yet
> - you set MADV_NOHUGEPAGE on the vma
> - you start populating the vma
> - khugepaged sees the vma is non-empty, collapses
> 
> unless I'm wrong, the racers will have mmap_sem for reading only when
> setting/checking the MADV_NOHUGEPAGE? Might be actually considered a bug.

madvise(MADV_*HUGEPAGE) takes mmap_sem for writing, so it is safe.
 
> However, can't you use prctl(PR_SET_THP_DISABLE) instead? "If arg2 has a
> nonzero value, the flag is set, otherwise it is cleared." says the
> manpage. Do it before the mmap and you avoid the race as well?

I've missed that one, thanks Vlastimil!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2017-05-22 17:51 UTC|newest]

Thread overview: 117+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-05-22  6:12 [PATCH] mm: introduce MADV_CLR_HUGEPAGE Mike Rapoport
2017-05-22  6:12 ` Mike Rapoport
2017-05-22  7:26 ` Anshuman Khandual
2017-05-22  7:26   ` Anshuman Khandual
2017-05-22  8:12   ` Mike Rapoport
2017-05-22  8:12     ` Mike Rapoport
2017-05-22 11:42 ` Kirill A. Shutemov
2017-05-22 11:42   ` Kirill A. Shutemov
2017-05-22 13:36   ` Mike Rapoport
2017-05-22 13:36     ` Mike Rapoport
2017-05-22 13:44     ` Kirill A. Shutemov
2017-05-22 13:44       ` Kirill A. Shutemov
2017-05-22 13:55     ` Michal Hocko
2017-05-22 13:55       ` Michal Hocko
2017-05-22 14:29       ` Mike Rapoport
2017-05-22 14:29         ` Mike Rapoport
2017-05-22 15:52         ` Vlastimil Babka
2017-05-22 15:52           ` Vlastimil Babka
2017-05-22 17:51           ` Mike Rapoport [this message]
2017-05-22 17:51             ` Mike Rapoport
2017-05-24  7:50           ` Mike Rapoport
2017-05-24  7:50             ` Mike Rapoport
2017-05-24  7:58             ` Vlastimil Babka
2017-05-24  7:58               ` Vlastimil Babka
2017-05-24  7:58               ` Vlastimil Babka
2017-05-24 10:39               ` Mike Rapoport
2017-05-24 10:39                 ` Mike Rapoport
2017-05-24 11:18                 ` Michal Hocko
2017-05-24 11:18                   ` Michal Hocko
2017-05-24 14:25                   ` Pavel Emelyanov
2017-05-24 14:25                     ` Pavel Emelyanov
2017-05-24 14:27                   ` Mike Rapoport
2017-05-24 14:27                     ` Mike Rapoport
2017-05-24 15:22                     ` Andrea Arcangeli
2017-05-24 15:22                       ` Andrea Arcangeli
2017-05-30  7:44                     ` Michal Hocko
2017-05-30  7:44                       ` Michal Hocko
2017-05-30  7:44                       ` Michal Hocko
2017-05-30 10:19                       ` Mike Rapoport
2017-05-30 10:19                         ` Mike Rapoport
2017-05-30 10:19                         ` Mike Rapoport
2017-05-30 10:39                         ` Michal Hocko
2017-05-30 10:39                           ` Michal Hocko
2017-05-30 14:04                           ` Andrea Arcangeli
2017-05-30 14:04                             ` Andrea Arcangeli
2017-05-30 14:04                             ` Andrea Arcangeli
2017-05-30 14:39                             ` Michal Hocko
2017-05-30 14:39                               ` Michal Hocko
2017-05-30 14:56                               ` Michal Hocko
2017-05-30 14:56                                 ` Michal Hocko
2017-05-30 14:56                                 ` Michal Hocko
2017-05-30 16:06                                 ` Andrea Arcangeli
2017-05-30 16:06                                   ` Andrea Arcangeli
2017-05-30 16:06                                   ` Andrea Arcangeli
2017-05-31  6:30                                   ` Vlastimil Babka
2017-05-31  6:30                                     ` Vlastimil Babka
2017-05-31  8:24                                     ` Michal Hocko
2017-05-31  8:24                                       ` Michal Hocko
2017-05-31  9:27                                       ` Mike Rapoport
2017-05-31  9:27                                         ` Mike Rapoport
2017-05-31 10:24                                         ` Michal Hocko
2017-05-31 10:24                                           ` Michal Hocko
2017-05-31 10:22                                       ` Michal Hocko
2017-05-31 10:22                                         ` Michal Hocko
2017-05-31 10:22                                         ` Michal Hocko
2017-06-01 11:00                                       ` Mike Rapoport
2017-06-01 11:00                                         ` Mike Rapoport
2017-06-01 12:27                                         ` Michal Hocko
2017-06-01 12:27                                           ` Michal Hocko
2017-05-30 15:43                               ` Andrea Arcangeli
2017-05-30 15:43                                 ` Andrea Arcangeli
2017-05-31 12:08                                 ` Michal Hocko
2017-05-31 12:08                                   ` Michal Hocko
2017-05-31 12:39                                   ` Mike Rapoprt
2017-05-31 12:39                                     ` Mike Rapoprt
2017-05-31 12:39                                     ` Mike Rapoprt
2017-05-31 14:18                                     ` Andrea Arcangeli
2017-05-31 14:18                                       ` Andrea Arcangeli
2017-05-31 14:32                                       ` Michal Hocko
2017-05-31 14:32                                         ` Michal Hocko
2017-05-31 14:32                                         ` Michal Hocko
2017-05-31 15:46                                         ` Andrea Arcangeli
2017-05-31 15:46                                           ` Andrea Arcangeli
2017-06-01  6:58                                       ` Mike Rapoport
2017-06-01  6:58                                         ` Mike Rapoport
2017-06-01  6:58                                         ` Mike Rapoport
2017-05-31 14:19                                     ` Michal Hocko
2017-05-31 14:19                                       ` Michal Hocko
2017-05-31 14:19                                       ` Michal Hocko
2017-06-01  6:53                               ` Mike Rapoport
2017-06-01  6:53                                 ` Mike Rapoport
2017-06-01  8:09                                 ` Michal Hocko
2017-06-01  8:09                                   ` Michal Hocko
2017-06-01  8:35                                   ` Mike Rapoport
2017-06-01  8:35                                     ` Mike Rapoport
2017-06-01  8:35                                     ` Mike Rapoport
2017-06-01 13:45                                   ` Andrea Arcangeli
2017-06-01 13:45                                     ` Andrea Arcangeli
2017-06-02  9:11                                     ` Mike Rapoport
2017-06-02  9:11                                       ` Mike Rapoport
2017-05-31  9:08                           ` Mike Rapoport
2017-05-31  9:08                             ` Mike Rapoport
2017-05-31  9:08                             ` Mike Rapoport
2017-05-31 12:05                             ` Michal Hocko
2017-05-31 12:05                               ` Michal Hocko
2017-05-31 12:25                               ` Mike Rapoprt
2017-05-31 12:25                                 ` Mike Rapoprt
2017-05-24 11:31                 ` Vlastimil Babka
2017-05-24 11:31                   ` Vlastimil Babka
2017-05-24 14:28                   ` Pavel Emelyanov
2017-05-24 14:28                     ` Pavel Emelyanov
2017-05-24 14:54                     ` Vlastimil Babka
2017-05-24 14:54                       ` Vlastimil Babka
2017-05-24 15:13                       ` Mike Rapoport
2017-05-24 15:13                         ` Mike Rapoport
2017-05-22 15:33 ` kbuild test robot
2017-05-22 15:33   ` kbuild test robot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170522175106.GA15644@rapoport-lnx \
    --to=rppt@linux.vnet.ibm.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=arnd@arndb.de \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=kirill@shutemov.name \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=vbabka@suse.cz \
    --cc=xemul@virtuozzo.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.