linux-api.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Mike Rapoport <rppt@linux.vnet.ibm.com>
To: Michal Hocko <mhocko@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>,
	"Kirill A. Shutemov" <kirill@shutemov.name>,
	Andrew Morton <akpm@linux-foundation.org>,
	Arnd Bergmann <arnd@arndb.de>,
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Pavel Emelyanov <xemul@virtuozzo.com>,
	linux-mm <linux-mm@kvack.org>,
	lkml <linux-kernel@vger.kernel.org>,
	Linux API <linux-api@vger.kernel.org>
Subject: Re: [PATCH] mm: introduce MADV_CLR_HUGEPAGE
Date: Wed, 24 May 2017 17:27:36 +0300	[thread overview]
Message-ID: <20170524142735.GF3063@rapoport-lnx> (raw)
In-Reply-To: <20170524111800.GD14733@dhcp22.suse.cz>

On Wed, May 24, 2017 at 01:18:00PM +0200, Michal Hocko wrote:
> On Wed 24-05-17 13:39:48, Mike Rapoport wrote:
> > On Wed, May 24, 2017 at 09:58:06AM +0200, Vlastimil Babka wrote:
> > > On 05/24/2017 09:50 AM, Mike Rapoport wrote:
> > > > On Mon, May 22, 2017 at 05:52:47PM +0200, Vlastimil Babka wrote:
> > > >> On 05/22/2017 04:29 PM, Mike Rapoport wrote:
> > > >>>
> > > >>> Probably I didn't explained it too well.
> > > >>>
> > > >>> The range is intentionally not populated. When we combine pre- and
> > > >>> post-copy for process migration, we create memory pre-dump without stopping
> > > >>> the process, then we freeze the process without dumping the pages it has
> > > >>> dirtied between pre-dump and freeze, and then, during restore, we populate
> > > >>> the dirtied pages using userfaultfd.
> > > >>>
> > > >>> When CRIU restores a process in such scenario, it does something like:
> > > >>>
> > > >>> * mmap() memory region
> > > >>> * fill in the pages that were collected during the pre-dump
> > > >>> * do some other stuff
> > > >>> * register memory region with userfaultfd
> > > >>> * populate the missing memory on demand
> > > >>>
> > > >>> khugepaged collapses the pages in the partially populated regions before we
> > > >>> have a chance to register these regions with userfaultfd, which would
> > > >>> prevent the collapse.
> > > >>>
> > > >>> We could have used MADV_NOHUGEPAGE right after the mmap() call, and then
> > > >>> there would be no race because there would be nothing for khugepaged to
> > > >>> collapse at that point. But the problem is that we have no way to reset
> > > >>> *HUGEPAGE flags after the memory restore is complete.
> > > >>
> > > >> Hmm, I wouldn't be that sure if this is indeed race-free. Check that
> > > >> this scenario is indeed impossible?
> > > >>
> > > >> - you do the mmap
> > > >> - khugepaged will choose the process' mm to scan
> > > >> - khugepaged will get to the vma in question, it doesn't have
> > > >> MADV_NOHUGEPAGE yet
> > > >> - you set MADV_NOHUGEPAGE on the vma
> > > >> - you start populating the vma
> > > >> - khugepaged sees the vma is non-empty, collapses
> > > >>
> > > >> unless I'm wrong, the racers will have mmap_sem for reading only when
> > > >> setting/checking the MADV_NOHUGEPAGE? Might be actually considered a bug.
> > > >>
> > > >> However, can't you use prctl(PR_SET_THP_DISABLE) instead? "If arg2 has a
> > > >> nonzero value, the flag is set, otherwise it is cleared." says the
> > > >> manpage. Do it before the mmap and you avoid the race as well?
> > > > 
> > > > Unfortunately, prctl(PR_SET_THP_DISABLE) didn't help :(
> > > > When I've tried to use it, I've ended up with VM_NOHUGEPAGE set on all VMAs
> > > > created after prctl(). This returns me to the state when checkpoint-restore
> > > > alters the application vma->vm_flags although it shouldn't and I do not see
> > > > a way to fix it using existing interfaces.
> > > 
> > > [CC linux-api, should have been done in the initial posting already]
> > 
> > Sorry, missed that.
> >  
> > > Hm so the prctl does:
> > > 
> > >                 if (arg2)
> > >                         me->mm->def_flags |= VM_NOHUGEPAGE;
> > >                 else
> > >                         me->mm->def_flags &= ~VM_NOHUGEPAGE;
> > > 
> > > That's rather lazy implementation IMHO. Could we change it so the flag
> > > is stored elsewhere in the mm, and the code that decides to (not) use
> > > THP will check both the per-vma flag and the per-mm flag?
> > 
> > I afraid I don't understand how that can help.
> > What we need is an ability to temporarily disable collapse of the pages in
> > VMAs that do not have VM_*HUGEPAGE flags set and that after we re-enable
> > THP, the vma->vm_flags for those VMAs will remain intact.
> 
> Why cannot khugepaged simply skip over all VMAs which have userfault
> regions registered? This would sound like a less error prone approach to
> me.

khugepaged does skip over VMAs which have userfault. We could register the
regions with userfault before populating them to avoid collapses in the
transition period. But then we'll have to populate these regions with
UFFDIO_COPY which adds quite an overhead.

> -- 
> Michal Hocko
> SUSE Labs
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  parent reply	other threads:[~2017-05-24 14:27 UTC|newest]

Thread overview: 40+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <1495433562-26625-1-git-send-email-rppt@linux.vnet.ibm.com>
     [not found] ` <20170522114243.2wrdbncilozygbpl@node.shutemov.name>
     [not found]   ` <20170522133559.GE27382@rapoport-lnx>
     [not found]     ` <20170522135548.GA8514@dhcp22.suse.cz>
     [not found]       ` <20170522142927.GG27382@rapoport-lnx>
     [not found]         ` <a9e74c22-1a07-f49a-42b5-497fee85e9c9@suse.cz>
     [not found]           ` <20170524075043.GB3063@rapoport-lnx>
2017-05-24  7:58             ` [PATCH] mm: introduce MADV_CLR_HUGEPAGE Vlastimil Babka
2017-05-24 10:39               ` Mike Rapoport
2017-05-24 11:18                 ` Michal Hocko
2017-05-24 14:25                   ` Pavel Emelyanov
2017-05-24 14:27                   ` Mike Rapoport [this message]
2017-05-24 15:22                     ` Andrea Arcangeli
2017-05-30  7:44                     ` Michal Hocko
     [not found]                       ` <20170530074408.GA7969-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2017-05-30 10:19                         ` Mike Rapoport
2017-05-30 10:39                           ` Michal Hocko
     [not found]                             ` <20170530103930.GB7969-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2017-05-30 14:04                               ` Andrea Arcangeli
2017-05-30 14:39                                 ` Michal Hocko
     [not found]                                   ` <20170530143941.GK7969-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2017-05-30 14:56                                     ` Michal Hocko
     [not found]                                       ` <20170530145632.GL7969-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2017-05-30 16:06                                         ` Andrea Arcangeli
2017-05-31  6:30                                           ` Vlastimil Babka
2017-05-31  8:24                                             ` Michal Hocko
2017-05-31  9:27                                               ` Mike Rapoport
2017-05-31 10:24                                                 ` Michal Hocko
     [not found]                                               ` <20170531082414.GB27783-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2017-05-31 10:22                                                 ` Michal Hocko
2017-06-01 11:00                                               ` Mike Rapoport
2017-06-01 12:27                                                 ` Michal Hocko
2017-05-30 15:43                                   ` Andrea Arcangeli
2017-05-31 12:08                                     ` Michal Hocko
     [not found]                                       ` <20170531120822.GL27783-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2017-05-31 12:39                                         ` Mike Rapoprt
2017-05-31 14:18                                           ` Andrea Arcangeli
     [not found]                                             ` <20170531141809.GB302-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-05-31 14:32                                               ` Michal Hocko
2017-05-31 15:46                                                 ` Andrea Arcangeli
2017-06-01  6:58                                               ` Mike Rapoport
     [not found]                                           ` <8FA5E4C2-D289-4AF5-AA09-6C199E58F9A5-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
2017-05-31 14:19                                             ` Michal Hocko
2017-06-01  6:53                                   ` Mike Rapoport
2017-06-01  8:09                                     ` Michal Hocko
     [not found]                                       ` <20170601080909.GD32677-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2017-06-01  8:35                                         ` Mike Rapoport
2017-06-01 13:45                                       ` Andrea Arcangeli
2017-06-02  9:11                                         ` Mike Rapoport
2017-05-31  9:08                               ` Mike Rapoport
2017-05-31 12:05                                 ` Michal Hocko
2017-05-31 12:25                                   ` Mike Rapoprt
2017-05-24 11:31                 ` Vlastimil Babka
2017-05-24 14:28                   ` Pavel Emelyanov
2017-05-24 14:54                     ` Vlastimil Babka
2017-05-24 15:13                       ` Mike Rapoport

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170524142735.GF3063@rapoport-lnx \
    --to=rppt@linux.vnet.ibm.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=arnd@arndb.de \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=kirill@shutemov.name \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=vbabka@suse.cz \
    --cc=xemul@virtuozzo.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).