linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Andrea Arcangeli <aarcange@redhat.com>
To: Vlastimil Babka <vbabka@suse.cz>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 2/4] thp: fix MADV_DONTNEED vs. numa balancing race
Date: Tue, 16 May 2017 22:29:19 +0200	[thread overview]
Message-ID: <20170516202919.GA2843@redhat.com> (raw)
In-Reply-To: <f105f6a5-bb5e-9480-6b2e-d2d15f631af9@suse.cz>

On Wed, Apr 12, 2017 at 03:33:35PM +0200, Vlastimil Babka wrote:
> On 03/02/2017 04:10 PM, Kirill A. Shutemov wrote:
> > In case prot_numa, we are under down_read(mmap_sem). It's critical
> > to not clear pmd intermittently to avoid race with MADV_DONTNEED
> > which is also under down_read(mmap_sem):
> > 
> > 	CPU0:				CPU1:
> > 				change_huge_pmd(prot_numa=1)
> > 				 pmdp_huge_get_and_clear_notify()
> > madvise_dontneed()
> >  zap_pmd_range()
> >   pmd_trans_huge(*pmd) == 0 (without ptl)
> >   // skip the pmd
> > 				 set_pmd_at();
> > 				 // pmd is re-established
> > 
> > The race makes MADV_DONTNEED miss the huge pmd and don't clear it
> > which may break userspace.
> > 
> > Found by code analysis, never saw triggered.
> > 
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > ---
> >  mm/huge_memory.c | 34 +++++++++++++++++++++++++++++++++-
> >  1 file changed, 33 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index e7ce73b2b208..bb2b3646bd78 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -1744,7 +1744,39 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
> >  	if (prot_numa && pmd_protnone(*pmd))
> >  		goto unlock;
> >  
> > -	entry = pmdp_huge_get_and_clear_notify(mm, addr, pmd);
> > +	/*
> > +	 * In case prot_numa, we are under down_read(mmap_sem). It's critical
> > +	 * to not clear pmd intermittently to avoid race with MADV_DONTNEED
> > +	 * which is also under down_read(mmap_sem):
> > +	 *
> > +	 *	CPU0:				CPU1:
> > +	 *				change_huge_pmd(prot_numa=1)
> > +	 *				 pmdp_huge_get_and_clear_notify()
> > +	 * madvise_dontneed()
> > +	 *  zap_pmd_range()
> > +	 *   pmd_trans_huge(*pmd) == 0 (without ptl)
> > +	 *   // skip the pmd
> > +	 *				 set_pmd_at();
> > +	 *				 // pmd is re-established
> > +	 *
> > +	 * The race makes MADV_DONTNEED miss the huge pmd and don't clear it
> > +	 * which may break userspace.
> > +	 *
> > +	 * pmdp_invalidate() is required to make sure we don't miss
> > +	 * dirty/young flags set by hardware.
> > +	 */
> > +	entry = *pmd;
> > +	pmdp_invalidate(vma, addr, pmd);
> > +
> > +	/*
> > +	 * Recover dirty/young flags.  It relies on pmdp_invalidate to not
> > +	 * corrupt them.
> > +	 */
> 
> pmdp_invalidate() does:
> 
>         pmd_t entry = *pmdp;
>         set_pmd_at(vma->vm_mm, address, pmdp, pmd_mknotpresent(entry));
> 
> so it's not atomic and if CPU sets dirty or accessed in the middle of
> this, they will be lost?

I agree it looks like the dirty bit can be lost. Furthermore this also
loses a MMU notifier invalidate that will lead to corruption at the
secondary MMU level (which will keep using the old protection
permission, potentially keeping writing to a wrprotected page).

> 
> But I don't see how the other invalidate caller
> __split_huge_pmd_locked() deals with this either. Andrea, any idea?

The original code I wrote did this in __split_huge_page_map to create
the "entry" to establish in the pte pagetables:

    	       entry = mk_pte(page + i, vma->vm_page_prot);
	       entry = maybe_mkwrite(pte_mkdirty(entry),
	       	       		   vma);

For anonymous memory the dirty bit is only meaningful for swapping,
and THP couldn't be swapped so setting it unconditional avoided any
issue with the pmdp_invalidate; pmdp_establish.

pmdp_invalidate is needed primarily to avoid aliasing of two different
TLB translation pointing from the same virtual address to the the same
physical address that triggered machine checks (while needing to keep
the pmd huge at all times, back then it was also splitting huge,
splitting is a software bit so userland could still access the data,
splitting bit only blocked kernel code to manipulate on it similar to
what migration entry does right now upstream, except those prevent
userland to access the page during split which is less efficient than
the splitting bit, but at least it's only used for the physical split,
back then there was no difference between virtual and physical split
and physical split is less frequent than the virtual one right now).

It looks like this needs a pmdp_populate that atomically grabs the
value of the pmd and returns it like pmdp_huge_get_and_clear_notify
does and a _notify variant to use "freeze" is false (if freeze is true
the MMU notifier invalidate must have happened when the pmd was set to
a migration entry). If pmdp_populate_notify (freeze==true)
/pmd_populate (freeze==false) would return the old pmd value
atomically with xchg() (just instead of setting it to 0 we should set
it to the mknotpresent one), then we can set the dirty bit on the ptes
(__split_huge_pmd_locked) or in the pmd itself in the change_huge_pmd
accordingly.

If the "dirty" flag information is obtained by the pmd read before
calling pmdp_invalidate is not ok (losing _notify also not ok).

Thanks!
Andrea

  parent reply	other threads:[~2017-05-16 20:29 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-03-02 15:10 [PATCH 0/4] thp: fix few MADV_DONTNEED races Kirill A. Shutemov
2017-03-02 15:10 ` [PATCH 1/4] thp: reduce indentation level in change_huge_pmd() Kirill A. Shutemov
2017-04-12 11:37   ` Vlastimil Babka
2017-03-02 15:10 ` [PATCH 2/4] thp: fix MADV_DONTNEED vs. numa balancing race Kirill A. Shutemov
2017-03-03 17:17   ` Dave Hansen
2017-04-12 13:33   ` Vlastimil Babka
2017-05-16 14:54     ` Vlastimil Babka
2017-05-16 20:29     ` Andrea Arcangeli [this message]
2017-05-23 12:42       ` Vlastimil Babka
2017-06-09  8:21         ` Vlastimil Babka
2017-03-02 15:10 ` [PATCH 3/4] thp: fix MADV_DONTNEED vs. MADV_FREE race Kirill A. Shutemov
2017-03-03  5:35   ` Hillf Danton
2017-03-03 10:26     ` Kirill A. Shutemov
2017-03-06  1:44       ` Minchan Kim
2017-03-07 14:04         ` Kirill A. Shutemov
2017-03-06  2:49   ` Aneesh Kumar K.V
2017-03-07 13:52     ` Kirill A. Shutemov
2017-03-02 15:10 ` [PATCH 4/4] thp: fix MADV_DONTNEED vs clear soft dirty race Kirill A. Shutemov
2017-03-03 22:29   ` Andrew Morton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170516202919.GA2843@redhat.com \
    --to=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).