linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* mm: fix BUG in __split_huge_page_pmd
@ 2013-10-15 11:08 Hugh Dickins
  2013-10-15 11:32 ` Kirill A. Shutemov
  2013-10-15 14:34 ` Andrea Arcangeli
  0 siblings, 2 replies; 12+ messages in thread
From: Hugh Dickins @ 2013-10-15 11:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, David Rientjes, Kirill A. Shutemov,
	Naoya Horiguchi, linux-kernel, linux-mm

Occasionally we hit the BUG_ON(pmd_trans_huge(*pmd)) at the end of
__split_huge_page_pmd(): seen when doing madvise(,,MADV_DONTNEED).

It's invalid: we don't always have down_write of mmap_sem there:
a racing do_huge_pmd_wp_page() might have copied-on-write to another
huge page before our split_huge_page() got the anon_vma lock.

Forget the BUG_ON, just go back and try again if this happens.
    
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: stable@vger.kernel.org
---

 mm/huge_memory.c |   10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

--- 3.12-rc5/mm/huge_memory.c	2013-09-16 17:37:56.811072270 -0700
+++ linux/mm/huge_memory.c	2013-10-15 03:40:02.044138488 -0700
@@ -2697,6 +2697,7 @@ void __split_huge_page_pmd(struct vm_are
 
 	mmun_start = haddr;
 	mmun_end   = haddr + HPAGE_PMD_SIZE;
+again:
 	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
 	spin_lock(&mm->page_table_lock);
 	if (unlikely(!pmd_trans_huge(*pmd))) {
@@ -2719,7 +2720,14 @@ void __split_huge_page_pmd(struct vm_are
 	split_huge_page(page);
 
 	put_page(page);
-	BUG_ON(pmd_trans_huge(*pmd));
+
+	/*
+	 * We don't always have down_write of mmap_sem here: a racing
+	 * do_huge_pmd_wp_page() might have copied-on-write to another
+	 * huge page before our split_huge_page() got the anon_vma lock.
+	 */
+	if (unlikely(pmd_trans_huge(*pmd)))
+		goto again;
 }
 
 void split_huge_page_pmd_mm(struct mm_struct *mm, unsigned long address,

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: mm: fix BUG in __split_huge_page_pmd
  2013-10-15 11:08 mm: fix BUG in __split_huge_page_pmd Hugh Dickins
@ 2013-10-15 11:32 ` Kirill A. Shutemov
  2013-10-15 14:41   ` Andrea Arcangeli
  2013-10-15 14:34 ` Andrea Arcangeli
  1 sibling, 1 reply; 12+ messages in thread
From: Kirill A. Shutemov @ 2013-10-15 11:32 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Andrea Arcangeli, David Rientjes,
	Kirill A. Shutemov, Naoya Horiguchi, linux-kernel, linux-mm

Hugh Dickins wrote:
> Occasionally we hit the BUG_ON(pmd_trans_huge(*pmd)) at the end of
> __split_huge_page_pmd(): seen when doing madvise(,,MADV_DONTNEED).
> 
> It's invalid: we don't always have down_write of mmap_sem there:
> a racing do_huge_pmd_wp_page() might have copied-on-write to another
> huge page before our split_huge_page() got the anon_vma lock.
> 
> Forget the BUG_ON, just go back and try again if this happens.
>     
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Cc: stable@vger.kernel.org

Looks reasonable to me.

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

madvise(MADV_DONTNEED) was aproblematic with THP before. Is a big win having
mmap_sem taken on read rather than on write for it?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: mm: fix BUG in __split_huge_page_pmd
  2013-10-15 11:08 mm: fix BUG in __split_huge_page_pmd Hugh Dickins
  2013-10-15 11:32 ` Kirill A. Shutemov
@ 2013-10-15 14:34 ` Andrea Arcangeli
  2013-10-15 14:48   ` Kirill A. Shutemov
  1 sibling, 1 reply; 12+ messages in thread
From: Andrea Arcangeli @ 2013-10-15 14:34 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, David Rientjes, Kirill A. Shutemov,
	Naoya Horiguchi, linux-kernel, linux-mm

Hi Hugh,

On Tue, Oct 15, 2013 at 04:08:28AM -0700, Hugh Dickins wrote:
> Occasionally we hit the BUG_ON(pmd_trans_huge(*pmd)) at the end of
> __split_huge_page_pmd(): seen when doing madvise(,,MADV_DONTNEED).
> 
> It's invalid: we don't always have down_write of mmap_sem there:
> a racing do_huge_pmd_wp_page() might have copied-on-write to another
> huge page before our split_huge_page() got the anon_vma lock.
> 

I don't get exactly the scenario with do_huge_pmd_wp_page(), could you
elaborate?

My scenario is that in the below line another madvise(MADV_DONTNEED)
runs:

	spin_unlock(&mm->page_table_lock);
	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);

	------ madvise(MADV_DONTNEED) from another thread zapping the
               entire 2m and clearing the pmd

	split_huge_page(page);

then split_huge_page does nothing because the page has been freed. And
then just before the BUG_ON (after split_huge_page returns) a new page
fault fills in an anonymous page just before the BUG_ON.

And the crashing thread would always be a partial MADV_DONTNEED with a
misaligned end address (so requiring a split_huge_page to zap 4k
subpages).

So the testcase required would be: 2 concurrent MADV_DONTNEED, where
the first has a misaligned "end" address (the one that triggers the
BUG_ON), the second MADV_DONTNEED has a end address that covers the
whole hugepmd, and a trans huge page fault happening just before the
false positive triggers.

Maybe your scenario with do_huge_pmd_wp_page() is simpler?

> Forget the BUG_ON, just go back and try again if this happens.
>     
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Cc: stable@vger.kernel.org
> ---
> 
>  mm/huge_memory.c |   10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> --- 3.12-rc5/mm/huge_memory.c	2013-09-16 17:37:56.811072270 -0700
> +++ linux/mm/huge_memory.c	2013-10-15 03:40:02.044138488 -0700
> @@ -2697,6 +2697,7 @@ void __split_huge_page_pmd(struct vm_are
>  
>  	mmun_start = haddr;
>  	mmun_end   = haddr + HPAGE_PMD_SIZE;
> +again:
>  	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
>  	spin_lock(&mm->page_table_lock);
>  	if (unlikely(!pmd_trans_huge(*pmd))) {
> @@ -2719,7 +2720,14 @@ void __split_huge_page_pmd(struct vm_are
>  	split_huge_page(page);
>  
>  	put_page(page);
> -	BUG_ON(pmd_trans_huge(*pmd));
> +
> +	/*
> +	 * We don't always have down_write of mmap_sem here: a racing
> +	 * do_huge_pmd_wp_page() might have copied-on-write to another
> +	 * huge page before our split_huge_page() got the anon_vma lock.
> +	 */
> +	if (unlikely(pmd_trans_huge(*pmd)))
> +		goto again;
>  }
>  
>  void split_huge_page_pmd_mm(struct mm_struct *mm, unsigned long address,

While it is correct, the looping is misleading. We should document
simply that any caller of split_huge_page_pmd with the mmap_sem for
reading, should use pmd_none_or_trans_huge_or_clear_bad or
pmd_trans_unstable before making any assumption on the pmd being
"stable" after split_huge_page returns.

This is what zap_pmd_range of course does already so it's perfectly
safe if split_huge_page_pmd returns with pmd_trans_huge(*pmd) == true.

Even with the loop, if the concurrent page faults that maps a
trans_huge_pmd (replacing the pmd_none) triggers just after the 'goto
again', but before the 'ret' instruction, the function still returns
with a trans huge page mapped in the *pmd.

In short I think either we try the more strict but more tricky
approach from Hillf https://patchwork.kernel.org/patch/2178311/ and we
also take into account the mapcount > 0 in the __split_huge_page loop
to make Hillf patch safe (I think currently it isn't), or we just nuke
the BUG_ON completely. And we should document in the source what
happens with mmap_sem hold for reading in MADV_DONTNEED.

Yet another approach would be to add something like down_write_trylock
in front of the check after converting it to a VM_BUG_ON.

The patch in patchwork, despite trying to be more strict, it doesn't
look safe to me because I tend to think we should also change
__split_huge_page to return the "mapcount", so that split_huge_page
will fail also if the mapcount was 0 in __split_huge_page.

I believe split_huge_page_to_list may obtain the anon_vma lock (so the
page was mapped) but then zap_huge_pmd obtains the page_table_lock
before __split_huge_page runs __split_huge_page_splitting and freezes
any attempt zap_huge_pmd or any other attempt of MADV_DONTNEED with
truncation of the pmd (waiting in another split_huge_page). So I
believe it is possible (and safe) for it to run with mapcount 0 (doing
nothing). But it doesn't return failure in that case, but if mapcount
is 0, the pmd may not have been converted to stable state. This is why
that change would be needed to use the patchwork patch.

The only place where we depend on split_huge_page retval so far is
add_to_swap. But regular anon pages can also be zapped there. So it
should be able to cope and the swapcache will free itself when all
refcounts are dropped. So we don't need to worry about the above I
think in add_to_swap.

Could you also post the stack trace so we compare to the one in
patchwork? Ideally this happened for you also in the context of a
MADV_DONTNEED, the other places walking pagetables with only the
mmap_sem for reading usually don't mangle pagetables.

Also note the workload I described with two MADV_DONTNEED running
concurrently and a page fault too, on the very same transhugepage
looks a non deterministic workload, but still we need to avoid false
positives in those non deterministic cases.

The two stack traces I have for this problem (patchwork above and
below bugzilla) all confirms it's happening only inside MADV_DONTNEED
confirming my theory above.

https://bugzilla.redhat.com/show_bug.cgi?id=949735

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: mm: fix BUG in __split_huge_page_pmd
  2013-10-15 11:32 ` Kirill A. Shutemov
@ 2013-10-15 14:41   ` Andrea Arcangeli
  0 siblings, 0 replies; 12+ messages in thread
From: Andrea Arcangeli @ 2013-10-15 14:41 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Hugh Dickins, Andrew Morton, David Rientjes, Naoya Horiguchi,
	linux-kernel, linux-mm

On Tue, Oct 15, 2013 at 02:32:54PM +0300, Kirill A. Shutemov wrote:
> Hugh Dickins wrote:
> > Occasionally we hit the BUG_ON(pmd_trans_huge(*pmd)) at the end of
> > __split_huge_page_pmd(): seen when doing madvise(,,MADV_DONTNEED).
> > 
> > It's invalid: we don't always have down_write of mmap_sem there:
> > a racing do_huge_pmd_wp_page() might have copied-on-write to another
> > huge page before our split_huge_page() got the anon_vma lock.
> > 
> > Forget the BUG_ON, just go back and try again if this happens.
> >     
> > Signed-off-by: Hugh Dickins <hughd@google.com>
> > Cc: stable@vger.kernel.org
> 
> Looks reasonable to me.
> 
> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> 
> madvise(MADV_DONTNEED) was aproblematic with THP before. Is a big win having
> mmap_sem taken on read rather than on write for it?

Yeah it caused all those pmd_trans_unstable and
pmd_none_or_trans_huge_or_clear_bad and pmd_read_atomic in common
code. But I didn't want to regress the scalability of
MADV_DONTNEED... I think various apps use MADV_DONTNEED to free memory
(including very KVM in the balloon driver and probably JVM and other JIT).

none or huge pmds are unstable without mmap_sem for writing and
without page_table_lock (or in general pmd_trans_huge_lock).

It's identical to the pte being unstable if mmap_sem is held for
reading and we don't hold the PT lock, except the pte can only have
two states and they're both unstable.

hugepmds have three states, and the only stable state of the tree is
when it points to a regular pte (the third state that 4k ptes cannot have).

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: mm: fix BUG in __split_huge_page_pmd
  2013-10-15 14:34 ` Andrea Arcangeli
@ 2013-10-15 14:48   ` Kirill A. Shutemov
  2013-10-15 15:58     ` Andrea Arcangeli
  2013-10-15 17:53     ` Hugh Dickins
  0 siblings, 2 replies; 12+ messages in thread
From: Kirill A. Shutemov @ 2013-10-15 14:48 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Hugh Dickins, Andrew Morton, David Rientjes, Kirill A. Shutemov,
	Naoya Horiguchi, linux-kernel, linux-mm

Andrea Arcangeli wrote:
> Hi Hugh,
> 
> On Tue, Oct 15, 2013 at 04:08:28AM -0700, Hugh Dickins wrote:
> > Occasionally we hit the BUG_ON(pmd_trans_huge(*pmd)) at the end of
> > __split_huge_page_pmd(): seen when doing madvise(,,MADV_DONTNEED).
> > 
> > It's invalid: we don't always have down_write of mmap_sem there:
> > a racing do_huge_pmd_wp_page() might have copied-on-write to another
> > huge page before our split_huge_page() got the anon_vma lock.
> > 
> 
> I don't get exactly the scenario with do_huge_pmd_wp_page(), could you
> elaborate?

I think the scenario is follow:

	CPU0:					CPU1

__split_huge_page_pmd()
	page = pmd_page(*pmd);
					do_huge_pmd_wp_page() copy the
					page and changes pmd (the same as on CPU0)
					to point to newly copied page.
	split_huge_page(page)
	where page is original page,
	not allocated on COW.
	pmd still points on huge page.


Hugh, have I got it correctly?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: mm: fix BUG in __split_huge_page_pmd
  2013-10-15 14:48   ` Kirill A. Shutemov
@ 2013-10-15 15:58     ` Andrea Arcangeli
  2013-10-15 17:53     ` Hugh Dickins
  1 sibling, 0 replies; 12+ messages in thread
From: Andrea Arcangeli @ 2013-10-15 15:58 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Hugh Dickins, Andrew Morton, David Rientjes, Naoya Horiguchi,
	linux-kernel, linux-mm

On Tue, Oct 15, 2013 at 05:48:27PM +0300, Kirill A. Shutemov wrote:
> Andrea Arcangeli wrote:
> > Hi Hugh,
> > 
> > On Tue, Oct 15, 2013 at 04:08:28AM -0700, Hugh Dickins wrote:
> > > Occasionally we hit the BUG_ON(pmd_trans_huge(*pmd)) at the end of
> > > __split_huge_page_pmd(): seen when doing madvise(,,MADV_DONTNEED).
> > > 
> > > It's invalid: we don't always have down_write of mmap_sem there:
> > > a racing do_huge_pmd_wp_page() might have copied-on-write to another
> > > huge page before our split_huge_page() got the anon_vma lock.
> > > 
> > 
> > I don't get exactly the scenario with do_huge_pmd_wp_page(), could you
> > elaborate?
> 
> I think the scenario is follow:
> 
> 	CPU0:					CPU1
> 
> __split_huge_page_pmd()
> 	page = pmd_page(*pmd);
> 					do_huge_pmd_wp_page() copy the
> 					page and changes pmd (the same as on CPU0)
> 					to point to newly copied page.
> 	split_huge_page(page)
> 	where page is original page,
> 	not allocated on COW.
> 	pmd still points on huge page.
> 
> 
> Hugh, have I got it correctly?

So MADV_DONTNEED runs with with a "end" not 2m aligned (requiring 4k
subpage zapping) on a wrprotected trans-huge page that is hitting a
COW. So this scenario would be deterministic (the thread may write
beyond the "end" of the MADV_DONTNEED) and it only requires two
specific events.

With my other scenario with two concurrent MADV_DONTNEED plus a page
fault, you could still lead to split_huge_page_pmd returning with
pmd_trans_huge(*pmd) == true, despite of the loop introduced.

But for the above case, the loop makes a meaningful difference. So I
see the good reason for looping now.

It wouldn't be ok to miss a partial MADV_DONTNEED zapping because of a
concurrent COW, while it would be ok in my other scenario (and the
loop in fact cannot do anything to prevent split_huge_page_pmd return
with the pmd still huge).

My other scenario with two concurrent MADV_DONTNEED and a page fault
is non deterministic so looping was meaningless.

In both scenario, the kernel wouldn't run into stability issues, even
if we only removed the BUG_ON. But the COW scenario, without the loop,
we'd silently miss a partial MADV_DONTNEED on the 4k subpages before
the "end" (or after the "start").

And we still need pmd_none_or_trans_huge_or_clear_bad in
zap_pmd_range, to deal with the non deterministic cases that the loop
won't help (the two MADV_DONTNEED + page fault), in addition to the
loop to deal with the deterministic COW scenario above.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: mm: fix BUG in __split_huge_page_pmd
  2013-10-15 14:48   ` Kirill A. Shutemov
  2013-10-15 15:58     ` Andrea Arcangeli
@ 2013-10-15 17:53     ` Hugh Dickins
  2013-10-15 18:55       ` Andrea Arcangeli
  1 sibling, 1 reply; 12+ messages in thread
From: Hugh Dickins @ 2013-10-15 17:53 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, David Rientjes, Naoya Horiguchi,
	linux-kernel, linux-mm

On Tue, 15 Oct 2013, Kirill A. Shutemov wrote:
> Andrea Arcangeli wrote:
> > Hi Hugh,
> > 
> > On Tue, Oct 15, 2013 at 04:08:28AM -0700, Hugh Dickins wrote:
> > > Occasionally we hit the BUG_ON(pmd_trans_huge(*pmd)) at the end of
> > > __split_huge_page_pmd(): seen when doing madvise(,,MADV_DONTNEED).
> > > 
> > > It's invalid: we don't always have down_write of mmap_sem there:
> > > a racing do_huge_pmd_wp_page() might have copied-on-write to another
> > > huge page before our split_huge_page() got the anon_vma lock.
> > > 
> > 
> > I don't get exactly the scenario with do_huge_pmd_wp_page(), could you
> > elaborate?
> 
> I think the scenario is follow:
> 
> 	CPU0:					CPU1
> 
> __split_huge_page_pmd()
> 	page = pmd_page(*pmd);
> 					do_huge_pmd_wp_page() copy the
> 					page and changes pmd (the same as on CPU0)
> 					to point to newly copied page.
> 	split_huge_page(page)
> 	where page is original page,
> 	not allocated on COW.
> 	pmd still points on huge page.
> 
> 
> Hugh, have I got it correctly?

Yes, that's correct, that's what I've been assuming the race is.
With CPU0 split_huge_page_pmd() being called from zap_pmd_range()
in the service of madvise(,,MADV_DONTNEED).

I don't have the stacktrace to hand: could perfectly well dig it out
in an hour or two, but honestly, it adds nothing more to the picture.
I have no trace of the CPU1 side of things, and have merely surmised
that it was doing a COW.

As to whether the MADV_DONTNEED down_read optimization is important:
I don't recall, might be able to discover justification in old mail,
0a27a14a6292 doesn't actually say; but in general we're much better
off using down_read than down_write where it's safe to do so.

But more importantly, MADV_DONTNEED down_read across zap_page_range
is building on the fact that file invalidation/truncation can already
call zap_page_range without touching mmap_sem at all: not a problem
for traditional anon-only THP, but something you'll have had to worry
about for THPageCache.

I'm afraid Andrea's mail about concurrent madvises gives me far more
to think about than I have time for: seems to get into problems he
knows a lot about but I'm unfamiliar with.  If this patch looks good
for now on its own, let's put it in; but no problem if you guys prefer
to wait for a fuller solution of more problems, we can ride with this
one internally for the moment.

And I should admit that the crash has occurred too rarely for us yet
to be able to judge whether this patch actually fixes it in practice.

Hugh

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: mm: fix BUG in __split_huge_page_pmd
  2013-10-15 17:53     ` Hugh Dickins
@ 2013-10-15 18:55       ` Andrea Arcangeli
  2013-10-15 19:28         ` Naoya Horiguchi
  0 siblings, 1 reply; 12+ messages in thread
From: Andrea Arcangeli @ 2013-10-15 18:55 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Kirill A. Shutemov, Andrew Morton, David Rientjes,
	Naoya Horiguchi, linux-kernel, linux-mm

On Tue, Oct 15, 2013 at 10:53:10AM -0700, Hugh Dickins wrote:
> I'm afraid Andrea's mail about concurrent madvises gives me far more
> to think about than I have time for: seems to get into problems he
> knows a lot about but I'm unfamiliar with.  If this patch looks good
> for now on its own, let's put it in; but no problem if you guys prefer
> to wait for a fuller solution of more problems, we can ride with this
> one internally for the moment.

I'm very happy with the patch and I think it's a correct fix for the
COW scenario which is deterministic so the looping makes a meaningful
difference for it. If we wouldn't loop, part of the copied page
wouldn't be zapped after the COW.

The patch also solves the false positive for the other non
deterministic scenario of two MADV_DONTNEED (one partial, one whole)
plus a concurrent page fault.

> And I should admit that the crash has occurred too rarely for us yet
> to be able to judge whether this patch actually fixes it in practice.

It is very rare indeed, and thanks to the BUG_ON it cannot lead to any
user or kernel memory corruption, but it's a nuisance we need to
fix. I only have the two stack traces in the two links I posted in the
previous email and I also don't have the traces of the other CPU.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: mm: fix BUG in __split_huge_page_pmd
  2013-10-15 18:55       ` Andrea Arcangeli
@ 2013-10-15 19:28         ` Naoya Horiguchi
  2013-10-15 19:44           ` Andrea Arcangeli
  0 siblings, 1 reply; 12+ messages in thread
From: Naoya Horiguchi @ 2013-10-15 19:28 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Hugh Dickins, Kirill A. Shutemov, Andrew Morton, David Rientjes,
	linux-kernel, linux-mm

On Tue, Oct 15, 2013 at 08:55:10PM +0200, Andrea Arcangeli wrote:
> On Tue, Oct 15, 2013 at 10:53:10AM -0700, Hugh Dickins wrote:
> > I'm afraid Andrea's mail about concurrent madvises gives me far more
> > to think about than I have time for: seems to get into problems he
> > knows a lot about but I'm unfamiliar with.  If this patch looks good
> > for now on its own, let's put it in; but no problem if you guys prefer
> > to wait for a fuller solution of more problems, we can ride with this
> > one internally for the moment.
> 
> I'm very happy with the patch and I think it's a correct fix for the
> COW scenario which is deterministic so the looping makes a meaningful
> difference for it. If we wouldn't loop, part of the copied page
> wouldn't be zapped after the COW.

I like this patch, too.

If we have the loop in __split_huge_page_pmd as suggested in this patch,
can we assume that the pmd is stable after __split_huge_page_pmd returns?
If it's true, we can remove pmd_none_or_trans_huge_or_clear_bad check
in the callers side (zap_pmd_range and some other page table walking code.)

Naoya Horiguchi

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: mm: fix BUG in __split_huge_page_pmd
  2013-10-15 19:28         ` Naoya Horiguchi
@ 2013-10-15 19:44           ` Andrea Arcangeli
  2013-10-15 20:16             ` Naoya Horiguchi
  0 siblings, 1 reply; 12+ messages in thread
From: Andrea Arcangeli @ 2013-10-15 19:44 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Hugh Dickins, Kirill A. Shutemov, Andrew Morton, David Rientjes,
	linux-kernel, linux-mm

On Tue, Oct 15, 2013 at 03:28:50PM -0400, Naoya Horiguchi wrote:
> On Tue, Oct 15, 2013 at 08:55:10PM +0200, Andrea Arcangeli wrote:
> > On Tue, Oct 15, 2013 at 10:53:10AM -0700, Hugh Dickins wrote:
> > > I'm afraid Andrea's mail about concurrent madvises gives me far more
> > > to think about than I have time for: seems to get into problems he
> > > knows a lot about but I'm unfamiliar with.  If this patch looks good
> > > for now on its own, let's put it in; but no problem if you guys prefer
> > > to wait for a fuller solution of more problems, we can ride with this
> > > one internally for the moment.
> > 
> > I'm very happy with the patch and I think it's a correct fix for the
> > COW scenario which is deterministic so the looping makes a meaningful
> > difference for it. If we wouldn't loop, part of the copied page
> > wouldn't be zapped after the COW.
> 
> I like this patch, too.
> 
> If we have the loop in __split_huge_page_pmd as suggested in this patch,
> can we assume that the pmd is stable after __split_huge_page_pmd returns?
> If it's true, we can remove pmd_none_or_trans_huge_or_clear_bad check
> in the callers side (zap_pmd_range and some other page table walking code.)

We can assume it stable for the deterministic cases where the
looping is useful for and split_huge_page creates non-huge pmd that points to
a regular pte.

But we cannot remove pmd_none_or_trans_huge_or_clear_bad after if for
the other non deterministic cases that I described in previous
email. Looping still provides no guarantee that when the function
returns, the pmd in not huge. So for safety we still need to handle
the non deterministic case and just discard it through
pmd_none_or_trans_huge_or_clear_bad.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: mm: fix BUG in __split_huge_page_pmd
  2013-10-15 19:44           ` Andrea Arcangeli
@ 2013-10-15 20:16             ` Naoya Horiguchi
  2013-10-15 20:30               ` Andrea Arcangeli
  0 siblings, 1 reply; 12+ messages in thread
From: Naoya Horiguchi @ 2013-10-15 20:16 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Hugh Dickins, Kirill A. Shutemov, Andrew Morton, David Rientjes,
	linux-kernel, linux-mm

On Tue, Oct 15, 2013 at 09:44:28PM +0200, Andrea Arcangeli wrote:
> On Tue, Oct 15, 2013 at 03:28:50PM -0400, Naoya Horiguchi wrote:
> > On Tue, Oct 15, 2013 at 08:55:10PM +0200, Andrea Arcangeli wrote:
> > > On Tue, Oct 15, 2013 at 10:53:10AM -0700, Hugh Dickins wrote:
> > > > I'm afraid Andrea's mail about concurrent madvises gives me far more
> > > > to think about than I have time for: seems to get into problems he
> > > > knows a lot about but I'm unfamiliar with.  If this patch looks good
> > > > for now on its own, let's put it in; but no problem if you guys prefer
> > > > to wait for a fuller solution of more problems, we can ride with this
> > > > one internally for the moment.
> > > 
> > > I'm very happy with the patch and I think it's a correct fix for the
> > > COW scenario which is deterministic so the looping makes a meaningful
> > > difference for it. If we wouldn't loop, part of the copied page
> > > wouldn't be zapped after the COW.
> > 
> > I like this patch, too.
> > 
> > If we have the loop in __split_huge_page_pmd as suggested in this patch,
> > can we assume that the pmd is stable after __split_huge_page_pmd returns?
> > If it's true, we can remove pmd_none_or_trans_huge_or_clear_bad check
> > in the callers side (zap_pmd_range and some other page table walking code.)
> 
> We can assume it stable for the deterministic cases where the
> looping is useful for and split_huge_page creates non-huge pmd that points to
> a regular pte.
> 
> But we cannot remove pmd_none_or_trans_huge_or_clear_bad after if for
> the other non deterministic cases that I described in previous
> email. Looping still provides no guarantee that when the function
> returns, the pmd in not huge. So for safety we still need to handle
> the non deterministic case and just discard it through
> pmd_none_or_trans_huge_or_clear_bad.

OK, this check is necessary. But pmd_none_or_trans_huge_or_clear_bad
doesn't clear the pmd when pmd_trans_huge is true. So zap_pmd_range
seems to do nothing on such irregular pmd_trans_huge. So it looks to
me better that zap_pmd_range retries the loop on the same address
instead of 'goto next'.
The reason why I had this kind of question is that I recently study on
page table walker and some related code do retry in the similar situation.

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: mm: fix BUG in __split_huge_page_pmd
  2013-10-15 20:16             ` Naoya Horiguchi
@ 2013-10-15 20:30               ` Andrea Arcangeli
  0 siblings, 0 replies; 12+ messages in thread
From: Andrea Arcangeli @ 2013-10-15 20:30 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Hugh Dickins, Kirill A. Shutemov, Andrew Morton, David Rientjes,
	linux-kernel, linux-mm

On Tue, Oct 15, 2013 at 04:16:23PM -0400, Naoya Horiguchi wrote:
> On Tue, Oct 15, 2013 at 09:44:28PM +0200, Andrea Arcangeli wrote:
> > On Tue, Oct 15, 2013 at 03:28:50PM -0400, Naoya Horiguchi wrote:
> > > On Tue, Oct 15, 2013 at 08:55:10PM +0200, Andrea Arcangeli wrote:
> > > > On Tue, Oct 15, 2013 at 10:53:10AM -0700, Hugh Dickins wrote:
> > > > > I'm afraid Andrea's mail about concurrent madvises gives me far more
> > > > > to think about than I have time for: seems to get into problems he
> > > > > knows a lot about but I'm unfamiliar with.  If this patch looks good
> > > > > for now on its own, let's put it in; but no problem if you guys prefer
> > > > > to wait for a fuller solution of more problems, we can ride with this
> > > > > one internally for the moment.
> > > > 
> > > > I'm very happy with the patch and I think it's a correct fix for the
> > > > COW scenario which is deterministic so the looping makes a meaningful
> > > > difference for it. If we wouldn't loop, part of the copied page
> > > > wouldn't be zapped after the COW.
> > > 
> > > I like this patch, too.
> > > 
> > > If we have the loop in __split_huge_page_pmd as suggested in this patch,
> > > can we assume that the pmd is stable after __split_huge_page_pmd returns?
> > > If it's true, we can remove pmd_none_or_trans_huge_or_clear_bad check
> > > in the callers side (zap_pmd_range and some other page table walking code.)
> > 
> > We can assume it stable for the deterministic cases where the
> > looping is useful for and split_huge_page creates non-huge pmd that points to
> > a regular pte.
> > 
> > But we cannot remove pmd_none_or_trans_huge_or_clear_bad after if for
> > the other non deterministic cases that I described in previous
> > email. Looping still provides no guarantee that when the function
> > returns, the pmd in not huge. So for safety we still need to handle
> > the non deterministic case and just discard it through
> > pmd_none_or_trans_huge_or_clear_bad.
> 
> OK, this check is necessary. But pmd_none_or_trans_huge_or_clear_bad
> doesn't clear the pmd when pmd_trans_huge is true. So zap_pmd_range
> seems to do nothing on such irregular pmd_trans_huge. So it looks to
> me better that zap_pmd_range retries the loop on the same address
> instead of 'goto next'.

It may look like a bug to return with a huge pmd established, when we could
notice it after pmd_trans_huge_pmd returns.

However try to imagine to add a check there, and to keep adding checks
and loops. If you're just a bit more unlucky next time, the page fault
that converts the "unstable" pmd from "none" to "huge", may fire in
another thread running in another CPU during the iret, so while
MADV_DONTNEED completes and returns to userland.

There are not enough checks you can add even after zap_pmd_range
returns, to prevent the pmd to become established by the time
MADV_DONTNEED returns to userland.

The point is that if MADV_DONTNEED was zapping the entire pmd (not
part) userland is accessing the same memory at the same time from
another thread, the result is undefined.

> The reason why I had this kind of question is that I recently study on
> page table walker and some related code do retry in the similar situation.

There surely are other cases that give an undefined result in the
kernel. Another one in the pagetable walking code that would give
undefined result is still MADV_DONTNEED vs a 4k page fault. You don't
know who runs first, the 4k page may end up zapped or not. But in that
case there's no split_huge_page_pmd as variable of the equation.

The header definition documents it too:

/*
 * This function is meant to be used by sites walking pagetables with
 * the mmap_sem hold in read mode to protect against MADV_DONTNEED and
 * transhuge page faults. MADV_DONTNEED can convert a transhuge pmd
 * into a null pmd and the transhuge page fault can convert a null pmd
 * into an hugepmd or into a regular pmd (if the hugepage allocation
 * fails). While holding the mmap_sem in read mode the pmd becomes
 * stable and stops changing under us only if it's not null and not a
 * transhuge pmd. When those races occurs and this function makes a
 * difference vs the standard pmd_none_or_clear_bad, the result is
 * undefined so behaving like if the pmd was none is safe (because it
 * can return none anyway). The compiler level barrier() is critically
 * important to compute the two checks atomically on the same pmdval.
 *
 * For 32bit kernels with a 64bit large pmd_t this automatically takes
 * care of reading the pmd atomically to avoid SMP race conditions
 * against pmd_populate() when the mmap_sem is hold for reading by the
 * caller (a special atomic read not done by "gcc" as in the generic
 * version above, is also needed when THP is disabled because the page
 * fault can populate the pmd from under us).
 */
static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t *pmd)

The COW case cannot be threated as undefined though, that has as well
defined result that for userland must be identical to the one that
would happen on 4k pages.

This is why looping there is required. But we still need to deal with
the other scenario with undefined result too.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2013-10-15 20:30 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-10-15 11:08 mm: fix BUG in __split_huge_page_pmd Hugh Dickins
2013-10-15 11:32 ` Kirill A. Shutemov
2013-10-15 14:41   ` Andrea Arcangeli
2013-10-15 14:34 ` Andrea Arcangeli
2013-10-15 14:48   ` Kirill A. Shutemov
2013-10-15 15:58     ` Andrea Arcangeli
2013-10-15 17:53     ` Hugh Dickins
2013-10-15 18:55       ` Andrea Arcangeli
2013-10-15 19:28         ` Naoya Horiguchi
2013-10-15 19:44           ` Andrea Arcangeli
2013-10-15 20:16             ` Naoya Horiguchi
2013-10-15 20:30               ` Andrea Arcangeli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).