All of lore.kernel.org
 help / color / mirror / Atom feed
From: Hugh Dickins <hughd@google.com>
To: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>,
	Sasha Levin <sasha.levin@oracle.com>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Dave Jones <davej@redhat.com>,
	LKML <linux-kernel@vger.kernel.org>,
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Rik van Riel <riel@redhat.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Cyrill Gorcunov <gorcunov@gmail.com>
Subject: Re: mm: BUG in unmap_page_range
Date: Tue, 5 Aug 2014 17:42:03 -0700 (PDT)	[thread overview]
Message-ID: <alpine.LSU.2.11.1408051649330.6591@eggly.anvils> (raw)
In-Reply-To: <20140805144439.GW10819@suse.de>

On Tue, 5 Aug 2014, Mel Gorman wrote:
> On Mon, Aug 04, 2014 at 04:40:38AM -0700, Hugh Dickins wrote:
> > 
> > [INCOMPLETE PATCH] x86,mm: fix pte_special versus pte_numa
> > 
> > Sasha Levin has shown oopses on ffffea0003480048 and ffffea0003480008
> > at mm/memory.c:1132, running Trinity on different 3.16-rc-next kernels:
> > where zap_pte_range() checks page->mapping to see if PageAnon(page).
> > 
> > Those addresses fit struct pages for pfns d2001 and d2000, and in each
> > dump a register or a stack slot showed d2001730 or d2000730: pte flags
> > 0x730 are PCD ACCESSED PROTNONE SPECIAL IOMAP; and Sasha's e820 map has
> > a hole between cfffffff and 100000000, which would need special access.
> > 
> > Commit c46a7c817e66 ("x86: define _PAGE_NUMA by reusing software bits on
> > the PMD and PTE levels") has broken vm_normal_page(): a PROTNONE SPECIAL
> > pte no longer passes the pte_special() test, so zap_pte_range() goes on
> > to try to access a non-existent struct page.
> > 
> 
> :(
> 
> > Fix this by refining pte_special() (SPECIAL with PRESENT or PROTNONE)
> > to complement pte_numa() (SPECIAL with neither PRESENT nor PROTNONE).
> > 
> > It's unclear why c46a7c817e66 added pte_numa() test to vm_normal_page(),
> > and moved its is_zero_pfn() test from slow to fast path: I suspect both
> > were papering over PROT_NONE issues seen with inadequate pte_special().
> > Revert vm_normal_page() to how it was before, relying on pte_special().
> > 
> 
> Rather than answering directly I updated your changelog
> 
>     Fix this by refining pte_special() (SPECIAL with PRESENT or PROTNONE)
>     to complement pte_numa() (SPECIAL with neither PRESENT nor PROTNONE).
> 
>     A hint that this was a problem was that c46a7c817e66 added pte_numa()
>     test to vm_normal_page(), and moved its is_zero_pfn() test from slow to
>     fast path: This was papering over a pte_special() snag when the zero
>     page was encountered during zap. This patch reverts vm_normal_page()
>     to how it was before, relying on pte_special().

Thanks, that's fine.

> 
> > I find it confusing, that the only example of ARCH_USES_NUMA_PROT_NONE
> > no longer uses PROTNONE for NUMA, but SPECIAL instead: update the
> > asm-generic comment a little, but that config option remains unhelpful.
> > 
> 
> ARCH_USES_NUMA_PROT_NONE should have been sent to the farm at the same time
> as that patch and by rights unified with the powerpc helpers. With the new
> _PAGE_NUMA bit, there is no reason they should have different implementations
> of pte_numa and related functions. Unfortunately unifying them is a little
> problematic due to differences in fundamental types. It could be done with
> #defines but I'm attaching a preliminary prototype to illustrate the issue.
> 
> > But more seriously, I think this patch is incomplete: aren't there
> > other places which need to be handling PROTNONE along with PRESENT?
> > For example, pte_mknuma() clears _PAGE_PRESENT and sets _PAGE_NUMA,
> > but on a PROT_NONE area, I think that will now make it pte_special()?
> > So it ought to clear _PAGE_PROTNONE too.  Or maybe we can never
> > pte_mknuma() on a PROT_NONE area - there would be no point?
> > 
> 
> We are depending on the fact that inaccessible VMAs are skipped by the
> NUMA hinting scanner.

Ah, okay.  And the other way round (mprotecting to PROT_NONE an area
which already contains _PAGE_NUMA ptes) already looked safe to me.

> 
> > Around here I began to wonder if it was just a mistake to have deserted
> > the PROTNONE for NUMA model: I know Linus had a strong reaction against
> > it, and I've never delved into its drawbacks myself; but bringing yet
> > another (SPECIAL) flag into the game is not an obvious improvement.
> > Should we just revert c46a7c817e66, or would that be a mistake?
> > 
> 
> It's replacing one type of complexity with another. The downside is that
> _PAGE_NUMA == _PAGE_PROTNONE puts subtle traps all over the core for
> powerpc to fall foul of.

Okay.

> 
> I'm attaching a preliminary pair of patches. The first which deals with
> ARCH_USES_NUMA_PROT_NONE and the second which is yours with a revised
> changelog. I'm adding Aneesh to the cc to look at the powerpc portion of
> the first patch.

Thanks a lot, Mel.

I am surprised by the ordering, but perhaps you meant nothing by it.
Isn't the first one a welcome but optional cleanup, and the second one
a fix that we need in 3.16-stable?  Or does the fix actually depend in
some unstated way upon the cleanup, in powerpc-land perhaps?

Aside from that, for the first patch: yes, I heartily approve of the
disappearance of CONFIG_ARCH_WANTS_PROT_NUMA_PROT_NONE and
CONFIG_ARCH_USES_NUMA_PROT_NONE.  If you wish, add
Acked-by: Hugh Dickins <hughd@google.com>
but of course it's really Aneesh and powerpc who are the test of it.

One thing I did wonder, though: at first I was reassured by the
VM_BUG_ON(!pte_present(pte)) you add to pte_mknuma(); but then thought
it would be better as VM_BUG_ON(!(val & _PAGE_PRESENT)), being stronger
- asserting that indeed we do not put NUMA hints on PROT_NONE areas.
(But I have not tested, perhaps such a VM_BUG_ON would actually fire.)

And in the second patch, a few trivial edits:

> It still appears that this patch may be incomplete: aren't there other
> places which need to be handling PROTNONE along with PRESENT?  For example,
> pte_mknuma() clears _PAGE_PRESENT and sets _PAGE_NUMA, but on a PROT_NONE
> area, that would make it it pte_special(). This is side-stepped by the fact

s/it it/it/

> that NUMA hinting faults skiped PROT_NONE VMAs and there are no grounds

s/skiped/skip/

> where a NUMA hinting fault on a PROT_NONE VMA would be interesting.
> 
> Partially-Fixes: c46a7c817e66 ("x86: define _PAGE_NUMA by reusing software bits on the PMD and PTE levels")

s/Partially-//

> Reported-by: Sasha Levin <sasha.levin@oracle.com>
> Not-yet-Signed-off-by: Hugh Dickins <hughd@google.com>

s/Not-yet-//

> Not-yet-Signed-off-by: Mel Gorman <mgorman@suse.de>

Ditto I must leave to you!

> Cc: stable@vger.kernel.org [3.16]
> ---
>  arch/x86/include/asm/pgtable.h | 9 +++++++--
>  mm/memory.c                    | 7 +++----
>  2 files changed, 10 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index 0ec0560..230b811 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -131,8 +131,13 @@ static inline int pte_exec(pte_t pte)
>  
>  static inline int pte_special(pte_t pte)
>  {
> -	return (pte_flags(pte) & (_PAGE_PRESENT|_PAGE_SPECIAL)) ==
> -				 (_PAGE_PRESENT|_PAGE_SPECIAL);
> +	/*
> +	 * See CONFIG_NUMA_BALANCING CONFIG_ARCH_USES_NUMA_PROT_NONE pte_numa()

s/CONFIG_ARCH_USES_NUMA_PROT_NONE //
even if you do end up reordering this patch before the other.

Thanks!
Hugh

WARNING: multiple messages have this Message-ID (diff)
From: Hugh Dickins <hughd@google.com>
To: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>,
	Sasha Levin <sasha.levin@oracle.com>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Dave Jones <davej@redhat.com>,
	LKML <linux-kernel@vger.kernel.org>,
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Rik van Riel <riel@redhat.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Cyrill Gorcunov <gorcunov@gmail.com>
Subject: Re: mm: BUG in unmap_page_range
Date: Tue, 5 Aug 2014 17:42:03 -0700 (PDT)	[thread overview]
Message-ID: <alpine.LSU.2.11.1408051649330.6591@eggly.anvils> (raw)
In-Reply-To: <20140805144439.GW10819@suse.de>

On Tue, 5 Aug 2014, Mel Gorman wrote:
> On Mon, Aug 04, 2014 at 04:40:38AM -0700, Hugh Dickins wrote:
> > 
> > [INCOMPLETE PATCH] x86,mm: fix pte_special versus pte_numa
> > 
> > Sasha Levin has shown oopses on ffffea0003480048 and ffffea0003480008
> > at mm/memory.c:1132, running Trinity on different 3.16-rc-next kernels:
> > where zap_pte_range() checks page->mapping to see if PageAnon(page).
> > 
> > Those addresses fit struct pages for pfns d2001 and d2000, and in each
> > dump a register or a stack slot showed d2001730 or d2000730: pte flags
> > 0x730 are PCD ACCESSED PROTNONE SPECIAL IOMAP; and Sasha's e820 map has
> > a hole between cfffffff and 100000000, which would need special access.
> > 
> > Commit c46a7c817e66 ("x86: define _PAGE_NUMA by reusing software bits on
> > the PMD and PTE levels") has broken vm_normal_page(): a PROTNONE SPECIAL
> > pte no longer passes the pte_special() test, so zap_pte_range() goes on
> > to try to access a non-existent struct page.
> > 
> 
> :(
> 
> > Fix this by refining pte_special() (SPECIAL with PRESENT or PROTNONE)
> > to complement pte_numa() (SPECIAL with neither PRESENT nor PROTNONE).
> > 
> > It's unclear why c46a7c817e66 added pte_numa() test to vm_normal_page(),
> > and moved its is_zero_pfn() test from slow to fast path: I suspect both
> > were papering over PROT_NONE issues seen with inadequate pte_special().
> > Revert vm_normal_page() to how it was before, relying on pte_special().
> > 
> 
> Rather than answering directly I updated your changelog
> 
>     Fix this by refining pte_special() (SPECIAL with PRESENT or PROTNONE)
>     to complement pte_numa() (SPECIAL with neither PRESENT nor PROTNONE).
> 
>     A hint that this was a problem was that c46a7c817e66 added pte_numa()
>     test to vm_normal_page(), and moved its is_zero_pfn() test from slow to
>     fast path: This was papering over a pte_special() snag when the zero
>     page was encountered during zap. This patch reverts vm_normal_page()
>     to how it was before, relying on pte_special().

Thanks, that's fine.

> 
> > I find it confusing, that the only example of ARCH_USES_NUMA_PROT_NONE
> > no longer uses PROTNONE for NUMA, but SPECIAL instead: update the
> > asm-generic comment a little, but that config option remains unhelpful.
> > 
> 
> ARCH_USES_NUMA_PROT_NONE should have been sent to the farm at the same time
> as that patch and by rights unified with the powerpc helpers. With the new
> _PAGE_NUMA bit, there is no reason they should have different implementations
> of pte_numa and related functions. Unfortunately unifying them is a little
> problematic due to differences in fundamental types. It could be done with
> #defines but I'm attaching a preliminary prototype to illustrate the issue.
> 
> > But more seriously, I think this patch is incomplete: aren't there
> > other places which need to be handling PROTNONE along with PRESENT?
> > For example, pte_mknuma() clears _PAGE_PRESENT and sets _PAGE_NUMA,
> > but on a PROT_NONE area, I think that will now make it pte_special()?
> > So it ought to clear _PAGE_PROTNONE too.  Or maybe we can never
> > pte_mknuma() on a PROT_NONE area - there would be no point?
> > 
> 
> We are depending on the fact that inaccessible VMAs are skipped by the
> NUMA hinting scanner.

Ah, okay.  And the other way round (mprotecting to PROT_NONE an area
which already contains _PAGE_NUMA ptes) already looked safe to me.

> 
> > Around here I began to wonder if it was just a mistake to have deserted
> > the PROTNONE for NUMA model: I know Linus had a strong reaction against
> > it, and I've never delved into its drawbacks myself; but bringing yet
> > another (SPECIAL) flag into the game is not an obvious improvement.
> > Should we just revert c46a7c817e66, or would that be a mistake?
> > 
> 
> It's replacing one type of complexity with another. The downside is that
> _PAGE_NUMA == _PAGE_PROTNONE puts subtle traps all over the core for
> powerpc to fall foul of.

Okay.

> 
> I'm attaching a preliminary pair of patches. The first which deals with
> ARCH_USES_NUMA_PROT_NONE and the second which is yours with a revised
> changelog. I'm adding Aneesh to the cc to look at the powerpc portion of
> the first patch.

Thanks a lot, Mel.

I am surprised by the ordering, but perhaps you meant nothing by it.
Isn't the first one a welcome but optional cleanup, and the second one
a fix that we need in 3.16-stable?  Or does the fix actually depend in
some unstated way upon the cleanup, in powerpc-land perhaps?

Aside from that, for the first patch: yes, I heartily approve of the
disappearance of CONFIG_ARCH_WANTS_PROT_NUMA_PROT_NONE and
CONFIG_ARCH_USES_NUMA_PROT_NONE.  If you wish, add
Acked-by: Hugh Dickins <hughd@google.com>
but of course it's really Aneesh and powerpc who are the test of it.

One thing I did wonder, though: at first I was reassured by the
VM_BUG_ON(!pte_present(pte)) you add to pte_mknuma(); but then thought
it would be better as VM_BUG_ON(!(val & _PAGE_PRESENT)), being stronger
- asserting that indeed we do not put NUMA hints on PROT_NONE areas.
(But I have not tested, perhaps such a VM_BUG_ON would actually fire.)

And in the second patch, a few trivial edits:

> It still appears that this patch may be incomplete: aren't there other
> places which need to be handling PROTNONE along with PRESENT?  For example,
> pte_mknuma() clears _PAGE_PRESENT and sets _PAGE_NUMA, but on a PROT_NONE
> area, that would make it it pte_special(). This is side-stepped by the fact

s/it it/it/

> that NUMA hinting faults skiped PROT_NONE VMAs and there are no grounds

s/skiped/skip/

> where a NUMA hinting fault on a PROT_NONE VMA would be interesting.
> 
> Partially-Fixes: c46a7c817e66 ("x86: define _PAGE_NUMA by reusing software bits on the PMD and PTE levels")

s/Partially-//

> Reported-by: Sasha Levin <sasha.levin@oracle.com>
> Not-yet-Signed-off-by: Hugh Dickins <hughd@google.com>

s/Not-yet-//

> Not-yet-Signed-off-by: Mel Gorman <mgorman@suse.de>

Ditto I must leave to you!

> Cc: stable@vger.kernel.org [3.16]
> ---
>  arch/x86/include/asm/pgtable.h | 9 +++++++--
>  mm/memory.c                    | 7 +++----
>  2 files changed, 10 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index 0ec0560..230b811 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -131,8 +131,13 @@ static inline int pte_exec(pte_t pte)
>  
>  static inline int pte_special(pte_t pte)
>  {
> -	return (pte_flags(pte) & (_PAGE_PRESENT|_PAGE_SPECIAL)) ==
> -				 (_PAGE_PRESENT|_PAGE_SPECIAL);
> +	/*
> +	 * See CONFIG_NUMA_BALANCING CONFIG_ARCH_USES_NUMA_PROT_NONE pte_numa()

s/CONFIG_ARCH_USES_NUMA_PROT_NONE //
even if you do end up reordering this patch before the other.

Thanks!
Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2014-08-06  0:43 UTC|newest]

Thread overview: 86+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-08-02 21:58 mm: BUG in unmap_page_range Sasha Levin
2014-08-02 21:58 ` Sasha Levin
2014-08-04 11:40 ` Hugh Dickins
2014-08-04 11:40   ` Hugh Dickins
2014-08-05 14:44   ` Mel Gorman
2014-08-05 14:44     ` Mel Gorman
2014-08-06  0:42     ` Hugh Dickins [this message]
2014-08-06  0:42       ` Hugh Dickins
2014-08-06  1:04       ` Sasha Levin
2014-08-06  1:04         ` Sasha Levin
2014-08-12  3:28         ` Sasha Levin
2014-08-12  3:28           ` Sasha Levin
2014-08-12 10:47           ` [PATCH] x86,mm: fix pte_special versus pte_numa Mel Gorman
2014-08-12 10:47             ` Mel Gorman
2014-08-12 11:08             ` [PATCH] mm: Remove misleading ARCH_USES_NUMA_PROT_NONE Mel Gorman
2014-08-12 11:08               ` Mel Gorman
2014-08-13 13:14               ` Aneesh Kumar K.V
2014-08-13 13:14                 ` Aneesh Kumar K.V
2014-08-27  3:16           ` mm: BUG in unmap_page_range Sasha Levin
2014-08-27  3:16             ` Sasha Levin
2014-08-27 15:26             ` Mel Gorman
2014-08-27 15:26               ` Mel Gorman
2014-08-27 18:21               ` Sasha Levin
2014-08-27 18:21                 ` Sasha Levin
2014-08-30  1:23               ` Sasha Levin
2014-08-30  1:23                 ` Sasha Levin
2014-09-04  9:04                 ` Sasha Levin
2014-09-04  9:04                   ` Sasha Levin
2014-09-08 17:18                   ` Mel Gorman
2014-09-08 17:18                     ` Mel Gorman
2014-09-08 17:23                     ` Sasha Levin
2014-09-08 17:56                     ` Sasha Levin
2014-09-08 17:56                       ` Sasha Levin
2014-09-09 21:33                       ` Mel Gorman
2014-09-09 21:33                         ` Mel Gorman
2014-09-09 22:20                         ` Sasha Levin
2014-09-09 22:20                           ` Sasha Levin
2014-09-10  2:45                           ` Hugh Dickins
2014-09-10  2:45                             ` Hugh Dickins
2014-09-10 12:47                             ` Mel Gorman
2014-09-10 12:47                               ` Mel Gorman
2014-09-10 14:24                               ` Trinity and mbind flags (WAS: Re: mm: BUG in unmap_page_range) Sasha Levin
2014-09-10 14:24                                 ` Sasha Levin
2014-09-10 14:33                                 ` Dave Jones
2014-09-10 14:33                                   ` Dave Jones
2014-09-10 19:06                               ` mm: BUG in unmap_page_range Sasha Levin
2014-09-10 19:06                                 ` Sasha Levin
2014-09-10 19:36                               ` Hugh Dickins
2014-09-10 19:36                                 ` Hugh Dickins
2014-09-11  2:43                                 ` Sasha Levin
2014-09-11  2:43                                   ` Sasha Levin
2014-09-11 11:39                                   ` Hugh Dickins
2014-09-11 11:39                                     ` Hugh Dickins
2014-09-11 14:22                                     ` Sasha Levin
2014-09-11 14:22                                       ` Sasha Levin
2014-09-11 14:33                                       ` Dave Jones
2014-09-11 14:33                                         ` Dave Jones
2014-09-11 16:28                                     ` Mel Gorman
2014-09-11 16:28                                       ` Mel Gorman
2014-09-11 22:38                                       ` Sasha Levin
2014-09-11 22:38                                         ` Sasha Levin
2014-09-17 21:37                                         ` Sasha Levin
2014-09-17 21:37                                           ` Sasha Levin
2014-09-10 13:12                             ` Sasha Levin
2014-09-10 13:12                               ` Sasha Levin
2014-09-10 13:40                               ` Mel Gorman
2014-09-10 13:40                                 ` Mel Gorman
2014-09-10 16:44                                 ` Sasha Levin
2014-09-10 16:44                                   ` Sasha Levin
2014-09-10 19:09                               ` Hugh Dickins
2014-09-10 19:09                                 ` Hugh Dickins
2014-09-10 20:36                                 ` Sasha Levin
2014-09-10 20:36                                   ` Sasha Levin
2014-09-10 23:00                                   ` Hugh Dickins
2014-09-10 23:00                                     ` Hugh Dickins
2014-08-06 10:35       ` Mel Gorman
2014-08-06 10:35         ` Mel Gorman
2014-08-06  7:14     ` Aneesh Kumar K.V
2014-08-06  7:14       ` Aneesh Kumar K.V
2014-08-06  7:14       ` Aneesh Kumar K.V
2014-08-06 10:23       ` Mel Gorman
2014-08-06 10:23         ` Mel Gorman
2014-08-06 10:23         ` Mel Gorman
2014-08-07  8:40         ` Aneesh Kumar K.V
2014-08-07  8:40           ` Aneesh Kumar K.V
2014-08-07  8:40           ` Aneesh Kumar K.V

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=alpine.LSU.2.11.1408051649330.6591@eggly.anvils \
    --to=hughd@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=davej@redhat.com \
    --cc=gorcunov@gmail.com \
    --cc=hannes@cmpxchg.org \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=peterz@infradead.org \
    --cc=riel@redhat.com \
    --cc=sasha.levin@oracle.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.