All of lore.kernel.org
 help / color / mirror / Atom feed
* pipe/page fault oddness.
@ 2014-09-30  3:33 Dave Jones
  2014-09-30  4:27 ` Linus Torvalds
  0 siblings, 1 reply; 43+ messages in thread
From: Dave Jones @ 2014-09-30  3:33 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Linux Kernel

My fuzz tester ground to a halt, with many child processes blocked
on pipe_lock.  sysrq-t output: http://codemonkey.org.uk/junk/pipe-lock-wtf.txt

Looking at the dump, there's only one running trinity child,
with all the others blocking on it.

trinity-c49     R  running task    12856 19464   7633 0x00000004
ffff8800a09bf960 0000000000000002 ffff8800a09bf9f8 ffff880219650000
00000000001d4080 0000000000000000 ffff8800a09bffd8 00000000001d4080
ffff88023f755bc0 ffff880219650000 ffff8800a09bffd8 ffff88010b017e00
Call Trace:
[<ffffffff9181df46>] preempt_schedule+0x36/0x60
[<ffffffff9100e3d6>] ___preempt_schedule+0x56/0xb0
[<ffffffff911c3c67>] ? handle_mm_fault+0x3a7/0xcd0
[<ffffffff918239f1>] ? _raw_spin_unlock+0x31/0x50
[<ffffffff91823a05>] ? _raw_spin_unlock+0x45/0x50
[<ffffffff911c3c67>] handle_mm_fault+0x3a7/0xcd0
[<ffffffff910cb687>] ? __lock_is_held+0x57/0x80
[<ffffffff91042c84>] __do_page_fault+0x1a4/0x600
[<ffffffff910ce485>] ? mark_held_locks+0x75/0xa0
[<ffffffff910ce5bd>] ? trace_hardirqs_on_caller+0x10d/0x1d0
[<ffffffff910ce68d>] ? trace_hardirqs_on+0xd/0x10
[<ffffffff9118eda7>] ? context_tracking_user_exit+0x67/0x1b0
[<ffffffff910430fe>] do_page_fault+0x1e/0x70
[<ffffffff918264b2>] page_fault+0x22/0x30
[<ffffffff911bd7e3>] ? copy_page_to_iter+0x3b3/0x500
[<ffffffff9120eddf>] pipe_read+0xdf/0x330
[<ffffffff9120ed00>] ? pipe_write+0x490/0x490
[<ffffffff912051a0>] ? do_sync_readv_writev+0xa0/0xa0
[<ffffffff912053b8>] do_iter_readv_writev+0x78/0xc0
[<ffffffff91206bbe>] do_readv_writev+0xce/0x280
[<ffffffff9120ed00>] ? pipe_write+0x490/0x490
[<ffffffff910cbbf6>] ? lock_release_holdtime.part.29+0xe6/0x160
[<ffffffff910ac74d>] ? get_parent_ip+0xd/0x50
[<ffffffff910ac74d>] ? get_parent_ip+0xd/0x50
[<ffffffff910ac8ab>] ? preempt_count_sub+0x6b/0xf0
[<ffffffff91206da9>] vfs_readv+0x39/0x50
[<ffffffff91206e6c>] SyS_readv+0x5c/0x100
[<ffffffff918249e4>] tracesys+0xdd/0xe2

Running the function tracer on that pid shows it spinning forever..
http://codemonkey.org.uk/junk/pipe-trace.txt

Kernel bug (missing EFAULT check somewhere perhaps?), or is this a
case where the fuzzer asked the kernel to do something stupid, and it obliged ?

Trinity's watchdog process has been repeatedly sending SIGKILL's to this
running pid, but we never seem to get out of this state long enough for
it to take effect.

This is 3.17-rc7 fwiw.

	Dave


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: pipe/page fault oddness.
  2014-09-30  3:33 pipe/page fault oddness Dave Jones
@ 2014-09-30  4:27 ` Linus Torvalds
  2014-09-30  4:33   ` Dave Jones
  2014-09-30  4:35   ` Al Viro
  0 siblings, 2 replies; 43+ messages in thread
From: Linus Torvalds @ 2014-09-30  4:27 UTC (permalink / raw)
  To: Dave Jones, Linux Kernel, Al Viro

On Mon, Sep 29, 2014 at 8:33 PM, Dave Jones <davej@redhat.com> wrote:
>
> Looking at the dump, there's only one running trinity child,
> with all the others blocking on it.
>
> trinity-c49     R  running task    12856 19464   7633 0x00000004
> ffff8800a09bf960 0000000000000002 ffff8800a09bf9f8 ffff880219650000
> 00000000001d4080 0000000000000000 ffff8800a09bffd8 00000000001d4080
> ffff88023f755bc0 ffff880219650000 ffff8800a09bffd8 ffff88010b017e00
> Call Trace:
> [<ffffffff911c3c67>] handle_mm_fault+0x3a7/0xcd0
> [<ffffffff91042c84>] __do_page_fault+0x1a4/0x600
> [<ffffffff910430fe>] do_page_fault+0x1e/0x70
> [<ffffffff918264b2>] page_fault+0x22/0x30
> [<ffffffff911bd7e3>] ? copy_page_to_iter+0x3b3/0x500
> [<ffffffff9120eddf>] pipe_read+0xdf/0x330
>
> Running the function tracer on that pid shows it spinning forever..
> http://codemonkey.org.uk/junk/pipe-trace.txt
>
> Kernel bug (missing EFAULT check somewhere perhaps?), or is this a
> case where the fuzzer asked the kernel to do something stupid, and it obliged ?

Hmm. It looks like copy_page_to_iter_iovec() is broken and keeps not
making any progress while just faulting.

I don't see how that could happen, though. All the loops there are
conditional on the user copies *not* failing (ie "!left"), and they
seem to properly update "iov".

Mind sending a disassembly of your "copy_page_to_iter" function, in
particular around that whole "0x3b3/0x500" area which is where the
page fault seems to happen?

Adding Al to the cc, since this code is from his commit 6e58e79db8a1
("introduce copy_page_to_iter, kill loop over iovec in
generic_file_aio_read()") but I don't see anything obviously wrong
there.

Al? Do you see something I don't? Dave's function trace does seem to
say that it doesn't even get back to pipe_read(), though, so the loop
really must be inside copy_page_to_iter().

             Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: pipe/page fault oddness.
  2014-09-30  4:27 ` Linus Torvalds
@ 2014-09-30  4:33   ` Dave Jones
       [not found]     ` <CA+55aFwxdOBKHwwp7Zq1k19mHCyHYmYqigCVt59AtB-P7Zva1w@mail.gmail.com>
  2014-09-30  4:35   ` Al Viro
  1 sibling, 1 reply; 43+ messages in thread
From: Dave Jones @ 2014-09-30  4:33 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Linux Kernel, Al Viro

On Mon, Sep 29, 2014 at 09:27:09PM -0700, Linus Torvalds wrote:
 > On Mon, Sep 29, 2014 at 8:33 PM, Dave Jones <davej@redhat.com> wrote:
 > >
 > > Looking at the dump, there's only one running trinity child,
 > > with all the others blocking on it.
 > >
 > > trinity-c49     R  running task    12856 19464   7633 0x00000004
 > > ffff8800a09bf960 0000000000000002 ffff8800a09bf9f8 ffff880219650000
 > > 00000000001d4080 0000000000000000 ffff8800a09bffd8 00000000001d4080
 > > ffff88023f755bc0 ffff880219650000 ffff8800a09bffd8 ffff88010b017e00
 > > Call Trace:
 > > [<ffffffff911c3c67>] handle_mm_fault+0x3a7/0xcd0
 > > [<ffffffff91042c84>] __do_page_fault+0x1a4/0x600
 > > [<ffffffff910430fe>] do_page_fault+0x1e/0x70
 > > [<ffffffff918264b2>] page_fault+0x22/0x30
 > > [<ffffffff911bd7e3>] ? copy_page_to_iter+0x3b3/0x500
 > > [<ffffffff9120eddf>] pipe_read+0xdf/0x330
 > >
 > > Running the function tracer on that pid shows it spinning forever..
 > > http://codemonkey.org.uk/junk/pipe-trace.txt
 > >
 > > Kernel bug (missing EFAULT check somewhere perhaps?), or is this a
 > > case where the fuzzer asked the kernel to do something stupid, and it obliged ?
 > 
 > Hmm. It looks like copy_page_to_iter_iovec() is broken and keeps not
 > making any progress while just faulting.
 > 
 > I don't see how that could happen, though. All the loops there are
 > conditional on the user copies *not* failing (ie "!left"), and they
 > seem to properly update "iov".
 > 
 > Mind sending a disassembly of your "copy_page_to_iter" function, in
 > particular around that whole "0x3b3/0x500" area which is where the
 > page fault seems to happen?
 
Whole file is at http://codemonkey.org.uk/junk/iov_iter.txt

gcc is 4.8.3 btw

	Dave

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: pipe/page fault oddness.
  2014-09-30  4:27 ` Linus Torvalds
  2014-09-30  4:33   ` Dave Jones
@ 2014-09-30  4:35   ` Al Viro
  1 sibling, 0 replies; 43+ messages in thread
From: Al Viro @ 2014-09-30  4:35 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Dave Jones, Linux Kernel

On Mon, Sep 29, 2014 at 09:27:09PM -0700, Linus Torvalds wrote:
> On Mon, Sep 29, 2014 at 8:33 PM, Dave Jones <davej@redhat.com> wrote:
> >
> > Looking at the dump, there's only one running trinity child,
> > with all the others blocking on it.
> >
> > trinity-c49     R  running task    12856 19464   7633 0x00000004
> > ffff8800a09bf960 0000000000000002 ffff8800a09bf9f8 ffff880219650000
> > 00000000001d4080 0000000000000000 ffff8800a09bffd8 00000000001d4080
> > ffff88023f755bc0 ffff880219650000 ffff8800a09bffd8 ffff88010b017e00
> > Call Trace:
> > [<ffffffff911c3c67>] handle_mm_fault+0x3a7/0xcd0
> > [<ffffffff91042c84>] __do_page_fault+0x1a4/0x600
> > [<ffffffff910430fe>] do_page_fault+0x1e/0x70
> > [<ffffffff918264b2>] page_fault+0x22/0x30
> > [<ffffffff911bd7e3>] ? copy_page_to_iter+0x3b3/0x500
> > [<ffffffff9120eddf>] pipe_read+0xdf/0x330
> >
> > Running the function tracer on that pid shows it spinning forever..
> > http://codemonkey.org.uk/junk/pipe-trace.txt
> >
> > Kernel bug (missing EFAULT check somewhere perhaps?), or is this a
> > case where the fuzzer asked the kernel to do something stupid, and it obliged ?
> 
> Hmm. It looks like copy_page_to_iter_iovec() is broken and keeps not
> making any progress while just faulting.
> 
> I don't see how that could happen, though. All the loops there are
> conditional on the user copies *not* failing (ie "!left"), and they
> seem to properly update "iov".
> 
> Mind sending a disassembly of your "copy_page_to_iter" function, in
> particular around that whole "0x3b3/0x500" area which is where the
> page fault seems to happen?
> 
> Adding Al to the cc, since this code is from his commit 6e58e79db8a1
> ("introduce copy_page_to_iter, kill loop over iovec in
> generic_file_aio_read()") but I don't see anything obviously wrong
> there.
> 
> Al? Do you see something I don't? Dave's function trace does seem to
> say that it doesn't even get back to pipe_read(), though, so the loop
> really must be inside copy_page_to_iter().

I'll take a look tomorrow morning after I get some sleep - 19 hours of uptime,
on top of 5 hours of sleep, on top of ~20 hours of uptime ;-/

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: pipe/page fault oddness.
       [not found]     ` <CA+55aFwxdOBKHwwp7Zq1k19mHCyHYmYqigCVt59AtB-P7Zva1w@mail.gmail.com>
@ 2014-09-30 15:52       ` Linus Torvalds
  2014-09-30 16:03         ` Rik van Riel
  2014-09-30 16:05         ` Dave Jones
  0 siblings, 2 replies; 43+ messages in thread
From: Linus Torvalds @ 2014-09-30 15:52 UTC (permalink / raw)
  To: Dave Jones, Al Viro, Linux Kernel, Rik van Riel
  Cc: Ingo Molnar, Michel Lespinasse

On Mon, Sep 29, 2014 at 9:54 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Odd. The 0x3b3 offset seems to be the single-byte write of zero, which is
> just the initial probe (aka "fault_in_pages_writeable()").
>
> How *that* could loop, I have no idea. Unless the exception table is broken.
> I'll take another look tomorrow.

Confirmed. It's the second write in fault_in_pages_writeable() (the
one that writes to the "end" pointer).

And there's no loop in software. And in fact, the trace shows that
there is no exception case for the fault either, so the fault is
perfectly successful.

So if it's looping on that fault, what seems to happen is that the
page fault keeps happening.

Can you recreate this? Because if you can, please try to revert commit
e4a1cc56e4d7 ("x86: mm: drop TLB flush from ptep_set_access_flags").
Maybe the TLB has it read-only, and it doesn't get flushed, and the
page fault happens over and over again.

What kind of CPU is the problematic machine? There was some question
about just how architectural the whole "TLB entry causing a page fault
gets invalidated automatically" really is.

                    Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: pipe/page fault oddness.
  2014-09-30 15:52       ` Linus Torvalds
@ 2014-09-30 16:03         ` Rik van Riel
  2014-09-30 16:07           ` Dave Jones
  2014-09-30 16:26           ` Linus Torvalds
  2014-09-30 16:05         ` Dave Jones
  1 sibling, 2 replies; 43+ messages in thread
From: Rik van Riel @ 2014-09-30 16:03 UTC (permalink / raw)
  To: Linus Torvalds, Dave Jones, Al Viro, Linux Kernel
  Cc: Ingo Molnar, Michel Lespinasse

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 09/30/2014 11:52 AM, Linus Torvalds wrote:
> On Mon, Sep 29, 2014 at 9:54 PM, Linus Torvalds 
> <torvalds@linux-foundation.org> wrote:
>> 
>> Odd. The 0x3b3 offset seems to be the single-byte write of zero,
>> which is just the initial probe (aka
>> "fault_in_pages_writeable()").
>> 
>> How *that* could loop, I have no idea. Unless the exception table
>> is broken. I'll take another look tomorrow.
> 
> Confirmed. It's the second write in fault_in_pages_writeable()
> (the one that writes to the "end" pointer).
> 
> And there's no loop in software. And in fact, the trace shows that 
> there is no exception case for the fault either, so the fault is 
> perfectly successful.
> 
> So if it's looping on that fault, what seems to happen is that the 
> page fault keeps happening.
> 
> Can you recreate this? Because if you can, please try to revert
> commit e4a1cc56e4d7 ("x86: mm: drop TLB flush from
> ptep_set_access_flags"). Maybe the TLB has it read-only, and it
> doesn't get flushed, and the page fault happens over and over
> again.
> 
> What kind of CPU is the problematic machine? There was some
> question about just how architectural the whole "TLB entry causing
> a page fault gets invalidated automatically" really is.

Intel people told me at the time that the guarantee was architectural.
I don't know whether other x86 manufacturers know this...

Doing a local tlb flush from ptep_set_access_flags seems appropriate,
if that is indeed the issue.

On the other hand, do_wp_page does not seem to do a tlb flush when
the old page is reused, so CPUs do get rid of inappropriate TLB
entries. We would have noticed do_wp_page not working right :)

- -- 
All rights reversed
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQEcBAEBAgAGBQJUKtQ6AAoJEM553pKExN6D0r4H/088aviaZYq87YoWcHVkc0ld
MD9PhuKtQugDW+CahmkQ1CDchEo25NUNV2eZhOkEzj/kCqcxzGGFVfBYmtZowp7X
X7675LIqueESIqdmlxXt8zaxnTnaC1xFiqilnL1YNvx+11EfgjIKRY3bSjrvFhBh
XBPymhVXg0DlC7FBdX4+ekGrWPu0JpAfdjGaONtImO2hEbXuYAylYsbx/vpLRD6S
Fcml2oQYE+f2Mp2+SKYNL94XWh4yz7l6UaSd8aJWIr2ssqWAYgQJ7v/N2Sa/2qq8
WJkmMcHFDWCesH+Hw5OhRAEW48WbH7EZZmR/rzIfEOgs1LaxTkfsONqjUoi/6+g=
=cd+t
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: pipe/page fault oddness.
  2014-09-30 15:52       ` Linus Torvalds
  2014-09-30 16:03         ` Rik van Riel
@ 2014-09-30 16:05         ` Dave Jones
  2014-09-30 16:10           ` Linus Torvalds
  1 sibling, 1 reply; 43+ messages in thread
From: Dave Jones @ 2014-09-30 16:05 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Linux Kernel, Rik van Riel, Ingo Molnar, Michel Lespinasse

On Tue, Sep 30, 2014 at 08:52:08AM -0700, Linus Torvalds wrote:

 > So if it's looping on that fault, what seems to happen is that the
 > page fault keeps happening.
 > 
 > Can you recreate this? Because if you can, please try to revert commit
 > e4a1cc56e4d7 ("x86: mm: drop TLB flush from ptep_set_access_flags").
 > Maybe the TLB has it read-only, and it doesn't get flushed, and the
 > page fault happens over and over again.

I left it spinning overnight in case someone wanted me to probe it
further, so I haven't tried reproducing it yet.  It took ~12 hours
yesterday before it got in that state.  I'll restart it, and tell it
to only use pipe fd's, which might speed things up a little.

If I can reproduce it, I'll then try that revert.

 > What kind of CPU is the problematic machine? There was some question
 > about just how architectural the whole "TLB entry causing a page fault
 > gets invalidated automatically" really is.

model name	: Intel(R) Core(TM) i5-4670T CPU @ 2.30GHz

	Dave


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: pipe/page fault oddness.
  2014-09-30 16:03         ` Rik van Riel
@ 2014-09-30 16:07           ` Dave Jones
  2014-09-30 16:26           ` Linus Torvalds
  1 sibling, 0 replies; 43+ messages in thread
From: Dave Jones @ 2014-09-30 16:07 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Linus Torvalds, Al Viro, Linux Kernel, Ingo Molnar, Michel Lespinasse

On Tue, Sep 30, 2014 at 12:03:06PM -0400, Rik van Riel wrote:

 > > What kind of CPU is the problematic machine? There was some
 > > question about just how architectural the whole "TLB entry causing
 > > a page fault gets invalidated automatically" really is.
 > 
 > Intel people told me at the time that the guarantee was architectural.
 > I don't know whether other x86 manufacturers know this...
 > 
 > Doing a local tlb flush from ptep_set_access_flags seems appropriate,
 > if that is indeed the issue.
 > 
 > On the other hand, do_wp_page does not seem to do a tlb flush when
 > the old page is reused, so CPUs do get rid of inappropriate TLB
 > entries. We would have noticed do_wp_page not working right :)
 
The puzzling thing is we've had that code change for two years
without issue. This isn't a new machine, so why would it only
be showing up bad effects now ?

	Dave


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: pipe/page fault oddness.
  2014-09-30 16:05         ` Dave Jones
@ 2014-09-30 16:10           ` Linus Torvalds
  2014-09-30 16:22             ` Dave Jones
  0 siblings, 1 reply; 43+ messages in thread
From: Linus Torvalds @ 2014-09-30 16:10 UTC (permalink / raw)
  To: Dave Jones, Linus Torvalds, Al Viro, Linux Kernel, Rik van Riel,
	Ingo Molnar, Michel Lespinasse

On Tue, Sep 30, 2014 at 9:05 AM, Dave Jones <davej@redhat.com> wrote:
>
> I left it spinning overnight in case someone wanted me to probe it
> further, so I haven't tried reproducing it yet.  It took ~12 hours
> yesterday before it got in that state.  I'll restart it, and tell it
> to only use pipe fd's, which might speed things up a little.

Actually, if you haven't restarted it yet, do a few more "Sysrq-T"'s
to see if the stack below the page fault ever changes. Is it *always*
that "second access by fault_in_pages_writeable()" or migth there be
some looping going on in copy_page_to_iter() after all? (Quite
frankly, I don't see how such looping could happen with a good
compiler, and 4.8.3 should be good, but just in case).

It's an unlikely scenario, so if you already rebooted, not a big deal.

> model name      : Intel(R) Core(TM) i5-4670T CPU @ 2.30GHz

Ok, that pretty much throws the "maybe AMD did something different"
theory out of the water. Something subtler. Still worth trying the
revert.

            Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: pipe/page fault oddness.
  2014-09-30 16:10           ` Linus Torvalds
@ 2014-09-30 16:22             ` Dave Jones
  2014-09-30 16:40               ` Dave Jones
  0 siblings, 1 reply; 43+ messages in thread
From: Dave Jones @ 2014-09-30 16:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Linux Kernel, Rik van Riel, Ingo Molnar, Michel Lespinasse

On Tue, Sep 30, 2014 at 09:10:26AM -0700, Linus Torvalds wrote:
 > On Tue, Sep 30, 2014 at 9:05 AM, Dave Jones <davej@redhat.com> wrote:
 > >
 > > I left it spinning overnight in case someone wanted me to probe it
 > > further, so I haven't tried reproducing it yet.  It took ~12 hours
 > > yesterday before it got in that state.  I'll restart it, and tell it
 > > to only use pipe fd's, which might speed things up a little.
 > 
 > Actually, if you haven't restarted it yet, do a few more "Sysrq-T"'s
 > to see if the stack below the page fault ever changes. Is it *always*
 > that "second access by fault_in_pages_writeable()" or migth there be
 > some looping going on in copy_page_to_iter() after all? (Quite
 > frankly, I don't see how such looping could happen with a good
 > compiler, and 4.8.3 should be good, but just in case).

Here's another snap of the same process..

trinity-c49     R  running task    12544 19464   7633 0x00080004
ffffffff910ac790 ffffffff91042e54 ffff8800a09bf9e8 00000064c0206056
0000000000000001 0000000008dd0d40 ffff8800a09bfaa0 ffff8800a13427e8
ffff8800a13427d0 ffff88010b017e00 0000000000000002 ffffffff910ac795
Call Trace:
[<ffffffff910cd401>] ? lock_acquired+0x1d1/0x3c0
[<ffffffff910d4885>] ? do_raw_spin_unlock+0x5/0x90
[<ffffffff910d4835>] ? do_raw_spin_trylock+0x5/0x50
[<ffffffff910ac74d>] ? get_parent_ip+0xd/0x50
[<ffffffff918239f1>] ? _raw_spin_unlock+0x31/0x50
[<ffffffff911c3c67>] ? handle_mm_fault+0x3a7/0xcd0
[<ffffffff91042c84>] ? __do_page_fault+0x1a4/0x600
[<ffffffff9103ad23>] ? prepare_ftrace_return+0x73/0xe0
[<ffffffff91826b3a>] ? ftrace_graph_caller+0x5a/0x85
[<ffffffff910cb1ad>] ? trace_hardirqs_off+0xd/0x10
[<ffffffff918253e4>] ? retint_restore_args+0xe/0xe
[<ffffffff9118eda7>] ? context_tracking_user_exit+0x67/0x1b0
[<ffffffff910430fe>] ? do_page_fault+0x1e/0x70
[<ffffffff913a5ebe>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[<ffffffff918264b2>] ? page_fault+0x22/0x30
[<ffffffff911bd7e3>] ? copy_page_to_iter+0x3b3/0x500
[<ffffffff9120eddf>] ? pipe_read+0xdf/0x330
[<ffffffff9120ed00>] ? pipe_write+0x490/0x490
[<ffffffff912051a0>] ? do_sync_readv_writev+0xa0/0xa0
[<ffffffff912053b8>] ? do_iter_readv_writev+0x78/0xc0
[<ffffffff91206bbe>] ? do_readv_writev+0xce/0x280
[<ffffffff9120ed00>] ? pipe_write+0x490/0x490
[<ffffffff910cbbf6>] ? lock_release_holdtime.part.29+0xe6/0x160
[<ffffffff910ac74d>] ? get_parent_ip+0xd/0x50
[<ffffffff910ac74d>] ? get_parent_ip+0xd/0x50
[<ffffffff910ac8ab>] ? preempt_count_sub+0x6b/0xf0
[<ffffffff91206da9>] ? vfs_readv+0x39/0x50
[<ffffffff91206e6c>] ? SyS_readv+0x5c/0x100
[<ffffffff918249e4>] ? tracesys+0xdd/0xe2

and a few seconds later..

trinity-c49     R  running task    12544 19464   7633 0x00080004
ffff880235073840 ffff8800a09bf920 0000000000000046 ffff8800a09bf930
ffff8800a09bfa28 ffffffff910ac840 0000000000000000 ffff8800a09bf968
0000000000000000 00000000a8ef9693 ffff880244c203c0 00000000a8ef9693
Call Trace:
[<ffffffff910d4880>] ? do_raw_spin_trylock+0x50/0x50
[<ffffffff91167913>] ? trace_graph_entry+0x123/0x250
[<ffffffff9115b7ce>] ? trace_buffer_lock_reserve+0x1e/0x60
[<ffffffff911c3c67>] ? handle_mm_fault+0x3a7/0xcd0
[<ffffffff910ce68d>] ? trace_hardirqs_on+0xd/0x10
[<ffffffff911678f8>] ? trace_graph_entry+0x108/0x250
[<ffffffff91042d14>] ? __do_page_fault+0x234/0x600
[<ffffffff9103ad23>] ? prepare_ftrace_return+0x73/0xe0
[<ffffffff910ca040>] ? down_write_nested+0xc0/0xc0
[<ffffffff910ac74d>] ? get_parent_ip+0xd/0x50
[<ffffffff91042d14>] ? __do_page_fault+0x234/0x600
[<ffffffff9103ad23>] ? prepare_ftrace_return+0x73/0xe0
[<ffffffff91826b3a>] ? ftrace_graph_caller+0x5a/0x85
[<ffffffff913a5ebe>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[<ffffffff9118eda7>] ? context_tracking_user_exit+0x67/0x1b0
[<ffffffff910430fe>] ? do_page_fault+0x1e/0x70
[<ffffffff913a5efd>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[<ffffffff911bd7e3>] ? copy_page_to_iter+0x3b3/0x500
[<ffffffff9120eddf>] ? pipe_read+0xdf/0x330
[<ffffffff9120ed00>] ? pipe_write+0x490/0x490
[<ffffffff912051a0>] ? do_sync_readv_writev+0xa0/0xa0
[<ffffffff912053b8>] ? do_iter_readv_writev+0x78/0xc0
[<ffffffff91206bbe>] ? do_readv_writev+0xce/0x280
[<ffffffff9120ed00>] ? pipe_write+0x490/0x490
[<ffffffff910cbbf6>] ? lock_release_holdtime.part.29+0xe6/0x160
[<ffffffff910ac74d>] ? get_parent_ip+0xd/0x50
[<ffffffff910ac74d>] ? get_parent_ip+0xd/0x50
[<ffffffff910ac8ab>] ? preempt_count_sub+0x6b/0xf0
[<ffffffff91206da9>] ? vfs_readv+0x39/0x50
[<ffffffff91206e6c>] ? SyS_readv+0x5c/0x100
[<ffffffff918249e4>] ? tracesys+0xdd/0xe2

That second one is odd, because I had disabled all the function tracing
before doing the sysrq-t, so still seeing all the ftrace stuff on the
stack suprised me.  Also puzzling is how every symbol in both cases
is '?'.


I've got to run to the dentist for an hour or so, so if you still have
more things you want me to test on this before I reboot, I'll put that
off until I get back.

	Dave


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: pipe/page fault oddness.
  2014-09-30 16:03         ` Rik van Riel
  2014-09-30 16:07           ` Dave Jones
@ 2014-09-30 16:26           ` Linus Torvalds
  1 sibling, 0 replies; 43+ messages in thread
From: Linus Torvalds @ 2014-09-30 16:26 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Dave Jones, Al Viro, Linux Kernel, Ingo Molnar, Michel Lespinasse

On Tue, Sep 30, 2014 at 9:03 AM, Rik van Riel <riel@redhat.com> wrote:
>
> On the other hand, do_wp_page does not seem to do a tlb flush when
> the old page is reused, so CPUs do get rid of inappropriate TLB
> entries. We would have noticed do_wp_page not working right :)

Hmm? do_wp_page() uses the same ptep_set_access_flags(). So it too
used to do the TLB flush before that got removed there..

I do agree that clearly the CPU must *usually* do a TLB flush, or we'd
have noticed the lack immediately. I just worry that there are some
very specific circumstances when it might be missed.

That said, I don't actually believe it's that commit. I would just
like to remove it from the list of suspects.

I'm wondering what else could cause us to take effectively the same
page fault over and over again on what is a very simple instruction:

    movb   $0x0,(%rdx)

where handle_mm_fault() doesn't see anything wrong with the page
tables, but the CPU does.

Hmm. We also have that

        if (unlikely((fault & VM_FAULT_RETRY) && fatal_signal_pending(current)))
                return;

thing. It assumes that the fatal signal will kill the process, but
that's only true if returning to user space. So that looks like a
potential mis-feature (although it should sort itself out *eventually*
when the IO has completed), but the trace Dave had didn't go through
the retry paths.

Anybody see anyting else?

               Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: pipe/page fault oddness.
  2014-09-30 16:22             ` Dave Jones
@ 2014-09-30 16:40               ` Dave Jones
  2014-09-30 16:46                 ` Linus Torvalds
  0 siblings, 1 reply; 43+ messages in thread
From: Dave Jones @ 2014-09-30 16:40 UTC (permalink / raw)
  To: Linus Torvalds, Al Viro, Linux Kernel, Rik van Riel, Ingo Molnar,
	Michel Lespinasse

On Tue, Sep 30, 2014 at 12:22:01PM -0400, Dave Jones wrote:
 
 > [<ffffffff91167913>] ? trace_graph_entry+0x123/0x250
 > [<ffffffff9115b7ce>] ? trace_buffer_lock_reserve+0x1e/0x60
 > [<ffffffff911c3c67>] ? handle_mm_fault+0x3a7/0xcd0
 > [<ffffffff910ce68d>] ? trace_hardirqs_on+0xd/0x10
 > [<ffffffff911678f8>] ? trace_graph_entry+0x108/0x250
 > [<ffffffff91042d14>] ? __do_page_fault+0x234/0x600
 > [<ffffffff9103ad23>] ? prepare_ftrace_return+0x73/0xe0
 > [<ffffffff910ca040>] ? down_write_nested+0xc0/0xc0
 > [<ffffffff910ac74d>] ? get_parent_ip+0xd/0x50
 > [<ffffffff91042d14>] ? __do_page_fault+0x234/0x600
 > [<ffffffff9103ad23>] ? prepare_ftrace_return+0x73/0xe0
 > [<ffffffff91826b3a>] ? ftrace_graph_caller+0x5a/0x85
 > [<ffffffff913a5ebe>] ? trace_hardirqs_on_thunk+0x3a/0x3f
 > [<ffffffff9118eda7>] ? context_tracking_user_exit+0x67/0x1b0
 > [<ffffffff910430fe>] ? do_page_fault+0x1e/0x70
 > [<ffffffff913a5efd>] ? trace_hardirqs_off_thunk+0x3a/0x3c
 > [<ffffffff911bd7e3>] ? copy_page_to_iter+0x3b3/0x500
 > [<ffffffff9120eddf>] ? pipe_read+0xdf/0x330
 > [<ffffffff9120ed00>] ? pipe_write+0x490/0x490
 > [<ffffffff912051a0>] ? do_sync_readv_writev+0xa0/0xa0
 > [<ffffffff912053b8>] ? do_iter_readv_writev+0x78/0xc0
 > [<ffffffff91206bbe>] ? do_readv_writev+0xce/0x280
 > [<ffffffff9120ed00>] ? pipe_write+0x490/0x490
 > [<ffffffff910cbbf6>] ? lock_release_holdtime.part.29+0xe6/0x160
 > [<ffffffff910ac74d>] ? get_parent_ip+0xd/0x50
 > [<ffffffff910ac74d>] ? get_parent_ip+0xd/0x50
 > [<ffffffff910ac8ab>] ? preempt_count_sub+0x6b/0xf0
 > [<ffffffff91206da9>] ? vfs_readv+0x39/0x50
 > [<ffffffff91206e6c>] ? SyS_readv+0x5c/0x100
 > [<ffffffff918249e4>] ? tracesys+0xdd/0xe2
 > 
 > That second one is odd, because I had disabled all the function tracing
 > before doing the sysrq-t, so still seeing all the ftrace stuff on the
 > stack suprised me.

ah, echo 0 > tracing_on isn't enough, I had to 0 out set_ftrace_pid too.
Never mind.  Subsequent traces all looking the same, minus the tracing
junk though.

Is there some way I can figure out what address it's faulting on ?
I wonder if that might give some clues.

	Dave


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: pipe/page fault oddness.
  2014-09-30 16:40               ` Dave Jones
@ 2014-09-30 16:46                 ` Linus Torvalds
  2014-09-30 18:20                   ` Dave Jones
  0 siblings, 1 reply; 43+ messages in thread
From: Linus Torvalds @ 2014-09-30 16:46 UTC (permalink / raw)
  To: Dave Jones, Linus Torvalds, Al Viro, Linux Kernel, Rik van Riel,
	Ingo Molnar, Michel Lespinasse

On Tue, Sep 30, 2014 at 9:40 AM, Dave Jones <davej@redhat.com> wrote:
>
> ah, echo 0 > tracing_on isn't enough, I had to 0 out set_ftrace_pid too.
> Never mind.  Subsequent traces all looking the same, minus the tracing
> junk though.

Yeah, looks like that copy_page_to_iter+0x3b3 is always there.

> Is there some way I can figure out what address it's faulting on ?
> I wonder if that might give some clues.

Could be useful. There's a trace_page_fault_kernel() entry that should
give it to you, along with error code.

            Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: pipe/page fault oddness.
  2014-09-30 16:46                 ` Linus Torvalds
@ 2014-09-30 18:20                   ` Dave Jones
  2014-09-30 18:58                     ` Linus Torvalds
  0 siblings, 1 reply; 43+ messages in thread
From: Dave Jones @ 2014-09-30 18:20 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Linux Kernel, Rik van Riel, Ingo Molnar, Michel Lespinasse

On Tue, Sep 30, 2014 at 09:46:45AM -0700, Linus Torvalds wrote:
 > On Tue, Sep 30, 2014 at 9:40 AM, Dave Jones <davej@redhat.com> wrote:
 > >
 > > ah, echo 0 > tracing_on isn't enough, I had to 0 out set_ftrace_pid too.
 > > Never mind.  Subsequent traces all looking the same, minus the tracing
 > > junk though.
 > 
 > Yeah, looks like that copy_page_to_iter+0x3b3 is always there.
 > 
 > > Is there some way I can figure out what address it's faulting on ?
 > > I wonder if that might give some clues.
 > 
 > Could be useful. There's a trace_page_fault_kernel() entry that should
 > give it to you, along with error code.

Hm.

page_fault_kernel:    address=__per_cpu_end ip=copy_page_to_iter error_code=0x2

over and over..

	Dave



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: pipe/page fault oddness.
  2014-09-30 18:20                   ` Dave Jones
@ 2014-09-30 18:58                     ` Linus Torvalds
  2014-10-01  8:19                       ` Hugh Dickins
  0 siblings, 1 reply; 43+ messages in thread
From: Linus Torvalds @ 2014-09-30 18:58 UTC (permalink / raw)
  To: Dave Jones, Linus Torvalds, Al Viro, Linux Kernel, Rik van Riel,
	Ingo Molnar, Michel Lespinasse, Kirill A. Shutemov, Hugh Dickins

On Tue, Sep 30, 2014 at 11:20 AM, Dave Jones <davej@redhat.com> wrote:
>
> page_fault_kernel:    address=__per_cpu_end ip=copy_page_to_iter error_code=0x2

Interesting. "error_code" in particular. The value "2" means that the
CPU thinks that the page is not present (bit zero is clear).

(That "address" is useless - it's tried to turn a user address into a
kernel symbol, and the percpu symbols are zero-based, so it picks the
last of them. The "ip" is useless too, since it doesn't give the
offset)

So the CPU thinks it's a write to a not-present page, which means that
_PAGE_PRESENT bit is clear.

Now the *kernel* thinks a page is present not just if _PAGE_PRESENT is
set, but also if _PAGE_PROTNONE or _PAGE_NUMA are set. Sadly, your
trace is not very useful, because inlining has caused pretty much all
the cases to be in "handle_mm_fault()", so the trace doesn't really
tell which path this all takes.

But we can still do *some* analysis on the trace: do_wp_page()
shouldn't have been inlined, so it would have shown up in the trace if
it had been called. So I think we can be pretty confident that the
ptep_set_access_flags() we see is the one from handle_pte_fault().

And if that is the case, then we know that "pte_present()" is indeed
true as far a the kernel is concerned. So with _PAGE_PRESENT not being
set (based on the error code), we know that _PAGE_PROTNONE must be
set, otherwise we'd have triggered the pte_numa() check and exited
through do_numa_page().

So it smells like we have a PROT_NONE VM area (at least the paeg table
entries imply that). But "access_error()" should have flagged that (it
checks "vma->vm_flags & VM_WRITE"). How do we have a page table entry
marked _PAGE_PROTNONE, but VM_WRITE set in the vma?

Or, possibly, we have some confusion about the page tables themselves
(corruption, wrong %cr3 value, whatever), explaining why the CPU
thinks one thing, but our software page table walker thinks another.

I'm not seeing how this all happens. But I'm adding Kirill to the cc,
since he might see something I missed, and he touched some of this
code last ("tag, you're it").

Kirill: the thread is on lkml, but basically it boils down to the
second byte write in fault_in_pages_writeable() faulting forever,
despite handle_mm_fault() apparently thinking that everything is fine.

Also adding Hugh Dickins, just because the more people who know this
code that are involved, the better.

              Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: pipe/page fault oddness.
  2014-09-30 18:58                     ` Linus Torvalds
@ 2014-10-01  8:19                       ` Hugh Dickins
  2014-10-01 16:01                         ` Linus Torvalds
  0 siblings, 1 reply; 43+ messages in thread
From: Hugh Dickins @ 2014-10-01  8:19 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Jones, Al Viro, Linux Kernel, Rik van Riel, Ingo Molnar,
	Michel Lespinasse, Kirill A. Shutemov, Hugh Dickins, Mel Gorman,
	Sasha Levin

On Tue, 30 Sep 2014, Linus Torvalds wrote:
> On Tue, Sep 30, 2014 at 11:20 AM, Dave Jones <davej@redhat.com> wrote:
> >
> > page_fault_kernel:    address=__per_cpu_end ip=copy_page_to_iter error_code=0x2
> 
> Interesting. "error_code" in particular. The value "2" means that the
> CPU thinks that the page is not present (bit zero is clear).
> 
> (That "address" is useless - it's tried to turn a user address into a
> kernel symbol, and the percpu symbols are zero-based, so it picks the
> last of them. The "ip" is useless too, since it doesn't give the
> offset)
> 
> So the CPU thinks it's a write to a not-present page, which means that
> _PAGE_PRESENT bit is clear.
> 
> Now the *kernel* thinks a page is present not just if _PAGE_PRESENT is
> set, but also if _PAGE_PROTNONE or _PAGE_NUMA are set. Sadly, your
> trace is not very useful, because inlining has caused pretty much all
> the cases to be in "handle_mm_fault()", so the trace doesn't really
> tell which path this all takes.
> 
> But we can still do *some* analysis on the trace: do_wp_page()
> shouldn't have been inlined, so it would have shown up in the trace if
> it had been called. So I think we can be pretty confident that the
> ptep_set_access_flags() we see is the one from handle_pte_fault().
> 
> And if that is the case, then we know that "pte_present()" is indeed
> true as far a the kernel is concerned. So with _PAGE_PRESENT not being
> set (based on the error code), we know that _PAGE_PROTNONE must be
> set, otherwise we'd have triggered the pte_numa() check and exited
> through do_numa_page().
> 
> So it smells like we have a PROT_NONE VM area (at least the paeg table
> entries imply that). But "access_error()" should have flagged that (it
> checks "vma->vm_flags & VM_WRITE"). How do we have a page table entry
> marked _PAGE_PROTNONE, but VM_WRITE set in the vma?
> 
> Or, possibly, we have some confusion about the page tables themselves
> (corruption, wrong %cr3 value, whatever), explaining why the CPU
> thinks one thing, but our software page table walker thinks another.
> 
> I'm not seeing how this all happens. But I'm adding Kirill to the cc,
> since he might see something I missed, and he touched some of this
> code last ("tag, you're it").
> 
> Kirill: the thread is on lkml, but basically it boils down to the
> second byte write in fault_in_pages_writeable() faulting forever,
> despite handle_mm_fault() apparently thinking that everything is fine.
> 
> Also adding Hugh Dickins, just because the more people who know this
> code that are involved, the better.

I've tried, but failed to explain it.

I think it's likely related to the VM_BUG_ON(!(val & _PAGE_PRESENT))
which linux-next has in pte_mknuma(), which Sasha Levin first reported
hitting in https://lkml.org/lkml/2014/8/26/869 (a resumption of the
"mm: BUG in unmap_page_range" thread, though its subject bug is fixed).

Mel and I gave it a lot of thought, but that too remains unexplained.
Sasha could reproduce it fairly easily on linux-next, but could not
reproduce it on 3.17-rc4 (plus the VM_BUG_ON); maybe Dave is doing
something different enough to get it on 3.17-rc7.

I say they're likely related because both could be explained if
there's some way in which a PROTNONE pte can get left behind after
the vma has been mprotected back from PROT_NONE to read-writable.
But we cannot see how (even when racing with page migration).

Irrelevance follows...

There *appears* to be a risk of hitting the VM_BUG_ON, or with no
VM_BUG_ON (as in 3.17-rc) pte_mknuma proceeding to add _PAGE_NUMA
to _PAGE_PROTNONE - making the pte then fail the pte_numa test,
but pass the pte_special test, hence fail the vm_normal_page test:
when coming from change_prot_numa serving MPOL_MF_LAZY for mbind.

However, that would still not explain Dave's endless refaulting;
though I was reminded to send you a patch to fix it, except that
when I came to test the fix, I could not produce the problem, and
eventually discovered a720094ded8c ("mm: mempolicy: Hide MPOL_NOOP
and MPOL_MF_LAZY from userspace for now") - that call to
change_prot_numa is still just dead code, so we're still safe from
its use on PROT_NONE areas (which task_numa_work carefully avoids).

Some time wasted on that, but I learnt a valuable debugging technique:
#undef EINVAL
#define EINVAL __LINE__

Hugh

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: pipe/page fault oddness.
  2014-10-01  8:19                       ` Hugh Dickins
@ 2014-10-01 16:01                         ` Linus Torvalds
  2014-10-01 16:18                           ` Linus Torvalds
  2014-10-02  8:47                           ` Hugh Dickins
  0 siblings, 2 replies; 43+ messages in thread
From: Linus Torvalds @ 2014-10-01 16:01 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Dave Jones, Al Viro, Linux Kernel, Rik van Riel, Ingo Molnar,
	Michel Lespinasse, Kirill A. Shutemov, Mel Gorman, Sasha Levin

On Wed, Oct 1, 2014 at 1:19 AM, Hugh Dickins <hughd@google.com> wrote:
>
> Irrelevance follows...

Maybe not irrelevant.

> There *appears* to be a risk of hitting the VM_BUG_ON, or with no
> VM_BUG_ON (as in 3.17-rc) pte_mknuma proceeding to add _PAGE_NUMA
> to _PAGE_PROTNONE - making the pte then fail the pte_numa test,
> but pass the pte_special test, hence fail the vm_normal_page test:
> when coming from change_prot_numa serving MPOL_MF_LAZY for mbind.

Ugh, yes. The whole _PAGE_NUMA is still a f*cking mess. I hate it.
Hate it, hate it, hate it.

We need to get rid of it, and just make it the same as pte_protnone().
And then the real protnone is in the vma flags, and if you actually
ever get to a pte that is marked protnone, you know it's a numa page.

Seriously.

I never understood what the objection to that was, but every time I
tell people to do it, they go crazy and think _PAGE_NUMA makes sense.
It doesn't. There's no excuse. Rik, what was your broken excuse again?
Something to do with Powerpc, but it is obviously not true, since
powerpc supports protnone just fine.

Even our own comments are confused, with include/asm-generic/pgtable.h saying:

 * _PAGE_NUMA works identical to _PAGE_PROTNONE (it's actually the
 * same bit too).

but no, it's not the same bit.

Can we please just get rid of _PAGE_NUMA. There is no excuse for it.

> However, that would still not explain Dave's endless refaulting;

Why not? You start out with a PROTNONE, trigger shrink_page_list() on
a hugepage,.which calls add_to_swap(), which does
split_huge_page_to_list(), which in turn calls __split_huge_page(),
and that turns (_PAGE_PROTNONE) into (_PAGE_PROTNONE|_PAGE_NUMA),
which you will then fault on forever, because the kernel thinks the
page is present, but not a NUMA page.

IOW, it's *exactly* the same f*cking confusion between _PAGE_NUMA and
_PAGE_PROTNONE I've complained about before.

> Some time wasted on that, but I learnt a valuable debugging technique:
> #undef EINVAL
> #define EINVAL __LINE__

Wow. There's a certain beauty in the pure crazyness.

However, be careful: our IS_ERR() handling knows that error numbers
are < 4096. So on a big file, and error pointers, that doesn't work
reliably.

              Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: pipe/page fault oddness.
  2014-10-01 16:01                         ` Linus Torvalds
@ 2014-10-01 16:18                           ` Linus Torvalds
  2014-10-01 17:29                             ` Rik van Riel
                                               ` (2 more replies)
  2014-10-02  8:47                           ` Hugh Dickins
  1 sibling, 3 replies; 43+ messages in thread
From: Linus Torvalds @ 2014-10-01 16:18 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Dave Jones, Al Viro, Linux Kernel, Rik van Riel, Ingo Molnar,
	Michel Lespinasse, Kirill A. Shutemov, Mel Gorman, Sasha Levin

On Wed, Oct 1, 2014 at 9:01 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> We need to get rid of it, and just make it the same as pte_protnone().
> And then the real protnone is in the vma flags, and if you actually
> ever get to a pte that is marked protnone, you know it's a numa page.

So I'd really suggest we do exactly that. Get rid of "pte_numa()"
entirely, get rid of "_PAGE_[BIT_]NUMA" entirely, and instead add a
"pte_protnone()" helper to check for the "protnone" case (which on x86
is testing the _PAGE_PROTNONE bit, and on most other architectures is
just testing that the page has no access rights).

Then we throw away "pte_mknuma()" and "pte_mknonnuma()" entirely,
because they are brainless sh*t, and we just use

    ptent = ptep_modify_prot_start(mm, addr, pte);
    ptent = pte_modify(ptent, newprot);
    ptep_modify_prot_commit(mm, addr, pte, ptent);

reliably instead (where for the mknuma case "newprot" is PROT_NONE,
and for mknonnuma() it is vma->vm_page_prot. Yes, that means that you
have to pass in the vma to those functions, but that just makes sense
anyway.

And if that means that we lose the numa flag on mprotect etc, nobody sane cares.

Seriously, why can't we just do this, and throw away all the crap that
is "numa special case". This would make all the random games in
change_pte_range() just go away entirely, because the whole NUMA thing
really wouldn't be a special case for the pte AT ALL any more. All it
would be is that a pte could be marked PROT_NONE even if the
vma->vm_flags aren't.

Please, please, please? The current _PAGE_NUMA really is a horrible
horrible thing, and may well be the source of this bug.

The fact that it took DaveJ a long time to trigger his lockup would be
entirely consistent with "you have to split a PROTNONE large page due
to memory pressure", so the problem with our current pte_mknuma() that
Hugh points out looks entirely possible to me.

Now, there may be some reason why it can't happen, but even in the
absense of this bug, I really think that _PAGE_NUMA has been a huge
mistake from day one.

                  Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: pipe/page fault oddness.
  2014-10-01 16:18                           ` Linus Torvalds
@ 2014-10-01 17:29                             ` Rik van Riel
  2014-10-02  8:28                               ` Peter Zijlstra
  2014-10-01 20:20                             ` Linus Torvalds
  2014-10-02 12:45                             ` Mel Gorman
  2 siblings, 1 reply; 43+ messages in thread
From: Rik van Riel @ 2014-10-01 17:29 UTC (permalink / raw)
  To: Linus Torvalds, Hugh Dickins
  Cc: Dave Jones, Al Viro, Linux Kernel, Ingo Molnar,
	Michel Lespinasse, Kirill A. Shutemov, Mel Gorman, Sasha Levin,
	Peter Zijlstra

On 10/01/2014 12:18 PM, Linus Torvalds wrote:

> Seriously, why can't we just do this, and throw away all the crap that
> is "numa special case". This would make all the random games in
> change_pte_range() just go away entirely, because the whole NUMA thing
> really wouldn't be a special case for the pte AT ALL any more. All it
> would be is that a pte could be marked PROT_NONE even if the
> vma->vm_flags aren't.

That's what I suggested quite a while back, but IIRC either
Peter or Mel brought up a reason why this was not possible.

Unfortunately I do not remember what it was, just that I
was convinced by it at the time...

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: pipe/page fault oddness.
  2014-10-01 16:18                           ` Linus Torvalds
  2014-10-01 17:29                             ` Rik van Riel
@ 2014-10-01 20:20                             ` Linus Torvalds
  2014-10-01 21:09                               ` Rik van Riel
  2014-10-01 22:08                               ` Sasha Levin
  2014-10-02 12:45                             ` Mel Gorman
  2 siblings, 2 replies; 43+ messages in thread
From: Linus Torvalds @ 2014-10-01 20:20 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Dave Jones, Al Viro, Linux Kernel, Rik van Riel, Ingo Molnar,
	Michel Lespinasse, Kirill A. Shutemov, Mel Gorman, Sasha Levin

[-- Attachment #1: Type: text/plain, Size: 2629 bytes --]

On Wed, Oct 1, 2014 at 9:18 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> So I'd really suggest we do exactly that. Get rid of "pte_numa()"
> entirely, get rid of "_PAGE_[BIT_]NUMA" entirely, and instead add a
> "pte_protnone()" helper to check for the "protnone" case (which on x86
> is testing the _PAGE_PROTNONE bit, and on most other architectures is
> just testing that the page has no access rights).
>
> Then we throw away "pte_mknuma()" and "pte_mknonnuma()" entirely,
> because they are brainless sh*t, and we just use
>
>     ptent = ptep_modify_prot_start(mm, addr, pte);
>     ptent = pte_modify(ptent, newprot);
>     ptep_modify_prot_commit(mm, addr, pte, ptent);
>
> reliably instead (where for the mknuma case "newprot" is PROT_NONE,
> and for mknonnuma() it is vma->vm_page_prot. Yes, that means that you
> have to pass in the vma to those functions, but that just makes sense
> anyway.
>
> And if that means that we lose the numa flag on mprotect etc, nobody sane cares.

So here is a *COMPLETELY UNTESTED* and probably seriously buggy first
version of such a patch. It doesn't do the powerpc conversion, so
somebody would need to check that eventually, but aside from that
obvious issue, can people fix this up? Or comment on why it doesn't
work.

Now, I like this because it gets rid of the horrible PAGE_NUMA special
cases, but it really seems to simplify things in general. Lookie here:

     13 files changed, 74 insertions(+), 268 deletions(-)

that's really mainly just the removal of odd and broken numa pte/pmd
helper functions from <asm-generic/pgtable.h> that aren't needed any
more because the normal "change protections" functions just DTRT
automatically. Although there are actually a few other cases that got
simpler too, so it's not *just* removal of those _PAGE_NUMA-specific
helpers.

One thing this does *not* remove is the special pte locking rule in
the "change_*_range()" functions: they still take that broken
"prot_numa" argument. HOWEVER, it isn't actually used for any page
table modifications, the only reason for it existing is the hacky
locking issue (see lock_pte_protection(), and the comment about races
with the transhuge accesses).

Now, I'll be honest: this patch *migth* just work, but I expect it to
have some stupid problem. It compiles. I haven't even dared boot it,
much less try any numa benchmarks that woudln't show anything sane on
my machine anyway.

So I'm really sending this patch out in the hope that it will get
comments, fixup and possibly even testing by people who actually know
the NUMA balancing code. Rik?  Anybody?

                Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/plain, Size: 19594 bytes --]

 arch/x86/include/asm/pgtable.h       |  31 +++++--
 arch/x86/include/asm/pgtable_types.h |  27 +-----
 arch/x86/mm/gup.c                    |   4 +-
 include/asm-generic/pgtable.h        | 159 ++---------------------------------
 include/linux/huge_mm.h              |   3 +-
 init/Kconfig                         |  11 ---
 mm/gup.c                             |   4 +-
 mm/huge_memory.c                     |  42 +++------
 mm/memory.c                          |  16 ++--
 mm/mempolicy.c                       |   2 +-
 mm/migrate.c                         |   1 -
 mm/mprotect.c                        |  40 +++------
 mm/pgtable-generic.c                 |   2 -
 13 files changed, 74 insertions(+), 268 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index aa97a070f09f..0d4ec3a62fed 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -299,7 +299,7 @@ static inline pmd_t pmd_mkwrite(pmd_t pmd)
 
 static inline pmd_t pmd_mknotpresent(pmd_t pmd)
 {
-	return pmd_clear_flags(pmd, _PAGE_PRESENT);
+	return pmd_clear_flags(pmd, _PAGE_PRESENT | _PAGE_PROTNONE);
 }
 
 #ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
@@ -457,8 +457,7 @@ static inline int pte_same(pte_t a, pte_t b)
 
 static inline int pte_present(pte_t a)
 {
-	return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE |
-			       _PAGE_NUMA);
+	return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE);
 }
 
 #define pte_present_nonuma pte_present_nonuma
@@ -473,7 +472,7 @@ static inline bool pte_accessible(struct mm_struct *mm, pte_t a)
 	if (pte_flags(a) & _PAGE_PRESENT)
 		return true;
 
-	if ((pte_flags(a) & (_PAGE_PROTNONE | _PAGE_NUMA)) &&
+	if ((pte_flags(a) & _PAGE_PROTNONE) &&
 			mm_tlb_flush_pending(mm))
 		return true;
 
@@ -493,10 +492,26 @@ static inline int pmd_present(pmd_t pmd)
 	 * the _PAGE_PSE flag will remain set at all times while the
 	 * _PAGE_PRESENT bit is clear).
 	 */
-	return pmd_flags(pmd) & (_PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_PSE |
-				 _PAGE_NUMA);
+	return pmd_flags(pmd) & (_PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_PSE);
 }
 
+#ifdef CONFIG_NUMA_BALANCING
+/*
+ * Technically these work even when not doing NUMA
+ * balancing, but we don't care and have simpler
+ * "always false" versions for that case
+ */
+static inline int pte_protnone(pte_t pte)
+{
+	return pte_flags(pte) & _PAGE_PROTNONE;
+}
+
+static inline int pmd_protnone(pmd_t pmd)
+{
+	return pmd_flags(pmd) & _PAGE_PROTNONE;
+}
+#endif
+
 static inline int pmd_none(pmd_t pmd)
 {
 	/* Only check low word on 32-bit platforms, since it might be
@@ -554,8 +569,8 @@ static inline pte_t *pte_offset_kernel(pmd_t *pmd, unsigned long address)
 static inline int pmd_bad(pmd_t pmd)
 {
 #ifdef CONFIG_NUMA_BALANCING
-	/* pmd_numa check */
-	if ((pmd_flags(pmd) & (_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA)
+	/* pmd_numa check. Really? */
+	if ((pmd_flags(pmd) & (_PAGE_PROTNONE|_PAGE_PRESENT)) == _PAGE_PROTNONE)
 		return 0;
 #endif
 	return (pmd_flags(pmd) & ~_PAGE_USER) != _KERNPG_TABLE;
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index f216963760e5..5438e96b2915 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -28,14 +28,6 @@
 #define _PAGE_BIT_SOFT_DIRTY	_PAGE_BIT_SOFTW3 /* software dirty tracking */
 #define _PAGE_BIT_NX           63       /* No execute: only valid after cpuid check */
 
-/*
- * Swap offsets on configurations that allow automatic NUMA balancing use the
- * bits after _PAGE_BIT_GLOBAL. To uniquely distinguish NUMA hinting PTEs from
- * swap entries, we use the first bit after _PAGE_BIT_GLOBAL and shrink the
- * maximum possible swap space from 16TB to 8TB.
- */
-#define _PAGE_BIT_NUMA		(_PAGE_BIT_GLOBAL+1)
-
 /* If _PAGE_BIT_PRESENT is clear, we use these: */
 /* - if the user mapped it with PROT_NONE; pte_present gives true */
 #define _PAGE_BIT_PROTNONE	_PAGE_BIT_GLOBAL
@@ -79,21 +71,6 @@
 #endif
 
 /*
- * _PAGE_NUMA distinguishes between a numa hinting minor fault and a page
- * that is not present. The hinting fault gathers numa placement statistics
- * (see pte_numa()). The bit is always zero when the PTE is not present.
- *
- * The bit picked must be always zero when the pmd is present and not
- * present, so that we don't lose information when we set it while
- * atomically clearing the present bit.
- */
-#ifdef CONFIG_NUMA_BALANCING
-#define _PAGE_NUMA	(_AT(pteval_t, 1) << _PAGE_BIT_NUMA)
-#else
-#define _PAGE_NUMA	(_AT(pteval_t, 0))
-#endif
-
-/*
  * Tracking soft dirty bit when a page goes to a swap is tricky.
  * We need a bit which can be stored in pte _and_ not conflict
  * with swap entry format. On x86 bits 6 and 7 are *not* involved
@@ -126,8 +103,8 @@
 /* Set of bits not changed in pte_modify */
 #define _PAGE_CHG_MASK	(PTE_PFN_MASK | _PAGE_PCD | _PAGE_PWT |		\
 			 _PAGE_SPECIAL | _PAGE_ACCESSED | _PAGE_DIRTY |	\
-			 _PAGE_SOFT_DIRTY | _PAGE_NUMA)
-#define _HPAGE_CHG_MASK (_PAGE_CHG_MASK | _PAGE_PSE | _PAGE_NUMA)
+			 _PAGE_SOFT_DIRTY)
+#define _HPAGE_CHG_MASK (_PAGE_CHG_MASK | _PAGE_PSE)
 
 #define _PAGE_CACHE_MASK	(_PAGE_PCD | _PAGE_PWT)
 #define _PAGE_CACHE_WB		(0)
diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index 207d9aef662d..f32e12c44b23 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -84,7 +84,7 @@ static noinline int gup_pte_range(pmd_t pmd, unsigned long addr,
 		struct page *page;
 
 		/* Similar to the PMD case, NUMA hinting must take slow path */
-		if (pte_numa(pte)) {
+		if (pte_protnone(pte)) {
 			pte_unmap(ptep);
 			return 0;
 		}
@@ -178,7 +178,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
 			 * slowpath for accounting purposes and so that they
 			 * can be serialised against THP migration.
 			 */
-			if (pmd_numa(pmd))
+			if (pmd_protnone(pmd))
 				return 0;
 			if (!gup_huge_pmd(pmd, addr, next, write, pages, nr))
 				return 0;
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 53b2acc38213..a7c08d9ddeda 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -659,165 +659,24 @@ static inline int pmd_trans_unstable(pmd_t *pmd)
 #endif
 }
 
-#ifdef CONFIG_NUMA_BALANCING
-#ifdef CONFIG_ARCH_USES_NUMA_PROT_NONE
+#ifndef CONFIG_NUMA_BALANCING
 /*
- * _PAGE_NUMA works identical to _PAGE_PROTNONE (it's actually the
- * same bit too). It's set only when _PAGE_PRESET is not set and it's
- * never set if _PAGE_PRESENT is set.
- *
- * pte/pmd_present() returns true if pte/pmd_numa returns true. Page
- * fault triggers on those regions if pte/pmd_numa returns true
- * (because _PAGE_PRESENT is not set).
- */
-#ifndef pte_numa
-static inline int pte_numa(pte_t pte)
-{
-	return (pte_flags(pte) &
-		(_PAGE_NUMA|_PAGE_PROTNONE|_PAGE_PRESENT)) == _PAGE_NUMA;
-}
-#endif
-
-#ifndef pmd_numa
-static inline int pmd_numa(pmd_t pmd)
-{
-	return (pmd_flags(pmd) &
-		(_PAGE_NUMA|_PAGE_PROTNONE|_PAGE_PRESENT)) == _PAGE_NUMA;
-}
-#endif
-
-/*
- * pte/pmd_mknuma sets the _PAGE_ACCESSED bitflag automatically
- * because they're called by the NUMA hinting minor page fault. If we
- * wouldn't set the _PAGE_ACCESSED bitflag here, the TLB miss handler
- * would be forced to set it later while filling the TLB after we
- * return to userland. That would trigger a second write to memory
- * that we optimize away by setting _PAGE_ACCESSED here.
+ * Technically a pte can be PROTNONE even when not
+ * doing NUMA balancing, but the only case where the
+ * kernel actually _cares_ is for the numa balancing
+ * case, so by default we just implement the simpler
+ * "always no" helper function.
  */
-#ifndef pte_mknonnuma
-static inline pte_t pte_mknonnuma(pte_t pte)
-{
-	pteval_t val = pte_val(pte);
-
-	val &= ~_PAGE_NUMA;
-	val |= (_PAGE_PRESENT|_PAGE_ACCESSED);
-	return __pte(val);
-}
-#endif
-
-#ifndef pmd_mknonnuma
-static inline pmd_t pmd_mknonnuma(pmd_t pmd)
-{
-	pmdval_t val = pmd_val(pmd);
-
-	val &= ~_PAGE_NUMA;
-	val |= (_PAGE_PRESENT|_PAGE_ACCESSED);
-
-	return __pmd(val);
-}
-#endif
-
-#ifndef pte_mknuma
-static inline pte_t pte_mknuma(pte_t pte)
-{
-	pteval_t val = pte_val(pte);
-
-	val &= ~_PAGE_PRESENT;
-	val |= _PAGE_NUMA;
-
-	return __pte(val);
-}
-#endif
-
-#ifndef ptep_set_numa
-static inline void ptep_set_numa(struct mm_struct *mm, unsigned long addr,
-				 pte_t *ptep)
-{
-	pte_t ptent = *ptep;
-
-	ptent = pte_mknuma(ptent);
-	set_pte_at(mm, addr, ptep, ptent);
-	return;
-}
-#endif
-
-#ifndef pmd_mknuma
-static inline pmd_t pmd_mknuma(pmd_t pmd)
-{
-	pmdval_t val = pmd_val(pmd);
-
-	val &= ~_PAGE_PRESENT;
-	val |= _PAGE_NUMA;
-
-	return __pmd(val);
-}
-#endif
-
-#ifndef pmdp_set_numa
-static inline void pmdp_set_numa(struct mm_struct *mm, unsigned long addr,
-				 pmd_t *pmdp)
-{
-	pmd_t pmd = *pmdp;
-
-	pmd = pmd_mknuma(pmd);
-	set_pmd_at(mm, addr, pmdp, pmd);
-	return;
-}
-#endif
-#else
-extern int pte_numa(pte_t pte);
-extern int pmd_numa(pmd_t pmd);
-extern pte_t pte_mknonnuma(pte_t pte);
-extern pmd_t pmd_mknonnuma(pmd_t pmd);
-extern pte_t pte_mknuma(pte_t pte);
-extern pmd_t pmd_mknuma(pmd_t pmd);
-extern void ptep_set_numa(struct mm_struct *mm, unsigned long addr, pte_t *ptep);
-extern void pmdp_set_numa(struct mm_struct *mm, unsigned long addr, pmd_t *pmdp);
-#endif /* CONFIG_ARCH_USES_NUMA_PROT_NONE */
-#else
-static inline int pmd_numa(pmd_t pmd)
+static inline int pte_protnone(pte_t pte)
 {
 	return 0;
 }
 
-static inline int pte_numa(pte_t pte)
+static inline int pmd_protnone(pmd_t pmd)
 {
 	return 0;
 }
-
-static inline pte_t pte_mknonnuma(pte_t pte)
-{
-	return pte;
-}
-
-static inline pmd_t pmd_mknonnuma(pmd_t pmd)
-{
-	return pmd;
-}
-
-static inline pte_t pte_mknuma(pte_t pte)
-{
-	return pte;
-}
-
-static inline void ptep_set_numa(struct mm_struct *mm, unsigned long addr,
-				 pte_t *ptep)
-{
-	return;
-}
-
-
-static inline pmd_t pmd_mknuma(pmd_t pmd)
-{
-	return pmd;
-}
-
-static inline void pmdp_set_numa(struct mm_struct *mm, unsigned long addr,
-				 pmd_t *pmdp)
-{
-	return ;
-}
-#endif /* CONFIG_NUMA_BALANCING */
+#endif
 
 #endif /* CONFIG_MMU */
 
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 63579cb8d3dc..2a2ac91eafde 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -31,8 +31,7 @@ extern int move_huge_pmd(struct vm_area_struct *vma,
 			 unsigned long new_addr, unsigned long old_end,
 			 pmd_t *old_pmd, pmd_t *new_pmd);
 extern int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
-			unsigned long addr, pgprot_t newprot,
-			int prot_numa);
+			unsigned long addr, pgprot_t newprot);
 
 enum transparent_hugepage_flag {
 	TRANSPARENT_HUGEPAGE_FLAG,
diff --git a/init/Kconfig b/init/Kconfig
index e84c6423a2e5..f5c6bb56617a 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -887,17 +887,6 @@ config ARCH_SUPPORTS_INT128
 config ARCH_WANT_NUMA_VARIABLE_LOCALITY
 	bool
 
-#
-# For architectures that are willing to define _PAGE_NUMA as _PAGE_PROTNONE
-config ARCH_WANTS_PROT_NUMA_PROT_NONE
-	bool
-
-config ARCH_USES_NUMA_PROT_NONE
-	bool
-	default y
-	depends on ARCH_WANTS_PROT_NUMA_PROT_NONE
-	depends on NUMA_BALANCING
-
 config NUMA_BALANCING_DEFAULT_ENABLED
 	bool "Automatically enable NUMA aware memory/task placement"
 	default y
diff --git a/mm/gup.c b/mm/gup.c
index 91d044b1600d..7a7419da2da9 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -60,7 +60,7 @@ retry:
 		migration_entry_wait(mm, pmd, address);
 		goto retry;
 	}
-	if ((flags & FOLL_NUMA) && pte_numa(pte))
+	if ((flags & FOLL_NUMA) && pte_protnone(pte))
 		goto no_page;
 	if ((flags & FOLL_WRITE) && !pte_write(pte)) {
 		pte_unmap_unlock(ptep, ptl);
@@ -189,7 +189,7 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
 		}
 		return page;
 	}
-	if ((flags & FOLL_NUMA) && pmd_numa(*pmd))
+	if ((flags & FOLL_NUMA) && pmd_protnone(*pmd))
 		return no_page_table(vma, flags);
 	if (pmd_trans_huge(*pmd)) {
 		if (flags & FOLL_SPLIT) {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d9a21d06b862..14de54af6c38 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1223,7 +1223,7 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
 		return ERR_PTR(-EFAULT);
 
 	/* Full NUMA hinting faults to serialise migration in fault paths */
-	if ((flags & FOLL_NUMA) && pmd_numa(*pmd))
+	if ((flags & FOLL_NUMA) && pmd_protnone(*pmd))
 		goto out;
 
 	page = pmd_page(*pmd);
@@ -1366,9 +1366,8 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	goto out;
 clear_pmdnuma:
 	BUG_ON(!PageLocked(page));
-	pmd = pmd_mknonnuma(pmd);
+	pmd = pmd_modify(pmd, vma->vm_page_prot);
 	set_pmd_at(mm, haddr, pmdp, pmd);
-	VM_BUG_ON(pmd_numa(*pmdp));
 	update_mmu_cache_pmd(vma, addr, pmdp);
 	unlock_page(page);
 out_unlock:
@@ -1502,7 +1501,7 @@ out:
  *  - HPAGE_PMD_NR is protections changed and TLB flush necessary
  */
 int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
-		unsigned long addr, pgprot_t newprot, int prot_numa)
+		unsigned long addr, pgprot_t newprot)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	spinlock_t *ptl;
@@ -1511,29 +1510,12 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 	if (__pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
 		pmd_t entry;
 		ret = 1;
-		if (!prot_numa) {
-			entry = pmdp_get_and_clear(mm, addr, pmd);
-			if (pmd_numa(entry))
-				entry = pmd_mknonnuma(entry);
-			entry = pmd_modify(entry, newprot);
-			ret = HPAGE_PMD_NR;
-			set_pmd_at(mm, addr, pmd, entry);
-			BUG_ON(pmd_write(entry));
-		} else {
-			struct page *page = pmd_page(*pmd);
 
-			/*
-			 * Do not trap faults against the zero page. The
-			 * read-only data is likely to be read-cached on the
-			 * local CPU cache and it is less useful to know about
-			 * local vs remote hits on the zero page.
-			 */
-			if (!is_huge_zero_page(page) &&
-			    !pmd_numa(*pmd)) {
-				pmdp_set_numa(mm, addr, pmd);
-				ret = HPAGE_PMD_NR;
-			}
-		}
+		entry = pmdp_get_and_clear(mm, addr, pmd);
+		entry = pmd_modify(entry, newprot);
+		ret = HPAGE_PMD_NR;
+		set_pmd_at(mm, addr, pmd, entry);
+		BUG_ON(pmd_write(entry));
 		spin_unlock(ptl);
 	}
 
@@ -1794,15 +1776,17 @@ static int __split_huge_page_map(struct page *page,
 		haddr = address;
 		for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
 			pte_t *pte, entry;
+			pgprot_t newprot = vma->vm_page_prot;
+
 			BUG_ON(PageCompound(page+i));
-			entry = mk_pte(page + i, vma->vm_page_prot);
+			if (pmd_protnone(*pmd))
+				newprot = PAGE_NONE;
+			entry = mk_pte(page + i, newprot);
 			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
 			if (!pmd_write(*pmd))
 				entry = pte_wrprotect(entry);
 			if (!pmd_young(*pmd))
 				entry = pte_mkold(entry);
-			if (pmd_numa(*pmd))
-				entry = pte_mknuma(entry);
 			pte = pte_offset_map(&_pmd, haddr);
 			BUG_ON(!pte_none(*pte));
 			set_pte_at(mm, haddr, pte, entry);
diff --git a/mm/memory.c b/mm/memory.c
index e229970e4223..62f1b7b82aef 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3118,9 +3118,9 @@ static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	* validation through pte_unmap_same(). It's of NUMA type but
 	* the pfn may be screwed if the read is non atomic.
 	*
-	* ptep_modify_prot_start is not called as this is clearing
-	* the _PAGE_NUMA bit and it is not really expected that there
-	* would be concurrent hardware modifications to the PTE.
+	* We can safely just do a "set_pte_at()", because the old
+	* page table entry is not accessible, so there would be no
+	* concurrent hardware modifications to the PTE.
 	*/
 	ptl = pte_lockptr(mm, pmd);
 	spin_lock(ptl);
@@ -3129,7 +3129,11 @@ static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		goto out;
 	}
 
-	pte = pte_mknonnuma(pte);
+	/* Make it present again */
+	/* Should we make it writable?  We don't have the fault flags */
+	pte = pte_modify(pte, vma->vm_page_prot);
+	pte = pte_mkyoung(pte);
+
 	set_pte_at(mm, addr, ptep, pte);
 	update_mmu_cache(vma, addr, ptep);
 
@@ -3218,7 +3222,7 @@ static int handle_pte_fault(struct mm_struct *mm,
 					pte, pmd, flags, entry);
 	}
 
-	if (pte_numa(entry))
+	if (pte_protnone(entry))
 		return do_numa_page(mm, vma, address, entry, pte, pmd);
 
 	ptl = pte_lockptr(mm, pmd);
@@ -3296,7 +3300,7 @@ static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 			if (pmd_trans_splitting(orig_pmd))
 				return 0;
 
-			if (pmd_numa(orig_pmd))
+			if (pmd_protnone(orig_pmd))
 				return do_huge_pmd_numa_page(mm, vma, address,
 							     orig_pmd, pmd);
 
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 8f5330d74f47..5e22f1ff4e33 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -635,7 +635,7 @@ unsigned long change_prot_numa(struct vm_area_struct *vma,
 {
 	int nr_updated;
 
-	nr_updated = change_protection(vma, addr, end, vma->vm_page_prot, 0, 1);
+	nr_updated = change_protection(vma, addr, end, PAGE_NONE, 0, 1);
 	if (nr_updated)
 		count_vm_numa_events(NUMA_PTE_UPDATES, nr_updated);
 
diff --git a/mm/migrate.c b/mm/migrate.c
index f78ec9bd454d..2600f33416b1 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1901,7 +1901,6 @@ out_fail:
 out_dropref:
 	ptl = pmd_lock(mm, pmd);
 	if (pmd_same(*pmd, entry)) {
-		entry = pmd_mknonnuma(entry);
 		set_pmd_at(mm, mmun_start, pmd, entry);
 		update_mmu_cache_pmd(vma, address, &entry);
 	}
diff --git a/mm/mprotect.c b/mm/mprotect.c
index c43d557941f8..aef6fe6aba36 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -82,34 +82,17 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 		oldpte = *pte;
 		if (pte_present(oldpte)) {
 			pte_t ptent;
-			bool updated = false;
 
-			if (!prot_numa) {
-				ptent = ptep_modify_prot_start(mm, addr, pte);
-				if (pte_numa(ptent))
-					ptent = pte_mknonnuma(ptent);
-				ptent = pte_modify(ptent, newprot);
-				/*
-				 * Avoid taking write faults for pages we
-				 * know to be dirty.
-				 */
-				if (dirty_accountable && pte_dirty(ptent))
-					ptent = pte_mkwrite(ptent);
-				ptep_modify_prot_commit(mm, addr, pte, ptent);
-				updated = true;
-			} else {
-				struct page *page;
-
-				page = vm_normal_page(vma, addr, oldpte);
-				if (page && !PageKsm(page)) {
-					if (!pte_numa(oldpte)) {
-						ptep_set_numa(mm, addr, pte);
-						updated = true;
-					}
-				}
-			}
-			if (updated)
-				pages++;
+			ptent = ptep_modify_prot_start(mm, addr, pte);
+			ptent = pte_modify(ptent, newprot);
+			/*
+			 * Avoid taking write faults for pages we
+			 * know to be dirty.
+			 */
+			if (dirty_accountable && pte_dirty(ptent))
+				ptent = pte_mkwrite(ptent);
+			ptep_modify_prot_commit(mm, addr, pte, ptent);
+			pages++;
 		} else if (IS_ENABLED(CONFIG_MIGRATION) && !pte_file(oldpte)) {
 			swp_entry_t entry = pte_to_swp_entry(oldpte);
 
@@ -164,8 +147,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 			if (next - addr != HPAGE_PMD_SIZE)
 				split_huge_page_pmd(vma, addr, pmd);
 			else {
-				int nr_ptes = change_huge_pmd(vma, pmd, addr,
-						newprot, prot_numa);
+				int nr_ptes = change_huge_pmd(vma, pmd, addr, newprot);
 
 				if (nr_ptes) {
 					if (nr_ptes == HPAGE_PMD_NR) {
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index dfb79e028ecb..c25f94b33811 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -193,8 +193,6 @@ void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
 		     pmd_t *pmdp)
 {
 	pmd_t entry = *pmdp;
-	if (pmd_numa(entry))
-		entry = pmd_mknonnuma(entry);
 	set_pmd_at(vma->vm_mm, address, pmdp, pmd_mknotpresent(entry));
 	flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
 }

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: pipe/page fault oddness.
  2014-10-01 20:20                             ` Linus Torvalds
@ 2014-10-01 21:09                               ` Rik van Riel
  2014-10-01 22:08                               ` Sasha Levin
  1 sibling, 0 replies; 43+ messages in thread
From: Rik van Riel @ 2014-10-01 21:09 UTC (permalink / raw)
  To: Linus Torvalds, Hugh Dickins
  Cc: Dave Jones, Al Viro, Linux Kernel, Ingo Molnar,
	Michel Lespinasse, Kirill A. Shutemov, Mel Gorman, Sasha Levin

On 10/01/2014 04:20 PM, Linus Torvalds wrote:

> Now, I'll be honest: this patch *migth* just work, but I expect it to
> have some stupid problem. It compiles. I haven't even dared boot it,
> much less try any numa benchmarks that woudln't show anything sane on
> my machine anyway.
> 
> So I'm really sending this patch out in the hope that it will get
> comments, fixup and possibly even testing by people who actually know
> the NUMA balancing code. Rik?  Anybody?

I'll add it to the list. No guarantees I'll be able to get
to it before Linux Con / KVM Forum though. I have a bit of
a backlog to get through this coming week...


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: pipe/page fault oddness.
  2014-10-01 20:20                             ` Linus Torvalds
  2014-10-01 21:09                               ` Rik van Riel
@ 2014-10-01 22:08                               ` Sasha Levin
  2014-10-01 22:28                                 ` Chuck Ebbert
  2014-10-01 22:42                                 ` Linus Torvalds
  1 sibling, 2 replies; 43+ messages in thread
From: Sasha Levin @ 2014-10-01 22:08 UTC (permalink / raw)
  To: Linus Torvalds, Hugh Dickins
  Cc: Dave Jones, Al Viro, Linux Kernel, Rik van Riel, Ingo Molnar,
	Michel Lespinasse, Kirill A. Shutemov, Mel Gorman

On 10/01/2014 04:20 PM, Linus Torvalds wrote:
> So I'm really sending this patch out in the hope that it will get
> comments, fixup and possibly even testing by people who actually know
> the NUMA balancing code. Rik?  Anybody?

Hi Linus,

I've tried this patch on the same configuration that was triggering
the VM_BUG_ON that Hugh mentioned previously. Surprisingly enough it
ran fine for ~20 minutes before exploding with:

[ 2781.566206] kernel BUG at mm/huge_memory.c:1293!
[ 2781.566953] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[ 2781.568054] Dumping ftrace buffer:
[ 2781.568826]    (ftrace buffer empty)
[ 2781.569392] Modules linked in:
[ 2781.569909] CPU: 61 PID: 13111 Comm: trinity-c61 Not tainted 3.17.0-rc7-sasha-00040-g65e1cb2 #1259
[ 2781.571077] task: ffff88050ba80000 ti: ffff880418ecc000 task.ti: ffff880418ecc000
[ 2781.571077] RIP: do_huge_pmd_numa_page (mm/huge_memory.c:1293 (discriminator 1))
[ 2781.571077] RSP: 0000:ffff880418ecfc60  EFLAGS: 00010246
[ 2781.571077] RAX: ffffea0074c60000 RBX: ffffea0074c60000 RCX: 0000001d318009e0
[ 2781.571077] RDX: ffffea0000000000 RSI: ffffffffb5706ef3 RDI: 0000001d318009e0
[ 2781.571077] RBP: ffff880418ecfcc8 R08: 0000000000000038 R09: 0000000000000001
[ 2781.571077] R10: 0000000000000038 R11: 0000000000000001 R12: ffff8804f9b52000
[ 2781.571077] R13: 0000001d318009e0 R14: ffff880508a1f840 R15: 0000000000000028
[ 2781.571077] FS:  00007f5502fc9700(0000) GS:ffff881d77e00000(0000) knlGS:0000000000000000
[ 2781.571077] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2781.571077] CR2: 0000000000000000 CR3: 00000004bfac4000 CR4: 00000000000006a0
[ 2781.571077] Stack:
[ 2781.571077]  ffff880418ecfc98 0000000000000282 ffff88050ba80000 000000000000000b
[ 2781.571077]  ffff88060d2ab000 ffff88060000001d 0000000000000000 ffff881d30b3ec00
[ 2781.571077]  0000000000000000 ffff881d30b3ec00 ffff88060d2ab000 0000000000000100
[ 2781.571077] Call Trace:
[ 2781.571077] handle_mm_fault (mm/memory.c:3304 mm/memory.c:3368)
[ 2781.571077] __do_page_fault (arch/x86/mm/fault.c:1231)
[ 2781.571077] ? kvm_clock_read (./arch/x86/include/asm/preempt.h:90 arch/x86/kernel/kvmclock.c:86)
[ 2781.571077] ? sched_clock (./arch/x86/include/asm/paravirt.h:192 arch/x86/kernel/tsc.c:304)
[ 2781.571077] ? sched_clock_local (kernel/sched/clock.c:214)
[ 2781.571077] ? context_tracking_user_exit (kernel/context_tracking.c:184)
[ 2781.571077] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
[ 2781.571077] ? trace_hardirqs_off_caller (kernel/locking/lockdep.c:2641 (discriminator 8))
[ 2781.571077] trace_do_page_fault (arch/x86/mm/fault.c:1314 include/linux/jump_label.h:115 include/linux/context_tracking_state.h:27 include/linux/context_tracking.h:45 arch/x86/mm/fault.c:1315)
[ 2781.571077] do_async_page_fault (arch/x86/kernel/kvm.c:279)
[ 2781.571077] async_page_fault (arch/x86/kernel/entry_64.S:1314)
[ 2781.571077] ? copy_user_generic_unrolled (arch/x86/lib/copy_user_64.S:166)
[ 2781.571077] ? sys32_mmap (arch/x86/ia32/sys_ia32.c:159)
[ 2781.571077] ia32_do_call (arch/x86/ia32/ia32entry.S:430)
[ 2781.571077] Code: b4 eb e0 0f 1f 84 00 00 00 00 00 4c 89 f7 e8 88 2f 0c 03 48 8b 45 d0 4c 89 e6 48 8b b8 88 00 00 00 e8 85 c7 ff ff e9 90 fe ff ff <0f> 0b 66 0f 1f 44 00 00 48 89 df e8 90 e9 f9 ff 84 c0 0f 85 be
All code
========
   0:	b4 eb                	mov    $0xeb,%ah
   2:	e0 0f                	loopne 0x13
   4:	1f                   	(bad)
   5:	84 00                	test   %al,(%rax)
   7:	00 00                	add    %al,(%rax)
   9:	00 00                	add    %al,(%rax)
   b:	4c 89 f7             	mov    %r14,%rdi
   e:	e8 88 2f 0c 03       	callq  0x30c2f9b
  13:	48 8b 45 d0          	mov    -0x30(%rbp),%rax
  17:	4c 89 e6             	mov    %r12,%rsi
  1a:	48 8b b8 88 00 00 00 	mov    0x88(%rax),%rdi
  21:	e8 85 c7 ff ff       	callq  0xffffffffffffc7ab
  26:	e9 90 fe ff ff       	jmpq   0xfffffffffffffebb
  2b:*	0f 0b                	ud2    		<-- trapping instruction
  2d:	66 0f 1f 44 00 00    	nopw   0x0(%rax,%rax,1)
  33:	48 89 df             	mov    %rbx,%rdi
  36:	e8 90 e9 f9 ff       	callq  0xfffffffffff9e9cb
  3b:	84 c0                	test   %al,%al
  3d:	0f                   	.byte 0xf
  3e:	85                   	.byte 0x85
  3f:	be                   	.byte 0xbe
	...

Code starting with the faulting instruction
===========================================
   0:	0f 0b                	ud2
   2:	66 0f 1f 44 00 00    	nopw   0x0(%rax,%rax,1)
   8:	48 89 df             	mov    %rbx,%rdi
   b:	e8 90 e9 f9 ff       	callq  0xfffffffffff9e9a0
  10:	84 c0                	test   %al,%al
  12:	0f                   	.byte 0xf
  13:	85                   	.byte 0x85
  14:	be                   	.byte 0xbe
	...
[ 2781.571077] RIP do_huge_pmd_numa_page (mm/huge_memory.c:1293 (discriminator 1))
[ 2781.571077]  RSP <ffff880418ecfc60>


Thanks,
Sasha

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: pipe/page fault oddness.
  2014-10-01 22:08                               ` Sasha Levin
@ 2014-10-01 22:28                                 ` Chuck Ebbert
  2014-10-02  3:32                                   ` Sasha Levin
  2014-10-01 22:42                                 ` Linus Torvalds
  1 sibling, 1 reply; 43+ messages in thread
From: Chuck Ebbert @ 2014-10-01 22:28 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Linus Torvalds, Hugh Dickins, Dave Jones, Al Viro, Linux Kernel,
	Rik van Riel, Ingo Molnar, Michel Lespinasse, Kirill A. Shutemov,
	Mel Gorman

On Wed, 01 Oct 2014 18:08:30 -0400
Sasha Levin <sasha.levin@oracle.com> wrote:

> On 10/01/2014 04:20 PM, Linus Torvalds wrote:
> > So I'm really sending this patch out in the hope that it will get
> > comments, fixup and possibly even testing by people who actually know
> > the NUMA balancing code. Rik?  Anybody?
> 
> Hi Linus,
> 
> I've tried this patch on the same configuration that was triggering
> the VM_BUG_ON that Hugh mentioned previously. Surprisingly enough it
> ran fine for ~20 minutes before exploding with:
> 
> [ 2781.566206] kernel BUG at mm/huge_memory.c:1293!

That's:
	BUG_ON(is_huge_zero_page(page));

Can you change your scripts to show the source code line when
the error is a BUG_ON()? The machine code disassembly after the
oops message doesn't really help.

> [ 2781.566953] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> [ 2781.568054] Dumping ftrace buffer:
> [ 2781.568826]    (ftrace buffer empty)
> [ 2781.569392] Modules linked in:
> [ 2781.569909] CPU: 61 PID: 13111 Comm: trinity-c61 Not tainted 3.17.0-rc7-sasha-00040-g65e1cb2 #1259
> [ 2781.571077] task: ffff88050ba80000 ti: ffff880418ecc000 task.ti: ffff880418ecc000
> [ 2781.571077] RIP: do_huge_pmd_numa_page (mm/huge_memory.c:1293 (discriminator 1))
> [ 2781.571077] RSP: 0000:ffff880418ecfc60  EFLAGS: 00010246
> [ 2781.571077] RAX: ffffea0074c60000 RBX: ffffea0074c60000 RCX: 0000001d318009e0
> [ 2781.571077] RDX: ffffea0000000000 RSI: ffffffffb5706ef3 RDI: 0000001d318009e0
> [ 2781.571077] RBP: ffff880418ecfcc8 R08: 0000000000000038 R09: 0000000000000001
> [ 2781.571077] R10: 0000000000000038 R11: 0000000000000001 R12: ffff8804f9b52000
> [ 2781.571077] R13: 0000001d318009e0 R14: ffff880508a1f840 R15: 0000000000000028
> [ 2781.571077] FS:  00007f5502fc9700(0000) GS:ffff881d77e00000(0000) knlGS:0000000000000000
> [ 2781.571077] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 2781.571077] CR2: 0000000000000000 CR3: 00000004bfac4000 CR4: 00000000000006a0
> [ 2781.571077] Stack:
> [ 2781.571077]  ffff880418ecfc98 0000000000000282 ffff88050ba80000 000000000000000b
> [ 2781.571077]  ffff88060d2ab000 ffff88060000001d 0000000000000000 ffff881d30b3ec00
> [ 2781.571077]  0000000000000000 ffff881d30b3ec00 ffff88060d2ab000 0000000000000100
> [ 2781.571077] Call Trace:
> [ 2781.571077] handle_mm_fault (mm/memory.c:3304 mm/memory.c:3368)
> [ 2781.571077] __do_page_fault (arch/x86/mm/fault.c:1231)
> [ 2781.571077] ? kvm_clock_read (./arch/x86/include/asm/preempt.h:90 arch/x86/kernel/kvmclock.c:86)
> [ 2781.571077] ? sched_clock (./arch/x86/include/asm/paravirt.h:192 arch/x86/kernel/tsc.c:304)
> [ 2781.571077] ? sched_clock_local (kernel/sched/clock.c:214)
> [ 2781.571077] ? context_tracking_user_exit (kernel/context_tracking.c:184)
> [ 2781.571077] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
> [ 2781.571077] ? trace_hardirqs_off_caller (kernel/locking/lockdep.c:2641 (discriminator 8))
> [ 2781.571077] trace_do_page_fault (arch/x86/mm/fault.c:1314 include/linux/jump_label.h:115 include/linux/context_tracking_state.h:27 include/linux/context_tracking.h:45 arch/x86/mm/fault.c:1315)
> [ 2781.571077] do_async_page_fault (arch/x86/kernel/kvm.c:279)
> [ 2781.571077] async_page_fault (arch/x86/kernel/entry_64.S:1314)
> [ 2781.571077] ? copy_user_generic_unrolled (arch/x86/lib/copy_user_64.S:166)
> [ 2781.571077] ? sys32_mmap (arch/x86/ia32/sys_ia32.c:159)
> [ 2781.571077] ia32_do_call (arch/x86/ia32/ia32entry.S:430)
> [ 2781.571077] Code: b4 eb e0 0f 1f 84 00 00 00 00 00 4c 89 f7 e8 88 2f 0c 03 48 8b 45 d0 4c 89 e6 48 8b b8 88 00 00 00 e8 85 c7 ff ff e9 90 fe ff ff <0f> 0b 66 0f 1f 44 00 00 48 89 df e8 90 e9 f9 ff 84 c0 0f 85 be
> All code
> ========
>    0:	b4 eb                	mov    $0xeb,%ah
>    2:	e0 0f                	loopne 0x13
>    4:	1f                   	(bad)
>    5:	84 00                	test   %al,(%rax)
>    7:	00 00                	add    %al,(%rax)
>    9:	00 00                	add    %al,(%rax)
>    b:	4c 89 f7             	mov    %r14,%rdi
>    e:	e8 88 2f 0c 03       	callq  0x30c2f9b
>   13:	48 8b 45 d0          	mov    -0x30(%rbp),%rax
>   17:	4c 89 e6             	mov    %r12,%rsi
>   1a:	48 8b b8 88 00 00 00 	mov    0x88(%rax),%rdi
>   21:	e8 85 c7 ff ff       	callq  0xffffffffffffc7ab
>   26:	e9 90 fe ff ff       	jmpq   0xfffffffffffffebb
>   2b:*	0f 0b                	ud2    		<-- trapping instruction
>   2d:	66 0f 1f 44 00 00    	nopw   0x0(%rax,%rax,1)
>   33:	48 89 df             	mov    %rbx,%rdi
>   36:	e8 90 e9 f9 ff       	callq  0xfffffffffff9e9cb
>   3b:	84 c0                	test   %al,%al
>   3d:	0f                   	.byte 0xf
>   3e:	85                   	.byte 0x85
>   3f:	be                   	.byte 0xbe
> 	...
> 
> Code starting with the faulting instruction
> ===========================================
>    0:	0f 0b                	ud2
>    2:	66 0f 1f 44 00 00    	nopw   0x0(%rax,%rax,1)
>    8:	48 89 df             	mov    %rbx,%rdi
>    b:	e8 90 e9 f9 ff       	callq  0xfffffffffff9e9a0
>   10:	84 c0                	test   %al,%al
>   12:	0f                   	.byte 0xf
>   13:	85                   	.byte 0x85
>   14:	be                   	.byte 0xbe
> 	...
> [ 2781.571077] RIP do_huge_pmd_numa_page (mm/huge_memory.c:1293 (discriminator 1))
> [ 2781.571077]  RSP <ffff880418ecfc60>
> 
> 
> Thanks,
> Sasha
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: pipe/page fault oddness.
  2014-10-01 22:08                               ` Sasha Levin
  2014-10-01 22:28                                 ` Chuck Ebbert
@ 2014-10-01 22:42                                 ` Linus Torvalds
  2014-10-02 14:25                                   ` Kirill A. Shutemov
  2014-10-02 15:04                                   ` Sasha Levin
  1 sibling, 2 replies; 43+ messages in thread
From: Linus Torvalds @ 2014-10-01 22:42 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Hugh Dickins, Dave Jones, Al Viro, Linux Kernel, Rik van Riel,
	Ingo Molnar, Michel Lespinasse, Kirill A. Shutemov, Mel Gorman

[-- Attachment #1: Type: text/plain, Size: 1175 bytes --]

On Wed, Oct 1, 2014 at 3:08 PM, Sasha Levin <sasha.levin@oracle.com> wrote:
>
> I've tried this patch on the same configuration that was triggering
> the VM_BUG_ON that Hugh mentioned previously. Surprisingly enough it
> ran fine for ~20 minutes before exploding with:

Well, that's somewhat encouraging. I didn't expect it to be perfect.

That said, "ran fine" isn't necessarily the same thing as "worked".
Who knows how buggy it was without showing overt symptoms until the
BUG_ON() triggered. But hey, I'll be optimistic.

> [ 2781.566206] kernel BUG at mm/huge_memory.c:1293!

So that's

        BUG_ON(is_huge_zero_page(page));

and the reason is trivial: the old code used to have a magical special
case for the zero-page hugepage (see change_huge_pmd()) and I got rid
of that (because now it's just about setting protections, and the
zero-page hugepage is in no way special.

So I think the solution is equally trivial: just accept that the
zero-page can happen, and ignore it (just un-numa it).

Appended is a incremental diff on top of the previous one. Even less
tested than the last case, but I think you get the idea if it doesn't
work as-is.

             Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/plain, Size: 806 bytes --]

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 14de54af6c38..fc33952d59c4 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1290,7 +1290,9 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	}
 
 	page = pmd_page(pmd);
-	BUG_ON(is_huge_zero_page(page));
+	if (is_huge_zero_page(page))
+		goto huge_zero_page;
+
 	page_nid = page_to_nid(page);
 	last_cpupid = page_cpupid_last(page);
 	count_vm_numa_event(NUMA_HINT_FAULTS);
@@ -1381,6 +1383,11 @@ out:
 		task_numa_fault(last_cpupid, page_nid, HPAGE_PMD_NR, flags);
 
 	return 0;
+huge_zero_page:
+	pmd = pmd_modify(pmd, vma->vm_page_prot);
+	set_pmd_at(mm, haddr, pmdp, pmd);
+	update_mmu_cache_pmd(vma, addr, pmdp);
+	goto out_unlock;
 }
 
 int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: pipe/page fault oddness.
  2014-10-01 22:28                                 ` Chuck Ebbert
@ 2014-10-02  3:32                                   ` Sasha Levin
  2014-10-02  8:03                                     ` Chuck Ebbert
  0 siblings, 1 reply; 43+ messages in thread
From: Sasha Levin @ 2014-10-02  3:32 UTC (permalink / raw)
  To: Chuck Ebbert
  Cc: Linus Torvalds, Hugh Dickins, Dave Jones, Al Viro, Linux Kernel,
	Rik van Riel, Ingo Molnar, Michel Lespinasse, Kirill A. Shutemov,
	Mel Gorman

On 10/01/2014 06:28 PM, Chuck Ebbert wrote:
> On Wed, 01 Oct 2014 18:08:30 -0400
> Sasha Levin <sasha.levin@oracle.com> wrote:
> 
>> > On 10/01/2014 04:20 PM, Linus Torvalds wrote:
>>> > > So I'm really sending this patch out in the hope that it will get
>>> > > comments, fixup and possibly even testing by people who actually know
>>> > > the NUMA balancing code. Rik?  Anybody?
>> > 
>> > Hi Linus,
>> > 
>> > I've tried this patch on the same configuration that was triggering
>> > the VM_BUG_ON that Hugh mentioned previously. Surprisingly enough it
>> > ran fine for ~20 minutes before exploding with:
>> > 
>> > [ 2781.566206] kernel BUG at mm/huge_memory.c:1293!
> That's:
> 	BUG_ON(is_huge_zero_page(page));
> 
> Can you change your scripts to show the source code line when
> the error is a BUG_ON()? The machine code disassembly after the
> oops message doesn't really help.
> 

Hum? The source code line is the first line in the trace:

	[ 2781.566206] kernel BUG at mm/huge_memory.c:1293!


Thanks,
Sasha

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: pipe/page fault oddness.
  2014-10-02  3:32                                   ` Sasha Levin
@ 2014-10-02  8:03                                     ` Chuck Ebbert
  2014-10-02 14:49                                       ` Sasha Levin
  0 siblings, 1 reply; 43+ messages in thread
From: Chuck Ebbert @ 2014-10-02  8:03 UTC (permalink / raw)
  To: Sasha Levin; +Cc: Linux Kernel

On Wed, 01 Oct 2014 23:32:15 -0400
Sasha Levin <sasha.levin@oracle.com> wrote:

> On 10/01/2014 06:28 PM, Chuck Ebbert wrote:
> > On Wed, 01 Oct 2014 18:08:30 -0400
> > Sasha Levin <sasha.levin@oracle.com> wrote:
> > 
> >> > On 10/01/2014 04:20 PM, Linus Torvalds wrote:
> >>> > > So I'm really sending this patch out in the hope that it will get
> >>> > > comments, fixup and possibly even testing by people who actually know
> >>> > > the NUMA balancing code. Rik?  Anybody?
> >> > 
> >> > Hi Linus,
> >> > 
> >> > I've tried this patch on the same configuration that was triggering
> >> > the VM_BUG_ON that Hugh mentioned previously. Surprisingly enough it
> >> > ran fine for ~20 minutes before exploding with:
> >> > 
> >> > [ 2781.566206] kernel BUG at mm/huge_memory.c:1293!
> > That's:
> > 	BUG_ON(is_huge_zero_page(page));
> > 
> > Can you change your scripts to show the source code line when
> > the error is a BUG_ON()? The machine code disassembly after the
> > oops message doesn't really help.
> > 
> 
> Hum? The source code line is the first line in the trace:
> 
> 	[ 2781.566206] kernel BUG at mm/huge_memory.c:1293!
> 

I meant, display the contents of that line so we can see what the
BUG_ON() was triggered by. In some cases you might have a custom patch
applied or be running a version that some people don't have handy.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: pipe/page fault oddness.
  2014-10-01 17:29                             ` Rik van Riel
@ 2014-10-02  8:28                               ` Peter Zijlstra
  0 siblings, 0 replies; 43+ messages in thread
From: Peter Zijlstra @ 2014-10-02  8:28 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Linus Torvalds, Hugh Dickins, Dave Jones, Al Viro, Linux Kernel,
	Ingo Molnar, Michel Lespinasse, Kirill A. Shutemov, Mel Gorman,
	Sasha Levin

On Wed, Oct 01, 2014 at 01:29:04PM -0400, Rik van Riel wrote:
> On 10/01/2014 12:18 PM, Linus Torvalds wrote:
> 
> > Seriously, why can't we just do this, and throw away all the crap that
> > is "numa special case". This would make all the random games in
> > change_pte_range() just go away entirely, because the whole NUMA thing
> > really wouldn't be a special case for the pte AT ALL any more. All it
> > would be is that a pte could be marked PROT_NONE even if the
> > vma->vm_flags aren't.
> 
> That's what I suggested quite a while back, but IIRC either
> Peter or Mel brought up a reason why this was not possible.

Hey, I'm the one that did the numa faulting with PROTNONE to begin with.
Andrea and Mel (and possibly you) all didn't like that and did the per
arch pagetable nonensen :-) I've never understood why.

https://lkml.org/lkml/2012/9/27/76

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: pipe/page fault oddness.
  2014-10-01 16:01                         ` Linus Torvalds
  2014-10-01 16:18                           ` Linus Torvalds
@ 2014-10-02  8:47                           ` Hugh Dickins
  2014-10-02 15:57                             ` Linus Torvalds
  1 sibling, 1 reply; 43+ messages in thread
From: Hugh Dickins @ 2014-10-02  8:47 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, Dave Jones, Al Viro, Linux Kernel, Rik van Riel,
	Ingo Molnar, Michel Lespinasse, Kirill A. Shutemov, Mel Gorman,
	Sasha Levin

On Wed, 1 Oct 2014, Linus Torvalds wrote:
> On Wed, Oct 1, 2014 at 1:19 AM, Hugh Dickins <hughd@google.com> wrote:
> 
> Can we please just get rid of _PAGE_NUMA. There is no excuse for it.

I'm no lover of _PAGE_NUMA, and hope that it can be simplified away
as you outline.  What we have in 3.16+3.17 is already an attempt to
improve on what you hated before, but not obviously an improvement.

Mel is the one who knows these issues inside out: maybe he's been
blinkered, but I wouldn't dare to pull it apart without his input.
Myself, I'm not looking beyond fixing whatever is the bug for 3.17.

> > However, that would still not explain Dave's endless refaulting;
> 
> Why not? You start out with a PROTNONE, trigger shrink_page_list() on
> a hugepage,.which calls add_to_swap(), which does
> split_huge_page_to_list(), which in turn calls __split_huge_page(),
> and that turns (_PAGE_PROTNONE) into (_PAGE_PROTNONE|_PAGE_NUMA),
> which you will then fault on forever, because the kernel thinks the
> page is present, but not a NUMA page.

I hesitate to admit, I still don't see it: please illuminate further.

We're talking about the loop in __split_huge_page_map(), where it does

			entry = mk_pte(page + i, vma->vm_page_prot);
			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
			if (!pmd_write(*pmd))
				entry = pte_wrprotect(entry);
			if (!pmd_young(*pmd))
				entry = pte_mkold(entry);
			if (pmd_numa(*pmd))
				entry = pte_mknuma(entry);

, right?  I only see that adding _PAGE_NUMA to _PAGE_PROTNONE if
pmd_numa(*pmd): but that would mean we had already gone wrong, setting
pmd_numa in a PROT_NONE vma, which task_numa_work takes care not to do;
or have mprotected an area to PROT_NONE without doing the pmd_mknonnuma.

Ah, we won't have mmap_sem in the add_to_swap case; so we could be
racing with an mprotect which already updated vm_flags and vm_page_prot,
but has not yet reached this pmd: is that a part of how you see it?

Or are you noticing a deficiency in the pmd locking?  I have not
worked my way through that, so cannot guarantee it, but please
point me to the weakness where you see it.

But when you convince me on that, then I still don't see how we get to
doing the repeated write fault, instead of hitting access_error(), as
you pointed out originally.  That still seems to require a PROTNONE
pte to be left behind in a VM_WRITE vma, which I do not see happening
here.  pte_modify leaves _PAGE_NUMA alone, but updates _PAGE_PROTNONE.
The pte_numa<->pte_special confusion is messy, but I don't yet get
how it would manifest in the manner observed.

But certainly, a bug in the THP splitting feels just right to
match the frequency of the sightings: I hope you've got it.

Hugh

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: pipe/page fault oddness.
  2014-10-01 16:18                           ` Linus Torvalds
  2014-10-01 17:29                             ` Rik van Riel
  2014-10-01 20:20                             ` Linus Torvalds
@ 2014-10-02 12:45                             ` Mel Gorman
  2014-10-06 19:18                               ` Aneesh Kumar K.V
  2 siblings, 1 reply; 43+ messages in thread
From: Mel Gorman @ 2014-10-02 12:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, Dave Jones, Al Viro, Linux Kernel, Rik van Riel,
	Ingo Molnar, Michel Lespinasse, Kirill A. Shutemov,
	Aneesh Kumar K.V, Sasha Levin

On Wed, Oct 01, 2014 at 09:18:25AM -0700, Linus Torvalds wrote:
> On Wed, Oct 1, 2014 at 9:01 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > We need to get rid of it, and just make it the same as pte_protnone().
> > And then the real protnone is in the vma flags, and if you actually
> > ever get to a pte that is marked protnone, you know it's a numa page.
> 
> So I'd really suggest we do exactly that. Get rid of "pte_numa()"
> entirely, get rid of "_PAGE_[BIT_]NUMA" entirely, and instead add a
> "pte_protnone()" helper to check for the "protnone" case (which on x86
> is testing the _PAGE_PROTNONE bit, and on most other architectures is
> just testing that the page has no access rights).
> 

Do not interpret the following as being against the idea of taking the
pte_protnone approach. This is intended to give background.

At the time the changes were made to the _PAGE_NUMA bits it was acknowledged
that a full move to prot_none was an option but it was not the preferred
solution at the time. It replaced one set of corner cases with another and
the last time like this time, there was considerable time pressure. The
VMA would be required to distinguish between a NUMA hinting fault and a
real prot_none bit. In most cases, we have the VMA now with the exception
of GUP. GUP would have to unconditionally go into the slow path to do the
VMA lookup. That is not likely to be a big of a problem but it was a concern.

In early implementations based on prot_none there were some VMA-based
protection checks that had higher overhead. At the time, there were severe
problems with overhead due to NUMA balancing and adding more was not
desirable. This has been addressed since with changes in multiple other
areas so it's much less of a concern now than it was. In the current shape,
these probably is not as much a problem as long as any check on pte_numa
was first guarded by a VMA check. One way of handling the corner cases
where would be to pass in the VMA where available and have a VM_BUG_ON that
fires if its a PROT_NONE VMA. That would catch problems during debugging
without adding overhead in the !debug case.

Going back to the start, the PTE bit was used as the approach due to
concerns that a pte_protnone helper would not work on all architectures,
ppc64 in particular.  There was no PROT_NONE bit there and instead prot_none
protections rely on PAGE_USER not being set so it's inaccessible from
userspace. There was discussion at the time that this could conceivably be
broken from some sub-architectures but I don't recall the details. Looking
at the current shape and your patch, it's conceivable that the pte_protnone
could be implemented as a _PAGE_PRESENT && !_PAGE_USER check as long
as it was guarded by a VMA check which x86 requires anyway. Not sure
if that would work for PMDs as I'm not familiar with with ppc64 to tell
offhand. Alternatively, ppc64 would potentially use the bit currently used
for _PAGE_NUMA as a _PROT_NONE bit.

I finished a small series that cleans up a number of issues discovered
recently due to Sasha's testing. It passed light testing and NUMA balancing
works but I cannot comment if they help Dave or Sasha's bugs as I never
managed to reproduce them. I'll post it shortly after I sent this mail.

Again, I'm not opposed to the pte_protnone issue as many of the concerns
I had before have been addressed since or rendered redundant. However,
I won't be able to digest and/or finish your patch in a reasonable time
frame. I'm only partially in work at the moment due to sick (going back
to bed after this mail) and out for a good chunk of next week and the week
after. Minor comments from the patch though

1. ppc64 needs work. Added Aneesh to the cc so he's at least aware
2. That check in pte_offset_kernel can probably go away
3. Ideally, VM_BUG_ON checks should be added to the pte_protnone check to
   ensure the VMA checks have already completed to avoid confusion between
   NUMA hints and real prot_none protections
4. GUP on prot_none now always hits the slow path but can't think of a
   case where that really matters.
5. SWP_OFFSET_SHIFT should be readjusted back if _PAGE_BIT_NUMA is
   removed.
6. It probably should at least be rebased on top of "mm: remove
   misleading ARCH_USES_NUMA_PROT_NONE" simply on the grounds that
   it cleans up some cruft
7. At the risk of pissing you off, pte_protnone_numa might be cleared
   at indicating why protnone is being checked at all

> Then we throw away "pte_mknuma()" and "pte_mknonnuma()" entirely,
> because they are brainless sh*t, and we just use
> 
>     ptent = ptep_modify_prot_start(mm, addr, pte);
>     ptent = pte_modify(ptent, newprot);
>     ptep_modify_prot_commit(mm, addr, pte, ptent);
> 
> reliably instead (where for the mknuma case "newprot" is PROT_NONE,
> and for mknonnuma() it is vma->vm_page_prot. Yes, that means that you
> have to pass in the vma to those functions, but that just makes sense
> anyway.
> 
> And if that means that we lose the numa flag on mprotect etc, nobody sane cares.
> 

Losing a NUMA hinting fault due to operations like mprotect is not a concern.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: pipe/page fault oddness.
  2014-10-01 22:42                                 ` Linus Torvalds
@ 2014-10-02 14:25                                   ` Kirill A. Shutemov
  2014-10-02 16:01                                     ` Linus Torvalds
  2014-10-02 15:04                                   ` Sasha Levin
  1 sibling, 1 reply; 43+ messages in thread
From: Kirill A. Shutemov @ 2014-10-02 14:25 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Sasha Levin, Hugh Dickins, Dave Jones, Al Viro, Linux Kernel,
	Rik van Riel, Ingo Molnar, Michel Lespinasse, Kirill A. Shutemov,
	Mel Gorman

On Wed, Oct 01, 2014 at 03:42:53PM -0700, Linus Torvalds wrote:
> On Wed, Oct 1, 2014 at 3:08 PM, Sasha Levin <sasha.levin@oracle.com> wrote:
> >
> > I've tried this patch on the same configuration that was triggering
> > the VM_BUG_ON that Hugh mentioned previously. Surprisingly enough it
> > ran fine for ~20 minutes before exploding with:
> 
> Well, that's somewhat encouraging. I didn't expect it to be perfect.
> 
> That said, "ran fine" isn't necessarily the same thing as "worked".
> Who knows how buggy it was without showing overt symptoms until the
> BUG_ON() triggered. But hey, I'll be optimistic.
> 
> > [ 2781.566206] kernel BUG at mm/huge_memory.c:1293!
> 
> So that's
> 
>         BUG_ON(is_huge_zero_page(page));
> 
> and the reason is trivial: the old code used to have a magical special
> case for the zero-page hugepage (see change_huge_pmd()) and I got rid
> of that (because now it's just about setting protections, and the
> zero-page hugepage is in no way special.
> 
> So I think the solution is equally trivial: just accept that the
> zero-page can happen, and ignore it (just un-numa it).
> 
> Appended is a incremental diff on top of the previous one. Even less
> tested than the last case, but I think you get the idea if it doesn't
> work as-is.
> 
>              Linus

> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 14de54af6c38..fc33952d59c4 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1290,7 +1290,9 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  	}
>  
>  	page = pmd_page(pmd);
> -	BUG_ON(is_huge_zero_page(page));
> +	if (is_huge_zero_page(page))
> +		goto huge_zero_page;
> +
>  	page_nid = page_to_nid(page);
>  	last_cpupid = page_cpupid_last(page);
>  	count_vm_numa_event(NUMA_HINT_FAULTS);
> @@ -1381,6 +1383,11 @@ out:
>  		task_numa_fault(last_cpupid, page_nid, HPAGE_PMD_NR, flags);
>  
>  	return 0;
> +huge_zero_page:
> +	pmd = pmd_modify(pmd, vma->vm_page_prot);
> +	set_pmd_at(mm, haddr, pmdp, pmd);
> +	update_mmu_cache_pmd(vma, addr, pmdp);
> +	goto out_unlock;

I don't see what prevents the code to make zero page writable here.
We need at least pmd = pmd_wrprotect(pmd) before set_pmd_at();

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: pipe/page fault oddness.
  2014-10-02  8:03                                     ` Chuck Ebbert
@ 2014-10-02 14:49                                       ` Sasha Levin
  0 siblings, 0 replies; 43+ messages in thread
From: Sasha Levin @ 2014-10-02 14:49 UTC (permalink / raw)
  To: Chuck Ebbert; +Cc: Linux Kernel

On 10/02/2014 04:03 AM, Chuck Ebbert wrote:
> On Wed, 01 Oct 2014 23:32:15 -0400
> Sasha Levin <sasha.levin@oracle.com> wrote:
> 
>> On 10/01/2014 06:28 PM, Chuck Ebbert wrote:
>>> On Wed, 01 Oct 2014 18:08:30 -0400
>>> Sasha Levin <sasha.levin@oracle.com> wrote:
>>>
>>>>> On 10/01/2014 04:20 PM, Linus Torvalds wrote:
>>>>>>> So I'm really sending this patch out in the hope that it will get
>>>>>>> comments, fixup and possibly even testing by people who actually know
>>>>>>> the NUMA balancing code. Rik?  Anybody?
>>>>>
>>>>> Hi Linus,
>>>>>
>>>>> I've tried this patch on the same configuration that was triggering
>>>>> the VM_BUG_ON that Hugh mentioned previously. Surprisingly enough it
>>>>> ran fine for ~20 minutes before exploding with:
>>>>>
>>>>> [ 2781.566206] kernel BUG at mm/huge_memory.c:1293!
>>> That's:
>>> 	BUG_ON(is_huge_zero_page(page));
>>>
>>> Can you change your scripts to show the source code line when
>>> the error is a BUG_ON()? The machine code disassembly after the
>>> oops message doesn't really help.
>>>
>>
>> Hum? The source code line is the first line in the trace:
>>
>> 	[ 2781.566206] kernel BUG at mm/huge_memory.c:1293!
>>
> 
> I meant, display the contents of that line so we can see what the
> BUG_ON() was triggered by. In some cases you might have a custom patch
> applied or be running a version that some people don't have handy.

Ah, the actual content? That's a good point - let me look into that.


Thanks,
Sasha


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: pipe/page fault oddness.
  2014-10-01 22:42                                 ` Linus Torvalds
  2014-10-02 14:25                                   ` Kirill A. Shutemov
@ 2014-10-02 15:04                                   ` Sasha Levin
  2014-10-02 16:10                                     ` Linus Torvalds
  1 sibling, 1 reply; 43+ messages in thread
From: Sasha Levin @ 2014-10-02 15:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, Dave Jones, Al Viro, Linux Kernel, Rik van Riel,
	Ingo Molnar, Michel Lespinasse, Kirill A. Shutemov, Mel Gorman

On 10/01/2014 06:42 PM, Linus Torvalds wrote:
> On Wed, Oct 1, 2014 at 3:08 PM, Sasha Levin <sasha.levin@oracle.com> wrote:
>> >
>> > I've tried this patch on the same configuration that was triggering
>> > the VM_BUG_ON that Hugh mentioned previously. Surprisingly enough it
>> > ran fine for ~20 minutes before exploding with:
> Well, that's somewhat encouraging. I didn't expect it to be perfect.
> 
> That said, "ran fine" isn't necessarily the same thing as "worked".
> Who knows how buggy it was without showing overt symptoms until the
> BUG_ON() triggered. But hey, I'll be optimistic.
> 
>> > [ 2781.566206] kernel BUG at mm/huge_memory.c:1293!
> So that's
> 
>         BUG_ON(is_huge_zero_page(page));
> 
> and the reason is trivial: the old code used to have a magical special
> case for the zero-page hugepage (see change_huge_pmd()) and I got rid
> of that (because now it's just about setting protections, and the
> zero-page hugepage is in no way special.
> 
> So I think the solution is equally trivial: just accept that the
> zero-page can happen, and ignore it (just un-numa it).
> 
> Appended is a incremental diff on top of the previous one. Even less
> tested than the last case, but I think you get the idea if it doesn't
> work as-is.

I have a new one for you. I know it doesn't say "numa" anywhere, but I
haven't ever seen that trace before so I'll just go ahead and blame it
on your patch...

[ 2838.403382] BUG: unable to handle kernel paging request at 000000055d996e80
[ 2838.405740] IP: task_curr (kernel/sched/core.c:1010)
[ 2838.407076] PGD dba2c6067 PUD 0
[ 2838.407926] Thread overran stack, or stack corrupted
[ 2838.409093] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[ 2838.411454] Dumping ftrace buffer:
[ 2838.412602]    (ftrace buffer empty)
[ 2838.413187] Modules linked in:
[ 2838.413187] CPU: 38 PID: 9342 Comm: trinity-c38 Not tainted 3.17.0-rc7-sasha-00041-g6c9c81b #1260
[ 2838.413187] task: ffff880dba2f0000 ti: ffff880dba2ec000 task.ti: ffff880dba2ec000
[ 2838.413187] RIP: task_curr (kernel/sched/core.c:1010)
[ 2838.413187] RSP: 0018:ffff880dba2ebf48  EFLAGS: 00010046
[ 2838.413187] RAX: 000000000000f080 RBX: ffff880dba2f0000 RCX: 000000000000000a
[ 2838.413187] RDX: 00000000ba1a9560 RSI: ffff880dba2f0000 RDI: ffff880dba2f0000
[ 2838.413187] RBP: ffff880dba2ebf98 R08: 000000000004862a R09: 0000000000000000
[ 2838.413187] R10: 0000000000000038 R11: 000000000000001f R12: ffff880dba2f0000
[ 2838.413187] R13: ffff880dd5420740 R14: 000000000000000b R15: ffffffff8cc92000
[ 2838.413187] FS:  00007f05f3dbc700(0000) GS:ffff880701e00000(0000) knlGS:0000000000000000
[ 2838.413187] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 2838.413187] CR2: 000000055d996e80 CR3: 0000000dba2c5000 CR4: 00000000000006a0
[ 2838.413187] DR0: 00000000006ee000 DR1: 0000000000000000 DR2: 0000000000000000
[ 2838.413187] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000090602
[ 2838.413187] Stack:
[ 2838.413187]  ffffffff8816218b 0000000000000000 ffff880d0000000a 000000000000000b
[ 2838.413187]  0000000000000082 ffff880dba2f0000 000000000000000b ffff880dba2ec070
[ 2838.413187]  0000000000000000 ffffffff8cc92000 ffff880dba2ebff8 ffffffff88162a84
[ 2838.413187] Call Trace:
[ 2838.413187]  <UNK>
[ 2838.413187] Code: 87 60 09 00 00 01 e8 8d ee ff ff 5d c3 66 66 2e 0f 1f 84 00 00 00 00 00 48 8b 57 08 55 48 c7 c0 80 f0 00 00 48 89 e5 5d 8b 52 18 <48> 8b 14 d5 80 c3 c4 8c 48 39 bc 10 68 09 00 00 0f 94 c0 0f b6
All code
========
   0:	87 60 09             	xchg   %esp,0x9(%rax)
   3:	00 00                	add    %al,(%rax)
   5:	01 e8                	add    %ebp,%eax
   7:	8d                   	(bad)
   8:	ee                   	out    %al,(%dx)
   9:	ff                   	(bad)
   a:	ff 5d c3             	lcallq *-0x3d(%rbp)
   d:	66 66 2e 0f 1f 84 00 	data32 nopw %cs:0x0(%rax,%rax,1)
  14:	00 00 00 00
  18:	48 8b 57 08          	mov    0x8(%rdi),%rdx
  1c:	55                   	push   %rbp
  1d:	48 c7 c0 80 f0 00 00 	mov    $0xf080,%rax
  24:	48 89 e5             	mov    %rsp,%rbp
  27:	5d                   	pop    %rbp
  28:	8b 52 18             	mov    0x18(%rdx),%edx
  2b:*	48 8b 14 d5 80 c3 c4 	mov    -0x733b3c80(,%rdx,8),%rdx		<-- trapping instruction
  32:	8c
  33:	48 39 bc 10 68 09 00 	cmp    %rdi,0x968(%rax,%rdx,1)
  3a:	00
  3b:	0f 94 c0             	sete   %al
  3e:	0f b6 00             	movzbl (%rax),%eax

Code starting with the faulting instruction
===========================================
   0:	48 8b 14 d5 80 c3 c4 	mov    -0x733b3c80(,%rdx,8),%rdx
   7:	8c
   8:	48 39 bc 10 68 09 00 	cmp    %rdi,0x968(%rax,%rdx,1)
   f:	00
  10:	0f 94 c0             	sete   %al
  13:	0f b6 00             	movzbl (%rax),%eax
[ 2838.413187] RIP task_curr (kernel/sched/core.c:1010)
[ 2838.413187]  RSP <ffff880dba2ebf48>
[ 2838.413187] CR2: 000000055d996e80


Thanks,
Sasha

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: pipe/page fault oddness.
  2014-10-02  8:47                           ` Hugh Dickins
@ 2014-10-02 15:57                             ` Linus Torvalds
  0 siblings, 0 replies; 43+ messages in thread
From: Linus Torvalds @ 2014-10-02 15:57 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Dave Jones, Al Viro, Linux Kernel, Rik van Riel, Ingo Molnar,
	Michel Lespinasse, Kirill A. Shutemov, Mel Gorman, Sasha Levin

On Thu, Oct 2, 2014 at 1:47 AM, Hugh Dickins <hughd@google.com> wrote:
>
> I hesitate to admit, I still don't see it: please illuminate further.

No, your'e looking at what I was looking.

> We're talking about the loop in __split_huge_page_map(), where it does

Yes.

>                         entry = mk_pte(page + i, vma->vm_page_prot);
>                         entry = maybe_mkwrite(pte_mkdirty(entry), vma);
>                         if (!pmd_write(*pmd))
>                                 entry = pte_wrprotect(entry);
>                         if (!pmd_young(*pmd))
>                                 entry = pte_mkold(entry);
>                         if (pmd_numa(*pmd))
>                                 entry = pte_mknuma(entry);
>
> , right?  I only see that adding _PAGE_NUMA to _PAGE_PROTNONE if
> pmd_numa(*pmd): but that would mean we had already gone wrong, setting
> pmd_numa in a PROT_NONE vma, which task_numa_work takes care not to do;
> or have mprotected an area to PROT_NONE without doing the pmd_mknonnuma.

Fair enough. Except this code has no locking that I see, so if we
*ever* see that numa entry in the pmd while walking the page tables in
vmscan, we're basically screwed.

> Or are you noticing a deficiency in the pmd locking?  I have not
> worked my way through that, so cannot guarantee it, but please
> point me to the weakness where you see it.

So I don't see any locking at all wrt mprotect (or new mmap). That's
kind of the whole point for page-out - it bypasses all the normal VM
locks, and only uses the last pte locking.

So the whole use of vma->vm_page_prot here is a bit scary. That gets
modified outside of the page table locks. So how do you know it's not
already PROT_NONE, but mprotect just hasn't gotten to actually take
the page table locks yet?

I dunno. It all makes me just very nervous. The whole "numa bit is
separate from the protections, has different locking, and is just
oddly and subtly different" is really what I fundamentally object to.
And it seems so _unnecessary_. All this odd complexity for no actual
gain - just extra code, and extra room for subtle bugs. Which is
exactly why I hate that magic NUMA bit so much.

                  Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: pipe/page fault oddness.
  2014-10-02 14:25                                   ` Kirill A. Shutemov
@ 2014-10-02 16:01                                     ` Linus Torvalds
  2014-10-02 16:35                                       ` Kirill A. Shutemov
  0 siblings, 1 reply; 43+ messages in thread
From: Linus Torvalds @ 2014-10-02 16:01 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Sasha Levin, Hugh Dickins, Dave Jones, Al Viro, Linux Kernel,
	Rik van Riel, Ingo Molnar, Michel Lespinasse, Kirill A. Shutemov,
	Mel Gorman

On Thu, Oct 2, 2014 at 7:25 AM, Kirill A. Shutemov <kirill@shutemov.name> wrote:
>
> I don't see what prevents the code to make zero page writable here.
> We need at least pmd = pmd_wrprotect(pmd) before set_pmd_at();

Do we? If it's the zero page, it had better be an anonymous mapping,
and vm_page_prot had better not be writable.

Anonymous pages don't _start_ out writable, we explicitly make them so
with code like

        if (vma->vm_flags & VM_WRITE)
                entry = pte_mkwrite(pte_mkdirty(entry));

so it should be fine to just use "pmd_modify(pmd, vma->vm_page_prot);" directly.

But hey, this is the kind of thing that maybe I'm missing something on..

              Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: pipe/page fault oddness.
  2014-10-02 15:04                                   ` Sasha Levin
@ 2014-10-02 16:10                                     ` Linus Torvalds
  2014-10-03  5:00                                       ` Sasha Levin
  0 siblings, 1 reply; 43+ messages in thread
From: Linus Torvalds @ 2014-10-02 16:10 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Hugh Dickins, Dave Jones, Al Viro, Linux Kernel, Rik van Riel,
	Ingo Molnar, Michel Lespinasse, Kirill A. Shutemov, Mel Gorman

On Thu, Oct 2, 2014 at 8:04 AM, Sasha Levin <sasha.levin@oracle.com> wrote:
>
> I have a new one for you. I know it doesn't say "numa" anywhere, but I
> haven't ever seen that trace before so I'll just go ahead and blame it
> on your patch...

Fair enough, but the oops doesn't really give even a hint of what
could be wrong.

The stack is clearly too deep:

  Thread overran stack, or stack corrupted
  task.ti: ffff880dba2ec000
    RSP: ffff880dba2ebf48

but my patch shouldn't have added any deeper call-chains anywhere.

Anyway, unless you can get some other interesting oops with more hints
about where we go wrong, you might want to try Mel's four cleanup
patches instead. I love how you're testing my quick-and-dirty hack,
and I think we'll need to do this eventually, but in the short term
Mel's patches are probably worth applying.

In particular, his 4/4 patch removes the code case that I was
particularly nervous about, so it looks worth a try, because that may
just remove the bug with much smaller effort.

               Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: pipe/page fault oddness.
  2014-10-02 16:01                                     ` Linus Torvalds
@ 2014-10-02 16:35                                       ` Kirill A. Shutemov
  0 siblings, 0 replies; 43+ messages in thread
From: Kirill A. Shutemov @ 2014-10-02 16:35 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Sasha Levin, Hugh Dickins, Dave Jones, Al Viro, Linux Kernel,
	Rik van Riel, Ingo Molnar, Michel Lespinasse, Kirill A. Shutemov,
	Mel Gorman

On Thu, Oct 02, 2014 at 09:01:38AM -0700, Linus Torvalds wrote:
> On Thu, Oct 2, 2014 at 7:25 AM, Kirill A. Shutemov <kirill@shutemov.name> wrote:
> >
> > I don't see what prevents the code to make zero page writable here.
> > We need at least pmd = pmd_wrprotect(pmd) before set_pmd_at();
> 
> Do we? If it's the zero page, it had better be an anonymous mapping,
> and vm_page_prot had better not be writable.
> 
> Anonymous pages don't _start_ out writable, we explicitly make them so
> with code like
> 
>         if (vma->vm_flags & VM_WRITE)
>                 entry = pte_mkwrite(pte_mkdirty(entry));
> 
> so it should be fine to just use "pmd_modify(pmd, vma->vm_page_prot);" directly.
> 
> But hey, this is the kind of thing that maybe I'm missing something on..

You're right.
It means we have redundant pmd_wrprotect() in set_huge_zero_page().

==========================================================================

Subject: [PATCH] thp: do not mark zero-page pmd write-protected explicitly

Zero pages can be used only in anonymous mappings, which never have
writable vma->vm_page_prot: see protection_map in mm/mmap.c and __PX1X
definitions.

Let's drop redundant pmd_wrprotect() in set_huge_zero_page().

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/huge_memory.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d9a21d06b862..2c17d184b56d 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -784,7 +784,6 @@ static bool set_huge_zero_page(pgtable_t pgtable, struct mm_struct *mm,
 	if (!pmd_none(*pmd))
 		return false;
 	entry = mk_pmd(zero_page, vma->vm_page_prot);
-	entry = pmd_wrprotect(entry);
 	entry = pmd_mkhuge(entry);
 	pgtable_trans_huge_deposit(mm, pmd, pgtable);
 	set_pmd_at(mm, haddr, pmd, entry);
-- 
 Kirill A. Shutemov

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: pipe/page fault oddness.
  2014-10-02 16:10                                     ` Linus Torvalds
@ 2014-10-03  5:00                                       ` Sasha Levin
  2014-10-03 15:43                                         ` Linus Torvalds
  0 siblings, 1 reply; 43+ messages in thread
From: Sasha Levin @ 2014-10-03  5:00 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, Dave Jones, Al Viro, Linux Kernel, Rik van Riel,
	Ingo Molnar, Michel Lespinasse, Kirill A. Shutemov, Mel Gorman

On 10/02/2014 12:10 PM, Linus Torvalds wrote:
> On Thu, Oct 2, 2014 at 8:04 AM, Sasha Levin <sasha.levin@oracle.com> wrote:
>> >
>> > I have a new one for you. I know it doesn't say "numa" anywhere, but I
>> > haven't ever seen that trace before so I'll just go ahead and blame it
>> > on your patch...
> Fair enough, but the oops doesn't really give even a hint of what
> could be wrong.
> 
> The stack is clearly too deep:
> 
>   Thread overran stack, or stack corrupted
>   task.ti: ffff880dba2ec000
>     RSP: ffff880dba2ebf48
> 
> but my patch shouldn't have added any deeper call-chains anywhere.

For the record, I tweaked the environment to put some more pressure on the
scheduler and found out what broke (which is not related to this thread at
all).

For the curious ones: https://lkml.org/lkml/2014/10/3/5


Thanks,
Sasha

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: pipe/page fault oddness.
  2014-10-03  5:00                                       ` Sasha Levin
@ 2014-10-03 15:43                                         ` Linus Torvalds
  2014-10-03 15:58                                           ` Dave Jones
  0 siblings, 1 reply; 43+ messages in thread
From: Linus Torvalds @ 2014-10-03 15:43 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Hugh Dickins, Dave Jones, Al Viro, Linux Kernel, Rik van Riel,
	Ingo Molnar, Michel Lespinasse, Kirill A. Shutemov, Mel Gorman

On Thu, Oct 2, 2014 at 10:00 PM, Sasha Levin <sasha.levin@oracle.com> wrote:
>
> For the record, I tweaked the environment to put some more pressure on the
> scheduler and found out what broke (which is not related to this thread at
> all).

Ok. It's probably still worth testing Mel's patches, since that's what
goes into 3.17. But it would be interesting to eventually go back to
stability-testing the protnone set, since it *looked* like it was
working well aside from this issue. I might decide to just do it
during the 3.18 merge window..

          Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: pipe/page fault oddness.
  2014-10-03 15:43                                         ` Linus Torvalds
@ 2014-10-03 15:58                                           ` Dave Jones
  2014-10-03 16:02                                             ` Sasha Levin
  0 siblings, 1 reply; 43+ messages in thread
From: Dave Jones @ 2014-10-03 15:58 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Sasha Levin, Hugh Dickins, Al Viro, Linux Kernel, Rik van Riel,
	Ingo Molnar, Michel Lespinasse, Kirill A. Shutemov, Mel Gorman

On Fri, Oct 03, 2014 at 08:43:57AM -0700, Linus Torvalds wrote:
 > On Thu, Oct 2, 2014 at 10:00 PM, Sasha Levin <sasha.levin@oracle.com> wrote:
 > >
 > > For the record, I tweaked the environment to put some more pressure on the
 > > scheduler and found out what broke (which is not related to this thread at
 > > all).
 > 
 > Ok. It's probably still worth testing Mel's patches, since that's what
 > goes into 3.17. But it would be interesting to eventually go back to
 > stability-testing the protnone set, since it *looked* like it was
 > working well aside from this issue. I might decide to just do it
 > during the 3.18 merge window..

FYI, I've been trying to reproduce that initial bug all week, and
haven't seen a single reoccurance of it, just other bugs.
Kind of a pain in the ass for confirming this numa stuff was indeed
the cause, but that's been true of so many of the more obscure bugs
trinity has shaken out.

	Dave


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: pipe/page fault oddness.
  2014-10-03 15:58                                           ` Dave Jones
@ 2014-10-03 16:02                                             ` Sasha Levin
  0 siblings, 0 replies; 43+ messages in thread
From: Sasha Levin @ 2014-10-03 16:02 UTC (permalink / raw)
  To: Dave Jones, Linus Torvalds, Hugh Dickins, Al Viro, Linux Kernel,
	Rik van Riel, Ingo Molnar, Michel Lespinasse, Kirill A. Shutemov,
	Mel Gorman

On 10/03/2014 11:58 AM, Dave Jones wrote:
> On Fri, Oct 03, 2014 at 08:43:57AM -0700, Linus Torvalds wrote:
>  > On Thu, Oct 2, 2014 at 10:00 PM, Sasha Levin <sasha.levin@oracle.com> wrote:
>  > >
>  > > For the record, I tweaked the environment to put some more pressure on the
>  > > scheduler and found out what broke (which is not related to this thread at
>  > > all).
>  > 
>  > Ok. It's probably still worth testing Mel's patches, since that's what
>  > goes into 3.17. But it would be interesting to eventually go back to
>  > stability-testing the protnone set, since it *looked* like it was
>  > working well aside from this issue. I might decide to just do it
>  > during the 3.18 merge window..

Linus, I'm running with Mel's patches since yesterday, nothing interesting
to report.

> FYI, I've been trying to reproduce that initial bug all week, and
> haven't seen a single reoccurance of it, just other bugs.
> Kind of a pain in the ass for confirming this numa stuff was indeed
> the cause, but that's been true of so many of the more obscure bugs
> trinity has shaken out.

+1, it seems to have "disappeared", quite annoying :/


Thanks,
Sasha

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: pipe/page fault oddness.
  2014-10-02 12:45                             ` Mel Gorman
@ 2014-10-06 19:18                               ` Aneesh Kumar K.V
  2014-10-07 12:45                                 ` Linus Torvalds
  0 siblings, 1 reply; 43+ messages in thread
From: Aneesh Kumar K.V @ 2014-10-06 19:18 UTC (permalink / raw)
  To: Mel Gorman, Linus Torvalds
  Cc: Hugh Dickins, Dave Jones, Al Viro, Linux Kernel, Rik van Riel,
	Ingo Molnar, Michel Lespinasse, Kirill A. Shutemov, Sasha Levin,
	Benjamin Herrenschmidt

Mel Gorman <mgorman@suse.de> writes:

> On Wed, Oct 01, 2014 at 09:18:25AM -0700, Linus Torvalds wrote:
>> On Wed, Oct 1, 2014 at 9:01 AM, Linus Torvalds
>> <torvalds@linux-foundation.org> wrote:
>> >
>> > We need to get rid of it, and just make it the same as pte_protnone().
>> > And then the real protnone is in the vma flags, and if you actually
>> > ever get to a pte that is marked protnone, you know it's a numa page.
>> 
>> So I'd really suggest we do exactly that. Get rid of "pte_numa()"
>> entirely, get rid of "_PAGE_[BIT_]NUMA" entirely, and instead add a
>> "pte_protnone()" helper to check for the "protnone" case (which on x86
>> is testing the _PAGE_PROTNONE bit, and on most other architectures is
>> just testing that the page has no access rights).
>> 
>
> Do not interpret the following as being against the idea of taking the
> pte_protnone approach. This is intended to give background.
>
> At the time the changes were made to the _PAGE_NUMA bits it was acknowledged
> that a full move to prot_none was an option but it was not the preferred
> solution at the time. It replaced one set of corner cases with another and
> the last time like this time, there was considerable time pressure. The
> VMA would be required to distinguish between a NUMA hinting fault and a
> real prot_none bit. In most cases, we have the VMA now with the exception
> of GUP. GUP would have to unconditionally go into the slow path to do the
> VMA lookup. That is not likely to be a big of a problem but it was a concern.
>
> In early implementations based on prot_none there were some VMA-based
> protection checks that had higher overhead. At the time, there were severe
> problems with overhead due to NUMA balancing and adding more was not
> desirable. This has been addressed since with changes in multiple other
> areas so it's much less of a concern now than it was. In the current shape,
> these probably is not as much a problem as long as any check on pte_numa
> was first guarded by a VMA check. One way of handling the corner cases
> where would be to pass in the VMA where available and have a VM_BUG_ON that
> fires if its a PROT_NONE VMA. That would catch problems during debugging
> without adding overhead in the !debug case.
>
> Going back to the start, the PTE bit was used as the approach due to
> concerns that a pte_protnone helper would not work on all architectures,
> ppc64 in particular.  There was no PROT_NONE bit there and instead prot_none
> protections rely on PAGE_USER not being set so it's inaccessible from
> userspace. There was discussion at the time that this could conceivably be
> broken from some sub-architectures but I don't recall the details. Looking
> at the current shape and your patch, it's conceivable that the pte_protnone
> could be implemented as a _PAGE_PRESENT && !_PAGE_USER check as long
> as it was guarded by a VMA check which x86 requires anyway. Not sure
> if that would work for PMDs as I'm not familiar with with ppc64 to tell
> offhand. Alternatively, ppc64 would potentially use the bit currently used
> for _PAGE_NUMA as a _PROT_NONE bit.

Are we still looking at these options ? I could look at implementing the
first option which will also enable us to free up one pte bit.

Note: Freeing up one bit will enable us to implement soft dirty tracking
needed for CRIU.

-aneesh


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: pipe/page fault oddness.
  2014-10-06 19:18                               ` Aneesh Kumar K.V
@ 2014-10-07 12:45                                 ` Linus Torvalds
  2014-10-08 10:37                                   ` Aneesh Kumar K.V
  0 siblings, 1 reply; 43+ messages in thread
From: Linus Torvalds @ 2014-10-07 12:45 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Mel Gorman, Hugh Dickins, Dave Jones, Al Viro, Linux Kernel,
	Rik van Riel, Ingo Molnar, Michel Lespinasse, Kirill A. Shutemov,
	Sasha Levin, Benjamin Herrenschmidt

On Mon, Oct 6, 2014 at 3:18 PM, Aneesh Kumar K.V
<aneesh.kumar@linux.vnet.ibm.com> wrote:
>
> Are we still looking at these options ? I could look at implementing the
> first option which will also enable us to free up one pte bit.

We definitely are. If you can test my patch (with the small follow-up
fix), and do the necessary changes for ppc64, that would be good.

I looked quickly at the ppc64 side, and it didn't look too painful.
Using pte_protnone() instead of pte_numa() should remove move lines
than it adds there too..

            Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: pipe/page fault oddness.
  2014-10-07 12:45                                 ` Linus Torvalds
@ 2014-10-08 10:37                                   ` Aneesh Kumar K.V
  0 siblings, 0 replies; 43+ messages in thread
From: Aneesh Kumar K.V @ 2014-10-08 10:37 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mel Gorman, Hugh Dickins, Dave Jones, Al Viro, Linux Kernel,
	Rik van Riel, Ingo Molnar, Michel Lespinasse, Kirill A. Shutemov,
	Sasha Levin, Benjamin Herrenschmidt

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Mon, Oct 6, 2014 at 3:18 PM, Aneesh Kumar K.V
> <aneesh.kumar@linux.vnet.ibm.com> wrote:
>>
>> Are we still looking at these options ? I could look at implementing the
>> first option which will also enable us to free up one pte bit.
>
> We definitely are. If you can test my patch (with the small follow-up
> fix), and do the necessary changes for ppc64, that would be good.
>
> I looked quickly at the ppc64 side, and it didn't look too painful.
> Using pte_protnone() instead of pte_numa() should remove move lines
> than it adds there too..
>

This is a quick hack and gets it running. With perf bench numa mem

-bash-4.2# grep numa /proc/vmstat                                                                                                                                                                                                              
numa_hit 3310633
numa_miss 0
numa_foreign 0
numa_interleave 6451
numa_local 3162369
numa_other 148264
numa_pte_updates 27708982
numa_huge_pte_updates 76987
numa_hint_faults 268439275
numa_hint_faults_local 5359216
numa_pages_migrated 3349573
-bash-4.2# 

diff --git a/arch/powerpc/include/asm/pgtable.h b/arch/powerpc/include/asm/pgtable.h
index d98c1ecc3266..2a9bbe4d2364 100644
--- a/arch/powerpc/include/asm/pgtable.h
+++ b/arch/powerpc/include/asm/pgtable.h
@@ -41,7 +41,7 @@ static inline pgprot_t pte_pgprot(pte_t pte)	{ return __pgprot(pte_val(pte) & PA
 
 static inline int pte_present(pte_t pte)
 {
-	return pte_val(pte) & (_PAGE_PRESENT | _PAGE_NUMA);
+	return pte_val(pte) & _PAGE_PRESENT;
 }
 
 #define pte_present_nonuma pte_present_nonuma
@@ -50,78 +50,20 @@ static inline int pte_present_nonuma(pte_t pte)
 	return pte_val(pte) & (_PAGE_PRESENT);
 }
 
-#define pte_numa pte_numa
-static inline int pte_numa(pte_t pte)
+#define pte_protnone pte_protnone
+static inline int pte_protnone(pte_t pte)
 {
 	return (pte_val(pte) &
-		(_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA;
+		(_PAGE_PRESENT | _PAGE_USER)) == _PAGE_PRESENT;
 }
 
-#define pte_mknonnuma pte_mknonnuma
-static inline pte_t pte_mknonnuma(pte_t pte)
+#define pmd_protnone pmd_protnone
+static inline int pmd_protnone(pmd_t pmd)
 {
-	pte_val(pte) &= ~_PAGE_NUMA;
-	pte_val(pte) |=  _PAGE_PRESENT | _PAGE_ACCESSED;
-	return pte;
-}
-
-#define pte_mknuma pte_mknuma
-static inline pte_t pte_mknuma(pte_t pte)
-{
-	/*
-	 * We should not set _PAGE_NUMA on non present ptes. Also clear the
-	 * present bit so that hash_page will return 1 and we collect this
-	 * as numa fault.
-	 */
-	if (pte_present(pte)) {
-		pte_val(pte) |= _PAGE_NUMA;
-		pte_val(pte) &= ~_PAGE_PRESENT;
-	} else
-		VM_BUG_ON(1);
-	return pte;
+	return pte_protnone(pmd_pte(pmd));
 }
 
-#define ptep_set_numa ptep_set_numa
-static inline void ptep_set_numa(struct mm_struct *mm, unsigned long addr,
-				 pte_t *ptep)
-{
-	if ((pte_val(*ptep) & _PAGE_PRESENT) == 0)
-		VM_BUG_ON(1);
-
-	pte_update(mm, addr, ptep, _PAGE_PRESENT, _PAGE_NUMA, 0);
-	return;
-}
-
-#define pmd_numa pmd_numa
-static inline int pmd_numa(pmd_t pmd)
-{
-	return pte_numa(pmd_pte(pmd));
-}
-
-#define pmdp_set_numa pmdp_set_numa
-static inline void pmdp_set_numa(struct mm_struct *mm, unsigned long addr,
-				 pmd_t *pmdp)
-{
-	if ((pmd_val(*pmdp) & _PAGE_PRESENT) == 0)
-		VM_BUG_ON(1);
-
-	pmd_hugepage_update(mm, addr, pmdp, _PAGE_PRESENT, _PAGE_NUMA);
-	return;
-}
-
-#define pmd_mknonnuma pmd_mknonnuma
-static inline pmd_t pmd_mknonnuma(pmd_t pmd)
-{
-	return pte_pmd(pte_mknonnuma(pmd_pte(pmd)));
-}
-
-#define pmd_mknuma pmd_mknuma
-static inline pmd_t pmd_mknuma(pmd_t pmd)
-{
-	return pte_pmd(pte_mknuma(pmd_pte(pmd)));
-}
-
-# else
+#else
 
 static inline int pte_present(pte_t pte)
 {
diff --git a/arch/powerpc/include/asm/pte-hash64.h b/arch/powerpc/include/asm/pte-hash64.h
index 2505d8eab15c..68d0a5d01dc3 100644
--- a/arch/powerpc/include/asm/pte-hash64.h
+++ b/arch/powerpc/include/asm/pte-hash64.h
@@ -30,7 +30,7 @@
 /*
  * Used for tracking numa faults
  */
-#define _PAGE_NUMA	0x00000010 /* Gather numa placement stats */
+//#define _PAGE_NUMA	0x00000010 /* Gather numa placement stats */
 
 
 /* No separate kernel read-only */
diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index 084ad54c73cd..27004431b576 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -235,7 +235,11 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags,
 		pte_size = psize;
 		pte = lookup_linux_pte_and_update(pgdir, hva, writing,
 						  &pte_size);
-		if (pte_present(pte) && !pte_numa(pte)) {
+		/*
+		 * Skip the ptes marked for numa fault tracking in
+		 * host page table.
+		 */
+		if (pte_present(pte) && !pte_protnone(pte)) {
 			if (writing && !pte_write(pte))
 				/* make the actual HPTE be read-only */
 				ptel = hpte_make_readonly(ptel);
diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index 51ab9e7e6c39..b77ecac7e61f 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -393,8 +393,6 @@ good_area:
 		 * processors use the same I/D cache coherency mechanism
 		 * as embedded.
 		 */
-		if (error_code & DSISR_PROTFAULT)
-			goto bad_area;
 #endif /* CONFIG_PPC_STD_MMU */
 
 		/*
@@ -418,9 +416,6 @@ good_area:
 		flags |= FAULT_FLAG_WRITE;
 	/* a read */
 	} else {
-		/* protection fault */
-		if (error_code & 0x08000000)
-			goto bad_area;
 		if (!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE)))
 			goto bad_area;
 	}
diff --git a/arch/powerpc/mm/gup.c b/arch/powerpc/mm/gup.c
index d8746684f606..89f8568bf0b5 100644
--- a/arch/powerpc/mm/gup.c
+++ b/arch/powerpc/mm/gup.c
@@ -39,7 +39,7 @@ static noinline int gup_pte_range(pmd_t pmd, unsigned long addr,
 		/*
 		 * Similar to the PMD case, NUMA hinting must take slow path
 		 */
-		if (pte_numa(pte))
+		if (pte_protnone(pte))
 			return 0;
 
 		if ((pte_val(pte) & mask) != result)
@@ -85,7 +85,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
 			 * slowpath for accounting purposes and so that they
 			 * can be serialised against THP migration.
 			 */
-			if (pmd_numa(pmd))
+			if (pmd_protnone(pmd))
 				return 0;
 
 			if (!gup_hugepte((pte_t *)pmdp, PMD_SIZE, addr, next,


^ permalink raw reply related	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2014-10-08 10:37 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-09-30  3:33 pipe/page fault oddness Dave Jones
2014-09-30  4:27 ` Linus Torvalds
2014-09-30  4:33   ` Dave Jones
     [not found]     ` <CA+55aFwxdOBKHwwp7Zq1k19mHCyHYmYqigCVt59AtB-P7Zva1w@mail.gmail.com>
2014-09-30 15:52       ` Linus Torvalds
2014-09-30 16:03         ` Rik van Riel
2014-09-30 16:07           ` Dave Jones
2014-09-30 16:26           ` Linus Torvalds
2014-09-30 16:05         ` Dave Jones
2014-09-30 16:10           ` Linus Torvalds
2014-09-30 16:22             ` Dave Jones
2014-09-30 16:40               ` Dave Jones
2014-09-30 16:46                 ` Linus Torvalds
2014-09-30 18:20                   ` Dave Jones
2014-09-30 18:58                     ` Linus Torvalds
2014-10-01  8:19                       ` Hugh Dickins
2014-10-01 16:01                         ` Linus Torvalds
2014-10-01 16:18                           ` Linus Torvalds
2014-10-01 17:29                             ` Rik van Riel
2014-10-02  8:28                               ` Peter Zijlstra
2014-10-01 20:20                             ` Linus Torvalds
2014-10-01 21:09                               ` Rik van Riel
2014-10-01 22:08                               ` Sasha Levin
2014-10-01 22:28                                 ` Chuck Ebbert
2014-10-02  3:32                                   ` Sasha Levin
2014-10-02  8:03                                     ` Chuck Ebbert
2014-10-02 14:49                                       ` Sasha Levin
2014-10-01 22:42                                 ` Linus Torvalds
2014-10-02 14:25                                   ` Kirill A. Shutemov
2014-10-02 16:01                                     ` Linus Torvalds
2014-10-02 16:35                                       ` Kirill A. Shutemov
2014-10-02 15:04                                   ` Sasha Levin
2014-10-02 16:10                                     ` Linus Torvalds
2014-10-03  5:00                                       ` Sasha Levin
2014-10-03 15:43                                         ` Linus Torvalds
2014-10-03 15:58                                           ` Dave Jones
2014-10-03 16:02                                             ` Sasha Levin
2014-10-02 12:45                             ` Mel Gorman
2014-10-06 19:18                               ` Aneesh Kumar K.V
2014-10-07 12:45                                 ` Linus Torvalds
2014-10-08 10:37                                   ` Aneesh Kumar K.V
2014-10-02  8:47                           ` Hugh Dickins
2014-10-02 15:57                             ` Linus Torvalds
2014-09-30  4:35   ` Al Viro

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.