All of lore.kernel.org
 help / color / mirror / Atom feed
* page_waitqueue() considered harmful
@ 2016-09-26 20:58 Linus Torvalds
  2016-09-26 21:23 ` Rik van Riel
                   ` (3 more replies)
  0 siblings, 4 replies; 42+ messages in thread
From: Linus Torvalds @ 2016-09-26 20:58 UTC (permalink / raw)
  To: Andrew Morton, Kirill A. Shutemov, Johannes Weiner, Jan Kara,
	Mel Gorman, Peter Zijlstra, Rik van Riel
  Cc: linux-mm

So I've been doing some profiling of the git "make test" load, which
is interesting because it's just a lot of small scripts, and it shows
our fork/execve/exit costs.

Some of the top offenders are pretty understandable: we have long had
unmap_page_range() show up at the top on these loads, because the
zap_pte_range() function ends up touching each page as it unmaps
things (it does it to check whether it's an anonymous page, but then
also for the page map count update etc), and that's a disaster from a
cache standpoint. That single function is something between 3-4% of
CPU time, and the one instruction that accesses "struct page" the
first time is a large portion of that. Yes, a single instruction is
blamed for about 1% of all CPU time on a fork/exec/exit workload.

Anyway, there really isn't a ton to be done about it. Same goes for
the reverse side of the coin: filemap_map_pages() (to map in the new
executable pages) is #2, and copy_page() (COW after fork()) is #3.
Those are all kind of inevitable for the load.

#4 on my list is "native_irq_return_iret", which is just a sign of
serializing instructions being really expensive, and this being a
workload with a lot of exceptions (most of which are the page faults).
So I guess there are *two* instructions in the kernel that are really
really hot. Maybe Intel will fix the cost of "iret" some day, at least
partially, but that day has not yet come to pass.

Anyway, things get kind of interesting once you get past the very top
offenders, and the profile starts to be less about "yeah, tough, can't
fix that" and instead hit things that make you go "ehh, really?"

#5 and #6 on my profile are user space (_int_malloc in glibc, and
do_lookup_x in the loader - I think user space should probably start
thinking more about doing static libraries for the really core basic
things, but whatever. Not a kernel issue.

#7 is in the kernel again. And that one struck me as really odd. It's
"unlock_page()", while #9 is __wake_up_bit(). WTF? There's no IO in
this load, it's all cached, why do we use 3% of the time (1.7% and
1.4% respectively) on unlocking a page. And why can't I see the
locking part?

It turns out that I *can* see the locking part, but it's pretty cheap.
It's inside of filemap_map_pages(), which does a trylock, and it shows
up as about 1/6th of the cost of that function. Still, it's much less
than the unlocking side. Why is unlocking so expensive?

Yeah, unlocking is expensive because of the nasty __wake_up_bit()
code. In fact, even inside "unlock_page()" itself, most of the costs
aren't even the atomic bit clearing (like you'd expect), it's the
inlined part of wake_up_bit(). Which does some really nasty crud.

Why is the page_waitqueue() handling so expensive? Let me count the ways:

 (a) stupid code generation interacting really badly with microarchitecture

     We clear the bit in page->flags with a byte-sized "lock andb",
and then page_waitqueue() looks up the zone by reading the full value
of "page->flags" immediately afterwards, which causes bad bad behavior
with the whole read-after-partial-write thing. Nasty.

 (b) It's cache miss heaven. It takes a cache miss on three different
things:looking up the zone 'wait_table', then looking up the hash
queue there, and finally (inside __wake_up_bit) looking up the wait
queue itself (which will effectively always be NULL).

Now, (a) could be fairly easy to fix several ways (the byte-size
operation on an "unsigned long" field have caused problems before, and
may just be a mistake), but (a) wouldn't even be a problem if we
didn't have the complexity of (b) and having to look up the zone and
everything. So realistically, it's actually (b) that is the primary
problem, and indirectly causes (a) to happen too.

Is there really any reason for that incredible indirection? Do we
really want to make the page_waitqueue() be a per-zone thing at all?
Especially since all those wait-queues won't even be *used* unless
there is actual IO going on and people are really getting into
contention on the page lock.. Why isn't the page_waitqueue() just one
statically sized array?

Also, if those bitlock ops had a different bit that showed contention,
we could actually skip *all* of this, and just see that "oh, nobody is
waiting on this page anyway, so there's no point in looking up those
wait queues". We don't have that many "__wait_on_bit()" users, maybe
we could say that the bitlocks do have to haev *two* bits: one for the
lock bit itself, and one for "there is contention".

Looking at the history of this code, most of it actually predates the
git history, it goes back to ~2004. Time to revisit that thing?

                  Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: page_waitqueue() considered harmful
  2016-09-26 20:58 page_waitqueue() considered harmful Linus Torvalds
@ 2016-09-26 21:23 ` Rik van Riel
  2016-09-26 21:30   ` Linus Torvalds
  2016-09-26 23:11   ` Kirill A. Shutemov
  2016-09-27  7:30 ` Peter Zijlstra
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 42+ messages in thread
From: Rik van Riel @ 2016-09-26 21:23 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, Kirill A. Shutemov,
	Johannes Weiner, Jan Kara, Mel Gorman, Peter Zijlstra
  Cc: linux-mm

[-- Attachment #1: Type: text/plain, Size: 1248 bytes --]

On Mon, 2016-09-26 at 13:58 -0700, Linus Torvalds wrote:

> Is there really any reason for that incredible indirection? Do we
> really want to make the page_waitqueue() be a per-zone thing at all?
> Especially since all those wait-queues won't even be *used* unless
> there is actual IO going on and people are really getting into
> contention on the page lock.. Why isn't the page_waitqueue() just one
> statically sized array?

Why are we touching file pages at all during fork()?

Could we get away with skipping copy_page_range on VMAs
that do not have any anonymous pages?

Could we teach copy_page_range to skip file PTEs during
fork?

The child process can just fault the needed file pages in
after it has been forked off.

Given how common fork + exec are, there is a real chance that:
- the child process did not need all of the parent's pages, and
- the parent process does not have some of the bits needed by
  the child (for exec) mapped into its page tables, anyway
  (as suggested by the child processes page faulting on file
   pages)

Having fewer pages mapped might also make zap_page_range
a little cheaper, both at exec and at exit time.

Am I overlooking something?

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: page_waitqueue() considered harmful
  2016-09-26 21:23 ` Rik van Riel
@ 2016-09-26 21:30   ` Linus Torvalds
  2016-09-26 23:11   ` Kirill A. Shutemov
  1 sibling, 0 replies; 42+ messages in thread
From: Linus Torvalds @ 2016-09-26 21:30 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrew Morton, Kirill A. Shutemov, Johannes Weiner, Jan Kara,
	Mel Gorman, Peter Zijlstra, linux-mm

On Mon, Sep 26, 2016 at 2:23 PM, Rik van Riel <riel@redhat.com> wrote:
>
> Why are we touching file pages at all during fork()?

This is *not* fork. This is fork/exec/exit. It's a real load, it's
just fairly concentrated by the scripts being many and small.

It's easy enough to try yourself:

   perf record -e cycles:pp make -j16 test

in the git tree.

> Could we get away with skipping copy_page_range on VMAs
> that do not have any anonymous pages?

You still have to duplicate those ranges, but that's fairly cheap.
>From what I can tell, the expensive part really is the "page in new
executable/library pages" and then tearing them down again (because it
was just a quick small git process or a very small shell script).

So forget about fork(). I'm not even sure how much of that there is,
it's possible that you end up having vfork() instead. It's exec/exit
that matters most in this load.

(You can see that in the profile with things like strnlen_user()
actually being fairly high on the profile too - mostly the environment
variables during exec).

            Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: page_waitqueue() considered harmful
  2016-09-26 21:23 ` Rik van Riel
  2016-09-26 21:30   ` Linus Torvalds
@ 2016-09-26 23:11   ` Kirill A. Shutemov
  2016-09-27  1:01     ` Rik van Riel
  1 sibling, 1 reply; 42+ messages in thread
From: Kirill A. Shutemov @ 2016-09-26 23:11 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Linus Torvalds, Andrew Morton, Kirill A. Shutemov,
	Johannes Weiner, Jan Kara, Mel Gorman, Peter Zijlstra, linux-mm

On Mon, Sep 26, 2016 at 05:23:29PM -0400, Rik van Riel wrote:
> On Mon, 2016-09-26 at 13:58 -0700, Linus Torvalds wrote:
> 
> > Is there really any reason for that incredible indirection? Do we
> > really want to make the page_waitqueue() be a per-zone thing at all?
> > Especially since all those wait-queues won't even be *used* unless
> > there is actual IO going on and people are really getting into
> > contention on the page lock.. Why isn't the page_waitqueue() just one
> > statically sized array?
> 
> Why are we touching file pages at all during fork()?

We are not.
Unless the vma has private pages (vma->anon_vma is not NULL).

See first lines for copy_page_range().

We probably can go futher and skip non-private pages within file VMA.
But we would need to touch struct page in this case, so it doesn't make
sense.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: page_waitqueue() considered harmful
  2016-09-26 23:11   ` Kirill A. Shutemov
@ 2016-09-27  1:01     ` Rik van Riel
  0 siblings, 0 replies; 42+ messages in thread
From: Rik van Riel @ 2016-09-27  1:01 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Linus Torvalds, Andrew Morton, Kirill A. Shutemov,
	Johannes Weiner, Jan Kara, Mel Gorman, Peter Zijlstra, linux-mm

[-- Attachment #1: Type: text/plain, Size: 920 bytes --]

On Tue, 2016-09-27 at 02:11 +0300, Kirill A. Shutemov wrote:
> On Mon, Sep 26, 2016 at 05:23:29PM -0400, Rik van Riel wrote:
> > 
> > On Mon, 2016-09-26 at 13:58 -0700, Linus Torvalds wrote:
> > 
> > > 
> > > Is there really any reason for that incredible indirection? Do we
> > > really want to make the page_waitqueue() be a per-zone thing at
> > > all?
> > > Especially since all those wait-queues won't even be *used*
> > > unless
> > > there is actual IO going on and people are really getting into
> > > contention on the page lock.. Why isn't the page_waitqueue() just
> > > one
> > > statically sized array?
> > 
> > Why are we touching file pages at all during fork()?
> 
> We are not.
> Unless the vma has private pages (vma->anon_vma is not NULL).
> 
> See first lines for copy_page_range().

Ahhh, indeed. I thought I remembered an optimization like
that.

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: page_waitqueue() considered harmful
  2016-09-26 20:58 page_waitqueue() considered harmful Linus Torvalds
  2016-09-26 21:23 ` Rik van Riel
@ 2016-09-27  7:30 ` Peter Zijlstra
  2016-09-27  8:54   ` Mel Gorman
  2016-09-27  8:03 ` Jan Kara
  2016-09-27  8:31 ` Mel Gorman
  3 siblings, 1 reply; 42+ messages in thread
From: Peter Zijlstra @ 2016-09-27  7:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Kirill A. Shutemov, Johannes Weiner, Jan Kara,
	Mel Gorman, Rik van Riel, linux-mm

On Mon, Sep 26, 2016 at 01:58:00PM -0700, Linus Torvalds wrote:

> Why is the page_waitqueue() handling so expensive? Let me count the ways:

>  (b) It's cache miss heaven. It takes a cache miss on three different
> things:looking up the zone 'wait_table', then looking up the hash
> queue there, and finally (inside __wake_up_bit) looking up the wait
> queue itself (which will effectively always be NULL).

> Is there really any reason for that incredible indirection? Do we
> really want to make the page_waitqueue() be a per-zone thing at all?
> Especially since all those wait-queues won't even be *used* unless
> there is actual IO going on and people are really getting into
> contention on the page lock.. Why isn't the page_waitqueue() just one
> statically sized array?

I suspect the reason is to have per node hash tables, just like we get
per node page-frame arrays with sparsemem.

> Also, if those bitlock ops had a different bit that showed contention,
> we could actually skip *all* of this, and just see that "oh, nobody is
> waiting on this page anyway, so there's no point in looking up those
> wait queues". We don't have that many "__wait_on_bit()" users, maybe
> we could say that the bitlocks do have to haev *two* bits: one for the
> lock bit itself, and one for "there is contention".

That would be fairly simple to implement, the difficulty would be
actually getting a page-flag to use for this. We're running pretty low
in available bits :/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: page_waitqueue() considered harmful
  2016-09-26 20:58 page_waitqueue() considered harmful Linus Torvalds
  2016-09-26 21:23 ` Rik van Riel
  2016-09-27  7:30 ` Peter Zijlstra
@ 2016-09-27  8:03 ` Jan Kara
  2016-09-27  8:31 ` Mel Gorman
  3 siblings, 0 replies; 42+ messages in thread
From: Jan Kara @ 2016-09-27  8:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Kirill A. Shutemov, Johannes Weiner, Jan Kara,
	Mel Gorman, Peter Zijlstra, Rik van Riel, linux-mm

Hello,

On Mon 26-09-16 13:58:00, Linus Torvalds wrote:
> #7 is in the kernel again. And that one struck me as really odd. It's
> "unlock_page()", while #9 is __wake_up_bit(). WTF? There's no IO in
> this load, it's all cached, why do we use 3% of the time (1.7% and
> 1.4% respectively) on unlocking a page. And why can't I see the
> locking part?
> 
> It turns out that I *can* see the locking part, but it's pretty cheap.
> It's inside of filemap_map_pages(), which does a trylock, and it shows
> up as about 1/6th of the cost of that function. Still, it's much less
> than the unlocking side. Why is unlocking so expensive?
> 
> Yeah, unlocking is expensive because of the nasty __wake_up_bit()
> code. In fact, even inside "unlock_page()" itself, most of the costs
> aren't even the atomic bit clearing (like you'd expect), it's the
> inlined part of wake_up_bit(). Which does some really nasty crud.
> 
> Why is the page_waitqueue() handling so expensive? Let me count the ways:
> 
>  (a) stupid code generation interacting really badly with microarchitecture
> 
>      We clear the bit in page->flags with a byte-sized "lock andb",
> and then page_waitqueue() looks up the zone by reading the full value
> of "page->flags" immediately afterwards, which causes bad bad behavior
> with the whole read-after-partial-write thing. Nasty.
> 
>  (b) It's cache miss heaven. It takes a cache miss on three different
> things:looking up the zone 'wait_table', then looking up the hash
> queue there, and finally (inside __wake_up_bit) looking up the wait
> queue itself (which will effectively always be NULL).
> 
> Now, (a) could be fairly easy to fix several ways (the byte-size
> operation on an "unsigned long" field have caused problems before, and
> may just be a mistake), but (a) wouldn't even be a problem if we
> didn't have the complexity of (b) and having to look up the zone and
> everything. So realistically, it's actually (b) that is the primary
> problem, and indirectly causes (a) to happen too.
> 
> Is there really any reason for that incredible indirection? Do we
> really want to make the page_waitqueue() be a per-zone thing at all?
> Especially since all those wait-queues won't even be *used* unless
> there is actual IO going on and people are really getting into
> contention on the page lock.. Why isn't the page_waitqueue() just one
> statically sized array?
> 
> Also, if those bitlock ops had a different bit that showed contention,
> we could actually skip *all* of this, and just see that "oh, nobody is
> waiting on this page anyway, so there's no point in looking up those
> wait queues". We don't have that many "__wait_on_bit()" users, maybe
> we could say that the bitlocks do have to haev *two* bits: one for the
> lock bit itself, and one for "there is contention".

So we were carrying patches in SUSE kernels for a long time that did
something like this (since 2011 it seems) since the unlock_page() was
pretty visible on some crazy SGI workload on their supercomputer. But they
were never accepted upstream and eventually we mostly dropped them (some
less intrusive optimizations got merged) when rebasing on 3.12 - Mel did
the evaluation at that time so he should be able to fill in details but at
one place he writes:

"These patches lack a compelling use case for pushing to mainline.  They
show a measurable gain in profiles but the gain is in the noise for the
whole workload. It is known to improve boot times on very large machines
and help an artifical test case but that is not a compelling reason to
consume a page flag and push it to mainline. The patches are held in
reserve until a compelling case for them is found."

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: page_waitqueue() considered harmful
  2016-09-26 20:58 page_waitqueue() considered harmful Linus Torvalds
                   ` (2 preceding siblings ...)
  2016-09-27  8:03 ` Jan Kara
@ 2016-09-27  8:31 ` Mel Gorman
  2016-09-27 14:34   ` Peter Zijlstra
  2016-09-27 14:53   ` Nicholas Piggin
  3 siblings, 2 replies; 42+ messages in thread
From: Mel Gorman @ 2016-09-27  8:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Kirill A. Shutemov, Johannes Weiner, Jan Kara,
	Peter Zijlstra, Rik van Riel, linux-mm


Hi Linus,

On Mon, Sep 26, 2016 at 01:58:00PM -0700, Linus Torvalds wrote:
> So I've been doing some profiling of the git "make test" load, which
> is interesting because it's just a lot of small scripts, and it shows
> our fork/execve/exit costs.
> 
> Some of the top offenders are pretty understandable: we have long had
> unmap_page_range() show up at the top on these loads, because the
> zap_pte_range() function ends up touching each page as it unmaps
> things (it does it to check whether it's an anonymous page, but then
> also for the page map count update etc), and that's a disaster from a
> cache standpoint. That single function is something between 3-4% of
> CPU time, and the one instruction that accesses "struct page" the
> first time is a large portion of that. Yes, a single instruction is
> blamed for about 1% of all CPU time on a fork/exec/exit workload.
> 

It was found at one point that the fault-around made these costs worse as
there were simply more pages to tear down. However, this only applied to
fork/exit microbenchmarks.  Matt Fleming prototyped an unreleased patch
that tried to be clever about this but the cost was never worthwhile. A
plain revert helped a microbenchmark but hurt workloads like the git
testsuite which was shell intensive.

It got filed under "we're not fixing a fork/exit microbenchmark at the
expense of "real" workloads like git checkout and git testsuite".

> <SNIP>
>
> #5 and #6 on my profile are user space (_int_malloc in glibc, and
> do_lookup_x in the loader - I think user space should probably start
> thinking more about doing static libraries for the really core basic
> things, but whatever. Not a kernel issue.
> 

Recent problems have been fixed with _int_malloc in glibc, particularly as it
applies to threads but no fix springs to mind that might impact "make test".

> #7 is in the kernel again. And that one struck me as really odd. It's
> "unlock_page()", while #9 is __wake_up_bit(). WTF? There's no IO in
> this load, it's all cached, why do we use 3% of the time (1.7% and
> 1.4% respectively) on unlocking a page. And why can't I see the
> locking part?
> 
> It turns out that I *can* see the locking part, but it's pretty cheap.
> It's inside of filemap_map_pages(), which does a trylock, and it shows
> up as about 1/6th of the cost of that function. Still, it's much less
> than the unlocking side. Why is unlocking so expensive?
> 
> Yeah, unlocking is expensive because of the nasty __wake_up_bit()
> code. In fact, even inside "unlock_page()" itself, most of the costs
> aren't even the atomic bit clearing (like you'd expect), it's the
> inlined part of wake_up_bit(). Which does some really nasty crud.
> 
> Why is the page_waitqueue() handling so expensive? Let me count the ways:
> 

page_waitqueue() has been a hazard for years. I think the last attempt to
fix it was back in 2014 http://www.spinics.net/lists/linux-mm/msg73207.html

The patch is heavily derived from work by Nick Piggin who noticed the years
before that. I think that was the last version I posted and the changelog
includes profile data. I don't have an exact reference why it was rejected
but a consistent piece of feedback was that it was very complex for the
level of impact it had.

>  (a) stupid code generation interacting really badly with microarchitecture
> 
>      We clear the bit in page->flags with a byte-sized "lock andb",
> and then page_waitqueue() looks up the zone by reading the full value
> of "page->flags" immediately afterwards, which causes bad bad behavior
> with the whole read-after-partial-write thing. Nasty.
> 

Other than some code churn, it should be possible to lookup the waitqueue
before clearing the bit if the patch that side-steps the lookup entirely
is still distasteful.

>  (b) It's cache miss heaven. It takes a cache miss on three different
> things:looking up the zone 'wait_table', then looking up the hash
> queue there, and finally (inside __wake_up_bit) looking up the wait
> queue itself (which will effectively always be NULL).
> 
> Now, (a) could be fairly easy to fix several ways (the byte-size
> operation on an "unsigned long" field have caused problems before, and
> may just be a mistake), but (a) wouldn't even be a problem if we
> didn't have the complexity of (b) and having to look up the zone and
> everything. So realistically, it's actually (b) that is the primary
> problem, and indirectly causes (a) to happen too.
> 

The patch in question side-steps the (b) part.

> Is there really any reason for that incredible indirection? Do we
> really want to make the page_waitqueue() be a per-zone thing at all?
> Especially since all those wait-queues won't even be *used* unless
> there is actual IO going on and people are really getting into
> contention on the page lock.. Why isn't the page_waitqueue() just one
> statically sized array?
> 

About all I can think of is NUMA locality. Doing it per-zone at the time
would be convenient. It could be per-node but that's no better from a
complexity point of view. If it was a static array, the key could be related
to the node the page is on but that's still looking into the page flags.

> Also, if those bitlock ops had a different bit that showed contention,
> we could actually skip *all* of this, and just see that "oh, nobody is
> waiting on this page anyway, so there's no point in looking up those
> wait queues".

That's very close to what the 2014 patch does. However, it's 64-bit only
due to the use of page flags. Even though it's out of date, can you take
a look? If it doesn't trigger hate, I can forward port it unless you do
that yourself for the purposes of testing.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: page_waitqueue() considered harmful
  2016-09-27  7:30 ` Peter Zijlstra
@ 2016-09-27  8:54   ` Mel Gorman
  2016-09-27  9:11     ` Kirill A. Shutemov
  2016-09-29  8:01     ` Peter Zijlstra
  0 siblings, 2 replies; 42+ messages in thread
From: Mel Gorman @ 2016-09-27  8:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Andrew Morton, Kirill A. Shutemov,
	Johannes Weiner, Jan Kara, Rik van Riel, linux-mm

On Tue, Sep 27, 2016 at 09:30:55AM +0200, Peter Zijlstra wrote:
> > Also, if those bitlock ops had a different bit that showed contention,
> > we could actually skip *all* of this, and just see that "oh, nobody is
> > waiting on this page anyway, so there's no point in looking up those
> > wait queues". We don't have that many "__wait_on_bit()" users, maybe
> > we could say that the bitlocks do have to haev *two* bits: one for the
> > lock bit itself, and one for "there is contention".
> 
> That would be fairly simple to implement, the difficulty would be
> actually getting a page-flag to use for this. We're running pretty low
> in available bits :/

Simple is relative unless I drastically overcomplicated things and it
wouldn't be the first time. 64-bit only side-steps the page flag issue
as long as we can live with that.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: page_waitqueue() considered harmful
  2016-09-27  8:54   ` Mel Gorman
@ 2016-09-27  9:11     ` Kirill A. Shutemov
  2016-09-27  9:42       ` Mel Gorman
  2016-09-27  9:52       ` Minchan Kim
  2016-09-29  8:01     ` Peter Zijlstra
  1 sibling, 2 replies; 42+ messages in thread
From: Kirill A. Shutemov @ 2016-09-27  9:11 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Linus Torvalds, Andrew Morton,
	Kirill A. Shutemov, Johannes Weiner, Jan Kara, Rik van Riel,
	linux-mm

On Tue, Sep 27, 2016 at 09:54:12AM +0100, Mel Gorman wrote:
> On Tue, Sep 27, 2016 at 09:30:55AM +0200, Peter Zijlstra wrote:
> > > Also, if those bitlock ops had a different bit that showed contention,
> > > we could actually skip *all* of this, and just see that "oh, nobody is
> > > waiting on this page anyway, so there's no point in looking up those
> > > wait queues". We don't have that many "__wait_on_bit()" users, maybe
> > > we could say that the bitlocks do have to haev *two* bits: one for the
> > > lock bit itself, and one for "there is contention".
> > 
> > That would be fairly simple to implement, the difficulty would be
> > actually getting a page-flag to use for this. We're running pretty low
> > in available bits :/
> 
> Simple is relative unless I drastically overcomplicated things and it
> wouldn't be the first time. 64-bit only side-steps the page flag issue
> as long as we can live with that.

Looks like we don't ever lock slab pages. Unless I miss something.

We can try to use PG_locked + PG_slab to indicate contation.

I tried to boot kernel with CONFIG_SLUB + BUG_ON(PageSlab()) in
trylock/unlock_page() codepath. Works fine, but more inspection is
required.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: page_waitqueue() considered harmful
  2016-09-27  9:11     ` Kirill A. Shutemov
@ 2016-09-27  9:42       ` Mel Gorman
  2016-09-27  9:52       ` Minchan Kim
  1 sibling, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2016-09-27  9:42 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Peter Zijlstra, Linus Torvalds, Andrew Morton,
	Kirill A. Shutemov, Johannes Weiner, Jan Kara, Rik van Riel,
	linux-mm

On Tue, Sep 27, 2016 at 12:11:17PM +0300, Kirill A. Shutemov wrote:
> On Tue, Sep 27, 2016 at 09:54:12AM +0100, Mel Gorman wrote:
> > On Tue, Sep 27, 2016 at 09:30:55AM +0200, Peter Zijlstra wrote:
> > > > Also, if those bitlock ops had a different bit that showed contention,
> > > > we could actually skip *all* of this, and just see that "oh, nobody is
> > > > waiting on this page anyway, so there's no point in looking up those
> > > > wait queues". We don't have that many "__wait_on_bit()" users, maybe
> > > > we could say that the bitlocks do have to haev *two* bits: one for the
> > > > lock bit itself, and one for "there is contention".
> > > 
> > > That would be fairly simple to implement, the difficulty would be
> > > actually getting a page-flag to use for this. We're running pretty low
> > > in available bits :/
> > 
> > Simple is relative unless I drastically overcomplicated things and it
> > wouldn't be the first time. 64-bit only side-steps the page flag issue
> > as long as we can live with that.
> 
> Looks like we don't ever lock slab pages. Unless I miss something.
> 

I don't think we do but direct PageSlab checks might be problematic if
it was a false-positive due to a locked page and we'd have to be very
careful about any races due to two bits being used.

While we shouldn't rule it out, I think it's important to first look at
that original patch and see if it's remotely acceptable and makes enough
difference to a real workload to matter. If so, then we could consider
additional complexity on top to make it work on 32-bit -- maybe separated
by one release as it took a long time to flush out subtle bugs with the
PG_waiters approach.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: page_waitqueue() considered harmful
  2016-09-27  9:11     ` Kirill A. Shutemov
  2016-09-27  9:42       ` Mel Gorman
@ 2016-09-27  9:52       ` Minchan Kim
  2016-09-27 12:11         ` Kirill A. Shutemov
  1 sibling, 1 reply; 42+ messages in thread
From: Minchan Kim @ 2016-09-27  9:52 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Mel Gorman, Peter Zijlstra, Linus Torvalds, Andrew Morton,
	Kirill A. Shutemov, Johannes Weiner, Jan Kara, Rik van Riel,
	linux-mm

On Tue, Sep 27, 2016 at 12:11:17PM +0300, Kirill A. Shutemov wrote:
> On Tue, Sep 27, 2016 at 09:54:12AM +0100, Mel Gorman wrote:
> > On Tue, Sep 27, 2016 at 09:30:55AM +0200, Peter Zijlstra wrote:
> > > > Also, if those bitlock ops had a different bit that showed contention,
> > > > we could actually skip *all* of this, and just see that "oh, nobody is
> > > > waiting on this page anyway, so there's no point in looking up those
> > > > wait queues". We don't have that many "__wait_on_bit()" users, maybe
> > > > we could say that the bitlocks do have to haev *two* bits: one for the
> > > > lock bit itself, and one for "there is contention".
> > > 
> > > That would be fairly simple to implement, the difficulty would be
> > > actually getting a page-flag to use for this. We're running pretty low
> > > in available bits :/
> > 
> > Simple is relative unless I drastically overcomplicated things and it
> > wouldn't be the first time. 64-bit only side-steps the page flag issue
> > as long as we can live with that.
> 
> Looks like we don't ever lock slab pages. Unless I miss something.
> 
> We can try to use PG_locked + PG_slab to indicate contation.
> 
> I tried to boot kernel with CONFIG_SLUB + BUG_ON(PageSlab()) in
> trylock/unlock_page() codepath. Works fine, but more inspection is
> required.

SLUB used bit_spin_lock via slab_lock instead of trylock/unlock.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: page_waitqueue() considered harmful
  2016-09-27  9:52       ` Minchan Kim
@ 2016-09-27 12:11         ` Kirill A. Shutemov
  0 siblings, 0 replies; 42+ messages in thread
From: Kirill A. Shutemov @ 2016-09-27 12:11 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Mel Gorman, Peter Zijlstra, Linus Torvalds, Andrew Morton,
	Kirill A. Shutemov, Johannes Weiner, Jan Kara, Rik van Riel,
	linux-mm

On Tue, Sep 27, 2016 at 06:52:06PM +0900, Minchan Kim wrote:
> On Tue, Sep 27, 2016 at 12:11:17PM +0300, Kirill A. Shutemov wrote:
> > On Tue, Sep 27, 2016 at 09:54:12AM +0100, Mel Gorman wrote:
> > > On Tue, Sep 27, 2016 at 09:30:55AM +0200, Peter Zijlstra wrote:
> > > > > Also, if those bitlock ops had a different bit that showed contention,
> > > > > we could actually skip *all* of this, and just see that "oh, nobody is
> > > > > waiting on this page anyway, so there's no point in looking up those
> > > > > wait queues". We don't have that many "__wait_on_bit()" users, maybe
> > > > > we could say that the bitlocks do have to haev *two* bits: one for the
> > > > > lock bit itself, and one for "there is contention".
> > > > 
> > > > That would be fairly simple to implement, the difficulty would be
> > > > actually getting a page-flag to use for this. We're running pretty low
> > > > in available bits :/
> > > 
> > > Simple is relative unless I drastically overcomplicated things and it
> > > wouldn't be the first time. 64-bit only side-steps the page flag issue
> > > as long as we can live with that.
> > 
> > Looks like we don't ever lock slab pages. Unless I miss something.
> > 
> > We can try to use PG_locked + PG_slab to indicate contation.
> > 
> > I tried to boot kernel with CONFIG_SLUB + BUG_ON(PageSlab()) in
> > trylock/unlock_page() codepath. Works fine, but more inspection is
> > required.
> 
> SLUB used bit_spin_lock via slab_lock instead of trylock/unlock.

Ahh. Missed that.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: page_waitqueue() considered harmful
  2016-09-27  8:31 ` Mel Gorman
@ 2016-09-27 14:34   ` Peter Zijlstra
  2016-09-27 15:08     ` Nicholas Piggin
                       ` (2 more replies)
  2016-09-27 14:53   ` Nicholas Piggin
  1 sibling, 3 replies; 42+ messages in thread
From: Peter Zijlstra @ 2016-09-27 14:34 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linus Torvalds, Andrew Morton, Kirill A. Shutemov,
	Johannes Weiner, Jan Kara, Rik van Riel, linux-mm

On Tue, Sep 27, 2016 at 09:31:04AM +0100, Mel Gorman wrote:
> page_waitqueue() has been a hazard for years. I think the last attempt to
> fix it was back in 2014 http://www.spinics.net/lists/linux-mm/msg73207.html
> 
> The patch is heavily derived from work by Nick Piggin who noticed the years
> before that. I think that was the last version I posted and the changelog
> includes profile data. I don't have an exact reference why it was rejected
> but a consistent piece of feedback was that it was very complex for the
> level of impact it had.

Right, I never really liked that patch. In any case, the below seems to
boot, although the lock_page_wait() thing did get my brain in a bit of a
twist. Doing explicit loops with PG_contended inside wq->lock would be
more obvious but results in much more code.

We could muck about with PG_contended naming/placement if any of this
shows benefit.

It does boot on my x86_64 and builds a kernel, so it must be perfect ;-)

---
 include/linux/page-flags.h     |  2 ++
 include/linux/pagemap.h        | 21 +++++++++----
 include/linux/wait.h           |  2 +-
 include/trace/events/mmflags.h |  1 +
 kernel/sched/wait.c            | 17 ++++++----
 mm/filemap.c                   | 71 ++++++++++++++++++++++++++++++++++++------
 6 files changed, 92 insertions(+), 22 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 74e4dda..0ed3900 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -73,6 +73,7 @@
  */
 enum pageflags {
 	PG_locked,		/* Page is locked. Don't touch. */
+	PG_contended,		/* Page lock is contended. */
 	PG_error,
 	PG_referenced,
 	PG_uptodate,
@@ -253,6 +254,7 @@ static inline int TestClearPage##uname(struct page *page) { return 0; }
 	TESTSETFLAG_FALSE(uname) TESTCLEARFLAG_FALSE(uname)
 
 __PAGEFLAG(Locked, locked, PF_NO_TAIL)
+PAGEFLAG(Contended, contended, PF_NO_TAIL)
 PAGEFLAG(Error, error, PF_NO_COMPOUND) TESTCLEARFLAG(Error, error, PF_NO_COMPOUND)
 PAGEFLAG(Referenced, referenced, PF_HEAD)
 	TESTCLEARFLAG(Referenced, referenced, PF_HEAD)
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 66a1260..3b38a96 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -417,7 +417,7 @@ extern void __lock_page(struct page *page);
 extern int __lock_page_killable(struct page *page);
 extern int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
 				unsigned int flags);
-extern void unlock_page(struct page *page);
+extern void __unlock_page(struct page *page);
 
 static inline int trylock_page(struct page *page)
 {
@@ -448,6 +448,16 @@ static inline int lock_page_killable(struct page *page)
 	return 0;
 }
 
+static inline void unlock_page(struct page *page)
+{
+	page = compound_head(page);
+	VM_BUG_ON_PAGE(!PageLocked(page), page);
+	clear_bit_unlock(PG_locked, &page->flags);
+	smp_mb__after_atomic();
+	if (PageContended(page))
+		__unlock_page(page);
+}
+
 /*
  * lock_page_or_retry - Lock the page, unless this would block and the
  * caller indicated that it can handle a retry.
@@ -472,11 +482,11 @@ extern int wait_on_page_bit_killable(struct page *page, int bit_nr);
 extern int wait_on_page_bit_killable_timeout(struct page *page,
 					     int bit_nr, unsigned long timeout);
 
+extern int wait_on_page_lock(struct page *page, int mode);
+
 static inline int wait_on_page_locked_killable(struct page *page)
 {
-	if (!PageLocked(page))
-		return 0;
-	return wait_on_page_bit_killable(compound_head(page), PG_locked);
+	return wait_on_page_lock(page, TASK_KILLABLE);
 }
 
 extern wait_queue_head_t *page_waitqueue(struct page *page);
@@ -494,8 +504,7 @@ static inline void wake_up_page(struct page *page, int bit)
  */
 static inline void wait_on_page_locked(struct page *page)
 {
-	if (PageLocked(page))
-		wait_on_page_bit(compound_head(page), PG_locked);
+	wait_on_page_lock(page, TASK_UNINTERRUPTIBLE);
 }
 
 /* 
diff --git a/include/linux/wait.h b/include/linux/wait.h
index c3ff74d..524cd54 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -198,7 +198,7 @@ __remove_wait_queue(wait_queue_head_t *head, wait_queue_t *old)
 
 typedef int wait_bit_action_f(struct wait_bit_key *, int mode);
 void __wake_up(wait_queue_head_t *q, unsigned int mode, int nr, void *key);
-void __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, void *key);
+int __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, void *key);
 void __wake_up_sync_key(wait_queue_head_t *q, unsigned int mode, int nr, void *key);
 void __wake_up_locked(wait_queue_head_t *q, unsigned int mode, int nr);
 void __wake_up_sync(wait_queue_head_t *q, unsigned int mode, int nr);
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index 5a81ab4..18b8398 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -81,6 +81,7 @@
 
 #define __def_pageflag_names						\
 	{1UL << PG_locked,		"locked"	},		\
+	{1UL << PG_contended,		"contended"	},		\
 	{1UL << PG_error,		"error"		},		\
 	{1UL << PG_referenced,		"referenced"	},		\
 	{1UL << PG_uptodate,		"uptodate"	},		\
diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index f15d6b6..46dcc42 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -62,18 +62,23 @@ EXPORT_SYMBOL(remove_wait_queue);
  * started to run but is not in state TASK_RUNNING. try_to_wake_up() returns
  * zero in this (rare) case, and we handle it by continuing to scan the queue.
  */
-static void __wake_up_common(wait_queue_head_t *q, unsigned int mode,
+static int __wake_up_common(wait_queue_head_t *q, unsigned int mode,
 			int nr_exclusive, int wake_flags, void *key)
 {
 	wait_queue_t *curr, *next;
+	int woken = 0;
 
 	list_for_each_entry_safe(curr, next, &q->task_list, task_list) {
 		unsigned flags = curr->flags;
 
-		if (curr->func(curr, mode, wake_flags, key) &&
-				(flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive)
-			break;
+		if (curr->func(curr, mode, wake_flags, key)) {
+			woken++;
+			if ((flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive)
+				break;
+		}
 	}
+
+	return woken;
 }
 
 /**
@@ -106,9 +111,9 @@ void __wake_up_locked(wait_queue_head_t *q, unsigned int mode, int nr)
 }
 EXPORT_SYMBOL_GPL(__wake_up_locked);
 
-void __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, void *key)
+int __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, void *key)
 {
-	__wake_up_common(q, mode, 1, 0, key);
+	return __wake_up_common(q, mode, 1, 0, key);
 }
 EXPORT_SYMBOL_GPL(__wake_up_locked_key);
 
diff --git a/mm/filemap.c b/mm/filemap.c
index 8a287df..d3e3203 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -847,15 +847,18 @@ EXPORT_SYMBOL_GPL(add_page_wait_queue);
  * The mb is necessary to enforce ordering between the clear_bit and the read
  * of the waitqueue (to avoid SMP races with a parallel wait_on_page_locked()).
  */
-void unlock_page(struct page *page)
+void __unlock_page(struct page *page)
 {
-	page = compound_head(page);
-	VM_BUG_ON_PAGE(!PageLocked(page), page);
-	clear_bit_unlock(PG_locked, &page->flags);
-	smp_mb__after_atomic();
-	wake_up_page(page, PG_locked);
+	struct wait_bit_key key = __WAIT_BIT_KEY_INITIALIZER(&page->flags, PG_locked);
+	wait_queue_head_t *wq = page_waitqueue(page);
+	unsigned long flags;
+
+	spin_lock_irqsave(&wq->lock, flags);
+	if (!__wake_up_locked_key(wq, TASK_NORMAL, &key))
+		ClearPageContended(page);
+	spin_unlock_irqrestore(&wq->lock, flags);
 }
-EXPORT_SYMBOL(unlock_page);
+EXPORT_SYMBOL(__unlock_page);
 
 /**
  * end_page_writeback - end writeback against a page
@@ -908,6 +911,44 @@ void page_endio(struct page *page, bool is_write, int err)
 }
 EXPORT_SYMBOL_GPL(page_endio);
 
+static int lock_page_wait(struct wait_bit_key *word, int mode)
+{
+	struct page *page = container_of(word->flags, struct page, flags);
+
+	/*
+	 * Set PG_contended now that we're enqueued on the waitqueue, this
+	 * way we cannot race against ClearPageContended in wakeup.
+	 */
+	if (!PageContended(page)) {
+		SetPageContended(page);
+
+		/*
+		 * If we set PG_contended, we must order against any later list
+		 * addition and/or a later PG_lock load.
+		 *
+		 *  [unlock]			[wait]
+		 *
+		 *  clear PG_locked		set PG_contended
+		 *  test  PG_contended		test-and-set PG_locked
+		 *
+		 * and
+		 *
+		 *  [unlock]			[wait]
+		 *
+		 *  test PG_contended		set PG_contended
+		 *  wakeup			add
+		 *  clear PG_contended		sleep
+		 *
+		 * In either case we must not reorder nor sleep before
+		 * PG_contended is visible.
+		 */
+		smp_mb__after_atomic();
+		return 0;
+	}
+
+	return bit_wait_io(word, mode);
+}
+
 /**
  * __lock_page - get a lock on the page, assuming we need to sleep to get it
  * @page: the page to lock
@@ -917,7 +958,7 @@ void __lock_page(struct page *page)
 	struct page *page_head = compound_head(page);
 	DEFINE_WAIT_BIT(wait, &page_head->flags, PG_locked);
 
-	__wait_on_bit_lock(page_waitqueue(page_head), &wait, bit_wait_io,
+	__wait_on_bit_lock(page_waitqueue(page_head), &wait, lock_page_wait,
 							TASK_UNINTERRUPTIBLE);
 }
 EXPORT_SYMBOL(__lock_page);
@@ -928,10 +969,22 @@ int __lock_page_killable(struct page *page)
 	DEFINE_WAIT_BIT(wait, &page_head->flags, PG_locked);
 
 	return __wait_on_bit_lock(page_waitqueue(page_head), &wait,
-					bit_wait_io, TASK_KILLABLE);
+					lock_page_wait, TASK_KILLABLE);
 }
 EXPORT_SYMBOL_GPL(__lock_page_killable);
 
+int wait_on_page_lock(struct page *page, int mode)
+{
+	struct page __always_unused *__page = (page = compound_head(page));
+	DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
+
+	if (!PageLocked(page))
+		return 0;
+
+	return __wait_on_bit(page_waitqueue(page), &wait, lock_page_wait, mode);
+}
+EXPORT_SYMBOL(wait_on_page_lock);
+
 /*
  * Return values:
  * 1 - page is locked; mmap_sem is still held.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: page_waitqueue() considered harmful
  2016-09-27  8:31 ` Mel Gorman
  2016-09-27 14:34   ` Peter Zijlstra
@ 2016-09-27 14:53   ` Nicholas Piggin
  2016-09-27 15:17     ` Nicholas Piggin
                       ` (2 more replies)
  1 sibling, 3 replies; 42+ messages in thread
From: Nicholas Piggin @ 2016-09-27 14:53 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linus Torvalds, Andrew Morton, Kirill A. Shutemov,
	Johannes Weiner, Jan Kara, Peter Zijlstra, Rik van Riel,
	linux-mm

On Tue, 27 Sep 2016 09:31:04 +0100
Mel Gorman <mgorman@techsingularity.net> wrote:

> Hi Linus,
> 
> On Mon, Sep 26, 2016 at 01:58:00PM -0700, Linus Torvalds wrote:
> > So I've been doing some profiling of the git "make test" load, which
> > is interesting because it's just a lot of small scripts, and it shows
> > our fork/execve/exit costs.
> > 
> > Some of the top offenders are pretty understandable: we have long had
> > unmap_page_range() show up at the top on these loads, because the
> > zap_pte_range() function ends up touching each page as it unmaps
> > things (it does it to check whether it's an anonymous page, but then
> > also for the page map count update etc), and that's a disaster from a
> > cache standpoint. That single function is something between 3-4% of
> > CPU time, and the one instruction that accesses "struct page" the
> > first time is a large portion of that. Yes, a single instruction is
> > blamed for about 1% of all CPU time on a fork/exec/exit workload.
> >   
> 
> It was found at one point that the fault-around made these costs worse as
> there were simply more pages to tear down. However, this only applied to
> fork/exit microbenchmarks.  Matt Fleming prototyped an unreleased patch
> that tried to be clever about this but the cost was never worthwhile. A
> plain revert helped a microbenchmark but hurt workloads like the git
> testsuite which was shell intensive.
> 
> It got filed under "we're not fixing a fork/exit microbenchmark at the
> expense of "real" workloads like git checkout and git testsuite".
> 
> > <SNIP>
> >
> > #5 and #6 on my profile are user space (_int_malloc in glibc, and
> > do_lookup_x in the loader - I think user space should probably start
> > thinking more about doing static libraries for the really core basic
> > things, but whatever. Not a kernel issue.
> >   
> 
> Recent problems have been fixed with _int_malloc in glibc, particularly as it
> applies to threads but no fix springs to mind that might impact "make test".
> 
> > #7 is in the kernel again. And that one struck me as really odd. It's
> > "unlock_page()", while #9 is __wake_up_bit(). WTF? There's no IO in
> > this load, it's all cached, why do we use 3% of the time (1.7% and
> > 1.4% respectively) on unlocking a page. And why can't I see the
> > locking part?
> > 
> > It turns out that I *can* see the locking part, but it's pretty cheap.
> > It's inside of filemap_map_pages(), which does a trylock, and it shows
> > up as about 1/6th of the cost of that function. Still, it's much less
> > than the unlocking side. Why is unlocking so expensive?
> > 
> > Yeah, unlocking is expensive because of the nasty __wake_up_bit()
> > code. In fact, even inside "unlock_page()" itself, most of the costs
> > aren't even the atomic bit clearing (like you'd expect), it's the
> > inlined part of wake_up_bit(). Which does some really nasty crud.
> > 
> > Why is the page_waitqueue() handling so expensive? Let me count the ways:
> >   
> 
> page_waitqueue() has been a hazard for years. I think the last attempt to
> fix it was back in 2014 http://www.spinics.net/lists/linux-mm/msg73207.html
> 
> The patch is heavily derived from work by Nick Piggin who noticed the years
> before that. I think that was the last version I posted and the changelog
> includes profile data. I don't have an exact reference why it was rejected
> but a consistent piece of feedback was that it was very complex for the
> level of impact it had.

Huh, I was just wondering about this again the other day. Powerpc has
some interesting issues with atomic ops and barriers (not to mention
random cache misses that hurt everybody).

It actually wasn't for big Altix machines (at least not when I wrote it),
but it came from some effort to optimize page reclaim performance on an
opteron with a lot (back then) of cores per node.

And it's not only for scalability, it's a single threaded performance
optimisation as much as anything.

By the way I think that patch linked is taking the wrong approach. Better
to put all the complexity of maintaining the waiters bit into the sleep/wake
functions. The fastpath simply tests the bit in no less racy a manner than
the unlocked waitqueue_active() test. Really incomplete patch attached for
reference.

The part where we hack the wait code into maintaining the extra bit for us
is pretty mechanical and boring so long as it's under the waitqueue lock.

The more interesting is the ability to avoid the barrier between fastpath
clearing a bit and testing for waiters.

unlock():                        lock() (slowpath):
clear_bit(PG_locked)             set_bit(PG_waiter)
test_bit(PG_waiter)              test_bit(PG_locked)

If this was memory ops to different words, it would require smp_mb each
side.. Being the same word, can we avoid them? ISTR Linus you were worried
about stores being forwarded to loads before it is visible to the other CPU.
I think that should be okay because the stores will be ordered, and the load
can't move earlier than the store on the same CPU. Maybe I completely
misremember it.


Subject: [PATCH] blah

---
 include/linux/pagemap.h | 10 +++---
 mm/filemap.c            | 92 +++++++++++++++++++++++++++++++++++++++++++------
 2 files changed, 87 insertions(+), 15 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 66a1260..a536df9 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -479,10 +479,11 @@ static inline int wait_on_page_locked_killable(struct page *page)
 	return wait_on_page_bit_killable(compound_head(page), PG_locked);
 }
 
-extern wait_queue_head_t *page_waitqueue(struct page *page);
 static inline void wake_up_page(struct page *page, int bit)
 {
-	__wake_up_bit(page_waitqueue(page), &page->flags, bit);
+	if (!PageWaiters(page))
+		return;
+	wake_up_page_bit(page, bit);
 }
 
 /* 
@@ -494,8 +495,9 @@ static inline void wake_up_page(struct page *page, int bit)
  */
 static inline void wait_on_page_locked(struct page *page)
 {
-	if (PageLocked(page))
-		wait_on_page_bit(compound_head(page), PG_locked);
+	if (!PageLocked(page))
+		return 0;
+	wait_on_page_bit(compound_head(page), PG_locked);
 }
 
 /* 
diff --git a/mm/filemap.c b/mm/filemap.c
index 8a287df..09bca8a 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -775,24 +775,98 @@ EXPORT_SYMBOL(__page_cache_alloc);
  * at a cost of "thundering herd" phenomena during rare hash
  * collisions.
  */
-wait_queue_head_t *page_waitqueue(struct page *page)
+static wait_queue_head_t *page_waitqueue(struct page *page)
 {
 	const struct zone *zone = page_zone(page);
 
 	return &zone->wait_table[hash_ptr(page, zone->wait_table_bits)];
 }
-EXPORT_SYMBOL(page_waitqueue);
+
+struct wait_page_key {
+	struct page *page;
+	int bit_nr;
+	int page_match;
+};
+
+struct wait_page_queue {
+	struct wait_page_key key;
+	wait_queue_t wait;
+};
+
+static int wake_page_function(wait_queue_t *wait, unsigned mode, int sync, void *arg)
+{
+	struct wait_page_queue *wait = container_of(wait, struct wait_page_queue, wait);
+	struct wait_page_key *key = arg;
+	int ret;
+
+	if (wait->key.page != key->page)
+	       return 0;
+	key->page_match = 1;
+	if (wait->key.bit_nr != key->bit_nr ||
+			test_bit(key->bit_nr, &key->page->flags))
+		return 0;
+
+	ret = try_to_wake_up(wait->wait.private, mode, sync);
+	if (ret)
+		list_del_init(&wait->task_list);
+	return ret;
+}
 
 void wait_on_page_bit(struct page *page, int bit_nr)
 {
-	DEFINE_WAIT_BIT(wait, &page->flags, bit_nr);
+	wait_queue_head_t *wq = page_waitqueue(page);
+	struct wait_page_queue wait;
 
-	if (test_bit(bit_nr, &page->flags))
-		__wait_on_bit(page_waitqueue(page), &wait, bit_wait_io,
-							TASK_UNINTERRUPTIBLE);
+	init_wait(&wait.wait);
+	wait.wait.func = wake_page_function;
+	wait.key.page = page;
+	wait.key.bit_nr = bit_nr;
+	/* wait.key.page_match unused */
+
+	wait.flags &= ~WQ_FLAG_EXCLUSIVE;
+
+again:
+	spin_lock_irq(&wq->lock);
+	if (unlikely(!test_bit(bit_nr, &page->flags))) {
+		spin_unlock_irq(&wq->lock);
+		return;
+	}
+
+	if (list_empty(&wait->task_list)) {
+		__add_wait_queue(wq, &wait);
+		if (!PageWaiters(page))
+			SetPageWaiters(page);
+	}
+	set_current_state(TASK_UNINTERRUPTIBLE);
+	spin_unlock_irq(&wq->lock);
+
+	io_schedule();
+
+	if (!test_bit(bit_nr, &page->flags))
+		goto again;
+
+	finish_wait(wq, &wait);
 }
 EXPORT_SYMBOL(wait_on_page_bit);
 
+void wake_up_page_bit(struct page *page, int bit_nr)
+{
+	wait_queue_head_t *wq;
+	struct wait_page_key key;
+	unsigned long flags;
+
+	key.page = page;
+	key.bit_nr = bit_nr;
+	key.page_match = 0;
+
+	wq = page_waitqueue(page);
+	spin_lock_irqsave(&wq->lock, flags);
+	__wake_up_locked_key(wq, TASK_NORMAL, &key);
+	if (!waitqueue_active(wq) || !key.page_match)
+		ClearPageWaiters(page);
+	spin_unlock_irqrestore(&wq->lock, flags);
+}
+
 int wait_on_page_bit_killable(struct page *page, int bit_nr)
 {
 	DEFINE_WAIT_BIT(wait, &page->flags, bit_nr);
@@ -852,7 +926,6 @@ void unlock_page(struct page *page)
 	page = compound_head(page);
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	clear_bit_unlock(PG_locked, &page->flags);
-	smp_mb__after_atomic();
 	wake_up_page(page, PG_locked);
 }
 EXPORT_SYMBOL(unlock_page);
@@ -875,10 +948,7 @@ void end_page_writeback(struct page *page)
 		rotate_reclaimable_page(page);
 	}
 
-	if (!test_clear_page_writeback(page))
-		BUG();
-
-	smp_mb__after_atomic();
+	clear_bit_unlock(PG_writeback, &page->flags);
 	wake_up_page(page, PG_writeback);
 }
 EXPORT_SYMBOL(end_page_writeback);
-- 
2.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: page_waitqueue() considered harmful
  2016-09-27 14:34   ` Peter Zijlstra
@ 2016-09-27 15:08     ` Nicholas Piggin
  2016-09-27 16:31     ` Linus Torvalds
  2016-09-28 10:45     ` Mel Gorman
  2 siblings, 0 replies; 42+ messages in thread
From: Nicholas Piggin @ 2016-09-27 15:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Linus Torvalds, Andrew Morton, Kirill A. Shutemov,
	Johannes Weiner, Jan Kara, Rik van Riel, linux-mm

On Tue, 27 Sep 2016 16:34:26 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Tue, Sep 27, 2016 at 09:31:04AM +0100, Mel Gorman wrote:
> > page_waitqueue() has been a hazard for years. I think the last attempt to
> > fix it was back in 2014 http://www.spinics.net/lists/linux-mm/msg73207.html
> > 
> > The patch is heavily derived from work by Nick Piggin who noticed the years
> > before that. I think that was the last version I posted and the changelog
> > includes profile data. I don't have an exact reference why it was rejected
> > but a consistent piece of feedback was that it was very complex for the
> > level of impact it had.  
> 
> Right, I never really liked that patch. In any case, the below seems to
> boot, although the lock_page_wait() thing did get my brain in a bit of a
> twist. Doing explicit loops with PG_contended inside wq->lock would be
> more obvious but results in much more code.
> 
> We could muck about with PG_contended naming/placement if any of this
> shows benefit.
> 
> It does boot on my x86_64 and builds a kernel, so it must be perfect ;-)
> 
> ---
>  include/linux/page-flags.h     |  2 ++
>  include/linux/pagemap.h        | 21 +++++++++----
>  include/linux/wait.h           |  2 +-
>  include/trace/events/mmflags.h |  1 +
>  kernel/sched/wait.c            | 17 ++++++----
>  mm/filemap.c                   | 71 ++++++++++++++++++++++++++++++++++++------
>  6 files changed, 92 insertions(+), 22 deletions(-)
> 
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 74e4dda..0ed3900 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -73,6 +73,7 @@
>   */
>  enum pageflags {
>  	PG_locked,		/* Page is locked. Don't touch. */
> +	PG_contended,		/* Page lock is contended. */
>  	PG_error,
>  	PG_referenced,
>  	PG_uptodate,
> @@ -253,6 +254,7 @@ static inline int TestClearPage##uname(struct page *page) { return 0; }
>  	TESTSETFLAG_FALSE(uname) TESTCLEARFLAG_FALSE(uname)
>  
>  __PAGEFLAG(Locked, locked, PF_NO_TAIL)
> +PAGEFLAG(Contended, contended, PF_NO_TAIL)


>  PAGEFLAG(Error, error, PF_NO_COMPOUND) TESTCLEARFLAG(Error, error, PF_NO_COMPOUND)
>  PAGEFLAG(Referenced, referenced, PF_HEAD)
>  	TESTCLEARFLAG(Referenced, referenced, PF_HEAD)
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index 66a1260..3b38a96 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -417,7 +417,7 @@ extern void __lock_page(struct page *page);
>  extern int __lock_page_killable(struct page *page);
>  extern int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
>  				unsigned int flags);
> -extern void unlock_page(struct page *page);
> +extern void __unlock_page(struct page *page);
>  
>  static inline int trylock_page(struct page *page)
>  {
> @@ -448,6 +448,16 @@ static inline int lock_page_killable(struct page *page)
>  	return 0;
>  }
>  
> +static inline void unlock_page(struct page *page)
> +{
> +	page = compound_head(page);
> +	VM_BUG_ON_PAGE(!PageLocked(page), page);
> +	clear_bit_unlock(PG_locked, &page->flags);
> +	smp_mb__after_atomic();

If the wakeup does not do a reorderable waitqueue_active check outside
the lock, why is this barrier needed? 

> +	if (PageContended(page))
> +		__unlock_page(page);
> +}
> +


> @@ -62,18 +62,23 @@ EXPORT_SYMBOL(remove_wait_queue);
>   * started to run but is not in state TASK_RUNNING. try_to_wake_up() returns
>   * zero in this (rare) case, and we handle it by continuing to scan the queue.
>   */
> -static void __wake_up_common(wait_queue_head_t *q, unsigned int mode,
> +static int __wake_up_common(wait_queue_head_t *q, unsigned int mode,
>  			int nr_exclusive, int wake_flags, void *key)
>  {
>  	wait_queue_t *curr, *next;
> +	int woken = 0;
>  
>  	list_for_each_entry_safe(curr, next, &q->task_list, task_list) {
>  		unsigned flags = curr->flags;
>  
> -		if (curr->func(curr, mode, wake_flags, key) &&
> -				(flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive)
> -			break;
> +		if (curr->func(curr, mode, wake_flags, key)) {
> +			woken++;
> +			if ((flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive)
> +				break;
> +		}
>  	}
> +
> +	return woken;
>  }
>  
>  /**
> @@ -106,9 +111,9 @@ void __wake_up_locked(wait_queue_head_t *q, unsigned int mode, int nr)
>  }
>  EXPORT_SYMBOL_GPL(__wake_up_locked);
>  
> -void __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, void *key)
> +int __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, void *key)
>  {
> -	__wake_up_common(q, mode, 1, 0, key);
> +	return __wake_up_common(q, mode, 1, 0, key);
>  }
>  EXPORT_SYMBOL_GPL(__wake_up_locked_key);
>  
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 8a287df..d3e3203 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -847,15 +847,18 @@ EXPORT_SYMBOL_GPL(add_page_wait_queue);
>   * The mb is necessary to enforce ordering between the clear_bit and the read
>   * of the waitqueue (to avoid SMP races with a parallel wait_on_page_locked()).
>   */
> -void unlock_page(struct page *page)
> +void __unlock_page(struct page *page)
>  {
> -	page = compound_head(page);
> -	VM_BUG_ON_PAGE(!PageLocked(page), page);
> -	clear_bit_unlock(PG_locked, &page->flags);
> -	smp_mb__after_atomic();
> -	wake_up_page(page, PG_locked);
> +	struct wait_bit_key key = __WAIT_BIT_KEY_INITIALIZER(&page->flags, PG_locked);
> +	wait_queue_head_t *wq = page_waitqueue(page);
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&wq->lock, flags);
> +	if (!__wake_up_locked_key(wq, TASK_NORMAL, &key))
> +		ClearPageContended(page);
> +	spin_unlock_irqrestore(&wq->lock, flags);

This approach of just counting wakeups doesn't work when you use the
waiter bit for other than just PG_locked (e.g., PG_writeback). I guess
that's why I didn't call it contented too: it really should be just
about whether it is on the waitqueue or not.

>  }
> -EXPORT_SYMBOL(unlock_page);
> +EXPORT_SYMBOL(__unlock_page);
>  
>  /**
>   * end_page_writeback - end writeback against a page
> @@ -908,6 +911,44 @@ void page_endio(struct page *page, bool is_write, int err)
>  }
>  EXPORT_SYMBOL_GPL(page_endio);
>  
> +static int lock_page_wait(struct wait_bit_key *word, int mode)
> +{
> +	struct page *page = container_of(word->flags, struct page, flags);
> +
> +	/*
> +	 * Set PG_contended now that we're enqueued on the waitqueue, this
> +	 * way we cannot race against ClearPageContended in wakeup.
> +	 */
> +	if (!PageContended(page)) {
> +		SetPageContended(page);
> +
> +		/*
> +		 * If we set PG_contended, we must order against any later list
> +		 * addition and/or a later PG_lock load.
> +		 *
> +		 *  [unlock]			[wait]
> +		 *
> +		 *  clear PG_locked		set PG_contended
> +		 *  test  PG_contended		test-and-set PG_locked
> +		 *
> +		 * and
> +		 *
> +		 *  [unlock]			[wait]
> +		 *
> +		 *  test PG_contended		set PG_contended
> +		 *  wakeup			add
> +		 *  clear PG_contended		sleep
> +		 *
> +		 * In either case we must not reorder nor sleep before
> +		 * PG_contended is visible.
> +		 */
> +		smp_mb__after_atomic();


> +		return 0;
> +	}
> +
> +	return bit_wait_io(word, mode);
> +}

I probably slightly prefer how my patch maintains the bit, but don't really
fussed how it works exactly as long as it can support waiting on other bits
as well.

The more interesting question is the memory barriers -- see my other post.

Thanks,
Nick

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: page_waitqueue() considered harmful
  2016-09-27 14:53   ` Nicholas Piggin
@ 2016-09-27 15:17     ` Nicholas Piggin
  2016-09-27 16:52     ` Peter Zijlstra
  2016-09-28  7:40     ` Mel Gorman
  2 siblings, 0 replies; 42+ messages in thread
From: Nicholas Piggin @ 2016-09-27 15:17 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linus Torvalds, Andrew Morton, Kirill A. Shutemov,
	Johannes Weiner, Jan Kara, Peter Zijlstra, Rik van Riel,
	linux-mm

On Wed, 28 Sep 2016 00:53:18 +1000
Nicholas Piggin <npiggin@gmail.com> wrote:

> On Tue, 27 Sep 2016 09:31:04 +0100
> Mel Gorman <mgorman@techsingularity.net> wrote:
> 
> > Hi Linus,
> > 
> > On Mon, Sep 26, 2016 at 01:58:00PM -0700, Linus Torvalds wrote:  
> > > So I've been doing some profiling of the git "make test" load, which
> > > is interesting because it's just a lot of small scripts, and it shows
> > > our fork/execve/exit costs.
> > > 
> > > Some of the top offenders are pretty understandable: we have long had
> > > unmap_page_range() show up at the top on these loads, because the
> > > zap_pte_range() function ends up touching each page as it unmaps
> > > things (it does it to check whether it's an anonymous page, but then
> > > also for the page map count update etc), and that's a disaster from a
> > > cache standpoint. That single function is something between 3-4% of
> > > CPU time, and the one instruction that accesses "struct page" the
> > > first time is a large portion of that. Yes, a single instruction is
> > > blamed for about 1% of all CPU time on a fork/exec/exit workload.
> > >     
> > 
> > It was found at one point that the fault-around made these costs worse as
> > there were simply more pages to tear down. However, this only applied to
> > fork/exit microbenchmarks.  Matt Fleming prototyped an unreleased patch
> > that tried to be clever about this but the cost was never worthwhile. A
> > plain revert helped a microbenchmark but hurt workloads like the git
> > testsuite which was shell intensive.
> > 
> > It got filed under "we're not fixing a fork/exit microbenchmark at the
> > expense of "real" workloads like git checkout and git testsuite".
> >   
> > > <SNIP>
> > >
> > > #5 and #6 on my profile are user space (_int_malloc in glibc, and
> > > do_lookup_x in the loader - I think user space should probably start
> > > thinking more about doing static libraries for the really core basic
> > > things, but whatever. Not a kernel issue.
> > >     
> > 
> > Recent problems have been fixed with _int_malloc in glibc, particularly as it
> > applies to threads but no fix springs to mind that might impact "make test".
> >   
> > > #7 is in the kernel again. And that one struck me as really odd. It's
> > > "unlock_page()", while #9 is __wake_up_bit(). WTF? There's no IO in
> > > this load, it's all cached, why do we use 3% of the time (1.7% and
> > > 1.4% respectively) on unlocking a page. And why can't I see the
> > > locking part?
> > > 
> > > It turns out that I *can* see the locking part, but it's pretty cheap.
> > > It's inside of filemap_map_pages(), which does a trylock, and it shows
> > > up as about 1/6th of the cost of that function. Still, it's much less
> > > than the unlocking side. Why is unlocking so expensive?
> > > 
> > > Yeah, unlocking is expensive because of the nasty __wake_up_bit()
> > > code. In fact, even inside "unlock_page()" itself, most of the costs
> > > aren't even the atomic bit clearing (like you'd expect), it's the
> > > inlined part of wake_up_bit(). Which does some really nasty crud.
> > > 
> > > Why is the page_waitqueue() handling so expensive? Let me count the ways:
> > >     
> > 
> > page_waitqueue() has been a hazard for years. I think the last attempt to
> > fix it was back in 2014 http://www.spinics.net/lists/linux-mm/msg73207.html
> > 
> > The patch is heavily derived from work by Nick Piggin who noticed the years
> > before that. I think that was the last version I posted and the changelog
> > includes profile data. I don't have an exact reference why it was rejected
> > but a consistent piece of feedback was that it was very complex for the
> > level of impact it had.  
> 
> Huh, I was just wondering about this again the other day. Powerpc has
> some interesting issues with atomic ops and barriers (not to mention
> random cache misses that hurt everybody).
> 
> It actually wasn't for big Altix machines (at least not when I wrote it),
> but it came from some effort to optimize page reclaim performance on an
> opteron with a lot (back then) of cores per node.
> 
> And it's not only for scalability, it's a single threaded performance
> optimisation as much as anything.
> 
> By the way I think that patch linked is taking the wrong approach. Better
> to put all the complexity of maintaining the waiters bit into the sleep/wake
> functions. The fastpath simply tests the bit in no less racy a manner than
> the unlocked waitqueue_active() test. Really incomplete patch attached for
> reference.
> 
> The part where we hack the wait code into maintaining the extra bit for us
> is pretty mechanical and boring so long as it's under the waitqueue lock.
> 
> The more interesting is the ability to avoid the barrier between fastpath
> clearing a bit and testing for waiters.
> 
> unlock():                        lock() (slowpath):
> clear_bit(PG_locked)             set_bit(PG_waiter)
> test_bit(PG_waiter)              test_bit(PG_locked)
> 
> If this was memory ops to different words, it would require smp_mb each
> side.. Being the same word, can we avoid them? ISTR Linus you were worried
> about stores being forwarded to loads before it is visible to the other CPU.
> I think that should be okay because the stores will be ordered, and the load
> can't move earlier than the store on the same CPU. Maybe I completely
> misremember it.
> 
> 
> Subject: [PATCH] blah
> 
> ---
>  include/linux/pagemap.h | 10 +++---
>  mm/filemap.c            | 92 +++++++++++++++++++++++++++++++++++++++++++------
>  2 files changed, 87 insertions(+), 15 deletions(-)
> 
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index 66a1260..a536df9 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -479,10 +479,11 @@ static inline int wait_on_page_locked_killable(struct page *page)
>  	return wait_on_page_bit_killable(compound_head(page), PG_locked);
>  }
>  
> -extern wait_queue_head_t *page_waitqueue(struct page *page);
>  static inline void wake_up_page(struct page *page, int bit)
>  {
> -	__wake_up_bit(page_waitqueue(page), &page->flags, bit);
> +	if (!PageWaiters(page))
> +		return;
> +	wake_up_page_bit(page, bit);
>  }
>  
>  /* 
> @@ -494,8 +495,9 @@ static inline void wake_up_page(struct page *page, int bit)
>   */
>  static inline void wait_on_page_locked(struct page *page)
>  {
> -	if (PageLocked(page))
> -		wait_on_page_bit(compound_head(page), PG_locked);
> +	if (!PageLocked(page))
> +		return 0;
> +	wait_on_page_bit(compound_head(page), PG_locked);
>  }
>  
>  /* 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 8a287df..09bca8a 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -775,24 +775,98 @@ EXPORT_SYMBOL(__page_cache_alloc);
>   * at a cost of "thundering herd" phenomena during rare hash
>   * collisions.
>   */
> -wait_queue_head_t *page_waitqueue(struct page *page)
> +static wait_queue_head_t *page_waitqueue(struct page *page)
>  {
>  	const struct zone *zone = page_zone(page);
>  
>  	return &zone->wait_table[hash_ptr(page, zone->wait_table_bits)];
>  }
> -EXPORT_SYMBOL(page_waitqueue);
> +
> +struct wait_page_key {
> +	struct page *page;
> +	int bit_nr;
> +	int page_match;
> +};
> +
> +struct wait_page_queue {
> +	struct wait_page_key key;
> +	wait_queue_t wait;
> +};
> +
> +static int wake_page_function(wait_queue_t *wait, unsigned mode, int sync, void *arg)
> +{
> +	struct wait_page_queue *wait = container_of(wait, struct wait_page_queue, wait);
> +	struct wait_page_key *key = arg;
> +	int ret;
> +
> +	if (wait->key.page != key->page)
> +	       return 0;
> +	key->page_match = 1;
> +	if (wait->key.bit_nr != key->bit_nr ||
> +			test_bit(key->bit_nr, &key->page->flags))
> +		return 0;
> +
> +	ret = try_to_wake_up(wait->wait.private, mode, sync);
> +	if (ret)
> +		list_del_init(&wait->task_list);
> +	return ret;
> +}
>  
>  void wait_on_page_bit(struct page *page, int bit_nr)
>  {
> -	DEFINE_WAIT_BIT(wait, &page->flags, bit_nr);
> +	wait_queue_head_t *wq = page_waitqueue(page);
> +	struct wait_page_queue wait;
>  
> -	if (test_bit(bit_nr, &page->flags))
> -		__wait_on_bit(page_waitqueue(page), &wait, bit_wait_io,
> -							TASK_UNINTERRUPTIBLE);
> +	init_wait(&wait.wait);
> +	wait.wait.func = wake_page_function;
> +	wait.key.page = page;
> +	wait.key.bit_nr = bit_nr;
> +	/* wait.key.page_match unused */
> +
> +	wait.flags &= ~WQ_FLAG_EXCLUSIVE;
> +
> +again:
> +	spin_lock_irq(&wq->lock);
> +	if (unlikely(!test_bit(bit_nr, &page->flags))) {
> +		spin_unlock_irq(&wq->lock);
> +		return;
> +	}
> +
> +	if (list_empty(&wait->task_list)) {
> +		__add_wait_queue(wq, &wait);
> +		if (!PageWaiters(page))
> +			SetPageWaiters(page);
> +	}

Ugh, I even wrote the correct ordering in the email, but did it the wrong
way here. Anyway, you get the idea.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: page_waitqueue() considered harmful
  2016-09-27 14:34   ` Peter Zijlstra
  2016-09-27 15:08     ` Nicholas Piggin
@ 2016-09-27 16:31     ` Linus Torvalds
  2016-09-27 16:49       ` Peter Zijlstra
  2016-09-28 10:45     ` Mel Gorman
  2 siblings, 1 reply; 42+ messages in thread
From: Linus Torvalds @ 2016-09-27 16:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Andrew Morton, Kirill A. Shutemov, Johannes Weiner,
	Jan Kara, Rik van Riel, linux-mm

On Tue, Sep 27, 2016 at 7:34 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>
> Right, I never really liked that patch. In any case, the below seems to
> boot, although the lock_page_wait() thing did get my brain in a bit of a
> twist. Doing explicit loops with PG_contended inside wq->lock would be
> more obvious but results in much more code.
>
> We could muck about with PG_contended naming/placement if any of this
> shows benefit.
>
> It does boot on my x86_64 and builds a kernel, so it must be perfect ;-)

This patch looks very much like what I was thinking of. Except you
made that bit clearing more complicated than I would have done.

I see why you did it (it's hard to clear the bit when the wait-queue
that is associated with it can be associated with multiple pages), but
I think it would be perfectly fine to just not even try to make the
"contended" bit be an exact bit. You can literally leave it set
(giving us the existing behavior), but then when you hit the
__unlock_page() case, and you look up the page_waitqueue(), and find
that the waitqueue is empty, *then* you clear it.

So you'd end up going through the slow path one too many times per
page, but considering that right now we *always* go through that
slow-path, and the "one too many times" is "two times per IO rather
than just once", it really is not a performance issue. I'd rather go
for simple and robust.

I get a bit nervous when I see you being so careful in counting the
number of waiters that match the page key - if any of that code ever
gets it wrong (because two different pages that shared a waitqueue
happen to race at just the right time), and the bit gets cleared too
early, you will get some *very* hard-to-debug problems.

So I actually think your patch is a bit too clever.

But maybe there's a reason for that that I just don't see. My gut feel
is that your patch is good.

.. and hey, it booted and compiled the kernel, so as you say, it must
be perfect.

                   Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: page_waitqueue() considered harmful
  2016-09-27 16:31     ` Linus Torvalds
@ 2016-09-27 16:49       ` Peter Zijlstra
  0 siblings, 0 replies; 42+ messages in thread
From: Peter Zijlstra @ 2016-09-27 16:49 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mel Gorman, Andrew Morton, Kirill A. Shutemov, Johannes Weiner,
	Jan Kara, Rik van Riel, linux-mm

On Tue, Sep 27, 2016 at 09:31:54AM -0700, Linus Torvalds wrote:
> On Tue, Sep 27, 2016 at 7:34 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > Right, I never really liked that patch. In any case, the below seems to
> > boot, although the lock_page_wait() thing did get my brain in a bit of a
> > twist. Doing explicit loops with PG_contended inside wq->lock would be
> > more obvious but results in much more code.
> >
> > We could muck about with PG_contended naming/placement if any of this
> > shows benefit.
> >
> > It does boot on my x86_64 and builds a kernel, so it must be perfect ;-)
> 
> This patch looks very much like what I was thinking of. Except you
> made that bit clearing more complicated than I would have done.
> 
> I see why you did it (it's hard to clear the bit when the wait-queue
> that is associated with it can be associated with multiple pages), but
> I think it would be perfectly fine to just not even try to make the
> "contended" bit be an exact bit. You can literally leave it set
> (giving us the existing behavior), but then when you hit the
> __unlock_page() case, and you look up the page_waitqueue(), and find
> that the waitqueue is empty, *then* you clear it.
> 
> So you'd end up going through the slow path one too many times per
> page, but considering that right now we *always* go through that
> slow-path, and the "one too many times" is "two times per IO rather
> than just once", it really is not a performance issue. I'd rather go
> for simple and robust.

My clear already does that same thing, once we find nobody to wake up we
clear, this means we did the waitqueue lookup once in vain. But yes it
allows the bit to be cleared while there are still waiters (for other
bits) on the waitqueue.

The other benefit of doing what you suggest (and Nick does) is that you
can then indeed use the bit for other waitqueue users like PG_writeback.
I never really got that part of the patch as it wasn't spelled out, but
it does make sense now that I understand the intent.

And assuming collisions are rare, that works just fine.

> I get a bit nervous when I see you being so careful in counting the
> number of waiters that match the page key - if any of that code ever
> gets it wrong (because two different pages that shared a waitqueue
> happen to race at just the right time), and the bit gets cleared too
> early, you will get some *very* hard-to-debug problems.

Right, so I don't care about the actual number, just it being 0 or not.
Maybe I should've returned bool.

But yes, its a tad more tricky than I'd liked, mostly because I was
lazy. Doing the SetPageContended under the wq->lock would've made things
more obvious.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: page_waitqueue() considered harmful
  2016-09-27 14:53   ` Nicholas Piggin
  2016-09-27 15:17     ` Nicholas Piggin
@ 2016-09-27 16:52     ` Peter Zijlstra
  2016-09-27 17:06       ` Nicholas Piggin
  2016-09-28  7:40     ` Mel Gorman
  2 siblings, 1 reply; 42+ messages in thread
From: Peter Zijlstra @ 2016-09-27 16:52 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Mel Gorman, Linus Torvalds, Andrew Morton, Kirill A. Shutemov,
	Johannes Weiner, Jan Kara, Rik van Riel, linux-mm, Will Deacon

On Wed, Sep 28, 2016 at 12:53:18AM +1000, Nicholas Piggin wrote:
> The more interesting is the ability to avoid the barrier between fastpath
> clearing a bit and testing for waiters.
> 
> unlock():                        lock() (slowpath):
> clear_bit(PG_locked)             set_bit(PG_waiter)
> test_bit(PG_waiter)              test_bit(PG_locked)
> 
> If this was memory ops to different words, it would require smp_mb each
> side.. Being the same word, can we avoid them? 

Ah, that is the reason I put that smp_mb__after_atomic() there. You have
a cute point on them being to the same word though. Need to think about
that.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: page_waitqueue() considered harmful
  2016-09-27 16:52     ` Peter Zijlstra
@ 2016-09-27 17:06       ` Nicholas Piggin
  2016-09-28  7:05         ` Peter Zijlstra
  0 siblings, 1 reply; 42+ messages in thread
From: Nicholas Piggin @ 2016-09-27 17:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Linus Torvalds, Andrew Morton, Kirill A. Shutemov,
	Johannes Weiner, Jan Kara, Rik van Riel, linux-mm, Will Deacon

On Tue, 27 Sep 2016 18:52:21 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Wed, Sep 28, 2016 at 12:53:18AM +1000, Nicholas Piggin wrote:
> > The more interesting is the ability to avoid the barrier between fastpath
> > clearing a bit and testing for waiters.
> > 
> > unlock():                        lock() (slowpath):
> > clear_bit(PG_locked)             set_bit(PG_waiter)
> > test_bit(PG_waiter)              test_bit(PG_locked)
> > 
> > If this was memory ops to different words, it would require smp_mb each
> > side.. Being the same word, can we avoid them?   
> 
> Ah, that is the reason I put that smp_mb__after_atomic() there. You have
> a cute point on them being to the same word though. Need to think about
> that.

This is all assuming the store accesses are ordered, which you should get
if the stores to the different bits operate on the same address and size.
That might not be the case for some architectures, but they might not
require barriers for other reasons. That would call for an smp_mb variant
that is used for bitops on different bits but same aligned long. 

Thanks,
Nick

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: page_waitqueue() considered harmful
  2016-09-27 17:06       ` Nicholas Piggin
@ 2016-09-28  7:05         ` Peter Zijlstra
  2016-09-28 11:05           ` Paul E. McKenney
  2016-09-29  1:31           ` Nicholas Piggin
  0 siblings, 2 replies; 42+ messages in thread
From: Peter Zijlstra @ 2016-09-28  7:05 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Mel Gorman, Linus Torvalds, Andrew Morton, Kirill A. Shutemov,
	Johannes Weiner, Jan Kara, Rik van Riel, linux-mm, Will Deacon,
	Paul McKenney, Alan Stern

On Wed, Sep 28, 2016 at 03:06:21AM +1000, Nicholas Piggin wrote:
> On Tue, 27 Sep 2016 18:52:21 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Wed, Sep 28, 2016 at 12:53:18AM +1000, Nicholas Piggin wrote:
> > > The more interesting is the ability to avoid the barrier between fastpath
> > > clearing a bit and testing for waiters.
> > > 
> > > unlock():                        lock() (slowpath):
> > > clear_bit(PG_locked)             set_bit(PG_waiter)
> > > test_bit(PG_waiter)              test_bit(PG_locked)
> > > 
> > > If this was memory ops to different words, it would require smp_mb each
> > > side.. Being the same word, can we avoid them?   
> > 
> > Ah, that is the reason I put that smp_mb__after_atomic() there. You have
> > a cute point on them being to the same word though. Need to think about
> > that.
> 
> This is all assuming the store accesses are ordered, which you should get
> if the stores to the different bits operate on the same address and size.
> That might not be the case for some architectures, but they might not
> require barriers for other reasons. That would call for an smp_mb variant
> that is used for bitops on different bits but same aligned long. 

Since the {set,clear}_bit operations are atomic, they must be ordered
against one another. The subsequent test_bit is a load, which, since its
to the same variable, and a CPU must appear to preserve Program-Order,
must come after the RmW.

So I think you're right and that we can forgo the memory barriers here.
I even think this must be true on all architectures.

Paul and Alan have a validation tool someplace, put them on Cc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: page_waitqueue() considered harmful
  2016-09-27 14:53   ` Nicholas Piggin
  2016-09-27 15:17     ` Nicholas Piggin
  2016-09-27 16:52     ` Peter Zijlstra
@ 2016-09-28  7:40     ` Mel Gorman
  2 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2016-09-28  7:40 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Linus Torvalds, Andrew Morton, Kirill A. Shutemov,
	Johannes Weiner, Jan Kara, Peter Zijlstra, Rik van Riel,
	linux-mm

On Wed, Sep 28, 2016 at 12:53:18AM +1000, Nicholas Piggin wrote:
> > The patch is heavily derived from work by Nick Piggin who noticed the years
> > before that. I think that was the last version I posted and the changelog
> > includes profile data. I don't have an exact reference why it was rejected
> > but a consistent piece of feedback was that it was very complex for the
> > level of impact it had.
> 
> Huh, I was just wondering about this again the other day. Powerpc has
> some interesting issues with atomic ops and barriers (not to mention
> random cache misses that hurt everybody).
> 
> It actually wasn't for big Altix machines (at least not when I wrote it),
> but it came from some effort to optimize page reclaim performance on an
> opteron with a lot (back then) of cores per node.
> 

The reason SUSE carried the patch for a while, albeit not right now, was
because of the large machines and them being the only people that could
provide concrete support. Over time, the benefit dropped.

It was always the case the benefit of the patch could be measured from
profiles but rarely visible in the context of the overall workload or
buried deep within the noise.

The final structure of the patch was partially driven by fixes for subtle
bugs discovered in the patch over a long period of time. Maybe the new
approaches will both avoid those bugs and be visible on "real"
workloads.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: page_waitqueue() considered harmful
  2016-09-27 14:34   ` Peter Zijlstra
  2016-09-27 15:08     ` Nicholas Piggin
  2016-09-27 16:31     ` Linus Torvalds
@ 2016-09-28 10:45     ` Mel Gorman
  2016-09-28 11:11       ` Peter Zijlstra
  2 siblings, 1 reply; 42+ messages in thread
From: Mel Gorman @ 2016-09-28 10:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Andrew Morton, Kirill A. Shutemov,
	Johannes Weiner, Jan Kara, Rik van Riel, linux-mm

On Tue, Sep 27, 2016 at 04:34:26PM +0200, Peter Zijlstra wrote:
> On Tue, Sep 27, 2016 at 09:31:04AM +0100, Mel Gorman wrote:
> > page_waitqueue() has been a hazard for years. I think the last attempt to
> > fix it was back in 2014 http://www.spinics.net/lists/linux-mm/msg73207.html
> > 
> > The patch is heavily derived from work by Nick Piggin who noticed the years
> > before that. I think that was the last version I posted and the changelog
> > includes profile data. I don't have an exact reference why it was rejected
> > but a consistent piece of feedback was that it was very complex for the
> > level of impact it had.
> 
> Right, I never really liked that patch. In any case, the below seems to
> boot, although the lock_page_wait() thing did get my brain in a bit of a
> twist. Doing explicit loops with PG_contended inside wq->lock would be
> more obvious but results in much more code.
> 
> We could muck about with PG_contended naming/placement if any of this
> shows benefit.
> 
> It does boot on my x86_64 and builds a kernel, so it must be perfect ;-)
> 

Heh.

tldr: Other than 32-bit vs 64-bit, I could not find anything obviously wrong.

> ---
>  include/linux/page-flags.h     |  2 ++
>  include/linux/pagemap.h        | 21 +++++++++----
>  include/linux/wait.h           |  2 +-
>  include/trace/events/mmflags.h |  1 +
>  kernel/sched/wait.c            | 17 ++++++----
>  mm/filemap.c                   | 71 ++++++++++++++++++++++++++++++++++++------
>  6 files changed, 92 insertions(+), 22 deletions(-)
> 
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 74e4dda..0ed3900 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -73,6 +73,7 @@
>   */
>  enum pageflags {
>  	PG_locked,		/* Page is locked. Don't touch. */
> +	PG_contended,		/* Page lock is contended. */
>  	PG_error,
>  	PG_referenced,
>  	PG_uptodate,

Naming has been covered by Nick. You may run into the same problem with
32-bit and available page flags. I didn't work out the remaining number
of flags but did you check 32-bit is ok? If not, you may need to take a
similar approach that I did that says "there are always waiters and use
the slow path on 32-bit".

> @@ -253,6 +254,7 @@ static inline int TestClearPage##uname(struct page *page) { return 0; }
>  	TESTSETFLAG_FALSE(uname) TESTCLEARFLAG_FALSE(uname)
>  
>  __PAGEFLAG(Locked, locked, PF_NO_TAIL)
> +PAGEFLAG(Contended, contended, PF_NO_TAIL)
>  PAGEFLAG(Error, error, PF_NO_COMPOUND) TESTCLEARFLAG(Error, error, PF_NO_COMPOUND)
>  PAGEFLAG(Referenced, referenced, PF_HEAD)
>  	TESTCLEARFLAG(Referenced, referenced, PF_HEAD)
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index 66a1260..3b38a96 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -417,7 +417,7 @@ extern void __lock_page(struct page *page);
>  extern int __lock_page_killable(struct page *page);
>  extern int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
>  				unsigned int flags);
> -extern void unlock_page(struct page *page);
> +extern void __unlock_page(struct page *page);
>  
>  static inline int trylock_page(struct page *page)
>  {
> @@ -448,6 +448,16 @@ static inline int lock_page_killable(struct page *page)
>  	return 0;
>  }
>  
> +static inline void unlock_page(struct page *page)
> +{
> +	page = compound_head(page);
> +	VM_BUG_ON_PAGE(!PageLocked(page), page);
> +	clear_bit_unlock(PG_locked, &page->flags);
> +	smp_mb__after_atomic();
> +	if (PageContended(page))
> +		__unlock_page(page);
> +}
> +

The main race to be concerned with is PageContended being set after the
page is unlocked and missing a wakeup here. While you explain the protocol
later, it's worth referencing it here.

>  /*
>   * lock_page_or_retry - Lock the page, unless this would block and the
>   * caller indicated that it can handle a retry.
> @@ -472,11 +482,11 @@ extern int wait_on_page_bit_killable(struct page *page, int bit_nr);
>  extern int wait_on_page_bit_killable_timeout(struct page *page,
>  					     int bit_nr, unsigned long timeout);
>  
> +extern int wait_on_page_lock(struct page *page, int mode);
> +
>  static inline int wait_on_page_locked_killable(struct page *page)
>  {
> -	if (!PageLocked(page))
> -		return 0;
> -	return wait_on_page_bit_killable(compound_head(page), PG_locked);
> +	return wait_on_page_lock(page, TASK_KILLABLE);
>  }
>  

Ok, I raised an eyebrow at compound_head but it's covered in the helper.

>  extern wait_queue_head_t *page_waitqueue(struct page *page);
> @@ -494,8 +504,7 @@ static inline void wake_up_page(struct page *page, int bit)
>   */
>  static inline void wait_on_page_locked(struct page *page)
>  {
> -	if (PageLocked(page))
> -		wait_on_page_bit(compound_head(page), PG_locked);
> +	wait_on_page_lock(page, TASK_UNINTERRUPTIBLE);
>  }
>  
>  /* 
> diff --git a/include/linux/wait.h b/include/linux/wait.h
> index c3ff74d..524cd54 100644
> --- a/include/linux/wait.h
> +++ b/include/linux/wait.h
> @@ -198,7 +198,7 @@ __remove_wait_queue(wait_queue_head_t *head, wait_queue_t *old)
>  
>  typedef int wait_bit_action_f(struct wait_bit_key *, int mode);
>  void __wake_up(wait_queue_head_t *q, unsigned int mode, int nr, void *key);
> -void __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, void *key);
> +int __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, void *key);
>  void __wake_up_sync_key(wait_queue_head_t *q, unsigned int mode, int nr, void *key);
>  void __wake_up_locked(wait_queue_head_t *q, unsigned int mode, int nr);
>  void __wake_up_sync(wait_queue_head_t *q, unsigned int mode, int nr);
> diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
> index 5a81ab4..18b8398 100644
> --- a/include/trace/events/mmflags.h
> +++ b/include/trace/events/mmflags.h
> @@ -81,6 +81,7 @@
>  
>  #define __def_pageflag_names						\
>  	{1UL << PG_locked,		"locked"	},		\
> +	{1UL << PG_contended,		"contended"	},		\
>  	{1UL << PG_error,		"error"		},		\
>  	{1UL << PG_referenced,		"referenced"	},		\
>  	{1UL << PG_uptodate,		"uptodate"	},		\
> diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
> index f15d6b6..46dcc42 100644
> --- a/kernel/sched/wait.c
> +++ b/kernel/sched/wait.c
> @@ -62,18 +62,23 @@ EXPORT_SYMBOL(remove_wait_queue);
>   * started to run but is not in state TASK_RUNNING. try_to_wake_up() returns
>   * zero in this (rare) case, and we handle it by continuing to scan the queue.
>   */
> -static void __wake_up_common(wait_queue_head_t *q, unsigned int mode,
> +static int __wake_up_common(wait_queue_head_t *q, unsigned int mode,
>  			int nr_exclusive, int wake_flags, void *key)
>  {
>  	wait_queue_t *curr, *next;
> +	int woken = 0;
>  
>  	list_for_each_entry_safe(curr, next, &q->task_list, task_list) {
>  		unsigned flags = curr->flags;
>  
> -		if (curr->func(curr, mode, wake_flags, key) &&
> -				(flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive)
> -			break;
> +		if (curr->func(curr, mode, wake_flags, key)) {
> +			woken++;
> +			if ((flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive)
> +				break;
> +		}
>  	}
> +
> +	return woken;
>  }
>  

ok.

>  /**
> @@ -106,9 +111,9 @@ void __wake_up_locked(wait_queue_head_t *q, unsigned int mode, int nr)
>  }
>  EXPORT_SYMBOL_GPL(__wake_up_locked);
>  
> -void __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, void *key)
> +int __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, void *key)
>  {
> -	__wake_up_common(q, mode, 1, 0, key);
> +	return __wake_up_common(q, mode, 1, 0, key);
>  }
>  EXPORT_SYMBOL_GPL(__wake_up_locked_key);
>  
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 8a287df..d3e3203 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -847,15 +847,18 @@ EXPORT_SYMBOL_GPL(add_page_wait_queue);
>   * The mb is necessary to enforce ordering between the clear_bit and the read
>   * of the waitqueue (to avoid SMP races with a parallel wait_on_page_locked()).
>   */
> -void unlock_page(struct page *page)
> +void __unlock_page(struct page *page)
>  {
> -	page = compound_head(page);
> -	VM_BUG_ON_PAGE(!PageLocked(page), page);
> -	clear_bit_unlock(PG_locked, &page->flags);
> -	smp_mb__after_atomic();
> -	wake_up_page(page, PG_locked);
> +	struct wait_bit_key key = __WAIT_BIT_KEY_INITIALIZER(&page->flags, PG_locked);
> +	wait_queue_head_t *wq = page_waitqueue(page);
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&wq->lock, flags);
> +	if (!__wake_up_locked_key(wq, TASK_NORMAL, &key))
> +		ClearPageContended(page);
> +	spin_unlock_irqrestore(&wq->lock, flags);
>  }
> -EXPORT_SYMBOL(unlock_page);
> +EXPORT_SYMBOL(__unlock_page);
>  

The function name is questionable. It used to unlock_page but now it's
handling the wakeup of waiters.

That aside, the wq->lock in itself does not protect the PageContended
bit but I couldn't see a case where we lost a wakeup in the following
sequence either.

unlock_page
					lock_page
					prepare_wait
lock_wq
wake
ClearPageContended
unlock_wq
					SetPageContended

Otherwise the page_waitqueue deferrals to the slowpath look ok.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: page_waitqueue() considered harmful
  2016-09-28  7:05         ` Peter Zijlstra
@ 2016-09-28 11:05           ` Paul E. McKenney
  2016-09-28 11:16             ` Peter Zijlstra
  2016-09-29  1:31           ` Nicholas Piggin
  1 sibling, 1 reply; 42+ messages in thread
From: Paul E. McKenney @ 2016-09-28 11:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nicholas Piggin, Mel Gorman, Linus Torvalds, Andrew Morton,
	Kirill A. Shutemov, Johannes Weiner, Jan Kara, Rik van Riel,
	linux-mm, Will Deacon, Alan Stern

On Wed, Sep 28, 2016 at 09:05:46AM +0200, Peter Zijlstra wrote:
> On Wed, Sep 28, 2016 at 03:06:21AM +1000, Nicholas Piggin wrote:
> > On Tue, 27 Sep 2016 18:52:21 +0200
> > Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > > On Wed, Sep 28, 2016 at 12:53:18AM +1000, Nicholas Piggin wrote:
> > > > The more interesting is the ability to avoid the barrier between fastpath
> > > > clearing a bit and testing for waiters.
> > > > 
> > > > unlock():                        lock() (slowpath):
> > > > clear_bit(PG_locked)             set_bit(PG_waiter)
> > > > test_bit(PG_waiter)              test_bit(PG_locked)

The point being that at least one of the test_bit() calls must return
true?

> > > > If this was memory ops to different words, it would require smp_mb each
> > > > side.. Being the same word, can we avoid them?   
> > > 
> > > Ah, that is the reason I put that smp_mb__after_atomic() there. You have
> > > a cute point on them being to the same word though. Need to think about
> > > that.
> > 
> > This is all assuming the store accesses are ordered, which you should get
> > if the stores to the different bits operate on the same address and size.
> > That might not be the case for some architectures, but they might not
> > require barriers for other reasons. That would call for an smp_mb variant
> > that is used for bitops on different bits but same aligned long. 

As far as I know, all architectures fully order aligned same-size
machine-sized accesses to the same location even without barriers.
In the example above, the PG_locked and PG_waiter are different bits in
the same location, correct?  (Looks that way, but the above also looks
a bit abbreviated.)

Not sure that Docuemntation/memory-barriers.txt is as clear as it
should be on this one, though.

> Since the {set,clear}_bit operations are atomic, they must be ordered
> against one another. The subsequent test_bit is a load, which, since its
> to the same variable, and a CPU must appear to preserve Program-Order,
> must come after the RmW.

But from Documentation/atomic_ops.h:

------------------------------------------------------------------------

	void set_bit(unsigned long nr, volatile unsigned long *addr);
	void clear_bit(unsigned long nr, volatile unsigned long *addr);
	void change_bit(unsigned long nr, volatile unsigned long *addr);

These routines set, clear, and change, respectively, the bit number
indicated by "nr" on the bit mask pointed to by "ADDR".

They must execute atomically, yet there are no implicit memory barrier
semantics required of these interfaces.

------------------------------------------------------------------------

So unless they operate on the same location or are accompanied by
something like the smp_mb__after_atomic() called out above, there
is no ordering.

> So I think you're right and that we can forgo the memory barriers here.
> I even think this must be true on all architectures.
> 
> Paul and Alan have a validation tool someplace, put them on Cc.

It does not yet fully handle atomics yet (but maybe Alan is ahead of
me here, in which case he won't be shy).  However, the point about
strong ordering of same-sized aligned accesses to a machine-sized
location can be made without atomics:

------------------------------------------------------------------------

C C-piggin-SB+samevar.litmus

{
	x=0;
}

P0(int *x)
{
	WRITE_ONCE(*x, 1);
	r1 = READ_ONCE(*x);
}

P1(int *x)
{
	WRITE_ONCE(*x, 2);
	r1 = READ_ONCE(*x);
}

exists
(0:r1=0 /\ 1:r1=0)

------------------------------------------------------------------------

This gets us the following, as expected:

Test C-piggin-SB+samevar Allowed
States 3
0:r1=1; 1:r1=1;
0:r1=1; 1:r1=2;
0:r1=2; 1:r1=2;
No
Witnesses
Positive: 0 Negative: 4
Condition exists (0:r1=0 /\ 1:r1=0)
Observation C-piggin-SB+samevar Never 0 4
Hash=31466d003c10c07836583f5879824502

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: page_waitqueue() considered harmful
  2016-09-28 10:45     ` Mel Gorman
@ 2016-09-28 11:11       ` Peter Zijlstra
  2016-09-28 16:10         ` Linus Torvalds
  0 siblings, 1 reply; 42+ messages in thread
From: Peter Zijlstra @ 2016-09-28 11:11 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linus Torvalds, Andrew Morton, Kirill A. Shutemov,
	Johannes Weiner, Jan Kara, Rik van Riel, linux-mm

On Wed, Sep 28, 2016 at 11:45:00AM +0100, Mel Gorman wrote:

> tldr: Other than 32-bit vs 64-bit, I could not find anything obviously wrong.

Thanks, meanwhile I redid the patch based on the feedback from Nick and
Linus.

I know I ignored the 32bit issue, I figured we'd first see if it helps
at all before mucking about with that.

In any case, I don't mind doing a PG_waiters and also using it for the
PG_writeback thing. Just wasn't what I was thinking of when doing that
patch.

Re the naming of __unlock_page(), its the slow path continuation of
unlock_page(). But I'm not too bothered if people want to change the
name.

So the below boots and builds a kernel and must be equally perfect, then
again x86 has no-op smp_mb__{before,after}_atomic() so the lack of
barriers will not affect anything much. PowerPC, ARM and MIPS (among
others) will want testing.

---
 include/linux/page-flags.h     |  2 ++
 include/linux/pagemap.h        | 25 +++++++++----
 include/trace/events/mmflags.h |  1 +
 mm/filemap.c                   | 80 +++++++++++++++++++++++++++++++++++++-----
 4 files changed, 93 insertions(+), 15 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 74e4dda..0ed3900 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -73,6 +73,7 @@
  */
 enum pageflags {
 	PG_locked,		/* Page is locked. Don't touch. */
+	PG_contended,		/* Page lock is contended. */
 	PG_error,
 	PG_referenced,
 	PG_uptodate,
@@ -253,6 +254,7 @@ static inline int TestClearPage##uname(struct page *page) { return 0; }
 	TESTSETFLAG_FALSE(uname) TESTCLEARFLAG_FALSE(uname)
 
 __PAGEFLAG(Locked, locked, PF_NO_TAIL)
+PAGEFLAG(Contended, contended, PF_NO_TAIL)
 PAGEFLAG(Error, error, PF_NO_COMPOUND) TESTCLEARFLAG(Error, error, PF_NO_COMPOUND)
 PAGEFLAG(Referenced, referenced, PF_HEAD)
 	TESTCLEARFLAG(Referenced, referenced, PF_HEAD)
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 66a1260..c8b8651 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -417,7 +417,7 @@ extern void __lock_page(struct page *page);
 extern int __lock_page_killable(struct page *page);
 extern int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
 				unsigned int flags);
-extern void unlock_page(struct page *page);
+extern void __unlock_page(struct page *page);
 
 static inline int trylock_page(struct page *page)
 {
@@ -448,6 +448,20 @@ static inline int lock_page_killable(struct page *page)
 	return 0;
 }
 
+static inline void unlock_page(struct page *page)
+{
+	page = compound_head(page);
+	VM_BUG_ON_PAGE(!PageLocked(page), page);
+	clear_bit_unlock(PG_locked, &page->flags);
+	/*
+	 * Since PG_locked and PG_contended are in the same word, Program-Order
+	 * ensures the load of PG_contended must not observe a value earlier
+	 * than our clear_bit() store. See lock_page_wait().
+	 */
+	if (PageContended(page))
+		__unlock_page(page);
+}
+
 /*
  * lock_page_or_retry - Lock the page, unless this would block and the
  * caller indicated that it can handle a retry.
@@ -472,11 +486,11 @@ extern int wait_on_page_bit_killable(struct page *page, int bit_nr);
 extern int wait_on_page_bit_killable_timeout(struct page *page,
 					     int bit_nr, unsigned long timeout);
 
+extern int wait_on_page_lock(struct page *page, int mode);
+
 static inline int wait_on_page_locked_killable(struct page *page)
 {
-	if (!PageLocked(page))
-		return 0;
-	return wait_on_page_bit_killable(compound_head(page), PG_locked);
+	return wait_on_page_lock(page, TASK_KILLABLE);
 }
 
 extern wait_queue_head_t *page_waitqueue(struct page *page);
@@ -494,8 +508,7 @@ static inline void wake_up_page(struct page *page, int bit)
  */
 static inline void wait_on_page_locked(struct page *page)
 {
-	if (PageLocked(page))
-		wait_on_page_bit(compound_head(page), PG_locked);
+	wait_on_page_lock(page, TASK_UNINTERRUPTIBLE);
 }
 
 /* 
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index 5a81ab4..18b8398 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -81,6 +81,7 @@
 
 #define __def_pageflag_names						\
 	{1UL << PG_locked,		"locked"	},		\
+	{1UL << PG_contended,		"contended"	},		\
 	{1UL << PG_error,		"error"		},		\
 	{1UL << PG_referenced,		"referenced"	},		\
 	{1UL << PG_uptodate,		"uptodate"	},		\
diff --git a/mm/filemap.c b/mm/filemap.c
index 8a287df..79aab60 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -847,15 +847,17 @@ EXPORT_SYMBOL_GPL(add_page_wait_queue);
  * The mb is necessary to enforce ordering between the clear_bit and the read
  * of the waitqueue (to avoid SMP races with a parallel wait_on_page_locked()).
  */
-void unlock_page(struct page *page)
+void __unlock_page(struct page *page)
 {
-	page = compound_head(page);
-	VM_BUG_ON_PAGE(!PageLocked(page), page);
-	clear_bit_unlock(PG_locked, &page->flags);
-	smp_mb__after_atomic();
-	wake_up_page(page, PG_locked);
+	struct wait_bit_key key = __WAIT_BIT_KEY_INITIALIZER(&page->flags, PG_locked);
+	wait_queue_head_t *wq = page_waitqueue(page);
+
+	if (waitqueue_active(wq))
+		__wake_up(wq, TASK_NORMAL, 1, &key);
+	else
+		ClearPageContended(page);
 }
-EXPORT_SYMBOL(unlock_page);
+EXPORT_SYMBOL(__unlock_page);
 
 /**
  * end_page_writeback - end writeback against a page
@@ -908,6 +910,54 @@ void page_endio(struct page *page, bool is_write, int err)
 }
 EXPORT_SYMBOL_GPL(page_endio);
 
+static int lock_page_wait(struct wait_bit_key *word, int mode)
+{
+	struct page *page = container_of(word->flags, struct page, flags);
+
+	/*
+	 * We cannot go sleep without having PG_contended set. This would mean
+	 * nobody would issue a wakeup and we'd be stuck.
+	 */
+	if (!PageContended(page)) {
+		SetPageContended(page);
+
+		/*
+		 * There are two orderings of importance:
+		 *
+		 * 1)
+		 *
+		 *  [unlock]			[wait]
+		 *
+		 *  clear PG_locked		set PG_contended
+		 *  test  PG_contended		test (and-set) PG_locked
+		 *
+		 * Since these are on the same word, and the clear/set
+		 * operation are atomic, they are ordered against one another.
+		 * Program-Order further constraints a CPU from speculating the
+		 * later load to not be earlier than the RmW. So this doesn't
+		 * need an explicit barrier. Also see unlock_page().
+		 *
+		 * 2)
+		 *
+		 *  [unlock]			[wait]
+		 *
+		 *  test PG_contended		set PG_contended
+		 *  wakeup/remove		add
+		 *  clear PG_contended		trylock
+		 *				sleep
+		 *
+		 * In this case however, we must ensure PG_contended is set
+		 * before we add ourselves to the list, such that unlock() must
+		 * either see PG_contended or we see its clear.
+		 */
+		smp_mb__after_atomic();
+
+		return 0;
+	}
+
+	return bit_wait_io(word, mode);
+}
+
 /**
  * __lock_page - get a lock on the page, assuming we need to sleep to get it
  * @page: the page to lock
@@ -917,7 +967,7 @@ void __lock_page(struct page *page)
 	struct page *page_head = compound_head(page);
 	DEFINE_WAIT_BIT(wait, &page_head->flags, PG_locked);
 
-	__wait_on_bit_lock(page_waitqueue(page_head), &wait, bit_wait_io,
+	__wait_on_bit_lock(page_waitqueue(page_head), &wait, lock_page_wait,
 							TASK_UNINTERRUPTIBLE);
 }
 EXPORT_SYMBOL(__lock_page);
@@ -928,10 +978,22 @@ int __lock_page_killable(struct page *page)
 	DEFINE_WAIT_BIT(wait, &page_head->flags, PG_locked);
 
 	return __wait_on_bit_lock(page_waitqueue(page_head), &wait,
-					bit_wait_io, TASK_KILLABLE);
+					lock_page_wait, TASK_KILLABLE);
 }
 EXPORT_SYMBOL_GPL(__lock_page_killable);
 
+int wait_on_page_lock(struct page *page, int mode)
+{
+	struct page __always_unused *__page = (page = compound_head(page));
+	DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
+
+	if (!PageLocked(page))
+		return 0;
+
+	return __wait_on_bit(page_waitqueue(page), &wait, lock_page_wait, mode);
+}
+EXPORT_SYMBOL(wait_on_page_lock);
+
 /*
  * Return values:
  * 1 - page is locked; mmap_sem is still held.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: page_waitqueue() considered harmful
  2016-09-28 11:05           ` Paul E. McKenney
@ 2016-09-28 11:16             ` Peter Zijlstra
  2016-09-28 12:58               ` Paul E. McKenney
  0 siblings, 1 reply; 42+ messages in thread
From: Peter Zijlstra @ 2016-09-28 11:16 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Nicholas Piggin, Mel Gorman, Linus Torvalds, Andrew Morton,
	Kirill A. Shutemov, Johannes Weiner, Jan Kara, Rik van Riel,
	linux-mm, Will Deacon, Alan Stern

On Wed, Sep 28, 2016 at 04:05:30AM -0700, Paul E. McKenney wrote:
> On Wed, Sep 28, 2016 at 09:05:46AM +0200, Peter Zijlstra wrote:
> > On Wed, Sep 28, 2016 at 03:06:21AM +1000, Nicholas Piggin wrote:
> > > On Tue, 27 Sep 2016 18:52:21 +0200
> > > Peter Zijlstra <peterz@infradead.org> wrote:
> > > 
> > > > On Wed, Sep 28, 2016 at 12:53:18AM +1000, Nicholas Piggin wrote:
> > > > > The more interesting is the ability to avoid the barrier between fastpath
> > > > > clearing a bit and testing for waiters.
> > > > > 
> > > > > unlock():                        lock() (slowpath):
> > > > > clear_bit(PG_locked)             set_bit(PG_waiter)
> > > > > test_bit(PG_waiter)              test_bit(PG_locked)
> 
> The point being that at least one of the test_bit() calls must return
> true?

Yes, more or less. Either unlock() observes PG_waiters set, or lock()
observes PG_locked unset. (opposed to all our 'normal' examples the
initial state isn't all 0 and the stores aren't all 1 :-).

> As far as I know, all architectures fully order aligned same-size
> machine-sized accesses to the same location even without barriers.
> In the example above, the PG_locked and PG_waiter are different bits in
> the same location, correct?  (Looks that way, but the above also looks
> a bit abbreviated.)

Correct, PG_* all live in the same word.

> So unless they operate on the same location or are accompanied by
> something like the smp_mb__after_atomic() called out above, there
> is no ordering.

Same word..

> > So I think you're right and that we can forgo the memory barriers here.
> > I even think this must be true on all architectures.
> > 
> > Paul and Alan have a validation tool someplace, put them on Cc.
> 
> It does not yet fully handle atomics yet (but maybe Alan is ahead of
> me here, in which case he won't be shy).  However, the point about
> strong ordering of same-sized aligned accesses to a machine-sized
> location can be made without atomics:

Great. That's what I remember from reading that stuff.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: page_waitqueue() considered harmful
  2016-09-28 11:16             ` Peter Zijlstra
@ 2016-09-28 12:58               ` Paul E. McKenney
  0 siblings, 0 replies; 42+ messages in thread
From: Paul E. McKenney @ 2016-09-28 12:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nicholas Piggin, Mel Gorman, Linus Torvalds, Andrew Morton,
	Kirill A. Shutemov, Johannes Weiner, Jan Kara, Rik van Riel,
	linux-mm, Will Deacon, Alan Stern

On Wed, Sep 28, 2016 at 01:16:45PM +0200, Peter Zijlstra wrote:
> On Wed, Sep 28, 2016 at 04:05:30AM -0700, Paul E. McKenney wrote:
> > On Wed, Sep 28, 2016 at 09:05:46AM +0200, Peter Zijlstra wrote:
> > > On Wed, Sep 28, 2016 at 03:06:21AM +1000, Nicholas Piggin wrote:
> > > > On Tue, 27 Sep 2016 18:52:21 +0200
> > > > Peter Zijlstra <peterz@infradead.org> wrote:
> > > > 
> > > > > On Wed, Sep 28, 2016 at 12:53:18AM +1000, Nicholas Piggin wrote:
> > > > > > The more interesting is the ability to avoid the barrier between fastpath
> > > > > > clearing a bit and testing for waiters.
> > > > > > 
> > > > > > unlock():                        lock() (slowpath):
> > > > > > clear_bit(PG_locked)             set_bit(PG_waiter)
> > > > > > test_bit(PG_waiter)              test_bit(PG_locked)
> > 
> > The point being that at least one of the test_bit() calls must return
> > true?
> 
> Yes, more or less. Either unlock() observes PG_waiters set, or lock()
> observes PG_locked unset. (opposed to all our 'normal' examples the
> initial state isn't all 0 and the stores aren't all 1 :-).

You lost me on unlock() doing any observation, but yes, I transliterated
to standard form, unintentionally, as it turns out.  ;-)

So the goal is that either test_bit(PG_waiter) sees the set_bit()
or test_bit(PG_locked) sees the clear_bit(), correct?

> > As far as I know, all architectures fully order aligned same-size
> > machine-sized accesses to the same location even without barriers.
> > In the example above, the PG_locked and PG_waiter are different bits in
> > the same location, correct?  (Looks that way, but the above also looks
> > a bit abbreviated.)
> 
> Correct, PG_* all live in the same word.

That should make things somewhat more reliable.  ;-)

> > So unless they operate on the same location or are accompanied by
> > something like the smp_mb__after_atomic() called out above, there
> > is no ordering.
> 
> Same word..
> 
> > > So I think you're right and that we can forgo the memory barriers here.
> > > I even think this must be true on all architectures.
> > > 
> > > Paul and Alan have a validation tool someplace, put them on Cc.
> > 
> > It does not yet fully handle atomics yet (but maybe Alan is ahead of
> > me here, in which case he won't be shy).  However, the point about
> > strong ordering of same-sized aligned accesses to a machine-sized
> > location can be made without atomics:
> 
> Great. That's what I remember from reading that stuff.

;-)

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: page_waitqueue() considered harmful
  2016-09-28 11:11       ` Peter Zijlstra
@ 2016-09-28 16:10         ` Linus Torvalds
  2016-09-29 13:08           ` Peter Zijlstra
  0 siblings, 1 reply; 42+ messages in thread
From: Linus Torvalds @ 2016-09-28 16:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Andrew Morton, Kirill A. Shutemov, Johannes Weiner,
	Jan Kara, Rik van Riel, linux-mm

On Wed, Sep 28, 2016 at 4:11 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> -void unlock_page(struct page *page)
> +void __unlock_page(struct page *page)
>  {
> +       struct wait_bit_key key = __WAIT_BIT_KEY_INITIALIZER(&page->flags, PG_locked);
> +       wait_queue_head_t *wq = page_waitqueue(page);
> +
> +       if (waitqueue_active(wq))
> +               __wake_up(wq, TASK_NORMAL, 1, &key);
> +       else
> +               ClearPageContended(page);
>  }
> +EXPORT_SYMBOL(__unlock_page);

I think the above needs to be protected. Something like

    spin_lock_irqsave(&q->lock, flags);
    if (waitqueue_active(wq))
          __wake_up_locked(wq, TASK_NORMAL, 1, &key);
    else
          ClearPageContended(page);
    spin_unlock_irqrestore(&q->lock, flags);

because otherwise a new waiter could come in and add itself to the
wait-queue, and then set the bit, and now we clear it (because we
didn't see the new waiter).

The *waiter* doesn't need any extra locking, because doing

    add_wait_queue(..);
    SetPageContended(page);

is not racy (the add_wait_queue() will now already guarantee that
nobody else clears the bit).

Hmm?

                  Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: page_waitqueue() considered harmful
  2016-09-28  7:05         ` Peter Zijlstra
  2016-09-28 11:05           ` Paul E. McKenney
@ 2016-09-29  1:31           ` Nicholas Piggin
  2016-09-29  2:12             ` Paul E. McKenney
  2016-09-29  6:21             ` Peter Zijlstra
  1 sibling, 2 replies; 42+ messages in thread
From: Nicholas Piggin @ 2016-09-29  1:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Linus Torvalds, Andrew Morton, Kirill A. Shutemov,
	Johannes Weiner, Jan Kara, Rik van Riel, linux-mm, Will Deacon,
	Paul McKenney, Alan Stern

On Wed, 28 Sep 2016 09:05:46 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Wed, Sep 28, 2016 at 03:06:21AM +1000, Nicholas Piggin wrote:
> > On Tue, 27 Sep 2016 18:52:21 +0200
> > Peter Zijlstra <peterz@infradead.org> wrote:
> >   
> > > On Wed, Sep 28, 2016 at 12:53:18AM +1000, Nicholas Piggin wrote:  
> > > > The more interesting is the ability to avoid the barrier between fastpath
> > > > clearing a bit and testing for waiters.
> > > > 
> > > > unlock():                        lock() (slowpath):
> > > > clear_bit(PG_locked)             set_bit(PG_waiter)
> > > > test_bit(PG_waiter)              test_bit(PG_locked)
> > > > 
> > > > If this was memory ops to different words, it would require smp_mb each
> > > > side.. Being the same word, can we avoid them?     
> > > 
> > > Ah, that is the reason I put that smp_mb__after_atomic() there. You have
> > > a cute point on them being to the same word though. Need to think about
> > > that.  
> > 
> > This is all assuming the store accesses are ordered, which you should get
> > if the stores to the different bits operate on the same address and size.
> > That might not be the case for some architectures, but they might not
> > require barriers for other reasons. That would call for an smp_mb variant
> > that is used for bitops on different bits but same aligned long.   
> 
> Since the {set,clear}_bit operations are atomic, they must be ordered
> against one another. The subsequent test_bit is a load, which, since its
> to the same variable, and a CPU must appear to preserve Program-Order,
> must come after the RmW.
> 
> So I think you're right and that we can forgo the memory barriers here.
> I even think this must be true on all architectures.

In generic code, I don't think so. We'd need an
smp_mb__between_bitops_to_the_same_aligned_long, wouldn't we?

x86 implements set_bit as 'orb (addr),bit_nr', and compiler could
implement test_bit as a byte load as well. If those bits are in
different bytes, then they could be reordered, no?

ia64 does 32-bit ops. If you make PG_waiter 64-bit only and put it
in the different side of the long, then this could be a problem too.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: page_waitqueue() considered harmful
  2016-09-29  1:31           ` Nicholas Piggin
@ 2016-09-29  2:12             ` Paul E. McKenney
  2016-09-29  6:21             ` Peter Zijlstra
  1 sibling, 0 replies; 42+ messages in thread
From: Paul E. McKenney @ 2016-09-29  2:12 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Peter Zijlstra, Mel Gorman, Linus Torvalds, Andrew Morton,
	Kirill A. Shutemov, Johannes Weiner, Jan Kara, Rik van Riel,
	linux-mm, Will Deacon, Alan Stern

On Thu, Sep 29, 2016 at 11:31:32AM +1000, Nicholas Piggin wrote:
> On Wed, 28 Sep 2016 09:05:46 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Wed, Sep 28, 2016 at 03:06:21AM +1000, Nicholas Piggin wrote:
> > > On Tue, 27 Sep 2016 18:52:21 +0200
> > > Peter Zijlstra <peterz@infradead.org> wrote:
> > >   
> > > > On Wed, Sep 28, 2016 at 12:53:18AM +1000, Nicholas Piggin wrote:  
> > > > > The more interesting is the ability to avoid the barrier between fastpath
> > > > > clearing a bit and testing for waiters.
> > > > > 
> > > > > unlock():                        lock() (slowpath):
> > > > > clear_bit(PG_locked)             set_bit(PG_waiter)
> > > > > test_bit(PG_waiter)              test_bit(PG_locked)
> > > > > 
> > > > > If this was memory ops to different words, it would require smp_mb each
> > > > > side.. Being the same word, can we avoid them?     
> > > > 
> > > > Ah, that is the reason I put that smp_mb__after_atomic() there. You have
> > > > a cute point on them being to the same word though. Need to think about
> > > > that.  
> > > 
> > > This is all assuming the store accesses are ordered, which you should get
> > > if the stores to the different bits operate on the same address and size.
> > > That might not be the case for some architectures, but they might not
> > > require barriers for other reasons. That would call for an smp_mb variant
> > > that is used for bitops on different bits but same aligned long.   
> > 
> > Since the {set,clear}_bit operations are atomic, they must be ordered
> > against one another. The subsequent test_bit is a load, which, since its
> > to the same variable, and a CPU must appear to preserve Program-Order,
> > must come after the RmW.
> > 
> > So I think you're right and that we can forgo the memory barriers here.
> > I even think this must be true on all architectures.
> 
> In generic code, I don't think so. We'd need an
> smp_mb__between_bitops_to_the_same_aligned_long, wouldn't we?
> 
> x86 implements set_bit as 'orb (addr),bit_nr', and compiler could
> implement test_bit as a byte load as well. If those bits are in
> different bytes, then they could be reordered, no?
> 
> ia64 does 32-bit ops. If you make PG_waiter 64-bit only and put it
> in the different side of the long, then this could be a problem too.

Fair point, that would defeat the same-location ordering...

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: page_waitqueue() considered harmful
  2016-09-29  1:31           ` Nicholas Piggin
  2016-09-29  2:12             ` Paul E. McKenney
@ 2016-09-29  6:21             ` Peter Zijlstra
  2016-09-29  6:42               ` Nicholas Piggin
  1 sibling, 1 reply; 42+ messages in thread
From: Peter Zijlstra @ 2016-09-29  6:21 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Mel Gorman, Linus Torvalds, Andrew Morton, Kirill A. Shutemov,
	Johannes Weiner, Jan Kara, Rik van Riel, linux-mm, Will Deacon,
	Paul McKenney, Alan Stern

On Thu, Sep 29, 2016 at 11:31:32AM +1000, Nicholas Piggin wrote:
> > Since the {set,clear}_bit operations are atomic, they must be ordered
> > against one another. The subsequent test_bit is a load, which, since its
> > to the same variable, and a CPU must appear to preserve Program-Order,
> > must come after the RmW.
> > 
> > So I think you're right and that we can forgo the memory barriers here.
> > I even think this must be true on all architectures.
> 
> In generic code, I don't think so. We'd need an
> smp_mb__between_bitops_to_the_same_aligned_long, wouldn't we?
> 
> x86 implements set_bit as 'orb (addr),bit_nr', and compiler could
> implement test_bit as a byte load as well. If those bits are in
> different bytes, then they could be reordered, no?
> 
> ia64 does 32-bit ops. If you make PG_waiter 64-bit only and put it
> in the different side of the long, then this could be a problem too.

Not on ia64, its atomics are full barriers too, just like x86 (even
though its docs imply otherwise). But I get the point.

I would however rather audit and attempt to fix affected archs before
introducing such a barrier if at all possible.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: page_waitqueue() considered harmful
  2016-09-29  6:21             ` Peter Zijlstra
@ 2016-09-29  6:42               ` Nicholas Piggin
  2016-09-29  7:14                 ` Peter Zijlstra
  0 siblings, 1 reply; 42+ messages in thread
From: Nicholas Piggin @ 2016-09-29  6:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Linus Torvalds, Andrew Morton, Kirill A. Shutemov,
	Johannes Weiner, Jan Kara, Rik van Riel, linux-mm, Will Deacon,
	Paul McKenney, Alan Stern

On Thu, 29 Sep 2016 08:21:32 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Thu, Sep 29, 2016 at 11:31:32AM +1000, Nicholas Piggin wrote:
> > > Since the {set,clear}_bit operations are atomic, they must be ordered
> > > against one another. The subsequent test_bit is a load, which, since its
> > > to the same variable, and a CPU must appear to preserve Program-Order,
> > > must come after the RmW.
> > > 
> > > So I think you're right and that we can forgo the memory barriers here.
> > > I even think this must be true on all architectures.  
> > 
> > In generic code, I don't think so. We'd need an
> > smp_mb__between_bitops_to_the_same_aligned_long, wouldn't we?
> > 
> > x86 implements set_bit as 'orb (addr),bit_nr', and compiler could
> > implement test_bit as a byte load as well. If those bits are in
> > different bytes, then they could be reordered, no?
> > 
> > ia64 does 32-bit ops. If you make PG_waiter 64-bit only and put it
> > in the different side of the long, then this could be a problem too.  
> 
> Not on ia64, its atomics are full barriers too, just like x86 (even
> though its docs imply otherwise). But I get the point.

Oh yes of course I knew x86 atomics were barriers :)

Take Alpha instead. It's using 32-bit ops.

> I would however rather audit and attempt to fix affected archs before
> introducing such a barrier if at all possible.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: page_waitqueue() considered harmful
  2016-09-29  6:42               ` Nicholas Piggin
@ 2016-09-29  7:14                 ` Peter Zijlstra
  2016-09-29  7:43                   ` Peter Zijlstra
  0 siblings, 1 reply; 42+ messages in thread
From: Peter Zijlstra @ 2016-09-29  7:14 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Mel Gorman, Linus Torvalds, Andrew Morton, Kirill A. Shutemov,
	Johannes Weiner, Jan Kara, Rik van Riel, linux-mm, Will Deacon,
	Paul McKenney, Alan Stern

On Thu, Sep 29, 2016 at 04:42:31PM +1000, Nicholas Piggin wrote:
> Take Alpha instead. It's using 32-bit ops.

Hmm, my Alpha docs are on the other machine, but I suppose the problem
is 64bit immediates (which would be a common problem I suppose, those
don't really work well on x86 either).

Yes, that does make it all more tricky than desired.

OTOH maybe this is a good excuse to (finally) sanitize the bitmap API to
use a fixed width. It using 'unsigned long' has been something of a pain
at times.

With that it would be 'obvious' bits need to be part of the same 32bit
word and we can use the normal smp_mb__{before,after}_atomic() again.

Given our bitops really are only about single bits, I can't see the
change making a real performance difference.

OTOH, the non atomic things like weight and ff[sz] do benefit from using
the longer words. Bother..

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: page_waitqueue() considered harmful
  2016-09-29  7:14                 ` Peter Zijlstra
@ 2016-09-29  7:43                   ` Peter Zijlstra
  0 siblings, 0 replies; 42+ messages in thread
From: Peter Zijlstra @ 2016-09-29  7:43 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Mel Gorman, Linus Torvalds, Andrew Morton, Kirill A. Shutemov,
	Johannes Weiner, Jan Kara, Rik van Riel, linux-mm, Will Deacon,
	Paul McKenney, Alan Stern

On Thu, Sep 29, 2016 at 09:14:51AM +0200, Peter Zijlstra wrote:
> On Thu, Sep 29, 2016 at 04:42:31PM +1000, Nicholas Piggin wrote:
> > Take Alpha instead. It's using 32-bit ops.
> 
> Hmm, my Alpha docs are on the other machine, but I suppose the problem
> is 64bit immediates (which would be a common problem I suppose, those
> don't really work well on x86 either).

OK, so from the architecture that have 64bit support, I think Alpha is
the only one that uses 32bit ops and cares.

alpha is weak and uses 32bit ops (fail)
arm64 is weak but uses 64bit ops
ia64 has full barriers and 64bit ops
mips is weak but uses 64bit ops
parisc is horrid but uses 64bit ops
powerpc is weak but uses 64bit ops
s390 has full barriers and uses 64bit ops
sparc has full barriers and uses 64bit ops (if I read the asm right)
tile is weak but uses 64bit ops
x86 has full barriers and uses byte ops

So we could just fix Alpha..

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: page_waitqueue() considered harmful
  2016-09-27  8:54   ` Mel Gorman
  2016-09-27  9:11     ` Kirill A. Shutemov
@ 2016-09-29  8:01     ` Peter Zijlstra
  2016-09-29 12:55       ` Nicholas Piggin
  1 sibling, 1 reply; 42+ messages in thread
From: Peter Zijlstra @ 2016-09-29  8:01 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linus Torvalds, Andrew Morton, Kirill A. Shutemov,
	Johannes Weiner, Jan Kara, Rik van Riel, linux-mm

On Tue, Sep 27, 2016 at 09:54:12AM +0100, Mel Gorman wrote:
> On Tue, Sep 27, 2016 at 09:30:55AM +0200, Peter Zijlstra wrote:
> Simple is relative unless I drastically overcomplicated things and it
> wouldn't be the first time. 64-bit only side-steps the page flag issue
> as long as we can live with that.

So one problem with the 64bit only pageflags is that they do eat space
from page-flags-layout, we do try and fit a bunch of other crap in
there, and at some point that all will not fit anymore and we'll revert
to worse.

I've no idea how far away from that we are for distro kernels. I suppose
they have fairly large NR_NODES and NR_CPUS.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: page_waitqueue() considered harmful
  2016-09-29  8:01     ` Peter Zijlstra
@ 2016-09-29 12:55       ` Nicholas Piggin
  2016-09-29 13:16         ` Peter Zijlstra
  2016-09-29 15:05         ` Rik van Riel
  0 siblings, 2 replies; 42+ messages in thread
From: Nicholas Piggin @ 2016-09-29 12:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Linus Torvalds, Andrew Morton, Kirill A. Shutemov,
	Johannes Weiner, Jan Kara, Rik van Riel, linux-mm

On Thu, 29 Sep 2016 10:01:30 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Tue, Sep 27, 2016 at 09:54:12AM +0100, Mel Gorman wrote:
> > On Tue, Sep 27, 2016 at 09:30:55AM +0200, Peter Zijlstra wrote:
> > Simple is relative unless I drastically overcomplicated things and it
> > wouldn't be the first time. 64-bit only side-steps the page flag issue
> > as long as we can live with that.  
> 
> So one problem with the 64bit only pageflags is that they do eat space
> from page-flags-layout, we do try and fit a bunch of other crap in
> there, and at some point that all will not fit anymore and we'll revert
> to worse.
> 
> I've no idea how far away from that we are for distro kernels. I suppose
> they have fairly large NR_NODES and NR_CPUS.

I know it's not fashionable to care about them anymore, but it's sad if
32-bit architectures miss out fundamental optimisations like this because
we're out of page flags. It would also be sad to increase the size of
struct page because we're too lazy to reduce flags. There's some that
might be able to be removed.

PG_reserved - We should have killed this years ago. More users have crept
back in.

PG_mappedtodisk - Rarely used, to slightly shortcut some mapping lookups.
Possible for filesystems to derive this some other way, e.g.,
PG_private == 1 && ->private == NULL, or another filesystem private bit.
We should really kill this before more users spring up and it gets stuck
forever. 

PG_swapcache - can this be replaced with ane of the private bits, I wonder?

PG_uncached - this is PG_arch_2 really, but unfortunately x86 uses both.
Still, that and PG_arch_1 could be removed from most architectures, so
many of the 32-bit ones could use the extra flags.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: page_waitqueue() considered harmful
  2016-09-28 16:10         ` Linus Torvalds
@ 2016-09-29 13:08           ` Peter Zijlstra
  2016-10-03 10:47             ` Mel Gorman
  0 siblings, 1 reply; 42+ messages in thread
From: Peter Zijlstra @ 2016-09-29 13:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mel Gorman, Andrew Morton, Kirill A. Shutemov, Johannes Weiner,
	Jan Kara, Rik van Riel, linux-mm

On Wed, Sep 28, 2016 at 09:10:29AM -0700, Linus Torvalds wrote:

> I think the above needs to be protected. Something like
> 
>     spin_lock_irqsave(&q->lock, flags);
>     if (waitqueue_active(wq))
>           __wake_up_locked(wq, TASK_NORMAL, 1, &key);
>     else
>           ClearPageContended(page);
>     spin_unlock_irqrestore(&q->lock, flags);
> 
> because otherwise a new waiter could come in and add itself to the
> wait-queue, and then set the bit, and now we clear it (because we
> didn't see the new waiter).
> 
> The *waiter* doesn't need any extra locking, because doing
> 
>     add_wait_queue(..);
>     SetPageContended(page);
> 
> is not racy (the add_wait_queue() will now already guarantee that
> nobody else clears the bit).
> 
> Hmm?

Yes. I got my brain in a complete twist, but you're right, that is
indeed required.

Here's a new version with hopefully clearer comments.

Same caveat about 32bit, naming etc..

---
 include/linux/page-flags.h     |  2 +
 include/linux/pagemap.h        | 25 ++++++++---
 include/trace/events/mmflags.h |  1 +
 mm/filemap.c                   | 94 ++++++++++++++++++++++++++++++++++++++----
 4 files changed, 107 insertions(+), 15 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 74e4dda..0ed3900 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -73,6 +73,7 @@
  */
 enum pageflags {
 	PG_locked,		/* Page is locked. Don't touch. */
+	PG_contended,		/* Page lock is contended. */
 	PG_error,
 	PG_referenced,
 	PG_uptodate,
@@ -253,6 +254,7 @@ static inline int TestClearPage##uname(struct page *page) { return 0; }
 	TESTSETFLAG_FALSE(uname) TESTCLEARFLAG_FALSE(uname)
 
 __PAGEFLAG(Locked, locked, PF_NO_TAIL)
+PAGEFLAG(Contended, contended, PF_NO_TAIL)
 PAGEFLAG(Error, error, PF_NO_COMPOUND) TESTCLEARFLAG(Error, error, PF_NO_COMPOUND)
 PAGEFLAG(Referenced, referenced, PF_HEAD)
 	TESTCLEARFLAG(Referenced, referenced, PF_HEAD)
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 66a1260..c8b8651 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -417,7 +417,7 @@ extern void __lock_page(struct page *page);
 extern int __lock_page_killable(struct page *page);
 extern int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
 				unsigned int flags);
-extern void unlock_page(struct page *page);
+extern void __unlock_page(struct page *page);
 
 static inline int trylock_page(struct page *page)
 {
@@ -448,6 +448,20 @@ static inline int lock_page_killable(struct page *page)
 	return 0;
 }
 
+static inline void unlock_page(struct page *page)
+{
+	page = compound_head(page);
+	VM_BUG_ON_PAGE(!PageLocked(page), page);
+	clear_bit_unlock(PG_locked, &page->flags);
+	/*
+	 * Since PG_locked and PG_contended are in the same word, Program-Order
+	 * ensures the load of PG_contended must not observe a value earlier
+	 * than our clear_bit() store.
+	 */
+	if (PageContended(page))
+		__unlock_page(page);
+}
+
 /*
  * lock_page_or_retry - Lock the page, unless this would block and the
  * caller indicated that it can handle a retry.
@@ -472,11 +486,11 @@ extern int wait_on_page_bit_killable(struct page *page, int bit_nr);
 extern int wait_on_page_bit_killable_timeout(struct page *page,
 					     int bit_nr, unsigned long timeout);
 
+extern int wait_on_page_lock(struct page *page, int mode);
+
 static inline int wait_on_page_locked_killable(struct page *page)
 {
-	if (!PageLocked(page))
-		return 0;
-	return wait_on_page_bit_killable(compound_head(page), PG_locked);
+	return wait_on_page_lock(page, TASK_KILLABLE);
 }
 
 extern wait_queue_head_t *page_waitqueue(struct page *page);
@@ -494,8 +508,7 @@ static inline void wake_up_page(struct page *page, int bit)
  */
 static inline void wait_on_page_locked(struct page *page)
 {
-	if (PageLocked(page))
-		wait_on_page_bit(compound_head(page), PG_locked);
+	wait_on_page_lock(page, TASK_UNINTERRUPTIBLE);
 }
 
 /* 
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index 5a81ab4..18b8398 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -81,6 +81,7 @@
 
 #define __def_pageflag_names						\
 	{1UL << PG_locked,		"locked"	},		\
+	{1UL << PG_contended,		"contended"	},		\
 	{1UL << PG_error,		"error"		},		\
 	{1UL << PG_referenced,		"referenced"	},		\
 	{1UL << PG_uptodate,		"uptodate"	},		\
diff --git a/mm/filemap.c b/mm/filemap.c
index 8a287df..734082a 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -847,15 +847,30 @@ EXPORT_SYMBOL_GPL(add_page_wait_queue);
  * The mb is necessary to enforce ordering between the clear_bit and the read
  * of the waitqueue (to avoid SMP races with a parallel wait_on_page_locked()).
  */
-void unlock_page(struct page *page)
+void __unlock_page(struct page *page)
 {
-	page = compound_head(page);
-	VM_BUG_ON_PAGE(!PageLocked(page), page);
-	clear_bit_unlock(PG_locked, &page->flags);
-	smp_mb__after_atomic();
-	wake_up_page(page, PG_locked);
+	wait_queue_head_t *wq = page_waitqueue(page);
+	unsigned long flags;
+
+	spin_lock_irqsave(&wq->lock, flags);
+	if (waitqueue_active(wq)) {
+		struct wait_bit_key key =
+			__WAIT_BIT_KEY_INITIALIZER(&page->flags, PG_locked);
+
+		__wake_up_locked_key(wq, TASK_NORMAL, &key);
+	} else {
+		/*
+		 * We need to do ClearPageContended() under wq->lock such that
+		 * we serialize against prepare_to_wait() adding waiters and
+		 * setting task_struct::state.
+		 *
+		 * See lock_page_wait().
+		 */
+		ClearPageContended(page);
+	}
+	spin_unlock_irqrestore(&wq->lock, flags);
 }
-EXPORT_SYMBOL(unlock_page);
+EXPORT_SYMBOL(__unlock_page);
 
 /**
  * end_page_writeback - end writeback against a page
@@ -908,6 +923,55 @@ void page_endio(struct page *page, bool is_write, int err)
 }
 EXPORT_SYMBOL_GPL(page_endio);
 
+static int lock_page_wait(struct wait_bit_key *word, int mode)
+{
+	struct page *page = container_of(word->flags, struct page, flags);
+
+	/*
+	 * We cannot go sleep without having PG_contended set. This would mean
+	 * nobody would issue a wakeup and we'd be stuck.
+	 */
+	if (!PageContended(page)) {
+
+		/*
+		 * There are two orderings of importance:
+		 *
+		 * 1)
+		 *
+		 *  [unlock]			[wait]
+		 *
+		 *  clear PG_locked		set PG_contended
+		 *  test  PG_contended		test (and-set) PG_locked
+		 *
+		 * Since these are on the same word, and the clear/set
+		 * operation are atomic, they are ordered against one another.
+		 * Program-Order further constraints a CPU from speculating the
+		 * later load to not be earlier than the RmW. So this doesn't
+		 * need an explicit barrier. Also see unlock_page().
+		 *
+		 * 2)
+		 *
+		 *  [unlock]			[wait]
+		 *
+		 *  LOCK wq->lock		LOCK wq->lock
+		 *    __wake_up_locked ||	  list-add
+		 *    clear PG_contended	  set_current_state()
+		 *  UNLOCK wq->lock		UNLOCK wq->lock
+		 *				set PG_contended
+		 *
+		 * Since we're added to the waitqueue, we cannot get
+		 * PG_contended cleared without also getting TASK_RUNNING set,
+		 * which will then void the schedule() call and we'll loop.
+		 * Here wq->lock is sufficient ordering. See __unlock_page().
+		 */
+		SetPageContended(page);
+
+		return 0;
+	}
+
+	return bit_wait_io(word, mode);
+}
+
 /**
  * __lock_page - get a lock on the page, assuming we need to sleep to get it
  * @page: the page to lock
@@ -917,7 +981,7 @@ void __lock_page(struct page *page)
 	struct page *page_head = compound_head(page);
 	DEFINE_WAIT_BIT(wait, &page_head->flags, PG_locked);
 
-	__wait_on_bit_lock(page_waitqueue(page_head), &wait, bit_wait_io,
+	__wait_on_bit_lock(page_waitqueue(page_head), &wait, lock_page_wait,
 							TASK_UNINTERRUPTIBLE);
 }
 EXPORT_SYMBOL(__lock_page);
@@ -928,10 +992,22 @@ int __lock_page_killable(struct page *page)
 	DEFINE_WAIT_BIT(wait, &page_head->flags, PG_locked);
 
 	return __wait_on_bit_lock(page_waitqueue(page_head), &wait,
-					bit_wait_io, TASK_KILLABLE);
+					lock_page_wait, TASK_KILLABLE);
 }
 EXPORT_SYMBOL_GPL(__lock_page_killable);
 
+int wait_on_page_lock(struct page *page, int mode)
+{
+	struct page __always_unused *__page = (page = compound_head(page));
+	DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
+
+	if (!PageLocked(page))
+		return 0;
+
+	return __wait_on_bit(page_waitqueue(page), &wait, lock_page_wait, mode);
+}
+EXPORT_SYMBOL(wait_on_page_lock);
+
 /*
  * Return values:
  * 1 - page is locked; mmap_sem is still held.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: page_waitqueue() considered harmful
  2016-09-29 12:55       ` Nicholas Piggin
@ 2016-09-29 13:16         ` Peter Zijlstra
  2016-09-29 13:54           ` Nicholas Piggin
  2016-09-29 15:05         ` Rik van Riel
  1 sibling, 1 reply; 42+ messages in thread
From: Peter Zijlstra @ 2016-09-29 13:16 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Mel Gorman, Linus Torvalds, Andrew Morton, Kirill A. Shutemov,
	Johannes Weiner, Jan Kara, Rik van Riel, linux-mm

On Thu, Sep 29, 2016 at 10:55:44PM +1000, Nicholas Piggin wrote:
> On Thu, 29 Sep 2016 10:01:30 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Tue, Sep 27, 2016 at 09:54:12AM +0100, Mel Gorman wrote:
> > > On Tue, Sep 27, 2016 at 09:30:55AM +0200, Peter Zijlstra wrote:
> > > Simple is relative unless I drastically overcomplicated things and it
> > > wouldn't be the first time. 64-bit only side-steps the page flag issue
> > > as long as we can live with that.  
> > 
> > So one problem with the 64bit only pageflags is that they do eat space
> > from page-flags-layout, we do try and fit a bunch of other crap in
> > there, and at some point that all will not fit anymore and we'll revert
> > to worse.
> > 
> > I've no idea how far away from that we are for distro kernels. I suppose
> > they have fairly large NR_NODES and NR_CPUS.
> 
> I know it's not fashionable to care about them anymore, but it's sad if
> 32-bit architectures miss out fundamental optimisations like this because
> we're out of page flags. It would also be sad to increase the size of
> struct page because we're too lazy to reduce flags. There's some that
> might be able to be removed.

I'm all for cleaning some of that up, but its been a long while since I
poked in that general area.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: page_waitqueue() considered harmful
  2016-09-29 13:16         ` Peter Zijlstra
@ 2016-09-29 13:54           ` Nicholas Piggin
  0 siblings, 0 replies; 42+ messages in thread
From: Nicholas Piggin @ 2016-09-29 13:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Linus Torvalds, Andrew Morton, Kirill A. Shutemov,
	Johannes Weiner, Jan Kara, Rik van Riel, linux-mm

On Thu, 29 Sep 2016 15:16:35 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Thu, Sep 29, 2016 at 10:55:44PM +1000, Nicholas Piggin wrote:
> > On Thu, 29 Sep 2016 10:01:30 +0200
> > Peter Zijlstra <peterz@infradead.org> wrote:
> >   
> > > On Tue, Sep 27, 2016 at 09:54:12AM +0100, Mel Gorman wrote:  
> > > > On Tue, Sep 27, 2016 at 09:30:55AM +0200, Peter Zijlstra wrote:
> > > > Simple is relative unless I drastically overcomplicated things and it
> > > > wouldn't be the first time. 64-bit only side-steps the page flag issue
> > > > as long as we can live with that.    
> > > 
> > > So one problem with the 64bit only pageflags is that they do eat space
> > > from page-flags-layout, we do try and fit a bunch of other crap in
> > > there, and at some point that all will not fit anymore and we'll revert
> > > to worse.
> > > 
> > > I've no idea how far away from that we are for distro kernels. I suppose
> > > they have fairly large NR_NODES and NR_CPUS.  
> > 
> > I know it's not fashionable to care about them anymore, but it's sad if
> > 32-bit architectures miss out fundamental optimisations like this because
> > we're out of page flags. It would also be sad to increase the size of
> > struct page because we're too lazy to reduce flags. There's some that
> > might be able to be removed.  
> 
> I'm all for cleaning some of that up, but its been a long while since I
> poked in that general area.

Forgive my rant! Cleaning page flags is a topic of its own...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: page_waitqueue() considered harmful
  2016-09-29 12:55       ` Nicholas Piggin
  2016-09-29 13:16         ` Peter Zijlstra
@ 2016-09-29 15:05         ` Rik van Riel
  1 sibling, 0 replies; 42+ messages in thread
From: Rik van Riel @ 2016-09-29 15:05 UTC (permalink / raw)
  To: Nicholas Piggin, Peter Zijlstra
  Cc: Mel Gorman, Linus Torvalds, Andrew Morton, Kirill A. Shutemov,
	Johannes Weiner, Jan Kara, linux-mm

[-- Attachment #1: Type: text/plain, Size: 698 bytes --]

On Thu, 2016-09-29 at 22:55 +1000, Nicholas Piggin wrote:

> PG_swapcache - can this be replaced with ane of the private bits, I
> wonder?

Perhaps. page->mapping needs to be free to point
at the anon_vma, but from the mapping pointer we
can see that the page is swap backed.

Is there any use for page->private for swap
backed pages that is not the page cache index?

If so, (PageAnon(page) && page->private)
might work as a replacement for PG_swapcache.

That might catch some false positives with
the special swap types used for migration, but
maybe we do not need to care about those (much),
or we can filter them out with a more in-depth
check?

-- 
All rights reversed

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: page_waitqueue() considered harmful
  2016-09-29 13:08           ` Peter Zijlstra
@ 2016-10-03 10:47             ` Mel Gorman
  0 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2016-10-03 10:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Andrew Morton, Kirill A. Shutemov,
	Johannes Weiner, Jan Kara, Nicholas Piggin, Rik van Riel,
	linux-mm

On Thu, Sep 29, 2016 at 03:08:27PM +0200, Peter Zijlstra wrote:
> > is not racy (the add_wait_queue() will now already guarantee that
> > nobody else clears the bit).
> > 
> > Hmm?
> 
> Yes. I got my brain in a complete twist, but you're right, that is
> indeed required.
> 
> Here's a new version with hopefully clearer comments.
> 
> Same caveat about 32bit, naming etc..
> 

I was able to run this with basic workloads over the weekend on small
UMA machines. Both machines behaved similarly so I'm only reporting one
from a single socket Skylake machine. NUMA machines rarely show anything
much more interesting for these type of workloads but as always, the full
impact is machine and workload dependant. Generally, I expect this type
of patch to have marginal but detectable impact.

This is a workload doing parallel dd of files large enough to trigger
reclaim which locks/unlocks pages

paralleldd
                              4.8.0-rc8             4.8.0-rc8
                                vanilla        waitqueue-v1r2
Amean    Elapsd-1      215.05 (  0.00%)      214.53 (  0.24%)
Amean    Elapsd-3      214.72 (  0.00%)      214.42 (  0.14%)
Amean    Elapsd-5      215.29 (  0.00%)      214.88 (  0.19%)
Amean    Elapsd-7      215.75 (  0.00%)      214.79 (  0.44%)
Amean    Elapsd-8      214.96 (  0.00%)      215.21 ( -0.12%)

That's basically within the noise. CPU usage overall looks like

           4.8.0-rc8   4.8.0-rc8
             vanillawaitqueue-v1r2
User         3409.66     3421.72
System      18298.66    18251.99
Elapsed      7178.82     7181.14

Marginal decrease in system CPU usage. Profiles showed the vanilla
kernel spending less than 0.1% on unlock_page but it's eliminated by the
patch.

This is some microbenchmarks from the vm-scalability benchmark. It's
similar to dd in that it triggers reclaim from a single thread

vmscale
                                                           4.8.0-rc8                          4.8.0-rc8
                                                             vanilla                     waitqueue-v1r2
Ops lru-file-mmap-read-elapsed                       19.50 (  0.00%)                    19.43 (  0.36%)
Ops lru-file-readonce-elapsed                        12.44 (  0.00%)                    12.29 (  1.21%)
Ops lru-file-readtwice-elapsed                       22.27 (  0.00%)                    22.19 (  0.36%)
Ops lru-memcg-elapsed                                12.18 (  0.00%)                    12.00 (  1.48%)

           4.8.0-rc8   4.8.0-rc8
             vanillawaitqueue-v1r2
User           50.54       50.88
System        398.72      388.81
Elapsed        69.48       68.99

Again, differences are marginal but detectable. I accidentally did not
collect profile data but I have no reason to believe it's significantly
different to dd.

This is "gitsource" from mmtests but it's a checkout of the git source
tree and a run of make test which is where Linus first noticed the
problem. The metric here is time-based, I don't actually check the
results of the regression suite.

gitsource
                             4.8.0-rc8             4.8.0-rc8
                               vanilla        waitqueue-v1r2
User    min           192.28 (  0.00%)      192.49 ( -0.11%)
User    mean          193.55 (  0.00%)      194.88 ( -0.69%)
User    stddev          1.52 (  0.00%)        2.39 (-57.58%)
User    coeffvar        0.79 (  0.00%)        1.23 (-56.51%)
User    max           196.34 (  0.00%)      199.06 ( -1.39%)
System  min           122.70 (  0.00%)      118.69 (  3.27%)
System  mean          123.87 (  0.00%)      120.68 (  2.57%)
System  stddev          0.84 (  0.00%)        1.65 (-97.67%)
System  coeffvar        0.67 (  0.00%)        1.37 (-102.89%)
System  max           124.95 (  0.00%)      123.14 (  1.45%)
Elapsed min           718.09 (  0.00%)      711.48 (  0.92%)
Elapsed mean          724.23 (  0.00%)      716.52 (  1.07%)
Elapsed stddev          4.20 (  0.00%)        4.84 (-15.42%)
Elapsed coeffvar        0.58 (  0.00%)        0.68 (-16.66%)
Elapsed max           730.51 (  0.00%)      724.45 (  0.83%)

           4.8.0-rc8   4.8.0-rc8
             vanillawaitqueue-v1r2
User         2730.60     2808.48
System       2184.85     2108.68
Elapsed      9938.01     9929.56

Overall, it's showing a drop in system CPU usage as expected. The detailed
results show a drop of 2.57% in system CPU usage running the benchmark
itself and 3.48% overall which is measuring everything and not just "make
test". The drop in elapsed time is marginal but measurable.

It may raise an eyebrow that the overall elapsed time doesn't match the
detailed results. The detailed results report 5 iterations of "make test"
without profiling enabled which takes takes about an hour. The way I
configured it, the profiled run happened immediately after it and it's much
slower as well as having to compile git itself which takes a few minutes.

This is the top lock/unlock activity in the vanilla kernel

     0.80%  git              [kernel.vmlinux]              [k] unlock_page
     0.28%  sh               [kernel.vmlinux]              [k] unlock_page
     0.20%  git-rebase       [kernel.vmlinux]              [k] unlock_page
     0.13%  git              [kernel.vmlinux]              [k] lock_page_memcg
     0.10%  git              [kernel.vmlinux]              [k] unlock_page_memcg
     0.07%  git-submodule    [kernel.vmlinux]              [k] unlock_page
     0.04%  sh               [kernel.vmlinux]              [k] lock_page_memcg
     0.03%  git-rebase       [kernel.vmlinux]              [k] lock_page_memcg
     0.03%  sh               [kernel.vmlinux]              [k] unlock_page_memcg
     0.03%  sed              [kernel.vmlinux]              [k] unlock_page
     0.03%  perf             [kernel.vmlinux]              [k] unlock_page
     0.02%  git-rebase       [kernel.vmlinux]              [k] unlock_page_memcg
     0.02%  rm               [kernel.vmlinux]              [k] unlock_page
     0.02%  git-stash        [kernel.vmlinux]              [k] unlock_page
     0.02%  git-bisect       [kernel.vmlinux]              [k] unlock_page
     0.02%  diff             [kernel.vmlinux]              [k] unlock_page
     0.02%  cat              [kernel.vmlinux]              [k] unlock_page
     0.02%  wc               [kernel.vmlinux]              [k] unlock_page
     0.01%  mv               [kernel.vmlinux]              [k] unlock_page
     0.01%  git-submodule    [kernel.vmlinux]              [k] lock_page_memcg

This is with the patch applied

     0.49%  git              [kernel.vmlinux]             [k] unlock_page
     0.14%  sh               [kernel.vmlinux]             [k] unlock_page
     0.13%  git              [kernel.vmlinux]             [k] lock_page_memcg
     0.11%  git-rebase       [kernel.vmlinux]             [k] unlock_page
     0.10%  git              [kernel.vmlinux]             [k] unlock_page_memcg
     0.04%  sh               [kernel.vmlinux]             [k] lock_page_memcg
     0.04%  git-submodule    [kernel.vmlinux]             [k] unlock_page
     0.03%  sh               [kernel.vmlinux]             [k] unlock_page_memcg
     0.03%  git-rebase       [kernel.vmlinux]             [k] lock_page_memcg
     0.02%  git-rebase       [kernel.vmlinux]             [k] unlock_page_memcg
     0.02%  sed              [kernel.vmlinux]             [k] unlock_page
     0.01%  rm               [kernel.vmlinux]             [k] unlock_page
     0.01%  git-stash        [kernel.vmlinux]             [k] unlock_page
     0.01%  git-submodule    [kernel.vmlinux]             [k] lock_page_memcg
     0.01%  git-bisect       [kernel.vmlinux]             [k] unlock_page
     0.01%  diff             [kernel.vmlinux]             [k] unlock_page
     0.01%  cat              [kernel.vmlinux]             [k] unlock_page
     0.01%  wc               [kernel.vmlinux]             [k] unlock_page
     0.01%  git-submodule    [kernel.vmlinux]             [k] unlock_page_memcg
     0.01%  mv               [kernel.vmlinux]             [k] unlock_page

The drop in time spent by git in unlock_page is noticable. I did not
drill down into the annotated profile but this roughly matches what I
measured before when avoiding page_waitqueue lookups.

The full profile is not exactly great but I didn't see anything in there
I haven't seen before. Top entries with the patch applied looks like
this

     7.44%  swapper          [kernel.vmlinux]             [k] intel_idle
     1.25%  git              [kernel.vmlinux]             [k] filemap_map_pages
     1.08%  git              [kernel.vmlinux]             [k] native_irq_return_iret
     0.79%  git              [kernel.vmlinux]             [k] unmap_page_range
     0.56%  git              [kernel.vmlinux]             [k] release_pages
     0.51%  git              [kernel.vmlinux]             [k] handle_mm_fault
     0.49%  git              [kernel.vmlinux]             [k] unlock_page
     0.46%  git              [kernel.vmlinux]             [k] page_remove_rmap
     0.46%  git              [kernel.vmlinux]             [k] _raw_spin_lock
     0.42%  git              [kernel.vmlinux]             [k] clear_page_c_e

Lot of map/unmap activity like you'd expect and release_pages being a pig
as usual.

Overall, this patch shows similar behaviour to my own patch from 2014.
There is a definite benefit but it's marginal. The big difference is
that this patch is a lot similar than the 2014 version and may meet less
resistance as a result.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2016-10-03 10:47 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-09-26 20:58 page_waitqueue() considered harmful Linus Torvalds
2016-09-26 21:23 ` Rik van Riel
2016-09-26 21:30   ` Linus Torvalds
2016-09-26 23:11   ` Kirill A. Shutemov
2016-09-27  1:01     ` Rik van Riel
2016-09-27  7:30 ` Peter Zijlstra
2016-09-27  8:54   ` Mel Gorman
2016-09-27  9:11     ` Kirill A. Shutemov
2016-09-27  9:42       ` Mel Gorman
2016-09-27  9:52       ` Minchan Kim
2016-09-27 12:11         ` Kirill A. Shutemov
2016-09-29  8:01     ` Peter Zijlstra
2016-09-29 12:55       ` Nicholas Piggin
2016-09-29 13:16         ` Peter Zijlstra
2016-09-29 13:54           ` Nicholas Piggin
2016-09-29 15:05         ` Rik van Riel
2016-09-27  8:03 ` Jan Kara
2016-09-27  8:31 ` Mel Gorman
2016-09-27 14:34   ` Peter Zijlstra
2016-09-27 15:08     ` Nicholas Piggin
2016-09-27 16:31     ` Linus Torvalds
2016-09-27 16:49       ` Peter Zijlstra
2016-09-28 10:45     ` Mel Gorman
2016-09-28 11:11       ` Peter Zijlstra
2016-09-28 16:10         ` Linus Torvalds
2016-09-29 13:08           ` Peter Zijlstra
2016-10-03 10:47             ` Mel Gorman
2016-09-27 14:53   ` Nicholas Piggin
2016-09-27 15:17     ` Nicholas Piggin
2016-09-27 16:52     ` Peter Zijlstra
2016-09-27 17:06       ` Nicholas Piggin
2016-09-28  7:05         ` Peter Zijlstra
2016-09-28 11:05           ` Paul E. McKenney
2016-09-28 11:16             ` Peter Zijlstra
2016-09-28 12:58               ` Paul E. McKenney
2016-09-29  1:31           ` Nicholas Piggin
2016-09-29  2:12             ` Paul E. McKenney
2016-09-29  6:21             ` Peter Zijlstra
2016-09-29  6:42               ` Nicholas Piggin
2016-09-29  7:14                 ` Peter Zijlstra
2016-09-29  7:43                   ` Peter Zijlstra
2016-09-28  7:40     ` Mel Gorman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.