All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mel Gorman <mgorman@techsingularity.net>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	Johannes Weiner <hannes@cmpxchg.org>, Jan Kara <jack@suse.cz>,
	Peter Zijlstra <peterz@infradead.org>,
	Rik van Riel <riel@redhat.com>, linux-mm <linux-mm@kvack.org>
Subject: Re: page_waitqueue() considered harmful
Date: Tue, 27 Sep 2016 09:31:04 +0100	[thread overview]
Message-ID: <20160927083104.GC2838@techsingularity.net> (raw)
In-Reply-To: <CA+55aFwVSXZPONk2OEyxcP-aAQU7-aJsF3OFXVi8Z5vA11v_-Q@mail.gmail.com>


Hi Linus,

On Mon, Sep 26, 2016 at 01:58:00PM -0700, Linus Torvalds wrote:
> So I've been doing some profiling of the git "make test" load, which
> is interesting because it's just a lot of small scripts, and it shows
> our fork/execve/exit costs.
> 
> Some of the top offenders are pretty understandable: we have long had
> unmap_page_range() show up at the top on these loads, because the
> zap_pte_range() function ends up touching each page as it unmaps
> things (it does it to check whether it's an anonymous page, but then
> also for the page map count update etc), and that's a disaster from a
> cache standpoint. That single function is something between 3-4% of
> CPU time, and the one instruction that accesses "struct page" the
> first time is a large portion of that. Yes, a single instruction is
> blamed for about 1% of all CPU time on a fork/exec/exit workload.
> 

It was found at one point that the fault-around made these costs worse as
there were simply more pages to tear down. However, this only applied to
fork/exit microbenchmarks.  Matt Fleming prototyped an unreleased patch
that tried to be clever about this but the cost was never worthwhile. A
plain revert helped a microbenchmark but hurt workloads like the git
testsuite which was shell intensive.

It got filed under "we're not fixing a fork/exit microbenchmark at the
expense of "real" workloads like git checkout and git testsuite".

> <SNIP>
>
> #5 and #6 on my profile are user space (_int_malloc in glibc, and
> do_lookup_x in the loader - I think user space should probably start
> thinking more about doing static libraries for the really core basic
> things, but whatever. Not a kernel issue.
> 

Recent problems have been fixed with _int_malloc in glibc, particularly as it
applies to threads but no fix springs to mind that might impact "make test".

> #7 is in the kernel again. And that one struck me as really odd. It's
> "unlock_page()", while #9 is __wake_up_bit(). WTF? There's no IO in
> this load, it's all cached, why do we use 3% of the time (1.7% and
> 1.4% respectively) on unlocking a page. And why can't I see the
> locking part?
> 
> It turns out that I *can* see the locking part, but it's pretty cheap.
> It's inside of filemap_map_pages(), which does a trylock, and it shows
> up as about 1/6th of the cost of that function. Still, it's much less
> than the unlocking side. Why is unlocking so expensive?
> 
> Yeah, unlocking is expensive because of the nasty __wake_up_bit()
> code. In fact, even inside "unlock_page()" itself, most of the costs
> aren't even the atomic bit clearing (like you'd expect), it's the
> inlined part of wake_up_bit(). Which does some really nasty crud.
> 
> Why is the page_waitqueue() handling so expensive? Let me count the ways:
> 

page_waitqueue() has been a hazard for years. I think the last attempt to
fix it was back in 2014 http://www.spinics.net/lists/linux-mm/msg73207.html

The patch is heavily derived from work by Nick Piggin who noticed the years
before that. I think that was the last version I posted and the changelog
includes profile data. I don't have an exact reference why it was rejected
but a consistent piece of feedback was that it was very complex for the
level of impact it had.

>  (a) stupid code generation interacting really badly with microarchitecture
> 
>      We clear the bit in page->flags with a byte-sized "lock andb",
> and then page_waitqueue() looks up the zone by reading the full value
> of "page->flags" immediately afterwards, which causes bad bad behavior
> with the whole read-after-partial-write thing. Nasty.
> 

Other than some code churn, it should be possible to lookup the waitqueue
before clearing the bit if the patch that side-steps the lookup entirely
is still distasteful.

>  (b) It's cache miss heaven. It takes a cache miss on three different
> things:looking up the zone 'wait_table', then looking up the hash
> queue there, and finally (inside __wake_up_bit) looking up the wait
> queue itself (which will effectively always be NULL).
> 
> Now, (a) could be fairly easy to fix several ways (the byte-size
> operation on an "unsigned long" field have caused problems before, and
> may just be a mistake), but (a) wouldn't even be a problem if we
> didn't have the complexity of (b) and having to look up the zone and
> everything. So realistically, it's actually (b) that is the primary
> problem, and indirectly causes (a) to happen too.
> 

The patch in question side-steps the (b) part.

> Is there really any reason for that incredible indirection? Do we
> really want to make the page_waitqueue() be a per-zone thing at all?
> Especially since all those wait-queues won't even be *used* unless
> there is actual IO going on and people are really getting into
> contention on the page lock.. Why isn't the page_waitqueue() just one
> statically sized array?
> 

About all I can think of is NUMA locality. Doing it per-zone at the time
would be convenient. It could be per-node but that's no better from a
complexity point of view. If it was a static array, the key could be related
to the node the page is on but that's still looking into the page flags.

> Also, if those bitlock ops had a different bit that showed contention,
> we could actually skip *all* of this, and just see that "oh, nobody is
> waiting on this page anyway, so there's no point in looking up those
> wait queues".

That's very close to what the 2014 patch does. However, it's 64-bit only
due to the use of page flags. Even though it's out of date, can you take
a look? If it doesn't trigger hate, I can forward port it unless you do
that yourself for the purposes of testing.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  parent reply	other threads:[~2016-09-27  8:31 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-09-26 20:58 page_waitqueue() considered harmful Linus Torvalds
2016-09-26 21:23 ` Rik van Riel
2016-09-26 21:30   ` Linus Torvalds
2016-09-26 23:11   ` Kirill A. Shutemov
2016-09-27  1:01     ` Rik van Riel
2016-09-27  7:30 ` Peter Zijlstra
2016-09-27  8:54   ` Mel Gorman
2016-09-27  9:11     ` Kirill A. Shutemov
2016-09-27  9:42       ` Mel Gorman
2016-09-27  9:52       ` Minchan Kim
2016-09-27 12:11         ` Kirill A. Shutemov
2016-09-29  8:01     ` Peter Zijlstra
2016-09-29 12:55       ` Nicholas Piggin
2016-09-29 13:16         ` Peter Zijlstra
2016-09-29 13:54           ` Nicholas Piggin
2016-09-29 15:05         ` Rik van Riel
2016-09-27  8:03 ` Jan Kara
2016-09-27  8:31 ` Mel Gorman [this message]
2016-09-27 14:34   ` Peter Zijlstra
2016-09-27 15:08     ` Nicholas Piggin
2016-09-27 16:31     ` Linus Torvalds
2016-09-27 16:49       ` Peter Zijlstra
2016-09-28 10:45     ` Mel Gorman
2016-09-28 11:11       ` Peter Zijlstra
2016-09-28 16:10         ` Linus Torvalds
2016-09-29 13:08           ` Peter Zijlstra
2016-10-03 10:47             ` Mel Gorman
2016-09-27 14:53   ` Nicholas Piggin
2016-09-27 15:17     ` Nicholas Piggin
2016-09-27 16:52     ` Peter Zijlstra
2016-09-27 17:06       ` Nicholas Piggin
2016-09-28  7:05         ` Peter Zijlstra
2016-09-28 11:05           ` Paul E. McKenney
2016-09-28 11:16             ` Peter Zijlstra
2016-09-28 12:58               ` Paul E. McKenney
2016-09-29  1:31           ` Nicholas Piggin
2016-09-29  2:12             ` Paul E. McKenney
2016-09-29  6:21             ` Peter Zijlstra
2016-09-29  6:42               ` Nicholas Piggin
2016-09-29  7:14                 ` Peter Zijlstra
2016-09-29  7:43                   ` Peter Zijlstra
2016-09-28  7:40     ` Mel Gorman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160927083104.GC2838@techsingularity.net \
    --to=mgorman@techsingularity.net \
    --cc=akpm@linux-foundation.org \
    --cc=hannes@cmpxchg.org \
    --cc=jack@suse.cz \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-mm@kvack.org \
    --cc=peterz@infradead.org \
    --cc=riel@redhat.com \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.