All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Liang, Kan" <kan.liang@intel.com>
To: Linus Torvalds <torvalds@linux-foundation.org>,
	Tim Chen <tim.c.chen@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>,
	Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@elte.hu>, Andi Kleen <ak@linux.intel.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	"Johannes Weiner" <hannes@cmpxchg.org>, Jan Kara <jack@suse.cz>,
	Christopher Lameter <cl@linux.com>,
	"Eric W . Biederman" <ebiederm@xmission.com>,
	Davidlohr Bueso <dave@stgolabs.net>,
	linux-mm <linux-mm@kvack.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Subject: RE: [PATCH 2/2 v2] sched/wait: Introduce lock breaker in wake_up_page_bit
Date: Mon, 28 Aug 2017 14:51:14 +0000	[thread overview]
Message-ID: <37D7C6CF3E00A74B8858931C1DB2F077537A07E9@SHSMSX103.ccr.corp.intel.com> (raw)
In-Reply-To: <CA+55aFy18WCqZGwkxH6dTZR9LD9M5nXWqEN8DBeZ4LvNo4Y0BQ@mail.gmail.com>



.
> On Fri, Aug 25, 2017 at 7:54 PM, Linus Torvalds <torvalds@linux-
> foundation.org> wrote:
> >
> > Simplify, simplify, simplify.
> 
> I've now tried three different approaches (I stopped sending broken patches
> after deciding the first one was unfixable), and they all suck.
> 
> I was hoping for something lockless to avoid the whole issue with latency
> over the long list, but anything that has the wait queue entry allocated on the
> stack needs to synchronize the wakeup due to the stack usage, so once you
> have lists that are thousands of entries, either you hold the lock for long
> times (normal wait queues) or take and release them constantly (the swait
> approach), or you batch things up (Tim's wait-queue patches).
> 
> Of those three approaches, the batching does seem to be the best of the lot.
> Allocating non-stack wait entries while waiting for pages seems like a bad
> idea. We're probably waiting for IO in normal circumstances, possibly
> because we're low on memory.
> 
> So I *am* ok with the batching (your patch #1), but I'm *not* ok with the
> nasty horrible book-keeping to avoid new entries when you batch and
> release the lock and that allows new entries on the list (your patch #2).
> 
> That said, I have now stared *way* too much at the page wait code after
> having unsuccessfully tried to replace that wait-queue with other "clever"
> approaches (all of which ended up being crap).
> 
> And I'm starting to see a possible solution, or at least improvement.
> 
> Let's just assume I take the batching patch. It's not conceptually ugly, it's
> fairly simple, and there are lots of independent arguments for it (ie latency,
> but also possible performance from better parallelism).
> 
> That patch doesn't make any data structures bigger or more complicated, it's
> just that single "is this a bookmark entry" thing.
> So that patch just looks fundamentally fine to me, and the only real
> argument I ever had against it was that I would really really _really_ have
> wanted to root-cause the behavior.
> 
> So that leaves my fundamental dislike of your other patch.
> 
> And it turns out that all my looking at the page wait code wasn't entirely
> unproductive. Yes, I went through three crap approaches before I gave up on
> rewriting it, but in the meantime I did get way too intimate with looking at
> that pile of crud.
> 
> And I think I found the reason why you needed that patch #2 in the first place.
> 
> Here's what is going on:
> 
>  - we're going to assume that the problem is all with a single page, not due to
> hash collisions (as per your earlier reports)
> 
>  - we also know that the only bit that matters is the PG_locked bit
> 
>  - that means that the only way to get that "multiple concurrent waker"
> situation that your patch #2 tries to handle better is because you have
> multiple people unlocking the page - since that is what causes the wakeups
> 
>  - that in turn means that you obviously had multiple threads *locking* it too.
> 
>  - and that means that among those new entries there are lockers coming in
> in the middle and adding an exclusive entry.
> 
>  - the exclusive entry already stops the wakeup list walking
> 
>  - but we add non-exclusive entries TO THE BEGINNING of the page waiters
> list
> 
> And I really think that last thing is fundamentally wrong.
> 
> It's wrong for several reasons:
> 
>  - because it's unfair: threads that want to lock get put behind threads that
> just want to see the unlocked state.
> 
>  - because it's stupid: our non-locking waiters will end up waiting again if the
> page got locked, so waking up a locker *and* a lot of non-locking waiters will
> just cause them to go back to sleep anyway
> 
>  - because it causes us to walk longer lists: we stop walking when we wake up
> the exclusive waiter, but because we always put the non-exclusive waiters in
> there first, we always walk the long list of non-exclusive waiters even if we
> could just stop walking because we woke up an exclusive one.
> 
> Now the reason we do this seems to be entirely random: for a *normal* wait
> queue, you really want to always wake up all the non-exclusive waiters,
> because exclusive waiters are only exclusive wrt each other.
> 
> But for a page lock, an exclusive waiter really is getting the lock, and the
> other waiters are going to wait for the lock to clear anyway, so the page wait
> thing is actually almost exactly the reverse situation. We *could* put
> exclusive waiters at the beginning of the list instead, and actively try to avoid
> walking the list at all if we have pending lockers.
> 
> I'm not doing that, because I think the fair thing to do is to just do things in
> the order they came in. Plus the code is actually simpler if we just always add
> to the tail.
> 
> Now, the other thing to look at is how the wakeup function works. It checks
> the aliasing information (page and bit number), but then it
> *also* does:
> 
>         if (test_bit(key->bit_nr, &key->page->flags))
>                 return 0;
> 
> basically saying "continue walking if somebody else already got the bit".
> 
> That's *INSANE*. It's exactly the wrong thing to do. It's basically saying "even
> if we had an exclusive waiter, let's not wake it up, but do let us continue to
> walk the list even though the bit we're waiting for to clear is set already".
> 
> What would be sane is to say "stop walking the list now".. So just do that - by
> making "negative return from wake function" mean "stop walking".
> 
> So how about just this fairly trivial patch?

I tried this patch and https://lkml.org/lkml/2017/8/27/222 together.
But they don't fix the issue. I can still get the similar call stack.

Thanks,
Kan

> 
> In fact, I think this may help even *without* Tim's patch #1. So I think this
> would be lovely to test on that problem load on its own, and seeing if it
> makes the wait queues behave better.
> 
> It might not cut down on the total length of the wait-queue, but it should
> hopefully cause it to walk it much less. We now hit the exclusive waiters
> earlier and stop waking things up when we have a new locker thread pending.
> And when the page ends up being locked again, we stop walking the list
> entirely, so no unnecessarily traversal.
> 
> This patch is small and at least minimally works (I'm running it right now).
> 
>                                Linus

  parent reply	other threads:[~2017-08-28 14:51 UTC|newest]

Thread overview: 57+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-08-25 16:13 [PATCH 1/2 v2] sched/wait: Break up long wake list walk Tim Chen
2017-08-25 16:13 ` Tim Chen
2017-08-25 16:13 ` [PATCH 2/2 v2] sched/wait: Introduce lock breaker in wake_up_page_bit Tim Chen
2017-08-25 16:13   ` Tim Chen
2017-08-25 19:58   ` Linus Torvalds
2017-08-25 19:58     ` Linus Torvalds
2017-08-25 22:19     ` Tim Chen
2017-08-25 22:19       ` Tim Chen
2017-08-25 22:51       ` Linus Torvalds
2017-08-25 23:03         ` Linus Torvalds
2017-08-26  0:31           ` Linus Torvalds
2017-08-26  0:31             ` Linus Torvalds
2017-08-26  2:54             ` Linus Torvalds
2017-08-26  2:54               ` Linus Torvalds
2017-08-26 18:15               ` Linus Torvalds
2017-08-27 21:40                 ` Linus Torvalds
2017-08-27 21:40                   ` Linus Torvalds
2017-08-27 21:42                   ` Linus Torvalds
2017-08-27 21:42                     ` Linus Torvalds
2017-08-27 23:12                   ` Linus Torvalds
2017-08-27 23:12                     ` Linus Torvalds
2017-08-28  1:16                     ` Nicholas Piggin
2017-08-28  1:16                       ` Nicholas Piggin
2017-08-28  1:29                       ` Nicholas Piggin
2017-08-28  1:29                         ` Nicholas Piggin
2017-08-28  5:17                         ` Linus Torvalds
2017-08-28  5:17                           ` Linus Torvalds
2017-08-28  7:18                           ` Nicholas Piggin
2017-08-28  7:18                             ` Nicholas Piggin
2017-08-28 14:51                 ` Liang, Kan [this message]
2017-08-28 16:48                   ` Linus Torvalds
2017-08-28 20:01                     ` Tim Chen
2017-08-28 20:01                       ` Tim Chen
2017-08-29 12:57                     ` Liang, Kan
2017-08-29 16:01                       ` Linus Torvalds
2017-08-29 16:01                         ` Linus Torvalds
2017-08-29 16:13                         ` Tim Chen
2017-08-29 16:13                           ` Tim Chen
2017-08-29 16:24                           ` Linus Torvalds
2017-08-29 16:24                             ` Linus Torvalds
2017-08-29 16:57                             ` Tim Chen
2017-08-29 16:57                               ` Tim Chen
2017-09-14  2:12                             ` Tim Chen
2017-09-14  2:12                               ` Tim Chen
2017-09-14  2:27                               ` Linus Torvalds
2017-09-14  2:27                                 ` Linus Torvalds
2017-09-14 16:50                                 ` Tim Chen
2017-09-14 16:50                                   ` Tim Chen
2017-09-14 17:00                                   ` Linus Torvalds
2017-09-14 17:00                                     ` Linus Torvalds
2017-09-14 16:39                               ` Christopher Lameter
2017-08-29 16:17                     ` Tim Chen
2017-08-29 16:17                       ` Tim Chen
2017-08-29 16:22                       ` Linus Torvalds
2017-08-29 16:22                         ` Linus Torvalds
2017-08-25 17:46 ` [PATCH 1/2 v2] sched/wait: Break up long wake list walk Christopher Lameter
2017-08-25 17:46   ` Christopher Lameter

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=37D7C6CF3E00A74B8858931C1DB2F077537A07E9@SHSMSX103.ccr.corp.intel.com \
    --to=kan.liang@intel.com \
    --cc=ak@linux.intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=cl@linux.com \
    --cc=dave@stgolabs.net \
    --cc=ebiederm@xmission.com \
    --cc=hannes@cmpxchg.org \
    --cc=jack@suse.cz \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@techsingularity.net \
    --cc=mingo@elte.hu \
    --cc=peterz@infradead.org \
    --cc=tim.c.chen@linux.intel.com \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.