From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1751280AbdH1OvU (ORCPT <rfc822;w@1wt.eu>);
        Mon, 28 Aug 2017 10:51:20 -0400
Received: from mga14.intel.com ([192.55.52.115]:57511 "EHLO mga14.intel.com"
        rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
        id S1751184AbdH1OvS (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
        Mon, 28 Aug 2017 10:51:18 -0400
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.41,442,1498546800"; 
   d="scan'208";a="142773064"
From: "Liang, Kan" <kan.liang@intel.com>
To: Linus Torvalds <torvalds@linux-foundation.org>,
        Tim Chen <tim.c.chen@linux.intel.com>
CC: Mel Gorman <mgorman@techsingularity.net>,
        Peter Zijlstra <peterz@infradead.org>, Ingo Molnar <mingo@elte.hu>,
        Andi Kleen <ak@linux.intel.com>,
        Andrew Morton <akpm@linux-foundation.org>,
        "Johannes Weiner" <hannes@cmpxchg.org>, Jan Kara <jack@suse.cz>,
        Christopher Lameter <cl@linux.com>,
        "Eric W . Biederman" <ebiederm@xmission.com>,
        Davidlohr Bueso <dave@stgolabs.net>, linux-mm <linux-mm@kvack.org>,
        Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Subject: RE: [PATCH 2/2 v2] sched/wait: Introduce lock breaker in
 wake_up_page_bit
Thread-Topic: [PATCH 2/2 v2] sched/wait: Introduce lock breaker in
 wake_up_page_bit
Thread-Index: AQHTHb+Q9zxUNJ5MmUaRufh57Tkd0qKU91KAgAAnnoCAAAjGgIAAA4IAgAAYmQCAACfWgIABATSAgANwC8A=
Date: Mon, 28 Aug 2017 14:51:14 +0000
Message-ID: <37D7C6CF3E00A74B8858931C1DB2F077537A07E9@SHSMSX103.ccr.corp.intel.com>
References: <83f675ad385d67760da4b99cd95ee912ca7c0b44.1503677178.git.tim.c.chen@linux.intel.com>
 <cd8ce7fbca9c126f7f928b8fa48d7a9197955b45.1503677178.git.tim.c.chen@linux.intel.com>
 <CA+55aFyErsNw8bqTOCzcrarDZBdj+Ev=1N3sV-gxtLTH03bBFQ@mail.gmail.com>
 <f10f4c25-49c0-7ef5-55c2-769c8fd9bf90@linux.intel.com>
 <CA+55aFzNikMsuPAaExxT1Z8MfOeU6EhSn6UPDkkz-MRqamcemg@mail.gmail.com>
 <CA+55aFx67j0u=GNRKoCWpsLRDcHdrjfVvWRS067wLUSfzstgoQ@mail.gmail.com>
 <CA+55aFzy981a8Ab+89APi6Qnb9U9xap=0A6XNc+wZsAWngWPzA@mail.gmail.com>
 <CA+55aFwyCSh1RbJ3d5AXURa4_r5OA_=ZZKQrFX0=Z1J3ZgVJ5g@mail.gmail.com>
 <CA+55aFy18WCqZGwkxH6dTZR9LD9M5nXWqEN8DBeZ4LvNo4Y0BQ@mail.gmail.com>
In-Reply-To: <CA+55aFy18WCqZGwkxH6dTZR9LD9M5nXWqEN8DBeZ4LvNo4Y0BQ@mail.gmail.com>
Accept-Language: zh-CN, en-US
Content-Language: en-US
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
x-titus-metadata-40: eyJDYXRlZ29yeUxhYmVscyI6IiIsIk1ldGFkYXRhIjp7Im5zIjoiaHR0cDpcL1wvd3d3LnRpdHVzLmNvbVwvbnNcL0ludGVsMyIsImlkIjoiZTg4MGY4MTMtYjdhMi00ODhjLWEwZDItZmQ3YTI2YWYwOTg3IiwicHJvcHMiOlt7Im4iOiJDVFBDbGFzc2lmaWNhdGlvbiIsInZhbHMiOlt7InZhbHVlIjoiQ1RQX0lDIn1dfV19LCJTdWJqZWN0TGFiZWxzIjpbXSwiVE1DVmVyc2lvbiI6IjE2LjUuOS4zIiwiVHJ1c3RlZExhYmVsSGFzaCI6IjFRenErQUlyZm9qbWxaUUQrdDFQaWhaUTRyT0NOTmlmbU9HN3p0OFh4dDA9In0=
x-ctpclassification: CTP_IC
dlp-product: dlpe-windows
dlp-version: 11.0.0.116
dlp-reaction: no-action
x-originating-ip: [10.239.127.40]
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
Content-Transfer-Encoding: 8bit
X-MIME-Autoconverted: from base64 to 8bit by nfs id v7SEpQqJ002006


.
> On Fri, Aug 25, 2017 at 7:54 PM, Linus Torvalds <torvalds@linux-
> foundation.org> wrote:
> >
> > Simplify, simplify, simplify.
> 
> I've now tried three different approaches (I stopped sending broken patches
> after deciding the first one was unfixable), and they all suck.
> 
> I was hoping for something lockless to avoid the whole issue with latency
> over the long list, but anything that has the wait queue entry allocated on the
> stack needs to synchronize the wakeup due to the stack usage, so once you
> have lists that are thousands of entries, either you hold the lock for long
> times (normal wait queues) or take and release them constantly (the swait
> approach), or you batch things up (Tim's wait-queue patches).
> 
> Of those three approaches, the batching does seem to be the best of the lot.
> Allocating non-stack wait entries while waiting for pages seems like a bad
> idea. We're probably waiting for IO in normal circumstances, possibly
> because we're low on memory.
> 
> So I *am* ok with the batching (your patch #1), but I'm *not* ok with the
> nasty horrible book-keeping to avoid new entries when you batch and
> release the lock and that allows new entries on the list (your patch #2).
> 
> That said, I have now stared *way* too much at the page wait code after
> having unsuccessfully tried to replace that wait-queue with other "clever"
> approaches (all of which ended up being crap).
> 
> And I'm starting to see a possible solution, or at least improvement.
> 
> Let's just assume I take the batching patch. It's not conceptually ugly, it's
> fairly simple, and there are lots of independent arguments for it (ie latency,
> but also possible performance from better parallelism).
> 
> That patch doesn't make any data structures bigger or more complicated, it's
> just that single "is this a bookmark entry" thing.
> So that patch just looks fundamentally fine to me, and the only real
> argument I ever had against it was that I would really really _really_ have
> wanted to root-cause the behavior.
> 
> So that leaves my fundamental dislike of your other patch.
> 
> And it turns out that all my looking at the page wait code wasn't entirely
> unproductive. Yes, I went through three crap approaches before I gave up on
> rewriting it, but in the meantime I did get way too intimate with looking at
> that pile of crud.
> 
> And I think I found the reason why you needed that patch #2 in the first place.
> 
> Here's what is going on:
> 
>  - we're going to assume that the problem is all with a single page, not due to
> hash collisions (as per your earlier reports)
> 
>  - we also know that the only bit that matters is the PG_locked bit
> 
>  - that means that the only way to get that "multiple concurrent waker"
> situation that your patch #2 tries to handle better is because you have
> multiple people unlocking the page - since that is what causes the wakeups
> 
>  - that in turn means that you obviously had multiple threads *locking* it too.
> 
>  - and that means that among those new entries there are lockers coming in
> in the middle and adding an exclusive entry.
> 
>  - the exclusive entry already stops the wakeup list walking
> 
>  - but we add non-exclusive entries TO THE BEGINNING of the page waiters
> list
> 
> And I really think that last thing is fundamentally wrong.
> 
> It's wrong for several reasons:
> 
>  - because it's unfair: threads that want to lock get put behind threads that
> just want to see the unlocked state.
> 
>  - because it's stupid: our non-locking waiters will end up waiting again if the
> page got locked, so waking up a locker *and* a lot of non-locking waiters will
> just cause them to go back to sleep anyway
> 
>  - because it causes us to walk longer lists: we stop walking when we wake up
> the exclusive waiter, but because we always put the non-exclusive waiters in
> there first, we always walk the long list of non-exclusive waiters even if we
> could just stop walking because we woke up an exclusive one.
> 
> Now the reason we do this seems to be entirely random: for a *normal* wait
> queue, you really want to always wake up all the non-exclusive waiters,
> because exclusive waiters are only exclusive wrt each other.
> 
> But for a page lock, an exclusive waiter really is getting the lock, and the
> other waiters are going to wait for the lock to clear anyway, so the page wait
> thing is actually almost exactly the reverse situation. We *could* put
> exclusive waiters at the beginning of the list instead, and actively try to avoid
> walking the list at all if we have pending lockers.
> 
> I'm not doing that, because I think the fair thing to do is to just do things in
> the order they came in. Plus the code is actually simpler if we just always add
> to the tail.
> 
> Now, the other thing to look at is how the wakeup function works. It checks
> the aliasing information (page and bit number), but then it
> *also* does:
> 
>         if (test_bit(key->bit_nr, &key->page->flags))
>                 return 0;
> 
> basically saying "continue walking if somebody else already got the bit".
> 
> That's *INSANE*. It's exactly the wrong thing to do. It's basically saying "even
> if we had an exclusive waiter, let's not wake it up, but do let us continue to
> walk the list even though the bit we're waiting for to clear is set already".
> 
> What would be sane is to say "stop walking the list now".. So just do that - by
> making "negative return from wake function" mean "stop walking".
> 
> So how about just this fairly trivial patch?

I tried this patch and https://lkml.org/lkml/2017/8/27/222 together.
But they don't fix the issue. I can still get the similar call stack.

Thanks,
Kan

> 
> In fact, I think this may help even *without* Tim's patch #1. So I think this
> would be lovely to test on that problem load on its own, and seeing if it
> makes the wait queues behave better.
> 
> It might not cut down on the total length of the wait-queue, but it should
> hopefully cause it to walk it much less. We now hit the exclusive waiters
> earlier and stop waking things up when we have a new locker thread pending.
> And when the page ends up being locked again, we stop walking the list
> entirely, so no unnecessarily traversal.
> 
> This patch is small and at least minimally works (I'm running it right now).
> 
>                                Linus