Re: [PATCH 2/2 v2] sched/wait: Introduce lock breaker in wake_up_page_bit

From: Linus Torvalds <torvalds@linux-foundation.org>
To: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>,
	Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@elte.hu>, Andi Kleen <ak@linux.intel.com>,
	Kan Liang <kan.liang@intel.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Johannes Weiner <hannes@cmpxchg.org>, Jan Kara <jack@suse.cz>,
	Christopher Lameter <cl@linux.com>,
	"Eric W . Biederman" <ebiederm@xmission.com>,
	Davidlohr Bueso <dave@stgolabs.net>,
	linux-mm <linux-mm@kvack.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 2/2 v2] sched/wait: Introduce lock breaker in wake_up_page_bit
Date: Fri, 25 Aug 2017 12:58:05 -0700	[thread overview]
Message-ID: <CA+55aFyErsNw8bqTOCzcrarDZBdj+Ev=1N3sV-gxtLTH03bBFQ@mail.gmail.com> (raw)
In-Reply-To: <cd8ce7fbca9c126f7f928b8fa48d7a9197955b45.1503677178.git.tim.c.chen@linux.intel.com>

On Fri, Aug 25, 2017 at 9:13 AM, Tim Chen <tim.c.chen@linux.intel.com> wrote:
> Now that we have added breaks in the wait queue scan and allow bookmark
> on scan position, we put this logic in the wake_up_page_bit function.

Oh, _this_ is the other patch you were talking about. I thought it was
the NUMA counter threashold that was discussed around the same time,
and that's why I referred to Mel.

Gods, _this_ patch is ugly.  No, I'm not happy with it at all. It
makes that wait_queue_head much bigger, for this disgusting one use.

So no, this is no good.

Now, maybe the page_wait_table[] itself could be changed to be
something that is *not* just the wait-queue head.

But if we need to change the page_wait_table[] itself to have more
information, then we should make it not be a wait-queue at all, we
should make it be a list of much more specialized entries, indexed by
the {page,bit} tuple.

And once we do that, we probably *could* use something like two
specialized lists: one that is wake-all, and one that is wake-one.

So you'd have something like

    struct page_wait_struct {
        struct list_node list;
        struct page *page;
        int bit;
        struct llist_head all;
        struct llist_head exclusive;
    };

and then the "page_wait_table[]" would be just an array of

    struct page_wait_head {
        spinlock_t lock;
        struct hlist_head list;
    };

and then the rule is:

 - each page/bit combination allocates one of these page_wait_struct
entries when somebody starts waiting for it for the first time (and we
use the spinlock in the page_wait_head to serialize that list).

 - an exclusive wait (ie somebody who waits to get the bit for
locking) adds itself to the 'exclusive' llist

 - non-locking waiters add themselves to the 'all' list

 - we can use llist_del_all() to remove the 'all' list and then walk
it and wake them up without any locks

 - we can use llist_del_first() to remove the first exclusive waiter
and wait _it_ up without any locks.

Hmm? How does that sound? That means that we wouldn't use the normal
wait-queues at all for the page hash waiting. We'd use this two-level
list: one list to find the page/bit thing, and then two lists within
that fdor the wait-queues for just *that* page/bit.

So no need for the 'key' stuff at all, because the page/bit data would
be in the first data list, and the second list wouldn't have any of
these traversal issues where you have to be careful and do it one
entry at a time.

                 Linus