All of lore.kernel.org
 help / color / mirror / Atom feed
From: Linus Torvalds <torvalds@linux-foundation.org>
To: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>,
	Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@elte.hu>, Andi Kleen <ak@linux.intel.com>,
	Kan Liang <kan.liang@intel.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Johannes Weiner <hannes@cmpxchg.org>, Jan Kara <jack@suse.cz>,
	Christopher Lameter <cl@linux.com>,
	"Eric W . Biederman" <ebiederm@xmission.com>,
	Davidlohr Bueso <dave@stgolabs.net>,
	linux-mm <linux-mm@kvack.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 2/2 v2] sched/wait: Introduce lock breaker in wake_up_page_bit
Date: Fri, 25 Aug 2017 12:58:05 -0700	[thread overview]
Message-ID: <CA+55aFyErsNw8bqTOCzcrarDZBdj+Ev=1N3sV-gxtLTH03bBFQ@mail.gmail.com> (raw)
In-Reply-To: <cd8ce7fbca9c126f7f928b8fa48d7a9197955b45.1503677178.git.tim.c.chen@linux.intel.com>

On Fri, Aug 25, 2017 at 9:13 AM, Tim Chen <tim.c.chen@linux.intel.com> wrote:
> Now that we have added breaks in the wait queue scan and allow bookmark
> on scan position, we put this logic in the wake_up_page_bit function.

Oh, _this_ is the other patch you were talking about. I thought it was
the NUMA counter threashold that was discussed around the same time,
and that's why I referred to Mel.

Gods, _this_ patch is ugly.  No, I'm not happy with it at all. It
makes that wait_queue_head much bigger, for this disgusting one use.

So no, this is no good.

Now, maybe the page_wait_table[] itself could be changed to be
something that is *not* just the wait-queue head.

But if we need to change the page_wait_table[] itself to have more
information, then we should make it not be a wait-queue at all, we
should make it be a list of much more specialized entries, indexed by
the {page,bit} tuple.

And once we do that, we probably *could* use something like two
specialized lists: one that is wake-all, and one that is wake-one.

So you'd have something like

    struct page_wait_struct {
        struct list_node list;
        struct page *page;
        int bit;
        struct llist_head all;
        struct llist_head exclusive;
    };

and then the "page_wait_table[]" would be just an array of

    struct page_wait_head {
        spinlock_t lock;
        struct hlist_head list;
    };

and then the rule is:

 - each page/bit combination allocates one of these page_wait_struct
entries when somebody starts waiting for it for the first time (and we
use the spinlock in the page_wait_head to serialize that list).

 - an exclusive wait (ie somebody who waits to get the bit for
locking) adds itself to the 'exclusive' llist

 - non-locking waiters add themselves to the 'all' list

 - we can use llist_del_all() to remove the 'all' list and then walk
it and wake them up without any locks

 - we can use llist_del_first() to remove the first exclusive waiter
and wait _it_ up without any locks.

Hmm? How does that sound? That means that we wouldn't use the normal
wait-queues at all for the page hash waiting. We'd use this two-level
list: one list to find the page/bit thing, and then two lists within
that fdor the wait-queues for just *that* page/bit.

So no need for the 'key' stuff at all, because the page/bit data would
be in the first data list, and the second list wouldn't have any of
these traversal issues where you have to be careful and do it one
entry at a time.

                 Linus

WARNING: multiple messages have this Message-ID (diff)
From: Linus Torvalds <torvalds@linux-foundation.org>
To: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>,
	Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@elte.hu>, Andi Kleen <ak@linux.intel.com>,
	Kan Liang <kan.liang@intel.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Johannes Weiner <hannes@cmpxchg.org>, Jan Kara <jack@suse.cz>,
	Christopher Lameter <cl@linux.com>,
	"Eric W . Biederman" <ebiederm@xmission.com>,
	Davidlohr Bueso <dave@stgolabs.net>,
	linux-mm <linux-mm@kvack.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 2/2 v2] sched/wait: Introduce lock breaker in wake_up_page_bit
Date: Fri, 25 Aug 2017 12:58:05 -0700	[thread overview]
Message-ID: <CA+55aFyErsNw8bqTOCzcrarDZBdj+Ev=1N3sV-gxtLTH03bBFQ@mail.gmail.com> (raw)
In-Reply-To: <cd8ce7fbca9c126f7f928b8fa48d7a9197955b45.1503677178.git.tim.c.chen@linux.intel.com>

On Fri, Aug 25, 2017 at 9:13 AM, Tim Chen <tim.c.chen@linux.intel.com> wrote:
> Now that we have added breaks in the wait queue scan and allow bookmark
> on scan position, we put this logic in the wake_up_page_bit function.

Oh, _this_ is the other patch you were talking about. I thought it was
the NUMA counter threashold that was discussed around the same time,
and that's why I referred to Mel.

Gods, _this_ patch is ugly.  No, I'm not happy with it at all. It
makes that wait_queue_head much bigger, for this disgusting one use.

So no, this is no good.

Now, maybe the page_wait_table[] itself could be changed to be
something that is *not* just the wait-queue head.

But if we need to change the page_wait_table[] itself to have more
information, then we should make it not be a wait-queue at all, we
should make it be a list of much more specialized entries, indexed by
the {page,bit} tuple.

And once we do that, we probably *could* use something like two
specialized lists: one that is wake-all, and one that is wake-one.

So you'd have something like

    struct page_wait_struct {
        struct list_node list;
        struct page *page;
        int bit;
        struct llist_head all;
        struct llist_head exclusive;
    };

and then the "page_wait_table[]" would be just an array of

    struct page_wait_head {
        spinlock_t lock;
        struct hlist_head list;
    };

and then the rule is:

 - each page/bit combination allocates one of these page_wait_struct
entries when somebody starts waiting for it for the first time (and we
use the spinlock in the page_wait_head to serialize that list).

 - an exclusive wait (ie somebody who waits to get the bit for
locking) adds itself to the 'exclusive' llist

 - non-locking waiters add themselves to the 'all' list

 - we can use llist_del_all() to remove the 'all' list and then walk
it and wake them up without any locks

 - we can use llist_del_first() to remove the first exclusive waiter
and wait _it_ up without any locks.

Hmm? How does that sound? That means that we wouldn't use the normal
wait-queues at all for the page hash waiting. We'd use this two-level
list: one list to find the page/bit thing, and then two lists within
that fdor the wait-queues for just *that* page/bit.

So no need for the 'key' stuff at all, because the page/bit data would
be in the first data list, and the second list wouldn't have any of
these traversal issues where you have to be careful and do it one
entry at a time.

                 Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2017-08-25 19:58 UTC|newest]

Thread overview: 57+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-08-25 16:13 [PATCH 1/2 v2] sched/wait: Break up long wake list walk Tim Chen
2017-08-25 16:13 ` Tim Chen
2017-08-25 16:13 ` [PATCH 2/2 v2] sched/wait: Introduce lock breaker in wake_up_page_bit Tim Chen
2017-08-25 16:13   ` Tim Chen
2017-08-25 19:58   ` Linus Torvalds [this message]
2017-08-25 19:58     ` Linus Torvalds
2017-08-25 22:19     ` Tim Chen
2017-08-25 22:19       ` Tim Chen
2017-08-25 22:51       ` Linus Torvalds
2017-08-25 23:03         ` Linus Torvalds
2017-08-26  0:31           ` Linus Torvalds
2017-08-26  0:31             ` Linus Torvalds
2017-08-26  2:54             ` Linus Torvalds
2017-08-26  2:54               ` Linus Torvalds
2017-08-26 18:15               ` Linus Torvalds
2017-08-27 21:40                 ` Linus Torvalds
2017-08-27 21:40                   ` Linus Torvalds
2017-08-27 21:42                   ` Linus Torvalds
2017-08-27 21:42                     ` Linus Torvalds
2017-08-27 23:12                   ` Linus Torvalds
2017-08-27 23:12                     ` Linus Torvalds
2017-08-28  1:16                     ` Nicholas Piggin
2017-08-28  1:16                       ` Nicholas Piggin
2017-08-28  1:29                       ` Nicholas Piggin
2017-08-28  1:29                         ` Nicholas Piggin
2017-08-28  5:17                         ` Linus Torvalds
2017-08-28  5:17                           ` Linus Torvalds
2017-08-28  7:18                           ` Nicholas Piggin
2017-08-28  7:18                             ` Nicholas Piggin
2017-08-28 14:51                 ` Liang, Kan
2017-08-28 16:48                   ` Linus Torvalds
2017-08-28 20:01                     ` Tim Chen
2017-08-28 20:01                       ` Tim Chen
2017-08-29 12:57                     ` Liang, Kan
2017-08-29 16:01                       ` Linus Torvalds
2017-08-29 16:01                         ` Linus Torvalds
2017-08-29 16:13                         ` Tim Chen
2017-08-29 16:13                           ` Tim Chen
2017-08-29 16:24                           ` Linus Torvalds
2017-08-29 16:24                             ` Linus Torvalds
2017-08-29 16:57                             ` Tim Chen
2017-08-29 16:57                               ` Tim Chen
2017-09-14  2:12                             ` Tim Chen
2017-09-14  2:12                               ` Tim Chen
2017-09-14  2:27                               ` Linus Torvalds
2017-09-14  2:27                                 ` Linus Torvalds
2017-09-14 16:50                                 ` Tim Chen
2017-09-14 16:50                                   ` Tim Chen
2017-09-14 17:00                                   ` Linus Torvalds
2017-09-14 17:00                                     ` Linus Torvalds
2017-09-14 16:39                               ` Christopher Lameter
2017-08-29 16:17                     ` Tim Chen
2017-08-29 16:17                       ` Tim Chen
2017-08-29 16:22                       ` Linus Torvalds
2017-08-29 16:22                         ` Linus Torvalds
2017-08-25 17:46 ` [PATCH 1/2 v2] sched/wait: Break up long wake list walk Christopher Lameter
2017-08-25 17:46   ` Christopher Lameter

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CA+55aFyErsNw8bqTOCzcrarDZBdj+Ev=1N3sV-gxtLTH03bBFQ@mail.gmail.com' \
    --to=torvalds@linux-foundation.org \
    --cc=ak@linux.intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=cl@linux.com \
    --cc=dave@stgolabs.net \
    --cc=ebiederm@xmission.com \
    --cc=hannes@cmpxchg.org \
    --cc=jack@suse.cz \
    --cc=kan.liang@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@techsingularity.net \
    --cc=mingo@elte.hu \
    --cc=peterz@infradead.org \
    --cc=tim.c.chen@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.