All of lore.kernel.org
 help / color / mirror / Atom feed
From: Tim Chen <tim.c.chen@linux.intel.com>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>,
	Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@elte.hu>, Andi Kleen <ak@linux.intel.com>,
	Kan Liang <kan.liang@intel.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Johannes Weiner <hannes@cmpxchg.org>, Jan Kara <jack@suse.cz>,
	Christopher Lameter <cl@linux.com>,
	"Eric W . Biederman" <ebiederm@xmission.com>,
	Davidlohr Bueso <dave@stgolabs.net>,
	linux-mm <linux-mm@kvack.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 2/2 v2] sched/wait: Introduce lock breaker in wake_up_page_bit
Date: Fri, 25 Aug 2017 15:19:53 -0700	[thread overview]
Message-ID: <f10f4c25-49c0-7ef5-55c2-769c8fd9bf90@linux.intel.com> (raw)
In-Reply-To: <CA+55aFyErsNw8bqTOCzcrarDZBdj+Ev=1N3sV-gxtLTH03bBFQ@mail.gmail.com>

On 08/25/2017 12:58 PM, Linus Torvalds wrote:
> On Fri, Aug 25, 2017 at 9:13 AM, Tim Chen <tim.c.chen@linux.intel.com> wrote:
>> Now that we have added breaks in the wait queue scan and allow bookmark
>> on scan position, we put this logic in the wake_up_page_bit function.
> 
> Oh, _this_ is the other patch you were talking about. I thought it was
> the NUMA counter threashold that was discussed around the same time,
> and that's why I referred to Mel.
> 
> Gods, _this_ patch is ugly.  No, I'm not happy with it at all. It
> makes that wait_queue_head much bigger, for this disgusting one use.
> 
> So no, this is no good.
> 
> Now, maybe the page_wait_table[] itself could be changed to be
> something that is *not* just the wait-queue head.
> 
> But if we need to change the page_wait_table[] itself to have more
> information, then we should make it not be a wait-queue at all, we
> should make it be a list of much more specialized entries, indexed by
> the {page,bit} tuple.
> 
> And once we do that, we probably *could* use something like two
> specialized lists: one that is wake-all, and one that is wake-one.
> 
> So you'd have something like
> 
>     struct page_wait_struct {
>         struct list_node list;
>         struct page *page;
>         int bit;
>         struct llist_head all;
>         struct llist_head exclusive;
>     };
> 
> and then the "page_wait_table[]" would be just an array of
> 
>     struct page_wait_head {
>         spinlock_t lock;
>         struct hlist_head list;
>     };
> 
> and then the rule is:
> 
>  - each page/bit combination allocates one of these page_wait_struct
> entries when somebody starts waiting for it for the first time (and we
> use the spinlock in the page_wait_head to serialize that list).
> 
>  - an exclusive wait (ie somebody who waits to get the bit for
> locking) adds itself to the 'exclusive' llist
> 
>  - non-locking waiters add themselves to the 'all' list
> 
>  - we can use llist_del_all() to remove the 'all' list and then walk
> it and wake them up without any locks
> 
>  - we can use llist_del_first() to remove the first exclusive waiter
> and wait _it_ up without any locks.
> 
> Hmm? How does that sound? That means that we wouldn't use the normal
> wait-queues at all for the page hash waiting. We'd use this two-level
> list: one list to find the page/bit thing, and then two lists within
> that fdor the wait-queues for just *that* page/bit.
> 
> So no need for the 'key' stuff at all, because the page/bit data would
> be in the first data list, and the second list wouldn't have any of
> these traversal issues where you have to be careful and do it one
> entry at a time.

If I have the waker count and flags on the wait_page_queue structure
will that have been more acceptable?

Also I think patch 1 is still a good idea for a fail safe mechanism
in case there are other long wait list.

That said, I do think your suggested approach is cleaner.  However, it
is a much more substantial change.  Let me take a look and see if I
have any issues implementing it.

Tim



> 
>                  Linus
> 

WARNING: multiple messages have this Message-ID (diff)
From: Tim Chen <tim.c.chen@linux.intel.com>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>,
	Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@elte.hu>, Andi Kleen <ak@linux.intel.com>,
	Kan Liang <kan.liang@intel.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Johannes Weiner <hannes@cmpxchg.org>, Jan Kara <jack@suse.cz>,
	Christopher Lameter <cl@linux.com>,
	"Eric W . Biederman" <ebiederm@xmission.com>,
	Davidlohr Bueso <dave@stgolabs.net>,
	linux-mm <linux-mm@kvack.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 2/2 v2] sched/wait: Introduce lock breaker in wake_up_page_bit
Date: Fri, 25 Aug 2017 15:19:53 -0700	[thread overview]
Message-ID: <f10f4c25-49c0-7ef5-55c2-769c8fd9bf90@linux.intel.com> (raw)
In-Reply-To: <CA+55aFyErsNw8bqTOCzcrarDZBdj+Ev=1N3sV-gxtLTH03bBFQ@mail.gmail.com>

On 08/25/2017 12:58 PM, Linus Torvalds wrote:
> On Fri, Aug 25, 2017 at 9:13 AM, Tim Chen <tim.c.chen@linux.intel.com> wrote:
>> Now that we have added breaks in the wait queue scan and allow bookmark
>> on scan position, we put this logic in the wake_up_page_bit function.
> 
> Oh, _this_ is the other patch you were talking about. I thought it was
> the NUMA counter threashold that was discussed around the same time,
> and that's why I referred to Mel.
> 
> Gods, _this_ patch is ugly.  No, I'm not happy with it at all. It
> makes that wait_queue_head much bigger, for this disgusting one use.
> 
> So no, this is no good.
> 
> Now, maybe the page_wait_table[] itself could be changed to be
> something that is *not* just the wait-queue head.
> 
> But if we need to change the page_wait_table[] itself to have more
> information, then we should make it not be a wait-queue at all, we
> should make it be a list of much more specialized entries, indexed by
> the {page,bit} tuple.
> 
> And once we do that, we probably *could* use something like two
> specialized lists: one that is wake-all, and one that is wake-one.
> 
> So you'd have something like
> 
>     struct page_wait_struct {
>         struct list_node list;
>         struct page *page;
>         int bit;
>         struct llist_head all;
>         struct llist_head exclusive;
>     };
> 
> and then the "page_wait_table[]" would be just an array of
> 
>     struct page_wait_head {
>         spinlock_t lock;
>         struct hlist_head list;
>     };
> 
> and then the rule is:
> 
>  - each page/bit combination allocates one of these page_wait_struct
> entries when somebody starts waiting for it for the first time (and we
> use the spinlock in the page_wait_head to serialize that list).
> 
>  - an exclusive wait (ie somebody who waits to get the bit for
> locking) adds itself to the 'exclusive' llist
> 
>  - non-locking waiters add themselves to the 'all' list
> 
>  - we can use llist_del_all() to remove the 'all' list and then walk
> it and wake them up without any locks
> 
>  - we can use llist_del_first() to remove the first exclusive waiter
> and wait _it_ up without any locks.
> 
> Hmm? How does that sound? That means that we wouldn't use the normal
> wait-queues at all for the page hash waiting. We'd use this two-level
> list: one list to find the page/bit thing, and then two lists within
> that fdor the wait-queues for just *that* page/bit.
> 
> So no need for the 'key' stuff at all, because the page/bit data would
> be in the first data list, and the second list wouldn't have any of
> these traversal issues where you have to be careful and do it one
> entry at a time.

If I have the waker count and flags on the wait_page_queue structure
will that have been more acceptable?

Also I think patch 1 is still a good idea for a fail safe mechanism
in case there are other long wait list.

That said, I do think your suggested approach is cleaner.  However, it
is a much more substantial change.  Let me take a look and see if I
have any issues implementing it.

Tim



> 
>                  Linus
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2017-08-25 22:19 UTC|newest]

Thread overview: 57+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-08-25 16:13 [PATCH 1/2 v2] sched/wait: Break up long wake list walk Tim Chen
2017-08-25 16:13 ` Tim Chen
2017-08-25 16:13 ` [PATCH 2/2 v2] sched/wait: Introduce lock breaker in wake_up_page_bit Tim Chen
2017-08-25 16:13   ` Tim Chen
2017-08-25 19:58   ` Linus Torvalds
2017-08-25 19:58     ` Linus Torvalds
2017-08-25 22:19     ` Tim Chen [this message]
2017-08-25 22:19       ` Tim Chen
2017-08-25 22:51       ` Linus Torvalds
2017-08-25 23:03         ` Linus Torvalds
2017-08-26  0:31           ` Linus Torvalds
2017-08-26  0:31             ` Linus Torvalds
2017-08-26  2:54             ` Linus Torvalds
2017-08-26  2:54               ` Linus Torvalds
2017-08-26 18:15               ` Linus Torvalds
2017-08-27 21:40                 ` Linus Torvalds
2017-08-27 21:40                   ` Linus Torvalds
2017-08-27 21:42                   ` Linus Torvalds
2017-08-27 21:42                     ` Linus Torvalds
2017-08-27 23:12                   ` Linus Torvalds
2017-08-27 23:12                     ` Linus Torvalds
2017-08-28  1:16                     ` Nicholas Piggin
2017-08-28  1:16                       ` Nicholas Piggin
2017-08-28  1:29                       ` Nicholas Piggin
2017-08-28  1:29                         ` Nicholas Piggin
2017-08-28  5:17                         ` Linus Torvalds
2017-08-28  5:17                           ` Linus Torvalds
2017-08-28  7:18                           ` Nicholas Piggin
2017-08-28  7:18                             ` Nicholas Piggin
2017-08-28 14:51                 ` Liang, Kan
2017-08-28 16:48                   ` Linus Torvalds
2017-08-28 20:01                     ` Tim Chen
2017-08-28 20:01                       ` Tim Chen
2017-08-29 12:57                     ` Liang, Kan
2017-08-29 16:01                       ` Linus Torvalds
2017-08-29 16:01                         ` Linus Torvalds
2017-08-29 16:13                         ` Tim Chen
2017-08-29 16:13                           ` Tim Chen
2017-08-29 16:24                           ` Linus Torvalds
2017-08-29 16:24                             ` Linus Torvalds
2017-08-29 16:57                             ` Tim Chen
2017-08-29 16:57                               ` Tim Chen
2017-09-14  2:12                             ` Tim Chen
2017-09-14  2:12                               ` Tim Chen
2017-09-14  2:27                               ` Linus Torvalds
2017-09-14  2:27                                 ` Linus Torvalds
2017-09-14 16:50                                 ` Tim Chen
2017-09-14 16:50                                   ` Tim Chen
2017-09-14 17:00                                   ` Linus Torvalds
2017-09-14 17:00                                     ` Linus Torvalds
2017-09-14 16:39                               ` Christopher Lameter
2017-08-29 16:17                     ` Tim Chen
2017-08-29 16:17                       ` Tim Chen
2017-08-29 16:22                       ` Linus Torvalds
2017-08-29 16:22                         ` Linus Torvalds
2017-08-25 17:46 ` [PATCH 1/2 v2] sched/wait: Break up long wake list walk Christopher Lameter
2017-08-25 17:46   ` Christopher Lameter

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=f10f4c25-49c0-7ef5-55c2-769c8fd9bf90@linux.intel.com \
    --to=tim.c.chen@linux.intel.com \
    --cc=ak@linux.intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=cl@linux.com \
    --cc=dave@stgolabs.net \
    --cc=ebiederm@xmission.com \
    --cc=hannes@cmpxchg.org \
    --cc=jack@suse.cz \
    --cc=kan.liang@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@techsingularity.net \
    --cc=mingo@elte.hu \
    --cc=peterz@infradead.org \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.