From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S934841AbdHYWTz (ORCPT ); Fri, 25 Aug 2017 18:19:55 -0400 Received: from mga05.intel.com ([192.55.52.43]:63995 "EHLO mga05.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S934049AbdHYWTy (ORCPT ); Fri, 25 Aug 2017 18:19:54 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.41,426,1498546800"; d="scan'208";a="1007742419" Subject: Re: [PATCH 2/2 v2] sched/wait: Introduce lock breaker in wake_up_page_bit To: Linus Torvalds Cc: Mel Gorman , Peter Zijlstra , Ingo Molnar , Andi Kleen , Kan Liang , Andrew Morton , Johannes Weiner , Jan Kara , Christopher Lameter , "Eric W . Biederman" , Davidlohr Bueso , linux-mm , Linux Kernel Mailing List References: <83f675ad385d67760da4b99cd95ee912ca7c0b44.1503677178.git.tim.c.chen@linux.intel.com> From: Tim Chen Message-ID: Date: Fri, 25 Aug 2017 15:19:53 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.1.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 08/25/2017 12:58 PM, Linus Torvalds wrote: > On Fri, Aug 25, 2017 at 9:13 AM, Tim Chen wrote: >> Now that we have added breaks in the wait queue scan and allow bookmark >> on scan position, we put this logic in the wake_up_page_bit function. > > Oh, _this_ is the other patch you were talking about. I thought it was > the NUMA counter threashold that was discussed around the same time, > and that's why I referred to Mel. > > Gods, _this_ patch is ugly. No, I'm not happy with it at all. It > makes that wait_queue_head much bigger, for this disgusting one use. > > So no, this is no good. > > Now, maybe the page_wait_table[] itself could be changed to be > something that is *not* just the wait-queue head. > > But if we need to change the page_wait_table[] itself to have more > information, then we should make it not be a wait-queue at all, we > should make it be a list of much more specialized entries, indexed by > the {page,bit} tuple. > > And once we do that, we probably *could* use something like two > specialized lists: one that is wake-all, and one that is wake-one. > > So you'd have something like > > struct page_wait_struct { > struct list_node list; > struct page *page; > int bit; > struct llist_head all; > struct llist_head exclusive; > }; > > and then the "page_wait_table[]" would be just an array of > > struct page_wait_head { > spinlock_t lock; > struct hlist_head list; > }; > > and then the rule is: > > - each page/bit combination allocates one of these page_wait_struct > entries when somebody starts waiting for it for the first time (and we > use the spinlock in the page_wait_head to serialize that list). > > - an exclusive wait (ie somebody who waits to get the bit for > locking) adds itself to the 'exclusive' llist > > - non-locking waiters add themselves to the 'all' list > > - we can use llist_del_all() to remove the 'all' list and then walk > it and wake them up without any locks > > - we can use llist_del_first() to remove the first exclusive waiter > and wait _it_ up without any locks. > > Hmm? How does that sound? That means that we wouldn't use the normal > wait-queues at all for the page hash waiting. We'd use this two-level > list: one list to find the page/bit thing, and then two lists within > that fdor the wait-queues for just *that* page/bit. > > So no need for the 'key' stuff at all, because the page/bit data would > be in the first data list, and the second list wouldn't have any of > these traversal issues where you have to be careful and do it one > entry at a time. If I have the waker count and flags on the wait_page_queue structure will that have been more acceptable? Also I think patch 1 is still a good idea for a fail safe mechanism in case there are other long wait list. That said, I do think your suggested approach is cleaner. However, it is a much more substantial change. Let me take a look and see if I have any issues implementing it. Tim > > Linus >