From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S943285AbcJSRfB (ORCPT ); Wed, 19 Oct 2016 13:35:01 -0400 Received: from merlin.infradead.org ([205.233.59.134]:36062 "EHLO merlin.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S936411AbcJSRe7 (ORCPT ); Wed, 19 Oct 2016 13:34:59 -0400 Date: Wed, 19 Oct 2016 19:34:03 +0200 From: Peter Zijlstra To: Will Deacon Cc: Linus Torvalds , Waiman Long , Jason Low , Ding Tianhong , Thomas Gleixner , Ingo Molnar , Imre Deak , Linux Kernel Mailing List , Davidlohr Bueso , Tim Chen , Terry Rudd , "Paul E. McKenney" , Jason Low , Chris Wilson , Daniel Vetter Subject: Re: [PATCH -v4 6/8] locking/mutex: Restructure wait loop Message-ID: <20161019173403.GB3142@twins.programming.kicks-ass.net> References: <20161007145243.361481786@infradead.org> <20161007150211.271490994@infradead.org> <20161013151720.GB13138@arm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20161013151720.GB13138@arm.com> User-Agent: Mutt/1.5.23.1 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Oct 13, 2016 at 04:17:21PM +0100, Will Deacon wrote: > > if (!first && __mutex_waiter_is_first(lock, &waiter)) { > > first = true; > > __mutex_set_flag(lock, MUTEX_FLAG_HANDOFF); > > } > > + > > + set_task_state(task, state); > > With this change, we no longer hold the lock wit_hen we set the task > state, and it's ordered strictly *after* setting the HANDOFF flag. > Doesn't that mean that the unlock code can see the HANDOFF flag, issue > the wakeup, but then we come in and overwrite the task state? > > I'm struggling to work out whether that's an issue, but it certainly > feels odd and is a change from the previous behaviour. OK, so after a discussion on IRC the problem appears to have been unfamiliarity with the basic sleep/wakeup scheme. Mutex used to be the odd duck out for being fully serialized by wait_lock. The below adds a few words on how the 'normal' sleep/wakeup scheme works. --- Subject: sched: Better explain sleep/wakeup From: Peter Zijlstra Date: Wed Oct 19 15:45:27 CEST 2016 There were a few questions wrt how sleep-wakeup works. Try and explain it more. Requested-by: Will Deacon Signed-off-by: Peter Zijlstra (Intel) --- include/linux/sched.h | 52 ++++++++++++++++++++++++++++++++++---------------- kernel/sched/core.c | 15 +++++++------- 2 files changed, 44 insertions(+), 23 deletions(-) --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -262,20 +262,9 @@ extern char ___assert_task_state[1 - 2*! #define set_task_state(tsk, state_value) \ do { \ (tsk)->task_state_change = _THIS_IP_; \ - smp_store_mb((tsk)->state, (state_value)); \ + smp_store_mb((tsk)->state, (state_value)); \ } while (0) -/* - * set_current_state() includes a barrier so that the write of current->state - * is correctly serialised wrt the caller's subsequent test of whether to - * actually sleep: - * - * set_current_state(TASK_UNINTERRUPTIBLE); - * if (do_i_need_to_sleep()) - * schedule(); - * - * If the caller does not need such serialisation then use __set_current_state() - */ #define __set_current_state(state_value) \ do { \ current->task_state_change = _THIS_IP_; \ @@ -284,11 +273,19 @@ extern char ___assert_task_state[1 - 2*! #define set_current_state(state_value) \ do { \ current->task_state_change = _THIS_IP_; \ - smp_store_mb(current->state, (state_value)); \ + smp_store_mb(current->state, (state_value)); \ } while (0) #else +/* + * @tsk had better be current, or you get to keep the pieces. + * + * The only reason is that computing current can be more expensive than + * using a pointer that's already available. + * + * Therefore, see set_current_state(). + */ #define __set_task_state(tsk, state_value) \ do { (tsk)->state = (state_value); } while (0) #define set_task_state(tsk, state_value) \ @@ -299,11 +296,34 @@ extern char ___assert_task_state[1 - 2*! * is correctly serialised wrt the caller's subsequent test of whether to * actually sleep: * + * for (;;) { * set_current_state(TASK_UNINTERRUPTIBLE); - * if (do_i_need_to_sleep()) - * schedule(); + * if (!need_sleep) + * break; + * + * schedule(); + * } + * __set_current_state(TASK_RUNNING); + * + * If the caller does not need such serialisation (because, for instance, the + * condition test and condition change and wakeup are under the same lock) then + * use __set_current_state(). + * + * The above is typically ordered against the wakeup, which does: + * + * need_sleep = false; + * wake_up_state(p, TASK_UNINTERRUPTIBLE); + * + * Where wake_up_state() (and all other wakeup primitives) imply enough + * barriers to order the store of the variable against wakeup. + * + * Wakeup will do: if (@state & p->state) p->state = TASK_RUNNING, that is, + * once it observes the TASK_UNINTERRUPTIBLE store the waking CPU can issue a + * TASK_RUNNING store which can collide with __set_current_state(TASK_RUNNING). + * + * This is obviously fine, since they both store the exact same value. * - * If the caller does not need such serialisation then use __set_current_state() + * Also see the comments of try_to_wake_up(). */ #define __set_current_state(state_value) \ do { current->state = (state_value); } while (0) --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2000,14 +2000,15 @@ static void ttwu_queue(struct task_struc * @state: the mask of task states that can be woken * @wake_flags: wake modifier flags (WF_*) * - * Put it on the run-queue if it's not already there. The "current" - * thread is always on the run-queue (except when the actual - * re-schedule is in progress), and as such you're allowed to do - * the simpler "current->state = TASK_RUNNING" to mark yourself - * runnable without the overhead of this. + * If (@state & @p->state) @p->state = TASK_RUNNING. * - * Return: %true if @p was woken up, %false if it was already running. - * or @state didn't match @p's state. + * If the task was not queued/runnable, also place it back on a runqueue. + * + * Atomic against schedule() which would dequeue a task, also see + * set_current_state(). + * + * Return: %true if @p->state changes (an actual wakeup was done), + * %false otherwise. */ static int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)