From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1751977AbcF0MXq (ORCPT <rfc822;w@1wt.eu>);
	Mon, 27 Jun 2016 08:23:46 -0400
Received: from merlin.infradead.org ([205.233.59.134]:53197 "EHLO
	merlin.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1751714AbcF0MXp (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Mon, 27 Jun 2016 08:23:45 -0400
Date: Mon, 27 Jun 2016 14:23:35 +0200
From: Peter Zijlstra <peterz@infradead.org>
To: Juri Lelli <juri.lelli@arm.com>
Cc: mingo@kernel.org, tglx@linutronix.de, rostedt@goodmis.org,
        xlpang@redhat.com, linux-kernel@vger.kernel.org,
        mathieu.desnoyers@efficios.com, jdesfossez@efficios.com,
        bristot@redhat.com
Subject: Re: [RFC][PATCH 8/8] rtmutex: Fix PI chain order integrity
Message-ID: <20160627122335.GB30154@twins.programming.kicks-ass.net>
References: <20160607195635.710022345@infradead.org>
 <20160607200216.117270606@infradead.org>
 <20160614173908.GQ5981@e106622-lin>
 <20160614194401.GL30921@twins.programming.kicks-ass.net>
 <20160615072507.GS5981@e106622-lin>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20160615072507.GS5981@e106622-lin>
User-Agent: Mutt/1.5.23.1 (2014-03-12)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Wed, Jun 15, 2016 at 08:25:07AM +0100, Juri Lelli wrote:
> I guess it's not that likely, but yes it could potentially happen that a
> waiter is optimistically spinning, depletes its runtime, gets throttled
> and then replenished when still spinning. Maybe it doesn't really make
> sense continuing spinning in this situation, but I guess things get
> really complicated. :-/
> 
> Anyway, as said, I think this patch is OK. Maybe we want to add a
> comment just to remember what situation can cause an issue if we don't
> do this? Patch changelog would be OK as well for such a comment IMHO.


OK, so I went to write a simple comment and ended up with the below :/

While writing the comment I noticed two issues:

 - we update the waiter order fields while the entry is still enqueued
   on the pi_waiters tree, which is also sorted by these exact fields.

 - another one of these pure ->prio comparisons

Please double check, there be dragons here.

---

--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -158,6 +158,9 @@ static inline bool unlock_rt_mutex_safe(
 }
 #endif
 
+#define cmp_task(p)	\
+	&(struct rt_mutex_waiter){ .prio = (p)->prio, .deadline = (p)->dl.deadline }
+
 static inline int
 rt_mutex_waiter_less(struct rt_mutex_waiter *left,
 		     struct rt_mutex_waiter *right)
@@ -583,8 +586,26 @@ static int rt_mutex_adjust_prio_chain(st
 
 	/* [7] Requeue the waiter in the lock waiter tree. */
 	rt_mutex_dequeue(lock, waiter);
+
+	/*
+	 * Update the waiter prio fields now that we're dequeued.
+	 *
+	 * These values can have changed through either:
+	 *
+	 *   sys_sched_set_scheduler() / sys_sched_setattr()
+	 *
+	 * or
+	 *
+	 *   DL CBS enforcement advancing the effective deadline.
+	 *
+	 * Even though pi_waiters also uses these fields, and that tree is only
+	 * updated in [11], we can do this here, since we hold [L], which
+	 * serializes all pi_waiters access and rb_erase() does not care about
+	 * the values of the node being removed.
+	 */
 	waiter->prio = task->prio;
 	waiter->deadline = task->dl.deadline;
+
 	rt_mutex_enqueue(lock, waiter);
 
 	/* [8] Release the task */
@@ -711,6 +732,8 @@ static int rt_mutex_adjust_prio_chain(st
 static int try_to_take_rt_mutex(struct rt_mutex *lock, struct task_struct *task,
 				struct rt_mutex_waiter *waiter)
 {
+	lockdep_assert_held(&lock->wait_lock);
+
 	/*
 	 * Before testing whether we can acquire @lock, we set the
 	 * RT_MUTEX_HAS_WAITERS bit in @lock->owner. This forces all
@@ -770,7 +793,8 @@ static int try_to_take_rt_mutex(struct r
 			 * the top waiter priority (kernel view),
 			 * @task lost.
 			 */
-			if (task->prio >= rt_mutex_top_waiter(lock)->prio)
+			if (!rt_mutex_waiter_less(cmp_task(task),
+						  rt_mutex_top_waiter(lock)))
 				return 0;
 
 			/*
@@ -838,6 +862,8 @@ static int task_blocks_on_rt_mutex(struc
 	struct rt_mutex *next_lock;
 	int chain_walk = 0, res;
 
+	lockdep_assert_held(&lock->wait_lock);
+
 	/*
 	 * Early deadlock detection. We really don't want the task to
 	 * enqueue on itself just to untangle the mess later. It's not
@@ -973,6 +999,8 @@ static void remove_waiter(struct rt_mute
 	struct task_struct *owner = rt_mutex_owner(lock);
 	struct rt_mutex *next_lock;
 
+	lockdep_assert_held(&lock->wait_lock);
+
 	raw_spin_lock(&current->pi_lock);
 	rt_mutex_dequeue(lock, waiter);
 	current->pi_blocked_on = NULL;