All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC][PATCH RT v2 0/3] RT: Fix trylock deadlock without msleep() hack
@ 2015-09-10 14:50 Steven Rostedt
  2015-09-10 14:50   ` Steven Rostedt
                   ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From: Steven Rostedt @ 2015-09-10 14:50 UTC (permalink / raw)
  To: linux-kernel, linux-rt-users
  Cc: Thomas Gleixner, Carsten Emde, Sebastian Andrzej Siewior,
	John Kacur, Paul Gortmaker, Peter Zijlstra, Clark Williams,
	Arnaldo Carvalho de Melo, Ingo Molnar

(Frozen shark deflectors set on hold)

I wont describe the trylock live lock issue again, for that please see
v1 of this patch set:

  http://lkml.kernel.org/r/20150904011900.730816481@goodmis.org

My solution there was to add a spin_try_or_boost_lock() call that would
temporarily boost the owner of the lock if the lock was held. The boosting
will last until the task unlocks any lock, then it would reset that boost.

In the mean time, the caller of the trylock would then call cpu_chill() which
was turned into a sched_yield() hoping that if it was blocking the owner
the owner (now boosted to the same priority) would get a chance to run.

Thomas Gleixner had issues with this approach (and rightfully so).

 #1) if the owner was blocked by an even higher priority task on another
     CPU, the trylock spinner would constantly spin and block that
     CPU from having any other user.

 #2) There was a temporary priority leak (although this isn't so bad
     because the leak is bounded). Where if the trylock spinner hit
     a condition to not spin anymore. The owner of the lock would still
     have the unneeded boost till it released that lock.

 #3) There's no connection with the trylock spinner and the owner. If
     the spinner was deboosted, or boosted, the owner of the lock would
     not see that change.


With my new approach (this patch series), the above can be solved. Note,
1 and 2 are solved, but 3 still needs some tweaks with the priority chain.
But first lets describe the new approach.

Before describing the new approach, I need to describe the rt_mutex_waiter.
This is the descriptor used to hold most of the state of a blocked task
in the priority inheritance chain. When a task blocks on a rtmutex,
the rt_mutex_waiter (called waiter from now on) is allocated on the
stack and initialized. It points to the lock as well as the blocked task.
It has hooks to link into the owner's priority list as well as the
lock's wait list. The waiter is used extensively in the pi code.

The waiter can be allocated on the stack because a blocked task wont need
the pi code once it is woken and is finished with the rtmutex call.
But this wont work with the spin_trylock_or_boost() (renamed from the
previous spin_try_or_boost_lock()). That is because the trylock_or_boost
call will finish and the waiter will still be in use. The scope of this
waiter needs to be from the trylock_or_boost() call, all the way to the
cpu_chill() call (which must be called after a failed spin_trylock_or_boost()).

To solve the need of a waiter beyond the scope of a function call, a
waiter descriptor is added to the task_struct. This "static" waiter can
now be used to boost tasks after a failed spin_trylock_or_boost() and
keep the state between the trylock spinner and the owner of the lock.

Some caveats are needed though. A task can only boost one other task at a time.
If two spin_trylock_or_boost() calls are done and both failed. Only the first
one will boost the owner of the task. The second one (which would for now
give a warning, because I don't see why two failed attempts would be done),
will be ignored.

If the task calls spin_trylock_or_boost() fails, and boosts the owner, and then
calls a normal spin_lock() and fails, the boosted task from the
trylock_or_boost will be released. That is, the waiter on the task_struct
will be removed from the lock and owner's pi list, before the task blocks
on the spin_lock. This prevents any strange anomaly with having one task
with two waiters. As stated before. A task may only boost one other task
at a time, never two or more. That is, it may only be in one pi chain at
a time. This keeps the complexity down.

When the task gets around to calling cpu_chill(), the cpu_chill() will
check the "static" waiter, and see if there's a lock assigned to it.
If there is, it will set a new flag in the waiter stating that it wants
to be woken, and will sleep. If there's no lock assigned, that means either
the lock was released in the mean time, or the task blocked on another lock
and had to release that waiter. In either case, the cpu_chill() simply
calls cpu_relax() and continues.

When the owner of the lock releases the lock, if the top_waiter is a static
waiter (waiter == &waiter->task.rt_waiter), then the waiter is removed not
only from the pi_list of the old owner, but also from the wait_list of the
lock itself. The waiter->lock is set to NULL too. This lets the waiter->task
know that the lock has been freed. The old owner will only wake up the
task if the waiter flag is set to do so. Otherwise it may cause a spurious
wake up. If the flag is not set, then the trylock spinner has not made it to
the cpu_chill() yet, and with the waiter->lock cleared, the cpu_chill() will
not sleep.

The waiter->task->pi_lock is used to synchronize the modifications of the
static waiter.

Now, one might notice that we could just keep the task on the lock's wait_list
and that way lower priority tasks could not steal that lock. This would
be nice, but there's a major problem. And this problem is also the reason
for patch 3 (a new patch in the series). That is, because the trylock
spinner is not actually blocked on the lock, there's no guarantee that the
lock will stay around by the time the spinner calls cpu_chill().


    CPU0		   CPU1
    ----		   ----
  lock(dev->B)
			lock(A)
			trylock(dev->B)
			  [boost and link on lock]
			unlock(A)
  unlock(dev->B)
  lock(A)
  free(dev)
  unlock(A)

			cpu_chill()


See the problem?

If the owner kept the trylock spinner on the lock, there's nothing in the
logic that prevents that lock from being freed. In fact, in this case it
was lock A that protects the lock from being freed. But the trylock spinner
releases that lock before calling cpu_chill().

Because of the above, the unlock(dev->B) must remove all top waiters that
are there due to a spin_trylock_or_boost(). If there are normal waiters,
nothing needs to be done, because the normal waiter would have to be in
a path that guarantees the lock stays around (otherwise it would bug on
mainline too), and when the normal waiter gets the lock and releases it
it will remove any waiters that are below it.

I boot tested this, and ran my simple module to see if it works, and
it does. But this code is more complex, and I'm sure there's bugs in it.
This is still in RFC, and there's some clean ups that can be done.
But I think this is a better solution than the one I supplied before,
albeit a bit more complex. But it handles the issues that Thomas brought
up. #3 isn't fixed, because that will require the pi chain checking
tasks that are not blocked, but boost via its static waiter. This would
not be hard to fix, and probably can be fixed by changing cpu_chill()
to boost again before it sleeps.

Thoughts?

-- Steve

---

Steven Rostedt (Red Hat) (3):
      locking: Add spin_trylock_or_boost() infrastructure
      rt: Make cpu_chill() call rt_mutex_wait_for_lock() and add new cpu_rest() as msleep(1)
      rtmutex: Wake up all top trylock waiters on unlock


 block/blk-ioc.c                 |    4 
 fs/autofs4/expire.c             |    2 
 fs/dcache.c                     |    6 
 fs/namespace.c                  |    2 
 include/linux/delay.h           |   13 +
 include/linux/rtmutex.h         |    1 
 include/linux/sched.h           |   50 +++++
 include/linux/spinlock.h        |    5 
 include/linux/spinlock_rt.h     |   15 +
 kernel/locking/rtmutex.c        |  376 +++++++++++++++++++++++++++++++++++++++-
 kernel/locking/rtmutex_common.h |   22 --
 kernel/time/hrtimer.c           |   17 +
 kernel/workqueue.c              |    2 
 net/packet/af_packet.c          |    4 
 net/rds/ib_rdma.c               |    2 
 15 files changed, 478 insertions(+), 43 deletions(-)



^ permalink raw reply	[flat|nested] 5+ messages in thread

* [RFC][PATCH RT v2 1/3] [PATCH 1/3] locking: Add spin_trylock_or_boost() infrastructure
  2015-09-10 14:50 [RFC][PATCH RT v2 0/3] RT: Fix trylock deadlock without msleep() hack Steven Rostedt
@ 2015-09-10 14:50   ` Steven Rostedt
  2015-09-10 14:50 ` [RFC][PATCH RT v2 2/3] [PATCH 2/3] rt: Make cpu_chill() call rt_mutex_wait_for_lock() and add new cpu_rest() as msleep(1) Steven Rostedt
  2015-09-10 14:50 ` [RFC][PATCH RT v2 3/3] [PATCH 3/3] rtmutex: Wake up all top trylock waiters on unlock Steven Rostedt
  2 siblings, 0 replies; 5+ messages in thread
From: Steven Rostedt @ 2015-09-10 14:50 UTC (permalink / raw)
  To: linux-kernel, linux-rt-users
  Cc: Thomas Gleixner, Carsten Emde, Sebastian Andrzej Siewior,
	John Kacur, Paul Gortmaker, Peter Zijlstra, Clark Williams,
	Arnaldo Carvalho de Melo, Ingo Molnar

[-- Attachment #1: 0001-locking-Add-spin_trylock_or_boost-infrastructure.patch --]
[-- Type: text/plain, Size: 22169 bytes --]

In mainline Linux, there are cases where locks need to be taken in reverse
order. In non PREEMPT_RT code, spinlocks can not be preempted, thus if one
needs to take locks in reverse order, if it can't get the second lock, it
simply needs to release the first lock (the one that can cause a deadlock)
and try again. The owner of the second lock must be running on another CPU
and as long as it's not blocked on the held lock of the original task, then
progress will be made.

But things change when these locks become preemptive locks.

Simply releasing the first lock when failing to get the second lock wont
always work if the locks can be preempted. That's because the task could
have preempted the owner, and it can prevent the owner from making forward
progress. Just letting go of the first lock and trying again wont change the
state of things, as the owner will still be blocked by being preempted by
the original task. This is especially true when the original task is a
real-time task running a FIFO prio.

	<Task 1>
	Grab lock A

	<preempt to higher priority Task 2>

	Grab lock B
	try to grab lock A
	release lock B
	Grab lock B
	try to grab lock A
	<wash, rinse, repeat>  (sorry Fred)

As Task 1 will never run as long as Task 2 is spinning, no forward progress
is had, and this becomes a live lock (as suppose to a dead lock).

The solution here is to create add a static rt_mutex_waiter to the
task_struct. The rt_mutex_waiter is a structure that is used to connect
locks with owners and those that are blocked on the lock in order to
propagate priority inheritance. But the rt_mutex_waiter is allocated on the
stack when an rtmutex is blocked. This is fine, because the scope of that
waiter structure will be finished by the time the rtmutex function returns.
For the spin_trylock_or_boost() caller, the waiter's scope needs to go
beyond the function call, thus it must not be allocated on the stack. Adding
it to the task struct works, but only allows one trylock_or_boost to be
called, but that should be fine anyway (a warning is added if it is not).

The spin_trylock_or_boost() will add this new task_struct waiter to the pi
lists of the lock, and if necessary (if it is the highest waiter on the
lock) to the owner's pi list too. This is exactly the way things work with
the current pi inheritance logic.

Some things need to be changed though. As the waiter is not actually waiting
on the lock, the owner can't wake it up without prejudice, as the waiter
could have set itself in another state before calling cpu_chill(). A
"wake_up" field is added to the waiter and is set only when the task is
ready to receive the wakeup and it saves the current state.

When the owner of the lock releases the lock and picks this task to take the
lock next, it will remove it from the pi lists and clear the waiter->lock.
This is done under the task's pi_lock to make sure it synchronizes with the
cpu_chill() call.

When cpu_chill() is called, it will look to see if the task's static waiter
has a lock. If it does not, the cpu_chill will act like a cpu_relax(), as
that means the owner released the lock. The cpu_chill() will then set the
"wake_up" flag, save the state, and go to sleep in the TASK_INTERRUPTIBLE
state, as anything should be able to wake it up.

When it is woken, it clears the lock (if it hasn't already been done) and
removes itself from the pi lists.

Note, the reason that the owner of the lock clears the lock and does not let
the task stay on the pi lists, is because nothing prevents the lock from
being freed when there is no real owner of the task.

Note 2, if the task blocks on a rtmutex, or tries to call trylock_or_boost
again and fails while already boosting a lock, then it must remove itself
from the previous lock before it can boost another task. Only one boosting
per task is allowed. A task can not boost more than one task.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 include/linux/rtmutex.h         |   1 +
 include/linux/sched.h           |  50 +++++++
 include/linux/spinlock.h        |   5 +
 include/linux/spinlock_rt.h     |  15 ++
 kernel/locking/rtmutex.c        | 323 +++++++++++++++++++++++++++++++++++++++-
 kernel/locking/rtmutex_common.h |  22 ---
 6 files changed, 386 insertions(+), 30 deletions(-)

diff --git a/include/linux/rtmutex.h b/include/linux/rtmutex.h
index d5a04ea47a13..4ab8d4bddfbb 100644
--- a/include/linux/rtmutex.h
+++ b/include/linux/rtmutex.h
@@ -111,6 +111,7 @@ extern int rt_mutex_timed_lock(struct rt_mutex *lock,
 			       struct hrtimer_sleeper *timeout);
 
 extern int rt_mutex_trylock(struct rt_mutex *lock);
+extern int rt_mutex_try_or_boost_lock(struct rt_mutex *lock);
 
 extern void rt_mutex_unlock(struct rt_mutex *lock);
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 956658437527..94ebad0ac112 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1336,6 +1336,29 @@ enum perf_event_task_context {
 	perf_nr_task_contexts,
 };
 
+/*
+ * This is the control structure for tasks blocked on a rt_mutex,
+ * which is allocated on the kernel stack on of the blocked task.
+ *
+ * @tree_entry:		pi node to enqueue into the mutex waiters tree
+ * @pi_tree_entry:	pi node to enqueue into the mutex owner waiters tree
+ * @task:		task reference to the blocked task
+ */
+struct rt_mutex_waiter {
+	struct rb_node          tree_entry;
+	struct rb_node          pi_tree_entry;
+	struct task_struct	*task;
+	struct rt_mutex		*lock;
+	bool			savestate;
+	bool			wake_up;
+#ifdef CONFIG_DEBUG_RT_MUTEXES
+	unsigned long		ip;
+	struct pid		*deadlock_task_pid;
+	struct rt_mutex		*deadlock_lock;
+#endif
+	int prio;
+};
+
 struct task_struct {
 	volatile long state;	/* -1 unrunnable, 0 runnable, >0 stopped */
 	volatile long saved_state;	/* saved state for "spinlock sleepers" */
@@ -1357,6 +1380,33 @@ struct task_struct {
 
 	int prio, static_prio, normal_prio;
 	unsigned int rt_priority;
+#ifdef CONFIG_PREEMPT_RT_FULL
+	/*
+	 * There are cases where spinlocks are taken in reverse
+	 * order via a spin_trylock(). This is fine for true
+	 * spinlocks, but in PREEMPT_RT, they become mutexes.
+	 * If a real-time high priority task preempts the owner
+	 * of a lock it is trying to grab in reverse order with
+	 * a trylock, it can go into a livelock, as it spins
+	 * waiting for the owner to release the lock. But as it
+	 * has preempted that owner, and prevented it from continuing,
+	 * the lock is never released, and the high priority task
+	 * spins forever.
+	 *
+	 * For these cases, a spin_trylock_or_boost() must be used
+	 * on PREEMPT_RT. This will set the owner of the lock to the
+	 * prioirty of the caller of spin_trylock_or_boost()
+	 * via this trylock_prio field (if the owner has a lower
+	 * priority than the caller). This priority is always reset
+	 * whenever any spinlock converted rtmutex is released.
+	 * This guarantees forward progress of the trylock spin.
+	 *
+	 * Callers of spin_trylock_or_boost() must also call cpu_chill()
+	 * which on PREEMPT_RT is a sched_yield() that allows the
+	 * newly risen priority task to run ahead of the trylock spinner.
+	 */
+	struct rt_mutex_waiter	rt_waiter;
+#endif
 	const struct sched_class *sched_class;
 	struct sched_entity se;
 	struct sched_rt_entity rt;
diff --git a/include/linux/spinlock.h b/include/linux/spinlock.h
index 28f4366fd495..1d15e8c1267f 100644
--- a/include/linux/spinlock.h
+++ b/include/linux/spinlock.h
@@ -330,6 +330,11 @@ static inline int spin_trylock(spinlock_t *lock)
 	return raw_spin_trylock(&lock->rlock);
 }
 
+static inline int spin_trylock_or_boost(spinlock_t *lock)
+{
+	return raw_spin_trylock(&lock->rlock);
+}
+
 #define spin_lock_nested(lock, subclass)			\
 do {								\
 	raw_spin_lock_nested(spinlock_check(lock), subclass);	\
diff --git a/include/linux/spinlock_rt.h b/include/linux/spinlock_rt.h
index f757096b230c..e65951230e3e 100644
--- a/include/linux/spinlock_rt.h
+++ b/include/linux/spinlock_rt.h
@@ -26,6 +26,7 @@ extern void __lockfunc rt_spin_unlock_wait(spinlock_t *lock);
 extern int __lockfunc rt_spin_trylock_irqsave(spinlock_t *lock, unsigned long *flags);
 extern int __lockfunc rt_spin_trylock_bh(spinlock_t *lock);
 extern int __lockfunc rt_spin_trylock(spinlock_t *lock);
+extern int __lockfunc rt_spin_trylock_or_boost(spinlock_t *lock);
 extern int atomic_dec_and_spin_lock(atomic_t *atomic, spinlock_t *lock);
 
 /*
@@ -53,6 +54,8 @@ extern int __lockfunc __rt_spin_trylock(struct rt_mutex *lock);
 
 #define spin_do_trylock(lock)		__cond_lock(lock, rt_spin_trylock(lock))
 
+#define spin_do_try_or_boost_lock(lock) __cond_lock(lock, rt_spin_trylock_or_boost(lock))
+
 #define spin_trylock(lock)			\
 ({						\
 	int __locked;				\
@@ -63,6 +66,16 @@ extern int __lockfunc __rt_spin_trylock(struct rt_mutex *lock);
 	__locked;				\
 })
 
+#define spin_trylock_or_boost(lock)			\
+({							\
+	int __locked;					\
+	migrate_disable();				\
+	__locked = spin_do_try_or_boost_lock(lock);	\
+	if (!__locked)					\
+		migrate_enable();			\
+	__locked;					\
+})
+
 #ifdef CONFIG_LOCKDEP
 # define spin_lock_nested(lock, subclass)		\
 	do {						\
@@ -153,6 +166,8 @@ static inline unsigned long spin_lock_trace_flags(spinlock_t *lock)
 # define spin_is_contended(lock)	(((void)(lock), 0))
 #endif
 
+void rt_mutex_wait_for_lock(struct task_struct *task);
+
 static inline int spin_can_lock(spinlock_t *lock)
 {
 	return !rt_mutex_is_locked(&lock->lock);
diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 2822aceb8dfb..843b67f38e20 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -253,6 +253,258 @@ rt_mutex_dequeue_pi(struct task_struct *task, struct rt_mutex_waiter *waiter)
 	RB_CLEAR_NODE(&waiter->pi_tree_entry);
 }
 
+#ifdef CONFIG_PREEMPT_RT_FULL
+/*
+ * Returns true if the task should be woken up, false otherwise.
+ */
+static inline bool rt_mutex_wake_trylock_waiter(struct rt_mutex_waiter *waiter)
+{
+	struct task_struct *task = waiter->task;
+	unsigned long flags;
+	bool wakeup;
+
+	if (likely(waiter != &task->rt_waiter))
+		return true;
+
+	/*
+	 * A task boosted current because it is within a trylock spin.
+	 * But we only want to wake up the task if it is in the
+	 * cpu_chill() sleep, and not before. When the task is ready, it
+	 * will set "wake_up" to true. Otherwise we do nothing. When the
+	 * task gets to the cpu_chill() it will not sleep because the
+	 * waiter->lock will be NULL.
+	 */
+	raw_spin_lock_irqsave(&task->pi_lock, flags);
+	rt_mutex_dequeue(waiter->lock, waiter);
+	waiter->lock = NULL;
+
+	wakeup = waiter->wake_up;
+	raw_spin_unlock_irqrestore(&task->pi_lock, flags);
+
+	return wakeup;
+}
+
+static void __rt_mutex_adjust_prio(struct task_struct *task);
+
+/* 
+ * Called with lock->wait_lock held and must call
+ * rt_mutex_adjust_prio_chain() if an owner is returned
+ */
+static inline struct task_struct *trylock_boost_owner(struct rt_mutex *lock)
+{
+	struct rt_mutex_waiter *top_waiter = NULL;
+	struct rt_mutex_waiter *waiter;
+	struct task_struct *task = current;
+	struct task_struct *owner;
+
+	owner = rt_mutex_owner(lock);
+	if (!owner)
+		return NULL;
+
+	waiter = &task->rt_waiter;
+
+	raw_spin_lock_irq(&task->pi_lock);
+
+	/*
+	 * current is already waiting for another lock owner?
+	 * This should never happen. Don't add another.
+	 */
+	if (WARN_ON_ONCE(waiter->lock)) {
+		raw_spin_unlock_irq(&task->pi_lock);
+		return NULL;
+	}
+
+	/* Add waiter to lock */
+	waiter->task = task;
+	waiter->lock = lock;
+	waiter->prio = task->prio;
+	waiter->wake_up = false;
+	raw_spin_unlock_irq(&task->pi_lock);
+
+	if (rt_mutex_has_waiters(lock))
+		top_waiter = rt_mutex_top_waiter(lock);
+
+	rt_mutex_enqueue(lock, waiter);
+
+	if (waiter == rt_mutex_top_waiter(lock)) {
+		raw_spin_lock_irq(&owner->pi_lock);
+		if (top_waiter)
+			rt_mutex_dequeue_pi(owner, top_waiter);
+		rt_mutex_enqueue_pi(owner, waiter);
+		__rt_mutex_adjust_prio(owner);
+		raw_spin_unlock_irq(&owner->pi_lock);
+
+		/* Will be released by rt_mutex_adjust_prio_chain() */
+		get_task_struct(owner);
+
+		/* New top waiter, need to do the chain walk */
+		return owner;
+	}
+
+	return NULL;
+}
+
+/*
+ * Returns true if lock->wait_lock was released.
+ */
+static inline bool check_static_waiter(struct task_struct *task,
+				       struct rt_mutex *orig_lock, bool ok)
+{
+	struct rt_mutex_waiter *top_waiter;
+	struct rt_mutex_waiter *waiter = &task->rt_waiter;
+	struct task_struct *owner;
+	struct rt_mutex *lock;
+
+	if (likely(!waiter->lock))
+		return false;
+
+	/* Only spinlock converted to rtmutex may be called with a waiter */
+	BUG_ON(!ok);
+
+	/*
+	 * The task is about to block on a lock, and it boosted
+	 * the owner of another lock. The pi chain can not handle
+	 * the same task on the chain at once. This task is going
+	 * to go to sleep anyway, remove itself from the other lock.
+	 */
+	raw_spin_unlock(&orig_lock->wait_lock);
+
+	/*
+	 * Need to get the task->pi_lock first, as that will be
+	 * used to synchronize with the owner of the lock if it
+	 * releases the lock.
+	 */
+again:
+	raw_spin_lock_irq(&task->pi_lock);
+	/* The lock could have been released. */
+	lock = waiter->lock;
+	if (!lock) {
+		raw_spin_unlock_irq(&task->pi_lock);
+		goto out;
+	}
+
+	if (!raw_spin_trylock(&lock->wait_lock)) {
+		/* ironic isn't this. */
+		raw_spin_unlock_irq(&task->pi_lock);
+		goto again;
+	}
+
+	waiter->lock = NULL;
+
+	/*
+	 * Having the lock->wait_lock will prevent the owner from
+	 * doing anything else.
+	 */
+	raw_spin_unlock_irq(&task->pi_lock);
+
+	top_waiter = rt_mutex_top_waiter(lock);
+	rt_mutex_dequeue(lock, waiter);
+
+	owner = rt_mutex_owner(lock);
+	if (owner && waiter == top_waiter) {
+		raw_spin_lock_irq(&owner->pi_lock);
+		rt_mutex_dequeue_pi(owner, waiter);
+		if (rt_mutex_has_waiters(lock)) {
+			top_waiter = rt_mutex_top_waiter(lock);
+			rt_mutex_enqueue_pi(owner, top_waiter);
+			__rt_mutex_adjust_prio(owner);
+		}
+		raw_spin_unlock_irq(&owner->pi_lock);
+	}
+
+	raw_spin_unlock(&lock->wait_lock);
+out:
+	raw_spin_lock(&orig_lock->wait_lock);
+
+	return true;
+}
+/*
+ * Called by cpu_chill() that is after a spin_trylock_or_boost().
+ * This will sleep if the lock is still held by the task that was boosted.
+ */
+void rt_mutex_wait_for_lock(struct task_struct *task)
+{
+	struct rt_mutex_waiter *waiter = &task->rt_waiter;
+	struct rt_mutex *lock;
+
+	if (!waiter->lock) {
+		cpu_relax();
+		return;
+	}
+
+	/* The pi_lock will prevent the owner from doing anything */
+	raw_spin_lock_irq(&task->pi_lock);
+	lock = waiter->lock;
+	if (!lock) {
+		raw_spin_unlock_irq(&task->pi_lock);
+		return;
+	}
+
+	task->saved_state = task->state;
+	/* We let anything wake this up. */
+	__set_current_state_no_track(TASK_INTERRUPTIBLE);
+
+	waiter->wake_up = true;
+	raw_spin_unlock_irq(&task->pi_lock);
+
+	if (waiter->lock)
+		schedule();
+
+again:
+	raw_spin_lock_irq(&task->pi_lock);
+	waiter->wake_up = false;
+
+	/* Make sure it still does not have the lock */
+	if (waiter->lock) {
+		struct rt_mutex_waiter *top_waiter = NULL;
+		struct task_struct *owner;
+
+		lock = waiter->lock;
+		if (!raw_spin_trylock(&lock->wait_lock)) {
+			raw_spin_unlock_irq(&task->pi_lock);
+			goto again;
+		}
+		
+		top_waiter = rt_mutex_top_waiter(lock);
+		rt_mutex_dequeue(lock, waiter);
+		waiter->lock = NULL;
+
+		owner = rt_mutex_owner(lock);
+		if (owner && waiter == top_waiter) {
+			raw_spin_unlock(&task->pi_lock);
+			raw_spin_lock(&owner->pi_lock);
+			rt_mutex_dequeue_pi(owner, waiter);
+			if (rt_mutex_has_waiters(lock)) {
+				top_waiter = rt_mutex_top_waiter(lock);
+				rt_mutex_enqueue_pi(owner, top_waiter);
+			}
+			raw_spin_unlock(&owner->pi_lock);
+			raw_spin_lock(&task->pi_lock);
+		}
+		raw_spin_unlock(&lock->wait_lock);
+	}
+		
+	__set_current_state_no_track(task->saved_state);
+	task->saved_state = TASK_RUNNING;
+	raw_spin_unlock_irq(&task->pi_lock);
+}
+
+#else
+static inline struct task_struct *trylock_boost_owner(struct rt_mutex *lock)
+{
+	return NULL;
+}
+static inline bool rt_mutex_wake_trylock_waiter(struct rt_mutex_waiter *waiter)
+{
+	return true;
+}
+static inline bool check_static_waiter(struct task_struct *task,
+				       struct rt_mutex *lock, bool ok)
+{
+	return false;
+}
+#endif
+
 /*
  * Calculate task priority from the waiter tree priority
  *
@@ -1081,7 +1333,7 @@ static void  noinline __sched rt_spin_lock_slowunlock(struct rt_mutex *lock)
 
 	raw_spin_unlock(&lock->wait_lock);
 
-	/* Undo pi boosting.when necessary */
+
 	rt_mutex_adjust_prio(current);
 }
 
@@ -1148,6 +1400,16 @@ int __lockfunc rt_spin_trylock(spinlock_t *lock)
 }
 EXPORT_SYMBOL(rt_spin_trylock);
 
+int __lockfunc rt_spin_trylock_or_boost(spinlock_t *lock)
+{
+	int ret = rt_mutex_try_or_boost_lock(&lock->lock);
+
+	if (ret)
+		spin_acquire(&lock->dep_map, 0, 1, _RET_IP_);
+	return ret;
+}
+EXPORT_SYMBOL(rt_spin_trylock_or_boost);
+
 int __lockfunc rt_spin_trylock_bh(spinlock_t *lock)
 {
 	int ret;
@@ -1392,6 +1654,9 @@ static void wakeup_next_waiter(struct rt_mutex *lock)
 
 	raw_spin_unlock_irqrestore(&current->pi_lock, flags);
 
+	if (!rt_mutex_wake_trylock_waiter(waiter))
+		return;
+
 	/*
 	 * It's safe to dereference waiter as it cannot go away as
 	 * long as we hold lock->wait_lock. The waiter task needs to
@@ -1655,6 +1920,7 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state,
 
 	rt_mutex_init_waiter(&waiter, false);
 
+ try_again:
 	raw_spin_lock(&lock->wait_lock);
 
 	/* Try to acquire the lock again: */
@@ -1665,6 +1931,19 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state,
 		return 0;
 	}
 
+	/*
+	 * If the task boosted another task with trylock_or_boost
+	 * it can not block with that in use. The check_static_waiter()
+	 * will remove the boosting of the trylock if needed. But to
+	 * do so will require unlocking lock->wait_lock, and then
+	 * the state of the lock may change. If the wait_lock is
+	 * released, the lock is tried again.
+	 */
+	if (unlikely(check_static_waiter(current, lock, true))) {
+		raw_spin_unlock(&lock->wait_lock);
+		goto try_again;
+	}
+
 	set_current_state(state);
 
 	/* Setup the timer, when timeout != NULL */
@@ -1717,26 +1996,34 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state,
 /*
  * Slow path try-lock function:
  */
-static inline int rt_mutex_slowtrylock(struct rt_mutex *lock)
+static inline int rt_mutex_slowtrylock(struct rt_mutex *lock, bool boost)
 {
+	struct task_struct *owner;
 	int ret;
 
 	/*
 	 * If the lock already has an owner we fail to get the lock.
 	 * This can be done without taking the @lock->wait_lock as
 	 * it is only being read, and this is a trylock anyway.
+	 *
+	 * Only do the short cut if we do not need to boost the task
+	 * if we fail to get the lock.
 	 */
-	if (rt_mutex_owner(lock))
+	if (!boost && rt_mutex_owner(lock))
 		return 0;
 
 	/*
-	 * The mutex has currently no owner. Lock the wait lock and
-	 * try to acquire the lock.
+	 * The mutex has currently no owner or we need to boost the task
+	 * if we fail to grab the lock. Lock the wait lock and try to
+	 * acquire the lock.
 	 */
 	raw_spin_lock(&lock->wait_lock);
 
 	ret = try_to_take_rt_mutex(lock, current, NULL);
 
+	if (!ret && boost)
+		owner = trylock_boost_owner(lock);
+
 	/*
 	 * try_to_take_rt_mutex() sets the lock waiters bit
 	 * unconditionally. Clean this up.
@@ -1745,6 +2032,10 @@ static inline int rt_mutex_slowtrylock(struct rt_mutex *lock)
 
 	raw_spin_unlock(&lock->wait_lock);
 
+	if (!ret && boost && owner)
+		rt_mutex_adjust_prio_chain(owner, RT_MUTEX_MIN_CHAINWALK,
+					   lock, NULL, NULL, NULL);
+
 	return ret;
 }
 
@@ -1852,13 +2143,14 @@ rt_mutex_timed_fastlock(struct rt_mutex *lock, int state,
 
 static inline int
 rt_mutex_fasttrylock(struct rt_mutex *lock,
-		     int (*slowfn)(struct rt_mutex *lock))
+		     int (*slowfn)(struct rt_mutex *lock, bool boost),
+		     bool boost)
 {
 	if (likely(rt_mutex_cmpxchg(lock, NULL, current))) {
 		rt_mutex_deadlock_account_lock(lock, current);
 		return 1;
 	}
-	return slowfn(lock);
+	return slowfn(lock, boost);
 }
 
 static inline void
@@ -1969,11 +2261,24 @@ EXPORT_SYMBOL_GPL(rt_mutex_timed_lock);
  */
 int __sched rt_mutex_trylock(struct rt_mutex *lock)
 {
-	return rt_mutex_fasttrylock(lock, rt_mutex_slowtrylock);
+	return rt_mutex_fasttrylock(lock, rt_mutex_slowtrylock, false);
 }
 EXPORT_SYMBOL_GPL(rt_mutex_trylock);
 
 /**
+ * rt_mutex_try_or_boost_lock - try to lock a rt_mutex or boost the owner
+ *
+ * @lock:	the rt_mutex to be locked
+ *
+ * Returns 1 on success and 0 on contention
+ */
+int __sched rt_mutex_try_or_boost_lock(struct rt_mutex *lock)
+{
+	return rt_mutex_fasttrylock(lock, rt_mutex_slowtrylock, true);
+}
+EXPORT_SYMBOL_GPL(rt_mutex_try_or_boost_lock);
+
+/**
  * rt_mutex_unlock - unlock a rt_mutex
  *
  * @lock: the rt_mutex to be unlocked
@@ -2127,6 +2432,8 @@ int rt_mutex_start_proxy_lock(struct rt_mutex *lock,
 	raw_spin_unlock_irq(&task->pi_lock);
 #endif
 
+	check_static_waiter(task, lock, false);
+
 	/* We enforce deadlock detection for futexes */
 	ret = task_blocks_on_rt_mutex(lock, waiter, task,
 				      RT_MUTEX_FULL_CHAINWALK);
diff --git a/kernel/locking/rtmutex_common.h b/kernel/locking/rtmutex_common.h
index 4d317e9a5d0f..338e7fa4bbfa 100644
--- a/kernel/locking/rtmutex_common.h
+++ b/kernel/locking/rtmutex_common.h
@@ -37,28 +37,6 @@ extern void schedule_rt_mutex_test(struct rt_mutex *lock);
 #endif
 
 /*
- * This is the control structure for tasks blocked on a rt_mutex,
- * which is allocated on the kernel stack on of the blocked task.
- *
- * @tree_entry:		pi node to enqueue into the mutex waiters tree
- * @pi_tree_entry:	pi node to enqueue into the mutex owner waiters tree
- * @task:		task reference to the blocked task
- */
-struct rt_mutex_waiter {
-	struct rb_node          tree_entry;
-	struct rb_node          pi_tree_entry;
-	struct task_struct	*task;
-	struct rt_mutex		*lock;
-	bool			savestate;
-#ifdef CONFIG_DEBUG_RT_MUTEXES
-	unsigned long		ip;
-	struct pid		*deadlock_task_pid;
-	struct rt_mutex		*deadlock_lock;
-#endif
-	int prio;
-};
-
-/*
  * Various helpers to access the waiters-tree:
  */
 static inline int rt_mutex_has_waiters(struct rt_mutex *lock)
-- 
2.5.1



^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [RFC][PATCH RT v2 1/3] [PATCH 1/3] locking: Add spin_trylock_or_boost() infrastructure
@ 2015-09-10 14:50   ` Steven Rostedt
  0 siblings, 0 replies; 5+ messages in thread
From: Steven Rostedt @ 2015-09-10 14:50 UTC (permalink / raw)
  To: linux-kernel, linux-rt-users
  Cc: Thomas Gleixner, Carsten Emde, Sebastian Andrzej Siewior,
	John Kacur, Paul Gortmaker, Peter Zijlstra, Clark Williams,
	Arnaldo Carvalho de Melo, Ingo Molnar

[-- Attachment #1: 0001-locking-Add-spin_trylock_or_boost-infrastructure.patch --]
[-- Type: text/plain, Size: 22035 bytes --]

In mainline Linux, there are cases where locks need to be taken in reverse
order. In non PREEMPT_RT code, spinlocks can not be preempted, thus if one
needs to take locks in reverse order, if it can't get the second lock, it
simply needs to release the first lock (the one that can cause a deadlock)
and try again. The owner of the second lock must be running on another CPU
and as long as it's not blocked on the held lock of the original task, then
progress will be made.

But things change when these locks become preemptive locks.

Simply releasing the first lock when failing to get the second lock wont
always work if the locks can be preempted. That's because the task could
have preempted the owner, and it can prevent the owner from making forward
progress. Just letting go of the first lock and trying again wont change the
state of things, as the owner will still be blocked by being preempted by
the original task. This is especially true when the original task is a
real-time task running a FIFO prio.

	<Task 1>
	Grab lock A

	<preempt to higher priority Task 2>

	Grab lock B
	try to grab lock A
	release lock B
	Grab lock B
	try to grab lock A
	<wash, rinse, repeat>  (sorry Fred)

As Task 1 will never run as long as Task 2 is spinning, no forward progress
is had, and this becomes a live lock (as suppose to a dead lock).

The solution here is to create add a static rt_mutex_waiter to the
task_struct. The rt_mutex_waiter is a structure that is used to connect
locks with owners and those that are blocked on the lock in order to
propagate priority inheritance. But the rt_mutex_waiter is allocated on the
stack when an rtmutex is blocked. This is fine, because the scope of that
waiter structure will be finished by the time the rtmutex function returns.
For the spin_trylock_or_boost() caller, the waiter's scope needs to go
beyond the function call, thus it must not be allocated on the stack. Adding
it to the task struct works, but only allows one trylock_or_boost to be
called, but that should be fine anyway (a warning is added if it is not).

The spin_trylock_or_boost() will add this new task_struct waiter to the pi
lists of the lock, and if necessary (if it is the highest waiter on the
lock) to the owner's pi list too. This is exactly the way things work with
the current pi inheritance logic.

Some things need to be changed though. As the waiter is not actually waiting
on the lock, the owner can't wake it up without prejudice, as the waiter
could have set itself in another state before calling cpu_chill(). A
"wake_up" field is added to the waiter and is set only when the task is
ready to receive the wakeup and it saves the current state.

When the owner of the lock releases the lock and picks this task to take the
lock next, it will remove it from the pi lists and clear the waiter->lock.
This is done under the task's pi_lock to make sure it synchronizes with the
cpu_chill() call.

When cpu_chill() is called, it will look to see if the task's static waiter
has a lock. If it does not, the cpu_chill will act like a cpu_relax(), as
that means the owner released the lock. The cpu_chill() will then set the
"wake_up" flag, save the state, and go to sleep in the TASK_INTERRUPTIBLE
state, as anything should be able to wake it up.

When it is woken, it clears the lock (if it hasn't already been done) and
removes itself from the pi lists.

Note, the reason that the owner of the lock clears the lock and does not let
the task stay on the pi lists, is because nothing prevents the lock from
being freed when there is no real owner of the task.

Note 2, if the task blocks on a rtmutex, or tries to call trylock_or_boost
again and fails while already boosting a lock, then it must remove itself
from the previous lock before it can boost another task. Only one boosting
per task is allowed. A task can not boost more than one task.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 include/linux/rtmutex.h         |   1 +
 include/linux/sched.h           |  50 +++++++
 include/linux/spinlock.h        |   5 +
 include/linux/spinlock_rt.h     |  15 ++
 kernel/locking/rtmutex.c        | 323 +++++++++++++++++++++++++++++++++++++++-
 kernel/locking/rtmutex_common.h |  22 ---
 6 files changed, 386 insertions(+), 30 deletions(-)

diff --git a/include/linux/rtmutex.h b/include/linux/rtmutex.h
index d5a04ea47a13..4ab8d4bddfbb 100644
--- a/include/linux/rtmutex.h
+++ b/include/linux/rtmutex.h
@@ -111,6 +111,7 @@ extern int rt_mutex_timed_lock(struct rt_mutex *lock,
 			       struct hrtimer_sleeper *timeout);
 
 extern int rt_mutex_trylock(struct rt_mutex *lock);
+extern int rt_mutex_try_or_boost_lock(struct rt_mutex *lock);
 
 extern void rt_mutex_unlock(struct rt_mutex *lock);
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 956658437527..94ebad0ac112 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1336,6 +1336,29 @@ enum perf_event_task_context {
 	perf_nr_task_contexts,
 };
 
+/*
+ * This is the control structure for tasks blocked on a rt_mutex,
+ * which is allocated on the kernel stack on of the blocked task.
+ *
+ * @tree_entry:		pi node to enqueue into the mutex waiters tree
+ * @pi_tree_entry:	pi node to enqueue into the mutex owner waiters tree
+ * @task:		task reference to the blocked task
+ */
+struct rt_mutex_waiter {
+	struct rb_node          tree_entry;
+	struct rb_node          pi_tree_entry;
+	struct task_struct	*task;
+	struct rt_mutex		*lock;
+	bool			savestate;
+	bool			wake_up;
+#ifdef CONFIG_DEBUG_RT_MUTEXES
+	unsigned long		ip;
+	struct pid		*deadlock_task_pid;
+	struct rt_mutex		*deadlock_lock;
+#endif
+	int prio;
+};
+
 struct task_struct {
 	volatile long state;	/* -1 unrunnable, 0 runnable, >0 stopped */
 	volatile long saved_state;	/* saved state for "spinlock sleepers" */
@@ -1357,6 +1380,33 @@ struct task_struct {
 
 	int prio, static_prio, normal_prio;
 	unsigned int rt_priority;
+#ifdef CONFIG_PREEMPT_RT_FULL
+	/*
+	 * There are cases where spinlocks are taken in reverse
+	 * order via a spin_trylock(). This is fine for true
+	 * spinlocks, but in PREEMPT_RT, they become mutexes.
+	 * If a real-time high priority task preempts the owner
+	 * of a lock it is trying to grab in reverse order with
+	 * a trylock, it can go into a livelock, as it spins
+	 * waiting for the owner to release the lock. But as it
+	 * has preempted that owner, and prevented it from continuing,
+	 * the lock is never released, and the high priority task
+	 * spins forever.
+	 *
+	 * For these cases, a spin_trylock_or_boost() must be used
+	 * on PREEMPT_RT. This will set the owner of the lock to the
+	 * prioirty of the caller of spin_trylock_or_boost()
+	 * via this trylock_prio field (if the owner has a lower
+	 * priority than the caller). This priority is always reset
+	 * whenever any spinlock converted rtmutex is released.
+	 * This guarantees forward progress of the trylock spin.
+	 *
+	 * Callers of spin_trylock_or_boost() must also call cpu_chill()
+	 * which on PREEMPT_RT is a sched_yield() that allows the
+	 * newly risen priority task to run ahead of the trylock spinner.
+	 */
+	struct rt_mutex_waiter	rt_waiter;
+#endif
 	const struct sched_class *sched_class;
 	struct sched_entity se;
 	struct sched_rt_entity rt;
diff --git a/include/linux/spinlock.h b/include/linux/spinlock.h
index 28f4366fd495..1d15e8c1267f 100644
--- a/include/linux/spinlock.h
+++ b/include/linux/spinlock.h
@@ -330,6 +330,11 @@ static inline int spin_trylock(spinlock_t *lock)
 	return raw_spin_trylock(&lock->rlock);
 }
 
+static inline int spin_trylock_or_boost(spinlock_t *lock)
+{
+	return raw_spin_trylock(&lock->rlock);
+}
+
 #define spin_lock_nested(lock, subclass)			\
 do {								\
 	raw_spin_lock_nested(spinlock_check(lock), subclass);	\
diff --git a/include/linux/spinlock_rt.h b/include/linux/spinlock_rt.h
index f757096b230c..e65951230e3e 100644
--- a/include/linux/spinlock_rt.h
+++ b/include/linux/spinlock_rt.h
@@ -26,6 +26,7 @@ extern void __lockfunc rt_spin_unlock_wait(spinlock_t *lock);
 extern int __lockfunc rt_spin_trylock_irqsave(spinlock_t *lock, unsigned long *flags);
 extern int __lockfunc rt_spin_trylock_bh(spinlock_t *lock);
 extern int __lockfunc rt_spin_trylock(spinlock_t *lock);
+extern int __lockfunc rt_spin_trylock_or_boost(spinlock_t *lock);
 extern int atomic_dec_and_spin_lock(atomic_t *atomic, spinlock_t *lock);
 
 /*
@@ -53,6 +54,8 @@ extern int __lockfunc __rt_spin_trylock(struct rt_mutex *lock);
 
 #define spin_do_trylock(lock)		__cond_lock(lock, rt_spin_trylock(lock))
 
+#define spin_do_try_or_boost_lock(lock) __cond_lock(lock, rt_spin_trylock_or_boost(lock))
+
 #define spin_trylock(lock)			\
 ({						\
 	int __locked;				\
@@ -63,6 +66,16 @@ extern int __lockfunc __rt_spin_trylock(struct rt_mutex *lock);
 	__locked;				\
 })
 
+#define spin_trylock_or_boost(lock)			\
+({							\
+	int __locked;					\
+	migrate_disable();				\
+	__locked = spin_do_try_or_boost_lock(lock);	\
+	if (!__locked)					\
+		migrate_enable();			\
+	__locked;					\
+})
+
 #ifdef CONFIG_LOCKDEP
 # define spin_lock_nested(lock, subclass)		\
 	do {						\
@@ -153,6 +166,8 @@ static inline unsigned long spin_lock_trace_flags(spinlock_t *lock)
 # define spin_is_contended(lock)	(((void)(lock), 0))
 #endif
 
+void rt_mutex_wait_for_lock(struct task_struct *task);
+
 static inline int spin_can_lock(spinlock_t *lock)
 {
 	return !rt_mutex_is_locked(&lock->lock);
diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 2822aceb8dfb..843b67f38e20 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -253,6 +253,258 @@ rt_mutex_dequeue_pi(struct task_struct *task, struct rt_mutex_waiter *waiter)
 	RB_CLEAR_NODE(&waiter->pi_tree_entry);
 }
 
+#ifdef CONFIG_PREEMPT_RT_FULL
+/*
+ * Returns true if the task should be woken up, false otherwise.
+ */
+static inline bool rt_mutex_wake_trylock_waiter(struct rt_mutex_waiter *waiter)
+{
+	struct task_struct *task = waiter->task;
+	unsigned long flags;
+	bool wakeup;
+
+	if (likely(waiter != &task->rt_waiter))
+		return true;
+
+	/*
+	 * A task boosted current because it is within a trylock spin.
+	 * But we only want to wake up the task if it is in the
+	 * cpu_chill() sleep, and not before. When the task is ready, it
+	 * will set "wake_up" to true. Otherwise we do nothing. When the
+	 * task gets to the cpu_chill() it will not sleep because the
+	 * waiter->lock will be NULL.
+	 */
+	raw_spin_lock_irqsave(&task->pi_lock, flags);
+	rt_mutex_dequeue(waiter->lock, waiter);
+	waiter->lock = NULL;
+
+	wakeup = waiter->wake_up;
+	raw_spin_unlock_irqrestore(&task->pi_lock, flags);
+
+	return wakeup;
+}
+
+static void __rt_mutex_adjust_prio(struct task_struct *task);
+
+/* 
+ * Called with lock->wait_lock held and must call
+ * rt_mutex_adjust_prio_chain() if an owner is returned
+ */
+static inline struct task_struct *trylock_boost_owner(struct rt_mutex *lock)
+{
+	struct rt_mutex_waiter *top_waiter = NULL;
+	struct rt_mutex_waiter *waiter;
+	struct task_struct *task = current;
+	struct task_struct *owner;
+
+	owner = rt_mutex_owner(lock);
+	if (!owner)
+		return NULL;
+
+	waiter = &task->rt_waiter;
+
+	raw_spin_lock_irq(&task->pi_lock);
+
+	/*
+	 * current is already waiting for another lock owner?
+	 * This should never happen. Don't add another.
+	 */
+	if (WARN_ON_ONCE(waiter->lock)) {
+		raw_spin_unlock_irq(&task->pi_lock);
+		return NULL;
+	}
+
+	/* Add waiter to lock */
+	waiter->task = task;
+	waiter->lock = lock;
+	waiter->prio = task->prio;
+	waiter->wake_up = false;
+	raw_spin_unlock_irq(&task->pi_lock);
+
+	if (rt_mutex_has_waiters(lock))
+		top_waiter = rt_mutex_top_waiter(lock);
+
+	rt_mutex_enqueue(lock, waiter);
+
+	if (waiter == rt_mutex_top_waiter(lock)) {
+		raw_spin_lock_irq(&owner->pi_lock);
+		if (top_waiter)
+			rt_mutex_dequeue_pi(owner, top_waiter);
+		rt_mutex_enqueue_pi(owner, waiter);
+		__rt_mutex_adjust_prio(owner);
+		raw_spin_unlock_irq(&owner->pi_lock);
+
+		/* Will be released by rt_mutex_adjust_prio_chain() */
+		get_task_struct(owner);
+
+		/* New top waiter, need to do the chain walk */
+		return owner;
+	}
+
+	return NULL;
+}
+
+/*
+ * Returns true if lock->wait_lock was released.
+ */
+static inline bool check_static_waiter(struct task_struct *task,
+				       struct rt_mutex *orig_lock, bool ok)
+{
+	struct rt_mutex_waiter *top_waiter;
+	struct rt_mutex_waiter *waiter = &task->rt_waiter;
+	struct task_struct *owner;
+	struct rt_mutex *lock;
+
+	if (likely(!waiter->lock))
+		return false;
+
+	/* Only spinlock converted to rtmutex may be called with a waiter */
+	BUG_ON(!ok);
+
+	/*
+	 * The task is about to block on a lock, and it boosted
+	 * the owner of another lock. The pi chain can not handle
+	 * the same task on the chain at once. This task is going
+	 * to go to sleep anyway, remove itself from the other lock.
+	 */
+	raw_spin_unlock(&orig_lock->wait_lock);
+
+	/*
+	 * Need to get the task->pi_lock first, as that will be
+	 * used to synchronize with the owner of the lock if it
+	 * releases the lock.
+	 */
+again:
+	raw_spin_lock_irq(&task->pi_lock);
+	/* The lock could have been released. */
+	lock = waiter->lock;
+	if (!lock) {
+		raw_spin_unlock_irq(&task->pi_lock);
+		goto out;
+	}
+
+	if (!raw_spin_trylock(&lock->wait_lock)) {
+		/* ironic isn't this. */
+		raw_spin_unlock_irq(&task->pi_lock);
+		goto again;
+	}
+
+	waiter->lock = NULL;
+
+	/*
+	 * Having the lock->wait_lock will prevent the owner from
+	 * doing anything else.
+	 */
+	raw_spin_unlock_irq(&task->pi_lock);
+
+	top_waiter = rt_mutex_top_waiter(lock);
+	rt_mutex_dequeue(lock, waiter);
+
+	owner = rt_mutex_owner(lock);
+	if (owner && waiter == top_waiter) {
+		raw_spin_lock_irq(&owner->pi_lock);
+		rt_mutex_dequeue_pi(owner, waiter);
+		if (rt_mutex_has_waiters(lock)) {
+			top_waiter = rt_mutex_top_waiter(lock);
+			rt_mutex_enqueue_pi(owner, top_waiter);
+			__rt_mutex_adjust_prio(owner);
+		}
+		raw_spin_unlock_irq(&owner->pi_lock);
+	}
+
+	raw_spin_unlock(&lock->wait_lock);
+out:
+	raw_spin_lock(&orig_lock->wait_lock);
+
+	return true;
+}
+/*
+ * Called by cpu_chill() that is after a spin_trylock_or_boost().
+ * This will sleep if the lock is still held by the task that was boosted.
+ */
+void rt_mutex_wait_for_lock(struct task_struct *task)
+{
+	struct rt_mutex_waiter *waiter = &task->rt_waiter;
+	struct rt_mutex *lock;
+
+	if (!waiter->lock) {
+		cpu_relax();
+		return;
+	}
+
+	/* The pi_lock will prevent the owner from doing anything */
+	raw_spin_lock_irq(&task->pi_lock);
+	lock = waiter->lock;
+	if (!lock) {
+		raw_spin_unlock_irq(&task->pi_lock);
+		return;
+	}
+
+	task->saved_state = task->state;
+	/* We let anything wake this up. */
+	__set_current_state_no_track(TASK_INTERRUPTIBLE);
+
+	waiter->wake_up = true;
+	raw_spin_unlock_irq(&task->pi_lock);
+
+	if (waiter->lock)
+		schedule();
+
+again:
+	raw_spin_lock_irq(&task->pi_lock);
+	waiter->wake_up = false;
+
+	/* Make sure it still does not have the lock */
+	if (waiter->lock) {
+		struct rt_mutex_waiter *top_waiter = NULL;
+		struct task_struct *owner;
+
+		lock = waiter->lock;
+		if (!raw_spin_trylock(&lock->wait_lock)) {
+			raw_spin_unlock_irq(&task->pi_lock);
+			goto again;
+		}
+		
+		top_waiter = rt_mutex_top_waiter(lock);
+		rt_mutex_dequeue(lock, waiter);
+		waiter->lock = NULL;
+
+		owner = rt_mutex_owner(lock);
+		if (owner && waiter == top_waiter) {
+			raw_spin_unlock(&task->pi_lock);
+			raw_spin_lock(&owner->pi_lock);
+			rt_mutex_dequeue_pi(owner, waiter);
+			if (rt_mutex_has_waiters(lock)) {
+				top_waiter = rt_mutex_top_waiter(lock);
+				rt_mutex_enqueue_pi(owner, top_waiter);
+			}
+			raw_spin_unlock(&owner->pi_lock);
+			raw_spin_lock(&task->pi_lock);
+		}
+		raw_spin_unlock(&lock->wait_lock);
+	}
+		
+	__set_current_state_no_track(task->saved_state);
+	task->saved_state = TASK_RUNNING;
+	raw_spin_unlock_irq(&task->pi_lock);
+}
+
+#else
+static inline struct task_struct *trylock_boost_owner(struct rt_mutex *lock)
+{
+	return NULL;
+}
+static inline bool rt_mutex_wake_trylock_waiter(struct rt_mutex_waiter *waiter)
+{
+	return true;
+}
+static inline bool check_static_waiter(struct task_struct *task,
+				       struct rt_mutex *lock, bool ok)
+{
+	return false;
+}
+#endif
+
 /*
  * Calculate task priority from the waiter tree priority
  *
@@ -1081,7 +1333,7 @@ static void  noinline __sched rt_spin_lock_slowunlock(struct rt_mutex *lock)
 
 	raw_spin_unlock(&lock->wait_lock);
 
-	/* Undo pi boosting.when necessary */
+
 	rt_mutex_adjust_prio(current);
 }
 
@@ -1148,6 +1400,16 @@ int __lockfunc rt_spin_trylock(spinlock_t *lock)
 }
 EXPORT_SYMBOL(rt_spin_trylock);
 
+int __lockfunc rt_spin_trylock_or_boost(spinlock_t *lock)
+{
+	int ret = rt_mutex_try_or_boost_lock(&lock->lock);
+
+	if (ret)
+		spin_acquire(&lock->dep_map, 0, 1, _RET_IP_);
+	return ret;
+}
+EXPORT_SYMBOL(rt_spin_trylock_or_boost);
+
 int __lockfunc rt_spin_trylock_bh(spinlock_t *lock)
 {
 	int ret;
@@ -1392,6 +1654,9 @@ static void wakeup_next_waiter(struct rt_mutex *lock)
 
 	raw_spin_unlock_irqrestore(&current->pi_lock, flags);
 
+	if (!rt_mutex_wake_trylock_waiter(waiter))
+		return;
+
 	/*
 	 * It's safe to dereference waiter as it cannot go away as
 	 * long as we hold lock->wait_lock. The waiter task needs to
@@ -1655,6 +1920,7 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state,
 
 	rt_mutex_init_waiter(&waiter, false);
 
+ try_again:
 	raw_spin_lock(&lock->wait_lock);
 
 	/* Try to acquire the lock again: */
@@ -1665,6 +1931,19 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state,
 		return 0;
 	}
 
+	/*
+	 * If the task boosted another task with trylock_or_boost
+	 * it can not block with that in use. The check_static_waiter()
+	 * will remove the boosting of the trylock if needed. But to
+	 * do so will require unlocking lock->wait_lock, and then
+	 * the state of the lock may change. If the wait_lock is
+	 * released, the lock is tried again.
+	 */
+	if (unlikely(check_static_waiter(current, lock, true))) {
+		raw_spin_unlock(&lock->wait_lock);
+		goto try_again;
+	}
+
 	set_current_state(state);
 
 	/* Setup the timer, when timeout != NULL */
@@ -1717,26 +1996,34 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state,
 /*
  * Slow path try-lock function:
  */
-static inline int rt_mutex_slowtrylock(struct rt_mutex *lock)
+static inline int rt_mutex_slowtrylock(struct rt_mutex *lock, bool boost)
 {
+	struct task_struct *owner;
 	int ret;
 
 	/*
 	 * If the lock already has an owner we fail to get the lock.
 	 * This can be done without taking the @lock->wait_lock as
 	 * it is only being read, and this is a trylock anyway.
+	 *
+	 * Only do the short cut if we do not need to boost the task
+	 * if we fail to get the lock.
 	 */
-	if (rt_mutex_owner(lock))
+	if (!boost && rt_mutex_owner(lock))
 		return 0;
 
 	/*
-	 * The mutex has currently no owner. Lock the wait lock and
-	 * try to acquire the lock.
+	 * The mutex has currently no owner or we need to boost the task
+	 * if we fail to grab the lock. Lock the wait lock and try to
+	 * acquire the lock.
 	 */
 	raw_spin_lock(&lock->wait_lock);
 
 	ret = try_to_take_rt_mutex(lock, current, NULL);
 
+	if (!ret && boost)
+		owner = trylock_boost_owner(lock);
+
 	/*
 	 * try_to_take_rt_mutex() sets the lock waiters bit
 	 * unconditionally. Clean this up.
@@ -1745,6 +2032,10 @@ static inline int rt_mutex_slowtrylock(struct rt_mutex *lock)
 
 	raw_spin_unlock(&lock->wait_lock);
 
+	if (!ret && boost && owner)
+		rt_mutex_adjust_prio_chain(owner, RT_MUTEX_MIN_CHAINWALK,
+					   lock, NULL, NULL, NULL);
+
 	return ret;
 }
 
@@ -1852,13 +2143,14 @@ rt_mutex_timed_fastlock(struct rt_mutex *lock, int state,
 
 static inline int
 rt_mutex_fasttrylock(struct rt_mutex *lock,
-		     int (*slowfn)(struct rt_mutex *lock))
+		     int (*slowfn)(struct rt_mutex *lock, bool boost),
+		     bool boost)
 {
 	if (likely(rt_mutex_cmpxchg(lock, NULL, current))) {
 		rt_mutex_deadlock_account_lock(lock, current);
 		return 1;
 	}
-	return slowfn(lock);
+	return slowfn(lock, boost);
 }
 
 static inline void
@@ -1969,11 +2261,24 @@ EXPORT_SYMBOL_GPL(rt_mutex_timed_lock);
  */
 int __sched rt_mutex_trylock(struct rt_mutex *lock)
 {
-	return rt_mutex_fasttrylock(lock, rt_mutex_slowtrylock);
+	return rt_mutex_fasttrylock(lock, rt_mutex_slowtrylock, false);
 }
 EXPORT_SYMBOL_GPL(rt_mutex_trylock);
 
 /**
+ * rt_mutex_try_or_boost_lock - try to lock a rt_mutex or boost the owner
+ *
+ * @lock:	the rt_mutex to be locked
+ *
+ * Returns 1 on success and 0 on contention
+ */
+int __sched rt_mutex_try_or_boost_lock(struct rt_mutex *lock)
+{
+	return rt_mutex_fasttrylock(lock, rt_mutex_slowtrylock, true);
+}
+EXPORT_SYMBOL_GPL(rt_mutex_try_or_boost_lock);
+
+/**
  * rt_mutex_unlock - unlock a rt_mutex
  *
  * @lock: the rt_mutex to be unlocked
@@ -2127,6 +2432,8 @@ int rt_mutex_start_proxy_lock(struct rt_mutex *lock,
 	raw_spin_unlock_irq(&task->pi_lock);
 #endif
 
+	check_static_waiter(task, lock, false);
+
 	/* We enforce deadlock detection for futexes */
 	ret = task_blocks_on_rt_mutex(lock, waiter, task,
 				      RT_MUTEX_FULL_CHAINWALK);
diff --git a/kernel/locking/rtmutex_common.h b/kernel/locking/rtmutex_common.h
index 4d317e9a5d0f..338e7fa4bbfa 100644
--- a/kernel/locking/rtmutex_common.h
+++ b/kernel/locking/rtmutex_common.h
@@ -37,28 +37,6 @@ extern void schedule_rt_mutex_test(struct rt_mutex *lock);
 #endif
 
 /*
- * This is the control structure for tasks blocked on a rt_mutex,
- * which is allocated on the kernel stack on of the blocked task.
- *
- * @tree_entry:		pi node to enqueue into the mutex waiters tree
- * @pi_tree_entry:	pi node to enqueue into the mutex owner waiters tree
- * @task:		task reference to the blocked task
- */
-struct rt_mutex_waiter {
-	struct rb_node          tree_entry;
-	struct rb_node          pi_tree_entry;
-	struct task_struct	*task;
-	struct rt_mutex		*lock;
-	bool			savestate;
-#ifdef CONFIG_DEBUG_RT_MUTEXES
-	unsigned long		ip;
-	struct pid		*deadlock_task_pid;
-	struct rt_mutex		*deadlock_lock;
-#endif
-	int prio;
-};

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [RFC][PATCH RT v2 2/3] [PATCH 2/3] rt: Make cpu_chill() call rt_mutex_wait_for_lock() and add new cpu_rest() as msleep(1)
  2015-09-10 14:50 [RFC][PATCH RT v2 0/3] RT: Fix trylock deadlock without msleep() hack Steven Rostedt
  2015-09-10 14:50   ` Steven Rostedt
@ 2015-09-10 14:50 ` Steven Rostedt
  2015-09-10 14:50 ` [RFC][PATCH RT v2 3/3] [PATCH 3/3] rtmutex: Wake up all top trylock waiters on unlock Steven Rostedt
  2 siblings, 0 replies; 5+ messages in thread
From: Steven Rostedt @ 2015-09-10 14:50 UTC (permalink / raw)
  To: linux-kernel, linux-rt-users
  Cc: Thomas Gleixner, Carsten Emde, Sebastian Andrzej Siewior,
	John Kacur, Paul Gortmaker, Peter Zijlstra, Clark Williams,
	Arnaldo Carvalho de Melo, Ingo Molnar

[-- Attachment #1: 0002-rt-Make-cpu_chill-call-rt_mutex_wait_for_lock-and-ad.patch --]
[-- Type: text/plain, Size: 7547 bytes --]

Make the cpu_chill() call rt_mutex_wait_for_lock() where it will only sleep
if the owner of a lock boosted by spin_trylock_or_boost() still has the
lock. The owner will then wake up the caller of cpu_chill().

As there are still locations that use cpu_chill(), as it spins on status events
like bits and what not (not locks), where the updater is not known and can not
be boosted, add a new cpu_rest() that will take over the msleep(1) action.

Hopefully, we can get rid of the cpu_rest() and find a way to know what tasks
need priority boosting, and perhaps make another API that will allow boosting
of the updater, and the current task can contine to spin instead of sleep.

Also convert trylock spinners over to spin_try_or_boost_lock(), which will
later call cpu_chill().

When trying to take locks in reverse order, it is possible that on
PREEMPT_RT that the running task could have preempted the owner and never
let it run, creating a live lock. This is because spinlocks in PREEMPT_RT
can be preempted.

Currently, this is solved by calling cpu_chill(), which on PREEMPT_RT is
converted into a msleep(1), and we just hopen that the owner will have time
to release the lock, and nobody else will take in when the task wakes up.

By converting these to spin_trylock_or_boost() which will boost the owners,
the cpu_chill() now must be called afterward to sleep if the owner still has
the lock, and the owner will do the wake up.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 block/blk-ioc.c        |  4 ++--
 fs/autofs4/expire.c    |  2 +-
 fs/dcache.c            |  6 +++---
 fs/namespace.c         |  2 +-
 include/linux/delay.h  | 13 +++++++++++++
 kernel/time/hrtimer.c  | 17 +++++++++++++++--
 kernel/workqueue.c     |  2 +-
 net/packet/af_packet.c |  4 ++--
 net/rds/ib_rdma.c      |  2 +-
 9 files changed, 39 insertions(+), 13 deletions(-)

diff --git a/block/blk-ioc.c b/block/blk-ioc.c
index 28f467e636cc..93cf668ca314 100644
--- a/block/blk-ioc.c
+++ b/block/blk-ioc.c
@@ -105,7 +105,7 @@ static void ioc_release_fn(struct work_struct *work)
 						struct io_cq, ioc_node);
 		struct request_queue *q = icq->q;
 
-		if (spin_trylock(q->queue_lock)) {
+		if (spin_trylock_or_boost(q->queue_lock)) {
 			ioc_destroy_icq(icq);
 			spin_unlock(q->queue_lock);
 		} else {
@@ -183,7 +183,7 @@ retry:
 	hlist_for_each_entry(icq, &ioc->icq_list, ioc_node) {
 		if (icq->flags & ICQ_EXITED)
 			continue;
-		if (spin_trylock(icq->q->queue_lock)) {
+		if (spin_trylock_or_boost(icq->q->queue_lock)) {
 			ioc_exit_icq(icq);
 			spin_unlock(icq->q->queue_lock);
 		} else {
diff --git a/fs/autofs4/expire.c b/fs/autofs4/expire.c
index d487fa27add5..79eef1e3157e 100644
--- a/fs/autofs4/expire.c
+++ b/fs/autofs4/expire.c
@@ -148,7 +148,7 @@ again:
 			}
 
 			parent = p->d_parent;
-			if (!spin_trylock(&parent->d_lock)) {
+			if (!spin_trylock_or_boost(&parent->d_lock)) {
 				spin_unlock(&p->d_lock);
 				cpu_chill();
 				goto relock;
diff --git a/fs/dcache.c b/fs/dcache.c
index c1dad92434d5..151d2db0ded7 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -573,12 +573,12 @@ static struct dentry *dentry_kill(struct dentry *dentry)
 	struct inode *inode = dentry->d_inode;
 	struct dentry *parent = NULL;
 
-	if (inode && unlikely(!spin_trylock(&inode->i_lock)))
+	if (inode && unlikely(!spin_trylock_or_boost(&inode->i_lock)))
 		goto failed;
 
 	if (!IS_ROOT(dentry)) {
 		parent = dentry->d_parent;
-		if (unlikely(!spin_trylock(&parent->d_lock))) {
+		if (unlikely(!spin_trylock_or_boost(&parent->d_lock))) {
 			if (inode)
 				spin_unlock(&inode->i_lock);
 			goto failed;
@@ -2394,7 +2394,7 @@ again:
 	inode = dentry->d_inode;
 	isdir = S_ISDIR(inode->i_mode);
 	if (dentry->d_lockref.count == 1) {
-		if (!spin_trylock(&inode->i_lock)) {
+		if (!spin_trylock_or_boost(&inode->i_lock)) {
 			spin_unlock(&dentry->d_lock);
 			cpu_chill();
 			goto again;
diff --git a/fs/namespace.c b/fs/namespace.c
index 28937028f3a5..24769de44041 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -356,7 +356,7 @@ int __mnt_want_write(struct vfsmount *m)
 	smp_mb();
 	while (ACCESS_ONCE(mnt->mnt.mnt_flags) & MNT_WRITE_HOLD) {
 		preempt_enable();
-		cpu_chill();
+		cpu_rest();
 		preempt_disable();
 	}
 	/*
diff --git a/include/linux/delay.h b/include/linux/delay.h
index 37caab306336..b787f4b11243 100644
--- a/include/linux/delay.h
+++ b/include/linux/delay.h
@@ -53,9 +53,22 @@ static inline void ssleep(unsigned int seconds)
 }
 
 #ifdef CONFIG_PREEMPT_RT_FULL
+/*
+ * Use cpu_chill() after a spin_trylock_or_boost() which will boost the owner
+ * of the lock to the callers priority (if needed), and cpu_chill will
+ * act like a sched_yield() allowing the owner to proceed.
+ */
 extern void cpu_chill(void);
+/*
+ * Use cpu_rest() if there's no way to find out who the owner you are waiting
+ * for (like spinning on a status variable or bit). This is equivalent to
+ * a msleep(1) and you can hope that the status will change by the time
+ * you wake up.
+ */
+extern void cpu_rest(void);
 #else
 # define cpu_chill()	cpu_relax()
+# define cpu_rest()	cpu_relax()
 #endif
 
 #endif /* defined(_LINUX_DELAY_H) */
diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 2c6be169bdc7..6fd780ffa5d8 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -1881,7 +1881,7 @@ SYSCALL_DEFINE2(nanosleep, struct timespec __user *, rqtp,
 /*
  * Sleep for 1 ms in hope whoever holds what we want will let it go.
  */
-void cpu_chill(void)
+void cpu_rest(void)
 {
 	struct timespec tu = {
 		.tv_nsec = NSEC_PER_MSEC,
@@ -1894,8 +1894,21 @@ void cpu_chill(void)
 	if (!freeze_flag)
 		current->flags &= ~PF_NOFREEZE;
 }
+EXPORT_SYMBOL(cpu_rest);
+
+/*
+ * Used after a spin_trylock_or_boost(), which should boost the owner
+ * of the lock to the priority of the current task (if needed), and
+ * this will yield the current task to the owner if the owner is on
+ * current's CPU.
+ */
+void cpu_chill(void)
+{
+	rt_mutex_wait_for_lock(current);
+}
 EXPORT_SYMBOL(cpu_chill);
-#endif
+
+#endif /* CONFIG_PREEMPT_RT_FULL */
 
 /*
  * Functions related to boot-time initialization:
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 21daecdfd86d..32b4d73349dd 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1260,7 +1260,7 @@ fail:
 	local_unlock_irqrestore(pendingb_lock, *flags);
 	if (work_is_canceling(work))
 		return -ENOENT;
-	cpu_chill();
+	cpu_rest();
 	return -EAGAIN;
 }
 
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index ef1eb20504a7..e906044802c8 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -699,7 +699,7 @@ static void prb_retire_rx_blk_timer_expired(unsigned long data)
 	if (BLOCK_NUM_PKTS(pbd)) {
 		while (atomic_read(&pkc->blk_fill_in_prog)) {
 			/* Waiting for skb_copy_bits to finish... */
-			cpu_chill();
+			cpu_rest();
 		}
 	}
 
@@ -961,7 +961,7 @@ static void prb_retire_current_block(struct tpacket_kbdq_core *pkc,
 		if (!(status & TP_STATUS_BLK_TMO)) {
 			while (atomic_read(&pkc->blk_fill_in_prog)) {
 				/* Waiting for skb_copy_bits to finish... */
-				cpu_chill();
+				cpu_rest();
 			}
 		}
 		prb_close_block(pkc, pbd, po, status);
diff --git a/net/rds/ib_rdma.c b/net/rds/ib_rdma.c
index c8faaf36423a..31fe3b8b4cde 100644
--- a/net/rds/ib_rdma.c
+++ b/net/rds/ib_rdma.c
@@ -287,7 +287,7 @@ static inline void wait_clean_list_grace(void)
 	for_each_online_cpu(cpu) {
 		flag = &per_cpu(clean_list_grace, cpu);
 		while (test_bit(CLEAN_LIST_BUSY_BIT, flag))
-			cpu_chill();
+			cpu_rest();
 	}
 }
 
-- 
2.5.1



^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [RFC][PATCH RT v2 3/3] [PATCH 3/3] rtmutex: Wake up all top trylock waiters on unlock
  2015-09-10 14:50 [RFC][PATCH RT v2 0/3] RT: Fix trylock deadlock without msleep() hack Steven Rostedt
  2015-09-10 14:50   ` Steven Rostedt
  2015-09-10 14:50 ` [RFC][PATCH RT v2 2/3] [PATCH 2/3] rt: Make cpu_chill() call rt_mutex_wait_for_lock() and add new cpu_rest() as msleep(1) Steven Rostedt
@ 2015-09-10 14:50 ` Steven Rostedt
  2 siblings, 0 replies; 5+ messages in thread
From: Steven Rostedt @ 2015-09-10 14:50 UTC (permalink / raw)
  To: linux-kernel, linux-rt-users
  Cc: Thomas Gleixner, Carsten Emde, Sebastian Andrzej Siewior,
	John Kacur, Paul Gortmaker, Peter Zijlstra, Clark Williams,
	Arnaldo Carvalho de Melo, Ingo Molnar

[-- Attachment #1: 0003-rtmutex-Wake-up-all-top-trylock-waiters-on-unlock.patch --]
[-- Type: text/plain, Size: 4459 bytes --]

A task that boosts an owner of a lock via spin_trylock_or_boost() is not a
real waiter of the lock in non PREEMPT_RT code. In non PREEMPT_RT, that task
is just spinning. But in PREEMPT_RT the call to cpu_chill() will touch the
lock. But there's nothing keeping the lock there.

As the lock is boosted via a trylock, it means something had to be done
before we got that lock with another lock held out of reverse order. That
means, if there's nothing using that lock, there's a possible code path that
can make that lock disappear. Here's a fictitious example:

    CPU0	     CPU1		CPU2
    ----	     ----		----
   [task0]	     [task1]		[task2]

  lock(dev->A)
		     lock(B)
		     trylock(dev->A)
		     unlock(B)
		     goto again
					lock(B)
					trylock(dev->A)
					unlock(B)
					goto again
  unlock(dev->A)
    wake(task1)
    remove_task1_links
  lock(B)
  free(dev)
  unlock(B)

At this moment, although task1 is running and ready to go, task2 is still on
dev->A->wait_list, and that will cause a panic when task2 does a cpu_chill().

Things are fine as long as there's a waiter that is from a rtmutex_lock().
Wake all the top tasks till a task is found that is blocked on the rtmutex
itself.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 kernel/locking/rtmutex.c | 65 +++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 59 insertions(+), 6 deletions(-)

diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 843b67f38e20..f26eebe5de87 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -254,17 +254,25 @@ rt_mutex_dequeue_pi(struct task_struct *task, struct rt_mutex_waiter *waiter)
 }
 
 #ifdef CONFIG_PREEMPT_RT_FULL
+static void rt_mutex_wake_waiter(struct rt_mutex_waiter *waiter);
 /*
- * Returns true if the task should be woken up, false otherwise.
+ * Returns true if this is a trylock waiter.
  */
 static inline bool rt_mutex_wake_trylock_waiter(struct rt_mutex_waiter *waiter)
 {
-	struct task_struct *task = waiter->task;
+	struct task_struct *task;
+	struct rt_mutex *lock;
 	unsigned long flags;
 	bool wakeup;
+	bool trylock_waiter = false;
+
+again:
+	task = waiter->task;
 
 	if (likely(waiter != &task->rt_waiter))
-		return true;
+		return trylock_waiter;
+
+	trylock_waiter = true;
 
 	/*
 	 * A task boosted current because it is within a trylock spin.
@@ -276,12 +284,57 @@ static inline bool rt_mutex_wake_trylock_waiter(struct rt_mutex_waiter *waiter)
 	 */
 	raw_spin_lock_irqsave(&task->pi_lock, flags);
 	rt_mutex_dequeue(waiter->lock, waiter);
+	lock = waiter->lock;
 	waiter->lock = NULL;
 
 	wakeup = waiter->wake_up;
+	get_task_struct(task);
 	raw_spin_unlock_irqrestore(&task->pi_lock, flags);
 
-	return wakeup;
+	if (wakeup)
+		rt_mutex_wake_waiter(waiter);
+
+	put_task_struct(task);
+
+	/*
+	 * All tasks that are trylock waiters need to be woken up,
+	 * otherwise there's a chance that the lock may go away from
+	 * under them. Here's the scenario:
+	 *
+	 *    CPU0	     CPU1		CPU2
+	 *    ----	     ----		----
+	 *   [task0]	     [task1]		[task2]
+	 *
+	 *  lock(dev->A)
+	 *		     lock(B)
+	 *		     trylock(dev->A)
+	 *		     unlock(B)
+	 *		     goto again
+	 *					lock(B)
+	 *					trylock(dev->A)
+	 *					unlock(B)
+	 *					goto again
+	 *  unlock(dev->A)
+	 *    wake(task1)
+	 *    remove_task1_links
+	 *  lock(B)
+	 *  free(dev)
+	 *  unlock(B)
+	 *
+	 * At this moment, although task1 is running and ready
+	 * to go, task2 is still on dev->wait_list, and that will
+	 * cause a panic when task2 does a cpu_chill().
+	 *
+	 * Things are fine as long as there's a waiter that is
+	 * from a rtmutex_lock(). Keep waking tasks until we find
+	 * a rtmutex_lock() waiter.
+	 */
+
+	if (!rt_mutex_has_waiters(lock))
+		return true;
+
+	waiter = rt_mutex_top_waiter(lock);
+	goto again;
 }
 
 static void __rt_mutex_adjust_prio(struct task_struct *task);
@@ -496,7 +549,7 @@ static inline struct task_struct *trylock_boost_owner(struct rt_mutex *lock)
 }
 static inline bool rt_mutex_wake_trylock_waiter(struct rt_mutex_waiter *waiter)
 {
-	return true;
+	return false;
 }
 static inline bool check_static_waiter(struct task_struct *task,
 				       struct rt_mutex *lock, bool ok)
@@ -1654,7 +1707,7 @@ static void wakeup_next_waiter(struct rt_mutex *lock)
 
 	raw_spin_unlock_irqrestore(&current->pi_lock, flags);
 
-	if (!rt_mutex_wake_trylock_waiter(waiter))
+	if (rt_mutex_wake_trylock_waiter(waiter))
 		return;
 
 	/*
-- 
2.5.1



^ permalink raw reply related	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2015-09-10 15:01 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-09-10 14:50 [RFC][PATCH RT v2 0/3] RT: Fix trylock deadlock without msleep() hack Steven Rostedt
2015-09-10 14:50 ` [RFC][PATCH RT v2 1/3] [PATCH 1/3] locking: Add spin_trylock_or_boost() infrastructure Steven Rostedt
2015-09-10 14:50   ` Steven Rostedt
2015-09-10 14:50 ` [RFC][PATCH RT v2 2/3] [PATCH 2/3] rt: Make cpu_chill() call rt_mutex_wait_for_lock() and add new cpu_rest() as msleep(1) Steven Rostedt
2015-09-10 14:50 ` [RFC][PATCH RT v2 3/3] [PATCH 3/3] rtmutex: Wake up all top trylock waiters on unlock Steven Rostedt

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.