linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH -tip 0/3] locking/rtmutex: Another crack at spin on owner
@ 2016-03-08 18:20 Davidlohr Bueso
  2016-03-08 18:20 ` [PATCH 1/3] rtmutex: Delete save_state member of struct rt_mutex Davidlohr Bueso
                   ` (3 more replies)
  0 siblings, 4 replies; 18+ messages in thread
From: Davidlohr Bueso @ 2016-03-08 18:20 UTC (permalink / raw)
  To: tglx, mingo; +Cc: peterz, bigeasy, umgwanakikbuti, paulmck, dave, linux-kernel

Hi,

This is a (painfully late) followup to proposal sometime ago to add
spin on owner to rtmutexes. My first attempt was rather liberal in that
I tried avoiding the pi-dance and let the lock be stolen. However, due to
-rt constraints this series only deals with top-waiter, based on what we
do in the preempt rt patchset.

First two patches are trivial and the whole patchset as survived a week
of locktorture+pi_stress pounding at the same time without anything
breaking. That said, I'm sure it needs more testing and eyeballs, these
paths make my head hurt.

Thanks!

Davidlohr Bueso (3):
  rtmutex: Delete save_state member of struct rt_mutex
  rtmutex: Add rt_mutex_init_waiter helper
  rtmutex: Reduce top-waiter blocking on a lock

 include/linux/rtmutex.h         |  1 -
 kernel/Kconfig.locks            |  4 ++
 kernel/futex.c                  |  5 +--
 kernel/locking/rtmutex.c        | 82 +++++++++++++++++++++++++++++++++++------
 kernel/locking/rtmutex_common.h | 17 ++++++++-
 5 files changed, 92 insertions(+), 17 deletions(-)

-- 
2.1.4

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 1/3] rtmutex: Delete save_state member of struct rt_mutex
  2016-03-08 18:20 [PATCH -tip 0/3] locking/rtmutex: Another crack at spin on owner Davidlohr Bueso
@ 2016-03-08 18:20 ` Davidlohr Bueso
  2016-03-08 18:20 ` [PATCH 2/3] rtmutex: Add rt_mutex_init_waiter helper Davidlohr Bueso
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 18+ messages in thread
From: Davidlohr Bueso @ 2016-03-08 18:20 UTC (permalink / raw)
  To: tglx, mingo
  Cc: peterz, bigeasy, umgwanakikbuti, paulmck, dave, linux-kernel,
	Davidlohr Bueso

From: Davidlohr Bueso <dbueso@suse.de>

This field (debug) is unused. Furthermore it looks like a result
of rtmutex from -rt into upstream, where it serves to determine
if the wakeup is for a task blocked on a "sleeping spinlock",
iow if this is a regular rt_mutex_lock() or rt_spin_lock().

Of course, upstream we only have regular rt_mutexes, thus we
can safely assume nothing will be done with this field anyway.
There is also the fact that this is under debug.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
---
 include/linux/rtmutex.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/include/linux/rtmutex.h b/include/linux/rtmutex.h
index 1abba5c..9c50e2e 100644
--- a/include/linux/rtmutex.h
+++ b/include/linux/rtmutex.h
@@ -32,7 +32,6 @@ struct rt_mutex {
 	struct rb_node          *waiters_leftmost;
 	struct task_struct	*owner;
 #ifdef CONFIG_DEBUG_RT_MUTEXES
-	int			save_state;
 	const char 		*name, *file;
 	int			line;
 	void			*magic;
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 2/3] rtmutex: Add rt_mutex_init_waiter helper
  2016-03-08 18:20 [PATCH -tip 0/3] locking/rtmutex: Another crack at spin on owner Davidlohr Bueso
  2016-03-08 18:20 ` [PATCH 1/3] rtmutex: Delete save_state member of struct rt_mutex Davidlohr Bueso
@ 2016-03-08 18:20 ` Davidlohr Bueso
  2016-03-14 13:16   ` Peter Zijlstra
  2016-03-08 18:20 ` [PATCH 3/3] rtmutex: Reduce top-waiter blocking on a lock Davidlohr Bueso
  2016-03-08 22:05 ` [PATCH 4/3] rtmutex: Avoid barrier in rt_mutex_handle_deadlock Davidlohr Bueso
  3 siblings, 1 reply; 18+ messages in thread
From: Davidlohr Bueso @ 2016-03-08 18:20 UTC (permalink / raw)
  To: tglx, mingo
  Cc: peterz, bigeasy, umgwanakikbuti, paulmck, dave, linux-kernel,
	Davidlohr Bueso

From: Davidlohr Bueso <dbueso@suse.de>

... encapsulates debug and regular waiter initialization. In the
case of rtmutexes, we now also set the waiter to nil until later
explicitly set to whatever task. This is safe as the waiter is on
the stack and we are doing very basic initialization anyway.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
---
 kernel/futex.c                  |  5 +----
 kernel/locking/rtmutex.c        |  4 +---
 kernel/locking/rtmutex_common.h | 15 +++++++++++++++
 3 files changed, 17 insertions(+), 7 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index bae542e..0ff94a7 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -2778,10 +2778,7 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
 	 * The waiter is allocated on our stack, manipulated by the requeue
 	 * code while we sleep on uaddr.
 	 */
-	debug_rt_mutex_init_waiter(&rt_waiter);
-	RB_CLEAR_NODE(&rt_waiter.pi_tree_entry);
-	RB_CLEAR_NODE(&rt_waiter.tree_entry);
-	rt_waiter.task = NULL;
+	rt_mutex_init_waiter(&rt_waiter);
 
 	ret = get_futex_key(uaddr2, flags & FLAGS_SHARED, &key2, VERIFY_WRITE);
 	if (unlikely(ret != 0))
diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 3e74660..60fa79a 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -1171,9 +1171,7 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state,
 	unsigned long flags;
 	int ret = 0;
 
-	debug_rt_mutex_init_waiter(&waiter);
-	RB_CLEAR_NODE(&waiter.pi_tree_entry);
-	RB_CLEAR_NODE(&waiter.tree_entry);
+	rt_mutex_init_waiter(&waiter);
 
 	/*
 	 * Technically we could use raw_spin_[un]lock_irq() here, but this can
diff --git a/kernel/locking/rtmutex_common.h b/kernel/locking/rtmutex_common.h
index 4f5f83c..c714737 100644
--- a/kernel/locking/rtmutex_common.h
+++ b/kernel/locking/rtmutex_common.h
@@ -119,4 +119,19 @@ extern void rt_mutex_adjust_prio(struct task_struct *task);
 # include "rtmutex.h"
 #endif
 
+/*
+ * Waiter structure basic initialization. A waiter is not considered
+ * actually usable until it after calling task_blocks_on_rt_mutex()
+ * which setups up the relevant entries.
+ */
+static inline void
+rt_mutex_init_waiter(struct rt_mutex_waiter *waiter)
+{
+       debug_rt_mutex_init_waiter(waiter);
+
+       RB_CLEAR_NODE(&waiter->pi_tree_entry);
+       RB_CLEAR_NODE(&waiter->tree_entry);
+       waiter->task = NULL;
+}
+
 #endif
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 3/3] rtmutex: Reduce top-waiter blocking on a lock
  2016-03-08 18:20 [PATCH -tip 0/3] locking/rtmutex: Another crack at spin on owner Davidlohr Bueso
  2016-03-08 18:20 ` [PATCH 1/3] rtmutex: Delete save_state member of struct rt_mutex Davidlohr Bueso
  2016-03-08 18:20 ` [PATCH 2/3] rtmutex: Add rt_mutex_init_waiter helper Davidlohr Bueso
@ 2016-03-08 18:20 ` Davidlohr Bueso
  2016-03-14 13:23   ` Peter Zijlstra
  2016-03-08 22:05 ` [PATCH 4/3] rtmutex: Avoid barrier in rt_mutex_handle_deadlock Davidlohr Bueso
  3 siblings, 1 reply; 18+ messages in thread
From: Davidlohr Bueso @ 2016-03-08 18:20 UTC (permalink / raw)
  To: tglx, mingo
  Cc: peterz, bigeasy, umgwanakikbuti, paulmck, dave, linux-kernel,
	Davidlohr Bueso

From: Davidlohr Bueso <dbueso@suse.de>

By applying well known spin-on-lock-owner techniques, we can avoid the
blocking overhead during the process of when the task is trying to take
the rtmutex. The idea is that as long as the owner is running, there is a
fair chance it'll release the lock soon, and thus a task trying to acquire
the rtmutex will better off spinning instead of blocking immediately after
the fastpath. This is similar to what we use for other locks, borrowed
from -rt. The main difference (due to the obvious realtime constraints)
is that top-waiter spinning must account for any new higher priority waiter,
and therefore cannot steal the lock and avoid any pi-dance. As such there
will be at most only one spinner waiter upon contended lock.

Conditions to stop spinning and block are simple:

(1) Upon need_resched()
(2) Current lock owner blocks

The unlock side remains unchanged as wake_up_process can safely deal with
calls where the task is not actually blocked (TASK_NORMAL). As such, there
is only unnecessary overhead dealing with the wake_q, but this allows us not
to miss any wakeups between the spinning step and the unlocking side.

Measuring the amount of inversions of the pi_stress program, there is
a rather constant improvement in throughput during a 30 second window.
On a 32-core box, with increasing thread-group counts:

pistress
                            4.4.3                 4.4.3
                          vanilla           rtmutex-spinv1
Hmean    1   2321586.73 (  0.00%)  2339847.23 (  0.79%)
Hmean    4   8209026.49 (  0.00%)  8597809.55 (  4.74%)
Hmean    7  12655322.45 (  0.00%) 13194896.45 (  4.26%)
Hmean    12  4210477.03 (  0.00%)  4348643.08 (  3.28%)
Hmean    21  2996823.05 (  0.00%)  3104513.47 (  3.59%)
Hmean    30  2463107.53 (  0.00%)  2584275.71 (  4.91%)
Hmean    48  2656668.46 (  0.00%)  2719324.53 (  2.36%)
Hmean    64  2397253.65 (  0.00%)  2471628.92 (  3.10%)
Stddev   1    653473.88 (  0.00%)   527076.59 (-19.34%)
Stddev   4    664995.50 (  0.00%)   359487.15 (-45.94%)
Stddev   7    248476.88 (  0.00%)   278307.31 ( 12.01%)
Stddev   12    74537.42 (  0.00%)    54305.86 (-27.14%)
Stddev   21    72143.80 (  0.00%)    40371.42 (-44.04%)
Stddev   30    31981.43 (  0.00%)    42306.07 ( 32.28%)
Stddev   48    21317.95 (  0.00%)    42608.50 ( 99.87%)
Stddev   64    23433.99 (  0.00%)    21502.56 ( -8.24%)

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
---
 kernel/Kconfig.locks            |  4 +++
 kernel/locking/rtmutex.c        | 78 ++++++++++++++++++++++++++++++++++++-----
 kernel/locking/rtmutex_common.h |  2 +-
 3 files changed, 75 insertions(+), 9 deletions(-)

diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
index ebdb004..8602871 100644
--- a/kernel/Kconfig.locks
+++ b/kernel/Kconfig.locks
@@ -227,6 +227,10 @@ config MUTEX_SPIN_ON_OWNER
 	def_bool y
 	depends on SMP && !DEBUG_MUTEXES && ARCH_SUPPORTS_ATOMIC_RMW
 
+config RT_MUTEX_SPIN_ON_OWNER
+	def_bool y
+	depends on SMP && RT_MUTEXES && !DEBUG_RT_MUTEXES && ARCH_SUPPORTS_ATOMIC_RMW
+
 config RWSEM_SPIN_ON_OWNER
        def_bool y
        depends on SMP && RWSEM_XCHGADD_ALGORITHM && ARCH_SUPPORTS_ATOMIC_RMW
diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 60fa79a..8eb99b7 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -69,6 +69,48 @@ static void fixup_rt_mutex_waiters(struct rt_mutex *lock)
 		clear_rt_mutex_waiters(lock);
 }
 
+#ifdef CONFIG_RT_MUTEX_SPIN_ON_OWNER
+static bool rt_mutex_spin_on_owner(struct rt_mutex *lock,
+				   struct task_struct *owner)
+{
+	bool ret = true;
+
+	/*
+	 * The last owner could have just released the lock,
+	 * immediately try taking it again.
+	 */
+	if (!owner)
+		goto done;
+
+	rcu_read_lock();
+	while (rt_mutex_owner(lock) == owner) {
+		/*
+		 * Ensure we emit the owner->on_cpu, dereference _after_
+		 * checking lock->owner still matches owner. If that fails,
+		 * owner might point to freed memory. If it still matches,
+		 * the rcu_read_lock() ensures the memory stays valid.
+		 */
+		barrier();
+		if (!owner->on_cpu || need_resched()) {
+			ret = false;
+			break;
+		}
+
+		cpu_relax_lowlatency();
+	}
+	rcu_read_unlock();
+done:
+	return ret;
+}
+
+#else
+static bool rt_mutex_spin_on_owner(struct rt_mutex *lock,
+				   struct task_struct *owner)
+{
+	return false;
+}
+#endif
+
 /*
  * We can speed up the acquire/release, if there's no debugging state to be
  * set up.
@@ -130,6 +172,12 @@ static inline bool unlock_rt_mutex_safe(struct rt_mutex *lock,
 	 * unlock(wait_lock);
 	 *					lock(wait_lock);
 	 *					acquire(lock);
+	 *
+	 * Care must be taken when performing spin-on-owner techniques for
+	 * the top waiter. Because this relies on lockless owner->on_cpu,
+	 * it can cause nil owner pointer dereferencing. This condition is
+	 * taken care of in rt_mutex_spin_on_owner() and will cause the task
+	 * to immediately block.
 	 */
 	return rt_mutex_cmpxchg_release(lock, owner, NULL);
 }
@@ -989,12 +1037,15 @@ static void mark_wakeup_next_waiter(struct wake_q_head *wake_q,
 	rt_mutex_dequeue_pi(current, waiter);
 
 	/*
-	 * As we are waking up the top waiter, and the waiter stays
-	 * queued on the lock until it gets the lock, this lock
-	 * obviously has waiters. Just set the bit here and this has
-	 * the added benefit of forcing all new tasks into the
-	 * slow path making sure no task of lower priority than
-	 * the top waiter can steal this lock.
+	 * As we are potentially waking up the top waiter, and the waiter
+	 * stays queued on the lock until it gets the lock, this lock
+	 * obviously has waiters. Just set the bit here and this has the
+	 * added benefit of forcing all new tasks into the slow path
+	 * making sure no task of lower priority than the top waiter can
+	 * steal this lock.
+	 *
+	 * If the top waiter, otoh, is spinning on ->owner, this will also
+	 * serve to exit out of the loop and try to acquire the lock.
 	 */
 	lock->owner = (void *) RT_MUTEX_HAS_WAITERS;
 
@@ -1090,7 +1141,7 @@ void rt_mutex_adjust_pi(struct task_struct *task)
 }
 
 /**
- * __rt_mutex_slowlock() - Perform the wait-wake-try-to-take loop
+ * __rt_mutex_slowlock() - Perform the spin or wait-wake-try-to-take loop
  * @lock:		 the rt_mutex to take
  * @state:		 the state the task should block in (TASK_INTERRUPTIBLE
  *			 or TASK_UNINTERRUPTIBLE)
@@ -1105,6 +1156,7 @@ __rt_mutex_slowlock(struct rt_mutex *lock, int state,
 		    struct rt_mutex_waiter *waiter)
 {
 	int ret = 0;
+	struct rt_mutex_waiter *top_waiter;
 
 	for (;;) {
 		/* Try to acquire the lock: */
@@ -1125,11 +1177,21 @@ __rt_mutex_slowlock(struct rt_mutex *lock, int state,
 				break;
 		}
 
+		top_waiter = rt_mutex_top_waiter(lock);
+
 		raw_spin_unlock_irq(&lock->wait_lock);
 
 		debug_rt_mutex_print_deadlock(waiter);
 
-		schedule();
+		/*
+		 * At this point the PI-dance is done, and, as the top waiter,
+		 * we are next in line for the lock. Try to spin on the current
+		 * owner for a while, in the hope that the lock will be released
+		 * soon. Otherwise fallback and block.
+		 */
+		if (top_waiter != waiter ||
+		    !rt_mutex_spin_on_owner(lock, rt_mutex_owner(lock)))
+			schedule();
 
 		raw_spin_lock_irq(&lock->wait_lock);
 		set_current_state(state);
diff --git a/kernel/locking/rtmutex_common.h b/kernel/locking/rtmutex_common.h
index c714737..cba4b81 100644
--- a/kernel/locking/rtmutex_common.h
+++ b/kernel/locking/rtmutex_common.h
@@ -76,7 +76,7 @@ task_top_pi_waiter(struct task_struct *p)
 static inline struct task_struct *rt_mutex_owner(struct rt_mutex *lock)
 {
 	return (struct task_struct *)
-		((unsigned long)lock->owner & ~RT_MUTEX_OWNER_MASKALL);
+		((unsigned long)READ_ONCE(lock->owner) & ~RT_MUTEX_OWNER_MASKALL);
 }
 
 /*
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 4/3] rtmutex: Avoid barrier in rt_mutex_handle_deadlock
  2016-03-08 18:20 [PATCH -tip 0/3] locking/rtmutex: Another crack at spin on owner Davidlohr Bueso
                   ` (2 preceding siblings ...)
  2016-03-08 18:20 ` [PATCH 3/3] rtmutex: Reduce top-waiter blocking on a lock Davidlohr Bueso
@ 2016-03-08 22:05 ` Davidlohr Bueso
  2016-03-14 13:40   ` Peter Zijlstra
  3 siblings, 1 reply; 18+ messages in thread
From: Davidlohr Bueso @ 2016-03-08 22:05 UTC (permalink / raw)
  To: tglx, mingo; +Cc: peterz, bigeasy, umgwanakikbuti, paulmck, dave, linux-kernel

The very nature of rt_mutex_handle_deadlock() implies that this
patch is merely a formality, as in practice the saved barrier
is of little use. That said, we can relax setting the task state
and be done with it; blocking unconditionally... this is a deadlock!

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
---
  kernel/locking/rtmutex.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 8eb99b7f1ac8..c3d3c8e8ea5c 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -1216,7 +1216,7 @@ static void rt_mutex_handle_deadlock(int res, int detect_deadlock,
  	 */
  	rt_mutex_print_deadlock(w);
  	while (1) {
-		set_current_state(TASK_INTERRUPTIBLE);
+		__set_current_state(TASK_INTERRUPTIBLE);
  		schedule();
  	}
  }
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/3] rtmutex: Add rt_mutex_init_waiter helper
  2016-03-08 18:20 ` [PATCH 2/3] rtmutex: Add rt_mutex_init_waiter helper Davidlohr Bueso
@ 2016-03-14 13:16   ` Peter Zijlstra
  0 siblings, 0 replies; 18+ messages in thread
From: Peter Zijlstra @ 2016-03-14 13:16 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: tglx, mingo, bigeasy, umgwanakikbuti, paulmck, linux-kernel,
	Davidlohr Bueso

On Tue, Mar 08, 2016 at 10:20:22AM -0800, Davidlohr Bueso wrote:

> +++ b/kernel/futex.c
> @@ -2778,10 +2778,7 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
>  	 * The waiter is allocated on our stack, manipulated by the requeue
>  	 * code while we sleep on uaddr.
>  	 */
> -	debug_rt_mutex_init_waiter(&rt_waiter);
> -	RB_CLEAR_NODE(&rt_waiter.pi_tree_entry);
> -	RB_CLEAR_NODE(&rt_waiter.tree_entry);
> -	rt_waiter.task = NULL;
> +	rt_mutex_init_waiter(&rt_waiter);
>  
>  	ret = get_futex_key(uaddr2, flags & FLAGS_SHARED, &key2, VERIFY_WRITE);
>  	if (unlikely(ret != 0))


> +static inline void
> +rt_mutex_init_waiter(struct rt_mutex_waiter *waiter)
> +{
> +       debug_rt_mutex_init_waiter(waiter);
> +
> +       RB_CLEAR_NODE(&waiter->pi_tree_entry);
> +       RB_CLEAR_NODE(&waiter->tree_entry);
> +       waiter->task = NULL;
> +}


Just thinking, would not something like:

#define DEFINE_RT_WAITER(name)					\
  struct rt_mutex_waiter name = {				\
	.tree_entry = __INIT_RB_NODE(name.tree_entry),		\
	.pi_tree_entry = __INIT_RB_NODE(name.pi_tree_entry),	\
	__INIT_RT_WAITER_DEBUG(name)				\
  }

Be nicer? That way we're sure the whole structure is initialized.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/3] rtmutex: Reduce top-waiter blocking on a lock
  2016-03-08 18:20 ` [PATCH 3/3] rtmutex: Reduce top-waiter blocking on a lock Davidlohr Bueso
@ 2016-03-14 13:23   ` Peter Zijlstra
  0 siblings, 0 replies; 18+ messages in thread
From: Peter Zijlstra @ 2016-03-14 13:23 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: tglx, mingo, bigeasy, umgwanakikbuti, paulmck, linux-kernel,
	Davidlohr Bueso

On Tue, Mar 08, 2016 at 10:20:23AM -0800, Davidlohr Bueso wrote:
>  	lock->owner = (void *) RT_MUTEX_HAS_WAITERS;

>  static inline struct task_struct *rt_mutex_owner(struct rt_mutex *lock)
>  {
>  	return (struct task_struct *)
> -		((unsigned long)lock->owner & ~RT_MUTEX_OWNER_MASKALL);
> +		((unsigned long)READ_ONCE(lock->owner) & ~RT_MUTEX_OWNER_MASKALL);
>  }

If you READ_ONCE(), you should also WRITE_ONCE(), because while the
write is under the proper locks, our friendly compiler might still
choose to emit the store in a random sequence of byte stores, rendering
our READ_ONCE() pointless.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/3] rtmutex: Avoid barrier in rt_mutex_handle_deadlock
  2016-03-08 22:05 ` [PATCH 4/3] rtmutex: Avoid barrier in rt_mutex_handle_deadlock Davidlohr Bueso
@ 2016-03-14 13:40   ` Peter Zijlstra
  2016-03-21 18:16     ` Davidlohr Bueso
  0 siblings, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2016-03-14 13:40 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: tglx, mingo, bigeasy, umgwanakikbuti, paulmck, linux-kernel

On Tue, Mar 08, 2016 at 02:05:39PM -0800, Davidlohr Bueso wrote:
> The very nature of rt_mutex_handle_deadlock() implies that this
> patch is merely a formality, as in practice the saved barrier
> is of little use. That said, we can relax setting the task state
> and be done with it; blocking unconditionally... this is a deadlock!
> 
> Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
> ---
>  kernel/locking/rtmutex.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
> index 8eb99b7f1ac8..c3d3c8e8ea5c 100644
> --- a/kernel/locking/rtmutex.c
> +++ b/kernel/locking/rtmutex.c
> @@ -1216,7 +1216,7 @@ static void rt_mutex_handle_deadlock(int res, int detect_deadlock,
>  	 */
>  	rt_mutex_print_deadlock(w);
>  	while (1) {
> -		set_current_state(TASK_INTERRUPTIBLE);
> +		__set_current_state(TASK_INTERRUPTIBLE);
>  		schedule();

So you're right that it doesn't matter here, however for that very
reason I would suggest not using __set_current_state() before schedule()
unless there is a _really_ good reason, and then with an extensive
comment to go with.

Otherwise people will manage to pick this as an example to copy and who
all knows what kind of borkage will result from that.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/3] rtmutex: Avoid barrier in rt_mutex_handle_deadlock
  2016-03-14 13:40   ` Peter Zijlstra
@ 2016-03-21 18:16     ` Davidlohr Bueso
  2016-03-22 10:21       ` Peter Zijlstra
  0 siblings, 1 reply; 18+ messages in thread
From: Davidlohr Bueso @ 2016-03-21 18:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, mingo, bigeasy, umgwanakikbuti, paulmck, linux-kernel, dave

On Mon, 14 Mar 2016, Peter Zijlstra wrote:

>So you're right that it doesn't matter here, however for that very
>reason I would suggest not using __set_current_state() before schedule()
>unless there is a _really_ good reason, and then with an extensive
>comment to go with.

No problem.

>
>Otherwise people will manage to pick this as an example to copy and who
>all knows what kind of borkage will result from that.

Although I would expect 'people' to at least read the comments around the
code... and not blindly use rt-deadlock-related things :)

But yeah, lets drop this, I have no objection. While going through this,
I did find that we could do a little better documenting the actual helpers.

What do you think of the following?

Thanks,
Davidlohr


----------8<----------------------------------------------------------
From: Davidlohr Bueso <dave@stgolabs.net>
Subject: [PATCH -tip] sched: Cleanup comments for tsk->state helpers

While there is nothing wrong about the current comments, we could
easily improve them by the changes proposed in this patch:

- Remove duplicate text for CONFIG_DEBUG_ATOMIC_SLEEP.

- Update blocking example to consider spurious wakeups (for-loop).

- Point the reader to the infamous memory-barriers.txt doc, which
goes into plenty of detail in the 'SLEEP AND WAKE-UP FUNCTIONS'
section (the above also taken from there).

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
---
  include/linux/sched.h | 51 ++++++++++++++++++++++++++-------------------------
  1 file changed, 26 insertions(+), 25 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index c617ea12c6b7..3a3ec2503897 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -249,6 +249,31 @@ extern char ___assert_task_state[1 - 2*!!(
				 (task->flags & PF_FROZEN) == 0 && \
				 (task->state & TASK_NOLOAD) == 0)

+/*
+ * Helpers for modifying the state of either the current task, or a foreign
+ * task. Each of these calls come in both full barrier and weak flavors:
+ *
+ *                                           Weak
+ *     set_task_state()           __set_task_state()
+ *     set_current_state()        __set_current_state()
+ *
+ * Where set_current_state() and set_task_state() includes a full smp barrier
+ * -after- the write of ->state is correctly serialized with the later test
+ * of whether to actually sleep:
+ *
+ *	for (;;) {
+ *		set_current_state(TASK_UNINTERRUPTIBLE);
+ *		if (event_indicated)
+ *			break;
+ *		schedule();
+ *	}
+ *
+ * This is commonly necessary for processes sleeping and waking through flag
+ * based events. If the caller does not need such serialization, then use
+ * weaker counterparts, which simply writes the state.
+ *
+ * Refer to Documentation/memory-barriers.txt
+ */
  #ifdef CONFIG_DEBUG_ATOMIC_SLEEP

  #define __set_task_state(tsk, state_value)			\
@@ -261,18 +286,6 @@ extern char ___assert_task_state[1 - 2*!!(
		(tsk)->task_state_change = _THIS_IP_;		\
		smp_store_mb((tsk)->state, (state_value));		\
	} while (0)
-
-/*
- * set_current_state() includes a barrier so that the write of current->state
- * is correctly serialised wrt the caller's subsequent test of whether to
- * actually sleep:
- *
- *	set_current_state(TASK_UNINTERRUPTIBLE);
- *	if (do_i_need_to_sleep())
- *		schedule();
- *
- * If the caller does not need such serialisation then use __set_current_state()
- */
  #define __set_current_state(state_value)			\
	do {							\
		current->task_state_change = _THIS_IP_;		\
@@ -290,24 +303,12 @@ extern char ___assert_task_state[1 - 2*!!(
	do { (tsk)->state = (state_value); } while (0)
  #define set_task_state(tsk, state_value)		\
	smp_store_mb((tsk)->state, (state_value))
-
-/*
- * set_current_state() includes a barrier so that the write of current->state
- * is correctly serialised wrt the caller's subsequent test of whether to
- * actually sleep:
- *
- *	set_current_state(TASK_UNINTERRUPTIBLE);
- *	if (do_i_need_to_sleep())
- *		schedule();
- *
- * If the caller does not need such serialisation then use __set_current_state()
- */
  #define __set_current_state(state_value)		\
	do { current->state = (state_value); } while (0)
  #define set_current_state(state_value)			\
	smp_store_mb(current->state, (state_value))

-#endif
+#endif /* CONFIG_DEBUG_ATOMIC_SLEEP */

  /* Task command name length */
  #define TASK_COMM_LEN 16
--
2.1.4

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/3] rtmutex: Avoid barrier in rt_mutex_handle_deadlock
  2016-03-21 18:16     ` Davidlohr Bueso
@ 2016-03-22 10:21       ` Peter Zijlstra
  2016-03-22 11:32         ` Heiko Carstens
  2016-03-25  2:30         ` Davidlohr Bueso
  0 siblings, 2 replies; 18+ messages in thread
From: Peter Zijlstra @ 2016-03-22 10:21 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: tglx, mingo, bigeasy, umgwanakikbuti, paulmck, linux-kernel, kmo,
	heiko.carstens

On Mon, Mar 21, 2016 at 11:16:22AM -0700, Davidlohr Bueso wrote:

> +/*
> + * Helpers for modifying the state of either the current task, or a foreign
> + * task. Each of these calls come in both full barrier and weak flavors:
> + *
> + *                                           Weak
> + *     set_task_state()           __set_task_state()
> + *     set_current_state()        __set_current_state()
> + *
> + * Where set_current_state() and set_task_state() includes a full smp barrier
> + * -after- the write of ->state is correctly serialized with the later test
> + * of whether to actually sleep:
> + *
> + *	for (;;) {
> + *		set_current_state(TASK_UNINTERRUPTIBLE);
> + *		if (event_indicated)
> + *			break;
> + *		schedule();
> + *	}
> + *
> + * This is commonly necessary for processes sleeping and waking through flag
> + * based events. If the caller does not need such serialization, then use
> + * weaker counterparts, which simply writes the state.
> + *
> + * Refer to Documentation/memory-barriers.txt
> + */

I would prefer to pretend set_task_state() does not exist, using it on
anything other than task==current is very very tricky.

With the below patch; we're only left with:

arch/s390/mm/fault.c:			__set_task_state(tsk, TASK_UNINTERRUPTIBLE);
arch/s390/mm/fault.c:			__set_task_state(tsk, TASK_UNINTERRUPTIBLE);
drivers/md/bcache/btree.c:	set_task_state(c->gc_thread, TASK_INTERRUPTIBLE);
kernel/exit.c:			set_task_state(tsk, TASK_UNINTERRUPTIBLE);
kernel/exit.c:		__set_task_state(tsk, TASK_RUNNING);

exit most probably also has tsk==current, but I didn't check.

bacache seems to rely on the fact that the task is not running after
kthread_create() to change the state. But I've no idea why; the only
think I can come up with is because load accounting, a new thread blocks
in UNINTERRUPTIBLE which adds to load. But by setting it to
INTERRUPTIBLE before waking up it can actually mess that up. This really
should be fixed.

And s390 does something entirely vile, no idea what.

---
 arch/um/drivers/random.c                                 |  2 +-
 drivers/md/dm-bufio.c                                    |  2 +-
 drivers/md/persistent-data/dm-block-manager.c            |  4 ++--
 drivers/staging/lustre/lustre/libcfs/linux/linux-debug.c |  6 ++++--
 drivers/tty/tty_ldsem.c                                  | 10 +++++-----
 kernel/locking/mutex.c                                   |  4 ++--
 kernel/locking/rwsem-spinlock.c                          | 12 +++++-------
 kernel/locking/rwsem-xadd.c                              |  4 ++--
 kernel/locking/semaphore.c                               |  2 +-
 9 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/arch/um/drivers/random.c b/arch/um/drivers/random.c
index dd16c902ff70..19d41a583288 100644
--- a/arch/um/drivers/random.c
+++ b/arch/um/drivers/random.c
@@ -76,7 +76,7 @@ static ssize_t rng_dev_read (struct file *filp, char __user *buf, size_t size,
 			add_sigio_fd(random_fd);
 
 			add_wait_queue(&host_read_wait, &wait);
-			set_task_state(current, TASK_INTERRUPTIBLE);
+			set_current_state(TASK_INTERRUPTIBLE);
 
 			schedule();
 			remove_wait_queue(&host_read_wait, &wait);
diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
index cd77216beff1..c5e89f358d98 100644
--- a/drivers/md/dm-bufio.c
+++ b/drivers/md/dm-bufio.c
@@ -807,7 +807,7 @@ static void __wait_for_free_buffer(struct dm_bufio_client *c)
 	DECLARE_WAITQUEUE(wait, current);
 
 	add_wait_queue(&c->free_buffer_wait, &wait);
-	set_task_state(current, TASK_UNINTERRUPTIBLE);
+	set_current_state(TASK_UNINTERRUPTIBLE);
 	dm_bufio_unlock(c);
 
 	io_schedule();
diff --git a/drivers/md/persistent-data/dm-block-manager.c b/drivers/md/persistent-data/dm-block-manager.c
index 1e33dd51c21f..821a26b934c2 100644
--- a/drivers/md/persistent-data/dm-block-manager.c
+++ b/drivers/md/persistent-data/dm-block-manager.c
@@ -118,7 +118,7 @@ static int __check_holder(struct block_lock *lock)
 static void __wait(struct waiter *w)
 {
 	for (;;) {
-		set_task_state(current, TASK_UNINTERRUPTIBLE);
+		set_current_state(TASK_UNINTERRUPTIBLE);
 
 		if (!w->task)
 			break;
@@ -126,7 +126,7 @@ static void __wait(struct waiter *w)
 		schedule();
 	}
 
-	set_task_state(current, TASK_RUNNING);
+	set_current_state(TASK_RUNNING);
 }
 
 static void __wake_waiter(struct waiter *w)
diff --git a/drivers/staging/lustre/lustre/libcfs/linux/linux-debug.c b/drivers/staging/lustre/lustre/libcfs/linux/linux-debug.c
index 59c7bf3cbc1f..087d7e49cf3e 100644
--- a/drivers/staging/lustre/lustre/libcfs/linux/linux-debug.c
+++ b/drivers/staging/lustre/lustre/libcfs/linux/linux-debug.c
@@ -162,9 +162,11 @@ void __noreturn lbug_with_loc(struct libcfs_debug_msg_data *msgdata)
 	libcfs_run_lbug_upcall(msgdata);
 	if (libcfs_panic_on_lbug)
 		panic("LBUG");
-	set_task_state(current, TASK_UNINTERRUPTIBLE);
-	while (1)
+
+	while (1) {
+		set_current_state(TASK_UNINTERRUPTIBLE);
 		schedule();
+	}
 }
 
 static int panic_notifier(struct notifier_block *self, unsigned long unused1,
diff --git a/drivers/tty/tty_ldsem.c b/drivers/tty/tty_ldsem.c
index 1bf8ed13f827..c94bc0eef85d 100644
--- a/drivers/tty/tty_ldsem.c
+++ b/drivers/tty/tty_ldsem.c
@@ -232,7 +232,7 @@ down_read_failed(struct ld_semaphore *sem, long count, long timeout)
 
 	/* wait to be given the lock */
 	for (;;) {
-		set_task_state(tsk, TASK_UNINTERRUPTIBLE);
+		set_current_state(TASK_UNINTERRUPTIBLE);
 
 		if (!waiter.task)
 			break;
@@ -241,7 +241,7 @@ down_read_failed(struct ld_semaphore *sem, long count, long timeout)
 		timeout = schedule_timeout(timeout);
 	}
 
-	__set_task_state(tsk, TASK_RUNNING);
+	__set_current_state(TASK_RUNNING);
 
 	if (!timeout) {
 		/* lock timed out but check if this task was just
@@ -291,14 +291,14 @@ down_write_failed(struct ld_semaphore *sem, long count, long timeout)
 
 	waiter.task = tsk;
 
-	set_task_state(tsk, TASK_UNINTERRUPTIBLE);
+	set_current_state(TASK_UNINTERRUPTIBLE);
 	for (;;) {
 		if (!timeout)
 			break;
 		raw_spin_unlock_irq(&sem->wait_lock);
 		timeout = schedule_timeout(timeout);
 		raw_spin_lock_irq(&sem->wait_lock);
-		set_task_state(tsk, TASK_UNINTERRUPTIBLE);
+		set_current_state(TASK_UNINTERRUPTIBLE);
 		locked = writer_trylock(sem);
 		if (locked)
 			break;
@@ -309,7 +309,7 @@ down_write_failed(struct ld_semaphore *sem, long count, long timeout)
 	list_del(&waiter.list);
 	raw_spin_unlock_irq(&sem->wait_lock);
 
-	__set_task_state(tsk, TASK_RUNNING);
+	__set_current_state(TASK_RUNNING);
 
 	/* lock wait may have timed out */
 	if (!locked)
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index e364b424b019..c10fe056c34a 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -572,14 +572,14 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
 				goto err;
 		}
 
-		__set_task_state(task, state);
+		__set_current_state(state);
 
 		/* didn't get the lock, go to sleep: */
 		spin_unlock_mutex(&lock->wait_lock, flags);
 		schedule_preempt_disabled();
 		spin_lock_mutex(&lock->wait_lock, flags);
 	}
-	__set_task_state(task, TASK_RUNNING);
+	__set_current_state(TASK_RUNNING);
 
 	mutex_remove_waiter(lock, &waiter, current_thread_info());
 	/* set it to 0 if there are no waiters left: */
diff --git a/kernel/locking/rwsem-spinlock.c b/kernel/locking/rwsem-spinlock.c
index 3a5048572065..dfe5ea3736a8 100644
--- a/kernel/locking/rwsem-spinlock.c
+++ b/kernel/locking/rwsem-spinlock.c
@@ -141,7 +141,7 @@ void __sched __down_read(struct rw_semaphore *sem)
 	}
 
 	tsk = current;
-	set_task_state(tsk, TASK_UNINTERRUPTIBLE);
+	set_current_state(TASK_UNINTERRUPTIBLE);
 
 	/* set up my own style of waitqueue */
 	waiter.task = tsk;
@@ -158,10 +158,10 @@ void __sched __down_read(struct rw_semaphore *sem)
 		if (!waiter.task)
 			break;
 		schedule();
-		set_task_state(tsk, TASK_UNINTERRUPTIBLE);
+		set_current_state(TASK_UNINTERRUPTIBLE);
 	}
 
-	__set_task_state(tsk, TASK_RUNNING);
+	__set_current_state(TASK_RUNNING);
  out:
 	;
 }
@@ -194,14 +194,12 @@ int __down_read_trylock(struct rw_semaphore *sem)
 void __sched __down_write_nested(struct rw_semaphore *sem, int subclass)
 {
 	struct rwsem_waiter waiter;
-	struct task_struct *tsk;
 	unsigned long flags;
 
 	raw_spin_lock_irqsave(&sem->wait_lock, flags);
 
 	/* set up my own style of waitqueue */
-	tsk = current;
-	waiter.task = tsk;
+	waiter.task = current;
 	waiter.type = RWSEM_WAITING_FOR_WRITE;
 	list_add_tail(&waiter.list, &sem->wait_list);
 
@@ -215,7 +213,7 @@ void __sched __down_write_nested(struct rw_semaphore *sem, int subclass)
 		 */
 		if (sem->count == 0)
 			break;
-		set_task_state(tsk, TASK_UNINTERRUPTIBLE);
+		set_current_state(TASK_UNINTERRUPTIBLE);
 		raw_spin_unlock_irqrestore(&sem->wait_lock, flags);
 		schedule();
 		raw_spin_lock_irqsave(&sem->wait_lock, flags);
diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index a4d4de05b2d1..a33ffc2ee236 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -244,13 +244,13 @@ struct rw_semaphore __sched *rwsem_down_read_failed(struct rw_semaphore *sem)
 
 	/* wait to be given the lock */
 	while (true) {
-		set_task_state(tsk, TASK_UNINTERRUPTIBLE);
+		set_current_state(TASK_UNINTERRUPTIBLE);
 		if (!waiter.task)
 			break;
 		schedule();
 	}
 
-	__set_task_state(tsk, TASK_RUNNING);
+	__set_current_state(TASK_RUNNING);
 	return sem;
 }
 EXPORT_SYMBOL(rwsem_down_read_failed);
diff --git a/kernel/locking/semaphore.c b/kernel/locking/semaphore.c
index b8120abe594b..2f8cdb712b63 100644
--- a/kernel/locking/semaphore.c
+++ b/kernel/locking/semaphore.c
@@ -216,7 +216,7 @@ static inline int __sched __down_common(struct semaphore *sem, long state,
 			goto interrupted;
 		if (unlikely(timeout <= 0))
 			goto timed_out;
-		__set_task_state(task, state);
+		__set_current_state(state);
 		raw_spin_unlock_irq(&sem->lock);
 		timeout = schedule_timeout(timeout);
 		raw_spin_lock_irq(&sem->lock);

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/3] rtmutex: Avoid barrier in rt_mutex_handle_deadlock
  2016-03-22 10:21       ` Peter Zijlstra
@ 2016-03-22 11:32         ` Heiko Carstens
  2016-03-22 12:20           ` Peter Zijlstra
  2016-03-25  2:30         ` Davidlohr Bueso
  1 sibling, 1 reply; 18+ messages in thread
From: Heiko Carstens @ 2016-03-22 11:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Davidlohr Bueso, tglx, mingo, bigeasy, umgwanakikbuti, paulmck,
	linux-kernel, kmo

On Tue, Mar 22, 2016 at 11:21:53AM +0100, Peter Zijlstra wrote:
> On Mon, Mar 21, 2016 at 11:16:22AM -0700, Davidlohr Bueso wrote:
> 
> > +/*
> > + * Helpers for modifying the state of either the current task, or a foreign
> > + * task. Each of these calls come in both full barrier and weak flavors:
> > + *
> > + *                                           Weak
> > + *     set_task_state()           __set_task_state()
> > + *     set_current_state()        __set_current_state()
> > + *
> > + * Where set_current_state() and set_task_state() includes a full smp barrier
> > + * -after- the write of ->state is correctly serialized with the later test
> > + * of whether to actually sleep:
> > + *
> > + *	for (;;) {
> > + *		set_current_state(TASK_UNINTERRUPTIBLE);
> > + *		if (event_indicated)
> > + *			break;
> > + *		schedule();
> > + *	}
> > + *
> > + * This is commonly necessary for processes sleeping and waking through flag
> > + * based events. If the caller does not need such serialization, then use
> > + * weaker counterparts, which simply writes the state.
> > + *
> > + * Refer to Documentation/memory-barriers.txt
> > + */
> 
> I would prefer to pretend set_task_state() does not exist, using it on
> anything other than task==current is very very tricky.
> 
> With the below patch; we're only left with:
> 
> arch/s390/mm/fault.c:			__set_task_state(tsk, TASK_UNINTERRUPTIBLE);
> arch/s390/mm/fault.c:			__set_task_state(tsk, TASK_UNINTERRUPTIBLE);
> drivers/md/bcache/btree.c:	set_task_state(c->gc_thread, TASK_INTERRUPTIBLE);
> kernel/exit.c:			set_task_state(tsk, TASK_UNINTERRUPTIBLE);
> kernel/exit.c:		__set_task_state(tsk, TASK_RUNNING);
> 
> exit most probably also has tsk==current, but I didn't check.
> 
> bacache seems to rely on the fact that the task is not running after
> kthread_create() to change the state. But I've no idea why; the only
> think I can come up with is because load accounting, a new thread blocks
> in UNINTERRUPTIBLE which adds to load. But by setting it to
> INTERRUPTIBLE before waking up it can actually mess that up. This really
> should be fixed.
> 
> And s390 does something entirely vile, no idea what.

For the two s390 usages tsk equals current. So it could be easily replaced
with set_current_state().

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/3] rtmutex: Avoid barrier in rt_mutex_handle_deadlock
  2016-03-22 11:32         ` Heiko Carstens
@ 2016-03-22 12:20           ` Peter Zijlstra
  2016-03-22 13:26             ` Heiko Carstens
  0 siblings, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2016-03-22 12:20 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Davidlohr Bueso, tglx, mingo, bigeasy, umgwanakikbuti, paulmck,
	linux-kernel, kmo

On Tue, Mar 22, 2016 at 12:32:21PM +0100, Heiko Carstens wrote:
> On Tue, Mar 22, 2016 at 11:21:53AM +0100, Peter Zijlstra wrote:

> > And s390 does something entirely vile, no idea what.
> 
> For the two s390 usages tsk equals current. So it could be easily replaced
> with set_current_state().

Hmm indeed, I only saw tsk = find_task_by_pid_ns() and didn't look
further, but you do indeed have an assertion later that ensures task ==
current.

I still don't get that code though; why would you set the current task
state to UNINTERRUPTIBLE, also set need_resched, but then not call
schedule() at all.

Clearly something magical is going on and its not clear.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/3] rtmutex: Avoid barrier in rt_mutex_handle_deadlock
  2016-03-22 12:20           ` Peter Zijlstra
@ 2016-03-22 13:26             ` Heiko Carstens
  2016-03-22 13:55               ` Peter Zijlstra
  0 siblings, 1 reply; 18+ messages in thread
From: Heiko Carstens @ 2016-03-22 13:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Davidlohr Bueso, tglx, mingo, bigeasy, umgwanakikbuti, paulmck,
	linux-kernel, kmo

On Tue, Mar 22, 2016 at 01:20:50PM +0100, Peter Zijlstra wrote:
> On Tue, Mar 22, 2016 at 12:32:21PM +0100, Heiko Carstens wrote:
> > On Tue, Mar 22, 2016 at 11:21:53AM +0100, Peter Zijlstra wrote:
> 
> > > And s390 does something entirely vile, no idea what.
> > 
> > For the two s390 usages tsk equals current. So it could be easily replaced
> > with set_current_state().
> 
> Hmm indeed, I only saw tsk = find_task_by_pid_ns() and didn't look
> further, but you do indeed have an assertion later that ensures task ==
> current.
> 
> I still don't get that code though; why would you set the current task
> state to UNINTERRUPTIBLE, also set need_resched, but then not call
> schedule() at all.
> 
> Clearly something magical is going on and its not clear.

The mechanism of our pfault code: if Linux is running as guest, runs a user
space process and the user space process accesses a page that the host has
paged out we get a pfault interrupt.

This allows us, within the guest, to schedule a different process. Without
this mechanism the host would have to suspend the whole virtual CPU until
the page has been paged in.

So when we get such an interrupt then we set the state of the current task
to uninterruptible and also set the need_resched flag. Both happens within
interrupt context(!). If we later on want to return to user space we
recognize the need_resched flag and then call schedule().
It's not very obvious how this works...

Of course we have a lot of additional fun with the completion interrupt (->
host signals that a page of a process has been paged in and the process can
continue to run). This interrupt can arrive on any cpu and, since we have
virtual cpus, actually appear before the interrupt that signals that a page
is missing.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/3] rtmutex: Avoid barrier in rt_mutex_handle_deadlock
  2016-03-22 13:26             ` Heiko Carstens
@ 2016-03-22 13:55               ` Peter Zijlstra
  2016-03-22 14:45                 ` Heiko Carstens
  0 siblings, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2016-03-22 13:55 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Davidlohr Bueso, tglx, mingo, bigeasy, umgwanakikbuti, paulmck,
	linux-kernel, kmo

On Tue, Mar 22, 2016 at 02:26:00PM +0100, Heiko Carstens wrote:
> > Clearly something magical is going on and its not clear.
> 
> The mechanism of our pfault code: if Linux is running as guest, runs a user
> space process and the user space process accesses a page that the host has
> paged out we get a pfault interrupt.
> 
> This allows us, within the guest, to schedule a different process. Without
> this mechanism the host would have to suspend the whole virtual CPU until
> the page has been paged in.
> 
> So when we get such an interrupt then we set the state of the current task
> to uninterruptible and also set the need_resched flag. Both happens within
> interrupt context(!). If we later on want to return to user space we
> recognize the need_resched flag and then call schedule().
> It's not very obvious how this works...

A few lines like the above near that function would go a long while I
think.

And, ah!, you rely on the return to user resched to not be a
preempt_schedule, how very icky :-)

Now, what happens if that task gets a spurious wakeup? Will it take the
fault again, raise the PF int again etc.. ?

> Of course we have a lot of additional fun with the completion interrupt (->
> host signals that a page of a process has been paged in and the process can
> continue to run). This interrupt can arrive on any cpu and, since we have
> virtual cpus, actually appear before the interrupt that signals that a page
> is missing.

Of course :-)

Something like the below perhaps?

---
 arch/s390/mm/fault.c | 44 ++++++++++++++++++++++++++++++++++++--------
 1 file changed, 36 insertions(+), 8 deletions(-)

diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
index 791a4146052c..52cc8c99e62c 100644
--- a/arch/s390/mm/fault.c
+++ b/arch/s390/mm/fault.c
@@ -629,6 +629,29 @@ void pfault_fini(void)
 static DEFINE_SPINLOCK(pfault_lock);
 static LIST_HEAD(pfault_list);
 
+#define PF_COMPLETE	0x0080
+
+/*
+ * The mechanism of our pfault code: if Linux is running as guest, runs a user
+ * space process and the user space process accesses a page that the host has
+ * paged out we get a pfault interrupt.
+ *
+ * This allows us, within the guest, to schedule a different process. Without
+ * this mechanism the host would have to suspend the whole virtual CPU until
+ * the page has been paged in.
+ *
+ * So when we get such an interrupt then we set the state of the current task
+ * to uninterruptible and also set the need_resched flag. Both happens within
+ * interrupt context(!). If we later on want to return to user space we
+ * recognize the need_resched flag and then call schedule().  It's not very
+ * obvious how this works...
+ *
+ * Of course we have a lot of additional fun with the completion interrupt (->
+ * host signals that a page of a process has been paged in and the process can
+ * continue to run). This interrupt can arrive on any cpu and, since we have
+ * virtual cpus, actually appear before the interrupt that signals that a page
+ * is missing.
+ */
 static void pfault_interrupt(struct ext_code ext_code,
 			     unsigned int param32, unsigned long param64)
 {
@@ -637,14 +660,14 @@ static void pfault_interrupt(struct ext_code ext_code,
 	pid_t pid;
 
 	/*
-	 * Get the external interruption subcode & pfault
-	 * initial/completion signal bit. VM stores this 
-	 * in the 'cpu address' field associated with the
-         * external interrupt. 
+	 * Get the external interruption subcode & pfault initial/completion
+	 * signal bit. VM stores this in the 'cpu address' field associated
+	 * with the external interrupt.
 	 */
 	subcode = ext_code.subcode;
 	if ((subcode & 0xff00) != __SUBCODE_MASK)
 		return;
+
 	inc_irq_stat(IRQEXT_PFL);
 	/* Get the token (= pid of the affected task). */
 	pid = param64 & LPP_PFAULT_PID_MASK;
@@ -655,8 +678,9 @@ static void pfault_interrupt(struct ext_code ext_code,
 	rcu_read_unlock();
 	if (!tsk)
 		return;
+
 	spin_lock(&pfault_lock);
-	if (subcode & 0x0080) {
+	if (subcode & PF_COMPLETE) {
 		/* signal bit is set -> a page has been swapped in by VM */
 		if (tsk->thread.pfault_wait == 1) {
 			/* Initial interrupt was faster than the completion
@@ -683,10 +707,10 @@ static void pfault_interrupt(struct ext_code ext_code,
 		/* signal bit not set -> a real page is missing. */
 		if (WARN_ON_ONCE(tsk != current))
 			goto out;
+
 		if (tsk->thread.pfault_wait == 1) {
 			/* Already on the list with a reference: put to sleep */
-			__set_task_state(tsk, TASK_UNINTERRUPTIBLE);
-			set_tsk_need_resched(tsk);
+			goto block;
 		} else if (tsk->thread.pfault_wait == -1) {
 			/* Completion interrupt was faster than the initial
 			 * interrupt (pfault_wait == -1). Set pfault_wait
@@ -701,7 +725,11 @@ static void pfault_interrupt(struct ext_code ext_code,
 			get_task_struct(tsk);
 			tsk->thread.pfault_wait = 1;
 			list_add(&tsk->thread.list, &pfault_list);
-			__set_task_state(tsk, TASK_UNINTERRUPTIBLE);
+block:
+			/* Since this must be a userspace fault, there
+			 * is no kernel task state to trample. Rely on the
+			 * return to userspace schedule() to block */
+			__set_current_state(TASK_UNINTERRUPTIBLE);
 			set_tsk_need_resched(tsk);
 		}
 	}

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/3] rtmutex: Avoid barrier in rt_mutex_handle_deadlock
  2016-03-22 13:55               ` Peter Zijlstra
@ 2016-03-22 14:45                 ` Heiko Carstens
  2016-03-22 16:41                   ` Peter Zijlstra
  0 siblings, 1 reply; 18+ messages in thread
From: Heiko Carstens @ 2016-03-22 14:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Davidlohr Bueso, tglx, mingo, bigeasy, umgwanakikbuti, paulmck,
	linux-kernel, kmo

On Tue, Mar 22, 2016 at 02:55:30PM +0100, Peter Zijlstra wrote:
> On Tue, Mar 22, 2016 at 02:26:00PM +0100, Heiko Carstens wrote:
> > > Clearly something magical is going on and its not clear.
> > 
> > The mechanism of our pfault code: if Linux is running as guest, runs a user
> > space process and the user space process accesses a page that the host has
> > paged out we get a pfault interrupt.
> > 
> > This allows us, within the guest, to schedule a different process. Without
> > this mechanism the host would have to suspend the whole virtual CPU until
> > the page has been paged in.
> > 
> > So when we get such an interrupt then we set the state of the current task
> > to uninterruptible and also set the need_resched flag. Both happens within
> > interrupt context(!). If we later on want to return to user space we
> > recognize the need_resched flag and then call schedule().
> > It's not very obvious how this works...
> 
> A few lines like the above near that function would go a long while I
> think.
> 
> And, ah!, you rely on the return to user resched to not be a
> preempt_schedule, how very icky :-)
> 
> Now, what happens if that task gets a spurious wakeup? Will it take the
> fault again, raise the PF int again etc.. ?

Yes, it will fault again etc. We actually do the spurious wakeup thing on
cpu hotplug (down), since unfortunately the original protocal has a flaw:
all pending completion interrupts of the "downed" cpu got lost in the host
and we do not know which ones.

So we wake all tasks up and see what happens... see pfault_cpu_notify().

> > Of course we have a lot of additional fun with the completion interrupt (->
> > host signals that a page of a process has been paged in and the process can
> > continue to run). This interrupt can arrive on any cpu and, since we have
> > virtual cpus, actually appear before the interrupt that signals that a page
> > is missing.
> 
> Of course :-)
> 
> Something like the below perhaps?
> 
> ---
>  arch/s390/mm/fault.c | 44 ++++++++++++++++++++++++++++++++++++--------
>  1 file changed, 36 insertions(+), 8 deletions(-)

Sure, looks nice and makes a lot of sense. And the text looks a bit familiar
to me ;)

Could you provide From: and Signed-off-by: lines?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/3] rtmutex: Avoid barrier in rt_mutex_handle_deadlock
  2016-03-22 14:45                 ` Heiko Carstens
@ 2016-03-22 16:41                   ` Peter Zijlstra
  2016-03-22 21:46                     ` Heiko Carstens
  0 siblings, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2016-03-22 16:41 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Davidlohr Bueso, tglx, mingo, bigeasy, umgwanakikbuti, paulmck,
	linux-kernel, kmo

On Tue, Mar 22, 2016 at 03:45:37PM +0100, Heiko Carstens wrote:

> Sure, looks nice and makes a lot of sense. And the text looks a bit familiar
> to me ;)
> 
> Could you provide From: and Signed-off-by: lines?

Of course, find below.

---
Subject: s390: Clarify pagefault interrupt
From: Peter Zijlstra <peterz@infradead.org>

While looking at set_task_state() users I stumbled over the s390 pfault
interrupt code.  Since Heiko provided a great explanation on how it
worked, I figured we ought to preserve this.

Also make a few little tweaks to the code to aid in readability and
explicitly comment the unusual blocking scheme.

Based-on-text-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/s390/mm/fault.c | 44 ++++++++++++++++++++++++++++++++++++--------
 1 file changed, 36 insertions(+), 8 deletions(-)

diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
index 791a4146052c..52cc8c99e62c 100644
--- a/arch/s390/mm/fault.c
+++ b/arch/s390/mm/fault.c
@@ -629,6 +629,29 @@ void pfault_fini(void)
 static DEFINE_SPINLOCK(pfault_lock);
 static LIST_HEAD(pfault_list);
 
+#define PF_COMPLETE	0x0080
+
+/*
+ * The mechanism of our pfault code: if Linux is running as guest, runs a user
+ * space process and the user space process accesses a page that the host has
+ * paged out we get a pfault interrupt.
+ *
+ * This allows us, within the guest, to schedule a different process. Without
+ * this mechanism the host would have to suspend the whole virtual CPU until
+ * the page has been paged in.
+ *
+ * So when we get such an interrupt then we set the state of the current task
+ * to uninterruptible and also set the need_resched flag. Both happens within
+ * interrupt context(!). If we later on want to return to user space we
+ * recognize the need_resched flag and then call schedule().  It's not very
+ * obvious how this works...
+ *
+ * Of course we have a lot of additional fun with the completion interrupt (->
+ * host signals that a page of a process has been paged in and the process can
+ * continue to run). This interrupt can arrive on any cpu and, since we have
+ * virtual cpus, actually appear before the interrupt that signals that a page
+ * is missing.
+ */
 static void pfault_interrupt(struct ext_code ext_code,
 			     unsigned int param32, unsigned long param64)
 {
@@ -637,14 +660,14 @@ static void pfault_interrupt(struct ext_code ext_code,
 	pid_t pid;
 
 	/*
-	 * Get the external interruption subcode & pfault
-	 * initial/completion signal bit. VM stores this 
-	 * in the 'cpu address' field associated with the
-         * external interrupt. 
+	 * Get the external interruption subcode & pfault initial/completion
+	 * signal bit. VM stores this in the 'cpu address' field associated
+	 * with the external interrupt.
 	 */
 	subcode = ext_code.subcode;
 	if ((subcode & 0xff00) != __SUBCODE_MASK)
 		return;
+
 	inc_irq_stat(IRQEXT_PFL);
 	/* Get the token (= pid of the affected task). */
 	pid = param64 & LPP_PFAULT_PID_MASK;
@@ -655,8 +678,9 @@ static void pfault_interrupt(struct ext_code ext_code,
 	rcu_read_unlock();
 	if (!tsk)
 		return;
+
 	spin_lock(&pfault_lock);
-	if (subcode & 0x0080) {
+	if (subcode & PF_COMPLETE) {
 		/* signal bit is set -> a page has been swapped in by VM */
 		if (tsk->thread.pfault_wait == 1) {
 			/* Initial interrupt was faster than the completion
@@ -683,10 +707,10 @@ static void pfault_interrupt(struct ext_code ext_code,
 		/* signal bit not set -> a real page is missing. */
 		if (WARN_ON_ONCE(tsk != current))
 			goto out;
+
 		if (tsk->thread.pfault_wait == 1) {
 			/* Already on the list with a reference: put to sleep */
-			__set_task_state(tsk, TASK_UNINTERRUPTIBLE);
-			set_tsk_need_resched(tsk);
+			goto block;
 		} else if (tsk->thread.pfault_wait == -1) {
 			/* Completion interrupt was faster than the initial
 			 * interrupt (pfault_wait == -1). Set pfault_wait
@@ -701,7 +725,11 @@ static void pfault_interrupt(struct ext_code ext_code,
 			get_task_struct(tsk);
 			tsk->thread.pfault_wait = 1;
 			list_add(&tsk->thread.list, &pfault_list);
-			__set_task_state(tsk, TASK_UNINTERRUPTIBLE);
+block:
+			/* Since this must be a userspace fault, there
+			 * is no kernel task state to trample. Rely on the
+			 * return to userspace schedule() to block */
+			__set_current_state(TASK_UNINTERRUPTIBLE);
 			set_tsk_need_resched(tsk);
 		}
 	}

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/3] rtmutex: Avoid barrier in rt_mutex_handle_deadlock
  2016-03-22 16:41                   ` Peter Zijlstra
@ 2016-03-22 21:46                     ` Heiko Carstens
  0 siblings, 0 replies; 18+ messages in thread
From: Heiko Carstens @ 2016-03-22 21:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Davidlohr Bueso, tglx, mingo, bigeasy, umgwanakikbuti, paulmck,
	linux-kernel, kmo

On Tue, Mar 22, 2016 at 05:41:22PM +0100, Peter Zijlstra wrote:
> On Tue, Mar 22, 2016 at 03:45:37PM +0100, Heiko Carstens wrote:
> 
> > Sure, looks nice and makes a lot of sense. And the text looks a bit familiar
> > to me ;)
> > 
> > Could you provide From: and Signed-off-by: lines?
> 
> Of course, find below.
> 
> ---
> Subject: s390: Clarify pagefault interrupt
> From: Peter Zijlstra <peterz@infradead.org>
> 
> While looking at set_task_state() users I stumbled over the s390 pfault
> interrupt code.  Since Heiko provided a great explanation on how it
> worked, I figured we ought to preserve this.
> 
> Also make a few little tweaks to the code to aid in readability and
> explicitly comment the unusual blocking scheme.
> 
> Based-on-text-by: Heiko Carstens <heiko.carstens@de.ibm.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  arch/s390/mm/fault.c | 44 ++++++++++++++++++++++++++++++++++++--------
>  1 file changed, 36 insertions(+), 8 deletions(-)

Applied, with some whitespace changes from me. Thank you!

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/3] rtmutex: Avoid barrier in rt_mutex_handle_deadlock
  2016-03-22 10:21       ` Peter Zijlstra
  2016-03-22 11:32         ` Heiko Carstens
@ 2016-03-25  2:30         ` Davidlohr Bueso
  1 sibling, 0 replies; 18+ messages in thread
From: Davidlohr Bueso @ 2016-03-25  2:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, mingo, bigeasy, umgwanakikbuti, paulmck, linux-kernel, kmo,
	heiko.carstens, kent.overstreet, linux-bcache

Adding a few more Cc's for bcache.

On Tue, 22 Mar 2016, Peter Zijlstra wrote:

>On Mon, Mar 21, 2016 at 11:16:22AM -0700, Davidlohr Bueso wrote:
>
>> +/*
>> + * Helpers for modifying the state of either the current task, or a foreign
>> + * task. Each of these calls come in both full barrier and weak flavors:
>> + *
>> + *                                           Weak
>> + *     set_task_state()           __set_task_state()
>> + *     set_current_state()        __set_current_state()
>> + *
>> + * Where set_current_state() and set_task_state() includes a full smp barrier
>> + * -after- the write of ->state is correctly serialized with the later test
>> + * of whether to actually sleep:
>> + *
>> + *	for (;;) {
>> + *		set_current_state(TASK_UNINTERRUPTIBLE);
>> + *		if (event_indicated)
>> + *			break;
>> + *		schedule();
>> + *	}
>> + *
>> + * This is commonly necessary for processes sleeping and waking through flag
>> + * based events. If the caller does not need such serialization, then use
>> + * weaker counterparts, which simply writes the state.
>> + *
>> + * Refer to Documentation/memory-barriers.txt
>> + */
>
>I would prefer to pretend set_task_state() does not exist, using it on
>anything other than task==current is very very tricky.
>
>With the below patch; we're only left with:
>
>arch/s390/mm/fault.c:			__set_task_state(tsk, TASK_UNINTERRUPTIBLE);
>arch/s390/mm/fault.c:			__set_task_state(tsk, TASK_UNINTERRUPTIBLE);
>drivers/md/bcache/btree.c:	set_task_state(c->gc_thread, TASK_INTERRUPTIBLE);
>kernel/exit.c:			set_task_state(tsk, TASK_UNINTERRUPTIBLE);
>kernel/exit.c:		__set_task_state(tsk, TASK_RUNNING);
>
>exit most probably also has tsk==current, but I didn't check.

Right, and only user is do_exit -> exit_mm() which is always current.

>
>bacache seems to rely on the fact that the task is not running after
>kthread_create() to change the state. But I've no idea why; the only
>think I can come up with is because load accounting, a new thread blocks
>in UNINTERRUPTIBLE which adds to load. But by setting it to
>INTERRUPTIBLE before waking up it can actually mess that up. This really
>should be fixed.

No idea why either.

>
>And s390 does something entirely vile, no idea what.

So this is solved.

I'll send an updated patch based on this one that removes set_task_state 
iff we get rid of the bcache situation obviously.

Thanks,
Davidlohr

>
>---
> arch/um/drivers/random.c                                 |  2 +-
> drivers/md/dm-bufio.c                                    |  2 +-
> drivers/md/persistent-data/dm-block-manager.c            |  4 ++--
> drivers/staging/lustre/lustre/libcfs/linux/linux-debug.c |  6 ++++--
> drivers/tty/tty_ldsem.c                                  | 10 +++++-----
> kernel/locking/mutex.c                                   |  4 ++--
> kernel/locking/rwsem-spinlock.c                          | 12 +++++-------
> kernel/locking/rwsem-xadd.c                              |  4 ++--
> kernel/locking/semaphore.c                               |  2 +-
> 9 files changed, 23 insertions(+), 23 deletions(-)
>
>diff --git a/arch/um/drivers/random.c b/arch/um/drivers/random.c
>index dd16c902ff70..19d41a583288 100644
>--- a/arch/um/drivers/random.c
>+++ b/arch/um/drivers/random.c
>@@ -76,7 +76,7 @@ static ssize_t rng_dev_read (struct file *filp, char __user *buf, size_t size,
> 			add_sigio_fd(random_fd);
>
> 			add_wait_queue(&host_read_wait, &wait);
>-			set_task_state(current, TASK_INTERRUPTIBLE);
>+			set_current_state(TASK_INTERRUPTIBLE);
>
> 			schedule();
> 			remove_wait_queue(&host_read_wait, &wait);
>diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
>index cd77216beff1..c5e89f358d98 100644
>--- a/drivers/md/dm-bufio.c
>+++ b/drivers/md/dm-bufio.c
>@@ -807,7 +807,7 @@ static void __wait_for_free_buffer(struct dm_bufio_client *c)
> 	DECLARE_WAITQUEUE(wait, current);
>
> 	add_wait_queue(&c->free_buffer_wait, &wait);
>-	set_task_state(current, TASK_UNINTERRUPTIBLE);
>+	set_current_state(TASK_UNINTERRUPTIBLE);
> 	dm_bufio_unlock(c);
>
> 	io_schedule();
>diff --git a/drivers/md/persistent-data/dm-block-manager.c b/drivers/md/persistent-data/dm-block-manager.c
>index 1e33dd51c21f..821a26b934c2 100644
>--- a/drivers/md/persistent-data/dm-block-manager.c
>+++ b/drivers/md/persistent-data/dm-block-manager.c
>@@ -118,7 +118,7 @@ static int __check_holder(struct block_lock *lock)
> static void __wait(struct waiter *w)
> {
> 	for (;;) {
>-		set_task_state(current, TASK_UNINTERRUPTIBLE);
>+		set_current_state(TASK_UNINTERRUPTIBLE);
>
> 		if (!w->task)
> 			break;
>@@ -126,7 +126,7 @@ static void __wait(struct waiter *w)
> 		schedule();
> 	}
>
>-	set_task_state(current, TASK_RUNNING);
>+	set_current_state(TASK_RUNNING);
> }
>
> static void __wake_waiter(struct waiter *w)
>diff --git a/drivers/staging/lustre/lustre/libcfs/linux/linux-debug.c b/drivers/staging/lustre/lustre/libcfs/linux/linux-debug.c
>index 59c7bf3cbc1f..087d7e49cf3e 100644
>--- a/drivers/staging/lustre/lustre/libcfs/linux/linux-debug.c
>+++ b/drivers/staging/lustre/lustre/libcfs/linux/linux-debug.c
>@@ -162,9 +162,11 @@ void __noreturn lbug_with_loc(struct libcfs_debug_msg_data *msgdata)
> 	libcfs_run_lbug_upcall(msgdata);
> 	if (libcfs_panic_on_lbug)
> 		panic("LBUG");
>-	set_task_state(current, TASK_UNINTERRUPTIBLE);
>-	while (1)
>+
>+	while (1) {
>+		set_current_state(TASK_UNINTERRUPTIBLE);
> 		schedule();
>+	}
> }
>
> static int panic_notifier(struct notifier_block *self, unsigned long unused1,
>diff --git a/drivers/tty/tty_ldsem.c b/drivers/tty/tty_ldsem.c
>index 1bf8ed13f827..c94bc0eef85d 100644
>--- a/drivers/tty/tty_ldsem.c
>+++ b/drivers/tty/tty_ldsem.c
>@@ -232,7 +232,7 @@ down_read_failed(struct ld_semaphore *sem, long count, long timeout)
>
> 	/* wait to be given the lock */
> 	for (;;) {
>-		set_task_state(tsk, TASK_UNINTERRUPTIBLE);
>+		set_current_state(TASK_UNINTERRUPTIBLE);
>
> 		if (!waiter.task)
> 			break;
>@@ -241,7 +241,7 @@ down_read_failed(struct ld_semaphore *sem, long count, long timeout)
> 		timeout = schedule_timeout(timeout);
> 	}
>
>-	__set_task_state(tsk, TASK_RUNNING);
>+	__set_current_state(TASK_RUNNING);
>
> 	if (!timeout) {
> 		/* lock timed out but check if this task was just
>@@ -291,14 +291,14 @@ down_write_failed(struct ld_semaphore *sem, long count, long timeout)
>
> 	waiter.task = tsk;
>
>-	set_task_state(tsk, TASK_UNINTERRUPTIBLE);
>+	set_current_state(TASK_UNINTERRUPTIBLE);
> 	for (;;) {
> 		if (!timeout)
> 			break;
> 		raw_spin_unlock_irq(&sem->wait_lock);
> 		timeout = schedule_timeout(timeout);
> 		raw_spin_lock_irq(&sem->wait_lock);
>-		set_task_state(tsk, TASK_UNINTERRUPTIBLE);
>+		set_current_state(TASK_UNINTERRUPTIBLE);
> 		locked = writer_trylock(sem);
> 		if (locked)
> 			break;
>@@ -309,7 +309,7 @@ down_write_failed(struct ld_semaphore *sem, long count, long timeout)
> 	list_del(&waiter.list);
> 	raw_spin_unlock_irq(&sem->wait_lock);
>
>-	__set_task_state(tsk, TASK_RUNNING);
>+	__set_current_state(TASK_RUNNING);
>
> 	/* lock wait may have timed out */
> 	if (!locked)
>diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
>index e364b424b019..c10fe056c34a 100644
>--- a/kernel/locking/mutex.c
>+++ b/kernel/locking/mutex.c
>@@ -572,14 +572,14 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
> 				goto err;
> 		}
>
>-		__set_task_state(task, state);
>+		__set_current_state(state);
>
> 		/* didn't get the lock, go to sleep: */
> 		spin_unlock_mutex(&lock->wait_lock, flags);
> 		schedule_preempt_disabled();
> 		spin_lock_mutex(&lock->wait_lock, flags);
> 	}
>-	__set_task_state(task, TASK_RUNNING);
>+	__set_current_state(TASK_RUNNING);
>
> 	mutex_remove_waiter(lock, &waiter, current_thread_info());
> 	/* set it to 0 if there are no waiters left: */
>diff --git a/kernel/locking/rwsem-spinlock.c b/kernel/locking/rwsem-spinlock.c
>index 3a5048572065..dfe5ea3736a8 100644
>--- a/kernel/locking/rwsem-spinlock.c
>+++ b/kernel/locking/rwsem-spinlock.c
>@@ -141,7 +141,7 @@ void __sched __down_read(struct rw_semaphore *sem)
> 	}
>
> 	tsk = current;
>-	set_task_state(tsk, TASK_UNINTERRUPTIBLE);
>+	set_current_state(TASK_UNINTERRUPTIBLE);
>
> 	/* set up my own style of waitqueue */
> 	waiter.task = tsk;
>@@ -158,10 +158,10 @@ void __sched __down_read(struct rw_semaphore *sem)
> 		if (!waiter.task)
> 			break;
> 		schedule();
>-		set_task_state(tsk, TASK_UNINTERRUPTIBLE);
>+		set_current_state(TASK_UNINTERRUPTIBLE);
> 	}
>
>-	__set_task_state(tsk, TASK_RUNNING);
>+	__set_current_state(TASK_RUNNING);
>  out:
> 	;
> }
>@@ -194,14 +194,12 @@ int __down_read_trylock(struct rw_semaphore *sem)
> void __sched __down_write_nested(struct rw_semaphore *sem, int subclass)
> {
> 	struct rwsem_waiter waiter;
>-	struct task_struct *tsk;
> 	unsigned long flags;
>
> 	raw_spin_lock_irqsave(&sem->wait_lock, flags);
>
> 	/* set up my own style of waitqueue */
>-	tsk = current;
>-	waiter.task = tsk;
>+	waiter.task = current;
> 	waiter.type = RWSEM_WAITING_FOR_WRITE;
> 	list_add_tail(&waiter.list, &sem->wait_list);
>
>@@ -215,7 +213,7 @@ void __sched __down_write_nested(struct rw_semaphore *sem, int subclass)
> 		 */
> 		if (sem->count == 0)
> 			break;
>-		set_task_state(tsk, TASK_UNINTERRUPTIBLE);
>+		set_current_state(TASK_UNINTERRUPTIBLE);
> 		raw_spin_unlock_irqrestore(&sem->wait_lock, flags);
> 		schedule();
> 		raw_spin_lock_irqsave(&sem->wait_lock, flags);
>diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
>index a4d4de05b2d1..a33ffc2ee236 100644
>--- a/kernel/locking/rwsem-xadd.c
>+++ b/kernel/locking/rwsem-xadd.c
>@@ -244,13 +244,13 @@ struct rw_semaphore __sched *rwsem_down_read_failed(struct rw_semaphore *sem)
>
> 	/* wait to be given the lock */
> 	while (true) {
>-		set_task_state(tsk, TASK_UNINTERRUPTIBLE);
>+		set_current_state(TASK_UNINTERRUPTIBLE);
> 		if (!waiter.task)
> 			break;
> 		schedule();
> 	}
>
>-	__set_task_state(tsk, TASK_RUNNING);
>+	__set_current_state(TASK_RUNNING);
> 	return sem;
> }
> EXPORT_SYMBOL(rwsem_down_read_failed);
>diff --git a/kernel/locking/semaphore.c b/kernel/locking/semaphore.c
>index b8120abe594b..2f8cdb712b63 100644
>--- a/kernel/locking/semaphore.c
>+++ b/kernel/locking/semaphore.c
>@@ -216,7 +216,7 @@ static inline int __sched __down_common(struct semaphore *sem, long state,
> 			goto interrupted;
> 		if (unlikely(timeout <= 0))
> 			goto timed_out;
>-		__set_task_state(task, state);
>+		__set_current_state(state);
> 		raw_spin_unlock_irq(&sem->lock);
> 		timeout = schedule_timeout(timeout);
> 		raw_spin_lock_irq(&sem->lock);

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2016-03-25  2:30 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-03-08 18:20 [PATCH -tip 0/3] locking/rtmutex: Another crack at spin on owner Davidlohr Bueso
2016-03-08 18:20 ` [PATCH 1/3] rtmutex: Delete save_state member of struct rt_mutex Davidlohr Bueso
2016-03-08 18:20 ` [PATCH 2/3] rtmutex: Add rt_mutex_init_waiter helper Davidlohr Bueso
2016-03-14 13:16   ` Peter Zijlstra
2016-03-08 18:20 ` [PATCH 3/3] rtmutex: Reduce top-waiter blocking on a lock Davidlohr Bueso
2016-03-14 13:23   ` Peter Zijlstra
2016-03-08 22:05 ` [PATCH 4/3] rtmutex: Avoid barrier in rt_mutex_handle_deadlock Davidlohr Bueso
2016-03-14 13:40   ` Peter Zijlstra
2016-03-21 18:16     ` Davidlohr Bueso
2016-03-22 10:21       ` Peter Zijlstra
2016-03-22 11:32         ` Heiko Carstens
2016-03-22 12:20           ` Peter Zijlstra
2016-03-22 13:26             ` Heiko Carstens
2016-03-22 13:55               ` Peter Zijlstra
2016-03-22 14:45                 ` Heiko Carstens
2016-03-22 16:41                   ` Peter Zijlstra
2016-03-22 21:46                     ` Heiko Carstens
2016-03-25  2:30         ` Davidlohr Bueso

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).