All of lore.kernel.org
 help / color / mirror / Atom feed
* rt: rtmutex experiment doubled tbench throughput
@ 2013-02-22 16:36 Mike Galbraith
  2013-02-23  8:25 ` Mike Galbraith
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Mike Galbraith @ 2013-02-22 16:36 UTC (permalink / raw)
  To: RT; +Cc: Thomas Gleixner, Steven Rostedt

Greetings,

A user reported surprise at seeing a low priority rt task awakened in
the same semop as a high priority task run before the high priority task
could finish what it was supposed to do, and put itself to sleep.  The
reason for that happening is that it, and a few of its brothers (highly
synchronized task schedulers) enter semop to diddle the same array at
the same time, meet at spinlock (rtmutex) and block.

Now you could say "Waking task foo and depending on it's lower than task
bar priority to keep it off the CPU is your bad if you do so without a
no-block guarantee in hand for everything task bar does." which I did,
that and "The semop syscall is not atomic per POSIX, it's the array
operations that are atomic.  Blocking on a contended lock is fine".

Anyway, I looked at rtmutex.c and started pondering, wondering if I
could un-surprise the user without breaking the world, and maybe even
make non-rt stuff perform a little better.

Numerous deadlocks and cool explosions later...

This seems to work, no harm to 60 core jitter testcase detected, and it
doubled tbench throughput.  But be advised, your breakfast may emigrate,
or a POSIX swat team may kick in your door if you even _look_ at this. 

64 core DL980G2, 3.0.61-rt85

vogelweide:/:[0]# tbench.sh 128 10
waiting for connections
dbench version 3.04 - Copyright Andrew Tridgell 1999-2004

Running for 10 seconds with load '/usr/share/dbench/client.txt' and minimum warmup 2 secs
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^128 clients started
 128      5881  6239.48 MB/sec  warmup   1 sec   
 128     17978  4169.33 MB/sec  execute   1 sec   
 128     24051  4173.54 MB/sec  execute   2 sec   
 128     30131  4185.35 MB/sec  execute   3 sec   
 128     36145  4173.59 MB/sec  execute   4 sec   
 128     42233  4182.69 MB/sec  execute   5 sec   
 128     48293  4181.18 MB/sec  execute   6 sec   
 128     54407  4185.10 MB/sec  execute   7 sec   
 128     60445  4183.09 MB/sec  execute   8 sec   
 128     66482  4179.41 MB/sec  execute   9 sec   
 128     72543  4179.37 MB/sec  cleanup  10 sec   
 128     72543  4174.07 MB/sec  cleanup  10 sec   

Throughput 4179.49 MB/sec 128 procs
924536 packets/sec
/root/bin/tbench.sh: line 33: 11292 Killed                  tbench_srv


vogelweide:/:[0]# echo 1 >  /proc/sys/kernel/sched_rt_spin_yield
vogelweide:/:[0]# tbench.sh 128 10
waiting for connections
dbench version 3.04 - Copyright Andrew Tridgell 1999-2004

Running for 10 seconds with load '/usr/share/dbench/client.txt' and minimum warmup 2 secs
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^128 clients started
 128     14360  12090.05 MB/sec  warmup   1 sec   
 128     42573  9740.67 MB/sec  execute   1 sec   
 128     56661  9736.85 MB/sec  execute   2 sec   
 128     70752  9723.79 MB/sec  execute   3 sec   
 128     84850  9724.82 MB/sec  execute   4 sec   
 128     98936  9720.49 MB/sec  execute   5 sec   
 128    113021  9721.15 MB/sec  execute   6 sec   
 128    127111  9723.26 MB/sec  execute   7 sec   
 128    141203  9722.81 MB/sec  execute   8 sec   
 128    155300  9722.90 MB/sec  execute   9 sec   
 128    169392  9727.48 MB/sec  cleanup  10 sec   
 128    169392  9712.52 MB/sec  cleanup  10 sec   

Throughput 9727.56 MB/sec 128 procs
2150132 packets/sec
/root/bin/tbench.sh: line 34: 11568 Killed                  tbench_srv

---
 include/linux/sched.h   |    7 +++++--
 kernel/rtmutex.c        |   48 +++++++++++++++++++++++++++++++++++++++++-------
 kernel/rtmutex_common.h |   42 +++++++++++++++++++++++++++++++++++++++---
 kernel/sched.c          |   31 ++++++++++++++++++++++++++++---
 kernel/sched_fair.c     |    5 +++++
 kernel/sched_rt.c       |    3 +++
 kernel/sysctl.c         |   11 +++++++++++
 7 files changed, 132 insertions(+), 15 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2083,6 +2083,9 @@ extern unsigned int sysctl_sched_latency
 extern unsigned int sysctl_sched_min_granularity;
 extern unsigned int sysctl_sched_wakeup_granularity;
 extern unsigned int sysctl_sched_child_runs_first;
+#ifdef CONFIG_PREEMPT_RT_FULL
+extern unsigned int sysctl_sched_rt_spin_yield;
+#endif
 
 enum sched_tunable_scaling {
 	SCHED_TUNABLESCALING_NONE,
@@ -2146,12 +2149,12 @@ extern unsigned int sysctl_sched_cfs_ban
 #endif
 
 #ifdef CONFIG_RT_MUTEXES
-extern void task_setprio(struct task_struct *p, int prio);
+extern void task_setprio(struct task_struct *p, int prio, int yield);
 extern int rt_mutex_getprio(struct task_struct *p);
 extern int rt_mutex_check_prio(struct task_struct *task, int newprio);
 static inline void rt_mutex_setprio(struct task_struct *p, int prio)
 {
-	task_setprio(p, prio);
+	task_setprio(p, prio, 0);
 }
 extern void rt_mutex_adjust_pi(struct task_struct *p);
 static inline bool tsk_is_pi_blocked(struct task_struct *tsk)
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -689,7 +689,7 @@ static inline void rt_spin_lock_fastunlo
  * on rcu_read_lock() and the check against the lock owner.
  */
 static int adaptive_wait(struct rt_mutex *lock,
-			 struct task_struct *owner)
+			 struct task_struct *owner, int spin)
 {
 	int res = 0;
 
@@ -702,8 +702,8 @@ static int adaptive_wait(struct rt_mutex
 		 * checking the above to be valid.
 		 */
 		barrier();
-		if (!owner->on_cpu) {
-			res = 1;
+		if (!owner || !owner->on_cpu) {
+			res = !spin;
 			break;
 		}
 		cpu_relax();
@@ -713,7 +713,7 @@ static int adaptive_wait(struct rt_mutex
 }
 #else
 static int adaptive_wait(struct rt_mutex *lock,
-			 struct task_struct *orig_owner)
+			 struct task_struct *orig_owner, int spin)
 {
 	return 1;
 }
@@ -733,7 +733,7 @@ static void  noinline __sched rt_spin_lo
 {
 	struct task_struct *lock_owner, *self = current;
 	struct rt_mutex_waiter waiter, *top_waiter;
-	int ret;
+	int ret, wait, rt_spin = 0, other_spin = 0, cpu;
 
 	rt_mutex_init_waiter(&waiter, true);
 
@@ -761,6 +761,17 @@ static void  noinline __sched rt_spin_lo
 	ret = task_blocks_on_rt_mutex(lock, &waiter, self, 0);
 	BUG_ON(ret);
 
+	/* basic spin/yield sanity checks */
+	if (rt_spin_yield_enabled()) {
+		rt_spin = !self->saved_state;
+		/* Here there be dragons */
+		rt_spin &= !(self->flags & PF_EXITING);
+		other_spin = rt_spin;
+		rt_spin &= rt_task(self);
+		other_spin &= !rt_spin;
+	}
+	cpu = raw_smp_processor_id();
+
 	for (;;) {
 		/* Try to acquire the lock again. */
 		if (__try_to_take_rt_mutex(lock, self, &waiter, STEAL_LATERAL))
@@ -769,12 +780,25 @@ static void  noinline __sched rt_spin_lo
 		top_waiter = rt_mutex_top_waiter(lock);
 		lock_owner = rt_mutex_owner(lock);
 
+		if (rt_spin)
+			wait = 1;
+		else
+			wait = top_waiter != &waiter;
+
+		/* SCHED_OTHER can laterally steal, let them try */
+		if (other_spin) {
+			wait &= task_cpu(top_waiter->task) == cpu;
+			wait |= top_waiter->task->prio < self->prio;
+			wait |= lock_owner && !lock_owner->on_cpu;
+			wait |= lock_owner && !lock_owner->prio < self->prio;
+		}
+
 		raw_spin_unlock(&lock->wait_lock);
 
 		debug_rt_mutex_print_deadlock(&waiter);
 
-		if (top_waiter != &waiter || adaptive_wait(lock, lock_owner))
-			schedule_rt_mutex(lock);
+		if (wait || adaptive_wait(lock, lock_owner, rt_spin))
+			schedule_rt_spinlock(lock, rt_spin);
 
 		raw_spin_lock(&lock->wait_lock);
 
@@ -826,6 +850,16 @@ static void  noinline __sched rt_spin_lo
 		return;
 	}
 
+	if (rt_spin_yield_enabled() && rt_task(current)) {
+		struct rt_mutex_waiter *top_waiter = rt_mutex_top_waiter(lock);
+		struct task_struct *next = top_waiter->task;
+
+		/* Move next in line to head of its queue */
+		pi_lock(&next->pi_lock);
+		task_setprio(next, next->prio, 1);
+		pi_unlock(&next->pi_lock);
+	}
+
 	wakeup_next_waiter(lock);
 
 	raw_spin_unlock(&lock->wait_lock);
--- a/kernel/rtmutex_common.h
+++ b/kernel/rtmutex_common.h
@@ -32,9 +32,45 @@ extern void schedule_rt_mutex_test(struc
 		schedule_rt_mutex_test(_lock);			\
   } while (0)
 
-#else
-# define schedule_rt_mutex(_lock)			schedule()
-#endif
+#ifdef CONFIG_PREEMPT_RT_FULL
+#define rt_spin_yield_enabled()					\
+	(sysctl_sched_rt_spin_yield && system_state == SYSTEM_RUNNING)
+
+#define schedule_rt_spinlock(_lock, _spin)			\
+  do {								\
+	if (!(current->flags & PF_MUTEX_TESTER)) {		\
+		if (!_spin)					\
+			schedule();				\
+		else						\
+			yield();				\
+	} else							\
+		schedule_rt_mutex_test(_lock);			\
+  } while (0)
+#else /* !CONFIG_PREEMPT_RT_FULL */
+#define rt_spin_yield_enabled()	(0)
+#define schedule_rt_spinlock(_lock, _spin) schedule_rt_mutex(_lock)
+#endif /* CONFIG_PREEMPT_RT_FULL */
+
+#else /* !CONFIG_RT_MUTEX_TESTER */
+
+#define schedule_rt_mutex(_lock)			schedule()
+
+#ifdef CONFIG_PREEMPT_RT_FULL
+#define rt_spin_yield_enabled()					\
+	(sysctl_sched_rt_spin_yield && system_state == SYSTEM_RUNNING)
+
+#define schedule_rt_spinlock(_lock, _spin)			\
+  do {								\
+	if (!_spin)						\
+		schedule();					\
+	else							\
+		yield();					\
+  } while (0)
+#else /* !CONFIG_PREEMPT_RT_FULL */
+#define rt_spin_yield_enabled()	(0)
+#define schedule_rt_spinlock(_lock, _spin) schedule_rt_mutex(_lock)
+#endif /* CONFIG_PREEMPT_RT_FULL */
+#endif /* CONFIG_RT_MUTEX_TESTER */
 
 /*
  * This is the control structure for tasks blocked on a rt_mutex,
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -943,6 +943,11 @@ late_initcall(sched_init_debug);
 const_debug unsigned int sysctl_sched_nr_migrate = 32;
 #else
 const_debug unsigned int sysctl_sched_nr_migrate = 8;
+
+/*
+ * rt spinlock waiters spin and yield() if necessary vs blocking
+ */
+unsigned int sysctl_sched_rt_spin_yield __read_mostly;
 #endif
 
 /*
@@ -5292,6 +5297,7 @@ EXPORT_SYMBOL(sleep_on_timeout);
  * task_setprio - set the current priority of a task
  * @p: task
  * @prio: prio value (kernel-internal form)
+ * @requeue: requeue an rt_spin_lock_slowlock() top waiter and preempt
  *
  * This function changes the 'effective' priority of a task. It does
  * not touch ->normal_prio like __setscheduler().
@@ -5299,7 +5305,7 @@ EXPORT_SYMBOL(sleep_on_timeout);
  * Used by the rt_mutex code to implement priority inheritance
  * logic. Call site only calls if the priority of the task changed.
  */
-void task_setprio(struct task_struct *p, int prio)
+void task_setprio(struct task_struct *p, int prio, int requeue)
 {
 	int oldprio, on_rq, running;
 	struct rq *rq;
@@ -5332,6 +5338,8 @@ void task_setprio(struct task_struct *p,
 	prev_class = p->sched_class;
 	on_rq = p->on_rq;
 	running = task_current(rq, p);
+	if (requeue && (running || !on_rq || !rt_prio(oldprio)))
+		goto out_unlock;
 	if (on_rq)
 		dequeue_task(rq, p, 0);
 	if (running)
@@ -5346,8 +5354,25 @@ void task_setprio(struct task_struct *p,
 
 	if (running)
 		p->sched_class->set_curr_task(rq);
-	if (on_rq)
-		enqueue_task(rq, p, oldprio < prio ? ENQUEUE_HEAD : 0);
+	if (on_rq) {
+		if (!sysctl_sched_rt_spin_yield) {
+			enqueue_task(rq, p, oldprio < prio ? ENQUEUE_HEAD : 0);
+		} else {
+			enqueue_task(rq, p, ENQUEUE_HEAD);
+
+			/*
+			 * If we're requeueing a spinlock waiter, preempt any
+			 * peer in the way, waiter involuntarily blocked, so
+			 * has the right to use this CPU before its peers.
+			 */
+			requeue &= p->prio <= rq->curr->prio;
+			requeue &= rq->curr->state == TASK_RUNNING;
+			requeue &= rq->curr != current;
+
+			if (requeue)
+				resched_task(rq->curr);
+		}
+	}
 
 	check_class_changed(rq, p, prev_class, oldprio);
 out_unlock:
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -2510,6 +2510,11 @@ static void check_preempt_wakeup(struct
 	if (unlikely(se == pse))
 		return;
 
+#ifdef CONFIG_PREEMPT_RT_FULL
+	if (sysctl_sched_rt_spin_yield && curr->migrate_disable)
+		return;
+#endif
+
 	/*
 	 * This is possible from callers such as pull_task(), in which we
 	 * unconditionally check_prempt_curr() after an enqueue (which may have
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -991,6 +991,9 @@ enqueue_task_rt(struct rq *rq, struct ta
 	if (flags & ENQUEUE_WAKEUP)
 		rt_se->timeout = 0;
 
+	if (sysctl_sched_rt_spin_yield && (flags & WF_LOCK_SLEEPER))
+		flags |= ENQUEUE_HEAD;
+
 	enqueue_rt_entity(rt_se, flags & ENQUEUE_HEAD);
 
 	if (!task_current(rq, p) && p->rt.nr_cpus_allowed > 1)
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -368,6 +368,17 @@ static struct ctl_table kern_table[] = {
 		.mode		= 0644,
 		.proc_handler	= sched_rt_handler,
 	},
+#ifdef CONFIG_PREEMPT_RT_FULL
+	{
+		.procname	= "sched_rt_spin_yield",
+		.data		= &sysctl_sched_rt_spin_yield,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &zero,
+		.extra2		= &one,
+	},
+#endif
 	{
 		.procname	= "sched_compat_yield",
 		.data		= &sysctl_sched_compat_yield,




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: rt: rtmutex experiment doubled tbench throughput
  2013-02-22 16:36 rt: rtmutex experiment doubled tbench throughput Mike Galbraith
@ 2013-02-23  8:25 ` Mike Galbraith
  2013-02-26 19:45 ` Steven Rostedt
  2013-02-26 21:25 ` Steven Rostedt
  2 siblings, 0 replies; 11+ messages in thread
From: Mike Galbraith @ 2013-02-23  8:25 UTC (permalink / raw)
  To: RT; +Cc: Thomas Gleixner, Steven Rostedt

With some old refuse hauled away if anybody wants to play with it.

I can't find anything that it breaks, fwtw. 

---
 include/linux/sched.h   |   27 +++++++++++++++++++++++++--
 kernel/rtmutex.c        |   44 +++++++++++++++++++++++++++++++++++++++-----
 kernel/rtmutex_common.h |   34 +++++++++++++++++++++++++++++++---
 kernel/sched.c          |   32 +++++++++++++++++++++++++++++---
 kernel/sched_fair.c     |    4 ++++
 kernel/sched_rt.c       |    3 +++
 kernel/sys.c            |    1 +
 kernel/sysctl.c         |   11 +++++++++++
 8 files changed, 143 insertions(+), 13 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2146,12 +2146,35 @@ extern unsigned int sysctl_sched_cfs_ban
 #endif
 
 #ifdef CONFIG_RT_MUTEXES
-extern void task_setprio(struct task_struct *p, int prio);
+#ifdef CONFIG_PREEMPT_RT_FULL
+extern unsigned int sysctl_sched_rt_spin_yield;
+
+static inline bool rt_spin_yield_enabled(void)
+{
+	return sysctl_sched_rt_spin_yield;
+}
+
+static inline void rt_spin_yield_disable(void)
+{
+	sysctl_sched_rt_spin_yield = 0;
+}
+#else
+static inline bool rt_spin_yield_enabled(void)
+{
+	return 0;
+}
+static inline void rt_spin_yield_disable(void) { }
+#endif
+extern void task_setprio(struct task_struct *p, int prio, int requeue);
 extern int rt_mutex_getprio(struct task_struct *p);
 extern int rt_mutex_check_prio(struct task_struct *task, int newprio);
 static inline void rt_mutex_setprio(struct task_struct *p, int prio)
 {
-	task_setprio(p, prio);
+	task_setprio(p, prio, 0);
+}
+static inline void rt_mutex_requeue(struct task_struct *p, int prio)
+{
+	task_setprio(p, prio, 1);
 }
 extern void rt_mutex_adjust_pi(struct task_struct *p);
 static inline bool tsk_is_pi_blocked(struct task_struct *tsk)
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -702,8 +702,8 @@ static int adaptive_wait(struct rt_mutex
 		 * checking the above to be valid.
 		 */
 		barrier();
-		if (!owner->on_cpu) {
-			res = 1;
+		if (!owner || !owner->on_cpu) {
+			res = owner && !owner->on_cpu;
 			break;
 		}
 		cpu_relax();
@@ -733,7 +733,7 @@ static void  noinline __sched rt_spin_lo
 {
 	struct task_struct *lock_owner, *self = current;
 	struct rt_mutex_waiter waiter, *top_waiter;
-	int ret;
+	int ret, wait, rt_spin = 0, other_spin = 0, cpu;
 
 	rt_mutex_init_waiter(&waiter, true);
 
@@ -761,6 +761,17 @@ static void  noinline __sched rt_spin_lo
 	ret = task_blocks_on_rt_mutex(lock, &waiter, self, 0);
 	BUG_ON(ret);
 
+	/* basic spin/yield sanity checks */
+	if (rt_spin_yield_enabled()) {
+		rt_spin = !self->saved_state;
+		/* Here there be dragons */
+		rt_spin &= !(self->flags & PF_EXITING);
+		other_spin = rt_spin;
+		rt_spin &= rt_task(self);
+		other_spin &= !rt_spin;
+	}
+	cpu = raw_smp_processor_id();
+
 	for (;;) {
 		/* Try to acquire the lock again. */
 		if (__try_to_take_rt_mutex(lock, self, &waiter, STEAL_LATERAL))
@@ -769,12 +780,25 @@ static void  noinline __sched rt_spin_lo
 		top_waiter = rt_mutex_top_waiter(lock);
 		lock_owner = rt_mutex_owner(lock);
 
+		if (rt_spin)
+			wait = 1;
+		else
+			wait = top_waiter != &waiter;
+
+		/* SCHED_OTHER can laterally steal, let them try */
+		if (other_spin) {
+			wait &= task_cpu(top_waiter->task) == cpu;
+			wait |= top_waiter->task->prio < self->prio;
+			wait |= lock_owner && !lock_owner->on_cpu;
+			wait |= lock_owner && !lock_owner->prio < self->prio;
+		}
+
 		raw_spin_unlock(&lock->wait_lock);
 
 		debug_rt_mutex_print_deadlock(&waiter);
 
-		if (top_waiter != &waiter || adaptive_wait(lock, lock_owner))
-			schedule_rt_mutex(lock);
+		if (wait || adaptive_wait(lock, lock_owner))
+			schedule_rt_spinlock(lock, rt_spin);
 
 		raw_spin_lock(&lock->wait_lock);
 
@@ -826,6 +850,16 @@ static void  noinline __sched rt_spin_lo
 		return;
 	}
 
+	if (rt_spin_yield_enabled() && rt_task(current)) {
+		struct rt_mutex_waiter *top_waiter = rt_mutex_top_waiter(lock);
+		struct task_struct *next = top_waiter->task;
+
+		/* Move next in line to head of its queue */
+		pi_lock(&next->pi_lock);
+		rt_mutex_requeue(next, next->prio);
+		pi_unlock(&next->pi_lock);
+	}
+
 	wakeup_next_waiter(lock);
 
 	raw_spin_unlock(&lock->wait_lock);
--- a/kernel/rtmutex_common.h
+++ b/kernel/rtmutex_common.h
@@ -32,9 +32,37 @@ extern void schedule_rt_mutex_test(struc
 		schedule_rt_mutex_test(_lock);			\
   } while (0)
 
-#else
-# define schedule_rt_mutex(_lock)			schedule()
-#endif
+#ifdef CONFIG_PREEMPT_RT_FULL
+#define schedule_rt_spinlock(_lock, _spin)			\
+  do {								\
+	if (!(current->flags & PF_MUTEX_TESTER)) {		\
+		if (!_spin)					\
+			schedule();				\
+		else						\
+			yield();				\
+	} else							\
+		schedule_rt_mutex_test(_lock);			\
+  } while (0)
+#else /* !CONFIG_PREEMPT_RT_FULL */
+#define schedule_rt_spinlock(_lock, _spin) schedule_rt_mutex(_lock)
+#endif /* CONFIG_PREEMPT_RT_FULL */
+
+#else /* !CONFIG_RT_MUTEX_TESTER */
+
+#define schedule_rt_mutex(_lock)			schedule()
+
+#ifdef CONFIG_PREEMPT_RT_FULL
+#define schedule_rt_spinlock(_lock, _spin)			\
+  do {								\
+	if (!_spin)						\
+		schedule();					\
+	else							\
+		yield();					\
+  } while (0)
+#else /* !CONFIG_PREEMPT_RT_FULL */
+#define schedule_rt_spinlock(_lock, _spin) schedule_rt_mutex(_lock)
+#endif /* CONFIG_PREEMPT_RT_FULL */
+#endif /* CONFIG_RT_MUTEX_TESTER */
 
 /*
  * This is the control structure for tasks blocked on a rt_mutex,
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -943,6 +943,12 @@ late_initcall(sched_init_debug);
 const_debug unsigned int sysctl_sched_nr_migrate = 32;
 #else
 const_debug unsigned int sysctl_sched_nr_migrate = 8;
+
+/*
+ * rt spinlock waiters yield() if necessary vs blocking.
+ * SCHED_OTHER must block, but spin if they can do so.
+ */
+unsigned int sysctl_sched_rt_spin_yield __read_mostly;
 #endif
 
 /*
@@ -5292,6 +5298,7 @@ EXPORT_SYMBOL(sleep_on_timeout);
  * task_setprio - set the current priority of a task
  * @p: task
  * @prio: prio value (kernel-internal form)
+ * @requeue: requeue an rt_spin_lock_slowlock() top waiter and preempt
  *
  * This function changes the 'effective' priority of a task. It does
  * not touch ->normal_prio like __setscheduler().
@@ -5299,7 +5306,7 @@ EXPORT_SYMBOL(sleep_on_timeout);
  * Used by the rt_mutex code to implement priority inheritance
  * logic. Call site only calls if the priority of the task changed.
  */
-void task_setprio(struct task_struct *p, int prio)
+void task_setprio(struct task_struct *p, int prio, int requeue)
 {
 	int oldprio, on_rq, running;
 	struct rq *rq;
@@ -5332,6 +5339,8 @@ void task_setprio(struct task_struct *p,
 	prev_class = p->sched_class;
 	on_rq = p->on_rq;
 	running = task_current(rq, p);
+	if (requeue && (running || !on_rq || !rt_prio(oldprio)))
+		goto out_unlock;
 	if (on_rq)
 		dequeue_task(rq, p, 0);
 	if (running)
@@ -5346,8 +5355,25 @@ void task_setprio(struct task_struct *p,
 
 	if (running)
 		p->sched_class->set_curr_task(rq);
-	if (on_rq)
-		enqueue_task(rq, p, oldprio < prio ? ENQUEUE_HEAD : 0);
+	if (on_rq) {
+		if (!sysctl_sched_rt_spin_yield) {
+			enqueue_task(rq, p, oldprio < prio ? ENQUEUE_HEAD : 0);
+		} else {
+			enqueue_task(rq, p, ENQUEUE_HEAD);
+
+			/*
+			 * If we're requeueing a spinlock waiter, preempt any
+			 * peer in the way, waiter involuntarily blocked, so
+			 * has the right to use this CPU before its peers.
+			 */
+			requeue &= p->prio <= rq->curr->prio;
+			requeue &= rq->curr->state == TASK_RUNNING;
+			requeue &= rq->curr != current;
+
+			if (requeue)
+				resched_task(rq->curr);
+		}
+	}
 
 	check_class_changed(rq, p, prev_class, oldprio);
 out_unlock:
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -2510,6 +2510,10 @@ static void check_preempt_wakeup(struct
 	if (unlikely(se == pse))
 		return;
 
+	/* FIXME: might be spinning on a lock, good enough to play with */
+	if (rt_spin_yield_enabled() && curr->migrate_disable)
+		return;
+
 	/*
 	 * This is possible from callers such as pull_task(), in which we
 	 * unconditionally check_prempt_curr() after an enqueue (which may have
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -991,6 +991,9 @@ enqueue_task_rt(struct rq *rq, struct ta
 	if (flags & ENQUEUE_WAKEUP)
 		rt_se->timeout = 0;
 
+	if (rt_spin_yield_enabled() && (flags & WF_LOCK_SLEEPER))
+		flags |= ENQUEUE_HEAD;
+
 	enqueue_rt_entity(rt_se, flags & ENQUEUE_HEAD);
 
 	if (!task_current(rq, p) && p->rt.nr_cpus_allowed > 1)
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -316,6 +316,7 @@ void kernel_restart_prepare(char *cmd)
 {
 	blocking_notifier_call_chain(&reboot_notifier_list, SYS_RESTART, cmd);
 	system_state = SYSTEM_RESTART;
+	rt_spin_yield_disable();
 	usermodehelper_disable();
 	device_shutdown();
 	syscore_shutdown();
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -368,6 +368,17 @@ static struct ctl_table kern_table[] = {
 		.mode		= 0644,
 		.proc_handler	= sched_rt_handler,
 	},
+#ifdef CONFIG_PREEMPT_RT_FULL
+	{
+		.procname	= "sched_rt_spin_yield",
+		.data		= &sysctl_sched_rt_spin_yield,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &zero,
+		.extra2		= &one,
+	},
+#endif
 	{
 		.procname	= "sched_compat_yield",
 		.data		= &sysctl_sched_compat_yield,



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: rt: rtmutex experiment doubled tbench throughput
  2013-02-22 16:36 rt: rtmutex experiment doubled tbench throughput Mike Galbraith
  2013-02-23  8:25 ` Mike Galbraith
@ 2013-02-26 19:45 ` Steven Rostedt
  2013-02-27  4:19   ` Mike Galbraith
  2013-02-26 21:25 ` Steven Rostedt
  2 siblings, 1 reply; 11+ messages in thread
From: Steven Rostedt @ 2013-02-26 19:45 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: RT, Thomas Gleixner

On Fri, 2013-02-22 at 17:36 +0100, Mike Galbraith wrote:
> Greetings,
> 
> A user reported surprise at seeing a low priority rt task awakened in
> the same semop as a high priority task run before the high priority task
> could finish what it was supposed to do, and put itself to sleep.  The
> reason for that happening is that it, and a few of its brothers (highly
> synchronized task schedulers) enter semop to diddle the same array at
> the same time, meet at spinlock (rtmutex) and block.
> 
> Now you could say "Waking task foo and depending on it's lower than task
> bar priority to keep it off the CPU is your bad if you do so without a
> no-block guarantee in hand for everything task bar does." which I did,
> that and "The semop syscall is not atomic per POSIX, it's the array
> operations that are atomic.  Blocking on a contended lock is fine".
> 
> Anyway, I looked at rtmutex.c and started pondering, wondering if I
> could un-surprise the user without breaking the world, and maybe even
> make non-rt stuff perform a little better.
> 
> Numerous deadlocks and cool explosions later...
> 
> This seems to work, no harm to 60 core jitter testcase detected, and it
> doubled tbench throughput.  But be advised, your breakfast may emigrate,
> or a POSIX swat team may kick in your door if you even _look_ at this. 
> 
> 64 core DL980G2, 3.0.61-rt85
> 
> vogelweide:/:[0]# tbench.sh 128 10
> waiting for connections
> dbench version 3.04 - Copyright Andrew Tridgell 1999-2004
> 
> Running for 10 seconds with load '/usr/share/dbench/client.txt' and minimum warmup 2 secs
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^128 clients started
>  128      5881  6239.48 MB/sec  warmup   1 sec   
>  128     17978  4169.33 MB/sec  execute   1 sec   
>  128     24051  4173.54 MB/sec  execute   2 sec   
>  128     30131  4185.35 MB/sec  execute   3 sec   
>  128     36145  4173.59 MB/sec  execute   4 sec   
>  128     42233  4182.69 MB/sec  execute   5 sec   
>  128     48293  4181.18 MB/sec  execute   6 sec   
>  128     54407  4185.10 MB/sec  execute   7 sec   
>  128     60445  4183.09 MB/sec  execute   8 sec   
>  128     66482  4179.41 MB/sec  execute   9 sec   
>  128     72543  4179.37 MB/sec  cleanup  10 sec   
>  128     72543  4174.07 MB/sec  cleanup  10 sec   
> 
> Throughput 4179.49 MB/sec 128 procs
> 924536 packets/sec
> /root/bin/tbench.sh: line 33: 11292 Killed                  tbench_srv
> 
> 
> vogelweide:/:[0]# echo 1 >  /proc/sys/kernel/sched_rt_spin_yield
> vogelweide:/:[0]# tbench.sh 128 10
> waiting for connections
> dbench version 3.04 - Copyright Andrew Tridgell 1999-2004
> 
> Running for 10 seconds with load '/usr/share/dbench/client.txt' and minimum warmup 2 secs
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^128 clients started
>  128     14360  12090.05 MB/sec  warmup   1 sec   
>  128     42573  9740.67 MB/sec  execute   1 sec   
>  128     56661  9736.85 MB/sec  execute   2 sec   
>  128     70752  9723.79 MB/sec  execute   3 sec   
>  128     84850  9724.82 MB/sec  execute   4 sec   
>  128     98936  9720.49 MB/sec  execute   5 sec   
>  128    113021  9721.15 MB/sec  execute   6 sec   
>  128    127111  9723.26 MB/sec  execute   7 sec   
>  128    141203  9722.81 MB/sec  execute   8 sec   
>  128    155300  9722.90 MB/sec  execute   9 sec   
>  128    169392  9727.48 MB/sec  cleanup  10 sec   
>  128    169392  9712.52 MB/sec  cleanup  10 sec   
> 

Wow, that's some result!

> Throughput 9727.56 MB/sec 128 procs
> 2150132 packets/sec
> /root/bin/tbench.sh: line 34: 11568 Killed                  tbench_srv
> 
> ---
>  include/linux/sched.h   |    7 +++++--
>  kernel/rtmutex.c        |   48 +++++++++++++++++++++++++++++++++++++++++-------
>  kernel/rtmutex_common.h |   42 +++++++++++++++++++++++++++++++++++++++---
>  kernel/sched.c          |   31 ++++++++++++++++++++++++++++---
>  kernel/sched_fair.c     |    5 +++++
>  kernel/sched_rt.c       |    3 +++
>  kernel/sysctl.c         |   11 +++++++++++
>  7 files changed, 132 insertions(+), 15 deletions(-)
> 
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2083,6 +2083,9 @@ extern unsigned int sysctl_sched_latency
>  extern unsigned int sysctl_sched_min_granularity;
>  extern unsigned int sysctl_sched_wakeup_granularity;
>  extern unsigned int sysctl_sched_child_runs_first;
> +#ifdef CONFIG_PREEMPT_RT_FULL
> +extern unsigned int sysctl_sched_rt_spin_yield;
> +#endif

I'm not sure we want this. I think it makes sense to either always do
it, or never do it.

>  
>  enum sched_tunable_scaling {
>  	SCHED_TUNABLESCALING_NONE,
> @@ -2146,12 +2149,12 @@ extern unsigned int sysctl_sched_cfs_ban
>  #endif
>  
>  #ifdef CONFIG_RT_MUTEXES
> -extern void task_setprio(struct task_struct *p, int prio);
> +extern void task_setprio(struct task_struct *p, int prio, int yield);
>  extern int rt_mutex_getprio(struct task_struct *p);
>  extern int rt_mutex_check_prio(struct task_struct *task, int newprio);
>  static inline void rt_mutex_setprio(struct task_struct *p, int prio)
>  {
> -	task_setprio(p, prio);
> +	task_setprio(p, prio, 0);
>  }
>  extern void rt_mutex_adjust_pi(struct task_struct *p);
>  static inline bool tsk_is_pi_blocked(struct task_struct *tsk)
> --- a/kernel/rtmutex.c
> +++ b/kernel/rtmutex.c
> @@ -689,7 +689,7 @@ static inline void rt_spin_lock_fastunlo
>   * on rcu_read_lock() and the check against the lock owner.
>   */
>  static int adaptive_wait(struct rt_mutex *lock,
> -			 struct task_struct *owner)
> +			 struct task_struct *owner, int spin)
>  {
>  	int res = 0;
>  
> @@ -702,8 +702,8 @@ static int adaptive_wait(struct rt_mutex
>  		 * checking the above to be valid.
>  		 */
>  		barrier();
> -		if (!owner->on_cpu) {
> -			res = 1;
> +		if (!owner || !owner->on_cpu) {
> +			res = !spin;
>  			break;

Hmm, if spin is set, this function *always* returns zero. Ug, don't do
this, a better solution below (where adaptive_wait() is called).

>  		}
>  		cpu_relax();
> @@ -713,7 +713,7 @@ static int adaptive_wait(struct rt_mutex
>  }
>  #else
>  static int adaptive_wait(struct rt_mutex *lock,
> -			 struct task_struct *orig_owner)
> +			 struct task_struct *orig_owner, int spin)
>  {
>  	return 1;
>  }
> @@ -733,7 +733,7 @@ static void  noinline __sched rt_spin_lo
>  {
>  	struct task_struct *lock_owner, *self = current;
>  	struct rt_mutex_waiter waiter, *top_waiter;
> -	int ret;
> +	int ret, wait, rt_spin = 0, other_spin = 0, cpu;
>  
>  	rt_mutex_init_waiter(&waiter, true);
>  
> @@ -761,6 +761,17 @@ static void  noinline __sched rt_spin_lo
>  	ret = task_blocks_on_rt_mutex(lock, &waiter, self, 0);
>  	BUG_ON(ret);
>  
> +	/* basic spin/yield sanity checks */
> +	if (rt_spin_yield_enabled()) {
> +		rt_spin = !self->saved_state;
> +		/* Here there be dragons */
> +		rt_spin &= !(self->flags & PF_EXITING);
> +		other_spin = rt_spin;
> +		rt_spin &= rt_task(self);
> +		other_spin &= !rt_spin;

The multiple assigning is very ugly.

	int self_state = !self->saved_state && !(self->flags & PF_EXITING);

	rt_spin = self_state && rt_task(self);
	other_spin = self_state && !rt_task(self);

Much easier to read.


> +	}
> +	cpu = raw_smp_processor_id();
> +
>  	for (;;) {
>  		/* Try to acquire the lock again. */
>  		if (__try_to_take_rt_mutex(lock, self, &waiter, STEAL_LATERAL))
> @@ -769,12 +780,25 @@ static void  noinline __sched rt_spin_lo
>  		top_waiter = rt_mutex_top_waiter(lock);
>  		lock_owner = rt_mutex_owner(lock);
>  
> +		if (rt_spin)
> +			wait = 1;
> +		else
> +			wait = top_waiter != &waiter;
> +
> +		/* SCHED_OTHER can laterally steal, let them try */
> +		if (other_spin) {
> +			wait &= task_cpu(top_waiter->task) == cpu;
> +			wait |= top_waiter->task->prio < self->prio;
> +			wait |= lock_owner && !lock_owner->on_cpu;
> +			wait |= lock_owner && !lock_owner->prio < self->prio;

Again very ugly and hard to read.

Let's see. If other_spin is set, this also means that rt_spin is not, so
wait is only set if top_waiter != &waiter (current is not top_waiter).
OK, if current is top waiter, we don't want to wait. That makes sense.

Now, if top_waiter->task cpu != current cpu, we don't wait. Ah, even if
we are not the top waiter, we can still spin if the top_waiter is on
another CPU. But if top_waiter is higher priority, then we still wait.
Or if the lock owner is not on a CPU, or if the lock owner is not higher
priority than current???

I'm confused to why you did this. This requires a hell of a lot of
explanation.


> +		}
> +
>  		raw_spin_unlock(&lock->wait_lock);
>  
>  		debug_rt_mutex_print_deadlock(&waiter);
>  
> -		if (top_waiter != &waiter || adaptive_wait(lock, lock_owner))
> -			schedule_rt_mutex(lock);
> +		if (wait || adaptive_wait(lock, lock_owner, rt_spin))
> +			schedule_rt_spinlock(lock, rt_spin);

I'm also confused, if rt_spin is set, we have wait = 1. wait can only be
cleared, if other_spin is set, but other_spin can't be set if rt_spin is
set.

But if I missed something, there's no reason to pass rt_spin to
adaptive_wait() as if it's set, adaptive_wait returns zero. You can just
do:

	if (wait || (adaptive_wait(lock, lock_owner) && !rt_spin)))

But I'm not sure I ever see
>  
>  		raw_spin_lock(&lock->wait_lock);
>  
> @@ -826,6 +850,16 @@ static void  noinline __sched rt_spin_lo
>  		return;
>  	}
>  
> +	if (rt_spin_yield_enabled() && rt_task(current)) {
> +		struct rt_mutex_waiter *top_waiter = rt_mutex_top_waiter(lock);
> +		struct task_struct *next = top_waiter->task;
> +
> +		/* Move next in line to head of its queue */
> +		pi_lock(&next->pi_lock);
> +		task_setprio(next, next->prio, 1);
> +		pi_unlock(&next->pi_lock);

You need to explain why this is done.

> +	}
> +
>  	wakeup_next_waiter(lock);
>  
>  	raw_spin_unlock(&lock->wait_lock);
> --- a/kernel/rtmutex_common.h
> +++ b/kernel/rtmutex_common.h
> @@ -32,9 +32,45 @@ extern void schedule_rt_mutex_test(struc
>  		schedule_rt_mutex_test(_lock);			\
>    } while (0)
>  
> -#else
> -# define schedule_rt_mutex(_lock)			schedule()
> -#endif
> +#ifdef CONFIG_PREEMPT_RT_FULL
> +#define rt_spin_yield_enabled()					\
> +	(sysctl_sched_rt_spin_yield && system_state == SYSTEM_RUNNING)
> +
> +#define schedule_rt_spinlock(_lock, _spin)			\
> +  do {								\
> +	if (!(current->flags & PF_MUTEX_TESTER)) {		\
> +		if (!_spin)					\
> +			schedule();				\
> +		else						\
> +			yield();				\
> +	} else							\
> +		schedule_rt_mutex_test(_lock);			\

I'm wondering if this is the only thing you needed to get your
performance boost.

> +  } while (0)
> +#else /* !CONFIG_PREEMPT_RT_FULL */
> +#define rt_spin_yield_enabled()	(0)
> +#define schedule_rt_spinlock(_lock, _spin) schedule_rt_mutex(_lock)
> +#endif /* CONFIG_PREEMPT_RT_FULL */
> +
> +#else /* !CONFIG_RT_MUTEX_TESTER */
> +
> +#define schedule_rt_mutex(_lock)			schedule()
> +
> +#ifdef CONFIG_PREEMPT_RT_FULL
> +#define rt_spin_yield_enabled()					\
> +	(sysctl_sched_rt_spin_yield && system_state == SYSTEM_RUNNING)
> +
> +#define schedule_rt_spinlock(_lock, _spin)			\
> +  do {								\
> +	if (!_spin)						\
> +		schedule();					\
> +	else							\
> +		yield();					\
> +  } while (0)
> +#else /* !CONFIG_PREEMPT_RT_FULL */
> +#define rt_spin_yield_enabled()	(0)
> +#define schedule_rt_spinlock(_lock, _spin) schedule_rt_mutex(_lock)
> +#endif /* CONFIG_PREEMPT_RT_FULL */
> +#endif /* CONFIG_RT_MUTEX_TESTER */
>  
>  /*
>   * This is the control structure for tasks blocked on a rt_mutex,
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -943,6 +943,11 @@ late_initcall(sched_init_debug);
>  const_debug unsigned int sysctl_sched_nr_migrate = 32;
>  #else
>  const_debug unsigned int sysctl_sched_nr_migrate = 8;
> +
> +/*
> + * rt spinlock waiters spin and yield() if necessary vs blocking
> + */
> +unsigned int sysctl_sched_rt_spin_yield __read_mostly;
>  #endif
>  
>  /*
> @@ -5292,6 +5297,7 @@ EXPORT_SYMBOL(sleep_on_timeout);
>   * task_setprio - set the current priority of a task
>   * @p: task
>   * @prio: prio value (kernel-internal form)
> + * @requeue: requeue an rt_spin_lock_slowlock() top waiter and preempt
>   *
>   * This function changes the 'effective' priority of a task. It does
>   * not touch ->normal_prio like __setscheduler().
> @@ -5299,7 +5305,7 @@ EXPORT_SYMBOL(sleep_on_timeout);
>   * Used by the rt_mutex code to implement priority inheritance
>   * logic. Call site only calls if the priority of the task changed.
>   */
> -void task_setprio(struct task_struct *p, int prio)
> +void task_setprio(struct task_struct *p, int prio, int requeue)
>  {
>  	int oldprio, on_rq, running;
>  	struct rq *rq;
> @@ -5332,6 +5338,8 @@ void task_setprio(struct task_struct *p,
>  	prev_class = p->sched_class;
>  	on_rq = p->on_rq;
>  	running = task_current(rq, p);
> +	if (requeue && (running || !on_rq || !rt_prio(oldprio)))
> +		goto out_unlock;
>  	if (on_rq)
>  		dequeue_task(rq, p, 0);
>  	if (running)
> @@ -5346,8 +5354,25 @@ void task_setprio(struct task_struct *p,
>  
>  	if (running)
>  		p->sched_class->set_curr_task(rq);
> -	if (on_rq)
> -		enqueue_task(rq, p, oldprio < prio ? ENQUEUE_HEAD : 0);
> +	if (on_rq) {
> +		if (!sysctl_sched_rt_spin_yield) {
> +			enqueue_task(rq, p, oldprio < prio ? ENQUEUE_HEAD : 0);
> +		} else {
> +			enqueue_task(rq, p, ENQUEUE_HEAD);
> +
> +			/*
> +			 * If we're requeueing a spinlock waiter, preempt any
> +			 * peer in the way, waiter involuntarily blocked, so
> +			 * has the right to use this CPU before its peers.
> +			 */
> +			requeue &= p->prio <= rq->curr->prio;
> +			requeue &= rq->curr->state == TASK_RUNNING;
> +			requeue &= rq->curr != current;
> +
> +			if (requeue)
> +				resched_task(rq->curr);

This too is interesting.

> +		}
> +	}
>  
>  	check_class_changed(rq, p, prev_class, oldprio);
>  out_unlock:
> --- a/kernel/sched_fair.c
> +++ b/kernel/sched_fair.c
> @@ -2510,6 +2510,11 @@ static void check_preempt_wakeup(struct
>  	if (unlikely(se == pse))
>  		return;
>  
> +#ifdef CONFIG_PREEMPT_RT_FULL
> +	if (sysctl_sched_rt_spin_yield && curr->migrate_disable)
> +		return;
> +#endif
> +
>  	/*
>  	 * This is possible from callers such as pull_task(), in which we
>  	 * unconditionally check_prempt_curr() after an enqueue (which may have
> --- a/kernel/sched_rt.c
> +++ b/kernel/sched_rt.c
> @@ -991,6 +991,9 @@ enqueue_task_rt(struct rq *rq, struct ta
>  	if (flags & ENQUEUE_WAKEUP)
>  		rt_se->timeout = 0;
>  
> +	if (sysctl_sched_rt_spin_yield && (flags & WF_LOCK_SLEEPER))
> +		flags |= ENQUEUE_HEAD;
> +
>  	enqueue_rt_entity(rt_se, flags & ENQUEUE_HEAD);
>  
>  	if (!task_current(rq, p) && p->rt.nr_cpus_allowed > 1)
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -368,6 +368,17 @@ static struct ctl_table kern_table[] = {
>  		.mode		= 0644,
>  		.proc_handler	= sched_rt_handler,
>  	},
> +#ifdef CONFIG_PREEMPT_RT_FULL
> +	{
> +		.procname	= "sched_rt_spin_yield",
> +		.data		= &sysctl_sched_rt_spin_yield,
> +		.maxlen		= sizeof(unsigned int),
> +		.mode		= 0644,
> +		.proc_handler	= proc_dointvec_minmax,
> +		.extra1		= &zero,
> +		.extra2		= &one,
> +	},
> +#endif
>  	{
>  		.procname	= "sched_compat_yield",
>  		.data		= &sysctl_sched_compat_yield,
> 
> 

I would be interested in seeing what happens if we just made rt_tasks
spin instead of sleep (call yield). Actually, if we made real-time tasks
always spin (never sleep), just yield, if this would give us much better
performance.

-- Steve



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: rt: rtmutex experiment doubled tbench throughput
  2013-02-22 16:36 rt: rtmutex experiment doubled tbench throughput Mike Galbraith
  2013-02-23  8:25 ` Mike Galbraith
  2013-02-26 19:45 ` Steven Rostedt
@ 2013-02-26 21:25 ` Steven Rostedt
  2013-02-27  2:22   ` Paul Gortmaker
  2 siblings, 1 reply; 11+ messages in thread
From: Steven Rostedt @ 2013-02-26 21:25 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: RT, Thomas Gleixner

On Fri, 2013-02-22 at 17:36 +0100, Mike Galbraith wrote:

> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2083,6 +2083,9 @@ extern unsigned int sysctl_sched_latency
>  extern unsigned int sysctl_sched_min_granularity;
>  extern unsigned int sysctl_sched_wakeup_granularity;
>  extern unsigned int sysctl_sched_child_runs_first;
> +#ifdef CONFIG_PREEMPT_RT_FULL
> +extern unsigned int sysctl_sched_rt_spin_yield;
> +#endif
>  
>  enum sched_tunable_scaling {
>  	SCHED_TUNABLESCALING_NONE,
> @@ -2146,12 +2149,12 @@ extern unsigned int sysctl_sched_cfs_ban
>  #endif
>  
>  #ifdef CONFIG_RT_MUTEXES
> -extern void task_setprio(struct task_struct *p, int prio);
> +extern void task_setprio(struct task_struct *p, int prio, int yield);

BTW, what kernel did you apply this too, as there's no task_setprio()
anymore in 3.0-rt, 3.2-rt, 3.4-rt or 3.6-rt?

-- Steve


>  extern int rt_mutex_getprio(struct task_struct *p);
>  extern int rt_mutex_check_prio(struct task_struct *task, int newprio);
>  static inline void rt_mutex_setprio(struct task_struct *p, int prio)
>  {
> -	task_setprio(p, prio);
> +	task_setprio(p, prio, 0);
>  }
>  extern void rt_mutex_adjust_pi(struct task_struct *p);
>  static inline bool tsk_is_pi_blocked(struct task_struct *tsk)



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: rt: rtmutex experiment doubled tbench throughput
  2013-02-26 21:25 ` Steven Rostedt
@ 2013-02-27  2:22   ` Paul Gortmaker
  2013-02-27  2:32     ` Steven Rostedt
  0 siblings, 1 reply; 11+ messages in thread
From: Paul Gortmaker @ 2013-02-27  2:22 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: Mike Galbraith, RT, Thomas Gleixner

On Tue, Feb 26, 2013 at 4:25 PM, Steven Rostedt <rostedt@goodmis.org> wrote:
> On Fri, 2013-02-22 at 17:36 +0100, Mike Galbraith wrote:
>
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -2083,6 +2083,9 @@ extern unsigned int sysctl_sched_latency
>>  extern unsigned int sysctl_sched_min_granularity;
>>  extern unsigned int sysctl_sched_wakeup_granularity;
>>  extern unsigned int sysctl_sched_child_runs_first;
>> +#ifdef CONFIG_PREEMPT_RT_FULL
>> +extern unsigned int sysctl_sched_rt_spin_yield;
>> +#endif
>>
>>  enum sched_tunable_scaling {
>>       SCHED_TUNABLESCALING_NONE,
>> @@ -2146,12 +2149,12 @@ extern unsigned int sysctl_sched_cfs_ban
>>  #endif
>>
>>  #ifdef CONFIG_RT_MUTEXES
>> -extern void task_setprio(struct task_struct *p, int prio);
>> +extern void task_setprio(struct task_struct *p, int prio, int yield);
>
> BTW, what kernel did you apply this too, as there's no task_setprio()
> anymore in 3.0-rt, 3.2-rt, 3.4-rt or 3.6-rt?

Hi Steve -- are you sure on that?

------------------
paul@PhenomX6:~/git/linux-stable-rt$ git describe
v3.0.66-rt93
paul@PhenomX6:~/git/linux-stable-rt$ git grep -l task_setprio
include/linux/sched.h
kernel/sched.c
paul@PhenomX6:~/git/linux-stable-rt$
------------------

Paul.
--

>
> -- Steve
>
>
>>  extern int rt_mutex_getprio(struct task_struct *p);
>>  extern int rt_mutex_check_prio(struct task_struct *task, int newprio);
>>  static inline void rt_mutex_setprio(struct task_struct *p, int prio)
>>  {
>> -     task_setprio(p, prio);
>> +     task_setprio(p, prio, 0);
>>  }
>>  extern void rt_mutex_adjust_pi(struct task_struct *p);
>>  static inline bool tsk_is_pi_blocked(struct task_struct *tsk)
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: rt: rtmutex experiment doubled tbench throughput
  2013-02-27  2:22   ` Paul Gortmaker
@ 2013-02-27  2:32     ` Steven Rostedt
  2013-02-27  2:35       ` Steven Rostedt
  0 siblings, 1 reply; 11+ messages in thread
From: Steven Rostedt @ 2013-02-27  2:32 UTC (permalink / raw)
  To: Paul Gortmaker; +Cc: Mike Galbraith, RT, Thomas Gleixner

On Tue, 2013-02-26 at 21:22 -0500, Paul Gortmaker wrote:

> > BTW, what kernel did you apply this too, as there's no task_setprio()
> > anymore in 3.0-rt, 3.2-rt, 3.4-rt or 3.6-rt?
> 
> Hi Steve -- are you sure on that?
> 
> ------------------
> paul@PhenomX6:~/git/linux-stable-rt$ git describe
> v3.0.66-rt93
> paul@PhenomX6:~/git/linux-stable-rt$ git grep -l task_setprio
> include/linux/sched.h
> kernel/sched.c
> paul@PhenomX6:~/git/linux-stable-rt$
> ------------------
> 

Bah! As I looked in my history I see I did the following:

  git show v3.0:include/linux/sched.h
  git show v3.2:include/linux/sched.h
  git show v3.4:include/linux/sched.h

I forgot the '-rt" on each one of them :-p

-- Steve



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: rt: rtmutex experiment doubled tbench throughput
  2013-02-27  2:32     ` Steven Rostedt
@ 2013-02-27  2:35       ` Steven Rostedt
  2013-02-27  2:52         ` Paul Gortmaker
  2013-02-27  4:55         ` Mike Galbraith
  0 siblings, 2 replies; 11+ messages in thread
From: Steven Rostedt @ 2013-02-27  2:35 UTC (permalink / raw)
  To: Paul Gortmaker; +Cc: Mike Galbraith, RT, Thomas Gleixner

On Tue, 2013-02-26 at 21:32 -0500, Steven Rostedt wrote:

> Bah! As I looked in my history I see I did the following:
> 
>   git show v3.0:include/linux/sched.h
>   git show v3.2:include/linux/sched.h
>   git show v3.4:include/linux/sched.h
> 
> I forgot the '-rt" on each one of them :-p

That said...

Only v3.0-rt has it. I'm barely supporting that now, and I'm only doing
updates that might make sense. Greg should be dropping it soon, and so
will I. There's issues in v3.0-rt that are not in v3.2-rt or above, that
I'm not going to bother debugging.

I'll test a patch against the others but not one against v3.0-rt.

-- Steve



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: rt: rtmutex experiment doubled tbench throughput
  2013-02-27  2:35       ` Steven Rostedt
@ 2013-02-27  2:52         ` Paul Gortmaker
  2013-02-27  4:55         ` Mike Galbraith
  1 sibling, 0 replies; 11+ messages in thread
From: Paul Gortmaker @ 2013-02-27  2:52 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: Mike Galbraith, RT, Thomas Gleixner

On Tue, Feb 26, 2013 at 9:35 PM, Steven Rostedt <rostedt@goodmis.org> wrote:
> On Tue, 2013-02-26 at 21:32 -0500, Steven Rostedt wrote:
>
>> Bah! As I looked in my history I see I did the following:
>>
>>   git show v3.0:include/linux/sched.h
>>   git show v3.2:include/linux/sched.h
>>   git show v3.4:include/linux/sched.h
>>
>> I forgot the '-rt" on each one of them :-p
>
> That said...
>
> Only v3.0-rt has it. I'm barely supporting that now, and I'm only doing

Yeah -- I'm not surprised.  The only reason I'd checked myself is
that I'd recalled another thread where Mike said he'd been on v3.0,
and so I was reasonably sure it would be there in the 3.0 tree.

I'm all for deleting obsolete crap in mainline, so don't expect me
to argue against you and ask for supporting old releases on RT.   :)

Paul.
--

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: rt: rtmutex experiment doubled tbench throughput
  2013-02-26 19:45 ` Steven Rostedt
@ 2013-02-27  4:19   ` Mike Galbraith
  2013-02-28  5:59     ` Mike Galbraith
  0 siblings, 1 reply; 11+ messages in thread
From: Mike Galbraith @ 2013-02-27  4:19 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: RT, Thomas Gleixner

On Tue, 2013-02-26 at 14:45 -0500, Steven Rostedt wrote: 
> On Fri, 2013-02-22 at 17:36 +0100, Mike Galbraith wrote:
>   
> >  128    169392  9712.52 MB/sec  cleanup  10 sec   
> > 
> 
> Wow, that's some result!

That's the interesting bit.  With the delayed preempt stuff applied, on
a small box I got ~30%, but when I tried it on a big box, nada.

> > Throughput 9727.56 MB/sec 128 procs
> > 2150132 packets/sec
> > /root/bin/tbench.sh: line 34: 11568 Killed                  tbench_srv
> > 
> > ---
> >  include/linux/sched.h   |    7 +++++--
> >  kernel/rtmutex.c        |   48 +++++++++++++++++++++++++++++++++++++++++-------
> >  kernel/rtmutex_common.h |   42 +++++++++++++++++++++++++++++++++++++++---
> >  kernel/sched.c          |   31 ++++++++++++++++++++++++++++---
> >  kernel/sched_fair.c     |    5 +++++
> >  kernel/sched_rt.c       |    3 +++
> >  kernel/sysctl.c         |   11 +++++++++++
> >  7 files changed, 132 insertions(+), 15 deletions(-)
> > 
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -2083,6 +2083,9 @@ extern unsigned int sysctl_sched_latency
> >  extern unsigned int sysctl_sched_min_granularity;
> >  extern unsigned int sysctl_sched_wakeup_granularity;
> >  extern unsigned int sysctl_sched_child_runs_first;
> > +#ifdef CONFIG_PREEMPT_RT_FULL
> > +extern unsigned int sysctl_sched_rt_spin_yield;
> > +#endif
> 
> I'm not sure we want this. I think it makes sense to either always do
> it, or never do it.

For SCHED_OTHER, I think it makes sense to let always tasks spin if they
can.  For rt, well, box tells _always_ doing it is a bad idea.  Pinning
and spinning when a CPU is trying to go down doesn't go well, nor does
spinning while exiting, but I found no rt testcase breakage within the
constraints shown in patchlet.

> > @@ -702,8 +702,8 @@ static int adaptive_wait(struct rt_mutex
> >  		 * checking the above to be valid.
> >  		 */
> >  		barrier();
> > -		if (!owner->on_cpu) {
> > -			res = 1;
> > +		if (!owner || !owner->on_cpu) {
> > +			res = !spin;
> >  			break;
> 
> Hmm, if spin is set, this function *always* returns zero. Ug, don't do
> this, a better solution below (where adaptive_wait() is called).

That's a remnant from when I was trying to use adaptive_wait() for rt
tasks, only yielding when it would be a very bad idea to not do so.
That tactic worked fine on a 4 core box no matter how hard I beat on it,
but lasted.. oh, about a microsecond or so on big box when firing up 60
core all isolated, pinned and synchronized jitter test.

> > @@ -761,6 +761,17 @@ static void  noinline __sched rt_spin_lo
> >  	ret = task_blocks_on_rt_mutex(lock, &waiter, self, 0);
> >  	BUG_ON(ret);
> >  
> > +	/* basic spin/yield sanity checks */
> > +	if (rt_spin_yield_enabled()) {
> > +		rt_spin = !self->saved_state;
> > +		/* Here there be dragons */
> > +		rt_spin &= !(self->flags & PF_EXITING);
> > +		other_spin = rt_spin;
> > +		rt_spin &= rt_task(self);
> > +		other_spin &= !rt_spin;
> 
> The multiple assigning is very ugly.
> 
> 	int self_state = !self->saved_state && !(self->flags & PF_EXITING);
> 
> 	rt_spin = self_state && rt_task(self);
> 	other_spin = self_state && !rt_task(self);
> 
> Much easier to read.

I like them compact like this, but yeah, non-compliant kernel style.

> > +	}
> > +	cpu = raw_smp_processor_id();
> > +
> >  	for (;;) {
> >  		/* Try to acquire the lock again. */
> >  		if (__try_to_take_rt_mutex(lock, self, &waiter, STEAL_LATERAL))
> > @@ -769,12 +780,25 @@ static void  noinline __sched rt_spin_lo
> >  		top_waiter = rt_mutex_top_waiter(lock);
> >  		lock_owner = rt_mutex_owner(lock);
> >  
> > +		if (rt_spin)
> > +			wait = 1;
> > +		else
> > +			wait = top_waiter != &waiter;
> > +
> > +		/* SCHED_OTHER can laterally steal, let them try */
> > +		if (other_spin) {
> > +			wait &= task_cpu(top_waiter->task) == cpu;
> > +			wait |= top_waiter->task->prio < self->prio;
> > +			wait |= lock_owner && !lock_owner->on_cpu;
> > +			wait |= lock_owner && !lock_owner->prio < self->prio;
> 
> Again very ugly and hard to read.
> 
> Let's see. If other_spin is set, this also means that rt_spin is not, so
> wait is only set if top_waiter != &waiter (current is not top_waiter).
> OK, if current is top waiter, we don't want to wait. That makes sense.
> 
> Now, if top_waiter->task cpu != current cpu, we don't wait. Ah, even if
> we are not the top waiter, we can still spin if the top_waiter is on
> another CPU. But if top_waiter is higher priority, then we still wait.
> Or if the lock owner is not on a CPU, or if the lock owner is not higher
> priority than current???
> 
> I'm confused to why you did this. This requires a hell of a lot of
> explanation.

That last is a brain-o.

> > +		}
> > +
> >  		raw_spin_unlock(&lock->wait_lock);
> >  
> >  		debug_rt_mutex_print_deadlock(&waiter);
> >  
> > -		if (top_waiter != &waiter || adaptive_wait(lock, lock_owner))
> > -			schedule_rt_mutex(lock);
> > +		if (wait || adaptive_wait(lock, lock_owner, rt_spin))
> > +			schedule_rt_spinlock(lock, rt_spin);
> 
> I'm also confused, if rt_spin is set, we have wait = 1. wait can only be
> cleared, if other_spin is set, but other_spin can't be set if rt_spin is
> set.

Yeah, passing rt_spin is leftover cruft not in the somewhat cleaner
version I later posted.

> > @@ -826,6 +850,16 @@ static void  noinline __sched rt_spin_lo
> >  		return;
> >  	}
> >  
> > +	if (rt_spin_yield_enabled() && rt_task(current)) {
> > +		struct rt_mutex_waiter *top_waiter = rt_mutex_top_waiter(lock);
> > +		struct task_struct *next = top_waiter->task;
> > +
> > +		/* Move next in line to head of its queue */
> > +		pi_lock(&next->pi_lock);
> > +		task_setprio(next, next->prio, 1);
> > +		pi_unlock(&next->pi_lock);
> 
> You need to explain why this is done.

Because nothing else worked :)  No rt spin+yield variant worked on the
big box, you interleave, you die.  Only after thinking of making it a
global thing that you miss the lock, you yield, acquisition puts you
back at the head of your queue, did big box stop screeching to a halt.

> > @@ -32,9 +32,45 @@ extern void schedule_rt_mutex_test(struc
> >  		schedule_rt_mutex_test(_lock);			\
> >    } while (0)
> >  
> > -#else
> > -# define schedule_rt_mutex(_lock)			schedule()
> > -#endif
> > +#ifdef CONFIG_PREEMPT_RT_FULL
> > +#define rt_spin_yield_enabled()					\
> > +	(sysctl_sched_rt_spin_yield && system_state == SYSTEM_RUNNING)
> > +
> > +#define schedule_rt_spinlock(_lock, _spin)			\
> > +  do {								\
> > +	if (!(current->flags & PF_MUTEX_TESTER)) {		\
> > +		if (!_spin)					\
> > +			schedule();				\
> > +		else						\
> > +			yield();				\
> > +	} else							\
> > +		schedule_rt_mutex_test(_lock);			\
> 
> I'm wondering if this is the only thing you needed to get your
> performance boost.

I think it's more about SCHED_OTHER and ->migrate_disable, these are the
times when preempting just stacks tasks up, and lets more in to just
stack up.  If we leave these alone and let tasks spin, less sleeping
logjam, so more work gets done.

> > @@ -5346,8 +5354,25 @@ void task_setprio(struct task_struct *p,
> >  
> >  	if (running)
> >  		p->sched_class->set_curr_task(rq);
> > -	if (on_rq)
> > -		enqueue_task(rq, p, oldprio < prio ? ENQUEUE_HEAD : 0);
> > +	if (on_rq) {
> > +		if (!sysctl_sched_rt_spin_yield) {
> > +			enqueue_task(rq, p, oldprio < prio ? ENQUEUE_HEAD : 0);
> > +		} else {
> > +			enqueue_task(rq, p, ENQUEUE_HEAD);
> > +
> > +			/*
> > +			 * If we're requeueing a spinlock waiter, preempt any
> > +			 * peer in the way, waiter involuntarily blocked, so
> > +			 * has the right to use this CPU before its peers.
> > +			 */
> > +			requeue &= p->prio <= rq->curr->prio;
> > +			requeue &= rq->curr->state == TASK_RUNNING;
> > +			requeue &= rq->curr != current;
> > +
> > +			if (requeue)
> > +				resched_task(rq->curr);
> 
> This too is interesting.

Regardless of patchlet, I don't see why we take a boosted task, who was
at the head of its queue when it took the lock, and requeue it to tail
on de-boost when releasing that lock.

> I would be interested in seeing what happens if we just made rt_tasks
> spin instead of sleep (call yield). Actually, if we made real-time tasks
> always spin (never sleep), just yield, if this would give us much better
> performance.

Maybe you can get that working without the requeueing  Without it being
a global head -> tail -> head thing, my big box died and died and.. 

-Mike


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: rt: rtmutex experiment doubled tbench throughput
  2013-02-27  2:35       ` Steven Rostedt
  2013-02-27  2:52         ` Paul Gortmaker
@ 2013-02-27  4:55         ` Mike Galbraith
  1 sibling, 0 replies; 11+ messages in thread
From: Mike Galbraith @ 2013-02-27  4:55 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: Paul Gortmaker, RT, Thomas Gleixner

On Tue, 2013-02-26 at 21:35 -0500, Steven Rostedt wrote: 
> On Tue, 2013-02-26 at 21:32 -0500, Steven Rostedt wrote:
> 
> > Bah! As I looked in my history I see I did the following:
> > 
> >   git show v3.0:include/linux/sched.h
> >   git show v3.2:include/linux/sched.h
> >   git show v3.4:include/linux/sched.h
> > 
> > I forgot the '-rt" on each one of them :-p
> 
> That said...
> 
> Only v3.0-rt has it. I'm barely supporting that now, and I'm only doing
> updates that might make sense. Greg should be dropping it soon, and so
> will I. There's issues in v3.0-rt that are not in v3.2-rt or above, that
> I'm not going to bother debugging.

Yeah, I'm stuck with 3.0, so my 3.0 is not as virgin as I would like it
to be.  It has threaded sirqs because I must have them, can turn cpupri
and nohz on/off on the fly because I must have to have that too to make
80 core boxen with tight jitter constraint be all they can be. 

> I'll test a patch against the others but not one against v3.0-rt.

64 core test box has microscopic root and newer kernel drivers break its
userspace.  Putting a newer rt kernel on it to play is a pita, so I do
big box tinkering in 3.0.  I can make a 3.6 hacklet during the weekend.

-Mike


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: rt: rtmutex experiment doubled tbench throughput
  2013-02-27  4:19   ` Mike Galbraith
@ 2013-02-28  5:59     ` Mike Galbraith
  0 siblings, 0 replies; 11+ messages in thread
From: Mike Galbraith @ 2013-02-28  5:59 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: RT, Thomas Gleixner

On Wed, 2013-02-27 at 05:19 +0100, Mike Galbraith wrote: 
> On Tue, 2013-02-26 at 14:45 -0500, Steven Rostedt wrote: 

> > I would be interested in seeing what happens if we just made rt_tasks
> > spin instead of sleep (call yield). Actually, if we made real-time tasks
> > always spin (never sleep), just yield, if this would give us much better
> > performance.
> 
> Maybe you can get that working without the requeueing  Without it being
> a global head -> tail -> head thing, my big box died and died and.. 

As noted, prior to global queue to head, I made dead box.. a lot.

However, yesterday I had to hunt down a localhost throughput regression
in 3.0.65 (794ed393) anyway, and I used the opportunity to do a little
concurrent tinkering...

With SCHED_OTHER brain-o fixed up, as mentioned, -rt beat its NOPREEMPT
parent tree at hefty tbench throughput.  This morning, I tried real rt
spinning again, only yielding when mandatory.  With global queue to head
in place, no silent box syndrome happened, worked fine.  That should
mean that across the box contention induced jitter can shrink by sched
cost * players, no?

Maybe with some refinement, experiment will work out.

Longish jitter test just finished, though worst case doesn't look any
better (darn), it doesn't look any worse either, so I'll poke send.

(hm, maybe if I avoid pounding raw lock through the floor...)

-Mike


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2013-02-28  5:59 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-02-22 16:36 rt: rtmutex experiment doubled tbench throughput Mike Galbraith
2013-02-23  8:25 ` Mike Galbraith
2013-02-26 19:45 ` Steven Rostedt
2013-02-27  4:19   ` Mike Galbraith
2013-02-28  5:59     ` Mike Galbraith
2013-02-26 21:25 ` Steven Rostedt
2013-02-27  2:22   ` Paul Gortmaker
2013-02-27  2:32     ` Steven Rostedt
2013-02-27  2:35       ` Steven Rostedt
2013-02-27  2:52         ` Paul Gortmaker
2013-02-27  4:55         ` Mike Galbraith

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.