linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC][PATCH 0/5] Reduce runqueue lock contention -v2
@ 2010-12-16 14:56 Peter Zijlstra
  2010-12-16 14:56 ` [RFC][PATCH 1/5] sched: Always provide p->oncpu Peter Zijlstra
                   ` (5 more replies)
  0 siblings, 6 replies; 44+ messages in thread
From: Peter Zijlstra @ 2010-12-16 14:56 UTC (permalink / raw)
  To: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Oleg Nesterov, Paul Turner, Jens Axboe
  Cc: linux-kernel, Peter Zijlstra

Hi, here a new posting of my scary patch(es) ;-)

These actually survive a sembench run (and everything else I threw at it).
The discussion between Mike and Frank over the task_running() check made me
realize what was wrong with the previous one.

As it turns out, what was needed (p->oncpu) was something Thomas wanted me
to do for an entirely different reason (see patch #2).

Frank's patch, while encouraging me to poke at it again, has a number of
very fundamental problems with it, the most serious one being that it
completely wrecks the wake-up load-balancing.

I'll try and come up with a way to unwreck the 32bit select_task_rq_fair()
problem, but for now inspiration in that departments seems lacking, yet I
still wanted to share these patches so that others can have a go at them.

If all you care about is the numbers, skip to patch #5.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [RFC][PATCH 1/5] sched: Always provide p->oncpu
  2010-12-16 14:56 [RFC][PATCH 0/5] Reduce runqueue lock contention -v2 Peter Zijlstra
@ 2010-12-16 14:56 ` Peter Zijlstra
  2010-12-18  1:03   ` Frank Rowand
  2010-12-16 14:56 ` [RFC][PATCH 2/5] mutex: Use p->oncpu for the adaptive spin Peter Zijlstra
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 44+ messages in thread
From: Peter Zijlstra @ 2010-12-16 14:56 UTC (permalink / raw)
  To: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Oleg Nesterov, Paul Turner, Jens Axboe
  Cc: linux-kernel, Peter Zijlstra

[-- Attachment #1: sched-on_cpu.patch --]
[-- Type: text/plain, Size: 2886 bytes --]

Always provide p->oncpu so that we can determine if its on a cpu
without having to lock the rq.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/sched.h |    2 --
 kernel/sched.c        |   36 ++++++++++++++++++++++++------------
 2 files changed, 24 insertions(+), 14 deletions(-)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -850,18 +850,39 @@ static inline int task_current(struct rq
 	return rq->curr == p;
 }
 
-#ifndef __ARCH_WANT_UNLOCKED_CTXSW
 static inline int task_running(struct rq *rq, struct task_struct *p)
 {
+#ifdef CONFIG_SMP
+	return p->oncpu;
+#else
 	return task_current(rq, p);
+#endif
 }
 
+#ifndef __ARCH_WANT_UNLOCKED_CTXSW
 static inline void prepare_lock_switch(struct rq *rq, struct task_struct *next)
 {
+#ifdef CONFIG_SMP
+	/*
+	 * We can optimise this out completely for !SMP, because the
+	 * SMP rebalancing from interrupt is the only thing that cares
+	 * here.
+	 */
+	next->oncpu = 1;
+#endif
 }
 
 static inline void finish_lock_switch(struct rq *rq, struct task_struct *prev)
 {
+#ifdef CONFIG_SMP
+	/*
+	 * After ->oncpu is cleared, the task can be moved to a different CPU.
+	 * We must ensure this doesn't happen until the switch is completely
+	 * finished.
+	 */
+	smp_wmb();
+	prev->oncpu = 0;
+#endif
 #ifdef CONFIG_DEBUG_SPINLOCK
 	/* this is a valid case when another task releases the spinlock */
 	rq->lock.owner = current;
@@ -877,15 +898,6 @@ static inline void finish_lock_switch(st
 }
 
 #else /* __ARCH_WANT_UNLOCKED_CTXSW */
-static inline int task_running(struct rq *rq, struct task_struct *p)
-{
-#ifdef CONFIG_SMP
-	return p->oncpu;
-#else
-	return task_current(rq, p);
-#endif
-}
-
 static inline void prepare_lock_switch(struct rq *rq, struct task_struct *next)
 {
 #ifdef CONFIG_SMP
@@ -2589,7 +2601,7 @@ void sched_fork(struct task_struct *p, i
 	if (likely(sched_info_on()))
 		memset(&p->sched_info, 0, sizeof(p->sched_info));
 #endif
-#if defined(CONFIG_SMP) && defined(__ARCH_WANT_UNLOCKED_CTXSW)
+#if defined(CONFIG_SMP)
 	p->oncpu = 0;
 #endif
 #ifdef CONFIG_PREEMPT
@@ -5502,7 +5514,7 @@ void __cpuinit init_idle(struct task_str
 	rcu_read_unlock();
 
 	rq->curr = rq->idle = idle;
-#if defined(CONFIG_SMP) && defined(__ARCH_WANT_UNLOCKED_CTXSW)
+#if defined(CONFIG_SMP)
 	idle->oncpu = 1;
 #endif
 	raw_spin_unlock_irqrestore(&rq->lock, flags);
Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -1198,10 +1198,8 @@ struct task_struct {
 	int lock_depth;		/* BKL lock depth */
 
 #ifdef CONFIG_SMP
-#ifdef __ARCH_WANT_UNLOCKED_CTXSW
 	int oncpu;
 #endif
-#endif
 
 	int prio, static_prio, normal_prio;
 	unsigned int rt_priority;



^ permalink raw reply	[flat|nested] 44+ messages in thread

* [RFC][PATCH 2/5] mutex: Use p->oncpu for the adaptive spin
  2010-12-16 14:56 [RFC][PATCH 0/5] Reduce runqueue lock contention -v2 Peter Zijlstra
  2010-12-16 14:56 ` [RFC][PATCH 1/5] sched: Always provide p->oncpu Peter Zijlstra
@ 2010-12-16 14:56 ` Peter Zijlstra
  2010-12-16 17:34   ` Oleg Nesterov
  2010-12-16 14:56 ` [RFC][PATCH 3/5] sched: Change the ttwu success details Peter Zijlstra
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 44+ messages in thread
From: Peter Zijlstra @ 2010-12-16 14:56 UTC (permalink / raw)
  To: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Oleg Nesterov, Paul Turner, Jens Axboe
  Cc: linux-kernel, Peter Zijlstra

[-- Attachment #1: sched-on_cpu-use.patch --]
[-- Type: text/plain, Size: 5171 bytes --]

Since we now have p->oncpu unconditionally available, use it to spin
on.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/mutex.h |    2 -
 include/linux/sched.h |    2 -
 kernel/mutex-debug.c  |    2 -
 kernel/mutex-debug.h  |    2 -
 kernel/mutex.c        |    2 -
 kernel/mutex.h        |    2 -
 kernel/sched.c        |   52 +++++++++++++-------------------------------------
 7 files changed, 20 insertions(+), 44 deletions(-)

Index: linux-2.6/include/linux/mutex.h
===================================================================
--- linux-2.6.orig/include/linux/mutex.h
+++ linux-2.6/include/linux/mutex.h
@@ -51,7 +51,7 @@ struct mutex {
 	spinlock_t		wait_lock;
 	struct list_head	wait_list;
 #if defined(CONFIG_DEBUG_MUTEXES) || defined(CONFIG_SMP)
-	struct thread_info	*owner;
+	struct task_struct	*owner;
 #endif
 #ifdef CONFIG_DEBUG_MUTEXES
 	const char 		*name;
Index: linux-2.6/kernel/mutex-debug.c
===================================================================
--- linux-2.6.orig/kernel/mutex-debug.c
+++ linux-2.6/kernel/mutex-debug.c
@@ -75,7 +75,7 @@ void debug_mutex_unlock(struct mutex *lo
 		return;
 
 	DEBUG_LOCKS_WARN_ON(lock->magic != lock);
-	DEBUG_LOCKS_WARN_ON(lock->owner != current_thread_info());
+	DEBUG_LOCKS_WARN_ON(lock->owner != current);
 	DEBUG_LOCKS_WARN_ON(!lock->wait_list.prev && !lock->wait_list.next);
 	mutex_clear_owner(lock);
 }
Index: linux-2.6/kernel/mutex-debug.h
===================================================================
--- linux-2.6.orig/kernel/mutex-debug.h
+++ linux-2.6/kernel/mutex-debug.h
@@ -29,7 +29,7 @@ extern void debug_mutex_init(struct mute
 
 static inline void mutex_set_owner(struct mutex *lock)
 {
-	lock->owner = current_thread_info();
+	lock->owner = current;
 }
 
 static inline void mutex_clear_owner(struct mutex *lock)
Index: linux-2.6/kernel/mutex.c
===================================================================
--- linux-2.6.orig/kernel/mutex.c
+++ linux-2.6/kernel/mutex.c
@@ -160,7 +160,7 @@ __mutex_lock_common(struct mutex *lock,
 	 */
 
 	for (;;) {
-		struct thread_info *owner;
+		struct task_struct *owner;
 
 		/*
 		 * If we own the BKL, then don't spin. The owner of
Index: linux-2.6/kernel/mutex.h
===================================================================
--- linux-2.6.orig/kernel/mutex.h
+++ linux-2.6/kernel/mutex.h
@@ -19,7 +19,7 @@
 #ifdef CONFIG_SMP
 static inline void mutex_set_owner(struct mutex *lock)
 {
-	lock->owner = current_thread_info();
+	lock->owner = current;
 }
 
 static inline void mutex_clear_owner(struct mutex *lock)
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -3853,10 +3853,9 @@ EXPORT_SYMBOL(schedule);
  * Look out! "owner" is an entirely speculative pointer
  * access and not reliable.
  */
-int mutex_spin_on_owner(struct mutex *lock, struct thread_info *owner)
+int mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner)
 {
-	unsigned int cpu;
-	struct rq *rq;
+	int oncpu;
 
 	if (!sched_feat(OWNER_SPIN))
 		return 0;
@@ -3867,52 +3866,29 @@ int mutex_spin_on_owner(struct mutex *lo
 	 * DEBUG_PAGEALLOC could have unmapped it if
 	 * the mutex owner just released it and exited.
 	 */
-	if (probe_kernel_address(&owner->cpu, cpu))
+	if (probe_kernel_address(&owner->oncpu, oncpu))
 		return 0;
 #else
-	cpu = owner->cpu;
+	oncpu = owner->oncpu;
 #endif
 
-	/*
-	 * Even if the access succeeded (likely case),
-	 * the cpu field may no longer be valid.
-	 */
-	if (cpu >= nr_cpumask_bits)
-		return 0;
-
-	/*
-	 * We need to validate that we can do a
-	 * get_cpu() and that we have the percpu area.
-	 */
-	if (!cpu_online(cpu))
+	if (!oncpu)
 		return 0;
 
-	rq = cpu_rq(cpu);
-
-	for (;;) {
-		/*
-		 * Owner changed, break to re-assess state.
-		 */
-		if (lock->owner != owner) {
-			/*
-			 * If the lock has switched to a different owner,
-			 * we likely have heavy contention. Return 0 to quit
-			 * optimistic spinning and not contend further:
-			 */
-			if (lock->owner)
-				return 0;
-			break;
-		}
-
-		/*
-		 * Is that owner really running on that cpu?
-		 */
-		if (task_thread_info(rq->curr) != owner || need_resched())
+	while (lock->owner == owner && owner->oncpu) {
+		if (need_resched())
 			return 0;
 
 		arch_mutex_cpu_relax();
 	}
 
+	/*
+	 * If the owner changed to another task there is likely
+	 * heavy contention, stop spinning.
+	 */
+	if (lock->owner)
+		return 0;
+
 	return 1;
 }
 #endif
Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -360,7 +360,7 @@ extern signed long schedule_timeout_inte
 extern signed long schedule_timeout_killable(signed long timeout);
 extern signed long schedule_timeout_uninterruptible(signed long timeout);
 asmlinkage void schedule(void);
-extern int mutex_spin_on_owner(struct mutex *lock, struct thread_info *owner);
+extern int mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner);
 
 struct nsproxy;
 struct user_namespace;



^ permalink raw reply	[flat|nested] 44+ messages in thread

* [RFC][PATCH 3/5] sched: Change the ttwu success details
  2010-12-16 14:56 [RFC][PATCH 0/5] Reduce runqueue lock contention -v2 Peter Zijlstra
  2010-12-16 14:56 ` [RFC][PATCH 1/5] sched: Always provide p->oncpu Peter Zijlstra
  2010-12-16 14:56 ` [RFC][PATCH 2/5] mutex: Use p->oncpu for the adaptive spin Peter Zijlstra
@ 2010-12-16 14:56 ` Peter Zijlstra
  2010-12-16 15:23   ` Frederic Weisbecker
  2010-12-18  1:05   ` Frank Rowand
  2010-12-16 14:56 ` [RFC][PATCH 4/5] sched: Clean up ttwu stats Peter Zijlstra
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 44+ messages in thread
From: Peter Zijlstra @ 2010-12-16 14:56 UTC (permalink / raw)
  To: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Oleg Nesterov, Paul Turner, Jens Axboe
  Cc: linux-kernel, Peter Zijlstra

[-- Attachment #1: sched-change-ttwu-return.patch --]
[-- Type: text/plain, Size: 5219 bytes --]

try_to_wake_up() would only return a success when it would have to
place a task on a rq, change that to every time we change p->state to
TASK_RUNNING, because that's the real measure of wakeups.

This results in that success is always true for the tracepoint, so
remove its success argument.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/trace/events/sched.h      |   18 ++++++++----------
 kernel/sched.c                    |   18 ++++++++----------
 kernel/trace/trace_sched_wakeup.c |    3 +--
 3 files changed, 17 insertions(+), 22 deletions(-)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -2327,10 +2327,10 @@ static inline void ttwu_activate(struct 
 	activate_task(rq, p, en_flags);
 }
 
-static inline void ttwu_post_activation(struct task_struct *p, struct rq *rq,
-					int wake_flags, bool success)
+static void
+ttwu_post_activation(struct task_struct *p, struct rq *rq, int wake_flags)
 {
-	trace_sched_wakeup(p, success);
+	trace_sched_wakeup(p);
 	check_preempt_curr(rq, p, wake_flags);
 
 	p->state = TASK_RUNNING;
@@ -2350,7 +2350,7 @@ static inline void ttwu_post_activation(
 	}
 #endif
 	/* if a worker is waking up, notify workqueue */
-	if ((p->flags & PF_WQ_WORKER) && success)
+	if (p->flags & PF_WQ_WORKER)
 		wq_worker_waking_up(p, cpu_of(rq));
 }
 
@@ -2449,9 +2449,9 @@ static int try_to_wake_up(struct task_st
 #endif /* CONFIG_SMP */
 	ttwu_activate(p, rq, wake_flags & WF_SYNC, orig_cpu != cpu,
 		      cpu == this_cpu, en_flags);
-	success = 1;
 out_running:
-	ttwu_post_activation(p, rq, wake_flags, success);
+	ttwu_post_activation(p, rq, wake_flags);
+	success = 1;
 out:
 	task_rq_unlock(rq, &flags);
 	put_cpu();
@@ -2470,7 +2470,6 @@ static int try_to_wake_up(struct task_st
 static void try_to_wake_up_local(struct task_struct *p)
 {
 	struct rq *rq = task_rq(p);
-	bool success = false;
 
 	BUG_ON(rq != this_rq());
 	BUG_ON(p == current);
@@ -2485,9 +2484,8 @@ static void try_to_wake_up_local(struct 
 			schedstat_inc(rq, ttwu_local);
 		}
 		ttwu_activate(p, rq, false, false, true, ENQUEUE_WAKEUP);
-		success = true;
 	}
-	ttwu_post_activation(p, rq, 0, success);
+	ttwu_post_activation(p, rq, 0);
 }
 
 /**
@@ -2649,7 +2647,7 @@ void wake_up_new_task(struct task_struct
 
 	rq = task_rq_lock(p, &flags);
 	activate_task(rq, p, 0);
-	trace_sched_wakeup_new(p, 1);
+	trace_sched_wakeup_new(p);
 	check_preempt_curr(rq, p, WF_FORK);
 #ifdef CONFIG_SMP
 	if (p->sched_class->task_woken)
Index: linux-2.6/include/trace/events/sched.h
===================================================================
--- linux-2.6.orig/include/trace/events/sched.h
+++ linux-2.6/include/trace/events/sched.h
@@ -54,15 +54,14 @@ TRACE_EVENT(sched_kthread_stop_ret,
  */
 DECLARE_EVENT_CLASS(sched_wakeup_template,
 
-	TP_PROTO(struct task_struct *p, int success),
+	TP_PROTO(struct task_struct *p),
 
-	TP_ARGS(p, success),
+	TP_ARGS(p),
 
 	TP_STRUCT__entry(
 		__array(	char,	comm,	TASK_COMM_LEN	)
 		__field(	pid_t,	pid			)
 		__field(	int,	prio			)
-		__field(	int,	success			)
 		__field(	int,	target_cpu		)
 	),
 
@@ -70,25 +69,24 @@ DECLARE_EVENT_CLASS(sched_wakeup_templat
 		memcpy(__entry->comm, p->comm, TASK_COMM_LEN);
 		__entry->pid		= p->pid;
 		__entry->prio		= p->prio;
-		__entry->success	= success;
 		__entry->target_cpu	= task_cpu(p);
 	),
 
-	TP_printk("comm=%s pid=%d prio=%d success=%d target_cpu=%03d",
+	TP_printk("comm=%s pid=%d prio=%d target_cpu=%03d",
 		  __entry->comm, __entry->pid, __entry->prio,
-		  __entry->success, __entry->target_cpu)
+		  __entry->target_cpu)
 );
 
 DEFINE_EVENT(sched_wakeup_template, sched_wakeup,
-	     TP_PROTO(struct task_struct *p, int success),
-	     TP_ARGS(p, success));
+	     TP_PROTO(struct task_struct *p),
+	     TP_ARGS(p));
 
 /*
  * Tracepoint for waking up a new task:
  */
 DEFINE_EVENT(sched_wakeup_template, sched_wakeup_new,
-	     TP_PROTO(struct task_struct *p, int success),
-	     TP_ARGS(p, success));
+	     TP_PROTO(struct task_struct *p),
+	     TP_ARGS(p));
 
 #ifdef CREATE_TRACE_POINTS
 static inline long __trace_sched_switch_state(struct task_struct *p)
Index: linux-2.6/kernel/trace/trace_sched_wakeup.c
===================================================================
--- linux-2.6.orig/kernel/trace/trace_sched_wakeup.c
+++ linux-2.6/kernel/trace/trace_sched_wakeup.c
@@ -399,8 +399,7 @@ static void wakeup_reset(struct trace_ar
 	local_irq_restore(flags);
 }
 
-static void
-probe_wakeup(void *ignore, struct task_struct *p, int success)
+static void probe_wakeup(void *ignore, struct task_struct *p)
 {
 	struct trace_array_cpu *data;
 	int cpu = smp_processor_id();
diff --git a/kernel/trace/trace_sched_switch.c b/kernel/trace/trace_sched_switch.c
index 8f758d0..bb1ba50 100644
--- a/kernel/trace/trace_sched_switch.c
+++ b/kernel/trace/trace_sched_switch.c
@@ -108,7 +108,7 @@ tracing_sched_wakeup_trace(struct trace_array *tr,
 }
 
 static void
-probe_sched_wakeup(void *ignore, struct task_struct *wakee, int success)
+probe_sched_wakeup(void *ignore, struct task_struct *wakee)
 {
 	struct trace_array_cpu *data;
 	unsigned long flags;



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [RFC][PATCH 4/5] sched: Clean up ttwu stats
  2010-12-16 14:56 [RFC][PATCH 0/5] Reduce runqueue lock contention -v2 Peter Zijlstra
                   ` (2 preceding siblings ...)
  2010-12-16 14:56 ` [RFC][PATCH 3/5] sched: Change the ttwu success details Peter Zijlstra
@ 2010-12-16 14:56 ` Peter Zijlstra
  2010-12-18  1:09   ` Frank Rowand
  2010-12-16 14:56 ` [RFC][PATCH 5/5] sched: Reduce ttwu rq->lock contention Peter Zijlstra
  2010-12-16 19:12 ` [RFC][PATCH 0/5] Reduce runqueue lock contention -v2 Frank Rowand
  5 siblings, 1 reply; 44+ messages in thread
From: Peter Zijlstra @ 2010-12-16 14:56 UTC (permalink / raw)
  To: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Oleg Nesterov, Paul Turner, Jens Axboe
  Cc: linux-kernel, Peter Zijlstra

[-- Attachment #1: sched-ttwu-stat.patch --]
[-- Type: text/plain, Size: 3224 bytes --]

Collect all ttwu stat code into a single function and ensure its
always called for an actual wakeup (changing p->state to
TASK_RUNNING).

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched.c |   66 ++++++++++++++++++++++++++-------------------------------
 1 file changed, 31 insertions(+), 35 deletions(-)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -2310,21 +2310,35 @@ static void update_avg(u64 *avg, u64 sam
 }
 #endif
 
-static inline void ttwu_activate(struct task_struct *p, struct rq *rq,
-				 bool is_sync, bool is_migrate, bool is_local,
-				 unsigned long en_flags)
+static void
+ttwu_stat(struct rq *rq, struct task_struct *p, int cpu, int wake_flags)
 {
+#ifdef CONFIG_SCHEDSTATS
+
+	schedstat_inc(rq, ttwu_count);
 	schedstat_inc(p, se.statistics.nr_wakeups);
-	if (is_sync)
+
+	if (wake_flags & WF_SYNC)
 		schedstat_inc(p, se.statistics.nr_wakeups_sync);
-	if (is_migrate)
+
+	if (cpu != task_cpu(p))
 		schedstat_inc(p, se.statistics.nr_wakeups_migrate);
-	if (is_local)
+
+	if (cpu == smp_processor_id()) {
+		schedstat_inc(rq, ttwu_local);
 		schedstat_inc(p, se.statistics.nr_wakeups_local);
-	else
-		schedstat_inc(p, se.statistics.nr_wakeups_remote);
+	} else {
+		struct sched_domain *sd;
 
-	activate_task(rq, p, en_flags);
+		schedstat_inc(p, se.statistics.nr_wakeups_remote);
+		for_each_domain(smp_processor_id(), sd) {
+			if (cpumask_test_cpu(cpu, sched_domain_span(sd))) {
+				schedstat_inc(sd, ttwu_wake_remote);
+				break;
+			}
+		}
+	}
+#endif /* CONFIG_SCHEDSTATS */
 }
 
 static void
@@ -2384,12 +2398,12 @@ static int try_to_wake_up(struct task_st
 	if (!(p->state & state))
 		goto out;
 
+	cpu = task_cpu(p);
+
 	if (p->se.on_rq)
 		goto out_running;
 
-	cpu = task_cpu(p);
 	orig_cpu = cpu;
-
 #ifdef CONFIG_SMP
 	if (unlikely(task_running(rq, p)))
 		goto out_activate;
@@ -2430,27 +2444,12 @@ static int try_to_wake_up(struct task_st
 	WARN_ON(task_cpu(p) != cpu);
 	WARN_ON(p->state != TASK_WAKING);
 
-#ifdef CONFIG_SCHEDSTATS
-	schedstat_inc(rq, ttwu_count);
-	if (cpu == this_cpu)
-		schedstat_inc(rq, ttwu_local);
-	else {
-		struct sched_domain *sd;
-		for_each_domain(this_cpu, sd) {
-			if (cpumask_test_cpu(cpu, sched_domain_span(sd))) {
-				schedstat_inc(sd, ttwu_wake_remote);
-				break;
-			}
-		}
-	}
-#endif /* CONFIG_SCHEDSTATS */
-
 out_activate:
 #endif /* CONFIG_SMP */
-	ttwu_activate(p, rq, wake_flags & WF_SYNC, orig_cpu != cpu,
-		      cpu == this_cpu, en_flags);
+	activate_task(rq, p, en_flags);
 out_running:
 	ttwu_post_activation(p, rq, wake_flags);
+	ttwu_stat(rq, p, cpu, wake_flags);
 	success = 1;
 out:
 	task_rq_unlock(rq, &flags);
@@ -2478,14 +2477,11 @@ static void try_to_wake_up_local(struct 
 	if (!(p->state & TASK_NORMAL))
 		return;
 
-	if (!p->se.on_rq) {
-		if (likely(!task_running(rq, p))) {
-			schedstat_inc(rq, ttwu_count);
-			schedstat_inc(rq, ttwu_local);
-		}
-		ttwu_activate(p, rq, false, false, true, ENQUEUE_WAKEUP);
-	}
+	if (!p->se.on_rq)
+		activate_task(rq, p, ENQUEUE_WAKEUP);
+
 	ttwu_post_activation(p, rq, 0);
+	ttwu_stat(rq, p, smp_processor_id(), 0);
 }
 
 /**



^ permalink raw reply	[flat|nested] 44+ messages in thread

* [RFC][PATCH 5/5] sched: Reduce ttwu rq->lock contention
  2010-12-16 14:56 [RFC][PATCH 0/5] Reduce runqueue lock contention -v2 Peter Zijlstra
                   ` (3 preceding siblings ...)
  2010-12-16 14:56 ` [RFC][PATCH 4/5] sched: Clean up ttwu stats Peter Zijlstra
@ 2010-12-16 14:56 ` Peter Zijlstra
  2010-12-16 15:31   ` Frederic Weisbecker
                     ` (2 more replies)
  2010-12-16 19:12 ` [RFC][PATCH 0/5] Reduce runqueue lock contention -v2 Frank Rowand
  5 siblings, 3 replies; 44+ messages in thread
From: Peter Zijlstra @ 2010-12-16 14:56 UTC (permalink / raw)
  To: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Oleg Nesterov, Paul Turner, Jens Axboe
  Cc: linux-kernel, Peter Zijlstra

[-- Attachment #1: peter_zijlstra-re-_reduce_runqueue_lock_contention.patch --]
[-- Type: text/plain, Size: 14866 bytes --]

Reduce rq->lock contention on try_to_wake_up() by changing the task
state using a cmpxchg loop.

Once the task is set to TASK_WAKING we're guaranteed the only one
poking at it, then proceed to pick a new cpu without holding the
rq->lock (XXX this opens some races).

Then instead of locking the remote rq and activating the task, place
the task on a remote queue, again using cmpxchg, and notify the remote
cpu per IPI if this queue was empty to start processing its wakeups.

This avoids (in most cases) having to lock the remote runqueue (and
therefore the exclusive cacheline transfer thereof) but also touching
all the remote runqueue data structures needed for the actual
activation.

As measured using: http://oss.oracle.com/~mason/sembench.c

$ echo 4096 32000 64 128 > /proc/sys/kernel/sem
$ ./sembench -t 2048 -w 1900 -o 0

unpatched: run time 30 seconds 537953 worker burns per second
patched:   run time 30 seconds 657336 worker burns per second

Still need to sort out all the races marked XXX (non-trivial), and its
x86 only for the moment.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/x86/kernel/smp.c   |    1 
 include/linux/sched.h   |    7 -
 kernel/sched.c          |  241 ++++++++++++++++++++++++++++++++++--------------
 kernel/sched_fair.c     |    5 
 kernel/sched_features.h |    3 
 kernel/sched_idletask.c |    2 
 kernel/sched_rt.c       |    4 
 kernel/sched_stoptask.c |    3 
 8 files changed, 190 insertions(+), 76 deletions(-)

Index: linux-2.6/arch/x86/kernel/smp.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/smp.c
+++ linux-2.6/arch/x86/kernel/smp.c
@@ -205,6 +205,7 @@ void smp_reschedule_interrupt(struct pt_
 	/*
 	 * KVM uses this interrupt to force a cpu out of guest mode
 	 */
+	sched_ttwu_pending();
 }
 
 void smp_call_function_interrupt(struct pt_regs *regs)
Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -1020,6 +1020,7 @@ partition_sched_domains(int ndoms_new, c
 }
 #endif	/* !CONFIG_SMP */
 
+void sched_ttwu_pending(void);
 
 struct io_context;			/* See blkdev.h */
 
@@ -1063,12 +1064,11 @@ struct sched_class {
 	void (*put_prev_task) (struct rq *rq, struct task_struct *p);
 
 #ifdef CONFIG_SMP
-	int  (*select_task_rq)(struct rq *rq, struct task_struct *p,
-			       int sd_flag, int flags);
+	int  (*select_task_rq)(struct task_struct *p, int sd_flag, int flags);
 
 	void (*pre_schedule) (struct rq *this_rq, struct task_struct *task);
 	void (*post_schedule) (struct rq *this_rq);
-	void (*task_waking) (struct rq *this_rq, struct task_struct *task);
+	void (*task_waking) (struct task_struct *task);
 	void (*task_woken) (struct rq *this_rq, struct task_struct *task);
 
 	void (*set_cpus_allowed)(struct task_struct *p,
@@ -1198,6 +1198,7 @@ struct task_struct {
 	int lock_depth;		/* BKL lock depth */
 
 #ifdef CONFIG_SMP
+	struct task_struct *wake_entry;
 	int oncpu;
 #endif
 
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -520,6 +520,8 @@ struct rq {
 	u64 age_stamp;
 	u64 idle_stamp;
 	u64 avg_idle;
+
+	struct task_struct *wake_list;
 #endif
 
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
@@ -953,7 +955,7 @@ static inline struct rq *__task_rq_lock(
 	for (;;) {
 		rq = task_rq(p);
 		raw_spin_lock(&rq->lock);
-		if (likely(rq == task_rq(p)))
+		if (likely(rq == task_rq(p)) && !task_is_waking(p))
 			return rq;
 		raw_spin_unlock(&rq->lock);
 	}
@@ -973,7 +975,7 @@ static struct rq *task_rq_lock(struct ta
 		local_irq_save(*flags);
 		rq = task_rq(p);
 		raw_spin_lock(&rq->lock);
-		if (likely(rq == task_rq(p)))
+		if (likely(rq == task_rq(p)) && !task_is_waking(p))
 			return rq;
 		raw_spin_unlock_irqrestore(&rq->lock, *flags);
 	}
@@ -2282,9 +2284,9 @@ static int select_fallback_rq(int cpu, s
  * The caller (fork, wakeup) owns TASK_WAKING, ->cpus_allowed is stable.
  */
 static inline
-int select_task_rq(struct rq *rq, struct task_struct *p, int sd_flags, int wake_flags)
+int select_task_rq(struct task_struct *p, int sd_flags, int wake_flags)
 {
-	int cpu = p->sched_class->select_task_rq(rq, p, sd_flags, wake_flags);
+	int cpu = p->sched_class->select_task_rq(p, sd_flags, wake_flags);
 
 	/*
 	 * In order not to call set_task_cpu() on a blocking task we need
@@ -2310,10 +2312,10 @@ static void update_avg(u64 *avg, u64 sam
 }
 #endif
 
-static void
-ttwu_stat(struct rq *rq, struct task_struct *p, int cpu, int wake_flags)
+static void ttwu_stat(struct task_struct *p, int cpu, int wake_flags)
 {
 #ifdef CONFIG_SCHEDSTATS
+	struct rq *rq = this_rq();
 
 	schedstat_inc(rq, ttwu_count);
 	schedstat_inc(p, se.statistics.nr_wakeups);
@@ -2341,8 +2343,11 @@ ttwu_stat(struct rq *rq, struct task_str
 #endif /* CONFIG_SCHEDSTATS */
 }
 
+/*
+ * Mark the task runnable and perform wakeup-preemption.
+ */
 static void
-ttwu_post_activation(struct task_struct *p, struct rq *rq, int wake_flags)
+ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
 {
 	trace_sched_wakeup(p);
 	check_preempt_curr(rq, p, wake_flags);
@@ -2368,6 +2373,109 @@ ttwu_post_activation(struct task_struct 
 		wq_worker_waking_up(p, cpu_of(rq));
 }
 
+static void ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags)
+{
+	if (task_cpu(p) != cpu_of(rq))
+		set_task_cpu(p, cpu_of(rq));
+
+	activate_task(rq, p, ENQUEUE_WAKEUP
+#ifdef CONFIG_SMP
+			     | ENQUEUE_WAKING
+#endif
+			);
+	ttwu_do_wakeup(rq, p, wake_flags);
+}
+
+/*
+ * Called in case the task @p isn't fully descheduled from its runqueue,
+ * in this case we must do a full remote wakeup.
+ */
+static int ttwu_force(struct task_struct *p, int wake_flags)
+{
+	struct rq *rq;
+	int ret = 0;
+
+	/*
+	 * Since we've already set TASK_WAKING this task's CPU cannot
+	 * change from under us.
+	 */
+	rq = task_rq(p);
+	raw_spin_lock(&rq->lock);
+	BUG_ON(rq != task_rq(p));
+
+	if (p->se.on_rq) {
+		ttwu_do_wakeup(rq, p, wake_flags);
+		ret = 1;
+	} else if (task_running(rq, p)) {
+		ttwu_do_activate(rq, p, wake_flags);
+		ret = 1;
+	}
+
+	if (ret)
+		ttwu_stat(p, task_cpu(p), wake_flags);
+
+	raw_spin_unlock(&rq->lock);
+
+	return ret;
+}
+
+void sched_ttwu_pending(void)
+{
+#ifdef CONFIG_SMP
+	struct rq *rq = this_rq();
+	struct task_struct *list = xchg(&rq->wake_list, NULL);
+
+	if (!list)
+		return;
+
+	raw_spin_lock(&rq->lock);
+
+	while (list) {
+		struct task_struct *p = list;
+		list = list->wake_entry;
+		ttwu_do_activate(rq, p, 0);
+	}
+
+	raw_spin_unlock(&rq->lock);
+#endif
+}
+
+#ifdef CONFIG_SMP
+static void ttwu_queue_remote(struct task_struct *p, int cpu)
+{
+	struct task_struct *next = NULL;
+	struct rq *rq = cpu_rq(cpu);
+
+	for (;;) {
+		struct task_struct *old = next;
+
+		p->wake_entry = next;
+		next = cmpxchg(&rq->wake_list, old, p);
+		if (next == old)
+			break;
+	}
+
+	if (!next)
+		smp_send_reschedule(cpu);
+}
+#endif
+
+static void ttwu_queue(struct task_struct *p, int cpu)
+{
+	struct rq *rq = cpu_rq(cpu);
+
+#ifdef CONFIG_SMP
+	if (!sched_feat(TTWU_FORCE) && cpu != smp_processor_id()) {
+		ttwu_queue_remote(p, cpu);
+		return;
+	}
+#endif
+
+	raw_spin_lock(&rq->lock);
+	ttwu_do_activate(rq, p, 0);
+	raw_spin_unlock(&rq->lock);
+}
+
 /**
  * try_to_wake_up - wake up a thread
  * @p: the thread to be awakened
@@ -2383,77 +2491,77 @@ ttwu_post_activation(struct task_struct 
  * Returns %true if @p was woken up, %false if it was already running
  * or @state didn't match @p's state.
  */
-static int try_to_wake_up(struct task_struct *p, unsigned int state,
-			  int wake_flags)
+static int
+try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 {
-	int cpu, orig_cpu, this_cpu, success = 0;
+	int cpu = task_cpu(p);
 	unsigned long flags;
-	unsigned long en_flags = ENQUEUE_WAKEUP;
-	struct rq *rq;
+	int success = 0;
+	int load;
 
-	this_cpu = get_cpu();
-
-	smp_wmb();
-	rq = task_rq_lock(p, &flags);
-	if (!(p->state & state))
-		goto out;
+	local_irq_save(flags);
+	for (;;) {
+		unsigned int task_state = p->state;
 
-	cpu = task_cpu(p);
+		if (!(task_state & state))
+			goto out;
 
-	if (p->se.on_rq)
-		goto out_running;
+		/*
+		 * We've got to store the contributes_to_load state before
+		 * modifying the task state.
+		 */
+		load = task_contributes_to_load(p);
 
-	orig_cpu = cpu;
-#ifdef CONFIG_SMP
-	if (unlikely(task_running(rq, p)))
-		goto out_activate;
+		if (cmpxchg(&p->state, task_state, TASK_WAKING) == task_state)
+			break;
+	}
 
 	/*
-	 * In order to handle concurrent wakeups and release the rq->lock
-	 * we put the task in TASK_WAKING state.
-	 *
-	 * First fix up the nr_uninterruptible count:
+	 * Everything that changes p->state to TASK_RUNNING is a wakeup!
 	 */
-	if (task_contributes_to_load(p)) {
-		if (likely(cpu_online(orig_cpu)))
-			rq->nr_uninterruptible--;
-		else
-			this_rq()->nr_uninterruptible--;
-	}
-	p->state = TASK_WAKING;
-
-	if (p->sched_class->task_waking) {
-		p->sched_class->task_waking(rq, p);
-		en_flags |= ENQUEUE_WAKING;
-	}
+	success = 1;
 
-	cpu = select_task_rq(rq, p, SD_BALANCE_WAKE, wake_flags);
-	if (cpu != orig_cpu)
-		set_task_cpu(p, cpu);
-	__task_rq_unlock(rq);
+	/*
+	 * If p is prev in schedule() before deactivate_task() we
+	 * simply need to set p->state = TASK_RUNNING while
+	 * holding the rq->lock.
+	 */
+	if (p->se.on_rq && ttwu_force(p, wake_flags))
+		goto out;
 
-	rq = cpu_rq(cpu);
-	raw_spin_lock(&rq->lock);
+	/*
+	 * Now that we'll do a full wakeup, activate_task(), account the
+	 * contribute_to_load state.
+	 */
+	if (load) // XXX racy
+		this_rq()->nr_uninterruptible--;
 
+#ifdef CONFIG_SMP
 	/*
-	 * We migrated the task without holding either rq->lock, however
-	 * since the task is not on the task list itself, nobody else
-	 * will try and migrate the task, hence the rq should match the
-	 * cpu we just moved it to.
+	 * Catch the case where schedule() has done the dequeue but hasn't yet
+	 * scheduled to a new task, in that case p is still being referenced
+	 * by that cpu so we cannot wake it to any other cpu.
 	 */
-	WARN_ON(task_cpu(p) != cpu);
-	WARN_ON(p->state != TASK_WAKING);
+	if (sched_feat(TTWU_SPIN)) {
+		while (p->oncpu)
+			cpu_relax();
+	} else {
+		if (p->oncpu && ttwu_force(p, wake_flags))
+			goto out;
+	}
 
-out_activate:
-#endif /* CONFIG_SMP */
-	activate_task(rq, p, en_flags);
-out_running:
-	ttwu_post_activation(p, rq, wake_flags);
-	ttwu_stat(rq, p, cpu, wake_flags);
-	success = 1;
+	if (p->sched_class->task_waking)
+		p->sched_class->task_waking(p);
+	/*
+	 * XXX: by having set TASK_WAKING outside of rq->lock, there
+	 * could be an in-flight change to p->cpus_allowed..
+	 */
+	cpu = select_task_rq(p, SD_BALANCE_WAKE, wake_flags);
+#endif
+	ttwu_stat(p, cpu, wake_flags);
+	ttwu_queue(p, cpu);
 out:
-	task_rq_unlock(rq, &flags);
-	put_cpu();
+	local_irq_restore(flags);
 
 	return success;
 }
@@ -2480,8 +2588,8 @@ static void try_to_wake_up_local(struct 
 	if (!p->se.on_rq)
 		activate_task(rq, p, ENQUEUE_WAKEUP);
 
-	ttwu_post_activation(p, rq, 0);
-	ttwu_stat(rq, p, smp_processor_id(), 0);
+	ttwu_stat(p, smp_processor_id(), 0);
+	ttwu_do_wakeup(rq, p, 0);
 }
 
 /**
@@ -2634,7 +2742,7 @@ void wake_up_new_task(struct task_struct
 	 * We set TASK_WAKING so that select_task_rq() can drop rq->lock
 	 * without people poking at ->cpus_allowed.
 	 */
-	cpu = select_task_rq(rq, p, SD_BALANCE_FORK, 0);
+	cpu = select_task_rq(p, SD_BALANCE_FORK, 0);
 	set_task_cpu(p, cpu);
 
 	p->state = TASK_RUNNING;
@@ -3360,7 +3468,7 @@ void sched_exec(void)
 	int dest_cpu;
 
 	rq = task_rq_lock(p, &flags);
-	dest_cpu = p->sched_class->select_task_rq(rq, p, SD_BALANCE_EXEC, 0);
+	dest_cpu = p->sched_class->select_task_rq(p, SD_BALANCE_EXEC, 0);
 	if (dest_cpu == smp_processor_id())
 		goto unlock;
 
@@ -3931,7 +4039,6 @@ asmlinkage void __sched schedule(void)
 	}
 
 	pre_schedule(rq, prev);
-
 	if (unlikely(!rq->nr_running))
 		idle_balance(cpu, rq);
 
Index: linux-2.6/kernel/sched_fair.c
===================================================================
--- linux-2.6.orig/kernel/sched_fair.c
+++ linux-2.6/kernel/sched_fair.c
@@ -1321,11 +1321,12 @@ static void yield_task_fair(struct rq *r
 
 #ifdef CONFIG_SMP
 
-static void task_waking_fair(struct rq *rq, struct task_struct *p)
+static void task_waking_fair(struct task_struct *p)
 {
 	struct sched_entity *se = &p->se;
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 
+	// XXX racy again,.. we need to read cfs_rq->min_vruntime atomically
 	se->vruntime -= cfs_rq->min_vruntime;
 }
 
@@ -1606,7 +1607,7 @@ static int select_idle_sibling(struct ta
  * preempt must be disabled.
  */
 static int
-select_task_rq_fair(struct rq *rq, struct task_struct *p, int sd_flag, int wake_flags)
+select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
 {
 	struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
 	int cpu = smp_processor_id();
Index: linux-2.6/kernel/sched_idletask.c
===================================================================
--- linux-2.6.orig/kernel/sched_idletask.c
+++ linux-2.6/kernel/sched_idletask.c
@@ -7,7 +7,7 @@
 
 #ifdef CONFIG_SMP
 static int
-select_task_rq_idle(struct rq *rq, struct task_struct *p, int sd_flag, int flags)
+select_task_rq_idle(struct task_struct *p, int sd_flag, int flags)
 {
 	return task_cpu(p); /* IDLE tasks as never migrated */
 }
Index: linux-2.6/kernel/sched_rt.c
===================================================================
--- linux-2.6.orig/kernel/sched_rt.c
+++ linux-2.6/kernel/sched_rt.c
@@ -973,8 +973,10 @@ static void yield_task_rt(struct rq *rq)
 static int find_lowest_rq(struct task_struct *task);
 
 static int
-select_task_rq_rt(struct rq *rq, struct task_struct *p, int sd_flag, int flags)
+select_task_rq_rt(struct task_struct *p, int sd_flag, int flags)
 {
+	struct rq *rq = task_rq(p); // XXX  racy
+
 	if (sd_flag != SD_BALANCE_WAKE)
 		return smp_processor_id();
 
Index: linux-2.6/kernel/sched_stoptask.c
===================================================================
--- linux-2.6.orig/kernel/sched_stoptask.c
+++ linux-2.6/kernel/sched_stoptask.c
@@ -9,8 +9,7 @@
 
 #ifdef CONFIG_SMP
 static int
-select_task_rq_stop(struct rq *rq, struct task_struct *p,
-		    int sd_flag, int flags)
+select_task_rq_stop(struct task_struct *p, int sd_flag, int flags)
 {
 	return task_cpu(p); /* stop tasks as never migrate */
 }
Index: linux-2.6/kernel/sched_features.h
===================================================================
--- linux-2.6.orig/kernel/sched_features.h
+++ linux-2.6/kernel/sched_features.h
@@ -64,3 +64,6 @@ SCHED_FEAT(OWNER_SPIN, 1)
  * Decrement CPU power based on irq activity
  */
 SCHED_FEAT(NONIRQ_POWER, 1)
+
+SCHED_FEAT(TTWU_FORCE, 0)
+SCHED_FEAT(TTWU_SPIN, 1)



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC][PATCH 3/5] sched: Change the ttwu success details
  2010-12-16 14:56 ` [RFC][PATCH 3/5] sched: Change the ttwu success details Peter Zijlstra
@ 2010-12-16 15:23   ` Frederic Weisbecker
  2010-12-16 15:27     ` Peter Zijlstra
  2010-12-18  1:05   ` Frank Rowand
  1 sibling, 1 reply; 44+ messages in thread
From: Frederic Weisbecker @ 2010-12-16 15:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Oleg Nesterov, Paul Turner, Jens Axboe,
	linux-kernel

On Thu, Dec 16, 2010 at 03:56:05PM +0100, Peter Zijlstra wrote:
> try_to_wake_up() would only return a success when it would have to
> place a task on a rq, change that to every time we change p->state to
> TASK_RUNNING, because that's the real measure of wakeups.
> 
> This results in that success is always true for the tracepoint, so
> remove its success argument.
> 
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
>  include/trace/events/sched.h      |   18 ++++++++----------
>  kernel/sched.c                    |   18 ++++++++----------
>  kernel/trace/trace_sched_wakeup.c |    3 +--
>  3 files changed, 17 insertions(+), 22 deletions(-)
> 
> Index: linux-2.6/kernel/sched.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched.c
> +++ linux-2.6/kernel/sched.c
> @@ -2327,10 +2327,10 @@ static inline void ttwu_activate(struct 
>  	activate_task(rq, p, en_flags);
>  }
>  
> -static inline void ttwu_post_activation(struct task_struct *p, struct rq *rq,
> -					int wake_flags, bool success)
> +static void
> +ttwu_post_activation(struct task_struct *p, struct rq *rq, int wake_flags)
>  {
> -	trace_sched_wakeup(p, success);
> +	trace_sched_wakeup(p);
>  	check_preempt_curr(rq, p, wake_flags);
>  
>  	p->state = TASK_RUNNING;
> @@ -2350,7 +2350,7 @@ static inline void ttwu_post_activation(
>  	}
>  #endif
>  	/* if a worker is waking up, notify workqueue */
> -	if ((p->flags & PF_WQ_WORKER) && success)
> +	if (p->flags & PF_WQ_WORKER)
>  		wq_worker_waking_up(p, cpu_of(rq));
>  }
>  
> @@ -2449,9 +2449,9 @@ static int try_to_wake_up(struct task_st
>  #endif /* CONFIG_SMP */
>  	ttwu_activate(p, rq, wake_flags & WF_SYNC, orig_cpu != cpu,
>  		      cpu == this_cpu, en_flags);
> -	success = 1;
>  out_running:
> -	ttwu_post_activation(p, rq, wake_flags, success);
> +	ttwu_post_activation(p, rq, wake_flags);
> +	success = 1;
>  out:
>  	task_rq_unlock(rq, &flags);
>  	put_cpu();
> @@ -2470,7 +2470,6 @@ static int try_to_wake_up(struct task_st
>  static void try_to_wake_up_local(struct task_struct *p)
>  {
>  	struct rq *rq = task_rq(p);
> -	bool success = false;
>  
>  	BUG_ON(rq != this_rq());
>  	BUG_ON(p == current);
> @@ -2485,9 +2484,8 @@ static void try_to_wake_up_local(struct 
>  			schedstat_inc(rq, ttwu_local);
>  		}
>  		ttwu_activate(p, rq, false, false, true, ENQUEUE_WAKEUP);
> -		success = true;
>  	}
> -	ttwu_post_activation(p, rq, 0, success);
> +	ttwu_post_activation(p, rq, 0);
>  }
>  
>  /**
> @@ -2649,7 +2647,7 @@ void wake_up_new_task(struct task_struct
>  
>  	rq = task_rq_lock(p, &flags);
>  	activate_task(rq, p, 0);
> -	trace_sched_wakeup_new(p, 1);
> +	trace_sched_wakeup_new(p);
>  	check_preempt_curr(rq, p, WF_FORK);
>  #ifdef CONFIG_SMP
>  	if (p->sched_class->task_woken)
> Index: linux-2.6/include/trace/events/sched.h
> ===================================================================
> --- linux-2.6.orig/include/trace/events/sched.h
> +++ linux-2.6/include/trace/events/sched.h
> @@ -54,15 +54,14 @@ TRACE_EVENT(sched_kthread_stop_ret,
>   */
>  DECLARE_EVENT_CLASS(sched_wakeup_template,
>  
> -	TP_PROTO(struct task_struct *p, int success),
> +	TP_PROTO(struct task_struct *p),
>  
> -	TP_ARGS(p, success),
> +	TP_ARGS(p),
>  
>  	TP_STRUCT__entry(
>  		__array(	char,	comm,	TASK_COMM_LEN	)
>  		__field(	pid_t,	pid			)
>  		__field(	int,	prio			)
> -		__field(	int,	success			)
>  		__field(	int,	target_cpu		)
>  	),
>  
> @@ -70,25 +69,24 @@ DECLARE_EVENT_CLASS(sched_wakeup_templat
>  		memcpy(__entry->comm, p->comm, TASK_COMM_LEN);
>  		__entry->pid		= p->pid;
>  		__entry->prio		= p->prio;
> -		__entry->success	= success;
>  		__entry->target_cpu	= task_cpu(p);
>  	),
>  
> -	TP_printk("comm=%s pid=%d prio=%d success=%d target_cpu=%03d",
> +	TP_printk("comm=%s pid=%d prio=%d target_cpu=%03d",
>  		  __entry->comm, __entry->pid, __entry->prio,
> -		  __entry->success, __entry->target_cpu)
> +		  __entry->target_cpu)

Note we'll need to fix some perf scripts after that. And also perf sched,
probably perf timechart and so on...

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC][PATCH 3/5] sched: Change the ttwu success details
  2010-12-16 15:23   ` Frederic Weisbecker
@ 2010-12-16 15:27     ` Peter Zijlstra
  2010-12-16 15:30       ` Peter Zijlstra
  2010-12-16 15:35       ` Frederic Weisbecker
  0 siblings, 2 replies; 44+ messages in thread
From: Peter Zijlstra @ 2010-12-16 15:27 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Oleg Nesterov, Paul Turner, Jens Axboe,
	linux-kernel

On Thu, 2010-12-16 at 16:23 +0100, Frederic Weisbecker wrote:

> > -	TP_printk("comm=%s pid=%d prio=%d success=%d target_cpu=%03d",
> > +	TP_printk("comm=%s pid=%d prio=%d target_cpu=%03d",
> >  		  __entry->comm, __entry->pid, __entry->prio,
> > -		  __entry->success, __entry->target_cpu)
> > +		  __entry->target_cpu)
> 
> Note we'll need to fix some perf scripts after that. And also perf sched,
> probably perf timechart and so on...

Do any of those actually use the success parameter? If not, then me
removing it shouldn't break those tools since they're supposed to parse
the format stuff and not notice it missing ;-)

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC][PATCH 3/5] sched: Change the ttwu success details
  2010-12-16 15:27     ` Peter Zijlstra
@ 2010-12-16 15:30       ` Peter Zijlstra
  2010-12-16 15:45         ` Frederic Weisbecker
  2010-12-16 15:35       ` Frederic Weisbecker
  1 sibling, 1 reply; 44+ messages in thread
From: Peter Zijlstra @ 2010-12-16 15:30 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Oleg Nesterov, Paul Turner, Jens Axboe,
	linux-kernel

On Thu, 2010-12-16 at 16:27 +0100, Peter Zijlstra wrote:
> On Thu, 2010-12-16 at 16:23 +0100, Frederic Weisbecker wrote:
> 
> > > -	TP_printk("comm=%s pid=%d prio=%d success=%d target_cpu=%03d",
> > > +	TP_printk("comm=%s pid=%d prio=%d target_cpu=%03d",
> > >  		  __entry->comm, __entry->pid, __entry->prio,
> > > -		  __entry->success, __entry->target_cpu)
> > > +		  __entry->target_cpu)
> > 
> > Note we'll need to fix some perf scripts after that. And also perf sched,
> > probably perf timechart and so on...
> 
> Do any of those actually use the success parameter? If not, then me
> removing it shouldn't break those tools since they're supposed to parse
> the format stuff and not notice it missing ;-)

Alternatively we could always call trace_sched_wakeup() and make success
reflect actual wakeup success.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC][PATCH 5/5] sched: Reduce ttwu rq->lock contention
  2010-12-16 14:56 ` [RFC][PATCH 5/5] sched: Reduce ttwu rq->lock contention Peter Zijlstra
@ 2010-12-16 15:31   ` Frederic Weisbecker
  2010-12-16 17:58   ` Oleg Nesterov
  2010-12-16 18:42   ` Oleg Nesterov
  2 siblings, 0 replies; 44+ messages in thread
From: Frederic Weisbecker @ 2010-12-16 15:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Oleg Nesterov, Paul Turner, Jens Axboe,
	linux-kernel

On Thu, Dec 16, 2010 at 03:56:07PM +0100, Peter Zijlstra wrote:
> Reduce rq->lock contention on try_to_wake_up() by changing the task
> state using a cmpxchg loop.
> 
> Once the task is set to TASK_WAKING we're guaranteed the only one
> poking at it, then proceed to pick a new cpu without holding the
> rq->lock (XXX this opens some races).
> 
> Then instead of locking the remote rq and activating the task, place
> the task on a remote queue, again using cmpxchg, and notify the remote
> cpu per IPI if this queue was empty to start processing its wakeups.
> 
> This avoids (in most cases) having to lock the remote runqueue (and
> therefore the exclusive cacheline transfer thereof) but also touching
> all the remote runqueue data structures needed for the actual
> activation.
> 
> As measured using: http://oss.oracle.com/~mason/sembench.c
> 
> $ echo 4096 32000 64 128 > /proc/sys/kernel/sem
> $ ./sembench -t 2048 -w 1900 -o 0
> 
> unpatched: run time 30 seconds 537953 worker burns per second
> patched:   run time 30 seconds 657336 worker burns per second
> 
> Still need to sort out all the races marked XXX (non-trivial), and its
> x86 only for the moment.
> 
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
>  arch/x86/kernel/smp.c   |    1 
>  include/linux/sched.h   |    7 -
>  kernel/sched.c          |  241 ++++++++++++++++++++++++++++++++++--------------
>  kernel/sched_fair.c     |    5 
>  kernel/sched_features.h |    3 
>  kernel/sched_idletask.c |    2 
>  kernel/sched_rt.c       |    4 
>  kernel/sched_stoptask.c |    3 
>  8 files changed, 190 insertions(+), 76 deletions(-)
> 
> Index: linux-2.6/arch/x86/kernel/smp.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/kernel/smp.c
> +++ linux-2.6/arch/x86/kernel/smp.c
> @@ -205,6 +205,7 @@ void smp_reschedule_interrupt(struct pt_
>  	/*
>  	 * KVM uses this interrupt to force a cpu out of guest mode
>  	 */
> +	sched_ttwu_pending();
>  }

Great, that's going to greatly simplify and lower the overhead of
the remote tick restart I'm doing on wake up for the nohz task thing.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC][PATCH 3/5] sched: Change the ttwu success details
  2010-12-16 15:27     ` Peter Zijlstra
  2010-12-16 15:30       ` Peter Zijlstra
@ 2010-12-16 15:35       ` Frederic Weisbecker
  1 sibling, 0 replies; 44+ messages in thread
From: Frederic Weisbecker @ 2010-12-16 15:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Oleg Nesterov, Paul Turner, Jens Axboe,
	linux-kernel

On Thu, Dec 16, 2010 at 04:27:55PM +0100, Peter Zijlstra wrote:
> On Thu, 2010-12-16 at 16:23 +0100, Frederic Weisbecker wrote:
> 
> > > -	TP_printk("comm=%s pid=%d prio=%d success=%d target_cpu=%03d",
> > > +	TP_printk("comm=%s pid=%d prio=%d target_cpu=%03d",
> > >  		  __entry->comm, __entry->pid, __entry->prio,
> > > -		  __entry->success, __entry->target_cpu)
> > > +		  __entry->target_cpu)
> > 
> > Note we'll need to fix some perf scripts after that. And also perf sched,
> > probably perf timechart and so on...
> 
> Do any of those actually use the success parameter? If not, then me
> removing it shouldn't break those tools since they're supposed to parse
> the format stuff and not notice it missing ;-)


Yep, builtin-sched.c and sched-migration.py do.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC][PATCH 3/5] sched: Change the ttwu success details
  2010-12-16 15:30       ` Peter Zijlstra
@ 2010-12-16 15:45         ` Frederic Weisbecker
  0 siblings, 0 replies; 44+ messages in thread
From: Frederic Weisbecker @ 2010-12-16 15:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Oleg Nesterov, Paul Turner, Jens Axboe,
	linux-kernel

On Thu, Dec 16, 2010 at 04:30:15PM +0100, Peter Zijlstra wrote:
> On Thu, 2010-12-16 at 16:27 +0100, Peter Zijlstra wrote:
> > On Thu, 2010-12-16 at 16:23 +0100, Frederic Weisbecker wrote:
> > 
> > > > -	TP_printk("comm=%s pid=%d prio=%d success=%d target_cpu=%03d",
> > > > +	TP_printk("comm=%s pid=%d prio=%d target_cpu=%03d",
> > > >  		  __entry->comm, __entry->pid, __entry->prio,
> > > > -		  __entry->success, __entry->target_cpu)
> > > > +		  __entry->target_cpu)
> > > 
> > > Note we'll need to fix some perf scripts after that. And also perf sched,
> > > probably perf timechart and so on...
> > 
> > Do any of those actually use the success parameter? If not, then me
> > removing it shouldn't break those tools since they're supposed to parse
> > the format stuff and not notice it missing ;-)
> 
> Alternatively we could always call trace_sched_wakeup() and make success
> reflect actual wakeup success.

So, that would work very well for sched-migration.py

builtin-sched.c would continue to work fine but would notice that as
a bug if a wake up occur on a task already running.

But it will only trigger a warning, besides that it will just continue
to work normally.

It's ok I think, if that old tool triggers a spurious warning. I suspect
very few people use it anyway.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC][PATCH 2/5] mutex: Use p->oncpu for the adaptive spin
  2010-12-16 14:56 ` [RFC][PATCH 2/5] mutex: Use p->oncpu for the adaptive spin Peter Zijlstra
@ 2010-12-16 17:34   ` Oleg Nesterov
  2010-12-16 19:29     ` Peter Zijlstra
  0 siblings, 1 reply; 44+ messages in thread
From: Oleg Nesterov @ 2010-12-16 17:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Paul Turner, Jens Axboe, linux-kernel

On 12/16, Peter Zijlstra wrote:
>
> @@ -3867,52 +3866,29 @@ int mutex_spin_on_owner(struct mutex *lo
>  	 * DEBUG_PAGEALLOC could have unmapped it if
>  	 * the mutex owner just released it and exited.
>  	 */
> -	if (probe_kernel_address(&owner->cpu, cpu))
> +	if (probe_kernel_address(&owner->oncpu, oncpu))
>  		return 0;
>  #else
> -	cpu = owner->cpu;
> +	oncpu = owner->oncpu;
>  #endif
>
> -	/*
> -	 * Even if the access succeeded (likely case),
> -	 * the cpu field may no longer be valid.
> -	 */
> -	if (cpu >= nr_cpumask_bits)
> -		return 0;
> -
> -	/*
> -	 * We need to validate that we can do a
> -	 * get_cpu() and that we have the percpu area.
> -	 */
> -	if (!cpu_online(cpu))
> +	if (!oncpu)
>  		return 0;
>
> -	rq = cpu_rq(cpu);
> -
> -	for (;;) {
> -		/*
> -		 * Owner changed, break to re-assess state.
> -		 */
> -		if (lock->owner != owner) {
> -			/*
> -			 * If the lock has switched to a different owner,
> -			 * we likely have heavy contention. Return 0 to quit
> -			 * optimistic spinning and not contend further:
> -			 */
> -			if (lock->owner)
> -				return 0;
> -			break;
> -		}
> -
> -		/*
> -		 * Is that owner really running on that cpu?
> -		 */
> -		if (task_thread_info(rq->curr) != owner || need_resched())
> +	while (lock->owner == owner && owner->oncpu) {

It seems, this owner->oncpu dereference needs probe_kernel_address() too.
Of course, only in theory.

OTOH, I do not understand why we need the first "if (!oncpu)" check
(and "int oncpu" at all).

Oleg.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC][PATCH 5/5] sched: Reduce ttwu rq->lock contention
  2010-12-16 14:56 ` [RFC][PATCH 5/5] sched: Reduce ttwu rq->lock contention Peter Zijlstra
  2010-12-16 15:31   ` Frederic Weisbecker
@ 2010-12-16 17:58   ` Oleg Nesterov
  2010-12-16 18:42   ` Oleg Nesterov
  2 siblings, 0 replies; 44+ messages in thread
From: Oleg Nesterov @ 2010-12-16 17:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Paul Turner, Jens Axboe, linux-kernel

On 12/16, Peter Zijlstra wrote:
>
> Then instead of locking the remote rq and activating the task, place
> the task on a remote queue, again using cmpxchg, and notify the remote
> cpu per IPI if this queue was empty to start processing its wakeups.

Interesting... I didn't actually read this patch yet, just a very
minor nit.

> +#ifdef CONFIG_SMP
> +static void ttwu_queue_remote(struct task_struct *p, int cpu)
> +{
> +	struct task_struct *next = NULL;
> +	struct rq *rq = cpu_rq(cpu);
> +
> +	for (;;) {
> +		struct task_struct *old = next;
> +
> +		p->wake_entry = next;
> +		next = cmpxchg(&rq->wake_list, old, p);

Somehow I was confused by initial "next = NULL", perhaps

	struct rq *rq = cpu_rq(cpu);
	struct task_struct *next = rq->wake_list;

makes a bit more sense.

Oleg.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC][PATCH 5/5] sched: Reduce ttwu rq->lock contention
  2010-12-16 14:56 ` [RFC][PATCH 5/5] sched: Reduce ttwu rq->lock contention Peter Zijlstra
  2010-12-16 15:31   ` Frederic Weisbecker
  2010-12-16 17:58   ` Oleg Nesterov
@ 2010-12-16 18:42   ` Oleg Nesterov
  2010-12-16 18:58     ` Peter Zijlstra
  2 siblings, 1 reply; 44+ messages in thread
From: Oleg Nesterov @ 2010-12-16 18:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Paul Turner, Jens Axboe, linux-kernel

On 12/16, Peter Zijlstra wrote:
>
> +static int ttwu_force(struct task_struct *p, int wake_flags)
> +{
> +	struct rq *rq;
> +	int ret = 0;
> +
> +	/*
> +	 * Since we've already set TASK_WAKING this task's CPU cannot
> +	 * change from under us.

I think it can. Yes, we've set TASK_WAKING. But, at least the task
itself can change its state back to TASK_RUNNING without calling
schedule. Say, __wait_event()-like code.

> +static int
> +try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
>  {
> -	int cpu, orig_cpu, this_cpu, success = 0;
> +	int cpu = task_cpu(p);
>  	unsigned long flags;
> -	unsigned long en_flags = ENQUEUE_WAKEUP;
> -	struct rq *rq;
> +	int success = 0;
> +	int load;
>  
> -	this_cpu = get_cpu();
> -
> -	smp_wmb();
> -	rq = task_rq_lock(p, &flags);
> -	if (!(p->state & state))
> -		goto out;
> +	local_irq_save(flags);
> +	for (;;) {
> +		unsigned int task_state = p->state;
>  
> -	cpu = task_cpu(p);
> +		if (!(task_state & state))
> +			goto out;

Well, this surely breaks the code like

	CONDITION = true;
	wake_up_process(p);

At least we need mb() before we check task_state the first time.

Oleg.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC][PATCH 5/5] sched: Reduce ttwu rq->lock contention
  2010-12-16 18:42   ` Oleg Nesterov
@ 2010-12-16 18:58     ` Peter Zijlstra
  2010-12-16 19:03       ` Peter Zijlstra
  0 siblings, 1 reply; 44+ messages in thread
From: Peter Zijlstra @ 2010-12-16 18:58 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Paul Turner, Jens Axboe, linux-kernel

On Thu, 2010-12-16 at 19:42 +0100, Oleg Nesterov wrote:
> On 12/16, Peter Zijlstra wrote:
> >
> > +static int ttwu_force(struct task_struct *p, int wake_flags)
> > +{
> > +	struct rq *rq;
> > +	int ret = 0;
> > +
> > +	/*
> > +	 * Since we've already set TASK_WAKING this task's CPU cannot
> > +	 * change from under us.
> 
> I think it can. Yes, we've set TASK_WAKING. But, at least the task
> itself can change its state back to TASK_RUNNING without calling
> schedule. Say, __wait_event()-like code.

Oh crud, you're right, that's going to make all this cmpxchg stuff lots
more interesting :/

> > +static int
> > +try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
> >  {
> > -	int cpu, orig_cpu, this_cpu, success = 0;
> > +	int cpu = task_cpu(p);
> >  	unsigned long flags;
> > -	unsigned long en_flags = ENQUEUE_WAKEUP;
> > -	struct rq *rq;
> > +	int success = 0;
> > +	int load;
> >  
> > -	this_cpu = get_cpu();
> > -
> > -	smp_wmb();
> > -	rq = task_rq_lock(p, &flags);
> > -	if (!(p->state & state))
> > -		goto out;
> > +	local_irq_save(flags);
> > +	for (;;) {
> > +		unsigned int task_state = p->state;
> >  
> > -	cpu = task_cpu(p);
> > +		if (!(task_state & state))
> > +			goto out;
> 
> Well, this surely breaks the code like
> 
> 	CONDITION = true;
> 	wake_up_process(p);
> 
> At least we need mb() before we check task_state the first time.

You're right (wmb, at least), I left that out because I had the cmpxchg
in there that provides a mb, but didn't notice I read the state before
that.. /me goes put the smp_wmb() back.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC][PATCH 5/5] sched: Reduce ttwu rq->lock contention
  2010-12-16 18:58     ` Peter Zijlstra
@ 2010-12-16 19:03       ` Peter Zijlstra
  2010-12-16 19:47         ` Peter Zijlstra
  0 siblings, 1 reply; 44+ messages in thread
From: Peter Zijlstra @ 2010-12-16 19:03 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Paul Turner, Jens Axboe, linux-kernel

On Thu, 2010-12-16 at 19:58 +0100, Peter Zijlstra wrote:
> > > +    * Since we've already set TASK_WAKING this task's CPU cannot
> > > +    * change from under us.
> > 
> > I think it can. Yes, we've set TASK_WAKING. But, at least the task
> > itself can change its state back to TASK_RUNNING without calling
> > schedule. Say, __wait_event()-like code.
> 
> Oh crud, you're right, that's going to make all this cmpxchg stuff lots
> more interesting :/ 

Hrmph, we can add a task_is_waking() test to the rq->lock in schedule(),
like we have for __task_rq_lock():

  local_irq_save(flags);
again:
  while (task_is_waking(current))
	cpu_relax();
  raw_spin_lock(rq->lock);
  if (task_is_waking(current)) {
	raw_spin_unlock(rq->lock);
	goto again;
  }


But that's not particularly pretty...

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC][PATCH 0/5] Reduce runqueue lock contention -v2
  2010-12-16 14:56 [RFC][PATCH 0/5] Reduce runqueue lock contention -v2 Peter Zijlstra
                   ` (4 preceding siblings ...)
  2010-12-16 14:56 ` [RFC][PATCH 5/5] sched: Reduce ttwu rq->lock contention Peter Zijlstra
@ 2010-12-16 19:12 ` Frank Rowand
  2010-12-16 19:36   ` Frank Rowand
  2010-12-16 19:36   ` Frank Rowand
  5 siblings, 2 replies; 44+ messages in thread
From: Frank Rowand @ 2010-12-16 19:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Oleg Nesterov, Paul Turner, Jens Axboe,
	linux-kernel

On 12/16/10 06:56, Peter Zijlstra wrote:
> Hi, here a new posting of my scary patch(es) ;-)
> 
> These actually survive a sembench run (and everything else I threw at it).
> The discussion between Mike and Frank over the task_running() check made me
> realize what was wrong with the previous one.
> 
> As it turns out, what was needed (p->oncpu) was something Thomas wanted me
> to do for an entirely different reason (see patch #2).
> 
> Frank's patch, while encouraging me to poke at it again, has a number of
> very fundamental problems with it, the most serious one being that it
> completely wrecks the wake-up load-balancing.

And also as Peter pointed out when I posted the patch (thank you Peter),
I did not properly handle the return value for try_to_wake_up() - a rather
fatal flaw.

By coincidence, I was about to post a new version of my scary patch when
this email arrived.  I'll post my patches as a reply to this email, then
read through Peter's.


Frank's patch, Version 2

Changes from Version 1:
  - Ensure return value of try_to_wake_up() is correct, even when queueing
    wake up on a different cpu.
  - rq->lock contention reduction not as good as first version

patch 1

  The core changes.  All the scary lock related stuff.

  select_task_rq() uses the smp_processor_id() of the old task_cpu(p) instead
  of the waking smp_processor_id().

patch 2

  select_task_rq() uses the smp_processor_id() of the waking smp_processor_id()
  
Limitations
  x86 only

Tests
  - tested on 2 cpu x86_64
  - very simplistic workload
  - results:
     rq->lock contention count reduced by ~ 75%
     rq->lock contention wait time reduced by ~ 70%
     test duration reduction is in the noise
     rq->lock contention improvement is slightly better with just patch 1
       applied, but the difference is in the noise

Todo
  - handle cpu being offlined


-Frank

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC][PATCH 2/5] mutex: Use p->oncpu for the adaptive spin
  2010-12-16 17:34   ` Oleg Nesterov
@ 2010-12-16 19:29     ` Peter Zijlstra
  2010-12-17 19:17       ` Oleg Nesterov
  0 siblings, 1 reply; 44+ messages in thread
From: Peter Zijlstra @ 2010-12-16 19:29 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Paul Turner, Jens Axboe, linux-kernel

On Thu, 2010-12-16 at 18:34 +0100, Oleg Nesterov wrote:
> On 12/16, Peter Zijlstra wrote:
> >
> > @@ -3867,52 +3866,29 @@ int mutex_spin_on_owner(struct mutex *lo
> >  	 * DEBUG_PAGEALLOC could have unmapped it if
> >  	 * the mutex owner just released it and exited.
> >  	 */
> > +	if (probe_kernel_address(&owner->oncpu, oncpu))
> >  		return 0;
> >  #else
> > +	oncpu = owner->oncpu;
> >  #endif
> >
> > +	if (!oncpu)
> >  		return 0;
> >
> > +	while (lock->owner == owner && owner->oncpu) {
> 
> It seems, this owner->oncpu dereference needs probe_kernel_address() too.
> Of course, only in theory.
> 
> OTOH, I do not understand why we need the first "if (!oncpu)" check
> (and "int oncpu" at all).

Probably because I got my head in a twist.. I wanted to keep the
probe_kernel_address() stuff there so that we wouldn't dereference a
dead owner pointer, however..

I then later figured that the: lock->owner == owner, test is already
sufficient for that, if owner died and went away, it would have released
the mutex already (unless something really bad happened, in which case
this wouldn't be the first explosion).

So yea, I think we can simplify all this even further, but we do need a
smp_rmb() in between those two checks, hmm, and RCU.

How about this?


---
Subject: mutex: Use p->oncpu for the adaptive spin
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Wed Dec 15 18:37:21 CET 2010

Since we no have p->oncpu unconditionally available, use it to spin
on.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
---
 include/linux/mutex.h |    2 -
 include/linux/sched.h |    2 -
 kernel/mutex-debug.c  |    2 -
 kernel/mutex-debug.h  |    2 -
 kernel/mutex.c        |    2 -
 kernel/mutex.h        |    2 -
 kernel/sched.c        |   87 +++++++++++++++++++++-----------------------------
 7 files changed, 43 insertions(+), 56 deletions(-)

Index: linux-2.6/include/linux/mutex.h
===================================================================
--- linux-2.6.orig/include/linux/mutex.h
+++ linux-2.6/include/linux/mutex.h
@@ -51,7 +51,7 @@ struct mutex {
 	spinlock_t		wait_lock;
 	struct list_head	wait_list;
 #if defined(CONFIG_DEBUG_MUTEXES) || defined(CONFIG_SMP)
-	struct thread_info	*owner;
+	struct task_struct	*owner;
 #endif
 #ifdef CONFIG_DEBUG_MUTEXES
 	const char 		*name;
Index: linux-2.6/kernel/mutex-debug.c
===================================================================
--- linux-2.6.orig/kernel/mutex-debug.c
+++ linux-2.6/kernel/mutex-debug.c
@@ -75,7 +75,7 @@ void debug_mutex_unlock(struct mutex *lo
 		return;
 
 	DEBUG_LOCKS_WARN_ON(lock->magic != lock);
-	DEBUG_LOCKS_WARN_ON(lock->owner != current_thread_info());
+	DEBUG_LOCKS_WARN_ON(lock->owner != current);
 	DEBUG_LOCKS_WARN_ON(!lock->wait_list.prev && !lock->wait_list.next);
 	mutex_clear_owner(lock);
 }
Index: linux-2.6/kernel/mutex-debug.h
===================================================================
--- linux-2.6.orig/kernel/mutex-debug.h
+++ linux-2.6/kernel/mutex-debug.h
@@ -29,7 +29,7 @@ extern void debug_mutex_init(struct mute
 
 static inline void mutex_set_owner(struct mutex *lock)
 {
-	lock->owner = current_thread_info();
+	lock->owner = current;
 }
 
 static inline void mutex_clear_owner(struct mutex *lock)
Index: linux-2.6/kernel/mutex.c
===================================================================
--- linux-2.6.orig/kernel/mutex.c
+++ linux-2.6/kernel/mutex.c
@@ -160,7 +160,7 @@ __mutex_lock_common(struct mutex *lock, 
 	 */
 
 	for (;;) {
-		struct thread_info *owner;
+		struct task_struct *owner;
 
 		/*
 		 * If we own the BKL, then don't spin. The owner of
Index: linux-2.6/kernel/mutex.h
===================================================================
--- linux-2.6.orig/kernel/mutex.h
+++ linux-2.6/kernel/mutex.h
@@ -19,7 +19,7 @@
 #ifdef CONFIG_SMP
 static inline void mutex_set_owner(struct mutex *lock)
 {
-	lock->owner = current_thread_info();
+	lock->owner = current;
 }
 
 static inline void mutex_clear_owner(struct mutex *lock)
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -3979,70 +3979,57 @@ asmlinkage void __sched schedule(void)
 EXPORT_SYMBOL(schedule);
 
 #ifdef CONFIG_MUTEX_SPIN_ON_OWNER
-/*
- * Look out! "owner" is an entirely speculative pointer
- * access and not reliable.
- */
-int mutex_spin_on_owner(struct mutex *lock, struct thread_info *owner)
-{
-	unsigned int cpu;
-	struct rq *rq;
 
-	if (!sched_feat(OWNER_SPIN))
-		return 0;
+static inline bool owner_running(struct mutex *lock, struct task_struct *owner)
+{
+	bool ret = false;
 
-#ifdef CONFIG_DEBUG_PAGEALLOC
-	/*
-	 * Need to access the cpu field knowing that
-	 * DEBUG_PAGEALLOC could have unmapped it if
-	 * the mutex owner just released it and exited.
-	 */
-	if (probe_kernel_address(&owner->cpu, cpu))
-		return 0;
-#else
-	cpu = owner->cpu;
-#endif
+	rcu_read_lock();
+	if (lock->owner != owner)
+		goto fail;
 
 	/*
-	 * Even if the access succeeded (likely case),
-	 * the cpu field may no longer be valid.
+	 * Ensure we've finished the lock->owner load before dereferencing
+	 * owner. Matches the smp_wmb() in finish_lock_switch().
+	 *
+	 * If the previous branch fails, dereferencing owner is not safe
+	 * anymore because it could have exited and freed the task_struct.
+	 *
+	 * If it was true, the task ref is still valid because of us holding
+	 * the rcu_read_lock().
 	 */
-	if (cpu >= nr_cpumask_bits)
-		return 0;
+	smp_rmb();
 
-	/*
-	 * We need to validate that we can do a
-	 * get_cpu() and that we have the percpu area.
-	 */
-	if (!cpu_online(cpu))
-		return 0;
+	ret = owner->oncpu;
+fail:
+	rcu_read_unlock();
 
-	rq = cpu_rq(cpu);
+	return ret;
+}
 
-	for (;;) {
-		/*
-		 * Owner changed, break to re-assess state.
-		 */
-		if (lock->owner != owner) {
-			/*
-			 * If the lock has switched to a different owner,
-			 * we likely have heavy contention. Return 0 to quit
-			 * optimistic spinning and not contend further:
-			 */
-			if (lock->owner)
-				return 0;
-			break;
-		}
+/*
+ * Look out! "owner" is an entirely speculative pointer
+ * access and not reliable.
+ */
+int mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner)
+{
+	if (!sched_feat(OWNER_SPIN))
+		return 0;
 
-		/*
-		 * Is that owner really running on that cpu?
-		 */
-		if (task_thread_info(rq->curr) != owner || need_resched())
+	while (owner_running(lock, owner)) {
+		if (need_resched())
 			return 0;
 
 		arch_mutex_cpu_relax();
 	}
 
+	/*
+	 * If the owner changed to another task there is likely
+	 * heavy contention, stop spinning.
+	 */
+	if (lock->owner)
+		return 0;
+
 	return 1;
 }
 #endif
Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -360,7 +360,7 @@ extern signed long schedule_timeout_inte
 extern signed long schedule_timeout_killable(signed long timeout);
 extern signed long schedule_timeout_uninterruptible(signed long timeout);
 asmlinkage void schedule(void);
-extern int mutex_spin_on_owner(struct mutex *lock, struct thread_info *owner);
+extern int mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner);
 
 struct nsproxy;
 struct user_namespace;


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC][PATCH 0/5] Reduce runqueue lock contention -v2
  2010-12-16 19:12 ` [RFC][PATCH 0/5] Reduce runqueue lock contention -v2 Frank Rowand
@ 2010-12-16 19:36   ` Frank Rowand
  2010-12-16 19:39     ` Frank Rowand
  2010-12-16 19:36   ` Frank Rowand
  1 sibling, 1 reply; 44+ messages in thread
From: Frank Rowand @ 2010-12-16 19:36 UTC (permalink / raw)
  To: frank.rowand, frank.rowand
  Cc: Peter Zijlstra, Chris Mason, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Oleg Nesterov, Paul Turner, Jens Axboe,
	linux-kernel



patch 1 of 2

Signed-off-by: Frank Rowand <frank.rowand@am.sony.com>

---
 arch/x86/kernel/smp.c |    1 	1 +	0 -	0 !
 include/linux/sched.h |    5 	5 +	0 -	0 !
 kernel/sched.c        |  105 	99 +	6 -	0 !
 3 files changed, 105 insertions(+), 6 deletions(-)

Index: linux-2.6/arch/x86/kernel/smp.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/smp.c
+++ linux-2.6/arch/x86/kernel/smp.c
@@ -205,6 +205,7 @@ void smp_reschedule_interrupt(struct pt_
 	/*
 	 * KVM uses this interrupt to force a cpu out of guest mode
 	 */
+	sched_ttwu_pending();
 }
 
 void smp_call_function_interrupt(struct pt_regs *regs)
Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -1038,6 +1038,7 @@ struct sched_domain;
  */
 #define WF_SYNC		0x01		/* waker goes to sleep after wakup */
 #define WF_FORK		0x02		/* child wakeup after fork */
+#define WF_LOAD		0x04		/* for queued try_to_wake_up() */
 
 #define ENQUEUE_WAKEUP		1
 #define ENQUEUE_WAKING		2
@@ -1193,6 +1194,8 @@ struct task_struct {
 	int lock_depth;		/* BKL lock depth */
 
 #ifdef CONFIG_SMP
+	struct task_struct *ttwu_queue_wake_entry;
+	int ttwu_queue_wake_flags;
 #ifdef __ARCH_WANT_UNLOCKED_CTXSW
 	int oncpu;
 #endif
@@ -2017,6 +2020,7 @@ extern void release_uids(struct user_nam
 
 extern void do_timer(unsigned long ticks);
 
+extern void sched_ttwu_pending(void);
 extern int wake_up_state(struct task_struct *tsk, unsigned int state);
 extern int wake_up_process(struct task_struct *tsk);
 extern void wake_up_new_task(struct task_struct *tsk,
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -515,6 +515,8 @@ struct rq {
 	u64 age_stamp;
 	u64 idle_stamp;
 	u64 avg_idle;
+
+	struct task_struct *wake_list;
 #endif
 
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
@@ -2332,6 +2334,28 @@ static inline void ttwu_post_activation(
 		wq_worker_waking_up(p, cpu_of(rq));
 }
 
+#ifdef CONFIG_SMP
+static void ttwu_queue_wake_up(struct task_struct *p, int cpu, int wake_flags)
+{
+	struct task_struct *next = NULL;
+	struct rq *rq = cpu_rq(cpu);
+
+	p->ttwu_queue_wake_flags = wake_flags;
+
+	for (;;) {
+		struct task_struct *old = next;
+
+		p->ttwu_queue_wake_entry = next;
+		next = cmpxchg(&rq->wake_list, old, p);
+		if (next == old)
+			break;
+	}
+
+	if (!next)
+		smp_send_reschedule(cpu);
+}
+#endif
+
 /**
  * try_to_wake_up - wake up a thread
  * @p: the thread to be awakened
@@ -2350,20 +2374,88 @@ static inline void ttwu_post_activation(
 static int try_to_wake_up(struct task_struct *p, unsigned int state,
 			  int wake_flags)
 {
+/*
+ * xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
+ * todo
+ *  - pass waking cpu with queued wake up, to be used in call to
+ *    select_task_rq().
+ *  - handle cpu being offlined
+ * xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
+ */
 	int cpu, orig_cpu, this_cpu, success = 0;
 	unsigned long flags;
 	unsigned long en_flags = ENQUEUE_WAKEUP;
 	struct rq *rq;
+#ifdef CONFIG_SMP
+	int load;
+#endif
 
 	this_cpu = get_cpu();
 
-	smp_wmb();
-	rq = task_rq_lock(p, &flags);
-	if (!(p->state & state))
-		goto out;
+	local_irq_save(flags);
 
-	if (p->se.on_rq)
-		goto out_running;
+	for (;;) {
+		unsigned int task_state = p->state;
+
+		if (!(task_state & state))
+			goto out_nolock;
+		/*
+		 * task_contributes_to_load() tests p->state
+		 */
+		load = task_contributes_to_load(p);
+
+		if (cmpxchg(&p->state, task_state, TASK_WAKING) == task_state) {
+			if (state == TASK_WAKING)
+				load = wake_flags & WF_LOAD;
+			break;
+		}
+	}
+
+	/*
+	 * Avoid a possible cross cpu rq lock attempt until we know that a
+	 * lock must be acquired.  rq lock is to protect interaction with
+	 * schedule().
+	 *
+	 * p->state == TASK_WAKING protects against any other try_to_wake_up()
+	 * setting p->se.on_rq true after this test.
+	 */
+	if (unlikely(p->se.on_rq)) {
+		smp_wmb();
+		rq = __task_rq_lock(p);
+		if (p->se.on_rq)
+			goto out_running;
+		__task_rq_unlock(rq);
+	}
+
+#ifdef CONFIG_SMP
+	/*
+	 * If task_cpu(p) != this_cpu then the attempt to lock the rq on the
+	 * other cpu can result in rq lock contention.  Queueing this wake up
+	 * on the other cpu may reduce rq lock contention.
+	 *
+	 * All tests that could have led to returning 0 have been completed
+	 * before this point, return value will be 1.  The return value of
+	 * the try_to_wake_up() executed after unqueueing the wake request
+	 * can not be returned to the current caller, so have to know what
+	 * the return value of the queued request will be.
+	 */
+	cpu = task_cpu(p);
+	if (cpu != this_cpu) {
+		if (load)
+			wake_flags |= WF_LOAD;
+		ttwu_queue_wake_up(p, cpu, wake_flags);
+		success = 1;
+		goto out_nolock;
+	}
+#endif
+
+	/*
+	 * task_cpu(p) may have changed since it was checked since rq->lock
+	 * is not held.  Thus may still end up with cross cpu rq lock
+	 * contention.  Encountering this race should be very rare.
+	 */
+	smp_wmb();
+	rq = __task_rq_lock(p);
 
 	cpu = task_cpu(p);
 	orig_cpu = cpu;
@@ -2378,13 +2470,12 @@ static int try_to_wake_up(struct task_st
 	 *
 	 * First fix up the nr_uninterruptible count:
 	 */
-	if (task_contributes_to_load(p)) {
+	if (load) {
 		if (likely(cpu_online(orig_cpu)))
 			rq->nr_uninterruptible--;
 		else
 			this_rq()->nr_uninterruptible--;
 	}
-	p->state = TASK_WAKING;
 
 	if (p->sched_class->task_waking) {
 		p->sched_class->task_waking(rq, p);
@@ -2394,6 +2485,10 @@ static int try_to_wake_up(struct task_st
 	cpu = select_task_rq(rq, p, SD_BALANCE_WAKE, wake_flags);
 	if (cpu != orig_cpu)
 		set_task_cpu(p, cpu);
+	/*
+	 * Protected against concurrent wakeups while rq->lock released because
+	 * p is in TASK_WAKING state.
+	 */
 	__task_rq_unlock(rq);
 
 	rq = cpu_rq(cpu);
@@ -2430,13 +2525,30 @@ out_activate:
 	success = 1;
 out_running:
 	ttwu_post_activation(p, rq, wake_flags, success);
-out:
-	task_rq_unlock(rq, &flags);
+	__task_rq_unlock(rq);
+out_nolock:
+	local_irq_restore(flags);
 	put_cpu();
 
 	return success;
 }
 
+#ifdef CONFIG_SMP
+void sched_ttwu_pending(void)
+{
+	struct rq *rq = this_rq();
+	struct task_struct *p = xchg(&rq->wake_list, NULL);
+
+	if (!p)
+		return;
+
+	while (p) {
+		try_to_wake_up(p, TASK_WAKING, p->ttwu_queue_wake_flags);
+		p = p->ttwu_queue_wake_entry;
+	}
+}
+#endif
+
 /**
  * try_to_wake_up_local - try to wake up a local task with rq lock held
  * @p: the thread to be awakened

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC][PATCH 0/5] Reduce runqueue lock contention -v2
  2010-12-16 19:12 ` [RFC][PATCH 0/5] Reduce runqueue lock contention -v2 Frank Rowand
  2010-12-16 19:36   ` Frank Rowand
@ 2010-12-16 19:36   ` Frank Rowand
  1 sibling, 0 replies; 44+ messages in thread
From: Frank Rowand @ 2010-12-16 19:36 UTC (permalink / raw)
  To: frank.rowand, frank.rowand
  Cc: Peter Zijlstra, Chris Mason, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Oleg Nesterov, Paul Turner, Jens Axboe,
	linux-kernel



patch 2 of 2

Signed-off-by: Frank Rowand <frank.rowand@am.sony.com>

---
Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -1060,7 +1060,7 @@ struct sched_class {
 
 #ifdef CONFIG_SMP
 	int  (*select_task_rq)(struct rq *rq, struct task_struct *p,
-			       int sd_flag, int flags);
+			       int sd_flag, int flags, int waking_cpu);
 
 	void (*pre_schedule) (struct rq *this_rq, struct task_struct *task);
 	void (*post_schedule) (struct rq *this_rq);
@@ -1196,6 +1196,7 @@ struct task_struct {
 #ifdef CONFIG_SMP
 	struct task_struct *ttwu_queue_wake_entry;
 	int ttwu_queue_wake_flags;
+	int ttwu_waking_cpu;
 #ifdef __ARCH_WANT_UNLOCKED_CTXSW
 	int oncpu;
 #endif
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -2262,9 +2262,11 @@ static int select_fallback_rq(int cpu, s
  * The caller (fork, wakeup) owns TASK_WAKING, ->cpus_allowed is stable.
  */
 static inline
-int select_task_rq(struct rq *rq, struct task_struct *p, int sd_flags, int wake_flags)
+int select_task_rq(struct rq *rq, struct task_struct *p, int sd_flags,
+		   int wake_flags, int waking_cpu)
 {
-	int cpu = p->sched_class->select_task_rq(rq, p, sd_flags, wake_flags);
+	int cpu = p->sched_class->select_task_rq(rq, p, sd_flags, wake_flags,
+						 waking_cpu);
 
 	/*
 	 * In order not to call set_task_cpu() on a blocking task we need
@@ -2335,12 +2337,14 @@ static inline void ttwu_post_activation(
 }
 
 #ifdef CONFIG_SMP
-static void ttwu_queue_wake_up(struct task_struct *p, int cpu, int wake_flags)
+static void ttwu_queue_wake_up(struct task_struct *p, int this_cpu, int p_cpu,
+			       int wake_flags)
 {
 	struct task_struct *next = NULL;
-	struct rq *rq = cpu_rq(cpu);
+	struct rq *rq = cpu_rq(p_cpu);
 
 	p->ttwu_queue_wake_flags = wake_flags;
+	p->ttwu_waking_cpu = this_cpu;
 
 	for (;;) {
 		struct task_struct *old = next;
@@ -2352,7 +2356,7 @@ static void ttwu_queue_wake_up(struct ta
 	}
 
 	if (!next)
-		smp_send_reschedule(cpu);
+		smp_send_reschedule(p_cpu);
 }
 #endif
 
@@ -2377,8 +2381,6 @@ static int try_to_wake_up(struct task_st
 /*
  * xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
  * todo
- *  - pass waking cpu with queued wake up, to be used in call to
- *    select_task_rq().
  *  - handle cpu being offlined
  * xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
  */
@@ -2387,7 +2389,7 @@ static int try_to_wake_up(struct task_st
 	unsigned long en_flags = ENQUEUE_WAKEUP;
 	struct rq *rq;
 #ifdef CONFIG_SMP
-	int load;
+	int load, waking_cpu;
 #endif
 
 	this_cpu = get_cpu();
@@ -2405,8 +2407,12 @@ static int try_to_wake_up(struct task_st
 		load = task_contributes_to_load(p);
 
 		if (cmpxchg(&p->state, task_state, TASK_WAKING) == task_state) {
-			if (state == TASK_WAKING)
+			if (state == TASK_WAKING) {
 				load = wake_flags & WF_LOAD;
+				waking_cpu = p->ttwu_waking_cpu;
+			} else {
+				waking_cpu = this_cpu;
+			}
 			break;
 		}
 	}
@@ -2443,7 +2449,7 @@ static int try_to_wake_up(struct task_st
 	if (cpu != this_cpu) {
 		if (load)
 			wake_flags |= WF_LOAD;
-		ttwu_queue_wake_up(p, cpu, wake_flags);
+		ttwu_queue_wake_up(p, this_cpu, cpu, wake_flags);
 		success = 1;
 		goto out_nolock;
 	}
@@ -2482,7 +2488,7 @@ static int try_to_wake_up(struct task_st
 		en_flags |= ENQUEUE_WAKING;
 	}
 
-	cpu = select_task_rq(rq, p, SD_BALANCE_WAKE, wake_flags);
+	cpu = select_task_rq(rq, p, SD_BALANCE_WAKE, wake_flags, waking_cpu);
 	if (cpu != orig_cpu)
 		set_task_cpu(p, cpu);
 	/*
@@ -2728,7 +2734,7 @@ void wake_up_new_task(struct task_struct
 	 * We set TASK_WAKING so that select_task_rq() can drop rq->lock
 	 * without people poking at ->cpus_allowed.
 	 */
-	cpu = select_task_rq(rq, p, SD_BALANCE_FORK, 0);
+	cpu = select_task_rq(rq, p, SD_BALANCE_FORK, 0, cpu);
 	set_task_cpu(p, cpu);
 
 	p->state = TASK_RUNNING;
@@ -3327,7 +3333,8 @@ void sched_exec(void)
 	int dest_cpu;
 
 	rq = task_rq_lock(p, &flags);
-	dest_cpu = p->sched_class->select_task_rq(rq, p, SD_BALANCE_EXEC, 0);
+	dest_cpu = p->sched_class->select_task_rq(rq, p, SD_BALANCE_EXEC, 0,
+						  smp_processor_id());
 	if (dest_cpu == smp_processor_id())
 		goto unlock;
 
Index: linux-2.6/kernel/sched_fair.c
===================================================================
--- linux-2.6.orig/kernel/sched_fair.c
+++ linux-2.6/kernel/sched_fair.c
@@ -1606,10 +1606,10 @@ static int select_idle_sibling(struct ta
  * preempt must be disabled.
  */
 static int
-select_task_rq_fair(struct rq *rq, struct task_struct *p, int sd_flag, int wake_flags)
+select_task_rq_fair(struct rq *rq, struct task_struct *p, int sd_flag,
+		    int wake_flags, int cpu)
 {
 	struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
-	int cpu = smp_processor_id();
 	int prev_cpu = task_cpu(p);
 	int new_cpu = cpu;
 	int want_affine = 0;
Index: linux-2.6/kernel/sched_idletask.c
===================================================================
--- linux-2.6.orig/kernel/sched_idletask.c
+++ linux-2.6/kernel/sched_idletask.c
@@ -7,7 +7,8 @@
 
 #ifdef CONFIG_SMP
 static int
-select_task_rq_idle(struct rq *rq, struct task_struct *p, int sd_flag, int flags)
+select_task_rq_idle(struct rq *rq, struct task_struct *p, int sd_flag,
+		    int flags, int waking_cpu)
 {
 	return task_cpu(p); /* IDLE tasks as never migrated */
 }
Index: linux-2.6/kernel/sched_rt.c
===================================================================
--- linux-2.6.orig/kernel/sched_rt.c
+++ linux-2.6/kernel/sched_rt.c
@@ -973,7 +973,8 @@ static void yield_task_rt(struct rq *rq)
 static int find_lowest_rq(struct task_struct *task);
 
 static int
-select_task_rq_rt(struct rq *rq, struct task_struct *p, int sd_flag, int flags)
+select_task_rq_rt(struct rq *rq, struct task_struct *p, int sd_flag, int flags,
+		  int waking_cpu)
 {
 	if (sd_flag != SD_BALANCE_WAKE)
 		return smp_processor_id();
Index: linux-2.6/kernel/sched_stoptask.c
===================================================================
--- linux-2.6.orig/kernel/sched_stoptask.c
+++ linux-2.6/kernel/sched_stoptask.c
@@ -10,7 +10,7 @@
 #ifdef CONFIG_SMP
 static int
 select_task_rq_stop(struct rq *rq, struct task_struct *p,
-		    int sd_flag, int flags)
+		    int sd_flag, int flags, int waking_cpu)
 {
 	return task_cpu(p); /* stop tasks as never migrate */
 }

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC][PATCH 0/5] Reduce runqueue lock contention -v2
  2010-12-16 19:36   ` Frank Rowand
@ 2010-12-16 19:39     ` Frank Rowand
  2010-12-16 19:42       ` Peter Zijlstra
  0 siblings, 1 reply; 44+ messages in thread
From: Frank Rowand @ 2010-12-16 19:39 UTC (permalink / raw)
  To: frank.rowand
  Cc: frank.rowand, Peter Zijlstra, Chris Mason, Ingo Molnar,
	Thomas Gleixner, Mike Galbraith, Oleg Nesterov, Paul Turner,
	Jens Axboe, linux-kernel

On 12/16/10 11:36, Frank Rowand wrote:
> 
> 
> patch 1 of 2

The email that explains the context for this does not seem to have gotten
through to the list.  Here is the email that this patch should have been
a reply to:

On 12/16/10 06:56, Peter Zijlstra wrote:
> Hi, here a new posting of my scary patch(es) ;-)
> 
> These actually survive a sembench run (and everything else I threw at it).
> The discussion between Mike and Frank over the task_running() check made me
> realize what was wrong with the previous one.
> 
> As it turns out, what was needed (p->oncpu) was something Thomas wanted me
> to do for an entirely different reason (see patch #2).
> 
> Frank's patch, while encouraging me to poke at it again, has a number of
> very fundamental problems with it, the most serious one being that it
> completely wrecks the wake-up load-balancing.

And also as Peter pointed out when I posted the patch (thank you Peter),
I did not properly handle the return value for try_to_wake_up() - a rather
fatal flaw.

By coincidence, I was about to post a new version of my scary patch when
this email arrived.  I'll post my patches as a reply to this email, then
read through Peter's.


Frank's patch, Version 2

Changes from Version 1:
  - Ensure return value of try_to_wake_up() is correct, even when queueing
    wake up on a different cpu.
  - rq->lock contention reduction not as good as first version

patch 1

  The core changes.  All the scary lock related stuff.

  select_task_rq() uses the smp_processor_id() of the old task_cpu(p) instead
  of the waking smp_processor_id().

patch 2

  select_task_rq() uses the smp_processor_id() of the waking smp_processor_id()
  
Limitations
  x86 only

Tests
  - tested on 2 cpu x86_64
  - very simplistic workload
  - results:
     rq->lock contention count reduced by ~ 75%
     rq->lock contention wait time reduced by ~ 70%
     test duration reduction is in the noise
     rq->lock contention improvement is slightly better with just patch 1
       applied, but the difference is in the noise

Todo
  - handle cpu being offlined


-Frank

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC][PATCH 0/5] Reduce runqueue lock contention -v2
  2010-12-16 19:39     ` Frank Rowand
@ 2010-12-16 19:42       ` Peter Zijlstra
  2010-12-16 20:45         ` Frank Rowand
  0 siblings, 1 reply; 44+ messages in thread
From: Peter Zijlstra @ 2010-12-16 19:42 UTC (permalink / raw)
  To: frank.rowand
  Cc: frank.rowand, Chris Mason, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Oleg Nesterov, Paul Turner, Jens Axboe,
	linux-kernel

On Thu, 2010-12-16 at 11:39 -0800, Frank Rowand wrote:
>   select_task_rq() uses the smp_processor_id() of the old task_cpu(p) instead
>   of the waking smp_processor_id().

So first you bounce the wakeup to the previous task cpu, when you do
select_task_rq() and then you bounce it to the new target?



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC][PATCH 5/5] sched: Reduce ttwu rq->lock contention
  2010-12-16 19:03       ` Peter Zijlstra
@ 2010-12-16 19:47         ` Peter Zijlstra
  2010-12-16 20:32           ` Peter Zijlstra
  0 siblings, 1 reply; 44+ messages in thread
From: Peter Zijlstra @ 2010-12-16 19:47 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Paul Turner, Jens Axboe, linux-kernel

On Thu, 2010-12-16 at 20:03 +0100, Peter Zijlstra wrote:
> On Thu, 2010-12-16 at 19:58 +0100, Peter Zijlstra wrote:
> > > > +    * Since we've already set TASK_WAKING this task's CPU cannot
> > > > +    * change from under us.
> > > 
> > > I think it can. Yes, we've set TASK_WAKING. But, at least the task
> > > itself can change its state back to TASK_RUNNING without calling
> > > schedule. Say, __wait_event()-like code.
> > 
> > Oh crud, you're right, that's going to make all this cmpxchg stuff lots
> > more interesting :/ 
> 
> Hrmph, we can add a task_is_waking() test to the rq->lock in schedule(),
> like we have for __task_rq_lock():
> 
>   local_irq_save(flags);
> again:
>   while (task_is_waking(current))
> 	cpu_relax();
>   raw_spin_lock(rq->lock);
>   if (task_is_waking(current)) {
> 	raw_spin_unlock(rq->lock);
> 	goto again;
>   }
> 
> 
> But that's not particularly pretty...

OK, I'm just not thinking straight here, if we're not passing through
schedule() the above clearly won't help any...

Damn, that's a nasty case to solve...

  current->state = TASK_UNINTERRUPTIBLE;
  while (!true)
    schedule();
  current->state = TASK_RUNNING;

... /me goes make a strong pot of tea

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC][PATCH 5/5] sched: Reduce ttwu rq->lock contention
  2010-12-16 19:47         ` Peter Zijlstra
@ 2010-12-16 20:32           ` Peter Zijlstra
  2010-12-17  3:06             ` Yan, Zheng
  2010-12-17 16:54             ` Oleg Nesterov
  0 siblings, 2 replies; 44+ messages in thread
From: Peter Zijlstra @ 2010-12-16 20:32 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Paul Turner, Jens Axboe, linux-kernel

On Thu, 2010-12-16 at 20:47 +0100, Peter Zijlstra wrote:

> ... /me goes make a strong pot of tea


Just when the tea was setting my brain sparked (could be insanity,
dunno), anyway, how about the below?

It does the state and on_rq checks first, if we find on_rq, we do a
forced remote wakeup (rq->lock, recheck state, recheck on_rq, set
p->state to TASK_RUNNING), otherwise, we do the cmpxchg thing to
TASK_WAKING, (since !on_rq @p passed through schedule()) and proceed
like before.



---
Subject: sched: Reduce ttwu rq->lock contention
From: Peter Zijlstra <peterz@infradead.org>
Date: Tue, 22 Jun 2010 15:33:06 +0200

Reduce rq->lock contention on try_to_wake_up() by changing the task
state using a cmpxchg loop.

Once the task is set to TASK_WAKING we're guaranteed the only one
poking at it, then proceed to pick a new cpu without holding the
rq->lock (XXX this opens some races).

Then instead of locking the remote rq and activating the task, place
the task on a remote queue, again using cmpxchg, and notify the remote
cpu per IPI if this queue was empty to start processing its wakeups.

This avoids (in most cases) having to lock the remote runqueue (and
therefore the exclusive cacheline transfer thereof) but also touching
all the remote runqueue data structures needed for the actual
activation.

As measured using: http://oss.oracle.com/~mason/sembench.c

$ echo 4096 32000 64 128 > /proc/sys/kernel/sem
$ ./sembench -t 2048 -w 1900 -o 0

unpatched: run time 30 seconds 537953 worker burns per second
patched:   run time 30 seconds 657336 worker burns per second

Still need to sort out all the races marked XXX (non-trivial), and its
x86 only for the moment.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1277213586.1875.704.camel@laptop>
---
 arch/x86/kernel/smp.c   |    1 
 include/linux/sched.h   |    7 -
 kernel/sched.c          |  226 +++++++++++++++++++++++++++++++++---------------
 kernel/sched_fair.c     |    5 -
 kernel/sched_features.h |    2 
 kernel/sched_idletask.c |    2 
 kernel/sched_rt.c       |    4 
 kernel/sched_stoptask.c |    3 
 8 files changed, 175 insertions(+), 75 deletions(-)

Index: linux-2.6/arch/x86/kernel/smp.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/smp.c
+++ linux-2.6/arch/x86/kernel/smp.c
@@ -205,6 +205,7 @@ void smp_reschedule_interrupt(struct pt_
 	/*
 	 * KVM uses this interrupt to force a cpu out of guest mode
 	 */
+	sched_ttwu_pending();
 }
 
 void smp_call_function_interrupt(struct pt_regs *regs)
Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -1020,6 +1020,7 @@ partition_sched_domains(int ndoms_new, c
 }
 #endif	/* !CONFIG_SMP */
 
+void sched_ttwu_pending(void);
 
 struct io_context;			/* See blkdev.h */
 
@@ -1063,12 +1064,11 @@ struct sched_class {
 	void (*put_prev_task) (struct rq *rq, struct task_struct *p);
 
 #ifdef CONFIG_SMP
-	int  (*select_task_rq)(struct rq *rq, struct task_struct *p,
-			       int sd_flag, int flags);
+	int  (*select_task_rq)(struct task_struct *p, int sd_flag, int flags);
 
 	void (*pre_schedule) (struct rq *this_rq, struct task_struct *task);
 	void (*post_schedule) (struct rq *this_rq);
-	void (*task_waking) (struct rq *this_rq, struct task_struct *task);
+	void (*task_waking) (struct task_struct *task);
 	void (*task_woken) (struct rq *this_rq, struct task_struct *task);
 
 	void (*set_cpus_allowed)(struct task_struct *p,
@@ -1198,6 +1198,7 @@ struct task_struct {
 	int lock_depth;		/* BKL lock depth */
 
 #ifdef CONFIG_SMP
+	struct task_struct *wake_entry;
 	int oncpu;
 #endif
 
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -520,6 +520,8 @@ struct rq {
 	u64 age_stamp;
 	u64 idle_stamp;
 	u64 avg_idle;
+
+	struct task_struct *wake_list;
 #endif
 
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
@@ -953,7 +955,7 @@ static inline struct rq *__task_rq_lock(
 	for (;;) {
 		rq = task_rq(p);
 		raw_spin_lock(&rq->lock);
-		if (likely(rq == task_rq(p)))
+		if (likely(rq == task_rq(p)) && !task_is_waking(p))
 			return rq;
 		raw_spin_unlock(&rq->lock);
 	}
@@ -973,7 +975,7 @@ static struct rq *task_rq_lock(struct ta
 		local_irq_save(*flags);
 		rq = task_rq(p);
 		raw_spin_lock(&rq->lock);
-		if (likely(rq == task_rq(p)))
+		if (likely(rq == task_rq(p)) && !task_is_waking(p))
 			return rq;
 		raw_spin_unlock_irqrestore(&rq->lock, *flags);
 	}
@@ -2282,9 +2284,9 @@ static int select_fallback_rq(int cpu, s
  * The caller (fork, wakeup) owns TASK_WAKING, ->cpus_allowed is stable.
  */
 static inline
-int select_task_rq(struct rq *rq, struct task_struct *p, int sd_flags, int wake_flags)
+int select_task_rq(struct task_struct *p, int sd_flags, int wake_flags)
 {
-	int cpu = p->sched_class->select_task_rq(rq, p, sd_flags, wake_flags);
+	int cpu = p->sched_class->select_task_rq(p, sd_flags, wake_flags);
 
 	/*
 	 * In order not to call set_task_cpu() on a blocking task we need
@@ -2310,10 +2312,10 @@ static void update_avg(u64 *avg, u64 sam
 }
 #endif
 
-static void
-ttwu_stat(struct rq *rq, struct task_struct *p, int cpu, int wake_flags)
+static void ttwu_stat(struct task_struct *p, int cpu, int wake_flags)
 {
 #ifdef CONFIG_SCHEDSTATS
+	struct rq *rq = this_rq();
 
 	schedstat_inc(rq, ttwu_count);
 	schedstat_inc(p, se.statistics.nr_wakeups);
@@ -2341,8 +2343,11 @@ ttwu_stat(struct rq *rq, struct task_str
 #endif /* CONFIG_SCHEDSTATS */
 }
 
+/*
+ * Mark the task runnable and perform wakeup-preemption.
+ */
 static void
-ttwu_post_activation(struct task_struct *p, struct rq *rq, int wake_flags)
+ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
 {
 	trace_sched_wakeup(p);
 	check_preempt_curr(rq, p, wake_flags);
@@ -2368,6 +2373,103 @@ ttwu_post_activation(struct task_struct 
 		wq_worker_waking_up(p, cpu_of(rq));
 }
 
+static void ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags)
+{
+	if (task_cpu(p) != cpu_of(rq))
+		set_task_cpu(p, cpu_of(rq));
+
+	activate_task(rq, p, ENQUEUE_WAKEUP
+#ifdef CONFIG_SMP
+			     | ENQUEUE_WAKING
+#endif
+			);
+	ttwu_do_wakeup(rq, p, wake_flags);
+}
+
+/*
+ * Called in case the task @p isn't fully descheduled from its runqueue,
+ * in this case we must do a full remote wakeup.
+ */
+static int ttwu_force(struct task_struct *p, int state, int wake_flags)
+{
+	unsigned long flags;
+	struct rq *rq;
+	int ret = 0;
+
+	rq = task_rq_lock(p, &flags);
+
+	if (!(p->state & state))
+		goto out;
+
+	if (p->se.on_rq) {
+		ttwu_do_wakeup(rq, p, wake_flags);
+		ttwu_stat(p, task_cpu(p), wake_flags);
+		ret = 1;
+	}
+
+out:
+	raw_spin_unlock_irqrestore(&rq->lock, flags);
+
+	return ret;
+}
+
+void sched_ttwu_pending(void)
+{
+#ifdef CONFIG_SMP
+	struct rq *rq = this_rq();
+	struct task_struct *list = xchg(&rq->wake_list, NULL);
+
+	if (!list)
+		return;
+
+	raw_spin_lock(&rq->lock);
+
+	while (list) {
+		struct task_struct *p = list;
+		list = list->wake_entry;
+		ttwu_do_activate(rq, p, 0);
+	}
+
+	raw_spin_unlock(&rq->lock);
+#endif
+}
+
+#ifdef CONFIG_SMP
+static void ttwu_queue_remote(struct task_struct *p, int cpu)
+{
+	struct task_struct *next = NULL;
+	struct rq *rq = cpu_rq(cpu);
+
+	for (;;) {
+		struct task_struct *old = next;
+
+		p->wake_entry = next;
+		next = cmpxchg(&rq->wake_list, old, p);
+		if (next == old)
+			break;
+	}
+
+	if (!next)
+		smp_send_reschedule(cpu);
+}
+#endif
+
+static void ttwu_queue(struct task_struct *p, int cpu)
+{
+	struct rq *rq = cpu_rq(cpu);
+
+#ifdef CONFIG_SMP
+	if (!sched_feat(TTWU_FORCE) && cpu != smp_processor_id()) {
+		ttwu_queue_remote(p, cpu);
+		return;
+	}
+#endif
+
+	raw_spin_lock(&rq->lock);
+	ttwu_do_activate(rq, p, 0);
+	raw_spin_unlock(&rq->lock);
+}
+
 /**
  * try_to_wake_up - wake up a thread
  * @p: the thread to be awakened
@@ -2383,79 +2485,69 @@ ttwu_post_activation(struct task_struct 
  * Returns %true if @p was woken up, %false if it was already running
  * or @state didn't match @p's state.
  */
-static int try_to_wake_up(struct task_struct *p, unsigned int state,
-			  int wake_flags)
+static int
+try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 {
-	int cpu, orig_cpu, this_cpu, success = 0;
+	int cpu, load, ret = 0;
 	unsigned long flags;
-	unsigned long en_flags = ENQUEUE_WAKEUP;
-	struct rq *rq;
 
-	this_cpu = get_cpu();
+	smp_mb();
 
-	smp_wmb();
-	rq = task_rq_lock(p, &flags);
 	if (!(p->state & state))
-		goto out;
-
-	cpu = task_cpu(p);
-
-	if (p->se.on_rq)
-		goto out_running;
+		return 0;
 
-	orig_cpu = cpu;
-#ifdef CONFIG_SMP
-	if (unlikely(task_running(rq, p)))
-		goto out_activate;
+	if (p->se.on_rq && ttwu_force(p, state, wake_flags))
+		return 1;
 
 	/*
-	 * In order to handle concurrent wakeups and release the rq->lock
-	 * we put the task in TASK_WAKING state.
-	 *
-	 * First fix up the nr_uninterruptible count:
+	 * Since !on_rq, the task @p must have passed through schedule,
+	 * so now cmpxchg the state to TASK_WAKING to serialize concurrent
+	 * wakeups.
 	 */
-	if (task_contributes_to_load(p)) {
-		if (likely(cpu_online(orig_cpu)))
-			rq->nr_uninterruptible--;
-		else
-			this_rq()->nr_uninterruptible--;
-	}
-	p->state = TASK_WAKING;
 
-	if (p->sched_class->task_waking) {
-		p->sched_class->task_waking(rq, p);
-		en_flags |= ENQUEUE_WAKING;
+	local_irq_save(flags);
+	for (;;) {
+		unsigned int task_state = p->state;
+
+		if (!(task_state & state))
+			goto out;
+
+		load = task_contributes_to_load(p);
+
+		if (cmpxchg(&p->state, task_state, TASK_WAKING) == task_state)
+			break;
 	}
 
-	cpu = select_task_rq(rq, p, SD_BALANCE_WAKE, wake_flags);
-	if (cpu != orig_cpu)
-		set_task_cpu(p, cpu);
-	__task_rq_unlock(rq);
+	ret = 1; /* we qualify as a proper wakeup now */
 
-	rq = cpu_rq(cpu);
-	raw_spin_lock(&rq->lock);
+	if (load) // XXX racy
+		this_rq()->nr_uninterruptible--;
+
+	cpu = task_cpu(p);
 
+#ifdef CONFIG_SMP
 	/*
-	 * We migrated the task without holding either rq->lock, however
-	 * since the task is not on the task list itself, nobody else
-	 * will try and migrate the task, hence the rq should match the
-	 * cpu we just moved it to.
+	 * Catch the case where schedule() has done the dequeue but hasn't yet
+	 * scheduled to a new task, in that case p is still being referenced
+	 * by that cpu so we cannot wake it to any other cpu.
 	 */
-	WARN_ON(task_cpu(p) != cpu);
-	WARN_ON(p->state != TASK_WAKING);
+	while (p->oncpu)
+		cpu_relax();
 
-out_activate:
-#endif /* CONFIG_SMP */
-	activate_task(rq, p, en_flags);
-out_running:
-	ttwu_post_activation(p, rq, wake_flags);
-	ttwu_stat(rq, p, cpu, wake_flags);
-	success = 1;
+	if (p->sched_class->task_waking)
+		p->sched_class->task_waking(p);
+	/*
+	 * XXX: by having set TASK_WAKING outside of rq->lock, there
+	 * could be an in-flight change to p->cpus_allowed..
+	 */
+	cpu = select_task_rq(p, SD_BALANCE_WAKE, wake_flags);
+#endif
+	ttwu_queue(p, cpu);
+	ttwu_stat(p, cpu, wake_flags);
 out:
-	task_rq_unlock(rq, &flags);
-	put_cpu();
+	local_irq_restore(flags);
 
-	return success;
+	return ret;
 }
 
 /**
@@ -2480,8 +2572,8 @@ static void try_to_wake_up_local(struct 
 	if (!p->se.on_rq)
 		activate_task(rq, p, ENQUEUE_WAKEUP);
 
-	ttwu_post_activation(p, rq, 0);
-	ttwu_stat(rq, p, smp_processor_id(), 0);
+	ttwu_stat(p, smp_processor_id(), 0);
+	ttwu_do_wakeup(rq, p, 0);
 }
 
 /**
@@ -2634,7 +2726,7 @@ void wake_up_new_task(struct task_struct
 	 * We set TASK_WAKING so that select_task_rq() can drop rq->lock
 	 * without people poking at ->cpus_allowed.
 	 */
-	cpu = select_task_rq(rq, p, SD_BALANCE_FORK, 0);
+	cpu = select_task_rq(p, SD_BALANCE_FORK, 0);
 	set_task_cpu(p, cpu);
 
 	p->state = TASK_RUNNING;
@@ -3360,7 +3452,7 @@ void sched_exec(void)
 	int dest_cpu;
 
 	rq = task_rq_lock(p, &flags);
-	dest_cpu = p->sched_class->select_task_rq(rq, p, SD_BALANCE_EXEC, 0);
+	dest_cpu = p->sched_class->select_task_rq(p, SD_BALANCE_EXEC, 0);
 	if (dest_cpu == smp_processor_id())
 		goto unlock;
 
@@ -3930,8 +4022,10 @@ asmlinkage void __sched schedule(void)
 		switch_count = &prev->nvcsw;
 	}
 
+	/*
+	 * Both pre_schedule() and idle_balance() can releaes rq->lock.
+	 */
 	pre_schedule(rq, prev);
-
 	if (unlikely(!rq->nr_running))
 		idle_balance(cpu, rq);
 
Index: linux-2.6/kernel/sched_fair.c
===================================================================
--- linux-2.6.orig/kernel/sched_fair.c
+++ linux-2.6/kernel/sched_fair.c
@@ -1321,11 +1321,12 @@ static void yield_task_fair(struct rq *r
 
 #ifdef CONFIG_SMP
 
-static void task_waking_fair(struct rq *rq, struct task_struct *p)
+static void task_waking_fair(struct task_struct *p)
 {
 	struct sched_entity *se = &p->se;
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 
+	// XXX racy again,.. we need to read cfs_rq->min_vruntime atomically
 	se->vruntime -= cfs_rq->min_vruntime;
 }
 
@@ -1606,7 +1607,7 @@ static int select_idle_sibling(struct ta
  * preempt must be disabled.
  */
 static int
-select_task_rq_fair(struct rq *rq, struct task_struct *p, int sd_flag, int wake_flags)
+select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
 {
 	struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
 	int cpu = smp_processor_id();
Index: linux-2.6/kernel/sched_idletask.c
===================================================================
--- linux-2.6.orig/kernel/sched_idletask.c
+++ linux-2.6/kernel/sched_idletask.c
@@ -7,7 +7,7 @@
 
 #ifdef CONFIG_SMP
 static int
-select_task_rq_idle(struct rq *rq, struct task_struct *p, int sd_flag, int flags)
+select_task_rq_idle(struct task_struct *p, int sd_flag, int flags)
 {
 	return task_cpu(p); /* IDLE tasks as never migrated */
 }
Index: linux-2.6/kernel/sched_rt.c
===================================================================
--- linux-2.6.orig/kernel/sched_rt.c
+++ linux-2.6/kernel/sched_rt.c
@@ -973,8 +973,10 @@ static void yield_task_rt(struct rq *rq)
 static int find_lowest_rq(struct task_struct *task);
 
 static int
-select_task_rq_rt(struct rq *rq, struct task_struct *p, int sd_flag, int flags)
+select_task_rq_rt(struct task_struct *p, int sd_flag, int flags)
 {
+	struct rq *rq = task_rq(p);
+
 	if (sd_flag != SD_BALANCE_WAKE)
 		return smp_processor_id();
 
Index: linux-2.6/kernel/sched_stoptask.c
===================================================================
--- linux-2.6.orig/kernel/sched_stoptask.c
+++ linux-2.6/kernel/sched_stoptask.c
@@ -9,8 +9,7 @@
 
 #ifdef CONFIG_SMP
 static int
-select_task_rq_stop(struct rq *rq, struct task_struct *p,
-		    int sd_flag, int flags)
+select_task_rq_stop(struct task_struct *p, int sd_flag, int flags)
 {
 	return task_cpu(p); /* stop tasks as never migrate */
 }
Index: linux-2.6/kernel/sched_features.h
===================================================================
--- linux-2.6.orig/kernel/sched_features.h
+++ linux-2.6/kernel/sched_features.h
@@ -64,3 +64,5 @@ SCHED_FEAT(OWNER_SPIN, 1)
  * Decrement CPU power based on irq activity
  */
 SCHED_FEAT(NONIRQ_POWER, 1)
+
+SCHED_FEAT(TTWU_FORCE, 0)


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC][PATCH 0/5] Reduce runqueue lock contention -v2
  2010-12-16 19:42       ` Peter Zijlstra
@ 2010-12-16 20:45         ` Frank Rowand
  0 siblings, 0 replies; 44+ messages in thread
From: Frank Rowand @ 2010-12-16 20:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rowand, Frank, frank.rowand, Chris Mason, Ingo Molnar,
	Thomas Gleixner, Mike Galbraith, Oleg Nesterov, Paul Turner,
	Jens Axboe, linux-kernel

On 12/16/10 11:42, Peter Zijlstra wrote:
> On Thu, 2010-12-16 at 11:39 -0800, Frank Rowand wrote:
>>   select_task_rq() uses the smp_processor_id() of the old task_cpu(p) instead
>>   of the waking smp_processor_id().
> 
> So first you bounce the wakeup to the previous task cpu, when you do
> select_task_rq() and then you bounce it to the new target?

Yes.  (This is the answer when only the first patch is applied.)

The wakeup is bounced to the previous task cpu.  The bounce adds extra
overhead (including an IRQ on the previous task cpu) in return for a
reduction in the number of rq->lock contentions.

Then the try_to_wake_up() on the previous task cpu may wake the task
on a different cpu, but the bias should be the cpu that try_to_wake_up()
is now running on.

The second patch in my series instead feeds the waking cpu number to
select_task_rq(), so the bias will be to wake the task on the waking
cpu.

-Frank


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC][PATCH 5/5] sched: Reduce ttwu rq->lock contention
  2010-12-16 20:32           ` Peter Zijlstra
@ 2010-12-17  3:06             ` Yan, Zheng
  2010-12-17 13:23               ` Peter Zijlstra
  2010-12-17 16:54             ` Oleg Nesterov
  1 sibling, 1 reply; 44+ messages in thread
From: Yan, Zheng @ 2010-12-17  3:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Chris Mason, Frank Rowand, Ingo Molnar,
	Thomas Gleixner, Mike Galbraith, Paul Turner, Jens Axboe,
	linux-kernel

On Fri, Dec 17, 2010 at 4:32 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> @@ -953,7 +955,7 @@ static inline struct rq *__task_rq_lock(
>        for (;;) {
>                rq = task_rq(p);
>                raw_spin_lock(&rq->lock);
> -               if (likely(rq == task_rq(p)))
> +               if (likely(rq == task_rq(p)) && !task_is_waking(p))
>                        return rq;
>                raw_spin_unlock(&rq->lock);
>        }
> @@ -973,7 +975,7 @@ static struct rq *task_rq_lock(struct ta
>                local_irq_save(*flags);
>                rq = task_rq(p);
>                raw_spin_lock(&rq->lock);
> -               if (likely(rq == task_rq(p)))
> +               if (likely(rq == task_rq(p)) && !task_is_waking(p))
>                        return rq;
>                raw_spin_unlock_irqrestore(&rq->lock, *flags);
>        }

Looks like nothing prevents ttwu() from changing task's CPU while
some one else is holding task_rq_lock(). Is this OK?

Thanks
Yan, Zheng

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC][PATCH 5/5] sched: Reduce ttwu rq->lock contention
  2010-12-17  3:06             ` Yan, Zheng
@ 2010-12-17 13:23               ` Peter Zijlstra
  0 siblings, 0 replies; 44+ messages in thread
From: Peter Zijlstra @ 2010-12-17 13:23 UTC (permalink / raw)
  To: Yan, Zheng
  Cc: Oleg Nesterov, Chris Mason, Frank Rowand, Ingo Molnar,
	Thomas Gleixner, Mike Galbraith, Paul Turner, Jens Axboe,
	linux-kernel

On Fri, 2010-12-17 at 11:06 +0800, Yan, Zheng wrote:
> On Fri, Dec 17, 2010 at 4:32 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > @@ -953,7 +955,7 @@ static inline struct rq *__task_rq_lock(
> >        for (;;) {
> >                rq = task_rq(p);
> >                raw_spin_lock(&rq->lock);
> > -               if (likely(rq == task_rq(p)))
> > +               if (likely(rq == task_rq(p)) && !task_is_waking(p))
> >                        return rq;
> >                raw_spin_unlock(&rq->lock);
> >        }
> > @@ -973,7 +975,7 @@ static struct rq *task_rq_lock(struct ta
> >                local_irq_save(*flags);
> >                rq = task_rq(p);
> >                raw_spin_lock(&rq->lock);
> > -               if (likely(rq == task_rq(p)))
> > +               if (likely(rq == task_rq(p)) && !task_is_waking(p))
> >                        return rq;
> >                raw_spin_unlock_irqrestore(&rq->lock, *flags);
> >        }
> 
> Looks like nothing prevents ttwu() from changing task's CPU while
> some one else is holding task_rq_lock(). Is this OK?

Ah, crud, good catch. No that is not quite OK ;-)

I'm starting to think adding a per-task scheduler lock isn't such a bad
idea after all :-)

How does something like the below look, it waits for the current
task_rq(p)->lock owner to go away after we flip p->state to TASK_WAKING.

It also optimizes the x86 spinlock code a bit, no need to wait for all
pending owners to go away, just the current one.

This also solves the p->cpus_allowed race..

---
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -2518,6 +2518,8 @@ try_to_wake_up(struct task_struct *p, un
 			break;
 	}
 
+	raw_spin_unlock_wait(&task_rq(p)->lock);
+
 	ret = 1; /* we qualify as a proper wakeup now */
 
 	if (load) // XXX racy
@@ -2536,10 +2538,7 @@ try_to_wake_up(struct task_struct *p, un
 
 	if (p->sched_class->task_waking)
 		p->sched_class->task_waking(p);
-	/*
-	 * XXX: by having set TASK_WAKING outside of rq->lock, there
-	 * could be an in-flight change to p->cpus_allowed..
-	 */
+
 	cpu = select_task_rq(p, SD_BALANCE_WAKE, wake_flags);
 #endif
 	ttwu_queue(p, cpu);
Index: linux-2.6/arch/x86/include/asm/spinlock.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/spinlock.h
+++ linux-2.6/arch/x86/include/asm/spinlock.h
@@ -158,18 +158,34 @@ static __always_inline void __ticket_spi
 }
 #endif
 
+#define TICKET_MASK ((1 << TICKET_SHIFT) - 1)
+
 static inline int __ticket_spin_is_locked(arch_spinlock_t *lock)
 {
 	int tmp = ACCESS_ONCE(lock->slock);
 
-	return !!(((tmp >> TICKET_SHIFT) ^ tmp) & ((1 << TICKET_SHIFT) - 1));
+	return !!(((tmp >> TICKET_SHIFT) ^ tmp) & TICKET_MASK);
 }
 
 static inline int __ticket_spin_is_contended(arch_spinlock_t *lock)
 {
 	int tmp = ACCESS_ONCE(lock->slock);
 
-	return (((tmp >> TICKET_SHIFT) - tmp) & ((1 << TICKET_SHIFT) - 1)) > 1;
+	return (((tmp >> TICKET_SHIFT) - tmp) & TICKET_MASK) > 1;
+}
+
+static inline void __ticket_spin_unlock_wait(arch_spinlock_t *lock)
+{
+	int tmp = ACCESS_ONCE(lock->slock);
+
+	if (!(((tmp >> TICKET_SHIFT) ^ tmp) & TICKET_MASK))
+		return; /* not locked */
+
+	tmp &= TICKET_MASK;
+
+	/* wait until the current lock holder goes away */
+	while ((ACCESS_ONCE(lock->slock) & TICKET_MASK) == tmp)
+		cpu_relax();
 }
 
 #ifndef CONFIG_PARAVIRT_SPINLOCKS
@@ -206,7 +222,11 @@ static __always_inline void arch_spin_lo
 	arch_spin_lock(lock);
 }
 
-#endif	/* CONFIG_PARAVIRT_SPINLOCKS */
+static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
+{
+	__ticket_spin_unlock_wait(lock);
+}
+#else
 
 static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
 {
@@ -214,6 +234,8 @@ static inline void arch_spin_unlock_wait
 		cpu_relax();
 }
 
+#endif	/* CONFIG_PARAVIRT_SPINLOCKS */
+
 /*
  * Read-write spinlocks, allowing multiple readers
  * but only one writer.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC][PATCH 5/5] sched: Reduce ttwu rq->lock contention
  2010-12-16 20:32           ` Peter Zijlstra
  2010-12-17  3:06             ` Yan, Zheng
@ 2010-12-17 16:54             ` Oleg Nesterov
  2010-12-17 17:43               ` Peter Zijlstra
  2010-12-17 17:50               ` Oleg Nesterov
  1 sibling, 2 replies; 44+ messages in thread
From: Oleg Nesterov @ 2010-12-17 16:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Paul Turner, Jens Axboe, linux-kernel

On 12/16, Peter Zijlstra wrote:
>
> It does the state and on_rq checks first, if we find on_rq,

The problem is, somehow we should check both on_rq and state
at the same time,

> +try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
>  {
> -	int cpu, orig_cpu, this_cpu, success = 0;
> +	int cpu, load, ret = 0;
>  	unsigned long flags;
> -	unsigned long en_flags = ENQUEUE_WAKEUP;
> -	struct rq *rq;
>
> -	this_cpu = get_cpu();
> +	smp_mb();

Yes, we need the full mb(). without subsequent spin_lock(), wmb()
can't act as a smp_store_load_barrier() (which we don't have).

> +	if (p->se.on_rq && ttwu_force(p, state, wake_flags))
> +		return 1;

	----- WINDOW -----

> +	for (;;) {
> +		unsigned int task_state = p->state;
> +
> +		if (!(task_state & state))
> +			goto out;
> +
> +		load = task_contributes_to_load(p);
> +
> +		if (cmpxchg(&p->state, task_state, TASK_WAKING) == task_state)
> +			break;

Suppose that we have a task T sleeping in TASK_INTERRUPTIBLE state,
and this cpu does try_to_wake_up(TASK_INTERRUPTIBLE). on_rq == false.
try_to_wake_up() starts the "for (;;)" loop.

However, in the WINDOW above, it is possible that somebody else wakes
it up, and then this task changes its state to TASK_INTERRUPTIBLE again.

Then we set ->state = TASK_WAKING, but this (still running) T restores
TASK_RUNNING after us.

Oleg.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC][PATCH 5/5] sched: Reduce ttwu rq->lock contention
  2010-12-17 16:54             ` Oleg Nesterov
@ 2010-12-17 17:43               ` Peter Zijlstra
  2010-12-17 18:15                 ` Peter Zijlstra
  2010-12-17 18:21                 ` Oleg Nesterov
  2010-12-17 17:50               ` Oleg Nesterov
  1 sibling, 2 replies; 44+ messages in thread
From: Peter Zijlstra @ 2010-12-17 17:43 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Paul Turner, Jens Axboe, linux-kernel

On Fri, 2010-12-17 at 17:54 +0100, Oleg Nesterov wrote:
> On 12/16, Peter Zijlstra wrote:
> >
> > It does the state and on_rq checks first, if we find on_rq,
> 
> The problem is, somehow we should check both on_rq and state
> at the same time,
> 
> > +try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
> >  {
> > -	int cpu, orig_cpu, this_cpu, success = 0;
> > +	int cpu, load, ret = 0;
> >  	unsigned long flags;
> > -	unsigned long en_flags = ENQUEUE_WAKEUP;
> > -	struct rq *rq;
> >
> > -	this_cpu = get_cpu();
> > +	smp_mb();
> 
> Yes, we need the full mb(). without subsequent spin_lock(), wmb()
> can't act as a smp_store_load_barrier() (which we don't have).
> 
> > +	if (p->se.on_rq && ttwu_force(p, state, wake_flags))
> > +		return 1;
> 
> 	----- WINDOW -----
> 
> > +	for (;;) {
> > +		unsigned int task_state = p->state;
> > +
> > +		if (!(task_state & state))
> > +			goto out;
> > +
> > +		load = task_contributes_to_load(p);
> > +
> > +		if (cmpxchg(&p->state, task_state, TASK_WAKING) == task_state)
> > +			break;
> 
> Suppose that we have a task T sleeping in TASK_INTERRUPTIBLE state,
> and this cpu does try_to_wake_up(TASK_INTERRUPTIBLE). on_rq == false.
> try_to_wake_up() starts the "for (;;)" loop.
> 
> However, in the WINDOW above, it is possible that somebody else wakes
> it up, and then this task changes its state to TASK_INTERRUPTIBLE again.
> 
> Then we set ->state = TASK_WAKING, but this (still running) T restores
> TASK_RUNNING after us.

See, there's a reason I CC'ed you ;-)

Hrmph, so is it only about serializing concurrent wakeups? If so, we
could possibly hold p->pi_lock over the wakeup.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC][PATCH 5/5] sched: Reduce ttwu rq->lock contention
  2010-12-17 16:54             ` Oleg Nesterov
  2010-12-17 17:43               ` Peter Zijlstra
@ 2010-12-17 17:50               ` Oleg Nesterov
  2010-12-17 18:24                 ` Peter Zijlstra
  1 sibling, 1 reply; 44+ messages in thread
From: Oleg Nesterov @ 2010-12-17 17:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Paul Turner, Jens Axboe, linux-kernel

On 12/17, Oleg Nesterov wrote:
>
> On 12/16, Peter Zijlstra wrote:
> >
> > +	if (p->se.on_rq && ttwu_force(p, state, wake_flags))
> > +		return 1;
>
> 	----- WINDOW -----
>
> > +	for (;;) {
> > +		unsigned int task_state = p->state;
> > +
> > +		if (!(task_state & state))
> > +			goto out;
> > +
> > +		load = task_contributes_to_load(p);
> > +
> > +		if (cmpxchg(&p->state, task_state, TASK_WAKING) == task_state)
> > +			break;
>
> Suppose that we have a task T sleeping in TASK_INTERRUPTIBLE state,
> and this cpu does try_to_wake_up(TASK_INTERRUPTIBLE). on_rq == false.
> try_to_wake_up() starts the "for (;;)" loop.
>
> However, in the WINDOW above, it is possible that somebody else wakes
> it up, and then this task changes its state to TASK_INTERRUPTIBLE again.
>
> Then we set ->state = TASK_WAKING, but this (still running) T restores
> TASK_RUNNING after us.

Even simpler. This can race with, say, __migrate_task() which does
deactivate_task + activate_task and temporary clears on_rq. Although
this is simple to fix, I think.



Also. Afaics, without rq->lock, we can't trust "while (p->oncpu)", at
least we need rmb() after that.

Interestingly, I can't really understand the current meaning of smp_wmb()
in finish_lock_switch(). Do you know what exactly is buys? In any case,
task_running() (or its callers) do not have the corresponding rmb().
Say, currently try_to_wake_up()->task_waking() can miss all changes
starting from prepare_lock_switch(). Hopefully this is OK, but I am
confused ;)

Oleg.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC][PATCH 5/5] sched: Reduce ttwu rq->lock contention
  2010-12-17 17:43               ` Peter Zijlstra
@ 2010-12-17 18:15                 ` Peter Zijlstra
  2010-12-17 19:28                   ` Oleg Nesterov
  2010-12-18 14:49                   ` Yong Zhang
  2010-12-17 18:21                 ` Oleg Nesterov
  1 sibling, 2 replies; 44+ messages in thread
From: Peter Zijlstra @ 2010-12-17 18:15 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Paul Turner, Jens Axboe, linux-kernel

On Fri, 2010-12-17 at 18:43 +0100, Peter Zijlstra wrote:
> 
> Hrmph, so is it only about serializing concurrent wakeups? If so, we
> could possibly hold p->pi_lock over the wakeup. 

Something like the below.. except it still suffers from the
__migrate_task() hole you identified in your other email.

By fully serializing all wakeups using ->pi_lock it becomes a lot
simpler (although I just realized we might have a problem with
try_to_wake_up_local).

static int
try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
{
        unsigned long flags;
        int cpu, ret = 0;

        smp_wmb();
        raw_spin_lock_irqsave(&p->pi_lock, flags);

        if (!(p->state & state))
                goto unlock;

        ret = 1; /* we qualify as a proper wakeup now */

        if (p->se.on_rq && ttwu_force(p, state, wake_flags))
                goto unlock;

        p->sched_contributes_to_load = !!task_contributes_to_load(p);

        /*
         * In order to serialize against other tasks wanting to task_rq_lock()
         * we need to wait until the current task_rq(p)->lock holder goes away,
         * so that the next might observe TASK_WAKING.
         */
        p->state = TASK_WAKING;
        smp_wmb();
        raw_spin_unlock_wait(&task_rq(p)->lock);

        /*
         * Stable, now that TASK_WAKING is visible.
         */
        cpu = task_cpu(p);

#ifdef CONFIG_SMP
        /*
         * Catch the case where schedule() has done the dequeue but hasn't yet
         * scheduled to a new task, in that case p is still being referenced
         * by that cpu so we cannot wake it to any other cpu.
         *
         * Here we must either do a full remote enqueue, or simply wait for
         * the remote cpu to finish the schedule(), the latter was found to
         * be cheapest.
         */
        while (p->oncpu)
                cpu_relax();

        if (p->sched_class->task_waking)
                p->sched_class->task_waking(p);

        cpu = select_task_rq(p, SD_BALANCE_WAKE, wake_flags);
#endif
        ttwu_queue(p, cpu);
        ttwu_stat(p, cpu, wake_flags);
unlock:
        raw_spin_unlock_irqrestore(&p->pi_lock, flags);

        return ret;
}


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC][PATCH 5/5] sched: Reduce ttwu rq->lock contention
  2010-12-17 17:43               ` Peter Zijlstra
  2010-12-17 18:15                 ` Peter Zijlstra
@ 2010-12-17 18:21                 ` Oleg Nesterov
  1 sibling, 0 replies; 44+ messages in thread
From: Oleg Nesterov @ 2010-12-17 18:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Paul Turner, Jens Axboe, linux-kernel

On 12/17, Peter Zijlstra wrote:
>
> Hrmph, so is it only about serializing concurrent wakeups?

I __think__ that it is possible to solve other problems, but
of course I am not sure ;)

> If so, we
> could possibly hold p->pi_lock over the wakeup.

Hmm, yes, this looks right.

Oleg.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC][PATCH 5/5] sched: Reduce ttwu rq->lock contention
  2010-12-17 17:50               ` Oleg Nesterov
@ 2010-12-17 18:24                 ` Peter Zijlstra
  2010-12-17 18:41                   ` Peter Zijlstra
  0 siblings, 1 reply; 44+ messages in thread
From: Peter Zijlstra @ 2010-12-17 18:24 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Paul Turner, Jens Axboe, linux-kernel

On Fri, 2010-12-17 at 18:50 +0100, Oleg Nesterov wrote:
> On 12/17, Oleg Nesterov wrote:
> >
> > On 12/16, Peter Zijlstra wrote:
> > >
> > > +	if (p->se.on_rq && ttwu_force(p, state, wake_flags))
> > > +		return 1;
> >
> > 	----- WINDOW -----
> >
> > > +	for (;;) {
> > > +		unsigned int task_state = p->state;
> > > +
> > > +		if (!(task_state & state))
> > > +			goto out;
> > > +
> > > +		load = task_contributes_to_load(p);
> > > +
> > > +		if (cmpxchg(&p->state, task_state, TASK_WAKING) == task_state)
> > > +			break;
> >
> > Suppose that we have a task T sleeping in TASK_INTERRUPTIBLE state,
> > and this cpu does try_to_wake_up(TASK_INTERRUPTIBLE). on_rq == false.
> > try_to_wake_up() starts the "for (;;)" loop.
> >
> > However, in the WINDOW above, it is possible that somebody else wakes
> > it up, and then this task changes its state to TASK_INTERRUPTIBLE again.
> >
> > Then we set ->state = TASK_WAKING, but this (still running) T restores
> > TASK_RUNNING after us.
> 
> Even simpler. This can race with, say, __migrate_task() which does
> deactivate_task + activate_task and temporary clears on_rq. Although
> this is simple to fix, I think.

Yes, another hole..

> Also. Afaics, without rq->lock, we can't trust "while (p->oncpu)", at
> least we need rmb() after that.

I think Linus once argued that loops like that should be fine without a
rmb(), at worst they'll have to spin a few more times to observe the
1->0 switch (we don't care about the 0->1 switch in this case because
that's ruled out by the ->state test).

> Interestingly, I can't really understand the current meaning of smp_wmb()
> in finish_lock_switch(). Do you know what exactly is buys? 

I _think_ its meant to ensure the full contest switch happened and we've
stored all changes to the rq structure (destroying all references to
prev), in particular, we've finished writing the new value of current.

> In any case,
> task_running() (or its callers) do not have the corresponding rmb().
> Say, currently try_to_wake_up()->task_waking() can miss all changes
> starting from prepare_lock_switch(). Hopefully this is OK, but I am
> confused ;)

So I thought I saw how we are OK there, but then I got myself confused
too :-)

My argument was something along the lines of there must be some
serialization between the task going to sleep and another task waking it
(the task setting TASK_UNINTERRUPTIBLE and enqueuing it on a waitqueue,
and the waker finding it on the waitqueue), this should be sufficient to
make ->state visible to the waker.

If the waker observes a !TASK_RUNNING ->state, then by definition it
must see all the changes previous to it (including the ->oncpu 0->1
transition).

But like said, got my brain in a twist too.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC][PATCH 5/5] sched: Reduce ttwu rq->lock contention
  2010-12-17 18:24                 ` Peter Zijlstra
@ 2010-12-17 18:41                   ` Peter Zijlstra
  0 siblings, 0 replies; 44+ messages in thread
From: Peter Zijlstra @ 2010-12-17 18:41 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Paul Turner, Jens Axboe, linux-kernel

On Fri, 2010-12-17 at 19:24 +0100, Peter Zijlstra wrote:
> > Even simpler. This can race with, say, __migrate_task() which does
> > deactivate_task + activate_task and temporary clears on_rq. Although
> > this is simple to fix, I think.
> 
> Yes, another hole.. 

__migrate_task() is a tad special in that it should only migrate
runnable tasks, but of course we can have it preempted with ->state !=
TASK_RUNNING, which makes life interesting, but we can fix that by
including a PREEMPT_ACTIVE test in ttwu.

Other such callers are more straight fwd though:

  rt_mutex_setprio()
  set_user_nice()
  sched_move_task()
  __sched_setscheduler()
  normalize_task()

none of those are fast-path operations, so it is quite doable to also
take the ->pi_lock there and fully serialize them.
  

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC][PATCH 2/5] mutex: Use p->oncpu for the adaptive spin
  2010-12-16 19:29     ` Peter Zijlstra
@ 2010-12-17 19:17       ` Oleg Nesterov
  0 siblings, 0 replies; 44+ messages in thread
From: Oleg Nesterov @ 2010-12-17 19:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Paul Turner, Jens Axboe, linux-kernel

On 12/16, Peter Zijlstra wrote:
>
> So yea, I think we can simplify all this even further, but we do need a
> smp_rmb() in between those two checks, hmm, and RCU.

I am not sure we need rmb() or even read_barrier_depends(), I think
the simple barrier() should be enough. Although rcu_dereference()
may be more clean.

We don't really care about the "correct" result of *owner. All we need,
we should ensure it is safe to dereference this address. If CPU itself
does something like "readahead" before we verified "lock->owner == owner"
under RCU lock, this shouldn't lead to the fault.

Oleg.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC][PATCH 5/5] sched: Reduce ttwu rq->lock contention
  2010-12-17 18:15                 ` Peter Zijlstra
@ 2010-12-17 19:28                   ` Oleg Nesterov
  2010-12-17 21:02                     ` Peter Zijlstra
  2010-12-18 14:49                   ` Yong Zhang
  1 sibling, 1 reply; 44+ messages in thread
From: Oleg Nesterov @ 2010-12-17 19:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Paul Turner, Jens Axboe, linux-kernel

On 12/17, Peter Zijlstra wrote:
>
> By fully serializing all wakeups using ->pi_lock it becomes a lot
> simpler

Hmm, yes. Contrary to my expectations ;)

> static int
> try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
> {
>         unsigned long flags;
>         int cpu, ret = 0;
>
>         smp_wmb();
>         raw_spin_lock_irqsave(&p->pi_lock, flags);
>
>         if (!(p->state & state))
>                 goto unlock;
>
>         ret = 1; /* we qualify as a proper wakeup now */
>
>         if (p->se.on_rq && ttwu_force(p, state, wake_flags))
>                 goto unlock;

Well. All I can say, I'll try to re-read this code with the fresh head ;)
We should ensure that on_rq == 0 can not be racy.

>         p->state = TASK_WAKING;
>         smp_wmb();
>         raw_spin_unlock_wait(&task_rq(p)->lock);

This needs smp_mb(), unlock_wait() reads the memory.

Oleg.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC][PATCH 5/5] sched: Reduce ttwu rq->lock contention
  2010-12-17 19:28                   ` Oleg Nesterov
@ 2010-12-17 21:02                     ` Peter Zijlstra
  0 siblings, 0 replies; 44+ messages in thread
From: Peter Zijlstra @ 2010-12-17 21:02 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Paul Turner, Jens Axboe, linux-kernel

On Fri, 2010-12-17 at 20:28 +0100, Oleg Nesterov wrote:
> Well. All I can say, I'll try to re-read this code with the fresh head ;)

Right, good plan ;-)

I'll try and clean up my current mess and try to post a new series this
weekend (although it looks to be a busy weekend, with the holidays
coming up and all).


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC][PATCH 1/5] sched: Always provide p->oncpu
  2010-12-16 14:56 ` [RFC][PATCH 1/5] sched: Always provide p->oncpu Peter Zijlstra
@ 2010-12-18  1:03   ` Frank Rowand
  0 siblings, 0 replies; 44+ messages in thread
From: Frank Rowand @ 2010-12-18  1:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Chris Mason, Rowand, Frank, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Oleg Nesterov, Paul Turner, Jens Axboe,
	linux-kernel

On 12/16/10 06:56, Peter Zijlstra wrote:
> Always provide p->oncpu so that we can determine if its on a cpu
> without having to lock the rq.
> 
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

Looks good to me.

Acked-by: Frank Rowand <frank.rowand@am.sony.com>


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC][PATCH 3/5] sched: Change the ttwu success details
  2010-12-16 14:56 ` [RFC][PATCH 3/5] sched: Change the ttwu success details Peter Zijlstra
  2010-12-16 15:23   ` Frederic Weisbecker
@ 2010-12-18  1:05   ` Frank Rowand
  1 sibling, 0 replies; 44+ messages in thread
From: Frank Rowand @ 2010-12-18  1:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Chris Mason, Rowand, Frank, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Oleg Nesterov, Paul Turner, Jens Axboe,
	linux-kernel

On 12/16/10 06:56, Peter Zijlstra wrote:
> try_to_wake_up() would only return a success when it would have to
> place a task on a rq, change that to every time we change p->state to
> TASK_RUNNING, because that's the real measure of wakeups.
> 
> This results in that success is always true for the tracepoint, so
> remove its success argument.
> 
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

Looks good to me, so far.  But I still I have a few more places that call
try_to_wake_up() to validate.

Acked-by: Frank Rowand <frank.rowand@am.sony.com>


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC][PATCH 4/5] sched: Clean up ttwu stats
  2010-12-16 14:56 ` [RFC][PATCH 4/5] sched: Clean up ttwu stats Peter Zijlstra
@ 2010-12-18  1:09   ` Frank Rowand
  0 siblings, 0 replies; 44+ messages in thread
From: Frank Rowand @ 2010-12-18  1:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Chris Mason, Rowand, Frank, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Oleg Nesterov, Paul Turner, Jens Axboe,
	linux-kernel

On 12/16/10 06:56, Peter Zijlstra wrote:
> Collect all ttwu stat code into a single function and ensure its
> always called for an actual wakeup (changing p->state to
> TASK_RUNNING).
> 
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
>  kernel/sched.c |   66 ++++++++++++++++++++++++++-------------------------------
>  1 file changed, 31 insertions(+), 35 deletions(-)

Looks good to me.  Nice cleanup...

A couple of nits inline if you want them.

Acked-by: Frank Rowand <frank.rowand@am.sony.com>

> 
> Index: linux-2.6/kernel/sched.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched.c
> +++ linux-2.6/kernel/sched.c
> @@ -2310,21 +2310,35 @@ static void update_avg(u64 *avg, u64 sam
>  }
>  #endif
>  
> -static inline void ttwu_activate(struct task_struct *p, struct rq *rq,
> -				 bool is_sync, bool is_migrate, bool is_local,
> -				 unsigned long en_flags)
> +static void
> +ttwu_stat(struct rq *rq, struct task_struct *p, int cpu, int wake_flags)
>  {
> +#ifdef CONFIG_SCHEDSTATS
> +

        int this_cpu = smp_processor_id();

> +	schedstat_inc(rq, ttwu_count);
>  	schedstat_inc(p, se.statistics.nr_wakeups);
> -	if (is_sync)
> +
> +	if (wake_flags & WF_SYNC)
>  		schedstat_inc(p, se.statistics.nr_wakeups_sync);
> -	if (is_migrate)
> +
> +	if (cpu != task_cpu(p))
>  		schedstat_inc(p, se.statistics.nr_wakeups_migrate);
> -	if (is_local)
> +
> +	if (cpu == smp_processor_id()) {

                    ^^^^^^  this_cpu

> +		schedstat_inc(rq, ttwu_local);
>  		schedstat_inc(p, se.statistics.nr_wakeups_local);
> -	else
> -		schedstat_inc(p, se.statistics.nr_wakeups_remote);
> +	} else {
> +		struct sched_domain *sd;
>  
> -	activate_task(rq, p, en_flags);
> +		schedstat_inc(p, se.statistics.nr_wakeups_remote);
> +		for_each_domain(smp_processor_id(), sd) {

                                ^^^^^ this_cpu

> +			if (cpumask_test_cpu(cpu, sched_domain_span(sd))) {
> +				schedstat_inc(sd, ttwu_wake_remote);
> +				break;
> +			}
> +		}
> +	}
> +#endif /* CONFIG_SCHEDSTATS */
>  }
>  
>  static void


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC][PATCH 5/5] sched: Reduce ttwu rq->lock contention
  2010-12-17 18:15                 ` Peter Zijlstra
  2010-12-17 19:28                   ` Oleg Nesterov
@ 2010-12-18 14:49                   ` Yong Zhang
  2010-12-18 20:08                     ` Oleg Nesterov
  1 sibling, 1 reply; 44+ messages in thread
From: Yong Zhang @ 2010-12-18 14:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Chris Mason, Frank Rowand, Ingo Molnar,
	Thomas Gleixner, Mike Galbraith, Paul Turner, Jens Axboe,
	linux-kernel

On Sat, Dec 18, 2010 at 2:15 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, 2010-12-17 at 18:43 +0100, Peter Zijlstra wrote:
>>
>> Hrmph, so is it only about serializing concurrent wakeups? If so, we
>> could possibly hold p->pi_lock over the wakeup.
>
> Something like the below.. except it still suffers from the
> __migrate_task() hole you identified in your other email.
>
> By fully serializing all wakeups using ->pi_lock it becomes a lot
> simpler (although I just realized we might have a problem with
> try_to_wake_up_local).
>
> static int
> try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
> {
>        unsigned long flags;
>        int cpu, ret = 0;
>
>        smp_wmb();
>        raw_spin_lock_irqsave(&p->pi_lock, flags);
>
>        if (!(p->state & state))
>                goto unlock;
>
>        ret = 1; /* we qualify as a proper wakeup now */

Could below happen in this __window__?

p is going through wake_event and it first set TASK_UNINTERRUPTIBLE,
then waker see that and above if (!(p->state & state)) passed.
But at this time condition == true for p, and p return to run and
intend to sleep:
          p->state == XXX;
          sleep;

then we could wake up a process which has wrong state, no?

>
>        if (p->se.on_rq && ttwu_force(p, state, wake_flags))
>                goto unlock;
>
>        p->sched_contributes_to_load = !!task_contributes_to_load(p);
>
>        /*
>         * In order to serialize against other tasks wanting to task_rq_lock()
>         * we need to wait until the current task_rq(p)->lock holder goes away,
>         * so that the next might observe TASK_WAKING.
>         */
>        p->state = TASK_WAKING;
>        smp_wmb();
>        raw_spin_unlock_wait(&task_rq(p)->lock);
>
>        /*
>         * Stable, now that TASK_WAKING is visible.
>         */
>        cpu = task_cpu(p);
>
> #ifdef CONFIG_SMP
>        /*
>         * Catch the case where schedule() has done the dequeue but hasn't yet
>         * scheduled to a new task, in that case p is still being referenced
>         * by that cpu so we cannot wake it to any other cpu.
>         *
>         * Here we must either do a full remote enqueue, or simply wait for
>         * the remote cpu to finish the schedule(), the latter was found to
>         * be cheapest.
>         */
>        while (p->oncpu)
>                cpu_relax();
>
>        if (p->sched_class->task_waking)
>                p->sched_class->task_waking(p);
>
>        cpu = select_task_rq(p, SD_BALANCE_WAKE, wake_flags);
> #endif
>        ttwu_queue(p, cpu);
>        ttwu_stat(p, cpu, wake_flags);
> unlock:
>        raw_spin_unlock_irqrestore(&p->pi_lock, flags);
>
>        return ret;
> }
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>



-- 
Only stand for myself.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC][PATCH 5/5] sched: Reduce ttwu rq->lock contention
  2010-12-18 14:49                   ` Yong Zhang
@ 2010-12-18 20:08                     ` Oleg Nesterov
  2010-12-19 11:20                       ` Yong Zhang
  0 siblings, 1 reply; 44+ messages in thread
From: Oleg Nesterov @ 2010-12-18 20:08 UTC (permalink / raw)
  To: Yong Zhang
  Cc: Peter Zijlstra, Chris Mason, Frank Rowand, Ingo Molnar,
	Thomas Gleixner, Mike Galbraith, Paul Turner, Jens Axboe,
	linux-kernel

On 12/18, Yong Zhang wrote:
>
> On Sat, Dec 18, 2010 at 2:15 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > static int
> > try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
> > {
> >        unsigned long flags;
> >        int cpu, ret = 0;
> >
> >        smp_wmb();
> >        raw_spin_lock_irqsave(&p->pi_lock, flags);
> >
> >        if (!(p->state & state))
> >                goto unlock;
> >
> >        ret = 1; /* we qualify as a proper wakeup now */
>
> Could below happen in this __window__?
>
> p is going through wake_event

I don't think this can happen with wait_event/wake_up/etc,
wait_queue_head_t->lock adds the necessary synchronization.

But, in general,

> and it first set TASK_UNINTERRUPTIBLE,
> then waker see that and above if (!(p->state & state)) passed.
> But at this time condition == true for p, and p return to run and
> intend to sleep:
>           p->state == XXX;
>           sleep;
>
> then we could wake up a process which has wrong state, no?

I think this is possible, and this is possible whatever we do.
Afaics, this patch changes nothing in this sense. Consider:

	set_current_state(TASK_INTERRUPTIBLE);

	set_current_state(TASK_UNINTERRUPTIBLE);
	schedule();

wake_up_state(TASK_INTERRUPTIBLE) in between can in fact wakeup
this task in TASK_UNINTERRUPTIBLE state.

I do not think this is the problem. The user of wake_up_process()
should take care and write the correct code ;) And in any case,
any wait-event-like code should handle the spurious wakeups
correctly.

Or I missed your point?

Oleg.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC][PATCH 5/5] sched: Reduce ttwu rq->lock contention
  2010-12-18 20:08                     ` Oleg Nesterov
@ 2010-12-19 11:20                       ` Yong Zhang
  0 siblings, 0 replies; 44+ messages in thread
From: Yong Zhang @ 2010-12-19 11:20 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, Chris Mason, Frank Rowand, Ingo Molnar,
	Thomas Gleixner, Mike Galbraith, Paul Turner, Jens Axboe,
	linux-kernel

On Sat, Dec 18, 2010 at 09:08:50PM +0100, Oleg Nesterov wrote:
> On 12/18, Yong Zhang wrote:
> >
> > On Sat, Dec 18, 2010 at 2:15 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > static int
> > > try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
> > > {
> > > ? ? ? ?unsigned long flags;
> > > ? ? ? ?int cpu, ret = 0;
> > >
> > > ? ? ? ?smp_wmb();
> > > ? ? ? ?raw_spin_lock_irqsave(&p->pi_lock, flags);
> > >
> > > ? ? ? ?if (!(p->state & state))
> > > ? ? ? ? ? ? ? ?goto unlock;
> > >
> > > ? ? ? ?ret = 1; /* we qualify as a proper wakeup now */
> >
> > Could below happen in this __window__?
> >
> > p is going through wake_event
> 
> I don't think this can happen with wait_event/wake_up/etc,
> wait_queue_head_t->lock adds the necessary synchronization.

Actually I don't take different sight into wait_event/wake_up
and sleep/wake_up_process, beause nothing prevent the user
from using wake_up_process on an added-to-wait_queue sleeper
though we know that it's not recommended.

And you're right wait_queue_head_t->lock privide necessary
synchronization with wait_event/wake_up.

> 
> But, in general,
> 
> > and it first set TASK_UNINTERRUPTIBLE,
> > then waker see that and above if (!(p->state & state)) passed.
> > But at this time condition == true for p, and p return to run and
> > intend to sleep:
> >           p->state == XXX;
> >           sleep;
> >
> > then we could wake up a process which has wrong state, no?
> 
> I think this is possible, and this is possible whatever we do.
> Afaics, this patch changes nothing in this sense. Consider:
> 
> 	set_current_state(TASK_INTERRUPTIBLE);
> 
> 	set_current_state(TASK_UNINTERRUPTIBLE);
> 	schedule();
> 
> wake_up_state(TASK_INTERRUPTIBLE) in between can in fact wakeup
> this task in TASK_UNINTERRUPTIBLE state.

Hmmm, yeah. I missed that.

> 
> I do not think this is the problem. The user of wake_up_process()
> should take care and write the correct code ;) 

Fair enough ;)

> And in any case,
> any wait-event-like code should handle the spurious wakeups
> correctly.

Yup.

Thanks,
Yong

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2010-12-19 11:21 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-12-16 14:56 [RFC][PATCH 0/5] Reduce runqueue lock contention -v2 Peter Zijlstra
2010-12-16 14:56 ` [RFC][PATCH 1/5] sched: Always provide p->oncpu Peter Zijlstra
2010-12-18  1:03   ` Frank Rowand
2010-12-16 14:56 ` [RFC][PATCH 2/5] mutex: Use p->oncpu for the adaptive spin Peter Zijlstra
2010-12-16 17:34   ` Oleg Nesterov
2010-12-16 19:29     ` Peter Zijlstra
2010-12-17 19:17       ` Oleg Nesterov
2010-12-16 14:56 ` [RFC][PATCH 3/5] sched: Change the ttwu success details Peter Zijlstra
2010-12-16 15:23   ` Frederic Weisbecker
2010-12-16 15:27     ` Peter Zijlstra
2010-12-16 15:30       ` Peter Zijlstra
2010-12-16 15:45         ` Frederic Weisbecker
2010-12-16 15:35       ` Frederic Weisbecker
2010-12-18  1:05   ` Frank Rowand
2010-12-16 14:56 ` [RFC][PATCH 4/5] sched: Clean up ttwu stats Peter Zijlstra
2010-12-18  1:09   ` Frank Rowand
2010-12-16 14:56 ` [RFC][PATCH 5/5] sched: Reduce ttwu rq->lock contention Peter Zijlstra
2010-12-16 15:31   ` Frederic Weisbecker
2010-12-16 17:58   ` Oleg Nesterov
2010-12-16 18:42   ` Oleg Nesterov
2010-12-16 18:58     ` Peter Zijlstra
2010-12-16 19:03       ` Peter Zijlstra
2010-12-16 19:47         ` Peter Zijlstra
2010-12-16 20:32           ` Peter Zijlstra
2010-12-17  3:06             ` Yan, Zheng
2010-12-17 13:23               ` Peter Zijlstra
2010-12-17 16:54             ` Oleg Nesterov
2010-12-17 17:43               ` Peter Zijlstra
2010-12-17 18:15                 ` Peter Zijlstra
2010-12-17 19:28                   ` Oleg Nesterov
2010-12-17 21:02                     ` Peter Zijlstra
2010-12-18 14:49                   ` Yong Zhang
2010-12-18 20:08                     ` Oleg Nesterov
2010-12-19 11:20                       ` Yong Zhang
2010-12-17 18:21                 ` Oleg Nesterov
2010-12-17 17:50               ` Oleg Nesterov
2010-12-17 18:24                 ` Peter Zijlstra
2010-12-17 18:41                   ` Peter Zijlstra
2010-12-16 19:12 ` [RFC][PATCH 0/5] Reduce runqueue lock contention -v2 Frank Rowand
2010-12-16 19:36   ` Frank Rowand
2010-12-16 19:39     ` Frank Rowand
2010-12-16 19:42       ` Peter Zijlstra
2010-12-16 20:45         ` Frank Rowand
2010-12-16 19:36   ` Frank Rowand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).