[RFC][PATCH 00/17] sched: Reduce runqueue lock contention -v3

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC][PATCH 00/17] sched: Reduce runqueue lock contention -v3
@ 2010-12-24 12:23 Peter Zijlstra
  2010-12-24 12:23 ` [RFC][PATCH 01/17] sched: Always provide p->on_cpu Peter Zijlstra
                   ` (17 more replies)
  0 siblings, 18 replies; 59+ messages in thread
From: Peter Zijlstra @ 2010-12-24 12:23 UTC (permalink / raw)
  To: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Oleg Nesterov, Paul Turner, Jens Axboe,
	Yong Zhang
  Cc: linux-kernel, Peter Zijlstra

This is the latest incarnation of my scary ttwu patch, its split up into
tiny little pieces because at some point while working on the feedback from
the last posting it all stopped working and tiny changes was the only way
to find out wth I done wrong (I never did find out though, when I
completely rewrote everything in tiny chunks it all magically worked again
-- so I guess tglx is chuckling now).

Anyway, it still need a few more loose ends fixed, esp on 32bit there is
the whole unlocked remote min_vruntime access where we can read partial
updates and screw the timeline like problem. And I suppose there's a few
more XXXs in there where I know things are needing TLC.

Also, I bet the eagle eyed lot Cc'ed will spot more holes, as all this is
scary stuff :-)

Anyway, lots of thanks to Frank, Oleg, Ingo and others who commented on
the last posting.

I guess most everybody is out having x-mas celebrations (and I will too not
too long from now), so I expect replies to this won't happen until the new
year, but who knows, maybe in between eating and drinking too much we all
get an urge to stare at code.

Happy Holidays!

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [RFC][PATCH 01/17] sched: Always provide p->on_cpu
  2010-12-24 12:23 [RFC][PATCH 00/17] sched: Reduce runqueue lock contention -v3 Peter Zijlstra
@ 2010-12-24 12:23 ` Peter Zijlstra
  2010-12-24 12:23 ` [RFC][PATCH 02/17] mutex: Use p->on_cpu for the adaptive spin Peter Zijlstra
                   ` (16 subsequent siblings)
  17 siblings, 0 replies; 59+ messages in thread
From: Peter Zijlstra @ 2010-12-24 12:23 UTC (permalink / raw)
  To: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Oleg Nesterov, Paul Turner, Jens Axboe,
	Yong Zhang
  Cc: linux-kernel, Peter Zijlstra

[-- Attachment #1: sched-on_cpu.patch --]
[-- Type: text/plain, Size: 3698 bytes --]

Always provide p->on_cpu so that we can determine if its on a cpu
without having to lock the rq.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/sched.h |    4 +---
 kernel/sched.c        |   46 +++++++++++++++++++++++++++++-----------------
 2 files changed, 30 insertions(+), 20 deletions(-)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -845,18 +845,39 @@ static inline int task_current(struct rq
 	return rq->curr == p;
 }
 
-#ifndef __ARCH_WANT_UNLOCKED_CTXSW
 static inline int task_running(struct rq *rq, struct task_struct *p)
 {
+#ifdef CONFIG_SMP
+	return p->on_cpu;
+#else
 	return task_current(rq, p);
+#endif
 }
 
+#ifndef __ARCH_WANT_UNLOCKED_CTXSW
 static inline void prepare_lock_switch(struct rq *rq, struct task_struct *next)
 {
+#ifdef CONFIG_SMP
+	/*
+	 * We can optimise this out completely for !SMP, because the
+	 * SMP rebalancing from interrupt is the only thing that cares
+	 * here.
+	 */
+	next->on_cpu = 1;
+#endif
 }
 
 static inline void finish_lock_switch(struct rq *rq, struct task_struct *prev)
 {
+#ifdef CONFIG_SMP
+	/*
+	 * After ->on_cpu is cleared, the task can be moved to a different CPU.
+	 * We must ensure this doesn't happen until the switch is completely
+	 * finished.
+	 */
+	smp_wmb();
+	prev->on_cpu = 0;
+#endif
 #ifdef CONFIG_DEBUG_SPINLOCK
 	/* this is a valid case when another task releases the spinlock */
 	rq->lock.owner = current;
@@ -872,15 +893,6 @@ static inline void finish_lock_switch(st
 }
 
 #else /* __ARCH_WANT_UNLOCKED_CTXSW */
-static inline int task_running(struct rq *rq, struct task_struct *p)
-{
-#ifdef CONFIG_SMP
-	return p->oncpu;
-#else
-	return task_current(rq, p);
-#endif
-}
-
 static inline void prepare_lock_switch(struct rq *rq, struct task_struct *next)
 {
 #ifdef CONFIG_SMP
@@ -889,7 +901,7 @@ static inline void prepare_lock_switch(s
 	 * SMP rebalancing from interrupt is the only thing that cares
 	 * here.
 	 */
-	next->oncpu = 1;
+	next->on_cpu = 1;
 #endif
 #ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW
 	raw_spin_unlock_irq(&rq->lock);
@@ -902,12 +914,12 @@ static inline void finish_lock_switch(st
 {
 #ifdef CONFIG_SMP
 	/*
-	 * After ->oncpu is cleared, the task can be moved to a different CPU.
+	 * After ->on_cpu is cleared, the task can be moved to a different CPU.
 	 * We must ensure this doesn't happen until the switch is completely
 	 * finished.
 	 */
 	smp_wmb();
-	prev->oncpu = 0;
+	prev->on_cpu = 0;
 #endif
 #ifndef __ARCH_WANT_INTERRUPTS_ON_CTXSW
 	local_irq_enable();
@@ -2645,8 +2657,8 @@ void sched_fork(struct task_struct *p, i
 	if (likely(sched_info_on()))
 		memset(&p->sched_info, 0, sizeof(p->sched_info));
 #endif
-#if defined(CONFIG_SMP) && defined(__ARCH_WANT_UNLOCKED_CTXSW)
-	p->oncpu = 0;
+#if defined(CONFIG_SMP)
+	p->on_cpu = 0;
 #endif
 #ifdef CONFIG_PREEMPT
 	/* Want to start with kernel preemption disabled. */
@@ -5557,8 +5569,8 @@ void __cpuinit init_idle(struct task_str
 	rcu_read_unlock();
 
 	rq->curr = rq->idle = idle;
-#if defined(CONFIG_SMP) && defined(__ARCH_WANT_UNLOCKED_CTXSW)
-	idle->oncpu = 1;
+#if defined(CONFIG_SMP)
+	idle->on_cpu = 1;
 #endif
 	raw_spin_unlock_irqrestore(&rq->lock, flags);
 
Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -1198,9 +1198,7 @@ struct task_struct {
 	int lock_depth;		/* BKL lock depth */
 
 #ifdef CONFIG_SMP
-#ifdef __ARCH_WANT_UNLOCKED_CTXSW
-	int oncpu;
-#endif
+	int on_cpu;
 #endif
 
 	int prio, static_prio, normal_prio;



^ permalink raw reply	[flat|nested] 59+ messages in thread

* [RFC][PATCH 02/17] mutex: Use p->on_cpu for the adaptive spin
  2010-12-24 12:23 [RFC][PATCH 00/17] sched: Reduce runqueue lock contention -v3 Peter Zijlstra
  2010-12-24 12:23 ` [RFC][PATCH 01/17] sched: Always provide p->on_cpu Peter Zijlstra
@ 2010-12-24 12:23 ` Peter Zijlstra
  2010-12-24 12:23 ` [RFC][PATCH 03/17] sched: Change the ttwu success details Peter Zijlstra
                   ` (15 subsequent siblings)
  17 siblings, 0 replies; 59+ messages in thread
From: Peter Zijlstra @ 2010-12-24 12:23 UTC (permalink / raw)
  To: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Oleg Nesterov, Paul Turner, Jens Axboe,
	Yong Zhang
  Cc: linux-kernel, Peter Zijlstra

[-- Attachment #1: sched-on_cpu-use.patch --]
[-- Type: text/plain, Size: 5818 bytes --]

Since we now have p->on_cpu unconditionally available, use it to
re-implement mutex_spin_on_owner.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/mutex.h |    2 -
 include/linux/sched.h |    2 -
 kernel/mutex-debug.c  |    2 -
 kernel/mutex-debug.h  |    2 -
 kernel/mutex.c        |    2 -
 kernel/mutex.h        |    2 -
 kernel/sched.c        |   83 +++++++++++++++++++-------------------------------
 7 files changed, 39 insertions(+), 56 deletions(-)

Index: linux-2.6/include/linux/mutex.h
===================================================================
--- linux-2.6.orig/include/linux/mutex.h
+++ linux-2.6/include/linux/mutex.h
@@ -51,7 +51,7 @@ struct mutex {
 	spinlock_t		wait_lock;
 	struct list_head	wait_list;
 #if defined(CONFIG_DEBUG_MUTEXES) || defined(CONFIG_SMP)
-	struct thread_info	*owner;
+	struct task_struct	*owner;
 #endif
 #ifdef CONFIG_DEBUG_MUTEXES
 	const char 		*name;
Index: linux-2.6/kernel/mutex-debug.c
===================================================================
--- linux-2.6.orig/kernel/mutex-debug.c
+++ linux-2.6/kernel/mutex-debug.c
@@ -75,7 +75,7 @@ void debug_mutex_unlock(struct mutex *lo
 		return;
 
 	DEBUG_LOCKS_WARN_ON(lock->magic != lock);
-	DEBUG_LOCKS_WARN_ON(lock->owner != current_thread_info());
+	DEBUG_LOCKS_WARN_ON(lock->owner != current);
 	DEBUG_LOCKS_WARN_ON(!lock->wait_list.prev && !lock->wait_list.next);
 	mutex_clear_owner(lock);
 }
Index: linux-2.6/kernel/mutex-debug.h
===================================================================
--- linux-2.6.orig/kernel/mutex-debug.h
+++ linux-2.6/kernel/mutex-debug.h
@@ -29,7 +29,7 @@ extern void debug_mutex_init(struct mute
 
 static inline void mutex_set_owner(struct mutex *lock)
 {
-	lock->owner = current_thread_info();
+	lock->owner = current;
 }
 
 static inline void mutex_clear_owner(struct mutex *lock)
Index: linux-2.6/kernel/mutex.c
===================================================================
--- linux-2.6.orig/kernel/mutex.c
+++ linux-2.6/kernel/mutex.c
@@ -160,7 +160,7 @@ __mutex_lock_common(struct mutex *lock, 
 	 */
 
 	for (;;) {
-		struct thread_info *owner;
+		struct task_struct *owner;
 
 		/*
 		 * If we own the BKL, then don't spin. The owner of
Index: linux-2.6/kernel/mutex.h
===================================================================
--- linux-2.6.orig/kernel/mutex.h
+++ linux-2.6/kernel/mutex.h
@@ -19,7 +19,7 @@
 #ifdef CONFIG_SMP
 static inline void mutex_set_owner(struct mutex *lock)
 {
-	lock->owner = current_thread_info();
+	lock->owner = current;
 }
 
 static inline void mutex_clear_owner(struct mutex *lock)
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -4034,70 +4034,53 @@ asmlinkage void __sched schedule(void)
 EXPORT_SYMBOL(schedule);
 
 #ifdef CONFIG_MUTEX_SPIN_ON_OWNER
-/*
- * Look out! "owner" is an entirely speculative pointer
- * access and not reliable.
- */
-int mutex_spin_on_owner(struct mutex *lock, struct thread_info *owner)
-{
-	unsigned int cpu;
-	struct rq *rq;
 
-	if (!sched_feat(OWNER_SPIN))
-		return 0;
+static inline bool owner_running(struct mutex *lock, struct task_struct *owner)
+{
+	bool ret = false;
 
-#ifdef CONFIG_DEBUG_PAGEALLOC
-	/*
-	 * Need to access the cpu field knowing that
-	 * DEBUG_PAGEALLOC could have unmapped it if
-	 * the mutex owner just released it and exited.
-	 */
-	if (probe_kernel_address(&owner->cpu, cpu))
-		return 0;
-#else
-	cpu = owner->cpu;
-#endif
+	rcu_read_lock();
+	if (lock->owner != owner)
+		goto fail;
 
 	/*
-	 * Even if the access succeeded (likely case),
-	 * the cpu field may no longer be valid.
+	 * Ensure we emit the owner->on_cpu, dereference _after_ checking
+	 * lock->owner still matches owner, if that fails, owner might
+	 * point to free()d memory, if it still matches, the rcu_read_lock()
+	 * ensures the memory stays valid.
 	 */
-	if (cpu >= nr_cpumask_bits)
-		return 0;
+	barrier();
 
-	/*
-	 * We need to validate that we can do a
-	 * get_cpu() and that we have the percpu area.
-	 */
-	if (!cpu_online(cpu))
-		return 0;
+	ret = owner->on_cpu;
+fail:
+	rcu_read_unlock();
 
-	rq = cpu_rq(cpu);
+	return ret;
+}
 
-	for (;;) {
-		/*
-		 * Owner changed, break to re-assess state.
-		 */
-		if (lock->owner != owner) {
-			/*
-			 * If the lock has switched to a different owner,
-			 * we likely have heavy contention. Return 0 to quit
-			 * optimistic spinning and not contend further:
-			 */
-			if (lock->owner)
-				return 0;
-			break;
-		}
+/*
+ * Look out! "owner" is an entirely speculative pointer
+ * access and not reliable.
+ */
+int mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner)
+{
+	if (!sched_feat(OWNER_SPIN))
+		return 0;
 
-		/*
-		 * Is that owner really running on that cpu?
-		 */
-		if (task_thread_info(rq->curr) != owner || need_resched())
+	while (owner_running(lock, owner)) {
+		if (need_resched())
 			return 0;
 
 		arch_mutex_cpu_relax();
 	}
 
+	/*
+	 * If the owner changed to another task there is likely
+	 * heavy contention, stop spinning.
+	 */
+	if (lock->owner)
+		return 0;
+
 	return 1;
 }
 #endif
Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -360,7 +360,7 @@ extern signed long schedule_timeout_inte
 extern signed long schedule_timeout_killable(signed long timeout);
 extern signed long schedule_timeout_uninterruptible(signed long timeout);
 asmlinkage void schedule(void);
-extern int mutex_spin_on_owner(struct mutex *lock, struct thread_info *owner);
+extern int mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner);
 
 struct nsproxy;
 struct user_namespace;



^ permalink raw reply	[flat|nested] 59+ messages in thread

* [RFC][PATCH 03/17] sched: Change the ttwu success details
  2010-12-24 12:23 [RFC][PATCH 00/17] sched: Reduce runqueue lock contention -v3 Peter Zijlstra
  2010-12-24 12:23 ` [RFC][PATCH 01/17] sched: Always provide p->on_cpu Peter Zijlstra
  2010-12-24 12:23 ` [RFC][PATCH 02/17] mutex: Use p->on_cpu for the adaptive spin Peter Zijlstra
@ 2010-12-24 12:23 ` Peter Zijlstra
  2010-12-24 12:23 ` [RFC][PATCH 04/17] sched: Clean up ttwu stats Peter Zijlstra
                   ` (14 subsequent siblings)
  17 siblings, 0 replies; 59+ messages in thread
From: Peter Zijlstra @ 2010-12-24 12:23 UTC (permalink / raw)
  To: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Oleg Nesterov, Paul Turner, Jens Axboe,
	Yong Zhang
  Cc: linux-kernel, Peter Zijlstra

[-- Attachment #1: sched-change-ttwu-return.patch --]
[-- Type: text/plain, Size: 2431 bytes --]

try_to_wake_up() would only return a success when it would have to
place a task on a rq, change that to every time we change p->state to
TASK_RUNNING, because that's the real measure of wakeups.

This results in that success is always true for the tracepoints.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched.c |   18 ++++++++----------
 1 file changed, 8 insertions(+), 10 deletions(-)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -2383,10 +2383,10 @@ static inline void ttwu_activate(struct 
 	activate_task(rq, p, en_flags);
 }
 
-static inline void ttwu_post_activation(struct task_struct *p, struct rq *rq,
-					int wake_flags, bool success)
+static void
+ttwu_post_activation(struct task_struct *p, struct rq *rq, int wake_flags)
 {
-	trace_sched_wakeup(p, success);
+	trace_sched_wakeup(p, true);
 	check_preempt_curr(rq, p, wake_flags);
 
 	p->state = TASK_RUNNING;
@@ -2406,7 +2406,7 @@ static inline void ttwu_post_activation(
 	}
 #endif
 	/* if a worker is waking up, notify workqueue */
-	if ((p->flags & PF_WQ_WORKER) && success)
+	if (p->flags & PF_WQ_WORKER)
 		wq_worker_waking_up(p, cpu_of(rq));
 }
 
@@ -2505,9 +2505,9 @@ static int try_to_wake_up(struct task_st
 #endif /* CONFIG_SMP */
 	ttwu_activate(p, rq, wake_flags & WF_SYNC, orig_cpu != cpu,
 		      cpu == this_cpu, en_flags);
-	success = 1;
 out_running:
-	ttwu_post_activation(p, rq, wake_flags, success);
+	ttwu_post_activation(p, rq, wake_flags);
+	success = 1;
 out:
 	task_rq_unlock(rq, &flags);
 	put_cpu();
@@ -2526,7 +2526,6 @@ static int try_to_wake_up(struct task_st
 static void try_to_wake_up_local(struct task_struct *p)
 {
 	struct rq *rq = task_rq(p);
-	bool success = false;
 
 	BUG_ON(rq != this_rq());
 	BUG_ON(p == current);
@@ -2541,9 +2540,8 @@ static void try_to_wake_up_local(struct 
 			schedstat_inc(rq, ttwu_local);
 		}
 		ttwu_activate(p, rq, false, false, true, ENQUEUE_WAKEUP);
-		success = true;
 	}
-	ttwu_post_activation(p, rq, 0, success);
+	ttwu_post_activation(p, rq, 0);
 }
 
 /**
@@ -2705,7 +2703,7 @@ void wake_up_new_task(struct task_struct
 
 	rq = task_rq_lock(p, &flags);
 	activate_task(rq, p, 0);
-	trace_sched_wakeup_new(p, 1);
+	trace_sched_wakeup_new(p, true);
 	check_preempt_curr(rq, p, WF_FORK);
 #ifdef CONFIG_SMP
 	if (p->sched_class->task_woken)



^ permalink raw reply	[flat|nested] 59+ messages in thread

* [RFC][PATCH 04/17] sched: Clean up ttwu stats
  2010-12-24 12:23 [RFC][PATCH 00/17] sched: Reduce runqueue lock contention -v3 Peter Zijlstra
                   ` (2 preceding siblings ...)
  2010-12-24 12:23 ` [RFC][PATCH 03/17] sched: Change the ttwu success details Peter Zijlstra
@ 2010-12-24 12:23 ` Peter Zijlstra
  2010-12-24 12:23 ` [RFC][PATCH 05/17] x86: Optimize arch_spin_unlock_wait() Peter Zijlstra
                   ` (13 subsequent siblings)
  17 siblings, 0 replies; 59+ messages in thread
From: Peter Zijlstra @ 2010-12-24 12:23 UTC (permalink / raw)
  To: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Oleg Nesterov, Paul Turner, Jens Axboe,
	Yong Zhang
  Cc: linux-kernel, Peter Zijlstra

[-- Attachment #1: sched-ttwu-stat.patch --]
[-- Type: text/plain, Size: 3241 bytes --]

Collect all ttwu stat code into a single function and ensure its
always called for an actual wakeup (changing p->state to
TASK_RUNNING).

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched.c |   67 +++++++++++++++++++++++++++------------------------------
 1 file changed, 32 insertions(+), 35 deletions(-)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -2366,21 +2366,36 @@ static void update_avg(u64 *avg, u64 sam
 }
 #endif
 
-static inline void ttwu_activate(struct task_struct *p, struct rq *rq,
-				 bool is_sync, bool is_migrate, bool is_local,
-				 unsigned long en_flags)
+static void
+ttwu_stat(struct rq *rq, struct task_struct *p, int cpu, int wake_flags)
 {
+#ifdef CONFIG_SCHEDSTATS
+	int this_cpu = smp_processor_id();
+
+	schedstat_inc(rq, ttwu_count);
 	schedstat_inc(p, se.statistics.nr_wakeups);
-	if (is_sync)
+
+	if (wake_flags & WF_SYNC)
 		schedstat_inc(p, se.statistics.nr_wakeups_sync);
-	if (is_migrate)
+
+	if (cpu != task_cpu(p))
 		schedstat_inc(p, se.statistics.nr_wakeups_migrate);
-	if (is_local)
+
+	if (cpu == this_cpu) {
+		schedstat_inc(rq, ttwu_local);
 		schedstat_inc(p, se.statistics.nr_wakeups_local);
-	else
-		schedstat_inc(p, se.statistics.nr_wakeups_remote);
+	} else {
+		struct sched_domain *sd;
 
-	activate_task(rq, p, en_flags);
+		schedstat_inc(p, se.statistics.nr_wakeups_remote);
+		for_each_domain(this_cpu, sd) {
+			if (cpumask_test_cpu(cpu, sched_domain_span(sd))) {
+				schedstat_inc(sd, ttwu_wake_remote);
+				break;
+			}
+		}
+	}
+#endif /* CONFIG_SCHEDSTATS */
 }
 
 static void
@@ -2440,12 +2455,12 @@ static int try_to_wake_up(struct task_st
 	if (!(p->state & state))
 		goto out;
 
+	cpu = task_cpu(p);
+
 	if (p->se.on_rq)
 		goto out_running;
 
-	cpu = task_cpu(p);
 	orig_cpu = cpu;
-
 #ifdef CONFIG_SMP
 	if (unlikely(task_running(rq, p)))
 		goto out_activate;
@@ -2486,27 +2501,12 @@ static int try_to_wake_up(struct task_st
 	WARN_ON(task_cpu(p) != cpu);
 	WARN_ON(p->state != TASK_WAKING);
 
-#ifdef CONFIG_SCHEDSTATS
-	schedstat_inc(rq, ttwu_count);
-	if (cpu == this_cpu)
-		schedstat_inc(rq, ttwu_local);
-	else {
-		struct sched_domain *sd;
-		for_each_domain(this_cpu, sd) {
-			if (cpumask_test_cpu(cpu, sched_domain_span(sd))) {
-				schedstat_inc(sd, ttwu_wake_remote);
-				break;
-			}
-		}
-	}
-#endif /* CONFIG_SCHEDSTATS */
-
 out_activate:
 #endif /* CONFIG_SMP */
-	ttwu_activate(p, rq, wake_flags & WF_SYNC, orig_cpu != cpu,
-		      cpu == this_cpu, en_flags);
+	activate_task(rq, p, en_flags);
 out_running:
 	ttwu_post_activation(p, rq, wake_flags);
+	ttwu_stat(rq, p, cpu, wake_flags);
 	success = 1;
 out:
 	task_rq_unlock(rq, &flags);
@@ -2534,14 +2534,11 @@ static void try_to_wake_up_local(struct 
 	if (!(p->state & TASK_NORMAL))
 		return;
 
-	if (!p->se.on_rq) {
-		if (likely(!task_running(rq, p))) {
-			schedstat_inc(rq, ttwu_count);
-			schedstat_inc(rq, ttwu_local);
-		}
-		ttwu_activate(p, rq, false, false, true, ENQUEUE_WAKEUP);
-	}
+	if (!p->se.on_rq)
+		activate_task(rq, p, ENQUEUE_WAKEUP);
+
 	ttwu_post_activation(p, rq, 0);
+	ttwu_stat(rq, p, smp_processor_id(), 0);
 }
 
 /**



^ permalink raw reply	[flat|nested] 59+ messages in thread

* [RFC][PATCH 05/17] x86: Optimize arch_spin_unlock_wait()
  2010-12-24 12:23 [RFC][PATCH 00/17] sched: Reduce runqueue lock contention -v3 Peter Zijlstra
                   ` (3 preceding siblings ...)
  2010-12-24 12:23 ` [RFC][PATCH 04/17] sched: Clean up ttwu stats Peter Zijlstra
@ 2010-12-24 12:23 ` Peter Zijlstra
  2010-12-24 18:26   ` Linus Torvalds
  2010-12-24 12:23 ` [RFC][PATCH 06/17] sched: Provide p->on_rq Peter Zijlstra
                   ` (12 subsequent siblings)
  17 siblings, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2010-12-24 12:23 UTC (permalink / raw)
  To: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Oleg Nesterov, Paul Turner, Jens Axboe,
	Yong Zhang
  Cc: linux-kernel, Peter Zijlstra, Nick Piggin, Linus Torvalds,
	Jeremy Fitzhardinge

[-- Attachment #1: x86-spin-unlocked.patch --]
[-- Type: text/plain, Size: 2291 bytes --]

Only wait for the current holder to release the lock.

spin_unlock_wait() can only be about the current holder, since
completion of this function is inherently racy with new contenders.
Therefore, there is no reason to wait until the lock is completely
unlocked.

Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/x86/include/asm/spinlock.h |   26 +++++++++++++++++++++++---
 1 file changed, 23 insertions(+), 3 deletions(-)

Index: linux-2.6/arch/x86/include/asm/spinlock.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/spinlock.h
+++ linux-2.6/arch/x86/include/asm/spinlock.h
@@ -158,18 +158,32 @@ static __always_inline void __ticket_spi
 }
 #endif
 
+#define TICKET_MASK ((1 << TICKET_SHIFT) - 1)
+
 static inline int __ticket_spin_is_locked(arch_spinlock_t *lock)
 {
 	int tmp = ACCESS_ONCE(lock->slock);
 
-	return !!(((tmp >> TICKET_SHIFT) ^ tmp) & ((1 << TICKET_SHIFT) - 1));
+	return !!(((tmp >> TICKET_SHIFT) ^ tmp) & TICKET_MASK);
 }
 
 static inline int __ticket_spin_is_contended(arch_spinlock_t *lock)
 {
 	int tmp = ACCESS_ONCE(lock->slock);
 
-	return (((tmp >> TICKET_SHIFT) - tmp) & ((1 << TICKET_SHIFT) - 1)) > 1;
+	return (((tmp >> TICKET_SHIFT) - tmp) & TICKET_MASK) > 1;
+}
+
+static inline void __ticket_spin_unlock_wait(arch_spinlock_t *lock)
+{
+	int tmp = ACCESS_ONCE(lock->slock);
+
+	if (!(((tmp >> TICKET_SHIFT) ^ tmp) & TICKET_MASK))
+		return; /* not locked */
+
+	/* wait until the current lock holder goes away */
+	while ((lock->slock & TICKET_MASK) == (tmp & TICKET_MASK))
+		cpu_relax();
 }
 
 #ifndef CONFIG_PARAVIRT_SPINLOCKS
@@ -206,7 +220,11 @@ static __always_inline void arch_spin_lo
 	arch_spin_lock(lock);
 }
 
-#endif	/* CONFIG_PARAVIRT_SPINLOCKS */
+static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
+{
+	__ticket_spin_unlock_wait(lock);
+}
+#else
 
 static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
 {
@@ -214,6 +232,8 @@ static inline void arch_spin_unlock_wait
 		cpu_relax();
 }
 
+#endif	/* CONFIG_PARAVIRT_SPINLOCKS */
+
 /*
  * Read-write spinlocks, allowing multiple readers
  * but only one writer.



^ permalink raw reply	[flat|nested] 59+ messages in thread

* [RFC][PATCH 06/17] sched: Provide p->on_rq
  2010-12-24 12:23 [RFC][PATCH 00/17] sched: Reduce runqueue lock contention -v3 Peter Zijlstra
                   ` (4 preceding siblings ...)
  2010-12-24 12:23 ` [RFC][PATCH 05/17] x86: Optimize arch_spin_unlock_wait() Peter Zijlstra
@ 2010-12-24 12:23 ` Peter Zijlstra
  2010-12-29 14:14   ` Yong Zhang
  2010-12-24 12:23 ` [RFC][PATCH 07/17] sched: Serialize p->cpus_allowed and ttwu() using p->pi_lock Peter Zijlstra
                   ` (11 subsequent siblings)
  17 siblings, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2010-12-24 12:23 UTC (permalink / raw)
  To: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Oleg Nesterov, Paul Turner, Jens Axboe,
	Yong Zhang
  Cc: linux-kernel, Peter Zijlstra

[-- Attachment #1: sched-onrq.patch --]
[-- Type: text/plain, Size: 8343 bytes --]

Provide a generic p->on_rq because the p->se.on_rq semantics are
unfavourable for lockless wakeups but needed for sched_fair.

In particular, p->on_rq is only cleared when we actually dequeue the
task in schedule() and not on any random dequeue as done by things
like __migrate_task() and __sched_setscheduler().

This also allows us to remove p->se usage from !sched_fair code.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/sched.h   |    1 +
 kernel/sched.c          |   36 ++++++++++++++++++------------------
 kernel/sched_debug.c    |    2 +-
 kernel/sched_rt.c       |   10 +++++-----
 kernel/sched_stoptask.c |    2 +-
 5 files changed, 26 insertions(+), 25 deletions(-)

Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -1200,6 +1200,7 @@ struct task_struct {
 #ifdef CONFIG_SMP
 	int on_cpu;
 #endif
+	int on_rq;
 
 	int prio, static_prio, normal_prio;
 	unsigned int rt_priority;
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -1759,7 +1759,6 @@ static void enqueue_task(struct rq *rq, 
 	update_rq_clock(rq);
 	sched_info_queued(p);
 	p->sched_class->enqueue_task(rq, p, flags);
-	p->se.on_rq = 1;
 }
 
 static void dequeue_task(struct rq *rq, struct task_struct *p, int flags)
@@ -1767,7 +1766,6 @@ static void dequeue_task(struct rq *rq, 
 	update_rq_clock(rq);
 	sched_info_dequeued(p);
 	p->sched_class->dequeue_task(rq, p, flags);
-	p->se.on_rq = 0;
 }
 
 /*
@@ -1780,6 +1778,7 @@ static void activate_task(struct rq *rq,
 
 	enqueue_task(rq, p, flags);
 	inc_nr_running(rq);
+	p->on_rq = 1;
 }
 
 /*
@@ -2070,7 +2069,7 @@ static void check_preempt_curr(struct rq
 	 * A queue event has occurred, and we're going to schedule.  In
 	 * this case, we can save a useless back to back clock update.
 	 */
-	if (rq->curr->se.on_rq && test_tsk_need_resched(rq->curr))
+	if (rq->curr->on_rq && test_tsk_need_resched(rq->curr))
 		rq->skip_clock_update = 1;
 }
 
@@ -2145,7 +2144,7 @@ static bool migrate_task(struct task_str
 	 * If the task is not on a runqueue (and not running), then
 	 * the next wake-up will properly place the task.
 	 */
-	return p->se.on_rq || task_running(rq, p);
+	return p->on_rq || task_running(rq, p);
 }
 
 /*
@@ -2205,7 +2204,7 @@ unsigned long wait_task_inactive(struct 
 		rq = task_rq_lock(p, &flags);
 		trace_sched_wait_task(p);
 		running = task_running(rq, p);
-		on_rq = p->se.on_rq;
+		on_rq = p->on_rq;
 		ncsw = 0;
 		if (!match_state || p->state == match_state)
 			ncsw = p->nvcsw | LONG_MIN; /* sets MSB */
@@ -2457,7 +2456,7 @@ static int try_to_wake_up(struct task_st
 
 	cpu = task_cpu(p);
 
-	if (p->se.on_rq)
+	if (p->on_rq)
 		goto out_running;
 
 	orig_cpu = cpu;
@@ -2534,7 +2533,7 @@ static void try_to_wake_up_local(struct 
 	if (!(p->state & TASK_NORMAL))
 		return;
 
-	if (!p->se.on_rq)
+	if (!p->on_rq)
 		activate_task(rq, p, ENQUEUE_WAKEUP);
 
 	ttwu_post_activation(p, rq, 0);
@@ -2571,18 +2570,20 @@ int wake_up_state(struct task_struct *p,
  */
 static void __sched_fork(struct task_struct *p)
 {
+	p->on_rq				= 0;
+
+	p->se.on_rq			= 0;
 	p->se.exec_start		= 0;
 	p->se.sum_exec_runtime		= 0;
 	p->se.prev_sum_exec_runtime	= 0;
 	p->se.nr_migrations		= 0;
+	INIT_LIST_HEAD(&p->se.group_node);
 
 #ifdef CONFIG_SCHEDSTATS
 	memset(&p->se.statistics, 0, sizeof(p->se.statistics));
 #endif
 
 	INIT_LIST_HEAD(&p->rt.run_list);
-	p->se.on_rq = 0;
-	INIT_LIST_HEAD(&p->se.group_node);
 
 #ifdef CONFIG_PREEMPT_NOTIFIERS
 	INIT_HLIST_HEAD(&p->preempt_notifiers);
@@ -3904,7 +3905,7 @@ static inline void schedule_debug(struct
 
 static void put_prev_task(struct rq *rq, struct task_struct *prev)
 {
-	if (prev->se.on_rq)
+	if (prev->on_rq)
 		update_rq_clock(rq);
 	prev->sched_class->put_prev_task(rq, prev);
 }
@@ -3983,6 +3984,7 @@ asmlinkage void __sched schedule(void)
 					try_to_wake_up_local(to_wakeup);
 			}
 			deactivate_task(rq, prev, DEQUEUE_SLEEP);
+			prev->on_rq = 0;
 		}
 		switch_count = &prev->nvcsw;
 	}
@@ -4546,7 +4548,7 @@ void rt_mutex_setprio(struct task_struct
 	trace_sched_pi_setprio(p, prio);
 	oldprio = p->prio;
 	prev_class = p->sched_class;
-	on_rq = p->se.on_rq;
+	on_rq = p->on_rq;
 	running = task_current(rq, p);
 	if (on_rq)
 		dequeue_task(rq, p, 0);
@@ -4595,7 +4597,7 @@ void set_user_nice(struct task_struct *p
 		p->static_prio = NICE_TO_PRIO(nice);
 		goto out_unlock;
 	}
-	on_rq = p->se.on_rq;
+	on_rq = p->on_rq;
 	if (on_rq)
 		dequeue_task(rq, p, 0);
 
@@ -4729,8 +4731,6 @@ static struct task_struct *find_process_
 static void
 __setscheduler(struct rq *rq, struct task_struct *p, int policy, int prio)
 {
-	BUG_ON(p->se.on_rq);
-
 	p->policy = policy;
 	p->rt_priority = prio;
 	p->normal_prio = normal_prio(p);
@@ -4878,7 +4878,7 @@ static int __sched_setscheduler(struct t
 		raw_spin_unlock_irqrestore(&p->pi_lock, flags);
 		goto recheck;
 	}
-	on_rq = p->se.on_rq;
+	on_rq = p->on_rq;
 	running = task_current(rq, p);
 	if (on_rq)
 		deactivate_task(rq, p, 0);
@@ -5737,7 +5737,7 @@ static int __migrate_task(struct task_st
 	 * If we're not on a rq, the next wake-up will ensure we're
 	 * placed properly.
 	 */
-	if (p->se.on_rq) {
+	if (p->on_rq) {
 		deactivate_task(rq_src, p, 0);
 		set_task_cpu(p, dest_cpu);
 		activate_task(rq_dest, p, 0);
@@ -8106,7 +8106,7 @@ static void normalize_task(struct rq *rq
 {
 	int on_rq;
 
-	on_rq = p->se.on_rq;
+	on_rq = p->on_rq;
 	if (on_rq)
 		deactivate_task(rq, p, 0);
 	__setscheduler(rq, p, SCHED_NORMAL, 0);
@@ -8449,7 +8449,7 @@ void sched_move_task(struct task_struct 
 	rq = task_rq_lock(tsk, &flags);
 
 	running = task_current(rq, tsk);
-	on_rq = tsk->se.on_rq;
+	on_rq = tsk->on_rq;
 
 	if (on_rq)
 		dequeue_task(rq, tsk, 0);
Index: linux-2.6/kernel/sched_debug.c
===================================================================
--- linux-2.6.orig/kernel/sched_debug.c
+++ linux-2.6/kernel/sched_debug.c
@@ -127,7 +127,7 @@ static void print_rq(struct seq_file *m,
 	read_lock_irqsave(&tasklist_lock, flags);
 
 	do_each_thread(g, p) {
-		if (!p->se.on_rq || task_cpu(p) != rq_cpu)
+		if (!p->on_rq || task_cpu(p) != rq_cpu)
 			continue;
 
 		print_task(m, rq, p);
Index: linux-2.6/kernel/sched_rt.c
===================================================================
--- linux-2.6.orig/kernel/sched_rt.c
+++ linux-2.6/kernel/sched_rt.c
@@ -1132,7 +1132,7 @@ static void put_prev_task_rt(struct rq *
 	 * The previous task needs to be made eligible for pushing
 	 * if it is still active
 	 */
-	if (p->se.on_rq && p->rt.nr_cpus_allowed > 1)
+	if (p->on_rq && p->rt.nr_cpus_allowed > 1)
 		enqueue_pushable_task(rq, p);
 }
 
@@ -1283,7 +1283,7 @@ static struct rq *find_lock_lowest_rq(st
 				     !cpumask_test_cpu(lowest_rq->cpu,
 						       &task->cpus_allowed) ||
 				     task_running(rq, task) ||
-				     !task->se.on_rq)) {
+				     !task->on_rq)) {
 
 				raw_spin_unlock(&lowest_rq->lock);
 				lowest_rq = NULL;
@@ -1317,7 +1317,7 @@ static struct task_struct *pick_next_pus
 	BUG_ON(task_current(rq, p));
 	BUG_ON(p->rt.nr_cpus_allowed <= 1);
 
-	BUG_ON(!p->se.on_rq);
+	BUG_ON(!p->on_rq);
 	BUG_ON(!rt_task(p));
 
 	return p;
@@ -1463,7 +1463,7 @@ static int pull_rt_task(struct rq *this_
 		 */
 		if (p && (p->prio < this_rq->rt.highest_prio.curr)) {
 			WARN_ON(p == src_rq->curr);
-			WARN_ON(!p->se.on_rq);
+			WARN_ON(!p->on_rq);
 
 			/*
 			 * There's a chance that p is higher in priority
@@ -1534,7 +1534,7 @@ static void set_cpus_allowed_rt(struct t
 	 * Update the migration status of the RQ if we have an RT task
 	 * which is running AND changing its weight value.
 	 */
-	if (p->se.on_rq && (weight != p->rt.nr_cpus_allowed)) {
+	if (p->on_rq && (weight != p->rt.nr_cpus_allowed)) {
 		struct rq *rq = task_rq(p);
 
 		if (!task_current(rq, p)) {
Index: linux-2.6/kernel/sched_stoptask.c
===================================================================
--- linux-2.6.orig/kernel/sched_stoptask.c
+++ linux-2.6/kernel/sched_stoptask.c
@@ -26,7 +26,7 @@ static struct task_struct *pick_next_tas
 {
 	struct task_struct *stop = rq->stop;
 
-	if (stop && stop->se.on_rq)
+	if (stop && stop->on_rq)
 		return stop;
 
 	return NULL;



^ permalink raw reply	[flat|nested] 59+ messages in thread

* [RFC][PATCH 07/17] sched: Serialize p->cpus_allowed and ttwu() using p->pi_lock
  2010-12-24 12:23 [RFC][PATCH 00/17] sched: Reduce runqueue lock contention -v3 Peter Zijlstra
                   ` (5 preceding siblings ...)
  2010-12-24 12:23 ` [RFC][PATCH 06/17] sched: Provide p->on_rq Peter Zijlstra
@ 2010-12-24 12:23 ` Peter Zijlstra
  2010-12-29 14:20   ` Yong Zhang
  2010-12-24 12:23 ` [RFC][PATCH 08/17] sched: Drop the rq argument to sched_class::select_task_rq() Peter Zijlstra
                   ` (10 subsequent siblings)
  17 siblings, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2010-12-24 12:23 UTC (permalink / raw)
  To: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Oleg Nesterov, Paul Turner, Jens Axboe,
	Yong Zhang
  Cc: linux-kernel, Peter Zijlstra

[-- Attachment #1: sched-pi_lock-wakeup.patch --]
[-- Type: text/plain, Size: 3473 bytes --]

Currently p->pi_lock already serializes p->sched_class, also put
p->cpus_allowed and try_to_wake_up() under it, this prepares the way
to do the first part of ttwu() without holding rq->lock.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched.c |   37 ++++++++++++++++---------------------
 1 file changed, 16 insertions(+), 21 deletions(-)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -2301,7 +2301,7 @@ void task_oncpu_function_call(struct tas
 
 #ifdef CONFIG_SMP
 /*
- * ->cpus_allowed is protected by either TASK_WAKING or rq->lock held.
+ * ->cpus_allowed is protected by both rq->lock and p->pi_lock
  */
 static int select_fallback_rq(int cpu, struct task_struct *p)
 {
@@ -2334,7 +2334,7 @@ static int select_fallback_rq(int cpu, s
 }
 
 /*
- * The caller (fork, wakeup) owns TASK_WAKING, ->cpus_allowed is stable.
+ * The caller (fork, wakeup) owns p->pi_lock, ->cpus_allowed is stable.
  */
 static inline
 int select_task_rq(struct rq *rq, struct task_struct *p, int sd_flags, int wake_flags)
@@ -2450,7 +2450,8 @@ static int try_to_wake_up(struct task_st
 	this_cpu = get_cpu();
 
 	smp_wmb();
-	rq = task_rq_lock(p, &flags);
+	raw_spin_lock_irqsave(&p->pi_lock, flags);
+	rq = __task_rq_lock(p);
 	if (!(p->state & state))
 		goto out;
 
@@ -2508,7 +2509,8 @@ static int try_to_wake_up(struct task_st
 	ttwu_stat(rq, p, cpu, wake_flags);
 	success = 1;
 out:
-	task_rq_unlock(rq, &flags);
+	__task_rq_unlock(rq);
+	raw_spin_unlock_irqrestore(&p->pi_lock, flags);
 	put_cpu();
 
 	return success;
@@ -4543,6 +4545,8 @@ void rt_mutex_setprio(struct task_struct
 
 	BUG_ON(prio < 0 || prio > MAX_PRIO);
 
+	lockdep_assert_held(&p->pi_lock);
+
 	rq = task_rq_lock(p, &flags);
 
 	trace_sched_pi_setprio(p, prio);
@@ -5150,7 +5154,6 @@ long sched_getaffinity(pid_t pid, struct
 {
 	struct task_struct *p;
 	unsigned long flags;
-	struct rq *rq;
 	int retval;
 
 	get_online_cpus();
@@ -5165,9 +5168,9 @@ long sched_getaffinity(pid_t pid, struct
 	if (retval)
 		goto out_unlock;
 
-	rq = task_rq_lock(p, &flags);
+	raw_spin_lock_irqsave(&p->pi_lock, flags);
 	cpumask_and(mask, &p->cpus_allowed, cpu_online_mask);
-	task_rq_unlock(rq, &flags);
+	raw_spin_unlock_irqrestore(&p->pi_lock, flags);
 
 out_unlock:
 	rcu_read_unlock();
@@ -5652,18 +5655,8 @@ int set_cpus_allowed_ptr(struct task_str
 	unsigned int dest_cpu;
 	int ret = 0;
 
-	/*
-	 * Serialize against TASK_WAKING so that ttwu() and wunt() can
-	 * drop the rq->lock and still rely on ->cpus_allowed.
-	 */
-again:
-	while (task_is_waking(p))
-		cpu_relax();
-	rq = task_rq_lock(p, &flags);
-	if (task_is_waking(p)) {
-		task_rq_unlock(rq, &flags);
-		goto again;
-	}
+	raw_spin_lock_irqsave(&p->pi_lock, flags);
+	rq = __task_rq_lock(p);
 
 	if (!cpumask_intersects(new_mask, cpu_active_mask)) {
 		ret = -EINVAL;
@@ -5691,13 +5684,15 @@ int set_cpus_allowed_ptr(struct task_str
 	if (migrate_task(p, rq)) {
 		struct migration_arg arg = { p, dest_cpu };
 		/* Need help from migration thread: drop lock and wait. */
-		task_rq_unlock(rq, &flags);
+		__task_rq_unlock(rq);
+		raw_spin_unlock_irqrestore(&p->pi_lock, flags);
 		stop_one_cpu(cpu_of(rq), migration_cpu_stop, &arg);
 		tlb_migrate_finish(p->mm);
 		return 0;
 	}
 out:
-	task_rq_unlock(rq, &flags);
+	__task_rq_unlock(rq);
+	raw_spin_unlock_irqrestore(&p->pi_lock, flags);
 
 	return ret;
 }



^ permalink raw reply	[flat|nested] 59+ messages in thread

* [RFC][PATCH 08/17] sched: Drop the rq argument to sched_class::select_task_rq()
  2010-12-24 12:23 [RFC][PATCH 00/17] sched: Reduce runqueue lock contention -v3 Peter Zijlstra
                   ` (6 preceding siblings ...)
  2010-12-24 12:23 ` [RFC][PATCH 07/17] sched: Serialize p->cpus_allowed and ttwu() using p->pi_lock Peter Zijlstra
@ 2010-12-24 12:23 ` Peter Zijlstra
  2010-12-29 14:31   ` Yong Zhang
  2011-01-03 18:05   ` Oleg Nesterov
  2010-12-24 12:23 ` [RFC][PATCH 09/17] sched: Remove rq argument to sched_class::task_waking() Peter Zijlstra
                   ` (9 subsequent siblings)
  17 siblings, 2 replies; 59+ messages in thread
From: Peter Zijlstra @ 2010-12-24 12:23 UTC (permalink / raw)
  To: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Oleg Nesterov, Paul Turner, Jens Axboe,
	Yong Zhang, Steven Rostedt
  Cc: linux-kernel, Peter Zijlstra

[-- Attachment #1: sched-select_task_rq.patch --]
[-- Type: text/plain, Size: 7383 bytes --]

In preparation of calling select_task_rq() without rq->lock held, drop
the dependency on the rq argument.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/sched.h   |    3 +--
 kernel/sched.c          |   40 ++++++++++++++--------------------------
 kernel/sched_fair.c     |    2 +-
 kernel/sched_idletask.c |    2 +-
 kernel/sched_rt.c       |   10 +++++++++-
 kernel/sched_stoptask.c |    3 +--
 6 files changed, 27 insertions(+), 33 deletions(-)

Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -1063,8 +1063,7 @@ struct sched_class {
 	void (*put_prev_task) (struct rq *rq, struct task_struct *p);
 
 #ifdef CONFIG_SMP
-	int  (*select_task_rq)(struct rq *rq, struct task_struct *p,
-			       int sd_flag, int flags);
+	int  (*select_task_rq)(struct task_struct *p, int sd_flag, int flags);
 
 	void (*pre_schedule) (struct rq *this_rq, struct task_struct *task);
 	void (*post_schedule) (struct rq *this_rq);
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -2138,13 +2138,14 @@ static int migration_cpu_stop(void *data
  * The task's runqueue lock must be held.
  * Returns true if you have to wait for migration thread.
  */
-static bool migrate_task(struct task_struct *p, struct rq *rq)
+static bool need_migrate_task(struct task_struct *p)
 {
 	/*
 	 * If the task is not on a runqueue (and not running), then
 	 * the next wake-up will properly place the task.
 	 */
-	return p->on_rq || task_running(rq, p);
+	smp_rmb(); /* finish_lock_switch() */
+	return p->on_rq || p->on_cpu;
 }
 
 /*
@@ -2337,9 +2338,9 @@ static int select_fallback_rq(int cpu, s
  * The caller (fork, wakeup) owns p->pi_lock, ->cpus_allowed is stable.
  */
 static inline
-int select_task_rq(struct rq *rq, struct task_struct *p, int sd_flags, int wake_flags)
+int select_task_rq(struct task_struct *p, int sd_flags, int wake_flags)
 {
-	int cpu = p->sched_class->select_task_rq(rq, p, sd_flags, wake_flags);
+	int cpu = p->sched_class->select_task_rq(p, sd_flags, wake_flags);
 
 	/*
 	 * In order not to call set_task_cpu() on a blocking task we need
@@ -2484,7 +2485,7 @@ static int try_to_wake_up(struct task_st
 		en_flags |= ENQUEUE_WAKING;
 	}
 
-	cpu = select_task_rq(rq, p, SD_BALANCE_WAKE, wake_flags);
+	cpu = select_task_rq(p, SD_BALANCE_WAKE, wake_flags);
 	if (cpu != orig_cpu)
 		set_task_cpu(p, cpu);
 	__task_rq_unlock(rq);
@@ -2680,24 +2681,17 @@ void wake_up_new_task(struct task_struct
 {
 	unsigned long flags;
 	struct rq *rq;
-	int cpu __maybe_unused = get_cpu();
 
 #ifdef CONFIG_SMP
 	rq = task_rq_lock(p, &flags);
-	p->state = TASK_WAKING;
 
 	/*
 	 * Fork balancing, do it here and not earlier because:
 	 *  - cpus_allowed can change in the fork path
 	 *  - any previously selected cpu might disappear through hotplug
-	 *
-	 * We set TASK_WAKING so that select_task_rq() can drop rq->lock
-	 * without people poking at ->cpus_allowed.
 	 */
-	cpu = select_task_rq(rq, p, SD_BALANCE_FORK, 0);
-	set_task_cpu(p, cpu);
+	set_task_cpu(p, select_task_rq(p, SD_BALANCE_FORK, 0));
 
-	p->state = TASK_RUNNING;
 	task_rq_unlock(rq, &flags);
 #endif
 
@@ -2710,7 +2704,6 @@ void wake_up_new_task(struct task_struct
 		p->sched_class->task_woken(rq, p);
 #endif
 	task_rq_unlock(rq, &flags);
-	put_cpu();
 }
 
 #ifdef CONFIG_PREEMPT_NOTIFIERS
@@ -3416,27 +3409,22 @@ void sched_exec(void)
 {
 	struct task_struct *p = current;
 	unsigned long flags;
-	struct rq *rq;
 	int dest_cpu;
 
-	rq = task_rq_lock(p, &flags);
-	dest_cpu = p->sched_class->select_task_rq(rq, p, SD_BALANCE_EXEC, 0);
+	raw_spin_lock_irqsave(&p->pi_lock, flags);
+	dest_cpu = p->sched_class->select_task_rq(p, SD_BALANCE_EXEC, 0);
 	if (dest_cpu == smp_processor_id())
 		goto unlock;
 
-	/*
-	 * select_task_rq() can race against ->cpus_allowed
-	 */
-	if (cpumask_test_cpu(dest_cpu, &p->cpus_allowed) &&
-	    likely(cpu_active(dest_cpu)) && migrate_task(p, rq)) {
+	if (likely(cpu_active(dest_cpu)) && need_migrate_task(p)) {
 		struct migration_arg arg = { p, dest_cpu };
 
-		task_rq_unlock(rq, &flags);
-		stop_one_cpu(cpu_of(rq), migration_cpu_stop, &arg);
+		raw_spin_unlock_irqrestore(&p->pi_lock, flags);
+		stop_one_cpu(task_cpu(p), migration_cpu_stop, &arg);
 		return;
 	}
 unlock:
-	task_rq_unlock(rq, &flags);
+	raw_spin_unlock_irqrestore(&p->pi_lock, flags);
 }
 
 #endif
@@ -5681,7 +5669,7 @@ int set_cpus_allowed_ptr(struct task_str
 		goto out;
 
 	dest_cpu = cpumask_any_and(cpu_active_mask, new_mask);
-	if (migrate_task(p, rq)) {
+	if (need_migrate_task(p)) {
 		struct migration_arg arg = { p, dest_cpu };
 		/* Need help from migration thread: drop lock and wait. */
 		__task_rq_unlock(rq);
Index: linux-2.6/kernel/sched_fair.c
===================================================================
--- linux-2.6.orig/kernel/sched_fair.c
+++ linux-2.6/kernel/sched_fair.c
@@ -1623,7 +1623,7 @@ static int select_idle_sibling(struct ta
  * preempt must be disabled.
  */
 static int
-select_task_rq_fair(struct rq *rq, struct task_struct *p, int sd_flag, int wake_flags)
+select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
 {
 	struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
 	int cpu = smp_processor_id();
Index: linux-2.6/kernel/sched_idletask.c
===================================================================
--- linux-2.6.orig/kernel/sched_idletask.c
+++ linux-2.6/kernel/sched_idletask.c
@@ -7,7 +7,7 @@
 
 #ifdef CONFIG_SMP
 static int
-select_task_rq_idle(struct rq *rq, struct task_struct *p, int sd_flag, int flags)
+select_task_rq_idle(struct task_struct *p, int sd_flag, int flags)
 {
 	return task_cpu(p); /* IDLE tasks as never migrated */
 }
Index: linux-2.6/kernel/sched_rt.c
===================================================================
--- linux-2.6.orig/kernel/sched_rt.c
+++ linux-2.6/kernel/sched_rt.c
@@ -973,11 +973,18 @@ static void yield_task_rt(struct rq *rq)
 static int find_lowest_rq(struct task_struct *task);
 
 static int
-select_task_rq_rt(struct rq *rq, struct task_struct *p, int sd_flag, int flags)
+select_task_rq_rt(struct task_struct *p, int sd_flag, int flags)
 {
 	if (sd_flag != SD_BALANCE_WAKE)
 		return smp_processor_id();
 
+#if 0
+	/*
+	 * XXX without holding rq->lock the below is racy, need to
+	 * rewrite it in a racy but non-dangerous way so that we mostly
+	 * get the benefit of the heuristic but don't crash the kernel
+	 * if we get it wrong ;-)
+	 */
 	/*
 	 * If the current task is an RT task, then
 	 * try to see if we can wake this RT task up on another
@@ -1002,6 +1009,7 @@ select_task_rq_rt(struct rq *rq, struct 
 
 		return (cpu == -1) ? task_cpu(p) : cpu;
 	}
+#endif
 
 	/*
 	 * Otherwise, just let it ride on the affined RQ and the
Index: linux-2.6/kernel/sched_stoptask.c
===================================================================
--- linux-2.6.orig/kernel/sched_stoptask.c
+++ linux-2.6/kernel/sched_stoptask.c
@@ -9,8 +9,7 @@
 
 #ifdef CONFIG_SMP
 static int
-select_task_rq_stop(struct rq *rq, struct task_struct *p,
-		    int sd_flag, int flags)
+select_task_rq_stop(struct task_struct *p, int sd_flag, int flags)
 {
 	return task_cpu(p); /* stop tasks as never migrate */
 }



^ permalink raw reply	[flat|nested] 59+ messages in thread

* [RFC][PATCH 09/17] sched: Remove rq argument to sched_class::task_waking()
  2010-12-24 12:23 [RFC][PATCH 00/17] sched: Reduce runqueue lock contention -v3 Peter Zijlstra
                   ` (7 preceding siblings ...)
  2010-12-24 12:23 ` [RFC][PATCH 08/17] sched: Drop the rq argument to sched_class::select_task_rq() Peter Zijlstra
@ 2010-12-24 12:23 ` Peter Zijlstra
  2010-12-24 12:23 ` [RFC][PATCH 10/17] sched: Add TASK_WAKING to task_rq_lock Peter Zijlstra
                   ` (8 subsequent siblings)
  17 siblings, 0 replies; 59+ messages in thread
From: Peter Zijlstra @ 2010-12-24 12:23 UTC (permalink / raw)
  To: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Oleg Nesterov, Paul Turner, Jens Axboe,
	Yong Zhang
  Cc: linux-kernel, Peter Zijlstra

[-- Attachment #1: sched-task_waking.patch --]
[-- Type: text/plain, Size: 2230 bytes --]

In preparation of calling this without rq->lock held, remove the
dependency on the rq argument.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/sched.h |   10 +++++++---
 kernel/sched.c        |    2 +-
 kernel/sched_fair.c   |    4 +++-
 3 files changed, 11 insertions(+), 5 deletions(-)

Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -1045,8 +1045,12 @@ struct sched_domain;
 #define WF_FORK		0x02		/* child wakeup after fork */
 
 #define ENQUEUE_WAKEUP		1
-#define ENQUEUE_WAKING		2
-#define ENQUEUE_HEAD		4
+#define ENQUEUE_HEAD		2
+#ifdef CONFIG_SMP
+#define ENQUEUE_WAKING		4	/* sched_class::task_waking was called */
+#else
+#define ENQUEUE_WAKING		0
+#endif
 
 #define DEQUEUE_SLEEP		1
 
@@ -1067,7 +1071,7 @@ struct sched_class {
 
 	void (*pre_schedule) (struct rq *this_rq, struct task_struct *task);
 	void (*post_schedule) (struct rq *this_rq);
-	void (*task_waking) (struct rq *this_rq, struct task_struct *task);
+	void (*task_waking) (struct task_struct *task);
 	void (*task_woken) (struct rq *this_rq, struct task_struct *task);
 
 	void (*set_cpus_allowed)(struct task_struct *p,
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -2481,7 +2481,7 @@ static int try_to_wake_up(struct task_st
 	p->state = TASK_WAKING;
 
 	if (p->sched_class->task_waking) {
-		p->sched_class->task_waking(rq, p);
+		p->sched_class->task_waking(p);
 		en_flags |= ENQUEUE_WAKING;
 	}
 
Index: linux-2.6/kernel/sched_fair.c
===================================================================
--- linux-2.6.orig/kernel/sched_fair.c
+++ linux-2.6/kernel/sched_fair.c
@@ -1338,11 +1338,13 @@ static void yield_task_fair(struct rq *r
 
 #ifdef CONFIG_SMP
 
-static void task_waking_fair(struct rq *rq, struct task_struct *p)
+static void task_waking_fair(struct task_struct *p)
 {
 	struct sched_entity *se = &p->se;
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 
+	lockdep_assert_held(&task_rq(p)->lock);
+
 	se->vruntime -= cfs_rq->min_vruntime;
 }
 



^ permalink raw reply	[flat|nested] 59+ messages in thread

* [RFC][PATCH 10/17] sched: Add TASK_WAKING to task_rq_lock
  2010-12-24 12:23 [RFC][PATCH 00/17] sched: Reduce runqueue lock contention -v3 Peter Zijlstra
                   ` (8 preceding siblings ...)
  2010-12-24 12:23 ` [RFC][PATCH 09/17] sched: Remove rq argument to sched_class::task_waking() Peter Zijlstra
@ 2010-12-24 12:23 ` Peter Zijlstra
  2010-12-24 12:23 ` [RFC][PATCH 11/17] sched: Delay task_contributes_to_load() Peter Zijlstra
                   ` (7 subsequent siblings)
  17 siblings, 0 replies; 59+ messages in thread
From: Peter Zijlstra @ 2010-12-24 12:23 UTC (permalink / raw)
  To: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Oleg Nesterov, Paul Turner, Jens Axboe,
	Yong Zhang
  Cc: linux-kernel, Peter Zijlstra

[-- Attachment #1: sched-task_rq_lock.patch --]
[-- Type: text/plain, Size: 1944 bytes --]

In order to be able to call set_task_cpu() without holding the
appropriate rq->lock during ttwu(), add a TASK_WAKING clause to the
task_rq_lock() primitive.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched.c |   14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -928,12 +928,13 @@ static inline void finish_lock_switch(st
 #endif /* __ARCH_WANT_UNLOCKED_CTXSW */
 
 /*
- * Check whether the task is waking, we use this to synchronize ->cpus_allowed
- * against ttwu().
+ * In order to be able to call set_task_cpu() without holding the current
+ * task_rq(p)->lock during wake-ups we need to serialize on something else,
+ * use the wakeup task state.
  */
 static inline int task_is_waking(struct task_struct *p)
 {
-	return unlikely(p->state == TASK_WAKING);
+	return p->state == TASK_WAKING;
 }
 
 /*
@@ -948,7 +949,7 @@ static inline struct rq *__task_rq_lock(
 	for (;;) {
 		rq = task_rq(p);
 		raw_spin_lock(&rq->lock);
-		if (likely(rq == task_rq(p)))
+		if (likely(rq == task_rq(p) && !task_is_waking(p)))
 			return rq;
 		raw_spin_unlock(&rq->lock);
 	}
@@ -956,8 +957,7 @@ static inline struct rq *__task_rq_lock(
 
 /*
  * task_rq_lock - lock the runqueue a given task resides on and disable
- * interrupts. Note the ordering: we can safely lookup the task_rq without
- * explicitly disabling preemption.
+ * interrupts.
  */
 static struct rq *task_rq_lock(struct task_struct *p, unsigned long *flags)
 	__acquires(rq->lock)
@@ -968,7 +968,7 @@ static struct rq *task_rq_lock(struct ta
 		local_irq_save(*flags);
 		rq = task_rq(p);
 		raw_spin_lock(&rq->lock);
-		if (likely(rq == task_rq(p)))
+		if (likely(rq == task_rq(p) && !task_is_waking(p)))
 			return rq;
 		raw_spin_unlock_irqrestore(&rq->lock, *flags);
 	}



^ permalink raw reply	[flat|nested] 59+ messages in thread

* [RFC][PATCH 11/17] sched: Delay task_contributes_to_load()
  2010-12-24 12:23 [RFC][PATCH 00/17] sched: Reduce runqueue lock contention -v3 Peter Zijlstra
                   ` (9 preceding siblings ...)
  2010-12-24 12:23 ` [RFC][PATCH 10/17] sched: Add TASK_WAKING to task_rq_lock Peter Zijlstra
@ 2010-12-24 12:23 ` Peter Zijlstra
  2010-12-24 12:23 ` [RFC][PATCH 12/17] sched: Also serialize ttwu_local() with p->pi_lock Peter Zijlstra
                   ` (6 subsequent siblings)
  17 siblings, 0 replies; 59+ messages in thread
From: Peter Zijlstra @ 2010-12-24 12:23 UTC (permalink / raw)
  To: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Oleg Nesterov, Paul Turner, Jens Axboe,
	Yong Zhang
  Cc: linux-kernel, Peter Zijlstra

[-- Attachment #1: sched-ttwu-contribute-to-load.patch --]
[-- Type: text/plain, Size: 1771 bytes --]

In prepratation of having to call task_contributes_to_load() without
holding rq->lock, we need to store the result until we do and can
update the rq accounting accordingly.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/sched.h |    1 +
 kernel/sched.c        |   16 ++++------------
 2 files changed, 5 insertions(+), 12 deletions(-)

Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -1264,6 +1264,7 @@ struct task_struct {
 
 	/* Revert to default priority/policy when forking */
 	unsigned sched_reset_on_fork:1;
+	unsigned sched_contributes_to_load:1;
 
 	pid_t pid;
 	pid_t tgid;
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -2467,18 +2467,7 @@ static int try_to_wake_up(struct task_st
 	if (unlikely(task_running(rq, p)))
 		goto out_activate;
 
-	/*
-	 * In order to handle concurrent wakeups and release the rq->lock
-	 * we put the task in TASK_WAKING state.
-	 *
-	 * First fix up the nr_uninterruptible count:
-	 */
-	if (task_contributes_to_load(p)) {
-		if (likely(cpu_online(orig_cpu)))
-			rq->nr_uninterruptible--;
-		else
-			this_rq()->nr_uninterruptible--;
-	}
+	p->sched_contributes_to_load = !!task_contributes_to_load(p);
 	p->state = TASK_WAKING;
 
 	if (p->sched_class->task_waking) {
@@ -2503,6 +2492,9 @@ static int try_to_wake_up(struct task_st
 	WARN_ON(task_cpu(p) != cpu);
 	WARN_ON(p->state != TASK_WAKING);
 
+	if (p->sched_contributes_to_load)
+		rq->nr_uninterruptible--;
+
 out_activate:
 #endif /* CONFIG_SMP */
 	activate_task(rq, p, en_flags);



^ permalink raw reply	[flat|nested] 59+ messages in thread

* [RFC][PATCH 12/17] sched: Also serialize ttwu_local() with p->pi_lock
  2010-12-24 12:23 [RFC][PATCH 00/17] sched: Reduce runqueue lock contention -v3 Peter Zijlstra
                   ` (10 preceding siblings ...)
  2010-12-24 12:23 ` [RFC][PATCH 11/17] sched: Delay task_contributes_to_load() Peter Zijlstra
@ 2010-12-24 12:23 ` Peter Zijlstra
  2011-01-03 17:32   ` Oleg Nesterov
  2010-12-24 12:23 ` [RFC][PATCH 13/17] sched: Remove rq->lock from the first half of ttwu() Peter Zijlstra
                   ` (5 subsequent siblings)
  17 siblings, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2010-12-24 12:23 UTC (permalink / raw)
  To: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Oleg Nesterov, Paul Turner, Jens Axboe,
	Yong Zhang
  Cc: linux-kernel, Peter Zijlstra

[-- Attachment #1: sched-ttwu_local.patch --]
[-- Type: text/plain, Size: 2481 bytes --]

Since we now serialize ttwu() using p->pi_lock, we also need to
serialize ttwu_local() using that, otherwise, once we drop the
rq->lock from ttwu() it can race with ttwu_local().

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched.c |   28 +++++++++++++++++-----------
 1 file changed, 17 insertions(+), 11 deletions(-)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -2513,9 +2513,9 @@ static int try_to_wake_up(struct task_st
  * try_to_wake_up_local - try to wake up a local task with rq lock held
  * @p: the thread to be awakened
  *
- * Put @p on the run-queue if it's not alredy there.  The caller must
+ * Put @p on the run-queue if it's not alredy there. The caller must
  * ensure that this_rq() is locked, @p is bound to this_rq() and not
- * the current task.  this_rq() stays locked over invocation.
+ * the current task.
  */
 static void try_to_wake_up_local(struct task_struct *p)
 {
@@ -2523,16 +2523,21 @@ static void try_to_wake_up_local(struct 
 
 	BUG_ON(rq != this_rq());
 	BUG_ON(p == current);
-	lockdep_assert_held(&rq->lock);
+
+	raw_spin_unlock(&rq->lock);
+	raw_spin_lock(&p->pi_lock);
+	raw_spin_lock(&rq->lock);
 
 	if (!(p->state & TASK_NORMAL))
-		return;
+		goto out;
 
 	if (!p->on_rq)
 		activate_task(rq, p, ENQUEUE_WAKEUP);
 
 	ttwu_post_activation(p, rq, 0);
 	ttwu_stat(rq, p, smp_processor_id(), 0);
+out:
+	raw_spin_unlock(&p->pi_lock);
 }
 
 /**
@@ -3925,6 +3930,7 @@ pick_next_task(struct rq *rq)
  */
 asmlinkage void __sched schedule(void)
 {
+	struct task_struct *to_wakeup = NULL;
 	struct task_struct *prev, *next;
 	unsigned long *switch_count;
 	struct rq *rq;
@@ -3958,21 +3964,21 @@ asmlinkage void __sched schedule(void)
 			 * task to maintain concurrency.  If so, wake
 			 * up the task.
 			 */
-			if (prev->flags & PF_WQ_WORKER) {
-				struct task_struct *to_wakeup;
-
+			if (prev->flags & PF_WQ_WORKER)
 				to_wakeup = wq_worker_sleeping(prev, cpu);
-				if (to_wakeup)
-					try_to_wake_up_local(to_wakeup);
-			}
 			deactivate_task(rq, prev, DEQUEUE_SLEEP);
 			prev->on_rq = 0;
 		}
 		switch_count = &prev->nvcsw;
 	}
 
+	/*
+	 * All three: try_to_wake_up_local(), pre_schedule() and idle_balance()
+	 * can drop rq->lock.
+	 */
+	if (to_wakeup)
+		try_to_wake_up_local(to_wakeup);
 	pre_schedule(rq, prev);
-
 	if (unlikely(!rq->nr_running))
 		idle_balance(cpu, rq);
 



^ permalink raw reply	[flat|nested] 59+ messages in thread

* [RFC][PATCH 13/17] sched: Remove rq->lock from the first half of ttwu()
  2010-12-24 12:23 [RFC][PATCH 00/17] sched: Reduce runqueue lock contention -v3 Peter Zijlstra
                   ` (11 preceding siblings ...)
  2010-12-24 12:23 ` [RFC][PATCH 12/17] sched: Also serialize ttwu_local() with p->pi_lock Peter Zijlstra
@ 2010-12-24 12:23 ` Peter Zijlstra
  2010-12-24 12:23 ` [RFC][PATCH 14/17] sched: Remove rq argument to ttwu_stat() Peter Zijlstra
                   ` (4 subsequent siblings)
  17 siblings, 0 replies; 59+ messages in thread
From: Peter Zijlstra @ 2010-12-24 12:23 UTC (permalink / raw)
  To: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Oleg Nesterov, Paul Turner, Jens Axboe,
	Yong Zhang
  Cc: linux-kernel, Peter Zijlstra

[-- Attachment #1: sched-ttwu-optimize.patch --]
[-- Type: text/plain, Size: 4151 bytes --]

Currently ttwu() does two rq->lock acquisitions, once on the task's
old rq, holding it over the p->state fiddling and load-balance pass.
Then it drops the old rq->lock to acquire the new rq->lock.

By having serialized ttwu(), p->sched_class, p->cpus_allowed with
p->pi_lock, we can now drop the whole first rq->lock acquisition.

The p->pi_lock serializing concurrent ttwu() calls protects p->state,
which we will set to TASK_WAKING to bridge possible p->pi_lock to
rq->lock gaps and serialize set_task_cpu() calls against
task_rq_lock().

The p->pi_lock serialization of p->sched_class allows us to call
scheduling class methods without holding the rq->lock, and the
serialization of p->cpus_allowed allows us to do the load-balancing
bits without races.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched.c      |   57 ++++++++++++++++++++++++++--------------------------
 kernel/sched_fair.c |    3 --
 2 files changed, 30 insertions(+), 30 deletions(-)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -2440,69 +2440,70 @@ ttwu_post_activation(struct task_struct 
  * Returns %true if @p was woken up, %false if it was already running
  * or @state didn't match @p's state.
  */
-static int try_to_wake_up(struct task_struct *p, unsigned int state,
-			  int wake_flags)
+static int
+try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 {
-	int cpu, orig_cpu, this_cpu, success = 0;
+	int cpu, this_cpu, success = 0;
 	unsigned long flags;
-	unsigned long en_flags = ENQUEUE_WAKEUP;
 	struct rq *rq;
 
 	this_cpu = get_cpu();
 
 	smp_wmb();
 	raw_spin_lock_irqsave(&p->pi_lock, flags);
-	rq = __task_rq_lock(p);
 	if (!(p->state & state))
 		goto out;
 
 	cpu = task_cpu(p);
 
-	if (p->on_rq)
-		goto out_running;
+	if (p->on_rq) {
+		rq = __task_rq_lock(p);
+		if (p->on_rq)
+			goto out_running;
+		__task_rq_unlock(rq);
+	}
 
-	orig_cpu = cpu;
 #ifdef CONFIG_SMP
-	if (unlikely(task_running(rq, p)))
-		goto out_activate;
+	while (p->on_cpu)
+		cpu_relax();
 
 	p->sched_contributes_to_load = !!task_contributes_to_load(p);
 	p->state = TASK_WAKING;
+	/*
+	 * Separate the TASK_WAKING write from the rq->lock unlock wait.
+	 *
+	 * We need to wait for the current rq->lock owner to finish because
+	 * existing task_rq_lock() holders will not have observed
+	 * TASK_WAKING yet and we don't want to move the task from under
+	 * their feet.
+	 */
+	smp_mb();
+	raw_spin_unlock_wait(&task_rq(p)->lock);
 
-	if (p->sched_class->task_waking) {
+	if (p->sched_class->task_waking)
 		p->sched_class->task_waking(p);
-		en_flags |= ENQUEUE_WAKING;
-	}
 
 	cpu = select_task_rq(p, SD_BALANCE_WAKE, wake_flags);
-	if (cpu != orig_cpu)
-		set_task_cpu(p, cpu);
-	__task_rq_unlock(rq);
+#endif /* CONFIG_SMP */
 
 	rq = cpu_rq(cpu);
 	raw_spin_lock(&rq->lock);
 
-	/*
-	 * We migrated the task without holding either rq->lock, however
-	 * since the task is not on the task list itself, nobody else
-	 * will try and migrate the task, hence the rq should match the
-	 * cpu we just moved it to.
-	 */
-	WARN_ON(task_cpu(p) != cpu);
-	WARN_ON(p->state != TASK_WAKING);
+#ifdef CONFIG_SMP
+	if (cpu != task_cpu(p))
+		set_task_cpu(p, cpu);
 
 	if (p->sched_contributes_to_load)
 		rq->nr_uninterruptible--;
+#endif
 
-out_activate:
-#endif /* CONFIG_SMP */
-	activate_task(rq, p, en_flags);
+	activate_task(rq, p, ENQUEUE_WAKEUP | ENQUEUE_WAKING);
 out_running:
 	ttwu_post_activation(p, rq, wake_flags);
 	ttwu_stat(rq, p, cpu, wake_flags);
 	success = 1;
-out:
 	__task_rq_unlock(rq);
+out:
 	raw_spin_unlock_irqrestore(&p->pi_lock, flags);
 	put_cpu();
 
Index: linux-2.6/kernel/sched_fair.c
===================================================================
--- linux-2.6.orig/kernel/sched_fair.c
+++ linux-2.6/kernel/sched_fair.c
@@ -1343,8 +1343,7 @@ static void task_waking_fair(struct task
 	struct sched_entity *se = &p->se;
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 
-	lockdep_assert_held(&task_rq(p)->lock);
-
+	// XXX racy on 32bit
 	se->vruntime -= cfs_rq->min_vruntime;
 }
 



^ permalink raw reply	[flat|nested] 59+ messages in thread

* [RFC][PATCH 14/17] sched: Remove rq argument to ttwu_stat()
  2010-12-24 12:23 [RFC][PATCH 00/17] sched: Reduce runqueue lock contention -v3 Peter Zijlstra
                   ` (12 preceding siblings ...)
  2010-12-24 12:23 ` [RFC][PATCH 13/17] sched: Remove rq->lock from the first half of ttwu() Peter Zijlstra
@ 2010-12-24 12:23 ` Peter Zijlstra
  2010-12-29 14:40   ` Yong Zhang
  2010-12-24 12:23 ` [RFC][PATCH 15/17] sched: Rename ttwu_post_activation Peter Zijlstra
                   ` (3 subsequent siblings)
  17 siblings, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2010-12-24 12:23 UTC (permalink / raw)
  To: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Oleg Nesterov, Paul Turner, Jens Axboe,
	Yong Zhang
  Cc: linux-kernel, Peter Zijlstra

[-- Attachment #1: sched-ttwu_stat-rq.patch --]
[-- Type: text/plain, Size: 1325 bytes --]


Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched.c |    8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -2367,10 +2367,11 @@ static void update_avg(u64 *avg, u64 sam
 #endif
 
 static void
-ttwu_stat(struct rq *rq, struct task_struct *p, int cpu, int wake_flags)
+ttwu_stat(struct task_struct *p, int cpu, int wake_flags)
 {
 #ifdef CONFIG_SCHEDSTATS
 	int this_cpu = smp_processor_id();
+	struct rq *rq = this_rq();
 
 	schedstat_inc(rq, ttwu_count);
 	schedstat_inc(p, se.statistics.nr_wakeups);
@@ -2491,9 +2492,10 @@ try_to_wake_up(struct task_struct *p, un
 	activate_task(rq, p, ENQUEUE_WAKEUP | ENQUEUE_WAKING);
 out_running:
 	ttwu_post_activation(p, rq, wake_flags);
-	ttwu_stat(rq, p, cpu, wake_flags);
 	success = 1;
 	__task_rq_unlock(rq);
+
+	ttwu_stat(p, cpu, wake_flags);
 out:
 	raw_spin_unlock_irqrestore(&p->pi_lock, flags);
 	put_cpu();
@@ -2527,7 +2529,7 @@ static void try_to_wake_up_local(struct 
 		activate_task(rq, p, ENQUEUE_WAKEUP);
 
 	ttwu_post_activation(p, rq, 0);
-	ttwu_stat(rq, p, smp_processor_id(), 0);
+	ttwu_stat(p, smp_processor_id(), 0);
 out:
 	raw_spin_unlock(&p->pi_lock);
 }



^ permalink raw reply	[flat|nested] 59+ messages in thread

* [RFC][PATCH 15/17] sched: Rename ttwu_post_activation
  2010-12-24 12:23 [RFC][PATCH 00/17] sched: Reduce runqueue lock contention -v3 Peter Zijlstra
                   ` (13 preceding siblings ...)
  2010-12-24 12:23 ` [RFC][PATCH 14/17] sched: Remove rq argument to ttwu_stat() Peter Zijlstra
@ 2010-12-24 12:23 ` Peter Zijlstra
  2010-12-24 12:23 ` [RFC][PATCH 16/17] sched: Move the second half of ttwu() to the remote cpu Peter Zijlstra
                   ` (2 subsequent siblings)
  17 siblings, 0 replies; 59+ messages in thread
From: Peter Zijlstra @ 2010-12-24 12:23 UTC (permalink / raw)
  To: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Oleg Nesterov, Paul Turner, Jens Axboe,
	Yong Zhang
  Cc: linux-kernel, Peter Zijlstra

[-- Attachment #1: sched-ttwu_do_wakeup.patch --]
[-- Type: text/plain, Size: 1367 bytes --]

The ttwu_post_actication() does the core wakeup, it sets TASK_RUNNING
and performs wakeup-preemption, so give is a more descriptive name.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched.c |    9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -2399,8 +2399,11 @@ ttwu_stat(struct task_struct *p, int cpu
 #endif /* CONFIG_SCHEDSTATS */
 }
 
+/*
+ * Mark the task runnable and perform wakeup-preemption.
+ */
 static void
-ttwu_post_activation(struct task_struct *p, struct rq *rq, int wake_flags)
+ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
 {
 	trace_sched_wakeup(p, true);
 	check_preempt_curr(rq, p, wake_flags);
@@ -2492,7 +2495,7 @@ try_to_wake_up(struct task_struct *p, un
 
 	activate_task(rq, p, ENQUEUE_WAKEUP | ENQUEUE_WAKING);
 out_running:
-	ttwu_post_activation(p, rq, wake_flags);
+	ttwu_do_wakeup(rq, p, wake_flags);
 	success = 1;
 	__task_rq_unlock(rq);
 
@@ -2529,7 +2532,7 @@ static void try_to_wake_up_local(struct 
 	if (!p->on_rq)
 		activate_task(rq, p, ENQUEUE_WAKEUP);
 
-	ttwu_post_activation(p, rq, 0);
+	ttwu_do_wakeup(rq, p, 0);
 	ttwu_stat(p, smp_processor_id(), 0);
 out:
 	raw_spin_unlock(&p->pi_lock);



^ permalink raw reply	[flat|nested] 59+ messages in thread

* [RFC][PATCH 16/17] sched: Move the second half of ttwu() to the remote cpu
  2010-12-24 12:23 [RFC][PATCH 00/17] sched: Reduce runqueue lock contention -v3 Peter Zijlstra
                   ` (14 preceding siblings ...)
  2010-12-24 12:23 ` [RFC][PATCH 15/17] sched: Rename ttwu_post_activation Peter Zijlstra
@ 2010-12-24 12:23 ` Peter Zijlstra
  2011-01-03 14:36   ` [RFC][PATCH] sembench: add stddev to the burn stats Peter Zijlstra
  2011-01-04 14:28   ` [RFC][PATCH 16/17] sched: Move the second half of ttwu() to the remote cpu Oleg Nesterov
  2010-12-24 12:23 ` [RFC][PATCH 17/17] sched: Sort hotplug vs ttwu queueing Peter Zijlstra
  2010-12-24 13:15 ` [RFC][PATCH 00/17] sched: Reduce runqueue lock contention -v3 Peter Zijlstra
  17 siblings, 2 replies; 59+ messages in thread
From: Peter Zijlstra @ 2010-12-24 12:23 UTC (permalink / raw)
  To: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Oleg Nesterov, Paul Turner, Jens Axboe,
	Yong Zhang
  Cc: linux-kernel, Peter Zijlstra

[-- Attachment #1: sched-ttwu-queue-remote.patch --]
[-- Type: text/plain, Size: 6527 bytes --]

Now that we've removed the rq->lock requirement from the first part of
ttwu() and can compute placement without holding any rq->lock, ensure
we execute the second half of ttwu() on the actual cpu we want the
task to run on.

This avoids having to take rq->lock and doing the task enqueue
remotely, saving lots on cacheline transfers.

As measured using: http://oss.oracle.com/~mason/sembench.c

$ echo 4096 32000 64 128 > /proc/sys/kernel/sem
$ ./sembench -t 2048 -w 1900 -o 0

unpatched: run time 30 seconds 537953 worker burns per second
patched:   run time 30 seconds 657336 worker burns per second

Still need to sort out all the races marked XXX (non-trivial), and its
x86 only for the moment.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/x86/kernel/smp.c   |    1 
 include/linux/sched.h   |    2 
 kernel/sched.c          |  143 ++++++++++++++++++++++++++++++++++++------------
 kernel/sched_features.h |    2 
 4 files changed, 114 insertions(+), 34 deletions(-)

Index: linux-2.6/arch/x86/kernel/smp.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/smp.c
+++ linux-2.6/arch/x86/kernel/smp.c
@@ -205,6 +205,7 @@ void smp_reschedule_interrupt(struct pt_
 	/*
 	 * KVM uses this interrupt to force a cpu out of guest mode
 	 */
+	sched_ttwu_pending();
 }
 
 void smp_call_function_interrupt(struct pt_regs *regs)
Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -1020,6 +1020,7 @@ partition_sched_domains(int ndoms_new, c
 }
 #endif	/* !CONFIG_SMP */
 
+void sched_ttwu_pending(void);
 
 struct io_context;			/* See blkdev.h */
 
@@ -1201,6 +1202,7 @@ struct task_struct {
 	int lock_depth;		/* BKL lock depth */
 
 #ifdef CONFIG_SMP
+	struct task_struct *wake_entry;
 	int on_cpu;
 #endif
 	int on_rq;
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -559,6 +559,10 @@ struct rq {
 	/* BKL stats */
 	unsigned int bkl_count;
 #endif
+
+#ifdef CONFIG_SMP
+	struct task_struct *wake_list;
+#endif
 };
 
 static DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
@@ -2429,6 +2433,100 @@ ttwu_do_wakeup(struct rq *rq, struct tas
 		wq_worker_waking_up(p, cpu_of(rq));
 }
 
+static void
+ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags)
+{
+#ifdef CONFIG_SMP
+	if (task_cpu(p) != cpu_of(rq))
+		set_task_cpu(p, cpu_of(rq));
+#endif
+
+	if (p->sched_contributes_to_load)
+		rq->nr_uninterruptible--;
+
+	activate_task(rq, p, ENQUEUE_WAKEUP | ENQUEUE_WAKING);
+	ttwu_do_wakeup(rq, p, wake_flags);
+}
+
+/*
+ * Called in case the task @p isn't fully descheduled from its runqueue,
+ * in this case we must do a remote wakeup. Its a 'light' wakeup though,
+ * since all we need to do is flip p->state to TASK_RUNNING, since
+ * the task is still ->on_rq.
+ */
+static int ttwu_remote(struct task_struct *p, int wake_flags)
+{
+	struct rq *rq;
+	int ret = 0;
+
+	rq = __task_rq_lock(p);
+	if (p->on_rq) {
+		ttwu_do_wakeup(rq, p, wake_flags);
+		ttwu_stat(p, task_cpu(p), wake_flags);
+		ret = 1;
+	}
+	__task_rq_unlock(rq);
+
+	return ret;
+}
+
+void sched_ttwu_pending(void)
+{
+#ifdef CONFIG_SMP
+	struct rq *rq = this_rq();
+	struct task_struct *list = xchg(&rq->wake_list, NULL);
+
+	if (!list)
+		return;
+
+	raw_spin_lock(&rq->lock);
+
+	while (list) {
+		struct task_struct *p = list;
+		list = list->wake_entry;
+		ttwu_do_activate(rq, p, 0);
+	}
+
+	raw_spin_unlock(&rq->lock);
+#endif
+}
+
+#ifdef CONFIG_SMP
+static void ttwu_queue_remote(struct task_struct *p, int cpu)
+{
+	struct task_struct *next = NULL;
+	struct rq *rq = cpu_rq(cpu);
+
+	for (;;) {
+		struct task_struct *old = next;
+
+		p->wake_entry = next;
+		next = cmpxchg(&rq->wake_list, old, p);
+		if (next == old)
+			break;
+	}
+
+	if (!next)
+		smp_send_reschedule(cpu);
+}
+#endif
+
+static void ttwu_queue(struct task_struct *p, int cpu)
+{
+	struct rq *rq = cpu_rq(cpu);
+
+#ifdef CONFIG_SMP
+	if (!sched_feat(TTWU_FORCE_REMOTE) && cpu != smp_processor_id()) {
+		ttwu_queue_remote(p, cpu);
+		return;
+	}
+#endif
+
+	raw_spin_lock(&rq->lock);
+	ttwu_do_activate(rq, p, 0);
+	raw_spin_unlock(&rq->lock);
+}
+
 /**
  * try_to_wake_up - wake up a thread
  * @p: the thread to be awakened
@@ -2447,29 +2545,18 @@ ttwu_do_wakeup(struct rq *rq, struct tas
 static int
 try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 {
-	int cpu, this_cpu, success = 0;
 	unsigned long flags;
-	struct rq *rq;
-
-	this_cpu = get_cpu();
+	int cpu, success = 0;
 
 	smp_wmb();
 	raw_spin_lock_irqsave(&p->pi_lock, flags);
 	if (!(p->state & state))
 		goto out;
 
-	cpu = task_cpu(p);
-
-	if (p->on_rq) {
-		rq = __task_rq_lock(p);
-		if (p->on_rq)
-			goto out_running;
-		__task_rq_unlock(rq);
-	}
+	success = 1; /* we're going to change ->state */
 
-#ifdef CONFIG_SMP
-	while (p->on_cpu)
-		cpu_relax();
+	if (p->on_rq && ttwu_remote(p, wake_flags))
+		goto out;
 
 	p->sched_contributes_to_load = !!task_contributes_to_load(p);
 	p->state = TASK_WAKING;
@@ -2482,7 +2569,12 @@ try_to_wake_up(struct task_struct *p, un
 	 * their feet.
 	 */
 	smp_mb();
-	raw_spin_unlock_wait(&task_rq(p)->lock);
+	cpu = task_cpu(p);
+	raw_spin_unlock_wait(&cpu_rq(cpu)->lock);
+
+#ifdef CONFIG_SMP
+	while (p->on_cpu)
+		cpu_relax();
 
 	if (p->sched_class->task_waking)
 		p->sched_class->task_waking(p);
@@ -2490,27 +2582,10 @@ try_to_wake_up(struct task_struct *p, un
 	cpu = select_task_rq(p, SD_BALANCE_WAKE, wake_flags);
 #endif /* CONFIG_SMP */
 
-	rq = cpu_rq(cpu);
-	raw_spin_lock(&rq->lock);
-
-#ifdef CONFIG_SMP
-	if (cpu != task_cpu(p))
-		set_task_cpu(p, cpu);
-
-	if (p->sched_contributes_to_load)
-		rq->nr_uninterruptible--;
-#endif
-
-	activate_task(rq, p, ENQUEUE_WAKEUP | ENQUEUE_WAKING);
-out_running:
-	ttwu_do_wakeup(rq, p, wake_flags);
-	success = 1;
-	__task_rq_unlock(rq);
-
+	ttwu_queue(p, cpu);
 	ttwu_stat(p, cpu, wake_flags);
 out:
 	raw_spin_unlock_irqrestore(&p->pi_lock, flags);
-	put_cpu();
 
 	return success;
 }
Index: linux-2.6/kernel/sched_features.h
===================================================================
--- linux-2.6.orig/kernel/sched_features.h
+++ linux-2.6/kernel/sched_features.h
@@ -64,3 +64,5 @@ SCHED_FEAT(OWNER_SPIN, 1)
  * Decrement CPU power based on irq activity
  */
 SCHED_FEAT(NONIRQ_POWER, 1)
+
+SCHED_FEAT(TTWU_FORCE_REMOTE, 0)



^ permalink raw reply	[flat|nested] 59+ messages in thread

* [RFC][PATCH 17/17] sched: Sort hotplug vs ttwu queueing
  2010-12-24 12:23 [RFC][PATCH 00/17] sched: Reduce runqueue lock contention -v3 Peter Zijlstra
                   ` (15 preceding siblings ...)
  2010-12-24 12:23 ` [RFC][PATCH 16/17] sched: Move the second half of ttwu() to the remote cpu Peter Zijlstra
@ 2010-12-24 12:23 ` Peter Zijlstra
  2010-12-29 14:51   ` Yong Zhang
  2010-12-24 13:15 ` [RFC][PATCH 00/17] sched: Reduce runqueue lock contention -v3 Peter Zijlstra
  17 siblings, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2010-12-24 12:23 UTC (permalink / raw)
  To: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Oleg Nesterov, Paul Turner, Jens Axboe,
	Yong Zhang
  Cc: linux-kernel, Peter Zijlstra

[-- Attachment #1: sched-ttwu-hotplug.patch --]
[-- Type: text/plain, Size: 1363 bytes --]


Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched.c |   20 ++++++++++++++++++--
 1 file changed, 18 insertions(+), 2 deletions(-)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -2470,15 +2470,15 @@ static int ttwu_remote(struct task_struc
 	return ret;
 }
 
-void sched_ttwu_pending(void)
+static void __sched_ttwu_pending(struct rq *rq)
 {
 #ifdef CONFIG_SMP
-	struct rq *rq = this_rq();
 	struct task_struct *list = xchg(&rq->wake_list, NULL);
 
 	if (!list)
 		return;
 
+	rq = this_rq(); /* always enqueue locally */
 	raw_spin_lock(&rq->lock);
 
 	while (list) {
@@ -2491,6 +2491,11 @@ void sched_ttwu_pending(void)
 #endif
 }
 
+void sched_ttwu_pending(void)
+{
+	__sched_ttwu_pending(this_rq());
+}
+
 #ifdef CONFIG_SMP
 static void ttwu_queue_remote(struct task_struct *p, int cpu)
 {
@@ -6162,6 +6167,17 @@ migration_call(struct notifier_block *nf
 		migrate_nr_uninterruptible(rq);
 		calc_global_load_remove(rq);
 		break;
+
+	case CPU_DEAD:
+		/*
+		 * Queue any possible remaining pending wakeups on this cpu.
+		 * Load-balancing will sort it out eventually.
+		 */
+		local_irq_save(flags);
+		__sched_ttwu_pending(cpu_rq(cpu));
+		local_irq_restore(flags);
+		break;
+
 #endif
 	}
 	return NOTIFY_OK;



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC][PATCH 00/17] sched: Reduce runqueue lock contention -v3
  2010-12-24 12:23 [RFC][PATCH 00/17] sched: Reduce runqueue lock contention -v3 Peter Zijlstra
                   ` (16 preceding siblings ...)
  2010-12-24 12:23 ` [RFC][PATCH 17/17] sched: Sort hotplug vs ttwu queueing Peter Zijlstra
@ 2010-12-24 13:15 ` Peter Zijlstra
  17 siblings, 0 replies; 59+ messages in thread
From: Peter Zijlstra @ 2010-12-24 13:15 UTC (permalink / raw)
  To: Chris Mason
  Cc: Frank Rowand, Ingo Molnar, Thomas Gleixner, Mike Galbraith,
	Oleg Nesterov, Paul Turner, Jens Axboe, Yong Zhang, linux-kernel

On Fri, 2010-12-24 at 13:23 +0100, Peter Zijlstra wrote:
> This is the latest incarnation of my scary ttwu patch, its split up into
> tiny little pieces because at some point while working on the feedback from
> the last posting it all stopped working and tiny changes was the only way
> to find out wth I done wrong (I never did find out though, when I
> completely rewrote everything in tiny chunks it all magically worked again
> -- so I guess tglx is chuckling now).
> 
> Anyway, it still need a few more loose ends fixed, esp on 32bit there is
> the whole unlocked remote min_vruntime access where we can read partial
> updates and screw the timeline like problem. And I suppose there's a few
> more XXXs in there where I know things are needing TLC.
> 
> Also, I bet the eagle eyed lot Cc'ed will spot more holes, as all this is
> scary stuff :-)
> 
> Anyway, lots of thanks to Frank, Oleg, Ingo and others who commented on
> the last posting.
> 
> I guess most everybody is out having x-mas celebrations (and I will too not
> too long from now), so I expect replies to this won't happen until the new
> year, but who knows, maybe in between eating and drinking too much we all
> get an urge to stare at code.
> 
> Happy Holidays!

Also available from (once the mirror catches up):

git://git.kernel.org/pub/scm/linux/kernel/git/peterz/linux-2.6-sched.git

---
 arch/x86/include/asm/spinlock.h |   26 ++-
 arch/x86/kernel/smp.c           |    1 +
 include/linux/mutex.h           |    2 +-
 include/linux/sched.h           |   23 +-
 kernel/mutex-debug.c            |    2 +-
 kernel/mutex-debug.h            |    2 +-
 kernel/mutex.c                  |    2 +-
 kernel/mutex.h                  |    2 +-
 kernel/sched.c                  |  542 ++++++++++++++++++++++-----------------
 kernel/sched_debug.c            |    2 +-
 kernel/sched_fair.c             |    5 +-
 kernel/sched_features.h         |    2 +
 kernel/sched_idletask.c         |    2 +-
 kernel/sched_rt.c               |   20 +-
 kernel/sched_stoptask.c         |    5 +-
 15 files changed, 371 insertions(+), 267 deletions(-)


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC][PATCH 05/17] x86: Optimize arch_spin_unlock_wait()
  2010-12-24 12:23 ` [RFC][PATCH 05/17] x86: Optimize arch_spin_unlock_wait() Peter Zijlstra
@ 2010-12-24 18:26   ` Linus Torvalds
  2011-01-03 11:32     ` Peter Zijlstra
  0 siblings, 1 reply; 59+ messages in thread
From: Linus Torvalds @ 2010-12-24 18:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Oleg Nesterov, Paul Turner, Jens Axboe,
	Yong Zhang, linux-kernel, Nick Piggin, Jeremy Fitzhardinge

On Fri, Dec 24, 2010 at 4:23 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> Only wait for the current holder to release the lock.
>
> spin_unlock_wait() can only be about the current holder, since
> completion of this function is inherently racy with new contenders.
> Therefore, there is no reason to wait until the lock is completely
> unlocked.

Is there really any reason for this patch? I'd rather keep the simpler
and more straightforward code unless you have actual numbers.

> +static inline void __ticket_spin_unlock_wait(arch_spinlock_t *lock)
> +{
> +       int tmp = ACCESS_ONCE(lock->slock);
> +
> +       if (!(((tmp >> TICKET_SHIFT) ^ tmp) & TICKET_MASK))
> +               return; /* not locked */
> +
> +       /* wait until the current lock holder goes away */
> +       while ((lock->slock & TICKET_MASK) == (tmp & TICKET_MASK))
> +               cpu_relax();
>  }

Also, the above is just ugly. You've lost the ACCESS_ONCE() on the
lock access, and it's using another model of masking than the regular
one. Both of which may be intentional (maybe you are _trying_ to get
the compiler to just load the low bytes and avoid the 'and'), but the
whole open-coding of the logic - twice, and with different looking
masking - just makes my skin itch.

                                  Linus

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC][PATCH 06/17] sched: Provide p->on_rq
  2010-12-24 12:23 ` [RFC][PATCH 06/17] sched: Provide p->on_rq Peter Zijlstra
@ 2010-12-29 14:14   ` Yong Zhang
  0 siblings, 0 replies; 59+ messages in thread
From: Yong Zhang @ 2010-12-29 14:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Oleg Nesterov, Paul Turner, Jens Axboe,
	linux-kernel

On Fri, Dec 24, 2010 at 01:23:44PM +0100, Peter Zijlstra wrote:
> Provide a generic p->on_rq because the p->se.on_rq semantics are
> unfavourable for lockless wakeups but needed for sched_fair.
> 
> In particular, p->on_rq is only cleared when we actually dequeue the
> task in schedule() and not on any random dequeue as done by things
> like __migrate_task() and __sched_setscheduler().
> 
> This also allows us to remove p->se usage from !sched_fair code.
> 
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
>  include/linux/sched.h   |    1 +
>  kernel/sched.c          |   36 ++++++++++++++++++------------------
>  kernel/sched_debug.c    |    2 +-
>  kernel/sched_rt.c       |   10 +++++-----
> Index: linux-2.6/kernel/sched_rt.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched_rt.c
> +++ linux-2.6/kernel/sched_rt.c
> @@ -1132,7 +1132,7 @@ static void put_prev_task_rt(struct rq *
>  	 * The previous task needs to be made eligible for pushing
>  	 * if it is still active
>  	 */
> -	if (p->se.on_rq && p->rt.nr_cpus_allowed > 1)
> +	if (p->on_rq && p->rt.nr_cpus_allowed > 1)
>  		enqueue_pushable_task(rq, p);

Seems we need on_rt_rq(&p->rt) here, otherwise we enqueue the
task to pushable list when called from rt_mutex_setprio()/
__sched_setscheduler() etc. Thus add a little overhead.
Though we call dequeue_pushable_task() in set_curr_task_rt()
unconditionally.

Thanks,
Yong

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC][PATCH 07/17] sched: Serialize p->cpus_allowed and ttwu() using p->pi_lock
  2010-12-24 12:23 ` [RFC][PATCH 07/17] sched: Serialize p->cpus_allowed and ttwu() using p->pi_lock Peter Zijlstra
@ 2010-12-29 14:20   ` Yong Zhang
  2011-01-03 11:12     ` Peter Zijlstra
  0 siblings, 1 reply; 59+ messages in thread
From: Yong Zhang @ 2010-12-29 14:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Oleg Nesterov, Paul Turner, Jens Axboe,
	linux-kernel

On Fri, Dec 24, 2010 at 01:23:45PM +0100, Peter Zijlstra wrote:
> Currently p->pi_lock already serializes p->sched_class, also put
> p->cpus_allowed and try_to_wake_up() under it, this prepares the way
> to do the first part of ttwu() without holding rq->lock.
> 
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
>  kernel/sched.c |   37 ++++++++++++++++---------------------
>  1 file changed, 16 insertions(+), 21 deletions(-)
> 
> Index: linux-2.6/kernel/sched.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched.c
> +++ linux-2.6/kernel/sched.c
> @@ -2301,7 +2301,7 @@ void task_oncpu_function_call(struct tas
>  
>  #ifdef CONFIG_SMP
>  /*
> - * ->cpus_allowed is protected by either TASK_WAKING or rq->lock held.
> + * ->cpus_allowed is protected by both rq->lock and p->pi_lock
>   */
>  static int select_fallback_rq(int cpu, struct task_struct *p)
>  {
> @@ -2334,7 +2334,7 @@ static int select_fallback_rq(int cpu, s
>  }
>  
>  /*
> - * The caller (fork, wakeup) owns TASK_WAKING, ->cpus_allowed is stable.
> + * The caller (fork, wakeup) owns p->pi_lock, ->cpus_allowed is stable.

Yes for wakeup, but not true for fork.
I don't see protection in wake_up_new_task().
Or am I missing something?

Thanks,
Yong

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC][PATCH 08/17] sched: Drop the rq argument to sched_class::select_task_rq()
  2010-12-24 12:23 ` [RFC][PATCH 08/17] sched: Drop the rq argument to sched_class::select_task_rq() Peter Zijlstra
@ 2010-12-29 14:31   ` Yong Zhang
  2011-01-03 11:16     ` Peter Zijlstra
  2011-01-03 18:05   ` Oleg Nesterov
  1 sibling, 1 reply; 59+ messages in thread
From: Yong Zhang @ 2010-12-29 14:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Oleg Nesterov, Paul Turner, Jens Axboe,
	Steven Rostedt, linux-kernel

On Fri, Dec 24, 2010 at 01:23:46PM +0100, Peter Zijlstra wrote:
> In preparation of calling select_task_rq() without rq->lock held, drop
> the dependency on the rq argument.
> 
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
> @@ -3416,27 +3409,22 @@ void sched_exec(void)
>  {
>  	struct task_struct *p = current;
>  	unsigned long flags;
> -	struct rq *rq;
>  	int dest_cpu;
>  
> -	rq = task_rq_lock(p, &flags);
> -	dest_cpu = p->sched_class->select_task_rq(rq, p, SD_BALANCE_EXEC, 0);
> +	raw_spin_lock_irqsave(&p->pi_lock, flags);

Seems this should go to patch 07/17 ;)

> +	dest_cpu = p->sched_class->select_task_rq(p, SD_BALANCE_EXEC, 0);
>  	if (dest_cpu == smp_processor_id())
>  		goto unlock;
>  
> -	/*
> -	 * select_task_rq() can race against ->cpus_allowed
> -	 */
> -	if (cpumask_test_cpu(dest_cpu, &p->cpus_allowed) &&
> -	    likely(cpu_active(dest_cpu)) && migrate_task(p, rq)) {
> +	if (likely(cpu_active(dest_cpu)) && need_migrate_task(p)) {

If we drop rq_lock, need_migrate_task() maybe return true but
p is already running on other cpu. Thus we do a wrong migration
call.

Thanks,
Yong
>  		struct migration_arg arg = { p, dest_cpu };
>  
> -		task_rq_unlock(rq, &flags);
> -		stop_one_cpu(cpu_of(rq), migration_cpu_stop, &arg);
> +		raw_spin_unlock_irqrestore(&p->pi_lock, flags);
> +		stop_one_cpu(task_cpu(p), migration_cpu_stop, &arg);
>  		return;
>  	}
>  unlock:
> -	task_rq_unlock(rq, &flags);
> +	raw_spin_unlock_irqrestore(&p->pi_lock, flags);
>  }
>  
>  #endif
> @@ -5681,7 +5669,7 @@ int set_cpus_allowed_ptr(struct task_str
>  		goto out;
>  
>  	dest_cpu = cpumask_any_and(cpu_active_mask, new_mask);
> -	if (migrate_task(p, rq)) {
> +	if (need_migrate_task(p)) {
>  		struct migration_arg arg = { p, dest_cpu };
>  		/* Need help from migration thread: drop lock and wait. */
>  		__task_rq_unlock(rq);
> Index: linux-2.6/kernel/sched_fair.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched_fair.c
> +++ linux-2.6/kernel/sched_fair.c
> @@ -1623,7 +1623,7 @@ static int select_idle_sibling(struct ta
>   * preempt must be disabled.
>   */
>  static int
> -select_task_rq_fair(struct rq *rq, struct task_struct *p, int sd_flag, int wake_flags)
> +select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
>  {
>  	struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
>  	int cpu = smp_processor_id();
> Index: linux-2.6/kernel/sched_idletask.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched_idletask.c
> +++ linux-2.6/kernel/sched_idletask.c
> @@ -7,7 +7,7 @@
>  
>  #ifdef CONFIG_SMP
>  static int
> -select_task_rq_idle(struct rq *rq, struct task_struct *p, int sd_flag, int flags)
> +select_task_rq_idle(struct task_struct *p, int sd_flag, int flags)
>  {
>  	return task_cpu(p); /* IDLE tasks as never migrated */
>  }
> Index: linux-2.6/kernel/sched_rt.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched_rt.c
> +++ linux-2.6/kernel/sched_rt.c
> @@ -973,11 +973,18 @@ static void yield_task_rt(struct rq *rq)
>  static int find_lowest_rq(struct task_struct *task);
>  
>  static int
> -select_task_rq_rt(struct rq *rq, struct task_struct *p, int sd_flag, int flags)
> +select_task_rq_rt(struct task_struct *p, int sd_flag, int flags)
>  {
>  	if (sd_flag != SD_BALANCE_WAKE)
>  		return smp_processor_id();
>  
> +#if 0
> +	/*
> +	 * XXX without holding rq->lock the below is racy, need to
> +	 * rewrite it in a racy but non-dangerous way so that we mostly
> +	 * get the benefit of the heuristic but don't crash the kernel
> +	 * if we get it wrong ;-)
> +	 */
>  	/*
>  	 * If the current task is an RT task, then
>  	 * try to see if we can wake this RT task up on another
> @@ -1002,6 +1009,7 @@ select_task_rq_rt(struct rq *rq, struct 
>  
>  		return (cpu == -1) ? task_cpu(p) : cpu;
>  	}
> +#endif
>  
>  	/*
>  	 * Otherwise, just let it ride on the affined RQ and the
> Index: linux-2.6/kernel/sched_stoptask.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched_stoptask.c
> +++ linux-2.6/kernel/sched_stoptask.c
> @@ -9,8 +9,7 @@
>  
>  #ifdef CONFIG_SMP
>  static int
> -select_task_rq_stop(struct rq *rq, struct task_struct *p,
> -		    int sd_flag, int flags)
> +select_task_rq_stop(struct task_struct *p, int sd_flag, int flags)
>  {
>  	return task_cpu(p); /* stop tasks as never migrate */
>  }

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC][PATCH 14/17] sched: Remove rq argument to ttwu_stat()
  2010-12-24 12:23 ` [RFC][PATCH 14/17] sched: Remove rq argument to ttwu_stat() Peter Zijlstra
@ 2010-12-29 14:40   ` Yong Zhang
  2011-01-03 11:20     ` Peter Zijlstra
  0 siblings, 1 reply; 59+ messages in thread
From: Yong Zhang @ 2010-12-29 14:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Oleg Nesterov, Paul Turner, Jens Axboe,
	linux-kernel

On Fri, Dec 24, 2010 at 01:23:52PM +0100, Peter Zijlstra wrote:
> 
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
>  kernel/sched.c |    8 +++++---
>  1 file changed, 5 insertions(+), 3 deletions(-)
> 
> Index: linux-2.6/kernel/sched.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched.c
> +++ linux-2.6/kernel/sched.c
> @@ -2367,10 +2367,11 @@ static void update_avg(u64 *avg, u64 sam
>  #endif
>  
>  static void
> -ttwu_stat(struct rq *rq, struct task_struct *p, int cpu, int wake_flags)
> +ttwu_stat(struct task_struct *p, int cpu, int wake_flags)
>  {
>  #ifdef CONFIG_SCHEDSTATS
>  	int this_cpu = smp_processor_id();
> +	struct rq *rq = this_rq();

			task_rq(p)?

>  
>  	schedstat_inc(rq, ttwu_count);
>  	schedstat_inc(p, se.statistics.nr_wakeups);
> @@ -2491,9 +2492,10 @@ try_to_wake_up(struct task_struct *p, un
>  	activate_task(rq, p, ENQUEUE_WAKEUP | ENQUEUE_WAKING);
>  out_running:
>  	ttwu_post_activation(p, rq, wake_flags);
> -	ttwu_stat(rq, p, cpu, wake_flags);
>  	success = 1;
>  	__task_rq_unlock(rq);
> +
> +	ttwu_stat(p, cpu, wake_flags);

Typo? You just put it out of rq_lock.

Thanks,
Yong

>  out:
>  	raw_spin_unlock_irqrestore(&p->pi_lock, flags);
>  	put_cpu();
> @@ -2527,7 +2529,7 @@ static void try_to_wake_up_local(struct 
>  		activate_task(rq, p, ENQUEUE_WAKEUP);
>  
>  	ttwu_post_activation(p, rq, 0);
> -	ttwu_stat(rq, p, smp_processor_id(), 0);
> +	ttwu_stat(p, smp_processor_id(), 0);
>  out:
>  	raw_spin_unlock(&p->pi_lock);
>  }

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC][PATCH 17/17] sched: Sort hotplug vs ttwu queueing
  2010-12-24 12:23 ` [RFC][PATCH 17/17] sched: Sort hotplug vs ttwu queueing Peter Zijlstra
@ 2010-12-29 14:51   ` Yong Zhang
  2011-01-03 11:21     ` Peter Zijlstra
  0 siblings, 1 reply; 59+ messages in thread
From: Yong Zhang @ 2010-12-29 14:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Oleg Nesterov, Paul Turner, Jens Axboe,
	linux-kernel

On Fri, Dec 24, 2010 at 01:23:55PM +0100, Peter Zijlstra wrote:
> 
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
>  kernel/sched.c |   20 ++++++++++++++++++--
>  1 file changed, 18 insertions(+), 2 deletions(-)
> 
> Index: linux-2.6/kernel/sched.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched.c
> +++ linux-2.6/kernel/sched.c
> @@ -2470,15 +2470,15 @@ static int ttwu_remote(struct task_struc
>  	return ret;
>  }
>  
> -void sched_ttwu_pending(void)
> +static void __sched_ttwu_pending(struct rq *rq)
>  {
>  #ifdef CONFIG_SMP
> -	struct rq *rq = this_rq();
>  	struct task_struct *list = xchg(&rq->wake_list, NULL);
>  
>  	if (!list)
>  		return;
>  
> +	rq = this_rq(); /* always enqueue locally */

But it's possible that p is not allowed to run on this cpu,
right?

Thanks,
Yong

>  	raw_spin_lock(&rq->lock);
>  
>  	while (list) {
> @@ -2491,6 +2491,11 @@ void sched_ttwu_pending(void)
>  #endif
>  }
>  
> +void sched_ttwu_pending(void)
> +{
> +	__sched_ttwu_pending(this_rq());
> +}
> +
>  #ifdef CONFIG_SMP
>  static void ttwu_queue_remote(struct task_struct *p, int cpu)
>  {
> @@ -6162,6 +6167,17 @@ migration_call(struct notifier_block *nf
>  		migrate_nr_uninterruptible(rq);
>  		calc_global_load_remove(rq);
>  		break;
> +
> +	case CPU_DEAD:
> +		/*
> +		 * Queue any possible remaining pending wakeups on this cpu.
> +		 * Load-balancing will sort it out eventually.
> +		 */
> +		local_irq_save(flags);
> +		__sched_ttwu_pending(cpu_rq(cpu));
> +		local_irq_restore(flags);
> +		break;
> +
>  #endif
>  	}
>  	return NOTIFY_OK;

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC][PATCH 07/17] sched: Serialize p->cpus_allowed and ttwu() using p->pi_lock
  2010-12-29 14:20   ` Yong Zhang
@ 2011-01-03 11:12     ` Peter Zijlstra
  0 siblings, 0 replies; 59+ messages in thread
From: Peter Zijlstra @ 2011-01-03 11:12 UTC (permalink / raw)
  To: Yong Zhang
  Cc: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Oleg Nesterov, Paul Turner, Jens Axboe,
	linux-kernel

On Wed, 2010-12-29 at 22:20 +0800, Yong Zhang wrote:
> > - * The caller (fork, wakeup) owns TASK_WAKING, ->cpus_allowed is stable.
> > + * The caller (fork, wakeup) owns p->pi_lock, ->cpus_allowed is stable.
> 
> Yes for wakeup, but not true for fork.
> I don't see protection in wake_up_new_task().
> Or am I missing something? 

Ah, true, wake_up_new_task() holds task_rq_lock() which is sufficient,
but yes, we could also make that pi_lock.



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC][PATCH 08/17] sched: Drop the rq argument to sched_class::select_task_rq()
  2010-12-29 14:31   ` Yong Zhang
@ 2011-01-03 11:16     ` Peter Zijlstra
  2011-01-03 14:59       ` Oleg Nesterov
  2011-01-04  5:59       ` Yong Zhang
  0 siblings, 2 replies; 59+ messages in thread
From: Peter Zijlstra @ 2011-01-03 11:16 UTC (permalink / raw)
  To: Yong Zhang
  Cc: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Oleg Nesterov, Paul Turner, Jens Axboe,
	Steven Rostedt, linux-kernel

On Wed, 2010-12-29 at 22:31 +0800, Yong Zhang wrote:
> On Fri, Dec 24, 2010 at 01:23:46PM +0100, Peter Zijlstra wrote:
> > In preparation of calling select_task_rq() without rq->lock held, drop
> > the dependency on the rq argument.
> > 
> > Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> > ---
> > @@ -3416,27 +3409,22 @@ void sched_exec(void)
> >  {
> >  	struct task_struct *p = current;
> >  	unsigned long flags;
> > -	struct rq *rq;
> >  	int dest_cpu;
> >  
> > -	rq = task_rq_lock(p, &flags);
> > -	dest_cpu = p->sched_class->select_task_rq(rq, p, SD_BALANCE_EXEC, 0);
> > +	raw_spin_lock_irqsave(&p->pi_lock, flags);
> 
> Seems this should go to patch 07/17 ;)

Ah, the reason its here is that this patch removes the rq argument and
thus we no longer need rq->lock. So this part relies on the property
introduced by patch 7.

> > +	dest_cpu = p->sched_class->select_task_rq(p, SD_BALANCE_EXEC, 0);
> >  	if (dest_cpu == smp_processor_id())
> >  		goto unlock;
> >  
> > -	/*
> > -	 * select_task_rq() can race against ->cpus_allowed
> > -	 */
> > -	if (cpumask_test_cpu(dest_cpu, &p->cpus_allowed) &&
> > -	    likely(cpu_active(dest_cpu)) && migrate_task(p, rq)) {
> > +	if (likely(cpu_active(dest_cpu)) && need_migrate_task(p)) {
> 
> If we drop rq_lock, need_migrate_task() maybe return true but
> p is already running on other cpu. Thus we do a wrong migration
> call.

Yeah, too bad.. ;-) exec load balancing is more an optimistic thing
anyway, if it got rebalanced under out feet we don't care.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC][PATCH 14/17] sched: Remove rq argument to ttwu_stat()
  2010-12-29 14:40   ` Yong Zhang
@ 2011-01-03 11:20     ` Peter Zijlstra
  0 siblings, 0 replies; 59+ messages in thread
From: Peter Zijlstra @ 2011-01-03 11:20 UTC (permalink / raw)
  To: Yong Zhang
  Cc: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Oleg Nesterov, Paul Turner, Jens Axboe,
	linux-kernel

On Wed, 2010-12-29 at 22:40 +0800, Yong Zhang wrote:
> On Fri, Dec 24, 2010 at 01:23:52PM +0100, Peter Zijlstra wrote:
> > 
> > Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> > ---
> >  kernel/sched.c |    8 +++++---
> >  1 file changed, 5 insertions(+), 3 deletions(-)
> > 
> > Index: linux-2.6/kernel/sched.c
> > ===================================================================
> > --- linux-2.6.orig/kernel/sched.c
> > +++ linux-2.6/kernel/sched.c
> > @@ -2367,10 +2367,11 @@ static void update_avg(u64 *avg, u64 sam
> >  #endif
> >  
> >  static void
> > -ttwu_stat(struct rq *rq, struct task_struct *p, int cpu, int wake_flags)
> > +ttwu_stat(struct task_struct *p, int cpu, int wake_flags)
> >  {
> >  #ifdef CONFIG_SCHEDSTATS
> >  	int this_cpu = smp_processor_id();
> > +	struct rq *rq = this_rq();
> 
> 			task_rq(p)?

No we cannot change stats on task_rq() since we don't hold any locks
there, but since we have IRQs disabled we can change stats on the local
rq.

It changes the accounting slightly (instead of accounting the wakeup on
the task's rq we account it on the waking cpu's rq) but that shouldn't
matter.

> >  
> >  	schedstat_inc(rq, ttwu_count);
> >  	schedstat_inc(p, se.statistics.nr_wakeups);
> > @@ -2491,9 +2492,10 @@ try_to_wake_up(struct task_struct *p, un
> >  	activate_task(rq, p, ENQUEUE_WAKEUP | ENQUEUE_WAKING);
> >  out_running:
> >  	ttwu_post_activation(p, rq, wake_flags);
> > -	ttwu_stat(rq, p, cpu, wake_flags);
> >  	success = 1;
> >  	__task_rq_unlock(rq);
> > +
> > +	ttwu_stat(p, cpu, wake_flags);
> 
> Typo? You just put it out of rq_lock.

That's the purpose of this patch, we need to be able to do ttwu_stat()
without holding rq->lock.

I should have written a changelog but it was either finish all fancy
changelogs on an rfc series and post after the holidays or post
before ;-)

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC][PATCH 17/17] sched: Sort hotplug vs ttwu queueing
  2010-12-29 14:51   ` Yong Zhang
@ 2011-01-03 11:21     ` Peter Zijlstra
  0 siblings, 0 replies; 59+ messages in thread
From: Peter Zijlstra @ 2011-01-03 11:21 UTC (permalink / raw)
  To: Yong Zhang
  Cc: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Oleg Nesterov, Paul Turner, Jens Axboe,
	linux-kernel

On Wed, 2010-12-29 at 22:51 +0800, Yong Zhang wrote:
> > +     rq = this_rq(); /* always enqueue locally */
> 
> But it's possible that p is not allowed to run on this cpu,
> right?
> 
> 
Very good point, yes I totally forgot about that.. stupid
->cpus_allowed.

OK, I guess we need a new select_task_rq() call in the offline case..

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC][PATCH 05/17] x86: Optimize arch_spin_unlock_wait()
  2010-12-24 18:26   ` Linus Torvalds
@ 2011-01-03 11:32     ` Peter Zijlstra
  2011-01-04  6:45       ` Nick Piggin
  0 siblings, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2011-01-03 11:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Oleg Nesterov, Paul Turner, Jens Axboe,
	Yong Zhang, linux-kernel, Nick Piggin, Jeremy Fitzhardinge

On Fri, 2010-12-24 at 10:26 -0800, Linus Torvalds wrote:
> On Fri, Dec 24, 2010 at 4:23 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> > Only wait for the current holder to release the lock.
> >
> > spin_unlock_wait() can only be about the current holder, since
> > completion of this function is inherently racy with new contenders.
> > Therefore, there is no reason to wait until the lock is completely
> > unlocked.
> 
> Is there really any reason for this patch? I'd rather keep the simpler
> and more straightforward code unless you have actual numbers.

No numbers, the testcase I use for this series is too unstable to really
give that fine results. Its more a result of seeing the code an going:
"oohh that can wait a long time when the lock is severely contended".

But I think I can get rid of the need for calling this primitive
alltogether, which is even better.

> > +static inline void __ticket_spin_unlock_wait(arch_spinlock_t *lock)
> > +{
> > +       int tmp = ACCESS_ONCE(lock->slock);
> > +
> > +       if (!(((tmp >> TICKET_SHIFT) ^ tmp) & TICKET_MASK))
> > +               return; /* not locked */
> > +
> > +       /* wait until the current lock holder goes away */
> > +       while ((lock->slock & TICKET_MASK) == (tmp & TICKET_MASK))
> > +               cpu_relax();
> >  }
> 
> Also, the above is just ugly. You've lost the ACCESS_ONCE() on the
> lock access, and it's using another model of masking than the regular
> one. Both of which may be intentional (maybe you are _trying_ to get
> the compiler to just load the low bytes and avoid the 'and'), but the
> whole open-coding of the logic - twice, and with different looking
> masking - just makes my skin itch.

I'm not sure I fully understand the complaint here. The ACCESS_ONCE is
for the tmp variable, which we use several times and needs to contain a
single load of the lock variable and should not be optimized away into
multiple loads.

The first conditional: 

      if (!(((tmp >> TICKET_SHIFT) ^ tmp) & TICKET_MASK))

Is exactly like the regular __ticket_spin_is_contended, and while that
is a somewhat overly clever way of writing head != tail, I don't see a
problem with that.

The second conditional:

      while ((lock->slock & TICKET_MASK) == (tmp & TICKET_MASK))

Is indeed different, it waits for the lock tail (new load) to change
from the first observed (first load) tail. Once we observe the tail
index changing we know the previous owner completed and we can drop out.

Anyway, if I can indeed get rid of my unlock_wait usage its all moot
anyway, there aren't many users of this primitive.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [RFC][PATCH] sembench: add stddev to the burn stats
  2010-12-24 12:23 ` [RFC][PATCH 16/17] sched: Move the second half of ttwu() to the remote cpu Peter Zijlstra
@ 2011-01-03 14:36   ` Peter Zijlstra
  2011-01-04 14:28   ` [RFC][PATCH 16/17] sched: Move the second half of ttwu() to the remote cpu Oleg Nesterov
  1 sibling, 0 replies; 59+ messages in thread
From: Peter Zijlstra @ 2011-01-03 14:36 UTC (permalink / raw)
  To: Chris Mason
  Cc: Frank Rowand, Ingo Molnar, Thomas Gleixner, Mike Galbraith,
	Oleg Nesterov, Paul Turner, Jens Axboe, Yong Zhang, linux-kernel

On Fri, 2010-12-24 at 13:23 +0100, Peter Zijlstra wrote:
> As measured using: http://oss.oracle.com/~mason/sembench.c
> 
> $ echo 4096 32000 64 128 > /proc/sys/kernel/sem
> $ ./sembench -t 2048 -w 1900 -o 0
> 
> unpatched: run time 30 seconds 537953 worker burns per second
> patched:   run time 30 seconds 657336 worker burns per second
> 
> Still need to sort out all the races marked XXX (non-trivial), and its
> x86 only for the moment.


---

Adds stats to see how stable the sembench results are, it does slow the
burn rate down, I guess because we're now touching the FPU and the
context switch overhead increases dramatically, but it does show its
relatively stable.

New output looks like:

# ./sembench -t 2048 -w 1900 -o 0 -r 30
main loop going
all done
2048 threads, waking 1900 at a time
using ipc sem operations
main thread burns: 2374
worker burn count total 4510600 min 1187 max 2374 avg 2202.441 +- 0.415%
run time 30 seconds 150353 worker burns per second

---
--- sembench.c	2010-04-12 20:45:50.000000000 +0200
+++ sembench3.c	2011-01-03 15:29:42.000000000 +0100
@@ -1,6 +1,6 @@
 /*
  * copyright Oracle 2007.  Licensed under GPLv2
- * To compile: gcc -Wall -o sembench sembench.c -lpthread
+ * To compile: gcc -Wall -o sembench sembench.c -lpthread -lm
  *
  * usage: sembench -t thread count -w wakenum -r runtime -o op
  * op can be: 0 (ipc sem) 1 (nanosleep) 2 (futexes)
@@ -28,8 +28,9 @@
 #include <sys/time.h>
 #include <sys/syscall.h>
 #include <errno.h>
+#include <math.h>
 
-#define VERSION "0.2"
+#define VERSION "0.3"
 
 /* futexes have been around since 2.5.something, but it still seems I
  * need to make my own syscall.  Sigh.
@@ -78,10 +79,55 @@
 
 int *semid_lookup = NULL;
 
+struct stats
+{
+	double n, mean, M2;
+};
+
+static void update_stats(struct stats *stats, long long val)
+{
+	double delta;
+
+	stats->n++;
+	delta = val - stats->mean;
+	stats->mean += delta / stats->n;
+	stats->M2 += delta*(val - stats->mean);
+}
+
+static double avg_stats(struct stats *stats)
+{
+	return stats->mean;
+}
+
+/*
+ * http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance
+ *
+ *       (\Sum n_i^2) - ((\Sum n_i)^2)/n
+ * s^2 = -------------------------------
+ *                  n - 1
+ *
+ * http://en.wikipedia.org/wiki/Stddev
+ *
+ * The std dev of the mean is related to the std dev by:
+ *
+ *             s
+ * s_mean = -------
+ *          sqrt(n)
+ *
+ */
+static double stddev_stats(struct stats *stats)
+{
+	double variance = stats->M2 / (stats->n - 1);
+	double variance_mean = variance / stats->n;
+
+	return sqrt(variance_mean);
+}
+
 pthread_mutex_t worklist_mutex = PTHREAD_MUTEX_INITIALIZER;
 static unsigned long total_burns = 0;
 static unsigned long min_burns = ~0UL;
 static unsigned long max_burns = 0;
+static struct stats burn_stats;
 static int thread_count = 0;
 struct lockinfo *worklist_head = NULL;
 struct lockinfo *worklist_tail = NULL;
@@ -385,6 +431,7 @@
 		min_burns = burn_count;
 	if (burn_count > max_burns)
 		max_burns = burn_count;
+	update_stats(&burn_stats, burn_count);
 	thread_count--;
 	pthread_mutex_unlock(&worklist_mutex);
 	return (void *)0;
@@ -508,8 +555,9 @@
 	printf("%d threads, waking %d at a time\n", num_threads, wake_num);
 	printf("using %s\n", ops->name);
 	printf("main thread burns: %d\n", burn_count);
-	printf("worker burn count total %lu min %lu max %lu avg %lu\n",
-	       total_burns, min_burns, max_burns, total_burns / num_threads);
+	printf("worker burn count total %lu min %lu max %lu avg %.3f +- %.3f%%\n",
+	       total_burns, min_burns, max_burns, avg_stats(&burn_stats),
+	       100 * stddev_stats(&burn_stats) / avg_stats(&burn_stats));
 	printf("run time %d seconds %lu worker burns per second\n",
 		(int)(now.tv_sec - start.tv_sec),
 		total_burns / (now.tv_sec - start.tv_sec));


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC][PATCH 08/17] sched: Drop the rq argument to sched_class::select_task_rq()
  2011-01-03 11:16     ` Peter Zijlstra
@ 2011-01-03 14:59       ` Oleg Nesterov
  2011-01-03 15:21         ` Peter Zijlstra
  2011-01-04  5:59       ` Yong Zhang
  1 sibling, 1 reply; 59+ messages in thread
From: Oleg Nesterov @ 2011-01-03 14:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Yong Zhang, Chris Mason, Frank Rowand, Ingo Molnar,
	Thomas Gleixner, Mike Galbraith, Paul Turner, Jens Axboe,
	Steven Rostedt, linux-kernel

On 01/03, Peter Zijlstra wrote:
>
> On Wed, 2010-12-29 at 22:31 +0800, Yong Zhang wrote:
> > > -	/*
> > > -	 * select_task_rq() can race against ->cpus_allowed
> > > -	 */
> > > -	if (cpumask_test_cpu(dest_cpu, &p->cpus_allowed) &&
> > > -	    likely(cpu_active(dest_cpu)) && migrate_task(p, rq)) {
> > > +	if (likely(cpu_active(dest_cpu)) && need_migrate_task(p)) {
> >
> > If we drop rq_lock, need_migrate_task() maybe return true but
> > p is already running on other cpu. Thus we do a wrong migration
> > call.
>
> Yeah, too bad.. ;-) exec load balancing is more an optimistic thing
> anyway, if it got rebalanced under out feet we don't care.

I don't understand this need_migrate_task() at all (with or without
the patch). This task is current/running, it should always return T.

I guess, migrate_task() was needed before to initialize migration_req.

Oleg.


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC][PATCH 08/17] sched: Drop the rq argument to sched_class::select_task_rq()
  2011-01-03 14:59       ` Oleg Nesterov
@ 2011-01-03 15:21         ` Peter Zijlstra
  2011-01-03 15:49           ` Oleg Nesterov
  0 siblings, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2011-01-03 15:21 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Yong Zhang, Chris Mason, Frank Rowand, Ingo Molnar,
	Thomas Gleixner, Mike Galbraith, Paul Turner, Jens Axboe,
	Steven Rostedt, linux-kernel

On Mon, 2011-01-03 at 15:59 +0100, Oleg Nesterov wrote:
> On 01/03, Peter Zijlstra wrote:
> >
> > On Wed, 2010-12-29 at 22:31 +0800, Yong Zhang wrote:
> > > > -	/*
> > > > -	 * select_task_rq() can race against ->cpus_allowed
> > > > -	 */
> > > > -	if (cpumask_test_cpu(dest_cpu, &p->cpus_allowed) &&
> > > > -	    likely(cpu_active(dest_cpu)) && migrate_task(p, rq)) {
> > > > +	if (likely(cpu_active(dest_cpu)) && need_migrate_task(p)) {
> > >
> > > If we drop rq_lock, need_migrate_task() maybe return true but
> > > p is already running on other cpu. Thus we do a wrong migration
> > > call.
> >
> > Yeah, too bad.. ;-) exec load balancing is more an optimistic thing
> > anyway, if it got rebalanced under out feet we don't care.
> 
> I don't understand this need_migrate_task() at all (with or without
> the patch). This task is current/running, it should always return T.

This is true for the sched_exec() case, yes.

> I guess, migrate_task() was needed before to initialize migration_req.

But afaict you can call set_cpus_allowed_ptr() on !self.



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC][PATCH 08/17] sched: Drop the rq argument to sched_class::select_task_rq()
  2011-01-03 15:21         ` Peter Zijlstra
@ 2011-01-03 15:49           ` Oleg Nesterov
  2011-01-03 16:35             ` Peter Zijlstra
  2011-01-04  7:27             ` Yong Zhang
  0 siblings, 2 replies; 59+ messages in thread
From: Oleg Nesterov @ 2011-01-03 15:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Yong Zhang, Chris Mason, Frank Rowand, Ingo Molnar,
	Thomas Gleixner, Mike Galbraith, Paul Turner, Jens Axboe,
	Steven Rostedt, linux-kernel

On 01/03, Peter Zijlstra wrote:
>
> On Mon, 2011-01-03 at 15:59 +0100, Oleg Nesterov wrote:
> > On 01/03, Peter Zijlstra wrote:
> > >
> > > On Wed, 2010-12-29 at 22:31 +0800, Yong Zhang wrote:
> > > > > -	/*
> > > > > -	 * select_task_rq() can race against ->cpus_allowed
> > > > > -	 */
> > > > > -	if (cpumask_test_cpu(dest_cpu, &p->cpus_allowed) &&
> > > > > -	    likely(cpu_active(dest_cpu)) && migrate_task(p, rq)) {
> > > > > +	if (likely(cpu_active(dest_cpu)) && need_migrate_task(p)) {
> > > >
> > > > If we drop rq_lock, need_migrate_task() maybe return true but
> > > > p is already running on other cpu. Thus we do a wrong migration
> > > > call.
> > >
> > > Yeah, too bad.. ;-) exec load balancing is more an optimistic thing
> > > anyway, if it got rebalanced under out feet we don't care.
> >
> > I don't understand this need_migrate_task() at all (with or without
> > the patch). This task is current/running, it should always return T.
>
> This is true for the sched_exec() case, yes.
>
> > I guess, migrate_task() was needed before to initialize migration_req.
>
> But afaict you can call set_cpus_allowed_ptr() on !self.

Ah, sorry for the confusion, I only meant sched_exec() case.
set_cpus_allowed_ptr() does need need_migrate_task(), of course.


As for set_cpus_allowed_ptr()->need_migrate_task() path, I have another
question,

	static bool need_migrate_task(struct task_struct *p)
	{
		/*
		 * If the task is not on a runqueue (and not running), then
		 * the next wake-up will properly place the task.
		 */
		smp_rmb(); /* finish_lock_switch() */
		return p->on_rq || p->on_cpu;
	}

I don't understand this smp_rmb(). Yes, finish_lock_switch() does
wmb() before it clears ->on_cpu, but how these 2 barriers can pair?

In fact, I am completely confused. I do not understand why do we
check task_running() at all. If we see on_rq == 0 && on_cpu == 1,
then this task is going to clear its on_cpu soon, once it finishes
context_switch().

Probably, this check was needed before, try_to_wake_up() could
activate the task_running() task without migrating. But, at first
glance, this is no longer possible after this series?

Oleg.


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC][PATCH 08/17] sched: Drop the rq argument to sched_class::select_task_rq()
  2011-01-03 15:49           ` Oleg Nesterov
@ 2011-01-03 16:35             ` Peter Zijlstra
  2011-01-03 16:41               ` Peter Zijlstra
  2011-01-04  7:27             ` Yong Zhang
  1 sibling, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2011-01-03 16:35 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Yong Zhang, Chris Mason, Frank Rowand, Ingo Molnar,
	Thomas Gleixner, Mike Galbraith, Paul Turner, Jens Axboe,
	Steven Rostedt, linux-kernel

On Mon, 2011-01-03 at 16:49 +0100, Oleg Nesterov wrote:

> Ah, sorry for the confusion, I only meant sched_exec() case.
> set_cpus_allowed_ptr() does need need_migrate_task(), of course.

OK, removed the sched_exec() test, we'll see if anything explodes ;-)

> As for set_cpus_allowed_ptr()->need_migrate_task() path, I have another
> question,
> 
> 	static bool need_migrate_task(struct task_struct *p)
> 	{
> 		/*
> 		 * If the task is not on a runqueue (and not running), then
> 		 * the next wake-up will properly place the task.
> 		 */
> 		smp_rmb(); /* finish_lock_switch() */
> 		return p->on_rq || p->on_cpu;
> 	}
> 
> I don't understand this smp_rmb(). Yes, finish_lock_switch() does
> wmb() before it clears ->on_cpu, but how these 2 barriers can pair?

You mean the fact that I fouled up and didn't cross them (both are
before)? I placed the rmb after reading on_cpu.

> In fact, I am completely confused. I do not understand why do we
> check task_running() at all. If we see on_rq == 0 && on_cpu == 1,
> then this task is going to clear its on_cpu soon, once it finishes
> context_switch().

> Probably, this check was needed before, try_to_wake_up() could
> activate the task_running() task without migrating. But, at first
> glance, this is no longer possible after this series?

It can still do that, I think the problem is us dropping rq->lock in the
middle of schedule(), when the freshly woken migration thread comes in
between there and moves the task away, you can get into the situation
that two cpus reference the same task_struct at the same time, which
usually leads to 'interesting' situations.


---
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -2144,8 +2144,9 @@ static bool need_migrate_task(struct tas
 	 * If the task is not on a runqueue (and not running), then
 	 * the next wake-up will properly place the task.
 	 */
+	bool running = p->on_rq || p->on_cpu;
 	smp_rmb(); /* finish_lock_switch() */
-	return p->on_rq || p->on_cpu;
+	return running;
 }
 
 /*
@@ -3416,7 +3417,7 @@ void sched_exec(void)
 	if (dest_cpu == smp_processor_id())
 		goto unlock;
 
-	if (likely(cpu_active(dest_cpu)) && need_migrate_task(p)) {
+	if (likely(cpu_active(dest_cpu))) {
 		struct migration_arg arg = { p, dest_cpu };
 
 		raw_spin_unlock_irqrestore(&p->pi_lock, flags);


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC][PATCH 08/17] sched: Drop the rq argument to sched_class::select_task_rq()
  2011-01-03 16:35             ` Peter Zijlstra
@ 2011-01-03 16:41               ` Peter Zijlstra
  0 siblings, 0 replies; 59+ messages in thread
From: Peter Zijlstra @ 2011-01-03 16:41 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Yong Zhang, Chris Mason, Frank Rowand, Ingo Molnar,
	Thomas Gleixner, Mike Galbraith, Paul Turner, Jens Axboe,
	Steven Rostedt, linux-kernel

On Mon, 2011-01-03 at 17:35 +0100, Peter Zijlstra wrote:
> > In fact, I am completely confused. I do not understand why do we
> > check task_running() at all. If we see on_rq == 0 && on_cpu == 1,
> > then this task is going to clear its on_cpu soon, once it finishes
> > context_switch().
> 
> > Probably, this check was needed before, try_to_wake_up() could
> > activate the task_running() task without migrating. But, at first
> > glance, this is no longer possible after this series?
> 
> It can still do that, I think the problem is us dropping rq->lock in the
> middle of schedule(), when the freshly woken migration thread comes in
> between there and moves the task away, you can get into the situation
> that two cpus reference the same task_struct at the same time, which
> usually leads to 'interesting' situations. 

Frr, brainfart, you cannot schedule the migration thread without
completely finishing prev.

/me goes ponder more

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC][PATCH 12/17] sched: Also serialize ttwu_local() with p->pi_lock
  2010-12-24 12:23 ` [RFC][PATCH 12/17] sched: Also serialize ttwu_local() with p->pi_lock Peter Zijlstra
@ 2011-01-03 17:32   ` Oleg Nesterov
  2011-01-09 23:11     ` Tejun Heo
  0 siblings, 1 reply; 59+ messages in thread
From: Oleg Nesterov @ 2011-01-03 17:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Paul Turner, Jens Axboe, Yong Zhang,
	linux-kernel, Tejun Heo

(add Tejun)

On 12/24, Peter Zijlstra wrote:
>
> Since we now serialize ttwu() using p->pi_lock, we also need to
> serialize ttwu_local() using that, otherwise, once we drop the
> rq->lock from ttwu() it can race with ttwu_local().
>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
>  kernel/sched.c |   28 +++++++++++++++++-----------
>  1 file changed, 17 insertions(+), 11 deletions(-)
>
> Index: linux-2.6/kernel/sched.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched.c
> +++ linux-2.6/kernel/sched.c
> @@ -2513,9 +2513,9 @@ static int try_to_wake_up(struct task_st
>   * try_to_wake_up_local - try to wake up a local task with rq lock held
>   * @p: the thread to be awakened
>   *
> - * Put @p on the run-queue if it's not alredy there.  The caller must
> + * Put @p on the run-queue if it's not alredy there. The caller must
>   * ensure that this_rq() is locked, @p is bound to this_rq() and not
> - * the current task.  this_rq() stays locked over invocation.
> + * the current task.
>   */
>  static void try_to_wake_up_local(struct task_struct *p)
>  {
> @@ -2523,16 +2523,21 @@ static void try_to_wake_up_local(struct
>
>  	BUG_ON(rq != this_rq());
>  	BUG_ON(p == current);
> -	lockdep_assert_held(&rq->lock);
> +
> +	raw_spin_unlock(&rq->lock);
> +	raw_spin_lock(&p->pi_lock);
> +	raw_spin_lock(&rq->lock);

I _think_ this is safe, this worker can't change cpu afaics. But
probably Tejun can take a look, just in case.


>
>  	if (!(p->state & TASK_NORMAL))
> -		return;
> +		goto out;
>
>  	if (!p->on_rq)
>  		activate_task(rq, p, ENQUEUE_WAKEUP);
>
>  	ttwu_post_activation(p, rq, 0);
>  	ttwu_stat(rq, p, smp_processor_id(), 0);
> +out:
> +	raw_spin_unlock(&p->pi_lock);
>  }
>
>  /**
> @@ -3925,6 +3930,7 @@ pick_next_task(struct rq *rq)
>   */
>  asmlinkage void __sched schedule(void)
>  {
> +	struct task_struct *to_wakeup = NULL;
>  	struct task_struct *prev, *next;
>  	unsigned long *switch_count;
>  	struct rq *rq;
> @@ -3958,21 +3964,21 @@ asmlinkage void __sched schedule(void)
>  			 * task to maintain concurrency.  If so, wake
>  			 * up the task.
>  			 */
> -			if (prev->flags & PF_WQ_WORKER) {
> -				struct task_struct *to_wakeup;
> -
> +			if (prev->flags & PF_WQ_WORKER)
>  				to_wakeup = wq_worker_sleeping(prev, cpu);
> -				if (to_wakeup)
> -					try_to_wake_up_local(to_wakeup);
> -			}
>  			deactivate_task(rq, prev, DEQUEUE_SLEEP);
>  			prev->on_rq = 0;
>  		}
>  		switch_count = &prev->nvcsw;
>  	}
>
> +	/*
> +	 * All three: try_to_wake_up_local(), pre_schedule() and idle_balance()
> +	 * can drop rq->lock.
> +	 */
> +	if (to_wakeup)
> +		try_to_wake_up_local(to_wakeup);
>  	pre_schedule(rq, prev);
> -
>  	if (unlikely(!rq->nr_running))
>  		idle_balance(cpu, rq);
>
>
>


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC][PATCH 08/17] sched: Drop the rq argument to sched_class::select_task_rq()
  2010-12-24 12:23 ` [RFC][PATCH 08/17] sched: Drop the rq argument to sched_class::select_task_rq() Peter Zijlstra
  2010-12-29 14:31   ` Yong Zhang
@ 2011-01-03 18:05   ` Oleg Nesterov
  2011-01-04 13:01     ` Peter Zijlstra
  1 sibling, 1 reply; 59+ messages in thread
From: Oleg Nesterov @ 2011-01-03 18:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Paul Turner, Jens Axboe, Yong Zhang,
	Steven Rostedt, linux-kernel

On 12/24, Peter Zijlstra wrote:
>
> @@ -2680,24 +2681,17 @@ void wake_up_new_task(struct task_struct
>  {
>  	unsigned long flags;
>  	struct rq *rq;
> -	int cpu __maybe_unused = get_cpu();

Wait, I think this not right.

>  #ifdef CONFIG_SMP
>  	rq = task_rq_lock(p, &flags);
> -	p->state = TASK_WAKING;
>
>  	/*
>  	 * Fork balancing, do it here and not earlier because:
>  	 *  - cpus_allowed can change in the fork path
>  	 *  - any previously selected cpu might disappear through hotplug
> -	 *
> -	 * We set TASK_WAKING so that select_task_rq() can drop rq->lock
> -	 * without people poking at ->cpus_allowed.
>  	 */
> -	cpu = select_task_rq(rq, p, SD_BALANCE_FORK, 0);
> -	set_task_cpu(p, cpu);
> +	set_task_cpu(p, select_task_rq(p, SD_BALANCE_FORK, 0));
>
> -	p->state = TASK_RUNNING;
>  	task_rq_unlock(rq, &flags);

We need preempt_disable() to protect us against CPU hotplug. This task
was never activated, it won't be found/migrated if that CPU goes away
before we take task_rq_lock().

Oleg.


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC][PATCH 08/17] sched: Drop the rq argument to sched_class::select_task_rq()
  2011-01-03 11:16     ` Peter Zijlstra
  2011-01-03 14:59       ` Oleg Nesterov
@ 2011-01-04  5:59       ` Yong Zhang
  2011-01-04 13:00         ` Peter Zijlstra
  1 sibling, 1 reply; 59+ messages in thread
From: Yong Zhang @ 2011-01-04  5:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Oleg Nesterov, Paul Turner, Jens Axboe,
	Steven Rostedt, linux-kernel

On Mon, Jan 3, 2011 at 7:16 PM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Wed, 2010-12-29 at 22:31 +0800, Yong Zhang wrote:
>> On Fri, Dec 24, 2010 at 01:23:46PM +0100, Peter Zijlstra wrote:
>> > In preparation of calling select_task_rq() without rq->lock held, drop
>> > the dependency on the rq argument.
>> >
>> > Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
>> > ---
>> > @@ -3416,27 +3409,22 @@ void sched_exec(void)
>> >  {
>> >     struct task_struct *p = current;
>> >     unsigned long flags;
>> > -   struct rq *rq;
>> >     int dest_cpu;
>> >
>> > -   rq = task_rq_lock(p, &flags);
>> > -   dest_cpu = p->sched_class->select_task_rq(rq, p, SD_BALANCE_EXEC, 0);
>> > +   raw_spin_lock_irqsave(&p->pi_lock, flags);
>>
>> Seems this should go to patch 07/17 ;)
>
> Ah, the reason its here is that this patch removes the rq argument and
> thus we no longer need rq->lock. So this part relies on the property
> introduced by patch 7.

What I mean is we could firstly add pi_lock in patch 7 and then remove
the rq argument in this patch. :)

Thanks,
Yong

-- 
Only stand for myself

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC][PATCH 05/17] x86: Optimize arch_spin_unlock_wait()
  2011-01-03 11:32     ` Peter Zijlstra
@ 2011-01-04  6:45       ` Nick Piggin
  2011-01-05 19:14         ` [RFC][PATCH] spinlock: Kill spin_unlock_wait() Peter Zijlstra
  0 siblings, 1 reply; 59+ messages in thread
From: Nick Piggin @ 2011-01-04  6:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Chris Mason, Frank Rowand, Ingo Molnar,
	Thomas Gleixner, Mike Galbraith, Oleg Nesterov, Paul Turner,
	Jens Axboe, Yong Zhang, linux-kernel, Nick Piggin,
	Jeremy Fitzhardinge

On Mon, Jan 03, 2011 at 12:32:42PM +0100, Peter Zijlstra wrote:
> On Fri, 2010-12-24 at 10:26 -0800, Linus Torvalds wrote:
> > On Fri, Dec 24, 2010 at 4:23 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> > > Only wait for the current holder to release the lock.
> > >
> > > spin_unlock_wait() can only be about the current holder, since
> > > completion of this function is inherently racy with new contenders.
> > > Therefore, there is no reason to wait until the lock is completely
> > > unlocked.
> > 
> > Is there really any reason for this patch? I'd rather keep the simpler
> > and more straightforward code unless you have actual numbers.
> 
> No numbers, the testcase I use for this series is too unstable to really
> give that fine results. Its more a result of seeing the code an going:
> "oohh that can wait a long time when the lock is severely contended".

It would be kind of nice to fix, with ticket locks, dumb spin_unlock_wait
can infinitely starve if the lock queue is never empty, wheras at least
the simple spinlocks it would have a statistical chance of being given
the cacheline in unlocked state.

> But I think I can get rid of the need for calling this primitive
> alltogether, which is even better.

I always hated it because it seems hard to use right and verify result
is correct, particularly because it has no memory ordering guarantees.

assert(active == 1);
spin_lock(&blah);
if (should_die)
  active = 0;
counter++;
spin_unlock(&blah);

if (active) {
  spin_lock(&blah);
  /* do something */
  spin_unlock(&blah);
} else {
  /* wait for last to go away */
  spin_unlock_wait(&blah);
  counter++;
}

I don't know, stupid example but I can't really think of good ways to
use it off the top of my head.

Anyway this has a lost update problem even on x86 because counter can
be speculatively loaded out of order from the load of the lock word.
So the nice simple lock APIs which supposedly don't require any thought
of barriers have tricked us!

So I agree, taking it out the back and shooting it in the head would make
the world a better place.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC][PATCH 08/17] sched: Drop the rq argument to sched_class::select_task_rq()
  2011-01-03 15:49           ` Oleg Nesterov
  2011-01-03 16:35             ` Peter Zijlstra
@ 2011-01-04  7:27             ` Yong Zhang
  2011-01-04 12:34               ` Peter Zijlstra
  1 sibling, 1 reply; 59+ messages in thread
From: Yong Zhang @ 2011-01-04  7:27 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, Chris Mason, Frank Rowand, Ingo Molnar,
	Thomas Gleixner, Mike Galbraith, Paul Turner, Jens Axboe,
	Steven Rostedt, linux-kernel

On Mon, Jan 3, 2011 at 11:49 PM, Oleg Nesterov <oleg@redhat.com> wrote:
>
> Ah, sorry for the confusion, I only meant sched_exec() case.
> set_cpus_allowed_ptr() does need need_migrate_task(), of course.
>
>
> As for set_cpus_allowed_ptr()->need_migrate_task() path, I have another
> question,
>
>        static bool need_migrate_task(struct task_struct *p)
>        {
>                /*
>                 * If the task is not on a runqueue (and not running), then
>                 * the next wake-up will properly place the task.
>                 */
>                smp_rmb(); /* finish_lock_switch() */
>                return p->on_rq || p->on_cpu;
>        }
>
> I don't understand this smp_rmb(). Yes, finish_lock_switch() does
> wmb() before it clears ->on_cpu, but how these 2 barriers can pair?
>
> In fact, I am completely confused. I do not understand why do we
> check task_running() at all. If we see on_rq == 0 && on_cpu == 1,
> then this task is going to clear its on_cpu soon, once it finishes
> context_switch().
>
> Probably, this check was needed before, try_to_wake_up() could
> activate the task_running() task without migrating. But, at first
> glance, this is no longer possible after this series?

Yeah, task_running() is not needed after patch 13 which
may be the suitable place to poke :)

Thanks,
Yong

-- 
Only stand for myself

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC][PATCH 08/17] sched: Drop the rq argument to sched_class::select_task_rq()
  2011-01-04  7:27             ` Yong Zhang
@ 2011-01-04 12:34               ` Peter Zijlstra
  0 siblings, 0 replies; 59+ messages in thread
From: Peter Zijlstra @ 2011-01-04 12:34 UTC (permalink / raw)
  To: Yong Zhang
  Cc: Oleg Nesterov, Chris Mason, Frank Rowand, Ingo Molnar,
	Thomas Gleixner, Mike Galbraith, Paul Turner, Jens Axboe,
	Steven Rostedt, linux-kernel

On Tue, 2011-01-04 at 15:27 +0800, Yong Zhang wrote:
> On Mon, Jan 3, 2011 at 11:49 PM, Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > Ah, sorry for the confusion, I only meant sched_exec() case.
> > set_cpus_allowed_ptr() does need need_migrate_task(), of course.
> >
> >
> > As for set_cpus_allowed_ptr()->need_migrate_task() path, I have another
> > question,
> >
> >        static bool need_migrate_task(struct task_struct *p)
> >        {
> >                /*
> >                 * If the task is not on a runqueue (and not running), then
> >                 * the next wake-up will properly place the task.
> >                 */
> >                smp_rmb(); /* finish_lock_switch() */
> >                return p->on_rq || p->on_cpu;
> >        }
> >
> > I don't understand this smp_rmb(). Yes, finish_lock_switch() does
> > wmb() before it clears ->on_cpu, but how these 2 barriers can pair?
> >
> > In fact, I am completely confused. I do not understand why do we
> > check task_running() at all. If we see on_rq == 0 && on_cpu == 1,
> > then this task is going to clear its on_cpu soon, once it finishes
> > context_switch().
> >
> > Probably, this check was needed before, try_to_wake_up() could
> > activate the task_running() task without migrating. But, at first
> > glance, this is no longer possible after this series?
> 
> Yeah, task_running() is not needed after patch 13 which
> may be the suitable place to poke :)

That and patch 6, which removes the false negatives from ->on_rq.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC][PATCH 08/17] sched: Drop the rq argument to sched_class::select_task_rq()
  2011-01-04  5:59       ` Yong Zhang
@ 2011-01-04 13:00         ` Peter Zijlstra
  0 siblings, 0 replies; 59+ messages in thread
From: Peter Zijlstra @ 2011-01-04 13:00 UTC (permalink / raw)
  To: Yong Zhang
  Cc: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Oleg Nesterov, Paul Turner, Jens Axboe,
	Steven Rostedt, linux-kernel

On Tue, 2011-01-04 at 13:59 +0800, Yong Zhang wrote:
> On Mon, Jan 3, 2011 at 7:16 PM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> > On Wed, 2010-12-29 at 22:31 +0800, Yong Zhang wrote:
> >> On Fri, Dec 24, 2010 at 01:23:46PM +0100, Peter Zijlstra wrote:
> >> > In preparation of calling select_task_rq() without rq->lock held, drop
> >> > the dependency on the rq argument.
> >> >
> >> > Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> >> > ---
> >> > @@ -3416,27 +3409,22 @@ void sched_exec(void)
> >> >  {
> >> >     struct task_struct *p = current;
> >> >     unsigned long flags;
> >> > -   struct rq *rq;
> >> >     int dest_cpu;
> >> >
> >> > -   rq = task_rq_lock(p, &flags);
> >> > -   dest_cpu = p->sched_class->select_task_rq(rq, p, SD_BALANCE_EXEC, 0);
> >> > +   raw_spin_lock_irqsave(&p->pi_lock, flags);
> >>
> >> Seems this should go to patch 07/17 ;)
> >
> > Ah, the reason its here is that this patch removes the rq argument and
> > thus we no longer need rq->lock. So this part relies on the property
> > introduced by patch 7.
> 
> What I mean is we could firstly add pi_lock in patch 7 and then remove
> the rq argument in this patch. :)

No, that's the wrong way around.. anyway, I've pulled this into its own
patch and the next posting should hopefully be clearer.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC][PATCH 08/17] sched: Drop the rq argument to sched_class::select_task_rq()
  2011-01-03 18:05   ` Oleg Nesterov
@ 2011-01-04 13:01     ` Peter Zijlstra
  0 siblings, 0 replies; 59+ messages in thread
From: Peter Zijlstra @ 2011-01-04 13:01 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Paul Turner, Jens Axboe, Yong Zhang,
	Steven Rostedt, linux-kernel

On Mon, 2011-01-03 at 19:05 +0100, Oleg Nesterov wrote:
> On 12/24, Peter Zijlstra wrote:
> >
> > @@ -2680,24 +2681,17 @@ void wake_up_new_task(struct task_struct
> >  {
> >  	unsigned long flags;
> >  	struct rq *rq;
> > -	int cpu __maybe_unused = get_cpu();
> 
> Wait, I think this not right.
> 
> >  #ifdef CONFIG_SMP
> >  	rq = task_rq_lock(p, &flags);
> > -	p->state = TASK_WAKING;
> >
> >  	/*
> >  	 * Fork balancing, do it here and not earlier because:
> >  	 *  - cpus_allowed can change in the fork path
> >  	 *  - any previously selected cpu might disappear through hotplug
> > -	 *
> > -	 * We set TASK_WAKING so that select_task_rq() can drop rq->lock
> > -	 * without people poking at ->cpus_allowed.
> >  	 */
> > -	cpu = select_task_rq(rq, p, SD_BALANCE_FORK, 0);
> > -	set_task_cpu(p, cpu);
> > +	set_task_cpu(p, select_task_rq(p, SD_BALANCE_FORK, 0));
> >
> > -	p->state = TASK_RUNNING;
> >  	task_rq_unlock(rq, &flags);
> 
> We need preempt_disable() to protect us against CPU hotplug. This task
> was never activated, it won't be found/migrated if that CPU goes away
> before we take task_rq_lock().

Ah, indeed. I've replaced it by keeping IRQs disabled over the whole
function, no need to restore and save them in the middle.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC][PATCH 16/17] sched: Move the second half of ttwu() to the remote cpu
  2010-12-24 12:23 ` [RFC][PATCH 16/17] sched: Move the second half of ttwu() to the remote cpu Peter Zijlstra
  2011-01-03 14:36   ` [RFC][PATCH] sembench: add stddev to the burn stats Peter Zijlstra
@ 2011-01-04 14:28   ` Oleg Nesterov
  2011-01-04 14:47     ` Peter Zijlstra
  1 sibling, 1 reply; 59+ messages in thread
From: Oleg Nesterov @ 2011-01-04 14:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Paul Turner, Jens Axboe, Yong Zhang,
	linux-kernel

On 12/24, Peter Zijlstra wrote:
>
> +static void
> +ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags)
> +{
> +#ifdef CONFIG_SMP
> +	if (task_cpu(p) != cpu_of(rq))
> +		set_task_cpu(p, cpu_of(rq));
> +#endif

This looks a bit suspicious.

If this is called by sched_ttwu_pending() we are holding rq->lock,
not task_rq_lock(). It seems, we can race with, say, migration
thread running on task_cpu().



OK, p->state = TASK_WAKING protects us against, say, set_cpus_allowed_ptr()
which does task_rq_lock(p) and thus checks task_is_waking().

But, at the same time,

> +#ifdef CONFIG_SMP
> +static void ttwu_queue_remote(struct task_struct *p, int cpu)
> +{
> +	struct task_struct *next = NULL;
> +	struct rq *rq = cpu_rq(cpu);
> +
> +	for (;;) {
> +		struct task_struct *old = next;
> +
> +		p->wake_entry = next;
> +		next = cmpxchg(&rq->wake_list, old, p);
> +		if (next == old)
> +			break;
> +	}
> +
> +	if (!next)
> +		smp_send_reschedule(cpu);

what if that cpu does set_cpus_allowed_ptr(p) ?

It spins with irq disabled. Once the caller, try_to_wake_up(),
drops ->pi_lock it will wait for !task_is_waking() forever, no?

Oleg.


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC][PATCH 16/17] sched: Move the second half of ttwu() to the remote cpu
  2011-01-04 14:28   ` [RFC][PATCH 16/17] sched: Move the second half of ttwu() to the remote cpu Oleg Nesterov
@ 2011-01-04 14:47     ` Peter Zijlstra
  2011-01-04 15:18       ` Oleg Nesterov
  0 siblings, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2011-01-04 14:47 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Paul Turner, Jens Axboe, Yong Zhang,
	linux-kernel

On Tue, 2011-01-04 at 15:28 +0100, Oleg Nesterov wrote:
> On 12/24, Peter Zijlstra wrote:
> >
> > +static void
> > +ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags)
> > +{
> > +#ifdef CONFIG_SMP
> > +	if (task_cpu(p) != cpu_of(rq))
> > +		set_task_cpu(p, cpu_of(rq));
> > +#endif
> 
> This looks a bit suspicious.
> 
> If this is called by sched_ttwu_pending() we are holding rq->lock,
> not task_rq_lock(). It seems, we can race with, say, migration
> thread running on task_cpu().

I don't think so, nobody should be migrating a TASK_WAKING task.

> OK, p->state = TASK_WAKING protects us against, say, set_cpus_allowed_ptr()
> which does task_rq_lock(p) and thus checks task_is_waking().
> 
> But, at the same time,
> 
> > +#ifdef CONFIG_SMP
> > +static void ttwu_queue_remote(struct task_struct *p, int cpu)
> > +{
> > +	struct task_struct *next = NULL;
> > +	struct rq *rq = cpu_rq(cpu);
> > +
> > +	for (;;) {
> > +		struct task_struct *old = next;
> > +
> > +		p->wake_entry = next;
> > +		next = cmpxchg(&rq->wake_list, old, p);
> > +		if (next == old)
> > +			break;
> > +	}
> > +
> > +	if (!next)
> > +		smp_send_reschedule(cpu);
> 
> what if that cpu does set_cpus_allowed_ptr(p) ?
> 
> It spins with irq disabled. Once the caller, try_to_wake_up(),
> drops ->pi_lock it will wait for !task_is_waking() forever, no?

Ah, it appears I've already fixed that, let me clean up my current
series and repost.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC][PATCH 16/17] sched: Move the second half of ttwu() to the remote cpu
  2011-01-04 14:47     ` Peter Zijlstra
@ 2011-01-04 15:18       ` Oleg Nesterov
  2011-01-04 15:43         ` Peter Zijlstra
  0 siblings, 1 reply; 59+ messages in thread
From: Oleg Nesterov @ 2011-01-04 15:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Paul Turner, Jens Axboe, Yong Zhang,
	linux-kernel

On 01/04, Peter Zijlstra wrote:
>
> On Tue, 2011-01-04 at 15:28 +0100, Oleg Nesterov wrote:
> > On 12/24, Peter Zijlstra wrote:
> > >
> > > +static void
> > > +ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags)
> > > +{
> > > +#ifdef CONFIG_SMP
> > > +	if (task_cpu(p) != cpu_of(rq))
> > > +		set_task_cpu(p, cpu_of(rq));
> > > +#endif
> >
> > This looks a bit suspicious.
> >
> > If this is called by sched_ttwu_pending() we are holding rq->lock,
> > not task_rq_lock(). It seems, we can race with, say, migration
> > thread running on task_cpu().
>
> I don't think so, nobody should be migrating a TASK_WAKING task.

I am not sure...

Suppose that p was TASK_INTERRUPTIBLE and p->on_rq == 1 before, when
set_cpus_allowed_ptr() was called. To simplify, suppose that
the caller is preempted right after it drops p->pi_lock and before
it does stop_one_cpu(migration_cpu_stop).

After that p can complete chedule() and deactivate itself.

Now, try_to_wake_up() can set TASK_WAKING, choose another CPU,
and do ttwu_queue_remote().

Finally, the caller of set_cpus_allowed_ptr() resumes and
schedules migration_cpu_stop.

It is very possible I missed something, but what is the new
locking rules for set_task_cpu() anyway? I mean, which rq->lock
it needs?

Oleg.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC][PATCH 16/17] sched: Move the second half of ttwu() to the remote cpu
  2011-01-04 15:18       ` Oleg Nesterov
@ 2011-01-04 15:43         ` Peter Zijlstra
  2011-01-04 16:06           ` Oleg Nesterov
  0 siblings, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2011-01-04 15:43 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Paul Turner, Jens Axboe, Yong Zhang,
	linux-kernel

On Tue, 2011-01-04 at 16:18 +0100, Oleg Nesterov wrote:

> > I don't think so, nobody should be migrating a TASK_WAKING task.
> 
> I am not sure...
> 
> Suppose that p was TASK_INTERRUPTIBLE and p->on_rq == 1 before, when
> set_cpus_allowed_ptr() was called. To simplify, suppose that
> the caller is preempted right after it drops p->pi_lock and before
> it does stop_one_cpu(migration_cpu_stop).
> 
> After that p can complete chedule() and deactivate itself.
> 
> Now, try_to_wake_up() can set TASK_WAKING, choose another CPU,
> and do ttwu_queue_remote().
> 
> Finally, the caller of set_cpus_allowed_ptr() resumes and
> schedules migration_cpu_stop.

But __migrate_task() will then find !p->on_rq and not actually do
anything.

> It is very possible I missed something, but what is the new
> locking rules for set_task_cpu() anyway? I mean, which rq->lock
> it needs?

In the patches I just posted: set_task_cpu(p, cpu) callers need to
either hold task_rq(p)->lock (the old rq) or p->pi_lock.


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC][PATCH 16/17] sched: Move the second half of ttwu() to the remote cpu
  2011-01-04 15:43         ` Peter Zijlstra
@ 2011-01-04 16:06           ` Oleg Nesterov
  0 siblings, 0 replies; 59+ messages in thread
From: Oleg Nesterov @ 2011-01-04 16:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Paul Turner, Jens Axboe, Yong Zhang,
	linux-kernel

On 01/04, Peter Zijlstra wrote:
>
> On Tue, 2011-01-04 at 16:18 +0100, Oleg Nesterov wrote:
>
> > > I don't think so, nobody should be migrating a TASK_WAKING task.
> >
> > I am not sure...
> >
> > Suppose that p was TASK_INTERRUPTIBLE and p->on_rq == 1 before, when
> > set_cpus_allowed_ptr() was called. To simplify, suppose that
> > the caller is preempted right after it drops p->pi_lock and before
> > it does stop_one_cpu(migration_cpu_stop).
> >
> > After that p can complete chedule() and deactivate itself.
> >
> > Now, try_to_wake_up() can set TASK_WAKING, choose another CPU,
> > and do ttwu_queue_remote().
> >
> > Finally, the caller of set_cpus_allowed_ptr() resumes and
> > schedules migration_cpu_stop.
>
> But __migrate_task() will then find !p->on_rq and not actually do
> anything.

But if it races with ttwu_do_activate() (which can hold another
rq->lock != rq_src/rq_dest), it can check p->on_rq after
activate_task() was already called.



But I think this no longer matters,

> > It is very possible I missed something, but what is the new
> > locking rules for set_task_cpu() anyway? I mean, which rq->lock
> > it needs?
>
> In the patches I just posted: set_task_cpu(p, cpu) callers need to
> either hold task_rq(p)->lock (the old rq) or p->pi_lock.

Yes, thanks, I already noticed v4, and at first glance 11/18 plus
set_task_cpu() in try_to_wake_up() should close the problems...

Oleg.


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [RFC][PATCH] spinlock: Kill spin_unlock_wait()
  2011-01-04  6:45       ` Nick Piggin
@ 2011-01-05 19:14         ` Peter Zijlstra
  2011-01-05 19:26           ` Oleg Nesterov
  2011-01-05 19:43           ` Linus Torvalds
  0 siblings, 2 replies; 59+ messages in thread
From: Peter Zijlstra @ 2011-01-05 19:14 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Chris Mason, Frank Rowand, Ingo Molnar,
	Thomas Gleixner, Mike Galbraith, Oleg Nesterov, Paul Turner,
	Jens Axboe, Yong Zhang, linux-kernel, Jeremy Fitzhardinge,
	Linux-Arch, Jeff Garzik, Tejun Heo

On Tue, 2011-01-04 at 17:45 +1100, Nick Piggin wrote:
> So I agree, taking it out the back and shooting it in the head would make
> the world a better place. 

There appear to be only two callsites of said horror, one in the exit
path and one in ata-eh, neither appear to be performance critical so I
replaced them with a simple lock-unlock sequence.

---
 arch/alpha/include/asm/spinlock.h          |    2 --
 arch/arm/include/asm/spinlock.h            |    2 --
 arch/blackfin/include/asm/spinlock.h       |    6 ------
 arch/cris/include/arch-v32/arch/spinlock.h |    6 ------
 arch/ia64/include/asm/spinlock.h           |   19 -------------------
 arch/m32r/include/asm/spinlock.h           |    2 --
 arch/mips/include/asm/spinlock.h           |    2 --
 arch/mn10300/include/asm/spinlock.h        |    1 -
 arch/parisc/include/asm/spinlock.h         |    2 --
 arch/powerpc/include/asm/spinlock.h        |    7 -------
 arch/powerpc/lib/locks.c                   |   11 -----------
 arch/s390/include/asm/spinlock.h           |    3 ---
 arch/sh/include/asm/spinlock.h             |    2 --
 arch/sparc/include/asm/spinlock_32.h       |    3 ---
 arch/sparc/include/asm/spinlock_64.h       |    4 ----
 arch/tile/include/asm/spinlock_32.h        |    2 --
 arch/tile/lib/spinlock_32.c                |    8 --------
 arch/x86/include/asm/spinlock.h            |    6 ------
 drivers/ata/libata-eh.c                    |    6 ++++--
 include/linux/spinlock.h                   |   11 -----------
 include/linux/spinlock_up.h                |    3 ---
 kernel/exit.c                              |    3 ++-
 22 files changed, 6 insertions(+), 105 deletions(-)

diff --git a/arch/alpha/include/asm/spinlock.h b/arch/alpha/include/asm/spinlock.h
index d0faca1..0545bef 100644
--- a/arch/alpha/include/asm/spinlock.h
+++ b/arch/alpha/include/asm/spinlock.h
@@ -14,8 +14,6 @@
 
 #define arch_spin_lock_flags(lock, flags) arch_spin_lock(lock)
 #define arch_spin_is_locked(x)	((x)->lock != 0)
-#define arch_spin_unlock_wait(x) \
-		do { cpu_relax(); } while ((x)->lock)
 
 static inline void arch_spin_unlock(arch_spinlock_t * lock)
 {
diff --git a/arch/arm/include/asm/spinlock.h b/arch/arm/include/asm/spinlock.h
index 17eb355..91ede71 100644
--- a/arch/arm/include/asm/spinlock.h
+++ b/arch/arm/include/asm/spinlock.h
@@ -34,8 +34,6 @@ static inline void dsb_sev(void)
  */
 
 #define arch_spin_is_locked(x)		((x)->lock != 0)
-#define arch_spin_unlock_wait(lock) \
-	do { while (arch_spin_is_locked(lock)) cpu_relax(); } while (0)
 
 #define arch_spin_lock_flags(lock, flags) arch_spin_lock(lock)
 
diff --git a/arch/blackfin/include/asm/spinlock.h b/arch/blackfin/include/asm/spinlock.h
index 1942ccf..d1630fc 100644
--- a/arch/blackfin/include/asm/spinlock.h
+++ b/arch/blackfin/include/asm/spinlock.h
@@ -46,12 +46,6 @@ static inline void arch_spin_unlock(arch_spinlock_t *lock)
 	__raw_spin_unlock_asm(&lock->lock);
 }
 
-static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
-{
-	while (arch_spin_is_locked(lock))
-		cpu_relax();
-}
-
 static inline int arch_read_can_lock(arch_rwlock_t *rw)
 {
 	return __raw_uncached_fetch_asm(&rw->lock) > 0;
diff --git a/arch/cris/include/arch-v32/arch/spinlock.h b/arch/cris/include/arch-v32/arch/spinlock.h
index f171a66..01da21c 100644
--- a/arch/cris/include/arch-v32/arch/spinlock.h
+++ b/arch/cris/include/arch-v32/arch/spinlock.h
@@ -22,12 +22,6 @@ static inline void arch_spin_unlock(arch_spinlock_t *lock)
 			  : "memory");
 }
 
-static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
-{
-	while (arch_spin_is_locked(lock))
-		cpu_relax();
-}
-
 static inline int arch_spin_trylock(arch_spinlock_t *lock)
 {
 	return cris_spin_trylock((void *)&lock->slock);
diff --git a/arch/ia64/include/asm/spinlock.h b/arch/ia64/include/asm/spinlock.h
index 1a91c91..7c4de8b 100644
--- a/arch/ia64/include/asm/spinlock.h
+++ b/arch/ia64/include/asm/spinlock.h
@@ -75,20 +75,6 @@ static __always_inline void __ticket_spin_unlock(arch_spinlock_t *lock)
 	ACCESS_ONCE(*p) = (tmp + 2) & ~1;
 }
 
-static __always_inline void __ticket_spin_unlock_wait(arch_spinlock_t *lock)
-{
-	int	*p = (int *)&lock->lock, ticket;
-
-	ia64_invala();
-
-	for (;;) {
-		asm volatile ("ld4.c.nc %0=[%1]" : "=r"(ticket) : "r"(p) : "memory");
-		if (!(((ticket >> TICKET_SHIFT) ^ ticket) & TICKET_MASK))
-			return;
-		cpu_relax();
-	}
-}
-
 static inline int __ticket_spin_is_locked(arch_spinlock_t *lock)
 {
 	long tmp = ACCESS_ONCE(lock->lock);
@@ -135,11 +121,6 @@ static __always_inline void arch_spin_lock_flags(arch_spinlock_t *lock,
 	arch_spin_lock(lock);
 }
 
-static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
-{
-	__ticket_spin_unlock_wait(lock);
-}
-
 #define arch_read_can_lock(rw)		(*(volatile int *)(rw) >= 0)
 #define arch_write_can_lock(rw)	(*(volatile int *)(rw) == 0)
 
diff --git a/arch/m32r/include/asm/spinlock.h b/arch/m32r/include/asm/spinlock.h
index 179a064..0e099cc 100644
--- a/arch/m32r/include/asm/spinlock.h
+++ b/arch/m32r/include/asm/spinlock.h
@@ -26,8 +26,6 @@
 
 #define arch_spin_is_locked(x)		(*(volatile int *)(&(x)->slock) <= 0)
 #define arch_spin_lock_flags(lock, flags) arch_spin_lock(lock)
-#define arch_spin_unlock_wait(x) \
-		do { cpu_relax(); } while (arch_spin_is_locked(x))
 
 /**
  * arch_spin_trylock - Try spin lock and return a result
diff --git a/arch/mips/include/asm/spinlock.h b/arch/mips/include/asm/spinlock.h
index 396e402..711425f 100644
--- a/arch/mips/include/asm/spinlock.h
+++ b/arch/mips/include/asm/spinlock.h
@@ -42,8 +42,6 @@ static inline int arch_spin_is_locked(arch_spinlock_t *lock)
 }
 
 #define arch_spin_lock_flags(lock, flags) arch_spin_lock(lock)
-#define arch_spin_unlock_wait(x) \
-	while (arch_spin_is_locked(x)) { cpu_relax(); }
 
 static inline int arch_spin_is_contended(arch_spinlock_t *lock)
 {
diff --git a/arch/mn10300/include/asm/spinlock.h b/arch/mn10300/include/asm/spinlock.h
index 9342915..3460e5d 100644
--- a/arch/mn10300/include/asm/spinlock.h
+++ b/arch/mn10300/include/asm/spinlock.h
@@ -23,7 +23,6 @@
  */
 
 #define arch_spin_is_locked(x)	(*(volatile signed char *)(&(x)->slock) != 0)
-#define arch_spin_unlock_wait(x) do { barrier(); } while (arch_spin_is_locked(x))
 
 static inline void arch_spin_unlock(arch_spinlock_t *lock)
 {
diff --git a/arch/parisc/include/asm/spinlock.h b/arch/parisc/include/asm/spinlock.h
index 74036f4..b419128 100644
--- a/arch/parisc/include/asm/spinlock.h
+++ b/arch/parisc/include/asm/spinlock.h
@@ -12,8 +12,6 @@ static inline int arch_spin_is_locked(arch_spinlock_t *x)
 }
 
 #define arch_spin_lock(lock) arch_spin_lock_flags(lock, 0)
-#define arch_spin_unlock_wait(x) \
-		do { cpu_relax(); } while (arch_spin_is_locked(x))
 
 static inline void arch_spin_lock_flags(arch_spinlock_t *x,
 					 unsigned long flags)
diff --git a/arch/powerpc/include/asm/spinlock.h b/arch/powerpc/include/asm/spinlock.h
index f9611bd..c9fd936 100644
--- a/arch/powerpc/include/asm/spinlock.h
+++ b/arch/powerpc/include/asm/spinlock.h
@@ -150,13 +150,6 @@ static inline void arch_spin_unlock(arch_spinlock_t *lock)
 	lock->slock = 0;
 }
 
-#ifdef CONFIG_PPC64
-extern void arch_spin_unlock_wait(arch_spinlock_t *lock);
-#else
-#define arch_spin_unlock_wait(lock) \
-	do { while (arch_spin_is_locked(lock)) cpu_relax(); } while (0)
-#endif
-
 /*
  * Read-write spinlocks, allowing multiple readers
  * but only one writer.
diff --git a/arch/powerpc/lib/locks.c b/arch/powerpc/lib/locks.c
index 9b8182e..e800d5f 100644
--- a/arch/powerpc/lib/locks.c
+++ b/arch/powerpc/lib/locks.c
@@ -82,14 +82,3 @@ void __rw_yield(arch_rwlock_t *rw)
 }
 #endif
 
-void arch_spin_unlock_wait(arch_spinlock_t *lock)
-{
-	while (lock->slock) {
-		HMT_low();
-		if (SHARED_PROCESSOR)
-			__spin_yield(lock);
-	}
-	HMT_medium();
-}
-
-EXPORT_SYMBOL(arch_spin_unlock_wait);
diff --git a/arch/s390/include/asm/spinlock.h b/arch/s390/include/asm/spinlock.h
index 56612fc..cb98ce7 100644
--- a/arch/s390/include/asm/spinlock.h
+++ b/arch/s390/include/asm/spinlock.h
@@ -35,9 +35,6 @@ _raw_compare_and_swap(volatile unsigned int *lock,
  */
 
 #define arch_spin_is_locked(x) ((x)->owner_cpu != 0)
-#define arch_spin_unlock_wait(lock) \
-	do { while (arch_spin_is_locked(lock)) \
-		 arch_spin_relax(lock); } while (0)
 
 extern void arch_spin_lock_wait(arch_spinlock_t *);
 extern void arch_spin_lock_wait_flags(arch_spinlock_t *, unsigned long flags);
diff --git a/arch/sh/include/asm/spinlock.h b/arch/sh/include/asm/spinlock.h
index bdc0f3b..7ac8916 100644
--- a/arch/sh/include/asm/spinlock.h
+++ b/arch/sh/include/asm/spinlock.h
@@ -25,8 +25,6 @@
 
 #define arch_spin_is_locked(x)		((x)->lock <= 0)
 #define arch_spin_lock_flags(lock, flags) arch_spin_lock(lock)
-#define arch_spin_unlock_wait(x) \
-	do { while (arch_spin_is_locked(x)) cpu_relax(); } while (0)
 
 /*
  * Simple spin lock operations.  There are two variants, one clears IRQ's
diff --git a/arch/sparc/include/asm/spinlock_32.h b/arch/sparc/include/asm/spinlock_32.h
index 7f9b9db..f10958a 100644
--- a/arch/sparc/include/asm/spinlock_32.h
+++ b/arch/sparc/include/asm/spinlock_32.h
@@ -12,9 +12,6 @@
 
 #define arch_spin_is_locked(lock) (*((volatile unsigned char *)(lock)) != 0)
 
-#define arch_spin_unlock_wait(lock) \
-	do { while (arch_spin_is_locked(lock)) cpu_relax(); } while (0)
-
 static inline void arch_spin_lock(arch_spinlock_t *lock)
 {
 	__asm__ __volatile__(
diff --git a/arch/sparc/include/asm/spinlock_64.h b/arch/sparc/include/asm/spinlock_64.h
index 073936a..2845f6e 100644
--- a/arch/sparc/include/asm/spinlock_64.h
+++ b/arch/sparc/include/asm/spinlock_64.h
@@ -23,10 +23,6 @@
 
 #define arch_spin_is_locked(lp)	((lp)->lock != 0)
 
-#define arch_spin_unlock_wait(lp)	\
-	do {	rmb();			\
-	} while((lp)->lock)
-
 static inline void arch_spin_lock(arch_spinlock_t *lock)
 {
 	unsigned long tmp;
diff --git a/arch/tile/include/asm/spinlock_32.h b/arch/tile/include/asm/spinlock_32.h
index 88efdde..3ccb768 100644
--- a/arch/tile/include/asm/spinlock_32.h
+++ b/arch/tile/include/asm/spinlock_32.h
@@ -61,8 +61,6 @@ static inline void arch_spin_unlock(arch_spinlock_t *lock)
 	lock->current_ticket = old_ticket + TICKET_QUANTUM;
 }
 
-void arch_spin_unlock_wait(arch_spinlock_t *lock);
-
 /*
  * Read-write spinlocks, allowing multiple readers
  * but only one writer.
diff --git a/arch/tile/lib/spinlock_32.c b/arch/tile/lib/spinlock_32.c
index 5cd1c40..72a6507 100644
--- a/arch/tile/lib/spinlock_32.c
+++ b/arch/tile/lib/spinlock_32.c
@@ -61,14 +61,6 @@ int arch_spin_trylock(arch_spinlock_t *lock)
 }
 EXPORT_SYMBOL(arch_spin_trylock);
 
-void arch_spin_unlock_wait(arch_spinlock_t *lock)
-{
-	u32 iterations = 0;
-	while (arch_spin_is_locked(lock))
-		delay_backoff(iterations++);
-}
-EXPORT_SYMBOL(arch_spin_unlock_wait);
-
 /*
  * The low byte is always reserved to be the marker for a "tns" operation
  * since the low bit is set to "1" by a tns.  The next seven bits are
diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index 3089f70..f23106f 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -208,12 +208,6 @@ static __always_inline void arch_spin_lock_flags(arch_spinlock_t *lock,
 
 #endif	/* CONFIG_PARAVIRT_SPINLOCKS */
 
-static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
-{
-	while (arch_spin_is_locked(lock))
-		cpu_relax();
-}
-
 /*
  * Read-write spinlocks, allowing multiple readers
  * but only one writer.
diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c
index 17a6378..f246965 100644
--- a/drivers/ata/libata-eh.c
+++ b/drivers/ata/libata-eh.c
@@ -668,8 +668,10 @@ void ata_scsi_error(struct Scsi_Host *host)
 
 		/* initialize eh_tries */
 		ap->eh_tries = ATA_EH_MAX_TRIES;
-	} else
-		spin_unlock_wait(ap->lock);
+	} else {
+		spin_lock_irqsave(ap->lock, flags);
+		spin_unlock_irqrestore(ap->lock, flags);
+	}
 
 	/* If we timed raced normal completion and there is nothing to
 	   recover nr_timedout == 0 why exactly are we doing error recovery ? */
diff --git a/include/linux/spinlock.h b/include/linux/spinlock.h
index 80e5358..d7b89ef 100644
--- a/include/linux/spinlock.h
+++ b/include/linux/spinlock.h
@@ -122,12 +122,6 @@ do {								\
 static inline void smp_mb__after_lock(void) { smp_mb(); }
 #endif
 
-/**
- * raw_spin_unlock_wait - wait until the spinlock gets unlocked
- * @lock: the spinlock in question.
- */
-#define raw_spin_unlock_wait(lock)	arch_spin_unlock_wait(&(lock)->raw_lock)
-
 #ifdef CONFIG_DEBUG_SPINLOCK
  extern void do_raw_spin_lock(raw_spinlock_t *lock) __acquires(lock);
 #define do_raw_spin_lock_flags(lock, flags) do_raw_spin_lock(lock)
@@ -355,11 +349,6 @@ static inline int spin_trylock_irq(spinlock_t *lock)
 	raw_spin_trylock_irqsave(spinlock_check(lock), flags); \
 })
 
-static inline void spin_unlock_wait(spinlock_t *lock)
-{
-	raw_spin_unlock_wait(&lock->rlock);
-}
-
 static inline int spin_is_locked(spinlock_t *lock)
 {
 	return raw_spin_is_locked(&lock->rlock);
diff --git a/include/linux/spinlock_up.h b/include/linux/spinlock_up.h
index b14f6a9..53c7a86 100644
--- a/include/linux/spinlock_up.h
+++ b/include/linux/spinlock_up.h
@@ -70,7 +70,4 @@ static inline void arch_spin_unlock(arch_spinlock_t *lock)
 #define arch_read_can_lock(lock)	(((void)(lock), 1))
 #define arch_write_can_lock(lock)	(((void)(lock), 1))
 
-#define arch_spin_unlock_wait(lock) \
-		do { cpu_relax(); } while (arch_spin_is_locked(lock))
-
 #endif /* __LINUX_SPINLOCK_UP_H */
diff --git a/kernel/exit.c b/kernel/exit.c
index 676149a..2c6b725 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -956,7 +956,8 @@ NORET_TYPE void do_exit(long code)
 	 * an exiting task cleaning up the robust pi futexes.
 	 */
 	smp_mb();
-	raw_spin_unlock_wait(&tsk->pi_lock);
+	raw_spin_lock_irq(&tsk->pi_lock);
+	raw_spin_unlock_irq(&tsk->pi_lock);
 
 	if (unlikely(in_atomic()))
 		printk(KERN_INFO "note: %s[%d] exited with preempt_count %d\n",


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [RFC][PATCH] spinlock: Kill spin_unlock_wait()
  2011-01-05 19:14         ` [RFC][PATCH] spinlock: Kill spin_unlock_wait() Peter Zijlstra
@ 2011-01-05 19:26           ` Oleg Nesterov
  2011-01-05 19:43           ` Linus Torvalds
  1 sibling, 0 replies; 59+ messages in thread
From: Oleg Nesterov @ 2011-01-05 19:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nick Piggin, Linus Torvalds, Chris Mason, Frank Rowand,
	Ingo Molnar, Thomas Gleixner, Mike Galbraith, Paul Turner,
	Jens Axboe, Yong Zhang, linux-kernel, Jeremy Fitzhardinge,
	Linux-Arch, Jeff Garzik, Tejun Heo

On 01/05, Peter Zijlstra wrote:
>
> On Tue, 2011-01-04 at 17:45 +1100, Nick Piggin wrote:
> > So I agree, taking it out the back and shooting it in the head would make
> > the world a better place.
>
> There appear to be only two callsites of said horror,

Not sure I understand why spin_unlock_wait() is really awful, but OK.

> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -956,7 +956,8 @@ NORET_TYPE void do_exit(long code)
>  	 * an exiting task cleaning up the robust pi futexes.
>  	 */
>  	smp_mb();
> -	raw_spin_unlock_wait(&tsk->pi_lock);
> +	raw_spin_lock_irq(&tsk->pi_lock);
> +	raw_spin_unlock_irq(&tsk->pi_lock);

then you can kill smp_mb() above.

Oleg.


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC][PATCH] spinlock: Kill spin_unlock_wait()
  2011-01-05 19:14         ` [RFC][PATCH] spinlock: Kill spin_unlock_wait() Peter Zijlstra
  2011-01-05 19:26           ` Oleg Nesterov
@ 2011-01-05 19:43           ` Linus Torvalds
  2011-01-06  9:32             ` Peter Zijlstra
  1 sibling, 1 reply; 59+ messages in thread
From: Linus Torvalds @ 2011-01-05 19:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nick Piggin, Chris Mason, Frank Rowand, Ingo Molnar,
	Thomas Gleixner, Mike Galbraith, Oleg Nesterov, Paul Turner,
	Jens Axboe, Yong Zhang, linux-kernel, Jeremy Fitzhardinge,
	Linux-Arch, Jeff Garzik, Tejun Heo

On Wed, Jan 5, 2011 at 11:14 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
>
> There appear to be only two callsites of said horror, one in the exit
> path and one in ata-eh, neither appear to be performance critical so I
> replaced them with a simple lock-unlock sequence.

Again, WHY?

What's the problem with the current code? Instead of generating ugly
patches to change it, and instead of removing it, just say what the
PROBLEM is.

Stop this wanking already. The exit path is certainly not unimportant.

And if you want to change the thing to be more efficient, I'm ok with
that, as long as it's done PRETTILY instead of making the damn thing
an unreadable mess.

The current "spin_unlock_wait()" is obvious. I'm perfectly happy
improving on it, but I would want to retain the "obvious" part, which
your previous patch certainly didn't do.

Some simple helper functions to extract the tail/head part of the
ticket lock to make the comparisons understandable, together with
always accessing the lock with the proper ACCESS_ONCE() would have
made your previous patch acceptable. But you ignored that feedback,
and instead you now want to do a "let's just remove it entirely patch"
that is even worse.

And in NEITHER version did you actually give any actual *REASON* for
the change in the first place.

Why? WHY?

                      Linus

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC][PATCH] spinlock: Kill spin_unlock_wait()
  2011-01-05 19:43           ` Linus Torvalds
@ 2011-01-06  9:32             ` Peter Zijlstra
  2011-01-06 10:38               ` Nick Piggin
  0 siblings, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2011-01-06  9:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Chris Mason, Frank Rowand, Ingo Molnar,
	Thomas Gleixner, Mike Galbraith, Oleg Nesterov, Paul Turner,
	Jens Axboe, Yong Zhang, linux-kernel, Jeremy Fitzhardinge,
	Linux-Arch, Jeff Garzik, Tejun Heo

On Wed, 2011-01-05 at 11:43 -0800, Linus Torvalds wrote:
> On Wed, Jan 5, 2011 at 11:14 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> >
> > There appear to be only two callsites of said horror, one in the exit
> > path and one in ata-eh, neither appear to be performance critical so I
> > replaced them with a simple lock-unlock sequence.
> 
> Again, WHY?
> 
> What's the problem with the current code? Instead of generating ugly
> patches to change it, and instead of removing it, just say what the
> PROBLEM is.

Well, I don't care about the primitive anymore, and Nick had some
reasonable arguments on why its not a good primitive to have. So in a
brief moment I decided to see what it would take to make it go away.

Apparently you don't like it, I'm fine with that, consider the patch
discarded.

> Some simple helper functions to extract the tail/head part of the
> ticket lock to make the comparisons understandable,

Jeremy has a number of pending patches making things more pretty. If you
wish I can revisit this once that work hits your tree.

  http://lkml.org/lkml/2010/11/16/479

He makes the thing looks like:

+#if (CONFIG_NR_CPUS < 256)
+typedef u8  __ticket_t;
+#else
+typedef u16 __ticket_t;
+#endif
+
+#define TICKET_SHIFT   (sizeof(__ticket_t) * 8)
+#define TICKET_MASK    ((__ticket_t)((1 << TICKET_SHIFT) - 1))
+
 typedef struct arch_spinlock {
+       union {
+               unsigned int slock;
+               struct __raw_tickets {
+                       __ticket_t head, tail;
+               } tickets;
+       };
 } arch_spinlock_t;

>  together with
> always accessing the lock with the proper ACCESS_ONCE() would have
> made your previous patch acceptable.

I'm still not quite seeing where I was missing an ACCESS_ONCE(), the
second loop had a cpu_relax() in, which is a compiler barrier so it
forces a reload that way.

>  But you ignored that feedback,
> and instead you now want to do a "let's just remove it entirely patch"
> that is even worse.

My locking improved and became a lot more obvious by not using the
primitive, so for the work I was doing not using it seemed the better
solution.

And as said, this was inspired by Nick's comments and it was a quick
edit to see what it would take.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC][PATCH] spinlock: Kill spin_unlock_wait()
  2011-01-06  9:32             ` Peter Zijlstra
@ 2011-01-06 10:38               ` Nick Piggin
  2011-01-06 18:26                 ` Peter Zijlstra
  0 siblings, 1 reply; 59+ messages in thread
From: Nick Piggin @ 2011-01-06 10:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Nick Piggin, Chris Mason, Frank Rowand,
	Ingo Molnar, Thomas Gleixner, Mike Galbraith, Oleg Nesterov,
	Paul Turner, Jens Axboe, Yong Zhang, linux-kernel,
	Jeremy Fitzhardinge, Linux-Arch, Jeff Garzik, Tejun Heo

On Thu, Jan 06, 2011 at 10:32:33AM +0100, Peter Zijlstra wrote:
> On Wed, 2011-01-05 at 11:43 -0800, Linus Torvalds wrote:
> > On Wed, Jan 5, 2011 at 11:14 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> > >
> > > There appear to be only two callsites of said horror, one in the exit
> > > path and one in ata-eh, neither appear to be performance critical so I
> > > replaced them with a simple lock-unlock sequence.
> > 
> > Again, WHY?
> > 
> > What's the problem with the current code? Instead of generating ugly
> > patches to change it, and instead of removing it, just say what the
> > PROBLEM is.
> 
> Well, I don't care about the primitive anymore, and Nick had some
> reasonable arguments on why its not a good primitive to have. So in a
> brief moment I decided to see what it would take to make it go away.
> 
> Apparently you don't like it, I'm fine with that, consider the patch
> discarded.
> 
> > Some simple helper functions to extract the tail/head part of the
> > ticket lock to make the comparisons understandable,
> 
> Jeremy has a number of pending patches making things more pretty. If you
> wish I can revisit this once that work hits your tree.
> 
>   http://lkml.org/lkml/2010/11/16/479
> 
> He makes the thing looks like:
> 
> +#if (CONFIG_NR_CPUS < 256)
> +typedef u8  __ticket_t;
> +#else
> +typedef u16 __ticket_t;
> +#endif
> +
> +#define TICKET_SHIFT   (sizeof(__ticket_t) * 8)
> +#define TICKET_MASK    ((__ticket_t)((1 << TICKET_SHIFT) - 1))
> +
>  typedef struct arch_spinlock {
> +       union {
> +               unsigned int slock;
> +               struct __raw_tickets {
> +                       __ticket_t head, tail;
> +               } tickets;
> +       };
>  } arch_spinlock_t;
> 
> >  together with
> > always accessing the lock with the proper ACCESS_ONCE() would have
> > made your previous patch acceptable.
> 
> I'm still not quite seeing where I was missing an ACCESS_ONCE(), the
> second loop had a cpu_relax() in, which is a compiler barrier so it
> forces a reload that way.
> 
> >  But you ignored that feedback,
> > and instead you now want to do a "let's just remove it entirely patch"
> > that is even worse.
> 
> My locking improved and became a lot more obvious by not using the
> primitive, so for the work I was doing not using it seemed the better
> solution.
> 
> And as said, this was inspired by Nick's comments and it was a quick
> edit to see what it would take.

I was hoping more that it could be removed from callers rather than
replaced with lock/unlock sequence...

I can't see how the usage in libata could be right. If we're waiting
for some modifications inside the critical section, then it seems we
could load some shared data before the lock is actually released, on
an architecture which does load/load reordering. At best it needs some
good comments and perhaps a barrier or two.

The spin_unlock_wait in do_exit seems like it shouldn't be needed --
exit_pi_state_list takes the pi_lock after the task has died anyway,
so I can't see what the problem would be. Removing the call (and
perhaps put a BUG_ON(!(curr->flags & PF_EXITING)) in exit_pi_state_list
would do the same thing, avoid the barrier, and localize fuxtex exit
protocol to futex.c.

(If the spin_unlock_wait call *was* actually required, eg. we subsequently
 checked current->pi_state_list without holding the pi_lock, then we
 would require a barrier after the spin_unlock_wait call too, to prevent
 speculative load pulling it up before the load of the lock, so I don't
 really agree Linus that it's an obvious API -- to you perhaps because
 barriers are second nature to you, but not for driver writers :)).



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC][PATCH] spinlock: Kill spin_unlock_wait()
  2011-01-06 10:38               ` Nick Piggin
@ 2011-01-06 18:26                 ` Peter Zijlstra
  2011-01-07 21:01                   ` Tejun Heo
  0 siblings, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2011-01-06 18:26 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Chris Mason, Frank Rowand, Ingo Molnar,
	Thomas Gleixner, Mike Galbraith, Oleg Nesterov, Paul Turner,
	Jens Axboe, Yong Zhang, linux-kernel, Jeremy Fitzhardinge,
	Linux-Arch, Jeff Garzik, Tejun Heo

On Thu, 2011-01-06 at 21:38 +1100, Nick Piggin wrote:
> I can't see how the usage in libata could be right. If we're waiting
> for some modifications inside the critical section, then it seems we
> could load some shared data before the lock is actually released, on
> an architecture which does load/load reordering. At best it needs some
> good comments and perhaps a barrier or two.
> 
> The spin_unlock_wait in do_exit seems like it shouldn't be needed --
> exit_pi_state_list takes the pi_lock after the task has died anyway,
> so I can't see what the problem would be. Removing the call (and
> perhaps put a BUG_ON(!(curr->flags & PF_EXITING)) in exit_pi_state_list
> would do the same thing, avoid the barrier, and localize fuxtex exit
> protocol to futex.c. 

Jeff, Tejun, could you look at the ata-eh thing, then I'll put sorting
through the futex thing on my todo list.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC][PATCH] spinlock: Kill spin_unlock_wait()
  2011-01-06 18:26                 ` Peter Zijlstra
@ 2011-01-07 21:01                   ` Tejun Heo
  2011-01-07 21:13                     ` Jeff Garzik
  0 siblings, 1 reply; 59+ messages in thread
From: Tejun Heo @ 2011-01-07 21:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nick Piggin, Linus Torvalds, Chris Mason, Frank Rowand,
	Ingo Molnar, Thomas Gleixner, Mike Galbraith, Oleg Nesterov,
	Paul Turner, Jens Axboe, Yong Zhang, linux-kernel,
	Jeremy Fitzhardinge, Linux-Arch, Jeff Garzik

Hello,

On Thu, Jan 06, 2011 at 07:26:35PM +0100, Peter Zijlstra wrote:
> Jeff, Tejun, could you look at the ata-eh thing, then I'll put sorting
> through the futex thing on my todo list.

Hmm... I think the ->eng_timeout path is already dead.  We no longer
have any in-kernel implementation, so killing spin_unlock_wait()
should be fine.  I'll follow up with removal of the unused callback.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC][PATCH] spinlock: Kill spin_unlock_wait()
  2011-01-07 21:01                   ` Tejun Heo
@ 2011-01-07 21:13                     ` Jeff Garzik
  2011-01-07 21:33                       ` Tejun Heo
  0 siblings, 1 reply; 59+ messages in thread
From: Jeff Garzik @ 2011-01-07 21:13 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Peter Zijlstra, Nick Piggin, Linus Torvalds, Chris Mason,
	Frank Rowand, Ingo Molnar, Thomas Gleixner, Mike Galbraith,
	Oleg Nesterov, Paul Turner, Jens Axboe, Yong Zhang, linux-kernel,
	Jeremy Fitzhardinge, Linux-Arch, linux-scsi,
	IDE/ATA development list

On Fri, Jan 7, 2011 at 4:01 PM, Tejun Heo <tj@kernel.org> wrote:
> On Thu, Jan 06, 2011 at 07:26:35PM +0100, Peter Zijlstra wrote:
>> Jeff, Tejun, could you look at the ata-eh thing, then I'll put sorting
>> through the futex thing on my todo list.
>
> Hmm... I think the ->eng_timeout path is already dead.  We no longer
> have any in-kernel implementation, so killing spin_unlock_wait()
> should be fine.  I'll follow up with removal of the unused callback.

Unfortunately...  libsas continues to avoid the new EH :(

It's a hairy mess to untangle, too.  libata does proper error handling
of ATA device errors, notably NCQ error handling, which libsas sorely
misses.  But libata new EH assumes a bit too much about "owning" the
entirety of the EH process.  These assumptions are proper for wholly
ATA drivers (drivers/ata/*) where new EH can drive the EH process, but
in the SAS situation, a phy in SATA mode is simply a subset of a
larger set of EH conditions that must be handled.

Thus libsas uses the ancient libata hook ->phy_reset and lacks ->error_handler.

I think libata's old-EH path is entirely SAS-specific at this point.

     Jeff

P.S.  Note that libsas and ipr are peers; thus ipr driver also uses old EH.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC][PATCH] spinlock: Kill spin_unlock_wait()
  2011-01-07 21:13                     ` Jeff Garzik
@ 2011-01-07 21:33                       ` Tejun Heo
  0 siblings, 0 replies; 59+ messages in thread
From: Tejun Heo @ 2011-01-07 21:33 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Peter Zijlstra, Nick Piggin, Linus Torvalds, Chris Mason,
	Frank Rowand, Ingo Molnar, Thomas Gleixner, Mike Galbraith,
	Oleg Nesterov, Paul Turner, Jens Axboe, Yong Zhang, linux-kernel,
	Jeremy Fitzhardinge, Linux-Arch, linux-scsi,
	IDE/ATA development list

Hello, Jeff.

On Fri, Jan 07, 2011 at 04:13:53PM -0500, Jeff Garzik wrote:
> > Hmm... I think the ->eng_timeout path is already dead.  We no longer
> > have any in-kernel implementation, so killing spin_unlock_wait()
> > should be fine.  I'll follow up with removal of the unused callback.
> 
> Unfortunately...  libsas continues to avoid the new EH :(
> 
> It's a hairy mess to untangle, too.  libata does proper error handling
> of ATA device errors, notably NCQ error handling, which libsas sorely
> misses.  But libata new EH assumes a bit too much about "owning" the
> entirety of the EH process.  These assumptions are proper for wholly
> ATA drivers (drivers/ata/*) where new EH can drive the EH process, but
> in the SAS situation, a phy in SATA mode is simply a subset of a
> larger set of EH conditions that must be handled.
> 
> Thus libsas uses the ancient libata hook ->phy_reset and lacks ->error_handler.
> 
> I think libata's old-EH path is entirely SAS-specific at this point.

Hmm... but they don't use ata_scsi_error() and ->eng_timeout() at all,
no?  We can't remove phy_reset() and need to keep the silly "if
(->error_handler)" tests around but should be able to remove those
from ata_scsi_error() and other EH routines, at least.  Am I missing
something?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC][PATCH 12/17] sched: Also serialize ttwu_local() with p->pi_lock
  2011-01-03 17:32   ` Oleg Nesterov
@ 2011-01-09 23:11     ` Tejun Heo
  0 siblings, 0 replies; 59+ messages in thread
From: Tejun Heo @ 2011-01-09 23:11 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, Chris Mason, Frank Rowand, Ingo Molnar,
	Thomas Gleixner, Mike Galbraith, Paul Turner, Jens Axboe,
	Yong Zhang, linux-kernel

Hello,

On Mon, Jan 03, 2011 at 06:32:54PM +0100, Oleg Nesterov wrote:
> > @@ -2523,16 +2523,21 @@ static void try_to_wake_up_local(struct
> >
> >  	BUG_ON(rq != this_rq());
> >  	BUG_ON(p == current);
> > -	lockdep_assert_held(&rq->lock);
> > +
> > +	raw_spin_unlock(&rq->lock);
> > +	raw_spin_lock(&p->pi_lock);
> > +	raw_spin_lock(&rq->lock);
> 
> I _think_ this is safe, this worker can't change cpu afaics. But
> probably Tejun can take a look, just in case.

Yeah, preemption is disabled so it should be safe.  Only bound workers
can be woken up by ttwu_local.  They get migrated to another CPU only
during CPU offlining.  Both CPU offlining and task exit wouldn't
happen while preemption is disabled so it should be safe.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 59+ messages in thread

end of thread, other threads:[~2011-01-09 23:11 UTC | newest]

Thread overview: 59+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-12-24 12:23 [RFC][PATCH 00/17] sched: Reduce runqueue lock contention -v3 Peter Zijlstra
2010-12-24 12:23 ` [RFC][PATCH 01/17] sched: Always provide p->on_cpu Peter Zijlstra
2010-12-24 12:23 ` [RFC][PATCH 02/17] mutex: Use p->on_cpu for the adaptive spin Peter Zijlstra
2010-12-24 12:23 ` [RFC][PATCH 03/17] sched: Change the ttwu success details Peter Zijlstra
2010-12-24 12:23 ` [RFC][PATCH 04/17] sched: Clean up ttwu stats Peter Zijlstra
2010-12-24 12:23 ` [RFC][PATCH 05/17] x86: Optimize arch_spin_unlock_wait() Peter Zijlstra
2010-12-24 18:26   ` Linus Torvalds
2011-01-03 11:32     ` Peter Zijlstra
2011-01-04  6:45       ` Nick Piggin
2011-01-05 19:14         ` [RFC][PATCH] spinlock: Kill spin_unlock_wait() Peter Zijlstra
2011-01-05 19:26           ` Oleg Nesterov
2011-01-05 19:43           ` Linus Torvalds
2011-01-06  9:32             ` Peter Zijlstra
2011-01-06 10:38               ` Nick Piggin
2011-01-06 18:26                 ` Peter Zijlstra
2011-01-07 21:01                   ` Tejun Heo
2011-01-07 21:13                     ` Jeff Garzik
2011-01-07 21:33                       ` Tejun Heo
2010-12-24 12:23 ` [RFC][PATCH 06/17] sched: Provide p->on_rq Peter Zijlstra
2010-12-29 14:14   ` Yong Zhang
2010-12-24 12:23 ` [RFC][PATCH 07/17] sched: Serialize p->cpus_allowed and ttwu() using p->pi_lock Peter Zijlstra
2010-12-29 14:20   ` Yong Zhang
2011-01-03 11:12     ` Peter Zijlstra
2010-12-24 12:23 ` [RFC][PATCH 08/17] sched: Drop the rq argument to sched_class::select_task_rq() Peter Zijlstra
2010-12-29 14:31   ` Yong Zhang
2011-01-03 11:16     ` Peter Zijlstra
2011-01-03 14:59       ` Oleg Nesterov
2011-01-03 15:21         ` Peter Zijlstra
2011-01-03 15:49           ` Oleg Nesterov
2011-01-03 16:35             ` Peter Zijlstra
2011-01-03 16:41               ` Peter Zijlstra
2011-01-04  7:27             ` Yong Zhang
2011-01-04 12:34               ` Peter Zijlstra
2011-01-04  5:59       ` Yong Zhang
2011-01-04 13:00         ` Peter Zijlstra
2011-01-03 18:05   ` Oleg Nesterov
2011-01-04 13:01     ` Peter Zijlstra
2010-12-24 12:23 ` [RFC][PATCH 09/17] sched: Remove rq argument to sched_class::task_waking() Peter Zijlstra
2010-12-24 12:23 ` [RFC][PATCH 10/17] sched: Add TASK_WAKING to task_rq_lock Peter Zijlstra
2010-12-24 12:23 ` [RFC][PATCH 11/17] sched: Delay task_contributes_to_load() Peter Zijlstra
2010-12-24 12:23 ` [RFC][PATCH 12/17] sched: Also serialize ttwu_local() with p->pi_lock Peter Zijlstra
2011-01-03 17:32   ` Oleg Nesterov
2011-01-09 23:11     ` Tejun Heo
2010-12-24 12:23 ` [RFC][PATCH 13/17] sched: Remove rq->lock from the first half of ttwu() Peter Zijlstra
2010-12-24 12:23 ` [RFC][PATCH 14/17] sched: Remove rq argument to ttwu_stat() Peter Zijlstra
2010-12-29 14:40   ` Yong Zhang
2011-01-03 11:20     ` Peter Zijlstra
2010-12-24 12:23 ` [RFC][PATCH 15/17] sched: Rename ttwu_post_activation Peter Zijlstra
2010-12-24 12:23 ` [RFC][PATCH 16/17] sched: Move the second half of ttwu() to the remote cpu Peter Zijlstra
2011-01-03 14:36   ` [RFC][PATCH] sembench: add stddev to the burn stats Peter Zijlstra
2011-01-04 14:28   ` [RFC][PATCH 16/17] sched: Move the second half of ttwu() to the remote cpu Oleg Nesterov
2011-01-04 14:47     ` Peter Zijlstra
2011-01-04 15:18       ` Oleg Nesterov
2011-01-04 15:43         ` Peter Zijlstra
2011-01-04 16:06           ` Oleg Nesterov
2010-12-24 12:23 ` [RFC][PATCH 17/17] sched: Sort hotplug vs ttwu queueing Peter Zijlstra
2010-12-29 14:51   ` Yong Zhang
2011-01-03 11:21     ` Peter Zijlstra
2010-12-24 13:15 ` [RFC][PATCH 00/17] sched: Reduce runqueue lock contention -v3 Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).