All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch] sched: fix set_task_cpu() and provide an unlocked runqueue variant
@ 2009-11-22 12:09 Mike Galbraith
  2009-11-25 18:27 ` Peter Zijlstra
  0 siblings, 1 reply; 10+ messages in thread
From: Mike Galbraith @ 2009-11-22 12:09 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar; +Cc: LKML


sched: fix set_task_cpu() and provide an unlocked runqueue variant.

set_task_cpu() falsifies migration stats by unconditionally generating migration
stats whether a task's cpu actually changed or not.  As used in copy_process(),
the runqueue is unlocked, so we need to provide an unlocked variant which does
the locking to provide a write barrier.

Signed-off-by: Mike Galbraith <efault@gmx.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>

---
 include/linux/sched.h |    5 +++++
 kernel/fork.c         |    2 +-
 kernel/sched.c        |   38 ++++++++++++++++++++++++--------------
 3 files changed, 30 insertions(+), 15 deletions(-)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -2056,7 +2056,6 @@ task_hot(struct task_struct *p, u64 now,
 	return delta < (s64)sysctl_sched_migration_cost;
 }
 
-
 void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
 {
 	int old_cpu = task_cpu(p);
@@ -2065,10 +2064,10 @@ void set_task_cpu(struct task_struct *p,
 		      *new_cfsrq = cpu_cfs_rq(old_cfsrq, new_cpu);
 	u64 clock_offset;
 
-	clock_offset = old_rq->clock - new_rq->clock;
-
-	trace_sched_migrate_task(p, new_cpu);
+	if (unlikely(old_cpu == new_cpu))
+		goto out;
 
+	clock_offset = old_rq->clock - new_rq->clock;
 #ifdef CONFIG_SCHEDSTATS
 	if (p->se.wait_start)
 		p->se.wait_start -= clock_offset;
@@ -2076,22 +2075,33 @@ void set_task_cpu(struct task_struct *p,
 		p->se.sleep_start -= clock_offset;
 	if (p->se.block_start)
 		p->se.block_start -= clock_offset;
+	if (task_hot(p, old_rq->clock, NULL))
+		schedstat_inc(p, se.nr_forced2_migrations);
 #endif
-	if (old_cpu != new_cpu) {
-		p->se.nr_migrations++;
-#ifdef CONFIG_SCHEDSTATS
-		if (task_hot(p, old_rq->clock, NULL))
-			schedstat_inc(p, se.nr_forced2_migrations);
-#endif
-		perf_sw_event(PERF_COUNT_SW_CPU_MIGRATIONS,
-				     1, 1, NULL, 0);
-	}
 	p->se.vruntime -= old_cfsrq->min_vruntime -
-					 new_cfsrq->min_vruntime;
+				 new_cfsrq->min_vruntime;
+	p->se.nr_migrations++;
+	trace_sched_migrate_task(p, new_cpu);
+	perf_sw_event(PERF_COUNT_SW_CPU_MIGRATIONS, 1, 1, NULL, 0);
 
+out:
 	__set_task_cpu(p, new_cpu);
 }
 
+void set_task_cpu_unlocked(struct task_struct *p, unsigned int new_cpu)
+{
+	unsigned long flags;
+	struct rq *rq, *new_rq = cpu_rq(new_cpu);
+
+	smp_wmb();
+	rq = task_rq_lock(p, &flags);
+	update_rq_clock(rq);
+	if (rq != new_rq)
+		update_rq_clock(new_rq);
+	set_task_cpu(p, new_cpu);
+	task_rq_unlock(rq, &flags);
+}
+
 struct migration_req {
 	struct list_head list;
 
Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -2457,6 +2457,7 @@ static inline unsigned int task_cpu(cons
 }
 
 extern void set_task_cpu(struct task_struct *p, unsigned int cpu);
+extern void set_task_cpu_unlocked(struct task_struct *p, unsigned int cpu);
 
 #else
 
@@ -2469,6 +2470,10 @@ static inline void set_task_cpu(struct t
 {
 }
 
+static inline void set_task_cpu_unlocked(struct task_struct *p, unsigned int cpu)
+{
+}
+
 #endif /* CONFIG_SMP */
 
 extern void arch_pick_mmap_layout(struct mm_struct *mm);
Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c
+++ linux-2.6/kernel/fork.c
@@ -1242,7 +1242,7 @@ static struct task_struct *copy_process(
 	p->rt.nr_cpus_allowed = current->rt.nr_cpus_allowed;
 	if (unlikely(!cpu_isset(task_cpu(p), p->cpus_allowed) ||
 			!cpu_online(task_cpu(p))))
-		set_task_cpu(p, smp_processor_id());
+		set_task_cpu_unlocked(p, smp_processor_id());
 
 	/* CLONE_PARENT re-uses the old parent */
 	if (clone_flags & (CLONE_PARENT|CLONE_THREAD)) {



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [patch] sched: fix set_task_cpu() and provide an unlocked runqueue variant
  2009-11-22 12:09 [patch] sched: fix set_task_cpu() and provide an unlocked runqueue variant Mike Galbraith
@ 2009-11-25 18:27 ` Peter Zijlstra
  2009-11-26  1:01   ` Mike Galbraith
  0 siblings, 1 reply; 10+ messages in thread
From: Peter Zijlstra @ 2009-11-25 18:27 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Ingo Molnar, LKML

On Sun, 2009-11-22 at 13:09 +0100, Mike Galbraith wrote:
> sched: fix set_task_cpu() and provide an unlocked runqueue variant.
> 
> set_task_cpu() falsifies migration stats by unconditionally generating migration
> stats whether a task's cpu actually changed or not.  As used in copy_process(),
> the runqueue is unlocked, so we need to provide an unlocked variant which does
> the locking to provide a write barrier.
> 
> Signed-off-by: Mike Galbraith <efault@gmx.de>
> Cc: Ingo Molnar <mingo@elte.hu>
> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
> LKML-Reference: <new-submission>
> 
> ---

> +void set_task_cpu_unlocked(struct task_struct *p, unsigned int new_cpu)
> +{
> +	unsigned long flags;
> +	struct rq *rq, *new_rq = cpu_rq(new_cpu);
> +
> +	smp_wmb();
> +	rq = task_rq_lock(p, &flags);
> +	update_rq_clock(rq);
> +	if (rq != new_rq)
> +		update_rq_clock(new_rq);
> +	set_task_cpu(p, new_cpu);
> +	task_rq_unlock(rq, &flags);
> +}

I've got to ask, what's that barrier for?




^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [patch] sched: fix set_task_cpu() and provide an unlocked runqueue variant
  2009-11-25 18:27 ` Peter Zijlstra
@ 2009-11-26  1:01   ` Mike Galbraith
  2009-11-26  1:31     ` Mike Galbraith
  0 siblings, 1 reply; 10+ messages in thread
From: Mike Galbraith @ 2009-11-26  1:01 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Ingo Molnar, LKML

On Wed, 2009-11-25 at 19:27 +0100, Peter Zijlstra wrote:
> On Sun, 2009-11-22 at 13:09 +0100, Mike Galbraith wrote:
> > sched: fix set_task_cpu() and provide an unlocked runqueue variant.
> > 
> > set_task_cpu() falsifies migration stats by unconditionally generating migration
> > stats whether a task's cpu actually changed or not.  As used in copy_process(),
> > the runqueue is unlocked, so we need to provide an unlocked variant which does
> > the locking to provide a write barrier.
> > 
> > Signed-off-by: Mike Galbraith <efault@gmx.de>
> > Cc: Ingo Molnar <mingo@elte.hu>
> > Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
> > LKML-Reference: <new-submission>
> > 
> > ---
> 
> > +void set_task_cpu_unlocked(struct task_struct *p, unsigned int new_cpu)
> > +{
> > +	unsigned long flags;
> > +	struct rq *rq, *new_rq = cpu_rq(new_cpu);
> > +
> > +	smp_wmb();
> > +	rq = task_rq_lock(p, &flags);
> > +	update_rq_clock(rq);
> > +	if (rq != new_rq)
> > +		update_rq_clock(new_rq);
> > +	set_task_cpu(p, new_cpu);
> > +	task_rq_unlock(rq, &flags);
> > +}
> 
> I've got to ask, what's that barrier for?

It's a leftover from frustrated bug hunting.

	-Mike 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [patch] sched: fix set_task_cpu() and provide an unlocked runqueue variant
  2009-11-26  1:01   ` Mike Galbraith
@ 2009-11-26  1:31     ` Mike Galbraith
  2009-11-26  9:35       ` Peter Zijlstra
  0 siblings, 1 reply; 10+ messages in thread
From: Mike Galbraith @ 2009-11-26  1:31 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Ingo Molnar, LKML

On Thu, 2009-11-26 at 02:01 +0100, Mike Galbraith wrote:
> On Wed, 2009-11-25 at 19:27 +0100, Peter Zijlstra wrote:

> > I've got to ask, what's that barrier for?
> 
> It's a leftover from frustrated bug hunting.

To be more specific, I put it there to ensure that min_vruntimes are
stable.  I figured that if try_to_wake_up() needs a barrier to look at
task->state, I had better do the same for the runqueues.

	-Mike


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [patch] sched: fix set_task_cpu() and provide an unlocked runqueue variant
  2009-11-26  1:31     ` Mike Galbraith
@ 2009-11-26  9:35       ` Peter Zijlstra
  2009-11-26 10:16         ` Mike Galbraith
  0 siblings, 1 reply; 10+ messages in thread
From: Peter Zijlstra @ 2009-11-26  9:35 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Ingo Molnar, LKML

On Thu, 2009-11-26 at 02:31 +0100, Mike Galbraith wrote:
> On Thu, 2009-11-26 at 02:01 +0100, Mike Galbraith wrote:
> > On Wed, 2009-11-25 at 19:27 +0100, Peter Zijlstra wrote:
> 
> > > I've got to ask, what's that barrier for?
> > 
> > It's a leftover from frustrated bug hunting.
> 
> To be more specific, I put it there to ensure that min_vruntimes are
> stable. 

min_vruntime should only ever be poked at when holding the respective
rq->lock, even with a barrier a 64bit read on a 32bit machine can go all
funny.

>  I figured that if try_to_wake_up() needs a barrier to look at
> task->state, I had better do the same for the runqueues.

Ah, ttwu() has that barrier for another reason. The wmb in ttwu() is to
ensure the wakee sees the state of the waker at the time of waking.

That is, its about ordering things like:


   A				   B


my_cond = true;
wake_process(my_friend);

				while (!my_cond)
				  schedule();


So that you can actually observe my_cond being true once you wakeup
(schedule acts as a mb() when it actually schedules).


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [patch] sched: fix set_task_cpu() and provide an unlocked runqueue variant
  2009-11-26  9:35       ` Peter Zijlstra
@ 2009-11-26 10:16         ` Mike Galbraith
  2009-11-26 14:09           ` Peter Zijlstra
  0 siblings, 1 reply; 10+ messages in thread
From: Mike Galbraith @ 2009-11-26 10:16 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Ingo Molnar, LKML

On Thu, 2009-11-26 at 10:35 +0100, Peter Zijlstra wrote:
> On Thu, 2009-11-26 at 02:31 +0100, Mike Galbraith wrote:
> > On Thu, 2009-11-26 at 02:01 +0100, Mike Galbraith wrote:
> > > On Wed, 2009-11-25 at 19:27 +0100, Peter Zijlstra wrote:
> > 
> > > > I've got to ask, what's that barrier for?
> > > 
> > > It's a leftover from frustrated bug hunting.
> > 
> > To be more specific, I put it there to ensure that min_vruntimes are
> > stable. 
> 
> min_vruntime should only ever be poked at when holding the respective
> rq->lock, even with a barrier a 64bit read on a 32bit machine can go all
> funny.

Yeah, but we're looking at an unlocked runqueue.  But never mind...

> >  I figured that if try_to_wake_up() needs a barrier to look at
> > task->state, I had better do the same for the runqueues.
> 
> Ah, ttwu() has that barrier for another reason. The wmb in ttwu() is to
> ensure the wakee sees the state of the waker at the time of waking.
> 
> That is, its about ordering things like:
> 
> 
>    A				   B
> 
> 
> my_cond = true;
> wake_process(my_friend);
> 
> 				while (!my_cond)
> 				  schedule();
> 
> 
> So that you can actually observe my_cond being true once you wakeup
> (schedule acts as a mb() when it actually schedules).

(Ah!  I think you actually made a wrinkle in grey-ware)

...ATM, I kinda wish I'd not gone off and read barriers.txt.  It hasn't
sunk in yet, but certainly has made me _paranoid as hell_ :-)

	-Mike


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [patch] sched: fix set_task_cpu() and provide an unlocked runqueue variant
  2009-11-26 10:16         ` Mike Galbraith
@ 2009-11-26 14:09           ` Peter Zijlstra
  2009-11-26 14:21             ` Peter Zijlstra
  2009-11-26 14:58             ` Mike Galbraith
  0 siblings, 2 replies; 10+ messages in thread
From: Peter Zijlstra @ 2009-11-26 14:09 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Ingo Molnar, LKML

On Thu, 2009-11-26 at 11:16 +0100, Mike Galbraith wrote:
> > min_vruntime should only ever be poked at when holding the respective
> > rq->lock, even with a barrier a 64bit read on a 32bit machine can go all
> > funny.
> 
> Yeah, but we're looking at an unlocked runqueue.  But never mind...

The patch is also poking at rq->clock without rq->lock held... not very
nice.

Gah, I hate that we're doing migration things without holding both rq's,
this is making live so very interesting ;-)


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [patch] sched: fix set_task_cpu() and provide an unlocked runqueue variant
  2009-11-26 14:09           ` Peter Zijlstra
@ 2009-11-26 14:21             ` Peter Zijlstra
  2009-11-26 15:32               ` Mike Galbraith
  2009-11-26 14:58             ` Mike Galbraith
  1 sibling, 1 reply; 10+ messages in thread
From: Peter Zijlstra @ 2009-11-26 14:21 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Ingo Molnar, LKML

On Thu, 2009-11-26 at 15:09 +0100, Peter Zijlstra wrote:
> On Thu, 2009-11-26 at 11:16 +0100, Mike Galbraith wrote:
> > > min_vruntime should only ever be poked at when holding the respective
> > > rq->lock, even with a barrier a 64bit read on a 32bit machine can go all
> > > funny.
> > 
> > Yeah, but we're looking at an unlocked runqueue.  But never mind...
> 
> The patch is also poking at rq->clock without rq->lock held... not very
> nice.
> 
> Gah, I hate that we're doing migration things without holding both rq's,
> this is making live so very interesting ;-)

so the problem is this bit in kernel/fork.c, which is ran with
tasklist_lock held for writing:

	/*
	 * The task hasn't been attached yet, so its cpus_allowed mask will
	 * not be changed, nor will its assigned CPU.
	 *
	 * The cpus_allowed mask of the parent may have changed after it was
	 * copied first time - so re-copy it here, then check the child's CPU
	 * to ensure it is on a valid CPU (and if not, just force it back to
	 * parent's CPU). This avoids alot of nasty races.
	 */
	p->cpus_allowed = current->cpus_allowed;
	p->rt.nr_cpus_allowed = current->rt.nr_cpus_allowed;
	if (unlikely(!cpu_isset(task_cpu(p), p->cpus_allowed) ||
			!cpu_online(task_cpu(p))))
		set_task_cpu(p, smp_processor_id());


The problem is that that doesn't close any races at all since
tasklist_lock doesn't fully serialize changes to ->cpus_allowed.

In fact, there is nothing that protects that mask at all.

The second problem is that set_task_cpu() is accessing data from both
the old and the new rq, which basically requires is being ran with both
rq's locked, and the regular migration paths do so.

However things like ttwu() try to be cute and do not, opening the doors
to all kinds of funny.

Clearly we don't really want to do double_rq_lock() in ttwu(), that's
one of the hotter paths around (and looking at it we ought to seriously
look at trimming some of it).

Which sort of leaves us in a predicament all-right...

/me goes puzzle.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [patch] sched: fix set_task_cpu() and provide an unlocked runqueue variant
  2009-11-26 14:09           ` Peter Zijlstra
  2009-11-26 14:21             ` Peter Zijlstra
@ 2009-11-26 14:58             ` Mike Galbraith
  1 sibling, 0 replies; 10+ messages in thread
From: Mike Galbraith @ 2009-11-26 14:58 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Ingo Molnar, LKML

On Thu, 2009-11-26 at 15:09 +0100, Peter Zijlstra wrote:
> On Thu, 2009-11-26 at 11:16 +0100, Mike Galbraith wrote:
> > > min_vruntime should only ever be poked at when holding the respective
> > > rq->lock, even with a barrier a 64bit read on a 32bit machine can go all
> > > funny.
> > 
> > Yeah, but we're looking at an unlocked runqueue.  But never mind...
> 
> The patch is also poking at rq->clock without rq->lock held... not very
> nice.

To the users on the remote runqueue, it doesn't matter, they always
update before using, so will always have accurate time.

What bothers me is that I don't see what prevents a SCHED_IDLE task
making a huge min_vruntime update on the remote CPU while you're
queueing a sleeper here, who used to live over there, while you're
reading.

But, they're all instantaneous values anyway, so I suppose memory speed
is fine.  It works... and I really doubt that the yet another paranoid
barrier I just put in will make diddly spit difference, but I did just
put an smp_rmb() there anyway to see with my own eyes ;-)

> Gah, I hate that we're doing migration things without holding both rq's,
> this is making live so very interesting ;-)

Yes.

	-Mike


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [patch] sched: fix set_task_cpu() and provide an unlocked runqueue variant
  2009-11-26 14:21             ` Peter Zijlstra
@ 2009-11-26 15:32               ` Mike Galbraith
  0 siblings, 0 replies; 10+ messages in thread
From: Mike Galbraith @ 2009-11-26 15:32 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Ingo Molnar, LKML

On Thu, 2009-11-26 at 15:21 +0100, Peter Zijlstra wrote:
> On Thu, 2009-11-26 at 15:09 +0100, Peter Zijlstra wrote:
> > On Thu, 2009-11-26 at 11:16 +0100, Mike Galbraith wrote:
> > > > min_vruntime should only ever be poked at when holding the respective
> > > > rq->lock, even with a barrier a 64bit read on a 32bit machine can go all
> > > > funny.
> > > 
> > > Yeah, but we're looking at an unlocked runqueue.  But never mind...
> > 
> > The patch is also poking at rq->clock without rq->lock held... not very
> > nice.
> > 
> > Gah, I hate that we're doing migration things without holding both rq's,
> > this is making live so very interesting ;-)
> 
> so the problem is this bit in kernel/fork.c, which is ran with
> tasklist_lock held for writing:
> 
> 	/*
> 	 * The task hasn't been attached yet, so its cpus_allowed mask will
> 	 * not be changed, nor will its assigned CPU.
> 	 *
> 	 * The cpus_allowed mask of the parent may have changed after it was
> 	 * copied first time - so re-copy it here, then check the child's CPU
> 	 * to ensure it is on a valid CPU (and if not, just force it back to
> 	 * parent's CPU). This avoids alot of nasty races.
> 	 */
> 	p->cpus_allowed = current->cpus_allowed;
> 	p->rt.nr_cpus_allowed = current->rt.nr_cpus_allowed;
> 	if (unlikely(!cpu_isset(task_cpu(p), p->cpus_allowed) ||
> 			!cpu_online(task_cpu(p))))
> 		set_task_cpu(p, smp_processor_id());
> 
> 
> The problem is that that doesn't close any races at all since
> tasklist_lock doesn't fully serialize changes to ->cpus_allowed.

Well, some stuff can't get at you if you're there, but yes, I was
wondering how fixing it up there was going to guarantee a happy landing
when we get to... wake_up_new_task(). 

> In fact, there is nothing that protects that mask at all.
> 
> The second problem is that set_task_cpu() is accessing data from both
> the old and the new rq, which basically requires is being ran with both
> rq's locked, and the regular migration paths do so.

Yes, and task_cpu() and task_rq() are racy as heck without the lock.  It
all goes fuzzy.

sched_class can change out from under you the instant you release the
runqueue lock afaikt, nice level, affinity... etc?

> However things like ttwu() try to be cute and do not, opening the doors
> to all kinds of funny.

Yes, so all the raciness I've been imagining isn't _all_ imaginary.
Yoohoo.  Um, I mean damn.

> Clearly we don't really want to do double_rq_lock() in ttwu(), that's
> one of the hotter paths around (and looking at it we ought to seriously
> look at trimming some of it).

No, apparently not.  About an hour ago, paranoid little me merely did
lock handoff in ttwu and... wunt (wunt?), and was rewarded with a
deadlocked box a bit after X came up.

WRT lard, yes, it is getting fat.  The cache misses of the prefer
sibling thing are hurting very fast threads too.  Much reward if you
find a sibling, ~4% pain for TCP_RR with the cache misses and whatnot
you waste looking around for a spot for a pinned ultralight task.

Wish I could find an answer for the sibling thing.  Nearly doubles
throughput for some things.

	-Mike


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2009-11-26 15:32 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-11-22 12:09 [patch] sched: fix set_task_cpu() and provide an unlocked runqueue variant Mike Galbraith
2009-11-25 18:27 ` Peter Zijlstra
2009-11-26  1:01   ` Mike Galbraith
2009-11-26  1:31     ` Mike Galbraith
2009-11-26  9:35       ` Peter Zijlstra
2009-11-26 10:16         ` Mike Galbraith
2009-11-26 14:09           ` Peter Zijlstra
2009-11-26 14:21             ` Peter Zijlstra
2009-11-26 15:32               ` Mike Galbraith
2009-11-26 14:58             ` Mike Galbraith

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.