linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Scheduler bug related to rq->skip_clock_update?
@ 2010-11-21  4:22 Bjoern B. Brandenburg
  2010-11-21 17:14 ` Mike Galbraith
  2010-12-03 12:41 ` Peter Zijlstra
  0 siblings, 2 replies; 23+ messages in thread
From: Bjoern B. Brandenburg @ 2010-11-21  4:22 UTC (permalink / raw)
  To: Peter Zijlstra, Mike Galbraith, Ingo Molnar
  Cc: Andrea Bastoni, James H. Anderson, linux-kernel


Hi Peter, hi Mike, hi Ingo,

we have observed some strangeness regarding the runqueue clock updates when
porting the LITMUS^RT patches (http://www.cs.unc.edu/~anderson/litmus-rt)
from .34 to .36. Specifically, it seems that the flag rq->skip_clock_update
remains set for prolonged times in some circumstances. Not updating the
runqueue clock while a task is scheduled obviously causes the scheduler to
make badly misguided decisions.

I was also able to reproduce this on the latest pre-.37 _without_ LITMUS^RT
patches (see below), so I don't think that the problem was introduced by us.
First, here's a debug trace that illustrates the problem. (The trace is from
LITMUS^RT applied to .36 on a four-CPU host with copious amounts of extra
debug info. The first number in the trace is a sequence number and can be
ignored.)

> 613953 P3: SCHED_STATE [P3] 0x1 (TASK_SCHEDULED) -> 0x4 (WILL_SCHEDULE)
> 613954 P3: (kworker/u:0/5) clear_tsk_need_resched

The workload consisted of tmux, a couple of top instances, and compiling a
library (liblitmus). On CPU 3, kworker invoked schedule() because it blocks.
Its TIF_NEED_RESCHED is cleared. The idle load balancer is activated because
kworker blocks and there is nothing else pending.

> 613955 P3: (kworker/u:0/5) XXXX calling resched_task() during schedule()
> from ret:0xc0045400 [rt:0 nr_running:1]
> 613957 P3: (kworker/u:0/5) set_tsk_need_resched() ret:c004495c task_cpu:3

The idle load balancer marked kworker with set_tsk_need_resched(), even
though kworker was on its way out anyway. CPU 3 continues to schedule and
picks 'ld'. So far so good.

> 613960 P3: SCHED_STATE [P3] 0x4 (WILL_SCHEDULE) -> 0x8 (TASK_PICKED)
> 613961 P3: SCHED_STATE [P3] 0x8 (TASK_PICKED) -> 0x1 (TASK_SCHEDULED)
> 613962 P3: SCHED_STATE [P3] 0x1 (TASK_SCHEDULED) -> 0x4 (WILL_SCHEDULE)
> 613963 P3: (ld/1107) set_tsk_need_resched() ret:c004495c task_cpu:3
> 613964 P3: SCHED_STATE [P3] 0x4 (WILL_SCHEDULE) -> 0x4 (WILL_SCHEDULE)
> 613965 P3: (ld/1107) clear_tsk_need_resched
> 613966 P3: SCHED_STATE [P3] 0x4 (WILL_SCHEDULE) -> 0x8 (TASK_PICKED)
> 613967 P3: SCHED_STATE [P3] 0x8 (TASK_PICKED) -> 0x1 (TASK_SCHEDULED)
> 613968 P3: SCHED_STATE [P3] 0x1 (TASK_SCHEDULED) -> 0x4 (WILL_SCHEDULE)
> 613969 P3: (kworker/3:1/411) clear_tsk_need_resched
> 613970 P3: SCHED_STATE [P3] 0x4 (WILL_SCHEDULE) -> 0x8 (TASK_PICKED)
> 613971 P3: SCHED_STATE [P3] 0x8 (TASK_PICKED) -> 0x1 (TASK_SCHEDULED)

Some time passed and CPU 3 was active servicing ld and a different kworker.
Now something bad happens.

> 613972 P0: (kworker/u:0/5) already need_resched when resuming
> 613973 P0: (kworker/u:0/5) rq->skip_clock_update <- 1 (flags=0; cpu=3;
> is-cur?=0)

The kworker that was marked with set_tsk_need_resched() resumes on CPU 0.
Because the flag was already set, check_preempt_curr() sets
rq->skip_clock_update() *on CPU 3*.

CPU 3 never knows that anything happened and continues to schedule its
currently assigned task (ld). Because skip_clock_update is set, the reference
time for the scheduler does not advance anymore. Hence ld seems to be
consuming no execution time and CFS schedules it until it blocks by itself
(or until a SCHED_FIFO or SCHED_RT task becomes active).

The scheduler tick on P3 reports for some time after that nothing changes.

> 613974 P3: (ld/1107) rq->skip_clock_update [CFS] =>
> sum_exec_runtime=569265250  delta=0
> 613975 P3: (ld/1107) rq->skip_clock_update [CFS] =>
> sum_exec_runtime=569265250  delta=0
> 613976 P3: (ld/1107) rq->skip_clock_update [CFS] =>
> sum_exec_runtime=569265250  delta=0
> 613977 P3: (ld/1107) rq->skip_clock_update [CFS] =>
> sum_exec_runtime=569265250  delta=0
> 613992 P3: (ld/1107) rq->skip_clock_update [CFS] =>
> sum_exec_runtime=569265250  delta=0
> 613993 P3: (ld/1107) rq->skip_clock_update [CFS] =>
> sum_exec_runtime=569265250  delta=0
> 613994 P3: (ld/1107) rq->skip_clock_update [CFS] =>
> sum_exec_runtime=569265250  delta=0
> 614002 P3: (ld/1107) rq->skip_clock_update [CFS] =>
> sum_exec_runtime=569265250  delta=0
> 614005 P2: (ld/1107) rq->skip_clock_update [CFS] =>
> sum_exec_runtime=569265250  delta=0
> 614025 P3: (ld/1107) rq->skip_clock_update [CFS] =>
> sum_exec_runtime=569265250  delta=0
> 614080 P3: (ld/1107) rq->skip_clock_update [CFS] =>
> sum_exec_runtime=569265250  delta=0
> 614117 P3: (ld/1107) rq->skip_clock_update [CFS] =>
> sum_exec_runtime=569265250  delta=0
> 614146 P3: (ld/1107) rq->skip_clock_update [CFS] =>
> sum_exec_runtime=569265250  delta=0
> 614186 P3: (ld/1107) rq->skip_clock_update [CFS] =>
> sum_exec_runtime=569265250  delta=0
> 614233 P3: (ld/1107) rq->skip_clock_update [CFS] =>
> sum_exec_runtime=569265250  delta=0
> 614235 P3: (ld/1107) rq->skip_clock_update [CFS] =>
> sum_exec_runtime=569265250  delta=0

And so on...

This seems very wrong and it causes massive problems for LITMUS^RT. As a
workaround, never setting skip_clock_update in check_preempt_curr() gets rid
of all symptoms.

I was under the impression that, as an invariant, tasks should not have
TIF_NEED_RESCHED set after they've blocked. In this case, the idle load
balancer should not mark the task that's on its way out with
set_tsk_need_resched().

In any case, check_preempt_curr() seems to assume that a resuming task cannot
have TIF_NEED_RESCHED already set. Setting skip_clock_update on a remote CPU
that hasn't even been notified via IPI seems wrong.

Since AFAIK printk() can't be called while holding a runqueue lock, I am not
sure what the best way is to reproduce a trace like the above on a vanilla
kernel. As a first approximation, the patch below simply tracks how much time
passes between setting and clearing skip_clock_update.

With the patch applied on top of b86db4744230c94e480de56f1b7f31117edbf193
(the current master), I got the following numbers on a four-CPU ARM11 MPCore.

> bbb@district10:/proc$ egrep 'cpu#|skip' /proc/sched_debug
> cpu#0
>   .skip_clock_max                : 139310500
> cpu#1
>   .skip_clock_max                : 1831250
> cpu#2
>   .skip_clock_max                : 1112750
> cpu#3
>   .skip_clock_max                : 1243375

On CPU 0, the flag was set at some point for over 139 milliseconds! Even on
CPU 2, the flag was set for over one millisecond.

I believe this shows that the problem persists in the latest kernel and that
it is not LITMUS^RT-related. It seems the load balancer should not mark the
currently-scheduled task with set_tsk_need_resched() when invoked from within
schedule().

Comments? Advice?

Thanks,
Bjoern

---
diff --git a/kernel/sched.c b/kernel/sched.c
index dc91a4d..ccc88cc 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -541,6 +541,9 @@ struct rq {
 	unsigned long long rq_cpu_time;
 	/* could above be rq->cfs_rq.exec_clock + rq->rt_rq.rt_runtime ? */

+	u64 skip_clock_max;
+	u64 skip_clock_set;
+
 	/* sys_sched_yield() stats */
 	unsigned int yld_count;

@@ -2129,8 +2132,11 @@ static void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags)
 	 * A queue event has occurred, and we're going to schedule.  In
 	 * this case, we can save a useless back to back clock update.
 	 */
-	if (test_tsk_need_resched(rq->curr))
+	if (test_tsk_need_resched(rq->curr)) {
+		schedstat_set(rq->skip_clock_set,
+			      sched_clock_cpu(cpu_of(rq)));
 		rq->skip_clock_update = 1;
+	}
 }

 #ifdef CONFIG_SMP
@@ -3845,7 +3851,12 @@ static void put_prev_task(struct rq *rq, struct task_struct *prev)
 {
 	if (prev->se.on_rq)
 		update_rq_clock(rq);
-	rq->skip_clock_update = 0;
+	if (unlikely(rq->skip_clock_update)) {
+		schedstat_set(rq->skip_clock_max,
+			      max(rq->skip_clock_max,
+				  sched_clock_cpu(cpu_of(rq)) - rq->skip_clock_set));
+		rq->skip_clock_update = 0;
+	}
 	prev->sched_class->put_prev_task(rq, prev);
 }

diff --git a/kernel/sched_debug.c b/kernel/sched_debug.c
index 2e1b0d1..63ce99d 100644
--- a/kernel/sched_debug.c
+++ b/kernel/sched_debug.c
@@ -298,6 +298,8 @@ static void print_cpu(struct seq_file *m, int cpu)

 	P(bkl_count);

+	P64(skip_clock_max);
+
 #undef P
 #endif
 	print_cfs_stats(m, cpu);

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: Scheduler bug related to rq->skip_clock_update?
  2010-11-21  4:22 Scheduler bug related to rq->skip_clock_update? Bjoern B. Brandenburg
@ 2010-11-21 17:14 ` Mike Galbraith
  2010-11-22  4:29   ` Bjoern B. Brandenburg
  2010-12-03 12:41 ` Peter Zijlstra
  1 sibling, 1 reply; 23+ messages in thread
From: Mike Galbraith @ 2010-11-21 17:14 UTC (permalink / raw)
  To: Bjoern B. Brandenburg
  Cc: Peter Zijlstra, Ingo Molnar, Andrea Bastoni, James H. Anderson,
	linux-kernel

On Sat, 2010-11-20 at 23:22 -0500, Bjoern B. Brandenburg wrote:

> I was under the impression that, as an invariant, tasks should not have
> TIF_NEED_RESCHED set after they've blocked. In this case, the idle load
> balancer should not mark the task that's on its way out with
> set_tsk_need_resched().

Nice find.

> In any case, check_preempt_curr() seems to assume that a resuming task cannot
> have TIF_NEED_RESCHED already set. Setting skip_clock_update on a remote CPU
> that hasn't even been notified via IPI seems wrong.

Yes. Does the below fix it up for you?

Sched: clear_tsk_need_resched() after pull_task() when NEWIDLE balancing

pull_task() may call set_tsk_need_resched() on a deactivated task,
leaving it vulnerable to an inappropriate preemption after wakeup.

This also confuses the skip_clock_update logic, which assumes that
schedule() will be called in very short order after being set.  Make
that logic more robust by clearing in update_rq_clock() itself, so
only one update can be skipped.

Signed-off-by: Mike Galbraith <efault@gmx.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Bjoern B. Brandenburg <bbb.lst@gmail.com>
Reported-by: Bjoern B. Brandenburg <bbb.lst@gmail.com>

---
 kernel/sched.c      |    3 ++-
 kernel/sched_fair.c |   10 ++++++++--
 2 files changed, 10 insertions(+), 3 deletions(-)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -657,6 +657,8 @@ inline void update_rq_clock(struct rq *r
 
 		sched_irq_time_avg_update(rq, irq_time);
 	}
+
+	rq->skip_clock_update = 0;
 }
 
 /*
@@ -3714,7 +3716,6 @@ static void put_prev_task(struct rq *rq,
 {
 	if (prev->se.on_rq)
 		update_rq_clock(rq);
-	rq->skip_clock_update = 0;
 	prev->sched_class->put_prev_task(rq, prev);
 }
 
Index: linux-2.6/kernel/sched_fair.c
===================================================================
--- linux-2.6.orig/kernel/sched_fair.c
+++ linux-2.6/kernel/sched_fair.c
@@ -2019,15 +2019,21 @@ balance_tasks(struct rq *this_rq, int th
 		pulled++;
 		rem_load_move -= p->se.load.weight;
 
-#ifdef CONFIG_PREEMPT
 		/*
+		 * pull_task() may have set_tsk_need_resched().  Clear it
+		 * lest a sleeper awaken and be inappropriately preempted
+		 * shortly thereafter.
+		 *
 		 * NEWIDLE balancing is a source of latency, so preemptible
 		 * kernels will stop after the first task is pulled to minimize
 		 * the critical section.
 		 */
-		if (idle == CPU_NEWLY_IDLE)
+		if (idle == CPU_NEWLY_IDLE) {
+			clear_tsk_need_resched(this_rq->curr);
+#ifdef CONFIG_PREEMPT
 			break;
 #endif
+		}
 
 		/*
 		 * We only want to steal up to the prescribed amount of



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Scheduler bug related to rq->skip_clock_update?
  2010-11-21 17:14 ` Mike Galbraith
@ 2010-11-22  4:29   ` Bjoern B. Brandenburg
  2010-11-22 16:19     ` Mike Galbraith
  0 siblings, 1 reply; 23+ messages in thread
From: Bjoern B. Brandenburg @ 2010-11-22  4:29 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, Ingo Molnar, Andrea Bastoni, James H. Anderson,
	linux-kernel

On Sun, 21 Nov 2010, Mike Galbraith wrote:

> On Sat, 2010-11-20 at 23:22 -0500, Bjoern B. Brandenburg wrote:
>
> > I was under the impression that, as an invariant, tasks should not have
> > TIF_NEED_RESCHED set after they've blocked. In this case, the idle load
> > balancer should not mark the task that's on its way out with
> > set_tsk_need_resched().
>
> Nice find.
>
> > In any case, check_preempt_curr() seems to assume that a resuming task cannot
> > have TIF_NEED_RESCHED already set. Setting skip_clock_update on a remote CPU
> > that hasn't even been notified via IPI seems wrong.
>
> Yes. Does the below fix it up for you?

The patch definitely changes the behavior, but it doesn't seem to solve (all
of) the root cause(s). The failsafe kicks in and clears the flag the next
time that update_rq_clock() is called, but there can still be a significant
delay between setting and clearing the flag. Right after boot, I'm now seeing
values that go up to ~21ms.

> bbb@district10:~$ egrep 'cpu#|skip' /proc/sched_debug
> cpu#0
>   .skip_clock_max                : 18279250
> cpu#1
>   .skip_clock_max                : 17190250
> cpu#2
>   .skip_clock_max                : 9292875
> cpu#3
>   .skip_clock_max                : 21760250

Resetting the values in a (mostly) idle system yields delays up to ~8.4ms.

> rq CPU0 skip_clock_max=8403000 skip_clock_count=465
> rq CPU1 skip_clock_max=3605375 skip_clock_count=90
> rq CPU2 skip_clock_max=1007375 skip_clock_count=529
> rq CPU3 skip_clock_max=6248500 skip_clock_count=197

If I let it run for a while with a couple 'find /' and 'top' processes in the
background, I'm seeing delays close to 10ms.

> rq CPU0 skip_clock_max=9893500 skip_clock_count=56779
> rq CPU1 skip_clock_max=9916750 skip_clock_count=48852
> rq CPU2 skip_clock_max=9915500 skip_clock_count=49279
> rq CPU3 skip_clock_max=9920500 skip_clock_count=53311

It makes sense for the maximum skip time to converge to 10ms since
HZ=100 on the ARM test system, so after at most 10ms scheduler_tick() should
kick in and call update_rq_clock().

I'm also seeing similar delays on an 8-core x86_64 system with HZ=1000. With
a workload of 'make clean && make -j8 bzImage' and 'top' I get stable delays
after boot in the range of 1ms and a max delay of ~53ms during boot (the
field skip_clock_recent_max resets to zero every 10000 samples;
skip_clock_max tracks the all-time max since the processor came online).

> bbb@koruna:~$ egrep 'cpu#|skip' /proc/sched_debug
> cpu#0, 2493.662 MHz
>   .skip_clock_count              : 32302
>   .skip_clock_recent_max         : 997011
>   .skip_clock_max                : 40643460
> cpu#1, 2493.662 MHz
>   .skip_clock_count              : 32798
>   .skip_clock_recent_max         : 1000658
>   .skip_clock_max                : 35074109
> cpu#2, 2493.662 MHz
>   .skip_clock_count              : 27984
>   .skip_clock_recent_max         : 1001603
>   .skip_clock_max                : 16167011
> cpu#3, 2493.662 MHz
>   .skip_clock_count              : 28153
>   .skip_clock_recent_max         : 1000482
>   .skip_clock_max                : 19163454
> cpu#4, 2493.662 MHz
>   .skip_clock_count              : 29903
>   .skip_clock_recent_max         : 995757
>   .skip_clock_max                : 14243617
> cpu#5, 2493.662 MHz
>   .skip_clock_count              : 28690
>   .skip_clock_recent_max         : 1001300
>   .skip_clock_max                : 46856516
> cpu#6, 2493.662 MHz
>   .skip_clock_count              : 28011
>   .skip_clock_recent_max         : 994878
>   .skip_clock_max                : 53438801
> cpu#7, 2493.662 MHz
>   .skip_clock_count              : 26531
>   .skip_clock_recent_max         : 997352
>   .skip_clock_max                : 19271672

The patch below applies on top of yours and should reproduce the behavior.

Please let me know if there is something else that I should test.

Thanks,
Bjoern

---
sched: show length of runqueue clock deactivation in /proc/sched_debug

The runqueue clock update should obviously not be skipped for
prolonged times (otherwise the consumed time is not correctly kept
track of). This patch measures the time between setting and clearing
the rq->skip_clock_update flag. The maximum observed value is exported
in /proc/sched_debug.
---
 kernel/sched.c       |   21 ++++++++++++++++++++-
 kernel/sched_debug.c |    4 ++++
 2 files changed, 24 insertions(+), 1 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 996ed5c..21c1525 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -541,6 +541,11 @@ struct rq {
 	unsigned long long rq_cpu_time;
 	/* could above be rq->cfs_rq.exec_clock + rq->rt_rq.rt_runtime ? */

+	u64 skip_clock_max;
+	u64 skip_clock_recent_max;
+	u64 skip_clock_set;
+	unsigned int skip_clock_count;
+
 	/* sys_sched_yield() stats */
 	unsigned int yld_count;

@@ -653,6 +658,17 @@ inline void update_rq_clock(struct rq *rq)
 		sched_irq_time_avg_update(rq, irq_time);
 	}

+#ifdef CONFIG_SCHEDSTATS
+	if (unlikely(rq->skip_clock_update)) {
+		u64 skipped = sched_clock_cpu(cpu_of(rq)) - rq->skip_clock_set;
+		rq->skip_clock_max = max(rq->skip_clock_max, skipped);
+		/* reset infrequently to expose changes */
+		if (!(++rq->skip_clock_count % 10000))
+			rq->skip_clock_recent_max = 0;
+		rq->skip_clock_recent_max = max(rq->skip_clock_recent_max,
+						skipped);
+	}
+#endif
 	rq->skip_clock_update = 0;
 }

@@ -2131,8 +2147,11 @@ static void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags)
 	 * A queue event has occurred, and we're going to schedule.  In
 	 * this case, we can save a useless back to back clock update.
 	 */
-	if (test_tsk_need_resched(rq->curr))
+	if (test_tsk_need_resched(rq->curr)) {
+		schedstat_set(rq->skip_clock_set,
+			      sched_clock_cpu(cpu_of(rq)));
 		rq->skip_clock_update = 1;
+	}
 }

 #ifdef CONFIG_SMP
diff --git a/kernel/sched_debug.c b/kernel/sched_debug.c
index 2e1b0d1..8924be2 100644
--- a/kernel/sched_debug.c
+++ b/kernel/sched_debug.c
@@ -298,6 +298,10 @@ static void print_cpu(struct seq_file *m, int cpu)

 	P(bkl_count);

+	P(skip_clock_count);
+	P64(skip_clock_recent_max);
+	P64(skip_clock_max);
+
 #undef P
 #endif
 	print_cfs_stats(m, cpu);
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: Scheduler bug related to rq->skip_clock_update?
  2010-11-22  4:29   ` Bjoern B. Brandenburg
@ 2010-11-22 16:19     ` Mike Galbraith
  2010-11-22 18:14       ` Bjoern B. Brandenburg
  0 siblings, 1 reply; 23+ messages in thread
From: Mike Galbraith @ 2010-11-22 16:19 UTC (permalink / raw)
  To: Bjoern B. Brandenburg
  Cc: Peter Zijlstra, Ingo Molnar, Andrea Bastoni, James H. Anderson,
	linux-kernel

On Sun, 2010-11-21 at 23:29 -0500, Bjoern B. Brandenburg wrote:
> On Sun, 21 Nov 2010, Mike Galbraith wrote:
> 
> > On Sat, 2010-11-20 at 23:22 -0500, Bjoern B. Brandenburg wrote:
> >
> > > I was under the impression that, as an invariant, tasks should not have
> > > TIF_NEED_RESCHED set after they've blocked. In this case, the idle load
> > > balancer should not mark the task that's on its way out with
> > > set_tsk_need_resched().
> >
> > Nice find.
> >
> > > In any case, check_preempt_curr() seems to assume that a resuming task cannot
> > > have TIF_NEED_RESCHED already set. Setting skip_clock_update on a remote CPU
> > > that hasn't even been notified via IPI seems wrong.
> >
> > Yes. Does the below fix it up for you?
> 
> The patch definitely changes the behavior, but it doesn't seem to solve (all
> of) the root cause(s). The failsafe kicks in and clears the flag the next
> time that update_rq_clock() is called, but there can still be a significant
> delay between setting and clearing the flag. Right after boot, I'm now seeing
> values that go up to ~21ms.

A pull isn't the only vulnerability.  Since idle_balance() drops
rq->lock, so another cpu can wake to this rq.

> Please let me know if there is something else that I should test.

Sched: clear_tsk_need_resched() after NEWIDLE balancing

idle_balance() drops/retakes rq->lock, leaving the previous task
vulnerable to set_tsk_need_resched() from another CPU.  Clear it
after NEWIDLE balancing to maintain the invariant that descheduled
tasks are NOT marked for resched.

This also confuses the skip_clock_update logic, which assumes that
the next call to update_rq_clock() will come nearly ĩmmediately after
being set.  Make the optimization more robust by clearing before we
balance and in update_rq_clock().

Signed-off-by: Mike Galbraith <efault@gmx.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Bjoern B. Brandenburg <bbb.lst@gmail.com>
Reported-by: Bjoern B. Brandenburg <bbb.lst@gmail.com>

---
 kernel/sched.c |   12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -657,6 +657,7 @@ inline void update_rq_clock(struct rq *r
 
 		sched_irq_time_avg_update(rq, irq_time);
 	}
+	rq->skip_clock_update = 0;
 }
 
 /*
@@ -3714,7 +3715,6 @@ static void put_prev_task(struct rq *rq,
 {
 	if (prev->se.on_rq)
 		update_rq_clock(rq);
-	rq->skip_clock_update = 0;
 	prev->sched_class->put_prev_task(rq, prev);
 }
 
@@ -3799,8 +3799,16 @@ need_resched_nonpreemptible:
 
 	pre_schedule(rq, prev);
 
-	if (unlikely(!rq->nr_running))
+	if (unlikely(!rq->nr_running)) {
 		idle_balance(cpu, rq);
+		/*
+		 * idle_balance() releases/retakes rq->lock, leaving prev
+		 * vulnerable to set_tsk_need_resched() from another cpu.
+		 * Clock updates should not be skipped while we're away.
+		 */
+		clear_tsk_need_resched(prev);
+		rq->skip_clock_update = 0;
+	}
 
 	put_prev_task(rq, prev);
 	next = pick_next_task(rq);



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Scheduler bug related to rq->skip_clock_update?
  2010-11-22 16:19     ` Mike Galbraith
@ 2010-11-22 18:14       ` Bjoern B. Brandenburg
  2010-12-04  7:42         ` Yong Zhang
                           ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Bjoern B. Brandenburg @ 2010-11-22 18:14 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, Ingo Molnar, Andrea Bastoni, James H. Anderson,
	linux-kernel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 7312 bytes --]

On Mon, 22 Nov 2010, Mike Galbraith wrote:

> On Sun, 2010-11-21 at 23:29 -0500, Bjoern B. Brandenburg wrote:
> > On Sun, 21 Nov 2010, Mike Galbraith wrote:
> >
> > > On Sat, 2010-11-20 at 23:22 -0500, Bjoern B. Brandenburg wrote:
> > >
> > > > I was under the impression that, as an invariant, tasks should not have
> > > > TIF_NEED_RESCHED set after they've blocked. In this case, the idle load
> > > > balancer should not mark the task that's on its way out with
> > > > set_tsk_need_resched().
> > >
> > > Nice find.
> > >
> > > > In any case, check_preempt_curr() seems to assume that a resuming task cannot
> > > > have TIF_NEED_RESCHED already set. Setting skip_clock_update on a remote CPU
> > > > that hasn't even been notified via IPI seems wrong.
> > >
> > > Yes. Does the below fix it up for you?
> >
> > The patch definitely changes the behavior, but it doesn't seem to solve (all
> > of) the root cause(s). The failsafe kicks in and clears the flag the next
> > time that update_rq_clock() is called, but there can still be a significant
> > delay between setting and clearing the flag. Right after boot, I'm now seeing
> > values that go up to ~21ms.
>
> A pull isn't the only vulnerability.  Since idle_balance() drops
> rq->lock, so another cpu can wake to this rq.
>
> > Please let me know if there is something else that I should test.
>
> Sched: clear_tsk_need_resched() after NEWIDLE balancing
>
> idle_balance() drops/retakes rq->lock, leaving the previous task
> vulnerable to set_tsk_need_resched() from another CPU.  Clear it
> after NEWIDLE balancing to maintain the invariant that descheduled
> tasks are NOT marked for resched.
>
> This also confuses the skip_clock_update logic, which assumes that
> the next call to update_rq_clock() will come nearly ĩmmediately after
> being set.  Make the optimization more robust by clearing before we
> balance and in update_rq_clock().

Unfortunately that doesn't seem to do it yet.

After running five 'find /' instances to completion on the ARM platform,
I'm still seeing delays close to 10ms.

    bbb@district10:~$ egrep 'cpu#|skip' /proc/sched_debug
    cpu#0
      .skip_clock_count              : 89606
      .skip_clock_recent_max         : 9817250
      .skip_clock_max                : 21992375
    cpu#1
      .skip_clock_count              : 81978
      .skip_clock_recent_max         : 9582500
      .skip_clock_max                : 17201750
    cpu#2
      .skip_clock_count              : 74565
      .skip_clock_recent_max         : 9678000
      .skip_clock_max                : 9879250
    cpu#3
      .skip_clock_count              : 81685
      .skip_clock_recent_max         : 9300125
      .skip_clock_max                : 14115750

On the x86_64 host, I've changed to HZ=100 and am now also seeing delays
close to 10ms after 'make clean && make -j8 bzImage'.

    bbb@koruna:~$ egrep 'cpu#|skip' /proc/sched_debug
    cpu#0, 2493.476 MHz
      .skip_clock_count              : 29703
      .skip_clock_recent_max         : 9999858
      .skip_clock_max                : 40645942
    cpu#1, 2493.476 MHz
      .skip_clock_count              : 32696
      .skip_clock_recent_max         : 9959118
      .skip_clock_max                : 35074771
    cpu#2, 2493.476 MHz
      .skip_clock_count              : 31742
      .skip_clock_recent_max         : 9788654
      .skip_clock_max                : 24821765
    cpu#3, 2493.476 MHz
      .skip_clock_count              : 31123
      .skip_clock_recent_max         : 9858546
      .skip_clock_max                : 44276033
    cpu#4, 2493.476 MHz
      .skip_clock_count              : 28346
      .skip_clock_recent_max         : 10000775
      .skip_clock_max                : 18681753
    cpu#5, 2493.476 MHz
      .skip_clock_count              : 29421
      .skip_clock_recent_max         : 9997656
      .skip_clock_max                : 138473407
    cpu#6, 2493.476 MHz
      .skip_clock_count              : 27721
      .skip_clock_recent_max         : 9992074
      .skip_clock_max                : 53436918
    cpu#7, 2493.476 MHz
      .skip_clock_count              : 29637
      .skip_clock_recent_max         : 9994516
      .skip_clock_max                : 566793528

These numbers were recorded with the below patch.

Please let me know if I can help by testing or tracing something else.

Thanks,
Bjoern

---
sched: show length of runqueue clock deactivation in /proc/sched_debug

The runqueue clock update should obviously not be skipped for
prolonged times (otherwise the consumed time is not correctly kept
track of). This patch measures the time between setting and clearing
the rq->skip_clock_update flag. The maximum observed value is exported
in /proc/sched_debug.
---
 kernel/sched.c       |   30 +++++++++++++++++++++++++++---
 kernel/sched_debug.c |    4 ++++
 2 files changed, 31 insertions(+), 3 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 29509e1..08839ce 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -541,6 +541,11 @@ struct rq {
 	unsigned long long rq_cpu_time;
 	/* could above be rq->cfs_rq.exec_clock + rq->rt_rq.rt_runtime ? */

+	u64 skip_clock_max;
+	u64 skip_clock_recent_max;
+	u64 skip_clock_set;
+	unsigned int skip_clock_count;
+
 	/* sys_sched_yield() stats */
 	unsigned int yld_count;

@@ -639,6 +644,22 @@ static inline struct task_group *task_group(struct task_struct *p)
 static u64 irq_time_cpu(int cpu);
 static void sched_irq_time_avg_update(struct rq *rq, u64 irq_time);

+static void clear_skip_clock_update(struct rq *rq)
+{
+#ifdef CONFIG_SCHEDSTATS
+	if (unlikely(rq->skip_clock_update)) {
+		u64 skipped = sched_clock_cpu(cpu_of(rq)) - rq->skip_clock_set;
+		rq->skip_clock_max = max(rq->skip_clock_max, skipped);
+		/* reset infrequently to expose changes */
+		if (!(++rq->skip_clock_count % 10000))
+			rq->skip_clock_recent_max = 0;
+		rq->skip_clock_recent_max = max(rq->skip_clock_recent_max,
+						skipped);
+	}
+#endif
+	rq->skip_clock_update = 0;
+}
+
 inline void update_rq_clock(struct rq *rq)
 {
 	if (!rq->skip_clock_update) {
@@ -652,7 +673,7 @@ inline void update_rq_clock(struct rq *rq)

 		sched_irq_time_avg_update(rq, irq_time);
 	}
-	rq->skip_clock_update = 0;
+	clear_skip_clock_update(rq);
 }

 /*
@@ -2130,8 +2151,11 @@ static void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags)
 	 * A queue event has occurred, and we're going to schedule.  In
 	 * this case, we can save a useless back to back clock update.
 	 */
-	if (test_tsk_need_resched(rq->curr))
+	if (test_tsk_need_resched(rq->curr)) {
+		schedstat_set(rq->skip_clock_set,
+			      sched_clock_cpu(cpu_of(rq)));
 		rq->skip_clock_update = 1;
+	}
 }

 #ifdef CONFIG_SMP
@@ -3938,7 +3962,7 @@ need_resched_nonpreemptible:
 		 * Clock updates should not be skipped while we're away.
 		 */
 		clear_tsk_need_resched(prev);
-		rq->skip_clock_update = 0;
+		clear_skip_clock_update(rq);
 	}

 	put_prev_task(rq, prev);
diff --git a/kernel/sched_debug.c b/kernel/sched_debug.c
index 2e1b0d1..8924be2 100644
--- a/kernel/sched_debug.c
+++ b/kernel/sched_debug.c
@@ -298,6 +298,10 @@ static void print_cpu(struct seq_file *m, int cpu)

 	P(bkl_count);

+	P(skip_clock_count);
+	P64(skip_clock_recent_max);
+	P64(skip_clock_max);
+
 #undef P
 #endif
 	print_cfs_stats(m, cpu);
-- 
1.7.3.1

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: Scheduler bug related to rq->skip_clock_update?
  2010-11-21  4:22 Scheduler bug related to rq->skip_clock_update? Bjoern B. Brandenburg
  2010-11-21 17:14 ` Mike Galbraith
@ 2010-12-03 12:41 ` Peter Zijlstra
  1 sibling, 0 replies; 23+ messages in thread
From: Peter Zijlstra @ 2010-12-03 12:41 UTC (permalink / raw)
  To: Bjoern B. Brandenburg
  Cc: Mike Galbraith, Ingo Molnar, Andrea Bastoni, James H. Anderson,
	linux-kernel


Hi Bjoern,

sorry the the slow response, and sadly I'll have to ask a little bit
more patience yet :/

It a nice find and I'll look into it as soon as possible.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Scheduler bug related to rq->skip_clock_update?
  2010-11-22 18:14       ` Bjoern B. Brandenburg
@ 2010-12-04  7:42         ` Yong Zhang
  2010-12-04 14:05           ` Yong Zhang
  2010-12-04 14:08         ` Yong Zhang
  2010-12-05  5:28         ` Yong Zhang
  2 siblings, 1 reply; 23+ messages in thread
From: Yong Zhang @ 2010-12-04  7:42 UTC (permalink / raw)
  To: Bjoern B. Brandenburg
  Cc: Mike Galbraith, Peter Zijlstra, Ingo Molnar, Andrea Bastoni,
	James H. Anderson, linux-kernel

On Mon, Nov 22, 2010 at 01:14:47PM -0500, Bjoern B. Brandenburg wrote:
> On Mon, 22 Nov 2010, Mike Galbraith wrote:
> 
> > On Sun, 2010-11-21 at 23:29 -0500, Bjoern B. Brandenburg wrote:
> > > On Sun, 21 Nov 2010, Mike Galbraith wrote:
> > >
> > > > On Sat, 2010-11-20 at 23:22 -0500, Bjoern B. Brandenburg wrote:
> > > >
> > > > > I was under the impression that, as an invariant, tasks should not have
> > > > > TIF_NEED_RESCHED set after they've blocked. In this case, the idle load
> > > > > balancer should not mark the task that's on its way out with
> > > > > set_tsk_need_resched().
> > > >
> > > > Nice find.
> > > >
> > > > > In any case, check_preempt_curr() seems to assume that a resuming task cannot
> > > > > have TIF_NEED_RESCHED already set. Setting skip_clock_update on a remote CPU
> > > > > that hasn't even been notified via IPI seems wrong.
> > > >
> > > > Yes. Does the below fix it up for you?
> > >
> > > The patch definitely changes the behavior, but it doesn't seem to solve (all
> > > of) the root cause(s). The failsafe kicks in and clears the flag the next
> > > time that update_rq_clock() is called, but there can still be a significant
> > > delay between setting and clearing the flag. Right after boot, I'm now seeing
> > > values that go up to ~21ms.
> >
> > A pull isn't the only vulnerability.  Since idle_balance() drops
> > rq->lock, so another cpu can wake to this rq.
> >
> > > Please let me know if there is something else that I should test.
> >
> > Sched: clear_tsk_need_resched() after NEWIDLE balancing
> >
> > idle_balance() drops/retakes rq->lock, leaving the previous task
> > vulnerable to set_tsk_need_resched() from another CPU.  Clear it
> > after NEWIDLE balancing to maintain the invariant that descheduled
> > tasks are NOT marked for resched.
> >
> > This also confuses the skip_clock_update logic, which assumes that
> > the next call to update_rq_clock() will come nearly ĩmmediately after
> > being set.  Make the optimization more robust by clearing before we
> > balance and in update_rq_clock().
> 
> Unfortunately that doesn't seem to do it yet.
> 
> After running five 'find /' instances to completion on the ARM platform,
> I'm still seeing delays close to 10ms.
> 
>     bbb@district10:~$ egrep 'cpu#|skip' /proc/sched_debug
>     cpu#0
>       .skip_clock_count              : 89606
>       .skip_clock_recent_max         : 9817250
>       .skip_clock_max                : 21992375
>     cpu#1
>       .skip_clock_count              : 81978
>       .skip_clock_recent_max         : 9582500
>       .skip_clock_max                : 17201750
>     cpu#2
>       .skip_clock_count              : 74565
>       .skip_clock_recent_max         : 9678000
>       .skip_clock_max                : 9879250
>     cpu#3
>       .skip_clock_count              : 81685
>       .skip_clock_recent_max         : 9300125
>       .skip_clock_max                : 14115750
> 
> On the x86_64 host, I've changed to HZ=100 and am now also seeing delays
> close to 10ms after 'make clean && make -j8 bzImage'.
> 
>     bbb@koruna:~$ egrep 'cpu#|skip' /proc/sched_debug
>     cpu#0, 2493.476 MHz
>       .skip_clock_count              : 29703
>       .skip_clock_recent_max         : 9999858
>       .skip_clock_max                : 40645942
>     cpu#1, 2493.476 MHz
>       .skip_clock_count              : 32696
>       .skip_clock_recent_max         : 9959118
>       .skip_clock_max                : 35074771
>     cpu#2, 2493.476 MHz
>       .skip_clock_count              : 31742
>       .skip_clock_recent_max         : 9788654
>       .skip_clock_max                : 24821765
>     cpu#3, 2493.476 MHz
>       .skip_clock_count              : 31123
>       .skip_clock_recent_max         : 9858546
>       .skip_clock_max                : 44276033
>     cpu#4, 2493.476 MHz
>       .skip_clock_count              : 28346
>       .skip_clock_recent_max         : 10000775
>       .skip_clock_max                : 18681753
>     cpu#5, 2493.476 MHz
>       .skip_clock_count              : 29421
>       .skip_clock_recent_max         : 9997656
>       .skip_clock_max                : 138473407
>     cpu#6, 2493.476 MHz
>       .skip_clock_count              : 27721
>       .skip_clock_recent_max         : 9992074
>       .skip_clock_max                : 53436918
>     cpu#7, 2493.476 MHz
>       .skip_clock_count              : 29637
>       .skip_clock_recent_max         : 9994516
>       .skip_clock_max                : 566793528
> 
> These numbers were recorded with the below patch.
> 
> Please let me know if I can help by testing or tracing something else.

Seems the checking for <if (prev->se.on_rq)> in put_prev_task()
is the culprit.

Because if we preempt a going sleep process on another CPU,
we will fail to update the rq clock for that CPU in schedule.
For example:

          CPUA                       CPUB
                                  process xxx == current
 check_preempt_curr() for CPUB		... skip_clock_update==1
				  going to sleep   
                                    ->schedule()
				      ->deactivate_task() fail to update rq clock
				                       because skip_clock_update==1
				      ->put_prev_task() fail to update rq clock
				                       because prev->se.on_rq==false

Then rq clock on CPUB will updated until another schedule envent
comes.

So Bjoern, is deleting the checking for prev->se.on_rq in
put_prev_task() helpful?

Thanks,
Yong

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Scheduler bug related to rq->skip_clock_update?
  2010-12-04  7:42         ` Yong Zhang
@ 2010-12-04 14:05           ` Yong Zhang
  0 siblings, 0 replies; 23+ messages in thread
From: Yong Zhang @ 2010-12-04 14:05 UTC (permalink / raw)
  To: Bjoern B. Brandenburg
  Cc: Mike Galbraith, Peter Zijlstra, Ingo Molnar, Andrea Bastoni,
	James H. Anderson, linux-kernel

On Sat, Dec 04, 2010 at 03:42:36PM +0800, Yong Zhang wrote:
> On Mon, Nov 22, 2010 at 01:14:47PM -0500, Bjoern B. Brandenburg wrote:
> > On Mon, 22 Nov 2010, Mike Galbraith wrote:
> > 
> > > On Sun, 2010-11-21 at 23:29 -0500, Bjoern B. Brandenburg wrote:
> > > > On Sun, 21 Nov 2010, Mike Galbraith wrote:
> > > >
> > > > > On Sat, 2010-11-20 at 23:22 -0500, Bjoern B. Brandenburg wrote:
> > > > >
> > > > > > I was under the impression that, as an invariant, tasks should not have
> > > > > > TIF_NEED_RESCHED set after they've blocked. In this case, the idle load
> > > > > > balancer should not mark the task that's on its way out with
> > > > > > set_tsk_need_resched().
> > > > >
> > > > > Nice find.
> > > > >
> > > > > > In any case, check_preempt_curr() seems to assume that a resuming task cannot
> > > > > > have TIF_NEED_RESCHED already set. Setting skip_clock_update on a remote CPU
> > > > > > that hasn't even been notified via IPI seems wrong.
> > > > >
> > > > > Yes. Does the below fix it up for you?
> > > >
> > > > The patch definitely changes the behavior, but it doesn't seem to solve (all
> > > > of) the root cause(s). The failsafe kicks in and clears the flag the next
> > > > time that update_rq_clock() is called, but there can still be a significant
> > > > delay between setting and clearing the flag. Right after boot, I'm now seeing
> > > > values that go up to ~21ms.
> > >
> > > A pull isn't the only vulnerability.  Since idle_balance() drops
> > > rq->lock, so another cpu can wake to this rq.
> > >
> > > > Please let me know if there is something else that I should test.
> > >
> > > Sched: clear_tsk_need_resched() after NEWIDLE balancing
> > >
> > > idle_balance() drops/retakes rq->lock, leaving the previous task
> > > vulnerable to set_tsk_need_resched() from another CPU.  Clear it
> > > after NEWIDLE balancing to maintain the invariant that descheduled
> > > tasks are NOT marked for resched.
> > >
> > > This also confuses the skip_clock_update logic, which assumes that
> > > the next call to update_rq_clock() will come nearly ĩmmediately after
> > > being set.  Make the optimization more robust by clearing before we
> > > balance and in update_rq_clock().
> > 
> > Unfortunately that doesn't seem to do it yet.
> > 
> > After running five 'find /' instances to completion on the ARM platform,
> > I'm still seeing delays close to 10ms.
> > 
> >     bbb@district10:~$ egrep 'cpu#|skip' /proc/sched_debug
> >     cpu#0
> >       .skip_clock_count              : 89606
> >       .skip_clock_recent_max         : 9817250
> >       .skip_clock_max                : 21992375
> >     cpu#1
> >       .skip_clock_count              : 81978
> >       .skip_clock_recent_max         : 9582500
> >       .skip_clock_max                : 17201750
> >     cpu#2
> >       .skip_clock_count              : 74565
> >       .skip_clock_recent_max         : 9678000
> >       .skip_clock_max                : 9879250
> >     cpu#3
> >       .skip_clock_count              : 81685
> >       .skip_clock_recent_max         : 9300125
> >       .skip_clock_max                : 14115750
> > 
> > On the x86_64 host, I've changed to HZ=100 and am now also seeing delays
> > close to 10ms after 'make clean && make -j8 bzImage'.
> > 
> >     bbb@koruna:~$ egrep 'cpu#|skip' /proc/sched_debug
> >     cpu#0, 2493.476 MHz
> >       .skip_clock_count              : 29703
> >       .skip_clock_recent_max         : 9999858
> >       .skip_clock_max                : 40645942
> >     cpu#1, 2493.476 MHz
> >       .skip_clock_count              : 32696
> >       .skip_clock_recent_max         : 9959118
> >       .skip_clock_max                : 35074771
> >     cpu#2, 2493.476 MHz
> >       .skip_clock_count              : 31742
> >       .skip_clock_recent_max         : 9788654
> >       .skip_clock_max                : 24821765
> >     cpu#3, 2493.476 MHz
> >       .skip_clock_count              : 31123
> >       .skip_clock_recent_max         : 9858546
> >       .skip_clock_max                : 44276033
> >     cpu#4, 2493.476 MHz
> >       .skip_clock_count              : 28346
> >       .skip_clock_recent_max         : 10000775
> >       .skip_clock_max                : 18681753
> >     cpu#5, 2493.476 MHz
> >       .skip_clock_count              : 29421
> >       .skip_clock_recent_max         : 9997656
> >       .skip_clock_max                : 138473407
> >     cpu#6, 2493.476 MHz
> >       .skip_clock_count              : 27721
> >       .skip_clock_recent_max         : 9992074
> >       .skip_clock_max                : 53436918
> >     cpu#7, 2493.476 MHz
> >       .skip_clock_count              : 29637
> >       .skip_clock_recent_max         : 9994516
> >       .skip_clock_max                : 566793528
> > 
> > These numbers were recorded with the below patch.
> > 
> > Please let me know if I can help by testing or tracing something else.
> 
> Seems the checking for <if (prev->se.on_rq)> in put_prev_task()
> is the culprit.
> 
> Because if we preempt a going sleep process on another CPU,
> we will fail to update the rq clock for that CPU in schedule.
> For example:
> 
>           CPUA                       CPUB
>                                   process xxx == current
>  check_preempt_curr() for CPUB		... skip_clock_update==1
> 				  going to sleep   
>                                     ->schedule()
> 				      ->deactivate_task() fail to update rq clock
> 				                       because skip_clock_update==1
> 				      ->put_prev_task() fail to update rq clock
> 				                       because prev->se.on_rq==false
> 
> Then rq clock on CPUB will updated until another schedule envent
> comes.
> 
> So Bjoern, is deleting the checking for prev->se.on_rq in
> put_prev_task() helpful?

My test show there indeed is some improvement.
But I just notice skip_clock_recent_max/max is based on
_nanosecond_, so the 10ms delay mentioned by Bjoern
should be _10us_.

So I'm not sure if my suggestion is necessary.

> 
> Thanks,
> Yong

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Scheduler bug related to rq->skip_clock_update?
  2010-11-22 18:14       ` Bjoern B. Brandenburg
  2010-12-04  7:42         ` Yong Zhang
@ 2010-12-04 14:08         ` Yong Zhang
  2010-12-04 14:33           ` Yong Zhang
  2010-12-05  5:28         ` Yong Zhang
  2 siblings, 1 reply; 23+ messages in thread
From: Yong Zhang @ 2010-12-04 14:08 UTC (permalink / raw)
  To: Bjoern B. Brandenburg
  Cc: Mike Galbraith, Peter Zijlstra, Ingo Molnar, Andrea Bastoni,
	James H. Anderson, linux-kernel

On Mon, Nov 22, 2010 at 01:14:47PM -0500, Bjoern B. Brandenburg wrote:
> On Mon, 22 Nov 2010, Mike Galbraith wrote:
> 
> > On Sun, 2010-11-21 at 23:29 -0500, Bjoern B. Brandenburg wrote:
> > > On Sun, 21 Nov 2010, Mike Galbraith wrote:
> > >
> > > > On Sat, 2010-11-20 at 23:22 -0500, Bjoern B. Brandenburg wrote:
> > > >
> > > > > I was under the impression that, as an invariant, tasks should not have
> > > > > TIF_NEED_RESCHED set after they've blocked. In this case, the idle load
> > > > > balancer should not mark the task that's on its way out with
> > > > > set_tsk_need_resched().
> > > >
> > > > Nice find.
> > > >
> > > > > In any case, check_preempt_curr() seems to assume that a resuming task cannot
> > > > > have TIF_NEED_RESCHED already set. Setting skip_clock_update on a remote CPU
> > > > > that hasn't even been notified via IPI seems wrong.
> > > >
> > > > Yes. Does the below fix it up for you?
> > >
> > > The patch definitely changes the behavior, but it doesn't seem to solve (all
> > > of) the root cause(s). The failsafe kicks in and clears the flag the next
> > > time that update_rq_clock() is called, but there can still be a significant
> > > delay between setting and clearing the flag. Right after boot, I'm now seeing
> > > values that go up to ~21ms.
> >
> > A pull isn't the only vulnerability.  Since idle_balance() drops
> > rq->lock, so another cpu can wake to this rq.
> >
> > > Please let me know if there is something else that I should test.
> >
> > Sched: clear_tsk_need_resched() after NEWIDLE balancing
> >
> > idle_balance() drops/retakes rq->lock, leaving the previous task
> > vulnerable to set_tsk_need_resched() from another CPU.  Clear it
> > after NEWIDLE balancing to maintain the invariant that descheduled
> > tasks are NOT marked for resched.
> >
> > This also confuses the skip_clock_update logic, which assumes that
> > the next call to update_rq_clock() will come nearly ĩmmediately after
> > being set.  Make the optimization more robust by clearing before we
> > balance and in update_rq_clock().
> 
> Unfortunately that doesn't seem to do it yet.
> 
> After running five 'find /' instances to completion on the ARM platform,
> I'm still seeing delays close to 10ms.

Because skip_clock_recent_max/skip_clock_max is based on nanosecond,
so the delay should be 10us, right?

> 
>     bbb@district10:~$ egrep 'cpu#|skip' /proc/sched_debug
>     cpu#0
>       .skip_clock_count              : 89606
>       .skip_clock_recent_max         : 9817250
>       .skip_clock_max                : 21992375
>     cpu#1
>       .skip_clock_count              : 81978
>       .skip_clock_recent_max         : 9582500
>       .skip_clock_max                : 17201750
>     cpu#2
>       .skip_clock_count              : 74565
>       .skip_clock_recent_max         : 9678000
>       .skip_clock_max                : 9879250
>     cpu#3
>       .skip_clock_count              : 81685
>       .skip_clock_recent_max         : 9300125
>       .skip_clock_max                : 14115750
> 
> On the x86_64 host, I've changed to HZ=100 and am now also seeing delays
> close to 10ms after 'make clean && make -j8 bzImage'.

ditto.

> 
>     bbb@koruna:~$ egrep 'cpu#|skip' /proc/sched_debug
>     cpu#0, 2493.476 MHz
>       .skip_clock_count              : 29703
>       .skip_clock_recent_max         : 9999858
>       .skip_clock_max                : 40645942
>     cpu#1, 2493.476 MHz
>       .skip_clock_count              : 32696
>       .skip_clock_recent_max         : 9959118
>       .skip_clock_max                : 35074771
>     cpu#2, 2493.476 MHz
>       .skip_clock_count              : 31742
>       .skip_clock_recent_max         : 9788654
>       .skip_clock_max                : 24821765
>     cpu#3, 2493.476 MHz
>       .skip_clock_count              : 31123
>       .skip_clock_recent_max         : 9858546
>       .skip_clock_max                : 44276033
>     cpu#4, 2493.476 MHz
>       .skip_clock_count              : 28346
>       .skip_clock_recent_max         : 10000775
>       .skip_clock_max                : 18681753
>     cpu#5, 2493.476 MHz
>       .skip_clock_count              : 29421
>       .skip_clock_recent_max         : 9997656
>       .skip_clock_max                : 138473407
>     cpu#6, 2493.476 MHz
>       .skip_clock_count              : 27721
>       .skip_clock_recent_max         : 9992074
>       .skip_clock_max                : 53436918
>     cpu#7, 2493.476 MHz
>       .skip_clock_count              : 29637
>       .skip_clock_recent_max         : 9994516
>       .skip_clock_max                : 566793528
> 
> These numbers were recorded with the below patch.
> 
> Please let me know if I can help by testing or tracing something else.
> 
> Thanks,
> Bjoern
> 
> ---
> sched: show length of runqueue clock deactivation in /proc/sched_debug
> 
> The runqueue clock update should obviously not be skipped for
> prolonged times (otherwise the consumed time is not correctly kept
> track of). This patch measures the time between setting and clearing
> the rq->skip_clock_update flag. The maximum observed value is exported
> in /proc/sched_debug.
> ---
>  kernel/sched.c       |   30 +++++++++++++++++++++++++++---
>  kernel/sched_debug.c |    4 ++++
>  2 files changed, 31 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 29509e1..08839ce 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -541,6 +541,11 @@ struct rq {
>  	unsigned long long rq_cpu_time;
>  	/* could above be rq->cfs_rq.exec_clock + rq->rt_rq.rt_runtime ? */
> 
> +	u64 skip_clock_max;
> +	u64 skip_clock_recent_max;
> +	u64 skip_clock_set;
> +	unsigned int skip_clock_count;
> +
>  	/* sys_sched_yield() stats */
>  	unsigned int yld_count;
> 
> @@ -639,6 +644,22 @@ static inline struct task_group *task_group(struct task_struct *p)
>  static u64 irq_time_cpu(int cpu);
>  static void sched_irq_time_avg_update(struct rq *rq, u64 irq_time);
> 
> +static void clear_skip_clock_update(struct rq *rq)
> +{
> +#ifdef CONFIG_SCHEDSTATS
> +	if (unlikely(rq->skip_clock_update)) {
> +		u64 skipped = sched_clock_cpu(cpu_of(rq)) - rq->skip_clock_set;
> +		rq->skip_clock_max = max(rq->skip_clock_max, skipped);
> +		/* reset infrequently to expose changes */
> +		if (!(++rq->skip_clock_count % 10000))
> +			rq->skip_clock_recent_max = 0;
> +		rq->skip_clock_recent_max = max(rq->skip_clock_recent_max,
> +						skipped);
> +	}
> +#endif
> +	rq->skip_clock_update = 0;
> +}
> +
>  inline void update_rq_clock(struct rq *rq)
>  {
>  	if (!rq->skip_clock_update) {
> @@ -652,7 +673,7 @@ inline void update_rq_clock(struct rq *rq)
> 
>  		sched_irq_time_avg_update(rq, irq_time);
>  	}
> -	rq->skip_clock_update = 0;
> +	clear_skip_clock_update(rq);
>  }
> 
>  /*
> @@ -2130,8 +2151,11 @@ static void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags)
>  	 * A queue event has occurred, and we're going to schedule.  In
>  	 * this case, we can save a useless back to back clock update.
>  	 */
> -	if (test_tsk_need_resched(rq->curr))
> +	if (test_tsk_need_resched(rq->curr)) {
> +		schedstat_set(rq->skip_clock_set,
> +			      sched_clock_cpu(cpu_of(rq)));
>  		rq->skip_clock_update = 1;
> +	}
>  }
> 
>  #ifdef CONFIG_SMP
> @@ -3938,7 +3962,7 @@ need_resched_nonpreemptible:
>  		 * Clock updates should not be skipped while we're away.
>  		 */
>  		clear_tsk_need_resched(prev);
> -		rq->skip_clock_update = 0;
> +		clear_skip_clock_update(rq);
>  	}
> 
>  	put_prev_task(rq, prev);
> diff --git a/kernel/sched_debug.c b/kernel/sched_debug.c
> index 2e1b0d1..8924be2 100644
> --- a/kernel/sched_debug.c
> +++ b/kernel/sched_debug.c
> @@ -298,6 +298,10 @@ static void print_cpu(struct seq_file *m, int cpu)
> 
>  	P(bkl_count);
> 
> +	P(skip_clock_count);
> +	P64(skip_clock_recent_max);
> +	P64(skip_clock_max);
> +
>  #undef P
>  #endif
>  	print_cfs_stats(m, cpu);
> -- 
> 1.7.3.1


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Scheduler bug related to rq->skip_clock_update?
  2010-12-04 14:08         ` Yong Zhang
@ 2010-12-04 14:33           ` Yong Zhang
  0 siblings, 0 replies; 23+ messages in thread
From: Yong Zhang @ 2010-12-04 14:33 UTC (permalink / raw)
  To: Bjoern B. Brandenburg
  Cc: Mike Galbraith, Peter Zijlstra, Ingo Molnar, Andrea Bastoni,
	James H. Anderson, linux-kernel

On Sat, Dec 04, 2010 at 10:08:16PM +0800, Yong Zhang wrote:
> On Mon, Nov 22, 2010 at 01:14:47PM -0500, Bjoern B. Brandenburg wrote:
> > On Mon, 22 Nov 2010, Mike Galbraith wrote:
> > 
> > > On Sun, 2010-11-21 at 23:29 -0500, Bjoern B. Brandenburg wrote:
> > > > On Sun, 21 Nov 2010, Mike Galbraith wrote:
> > > >
> > > > > On Sat, 2010-11-20 at 23:22 -0500, Bjoern B. Brandenburg wrote:
> > > > >
> > > > > > I was under the impression that, as an invariant, tasks should not have
> > > > > > TIF_NEED_RESCHED set after they've blocked. In this case, the idle load
> > > > > > balancer should not mark the task that's on its way out with
> > > > > > set_tsk_need_resched().
> > > > >
> > > > > Nice find.
> > > > >
> > > > > > In any case, check_preempt_curr() seems to assume that a resuming task cannot
> > > > > > have TIF_NEED_RESCHED already set. Setting skip_clock_update on a remote CPU
> > > > > > that hasn't even been notified via IPI seems wrong.
> > > > >
> > > > > Yes. Does the below fix it up for you?
> > > >
> > > > The patch definitely changes the behavior, but it doesn't seem to solve (all
> > > > of) the root cause(s). The failsafe kicks in and clears the flag the next
> > > > time that update_rq_clock() is called, but there can still be a significant
> > > > delay between setting and clearing the flag. Right after boot, I'm now seeing
> > > > values that go up to ~21ms.
> > >
> > > A pull isn't the only vulnerability.  Since idle_balance() drops
> > > rq->lock, so another cpu can wake to this rq.
> > >
> > > > Please let me know if there is something else that I should test.
> > >
> > > Sched: clear_tsk_need_resched() after NEWIDLE balancing
> > >
> > > idle_balance() drops/retakes rq->lock, leaving the previous task
> > > vulnerable to set_tsk_need_resched() from another CPU.  Clear it
> > > after NEWIDLE balancing to maintain the invariant that descheduled
> > > tasks are NOT marked for resched.
> > >
> > > This also confuses the skip_clock_update logic, which assumes that
> > > the next call to update_rq_clock() will come nearly ĩmmediately after
> > > being set.  Make the optimization more robust by clearing before we
> > > balance and in update_rq_clock().
> > 
> > Unfortunately that doesn't seem to do it yet.
> > 
> > After running five 'find /' instances to completion on the ARM platform,
> > I'm still seeing delays close to 10ms.
> 
> Because skip_clock_recent_max/skip_clock_max is based on nanosecond,
> so the delay should be 10us, right?

Ohhhh. My head is twisted. Sorry for the noise.

I should go to sleep now to make me clear. ;)

Thanks,
Yong
> 
> > 
> >     bbb@district10:~$ egrep 'cpu#|skip' /proc/sched_debug
> >     cpu#0
> >       .skip_clock_count              : 89606
> >       .skip_clock_recent_max         : 9817250
> >       .skip_clock_max                : 21992375
> >     cpu#1
> >       .skip_clock_count              : 81978
> >       .skip_clock_recent_max         : 9582500
> >       .skip_clock_max                : 17201750
> >     cpu#2
> >       .skip_clock_count              : 74565
> >       .skip_clock_recent_max         : 9678000
> >       .skip_clock_max                : 9879250
> >     cpu#3
> >       .skip_clock_count              : 81685
> >       .skip_clock_recent_max         : 9300125
> >       .skip_clock_max                : 14115750
> > 
> > On the x86_64 host, I've changed to HZ=100 and am now also seeing delays
> > close to 10ms after 'make clean && make -j8 bzImage'.
> 
> ditto.
> 
> > 
> >     bbb@koruna:~$ egrep 'cpu#|skip' /proc/sched_debug
> >     cpu#0, 2493.476 MHz
> >       .skip_clock_count              : 29703
> >       .skip_clock_recent_max         : 9999858
> >       .skip_clock_max                : 40645942
> >     cpu#1, 2493.476 MHz
> >       .skip_clock_count              : 32696
> >       .skip_clock_recent_max         : 9959118
> >       .skip_clock_max                : 35074771
> >     cpu#2, 2493.476 MHz
> >       .skip_clock_count              : 31742
> >       .skip_clock_recent_max         : 9788654
> >       .skip_clock_max                : 24821765
> >     cpu#3, 2493.476 MHz
> >       .skip_clock_count              : 31123
> >       .skip_clock_recent_max         : 9858546
> >       .skip_clock_max                : 44276033
> >     cpu#4, 2493.476 MHz
> >       .skip_clock_count              : 28346
> >       .skip_clock_recent_max         : 10000775
> >       .skip_clock_max                : 18681753
> >     cpu#5, 2493.476 MHz
> >       .skip_clock_count              : 29421
> >       .skip_clock_recent_max         : 9997656
> >       .skip_clock_max                : 138473407
> >     cpu#6, 2493.476 MHz
> >       .skip_clock_count              : 27721
> >       .skip_clock_recent_max         : 9992074
> >       .skip_clock_max                : 53436918
> >     cpu#7, 2493.476 MHz
> >       .skip_clock_count              : 29637
> >       .skip_clock_recent_max         : 9994516
> >       .skip_clock_max                : 566793528
> > 
> > These numbers were recorded with the below patch.
> > 
> > Please let me know if I can help by testing or tracing something else.
> > 
> > Thanks,
> > Bjoern
> > 
> > ---
> > sched: show length of runqueue clock deactivation in /proc/sched_debug
> > 
> > The runqueue clock update should obviously not be skipped for
> > prolonged times (otherwise the consumed time is not correctly kept
> > track of). This patch measures the time between setting and clearing
> > the rq->skip_clock_update flag. The maximum observed value is exported
> > in /proc/sched_debug.
> > ---
> >  kernel/sched.c       |   30 +++++++++++++++++++++++++++---
> >  kernel/sched_debug.c |    4 ++++
> >  2 files changed, 31 insertions(+), 3 deletions(-)
> > 
> > diff --git a/kernel/sched.c b/kernel/sched.c
> > index 29509e1..08839ce 100644
> > --- a/kernel/sched.c
> > +++ b/kernel/sched.c
> > @@ -541,6 +541,11 @@ struct rq {
> >  	unsigned long long rq_cpu_time;
> >  	/* could above be rq->cfs_rq.exec_clock + rq->rt_rq.rt_runtime ? */
> > 
> > +	u64 skip_clock_max;
> > +	u64 skip_clock_recent_max;
> > +	u64 skip_clock_set;
> > +	unsigned int skip_clock_count;
> > +
> >  	/* sys_sched_yield() stats */
> >  	unsigned int yld_count;
> > 
> > @@ -639,6 +644,22 @@ static inline struct task_group *task_group(struct task_struct *p)
> >  static u64 irq_time_cpu(int cpu);
> >  static void sched_irq_time_avg_update(struct rq *rq, u64 irq_time);
> > 
> > +static void clear_skip_clock_update(struct rq *rq)
> > +{
> > +#ifdef CONFIG_SCHEDSTATS
> > +	if (unlikely(rq->skip_clock_update)) {
> > +		u64 skipped = sched_clock_cpu(cpu_of(rq)) - rq->skip_clock_set;
> > +		rq->skip_clock_max = max(rq->skip_clock_max, skipped);
> > +		/* reset infrequently to expose changes */
> > +		if (!(++rq->skip_clock_count % 10000))
> > +			rq->skip_clock_recent_max = 0;
> > +		rq->skip_clock_recent_max = max(rq->skip_clock_recent_max,
> > +						skipped);
> > +	}
> > +#endif
> > +	rq->skip_clock_update = 0;
> > +}
> > +
> >  inline void update_rq_clock(struct rq *rq)
> >  {
> >  	if (!rq->skip_clock_update) {
> > @@ -652,7 +673,7 @@ inline void update_rq_clock(struct rq *rq)
> > 
> >  		sched_irq_time_avg_update(rq, irq_time);
> >  	}
> > -	rq->skip_clock_update = 0;
> > +	clear_skip_clock_update(rq);
> >  }
> > 
> >  /*
> > @@ -2130,8 +2151,11 @@ static void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags)
> >  	 * A queue event has occurred, and we're going to schedule.  In
> >  	 * this case, we can save a useless back to back clock update.
> >  	 */
> > -	if (test_tsk_need_resched(rq->curr))
> > +	if (test_tsk_need_resched(rq->curr)) {
> > +		schedstat_set(rq->skip_clock_set,
> > +			      sched_clock_cpu(cpu_of(rq)));
> >  		rq->skip_clock_update = 1;
> > +	}
> >  }
> > 
> >  #ifdef CONFIG_SMP
> > @@ -3938,7 +3962,7 @@ need_resched_nonpreemptible:
> >  		 * Clock updates should not be skipped while we're away.
> >  		 */
> >  		clear_tsk_need_resched(prev);
> > -		rq->skip_clock_update = 0;
> > +		clear_skip_clock_update(rq);
> >  	}
> > 
> >  	put_prev_task(rq, prev);
> > diff --git a/kernel/sched_debug.c b/kernel/sched_debug.c
> > index 2e1b0d1..8924be2 100644
> > --- a/kernel/sched_debug.c
> > +++ b/kernel/sched_debug.c
> > @@ -298,6 +298,10 @@ static void print_cpu(struct seq_file *m, int cpu)
> > 
> >  	P(bkl_count);
> > 
> > +	P(skip_clock_count);
> > +	P64(skip_clock_recent_max);
> > +	P64(skip_clock_max);
> > +
> >  #undef P
> >  #endif
> >  	print_cfs_stats(m, cpu);
> > -- 
> > 1.7.3.1
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Scheduler bug related to rq->skip_clock_update?
  2010-11-22 18:14       ` Bjoern B. Brandenburg
  2010-12-04  7:42         ` Yong Zhang
  2010-12-04 14:08         ` Yong Zhang
@ 2010-12-05  5:28         ` Yong Zhang
  2010-12-06  5:33           ` Mike Galbraith
  2 siblings, 1 reply; 23+ messages in thread
From: Yong Zhang @ 2010-12-05  5:28 UTC (permalink / raw)
  To: Bjoern B. Brandenburg
  Cc: Mike Galbraith, Peter Zijlstra, Ingo Molnar, Andrea Bastoni,
	James H. Anderson, linux-kernel

On Mon, Nov 22, 2010 at 01:14:47PM -0500, Bjoern B. Brandenburg wrote:
> On Mon, 22 Nov 2010, Mike Galbraith wrote:
> 
> > On Sun, 2010-11-21 at 23:29 -0500, Bjoern B. Brandenburg wrote:
> > > On Sun, 21 Nov 2010, Mike Galbraith wrote:
> > >
> > > > On Sat, 2010-11-20 at 23:22 -0500, Bjoern B. Brandenburg wrote:
> > > >
> > > > > I was under the impression that, as an invariant, tasks should not have
> > > > > TIF_NEED_RESCHED set after they've blocked. In this case, the idle load
> > > > > balancer should not mark the task that's on its way out with
> > > > > set_tsk_need_resched().
> > > >
> > > > Nice find.
> > > >
> > > > > In any case, check_preempt_curr() seems to assume that a resuming task cannot
> > > > > have TIF_NEED_RESCHED already set. Setting skip_clock_update on a remote CPU
> > > > > that hasn't even been notified via IPI seems wrong.
> > > >
> > > > Yes. Does the below fix it up for you?
> > >
> > > The patch definitely changes the behavior, but it doesn't seem to solve (all
> > > of) the root cause(s). The failsafe kicks in and clears the flag the next
> > > time that update_rq_clock() is called, but there can still be a significant
> > > delay between setting and clearing the flag. Right after boot, I'm now seeing
> > > values that go up to ~21ms.
> >
> > A pull isn't the only vulnerability.  Since idle_balance() drops
> > rq->lock, so another cpu can wake to this rq.
> >
> > > Please let me know if there is something else that I should test.
> >
> > Sched: clear_tsk_need_resched() after NEWIDLE balancing
> >
> > idle_balance() drops/retakes rq->lock, leaving the previous task
> > vulnerable to set_tsk_need_resched() from another CPU.  Clear it
> > after NEWIDLE balancing to maintain the invariant that descheduled
> > tasks are NOT marked for resched.
> >
> > This also confuses the skip_clock_update logic, which assumes that
> > the next call to update_rq_clock() will come nearly ĩmmediately after
> > being set.  Make the optimization more robust by clearing before we
> > balance and in update_rq_clock().
> 
> Unfortunately that doesn't seem to do it yet.
> 
> After running five 'find /' instances to completion on the ARM platform,
> I'm still seeing delays close to 10ms.
> 
>     bbb@district10:~$ egrep 'cpu#|skip' /proc/sched_debug
>     cpu#0
>       .skip_clock_count              : 89606
>       .skip_clock_recent_max         : 9817250
>       .skip_clock_max                : 21992375
>     cpu#1
>       .skip_clock_count              : 81978
>       .skip_clock_recent_max         : 9582500
>       .skip_clock_max                : 17201750
>     cpu#2
>       .skip_clock_count              : 74565
>       .skip_clock_recent_max         : 9678000
>       .skip_clock_max                : 9879250
>     cpu#3
>       .skip_clock_count              : 81685
>       .skip_clock_recent_max         : 9300125
>       .skip_clock_max                : 14115750
> 
> On the x86_64 host, I've changed to HZ=100 and am now also seeing delays
> close to 10ms after 'make clean && make -j8 bzImage'.
> 
>     bbb@koruna:~$ egrep 'cpu#|skip' /proc/sched_debug
>     cpu#0, 2493.476 MHz
>       .skip_clock_count              : 29703
>       .skip_clock_recent_max         : 9999858
>       .skip_clock_max                : 40645942
>     cpu#1, 2493.476 MHz
>       .skip_clock_count              : 32696
>       .skip_clock_recent_max         : 9959118
>       .skip_clock_max                : 35074771
>     cpu#2, 2493.476 MHz
>       .skip_clock_count              : 31742
>       .skip_clock_recent_max         : 9788654
>       .skip_clock_max                : 24821765
>     cpu#3, 2493.476 MHz
>       .skip_clock_count              : 31123
>       .skip_clock_recent_max         : 9858546
>       .skip_clock_max                : 44276033
>     cpu#4, 2493.476 MHz
>       .skip_clock_count              : 28346
>       .skip_clock_recent_max         : 10000775
>       .skip_clock_max                : 18681753
>     cpu#5, 2493.476 MHz
>       .skip_clock_count              : 29421
>       .skip_clock_recent_max         : 9997656
>       .skip_clock_max                : 138473407
>     cpu#6, 2493.476 MHz
>       .skip_clock_count              : 27721
>       .skip_clock_recent_max         : 9992074
>       .skip_clock_max                : 53436918
>     cpu#7, 2493.476 MHz
>       .skip_clock_count              : 29637
>       .skip_clock_recent_max         : 9994516
>       .skip_clock_max                : 566793528
> 
> These numbers were recorded with the below patch.
> 
> Please let me know if I can help by testing or tracing something else.

Please just ignore my previous email and sorry for those noise.

I think I find the root cause after all-night sleep. :)

when we init idle task, we doesn't mark it on_rq.
My test show the concern is smoothed by below patch.

Thanks,
Yong

---
diff --git a/kernel/sched.c b/kernel/sched.c
index dc91a4d..21c76d9 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -5486,6 +5486,7 @@ void __cpuinit init_idle(struct task_struct *idle, int cpu)
 	__sched_fork(idle);
 	idle->state = TASK_RUNNING;
 	idle->se.exec_start = sched_clock();
+	idle->se.on_rq = 1;
 
 	cpumask_copy(&idle->cpus_allowed, cpumask_of(cpu));
 	/*

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: Scheduler bug related to rq->skip_clock_update?
  2010-12-05  5:28         ` Yong Zhang
@ 2010-12-06  5:33           ` Mike Galbraith
  2010-12-06  7:59             ` Yong Zhang
  2010-12-06 15:40             ` Scheduler bug related to rq->skip_clock_update? Bjoern B. Brandenburg
  0 siblings, 2 replies; 23+ messages in thread
From: Mike Galbraith @ 2010-12-06  5:33 UTC (permalink / raw)
  To: Yong Zhang
  Cc: Bjoern B. Brandenburg, Peter Zijlstra, Ingo Molnar,
	Andrea Bastoni, James H. Anderson, linux-kernel

On Sun, 2010-12-05 at 13:28 +0800, Yong Zhang wrote:

> when we init idle task, we doesn't mark it on_rq.
> My test show the concern is smoothed by below patch.

Close :)

The skip_clock_update flag should only be set if rq->curr is on_rq,
because it it _that_ clock update during dequeue, and subsequent
microscopic vruntime update it causes that we're trying to avoid.

I think the below fixes it up properly.

Sched: fix skip_clock_update optimization

idle_balance() drops/retakes rq->lock, leaving the previous task
vulnerable to set_tsk_need_resched().  Clear it after we return
from balancing instead, and in setup_thread_stack() as well, so
no successfully descheduled or never scheduled task has it set.
 
Need resched confused the skip_clock_update logic, which assumes
that the next call to update_rq_clock() will come nearly immediately
after being set.  Make the optimization robust against the waking
a sleeper before it sucessfully deschedules case by checking that
the current task has not been dequeued before setting the flag,
since it is that useless clock update we're trying to save, and
clear in update_rq_clock() to ensure that ONE call may be skipped.

Signed-off-by: Mike Galbraith <efault@gmx.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Bjoern B. Brandenburg <bbb.lst@gmail.com>
Reported-by: Bjoern B. Brandenburg <bbb.lst@gmail.com>

---
 kernel/fork.c  |    1 +
 kernel/sched.c |    6 +++---
 2 files changed, 4 insertions(+), 3 deletions(-)

Index: linux-2.6.37.git/kernel/sched.c
===================================================================
--- linux-2.6.37.git.orig/kernel/sched.c
+++ linux-2.6.37.git/kernel/sched.c
@@ -660,6 +660,7 @@ inline void update_rq_clock(struct rq *r
 
 		sched_irq_time_avg_update(rq, irq_time);
 	}
+	rq->skip_clock_update = 0;
 }
 
 /*
@@ -2138,7 +2139,7 @@ static void check_preempt_curr(struct rq
 	 * A queue event has occurred, and we're going to schedule.  In
 	 * this case, we can save a useless back to back clock update.
 	 */
-	if (test_tsk_need_resched(rq->curr))
+	if (rq->curr->se.on_rq && test_tsk_need_resched(rq->curr))
 		rq->skip_clock_update = 1;
 }
 
@@ -3854,7 +3855,6 @@ static void put_prev_task(struct rq *rq,
 {
 	if (prev->se.on_rq)
 		update_rq_clock(rq);
-	rq->skip_clock_update = 0;
 	prev->sched_class->put_prev_task(rq, prev);
 }
 
@@ -3912,7 +3912,6 @@ need_resched_nonpreemptible:
 		hrtick_clear(rq);
 
 	raw_spin_lock_irq(&rq->lock);
-	clear_tsk_need_resched(prev);
 
 	switch_count = &prev->nivcsw;
 	if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {
@@ -3942,6 +3941,7 @@ need_resched_nonpreemptible:
 	if (unlikely(!rq->nr_running))
 		idle_balance(cpu, rq);
 
+	clear_tsk_need_resched(prev);
 	put_prev_task(rq, prev);
 	next = pick_next_task(rq);
 
Index: linux-2.6.37.git/kernel/fork.c
===================================================================
--- linux-2.6.37.git.orig/kernel/fork.c
+++ linux-2.6.37.git/kernel/fork.c
@@ -275,6 +275,7 @@ static struct task_struct *dup_task_stru
 
 	setup_thread_stack(tsk, orig);
 	clear_user_return_notifier(tsk);
+	clear_tsk_need_resched(tsk);
 	stackend = end_of_stack(tsk);
 	*stackend = STACK_END_MAGIC;	/* for overflow detection */
 



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Scheduler bug related to rq->skip_clock_update?
  2010-12-06  5:33           ` Mike Galbraith
@ 2010-12-06  7:59             ` Yong Zhang
  2010-12-06  8:32               ` [patchlet] " Mike Galbraith
  2010-12-06 15:40             ` Scheduler bug related to rq->skip_clock_update? Bjoern B. Brandenburg
  1 sibling, 1 reply; 23+ messages in thread
From: Yong Zhang @ 2010-12-06  7:59 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Bjoern B. Brandenburg, Peter Zijlstra, Ingo Molnar,
	Andrea Bastoni, James H. Anderson, linux-kernel

On Mon, Dec 6, 2010 at 1:33 PM, Mike Galbraith <efault@gmx.de> wrote:
> On Sun, 2010-12-05 at 13:28 +0800, Yong Zhang wrote:
>
>> when we init idle task, we doesn't mark it on_rq.
>> My test show the concern is smoothed by below patch.
>
> Close :)
>
> The skip_clock_update flag should only be set if rq->curr is on_rq,
> because it it _that_ clock update during dequeue, and subsequent
> microscopic vruntime update it causes that we're trying to avoid.
>
> I think the below fixes it up properly.

Yep. Now it's running well on my machine.

If you want, you can add my tested-by. :)

Thanks,
Yong

>
> Sched: fix skip_clock_update optimization
>
> idle_balance() drops/retakes rq->lock, leaving the previous task
> vulnerable to set_tsk_need_resched().  Clear it after we return
> from balancing instead, and in setup_thread_stack() as well, so
> no successfully descheduled or never scheduled task has it set.
>
> Need resched confused the skip_clock_update logic, which assumes
> that the next call to update_rq_clock() will come nearly immediately
> after being set.  Make the optimization robust against the waking
> a sleeper before it sucessfully deschedules case by checking that
> the current task has not been dequeued before setting the flag,
> since it is that useless clock update we're trying to save, and
> clear in update_rq_clock() to ensure that ONE call may be skipped.
>
> Signed-off-by: Mike Galbraith <efault@gmx.de>
> Cc: Ingo Molnar <mingo@elte.hu>
> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Cc: Bjoern B. Brandenburg <bbb.lst@gmail.com>
> Reported-by: Bjoern B. Brandenburg <bbb.lst@gmail.com>
>
> ---
>  kernel/fork.c  |    1 +
>  kernel/sched.c |    6 +++---
>  2 files changed, 4 insertions(+), 3 deletions(-)
>
> Index: linux-2.6.37.git/kernel/sched.c
> ===================================================================
> --- linux-2.6.37.git.orig/kernel/sched.c
> +++ linux-2.6.37.git/kernel/sched.c
> @@ -660,6 +660,7 @@ inline void update_rq_clock(struct rq *r
>
>                sched_irq_time_avg_update(rq, irq_time);
>        }
> +       rq->skip_clock_update = 0;
>  }
>
>  /*
> @@ -2138,7 +2139,7 @@ static void check_preempt_curr(struct rq
>         * A queue event has occurred, and we're going to schedule.  In
>         * this case, we can save a useless back to back clock update.
>         */
> -       if (test_tsk_need_resched(rq->curr))
> +       if (rq->curr->se.on_rq && test_tsk_need_resched(rq->curr))
>                rq->skip_clock_update = 1;
>  }
>
> @@ -3854,7 +3855,6 @@ static void put_prev_task(struct rq *rq,
>  {
>        if (prev->se.on_rq)
>                update_rq_clock(rq);
> -       rq->skip_clock_update = 0;
>        prev->sched_class->put_prev_task(rq, prev);
>  }
>
> @@ -3912,7 +3912,6 @@ need_resched_nonpreemptible:
>                hrtick_clear(rq);
>
>        raw_spin_lock_irq(&rq->lock);
> -       clear_tsk_need_resched(prev);
>
>        switch_count = &prev->nivcsw;
>        if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {
> @@ -3942,6 +3941,7 @@ need_resched_nonpreemptible:
>        if (unlikely(!rq->nr_running))
>                idle_balance(cpu, rq);
>
> +       clear_tsk_need_resched(prev);
>        put_prev_task(rq, prev);
>        next = pick_next_task(rq);
>
> Index: linux-2.6.37.git/kernel/fork.c
> ===================================================================
> --- linux-2.6.37.git.orig/kernel/fork.c
> +++ linux-2.6.37.git/kernel/fork.c
> @@ -275,6 +275,7 @@ static struct task_struct *dup_task_stru
>
>        setup_thread_stack(tsk, orig);
>        clear_user_return_notifier(tsk);
> +       clear_tsk_need_resched(tsk);
>        stackend = end_of_stack(tsk);
>        *stackend = STACK_END_MAGIC;    /* for overflow detection */
>
>
>
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [patchlet] Re: Scheduler bug related to rq->skip_clock_update?
  2010-12-06  7:59             ` Yong Zhang
@ 2010-12-06  8:32               ` Mike Galbraith
  2010-12-07 16:41                 ` Peter Zijlstra
  0 siblings, 1 reply; 23+ messages in thread
From: Mike Galbraith @ 2010-12-06  8:32 UTC (permalink / raw)
  To: Yong Zhang
  Cc: Bjoern B. Brandenburg, Peter Zijlstra, Ingo Molnar,
	Andrea Bastoni, James H. Anderson, linux-kernel

On Mon, 2010-12-06 at 15:59 +0800, Yong Zhang wrote:
> On Mon, Dec 6, 2010 at 1:33 PM, Mike Galbraith <efault@gmx.de> wrote:
> > On Sun, 2010-12-05 at 13:28 +0800, Yong Zhang wrote:
> >
> >> when we init idle task, we doesn't mark it on_rq.
> >> My test show the concern is smoothed by below patch.
> >
> > Close :)
> >
> > The skip_clock_update flag should only be set if rq->curr is on_rq,
> > because it it _that_ clock update during dequeue, and subsequent
> > microscopic vruntime update it causes that we're trying to avoid.
> >
> > I think the below fixes it up properly.
> 
> Yep. Now it's running well on my machine.
> 
> If you want, you can add my tested-by. :)

Done.

Sched: fix skip_clock_update optimization

idle_balance() drops/retakes rq->lock, leaving the previous task
vulnerable to set_tsk_need_resched().  Clear it after we return
from balancing instead, and in setup_thread_stack() as well, so
no successfully descheduled or never scheduled task has it set.
 
Need resched confused the skip_clock_update logic, which assumes
that the next call to update_rq_clock() will come nearly immediately
after being set.  Make the optimization robust against the waking
a sleeper before it sucessfully deschedules case by checking that
the current task has not been dequeued before setting the flag,
since it is that useless clock update we're trying to save, and
clear in update_rq_clock() to ensure that ONE call may be skipped.

Signed-off-by: Mike Galbraith <efault@gmx.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Bjoern B. Brandenburg <bbb.lst@gmail.com>
Reported-by: Bjoern B. Brandenburg <bbb.lst@gmail.com>
Tested-by: Yong Zhang <yong.zhang0@gmail.com>

---
 kernel/fork.c  |    1 +
 kernel/sched.c |    6 +++---
 2 files changed, 4 insertions(+), 3 deletions(-)

Index: linux-2.6.37.git/kernel/sched.c
===================================================================
--- linux-2.6.37.git.orig/kernel/sched.c
+++ linux-2.6.37.git/kernel/sched.c
@@ -660,6 +660,7 @@ inline void update_rq_clock(struct rq *r
 
 		sched_irq_time_avg_update(rq, irq_time);
 	}
+	rq->skip_clock_update = 0;
 }
 
 /*
@@ -2138,7 +2139,7 @@ static void check_preempt_curr(struct rq
 	 * A queue event has occurred, and we're going to schedule.  In
 	 * this case, we can save a useless back to back clock update.
 	 */
-	if (test_tsk_need_resched(rq->curr))
+	if (rq->curr->se.on_rq && test_tsk_need_resched(rq->curr))
 		rq->skip_clock_update = 1;
 }
 
@@ -3854,7 +3855,6 @@ static void put_prev_task(struct rq *rq,
 {
 	if (prev->se.on_rq)
 		update_rq_clock(rq);
-	rq->skip_clock_update = 0;
 	prev->sched_class->put_prev_task(rq, prev);
 }
 
@@ -3912,7 +3912,6 @@ need_resched_nonpreemptible:
 		hrtick_clear(rq);
 
 	raw_spin_lock_irq(&rq->lock);
-	clear_tsk_need_resched(prev);
 
 	switch_count = &prev->nivcsw;
 	if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {
@@ -3942,6 +3941,7 @@ need_resched_nonpreemptible:
 	if (unlikely(!rq->nr_running))
 		idle_balance(cpu, rq);
 
+	clear_tsk_need_resched(prev);
 	put_prev_task(rq, prev);
 	next = pick_next_task(rq);
 
Index: linux-2.6.37.git/kernel/fork.c
===================================================================
--- linux-2.6.37.git.orig/kernel/fork.c
+++ linux-2.6.37.git/kernel/fork.c
@@ -275,6 +275,7 @@ static struct task_struct *dup_task_stru
 
 	setup_thread_stack(tsk, orig);
 	clear_user_return_notifier(tsk);
+	clear_tsk_need_resched(tsk);
 	stackend = end_of_stack(tsk);
 	*stackend = STACK_END_MAGIC;	/* for overflow detection */
 




^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Scheduler bug related to rq->skip_clock_update?
  2010-12-06  5:33           ` Mike Galbraith
  2010-12-06  7:59             ` Yong Zhang
@ 2010-12-06 15:40             ` Bjoern B. Brandenburg
  1 sibling, 0 replies; 23+ messages in thread
From: Bjoern B. Brandenburg @ 2010-12-06 15:40 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Yong Zhang, Peter Zijlstra, Ingo Molnar, Andrea Bastoni,
	James H. Anderson, linux-kernel



On Mon, 6 Dec 2010, Mike Galbraith wrote:

> On Sun, 2010-12-05 at 13:28 +0800, Yong Zhang wrote:
>
> > when we init idle task, we doesn't mark it on_rq.
> > My test show the concern is smoothed by below patch.
>
> Close :)
>
> The skip_clock_update flag should only be set if rq->curr is on_rq,
> because it it _that_ clock update during dequeue, and subsequent
> microscopic vruntime update it causes that we're trying to avoid.
>
> I think the below fixes it up properly.
>
> Sched: fix skip_clock_update optimization

Mike, Yong,

thanks for looking into this. On the x86_64 host, I'm now see seeing
delays of less than 0.1ms, which seems fine. On the ARM host, I'm seeing
delays of up to ~2.5ms. However, this only happens on CPU0, which makes me
think that it is probably just a byproduct of interrupts.

bbb@district10:~$ egrep 'cpu#|skip' /proc/sched_debug
cpu#0
  .skip_clock_count              : 98494
  .skip_clock_recent_max         : 2543000
  .skip_clock_max                : 9876875
cpu#1
  .skip_clock_count              : 61084
  .skip_clock_recent_max         : 861750
  .skip_clock_max                : 1382875
cpu#2
  .skip_clock_count              : 60846
  .skip_clock_recent_max         : 193375
  .skip_clock_max                : 2236500
cpu#3
  .skip_clock_count              : 60270
  .skip_clock_recent_max         : 318750
  .skip_clock_max                : 1679000
bbb@district10:~$ cat /proc/interrupts
           CPU0       CPU1       CPU2       CPU3
 33:         29          0          0          0         GIC  timer
 35:         49          0          0          0         GIC  isp1760-hcd:usb1
 36:        145          0          0          0         GIC  uart-pl011
 39:         10          0          0          0         GIC  kmi-pl050
 40:        116          0          0          0         GIC  kmi-pl050
 41:     215789          0          0          0         GIC  eth0
 46:          0          0          0          0         GIC  mmci-pl18x (cmd)
 47:          0          0          0          0         GIC  mmci-pl18x (pio)
IPI:     102721     103079      94532     111227
LOC:     115129     114840     114166     115122
Err:          0

So the latest patch seems to solve the issue.

Tested-by: Bjoern B. Brandenburg <bbb.lst@gmail.com>

Btw, I think this patch is a good candidate for the .36-stable tree since
it fixes a major regression.

Thanks,
Bjoern

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patchlet] Re: Scheduler bug related to rq->skip_clock_update?
  2010-12-06  8:32               ` [patchlet] " Mike Galbraith
@ 2010-12-07 16:41                 ` Peter Zijlstra
  2010-12-07 18:55                   ` Mike Galbraith
  0 siblings, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2010-12-07 16:41 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Yong Zhang, Bjoern B. Brandenburg, Ingo Molnar, Andrea Bastoni,
	James H. Anderson, linux-kernel

On Mon, 2010-12-06 at 09:32 +0100, Mike Galbraith wrote:

>  kernel/fork.c  |    1 +
>  kernel/sched.c |    6 +++---
>  2 files changed, 4 insertions(+), 3 deletions(-)
> 
> Index: linux-2.6.37.git/kernel/sched.c
> ===================================================================
> --- linux-2.6.37.git.orig/kernel/sched.c
> +++ linux-2.6.37.git/kernel/sched.c
> @@ -660,6 +660,7 @@ inline void update_rq_clock(struct rq *r
>  
>  		sched_irq_time_avg_update(rq, irq_time);
>  	}
> +	rq->skip_clock_update = 0;
>  }
>  
>  /*

Shouldn't we do that at the end of schedule()? Since the purpose of
->skip_clock_update is to avoid multiple calls to:
  - avoid overhead
  - ensure scheduling is accounted at a single point

[ for that latter purpose it might also make sense to put that point
somewhere around context_switch() but due to the fact that we need a
clock update early that's a bit impractical. ]

Hmm?

> @@ -2138,7 +2139,7 @@ static void check_preempt_curr(struct rq
>  	 * A queue event has occurred, and we're going to schedule.  In
>  	 * this case, we can save a useless back to back clock update.
>  	 */
> -	if (test_tsk_need_resched(rq->curr))
> +	if (rq->curr->se.on_rq && test_tsk_need_resched(rq->curr))
>  		rq->skip_clock_update = 1;
>  }

OK, I initially tried to replace the test with a return value of
->check_preempt_curr() and such, but that turns into a lot of code and
won't necessarily be any better.

> @@ -3854,7 +3855,6 @@ static void put_prev_task(struct rq *rq,
>  {
>  	if (prev->se.on_rq)
>  		update_rq_clock(rq);
> -	rq->skip_clock_update = 0;
>  	prev->sched_class->put_prev_task(rq, prev);
>  }

See the first note.

> @@ -3912,7 +3912,6 @@ need_resched_nonpreemptible:
>  		hrtick_clear(rq);
>  
>  	raw_spin_lock_irq(&rq->lock);
> -	clear_tsk_need_resched(prev);
>  
>  	switch_count = &prev->nivcsw;
>  	if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {
> @@ -3942,6 +3941,7 @@ need_resched_nonpreemptible:
>  	if (unlikely(!rq->nr_running))
>  		idle_balance(cpu, rq);
>  
> +	clear_tsk_need_resched(prev);
>  	put_prev_task(rq, prev);
>  	next = pick_next_task(rq);

Good find, this needs to be done after the idle balancing because that
can release the rq->lock and allow for TIF_NEED_RESCHED to be set again.

Maybe complement this with a WARN_ON_ONCE(test_tsk_need_resched(next))
somewhere after pick_next_task() so as to ensure that !current has !
TIF_NEED_RESCHED.

> Index: linux-2.6.37.git/kernel/fork.c
> ===================================================================
> --- linux-2.6.37.git.orig/kernel/fork.c
> +++ linux-2.6.37.git/kernel/fork.c
> @@ -275,6 +275,7 @@ static struct task_struct *dup_task_stru
>  
>  	setup_thread_stack(tsk, orig);
>  	clear_user_return_notifier(tsk);
> +	clear_tsk_need_resched(tsk);
>  	stackend = end_of_stack(tsk);
>  	*stackend = STACK_END_MAGIC;	/* for overflow detection */
>  

OK.. have we looked if there's more TIF flags that could do with a
reset?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patchlet] Re: Scheduler bug related to rq->skip_clock_update?
  2010-12-07 16:41                 ` Peter Zijlstra
@ 2010-12-07 18:55                   ` Mike Galbraith
  2010-12-08 10:05                     ` Mike Galbraith
  0 siblings, 1 reply; 23+ messages in thread
From: Mike Galbraith @ 2010-12-07 18:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Yong Zhang, Bjoern B. Brandenburg, Ingo Molnar, Andrea Bastoni,
	James H. Anderson, linux-kernel

On Tue, 2010-12-07 at 17:41 +0100, Peter Zijlstra wrote:
> On Mon, 2010-12-06 at 09:32 +0100, Mike Galbraith wrote:
> 
> >  kernel/fork.c  |    1 +
> >  kernel/sched.c |    6 +++---
> >  2 files changed, 4 insertions(+), 3 deletions(-)
> > 
> > Index: linux-2.6.37.git/kernel/sched.c
> > ===================================================================
> > --- linux-2.6.37.git.orig/kernel/sched.c
> > +++ linux-2.6.37.git/kernel/sched.c
> > @@ -660,6 +660,7 @@ inline void update_rq_clock(struct rq *r
> >  
> >  		sched_irq_time_avg_update(rq, irq_time);
> >  	}
> > +	rq->skip_clock_update = 0;
> >  }
> >  
> >  /*
> 
> Shouldn't we do that at the end of schedule()? Since the purpose of
> ->skip_clock_update is to avoid multiple calls to:
>   - avoid overhead
>   - ensure scheduling is accounted at a single point
> 
> [ for that latter purpose it might also make sense to put that point
> somewhere around context_switch() but due to the fact that we need a
> clock update early that's a bit impractical. ]
> 
> Hmm?

Yeah, could do that instead.  There's no gain in any call that may
happen in the interval between.  Think I'll measure though, this bug was
a surprise :)

> > @@ -2138,7 +2139,7 @@ static void check_preempt_curr(struct rq
> >  	 * A queue event has occurred, and we're going to schedule.  In
> >  	 * this case, we can save a useless back to back clock update.
> >  	 */
> > -	if (test_tsk_need_resched(rq->curr))
> > +	if (rq->curr->se.on_rq && test_tsk_need_resched(rq->curr))
> >  		rq->skip_clock_update = 1;
> >  }
> 
> OK, I initially tried to replace the test with a return value of
> ->check_preempt_curr() and such, but that turns into a lot of code and
> won't necessarily be any better.

(Yeah, I considered doing the same)

> > @@ -3854,7 +3855,6 @@ static void put_prev_task(struct rq *rq,
> >  {
> >  	if (prev->se.on_rq)
> >  		update_rq_clock(rq);
> > -	rq->skip_clock_update = 0;
> >  	prev->sched_class->put_prev_task(rq, prev);
> >  }
> 
> See the first note.
> 
> > @@ -3912,7 +3912,6 @@ need_resched_nonpreemptible:
> >  		hrtick_clear(rq);
> >  
> >  	raw_spin_lock_irq(&rq->lock);
> > -	clear_tsk_need_resched(prev);
> >  
> >  	switch_count = &prev->nivcsw;
> >  	if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {
> > @@ -3942,6 +3941,7 @@ need_resched_nonpreemptible:
> >  	if (unlikely(!rq->nr_running))
> >  		idle_balance(cpu, rq);
> >  
> > +	clear_tsk_need_resched(prev);
> >  	put_prev_task(rq, prev);
> >  	next = pick_next_task(rq);
> 
> Good find, this needs to be done after the idle balancing because that
> can release the rq->lock and allow for TIF_NEED_RESCHED to be set again.
> 
> Maybe complement this with a WARN_ON_ONCE(test_tsk_need_resched(next))
> somewhere after pick_next_task() so as to ensure that !current has !
> TIF_NEED_RESCHED.
> 
> > Index: linux-2.6.37.git/kernel/fork.c
> > ===================================================================
> > --- linux-2.6.37.git.orig/kernel/fork.c
> > +++ linux-2.6.37.git/kernel/fork.c
> > @@ -275,6 +275,7 @@ static struct task_struct *dup_task_stru
> >  
> >  	setup_thread_stack(tsk, orig);
> >  	clear_user_return_notifier(tsk);
> > +	clear_tsk_need_resched(tsk);
> >  	stackend = end_of_stack(tsk);
> >  	*stackend = STACK_END_MAGIC;	/* for overflow detection */
> >  
> 
> OK.. have we looked if there's more TIF flags that could do with a
> reset?

mmm, no.

	-Mike


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patchlet] Re: Scheduler bug related to rq->skip_clock_update?
  2010-12-07 18:55                   ` Mike Galbraith
@ 2010-12-08 10:05                     ` Mike Galbraith
  2010-12-08 11:12                       ` Peter Zijlstra
  2010-12-08 20:40                       ` [tip:sched/urgent] Sched: fix skip_clock_update optimization tip-bot for Mike Galbraith
  0 siblings, 2 replies; 23+ messages in thread
From: Mike Galbraith @ 2010-12-08 10:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Yong Zhang, Bjoern B. Brandenburg, Ingo Molnar, Andrea Bastoni,
	James H. Anderson, linux-kernel

On Tue, 2010-12-07 at 19:55 +0100, Mike Galbraith wrote:
> On Tue, 2010-12-07 at 17:41 +0100, Peter Zijlstra wrote:
> > On Mon, 2010-12-06 at 09:32 +0100, Mike Galbraith wrote:
> > 
> > >  kernel/fork.c  |    1 +
> > >  kernel/sched.c |    6 +++---
> > >  2 files changed, 4 insertions(+), 3 deletions(-)
> > > 
> > > Index: linux-2.6.37.git/kernel/sched.c
> > > ===================================================================
> > > --- linux-2.6.37.git.orig/kernel/sched.c
> > > +++ linux-2.6.37.git/kernel/sched.c
> > > @@ -660,6 +660,7 @@ inline void update_rq_clock(struct rq *r
> > >  
> > >  		sched_irq_time_avg_update(rq, irq_time);
> > >  	}
> > > +	rq->skip_clock_update = 0;
> > >  }
> > >  
> > >  /*
> > 
> > Shouldn't we do that at the end of schedule()? Since the purpose of
> > ->skip_clock_update is to avoid multiple calls to:
> >   - avoid overhead
> >   - ensure scheduling is accounted at a single point
> > 
> > [ for that latter purpose it might also make sense to put that point
> > somewhere around context_switch() but due to the fact that we need a
> > clock update early that's a bit impractical. ]
> > 
> > Hmm?
> 
> Yeah, could do that instead.  There's no gain in any call that may
> happen in the interval between.  Think I'll measure though, this bug was
> a surprise :)

Did that, all is well.

> > OK.. have we looked if there's more TIF flags that could do with a
> > reset?
> 
> mmm, no.

Crawling over flags had to be left for another day.

Sched: fix skip_clock_update optimization

idle_balance() drops/retakes rq->lock, leaving the previous task
vulnerable to set_tsk_need_resched().  Clear it after we return
from balancing instead, and in setup_thread_stack() as well, so
no successfully descheduled or never scheduled task has it set.
 
Need resched confused the skip_clock_update logic, which assumes
that the next call to update_rq_clock() will come nearly immediately
after being set.  Make the optimization robust against the waking
a sleeper before it sucessfully deschedules case by checking that
the current task has not been dequeued before setting the flag,
since it is that useless clock update we're trying to save, and
clear unconditionally in schedule() proper instead of conditionally
in put_prev_task().

Signed-off-by: Mike Galbraith <efault@gmx.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Bjoern B. Brandenburg <bbb.lst@gmail.com>
Reported-by: Bjoern B. Brandenburg <bbb.lst@gmail.com>
Tested-by: Yong Zhang <yong.zhang0@gmail.com>

---
 kernel/fork.c  |    1 +
 kernel/sched.c |   28 +++++++++++++++-------------
 2 files changed, 16 insertions(+), 13 deletions(-)

Index: linux-2.6.37.git/kernel/sched.c
===================================================================
--- linux-2.6.37.git.orig/kernel/sched.c
+++ linux-2.6.37.git/kernel/sched.c
@@ -649,17 +649,18 @@ static void sched_irq_time_avg_update(st
 
 inline void update_rq_clock(struct rq *rq)
 {
-	if (!rq->skip_clock_update) {
-		int cpu = cpu_of(rq);
-		u64 irq_time;
-
-		rq->clock = sched_clock_cpu(cpu);
-		irq_time = irq_time_cpu(cpu);
-		if (rq->clock - irq_time > rq->clock_task)
-			rq->clock_task = rq->clock - irq_time;
+	int cpu = cpu_of(rq);
+	u64 irq_time;
 
-		sched_irq_time_avg_update(rq, irq_time);
-	}
+	if (rq->skip_clock_update)
+		return;
+
+	rq->clock = sched_clock_cpu(cpu);
+	irq_time = irq_time_cpu(cpu);
+	if (rq->clock - irq_time > rq->clock_task)
+		rq->clock_task = rq->clock - irq_time;
+
+	sched_irq_time_avg_update(rq, irq_time);
 }
 
 /*
@@ -2138,7 +2139,7 @@ static void check_preempt_curr(struct rq
 	 * A queue event has occurred, and we're going to schedule.  In
 	 * this case, we can save a useless back to back clock update.
 	 */
-	if (test_tsk_need_resched(rq->curr))
+	if (rq->curr->se.on_rq && test_tsk_need_resched(rq->curr))
 		rq->skip_clock_update = 1;
 }
 
@@ -3854,7 +3855,6 @@ static void put_prev_task(struct rq *rq,
 {
 	if (prev->se.on_rq)
 		update_rq_clock(rq);
-	rq->skip_clock_update = 0;
 	prev->sched_class->put_prev_task(rq, prev);
 }
 
@@ -3912,7 +3912,6 @@ need_resched_nonpreemptible:
 		hrtick_clear(rq);
 
 	raw_spin_lock_irq(&rq->lock);
-	clear_tsk_need_resched(prev);
 
 	switch_count = &prev->nivcsw;
 	if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {
@@ -3944,6 +3943,8 @@ need_resched_nonpreemptible:
 
 	put_prev_task(rq, prev);
 	next = pick_next_task(rq);
+	clear_tsk_need_resched(prev);
+	rq->skip_clock_update = 0;
 
 	if (likely(prev != next)) {
 		sched_info_switch(prev, next);
@@ -3952,6 +3953,7 @@ need_resched_nonpreemptible:
 		rq->nr_switches++;
 		rq->curr = next;
 		++*switch_count;
+		WARN_ON_ONCE(test_tsk_need_resched(next));
 
 		context_switch(rq, prev, next); /* unlocks the rq */
 		/*
Index: linux-2.6.37.git/kernel/fork.c
===================================================================
--- linux-2.6.37.git.orig/kernel/fork.c
+++ linux-2.6.37.git/kernel/fork.c
@@ -275,6 +275,7 @@ static struct task_struct *dup_task_stru
 
 	setup_thread_stack(tsk, orig);
 	clear_user_return_notifier(tsk);
+	clear_tsk_need_resched(tsk);
 	stackend = end_of_stack(tsk);
 	*stackend = STACK_END_MAGIC;	/* for overflow detection */
 



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patchlet] Re: Scheduler bug related to rq->skip_clock_update?
  2010-12-08 10:05                     ` Mike Galbraith
@ 2010-12-08 11:12                       ` Peter Zijlstra
  2010-12-08 20:40                       ` [tip:sched/urgent] Sched: fix skip_clock_update optimization tip-bot for Mike Galbraith
  1 sibling, 0 replies; 23+ messages in thread
From: Peter Zijlstra @ 2010-12-08 11:12 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Yong Zhang, Bjoern B. Brandenburg, Ingo Molnar, Andrea Bastoni,
	James H. Anderson, linux-kernel

On Wed, 2010-12-08 at 11:05 +0100, Mike Galbraith wrote:
> 
> Crawling over flags had to be left for another day.
> 
Sure, just something we need to keep on the todo list (as if that thing
isn't bottomless ;-)

Anyway, Thanks I've queued it up (added stable@kernel.org as well) and
will hopefully still get it in .37.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [tip:sched/urgent] Sched: fix skip_clock_update optimization
  2010-12-08 10:05                     ` Mike Galbraith
  2010-12-08 11:12                       ` Peter Zijlstra
@ 2010-12-08 20:40                       ` tip-bot for Mike Galbraith
  2010-12-09 15:32                         ` Peter Zijlstra
  1 sibling, 1 reply; 23+ messages in thread
From: tip-bot for Mike Galbraith @ 2010-12-08 20:40 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, a.p.zijlstra, bbb.lst, efault, tglx,
	yong.zhang0, mingo

Commit-ID:  f26f9aff6aaf67e9a430d16c266f91b13a5bff64
Gitweb:     http://git.kernel.org/tip/f26f9aff6aaf67e9a430d16c266f91b13a5bff64
Author:     Mike Galbraith <efault@gmx.de>
AuthorDate: Wed, 8 Dec 2010 11:05:42 +0100
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Wed, 8 Dec 2010 20:15:06 +0100

Sched: fix skip_clock_update optimization

idle_balance() drops/retakes rq->lock, leaving the previous task
vulnerable to set_tsk_need_resched().  Clear it after we return
from balancing instead, and in setup_thread_stack() as well, so
no successfully descheduled or never scheduled task has it set.

Need resched confused the skip_clock_update logic, which assumes
that the next call to update_rq_clock() will come nearly immediately
after being set.  Make the optimization robust against the waking
a sleeper before it sucessfully deschedules case by checking that
the current task has not been dequeued before setting the flag,
since it is that useless clock update we're trying to save, and
clear unconditionally in schedule() proper instead of conditionally
in put_prev_task().

Signed-off-by: Mike Galbraith <efault@gmx.de>
Reported-by: Bjoern B. Brandenburg <bbb.lst@gmail.com>
Tested-by: Yong Zhang <yong.zhang0@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: stable@kernel.org
LKML-Reference: <1291802742.1417.9.camel@marge.simson.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/fork.c  |    1 +
 kernel/sched.c |   26 ++++++++++++++------------
 2 files changed, 15 insertions(+), 12 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 3b159c5..5447dc7 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -273,6 +273,7 @@ static struct task_struct *dup_task_struct(struct task_struct *orig)
 
 	setup_thread_stack(tsk, orig);
 	clear_user_return_notifier(tsk);
+	clear_tsk_need_resched(tsk);
 	stackend = end_of_stack(tsk);
 	*stackend = STACK_END_MAGIC;	/* for overflow detection */
 
diff --git a/kernel/sched.c b/kernel/sched.c
index 6b7c26a..da14302 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -641,17 +641,18 @@ static void sched_irq_time_avg_update(struct rq *rq, u64 irq_time);
 
 inline void update_rq_clock(struct rq *rq)
 {
-	if (!rq->skip_clock_update) {
-		int cpu = cpu_of(rq);
-		u64 irq_time;
+	int cpu = cpu_of(rq);
+	u64 irq_time;
 
-		rq->clock = sched_clock_cpu(cpu);
-		irq_time = irq_time_cpu(cpu);
-		if (rq->clock - irq_time > rq->clock_task)
-			rq->clock_task = rq->clock - irq_time;
+	if (rq->skip_clock_update)
+		return;
 
-		sched_irq_time_avg_update(rq, irq_time);
-	}
+	rq->clock = sched_clock_cpu(cpu);
+	irq_time = irq_time_cpu(cpu);
+	if (rq->clock - irq_time > rq->clock_task)
+		rq->clock_task = rq->clock - irq_time;
+
+	sched_irq_time_avg_update(rq, irq_time);
 }
 
 /*
@@ -2129,7 +2130,7 @@ static void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags)
 	 * A queue event has occurred, and we're going to schedule.  In
 	 * this case, we can save a useless back to back clock update.
 	 */
-	if (test_tsk_need_resched(rq->curr))
+	if (rq->curr->se.on_rq && test_tsk_need_resched(rq->curr))
 		rq->skip_clock_update = 1;
 }
 
@@ -3973,7 +3974,6 @@ static void put_prev_task(struct rq *rq, struct task_struct *prev)
 {
 	if (prev->se.on_rq)
 		update_rq_clock(rq);
-	rq->skip_clock_update = 0;
 	prev->sched_class->put_prev_task(rq, prev);
 }
 
@@ -4031,7 +4031,6 @@ need_resched_nonpreemptible:
 		hrtick_clear(rq);
 
 	raw_spin_lock_irq(&rq->lock);
-	clear_tsk_need_resched(prev);
 
 	switch_count = &prev->nivcsw;
 	if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {
@@ -4063,6 +4062,8 @@ need_resched_nonpreemptible:
 
 	put_prev_task(rq, prev);
 	next = pick_next_task(rq);
+	clear_tsk_need_resched(prev);
+	rq->skip_clock_update = 0;
 
 	if (likely(prev != next)) {
 		sched_info_switch(prev, next);
@@ -4071,6 +4072,7 @@ need_resched_nonpreemptible:
 		rq->nr_switches++;
 		rq->curr = next;
 		++*switch_count;
+		WARN_ON_ONCE(test_tsk_need_resched(next));
 
 		context_switch(rq, prev, next); /* unlocks the rq */
 		/*

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [tip:sched/urgent] Sched: fix skip_clock_update optimization
  2010-12-08 20:40                       ` [tip:sched/urgent] Sched: fix skip_clock_update optimization tip-bot for Mike Galbraith
@ 2010-12-09 15:32                         ` Peter Zijlstra
  2010-12-10  2:33                           ` Yong Zhang
  0 siblings, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2010-12-09 15:32 UTC (permalink / raw)
  To: mingo, hpa, linux-kernel, efault, bbb.lst, tglx, yong.zhang0, mingo
  Cc: linux-tip-commits

On Wed, 2010-12-08 at 20:40 +0000, tip-bot for Mike Galbraith wrote:
> @@ -4063,6 +4062,8 @@ need_resched_nonpreemptible:
>  
>         put_prev_task(rq, prev);
>         next = pick_next_task(rq);
> +       clear_tsk_need_resched(prev);
> +       rq->skip_clock_update = 0;
>  
>         if (likely(prev != next)) {
>                 sched_info_switch(prev, next);
> @@ -4071,6 +4072,7 @@ need_resched_nonpreemptible:
>                 rq->nr_switches++;
>                 rq->curr = next;
>                 ++*switch_count;
> +               WARN_ON_ONCE(test_tsk_need_resched(next)); 

I just managed to trigger this one.. will try and reproduce to find out
wth happened.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [tip:sched/urgent] Sched: fix skip_clock_update optimization
  2010-12-09 15:32                         ` Peter Zijlstra
@ 2010-12-10  2:33                           ` Yong Zhang
  2010-12-10 16:17                             ` Peter Zijlstra
  0 siblings, 1 reply; 23+ messages in thread
From: Yong Zhang @ 2010-12-10  2:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, hpa, linux-kernel, efault, bbb.lst, tglx, mingo,
	linux-tip-commits

On Thu, Dec 9, 2010 at 11:32 PM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Wed, 2010-12-08 at 20:40 +0000, tip-bot for Mike Galbraith wrote:
>> @@ -4063,6 +4062,8 @@ need_resched_nonpreemptible:
>>
>>         put_prev_task(rq, prev);
>>         next = pick_next_task(rq);
>> +       clear_tsk_need_resched(prev);
>> +       rq->skip_clock_update = 0;
>>
>>         if (likely(prev != next)) {
>>                 sched_info_switch(prev, next);
>> @@ -4071,6 +4072,7 @@ need_resched_nonpreemptible:
>>                 rq->nr_switches++;
>>                 rq->curr = next;
>>                 ++*switch_count;
>> +               WARN_ON_ONCE(test_tsk_need_resched(next));
>
> I just managed to trigger this one.. will try and reproduce to find out
> wth happened.
>

I don't see it on my side.
Then just show some info(don't know if it's helpful):
wake_up_idle_cpu() can set TIF_RESCHED locklessly.

So does
WARN_ON_ONCE(next != rq->idle && test_tsk_need_resched(next));
help?

Thanks,
Yong


-- 
Only stand for myself.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [tip:sched/urgent] Sched: fix skip_clock_update optimization
  2010-12-10  2:33                           ` Yong Zhang
@ 2010-12-10 16:17                             ` Peter Zijlstra
  0 siblings, 0 replies; 23+ messages in thread
From: Peter Zijlstra @ 2010-12-10 16:17 UTC (permalink / raw)
  To: Yong Zhang
  Cc: mingo, hpa, linux-kernel, efault, bbb.lst, tglx, mingo,
	linux-tip-commits

On Fri, 2010-12-10 at 10:33 +0800, Yong Zhang wrote:
> 
> I don't see it on my side.
> Then just show some info(don't know if it's helpful):
> wake_up_idle_cpu() can set TIF_RESCHED locklessly.
> 
> So does
> WARN_ON_ONCE(next != rq->idle && test_tsk_need_resched(next));
> help?

Its a good possibility, sadly I haven't been able to reproduce so I
cannot confirm. I'll keep my machine running, maybe it will trigger in a
few hours.

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2010-12-10 16:18 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-11-21  4:22 Scheduler bug related to rq->skip_clock_update? Bjoern B. Brandenburg
2010-11-21 17:14 ` Mike Galbraith
2010-11-22  4:29   ` Bjoern B. Brandenburg
2010-11-22 16:19     ` Mike Galbraith
2010-11-22 18:14       ` Bjoern B. Brandenburg
2010-12-04  7:42         ` Yong Zhang
2010-12-04 14:05           ` Yong Zhang
2010-12-04 14:08         ` Yong Zhang
2010-12-04 14:33           ` Yong Zhang
2010-12-05  5:28         ` Yong Zhang
2010-12-06  5:33           ` Mike Galbraith
2010-12-06  7:59             ` Yong Zhang
2010-12-06  8:32               ` [patchlet] " Mike Galbraith
2010-12-07 16:41                 ` Peter Zijlstra
2010-12-07 18:55                   ` Mike Galbraith
2010-12-08 10:05                     ` Mike Galbraith
2010-12-08 11:12                       ` Peter Zijlstra
2010-12-08 20:40                       ` [tip:sched/urgent] Sched: fix skip_clock_update optimization tip-bot for Mike Galbraith
2010-12-09 15:32                         ` Peter Zijlstra
2010-12-10  2:33                           ` Yong Zhang
2010-12-10 16:17                             ` Peter Zijlstra
2010-12-06 15:40             ` Scheduler bug related to rq->skip_clock_update? Bjoern B. Brandenburg
2010-12-03 12:41 ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).