All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v1] sched: fix nohz idle load balancer issues
@ 2011-09-26 11:50 Srivatsa Vaddagiri
  2011-09-26 12:01 ` Srivatsa Vaddagiri
                   ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Srivatsa Vaddagiri @ 2011-09-26 11:50 UTC (permalink / raw)
  To: Peter Zijlstra, Paul Turner, Venki Pallipadi, Ingo Molnar,
	Vaidyanathan Srinivasan, Kamalesh Babulal
  Cc: linux-kernel

While trying to test recently introduced cpu bandwidth control feature for
non-realtime tasks, we noticed "more than expected" idle time, which
reduced considerably when booted with nohz=off. This patch is an attempt
to fix that discrepancy so that we see little variance in idle time between
nohz=on and nohz=off.

Test setup:

Machine : 16-cpus (2 Quad-core w/ HT enabled)
Kernel  : Latest code (HEAD at 6e8d0472ea63969e2df77c7e84741495fedf7d9b) found 
	  at git://tesla.tglx.de/git/linux-2.6-tip

Cgroups :

5 in number (/L1/L2/C1 - /L1/L2/C5), each having {2, 2, 4, 8, 16} tasks
respectively. /L1 and /L2 were added to the hierarchy to mimic cgroup hierarchy 
created by libvirt and otherwise do not contain any tasks. Each cgroup has 
cpu.shares proportional to # of tasks in it. For ex: /L1/L2/C1's cpu.shares =
2 * 1024 = 2048, C3's cpu.shares = 4096 etc. Further, each task is placed in its
own (sub-)cgroup with default shares of 1024 and a capped usage of 50% CPU.
  
        /L1/L2/C1/C1_1/Task1  -> capped at 50% cpu usage
        /L1/L2/C1/C1_2/Task2  -> capped at 50% cpu usage
        /L1/L2/C2/C2_1/Task3  -> capped at 50% cpu usage
        /L1/L2/C2/C2_2/Task3  -> capped at 50% cpu usage
        /L1/L2/C3/C3_1/Task4  -> capped at 50% cpu usage
        /L1/L2/C3/C3_2/Task4  -> capped at 50% cpu usage
        /L1/L2/C3/C3_3/Task4  -> capped at 50% cpu usage
        /L1/L2/C3/C3_4/Task4  -> capped at 50% cpu usage
        ...
        /L1/L2/C5/C5_16/Task32 -> capped at 50% cpu usage

So we have 32 tasks, each capped at 50% CPU usage, run on a 16-CPU
system, which one may expect to consume all CPU resource leaving no idle
time. While that may be "insane" expectation, the goal is to minimize idle time
in this situation as much as possible.

I am using a slightly modified script provided at
https://lkml.org/lkml/2011/6/7/352 for generating this test scenario -
can make that available if required.

Idle time was sampled every second (using vmstat) over a window of 60 seconds
and was found as below:

Idle time 		Average	 Std-deviation	 Min	Max
============================================================

nohz=off		4%        0.5%           3%      5%
nohz=on	 		10% 	  2.4% 		 5%	 18%
nohz=on + patch		5.3%      1.3%		 3%      9%

The patch cuts down idle time significantly when kernel is booted 
with 'nohz=on' (which is good for saving power when idle).

Description of the patch
========================

Reviewing idle load balancer code and testing it with some
trace_printk(), I observed the following:

1. I had put a trace_printk() in nohz_idle_balance() as below:

	nohz_idle_balance()
	{

	if (idle != CPU_IDLE || !this_rq->nohz_balance_kick)
		return;

	..

		trace_printk("Running rebalance for %d\n", balance_cpu);

		rebalance_domains(balance_cpu, CPU_IDLE);
	}

    I *never* got that printed during the test. Further investigation
    revealed that ilb_cpu was bailing out early because idle =
    CPU_NOT_IDLE i.e ilb_cpu was no longer idle_at_tick by the time it
    got around to handle the kick. As a result, no one was truly
    doing a load balance on behalf of sleeping idle cpus.


2. nohz.next_interval is supposed to reflect min(rq->next_balance)
   across all sleeping idle cpus. As new idle cpus go tickless, we 
   don't update nohz.next_balance to reflect the fact that the 
   idle cpu going tickless may have rq->next_balance earlier than
   nohz.next_balance. Putting some trace_printk() I noticed that 
   difference (between nohz.next_interval and its desired new value) 
   was between  30 - 1673 ticks (CONFIG_HZ=1000 in my case).

3. AFAICS CPUs desigated as first_pick_cpu or second_pick_cpu can be
   idle for sometime before going tickless. When that happens, they
   stop kicking ilb_cpu. More importantly other busy cpus also 
   don't kick ilb_cpu as they are neither first nor second pick_cpu.
   In this situation, ilb_cpu is not woken up in a timely manner to
   do idle load balance.

This patch is an attempt to solve above issues observed with idle load
balancer. 

- The patch causes a ilb_cpu (that was kicked) to kick another idle 
  cpu in case it was found to be !idle_at_tick (so that another idle cpu
  can do load balance on behalf of idle cpus). This fixes issue #1

- The patch introduces a 'nohz.next_balance_lock' spinlock which is used
  to update nohz.next_balance, so that it stays as min(rq->next_balance).
  This fixes issue #2. I don't like a global spinlock so much, but don't
  see easy way out. Besides, this lock is taken in context of idle cpu.

- It allows any busy cpu to kick ilb_cpu if it has greater than 2
  runnable tasks. This addresses issue #3

The patch is against latest tip code (HEAD at
6e8d0472ea63969e2df77c7e84741495fedf7d9b) found at
git://tesla.tglx.de/git/linux-2.6-tip

Another improvement that we may want to look at is to have second_pick_cpu
avoid kicking in case there is no power/performance benefit (i.e its not a
thread/core sibling of first_pick_cpu).

Comments/flames welcome!

Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>

---

Index: current/kernel/sched.c
===================================================================
--- current.orig/kernel/sched.c
+++ current/kernel/sched.c
@@ -8355,6 +8355,7 @@ void __init sched_init(void)
 	atomic_set(&nohz.load_balancer, nr_cpu_ids);
 	atomic_set(&nohz.first_pick_cpu, nr_cpu_ids);
 	atomic_set(&nohz.second_pick_cpu, nr_cpu_ids);
+	spin_lock_init(&nohz.next_balance_lock);
 #endif
 	/* May be allocated at isolcpus cmdline parse time */
 	if (cpu_isolated_map == NULL)
Index: current/kernel/sched_fair.c
===================================================================
--- current.orig/kernel/sched_fair.c
+++ current/kernel/sched_fair.c
@@ -4302,6 +4302,7 @@ static struct {
 	cpumask_var_t idle_cpus_mask;
 	cpumask_var_t grp_idle_mask;
 	unsigned long next_balance;     /* in jiffy units */
+	spinlock_t next_balance_lock;   /* Serialize update to 'next_balance' */
 } nohz ____cacheline_aligned;
 
 int get_nohz_load_balancer(void)
@@ -4443,8 +4444,12 @@ static void nohz_balancer_kick(int cpu)
 
 	ilb_cpu = get_nohz_load_balancer();
 
-	if (ilb_cpu >= nr_cpu_ids) {
-		ilb_cpu = cpumask_first(nohz.idle_cpus_mask);
+	/*
+	 * ilb_cpu itself can be attempting to kick another idle cpu. Pick
+	 * another idle cpu in that case.
+	 */
+	if (ilb_cpu >= nr_cpu_ids || ilb_cpu == cpu) {
+		ilb_cpu = cpumask_any_but(nohz.idle_cpus_mask, cpu);
 		if (ilb_cpu >= nr_cpu_ids)
 			return;
 	}
@@ -4459,6 +4464,18 @@ static void nohz_balancer_kick(int cpu)
 	return;
 }
 
+/* Update nohz.next_balance with a new minimum value */
+static inline void set_nohz_next_balance(unsigned long next_balance)
+{
+	if (time_after(next_balance, nohz.next_balance))
+		return;
+
+	spin_lock(&nohz.next_balance_lock);
+	if (time_before(next_balance, nohz.next_balance))
+		nohz.next_balance = next_balance;
+	spin_unlock(&nohz.next_balance_lock);
+}
+
 /*
  * This routine will try to nominate the ilb (idle load balancing)
  * owner among the cpus whose ticks are stopped. ilb owner will do the idle
@@ -4517,8 +4534,10 @@ void select_nohz_load_balancer(int stop_
 				resched_cpu(new_ilb);
 				return;
 			}
+			set_nohz_next_balance(cpu_rq(cpu)->next_balance);
 			return;
 		}
+		set_nohz_next_balance(cpu_rq(cpu)->next_balance);
 	} else {
 		if (!cpumask_test_cpu(cpu, nohz.idle_cpus_mask))
 			return;
@@ -4634,9 +4653,16 @@ static void nohz_idle_balance(int this_c
 	struct rq *rq;
 	int balance_cpu;
 
-	if (idle != CPU_IDLE || !this_rq->nohz_balance_kick)
+	if (!this_rq->nohz_balance_kick)
 		return;
 
+	/* Wakeup another idle cpu to do idle load balance if we got busy */
+	if (idle != CPU_IDLE) {
+		nohz_balancer_kick(this_cpu);
+		this_rq->nohz_balance_kick = 0;
+		return;
+	}
+
 	for_each_cpu(balance_cpu, nohz.idle_cpus_mask) {
 		if (balance_cpu == this_cpu)
 			continue;
@@ -4646,10 +4672,8 @@ static void nohz_idle_balance(int this_c
 		 * work being done for other cpus. Next load
 		 * balancing owner will pick it up.
 		 */
-		if (need_resched()) {
-			this_rq->nohz_balance_kick = 0;
+		if (need_resched())
 			break;
-		}
 
 		raw_spin_lock_irq(&this_rq->lock);
 		update_rq_clock(this_rq);
@@ -4663,7 +4687,7 @@ static void nohz_idle_balance(int this_c
 		if (time_after(this_rq->next_balance, rq->next_balance))
 			this_rq->next_balance = rq->next_balance;
 	}
-	nohz.next_balance = this_rq->next_balance;
+	set_nohz_next_balance(this_rq->next_balance);
 	this_rq->nohz_balance_kick = 0;
 }
 
@@ -4695,8 +4719,11 @@ static inline int nohz_kick_needed(struc
 	second_pick_cpu = atomic_read(&nohz.second_pick_cpu);
 
 	if (first_pick_cpu < nr_cpu_ids && first_pick_cpu != cpu &&
-	    second_pick_cpu < nr_cpu_ids && second_pick_cpu != cpu)
+	    second_pick_cpu < nr_cpu_ids && second_pick_cpu != cpu) {
+		if (rq->nr_running > 1)
+			return 1;
 		return 0;
+	}
 
 	ret = atomic_cmpxchg(&nohz.first_pick_cpu, nr_cpu_ids, cpu);
 	if (ret == nr_cpu_ids || ret == cpu) {
@@ -4752,7 +4779,7 @@ static inline void trigger_load_balance(
 	    likely(!on_null_domain(cpu)))
 		raise_softirq(SCHED_SOFTIRQ);
 #ifdef CONFIG_NO_HZ
-	else if (nohz_kick_needed(rq, cpu) && likely(!on_null_domain(cpu)))
+	if (nohz_kick_needed(rq, cpu) && likely(!on_null_domain(cpu)))
 		nohz_balancer_kick(cpu);
 #endif
 }

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v1] sched: fix nohz idle load balancer issues
  2011-09-26 11:50 [PATCH v1] sched: fix nohz idle load balancer issues Srivatsa Vaddagiri
@ 2011-09-26 12:01 ` Srivatsa Vaddagiri
  2011-09-27  6:32 ` Ingo Molnar
  2011-09-27 19:53 ` Venki Pallipadi
  2 siblings, 0 replies; 17+ messages in thread
From: Srivatsa Vaddagiri @ 2011-09-26 12:01 UTC (permalink / raw)
  To: Peter Zijlstra, Paul Turner, Venki Pallipadi, Ingo Molnar,
	Vaidyanathan Srinivasan, Kamalesh Babulal
  Cc: linux-kernel

* Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> [2011-09-26 17:20:49]:

> @@ -4695,8 +4719,11 @@ static inline int nohz_kick_needed(struc
>  	second_pick_cpu = atomic_read(&nohz.second_pick_cpu);
>  
>  	if (first_pick_cpu < nr_cpu_ids && first_pick_cpu != cpu &&
> -	    second_pick_cpu < nr_cpu_ids && second_pick_cpu != cpu)
> +	    second_pick_cpu < nr_cpu_ids && second_pick_cpu != cpu) {
> +		if (rq->nr_running > 1)
> +			return 1;
>  		return 0;
> +	}

I should have added a comment to explain this change ..will do it in
next version (along with any other changes suggested).

- vatsa


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v1] sched: fix nohz idle load balancer issues
  2011-09-26 11:50 [PATCH v1] sched: fix nohz idle load balancer issues Srivatsa Vaddagiri
  2011-09-26 12:01 ` Srivatsa Vaddagiri
@ 2011-09-27  6:32 ` Ingo Molnar
  2011-09-27 10:31   ` Srivatsa Vaddagiri
  2011-09-27 19:53 ` Venki Pallipadi
  2 siblings, 1 reply; 17+ messages in thread
From: Ingo Molnar @ 2011-09-27  6:32 UTC (permalink / raw)
  To: Srivatsa Vaddagiri
  Cc: Peter Zijlstra, Paul Turner, Venki Pallipadi,
	Vaidyanathan Srinivasan, Kamalesh Babulal, linux-kernel


* Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> wrote:

> While trying to test recently introduced cpu bandwidth control feature for
> non-realtime tasks, we noticed "more than expected" idle time, which
> reduced considerably when booted with nohz=off. This patch is an attempt
> to fix that discrepancy so that we see little variance in idle time between
> nohz=on and nohz=off.
> 
> Test setup:
> 
> Machine : 16-cpus (2 Quad-core w/ HT enabled)
> Kernel  : Latest code (HEAD at 6e8d0472ea63969e2df77c7e84741495fedf7d9b) found 
> 	  at git://tesla.tglx.de/git/linux-2.6-tip
> 
> Cgroups :
> 
> 5 in number (/L1/L2/C1 - /L1/L2/C5), each having {2, 2, 4, 8, 16} tasks
> respectively. /L1 and /L2 were added to the hierarchy to mimic cgroup hierarchy 
> created by libvirt and otherwise do not contain any tasks. Each cgroup has 
> cpu.shares proportional to # of tasks in it. For ex: /L1/L2/C1's cpu.shares =
> 2 * 1024 = 2048, C3's cpu.shares = 4096 etc. Further, each task is placed in its
> own (sub-)cgroup with default shares of 1024 and a capped usage of 50% CPU.
>   
>         /L1/L2/C1/C1_1/Task1  -> capped at 50% cpu usage
>         /L1/L2/C1/C1_2/Task2  -> capped at 50% cpu usage
>         /L1/L2/C2/C2_1/Task3  -> capped at 50% cpu usage
>         /L1/L2/C2/C2_2/Task3  -> capped at 50% cpu usage
>         /L1/L2/C3/C3_1/Task4  -> capped at 50% cpu usage
>         /L1/L2/C3/C3_2/Task4  -> capped at 50% cpu usage
>         /L1/L2/C3/C3_3/Task4  -> capped at 50% cpu usage
>         /L1/L2/C3/C3_4/Task4  -> capped at 50% cpu usage
>         ...
>         /L1/L2/C5/C5_16/Task32 -> capped at 50% cpu usage
> 
> So we have 32 tasks, each capped at 50% CPU usage, run on a 16-CPU
> system, which one may expect to consume all CPU resource leaving no idle
> time. While that may be "insane" expectation, the goal is to minimize idle time
> in this situation as much as possible.
> 
> I am using a slightly modified script provided at
> https://lkml.org/lkml/2011/6/7/352 for generating this test scenario -
> can make that available if required.
> 
> Idle time was sampled every second (using vmstat) over a window of 60 seconds
> and was found as below:
> 
> Idle time 		Average	 Std-deviation	 Min	Max
> ============================================================
> 
> nohz=off		4%        0.5%           3%      5%
> nohz=on	 	10% 	  2.4% 		 5%	 18%
> nohz=on + patch	5.3%      1.3%		 3%      9%
> 
> The patch cuts down idle time significantly when kernel is booted 
> with 'nohz=on' (which is good for saving power when idle).

What are the tasks doing which are running - are they plain burning 
CPU time? If the tasks do something more complex, do you also have a 
measure of how much work gets done by the workload, per second?

Percentual changes in that metric would be nice to include in an 
additional column - that way we can see that it's not only idle
that has gone down, but workload performance has gone up too.

In fact even if there was only a CPU burning loop in the workload it 
would be nice to make that somewhat more sophisticated by letting it 
process some larger array that has a cache footprint. This mimics 
real workloads that don't just spin burning CPU time but do real data 
processing.

For any non-trivial workload it's possible to reduce idle time 
without much increase in work done and in fact it's possible to 
decrease idle time *and* work done - so we need to see more clearly 
here and make sure it's all an improvement.

Thanks,

	Ingo


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v1] sched: fix nohz idle load balancer issues
  2011-09-27  6:32 ` Ingo Molnar
@ 2011-09-27 10:31   ` Srivatsa Vaddagiri
  2011-09-27 13:39     ` Ingo Molnar
  0 siblings, 1 reply; 17+ messages in thread
From: Srivatsa Vaddagiri @ 2011-09-27 10:31 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Paul Turner, Venki Pallipadi,
	Vaidyanathan Srinivasan, Kamalesh Babulal, linux-kernel

* Ingo Molnar <mingo@elte.hu> [2011-09-27 08:32:24]:

> What are the tasks doing which are running - are they plain burning 
> CPU time? If the tasks do something more complex, do you also have a 
> measure of how much work gets done by the workload, per second?

They are simple cpu hogs at this time.

> Percentual changes in that metric would be nice to include in an 
> additional column - that way we can see that it's not only idle
> that has gone down, but workload performance has gone up too.

Ok, good point.

> In fact even if there was only a CPU burning loop in the workload it 
> would be nice to make that somewhat more sophisticated by letting it 
> process some larger array that has a cache footprint. This mimics 
> real workloads that don't just spin burning CPU time but do real data 
> processing.
> 
> For any non-trivial workload it's possible to reduce idle time 
> without much increase in work done and in fact it's possible to 
> decrease idle time *and* work done - so we need to see more clearly 
> here and make sure it's all an improvement.

Ok - I will run a cpu intensive benchmark and get some numbers on
how benchmark score varies with the patch applied. I can pick a simple
matrix multiplication type benchmark, unless you have other suggestions!

- vatsa

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v1] sched: fix nohz idle load balancer issues
  2011-09-27 10:31   ` Srivatsa Vaddagiri
@ 2011-09-27 13:39     ` Ingo Molnar
  0 siblings, 0 replies; 17+ messages in thread
From: Ingo Molnar @ 2011-09-27 13:39 UTC (permalink / raw)
  To: Srivatsa Vaddagiri
  Cc: Peter Zijlstra, Paul Turner, Venki Pallipadi,
	Vaidyanathan Srinivasan, Kamalesh Babulal, linux-kernel


* Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> wrote:

> * Ingo Molnar <mingo@elte.hu> [2011-09-27 08:32:24]:
> 
> > What are the tasks doing which are running - are they plain burning 
> > CPU time? If the tasks do something more complex, do you also have a 
> > measure of how much work gets done by the workload, per second?
> 
> They are simple cpu hogs at this time.
> 
> > Percentual changes in that metric would be nice to include in an 
> > additional column - that way we can see that it's not only idle
> > that has gone down, but workload performance has gone up too.
> 
> Ok, good point.
> 
> > In fact even if there was only a CPU burning loop in the workload it 
> > would be nice to make that somewhat more sophisticated by letting it 
> > process some larger array that has a cache footprint. This mimics 
> > real workloads that don't just spin burning CPU time but do real data 
> > processing.
> > 
> > For any non-trivial workload it's possible to reduce idle time 
> > without much increase in work done and in fact it's possible to 
> > decrease idle time *and* work done - so we need to see more clearly 
> > here and make sure it's all an improvement.
> 
> Ok - I will run a cpu intensive benchmark and get some numbers on 
> how benchmark score varies with the patch applied. I can pick a 
> simple matrix multiplication type benchmark, unless you have other 
> suggestions!

Yeah, matrix multiplication would be fine i think.

You could create it yourself and add it into 
tools/perf/bench/cpu-matrix.c as a new 'perf bench cpu matrix' 
testcase - perhaps with a '-s 10m' parameter that defines the size of 
the matrices, and a '-i 1000' parameter to specify the number of 
iterations - or so.

We could use that for scheduler HPC benchmarking in the future as 
well.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v1] sched: fix nohz idle load balancer issues
  2011-09-26 11:50 [PATCH v1] sched: fix nohz idle load balancer issues Srivatsa Vaddagiri
  2011-09-26 12:01 ` Srivatsa Vaddagiri
  2011-09-27  6:32 ` Ingo Molnar
@ 2011-09-27 19:53 ` Venki Pallipadi
  2011-09-27 23:49   ` Suresh Siddha
  2011-09-28  4:34   ` Srivatsa Vaddagiri
  2 siblings, 2 replies; 17+ messages in thread
From: Venki Pallipadi @ 2011-09-27 19:53 UTC (permalink / raw)
  To: Srivatsa Vaddagiri
  Cc: Peter Zijlstra, Paul Turner, Ingo Molnar,
	Vaidyanathan Srinivasan, Kamalesh Babulal, linux-kernel

On Mon, Sep 26, 2011 at 4:50 AM, Srivatsa Vaddagiri
<vatsa@linux.vnet.ibm.com> wrote:
> While trying to test recently introduced cpu bandwidth control feature for
> non-realtime tasks, we noticed "more than expected" idle time, which
> reduced considerably when booted with nohz=off. This patch is an attempt
> to fix that discrepancy so that we see little variance in idle time between
> nohz=on and nohz=off.
>
> Test setup:
>
> Machine : 16-cpus (2 Quad-core w/ HT enabled)
> Kernel  : Latest code (HEAD at 6e8d0472ea63969e2df77c7e84741495fedf7d9b) found
>          at git://tesla.tglx.de/git/linux-2.6-tip
>
> Cgroups :
>
> 5 in number (/L1/L2/C1 - /L1/L2/C5), each having {2, 2, 4, 8, 16} tasks
> respectively. /L1 and /L2 were added to the hierarchy to mimic cgroup hierarchy
> created by libvirt and otherwise do not contain any tasks. Each cgroup has
> cpu.shares proportional to # of tasks in it. For ex: /L1/L2/C1's cpu.shares =
> 2 * 1024 = 2048, C3's cpu.shares = 4096 etc. Further, each task is placed in its
> own (sub-)cgroup with default shares of 1024 and a capped usage of 50% CPU.
>
>        /L1/L2/C1/C1_1/Task1  -> capped at 50% cpu usage
>        /L1/L2/C1/C1_2/Task2  -> capped at 50% cpu usage
>        /L1/L2/C2/C2_1/Task3  -> capped at 50% cpu usage
>        /L1/L2/C2/C2_2/Task3  -> capped at 50% cpu usage
>        /L1/L2/C3/C3_1/Task4  -> capped at 50% cpu usage
>        /L1/L2/C3/C3_2/Task4  -> capped at 50% cpu usage
>        /L1/L2/C3/C3_3/Task4  -> capped at 50% cpu usage
>        /L1/L2/C3/C3_4/Task4  -> capped at 50% cpu usage
>        ...
>        /L1/L2/C5/C5_16/Task32 -> capped at 50% cpu usage
>
> So we have 32 tasks, each capped at 50% CPU usage, run on a 16-CPU
> system, which one may expect to consume all CPU resource leaving no idle
> time. While that may be "insane" expectation, the goal is to minimize idle time
> in this situation as much as possible.
>
> I am using a slightly modified script provided at
> https://lkml.org/lkml/2011/6/7/352 for generating this test scenario -
> can make that available if required.
>
> Idle time was sampled every second (using vmstat) over a window of 60 seconds
> and was found as below:
>
> Idle time               Average  Std-deviation   Min    Max
> ============================================================
>
> nohz=off                4%        0.5%           3%      5%
> nohz=on                 10%       2.4%           5%      18%
> nohz=on + patch         5.3%      1.3%           3%      9%
>
> The patch cuts down idle time significantly when kernel is booted
> with 'nohz=on' (which is good for saving power when idle).
>
> Description of the patch
> ========================
>
> Reviewing idle load balancer code and testing it with some
> trace_printk(), I observed the following:
>
> 1. I had put a trace_printk() in nohz_idle_balance() as below:
>
>        nohz_idle_balance()
>        {
>
>        if (idle != CPU_IDLE || !this_rq->nohz_balance_kick)
>                return;
>
>        ..
>
>                trace_printk("Running rebalance for %d\n", balance_cpu);
>
>                rebalance_domains(balance_cpu, CPU_IDLE);
>        }
>
>    I *never* got that printed during the test. Further investigation
>    revealed that ilb_cpu was bailing out early because idle =
>    CPU_NOT_IDLE i.e ilb_cpu was no longer idle_at_tick by the time it
>    got around to handle the kick. As a result, no one was truly
>    doing a load balance on behalf of sleeping idle cpus.
>
>
> 2. nohz.next_interval is supposed to reflect min(rq->next_balance)
>   across all sleeping idle cpus. As new idle cpus go tickless, we
>   don't update nohz.next_balance to reflect the fact that the
>   idle cpu going tickless may have rq->next_balance earlier than
>   nohz.next_balance. Putting some trace_printk() I noticed that
>   difference (between nohz.next_interval and its desired new value)
>   was between  30 - 1673 ticks (CONFIG_HZ=1000 in my case).
>
> 3. AFAICS CPUs desigated as first_pick_cpu or second_pick_cpu can be
>   idle for sometime before going tickless. When that happens, they
>   stop kicking ilb_cpu. More importantly other busy cpus also
>   don't kick ilb_cpu as they are neither first nor second pick_cpu.
>   In this situation, ilb_cpu is not woken up in a timely manner to
>   do idle load balance.
>
> This patch is an attempt to solve above issues observed with idle load
> balancer.
>
> - The patch causes a ilb_cpu (that was kicked) to kick another idle
>  cpu in case it was found to be !idle_at_tick (so that another idle cpu
>  can do load balance on behalf of idle cpus). This fixes issue #1

Some comments:

Another potential change here is to
- either reverse the order of rebalance_domains() and
nohz_idle_balance() in run_rebalance_domains()
- or to kick another idle CPU in case of need_resched() in nohz_idle_balance.
This should help with idle balance of tickless CPUs when ilb CPU gets
a new task through load balance and hence aborts ilb.

>
> - The patch introduces a 'nohz.next_balance_lock' spinlock which is used
>  to update nohz.next_balance, so that it stays as min(rq->next_balance).
>  This fixes issue #2. I don't like a global spinlock so much, but don't
>  see easy way out. Besides, this lock is taken in context of idle cpu.
>

The problem I see with this is that there is no way to reset
next_balance when a previously idle CPU goes busy. This probably will
result in frequent ilb than needed and potential power and
performance(due to SMT or freq timer interrupts) impact.

> - It allows any busy cpu to kick ilb_cpu if it has greater than 2
>  runnable tasks. This addresses issue #3


This again may have power impact with frequent kicking. Especially
with higher number of logical CPUs. Likely cleaner way is to clear
first_pick, second_pick on idle instead of clearing on tickless.
Would be interesting to see some tests power impact (or number of
interrupts, resched IPIs etc) of this change. Both with netbook kind
of systems and on servers with partially idle configuration.

Thanks,
Venki

>
> The patch is against latest tip code (HEAD at
> 6e8d0472ea63969e2df77c7e84741495fedf7d9b) found at
> git://tesla.tglx.de/git/linux-2.6-tip
>
> Another improvement that we may want to look at is to have second_pick_cpu
> avoid kicking in case there is no power/performance benefit (i.e its not a
> thread/core sibling of first_pick_cpu).
>
> Comments/flames welcome!
>
> Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
>
> ---
>
> Index: current/kernel/sched.c
> ===================================================================
> --- current.orig/kernel/sched.c
> +++ current/kernel/sched.c
> @@ -8355,6 +8355,7 @@ void __init sched_init(void)
>        atomic_set(&nohz.load_balancer, nr_cpu_ids);
>        atomic_set(&nohz.first_pick_cpu, nr_cpu_ids);
>        atomic_set(&nohz.second_pick_cpu, nr_cpu_ids);
> +       spin_lock_init(&nohz.next_balance_lock);
>  #endif
>        /* May be allocated at isolcpus cmdline parse time */
>        if (cpu_isolated_map == NULL)
> Index: current/kernel/sched_fair.c
> ===================================================================
> --- current.orig/kernel/sched_fair.c
> +++ current/kernel/sched_fair.c
> @@ -4302,6 +4302,7 @@ static struct {
>        cpumask_var_t idle_cpus_mask;
>        cpumask_var_t grp_idle_mask;
>        unsigned long next_balance;     /* in jiffy units */
> +       spinlock_t next_balance_lock;   /* Serialize update to 'next_balance' */
>  } nohz ____cacheline_aligned;
>
>  int get_nohz_load_balancer(void)
> @@ -4443,8 +4444,12 @@ static void nohz_balancer_kick(int cpu)
>
>        ilb_cpu = get_nohz_load_balancer();
>
> -       if (ilb_cpu >= nr_cpu_ids) {
> -               ilb_cpu = cpumask_first(nohz.idle_cpus_mask);
> +       /*
> +        * ilb_cpu itself can be attempting to kick another idle cpu. Pick
> +        * another idle cpu in that case.
> +        */
> +       if (ilb_cpu >= nr_cpu_ids || ilb_cpu == cpu) {
> +               ilb_cpu = cpumask_any_but(nohz.idle_cpus_mask, cpu);
>                if (ilb_cpu >= nr_cpu_ids)
>                        return;
>        }
> @@ -4459,6 +4464,18 @@ static void nohz_balancer_kick(int cpu)
>        return;
>  }
>
> +/* Update nohz.next_balance with a new minimum value */
> +static inline void set_nohz_next_balance(unsigned long next_balance)
> +{
> +       if (time_after(next_balance, nohz.next_balance))
> +               return;
> +
> +       spin_lock(&nohz.next_balance_lock);
> +       if (time_before(next_balance, nohz.next_balance))
> +               nohz.next_balance = next_balance;
> +       spin_unlock(&nohz.next_balance_lock);
> +}
> +
>  /*
>  * This routine will try to nominate the ilb (idle load balancing)
>  * owner among the cpus whose ticks are stopped. ilb owner will do the idle
> @@ -4517,8 +4534,10 @@ void select_nohz_load_balancer(int stop_
>                                resched_cpu(new_ilb);
>                                return;
>                        }
> +                       set_nohz_next_balance(cpu_rq(cpu)->next_balance);
>                        return;
>                }
> +               set_nohz_next_balance(cpu_rq(cpu)->next_balance);
>        } else {
>                if (!cpumask_test_cpu(cpu, nohz.idle_cpus_mask))
>                        return;
> @@ -4634,9 +4653,16 @@ static void nohz_idle_balance(int this_c
>        struct rq *rq;
>        int balance_cpu;
>
> -       if (idle != CPU_IDLE || !this_rq->nohz_balance_kick)
> +       if (!this_rq->nohz_balance_kick)
>                return;
>
> +       /* Wakeup another idle cpu to do idle load balance if we got busy */
> +       if (idle != CPU_IDLE) {
> +               nohz_balancer_kick(this_cpu);
> +               this_rq->nohz_balance_kick = 0;
> +               return;
> +       }
> +
>        for_each_cpu(balance_cpu, nohz.idle_cpus_mask) {
>                if (balance_cpu == this_cpu)
>                        continue;
> @@ -4646,10 +4672,8 @@ static void nohz_idle_balance(int this_c
>                 * work being done for other cpus. Next load
>                 * balancing owner will pick it up.
>                 */
> -               if (need_resched()) {
> -                       this_rq->nohz_balance_kick = 0;
> +               if (need_resched())
>                        break;
> -               }
>
>                raw_spin_lock_irq(&this_rq->lock);
>                update_rq_clock(this_rq);
> @@ -4663,7 +4687,7 @@ static void nohz_idle_balance(int this_c
>                if (time_after(this_rq->next_balance, rq->next_balance))
>                        this_rq->next_balance = rq->next_balance;
>        }
> -       nohz.next_balance = this_rq->next_balance;
> +       set_nohz_next_balance(this_rq->next_balance);
>        this_rq->nohz_balance_kick = 0;
>  }
>
> @@ -4695,8 +4719,11 @@ static inline int nohz_kick_needed(struc
>        second_pick_cpu = atomic_read(&nohz.second_pick_cpu);
>
>        if (first_pick_cpu < nr_cpu_ids && first_pick_cpu != cpu &&
> -           second_pick_cpu < nr_cpu_ids && second_pick_cpu != cpu)
> +           second_pick_cpu < nr_cpu_ids && second_pick_cpu != cpu) {
> +               if (rq->nr_running > 1)
> +                       return 1;
>                return 0;
> +       }
>
>        ret = atomic_cmpxchg(&nohz.first_pick_cpu, nr_cpu_ids, cpu);
>        if (ret == nr_cpu_ids || ret == cpu) {
> @@ -4752,7 +4779,7 @@ static inline void trigger_load_balance(
>            likely(!on_null_domain(cpu)))
>                raise_softirq(SCHED_SOFTIRQ);
>  #ifdef CONFIG_NO_HZ
> -       else if (nohz_kick_needed(rq, cpu) && likely(!on_null_domain(cpu)))
> +       if (nohz_kick_needed(rq, cpu) && likely(!on_null_domain(cpu)))
>                nohz_balancer_kick(cpu);
>  #endif
>  }
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v1] sched: fix nohz idle load balancer issues
  2011-09-27 19:53 ` Venki Pallipadi
@ 2011-09-27 23:49   ` Suresh Siddha
  2011-09-28  4:15     ` Srivatsa Vaddagiri
  2011-09-28  4:34   ` Srivatsa Vaddagiri
  1 sibling, 1 reply; 17+ messages in thread
From: Suresh Siddha @ 2011-09-27 23:49 UTC (permalink / raw)
  To: Venki Pallipadi
  Cc: Srivatsa Vaddagiri, Peter Zijlstra, Paul Turner, Ingo Molnar,
	Vaidyanathan Srinivasan, Kamalesh Babulal, linux-kernel

On Tue, 2011-09-27 at 12:53 -0700, Venki Pallipadi wrote:
> On Mon, Sep 26, 2011 at 4:50 AM, Srivatsa Vaddagiri
> <vatsa@linux.vnet.ibm.com> wrote:
> >
> > Reviewing idle load balancer code and testing it with some
> > trace_printk(), I observed the following:
> >
> > 1. I had put a trace_printk() in nohz_idle_balance() as below:
> >
> >        nohz_idle_balance()
> >        {
> >
> >        if (idle != CPU_IDLE || !this_rq->nohz_balance_kick)
> >                return;
> >
> >        ..
> >
> >                trace_printk("Running rebalance for %d\n", balance_cpu);
> >
> >                rebalance_domains(balance_cpu, CPU_IDLE);
> >        }
> >
> >    I *never* got that printed during the test. Further investigation
> >    revealed that ilb_cpu was bailing out early because idle =
> >    CPU_NOT_IDLE i.e ilb_cpu was no longer idle_at_tick by the time it
> >    got around to handle the kick. As a result, no one was truly
> >    doing a load balance on behalf of sleeping idle cpus.

One of the reasons why we saw lib_cpu not idle is probably because that
info was stale.

Consider this scenario.

a. got a tick when the cpu was busy, so idle_at_tick was not set
b. cpu went idle
c. same cpu got the kick IPI from other busy cpu
d. and as it has idle_at_tick not set, it couldn't proceed with the nohz
idle balance.

> > This patch is an attempt to solve above issues observed with idle load
> > balancer.
> >
> > - The patch causes a ilb_cpu (that was kicked) to kick another idle
> >  cpu in case it was found to be !idle_at_tick (so that another idle cpu
> >  can do load balance on behalf of idle cpus). This fixes issue #1
> 
> Some comments:
> 
> Another potential change here is to
> - either reverse the order of rebalance_domains() and
> nohz_idle_balance() in run_rebalance_domains()
> - or to kick another idle CPU in case of need_resched() in nohz_idle_balance.
> This should help with idle balance of tickless CPUs when ilb CPU gets
> a new task through load balance and hence aborts ilb.

I think we are mostly likely seeing the above mentioned scenario.

Also Vatsa, there is a deadlock associated by using
__smp_call_funciton_single() in the nohz_balancer_kick(). So I am
planning to remove the IPI that is used to kick the nohz balancer and
instead use the resched_cpu logic to kick the nohz balancer.

I will post this patch mostly tomorrow. That patch will not use the
idle_at_tick check in the nohz_idle_balance(). So that should address
your issue in some cases if not most.

thanks,
suresh


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v1] sched: fix nohz idle load balancer issues
  2011-09-27 23:49   ` Suresh Siddha
@ 2011-09-28  4:15     ` Srivatsa Vaddagiri
  2011-09-29  0:47       ` Suresh Siddha
  0 siblings, 1 reply; 17+ messages in thread
From: Srivatsa Vaddagiri @ 2011-09-28  4:15 UTC (permalink / raw)
  To: Suresh Siddha
  Cc: Venki Pallipadi, Peter Zijlstra, Paul Turner, Ingo Molnar,
	Vaidyanathan Srinivasan, Kamalesh Babulal, linux-kernel

* Suresh Siddha <suresh.b.siddha@intel.com> [2011-09-27 16:49:36]:

> One of the reasons why we saw lib_cpu not idle is probably because that
> info was stale.
> 
> Consider this scenario.
> 
> a. got a tick when the cpu was busy, so idle_at_tick was not set
> b. cpu went idle
> c. same cpu got the kick IPI from other busy cpu
> d. and as it has idle_at_tick not set, it couldn't proceed with the nohz
> idle balance.

Good point ..we chould use idle_cpu() instead there ..

> I think we are mostly likely seeing the above mentioned scenario.
> 
> Also Vatsa, there is a deadlock associated by using
> __smp_call_funciton_single() in the nohz_balancer_kick(). So I am
> planning to remove the IPI that is used to kick the nohz balancer and
> instead use the resched_cpu logic to kick the nohz balancer.
> 
> I will post this patch mostly tomorrow. That patch will not use the
> idle_at_tick check in the nohz_idle_balance(). So that should address
> your issue in some cases if not most.

Ok ..would be glad to test your change ..I am however doubtfull if it
will eliminate rest of the issues I pointed out with nohz load balancer.

- vatsa

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v1] sched: fix nohz idle load balancer issues
  2011-09-27 19:53 ` Venki Pallipadi
  2011-09-27 23:49   ` Suresh Siddha
@ 2011-09-28  4:34   ` Srivatsa Vaddagiri
  2011-09-28  4:39     ` Srivatsa Vaddagiri
  2011-09-28 14:17     ` Srivatsa Vaddagiri
  1 sibling, 2 replies; 17+ messages in thread
From: Srivatsa Vaddagiri @ 2011-09-28  4:34 UTC (permalink / raw)
  To: Venki Pallipadi
  Cc: Peter Zijlstra, Paul Turner, Ingo Molnar,
	Vaidyanathan Srinivasan, Kamalesh Babulal, linux-kernel

* Venki Pallipadi <venki@google.com> [2011-09-27 12:53:21]:

> Some comments:
> 
> Another potential change here is to
> - either reverse the order of rebalance_domains() and
> nohz_idle_balance() in run_rebalance_domains()

I thought of that, but then realized that it won't influence our
"idle_at_tick" check in nohz_idle_balance(). Did you have any other benefit in
mind behind that change?

> - or to kick another idle CPU in case of need_resched() in nohz_idle_balance.
> This should help with idle balance of tickless CPUs when ilb CPU gets
> a new task through load balance and hence aborts ilb.

Yes good point. I will add that in next version.

> > - The patch introduces a 'nohz.next_balance_lock' spinlock which is used
> >  to update nohz.next_balance, so that it stays as min(rq->next_balance).
> >  This fixes issue #2. I don't like a global spinlock so much, but don't
> >  see easy way out. Besides, this lock is taken in context of idle cpu.
> >
> 
> The problem I see with this is that there is no way to reset
> next_balance when a previously idle CPU goes busy. This probably will
> result in frequent ilb than needed and potential power and
> performance(due to SMT or freq timer interrupts) impact.

That already seems to be an issue with existing code. One possibility is
to rescan idle cpus looking for the next min rq->next_balance (unless we
want to go for more sophisticated like a sorted list of rq->next_balance
in a rb-tree).

> > - It allows any busy cpu to kick ilb_cpu if it has greater than 2
> >  runnable tasks. This addresses issue #3
> 
> 
> This again may have power impact with frequent kicking.

I don't know how much additional kicks it would add - I mean the system
is busy and ilb_cpu deserves a kick. With this, we are forcing it to
happen sooner rather than wait for first/second_pick_cpu to do
"justice"?

> Especially with higher number of logical CPUs. Likely cleaner way is to clear
> first_pick, second_pick on idle instead of clearing on tickless.

I think I tried that (cleared first/second_pick_cpu in
nohz_kick_needed() upon idle) but didn't get the best results. Let me
try that again and post idle time numbers.

> Would be interesting to see some tests power impact (or number of
> interrupts, resched IPIs etc) of this change. Both with netbook kind
> of systems and on servers with partially idle configuration.

Ok - will get some numbers in that regard as well.

- vatsa

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v1] sched: fix nohz idle load balancer issues
  2011-09-28  4:34   ` Srivatsa Vaddagiri
@ 2011-09-28  4:39     ` Srivatsa Vaddagiri
  2011-09-28  5:06       ` Srivatsa Vaddagiri
  2011-09-28 14:17     ` Srivatsa Vaddagiri
  1 sibling, 1 reply; 17+ messages in thread
From: Srivatsa Vaddagiri @ 2011-09-28  4:39 UTC (permalink / raw)
  To: Venki Pallipadi
  Cc: Peter Zijlstra, Paul Turner, Ingo Molnar,
	Vaidyanathan Srinivasan, Kamalesh Babulal, linux-kernel

* Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> [2011-09-28 10:04:27]:

> unless we want to go for more sophisticated like a sorted list of
> rq->next_balance in a rb-tree

Given that this list will need to be global (and not per-cpu) and it will need 
to be updated reqularly (even when cpus are busy and as they update their 
rq->next_balance), that will probably not fly!

- vatsa

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v1] sched: fix nohz idle load balancer issues
  2011-09-28  4:39     ` Srivatsa Vaddagiri
@ 2011-09-28  5:06       ` Srivatsa Vaddagiri
  0 siblings, 0 replies; 17+ messages in thread
From: Srivatsa Vaddagiri @ 2011-09-28  5:06 UTC (permalink / raw)
  To: Venki Pallipadi
  Cc: Peter Zijlstra, Paul Turner, Ingo Molnar,
	Vaidyanathan Srinivasan, Kamalesh Babulal, linux-kernel

* Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> [2011-09-28 10:09:21]:

> * Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> [2011-09-28 10:04:27]:
> 
> > unless we want to go for more sophisticated like a sorted list of
> > rq->next_balance in a rb-tree
> 
> Given that this list will need to be global (and not per-cpu) and it will need 
> to be updated reqularly (even when cpus are busy and as they update their 
> rq->next_balance), that will probably not fly!

On third thoughts, this list needs to be updated only in idle cpu
context (idle cpus going tickless will be inserted in the list (indexed at 
rq->next_balance and cpu exiting idle will need to be removed). So the global 
list may not hurt so much ..will see how that comes up in code.

- vatsa

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v1] sched: fix nohz idle load balancer issues
  2011-09-28  4:34   ` Srivatsa Vaddagiri
  2011-09-28  4:39     ` Srivatsa Vaddagiri
@ 2011-09-28 14:17     ` Srivatsa Vaddagiri
  2011-09-28 14:27       ` Srivatsa Vaddagiri
  1 sibling, 1 reply; 17+ messages in thread
From: Srivatsa Vaddagiri @ 2011-09-28 14:17 UTC (permalink / raw)
  To: Venki Pallipadi
  Cc: Peter Zijlstra, Paul Turner, Ingo Molnar,
	Vaidyanathan Srinivasan, Kamalesh Babulal, linux-kernel

* Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> [2011-09-28 10:04:27]:

> > Especially with higher number of logical CPUs. Likely cleaner way is to clear
> > first_pick, second_pick on idle instead of clearing on tickless.
> 
> I think I tried that (cleared first/second_pick_cpu in
> nohz_kick_needed() upon idle) but didn't get the best results. Let me
> try that again and post idle time numbers.

So made some changes so that first/second_pick_cpu identifiers are
cleared when corresponding cpus become idle. They are cleared in
pick_next_task_idle() ..

With that change, I don't see additional benefit of having a third busy
cpu kick ilb_cpu when it has > 1 task.


Idle time -> 			Avg.	Std.Dev   Min   Max
W/o third busy cpu kicking	4.57%   0.9	  3%    8%   
With third busy cpu kicking	4.78%   1.1%      3%    8%

So will go with your suggestion in the next version!

- vatsa

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v1] sched: fix nohz idle load balancer issues
  2011-09-28 14:17     ` Srivatsa Vaddagiri
@ 2011-09-28 14:27       ` Srivatsa Vaddagiri
  0 siblings, 0 replies; 17+ messages in thread
From: Srivatsa Vaddagiri @ 2011-09-28 14:27 UTC (permalink / raw)
  To: Venki Pallipadi
  Cc: Peter Zijlstra, Paul Turner, Ingo Molnar,
	Vaidyanathan Srinivasan, Kamalesh Babulal, linux-kernel

* Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> [2011-09-28 19:47:01]:

> * Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> [2011-09-28 10:04:27]:
> 
> > > Especially with higher number of logical CPUs. Likely cleaner way is to clear
> > > first_pick, second_pick on idle instead of clearing on tickless.
> > 
> > I think I tried that (cleared first/second_pick_cpu in
> > nohz_kick_needed() upon idle) but didn't get the best results.

which perhaps was beause I was relying on idle_at_tick (which can be
stale information as Suresh pointed out).

- vatsa

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v1] sched: fix nohz idle load balancer issues
  2011-09-28  4:15     ` Srivatsa Vaddagiri
@ 2011-09-29  0:47       ` Suresh Siddha
  2011-09-29  2:13         ` Srivatsa Vaddagiri
  2011-09-29  8:15         ` Peter Zijlstra
  0 siblings, 2 replies; 17+ messages in thread
From: Suresh Siddha @ 2011-09-29  0:47 UTC (permalink / raw)
  To: Srivatsa Vaddagiri
  Cc: Venki Pallipadi, Peter Zijlstra, Paul Turner, Ingo Molnar,
	Vaidyanathan Srinivasan, Kamalesh Babulal, linux-kernel

On Tue, 2011-09-27 at 21:15 -0700, Srivatsa Vaddagiri wrote:
> * Suresh Siddha <suresh.b.siddha@intel.com> [2011-09-27 16:49:36]:
> 
> > I will post this patch mostly tomorrow. That patch will not use the
> > idle_at_tick check in the nohz_idle_balance(). So that should address
> > your issue in some cases if not most.
> 
> Ok ..would be glad to test your change ..

Vatsa, Here is the promised patch. I haven't completely made up my mind
about this patch.  Specially the change that I am doing now in
scheduler_ipi().

Initially I was planning to use resched_cpu() to trigger the
nohz_idle_balance() as part of the idle_balance(). But preemption is
disabled  during idle_balance() and I don't want to do more work in
idle_balance. Thomas/PeterZ may not like that. And also the idle cpu
would have done tick_nohz_restart_sched_tick() unnecessarily etc.

So ended up with using kick_process() and the scheduler_ipi() context to
trigger the SCHED_SOFTIRQ instead of using smp call function vector
sequence (which has a deadlock scenario in the context of heavy
interrupts which I can explain in detail when I send the complete
changelog). And also I am explicitly requesting for idle balance to
address the stale idle_at_tick condition.

Anyways, this appended patch is just for your quick experiment. I will
think a bit more tomorrow morning before sending the patch with complete
changelog. Thanks.
---

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
---
 kernel/sched.c      |   10 +++++++---
 kernel/sched_fair.c |   25 +++----------------------
 2 files changed, 10 insertions(+), 25 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index c5cf15e..482b9d2 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -644,7 +644,7 @@ struct rq {
 
 	unsigned long cpu_power;
 
-	unsigned char idle_at_tick;
+	unsigned char idle_balance;
 	/* For active balancing */
 	int post_schedule;
 	int active_balance;
@@ -2733,6 +2733,11 @@ void scheduler_ipi(void)
 	struct rq *rq = this_rq();
 	struct task_struct *list = xchg(&rq->wake_list, NULL);
 
+	if (unlikely((rq->idle == current) && rq->nohz_balance_kick)) {
+		rq->idle_balance = 1;
+		raise_softirq_irqoff(SCHED_SOFTIRQ);
+	}
+
 	if (!list)
 		return;
 
@@ -4257,7 +4262,7 @@ void scheduler_tick(void)
 	perf_event_task_tick();
 
 #ifdef CONFIG_SMP
-	rq->idle_at_tick = idle_cpu(cpu);
+	rq->idle_balance = idle_cpu(cpu);
 	trigger_load_balance(rq, cpu);
 #endif
 }
@@ -8303,7 +8308,6 @@ void __init sched_init(void)
 		rq_attach_root(rq, &def_root_domain);
 #ifdef CONFIG_NO_HZ
 		rq->nohz_balance_kick = 0;
-		init_sched_softirq_csd(&per_cpu(remote_sched_softirq_cb, i));
 #endif
 #endif
 		init_rq_hrtick(rq);
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index fef0bfd..bcefcb8 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -4269,22 +4269,6 @@ out_unlock:
 }
 
 #ifdef CONFIG_NO_HZ
-
-static DEFINE_PER_CPU(struct call_single_data, remote_sched_softirq_cb);
-
-static void trigger_sched_softirq(void *data)
-{
-	raise_softirq_irqoff(SCHED_SOFTIRQ);
-}
-
-static inline void init_sched_softirq_csd(struct call_single_data *csd)
-{
-	csd->func = trigger_sched_softirq;
-	csd->info = NULL;
-	csd->flags = 0;
-	csd->priv = 0;
-}
-
 /*
  * idle load balancing details
  * - One of the idle CPUs nominates itself as idle load_balancer, while
@@ -4450,11 +4434,8 @@ static void nohz_balancer_kick(int cpu)
 	}
 
 	if (!cpu_rq(ilb_cpu)->nohz_balance_kick) {
-		struct call_single_data *cp;
-
 		cpu_rq(ilb_cpu)->nohz_balance_kick = 1;
-		cp = &per_cpu(remote_sched_softirq_cb, cpu);
-		__smp_call_function_single(ilb_cpu, cp, 0);
+		kick_process(idle_task(ilb_cpu));
 	}
 	return;
 }
@@ -4687,7 +4668,7 @@ static inline int nohz_kick_needed(struct rq *rq, int cpu)
 	if (time_before(now, nohz.next_balance))
 		return 0;
 
-	if (rq->idle_at_tick)
+	if (idle_cpu(cpu))
 		return 0;
 
 	first_pick_cpu = atomic_read(&nohz.first_pick_cpu);
@@ -4723,7 +4704,7 @@ static void run_rebalance_domains(struct softirq_action *h)
 {
 	int this_cpu = smp_processor_id();
 	struct rq *this_rq = cpu_rq(this_cpu);
-	enum cpu_idle_type idle = this_rq->idle_at_tick ?
+	enum cpu_idle_type idle = this_rq->idle_balance ?
 						CPU_IDLE : CPU_NOT_IDLE;
 
 	rebalance_domains(this_cpu, idle);



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v1] sched: fix nohz idle load balancer issues
  2011-09-29  0:47       ` Suresh Siddha
@ 2011-09-29  2:13         ` Srivatsa Vaddagiri
  2011-09-29  8:15         ` Peter Zijlstra
  1 sibling, 0 replies; 17+ messages in thread
From: Srivatsa Vaddagiri @ 2011-09-29  2:13 UTC (permalink / raw)
  To: Suresh Siddha
  Cc: Venki Pallipadi, Peter Zijlstra, Paul Turner, Ingo Molnar,
	Vaidyanathan Srinivasan, Kamalesh Babulal, linux-kernel

* Suresh Siddha <suresh.b.siddha@intel.com> [2011-09-28 17:47:43]:

> Anyways, this appended patch is just for your quick experiment. I will
> think a bit more tomorrow morning before sending the patch with complete
> changelog. Thanks.


Ran my usual test to check idle time and here are the results:


Idle time ->		Avg.	Std.Dev	   Min      Max

Without your patch	8.9%     1.9	   5%	    14%
With your patch		3.5%     1.2	   2%	     7%
With my addon patch	2.8%	 0.5%	   2%	     4%

So your patch does improve things, but there is scope to improve it
further with an addon patch like below. I will experiment with some more
changes (like a sorted list of idle_cpu rq->next_balance) and send my v2
to apply on top of your patch later.

Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>

---

Index: current/kernel/sched.c
===================================================================
--- current.orig/kernel/sched.c
+++ current/kernel/sched.c
@@ -1404,6 +1404,12 @@ void wake_up_idle_cpu(int cpu)
 		smp_send_reschedule(cpu);
 }
 
+static void reset_first_second_pick_cpu(int cpu);
+
+#else	/* CONFIG_NO_HZ */
+
+static inline void reset_first_second_pick_cpu(int cpu) { }
+
 #endif /* CONFIG_NO_HZ */
 
 static u64 sched_avg_period(void)
@@ -8359,6 +8365,7 @@ void __init sched_init(void)
 	atomic_set(&nohz.load_balancer, nr_cpu_ids);
 	atomic_set(&nohz.first_pick_cpu, nr_cpu_ids);
 	atomic_set(&nohz.second_pick_cpu, nr_cpu_ids);
+	spin_lock_init(&nohz.next_balance_lock);
 #endif
 	/* May be allocated at isolcpus cmdline parse time */
 	if (cpu_isolated_map == NULL)
Index: current/kernel/sched_fair.c
===================================================================
--- current.orig/kernel/sched_fair.c
+++ current/kernel/sched_fair.c
@@ -4286,6 +4286,7 @@ static struct {
 	cpumask_var_t idle_cpus_mask;
 	cpumask_var_t grp_idle_mask;
 	unsigned long next_balance;     /* in jiffy units */
+	spinlock_t next_balance_lock;   /* Serialize update to 'next_balance' */
 } nohz ____cacheline_aligned;
 
 int get_nohz_load_balancer(void)
@@ -4427,8 +4428,12 @@ static void nohz_balancer_kick(int cpu)
 
 	ilb_cpu = get_nohz_load_balancer();
 
-	if (ilb_cpu >= nr_cpu_ids) {
-		ilb_cpu = cpumask_first(nohz.idle_cpus_mask);
+	/*
+	 * ilb_cpu itself can be attempting to kick another idle cpu. Pick
+	 * another idle cpu in that case.
+	 */
+	if (ilb_cpu >= nr_cpu_ids || ilb_cpu == cpu) {
+		ilb_cpu = cpumask_any_but(nohz.idle_cpus_mask, cpu);
 		if (ilb_cpu >= nr_cpu_ids)
 			return;
 	}
@@ -4440,6 +4445,18 @@ static void nohz_balancer_kick(int cpu)
 	return;
 }
 
+/* Update nohz.next_balance with a new minimum value */
+static inline void set_nohz_next_balance(unsigned long next_balance)
+{
+	if (time_after(next_balance, nohz.next_balance))
+		return;
+
+	spin_lock(&nohz.next_balance_lock);
+	if (time_before(next_balance, nohz.next_balance))
+		nohz.next_balance = next_balance;
+	spin_unlock(&nohz.next_balance_lock);
+}
+
 /*
  * This routine will try to nominate the ilb (idle load balancing)
  * owner among the cpus whose ticks are stopped. ilb owner will do the idle
@@ -4475,11 +4492,6 @@ void select_nohz_load_balancer(int stop_
 
 		cpumask_set_cpu(cpu, nohz.idle_cpus_mask);
 
-		if (atomic_read(&nohz.first_pick_cpu) == cpu)
-			atomic_cmpxchg(&nohz.first_pick_cpu, cpu, nr_cpu_ids);
-		if (atomic_read(&nohz.second_pick_cpu) == cpu)
-			atomic_cmpxchg(&nohz.second_pick_cpu, cpu, nr_cpu_ids);
-
 		if (atomic_read(&nohz.load_balancer) >= nr_cpu_ids) {
 			int new_ilb;
 
@@ -4498,8 +4510,10 @@ void select_nohz_load_balancer(int stop_
 				resched_cpu(new_ilb);
 				return;
 			}
+			set_nohz_next_balance(cpu_rq(cpu)->next_balance);
 			return;
 		}
+		set_nohz_next_balance(cpu_rq(cpu)->next_balance);
 	} else {
 		if (!cpumask_test_cpu(cpu, nohz.idle_cpus_mask))
 			return;
@@ -4615,9 +4629,16 @@ static void nohz_idle_balance(int this_c
 	struct rq *rq;
 	int balance_cpu;
 
-	if (idle != CPU_IDLE || !this_rq->nohz_balance_kick)
+	if (!this_rq->nohz_balance_kick)
 		return;
 
+	/* Wakeup another idle cpu to do idle load balance if we got busy */
+	if (!idle_cpu(this_cpu)) {
+		nohz_balancer_kick(this_cpu);
+		this_rq->nohz_balance_kick = 0;
+		return;
+	}
+
 	for_each_cpu(balance_cpu, nohz.idle_cpus_mask) {
 		if (balance_cpu == this_cpu)
 			continue;
@@ -4628,7 +4649,7 @@ static void nohz_idle_balance(int this_c
 		 * balancing owner will pick it up.
 		 */
 		if (need_resched()) {
-			this_rq->nohz_balance_kick = 0;
+			nohz_balancer_kick(this_cpu);
 			break;
 		}
 
@@ -4643,7 +4664,7 @@ static void nohz_idle_balance(int this_c
 		if (time_after(this_rq->next_balance, rq->next_balance))
 			this_rq->next_balance = rq->next_balance;
 	}
-	nohz.next_balance = this_rq->next_balance;
+	set_nohz_next_balance(this_rq->next_balance);
 	this_rq->nohz_balance_kick = 0;
 }
 
@@ -4692,9 +4713,24 @@ static inline int nohz_kick_needed(struc
 	}
 	return 0;
 }
-#else
+
+/*
+ * Reset first_pick_cpu or second_pick_cpu identifier in case
+ * corresponding cpu is going idle.
+ */
+static void reset_first_second_pick_cpu(int cpu)
+{
+	if (atomic_read(&nohz.first_pick_cpu) == cpu)
+		atomic_cmpxchg(&nohz.first_pick_cpu, cpu, nr_cpu_ids);
+	if (atomic_read(&nohz.second_pick_cpu) == cpu)
+		atomic_cmpxchg(&nohz.second_pick_cpu, cpu, nr_cpu_ids);
+}
+
+#else	/* CONFIG_NO_HZ */
+
 static void nohz_idle_balance(int this_cpu, enum cpu_idle_type idle) { }
-#endif
+
+#endif	/* CONFIG_NO_HZ */
 
 /*
  * run_rebalance_domains is triggered when needed from the scheduler tick.
@@ -4732,7 +4768,7 @@ static inline void trigger_load_balance(
 	    likely(!on_null_domain(cpu)))
 		raise_softirq(SCHED_SOFTIRQ);
 #ifdef CONFIG_NO_HZ
-	else if (nohz_kick_needed(rq, cpu) && likely(!on_null_domain(cpu)))
+	if (nohz_kick_needed(rq, cpu) && likely(!on_null_domain(cpu)))
 		nohz_balancer_kick(cpu);
 #endif
 }
Index: current/kernel/sched_idletask.c
===================================================================
--- current.orig/kernel/sched_idletask.c
+++ current/kernel/sched_idletask.c
@@ -24,6 +24,8 @@ static struct task_struct *pick_next_tas
 {
 	schedstat_inc(rq, sched_goidle);
 	calc_load_account_idle(rq);
+	reset_first_second_pick_cpu(cpu_of(rq));
+
 	return rq->idle;
 }
 






^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v1] sched: fix nohz idle load balancer issues
  2011-09-29  0:47       ` Suresh Siddha
  2011-09-29  2:13         ` Srivatsa Vaddagiri
@ 2011-09-29  8:15         ` Peter Zijlstra
  2011-09-29 17:13           ` Suresh Siddha
  1 sibling, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2011-09-29  8:15 UTC (permalink / raw)
  To: Suresh Siddha
  Cc: Srivatsa Vaddagiri, Venki Pallipadi, Paul Turner, Ingo Molnar,
	Vaidyanathan Srinivasan, Kamalesh Babulal, linux-kernel

On Wed, 2011-09-28 at 17:47 -0700, Suresh Siddha wrote:
> So ended up with using kick_process() and the scheduler_ipi() context to
> trigger the SCHED_SOFTIRQ instead of using smp call function vector
> sequence (which has a deadlock scenario in the context of heavy
> interrupts which I can explain in detail when I send the complete
> changelog). And also I am explicitly requesting for idle balance to
> address the stale idle_at_tick condition.

I'd be interested in hearing more about that deadlock, because when
allocating your own csd and not waiting for the result
__smp_call_function_single() should be deadlock free.

> @@ -2733,6 +2733,11 @@ void scheduler_ipi(void)
>         struct rq *rq = this_rq();
>         struct task_struct *list = xchg(&rq->wake_list, NULL);
>  
> +       if (unlikely((rq->idle == current) && rq->nohz_balance_kick)) {
> +               rq->idle_balance = 1;
> +               raise_softirq_irqoff(SCHED_SOFTIRQ);
> +       } 

This can end up being outside of irq_enter()/irq_exit(), which is
probably not what you want. See the somewhat large comment right below
here.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v1] sched: fix nohz idle load balancer issues
  2011-09-29  8:15         ` Peter Zijlstra
@ 2011-09-29 17:13           ` Suresh Siddha
  0 siblings, 0 replies; 17+ messages in thread
From: Suresh Siddha @ 2011-09-29 17:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Srivatsa Vaddagiri, Venki Pallipadi, Paul Turner, Ingo Molnar,
	Vaidyanathan Srinivasan, Kamalesh Babulal, linux-kernel,
	Prarit Bhargava

On Thu, 2011-09-29 at 01:15 -0700, Peter Zijlstra wrote:
> I'd be interested in hearing more about that deadlock, because when
> allocating your own csd and not waiting for the result
> __smp_call_function_single() should be deadlock free.

Prarit Bhargava last week raised this issue with me. This is the
deadlock condition:

1. cpu-A did a generic_exec_single() to cpu-B and it took a timer
interrupt before sending the actual IPI.

        raw_spin_lock_irqsave(&dst->lock, flags);
        ipi = list_empty(&dst->list);
        list_add_tail(&data->list, &dst->list);
        raw_spin_unlock_irqrestore(&dst->lock, flags);

        /*
         * The list addition should be visible before sending the IPI
         * handler locks the list to pull the entry off it because of
         * normal cache coherency rules implied by spinlocks.
         *
         * If IPIs can go out of order to the cache coherency protocol
         * in an architecture, sufficient synchronisation should be added
         * to arch code to make it appear to obey cache coherency WRT
         * locking and barrier primitives. Generic code isn't really
         * equipped to do the right thing...
         */
====> got an interrupt here.
        if (ipi)
                arch_send_call_function_single_ipi(cpu);

2. As part of the timer interrupt handler, cpu-A decided to kick cpu-B
for the idle load balancing (sets cpu-B's rq->nohz_balance_kick to 1)
and __smp_call_function_single() with nowait will queue the csd to the
cpu-B's queue. But the generic_exec_single() won't send an IPI to cpu-B
as the queue was not empty (from the code snippet shown above in
generic_exec_single()).

3. cpu-A is busy with lot of interrupts

4. Meanwhile cpu-B is entering and exiting idle and noticed that it has
it's rq->nohz_balance_kick set to '1'. So it will go ahead and do the
idle load balancer and clear its rq->nohz_balance_kick.

5. As this point, csd queued as part of the step-2 above is still locked
and waiting to be serviced on cpu-B.

6. cpu-A is still busy with interrupt load and now it got another timer
interrupt and as part of it decided to kick cpu-B for another idle load
balancing (as it finds cpu-B's rq->nohz_balance_kick cleared in step-4
above) and does __smp_call_function_single() with the same csd that is
still locked.

7. And we get a deadlock waiting for the csd_lock() in the
__smp_call_function_single().

Main issue here is that cpu-B can service the idle load balancer kick
request from cpu-A even with out receiving the IPI and this lead to
doing multiple __smp_call_function_single() on the same csd leading to
deadlock.

We can certainly fix this by making sure that cpu-B will service the
idle load balancer kick request only after receiving the IPI. But as we
have the resched-IPI already, I thought it is cleaner to use that
instead of using generic smp call function IPI to do this kick.

> > @@ -2733,6 +2733,11 @@ void scheduler_ipi(void)
> >         struct rq *rq = this_rq();
> >         struct task_struct *list = xchg(&rq->wake_list, NULL);
> >  
> > +       if (unlikely((rq->idle == current) && rq->nohz_balance_kick)) {
> > +               rq->idle_balance = 1;
> > +               raise_softirq_irqoff(SCHED_SOFTIRQ);
> > +       } 
> 
> This can end up being outside of irq_enter()/irq_exit(), which is
> probably not what you want. See the somewhat large comment right below
> here.

Yep. I had that in the mind already and hence mentioned to Vatsa that I
was not 100% sure if I am doing the right thing in the scheduler_ipi().
I will fix it and post the patch along with the complete changelog.

thanks,
suresh


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2011-09-29 17:12 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-09-26 11:50 [PATCH v1] sched: fix nohz idle load balancer issues Srivatsa Vaddagiri
2011-09-26 12:01 ` Srivatsa Vaddagiri
2011-09-27  6:32 ` Ingo Molnar
2011-09-27 10:31   ` Srivatsa Vaddagiri
2011-09-27 13:39     ` Ingo Molnar
2011-09-27 19:53 ` Venki Pallipadi
2011-09-27 23:49   ` Suresh Siddha
2011-09-28  4:15     ` Srivatsa Vaddagiri
2011-09-29  0:47       ` Suresh Siddha
2011-09-29  2:13         ` Srivatsa Vaddagiri
2011-09-29  8:15         ` Peter Zijlstra
2011-09-29 17:13           ` Suresh Siddha
2011-09-28  4:34   ` Srivatsa Vaddagiri
2011-09-28  4:39     ` Srivatsa Vaddagiri
2011-09-28  5:06       ` Srivatsa Vaddagiri
2011-09-28 14:17     ` Srivatsa Vaddagiri
2011-09-28 14:27       ` Srivatsa Vaddagiri

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.