[patch 0/7] sched: change nohz idle load balancing logic to push model

All of lore.kernel.org
 help / color / mirror / Atom feed

* [patch 0/7] sched: change nohz idle load balancing logic to push model
@ 2010-05-17 18:27 Suresh Siddha
  2010-05-17 18:27 ` [patch 1/7] softirq: Add a no local fallback option to send_remote_softirq Suresh Siddha
                   ` (9 more replies)
  0 siblings, 10 replies; 28+ messages in thread
From: Suresh Siddha @ 2010-05-17 18:27 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Arjan van de Ven
  Cc: Venkatesh Pallipadi, Vaidyanathan Srinivasan, ego, LKML,
	Dominik Brodowski, Nigel Cunningham

This is an updated version of patchset which is posted earlier at
http://lkml.org/lkml/2009/12/10/470

Description:
Existing nohz idle load balance logic uses the pull model, with one
idle load balancer CPU nominated on any partially idle system and that
balancer CPU not going into nohz mode. With the periodic tick, the
balancer does the idle balancing on behalf of all the CPUs in nohz mode.

This is not very optimal and has few issues:
* the balancer will continue to have periodic ticks and wakeup
  frequently (HZ rate), even though it may not have any rebalancing to do on
  behalf of any of the idle CPUs.
* On x86 and CPUs that have APIC timer stoppage on idle CPUs, this periodic
  wakeup can result in an additional interrupt on a CPU doing the timer
  broadcast.

The alternative is to have a push model, where all idle CPUs can enter nohz
mode and any busy CPU kicks one of the idle CPUs to take care of idle
balancing on behalf of a group of idle CPUs.

Following patches switches idle load balancer to this push approach.

Updates from the previous version:

* Busy CPU uses send_remote_softirq() for invoking SCHED_SOFTIRQ on the
  idle load balancing cpu, which does the load balancing on behalf of
  all the idle CPUs.

* Dropped the per NUMA node nohz load balancing as it doesn't detect
  certain imbalance scenarios. This will be addressed later.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Venkatesh Pallipadi <venki@google.com>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [patch 1/7] softirq: Add a no local fallback option to send_remote_softirq
  2010-05-17 18:27 [patch 0/7] sched: change nohz idle load balancing logic to push model Suresh Siddha
@ 2010-05-17 18:27 ` Suresh Siddha
  2010-05-17 18:27 ` [patch 2/7] softirq: add init_remote_softirq_csd() Suresh Siddha
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 28+ messages in thread
From: Suresh Siddha @ 2010-05-17 18:27 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Arjan van de Ven
  Cc: Venkatesh Pallipadi, Vaidyanathan Srinivasan, ego, LKML,
	Dominik Brodowski, Nigel Cunningham, Suresh Siddha

[-- Attachment #1: no_local_fallback_option_to_send_remote_sofitq.patch --]
[-- Type: text/plain, Size: 3085 bytes --]

From: Venkatesh Pallipadi <venki@google.com>
Subject: softirq: Add a no local fallback option to send_remote_softirq
send_remote_softirq and __send_remote_softirq fallback to triggering
the softirq on local CPU when __try_remote_softirq fails. In certain
circumstances (like the SCHED_SOFTIRQ usage in following patch), the
softirq should only run on target CPUs if it runs. This new option helps
such callers.

As there seem to be no current users of send_remote_softirq in tree
today, added this option to current API instead of introducing a new
API.

Signed-off-by: Venkatesh Pallipadi <venki@google.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
---
 include/linux/interrupt.h |    5 +++--
 kernel/softirq.c          |   12 ++++++++----
 2 files changed, 11 insertions(+), 6 deletions(-)

Index: tip/include/linux/interrupt.h
===================================================================
--- tip.orig/include/linux/interrupt.h
+++ tip/include/linux/interrupt.h
@@ -419,13 +419,14 @@ DECLARE_PER_CPU(struct list_head [NR_SOF
 /* Try to send a softirq to a remote cpu.  If this cannot be done, the
  * work will be queued to the local cpu.
  */
-extern void send_remote_softirq(struct call_single_data *cp, int cpu, int softirq);
+extern void send_remote_softirq(struct call_single_data *cp, int cpu,
+				int softirq, int fallback);
 
 /* Like send_remote_softirq(), but the caller must disable local cpu interrupts
  * and compute the current cpu, passed in as 'this_cpu'.
  */
 extern void __send_remote_softirq(struct call_single_data *cp, int cpu,
-				  int this_cpu, int softirq);
+				  int this_cpu, int softirq, int fallback);
 
 /* Tasklets --- multithreaded analogue of BHs.
 
Index: tip/kernel/softirq.c
===================================================================
--- tip.orig/kernel/softirq.c
+++ tip/kernel/softirq.c
@@ -610,9 +610,12 @@ static int __try_remote_softirq(struct c
  *
  * Interrupts must be disabled.
  */
-void __send_remote_softirq(struct call_single_data *cp, int cpu, int this_cpu, int softirq)
+void __send_remote_softirq(struct call_single_data *cp, int cpu, int this_cpu,
+			   int softirq, int fallback)
 {
-	if (cpu == this_cpu || __try_remote_softirq(cp, cpu, softirq))
+	if (cpu == this_cpu)
+		__local_trigger(cp, softirq);
+	else if (__try_remote_softirq(cp, cpu, softirq) && fallback)
 		__local_trigger(cp, softirq);
 }
 EXPORT_SYMBOL(__send_remote_softirq);
@@ -626,14 +629,15 @@ EXPORT_SYMBOL(__send_remote_softirq);
  * Like __send_remote_softirq except that disabling interrupts and
  * computing the current cpu is done for the caller.
  */
-void send_remote_softirq(struct call_single_data *cp, int cpu, int softirq)
+void send_remote_softirq(struct call_single_data *cp, int cpu, int softirq,
+			 int fallback)
 {
 	unsigned long flags;
 	int this_cpu;
 
 	local_irq_save(flags);
 	this_cpu = smp_processor_id();
-	__send_remote_softirq(cp, cpu, this_cpu, softirq);
+	__send_remote_softirq(cp, cpu, this_cpu, softirq, fallback);
 	local_irq_restore(flags);
 }
 EXPORT_SYMBOL(send_remote_softirq);



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [patch 2/7] softirq: add init_remote_softirq_csd()
  2010-05-17 18:27 [patch 0/7] sched: change nohz idle load balancing logic to push model Suresh Siddha
  2010-05-17 18:27 ` [patch 1/7] softirq: Add a no local fallback option to send_remote_softirq Suresh Siddha
@ 2010-05-17 18:27 ` Suresh Siddha
  2010-05-17 18:27 ` [patch 3/7] softirq: avoid softirq_work_list for SCHED_SOFTIRQ when sent remotely Suresh Siddha
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 28+ messages in thread
From: Suresh Siddha @ 2010-05-17 18:27 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Arjan van de Ven
  Cc: Venkatesh Pallipadi, Vaidyanathan Srinivasan, ego, LKML,
	Dominik Brodowski, Nigel Cunningham, Suresh Siddha,
	David S. Miller, Jens Axboe

[-- Attachment #1: add_init_remote_softirq_csd.patch --]
[-- Type: text/plain, Size: 4462 bytes --]

Currently send_remote_softirq() API's don't allow single call_single_data
struct(like per cpu call_single_data) to be used with multiple
send_remote_softirq() calls in parallel (as each of the send_remote_softirq()
call reinitializes the call_single_data overwriting the inflight usage.

Add a new init_remote_softirq_csd() which will init the call_single_data.
Second send_remote_softirq() from the same cpu will now wait on the csd_lock()
if the prior send_remote_softirq() is not complete.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Jens Axboe <jens.axboe@oracle.com>
---
 include/linux/interrupt.h |    8 ++++++--
 kernel/softirq.c          |   41 +++++++++++++++++++++++++++--------------
 2 files changed, 33 insertions(+), 16 deletions(-)

Index: tip/kernel/softirq.c
===================================================================
--- tip.orig/kernel/softirq.c
+++ tip/kernel/softirq.c
@@ -578,24 +578,23 @@ static void remote_softirq_receive(void 
 	local_irq_restore(flags);
 }
 
-static int __try_remote_softirq(struct call_single_data *cp, int cpu, int softirq)
+static int __try_remote_softirq(struct call_single_data *cp, int cpu)
 {
 	if (cpu_online(cpu)) {
-		cp->func = remote_softirq_receive;
-		cp->info = cp;
-		cp->flags = 0;
-		cp->priv = softirq;
-
 		__smp_call_function_single(cpu, cp, 0);
 		return 0;
 	}
 	return 1;
 }
 #else /* CONFIG_USE_GENERIC_SMP_HELPERS */
-static int __try_remote_softirq(struct call_single_data *cp, int cpu, int softirq)
+static int __try_remote_softirq(struct call_single_data *cp, int cpu)
 {
 	return 1;
 }
+static void remote_softirq_receive(void *data)
+{
+	return;
+}
 #endif
 
 /**
@@ -611,12 +610,12 @@ static int __try_remote_softirq(struct c
  * Interrupts must be disabled.
  */
 void __send_remote_softirq(struct call_single_data *cp, int cpu, int this_cpu,
-			   int softirq, int fallback)
+			   int fallback)
 {
 	if (cpu == this_cpu)
-		__local_trigger(cp, softirq);
-	else if (__try_remote_softirq(cp, cpu, softirq) && fallback)
-		__local_trigger(cp, softirq);
+		__local_trigger(cp, cp->priv);
+	else if (__try_remote_softirq(cp, cpu) && fallback)
+		__local_trigger(cp, cp->priv);
 }
 EXPORT_SYMBOL(__send_remote_softirq);
 
@@ -629,19 +628,33 @@ EXPORT_SYMBOL(__send_remote_softirq);
  * Like __send_remote_softirq except that disabling interrupts and
  * computing the current cpu is done for the caller.
  */
-void send_remote_softirq(struct call_single_data *cp, int cpu, int softirq,
-			 int fallback)
+void send_remote_softirq(struct call_single_data *cp, int cpu, int fallback)
 {
 	unsigned long flags;
 	int this_cpu;
 
 	local_irq_save(flags);
 	this_cpu = smp_processor_id();
-	__send_remote_softirq(cp, cpu, this_cpu, softirq, fallback);
+	__send_remote_softirq(cp, cpu, this_cpu, fallback);
 	local_irq_restore(flags);
 }
 EXPORT_SYMBOL(send_remote_softirq);
 
+/***
+ * init_remote_softirq_csd - initialize the smp_call_function_single()'s
+ * call single data that will be later used by send_remote_softirq()
+ * @cp: private SMP call function data area to be initialized
+ * @softirq: the softirq to be used for the work
+ */
+void init_remote_softirq_csd(struct call_single_data *cp, int softirq)
+{
+	cp->func = remote_softirq_receive;
+	cp->info = cp;
+	cp->flags = 0;
+	cp->priv = softirq;
+}
+EXPORT_SYMBOL(init_remote_softirq_csd);
+
 static int __cpuinit remote_softirq_cpu_notify(struct notifier_block *self,
 					       unsigned long action, void *hcpu)
 {
Index: tip/include/linux/interrupt.h
===================================================================
--- tip.orig/include/linux/interrupt.h
+++ tip/include/linux/interrupt.h
@@ -420,13 +420,17 @@ DECLARE_PER_CPU(struct list_head [NR_SOF
  * work will be queued to the local cpu.
  */
 extern void send_remote_softirq(struct call_single_data *cp, int cpu,
-				int softirq, int fallback);
+				int fallback);
 
 /* Like send_remote_softirq(), but the caller must disable local cpu interrupts
  * and compute the current cpu, passed in as 'this_cpu'.
  */
 extern void __send_remote_softirq(struct call_single_data *cp, int cpu,
-				  int this_cpu, int softirq, int fallback);
+				  int this_cpu, int fallback);
+
+/* Initialize the call_single_data to be later used with send_remote_softirq().
+ */
+extern void init_remote_softirq_csd(struct call_single_data *cp, int softirq);
 
 /* Tasklets --- multithreaded analogue of BHs.
 



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [patch 3/7] softirq: avoid softirq_work_list for SCHED_SOFTIRQ when sent remotely
  2010-05-17 18:27 [patch 0/7] sched: change nohz idle load balancing logic to push model Suresh Siddha
  2010-05-17 18:27 ` [patch 1/7] softirq: Add a no local fallback option to send_remote_softirq Suresh Siddha
  2010-05-17 18:27 ` [patch 2/7] softirq: add init_remote_softirq_csd() Suresh Siddha
@ 2010-05-17 18:27 ` Suresh Siddha
  2010-05-20  8:12   ` Peter Zijlstra
  2010-05-17 18:27 ` [patch 4/7] sched: Change nohz ilb logic from pull to push model Suresh Siddha
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 28+ messages in thread
From: Suresh Siddha @ 2010-05-17 18:27 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Arjan van de Ven
  Cc: Venkatesh Pallipadi, Vaidyanathan Srinivasan, ego, LKML,
	Dominik Brodowski, Nigel Cunningham, Suresh Siddha,
	David S. Miller, Jens Axboe

[-- Attachment #1: fix_local_trigger_remote_softirq_for_sched.patch --]
[-- Type: text/plain, Size: 1141 bytes --]

There is no need to add the SCHED_SOFTIRQ work to the softirq_work_list
when sent remotely. This is because any pending work associated with
SCHED_SOFTIRQ need not be migrated to a new cpu when the current cpu is
going down etc.

Also I am not sure how this softirq_work_list works for other softirqs.
I don't see anyone removing the list entries from the softirq_work_list
after it is handled on a remote cpu.

For now, just skip it for SCHED_SOFTIRQ.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Jens Axboe <jens.axboe@oracle.com>
---
 kernel/softirq.c |    5 +++++
 1 file changed, 5 insertions(+)

Index: tip/kernel/softirq.c
===================================================================
--- tip.orig/kernel/softirq.c
+++ tip/kernel/softirq.c
@@ -557,6 +557,11 @@ static void __local_trigger(struct call_
 {
 	struct list_head *head = &__get_cpu_var(softirq_work_list[softirq]);
 
+	if (softirq == SCHED_SOFTIRQ) {
+		raise_softirq_irqoff(softirq);
+		return;
+	}
+
 	list_add_tail(&cp->list, head);
 
 	/* Trigger the softirq only if the list was previously empty.  */



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [patch 4/7] sched: Change nohz ilb logic from pull to push model
  2010-05-17 18:27 [patch 0/7] sched: change nohz idle load balancing logic to push model Suresh Siddha
                   ` (2 preceding siblings ...)
  2010-05-17 18:27 ` [patch 3/7] softirq: avoid softirq_work_list for SCHED_SOFTIRQ when sent remotely Suresh Siddha
@ 2010-05-17 18:27 ` Suresh Siddha
  2010-06-01 23:47   ` Vaidyanathan Srinivasan
  2010-05-17 18:27 ` [patch 5/7] sched: Change select_nohz_load_balancer to return void Suresh Siddha
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 28+ messages in thread
From: Suresh Siddha @ 2010-05-17 18:27 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Arjan van de Ven
  Cc: Venkatesh Pallipadi, Vaidyanathan Srinivasan, ego, LKML,
	Dominik Brodowski, Nigel Cunningham, Suresh Siddha

[-- Attachment #1: sched-nohz-ilb-logic-from-pull-to-push.patch --]
[-- Type: text/plain, Size: 14962 bytes --]

From: Venkatesh Pallipadi <venki@google.com>
Subject: sched: Change nohz ilb logic from pull to push model

In the new push model, all idle CPUs indeed go into nohz mode. There is
still the concept of idle load balancer. Busy CPU kicks the nohz
balancer when any of the nohz CPUs need idle load balancing.
The kickee CPU does the idle load balancing on behalf of all idle CPUs
instead of the normal idle balance.

This addresses the below two problems with the current nohz ilb logic:
* the balancer will continue to have periodic ticks and wakeup
  frequently, even though it may not have any rebalancing to do on
  behalf of any of the idle CPUs.
* On x86 and CPUs that have APIC timer stoppage on idle CPUs, this
  periodic wakeup can result in an additional interrupt on a CPU
  doing the timer broadcast.

Signed-off-by: Venkatesh Pallipadi <venki@google.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
---
 kernel/sched.c      |   13 +-
 kernel/sched_fair.c |  283 ++++++++++++++++++++++++++++++----------------------
 2 files changed, 177 insertions(+), 119 deletions(-)

Index: tip/kernel/sched_fair.c
===================================================================
--- tip.orig/kernel/sched_fair.c
+++ tip/kernel/sched_fair.c
@@ -3081,10 +3081,26 @@ out_unlock:
 }
 
 #ifdef CONFIG_NO_HZ
+
+static DEFINE_PER_CPU(struct call_single_data, remote_sched_softirq_cb);
+
+/*
+ * idle load balancing details
+ * - One of the idle CPUs nominates itself as idle load_balancer, while
+ *   entering idle.
+ * - This idle load balancer CPU will also go into tickless mode when
+ *   it is idle, just like all other idle CPUs
+ * - When one of the busy CPUs notice that there may be an idle rebalancing
+ *   needed, they will kick the idle load balancer, which then does idle
+ *   load balancing for all the idle CPUs.
+ */
 static struct {
 	atomic_t load_balancer;
-	cpumask_var_t cpu_mask;
-	cpumask_var_t ilb_grp_nohz_mask;
+	atomic_t first_pick_cpu;
+	atomic_t second_pick_cpu;
+	cpumask_var_t idle_cpus_mask;
+	cpumask_var_t grp_idle_mask;
+	unsigned long next_balance;     /* in jiffy units */
 } nohz ____cacheline_aligned = {
 	.load_balancer = ATOMIC_INIT(-1),
 };
@@ -3141,17 +3157,17 @@ static inline struct sched_domain *lowes
  */
 static inline int is_semi_idle_group(struct sched_group *ilb_group)
 {
-	cpumask_and(nohz.ilb_grp_nohz_mask, nohz.cpu_mask,
+	cpumask_and(nohz.grp_idle_mask, nohz.idle_cpus_mask,
 					sched_group_cpus(ilb_group));
 
 	/*
 	 * A sched_group is semi-idle when it has atleast one busy cpu
 	 * and atleast one idle cpu.
 	 */
-	if (cpumask_empty(nohz.ilb_grp_nohz_mask))
+	if (cpumask_empty(nohz.grp_idle_mask))
 		return 0;
 
-	if (cpumask_equal(nohz.ilb_grp_nohz_mask, sched_group_cpus(ilb_group)))
+	if (cpumask_equal(nohz.grp_idle_mask, sched_group_cpus(ilb_group)))
 		return 0;
 
 	return 1;
@@ -3184,7 +3200,7 @@ static int find_new_ilb(int cpu)
 	 * Optimize for the case when we have no idle CPUs or only one
 	 * idle CPU. Don't walk the sched_domain hierarchy in such cases
 	 */
-	if (cpumask_weight(nohz.cpu_mask) < 2)
+	if (cpumask_weight(nohz.idle_cpus_mask) < 2)
 		goto out_done;
 
 	for_each_flag_domain(cpu, sd, SD_POWERSAVINGS_BALANCE) {
@@ -3192,7 +3208,7 @@ static int find_new_ilb(int cpu)
 
 		do {
 			if (is_semi_idle_group(ilb_group))
-				return cpumask_first(nohz.ilb_grp_nohz_mask);
+				return cpumask_first(nohz.grp_idle_mask);
 
 			ilb_group = ilb_group->next;
 
@@ -3200,42 +3216,61 @@ static int find_new_ilb(int cpu)
 	}
 
 out_done:
-	return cpumask_first(nohz.cpu_mask);
+	return nr_cpu_ids;
 }
 #else /*  (CONFIG_SCHED_MC || CONFIG_SCHED_SMT) */
 static inline int find_new_ilb(int call_cpu)
 {
-	return cpumask_first(nohz.cpu_mask);
+	return nr_cpu_ids;
 }
 #endif
 
 /*
+ * Kick a CPU to do the nohz balancing, if it is time for it. We pick the
+ * nohz_load_balancer CPU (if there is one) otherwise fallback to any idle
+ * CPU (if there is one).
+ */
+static void nohz_balancer_kick(int cpu)
+{
+	int ilb_cpu;
+
+	nohz.next_balance++;
+
+	ilb_cpu = get_nohz_load_balancer();
+	if (ilb_cpu < 0) {
+		ilb_cpu = cpumask_first(nohz.idle_cpus_mask);
+		if (ilb_cpu >= nr_cpu_ids)
+			return;
+	}
+
+	if (!cpu_rq(ilb_cpu)->nohz_balance_kick) {
+		struct call_single_data *cp;
+
+		cpu_rq(ilb_cpu)->nohz_balance_kick = 1;
+		cp = &per_cpu(remote_sched_softirq_cb, cpu);
+		send_remote_softirq(cp, ilb_cpu, 0);
+	}
+	return;
+}
+
+/*
  * This routine will try to nominate the ilb (idle load balancing)
  * owner among the cpus whose ticks are stopped. ilb owner will do the idle
- * load balancing on behalf of all those cpus. If all the cpus in the system
- * go into this tickless mode, then there will be no ilb owner (as there is
- * no need for one) and all the cpus will sleep till the next wakeup event
- * arrives...
- *
- * For the ilb owner, tick is not stopped. And this tick will be used
- * for idle load balancing. ilb owner will still be part of
- * nohz.cpu_mask..
+ * load balancing on behalf of all those cpus.
  *
- * While stopping the tick, this cpu will become the ilb owner if there
- * is no other owner. And will be the owner till that cpu becomes busy
- * or if all cpus in the system stop their ticks at which point
- * there is no need for ilb owner.
- *
- * When the ilb owner becomes busy, it nominates another owner, during the
- * next busy scheduler_tick()
+ * When the ilb owner becomes busy, we will not have new ilb owner until some
+ * idle CPU wakes up and goes back to idle or some busy CPU tries to kick
+ * idle load balancing by kicking one of the idle CPUs.
+ *
+ * Ticks are stopped for the ilb owner as well, with busy CPU kicking this
+ * ilb owner CPU in future (when there is a need for idle load balancing on
+ * behalf of all idle CPUs).
  */
 int select_nohz_load_balancer(int stop_tick)
 {
 	int cpu = smp_processor_id();
 
 	if (stop_tick) {
-		cpu_rq(cpu)->in_nohz_recently = 1;
-
 		if (!cpu_active(cpu)) {
 			if (atomic_read(&nohz.load_balancer) != cpu)
 				return 0;
@@ -3250,25 +3285,20 @@ int select_nohz_load_balancer(int stop_t
 			return 0;
 		}
 
-		cpumask_set_cpu(cpu, nohz.cpu_mask);
+		cpumask_set_cpu(cpu, nohz.idle_cpus_mask);
 
-		/* time for ilb owner also to sleep */
-		if (cpumask_weight(nohz.cpu_mask) == num_active_cpus()) {
-			if (atomic_read(&nohz.load_balancer) == cpu)
-				atomic_set(&nohz.load_balancer, -1);
-			return 0;
-		}
+		if (atomic_read(&nohz.first_pick_cpu) == cpu)
+			atomic_cmpxchg(&nohz.first_pick_cpu, cpu, nr_cpu_ids);
+		if (atomic_read(&nohz.second_pick_cpu) == cpu)
+			atomic_cmpxchg(&nohz.second_pick_cpu, cpu, nr_cpu_ids);
 
 		if (atomic_read(&nohz.load_balancer) == -1) {
-			/* make me the ilb owner */
-			if (atomic_cmpxchg(&nohz.load_balancer, -1, cpu) == -1)
-				return 1;
-		} else if (atomic_read(&nohz.load_balancer) == cpu) {
 			int new_ilb;
 
-			if (!(sched_smt_power_savings ||
-						sched_mc_power_savings))
-				return 1;
+			/* make me the ilb owner */
+			if (atomic_cmpxchg(&nohz.load_balancer, -1, cpu) != -1)
+				return 0;
+
 			/*
 			 * Check to see if there is a more power-efficient
 			 * ilb.
@@ -3279,13 +3309,13 @@ int select_nohz_load_balancer(int stop_t
 				resched_cpu(new_ilb);
 				return 0;
 			}
-			return 1;
+			return 0;
 		}
 	} else {
-		if (!cpumask_test_cpu(cpu, nohz.cpu_mask))
+		if (!cpumask_test_cpu(cpu, nohz.idle_cpus_mask))
 			return 0;
 
-		cpumask_clear_cpu(cpu, nohz.cpu_mask);
+		cpumask_clear_cpu(cpu, nohz.idle_cpus_mask);
 
 		if (atomic_read(&nohz.load_balancer) == cpu)
 			if (atomic_cmpxchg(&nohz.load_balancer, cpu, -1) != cpu)
@@ -3373,11 +3403,97 @@ out:
 		rq->next_balance = next_balance;
 }
 
+#ifdef CONFIG_NO_HZ
 /*
- * run_rebalance_domains is triggered when needed from the scheduler tick.
- * In CONFIG_NO_HZ case, the idle load balance owner will do the
+ * In CONFIG_NO_HZ case, the idle balance kickee will do the
  * rebalancing for all the cpus for whom scheduler ticks are stopped.
  */
+static void nohz_idle_balance(int this_cpu, enum cpu_idle_type idle)
+{
+	struct rq *this_rq = cpu_rq(this_cpu);
+	struct rq *rq;
+	int balance_cpu;
+
+	if (idle != CPU_IDLE || !this_rq->nohz_balance_kick)
+		return;
+
+	for_each_cpu(balance_cpu, nohz.idle_cpus_mask) {
+		if (balance_cpu == this_cpu)
+			continue;
+
+		/*
+		 * If this cpu gets work to do, stop the load balancing
+		 * work being done for other cpus. Next load
+		 * balancing owner will pick it up.
+		 */
+		if (need_resched()) {
+			this_rq->nohz_balance_kick = 0;
+			break;
+		}
+
+		rebalance_domains(balance_cpu, CPU_IDLE);
+
+		rq = cpu_rq(balance_cpu);
+		if (time_after(this_rq->next_balance, rq->next_balance))
+			this_rq->next_balance = rq->next_balance;
+	}
+	nohz.next_balance = this_rq->next_balance;
+	this_rq->nohz_balance_kick = 0;
+}
+
+/*
+ * Current heuristic for kicking the idle load balancer
+ * - first_pick_cpu is the one of the busy CPUs. It will kick
+ *   idle load balancer when it has more than one process active. This
+ *   eliminates the need for idle load balancing altogether when we have
+ *   only one running process in the system (common case).
+ * - If there are more than one busy CPU, idle load balancer may have
+ *   to run for active_load_balance to happen (i.e., two busy CPUs are
+ *   SMT or core siblings and can run better if they move to different
+ *   physical CPUs). So, second_pick_cpu is the second of the busy CPUs
+ *   which will kick idle load balancer as soon as it has any load.
+ */
+static inline int nohz_kick_needed(struct rq *rq, int cpu)
+{
+	unsigned long now = jiffies;
+	int ret;
+	int first_pick_cpu, second_pick_cpu;
+
+	if (time_before(now, nohz.next_balance))
+		return 0;
+
+	if (!rq->nr_running)
+		return 0;
+
+	first_pick_cpu = atomic_read(&nohz.first_pick_cpu);
+	second_pick_cpu = atomic_read(&nohz.second_pick_cpu);
+
+	if (first_pick_cpu < nr_cpu_ids && first_pick_cpu != cpu &&
+	    second_pick_cpu < nr_cpu_ids && second_pick_cpu != cpu)
+		return 0;
+
+	ret = atomic_cmpxchg(&nohz.first_pick_cpu, nr_cpu_ids, cpu);
+	if (ret == nr_cpu_ids || ret == cpu) {
+		atomic_cmpxchg(&nohz.second_pick_cpu, cpu, nr_cpu_ids);
+		if (rq->nr_running > 1)
+			return 1;
+	} else {
+		ret = atomic_cmpxchg(&nohz.second_pick_cpu, nr_cpu_ids, cpu);
+		if (ret == nr_cpu_ids || ret == cpu) {
+			if (rq->nr_running)
+				return 1;
+		}
+	}
+	return 0;
+}
+#else
+static void nohz_idle_balance(int this_cpu, enum cpu_idle_type idle) { }
+#endif
+
+/*
+ * run_rebalance_domains is triggered when needed from the scheduler tick.
+ * Also triggered for nohz idle balancing (with nohz_balancing_kick set).
+ */
 static void run_rebalance_domains(struct softirq_action *h)
 {
 	int this_cpu = smp_processor_id();
@@ -3387,37 +3503,12 @@ static void run_rebalance_domains(struct
 
 	rebalance_domains(this_cpu, idle);
 
-#ifdef CONFIG_NO_HZ
 	/*
-	 * If this cpu is the owner for idle load balancing, then do the
+	 * If this cpu has a pending nohz_balance_kick, then do the
 	 * balancing on behalf of the other idle cpus whose ticks are
 	 * stopped.
 	 */
-	if (this_rq->idle_at_tick &&
-	    atomic_read(&nohz.load_balancer) == this_cpu) {
-		struct rq *rq;
-		int balance_cpu;
-
-		for_each_cpu(balance_cpu, nohz.cpu_mask) {
-			if (balance_cpu == this_cpu)
-				continue;
-
-			/*
-			 * If this cpu gets work to do, stop the load balancing
-			 * work being done for other cpus. Next load
-			 * balancing owner will pick it up.
-			 */
-			if (need_resched())
-				break;
-
-			rebalance_domains(balance_cpu, CPU_IDLE);
-
-			rq = cpu_rq(balance_cpu);
-			if (time_after(this_rq->next_balance, rq->next_balance))
-				this_rq->next_balance = rq->next_balance;
-		}
-	}
-#endif
+	nohz_idle_balance(this_cpu, idle);
 }
 
 static inline int on_null_domain(int cpu)
@@ -3427,57 +3518,17 @@ static inline int on_null_domain(int cpu
 
 /*
  * Trigger the SCHED_SOFTIRQ if it is time to do periodic load balancing.
- *
- * In case of CONFIG_NO_HZ, this is the place where we nominate a new
- * idle load balancing owner or decide to stop the periodic load balancing,
- * if the whole system is idle.
  */
 static inline void trigger_load_balance(struct rq *rq, int cpu)
 {
-#ifdef CONFIG_NO_HZ
-	/*
-	 * If we were in the nohz mode recently and busy at the current
-	 * scheduler tick, then check if we need to nominate new idle
-	 * load balancer.
-	 */
-	if (rq->in_nohz_recently && !rq->idle_at_tick) {
-		rq->in_nohz_recently = 0;
-
-		if (atomic_read(&nohz.load_balancer) == cpu) {
-			cpumask_clear_cpu(cpu, nohz.cpu_mask);
-			atomic_set(&nohz.load_balancer, -1);
-		}
-
-		if (atomic_read(&nohz.load_balancer) == -1) {
-			int ilb = find_new_ilb(cpu);
-
-			if (ilb < nr_cpu_ids)
-				resched_cpu(ilb);
-		}
-	}
-
-	/*
-	 * If this cpu is idle and doing idle load balancing for all the
-	 * cpus with ticks stopped, is it time for that to stop?
-	 */
-	if (rq->idle_at_tick && atomic_read(&nohz.load_balancer) == cpu &&
-	    cpumask_weight(nohz.cpu_mask) == num_online_cpus()) {
-		resched_cpu(cpu);
-		return;
-	}
-
-	/*
-	 * If this cpu is idle and the idle load balancing is done by
-	 * someone else, then no need raise the SCHED_SOFTIRQ
-	 */
-	if (rq->idle_at_tick && atomic_read(&nohz.load_balancer) != cpu &&
-	    cpumask_test_cpu(cpu, nohz.cpu_mask))
-		return;
-#endif
 	/* Don't need to rebalance while attached to NULL domain */
 	if (time_after_eq(jiffies, rq->next_balance) &&
 	    likely(!on_null_domain(cpu)))
 		raise_softirq(SCHED_SOFTIRQ);
+#ifdef CONFIG_NO_HZ
+	else if (nohz_kick_needed(rq, cpu) && likely(!on_null_domain(cpu)))
+		nohz_balancer_kick(cpu);
+#endif
 }
 
 static void rq_online_fair(struct rq *rq)
Index: tip/kernel/sched.c
===================================================================
--- tip.orig/kernel/sched.c
+++ tip/kernel/sched.c
@@ -504,7 +504,7 @@ struct rq {
 	unsigned long cpu_load[CPU_LOAD_IDX_MAX];
 #ifdef CONFIG_NO_HZ
 	u64 nohz_stamp;
-	unsigned char in_nohz_recently;
+	unsigned char nohz_balance_kick;
 #endif
 	unsigned int skip_clock_update;
 
@@ -7605,6 +7605,11 @@ void __init sched_init(void)
 		rq->idle_stamp = 0;
 		rq->avg_idle = 2*sysctl_sched_migration_cost;
 		rq_attach_root(rq, &def_root_domain);
+#ifdef CONFIG_NO_HZ
+		rq->nohz_balance_kick = 0;
+		init_remote_softirq_csd(&per_cpu(remote_sched_softirq_cb, i),
+					SCHED_SOFTIRQ);
+#endif
 #endif
 		init_rq_hrtick(rq);
 		atomic_set(&rq->nr_iowait, 0);
@@ -7649,8 +7654,10 @@ void __init sched_init(void)
 	zalloc_cpumask_var(&nohz_cpu_mask, GFP_NOWAIT);
 #ifdef CONFIG_SMP
 #ifdef CONFIG_NO_HZ
-	zalloc_cpumask_var(&nohz.cpu_mask, GFP_NOWAIT);
-	alloc_cpumask_var(&nohz.ilb_grp_nohz_mask, GFP_NOWAIT);
+	zalloc_cpumask_var(&nohz.idle_cpus_mask, GFP_NOWAIT);
+	alloc_cpumask_var(&nohz.grp_idle_mask, GFP_NOWAIT);
+	atomic_set(&nohz.first_pick_cpu, nr_cpu_ids);
+	atomic_set(&nohz.second_pick_cpu, nr_cpu_ids);
 #endif
 	/* May be allocated at isolcpus cmdline parse time */
 	if (cpu_isolated_map == NULL)



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [patch 5/7] sched: Change select_nohz_load_balancer to return void
  2010-05-17 18:27 [patch 0/7] sched: change nohz idle load balancing logic to push model Suresh Siddha
                   ` (3 preceding siblings ...)
  2010-05-17 18:27 ` [patch 4/7] sched: Change nohz ilb logic from pull to push model Suresh Siddha
@ 2010-05-17 18:27 ` Suresh Siddha
  2010-05-17 18:27 ` [patch 6/7] sched: change nohz.load_balancer to be nr_cpu_ids based Suresh Siddha
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 28+ messages in thread
From: Suresh Siddha @ 2010-05-17 18:27 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Arjan van de Ven
  Cc: Venkatesh Pallipadi, Vaidyanathan Srinivasan, ego, LKML,
	Dominik Brodowski, Nigel Cunningham, Suresh Siddha

[-- Attachment #1: sched_select_nohz_load_balancer_to_return_void.patch --]
[-- Type: text/plain, Size: 3722 bytes --]

From: Venkatesh Pallipadi <venki@google.com>
Subject: sched: Change select_nohz_load_balancer to return void

select_nohz_load_balancer used to return 1 when ticks are not to be
stopped (idle_balancer), 0 otherwise. With change to nohz idle_balancer logic,
all idle cpus can go to tickless mode and be woken up by a kick from one
of the busy cpus, if and when idle balancing is required. As a result,
select_nohz_load_balancer can return void.

Signed-off-by: Venkatesh Pallipadi <venki@google.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
---
 include/linux/sched.h    |    7 ++-----
 kernel/sched_fair.c      |   16 ++++++++--------
 kernel/time/tick-sched.c |    8 +-------
 3 files changed, 11 insertions(+), 20 deletions(-)

Index: tip/kernel/sched_fair.c
===================================================================
--- tip.orig/kernel/sched_fair.c
+++ tip/kernel/sched_fair.c
@@ -3266,14 +3266,14 @@ static void nohz_balancer_kick(int cpu)
  * ilb owner CPU in future (when there is a need for idle load balancing on
  * behalf of all idle CPUs).
  */
-int select_nohz_load_balancer(int stop_tick)
+void select_nohz_load_balancer(int stop_tick)
 {
 	int cpu = smp_processor_id();
 
 	if (stop_tick) {
 		if (!cpu_active(cpu)) {
 			if (atomic_read(&nohz.load_balancer) != cpu)
-				return 0;
+				return;
 
 			/*
 			 * If we are going offline and still the leader,
@@ -3282,7 +3282,7 @@ int select_nohz_load_balancer(int stop_t
 			if (atomic_cmpxchg(&nohz.load_balancer, cpu, -1) != cpu)
 				BUG();
 
-			return 0;
+			return;
 		}
 
 		cpumask_set_cpu(cpu, nohz.idle_cpus_mask);
@@ -3297,7 +3297,7 @@ int select_nohz_load_balancer(int stop_t
 
 			/* make me the ilb owner */
 			if (atomic_cmpxchg(&nohz.load_balancer, -1, cpu) != -1)
-				return 0;
+				return;
 
 			/*
 			 * Check to see if there is a more power-efficient
@@ -3307,13 +3307,13 @@ int select_nohz_load_balancer(int stop_t
 			if (new_ilb < nr_cpu_ids && new_ilb != cpu) {
 				atomic_set(&nohz.load_balancer, -1);
 				resched_cpu(new_ilb);
-				return 0;
+				return;
 			}
-			return 0;
+			return;
 		}
 	} else {
 		if (!cpumask_test_cpu(cpu, nohz.idle_cpus_mask))
-			return 0;
+			return;
 
 		cpumask_clear_cpu(cpu, nohz.idle_cpus_mask);
 
@@ -3321,7 +3321,7 @@ int select_nohz_load_balancer(int stop_t
 			if (atomic_cmpxchg(&nohz.load_balancer, cpu, -1) != cpu)
 				BUG();
 	}
-	return 0;
+	return;
 }
 #endif
 
Index: tip/kernel/time/tick-sched.c
===================================================================
--- tip.orig/kernel/time/tick-sched.c
+++ tip/kernel/time/tick-sched.c
@@ -408,13 +408,7 @@ void tick_nohz_stop_sched_tick(int inidl
 		 * the scheduler tick in nohz_restart_sched_tick.
 		 */
 		if (!ts->tick_stopped) {
-			if (select_nohz_load_balancer(1)) {
-				/*
-				 * sched tick not stopped!
-				 */
-				cpumask_clear_cpu(cpu, nohz_cpu_mask);
-				goto out;
-			}
+			select_nohz_load_balancer(1);
 
 			ts->idle_tick = hrtimer_get_expires(&ts->sched_timer);
 			ts->tick_stopped = 1;
Index: tip/include/linux/sched.h
===================================================================
--- tip.orig/include/linux/sched.h
+++ tip/include/linux/sched.h
@@ -272,14 +272,11 @@ extern void task_rq_unlock_wait(struct t
 
 extern cpumask_var_t nohz_cpu_mask;
 #if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ)
-extern int select_nohz_load_balancer(int cpu);
+extern void select_nohz_load_balancer(int stop_tick);
 extern int get_nohz_load_balancer(void);
 extern int nohz_ratelimit(int cpu);
 #else
-static inline int select_nohz_load_balancer(int cpu)
-{
-	return 0;
-}
+static inline void select_nohz_load_balancer(int stop_tick) { }
 
 static inline int nohz_ratelimit(int cpu)
 {



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [patch 6/7] sched: change nohz.load_balancer to be nr_cpu_ids based
  2010-05-17 18:27 [patch 0/7] sched: change nohz idle load balancing logic to push model Suresh Siddha
                   ` (4 preceding siblings ...)
  2010-05-17 18:27 ` [patch 5/7] sched: Change select_nohz_load_balancer to return void Suresh Siddha
@ 2010-05-17 18:27 ` Suresh Siddha
  2010-05-20  9:49   ` Peter Zijlstra
  2010-05-17 18:27 ` [patch 7/7] timers: use nearest busy cpu for migrating timers from an idle cpu Suresh Siddha
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 28+ messages in thread
From: Suresh Siddha @ 2010-05-17 18:27 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Arjan van de Ven
  Cc: Venkatesh Pallipadi, Vaidyanathan Srinivasan, ego, LKML,
	Dominik Brodowski, Nigel Cunningham, Suresh Siddha

[-- Attachment #1: sched_change_nohz_load_balancer_to_nr_cpu_ids.patch --]
[-- Type: text/plain, Size: 4310 bytes --]

From: Venkatesh Pallipadi <venki@google.com>
Subject: sched: change nohz.load_balancer to be nr_cpu_ids based

nohz.load_balancer was using -1 when it is not set to any cpu. Change it
to nr_cpu_ids, making it similar to other such variables in this file.

Signed-off-by: Venkatesh Pallipadi <venki@google.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
---
 kernel/hrtimer.c    |    2 +-
 kernel/sched.c      |    1 +
 kernel/sched_fair.c |   26 ++++++++++++++++----------
 kernel/timer.c      |    2 +-
 4 files changed, 19 insertions(+), 12 deletions(-)

Index: tip/kernel/sched_fair.c
===================================================================
--- tip.orig/kernel/sched_fair.c
+++ tip/kernel/sched_fair.c
@@ -3101,13 +3101,15 @@ static struct {
 	cpumask_var_t idle_cpus_mask;
 	cpumask_var_t grp_idle_mask;
 	unsigned long next_balance;     /* in jiffy units */
-} nohz ____cacheline_aligned = {
-	.load_balancer = ATOMIC_INIT(-1),
-};
+} nohz ____cacheline_aligned;
 
 int get_nohz_load_balancer(void)
 {
-	return atomic_read(&nohz.load_balancer);
+	int load_balancer = atomic_read(&nohz.load_balancer);
+	if (load_balancer >= nr_cpu_ids)
+		return nr_cpu_ids;
+
+	return load_balancer;
 }
 
 #if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)
@@ -3237,7 +3239,8 @@ static void nohz_balancer_kick(int cpu)
 	nohz.next_balance++;
 
 	ilb_cpu = get_nohz_load_balancer();
-	if (ilb_cpu < 0) {
+
+	if (ilb_cpu >= nr_cpu_ids) {
 		ilb_cpu = cpumask_first(nohz.idle_cpus_mask);
 		if (ilb_cpu >= nr_cpu_ids)
 			return;
@@ -3279,7 +3282,8 @@ void select_nohz_load_balancer(int stop_
 			 * If we are going offline and still the leader,
 			 * give up!
 			 */
-			if (atomic_cmpxchg(&nohz.load_balancer, cpu, -1) != cpu)
+			if (atomic_cmpxchg(&nohz.load_balancer, cpu,
+					   nr_cpu_ids) != cpu)
 				BUG();
 
 			return;
@@ -3292,11 +3296,12 @@ void select_nohz_load_balancer(int stop_
 		if (atomic_read(&nohz.second_pick_cpu) == cpu)
 			atomic_cmpxchg(&nohz.second_pick_cpu, cpu, nr_cpu_ids);
 
-		if (atomic_read(&nohz.load_balancer) == -1) {
+		if (atomic_read(&nohz.load_balancer) >= nr_cpu_ids) {
 			int new_ilb;
 
 			/* make me the ilb owner */
-			if (atomic_cmpxchg(&nohz.load_balancer, -1, cpu) != -1)
+			if (atomic_cmpxchg(&nohz.load_balancer, nr_cpu_ids,
+					   cpu) != nr_cpu_ids)
 				return;
 
 			/*
@@ -3305,7 +3310,7 @@ void select_nohz_load_balancer(int stop_
 			 */
 			new_ilb = find_new_ilb(cpu);
 			if (new_ilb < nr_cpu_ids && new_ilb != cpu) {
-				atomic_set(&nohz.load_balancer, -1);
+				atomic_set(&nohz.load_balancer, nr_cpu_ids);
 				resched_cpu(new_ilb);
 				return;
 			}
@@ -3318,7 +3323,8 @@ void select_nohz_load_balancer(int stop_
 		cpumask_clear_cpu(cpu, nohz.idle_cpus_mask);
 
 		if (atomic_read(&nohz.load_balancer) == cpu)
-			if (atomic_cmpxchg(&nohz.load_balancer, cpu, -1) != cpu)
+			if (atomic_cmpxchg(&nohz.load_balancer, cpu,
+					   nr_cpu_ids) != cpu)
 				BUG();
 	}
 	return;
Index: tip/kernel/sched.c
===================================================================
--- tip.orig/kernel/sched.c
+++ tip/kernel/sched.c
@@ -7656,6 +7656,7 @@ void __init sched_init(void)
 #ifdef CONFIG_NO_HZ
 	zalloc_cpumask_var(&nohz.idle_cpus_mask, GFP_NOWAIT);
 	alloc_cpumask_var(&nohz.grp_idle_mask, GFP_NOWAIT);
+	atomic_set(&nohz.load_balancer, nr_cpu_ids);
 	atomic_set(&nohz.first_pick_cpu, nr_cpu_ids);
 	atomic_set(&nohz.second_pick_cpu, nr_cpu_ids);
 #endif
Index: tip/kernel/hrtimer.c
===================================================================
--- tip.orig/kernel/hrtimer.c
+++ tip/kernel/hrtimer.c
@@ -147,7 +147,7 @@ static int hrtimer_get_target(int this_c
 	if (!pinned && get_sysctl_timer_migration() && idle_cpu(this_cpu)) {
 		int preferred_cpu = get_nohz_load_balancer();
 
-		if (preferred_cpu >= 0)
+		if (preferred_cpu < nr_cpu_ids)
 			return preferred_cpu;
 	}
 #endif
Index: tip/kernel/timer.c
===================================================================
--- tip.orig/kernel/timer.c
+++ tip/kernel/timer.c
@@ -682,7 +682,7 @@ __mod_timer(struct timer_list *timer, un
 	if (!pinned && get_sysctl_timer_migration() && idle_cpu(cpu)) {
 		int preferred_cpu = get_nohz_load_balancer();
 
-		if (preferred_cpu >= 0)
+		if (preferred_cpu < nr_cpu_ids)
 			cpu = preferred_cpu;
 	}
 #endif



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [patch 7/7] timers: use nearest busy cpu for migrating timers from an idle cpu
  2010-05-17 18:27 [patch 0/7] sched: change nohz idle load balancing logic to push model Suresh Siddha
                   ` (5 preceding siblings ...)
  2010-05-17 18:27 ` [patch 6/7] sched: change nohz.load_balancer to be nr_cpu_ids based Suresh Siddha
@ 2010-05-17 18:27 ` Suresh Siddha
  2010-06-01 23:37   ` Vaidyanathan Srinivasan
  2010-05-17 22:39 ` [patch 0/7] sched: change nohz idle load balancing logic to push model Nigel Cunningham
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 28+ messages in thread
From: Suresh Siddha @ 2010-05-17 18:27 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Arjan van de Ven
  Cc: Venkatesh Pallipadi, Vaidyanathan Srinivasan, ego, LKML,
	Dominik Brodowski, Nigel Cunningham, Suresh Siddha

[-- Attachment #1: use_nearest_busy_cpu_for_migrating_timers_from_idle_cpu.patch --]
[-- Type: text/plain, Size: 3428 bytes --]

Currently we are migrating the unpinned timers from an idle to the cpu
doing idle load balancing (when all the cpus in the system are idle,
there is no idle load balacncing cpu and timers get added to the same idle cpu
where the request was made. So the current optimization works only on semi idle
system).

And In semi idle system, we no longer have periodic ticks on the idle cpu
doing the idle load balancing on behalf of all the cpu's. Using that cpu
will add more delays to the timers than intended (as that cpu's timer base
may not be uptodate wrt jiffies etc). This was causing mysterious slowdowns
during boot etc.

For now, in the semi idle case, use the nearest busy cpu for migrating timers from an
idle cpu.  This is good for power-savings anyway.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
---
 include/linux/sched.h |    2 +-
 kernel/hrtimer.c      |    8 ++------
 kernel/sched.c        |   13 +++++++++++++
 kernel/timer.c        |    8 ++------
 4 files changed, 18 insertions(+), 13 deletions(-)

Index: tip/kernel/hrtimer.c
===================================================================
--- tip.orig/kernel/hrtimer.c
+++ tip/kernel/hrtimer.c
@@ -144,12 +144,8 @@ struct hrtimer_clock_base *lock_hrtimer_
 static int hrtimer_get_target(int this_cpu, int pinned)
 {
 #ifdef CONFIG_NO_HZ
-	if (!pinned && get_sysctl_timer_migration() && idle_cpu(this_cpu)) {
-		int preferred_cpu = get_nohz_load_balancer();
-
-		if (preferred_cpu < nr_cpu_ids)
-			return preferred_cpu;
-	}
+	if (!pinned && get_sysctl_timer_migration() && idle_cpu(this_cpu))
+		return get_nohz_timer_target();
 #endif
 	return this_cpu;
 }
Index: tip/kernel/sched.c
===================================================================
--- tip.orig/kernel/sched.c
+++ tip/kernel/sched.c
@@ -1201,6 +1201,19 @@ static void resched_cpu(int cpu)
 }
 
 #ifdef CONFIG_NO_HZ
+int get_nohz_timer_target(void)
+{
+	int cpu = smp_processor_id();
+	int i;
+	struct sched_domain *sd;
+
+	for_each_domain(cpu, sd) {
+		for_each_cpu(i, sched_domain_span(sd))
+			if (!idle_cpu(i))
+				return i;
+	}
+	return cpu;
+}
 /*
  * When add_timer_on() enqueues a timer into the timer wheel of an
  * idle CPU then this timer might expire before the next timer event
Index: tip/kernel/timer.c
===================================================================
--- tip.orig/kernel/timer.c
+++ tip/kernel/timer.c
@@ -679,12 +679,8 @@ __mod_timer(struct timer_list *timer, un
 	cpu = smp_processor_id();
 
 #if defined(CONFIG_NO_HZ) && defined(CONFIG_SMP)
-	if (!pinned && get_sysctl_timer_migration() && idle_cpu(cpu)) {
-		int preferred_cpu = get_nohz_load_balancer();
-
-		if (preferred_cpu < nr_cpu_ids)
-			cpu = preferred_cpu;
-	}
+	if (!pinned && get_sysctl_timer_migration() && idle_cpu(cpu))
+		cpu = get_nohz_timer_target();
 #endif
 	new_base = per_cpu(tvec_bases, cpu);
 
Index: tip/include/linux/sched.h
===================================================================
--- tip.orig/include/linux/sched.h
+++ tip/include/linux/sched.h
@@ -273,7 +273,7 @@ extern void task_rq_unlock_wait(struct t
 extern cpumask_var_t nohz_cpu_mask;
 #if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ)
 extern void select_nohz_load_balancer(int stop_tick);
-extern int get_nohz_load_balancer(void);
+extern int get_nohz_timer_target(void);
 extern int nohz_ratelimit(int cpu);
 #else
 static inline void select_nohz_load_balancer(int stop_tick) { }



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 0/7] sched: change nohz idle load balancing logic to push model
  2010-05-17 18:27 [patch 0/7] sched: change nohz idle load balancing logic to push model Suresh Siddha
                   ` (6 preceding siblings ...)
  2010-05-17 18:27 ` [patch 7/7] timers: use nearest busy cpu for migrating timers from an idle cpu Suresh Siddha
@ 2010-05-17 22:39 ` Nigel Cunningham
  2010-05-19  9:19   ` Dominik Brodowski
  2010-05-20 10:50 ` Peter Zijlstra
  2010-05-20 11:07 ` [patch 0/7] sched: change " Nigel Cunningham
  9 siblings, 1 reply; 28+ messages in thread
From: Nigel Cunningham @ 2010-05-17 22:39 UTC (permalink / raw)
  To: Suresh Siddha
  Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Arjan van de Ven,
	Venkatesh Pallipadi, Vaidyanathan Srinivasan, ego, LKML,
	Dominik Brodowski

Hi Suresh.

On 18/05/10 04:27, Suresh Siddha wrote:
> This is an updated version of patchset which is posted earlier at
> http://lkml.org/lkml/2009/12/10/470
>
> Description:
> Existing nohz idle load balance logic uses the pull model, with one
> idle load balancer CPU nominated on any partially idle system and that
> balancer CPU not going into nohz mode. With the periodic tick, the
> balancer does the idle balancing on behalf of all the CPUs in nohz mode.
>
> This is not very optimal and has few issues:
> * the balancer will continue to have periodic ticks and wakeup
>    frequently (HZ rate), even though it may not have any rebalancing to do on
>    behalf of any of the idle CPUs.
> * On x86 and CPUs that have APIC timer stoppage on idle CPUs, this periodic
>    wakeup can result in an additional interrupt on a CPU doing the timer
>    broadcast.
>
> The alternative is to have a push model, where all idle CPUs can enter nohz
> mode and any busy CPU kicks one of the idle CPUs to take care of idle
> balancing on behalf of a group of idle CPUs.
>
> Following patches switches idle load balancer to this push approach.
>
> Updates from the previous version:
>
> * Busy CPU uses send_remote_softirq() for invoking SCHED_SOFTIRQ on the
>    idle load balancing cpu, which does the load balancing on behalf of
>    all the idle CPUs.
>
> * Dropped the per NUMA node nohz load balancing as it doesn't detect
>    certain imbalance scenarios. This will be addressed later.
>
> Signed-off-by: Suresh Siddha<suresh.b.siddha@intel.com>
> Signed-off-by: Venkatesh Pallipadi<venki@google.com>

Sounds great.

I'm in the middle of packing at the moment, but will try to find time to 
test the patches.

It might help to show the difference that the patch series makes 
(powertop output?)

Regards,

Nigel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 0/7] sched: change nohz idle load balancing logic to push model
  2010-05-17 22:39 ` [patch 0/7] sched: change nohz idle load balancing logic to push model Nigel Cunningham
@ 2010-05-19  9:19   ` Dominik Brodowski
  0 siblings, 0 replies; 28+ messages in thread
From: Dominik Brodowski @ 2010-05-19  9:19 UTC (permalink / raw)
  To: Suresh Siddha, Nigel Cunningham
  Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Arjan van de Ven,
	Venkatesh Pallipadi, Vaidyanathan Srinivasan, ego, LKML

Hey,

On Mon, May 17, 2010 at 11:27:26AM -0700, Suresh Siddha wrote:
> This is an updated version of patchset which is posted earlier at
> http://lkml.org/lkml/2009/12/10/470
> 
> Description:
> Existing nohz idle load balance logic uses the pull model, with one
> idle load balancer CPU nominated on any partially idle system and that
> balancer CPU not going into nohz mode. With the periodic tick, the
> balancer does the idle balancing on behalf of all the CPUs in nohz mode.
> 
> This is not very optimal and has few issues:
> * the balancer will continue to have periodic ticks and wakeup
>   frequently (HZ rate), even though it may not have any rebalancing to do on
>   behalf of any of the idle CPUs.
> * On x86 and CPUs that have APIC timer stoppage on idle CPUs, this periodic
>   wakeup can result in an additional interrupt on a CPU doing the timer
>   broadcast.
> 
> The alternative is to have a push model, where all idle CPUs can enter nohz
> mode and any busy CPU kicks one of the idle CPUs to take care of idle
> balancing on behalf of a group of idle CPUs.
> 
> Following patches switches idle load balancer to this push approach.
> 
> Updates from the previous version:
> 
> * Busy CPU uses send_remote_softirq() for invoking SCHED_SOFTIRQ on the
>   idle load balancing cpu, which does the load balancing on behalf of
>   all the idle CPUs.
> 
> * Dropped the per NUMA node nohz load balancing as it doesn't detect
>   certain imbalance scenarios. This will be addressed later.
> 
> Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
> Signed-off-by: Venkatesh Pallipadi <venki@google.com>

This patchset works good on my notebook; the number of wakeups decreases
from ~2.5 to ~1.6 when booting with init=/bin/bash, disabling cursor
blinking and enabling USB autosuspend.

Thanks and best,
	Dominik

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 3/7] softirq: avoid softirq_work_list for SCHED_SOFTIRQ when sent remotely
  2010-05-17 18:27 ` [patch 3/7] softirq: avoid softirq_work_list for SCHED_SOFTIRQ when sent remotely Suresh Siddha
@ 2010-05-20  8:12   ` Peter Zijlstra
  2010-05-20  8:14     ` David Miller
  0 siblings, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2010-05-20  8:12 UTC (permalink / raw)
  To: Suresh Siddha
  Cc: Ingo Molnar, Thomas Gleixner, Arjan van de Ven,
	Venkatesh Pallipadi, Vaidyanathan Srinivasan, ego, LKML,
	Dominik Brodowski, Nigel Cunningham, David S. Miller, Jens Axboe

On Mon, 2010-05-17 at 11:27 -0700, Suresh Siddha wrote:
> plain text document attachment
> (fix_local_trigger_remote_softirq_for_sched.patch)
> There is no need to add the SCHED_SOFTIRQ work to the softirq_work_list
> when sent remotely. This is because any pending work associated with
> SCHED_SOFTIRQ need not be migrated to a new cpu when the current cpu is
> going down etc.
> 
> Also I am not sure how this softirq_work_list works for other softirqs.
> I don't see anyone removing the list entries from the softirq_work_list
> after it is handled on a remote cpu.

Most odd all that, Dave, Jens, what happened to all that remote_softirq
stuff?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 3/7] softirq: avoid softirq_work_list for SCHED_SOFTIRQ when sent remotely
  2010-05-20  8:12   ` Peter Zijlstra
@ 2010-05-20  8:14     ` David Miller
  2010-05-20  8:23       ` Jens Axboe
  0 siblings, 1 reply; 28+ messages in thread
From: David Miller @ 2010-05-20  8:14 UTC (permalink / raw)
  To: peterz
  Cc: suresh.b.siddha, mingo, tglx, arjan, venki, svaidy, ego,
	linux-kernel, linux, ncunningham, jens.axboe

From: Peter Zijlstra <peterz@infradead.org>
Date: Thu, 20 May 2010 10:12:39 +0200

> Most odd all that, Dave, Jens, what happened to all that remote_softirq
> stuff?

Nobody ended up using this remote softirq infrastructure, it can be
completely deleted.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 3/7] softirq: avoid softirq_work_list for SCHED_SOFTIRQ when sent remotely
  2010-05-20  8:14     ` David Miller
@ 2010-05-20  8:23       ` Jens Axboe
  2010-05-20  8:29         ` Peter Zijlstra
  2010-05-20  9:18         ` David Miller
  0 siblings, 2 replies; 28+ messages in thread
From: Jens Axboe @ 2010-05-20  8:23 UTC (permalink / raw)
  To: David Miller
  Cc: peterz, suresh.b.siddha, mingo, tglx, arjan, venki, svaidy, ego,
	linux-kernel, linux, ncunningham

On Thu, May 20 2010, David Miller wrote:
> From: Peter Zijlstra <peterz@infradead.org>
> Date: Thu, 20 May 2010 10:12:39 +0200
> 
> > Most odd all that, Dave, Jens, what happened to all that remote_softirq
> > stuff?
> 
> Nobody ended up using this remote softirq infrastructure, it can be
> completely deleted.

What parts of it? There has been reworks of parts of the code since I
added it, the block layer only uses __smp_call_function_single() to
trigger remote softirqs.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 3/7] softirq: avoid softirq_work_list for SCHED_SOFTIRQ when sent remotely
  2010-05-20  8:23       ` Jens Axboe
@ 2010-05-20  8:29         ` Peter Zijlstra
  2010-05-20  9:18         ` David Miller
  1 sibling, 0 replies; 28+ messages in thread
From: Peter Zijlstra @ 2010-05-20  8:29 UTC (permalink / raw)
  To: Jens Axboe
  Cc: David Miller, suresh.b.siddha, mingo, tglx, arjan, venki, svaidy,
	ego, linux-kernel, linux, ncunningham

On Thu, 2010-05-20 at 10:23 +0200, Jens Axboe wrote:
> On Thu, May 20 2010, David Miller wrote:
> > From: Peter Zijlstra <peterz@infradead.org>
> > Date: Thu, 20 May 2010 10:12:39 +0200
> > 
> > > Most odd all that, Dave, Jens, what happened to all that remote_softirq
> > > stuff?
> > 
> > Nobody ended up using this remote softirq infrastructure, it can be
> > completely deleted.
> 
> What parts of it? There has been reworks of parts of the code since I
> added it, the block layer only uses __smp_call_function_single() to
> trigger remote softirqs.

Right, so __smp_call_function_single() is the normal remote ipi stuff
from kernel/smp.c, send_remote_softirq() is the pile of code in
kernel/softirq.c that uses that to tickle remote softirqs.

The 'problem' seems to be that the remote softirq code (not the ipi
bits) doesn't have any users and is quite complex and apparently
incomplete.

Suresh, would it make sense to do as Jens does and simply use
__smp_call_function_single() or do you think it still makes sense to
have a small wrapper and maybe share that with Jens?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 3/7] softirq: avoid softirq_work_list for SCHED_SOFTIRQ when sent remotely
  2010-05-20  8:23       ` Jens Axboe
  2010-05-20  8:29         ` Peter Zijlstra
@ 2010-05-20  9:18         ` David Miller
  1 sibling, 0 replies; 28+ messages in thread
From: David Miller @ 2010-05-20  9:18 UTC (permalink / raw)
  To: jens.axboe
  Cc: peterz, suresh.b.siddha, mingo, tglx, arjan, venki, svaidy, ego,
	linux-kernel, linux, ncunningham

From: Jens Axboe <jens.axboe@oracle.com>
Date: Thu, 20 May 2010 10:23:09 +0200

> On Thu, May 20 2010, David Miller wrote:
>> From: Peter Zijlstra <peterz@infradead.org>
>> Date: Thu, 20 May 2010 10:12:39 +0200
>> 
>> > Most odd all that, Dave, Jens, what happened to all that remote_softirq
>> > stuff?
>> 
>> Nobody ended up using this remote softirq infrastructure, it can be
>> completely deleted.
> 
> What parts of it? There has been reworks of parts of the code since I
> added it, the block layer only uses __smp_call_function_single() to
> trigger remote softirqs.

And that's what the networking is using for RPS/RFS too.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 6/7] sched: change nohz.load_balancer to be nr_cpu_ids based
  2010-05-17 18:27 ` [patch 6/7] sched: change nohz.load_balancer to be nr_cpu_ids based Suresh Siddha
@ 2010-05-20  9:49   ` Peter Zijlstra
  0 siblings, 0 replies; 28+ messages in thread
From: Peter Zijlstra @ 2010-05-20  9:49 UTC (permalink / raw)
  To: Suresh Siddha
  Cc: Ingo Molnar, Thomas Gleixner, Arjan van de Ven,
	Venkatesh Pallipadi, Vaidyanathan Srinivasan, ego, LKML,
	Dominik Brodowski, Nigel Cunningham

On Mon, 2010-05-17 at 11:27 -0700, Suresh Siddha wrote:
>  int get_nohz_load_balancer(void)
>  {
> -       return atomic_read(&nohz.load_balancer);
> +       int load_balancer = atomic_read(&nohz.load_balancer);
> +       if (load_balancer >= nr_cpu_ids)
> +               return nr_cpu_ids;
> +
> +       return load_balancer;
>  } 

How could it ever be > nr_cpu_ids, and thus, do we need that
conditional?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 0/7] sched: change nohz idle load balancing logic to push model
  2010-05-17 18:27 [patch 0/7] sched: change nohz idle load balancing logic to push model Suresh Siddha
                   ` (7 preceding siblings ...)
  2010-05-17 22:39 ` [patch 0/7] sched: change nohz idle load balancing logic to push model Nigel Cunningham
@ 2010-05-20 10:50 ` Peter Zijlstra
  2010-05-22  0:09   ` Suresh Siddha
  2010-05-20 11:07 ` [patch 0/7] sched: change " Nigel Cunningham
  9 siblings, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2010-05-20 10:50 UTC (permalink / raw)
  To: Suresh Siddha
  Cc: Ingo Molnar, Thomas Gleixner, Arjan van de Ven,
	Venkatesh Pallipadi, Vaidyanathan Srinivasan, ego, LKML,
	Dominik Brodowski, Nigel Cunningham, David Miller, Jens Axboe

On Mon, 2010-05-17 at 11:27 -0700, Suresh Siddha wrote:
> This is an updated version of patchset which is posted earlier at
> http://lkml.org/lkml/2009/12/10/470
> 
> Description:
> Existing nohz idle load balance logic uses the pull model, with one
> idle load balancer CPU nominated on any partially idle system and that
> balancer CPU not going into nohz mode. With the periodic tick, the
> balancer does the idle balancing on behalf of all the CPUs in nohz mode.
> 
> This is not very optimal and has few issues:
> * the balancer will continue to have periodic ticks and wakeup
>   frequently (HZ rate), even though it may not have any rebalancing to do on
>   behalf of any of the idle CPUs.
> * On x86 and CPUs that have APIC timer stoppage on idle CPUs, this periodic
>   wakeup can result in an additional interrupt on a CPU doing the timer
>   broadcast.
> 
> The alternative is to have a push model, where all idle CPUs can enter nohz
> mode and any busy CPU kicks one of the idle CPUs to take care of idle
> balancing on behalf of a group of idle CPUs.
> 
> Following patches switches idle load balancer to this push approach.
> 
> Updates from the previous version:
> 
> * Busy CPU uses send_remote_softirq() for invoking SCHED_SOFTIRQ on the
>   idle load balancing cpu, which does the load balancing on behalf of
>   all the idle CPUs.
> 
> * Dropped the per NUMA node nohz load balancing as it doesn't detect
>   certain imbalance scenarios. This will be addressed later.

Looks good.

I think we want to keep init_remote_softirq_csd() and a function that
directly triggers the relevant softirq and make the networking code and
the block layer use that if possible -- and axe the rest of the
send_remote_softirq() infrastructure.

Also, I think it makes sense to fold patches 4-6.



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 0/7] sched: change nohz idle load balancing logic to push model
  2010-05-17 18:27 [patch 0/7] sched: change nohz idle load balancing logic to push model Suresh Siddha
                   ` (8 preceding siblings ...)
  2010-05-20 10:50 ` Peter Zijlstra
@ 2010-05-20 11:07 ` Nigel Cunningham
  2010-05-20 11:17   ` Dominik Brodowski
  9 siblings, 1 reply; 28+ messages in thread
From: Nigel Cunningham @ 2010-05-20 11:07 UTC (permalink / raw)
  To: Suresh Siddha
  Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Arjan van de Ven,
	Venkatesh Pallipadi, Vaidyanathan Srinivasan, ego, LKML,
	Dominik Brodowski, Nigel Cunningham

Hi.

I finally found the time to give this patch series a try. Minor updates 
were required, but I'm now running it against 2.6.34.

"Load balancing tick" is still number one in my powertop list of top 
causes of wakeups (sitting at ~60 to 80 per second as I type this, with 
~170 wakeups per second total). Comparing this to the numbers I posted 
earlier, we seem to have a win.

I do wonder, though, whether further work could still be done. If I take 
one core offline, for example, I'm still getting load balancing ticks. 
Intuitively, I'd expect there to be no need for them with only one core 
available. But maybe I'm just ignorant of what's going on.

Regards,

Nigel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 0/7] sched: change nohz idle load balancing logic to push model
  2010-05-20 11:07 ` [patch 0/7] sched: change " Nigel Cunningham
@ 2010-05-20 11:17   ` Dominik Brodowski
  2010-05-20 11:35     ` Nigel Cunningham
  0 siblings, 1 reply; 28+ messages in thread
From: Dominik Brodowski @ 2010-05-20 11:17 UTC (permalink / raw)
  To: Nigel Cunningham
  Cc: Suresh Siddha, Peter Zijlstra, Ingo Molnar, Thomas Gleixner,
	Arjan van de Ven, Venkatesh Pallipadi, Vaidyanathan Srinivasan,
	ego, LKML, Nigel Cunningham

Hey,

On Thu, May 20, 2010 at 09:07:42PM +1000, Nigel Cunningham wrote:
> "Load balancing tick" is still number one in my powertop list of top  
> causes of wakeups (sitting at ~60 to 80 per second as I type this, with  
> ~170 wakeups per second total). Comparing this to the numbers I posted  
> earlier, we seem to have a win.
>
> I do wonder, though, whether further work could still be done. If I take  
> one core offline, for example, I'm still getting load balancing ticks.  
> Intuitively, I'd expect there to be no need for them with only one core  
> available. But maybe I'm just ignorant of what's going on.

Are you using HZ=1000, and is the CPU active ~ 6-8 % ? If so, is it just the
regular timer tick while the CPU is active, and so not a real "wakeup"? (Or
possibly double the number if both CPUs are active)

Best,
	Dominik

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 0/7] sched: change nohz idle load balancing logic to push model
  2010-05-20 11:17   ` Dominik Brodowski
@ 2010-05-20 11:35     ` Nigel Cunningham
  2010-05-20 12:13       ` Dominik Brodowski
  0 siblings, 1 reply; 28+ messages in thread
From: Nigel Cunningham @ 2010-05-20 11:35 UTC (permalink / raw)
  To: Dominik Brodowski, Nigel Cunningham, Suresh Siddha,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Arjan van de Ven,
	Venkatesh Pallipadi, Vaidyanathan Srinivasan, ego, LKML

Hi.

On 20/05/10 21:17, Dominik Brodowski wrote:
> Hey,
>
> On Thu, May 20, 2010 at 09:07:42PM +1000, Nigel Cunningham wrote:
>> "Load balancing tick" is still number one in my powertop list of top
>> causes of wakeups (sitting at ~60 to 80 per second as I type this, with
>> ~170 wakeups per second total). Comparing this to the numbers I posted
>> earlier, we seem to have a win.
>>
>> I do wonder, though, whether further work could still be done. If I take
>> one core offline, for example, I'm still getting load balancing ticks.
>> Intuitively, I'd expect there to be no need for them with only one core
>> available. But maybe I'm just ignorant of what's going on.
>
> Are you using HZ=1000, and is the CPU active ~ 6-8 % ? If so, is it just the
> regular timer tick while the CPU is active, and so not a real "wakeup"? (Or
> possibly double the number if both CPUs are active)

HZ is 1000, and the CPU running percentage in Powertop is low (2% when 
I'm not typing).

Regards,

Nigel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 0/7] sched: change nohz idle load balancing logic to push model
  2010-05-20 11:35     ` Nigel Cunningham
@ 2010-05-20 12:13       ` Dominik Brodowski
  0 siblings, 0 replies; 28+ messages in thread
From: Dominik Brodowski @ 2010-05-20 12:13 UTC (permalink / raw)
  To: Nigel Cunningham
  Cc: Nigel Cunningham, Suresh Siddha, Peter Zijlstra, Ingo Molnar,
	Thomas Gleixner, Arjan van de Ven, Venkatesh Pallipadi,
	Vaidyanathan Srinivasan, ego, LKML

Hey

On Thu, May 20, 2010 at 09:35:47PM +1000, Nigel Cunningham wrote:
> On 20/05/10 21:17, Dominik Brodowski wrote:
>> Hey,
>>
>> On Thu, May 20, 2010 at 09:07:42PM +1000, Nigel Cunningham wrote:
>>> "Load balancing tick" is still number one in my powertop list of top
>>> causes of wakeups (sitting at ~60 to 80 per second as I type this, with
>>> ~170 wakeups per second total). Comparing this to the numbers I posted
>>> earlier, we seem to have a win.
>>>
>>> I do wonder, though, whether further work could still be done. If I take
>>> one core offline, for example, I'm still getting load balancing ticks.
>>> Intuitively, I'd expect there to be no need for them with only one core
>>> available. But maybe I'm just ignorant of what's going on.
>>
>> Are you using HZ=1000, and is the CPU active ~ 6-8 % ? If so, is it just the
>> regular timer tick while the CPU is active, and so not a real "wakeup"? (Or
>> possibly double the number if both CPUs are active)
>
> HZ is 1000, and the CPU running percentage in Powertop is low (2% when  
> I'm not typing).

Still, that'd count for up to 4 % of the reported ticks. On my system, this
phenomenon accounts pretty much for the events powertop calls "load balancin
ticks". I wonder whether user- or kernelspace should be more smart to not
consider something a wakeup event if it wasn't a wakeup, i.e. to ignore
timer events from timer_stat / powertop if the CPU wasn't actually idle...

Best,
	Dominik

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 0/7] sched: change nohz idle load balancing logic to push model
  2010-05-20 10:50 ` Peter Zijlstra
@ 2010-05-22  0:09   ` Suresh Siddha
  2010-05-31  9:17     ` Peter Zijlstra
  2010-06-09 10:13     ` [tip:sched/core] sched: Change " tip-bot for Venkatesh Pallipadi
  0 siblings, 2 replies; 28+ messages in thread
From: Suresh Siddha @ 2010-05-22  0:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, Arjan van de Ven,
	Venkatesh Pallipadi, Vaidyanathan Srinivasan, ego, LKML,
	Dominik Brodowski, Nigel Cunningham, David Miller, Jens Axboe

On Thu, 2010-05-20 at 03:50 -0700, Peter Zijlstra wrote:
> Looks good.
> 
> I think we want to keep init_remote_softirq_csd() and a function that
> directly triggers the relevant softirq and make the networking code and
> the block layer use that if possible -- and axe the rest of the
> send_remote_softirq() infrastructure.

For now, I ended up using __smp_call_function_single() and does away
with all the remote softirq changes. If needed, we can remove the remote
softirq infrastructure but it looks like the block remote softirq code
looks similar to remote softirq code (like cpu notifiers for hotplug,
and softirq triggering by adding the request to local queues etc). Both
these things are not required for SCHED_SOFTIRQ and so I just ended up
using simple __smp_call_function_single(). May be in future we can
modify block and networking code to use remote softirq infrastructure.

> Also, I think it makes sense to fold patches 4-6.

Ok. I consolidated everything into this single patch(appended).

thanks,
suresh
---

From: Venkatesh Pallipadi <venki@google.com>
Subject: sched: Change nohz ilb logic from pull to push model

In the new push model, all idle CPUs indeed go into nohz mode. There is
still the concept of idle load balancer (performing the load balancing
on behalf of all the idle cpu's in the system). Busy CPU kicks the nohz
balancer when any of the nohz CPUs need idle load balancing.
The kickee CPU does the idle load balancing on behalf of all idle CPUs
instead of the normal idle balance.

This addresses the below two problems with the current nohz ilb logic:
* the idle load balancer continued to have periodic ticks during idle and
  wokeup frequently, even though it did not have any rebalancing to do on
  behalf of any of the idle CPUs.
* On x86 and CPUs that have APIC timer stoppage on idle CPUs, this
  periodic wakeup can result in a periodic additional interrupt on a CPU
  doing the timer broadcast.

Also currently we are migrating the unpinned timers from an idle to the cpu
doing idle load balancing (when all the cpus in the system are idle,
there is no idle load balancing cpu and timers get added to the same idle cpu
where the request was made. So the existing optimization works only on semi idle
system).

And In semi idle system, we no longer have periodic ticks on the idle load
balancer CPU. Using that cpu will add more delays to the timers than intended
(as that cpu's timer base may not be uptodate wrt jiffies etc). This was
causing mysterious slowdowns during boot etc.

For now, in the semi idle case, use the nearest busy cpu for migrating timers
from an idle cpu.  This is good for power-savings anyway.

Signed-off-by: Venkatesh Pallipadi <venki@google.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
---
 include/linux/sched.h    |    9 -
 kernel/hrtimer.c         |    8 -
 kernel/sched.c           |   34 ++++
 kernel/sched_fair.c      |  322 ++++++++++++++++++++++++++++-------------------
 kernel/time/tick-sched.c |    8 -
 kernel/timer.c           |    8 -
 6 files changed, 233 insertions(+), 156 deletions(-)

Index: tip/kernel/sched_fair.c
===================================================================
--- tip.orig/kernel/sched_fair.c
+++ tip/kernel/sched_fair.c
@@ -3081,13 +3081,40 @@ out_unlock:
 }
 
 #ifdef CONFIG_NO_HZ
+
+static DEFINE_PER_CPU(struct call_single_data, remote_sched_softirq_cb);
+
+static void trigger_sched_softirq(void *data)
+{
+	raise_softirq_irqoff(SCHED_SOFTIRQ);
+}
+
+static inline void init_sched_softirq_csd(struct call_single_data *csd)
+{
+	csd->func = trigger_sched_softirq;
+	csd->info = NULL;
+	csd->flags = 0;
+	csd->priv = 0;
+}
+
+/*
+ * idle load balancing details
+ * - One of the idle CPUs nominates itself as idle load_balancer, while
+ *   entering idle.
+ * - This idle load balancer CPU will also go into tickless mode when
+ *   it is idle, just like all other idle CPUs
+ * - When one of the busy CPUs notice that there may be an idle rebalancing
+ *   needed, they will kick the idle load balancer, which then does idle
+ *   load balancing for all the idle CPUs.
+ */
 static struct {
 	atomic_t load_balancer;
-	cpumask_var_t cpu_mask;
-	cpumask_var_t ilb_grp_nohz_mask;
-} nohz ____cacheline_aligned = {
-	.load_balancer = ATOMIC_INIT(-1),
-};
+	atomic_t first_pick_cpu;
+	atomic_t second_pick_cpu;
+	cpumask_var_t idle_cpus_mask;
+	cpumask_var_t grp_idle_mask;
+	unsigned long next_balance;     /* in jiffy units */
+} nohz ____cacheline_aligned;
 
 int get_nohz_load_balancer(void)
 {
@@ -3141,17 +3168,17 @@ static inline struct sched_domain *lowes
  */
 static inline int is_semi_idle_group(struct sched_group *ilb_group)
 {
-	cpumask_and(nohz.ilb_grp_nohz_mask, nohz.cpu_mask,
+	cpumask_and(nohz.grp_idle_mask, nohz.idle_cpus_mask,
 					sched_group_cpus(ilb_group));
 
 	/*
 	 * A sched_group is semi-idle when it has atleast one busy cpu
 	 * and atleast one idle cpu.
 	 */
-	if (cpumask_empty(nohz.ilb_grp_nohz_mask))
+	if (cpumask_empty(nohz.grp_idle_mask))
 		return 0;
 
-	if (cpumask_equal(nohz.ilb_grp_nohz_mask, sched_group_cpus(ilb_group)))
+	if (cpumask_equal(nohz.grp_idle_mask, sched_group_cpus(ilb_group)))
 		return 0;
 
 	return 1;
@@ -3184,7 +3211,7 @@ static int find_new_ilb(int cpu)
 	 * Optimize for the case when we have no idle CPUs or only one
 	 * idle CPU. Don't walk the sched_domain hierarchy in such cases
 	 */
-	if (cpumask_weight(nohz.cpu_mask) < 2)
+	if (cpumask_weight(nohz.idle_cpus_mask) < 2)
 		goto out_done;
 
 	for_each_flag_domain(cpu, sd, SD_POWERSAVINGS_BALANCE) {
@@ -3192,7 +3219,7 @@ static int find_new_ilb(int cpu)
 
 		do {
 			if (is_semi_idle_group(ilb_group))
-				return cpumask_first(nohz.ilb_grp_nohz_mask);
+				return cpumask_first(nohz.grp_idle_mask);
 
 			ilb_group = ilb_group->next;
 
@@ -3200,98 +3227,116 @@ static int find_new_ilb(int cpu)
 	}
 
 out_done:
-	return cpumask_first(nohz.cpu_mask);
+	return nr_cpu_ids;
 }
 #else /*  (CONFIG_SCHED_MC || CONFIG_SCHED_SMT) */
 static inline int find_new_ilb(int call_cpu)
 {
-	return cpumask_first(nohz.cpu_mask);
+	return nr_cpu_ids;
 }
 #endif
 
 /*
+ * Kick a CPU to do the nohz balancing, if it is time for it. We pick the
+ * nohz_load_balancer CPU (if there is one) otherwise fallback to any idle
+ * CPU (if there is one).
+ */
+static void nohz_balancer_kick(int cpu)
+{
+	int ilb_cpu;
+
+	nohz.next_balance++;
+
+	ilb_cpu = get_nohz_load_balancer();
+
+	if (ilb_cpu >= nr_cpu_ids) {
+		ilb_cpu = cpumask_first(nohz.idle_cpus_mask);
+		if (ilb_cpu >= nr_cpu_ids)
+			return;
+	}
+
+	if (!cpu_rq(ilb_cpu)->nohz_balance_kick) {
+		struct call_single_data *cp;
+
+		cpu_rq(ilb_cpu)->nohz_balance_kick = 1;
+		cp = &per_cpu(remote_sched_softirq_cb, cpu);
+		__smp_call_function_single(ilb_cpu, cp, 0);
+	}
+	return;
+}
+
+/*
  * This routine will try to nominate the ilb (idle load balancing)
  * owner among the cpus whose ticks are stopped. ilb owner will do the idle
- * load balancing on behalf of all those cpus. If all the cpus in the system
- * go into this tickless mode, then there will be no ilb owner (as there is
- * no need for one) and all the cpus will sleep till the next wakeup event
- * arrives...
+ * load balancing on behalf of all those cpus.
  *
- * For the ilb owner, tick is not stopped. And this tick will be used
- * for idle load balancing. ilb owner will still be part of
- * nohz.cpu_mask..
+ * When the ilb owner becomes busy, we will not have new ilb owner until some
+ * idle CPU wakes up and goes back to idle or some busy CPU tries to kick
+ * idle load balancing by kicking one of the idle CPUs.
  *
- * While stopping the tick, this cpu will become the ilb owner if there
- * is no other owner. And will be the owner till that cpu becomes busy
- * or if all cpus in the system stop their ticks at which point
- * there is no need for ilb owner.
- *
- * When the ilb owner becomes busy, it nominates another owner, during the
- * next busy scheduler_tick()
+ * Ticks are stopped for the ilb owner as well, with busy CPU kicking this
+ * ilb owner CPU in future (when there is a need for idle load balancing on
+ * behalf of all idle CPUs).
  */
-int select_nohz_load_balancer(int stop_tick)
+void select_nohz_load_balancer(int stop_tick)
 {
 	int cpu = smp_processor_id();
 
 	if (stop_tick) {
-		cpu_rq(cpu)->in_nohz_recently = 1;
-
 		if (!cpu_active(cpu)) {
 			if (atomic_read(&nohz.load_balancer) != cpu)
-				return 0;
+				return;
 
 			/*
 			 * If we are going offline and still the leader,
 			 * give up!
 			 */
-			if (atomic_cmpxchg(&nohz.load_balancer, cpu, -1) != cpu)
+			if (atomic_cmpxchg(&nohz.load_balancer, cpu,
+					   nr_cpu_ids) != cpu)
 				BUG();
 
-			return 0;
+			return;
 		}
 
-		cpumask_set_cpu(cpu, nohz.cpu_mask);
+		cpumask_set_cpu(cpu, nohz.idle_cpus_mask);
 
-		/* time for ilb owner also to sleep */
-		if (cpumask_weight(nohz.cpu_mask) == num_active_cpus()) {
-			if (atomic_read(&nohz.load_balancer) == cpu)
-				atomic_set(&nohz.load_balancer, -1);
-			return 0;
-		}
+		if (atomic_read(&nohz.first_pick_cpu) == cpu)
+			atomic_cmpxchg(&nohz.first_pick_cpu, cpu, nr_cpu_ids);
+		if (atomic_read(&nohz.second_pick_cpu) == cpu)
+			atomic_cmpxchg(&nohz.second_pick_cpu, cpu, nr_cpu_ids);
 
-		if (atomic_read(&nohz.load_balancer) == -1) {
-			/* make me the ilb owner */
-			if (atomic_cmpxchg(&nohz.load_balancer, -1, cpu) == -1)
-				return 1;
-		} else if (atomic_read(&nohz.load_balancer) == cpu) {
+		if (atomic_read(&nohz.load_balancer) >= nr_cpu_ids) {
 			int new_ilb;
 
-			if (!(sched_smt_power_savings ||
-						sched_mc_power_savings))
-				return 1;
+			/* make me the ilb owner */
+			if (atomic_cmpxchg(&nohz.load_balancer, nr_cpu_ids,
+					   cpu) != nr_cpu_ids)
+				return;
+
 			/*
 			 * Check to see if there is a more power-efficient
 			 * ilb.
 			 */
 			new_ilb = find_new_ilb(cpu);
 			if (new_ilb < nr_cpu_ids && new_ilb != cpu) {
-				atomic_set(&nohz.load_balancer, -1);
+				atomic_set(&nohz.load_balancer, nr_cpu_ids);
 				resched_cpu(new_ilb);
-				return 0;
+				return;
 			}
-			return 1;
+			return;
 		}
 	} else {
-		if (!cpumask_test_cpu(cpu, nohz.cpu_mask))
-			return 0;
+		if (!cpumask_test_cpu(cpu, nohz.idle_cpus_mask))
+			return;
 
-		cpumask_clear_cpu(cpu, nohz.cpu_mask);
+		cpumask_clear_cpu(cpu, nohz.idle_cpus_mask);
 
 		if (atomic_read(&nohz.load_balancer) == cpu)
-			if (atomic_cmpxchg(&nohz.load_balancer, cpu, -1) != cpu)
+			if (atomic_cmpxchg(&nohz.load_balancer, cpu,
+					   nr_cpu_ids) != cpu)
 				BUG();
 	}
-	return 0;
+	return;
 }
 #endif
 
@@ -3373,11 +3418,97 @@ out:
 		rq->next_balance = next_balance;
 }
 
+#ifdef CONFIG_NO_HZ
 /*
- * run_rebalance_domains is triggered when needed from the scheduler tick.
- * In CONFIG_NO_HZ case, the idle load balance owner will do the
+ * In CONFIG_NO_HZ case, the idle balance kickee will do the
  * rebalancing for all the cpus for whom scheduler ticks are stopped.
  */
+static void nohz_idle_balance(int this_cpu, enum cpu_idle_type idle)
+{
+	struct rq *this_rq = cpu_rq(this_cpu);
+	struct rq *rq;
+	int balance_cpu;
+
+	if (idle != CPU_IDLE || !this_rq->nohz_balance_kick)
+		return;
+
+	for_each_cpu(balance_cpu, nohz.idle_cpus_mask) {
+		if (balance_cpu == this_cpu)
+			continue;
+
+		/*
+		 * If this cpu gets work to do, stop the load balancing
+		 * work being done for other cpus. Next load
+		 * balancing owner will pick it up.
+		 */
+		if (need_resched()) {
+			this_rq->nohz_balance_kick = 0;
+			break;
+		}
+
+		rebalance_domains(balance_cpu, CPU_IDLE);
+
+		rq = cpu_rq(balance_cpu);
+		if (time_after(this_rq->next_balance, rq->next_balance))
+			this_rq->next_balance = rq->next_balance;
+	}
+	nohz.next_balance = this_rq->next_balance;
+	this_rq->nohz_balance_kick = 0;
+}
+
+/*
+ * Current heuristic for kicking the idle load balancer
+ * - first_pick_cpu is the one of the busy CPUs. It will kick
+ *   idle load balancer when it has more than one process active. This
+ *   eliminates the need for idle load balancing altogether when we have
+ *   only one running process in the system (common case).
+ * - If there are more than one busy CPU, idle load balancer may have
+ *   to run for active_load_balance to happen (i.e., two busy CPUs are
+ *   SMT or core siblings and can run better if they move to different
+ *   physical CPUs). So, second_pick_cpu is the second of the busy CPUs
+ *   which will kick idle load balancer as soon as it has any load.
+ */
+static inline int nohz_kick_needed(struct rq *rq, int cpu)
+{
+	unsigned long now = jiffies;
+	int ret;
+	int first_pick_cpu, second_pick_cpu;
+
+	if (time_before(now, nohz.next_balance))
+		return 0;
+
+	if (!rq->nr_running)
+		return 0;
+
+	first_pick_cpu = atomic_read(&nohz.first_pick_cpu);
+	second_pick_cpu = atomic_read(&nohz.second_pick_cpu);
+
+	if (first_pick_cpu < nr_cpu_ids && first_pick_cpu != cpu &&
+	    second_pick_cpu < nr_cpu_ids && second_pick_cpu != cpu)
+		return 0;
+
+	ret = atomic_cmpxchg(&nohz.first_pick_cpu, nr_cpu_ids, cpu);
+	if (ret == nr_cpu_ids || ret == cpu) {
+		atomic_cmpxchg(&nohz.second_pick_cpu, cpu, nr_cpu_ids);
+		if (rq->nr_running > 1)
+			return 1;
+	} else {
+		ret = atomic_cmpxchg(&nohz.second_pick_cpu, nr_cpu_ids, cpu);
+		if (ret == nr_cpu_ids || ret == cpu) {
+			if (rq->nr_running)
+				return 1;
+		}
+	}
+	return 0;
+}
+#else
+static void nohz_idle_balance(int this_cpu, enum cpu_idle_type idle) { }
+#endif
+
+/*
+ * run_rebalance_domains is triggered when needed from the scheduler tick.
+ * Also triggered for nohz idle balancing (with nohz_balancing_kick set).
+ */
 static void run_rebalance_domains(struct softirq_action *h)
 {
 	int this_cpu = smp_processor_id();
@@ -3387,37 +3518,12 @@ static void run_rebalance_domains(struct
 
 	rebalance_domains(this_cpu, idle);
 
-#ifdef CONFIG_NO_HZ
 	/*
-	 * If this cpu is the owner for idle load balancing, then do the
+	 * If this cpu has a pending nohz_balance_kick, then do the
 	 * balancing on behalf of the other idle cpus whose ticks are
 	 * stopped.
 	 */
-	if (this_rq->idle_at_tick &&
-	    atomic_read(&nohz.load_balancer) == this_cpu) {
-		struct rq *rq;
-		int balance_cpu;
-
-		for_each_cpu(balance_cpu, nohz.cpu_mask) {
-			if (balance_cpu == this_cpu)
-				continue;
-
-			/*
-			 * If this cpu gets work to do, stop the load balancing
-			 * work being done for other cpus. Next load
-			 * balancing owner will pick it up.
-			 */
-			if (need_resched())
-				break;
-
-			rebalance_domains(balance_cpu, CPU_IDLE);
-
-			rq = cpu_rq(balance_cpu);
-			if (time_after(this_rq->next_balance, rq->next_balance))
-				this_rq->next_balance = rq->next_balance;
-		}
-	}
-#endif
+	nohz_idle_balance(this_cpu, idle);
 }
 
 static inline int on_null_domain(int cpu)
@@ -3427,57 +3533,17 @@ static inline int on_null_domain(int cpu
 
 /*
  * Trigger the SCHED_SOFTIRQ if it is time to do periodic load balancing.
- *
- * In case of CONFIG_NO_HZ, this is the place where we nominate a new
- * idle load balancing owner or decide to stop the periodic load balancing,
- * if the whole system is idle.
  */
 static inline void trigger_load_balance(struct rq *rq, int cpu)
 {
-#ifdef CONFIG_NO_HZ
-	/*
-	 * If we were in the nohz mode recently and busy at the current
-	 * scheduler tick, then check if we need to nominate new idle
-	 * load balancer.
-	 */
-	if (rq->in_nohz_recently && !rq->idle_at_tick) {
-		rq->in_nohz_recently = 0;
-
-		if (atomic_read(&nohz.load_balancer) == cpu) {
-			cpumask_clear_cpu(cpu, nohz.cpu_mask);
-			atomic_set(&nohz.load_balancer, -1);
-		}
-
-		if (atomic_read(&nohz.load_balancer) == -1) {
-			int ilb = find_new_ilb(cpu);
-
-			if (ilb < nr_cpu_ids)
-				resched_cpu(ilb);
-		}
-	}
-
-	/*
-	 * If this cpu is idle and doing idle load balancing for all the
-	 * cpus with ticks stopped, is it time for that to stop?
-	 */
-	if (rq->idle_at_tick && atomic_read(&nohz.load_balancer) == cpu &&
-	    cpumask_weight(nohz.cpu_mask) == num_online_cpus()) {
-		resched_cpu(cpu);
-		return;
-	}
-
-	/*
-	 * If this cpu is idle and the idle load balancing is done by
-	 * someone else, then no need raise the SCHED_SOFTIRQ
-	 */
-	if (rq->idle_at_tick && atomic_read(&nohz.load_balancer) != cpu &&
-	    cpumask_test_cpu(cpu, nohz.cpu_mask))
-		return;
-#endif
 	/* Don't need to rebalance while attached to NULL domain */
 	if (time_after_eq(jiffies, rq->next_balance) &&
 	    likely(!on_null_domain(cpu)))
 		raise_softirq(SCHED_SOFTIRQ);
+#ifdef CONFIG_NO_HZ
+	else if (nohz_kick_needed(rq, cpu) && likely(!on_null_domain(cpu)))
+		nohz_balancer_kick(cpu);
+#endif
 }
 
 static void rq_online_fair(struct rq *rq)
Index: tip/kernel/sched.c
===================================================================
--- tip.orig/kernel/sched.c
+++ tip/kernel/sched.c
@@ -504,7 +504,7 @@ struct rq {
 	unsigned long cpu_load[CPU_LOAD_IDX_MAX];
 #ifdef CONFIG_NO_HZ
 	u64 nohz_stamp;
-	unsigned char in_nohz_recently;
+	unsigned char nohz_balance_kick;
 #endif
 	unsigned int skip_clock_update;
 
@@ -1202,6 +1202,27 @@ static void resched_cpu(int cpu)
 
 #ifdef CONFIG_NO_HZ
 /*
+ * In the semi idle case, use the nearest busy cpu for migrating timers
+ * from an idle cpu.  This is good for power-savings.
+ *
+ * We don't do similar optimization for completely idle system, as
+ * selecting an idle cpu will add more delays to the timers than intended
+ * (as that cpu's timer base may not be uptodate wrt jiffies etc).
+ */
+int get_nohz_timer_target(void)
+{
+	int cpu = smp_processor_id();
+	int i;
+	struct sched_domain *sd;
+
+	for_each_domain(cpu, sd) {
+		for_each_cpu(i, sched_domain_span(sd))
+			if (!idle_cpu(i))
+				return i;
+	}
+	return cpu;
+}
+/*
  * When add_timer_on() enqueues a timer into the timer wheel of an
  * idle CPU then this timer might expire before the next timer event
  * which is scheduled to wake up that CPU. In case of a completely
@@ -7604,6 +7625,10 @@ void __init sched_init(void)
 		rq->idle_stamp = 0;
 		rq->avg_idle = 2*sysctl_sched_migration_cost;
 		rq_attach_root(rq, &def_root_domain);
+#ifdef CONFIG_NO_HZ
+		rq->nohz_balance_kick = 0;
+		init_sched_softirq_csd(&per_cpu(remote_sched_softirq_cb, i));
+#endif
 #endif
 		init_rq_hrtick(rq);
 		atomic_set(&rq->nr_iowait, 0);
@@ -7648,8 +7673,11 @@ void __init sched_init(void)
 	zalloc_cpumask_var(&nohz_cpu_mask, GFP_NOWAIT);
 #ifdef CONFIG_SMP
 #ifdef CONFIG_NO_HZ
-	zalloc_cpumask_var(&nohz.cpu_mask, GFP_NOWAIT);
-	alloc_cpumask_var(&nohz.ilb_grp_nohz_mask, GFP_NOWAIT);
+	zalloc_cpumask_var(&nohz.idle_cpus_mask, GFP_NOWAIT);
+	alloc_cpumask_var(&nohz.grp_idle_mask, GFP_NOWAIT);
+	atomic_set(&nohz.load_balancer, nr_cpu_ids);
+	atomic_set(&nohz.first_pick_cpu, nr_cpu_ids);
+	atomic_set(&nohz.second_pick_cpu, nr_cpu_ids);
 #endif
 	/* May be allocated at isolcpus cmdline parse time */
 	if (cpu_isolated_map == NULL)
Index: tip/include/linux/sched.h
===================================================================
--- tip.orig/include/linux/sched.h
+++ tip/include/linux/sched.h
@@ -272,14 +272,11 @@ extern void task_rq_unlock_wait(struct t
 
 extern cpumask_var_t nohz_cpu_mask;
 #if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ)
-extern int select_nohz_load_balancer(int cpu);
-extern int get_nohz_load_balancer(void);
+extern void select_nohz_load_balancer(int stop_tick);
+extern int get_nohz_timer_target(void);
 extern int nohz_ratelimit(int cpu);
 #else
-static inline int select_nohz_load_balancer(int cpu)
-{
-	return 0;
-}
+static inline void select_nohz_load_balancer(int stop_tick) { }
 
 static inline int nohz_ratelimit(int cpu)
 {
Index: tip/kernel/hrtimer.c
===================================================================
--- tip.orig/kernel/hrtimer.c
+++ tip/kernel/hrtimer.c
@@ -144,12 +144,8 @@ struct hrtimer_clock_base *lock_hrtimer_
 static int hrtimer_get_target(int this_cpu, int pinned)
 {
 #ifdef CONFIG_NO_HZ
-	if (!pinned && get_sysctl_timer_migration() && idle_cpu(this_cpu)) {
-		int preferred_cpu = get_nohz_load_balancer();
-
-		if (preferred_cpu >= 0)
-			return preferred_cpu;
-	}
+	if (!pinned && get_sysctl_timer_migration() && idle_cpu(this_cpu))
+		return get_nohz_timer_target();
 #endif
 	return this_cpu;
 }
Index: tip/kernel/time/tick-sched.c
===================================================================
--- tip.orig/kernel/time/tick-sched.c
+++ tip/kernel/time/tick-sched.c
@@ -408,13 +408,7 @@ void tick_nohz_stop_sched_tick(int inidl
 		 * the scheduler tick in nohz_restart_sched_tick.
 		 */
 		if (!ts->tick_stopped) {
-			if (select_nohz_load_balancer(1)) {
-				/*
-				 * sched tick not stopped!
-				 */
-				cpumask_clear_cpu(cpu, nohz_cpu_mask);
-				goto out;
-			}
+			select_nohz_load_balancer(1);
 
 			ts->idle_tick = hrtimer_get_expires(&ts->sched_timer);
 			ts->tick_stopped = 1;
Index: tip/kernel/timer.c
===================================================================
--- tip.orig/kernel/timer.c
+++ tip/kernel/timer.c
@@ -679,12 +679,8 @@ __mod_timer(struct timer_list *timer, un
 	cpu = smp_processor_id();
 
 #if defined(CONFIG_NO_HZ) && defined(CONFIG_SMP)
-	if (!pinned && get_sysctl_timer_migration() && idle_cpu(cpu)) {
-		int preferred_cpu = get_nohz_load_balancer();
-
-		if (preferred_cpu >= 0)
-			cpu = preferred_cpu;
-	}
+	if (!pinned && get_sysctl_timer_migration() && idle_cpu(cpu))
+		cpu = get_nohz_timer_target();
 #endif
 	new_base = per_cpu(tvec_bases, cpu);
 




^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 0/7] sched: change nohz idle load balancing logic to push model
  2010-05-22  0:09   ` Suresh Siddha
@ 2010-05-31  9:17     ` Peter Zijlstra
  2010-06-09 10:13     ` [tip:sched/core] sched: Change " tip-bot for Venkatesh Pallipadi
  1 sibling, 0 replies; 28+ messages in thread
From: Peter Zijlstra @ 2010-05-31  9:17 UTC (permalink / raw)
  To: Suresh Siddha
  Cc: Ingo Molnar, Thomas Gleixner, Arjan van de Ven,
	Venkatesh Pallipadi, Vaidyanathan Srinivasan, ego, LKML,
	Dominik Brodowski, Nigel Cunningham, David Miller, Jens Axboe

On Fri, 2010-05-21 at 17:09 -0700, Suresh Siddha wrote:

> From: Venkatesh Pallipadi <venki@google.com>
> Subject: sched: Change nohz ilb logic from pull to push model
> 
> In the new push model, all idle CPUs indeed go into nohz mode. There is
> still the concept of idle load balancer (performing the load balancing
> on behalf of all the idle cpu's in the system). Busy CPU kicks the nohz
> balancer when any of the nohz CPUs need idle load balancing.
> The kickee CPU does the idle load balancing on behalf of all idle CPUs
> instead of the normal idle balance.
> 
> This addresses the below two problems with the current nohz ilb logic:
> * the idle load balancer continued to have periodic ticks during idle and
>   wokeup frequently, even though it did not have any rebalancing to do on
>   behalf of any of the idle CPUs.
> * On x86 and CPUs that have APIC timer stoppage on idle CPUs, this
>   periodic wakeup can result in a periodic additional interrupt on a CPU
>   doing the timer broadcast.
> 
> Also currently we are migrating the unpinned timers from an idle to the cpu
> doing idle load balancing (when all the cpus in the system are idle,
> there is no idle load balancing cpu and timers get added to the same idle cpu
> where the request was made. So the existing optimization works only on semi idle
> system).
> 
> And In semi idle system, we no longer have periodic ticks on the idle load
> balancer CPU. Using that cpu will add more delays to the timers than intended
> (as that cpu's timer base may not be uptodate wrt jiffies etc). This was
> causing mysterious slowdowns during boot etc.
> 
> For now, in the semi idle case, use the nearest busy cpu for migrating timers
> from an idle cpu.  This is good for power-savings anyway.
> 
> Signed-off-by: Venkatesh Pallipadi <venki@google.com>
> Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
> ---

I sorted the conflict with Venki's update_cpu_load() patch as below.

Thanks!

---
Index: linux-2.6/kernel/sched_fair.c
===================================================================
--- linux-2.6.orig/kernel/sched_fair.c
+++ linux-2.6/kernel/sched_fair.c
@@ -3446,6 +3446,9 @@ static void nohz_idle_balance(int this_c
 			break;
 		}
 
+		raw_spin_lock_irq(&rq->lock);
+		update_cpu_load(rq);
+		raw_spin_unlock_irq(&rq->lock);
 		rebalance_domains(balance_cpu, CPU_IDLE);
 
 		rq = cpu_rq(balance_cpu);
@@ -3518,40 +3521,12 @@ static void run_rebalance_domains(struct
 
 	rebalance_domains(this_cpu, idle);
 
-#ifdef CONFIG_NO_HZ
 	/*
-	 * If this cpu is the owner for idle load balancing, then do the
+	 * If this cpu has a pending nohz_balance_kick, then do the
 	 * balancing on behalf of the other idle cpus whose ticks are
 	 * stopped.
 	 */
-	if (this_rq->idle_at_tick &&
-	    atomic_read(&nohz.load_balancer) == this_cpu) {
-		struct rq *rq;
-		int balance_cpu;
-
-		for_each_cpu(balance_cpu, nohz.cpu_mask) {
-			if (balance_cpu == this_cpu)
-				continue;
-
-			/*
-			 * If this cpu gets work to do, stop the load balancing
-			 * work being done for other cpus. Next load
-			 * balancing owner will pick it up.
-			 */
-			if (need_resched())
-				break;
-
-			rq = cpu_rq(balance_cpu);
-			raw_spin_lock_irq(&rq->lock);
-			update_cpu_load(rq);
-			raw_spin_unlock_irq(&rq->lock);
-			rebalance_domains(balance_cpu, CPU_IDLE);
-
-			if (time_after(this_rq->next_balance, rq->next_balance))
-				this_rq->next_balance = rq->next_balance;
-		}
-	}
-#endif
+	nohz_idle_balance(this_cpu, idle);
 }
 
 static inline int on_null_domain(int cpu)


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 7/7] timers: use nearest busy cpu for migrating timers from an idle cpu
  2010-05-17 18:27 ` [patch 7/7] timers: use nearest busy cpu for migrating timers from an idle cpu Suresh Siddha
@ 2010-06-01 23:37   ` Vaidyanathan Srinivasan
  2010-06-02 22:02     ` Suresh Siddha
  0 siblings, 1 reply; 28+ messages in thread
From: Vaidyanathan Srinivasan @ 2010-06-01 23:37 UTC (permalink / raw)
  To: Suresh Siddha
  Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Arjan van de Ven,
	Venkatesh Pallipadi, ego, LKML, Dominik Brodowski,
	Nigel Cunningham

* Suresh Siddha <suresh.b.siddha@intel.com> [2010-05-17 11:27:33]:

> Currently we are migrating the unpinned timers from an idle to the cpu
> doing idle load balancing (when all the cpus in the system are idle,
> there is no idle load balacncing cpu and timers get added to the same idle cpu
> where the request was made. So the current optimization works only on semi idle
> system).
> 
> And In semi idle system, we no longer have periodic ticks on the idle cpu
> doing the idle load balancing on behalf of all the cpu's. Using that cpu
> will add more delays to the timers than intended (as that cpu's timer base
> may not be uptodate wrt jiffies etc). This was causing mysterious slowdowns
> during boot etc.

Hi Suresh,

Can please give more info on why this caused delay in bootup or timer
event.  The jiffies should be updated even with the current push model
right.  We will still have some pinned timer on the idle cpu and the
time base will have to be updated when the timer event happens.

> 
> For now, in the semi idle case, use the nearest busy cpu for migrating timers from an
> idle cpu.  This is good for power-savings anyway.

Yes.  This is good solution.  But on a large system the only running
cpu may accumulate too may timers that could affect the performance of
the task running.  We will need to test this out.

> Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
> ---
>  include/linux/sched.h |    2 +-
>  kernel/hrtimer.c      |    8 ++------
>  kernel/sched.c        |   13 +++++++++++++
>  kernel/timer.c        |    8 ++------
>  4 files changed, 18 insertions(+), 13 deletions(-)
> 
> Index: tip/kernel/hrtimer.c
> ===================================================================
> --- tip.orig/kernel/hrtimer.c
> +++ tip/kernel/hrtimer.c
> @@ -144,12 +144,8 @@ struct hrtimer_clock_base *lock_hrtimer_
>  static int hrtimer_get_target(int this_cpu, int pinned)
>  {
>  #ifdef CONFIG_NO_HZ
> -	if (!pinned && get_sysctl_timer_migration() && idle_cpu(this_cpu)) {
> -		int preferred_cpu = get_nohz_load_balancer();
> -
> -		if (preferred_cpu < nr_cpu_ids)
> -			return preferred_cpu;
> -	}
> +	if (!pinned && get_sysctl_timer_migration() && idle_cpu(this_cpu))
> +		return get_nohz_timer_target();
>  #endif
>  	return this_cpu;
>  }
> Index: tip/kernel/sched.c
> ===================================================================
> --- tip.orig/kernel/sched.c
> +++ tip/kernel/sched.c
> @@ -1201,6 +1201,19 @@ static void resched_cpu(int cpu)
>  }
> 
>  #ifdef CONFIG_NO_HZ
> +int get_nohz_timer_target(void)
> +{
> +	int cpu = smp_processor_id();
> +	int i;
> +	struct sched_domain *sd;
> +
> +	for_each_domain(cpu, sd) {
> +		for_each_cpu(i, sched_domain_span(sd))
> +			if (!idle_cpu(i))
> +				return i;
> +	}
> +	return cpu;
> +}

We will need a better way of finding the right CPU since this code
will take longer time on a larger system with one or two busy cpus.

We should perhaps pick the cpu from the compliment of the current
nohz.grp_idle_mask or something derived from these masks instead of
searching in the sched domain.  Only advantage I see is that we will
get the busy CPU nearest as in same node which is better.

--Vaidy

[snip]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 4/7] sched: Change nohz ilb logic from pull to push model
  2010-05-17 18:27 ` [patch 4/7] sched: Change nohz ilb logic from pull to push model Suresh Siddha
@ 2010-06-01 23:47   ` Vaidyanathan Srinivasan
  2010-06-02 22:27     ` Suresh Siddha
  0 siblings, 1 reply; 28+ messages in thread
From: Vaidyanathan Srinivasan @ 2010-06-01 23:47 UTC (permalink / raw)
  To: Suresh Siddha
  Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Arjan van de Ven,
	Venkatesh Pallipadi, ego, LKML, Dominik Brodowski,
	Nigel Cunningham

* Suresh Siddha <suresh.b.siddha@intel.com> [2010-05-17 11:27:30]:

> From: Venkatesh Pallipadi <venki@google.com>
> Subject: sched: Change nohz ilb logic from pull to push model
> 
> In the new push model, all idle CPUs indeed go into nohz mode. There is
> still the concept of idle load balancer. Busy CPU kicks the nohz
> balancer when any of the nohz CPUs need idle load balancing.
> The kickee CPU does the idle load balancing on behalf of all idle CPUs
> instead of the normal idle balance.
> 
> This addresses the below two problems with the current nohz ilb logic:
> * the balancer will continue to have periodic ticks and wakeup
>   frequently, even though it may not have any rebalancing to do on
>   behalf of any of the idle CPUs.
> * On x86 and CPUs that have APIC timer stoppage on idle CPUs, this
>   periodic wakeup can result in an additional interrupt on a CPU
>   doing the timer broadcast.

How do we select the timer broadcast cpu today?  Is it changes at all
at run time?  Maybe that CPU should be a good target for the timer
migration so that additional CPU are not wokenup from idle states.

> Signed-off-by: Venkatesh Pallipadi <venki@google.com>
> Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
> ---
>  kernel/sched.c      |   13 +-
>  kernel/sched_fair.c |  283 ++++++++++++++++++++++++++++++----------------------
>  2 files changed, 177 insertions(+), 119 deletions(-)
> 
> Index: tip/kernel/sched_fair.c
> ===================================================================
> --- tip.orig/kernel/sched_fair.c
> +++ tip/kernel/sched_fair.c
> @@ -3081,10 +3081,26 @@ out_unlock:
>  }
> 
>  #ifdef CONFIG_NO_HZ
> +
> +static DEFINE_PER_CPU(struct call_single_data, remote_sched_softirq_cb);
> +
> +/*
> + * idle load balancing details
> + * - One of the idle CPUs nominates itself as idle load_balancer, while
> + *   entering idle.
> + * - This idle load balancer CPU will also go into tickless mode when
> + *   it is idle, just like all other idle CPUs
> + * - When one of the busy CPUs notice that there may be an idle rebalancing
> + *   needed, they will kick the idle load balancer, which then does idle
> + *   load balancing for all the idle CPUs.
> + */
>  static struct {
>  	atomic_t load_balancer;
> -	cpumask_var_t cpu_mask;
> -	cpumask_var_t ilb_grp_nohz_mask;
> +	atomic_t first_pick_cpu;
> +	atomic_t second_pick_cpu;
> +	cpumask_var_t idle_cpus_mask;
> +	cpumask_var_t grp_idle_mask;
> +	unsigned long next_balance;     /* in jiffy units */
>  } nohz ____cacheline_aligned = {
>  	.load_balancer = ATOMIC_INIT(-1),
>  };
> @@ -3141,17 +3157,17 @@ static inline struct sched_domain *lowes
>   */
>  static inline int is_semi_idle_group(struct sched_group *ilb_group)
>  {
> -	cpumask_and(nohz.ilb_grp_nohz_mask, nohz.cpu_mask,
> +	cpumask_and(nohz.grp_idle_mask, nohz.idle_cpus_mask,
>  					sched_group_cpus(ilb_group));
> 
>  	/*
>  	 * A sched_group is semi-idle when it has atleast one busy cpu
>  	 * and atleast one idle cpu.
>  	 */
> -	if (cpumask_empty(nohz.ilb_grp_nohz_mask))
> +	if (cpumask_empty(nohz.grp_idle_mask))
>  		return 0;
> 
> -	if (cpumask_equal(nohz.ilb_grp_nohz_mask, sched_group_cpus(ilb_group)))
> +	if (cpumask_equal(nohz.grp_idle_mask, sched_group_cpus(ilb_group)))
>  		return 0;
> 
>  	return 1;
> @@ -3184,7 +3200,7 @@ static int find_new_ilb(int cpu)
>  	 * Optimize for the case when we have no idle CPUs or only one
>  	 * idle CPU. Don't walk the sched_domain hierarchy in such cases
>  	 */
> -	if (cpumask_weight(nohz.cpu_mask) < 2)
> +	if (cpumask_weight(nohz.idle_cpus_mask) < 2)
>  		goto out_done;
> 
>  	for_each_flag_domain(cpu, sd, SD_POWERSAVINGS_BALANCE) {
> @@ -3192,7 +3208,7 @@ static int find_new_ilb(int cpu)
> 
>  		do {
>  			if (is_semi_idle_group(ilb_group))
> -				return cpumask_first(nohz.ilb_grp_nohz_mask);
> +				return cpumask_first(nohz.grp_idle_mask);
> 
>  			ilb_group = ilb_group->next;
> 
> @@ -3200,42 +3216,61 @@ static int find_new_ilb(int cpu)
>  	}
> 
>  out_done:
> -	return cpumask_first(nohz.cpu_mask);
> +	return nr_cpu_ids;
>  }
>  #else /*  (CONFIG_SCHED_MC || CONFIG_SCHED_SMT) */
>  static inline int find_new_ilb(int call_cpu)
>  {
> -	return cpumask_first(nohz.cpu_mask);
> +	return nr_cpu_ids;
>  }
>  #endif
> 
>  /*
> + * Kick a CPU to do the nohz balancing, if it is time for it. We pick the
> + * nohz_load_balancer CPU (if there is one) otherwise fallback to any idle
> + * CPU (if there is one).
> + */
> +static void nohz_balancer_kick(int cpu)
> +{
> +	int ilb_cpu;
> +
> +	nohz.next_balance++;
> +
> +	ilb_cpu = get_nohz_load_balancer();
> +	if (ilb_cpu < 0) {
> +		ilb_cpu = cpumask_first(nohz.idle_cpus_mask);
> +		if (ilb_cpu >= nr_cpu_ids)
> +			return;
> +	}
> +
> +	if (!cpu_rq(ilb_cpu)->nohz_balance_kick) {
> +		struct call_single_data *cp;
> +
> +		cpu_rq(ilb_cpu)->nohz_balance_kick = 1;
> +		cp = &per_cpu(remote_sched_softirq_cb, cpu);
> +		send_remote_softirq(cp, ilb_cpu, 0);
> +	}
> +	return;
> +}
> +
> +/*
>   * This routine will try to nominate the ilb (idle load balancing)
>   * owner among the cpus whose ticks are stopped. ilb owner will do the idle
> - * load balancing on behalf of all those cpus. If all the cpus in the system
> - * go into this tickless mode, then there will be no ilb owner (as there is
> - * no need for one) and all the cpus will sleep till the next wakeup event
> - * arrives...
> - *
> - * For the ilb owner, tick is not stopped. And this tick will be used
> - * for idle load balancing. ilb owner will still be part of
> - * nohz.cpu_mask..
> + * load balancing on behalf of all those cpus.
>   *
> - * While stopping the tick, this cpu will become the ilb owner if there
> - * is no other owner. And will be the owner till that cpu becomes busy
> - * or if all cpus in the system stop their ticks at which point
> - * there is no need for ilb owner.
> - *
> - * When the ilb owner becomes busy, it nominates another owner, during the
> - * next busy scheduler_tick()
> + * When the ilb owner becomes busy, we will not have new ilb owner until some
> + * idle CPU wakes up and goes back to idle or some busy CPU tries to kick
> + * idle load balancing by kicking one of the idle CPUs.
> + *
> + * Ticks are stopped for the ilb owner as well, with busy CPU kicking this
> + * ilb owner CPU in future (when there is a need for idle load balancing on
> + * behalf of all idle CPUs).
>   */
>  int select_nohz_load_balancer(int stop_tick)
>  {
>  	int cpu = smp_processor_id();
> 
>  	if (stop_tick) {
> -		cpu_rq(cpu)->in_nohz_recently = 1;
> -
>  		if (!cpu_active(cpu)) {
>  			if (atomic_read(&nohz.load_balancer) != cpu)
>  				return 0;
> @@ -3250,25 +3285,20 @@ int select_nohz_load_balancer(int stop_t
>  			return 0;
>  		}
> 
> -		cpumask_set_cpu(cpu, nohz.cpu_mask);
> +		cpumask_set_cpu(cpu, nohz.idle_cpus_mask);
> 
> -		/* time for ilb owner also to sleep */
> -		if (cpumask_weight(nohz.cpu_mask) == num_active_cpus()) {
> -			if (atomic_read(&nohz.load_balancer) == cpu)
> -				atomic_set(&nohz.load_balancer, -1);
> -			return 0;
> -		}
> +		if (atomic_read(&nohz.first_pick_cpu) == cpu)
> +			atomic_cmpxchg(&nohz.first_pick_cpu, cpu, nr_cpu_ids);
> +		if (atomic_read(&nohz.second_pick_cpu) == cpu)
> +			atomic_cmpxchg(&nohz.second_pick_cpu, cpu, nr_cpu_ids);
> 
>  		if (atomic_read(&nohz.load_balancer) == -1) {
> -			/* make me the ilb owner */
> -			if (atomic_cmpxchg(&nohz.load_balancer, -1, cpu) == -1)
> -				return 1;
> -		} else if (atomic_read(&nohz.load_balancer) == cpu) {
>  			int new_ilb;
> 
> -			if (!(sched_smt_power_savings ||
> -						sched_mc_power_savings))
> -				return 1;
> +			/* make me the ilb owner */
> +			if (atomic_cmpxchg(&nohz.load_balancer, -1, cpu) != -1)
> +				return 0;
> +
>  			/*
>  			 * Check to see if there is a more power-efficient
>  			 * ilb.
> @@ -3279,13 +3309,13 @@ int select_nohz_load_balancer(int stop_t
>  				resched_cpu(new_ilb);
>  				return 0;
>  			}
> -			return 1;
> +			return 0;
>  		}
>  	} else {
> -		if (!cpumask_test_cpu(cpu, nohz.cpu_mask))
> +		if (!cpumask_test_cpu(cpu, nohz.idle_cpus_mask))
>  			return 0;
> 
> -		cpumask_clear_cpu(cpu, nohz.cpu_mask);
> +		cpumask_clear_cpu(cpu, nohz.idle_cpus_mask);
> 
>  		if (atomic_read(&nohz.load_balancer) == cpu)
>  			if (atomic_cmpxchg(&nohz.load_balancer, cpu, -1) != cpu)
> @@ -3373,11 +3403,97 @@ out:
>  		rq->next_balance = next_balance;
>  }
> 
> +#ifdef CONFIG_NO_HZ
>  /*
> - * run_rebalance_domains is triggered when needed from the scheduler tick.
> - * In CONFIG_NO_HZ case, the idle load balance owner will do the
> + * In CONFIG_NO_HZ case, the idle balance kickee will do the
>   * rebalancing for all the cpus for whom scheduler ticks are stopped.
>   */
> +static void nohz_idle_balance(int this_cpu, enum cpu_idle_type idle)
> +{
> +	struct rq *this_rq = cpu_rq(this_cpu);
> +	struct rq *rq;
> +	int balance_cpu;
> +
> +	if (idle != CPU_IDLE || !this_rq->nohz_balance_kick)
> +		return;
> +
> +	for_each_cpu(balance_cpu, nohz.idle_cpus_mask) {
> +		if (balance_cpu == this_cpu)
> +			continue;
> +
> +		/*
> +		 * If this cpu gets work to do, stop the load balancing
> +		 * work being done for other cpus. Next load
> +		 * balancing owner will pick it up.
> +		 */
> +		if (need_resched()) {
> +			this_rq->nohz_balance_kick = 0;
> +			break;
> +		}
> +
> +		rebalance_domains(balance_cpu, CPU_IDLE);
> +
> +		rq = cpu_rq(balance_cpu);
> +		if (time_after(this_rq->next_balance, rq->next_balance))
> +			this_rq->next_balance = rq->next_balance;
> +	}
> +	nohz.next_balance = this_rq->next_balance;
> +	this_rq->nohz_balance_kick = 0;
> +}
> +
> +/*
> + * Current heuristic for kicking the idle load balancer
> + * - first_pick_cpu is the one of the busy CPUs. It will kick
> + *   idle load balancer when it has more than one process active. This
> + *   eliminates the need for idle load balancing altogether when we have
> + *   only one running process in the system (common case).
> + * - If there are more than one busy CPU, idle load balancer may have
> + *   to run for active_load_balance to happen (i.e., two busy CPUs are
> + *   SMT or core siblings and can run better if they move to different
> + *   physical CPUs). So, second_pick_cpu is the second of the busy CPUs
> + *   which will kick idle load balancer as soon as it has any load.
> + */
> +static inline int nohz_kick_needed(struct rq *rq, int cpu)
> +{
> +	unsigned long now = jiffies;
> +	int ret;
> +	int first_pick_cpu, second_pick_cpu;
> +
> +	if (time_before(now, nohz.next_balance))
> +		return 0;
> +
> +	if (!rq->nr_running)
> +		return 0;
> +
> +	first_pick_cpu = atomic_read(&nohz.first_pick_cpu);
> +	second_pick_cpu = atomic_read(&nohz.second_pick_cpu);
> +
> +	if (first_pick_cpu < nr_cpu_ids && first_pick_cpu != cpu &&
> +	    second_pick_cpu < nr_cpu_ids && second_pick_cpu != cpu)
> +		return 0;
> +
> +	ret = atomic_cmpxchg(&nohz.first_pick_cpu, nr_cpu_ids, cpu);
> +	if (ret == nr_cpu_ids || ret == cpu) {
> +		atomic_cmpxchg(&nohz.second_pick_cpu, cpu, nr_cpu_ids);
> +		if (rq->nr_running > 1)
> +			return 1;
> +	} else {
> +		ret = atomic_cmpxchg(&nohz.second_pick_cpu, nr_cpu_ids, cpu);
> +		if (ret == nr_cpu_ids || ret == cpu) {
> +			if (rq->nr_running)
> +				return 1;
> +		}
> +	}
> +	return 0;

Hi Suresh,

Can you please give more explanation on how the combination of
first_pick_cpu and second_pick_cpu works.  We need to kick an idle CPU
whenever our cpu or group becomes overloaded right.  We will have to
prefer cores of other packages when more tasks become ready to run.
So a notion of group overload is needed to kick new cpus.  Also new
idle CPUs that are kicked should come form other cores and packages
instead or nearest sibling.

As per the current implementation a sibling thread may be kicked but
it will not pull the task as the load balancer will be run on behalf
of all idle cores in the system and then a appropriate idle core will
pull the new task... correct?

--Vaidy

[snip]


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 7/7] timers: use nearest busy cpu for migrating timers from an idle cpu
  2010-06-01 23:37   ` Vaidyanathan Srinivasan
@ 2010-06-02 22:02     ` Suresh Siddha
  0 siblings, 0 replies; 28+ messages in thread
From: Suresh Siddha @ 2010-06-02 22:02 UTC (permalink / raw)
  To: svaidy
  Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Arjan van de Ven,
	Venkatesh Pallipadi, ego, LKML, Dominik Brodowski,
	Nigel Cunningham

On Tue, 2010-06-01 at 16:37 -0700, Vaidyanathan Srinivasan wrote:
> * Suresh Siddha <suresh.b.siddha@intel.com> [2010-05-17 11:27:33]:
> 
> > Currently we are migrating the unpinned timers from an idle to the cpu
> > doing idle load balancing (when all the cpus in the system are idle,
> > there is no idle load balacncing cpu and timers get added to the same idle cpu
> > where the request was made. So the current optimization works only on semi idle
> > system).
> > 
> > And In semi idle system, we no longer have periodic ticks on the idle cpu
> > doing the idle load balancing on behalf of all the cpu's. Using that cpu
> > will add more delays to the timers than intended (as that cpu's timer base
> > may not be uptodate wrt jiffies etc). This was causing mysterious slowdowns
> > during boot etc.
> 
> Hi Suresh,
> 
> Can please give more info on why this caused delay in bootup or timer
> event.  The jiffies should be updated even with the current push model
> right.  We will still have some pinned timer on the idle cpu and the
> time base will have to be updated when the timer event happens.

with these changes, idle load balancer doesn't have periodic ticks in
idle. So for that cpu, timer_jiffies in timer base won't be uptodate
when another idle cpu adds timer to this cpu. So, we will introduce more
delays for the timers than expected.

> > For now, in the semi idle case, use the nearest busy cpu for migrating timers from an
> > idle cpu.  This is good for power-savings anyway.
> 
> Yes.  This is good solution.  But on a large system the only running
> cpu may accumulate too may timers that could affect the performance of
> the task running.  We will need to test this out.

Yes. It will be good to have an impact of this on big systems. If we see
any performance issues, we can migrate the timers only when
power-savings tunable is selected.

> 
> >  #ifdef CONFIG_NO_HZ
> > +int get_nohz_timer_target(void)
> > +{
> > +	int cpu = smp_processor_id();
> > +	int i;
> > +	struct sched_domain *sd;
> > +
> > +	for_each_domain(cpu, sd) {
> > +		for_each_cpu(i, sched_domain_span(sd))
> > +			if (!idle_cpu(i))
> > +				return i;
> > +	}
> > +	return cpu;
> > +}
> 
> We will need a better way of finding the right CPU since this code
> will take longer time on a larger system with one or two busy cpus.
> 
> We should perhaps pick the cpu from the compliment of the current
> nohz.grp_idle_mask or something derived from these masks instead of
> searching in the sched domain.  Only advantage I see is that we will
> get the busy CPU nearest as in same node which is better.

Yes I wanted to migrate the timers to the nearest busy cpu aswell as
distribute the load among all the busy cpus (if there are multiple cpu's
busy).

thanks,
suresh


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 4/7] sched: Change nohz ilb logic from pull to push model
  2010-06-01 23:47   ` Vaidyanathan Srinivasan
@ 2010-06-02 22:27     ` Suresh Siddha
  0 siblings, 0 replies; 28+ messages in thread
From: Suresh Siddha @ 2010-06-02 22:27 UTC (permalink / raw)
  To: svaidy
  Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Arjan van de Ven,
	Venkatesh Pallipadi, ego, LKML, Dominik Brodowski,
	Nigel Cunningham

On Tue, 2010-06-01 at 16:47 -0700, Vaidyanathan Srinivasan wrote:
> * Suresh Siddha <suresh.b.siddha@intel.com> [2010-05-17 11:27:30]:
> 
> > From: Venkatesh Pallipadi <venki@google.com>
> > Subject: sched: Change nohz ilb logic from pull to push model
> >
> > In the new push model, all idle CPUs indeed go into nohz mode. There is
> > still the concept of idle load balancer. Busy CPU kicks the nohz
> > balancer when any of the nohz CPUs need idle load balancing.
> > The kickee CPU does the idle load balancing on behalf of all idle CPUs
> > instead of the normal idle balance.
> >
> > This addresses the below two problems with the current nohz ilb logic:
> > * the balancer will continue to have periodic ticks and wakeup
> >   frequently, even though it may not have any rebalancing to do on
> >   behalf of any of the idle CPUs.
> > * On x86 and CPUs that have APIC timer stoppage on idle CPUs, this
> >   periodic wakeup can result in an additional interrupt on a CPU
> >   doing the timer broadcast.
> 
> How do we select the timer broadcast cpu today?  Is it changes at all
> at run time?  Maybe that CPU should be a good target for the timer
> migration so that additional CPU are not wokenup from idle states.

It is based on who handles irq 0. Anyways, newer generation of cpu's
doesn't have apic timer stoppage issue. If you want to do more
intelligent timer migrations, then its better to address with a
mechanism that works efficiently for all platforms.

> Can you please give more explanation on how the combination of
> first_pick_cpu and second_pick_cpu works.  We need to kick an idle CPU
> whenever our cpu or group becomes overloaded right.

With this change, the need for idle load balancing(/kicking an idle cpu
to do idle load balancing on behalf of all idle cpus) is determined when
there is only one cpu busy with more than 1 task or more than one cpu
busy.

sched group is relevant only if we are balancing at a particular domain.
Traversing all the groups and finding out the group load vs capacity
from the busy cpu will add more overhead to the busy cpu.

We need more intelligence of when to do idle load balancing and when to
stop it. Current proposal is a simple fix to address the idle core in a
semi-idle laptop/netbooks to not have periodic ticks in idle.

>  We will have to
> prefer cores of other packages when more tasks become ready to run.
> So a notion of group overload is needed to kick new cpus.  Also new
> idle CPUs that are kicked should come form other cores and packages
> instead or nearest sibling.

Kicked cpu can be nearest idle core to the busy core. This can do idle
load balancing and the actual load can move the far away core (for perf
policy) or nearest core (for power-savings policy). Any intelligent
heuristics to do this with minimal disturbance to busy cpu's are
welcome.

> 
> As per the current implementation a sibling thread may be kicked but
> it will not pull the task as the load balancer will be run on behalf
> of all idle cores in the system and then a appropriate idle core will
> pull the new task... correct?

Yes.

thanks,
suresh

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [tip:sched/core] sched: Change nohz idle load balancing logic to push model
  2010-05-22  0:09   ` Suresh Siddha
  2010-05-31  9:17     ` Peter Zijlstra
@ 2010-06-09 10:13     ` tip-bot for Venkatesh Pallipadi
  1 sibling, 0 replies; 28+ messages in thread
From: tip-bot for Venkatesh Pallipadi @ 2010-06-09 10:13 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, a.p.zijlstra, suresh.b.siddha, tglx,
	venki, mingo

Commit-ID:  83cd4fe27ad8446619b2e030b171b858501de87d
Gitweb:     http://git.kernel.org/tip/83cd4fe27ad8446619b2e030b171b858501de87d
Author:     Venkatesh Pallipadi <venki@google.com>
AuthorDate: Fri, 21 May 2010 17:09:41 -0700
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Wed, 9 Jun 2010 10:34:52 +0200

sched: Change nohz idle load balancing logic to push model

In the new push model, all idle CPUs indeed go into nohz mode. There is
still the concept of idle load balancer (performing the load balancing
on behalf of all the idle cpu's in the system). Busy CPU kicks the nohz
balancer when any of the nohz CPUs need idle load balancing.
The kickee CPU does the idle load balancing on behalf of all idle CPUs
instead of the normal idle balance.

This addresses the below two problems with the current nohz ilb logic:
* the idle load balancer continued to have periodic ticks during idle and
  wokeup frequently, even though it did not have any rebalancing to do on
  behalf of any of the idle CPUs.
* On x86 and CPUs that have APIC timer stoppage on idle CPUs, this
  periodic wakeup can result in a periodic additional interrupt on a CPU
  doing the timer broadcast.

Also currently we are migrating the unpinned timers from an idle to the cpu
doing idle load balancing (when all the cpus in the system are idle,
there is no idle load balancing cpu and timers get added to the same idle cpu
where the request was made. So the existing optimization works only on semi idle
system).

And In semi idle system, we no longer have periodic ticks on the idle load
balancer CPU. Using that cpu will add more delays to the timers than intended
(as that cpu's timer base may not be uptodate wrt jiffies etc). This was
causing mysterious slowdowns during boot etc.

For now, in the semi idle case, use the nearest busy cpu for migrating timers
from an idle cpu.  This is good for power-savings anyway.

Signed-off-by: Venkatesh Pallipadi <venki@google.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
LKML-Reference: <1274486981.2840.46.camel@sbs-t61.sc.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 include/linux/sched.h    |    9 +-
 kernel/hrtimer.c         |    8 +-
 kernel/sched.c           |   34 +++++-
 kernel/sched_fair.c      |  329 ++++++++++++++++++++++++++++------------------
 kernel/time/tick-sched.c |    8 +-
 kernel/timer.c           |    8 +-
 6 files changed, 237 insertions(+), 159 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index c2d4316..a3e5b1c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -271,14 +271,11 @@ extern int runqueue_is_locked(int cpu);
 
 extern cpumask_var_t nohz_cpu_mask;
 #if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ)
-extern int select_nohz_load_balancer(int cpu);
-extern int get_nohz_load_balancer(void);
+extern void select_nohz_load_balancer(int stop_tick);
+extern int get_nohz_timer_target(void);
 extern int nohz_ratelimit(int cpu);
 #else
-static inline int select_nohz_load_balancer(int cpu)
-{
-	return 0;
-}
+static inline void select_nohz_load_balancer(int stop_tick) { }
 
 static inline int nohz_ratelimit(int cpu)
 {
diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
index 5c69e99..e934339 100644
--- a/kernel/hrtimer.c
+++ b/kernel/hrtimer.c
@@ -144,12 +144,8 @@ struct hrtimer_clock_base *lock_hrtimer_base(const struct hrtimer *timer,
 static int hrtimer_get_target(int this_cpu, int pinned)
 {
 #ifdef CONFIG_NO_HZ
-	if (!pinned && get_sysctl_timer_migration() && idle_cpu(this_cpu)) {
-		int preferred_cpu = get_nohz_load_balancer();
-
-		if (preferred_cpu >= 0)
-			return preferred_cpu;
-	}
+	if (!pinned && get_sysctl_timer_migration() && idle_cpu(this_cpu))
+		return get_nohz_timer_target();
 #endif
 	return this_cpu;
 }
diff --git a/kernel/sched.c b/kernel/sched.c
index a757f6b..132950b 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -460,7 +460,7 @@ struct rq {
 	unsigned long last_load_update_tick;
 #ifdef CONFIG_NO_HZ
 	u64 nohz_stamp;
-	unsigned char in_nohz_recently;
+	unsigned char nohz_balance_kick;
 #endif
 	unsigned int skip_clock_update;
 
@@ -1195,6 +1195,27 @@ static void resched_cpu(int cpu)
 
 #ifdef CONFIG_NO_HZ
 /*
+ * In the semi idle case, use the nearest busy cpu for migrating timers
+ * from an idle cpu.  This is good for power-savings.
+ *
+ * We don't do similar optimization for completely idle system, as
+ * selecting an idle cpu will add more delays to the timers than intended
+ * (as that cpu's timer base may not be uptodate wrt jiffies etc).
+ */
+int get_nohz_timer_target(void)
+{
+	int cpu = smp_processor_id();
+	int i;
+	struct sched_domain *sd;
+
+	for_each_domain(cpu, sd) {
+		for_each_cpu(i, sched_domain_span(sd))
+			if (!idle_cpu(i))
+				return i;
+	}
+	return cpu;
+}
+/*
  * When add_timer_on() enqueues a timer into the timer wheel of an
  * idle CPU then this timer might expire before the next timer event
  * which is scheduled to wake up that CPU. In case of a completely
@@ -7791,6 +7812,10 @@ void __init sched_init(void)
 		rq->idle_stamp = 0;
 		rq->avg_idle = 2*sysctl_sched_migration_cost;
 		rq_attach_root(rq, &def_root_domain);
+#ifdef CONFIG_NO_HZ
+		rq->nohz_balance_kick = 0;
+		init_sched_softirq_csd(&per_cpu(remote_sched_softirq_cb, i));
+#endif
 #endif
 		init_rq_hrtick(rq);
 		atomic_set(&rq->nr_iowait, 0);
@@ -7835,8 +7860,11 @@ void __init sched_init(void)
 	zalloc_cpumask_var(&nohz_cpu_mask, GFP_NOWAIT);
 #ifdef CONFIG_SMP
 #ifdef CONFIG_NO_HZ
-	zalloc_cpumask_var(&nohz.cpu_mask, GFP_NOWAIT);
-	alloc_cpumask_var(&nohz.ilb_grp_nohz_mask, GFP_NOWAIT);
+	zalloc_cpumask_var(&nohz.idle_cpus_mask, GFP_NOWAIT);
+	alloc_cpumask_var(&nohz.grp_idle_mask, GFP_NOWAIT);
+	atomic_set(&nohz.load_balancer, nr_cpu_ids);
+	atomic_set(&nohz.first_pick_cpu, nr_cpu_ids);
+	atomic_set(&nohz.second_pick_cpu, nr_cpu_ids);
 #endif
 	/* May be allocated at isolcpus cmdline parse time */
 	if (cpu_isolated_map == NULL)
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 22b8b4f..6ee2e0a 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -3091,13 +3091,40 @@ out_unlock:
 }
 
 #ifdef CONFIG_NO_HZ
+
+static DEFINE_PER_CPU(struct call_single_data, remote_sched_softirq_cb);
+
+static void trigger_sched_softirq(void *data)
+{
+	raise_softirq_irqoff(SCHED_SOFTIRQ);
+}
+
+static inline void init_sched_softirq_csd(struct call_single_data *csd)
+{
+	csd->func = trigger_sched_softirq;
+	csd->info = NULL;
+	csd->flags = 0;
+	csd->priv = 0;
+}
+
+/*
+ * idle load balancing details
+ * - One of the idle CPUs nominates itself as idle load_balancer, while
+ *   entering idle.
+ * - This idle load balancer CPU will also go into tickless mode when
+ *   it is idle, just like all other idle CPUs
+ * - When one of the busy CPUs notice that there may be an idle rebalancing
+ *   needed, they will kick the idle load balancer, which then does idle
+ *   load balancing for all the idle CPUs.
+ */
 static struct {
 	atomic_t load_balancer;
-	cpumask_var_t cpu_mask;
-	cpumask_var_t ilb_grp_nohz_mask;
-} nohz ____cacheline_aligned = {
-	.load_balancer = ATOMIC_INIT(-1),
-};
+	atomic_t first_pick_cpu;
+	atomic_t second_pick_cpu;
+	cpumask_var_t idle_cpus_mask;
+	cpumask_var_t grp_idle_mask;
+	unsigned long next_balance;     /* in jiffy units */
+} nohz ____cacheline_aligned;
 
 int get_nohz_load_balancer(void)
 {
@@ -3151,17 +3178,17 @@ static inline struct sched_domain *lowest_flag_domain(int cpu, int flag)
  */
 static inline int is_semi_idle_group(struct sched_group *ilb_group)
 {
-	cpumask_and(nohz.ilb_grp_nohz_mask, nohz.cpu_mask,
+	cpumask_and(nohz.grp_idle_mask, nohz.idle_cpus_mask,
 					sched_group_cpus(ilb_group));
 
 	/*
 	 * A sched_group is semi-idle when it has atleast one busy cpu
 	 * and atleast one idle cpu.
 	 */
-	if (cpumask_empty(nohz.ilb_grp_nohz_mask))
+	if (cpumask_empty(nohz.grp_idle_mask))
 		return 0;
 
-	if (cpumask_equal(nohz.ilb_grp_nohz_mask, sched_group_cpus(ilb_group)))
+	if (cpumask_equal(nohz.grp_idle_mask, sched_group_cpus(ilb_group)))
 		return 0;
 
 	return 1;
@@ -3194,7 +3221,7 @@ static int find_new_ilb(int cpu)
 	 * Optimize for the case when we have no idle CPUs or only one
 	 * idle CPU. Don't walk the sched_domain hierarchy in such cases
 	 */
-	if (cpumask_weight(nohz.cpu_mask) < 2)
+	if (cpumask_weight(nohz.idle_cpus_mask) < 2)
 		goto out_done;
 
 	for_each_flag_domain(cpu, sd, SD_POWERSAVINGS_BALANCE) {
@@ -3202,7 +3229,7 @@ static int find_new_ilb(int cpu)
 
 		do {
 			if (is_semi_idle_group(ilb_group))
-				return cpumask_first(nohz.ilb_grp_nohz_mask);
+				return cpumask_first(nohz.grp_idle_mask);
 
 			ilb_group = ilb_group->next;
 
@@ -3210,98 +3237,116 @@ static int find_new_ilb(int cpu)
 	}
 
 out_done:
-	return cpumask_first(nohz.cpu_mask);
+	return nr_cpu_ids;
 }
 #else /*  (CONFIG_SCHED_MC || CONFIG_SCHED_SMT) */
 static inline int find_new_ilb(int call_cpu)
 {
-	return cpumask_first(nohz.cpu_mask);
+	return nr_cpu_ids;
 }
 #endif
 
 /*
+ * Kick a CPU to do the nohz balancing, if it is time for it. We pick the
+ * nohz_load_balancer CPU (if there is one) otherwise fallback to any idle
+ * CPU (if there is one).
+ */
+static void nohz_balancer_kick(int cpu)
+{
+	int ilb_cpu;
+
+	nohz.next_balance++;
+
+	ilb_cpu = get_nohz_load_balancer();
+
+	if (ilb_cpu >= nr_cpu_ids) {
+		ilb_cpu = cpumask_first(nohz.idle_cpus_mask);
+		if (ilb_cpu >= nr_cpu_ids)
+			return;
+	}
+
+	if (!cpu_rq(ilb_cpu)->nohz_balance_kick) {
+		struct call_single_data *cp;
+
+		cpu_rq(ilb_cpu)->nohz_balance_kick = 1;
+		cp = &per_cpu(remote_sched_softirq_cb, cpu);
+		__smp_call_function_single(ilb_cpu, cp, 0);
+	}
+	return;
+}
+
+/*
  * This routine will try to nominate the ilb (idle load balancing)
  * owner among the cpus whose ticks are stopped. ilb owner will do the idle
- * load balancing on behalf of all those cpus. If all the cpus in the system
- * go into this tickless mode, then there will be no ilb owner (as there is
- * no need for one) and all the cpus will sleep till the next wakeup event
- * arrives...
- *
- * For the ilb owner, tick is not stopped. And this tick will be used
- * for idle load balancing. ilb owner will still be part of
- * nohz.cpu_mask..
+ * load balancing on behalf of all those cpus.
  *
- * While stopping the tick, this cpu will become the ilb owner if there
- * is no other owner. And will be the owner till that cpu becomes busy
- * or if all cpus in the system stop their ticks at which point
- * there is no need for ilb owner.
+ * When the ilb owner becomes busy, we will not have new ilb owner until some
+ * idle CPU wakes up and goes back to idle or some busy CPU tries to kick
+ * idle load balancing by kicking one of the idle CPUs.
  *
- * When the ilb owner becomes busy, it nominates another owner, during the
- * next busy scheduler_tick()
+ * Ticks are stopped for the ilb owner as well, with busy CPU kicking this
+ * ilb owner CPU in future (when there is a need for idle load balancing on
+ * behalf of all idle CPUs).
  */
-int select_nohz_load_balancer(int stop_tick)
+void select_nohz_load_balancer(int stop_tick)
 {
 	int cpu = smp_processor_id();
 
 	if (stop_tick) {
-		cpu_rq(cpu)->in_nohz_recently = 1;
-
 		if (!cpu_active(cpu)) {
 			if (atomic_read(&nohz.load_balancer) != cpu)
-				return 0;
+				return;
 
 			/*
 			 * If we are going offline and still the leader,
 			 * give up!
 			 */
-			if (atomic_cmpxchg(&nohz.load_balancer, cpu, -1) != cpu)
+			if (atomic_cmpxchg(&nohz.load_balancer, cpu,
+					   nr_cpu_ids) != cpu)
 				BUG();
 
-			return 0;
+			return;
 		}
 
-		cpumask_set_cpu(cpu, nohz.cpu_mask);
+		cpumask_set_cpu(cpu, nohz.idle_cpus_mask);
 
-		/* time for ilb owner also to sleep */
-		if (cpumask_weight(nohz.cpu_mask) == num_active_cpus()) {
-			if (atomic_read(&nohz.load_balancer) == cpu)
-				atomic_set(&nohz.load_balancer, -1);
-			return 0;
-		}
+		if (atomic_read(&nohz.first_pick_cpu) == cpu)
+			atomic_cmpxchg(&nohz.first_pick_cpu, cpu, nr_cpu_ids);
+		if (atomic_read(&nohz.second_pick_cpu) == cpu)
+			atomic_cmpxchg(&nohz.second_pick_cpu, cpu, nr_cpu_ids);
 
-		if (atomic_read(&nohz.load_balancer) == -1) {
-			/* make me the ilb owner */
-			if (atomic_cmpxchg(&nohz.load_balancer, -1, cpu) == -1)
-				return 1;
-		} else if (atomic_read(&nohz.load_balancer) == cpu) {
+		if (atomic_read(&nohz.load_balancer) >= nr_cpu_ids) {
 			int new_ilb;
 
-			if (!(sched_smt_power_savings ||
-						sched_mc_power_savings))
-				return 1;
+			/* make me the ilb owner */
+			if (atomic_cmpxchg(&nohz.load_balancer, nr_cpu_ids,
+					   cpu) != nr_cpu_ids)
+				return;
+
 			/*
 			 * Check to see if there is a more power-efficient
 			 * ilb.
 			 */
 			new_ilb = find_new_ilb(cpu);
 			if (new_ilb < nr_cpu_ids && new_ilb != cpu) {
-				atomic_set(&nohz.load_balancer, -1);
+				atomic_set(&nohz.load_balancer, nr_cpu_ids);
 				resched_cpu(new_ilb);
-				return 0;
+				return;
 			}
-			return 1;
+			return;
 		}
 	} else {
-		if (!cpumask_test_cpu(cpu, nohz.cpu_mask))
-			return 0;
+		if (!cpumask_test_cpu(cpu, nohz.idle_cpus_mask))
+			return;
 
-		cpumask_clear_cpu(cpu, nohz.cpu_mask);
+		cpumask_clear_cpu(cpu, nohz.idle_cpus_mask);
 
 		if (atomic_read(&nohz.load_balancer) == cpu)
-			if (atomic_cmpxchg(&nohz.load_balancer, cpu, -1) != cpu)
+			if (atomic_cmpxchg(&nohz.load_balancer, cpu,
+					   nr_cpu_ids) != cpu)
 				BUG();
 	}
-	return 0;
+	return;
 }
 #endif
 
@@ -3383,11 +3428,101 @@ out:
 		rq->next_balance = next_balance;
 }
 
+#ifdef CONFIG_NO_HZ
 /*
- * run_rebalance_domains is triggered when needed from the scheduler tick.
- * In CONFIG_NO_HZ case, the idle load balance owner will do the
+ * In CONFIG_NO_HZ case, the idle balance kickee will do the
  * rebalancing for all the cpus for whom scheduler ticks are stopped.
  */
+static void nohz_idle_balance(int this_cpu, enum cpu_idle_type idle)
+{
+	struct rq *this_rq = cpu_rq(this_cpu);
+	struct rq *rq;
+	int balance_cpu;
+
+	if (idle != CPU_IDLE || !this_rq->nohz_balance_kick)
+		return;
+
+	for_each_cpu(balance_cpu, nohz.idle_cpus_mask) {
+		if (balance_cpu == this_cpu)
+			continue;
+
+		/*
+		 * If this cpu gets work to do, stop the load balancing
+		 * work being done for other cpus. Next load
+		 * balancing owner will pick it up.
+		 */
+		if (need_resched()) {
+			this_rq->nohz_balance_kick = 0;
+			break;
+		}
+
+		raw_spin_lock_irq(&this_rq->lock);
+		update_cpu_load(this_rq);
+		raw_spin_unlock_irq(&this_rq->lock);
+
+		rebalance_domains(balance_cpu, CPU_IDLE);
+
+		rq = cpu_rq(balance_cpu);
+		if (time_after(this_rq->next_balance, rq->next_balance))
+			this_rq->next_balance = rq->next_balance;
+	}
+	nohz.next_balance = this_rq->next_balance;
+	this_rq->nohz_balance_kick = 0;
+}
+
+/*
+ * Current heuristic for kicking the idle load balancer
+ * - first_pick_cpu is the one of the busy CPUs. It will kick
+ *   idle load balancer when it has more than one process active. This
+ *   eliminates the need for idle load balancing altogether when we have
+ *   only one running process in the system (common case).
+ * - If there are more than one busy CPU, idle load balancer may have
+ *   to run for active_load_balance to happen (i.e., two busy CPUs are
+ *   SMT or core siblings and can run better if they move to different
+ *   physical CPUs). So, second_pick_cpu is the second of the busy CPUs
+ *   which will kick idle load balancer as soon as it has any load.
+ */
+static inline int nohz_kick_needed(struct rq *rq, int cpu)
+{
+	unsigned long now = jiffies;
+	int ret;
+	int first_pick_cpu, second_pick_cpu;
+
+	if (time_before(now, nohz.next_balance))
+		return 0;
+
+	if (!rq->nr_running)
+		return 0;
+
+	first_pick_cpu = atomic_read(&nohz.first_pick_cpu);
+	second_pick_cpu = atomic_read(&nohz.second_pick_cpu);
+
+	if (first_pick_cpu < nr_cpu_ids && first_pick_cpu != cpu &&
+	    second_pick_cpu < nr_cpu_ids && second_pick_cpu != cpu)
+		return 0;
+
+	ret = atomic_cmpxchg(&nohz.first_pick_cpu, nr_cpu_ids, cpu);
+	if (ret == nr_cpu_ids || ret == cpu) {
+		atomic_cmpxchg(&nohz.second_pick_cpu, cpu, nr_cpu_ids);
+		if (rq->nr_running > 1)
+			return 1;
+	} else {
+		ret = atomic_cmpxchg(&nohz.second_pick_cpu, nr_cpu_ids, cpu);
+		if (ret == nr_cpu_ids || ret == cpu) {
+			if (rq->nr_running)
+				return 1;
+		}
+	}
+	return 0;
+}
+#else
+static void nohz_idle_balance(int this_cpu, enum cpu_idle_type idle) { }
+#endif
+
+/*
+ * run_rebalance_domains is triggered when needed from the scheduler tick.
+ * Also triggered for nohz idle balancing (with nohz_balancing_kick set).
+ */
 static void run_rebalance_domains(struct softirq_action *h)
 {
 	int this_cpu = smp_processor_id();
@@ -3397,40 +3532,12 @@ static void run_rebalance_domains(struct softirq_action *h)
 
 	rebalance_domains(this_cpu, idle);
 
-#ifdef CONFIG_NO_HZ
 	/*
-	 * If this cpu is the owner for idle load balancing, then do the
+	 * If this cpu has a pending nohz_balance_kick, then do the
 	 * balancing on behalf of the other idle cpus whose ticks are
 	 * stopped.
 	 */
-	if (this_rq->idle_at_tick &&
-	    atomic_read(&nohz.load_balancer) == this_cpu) {
-		struct rq *rq;
-		int balance_cpu;
-
-		for_each_cpu(balance_cpu, nohz.cpu_mask) {
-			if (balance_cpu == this_cpu)
-				continue;
-
-			/*
-			 * If this cpu gets work to do, stop the load balancing
-			 * work being done for other cpus. Next load
-			 * balancing owner will pick it up.
-			 */
-			if (need_resched())
-				break;
-
-			rq = cpu_rq(balance_cpu);
-			raw_spin_lock_irq(&rq->lock);
-			update_cpu_load(rq);
-			raw_spin_unlock_irq(&rq->lock);
-			rebalance_domains(balance_cpu, CPU_IDLE);
-
-			if (time_after(this_rq->next_balance, rq->next_balance))
-				this_rq->next_balance = rq->next_balance;
-		}
-	}
-#endif
+	nohz_idle_balance(this_cpu, idle);
 }
 
 static inline int on_null_domain(int cpu)
@@ -3440,57 +3547,17 @@ static inline int on_null_domain(int cpu)
 
 /*
  * Trigger the SCHED_SOFTIRQ if it is time to do periodic load balancing.
- *
- * In case of CONFIG_NO_HZ, this is the place where we nominate a new
- * idle load balancing owner or decide to stop the periodic load balancing,
- * if the whole system is idle.
  */
 static inline void trigger_load_balance(struct rq *rq, int cpu)
 {
-#ifdef CONFIG_NO_HZ
-	/*
-	 * If we were in the nohz mode recently and busy at the current
-	 * scheduler tick, then check if we need to nominate new idle
-	 * load balancer.
-	 */
-	if (rq->in_nohz_recently && !rq->idle_at_tick) {
-		rq->in_nohz_recently = 0;
-
-		if (atomic_read(&nohz.load_balancer) == cpu) {
-			cpumask_clear_cpu(cpu, nohz.cpu_mask);
-			atomic_set(&nohz.load_balancer, -1);
-		}
-
-		if (atomic_read(&nohz.load_balancer) == -1) {
-			int ilb = find_new_ilb(cpu);
-
-			if (ilb < nr_cpu_ids)
-				resched_cpu(ilb);
-		}
-	}
-
-	/*
-	 * If this cpu is idle and doing idle load balancing for all the
-	 * cpus with ticks stopped, is it time for that to stop?
-	 */
-	if (rq->idle_at_tick && atomic_read(&nohz.load_balancer) == cpu &&
-	    cpumask_weight(nohz.cpu_mask) == num_online_cpus()) {
-		resched_cpu(cpu);
-		return;
-	}
-
-	/*
-	 * If this cpu is idle and the idle load balancing is done by
-	 * someone else, then no need raise the SCHED_SOFTIRQ
-	 */
-	if (rq->idle_at_tick && atomic_read(&nohz.load_balancer) != cpu &&
-	    cpumask_test_cpu(cpu, nohz.cpu_mask))
-		return;
-#endif
 	/* Don't need to rebalance while attached to NULL domain */
 	if (time_after_eq(jiffies, rq->next_balance) &&
 	    likely(!on_null_domain(cpu)))
 		raise_softirq(SCHED_SOFTIRQ);
+#ifdef CONFIG_NO_HZ
+	else if (nohz_kick_needed(rq, cpu) && likely(!on_null_domain(cpu)))
+		nohz_balancer_kick(cpu);
+#endif
 }
 
 static void rq_online_fair(struct rq *rq)
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 1d7b9bc..5f171f0 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -408,13 +408,7 @@ void tick_nohz_stop_sched_tick(int inidle)
 		 * the scheduler tick in nohz_restart_sched_tick.
 		 */
 		if (!ts->tick_stopped) {
-			if (select_nohz_load_balancer(1)) {
-				/*
-				 * sched tick not stopped!
-				 */
-				cpumask_clear_cpu(cpu, nohz_cpu_mask);
-				goto out;
-			}
+			select_nohz_load_balancer(1);
 
 			ts->idle_tick = hrtimer_get_expires(&ts->sched_timer);
 			ts->tick_stopped = 1;
diff --git a/kernel/timer.c b/kernel/timer.c
index ee305c8..48d6aec 100644
--- a/kernel/timer.c
+++ b/kernel/timer.c
@@ -679,12 +679,8 @@ __mod_timer(struct timer_list *timer, unsigned long expires,
 	cpu = smp_processor_id();
 
 #if defined(CONFIG_NO_HZ) && defined(CONFIG_SMP)
-	if (!pinned && get_sysctl_timer_migration() && idle_cpu(cpu)) {
-		int preferred_cpu = get_nohz_load_balancer();
-
-		if (preferred_cpu >= 0)
-			cpu = preferred_cpu;
-	}
+	if (!pinned && get_sysctl_timer_migration() && idle_cpu(cpu))
+		cpu = get_nohz_timer_target();
 #endif
 	new_base = per_cpu(tvec_bases, cpu);
 

^ permalink raw reply related	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2010-06-09 10:14 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-05-17 18:27 [patch 0/7] sched: change nohz idle load balancing logic to push model Suresh Siddha
2010-05-17 18:27 ` [patch 1/7] softirq: Add a no local fallback option to send_remote_softirq Suresh Siddha
2010-05-17 18:27 ` [patch 2/7] softirq: add init_remote_softirq_csd() Suresh Siddha
2010-05-17 18:27 ` [patch 3/7] softirq: avoid softirq_work_list for SCHED_SOFTIRQ when sent remotely Suresh Siddha
2010-05-20  8:12   ` Peter Zijlstra
2010-05-20  8:14     ` David Miller
2010-05-20  8:23       ` Jens Axboe
2010-05-20  8:29         ` Peter Zijlstra
2010-05-20  9:18         ` David Miller
2010-05-17 18:27 ` [patch 4/7] sched: Change nohz ilb logic from pull to push model Suresh Siddha
2010-06-01 23:47   ` Vaidyanathan Srinivasan
2010-06-02 22:27     ` Suresh Siddha
2010-05-17 18:27 ` [patch 5/7] sched: Change select_nohz_load_balancer to return void Suresh Siddha
2010-05-17 18:27 ` [patch 6/7] sched: change nohz.load_balancer to be nr_cpu_ids based Suresh Siddha
2010-05-20  9:49   ` Peter Zijlstra
2010-05-17 18:27 ` [patch 7/7] timers: use nearest busy cpu for migrating timers from an idle cpu Suresh Siddha
2010-06-01 23:37   ` Vaidyanathan Srinivasan
2010-06-02 22:02     ` Suresh Siddha
2010-05-17 22:39 ` [patch 0/7] sched: change nohz idle load balancing logic to push model Nigel Cunningham
2010-05-19  9:19   ` Dominik Brodowski
2010-05-20 10:50 ` Peter Zijlstra
2010-05-22  0:09   ` Suresh Siddha
2010-05-31  9:17     ` Peter Zijlstra
2010-06-09 10:13     ` [tip:sched/core] sched: Change " tip-bot for Venkatesh Pallipadi
2010-05-20 11:07 ` [patch 0/7] sched: change " Nigel Cunningham
2010-05-20 11:17   ` Dominik Brodowski
2010-05-20 11:35     ` Nigel Cunningham
2010-05-20 12:13       ` Dominik Brodowski

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.