[patch 0/2] RFC sched: Change nohz ilb logic from poll to push model

All of lore.kernel.org
 help / color / mirror / Atom feed

* [patch 0/2] RFC sched: Change nohz ilb logic from poll to push model
@ 2009-06-17 18:26 venkatesh.pallipadi
  2009-06-17 18:26 ` [patch 1/2] RFC sched: Change the nohz ilb logic from pull " venkatesh.pallipadi
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: venkatesh.pallipadi @ 2009-06-17 18:26 UTC (permalink / raw)
  To: Peter Zijlstra, Gautham R Shenoy, Vaidyanathan Srinivasan
  Cc: Ingo Molnar, Thomas Gleixner, Arjan van de Ven, linux-kernel,
	Venkatesh Pallipadi, Suresh Siddha

Existing nohz idle load balance (ilb) logic uses the pull model, with one
idle load balancer CPU nominated on any partially idle system and that
balancer CPU not going into nohz mode. With the periodic tick, the
balancer does the idle balancing on behalf of all the CPUs in nohz mode.

This is not very optimal and has few issues:
* the balancer will continue to have periodic ticks and wakeup
  frequently (HZ rate), even though it may not have any rebalancing to do on
  behalf of any of the idle CPUs.
* On x86 and CPUs that have APIC timer stoppage on idle CPUs, this periodic
  wakeup can result in an additional interrupt on a CPU doing the timer
  broadcast.
* The balancer may end up spending a lot of time doing the balancing on
  behalf of nohz CPUs, especially with increasing number of sockets and
  cores in the platform.

The alternative is to have a push model, where all idle CPUs can enter nohz
mode and busy CPU kicks one of the idle CPUs to take care of idle balancing
on behalf of a group of idle CPUs.

Following patches tries that approach. There are still some rough edges
in the patches related to use of #defines around the code. But, wanted
to get opinion on this approach as an RFC (not for inclusion into the
tree yet).

Thanks,
Venki

Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>

-- 
-- 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [patch 1/2] RFC sched: Change the nohz ilb logic from pull to push model
  2009-06-17 18:26 [patch 0/2] RFC sched: Change nohz ilb logic from poll to push model venkatesh.pallipadi
@ 2009-06-17 18:26 ` venkatesh.pallipadi
  2009-06-17 18:26 ` [patch 2/2] RFC sched: Scale the nohz_tracker logic by making it per NUMA node venkatesh.pallipadi
  2009-06-17 19:16 ` [patch 0/2] RFC sched: Change nohz ilb logic from poll to push model Vaidyanathan Srinivasan
  2 siblings, 0 replies; 6+ messages in thread
From: venkatesh.pallipadi @ 2009-06-17 18:26 UTC (permalink / raw)
  To: Peter Zijlstra, Gautham R Shenoy, Vaidyanathan Srinivasan
  Cc: Ingo Molnar, Thomas Gleixner, Arjan van de Ven, linux-kernel,
	Venkatesh Pallipadi, Suresh Siddha

[-- Attachment #1: 0001-sched-Change-the-nohz-ilb-logic-from-pull-to-push-m.patch --]
[-- Type: text/plain, Size: 9818 bytes --]

In the new push model, all idle CPUs indeed go into nohz mode. There is
still the concept of idle load balancer. Busy CPU kicks the nohz
balancer when any of the nohz CPUs need idle load balancing.
The kickee CPU does the idle load balancing on behalf of all idle CPUs
instead of the normal idle balance.

This addresses two problems with the current nohz ilb logic
* the balancer will continue to have periodic ticks and wakeup
  frequently, even though it may not have any rebalancing to do on
  behalf of any of the idle CPUs.
* On x86 and CPUs that have APIC timer stoppage on idle CPUs, this periodic
  wakeup can result in an additional interrupt on a CPU doing the timer
  broadcast.

Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
---
 kernel/sched.c |  168 +++++++++++++++++++++++++++-----------------------------
 1 files changed, 80 insertions(+), 88 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 80636ed..22fe762 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -627,6 +627,10 @@ struct rq {
 	struct list_head migration_queue;
 #endif
 
+#ifdef CONFIG_NO_HZ
+	unsigned char nohz_balance_kick:1;
+#endif
+
 	/* calc_load related fields */
 	unsigned long calc_load_update;
 	long calc_load_active;
@@ -4409,6 +4413,7 @@ static struct {
 	atomic_t load_balancer;
 	cpumask_var_t cpu_mask;
 	cpumask_var_t ilb_grp_nohz_mask;
+	unsigned long next_balance; /* units in jiffies */
 } nohz ____cacheline_aligned = {
 	.load_balancer = ATOMIC_INIT(-1),
 };
@@ -4476,7 +4481,7 @@ static inline int is_semi_idle_group(struct sched_group *ilb_group)
 	return 1;
 }
 /**
- * find_new_ilb - Finds the optimum idle load balancer for nomination.
+ * find_new_power_opt_ilb - Finds the optimum idle load balancer for nomination.
  * @cpu:	The cpu which is nominating a new idle_load_balancer.
  *
  * Returns:	Returns the id of the idle load balancer if it exists,
@@ -4487,7 +4492,7 @@ static inline int is_semi_idle_group(struct sched_group *ilb_group)
  * completely idle packages/cores just for the purpose of idle load balancing
  * when there are other idle cpu's which are better suited for that job.
  */
-static int find_new_ilb(int cpu)
+static int find_new_power_opt_ilb(int cpu)
 {
 	struct sched_domain *sd;
 	struct sched_group *ilb_group;
@@ -4519,12 +4524,12 @@ static int find_new_ilb(int cpu)
 	}
 
 out_done:
-	return cpumask_first(nohz.cpu_mask);
+	return nr_cpu_ids;
 }
 #else /*  (CONFIG_SCHED_MC || CONFIG_SCHED_SMT) */
-static inline int find_new_ilb(int call_cpu)
+static inline int find_new_power_opt_ilb(int call_cpu)
 {
-	return cpumask_first(nohz.cpu_mask);
+	return nr_cpu_ids;
 }
 #endif
 
@@ -4534,24 +4539,42 @@ int get_nohz_load_balancer(void)
 }
 
 /*
+ * Kick a CPU to do the nohz balancing, if it is time for it. We pick the
+ * nohz_load_balancer CPU (if there is one) otherwise fallback to any idle
+ * CPU (if there is one).
+ */
+static void nohz_balancer_kick(int cpu)
+{
+	int ilb_cpu;
+	unsigned long now = jiffies;
+
+	if (time_before(now, nohz.next_balance))
+		return;
+
+	ilb_cpu = get_nohz_load_balancer();
+	if (ilb_cpu < 0) {
+		ilb_cpu = cpumask_first(nohz.cpu_mask);
+		if (ilb_cpu >= nr_cpu_ids)
+			return;
+	}
+
+	cpu_rq(ilb_cpu)->nohz_balance_kick = 1;
+	resched_cpu(ilb_cpu);
+	return;
+}
+
+/*
  * This routine will try to nominate the ilb (idle load balancing)
  * owner among the cpus whose ticks are stopped. ilb owner will do the idle
- * load balancing on behalf of all those cpus. If all the cpus in the system
- * go into this tickless mode, then there will be no ilb owner (as there is
- * no need for one) and all the cpus will sleep till the next wakeup event
- * arrives...
+ * load balancing on behalf of all those cpus.
  *
- * For the ilb owner, tick is not stopped. And this tick will be used
- * for idle load balancing. ilb owner will still be part of
- * nohz.cpu_mask..
+ * When the ilb owner becomes busy, we will not have new ilb owner until some
+ * idle CPU wakes up and goes back to idle or some busy CPU tries to kick
+ * idle load balancing by kicking one of the idle CPUs.
  *
- * While stopping the tick, this cpu will become the ilb owner if there
- * is no other owner. And will be the owner till that cpu becomes busy
- * or if all cpus in the system stop their ticks at which point
- * there is no need for ilb owner.
- *
- * When the ilb owner becomes busy, it nominates another owner, during the
- * next busy scheduler_tick()
+ * Ticks are stopped for the ilb owner as well, with busy CPU kicking this
+ * ilb CPU in future, when there is a need for idle load balancing on
+ * behalf of all idle CPUs.
  */
 int select_nohz_load_balancer(int stop_tick)
 {
@@ -4576,34 +4599,26 @@ int select_nohz_load_balancer(int stop_tick)
 
 		cpumask_set_cpu(cpu, nohz.cpu_mask);
 
-		/* time for ilb owner also to sleep */
-		if (cpumask_weight(nohz.cpu_mask) == num_online_cpus()) {
-			if (atomic_read(&nohz.load_balancer) == cpu)
-				atomic_set(&nohz.load_balancer, -1);
-			return 0;
-		}
-
 		if (atomic_read(&nohz.load_balancer) == -1) {
-			/* make me the ilb owner */
-			if (atomic_cmpxchg(&nohz.load_balancer, -1, cpu) == -1)
-				return 1;
-		} else if (atomic_read(&nohz.load_balancer) == cpu) {
 			int new_ilb;
 
-			if (!(sched_smt_power_savings ||
-						sched_mc_power_savings))
-				return 1;
+			/* make me the ilb owner */
+			if (atomic_cmpxchg(&nohz.load_balancer, -1, cpu) != -1) 
+				return 0;
+
 			/*
-			 * Check to see if there is a more power-efficient
-			 * ilb.
+			 * Now that, I am the load_balancer, check whether we
+			 * need to find a power optimal load balancer instead
 			 */
-			new_ilb = find_new_ilb(cpu);
+			if (!(sched_smt_power_savings ||
+			      sched_mc_power_savings))
+				return 0;
+
+			new_ilb = find_new_power_opt_ilb(cpu);
 			if (new_ilb < nr_cpu_ids && new_ilb != cpu) {
 				atomic_set(&nohz.load_balancer, -1);
 				resched_cpu(new_ilb);
-				return 0;
 			}
-			return 1;
 		}
 	} else {
 		if (!cpumask_test_cpu(cpu, nohz.cpu_mask))
@@ -4699,8 +4714,6 @@ out:
 
 /*
  * run_rebalance_domains is triggered when needed from the scheduler tick.
- * In CONFIG_NO_HZ case, the idle load balance owner will do the
- * rebalancing for all the cpus for whom scheduler ticks are stopped.
  */
 static void run_rebalance_domains(struct softirq_action *h)
 {
@@ -4710,15 +4723,23 @@ static void run_rebalance_domains(struct softirq_action *h)
 						CPU_IDLE : CPU_NOT_IDLE;
 
 	rebalance_domains(this_cpu, idle);
+}
 
 #ifdef CONFIG_NO_HZ
+/*
+ * In CONFIG_NO_HZ case, the idle balance kickee will do the
+ * rebalancing for all the cpus for whom scheduler ticks are stopped.
+ */
+static void nohz_idle_balance(int this_cpu, struct rq *this_rq)
+{
+	rebalance_domains(this_cpu, CPU_IDLE);
+
 	/*
 	 * If this cpu is the owner for idle load balancing, then do the
 	 * balancing on behalf of the other idle cpus whose ticks are
 	 * stopped.
 	 */
-	if (this_rq->idle_at_tick &&
-	    atomic_read(&nohz.load_balancer) == this_cpu) {
+	if (this_rq->nohz_balance_kick) {
 		struct rq *rq;
 		int balance_cpu;
 
@@ -4740,9 +4761,11 @@ static void run_rebalance_domains(struct softirq_action *h)
 			if (time_after(this_rq->next_balance, rq->next_balance))
 				this_rq->next_balance = rq->next_balance;
 		}
+		nohz.next_balance = this_rq->next_balance;
+		this_rq->nohz_balance_kick = 0;
 	}
-#endif
 }
+#endif
 
 static inline int on_null_domain(int cpu)
 {
@@ -4751,57 +4774,17 @@ static inline int on_null_domain(int cpu)
 
 /*
  * Trigger the SCHED_SOFTIRQ if it is time to do periodic load balancing.
- *
- * In case of CONFIG_NO_HZ, this is the place where we nominate a new
- * idle load balancing owner or decide to stop the periodic load balancing,
- * if the whole system is idle.
  */
 static inline void trigger_load_balance(struct rq *rq, int cpu)
 {
-#ifdef CONFIG_NO_HZ
-	/*
-	 * If we were in the nohz mode recently and busy at the current
-	 * scheduler tick, then check if we need to nominate new idle
-	 * load balancer.
-	 */
-	if (rq->in_nohz_recently && !rq->idle_at_tick) {
-		rq->in_nohz_recently = 0;
-
-		if (atomic_read(&nohz.load_balancer) == cpu) {
-			cpumask_clear_cpu(cpu, nohz.cpu_mask);
-			atomic_set(&nohz.load_balancer, -1);
-		}
-
-		if (atomic_read(&nohz.load_balancer) == -1) {
-			int ilb = find_new_ilb(cpu);
-
-			if (ilb < nr_cpu_ids)
-				resched_cpu(ilb);
-		}
-	}
-
-	/*
-	 * If this cpu is idle and doing idle load balancing for all the
-	 * cpus with ticks stopped, is it time for that to stop?
-	 */
-	if (rq->idle_at_tick && atomic_read(&nohz.load_balancer) == cpu &&
-	    cpumask_weight(nohz.cpu_mask) == num_online_cpus()) {
-		resched_cpu(cpu);
-		return;
-	}
-
-	/*
-	 * If this cpu is idle and the idle load balancing is done by
-	 * someone else, then no need raise the SCHED_SOFTIRQ
-	 */
-	if (rq->idle_at_tick && atomic_read(&nohz.load_balancer) != cpu &&
-	    cpumask_test_cpu(cpu, nohz.cpu_mask))
-		return;
-#endif
 	/* Don't need to rebalance while attached to NULL domain */
 	if (time_after_eq(jiffies, rq->next_balance) &&
 	    likely(!on_null_domain(cpu)))
 		raise_softirq(SCHED_SOFTIRQ);
+#ifdef CONFIG_NO_HZ
+	else if (rq->nr_running && likely(!on_null_domain(cpu)))
+		nohz_balancer_kick(cpu);
+#endif
 }
 
 #else	/* CONFIG_SMP */
@@ -5346,8 +5329,17 @@ need_resched_nonpreemptible:
 		prev->sched_class->pre_schedule(rq, prev);
 #endif
 
-	if (unlikely(!rq->nr_running))
+	if (unlikely(!rq->nr_running)) {
+#ifdef CONFIG_NO_HZ
+		if (rq->nohz_balance_kick) {
+			spin_unlock_irq(&rq->lock);
+			nohz_idle_balance(cpu, rq);
+			rq->nohz_balance_kick = 0;
+			spin_lock_irq(&rq->lock);
+		} else
+#endif
 		idle_balance(cpu, rq);
+	}
 
 	put_prev_task(rq, prev);
 	next = pick_next_task(rq);
-- 
1.6.0.6

-- 


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [patch 2/2] RFC sched: Scale the nohz_tracker logic by making it per NUMA node
  2009-06-17 18:26 [patch 0/2] RFC sched: Change nohz ilb logic from poll to push model venkatesh.pallipadi
  2009-06-17 18:26 ` [patch 1/2] RFC sched: Change the nohz ilb logic from pull " venkatesh.pallipadi
@ 2009-06-17 18:26 ` venkatesh.pallipadi
  2009-06-17 19:21   ` Vaidyanathan Srinivasan
  2009-06-17 19:16 ` [patch 0/2] RFC sched: Change nohz ilb logic from poll to push model Vaidyanathan Srinivasan
  2 siblings, 1 reply; 6+ messages in thread
From: venkatesh.pallipadi @ 2009-06-17 18:26 UTC (permalink / raw)
  To: Peter Zijlstra, Gautham R Shenoy, Vaidyanathan Srinivasan
  Cc: Ingo Molnar, Thomas Gleixner, Arjan van de Ven, linux-kernel,
	Venkatesh Pallipadi, Suresh Siddha

[-- Attachment #1: 0002-sched-Scale-the-nohz_tracker-logic-by-making-it-per.patch --]
[-- Type: text/plain, Size: 11203 bytes --]

Having one idle CPU doing the rebalancing for all the idle CPUs in
nohz mode does not scale well with increasing number of cores and
sockets. Make the nohz_tracker per NUMA node. This results in multiple
idle load balancing happening at NUMA node level and idle load balancer
only does the rebalance domain among all the other nohz CPUs in that
NUMA node.

This addresses the below problem with the current nohz ilb logic
* The lone balancer may end up spending a lot of time doing the balancing on
  behalf of nohz CPUs, especially with increasing number of sockets and
  cores in the platform.

Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
---
 kernel/sched.c |  162 +++++++++++++++++++++++++++++++++++++++++++-------------
 1 files changed, 124 insertions(+), 38 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 22fe762..49d3bb7 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -4408,16 +4408,74 @@ static void active_load_balance(struct rq *busiest_rq, int busiest_cpu)
 	double_unlock_balance(busiest_rq, target_rq);
 }
 
-#ifdef CONFIG_NO_HZ
-static struct {
+struct nohz_tracker {
 	atomic_t load_balancer;
 	cpumask_var_t cpu_mask;
-	cpumask_var_t ilb_grp_nohz_mask;
+	cpumask_var_t tmp_nohz_mask;
 	unsigned long next_balance; /* units in jiffies */
-} nohz ____cacheline_aligned = {
-	.load_balancer = ATOMIC_INIT(-1),
 };
 
+#ifdef CONFIG_NO_HZ
+static DEFINE_PER_CPU(struct nohz_tracker *, cpu_node_nohz_ptr);
+static struct nohz_tracker **nohz_tracker_ptrs;
+
+int alloc_node_nohz_tracker(void)
+{
+	int i, j;
+
+	/* Do all the allocations only once per boot */
+	if (nohz_tracker_ptrs)
+		return 0;
+
+	nohz_tracker_ptrs = kcalloc(nr_node_ids, sizeof(struct nohz_tracker *),
+				    GFP_KERNEL);
+	if (!nohz_tracker_ptrs) {
+		printk(KERN_WARNING "Can not alloc nohz trackers\n");
+		return -ENOMEM;
+	}
+
+	for (i = 0; i < nr_node_ids; i++) {
+		nohz_tracker_ptrs[i] = kmalloc_node(sizeof(struct nohz_tracker),
+						    GFP_KERNEL, i);
+		if (!nohz_tracker_ptrs[i]) {
+			printk(KERN_WARNING "Can not alloc domain group for "
+				"node %d\n", i);
+			goto free_ret;
+		}
+
+		if (!alloc_cpumask_var_node(&nohz_tracker_ptrs[i]->cpu_mask,
+					    GFP_KERNEL, i)) {
+			kfree(nohz_tracker_ptrs[i]);
+			goto free_ret;
+		}
+
+		if (!alloc_cpumask_var_node(&nohz_tracker_ptrs[i]->tmp_nohz_mask,
+					    GFP_KERNEL, i)) {
+			free_cpumask_var(nohz_tracker_ptrs[i]->cpu_mask);
+			kfree(nohz_tracker_ptrs[i]);
+			goto free_ret;
+		}
+		atomic_set(&nohz_tracker_ptrs[i]->load_balancer, -1);
+	}
+
+	return 0;
+
+free_ret:
+	for (j = 0; j < i; j++) {
+		free_cpumask_var(nohz_tracker_ptrs[j]->tmp_nohz_mask);
+		free_cpumask_var(nohz_tracker_ptrs[j]->cpu_mask);
+		kfree(nohz_tracker_ptrs[j]);
+	}
+
+	kfree(nohz_tracker_ptrs);
+
+	for_each_online_cpu(i)
+		per_cpu(cpu_node_nohz_ptr, i) = NULL;
+
+	nohz_tracker_ptrs = NULL;
+	return -ENOMEM;
+}
+
 #if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)
 /**
  * lowest_flag_domain - Return lowest sched_domain containing flag.
@@ -4456,6 +4514,7 @@ static inline struct sched_domain *lowest_flag_domain(int cpu, int flag)
 /**
  * is_semi_idle_group - Checks if the given sched_group is semi-idle.
  * @ilb_group:	group to be checked for semi-idleness
+ * @node_nohz: nohz_tracker for the node
  *
  * Returns:	1 if the group is semi-idle. 0 otherwise.
  *
@@ -4463,19 +4522,20 @@ static inline struct sched_domain *lowest_flag_domain(int cpu, int flag)
  * and atleast one non-idle CPU. This helper function checks if the given
  * sched_group is semi-idle or not.
  */
-static inline int is_semi_idle_group(struct sched_group *ilb_group)
+static inline int is_semi_idle_group(struct sched_group *ilb_group,
+				struct nohz_tracker *node_nohz)
 {
-	cpumask_and(nohz.ilb_grp_nohz_mask, nohz.cpu_mask,
+	cpumask_and(node_nohz->tmp_nohz_mask, node_nohz->cpu_mask,
 					sched_group_cpus(ilb_group));
 
 	/*
 	 * A sched_group is semi-idle when it has atleast one busy cpu
 	 * and atleast one idle cpu.
 	 */
-	if (cpumask_empty(nohz.ilb_grp_nohz_mask))
+	if (cpumask_empty(node_nohz->tmp_nohz_mask))
 		return 0;
 
-	if (cpumask_equal(nohz.ilb_grp_nohz_mask, sched_group_cpus(ilb_group)))
+	if (cpumask_equal(node_nohz->tmp_nohz_mask, sched_group_cpus(ilb_group)))
 		return 0;
 
 	return 1;
@@ -4483,6 +4543,7 @@ static inline int is_semi_idle_group(struct sched_group *ilb_group)
 /**
  * find_new_power_opt_ilb - Finds the optimum idle load balancer for nomination.
  * @cpu:	The cpu which is nominating a new idle_load_balancer.
+ * @node_nohz:	nohz_tracker for the node
  *
  * Returns:	Returns the id of the idle load balancer if it exists,
  *		Else, returns >= nr_cpu_ids.
@@ -4492,7 +4553,7 @@ static inline int is_semi_idle_group(struct sched_group *ilb_group)
  * completely idle packages/cores just for the purpose of idle load balancing
  * when there are other idle cpu's which are better suited for that job.
  */
-static int find_new_power_opt_ilb(int cpu)
+static int find_new_power_opt_ilb(int cpu, struct nohz_tracker *node_nohz)
 {
 	struct sched_domain *sd;
 	struct sched_group *ilb_group;
@@ -4508,15 +4569,15 @@ static int find_new_power_opt_ilb(int cpu)
 	 * Optimize for the case when we have no idle CPUs or only one
 	 * idle CPU. Don't walk the sched_domain hierarchy in such cases
 	 */
-	if (cpumask_weight(nohz.cpu_mask) < 2)
+	if (cpumask_weight(node_nohz->cpu_mask) < 2)
 		goto out_done;
 
 	for_each_flag_domain(cpu, sd, SD_POWERSAVINGS_BALANCE) {
 		ilb_group = sd->groups;
 
 		do {
-			if (is_semi_idle_group(ilb_group))
-				return cpumask_first(nohz.ilb_grp_nohz_mask);
+			if (is_semi_idle_group(ilb_group, node_nohz))
+				return cpumask_first(node_nohz->tmp_nohz_mask);
 
 			ilb_group = ilb_group->next;
 
@@ -4527,15 +4588,26 @@ out_done:
 	return nr_cpu_ids;
 }
 #else /*  (CONFIG_SCHED_MC || CONFIG_SCHED_SMT) */
-static inline int find_new_power_opt_ilb(int call_cpu)
+static inline int find_new_power_opt_ilb(int call_cpu,
+						struct nohz_tracker *node_nohz)
 {
 	return nr_cpu_ids;
 }
 #endif
 
+static int get_nohz_load_balancer_node(struct nohz_tracker *node_nohz)
+{
+	if (!node_nohz)
+		return -1;
+
+	return atomic_read(&node_nohz->load_balancer);
+}
+
 int get_nohz_load_balancer(void)
 {
-	return atomic_read(&nohz.load_balancer);
+	int cpu = smp_processor_id();
+	struct nohz_tracker *node_nohz = per_cpu(cpu_node_nohz_ptr, cpu);
+	return get_nohz_load_balancer_node(node_nohz);
 }
 
 /*
@@ -4547,13 +4619,17 @@ static void nohz_balancer_kick(int cpu)
 {
 	int ilb_cpu;
 	unsigned long now = jiffies;
+	struct nohz_tracker *node_nohz = per_cpu(cpu_node_nohz_ptr, cpu);
+
+	if (!node_nohz)
+		return;
 
-	if (time_before(now, nohz.next_balance))
+	if (time_before(now, node_nohz->next_balance))
 		return;
 
-	ilb_cpu = get_nohz_load_balancer();
+	ilb_cpu = get_nohz_load_balancer_node(node_nohz);
 	if (ilb_cpu < 0) {
-		ilb_cpu = cpumask_first(nohz.cpu_mask);
+		ilb_cpu = cpumask_first(node_nohz->cpu_mask);
 		if (ilb_cpu >= nr_cpu_ids)
 			return;
 	}
@@ -4579,31 +4655,35 @@ static void nohz_balancer_kick(int cpu)
 int select_nohz_load_balancer(int stop_tick)
 {
 	int cpu = smp_processor_id();
+	struct nohz_tracker *node_nohz = per_cpu(cpu_node_nohz_ptr, cpu);
+
+	if (!node_nohz)
+		return 0;
 
 	if (stop_tick) {
 		cpu_rq(cpu)->in_nohz_recently = 1;
 
 		if (!cpu_active(cpu)) {
-			if (atomic_read(&nohz.load_balancer) != cpu)
+			if (atomic_read(&node_nohz->load_balancer) != cpu)
 				return 0;
 
 			/*
 			 * If we are going offline and still the leader,
 			 * give up!
 			 */
-			if (atomic_cmpxchg(&nohz.load_balancer, cpu, -1) != cpu)
+			if (atomic_cmpxchg(&node_nohz->load_balancer, cpu, -1) != cpu)
 				BUG();
 
 			return 0;
 		}
 
-		cpumask_set_cpu(cpu, nohz.cpu_mask);
+		cpumask_set_cpu(cpu, node_nohz->cpu_mask);
 
-		if (atomic_read(&nohz.load_balancer) == -1) {
+		if (atomic_read(&node_nohz->load_balancer) == -1) {
 			int new_ilb;
 
 			/* make me the ilb owner */
-			if (atomic_cmpxchg(&nohz.load_balancer, -1, cpu) != -1) 
+			if (atomic_cmpxchg(&node_nohz->load_balancer, -1, cpu) != -1) 
 				return 0;
 
 			/*
@@ -4614,20 +4694,20 @@ int select_nohz_load_balancer(int stop_tick)
 			      sched_mc_power_savings))
 				return 0;
 
-			new_ilb = find_new_power_opt_ilb(cpu);
+			new_ilb = find_new_power_opt_ilb(cpu, node_nohz);
 			if (new_ilb < nr_cpu_ids && new_ilb != cpu) {
-				atomic_set(&nohz.load_balancer, -1);
+				atomic_set(&node_nohz->load_balancer, -1);
 				resched_cpu(new_ilb);
 			}
 		}
 	} else {
-		if (!cpumask_test_cpu(cpu, nohz.cpu_mask))
+		if (!cpumask_test_cpu(cpu, node_nohz->cpu_mask))
 			return 0;
 
-		cpumask_clear_cpu(cpu, nohz.cpu_mask);
+		cpumask_clear_cpu(cpu, node_nohz->cpu_mask);
 
-		if (atomic_read(&nohz.load_balancer) == cpu)
-			if (atomic_cmpxchg(&nohz.load_balancer, cpu, -1) != cpu)
+		if (atomic_read(&node_nohz->load_balancer) == cpu)
+			if (atomic_cmpxchg(&node_nohz->load_balancer, cpu, -1) != cpu)
 				BUG();
 	}
 	return 0;
@@ -4732,6 +4812,8 @@ static void run_rebalance_domains(struct softirq_action *h)
  */
 static void nohz_idle_balance(int this_cpu, struct rq *this_rq)
 {
+	struct nohz_tracker *node_nohz = per_cpu(cpu_node_nohz_ptr, this_cpu);
+
 	rebalance_domains(this_cpu, CPU_IDLE);
 
 	/*
@@ -4739,11 +4821,11 @@ static void nohz_idle_balance(int this_cpu, struct rq *this_rq)
 	 * balancing on behalf of the other idle cpus whose ticks are
 	 * stopped.
 	 */
-	if (this_rq->nohz_balance_kick) {
+	if (this_rq->nohz_balance_kick && node_nohz) {
 		struct rq *rq;
 		int balance_cpu;
 
-		for_each_cpu(balance_cpu, nohz.cpu_mask) {
+		for_each_cpu(balance_cpu, node_nohz->cpu_mask) {
 			if (balance_cpu == this_cpu)
 				continue;
 
@@ -4761,7 +4843,7 @@ static void nohz_idle_balance(int this_cpu, struct rq *this_rq)
 			if (time_after(this_rq->next_balance, rq->next_balance))
 				this_rq->next_balance = rq->next_balance;
 		}
-		nohz.next_balance = this_rq->next_balance;
+		node_nohz->next_balance = this_rq->next_balance;
 		this_rq->nohz_balance_kick = 0;
 	}
 }
@@ -8615,6 +8697,14 @@ static int __build_sched_domains(const struct cpumask *cpu_map,
 	}
 #endif
 
+	if (alloc_node_nohz_tracker())
+		goto error;
+
+	for_each_cpu(i, cpu_map) {
+		per_cpu(cpu_node_nohz_ptr, i) =
+					nohz_tracker_ptrs[cpu_to_node(i)];
+	}
+
 	/* Calculate CPU power for physical packages and nodes */
 #ifdef CONFIG_SCHED_SMT
 	for_each_cpu(i, cpu_map) {
@@ -8692,12 +8782,12 @@ free_sched_groups:
 #endif
 	goto free_tmpmask;
 
-#ifdef CONFIG_NUMA
 error:
+#ifdef CONFIG_NUMA
 	free_sched_groups(cpu_map, tmpmask);
 	free_rootdomain(rd);
-	goto free_tmpmask;
 #endif
+	goto free_tmpmask;
 }
 
 static int build_sched_domains(const struct cpumask *cpu_map)
@@ -9386,10 +9476,6 @@ void __init sched_init(void)
 	/* Allocate the nohz_cpu_mask if CONFIG_CPUMASK_OFFSTACK */
 	alloc_cpumask_var(&nohz_cpu_mask, GFP_NOWAIT);
 #ifdef CONFIG_SMP
-#ifdef CONFIG_NO_HZ
-	alloc_cpumask_var(&nohz.cpu_mask, GFP_NOWAIT);
-	alloc_cpumask_var(&nohz.ilb_grp_nohz_mask, GFP_NOWAIT);
-#endif
 	alloc_cpumask_var(&cpu_isolated_map, GFP_NOWAIT);
 #endif /* SMP */
 
-- 
1.6.0.6

-- 


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [patch 0/2] RFC sched: Change nohz ilb logic from poll to push model
  2009-06-17 18:26 [patch 0/2] RFC sched: Change nohz ilb logic from poll to push model venkatesh.pallipadi
  2009-06-17 18:26 ` [patch 1/2] RFC sched: Change the nohz ilb logic from pull " venkatesh.pallipadi
  2009-06-17 18:26 ` [patch 2/2] RFC sched: Scale the nohz_tracker logic by making it per NUMA node venkatesh.pallipadi
@ 2009-06-17 19:16 ` Vaidyanathan Srinivasan
  2009-06-18 23:41   ` Pallipadi, Venkatesh
  2 siblings, 1 reply; 6+ messages in thread
From: Vaidyanathan Srinivasan @ 2009-06-17 19:16 UTC (permalink / raw)
  To: venkatesh.pallipadi
  Cc: Peter Zijlstra, Gautham R Shenoy, Ingo Molnar, Thomas Gleixner,
	Arjan van de Ven, linux-kernel, Suresh Siddha

* venkatesh.pallipadi@intel.com <venkatesh.pallipadi@intel.com> [2009-06-17 11:26:49]:

> Existing nohz idle load balance (ilb) logic uses the pull model, with one
> idle load balancer CPU nominated on any partially idle system and that
> balancer CPU not going into nohz mode. With the periodic tick, the
> balancer does the idle balancing on behalf of all the CPUs in nohz mode.
> 
> This is not very optimal and has few issues:
> * the balancer will continue to have periodic ticks and wakeup
>   frequently (HZ rate), even though it may not have any rebalancing to do on
>   behalf of any of the idle CPUs.
> * On x86 and CPUs that have APIC timer stoppage on idle CPUs, this periodic
>   wakeup can result in an additional interrupt on a CPU doing the timer
>   broadcast.
> * The balancer may end up spending a lot of time doing the balancing on
>   behalf of nohz CPUs, especially with increasing number of sockets and
>   cores in the platform.
> 
> The alternative is to have a push model, where all idle CPUs can enter nohz
> mode and busy CPU kicks one of the idle CPUs to take care of idle balancing
> on behalf of a group of idle CPUs.

Hi Venki,

The idea is very useful and further extends the power savings in idle
system.  However the kick method from busy CPU should not add to
scheduling latency during a sudden burst of work.

Does adding nohz_balancer_kick() in trigger_load_balance() path in
a busy CPU add to its overhead?

 
> Following patches tries that approach. There are still some rough edges
> in the patches related to use of #defines around the code. But, wanted
> to get opinion on this approach as an RFC (not for inclusion into the
> tree yet).

I like the idea but my only concern is the performance impact on busy
cpus with this push model. 

--Vaidy

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [patch 2/2] RFC sched: Scale the nohz_tracker logic by making it per NUMA node
  2009-06-17 18:26 ` [patch 2/2] RFC sched: Scale the nohz_tracker logic by making it per NUMA node venkatesh.pallipadi
@ 2009-06-17 19:21   ` Vaidyanathan Srinivasan
  0 siblings, 0 replies; 6+ messages in thread
From: Vaidyanathan Srinivasan @ 2009-06-17 19:21 UTC (permalink / raw)
  To: venkatesh.pallipadi
  Cc: Peter Zijlstra, Gautham R Shenoy, Ingo Molnar, Thomas Gleixner,
	Arjan van de Ven, linux-kernel, Suresh Siddha

* venkatesh.pallipadi@intel.com <venkatesh.pallipadi@intel.com> [2009-06-17 11:26:51]:

> Having one idle CPU doing the rebalancing for all the idle CPUs in
> nohz mode does not scale well with increasing number of cores and
> sockets. Make the nohz_tracker per NUMA node. This results in multiple
> idle load balancing happening at NUMA node level and idle load balancer
> only does the rebalance domain among all the other nohz CPUs in that
> NUMA node.

This is a good optimisations but maybe an overkill for single chip
NUMA node machines like a two socket Nehalems or Optrons.

Some method to create load balancer group based on the scale of the
machine will be good.

--Vaidy

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [patch 0/2] RFC sched: Change nohz ilb logic from poll to push model
  2009-06-17 19:16 ` [patch 0/2] RFC sched: Change nohz ilb logic from poll to push model Vaidyanathan Srinivasan
@ 2009-06-18 23:41   ` Pallipadi, Venkatesh
  0 siblings, 0 replies; 6+ messages in thread
From: Pallipadi, Venkatesh @ 2009-06-18 23:41 UTC (permalink / raw)
  To: svaidy
  Cc: Peter Zijlstra, Gautham R Shenoy, Ingo Molnar, Thomas Gleixner,
	Arjan van de Ven, linux-kernel, Siddha, Suresh B

On Wed, 2009-06-17 at 12:16 -0700, Vaidyanathan Srinivasan wrote:
> * venkatesh.pallipadi@intel.com <venkatesh.pallipadi@intel.com> [2009-06-17 11:26:49]:
> 
> > Existing nohz idle load balance (ilb) logic uses the pull model, with one
> > idle load balancer CPU nominated on any partially idle system and that
> > balancer CPU not going into nohz mode. With the periodic tick, the
> > balancer does the idle balancing on behalf of all the CPUs in nohz mode.
> > 
> > This is not very optimal and has few issues:
> > * the balancer will continue to have periodic ticks and wakeup
> >   frequently (HZ rate), even though it may not have any rebalancing to do on
> >   behalf of any of the idle CPUs.
> > * On x86 and CPUs that have APIC timer stoppage on idle CPUs, this periodic
> >   wakeup can result in an additional interrupt on a CPU doing the timer
> >   broadcast.
> > * The balancer may end up spending a lot of time doing the balancing on
> >   behalf of nohz CPUs, especially with increasing number of sockets and
> >   cores in the platform.
> > 
> > The alternative is to have a push model, where all idle CPUs can enter nohz
> > mode and busy CPU kicks one of the idle CPUs to take care of idle balancing
> > on behalf of a group of idle CPUs.
> 
> Hi Venki,
> 
> The idea is very useful and further extends the power savings in idle
> system.  However the kick method from busy CPU should not add to
> scheduling latency during a sudden burst of work.
> 
> Does adding nohz_balancer_kick() in trigger_load_balance() path in
> a busy CPU add to its overhead?
> 
>  
> > Following patches tries that approach. There are still some rough edges
> > in the patches related to use of #defines around the code. But, wanted
> > to get opinion on this approach as an RFC (not for inclusion into the
> > tree yet).
> 
> I like the idea but my only concern is the performance impact on busy
> cpus with this push model. 

Vaidy,

I tried to keep the overhead on the busy CPU low in this RFC. There is a
check the for next_balance time and if there is a load balance CPU
nominated we just send a resched to the load balance CPU. We do look at
cpu_mask to find the first bit set, when there is no assigned
load_balance CPU (that is when say load balance CPU started running and
no other CPU has nominated himself yet). But, that's the only overhead
there. All the other complexities are handled on the idle CPU side.

Thanks,
Venki


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2009-06-18 23:42 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-06-17 18:26 [patch 0/2] RFC sched: Change nohz ilb logic from poll to push model venkatesh.pallipadi
2009-06-17 18:26 ` [patch 1/2] RFC sched: Change the nohz ilb logic from pull " venkatesh.pallipadi
2009-06-17 18:26 ` [patch 2/2] RFC sched: Scale the nohz_tracker logic by making it per NUMA node venkatesh.pallipadi
2009-06-17 19:21   ` Vaidyanathan Srinivasan
2009-06-17 19:16 ` [patch 0/2] RFC sched: Change nohz ilb logic from poll to push model Vaidyanathan Srinivasan
2009-06-18 23:41   ` Pallipadi, Venkatesh

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.