linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC][PATCH 0/2] reworking load_balance_monitor
@ 2008-02-14 15:57 Peter Zijlstra
  2008-02-14 15:57 ` [RFC][PATCH 1/2] sched: fair-group: rework load_balance_monitor Peter Zijlstra
                   ` (4 more replies)
  0 siblings, 5 replies; 10+ messages in thread
From: Peter Zijlstra @ 2008-02-14 15:57 UTC (permalink / raw)
  To: Srivatsa Vaddagiri, Dhaval Giani, Paul Jackson, Arjan van de Ven,
	Ingo Molnar, Thomas Gleixner, Gregory Haskins
  Cc: LKML, Peter Zijlstra

Hi,

Here the current patches that rework load_balance_monitor.

The main reason for doing this is to eliminate the wakeups the thing generates,
esp. on an idle system. The bonus is that it removes a kernel thread.

Paul, Gregory - the thing that bothers me most atm is the lack of
rd->load_balance. Should I introduce that (-rt ought to make use of that as
well) by way of copying from the top sched_domain when it gets created?

 - peterz


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC][PATCH 1/2] sched: fair-group: rework load_balance_monitor
  2008-02-14 15:57 [RFC][PATCH 0/2] reworking load_balance_monitor Peter Zijlstra
@ 2008-02-14 15:57 ` Peter Zijlstra
  2008-02-14 15:57 ` [RFC][PATCH 2/2] sched: fair-group: per root-domain load balancing Peter Zijlstra
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 10+ messages in thread
From: Peter Zijlstra @ 2008-02-14 15:57 UTC (permalink / raw)
  To: Srivatsa Vaddagiri, Dhaval Giani, Paul Jackson, Arjan van de Ven,
	Ingo Molnar, Thomas Gleixner, Gregory Haskins
  Cc: LKML, Peter Zijlstra

[-- Attachment #1: sched-fair-dyn-balance.patch --]
[-- Type: text/plain, Size: 11042 bytes --]

The biggest issue with load_balance_monitor atm is that it never goes
to sleep, and generates a huge amount of wakeups, even on idle systems.

I tried doing a simple interruptible sleep on idle, but wake_up_process
can't be used while holding the rq lock, so that didn't quite work out.

The next option was turning it into a timer, this does work however it
now runs from hardirq context and I worry that it might be too heavy,
esp on larger boxen.

The next patch will split this single instance into per root-domain balancers.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched.c      |  266 ++++++++++++++++++++++++++++++----------------------
 kernel/sched_fair.c |    3 
 2 files changed, 161 insertions(+), 108 deletions(-)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -338,11 +338,112 @@ static DEFINE_SPINLOCK(task_group_lock);
 /* doms_cur_mutex serializes access to doms_cur[] array */
 static DEFINE_MUTEX(doms_cur_mutex);
 
+/*
+ * Tunables that become constants when CONFIG_SCHED_DEBUG is off:
+ */
+#ifdef CONFIG_SCHED_DEBUG
+# define const_debug __read_mostly
+#else
+# define const_debug static const
+#endif
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 #ifdef CONFIG_SMP
 /* kernel thread that runs rebalance_shares() periodically */
-static struct task_struct *lb_monitor_task;
-static int load_balance_monitor(void *unused);
+
+struct lb_monitor {
+	struct hrtimer timer;
+	ktime_t timeout;
+	spinlock_t lock;
+};
+
+static struct lb_monitor lb_monitor;
+
+/*
+ * How frequently should we rebalance_shares() across cpus?
+ *
+ * The more frequently we rebalance shares, the more accurate is the fairness
+ * of cpu bandwidth distribution between task groups. However higher frequency
+ * also implies increased scheduling overhead.
+ *
+ * sysctl_sched_min_bal_int_shares represents the minimum interval between
+ * consecutive calls to rebalance_shares() in the same sched domain.
+ *
+ * sysctl_sched_max_bal_int_shares represents the maximum interval between
+ * consecutive calls to rebalance_shares() in the same sched domain.
+ *
+ * These settings allows for the appropriate trade-off between accuracy of
+ * fairness and the associated overhead.
+ *
+ */
+
+/* default: 8ms, units: milliseconds */
+const_debug unsigned int sysctl_sched_min_bal_int_shares = 8;
+
+/* default: 128ms, units: milliseconds */
+const_debug unsigned int sysctl_sched_max_bal_int_shares = 128;
+
+enum {
+	shares_idle,		/* nothing to balance */
+	shares_balanced,	/* it was in balance */
+	shares_unbalanced,	/* it needed balancing */
+};
+
+static int load_balance_shares(struct lb_monitor *lb_monitor);
+
+static enum hrtimer_restart lb_monitor_timer(struct hrtimer *timer)
+{
+	int state = shares_idle;
+	struct lb_monitor *lb_monitor =
+		container_of(timer, struct lb_monitor, timer);
+
+	for (;;) {
+		ktime_t now = hrtimer_cb_get_time(timer);
+		int overrun = hrtimer_forward(timer, now, lb_monitor->timeout);
+
+		if (!overrun)
+			break;
+
+		state = load_balance_shares(lb_monitor);
+	}
+
+	return (state == shares_idle) ? HRTIMER_NORESTART : HRTIMER_RESTART;
+}
+
+static void lb_monitor_wake(struct lb_monitor *lb_monitor)
+{
+	ktime_t now;
+
+	if (hrtimer_active(&lb_monitor->timer))
+		return;
+
+	if (nr_cpu_ids == 1)
+		return;
+
+	spin_lock(&lb_monitor->lock);
+	for (;;) {
+		if (hrtimer_active(&lb_monitor->timer))
+			break;
+
+		now = hrtimer_cb_get_time(&lb_monitor->timer);
+		hrtimer_forward(&lb_monitor->timer, now, lb_monitor->timeout);
+		hrtimer_start(&lb_monitor->timer, lb_monitor->timer.expires,
+				HRTIMER_MODE_ABS);
+	}
+	spin_unlock(&lb_monitor->lock);
+}
+
+static void lb_monitor_init(struct lb_monitor *lb_monitor)
+{
+	hrtimer_init(&lb_monitor->timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	lb_monitor->timer.function = lb_monitor_timer;
+	lb_monitor->timer.cb_mode = HRTIMER_CB_IRQSAFE_NO_SOFTIRQ;
+
+	lb_monitor->timeout = ns_to_ktime((u64)sysctl_sched_min_bal_int_shares *
+			NSEC_PER_MSEC);
+
+	spin_lock_init(&lb_monitor->lock);
+}
 #endif
 
 static void set_se_shares(struct sched_entity *se, unsigned long shares);
@@ -699,15 +800,6 @@ static void update_rq_clock(struct rq *r
 #define task_rq(p)		cpu_rq(task_cpu(p))
 #define cpu_curr(cpu)		(cpu_rq(cpu)->curr)
 
-/*
- * Tunables that become constants when CONFIG_SCHED_DEBUG is off:
- */
-#ifdef CONFIG_SCHED_DEBUG
-# define const_debug __read_mostly
-#else
-# define const_debug static const
-#endif
-
 /**
  * runqueue_is_locked
  *
@@ -7200,21 +7292,6 @@ void __init sched_init_smp(void)
 	if (set_cpus_allowed(current, non_isolated_cpus) < 0)
 		BUG();
 	sched_init_granularity();
-
-#ifdef CONFIG_FAIR_GROUP_SCHED
-	if (nr_cpu_ids == 1)
-		return;
-
-	lb_monitor_task = kthread_create(load_balance_monitor, NULL,
-					 "group_balance");
-	if (!IS_ERR(lb_monitor_task)) {
-		lb_monitor_task->flags |= PF_NOFREEZE;
-		wake_up_process(lb_monitor_task);
-	} else {
-		printk(KERN_ERR "Could not create load balance monitor thread"
-			"(error = %ld) \n", PTR_ERR(lb_monitor_task));
-	}
-#endif
 }
 #else
 void __init sched_init_smp(void)
@@ -7321,8 +7398,11 @@ void __init sched_init(void)
 
 #ifdef CONFIG_SMP
 	init_defrootdomain();
-#endif
 
+#ifdef CONFIG_FAIR_GROUP_SCHED
+	lb_monitor_init(&lb_monitor);
+#endif
+#endif
 	init_rt_bandwidth(&def_rt_bandwidth,
 			global_rt_period(), global_rt_runtime());
 
@@ -7556,7 +7636,7 @@ static int rebalance_shares(struct sched
 	struct cfs_rq *cfs_rq;
 	struct rq *rq = cpu_rq(this_cpu);
 	cpumask_t sdspan = sd->span;
-	int balanced = 1;
+	int state = shares_idle;
 
 	/* Walk thr' all the task groups that we have */
 	for_each_leaf_cfs_rq(rq, cfs_rq) {
@@ -7572,6 +7652,8 @@ static int rebalance_shares(struct sched
 		if (!total_load)
 			continue;
 
+		state = max(state, shares_balanced);
+
 		/*
 		 * tg->shares represents the number of cpu shares the task group
 		 * is eligible to hold on a single cpu. On N cpus, it is
@@ -7593,108 +7675,76 @@ static int rebalance_shares(struct sched
 			if (local_shares == tg->se[i]->load.weight)
 				continue;
 
-			spin_lock_irq(&cpu_rq(i)->lock);
+			spin_lock(&cpu_rq(i)->lock);
 			set_se_shares(tg->se[i], local_shares);
-			spin_unlock_irq(&cpu_rq(i)->lock);
-			balanced = 0;
+			spin_unlock(&cpu_rq(i)->lock);
+			state = shares_unbalanced;
 		}
 	}
 
-	return balanced;
+	return state;
 }
 
-/*
- * How frequently should we rebalance_shares() across cpus?
- *
- * The more frequently we rebalance shares, the more accurate is the fairness
- * of cpu bandwidth distribution between task groups. However higher frequency
- * also implies increased scheduling overhead.
- *
- * sysctl_sched_min_bal_int_shares represents the minimum interval between
- * consecutive calls to rebalance_shares() in the same sched domain.
- *
- * sysctl_sched_max_bal_int_shares represents the maximum interval between
- * consecutive calls to rebalance_shares() in the same sched domain.
- *
- * These settings allows for the appropriate trade-off between accuracy of
- * fairness and the associated overhead.
- *
- */
-
-/* default: 8ms, units: milliseconds */
-const_debug unsigned int sysctl_sched_min_bal_int_shares = 8;
-
-/* default: 128ms, units: milliseconds */
-const_debug unsigned int sysctl_sched_max_bal_int_shares = 128;
-
-/* kernel thread that runs rebalance_shares() periodically */
-static int load_balance_monitor(void *unused)
+static int load_balance_shares(struct lb_monitor *lb_monitor)
 {
-	unsigned int timeout = sysctl_sched_min_bal_int_shares;
-	struct sched_param schedparm;
-	int ret;
+	int i, cpu, state = shares_idle;
+	u64 max_timeout = (u64)sysctl_sched_max_bal_int_shares * NSEC_PER_MSEC;
+	u64 min_timeout = (u64)sysctl_sched_min_bal_int_shares * NSEC_PER_MSEC;
+	u64 timeout;
 
+	/* Prevent cpus going down or coming up */
+	/* get_online_cpus(); */
+	/* lockout changes to doms_cur[] array */
+	/* lock_doms_cur(); */
 	/*
-	 * We don't want this thread's execution to be limited by the shares
-	 * assigned to default group (init_task_group). Hence make it run
-	 * as a SCHED_RR RT task at the lowest priority.
-	 */
-	schedparm.sched_priority = 1;
-	ret = sched_setscheduler(current, SCHED_RR, &schedparm);
-	if (ret)
-		printk(KERN_ERR "Couldn't set SCHED_RR policy for load balance"
-				" monitor thread (error = %d) \n", ret);
-
-	while (!kthread_should_stop()) {
-		int i, cpu, balanced = 1;
+	 * Enter a rcu read-side critical section to safely walk rq->sd
+	 * chain on various cpus and to walk task group list
+	 * (rq->leaf_cfs_rq_list) in rebalance_shares().
+	 */
+	rcu_read_lock();
 
-		/* Prevent cpus going down or coming up */
-		get_online_cpus();
-		/* lockout changes to doms_cur[] array */
-		lock_doms_cur();
-		/*
-		 * Enter a rcu read-side critical section to safely walk rq->sd
-		 * chain on various cpus and to walk task group list
-		 * (rq->leaf_cfs_rq_list) in rebalance_shares().
-		 */
-		rcu_read_lock();
+	for (i = 0; i < ndoms_cur; i++) {
+		cpumask_t cpumap = doms_cur[i];
+		struct sched_domain *sd = NULL, *sd_prev = NULL;
 
-		for (i = 0; i < ndoms_cur; i++) {
-			cpumask_t cpumap = doms_cur[i];
-			struct sched_domain *sd = NULL, *sd_prev = NULL;
-
-			cpu = first_cpu(cpumap);
-
-			/* Find the highest domain at which to balance shares */
-			for_each_domain(cpu, sd) {
-				if (!(sd->flags & SD_LOAD_BALANCE))
-					continue;
-				sd_prev = sd;
-			}
+		cpu = first_cpu(cpumap);
 
-			sd = sd_prev;
-			/* sd == NULL? No load balance reqd in this domain */
-			if (!sd)
+		/* Find the highest domain at which to balance shares */
+		for_each_domain(cpu, sd) {
+			if (!(sd->flags & SD_LOAD_BALANCE))
 				continue;
-
-			balanced &= rebalance_shares(sd, cpu);
+			sd_prev = sd;
 		}
 
-		rcu_read_unlock();
+		sd = sd_prev;
+		/* sd == NULL? No load balance reqd in this domain */
+		if (!sd)
+			continue;
 
-		unlock_doms_cur();
-		put_online_cpus();
+		state = max(state, rebalance_shares(sd, cpu));
+	}
 
-		if (!balanced)
-			timeout = sysctl_sched_min_bal_int_shares;
-		else if (timeout < sysctl_sched_max_bal_int_shares)
+	rcu_read_unlock();
+
+	/* unlock_doms_cur(); */
+	/* put_online_cpus(); */
+
+	timeout = ktime_to_ns(lb_monitor->timeout);
+	switch (state) {
+	case shares_balanced:
+		if (timeout < max_timeout)
 			timeout *= 2;
+		break;
 
-		msleep_interruptible(timeout);
+	default:
+		timeout = min_timeout;
+		break;
 	}
+	lb_monitor->timeout = ns_to_ktime(timeout);
 
-	return 0;
+	return state;
 }
+
 #endif	/* CONFIG_SMP */
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
Index: linux-2.6/kernel/sched_fair.c
===================================================================
--- linux-2.6.orig/kernel/sched_fair.c
+++ linux-2.6/kernel/sched_fair.c
@@ -548,6 +548,9 @@ account_entity_enqueue(struct cfs_rq *cf
 	cfs_rq->nr_running++;
 	se->on_rq = 1;
 	list_add(&se->group_node, &cfs_rq->tasks);
+#if defined(CONFIG_FAIR_GROUP_SCHED) && defined(CONFIG_SMP)
+	lb_monitor_wake(&lb_monitor);
+#endif
 }
 
 static void

--


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC][PATCH 2/2] sched: fair-group: per root-domain load balancing
  2008-02-14 15:57 [RFC][PATCH 0/2] reworking load_balance_monitor Peter Zijlstra
  2008-02-14 15:57 ` [RFC][PATCH 1/2] sched: fair-group: rework load_balance_monitor Peter Zijlstra
@ 2008-02-14 15:57 ` Peter Zijlstra
  2008-02-15 16:46   ` Gregory Haskins
  2008-02-14 16:09 ` [RFC][PATCH 0/2] reworking load_balance_monitor Gregory Haskins
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 10+ messages in thread
From: Peter Zijlstra @ 2008-02-14 15:57 UTC (permalink / raw)
  To: Srivatsa Vaddagiri, Dhaval Giani, Paul Jackson, Arjan van de Ven,
	Ingo Molnar, Thomas Gleixner, Gregory Haskins
  Cc: LKML, Peter Zijlstra, Gregory Haskins

[-- Attachment #1: sched-fair-dyn-balance-more.patch --]
[-- Type: text/plain, Size: 6030 bytes --]

Currently the lb_monitor will walk all the domains/cpus from a single
cpu's timer interrupt. This will cause massive cache-trashing and cache-line
bouncing on larger machines.

Split the lb_monitor into root_domain (disjoint sched-domains).

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Gregory Haskins <gregory.haskins.ml@gmail.com>
---
 kernel/sched.c      |  106 ++++++++++++++++++++++++++++------------------------
 kernel/sched_fair.c |    2 
 2 files changed, 59 insertions(+), 49 deletions(-)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -357,8 +357,6 @@ struct lb_monitor {
 	spinlock_t lock;
 };
 
-static struct lb_monitor lb_monitor;
-
 /*
  * How frequently should we rebalance_shares() across cpus?
  *
@@ -417,6 +415,9 @@ static void lb_monitor_wake(struct lb_mo
 	if (hrtimer_active(&lb_monitor->timer))
 		return;
 
+	/*
+	 * XXX: rd->load_balance && weight(rd->span) > 1
+	 */
 	if (nr_cpu_ids == 1)
 		return;
 
@@ -444,6 +445,11 @@ static void lb_monitor_init(struct lb_mo
 
 	spin_lock_init(&lb_monitor->lock);
 }
+
+static int lb_monitor_destroy(struct lb_monitor *lb_monitor)
+{
+	return hrtimer_cancel(&lb_monitor->timer);
+}
 #endif
 
 static void set_se_shares(struct sched_entity *se, unsigned long shares);
@@ -607,6 +613,8 @@ struct root_domain {
 	 */
 	cpumask_t rto_mask;
 	atomic_t rto_count;
+
+	struct lb_monitor lb_monitor;
 };
 
 /*
@@ -6328,6 +6336,7 @@ static void rq_attach_root(struct rq *rq
 {
 	unsigned long flags;
 	const struct sched_class *class;
+	int active = 0;
 
 	spin_lock_irqsave(&rq->lock, flags);
 
@@ -6342,8 +6351,14 @@ static void rq_attach_root(struct rq *rq
 		cpu_clear(rq->cpu, old_rd->span);
 		cpu_clear(rq->cpu, old_rd->online);
 
-		if (atomic_dec_and_test(&old_rd->refcount))
+		if (atomic_dec_and_test(&old_rd->refcount)) {
+			/*
+			 * sync with active timers.
+			 */
+			active = lb_monitor_destroy(&old_rd->lb_monitor);
+
 			kfree(old_rd);
+		}
 	}
 
 	atomic_inc(&rd->refcount);
@@ -6358,6 +6373,9 @@ static void rq_attach_root(struct rq *rq
 			class->join_domain(rq);
 	}
 
+	if (active)
+		lb_monitor_wake(&rd->lb_monitor);
+
 	spin_unlock_irqrestore(&rq->lock, flags);
 }
 
@@ -6367,6 +6385,8 @@ static void init_rootdomain(struct root_
 
 	cpus_clear(rd->span);
 	cpus_clear(rd->online);
+
+	lb_monitor_init(&rd->lb_monitor);
 }
 
 static void init_defrootdomain(void)
@@ -7398,10 +7418,6 @@ void __init sched_init(void)
 
 #ifdef CONFIG_SMP
 	init_defrootdomain();
-
-#ifdef CONFIG_FAIR_GROUP_SCHED
-	lb_monitor_init(&lb_monitor);
-#endif
 #endif
 	init_rt_bandwidth(&def_rt_bandwidth,
 			global_rt_period(), global_rt_runtime());
@@ -7631,11 +7647,11 @@ void set_curr_task(int cpu, struct task_
  * distribute shares of all task groups among their schedulable entities,
  * to reflect load distribution across cpus.
  */
-static int rebalance_shares(struct sched_domain *sd, int this_cpu)
+static int rebalance_shares(struct root_domain *rd, int this_cpu)
 {
 	struct cfs_rq *cfs_rq;
 	struct rq *rq = cpu_rq(this_cpu);
-	cpumask_t sdspan = sd->span;
+	cpumask_t sdspan = rd->span;
 	int state = shares_idle;
 
 	/* Walk thr' all the task groups that we have */
@@ -7685,50 +7701,12 @@ static int rebalance_shares(struct sched
 	return state;
 }
 
-static int load_balance_shares(struct lb_monitor *lb_monitor)
+static void set_lb_monitor_timeout(struct lb_monitor *lb_monitor, int state)
 {
-	int i, cpu, state = shares_idle;
 	u64 max_timeout = (u64)sysctl_sched_max_bal_int_shares * NSEC_PER_MSEC;
 	u64 min_timeout = (u64)sysctl_sched_min_bal_int_shares * NSEC_PER_MSEC;
 	u64 timeout;
 
-	/* Prevent cpus going down or coming up */
-	/* get_online_cpus(); */
-	/* lockout changes to doms_cur[] array */
-	/* lock_doms_cur(); */
-	/*
-	 * Enter a rcu read-side critical section to safely walk rq->sd
-	 * chain on various cpus and to walk task group list
-	 * (rq->leaf_cfs_rq_list) in rebalance_shares().
-	 */
-	rcu_read_lock();
-
-	for (i = 0; i < ndoms_cur; i++) {
-		cpumask_t cpumap = doms_cur[i];
-		struct sched_domain *sd = NULL, *sd_prev = NULL;
-
-		cpu = first_cpu(cpumap);
-
-		/* Find the highest domain at which to balance shares */
-		for_each_domain(cpu, sd) {
-			if (!(sd->flags & SD_LOAD_BALANCE))
-				continue;
-			sd_prev = sd;
-		}
-
-		sd = sd_prev;
-		/* sd == NULL? No load balance reqd in this domain */
-		if (!sd)
-			continue;
-
-		state = max(state, rebalance_shares(sd, cpu));
-	}
-
-	rcu_read_unlock();
-
-	/* unlock_doms_cur(); */
-	/* put_online_cpus(); */
-
 	timeout = ktime_to_ns(lb_monitor->timeout);
 	switch (state) {
 	case shares_balanced:
@@ -7741,6 +7719,38 @@ static int load_balance_shares(struct lb
 		break;
 	}
 	lb_monitor->timeout = ns_to_ktime(timeout);
+}
+
+static int load_balance_shares(struct lb_monitor *lb_monitor)
+{
+	int cpu = smp_processor_id();
+	struct rq *rq = cpu_rq(cpu);
+	struct root_domain *rd;
+	struct sched_domain *sd, *sd_prev = NULL;
+	int state = shares_idle;
+
+	spin_lock(&rq->lock);
+	/*
+	 * root_domain will stay valid until timer exits - synchronized by
+	 * hrtimer_cancel().
+	 */
+	rd = rq->rd;
+	spin_unlock(&rq->lock);
+
+	/*
+	 * complicated way to find rd->load_balance
+	 */
+	rcu_read_lock();
+	for_each_domain(cpu, sd) {
+		if (!(sd->flags & SD_LOAD_BALANCE))
+			continue;
+		sd_prev = sd;
+	}
+	if (sd_prev)
+		state = max(state, rebalance_shares(rd, cpu));
+	rcu_read_unlock();
+
+	set_lb_monitor_timeout(lb_monitor, state);
 
 	return state;
 }
Index: linux-2.6/kernel/sched_fair.c
===================================================================
--- linux-2.6.orig/kernel/sched_fair.c
+++ linux-2.6/kernel/sched_fair.c
@@ -553,7 +553,7 @@ account_entity_enqueue(struct cfs_rq *cf
 	se->on_rq = 1;
 	list_add(&se->group_node, &cfs_rq->tasks);
 #if defined(CONFIG_FAIR_GROUP_SCHED) && defined(CONFIG_SMP)
-	lb_monitor_wake(&lb_monitor);
+	lb_monitor_wake(&rq_of(cfs_rq)->rd->lb_monitor);
 #endif
 }
 

--


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC][PATCH 0/2] reworking load_balance_monitor
  2008-02-14 15:57 [RFC][PATCH 0/2] reworking load_balance_monitor Peter Zijlstra
  2008-02-14 15:57 ` [RFC][PATCH 1/2] sched: fair-group: rework load_balance_monitor Peter Zijlstra
  2008-02-14 15:57 ` [RFC][PATCH 2/2] sched: fair-group: per root-domain load balancing Peter Zijlstra
@ 2008-02-14 16:09 ` Gregory Haskins
  2008-02-14 18:15 ` Paul Jackson
  2008-02-18  8:24 ` Dhaval Giani
  4 siblings, 0 replies; 10+ messages in thread
From: Gregory Haskins @ 2008-02-14 16:09 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arjan van de Ven, Thomas Gleixner,
	Dhaval Giani, Srivatsa Vaddagiri, Paul Jackson
  Cc: LKML

>>> On Thu, Feb 14, 2008 at 10:57 AM, in message
<20080214155724.772744000@chello.nl>, Peter Zijlstra <a.p.zijlstra@chello.nl>
wrote: 
> Hi,
> 
> Here the current patches that rework load_balance_monitor.
> 
> The main reason for doing this is to eliminate the wakeups the thing 
> generates,
> esp. on an idle system. The bonus is that it removes a kernel thread.
> 
> Paul, Gregory - the thing that bothers me most atm is the lack of
> rd->load_balance. Should I introduce that (-rt ought to make use of that as
> well) by way of copying from the top sched_domain when it gets created?

With the caveat that I currently have not digested your patch series, this sounds like a reasonable approach.  The root-domain effectively represents the top sched-domain anyway (with the additional attribute that its a shared structure with all constituent cpus).

Ill try to take a look at the series later today and get back to you with feedback.

-Greg


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC][PATCH 0/2] reworking load_balance_monitor
  2008-02-14 15:57 [RFC][PATCH 0/2] reworking load_balance_monitor Peter Zijlstra
                   ` (2 preceding siblings ...)
  2008-02-14 16:09 ` [RFC][PATCH 0/2] reworking load_balance_monitor Gregory Haskins
@ 2008-02-14 18:15 ` Paul Jackson
  2008-02-14 19:16   ` Gregory Haskins
  2008-02-18  8:24 ` Dhaval Giani
  4 siblings, 1 reply; 10+ messages in thread
From: Paul Jackson @ 2008-02-14 18:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: vatsa, dhaval, arjan, mingo, tglx, ghaskins, linux-kernel, a.p.zijlstra

Peter wrote of:
> the lack of rd->load_balance.

Could you explain to me a bit what that means?

Does this mean that the existing code would, by default (default being
a single sched domain, covering the entire system's CPUs) load balance
across the entire system, but with your rework, not so load balance
there?  That seems unlikely.

In any event, from my rather cpuset-centric perspective, there are only
two common cases to consider.

 1. In the default case, build_sched_domains() gets called once,
    at init, with a cpu_map of all non-isolated CPUs, and we should
    forever after load balance across all those non-isolated CPUs.

 2. In some carefully managed systems using the per-cpuset
    'sched_load_balance' flags, we tear down that first default
    sched domain, by calling detach_destroy_domains() on it, and we
    then setup some number of sched_domains (typically in the range
    of two to ten, though I suppose we should design to scale to
    hundreds of sched domains, on systems with thousands of CPUs)
    by additional calls to build_sched_domains(), such that their
    CPUs don't overlap (pairwise disjoint) and such that the union
    of all their CPUs may, or may not, include all non-isolated CPUs
    (some CPUs might be left 'out in the cold', intentionally, as
    essentially additional isolated CPUs.)  We would then expect load
    balancing within each of these pair-wise disjoint sched domains,
    but not between one of them and another.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC][PATCH 0/2] reworking load_balance_monitor
  2008-02-14 18:15 ` Paul Jackson
@ 2008-02-14 19:16   ` Gregory Haskins
  0 siblings, 0 replies; 10+ messages in thread
From: Gregory Haskins @ 2008-02-14 19:16 UTC (permalink / raw)
  To: Paul Jackson
  Cc: a.p.zijlstra, mingo, arjan, tglx, dhaval, vatsa, linux-kernel

>>> On Thu, Feb 14, 2008 at  1:15 PM, in message
<20080214121544.941d91f1.pj@sgi.com>, Paul Jackson <pj@sgi.com> wrote: 
> Peter wrote of:
>> the lack of rd->load_balance.
> 
> Could you explain to me a bit what that means?
> 
> Does this mean that the existing code would, by default (default being
> a single sched domain, covering the entire system's CPUs) load balance
> across the entire system, but with your rework, not so load balance
> there?  That seems unlikely.
> 
> In any event, from my rather cpuset-centric perspective, there are only
> two common cases to consider.
> 
>  1. In the default case, build_sched_domains() gets called once,
>     at init, with a cpu_map of all non-isolated CPUs, and we should
>     forever after load balance across all those non-isolated CPUs.
> 
>  2. In some carefully managed systems using the per-cpuset
>     'sched_load_balance' flags, we tear down that first default
>     sched domain, by calling detach_destroy_domains() on it, and we
>     then setup some number of sched_domains (typically in the range
>     of two to ten, though I suppose we should design to scale to
>     hundreds of sched domains, on systems with thousands of CPUs)
>     by additional calls to build_sched_domains(), such that their
>     CPUs don't overlap (pairwise disjoint) and such that the union
>     of all their CPUs may, or may not, include all non-isolated CPUs
>     (some CPUs might be left 'out in the cold', intentionally, as
>     essentially additional isolated CPUs.)  We would then expect load
>     balancing within each of these pair-wise disjoint sched domains,
>     but not between one of them and another.


Hi Paul,
  I think it will still work as you describe.  We create a new root-domain object for each pair-wise disjoint sched-domain.  In your case (1) above, we would only have one instance of a root-domain which contains (of course) a single instance of the rd->load_balance object.  This would, in fact operate like the global variable that Peter is suggesting it replace (IIUC).  However, for case (2), we would instantiate a root-domain object per pairwise-disjoint sched-domain, and therefore each one would have its own instance of rd->load_balance.

HTH
-Greg

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC][PATCH 2/2] sched: fair-group: per root-domain load balancing
  2008-02-14 15:57 ` [RFC][PATCH 2/2] sched: fair-group: per root-domain load balancing Peter Zijlstra
@ 2008-02-15 16:46   ` Gregory Haskins
  2008-02-15 19:46     ` Peter Zijlstra
  0 siblings, 1 reply; 10+ messages in thread
From: Gregory Haskins @ 2008-02-15 16:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Srivatsa Vaddagiri, Dhaval Giani, Paul Jackson, Arjan van de Ven,
	Ingo Molnar, Thomas Gleixner, Gregory Haskins, LKML

[-- Attachment #1: Type: text/plain, Size: 7598 bytes --]

Peter Zijlstra wrote:
> Currently the lb_monitor will walk all the domains/cpus from a single
> cpu's timer interrupt. This will cause massive cache-trashing and cache-line
> bouncing on larger machines.
> 
> Split the lb_monitor into root_domain (disjoint sched-domains).
> 
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> CC: Gregory Haskins <gregory.haskins.ml@gmail.com>
> ---
>  kernel/sched.c      |  106 ++++++++++++++++++++++++++++------------------------
>  kernel/sched_fair.c |    2 
>  2 files changed, 59 insertions(+), 49 deletions(-)
> 
> Index: linux-2.6/kernel/sched.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched.c
> +++ linux-2.6/kernel/sched.c
> @@ -357,8 +357,6 @@ struct lb_monitor {
>  	spinlock_t lock;
>  };
>  
> -static struct lb_monitor lb_monitor;
> -
>  /*
>   * How frequently should we rebalance_shares() across cpus?
>   *
> @@ -417,6 +415,9 @@ static void lb_monitor_wake(struct lb_mo
>  	if (hrtimer_active(&lb_monitor->timer))
>  		return;
>  
> +	/*
> +	 * XXX: rd->load_balance && weight(rd->span) > 1
> +	 */
>  	if (nr_cpu_ids == 1)
>  		return;
>  
> @@ -444,6 +445,11 @@ static void lb_monitor_init(struct lb_mo
>  
>  	spin_lock_init(&lb_monitor->lock);
>  }
> +
> +static int lb_monitor_destroy(struct lb_monitor *lb_monitor)
> +{
> +	return hrtimer_cancel(&lb_monitor->timer);
> +}
>  #endif
>  
>  static void set_se_shares(struct sched_entity *se, unsigned long shares);
> @@ -607,6 +613,8 @@ struct root_domain {
>  	 */
>  	cpumask_t rto_mask;
>  	atomic_t rto_count;
> +
> +	struct lb_monitor lb_monitor;
>  };
>  
>  /*
> @@ -6328,6 +6336,7 @@ static void rq_attach_root(struct rq *rq
>  {
>  	unsigned long flags;
>  	const struct sched_class *class;
> +	int active = 0;
>  
>  	spin_lock_irqsave(&rq->lock, flags);
>  
> @@ -6342,8 +6351,14 @@ static void rq_attach_root(struct rq *rq
>  		cpu_clear(rq->cpu, old_rd->span);
>  		cpu_clear(rq->cpu, old_rd->online);
>  
> -		if (atomic_dec_and_test(&old_rd->refcount))
> +		if (atomic_dec_and_test(&old_rd->refcount)) {
> +			/*
> +			 * sync with active timers.
> +			 */
> +			active = lb_monitor_destroy(&old_rd->lb_monitor);
> +
>  			kfree(old_rd);

Note that this works out to be a bug in my code on -rt since you cannot 
kfree() while the raw rq->lock is held.  This isn't your problem, per 
se, but just a heads up that I might need to patch this area ASAP.

> +		}
>  	}
>  
>  	atomic_inc(&rd->refcount);
> @@ -6358,6 +6373,9 @@ static void rq_attach_root(struct rq *rq
>  			class->join_domain(rq);
>  	}
>  
> +	if (active)
> +		lb_monitor_wake(&rd->lb_monitor);
> +
>  	spin_unlock_irqrestore(&rq->lock, flags);
>  }
>  
> @@ -6367,6 +6385,8 @@ static void init_rootdomain(struct root_
>  
>  	cpus_clear(rd->span);
>  	cpus_clear(rd->online);
> +
> +	lb_monitor_init(&rd->lb_monitor);
>  }
>  
>  static void init_defrootdomain(void)
> @@ -7398,10 +7418,6 @@ void __init sched_init(void)
>  
>  #ifdef CONFIG_SMP
>  	init_defrootdomain();
> -
> -#ifdef CONFIG_FAIR_GROUP_SCHED
> -	lb_monitor_init(&lb_monitor);
> -#endif
>  #endif
>  	init_rt_bandwidth(&def_rt_bandwidth,
>  			global_rt_period(), global_rt_runtime());
> @@ -7631,11 +7647,11 @@ void set_curr_task(int cpu, struct task_
>   * distribute shares of all task groups among their schedulable entities,
>   * to reflect load distribution across cpus.
>   */
> -static int rebalance_shares(struct sched_domain *sd, int this_cpu)
> +static int rebalance_shares(struct root_domain *rd, int this_cpu)
>  {
>  	struct cfs_rq *cfs_rq;
>  	struct rq *rq = cpu_rq(this_cpu);
> -	cpumask_t sdspan = sd->span;
> +	cpumask_t sdspan = rd->span;
>  	int state = shares_idle;
>  
>  	/* Walk thr' all the task groups that we have */
> @@ -7685,50 +7701,12 @@ static int rebalance_shares(struct sched
>  	return state;
>  }
>  
> -static int load_balance_shares(struct lb_monitor *lb_monitor)
> +static void set_lb_monitor_timeout(struct lb_monitor *lb_monitor, int state)
>  {
> -	int i, cpu, state = shares_idle;
>  	u64 max_timeout = (u64)sysctl_sched_max_bal_int_shares * NSEC_PER_MSEC;
>  	u64 min_timeout = (u64)sysctl_sched_min_bal_int_shares * NSEC_PER_MSEC;
>  	u64 timeout;
>  
> -	/* Prevent cpus going down or coming up */
> -	/* get_online_cpus(); */
> -	/* lockout changes to doms_cur[] array */
> -	/* lock_doms_cur(); */
> -	/*
> -	 * Enter a rcu read-side critical section to safely walk rq->sd
> -	 * chain on various cpus and to walk task group list
> -	 * (rq->leaf_cfs_rq_list) in rebalance_shares().
> -	 */
> -	rcu_read_lock();
> -
> -	for (i = 0; i < ndoms_cur; i++) {
> -		cpumask_t cpumap = doms_cur[i];
> -		struct sched_domain *sd = NULL, *sd_prev = NULL;
> -
> -		cpu = first_cpu(cpumap);
> -
> -		/* Find the highest domain at which to balance shares */
> -		for_each_domain(cpu, sd) {
> -			if (!(sd->flags & SD_LOAD_BALANCE))
> -				continue;
> -			sd_prev = sd;
> -		}
> -
> -		sd = sd_prev;
> -		/* sd == NULL? No load balance reqd in this domain */
> -		if (!sd)
> -			continue;
> -
> -		state = max(state, rebalance_shares(sd, cpu));
> -	}
> -
> -	rcu_read_unlock();
> -
> -	/* unlock_doms_cur(); */
> -	/* put_online_cpus(); */
> -
>  	timeout = ktime_to_ns(lb_monitor->timeout);
>  	switch (state) {
>  	case shares_balanced:
> @@ -7741,6 +7719,38 @@ static int load_balance_shares(struct lb
>  		break;
>  	}
>  	lb_monitor->timeout = ns_to_ktime(timeout);
> +}
> +
> +static int load_balance_shares(struct lb_monitor *lb_monitor)
> +{
> +	int cpu = smp_processor_id();
> +	struct rq *rq = cpu_rq(cpu);
> +	struct root_domain *rd;
> +	struct sched_domain *sd, *sd_prev = NULL;
> +	int state = shares_idle;
> +
> +	spin_lock(&rq->lock);
> +	/*
> +	 * root_domain will stay valid until timer exits - synchronized by
> +	 * hrtimer_cancel().
> +	 */
> +	rd = rq->rd;
> +	spin_unlock(&rq->lock);

I know we talked about this on IRC, I'm am still skeptical about this 
part of the code.  Normally we only guarantee the stability of the 
rq->rd pointer while the rq->lock is held or a rd->refcount is added. 
It would be "safer" to bracket this code with an up/down sequence on the 
rd->refcount, but perhaps you can convince me that it is not needed? 
(i.e. I am still not understanding how the timer guarantees the stability).

[up-ref]

> +
> +	/*
> +	 * complicated way to find rd->load_balance
> +	 */
> +	rcu_read_lock();
> +	for_each_domain(cpu, sd) {
> +		if (!(sd->flags & SD_LOAD_BALANCE))
> +			continue;
> +		sd_prev = sd;
> +	}
> +	if (sd_prev)
> +		state = max(state, rebalance_shares(rd, cpu));
> +	rcu_read_unlock();
> +

[down-ref]

We would of course need to re-work the drop-ref code so it could be 
freed independent of the rq_attach_root() function, but that should be 
trivial.

> +	set_lb_monitor_timeout(lb_monitor, state);
>  
>  	return state;
>  }
> Index: linux-2.6/kernel/sched_fair.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched_fair.c
> +++ linux-2.6/kernel/sched_fair.c
> @@ -553,7 +553,7 @@ account_entity_enqueue(struct cfs_rq *cf
>  	se->on_rq = 1;
>  	list_add(&se->group_node, &cfs_rq->tasks);
>  #if defined(CONFIG_FAIR_GROUP_SCHED) && defined(CONFIG_SMP)
> -	lb_monitor_wake(&lb_monitor);
> +	lb_monitor_wake(&rq_of(cfs_rq)->rd->lb_monitor);
>  #endif
>  }
>  
> 
> --
> 
> 



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 250 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC][PATCH 2/2] sched: fair-group: per root-domain load balancing
  2008-02-15 16:46   ` Gregory Haskins
@ 2008-02-15 19:46     ` Peter Zijlstra
  2008-02-19 12:42       ` Gregory Haskins
  0 siblings, 1 reply; 10+ messages in thread
From: Peter Zijlstra @ 2008-02-15 19:46 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Srivatsa Vaddagiri, Dhaval Giani, Paul Jackson, Arjan van de Ven,
	Ingo Molnar, Thomas Gleixner, Gregory Haskins, LKML


On Fri, 2008-02-15 at 11:46 -0500, Gregory Haskins wrote:
> Peter Zijlstra wrote:
>  
> > @@ -6342,8 +6351,14 @@ static void rq_attach_root(struct rq *rq
> >  		cpu_clear(rq->cpu, old_rd->span);
> >  		cpu_clear(rq->cpu, old_rd->online);
> >  
> > -		if (atomic_dec_and_test(&old_rd->refcount))
> > +		if (atomic_dec_and_test(&old_rd->refcount)) {
> > +			/*
> > +			 * sync with active timers.
> > +			 */
> > +			active = lb_monitor_destroy(&old_rd->lb_monitor);
> > +
> >  			kfree(old_rd);
> 
> Note that this works out to be a bug in my code on -rt since you cannot 
> kfree() while the raw rq->lock is held.  This isn't your problem, per 
> se, but just a heads up that I might need to patch this area ASAP.

Yeah, I saw that patch, should be simple enough to fix.

> > +static int load_balance_shares(struct lb_monitor *lb_monitor)
> > +{
> > +	int cpu = smp_processor_id();
> > +	struct rq *rq = cpu_rq(cpu);
> > +	struct root_domain *rd;
> > +	struct sched_domain *sd, *sd_prev = NULL;
> > +	int state = shares_idle;
> > +
> > +	spin_lock(&rq->lock);
> > +	/*
> > +	 * root_domain will stay valid until timer exits - synchronized by
> > +	 * hrtimer_cancel().
> > +	 */
> > +	rd = rq->rd;
> > +	spin_unlock(&rq->lock);
> 
> I know we talked about this on IRC, I'm am still skeptical about this 
> part of the code.  Normally we only guarantee the stability of the 
> rq->rd pointer while the rq->lock is held or a rd->refcount is added. 

> It would be "safer" to bracket this code with an up/down sequence on the 
> rd->refcount, 

I initially did break out the up/down reference stuff. It has a few
other complications but all quite tractable.

> but perhaps you can convince me that it is not needed? 
> (i.e. I am still not understanding how the timer guarantees the stability).

ok, let me try again.

So we take rq->lock, at this point we know rd is valid.
We also know the timer is active.

So when we release it, the last reference can be dropped and we end up
in the hrtimer_cancel(), right before the kfree().

hrtimer_cancel() will wait for the timer to end. therefore delaying the
kfree() until the running timer finished.

> [up-ref]
> 
> > +
> > +	/*
> > +	 * complicated way to find rd->load_balance
> > +	 */
> > +	rcu_read_lock();
> > +	for_each_domain(cpu, sd) {
> > +		if (!(sd->flags & SD_LOAD_BALANCE))
> > +			continue;
> > +		sd_prev = sd;
> > +	}
> > +	if (sd_prev)
> > +		state = max(state, rebalance_shares(rd, cpu));
> > +	rcu_read_unlock();
> > +
> 
> [down-ref]
> 
> We would of course need to re-work the drop-ref code so it could be 
> freed independent of the rq_attach_root() function, but that should be 
> trivial.
> 
> > +	set_lb_monitor_timeout(lb_monitor, state);
> >  
> >  	return state;
> >  }




^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC][PATCH 0/2] reworking load_balance_monitor
  2008-02-14 15:57 [RFC][PATCH 0/2] reworking load_balance_monitor Peter Zijlstra
                   ` (3 preceding siblings ...)
  2008-02-14 18:15 ` Paul Jackson
@ 2008-02-18  8:24 ` Dhaval Giani
  4 siblings, 0 replies; 10+ messages in thread
From: Dhaval Giani @ 2008-02-18  8:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Srivatsa Vaddagiri, Paul Jackson, Arjan van de Ven, Ingo Molnar,
	Thomas Gleixner, Gregory Haskins, LKML

On Thu, Feb 14, 2008 at 04:57:24PM +0100, Peter Zijlstra wrote:
> Hi,
> 
> Here the current patches that rework load_balance_monitor.
> 
> The main reason for doing this is to eliminate the wakeups the thing generates,
> esp. on an idle system. The bonus is that it removes a kernel thread.
> 

Hi Peter,

The changes look really good to me. I will give it a run in sometime and
give some more feedback.

-- 
regards,
Dhaval

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC][PATCH 2/2] sched: fair-group: per root-domain load balancing
  2008-02-15 19:46     ` Peter Zijlstra
@ 2008-02-19 12:42       ` Gregory Haskins
  0 siblings, 0 replies; 10+ messages in thread
From: Gregory Haskins @ 2008-02-19 12:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Srivatsa Vaddagiri, Dhaval Giani, Paul Jackson, Arjan van de Ven,
	Ingo Molnar, Thomas Gleixner, Gregory Haskins, LKML

[-- Attachment #1: Type: text/plain, Size: 707 bytes --]

Peter Zijlstra wrote:
> On Fri, 2008-02-15 at 11:46 -0500, Gregory Haskins wrote:
>   
>
>> but perhaps you can convince me that it is not needed? 
>> (i.e. I am still not understanding how the timer guarantees the stability).
>>     
>
> ok, let me try again.
>
> So we take rq->lock, at this point we know rd is valid.
> We also know the timer is active.
>
> So when we release it, the last reference can be dropped and we end up
> in the hrtimer_cancel(), right before the kfree().
>
> hrtimer_cancel() will wait for the timer to end. therefore delaying the
> kfree() until the running timer finished.
>
>   

Ok, I see it now.  I agree that I think it is safe.  Thanks!

-Greg


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 250 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2008-02-19 12:43 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-02-14 15:57 [RFC][PATCH 0/2] reworking load_balance_monitor Peter Zijlstra
2008-02-14 15:57 ` [RFC][PATCH 1/2] sched: fair-group: rework load_balance_monitor Peter Zijlstra
2008-02-14 15:57 ` [RFC][PATCH 2/2] sched: fair-group: per root-domain load balancing Peter Zijlstra
2008-02-15 16:46   ` Gregory Haskins
2008-02-15 19:46     ` Peter Zijlstra
2008-02-19 12:42       ` Gregory Haskins
2008-02-14 16:09 ` [RFC][PATCH 0/2] reworking load_balance_monitor Gregory Haskins
2008-02-14 18:15 ` Paul Jackson
2008-02-14 19:16   ` Gregory Haskins
2008-02-18  8:24 ` Dhaval Giani

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).