linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v1 00/19] Increase resolution of load weights
@ 2011-05-02  1:18 Nikhil Rao
  2011-05-02  1:18 ` [PATCH v1 01/19] sched: introduce SCHED_POWER_SCALE to scale cpu_power calculations Nikhil Rao
                   ` (19 more replies)
  0 siblings, 20 replies; 34+ messages in thread
From: Nikhil Rao @ 2011-05-02  1:18 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Mike Galbraith
  Cc: linux-kernel, Nikunj A. Dadhania, Srivatsa Vaddagiri,
	Stephan Barwolf, Nikhil Rao

Hi All,

Please find attached v1 of the patchset to increase the resolution of load
weights. The motivation for this patchset and requirements were described in
the first RFC sent to LKML (see http://thread.gmane.org/gmane.linux.kernel/1129232
for more info).

This version of the patchset is more stable than the previous RFC and more
suitable for testing. I have attached some test results below that show the
impact/improvements of the patchset on 32-bit machines and 64-bit kernels.

These patches apply cleanly on top of v2.6.39-rc5. Please note that there is
a merge conflict when applied to -tip; I could send out another patchset that
applies to -tip (not sure what is standard protocol here).

Changes since v0:
- Scale down reference load weight by SCHED_LOAD_RESOLUTION in
  calc_delta_mine() (thanks to Nikunj Dadhania)
- Detect overflow in update_cfs_load() and cap avg_load update to ~0ULL
- Fixed all power calculations to use SCHED_POWER_SHIFT instead of
  SCHED_LOAD_SHIFT (also thanks to Stephan Barwolf for identifying this)
- Convert atomic ops to use atomic64_t instead of atomic_t

Experiments:

1. Performance costs

Ran 50 iterations of Ingo's pipe-test-100k program (100k pipe ping-pongs). See
http://thread.gmane.org/gmane.linux.kernel/1129232/focus=1129389 for more info.

64-bit build.

  2.6.39-rc5 (baseline):

    Performance counter stats for './pipe-test-100k' (50 runs):

       905,034,914 instructions             #      0.345 IPC     ( +-   0.016% )
     2,623,924,516 cycles                     ( +-   0.759% )

        1.518543478  seconds time elapsed   ( +-   0.513% )

  2.6.39-rc5 + patchset:

    Performance counter stats for './pipe-test-100k' (50 runs):

       905,351,545 instructions             #      0.343 IPC     ( +-   0.018% )
     2,638,939,777 cycles                     ( +-   0.761% )

        1.509101452  seconds time elapsed   ( +-   0.537% )

There is a marginal increase in instruction retired, about 0.034%; and marginal
increase in cycles counted, about 0.57%.

32-bit build.

  2.6.39-rc5 (baseline):

    Performance counter stats for './pipe-test-100k' (50 runs):

     1,025,151,722 instructions             #      0.238 IPC     ( +-   0.018% )
     4,303,226,625 cycles                     ( +-   0.524% )

        2.133056844  seconds time elapsed   ( +-   0.619% )

  2.6.39-rc5 + patchset:

    Performance counter stats for './pipe-test-100k' (50 runs):

     1,070,610,068 instructions             #      0.239 IPC     ( +-   1.369% )
     4,478,912,974 cycles                     ( +-   1.011% )

        2.293382242  seconds time elapsed   ( +-   0.144% )

On 32-bit kernels, instructions retired increases by about 4.4% with this
patchset. CPU cycles also increases by about 4%.

2. Fairness tests

Test setup: run 5 soaker threads bound to a single cpu. Measure usage over 10s
for each thread and calculate mean, stdev and coeff of variation (stdev/mean)
for each set of reading. Coeff of variation is averaged over 10 such readings.

As you can see in the data below, there is no significant difference in coeff
of variation between the two kernels on 64-bit or 32-bit builds.

64-bit build.

  2.6.39-rc5 (baseline):
    cv=0.007374042

  2.6.39-rc5 + patchset:
    cv=0.006942042

32-bit-build.

  2.6.39-rc5 (baseline)
    cv=0.002547

  2.6.39-rc5 + patchset:
    cv=0.002426

3. Load balancing low-weight task groups

Test setup: run 50 tasks with random sleep/busy times (biased around 100ms) in
a low weight container (with cpu.shares = 2). Measure %idle as reported by
mpstat over a 10s window.

>From the data below, the patchset applied to v2.6.39-rc5 keeps the busy fully
utilized with tasks in the low weight container. These measurements are for a
64-bit kernel.

2.6.39-rc5 (baseline):

04:08:27 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle    intr/s
04:08:28 PM  all   98.75    0.00    0.00    0.00    0.00    0.00    0.00    0.00    1.25  16475.00
04:08:29 PM  all   99.31    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.69  16447.00
04:08:30 PM  all   99.44    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.56  16445.00
04:08:31 PM  all   99.19    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.81  16447.00
04:08:32 PM  all   99.50    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.50  16523.00
04:08:33 PM  all   99.81    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.19  16516.00
04:08:34 PM  all   99.81    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.19  16517.00
04:08:35 PM  all   99.13    0.00    0.44    0.00    0.00    0.00    0.00    0.00    0.44  17624.00
04:08:36 PM  all   97.00    0.00    0.31    0.00    0.00    0.12    0.00    0.00    2.56  17608.00
04:08:37 PM  all   99.31    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.69  16517.00
Average:     all   99.13    0.00    0.07    0.00    0.00    0.01    0.00    0.00    0.79  16711.90

2.6.39-rc5 + patchset:

04:06:26 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle    intr/s
04:06:27 PM  all  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  16573.00
04:06:28 PM  all   99.94    0.00    0.06    0.00    0.00    0.00    0.00    0.00    0.00  16554.00
04:06:29 PM  all   99.69    0.00    0.25    0.00    0.00    0.06    0.00    0.00    0.00  17496.00
04:06:30 PM  all  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  16542.00
04:06:31 PM  all   99.94    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.06  16624.00
04:06:32 PM  all   99.88    0.00    0.06    0.00    0.00    0.00    0.00    0.00    0.06  16671.00
04:06:33 PM  all  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  16605.00
04:06:34 PM  all  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  16580.00
04:06:35 PM  all  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  16646.00
04:06:36 PM  all  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  16533.00
Average:     all   99.94    0.00    0.04    0.00    0.00    0.01    0.00    0.00    0.01  16682.40

4. Sizes of vmlinux (32-bit builds)

Sizes of vmlinux compiled with 'make defconfig ARCH=i386' below.

2.6.39-rc5 (baseline):
   text	   data	    bss	    dec	    hex	filename
8144777	1077556	1085440	10307773	 9d48bd	vmlinux-v2.6.39-rc5

2.6.39-rc5 + patchset:
   text	   data	    bss	    dec	    hex	filename
8144846	1077620	1085440	10307906	 9d4942	vmlinux

Negligible increase in text, data size (less than 0.01%).

-Thanks,
Nikhil

Nikhil Rao (19):
  sched: introduce SCHED_POWER_SCALE to scale cpu_power calculations
  sched: increase SCHED_LOAD_SCALE resolution
  sched: use u64 for load_weight fields
  sched: update cpu_load to be u64
  sched: update this_cpu_load() to return u64 value
  sched: update source_load(), target_load() and weighted_cpuload() to
    use u64
  sched: update find_idlest_cpu() to use u64 for load
  sched: update find_idlest_group() to use u64
  sched: update division in cpu_avg_load_per_task to use div_u64
  sched: update wake_affine path to use u64, s64 for weights
  sched: update update_sg_lb_stats() to use u64
  sched: Update update_sd_lb_stats() to use u64
  sched: update f_b_g() to use u64 for weights
  sched: change type of imbalance to be u64
  sched: update h_load to use u64
  sched: update move_task() and helper functions to use u64 for weights
  sched: update f_b_q() to use u64 for weighted cpuload
  sched: update shares distribution to use u64
  sched: convert atomic ops in shares update to use atomic64_t ops

 drivers/cpuidle/governors/menu.c |    5 +-
 include/linux/sched.h            |   22 ++--
 kernel/sched.c                   |   70 ++++++------
 kernel/sched_debug.c             |   14 +-
 kernel/sched_fair.c              |  234 ++++++++++++++++++++------------------
 kernel/sched_stats.h             |    2 +-
 6 files changed, 182 insertions(+), 165 deletions(-)

-- 
1.7.3.1


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v1 01/19] sched: introduce SCHED_POWER_SCALE to scale cpu_power calculations
  2011-05-02  1:18 [PATCH v1 00/19] Increase resolution of load weights Nikhil Rao
@ 2011-05-02  1:18 ` Nikhil Rao
  2011-05-02  1:19 ` [PATCH v1 02/19] sched: increase SCHED_LOAD_SCALE resolution Nikhil Rao
                   ` (18 subsequent siblings)
  19 siblings, 0 replies; 34+ messages in thread
From: Nikhil Rao @ 2011-05-02  1:18 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Mike Galbraith
  Cc: linux-kernel, Nikunj A. Dadhania, Srivatsa Vaddagiri,
	Stephan Barwolf, Nikhil Rao

SCHED_LOAD_SCALE is used to increase nice resolution and to scale cpu_power
calculations in the scheduler. This patch introduces SCHED_POWER_SCALE and
converts all uses of SCHED_LOAD_SCALE for scaling cpu_power to use
SCHED_POWER_SCALE instead.

This is a preparatory patch for increasing the resolution of SCHED_LOAD_SCALE,
and there is no need to increase resolution for cpu_power calculations.

Signed-off-by: Nikhil Rao <ncrao@google.com>
---
 include/linux/sched.h |   13 +++++++----
 kernel/sched.c        |   11 ++++-----
 kernel/sched_fair.c   |   52 +++++++++++++++++++++++++-----------------------
 3 files changed, 40 insertions(+), 36 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 18d63ce..8d1ff2b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -792,17 +792,20 @@ enum cpu_idle_type {
 };
 
 /*
- * sched-domains (multiprocessor balancing) declarations:
- */
-
-/*
  * Increase resolution of nice-level calculations:
  */
 #define SCHED_LOAD_SHIFT	10
 #define SCHED_LOAD_SCALE	(1L << SCHED_LOAD_SHIFT)
 
-#define SCHED_LOAD_SCALE_FUZZ	SCHED_LOAD_SCALE
+/*
+ * Increase resolution of cpu_power calculations
+ */
+#define SCHED_POWER_SHIFT	10
+#define SCHED_POWER_SCALE	(1L << SCHED_POWER_SHIFT)
 
+/*
+ * sched-domains (multiprocessor balancing) declarations:
+ */
 #ifdef CONFIG_SMP
 #define SD_LOAD_BALANCE		0x0001	/* Do load balancing on this domain. */
 #define SD_BALANCE_NEWIDLE	0x0002	/* Balance when about to become idle */
diff --git a/kernel/sched.c b/kernel/sched.c
index 312f8b9..f4b4679 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1302,8 +1302,7 @@ static void sched_avg_update(struct rq *rq)
  * delta *= weight / lw
  */
 static unsigned long
-calc_delta_mine(unsigned long delta_exec, unsigned long weight,
-		struct load_weight *lw)
+calc_delta_mine(unsigned long delta_exec, u64 weight, struct load_weight *lw)
 {
 	u64 tmp;
 
@@ -6468,7 +6467,7 @@ static int sched_domain_debug_one(struct sched_domain *sd, int cpu, int level,
 		cpulist_scnprintf(str, sizeof(str), sched_group_cpus(group));
 
 		printk(KERN_CONT " %s", str);
-		if (group->cpu_power != SCHED_LOAD_SCALE) {
+		if (group->cpu_power != SCHED_POWER_SCALE) {
 			printk(KERN_CONT " (cpu_power = %d)",
 				group->cpu_power);
 		}
@@ -7176,7 +7175,7 @@ static void init_sched_groups_power(int cpu, struct sched_domain *sd)
 	sd->groups->cpu_power = 0;
 
 	if (!child) {
-		power = SCHED_LOAD_SCALE;
+		power = SCHED_POWER_SCALE;
 		weight = cpumask_weight(sched_domain_span(sd));
 		/*
 		 * SMT siblings share the power of a single core.
@@ -7187,7 +7186,7 @@ static void init_sched_groups_power(int cpu, struct sched_domain *sd)
 		if ((sd->flags & SD_SHARE_CPUPOWER) && weight > 1) {
 			power *= sd->smt_gain;
 			power /= weight;
-			power >>= SCHED_LOAD_SHIFT;
+			power >>= SCHED_POWER_SHIFT;
 		}
 		sd->groups->cpu_power += power;
 		return;
@@ -8224,7 +8223,7 @@ void __init sched_init(void)
 #ifdef CONFIG_SMP
 		rq->sd = NULL;
 		rq->rd = NULL;
-		rq->cpu_power = SCHED_LOAD_SCALE;
+		rq->cpu_power = SCHED_POWER_SCALE;
 		rq->post_schedule = 0;
 		rq->active_balance = 0;
 		rq->next_balance = jiffies;
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 6fa833a..1a9340c 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1557,7 +1557,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 		}
 
 		/* Adjust by relative CPU power of the group */
-		avg_load = (avg_load * SCHED_LOAD_SCALE) / group->cpu_power;
+		avg_load = (avg_load * SCHED_POWER_SCALE) / group->cpu_power;
 
 		if (local_group) {
 			this_load = avg_load;
@@ -1692,7 +1692,7 @@ select_task_rq_fair(struct rq *rq, struct task_struct *p, int sd_flag, int wake_
 				nr_running += cpu_rq(i)->cfs.nr_running;
 			}
 
-			capacity = DIV_ROUND_CLOSEST(power, SCHED_LOAD_SCALE);
+			capacity = DIV_ROUND_CLOSEST(power, SCHED_POWER_SCALE);
 
 			if (tmp->flags & SD_POWERSAVINGS_BALANCE)
 				nr_running /= 2;
@@ -2534,7 +2534,7 @@ static inline int check_power_save_busiest_group(struct sd_lb_stats *sds,
 
 unsigned long default_scale_freq_power(struct sched_domain *sd, int cpu)
 {
-	return SCHED_LOAD_SCALE;
+	return SCHED_POWER_SCALE;
 }
 
 unsigned long __weak arch_scale_freq_power(struct sched_domain *sd, int cpu)
@@ -2571,10 +2571,10 @@ unsigned long scale_rt_power(int cpu)
 		available = total - rq->rt_avg;
 	}
 
-	if (unlikely((s64)total < SCHED_LOAD_SCALE))
-		total = SCHED_LOAD_SCALE;
+	if (unlikely((s64)total < SCHED_POWER_SCALE))
+		total = SCHED_POWER_SCALE;
 
-	total >>= SCHED_LOAD_SHIFT;
+	total >>= SCHED_POWER_SHIFT;
 
 	return div_u64(available, total);
 }
@@ -2582,7 +2582,7 @@ unsigned long scale_rt_power(int cpu)
 static void update_cpu_power(struct sched_domain *sd, int cpu)
 {
 	unsigned long weight = sd->span_weight;
-	unsigned long power = SCHED_LOAD_SCALE;
+	unsigned long power = SCHED_POWER_SCALE;
 	struct sched_group *sdg = sd->groups;
 
 	if ((sd->flags & SD_SHARE_CPUPOWER) && weight > 1) {
@@ -2591,7 +2591,7 @@ static void update_cpu_power(struct sched_domain *sd, int cpu)
 		else
 			power *= default_scale_smt_power(sd, cpu);
 
-		power >>= SCHED_LOAD_SHIFT;
+		power >>= SCHED_POWER_SHIFT;
 	}
 
 	sdg->cpu_power_orig = power;
@@ -2601,10 +2601,10 @@ static void update_cpu_power(struct sched_domain *sd, int cpu)
 	else
 		power *= default_scale_freq_power(sd, cpu);
 
-	power >>= SCHED_LOAD_SHIFT;
+	power >>= SCHED_POWER_SHIFT;
 
 	power *= scale_rt_power(cpu);
-	power >>= SCHED_LOAD_SHIFT;
+	power >>= SCHED_POWER_SHIFT;
 
 	if (!power)
 		power = 1;
@@ -2646,7 +2646,7 @@ static inline int
 fix_small_capacity(struct sched_domain *sd, struct sched_group *group)
 {
 	/*
-	 * Only siblings can have significantly less than SCHED_LOAD_SCALE
+	 * Only siblings can have significantly less than SCHED_POWER_SCALE
 	 */
 	if (sd->level != SD_LV_SIBLING)
 		return 0;
@@ -2734,7 +2734,7 @@ static inline void update_sg_lb_stats(struct sched_domain *sd,
 	}
 
 	/* Adjust by relative CPU power of the group */
-	sgs->avg_load = (sgs->group_load * SCHED_LOAD_SCALE) / group->cpu_power;
+	sgs->avg_load = (sgs->group_load*SCHED_POWER_SCALE) / group->cpu_power;
 
 	/*
 	 * Consider the group unbalanced when the imbalance is larger
@@ -2751,7 +2751,8 @@ static inline void update_sg_lb_stats(struct sched_domain *sd,
 	if ((max_cpu_load - min_cpu_load) >= avg_load_per_task && max_nr_running > 1)
 		sgs->group_imb = 1;
 
-	sgs->group_capacity = DIV_ROUND_CLOSEST(group->cpu_power, SCHED_LOAD_SCALE);
+	sgs->group_capacity = DIV_ROUND_CLOSEST(group->cpu_power,
+						SCHED_POWER_SCALE);
 	if (!sgs->group_capacity)
 		sgs->group_capacity = fix_small_capacity(sd, group);
 	sgs->group_weight = group->group_weight;
@@ -2925,7 +2926,7 @@ static int check_asym_packing(struct sched_domain *sd,
 		return 0;
 
 	*imbalance = DIV_ROUND_CLOSEST(sds->max_load * sds->busiest->cpu_power,
-				       SCHED_LOAD_SCALE);
+				       SCHED_POWER_SCALE);
 	return 1;
 }
 
@@ -2954,7 +2955,7 @@ static inline void fix_small_imbalance(struct sd_lb_stats *sds,
 			cpu_avg_load_per_task(this_cpu);
 
 	scaled_busy_load_per_task = sds->busiest_load_per_task
-						 * SCHED_LOAD_SCALE;
+					 * SCHED_POWER_SCALE;
 	scaled_busy_load_per_task /= sds->busiest->cpu_power;
 
 	if (sds->max_load - sds->this_load + scaled_busy_load_per_task >=
@@ -2973,10 +2974,10 @@ static inline void fix_small_imbalance(struct sd_lb_stats *sds,
 			min(sds->busiest_load_per_task, sds->max_load);
 	pwr_now += sds->this->cpu_power *
 			min(sds->this_load_per_task, sds->this_load);
-	pwr_now /= SCHED_LOAD_SCALE;
+	pwr_now /= SCHED_POWER_SCALE;
 
 	/* Amount of load we'd subtract */
-	tmp = (sds->busiest_load_per_task * SCHED_LOAD_SCALE) /
+	tmp = (sds->busiest_load_per_task * SCHED_POWER_SCALE) /
 		sds->busiest->cpu_power;
 	if (sds->max_load > tmp)
 		pwr_move += sds->busiest->cpu_power *
@@ -2984,15 +2985,15 @@ static inline void fix_small_imbalance(struct sd_lb_stats *sds,
 
 	/* Amount of load we'd add */
 	if (sds->max_load * sds->busiest->cpu_power <
-		sds->busiest_load_per_task * SCHED_LOAD_SCALE)
+		sds->busiest_load_per_task * SCHED_POWER_SCALE)
 		tmp = (sds->max_load * sds->busiest->cpu_power) /
 			sds->this->cpu_power;
 	else
-		tmp = (sds->busiest_load_per_task * SCHED_LOAD_SCALE) /
+		tmp = (sds->busiest_load_per_task * SCHED_POWER_SCALE) /
 			sds->this->cpu_power;
 	pwr_move += sds->this->cpu_power *
 			min(sds->this_load_per_task, sds->this_load + tmp);
-	pwr_move /= SCHED_LOAD_SCALE;
+	pwr_move /= SCHED_POWER_SCALE;
 
 	/* Move if we gain throughput */
 	if (pwr_move > pwr_now)
@@ -3034,7 +3035,7 @@ static inline void calculate_imbalance(struct sd_lb_stats *sds, int this_cpu,
 		load_above_capacity = (sds->busiest_nr_running -
 						sds->busiest_group_capacity);
 
-		load_above_capacity *= (SCHED_LOAD_SCALE * SCHED_LOAD_SCALE);
+		load_above_capacity *= (SCHED_LOAD_SCALE * SCHED_POWER_SCALE);
 
 		load_above_capacity /= sds->busiest->cpu_power;
 	}
@@ -3054,7 +3055,7 @@ static inline void calculate_imbalance(struct sd_lb_stats *sds, int this_cpu,
 	/* How much load to actually move to equalise the imbalance */
 	*imbalance = min(max_pull * sds->busiest->cpu_power,
 		(sds->avg_load - sds->this_load) * sds->this->cpu_power)
-			/ SCHED_LOAD_SCALE;
+			/ SCHED_POWER_SCALE;
 
 	/*
 	 * if *imbalance is less than the average load per runnable task
@@ -3123,7 +3124,7 @@ find_busiest_group(struct sched_domain *sd, int this_cpu,
 	if (!sds.busiest || sds.busiest_nr_running == 0)
 		goto out_balanced;
 
-	sds.avg_load = (SCHED_LOAD_SCALE * sds.total_load) / sds.total_pwr;
+	sds.avg_load = (SCHED_POWER_SCALE * sds.total_load) / sds.total_pwr;
 
 	/*
 	 * If the busiest group is imbalanced the below checks don't
@@ -3202,7 +3203,8 @@ find_busiest_queue(struct sched_domain *sd, struct sched_group *group,
 
 	for_each_cpu(i, sched_group_cpus(group)) {
 		unsigned long power = power_of(i);
-		unsigned long capacity = DIV_ROUND_CLOSEST(power, SCHED_LOAD_SCALE);
+		unsigned long capacity = DIV_ROUND_CLOSEST(power,
+							   SCHED_POWER_SCALE);
 		unsigned long wl;
 
 		if (!capacity)
@@ -3227,7 +3229,7 @@ find_busiest_queue(struct sched_domain *sd, struct sched_group *group,
 		 * the load can be moved away from the cpu that is potentially
 		 * running at a lower capacity.
 		 */
-		wl = (wl * SCHED_LOAD_SCALE) / power;
+		wl = (wl * SCHED_POWER_SCALE) / power;
 
 		if (wl > max_load) {
 			max_load = wl;
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v1 02/19] sched: increase SCHED_LOAD_SCALE resolution
  2011-05-02  1:18 [PATCH v1 00/19] Increase resolution of load weights Nikhil Rao
  2011-05-02  1:18 ` [PATCH v1 01/19] sched: introduce SCHED_POWER_SCALE to scale cpu_power calculations Nikhil Rao
@ 2011-05-02  1:19 ` Nikhil Rao
  2011-05-02  1:19 ` [PATCH v1 03/19] sched: use u64 for load_weight fields Nikhil Rao
                   ` (17 subsequent siblings)
  19 siblings, 0 replies; 34+ messages in thread
From: Nikhil Rao @ 2011-05-02  1:19 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Mike Galbraith
  Cc: linux-kernel, Nikunj A. Dadhania, Srivatsa Vaddagiri,
	Stephan Barwolf, Nikhil Rao

Introduce SCHED_LOAD_RESOLUTION, which scales is added to SCHED_LOAD_SHIFT and
increases the resolution of SCHED_LOAD_SCALE. This patch sets the value of
SCHED_LOAD_RESOLUTION to 10, scaling up the weights for all sched entities by
a factor of 1024. With this extra resolution, we can handle deeper cgroup
hiearchies and the scheduler can do better shares distribution and load
load balancing on larger systems (especially for low weight task groups).

This does not change the existing user interface, the scaled weights are only
used internally. We do not modify prio_to_weight values or inverses, but use
the original weights when calculating the inverse which is used to scale
execution time delta in calc_delta_mine(). This ensures we do not lose accuracy
when accounting time to the sched entities. Thanks to Nikunj Dadhania for fixing
an bug in c_d_m() that broken fairness.

Signed-off-by: Nikhil Rao <ncrao@google.com>
---
 include/linux/sched.h |    3 ++-
 kernel/sched.c        |   20 +++++++++++---------
 2 files changed, 13 insertions(+), 10 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8d1ff2b..d2c3bab 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -794,7 +794,8 @@ enum cpu_idle_type {
 /*
  * Increase resolution of nice-level calculations:
  */
-#define SCHED_LOAD_SHIFT	10
+#define SCHED_LOAD_RESOLUTION	10
+#define SCHED_LOAD_SHIFT	(10 + SCHED_LOAD_RESOLUTION)
 #define SCHED_LOAD_SCALE	(1L << SCHED_LOAD_SHIFT)
 
 /*
diff --git a/kernel/sched.c b/kernel/sched.c
index f4b4679..05e3fe2 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -293,7 +293,7 @@ static DEFINE_SPINLOCK(task_group_lock);
  *  limitation from this.)
  */
 #define MIN_SHARES	2
-#define MAX_SHARES	(1UL << 18)
+#define MAX_SHARES	(1UL << 18 + SCHED_LOAD_RESOLUTION)
 
 static int root_task_group_load = ROOT_TASK_GROUP_LOAD;
 #endif
@@ -1307,14 +1307,14 @@ calc_delta_mine(unsigned long delta_exec, u64 weight, struct load_weight *lw)
 	u64 tmp;
 
 	if (!lw->inv_weight) {
-		if (BITS_PER_LONG > 32 && unlikely(lw->weight >= WMULT_CONST))
+		unsigned long w = lw->weight >> SCHED_LOAD_RESOLUTION;
+		if (BITS_PER_LONG > 32 && unlikely(w >= WMULT_CONST))
 			lw->inv_weight = 1;
 		else
-			lw->inv_weight = 1 + (WMULT_CONST-lw->weight/2)
-				/ (lw->weight+1);
+			lw->inv_weight = 1 + (WMULT_CONST - w/2) / (w + 1);
 	}
 
-	tmp = (u64)delta_exec * weight;
+	tmp = (u64)delta_exec * ((weight >> SCHED_LOAD_RESOLUTION) + 1);
 	/*
 	 * Check whether we'd overflow the 64-bit multiplication:
 	 */
@@ -1758,12 +1758,13 @@ static void set_load_weight(struct task_struct *p)
 	 * SCHED_IDLE tasks get minimal weight:
 	 */
 	if (p->policy == SCHED_IDLE) {
-		p->se.load.weight = WEIGHT_IDLEPRIO;
+		p->se.load.weight = WEIGHT_IDLEPRIO << SCHED_LOAD_RESOLUTION;
 		p->se.load.inv_weight = WMULT_IDLEPRIO;
 		return;
 	}
 
-	p->se.load.weight = prio_to_weight[p->static_prio - MAX_RT_PRIO];
+	p->se.load.weight = prio_to_weight[p->static_prio - MAX_RT_PRIO]
+				<< SCHED_LOAD_RESOLUTION;
 	p->se.load.inv_weight = prio_to_wmult[p->static_prio - MAX_RT_PRIO];
 }
 
@@ -9129,14 +9130,15 @@ cpu_cgroup_exit(struct cgroup_subsys *ss, struct cgroup *cgrp,
 static int cpu_shares_write_u64(struct cgroup *cgrp, struct cftype *cftype,
 				u64 shareval)
 {
-	return sched_group_set_shares(cgroup_tg(cgrp), shareval);
+	return sched_group_set_shares(cgroup_tg(cgrp),
+				      shareval << SCHED_LOAD_RESOLUTION);
 }
 
 static u64 cpu_shares_read_u64(struct cgroup *cgrp, struct cftype *cft)
 {
 	struct task_group *tg = cgroup_tg(cgrp);
 
-	return (u64) tg->shares;
+	return (u64) tg->shares >> SCHED_LOAD_RESOLUTION;
 }
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v1 03/19] sched: use u64 for load_weight fields
  2011-05-02  1:18 [PATCH v1 00/19] Increase resolution of load weights Nikhil Rao
  2011-05-02  1:18 ` [PATCH v1 01/19] sched: introduce SCHED_POWER_SCALE to scale cpu_power calculations Nikhil Rao
  2011-05-02  1:19 ` [PATCH v1 02/19] sched: increase SCHED_LOAD_SCALE resolution Nikhil Rao
@ 2011-05-02  1:19 ` Nikhil Rao
  2011-05-02  1:19 ` [PATCH v1 04/19] sched: update cpu_load to be u64 Nikhil Rao
                   ` (16 subsequent siblings)
  19 siblings, 0 replies; 34+ messages in thread
From: Nikhil Rao @ 2011-05-02  1:19 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Mike Galbraith
  Cc: linux-kernel, Nikunj A. Dadhania, Srivatsa Vaddagiri,
	Stephan Barwolf, Nikhil Rao

This patch converts load_weight fields to use u64 instead of unsigned long.
This is effectively a no-op for 64-bit where unsigned long is 64-bit wide
anyway. On 32-bit architectures, it is required to ensure the rq load weight
does not overflow in the presence of multiple large weight entities. Also
increase MAX_SHARES to 2^28 (from 2^18).

Signed-off-by: Nikhil Rao <ncrao@google.com>
---
 include/linux/sched.h |    2 +-
 kernel/sched_debug.c  |    4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index d2c3bab..6d88be1 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1104,7 +1104,7 @@ struct sched_class {
 };
 
 struct load_weight {
-	unsigned long weight, inv_weight;
+	u64 weight, inv_weight;
 };
 
 #ifdef CONFIG_SCHEDSTATS
diff --git a/kernel/sched_debug.c b/kernel/sched_debug.c
index 7bacd83..d22b666 100644
--- a/kernel/sched_debug.c
+++ b/kernel/sched_debug.c
@@ -201,7 +201,7 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 	SEQ_printf(m, "  .%-30s: %d\n", "nr_spread_over",
 			cfs_rq->nr_spread_over);
 	SEQ_printf(m, "  .%-30s: %ld\n", "nr_running", cfs_rq->nr_running);
-	SEQ_printf(m, "  .%-30s: %ld\n", "load", cfs_rq->load.weight);
+	SEQ_printf(m, "  .%-30s: %lld\n", "load", cfs_rq->load.weight);
 #ifdef CONFIG_FAIR_GROUP_SCHED
 #ifdef CONFIG_SMP
 	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "load_avg",
@@ -264,7 +264,7 @@ static void print_cpu(struct seq_file *m, int cpu)
 	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", #x, SPLIT_NS(rq->x))
 
 	P(nr_running);
-	SEQ_printf(m, "  .%-30s: %lu\n", "load",
+	SEQ_printf(m, "  .%-30s: %llu\n", "load",
 		   rq->load.weight);
 	P(nr_switches);
 	P(nr_load_updates);
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v1 04/19] sched: update cpu_load to be u64
  2011-05-02  1:18 [PATCH v1 00/19] Increase resolution of load weights Nikhil Rao
                   ` (2 preceding siblings ...)
  2011-05-02  1:19 ` [PATCH v1 03/19] sched: use u64 for load_weight fields Nikhil Rao
@ 2011-05-02  1:19 ` Nikhil Rao
  2011-05-02  1:19 ` [PATCH v1 05/19] sched: update this_cpu_load() to return u64 value Nikhil Rao
                   ` (15 subsequent siblings)
  19 siblings, 0 replies; 34+ messages in thread
From: Nikhil Rao @ 2011-05-02  1:19 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Mike Galbraith
  Cc: linux-kernel, Nikunj A. Dadhania, Srivatsa Vaddagiri,
	Stephan Barwolf, Nikhil Rao

cpu_load derives from rq->load.weight and needs to be updated to u64 as it can
now overflow on 32-bit machines. This patch updates cpu_load in rq struct, and
all functions that use this field.

Signed-off-by: Nikhil Rao <ncrao@google.com>
---
 kernel/sched.c |    9 ++++-----
 1 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 05e3fe2..7badde6 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -454,7 +454,7 @@ struct rq {
 	 */
 	unsigned long nr_running;
 	#define CPU_LOAD_IDX_MAX 5
-	unsigned long cpu_load[CPU_LOAD_IDX_MAX];
+	u64 cpu_load[CPU_LOAD_IDX_MAX];
 	unsigned long last_load_update_tick;
 #ifdef CONFIG_NO_HZ
 	u64 nohz_stamp;
@@ -3364,8 +3364,7 @@ static const unsigned char
  * would be when CPU is idle and so we just decay the old load without
  * adding any new load.
  */
-static unsigned long
-decay_load_missed(unsigned long load, unsigned long missed_updates, int idx)
+static u64 decay_load_missed(u64 load, unsigned long missed_updates, int idx)
 {
 	int j = 0;
 
@@ -3395,7 +3394,7 @@ decay_load_missed(unsigned long load, unsigned long missed_updates, int idx)
  */
 static void update_cpu_load(struct rq *this_rq)
 {
-	unsigned long this_load = this_rq->load.weight;
+	u64 this_load = this_rq->load.weight;
 	unsigned long curr_jiffies = jiffies;
 	unsigned long pending_updates;
 	int i, scale;
@@ -3412,7 +3411,7 @@ static void update_cpu_load(struct rq *this_rq)
 	/* Update our load: */
 	this_rq->cpu_load[0] = this_load; /* Fasttrack for idx 0 */
 	for (i = 1, scale = 2; i < CPU_LOAD_IDX_MAX; i++, scale += scale) {
-		unsigned long old_load, new_load;
+		u64 old_load, new_load;
 
 		/* scale is effectively 1 << i now, and >> i divides by scale */
 
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v1 05/19] sched: update this_cpu_load() to return u64 value
  2011-05-02  1:18 [PATCH v1 00/19] Increase resolution of load weights Nikhil Rao
                   ` (3 preceding siblings ...)
  2011-05-02  1:19 ` [PATCH v1 04/19] sched: update cpu_load to be u64 Nikhil Rao
@ 2011-05-02  1:19 ` Nikhil Rao
  2011-05-02  1:19 ` [PATCH v1 06/19] sched: update source_load(), target_load() and weighted_cpuload() to use u64 Nikhil Rao
                   ` (14 subsequent siblings)
  19 siblings, 0 replies; 34+ messages in thread
From: Nikhil Rao @ 2011-05-02  1:19 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Mike Galbraith
  Cc: linux-kernel, Nikunj A. Dadhania, Srivatsa Vaddagiri,
	Stephan Barwolf, Nikhil Rao

The idlecpu governor uses this_cpu_load() for calculations which is now returns
u64. Update idlecpu governor to also use u64.

Signed-off-by: Nikhil Rao <ncrao@google.com>
---
 drivers/cpuidle/governors/menu.c |    5 ++---
 include/linux/sched.h            |    2 +-
 kernel/sched.c                   |    2 +-
 3 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c
index f508690..2051134 100644
--- a/drivers/cpuidle/governors/menu.c
+++ b/drivers/cpuidle/governors/menu.c
@@ -127,10 +127,9 @@ struct menu_device {
 
 static int get_loadavg(void)
 {
-	unsigned long this = this_cpu_load();
+	u64 this = this_cpu_load();
 
-
-	return LOAD_INT(this) * 10 + LOAD_FRAC(this) / 10;
+	return div_u64(LOAD_INT(this) * 10 + LOAD_FRAC(this), 10);
 }
 
 static inline int which_bucket(unsigned int duration)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6d88be1..546a418 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -140,7 +140,7 @@ extern unsigned long nr_running(void);
 extern unsigned long nr_uninterruptible(void);
 extern unsigned long nr_iowait(void);
 extern unsigned long nr_iowait_cpu(int cpu);
-extern unsigned long this_cpu_load(void);
+extern u64 this_cpu_load(void);
 
 
 extern void calc_global_load(unsigned long ticks);
diff --git a/kernel/sched.c b/kernel/sched.c
index 7badde6..f2eb816 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3061,7 +3061,7 @@ unsigned long nr_iowait_cpu(int cpu)
 	return atomic_read(&this->nr_iowait);
 }
 
-unsigned long this_cpu_load(void)
+u64 this_cpu_load(void)
 {
 	struct rq *this = this_rq();
 	return this->cpu_load[0];
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v1 06/19] sched: update source_load(), target_load() and weighted_cpuload() to use u64
  2011-05-02  1:18 [PATCH v1 00/19] Increase resolution of load weights Nikhil Rao
                   ` (4 preceding siblings ...)
  2011-05-02  1:19 ` [PATCH v1 05/19] sched: update this_cpu_load() to return u64 value Nikhil Rao
@ 2011-05-02  1:19 ` Nikhil Rao
  2011-05-02  1:19 ` [PATCH v1 07/19] sched: update find_idlest_cpu() to use u64 for load Nikhil Rao
                   ` (13 subsequent siblings)
  19 siblings, 0 replies; 34+ messages in thread
From: Nikhil Rao @ 2011-05-02  1:19 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Mike Galbraith
  Cc: linux-kernel, Nikunj A. Dadhania, Srivatsa Vaddagiri,
	Stephan Barwolf, Nikhil Rao

source_load(), target_load() and weighted_cpuload() refer to values in
rq->cpu_load, which is now u64. Update these functions to return u64 as well.

Signed-off-by: Nikhil Rao <ncrao@google.com>
---
 kernel/sched.c |   10 +++++-----
 1 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index f2eb816..a49ef0e 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1473,7 +1473,7 @@ static int tg_nop(struct task_group *tg, void *data)
 
 #ifdef CONFIG_SMP
 /* Used instead of source_load when we know the type == 0 */
-static unsigned long weighted_cpuload(const int cpu)
+static u64 weighted_cpuload(const int cpu)
 {
 	return cpu_rq(cpu)->load.weight;
 }
@@ -1485,10 +1485,10 @@ static unsigned long weighted_cpuload(const int cpu)
  * We want to under-estimate the load of migration sources, to
  * balance conservatively.
  */
-static unsigned long source_load(int cpu, int type)
+static u64 source_load(int cpu, int type)
 {
 	struct rq *rq = cpu_rq(cpu);
-	unsigned long total = weighted_cpuload(cpu);
+	u64 total = weighted_cpuload(cpu);
 
 	if (type == 0 || !sched_feat(LB_BIAS))
 		return total;
@@ -1500,10 +1500,10 @@ static unsigned long source_load(int cpu, int type)
  * Return a high guess at the load of a migration-target cpu weighted
  * according to the scheduling class and "nice" value.
  */
-static unsigned long target_load(int cpu, int type)
+static u64 target_load(int cpu, int type)
 {
 	struct rq *rq = cpu_rq(cpu);
-	unsigned long total = weighted_cpuload(cpu);
+	u64 total = weighted_cpuload(cpu);
 
 	if (type == 0 || !sched_feat(LB_BIAS))
 		return total;
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v1 07/19] sched: update find_idlest_cpu() to use u64 for load
  2011-05-02  1:18 [PATCH v1 00/19] Increase resolution of load weights Nikhil Rao
                   ` (5 preceding siblings ...)
  2011-05-02  1:19 ` [PATCH v1 06/19] sched: update source_load(), target_load() and weighted_cpuload() to use u64 Nikhil Rao
@ 2011-05-02  1:19 ` Nikhil Rao
  2011-05-02  1:19 ` [PATCH v1 08/19] sched: update find_idlest_group() to use u64 Nikhil Rao
                   ` (12 subsequent siblings)
  19 siblings, 0 replies; 34+ messages in thread
From: Nikhil Rao @ 2011-05-02  1:19 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Mike Galbraith
  Cc: linux-kernel, Nikunj A. Dadhania, Srivatsa Vaddagiri,
	Stephan Barwolf, Nikhil Rao

find_idlest_cpu() compares load across runqueues which is now using u64.

Signed-off-by: Nikhil Rao <ncrao@google.com>
---
 kernel/sched_fair.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 1a9340c..c08324b 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1578,7 +1578,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 static int
 find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
 {
-	unsigned long load, min_load = ULONG_MAX;
+	u64 load, min_load = ULLONG_MAX;
 	int idlest = -1;
 	int i;
 
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v1 08/19] sched: update find_idlest_group() to use u64
  2011-05-02  1:18 [PATCH v1 00/19] Increase resolution of load weights Nikhil Rao
                   ` (6 preceding siblings ...)
  2011-05-02  1:19 ` [PATCH v1 07/19] sched: update find_idlest_cpu() to use u64 for load Nikhil Rao
@ 2011-05-02  1:19 ` Nikhil Rao
  2011-05-02  1:19 ` [PATCH v1 09/19] sched: update division in cpu_avg_load_per_task to use div_u64 Nikhil Rao
                   ` (11 subsequent siblings)
  19 siblings, 0 replies; 34+ messages in thread
From: Nikhil Rao @ 2011-05-02  1:19 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Mike Galbraith
  Cc: linux-kernel, Nikunj A. Dadhania, Srivatsa Vaddagiri,
	Stephan Barwolf, Nikhil Rao

Update find_idlest_group() to use u64 to accumulate and maintain cpu_load
weights.

Signed-off-by: Nikhil Rao <ncrao@google.com>
---
 kernel/sched_fair.c |    7 ++++---
 1 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index c08324b..49e1eeb 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1527,11 +1527,11 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 		  int this_cpu, int load_idx)
 {
 	struct sched_group *idlest = NULL, *group = sd->groups;
-	unsigned long min_load = ULONG_MAX, this_load = 0;
+	u64 min_load = ULLONG_MAX, this_load = 0;
 	int imbalance = 100 + (sd->imbalance_pct-100)/2;
 
 	do {
-		unsigned long load, avg_load;
+		u64 load, avg_load;
 		int local_group;
 		int i;
 
@@ -1557,7 +1557,8 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 		}
 
 		/* Adjust by relative CPU power of the group */
-		avg_load = (avg_load * SCHED_POWER_SCALE) / group->cpu_power;
+		avg_load *= SCHED_POWER_SCALE;
+		avg_load = div_u64(avg_load, group->cpu_power);
 
 		if (local_group) {
 			this_load = avg_load;
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v1 09/19] sched: update division in cpu_avg_load_per_task to use div_u64
  2011-05-02  1:18 [PATCH v1 00/19] Increase resolution of load weights Nikhil Rao
                   ` (7 preceding siblings ...)
  2011-05-02  1:19 ` [PATCH v1 08/19] sched: update find_idlest_group() to use u64 Nikhil Rao
@ 2011-05-02  1:19 ` Nikhil Rao
  2011-05-02  1:19 ` [PATCH v1 10/19] sched: update wake_affine path to use u64, s64 for weights Nikhil Rao
                   ` (10 subsequent siblings)
  19 siblings, 0 replies; 34+ messages in thread
From: Nikhil Rao @ 2011-05-02  1:19 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Mike Galbraith
  Cc: linux-kernel, Nikunj A. Dadhania, Srivatsa Vaddagiri,
	Stephan Barwolf, Nikhil Rao

This patch updates the division in cpu_avg_load_per_task() to use div_u64, so
that it works on 32-bit. We do not convert avg_load_per_task to u64 since this
can be atmost 2^28, and fits into unsigned long on 32-bit.

Signed-off-by: Nikhil Rao <ncrao@google.com>
---
 kernel/sched.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index a49ef0e..08dcd24 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1524,7 +1524,7 @@ static unsigned long cpu_avg_load_per_task(int cpu)
 	unsigned long nr_running = ACCESS_ONCE(rq->nr_running);
 
 	if (nr_running)
-		rq->avg_load_per_task = rq->load.weight / nr_running;
+		rq->avg_load_per_task = div_u64(rq->load.weight, nr_running);
 	else
 		rq->avg_load_per_task = 0;
 
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v1 10/19] sched: update wake_affine path to use u64, s64 for weights
  2011-05-02  1:18 [PATCH v1 00/19] Increase resolution of load weights Nikhil Rao
                   ` (8 preceding siblings ...)
  2011-05-02  1:19 ` [PATCH v1 09/19] sched: update division in cpu_avg_load_per_task to use div_u64 Nikhil Rao
@ 2011-05-02  1:19 ` Nikhil Rao
  2011-05-02  1:19 ` [PATCH v1 11/19] sched: update update_sg_lb_stats() to use u64 Nikhil Rao
                   ` (9 subsequent siblings)
  19 siblings, 0 replies; 34+ messages in thread
From: Nikhil Rao @ 2011-05-02  1:19 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Mike Galbraith
  Cc: linux-kernel, Nikunj A. Dadhania, Srivatsa Vaddagiri,
	Stephan Barwolf, Nikhil Rao

Update the s64/u64 math in wake_affine() and effective_load() to handle
increased resolution.

Signed-off-by: Nikhil Rao <ncrao@google.com>
---
 kernel/sched_fair.c |    8 ++++----
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 49e1eeb..1e011b1 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1388,7 +1388,7 @@ static void task_waking_fair(struct rq *rq, struct task_struct *p)
  * of group shares between cpus. Assuming the shares were perfectly aligned one
  * can calculate the shift in shares.
  */
-static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
+static s64 effective_load(struct task_group *tg, int cpu, s64 wl, s64 wg)
 {
 	struct sched_entity *se = tg->se[cpu];
 
@@ -1396,7 +1396,7 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
 		return wl;
 
 	for_each_sched_entity(se) {
-		long lw, w;
+		s64 lw, w;
 
 		tg = se->my_q->tg;
 		w = se->my_q->load.weight;
@@ -1409,7 +1409,7 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
 		wl += w;
 
 		if (lw > 0 && wl < lw)
-			wl = (wl * tg->shares) / lw;
+			wl = div64_s64(wl * tg->shares, lw);
 		else
 			wl = tg->shares;
 
@@ -1504,7 +1504,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
 
 	if (balanced ||
 	    (this_load <= load &&
-	     this_load + target_load(prev_cpu, idx) <= tl_per_task)) {
+	     this_load + target_load(prev_cpu, idx) <= (u64)tl_per_task)) {
 		/*
 		 * This domain has SD_WAKE_AFFINE and
 		 * p is cache cold in this domain, and
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v1 11/19] sched: update update_sg_lb_stats() to use u64
  2011-05-02  1:18 [PATCH v1 00/19] Increase resolution of load weights Nikhil Rao
                   ` (9 preceding siblings ...)
  2011-05-02  1:19 ` [PATCH v1 10/19] sched: update wake_affine path to use u64, s64 for weights Nikhil Rao
@ 2011-05-02  1:19 ` Nikhil Rao
  2011-05-02  1:19 ` [PATCH v1 12/19] sched: Update update_sd_lb_stats() " Nikhil Rao
                   ` (8 subsequent siblings)
  19 siblings, 0 replies; 34+ messages in thread
From: Nikhil Rao @ 2011-05-02  1:19 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Mike Galbraith
  Cc: linux-kernel, Nikunj A. Dadhania, Srivatsa Vaddagiri,
	Stephan Barwolf, Nikhil Rao

Update variable types and 64-bit math in update_sg_lb_stats() to handle u64
weights.

Signed-off-by: Nikhil Rao <ncrao@google.com>
---
 kernel/sched_fair.c |   22 +++++++++++++---------
 1 files changed, 13 insertions(+), 9 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 1e011b1..992b9f4 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -2345,14 +2345,14 @@ struct sd_lb_stats {
  * sg_lb_stats - stats of a sched_group required for load_balancing
  */
 struct sg_lb_stats {
-	unsigned long avg_load; /*Avg load across the CPUs of the group */
-	unsigned long group_load; /* Total load over the CPUs of the group */
+	u64 avg_load;		/* Avg load across the CPUs of the group */
+	u64 group_load;		/* Total load over the CPUs of the group */
 	unsigned long sum_nr_running; /* Nr tasks running in the group */
-	unsigned long sum_weighted_load; /* Weighted load of group's tasks */
+	u64 sum_weighted_load;	/* Weighted load of group's tasks */
 	unsigned long group_capacity;
 	unsigned long idle_cpus;
 	unsigned long group_weight;
-	int group_imb; /* Is there an imbalance in the group ? */
+	int group_imb;		/* Is there an imbalance in the group ? */
 	int group_has_capacity; /* Is there extra capacity in the group? */
 };
 
@@ -2679,7 +2679,8 @@ static inline void update_sg_lb_stats(struct sched_domain *sd,
 			int local_group, const struct cpumask *cpus,
 			int *balance, struct sg_lb_stats *sgs)
 {
-	unsigned long load, max_cpu_load, min_cpu_load, max_nr_running;
+	u64 load, max_cpu_load, min_cpu_load;
+	unsigned long max_nr_running;
 	int i;
 	unsigned int balance_cpu = -1, first_idle_cpu = 0;
 	unsigned long avg_load_per_task = 0;
@@ -2689,7 +2690,7 @@ static inline void update_sg_lb_stats(struct sched_domain *sd,
 
 	/* Tally up the load of all CPUs in the group */
 	max_cpu_load = 0;
-	min_cpu_load = ~0UL;
+	min_cpu_load = ~0ULL;
 	max_nr_running = 0;
 
 	for_each_cpu_and(i, sched_group_cpus(group), cpus) {
@@ -2735,7 +2736,8 @@ static inline void update_sg_lb_stats(struct sched_domain *sd,
 	}
 
 	/* Adjust by relative CPU power of the group */
-	sgs->avg_load = (sgs->group_load*SCHED_POWER_SCALE) / group->cpu_power;
+	sgs->avg_load = div_u64(sgs->group_load * SCHED_POWER_SCALE,
+				group->cpu_power);
 
 	/*
 	 * Consider the group unbalanced when the imbalance is larger
@@ -2747,9 +2749,11 @@ static inline void update_sg_lb_stats(struct sched_domain *sd,
 	 *      the hierarchy?
 	 */
 	if (sgs->sum_nr_running)
-		avg_load_per_task = sgs->sum_weighted_load / sgs->sum_nr_running;
+		avg_load_per_task = div_u64(sgs->sum_weighted_load,
+					    sgs->sum_nr_running);
 
-	if ((max_cpu_load - min_cpu_load) >= avg_load_per_task && max_nr_running > 1)
+	if ((max_cpu_load - min_cpu_load) >= avg_load_per_task &&
+			max_nr_running > 1)
 		sgs->group_imb = 1;
 
 	sgs->group_capacity = DIV_ROUND_CLOSEST(group->cpu_power,
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v1 12/19] sched: Update update_sd_lb_stats() to use u64
  2011-05-02  1:18 [PATCH v1 00/19] Increase resolution of load weights Nikhil Rao
                   ` (10 preceding siblings ...)
  2011-05-02  1:19 ` [PATCH v1 11/19] sched: update update_sg_lb_stats() to use u64 Nikhil Rao
@ 2011-05-02  1:19 ` Nikhil Rao
  2011-05-02  1:19 ` [PATCH v1 13/19] sched: update f_b_g() to use u64 for weights Nikhil Rao
                   ` (7 subsequent siblings)
  19 siblings, 0 replies; 34+ messages in thread
From: Nikhil Rao @ 2011-05-02  1:19 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Mike Galbraith
  Cc: linux-kernel, Nikunj A. Dadhania, Srivatsa Vaddagiri,
	Stephan Barwolf, Nikhil Rao

Update update_sd_lb_stats() and helper functions to use u64/s64 for weight
calculations.

Signed-off-by: Nikhil Rao <ncrao@google.com>
---
 kernel/sched_fair.c |   22 +++++++++++-----------
 1 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 992b9f4..152b472 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -2308,23 +2308,23 @@ static int move_tasks(struct rq *this_rq, int this_cpu, struct rq *busiest,
  * 		during load balancing.
  */
 struct sd_lb_stats {
-	struct sched_group *busiest; /* Busiest group in this sd */
-	struct sched_group *this;  /* Local group in this sd */
-	unsigned long total_load;  /* Total load of all groups in sd */
-	unsigned long total_pwr;   /*	Total power of all groups in sd */
-	unsigned long avg_load;	   /* Average load across all groups in sd */
+	struct sched_group *busiest;	/* Busiest group in this sd */
+	struct sched_group *this;	/* Local group in this sd */
+	u64 total_load;			/* Total load of all groups in sd */
+	unsigned long total_pwr;	/* Total power of all groups in sd */
+	u64 avg_load;			/* Avg load across all groups in sd */
 
 	/** Statistics of this group */
-	unsigned long this_load;
-	unsigned long this_load_per_task;
+	u64 this_load;
+	u64 this_load_per_task;
 	unsigned long this_nr_running;
 	unsigned long this_has_capacity;
 	unsigned int  this_idle_cpus;
 
 	/* Statistics of the busiest group */
 	unsigned int  busiest_idle_cpus;
-	unsigned long max_load;
-	unsigned long busiest_load_per_task;
+	u64 max_load;
+	u64 busiest_load_per_task;
 	unsigned long busiest_nr_running;
 	unsigned long busiest_group_capacity;
 	unsigned long busiest_has_capacity;
@@ -2461,8 +2461,8 @@ static inline void update_sd_power_savings_stats(struct sched_group *group,
 	     group_first_cpu(group) > group_first_cpu(sds->group_min))) {
 		sds->group_min = group;
 		sds->min_nr_running = sgs->sum_nr_running;
-		sds->min_load_per_task = sgs->sum_weighted_load /
-						sgs->sum_nr_running;
+		sds->min_load_per_task = div_u64(sgs->sum_weighted_load,
+						 sgs->sum_nr_running);
 	}
 
 	/*
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v1 13/19] sched: update f_b_g() to use u64 for weights
  2011-05-02  1:18 [PATCH v1 00/19] Increase resolution of load weights Nikhil Rao
                   ` (11 preceding siblings ...)
  2011-05-02  1:19 ` [PATCH v1 12/19] sched: Update update_sd_lb_stats() " Nikhil Rao
@ 2011-05-02  1:19 ` Nikhil Rao
  2011-05-02  1:19 ` [PATCH v1 14/19] sched: change type of imbalance to be u64 Nikhil Rao
                   ` (6 subsequent siblings)
  19 siblings, 0 replies; 34+ messages in thread
From: Nikhil Rao @ 2011-05-02  1:19 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Mike Galbraith
  Cc: linux-kernel, Nikunj A. Dadhania, Srivatsa Vaddagiri,
	Stephan Barwolf, Nikhil Rao

This patch updates f_b_g() and helper functions to use u64 to handle the
increased sched load resolution.

Signed-off-by: Nikhil Rao <ncrao@google.com>
---
 kernel/sched_fair.c |   51 +++++++++++++++++++++++++++------------------------
 1 files changed, 27 insertions(+), 24 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 152b472..3e01c8d 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -2946,12 +2946,13 @@ static int check_asym_packing(struct sched_domain *sd,
 static inline void fix_small_imbalance(struct sd_lb_stats *sds,
 				int this_cpu, unsigned long *imbalance)
 {
-	unsigned long tmp, pwr_now = 0, pwr_move = 0;
+	u64 tmp, pwr_now = 0, pwr_move = 0;
 	unsigned int imbn = 2;
 	unsigned long scaled_busy_load_per_task;
 
 	if (sds->this_nr_running) {
-		sds->this_load_per_task /= sds->this_nr_running;
+		sds->this_load_per_task = div_u64(sds->this_load_per_task,
+						  sds->this_nr_running);
 		if (sds->busiest_load_per_task >
 				sds->this_load_per_task)
 			imbn = 1;
@@ -2959,9 +2960,9 @@ static inline void fix_small_imbalance(struct sd_lb_stats *sds,
 		sds->this_load_per_task =
 			cpu_avg_load_per_task(this_cpu);
 
-	scaled_busy_load_per_task = sds->busiest_load_per_task
-					 * SCHED_POWER_SCALE;
-	scaled_busy_load_per_task /= sds->busiest->cpu_power;
+	scaled_busy_load_per_task =
+		div_u64(sds->busiest_load_per_task * SCHED_POWER_SCALE,
+			sds->busiest->cpu_power);
 
 	if (sds->max_load - sds->this_load + scaled_busy_load_per_task >=
 			(scaled_busy_load_per_task * imbn)) {
@@ -2979,11 +2980,11 @@ static inline void fix_small_imbalance(struct sd_lb_stats *sds,
 			min(sds->busiest_load_per_task, sds->max_load);
 	pwr_now += sds->this->cpu_power *
 			min(sds->this_load_per_task, sds->this_load);
-	pwr_now /= SCHED_POWER_SCALE;
+	pwr_now = div_u64(pwr_now, SCHED_POWER_SCALE);
 
 	/* Amount of load we'd subtract */
-	tmp = (sds->busiest_load_per_task * SCHED_POWER_SCALE) /
-		sds->busiest->cpu_power;
+	tmp = div_u64(sds->busiest_load_per_task * SCHED_POWER_SCALE,
+		      sds->busiest->cpu_power);
 	if (sds->max_load > tmp)
 		pwr_move += sds->busiest->cpu_power *
 			min(sds->busiest_load_per_task, sds->max_load - tmp);
@@ -2991,14 +2992,15 @@ static inline void fix_small_imbalance(struct sd_lb_stats *sds,
 	/* Amount of load we'd add */
 	if (sds->max_load * sds->busiest->cpu_power <
 		sds->busiest_load_per_task * SCHED_POWER_SCALE)
-		tmp = (sds->max_load * sds->busiest->cpu_power) /
-			sds->this->cpu_power;
+		tmp = div_u64(sds->max_load * sds->busiest->cpu_power,
+			      sds->this->cpu_power);
 	else
-		tmp = (sds->busiest_load_per_task * SCHED_POWER_SCALE) /
-			sds->this->cpu_power;
+		tmp = div_u64(sds->busiest_load_per_task * SCHED_POWER_SCALE,
+			      sds->this->cpu_power);
+
 	pwr_move += sds->this->cpu_power *
 			min(sds->this_load_per_task, sds->this_load + tmp);
-	pwr_move /= SCHED_POWER_SCALE;
+	pwr_move = div_u64(pwr_move, SCHED_POWER_SCALE);
 
 	/* Move if we gain throughput */
 	if (pwr_move > pwr_now)
@@ -3015,9 +3017,10 @@ static inline void fix_small_imbalance(struct sd_lb_stats *sds,
 static inline void calculate_imbalance(struct sd_lb_stats *sds, int this_cpu,
 		unsigned long *imbalance)
 {
-	unsigned long max_pull, load_above_capacity = ~0UL;
+	u64 max_pull, load_above_capacity = ~0ULL;
 
-	sds->busiest_load_per_task /= sds->busiest_nr_running;
+	sds->busiest_load_per_task = div_u64(sds->busiest_load_per_task,
+					     sds->busiest_nr_running);
 	if (sds->group_imb) {
 		sds->busiest_load_per_task =
 			min(sds->busiest_load_per_task, sds->avg_load);
@@ -3034,15 +3037,15 @@ static inline void calculate_imbalance(struct sd_lb_stats *sds, int this_cpu,
 	}
 
 	if (!sds->group_imb) {
+		unsigned long imb_capacity = (sds->busiest_nr_running -
+					      sds->busiest_group_capacity);
 		/*
 		 * Don't want to pull so many tasks that a group would go idle.
 		 */
-		load_above_capacity = (sds->busiest_nr_running -
-						sds->busiest_group_capacity);
-
-		load_above_capacity *= (SCHED_LOAD_SCALE * SCHED_POWER_SCALE);
-
-		load_above_capacity /= sds->busiest->cpu_power;
+		load_above_capacity = NICE_0_LOAD * imb_capacity;
+		load_above_capacity =
+			div_u64(load_above_capacity * SCHED_POWER_SCALE,
+				sds->busiest->cpu_power);
 	}
 
 	/*
@@ -3059,8 +3062,8 @@ static inline void calculate_imbalance(struct sd_lb_stats *sds, int this_cpu,
 
 	/* How much load to actually move to equalise the imbalance */
 	*imbalance = min(max_pull * sds->busiest->cpu_power,
-		(sds->avg_load - sds->this_load) * sds->this->cpu_power)
-			/ SCHED_POWER_SCALE;
+			(sds->avg_load - sds->this_load)*sds->this->cpu_power);
+	*imbalance = div_u64(*imbalance, SCHED_POWER_SCALE);
 
 	/*
 	 * if *imbalance is less than the average load per runnable task
@@ -3129,7 +3132,7 @@ find_busiest_group(struct sched_domain *sd, int this_cpu,
 	if (!sds.busiest || sds.busiest_nr_running == 0)
 		goto out_balanced;
 
-	sds.avg_load = (SCHED_POWER_SCALE * sds.total_load) / sds.total_pwr;
+	sds.avg_load = div_u64(sds.total_load*SCHED_POWER_SCALE, sds.total_pwr);
 
 	/*
 	 * If the busiest group is imbalanced the below checks don't
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v1 14/19] sched: change type of imbalance to be u64
  2011-05-02  1:18 [PATCH v1 00/19] Increase resolution of load weights Nikhil Rao
                   ` (12 preceding siblings ...)
  2011-05-02  1:19 ` [PATCH v1 13/19] sched: update f_b_g() to use u64 for weights Nikhil Rao
@ 2011-05-02  1:19 ` Nikhil Rao
  2011-05-02  1:19 ` [PATCH v1 15/19] sched: update h_load to use u64 Nikhil Rao
                   ` (5 subsequent siblings)
  19 siblings, 0 replies; 34+ messages in thread
From: Nikhil Rao @ 2011-05-02  1:19 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Mike Galbraith
  Cc: linux-kernel, Nikunj A. Dadhania, Srivatsa Vaddagiri,
	Stephan Barwolf, Nikhil Rao

This patch changes the type of imbalance to be u64. With increased sched load
resolution, it is possible for a runqueue to have a sched weight of 2^32, and
imbalance needs to be updated to u64 to handle this case.

Signed-off-by: Nikhil Rao <ncrao@google.com>
---
 include/linux/sched.h |    2 +-
 kernel/sched_fair.c   |   24 ++++++++++++------------
 kernel/sched_stats.h  |    2 +-
 3 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 546a418..2d9689a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -945,10 +945,10 @@ struct sched_domain {
 
 #ifdef CONFIG_SCHEDSTATS
 	/* load_balance() stats */
+	u64 lb_imbalance[CPU_MAX_IDLE_TYPES];
 	unsigned int lb_count[CPU_MAX_IDLE_TYPES];
 	unsigned int lb_failed[CPU_MAX_IDLE_TYPES];
 	unsigned int lb_balanced[CPU_MAX_IDLE_TYPES];
-	unsigned int lb_imbalance[CPU_MAX_IDLE_TYPES];
 	unsigned int lb_gained[CPU_MAX_IDLE_TYPES];
 	unsigned int lb_hot_gained[CPU_MAX_IDLE_TYPES];
 	unsigned int lb_nobusyg[CPU_MAX_IDLE_TYPES];
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 3e01c8d..850e41b 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -2497,7 +2497,7 @@ static inline void update_sd_power_savings_stats(struct sched_group *group,
  * Else returns 0.
  */
 static inline int check_power_save_busiest_group(struct sd_lb_stats *sds,
-					int this_cpu, unsigned long *imbalance)
+					int this_cpu, u64 *imbalance)
 {
 	if (!sds->power_savings_balance)
 		return 0;
@@ -2526,7 +2526,7 @@ static inline void update_sd_power_savings_stats(struct sched_group *group,
 }
 
 static inline int check_power_save_busiest_group(struct sd_lb_stats *sds,
-					int this_cpu, unsigned long *imbalance)
+					int this_cpu, u64 *imbalance)
 {
 	return 0;
 }
@@ -2916,7 +2916,7 @@ int __weak arch_sd_sibling_asym_packing(void)
  */
 static int check_asym_packing(struct sched_domain *sd,
 			      struct sd_lb_stats *sds,
-			      int this_cpu, unsigned long *imbalance)
+			      int this_cpu, u64 *imbalance)
 {
 	int busiest_cpu;
 
@@ -2943,8 +2943,8 @@ static int check_asym_packing(struct sched_domain *sd,
  * @this_cpu: The cpu at whose sched_domain we're performing load-balance.
  * @imbalance: Variable to store the imbalance.
  */
-static inline void fix_small_imbalance(struct sd_lb_stats *sds,
-				int this_cpu, unsigned long *imbalance)
+static inline
+void fix_small_imbalance(struct sd_lb_stats *sds, int this_cpu, u64 *imbalance)
 {
 	u64 tmp, pwr_now = 0, pwr_move = 0;
 	unsigned int imbn = 2;
@@ -3014,8 +3014,8 @@ static inline void fix_small_imbalance(struct sd_lb_stats *sds,
  * @this_cpu: Cpu for which currently load balance is being performed.
  * @imbalance: The variable to store the imbalance.
  */
-static inline void calculate_imbalance(struct sd_lb_stats *sds, int this_cpu,
-		unsigned long *imbalance)
+static inline
+void calculate_imbalance(struct sd_lb_stats *sds, int this_cpu, u64 *imbalance)
 {
 	u64 max_pull, load_above_capacity = ~0ULL;
 
@@ -3103,9 +3103,9 @@ static inline void calculate_imbalance(struct sd_lb_stats *sds, int this_cpu,
  *		   put to idle by rebalancing its tasks onto our group.
  */
 static struct sched_group *
-find_busiest_group(struct sched_domain *sd, int this_cpu,
-		   unsigned long *imbalance, enum cpu_idle_type idle,
-		   const struct cpumask *cpus, int *balance)
+find_busiest_group(struct sched_domain *sd, int this_cpu, u64 *imbalance,
+		   enum cpu_idle_type idle, const struct cpumask *cpus,
+		   int *balance)
 {
 	struct sd_lb_stats sds;
 
@@ -3202,7 +3202,7 @@ ret:
  */
 static struct rq *
 find_busiest_queue(struct sched_domain *sd, struct sched_group *group,
-		   enum cpu_idle_type idle, unsigned long imbalance,
+		   enum cpu_idle_type idle, u64 imbalance,
 		   const struct cpumask *cpus)
 {
 	struct rq *busiest = NULL, *rq;
@@ -3308,7 +3308,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 {
 	int ld_moved, all_pinned = 0, active_balance = 0;
 	struct sched_group *group;
-	unsigned long imbalance;
+	u64 imbalance;
 	struct rq *busiest;
 	unsigned long flags;
 	struct cpumask *cpus = __get_cpu_var(load_balance_tmpmask);
diff --git a/kernel/sched_stats.h b/kernel/sched_stats.h
index 48ddf43..f44676c 100644
--- a/kernel/sched_stats.h
+++ b/kernel/sched_stats.h
@@ -46,7 +46,7 @@ static int show_schedstat(struct seq_file *seq, void *v)
 			seq_printf(seq, "domain%d %s", dcount++, mask_str);
 			for (itype = CPU_IDLE; itype < CPU_MAX_IDLE_TYPES;
 					itype++) {
-				seq_printf(seq, " %u %u %u %u %u %u %u %u",
+				seq_printf(seq, " %u %u %u %llu %u %u %u %u",
 				    sd->lb_count[itype],
 				    sd->lb_balanced[itype],
 				    sd->lb_failed[itype],
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v1 15/19] sched: update h_load to use u64
  2011-05-02  1:18 [PATCH v1 00/19] Increase resolution of load weights Nikhil Rao
                   ` (13 preceding siblings ...)
  2011-05-02  1:19 ` [PATCH v1 14/19] sched: change type of imbalance to be u64 Nikhil Rao
@ 2011-05-02  1:19 ` Nikhil Rao
  2011-05-02  1:19 ` [PATCH v1 16/19] sched: update move_task() and helper functions to use u64 for weights Nikhil Rao
                   ` (4 subsequent siblings)
  19 siblings, 0 replies; 34+ messages in thread
From: Nikhil Rao @ 2011-05-02  1:19 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Mike Galbraith
  Cc: linux-kernel, Nikunj A. Dadhania, Srivatsa Vaddagiri,
	Stephan Barwolf, Nikhil Rao

Calculate tg->h_load using u64 to handle u64 load weights.

Signed-off-by: Nikhil Rao <ncrao@google.com>
---
 kernel/sched.c |   12 +++++++-----
 1 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 08dcd24..6b9b02a 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -354,7 +354,7 @@ struct cfs_rq {
 	 * Where f(tg) is the recursive weight fraction assigned to
 	 * this group.
 	 */
-	unsigned long h_load;
+	u64 h_load;
 
 	/*
 	 * Maintaining per-cpu shares distribution for group scheduling
@@ -1540,15 +1540,17 @@ static unsigned long cpu_avg_load_per_task(int cpu)
  */
 static int tg_load_down(struct task_group *tg, void *data)
 {
-	unsigned long load;
+	u64 load;
 	long cpu = (long)data;
 
 	if (!tg->parent) {
 		load = cpu_rq(cpu)->load.weight;
 	} else {
-		load = tg->parent->cfs_rq[cpu]->h_load;
-		load *= tg->se[cpu]->load.weight;
-		load /= tg->parent->cfs_rq[cpu]->load.weight + 1;
+		u64 parent_h_load = tg->parent->cfs_rq[cpu]->h_load;
+		u64 parent_weight = tg->parent->cfs_rq[cpu]->load.weight;
+
+		load = div64_u64(parent_h_load * tg->se[cpu]->load.weight,
+				 parent_weight + 1);
 	}
 
 	tg->cfs_rq[cpu]->h_load = load;
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v1 16/19] sched: update move_task() and helper functions to use u64 for weights
  2011-05-02  1:18 [PATCH v1 00/19] Increase resolution of load weights Nikhil Rao
                   ` (14 preceding siblings ...)
  2011-05-02  1:19 ` [PATCH v1 15/19] sched: update h_load to use u64 Nikhil Rao
@ 2011-05-02  1:19 ` Nikhil Rao
  2011-05-02  1:19 ` [PATCH v1 17/19] sched: update f_b_q() to use u64 for weighted cpuload Nikhil Rao
                   ` (3 subsequent siblings)
  19 siblings, 0 replies; 34+ messages in thread
From: Nikhil Rao @ 2011-05-02  1:19 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Mike Galbraith
  Cc: linux-kernel, Nikunj A. Dadhania, Srivatsa Vaddagiri,
	Stephan Barwolf, Nikhil Rao

This patch updates move_task() and helper functions to use u64 to handle load
weight related calculations.

Signed-off-by: Nikhil Rao <ncrao@google.com>
---
 kernel/sched_fair.c |   41 +++++++++++++++++++----------------------
 1 files changed, 19 insertions(+), 22 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 850e41b..813bcf0 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -2099,14 +2099,14 @@ move_one_task(struct rq *this_rq, int this_cpu, struct rq *busiest,
 	return 0;
 }
 
-static unsigned long
+static u64
 balance_tasks(struct rq *this_rq, int this_cpu, struct rq *busiest,
-	      unsigned long max_load_move, struct sched_domain *sd,
+	      u64 max_load_move, struct sched_domain *sd,
 	      enum cpu_idle_type idle, int *all_pinned,
 	      int *this_best_prio, struct cfs_rq *busiest_cfs_rq)
 {
 	int loops = 0, pulled = 0;
-	long rem_load_move = max_load_move;
+	s64 rem_load_move = max_load_move;
 	struct task_struct *p, *n;
 
 	if (max_load_move == 0)
@@ -2199,13 +2199,12 @@ static void update_shares(int cpu)
 	rcu_read_unlock();
 }
 
-static unsigned long
+static u64
 load_balance_fair(struct rq *this_rq, int this_cpu, struct rq *busiest,
-		  unsigned long max_load_move,
-		  struct sched_domain *sd, enum cpu_idle_type idle,
-		  int *all_pinned, int *this_best_prio)
+		  u64 max_load_move, struct sched_domain *sd,
+		  enum cpu_idle_type idle, int *all_pinned, int *this_best_prio)
 {
-	long rem_load_move = max_load_move;
+	s64 rem_load_move = max_load_move;
 	int busiest_cpu = cpu_of(busiest);
 	struct task_group *tg;
 
@@ -2214,8 +2213,8 @@ load_balance_fair(struct rq *this_rq, int this_cpu, struct rq *busiest,
 
 	list_for_each_entry_rcu(tg, &task_groups, list) {
 		struct cfs_rq *busiest_cfs_rq = tg->cfs_rq[busiest_cpu];
-		unsigned long busiest_h_load = busiest_cfs_rq->h_load;
-		unsigned long busiest_weight = busiest_cfs_rq->load.weight;
+		u64 busiest_h_load = busiest_cfs_rq->h_load;
+		u64 busiest_weight = busiest_cfs_rq->load.weight;
 		u64 rem_load, moved_load;
 
 		/*
@@ -2224,8 +2223,8 @@ load_balance_fair(struct rq *this_rq, int this_cpu, struct rq *busiest,
 		if (!busiest_cfs_rq->task_weight)
 			continue;
 
-		rem_load = (u64)rem_load_move * busiest_weight;
-		rem_load = div_u64(rem_load, busiest_h_load + 1);
+		rem_load = div64_u64(busiest_weight * rem_load_move,
+				     busiest_h_load + 1);
 
 		moved_load = balance_tasks(this_rq, this_cpu, busiest,
 				rem_load, sd, idle, all_pinned, this_best_prio,
@@ -2234,8 +2233,8 @@ load_balance_fair(struct rq *this_rq, int this_cpu, struct rq *busiest,
 		if (!moved_load)
 			continue;
 
-		moved_load *= busiest_h_load;
-		moved_load = div_u64(moved_load, busiest_weight + 1);
+		moved_load = div64_u64(moved_load * busiest_h_load,
+				       busiest_weight + 1);
 
 		rem_load_move -= moved_load;
 		if (rem_load_move < 0)
@@ -2250,11 +2249,10 @@ static inline void update_shares(int cpu)
 {
 }
 
-static unsigned long
+static u64
 load_balance_fair(struct rq *this_rq, int this_cpu, struct rq *busiest,
-		  unsigned long max_load_move,
-		  struct sched_domain *sd, enum cpu_idle_type idle,
-		  int *all_pinned, int *this_best_prio)
+		  u64 max_load_move, struct sched_domain *sd,
+		  enum cpu_idle_type idle, int *all_pinned, int *this_best_prio)
 {
 	return balance_tasks(this_rq, this_cpu, busiest,
 			max_load_move, sd, idle, all_pinned,
@@ -2270,11 +2268,10 @@ load_balance_fair(struct rq *this_rq, int this_cpu, struct rq *busiest,
  * Called with both runqueues locked.
  */
 static int move_tasks(struct rq *this_rq, int this_cpu, struct rq *busiest,
-		      unsigned long max_load_move,
-		      struct sched_domain *sd, enum cpu_idle_type idle,
-		      int *all_pinned)
+		      u64 max_load_move, struct sched_domain *sd,
+		      enum cpu_idle_type idle, int *all_pinned)
 {
-	unsigned long total_load_moved = 0, load_moved;
+	u64 total_load_moved = 0, load_moved;
 	int this_best_prio = this_rq->curr->prio;
 
 	do {
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v1 17/19] sched: update f_b_q() to use u64 for weighted cpuload
  2011-05-02  1:18 [PATCH v1 00/19] Increase resolution of load weights Nikhil Rao
                   ` (15 preceding siblings ...)
  2011-05-02  1:19 ` [PATCH v1 16/19] sched: update move_task() and helper functions to use u64 for weights Nikhil Rao
@ 2011-05-02  1:19 ` Nikhil Rao
  2011-05-02  1:19 ` [PATCH v1 18/19] sched: update shares distribution to use u64 Nikhil Rao
                   ` (2 subsequent siblings)
  19 siblings, 0 replies; 34+ messages in thread
From: Nikhil Rao @ 2011-05-02  1:19 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Mike Galbraith
  Cc: linux-kernel, Nikunj A. Dadhania, Srivatsa Vaddagiri,
	Stephan Barwolf, Nikhil Rao

Update f_b_q() to use u64 when comparing loads.

Signed-off-by: Nikhil Rao <ncrao@google.com>
---
 kernel/sched_fair.c |    6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 813bcf0..bf9bbaa 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -3203,14 +3203,14 @@ find_busiest_queue(struct sched_domain *sd, struct sched_group *group,
 		   const struct cpumask *cpus)
 {
 	struct rq *busiest = NULL, *rq;
-	unsigned long max_load = 0;
+	u64 max_load = 0;
 	int i;
 
 	for_each_cpu(i, sched_group_cpus(group)) {
 		unsigned long power = power_of(i);
 		unsigned long capacity = DIV_ROUND_CLOSEST(power,
 							   SCHED_POWER_SCALE);
-		unsigned long wl;
+		u64 wl;
 
 		if (!capacity)
 			capacity = fix_small_capacity(sd, group);
@@ -3234,7 +3234,7 @@ find_busiest_queue(struct sched_domain *sd, struct sched_group *group,
 		 * the load can be moved away from the cpu that is potentially
 		 * running at a lower capacity.
 		 */
-		wl = (wl * SCHED_POWER_SCALE) / power;
+		wl = div_u64(wl * SCHED_POWER_SCALE, power);
 
 		if (wl > max_load) {
 			max_load = wl;
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v1 18/19] sched: update shares distribution to use u64
  2011-05-02  1:18 [PATCH v1 00/19] Increase resolution of load weights Nikhil Rao
                   ` (16 preceding siblings ...)
  2011-05-02  1:19 ` [PATCH v1 17/19] sched: update f_b_q() to use u64 for weighted cpuload Nikhil Rao
@ 2011-05-02  1:19 ` Nikhil Rao
  2011-05-02  1:19 ` [PATCH v1 19/19] sched: convert atomic ops in shares update to use atomic64_t ops Nikhil Rao
  2011-05-02  6:14 ` [PATCH v1 00/19] Increase resolution of load weights Ingo Molnar
  19 siblings, 0 replies; 34+ messages in thread
From: Nikhil Rao @ 2011-05-02  1:19 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Mike Galbraith
  Cc: linux-kernel, Nikunj A. Dadhania, Srivatsa Vaddagiri,
	Stephan Barwolf, Nikhil Rao

Update the shares distribution code to use u64. We still maintain tg->shares as
an unsigned long since sched entity weights can't exceed MAX_SHARES (2^28). This
patch updates all the calculations required to estimate shares to use u64.

Signed-off-by: Nikhil Rao <ncrao@google.com>
---
 kernel/sched.c       |    2 +-
 kernel/sched_debug.c |    6 +++---
 kernel/sched_fair.c  |   17 ++++++++++++-----
 3 files changed, 16 insertions(+), 9 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 6b9b02a..e131225 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -367,7 +367,7 @@ struct cfs_rq {
 	u64 load_period;
 	u64 load_stamp, load_last, load_unacc_exec_time;
 
-	unsigned long load_contribution;
+	u64 load_contribution;
 #endif
 #endif
 };
diff --git a/kernel/sched_debug.c b/kernel/sched_debug.c
index d22b666..b809651 100644
--- a/kernel/sched_debug.c
+++ b/kernel/sched_debug.c
@@ -204,11 +204,11 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 	SEQ_printf(m, "  .%-30s: %lld\n", "load", cfs_rq->load.weight);
 #ifdef CONFIG_FAIR_GROUP_SCHED
 #ifdef CONFIG_SMP
-	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "load_avg",
+	SEQ_printf(m, "  .%-30s: %lld.%06ld\n", "load_avg",
 			SPLIT_NS(cfs_rq->load_avg));
-	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "load_period",
+	SEQ_printf(m, "  .%-30s: %lld.%06ld\n", "load_period",
 			SPLIT_NS(cfs_rq->load_period));
-	SEQ_printf(m, "  .%-30s: %ld\n", "load_contrib",
+	SEQ_printf(m, "  .%-30s: %lld\n", "load_contrib",
 			cfs_rq->load_contribution);
 	SEQ_printf(m, "  .%-30s: %d\n", "load_tg",
 			atomic_read(&cfs_rq->tg->load_weight));
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index bf9bbaa..3f56410 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -708,12 +708,13 @@ static void update_cfs_rq_load_contribution(struct cfs_rq *cfs_rq,
 					    int global_update)
 {
 	struct task_group *tg = cfs_rq->tg;
-	long load_avg;
+	s64 load_avg;
 
 	load_avg = div64_u64(cfs_rq->load_avg, cfs_rq->load_period+1);
 	load_avg -= cfs_rq->load_contribution;
 
 	if (global_update || abs(load_avg) > cfs_rq->load_contribution / 8) {
+		/* TODO: fix atomics for 64-bit additions */
 		atomic_add(load_avg, &tg->load_weight);
 		cfs_rq->load_contribution += load_avg;
 	}
@@ -723,7 +724,7 @@ static void update_cfs_load(struct cfs_rq *cfs_rq, int global_update)
 {
 	u64 period = sysctl_sched_shares_window;
 	u64 now, delta;
-	unsigned long load = cfs_rq->load.weight;
+	u64 load = cfs_rq->load.weight;
 
 	if (cfs_rq->tg == &root_task_group)
 		return;
@@ -743,8 +744,13 @@ static void update_cfs_load(struct cfs_rq *cfs_rq, int global_update)
 	cfs_rq->load_unacc_exec_time = 0;
 	cfs_rq->load_period += delta;
 	if (load) {
+		u64 tmp = cfs_rq->load_avg;
 		cfs_rq->load_last = now;
 		cfs_rq->load_avg += delta * load;
+
+		/* Detect overflow and set load_avg to max */
+		if (unlikely(cfs_rq->load_avg < tmp))
+			cfs_rq->load_avg = ~0ULL;
 	}
 
 	/* consider updating load contribution on each fold or truncate */
@@ -769,24 +775,25 @@ static void update_cfs_load(struct cfs_rq *cfs_rq, int global_update)
 
 static long calc_cfs_shares(struct cfs_rq *cfs_rq, struct task_group *tg)
 {
-	long load_weight, load, shares;
+	s64 load_weight, load, shares;
 
 	load = cfs_rq->load.weight;
 
+	/* TODO: fixup atomics to handle u64 in 32-bit */
 	load_weight = atomic_read(&tg->load_weight);
 	load_weight += load;
 	load_weight -= cfs_rq->load_contribution;
 
 	shares = (tg->shares * load);
 	if (load_weight)
-		shares /= load_weight;
+		shares = div64_u64(shares, load_weight);
 
 	if (shares < MIN_SHARES)
 		shares = MIN_SHARES;
 	if (shares > tg->shares)
 		shares = tg->shares;
 
-	return shares;
+	return (long)shares;
 }
 
 static void update_entity_shares_tick(struct cfs_rq *cfs_rq)
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v1 19/19] sched: convert atomic ops in shares update to use atomic64_t ops
  2011-05-02  1:18 [PATCH v1 00/19] Increase resolution of load weights Nikhil Rao
                   ` (17 preceding siblings ...)
  2011-05-02  1:19 ` [PATCH v1 18/19] sched: update shares distribution to use u64 Nikhil Rao
@ 2011-05-02  1:19 ` Nikhil Rao
  2011-05-02  6:14 ` [PATCH v1 00/19] Increase resolution of load weights Ingo Molnar
  19 siblings, 0 replies; 34+ messages in thread
From: Nikhil Rao @ 2011-05-02  1:19 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Mike Galbraith
  Cc: linux-kernel, Nikunj A. Dadhania, Srivatsa Vaddagiri,
	Stephan Barwolf, Nikhil Rao

Convert uses of atomic_t to atomic64_t in shares update calculations. Total
task weight in a tg can overflow the atomic type on 32-bit systems.

Signed-off-by: Nikhil Rao <ncrao@google.com>
---
 kernel/sched.c       |    2 +-
 kernel/sched_debug.c |    4 ++--
 kernel/sched_fair.c  |    8 +++-----
 3 files changed, 6 insertions(+), 8 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index e131225..af26b3e 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -255,7 +255,7 @@ struct task_group {
 	struct cfs_rq **cfs_rq;
 	unsigned long shares;
 
-	atomic_t load_weight;
+	atomic64_t load_weight;
 #endif
 
 #ifdef CONFIG_RT_GROUP_SCHED
diff --git a/kernel/sched_debug.c b/kernel/sched_debug.c
index b809651..2d0fff9 100644
--- a/kernel/sched_debug.c
+++ b/kernel/sched_debug.c
@@ -210,8 +210,8 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 			SPLIT_NS(cfs_rq->load_period));
 	SEQ_printf(m, "  .%-30s: %lld\n", "load_contrib",
 			cfs_rq->load_contribution);
-	SEQ_printf(m, "  .%-30s: %d\n", "load_tg",
-			atomic_read(&cfs_rq->tg->load_weight));
+	SEQ_printf(m, "  .%-30s: %ld\n", "load_tg",
+			atomic64_read(&cfs_rq->tg->load_weight));
 #endif
 
 	print_cfs_group_stats(m, cpu, cfs_rq->tg);
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 3f56410..0152410 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -714,8 +714,7 @@ static void update_cfs_rq_load_contribution(struct cfs_rq *cfs_rq,
 	load_avg -= cfs_rq->load_contribution;
 
 	if (global_update || abs(load_avg) > cfs_rq->load_contribution / 8) {
-		/* TODO: fix atomics for 64-bit additions */
-		atomic_add(load_avg, &tg->load_weight);
+		atomic64_add(load_avg, &tg->load_weight);
 		cfs_rq->load_contribution += load_avg;
 	}
 }
@@ -779,8 +778,7 @@ static long calc_cfs_shares(struct cfs_rq *cfs_rq, struct task_group *tg)
 
 	load = cfs_rq->load.weight;
 
-	/* TODO: fixup atomics to handle u64 in 32-bit */
-	load_weight = atomic_read(&tg->load_weight);
+	load_weight = atomic64_read(&tg->load_weight);
 	load_weight += load;
 	load_weight -= cfs_rq->load_contribution;
 
@@ -1409,7 +1407,7 @@ static s64 effective_load(struct task_group *tg, int cpu, s64 wl, s64 wg)
 		w = se->my_q->load.weight;
 
 		/* use this cpu's instantaneous contribution */
-		lw = atomic_read(&tg->load_weight);
+		lw = atomic64_read(&tg->load_weight);
 		lw -= se->my_q->load_contribution;
 		lw += w + wg;
 
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 00/19] Increase resolution of load weights
  2011-05-02  1:18 [PATCH v1 00/19] Increase resolution of load weights Nikhil Rao
                   ` (18 preceding siblings ...)
  2011-05-02  1:19 ` [PATCH v1 19/19] sched: convert atomic ops in shares update to use atomic64_t ops Nikhil Rao
@ 2011-05-02  6:14 ` Ingo Molnar
  2011-05-04  0:58   ` Nikhil Rao
  19 siblings, 1 reply; 34+ messages in thread
From: Ingo Molnar @ 2011-05-02  6:14 UTC (permalink / raw)
  To: Nikhil Rao
  Cc: Peter Zijlstra, Mike Galbraith, linux-kernel, Nikunj A. Dadhania,
	Srivatsa Vaddagiri, Stephan Barwolf


* Nikhil Rao <ncrao@google.com> wrote:

> 1. Performance costs
> 
> Ran 50 iterations of Ingo's pipe-test-100k program (100k pipe ping-pongs). 
> See http://thread.gmane.org/gmane.linux.kernel/1129232/focus=1129389 for more 
> info.
> 
> 64-bit build.
> 
>   2.6.39-rc5 (baseline):
> 
>     Performance counter stats for './pipe-test-100k' (50 runs):
> 
>        905,034,914 instructions             #      0.345 IPC     ( +-   0.016% )
>      2,623,924,516 cycles                     ( +-   0.759% )
> 
>         1.518543478  seconds time elapsed   ( +-   0.513% )
> 
>   2.6.39-rc5 + patchset:
> 
>     Performance counter stats for './pipe-test-100k' (50 runs):
> 
>        905,351,545 instructions             #      0.343 IPC     ( +-   0.018% )
>      2,638,939,777 cycles                     ( +-   0.761% )
> 
>         1.509101452  seconds time elapsed   ( +-   0.537% )
> 
> There is a marginal increase in instruction retired, about 0.034%; and marginal
> increase in cycles counted, about 0.57%.

Not sure this increase is statistically significant: both effects are within 
noise and look at elapsed time, it actually went down.

Btw., to best measure context-switching costs you should do something like:

  taskset 1 perf stat --repeat 50 ./pipe-test-100k

to pin both tasks to the same CPU. This reduces noise and makes the numbers 
more relevant: SMP costs do not increase due to your patchset.

So it would be nice to re-run the 64-bit tests with the pipe test bound to a 
single CPU.

> 32-bit build.
> 
>   2.6.39-rc5 (baseline):
> 
>     Performance counter stats for './pipe-test-100k' (50 runs):
> 
>      1,025,151,722 instructions             #      0.238 IPC     ( +-   0.018% )
>      4,303,226,625 cycles                     ( +-   0.524% )
> 
>         2.133056844  seconds time elapsed   ( +-   0.619% )
> 
>   2.6.39-rc5 + patchset:
> 
>     Performance counter stats for './pipe-test-100k' (50 runs):
> 
>      1,070,610,068 instructions             #      0.239 IPC     ( +-   1.369% )
>      4,478,912,974 cycles                     ( +-   1.011% )
> 
>         2.293382242  seconds time elapsed   ( +-   0.144% )
> 
> On 32-bit kernels, instructions retired increases by about 4.4% with this
> patchset. CPU cycles also increases by about 4%.
>
> There is a marginal increase in instruction retired, about 0.034%; and 
> marginal increase in cycles counted, about 0.57%.

These results look more bothersome, a clear increase in both cycles, elapsed 
time, and instructions retired, well beyond measurement noise.

Given that scheduling costs are roughly 30% of that pipe test-case, the cost 
increase to the scheduler is probably around:

	instructions:	+14.5%
	cycles: 	+13.3%

That is rather significant.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 00/19] Increase resolution of load weights
  2011-05-02  6:14 ` [PATCH v1 00/19] Increase resolution of load weights Ingo Molnar
@ 2011-05-04  0:58   ` Nikhil Rao
  2011-05-04  1:07     ` Nikhil Rao
  2011-05-04 11:13     ` Ingo Molnar
  0 siblings, 2 replies; 34+ messages in thread
From: Nikhil Rao @ 2011-05-04  0:58 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Mike Galbraith, linux-kernel, Nikunj A. Dadhania,
	Srivatsa Vaddagiri, Stephan Barwolf

On Sun, May 1, 2011 at 11:14 PM, Ingo Molnar <mingo@elte.hu> wrote:
>
> * Nikhil Rao <ncrao@google.com> wrote:
>
>> 1. Performance costs
>>
>> Ran 50 iterations of Ingo's pipe-test-100k program (100k pipe ping-pongs).
>> See http://thread.gmane.org/gmane.linux.kernel/1129232/focus=1129389 for more
>> info.
>>
>> 64-bit build.
>>
>>   2.6.39-rc5 (baseline):
>>
>>     Performance counter stats for './pipe-test-100k' (50 runs):
>>
>>        905,034,914 instructions             #      0.345 IPC     ( +-   0.016% )
>>      2,623,924,516 cycles                     ( +-   0.759% )
>>
>>         1.518543478  seconds time elapsed   ( +-   0.513% )
>>
>>   2.6.39-rc5 + patchset:
>>
>>     Performance counter stats for './pipe-test-100k' (50 runs):
>>
>>        905,351,545 instructions             #      0.343 IPC     ( +-   0.018% )
>>      2,638,939,777 cycles                     ( +-   0.761% )
>>
>>         1.509101452  seconds time elapsed   ( +-   0.537% )
>>
>> There is a marginal increase in instruction retired, about 0.034%; and marginal
>> increase in cycles counted, about 0.57%.
>
> Not sure this increase is statistically significant: both effects are within
> noise and look at elapsed time, it actually went down.
>
> Btw., to best measure context-switching costs you should do something like:
>
>  taskset 1 perf stat --repeat 50 ./pipe-test-100k
>
> to pin both tasks to the same CPU. This reduces noise and makes the numbers
> more relevant: SMP costs do not increase due to your patchset.
>
> So it would be nice to re-run the 64-bit tests with the pipe test bound to a
> single CPU.

I re-ran the 64-bit tests with the pipe test bound to a single CPU.
Data attached below.

2.6.39-rc5:

 Performance counter stats for './pipe-test-100k' (100 runs):

       855,571,900 instructions             #      0.869 IPC     ( +-   0.637% )
       984,213,635 cycles                     ( +-   0.254% )

        0.796129773  seconds time elapsed   ( +-   0.152% )

2.6.39-rc5  + patchset:

 Performance counter stats for './pipe-test-100k' (100 runs):

       905,553,828 instructions             #      0.934 IPC     ( +-   0.059% )
       969,792,787 cycles                     ( +-   0.168% )

        0.788676004  seconds time elapsed   ( +-   0.122% )


There is a 5.8% increase in instructions which is statistically
significant and well over the error margins. Cycles dropped by about
1.17% and elapsed time also dropped about ~1%. I'm looking into
profiles for this test to understand why instr has increased.

>
>> 32-bit build.
>>
>>   2.6.39-rc5 (baseline):
>>
>>     Performance counter stats for './pipe-test-100k' (50 runs):
>>
>>      1,025,151,722 instructions             #      0.238 IPC     ( +-   0.018% )
>>      4,303,226,625 cycles                     ( +-   0.524% )
>>
>>         2.133056844  seconds time elapsed   ( +-   0.619% )
>>
>>   2.6.39-rc5 + patchset:
>>
>>     Performance counter stats for './pipe-test-100k' (50 runs):
>>
>>      1,070,610,068 instructions             #      0.239 IPC     ( +-   1.369% )
>>      4,478,912,974 cycles                     ( +-   1.011% )
>>
>>         2.293382242  seconds time elapsed   ( +-   0.144% )
>>
>> On 32-bit kernels, instructions retired increases by about 4.4% with this
>> patchset. CPU cycles also increases by about 4%.
>>
>> There is a marginal increase in instruction retired, about 0.034%; and
>> marginal increase in cycles counted, about 0.57%.
>
> These results look more bothersome, a clear increase in both cycles, elapsed
> time, and instructions retired, well beyond measurement noise.
>
> Given that scheduling costs are roughly 30% of that pipe test-case, the cost
> increase to the scheduler is probably around:
>
>        instructions:   +14.5%
>        cycles:         +13.3%
>
> That is rather significant.
>

I'll take a closer look at the performance of this patchset this week.
I'm a little confused about how you calculated the cost to the
scheduler. How did you come up with 14.5 % and 13.3%? Also, out of
curiosity, what's an acceptable tolerance level for a performance hit
on 32-bit?

-Thanks
Nikhil

> Thanks,
>
>        Ingo
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 00/19] Increase resolution of load weights
  2011-05-04  0:58   ` Nikhil Rao
@ 2011-05-04  1:07     ` Nikhil Rao
  2011-05-04 11:13     ` Ingo Molnar
  1 sibling, 0 replies; 34+ messages in thread
From: Nikhil Rao @ 2011-05-04  1:07 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Mike Galbraith, linux-kernel, Nikunj A. Dadhania,
	Srivatsa Vaddagiri, Stephan Barwolf

On Tue, May 3, 2011 at 5:58 PM, Nikhil Rao <ncrao@google.com> wrote:
> On Sun, May 1, 2011 at 11:14 PM, Ingo Molnar <mingo@elte.hu> wrote:
>>
>> * Nikhil Rao <ncrao@google.com> wrote:
>>
>>> 1. Performance costs
>>>
>>> Ran 50 iterations of Ingo's pipe-test-100k program (100k pipe ping-pongs).
>>> See http://thread.gmane.org/gmane.linux.kernel/1129232/focus=1129389 for more
>>> info.
>>>
>>> 64-bit build.
>>>
>>>   2.6.39-rc5 (baseline):
>>>
>>>     Performance counter stats for './pipe-test-100k' (50 runs):
>>>
>>>        905,034,914 instructions             #      0.345 IPC     ( +-   0.016% )
>>>      2,623,924,516 cycles                     ( +-   0.759% )
>>>
>>>         1.518543478  seconds time elapsed   ( +-   0.513% )
>>>
>>>   2.6.39-rc5 + patchset:
>>>
>>>     Performance counter stats for './pipe-test-100k' (50 runs):
>>>
>>>        905,351,545 instructions             #      0.343 IPC     ( +-   0.018% )
>>>      2,638,939,777 cycles                     ( +-   0.761% )
>>>
>>>         1.509101452  seconds time elapsed   ( +-   0.537% )
>>>
>>> There is a marginal increase in instruction retired, about 0.034%; and marginal
>>> increase in cycles counted, about 0.57%.
>>
>> Not sure this increase is statistically significant: both effects are within
>> noise and look at elapsed time, it actually went down.
>>
>> Btw., to best measure context-switching costs you should do something like:
>>
>>  taskset 1 perf stat --repeat 50 ./pipe-test-100k
>>
>> to pin both tasks to the same CPU. This reduces noise and makes the numbers
>> more relevant: SMP costs do not increase due to your patchset.
>>
>> So it would be nice to re-run the 64-bit tests with the pipe test bound to a
>> single CPU.
>
> I re-ran the 64-bit tests with the pipe test bound to a single CPU.
> Data attached below.
>
> 2.6.39-rc5:
>
>  Performance counter stats for './pipe-test-100k' (100 runs):
>
>       855,571,900 instructions             #      0.869 IPC     ( +-   0.637% )
>       984,213,635 cycles                     ( +-   0.254% )
>
>        0.796129773  seconds time elapsed   ( +-   0.152% )
>
> 2.6.39-rc5  + patchset:
>
>  Performance counter stats for './pipe-test-100k' (100 runs):
>
>       905,553,828 instructions             #      0.934 IPC     ( +-   0.059% )
>       969,792,787 cycles                     ( +-   0.168% )
>
>        0.788676004  seconds time elapsed   ( +-   0.122% )
>
>
> There is a 5.8% increase in instructions which is statistically
> significant and well over the error margins. Cycles dropped by about
> 1.17% and elapsed time also dropped about ~1%. I'm looking into
> profiles for this test to understand why instr has increased.
>
>>
>>> 32-bit build.
>>>
>>>   2.6.39-rc5 (baseline):
>>>
>>>     Performance counter stats for './pipe-test-100k' (50 runs):
>>>
>>>      1,025,151,722 instructions             #      0.238 IPC     ( +-   0.018% )
>>>      4,303,226,625 cycles                     ( +-   0.524% )
>>>
>>>         2.133056844  seconds time elapsed   ( +-   0.619% )
>>>
>>>   2.6.39-rc5 + patchset:
>>>
>>>     Performance counter stats for './pipe-test-100k' (50 runs):
>>>
>>>      1,070,610,068 instructions             #      0.239 IPC     ( +-   1.369% )
>>>      4,478,912,974 cycles                     ( +-   1.011% )
>>>
>>>         2.293382242  seconds time elapsed   ( +-   0.144% )
>>>
>>> On 32-bit kernels, instructions retired increases by about 4.4% with this
>>> patchset. CPU cycles also increases by about 4%.
>>>
>>> There is a marginal increase in instruction retired, about 0.034%; and
>>> marginal increase in cycles counted, about 0.57%.
>>
>> These results look more bothersome, a clear increase in both cycles, elapsed
>> time, and instructions retired, well beyond measurement noise.
>>
>> Given that scheduling costs are roughly 30% of that pipe test-case, the cost
>> increase to the scheduler is probably around:
>>
>>        instructions:   +14.5%
>>        cycles:         +13.3%
>>
>> That is rather significant.
>>
>
> I'll take a closer look at the performance of this patchset this week.
> I'm a little confused about how you calculated the cost to the
> scheduler. How did you come up with 14.5 % and 13.3%?

Ah, never mind that. After reading your mail again, I see how this is
calculated now.

Also, out of
> curiosity, what's an acceptable tolerance level for a performance hit
> on 32-bit?
>
> -Thanks
> Nikhil
>
>> Thanks,
>>
>>        Ingo
>>
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 00/19] Increase resolution of load weights
  2011-05-04  0:58   ` Nikhil Rao
  2011-05-04  1:07     ` Nikhil Rao
@ 2011-05-04 11:13     ` Ingo Molnar
  2011-05-06  1:29       ` Nikhil Rao
  1 sibling, 1 reply; 34+ messages in thread
From: Ingo Molnar @ 2011-05-04 11:13 UTC (permalink / raw)
  To: Nikhil Rao
  Cc: Peter Zijlstra, Mike Galbraith, linux-kernel, Nikunj A. Dadhania,
	Srivatsa Vaddagiri, Stephan Barwolf


* Nikhil Rao <ncrao@google.com> wrote:

> Also, out of curiosity, what's an acceptable tolerance level for a 
> performance hit on 32-bit?

It's a cost/benefit analysis and for 32-bit systems the benefits seem to be 
rather small, right?

Can we make the increase in resolution dependent on max CPU count or such and 
use cheaper divides on 32-bit in that case, while still keeping the code clean? 

We'd expect only relatively large and new (and 64-bit) systems to run into 
resolution problems, right?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 00/19] Increase resolution of load weights
  2011-05-04 11:13     ` Ingo Molnar
@ 2011-05-06  1:29       ` Nikhil Rao
  2011-05-06  6:59         ` Ingo Molnar
  2011-05-12  9:08         ` Peter Zijlstra
  0 siblings, 2 replies; 34+ messages in thread
From: Nikhil Rao @ 2011-05-06  1:29 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Mike Galbraith, linux-kernel, Nikunj A. Dadhania,
	Srivatsa Vaddagiri, Stephan Barwolf

On Wed, May 4, 2011 at 4:13 AM, Ingo Molnar <mingo@elte.hu> wrote:
>
> * Nikhil Rao <ncrao@google.com> wrote:
>
>> Also, out of curiosity, what's an acceptable tolerance level for a
>> performance hit on 32-bit?
>
> It's a cost/benefit analysis and for 32-bit systems the benefits seem to be
> rather small, right?
>

Yes, that's right. The benefits for 32-bit systems do seem to be limited.

When I initially posted this patchset, I expected much larger benefits
for 32-bit systems. I ran some experiments yesterday and found
negligible gains for 32-bit systems. I think two aspects of this
patchset are relevant for 32-bit:

1. Better distribution of weight for low-weight task groups. For
example, when an autogroup gets niced to +19, the task group is
assigned a weight of 15. Since 32-bit systems are only restricted to 8
cpus at most, I think we can manage to handle this case without the
need for more resolution. The results for this experiment were not
statistically significant.

2. You could also run out of resolution with deep hierarchies on
32-bit systems, but you need pretty complex cgroup hierarchies. Let's
assume you have a hierarchy with n levels and a branching factor of b
at each level. Let's also assume each leaf node has at least one
running task and users don't change any of the weights. You will need
approx log_b(1024/NR_CPUS) levels to run out of resolution in this
setup... so, b=2 needs 7 levels, b=3 needs 5 levels, b=4 needs 4
levels, ... and so on. These are a pretty elaborate hierarchy and I'm
not sure if there are use cases for these (yet!).

> Can we make the increase in resolution dependent on max CPU count or such and
> use cheaper divides on 32-bit in that case, while still keeping the code clean?
>

Sure. Is this what you had in mind?

commit 860030069190e3d6e1983cc77c936f7ccdaf7cff
Author: Nikhil Rao <ncrao@google.com>
Date:   Mon Apr 11 15:16:08 2011 -0700

    sched: increase SCHED_LOAD_SCALE resolution

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8d1ff2b..f92353c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -794,7 +794,21 @@ enum cpu_idle_type {
 /*
  * Increase resolution of nice-level calculations:
  */
-#define SCHED_LOAD_SHIFT	10
+#if CONFIG_NR_CPUS > 32
+#define SCHED_LOAD_RESOLUTION	10
+
+#define scale_up_load_resolution(w)	w << SCHED_LOAD_RESOLUTION
+#define scale_down_load_resolution(w)	w >> SCHED_LOAD_RESOLUTION;
+
+#else
+#define SCHED_LOAD_RESOLUTION	0
+
+#define scale_up_load_resolution(w)	w
+#define scale_down_load_resolution(w)	w
+
+#endif
+
+#define SCHED_LOAD_SHIFT	(10 + (SCHED_LOAD_RESOLUTION))
 #define SCHED_LOAD_SCALE	(1L << SCHED_LOAD_SHIFT)

 /*
diff --git a/kernel/sched.c b/kernel/sched.c
index f4b4679..3dae6c5 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -293,7 +293,7 @@ static DEFINE_SPINLOCK(task_group_lock);
  *  limitation from this.)
  */
 #define MIN_SHARES	2
-#define MAX_SHARES	(1UL << 18)
+#define MAX_SHARES	(1UL << (18 + SCHED_LOAD_RESOLUTION))

 static int root_task_group_load = ROOT_TASK_GROUP_LOAD;
 #endif
@@ -1307,14 +1307,18 @@ calc_delta_mine(unsigned long delta_exec, u64
weight, struct load_weight *lw)
 	u64 tmp;

 	if (!lw->inv_weight) {
-		if (BITS_PER_LONG > 32 && unlikely(lw->weight >= WMULT_CONST))
+		unsigned long w = scale_down_load_resolution(lw->weight);
+		if (BITS_PER_LONG > 32 && unlikely(w >= WMULT_CONST))
 			lw->inv_weight = 1;
 		else
-			lw->inv_weight = 1 + (WMULT_CONST-lw->weight/2)
-				/ (lw->weight+1);
+			lw->inv_weight = 1 + (WMULT_CONST - w/2) / (w + 1);
 	}

-	tmp = (u64)delta_exec * weight;
+	if (likely(weight > (1UL << SCHED_LOAD_RESOLUTION)))
+		tmp = (u64)delta_exec * scale_down_load_resolution(weight);
+	else
+		tmp = (u64)delta_exec;
+
 	/*
 	 * Check whether we'd overflow the 64-bit multiplication:
 	 */
@@ -1758,12 +1762,13 @@ static void set_load_weight(struct task_struct *p)
 	 * SCHED_IDLE tasks get minimal weight:
 	 */
 	if (p->policy == SCHED_IDLE) {
-		p->se.load.weight = WEIGHT_IDLEPRIO;
+		p->se.load.weight = scale_up_load_resolution(WEIGHT_IDLEPRIO);
 		p->se.load.inv_weight = WMULT_IDLEPRIO;
 		return;
 	}

-	p->se.load.weight = prio_to_weight[p->static_prio - MAX_RT_PRIO];
+	p->se.load.weight = scale_up_load_resolution(
+			prio_to_weight[p->static_prio - MAX_RT_PRIO]);
 	p->se.load.inv_weight = prio_to_wmult[p->static_prio - MAX_RT_PRIO];
 }

@@ -9129,14 +9134,15 @@ cpu_cgroup_exit(struct cgroup_subsys *ss,
struct cgroup *cgrp,
 static int cpu_shares_write_u64(struct cgroup *cgrp, struct cftype *cftype,
 				u64 shareval)
 {
-	return sched_group_set_shares(cgroup_tg(cgrp), shareval);
+	return sched_group_set_shares(cgroup_tg(cgrp),
+				      scale_up_load_resolution(shareval));
 }

 static u64 cpu_shares_read_u64(struct cgroup *cgrp, struct cftype *cft)
 {
 	struct task_group *tg = cgroup_tg(cgrp);

-	return (u64) tg->shares;
+	return (u64) scale_down_load_resolution(tg->shares);
 }
 #endif /* CONFIG_FAIR_GROUP_SCHED */


I think we still need the scaling in calc_delta_mine() -- it helps
with accuracy, and (tmp * inv_weight) would be less likely to overflow
the 64-bit multiplication, so we only do one SRR instead of two SRRs.
I think the SCHED_POWER_SCALE is also a good change since it makes
cpu_power calculations independent of LOAD_SCALE. We can drop all the
other patches that convert unsigned long to u64 and only carry the
overflow detection changes (eg. shares update).

> We'd expect only relatively large and new (and 64-bit) systems to run into
> resolution problems, right?
>

Yes, from the experiments I've run so far, it looks like larger
systems seem to be more affected by resolution problems.

-Thanks,
Nikhil

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 00/19] Increase resolution of load weights
  2011-05-06  1:29       ` Nikhil Rao
@ 2011-05-06  6:59         ` Ingo Molnar
  2011-05-11  0:14           ` Nikhil Rao
  2011-05-12  9:08         ` Peter Zijlstra
  1 sibling, 1 reply; 34+ messages in thread
From: Ingo Molnar @ 2011-05-06  6:59 UTC (permalink / raw)
  To: Nikhil Rao
  Cc: Peter Zijlstra, Mike Galbraith, linux-kernel, Nikunj A. Dadhania,
	Srivatsa Vaddagiri, Stephan Barwolf


* Nikhil Rao <ncrao@google.com> wrote:

> On Wed, May 4, 2011 at 4:13 AM, Ingo Molnar <mingo@elte.hu> wrote:
> >
> > * Nikhil Rao <ncrao@google.com> wrote:
> >
> >> Also, out of curiosity, what's an acceptable tolerance level for a 
> >> performance hit on 32-bit?
> >
> > It's a cost/benefit analysis and for 32-bit systems the benefits seem to be 
> > rather small, right?
> 
> Yes, that's right. The benefits for 32-bit systems do seem to be limited.
> 
> When I initially posted this patchset, I expected much larger benefits for 
> 32-bit systems. I ran some experiments yesterday and found negligible gains 
> for 32-bit systems. I think two aspects of this patchset are relevant for 
> 32-bit:
> 
> 1. Better distribution of weight for low-weight task groups. For example, 
> when an autogroup gets niced to +19, the task group is assigned a weight of 
> 15. Since 32-bit systems are only restricted to 8 cpus at most, I think we 
> can manage to handle this case without the need for more resolution. The 
> results for this experiment were not statistically significant.
> 
> 2. You could also run out of resolution with deep hierarchies on 32-bit 
> systems, but you need pretty complex cgroup hierarchies. Let's assume you 
> have a hierarchy with n levels and a branching factor of b at each level. 
> Let's also assume each leaf node has at least one running task and users 
> don't change any of the weights. You will need approx log_b(1024/NR_CPUS) 
> levels to run out of resolution in this setup... so, b=2 needs 7 levels, b=3 
> needs 5 levels, b=4 needs 4 levels, ... and so on. These are a pretty 
> elaborate hierarchy and I'm not sure if there are use cases for these (yet!).

Btw., the "take your patch" and "do not take your patch" choice is a false 
dichotomy, there's a third option we should consider seriously: we do not *have 
to* act for 32-bit systems, if we decide that the benefits are not clear yet.

I.e. we can act on 64-bit systems (there increasing resolution should be near 
zero cost as we have 64-bit ops), but delay any decision for 32-bit systems.

If 32-bit systems evolve in a direction (lots of cores, lots of cgroups 
complexity) that makes the increase in resolution inevitable, we can 
reconsider.

If on the other hand they are replaced more and more by 64-bit systems in all 
segments and become a niche then not acting will be the right decision. There's 
so many other reasons why 64-bit is better, better resolution and more 
scheduling precision in highly parallel systems/setups will just be another 
reason.

Personally i think this latter scenario is a lot more likely.

> > Can we make the increase in resolution dependent on max CPU count or such 
> > and use cheaper divides on 32-bit in that case, while still keeping the 
> > code clean?
> 
> Sure. Is this what you had in mind?

Yes, almost:

> commit 860030069190e3d6e1983cc77c936f7ccdaf7cff
> Author: Nikhil Rao <ncrao@google.com>
> Date:   Mon Apr 11 15:16:08 2011 -0700
> 
>     sched: increase SCHED_LOAD_SCALE resolution
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 8d1ff2b..f92353c 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -794,7 +794,21 @@ enum cpu_idle_type {
>  /*
>   * Increase resolution of nice-level calculations:
>   */
> -#define SCHED_LOAD_SHIFT	10
> +#if CONFIG_NR_CPUS > 32
> +#define SCHED_LOAD_RESOLUTION	10
> +
> +#define scale_up_load_resolution(w)	w << SCHED_LOAD_RESOLUTION
> +#define scale_down_load_resolution(w)	w >> SCHED_LOAD_RESOLUTION;
> +
> +#else
> +#define SCHED_LOAD_RESOLUTION	0
> +
> +#define scale_up_load_resolution(w)	w
> +#define scale_down_load_resolution(w)	w
> +
> +#endif
> +
> +#define SCHED_LOAD_SHIFT	(10 + (SCHED_LOAD_RESOLUTION))
>  #define SCHED_LOAD_SCALE	(1L << SCHED_LOAD_SHIFT)

I'd suggest to make the resolution dependent on BITS_PER_LONG. That way we have 
just two basic variants of resolution (CONFIG_NR_CPUS can vary a lot). This 
will be a *lot* more testable, and on 32-bit we will maintain the status quo.

But yes, look at how much nicer the patch looks like now.

A few small comments:

> diff --git a/kernel/sched.c b/kernel/sched.c
> index f4b4679..3dae6c5 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -293,7 +293,7 @@ static DEFINE_SPINLOCK(task_group_lock);
>   *  limitation from this.)
>   */
>  #define MIN_SHARES	2
> -#define MAX_SHARES	(1UL << 18)
> +#define MAX_SHARES	(1UL << (18 + SCHED_LOAD_RESOLUTION))
> 
>  static int root_task_group_load = ROOT_TASK_GROUP_LOAD;
>  #endif
> @@ -1307,14 +1307,18 @@ calc_delta_mine(unsigned long delta_exec, u64
> weight, struct load_weight *lw)
>  	u64 tmp;
> 
>  	if (!lw->inv_weight) {
> -		if (BITS_PER_LONG > 32 && unlikely(lw->weight >= WMULT_CONST))
> +		unsigned long w = scale_down_load_resolution(lw->weight);
> +		if (BITS_PER_LONG > 32 && unlikely(w >= WMULT_CONST))
>  			lw->inv_weight = 1;
>  		else
> -			lw->inv_weight = 1 + (WMULT_CONST-lw->weight/2)
> -				/ (lw->weight+1);
> +			lw->inv_weight = 1 + (WMULT_CONST - w/2) / (w + 1);
>  	}
> 
> -	tmp = (u64)delta_exec * weight;
> +	if (likely(weight > (1UL << SCHED_LOAD_RESOLUTION)))
> +		tmp = (u64)delta_exec * scale_down_load_resolution(weight);
> +	else
> +		tmp = (u64)delta_exec;

Couldnt the compiler figure out that on 32-bit, this:

> +		tmp = (u64)delta_exec * scale_down_load_resolution(weight);

is equivalent to:

> +		tmp = (u64)delta_exec;

?

I.e. it would be nice to check whether a reasonably recent version of GCC 
figures out this optimization by itself - in that case we can avoid the 
branching ugliness, right?

Also, the above (and the other scale-adjustment changes) probably explains why 
the instruction count went up on 64-bit.

> @@ -1758,12 +1762,13 @@ static void set_load_weight(struct task_struct *p)
>  	 * SCHED_IDLE tasks get minimal weight:
>  	 */
>  	if (p->policy == SCHED_IDLE) {
> -		p->se.load.weight = WEIGHT_IDLEPRIO;
> +		p->se.load.weight = scale_up_load_resolution(WEIGHT_IDLEPRIO);
>  		p->se.load.inv_weight = WMULT_IDLEPRIO;
>  		return;
>  	}
> 
> -	p->se.load.weight = prio_to_weight[p->static_prio - MAX_RT_PRIO];
> +	p->se.load.weight = scale_up_load_resolution(
> +			prio_to_weight[p->static_prio - MAX_RT_PRIO]);
>  	p->se.load.inv_weight = prio_to_wmult[p->static_prio - MAX_RT_PRIO];

Please create a local 'load' variable that is equal to &p->se.load, and also
create a 'prio = p->static_prio - MAX_RT_PRIO' variable.

Furthermore, please rename 'scale_up_load_resolution' to something shorter: 
scale_load() is not used within the scheduler yet so it's free for taking.

Then a lot of the above repetitious code could be written as a much nicer:

	load->weight = scale_load(prio_to_weight[prio]);
  	load->inv_weight = prio_to_wmult[prio];

... and the logic becomes a *lot* more readable and the ugly linebreak is gone 
as well.

Please make such a set_load_weight() cleanup patch separate from the main 
patch, so that your main patch can still be reviewed in separation.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 00/19] Increase resolution of load weights
  2011-05-06  6:59         ` Ingo Molnar
@ 2011-05-11  0:14           ` Nikhil Rao
  2011-05-11  6:59             ` Ingo Molnar
  0 siblings, 1 reply; 34+ messages in thread
From: Nikhil Rao @ 2011-05-11  0:14 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Mike Galbraith, linux-kernel, Nikunj A. Dadhania,
	Srivatsa Vaddagiri, Stephan Barwolf

On Thu, May 5, 2011 at 11:59 PM, Ingo Molnar <mingo@elte.hu> wrote:
> * Nikhil Rao <ncrao@google.com> wrote:
>> diff --git a/kernel/sched.c b/kernel/sched.c
>> index f4b4679..3dae6c5 100644
>> --- a/kernel/sched.c
>> +++ b/kernel/sched.c
>> @@ -293,7 +293,7 @@ static DEFINE_SPINLOCK(task_group_lock);
>>   *  limitation from this.)
>>   */
>>  #define MIN_SHARES   2
>> -#define MAX_SHARES   (1UL << 18)
>> +#define MAX_SHARES   (1UL << (18 + SCHED_LOAD_RESOLUTION))
>>
>>  static int root_task_group_load = ROOT_TASK_GROUP_LOAD;
>>  #endif
>> @@ -1307,14 +1307,18 @@ calc_delta_mine(unsigned long delta_exec, u64
>> weight, struct load_weight *lw)
>>       u64 tmp;
>>
>>       if (!lw->inv_weight) {
>> -             if (BITS_PER_LONG > 32 && unlikely(lw->weight >= WMULT_CONST))
>> +             unsigned long w = scale_down_load_resolution(lw->weight);
>> +             if (BITS_PER_LONG > 32 && unlikely(w >= WMULT_CONST))
>>                       lw->inv_weight = 1;
>>               else
>> -                     lw->inv_weight = 1 + (WMULT_CONST-lw->weight/2)
>> -                             / (lw->weight+1);
>> +                     lw->inv_weight = 1 + (WMULT_CONST - w/2) / (w + 1);
>>       }
>>
>> -     tmp = (u64)delta_exec * weight;
>> +     if (likely(weight > (1UL << SCHED_LOAD_RESOLUTION)))
>> +             tmp = (u64)delta_exec * scale_down_load_resolution(weight);
>> +     else
>> +             tmp = (u64)delta_exec;
>
> Couldnt the compiler figure out that on 32-bit, this:
>
>> +             tmp = (u64)delta_exec * scale_down_load_resolution(weight);
>
> is equivalent to:
>
>> +             tmp = (u64)delta_exec;
>
> ?
>
> I.e. it would be nice to check whether a reasonably recent version of GCC
> figures out this optimization by itself - in that case we can avoid the
> branching ugliness, right?
>

We added the branch to take care of the case when weight < 1024 (i..e
2^SCHED_LOAD_RESOLUTION). We downshift the weight by 10 bits so that
we do not lose accuracy/performance in calc_delta_mine(), and try to
keep the mult within 64-bits. Task groups with low shares values can
have sched entities with weight less than 1024 since MIN_SHARES is
still 2 (we don't scale that up). To prevent scaling down weight to 0,
we add this check and force a lower bound of 1.

I think we need the branch for 64-bit kernels. I don't like the branch
but I can't think of a better way to avoid it. Do you have any
suggestion?

For 32-bit systems, the compiler should ideally optimize this branch
away. Unfortunately gcc-4.4.3 doesn't do that (and I'm not sure if a
later version does it either). We could add a macro around this check
to avoid the branch for 32-bit and do the check for 64-bit kernels?

> Also, the above (and the other scale-adjustment changes) probably explains why
> the instruction count went up on 64-bit.

Yes, that makes sense. We see an increase in instruction count of
about 2% with the new version of the patchset, down from 5.8% (will
post the new patchset soon). Assuming 30% of the cost of pipe test is
scheduling, that is an effective increase of approx. 6.7%. I'll post
the data and some analysis along with the new version.

>
>> @@ -1758,12 +1762,13 @@ static void set_load_weight(struct task_struct *p)
>>        * SCHED_IDLE tasks get minimal weight:
>>        */
>>       if (p->policy == SCHED_IDLE) {
>> -             p->se.load.weight = WEIGHT_IDLEPRIO;
>> +             p->se.load.weight = scale_up_load_resolution(WEIGHT_IDLEPRIO);
>>               p->se.load.inv_weight = WMULT_IDLEPRIO;
>>               return;
>>       }
>>
>> -     p->se.load.weight = prio_to_weight[p->static_prio - MAX_RT_PRIO];
>> +     p->se.load.weight = scale_up_load_resolution(
>> +                     prio_to_weight[p->static_prio - MAX_RT_PRIO]);
>>       p->se.load.inv_weight = prio_to_wmult[p->static_prio - MAX_RT_PRIO];
>
> Please create a local 'load' variable that is equal to &p->se.load, and also
> create a 'prio = p->static_prio - MAX_RT_PRIO' variable.
>
> Furthermore, please rename 'scale_up_load_resolution' to something shorter:
> scale_load() is not used within the scheduler yet so it's free for taking.
>
> Then a lot of the above repetitious code could be written as a much nicer:
>
>        load->weight = scale_load(prio_to_weight[prio]);
>        load->inv_weight = prio_to_wmult[prio];
>
> ... and the logic becomes a *lot* more readable and the ugly linebreak is gone
> as well.
>
> Please make such a set_load_weight() cleanup patch separate from the main
> patch, so that your main patch can still be reviewed in separation.
>

Sure, will do.

-Thanks,
Nikhil

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 00/19] Increase resolution of load weights
  2011-05-11  0:14           ` Nikhil Rao
@ 2011-05-11  6:59             ` Ingo Molnar
  2011-05-12  8:56               ` Nikhil Rao
  0 siblings, 1 reply; 34+ messages in thread
From: Ingo Molnar @ 2011-05-11  6:59 UTC (permalink / raw)
  To: Nikhil Rao
  Cc: Peter Zijlstra, Mike Galbraith, linux-kernel, Nikunj A. Dadhania,
	Srivatsa Vaddagiri, Stephan Barwolf


* Nikhil Rao <ncrao@google.com> wrote:

> I think we need the branch for 64-bit kernels. I don't like the branch but I 
> can't think of a better way to avoid it. Do you have any suggestion?

It was just a quick stab into the dark, i was fishing for more 
micro-optimizations on 64-bit (32-bit, as long as we leave its resolution 
alone, should not matter much): clearly there is *some* new overhead on 64-bit 
kernels too, so it would be nice to reduce that to the absolute minimum.

> For 32-bit systems, the compiler should ideally optimize this branch away. 
> Unfortunately gcc-4.4.3 doesn't do that (and I'm not sure if a later version 
> does it either). We could add a macro around this check to avoid the branch 
> for 32-bit and do the check for 64-bit kernels?

I'd rather keep it easy to read. If we keep the 32-bit unit of load a 32-bit 
word then 32-bit will see basically no extra overhead, right? (modulo the 
compiler not noticing such optimizations.)

Also, it's a good idea to do performance measurements with newest gcc (4.6) if 
possible: by the time such a change hits distros it will be the established 
stock distro compiler that kernels get built with. Maybe your figures will get 
better and maybe it can optimize this branch as well.

> > Also, the above (and the other scale-adjustment changes) probably explains 
> > why the instruction count went up on 64-bit.
> 
> Yes, that makes sense. We see an increase in instruction count of about 2% 
> with the new version of the patchset, down from 5.8% (will post the new 
> patchset soon). Assuming 30% of the cost of pipe test is scheduling, that is 
> an effective increase of approx. 6.7%. I'll post the data and some analysis 
> along with the new version.

An instruction count increase does not necessarily mean a linear slowdown: if 
those instructions are cheaper or scheduled better by the CPU then often the 
slowdown will be less.

Sometimes a 1% increase in the instruction count can slow down a workload by 
5%, if the 1% increase does divisions, has complex data path dependencies or is 
missing the branch-cache a lot.

So you should keep an eye on the cycle count as well. Latest -tip's perf stat 
can also measure 'stalled cycles':

aldebaran:~/sched-tests> taskset 1 perf stat --repeat 3 ./pipe-test-1m

 Performance counter stats for './pipe-test-1m' (3 runs):

       6499.787926 task-clock               #    0.437 CPUs utilized            ( +-  0.41% )
         2,000,108 context-switches         #    0.308 M/sec                    ( +-  0.00% )
                 0 CPU-migrations           #    0.000 M/sec                    ( +-100.00% )
               147 page-faults              #    0.000 M/sec                    ( +-  0.00% )
    14,226,565,939 cycles                   #    2.189 GHz                      ( +-  0.49% )
     6,897,331,129 stalled-cycles-frontend  #   48.48% frontend cycles idle     ( +-  0.90% )
     4,230,895,459 stalled-cycles-backend   #   29.74% backend  cycles idle     ( +-  1.31% )
    14,002,256,109 instructions             #    0.98  insns per cycle        
                                            #    0.49  stalled cycles per insn  ( +-  0.02% )
     2,703,891,945 branches                 #  415.997 M/sec                    ( +-  0.02% )
        44,994,805 branch-misses            #    1.66% of all branches          ( +-  0.27% )

       14.859234036  seconds time elapsed  ( +-  0.19% )

Te stalled-cycles frontend/backend metrics indicate whether a workload utilizes 
the CPU's resources optimally. Looking at a 'perf record -e 
stalled-cycles-frontend' and 'perf report' will show you the problem areas.
 
Most of the 'problem areas' will be unrelated to your code.

A 'near perfectly utilized' CPU looks like this:

aldebaran:~/opt> taskset 1 perf stat --repeat 10 ./fill_1b

 Performance counter stats for './fill_1b' (10 runs):

       1880.489837 task-clock               #    0.998 CPUs utilized            ( +-  0.15% )
                36 context-switches         #    0.000 M/sec                    ( +- 19.87% )
                 1 CPU-migrations           #    0.000 M/sec                    ( +- 59.63% )
                99 page-faults              #    0.000 M/sec                    ( +-  0.10% )
     6,027,432,226 cycles                   #    3.205 GHz                      ( +-  0.15% )
        22,138,455 stalled-cycles-frontend  #    0.37% frontend cycles idle     ( +- 36.56% )
        16,400,224 stalled-cycles-backend   #    0.27% backend  cycles idle     ( +- 38.12% )
    18,008,803,113 instructions             #    2.99  insns per cycle        
                                            #    0.00  stalled cycles per insn  ( +-  0.00% )
     1,001,802,536 branches                 #  532.735 M/sec                    ( +-  0.01% )
            22,842 branch-misses            #    0.00% of all branches          ( +-  9.07% )

        1.884595529  seconds time elapsed  ( +-  0.15% )

Both stall counts are very low. This is pretty hard to achieve in general, so 
before/after comparisons are used. For that there's 'perf diff' which you can 
use to compare before/after profiles:

 aldebaran:~/sched-tests> taskset 1 perf record -e instructions ./pipe-test-1m
 [ perf record: Woken up 2 times to write data ]
 [ perf record: Captured and wrote 0.427 MB perf.data (~18677 samples) ]
 aldebaran:~/sched-tests> taskset 1 perf record -e instructions ./pipe-test-1m
 [ perf record: Woken up 2 times to write data ]
 [ perf record: Captured and wrote 0.428 MB perf.data (~18685 samples) ]
 aldebaran:~/sched-tests> perf diff | head -10
 # Baseline  Delta          Shared Object                         Symbol
 # ........ ..........  .................  .............................
 #
     2.68%     +0.84%  [kernel.kallsyms]  [k] select_task_rq_fair
     3.28%     -0.17%  [kernel.kallsyms]  [k] fsnotify
     2.67%     +0.13%  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
     2.46%     +0.11%  [kernel.kallsyms]  [k] pipe_read
     2.42%             [kernel.kallsyms]  [k] schedule
     2.11%     +0.28%  [kernel.kallsyms]  [k] copy_user_generic_string
     2.13%     +0.18%  [kernel.kallsyms]  [k] mutex_lock

 ( Note: these were two short runs on the same kernel so the diff shows the 
   natural noise of the profile of this workload. Longer runs are needed to 
   measure effects smaller than 1%. )

So there's a wide range of tools you can use to understand the precise 
performance impact of your patch and in turn you can present to us what you 
learned about it.

Such analysis saves quite a bit of time on the side of us scheduler maintainers 
and makes performance impacting patches a lot more easy to apply :-)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 00/19] Increase resolution of load weights
  2011-05-11  6:59             ` Ingo Molnar
@ 2011-05-12  8:56               ` Nikhil Rao
  2011-05-12 10:55                 ` Ingo Molnar
  0 siblings, 1 reply; 34+ messages in thread
From: Nikhil Rao @ 2011-05-12  8:56 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Mike Galbraith, linux-kernel, Nikunj A. Dadhania,
	Srivatsa Vaddagiri, Stephan Barwolf

On Tue, May 10, 2011 at 11:59 PM, Ingo Molnar <mingo@elte.hu> wrote:
>
> * Nikhil Rao <ncrao@google.com> wrote:
>
>> > Also, the above (and the other scale-adjustment changes) probably explains
>> > why the instruction count went up on 64-bit.
>>
>> Yes, that makes sense. We see an increase in instruction count of about 2%
>> with the new version of the patchset, down from 5.8% (will post the new
>> patchset soon). Assuming 30% of the cost of pipe test is scheduling, that is
>> an effective increase of approx. 6.7%. I'll post the data and some analysis
>> along with the new version.
>
> An instruction count increase does not necessarily mean a linear slowdown: if
> those instructions are cheaper or scheduled better by the CPU then often the
> slowdown will be less.
>
> Sometimes a 1% increase in the instruction count can slow down a workload by
> 5%, if the 1% increase does divisions, has complex data path dependencies or is
> missing the branch-cache a lot.
>
> So you should keep an eye on the cycle count as well. Latest -tip's perf stat
> can also measure 'stalled cycles':
>
> aldebaran:~/sched-tests> taskset 1 perf stat --repeat 3 ./pipe-test-1m
>
>  Performance counter stats for './pipe-test-1m' (3 runs):
>
>       6499.787926 task-clock               #    0.437 CPUs utilized            ( +-  0.41% )
>         2,000,108 context-switches         #    0.308 M/sec                    ( +-  0.00% )
>                 0 CPU-migrations           #    0.000 M/sec                    ( +-100.00% )
>               147 page-faults              #    0.000 M/sec                    ( +-  0.00% )
>    14,226,565,939 cycles                   #    2.189 GHz                      ( +-  0.49% )
>     6,897,331,129 stalled-cycles-frontend  #   48.48% frontend cycles idle     ( +-  0.90% )
>     4,230,895,459 stalled-cycles-backend   #   29.74% backend  cycles idle     ( +-  1.31% )
>    14,002,256,109 instructions             #    0.98  insns per cycle
>                                            #    0.49  stalled cycles per insn  ( +-  0.02% )
>     2,703,891,945 branches                 #  415.997 M/sec                    ( +-  0.02% )
>        44,994,805 branch-misses            #    1.66% of all branches          ( +-  0.27% )
>
>       14.859234036  seconds time elapsed  ( +-  0.19% )
>
> Te stalled-cycles frontend/backend metrics indicate whether a workload utilizes
> the CPU's resources optimally. Looking at a 'perf record -e
> stalled-cycles-frontend' and 'perf report' will show you the problem areas.
>
> Most of the 'problem areas' will be unrelated to your code.
>
> A 'near perfectly utilized' CPU looks like this:
>
> aldebaran:~/opt> taskset 1 perf stat --repeat 10 ./fill_1b
>
>  Performance counter stats for './fill_1b' (10 runs):
>
>       1880.489837 task-clock               #    0.998 CPUs utilized            ( +-  0.15% )
>                36 context-switches         #    0.000 M/sec                    ( +- 19.87% )
>                 1 CPU-migrations           #    0.000 M/sec                    ( +- 59.63% )
>                99 page-faults              #    0.000 M/sec                    ( +-  0.10% )
>     6,027,432,226 cycles                   #    3.205 GHz                      ( +-  0.15% )
>        22,138,455 stalled-cycles-frontend  #    0.37% frontend cycles idle     ( +- 36.56% )
>        16,400,224 stalled-cycles-backend   #    0.27% backend  cycles idle     ( +- 38.12% )
>    18,008,803,113 instructions             #    2.99  insns per cycle
>                                            #    0.00  stalled cycles per insn  ( +-  0.00% )
>     1,001,802,536 branches                 #  532.735 M/sec                    ( +-  0.01% )
>            22,842 branch-misses            #    0.00% of all branches          ( +-  9.07% )
>
>        1.884595529  seconds time elapsed  ( +-  0.15% )
>
> Both stall counts are very low. This is pretty hard to achieve in general, so
> before/after comparisons are used. For that there's 'perf diff' which you can
> use to compare before/after profiles:
>
>  aldebaran:~/sched-tests> taskset 1 perf record -e instructions ./pipe-test-1m
>  [ perf record: Woken up 2 times to write data ]
>  [ perf record: Captured and wrote 0.427 MB perf.data (~18677 samples) ]
>  aldebaran:~/sched-tests> taskset 1 perf record -e instructions ./pipe-test-1m
>  [ perf record: Woken up 2 times to write data ]
>  [ perf record: Captured and wrote 0.428 MB perf.data (~18685 samples) ]
>  aldebaran:~/sched-tests> perf diff | head -10
>  # Baseline  Delta          Shared Object                         Symbol
>  # ........ ..........  .................  .............................
>  #
>     2.68%     +0.84%  [kernel.kallsyms]  [k] select_task_rq_fair
>     3.28%     -0.17%  [kernel.kallsyms]  [k] fsnotify
>     2.67%     +0.13%  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
>     2.46%     +0.11%  [kernel.kallsyms]  [k] pipe_read
>     2.42%             [kernel.kallsyms]  [k] schedule
>     2.11%     +0.28%  [kernel.kallsyms]  [k] copy_user_generic_string
>     2.13%     +0.18%  [kernel.kallsyms]  [k] mutex_lock
>
>  ( Note: these were two short runs on the same kernel so the diff shows the
>   natural noise of the profile of this workload. Longer runs are needed to
>   measure effects smaller than 1%. )
>
> So there's a wide range of tools you can use to understand the precise
> performance impact of your patch and in turn you can present to us what you
> learned about it.
>
> Such analysis saves quite a bit of time on the side of us scheduler maintainers
> and makes performance impacting patches a lot more easy to apply :-)
>

Thanks for the info! I rebased the patchset against -tip and built
perf from -tip. Here are the results from running pipe-test-100k bound
to a single cpu with 100 repetitions.

-tip (baseline):

 Performance counter stats for '/root/data/pipe-test-100k' (100 runs):

       907,981,999 instructions             #    0.85  insns per cycle
                                            #    0.34  stalled cycles
per insn  ( +-  0.07% )
     1,072,650,809 cycles                   #    0.000 GHz
         ( +-  0.13% )
       305,678,413 stalled-cycles-backend   #   28.50% backend  cycles
idle     ( +-  0.51% )
       245,846,208 stalled-cycles-frontend  #   22.92% frontend cycles
idle     ( +-  0.70% )

        1.060303165  seconds time elapsed  ( +-  0.09% )


-tip+patches:

 Performance counter stats for '/root/data/pipe-test-100k' (100 runs):

       910,501,358 instructions             #    0.82  insns per cycle
                                            #    0.36  stalled cycles
per insn  ( +-  0.06% )
     1,108,981,763 cycles                   #    0.000 GHz
         ( +-  0.17% )
       328,816,295 stalled-cycles-backend   #   29.65% backend  cycles
idle     ( +-  0.63% )
       247,412,614 stalled-cycles-frontend  #   22.31% frontend cycles
idle     ( +-  0.71% )

        1.075497493  seconds time elapsed  ( +-  0.10% )


>From this latest run on -tip, the instruction count is about ~0.28%
more and cycles are approx 3.38% more. From the stalled cycles counts,
it looks like most of this increase is coming from backend stalled
cycles. It's not clear what type of stalls these are, but if I were to
guess, I think it means stalls post-decode (i.e. functional units,
load/store, etc.). Is that right?

I collected profiles from long runs of pipe-test (about 3m iterations)
and tried running "perf diff" on the profiles. I cached the buildid
from the two kernel images and associated test binary & libraries. The
individual reports make sense, but I suspect something is wrong with
the diff output.

# perf buildid-cache -v -a boot.tip-patches/vmlinux-2.6.39-tip-smp-DEV
Adding 17b6f2c42deb3725ad35e3dcba2d9fdb92ad47c1
boot.tip-patches/vmlinux-2.6.39-tip-smp-DEV: Ok
# perf buildid-cache -v -a boot.tip/vmlinux-2.6.39-tip-smp-DEV
Adding 47737eb3efdd6cb789872311c354b106ec8e7477
p/boot.tip/vmlinux-2.6.39-tip-smp-DEV: Ok

# perf buildid-list -i perf.data | grep kernel
17b6f2c42deb3725ad35e3dcba2d9fdb92ad47c1 [kernel.kallsyms]

# perf buildid-list -i perf.data.old | grep kernel
47737eb3efdd6cb789872311c354b106ec8e7477 [kernel.kallsyms]

# perf report -i perf.data.old -d [kernel.kallsyms] | head -n 10
# dso: [kernel.kallsyms]
# Events: 30K instructions
#
# Overhead       Command                       Symbol
# ........  ............  ...........................
#
     5.55%  pipe-test-3m  [k] pipe_read
     4.78%  pipe-test-3m  [k] schedule
     3.68%  pipe-test-3m  [k] update_curr
     3.52%  pipe-test-3m  [k] pipe_write


# perf report -i perf.data -d [kernel.kallsyms] | head -n 10
# dso: [kernel.kallsyms]
# Events: 31K instructions
#
# Overhead       Command                                 Symbol
# ........  ............  .....................................
#
     6.09%  pipe-test-3m  [k] pipe_read
     4.86%  pipe-test-3m  [k] schedule
     4.24%  pipe-test-3m  [k] update_curr
     3.87%  pipe-test-3m  [k] find_next_bit


# perf diff -v -d [kernel.kallsyms]
build id event received for [kernel.kallsyms]:
47737eb3efdd6cb789872311c354b106ec8e7477
...
build id event received for [kernel.kallsyms]:
17b6f2c42deb3725ad35e3dcba2d9fdb92ad47c1
...
Looking at the vmlinux_path (6 entries long)
Using /tmp/.debug/.build-id/47/737eb3efdd6cb789872311c354b106ec8e7477
for symbols
Looking at the vmlinux_path (6 entries long)
Using /tmp/.debug/.build-id/17/b6f2c42deb3725ad35e3dcba2d9fdb92ad47c1
for symbols
# Baseline  Delta                                     Symbol
# ........ ..........  .....................................
#
     0.00%     +6.09%  0xffffffff8112a258 ! [k] pipe_read
     0.00%     +4.86%  0xffffffff8141a206 ! [k] schedule
     0.00%     +4.24%  0xffffffff810634d8 ! [k] update_curr
     0.00%     +3.87%  0xffffffff8121f569 ! [k] find_next_bit
     0.00%     +3.33%  0xffffffff81065cbf ! [k] enqueue_task_fair
     0.00%     +3.25%  0xffffffff81065824 ! [k] dequeue_task_fair
     0.00%     +2.77%  0xffffffff81129d10 ! [k] pipe_write
     0.00%     +2.71%  0xffffffff8114ed97 ! [k] fsnotify

The baseline numbers are showing up as zero and the deltas match the
fractions from the -tip+patches report. Am I missing something here?

Another thing I noticed while running this on -tip is that low-weight
task groups are poorly balanced on -tip (much worse than v2.6.39-rc7).
I started bisecting between v2.6.39-rc7 and -tip to identify the
source of this regression.

[ experiment: create low-weight task group and run ~50 threads with
random sleep/busy pattern ]

-tip:

01:30:03 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft
%steal  %guest   %idle    intr/s
01:30:04 PM  all   90.67    0.00    0.00    0.00    0.00    0.00
0.00    0.00    9.33  15368.00
01:30:05 PM  all   93.08    0.00    0.00    0.00    0.00    0.00
0.00    0.00    6.92  15690.00
01:30:06 PM  all   94.56    0.00    0.00    0.00    0.00    0.00
0.00    0.00    5.44  15844.00
01:30:07 PM  all   94.88    0.00    0.06    0.00    0.00    0.00
0.00    0.00    5.06  15989.00
01:30:08 PM  all   94.31    0.00    0.00    0.00    0.00    0.00
0.00    0.00    5.69  15791.00
01:30:09 PM  all   95.00    0.00    0.00    0.00    0.00    0.00
0.00    0.00    5.00  15953.00
01:30:10 PM  all   94.19    0.00    0.06    0.00    0.00    0.00
0.00    0.00    5.75  15810.00
01:30:11 PM  all   93.75    0.00    0.00    0.00    0.00    0.00
0.00    0.00    6.25  15748.00
01:30:12 PM  all   94.94    0.00    0.06    0.00    0.00    0.00
0.00    0.00    5.00  15943.00
01:30:13 PM  all   94.31    0.00    0.00    0.00    0.00    0.00
0.00    0.00    5.69  15865.00
Average:     all   93.97    0.00    0.02    0.00    0.00    0.00
0.00    0.00    6.01  15800.10

-tip+patches:

01:29:59 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft
%steal  %guest   %idle    intr/s
01:30:00 PM  all   99.31    0.00    0.56    0.00    0.00    0.00
0.00    0.00    0.12  16908.00
01:30:01 PM  all   99.44    0.00    0.50    0.00    0.00    0.06
0.00    0.00    0.00  18128.00
01:30:02 PM  all   99.88    0.00    0.00    0.00    0.00    0.00
0.00    0.00    0.12  16582.00
01:30:03 PM  all   99.06    0.00    0.75    0.00    0.00    0.00
0.00    0.00    0.19  17108.00
01:30:04 PM  all   99.94    0.00    0.06    0.00    0.00    0.00
0.00    0.00    0.00  17113.00
01:30:05 PM  all  100.00    0.00    0.00    0.00    0.00    0.00
0.00    0.00    0.00  16568.00
01:30:06 PM  all   99.81    0.00    0.00    0.00    0.00    0.00
0.00    0.00    0.19  16408.91
01:30:07 PM  all   99.87    0.00    0.00    0.00    0.00    0.00
0.00    0.00    0.13  16576.00
01:30:08 PM  all   99.94    0.00    0.00    0.00    0.00    0.00
0.00    0.00    0.06  16617.00
01:30:09 PM  all   99.94    0.00    0.00    0.00    0.00    0.06
0.00    0.00    0.00  16702.00
Average:     all   99.72    0.00    0.19    0.00    0.00    0.01
0.00    0.00    0.08  16870.63

-Thanks,
Nikhil

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 00/19] Increase resolution of load weights
  2011-05-06  1:29       ` Nikhil Rao
  2011-05-06  6:59         ` Ingo Molnar
@ 2011-05-12  9:08         ` Peter Zijlstra
  2011-05-12 17:30           ` Nikhil Rao
  1 sibling, 1 reply; 34+ messages in thread
From: Peter Zijlstra @ 2011-05-12  9:08 UTC (permalink / raw)
  To: Nikhil Rao
  Cc: Ingo Molnar, Mike Galbraith, linux-kernel, Nikunj A. Dadhania,
	Srivatsa Vaddagiri, Stephan Barwolf

On Thu, 2011-05-05 at 18:29 -0700, Nikhil Rao wrote:
> > It's a cost/benefit analysis and for 32-bit systems the benefits seem to be
> > rather small, right?
> >
> 
> Yes, that's right. The benefits for 32-bit systems do seem to be limited.

deep(er) hierarchies on 32 bits still require this, it would be good to
verify that the cgroup mess created by the insanity called libvirt will
indeed work as expected.




^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 00/19] Increase resolution of load weights
  2011-05-12  8:56               ` Nikhil Rao
@ 2011-05-12 10:55                 ` Ingo Molnar
  2011-05-12 18:44                   ` Nikhil Rao
  0 siblings, 1 reply; 34+ messages in thread
From: Ingo Molnar @ 2011-05-12 10:55 UTC (permalink / raw)
  To: Nikhil Rao, Arnaldo Carvalho de Melo, Frédéric Weisbecker
  Cc: Peter Zijlstra, Mike Galbraith, linux-kernel, Nikunj A. Dadhania,
	Srivatsa Vaddagiri, Stephan Barwolf


* Nikhil Rao <ncrao@google.com> wrote:

> On Tue, May 10, 2011 at 11:59 PM, Ingo Molnar <mingo@elte.hu> wrote:
> >
> > * Nikhil Rao <ncrao@google.com> wrote:
> >
> >> > Also, the above (and the other scale-adjustment changes) probably explains
> >> > why the instruction count went up on 64-bit.
> >>
> >> Yes, that makes sense. We see an increase in instruction count of about 2%
> >> with the new version of the patchset, down from 5.8% (will post the new
> >> patchset soon). Assuming 30% of the cost of pipe test is scheduling, that is
> >> an effective increase of approx. 6.7%. I'll post the data and some analysis
> >> along with the new version.
> >
> > An instruction count increase does not necessarily mean a linear slowdown: if
> > those instructions are cheaper or scheduled better by the CPU then often the
> > slowdown will be less.
> >
> > Sometimes a 1% increase in the instruction count can slow down a workload by
> > 5%, if the 1% increase does divisions, has complex data path dependencies or is
> > missing the branch-cache a lot.
> >
> > So you should keep an eye on the cycle count as well. Latest -tip's perf stat
> > can also measure 'stalled cycles':
> >
> > aldebaran:~/sched-tests> taskset 1 perf stat --repeat 3 ./pipe-test-1m
> >
> >  Performance counter stats for './pipe-test-1m' (3 runs):
> >
> >       6499.787926 task-clock               #    0.437 CPUs utilized            ( +-  0.41% )
> >         2,000,108 context-switches         #    0.308 M/sec                    ( +-  0.00% )
> >                 0 CPU-migrations           #    0.000 M/sec                    ( +-100.00% )
> >               147 page-faults              #    0.000 M/sec                    ( +-  0.00% )
> >    14,226,565,939 cycles                   #    2.189 GHz                      ( +-  0.49% )
> >     6,897,331,129 stalled-cycles-frontend  #   48.48% frontend cycles idle     ( +-  0.90% )
> >     4,230,895,459 stalled-cycles-backend   #   29.74% backend  cycles idle     ( +-  1.31% )
> >    14,002,256,109 instructions             #    0.98  insns per cycle
> >                                            #    0.49  stalled cycles per insn  ( +-  0.02% )
> >     2,703,891,945 branches                 #  415.997 M/sec                    ( +-  0.02% )
> >        44,994,805 branch-misses            #    1.66% of all branches          ( +-  0.27% )
> >
> >       14.859234036  seconds time elapsed  ( +-  0.19% )
> >
> > Te stalled-cycles frontend/backend metrics indicate whether a workload utilizes
> > the CPU's resources optimally. Looking at a 'perf record -e
> > stalled-cycles-frontend' and 'perf report' will show you the problem areas.
> >
> > Most of the 'problem areas' will be unrelated to your code.
> >
> > A 'near perfectly utilized' CPU looks like this:
> >
> > aldebaran:~/opt> taskset 1 perf stat --repeat 10 ./fill_1b
> >
> >  Performance counter stats for './fill_1b' (10 runs):
> >
> >       1880.489837 task-clock               #    0.998 CPUs utilized            ( +-  0.15% )
> >                36 context-switches         #    0.000 M/sec                    ( +- 19.87% )
> >                 1 CPU-migrations           #    0.000 M/sec                    ( +- 59.63% )
> >                99 page-faults              #    0.000 M/sec                    ( +-  0.10% )
> >     6,027,432,226 cycles                   #    3.205 GHz                      ( +-  0.15% )
> >        22,138,455 stalled-cycles-frontend  #    0.37% frontend cycles idle     ( +- 36.56% )
> >        16,400,224 stalled-cycles-backend   #    0.27% backend  cycles idle     ( +- 38.12% )
> >    18,008,803,113 instructions             #    2.99  insns per cycle
> >                                            #    0.00  stalled cycles per insn  ( +-  0.00% )
> >     1,001,802,536 branches                 #  532.735 M/sec                    ( +-  0.01% )
> >            22,842 branch-misses            #    0.00% of all branches          ( +-  9.07% )
> >
> >        1.884595529  seconds time elapsed  ( +-  0.15% )
> >
> > Both stall counts are very low. This is pretty hard to achieve in general, so
> > before/after comparisons are used. For that there's 'perf diff' which you can
> > use to compare before/after profiles:
> >
> >  aldebaran:~/sched-tests> taskset 1 perf record -e instructions ./pipe-test-1m
> >  [ perf record: Woken up 2 times to write data ]
> >  [ perf record: Captured and wrote 0.427 MB perf.data (~18677 samples) ]
> >  aldebaran:~/sched-tests> taskset 1 perf record -e instructions ./pipe-test-1m
> >  [ perf record: Woken up 2 times to write data ]
> >  [ perf record: Captured and wrote 0.428 MB perf.data (~18685 samples) ]
> >  aldebaran:~/sched-tests> perf diff | head -10
> >  # Baseline  Delta          Shared Object                         Symbol
> >  # ........ ..........  .................  .............................
> >  #
> >     2.68%     +0.84%  [kernel.kallsyms]  [k] select_task_rq_fair
> >     3.28%     -0.17%  [kernel.kallsyms]  [k] fsnotify
> >     2.67%     +0.13%  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
> >     2.46%     +0.11%  [kernel.kallsyms]  [k] pipe_read
> >     2.42%             [kernel.kallsyms]  [k] schedule
> >     2.11%     +0.28%  [kernel.kallsyms]  [k] copy_user_generic_string
> >     2.13%     +0.18%  [kernel.kallsyms]  [k] mutex_lock
> >
> >  ( Note: these were two short runs on the same kernel so the diff shows the
> >   natural noise of the profile of this workload. Longer runs are needed to
> >   measure effects smaller than 1%. )
> >
> > So there's a wide range of tools you can use to understand the precise
> > performance impact of your patch and in turn you can present to us what you
> > learned about it.
> >
> > Such analysis saves quite a bit of time on the side of us scheduler maintainers
> > and makes performance impacting patches a lot more easy to apply :-)
> >
> 
> Thanks for the info! I rebased the patchset against -tip and built
> perf from -tip. Here are the results from running pipe-test-100k bound
> to a single cpu with 100 repetitions.
> 
> -tip (baseline):
> 
>  Performance counter stats for '/root/data/pipe-test-100k' (100 runs):
> 
>        907,981,999 instructions             #    0.85  insns per cycle
>                                             #    0.34  stalled cycles
> per insn  ( +-  0.07% )
>      1,072,650,809 cycles                   #    0.000 GHz
>          ( +-  0.13% )
>        305,678,413 stalled-cycles-backend   #   28.50% backend  cycles
> idle     ( +-  0.51% )
>        245,846,208 stalled-cycles-frontend  #   22.92% frontend cycles
> idle     ( +-  0.70% )
> 
>         1.060303165  seconds time elapsed  ( +-  0.09% )
> 
> 
> -tip+patches:
> 
>  Performance counter stats for '/root/data/pipe-test-100k' (100 runs):
> 
>        910,501,358 instructions             #    0.82  insns per cycle
>                                             #    0.36  stalled cycles
> per insn  ( +-  0.06% )
>      1,108,981,763 cycles                   #    0.000 GHz
>          ( +-  0.17% )
>        328,816,295 stalled-cycles-backend   #   29.65% backend  cycles
> idle     ( +-  0.63% )
>        247,412,614 stalled-cycles-frontend  #   22.31% frontend cycles
> idle     ( +-  0.71% )
> 
>         1.075497493  seconds time elapsed  ( +-  0.10% )
> 
> 
> >From this latest run on -tip, the instruction count is about ~0.28%
> more and cycles are approx 3.38% more. From the stalled cycles counts,
> it looks like most of this increase is coming from backend stalled
> cycles. It's not clear what type of stalls these are, but if I were to
> guess, I think it means stalls post-decode (i.e. functional units,
> load/store, etc.). Is that right?

Yeah, more functional work to be done, and probably a tad more expensive per 
extra instruction executed.

How did branches and branch misses change?

> Another thing I noticed while running this on -tip is that low-weight
> task groups are poorly balanced on -tip (much worse than v2.6.39-rc7).
> I started bisecting between v2.6.39-rc7 and -tip to identify the
> source of this regression.

Ok, would be nice to figure out which commit did this.

> I collected profiles from long runs of pipe-test (about 3m iterations)
> and tried running "perf diff" on the profiles. I cached the buildid
> from the two kernel images and associated test binary & libraries. The
> individual reports make sense, but I suspect something is wrong with
> the diff output.

Ok, i've Cc:-ed Arnaldo and Frederic, the perf diff output indeed looks 
strange. (the perf diff output is repeated below.)

Thanks,

	Ingo

> # perf buildid-cache -v -a boot.tip-patches/vmlinux-2.6.39-tip-smp-DEV
> Adding 17b6f2c42deb3725ad35e3dcba2d9fdb92ad47c1
> boot.tip-patches/vmlinux-2.6.39-tip-smp-DEV: Ok
> # perf buildid-cache -v -a boot.tip/vmlinux-2.6.39-tip-smp-DEV
> Adding 47737eb3efdd6cb789872311c354b106ec8e7477
> p/boot.tip/vmlinux-2.6.39-tip-smp-DEV: Ok
> 
> # perf buildid-list -i perf.data | grep kernel
> 17b6f2c42deb3725ad35e3dcba2d9fdb92ad47c1 [kernel.kallsyms]
> 
> # perf buildid-list -i perf.data.old | grep kernel
> 47737eb3efdd6cb789872311c354b106ec8e7477 [kernel.kallsyms]
> 
> # perf report -i perf.data.old -d [kernel.kallsyms] | head -n 10
> # dso: [kernel.kallsyms]
> # Events: 30K instructions
> #
> # Overhead       Command                       Symbol
> # ........  ............  ...........................
> #
>      5.55%  pipe-test-3m  [k] pipe_read
>      4.78%  pipe-test-3m  [k] schedule
>      3.68%  pipe-test-3m  [k] update_curr
>      3.52%  pipe-test-3m  [k] pipe_write
> 
> 
> # perf report -i perf.data -d [kernel.kallsyms] | head -n 10
> # dso: [kernel.kallsyms]
> # Events: 31K instructions
> #
> # Overhead       Command                                 Symbol
> # ........  ............  .....................................
> #
>      6.09%  pipe-test-3m  [k] pipe_read
>      4.86%  pipe-test-3m  [k] schedule
>      4.24%  pipe-test-3m  [k] update_curr
>      3.87%  pipe-test-3m  [k] find_next_bit
> 
> 
> # perf diff -v -d [kernel.kallsyms]
> build id event received for [kernel.kallsyms]:
> 47737eb3efdd6cb789872311c354b106ec8e7477
> ...
> build id event received for [kernel.kallsyms]:
> 17b6f2c42deb3725ad35e3dcba2d9fdb92ad47c1
> ...
> Looking at the vmlinux_path (6 entries long)
> Using /tmp/.debug/.build-id/47/737eb3efdd6cb789872311c354b106ec8e7477
> for symbols
> Looking at the vmlinux_path (6 entries long)
> Using /tmp/.debug/.build-id/17/b6f2c42deb3725ad35e3dcba2d9fdb92ad47c1
> for symbols
> # Baseline  Delta                                     Symbol
> # ........ ..........  .....................................
> #
>      0.00%     +6.09%  0xffffffff8112a258 ! [k] pipe_read
>      0.00%     +4.86%  0xffffffff8141a206 ! [k] schedule
>      0.00%     +4.24%  0xffffffff810634d8 ! [k] update_curr
>      0.00%     +3.87%  0xffffffff8121f569 ! [k] find_next_bit
>      0.00%     +3.33%  0xffffffff81065cbf ! [k] enqueue_task_fair
>      0.00%     +3.25%  0xffffffff81065824 ! [k] dequeue_task_fair
>      0.00%     +2.77%  0xffffffff81129d10 ! [k] pipe_write
>      0.00%     +2.71%  0xffffffff8114ed97 ! [k] fsnotify
> 
> The baseline numbers are showing up as zero and the deltas match the
> fractions from the -tip+patches report. Am I missing something here?
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 00/19] Increase resolution of load weights
  2011-05-12  9:08         ` Peter Zijlstra
@ 2011-05-12 17:30           ` Nikhil Rao
  2011-05-13  7:19             ` Peter Zijlstra
  0 siblings, 1 reply; 34+ messages in thread
From: Nikhil Rao @ 2011-05-12 17:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Mike Galbraith, linux-kernel, Nikunj A. Dadhania,
	Srivatsa Vaddagiri, Stephan Barwolf

On Thu, May 12, 2011 at 2:08 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, 2011-05-05 at 18:29 -0700, Nikhil Rao wrote:
>> > It's a cost/benefit analysis and for 32-bit systems the benefits seem to be
>> > rather small, right?
>> >
>>
>> Yes, that's right. The benefits for 32-bit systems do seem to be limited.
>
> deep(er) hierarchies on 32 bits still require this, it would be good to
> verify that the cgroup mess created by the insanity called libvirt will
> indeed work as expected.
>

I went through the libvirt docs and from what I understand, it creates
a hierarchy which is about 3 levels deep and has as many leaf nodes as
guest VMs.

Taking this graphic from
http://berrange.com/posts/2009/12/03/using-cgroups-with-libvirt-and-lxckvm-guests-in-fedora-12/

$ROOT
 |
 +-  libvirt    (all virtual machines/containers run by libvirtd)
       |
       +- lxc   (all LXC containers run by libvirtd)
       |   |
       |   +-  guest1    (LXC container called 'guest1')
       |   +-  guest2    (LXC container called 'guest2')
       |   +-  guest3    (LXC container called 'guest3')
       |   +-  ...       (LXC container called ...)
       |
       +- qemu  (all QEMU/KVM containers run by libvirtd)
           |
           +-  guest1    (QENU machine called 'guest1')
           +-  guest2    (QEMU machine called 'guest2')
           +-  guest3    (QEMU machine called 'guest3')
           +-  ...       (QEMU machine called ...)

Assuming the tg shares given to libvirt, lxc and qemu containers are
the defaults, the load balancer should be able to deal with the
current resolution on 32-bit. Back of the envelope calculations using
that approach I mentioned earlier (i.e. log_b(1024/NR_CPU)) says you
need > 64 VMs before you run out of resolution. I think that might be
too much to expect from a 8-cpu 32-bit machine ;-)

-Thanks,
Nikhil

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 00/19] Increase resolution of load weights
  2011-05-12 10:55                 ` Ingo Molnar
@ 2011-05-12 18:44                   ` Nikhil Rao
  0 siblings, 0 replies; 34+ messages in thread
From: Nikhil Rao @ 2011-05-12 18:44 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Arnaldo Carvalho de Melo, Frédéric Weisbecker,
	Peter Zijlstra, Mike Galbraith, linux-kernel, Nikunj A. Dadhania,
	Srivatsa Vaddagiri, Stephan Barwolf

On Thu, May 12, 2011 at 3:55 AM, Ingo Molnar <mingo@elte.hu> wrote:
>
> * Nikhil Rao <ncrao@google.com> wrote:
>> On Tue, May 10, 2011 at 11:59 PM, Ingo Molnar <mingo@elte.hu> wrote:
>> >
>> >From this latest run on -tip, the instruction count is about ~0.28%
>> more and cycles are approx 3.38% more. From the stalled cycles counts,
>> it looks like most of this increase is coming from backend stalled
>> cycles. It's not clear what type of stalls these are, but if I were to
>> guess, I think it means stalls post-decode (i.e. functional units,
>> load/store, etc.). Is that right?
>
> Yeah, more functional work to be done, and probably a tad more expensive per
> extra instruction executed.
>

OK, this might be the shifts we do in c_d_m(). To confirm this, let me
remove the shifts and see if the number of stalled decreases.

> How did branches and branch misses change?
>

It looks like we take slightly more branches and miss more often.
About 0.2% more branches, and miss about 25% more often (i.e. 2.957%
vs. 2.376%).

-tip:
# taskset 8 perf stat --repeat 100 -e instructions -e cycles -e
branches -e branch-misses /root/data/pipe-test-100k

 Performance counter stats for '/root/data/pipe-test-100k' (100 runs):

       906,385,082 instructions             #      0.835 IPC     ( +-   0.077% )
     1,085,517,988 cycles                     ( +-   0.139% )
       165,921,546 branches                   ( +-   0.071% )
         3,941,788 branch-misses            #      2.376 %       ( +-   0.952% )

        1.061813201  seconds time elapsed   ( +-   0.096% )


-tip+patches:
# taskset 8 perf stat --repeat 100 -e instructions -e cycles -e
branches -e branch-misses /root/data/pipe-test-100k

 Performance counter stats for '/root/data/pipe-test-100k' (100 runs):

       908,150,127 instructions             #      0.829 IPC     ( +-   0.073% )
     1,095,344,326 cycles                     ( +-   0.140% )
       166,266,732 branches                   ( +-   0.071% )
         4,917,179 branch-misses            #      2.957 %       ( +-   0.746% )

        1.065221478  seconds time elapsed   ( +-   0.099% )


Comparing two perf records of branch-misses by hand, we see about the
same number of branch-miss events but the distribution looks less
top-heavy compared to -tip, so we might have a longer tail of branch
misses with the patches. None of the scheduler functions really stand
out.

-tip:
# taskset 8 perf record -e branch-misses /root/pipe-test-30m

# perf report | head -n 20
# Events: 310K cycles
#
# Overhead        Command      Shared Object
      Symbol
# ........  .............  .................
.....................................
#
    11.15%  pipe-test-30m  [kernel.kallsyms]  [k] system_call
     7.70%  pipe-test-30m  [kernel.kallsyms]  [k] x86_pmu_disable_all
     6.63%  pipe-test-30m  libc-2.11.1.so     [.] __GI_read
     6.11%  pipe-test-30m  [kernel.kallsyms]  [k] pipe_read
     5.74%  pipe-test-30m  [kernel.kallsyms]  [k] system_call_after_swapgs
     5.60%  pipe-test-30m  pipe-test-30m      [.] main
     5.55%  pipe-test-30m  [kernel.kallsyms]  [k] find_next_bit
     5.55%  pipe-test-30m  [kernel.kallsyms]  [k] __might_sleep
     5.46%  pipe-test-30m  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
     4.55%  pipe-test-30m  [kernel.kallsyms]  [k] sched_clock
     3.82%  pipe-test-30m  [kernel.kallsyms]  [k] pipe_wait
     3.73%  pipe-test-30m  [kernel.kallsyms]  [k] sys_write
     3.65%  pipe-test-30m  [kernel.kallsyms]  [k] anon_pipe_buf_release
     3.61%  pipe-test-30m  [kernel.kallsyms]  [k] update_curr
     2.75%  pipe-test-30m  [kernel.kallsyms]  [k] select_task_rq_fair

-tip+patches:
# taskset 8 perf record -e branch-misses /root/pipe-test-30m

# perf report | head -n 20
# Events: 314K branch-misses
#
# Overhead        Command      Shared Object
      Symbol
# ........  .............  .................
.....................................
#
     7.66%  pipe-test-30m  [kernel.kallsyms]  [k] __might_sleep
     7.59%  pipe-test-30m  [kernel.kallsyms]  [k] system_call_after_swapgs
     5.88%  pipe-test-30m  [kernel.kallsyms]  [k] kill_fasync
     4.42%  pipe-test-30m  [kernel.kallsyms]  [k] fsnotify
     3.96%  pipe-test-30m  [kernel.kallsyms]  [k] update_curr
     3.93%  pipe-test-30m  [kernel.kallsyms]  [k] system_call
     3.91%  pipe-test-30m  [kernel.kallsyms]  [k] update_stats_wait_end
     3.90%  pipe-test-30m  [kernel.kallsyms]  [k] sys_read
     3.88%  pipe-test-30m  pipe-test-30m      [.] main
     3.86%  pipe-test-30m  [kernel.kallsyms]  [k] select_task_rq_fair
     3.81%  pipe-test-30m  libc-2.11.1.so     [.] __GI_read
     3.73%  pipe-test-30m  [kernel.kallsyms]  [k] sysret_check
     3.70%  pipe-test-30m  [kernel.kallsyms]  [k] sys_write
     3.66%  pipe-test-30m  [kernel.kallsyms]  [k] ret_from_sys_call
     3.56%  pipe-test-30m  [kernel.kallsyms]  [k] fsnotify_access

-Thanks,
Nikhil

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 00/19] Increase resolution of load weights
  2011-05-12 17:30           ` Nikhil Rao
@ 2011-05-13  7:19             ` Peter Zijlstra
  0 siblings, 0 replies; 34+ messages in thread
From: Peter Zijlstra @ 2011-05-13  7:19 UTC (permalink / raw)
  To: Nikhil Rao
  Cc: Ingo Molnar, Mike Galbraith, linux-kernel, Nikunj A. Dadhania,
	Srivatsa Vaddagiri, Stephan Barwolf

On Thu, 2011-05-12 at 10:30 -0700, Nikhil Rao wrote:
> On Thu, May 12, 2011 at 2:08 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Thu, 2011-05-05 at 18:29 -0700, Nikhil Rao wrote:
> >> > It's a cost/benefit analysis and for 32-bit systems the benefits seem to be
> >> > rather small, right?
> >> >
> >>
> >> Yes, that's right. The benefits for 32-bit systems do seem to be limited.
> >
> > deep(er) hierarchies on 32 bits still require this, it would be good to
> > verify that the cgroup mess created by the insanity called libvirt will
> > indeed work as expected.
> >
> 
> I went through the libvirt docs and from what I understand, it creates
> a hierarchy which is about 3 levels deep and has as many leaf nodes as
> guest VMs.

That sounds about right with what I remember people telling me
earlier ;-)

> Taking this graphic from
> http://berrange.com/posts/2009/12/03/using-cgroups-with-libvirt-and-lxckvm-guests-in-fedora-12/
> 
> $ROOT
>  |
>  +-  libvirt    (all virtual machines/containers run by libvirtd)
>        |
>        +- lxc   (all LXC containers run by libvirtd)
>        |   |
>        |   +-  guest1    (LXC container called 'guest1')
>        |   +-  guest2    (LXC container called 'guest2')
>        |   +-  guest3    (LXC container called 'guest3')
>        |   +-  ...       (LXC container called ...)
>        |
>        +- qemu  (all QEMU/KVM containers run by libvirtd)
>            |
>            +-  guest1    (QENU machine called 'guest1')
>            +-  guest2    (QEMU machine called 'guest2')
>            +-  guest3    (QEMU machine called 'guest3')
>            +-  ...       (QEMU machine called ...)
> 
> Assuming the tg shares given to libvirt, lxc and qemu containers are
> the defaults, the load balancer should be able to deal with the
> current resolution on 32-bit. Back of the envelope calculations using
> that approach I mentioned earlier (i.e. log_b(1024/NR_CPU)) says you
> need > 64 VMs before you run out of resolution. I think that might be
> too much to expect from a 8-cpu 32-bit machine ;-)

Quite so, get a real machine etc.. ;-) But then, there's always some
weird people out there, but I think we can tell them to run a 64bit
kernel.

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2011-05-13  7:20 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-05-02  1:18 [PATCH v1 00/19] Increase resolution of load weights Nikhil Rao
2011-05-02  1:18 ` [PATCH v1 01/19] sched: introduce SCHED_POWER_SCALE to scale cpu_power calculations Nikhil Rao
2011-05-02  1:19 ` [PATCH v1 02/19] sched: increase SCHED_LOAD_SCALE resolution Nikhil Rao
2011-05-02  1:19 ` [PATCH v1 03/19] sched: use u64 for load_weight fields Nikhil Rao
2011-05-02  1:19 ` [PATCH v1 04/19] sched: update cpu_load to be u64 Nikhil Rao
2011-05-02  1:19 ` [PATCH v1 05/19] sched: update this_cpu_load() to return u64 value Nikhil Rao
2011-05-02  1:19 ` [PATCH v1 06/19] sched: update source_load(), target_load() and weighted_cpuload() to use u64 Nikhil Rao
2011-05-02  1:19 ` [PATCH v1 07/19] sched: update find_idlest_cpu() to use u64 for load Nikhil Rao
2011-05-02  1:19 ` [PATCH v1 08/19] sched: update find_idlest_group() to use u64 Nikhil Rao
2011-05-02  1:19 ` [PATCH v1 09/19] sched: update division in cpu_avg_load_per_task to use div_u64 Nikhil Rao
2011-05-02  1:19 ` [PATCH v1 10/19] sched: update wake_affine path to use u64, s64 for weights Nikhil Rao
2011-05-02  1:19 ` [PATCH v1 11/19] sched: update update_sg_lb_stats() to use u64 Nikhil Rao
2011-05-02  1:19 ` [PATCH v1 12/19] sched: Update update_sd_lb_stats() " Nikhil Rao
2011-05-02  1:19 ` [PATCH v1 13/19] sched: update f_b_g() to use u64 for weights Nikhil Rao
2011-05-02  1:19 ` [PATCH v1 14/19] sched: change type of imbalance to be u64 Nikhil Rao
2011-05-02  1:19 ` [PATCH v1 15/19] sched: update h_load to use u64 Nikhil Rao
2011-05-02  1:19 ` [PATCH v1 16/19] sched: update move_task() and helper functions to use u64 for weights Nikhil Rao
2011-05-02  1:19 ` [PATCH v1 17/19] sched: update f_b_q() to use u64 for weighted cpuload Nikhil Rao
2011-05-02  1:19 ` [PATCH v1 18/19] sched: update shares distribution to use u64 Nikhil Rao
2011-05-02  1:19 ` [PATCH v1 19/19] sched: convert atomic ops in shares update to use atomic64_t ops Nikhil Rao
2011-05-02  6:14 ` [PATCH v1 00/19] Increase resolution of load weights Ingo Molnar
2011-05-04  0:58   ` Nikhil Rao
2011-05-04  1:07     ` Nikhil Rao
2011-05-04 11:13     ` Ingo Molnar
2011-05-06  1:29       ` Nikhil Rao
2011-05-06  6:59         ` Ingo Molnar
2011-05-11  0:14           ` Nikhil Rao
2011-05-11  6:59             ` Ingo Molnar
2011-05-12  8:56               ` Nikhil Rao
2011-05-12 10:55                 ` Ingo Molnar
2011-05-12 18:44                   ` Nikhil Rao
2011-05-12  9:08         ` Peter Zijlstra
2011-05-12 17:30           ` Nikhil Rao
2011-05-13  7:19             ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).