All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/3] Increase resolution of load weights
@ 2011-05-18 17:09 Nikhil Rao
  2011-05-18 17:09 ` [PATCH v2 1/3] sched: cleanup set_load_weight() Nikhil Rao
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Nikhil Rao @ 2011-05-18 17:09 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Mike Galbraith
  Cc: linux-kernel, Nikunj A. Dadhania, Srivatsa Vaddagiri,
	Stephan Barwolf, Nikhil Rao

Hi All,

Please find attached v2 of the patchset series to increase load resolution.
Based on discussions with Ingo and Peter, this version drops all the
unsigned long -> u64 conversions which greatly simplifies the patchset. We also
only scale load when BITS_PER_LONG > 32 bits. Based on experiments with the
previous patchset, the benefits for 32-bit systems were limited and did not
justify the increased performance penalties.

Major changes from v1:
- Dropped unsigned long -> u64 conversions
- Cleaned up set_load_weight()
- Scale load only when BITS_PER_LONG > 32
- Rebased patchset to -tip instead of v2.6.39-rcX

Previous versions:
- v1: http://thread.gmane.org/gmane.linux.kernel/1133978
- v0: http://thread.gmane.org/gmane.linux.kernel/1129232

The patchset applies cleanly to -tip. Please note that this does not apply
cleanly to v2.6.39-rc8. I can post a patchset against v2.6.39-rc8 if needed.

Below is some analysis of the performance costs/improvements of this patchset.

1. Micro-arch performance costs:

Experiment was to run Ingo's pipe_test_100k about 200 times and measure
instructions, cycles, stalled cycles, branches, branch-misses and other
micro-arch costs as needed.

-tip (baseline):

# taskset 4 perf stat -e instructions -e cycles -e stalled-cycles-backend -e stalled-cycles-frontend --repeat=200 ./pipe-test-100k

 Performance counter stats for '/root/load-scale/pipe-test-100k' (200 runs):

       964,991,769 instructions             #    0.82  insns per cycle        
                                            #    0.33  stalled cycles per insn
                                            #    ( +-  0.05% )
     1,171,186,635 cycles                   #    0.000 GHz                      ( +-  0.08% )
       306,373,664 stalled-cycles-backend   #   26.16% backend  cycles idle     ( +-  0.28% )
       314,933,621 stalled-cycles-frontend  #   26.89% frontend cycles idle     ( +-  0.34% )

        1.122405684  seconds time elapsed  ( +-  0.05% )


-tip+patches:

# taskset 4 perf stat -e instructions -e cycles -e stalled-cycles-backend -e stalled-cycles-frontend --repeat=200 ./pipe-test-100k

 Performance counter stats for './load-scale/pipe-test-100k' (200 runs):

       963,624,821 instructions             #    0.82  insns per cycle        
                                            #    0.33  stalled cycles per insn
                                            #    ( +-  0.04% )
     1,175,215,649 cycles                   #    0.000 GHz                      ( +-  0.08% )
       315,321,126 stalled-cycles-backend   #   26.83% backend  cycles idle     ( +-  0.28% )
       316,835,873 stalled-cycles-frontend  #   26.96% frontend cycles idle     ( +-  0.29% )

        1.122238659  seconds time elapsed  ( +-  0.06% )


With this version of the patchset, looks like instructions decreases by ~0.10%
and cycles increases by 0.27%. This doesn't look statistically significant. As
expected, number of stalled cycles in the backend increased from 26.16% to
26.83%. This can be attributed to the shifts we do in c_d_m() and other places.
The fraction of stalled cycles in the frontend remains about the same, at 26.96%
compared to 26.89% in -tip.

For comparison sake, I modified the patch to remove the shifts in c_d_m() and
scale down the inverse weight for tasks.

 Performance counter stats for './data/pipe-test-100k' (200 runs):

       956,993,064 instructions             #    0.81  insns per cycle        
                                            #    0.34  stalled cycles per insn
                                            #    ( +-  0.05% )
     1,181,506,294 cycles                   #    0.000 GHz                      ( +-  0.10% )
       304,271,160 stalled-cycles-backend   #   25.75% backend  cycles idle     ( +-  0.36% )
       325,099,601 stalled-cycles-frontend  #   27.52% frontend cycles idle     ( +-  0.37% )

        1.129208596  seconds time elapsed  ( +-  0.06% )

The number of instructions decreases by about 0.8% and cycles increase about
0.81%. The fraction of cycles stalled in the backend drops to 25.72% and
fraction of cycles in the frontend increases to 27.52% (increase of 2.61%). This
is probably because we take the path marked unlikely in c_d_m() more often
because we overflow the 64-bit mult.

I tried a few variations of this alternative; i.e. do not scale weights in
c_d_m() and replace the unlikely() compiler hint with (i). no compiler hint,
and (ii). a likley() compiler hint. Neither of these variations resulted in
better performance compared to scaling down weights (as currently done).

(i). -tip+patches(alt) + no compiler hint in c_d_m()

# taskset 4 perf stat -e instructions -e cycles -e stalled-cycles-backend -e stalled-cycles-frontend --repeat=200 ./pipe-test-100k

 Performance counter stats for './load-scale/pipe-test-100k' (200 runs):

       958,280,690 instructions             #    0.80  insns per cycle        
                                            #    0.34  stalled cycles per insn
                                            #    ( +-  0.07% )
     1,191,992,203 cycles                   #    0.000 GHz                      ( +-  0.10% )
       314,905,306 stalled-cycles-backend   #   26.42% backend  cycles idle     ( +-  0.36% )
       324,940,653 stalled-cycles-frontend  #   27.26% frontend cycles idle     ( +-  0.38% )

        1.136591328  seconds time elapsed  ( +-  0.08% )


(ii). -tip+patches(alt) + likely() compiler hint

# taskset 4 perf stat -e instructions -e cycles -e stalled-cycles-backend -e stalled-cycles-frontend --repeat=200 ./pipe-test-100k

 Performance counter stats for './load-scale/pipe-test-100k' (200 runs):

       956,711,525 instructions             #    0.81  insns per cycle        
                                            #    0.34  stalled cycles per insn
                                            #    ( +-  0.07% )
     1,184,777,377 cycles                   #    0.000 GHz                      ( +-  0.09% )
       308,259,625 stalled-cycles-backend   #   26.02% backend  cycles idle     ( +-  0.32% )
       325,105,968 stalled-cycles-frontend  #   27.44% frontend cycles idle     ( +-  0.38% )

        1.131130981  seconds time elapsed  ( +-  0.06% )


2. Balancing low-weight task groups

Test setup: run 50 tasks with random sleep/busy times (biased around 100ms) in
a low weight container (with cpu.shares = 2). Measure %idle as reported by
mpstat over a 10s window.

-tip (baseline):

06:47:48 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle    intr/s
06:47:49 PM  all   94.32    0.00    0.06    0.00    0.00    0.00    0.00    0.00    5.62  15888.00
06:47:50 PM  all   94.57    0.00    0.62    0.00    0.00    0.00    0.00    0.00    4.81  16180.00
06:47:51 PM  all   94.69    0.00    0.06    0.00    0.00    0.00    0.00    0.00    5.25  15966.00
06:47:52 PM  all   95.81    0.00    0.00    0.00    0.00    0.00    0.00    0.00    4.19  16053.00
06:47:53 PM  all   94.88    0.06    0.00    0.00    0.00    0.00    0.00    0.00    5.06  15984.00
06:47:54 PM  all   93.31    0.00    0.00    0.00    0.00    0.00    0.00    0.00    6.69  15806.00
06:47:55 PM  all   94.19    0.00    0.06    0.00    0.00    0.00    0.00    0.00    5.75  15896.00
06:47:56 PM  all   92.87    0.00    0.00    0.00    0.00    0.00    0.00    0.00    7.13  15716.00
06:47:57 PM  all   94.88    0.00    0.00    0.00    0.00    0.00    0.00    0.00    5.12  15982.00
06:47:58 PM  all   95.44    0.00    0.00    0.00    0.00    0.00    0.00    0.00    4.56  16075.00
Average:     all   94.49    0.01    0.08    0.00    0.00    0.00    0.00    0.00    5.42  15954.60

-tip+patches:

06:47:03 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle    intr/s
06:47:04 PM  all  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  16630.00
06:47:05 PM  all   99.69    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.31  16580.20
06:47:06 PM  all   99.69    0.00    0.06    0.00    0.00    0.00    0.00    0.00    0.25  16596.00
06:47:07 PM  all   99.20    0.00    0.74    0.00    0.00    0.06    0.00    0.00    0.00  17838.61
06:47:08 PM  all  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  16540.00
06:47:09 PM  all  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  16575.00
06:47:10 PM  all  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  16614.00
06:47:11 PM  all   99.94    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.06  16588.00
06:47:12 PM  all   99.94    0.00    0.06    0.00    0.00    0.00    0.00    0.00    0.00  16593.00
06:47:13 PM  all   99.94    0.00    0.06    0.00    0.00    0.00    0.00    0.00    0.00  16551.00
Average:     all   99.84    0.00    0.09    0.00    0.00    0.01    0.00    0.00    0.06  16711.58

On a related note, compared to v2.6.39-rc7, -tip seems to have regressed in
this test. A bisection points to this merge commit:

$ git show 493bd3180190511770876ed251d75dab8595d322
commit 493bd3180190511770876ed251d75dab8595d322
Merge: 0cc5bc3 d906f0e
Author: Ingo Molnar <mingo@elte.hu>
Date:   Fri Jan 7 14:09:55 2011 +0100

    Merge branch 'x86/numa'

Is there a way to bisect into this merge commit? I don't have much experience
with git bisect and I've been manually inspecting commits :-( Full bisection
log below.

git bisect start
# good: [693d92a1bbc9e42681c42ed190bd42b636ca876f] Linux 2.6.39-rc7
git bisect good 693d92a1bbc9e42681c42ed190bd42b636ca876f
# bad: [6639653af57eb8f48971cc8662318dfacd192929] Merge branch 'perf/urgent'
git bisect bad 6639653af57eb8f48971cc8662318dfacd192929
# bad: [1ebfeec0895bf5afc59d135deab2664ad4d71e82] Merge branch 'x86/urgent'
git bisect bad 1ebfeec0895bf5afc59d135deab2664ad4d71e82
# good: [0d51ef74badd470c0f84ee4eff502e06bdb7e06b] Merge branch 'perf/core'
git bisect good 0d51ef74badd470c0f84ee4eff502e06bdb7e06b
# good: [4c637fc4944b153dc98735cc79dcd776c1a47303] Merge branch 'perf/core'
git bisect good 4c637fc4944b153dc98735cc79dcd776c1a47303
# good: [959620395dcef68a93288ac6055f6010fbefa243] Merge branch 'perf/core'
git bisect good 959620395dcef68a93288ac6055f6010fbefa243
# good: [6a8aeb83d3d7e5682df4542e6080cd4037a1f6d7] Merge branch 'out-of-tree'
git bisect good 6a8aeb83d3d7e5682df4542e6080cd4037a1f6d7
# good: [106f3c7499db2d4094e2d3a669c2f2c6f08ab093] Merge branch 'perf/core'
git bisect good 106f3c7499db2d4094e2d3a669c2f2c6f08ab093
# good: [0de7032fa49c88b06a492a88da7ea0c5118cedad] Merge branch 'linus'
git bisect good 0de7032fa49c88b06a492a88da7ea0c5118cedad
# good: [5e02063f418c58fb756b1ac972a7d1805127f299] Merge branch 'perf/core'
git bisect good 5e02063f418c58fb756b1ac972a7d1805127f299
# bad: [f9a3e42e48b14d6cb698cf8a5380ea32c316c949] Merge branch 'sched/urgent'
git bisect bad f9a3e42e48b14d6cb698cf8a5380ea32c316c949
# good: [0cc5bc39098229cf4192c3639f0f6108afe4932b] Merge branch 'x86/urgent'
git bisect good 0cc5bc39098229cf4192c3639f0f6108afe4932b
# bad: [1f01b669d560f33ba388360aa6070fbdef9f8f44] Merge branch 'perf/core'
git bisect bad 1f01b669d560f33ba388360aa6070fbdef9f8f44
# bad: [493bd3180190511770876ed251d75dab8595d322] Merge branch 'x86/numa'
git bisect bad 493bd3180190511770876ed251d75dab8595d322
# bad: [493bd3180190511770876ed251d75dab8595d322] Merge branch 'x86/numa'
git bisect bad 493bd3180190511770876ed251d75dab8595d322

-Thanks,
Nikhil

Nikhil Rao (3):
  sched: cleanup set_load_weight()
  sched: introduce SCHED_POWER_SCALE to scale cpu_power calculations
  sched: increase SCHED_LOAD_SCALE resolution

 include/linux/sched.h |   25 +++++++++++++++++-----
 kernel/sched.c        |   38 ++++++++++++++++++++++++-----------
 kernel/sched_fair.c   |   52 +++++++++++++++++++++++++-----------------------
 3 files changed, 72 insertions(+), 43 deletions(-)

-- 
1.7.3.1


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v2 1/3] sched: cleanup set_load_weight()
  2011-05-18 17:09 [PATCH v2 0/3] Increase resolution of load weights Nikhil Rao
@ 2011-05-18 17:09 ` Nikhil Rao
  2011-05-20 13:43   ` [tip:sched/core] sched: Cleanup set_load_weight() tip-bot for Nikhil Rao
  2011-05-18 17:09 ` [PATCH v2 2/3] sched: introduce SCHED_POWER_SCALE to scale cpu_power calculations Nikhil Rao
  2011-05-18 17:09 ` [PATCH v2 3/3] sched: increase SCHED_LOAD_SCALE resolution Nikhil Rao
  2 siblings, 1 reply; 15+ messages in thread
From: Nikhil Rao @ 2011-05-18 17:09 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Mike Galbraith
  Cc: linux-kernel, Nikunj A. Dadhania, Srivatsa Vaddagiri,
	Stephan Barwolf, Nikhil Rao

Avoid using long repetitious names; make this simpler and nicer to read. No
functional change introduced in this patch.

Signed-off-by: Nikhil Rao <ncrao@google.com>
---
 kernel/sched.c |   11 +++++++----
 1 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index c62acf4..d036048 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1778,17 +1778,20 @@ static void dec_nr_running(struct rq *rq)
 
 static void set_load_weight(struct task_struct *p)
 {
+	int prio = p->static_prio - MAX_RT_PRIO;
+	struct load_weight *load = &p->se.load;
+
 	/*
 	 * SCHED_IDLE tasks get minimal weight:
 	 */
 	if (p->policy == SCHED_IDLE) {
-		p->se.load.weight = WEIGHT_IDLEPRIO;
-		p->se.load.inv_weight = WMULT_IDLEPRIO;
+		load->weight = WEIGHT_IDLEPRIO;
+		load->inv_weight = WMULT_IDLEPRIO;
 		return;
 	}
 
-	p->se.load.weight = prio_to_weight[p->static_prio - MAX_RT_PRIO];
-	p->se.load.inv_weight = prio_to_wmult[p->static_prio - MAX_RT_PRIO];
+	load->weight = prio_to_weight[prio];
+	load->inv_weight = prio_to_wmult[prio];
 }
 
 static void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 2/3] sched: introduce SCHED_POWER_SCALE to scale cpu_power calculations
  2011-05-18 17:09 [PATCH v2 0/3] Increase resolution of load weights Nikhil Rao
  2011-05-18 17:09 ` [PATCH v2 1/3] sched: cleanup set_load_weight() Nikhil Rao
@ 2011-05-18 17:09 ` Nikhil Rao
  2011-05-20 13:43   ` [tip:sched/core] sched: Introduce " tip-bot for Nikhil Rao
  2011-05-18 17:09 ` [PATCH v2 3/3] sched: increase SCHED_LOAD_SCALE resolution Nikhil Rao
  2 siblings, 1 reply; 15+ messages in thread
From: Nikhil Rao @ 2011-05-18 17:09 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Mike Galbraith
  Cc: linux-kernel, Nikunj A. Dadhania, Srivatsa Vaddagiri,
	Stephan Barwolf, Nikhil Rao

SCHED_LOAD_SCALE is used to increase nice resolution and to scale cpu_power
calculations in the scheduler. This patch introduces SCHED_POWER_SCALE and
converts all uses of SCHED_LOAD_SCALE for scaling cpu_power to use
SCHED_POWER_SCALE instead.

This is a preparatory patch for increasing the resolution of SCHED_LOAD_SCALE,
and there is no need to increase resolution for cpu_power calculations.

Signed-off-by: Nikhil Rao <ncrao@google.com>
---
 include/linux/sched.h |   13 +++++++----
 kernel/sched.c        |    4 +-
 kernel/sched_fair.c   |   52 +++++++++++++++++++++++++-----------------------
 3 files changed, 37 insertions(+), 32 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 12211e1..f2f4402 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -788,17 +788,20 @@ enum cpu_idle_type {
 };
 
 /*
- * sched-domains (multiprocessor balancing) declarations:
- */
-
-/*
  * Increase resolution of nice-level calculations:
  */
 #define SCHED_LOAD_SHIFT	10
 #define SCHED_LOAD_SCALE	(1L << SCHED_LOAD_SHIFT)
 
-#define SCHED_LOAD_SCALE_FUZZ	SCHED_LOAD_SCALE
+/*
+ * Increase resolution of cpu_power calculations
+ */
+#define SCHED_POWER_SHIFT	10
+#define SCHED_POWER_SCALE	(1L << SCHED_POWER_SHIFT)
 
+/*
+ * sched-domains (multiprocessor balancing) declarations:
+ */
 #ifdef CONFIG_SMP
 #define SD_LOAD_BALANCE		0x0001	/* Do load balancing on this domain. */
 #define SD_BALANCE_NEWIDLE	0x0002	/* Balance when about to become idle */
diff --git a/kernel/sched.c b/kernel/sched.c
index d036048..375e9c6 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -6530,7 +6530,7 @@ static int sched_domain_debug_one(struct sched_domain *sd, int cpu, int level,
 		cpulist_scnprintf(str, sizeof(str), sched_group_cpus(group));
 
 		printk(KERN_CONT " %s", str);
-		if (group->cpu_power != SCHED_LOAD_SCALE) {
+		if (group->cpu_power != SCHED_POWER_SCALE) {
 			printk(KERN_CONT " (cpu_power = %d)",
 				group->cpu_power);
 		}
@@ -7905,7 +7905,7 @@ void __init sched_init(void)
 #ifdef CONFIG_SMP
 		rq->sd = NULL;
 		rq->rd = NULL;
-		rq->cpu_power = SCHED_LOAD_SCALE;
+		rq->cpu_power = SCHED_POWER_SCALE;
 		rq->post_schedule = 0;
 		rq->active_balance = 0;
 		rq->next_balance = jiffies;
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 37f2262..e32a9b7 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1584,7 +1584,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 		}
 
 		/* Adjust by relative CPU power of the group */
-		avg_load = (avg_load * SCHED_LOAD_SCALE) / group->cpu_power;
+		avg_load = (avg_load * SCHED_POWER_SCALE) / group->cpu_power;
 
 		if (local_group) {
 			this_load = avg_load;
@@ -1722,7 +1722,7 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
 				nr_running += cpu_rq(i)->cfs.nr_running;
 			}
 
-			capacity = DIV_ROUND_CLOSEST(power, SCHED_LOAD_SCALE);
+			capacity = DIV_ROUND_CLOSEST(power, SCHED_POWER_SCALE);
 
 			if (tmp->flags & SD_POWERSAVINGS_BALANCE)
 				nr_running /= 2;
@@ -2570,7 +2570,7 @@ static inline int check_power_save_busiest_group(struct sd_lb_stats *sds,
 
 unsigned long default_scale_freq_power(struct sched_domain *sd, int cpu)
 {
-	return SCHED_LOAD_SCALE;
+	return SCHED_POWER_SCALE;
 }
 
 unsigned long __weak arch_scale_freq_power(struct sched_domain *sd, int cpu)
@@ -2607,10 +2607,10 @@ unsigned long scale_rt_power(int cpu)
 		available = total - rq->rt_avg;
 	}
 
-	if (unlikely((s64)total < SCHED_LOAD_SCALE))
-		total = SCHED_LOAD_SCALE;
+	if (unlikely((s64)total < SCHED_POWER_SCALE))
+		total = SCHED_POWER_SCALE;
 
-	total >>= SCHED_LOAD_SHIFT;
+	total >>= SCHED_POWER_SHIFT;
 
 	return div_u64(available, total);
 }
@@ -2618,7 +2618,7 @@ unsigned long scale_rt_power(int cpu)
 static void update_cpu_power(struct sched_domain *sd, int cpu)
 {
 	unsigned long weight = sd->span_weight;
-	unsigned long power = SCHED_LOAD_SCALE;
+	unsigned long power = SCHED_POWER_SCALE;
 	struct sched_group *sdg = sd->groups;
 
 	if ((sd->flags & SD_SHARE_CPUPOWER) && weight > 1) {
@@ -2627,7 +2627,7 @@ static void update_cpu_power(struct sched_domain *sd, int cpu)
 		else
 			power *= default_scale_smt_power(sd, cpu);
 
-		power >>= SCHED_LOAD_SHIFT;
+		power >>= SCHED_POWER_SHIFT;
 	}
 
 	sdg->cpu_power_orig = power;
@@ -2637,10 +2637,10 @@ static void update_cpu_power(struct sched_domain *sd, int cpu)
 	else
 		power *= default_scale_freq_power(sd, cpu);
 
-	power >>= SCHED_LOAD_SHIFT;
+	power >>= SCHED_POWER_SHIFT;
 
 	power *= scale_rt_power(cpu);
-	power >>= SCHED_LOAD_SHIFT;
+	power >>= SCHED_POWER_SHIFT;
 
 	if (!power)
 		power = 1;
@@ -2682,7 +2682,7 @@ static inline int
 fix_small_capacity(struct sched_domain *sd, struct sched_group *group)
 {
 	/*
-	 * Only siblings can have significantly less than SCHED_LOAD_SCALE
+	 * Only siblings can have significantly less than SCHED_POWER_SCALE
 	 */
 	if (!(sd->flags & SD_SHARE_CPUPOWER))
 		return 0;
@@ -2770,7 +2770,7 @@ static inline void update_sg_lb_stats(struct sched_domain *sd,
 	}
 
 	/* Adjust by relative CPU power of the group */
-	sgs->avg_load = (sgs->group_load * SCHED_LOAD_SCALE) / group->cpu_power;
+	sgs->avg_load = (sgs->group_load*SCHED_POWER_SCALE) / group->cpu_power;
 
 	/*
 	 * Consider the group unbalanced when the imbalance is larger
@@ -2787,7 +2787,8 @@ static inline void update_sg_lb_stats(struct sched_domain *sd,
 	if ((max_cpu_load - min_cpu_load) >= avg_load_per_task && max_nr_running > 1)
 		sgs->group_imb = 1;
 
-	sgs->group_capacity = DIV_ROUND_CLOSEST(group->cpu_power, SCHED_LOAD_SCALE);
+	sgs->group_capacity = DIV_ROUND_CLOSEST(group->cpu_power,
+						SCHED_POWER_SCALE);
 	if (!sgs->group_capacity)
 		sgs->group_capacity = fix_small_capacity(sd, group);
 	sgs->group_weight = group->group_weight;
@@ -2961,7 +2962,7 @@ static int check_asym_packing(struct sched_domain *sd,
 		return 0;
 
 	*imbalance = DIV_ROUND_CLOSEST(sds->max_load * sds->busiest->cpu_power,
-				       SCHED_LOAD_SCALE);
+				       SCHED_POWER_SCALE);
 	return 1;
 }
 
@@ -2990,7 +2991,7 @@ static inline void fix_small_imbalance(struct sd_lb_stats *sds,
 			cpu_avg_load_per_task(this_cpu);
 
 	scaled_busy_load_per_task = sds->busiest_load_per_task
-						 * SCHED_LOAD_SCALE;
+					 * SCHED_POWER_SCALE;
 	scaled_busy_load_per_task /= sds->busiest->cpu_power;
 
 	if (sds->max_load - sds->this_load + scaled_busy_load_per_task >=
@@ -3009,10 +3010,10 @@ static inline void fix_small_imbalance(struct sd_lb_stats *sds,
 			min(sds->busiest_load_per_task, sds->max_load);
 	pwr_now += sds->this->cpu_power *
 			min(sds->this_load_per_task, sds->this_load);
-	pwr_now /= SCHED_LOAD_SCALE;
+	pwr_now /= SCHED_POWER_SCALE;
 
 	/* Amount of load we'd subtract */
-	tmp = (sds->busiest_load_per_task * SCHED_LOAD_SCALE) /
+	tmp = (sds->busiest_load_per_task * SCHED_POWER_SCALE) /
 		sds->busiest->cpu_power;
 	if (sds->max_load > tmp)
 		pwr_move += sds->busiest->cpu_power *
@@ -3020,15 +3021,15 @@ static inline void fix_small_imbalance(struct sd_lb_stats *sds,
 
 	/* Amount of load we'd add */
 	if (sds->max_load * sds->busiest->cpu_power <
-		sds->busiest_load_per_task * SCHED_LOAD_SCALE)
+		sds->busiest_load_per_task * SCHED_POWER_SCALE)
 		tmp = (sds->max_load * sds->busiest->cpu_power) /
 			sds->this->cpu_power;
 	else
-		tmp = (sds->busiest_load_per_task * SCHED_LOAD_SCALE) /
+		tmp = (sds->busiest_load_per_task * SCHED_POWER_SCALE) /
 			sds->this->cpu_power;
 	pwr_move += sds->this->cpu_power *
 			min(sds->this_load_per_task, sds->this_load + tmp);
-	pwr_move /= SCHED_LOAD_SCALE;
+	pwr_move /= SCHED_POWER_SCALE;
 
 	/* Move if we gain throughput */
 	if (pwr_move > pwr_now)
@@ -3070,7 +3071,7 @@ static inline void calculate_imbalance(struct sd_lb_stats *sds, int this_cpu,
 		load_above_capacity = (sds->busiest_nr_running -
 						sds->busiest_group_capacity);
 
-		load_above_capacity *= (SCHED_LOAD_SCALE * SCHED_LOAD_SCALE);
+		load_above_capacity *= (SCHED_LOAD_SCALE * SCHED_POWER_SCALE);
 
 		load_above_capacity /= sds->busiest->cpu_power;
 	}
@@ -3090,7 +3091,7 @@ static inline void calculate_imbalance(struct sd_lb_stats *sds, int this_cpu,
 	/* How much load to actually move to equalise the imbalance */
 	*imbalance = min(max_pull * sds->busiest->cpu_power,
 		(sds->avg_load - sds->this_load) * sds->this->cpu_power)
-			/ SCHED_LOAD_SCALE;
+			/ SCHED_POWER_SCALE;
 
 	/*
 	 * if *imbalance is less than the average load per runnable task
@@ -3159,7 +3160,7 @@ find_busiest_group(struct sched_domain *sd, int this_cpu,
 	if (!sds.busiest || sds.busiest_nr_running == 0)
 		goto out_balanced;
 
-	sds.avg_load = (SCHED_LOAD_SCALE * sds.total_load) / sds.total_pwr;
+	sds.avg_load = (SCHED_POWER_SCALE * sds.total_load) / sds.total_pwr;
 
 	/*
 	 * If the busiest group is imbalanced the below checks don't
@@ -3238,7 +3239,8 @@ find_busiest_queue(struct sched_domain *sd, struct sched_group *group,
 
 	for_each_cpu(i, sched_group_cpus(group)) {
 		unsigned long power = power_of(i);
-		unsigned long capacity = DIV_ROUND_CLOSEST(power, SCHED_LOAD_SCALE);
+		unsigned long capacity = DIV_ROUND_CLOSEST(power,
+							   SCHED_POWER_SCALE);
 		unsigned long wl;
 
 		if (!capacity)
@@ -3263,7 +3265,7 @@ find_busiest_queue(struct sched_domain *sd, struct sched_group *group,
 		 * the load can be moved away from the cpu that is potentially
 		 * running at a lower capacity.
 		 */
-		wl = (wl * SCHED_LOAD_SCALE) / power;
+		wl = (wl * SCHED_POWER_SCALE) / power;
 
 		if (wl > max_load) {
 			max_load = wl;
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 3/3] sched: increase SCHED_LOAD_SCALE resolution
  2011-05-18 17:09 [PATCH v2 0/3] Increase resolution of load weights Nikhil Rao
  2011-05-18 17:09 ` [PATCH v2 1/3] sched: cleanup set_load_weight() Nikhil Rao
  2011-05-18 17:09 ` [PATCH v2 2/3] sched: introduce SCHED_POWER_SCALE to scale cpu_power calculations Nikhil Rao
@ 2011-05-18 17:09 ` Nikhil Rao
  2011-05-18 18:25   ` Ingo Molnar
  2011-05-18 18:26   ` [PATCH v2 3/3] sched: increase " Ingo Molnar
  2 siblings, 2 replies; 15+ messages in thread
From: Nikhil Rao @ 2011-05-18 17:09 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Mike Galbraith
  Cc: linux-kernel, Nikunj A. Dadhania, Srivatsa Vaddagiri,
	Stephan Barwolf, Nikhil Rao

Introduce SCHED_LOAD_RESOLUTION, which scales is added to SCHED_LOAD_SHIFT and
increases the resolution of SCHED_LOAD_SCALE. This patch sets the value of
SCHED_LOAD_RESOLUTION to 10, scaling up the weights for all sched entities by
a factor of 1024. With this extra resolution, we can handle deeper cgroup
hiearchies and the scheduler can do better shares distribution and load
load balancing on larger systems (especially for low weight task groups).

This does not change the existing user interface, the scaled weights are only
used internally. We do not modify prio_to_weight values or inverses, but use
the original weights when calculating the inverse which is used to scale
execution time delta in calc_delta_mine(). This ensures we do not lose accuracy
when accounting time to the sched entities. Thanks to Nikunj Dadhania for fixing
an bug in c_d_m() that broken fairness.

Signed-off-by: Nikhil Rao <ncrao@google.com>
---
 include/linux/sched.h |   14 ++++++++++++--
 kernel/sched.c        |   27 +++++++++++++++++++--------
 2 files changed, 31 insertions(+), 10 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index f2f4402..b7ecd61 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -788,9 +788,19 @@ enum cpu_idle_type {
 };
 
 /*
- * Increase resolution of nice-level calculations:
+ * Increase resolution of nice-level calculations for 64-bit architectures.
  */
-#define SCHED_LOAD_SHIFT	10
+#if BITS_PER_LONG > 32
+#define SCHED_LOAD_RESOLUTION	10
+#define scale_load(w)		(w << SCHED_LOAD_RESOLUTION)
+#define scale_load_down(w)	(w >> SCHED_LOAD_RESOLUTION)
+#else
+#define SCHED_LOAD_RESOLUTION	0
+#define scale_load(w)		w
+#define scale_load_down(w)	w
+#endif
+
+#define SCHED_LOAD_SHIFT	(10 + (SCHED_LOAD_RESOLUTION))
 #define SCHED_LOAD_SCALE	(1L << SCHED_LOAD_SHIFT)
 
 /*
diff --git a/kernel/sched.c b/kernel/sched.c
index 375e9c6..554d767 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -293,7 +293,7 @@ static DEFINE_SPINLOCK(task_group_lock);
  *  limitation from this.)
  */
 #define MIN_SHARES	2
-#define MAX_SHARES	(1UL << 18)
+#define MAX_SHARES	(1UL << (18 + SCHED_LOAD_RESOLUTION))
 
 static int root_task_group_load = ROOT_TASK_GROUP_LOAD;
 #endif
@@ -1330,13 +1330,24 @@ calc_delta_mine(unsigned long delta_exec, unsigned long weight,
 {
 	u64 tmp;
 
-	tmp = (u64)delta_exec * weight;
+	/*
+	 * weight can be less than 2^SCHED_LOAD_RESOLUTION for task group sched
+	 * entities since MIN_SHARES = 2. Treat weight as 1 if less than
+	 * 2^SCHED_LOAD_RESOLUTION.
+	 */
+	if (likely(weight > (1UL << SCHED_LOAD_RESOLUTION)))
+		tmp = (u64)delta_exec * scale_load_down(weight);
+	else
+		tmp = (u64)delta_exec;
 
 	if (!lw->inv_weight) {
-		if (BITS_PER_LONG > 32 && unlikely(lw->weight >= WMULT_CONST))
+		unsigned long w = scale_load_down(lw->weight);
+		if (BITS_PER_LONG > 32 && unlikely(w >= WMULT_CONST))
 			lw->inv_weight = 1;
+		else if (unlikely(!w))
+			lw->inv_weight = WMULT_CONST;
 		else
-			lw->inv_weight = WMULT_CONST / lw->weight;
+			lw->inv_weight = WMULT_CONST / w;
 	}
 
 	/*
@@ -1785,12 +1796,12 @@ static void set_load_weight(struct task_struct *p)
 	 * SCHED_IDLE tasks get minimal weight:
 	 */
 	if (p->policy == SCHED_IDLE) {
-		load->weight = WEIGHT_IDLEPRIO;
+		load->weight = scale_load(WEIGHT_IDLEPRIO);
 		load->inv_weight = WMULT_IDLEPRIO;
 		return;
 	}
 
-	load->weight = prio_to_weight[prio];
+	load->weight = scale_load(prio_to_weight[prio]);
 	load->inv_weight = prio_to_wmult[prio];
 }
 
@@ -8809,14 +8820,14 @@ cpu_cgroup_exit(struct cgroup_subsys *ss, struct cgroup *cgrp,
 static int cpu_shares_write_u64(struct cgroup *cgrp, struct cftype *cftype,
 				u64 shareval)
 {
-	return sched_group_set_shares(cgroup_tg(cgrp), shareval);
+	return sched_group_set_shares(cgroup_tg(cgrp), scale_load(shareval));
 }
 
 static u64 cpu_shares_read_u64(struct cgroup *cgrp, struct cftype *cft)
 {
 	struct task_group *tg = cgroup_tg(cgrp);
 
-	return (u64) tg->shares;
+	return (u64) scale_load_down(tg->shares);
 }
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 3/3] sched: increase SCHED_LOAD_SCALE resolution
  2011-05-18 17:09 ` [PATCH v2 3/3] sched: increase SCHED_LOAD_SCALE resolution Nikhil Rao
@ 2011-05-18 18:25   ` Ingo Molnar
  2011-05-18 19:39     ` Nikhil Rao
  2011-05-18 18:26   ` [PATCH v2 3/3] sched: increase " Ingo Molnar
  1 sibling, 1 reply; 15+ messages in thread
From: Ingo Molnar @ 2011-05-18 18:25 UTC (permalink / raw)
  To: Nikhil Rao
  Cc: Peter Zijlstra, Mike Galbraith, linux-kernel, Nikunj A. Dadhania,
	Srivatsa Vaddagiri, Stephan Barwolf


* Nikhil Rao <ncrao@google.com> wrote:

> -#define SCHED_LOAD_SHIFT	10
> +#if BITS_PER_LONG > 32
> +#define SCHED_LOAD_RESOLUTION	10
> +#define scale_load(w)		(w << SCHED_LOAD_RESOLUTION)
> +#define scale_load_down(w)	(w >> SCHED_LOAD_RESOLUTION)
> +#else
> +#define SCHED_LOAD_RESOLUTION	0
> +#define scale_load(w)		w
> +#define scale_load_down(w)	w
> +#endif

Please use the:

#if X
# define Y A
#else
# define Y B
#endif

type of nested CPP branching style.

Also, the 'w' probably wants to be '(w)', just in case.

>  	if (!lw->inv_weight) {
> -		if (BITS_PER_LONG > 32 && unlikely(lw->weight >= WMULT_CONST))
> +		unsigned long w = scale_load_down(lw->weight);
> +		if (BITS_PER_LONG > 32 && unlikely(w >= WMULT_CONST))
>  			lw->inv_weight = 1;

Please separate local variable declarations and the first statement following 
it by an extra empty line.

Could you also put all the performance measurement description into the 
changelog of this third commit - so that people can see (and enjoy) the 
extensive testing you have done on this topic?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 3/3] sched: increase SCHED_LOAD_SCALE resolution
  2011-05-18 17:09 ` [PATCH v2 3/3] sched: increase SCHED_LOAD_SCALE resolution Nikhil Rao
  2011-05-18 18:25   ` Ingo Molnar
@ 2011-05-18 18:26   ` Ingo Molnar
  1 sibling, 0 replies; 15+ messages in thread
From: Ingo Molnar @ 2011-05-18 18:26 UTC (permalink / raw)
  To: Nikhil Rao
  Cc: Peter Zijlstra, Mike Galbraith, linux-kernel, Nikunj A. Dadhania,
	Srivatsa Vaddagiri, Stephan Barwolf


* Nikhil Rao <ncrao@google.com> wrote:

>  /*
> - * Increase resolution of nice-level calculations:
> + * Increase resolution of nice-level calculations for 64-bit architectures.
>   */
> -#define SCHED_LOAD_SHIFT	10
> +#if BITS_PER_LONG > 32
> +#define SCHED_LOAD_RESOLUTION	10
> +#define scale_load(w)		(w << SCHED_LOAD_RESOLUTION)
> +#define scale_load_down(w)	(w >> SCHED_LOAD_RESOLUTION)
> +#else
> +#define SCHED_LOAD_RESOLUTION	0
> +#define scale_load(w)		w
> +#define scale_load_down(w)	w
> +#endif

Please also be a bit more verbose in the comment above why we treat 64-bit 
architectures differently - it's not obvious.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 3/3] sched: increase SCHED_LOAD_SCALE resolution
  2011-05-18 18:25   ` Ingo Molnar
@ 2011-05-18 19:39     ` Nikhil Rao
  2011-05-18 19:41       ` Nikhil Rao
  0 siblings, 1 reply; 15+ messages in thread
From: Nikhil Rao @ 2011-05-18 19:39 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Mike Galbraith, linux-kernel, Nikunj A. Dadhania,
	Srivatsa Vaddagiri, Stephan Barwolf

On Wed, May 18, 2011 at 11:25 AM, Ingo Molnar <mingo@elte.hu> wrote:
>
> * Nikhil Rao <ncrao@google.com> wrote:
>
>> -#define SCHED_LOAD_SHIFT     10
>> +#if BITS_PER_LONG > 32
>> +#define SCHED_LOAD_RESOLUTION        10
>> +#define scale_load(w)                (w << SCHED_LOAD_RESOLUTION)
>> +#define scale_load_down(w)   (w >> SCHED_LOAD_RESOLUTION)
>> +#else
>> +#define SCHED_LOAD_RESOLUTION        0
>> +#define scale_load(w)                w
>> +#define scale_load_down(w)   w
>> +#endif
>
> Please use the:
>
> #if X
> # define Y A
> #else
> # define Y B
> #endif
>
> type of nested CPP branching style.
>
> Also, the 'w' probably wants to be '(w)', just in case.

Thanks for the review. Will fix the macros.

>
>>       if (!lw->inv_weight) {
>> -             if (BITS_PER_LONG > 32 && unlikely(lw->weight >= WMULT_CONST))
>> +             unsigned long w = scale_load_down(lw->weight);
>> +             if (BITS_PER_LONG > 32 && unlikely(w >= WMULT_CONST))
>>                       lw->inv_weight = 1;
>
> Please separate local variable declarations and the first statement following
> it by an extra empty line.
>

Yes, will do.

> Could you also put all the performance measurement description into the
> changelog of this third commit - so that people can see (and enjoy) the
> extensive testing you have done on this topic?
>

Yes, will add the performance changes to the commit description.

>>  /*
>> - * Increase resolution of nice-level calculations:
>> + * Increase resolution of nice-level calculations for 64-bit architectures.
>>   */
>> -#define SCHED_LOAD_SHIFT     10
>> +#if BITS_PER_LONG > 32
>> +#define SCHED_LOAD_RESOLUTION        10
>> +#define scale_load(w)                (w << SCHED_LOAD_RESOLUTION)
>> +#define scale_load_down(w)   (w >> SCHED_LOAD_RESOLUTION)
>> +#else
>> +#define SCHED_LOAD_RESOLUTION        0
>> +#define scale_load(w)                w
>> +#define scale_load_down(w)   w
>> +#endif
>
> Please also be a bit more verbose in the comment above why we treat 64-bit
> architectures differently - it's not obvious.

Added a more descriptive comment.


Thanks for the review. I'll refresh the patch and send it with the fixes.

-Thanks,
Nikhil

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v2 3/3] sched: increase SCHED_LOAD_SCALE resolution
  2011-05-18 19:39     ` Nikhil Rao
@ 2011-05-18 19:41       ` Nikhil Rao
  2011-05-18 20:26         ` Ingo Molnar
  0 siblings, 1 reply; 15+ messages in thread
From: Nikhil Rao @ 2011-05-18 19:41 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Mike Galbraith
  Cc: linux-kernel, Nikunj A. Dadhania, Srivatsa Vaddagiri,
	Stephan Barwolf, Nikhil Rao

Introduce SCHED_LOAD_RESOLUTION, which scales is added to SCHED_LOAD_SHIFT and
increases the resolution of SCHED_LOAD_SCALE. This patch sets the value of
SCHED_LOAD_RESOLUTION to 10, scaling up the weights for all sched entities by
a factor of 1024. With this extra resolution, we can handle deeper cgroup
hiearchies and the scheduler can do better shares distribution and load
load balancing on larger systems (especially for low weight task groups).

This does not change the existing user interface, the scaled weights are only
used internally. We do not modify prio_to_weight values or inverses, but use
the original weights when calculating the inverse which is used to scale
execution time delta in calc_delta_mine(). This ensures we do not lose accuracy
when accounting time to the sched entities. Thanks to Nikunj Dadhania for fixing
an bug in c_d_m() that broken fairness.

Below is some analysis of the performance costs/improvements of this patch.

1. Micro-arch performance costs:

Experiment was to run Ingo's pipe_test_100k 200 times with the task pinned to
one cpu. I measured instruction, cycles and stalled-cycles for the runs. See
http://thread.gmane.org/gmane.linux.kernel/1129232/focus=1129389 for more info.

-tip (baseline):

 Performance counter stats for '/root/load-scale/pipe-test-100k' (200 runs):

       964,991,769 instructions             #    0.82  insns per cycle
                                            #    0.33  stalled cycles per insn
                                            #    ( +-  0.05% )
     1,171,186,635 cycles                   #    0.000 GHz                      ( +-  0.08% )
       306,373,664 stalled-cycles-backend   #   26.16% backend  cycles idle     ( +-  0.28% )
       314,933,621 stalled-cycles-frontend  #   26.89% frontend cycles idle     ( +-  0.34% )

        1.122405684  seconds time elapsed  ( +-  0.05% )

-tip+patches:

 Performance counter stats for './load-scale/pipe-test-100k' (200 runs):

       963,624,821 instructions             #    0.82  insns per cycle
                                            #    0.33  stalled cycles per insn
                                            #    ( +-  0.04% )
     1,175,215,649 cycles                   #    0.000 GHz                      ( +-  0.08% )
       315,321,126 stalled-cycles-backend   #   26.83% backend  cycles idle     ( +-  0.28% )
       316,835,873 stalled-cycles-frontend  #   26.96% frontend cycles idle     ( +-  0.29% )

        1.122238659  seconds time elapsed  ( +-  0.06% )

With this patch, instructions decrease by ~0.10% and cycles increase by 0.27%.
This doesn't look statistically significant. The number of stalled cycles in
the backend increased from 26.16% to 26.83%. This can be attributed to the
shifts we do in c_d_m() and other places. The fraction of stalled cycles in the
frontend remains about the same, at 26.96% compared to 26.89% in -tip.

2. Balancing low-weight task groups

Test setup: run 50 tasks with random sleep/busy times (biased around 100ms) in
a low weight container (with cpu.shares = 2). Measure %idle as reported by
mpstat over a 10s window.

-tip (baseline):

06:47:48 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle    intr/s
06:47:49 PM  all   94.32    0.00    0.06    0.00    0.00    0.00    0.00    0.00    5.62  15888.00
06:47:50 PM  all   94.57    0.00    0.62    0.00    0.00    0.00    0.00    0.00    4.81  16180.00
06:47:51 PM  all   94.69    0.00    0.06    0.00    0.00    0.00    0.00    0.00    5.25  15966.00
06:47:52 PM  all   95.81    0.00    0.00    0.00    0.00    0.00    0.00    0.00    4.19  16053.00
06:47:53 PM  all   94.88    0.06    0.00    0.00    0.00    0.00    0.00    0.00    5.06  15984.00
06:47:54 PM  all   93.31    0.00    0.00    0.00    0.00    0.00    0.00    0.00    6.69  15806.00
06:47:55 PM  all   94.19    0.00    0.06    0.00    0.00    0.00    0.00    0.00    5.75  15896.00
06:47:56 PM  all   92.87    0.00    0.00    0.00    0.00    0.00    0.00    0.00    7.13  15716.00
06:47:57 PM  all   94.88    0.00    0.00    0.00    0.00    0.00    0.00    0.00    5.12  15982.00
06:47:58 PM  all   95.44    0.00    0.00    0.00    0.00    0.00    0.00    0.00    4.56  16075.00
Average:     all   94.49    0.01    0.08    0.00    0.00    0.00    0.00    0.00    5.42  15954.60

-tip+patches:

06:47:03 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle    intr/s
06:47:04 PM  all  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  16630.00
06:47:05 PM  all   99.69    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.31  16580.20
06:47:06 PM  all   99.69    0.00    0.06    0.00    0.00    0.00    0.00    0.00    0.25  16596.00
06:47:07 PM  all   99.20    0.00    0.74    0.00    0.00    0.06    0.00    0.00    0.00  17838.61
06:47:08 PM  all  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  16540.00
06:47:09 PM  all  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  16575.00
06:47:10 PM  all  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  16614.00
06:47:11 PM  all   99.94    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.06  16588.00
06:47:12 PM  all   99.94    0.00    0.06    0.00    0.00    0.00    0.00    0.00    0.00  16593.00
06:47:13 PM  all   99.94    0.00    0.06    0.00    0.00    0.00    0.00    0.00    0.00  16551.00
Average:     all   99.84    0.00    0.09    0.00    0.00    0.01    0.00    0.00    0.06  16711.58

We see an improvement in idle% on the system (drops from 5.42% on -tip to 0.06%
with the patches).

Signed-off-by: Nikhil Rao <ncrao@google.com>
---
 include/linux/sched.h |   23 +++++++++++++++++++++--
 kernel/sched.c        |   28 ++++++++++++++++++++--------
 2 files changed, 41 insertions(+), 10 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index f2f4402..76245d2 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -788,9 +788,28 @@ enum cpu_idle_type {
 };
 
 /*
- * Increase resolution of nice-level calculations:
+ * Increase resolution of nice-level calculations for 64-bit architectures.
+ * The extra resolution improves shares distribution and load balancing of
+ * low-weight task groups (eg. nice +19 on an autogroup), deeper taskgroup
+ * hierarchies, especially on larger systems. This is not a user-visible change
+ * and does not change the user-interface for setting shares/weights.
+ *
+ * We increase resolution only if we have enough bits to allow this increased
+ * resolution (i.e. BITS_PER_LONG > 32). The costs for increasing resolution
+ * when BITS_PER_LONG <= 32 are pretty high and the returns do not justify the
+ * increased costs.
  */
-#define SCHED_LOAD_SHIFT	10
+#if BITS_PER_LONG > 32
+# define SCHED_LOAD_RESOLUTION	10
+# define scale_load(w)		(w << SCHED_LOAD_RESOLUTION)
+# define scale_load_down(w)	(w >> SCHED_LOAD_RESOLUTION)
+#else
+# define SCHED_LOAD_RESOLUTION	0
+# define scale_load(w)		(w)
+# define scale_load_down(w)	(w)
+#endif
+
+#define SCHED_LOAD_SHIFT	(10 + (SCHED_LOAD_RESOLUTION))
 #define SCHED_LOAD_SCALE	(1L << SCHED_LOAD_SHIFT)
 
 /*
diff --git a/kernel/sched.c b/kernel/sched.c
index 375e9c6..bb504e1 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -293,7 +293,7 @@ static DEFINE_SPINLOCK(task_group_lock);
  *  limitation from this.)
  */
 #define MIN_SHARES	2
-#define MAX_SHARES	(1UL << 18)
+#define MAX_SHARES	(1UL << (18 + SCHED_LOAD_RESOLUTION))
 
 static int root_task_group_load = ROOT_TASK_GROUP_LOAD;
 #endif
@@ -1330,13 +1330,25 @@ calc_delta_mine(unsigned long delta_exec, unsigned long weight,
 {
 	u64 tmp;
 
-	tmp = (u64)delta_exec * weight;
+	/*
+	 * weight can be less than 2^SCHED_LOAD_RESOLUTION for task group sched
+	 * entities since MIN_SHARES = 2. Treat weight as 1 if less than
+	 * 2^SCHED_LOAD_RESOLUTION.
+	 */
+	if (likely(weight > (1UL << SCHED_LOAD_RESOLUTION)))
+		tmp = (u64)delta_exec * scale_load_down(weight);
+	else
+		tmp = (u64)delta_exec;
 
 	if (!lw->inv_weight) {
-		if (BITS_PER_LONG > 32 && unlikely(lw->weight >= WMULT_CONST))
+		unsigned long w = scale_load_down(lw->weight);
+
+		if (BITS_PER_LONG > 32 && unlikely(w >= WMULT_CONST))
 			lw->inv_weight = 1;
+		else if (unlikely(!w))
+			lw->inv_weight = WMULT_CONST;
 		else
-			lw->inv_weight = WMULT_CONST / lw->weight;
+			lw->inv_weight = WMULT_CONST / w;
 	}
 
 	/*
@@ -1785,12 +1797,12 @@ static void set_load_weight(struct task_struct *p)
 	 * SCHED_IDLE tasks get minimal weight:
 	 */
 	if (p->policy == SCHED_IDLE) {
-		load->weight = WEIGHT_IDLEPRIO;
+		load->weight = scale_load(WEIGHT_IDLEPRIO);
 		load->inv_weight = WMULT_IDLEPRIO;
 		return;
 	}
 
-	load->weight = prio_to_weight[prio];
+	load->weight = scale_load(prio_to_weight[prio]);
 	load->inv_weight = prio_to_wmult[prio];
 }
 
@@ -8809,14 +8821,14 @@ cpu_cgroup_exit(struct cgroup_subsys *ss, struct cgroup *cgrp,
 static int cpu_shares_write_u64(struct cgroup *cgrp, struct cftype *cftype,
 				u64 shareval)
 {
-	return sched_group_set_shares(cgroup_tg(cgrp), shareval);
+	return sched_group_set_shares(cgroup_tg(cgrp), scale_load(shareval));
 }
 
 static u64 cpu_shares_read_u64(struct cgroup *cgrp, struct cftype *cft)
 {
 	struct task_group *tg = cgroup_tg(cgrp);
 
-	return (u64) tg->shares;
+	return (u64) scale_load_down(tg->shares);
 }
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 3/3] sched: increase SCHED_LOAD_SCALE resolution
  2011-05-18 19:41       ` Nikhil Rao
@ 2011-05-18 20:26         ` Ingo Molnar
  2011-05-18 21:18           ` Nikhil Rao
  0 siblings, 1 reply; 15+ messages in thread
From: Ingo Molnar @ 2011-05-18 20:26 UTC (permalink / raw)
  To: Nikhil Rao
  Cc: Peter Zijlstra, Mike Galbraith, linux-kernel, Nikunj A. Dadhania,
	Srivatsa Vaddagiri, Stephan Barwolf


* Nikhil Rao <ncrao@google.com> wrote:

> +#if BITS_PER_LONG > 32
> +# define SCHED_LOAD_RESOLUTION	10
> +# define scale_load(w)		(w << SCHED_LOAD_RESOLUTION)
> +# define scale_load_down(w)	(w >> SCHED_LOAD_RESOLUTION)
> +#else
> +# define SCHED_LOAD_RESOLUTION	0
> +# define scale_load(w)		(w)
> +# define scale_load_down(w)	(w)
> +#endif

We want (w) in the other definitions as well, to protect potential operators 
with lower precedence than <<. (Roughly half of the C operators are such so 
it's a real issue, should anyone use these macros with operators.)

> +#define SCHED_LOAD_SHIFT	(10 + (SCHED_LOAD_RESOLUTION))

that () is not needed actually, if you look at the definition of 
SCHED_LOAD_RESOLUTION.

So you could move the superfluous () from here up to the two definitions above 
and thus no parentheses would be hurt during the making of this patch.

> +		if (BITS_PER_LONG > 32 && unlikely(w >= WMULT_CONST))
>  			lw->inv_weight = 1;
> +		else if (unlikely(!w))
> +			lw->inv_weight = WMULT_CONST;
>  		else
> +			lw->inv_weight = WMULT_CONST / w;

Ok, i just noticed that you made use of BITS_PER_LONG here too.

It's better to put that into a helper define, something like 
SCHED_LOAD_HIGHRES, which could thus be used like this:

		if (SCHED_LOAD_HIGHRES && unlikely(w >= WMULT_CONST))

then, should anyone want to tweak the condition for SCHED_LOAD_HIGHRES, it 
could be done in a single place. It would also self-document.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 3/3] sched: increase SCHED_LOAD_SCALE resolution
  2011-05-18 20:26         ` Ingo Molnar
@ 2011-05-18 21:18           ` Nikhil Rao
  2011-05-18 21:22             ` Ingo Molnar
  0 siblings, 1 reply; 15+ messages in thread
From: Nikhil Rao @ 2011-05-18 21:18 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Mike Galbraith, linux-kernel, Nikunj A. Dadhania,
	Srivatsa Vaddagiri, Stephan Barwolf

On Wed, May 18, 2011 at 1:26 PM, Ingo Molnar <mingo@elte.hu> wrote:
>
> * Nikhil Rao <ncrao@google.com> wrote:
>
>> +#if BITS_PER_LONG > 32
>> +# define SCHED_LOAD_RESOLUTION       10
>> +# define scale_load(w)               (w << SCHED_LOAD_RESOLUTION)
>> +# define scale_load_down(w)  (w >> SCHED_LOAD_RESOLUTION)
>> +#else
>> +# define SCHED_LOAD_RESOLUTION       0
>> +# define scale_load(w)               (w)
>> +# define scale_load_down(w)  (w)
>> +#endif
>
> We want (w) in the other definitions as well, to protect potential operators
> with lower precedence than <<. (Roughly half of the C operators are such so
> it's a real issue, should anyone use these macros with operators.)
>
>> +#define SCHED_LOAD_SHIFT     (10 + (SCHED_LOAD_RESOLUTION))
>
> that () is not needed actually, if you look at the definition of
> SCHED_LOAD_RESOLUTION.
>
> So you could move the superfluous () from here up to the two definitions above
> and thus no parentheses would be hurt during the making of this patch.
>

Ah, thanks for the explanation. Does this look OK?

diff --git a/include/linux/sched.h b/include/linux/sched.h
index f2f4402..c34a718 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -788,9 +788,28 @@ enum cpu_idle_type {
 };

 /*
- * Increase resolution of nice-level calculations:
+ * Increase resolution of nice-level calculations for 64-bit architectures.
+ * The extra resolution improves shares distribution and load balancing of
+ * low-weight task groups (eg. nice +19 on an autogroup), deeper taskgroup
+ * hierarchies, especially on larger systems. This is not a user-visible change
+ * and does not change the user-interface for setting shares/weights.
+ *
+ * We increase resolution only if we have enough bits to allow this increased
+ * resolution (i.e. BITS_PER_LONG > 32). The costs for increasing resolution
+ * when BITS_PER_LONG <= 32 are pretty high and the returns do not justify the
+ * increased costs.
  */
-#define SCHED_LOAD_SHIFT       10
+#if BITS_PER_LONG > 32
+# define SCHED_LOAD_RESOLUTION 10
+# define scale_load(w)         ((w) << SCHED_LOAD_RESOLUTION)
+# define scale_load_down(w)    ((w) >> SCHED_LOAD_RESOLUTION)
+#else
+# define SCHED_LOAD_RESOLUTION 0
+# define scale_load(w)         (w)
+# define scale_load_down(w)    (w)
+#endif
+
+#define SCHED_LOAD_SHIFT       (10 + SCHED_LOAD_RESOLUTION)
 #define SCHED_LOAD_SCALE       (1L << SCHED_LOAD_SHIFT)

 /*


>> +             if (BITS_PER_LONG > 32 && unlikely(w >= WMULT_CONST))
>>                       lw->inv_weight = 1;
>> +             else if (unlikely(!w))
>> +                     lw->inv_weight = WMULT_CONST;
>>               else
>> +                     lw->inv_weight = WMULT_CONST / w;
>
> Ok, i just noticed that you made use of BITS_PER_LONG here too.
>
> It's better to put that into a helper define, something like
> SCHED_LOAD_HIGHRES, which could thus be used like this:
>
>                if (SCHED_LOAD_HIGHRES && unlikely(w >= WMULT_CONST))
>
> then, should anyone want to tweak the condition for SCHED_LOAD_HIGHRES, it
> could be done in a single place. It would also self-document.
>

Hmm, that particular use of BITS_PER_LONG was not touched by this
patch. This patch only changes lw->weight to use the local variable w.
The (BITS_PER_LONG & > WMULT_CONST) check is required on 64-bit
systems irrespective of the load-resolution changes.

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 3/3] sched: increase SCHED_LOAD_SCALE resolution
  2011-05-18 21:18           ` Nikhil Rao
@ 2011-05-18 21:22             ` Ingo Molnar
  2011-05-18 21:37               ` [PATCH v3 " Nikhil Rao
  0 siblings, 1 reply; 15+ messages in thread
From: Ingo Molnar @ 2011-05-18 21:22 UTC (permalink / raw)
  To: Nikhil Rao
  Cc: Peter Zijlstra, Mike Galbraith, linux-kernel, Nikunj A. Dadhania,
	Srivatsa Vaddagiri, Stephan Barwolf


* Nikhil Rao <ncrao@google.com> wrote:

> On Wed, May 18, 2011 at 1:26 PM, Ingo Molnar <mingo@elte.hu> wrote:
> >
> > * Nikhil Rao <ncrao@google.com> wrote:
> >
> >> +#if BITS_PER_LONG > 32
> >> +# define SCHED_LOAD_RESOLUTION       10
> >> +# define scale_load(w)               (w << SCHED_LOAD_RESOLUTION)
> >> +# define scale_load_down(w)  (w >> SCHED_LOAD_RESOLUTION)
> >> +#else
> >> +# define SCHED_LOAD_RESOLUTION       0
> >> +# define scale_load(w)               (w)
> >> +# define scale_load_down(w)  (w)
> >> +#endif
> >
> > We want (w) in the other definitions as well, to protect potential operators
> > with lower precedence than <<. (Roughly half of the C operators are such so
> > it's a real issue, should anyone use these macros with operators.)
> >
> >> +#define SCHED_LOAD_SHIFT     (10 + (SCHED_LOAD_RESOLUTION))
> >
> > that () is not needed actually, if you look at the definition of
> > SCHED_LOAD_RESOLUTION.
> >
> > So you could move the superfluous () from here up to the two definitions above
> > and thus no parentheses would be hurt during the making of this patch.
> >
> 
> Ah, thanks for the explanation. Does this look OK?
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index f2f4402..c34a718 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -788,9 +788,28 @@ enum cpu_idle_type {
>  };
> 
>  /*
> - * Increase resolution of nice-level calculations:
> + * Increase resolution of nice-level calculations for 64-bit architectures.
> + * The extra resolution improves shares distribution and load balancing of
> + * low-weight task groups (eg. nice +19 on an autogroup), deeper taskgroup
> + * hierarchies, especially on larger systems. This is not a user-visible change
> + * and does not change the user-interface for setting shares/weights.
> + *
> + * We increase resolution only if we have enough bits to allow this increased
> + * resolution (i.e. BITS_PER_LONG > 32). The costs for increasing resolution
> + * when BITS_PER_LONG <= 32 are pretty high and the returns do not justify the
> + * increased costs.
>   */
> -#define SCHED_LOAD_SHIFT       10
> +#if BITS_PER_LONG > 32
> +# define SCHED_LOAD_RESOLUTION 10
> +# define scale_load(w)         ((w) << SCHED_LOAD_RESOLUTION)
> +# define scale_load_down(w)    ((w) >> SCHED_LOAD_RESOLUTION)
> +#else
> +# define SCHED_LOAD_RESOLUTION 0
> +# define scale_load(w)         (w)
> +# define scale_load_down(w)    (w)
> +#endif
> +
> +#define SCHED_LOAD_SHIFT       (10 + SCHED_LOAD_RESOLUTION)
>  #define SCHED_LOAD_SCALE       (1L << SCHED_LOAD_SHIFT)
> 
>  /*
> 
> 
> >> +             if (BITS_PER_LONG > 32 && unlikely(w >= WMULT_CONST))
> >>                       lw->inv_weight = 1;
> >> +             else if (unlikely(!w))
> >> +                     lw->inv_weight = WMULT_CONST;
> >>               else
> >> +                     lw->inv_weight = WMULT_CONST / w;
> >
> > Ok, i just noticed that you made use of BITS_PER_LONG here too.
> >
> > It's better to put that into a helper define, something like
> > SCHED_LOAD_HIGHRES, which could thus be used like this:
> >
> >                if (SCHED_LOAD_HIGHRES && unlikely(w >= WMULT_CONST))
> >
> > then, should anyone want to tweak the condition for SCHED_LOAD_HIGHRES, it
> > could be done in a single place. It would also self-document.
> >
> 
> Hmm, that particular use of BITS_PER_LONG was not touched by this
> patch. This patch only changes lw->weight to use the local variable w.
> The (BITS_PER_LONG & > WMULT_CONST) check is required on 64-bit
> systems irrespective of the load-resolution changes.

my bad, i confused it with the resolution changes.

It's fine as-is then i guess. Mind reposting the full patch again, with all 
updates included and the subject line changed to make it easy to find this 
patch in the discussion?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v3 3/3] sched: increase SCHED_LOAD_SCALE resolution
  2011-05-18 21:22             ` Ingo Molnar
@ 2011-05-18 21:37               ` Nikhil Rao
  2011-05-20 13:43                 ` [tip:sched/core] sched: Increase " tip-bot for Nikhil Rao
  0 siblings, 1 reply; 15+ messages in thread
From: Nikhil Rao @ 2011-05-18 21:37 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Mike Galbraith
  Cc: linux-kernel, Nikunj A. Dadhania, Srivatsa Vaddagiri,
	Stephan Barwolf, Nikhil Rao

Introduce SCHED_LOAD_RESOLUTION, which scales is added to SCHED_LOAD_SHIFT and
increases the resolution of SCHED_LOAD_SCALE. This patch sets the value of
SCHED_LOAD_RESOLUTION to 10, scaling up the weights for all sched entities by
a factor of 1024. With this extra resolution, we can handle deeper cgroup
hiearchies and the scheduler can do better shares distribution and load
load balancing on larger systems (especially for low weight task groups).

This does not change the existing user interface, the scaled weights are only
used internally. We do not modify prio_to_weight values or inverses, but use
the original weights when calculating the inverse which is used to scale
execution time delta in calc_delta_mine(). This ensures we do not lose accuracy
when accounting time to the sched entities. Thanks to Nikunj Dadhania for fixing
an bug in c_d_m() that broken fairness.

Below is some analysis of the performance costs/improvements of this patch.

1. Micro-arch performance costs:

Experiment was to run Ingo's pipe_test_100k 200 times with the task pinned to
one cpu. I measured instruction, cycles and stalled-cycles for the runs. See
http://thread.gmane.org/gmane.linux.kernel/1129232/focus=1129389 for more info.

-tip (baseline):

 Performance counter stats for '/root/load-scale/pipe-test-100k' (200 runs):

       964,991,769 instructions             #    0.82  insns per cycle
                                            #    0.33  stalled cycles per insn
                                            #    ( +-  0.05% )
     1,171,186,635 cycles                   #    0.000 GHz                      ( +-  0.08% )
       306,373,664 stalled-cycles-backend   #   26.16% backend  cycles idle     ( +-  0.28% )
       314,933,621 stalled-cycles-frontend  #   26.89% frontend cycles idle     ( +-  0.34% )

        1.122405684  seconds time elapsed  ( +-  0.05% )

-tip+patches:

 Performance counter stats for './load-scale/pipe-test-100k' (200 runs):

       963,624,821 instructions             #    0.82  insns per cycle
                                            #    0.33  stalled cycles per insn
                                            #    ( +-  0.04% )
     1,175,215,649 cycles                   #    0.000 GHz                      ( +-  0.08% )
       315,321,126 stalled-cycles-backend   #   26.83% backend  cycles idle     ( +-  0.28% )
       316,835,873 stalled-cycles-frontend  #   26.96% frontend cycles idle     ( +-  0.29% )

        1.122238659  seconds time elapsed  ( +-  0.06% )

With this patch, instructions decrease by ~0.10% and cycles increase by 0.27%.
This doesn't look statistically significant. The number of stalled cycles in
the backend increased from 26.16% to 26.83%. This can be attributed to the
shifts we do in c_d_m() and other places. The fraction of stalled cycles in the
frontend remains about the same, at 26.96% compared to 26.89% in -tip.

2. Balancing low-weight task groups

Test setup: run 50 tasks with random sleep/busy times (biased around 100ms) in
a low weight container (with cpu.shares = 2). Measure %idle as reported by
mpstat over a 10s window.

-tip (baseline):

06:47:48 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle    intr/s
06:47:49 PM  all   94.32    0.00    0.06    0.00    0.00    0.00    0.00    0.00    5.62  15888.00
06:47:50 PM  all   94.57    0.00    0.62    0.00    0.00    0.00    0.00    0.00    4.81  16180.00
06:47:51 PM  all   94.69    0.00    0.06    0.00    0.00    0.00    0.00    0.00    5.25  15966.00
06:47:52 PM  all   95.81    0.00    0.00    0.00    0.00    0.00    0.00    0.00    4.19  16053.00
06:47:53 PM  all   94.88    0.06    0.00    0.00    0.00    0.00    0.00    0.00    5.06  15984.00
06:47:54 PM  all   93.31    0.00    0.00    0.00    0.00    0.00    0.00    0.00    6.69  15806.00
06:47:55 PM  all   94.19    0.00    0.06    0.00    0.00    0.00    0.00    0.00    5.75  15896.00
06:47:56 PM  all   92.87    0.00    0.00    0.00    0.00    0.00    0.00    0.00    7.13  15716.00
06:47:57 PM  all   94.88    0.00    0.00    0.00    0.00    0.00    0.00    0.00    5.12  15982.00
06:47:58 PM  all   95.44    0.00    0.00    0.00    0.00    0.00    0.00    0.00    4.56  16075.00
Average:     all   94.49    0.01    0.08    0.00    0.00    0.00    0.00    0.00    5.42  15954.60

-tip+patches:

06:47:03 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle    intr/s
06:47:04 PM  all  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  16630.00
06:47:05 PM  all   99.69    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.31  16580.20
06:47:06 PM  all   99.69    0.00    0.06    0.00    0.00    0.00    0.00    0.00    0.25  16596.00
06:47:07 PM  all   99.20    0.00    0.74    0.00    0.00    0.06    0.00    0.00    0.00  17838.61
06:47:08 PM  all  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  16540.00
06:47:09 PM  all  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  16575.00
06:47:10 PM  all  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  16614.00
06:47:11 PM  all   99.94    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.06  16588.00
06:47:12 PM  all   99.94    0.00    0.06    0.00    0.00    0.00    0.00    0.00    0.00  16593.00
06:47:13 PM  all   99.94    0.00    0.06    0.00    0.00    0.00    0.00    0.00    0.00  16551.00
Average:     all   99.84    0.00    0.09    0.00    0.00    0.01    0.00    0.00    0.06  16711.58

We see an improvement in idle% on the system (drops from 5.42% on -tip to 0.06%
with the patches).

Signed-off-by: Nikhil Rao <ncrao@google.com>
---
 include/linux/sched.h |   23 +++++++++++++++++++++--
 kernel/sched.c        |   28 ++++++++++++++++++++--------
 2 files changed, 41 insertions(+), 10 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index f2f4402..c34a718 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -788,9 +788,28 @@ enum cpu_idle_type {
 };
 
 /*
- * Increase resolution of nice-level calculations:
+ * Increase resolution of nice-level calculations for 64-bit architectures.
+ * The extra resolution improves shares distribution and load balancing of
+ * low-weight task groups (eg. nice +19 on an autogroup), deeper taskgroup
+ * hierarchies, especially on larger systems. This is not a user-visible change
+ * and does not change the user-interface for setting shares/weights.
+ *
+ * We increase resolution only if we have enough bits to allow this increased
+ * resolution (i.e. BITS_PER_LONG > 32). The costs for increasing resolution
+ * when BITS_PER_LONG <= 32 are pretty high and the returns do not justify the
+ * increased costs.
  */
-#define SCHED_LOAD_SHIFT	10
+#if BITS_PER_LONG > 32
+# define SCHED_LOAD_RESOLUTION	10
+# define scale_load(w)		((w) << SCHED_LOAD_RESOLUTION)
+# define scale_load_down(w)	((w) >> SCHED_LOAD_RESOLUTION)
+#else
+# define SCHED_LOAD_RESOLUTION	0
+# define scale_load(w)		(w)
+# define scale_load_down(w)	(w)
+#endif
+
+#define SCHED_LOAD_SHIFT	(10 + SCHED_LOAD_RESOLUTION)
 #define SCHED_LOAD_SCALE	(1L << SCHED_LOAD_SHIFT)
 
 /*
diff --git a/kernel/sched.c b/kernel/sched.c
index 375e9c6..bb504e1 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -293,7 +293,7 @@ static DEFINE_SPINLOCK(task_group_lock);
  *  limitation from this.)
  */
 #define MIN_SHARES	2
-#define MAX_SHARES	(1UL << 18)
+#define MAX_SHARES	(1UL << (18 + SCHED_LOAD_RESOLUTION))
 
 static int root_task_group_load = ROOT_TASK_GROUP_LOAD;
 #endif
@@ -1330,13 +1330,25 @@ calc_delta_mine(unsigned long delta_exec, unsigned long weight,
 {
 	u64 tmp;
 
-	tmp = (u64)delta_exec * weight;
+	/*
+	 * weight can be less than 2^SCHED_LOAD_RESOLUTION for task group sched
+	 * entities since MIN_SHARES = 2. Treat weight as 1 if less than
+	 * 2^SCHED_LOAD_RESOLUTION.
+	 */
+	if (likely(weight > (1UL << SCHED_LOAD_RESOLUTION)))
+		tmp = (u64)delta_exec * scale_load_down(weight);
+	else
+		tmp = (u64)delta_exec;
 
 	if (!lw->inv_weight) {
-		if (BITS_PER_LONG > 32 && unlikely(lw->weight >= WMULT_CONST))
+		unsigned long w = scale_load_down(lw->weight);
+
+		if (BITS_PER_LONG > 32 && unlikely(w >= WMULT_CONST))
 			lw->inv_weight = 1;
+		else if (unlikely(!w))
+			lw->inv_weight = WMULT_CONST;
 		else
-			lw->inv_weight = WMULT_CONST / lw->weight;
+			lw->inv_weight = WMULT_CONST / w;
 	}
 
 	/*
@@ -1785,12 +1797,12 @@ static void set_load_weight(struct task_struct *p)
 	 * SCHED_IDLE tasks get minimal weight:
 	 */
 	if (p->policy == SCHED_IDLE) {
-		load->weight = WEIGHT_IDLEPRIO;
+		load->weight = scale_load(WEIGHT_IDLEPRIO);
 		load->inv_weight = WMULT_IDLEPRIO;
 		return;
 	}
 
-	load->weight = prio_to_weight[prio];
+	load->weight = scale_load(prio_to_weight[prio]);
 	load->inv_weight = prio_to_wmult[prio];
 }
 
@@ -8809,14 +8821,14 @@ cpu_cgroup_exit(struct cgroup_subsys *ss, struct cgroup *cgrp,
 static int cpu_shares_write_u64(struct cgroup *cgrp, struct cftype *cftype,
 				u64 shareval)
 {
-	return sched_group_set_shares(cgroup_tg(cgrp), shareval);
+	return sched_group_set_shares(cgroup_tg(cgrp), scale_load(shareval));
 }
 
 static u64 cpu_shares_read_u64(struct cgroup *cgrp, struct cftype *cft)
 {
 	struct task_group *tg = cgroup_tg(cgrp);
 
-	return (u64) tg->shares;
+	return (u64) scale_load_down(tg->shares);
 }
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [tip:sched/core] sched: Cleanup set_load_weight()
  2011-05-18 17:09 ` [PATCH v2 1/3] sched: cleanup set_load_weight() Nikhil Rao
@ 2011-05-20 13:43   ` tip-bot for Nikhil Rao
  0 siblings, 0 replies; 15+ messages in thread
From: tip-bot for Nikhil Rao @ 2011-05-20 13:43 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, vatsa, hpa, mingo, ncrao, efault, peterz, nikunj,
	stephan.baerwolf, tglx, mingo

Commit-ID:  f05998d4b80632f2cc00f108da503066ef5d38d5
Gitweb:     http://git.kernel.org/tip/f05998d4b80632f2cc00f108da503066ef5d38d5
Author:     Nikhil Rao <ncrao@google.com>
AuthorDate: Wed, 18 May 2011 10:09:38 -0700
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Fri, 20 May 2011 14:16:49 +0200

sched: Cleanup set_load_weight()

Avoid using long repetitious names; make this simpler and nicer
to read. No functional change introduced in this patch.

Signed-off-by: Nikhil Rao <ncrao@google.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Nikunj A. Dadhania <nikunj@linux.vnet.ibm.com>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Stephan Barwolf <stephan.baerwolf@tu-ilmenau.de>
Cc: Mike Galbraith <efault@gmx.de>
Link: http://lkml.kernel.org/r/1305738580-9924-2-git-send-email-ncrao@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/sched.c |   11 +++++++----
 1 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index c62acf4..d036048 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1778,17 +1778,20 @@ static void dec_nr_running(struct rq *rq)
 
 static void set_load_weight(struct task_struct *p)
 {
+	int prio = p->static_prio - MAX_RT_PRIO;
+	struct load_weight *load = &p->se.load;
+
 	/*
 	 * SCHED_IDLE tasks get minimal weight:
 	 */
 	if (p->policy == SCHED_IDLE) {
-		p->se.load.weight = WEIGHT_IDLEPRIO;
-		p->se.load.inv_weight = WMULT_IDLEPRIO;
+		load->weight = WEIGHT_IDLEPRIO;
+		load->inv_weight = WMULT_IDLEPRIO;
 		return;
 	}
 
-	p->se.load.weight = prio_to_weight[p->static_prio - MAX_RT_PRIO];
-	p->se.load.inv_weight = prio_to_wmult[p->static_prio - MAX_RT_PRIO];
+	load->weight = prio_to_weight[prio];
+	load->inv_weight = prio_to_wmult[prio];
 }
 
 static void enqueue_task(struct rq *rq, struct task_struct *p, int flags)

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [tip:sched/core] sched: Introduce SCHED_POWER_SCALE to scale cpu_power calculations
  2011-05-18 17:09 ` [PATCH v2 2/3] sched: introduce SCHED_POWER_SCALE to scale cpu_power calculations Nikhil Rao
@ 2011-05-20 13:43   ` tip-bot for Nikhil Rao
  0 siblings, 0 replies; 15+ messages in thread
From: tip-bot for Nikhil Rao @ 2011-05-20 13:43 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, vatsa, hpa, mingo, ncrao, efault, peterz, nikunj,
	stephan.baerwolf, tglx, mingo

Commit-ID:  1399fa7807a1a5998bbf147e80668e9950661dfa
Gitweb:     http://git.kernel.org/tip/1399fa7807a1a5998bbf147e80668e9950661dfa
Author:     Nikhil Rao <ncrao@google.com>
AuthorDate: Wed, 18 May 2011 10:09:39 -0700
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Fri, 20 May 2011 14:16:50 +0200

sched: Introduce SCHED_POWER_SCALE to scale cpu_power calculations

SCHED_LOAD_SCALE is used to increase nice resolution and to
scale cpu_power calculations in the scheduler. This patch
introduces SCHED_POWER_SCALE and converts all uses of
SCHED_LOAD_SCALE for scaling cpu_power to use SCHED_POWER_SCALE
instead.

This is a preparatory patch for increasing the resolution of
SCHED_LOAD_SCALE, and there is no need to increase resolution
for cpu_power calculations.

Signed-off-by: Nikhil Rao <ncrao@google.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Nikunj A. Dadhania <nikunj@linux.vnet.ibm.com>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Stephan Barwolf <stephan.baerwolf@tu-ilmenau.de>
Cc: Mike Galbraith <efault@gmx.de>
Link: http://lkml.kernel.org/r/1305738580-9924-3-git-send-email-ncrao@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 include/linux/sched.h |   13 +++++++----
 kernel/sched.c        |    4 +-
 kernel/sched_fair.c   |   52 +++++++++++++++++++++++++-----------------------
 3 files changed, 37 insertions(+), 32 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 12211e1..f2f4402 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -788,17 +788,20 @@ enum cpu_idle_type {
 };
 
 /*
- * sched-domains (multiprocessor balancing) declarations:
- */
-
-/*
  * Increase resolution of nice-level calculations:
  */
 #define SCHED_LOAD_SHIFT	10
 #define SCHED_LOAD_SCALE	(1L << SCHED_LOAD_SHIFT)
 
-#define SCHED_LOAD_SCALE_FUZZ	SCHED_LOAD_SCALE
+/*
+ * Increase resolution of cpu_power calculations
+ */
+#define SCHED_POWER_SHIFT	10
+#define SCHED_POWER_SCALE	(1L << SCHED_POWER_SHIFT)
 
+/*
+ * sched-domains (multiprocessor balancing) declarations:
+ */
 #ifdef CONFIG_SMP
 #define SD_LOAD_BALANCE		0x0001	/* Do load balancing on this domain. */
 #define SD_BALANCE_NEWIDLE	0x0002	/* Balance when about to become idle */
diff --git a/kernel/sched.c b/kernel/sched.c
index d036048..375e9c6 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -6530,7 +6530,7 @@ static int sched_domain_debug_one(struct sched_domain *sd, int cpu, int level,
 		cpulist_scnprintf(str, sizeof(str), sched_group_cpus(group));
 
 		printk(KERN_CONT " %s", str);
-		if (group->cpu_power != SCHED_LOAD_SCALE) {
+		if (group->cpu_power != SCHED_POWER_SCALE) {
 			printk(KERN_CONT " (cpu_power = %d)",
 				group->cpu_power);
 		}
@@ -7905,7 +7905,7 @@ void __init sched_init(void)
 #ifdef CONFIG_SMP
 		rq->sd = NULL;
 		rq->rd = NULL;
-		rq->cpu_power = SCHED_LOAD_SCALE;
+		rq->cpu_power = SCHED_POWER_SCALE;
 		rq->post_schedule = 0;
 		rq->active_balance = 0;
 		rq->next_balance = jiffies;
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 37f2262..e32a9b7 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1584,7 +1584,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 		}
 
 		/* Adjust by relative CPU power of the group */
-		avg_load = (avg_load * SCHED_LOAD_SCALE) / group->cpu_power;
+		avg_load = (avg_load * SCHED_POWER_SCALE) / group->cpu_power;
 
 		if (local_group) {
 			this_load = avg_load;
@@ -1722,7 +1722,7 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
 				nr_running += cpu_rq(i)->cfs.nr_running;
 			}
 
-			capacity = DIV_ROUND_CLOSEST(power, SCHED_LOAD_SCALE);
+			capacity = DIV_ROUND_CLOSEST(power, SCHED_POWER_SCALE);
 
 			if (tmp->flags & SD_POWERSAVINGS_BALANCE)
 				nr_running /= 2;
@@ -2570,7 +2570,7 @@ static inline int check_power_save_busiest_group(struct sd_lb_stats *sds,
 
 unsigned long default_scale_freq_power(struct sched_domain *sd, int cpu)
 {
-	return SCHED_LOAD_SCALE;
+	return SCHED_POWER_SCALE;
 }
 
 unsigned long __weak arch_scale_freq_power(struct sched_domain *sd, int cpu)
@@ -2607,10 +2607,10 @@ unsigned long scale_rt_power(int cpu)
 		available = total - rq->rt_avg;
 	}
 
-	if (unlikely((s64)total < SCHED_LOAD_SCALE))
-		total = SCHED_LOAD_SCALE;
+	if (unlikely((s64)total < SCHED_POWER_SCALE))
+		total = SCHED_POWER_SCALE;
 
-	total >>= SCHED_LOAD_SHIFT;
+	total >>= SCHED_POWER_SHIFT;
 
 	return div_u64(available, total);
 }
@@ -2618,7 +2618,7 @@ unsigned long scale_rt_power(int cpu)
 static void update_cpu_power(struct sched_domain *sd, int cpu)
 {
 	unsigned long weight = sd->span_weight;
-	unsigned long power = SCHED_LOAD_SCALE;
+	unsigned long power = SCHED_POWER_SCALE;
 	struct sched_group *sdg = sd->groups;
 
 	if ((sd->flags & SD_SHARE_CPUPOWER) && weight > 1) {
@@ -2627,7 +2627,7 @@ static void update_cpu_power(struct sched_domain *sd, int cpu)
 		else
 			power *= default_scale_smt_power(sd, cpu);
 
-		power >>= SCHED_LOAD_SHIFT;
+		power >>= SCHED_POWER_SHIFT;
 	}
 
 	sdg->cpu_power_orig = power;
@@ -2637,10 +2637,10 @@ static void update_cpu_power(struct sched_domain *sd, int cpu)
 	else
 		power *= default_scale_freq_power(sd, cpu);
 
-	power >>= SCHED_LOAD_SHIFT;
+	power >>= SCHED_POWER_SHIFT;
 
 	power *= scale_rt_power(cpu);
-	power >>= SCHED_LOAD_SHIFT;
+	power >>= SCHED_POWER_SHIFT;
 
 	if (!power)
 		power = 1;
@@ -2682,7 +2682,7 @@ static inline int
 fix_small_capacity(struct sched_domain *sd, struct sched_group *group)
 {
 	/*
-	 * Only siblings can have significantly less than SCHED_LOAD_SCALE
+	 * Only siblings can have significantly less than SCHED_POWER_SCALE
 	 */
 	if (!(sd->flags & SD_SHARE_CPUPOWER))
 		return 0;
@@ -2770,7 +2770,7 @@ static inline void update_sg_lb_stats(struct sched_domain *sd,
 	}
 
 	/* Adjust by relative CPU power of the group */
-	sgs->avg_load = (sgs->group_load * SCHED_LOAD_SCALE) / group->cpu_power;
+	sgs->avg_load = (sgs->group_load*SCHED_POWER_SCALE) / group->cpu_power;
 
 	/*
 	 * Consider the group unbalanced when the imbalance is larger
@@ -2787,7 +2787,8 @@ static inline void update_sg_lb_stats(struct sched_domain *sd,
 	if ((max_cpu_load - min_cpu_load) >= avg_load_per_task && max_nr_running > 1)
 		sgs->group_imb = 1;
 
-	sgs->group_capacity = DIV_ROUND_CLOSEST(group->cpu_power, SCHED_LOAD_SCALE);
+	sgs->group_capacity = DIV_ROUND_CLOSEST(group->cpu_power,
+						SCHED_POWER_SCALE);
 	if (!sgs->group_capacity)
 		sgs->group_capacity = fix_small_capacity(sd, group);
 	sgs->group_weight = group->group_weight;
@@ -2961,7 +2962,7 @@ static int check_asym_packing(struct sched_domain *sd,
 		return 0;
 
 	*imbalance = DIV_ROUND_CLOSEST(sds->max_load * sds->busiest->cpu_power,
-				       SCHED_LOAD_SCALE);
+				       SCHED_POWER_SCALE);
 	return 1;
 }
 
@@ -2990,7 +2991,7 @@ static inline void fix_small_imbalance(struct sd_lb_stats *sds,
 			cpu_avg_load_per_task(this_cpu);
 
 	scaled_busy_load_per_task = sds->busiest_load_per_task
-						 * SCHED_LOAD_SCALE;
+					 * SCHED_POWER_SCALE;
 	scaled_busy_load_per_task /= sds->busiest->cpu_power;
 
 	if (sds->max_load - sds->this_load + scaled_busy_load_per_task >=
@@ -3009,10 +3010,10 @@ static inline void fix_small_imbalance(struct sd_lb_stats *sds,
 			min(sds->busiest_load_per_task, sds->max_load);
 	pwr_now += sds->this->cpu_power *
 			min(sds->this_load_per_task, sds->this_load);
-	pwr_now /= SCHED_LOAD_SCALE;
+	pwr_now /= SCHED_POWER_SCALE;
 
 	/* Amount of load we'd subtract */
-	tmp = (sds->busiest_load_per_task * SCHED_LOAD_SCALE) /
+	tmp = (sds->busiest_load_per_task * SCHED_POWER_SCALE) /
 		sds->busiest->cpu_power;
 	if (sds->max_load > tmp)
 		pwr_move += sds->busiest->cpu_power *
@@ -3020,15 +3021,15 @@ static inline void fix_small_imbalance(struct sd_lb_stats *sds,
 
 	/* Amount of load we'd add */
 	if (sds->max_load * sds->busiest->cpu_power <
-		sds->busiest_load_per_task * SCHED_LOAD_SCALE)
+		sds->busiest_load_per_task * SCHED_POWER_SCALE)
 		tmp = (sds->max_load * sds->busiest->cpu_power) /
 			sds->this->cpu_power;
 	else
-		tmp = (sds->busiest_load_per_task * SCHED_LOAD_SCALE) /
+		tmp = (sds->busiest_load_per_task * SCHED_POWER_SCALE) /
 			sds->this->cpu_power;
 	pwr_move += sds->this->cpu_power *
 			min(sds->this_load_per_task, sds->this_load + tmp);
-	pwr_move /= SCHED_LOAD_SCALE;
+	pwr_move /= SCHED_POWER_SCALE;
 
 	/* Move if we gain throughput */
 	if (pwr_move > pwr_now)
@@ -3070,7 +3071,7 @@ static inline void calculate_imbalance(struct sd_lb_stats *sds, int this_cpu,
 		load_above_capacity = (sds->busiest_nr_running -
 						sds->busiest_group_capacity);
 
-		load_above_capacity *= (SCHED_LOAD_SCALE * SCHED_LOAD_SCALE);
+		load_above_capacity *= (SCHED_LOAD_SCALE * SCHED_POWER_SCALE);
 
 		load_above_capacity /= sds->busiest->cpu_power;
 	}
@@ -3090,7 +3091,7 @@ static inline void calculate_imbalance(struct sd_lb_stats *sds, int this_cpu,
 	/* How much load to actually move to equalise the imbalance */
 	*imbalance = min(max_pull * sds->busiest->cpu_power,
 		(sds->avg_load - sds->this_load) * sds->this->cpu_power)
-			/ SCHED_LOAD_SCALE;
+			/ SCHED_POWER_SCALE;
 
 	/*
 	 * if *imbalance is less than the average load per runnable task
@@ -3159,7 +3160,7 @@ find_busiest_group(struct sched_domain *sd, int this_cpu,
 	if (!sds.busiest || sds.busiest_nr_running == 0)
 		goto out_balanced;
 
-	sds.avg_load = (SCHED_LOAD_SCALE * sds.total_load) / sds.total_pwr;
+	sds.avg_load = (SCHED_POWER_SCALE * sds.total_load) / sds.total_pwr;
 
 	/*
 	 * If the busiest group is imbalanced the below checks don't
@@ -3238,7 +3239,8 @@ find_busiest_queue(struct sched_domain *sd, struct sched_group *group,
 
 	for_each_cpu(i, sched_group_cpus(group)) {
 		unsigned long power = power_of(i);
-		unsigned long capacity = DIV_ROUND_CLOSEST(power, SCHED_LOAD_SCALE);
+		unsigned long capacity = DIV_ROUND_CLOSEST(power,
+							   SCHED_POWER_SCALE);
 		unsigned long wl;
 
 		if (!capacity)
@@ -3263,7 +3265,7 @@ find_busiest_queue(struct sched_domain *sd, struct sched_group *group,
 		 * the load can be moved away from the cpu that is potentially
 		 * running at a lower capacity.
 		 */
-		wl = (wl * SCHED_LOAD_SCALE) / power;
+		wl = (wl * SCHED_POWER_SCALE) / power;
 
 		if (wl > max_load) {
 			max_load = wl;

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [tip:sched/core] sched: Increase SCHED_LOAD_SCALE resolution
  2011-05-18 21:37               ` [PATCH v3 " Nikhil Rao
@ 2011-05-20 13:43                 ` tip-bot for Nikhil Rao
  0 siblings, 0 replies; 15+ messages in thread
From: tip-bot for Nikhil Rao @ 2011-05-20 13:43 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, vatsa, hpa, mingo, torvalds, ncrao, efault, peterz,
	nikunj, akpm, stephan.baerwolf, tglx, mingo

Commit-ID:  c8b281161dfa4bb5d5be63fb036ce19347b88c63
Gitweb:     http://git.kernel.org/tip/c8b281161dfa4bb5d5be63fb036ce19347b88c63
Author:     Nikhil Rao <ncrao@google.com>
AuthorDate: Wed, 18 May 2011 14:37:48 -0700
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Fri, 20 May 2011 14:16:50 +0200

sched: Increase SCHED_LOAD_SCALE resolution

Introduce SCHED_LOAD_RESOLUTION, which scales is added to
SCHED_LOAD_SHIFT and increases the resolution of
SCHED_LOAD_SCALE. This patch sets the value of
SCHED_LOAD_RESOLUTION to 10, scaling up the weights for all
sched entities by a factor of 1024. With this extra resolution,
we can handle deeper cgroup hiearchies and the scheduler can do
better shares distribution and load load balancing on larger
systems (especially for low weight task groups).

This does not change the existing user interface, the scaled
weights are only used internally. We do not modify
prio_to_weight values or inverses, but use the original weights
when calculating the inverse which is used to scale execution
time delta in calc_delta_mine(). This ensures we do not lose
accuracy when accounting time to the sched entities. Thanks to
Nikunj Dadhania for fixing an bug in c_d_m() that broken fairness.

Below is some analysis of the performance costs/improvements of
this patch.

1. Micro-arch performance costs:

Experiment was to run Ingo's pipe_test_100k 200 times with the
task pinned to one cpu. I measured instruction, cycles and
stalled-cycles for the runs. See:

   http://thread.gmane.org/gmane.linux.kernel/1129232/focus=1129389

for more info.

-tip (baseline):

 Performance counter stats for '/root/load-scale/pipe-test-100k' (200 runs):

       964,991,769 instructions             #    0.82  insns per cycle
                                            #    0.33  stalled cycles per insn
                                            #    ( +-  0.05% )
     1,171,186,635 cycles                   #    0.000 GHz                      ( +-  0.08% )
       306,373,664 stalled-cycles-backend   #   26.16% backend  cycles idle     ( +-  0.28% )
       314,933,621 stalled-cycles-frontend  #   26.89% frontend cycles idle     ( +-  0.34% )

        1.122405684  seconds time elapsed  ( +-  0.05% )

-tip+patches:

 Performance counter stats for './load-scale/pipe-test-100k' (200 runs):

       963,624,821 instructions             #    0.82  insns per cycle
                                            #    0.33  stalled cycles per insn
                                            #    ( +-  0.04% )
     1,175,215,649 cycles                   #    0.000 GHz                      ( +-  0.08% )
       315,321,126 stalled-cycles-backend   #   26.83% backend  cycles idle     ( +-  0.28% )
       316,835,873 stalled-cycles-frontend  #   26.96% frontend cycles idle     ( +-  0.29% )

        1.122238659  seconds time elapsed  ( +-  0.06% )

With this patch, instructions decrease by ~0.10% and cycles
increase by 0.27%. This doesn't look statistically significant.
The number of stalled cycles in the backend increased from
26.16% to 26.83%. This can be attributed to the shifts we do in
c_d_m() and other places. The fraction of stalled cycles in the
frontend remains about the same, at 26.96% compared to 26.89% in -tip.

2. Balancing low-weight task groups

Test setup: run 50 tasks with random sleep/busy times (biased
around 100ms) in a low weight container (with cpu.shares = 2).
Measure %idle as reported by mpstat over a 10s window.

-tip (baseline):

06:47:48 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle    intr/s
06:47:49 PM  all   94.32    0.00    0.06    0.00    0.00    0.00    0.00    0.00    5.62  15888.00
06:47:50 PM  all   94.57    0.00    0.62    0.00    0.00    0.00    0.00    0.00    4.81  16180.00
06:47:51 PM  all   94.69    0.00    0.06    0.00    0.00    0.00    0.00    0.00    5.25  15966.00
06:47:52 PM  all   95.81    0.00    0.00    0.00    0.00    0.00    0.00    0.00    4.19  16053.00
06:47:53 PM  all   94.88    0.06    0.00    0.00    0.00    0.00    0.00    0.00    5.06  15984.00
06:47:54 PM  all   93.31    0.00    0.00    0.00    0.00    0.00    0.00    0.00    6.69  15806.00
06:47:55 PM  all   94.19    0.00    0.06    0.00    0.00    0.00    0.00    0.00    5.75  15896.00
06:47:56 PM  all   92.87    0.00    0.00    0.00    0.00    0.00    0.00    0.00    7.13  15716.00
06:47:57 PM  all   94.88    0.00    0.00    0.00    0.00    0.00    0.00    0.00    5.12  15982.00
06:47:58 PM  all   95.44    0.00    0.00    0.00    0.00    0.00    0.00    0.00    4.56  16075.00
Average:     all   94.49    0.01    0.08    0.00    0.00    0.00    0.00    0.00    5.42  15954.60

-tip+patches:

06:47:03 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle    intr/s
06:47:04 PM  all  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  16630.00
06:47:05 PM  all   99.69    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.31  16580.20
06:47:06 PM  all   99.69    0.00    0.06    0.00    0.00    0.00    0.00    0.00    0.25  16596.00
06:47:07 PM  all   99.20    0.00    0.74    0.00    0.00    0.06    0.00    0.00    0.00  17838.61
06:47:08 PM  all  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  16540.00
06:47:09 PM  all  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  16575.00
06:47:10 PM  all  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  16614.00
06:47:11 PM  all   99.94    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.06  16588.00
06:47:12 PM  all   99.94    0.00    0.06    0.00    0.00    0.00    0.00    0.00    0.00  16593.00
06:47:13 PM  all   99.94    0.00    0.06    0.00    0.00    0.00    0.00    0.00    0.00  16551.00
Average:     all   99.84    0.00    0.09    0.00    0.00    0.01    0.00    0.00    0.06  16711.58

We see an improvement in idle% on the system (drops from 5.42% on -tip to 0.06%
with the patches).

We see an improvement in idle% on the system (drops from 5.42%
on -tip to 0.06% with the patches).

Signed-off-by: Nikhil Rao <ncrao@google.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Nikunj A. Dadhania <nikunj@linux.vnet.ibm.com>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Stephan Barwolf <stephan.baerwolf@tu-ilmenau.de>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/1305754668-18792-1-git-send-email-ncrao@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 include/linux/sched.h |   23 +++++++++++++++++++++--
 kernel/sched.c        |   28 ++++++++++++++++++++--------
 2 files changed, 41 insertions(+), 10 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index f2f4402..c34a718 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -788,9 +788,28 @@ enum cpu_idle_type {
 };
 
 /*
- * Increase resolution of nice-level calculations:
+ * Increase resolution of nice-level calculations for 64-bit architectures.
+ * The extra resolution improves shares distribution and load balancing of
+ * low-weight task groups (eg. nice +19 on an autogroup), deeper taskgroup
+ * hierarchies, especially on larger systems. This is not a user-visible change
+ * and does not change the user-interface for setting shares/weights.
+ *
+ * We increase resolution only if we have enough bits to allow this increased
+ * resolution (i.e. BITS_PER_LONG > 32). The costs for increasing resolution
+ * when BITS_PER_LONG <= 32 are pretty high and the returns do not justify the
+ * increased costs.
  */
-#define SCHED_LOAD_SHIFT	10
+#if BITS_PER_LONG > 32
+# define SCHED_LOAD_RESOLUTION	10
+# define scale_load(w)		((w) << SCHED_LOAD_RESOLUTION)
+# define scale_load_down(w)	((w) >> SCHED_LOAD_RESOLUTION)
+#else
+# define SCHED_LOAD_RESOLUTION	0
+# define scale_load(w)		(w)
+# define scale_load_down(w)	(w)
+#endif
+
+#define SCHED_LOAD_SHIFT	(10 + SCHED_LOAD_RESOLUTION)
 #define SCHED_LOAD_SCALE	(1L << SCHED_LOAD_SHIFT)
 
 /*
diff --git a/kernel/sched.c b/kernel/sched.c
index 375e9c6..bb504e1 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -293,7 +293,7 @@ static DEFINE_SPINLOCK(task_group_lock);
  *  limitation from this.)
  */
 #define MIN_SHARES	2
-#define MAX_SHARES	(1UL << 18)
+#define MAX_SHARES	(1UL << (18 + SCHED_LOAD_RESOLUTION))
 
 static int root_task_group_load = ROOT_TASK_GROUP_LOAD;
 #endif
@@ -1330,13 +1330,25 @@ calc_delta_mine(unsigned long delta_exec, unsigned long weight,
 {
 	u64 tmp;
 
-	tmp = (u64)delta_exec * weight;
+	/*
+	 * weight can be less than 2^SCHED_LOAD_RESOLUTION for task group sched
+	 * entities since MIN_SHARES = 2. Treat weight as 1 if less than
+	 * 2^SCHED_LOAD_RESOLUTION.
+	 */
+	if (likely(weight > (1UL << SCHED_LOAD_RESOLUTION)))
+		tmp = (u64)delta_exec * scale_load_down(weight);
+	else
+		tmp = (u64)delta_exec;
 
 	if (!lw->inv_weight) {
-		if (BITS_PER_LONG > 32 && unlikely(lw->weight >= WMULT_CONST))
+		unsigned long w = scale_load_down(lw->weight);
+
+		if (BITS_PER_LONG > 32 && unlikely(w >= WMULT_CONST))
 			lw->inv_weight = 1;
+		else if (unlikely(!w))
+			lw->inv_weight = WMULT_CONST;
 		else
-			lw->inv_weight = WMULT_CONST / lw->weight;
+			lw->inv_weight = WMULT_CONST / w;
 	}
 
 	/*
@@ -1785,12 +1797,12 @@ static void set_load_weight(struct task_struct *p)
 	 * SCHED_IDLE tasks get minimal weight:
 	 */
 	if (p->policy == SCHED_IDLE) {
-		load->weight = WEIGHT_IDLEPRIO;
+		load->weight = scale_load(WEIGHT_IDLEPRIO);
 		load->inv_weight = WMULT_IDLEPRIO;
 		return;
 	}
 
-	load->weight = prio_to_weight[prio];
+	load->weight = scale_load(prio_to_weight[prio]);
 	load->inv_weight = prio_to_wmult[prio];
 }
 
@@ -8809,14 +8821,14 @@ cpu_cgroup_exit(struct cgroup_subsys *ss, struct cgroup *cgrp,
 static int cpu_shares_write_u64(struct cgroup *cgrp, struct cftype *cftype,
 				u64 shareval)
 {
-	return sched_group_set_shares(cgroup_tg(cgrp), shareval);
+	return sched_group_set_shares(cgroup_tg(cgrp), scale_load(shareval));
 }
 
 static u64 cpu_shares_read_u64(struct cgroup *cgrp, struct cftype *cft)
 {
 	struct task_group *tg = cgroup_tg(cgrp);
 
-	return (u64) tg->shares;
+	return (u64) scale_load_down(tg->shares);
 }
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 

^ permalink raw reply related	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2011-05-20 13:44 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-05-18 17:09 [PATCH v2 0/3] Increase resolution of load weights Nikhil Rao
2011-05-18 17:09 ` [PATCH v2 1/3] sched: cleanup set_load_weight() Nikhil Rao
2011-05-20 13:43   ` [tip:sched/core] sched: Cleanup set_load_weight() tip-bot for Nikhil Rao
2011-05-18 17:09 ` [PATCH v2 2/3] sched: introduce SCHED_POWER_SCALE to scale cpu_power calculations Nikhil Rao
2011-05-20 13:43   ` [tip:sched/core] sched: Introduce " tip-bot for Nikhil Rao
2011-05-18 17:09 ` [PATCH v2 3/3] sched: increase SCHED_LOAD_SCALE resolution Nikhil Rao
2011-05-18 18:25   ` Ingo Molnar
2011-05-18 19:39     ` Nikhil Rao
2011-05-18 19:41       ` Nikhil Rao
2011-05-18 20:26         ` Ingo Molnar
2011-05-18 21:18           ` Nikhil Rao
2011-05-18 21:22             ` Ingo Molnar
2011-05-18 21:37               ` [PATCH v3 " Nikhil Rao
2011-05-20 13:43                 ` [tip:sched/core] sched: Increase " tip-bot for Nikhil Rao
2011-05-18 18:26   ` [PATCH v2 3/3] sched: increase " Ingo Molnar

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.