All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch 00/15] CFS Bandwidth Control V6
@ 2011-05-03  9:28 Paul Turner
  2011-05-03  9:28 ` [patch 01/15] sched: (fixlet) dont update shares twice on on_rq parent Paul Turner
                   ` (16 more replies)
  0 siblings, 17 replies; 129+ messages in thread
From: Paul Turner @ 2011-05-03  9:28 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov

[ Apologies if you're receiving this twice, the previous mailing did not seem 
to make it to the list for some reason ].

Hi all,

Please find attached the latest iteration of bandwidth control (v6).

Where the previous release cleaned up many of the semantics surrounding the
update_curr() path and throttling, this release is focused on cleaning up the
patchset itself.  Elements such as the notion of expiring bandwidth from
previous quota periods as well as some of the core accounting changes have
been pushed up (and re-written for clarity) within the patchset reducing the
patch-to-patch churn significantly.

While this restructuring was fairly extensive in terms of the code touched,
there are no major behavioral changes beyond bug fixes.

Thanks to Hidetoshi Seto for identifying the throttle list corruption.

Notable changes:
- Runtime is now actively expired taking advantage of the bounds placed on
  sched_clock syncrhonization.
- distribute_cfs_runtime() no longer races with throttles around the period
  boundary.
- Major code cleanup

Bug fixes:
- several interactions with active load-balance have been corrected.  This was
  manifesting previously in throttle_list corruption and crashes.

Interface:
----------
Three new cgroupfs files are exported by the cpu subsystem:
  cpu.cfs_period_us : period over which bandwidth is to be regulated
  cpu.cfs_quota_us  : bandwidth available for consumption per period
  cpu.stat          : statistics (such as number of throttled periods and
                      total throttled time)
One important interface change that this introduces (versus the rate limits
proposal) is that the defined bandwidth becomes an absolute quantifier.

Previous postings:
-----------------
v5:
https://lkml.org/lkml/2011/3/22/477
v4:
https://lkml.org/lkml/2011/2/23/44
v3:
https://lkml.org/lkml/2010/10/12/44
v2:
http://lkml.org/lkml/2010/4/28/88
Original posting:
http://lkml.org/lkml/2010/2/12/393

Prior approaches:
http://lkml.org/lkml/2010/1/5/44 ["CFS Hard limits v5"]

Thanks,

- Paul



^ permalink raw reply	[flat|nested] 129+ messages in thread

* [patch 01/15] sched: (fixlet) dont update shares twice on on_rq parent
  2011-05-03  9:28 [patch 00/15] CFS Bandwidth Control V6 Paul Turner
@ 2011-05-03  9:28 ` Paul Turner
  2011-05-10  7:14   ` Hidetoshi Seto
  2011-05-03  9:28 ` [patch 02/15] sched: hierarchical task accounting for SCHED_OTHER Paul Turner
                   ` (15 subsequent siblings)
  16 siblings, 1 reply; 129+ messages in thread
From: Paul Turner @ 2011-05-03  9:28 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov

[-- Attachment #1: sched-bwc-fix_dequeue_task_buglet.patch --]
[-- Type: text/plain, Size: 947 bytes --]

In dequeue_task_fair() we bail on dequeue when we encounter a parenting entity
with additional weight.  However, we perform a double shares update on this
entity since we continue the shares update traversal from that point, despite
dequeue_entity() having already updated its queuing cfs_rq.

Avoid this by starting from the parent when we resume.

Signed-off-by: Paul Turner <pjt@google.com>
---
 kernel/sched_fair.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Index: tip/kernel/sched_fair.c
===================================================================
--- tip.orig/kernel/sched_fair.c
+++ tip/kernel/sched_fair.c
@@ -1355,8 +1355,10 @@ static void dequeue_task_fair(struct rq 
 		dequeue_entity(cfs_rq, se, flags);
 
 		/* Don't dequeue parent if it has other entities besides us */
-		if (cfs_rq->load.weight)
+		if (cfs_rq->load.weight) {
+			se = parent_entity(se);
 			break;
+		}
 		flags |= DEQUEUE_SLEEP;
 	}
 



^ permalink raw reply	[flat|nested] 129+ messages in thread

* [patch 02/15] sched: hierarchical task accounting for SCHED_OTHER
  2011-05-03  9:28 [patch 00/15] CFS Bandwidth Control V6 Paul Turner
  2011-05-03  9:28 ` [patch 01/15] sched: (fixlet) dont update shares twice on on_rq parent Paul Turner
@ 2011-05-03  9:28 ` Paul Turner
  2011-05-10  7:17   ` Hidetoshi Seto
  2011-05-03  9:28 ` [patch 03/15] sched: introduce primitives to account for CFS bandwidth tracking Paul Turner
                   ` (14 subsequent siblings)
  16 siblings, 1 reply; 129+ messages in thread
From: Paul Turner @ 2011-05-03  9:28 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov

[-- Attachment #1: sched-bwc-account_nr_running.patch --]
[-- Type: text/plain, Size: 4928 bytes --]

Introduce hierarchal task accounting for the group scheduling case in CFS, as
well as promoting the responsibility for maintaining rq->nr_running to the
scheduling classes.

The primary motivation for this is that with scheduling classes supporting
bandwidht throttling it is possible for entities participating in trottled
sub-trees to not have root visible changes in rq->nr_running across activate
and de-activate operations.  This in turn leads to incorrect idle and 
weight-per-task load balance decisions.

This also allows us to make a small fixlet to the fastpath in pick_next_task()
under group scheduling.

Note: this issue also exists with the existing sched_rt throttling mechanism.
This patch does not address that.

Signed-off-by: Paul Turner <pjt@google.com>

---
 kernel/sched.c          |    6 ++----
 kernel/sched_fair.c     |   14 ++++++++++----
 kernel/sched_rt.c       |    5 ++++-
 kernel/sched_stoptask.c |    2 ++
 4 files changed, 18 insertions(+), 9 deletions(-)

Index: tip/kernel/sched.c
===================================================================
--- tip.orig/kernel/sched.c
+++ tip/kernel/sched.c
@@ -308,7 +308,7 @@ struct task_group root_task_group;
 /* CFS-related fields in a runqueue */
 struct cfs_rq {
 	struct load_weight load;
-	unsigned long nr_running;
+	unsigned long nr_running, h_nr_running;
 
 	u64 exec_clock;
 	u64 min_vruntime;
@@ -1793,7 +1793,6 @@ static void activate_task(struct rq *rq,
 		rq->nr_uninterruptible--;
 
 	enqueue_task(rq, p, flags);
-	inc_nr_running(rq);
 }
 
 /*
@@ -1805,7 +1804,6 @@ static void deactivate_task(struct rq *r
 		rq->nr_uninterruptible++;
 
 	dequeue_task(rq, p, flags);
-	dec_nr_running(rq);
 }
 
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
@@ -4053,7 +4051,7 @@ pick_next_task(struct rq *rq)
 	 * Optimization: we know that if all tasks are in
 	 * the fair class we can call that function directly:
 	 */
-	if (likely(rq->nr_running == rq->cfs.nr_running)) {
+	if (likely(rq->nr_running == rq->cfs.h_nr_running)) {
 		p = fair_sched_class.pick_next_task(rq);
 		if (likely(p))
 			return p;
Index: tip/kernel/sched_fair.c
===================================================================
--- tip.orig/kernel/sched_fair.c
+++ tip/kernel/sched_fair.c
@@ -1318,7 +1318,7 @@ static inline void hrtick_update(struct 
 static void
 enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 {
-	struct cfs_rq *cfs_rq;
+	struct cfs_rq *cfs_rq = NULL;
 	struct sched_entity *se = &p->se;
 
 	for_each_sched_entity(se) {
@@ -1326,16 +1326,19 @@ enqueue_task_fair(struct rq *rq, struct 
 			break;
 		cfs_rq = cfs_rq_of(se);
 		enqueue_entity(cfs_rq, se, flags);
+		cfs_rq->h_nr_running++;
 		flags = ENQUEUE_WAKEUP;
 	}
 
 	for_each_sched_entity(se) {
-		struct cfs_rq *cfs_rq = cfs_rq_of(se);
+		cfs_rq = cfs_rq_of(se);
+		cfs_rq->h_nr_running++;
 
 		update_cfs_load(cfs_rq, 0);
 		update_cfs_shares(cfs_rq);
 	}
 
+	inc_nr_running(rq);
 	hrtick_update(rq);
 }
 
@@ -1346,12 +1349,13 @@ enqueue_task_fair(struct rq *rq, struct 
  */
 static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 {
-	struct cfs_rq *cfs_rq;
+	struct cfs_rq *cfs_rq = NULL;
 	struct sched_entity *se = &p->se;
 
 	for_each_sched_entity(se) {
 		cfs_rq = cfs_rq_of(se);
 		dequeue_entity(cfs_rq, se, flags);
+		cfs_rq->h_nr_running--;
 
 		/* Don't dequeue parent if it has other entities besides us */
 		if (cfs_rq->load.weight) {
@@ -1362,12 +1366,14 @@ static void dequeue_task_fair(struct rq 
 	}
 
 	for_each_sched_entity(se) {
-		struct cfs_rq *cfs_rq = cfs_rq_of(se);
+		cfs_rq = cfs_rq_of(se);
+		cfs_rq->h_nr_running--;
 
 		update_cfs_load(cfs_rq, 0);
 		update_cfs_shares(cfs_rq);
 	}
 
+	dec_nr_running(rq);
 	hrtick_update(rq);
 }
 
Index: tip/kernel/sched_rt.c
===================================================================
--- tip.orig/kernel/sched_rt.c
+++ tip/kernel/sched_rt.c
@@ -927,6 +927,8 @@ enqueue_task_rt(struct rq *rq, struct ta
 
 	if (!task_current(rq, p) && p->rt.nr_cpus_allowed > 1)
 		enqueue_pushable_task(rq, p);
+
+	inc_nr_running(rq);
 }
 
 static void dequeue_task_rt(struct rq *rq, struct task_struct *p, int flags)
@@ -937,6 +939,8 @@ static void dequeue_task_rt(struct rq *r
 	dequeue_rt_entity(rt_se);
 
 	dequeue_pushable_task(rq, p);
+
+	dec_nr_running(rq);
 }
 
 /*
@@ -1804,4 +1808,3 @@ static void print_rt_stats(struct seq_fi
 	rcu_read_unlock();
 }
 #endif /* CONFIG_SCHED_DEBUG */
-
Index: tip/kernel/sched_stoptask.c
===================================================================
--- tip.orig/kernel/sched_stoptask.c
+++ tip/kernel/sched_stoptask.c
@@ -35,11 +35,13 @@ static struct task_struct *pick_next_tas
 static void
 enqueue_task_stop(struct rq *rq, struct task_struct *p, int flags)
 {
+	inc_nr_running(rq);
 }
 
 static void
 dequeue_task_stop(struct rq *rq, struct task_struct *p, int flags)
 {
+	dec_nr_running(rq);
 }
 
 static void yield_task_stop(struct rq *rq)



^ permalink raw reply	[flat|nested] 129+ messages in thread

* [patch 03/15] sched: introduce primitives to account for CFS bandwidth tracking
  2011-05-03  9:28 [patch 00/15] CFS Bandwidth Control V6 Paul Turner
  2011-05-03  9:28 ` [patch 01/15] sched: (fixlet) dont update shares twice on on_rq parent Paul Turner
  2011-05-03  9:28 ` [patch 02/15] sched: hierarchical task accounting for SCHED_OTHER Paul Turner
@ 2011-05-03  9:28 ` Paul Turner
  2011-05-10  7:18   ` Hidetoshi Seto
  2011-05-03  9:28 ` [patch 04/15] sched: validate CFS quota hierarchies Paul Turner
                   ` (13 subsequent siblings)
  16 siblings, 1 reply; 129+ messages in thread
From: Paul Turner @ 2011-05-03  9:28 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov, Nikhil Rao

[-- Attachment #1: sched-bwc-add_cfs_tg_bandwidth.patch --]
[-- Type: text/plain, Size: 9956 bytes --]

In this patch we introduce the notion of CFS bandwidth, partitioned into 
globally unassigned bandwidth, and locally claimed bandwidth.

- The global bandwidth is per task_group, it represents a pool of unclaimed
  bandwidth that cfs_rqs can allocate from.  
- The local bandwidth is tracked per-cfs_rq, this represents allotments from
  the global pool bandwidth assigned to a specific cpu.

Bandwidth is managed via cgroupfs, adding two new interfaces to the cpu subsystem:
- cpu.cfs_period_us : the bandwidth period in usecs
- cpu.cfs_quota_us : the cpu bandwidth (in usecs) that this tg will be allowed
  to consume over period above.

Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Nikhil Rao <ncrao@google.com>
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
---
 init/Kconfig        |   12 +++
 kernel/sched.c      |  193 ++++++++++++++++++++++++++++++++++++++++++++++++++--
 kernel/sched_fair.c |   16 ++++
 3 files changed, 217 insertions(+), 4 deletions(-)

Index: tip/init/Kconfig
===================================================================
--- tip.orig/init/Kconfig
+++ tip/init/Kconfig
@@ -715,6 +715,18 @@ config FAIR_GROUP_SCHED
 	depends on CGROUP_SCHED
 	default CGROUP_SCHED
 
+config CFS_BANDWIDTH
+	bool "CPU bandwidth provisioning for FAIR_GROUP_SCHED"
+	depends on EXPERIMENTAL
+	depends on FAIR_GROUP_SCHED
+	default n
+	help
+	  This option allows users to define CPU bandwidth rates (limits) for
+	  tasks running within the fair group scheduler.  Groups with no limit
+	  set are considered to be unconstrained and will run with no
+	  restriction.
+	  See tip/Documentation/scheduler/sched-bwc.txt for more information/
+
 config RT_GROUP_SCHED
 	bool "Group scheduling for SCHED_RR/FIFO"
 	depends on EXPERIMENTAL
Index: tip/kernel/sched.c
===================================================================
--- tip.orig/kernel/sched.c
+++ tip/kernel/sched.c
@@ -244,6 +244,14 @@ struct cfs_rq;
 
 static LIST_HEAD(task_groups);
 
+struct cfs_bandwidth {
+#ifdef CONFIG_CFS_BANDWIDTH
+	raw_spinlock_t lock;
+	ktime_t period;
+	u64 quota;
+#endif
+};
+
 /* task group related information */
 struct task_group {
 	struct cgroup_subsys_state css;
@@ -275,6 +283,8 @@ struct task_group {
 #ifdef CONFIG_SCHED_AUTOGROUP
 	struct autogroup *autogroup;
 #endif
+
+	struct cfs_bandwidth cfs_bandwidth;
 };
 
 /* task_group_lock serializes the addition/removal of task groups */
@@ -369,9 +379,45 @@ struct cfs_rq {
 
 	unsigned long load_contribution;
 #endif
+#ifdef CONFIG_CFS_BANDWIDTH
+	int runtime_enabled;
+	s64 runtime_remaining;
+#endif
 #endif
 };
 
+#ifdef CONFIG_CFS_BANDWIDTH
+static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
+{
+	return &tg->cfs_bandwidth;
+}
+
+static inline u64 default_cfs_period(void);
+
+static void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
+{
+	raw_spin_lock_init(&cfs_b->lock);
+	cfs_b->quota = RUNTIME_INF;
+	cfs_b->period = ns_to_ktime(default_cfs_period());
+}
+
+static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)
+{
+	cfs_rq->runtime_remaining = 0;
+	cfs_rq->runtime_enabled = 0;
+}
+
+static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
+{}
+#else
+#ifdef CONFIG_FAIR_GROUP_SCHED
+static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
+void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b) {}
+static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b) {}
+#endif /* CONFIG_FAIR_GROUP_SCHED */
+static void start_cfs_bandwidth(struct cfs_rq *cfs_rq) {}
+#endif /* CONFIG_CFS_BANDWIDTH */
+
 /* Real-Time classes' related field in a runqueue: */
 struct rt_rq {
 	struct rt_prio_array active;
@@ -8056,6 +8102,7 @@ static void init_tg_cfs_entry(struct tas
 	tg->cfs_rq[cpu] = cfs_rq;
 	init_cfs_rq(cfs_rq, rq);
 	cfs_rq->tg = tg;
+	init_cfs_rq_runtime(cfs_rq);
 
 	tg->se[cpu] = se;
 	/* se could be NULL for root_task_group */
@@ -8191,6 +8238,7 @@ void __init sched_init(void)
 		 * We achieve this by letting root_task_group's tasks sit
 		 * directly in rq->cfs (i.e root_task_group->se[] = NULL).
 		 */
+		init_cfs_bandwidth(&root_task_group.cfs_bandwidth);
 		init_tg_cfs_entry(&root_task_group, &rq->cfs, NULL, i, NULL);
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
@@ -8433,6 +8481,8 @@ static void free_fair_sched_group(struct
 {
 	int i;
 
+	destroy_cfs_bandwidth(tg_cfs_bandwidth(tg));
+
 	for_each_possible_cpu(i) {
 		if (tg->cfs_rq)
 			kfree(tg->cfs_rq[i]);
@@ -8460,6 +8510,8 @@ int alloc_fair_sched_group(struct task_g
 
 	tg->shares = NICE_0_LOAD;
 
+	init_cfs_bandwidth(tg_cfs_bandwidth(tg));
+
 	for_each_possible_cpu(i) {
 		cfs_rq = kzalloc_node(sizeof(struct cfs_rq),
 				      GFP_KERNEL, cpu_to_node(i));
@@ -8837,7 +8889,7 @@ static int __rt_schedulable(struct task_
 	return walk_tg_tree(tg_schedulable, tg_nop, &data);
 }
 
-static int tg_set_bandwidth(struct task_group *tg,
+static int tg_set_rt_bandwidth(struct task_group *tg,
 		u64 rt_period, u64 rt_runtime)
 {
 	int i, err = 0;
@@ -8876,7 +8928,7 @@ int sched_group_set_rt_runtime(struct ta
 	if (rt_runtime_us < 0)
 		rt_runtime = RUNTIME_INF;
 
-	return tg_set_bandwidth(tg, rt_period, rt_runtime);
+	return tg_set_rt_bandwidth(tg, rt_period, rt_runtime);
 }
 
 long sched_group_rt_runtime(struct task_group *tg)
@@ -8901,7 +8953,7 @@ int sched_group_set_rt_period(struct tas
 	if (rt_period == 0)
 		return -EINVAL;
 
-	return tg_set_bandwidth(tg, rt_period, rt_runtime);
+	return tg_set_rt_bandwidth(tg, rt_period, rt_runtime);
 }
 
 long sched_group_rt_period(struct task_group *tg)
@@ -9123,6 +9175,128 @@ static u64 cpu_shares_read_u64(struct cg
 
 	return (u64) tg->shares;
 }
+
+#ifdef CONFIG_CFS_BANDWIDTH
+const u64 max_cfs_quota_period = 1 * NSEC_PER_SEC; /* 1s */
+const u64 min_cfs_quota_period = 1 * NSEC_PER_MSEC; /* 1ms */
+
+static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
+{
+	int i;
+	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
+	static DEFINE_MUTEX(mutex);
+
+	if (tg == &root_task_group)
+		return -EINVAL;
+
+	/*
+	 * Ensure we have at some amount of bandwidth every period.  This is
+	 * to prevent reaching a state of large arrears when throttled via
+	 * entity_tick() resulting in prolonged exit starvation.
+	 */
+	if (quota < min_cfs_quota_period || period < min_cfs_quota_period)
+		return -EINVAL;
+
+	/*
+	 * Likewise, bound things on the otherside by preventing insane quota
+	 * periods.  This also allows us to normalize in computing quota
+	 * feasibility.
+	 */
+	if (period > max_cfs_quota_period)
+		return -EINVAL;
+
+	mutex_lock(&mutex);
+	raw_spin_lock_irq(&cfs_b->lock);
+	cfs_b->period = ns_to_ktime(period);
+	cfs_b->quota = quota;
+	raw_spin_unlock_irq(&cfs_b->lock);
+
+	for_each_possible_cpu(i) {
+		struct cfs_rq *cfs_rq = tg->cfs_rq[i];
+		struct rq *rq = rq_of(cfs_rq);
+
+		raw_spin_lock_irq(&rq->lock);
+		cfs_rq->runtime_enabled = quota != RUNTIME_INF;
+		cfs_rq->runtime_remaining = 0;
+		raw_spin_unlock_irq(&rq->lock);
+	}
+	mutex_unlock(&mutex);
+
+	return 0;
+}
+
+int tg_set_cfs_quota(struct task_group *tg, long cfs_quota_us)
+{
+	u64 quota, period;
+
+	period = ktime_to_ns(tg_cfs_bandwidth(tg)->period);
+	if (cfs_quota_us < 0)
+		quota = RUNTIME_INF;
+	else
+		quota = (u64)cfs_quota_us * NSEC_PER_USEC;
+
+	return tg_set_cfs_bandwidth(tg, period, quota);
+}
+
+long tg_get_cfs_quota(struct task_group *tg)
+{
+	u64 quota_us;
+
+	if (tg_cfs_bandwidth(tg)->quota == RUNTIME_INF)
+		return -1;
+
+	quota_us = tg_cfs_bandwidth(tg)->quota;
+	do_div(quota_us, NSEC_PER_USEC);
+
+	return quota_us;
+}
+
+int tg_set_cfs_period(struct task_group *tg, long cfs_period_us)
+{
+	u64 quota, period;
+
+	period = (u64)cfs_period_us * NSEC_PER_USEC;
+	quota = tg_cfs_bandwidth(tg)->quota;
+
+	if (period <= 0)
+		return -EINVAL;
+
+	return tg_set_cfs_bandwidth(tg, period, quota);
+}
+
+long tg_get_cfs_period(struct task_group *tg)
+{
+	u64 cfs_period_us;
+
+	cfs_period_us = ktime_to_ns(tg_cfs_bandwidth(tg)->period);
+	do_div(cfs_period_us, NSEC_PER_USEC);
+
+	return cfs_period_us;
+}
+
+static s64 cpu_cfs_quota_read_s64(struct cgroup *cgrp, struct cftype *cft)
+{
+	return tg_get_cfs_quota(cgroup_tg(cgrp));
+}
+
+static int cpu_cfs_quota_write_s64(struct cgroup *cgrp, struct cftype *cftype,
+				s64 cfs_quota_us)
+{
+	return tg_set_cfs_quota(cgroup_tg(cgrp), cfs_quota_us);
+}
+
+static u64 cpu_cfs_period_read_u64(struct cgroup *cgrp, struct cftype *cft)
+{
+	return tg_get_cfs_period(cgroup_tg(cgrp));
+}
+
+static int cpu_cfs_period_write_u64(struct cgroup *cgrp, struct cftype *cftype,
+				u64 cfs_period_us)
+{
+	return tg_set_cfs_period(cgroup_tg(cgrp), cfs_period_us);
+}
+
+#endif /* CONFIG_CFS_BANDWIDTH */
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
 #ifdef CONFIG_RT_GROUP_SCHED
@@ -9157,6 +9331,18 @@ static struct cftype cpu_files[] = {
 		.write_u64 = cpu_shares_write_u64,
 	},
 #endif
+#ifdef CONFIG_CFS_BANDWIDTH
+	{
+		.name = "cfs_quota_us",
+		.read_s64 = cpu_cfs_quota_read_s64,
+		.write_s64 = cpu_cfs_quota_write_s64,
+	},
+	{
+		.name = "cfs_period_us",
+		.read_u64 = cpu_cfs_period_read_u64,
+		.write_u64 = cpu_cfs_period_write_u64,
+	},
+#endif
 #ifdef CONFIG_RT_GROUP_SCHED
 	{
 		.name = "rt_runtime_us",
@@ -9466,4 +9652,3 @@ struct cgroup_subsys cpuacct_subsys = {
 	.subsys_id = cpuacct_subsys_id,
 };
 #endif	/* CONFIG_CGROUP_CPUACCT */
-
Index: tip/kernel/sched_fair.c
===================================================================
--- tip.orig/kernel/sched_fair.c
+++ tip/kernel/sched_fair.c
@@ -1250,6 +1250,22 @@ entity_tick(struct cfs_rq *cfs_rq, struc
 		check_preempt_tick(cfs_rq, curr);
 }
 
+
+/**************************************************
+ * CFS bandwidth control machinery
+ */
+
+#ifdef CONFIG_CFS_BANDWIDTH
+/*
+ * default period for cfs group bandwidth.
+ * default: 0.5s, units: nanoseconds
+ */
+static inline u64 default_cfs_period(void)
+{
+	return 500000000ULL;
+}
+#endif
+
 /**************************************************
  * CFS operations on tasks:
  */



^ permalink raw reply	[flat|nested] 129+ messages in thread

* [patch 04/15] sched: validate CFS quota hierarchies
  2011-05-03  9:28 [patch 00/15] CFS Bandwidth Control V6 Paul Turner
                   ` (2 preceding siblings ...)
  2011-05-03  9:28 ` [patch 03/15] sched: introduce primitives to account for CFS bandwidth tracking Paul Turner
@ 2011-05-03  9:28 ` Paul Turner
  2011-05-10  7:20   ` Hidetoshi Seto
                     ` (2 more replies)
  2011-05-03  9:28 ` [patch 05/15] sched: add a timer to handle CFS bandwidth refresh Paul Turner
                   ` (12 subsequent siblings)
  16 siblings, 3 replies; 129+ messages in thread
From: Paul Turner @ 2011-05-03  9:28 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov

[-- Attachment #1: sched-bwc-consistent_quota.patch --]
[-- Type: text/plain, Size: 8005 bytes --]

Add constraints validation for CFS bandwidth hierachies.

Validate that:
   sum(child bandwidth) <= parent_bandwidth

In a quota limited hierarchy, an unconstrainted entity
(e.g. bandwidth==RUNTIME_INF) inherits the bandwidth of its parent.

Since bandwidth periods may be non-uniform we normalize to the maximum allowed
period, 1 second.

This behavior may be disabled (allowing child bandwidth to exceed parent) via
kernel.sched_cfs_bandwidth_consistent=0

Signed-off-by: Paul Turner <pjt@google.com>

---
 include/linux/sched.h |    8 ++
 kernel/sched.c        |  137 +++++++++++++++++++++++++++++++++++++++++++++-----
 kernel/sched_fair.c   |    8 ++
 kernel/sysctl.c       |   11 ++++
 4 files changed, 151 insertions(+), 13 deletions(-)

Index: tip/kernel/sched.c
===================================================================
--- tip.orig/kernel/sched.c
+++ tip/kernel/sched.c
@@ -249,6 +249,7 @@ struct cfs_bandwidth {
 	raw_spinlock_t lock;
 	ktime_t period;
 	u64 quota;
+	s64 hierarchal_quota;
 #endif
 };
 
@@ -8789,12 +8790,7 @@ unsigned long sched_group_shares(struct 
 }
 #endif
 
-#ifdef CONFIG_RT_GROUP_SCHED
-/*
- * Ensure that the real time constraints are schedulable.
- */
-static DEFINE_MUTEX(rt_constraints_mutex);
-
+#if defined(CONFIG_RT_GROUP_SCHED) || defined(CONFIG_CFS_BANDWIDTH)
 static unsigned long to_ratio(u64 period, u64 runtime)
 {
 	if (runtime == RUNTIME_INF)
@@ -8802,6 +8798,13 @@ static unsigned long to_ratio(u64 period
 
 	return div64_u64(runtime << 20, period);
 }
+#endif
+
+#ifdef CONFIG_RT_GROUP_SCHED
+/*
+ * Ensure that the real time constraints are schedulable.
+ */
+static DEFINE_MUTEX(rt_constraints_mutex);
 
 /* Must be called with tasklist_lock held */
 static inline int tg_has_rt_tasks(struct task_group *tg)
@@ -8822,7 +8825,7 @@ struct rt_schedulable_data {
 	u64 rt_runtime;
 };
 
-static int tg_schedulable(struct task_group *tg, void *data)
+static int tg_rt_schedulable(struct task_group *tg, void *data)
 {
 	struct rt_schedulable_data *d = data;
 	struct task_group *child;
@@ -8886,7 +8889,7 @@ static int __rt_schedulable(struct task_
 		.rt_runtime = runtime,
 	};
 
-	return walk_tg_tree(tg_schedulable, tg_nop, &data);
+	return walk_tg_tree(tg_rt_schedulable, tg_nop, &data);
 }
 
 static int tg_set_rt_bandwidth(struct task_group *tg,
@@ -9177,14 +9180,17 @@ static u64 cpu_shares_read_u64(struct cg
 }
 
 #ifdef CONFIG_CFS_BANDWIDTH
+static DEFINE_MUTEX(cfs_constraints_mutex);
+
 const u64 max_cfs_quota_period = 1 * NSEC_PER_SEC; /* 1s */
 const u64 min_cfs_quota_period = 1 * NSEC_PER_MSEC; /* 1ms */
 
+static int __cfs_schedulable(struct task_group *tg, u64 period, u64 runtime);
+
 static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
 {
-	int i;
+	int i, ret = 0;
 	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
-	static DEFINE_MUTEX(mutex);
 
 	if (tg == &root_task_group)
 		return -EINVAL;
@@ -9205,7 +9211,13 @@ static int tg_set_cfs_bandwidth(struct t
 	if (period > max_cfs_quota_period)
 		return -EINVAL;
 
-	mutex_lock(&mutex);
+	mutex_lock(&cfs_constraints_mutex);
+	if (sysctl_sched_cfs_bandwidth_consistent) {
+		ret = __cfs_schedulable(tg, period, quota);
+		if (ret)
+			goto out_unlock;
+	}
+
 	raw_spin_lock_irq(&cfs_b->lock);
 	cfs_b->period = ns_to_ktime(period);
 	cfs_b->quota = quota;
@@ -9220,9 +9232,10 @@ static int tg_set_cfs_bandwidth(struct t
 		cfs_rq->runtime_remaining = 0;
 		raw_spin_unlock_irq(&rq->lock);
 	}
-	mutex_unlock(&mutex);
+out_unlock:
+	mutex_unlock(&cfs_constraints_mutex);
 
-	return 0;
+	return ret;
 }
 
 int tg_set_cfs_quota(struct task_group *tg, long cfs_quota_us)
@@ -9296,6 +9309,104 @@ static int cpu_cfs_period_write_u64(stru
 	return tg_set_cfs_period(cgroup_tg(cgrp), cfs_period_us);
 }
 
+
+struct cfs_schedulable_data {
+	struct task_group *tg;
+	u64 period, quota;
+};
+
+/*
+ * normalize group quota/period to be quota/max_period
+ * note: units are usecs
+ */
+static u64 normalize_cfs_quota(struct task_group *tg,
+			       struct cfs_schedulable_data *d)
+{
+	u64 quota, period;
+
+	if (tg == d->tg) {
+		period = d->period;
+		quota = d->quota;
+	} else {
+		period = tg_get_cfs_period(tg);
+		quota = tg_get_cfs_quota(tg);
+	}
+
+	if (quota == RUNTIME_INF)
+		return RUNTIME_INF;
+
+	return to_ratio(period, quota);
+}
+
+static int tg_cfs_schedulable_down(struct task_group *tg, void *data)
+{
+	struct cfs_schedulable_data *d = data;
+	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
+	s64 quota = 0, parent_quota = -1;
+
+	quota = normalize_cfs_quota(tg, d);
+	if (!tg->parent) {
+		quota = RUNTIME_INF;
+	} else {
+		struct cfs_bandwidth *parent_b = tg_cfs_bandwidth(tg->parent);
+
+		parent_quota = parent_b->hierarchal_quota;
+		if (parent_quota != RUNTIME_INF) {
+			parent_quota -= quota;
+			/* invalid hierarchy, child bandwidth exceeds parent */
+			if (parent_quota < 0)
+				return -EINVAL;
+		}
+
+		/* if no inherent limit then inherit parent quota */
+		if (quota == RUNTIME_INF)
+			quota = parent_quota;
+		parent_b->hierarchal_quota = parent_quota;
+	}
+	cfs_b->hierarchal_quota = quota;
+
+	return 0;
+}
+
+static int __cfs_schedulable(struct task_group *tg, u64 period, u64 quota)
+{
+	struct cfs_schedulable_data data = {
+		.tg = tg,
+		.period = period,
+		.quota = quota,
+	};
+
+	if (!sysctl_sched_cfs_bandwidth_consistent)
+		return 0;
+
+	if (quota != RUNTIME_INF) {
+		do_div(data.period, NSEC_PER_USEC);
+		do_div(data.quota, NSEC_PER_USEC);
+	}
+
+	return walk_tg_tree(tg_cfs_schedulable_down, tg_nop, &data);
+}
+
+int sched_cfs_consistent_handler(struct ctl_table *table, int write,
+		void __user *buffer, size_t *lenp,
+		loff_t *ppos)
+{
+	int ret;
+
+	mutex_lock(&cfs_constraints_mutex);
+	ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
+
+	if (!ret && write && sysctl_sched_cfs_bandwidth_consistent) {
+		ret = __cfs_schedulable(NULL, 0, 0);
+
+		/* must be consistent to enable */
+		if (ret)
+			sysctl_sched_cfs_bandwidth_consistent = 0;
+	}
+	mutex_unlock(&cfs_constraints_mutex);
+
+	return ret;
+}
 #endif /* CONFIG_CFS_BANDWIDTH */
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
Index: tip/kernel/sysctl.c
===================================================================
--- tip.orig/kernel/sysctl.c
+++ tip/kernel/sysctl.c
@@ -367,6 +367,17 @@ static struct ctl_table kern_table[] = {
 		.mode		= 0644,
 		.proc_handler	= sched_rt_handler,
 	},
+#ifdef CONFIG_CFS_BANDWIDTH
+	{
+		.procname	= "sched_cfs_bandwidth_consistent",
+		.data		= &sysctl_sched_cfs_bandwidth_consistent,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= sched_cfs_consistent_handler,
+		.extra1		= &zero,
+		.extra2		= &one,
+	},
+#endif
 #ifdef CONFIG_SCHED_AUTOGROUP
 	{
 		.procname	= "sched_autogroup_enabled",
Index: tip/include/linux/sched.h
===================================================================
--- tip.orig/include/linux/sched.h
+++ tip/include/linux/sched.h
@@ -1950,6 +1950,14 @@ int sched_rt_handler(struct ctl_table *t
 		void __user *buffer, size_t *lenp,
 		loff_t *ppos);
 
+#ifdef CONFIG_CFS_BANDWIDTH
+extern unsigned int sysctl_sched_cfs_bandwidth_consistent;
+
+int sched_cfs_consistent_handler(struct ctl_table *table, int write,
+		void __user *buffer, size_t *lenp,
+		loff_t *ppos);
+#endif
+
 #ifdef CONFIG_SCHED_AUTOGROUP
 extern unsigned int sysctl_sched_autogroup_enabled;
 
Index: tip/kernel/sched_fair.c
===================================================================
--- tip.orig/kernel/sched_fair.c
+++ tip/kernel/sched_fair.c
@@ -88,6 +88,14 @@ const_debug unsigned int sysctl_sched_mi
  */
 unsigned int __read_mostly sysctl_sched_shares_window = 10000000UL;
 
+#ifdef CONFIG_CFS_BANDWIDTH
+/*
+ * Whether a CFS bandwidth hierarchy is required to be consistent, that is:
+ *   sum(child_bandwidth) <= parent_bandwidth
+ */
+unsigned int sysctl_sched_cfs_bandwidth_consistent = 1;
+#endif
+
 static const struct sched_class fair_sched_class;
 
 /**************************************************************



^ permalink raw reply	[flat|nested] 129+ messages in thread

* [patch 05/15] sched: add a timer to handle CFS bandwidth refresh
  2011-05-03  9:28 [patch 00/15] CFS Bandwidth Control V6 Paul Turner
                   ` (3 preceding siblings ...)
  2011-05-03  9:28 ` [patch 04/15] sched: validate CFS quota hierarchies Paul Turner
@ 2011-05-03  9:28 ` Paul Turner
  2011-05-10  7:21   ` Hidetoshi Seto
  2011-05-16 10:18   ` Peter Zijlstra
  2011-05-03  9:28 ` [patch 06/15] sched: accumulate per-cfs_rq cpu usage and charge against bandwidth Paul Turner
                   ` (11 subsequent siblings)
  16 siblings, 2 replies; 129+ messages in thread
From: Paul Turner @ 2011-05-03  9:28 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov

[-- Attachment #1: sched-bwc-bandwidth_timers.patch --]
[-- Type: text/plain, Size: 4945 bytes --]

This patch adds a per-task_group timer which handles the refresh of the global
CFS bandwidth pool.

Since the RT pool is using a similar timer there's some small refactoring to
share this support.

Signed-off-by: Paul Turner <pjt@google.com>

---
 kernel/sched.c      |   87 ++++++++++++++++++++++++++++++++++++++++------------
 kernel/sched_fair.c |    9 +++++
 2 files changed, 77 insertions(+), 19 deletions(-)

Index: tip/kernel/sched.c
===================================================================
--- tip.orig/kernel/sched.c
+++ tip/kernel/sched.c
@@ -193,10 +193,28 @@ static inline int rt_bandwidth_enabled(v
 	return sysctl_sched_rt_runtime >= 0;
 }
 
-static void start_rt_bandwidth(struct rt_bandwidth *rt_b)
+static void start_bandwidth_timer(struct hrtimer *period_timer, ktime_t period)
 {
-	ktime_t now;
+	unsigned long delta;
+	ktime_t soft, hard, now;
+
+	for (;;) {
+		if (hrtimer_active(period_timer))
+			break;
+
+		now = hrtimer_cb_get_time(period_timer);
+		hrtimer_forward(period_timer, now, period);
 
+		soft = hrtimer_get_softexpires(period_timer);
+		hard = hrtimer_get_expires(period_timer);
+		delta = ktime_to_ns(ktime_sub(hard, soft));
+		__hrtimer_start_range_ns(period_timer, soft, delta,
+					 HRTIMER_MODE_ABS_PINNED, 0);
+	}
+}
+
+static void start_rt_bandwidth(struct rt_bandwidth *rt_b)
+{
 	if (!rt_bandwidth_enabled() || rt_b->rt_runtime == RUNTIME_INF)
 		return;
 
@@ -204,22 +222,7 @@ static void start_rt_bandwidth(struct rt
 		return;
 
 	raw_spin_lock(&rt_b->rt_runtime_lock);
-	for (;;) {
-		unsigned long delta;
-		ktime_t soft, hard;
-
-		if (hrtimer_active(&rt_b->rt_period_timer))
-			break;
-
-		now = hrtimer_cb_get_time(&rt_b->rt_period_timer);
-		hrtimer_forward(&rt_b->rt_period_timer, now, rt_b->rt_period);
-
-		soft = hrtimer_get_softexpires(&rt_b->rt_period_timer);
-		hard = hrtimer_get_expires(&rt_b->rt_period_timer);
-		delta = ktime_to_ns(ktime_sub(hard, soft));
-		__hrtimer_start_range_ns(&rt_b->rt_period_timer, soft, delta,
-				HRTIMER_MODE_ABS_PINNED, 0);
-	}
+	start_bandwidth_timer(&rt_b->rt_period_timer, rt_b->rt_period);
 	raw_spin_unlock(&rt_b->rt_runtime_lock);
 }
 
@@ -250,6 +253,9 @@ struct cfs_bandwidth {
 	ktime_t period;
 	u64 quota;
 	s64 hierarchal_quota;
+
+	int idle;
+	struct hrtimer period_timer;
 #endif
 };
 
@@ -394,12 +400,38 @@ static inline struct cfs_bandwidth *tg_c
 
 #ifdef CONFIG_CFS_BANDWIDTH
 static inline u64 default_cfs_period(void);
+static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun);
+
+static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
+{
+	struct cfs_bandwidth *cfs_b =
+		container_of(timer, struct cfs_bandwidth, period_timer);
+	ktime_t now;
+	int overrun;
+	int idle = 0;
+
+	for (;;) {
+		now = hrtimer_cb_get_time(timer);
+		overrun = hrtimer_forward(timer, now, cfs_b->period);
+
+		if (!overrun)
+			break;
+
+		idle = do_sched_cfs_period_timer(cfs_b, overrun);
+	}
+
+	return idle ? HRTIMER_NORESTART : HRTIMER_RESTART;
+}
 
 static void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
 {
 	raw_spin_lock_init(&cfs_b->lock);
 	cfs_b->quota = RUNTIME_INF;
 	cfs_b->period = ns_to_ktime(default_cfs_period());
+
+	hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	cfs_b->period_timer.function = sched_cfs_period_timer;
+
 }
 
 static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)
@@ -411,8 +443,25 @@ static void init_cfs_rq_runtime(struct c
 		cfs_rq->runtime_enabled = 1;
 }
 
+static void start_cfs_bandwidth(struct cfs_rq *cfs_rq)
+{
+	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
+
+	if (cfs_b->quota == RUNTIME_INF)
+		return;
+
+	if (hrtimer_active(&cfs_b->period_timer))
+		return;
+
+	raw_spin_lock(&cfs_b->lock);
+	start_bandwidth_timer(&cfs_b->period_timer, cfs_b->period);
+	raw_spin_unlock(&cfs_b->lock);
+}
+
 static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
-{}
+{
+	hrtimer_cancel(&cfs_b->period_timer);
+}
 #else
 #ifdef CONFIG_FAIR_GROUP_SCHED
 static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
Index: tip/kernel/sched_fair.c
===================================================================
--- tip.orig/kernel/sched_fair.c
+++ tip/kernel/sched_fair.c
@@ -1003,6 +1003,8 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
 
 	if (cfs_rq->nr_running == 1)
 		list_add_leaf_cfs_rq(cfs_rq);
+
+	start_cfs_bandwidth(cfs_rq);
 }
 
 static void __clear_buddies_last(struct sched_entity *se)
@@ -1220,6 +1222,8 @@ static void put_prev_entity(struct cfs_r
 		update_stats_wait_start(cfs_rq, prev);
 		/* Put 'current' back into the tree. */
 		__enqueue_entity(cfs_rq, prev);
+
+		start_cfs_bandwidth(cfs_rq);
 	}
 	cfs_rq->curr = NULL;
 }
@@ -1272,6 +1276,11 @@ static inline u64 default_cfs_period(voi
 {
 	return 500000000ULL;
 }
+
+static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
+{
+	return 1;
+}
 #endif
 
 /**************************************************



^ permalink raw reply	[flat|nested] 129+ messages in thread

* [patch 06/15] sched: accumulate per-cfs_rq cpu usage and charge against bandwidth
  2011-05-03  9:28 [patch 00/15] CFS Bandwidth Control V6 Paul Turner
                   ` (4 preceding siblings ...)
  2011-05-03  9:28 ` [patch 05/15] sched: add a timer to handle CFS bandwidth refresh Paul Turner
@ 2011-05-03  9:28 ` Paul Turner
  2011-05-10  7:22   ` Hidetoshi Seto
                     ` (2 more replies)
  2011-05-03  9:28 ` [patch 07/15] sched: expire invalid runtime Paul Turner
                   ` (10 subsequent siblings)
  16 siblings, 3 replies; 129+ messages in thread
From: Paul Turner @ 2011-05-03  9:28 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov, Nikhil Rao

[-- Attachment #1: sched-bwc-account_cfs_rq_runtime.patch --]
[-- Type: text/plain, Size: 5873 bytes --]

Account bandwidth usage on the cfs_rq level versus the task_groups to which
they belong.  Whether we are tracking bandwidht on a given cfs_rq is maintained
under cfs_rq->runtime_enabled.

cfs_rq's which belong to a bandwidth constrained task_group have their runtime
accounted via the update_curr() path, which withdraws bandwidth from the global
pool as desired.  Updates involving the global pool are currently protected
under cfs_bandwidth->lock, local runtime is protected by rq->lock.

This patch only attempts to assign and track quota, no action is taken in the
case that cfs_rq->runtime_used exceeds cfs_rq->runtime_assigned.

Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Nikhil Rao <ncrao@google.com>
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
---
 include/linux/sched.h |    4 ++
 kernel/sched.c        |    2 +
 kernel/sched_fair.c   |   85 ++++++++++++++++++++++++++++++++++++++++++++++++--
 kernel/sysctl.c       |    8 ++++
 4 files changed, 96 insertions(+), 3 deletions(-)

Index: tip/kernel/sched_fair.c
===================================================================
--- tip.orig/kernel/sched_fair.c
+++ tip/kernel/sched_fair.c
@@ -96,6 +96,15 @@ unsigned int __read_mostly sysctl_sched_
 unsigned int sysctl_sched_cfs_bandwidth_consistent = 1;
 #endif
 
+#ifdef CONFIG_CFS_BANDWIDTH
+/*
+ * amount of quota to allocate from global tg to local cfs_rq pool on each
+ * refresh
+ * default: 5ms, units: microseconds
+  */
+unsigned int sysctl_sched_cfs_bandwidth_slice = 5000UL;
+#endif
+
 static const struct sched_class fair_sched_class;
 
 /**************************************************************
@@ -312,6 +321,8 @@ find_matching_se(struct sched_entity **s
 
 #endif	/* CONFIG_FAIR_GROUP_SCHED */
 
+static void account_cfs_rq_runtime(struct cfs_rq *cfs_rq,
+				   unsigned long delta_exec);
 
 /**************************************************************
  * Scheduling class tree data structure manipulation methods:
@@ -605,6 +616,8 @@ static void update_curr(struct cfs_rq *c
 		cpuacct_charge(curtask, delta_exec);
 		account_group_exec_runtime(curtask, delta_exec);
 	}
+
+	account_cfs_rq_runtime(cfs_rq, delta_exec);
 }
 
 static inline void
@@ -1277,10 +1290,68 @@ static inline u64 default_cfs_period(voi
 	return 500000000ULL;
 }
 
+static inline u64 sched_cfs_bandwidth_slice(void)
+{
+	return (u64)sysctl_sched_cfs_bandwidth_slice * NSEC_PER_USEC;
+}
+
+static void assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
+{
+	struct task_group *tg = cfs_rq->tg;
+	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
+	u64 amount = 0, min_amount;
+
+	/* note: this is a positive sum, runtime_remaining <= 0 */
+	min_amount = sched_cfs_bandwidth_slice() - cfs_rq->runtime_remaining;
+
+	raw_spin_lock(&cfs_b->lock);
+	if (cfs_b->quota == RUNTIME_INF)
+		amount = min_amount;
+	else if (cfs_b->runtime > 0) {
+		amount = min(cfs_b->runtime, min_amount);
+		cfs_b->runtime -= amount;
+	}
+	cfs_b->idle = 0;
+	raw_spin_unlock(&cfs_b->lock);
+
+	cfs_rq->runtime_remaining += amount;
+}
+
+static void account_cfs_rq_runtime(struct cfs_rq *cfs_rq,
+		unsigned long delta_exec)
+{
+	if (!cfs_rq->runtime_enabled)
+		return;
+
+	cfs_rq->runtime_remaining -= delta_exec;
+	if (cfs_rq->runtime_remaining > 0)
+		return;
+
+	assign_cfs_rq_runtime(cfs_rq);
+}
+
 static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
 {
-	return 1;
+	u64 quota, runtime = 0;
+	int idle = 0;
+
+	raw_spin_lock(&cfs_b->lock);
+	quota = cfs_b->quota;
+
+	if (quota != RUNTIME_INF) {
+		runtime = quota;
+		cfs_b->runtime = runtime;
+
+		idle = cfs_b->idle;
+		cfs_b->idle = 1;
+	}
+	raw_spin_unlock(&cfs_b->lock);
+
+	return idle;
 }
+#else
+static void account_cfs_rq_runtime(struct cfs_rq *cfs_rq,
+		unsigned long delta_exec) {}
 #endif
 
 /**************************************************
@@ -4222,8 +4293,16 @@ static void set_curr_task_fair(struct rq
 {
 	struct sched_entity *se = &rq->curr->se;
 
-	for_each_sched_entity(se)
-		set_next_entity(cfs_rq_of(se), se);
+	for_each_sched_entity(se) {
+		struct cfs_rq *cfs_rq = cfs_rq_of(se);
+
+		set_next_entity(cfs_rq, se);
+		/*
+		 * if bandwidth is enabled, make sure it is up-to-date or
+		 * reschedule for the case of a move into a throttled cpu.
+		 */
+		account_cfs_rq_runtime(cfs_rq, 0);
+	}
 }
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
Index: tip/kernel/sysctl.c
===================================================================
--- tip.orig/kernel/sysctl.c
+++ tip/kernel/sysctl.c
@@ -377,6 +377,14 @@ static struct ctl_table kern_table[] = {
 		.extra1		= &zero,
 		.extra2		= &one,
 	},
+	{
+		.procname	= "sched_cfs_bandwidth_slice_us",
+		.data		= &sysctl_sched_cfs_bandwidth_slice,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &one,
+	},
 #endif
 #ifdef CONFIG_SCHED_AUTOGROUP
 	{
Index: tip/include/linux/sched.h
===================================================================
--- tip.orig/include/linux/sched.h
+++ tip/include/linux/sched.h
@@ -1958,6 +1958,10 @@ int sched_cfs_consistent_handler(struct 
 		loff_t *ppos);
 #endif
 
+#ifdef CONFIG_CFS_BANDWIDTH
+extern unsigned int sysctl_sched_cfs_bandwidth_slice;
+#endif
+
 #ifdef CONFIG_SCHED_AUTOGROUP
 extern unsigned int sysctl_sched_autogroup_enabled;
 
Index: tip/kernel/sched.c
===================================================================
--- tip.orig/kernel/sched.c
+++ tip/kernel/sched.c
@@ -252,6 +252,7 @@ struct cfs_bandwidth {
 	raw_spinlock_t lock;
 	ktime_t period;
 	u64 quota;
+	u64 runtime;
 	s64 hierarchal_quota;
 
 	int idle;
@@ -426,6 +427,7 @@ static enum hrtimer_restart sched_cfs_pe
 static void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
 {
 	raw_spin_lock_init(&cfs_b->lock);
+	cfs_b->runtime = 0;
 	cfs_b->quota = RUNTIME_INF;
 	cfs_b->period = ns_to_ktime(default_cfs_period());
 



^ permalink raw reply	[flat|nested] 129+ messages in thread

* [patch 07/15] sched: expire invalid runtime
  2011-05-03  9:28 [patch 00/15] CFS Bandwidth Control V6 Paul Turner
                   ` (5 preceding siblings ...)
  2011-05-03  9:28 ` [patch 06/15] sched: accumulate per-cfs_rq cpu usage and charge against bandwidth Paul Turner
@ 2011-05-03  9:28 ` Paul Turner
  2011-05-10  7:22   ` Hidetoshi Seto
                     ` (2 more replies)
  2011-05-03  9:28 ` [patch 08/15] sched: throttle cfs_rq entities which exceed their local runtime Paul Turner
                   ` (9 subsequent siblings)
  16 siblings, 3 replies; 129+ messages in thread
From: Paul Turner @ 2011-05-03  9:28 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov

[-- Attachment #1: sched-bwc-expire_cfs_rq_runtime.patch --]
[-- Type: text/plain, Size: 4773 bytes --]

With the global quota pool, one challenge is determining when the runtime we
have received from it is still valid.  Fortunately we can take advantage of
sched_clock synchronization around the jiffy to do this cheaply.

The one catch is that we don't know whether our local clock is behind or ahead
of the cpu setting the expiration time (relative to its own clock).

Fortunately we can detect which of these is the case by determining whether the
global deadline has advanced.  If it has not, then we assume we are behind, and
advance our local expiration; otherwise, we know the deadline has truly passed
and we expire our local runtime.

Signed-off-by: Paul Turner <pjt@google.com>

---
 kernel/sched.c      |    8 +++++++-
 kernel/sched_fair.c |   42 +++++++++++++++++++++++++++++++++++++++---
 2 files changed, 46 insertions(+), 4 deletions(-)

Index: tip/kernel/sched_fair.c
===================================================================
--- tip.orig/kernel/sched_fair.c
+++ tip/kernel/sched_fair.c
@@ -1299,7 +1299,7 @@ static void assign_cfs_rq_runtime(struct
 {
 	struct task_group *tg = cfs_rq->tg;
 	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
-	u64 amount = 0, min_amount;
+	u64 amount = 0, min_amount, expires;
 
 	/* note: this is a positive sum, runtime_remaining <= 0 */
 	min_amount = sched_cfs_bandwidth_slice() - cfs_rq->runtime_remaining;
@@ -1312,9 +1312,38 @@ static void assign_cfs_rq_runtime(struct
 		cfs_b->runtime -= amount;
 	}
 	cfs_b->idle = 0;
+	expires = cfs_b->runtime_expires;
 	raw_spin_unlock(&cfs_b->lock);
 
 	cfs_rq->runtime_remaining += amount;
+	cfs_rq->runtime_expires = max(cfs_rq->runtime_expires, expires);
+}
+
+static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq)
+{
+	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
+	struct rq *rq = rq_of(cfs_rq);
+
+	if (rq->clock < cfs_rq->runtime_expires)
+		return;
+
+	/*
+	 * If the local deadline has passed we have to cover for the
+	 * possibility that our sched_clock is ahead and the global deadline
+	 * has not truly expired.
+	 *
+	 * Fortunately we can check which of these is the case by determining
+	 * whether the global deadline has advanced.
+	 */
+
+	if (cfs_rq->runtime_expires >= cfs_b->runtime_expires) {
+		/* extend local deadline, drift is bounded above by 2 ticks */
+		cfs_rq->runtime_expires += TICK_NSEC;
+	} else {
+		/* global deadline is ahead, deadline must have passed */
+		if (cfs_rq->runtime_remaining > 0)
+			cfs_rq->runtime_remaining = 0;
+	}
 }
 
 static void account_cfs_rq_runtime(struct cfs_rq *cfs_rq,
@@ -1324,6 +1353,9 @@ static void account_cfs_rq_runtime(struc
 		return;
 
 	cfs_rq->runtime_remaining -= delta_exec;
+	/* dock delta_exec before expiring quota (as it could span periods) */
+	expire_cfs_rq_runtime(cfs_rq);
+
 	if (cfs_rq->runtime_remaining > 0)
 		return;
 
@@ -1332,16 +1364,20 @@ static void account_cfs_rq_runtime(struc
 
 static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
 {
-	u64 quota, runtime = 0;
+	u64 quota, runtime = 0, runtime_expires;
 	int idle = 0;
 
+	runtime_expires = sched_clock_cpu(smp_processor_id());
+
 	raw_spin_lock(&cfs_b->lock);
 	quota = cfs_b->quota;
 
 	if (quota != RUNTIME_INF) {
 		runtime = quota;
-		cfs_b->runtime = runtime;
+		runtime_expires += ktime_to_ns(cfs_b->period);
 
+		cfs_b->runtime = runtime;
+		cfs_b->runtime_expires = runtime_expires;
 		idle = cfs_b->idle;
 		cfs_b->idle = 1;
 	}
Index: tip/kernel/sched.c
===================================================================
--- tip.orig/kernel/sched.c
+++ tip/kernel/sched.c
@@ -253,6 +253,7 @@ struct cfs_bandwidth {
 	ktime_t period;
 	u64 quota;
 	u64 runtime;
+	u64 runtime_expires;
 	s64 hierarchal_quota;
 
 	int idle;
@@ -389,6 +390,7 @@ struct cfs_rq {
 #endif
 #ifdef CONFIG_CFS_BANDWIDTH
 	int runtime_enabled;
+	u64 runtime_expires;
 	s64 runtime_remaining;
 #endif
 #endif
@@ -9242,6 +9244,7 @@ static int tg_set_cfs_bandwidth(struct t
 {
 	int i, ret = 0;
 	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
+	u64 runtime_expires;
 
 	if (tg == &root_task_group)
 		return -EINVAL;
@@ -9271,7 +9274,9 @@ static int tg_set_cfs_bandwidth(struct t
 
 	raw_spin_lock_irq(&cfs_b->lock);
 	cfs_b->period = ns_to_ktime(period);
-	cfs_b->quota = quota;
+	cfs_b->quota = cfs_b->runtime = quota;
+	runtime_expires = sched_clock_cpu(smp_processor_id()) + period;
+	cfs_b->runtime_expires = runtime_expires;
 	raw_spin_unlock_irq(&cfs_b->lock);
 
 	for_each_possible_cpu(i) {
@@ -9281,6 +9286,7 @@ static int tg_set_cfs_bandwidth(struct t
 		raw_spin_lock_irq(&rq->lock);
 		cfs_rq->runtime_enabled = quota != RUNTIME_INF;
 		cfs_rq->runtime_remaining = 0;
+		cfs_rq->runtime_expires = runtime_expires;
 		raw_spin_unlock_irq(&rq->lock);
 	}
 out_unlock:



^ permalink raw reply	[flat|nested] 129+ messages in thread

* [patch 08/15] sched: throttle cfs_rq entities which exceed their local runtime
  2011-05-03  9:28 [patch 00/15] CFS Bandwidth Control V6 Paul Turner
                   ` (6 preceding siblings ...)
  2011-05-03  9:28 ` [patch 07/15] sched: expire invalid runtime Paul Turner
@ 2011-05-03  9:28 ` Paul Turner
  2011-05-10  7:23   ` Hidetoshi Seto
                     ` (2 more replies)
  2011-05-03  9:28 ` [patch 09/15] sched: unthrottle cfs_rq(s) who ran out of quota at period refresh Paul Turner
                   ` (8 subsequent siblings)
  16 siblings, 3 replies; 129+ messages in thread
From: Paul Turner @ 2011-05-03  9:28 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov, Nikhil Rao

[-- Attachment #1: sched-bwc-throttle_entities.patch --]
[-- Type: text/plain, Size: 8090 bytes --]

In account_cfs_rq_runtime() (via update_curr()) we track consumption versus a
cfs_rqs locally assigned runtime and whether there is global runtime available 
to provide a refill when it runs out.

In the case that there is no runtime remaining it's necessary to throttle so
that execution ceases until the susbequent period.  While it is at this
boundary that we detect (and signal for, via reshed_task) that a throttle is
required, the actual operation is deferred until put_prev_entity().

At this point the cfs_rq is marked as throttled and not re-enqueued, this
avoids potential interactions with throttled runqueues in the event that we
are not immediately able to evict the running task.

Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Nikhil Rao <ncrao@google.com>
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
---
 kernel/sched.c      |    7 ++
 kernel/sched_fair.c |  131 ++++++++++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 133 insertions(+), 5 deletions(-)

Index: tip/kernel/sched_fair.c
===================================================================
--- tip.orig/kernel/sched_fair.c
+++ tip/kernel/sched_fair.c
@@ -985,6 +985,8 @@ place_entity(struct cfs_rq *cfs_rq, stru
 	se->vruntime = vruntime;
 }
 
+static void check_enqueue_throttle(struct cfs_rq *cfs_rq);
+
 static void
 enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 {
@@ -1014,8 +1016,10 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
 		__enqueue_entity(cfs_rq, se);
 	se->on_rq = 1;
 
-	if (cfs_rq->nr_running == 1)
+	if (cfs_rq->nr_running == 1) {
 		list_add_leaf_cfs_rq(cfs_rq);
+		check_enqueue_throttle(cfs_rq);
+	}
 
 	start_cfs_bandwidth(cfs_rq);
 }
@@ -1221,6 +1225,8 @@ static struct sched_entity *pick_next_en
 	return se;
 }
 
+static void check_cfs_rq_runtime(struct cfs_rq *cfs_rq);
+
 static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
 {
 	/*
@@ -1230,6 +1236,9 @@ static void put_prev_entity(struct cfs_r
 	if (prev->on_rq)
 		update_curr(cfs_rq);
 
+	/* throttle cfs_rqs exceeding runtime */
+	check_cfs_rq_runtime(cfs_rq);
+
 	check_spread(cfs_rq, prev);
 	if (prev->on_rq) {
 		update_stats_wait_start(cfs_rq, prev);
@@ -1295,7 +1304,7 @@ static inline u64 sched_cfs_bandwidth_sl
 	return (u64)sysctl_sched_cfs_bandwidth_slice * NSEC_PER_USEC;
 }
 
-static void assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
+static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 {
 	struct task_group *tg = cfs_rq->tg;
 	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
@@ -1317,6 +1326,8 @@ static void assign_cfs_rq_runtime(struct
 
 	cfs_rq->runtime_remaining += amount;
 	cfs_rq->runtime_expires = max(cfs_rq->runtime_expires, expires);
+
+	return cfs_rq->runtime_remaining > 0;
 }
 
 static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq)
@@ -1359,7 +1370,90 @@ static void account_cfs_rq_runtime(struc
 	if (cfs_rq->runtime_remaining > 0)
 		return;
 
-	assign_cfs_rq_runtime(cfs_rq);
+	/*
+	 * if we're unable to extend our runtime we resched so that the active
+	 * hierarchy can be throttled
+	 */
+	if (!assign_cfs_rq_runtime(cfs_rq))
+		resched_task(rq_of(cfs_rq)->curr);
+}
+
+static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
+{
+	return cfs_rq->throttled;
+}
+
+static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
+{
+	struct rq *rq = rq_of(cfs_rq);
+	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
+	struct sched_entity *se;
+	long task_delta, dequeue = 1;
+
+	se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
+
+	/* account load preceding throttle */
+	update_cfs_load(cfs_rq, 0);
+
+	task_delta = -cfs_rq->h_nr_running;
+	for_each_sched_entity(se) {
+		struct cfs_rq *qcfs_rq = cfs_rq_of(se);
+		/* throttled entity or throttle-on-deactivate */
+		if (!se->on_rq)
+			break;
+
+		if (dequeue)
+			dequeue_entity(qcfs_rq, se, DEQUEUE_SLEEP);
+		qcfs_rq->h_nr_running += task_delta;
+
+		if (qcfs_rq->load.weight)
+			dequeue = 0;
+	}
+
+	if (!se)
+		rq->nr_running += task_delta;
+
+	cfs_rq->throttled = 1;
+	raw_spin_lock(&cfs_b->lock);
+	list_add_tail_rcu(&cfs_rq->throttled_list, &cfs_b->throttled_cfs_rq);
+	raw_spin_unlock(&cfs_b->lock);
+}
+
+/* conditionally throttle active cfs_rq's from put_prev_entity() */
+static void check_cfs_rq_runtime(struct cfs_rq *cfs_rq)
+{
+	if (!cfs_rq->runtime_enabled || cfs_rq->runtime_remaining > 0)
+		return;
+
+	/*
+	 * it's possible active load balance has forced a throttled cfs_rq to
+	 * run again, we don't want to re-throttled in this case.
+	 */
+	if (cfs_rq_throttled(cfs_rq))
+		return;
+
+	throttle_cfs_rq(cfs_rq);
+}
+
+/*
+ * When a group wakes up we want to make sure that its quota is not already
+ * expired, otherwise it may be allowed to steal additional ticks of runtime
+ * since update_curr() throttling can not not trigger until it's on-rq.
+ */
+static void check_enqueue_throttle(struct cfs_rq *cfs_rq)
+{
+	/* an active group must be handled by the update_curr()->put() path */
+	if (cfs_rq->curr || !cfs_rq->runtime_enabled)
+		return;
+
+	/* ensure the group is not already throttled */
+	if (cfs_rq_throttled(cfs_rq))
+		return;
+
+	/* update runtime allocation */
+	account_cfs_rq_runtime(cfs_rq, 0);
+	if (cfs_rq->runtime_remaining <= 0)
+		throttle_cfs_rq(cfs_rq);
 }
 
 static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
@@ -1389,6 +1483,14 @@ static int do_sched_cfs_period_timer(str
 #else
 static void account_cfs_rq_runtime(struct cfs_rq *cfs_rq,
 		unsigned long delta_exec) {}
+
+static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
+{
+	return 0;
+}
+
+static void check_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
+static void check_enqueue_throttle(struct cfs_rq *cfs_rq) {}
 #endif
 
 /**************************************************
@@ -1468,6 +1570,12 @@ enqueue_task_fair(struct rq *rq, struct 
 		cfs_rq = cfs_rq_of(se);
 		enqueue_entity(cfs_rq, se, flags);
 		cfs_rq->h_nr_running++;
+
+		/* end evaluation on throttled cfs_rq */
+		if (cfs_rq_throttled(cfs_rq)) {
+			se = NULL;
+			break;
+		}
 		flags = ENQUEUE_WAKEUP;
 	}
 
@@ -1475,11 +1583,15 @@ enqueue_task_fair(struct rq *rq, struct 
 		cfs_rq = cfs_rq_of(se);
 		cfs_rq->h_nr_running++;
 
+		if (cfs_rq_throttled(cfs_rq))
+			break;
+
 		update_cfs_load(cfs_rq, 0);
 		update_cfs_shares(cfs_rq);
 	}
 
-	inc_nr_running(rq);
+	if (!se)
+		inc_nr_running(rq);
 	hrtick_update(rq);
 }
 
@@ -1498,6 +1610,11 @@ static void dequeue_task_fair(struct rq 
 		dequeue_entity(cfs_rq, se, flags);
 		cfs_rq->h_nr_running--;
 
+		/* end evaluation on throttled cfs_rq */
+		if (cfs_rq_throttled(cfs_rq)) {
+			se = NULL;
+			break;
+		}
 		/* Don't dequeue parent if it has other entities besides us */
 		if (cfs_rq->load.weight) {
 			se = parent_entity(se);
@@ -1510,11 +1627,15 @@ static void dequeue_task_fair(struct rq 
 		cfs_rq = cfs_rq_of(se);
 		cfs_rq->h_nr_running--;
 
+		if (cfs_rq_throttled(cfs_rq))
+			break;
+
 		update_cfs_load(cfs_rq, 0);
 		update_cfs_shares(cfs_rq);
 	}
 
-	dec_nr_running(rq);
+	if (!se)
+		dec_nr_running(rq);
 	hrtick_update(rq);
 }
 
Index: tip/kernel/sched.c
===================================================================
--- tip.orig/kernel/sched.c
+++ tip/kernel/sched.c
@@ -258,6 +258,8 @@ struct cfs_bandwidth {
 
 	int idle;
 	struct hrtimer period_timer;
+	struct list_head throttled_cfs_rq;
+
 #endif
 };
 
@@ -392,6 +394,9 @@ struct cfs_rq {
 	int runtime_enabled;
 	u64 runtime_expires;
 	s64 runtime_remaining;
+
+	int throttled;
+	struct list_head throttled_list;
 #endif
 #endif
 };
@@ -433,6 +438,7 @@ static void init_cfs_bandwidth(struct cf
 	cfs_b->quota = RUNTIME_INF;
 	cfs_b->period = ns_to_ktime(default_cfs_period());
 
+	INIT_LIST_HEAD(&cfs_b->throttled_cfs_rq);
 	hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
 	cfs_b->period_timer.function = sched_cfs_period_timer;
 
@@ -442,6 +448,7 @@ static void init_cfs_rq_runtime(struct c
 {
 	cfs_rq->runtime_remaining = 0;
 	cfs_rq->runtime_enabled = 0;
+	INIT_LIST_HEAD(&cfs_rq->throttled_list);
 }
 
 static void start_cfs_bandwidth(struct cfs_rq *cfs_rq)



^ permalink raw reply	[flat|nested] 129+ messages in thread

* [patch 09/15] sched: unthrottle cfs_rq(s) who ran out of quota at period refresh
  2011-05-03  9:28 [patch 00/15] CFS Bandwidth Control V6 Paul Turner
                   ` (7 preceding siblings ...)
  2011-05-03  9:28 ` [patch 08/15] sched: throttle cfs_rq entities which exceed their local runtime Paul Turner
@ 2011-05-03  9:28 ` Paul Turner
  2011-05-10  7:24   ` Hidetoshi Seto
  2011-05-03  9:28 ` [patch 10/15] sched: allow for positional tg_tree walks Paul Turner
                   ` (7 subsequent siblings)
  16 siblings, 1 reply; 129+ messages in thread
From: Paul Turner @ 2011-05-03  9:28 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov, Nikhil Rao

[-- Attachment #1: sched-bwc-unthrottle_entities.patch --]
[-- Type: text/plain, Size: 4346 bytes --]

At the start of a new period there are several actions we must refresh the
global bandwidth pool as well as unthrottle any cfs_rq entities who previously
ran out of bandwidth (as quota permits).

Unthrottled entities have the cfs_rq->throttled flag cleared and are re-enqueued
into the cfs entity hierarchy.

Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Nikhil Rao <ncrao@google.com>
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
---
 kernel/sched.c      |    3 +
 kernel/sched_fair.c |  105 +++++++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 107 insertions(+), 1 deletion(-)

Index: tip/kernel/sched.c
===================================================================
--- tip.orig/kernel/sched.c
+++ tip/kernel/sched.c
@@ -9294,6 +9294,9 @@ static int tg_set_cfs_bandwidth(struct t
 		cfs_rq->runtime_enabled = quota != RUNTIME_INF;
 		cfs_rq->runtime_remaining = 0;
 		cfs_rq->runtime_expires = runtime_expires;
+
+		if (cfs_rq_throttled(cfs_rq))
+			unthrottle_cfs_rq(cfs_rq);
 		raw_spin_unlock_irq(&rq->lock);
 	}
 out_unlock:
Index: tip/kernel/sched_fair.c
===================================================================
--- tip.orig/kernel/sched_fair.c
+++ tip/kernel/sched_fair.c
@@ -1456,10 +1456,88 @@ static void check_enqueue_throttle(struc
 		throttle_cfs_rq(cfs_rq);
 }
 
+static void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
+{
+	struct rq *rq = rq_of(cfs_rq);
+	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
+	struct sched_entity *se;
+	int enqueue = 1;
+	long task_delta;
+
+	se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
+
+	cfs_rq->throttled = 0;
+	raw_spin_lock(&cfs_b->lock);
+	list_del_rcu(&cfs_rq->throttled_list);
+	raw_spin_unlock(&cfs_b->lock);
+
+	if (!cfs_rq->load.weight)
+		return;
+
+	task_delta = cfs_rq->h_nr_running;
+	for_each_sched_entity(se) {
+		if (se->on_rq)
+			enqueue = 0;
+
+		cfs_rq = cfs_rq_of(se);
+		if (enqueue)
+			enqueue_entity(cfs_rq, se, ENQUEUE_WAKEUP);
+		cfs_rq->h_nr_running += task_delta;
+
+		if (cfs_rq_throttled(cfs_rq))
+			break;
+	}
+
+	if (!se)
+		rq->nr_running += task_delta;
+
+	/* determine whether we need to wake up potentially idle cpu */
+	if (rq->curr == rq->idle && rq->cfs.nr_running)
+		resched_task(rq->curr);
+}
+
+static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
+		u64 remaining, u64 expires)
+{
+	struct cfs_rq *cfs_rq;
+	u64 runtime = remaining;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(cfs_rq, &cfs_b->throttled_cfs_rq,
+				throttled_list) {
+		struct rq *rq = rq_of(cfs_rq);
+
+		raw_spin_lock(&rq->lock);
+		if (!cfs_rq_throttled(cfs_rq))
+			goto next;
+
+		runtime = -cfs_rq->runtime_remaining + 1;
+		if (runtime > remaining)
+			runtime = remaining;
+		remaining -= runtime;
+
+		cfs_rq->runtime_remaining += runtime;
+		cfs_rq->runtime_expires = expires;
+
+		/* we check whether we're throttled above */
+		if (cfs_rq->runtime_remaining > 0)
+			unthrottle_cfs_rq(cfs_rq);
+
+next:
+		raw_spin_unlock(&rq->lock);
+
+		if (!remaining)
+			break;
+	}
+	rcu_read_unlock();
+
+	return remaining;
+}
+
 static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
 {
 	u64 quota, runtime = 0, runtime_expires;
-	int idle = 0;
+	int idle = 0, throttled = 0;
 
 	runtime_expires = sched_clock_cpu(smp_processor_id());
 
@@ -1469,6 +1547,7 @@ static int do_sched_cfs_period_timer(str
 	if (quota != RUNTIME_INF) {
 		runtime = quota;
 		runtime_expires += ktime_to_ns(cfs_b->period);
+		throttled = !list_empty(&cfs_b->throttled_cfs_rq);
 
 		cfs_b->runtime = runtime;
 		cfs_b->runtime_expires = runtime_expires;
@@ -1477,6 +1556,30 @@ static int do_sched_cfs_period_timer(str
 	}
 	raw_spin_unlock(&cfs_b->lock);
 
+	if (!throttled || quota == RUNTIME_INF)
+		goto out;
+	idle = 0;
+
+retry:
+	runtime = distribute_cfs_runtime(cfs_b, runtime, runtime_expires);
+
+	raw_spin_lock(&cfs_b->lock);
+	/* new new bandwidth may have been set */
+	if (unlikely(runtime_expires != cfs_b->runtime_expires))
+		goto out_unlock;
+	/*
+	 * make sure no-one was throttled while we were handing out the new
+	 * runtime.
+	 */
+	if (runtime > 0 && !list_empty(&cfs_b->throttled_cfs_rq)) {
+		raw_spin_unlock(&cfs_b->lock);
+		goto retry;
+	}
+	cfs_b->runtime = runtime;
+	cfs_b->idle = idle;
+out_unlock:
+	raw_spin_unlock(&cfs_b->lock);
+out:
 	return idle;
 }
 #else



^ permalink raw reply	[flat|nested] 129+ messages in thread

* [patch 10/15] sched: allow for positional tg_tree walks
  2011-05-03  9:28 [patch 00/15] CFS Bandwidth Control V6 Paul Turner
                   ` (8 preceding siblings ...)
  2011-05-03  9:28 ` [patch 09/15] sched: unthrottle cfs_rq(s) who ran out of quota at period refresh Paul Turner
@ 2011-05-03  9:28 ` Paul Turner
  2011-05-10  7:24   ` Hidetoshi Seto
  2011-05-17 13:31   ` Peter Zijlstra
  2011-05-03  9:28 ` [patch 11/15] sched: prevent interactions between throttled entities and load-balance Paul Turner
                   ` (6 subsequent siblings)
  16 siblings, 2 replies; 129+ messages in thread
From: Paul Turner @ 2011-05-03  9:28 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov

[-- Attachment #1: sched-bwc-refactor-walk_tg_tree.patch --]
[-- Type: text/plain, Size: 2015 bytes --]

Extend walk_tg_tree to accept a positional argument

static int walk_tg_tree_from(struct task_group *from,
			     tg_visitor down, tg_visitor up, void *data)

Existing semantics are preserved, caller must hold rcu_lock() or sufficient
analogue.

Signed-off-by: Paul Turner <pjt@google.com>
---
 kernel/sched.c |   34 +++++++++++++++++++++++-----------
 1 file changed, 23 insertions(+), 11 deletions(-)

Index: tip/kernel/sched.c
===================================================================
--- tip.orig/kernel/sched.c
+++ tip/kernel/sched.c
@@ -1430,21 +1430,19 @@ static inline void dec_cpu_load(struct r
 #if (defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)) || defined(CONFIG_RT_GROUP_SCHED)
 typedef int (*tg_visitor)(struct task_group *, void *);
 
-/*
- * Iterate the full tree, calling @down when first entering a node and @up when
- * leaving it for the final time.
- */
-static int walk_tg_tree(tg_visitor down, tg_visitor up, void *data)
+/* Iterate task_group tree rooted at *from */
+static int walk_tg_tree_from(struct task_group *from,
+			     tg_visitor down, tg_visitor up, void *data)
 {
 	struct task_group *parent, *child;
 	int ret;
 
-	rcu_read_lock();
-	parent = &root_task_group;
+	parent = from;
+
 down:
 	ret = (*down)(parent, data);
 	if (ret)
-		goto out_unlock;
+		goto out;
 	list_for_each_entry_rcu(child, &parent->children, siblings) {
 		parent = child;
 		goto down;
@@ -1453,14 +1451,28 @@ up:
 		continue;
 	}
 	ret = (*up)(parent, data);
-	if (ret)
-		goto out_unlock;
+	if (ret || parent == from)
+		goto out;
 
 	child = parent;
 	parent = parent->parent;
 	if (parent)
 		goto up;
-out_unlock:
+out:
+	return ret;
+}
+
+/*
+ * Iterate the full tree, calling @down when first entering a node and @up when
+ * leaving it for the final time.
+ */
+
+static inline int walk_tg_tree(tg_visitor down, tg_visitor up, void *data)
+{
+	int ret;
+
+	rcu_read_lock();
+	ret = walk_tg_tree_from(&root_task_group, down, up, data);
 	rcu_read_unlock();
 
 	return ret;



^ permalink raw reply	[flat|nested] 129+ messages in thread

* [patch 11/15] sched: prevent interactions between throttled entities and load-balance
  2011-05-03  9:28 [patch 00/15] CFS Bandwidth Control V6 Paul Turner
                   ` (9 preceding siblings ...)
  2011-05-03  9:28 ` [patch 10/15] sched: allow for positional tg_tree walks Paul Turner
@ 2011-05-03  9:28 ` Paul Turner
  2011-05-10  7:26   ` Hidetoshi Seto
  2011-05-03  9:28 ` [patch 12/15] sched: migrate throttled tasks on HOTPLUG Paul Turner
                   ` (5 subsequent siblings)
  16 siblings, 1 reply; 129+ messages in thread
From: Paul Turner @ 2011-05-03  9:28 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov

[-- Attachment #1: sched-bwc-throttled_shares.patch --]
[-- Type: text/plain, Size: 5368 bytes --]

>From the perspective of load-balance and shares distribution, throttled
entities should be invisible.

However, both of these operations work on 'active' lists and are not
inherently aware of what group hierarchies may be present.  In some cases this
may be side-stepped (e.g. we could sideload via tg_load_down in load balance) 
while in others (e.g. update_shares()) it is more difficult to compute without
incurring some O(n^2) costs.

Instead, track hierarchal throttled state at time of transition.  This allows
us to easily identify whether an entity belongs to a throttled hierarchy and
avoid incorrect interactions with it.

Also, when an entity leaves a throttled hierarchy we need to advance its
time averaging for shares averaging so that the elapsed throttled time is not
considered as part of the cfs_rq's operation.

Signed-off-by: Paul Turner <pjt@google.com>
---
 kernel/sched.c      |    2 -
 kernel/sched_fair.c |   76 +++++++++++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 71 insertions(+), 7 deletions(-)

Index: tip/kernel/sched_fair.c
===================================================================
--- tip.orig/kernel/sched_fair.c
+++ tip/kernel/sched_fair.c
@@ -739,13 +739,15 @@ static void update_cfs_rq_load_contribut
 	}
 }
 
+static inline int throttled_hierarchy(struct cfs_rq *cfs_rq);
+
 static void update_cfs_load(struct cfs_rq *cfs_rq, int global_update)
 {
 	u64 period = sysctl_sched_shares_window;
 	u64 now, delta;
 	unsigned long load = cfs_rq->load.weight;
 
-	if (cfs_rq->tg == &root_task_group)
+	if (cfs_rq->tg == &root_task_group || throttled_hierarchy(cfs_rq))
 		return;
 
 	now = rq_of(cfs_rq)->clock_task;
@@ -1383,6 +1385,46 @@ static inline int cfs_rq_throttled(struc
 	return cfs_rq->throttled;
 }
 
+static inline int throttled_hierarchy(struct cfs_rq *cfs_rq)
+{
+	return cfs_rq->throttle_count;
+}
+
+struct tg_unthrottle_down_data {
+	int cpu;
+	u64 now;
+};
+
+static int tg_unthrottle_down(struct task_group *tg, void *data)
+{
+	struct tg_unthrottle_down_data *udd = data;
+	struct cfs_rq *cfs_rq = tg->cfs_rq[udd->cpu];
+	u64 delta;
+
+	cfs_rq->throttle_count--;
+	if (!cfs_rq->throttle_count) {
+		/* leaving throttled state, move up windows */
+		delta = udd->now - cfs_rq->load_stamp;
+		cfs_rq->load_stamp += delta;
+		cfs_rq->load_last += delta;
+	}
+
+	return 0;
+}
+
+static int tg_throttle_down(struct task_group *tg, void *data)
+{
+	long cpu = (long)data;
+	struct cfs_rq *cfs_rq = tg->cfs_rq[cpu];
+
+	/* group is entering throttled state, record last load */
+	if (!cfs_rq->throttle_count)
+		update_cfs_load(cfs_rq, 0);
+	cfs_rq->throttle_count++;
+
+	return 0;
+}
+
 static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
 {
 	struct rq *rq = rq_of(cfs_rq);
@@ -1393,7 +1435,10 @@ static void throttle_cfs_rq(struct cfs_r
 	se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
 
 	/* account load preceding throttle */
-	update_cfs_load(cfs_rq, 0);
+	rcu_read_lock();
+	walk_tg_tree_from(cfs_rq->tg, tg_throttle_down, tg_nop,
+			  (void *)(long)rq_of(cfs_rq)->cpu);
+	rcu_read_unlock();
 
 	task_delta = -cfs_rq->h_nr_running;
 	for_each_sched_entity(se) {
@@ -1463,6 +1508,7 @@ static void unthrottle_cfs_rq(struct cfs
 	struct sched_entity *se;
 	int enqueue = 1;
 	long task_delta;
+	struct tg_unthrottle_down_data udd;
 
 	se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
 
@@ -1471,6 +1517,13 @@ static void unthrottle_cfs_rq(struct cfs
 	list_del_rcu(&cfs_rq->throttled_list);
 	raw_spin_unlock(&cfs_b->lock);
 
+	update_rq_clock(rq);
+	/* don't include throttled window for load statistics */
+	udd.cpu = rq->cpu;
+	udd.now = rq->clock_task;
+	walk_tg_tree_from(cfs_rq->tg, tg_unthrottle_down, tg_nop,
+			  (void *)&udd);
+
 	if (!cfs_rq->load.weight)
 		return;
 
@@ -1591,6 +1644,11 @@ static inline int cfs_rq_throttled(struc
 	return 0;
 }
 
+static inline int throttled_hierarchy(struct cfs_rq *cfs_rq)
+{
+	return 0;
+}
+
 static void check_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
 static void check_enqueue_throttle(struct cfs_rq *cfs_rq) {}
 #endif
@@ -2449,6 +2507,9 @@ move_one_task(struct rq *this_rq, int th
 	int pinned = 0;
 
 	for_each_leaf_cfs_rq(busiest, cfs_rq) {
+		if (throttled_hierarchy(cfs_rq))
+			continue;
+
 		list_for_each_entry_safe(p, n, &cfs_rq->tasks, se.group_node) {
 
 			if (!can_migrate_task(p, busiest, this_cpu,
@@ -2548,8 +2609,10 @@ static int update_shares_cpu(struct task
 
 	raw_spin_lock_irqsave(&rq->lock, flags);
 
-	update_rq_clock(rq);
-	update_cfs_load(cfs_rq, 1);
+	if (!throttled_hierarchy(cfs_rq)) {
+		update_rq_clock(rq);
+		update_cfs_load(cfs_rq, 1);
+	}
 
 	/*
 	 * We need to update shares after updating tg->load_weight in
@@ -2593,9 +2656,10 @@ load_balance_fair(struct rq *this_rq, in
 		u64 rem_load, moved_load;
 
 		/*
-		 * empty group
+		 * empty group or part of a throttled hierarchy
 		 */
-		if (!busiest_cfs_rq->task_weight)
+		if (!busiest_cfs_rq->task_weight ||
+		    throttled_hierarchy(busiest_cfs_rq))
 			continue;
 
 		rem_load = (u64)rem_load_move * busiest_weight;
Index: tip/kernel/sched.c
===================================================================
--- tip.orig/kernel/sched.c
+++ tip/kernel/sched.c
@@ -395,7 +395,7 @@ struct cfs_rq {
 	u64 runtime_expires;
 	s64 runtime_remaining;
 
-	int throttled;
+	int throttled, throttle_count;
 	struct list_head throttled_list;
 #endif
 #endif



^ permalink raw reply	[flat|nested] 129+ messages in thread

* [patch 12/15] sched: migrate throttled tasks on HOTPLUG
  2011-05-03  9:28 [patch 00/15] CFS Bandwidth Control V6 Paul Turner
                   ` (10 preceding siblings ...)
  2011-05-03  9:28 ` [patch 11/15] sched: prevent interactions between throttled entities and load-balance Paul Turner
@ 2011-05-03  9:28 ` Paul Turner
  2011-05-10  7:27   ` Hidetoshi Seto
  2011-05-03  9:28 ` [patch 13/15] sched: add exports tracking cfs bandwidth control statistics Paul Turner
                   ` (4 subsequent siblings)
  16 siblings, 1 reply; 129+ messages in thread
From: Paul Turner @ 2011-05-03  9:28 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov

[-- Attachment #1: sched-bwc-migrate_dead.patch --]
[-- Type: text/plain, Size: 1735 bytes --]

Throttled tasks are invisible to cpu-offline since they are not eligible for
selection by pick_next_task().  The regular 'escape' path for a thread that is
blocked at offline is via ttwu->select_task_rq, however this will not handle a
throttled group since there are no individual thread wakeups on an unthrottle.

Resolve this by unthrottling offline cpus so that threads can be migrated.

Signed-off-by: Paul Turner <pjt@google.com>
---
 kernel/sched.c |   29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

Index: tip/kernel/sched.c
===================================================================
--- tip.orig/kernel/sched.c
+++ tip/kernel/sched.c
@@ -6145,6 +6145,32 @@ static void calc_global_load_remove(stru
 	rq->calc_load_active = 0;
 }
 
+#ifdef CONFIG_CFS_BANDWIDTH
+static void unthrottle_offline_cfs_rqs(struct rq *rq)
+{
+	struct cfs_rq *cfs_rq;
+
+	for_each_leaf_cfs_rq(rq, cfs_rq) {
+		struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
+
+		if (!cfs_rq->runtime_enabled)
+			continue;
+
+		/*
+		 * clock_task is not advancing so we just need to make sure
+		 * there's some valid quota amount
+		 */
+		cfs_rq->runtime_remaining = cfs_b->quota;
+		if (cfs_rq_throttled(cfs_rq))
+			unthrottle_cfs_rq(cfs_rq);
+	}
+}
+#else
+static void unthrottle_offline_cfs_rqs(struct rq *rq)
+{
+}
+#endif
+
 /*
  * Migrate all tasks from the rq, sleeping tasks will be migrated by
  * try_to_wake_up()->select_task_rq().
@@ -6170,6 +6196,9 @@ static void migrate_tasks(unsigned int d
 	 */
 	rq->stop = NULL;
 
+	/* Ensure any throttled groups are reachable by pick_next_task */
+	unthrottle_offline_cfs_rqs(rq);
+
 	for ( ; ; ) {
 		/*
 		 * There's this thread running, bail when that's the only



^ permalink raw reply	[flat|nested] 129+ messages in thread

* [patch 13/15] sched: add exports tracking cfs bandwidth control statistics
  2011-05-03  9:28 [patch 00/15] CFS Bandwidth Control V6 Paul Turner
                   ` (11 preceding siblings ...)
  2011-05-03  9:28 ` [patch 12/15] sched: migrate throttled tasks on HOTPLUG Paul Turner
@ 2011-05-03  9:28 ` Paul Turner
  2011-05-10  7:27   ` Hidetoshi Seto
  2011-05-11  7:56   ` Hidetoshi Seto
  2011-05-03  9:29 ` [patch 14/15] sched: return unused runtime on voluntary sleep Paul Turner
                   ` (3 subsequent siblings)
  16 siblings, 2 replies; 129+ messages in thread
From: Paul Turner @ 2011-05-03  9:28 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov, Nikhil Rao

[-- Attachment #1: sched-bwc-throttle_stats.patch --]
[-- Type: text/plain, Size: 3161 bytes --]

From: Nikhil Rao <ncrao@google.com>

This change introduces statistics exports for the cpu sub-system, these are
added through the use of a stat file similar to that exported by other
subsystems.

The following exports are included:

nr_periods:	number of periods in which execution occurred
nr_throttled:	the number of periods above in which execution was throttle
throttled_time:	cumulative wall-time that any cpus have been throttled for
this group

Signed-off-by: Nikhil Rao <ncrao@google.com>
Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
---
 kernel/sched.c      |   22 ++++++++++++++++++++++
 kernel/sched_fair.c |    9 +++++++++
 2 files changed, 31 insertions(+)

Index: tip/kernel/sched.c
===================================================================
--- tip.orig/kernel/sched.c
+++ tip/kernel/sched.c
@@ -260,6 +260,10 @@ struct cfs_bandwidth {
 	struct hrtimer period_timer;
 	struct list_head throttled_cfs_rq;
 
+	/* statistics */
+	int nr_periods, nr_throttled;
+	u64 throttled_time;
+
 #endif
 };
 
@@ -395,6 +399,7 @@ struct cfs_rq {
 	u64 runtime_expires;
 	s64 runtime_remaining;
 
+	u64 throttled_timestamp;
 	int throttled, throttle_count;
 	struct list_head throttled_list;
 #endif
@@ -9517,6 +9522,19 @@ int sched_cfs_consistent_handler(struct 
 
 	return ret;
 }
+
+static int cpu_stats_show(struct cgroup *cgrp, struct cftype *cft,
+		struct cgroup_map_cb *cb)
+{
+	struct task_group *tg = cgroup_tg(cgrp);
+	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
+
+	cb->fill(cb, "nr_periods", cfs_b->nr_periods);
+	cb->fill(cb, "nr_throttled", cfs_b->nr_throttled);
+	cb->fill(cb, "throttled_time", cfs_b->throttled_time);
+
+	return 0;
+}
 #endif /* CONFIG_CFS_BANDWIDTH */
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
@@ -9563,6 +9581,10 @@ static struct cftype cpu_files[] = {
 		.read_u64 = cpu_cfs_period_read_u64,
 		.write_u64 = cpu_cfs_period_write_u64,
 	},
+	{
+		.name = "stat",
+		.read_map = cpu_stats_show,
+	},
 #endif
 #ifdef CONFIG_RT_GROUP_SCHED
 	{
Index: tip/kernel/sched_fair.c
===================================================================
--- tip.orig/kernel/sched_fair.c
+++ tip/kernel/sched_fair.c
@@ -1459,6 +1459,7 @@ static void throttle_cfs_rq(struct cfs_r
 		rq->nr_running += task_delta;
 
 	cfs_rq->throttled = 1;
+	cfs_rq->throttled_timestamp = rq->clock;
 	raw_spin_lock(&cfs_b->lock);
 	list_add_tail_rcu(&cfs_rq->throttled_list, &cfs_b->throttled_cfs_rq);
 	raw_spin_unlock(&cfs_b->lock);
@@ -1514,8 +1515,10 @@ static void unthrottle_cfs_rq(struct cfs
 
 	cfs_rq->throttled = 0;
 	raw_spin_lock(&cfs_b->lock);
+	cfs_b->throttled_time += rq->clock - cfs_rq->throttled_timestamp;
 	list_del_rcu(&cfs_rq->throttled_list);
 	raw_spin_unlock(&cfs_b->lock);
+	cfs_rq->throttled_timestamp = 0;
 
 	update_rq_clock(rq);
 	/* don't include throttled window for load statistics */
@@ -1628,6 +1631,12 @@ retry:
 		raw_spin_unlock(&cfs_b->lock);
 		goto retry;
 	}
+
+	/* update throttled stats */
+	cfs_b->nr_periods += overrun;
+	if (throttled)
+		cfs_b->nr_throttled += overrun;
+
 	cfs_b->runtime = runtime;
 	cfs_b->idle = idle;
 out_unlock:



^ permalink raw reply	[flat|nested] 129+ messages in thread

* [patch 14/15] sched: return unused runtime on voluntary sleep
  2011-05-03  9:28 [patch 00/15] CFS Bandwidth Control V6 Paul Turner
                   ` (12 preceding siblings ...)
  2011-05-03  9:28 ` [patch 13/15] sched: add exports tracking cfs bandwidth control statistics Paul Turner
@ 2011-05-03  9:29 ` Paul Turner
  2011-05-10  7:28   ` Hidetoshi Seto
  2011-05-03  9:29 ` [patch 15/15] sched: add documentation for bandwidth control Paul Turner
                   ` (2 subsequent siblings)
  16 siblings, 1 reply; 129+ messages in thread
From: Paul Turner @ 2011-05-03  9:29 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov

[-- Attachment #1: sched-bwc-simple_return_quota.patch --]
[-- Type: text/plain, Size: 7424 bytes --]

When a local cfs_rq blocks we return the majority of its remaining quota to the
global bandwidth pool for use by other runqueues.

We do this only when the quota is current and there is more than 
min_cfs_rq_quota [1ms by default] of runtime remaining on the rq.

In the case where there are throttled runqueues and we have sufficient
bandwidth to meter out a slice, a second timer is kicked off to handle this
delivery, unthrottling where appropriate.

Using a 'worst case' antagonist which executes on each cpu
for 1ms before moving onto the next on a fairly large machine:

no quota generations:
 197.47 ms       /cgroup/a/cpuacct.usage
 199.46 ms       /cgroup/a/cpuacct.usage
 205.46 ms       /cgroup/a/cpuacct.usage
 198.46 ms       /cgroup/a/cpuacct.usage
 208.39 ms       /cgroup/a/cpuacct.usage
Since we are allowed to use "stale" quota our usage is effectively bounded by
the rate of input into the global pool and performance is relatively stable.

with quota generations [1s increments]:
 119.58 ms       /cgroup/a/cpuacct.usage
 119.65 ms       /cgroup/a/cpuacct.usage
 119.64 ms       /cgroup/a/cpuacct.usage
 119.63 ms       /cgroup/a/cpuacct.usage
 119.60 ms       /cgroup/a/cpuacct.usage
The large deficit here is due to quota generations (/intentionally/) preventing
us from now using previously stranded slack quota.  The cost is that this quota
becomes unavailable.

with quota generations and quota return:
 200.09 ms       /cgroup/a/cpuacct.usage
 200.09 ms       /cgroup/a/cpuacct.usage
 198.09 ms       /cgroup/a/cpuacct.usage
 200.09 ms       /cgroup/a/cpuacct.usage
 200.06 ms       /cgroup/a/cpuacct.usage
By returning unused quota we're able to both stably consume our desired quota
and prevent unintentional overages due to the abuse of slack quota from 
previous quota periods (especially on a large machine).

Signed-off-by: Paul Turner <pjt@google.com>

---
 kernel/sched.c      |   15 +++++++
 kernel/sched_fair.c |  102 ++++++++++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 113 insertions(+), 4 deletions(-)

Index: tip/kernel/sched.c
===================================================================
--- tip.orig/kernel/sched.c
+++ tip/kernel/sched.c
@@ -257,7 +257,7 @@ struct cfs_bandwidth {
 	s64 hierarchal_quota;
 
 	int idle;
-	struct hrtimer period_timer;
+	struct hrtimer period_timer, slack_timer;
 	struct list_head throttled_cfs_rq;
 
 	/* statistics */
@@ -414,6 +414,16 @@ static inline struct cfs_bandwidth *tg_c
 
 static inline u64 default_cfs_period(void);
 static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun);
+static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b);
+
+static enum hrtimer_restart sched_cfs_slack_timer(struct hrtimer *timer)
+{
+	struct cfs_bandwidth *cfs_b =
+		container_of(timer, struct cfs_bandwidth, slack_timer);
+	do_sched_cfs_slack_timer(cfs_b);
+
+	return HRTIMER_NORESTART;
+}
 
 static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
 {
@@ -446,6 +456,8 @@ static void init_cfs_bandwidth(struct cf
 	INIT_LIST_HEAD(&cfs_b->throttled_cfs_rq);
 	hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
 	cfs_b->period_timer.function = sched_cfs_period_timer;
+	hrtimer_init(&cfs_b->slack_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	cfs_b->slack_timer.function = sched_cfs_slack_timer;
 
 }
 
@@ -474,6 +486,7 @@ static void start_cfs_bandwidth(struct c
 static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
 {
 	hrtimer_cancel(&cfs_b->period_timer);
+	hrtimer_cancel(&cfs_b->slack_timer);
 }
 #else
 #ifdef CONFIG_FAIR_GROUP_SCHED
Index: tip/kernel/sched_fair.c
===================================================================
--- tip.orig/kernel/sched_fair.c
+++ tip/kernel/sched_fair.c
@@ -1465,20 +1465,25 @@ static void throttle_cfs_rq(struct cfs_r
 	raw_spin_unlock(&cfs_b->lock);
 }
 
+static void return_cfs_rq_quota(struct cfs_rq *cfs_rq);
+
 /* conditionally throttle active cfs_rq's from put_prev_entity() */
 static void check_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 {
-	if (!cfs_rq->runtime_enabled || cfs_rq->runtime_remaining > 0)
+	if (!cfs_rq->runtime_enabled)
 		return;
 
 	/*
 	 * it's possible active load balance has forced a throttled cfs_rq to
-	 * run again, we don't want to re-throttled in this case.
+	 * run again, we don't want to re-throttle in this case.
 	 */
 	if (cfs_rq_throttled(cfs_rq))
 		return;
 
-	throttle_cfs_rq(cfs_rq);
+	if (cfs_rq->runtime_remaining <= 0)
+		throttle_cfs_rq(cfs_rq);
+	else if (!cfs_rq->load.weight)
+		return_cfs_rq_quota(cfs_rq);
 }
 
 /*
@@ -1644,6 +1649,97 @@ out_unlock:
 out:
 	return idle;
 }
+
+/* a cfs_rq won't donate quota below this amount */
+static const u64 min_cfs_rq_quota = 1 * NSEC_PER_MSEC;
+/* minimum remaining period time to redistribute slack quota */
+static const u64 min_bandwidth_expiration = 2 * NSEC_PER_MSEC;
+/* how long we wait to gather additional slack before distributing */
+static const u64 cfs_bandwidth_slack_period = 5 * NSEC_PER_MSEC;
+
+/* are we near the end of the current quota period? */
+static int runtime_refresh_within(struct cfs_bandwidth *cfs_b, u64 min_expire)
+{
+	struct hrtimer *refresh_timer = &cfs_b->period_timer;
+	u64 remaining;
+
+	/* if the call back is running a quota refresh is occurring */
+	if (hrtimer_callback_running(refresh_timer))
+		return 1;
+
+	/* is a quota refresh about to occur? */
+	remaining = ktime_to_ns(hrtimer_expires_remaining(refresh_timer));
+	if (remaining < min_expire)
+		return 1;
+
+	return 0;
+}
+
+static void start_cfs_slack_bandwidth(struct cfs_bandwidth *cfs_b)
+{
+	u64 min_left = cfs_bandwidth_slack_period + min_bandwidth_expiration;
+
+	/* if there's a quota refresh soon don't bother with slack */
+	if (runtime_refresh_within(cfs_b, min_left))
+		return;
+
+	start_bandwidth_timer(&cfs_b->slack_timer,
+				ns_to_ktime(cfs_bandwidth_slack_period));
+}
+
+static void return_cfs_rq_quota(struct cfs_rq *cfs_rq)
+{
+	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
+	s64 slack_runtime = cfs_rq->runtime_remaining - min_cfs_rq_quota;
+
+	if (!cfs_rq->runtime_enabled || cfs_rq->load.weight)
+		return;
+
+	if (slack_runtime <= 0)
+		return;
+
+	raw_spin_lock(&cfs_b->lock);
+	if (cfs_b->quota != RUNTIME_INF &&
+	    cfs_b->runtime_expires == cfs_rq->runtime_expires) {
+		cfs_b->runtime += slack_runtime;
+
+		if (cfs_b->runtime > sched_cfs_bandwidth_slice() &&
+		    !list_empty(&cfs_b->throttled_cfs_rq))
+			start_cfs_slack_bandwidth(cfs_b);
+	}
+	raw_spin_unlock(&cfs_b->lock);
+
+	cfs_rq->runtime_remaining -= slack_runtime;
+}
+
+static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
+{
+	u64 runtime = 0, slice = sched_cfs_bandwidth_slice();
+	u64 expires;
+
+	/* confirm we're still not at a refresh boundary */
+	if (runtime_refresh_within(cfs_b, min_bandwidth_expiration))
+		return;
+
+	raw_spin_lock(&cfs_b->lock);
+	if (cfs_b->quota != RUNTIME_INF && cfs_b->runtime > slice) {
+		runtime = cfs_b->runtime;
+		cfs_b->runtime = 0;
+	}
+	expires = cfs_b->runtime_expires;
+	raw_spin_unlock(&cfs_b->lock);
+
+	if (!runtime)
+		return;
+
+	runtime = distribute_cfs_runtime(cfs_b, runtime, expires);
+
+	raw_spin_lock(&cfs_b->lock);
+	if (expires == cfs_b->runtime_expires)
+		cfs_b->runtime = runtime;
+	raw_spin_unlock(&cfs_b->lock);
+}
+
 #else
 static void account_cfs_rq_runtime(struct cfs_rq *cfs_rq,
 		unsigned long delta_exec) {}



^ permalink raw reply	[flat|nested] 129+ messages in thread

* [patch 15/15] sched: add documentation for bandwidth control
  2011-05-03  9:28 [patch 00/15] CFS Bandwidth Control V6 Paul Turner
                   ` (13 preceding siblings ...)
  2011-05-03  9:29 ` [patch 14/15] sched: return unused runtime on voluntary sleep Paul Turner
@ 2011-05-03  9:29 ` Paul Turner
  2011-05-10  7:29   ` Hidetoshi Seto
  2011-06-07 15:45 ` CFS Bandwidth Control - Test results of cgroups tasks pinned vs unpinned Kamalesh Babulal
  2011-06-14  6:58 ` [patch 00/15] CFS Bandwidth Control V6 Hu Tao
  16 siblings, 1 reply; 129+ messages in thread
From: Paul Turner @ 2011-05-03  9:29 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov

[-- Attachment #1: sched-bwc-documentation.patch --]
[-- Type: text/plain, Size: 4888 bytes --]

From: Bharata B Rao <bharata@linux.vnet.ibm.com>

Basic description of usage and effect for CFS Bandwidth Control.

Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
Signed-off-by: Paul Turner <pjt@google.com>
---
 Documentation/scheduler/sched-bwc.txt |   98
 ++++++++++++++++++++++++++++++++++
 Documentation/scheduler/sched-bwc.txt |  104 ++++++++++++++++++++++++++++++++++
 1 file changed, 104 insertions(+)

Index: tip/Documentation/scheduler/sched-bwc.txt
===================================================================
--- /dev/null
+++ tip/Documentation/scheduler/sched-bwc.txt
@@ -0,0 +1,104 @@
+CFS Bandwidth Control (aka CPU hard limits)
+===========================================
+
+[ This document talks about CPU bandwidth control of CFS groups only.
+  The bandwidth control of RT groups is explained in
+  Documentation/scheduler/sched-rt-group.txt ]
+
+CFS bandwidth control is a group scheduler extension that can be used to
+control the maximum CPU bandwidth obtained by a CPU cgroup.
+
+Bandwidth allowed for a group is specified using quota and period. Within
+a given "period" (microseconds), a group is allowed to consume up to "quota"
+microseconds of CPU time, which is the upper limit or the hard limit. When the
+CPU bandwidth consumption of a group exceeds the hard limit, the tasks in the
+group are throttled and are not allowed to run until the end of the period at
+which time the group's quota is replenished.
+
+Runtime available to the group is tracked globally. At the beginning of
+every period, group's global runtime pool is replenished with "quota"
+microseconds worth of runtime. The runtime consumption happens locally at each
+CPU by fetching runtimes in "slices" from the global pool.
+
+Interface
+---------
+Quota and period can be set via cgroup files.
+
+cpu.cfs_quota_us: the enforcement interval (microseconds)
+cpu.cfs_period_us: the maximum allowed bandwidth (microseconds)
+
+Within a period of cpu.cfs_period_us, the group as a whole will not be allowed
+to consume more than cpu_cfs_quota_us worth of runtime.
+
+The default value of cpu.cfs_period_us is 500ms and the default value
+for cpu.cfs_quota_us is -1.
+
+A group with cpu.cfs_quota_us as -1 indicates that the group has infinite
+bandwidth, which means that it is not bandwidth controlled.
+
+Writing any negative value to cpu.cfs_quota_us will turn the group into
+an infinite bandwidth group. Reading cpu.cfs_quota_us for an infinite
+bandwidth group will always return -1.
+
+System wide settings
+--------------------
+The amount of runtime obtained from global pool every time a CPU wants the
+group quota locally is controlled by a sysctl parameter
+sched_cfs_bandwidth_slice_us. The current default is 5ms. This can be changed
+by writing to /proc/sys/kernel/sched_cfs_bandwidth_slice_us.
+
+A quota hierarchy is defined to be consistent if the sum of child reservations
+does not exceed the bandwidth allocated to its parent.  An entity with no
+explicit bandwidth reservation (e.g. no limit) is considered to inherit its
+parent's limits.  This behavior may be managed using
+/proc/sys/kernel/sched_cfs_bandwidth_consistent
+
+Statistics
+----------
+cpu.stat file lists three different stats related to CPU bandwidth control.
+
+nr_periods: Number of enforcement intervals that have elapsed.
+nr_throttled: Number of times the group has been throttled/limited.
+throttled_time: The total time duration (in nanoseconds) for which the group
+remained throttled.
+
+These files are read-only.
+
+Hierarchy considerations
+------------------------
+Each group's bandwidth (quota and period) can be set independent of its
+parent or child groups. There are two ways in which a group can get
+throttled:
+
+- it consumed its quota within the period
+- it has quota left but the parent's quota is exhausted.
+
+In the 2nd case, even though the child has quota left, it will not be
+able to run since the parent itself is throttled. Similarly groups that are
+not bandwidth constrained might end up being throttled if any parent
+in their hierarchy is throttled.
+
+Examples
+--------
+1. Limit a group to 1 CPU worth of runtime.
+
+	If period is 500ms and quota is also 500ms, the group will get
+	1 CPU worth of runtime every 500ms.
+
+	# echo 500000 > cpu.cfs_quota_us /* quota = 500ms */
+	# echo 500000 > cpu.cfs_period_us /* period = 500ms */
+
+2. Limit a group to 2 CPUs worth of runtime on a multi-CPU machine.
+
+	With 500ms period and 1000ms quota, the group can get 2 CPUs worth of
+	runtime every 500ms.
+
+	# echo 1000000 > cpu.cfs_quota_us /* quota = 1000ms */
+	# echo 500000 > cpu.cfs_period_us /* period = 500ms */
+
+3. Limit a group to 20% of 1 CPU.
+
+	With 500ms period, 100ms quota will be equivalent to 20% of 1 CPU.
+
+	# echo 100000 > cpu.cfs_quota_us /* quota = 100ms */
+	# echo 500000 > cpu.cfs_period_us /* period = 500ms */



^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [patch 01/15] sched: (fixlet) dont update shares twice on on_rq parent
  2011-05-03  9:28 ` [patch 01/15] sched: (fixlet) dont update shares twice on on_rq parent Paul Turner
@ 2011-05-10  7:14   ` Hidetoshi Seto
  2011-05-10  8:32     ` Mike Galbraith
  0 siblings, 1 reply; 129+ messages in thread
From: Hidetoshi Seto @ 2011-05-10  7:14 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Peter Zijlstra, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Pavel Emelyanov

(2011/05/03 18:28), Paul Turner wrote:
> In dequeue_task_fair() we bail on dequeue when we encounter a parenting entity
> with additional weight.  However, we perform a double shares update on this
> entity since we continue the shares update traversal from that point, despite
> dequeue_entity() having already updated its queuing cfs_rq.
> 
> Avoid this by starting from the parent when we resume.
> 
> Signed-off-by: Paul Turner <pjt@google.com>
> ---
>  kernel/sched_fair.c |    4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> Index: tip/kernel/sched_fair.c
> ===================================================================
> --- tip.orig/kernel/sched_fair.c
> +++ tip/kernel/sched_fair.c
> @@ -1355,8 +1355,10 @@ static void dequeue_task_fair(struct rq 
>  		dequeue_entity(cfs_rq, se, flags);
>  
>  		/* Don't dequeue parent if it has other entities besides us */
> -		if (cfs_rq->load.weight)
> +		if (cfs_rq->load.weight) {
> +			se = parent_entity(se);
>  			break;
> +		}
>  		flags |= DEQUEUE_SLEEP;
>  	}
>  

Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>

This small fixlet can stand alone.
Peter, how about getting this into git tree first?


Thanks,
H.Seto


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [patch 02/15] sched: hierarchical task accounting for SCHED_OTHER
  2011-05-03  9:28 ` [patch 02/15] sched: hierarchical task accounting for SCHED_OTHER Paul Turner
@ 2011-05-10  7:17   ` Hidetoshi Seto
  0 siblings, 0 replies; 129+ messages in thread
From: Hidetoshi Seto @ 2011-05-10  7:17 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Peter Zijlstra, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Pavel Emelyanov

Some typos in the description.

(2011/05/03 18:28), Paul Turner wrote:
> Introduce hierarchal task accounting for the group scheduling case in CFS, as

            hierarchical 

> well as promoting the responsibility for maintaining rq->nr_running to the
> scheduling classes.
> 
> The primary motivation for this is that with scheduling classes supporting
> bandwidht throttling it is possible for entities participating in trottled

  bandwidth                                                         throttled

> sub-trees to not have root visible changes in rq->nr_running across activate
> and de-activate operations.  This in turn leads to incorrect idle and 
> weight-per-task load balance decisions.
> 
> This also allows us to make a small fixlet to the fastpath in pick_next_task()
> under group scheduling.
> 
> Note: this issue also exists with the existing sched_rt throttling mechanism.
> This patch does not address that.
> 
> Signed-off-by: Paul Turner <pjt@google.com>
> 
> ---

The patch is good.

Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>


Thanks,
H.Seto


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [patch 03/15] sched: introduce primitives to account for CFS bandwidth tracking
  2011-05-03  9:28 ` [patch 03/15] sched: introduce primitives to account for CFS bandwidth tracking Paul Turner
@ 2011-05-10  7:18   ` Hidetoshi Seto
  0 siblings, 0 replies; 129+ messages in thread
From: Hidetoshi Seto @ 2011-05-10  7:18 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Peter Zijlstra, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Pavel Emelyanov, Nikhil Rao

One nitpicking...

(2011/05/03 18:28), Paul Turner wrote:
> In this patch we introduce the notion of CFS bandwidth, partitioned into 
> globally unassigned bandwidth, and locally claimed bandwidth.
> 
> - The global bandwidth is per task_group, it represents a pool of unclaimed
>   bandwidth that cfs_rqs can allocate from.  
> - The local bandwidth is tracked per-cfs_rq, this represents allotments from
>   the global pool bandwidth assigned to a specific cpu.
> 
> Bandwidth is managed via cgroupfs, adding two new interfaces to the cpu subsystem:
> - cpu.cfs_period_us : the bandwidth period in usecs
> - cpu.cfs_quota_us : the cpu bandwidth (in usecs) that this tg will be allowed
>   to consume over period above.
> 
> Signed-off-by: Paul Turner <pjt@google.com>
> Signed-off-by: Nikhil Rao <ncrao@google.com>
> Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
> ---

(snip)

> @@ -369,9 +379,45 @@ struct cfs_rq {
>  
>  	unsigned long load_contribution;
>  #endif
> +#ifdef CONFIG_CFS_BANDWIDTH
> +	int runtime_enabled;
> +	s64 runtime_remaining;
> +#endif
>  #endif
>  };
>  
> +#ifdef CONFIG_CFS_BANDWIDTH
> +static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
> +{
> +	return &tg->cfs_bandwidth;
> +}
> +
> +static inline u64 default_cfs_period(void);
> +
> +static void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
> +{
> +	raw_spin_lock_init(&cfs_b->lock);
> +	cfs_b->quota = RUNTIME_INF;
> +	cfs_b->period = ns_to_ktime(default_cfs_period());
> +}
> +
> +static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)
> +{
> +	cfs_rq->runtime_remaining = 0;
> +	cfs_rq->runtime_enabled = 0;
> +}
> +
> +static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
> +{}
> +#else
> +#ifdef CONFIG_FAIR_GROUP_SCHED
> +static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
> +void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b) {}

Nit: why not static?

> +static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b) {}
> +#endif /* CONFIG_FAIR_GROUP_SCHED */
> +static void start_cfs_bandwidth(struct cfs_rq *cfs_rq) {}
> +#endif /* CONFIG_CFS_BANDWIDTH */
> +
>  /* Real-Time classes' related field in a runqueue: */
>  struct rt_rq {
>  	struct rt_prio_array active;

The rest looks good for me.

Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>


Thanks,
H.Seto


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [patch 04/15] sched: validate CFS quota hierarchies
  2011-05-03  9:28 ` [patch 04/15] sched: validate CFS quota hierarchies Paul Turner
@ 2011-05-10  7:20   ` Hidetoshi Seto
  2011-05-11  9:37     ` Paul Turner
  2011-05-16  9:30   ` Peter Zijlstra
  2011-05-16  9:43   ` Peter Zijlstra
  2 siblings, 1 reply; 129+ messages in thread
From: Hidetoshi Seto @ 2011-05-10  7:20 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Peter Zijlstra, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Pavel Emelyanov

Description typos + one bug.

(2011/05/03 18:28), Paul Turner wrote:
> Add constraints validation for CFS bandwidth hierachies.

                                               hierarchies

> 
> Validate that:
>    sum(child bandwidth) <= parent_bandwidth
> 
> In a quota limited hierarchy, an unconstrainted entity

                                   unconstrained

> (e.g. bandwidth==RUNTIME_INF) inherits the bandwidth of its parent.
> 
> Since bandwidth periods may be non-uniform we normalize to the maximum allowed
> period, 1 second.
> 
> This behavior may be disabled (allowing child bandwidth to exceed parent) via
> kernel.sched_cfs_bandwidth_consistent=0
> 
> Signed-off-by: Paul Turner <pjt@google.com>
> 
> ---
(snip)
> +/*
> + * normalize group quota/period to be quota/max_period
> + * note: units are usecs
> + */
> +static u64 normalize_cfs_quota(struct task_group *tg,
> +			       struct cfs_schedulable_data *d)
> +{
> +	u64 quota, period;
> +
> +	if (tg == d->tg) {
> +		period = d->period;
> +		quota = d->quota;
> +	} else {
> +		period = tg_get_cfs_period(tg);
> +		quota = tg_get_cfs_quota(tg);
> +	}
> +
> +	if (quota == RUNTIME_INF)
> +		return RUNTIME_INF;
> +
> +	return to_ratio(period, quota);
> +}

Since tg_get_cfs_quota() doesn't return RUNTIME_INF but -1,
this function needs a fix like following.

For fixed version, feel free to add:

Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>

Thanks,
H.Seto

---
 kernel/sched.c |    7 ++++---
 1 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index d2562aa..f171ba5 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -9465,16 +9465,17 @@ static u64 normalize_cfs_quota(struct task_group *tg,
 	u64 quota, period;
 
 	if (tg == d->tg) {
+		if (d->quota == RUNTIME_INF)
+			return RUNTIME_INF;
 		period = d->period;
 		quota = d->quota;
 	} else {
+		if (tg_cfs_bandwidth(tg)->quota == RUNTIME_INF)
+			return RUNTIME_INF;
 		period = tg_get_cfs_period(tg);
 		quota = tg_get_cfs_quota(tg);
 	}
 
-	if (quota == RUNTIME_INF)
-		return RUNTIME_INF;
-
 	return to_ratio(period, quota);
 }
 


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* Re: [patch 05/15] sched: add a timer to handle CFS bandwidth refresh
  2011-05-03  9:28 ` [patch 05/15] sched: add a timer to handle CFS bandwidth refresh Paul Turner
@ 2011-05-10  7:21   ` Hidetoshi Seto
  2011-05-11  9:27     ` Paul Turner
  2011-05-16 10:18   ` Peter Zijlstra
  1 sibling, 1 reply; 129+ messages in thread
From: Hidetoshi Seto @ 2011-05-10  7:21 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Peter Zijlstra, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Pavel Emelyanov

(2011/05/03 18:28), Paul Turner wrote:
> @@ -250,6 +253,9 @@ struct cfs_bandwidth {
>  	ktime_t period;
>  	u64 quota;
>  	s64 hierarchal_quota;
> +
> +	int idle;
> +	struct hrtimer period_timer;
>  #endif
>  };
>  

"idle" is not used yet.  How about adding it in later patch?
Plus, comment explaining how it is used would be appreciated.

>  static void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
>  {
>  	raw_spin_lock_init(&cfs_b->lock);
>  	cfs_b->quota = RUNTIME_INF;
>  	cfs_b->period = ns_to_ktime(default_cfs_period());
> +
> +	hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
> +	cfs_b->period_timer.function = sched_cfs_period_timer;
> +
>  }

Nit: blank line?

Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>


Thanks,
H.Seto


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [patch 06/15] sched: accumulate per-cfs_rq cpu usage and charge against bandwidth
  2011-05-03  9:28 ` [patch 06/15] sched: accumulate per-cfs_rq cpu usage and charge against bandwidth Paul Turner
@ 2011-05-10  7:22   ` Hidetoshi Seto
  2011-05-11  9:25     ` Paul Turner
  2011-05-16 10:27   ` Peter Zijlstra
  2011-05-16 10:32   ` Peter Zijlstra
  2 siblings, 1 reply; 129+ messages in thread
From: Hidetoshi Seto @ 2011-05-10  7:22 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Peter Zijlstra, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Pavel Emelyanov, Nikhil Rao

(2011/05/03 18:28), Paul Turner wrote:
> Index: tip/include/linux/sched.h
> ===================================================================
> --- tip.orig/include/linux/sched.h
> +++ tip/include/linux/sched.h
> @@ -1958,6 +1958,10 @@ int sched_cfs_consistent_handler(struct 
>  		loff_t *ppos);
>  #endif
>  
> +#ifdef CONFIG_CFS_BANDWIDTH
> +extern unsigned int sysctl_sched_cfs_bandwidth_slice;
> +#endif
> +
>  #ifdef CONFIG_SCHED_AUTOGROUP
>  extern unsigned int sysctl_sched_autogroup_enabled;
>  

Nit: you can reuse ifdef just above here.

+#ifdef CONFIG_CFS_BANDWIDTH
+extern unsigned int sysctl_sched_cfs_bandwidth_consistent;
+
+int sched_cfs_consistent_handler(struct ctl_table *table, int write,
+               void __user *buffer, size_t *lenp,
+               loff_t *ppos);
+#endif
+
+#ifdef CONFIG_CFS_BANDWIDTH
+extern unsigned int sysctl_sched_cfs_bandwidth_slice;
+#endif
+

Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>


Thanks,
H.Seto


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [patch 07/15] sched: expire invalid runtime
  2011-05-03  9:28 ` [patch 07/15] sched: expire invalid runtime Paul Turner
@ 2011-05-10  7:22   ` Hidetoshi Seto
  2011-05-16 11:05   ` Peter Zijlstra
  2011-05-16 11:07   ` Peter Zijlstra
  2 siblings, 0 replies; 129+ messages in thread
From: Hidetoshi Seto @ 2011-05-10  7:22 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Peter Zijlstra, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Pavel Emelyanov

(2011/05/03 18:28), Paul Turner wrote:
> With the global quota pool, one challenge is determining when the runtime we
> have received from it is still valid.  Fortunately we can take advantage of
> sched_clock synchronization around the jiffy to do this cheaply.
> 
> The one catch is that we don't know whether our local clock is behind or ahead
> of the cpu setting the expiration time (relative to its own clock).
> 
> Fortunately we can detect which of these is the case by determining whether the
> global deadline has advanced.  If it has not, then we assume we are behind, and
> advance our local expiration; otherwise, we know the deadline has truly passed
> and we expire our local runtime.
> 
> Signed-off-by: Paul Turner <pjt@google.com>
> 
> ---

Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>


Thanks,
H.Seto


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [patch 08/15] sched: throttle cfs_rq entities which exceed their local runtime
  2011-05-03  9:28 ` [patch 08/15] sched: throttle cfs_rq entities which exceed their local runtime Paul Turner
@ 2011-05-10  7:23   ` Hidetoshi Seto
  2011-05-16 15:58   ` Peter Zijlstra
  2011-05-16 16:05   ` Peter Zijlstra
  2 siblings, 0 replies; 129+ messages in thread
From: Hidetoshi Seto @ 2011-05-10  7:23 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Peter Zijlstra, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Pavel Emelyanov, Nikhil Rao

(2011/05/03 18:28), Paul Turner wrote:
> Index: tip/kernel/sched.c
> ===================================================================
> --- tip.orig/kernel/sched.c
> +++ tip/kernel/sched.c
> @@ -258,6 +258,8 @@ struct cfs_bandwidth {
>  
>  	int idle;
>  	struct hrtimer period_timer;
> +	struct list_head throttled_cfs_rq;
> +
>  #endif
>  };
>  

Nit: blank line?

Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>

Thanks,
H.Seto


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [patch 09/15] sched: unthrottle cfs_rq(s) who ran out of quota at period refresh
  2011-05-03  9:28 ` [patch 09/15] sched: unthrottle cfs_rq(s) who ran out of quota at period refresh Paul Turner
@ 2011-05-10  7:24   ` Hidetoshi Seto
  2011-05-11  9:24     ` Paul Turner
  0 siblings, 1 reply; 129+ messages in thread
From: Hidetoshi Seto @ 2011-05-10  7:24 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Peter Zijlstra, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Pavel Emelyanov, Nikhil Rao

Some comments...

(2011/05/03 18:28), Paul Turner wrote:
> At the start of a new period there are several actions we must refresh the
> global bandwidth pool as well as unthrottle any cfs_rq entities who previously
> ran out of bandwidth (as quota permits).
> 
> Unthrottled entities have the cfs_rq->throttled flag cleared and are re-enqueued
> into the cfs entity hierarchy.
> 
> Signed-off-by: Paul Turner <pjt@google.com>
> Signed-off-by: Nikhil Rao <ncrao@google.com>
> Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
> ---
>  kernel/sched.c      |    3 +
>  kernel/sched_fair.c |  105 +++++++++++++++++++++++++++++++++++++++++++++++++++-
>  2 files changed, 107 insertions(+), 1 deletion(-)
> 
> Index: tip/kernel/sched.c
> ===================================================================
> --- tip.orig/kernel/sched.c
> +++ tip/kernel/sched.c
> @@ -9294,6 +9294,9 @@ static int tg_set_cfs_bandwidth(struct t
>  		cfs_rq->runtime_enabled = quota != RUNTIME_INF;
>  		cfs_rq->runtime_remaining = 0;
>  		cfs_rq->runtime_expires = runtime_expires;
> +
> +		if (cfs_rq_throttled(cfs_rq))
> +			unthrottle_cfs_rq(cfs_rq);
>  		raw_spin_unlock_irq(&rq->lock);
>  	}
>  out_unlock:
> Index: tip/kernel/sched_fair.c
> ===================================================================
> --- tip.orig/kernel/sched_fair.c
> +++ tip/kernel/sched_fair.c
> @@ -1456,10 +1456,88 @@ static void check_enqueue_throttle(struc
>  		throttle_cfs_rq(cfs_rq);
>  }
>  
> +static void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
> +{
> +	struct rq *rq = rq_of(cfs_rq);
> +	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
> +	struct sched_entity *se;
> +	int enqueue = 1;
> +	long task_delta;
> +
> +	se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
> +
> +	cfs_rq->throttled = 0;
> +	raw_spin_lock(&cfs_b->lock);
> +	list_del_rcu(&cfs_rq->throttled_list);
> +	raw_spin_unlock(&cfs_b->lock);
> +
> +	if (!cfs_rq->load.weight)
> +		return;
> +
> +	task_delta = cfs_rq->h_nr_running;
> +	for_each_sched_entity(se) {
> +		if (se->on_rq)
> +			enqueue = 0;
> +
> +		cfs_rq = cfs_rq_of(se);
> +		if (enqueue)
> +			enqueue_entity(cfs_rq, se, ENQUEUE_WAKEUP);
> +		cfs_rq->h_nr_running += task_delta;
> +
> +		if (cfs_rq_throttled(cfs_rq))
> +			break;
> +	}
> +
> +	if (!se)
> +		rq->nr_running += task_delta;
> +
> +	/* determine whether we need to wake up potentially idle cpu */
> +	if (rq->curr == rq->idle && rq->cfs.nr_running)
> +		resched_task(rq->curr);
> +}
> +
> +static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
> +		u64 remaining, u64 expires)
> +{
> +	struct cfs_rq *cfs_rq;
> +	u64 runtime = remaining;
> +
> +	rcu_read_lock();
> +	list_for_each_entry_rcu(cfs_rq, &cfs_b->throttled_cfs_rq,
> +				throttled_list) {
> +		struct rq *rq = rq_of(cfs_rq);
> +
> +		raw_spin_lock(&rq->lock);
> +		if (!cfs_rq_throttled(cfs_rq))
> +			goto next;
> +
> +		runtime = -cfs_rq->runtime_remaining + 1;

It will helpful if a comment can explain why negative and 1.

> +		if (runtime > remaining)
> +			runtime = remaining;
> +		remaining -= runtime;
> +
> +		cfs_rq->runtime_remaining += runtime;
> +		cfs_rq->runtime_expires = expires;
> +
> +		/* we check whether we're throttled above */
> +		if (cfs_rq->runtime_remaining > 0)
> +			unthrottle_cfs_rq(cfs_rq);
> +
> +next:
> +		raw_spin_unlock(&rq->lock);
> +
> +		if (!remaining)
> +			break;
> +	}
> +	rcu_read_unlock();
> +
> +	return remaining;
> +}
> +
>  static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
>  {
>  	u64 quota, runtime = 0, runtime_expires;
> -	int idle = 0;
> +	int idle = 0, throttled = 0;
>  
>  	runtime_expires = sched_clock_cpu(smp_processor_id());
>  
> @@ -1469,6 +1547,7 @@ static int do_sched_cfs_period_timer(str
>  	if (quota != RUNTIME_INF) {
>  		runtime = quota;
>  		runtime_expires += ktime_to_ns(cfs_b->period);
> +		throttled = !list_empty(&cfs_b->throttled_cfs_rq);
>  
>  		cfs_b->runtime = runtime;
>  		cfs_b->runtime_expires = runtime_expires;
> @@ -1477,6 +1556,30 @@ static int do_sched_cfs_period_timer(str
>  	}
>  	raw_spin_unlock(&cfs_b->lock);
>  
> +	if (!throttled || quota == RUNTIME_INF)
> +		goto out;
> +	idle = 0;
> +
> +retry:
> +	runtime = distribute_cfs_runtime(cfs_b, runtime, runtime_expires);
> +
> +	raw_spin_lock(&cfs_b->lock);
> +	/* new new bandwidth may have been set */

Typo? new, newer, newest...?

> +	if (unlikely(runtime_expires != cfs_b->runtime_expires))
> +		goto out_unlock;
> +	/*
> +	 * make sure no-one was throttled while we were handing out the new
> +	 * runtime.
> +	 */
> +	if (runtime > 0 && !list_empty(&cfs_b->throttled_cfs_rq)) {
> +		raw_spin_unlock(&cfs_b->lock);
> +		goto retry;
> +	}
> +	cfs_b->runtime = runtime;
> +	cfs_b->idle = idle;
> +out_unlock:
> +	raw_spin_unlock(&cfs_b->lock);
> +out:
>  	return idle;
>  }
>  #else

Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>

It would be better if this unthrottle patch (09/15) comes before
throttle patch (08/15) in this series, not to make a small window
in the history that throttled entity never back to the run queue.
But I'm just paranoid...
 

Thanks,
H.Seto


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [patch 10/15] sched: allow for positional tg_tree walks
  2011-05-03  9:28 ` [patch 10/15] sched: allow for positional tg_tree walks Paul Turner
@ 2011-05-10  7:24   ` Hidetoshi Seto
  2011-05-17 13:31   ` Peter Zijlstra
  1 sibling, 0 replies; 129+ messages in thread
From: Hidetoshi Seto @ 2011-05-10  7:24 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Peter Zijlstra, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Pavel Emelyanov

(2011/05/03 18:28), Paul Turner wrote:
> Extend walk_tg_tree to accept a positional argument
> 
> static int walk_tg_tree_from(struct task_group *from,
> 			     tg_visitor down, tg_visitor up, void *data)
> 
> Existing semantics are preserved, caller must hold rcu_lock() or sufficient
> analogue.
> 
> Signed-off-by: Paul Turner <pjt@google.com>
> ---

Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>

Yeah, it's nice to have.

Thanks,
H.Seto


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [patch 11/15] sched: prevent interactions between throttled entities and load-balance
  2011-05-03  9:28 ` [patch 11/15] sched: prevent interactions between throttled entities and load-balance Paul Turner
@ 2011-05-10  7:26   ` Hidetoshi Seto
  2011-05-11  9:11     ` Paul Turner
  0 siblings, 1 reply; 129+ messages in thread
From: Hidetoshi Seto @ 2011-05-10  7:26 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Peter Zijlstra, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Pavel Emelyanov

(2011/05/03 18:28), Paul Turner wrote:
>>From the perspective of load-balance and shares distribution, throttled
> entities should be invisible.
> 
> However, both of these operations work on 'active' lists and are not
> inherently aware of what group hierarchies may be present.  In some cases this
> may be side-stepped (e.g. we could sideload via tg_load_down in load balance) 
> while in others (e.g. update_shares()) it is more difficult to compute without
> incurring some O(n^2) costs.
> 
> Instead, track hierarchal throttled state at time of transition.  This allows

                 hierarchical

> us to easily identify whether an entity belongs to a throttled hierarchy and
> avoid incorrect interactions with it.
> 
> Also, when an entity leaves a throttled hierarchy we need to advance its
> time averaging for shares averaging so that the elapsed throttled time is not
> considered as part of the cfs_rq's operation.
> 
> Signed-off-by: Paul Turner <pjt@google.com>
> ---

Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>

Thanks,
H.Seto


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [patch 12/15] sched: migrate throttled tasks on HOTPLUG
  2011-05-03  9:28 ` [patch 12/15] sched: migrate throttled tasks on HOTPLUG Paul Turner
@ 2011-05-10  7:27   ` Hidetoshi Seto
  2011-05-11  9:10     ` Paul Turner
  0 siblings, 1 reply; 129+ messages in thread
From: Hidetoshi Seto @ 2011-05-10  7:27 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Peter Zijlstra, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Pavel Emelyanov

(2011/05/03 18:28), Paul Turner wrote:
> +#else
> +static void unthrottle_offline_cfs_rqs(struct rq *rq)
> +{
> +}
> +#endif
> +

Nit: To follow others, alternative style is in a line:

+static void unthrottle_offline_cfs_rqs(struct rq *rq) {}

Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>


Thanks,
H.Seto


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [patch 13/15] sched: add exports tracking cfs bandwidth control statistics
  2011-05-03  9:28 ` [patch 13/15] sched: add exports tracking cfs bandwidth control statistics Paul Turner
@ 2011-05-10  7:27   ` Hidetoshi Seto
  2011-05-11  7:56   ` Hidetoshi Seto
  1 sibling, 0 replies; 129+ messages in thread
From: Hidetoshi Seto @ 2011-05-10  7:27 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Peter Zijlstra, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Pavel Emelyanov, Nikhil Rao

(2011/05/03 18:28), Paul Turner wrote:
> From: Nikhil Rao <ncrao@google.com>
> 
> This change introduces statistics exports for the cpu sub-system, these are
> added through the use of a stat file similar to that exported by other
> subsystems.
> 
> The following exports are included:
> 
> nr_periods:	number of periods in which execution occurred
> nr_throttled:	the number of periods above in which execution was throttle
> throttled_time:	cumulative wall-time that any cpus have been throttled for
> this group
> 
> Signed-off-by: Nikhil Rao <ncrao@google.com>
> Signed-off-by: Paul Turner <pjt@google.com>
> Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
> ---
>  kernel/sched.c      |   22 ++++++++++++++++++++++
>  kernel/sched_fair.c |    9 +++++++++
>  2 files changed, 31 insertions(+)
> 
> Index: tip/kernel/sched.c
> ===================================================================
> --- tip.orig/kernel/sched.c
> +++ tip/kernel/sched.c
> @@ -260,6 +260,10 @@ struct cfs_bandwidth {
>  	struct hrtimer period_timer;
>  	struct list_head throttled_cfs_rq;
>  
> +	/* statistics */
> +	int nr_periods, nr_throttled;
> +	u64 throttled_time;
> +
>  #endif
>  };
>  

Nit: blank line?

Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>

Thanks,
H.Seto


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [patch 14/15] sched: return unused runtime on voluntary sleep
  2011-05-03  9:29 ` [patch 14/15] sched: return unused runtime on voluntary sleep Paul Turner
@ 2011-05-10  7:28   ` Hidetoshi Seto
  0 siblings, 0 replies; 129+ messages in thread
From: Hidetoshi Seto @ 2011-05-10  7:28 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Peter Zijlstra, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Pavel Emelyanov

(2011/05/03 18:29), Paul Turner wrote:
> When a local cfs_rq blocks we return the majority of its remaining quota to the
> global bandwidth pool for use by other runqueues.
> 
> We do this only when the quota is current and there is more than 
> min_cfs_rq_quota [1ms by default] of runtime remaining on the rq.
> 
> In the case where there are throttled runqueues and we have sufficient
> bandwidth to meter out a slice, a second timer is kicked off to handle this
> delivery, unthrottling where appropriate.
> 
> Using a 'worst case' antagonist which executes on each cpu
> for 1ms before moving onto the next on a fairly large machine:
> 
> no quota generations:
>  197.47 ms       /cgroup/a/cpuacct.usage
>  199.46 ms       /cgroup/a/cpuacct.usage
>  205.46 ms       /cgroup/a/cpuacct.usage
>  198.46 ms       /cgroup/a/cpuacct.usage
>  208.39 ms       /cgroup/a/cpuacct.usage
> Since we are allowed to use "stale" quota our usage is effectively bounded by
> the rate of input into the global pool and performance is relatively stable.
> 
> with quota generations [1s increments]:
>  119.58 ms       /cgroup/a/cpuacct.usage
>  119.65 ms       /cgroup/a/cpuacct.usage
>  119.64 ms       /cgroup/a/cpuacct.usage
>  119.63 ms       /cgroup/a/cpuacct.usage
>  119.60 ms       /cgroup/a/cpuacct.usage
> The large deficit here is due to quota generations (/intentionally/) preventing
> us from now using previously stranded slack quota.  The cost is that this quota
> becomes unavailable.
> 
> with quota generations and quota return:
>  200.09 ms       /cgroup/a/cpuacct.usage
>  200.09 ms       /cgroup/a/cpuacct.usage
>  198.09 ms       /cgroup/a/cpuacct.usage
>  200.09 ms       /cgroup/a/cpuacct.usage
>  200.06 ms       /cgroup/a/cpuacct.usage
> By returning unused quota we're able to both stably consume our desired quota
> and prevent unintentional overages due to the abuse of slack quota from 
> previous quota periods (especially on a large machine).
> 
> Signed-off-by: Paul Turner <pjt@google.com>
> 
> ---

Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>

Thanks,
H.Seto


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [patch 15/15] sched: add documentation for bandwidth control
  2011-05-03  9:29 ` [patch 15/15] sched: add documentation for bandwidth control Paul Turner
@ 2011-05-10  7:29   ` Hidetoshi Seto
  2011-05-11  9:09     ` Paul Turner
  0 siblings, 1 reply; 129+ messages in thread
From: Hidetoshi Seto @ 2011-05-10  7:29 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Peter Zijlstra, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Pavel Emelyanov

(2011/05/03 18:29), Paul Turner wrote:
> From: Bharata B Rao <bharata@linux.vnet.ibm.com>
> 
> Basic description of usage and effect for CFS Bandwidth Control.
> 
> Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
> Signed-off-by: Paul Turner <pjt@google.com>
> ---

Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>

Thank you very much for your great work, Paul!

I've run some test on this version and no problems so far
(other than minor bug pointed by 04/15).
Definitely things getting better.

I'll continue tests and let you know if there is something.


Thanks,
H.Seto


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [patch 01/15] sched: (fixlet) dont update shares twice on on_rq parent
  2011-05-10  7:14   ` Hidetoshi Seto
@ 2011-05-10  8:32     ` Mike Galbraith
  2011-05-11  7:55       ` Hidetoshi Seto
  0 siblings, 1 reply; 129+ messages in thread
From: Mike Galbraith @ 2011-05-10  8:32 UTC (permalink / raw)
  To: Hidetoshi Seto
  Cc: Paul Turner, linux-kernel, Peter Zijlstra, Bharata B Rao,
	Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan,
	Srivatsa Vaddagiri, Kamalesh Babulal, Ingo Molnar,
	Pavel Emelyanov

On Tue, 2011-05-10 at 16:14 +0900, Hidetoshi Seto wrote:
> (2011/05/03 18:28), Paul Turner wrote:
> > In dequeue_task_fair() we bail on dequeue when we encounter a parenting entity
> > with additional weight.  However, we perform a double shares update on this
> > entity since we continue the shares update traversal from that point, despite
> > dequeue_entity() having already updated its queuing cfs_rq.
> > 
> > Avoid this by starting from the parent when we resume.
> > 
> > Signed-off-by: Paul Turner <pjt@google.com>
> > ---
> >  kernel/sched_fair.c |    4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> > 
> > Index: tip/kernel/sched_fair.c
> > ===================================================================
> > --- tip.orig/kernel/sched_fair.c
> > +++ tip/kernel/sched_fair.c
> > @@ -1355,8 +1355,10 @@ static void dequeue_task_fair(struct rq 
> >  		dequeue_entity(cfs_rq, se, flags);
> >  
> >  		/* Don't dequeue parent if it has other entities besides us */
> > -		if (cfs_rq->load.weight)
> > +		if (cfs_rq->load.weight) {
> > +			se = parent_entity(se);
> >  			break;
> > +		}
> >  		flags |= DEQUEUE_SLEEP;
> >  	}
> >  
> 
> Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
> 
> This small fixlet can stand alone.
> Peter, how about getting this into git tree first?

tip 2f36825b176f67e5c5228aa33d828bc39718811f contains the below.

                /* Don't dequeue parent if it has other entities besides us */
-               if (cfs_rq->load.weight)
+               if (cfs_rq->load.weight) {
+                       /*
+                        * Bias pick_next to pick a task from this cfs_rq, as
+                        * p is sleeping when it is within its sched_slice.
+                        */
+                       if (task_sleep && parent_entity(se))
+                               set_next_buddy(parent_entity(se));
                        break;
+               }
                flags |= DEQUEUE_SLEEP;
        }



^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [patch 01/15] sched: (fixlet) dont update shares twice on on_rq parent
  2011-05-10  8:32     ` Mike Galbraith
@ 2011-05-11  7:55       ` Hidetoshi Seto
  2011-05-11  8:13         ` Paul Turner
  0 siblings, 1 reply; 129+ messages in thread
From: Hidetoshi Seto @ 2011-05-11  7:55 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Paul Turner, linux-kernel, Peter Zijlstra, Bharata B Rao,
	Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan,
	Srivatsa Vaddagiri, Kamalesh Babulal, Ingo Molnar,
	Pavel Emelyanov

(2011/05/10 17:32), Mike Galbraith wrote:
> On Tue, 2011-05-10 at 16:14 +0900, Hidetoshi Seto wrote:
>> This small fixlet can stand alone.
>> Peter, how about getting this into git tree first?
> 
> tip 2f36825b176f67e5c5228aa33d828bc39718811f contains the below.
> 
>                 /* Don't dequeue parent if it has other entities besides us */
> -               if (cfs_rq->load.weight)
> +               if (cfs_rq->load.weight) {
> +                       /*
> +                        * Bias pick_next to pick a task from this cfs_rq, as
> +                        * p is sleeping when it is within its sched_slice.
> +                        */
> +                       if (task_sleep && parent_entity(se))
> +                               set_next_buddy(parent_entity(se));
>                         break;
> +               }
>                 flags |= DEQUEUE_SLEEP;
>         }

Oh, thanks Mike!
It seems that this change in tip is better one.

Paul, don't you mind rebasing your patches onto tip/sched/core next time?
(...or is there better branch for rebase?)


Thanks,
H.Seto


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [patch 13/15] sched: add exports tracking cfs bandwidth control statistics
  2011-05-03  9:28 ` [patch 13/15] sched: add exports tracking cfs bandwidth control statistics Paul Turner
  2011-05-10  7:27   ` Hidetoshi Seto
@ 2011-05-11  7:56   ` Hidetoshi Seto
  2011-05-11  9:09     ` Paul Turner
  1 sibling, 1 reply; 129+ messages in thread
From: Hidetoshi Seto @ 2011-05-11  7:56 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Peter Zijlstra, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Pavel Emelyanov, Nikhil Rao

Oops, I found an issue here.

(2011/05/03 18:28), Paul Turner wrote:
> @@ -1628,6 +1631,12 @@ retry:
>  		raw_spin_unlock(&cfs_b->lock);
>  		goto retry;
>  	}
> +
> +	/* update throttled stats */
> +	cfs_b->nr_periods += overrun;
> +	if (throttled)
> +		cfs_b->nr_throttled += overrun;
> +
>  	cfs_b->runtime = runtime;
>  	cfs_b->idle = idle;
>  out_unlock:

Quoting from patch 09/15:

+	if (!throttled || quota == RUNTIME_INF)
+		goto out;
+	idle = 0;
+
+retry:
+	runtime = distribute_cfs_runtime(cfs_b, runtime, runtime_expires);
+
+	raw_spin_lock(&cfs_b->lock);
+	/* new new bandwidth may have been set */
+	if (unlikely(runtime_expires != cfs_b->runtime_expires))
+		goto out_unlock;
+	/*
+	 * make sure no-one was throttled while we were handing out the new
+	 * runtime.
+	 */
+	if (runtime > 0 && !list_empty(&cfs_b->throttled_cfs_rq)) {
+		raw_spin_unlock(&cfs_b->lock);
+		goto retry;
+	}
+	cfs_b->runtime = runtime;
+	cfs_b->idle = idle;
+out_unlock:
+	raw_spin_unlock(&cfs_b->lock);
+out:

Since we skip distributing runtime (by "goto out") when !throttled,
the new block inserted by this patch is passed only when throttled.
So I see that nr_periods and nr_throttled look the same.

Maybe we should move this block up like followings.

Thanks,
H.Seto

---
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1620,6 +1620,12 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
 		idle = cfs_b->idle;
 		cfs_b->idle = 1;
 	}
+
+	/* update throttled stats */
+	cfs_b->nr_periods += overrun;
+	if (throttled)
+		cfs_b->nr_throttled += overrun;
+
 	raw_spin_unlock(&cfs_b->lock);
 
 	if (!throttled || quota == RUNTIME_INF)
@@ -1642,11 +1648,6 @@ retry:
 		goto retry;
 	}
 
-	/* update throttled stats */
-	cfs_b->nr_periods += overrun;
-	if (throttled)
-		cfs_b->nr_throttled += overrun;
-
 	cfs_b->runtime = runtime;
 	cfs_b->idle = idle;
 out_unlock:



^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [patch 01/15] sched: (fixlet) dont update shares twice on on_rq parent
  2011-05-11  7:55       ` Hidetoshi Seto
@ 2011-05-11  8:13         ` Paul Turner
  2011-05-11  8:45           ` Mike Galbraith
  0 siblings, 1 reply; 129+ messages in thread
From: Paul Turner @ 2011-05-11  8:13 UTC (permalink / raw)
  To: Hidetoshi Seto
  Cc: Mike Galbraith, linux-kernel, Peter Zijlstra, Bharata B Rao,
	Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan,
	Srivatsa Vaddagiri, Kamalesh Babulal, Ingo Molnar,
	Pavel Emelyanov

On Wed, May 11, 2011 at 12:55 AM, Hidetoshi Seto
<seto.hidetoshi@jp.fujitsu.com> wrote:
> (2011/05/10 17:32), Mike Galbraith wrote:
>> On Tue, 2011-05-10 at 16:14 +0900, Hidetoshi Seto wrote:
>>> This small fixlet can stand alone.
>>> Peter, how about getting this into git tree first?
>>
>> tip 2f36825b176f67e5c5228aa33d828bc39718811f contains the below.
>>
>>                 /* Don't dequeue parent if it has other entities besides us */
>> -               if (cfs_rq->load.weight)
>> +               if (cfs_rq->load.weight) {
>> +                       /*
>> +                        * Bias pick_next to pick a task from this cfs_rq, as
>> +                        * p is sleeping when it is within its sched_slice.
>> +                        */
>> +                       if (task_sleep && parent_entity(se))
>> +                               set_next_buddy(parent_entity(se));
>>                         break;
>> +               }
>>                 flags |= DEQUEUE_SLEEP;
>>         }
>
> Oh, thanks Mike!
> It seems that this change in tip is better one.
>
> Paul, don't you mind rebasing your patches onto tip/sched/core next time?
> (...or is there better branch for rebase?)
>

I thought I had but apparently I missed this.

We still need to set se = parent_entity(se) to avoid the pointless
double update below.

Will definitely rebase.

Thanks!

>
> Thanks,
> H.Seto
>
>

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [patch 01/15] sched: (fixlet) dont update shares twice on on_rq parent
  2011-05-11  8:13         ` Paul Turner
@ 2011-05-11  8:45           ` Mike Galbraith
  2011-05-11  8:59             ` Hidetoshi Seto
  0 siblings, 1 reply; 129+ messages in thread
From: Mike Galbraith @ 2011-05-11  8:45 UTC (permalink / raw)
  To: Paul Turner
  Cc: Hidetoshi Seto, linux-kernel, Peter Zijlstra, Bharata B Rao,
	Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan,
	Srivatsa Vaddagiri, Kamalesh Babulal, Ingo Molnar,
	Pavel Emelyanov

On Wed, 2011-05-11 at 01:13 -0700, Paul Turner wrote:
> On Wed, May 11, 2011 at 12:55 AM, Hidetoshi Seto
> <seto.hidetoshi@jp.fujitsu.com> wrote:
> > (2011/05/10 17:32), Mike Galbraith wrote:
> >> On Tue, 2011-05-10 at 16:14 +0900, Hidetoshi Seto wrote:
> >>> This small fixlet can stand alone.
> >>> Peter, how about getting this into git tree first?
> >>
> >> tip 2f36825b176f67e5c5228aa33d828bc39718811f contains the below.
> >>
> >>                 /* Don't dequeue parent if it has other entities besides us */
> >> -               if (cfs_rq->load.weight)
> >> +               if (cfs_rq->load.weight) {
> >> +                       /*
> >> +                        * Bias pick_next to pick a task from this cfs_rq, as
> >> +                        * p is sleeping when it is within its sched_slice.
> >> +                        */
> >> +                       if (task_sleep && parent_entity(se))
> >> +                               set_next_buddy(parent_entity(se));
> >>                         break;
> >> +               }
> >>                 flags |= DEQUEUE_SLEEP;
> >>         }
> >
> > Oh, thanks Mike!
> > It seems that this change in tip is better one.
> >
> > Paul, don't you mind rebasing your patches onto tip/sched/core next time?
> > (...or is there better branch for rebase?)
> >
> 
> I thought I had but apparently I missed this.
> 
> We still need to set se = parent_entity(se) to avoid the pointless
> double update below.
> 
> Will definitely rebase.

Wish I could, wouldn't have 114 other patches just to get evaluation
tree up to speed :)

Index: linux-2.6.32/kernel/sched_fair.c
===================================================================
--- linux-2.6.32.orig/kernel/sched_fair.c
+++ linux-2.6.32/kernel/sched_fair.c
@@ -1308,12 +1308,15 @@ static void dequeue_task_fair(struct rq

                /* Don't dequeue parent if it has other entities besides us */
                if (cfs_rq->load.weight) {
+                       /* Avoid double update below. */
+                       se = parent_entity(se);
+
                        /*
                         * Bias pick_next to pick a task from this cfs_rq, as
                         * p is sleeping when it is within its sched_slice.
                         */
-                       if (task_sleep && parent_entity(se))
-                               set_next_buddy(parent_entity(se));
+                       if (task_sleep && se)
+                               set_next_buddy(se);
                        break;
                }
                flags |= DEQUEUE_SLEEP;



^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [patch 01/15] sched: (fixlet) dont update shares twice on on_rq parent
  2011-05-11  8:45           ` Mike Galbraith
@ 2011-05-11  8:59             ` Hidetoshi Seto
  0 siblings, 0 replies; 129+ messages in thread
From: Hidetoshi Seto @ 2011-05-11  8:59 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Paul Turner, linux-kernel, Peter Zijlstra, Bharata B Rao,
	Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan,
	Srivatsa Vaddagiri, Kamalesh Babulal, Ingo Molnar,
	Pavel Emelyanov

(2011/05/11 17:45), Mike Galbraith wrote:
> On Wed, 2011-05-11 at 01:13 -0700, Paul Turner wrote:
>> On Wed, May 11, 2011 at 12:55 AM, Hidetoshi Seto
>> <seto.hidetoshi@jp.fujitsu.com> wrote:
>>> (2011/05/10 17:32), Mike Galbraith wrote:
>>>> On Tue, 2011-05-10 at 16:14 +0900, Hidetoshi Seto wrote:
>>>>> This small fixlet can stand alone.
>>>>> Peter, how about getting this into git tree first?
>>>>
>>>> tip 2f36825b176f67e5c5228aa33d828bc39718811f contains the below.
>>>>
>>>>                 /* Don't dequeue parent if it has other entities besides us */
>>>> -               if (cfs_rq->load.weight)
>>>> +               if (cfs_rq->load.weight) {
>>>> +                       /*
>>>> +                        * Bias pick_next to pick a task from this cfs_rq, as
>>>> +                        * p is sleeping when it is within its sched_slice.
>>>> +                        */
>>>> +                       if (task_sleep && parent_entity(se))
>>>> +                               set_next_buddy(parent_entity(se));
>>>>                         break;
>>>> +               }
>>>>                 flags |= DEQUEUE_SLEEP;
>>>>         }
>>>
>>> Oh, thanks Mike!
>>> It seems that this change in tip is better one.
>>>
>>> Paul, don't you mind rebasing your patches onto tip/sched/core next time?
>>> (...or is there better branch for rebase?)
>>>
>>
>> I thought I had but apparently I missed this.
>>
>> We still need to set se = parent_entity(se) to avoid the pointless
>> double update below.
>>
>> Will definitely rebase.
> 
> Wish I could, wouldn't have 114 other patches just to get evaluation
> tree up to speed :)
> 
> Index: linux-2.6.32/kernel/sched_fair.c
> ===================================================================
> --- linux-2.6.32.orig/kernel/sched_fair.c
> +++ linux-2.6.32/kernel/sched_fair.c
> @@ -1308,12 +1308,15 @@ static void dequeue_task_fair(struct rq
> 
>                 /* Don't dequeue parent if it has other entities besides us */
>                 if (cfs_rq->load.weight) {
> +                       /* Avoid double update below. */
> +                       se = parent_entity(se);
> +
>                         /*
>                          * Bias pick_next to pick a task from this cfs_rq, as
>                          * p is sleeping when it is within its sched_slice.
>                          */
> -                       if (task_sleep && parent_entity(se))
> -                               set_next_buddy(parent_entity(se));
> +                       if (task_sleep && se)
> +                               set_next_buddy(se);
>                         break;
>                 }
>                 flags |= DEQUEUE_SLEEP;

Nice!

It will be better to put this fixlet out from the cfs-bandwidth series
and post as a single patch.


Thanks,
H.Seto


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [patch 13/15] sched: add exports tracking cfs bandwidth control statistics
  2011-05-11  7:56   ` Hidetoshi Seto
@ 2011-05-11  9:09     ` Paul Turner
  0 siblings, 0 replies; 129+ messages in thread
From: Paul Turner @ 2011-05-11  9:09 UTC (permalink / raw)
  To: Hidetoshi Seto
  Cc: linux-kernel, Peter Zijlstra, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Pavel Emelyanov, Nikhil Rao

On Wed, May 11, 2011 at 12:56 AM, Hidetoshi Seto
<seto.hidetoshi@jp.fujitsu.com> wrote:
> Oops, I found an issue here.
>
> (2011/05/03 18:28), Paul Turner wrote:
>> @@ -1628,6 +1631,12 @@ retry:
>>               raw_spin_unlock(&cfs_b->lock);
>>               goto retry;
>>       }
>> +
>> +     /* update throttled stats */
>> +     cfs_b->nr_periods += overrun;
>> +     if (throttled)
>> +             cfs_b->nr_throttled += overrun;
>> +
>>       cfs_b->runtime = runtime;
>>       cfs_b->idle = idle;
>>  out_unlock:
>
> Quoting from patch 09/15:
>
> +       if (!throttled || quota == RUNTIME_INF)
> +               goto out;
> +       idle = 0;
> +
> +retry:
> +       runtime = distribute_cfs_runtime(cfs_b, runtime, runtime_expires);
> +
> +       raw_spin_lock(&cfs_b->lock);
> +       /* new new bandwidth may have been set */
> +       if (unlikely(runtime_expires != cfs_b->runtime_expires))
> +               goto out_unlock;
> +       /*
> +        * make sure no-one was throttled while we were handing out the new
> +        * runtime.
> +        */
> +       if (runtime > 0 && !list_empty(&cfs_b->throttled_cfs_rq)) {
> +               raw_spin_unlock(&cfs_b->lock);
> +               goto retry;
> +       }
> +       cfs_b->runtime = runtime;
> +       cfs_b->idle = idle;
> +out_unlock:
> +       raw_spin_unlock(&cfs_b->lock);
> +out:
>
> Since we skip distributing runtime (by "goto out") when !throttled,
> the new block inserted by this patch is passed only when throttled.
> So I see that nr_periods and nr_throttled look the same.
>
> Maybe we should move this block up like followings.
>

Yes, makes sense, incorporated -- thanks!

> Thanks,
> H.Seto
>
> ---
> --- a/kernel/sched_fair.c
> +++ b/kernel/sched_fair.c
> @@ -1620,6 +1620,12 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
>                idle = cfs_b->idle;
>                cfs_b->idle = 1;
>        }
> +
> +       /* update throttled stats */
> +       cfs_b->nr_periods += overrun;
> +       if (throttled)
> +               cfs_b->nr_throttled += overrun;
> +
>        raw_spin_unlock(&cfs_b->lock);
>
>        if (!throttled || quota == RUNTIME_INF)
> @@ -1642,11 +1648,6 @@ retry:
>                goto retry;
>        }
>
> -       /* update throttled stats */
> -       cfs_b->nr_periods += overrun;
> -       if (throttled)
> -               cfs_b->nr_throttled += overrun;
> -
>        cfs_b->runtime = runtime;
>        cfs_b->idle = idle;
>  out_unlock:
>
>
>

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [patch 15/15] sched: add documentation for bandwidth control
  2011-05-10  7:29   ` Hidetoshi Seto
@ 2011-05-11  9:09     ` Paul Turner
  0 siblings, 0 replies; 129+ messages in thread
From: Paul Turner @ 2011-05-11  9:09 UTC (permalink / raw)
  To: Hidetoshi Seto
  Cc: linux-kernel, Peter Zijlstra, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Pavel Emelyanov

On Tue, May 10, 2011 at 12:29 AM, Hidetoshi Seto
<seto.hidetoshi@jp.fujitsu.com> wrote:
> (2011/05/03 18:29), Paul Turner wrote:
>> From: Bharata B Rao <bharata@linux.vnet.ibm.com>
>>
>> Basic description of usage and effect for CFS Bandwidth Control.
>>
>> Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
>> Signed-off-by: Paul Turner <pjt@google.com>
>> ---
>
> Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
>
> Thank you very much for your great work, Paul!
>
> I've run some test on this version and no problems so far
> (other than minor bug pointed by 04/15).
> Definitely things getting better.
>
> I'll continue tests and let you know if there is something.
>

Thank you for taking the time to review and test!

Very much appreciated!

>
> Thanks,
> H.Seto
>
>

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [patch 12/15] sched: migrate throttled tasks on HOTPLUG
  2011-05-10  7:27   ` Hidetoshi Seto
@ 2011-05-11  9:10     ` Paul Turner
  0 siblings, 0 replies; 129+ messages in thread
From: Paul Turner @ 2011-05-11  9:10 UTC (permalink / raw)
  To: Hidetoshi Seto
  Cc: linux-kernel, Peter Zijlstra, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Pavel Emelyanov

On Tue, May 10, 2011 at 12:27 AM, Hidetoshi Seto
<seto.hidetoshi@jp.fujitsu.com> wrote:
> (2011/05/03 18:28), Paul Turner wrote:
>> +#else
>> +static void unthrottle_offline_cfs_rqs(struct rq *rq)
>> +{
>> +}
>> +#endif
>> +
>
> Nit: To follow others, alternative style is in a line:
>
> +static void unthrottle_offline_cfs_rqs(struct rq *rq) {}
>

Agree, updated.  Thanks

> Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
>
>
> Thanks,
> H.Seto
>
>

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [patch 11/15] sched: prevent interactions between throttled entities and load-balance
  2011-05-10  7:26   ` Hidetoshi Seto
@ 2011-05-11  9:11     ` Paul Turner
  0 siblings, 0 replies; 129+ messages in thread
From: Paul Turner @ 2011-05-11  9:11 UTC (permalink / raw)
  To: Hidetoshi Seto
  Cc: linux-kernel, Peter Zijlstra, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Pavel Emelyanov

On Tue, May 10, 2011 at 12:26 AM, Hidetoshi Seto
<seto.hidetoshi@jp.fujitsu.com> wrote:
> (2011/05/03 18:28), Paul Turner wrote:
>>>From the perspective of load-balance and shares distribution, throttled
>> entities should be invisible.
>>
>> However, both of these operations work on 'active' lists and are not
>> inherently aware of what group hierarchies may be present.  In some cases this
>> may be side-stepped (e.g. we could sideload via tg_load_down in load balance)
>> while in others (e.g. update_shares()) it is more difficult to compute without
>> incurring some O(n^2) costs.
>>
>> Instead, track hierarchal throttled state at time of transition.  This allows
>
>                 hierarchical

Fixed, Thanks

>
>> us to easily identify whether an entity belongs to a throttled hierarchy and
>> avoid incorrect interactions with it.
>>
>> Also, when an entity leaves a throttled hierarchy we need to advance its
>> time averaging for shares averaging so that the elapsed throttled time is not
>> considered as part of the cfs_rq's operation.
>>
>> Signed-off-by: Paul Turner <pjt@google.com>
>> ---
>
> Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
>
> Thanks,
> H.Seto
>
>

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [patch 09/15] sched: unthrottle cfs_rq(s) who ran out of quota at period refresh
  2011-05-10  7:24   ` Hidetoshi Seto
@ 2011-05-11  9:24     ` Paul Turner
  0 siblings, 0 replies; 129+ messages in thread
From: Paul Turner @ 2011-05-11  9:24 UTC (permalink / raw)
  To: Hidetoshi Seto
  Cc: linux-kernel, Peter Zijlstra, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Pavel Emelyanov, Nikhil Rao

On Tue, May 10, 2011 at 12:24 AM, Hidetoshi Seto
<seto.hidetoshi@jp.fujitsu.com> wrote:
> Some comments...
>
> (2011/05/03 18:28), Paul Turner wrote:
>> At the start of a new period there are several actions we must refresh the
>> global bandwidth pool as well as unthrottle any cfs_rq entities who previously
>> ran out of bandwidth (as quota permits).
>>
>> Unthrottled entities have the cfs_rq->throttled flag cleared and are re-enqueued
>> into the cfs entity hierarchy.
>>
>> Signed-off-by: Paul Turner <pjt@google.com>
>> Signed-off-by: Nikhil Rao <ncrao@google.com>
>> Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
>> ---
>>  kernel/sched.c      |    3 +
>>  kernel/sched_fair.c |  105 +++++++++++++++++++++++++++++++++++++++++++++++++++-
>>  2 files changed, 107 insertions(+), 1 deletion(-)
>>
>> Index: tip/kernel/sched.c
>> ===================================================================
>> --- tip.orig/kernel/sched.c
>> +++ tip/kernel/sched.c
>> @@ -9294,6 +9294,9 @@ static int tg_set_cfs_bandwidth(struct t
>>               cfs_rq->runtime_enabled = quota != RUNTIME_INF;
>>               cfs_rq->runtime_remaining = 0;
>>               cfs_rq->runtime_expires = runtime_expires;
>> +
>> +             if (cfs_rq_throttled(cfs_rq))
>> +                     unthrottle_cfs_rq(cfs_rq);
>>               raw_spin_unlock_irq(&rq->lock);
>>       }
>>  out_unlock:
>> Index: tip/kernel/sched_fair.c
>> ===================================================================
>> --- tip.orig/kernel/sched_fair.c
>> +++ tip/kernel/sched_fair.c
>> @@ -1456,10 +1456,88 @@ static void check_enqueue_throttle(struc
>>               throttle_cfs_rq(cfs_rq);
>>  }
>>
>> +static void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
>> +{
>> +     struct rq *rq = rq_of(cfs_rq);
>> +     struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
>> +     struct sched_entity *se;
>> +     int enqueue = 1;
>> +     long task_delta;
>> +
>> +     se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
>> +
>> +     cfs_rq->throttled = 0;
>> +     raw_spin_lock(&cfs_b->lock);
>> +     list_del_rcu(&cfs_rq->throttled_list);
>> +     raw_spin_unlock(&cfs_b->lock);
>> +
>> +     if (!cfs_rq->load.weight)
>> +             return;
>> +
>> +     task_delta = cfs_rq->h_nr_running;
>> +     for_each_sched_entity(se) {
>> +             if (se->on_rq)
>> +                     enqueue = 0;
>> +
>> +             cfs_rq = cfs_rq_of(se);
>> +             if (enqueue)
>> +                     enqueue_entity(cfs_rq, se, ENQUEUE_WAKEUP);
>> +             cfs_rq->h_nr_running += task_delta;
>> +
>> +             if (cfs_rq_throttled(cfs_rq))
>> +                     break;
>> +     }
>> +
>> +     if (!se)
>> +             rq->nr_running += task_delta;
>> +
>> +     /* determine whether we need to wake up potentially idle cpu */
>> +     if (rq->curr == rq->idle && rq->cfs.nr_running)
>> +             resched_task(rq->curr);
>> +}
>> +
>> +static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
>> +             u64 remaining, u64 expires)
>> +{
>> +     struct cfs_rq *cfs_rq;
>> +     u64 runtime = remaining;
>> +
>> +     rcu_read_lock();
>> +     list_for_each_entry_rcu(cfs_rq, &cfs_b->throttled_cfs_rq,
>> +                             throttled_list) {
>> +             struct rq *rq = rq_of(cfs_rq);
>> +
>> +             raw_spin_lock(&rq->lock);
>> +             if (!cfs_rq_throttled(cfs_rq))
>> +                     goto next;
>> +
>> +             runtime = -cfs_rq->runtime_remaining + 1;
>
> It will helpful if a comment can explain why negative and 1.

Remaining runtime of <= 0 implies that there was no bandwidth
available.  See checks below et al. in check_... functions.

We choose the minimum amount here to return to a positive quota state.

Originally I had elected to take a full slice here.  The limitation
became that this then effectively duplicated the assign_cfs_rq_runtime
path, and would require the quota handed out in each to be in
lockstep.  Another trade-off is be that when we're in a large state of
arrears, handing out this extra bandwidth (in excess of the minimum
+1) up-front may prevent us from unthrottling another cfs_rq.

Will add a comment explaining that the minimum amount to leave arrears
is chosen above.

>
>> +             if (runtime > remaining)
>> +                     runtime = remaining;
>> +             remaining -= runtime;
>> +
>> +             cfs_rq->runtime_remaining += runtime;
>> +             cfs_rq->runtime_expires = expires;
>> +
>> +             /* we check whether we're throttled above */
>> +             if (cfs_rq->runtime_remaining > 0)
>> +                     unthrottle_cfs_rq(cfs_rq);
>> +
>> +next:
>> +             raw_spin_unlock(&rq->lock);
>> +
>> +             if (!remaining)
>> +                     break;
>> +     }
>> +     rcu_read_unlock();
>> +
>> +     return remaining;
>> +}
>> +
>>  static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
>>  {
>>       u64 quota, runtime = 0, runtime_expires;
>> -     int idle = 0;
>> +     int idle = 0, throttled = 0;
>>
>>       runtime_expires = sched_clock_cpu(smp_processor_id());
>>
>> @@ -1469,6 +1547,7 @@ static int do_sched_cfs_period_timer(str
>>       if (quota != RUNTIME_INF) {
>>               runtime = quota;
>>               runtime_expires += ktime_to_ns(cfs_b->period);
>> +             throttled = !list_empty(&cfs_b->throttled_cfs_rq);
>>
>>               cfs_b->runtime = runtime;
>>               cfs_b->runtime_expires = runtime_expires;
>> @@ -1477,6 +1556,30 @@ static int do_sched_cfs_period_timer(str
>>       }
>>       raw_spin_unlock(&cfs_b->lock);
>>
>> +     if (!throttled || quota == RUNTIME_INF)
>> +             goto out;
>> +     idle = 0;
>> +
>> +retry:
>> +     runtime = distribute_cfs_runtime(cfs_b, runtime, runtime_expires);
>> +
>> +     raw_spin_lock(&cfs_b->lock);
>> +     /* new new bandwidth may have been set */
>
> Typo? new, newer, newest...?
>

s/new new/new/ :)

>> +     if (unlikely(runtime_expires != cfs_b->runtime_expires))
>> +             goto out_unlock;
>> +     /*
>> +      * make sure no-one was throttled while we were handing out the new
>> +      * runtime.
>> +      */
>> +     if (runtime > 0 && !list_empty(&cfs_b->throttled_cfs_rq)) {
>> +             raw_spin_unlock(&cfs_b->lock);
>> +             goto retry;
>> +     }
>> +     cfs_b->runtime = runtime;
>> +     cfs_b->idle = idle;
>> +out_unlock:
>> +     raw_spin_unlock(&cfs_b->lock);
>> +out:
>>       return idle;
>>  }
>>  #else
>
> Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
>
> It would be better if this unthrottle patch (09/15) comes before
> throttle patch (08/15) in this series, not to make a small window
> in the history that throttled entity never back to the run queue.
> But I'm just paranoid...
>

The feature is inert unless bandwidth is set so this should be safe.

The trade-off with reversing the order is that a patch undoing state
that doesn't yet exist looks very strange :).  If the above is a
concern I'd probably prefer to separate it into 3 parts:
1. add throttle
2. add unthrottle
3. enable throttle

Where (3) would consist only of the enqueue/put checks to trigger throttling.


>
> Thanks,
> H.Seto
>
>

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [patch 06/15] sched: accumulate per-cfs_rq cpu usage and charge against bandwidth
  2011-05-10  7:22   ` Hidetoshi Seto
@ 2011-05-11  9:25     ` Paul Turner
  0 siblings, 0 replies; 129+ messages in thread
From: Paul Turner @ 2011-05-11  9:25 UTC (permalink / raw)
  To: Hidetoshi Seto
  Cc: linux-kernel, Peter Zijlstra, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Pavel Emelyanov, Nikhil Rao

On Tue, May 10, 2011 at 12:22 AM, Hidetoshi Seto
<seto.hidetoshi@jp.fujitsu.com> wrote:
> (2011/05/03 18:28), Paul Turner wrote:
>> Index: tip/include/linux/sched.h
>> ===================================================================
>> --- tip.orig/include/linux/sched.h
>> +++ tip/include/linux/sched.h
>> @@ -1958,6 +1958,10 @@ int sched_cfs_consistent_handler(struct
>>               loff_t *ppos);
>>  #endif
>>
>> +#ifdef CONFIG_CFS_BANDWIDTH
>> +extern unsigned int sysctl_sched_cfs_bandwidth_slice;
>> +#endif
>> +
>>  #ifdef CONFIG_SCHED_AUTOGROUP
>>  extern unsigned int sysctl_sched_autogroup_enabled;
>>
>
> Nit: you can reuse ifdef just above here.

Thanks!  I think this was actually a quilt-mis-merge when I was
shuffling the order of things around.  Definitely makes sense to
combine them.

>
> +#ifdef CONFIG_CFS_BANDWIDTH
> +extern unsigned int sysctl_sched_cfs_bandwidth_consistent;
> +
> +int sched_cfs_consistent_handler(struct ctl_table *table, int write,
> +               void __user *buffer, size_t *lenp,
> +               loff_t *ppos);
> +#endif
> +
> +#ifdef CONFIG_CFS_BANDWIDTH
> +extern unsigned int sysctl_sched_cfs_bandwidth_slice;
> +#endif
> +
>
> Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
>
>
> Thanks,
> H.Seto
>
>

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [patch 05/15] sched: add a timer to handle CFS bandwidth refresh
  2011-05-10  7:21   ` Hidetoshi Seto
@ 2011-05-11  9:27     ` Paul Turner
  0 siblings, 0 replies; 129+ messages in thread
From: Paul Turner @ 2011-05-11  9:27 UTC (permalink / raw)
  To: Hidetoshi Seto
  Cc: linux-kernel, Peter Zijlstra, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Pavel Emelyanov

On Tue, May 10, 2011 at 12:21 AM, Hidetoshi Seto
<seto.hidetoshi@jp.fujitsu.com> wrote:
> (2011/05/03 18:28), Paul Turner wrote:
>> @@ -250,6 +253,9 @@ struct cfs_bandwidth {
>>       ktime_t period;
>>       u64 quota;
>>       s64 hierarchal_quota;
>> +
>> +     int idle;
>> +     struct hrtimer period_timer;
>>  #endif
>>  };
>>
>
> "idle" is not used yet.  How about adding it in later patch?
> Plus, comment explaining how it is used would be appreciated.

Fixed both.  (idle belongs to the accumulate patch)

>
>>  static void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
>>  {
>>       raw_spin_lock_init(&cfs_b->lock);
>>       cfs_b->quota = RUNTIME_INF;
>>       cfs_b->period = ns_to_ktime(default_cfs_period());
>> +
>> +     hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
>> +     cfs_b->period_timer.function = sched_cfs_period_timer;
>> +
>>  }
>
> Nit: blank line?
>
> Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
>
>
> Thanks,
> H.Seto
>
>

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [patch 04/15] sched: validate CFS quota hierarchies
  2011-05-10  7:20   ` Hidetoshi Seto
@ 2011-05-11  9:37     ` Paul Turner
  0 siblings, 0 replies; 129+ messages in thread
From: Paul Turner @ 2011-05-11  9:37 UTC (permalink / raw)
  To: Hidetoshi Seto
  Cc: linux-kernel, Peter Zijlstra, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Pavel Emelyanov

On Tue, May 10, 2011 at 12:20 AM, Hidetoshi Seto
<seto.hidetoshi@jp.fujitsu.com> wrote:
> Description typos + one bug.
>
> (2011/05/03 18:28), Paul Turner wrote:
>> Add constraints validation for CFS bandwidth hierachies.
>
>                                               hierarchies
>
>>
>> Validate that:
>>    sum(child bandwidth) <= parent_bandwidth
>>
>> In a quota limited hierarchy, an unconstrainted entity
>
>                                   unconstrained
>
>> (e.g. bandwidth==RUNTIME_INF) inherits the bandwidth of its parent.
>>
>> Since bandwidth periods may be non-uniform we normalize to the maximum allowed
>> period, 1 second.
>>
>> This behavior may be disabled (allowing child bandwidth to exceed parent) via
>> kernel.sched_cfs_bandwidth_consistent=0
>>
>> Signed-off-by: Paul Turner <pjt@google.com>
>>
>> ---
> (snip)
>> +/*
>> + * normalize group quota/period to be quota/max_period
>> + * note: units are usecs
>> + */
>> +static u64 normalize_cfs_quota(struct task_group *tg,
>> +                            struct cfs_schedulable_data *d)
>> +{
>> +     u64 quota, period;
>> +
>> +     if (tg == d->tg) {
>> +             period = d->period;
>> +             quota = d->quota;
>> +     } else {
>> +             period = tg_get_cfs_period(tg);
>> +             quota = tg_get_cfs_quota(tg);
>> +     }
>> +
>> +     if (quota == RUNTIME_INF)
>> +             return RUNTIME_INF;
>> +
>> +     return to_ratio(period, quota);
>> +}
>
> Since tg_get_cfs_quota() doesn't return RUNTIME_INF but -1,
> this function needs a fix like following.
>
> For fixed version, feel free to add:
>
> Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
>
> Thanks,
> H.Seto
>
> ---
>  kernel/sched.c |    7 ++++---
>  1 files changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched.c b/kernel/sched.c
> index d2562aa..f171ba5 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -9465,16 +9465,17 @@ static u64 normalize_cfs_quota(struct task_group *tg,
>        u64 quota, period;
>
>        if (tg == d->tg) {
> +               if (d->quota == RUNTIME_INF)
> +                       return RUNTIME_INF;
>                period = d->period;
>                quota = d->quota;
>        } else {
> +               if (tg_cfs_bandwidth(tg)->quota == RUNTIME_INF)
> +                       return RUNTIME_INF;
>                period = tg_get_cfs_period(tg);
>                quota = tg_get_cfs_quota(tg);
>        }
>

Good catch!

Just modifying:
+if (quota == RUNTIME_INF || quota == -1)
+                       return RUNTIME_INF;

Seems simpler.

Although really there's no reason for tg_get_cfs_runtime (and
sched_group_rt_runtime from which it's cloned) not to be returning
RUNTIME_INF and then doing the conversion within the cgroupfs handler.

Fixing both is probably a better clean-up.

> -       if (quota == RUNTIME_INF)
> -               return RUNTIME_INF;
> -
>        return to_ratio(period, quota);
>  }
>
>
>

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [patch 04/15] sched: validate CFS quota hierarchies
  2011-05-03  9:28 ` [patch 04/15] sched: validate CFS quota hierarchies Paul Turner
  2011-05-10  7:20   ` Hidetoshi Seto
@ 2011-05-16  9:30   ` Peter Zijlstra
  2011-05-16  9:43   ` Peter Zijlstra
  2 siblings, 0 replies; 129+ messages in thread
From: Peter Zijlstra @ 2011-05-16  9:30 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov

On Tue, 2011-05-03 at 02:28 -0700, Paul Turner wrote:
> Since bandwidth periods may be non-uniform we normalize to the maximum allowed
> period, 1 second. 

I'm still somewhat confused on this point, what does it mean to have a
(parent) group with 0.1s period with child-groups that have 1s periods?



^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [patch 04/15] sched: validate CFS quota hierarchies
  2011-05-03  9:28 ` [patch 04/15] sched: validate CFS quota hierarchies Paul Turner
  2011-05-10  7:20   ` Hidetoshi Seto
  2011-05-16  9:30   ` Peter Zijlstra
@ 2011-05-16  9:43   ` Peter Zijlstra
  2011-05-16 12:32     ` Paul Turner
  2 siblings, 1 reply; 129+ messages in thread
From: Peter Zijlstra @ 2011-05-16  9:43 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov

On Tue, 2011-05-03 at 02:28 -0700, Paul Turner wrote:
> This behavior may be disabled (allowing child bandwidth to exceed parent) via
> kernel.sched_cfs_bandwidth_consistent=0

why? this needs very good justification.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [patch 05/15] sched: add a timer to handle CFS bandwidth refresh
  2011-05-03  9:28 ` [patch 05/15] sched: add a timer to handle CFS bandwidth refresh Paul Turner
  2011-05-10  7:21   ` Hidetoshi Seto
@ 2011-05-16 10:18   ` Peter Zijlstra
  2011-05-16 12:56     ` Paul Turner
  1 sibling, 1 reply; 129+ messages in thread
From: Peter Zijlstra @ 2011-05-16 10:18 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov

On Tue, 2011-05-03 at 02:28 -0700, Paul Turner wrote:
> @@ -1003,6 +1003,8 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
>  
>         if (cfs_rq->nr_running == 1)
>                 list_add_leaf_cfs_rq(cfs_rq);
> +
> +       start_cfs_bandwidth(cfs_rq);
>  }
>  
>  static void __clear_buddies_last(struct sched_entity *se)
> @@ -1220,6 +1222,8 @@ static void put_prev_entity(struct cfs_r
>                 update_stats_wait_start(cfs_rq, prev);
>                 /* Put 'current' back into the tree. */
>                 __enqueue_entity(cfs_rq, prev);
> +
> +               start_cfs_bandwidth(cfs_rq);
>         }
>         cfs_rq->curr = NULL;
>  } 

OK, so while the first made sense the second had me go wtf?!, now I
_think_ you do that because do_sched_cfs_period_timer() can return idle
and stop the timer when no bandwidth consumption is seen for a while,
and thus we need to re-start the timer when we put the entity to sleep,
since that could have been a throttle.

If that's so then neither really do make sense and a big fat comment is
missing.

So why not start on the same (but inverse) condition that makes it stop?

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [patch 06/15] sched: accumulate per-cfs_rq cpu usage and charge against bandwidth
  2011-05-03  9:28 ` [patch 06/15] sched: accumulate per-cfs_rq cpu usage and charge against bandwidth Paul Turner
  2011-05-10  7:22   ` Hidetoshi Seto
@ 2011-05-16 10:27   ` Peter Zijlstra
  2011-05-16 12:59     ` Paul Turner
  2011-05-16 10:32   ` Peter Zijlstra
  2 siblings, 1 reply; 129+ messages in thread
From: Peter Zijlstra @ 2011-05-16 10:27 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov, Nikhil Rao

On Tue, 2011-05-03 at 02:28 -0700, Paul Turner wrote:
> +unsigned int sysctl_sched_cfs_bandwidth_slice = 5000UL;

What happens when the period is smaller than the slice?

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [patch 06/15] sched: accumulate per-cfs_rq cpu usage and charge against bandwidth
  2011-05-03  9:28 ` [patch 06/15] sched: accumulate per-cfs_rq cpu usage and charge against bandwidth Paul Turner
  2011-05-10  7:22   ` Hidetoshi Seto
  2011-05-16 10:27   ` Peter Zijlstra
@ 2011-05-16 10:32   ` Peter Zijlstra
  2 siblings, 0 replies; 129+ messages in thread
From: Peter Zijlstra @ 2011-05-16 10:32 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov, Nikhil Rao

On Tue, 2011-05-03 at 02:28 -0700, Paul Turner wrote:
>  static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int
> overrun)
>  {
> -       return 1;
> +       u64 quota, runtime = 0;
> +       int idle = 0;
> +
> +       raw_spin_lock(&cfs_b->lock);
> +       quota = cfs_b->quota;
> +
> +       if (quota != RUNTIME_INF) {
> +               runtime = quota;
> +               cfs_b->runtime = runtime;
> +
> +               idle = cfs_b->idle;
> +               cfs_b->idle = 1;
> +       }
> +       raw_spin_unlock(&cfs_b->lock);
> +
> +       return idle;
>  } 

Shouldn't that also return 'idle' when quota is INF? No point in keeping
that timer ticking when there's no actual accounting anymore.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [patch 07/15] sched: expire invalid runtime
  2011-05-03  9:28 ` [patch 07/15] sched: expire invalid runtime Paul Turner
  2011-05-10  7:22   ` Hidetoshi Seto
@ 2011-05-16 11:05   ` Peter Zijlstra
  2011-05-16 11:07   ` Peter Zijlstra
  2 siblings, 0 replies; 129+ messages in thread
From: Peter Zijlstra @ 2011-05-16 11:05 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov

On Tue, 2011-05-03 at 02:28 -0700, Paul Turner wrote:
> +       cfs_rq->runtime_expires = max(cfs_rq->runtime_expires, expires);

That doesn't work well when the clock wraps.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [patch 07/15] sched: expire invalid runtime
  2011-05-03  9:28 ` [patch 07/15] sched: expire invalid runtime Paul Turner
  2011-05-10  7:22   ` Hidetoshi Seto
  2011-05-16 11:05   ` Peter Zijlstra
@ 2011-05-16 11:07   ` Peter Zijlstra
  2 siblings, 0 replies; 129+ messages in thread
From: Peter Zijlstra @ 2011-05-16 11:07 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov

On Tue, 2011-05-03 at 02:28 -0700, Paul Turner wrote:
> With the global quota pool, one challenge is determining when the runtime we
> have received from it is still valid.  Fortunately we can take advantage of
> sched_clock synchronization around the jiffy to do this cheaply.
> 
> The one catch is that we don't know whether our local clock is behind or ahead
> of the cpu setting the expiration time (relative to its own clock).
> 
> Fortunately we can detect which of these is the case by determining whether the
> global deadline has advanced.  If it has not, then we assume we are behind, and
> advance our local expiration; otherwise, we know the deadline has truly passed
> and we expire our local runtime.

This needs a few words explaining why we need to do all this. It only e
explains the how of it.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [patch 04/15] sched: validate CFS quota hierarchies
  2011-05-16  9:43   ` Peter Zijlstra
@ 2011-05-16 12:32     ` Paul Turner
  2011-05-17 15:26       ` Peter Zijlstra
  0 siblings, 1 reply; 129+ messages in thread
From: Paul Turner @ 2011-05-16 12:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov

On Mon, May 16, 2011 at 2:43 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
>
> On Tue, 2011-05-03 at 02:28 -0700, Paul Turner wrote:
> > This behavior may be disabled (allowing child bandwidth to exceed parent) via
> > kernel.sched_cfs_bandwidth_consistent=0
>
> why? this needs very good justification.

I think it was lost in other discussion before, but I think there are
two useful use-cases for it:

Posting (condensed) relevant snippet:
-----------------------------------------------------------
Consider:

- I have some application that I want to limit to 3 cpus
I have a 2 workers in that application, across a period I would like
those workers to use a maximum of say 2.5 cpus each (suppose they
serve some sort of co-processor request per user and we want to
prevent a single user eating our entire limit and starving out
everything else).

The goal in this case is not preventing increasing availability within a
given limit, while not destroying the (relatively) work-conserving aspect of
its performance in general.

(...)

- There's also the case of managing an abusive user, use cases such
as the above means that users can usefully be given write permission
to their relevant sub-hierarchy.

If the system size changes, or a user becomes newly abusive then being
able to set non-conformant constraint avoids the adversarial problem of having
to find and bring all of their set (possibly maliciously large) limits
within the global limit.
-----------------------------------------------------------
(Previously: https://lkml.org/lkml/2011/2/24/477)

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [patch 05/15] sched: add a timer to handle CFS bandwidth refresh
  2011-05-16 10:18   ` Peter Zijlstra
@ 2011-05-16 12:56     ` Paul Turner
  0 siblings, 0 replies; 129+ messages in thread
From: Paul Turner @ 2011-05-16 12:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov

On Mon, May 16, 2011 at 3:18 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Tue, 2011-05-03 at 02:28 -0700, Paul Turner wrote:
>> @@ -1003,6 +1003,8 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
>>
>>         if (cfs_rq->nr_running == 1)
>>                 list_add_leaf_cfs_rq(cfs_rq);
>> +
>> +       start_cfs_bandwidth(cfs_rq);
>>  }
>>
>>  static void __clear_buddies_last(struct sched_entity *se)
>> @@ -1220,6 +1222,8 @@ static void put_prev_entity(struct cfs_r
>>                 update_stats_wait_start(cfs_rq, prev);
>>                 /* Put 'current' back into the tree. */
>>                 __enqueue_entity(cfs_rq, prev);
>> +
>> +               start_cfs_bandwidth(cfs_rq);
>>         }
>>         cfs_rq->curr = NULL;
>>  }
>
> OK, so while the first made sense the second had me go wtf?!, now I
> _think_ you do that because do_sched_cfs_period_timer() can return idle
> and stop the timer when no bandwidth consumption is seen for a while,
> and thus we need to re-start the timer when we put the entity to sleep,
> since that could have been a throttle.
>
> If that's so then neither really do make sense and a big fat comment is
> missing.
>
> So why not start on the same (but inverse) condition that makes it stop?
>

This was originally to guard the case that an entity was running on
stale (from a previous period) quota resulting in cfs_bandwidth->idle
and the timer not being re-instantiated.

Now that expiration is properly integrated I think the two cases are
analogous and that this can be dropped (and the start moved into the
(nr_running == 1) entity case on enqueue).

I think this is correct but my brain's a little fuzzy right now, will
confirm in the morning.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [patch 06/15] sched: accumulate per-cfs_rq cpu usage and charge against bandwidth
  2011-05-16 10:27   ` Peter Zijlstra
@ 2011-05-16 12:59     ` Paul Turner
  2011-05-17 15:28       ` Peter Zijlstra
  0 siblings, 1 reply; 129+ messages in thread
From: Paul Turner @ 2011-05-16 12:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov, Nikhil Rao

On Mon, May 16, 2011 at 3:27 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Tue, 2011-05-03 at 02:28 -0700, Paul Turner wrote:
>> +unsigned int sysctl_sched_cfs_bandwidth_slice = 5000UL;
>
> What happens when the period is smaller than the slice?
>

We'll always take at most whatever's left in this case.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [patch 08/15] sched: throttle cfs_rq entities which exceed their local runtime
  2011-05-03  9:28 ` [patch 08/15] sched: throttle cfs_rq entities which exceed their local runtime Paul Turner
  2011-05-10  7:23   ` Hidetoshi Seto
@ 2011-05-16 15:58   ` Peter Zijlstra
  2011-05-16 16:05   ` Peter Zijlstra
  2 siblings, 0 replies; 129+ messages in thread
From: Peter Zijlstra @ 2011-05-16 15:58 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov, Nikhil Rao

On Tue, 2011-05-03 at 02:28 -0700, Paul Turner wrote:
> +       /*
> +        * it's possible active load balance has forced a throttled cfs_rq to
> +        * run again, we don't want to re-throttled in this case.
> +        */
> +       if (cfs_rq_throttled(cfs_rq))
> +               return; 

expand a little on this, why would load-balancing interact with a
throttled group? load-balancing should fully ignore these things,
they're not runnable after all.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [patch 08/15] sched: throttle cfs_rq entities which exceed their local runtime
  2011-05-03  9:28 ` [patch 08/15] sched: throttle cfs_rq entities which exceed their local runtime Paul Turner
  2011-05-10  7:23   ` Hidetoshi Seto
  2011-05-16 15:58   ` Peter Zijlstra
@ 2011-05-16 16:05   ` Peter Zijlstra
  2 siblings, 0 replies; 129+ messages in thread
From: Peter Zijlstra @ 2011-05-16 16:05 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov, Nikhil Rao

On Tue, 2011-05-03 at 02:28 -0700, Paul Turner wrote:
> +       task_delta = -cfs_rq->h_nr_running;
> +       for_each_sched_entity(se) {
> +               struct cfs_rq *qcfs_rq = cfs_rq_of(se);
> +               /* throttled entity or throttle-on-deactivate */
> +               if (!se->on_rq)
> +                       break;
> +
> +               if (dequeue)
> +                       dequeue_entity(qcfs_rq, se, DEQUEUE_SLEEP);
> +               qcfs_rq->h_nr_running += task_delta;
> +
> +               if (qcfs_rq->load.weight)
> +                       dequeue = 0;
> +       }
> +
> +       if (!se)
> +               rq->nr_running += task_delta; 

So throttle is like dequeue, it removes tasks, so why then insist on
writing it like its adding tasks? (I see you're adding a negative
number, but its all just weird).

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [patch 10/15] sched: allow for positional tg_tree walks
  2011-05-03  9:28 ` [patch 10/15] sched: allow for positional tg_tree walks Paul Turner
  2011-05-10  7:24   ` Hidetoshi Seto
@ 2011-05-17 13:31   ` Peter Zijlstra
  2011-05-18  7:18     ` Paul Turner
  1 sibling, 1 reply; 129+ messages in thread
From: Peter Zijlstra @ 2011-05-17 13:31 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov

On Tue, 2011-05-03 at 02:28 -0700, Paul Turner wrote:
> plain text document attachment (sched-bwc-refactor-walk_tg_tree.patch)
> Extend walk_tg_tree to accept a positional argument
> 
> static int walk_tg_tree_from(struct task_group *from,
> 			     tg_visitor down, tg_visitor up, void *data)
> 
> Existing semantics are preserved, caller must hold rcu_lock() or sufficient
> analogue.
> 
> Signed-off-by: Paul Turner <pjt@google.com>
> ---
>  kernel/sched.c |   34 +++++++++++++++++++++++-----------
>  1 file changed, 23 insertions(+), 11 deletions(-)
> 
> Index: tip/kernel/sched.c
> ===================================================================
> --- tip.orig/kernel/sched.c
> +++ tip/kernel/sched.c
> @@ -1430,21 +1430,19 @@ static inline void dec_cpu_load(struct r
>  #if (defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)) || defined(CONFIG_RT_GROUP_SCHED)
>  typedef int (*tg_visitor)(struct task_group *, void *);
>  
> -/*
> - * Iterate the full tree, calling @down when first entering a node and @up when
> - * leaving it for the final time.
> - */
> -static int walk_tg_tree(tg_visitor down, tg_visitor up, void *data)
> +/* Iterate task_group tree rooted at *from */
> +static int walk_tg_tree_from(struct task_group *from,
> +			     tg_visitor down, tg_visitor up, void *data)
>  {
>  	struct task_group *parent, *child;
>  	int ret;
>  
> -	rcu_read_lock();
> -	parent = &root_task_group;
> +	parent = from;
> +
>  down:
>  	ret = (*down)(parent, data);
>  	if (ret)
> -		goto out_unlock;
> +		goto out;
>  	list_for_each_entry_rcu(child, &parent->children, siblings) {
>  		parent = child;
>  		goto down;
> @@ -1453,14 +1451,28 @@ up:
>  		continue;
>  	}
>  	ret = (*up)(parent, data);
> -	if (ret)
> -		goto out_unlock;
> +	if (ret || parent == from)
> +		goto out;
>  
>  	child = parent;
>  	parent = parent->parent;
>  	if (parent)
>  		goto up;
> -out_unlock:
> +out:
> +	return ret;
> +}
> +
> +/*
> + * Iterate the full tree, calling @down when first entering a node and @up when
> + * leaving it for the final time.
> + */
> +
> +static inline int walk_tg_tree(tg_visitor down, tg_visitor up, void *data)
> +{
> +	int ret;
> +
> +	rcu_read_lock();
> +	ret = walk_tg_tree_from(&root_task_group, down, up, data);
>  	rcu_read_unlock();
>  
>  	return ret;

I don't much like the different locking rules for these two functions. I
don't much care which you pick, but please make them consistent.


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [patch 04/15] sched: validate CFS quota hierarchies
  2011-05-16 12:32     ` Paul Turner
@ 2011-05-17 15:26       ` Peter Zijlstra
  2011-05-18  7:16         ` Paul Turner
  0 siblings, 1 reply; 129+ messages in thread
From: Peter Zijlstra @ 2011-05-17 15:26 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov

On Mon, 2011-05-16 at 05:32 -0700, Paul Turner wrote:
> On Mon, May 16, 2011 at 2:43 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> >
> > On Tue, 2011-05-03 at 02:28 -0700, Paul Turner wrote:
> > > This behavior may be disabled (allowing child bandwidth to exceed parent) via
> > > kernel.sched_cfs_bandwidth_consistent=0
> >
> > why? this needs very good justification.
> 
> I think it was lost in other discussion before, but I think there are
> two useful use-cases for it:
> 
> Posting (condensed) relevant snippet:

Such stuff should really live in the changelog

> -----------------------------------------------------------
> Consider:
> 
> - I have some application that I want to limit to 3 cpus
> I have a 2 workers in that application, across a period I would like
> those workers to use a maximum of say 2.5 cpus each (suppose they
> serve some sort of co-processor request per user and we want to
> prevent a single user eating our entire limit and starving out
> everything else).
> 
> The goal in this case is not preventing increasing availability within a
> given limit, while not destroying the (relatively) work-conserving aspect of
> its performance in general.
> 
> (...)
> 
> - There's also the case of managing an abusive user, use cases such
> as the above means that users can usefully be given write permission
> to their relevant sub-hierarchy.
> 
> If the system size changes, or a user becomes newly abusive then being
> able to set non-conformant constraint avoids the adversarial problem of having
> to find and bring all of their set (possibly maliciously large) limits
> within the global limit.
> -----------------------------------------------------------


But what about those where they want both behaviours on the same machine
but for different sub-trees?

Also, without the constraints, what does the hierarchy mean?

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [patch 06/15] sched: accumulate per-cfs_rq cpu usage and charge against bandwidth
  2011-05-16 12:59     ` Paul Turner
@ 2011-05-17 15:28       ` Peter Zijlstra
  2011-05-18  7:02         ` Paul Turner
  0 siblings, 1 reply; 129+ messages in thread
From: Peter Zijlstra @ 2011-05-17 15:28 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov, Nikhil Rao

On Mon, 2011-05-16 at 05:59 -0700, Paul Turner wrote:
> On Mon, May 16, 2011 at 3:27 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> > On Tue, 2011-05-03 at 02:28 -0700, Paul Turner wrote:
> >> +unsigned int sysctl_sched_cfs_bandwidth_slice = 5000UL;
> >
> > What happens when the period is smaller than the slice?
> >
> 
> We'll always take at most whatever's left in this case.

Right, saw that, but it might be good to have a little comment
explaining the interaction between the slice and the period things.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [patch 06/15] sched: accumulate per-cfs_rq cpu usage and charge against bandwidth
  2011-05-17 15:28       ` Peter Zijlstra
@ 2011-05-18  7:02         ` Paul Turner
  0 siblings, 0 replies; 129+ messages in thread
From: Paul Turner @ 2011-05-18  7:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov, Nikhil Rao

On Tue, May 17, 2011 at 8:28 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Mon, 2011-05-16 at 05:59 -0700, Paul Turner wrote:
>> On Mon, May 16, 2011 at 3:27 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
>> > On Tue, 2011-05-03 at 02:28 -0700, Paul Turner wrote:
>> >> +unsigned int sysctl_sched_cfs_bandwidth_slice = 5000UL;
>> >
>> > What happens when the period is smaller than the slice?
>> >
>>
>> We'll always take at most whatever's left in this case.
>
> Right, saw that, but it might be good to have a little comment
> explaining the interaction between the slice and the period things.
>

Oh, sure -- easy enough :)

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [patch 04/15] sched: validate CFS quota hierarchies
  2011-05-17 15:26       ` Peter Zijlstra
@ 2011-05-18  7:16         ` Paul Turner
  2011-05-18 11:57           ` Peter Zijlstra
  0 siblings, 1 reply; 129+ messages in thread
From: Paul Turner @ 2011-05-18  7:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov

On Tue, May 17, 2011 at 8:26 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Mon, 2011-05-16 at 05:32 -0700, Paul Turner wrote:
>> On Mon, May 16, 2011 at 2:43 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
>> >
>> > On Tue, 2011-05-03 at 02:28 -0700, Paul Turner wrote:
>> > > This behavior may be disabled (allowing child bandwidth to exceed parent) via
>> > > kernel.sched_cfs_bandwidth_consistent=0
>> >
>> > why? this needs very good justification.
>>
>> I think it was lost in other discussion before, but I think there are
>> two useful use-cases for it:
>>
>> Posting (condensed) relevant snippet:
>
> Such stuff should really live in the changelog
>

Given the discussion below it would seem to make sense to split the CL
into one part that adds the consistency checking.  And (potentially,
depending on the discussion below) another that provides these state
semantics.  This would also give us a chance to clearly call these
details out in the commit description.

>> -----------------------------------------------------------
>> Consider:
>>
>> - I have some application that I want to limit to 3 cpus
>> I have a 2 workers in that application, across a period I would like
>> those workers to use a maximum of say 2.5 cpus each (suppose they
>> serve some sort of co-processor request per user and we want to
>> prevent a single user eating our entire limit and starving out
>> everything else).
>>
>> The goal in this case is not preventing increasing availability within a
>> given limit, while not destroying the (relatively) work-conserving aspect of
>> its performance in general.
>>
>> (...)
>>
>> - There's also the case of managing an abusive user, use cases such
>> as the above means that users can usefully be given write permission
>> to their relevant sub-hierarchy.
>>
>> If the system size changes, or a user becomes newly abusive then being
>> able to set non-conformant constraint avoids the adversarial problem of having
>> to find and bring all of their set (possibly maliciously large) limits
>> within the global limit.
>> -----------------------------------------------------------
>
>
> But what about those where they want both behaviours on the same machine
> but for different sub-trees?

I originally considered a per-tg tunable.  I made the assumption that
users would either handle this themselves (=0) or rely on the kernel
to do it (=1).  There are some additional complexities that lead me to
withdraw from the per-cg approach in this pass given the known
resistance to it.

One concern was the potential ambiguity in the nesting of these values.

When an inconsistent entity is nested under a consistent one:

A) Do we allow this?
B) How do we treat it?

I think if this was the case that it would make sense to allow it and
that each inconsistent entity should effectively be treated as
terminal from the parent's point of view, and as the new root from the
child's point of view.

Does this make sense?  While this is the most intuitive definition for
me there are certainly several other interpretations that could be
argued for.

Would you prefer this approach be taken to consistency vs at a global
level?  Do the use-cases above have sufficient merit that we even make
this an option in the first place?  Should we just always force
hierarchies to be consistent instead?  I'm open on this.

>
> Also, without the constraints, what does the hierarchy mean?
>

It's still an upper-bound for usage, however it may not be achievable
in an inconsistent hierarchy.  Whereas in a consistent one it should
always be achievable.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [patch 10/15] sched: allow for positional tg_tree walks
  2011-05-17 13:31   ` Peter Zijlstra
@ 2011-05-18  7:18     ` Paul Turner
  0 siblings, 0 replies; 129+ messages in thread
From: Paul Turner @ 2011-05-18  7:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov

On Tue, May 17, 2011 at 6:31 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Tue, 2011-05-03 at 02:28 -0700, Paul Turner wrote:
>> plain text document attachment (sched-bwc-refactor-walk_tg_tree.patch)
>> Extend walk_tg_tree to accept a positional argument
>>
>> static int walk_tg_tree_from(struct task_group *from,
>>                            tg_visitor down, tg_visitor up, void *data)
>>
>> Existing semantics are preserved, caller must hold rcu_lock() or sufficient
>> analogue.
>>
>> Signed-off-by: Paul Turner <pjt@google.com>
>> ---
>>  kernel/sched.c |   34 +++++++++++++++++++++++-----------
>>  1 file changed, 23 insertions(+), 11 deletions(-)
>>
>> Index: tip/kernel/sched.c
>> ===================================================================
>> --- tip.orig/kernel/sched.c
>> +++ tip/kernel/sched.c
>> @@ -1430,21 +1430,19 @@ static inline void dec_cpu_load(struct r
>>  #if (defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)) || defined(CONFIG_RT_GROUP_SCHED)
>>  typedef int (*tg_visitor)(struct task_group *, void *);
>>
>> -/*
>> - * Iterate the full tree, calling @down when first entering a node and @up when
>> - * leaving it for the final time.
>> - */
>> -static int walk_tg_tree(tg_visitor down, tg_visitor up, void *data)
>> +/* Iterate task_group tree rooted at *from */
>> +static int walk_tg_tree_from(struct task_group *from,
>> +                          tg_visitor down, tg_visitor up, void *data)
>>  {
>>       struct task_group *parent, *child;
>>       int ret;
>>
>> -     rcu_read_lock();
>> -     parent = &root_task_group;
>> +     parent = from;
>> +
>>  down:
>>       ret = (*down)(parent, data);
>>       if (ret)
>> -             goto out_unlock;
>> +             goto out;
>>       list_for_each_entry_rcu(child, &parent->children, siblings) {
>>               parent = child;
>>               goto down;
>> @@ -1453,14 +1451,28 @@ up:
>>               continue;
>>       }
>>       ret = (*up)(parent, data);
>> -     if (ret)
>> -             goto out_unlock;
>> +     if (ret || parent == from)
>> +             goto out;
>>
>>       child = parent;
>>       parent = parent->parent;
>>       if (parent)
>>               goto up;
>> -out_unlock:
>> +out:
>> +     return ret;
>> +}
>> +
>> +/*
>> + * Iterate the full tree, calling @down when first entering a node and @up when
>> + * leaving it for the final time.
>> + */
>> +
>> +static inline int walk_tg_tree(tg_visitor down, tg_visitor up, void *data)
>> +{
>> +     int ret;
>> +
>> +     rcu_read_lock();
>> +     ret = walk_tg_tree_from(&root_task_group, down, up, data);
>>       rcu_read_unlock();
>>
>>       return ret;
>
> I don't much like the different locking rules for these two functions. I
> don't much care which you pick, but please make them consistent.
>

Reasonable, given the call sites it would seem to make more sense to
make things consistent in the direction of depending on having the
caller do the locking.  Will update.

>

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [patch 04/15] sched: validate CFS quota hierarchies
  2011-05-18  7:16         ` Paul Turner
@ 2011-05-18 11:57           ` Peter Zijlstra
  0 siblings, 0 replies; 129+ messages in thread
From: Peter Zijlstra @ 2011-05-18 11:57 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov

On Wed, 2011-05-18 at 00:16 -0700, Paul Turner wrote:

> >
> > But what about those where they want both behaviours on the same machine
> > but for different sub-trees?
> 
> I originally considered a per-tg tunable.  I made the assumption that
> users would either handle this themselves (=0) or rely on the kernel
> to do it (=1).  There are some additional complexities that lead me to
> withdraw from the per-cg approach in this pass given the known
> resistance to it.

Yeah, that's quite horrid too, you chose wisely by not going there ;-)

> One concern was the potential ambiguity in the nesting of these values.
> 
> When an inconsistent entity is nested under a consistent one:
> 
> A) Do we allow this?
> B) How do we treat it?
> 
> I think if this was the case that it would make sense to allow it and
> that each inconsistent entity should effectively be treated as
> terminal from the parent's point of view, and as the new root from the
> child's point of view.
> 
> Does this make sense?  While this is the most intuitive definition for
> me there are certainly several other interpretations that could be
> argued for.

I'm not quite sure I get it, so what you're saying is: there were the
semantics are violated we draw a border and we only look at local
consistency, thereby side-stepping the whole problem.

Doesn't fly for me, also, see below, by not having any invariants you
don't have clear semantics at all.

> Would you prefer this approach be taken to consistency vs at a global
> level?  Do the use-cases above have sufficient merit that we even make
> this an option in the first place?  Should we just always force
> hierarchies to be consistent instead?  I'm open on this.

Yeah, I think the use cases do make sense, its just that I don't like
the two different semantics and the confusion that goes with it.

> >
> > Also, without the constraints, what does the hierarchy mean?
> >
> 
> It's still an upper-bound for usage, however it may not be achievable
> in an inconsistent hierarchy.  Whereas in a consistent one it should
> always be achievable. 

See that doesn't quite make sense to me, if its not achievable its
simply not and the meaning is no more.


So lets consider these cases again:

> - I have some application that I want to limit to 3 cpus
> I have a 2 workers in that application, across a period I would like
> those workers to use a maximum of say 2.5 cpus each (suppose they
> serve some sort of co-processor request per user and we want to
> prevent a single user eating our entire limit and starving out
> everything else).
>
> The goal in this case is not preventing increasing availability within a
> given limit, while not destroying the (relatively) work-conserving aspect of
> its performance in general.

So the problem here is that 2.5+2.5 > 3, right? So maybe our constraint
isn't quite right, since clearly the whole SCHED_OTHER bandwidth crap
has the purpose of allowing overload.

What about instead of using: \Sum u_i =< U, we use max(u_i) =< U, that
would allow the above case, and mean that the bandwidth limit placed on
the parent is the maximum allowed limit in that subtree. In overload
situations things go back to proportional parts of the subtree limit.

> >> - There's also the case of managing an abusive user, use cases such
> >> as the above means that users can usefully be given write permission
> >> to their relevant sub-hierarchy.
> >>
> >> If the system size changes, or a user becomes newly abusive then being
> >> able to set non-conformant constraint avoids the adversarial problem of having
> >> to find and bring all of their set (possibly maliciously large) limits
> >> within the global limit.

Right, so this example is a little more contrived in that if you had
managed it from the get-go the problem wouldn't be that big (you'd have
had sane limits to begin with).

So one solution is to co-mount the freezer cgroup with your cpu cgroup
and simply freeze the whole subtree while you sort out the settings :-)

Another possibility would be to allow something like:

 $ echo force:50000 > cfs_quota_us

Where the "force:" thing requires CAP_SYS_ADMIN and updates the entire
sub-tree such that the above invariant is kept.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* CFS Bandwidth Control - Test results of cgroups tasks pinned vs unpinned
  2011-05-03  9:28 [patch 00/15] CFS Bandwidth Control V6 Paul Turner
                   ` (14 preceding siblings ...)
  2011-05-03  9:29 ` [patch 15/15] sched: add documentation for bandwidth control Paul Turner
@ 2011-06-07 15:45 ` Kamalesh Babulal
  2011-06-08  3:09   ` Paul Turner
                     ` (2 more replies)
  2011-06-14  6:58 ` [patch 00/15] CFS Bandwidth Control V6 Hu Tao
  16 siblings, 3 replies; 129+ messages in thread
From: Kamalesh Babulal @ 2011-06-07 15:45 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Peter Zijlstra, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Ingo Molnar, Pavel Emelyanov

Hi All,

    In our test environment, while testing the CFS Bandwidth V6 patch set 
on top of 55922c9d1b84. We observed that the CPU's idle time is seen
between 30% to 40% while running CPU bound test, with the cgroups tasks 
not pinned to the CPU's. Whereas in the inverse case, where the cgroups 
tasks are pinned to the CPU's, the idle time seen is nearly zero.

Test Scenario
--------------
- 5 cgroups are created with each groups assigned 2, 2, 4, 8, 16 tasks respectively.
- Each of the cgroup, has N sub-cgroups created. Where N is the NR_TASKS the cgroup 
  is assigned with. i.e., cgroup1, will create two sub-cgroups under it and assigned 
  one tasks per sub-group.
				------------
				| cgroup 1 |
				------------
				 /        \
				/          \
			  --------------  --------------
			  |sub-cgroup 1|  |sub-cgroup 2|
			  | (task 1)   |  | (task 2)   |
			  --------------  --------------

- Top cgroup is given unlimited quota (cpu.cfs_quota_us = -1) and period of 500ms
  (cpu.cfs_period_us = 500000). Whereas the sub-cgroups are given 250ms of quota
  (cpu.cfs_quota_us = 250000) and period of 500ms. i.e. the top cgroups are given
  unlimited bandwidth, whereas the sub-group are throttled every 250ms.

- Additional if required the proportional CPU shares can be assigned to cpu.shares 
  as NR_TASKS * 1024. i.e. cgroup1 has 2 tasks * 1024 = 2048 worth cpu.shares
  for cgroup1. (In the below test results published all cgroups and sub-cgroups
  are given the equal share of 1024).

- One CPU bound while(1) task is attached to each sub-cgroup.

- sum-exec time for each cgroup/sub-cgroup is captured from /proc/sched_debug after 
  60 seconds and analyzed for the run time of the tasks a.k.a sub-cgroup.

How is the idle CPU time measured ?
------------------------------------
- vmstat stats are logged every 2 seconds, after attaching the last while1 task 
  to 16th sub-cgroup of cgroup 5 till the 60 sec run is over. After the run idle%
  of a CPU is calculated by summing idle column from the vmstat log and dividing it 
  by number of samples collected, of-course after neglecting the first record 
  from the log.

How are the tasks pinned to the CPU ?
-------------------------------------
- cgroup is mounted with cpuset,cpu controller and for every 2 sub-cgroups one 
  physical CPU is allocated. i.e. CPU 1 is allocated between 1/1 and 1/2 (Group 1, 
  sub-cgroup 1 and sub-cgroup 2). Similarly CPUs 7 to 15 are allocated to 15/1 to 
  15/16 (Group 15, subgroup 1 to 16). Note that test machine used to test has
  16 CPUs.

Result for non-pining case
---------------------------
Only the hierarchy is created as stated above and cpusets are not assigned per cgroup.

Average CPU Idle percentage 34.8% (as explained above in the Idle time measured)
Bandwidth shared with remaining non-Idle 65.2%

* Note: For the sake of roundoff value the numbers are multiplied by 100.

In the below result for cgroup1 9.2500 corresponds to sum-exec time captured 
from /proc/sched_debug for cgroup 1 tasks (including sub-cgroup 1 and 2). 
Which is in-turn 6% of the non-Idle CPU time (which is derived by 9.2500 * 65.2 / 100 )

Bandwidth of Group 1 = 9.2500 i.e = 6.0300% of non-Idle CPU time 65.2%
|...... subgroup 1/1	= 48.7800	i.e = 2.9400% of 6.0300% Groups non-Idle CPU time
|...... subgroup 1/2	= 51.2100	i.e = 3.0800% of 6.0300% Groups non-Idle CPU time
 
 
Bandwidth of Group 2 = 9.0400 i.e = 5.8900% of non-Idle CPU time 65.2%
|...... subgroup 2/1	= 51.0200	i.e = 3.0000% of 5.8900% Groups non-Idle CPU time
|...... subgroup 2/2	= 48.9700	i.e = 2.8800% of 5.8900% Groups non-Idle CPU time
 
 
Bandwidth of Group 3 = 16.9300 i.e = 11.0300% of non-Idle CPU time 65.2%
|...... subgroup 3/1	= 26.0300	i.e = 2.8700% of 11.0300% Groups non-Idle CPU time
|...... subgroup 3/2	= 25.8800	i.e = 2.8500% of 11.0300% Groups non-Idle CPU time
|...... subgroup 3/3	= 22.7800	i.e = 2.5100% of 11.0300% Groups non-Idle CPU time
|...... subgroup 3/4	= 25.2900	i.e = 2.7800% of 11.0300% Groups non-Idle CPU time
 
 
Bandwidth of Group 4 = 27.9300 i.e = 18.2100% of non-Idle CPU time 65.2%
|...... subgroup 4/1	= 16.6000	i.e = 3.0200% of 18.2100% Groups non-Idle CPU time
|...... subgroup 4/2	= 8.0000	i.e = 1.4500% of 18.2100% Groups non-Idle CPU time
|...... subgroup 4/3	= 9.0000	i.e = 1.6300% of 18.2100% Groups non-Idle CPU time
|...... subgroup 4/4	= 7.9600	i.e = 1.4400% of 18.2100% Groups non-Idle CPU time
|...... subgroup 4/5	= 12.3500	i.e = 2.2400% of 18.2100% Groups non-Idle CPU time
|...... subgroup 4/6	= 16.2500	i.e = 2.9500% of 18.2100% Groups non-Idle CPU time
|...... subgroup 4/7	= 12.6100	i.e = 2.2900% of 18.2100% Groups non-Idle CPU time
|...... subgroup 4/8	= 17.1900	i.e = 3.1300% of 18.2100% Groups non-Idle CPU time
 
 
Bandwidth of Group 5 = 36.8300 i.e = 24.0100% of non-Idle CPU time 65.2%
|...... subgroup 5/1	= 56.6900	i.e = 13.6100%	of 24.0100% Groups non-Idle CPU time
|...... subgroup 5/2	= 8.8600	i.e = 2.1200% 	of 24.0100% Groups non-Idle CPU time
|...... subgroup 5/3	= 5.5100	i.e = 1.3200% 	of 24.0100% Groups non-Idle CPU time
|...... subgroup 5/4	= 4.5700	i.e = 1.0900%	of 24.0100% Groups non-Idle CPU time
|...... subgroup 5/5	= 7.9500	i.e = 1.9000%	of 24.0100% Groups non-Idle CPU time
|...... subgroup 5/6	= 2.1600	i.e = .5100%	of 24.0100% Groups non-Idle CPU time
|...... subgroup 5/7	= 2.3400	i.e = .5600%	of 24.0100% Groups non-Idle CPU time
|...... subgroup 5/8	= 2.1500	i.e = .5100%	of 24.0100% Groups non-Idle CPU time
|...... subgroup 5/9	= 9.7200	i.e = 2.3300% 	of 24.0100% Groups non-Idle CPU time
|...... subgroup 5/10	= 5.0600	i.e = 1.2100% 	of 24.0100% Groups non-Idle CPU time
|...... subgroup 5/11	= 4.6900	i.e = 1.1200% 	of 24.0100% Groups non-Idle CPU time
|...... subgroup 5/12	= 8.9700	i.e = 2.1500% 	of 24.0100% Groups non-Idle CPU time
|...... subgroup 5/13	= 8.4600	i.e = 2.0300% 	of 24.0100% Groups non-Idle CPU time
|...... subgroup 5/14	= 11.8400	i.e = 2.8400% 	of 24.0100% Groups non-Idle CPU time
|...... subgroup 5/15	= 6.3400	i.e = 1.5200% 	of 24.0100% Groups non-Idle CPU time
|...... subgroup 5/16	= 5.1500	i.e = 1.2300% 	of 24.0100% Groups non-Idle CPU time

Pinned case
--------------
CPU hierarchy is created and cpusets are allocated.

Average CPU Idle percentage 0%
Bandwidth shared with remaining non-Idle 100%

Bandwidth of Group 1 = 6.3400 i.e = 6.3400% of non-Idle CPU time 100%
|...... subgroup 1/1	= 50.0400	i.e = 3.1700% of 6.3400% Groups non-Idle CPU time
|...... subgroup 1/2	= 49.9500	i.e = 3.1600% of 6.3400% Groups non-Idle CPU time
 
 
Bandwidth of Group 2 = 6.3200 i.e = 6.3200% of non-Idle CPU time 100%
|...... subgroup 2/1	= 50.0400	i.e = 3.1600% of 6.3200% Groups non-Idle CPU time
|...... subgroup 2/2	= 49.9500	i.e = 3.1500% of 6.3200% Groups non-Idle CPU time
 
 
Bandwidth of Group 3 = 12.6300 i.e = 12.6300% of non-Idle CPU time 100%
|...... subgroup 3/1	= 25.0300	i.e = 3.1600% of 12.6300% Groups non-Idle CPU time
|...... subgroup 3/2	= 25.0100	i.e = 3.1500% of 12.6300% Groups non-Idle CPU time
|...... subgroup 3/3	= 25.0000	i.e = 3.1500% of 12.6300% Groups non-Idle CPU time
|...... subgroup 3/4	= 24.9400	i.e = 3.1400% of 12.6300% Groups non-Idle CPU time
 
 
Bandwidth of Group 4 = 25.1000 i.e = 25.1000% of non-Idle CPU time 100%
|...... subgroup 4/1	= 12.5400	i.e = 3.1400% of 25.1000% Groups non-Idle CPU time
|...... subgroup 4/2	= 12.5100	i.e = 3.1400% of 25.1000% Groups non-Idle CPU time
|...... subgroup 4/3	= 12.5300	i.e = 3.1400% of 25.1000% Groups non-Idle CPU time
|...... subgroup 4/4	= 12.5000	i.e = 3.1300% of 25.1000% Groups non-Idle CPU time
|...... subgroup 4/5	= 12.4900	i.e = 3.1300% of 25.1000% Groups non-Idle CPU time
|...... subgroup 4/6	= 12.4700	i.e = 3.1200% of 25.1000% Groups non-Idle CPU time
|...... subgroup 4/7	= 12.4700	i.e = 3.1200% of 25.1000% Groups non-Idle CPU time
|...... subgroup 4/8	= 12.4500	i.e = 3.1200% of 25.1000% Groups non-Idle CPU time
 
 
Bandwidth of Group 5 = 49.5700 i.e = 49.5700% of non-Idle CPU time 100%
|...... subgroup 5/1	= 49.8500	i.e = 24.7100% of 49.5700% Groups non-Idle CPU time
|...... subgroup 5/2	= 6.2900	i.e = 3.1100% of 49.5700% Groups non-Idle CPU time
|...... subgroup 5/3	= 6.2800	i.e = 3.1100% of 49.5700% Groups non-Idle CPU time
|...... subgroup 5/4	= 6.2700	i.e = 3.1000% of 49.5700% Groups non-Idle CPU time
|...... subgroup 5/5	= 6.2700	i.e = 3.1000% of 49.5700% Groups non-Idle CPU time
|...... subgroup 5/6	= 6.2600	i.e = 3.1000% of 49.5700% Groups non-Idle CPU time
|...... subgroup 5/7	= 6.2500	i.e = 3.0900% of 49.5700% Groups non-Idle CPU time
|...... subgroup 5/8	= 6.2400	i.e = 3.0900% of 49.5700% Groups non-Idle CPU time
|...... subgroup 5/9	= 6.2400	i.e = 3.0900% of 49.5700% Groups non-Idle CPU time
|...... subgroup 5/10	= 6.2300	i.e = 3.0800% of 49.5700% Groups non-Idle CPU time
|...... subgroup 5/11	= 6.2300	i.e = 3.0800% of 49.5700% Groups non-Idle CPU time
|...... subgroup 5/12	= 6.2200	i.e = 3.0800% of 49.5700% Groups non-Idle CPU time
|...... subgroup 5/13	= 6.2100	i.e = 3.0700% of 49.5700% Groups non-Idle CPU time
|...... subgroup 5/14	= 6.2100	i.e = 3.0700% of 49.5700% Groups non-Idle CPU time
|...... subgroup 5/15	= 6.2100	i.e = 3.0700% of 49.5700% Groups non-Idle CPU time
|...... subgroup 5/16	= 6.2100	i.e = 3.0700% of 49.5700% Groups non-Idle CPU time
 
with equal cpu shares allocated to all the groups/sub-cgroups and CFS bandwidth configured
to allow 100% CPU utilization. We see the CPU idle time in the un-pinned case.

Benchmark used to reproduce the issue, is attached. Justing executing the script should
report similar numbers.

#!/bin/bash

NR_TASKS1=2
NR_TASKS2=2
NR_TASKS3=4
NR_TASKS4=8
NR_TASKS5=16

BANDWIDTH=1
SUBGROUP=1
PRO_SHARES=0
MOUNT=/cgroup/
LOAD=/root/while1

usage()
{
	echo "Usage $0: [-b 0|1] [-s 0|1] [-p 0|1]"
	echo "-b 1|0 set/unset  Cgroups bandwidth control (default set)"
	echo "-s Create sub-groups for every task (default creates sub-group)"
	echo "-p create propotional shares based on cpus"
	exit
}
while getopts ":b:s:p:" arg
do
	case $arg in
	b)
		BANDWIDTH=$OPTARG
		shift
		if [ $BANDWIDTH -gt 1 ] && [ $BANDWIDTH -lt  0 ]
		then
			usage
		fi
		;;
	s)
		SUBGROUP=$OPTARG
		shift
		if [ $SUBGROUP -gt 1 ] && [ $SUBGROUP -lt 0 ]
		then
			usage
		fi
		;;
	p)
		PRO_SHARES=$OPTARG
		shift
		if [ $PRO_SHARES -gt 1 ] && [ $PRO_SHARES -lt 0 ]
		then
			usage
		fi
		;;

	*)

	esac
done
if [ ! -d $MOUNT ]
then
	mkdir -p $MOUNT
fi
test()
{
	echo -n "[ "
	if [ $1 -eq 0 ]
	then
		echo -ne '\E[42;40mOk'
	else
		echo -ne '\E[31;40mFailed'
		tput sgr0
		echo " ]"
		exit
	fi
	tput sgr0
	echo " ]"
}
mount_cgrp()
{
	echo -n "Mounting root cgroup "
	mount -t cgroup -ocpu,cpuset,cpuacct none $MOUNT &> /dev/null
	test $?
}

umount_cgrp()
{
	echo -n "Unmounting root cgroup "
	cd /root/
	umount $MOUNT
	test $?
}

create_hierarchy()
{
	mount_cgrp
	cpuset_mem=`cat $MOUNT/cpuset.mems`
	cpuset_cpu=`cat $MOUNT/cpuset.cpus`
	echo -n "creating groups/sub-groups ..."
	for (( i=1; i<=5; i++ ))
	do
		mkdir $MOUNT/$i
		echo $cpuset_mem > $MOUNT/$i/cpuset.mems
		echo $cpuset_cpu > $MOUNT/$i/cpuset.cpus
		echo -n ".."
		if [ $SUBGROUP -eq 1 ]
		then
			jj=$(eval echo "\$NR_TASKS$i")
			for (( j=1; j<=$jj; j++ ))
			do
				mkdir -p $MOUNT/$i/$j
				echo $cpuset_mem > $MOUNT/$i/$j/cpuset.mems
				echo $cpuset_cpu > $MOUNT/$i/$j/cpuset.cpus
				echo -n ".."
			done
		fi
	done
	echo "."
}

cleanup()
{
	pkill -9 while1 &> /dev/null
	sleep 10
	echo -n "Umount groups/sub-groups .."
	for (( i=1; i<=5; i++ ))
	do
		if [ $SUBGROUP -eq 1 ]
		then
			jj=$(eval echo "\$NR_TASKS$i")
			for (( j=1; j<=$jj; j++ ))
			do
				rmdir $MOUNT/$i/$j
				echo -n ".."
			done
		fi
		rmdir $MOUNT/$i
		echo -n ".."
	done
	echo " "
	umount_cgrp
}

load_tasks()
{
	for (( i=1; i<=5; i++ ))
	do
		jj=$(eval echo "\$NR_TASKS$i")
		shares="1024"
		if [ $PRO_SHARES -eq 1 ]
		then
			eval shares=$(echo "$jj * 1024" | bc)
		fi
		echo $hares > $MOUNT/$i/cpu.shares
		for (( j=1; j<=$jj; j++ ))
		do
			echo "-1" > $MOUNT/$i/cpu.cfs_quota_us
			echo "500000" > $MOUNT/$i/cpu.cfs_period_us
			if [ $SUBGROUP -eq 1 ]
			then

				$LOAD &
				echo $! > $MOUNT/$i/$j/tasks
				echo "1024" > $MOUNT/$i/$j/cpu.shares

				if [ $BANDWIDTH -eq 1 ]
				then
					echo "500000" > $MOUNT/$i/$j/cpu.cfs_period_us
					echo "250000" > $MOUNT/$i/$j/cpu.cfs_quota_us
				fi
			else
				$LOAD & 
				echo $! > $MOUNT/$i/tasks
				echo $shares > $MOUNT/$i/cpu.shares

				if [ $BANDWIDTH -eq 1 ]
				then
					echo "500000" > $MOUNT/$i/cpu.cfs_period_us
					echo "250000" > $MOUNT/$i/cpu.cfs_quota_us
				fi
			fi
		done
	done
	echo "Captuing idle cpu time with vmstat...."
	vmstat 2 100 &> vmstat_log &
}

pin_tasks()
{
	cpu=0
	count=1
	for (( i=1; i<=5; i++ ))
	do
		if [ $SUBGROUP -eq 1 ]
		then
			jj=$(eval echo "\$NR_TASKS$i")
			for (( j=1; j<=$jj; j++ ))
			do
				if [ $count -gt 2 ]
				then
					cpu=$((cpu+1))
					count=1
				fi
				echo $cpu > $MOUNT/$i/$j/cpuset.cpus
				count=$((count+1))
			done
		else
			case $i in
			1)
				echo 0 > $MOUNT/$i/cpuset.cpus;;
			2)
				echo 1 > $MOUNT/$i/cpuset.cpus;;
			3)
				echo "2-3" > $MOUNT/$i/cpuset.cpus;;
			4)
				echo "4-6" > $MOUNT/$i/cpuset.cpus;;
			5)
				echo "7-15" > $MOUNT/$i/cpuset.cpus;;
			esac
		fi
	done
	
}

print_results()
{
	eval gtot=$(cat sched_log|grep -i while|sed 's/R//g'|awk '{gtot+=$7};END{printf "%f", gtot}')
	for (( i=1; i<=5; i++ ))	
	do
		eval temp=$(cat sched_log_$i|sed 's/R//g'| awk '{gtot+=$7};END{printf "%f",gtot}')
		eval tavg=$(echo "scale=4;(($temp / $gtot) * $1)/100 " | bc)
		eval avg=$(echo  "scale=4;($temp / $gtot) * 100" | bc)
		eval pretty_tavg=$( echo "scale=4; $tavg * 100"| bc) # F0r pretty format
		echo "Bandwidth of Group $i = $avg i.e = $pretty_tavg% of non-Idle CPU time $1%"
		if [ $SUBGROUP -eq 1 ]
		then
			jj=$(eval echo "\$NR_TASKS$i")
			for (( j=1; j<=$jj; j++ ))
			do
				eval tmp=$(cat sched_log_$i-$j|sed 's/R//g'| awk '{gtot+=$7};END{printf "%f",gtot}')
				eval stavg=$(echo "scale=4;($tmp / $temp) * 100" | bc)
				eval pretty_stavg=$(echo "scale=4;(($tmp / $temp) * $tavg) * 100" | bc)
				echo -n "|"
				echo -e "...... subgroup $i/$j\t= $stavg\ti.e = $pretty_stavg% of $pretty_tavg% Groups non-Idle CPU time"
			done
		fi
		echo " "
		echo " "
	done
}
capture_results()
{
	cat /proc/sched_debug > sched_log
	pkill -9 vmstat -c
	avg=$(cat vmstat_log |grep -iv "system"|grep -iv "swpd"|awk ' { if ( NR != 1) {id+=$15 }}END{print (id/NR)}')
	
	rem=$(echo "scale=2; 100 - $avg" |bc)
	echo "Average CPU Idle percentage $avg%"	
	echo "Bandwidth shared with remaining non-Idle $rem%" 
	for (( i=1; i<=5; i++ ))
	do
		cat sched_log |grep -i while1|grep -i " \/$i" > sched_log_$i
		if [ $SUBGROUP -eq 1 ]
		then
			jj=$(eval echo "\$NR_TASKS$i")
			for (( j=1; j<=$jj; j++ ))
			do
				cat sched_log |grep -i while1|grep -i " \/$i\/$j" > sched_log_$i-$j
			done
		fi
	done
	print_results $rem
}
create_hierarchy
pin_tasks

load_tasks
sleep 60
capture_results
cleanup
exit

Thanks,
Kamalesh.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: CFS Bandwidth Control - Test results of cgroups tasks pinned vs unpinned
  2011-06-07 15:45 ` CFS Bandwidth Control - Test results of cgroups tasks pinned vs unpinned Kamalesh Babulal
@ 2011-06-08  3:09   ` Paul Turner
  2011-06-08 10:46   ` Vladimir Davydov
  2011-06-14 10:16   ` CFS Bandwidth Control - Test results of cgroups tasks pinned vs unpinned Hidetoshi Seto
  2 siblings, 0 replies; 129+ messages in thread
From: Paul Turner @ 2011-06-08  3:09 UTC (permalink / raw)
  To: Kamalesh Babulal
  Cc: LKML, Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Ingo Molnar,
	Pavel Emelyanov

[ Sorry for the delayed response, I was out on vacation for the second
half of May until last week -- I've now caught up on email and am
preparing the next posting ]

Thanks for the test-case Kamalesh -- my immediate suspicion is quota
return may not be fine-grained enough (although the numbers provided
are large enough it's possible there's also just a bug).

I have some tools from my own testing I can use to pull this apart,
let me run your work-load and get back to you.

On Tue, Jun 7, 2011 at 8:45 AM, Kamalesh Babulal
<kamalesh@linux.vnet.ibm.com> wrote:
> Hi All,
>
>    In our test environment, while testing the CFS Bandwidth V6 patch set
> on top of 55922c9d1b84. We observed that the CPU's idle time is seen
> between 30% to 40% while running CPU bound test, with the cgroups tasks
> not pinned to the CPU's. Whereas in the inverse case, where the cgroups
> tasks are pinned to the CPU's, the idle time seen is nearly zero.
>
> Test Scenario
> --------------
> - 5 cgroups are created with each groups assigned 2, 2, 4, 8, 16 tasks respectively.
> - Each of the cgroup, has N sub-cgroups created. Where N is the NR_TASKS the cgroup
>  is assigned with. i.e., cgroup1, will create two sub-cgroups under it and assigned
>  one tasks per sub-group.
>                                ------------
>                                | cgroup 1 |
>                                ------------
>                                 /        \
>                                /          \
>                          --------------  --------------
>                          |sub-cgroup 1|  |sub-cgroup 2|
>                          | (task 1)   |  | (task 2)   |
>                          --------------  --------------
>
> - Top cgroup is given unlimited quota (cpu.cfs_quota_us = -1) and period of 500ms
>  (cpu.cfs_period_us = 500000). Whereas the sub-cgroups are given 250ms of quota
>  (cpu.cfs_quota_us = 250000) and period of 500ms. i.e. the top cgroups are given
>  unlimited bandwidth, whereas the sub-group are throttled every 250ms.
>
> - Additional if required the proportional CPU shares can be assigned to cpu.shares
>  as NR_TASKS * 1024. i.e. cgroup1 has 2 tasks * 1024 = 2048 worth cpu.shares
>  for cgroup1. (In the below test results published all cgroups and sub-cgroups
>  are given the equal share of 1024).
>
> - One CPU bound while(1) task is attached to each sub-cgroup.
>
> - sum-exec time for each cgroup/sub-cgroup is captured from /proc/sched_debug after
>  60 seconds and analyzed for the run time of the tasks a.k.a sub-cgroup.
>
> How is the idle CPU time measured ?
> ------------------------------------
> - vmstat stats are logged every 2 seconds, after attaching the last while1 task
>  to 16th sub-cgroup of cgroup 5 till the 60 sec run is over. After the run idle%
>  of a CPU is calculated by summing idle column from the vmstat log and dividing it
>  by number of samples collected, of-course after neglecting the first record
>  from the log.
>
> How are the tasks pinned to the CPU ?
> -------------------------------------
> - cgroup is mounted with cpuset,cpu controller and for every 2 sub-cgroups one
>  physical CPU is allocated. i.e. CPU 1 is allocated between 1/1 and 1/2 (Group 1,
>  sub-cgroup 1 and sub-cgroup 2). Similarly CPUs 7 to 15 are allocated to 15/1 to
>  15/16 (Group 15, subgroup 1 to 16). Note that test machine used to test has
>  16 CPUs.
>
> Result for non-pining case
> ---------------------------
> Only the hierarchy is created as stated above and cpusets are not assigned per cgroup.
>
> Average CPU Idle percentage 34.8% (as explained above in the Idle time measured)
> Bandwidth shared with remaining non-Idle 65.2%
>
> * Note: For the sake of roundoff value the numbers are multiplied by 100.
>
> In the below result for cgroup1 9.2500 corresponds to sum-exec time captured
> from /proc/sched_debug for cgroup 1 tasks (including sub-cgroup 1 and 2).
> Which is in-turn 6% of the non-Idle CPU time (which is derived by 9.2500 * 65.2 / 100 )
>
> Bandwidth of Group 1 = 9.2500 i.e = 6.0300% of non-Idle CPU time 65.2%
> |...... subgroup 1/1    = 48.7800       i.e = 2.9400% of 6.0300% Groups non-Idle CPU time
> |...... subgroup 1/2    = 51.2100       i.e = 3.0800% of 6.0300% Groups non-Idle CPU time
>
>
> Bandwidth of Group 2 = 9.0400 i.e = 5.8900% of non-Idle CPU time 65.2%
> |...... subgroup 2/1    = 51.0200       i.e = 3.0000% of 5.8900% Groups non-Idle CPU time
> |...... subgroup 2/2    = 48.9700       i.e = 2.8800% of 5.8900% Groups non-Idle CPU time
>
>
> Bandwidth of Group 3 = 16.9300 i.e = 11.0300% of non-Idle CPU time 65.2%
> |...... subgroup 3/1    = 26.0300       i.e = 2.8700% of 11.0300% Groups non-Idle CPU time
> |...... subgroup 3/2    = 25.8800       i.e = 2.8500% of 11.0300% Groups non-Idle CPU time
> |...... subgroup 3/3    = 22.7800       i.e = 2.5100% of 11.0300% Groups non-Idle CPU time
> |...... subgroup 3/4    = 25.2900       i.e = 2.7800% of 11.0300% Groups non-Idle CPU time
>
>
> Bandwidth of Group 4 = 27.9300 i.e = 18.2100% of non-Idle CPU time 65.2%
> |...... subgroup 4/1    = 16.6000       i.e = 3.0200% of 18.2100% Groups non-Idle CPU time
> |...... subgroup 4/2    = 8.0000        i.e = 1.4500% of 18.2100% Groups non-Idle CPU time
> |...... subgroup 4/3    = 9.0000        i.e = 1.6300% of 18.2100% Groups non-Idle CPU time
> |...... subgroup 4/4    = 7.9600        i.e = 1.4400% of 18.2100% Groups non-Idle CPU time
> |...... subgroup 4/5    = 12.3500       i.e = 2.2400% of 18.2100% Groups non-Idle CPU time
> |...... subgroup 4/6    = 16.2500       i.e = 2.9500% of 18.2100% Groups non-Idle CPU time
> |...... subgroup 4/7    = 12.6100       i.e = 2.2900% of 18.2100% Groups non-Idle CPU time
> |...... subgroup 4/8    = 17.1900       i.e = 3.1300% of 18.2100% Groups non-Idle CPU time
>
>
> Bandwidth of Group 5 = 36.8300 i.e = 24.0100% of non-Idle CPU time 65.2%
> |...... subgroup 5/1    = 56.6900       i.e = 13.6100%  of 24.0100% Groups non-Idle CPU time
> |...... subgroup 5/2    = 8.8600        i.e = 2.1200%   of 24.0100% Groups non-Idle CPU time
> |...... subgroup 5/3    = 5.5100        i.e = 1.3200%   of 24.0100% Groups non-Idle CPU time
> |...... subgroup 5/4    = 4.5700        i.e = 1.0900%   of 24.0100% Groups non-Idle CPU time
> |...... subgroup 5/5    = 7.9500        i.e = 1.9000%   of 24.0100% Groups non-Idle CPU time
> |...... subgroup 5/6    = 2.1600        i.e = .5100%    of 24.0100% Groups non-Idle CPU time
> |...... subgroup 5/7    = 2.3400        i.e = .5600%    of 24.0100% Groups non-Idle CPU time
> |...... subgroup 5/8    = 2.1500        i.e = .5100%    of 24.0100% Groups non-Idle CPU time
> |...... subgroup 5/9    = 9.7200        i.e = 2.3300%   of 24.0100% Groups non-Idle CPU time
> |...... subgroup 5/10   = 5.0600        i.e = 1.2100%   of 24.0100% Groups non-Idle CPU time
> |...... subgroup 5/11   = 4.6900        i.e = 1.1200%   of 24.0100% Groups non-Idle CPU time
> |...... subgroup 5/12   = 8.9700        i.e = 2.1500%   of 24.0100% Groups non-Idle CPU time
> |...... subgroup 5/13   = 8.4600        i.e = 2.0300%   of 24.0100% Groups non-Idle CPU time
> |...... subgroup 5/14   = 11.8400       i.e = 2.8400%   of 24.0100% Groups non-Idle CPU time
> |...... subgroup 5/15   = 6.3400        i.e = 1.5200%   of 24.0100% Groups non-Idle CPU time
> |...... subgroup 5/16   = 5.1500        i.e = 1.2300%   of 24.0100% Groups non-Idle CPU time
>
> Pinned case
> --------------
> CPU hierarchy is created and cpusets are allocated.
>
> Average CPU Idle percentage 0%
> Bandwidth shared with remaining non-Idle 100%
>
> Bandwidth of Group 1 = 6.3400 i.e = 6.3400% of non-Idle CPU time 100%
> |...... subgroup 1/1    = 50.0400       i.e = 3.1700% of 6.3400% Groups non-Idle CPU time
> |...... subgroup 1/2    = 49.9500       i.e = 3.1600% of 6.3400% Groups non-Idle CPU time
>
>
> Bandwidth of Group 2 = 6.3200 i.e = 6.3200% of non-Idle CPU time 100%
> |...... subgroup 2/1    = 50.0400       i.e = 3.1600% of 6.3200% Groups non-Idle CPU time
> |...... subgroup 2/2    = 49.9500       i.e = 3.1500% of 6.3200% Groups non-Idle CPU time
>
>
> Bandwidth of Group 3 = 12.6300 i.e = 12.6300% of non-Idle CPU time 100%
> |...... subgroup 3/1    = 25.0300       i.e = 3.1600% of 12.6300% Groups non-Idle CPU time
> |...... subgroup 3/2    = 25.0100       i.e = 3.1500% of 12.6300% Groups non-Idle CPU time
> |...... subgroup 3/3    = 25.0000       i.e = 3.1500% of 12.6300% Groups non-Idle CPU time
> |...... subgroup 3/4    = 24.9400       i.e = 3.1400% of 12.6300% Groups non-Idle CPU time
>
>
> Bandwidth of Group 4 = 25.1000 i.e = 25.1000% of non-Idle CPU time 100%
> |...... subgroup 4/1    = 12.5400       i.e = 3.1400% of 25.1000% Groups non-Idle CPU time
> |...... subgroup 4/2    = 12.5100       i.e = 3.1400% of 25.1000% Groups non-Idle CPU time
> |...... subgroup 4/3    = 12.5300       i.e = 3.1400% of 25.1000% Groups non-Idle CPU time
> |...... subgroup 4/4    = 12.5000       i.e = 3.1300% of 25.1000% Groups non-Idle CPU time
> |...... subgroup 4/5    = 12.4900       i.e = 3.1300% of 25.1000% Groups non-Idle CPU time
> |...... subgroup 4/6    = 12.4700       i.e = 3.1200% of 25.1000% Groups non-Idle CPU time
> |...... subgroup 4/7    = 12.4700       i.e = 3.1200% of 25.1000% Groups non-Idle CPU time
> |...... subgroup 4/8    = 12.4500       i.e = 3.1200% of 25.1000% Groups non-Idle CPU time
>
>
> Bandwidth of Group 5 = 49.5700 i.e = 49.5700% of non-Idle CPU time 100%
> |...... subgroup 5/1    = 49.8500       i.e = 24.7100% of 49.5700% Groups non-Idle CPU time
> |...... subgroup 5/2    = 6.2900        i.e = 3.1100% of 49.5700% Groups non-Idle CPU time
> |...... subgroup 5/3    = 6.2800        i.e = 3.1100% of 49.5700% Groups non-Idle CPU time
> |...... subgroup 5/4    = 6.2700        i.e = 3.1000% of 49.5700% Groups non-Idle CPU time
> |...... subgroup 5/5    = 6.2700        i.e = 3.1000% of 49.5700% Groups non-Idle CPU time
> |...... subgroup 5/6    = 6.2600        i.e = 3.1000% of 49.5700% Groups non-Idle CPU time
> |...... subgroup 5/7    = 6.2500        i.e = 3.0900% of 49.5700% Groups non-Idle CPU time
> |...... subgroup 5/8    = 6.2400        i.e = 3.0900% of 49.5700% Groups non-Idle CPU time
> |...... subgroup 5/9    = 6.2400        i.e = 3.0900% of 49.5700% Groups non-Idle CPU time
> |...... subgroup 5/10   = 6.2300        i.e = 3.0800% of 49.5700% Groups non-Idle CPU time
> |...... subgroup 5/11   = 6.2300        i.e = 3.0800% of 49.5700% Groups non-Idle CPU time
> |...... subgroup 5/12   = 6.2200        i.e = 3.0800% of 49.5700% Groups non-Idle CPU time
> |...... subgroup 5/13   = 6.2100        i.e = 3.0700% of 49.5700% Groups non-Idle CPU time
> |...... subgroup 5/14   = 6.2100        i.e = 3.0700% of 49.5700% Groups non-Idle CPU time
> |...... subgroup 5/15   = 6.2100        i.e = 3.0700% of 49.5700% Groups non-Idle CPU time
> |...... subgroup 5/16   = 6.2100        i.e = 3.0700% of 49.5700% Groups non-Idle CPU time
>
> with equal cpu shares allocated to all the groups/sub-cgroups and CFS bandwidth configured
> to allow 100% CPU utilization. We see the CPU idle time in the un-pinned case.
>
> Benchmark used to reproduce the issue, is attached. Justing executing the script should
> report similar numbers.
>
> #!/bin/bash
>
> NR_TASKS1=2
> NR_TASKS2=2
> NR_TASKS3=4
> NR_TASKS4=8
> NR_TASKS5=16
>
> BANDWIDTH=1
> SUBGROUP=1
> PRO_SHARES=0
> MOUNT=/cgroup/
> LOAD=/root/while1
>
> usage()
> {
>        echo "Usage $0: [-b 0|1] [-s 0|1] [-p 0|1]"
>        echo "-b 1|0 set/unset  Cgroups bandwidth control (default set)"
>        echo "-s Create sub-groups for every task (default creates sub-group)"
>        echo "-p create propotional shares based on cpus"
>        exit
> }
> while getopts ":b:s:p:" arg
> do
>        case $arg in
>        b)
>                BANDWIDTH=$OPTARG
>                shift
>                if [ $BANDWIDTH -gt 1 ] && [ $BANDWIDTH -lt  0 ]
>                then
>                        usage
>                fi
>                ;;
>        s)
>                SUBGROUP=$OPTARG
>                shift
>                if [ $SUBGROUP -gt 1 ] && [ $SUBGROUP -lt 0 ]
>                then
>                        usage
>                fi
>                ;;
>        p)
>                PRO_SHARES=$OPTARG
>                shift
>                if [ $PRO_SHARES -gt 1 ] && [ $PRO_SHARES -lt 0 ]
>                then
>                        usage
>                fi
>                ;;
>
>        *)
>
>        esac
> done
> if [ ! -d $MOUNT ]
> then
>        mkdir -p $MOUNT
> fi
> test()
> {
>        echo -n "[ "
>        if [ $1 -eq 0 ]
>        then
>                echo -ne '\E[42;40mOk'
>        else
>                echo -ne '\E[31;40mFailed'
>                tput sgr0
>                echo " ]"
>                exit
>        fi
>        tput sgr0
>        echo " ]"
> }
> mount_cgrp()
> {
>        echo -n "Mounting root cgroup "
>        mount -t cgroup -ocpu,cpuset,cpuacct none $MOUNT &> /dev/null
>        test $?
> }
>
> umount_cgrp()
> {
>        echo -n "Unmounting root cgroup "
>        cd /root/
>        umount $MOUNT
>        test $?
> }
>
> create_hierarchy()
> {
>        mount_cgrp
>        cpuset_mem=`cat $MOUNT/cpuset.mems`
>        cpuset_cpu=`cat $MOUNT/cpuset.cpus`
>        echo -n "creating groups/sub-groups ..."
>        for (( i=1; i<=5; i++ ))
>        do
>                mkdir $MOUNT/$i
>                echo $cpuset_mem > $MOUNT/$i/cpuset.mems
>                echo $cpuset_cpu > $MOUNT/$i/cpuset.cpus
>                echo -n ".."
>                if [ $SUBGROUP -eq 1 ]
>                then
>                        jj=$(eval echo "\$NR_TASKS$i")
>                        for (( j=1; j<=$jj; j++ ))
>                        do
>                                mkdir -p $MOUNT/$i/$j
>                                echo $cpuset_mem > $MOUNT/$i/$j/cpuset.mems
>                                echo $cpuset_cpu > $MOUNT/$i/$j/cpuset.cpus
>                                echo -n ".."
>                        done
>                fi
>        done
>        echo "."
> }
>
> cleanup()
> {
>        pkill -9 while1 &> /dev/null
>        sleep 10
>        echo -n "Umount groups/sub-groups .."
>        for (( i=1; i<=5; i++ ))
>        do
>                if [ $SUBGROUP -eq 1 ]
>                then
>                        jj=$(eval echo "\$NR_TASKS$i")
>                        for (( j=1; j<=$jj; j++ ))
>                        do
>                                rmdir $MOUNT/$i/$j
>                                echo -n ".."
>                        done
>                fi
>                rmdir $MOUNT/$i
>                echo -n ".."
>        done
>        echo " "
>        umount_cgrp
> }
>
> load_tasks()
> {
>        for (( i=1; i<=5; i++ ))
>        do
>                jj=$(eval echo "\$NR_TASKS$i")
>                shares="1024"
>                if [ $PRO_SHARES -eq 1 ]
>                then
>                        eval shares=$(echo "$jj * 1024" | bc)
>                fi
>                echo $hares > $MOUNT/$i/cpu.shares
>                for (( j=1; j<=$jj; j++ ))
>                do
>                        echo "-1" > $MOUNT/$i/cpu.cfs_quota_us
>                        echo "500000" > $MOUNT/$i/cpu.cfs_period_us
>                        if [ $SUBGROUP -eq 1 ]
>                        then
>
>                                $LOAD &
>                                echo $! > $MOUNT/$i/$j/tasks
>                                echo "1024" > $MOUNT/$i/$j/cpu.shares
>
>                                if [ $BANDWIDTH -eq 1 ]
>                                then
>                                        echo "500000" > $MOUNT/$i/$j/cpu.cfs_period_us
>                                        echo "250000" > $MOUNT/$i/$j/cpu.cfs_quota_us
>                                fi
>                        else
>                                $LOAD &
>                                echo $! > $MOUNT/$i/tasks
>                                echo $shares > $MOUNT/$i/cpu.shares
>
>                                if [ $BANDWIDTH -eq 1 ]
>                                then
>                                        echo "500000" > $MOUNT/$i/cpu.cfs_period_us
>                                        echo "250000" > $MOUNT/$i/cpu.cfs_quota_us
>                                fi
>                        fi
>                done
>        done
>        echo "Captuing idle cpu time with vmstat...."
>        vmstat 2 100 &> vmstat_log &
> }
>
> pin_tasks()
> {
>        cpu=0
>        count=1
>        for (( i=1; i<=5; i++ ))
>        do
>                if [ $SUBGROUP -eq 1 ]
>                then
>                        jj=$(eval echo "\$NR_TASKS$i")
>                        for (( j=1; j<=$jj; j++ ))
>                        do
>                                if [ $count -gt 2 ]
>                                then
>                                        cpu=$((cpu+1))
>                                        count=1
>                                fi
>                                echo $cpu > $MOUNT/$i/$j/cpuset.cpus
>                                count=$((count+1))
>                        done
>                else
>                        case $i in
>                        1)
>                                echo 0 > $MOUNT/$i/cpuset.cpus;;
>                        2)
>                                echo 1 > $MOUNT/$i/cpuset.cpus;;
>                        3)
>                                echo "2-3" > $MOUNT/$i/cpuset.cpus;;
>                        4)
>                                echo "4-6" > $MOUNT/$i/cpuset.cpus;;
>                        5)
>                                echo "7-15" > $MOUNT/$i/cpuset.cpus;;
>                        esac
>                fi
>        done
>
> }
>
> print_results()
> {
>        eval gtot=$(cat sched_log|grep -i while|sed 's/R//g'|awk '{gtot+=$7};END{printf "%f", gtot}')
>        for (( i=1; i<=5; i++ ))
>        do
>                eval temp=$(cat sched_log_$i|sed 's/R//g'| awk '{gtot+=$7};END{printf "%f",gtot}')
>                eval tavg=$(echo "scale=4;(($temp / $gtot) * $1)/100 " | bc)
>                eval avg=$(echo  "scale=4;($temp / $gtot) * 100" | bc)
>                eval pretty_tavg=$( echo "scale=4; $tavg * 100"| bc) # F0r pretty format
>                echo "Bandwidth of Group $i = $avg i.e = $pretty_tavg% of non-Idle CPU time $1%"
>                if [ $SUBGROUP -eq 1 ]
>                then
>                        jj=$(eval echo "\$NR_TASKS$i")
>                        for (( j=1; j<=$jj; j++ ))
>                        do
>                                eval tmp=$(cat sched_log_$i-$j|sed 's/R//g'| awk '{gtot+=$7};END{printf "%f",gtot}')
>                                eval stavg=$(echo "scale=4;($tmp / $temp) * 100" | bc)
>                                eval pretty_stavg=$(echo "scale=4;(($tmp / $temp) * $tavg) * 100" | bc)
>                                echo -n "|"
>                                echo -e "...... subgroup $i/$j\t= $stavg\ti.e = $pretty_stavg% of $pretty_tavg% Groups non-Idle CPU time"
>                        done
>                fi
>                echo " "
>                echo " "
>        done
> }
> capture_results()
> {
>        cat /proc/sched_debug > sched_log
>        pkill -9 vmstat -c
>        avg=$(cat vmstat_log |grep -iv "system"|grep -iv "swpd"|awk ' { if ( NR != 1) {id+=$15 }}END{print (id/NR)}')
>
>        rem=$(echo "scale=2; 100 - $avg" |bc)
>        echo "Average CPU Idle percentage $avg%"
>        echo "Bandwidth shared with remaining non-Idle $rem%"
>        for (( i=1; i<=5; i++ ))
>        do
>                cat sched_log |grep -i while1|grep -i " \/$i" > sched_log_$i
>                if [ $SUBGROUP -eq 1 ]
>                then
>                        jj=$(eval echo "\$NR_TASKS$i")
>                        for (( j=1; j<=$jj; j++ ))
>                        do
>                                cat sched_log |grep -i while1|grep -i " \/$i\/$j" > sched_log_$i-$j
>                        done
>                fi
>        done
>        print_results $rem
> }
> create_hierarchy
> pin_tasks
>
> load_tasks
> sleep 60
> capture_results
> cleanup
> exit
>
> Thanks,
> Kamalesh.
>

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: CFS Bandwidth Control - Test results of cgroups tasks pinned vs unpinned
  2011-06-07 15:45 ` CFS Bandwidth Control - Test results of cgroups tasks pinned vs unpinned Kamalesh Babulal
  2011-06-08  3:09   ` Paul Turner
@ 2011-06-08 10:46   ` Vladimir Davydov
  2011-06-08 16:32     ` Kamalesh Babulal
  2011-06-14 10:16   ` CFS Bandwidth Control - Test results of cgroups tasks pinned vs unpinned Hidetoshi Seto
  2 siblings, 1 reply; 129+ messages in thread
From: Vladimir Davydov @ 2011-06-08 10:46 UTC (permalink / raw)
  To: Kamalesh Babulal
  Cc: Paul Turner, linux-kernel, Peter Zijlstra, Bharata B Rao,
	Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan,
	Srivatsa Vaddagiri, Ingo Molnar, Pavel Emelianov

On Tue, 2011-06-07 at 19:45 +0400, Kamalesh Babulal wrote:
> Hi All,
> 
>     In our test environment, while testing the CFS Bandwidth V6 patch set
> on top of 55922c9d1b84. We observed that the CPU's idle time is seen
> between 30% to 40% while running CPU bound test, with the cgroups tasks
> not pinned to the CPU's. Whereas in the inverse case, where the cgroups
> tasks are pinned to the CPU's, the idle time seen is nearly zero.

(snip)

> load_tasks()
> {
>         for (( i=1; i<=5; i++ ))
>         do
>                 jj=$(eval echo "\$NR_TASKS$i")
>                 shares="1024"
>                 if [ $PRO_SHARES -eq 1 ]
>                 then
>                         eval shares=$(echo "$jj * 1024" | bc)
>                 fi
>                 echo $hares > $MOUNT/$i/cpu.shares
                        ^^^^^
                        a fatal misprint? must be shares, I guess

(Setting cpu.shares to "", i.e. to the minimal possible value, will
definitely confuse the load balancer)

>                 for (( j=1; j<=$jj; j++ ))
>                 do
>                         echo "-1" > $MOUNT/$i/cpu.cfs_quota_us
>                         echo "500000" > $MOUNT/$i/cpu.cfs_period_us
>                         if [ $SUBGROUP -eq 1 ]
>                         then
> 
>                                 $LOAD &
>                                 echo $! > $MOUNT/$i/$j/tasks
>                                 echo "1024" > $MOUNT/$i/$j/cpu.shares
> 
>                                 if [ $BANDWIDTH -eq 1 ]
>                                 then
>                                         echo "500000" > $MOUNT/$i/$j/cpu.cfs_period_us
>                                         echo "250000" > $MOUNT/$i/$j/cpu.cfs_quota_us
>                                 fi
>                         else
>                                 $LOAD &
>                                 echo $! > $MOUNT/$i/tasks
>                                 echo $shares > $MOUNT/$i/cpu.shares
> 
>                                 if [ $BANDWIDTH -eq 1 ]
>                                 then
>                                         echo "500000" > $MOUNT/$i/cpu.cfs_period_us
>                                         echo "250000" > $MOUNT/$i/cpu.cfs_quota_us
>                                 fi
>                         fi
>                 done
>         done
>         echo "Captuing idle cpu time with vmstat...."
>         vmstat 2 100 &> vmstat_log &
> }



^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: CFS Bandwidth Control - Test results of cgroups tasks pinned vs unpinned
  2011-06-08 10:46   ` Vladimir Davydov
@ 2011-06-08 16:32     ` Kamalesh Babulal
  2011-06-09  3:25       ` Paul Turner
  0 siblings, 1 reply; 129+ messages in thread
From: Kamalesh Babulal @ 2011-06-08 16:32 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Paul Turner, linux-kernel, Peter Zijlstra, Bharata B Rao,
	Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan,
	Srivatsa Vaddagiri, Ingo Molnar, Pavel Emelianov

* Vladimir Davydov <vdavydov@parallels.com> [2011-06-08 14:46:06]:

> On Tue, 2011-06-07 at 19:45 +0400, Kamalesh Babulal wrote:
> > Hi All,
> > 
> >     In our test environment, while testing the CFS Bandwidth V6 patch set
> > on top of 55922c9d1b84. We observed that the CPU's idle time is seen
> > between 30% to 40% while running CPU bound test, with the cgroups tasks
> > not pinned to the CPU's. Whereas in the inverse case, where the cgroups
> > tasks are pinned to the CPU's, the idle time seen is nearly zero.
> 
> (snip)
> 
> > load_tasks()
> > {
> >         for (( i=1; i<=5; i++ ))
> >         do
> >                 jj=$(eval echo "\$NR_TASKS$i")
> >                 shares="1024"
> >                 if [ $PRO_SHARES -eq 1 ]
> >                 then
> >                         eval shares=$(echo "$jj * 1024" | bc)
> >                 fi
> >                 echo $hares > $MOUNT/$i/cpu.shares
>                         ^^^^^
>                         a fatal misprint? must be shares, I guess
> 
> (Setting cpu.shares to "", i.e. to the minimal possible value, will
> definitely confuse the load balancer)

My bad. It was fatal typo, thanks for pointing it out. It made a big difference 
in the idle time reported. After correcting to $shares, now the CPU idle time 
reported is 20% to 22%. Which is 10% less from the previous reported number.

(snip)

There have been questions on how to interpret the results. Consider the
following test run without pinning of the cgroups tasks

Average CPU Idle percentage 20%
Bandwidth shared with remaining non-Idle 80%

Bandwidth of Group 1 = 7.9700% i.e = 6.3700% of non-Idle CPU time 80%
|...... subgroup 1/1	= 50.0200	i.e = 3.1800% of 6.3700% Groups non-Idle CPU time
|...... subgroup 1/2	= 49.9700	i.e = 3.1800% of 6.3700% Groups non-Idle CPU time
 
For example let consider the cgroup1 and sum_exec time is the 7 field
captured from the /proc/sched_debug

while1 27273     30665.912793      1988   120     30665.912793	30909.566767         0.021951 /1/2
while1 27272     30511.105690      1995   120     30511.105690	30942.998099         0.017369 /1/1
							      -----------------

								61852.564866
							      -----------------
 - The bandwidth for sub-cgroup1 of cgroup1 is calculated  = (30909.566767 * 100) / 61852.564866
					 		   = ~50% 

   and sub-cgroup2 of cgroup1 is calculated 		   = (30942.998099 * 100) / 61852.564866
							   = ~50%

In the similar way If we add up the sum_exec of all the groups its
------------------------------------------------------------------------------------------------
Group1		Group2		Group3		Group4		Group5		sum_exec 
------------------------------------------------------------------------------------------------
61852.564866 + 61686.604930 + 122840.294858 + 232576.303937 +296166.889155 = 	775122.657746

again taking the example of cgroup1
Total percentage of bandwidth allocated to cgroup1 = (61852.564866 * 100) / 775122.657746
						   = ~ 7.9% of total bandwidth of all the cgroups


Calculating the non-idle time is done with
	Total (execution time * 100) / (no of cpus * 60000 ms) [script is run for a 60 seconds]
	i.e. = (775122.657746 * 100) / (16 * 60000)
	     = ~80% of non-idle time

Percentage of bandwidth allocated to cgroup1 of the non-idle is derived as
	= (cgroup bandwith percentage * non-idle time) / 100
	= for cgroup1 	= (7.9700 * 80) / 100
			= 6.376% bandwidth allocated of non-Idle CPU time.	
	
 
Bandwidth of Group 2 = 7.9500% i.e = 6.3600% of non-Idle CPU time 80%
|...... subgroup 2/1	= 49.9900	i.e = 3.1700% of 6.3600% Groups non-Idle CPU time
|...... subgroup 2/2	= 50.0000	i.e = 3.1800% of 6.3600% Groups non-Idle CPU time
 
 
Bandwidth of Group 3 = 15.8400% i.e = 12.6700% of non-Idle CPU time 80%
|...... subgroup 3/1	= 24.9900	i.e = 3.1600% of 12.6700% Groups non-Idle CPU time
|...... subgroup 3/2	= 24.9900	i.e = 3.1600% of 12.6700% Groups non-Idle CPU time
|...... subgroup 3/3	= 25.0600	i.e = 3.1700% of 12.6700% Groups non-Idle CPU time
|...... subgroup 3/4	= 24.9400	i.e = 3.1500% of 12.6700% Groups non-Idle CPU time
 
 
Bandwidth of Group 4 = 30.0000% i.e = 24.0000% of non-Idle CPU time 80%
|...... subgroup 4/1	= 13.1600	i.e = 3.1500% of 24.0000% Groups non-Idle CPU time
|...... subgroup 4/2	= 11.3800	i.e = 2.7300% of 24.0000% Groups non-Idle CPU time
|...... subgroup 4/3	= 13.1100	i.e = 3.1400% of 24.0000% Groups non-Idle CPU time
|...... subgroup 4/4	= 12.3100	i.e = 2.9500% of 24.0000% Groups non-Idle CPU time
|...... subgroup 4/5	= 12.8200	i.e = 3.0700% of 24.0000% Groups non-Idle CPU time
|...... subgroup 4/6	= 11.0600	i.e = 2.6500% of 24.0000% Groups non-Idle CPU time
|...... subgroup 4/7	= 13.0600	i.e = 3.1300% of 24.0000% Groups non-Idle CPU time
|...... subgroup 4/8	= 13.0600	i.e = 3.1300% of 24.0000% Groups non-Idle CPU time
 
 
Bandwidth of Group 5 = 38.2000% i.e = 30.5600% of non-Idle CPU time 80%
|...... subgroup 5/1	= 48.1000	i.e = 14.6900%	of 30.5600% Groups non-Idle CPU time
|...... subgroup 5/2	= 6.7900	i.e = 2.0700%	of 30.5600% Groups non-Idle CPU time
|...... subgroup 5/3	= 6.3700	i.e = 1.9400%	of 30.5600% Groups non-Idle CPU time
|...... subgroup 5/4	= 5.1800	i.e = 1.5800%	of 30.5600% Groups non-Idle CPU time
|...... subgroup 5/5	= 5.0400	i.e = 1.5400%	of 30.5600% Groups non-Idle CPU time
|...... subgroup 5/6	= 10.1400	i.e = 3.0900%	of 30.5600% Groups non-Idle CPU time
|...... subgroup 5/7	= 5.0700	i.e = 1.5400%	of 30.5600% Groups non-Idle CPU time
|...... subgroup 5/8	= 6.3900	i.e = 1.9500%	of 30.5600% Groups non-Idle CPU time
|...... subgroup 5/9	= 6.8800	i.e = 2.1000%	of 30.5600% Groups non-Idle CPU time
|...... subgroup 5/10	= 6.4700	i.e = 1.9700%	of 30.5600% Groups non-Idle CPU time
|...... subgroup 5/11	= 6.5600	i.e = 2.0000%	of 30.5600% Groups non-Idle CPU time
|...... subgroup 5/12	= 4.6400	i.e = 1.4100%	of 30.5600% Groups non-Idle CPU time
|...... subgroup 5/13	= 7.4900	i.e = 2.2800%	of 30.5600% Groups non-Idle CPU time
|...... subgroup 5/14	= 5.8200	i.e = 1.7700%	of 30.5600% Groups non-Idle CPU time
|...... subgroup 5/15	= 6.5500	i.e = 2.0000%	of 30.5600% Groups non-Idle CPU time
|...... subgroup 5/16	= 5.2700	i.e = 1.6100%	of 30.5600% Groups non-Idle CPU time

Thanks,
Kamalesh.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: CFS Bandwidth Control - Test results of cgroups tasks pinned vs unpinned
  2011-06-08 16:32     ` Kamalesh Babulal
@ 2011-06-09  3:25       ` Paul Turner
  2011-06-10 18:17         ` Kamalesh Babulal
  0 siblings, 1 reply; 129+ messages in thread
From: Paul Turner @ 2011-06-09  3:25 UTC (permalink / raw)
  To: Kamalesh Babulal
  Cc: Vladimir Davydov, linux-kernel, Peter Zijlstra, Bharata B Rao,
	Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan,
	Srivatsa Vaddagiri, Ingo Molnar, Pavel Emelianov

Hi Kamalesh,

I'm unable to reproduce the results you describe.  One possibility is
load-balancer interaction -- can you describe the topology of the
platform you are running this on?

On both a straight NUMA topology and a hyper-threaded platform I
observe a ~4% delta between the pinned and un-pinned cases.

Thanks -- results below,

- Paul


16 cores -- pinned:
Average CPU Idle percentage 4.77419%
Bandwidth shared with remaining non-Idle 95.22581%
Bandwidth of Group 1 = 6.6300 i.e = 6.3100% of non-Idle CPU time 95.22581%
|...... subgroup 1/1    = 50.0400       i.e = 3.1500% of 6.3100%
Groups non-Idle CPU time
|...... subgroup 1/2    = 49.9500       i.e = 3.1500% of 6.3100%
Groups non-Idle CPU time


Bandwidth of Group 2 = 6.6300 i.e = 6.3100% of non-Idle CPU time 95.22581%
|...... subgroup 2/1    = 50.0300       i.e = 3.1500% of 6.3100%
Groups non-Idle CPU time
|...... subgroup 2/2    = 49.9600       i.e = 3.1500% of 6.3100%
Groups non-Idle CPU time


Bandwidth of Group 3 = 13.2000 i.e = 12.5600% of non-Idle CPU time 95.22581%
|...... subgroup 3/1    = 25.0200       i.e = 3.1400% of 12.5600%
Groups non-Idle CPU time
|...... subgroup 3/2    = 24.9500       i.e = 3.1300% of 12.5600%
Groups non-Idle CPU time
|...... subgroup 3/3    = 25.0400       i.e = 3.1400% of 12.5600%
Groups non-Idle CPU time
|...... subgroup 3/4    = 24.9700       i.e = 3.1300% of 12.5600%
Groups non-Idle CPU time


Bandwidth of Group 4 = 26.1500 i.e = 24.9000% of non-Idle CPU time 95.22581%
|...... subgroup 4/1    = 12.4700       i.e = 3.1000% of 24.9000%
Groups non-Idle CPU time
|...... subgroup 4/2    = 12.5500       i.e = 3.1200% of 24.9000%
Groups non-Idle CPU time
|...... subgroup 4/3    = 12.4600       i.e = 3.1000% of 24.9000%
Groups non-Idle CPU time
|...... subgroup 4/4    = 12.5000       i.e = 3.1100% of 24.9000%
Groups non-Idle CPU time
|...... subgroup 4/5    = 12.5400       i.e = 3.1200% of 24.9000%
Groups non-Idle CPU time
|...... subgroup 4/6    = 12.4700       i.e = 3.1000% of 24.9000%
Groups non-Idle CPU time
|...... subgroup 4/7    = 12.5200       i.e = 3.1100% of 24.9000%
Groups non-Idle CPU time
|...... subgroup 4/8    = 12.4600       i.e = 3.1000% of 24.9000%
Groups non-Idle CPU time


Bandwidth of Group 5 = 47.3600 i.e = 45.0900% of non-Idle CPU time 95.22581%
|...... subgroup 5/1    = 49.9600       i.e = 22.5200% of 45.0900%
Groups non-Idle CPU time
|...... subgroup 5/2    = 6.3600        i.e = 2.8600% of 45.0900%
Groups non-Idle CPU time
|...... subgroup 5/3    = 6.2400        i.e = 2.8100% of 45.0900%
Groups non-Idle CPU time
|...... subgroup 5/4    = 6.1900        i.e = 2.7900% of 45.0900%
Groups non-Idle CPU time
|...... subgroup 5/5    = 6.2700        i.e = 2.8200% of 45.0900%
Groups non-Idle CPU time
|...... subgroup 5/6    = 6.3400        i.e = 2.8500% of 45.0900%
Groups non-Idle CPU time
|...... subgroup 5/7    = 6.1900        i.e = 2.7900% of 45.0900%
Groups non-Idle CPU time
|...... subgroup 5/8    = 6.1500        i.e = 2.7700% of 45.0900%
Groups non-Idle CPU time
|...... subgroup 5/9    = 6.2600        i.e = 2.8200% of 45.0900%
Groups non-Idle CPU time
|...... subgroup 5/10   = 6.2800        i.e = 2.8300% of 45.0900%
Groups non-Idle CPU time
|...... subgroup 5/11   = 6.2800        i.e = 2.8300% of 45.0900%
Groups non-Idle CPU time
|...... subgroup 5/12   = 6.1400        i.e = 2.7600% of 45.0900%
Groups non-Idle CPU time
|...... subgroup 5/13   = 6.0900        i.e = 2.7400% of 45.0900%
Groups non-Idle CPU time
|...... subgroup 5/14   = 6.3000        i.e = 2.8400% of 45.0900%
Groups non-Idle CPU time
|...... subgroup 5/15   = 6.1600        i.e = 2.7700% of 45.0900%
Groups non-Idle CPU time
|...... subgroup 5/16   = 6.3400        i.e = 2.8500% of 45.0900%
Groups non-Idle CPU time

AMD 16 core -- pinned:
Average CPU Idle percentage 0%
Bandwidth shared with remaining non-Idle 100%
Bandwidth of Group 1 = 6.2800 i.e = 6.2800% of non-Idle CPU time 100%
|...... subgroup 1/1    = 50.0000       i.e = 3.1400% of 6.2800%
Groups non-Idle CPU time
|...... subgroup 1/2    = 49.9900       i.e = 3.1300% of 6.2800%
Groups non-Idle CPU time


Bandwidth of Group 2 = 6.2800 i.e = 6.2800% of non-Idle CPU time 100%
|...... subgroup 2/1    = 50.0000       i.e = 3.1400% of 6.2800%
Groups non-Idle CPU time
|...... subgroup 2/2    = 49.9900       i.e = 3.1300% of 6.2800%
Groups non-Idle CPU time


Bandwidth of Group 3 = 12.5500 i.e = 12.5500% of non-Idle CPU time 100%
|...... subgroup 3/1    = 25.0100       i.e = 3.1300% of 12.5500%
Groups non-Idle CPU time
|...... subgroup 3/2    = 25.0000       i.e = 3.1300% of 12.5500%
Groups non-Idle CPU time
|...... subgroup 3/3    = 24.9900       i.e = 3.1300% of 12.5500%
Groups non-Idle CPU time
|...... subgroup 3/4    = 24.9700       i.e = 3.1300% of 12.5500%
Groups non-Idle CPU time


Bandwidth of Group 4 = 25.0400 i.e = 25.0400% of non-Idle CPU time 100%
|...... subgroup 4/1    = 12.5000       i.e = 3.1300% of 25.0400%
Groups non-Idle CPU time
|...... subgroup 4/2    = 12.5000       i.e = 3.1300% of 25.0400%
Groups non-Idle CPU time
|...... subgroup 4/3    = 12.5000       i.e = 3.1300% of 25.0400%
Groups non-Idle CPU time
|...... subgroup 4/4    = 12.5000       i.e = 3.1300% of 25.0400%
Groups non-Idle CPU time
|...... subgroup 4/5    = 12.5000       i.e = 3.1300% of 25.0400%
Groups non-Idle CPU time
|...... subgroup 4/6    = 12.4900       i.e = 3.1200% of 25.0400%
Groups non-Idle CPU time
|...... subgroup 4/7    = 12.4900       i.e = 3.1200% of 25.0400%
Groups non-Idle CPU time
|...... subgroup 4/8    = 12.4800       i.e = 3.1200% of 25.0400%
Groups non-Idle CPU time


Bandwidth of Group 5 = 49.8200 i.e = 49.8200% of non-Idle CPU time 100%
|...... subgroup 5/1    = 49.9400       i.e = 24.8800% of 49.8200%
Groups non-Idle CPU time
|...... subgroup 5/2    = 6.2700        i.e = 3.1200% of 49.8200%
Groups non-Idle CPU time
|...... subgroup 5/3    = 6.2400        i.e = 3.1000% of 49.8200%
Groups non-Idle CPU time
|...... subgroup 5/4    = 6.2400        i.e = 3.1000% of 49.8200%
Groups non-Idle CPU time
|...... subgroup 5/5    = 6.2500        i.e = 3.1100% of 49.8200%
Groups non-Idle CPU time
|...... subgroup 5/6    = 6.2500        i.e = 3.1100% of 49.8200%
Groups non-Idle CPU time
|...... subgroup 5/7    = 6.2600        i.e = 3.1100% of 49.8200%
Groups non-Idle CPU time
|...... subgroup 5/8    = 6.2600        i.e = 3.1100% of 49.8200%
Groups non-Idle CPU time
|...... subgroup 5/9    = 6.2400        i.e = 3.1000% of 49.8200%
Groups non-Idle CPU time
|...... subgroup 5/10   = 6.2400        i.e = 3.1000% of 49.8200%
Groups non-Idle CPU time
|...... subgroup 5/11   = 6.2400        i.e = 3.1000% of 49.8200%
Groups non-Idle CPU time
|...... subgroup 5/12   = 6.2400        i.e = 3.1000% of 49.8200%
Groups non-Idle CPU time
|...... subgroup 5/13   = 6.2200        i.e = 3.0900% of 49.8200%
Groups non-Idle CPU time
|...... subgroup 5/14   = 6.2200        i.e = 3.0900% of 49.8200%
Groups non-Idle CPU time
|...... subgroup 5/15   = 6.2400        i.e = 3.1000% of 49.8200%
Groups non-Idle CPU time
|...... subgroup 5/16   = 6.2400        i.e = 3.1000% of 49.8200%
Groups non-Idle CPU time


16 core hyper-threaded subset of 24 core machine (threads not pinned
individually):

Average CPU Idle percentage 35.0645%
Bandwidth shared with remaining non-Idle 64.9355%
Bandwidth of Group 1 = 6.6000 i.e = 4.2800% of non-Idle CPU time 64.9355%
|...... subgroup 1/1    = 50.0600       i.e = 2.1400% of 4.2800%
Groups non-Idle CPU time
|...... subgroup 1/2    = 49.9300       i.e = 2.1300% of 4.2800%
Groups non-Idle CPU time


Bandwidth of Group 2 = 6.6000 i.e = 4.2800% of non-Idle CPU time 64.9355%
|...... subgroup 2/1    = 50.0100       i.e = 2.1400% of 4.2800%
Groups non-Idle CPU time
|...... subgroup 2/2    = 49.9800       i.e = 2.1300% of 4.2800%
Groups non-Idle CPU time


Bandwidth of Group 3 = 13.1600 i.e = 8.5400% of non-Idle CPU time 64.9355%
|...... subgroup 3/1    = 25.0200       i.e = 2.1300% of 8.5400%
Groups non-Idle CPU time
|...... subgroup 3/2    = 24.9900       i.e = 2.1300% of 8.5400%
Groups non-Idle CPU time
|...... subgroup 3/3    = 24.9900       i.e = 2.1300% of 8.5400%
Groups non-Idle CPU time
|...... subgroup 3/4    = 24.9900       i.e = 2.1300% of 8.5400%
Groups non-Idle CPU time


Bandwidth of Group 4 = 25.9700 i.e = 16.8600% of non-Idle CPU time 64.9355%
|...... subgroup 4/1    = 12.5000       i.e = 2.1000% of 16.8600%
Groups non-Idle CPU time
|...... subgroup 4/2    = 12.5100       i.e = 2.1000% of 16.8600%
Groups non-Idle CPU time
|...... subgroup 4/3    = 12.6000       i.e = 2.1200% of 16.8600%
Groups non-Idle CPU time
|...... subgroup 4/4    = 12.3800       i.e = 2.0800% of 16.8600%
Groups non-Idle CPU time
|...... subgroup 4/5    = 12.4700       i.e = 2.1000% of 16.8600%
Groups non-Idle CPU time
|...... subgroup 4/6    = 12.4900       i.e = 2.1000% of 16.8600%
Groups non-Idle CPU time
|...... subgroup 4/7    = 12.5700       i.e = 2.1100% of 16.8600%
Groups non-Idle CPU time
|...... subgroup 4/8    = 12.4400       i.e = 2.0900% of 16.8600%
Groups non-Idle CPU time


Bandwidth of Group 5 = 47.6500 i.e = 30.9400% of non-Idle CPU time 64.9355%
|...... subgroup 5/1    = 50.5400       i.e = 15.6300% of 30.9400%
Groups non-Idle CPU time
|...... subgroup 5/2    = 6.0400        i.e = 1.8600% of 30.9400%
Groups non-Idle CPU time
|...... subgroup 5/3    = 6.0600        i.e = 1.8700% of 30.9400%
Groups non-Idle CPU time
|...... subgroup 5/4    = 6.4300        i.e = 1.9800% of 30.9400%
Groups non-Idle CPU time
|...... subgroup 5/5    = 6.3100        i.e = 1.9500% of 30.9400%
Groups non-Idle CPU time
|...... subgroup 5/6    = 6.0000        i.e = 1.8500% of 30.9400%
Groups non-Idle CPU time
|...... subgroup 5/7    = 6.3100        i.e = 1.9500% of 30.9400%
Groups non-Idle CPU time
|...... subgroup 5/8    = 5.9800        i.e = 1.8500% of 30.9400%
Groups non-Idle CPU time
|...... subgroup 5/9    = 6.2900        i.e = 1.9400% of 30.9400%
Groups non-Idle CPU time
|...... subgroup 5/10   = 6.3300        i.e = 1.9500% of 30.9400%
Groups non-Idle CPU time
|...... subgroup 5/11   = 6.5200        i.e = 2.0100% of 30.9400%
Groups non-Idle CPU time
|...... subgroup 5/12   = 6.0500        i.e = 1.8700% of 30.9400%
Groups non-Idle CPU time
|...... subgroup 5/13   = 6.3500        i.e = 1.9600% of 30.9400%
Groups non-Idle CPU time
|...... subgroup 5/14   = 6.3500        i.e = 1.9600% of 30.9400%
Groups non-Idle CPU time
|...... subgroup 5/15   = 6.3400        i.e = 1.9600% of 30.9400%
Groups non-Idle CPU time
|...... subgroup 5/16   = 6.4200        i.e = 1.9800% of 30.9400%
Groups non-Idle CPU time


16 core hyper-threaded subset of 24 core machine (threads individually):

Average CPU Idle percentage 31.7419%
Bandwidth shared with remaining non-Idle 68.2581%
Bandwidth of Group 1 = 6.2700 i.e = 4.2700% of non-Idle CPU time 68.2581%
|...... subgroup 1/1    = 50.0100       i.e = 2.1300% of 4.2700%
Groups non-Idle CPU time
|...... subgroup 1/2    = 49.9800       i.e = 2.1300% of 4.2700%
Groups non-Idle CPU time


Bandwidth of Group 2 = 6.2700 i.e = 4.2700% of non-Idle CPU time 68.2581%
|...... subgroup 2/1    = 50.0100       i.e = 2.1300% of 4.2700%
Groups non-Idle CPU time
|...... subgroup 2/2    = 49.9800       i.e = 2.1300% of 4.2700%
Groups non-Idle CPU time


Bandwidth of Group 3 = 12.5300 i.e = 8.5500% of non-Idle CPU time 68.2581%
|...... subgroup 3/1    = 25.0100       i.e = 2.1300% of 8.5500%
Groups non-Idle CPU time
|...... subgroup 3/2    = 25.0000       i.e = 2.1300% of 8.5500%
Groups non-Idle CPU time
|...... subgroup 3/3    = 24.9900       i.e = 2.1300% of 8.5500%
Groups non-Idle CPU time
|...... subgroup 3/4    = 24.9800       i.e = 2.1300% of 8.5500%
Groups non-Idle CPU time


Bandwidth of Group 4 = 25.0200 i.e = 17.0700% of non-Idle CPU time 68.2581%
|...... subgroup 4/1    = 12.5100       i.e = 2.1300% of 17.0700%
Groups non-Idle CPU time
|...... subgroup 4/2    = 12.5000       i.e = 2.1300% of 17.0700%
Groups non-Idle CPU time
|...... subgroup 4/3    = 12.5000       i.e = 2.1300% of 17.0700%
Groups non-Idle CPU time
|...... subgroup 4/4    = 12.5000       i.e = 2.1300% of 17.0700%
Groups non-Idle CPU time
|...... subgroup 4/5    = 12.5000       i.e = 2.1300% of 17.0700%
Groups non-Idle CPU time
|...... subgroup 4/6    = 12.4900       i.e = 2.1300% of 17.0700%
Groups non-Idle CPU time
|...... subgroup 4/7    = 12.4900       i.e = 2.1300% of 17.0700%
Groups non-Idle CPU time
|...... subgroup 4/8    = 12.4800       i.e = 2.1300% of 17.0700%
Groups non-Idle CPU time


Bandwidth of Group 5 = 49.8900 i.e = 34.0500% of non-Idle CPU time 68.2581%
|...... subgroup 5/1    = 49.9600       i.e = 17.0100% of 34.0500%
Groups non-Idle CPU time
|...... subgroup 5/2    = 6.2600        i.e = 2.1300% of 34.0500%
Groups non-Idle CPU time
|...... subgroup 5/3    = 6.2600        i.e = 2.1300% of 34.0500%
Groups non-Idle CPU time
|...... subgroup 5/4    = 6.2500        i.e = 2.1200% of 34.0500%
Groups non-Idle CPU time
|...... subgroup 5/5    = 6.2500        i.e = 2.1200% of 34.0500%
Groups non-Idle CPU time
|...... subgroup 5/6    = 6.2500        i.e = 2.1200% of 34.0500%
Groups non-Idle CPU time
|...... subgroup 5/7    = 6.2500        i.e = 2.1200% of 34.0500%
Groups non-Idle CPU time
|...... subgroup 5/8    = 6.2500        i.e = 2.1200% of 34.0500%
Groups non-Idle CPU time
|...... subgroup 5/9    = 6.2400        i.e = 2.1200% of 34.0500%
Groups non-Idle CPU time
|...... subgroup 5/10   = 6.2400        i.e = 2.1200% of 34.0500%
Groups non-Idle CPU time
|...... subgroup 5/11   = 6.2400        i.e = 2.1200% of 34.0500%
Groups non-Idle CPU time
|...... subgroup 5/12   = 6.2400        i.e = 2.1200% of 34.0500%
Groups non-Idle CPU time
|...... subgroup 5/13   = 6.2400        i.e = 2.1200% of 34.0500%
Groups non-Idle CPU time
|...... subgroup 5/14   = 6.2400        i.e = 2.1200% of 34.0500%
Groups non-Idle CPU time
|...... subgroup 5/15   = 6.2300        i.e = 2.1200% of 34.0500%
Groups non-Idle CPU time
|...... subgroup 5/16   = 6.2300        i.e = 2.1200% of 34.0500%
Groups non-Idle CPU time

On Wed, Jun 8, 2011 at 9:32 AM, Kamalesh Babulal
<kamalesh@linux.vnet.ibm.com> wrote:
> * Vladimir Davydov <vdavydov@parallels.com> [2011-06-08 14:46:06]:
>
>> On Tue, 2011-06-07 at 19:45 +0400, Kamalesh Babulal wrote:
>> > Hi All,
>> >
>> >     In our test environment, while testing the CFS Bandwidth V6 patch set
>> > on top of 55922c9d1b84. We observed that the CPU's idle time is seen
>> > between 30% to 40% while running CPU bound test, with the cgroups tasks
>> > not pinned to the CPU's. Whereas in the inverse case, where the cgroups
>> > tasks are pinned to the CPU's, the idle time seen is nearly zero.
>>
>> (snip)
>>
>> > load_tasks()
>> > {
>> >         for (( i=1; i<=5; i++ ))
>> >         do
>> >                 jj=$(eval echo "\$NR_TASKS$i")
>> >                 shares="1024"
>> >                 if [ $PRO_SHARES -eq 1 ]
>> >                 then
>> >                         eval shares=$(echo "$jj * 1024" | bc)
>> >                 fi
>> >                 echo $hares > $MOUNT/$i/cpu.shares
>>                         ^^^^^
>>                         a fatal misprint? must be shares, I guess
>>
>> (Setting cpu.shares to "", i.e. to the minimal possible value, will
>> definitely confuse the load balancer)
>
> My bad. It was fatal typo, thanks for pointing it out. It made a big difference
> in the idle time reported. After correcting to $shares, now the CPU idle time
> reported is 20% to 22%. Which is 10% less from the previous reported number.
>
> (snip)
>
> There have been questions on how to interpret the results. Consider the
> following test run without pinning of the cgroups tasks
>
> Average CPU Idle percentage 20%
> Bandwidth shared with remaining non-Idle 80%
>
> Bandwidth of Group 1 = 7.9700% i.e = 6.3700% of non-Idle CPU time 80%
> |...... subgroup 1/1    = 50.0200       i.e = 3.1800% of 6.3700% Groups non-Idle CPU time
> |...... subgroup 1/2    = 49.9700       i.e = 3.1800% of 6.3700% Groups non-Idle CPU time
>
> For example let consider the cgroup1 and sum_exec time is the 7 field
> captured from the /proc/sched_debug
>
> while1 27273     30665.912793      1988   120     30665.912793  30909.566767         0.021951 /1/2
> while1 27272     30511.105690      1995   120     30511.105690  30942.998099         0.017369 /1/1
>                                                              -----------------
>
>                                                                61852.564866
>                                                              -----------------
>  - The bandwidth for sub-cgroup1 of cgroup1 is calculated  = (30909.566767 * 100) / 61852.564866
>                                                           = ~50%
>
>   and sub-cgroup2 of cgroup1 is calculated                = (30942.998099 * 100) / 61852.564866
>                                                           = ~50%
>
> In the similar way If we add up the sum_exec of all the groups its
> ------------------------------------------------------------------------------------------------
> Group1          Group2          Group3          Group4          Group5          sum_exec
> ------------------------------------------------------------------------------------------------
> 61852.564866 + 61686.604930 + 122840.294858 + 232576.303937 +296166.889155 =    775122.657746
>
> again taking the example of cgroup1
> Total percentage of bandwidth allocated to cgroup1 = (61852.564866 * 100) / 775122.657746
>                                                   = ~ 7.9% of total bandwidth of all the cgroups
>
>
> Calculating the non-idle time is done with
>        Total (execution time * 100) / (no of cpus * 60000 ms) [script is run for a 60 seconds]
>        i.e. = (775122.657746 * 100) / (16 * 60000)
>             = ~80% of non-idle time
>
> Percentage of bandwidth allocated to cgroup1 of the non-idle is derived as
>        = (cgroup bandwith percentage * non-idle time) / 100
>        = for cgroup1   = (7.9700 * 80) / 100
>                        = 6.376% bandwidth allocated of non-Idle CPU time.
>
>
> Bandwidth of Group 2 = 7.9500% i.e = 6.3600% of non-Idle CPU time 80%
> |...... subgroup 2/1    = 49.9900       i.e = 3.1700% of 6.3600% Groups non-Idle CPU time
> |...... subgroup 2/2    = 50.0000       i.e = 3.1800% of 6.3600% Groups non-Idle CPU time
>
>
> Bandwidth of Group 3 = 15.8400% i.e = 12.6700% of non-Idle CPU time 80%
> |...... subgroup 3/1    = 24.9900       i.e = 3.1600% of 12.6700% Groups non-Idle CPU time
> |...... subgroup 3/2    = 24.9900       i.e = 3.1600% of 12.6700% Groups non-Idle CPU time
> |...... subgroup 3/3    = 25.0600       i.e = 3.1700% of 12.6700% Groups non-Idle CPU time
> |...... subgroup 3/4    = 24.9400       i.e = 3.1500% of 12.6700% Groups non-Idle CPU time
>
>
> Bandwidth of Group 4 = 30.0000% i.e = 24.0000% of non-Idle CPU time 80%
> |...... subgroup 4/1    = 13.1600       i.e = 3.1500% of 24.0000% Groups non-Idle CPU time
> |...... subgroup 4/2    = 11.3800       i.e = 2.7300% of 24.0000% Groups non-Idle CPU time
> |...... subgroup 4/3    = 13.1100       i.e = 3.1400% of 24.0000% Groups non-Idle CPU time
> |...... subgroup 4/4    = 12.3100       i.e = 2.9500% of 24.0000% Groups non-Idle CPU time
> |...... subgroup 4/5    = 12.8200       i.e = 3.0700% of 24.0000% Groups non-Idle CPU time
> |...... subgroup 4/6    = 11.0600       i.e = 2.6500% of 24.0000% Groups non-Idle CPU time
> |...... subgroup 4/7    = 13.0600       i.e = 3.1300% of 24.0000% Groups non-Idle CPU time
> |...... subgroup 4/8    = 13.0600       i.e = 3.1300% of 24.0000% Groups non-Idle CPU time
>
>
> Bandwidth of Group 5 = 38.2000% i.e = 30.5600% of non-Idle CPU time 80%
> |...... subgroup 5/1    = 48.1000       i.e = 14.6900%  of 30.5600% Groups non-Idle CPU time
> |...... subgroup 5/2    = 6.7900        i.e = 2.0700%   of 30.5600% Groups non-Idle CPU time
> |...... subgroup 5/3    = 6.3700        i.e = 1.9400%   of 30.5600% Groups non-Idle CPU time
> |...... subgroup 5/4    = 5.1800        i.e = 1.5800%   of 30.5600% Groups non-Idle CPU time
> |...... subgroup 5/5    = 5.0400        i.e = 1.5400%   of 30.5600% Groups non-Idle CPU time
> |...... subgroup 5/6    = 10.1400       i.e = 3.0900%   of 30.5600% Groups non-Idle CPU time
> |...... subgroup 5/7    = 5.0700        i.e = 1.5400%   of 30.5600% Groups non-Idle CPU time
> |...... subgroup 5/8    = 6.3900        i.e = 1.9500%   of 30.5600% Groups non-Idle CPU time
> |...... subgroup 5/9    = 6.8800        i.e = 2.1000%   of 30.5600% Groups non-Idle CPU time
> |...... subgroup 5/10   = 6.4700        i.e = 1.9700%   of 30.5600% Groups non-Idle CPU time
> |...... subgroup 5/11   = 6.5600        i.e = 2.0000%   of 30.5600% Groups non-Idle CPU time
> |...... subgroup 5/12   = 4.6400        i.e = 1.4100%   of 30.5600% Groups non-Idle CPU time
> |...... subgroup 5/13   = 7.4900        i.e = 2.2800%   of 30.5600% Groups non-Idle CPU time
> |...... subgroup 5/14   = 5.8200        i.e = 1.7700%   of 30.5600% Groups non-Idle CPU time
> |...... subgroup 5/15   = 6.5500        i.e = 2.0000%   of 30.5600% Groups non-Idle CPU time
> |...... subgroup 5/16   = 5.2700        i.e = 1.6100%   of 30.5600% Groups non-Idle CPU time
>
> Thanks,
> Kamalesh.
>

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: CFS Bandwidth Control - Test results of cgroups tasks pinned vs unpinned
  2011-06-09  3:25       ` Paul Turner
@ 2011-06-10 18:17         ` Kamalesh Babulal
  2011-06-14  0:00           ` Paul Turner
  0 siblings, 1 reply; 129+ messages in thread
From: Kamalesh Babulal @ 2011-06-10 18:17 UTC (permalink / raw)
  To: Paul Turner
  Cc: Vladimir Davydov, linux-kernel, Peter Zijlstra, Bharata B Rao,
	Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan,
	Srivatsa Vaddagiri, Ingo Molnar, Pavel Emelianov

* Paul Turner <pjt@google.com> [2011-06-08 20:25:00]:

> Hi Kamalesh,
> 
> I'm unable to reproduce the results you describe.  One possibility is
> load-balancer interaction -- can you describe the topology of the
> platform you are running this on?
> 
> On both a straight NUMA topology and a hyper-threaded platform I
> observe a ~4% delta between the pinned and un-pinned cases.
> 
> Thanks -- results below,
> 
> - Paul
> 
> 
(snip)

Hi Paul,

That box is down. I tried running the test on the 2-socket quad-core with 
HT and I was not able to reproduce the issue. CPU idle time reported with 
both pinned and un-pinned case was ~0. But if we create a cgroup hirerachy 
of 3 levels above the 5 cgroups, instead of the current hirerachy where all
the 5 cgroups created under /cgroup. The Idle time is seen on 2-socket 
quad-core (HT) box.

				-----------
				| cgroups |
				-----------
				     |
				-----------
				| level 1 |
				-----------
				     |
				-----------
				| level 2 |
				-----------
				     |
				-----------
				| level 3 |
				-----------
			      /   /   |   \     \
			     /	 /    |    \     \
			cgrp1  cgrp2 cgrp3 cgrp4 cgrp5


Un-pinned run
--------------

Average CPU Idle percentage 24.8333%
Bandwidth shared with remaining non-Idle 75.1667%
Bandwidth of Group 1 = 8.3700 i.e = 6.2900% of non-Idle CPU time 75.1667%
|...... subgroup 1/1	= 49.9900	i.e = 3.1400% of 6.2900% Groups non-Idle CPU time
|...... subgroup 1/2	= 50.0000	i.e = 3.1400% of 6.2900% Groups non-Idle CPU time
 
 
Bandwidth of Group 2 = 8.3700 i.e = 6.2900% of non-Idle CPU time 75.1667%
|...... subgroup 2/1	= 49.9900	i.e = 3.1400% of 6.2900% Groups non-Idle CPU time
|...... subgroup 2/2	= 50.0000	i.e = 3.1400% of 6.2900% Groups non-Idle CPU time
 
 
Bandwidth of Group 3 = 16.6500 i.e = 12.5100% of non-Idle CPU time 75.1667%
|...... subgroup 3/1	= 25.0000	i.e = 3.1200% of 12.5100% Groups non-Idle CPU time
|...... subgroup 3/2	= 24.9100	i.e = 3.1100% of 12.5100% Groups non-Idle CPU time
|...... subgroup 3/3	= 25.0800	i.e = 3.1300% of 12.5100% Groups non-Idle CPU time
|...... subgroup 3/4	= 24.9900	i.e = 3.1200% of 12.5100% Groups non-Idle CPU time
 
 
Bandwidth of Group 4 = 29.3600 i.e = 22.0600% of non-Idle CPU time 75.1667%
|...... subgroup 4/1	= 12.0200	i.e = 2.6500% of 22.0600% Groups non-Idle CPU time
|...... subgroup 4/2	= 12.3800	i.e = 2.7300% of 22.0600% Groups non-Idle CPU time
|...... subgroup 4/3	= 13.6300	i.e = 3.0000% of 22.0600% Groups non-Idle CPU time
|...... subgroup 4/4	= 12.7000	i.e = 2.8000% of 22.0600% Groups non-Idle CPU time
|...... subgroup 4/5	= 12.8000	i.e = 2.8200% of 22.0600% Groups non-Idle CPU time
|...... subgroup 4/6	= 11.9600	i.e = 2.6300% of 22.0600% Groups non-Idle CPU time
|...... subgroup 4/7	= 12.7400	i.e = 2.8100% of 22.0600% Groups non-Idle CPU time
|...... subgroup 4/8	= 11.7300	i.e = 2.5800% of 22.0600% Groupsnon-Idle CPU time
 
 
Bandwidth of Group 5 = 37.2300 i.e = 27.9800% of non-Idle CPU time 75.1667%
|...... subgroup 5/1	= 47.7200	i.e = 13.3500%	of 27.9800% Groups non-Idle CPU time
|...... subgroup 5/2	= 5.2000	i.e = 1.4500% 	of 27.9800% Groups non-Idle CPU time
|...... subgroup 5/3	= 6.3600	i.e = 1.7700% 	of 27.9800% Groups non-Idle CPU time
|...... subgroup 5/4	= 6.3600	i.e = 1.7700%	of 27.9800% Groups non-Idle CPU time
|...... subgroup 5/5	= 7.9800	i.e = 2.2300%	of 27.9800% Groups non-Idle CPU time
|...... subgroup 5/6	= 5.1800	i.e = 1.4400%	of 27.9800% Groups non-Idle CPU time
|...... subgroup 5/7	= 7.4900	i.e = 2.0900%	of 27.9800% Groups non-Idle CPU time
|...... subgroup 5/8	= 5.9200	i.e = 1.6500%	of 27.9800% Groups non-Idle CPU time
|...... subgroup 5/9	= 7.7500	i.e = 2.1600%	of 27.9800% Groups non-Idle CPU time
|...... subgroup 5/10	= 4.8100	i.e = 1.3400%	of 27.9800% Groups non-Idle CPU time
|...... subgroup 5/11	= 4.9300	i.e = 1.3700%	of 27.9800% Groups non-Idle CPU time
|...... subgroup 5/12	= 6.8900	i.e = 1.9200%	of 27.9800% Groups non-Idle CPU time
|...... subgroup 5/13	= 6.0700	i.e = 1.6900%	of 27.9800% Groups non-Idle CPU time
|...... subgroup 5/14	= 6.5200	i.e = 1.8200%	of 27.9800% Groups non-Idle CPU time
|...... subgroup 5/15	= 5.9200	i.e = 1.6500%	of 27.9800% Groups non-Idle CPU time
|...... subgroup 5/16	= 6.6400	i.e = 1.8500% 	of 27.9800% Groups non-Idle CPU time

Pinned Run
----------

Average CPU Idle percentage 0%
Bandwidth shared with remaining non-Idle 100%
Bandwidth of Group 1 = 6.2700 i.e = 6.2700% of non-Idle CPU time 100%
|...... subgroup 1/1	= 50.0100	i.e = 3.1300% of 6.2700% Groups non-Idle CPU time
|...... subgroup 1/2	= 49.9800	i.e = 3.1300% of 6.2700% Groups non-Idle CPU time
 
 
Bandwidth of Group 2 = 6.2700 i.e = 6.2700% of non-Idle CPU time 100%
|...... subgroup 2/1	= 50.0000	i.e = 3.1300% of 6.2700% Groups non-Idle CPU time
|...... subgroup 2/2	= 49.9900	i.e = 3.1300% of 6.2700% Groups non-Idle CPU time
 
 
Bandwidth of Group 3 = 12.5300 i.e = 12.5300% of non-Idle CPU time 100%
|...... subgroup 3/1	= 25.0100	i.e = 3.1300% of 12.5300% Groups non-Idle CPU time
|...... subgroup 3/2	= 25.0000	i.e = 3.1300% of 12.5300% Groups non-Idle CPU time
|...... subgroup 3/3	= 24.9900	i.e = 3.1300% of 12.5300% Groups non-Idle CPU time
|...... subgroup 3/4	= 24.9900	i.e = 3.1300% of 12.5300% Groups non-Idle CPU time
 
 
Bandwidth of Group 4 = 25.0200 i.e = 25.0200% of non-Idle CPU time 100%
|...... subgroup 4/1	= 12.5100	i.e = 3.1300% of 25.0200% Groups non-Idle CPU time
|...... subgroup 4/2	= 12.5000	i.e = 3.1200% of 25.0200% Groups non-Idle CPU time
|...... subgroup 4/3	= 12.5000	i.e = 3.1200% of 25.0200% Groups non-Idle CPU time
|...... subgroup 4/4	= 12.5000	i.e = 3.1200% of 25.0200% Groups non-Idle CPU time
|...... subgroup 4/5	= 12.4900	i.e = 3.1200% of 25.0200% Groups non-Idle CPU time
|...... subgroup 4/6	= 12.4900	i.e = 3.1200% of 25.0200% Groups non-Idle CPU time
|...... subgroup 4/7	= 12.4900	i.e = 3.1200% of 25.0200% Groups non-Idle CPU time
|...... subgroup 4/8	= 12.4800	i.e = 3.1200% of 25.0200% Groups non-Idle CPU time
 
 
Bandwidth of Group 5 = 49.8800 i.e = 49.8800% of non-Idle CPU time 100%
|...... subgroup 5/1	= 49.9600	i.e = 24.9200% of 49.8800% Groups non-Idle CPU time
|...... subgroup 5/2	= 6.2500	i.e = 3.1100% of 49.8800% Groups non-Idle CPU time
|...... subgroup 5/3	= 6.2500	i.e = 3.1100% of 49.8800% Groups non-Idle CPU time
|...... subgroup 5/4	= 6.2500	i.e = 3.1100% of 49.8800% Groups non-Idle CPU time
|...... subgroup 5/5	= 6.2500	i.e = 3.1100% of 49.8800% Groups non-Idle CPU time
|...... subgroup 5/6	= 6.2500	i.e = 3.1100% of 49.8800% Groups non-Idle CPU time
|...... subgroup 5/7	= 6.2500	i.e = 3.1100% of 49.8800% Groups non-Idle CPU time
|...... subgroup 5/8	= 6.2500	i.e = 3.1100% of 49.8800% Groups non-Idle CPU time
|...... subgroup 5/9	= 6.2400	i.e = 3.1100% of 49.8800% Groups non-Idle CPU time
|...... subgroup 5/10	= 6.2500	i.e = 3.1100% of 49.8800% Groups non-Idle CPU time
|...... subgroup 5/11	= 6.2400	i.e = 3.1100% of 49.8800% Groups non-Idle CPU time
|...... subgroup 5/12	= 6.2400	i.e = 3.1100% of 49.8800% Groups non-Idle CPU time
|...... subgroup 5/13	= 6.2400	i.e = 3.1100% of 49.8800% Groups non-Idle CPU time
|...... subgroup 5/14	= 6.2400	i.e = 3.1100% of 49.8800% Groups non-Idle CPU time
|...... subgroup 5/15	= 6.2300	i.e = 3.1000% of 49.8800% Groups non-Idle CPU time
|...... subgroup 5/16	= 6.2400	i.e = 3.1100% of 49.8800% Groups non-Idle CPU time

Modified script
---------------

#!/bin/bash

NR_TASKS1=2
NR_TASKS2=2
NR_TASKS3=4
NR_TASKS4=8
NR_TASKS5=16

BANDWIDTH=1
SUBGROUP=1
PRO_SHARES=0
MOUNT_POINT=/cgroups/
MOUNT=/cgroups/
LOAD=./while1
LEVELS=3

usage()
{
	echo "Usage $0: [-b 0|1] [-s 0|1] [-p 0|1]"
	echo "-b 1|0 set/unset  Cgroups bandwidth control (default set)"
	echo "-s Create sub-groups for every task (default creates sub-group)"
	echo "-p create propotional shares based on cpus"
	exit
}
while getopts ":b:s:p:" arg
do
	case $arg in
	b)
		BANDWIDTH=$OPTARG
		shift
		if [ $BANDWIDTH -gt 1 ] && [ $BANDWIDTH -lt  0 ]
		then
			usage
		fi
		;;
	s)
		SUBGROUP=$OPTARG
		shift
		if [ $SUBGROUP -gt 1 ] && [ $SUBGROUP -lt 0 ]
		then
			usage
		fi
		;;
	p)
		PRO_SHARES=$OPTARG
		shift
		if [ $PRO_SHARES -gt 1 ] && [ $PRO_SHARES -lt 0 ]
		then
			usage
		fi
		;;

	*)

	esac
done
if [ ! -d $MOUNT ]
then
	mkdir -p $MOUNT
fi
test()
{
	echo -n "[ "
	if [ $1 -eq 0 ]
	then
		echo -ne '\E[42;40mOk'
	else
		echo -ne '\E[31;40mFailed'
		tput sgr0
		echo " ]"
		exit
	fi
	tput sgr0
	echo " ]"
}
mount_cgrp()
{
	echo -n "Mounting root cgroup "
	mount -t cgroup -ocpu,cpuset,cpuacct none $MOUNT_POINT &> /dev/null
	test $?
}

umount_cgrp()
{
	echo -n "Unmounting root cgroup "
	cd /root/
	umount $MOUNT_POINT
	test $?
}

create_hierarchy()
{
	mount_cgrp
	cpuset_mem=`cat $MOUNT/cpuset.mems`
	cpuset_cpu=`cat $MOUNT/cpuset.cpus`
	echo -n "creating hierarchy of levels $LEVELS "
	for (( i=1; i<=$LEVELS; i++ ))
	do
		MOUNT="${MOUNT}/level${i}"
		mkdir $MOUNT
		echo $cpuset_mem > $MOUNT/cpuset.mems
		echo $cpuset_cpu > $MOUNT/cpuset.cpus
		echo "-1" > $MOUNT/cpu.cfs_quota_us
		echo "500000" > $MOUNT/cpu.cfs_period_us
		echo -n " .."
	done
	echo " "
	echo $MOUNT
	echo -n "creating groups/sub-groups ..."
	for (( i=1; i<=5; i++ ))
	do
		mkdir $MOUNT/$i
		echo $cpuset_mem > $MOUNT/$i/cpuset.mems
		echo $cpuset_cpu > $MOUNT/$i/cpuset.cpus
		echo -n ".."
		if [ $SUBGROUP -eq 1 ]
		then
			jj=$(eval echo "\$NR_TASKS$i")
			for (( j=1; j<=$jj; j++ ))
			do
				mkdir -p $MOUNT/$i/$j
				echo $cpuset_mem > $MOUNT/$i/$j/cpuset.mems
				echo $cpuset_cpu > $MOUNT/$i/$j/cpuset.cpus
				echo -n ".."
			done
		fi
	done
	echo "."
}

cleanup()
{
	pkill -9 while1 &> /dev/null
	sleep 10
	echo -n "Umount groups/sub-groups .."
	for (( i=1; i<=5; i++ ))
	do
		if [ $SUBGROUP -eq 1 ]
		then
			jj=$(eval echo "\$NR_TASKS$i")
			for (( j=1; j<=$jj; j++ ))
			do
				rmdir $MOUNT/$i/$j
				echo -n ".."
			done
		fi
		rmdir $MOUNT/$i
		echo -n ".."
	done
	cd $MOUNT
	cd ../
	for (( i=$LEVELS; i>=1; i-- ))
	do
		rmdir level$i
		cd ../	
	done
	echo " "
	umount_cgrp
}

load_tasks()
{
	for (( i=1; i<=5; i++ ))
	do
		jj=$(eval echo "\$NR_TASKS$i")
		shares="1024"
		if [ $PRO_SHARES -eq 1 ]
		then
			eval shares=$(echo "$jj * 1024" | bc)
		fi
		echo $shares > $MOUNT/$i/cpu.shares
		for (( j=1; j<=$jj; j++ ))
		do
			echo "-1" > $MOUNT/$i/cpu.cfs_quota_us
			echo "500000" > $MOUNT/$i/cpu.cfs_period_us
			if [ $SUBGROUP -eq 1 ]
			then

				$LOAD &
				echo $! > $MOUNT/$i/$j/tasks
				echo "1024" > $MOUNT/$i/$j/cpu.shares

				if [ $BANDWIDTH -eq 1 ]
				then
					echo "500000" > $MOUNT/$i/$j/cpu.cfs_period_us
					echo "250000" > $MOUNT/$i/$j/cpu.cfs_quota_us
				fi
			else
				$LOAD & 
				echo $! > $MOUNT/$i/tasks
				echo $shares > $MOUNT/$i/cpu.shares

				if [ $BANDWIDTH -eq 1 ]
				then
					echo "500000" > $MOUNT/$i/cpu.cfs_period_us
					echo "250000" > $MOUNT/$i/cpu.cfs_quota_us
				fi
			fi
		done
	done
	echo "Capturing idle cpu time with vmstat...."
	vmstat 2 100 &> vmstat_log &
}

pin_tasks()
{
	cpu=0
	count=1
	for (( i=1; i<=5; i++ ))
	do
		if [ $SUBGROUP -eq 1 ]
		then
			jj=$(eval echo "\$NR_TASKS$i")
			for (( j=1; j<=$jj; j++ ))
			do
				if [ $count -gt 2 ]
				then
					cpu=$((cpu+1))
					count=1
				fi
				echo $cpu > $MOUNT/$i/$j/cpuset.cpus
				count=$((count+1))
			done
		else
			case $i in
			1)
				echo 0 > $MOUNT/$i/cpuset.cpus;;
			2)
				echo 1 > $MOUNT/$i/cpuset.cpus;;
			3)
				echo "2-3" > $MOUNT/$i/cpuset.cpus;;
			4)
				echo "4-6" > $MOUNT/$i/cpuset.cpus;;
			5)
				echo "7-15" > $MOUNT/$i/cpuset.cpus;;
			esac
		fi
	done
	
}

print_results()
{
	eval gtot=$(cat sched_log|grep -i while|sed 's/R//g'|awk '{gtot+=$7};END{printf "%f", gtot}')
	for (( i=1; i<=5; i++ ))	
	do
		eval temp=$(cat sched_log_$i|sed 's/R//g'| awk '{gtot+=$7};END{printf "%f",gtot}')
		eval tavg=$(echo "scale=4;(($temp / $gtot) * $1)/100 " | bc)
		eval avg=$(echo  "scale=4;($temp / $gtot) * 100" | bc)
		eval pretty_tavg=$( echo "scale=4; $tavg * 100"| bc) # F0r pretty format
		echo "Bandwidth of Group $i = $avg i.e = $pretty_tavg% of non-Idle CPU time $1%"
		if [ $SUBGROUP -eq 1 ]
		then
			jj=$(eval echo "\$NR_TASKS$i")
			for (( j=1; j<=$jj; j++ ))
			do
				eval tmp=$(cat sched_log_$i-$j|sed 's/R//g'| awk '{gtot+=$7};END{printf "%f",gtot}')
				eval stavg=$(echo "scale=4;($tmp / $temp) * 100" | bc)
				eval pretty_stavg=$(echo "scale=4;(($tmp / $temp) * $tavg) * 100" | bc)
				echo -n "|"
				echo -e "...... subgroup $i/$j\t= $stavg\ti.e = $pretty_stavg% of $pretty_tavg% Groups non-Idle CPU time"
			done
		fi
		echo " "
		echo " "
	done
}

capture_results()
{
	cat /proc/sched_debug > sched_log
	lev=""
	for (( i=1; i<=$LEVELS; i++ ))
	do
		lev="$lev\/level${i}"	
	done
	pkill -9 vmstat 
	avg=$(cat vmstat_log |grep -iv "system"|grep -iv "swpd"|awk ' { if ( NR != 1) {id+=$15 }}END{print (id/(NR-1))}')

	rem=$(echo "scale=2; 100 - $avg" |bc)
	echo "Average CPU Idle percentage $avg%"	
	echo "Bandwidth shared with remaining non-Idle $rem%" 
	for (( i=1; i<=5; i++ ))
	do
		cat sched_log |grep -i while1|grep -i "$lev\/$i" > sched_log_$i
		if [ $SUBGROUP -eq 1 ]
		then
			jj=$(eval echo "\$NR_TASKS$i")
			for (( j=1; j<=$jj; j++ ))
			do
				cat sched_log |grep -i while1|grep -i "$lev\/$i\/$j" > sched_log_$i-$j
			done
		fi
	done
	print_results $rem
}

create_hierarchy
pin_tasks

load_tasks
sleep 60
capture_results
cleanup
exit


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: CFS Bandwidth Control - Test results of cgroups tasks pinned vs unpinned
  2011-06-10 18:17         ` Kamalesh Babulal
@ 2011-06-14  0:00           ` Paul Turner
  2011-06-15  5:37             ` Kamalesh Babulal
  0 siblings, 1 reply; 129+ messages in thread
From: Paul Turner @ 2011-06-14  0:00 UTC (permalink / raw)
  To: Kamalesh Babulal
  Cc: Vladimir Davydov, linux-kernel, Peter Zijlstra, Bharata B Rao,
	Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan,
	Srivatsa Vaddagiri, Ingo Molnar, Pavel Emelianov

Hi Kamalesh.

I tried on both friday and again today to reproduce your results
without success.  Results are attached below.  The margin of error is
the same as the previous (2-level deep case), ~4%.  One minor nit, in
your script's input parsing you're calling shift; you don't need to do
this with getopts and it will actually lead to arguments being
dropped.

Are you testing on top of a clean -tip?  Do you have any custom
load-balancer or scheduler settings?

Thanks,

- Paul


Hyper-threaded topology:
unpinned:
Average CPU Idle percentage 38.6333%
Bandwidth shared with remaining non-Idle 61.3667%

pinned:
Average CPU Idle percentage 35.2766%
Bandwidth shared with remaining non-Idle 64.7234%
(The mask in the "unpinned" case is 0-3,6-9,12-15,18-21 which should
mirror your 2 socket 8x2 configuration.)

4-way NUMA topology:
unpinned:
Average CPU Idle percentage 5.26667%
Bandwidth shared with remaining non-Idle 94.73333%

pinned:
Average CPU Idle percentage 0.242424%
Bandwidth shared with remaining non-Idle 99.757576%




On Fri, Jun 10, 2011 at 11:17 AM, Kamalesh Babulal
<kamalesh@linux.vnet.ibm.com> wrote:
> * Paul Turner <pjt@google.com> [2011-06-08 20:25:00]:
>
>> Hi Kamalesh,
>>
>> I'm unable to reproduce the results you describe.  One possibility is
>> load-balancer interaction -- can you describe the topology of the
>> platform you are running this on?
>>
>> On both a straight NUMA topology and a hyper-threaded platform I
>> observe a ~4% delta between the pinned and un-pinned cases.
>>
>> Thanks -- results below,
>>
>> - Paul
>>
>>
> (snip)
>
> Hi Paul,
>
> That box is down. I tried running the test on the 2-socket quad-core with
> HT and I was not able to reproduce the issue. CPU idle time reported with
> both pinned and un-pinned case was ~0. But if we create a cgroup hirerachy
> of 3 levels above the 5 cgroups, instead of the current hirerachy where all
> the 5 cgroups created under /cgroup. The Idle time is seen on 2-socket
> quad-core (HT) box.
>
>                                -----------
>                                | cgroups |
>                                -----------
>                                     |
>                                -----------
>                                | level 1 |
>                                -----------
>                                     |
>                                -----------
>                                | level 2 |
>                                -----------
>                                     |
>                                -----------
>                                | level 3 |
>                                -----------
>                              /   /   |   \     \
>                             /   /    |    \     \
>                        cgrp1  cgrp2 cgrp3 cgrp4 cgrp5
>
>
> Un-pinned run
> --------------
>
> Average CPU Idle percentage 24.8333%
> Bandwidth shared with remaining non-Idle 75.1667%
> Bandwidth of Group 1 = 8.3700 i.e = 6.2900% of non-Idle CPU time 75.1667%
> |...... subgroup 1/1    = 49.9900       i.e = 3.1400% of 6.2900% Groups non-Idle CPU time
> |...... subgroup 1/2    = 50.0000       i.e = 3.1400% of 6.2900% Groups non-Idle CPU time
>
>
> Bandwidth of Group 2 = 8.3700 i.e = 6.2900% of non-Idle CPU time 75.1667%
> |...... subgroup 2/1    = 49.9900       i.e = 3.1400% of 6.2900% Groups non-Idle CPU time
> |...... subgroup 2/2    = 50.0000       i.e = 3.1400% of 6.2900% Groups non-Idle CPU time
>
>
> Bandwidth of Group 3 = 16.6500 i.e = 12.5100% of non-Idle CPU time 75.1667%
> |...... subgroup 3/1    = 25.0000       i.e = 3.1200% of 12.5100% Groups non-Idle CPU time
> |...... subgroup 3/2    = 24.9100       i.e = 3.1100% of 12.5100% Groups non-Idle CPU time
> |...... subgroup 3/3    = 25.0800       i.e = 3.1300% of 12.5100% Groups non-Idle CPU time
> |...... subgroup 3/4    = 24.9900       i.e = 3.1200% of 12.5100% Groups non-Idle CPU time
>
>
> Bandwidth of Group 4 = 29.3600 i.e = 22.0600% of non-Idle CPU time 75.1667%
> |...... subgroup 4/1    = 12.0200       i.e = 2.6500% of 22.0600% Groups non-Idle CPU time
> |...... subgroup 4/2    = 12.3800       i.e = 2.7300% of 22.0600% Groups non-Idle CPU time
> |...... subgroup 4/3    = 13.6300       i.e = 3.0000% of 22.0600% Groups non-Idle CPU time
> |...... subgroup 4/4    = 12.7000       i.e = 2.8000% of 22.0600% Groups non-Idle CPU time
> |...... subgroup 4/5    = 12.8000       i.e = 2.8200% of 22.0600% Groups non-Idle CPU time
> |...... subgroup 4/6    = 11.9600       i.e = 2.6300% of 22.0600% Groups non-Idle CPU time
> |...... subgroup 4/7    = 12.7400       i.e = 2.8100% of 22.0600% Groups non-Idle CPU time
> |...... subgroup 4/8    = 11.7300       i.e = 2.5800% of 22.0600% Groupsnon-Idle CPU time
>
>
> Bandwidth of Group 5 = 37.2300 i.e = 27.9800% of non-Idle CPU time 75.1667%
> |...... subgroup 5/1    = 47.7200       i.e = 13.3500%  of 27.9800% Groups non-Idle CPU time
> |...... subgroup 5/2    = 5.2000        i.e = 1.4500%   of 27.9800% Groups non-Idle CPU time
> |...... subgroup 5/3    = 6.3600        i.e = 1.7700%   of 27.9800% Groups non-Idle CPU time
> |...... subgroup 5/4    = 6.3600        i.e = 1.7700%   of 27.9800% Groups non-Idle CPU time
> |...... subgroup 5/5    = 7.9800        i.e = 2.2300%   of 27.9800% Groups non-Idle CPU time
> |...... subgroup 5/6    = 5.1800        i.e = 1.4400%   of 27.9800% Groups non-Idle CPU time
> |...... subgroup 5/7    = 7.4900        i.e = 2.0900%   of 27.9800% Groups non-Idle CPU time
> |...... subgroup 5/8    = 5.9200        i.e = 1.6500%   of 27.9800% Groups non-Idle CPU time
> |...... subgroup 5/9    = 7.7500        i.e = 2.1600%   of 27.9800% Groups non-Idle CPU time
> |...... subgroup 5/10   = 4.8100        i.e = 1.3400%   of 27.9800% Groups non-Idle CPU time
> |...... subgroup 5/11   = 4.9300        i.e = 1.3700%   of 27.9800% Groups non-Idle CPU time
> |...... subgroup 5/12   = 6.8900        i.e = 1.9200%   of 27.9800% Groups non-Idle CPU time
> |...... subgroup 5/13   = 6.0700        i.e = 1.6900%   of 27.9800% Groups non-Idle CPU time
> |...... subgroup 5/14   = 6.5200        i.e = 1.8200%   of 27.9800% Groups non-Idle CPU time
> |...... subgroup 5/15   = 5.9200        i.e = 1.6500%   of 27.9800% Groups non-Idle CPU time
> |...... subgroup 5/16   = 6.6400        i.e = 1.8500%   of 27.9800% Groups non-Idle CPU time
>
> Pinned Run
> ----------
>
> Average CPU Idle percentage 0%
> Bandwidth shared with remaining non-Idle 100%
> Bandwidth of Group 1 = 6.2700 i.e = 6.2700% of non-Idle CPU time 100%
> |...... subgroup 1/1    = 50.0100       i.e = 3.1300% of 6.2700% Groups non-Idle CPU time
> |...... subgroup 1/2    = 49.9800       i.e = 3.1300% of 6.2700% Groups non-Idle CPU time
>
>
> Bandwidth of Group 2 = 6.2700 i.e = 6.2700% of non-Idle CPU time 100%
> |...... subgroup 2/1    = 50.0000       i.e = 3.1300% of 6.2700% Groups non-Idle CPU time
> |...... subgroup 2/2    = 49.9900       i.e = 3.1300% of 6.2700% Groups non-Idle CPU time
>
>
> Bandwidth of Group 3 = 12.5300 i.e = 12.5300% of non-Idle CPU time 100%
> |...... subgroup 3/1    = 25.0100       i.e = 3.1300% of 12.5300% Groups non-Idle CPU time
> |...... subgroup 3/2    = 25.0000       i.e = 3.1300% of 12.5300% Groups non-Idle CPU time
> |...... subgroup 3/3    = 24.9900       i.e = 3.1300% of 12.5300% Groups non-Idle CPU time
> |...... subgroup 3/4    = 24.9900       i.e = 3.1300% of 12.5300% Groups non-Idle CPU time
>
>
> Bandwidth of Group 4 = 25.0200 i.e = 25.0200% of non-Idle CPU time 100%
> |...... subgroup 4/1    = 12.5100       i.e = 3.1300% of 25.0200% Groups non-Idle CPU time
> |...... subgroup 4/2    = 12.5000       i.e = 3.1200% of 25.0200% Groups non-Idle CPU time
> |...... subgroup 4/3    = 12.5000       i.e = 3.1200% of 25.0200% Groups non-Idle CPU time
> |...... subgroup 4/4    = 12.5000       i.e = 3.1200% of 25.0200% Groups non-Idle CPU time
> |...... subgroup 4/5    = 12.4900       i.e = 3.1200% of 25.0200% Groups non-Idle CPU time
> |...... subgroup 4/6    = 12.4900       i.e = 3.1200% of 25.0200% Groups non-Idle CPU time
> |...... subgroup 4/7    = 12.4900       i.e = 3.1200% of 25.0200% Groups non-Idle CPU time
> |...... subgroup 4/8    = 12.4800       i.e = 3.1200% of 25.0200% Groups non-Idle CPU time
>
>
> Bandwidth of Group 5 = 49.8800 i.e = 49.8800% of non-Idle CPU time 100%
> |...... subgroup 5/1    = 49.9600       i.e = 24.9200% of 49.8800% Groups non-Idle CPU time
> |...... subgroup 5/2    = 6.2500        i.e = 3.1100% of 49.8800% Groups non-Idle CPU time
> |...... subgroup 5/3    = 6.2500        i.e = 3.1100% of 49.8800% Groups non-Idle CPU time
> |...... subgroup 5/4    = 6.2500        i.e = 3.1100% of 49.8800% Groups non-Idle CPU time
> |...... subgroup 5/5    = 6.2500        i.e = 3.1100% of 49.8800% Groups non-Idle CPU time
> |...... subgroup 5/6    = 6.2500        i.e = 3.1100% of 49.8800% Groups non-Idle CPU time
> |...... subgroup 5/7    = 6.2500        i.e = 3.1100% of 49.8800% Groups non-Idle CPU time
> |...... subgroup 5/8    = 6.2500        i.e = 3.1100% of 49.8800% Groups non-Idle CPU time
> |...... subgroup 5/9    = 6.2400        i.e = 3.1100% of 49.8800% Groups non-Idle CPU time
> |...... subgroup 5/10   = 6.2500        i.e = 3.1100% of 49.8800% Groups non-Idle CPU time
> |...... subgroup 5/11   = 6.2400        i.e = 3.1100% of 49.8800% Groups non-Idle CPU time
> |...... subgroup 5/12   = 6.2400        i.e = 3.1100% of 49.8800% Groups non-Idle CPU time
> |...... subgroup 5/13   = 6.2400        i.e = 3.1100% of 49.8800% Groups non-Idle CPU time
> |...... subgroup 5/14   = 6.2400        i.e = 3.1100% of 49.8800% Groups non-Idle CPU time
> |...... subgroup 5/15   = 6.2300        i.e = 3.1000% of 49.8800% Groups non-Idle CPU time
> |...... subgroup 5/16   = 6.2400        i.e = 3.1100% of 49.8800% Groups non-Idle CPU time
>
> Modified script
> ---------------
>
> #!/bin/bash
>
> NR_TASKS1=2
> NR_TASKS2=2
> NR_TASKS3=4
> NR_TASKS4=8
> NR_TASKS5=16
>
> BANDWIDTH=1
> SUBGROUP=1
> PRO_SHARES=0
> MOUNT_POINT=/cgroups/
> MOUNT=/cgroups/
> LOAD=./while1
> LEVELS=3
>
> usage()
> {
>        echo "Usage $0: [-b 0|1] [-s 0|1] [-p 0|1]"
>        echo "-b 1|0 set/unset  Cgroups bandwidth control (default set)"
>        echo "-s Create sub-groups for every task (default creates sub-group)"
>        echo "-p create propotional shares based on cpus"
>        exit
> }
> while getopts ":b:s:p:" arg
> do
>        case $arg in
>        b)
>                BANDWIDTH=$OPTARG
>                shift
>                if [ $BANDWIDTH -gt 1 ] && [ $BANDWIDTH -lt  0 ]
>                then
>                        usage
>                fi
>                ;;
>        s)
>                SUBGROUP=$OPTARG
>                shift
>                if [ $SUBGROUP -gt 1 ] && [ $SUBGROUP -lt 0 ]
>                then
>                        usage
>                fi
>                ;;
>        p)
>                PRO_SHARES=$OPTARG
>                shift
>                if [ $PRO_SHARES -gt 1 ] && [ $PRO_SHARES -lt 0 ]
>                then
>                        usage
>                fi
>                ;;
>
>        *)
>
>        esac
> done
> if [ ! -d $MOUNT ]
> then
>        mkdir -p $MOUNT
> fi
> test()
> {
>        echo -n "[ "
>        if [ $1 -eq 0 ]
>        then
>                echo -ne '\E[42;40mOk'
>        else
>                echo -ne '\E[31;40mFailed'
>                tput sgr0
>                echo " ]"
>                exit
>        fi
>        tput sgr0
>        echo " ]"
> }
> mount_cgrp()
> {
>        echo -n "Mounting root cgroup "
>        mount -t cgroup -ocpu,cpuset,cpuacct none $MOUNT_POINT &> /dev/null
>        test $?
> }
>
> umount_cgrp()
> {
>        echo -n "Unmounting root cgroup "
>        cd /root/
>        umount $MOUNT_POINT
>        test $?
> }
>
> create_hierarchy()
> {
>        mount_cgrp
>        cpuset_mem=`cat $MOUNT/cpuset.mems`
>        cpuset_cpu=`cat $MOUNT/cpuset.cpus`
>        echo -n "creating hierarchy of levels $LEVELS "
>        for (( i=1; i<=$LEVELS; i++ ))
>        do
>                MOUNT="${MOUNT}/level${i}"
>                mkdir $MOUNT
>                echo $cpuset_mem > $MOUNT/cpuset.mems
>                echo $cpuset_cpu > $MOUNT/cpuset.cpus
>                echo "-1" > $MOUNT/cpu.cfs_quota_us
>                echo "500000" > $MOUNT/cpu.cfs_period_us
>                echo -n " .."
>        done
>        echo " "
>        echo $MOUNT
>        echo -n "creating groups/sub-groups ..."
>        for (( i=1; i<=5; i++ ))
>        do
>                mkdir $MOUNT/$i
>                echo $cpuset_mem > $MOUNT/$i/cpuset.mems
>                echo $cpuset_cpu > $MOUNT/$i/cpuset.cpus
>                echo -n ".."
>                if [ $SUBGROUP -eq 1 ]
>                then
>                        jj=$(eval echo "\$NR_TASKS$i")
>                        for (( j=1; j<=$jj; j++ ))
>                        do
>                                mkdir -p $MOUNT/$i/$j
>                                echo $cpuset_mem > $MOUNT/$i/$j/cpuset.mems
>                                echo $cpuset_cpu > $MOUNT/$i/$j/cpuset.cpus
>                                echo -n ".."
>                        done
>                fi
>        done
>        echo "."
> }
>
> cleanup()
> {
>        pkill -9 while1 &> /dev/null
>        sleep 10
>        echo -n "Umount groups/sub-groups .."
>        for (( i=1; i<=5; i++ ))
>        do
>                if [ $SUBGROUP -eq 1 ]
>                then
>                        jj=$(eval echo "\$NR_TASKS$i")
>                        for (( j=1; j<=$jj; j++ ))
>                        do
>                                rmdir $MOUNT/$i/$j
>                                echo -n ".."
>                        done
>                fi
>                rmdir $MOUNT/$i
>                echo -n ".."
>        done
>        cd $MOUNT
>        cd ../
>        for (( i=$LEVELS; i>=1; i-- ))
>        do
>                rmdir level$i
>                cd ../
>        done
>        echo " "
>        umount_cgrp
> }
>
> load_tasks()
> {
>        for (( i=1; i<=5; i++ ))
>        do
>                jj=$(eval echo "\$NR_TASKS$i")
>                shares="1024"
>                if [ $PRO_SHARES -eq 1 ]
>                then
>                        eval shares=$(echo "$jj * 1024" | bc)
>                fi
>                echo $shares > $MOUNT/$i/cpu.shares
>                for (( j=1; j<=$jj; j++ ))
>                do
>                        echo "-1" > $MOUNT/$i/cpu.cfs_quota_us
>                        echo "500000" > $MOUNT/$i/cpu.cfs_period_us
>                        if [ $SUBGROUP -eq 1 ]
>                        then
>
>                                $LOAD &
>                                echo $! > $MOUNT/$i/$j/tasks
>                                echo "1024" > $MOUNT/$i/$j/cpu.shares
>
>                                if [ $BANDWIDTH -eq 1 ]
>                                then
>                                        echo "500000" > $MOUNT/$i/$j/cpu.cfs_period_us
>                                        echo "250000" > $MOUNT/$i/$j/cpu.cfs_quota_us
>                                fi
>                        else
>                                $LOAD &
>                                echo $! > $MOUNT/$i/tasks
>                                echo $shares > $MOUNT/$i/cpu.shares
>
>                                if [ $BANDWIDTH -eq 1 ]
>                                then
>                                        echo "500000" > $MOUNT/$i/cpu.cfs_period_us
>                                        echo "250000" > $MOUNT/$i/cpu.cfs_quota_us
>                                fi
>                        fi
>                done
>        done
>        echo "Capturing idle cpu time with vmstat...."
>        vmstat 2 100 &> vmstat_log &
> }
>
> pin_tasks()
> {
>        cpu=0
>        count=1
>        for (( i=1; i<=5; i++ ))
>        do
>                if [ $SUBGROUP -eq 1 ]
>                then
>                        jj=$(eval echo "\$NR_TASKS$i")
>                        for (( j=1; j<=$jj; j++ ))
>                        do
>                                if [ $count -gt 2 ]
>                                then
>                                        cpu=$((cpu+1))
>                                        count=1
>                                fi
>                                echo $cpu > $MOUNT/$i/$j/cpuset.cpus
>                                count=$((count+1))
>                        done
>                else
>                        case $i in
>                        1)
>                                echo 0 > $MOUNT/$i/cpuset.cpus;;
>                        2)
>                                echo 1 > $MOUNT/$i/cpuset.cpus;;
>                        3)
>                                echo "2-3" > $MOUNT/$i/cpuset.cpus;;
>                        4)
>                                echo "4-6" > $MOUNT/$i/cpuset.cpus;;
>                        5)
>                                echo "7-15" > $MOUNT/$i/cpuset.cpus;;
>                        esac
>                fi
>        done
>
> }
>
> print_results()
> {
>        eval gtot=$(cat sched_log|grep -i while|sed 's/R//g'|awk '{gtot+=$7};END{printf "%f", gtot}')
>        for (( i=1; i<=5; i++ ))
>        do
>                eval temp=$(cat sched_log_$i|sed 's/R//g'| awk '{gtot+=$7};END{printf "%f",gtot}')
>                eval tavg=$(echo "scale=4;(($temp / $gtot) * $1)/100 " | bc)
>                eval avg=$(echo  "scale=4;($temp / $gtot) * 100" | bc)
>                eval pretty_tavg=$( echo "scale=4; $tavg * 100"| bc) # F0r pretty format
>                echo "Bandwidth of Group $i = $avg i.e = $pretty_tavg% of non-Idle CPU time $1%"
>                if [ $SUBGROUP -eq 1 ]
>                then
>                        jj=$(eval echo "\$NR_TASKS$i")
>                        for (( j=1; j<=$jj; j++ ))
>                        do
>                                eval tmp=$(cat sched_log_$i-$j|sed 's/R//g'| awk '{gtot+=$7};END{printf "%f",gtot}')
>                                eval stavg=$(echo "scale=4;($tmp / $temp) * 100" | bc)
>                                eval pretty_stavg=$(echo "scale=4;(($tmp / $temp) * $tavg) * 100" | bc)
>                                echo -n "|"
>                                echo -e "...... subgroup $i/$j\t= $stavg\ti.e = $pretty_stavg% of $pretty_tavg% Groups non-Idle CPU time"
>                        done
>                fi
>                echo " "
>                echo " "
>        done
> }
>
> capture_results()
> {
>        cat /proc/sched_debug > sched_log
>        lev=""
>        for (( i=1; i<=$LEVELS; i++ ))
>        do
>                lev="$lev\/level${i}"
>        done
>        pkill -9 vmstat
>        avg=$(cat vmstat_log |grep -iv "system"|grep -iv "swpd"|awk ' { if ( NR != 1) {id+=$15 }}END{print (id/(NR-1))}')
>
>        rem=$(echo "scale=2; 100 - $avg" |bc)
>        echo "Average CPU Idle percentage $avg%"
>        echo "Bandwidth shared with remaining non-Idle $rem%"
>        for (( i=1; i<=5; i++ ))
>        do
>                cat sched_log |grep -i while1|grep -i "$lev\/$i" > sched_log_$i
>                if [ $SUBGROUP -eq 1 ]
>                then
>                        jj=$(eval echo "\$NR_TASKS$i")
>                        for (( j=1; j<=$jj; j++ ))
>                        do
>                                cat sched_log |grep -i while1|grep -i "$lev\/$i\/$j" > sched_log_$i-$j
>                        done
>                fi
>        done
>        print_results $rem
> }
>
> create_hierarchy
> pin_tasks
>
> load_tasks
> sleep 60
> capture_results
> cleanup
> exit
>
>

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [patch 00/15] CFS Bandwidth Control V6
  2011-05-03  9:28 [patch 00/15] CFS Bandwidth Control V6 Paul Turner
                   ` (15 preceding siblings ...)
  2011-06-07 15:45 ` CFS Bandwidth Control - Test results of cgroups tasks pinned vs unpinned Kamalesh Babulal
@ 2011-06-14  6:58 ` Hu Tao
  2011-06-14  7:29   ` Hidetoshi Seto
  16 siblings, 1 reply; 129+ messages in thread
From: Hu Tao @ 2011-06-14  6:58 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Peter Zijlstra, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri

[-- Attachment #1: Type: text/plain, Size: 496 bytes --]

Hi,

I've run several tests including hackbench, unixbench, massive-intr
and kernel building. CPU is Intel(R) Xeon(R) CPU X3430  @ 2.40GHz,
4 cores, and 4G memory.

Most of the time the results differ few, but there are problems:

1. unixbench: execl throughout has about 5% drop.
2. unixbench: process creation has about 5% drop.
3. massive-intr: when running 200 processes for 5mins, the number
   of loops each process runs differ more than before cfs-bandwidth-v6.

The results are attached.

[-- Attachment #2: massive-intr-200-300-without-patch.txt --]
[-- Type: text/plain, Size: 2784 bytes --]

004726	00000761
004723	00000763
004793	00000763
004776	00000736
004746	00000735
004731	00000754
004685	00000735
004835	00000754
004782	00000751
004747	00000736
004766	00000754
004663	00000735
004696	00000752
004737	00000760
004679	00000735
004727	00000751
004840	00000754
004720	00000767
004718	00000764
004788	00000761
004716	00000770
004791	00000758
004655	00000755
004838	00000757
004811	00000753
004659	00000768
004686	00000735
004740	00000759
004676	00000739
004849	00000748
004825	00000763
004808	00000748
004844	00000747
004702	00000755
004828	00000758
004829	00000758
004822	00000750
004820	00000753
004805	00000751
004764	00000748
004717	00000765
004794	00000761
004701	00000750
004792	00000766
004818	00000753
004842	00000752
004837	00000751
004697	00000750
004654	00000739
004763	00000754
004851	00000761
004671	00000738
004807	00000753
004734	00000760
004661	00000740
004743	00000737
004664	00000740
004682	00000737
004741	00000750
004817	00000750
004694	00000754
004779	00000753
004833	00000754
004758	00000757
004809	00000756
004815	00000752
004666	00000758
004770	00000750
004704	00000737
004709	00000753
004841	00000754
004732	00000753
004706	00000753
004675	00000739
004745	00000737
004719	00000765
004691	00000764
004777	00000756
004778	00000750
004780	00000759
004754	00000737
004799	00000755
004848	00000755
004752	00000737
004742	00000734
004773	00000752
004774	00000747
004673	00000736
004787	00000763
004781	00000756
004693	00000753
004692	00000751
004769	00000750
004728	00000763
004756	00000758
004749	00000737
004762	00000753
004687	00000739
004827	00000766
004683	00000734
004761	00000757
004678	00000739
004830	00000763
004803	00000763
004798	00000765
004850	00000760
004771	00000749
004674	00000737
004832	00000753
004821	00000757
004753	00000734
004843	00000752
004724	00000763
004759	00000752
004800	00000753
004700	00000753
004824	00000763
004767	00000755
004823	00000751
004789	00000768
004757	00000755
004852	00000765
004836	00000756
004839	00000757
004760	00000748
004834	00000758
004739	00000759
004786	00000768
004846	00000754
004711	00000761
004826	00000765
004695	00000755
004710	00000758
004783	00000761
004765	00000755
004684	00000731
004698	00000752
004785	00000768
004755	00000736
004813	00000754
004775	00000753
004795	00000765
004712	00000755
004768	00000755
004713	00000767
004816	00000752
004790	00000765
004744	00000731
004736	00000756
004672	00000741
004715	00000766
004667	00000754
004705	00000755
004810	00000755
004708	00000755
004707	00000752
004750	00000736
004688	00000736
004772	00000741
004703	00000736
004681	00000736
004748	00000737
004668	00000736
004690	00000739
004669	00000739
004733	00000743
004656	00000767
004812	00000749
004714	00000771
004677	00000741
004806	00000755
004665	00000736
004680	00000739
004670	00000739

[-- Attachment #3: massive-intr-200-300-with-patch.txt --]
[-- Type: text/plain, Size: 3200 bytes --]

004663	00000754
004634	00000694
004723	00000800
004746	00000751
004734	00000768
004633	00000689
004755	00000754
004722	00000797
004626	00000797
004689	00000765
004767	00000695
004813	00000765
004724	00000800
004621	00000769
004725	00000796
004714	00000799
004789	00000793
004631	00000758
004712	00000796
004744	00000748
004655	00000796
004783	00000751
004785	00000800
004790	00000796
004758	00000748
004816	00000772
004683	00000765
004636	00000694
004771	00000691
004619	00000695
004669	00000753
004623	00000696
004775	00000753
004752	00000748
004778	00000754
004784	00000751
004739	00000767
004807	00000762
004693	00000765
004691	00000770
004736	00000763
004709	00000768
004720	00000796
004628	00000695
004772	00000695
004696	00000695
004682	00000692
004675	00000748
004643	00000689
004637	00000695
004715	00000793
004787	00000796
004792	00000793
004797	00000796
004708	00000768
004651	00000796
004806	00000766
004679	00000766
004811	00000763
004699	00000695
004624	00000769
004638	00000695
004645	00000695
004635	00000692
004704	00000692
004742	00000764
004680	00000761
004800	00000796
004796	00000801
004802	00000798
004731	00000793
004677	00000770
004640	00000692
004657	00000692
004656	00000793
004730	00000790
004786	00000795
004817	00000766
004627	00000694
004727	00000793
004814	00000773
004658	00000798
004695	00000689
004791	00000792
004653	00000795
004798	00000792
004673	00000745
004666	00000753
004753	00000751
004664	00000753
004788	00000798
004801	00000753
004685	00000766
004810	00000770
004750	00000753
004754	00000755
004652	00000795
004668	00000753
004654	00000795
004648	00000695
004777	00000747
004765	00000694
004672	00000753
004665	00000750
004737	00000770
004757	00000747
004620	00000796
004780	00000750
004717	00000792
004773	00000751
004756	00000767
004760	00000746
004808	00000770
004776	00000753
004662	00000756
004670	00000750
004625	00000694
004647	00000694
004794	00000795
004738	00000767
004641	00000698
004735	00000767
004759	00000694
004799	00000790
004762	00000697
004629	00000694
004769	00000694
004705	00000694
004743	00000767
004781	00000750
004701	00000697
004661	00000749
004702	00000694
004710	00000770
004681	00000767
004700	00000691
004686	00000767
004642	00000694
004747	00000753
004644	00000694
004812	00000767
004748	00000750
004733	00000764
004721	00000797
004687	00000771
004690	00000771
004751	00000749
004632	00000694
004732	00000764
004728	00000798
004766	00000694
004706	00000764
004630	00000694
004688	00000764
004711	00000694
004622	00000753
004795	00000798
004815	00000770
004729	00000791
004763	00000747
004818	00000766
004674	00000749
004761	00000694
004749	00000752
004770	00000692
004718	00000795
004694	00000694
004782	00000755
004809	00000766
004740	00000770
004671	00000752
004716	00000762
004707	00000766
004692	00000801
004719	00000795
004713	00000800
004659	00000797
004764	00000749
004774	00000747
004698	00000688
004649	00000696
004779	00000752
004768	00000694
004676	00000752
004646	00000693
004805	00000755
004697	00000691
004703	00000692
004639	00000694
004804	00000693
004803	00000754
004678	00000769
004741	00000768
004684	00000761
004660	00000693
004793	00000797
004667	00000753
004726	00000795
004745	00000755
004650	00000691

[-- Attachment #4: massive-intr-200-60-without-patch.txt --]
[-- Type: text/plain, Size: 2608 bytes --]

004544	00000138
004411	00000152
004435	00000154
004408	00000152
004553	00000138
004450	00000138
004540	00000138
004534	00000138
004557	00000138
004545	00000138
004469	00000152
004467	00000152
004521	00000138
004396	00000152
004484	00000152
004556	00000138
004474	00000152
004537	00000138
004489	00000152
004481	00000152
004547	00000138
004587	00000138
004555	00000138
004393	00000152
004480	00000152
004405	00000152
004392	00000152
004475	00000152
004402	00000152
004563	00000135
004524	00000154
004427	00000140
004517	00000154
004431	00000154
004584	00000154
004432	00000154
004418	00000140
004442	00000154
004420	00000140
004443	00000154
004428	00000140
004549	00000154
004466	00000140
004525	00000154
004516	00000154
004423	00000140
004468	00000140
004532	00000154
004444	00000154
004531	00000154
004441	00000154
004577	00000154
004438	00000151
004518	00000151
004574	00000151
004513	00000155
004398	00000156
004588	00000153
004413	00000154
004403	00000154
004520	00000151
004512	00000140
004409	00000154
004430	00000151
004465	00000137
004482	00000154
004390	00000156
004546	00000140
004501	00000155
004404	00000154
004538	00000140
004487	00000154
004554	00000140
004471	00000154
004571	00000152
004406	00000154
004564	00000155
004499	00000155
004492	00000154
004558	00000140
004485	00000154
004536	00000140
004470	00000154
004541	00000140
004514	00000140
004551	00000140
004508	00000155
004559	00000140
004394	00000154
004542	00000140
004483	00000154
004479	00000154
004510	00000155
004410	00000154
004550	00000140
004490	00000154
004389	00000154
004502	00000155
004445	00000155
004562	00000155
004399	00000154
004494	00000154
004414	00000154
004533	00000140
004496	00000140
004395	00000151
004495	00000140
004462	00000155
004412	00000154
004407	00000151
004523	00000137
004535	00000137
004543	00000137
004575	00000153
004457	00000157
004528	00000153
004529	00000153
004515	00000153
004519	00000153
004455	00000157
004522	00000153
004472	00000151
004569	00000157
004433	00000153
004401	00000151
004417	00000139
004583	00000153
004526	00000153
004488	00000151
004434	00000153
004530	00000153
004552	00000153
004421	00000139
004425	00000154
004585	00000153
004580	00000153
004448	00000157
004452	00000157
004446	00000157
004565	00000157
004451	00000157
004436	00000153
004505	00000157
004461	00000157
004449	00000157
004497	00000157
004400	00000160
004566	00000157
004568	00000154
004570	00000157
004498	00000154
004573	00000157
004509	00000157
004453	00000157
004456	00000157
004504	00000157
004500	00000157
004511	00000157
004391	00000160
004454	00000154
004506	00000154
004572	00000157
004459	00000157

[-- Attachment #5: massive-intr-200-60-with-patch.txt --]
[-- Type: text/plain, Size: 3120 bytes --]

004434	00000156
004547	00000156
004543	00000156
004473	00000156
004399	00000156
004537	00000156
004477	00000138
004400	00000156
004444	00000152
004465	00000156
004496	00000147
004548	00000156
004372	00000159
004437	00000152
004566	00000152
004495	00000147
004489	00000141
004545	00000156
004552	00000156
004421	00000141
004461	00000141
004490	00000141
004525	00000147
004472	00000156
004412	00000141
004397	00000141
004450	00000147
004522	00000148
004425	00000147
004455	00000147
004459	00000147
004523	00000147
004530	00000147
004551	00000155
004475	00000156
004484	00000138
004439	00000147
004557	00000154
004387	00000141
004515	00000147
004494	00000147
004535	00000147
004558	00000147
004519	00000147
004449	00000147
004385	00000155
004454	00000147
004534	00000147
004395	00000141
004524	00000147
004417	00000141
004542	00000153
004423	00000141
004509	00000157
004415	00000141
004536	00000155
004532	00000147
004446	00000147
004497	00000147
004468	00000153
004393	00000141
004554	00000153
004485	00000141
004521	00000147
004375	00000141
004448	00000147
004392	00000141
004452	00000141
004493	00000147
004403	00000141
004411	00000141
004424	00000141
004481	00000141
004538	00000157
004483	00000141
004418	00000141
004384	00000146
004420	00000140
004469	00000155
004491	00000139
004391	00000140
004419	00000138
004456	00000144
004502	00000155
004386	00000146
004451	00000144
004526	00000146
004429	00000156
004371	00000146
004471	00000157
004427	00000146
004549	00000146
004369	00000141
004487	00000142
004402	00000155
004373	00000141
004533	00000146
004416	00000142
004520	00000146
004414	00000140
004381	00000148
004479	00000155
004544	00000157
004466	00000155
004370	00000156
004470	00000155
004406	00000155
004546	00000156
004376	00000140
004383	00000148
004458	00000146
004527	00000146
004453	00000146
004432	00000154
004528	00000146
004435	00000154
004447	00000147
004499	00000154
004476	00000157
004498	00000146
004504	00000154
004467	00000155
004561	00000154
004531	00000148
004463	00000155
004565	00000154
004541	00000155
004405	00000155
004492	00000146
004410	00000140
004457	00000146
004374	00000140
004430	00000154
004442	00000154
004445	00000156
004539	00000155
004486	00000140
004382	00000140
004505	00000156
004482	00000142
004390	00000140
004409	00000140
004553	00000158
004379	00000140
004540	00000158
004431	00000154
004413	00000140
004550	00000155
004380	00000155
004388	00000140
004408	00000140
004474	00000157
004433	00000156
004507	00000153
004426	00000143
004518	00000156
004460	00000149
004559	00000156
004480	00000146
004478	00000158
004529	00000146
004464	00000157
004462	00000157
004440	00000155
004398	00000139
004443	00000156
004438	00000156
004562	00000156
004404	00000157
004377	00000140
004513	00000156
004422	00000142
004394	00000142
004401	00000155
004378	00000142
004389	00000142
004501	00000156
004407	00000157
004508	00000156
004488	00000142
004396	00000142
004560	00000156
004516	00000156
004514	00000156
004556	00000153
004564	00000157
004567	00000156
004555	00000156
004568	00000155
004500	00000156
004517	00000156
004512	00000156
004436	00000156
004506	00000156
004510	00000153

[-- Attachment #6: massive-intr.png --]
[-- Type: image/png, Size: 79913 bytes --]

[-- Attachment #7: unixbench-cfs-bandwidth-v6 --]
[-- Type: text/plain, Size: 5624 bytes --]

   BYTE UNIX Benchmarks (Version 5.1.3)

   System: KERNEL-128: GNU/Linux
   OS: GNU/Linux -- 2.6.39-rc3ht-cpu-bandwidth-test+ -- #1 SMP PREEMPT Fri May 27 11:20:19 CST 2011
   Machine: x86_64 (x86_64)
   Language: en_US.utf8 (charmap="UTF-8", collate="UTF-8")
   CPU 0: Intel(R) Xeon(R) CPU X3430 @ 2.40GHz (4787.8 bogomips)
          Hyper-Threading, x86-64, MMX, Physical Address Ext, SYSENTER/SYSEXIT, SYSCALL/SYSRET, Intel virtualization
   CPU 1: Intel(R) Xeon(R) CPU X3430 @ 2.40GHz (4786.6 bogomips)
          Hyper-Threading, x86-64, MMX, Physical Address Ext, SYSENTER/SYSEXIT, SYSCALL/SYSRET, Intel virtualization
   CPU 2: Intel(R) Xeon(R) CPU X3430 @ 2.40GHz (4786.6 bogomips)
          Hyper-Threading, x86-64, MMX, Physical Address Ext, SYSENTER/SYSEXIT, SYSCALL/SYSRET, Intel virtualization
   CPU 3: Intel(R) Xeon(R) CPU X3430 @ 2.40GHz (4786.6 bogomips)
          Hyper-Threading, x86-64, MMX, Physical Address Ext, SYSENTER/SYSEXIT, SYSCALL/SYSRET, Intel virtualization
   14:27:45 up 42 min,  1 user,  load average: 0.00, 0.04, 0.20; runlevel 3

------------------------------------------------------------------------
Benchmark Run: Mon May 30 2011 14:27:45 - 14:56:15
4 CPUs in system; running 1 parallel copy of tests

Dhrystone 2 using register variables       23560040.2 lps   (10.0 s, 7 samples)
Double-Precision Whetstone                     2854.8 MWIPS (10.0 s, 7 samples)
Execl Throughput                                240.6 lps   (29.9 s, 2 samples)
File Copy 1024 bufsize 2000 maxblocks         32539.3 KBps  (30.0 s, 2 samples)
File Copy 256 bufsize 500 maxblocks            8147.0 KBps  (30.0 s, 2 samples)
File Copy 4096 bufsize 8000 maxblocks        124312.2 KBps  (30.0 s, 2 samples)
Pipe Throughput                              235002.6 lps   (10.0 s, 7 samples)
Pipe-based Context Switching                  21412.1 lps   (10.0 s, 7 samples)
Process Creation                                416.0 lps   (30.0 s, 2 samples)
Shell Scripts (1 concurrent)                    895.9 lpm   (60.1 s, 2 samples)
Shell Scripts (8 concurrent)                    352.1 lpm   (60.2 s, 2 samples)
System Call Overhead                         322619.9 lps   (10.0 s, 7 samples)

System Benchmarks Index Values               BASELINE       RESULT    INDEX
Dhrystone 2 using register variables         116700.0   23560040.2   2018.9
Double-Precision Whetstone                       55.0       2854.8    519.1
Execl Throughput                                 43.0        240.6     56.0
File Copy 1024 bufsize 2000 maxblocks          3960.0      32539.3     82.2
File Copy 256 bufsize 500 maxblocks            1655.0       8147.0     49.2
File Copy 4096 bufsize 8000 maxblocks          5800.0     124312.2    214.3
Pipe Throughput                               12440.0     235002.6    188.9
Pipe-based Context Switching                   4000.0      21412.1     53.5
Process Creation                                126.0        416.0     33.0
Shell Scripts (1 concurrent)                     42.4        895.9    211.3
Shell Scripts (8 concurrent)                      6.0        352.1    586.8
System Call Overhead                          15000.0     322619.9    215.1
                                                                   ========
System Benchmarks Index Score                                         166.5

------------------------------------------------------------------------
Benchmark Run: Mon May 30 2011 14:56:15 - 15:24:49
4 CPUs in system; running 4 parallel copies of tests

Dhrystone 2 using register variables       94432034.3 lps   (10.0 s, 7 samples)
Double-Precision Whetstone                    11421.1 MWIPS (10.0 s, 7 samples)
Execl Throughput                               1787.6 lps   (29.6 s, 2 samples)
File Copy 1024 bufsize 2000 maxblocks         25070.7 KBps  (30.0 s, 2 samples)
File Copy 256 bufsize 500 maxblocks            6025.9 KBps  (30.0 s, 2 samples)
File Copy 4096 bufsize 8000 maxblocks         70719.0 KBps  (30.0 s, 2 samples)
Pipe Throughput                              912685.3 lps   (10.0 s, 7 samples)
Pipe-based Context Switching                 168909.9 lps   (10.0 s, 7 samples)
Process Creation                               2796.7 lps   (30.0 s, 2 samples)
Shell Scripts (1 concurrent)                   3201.4 lpm   (60.0 s, 2 samples)
Shell Scripts (8 concurrent)                    431.3 lpm   (60.3 s, 2 samples)
System Call Overhead                        1233097.5 lps   (10.0 s, 7 samples)

System Benchmarks Index Values               BASELINE       RESULT    INDEX
Dhrystone 2 using register variables         116700.0   94432034.3   8091.9
Double-Precision Whetstone                       55.0      11421.1   2076.6
Execl Throughput                                 43.0       1787.6    415.7
File Copy 1024 bufsize 2000 maxblocks          3960.0      25070.7     63.3
File Copy 256 bufsize 500 maxblocks            1655.0       6025.9     36.4
File Copy 4096 bufsize 8000 maxblocks          5800.0      70719.0    121.9
Pipe Throughput                               12440.0     912685.3    733.7
Pipe-based Context Switching                   4000.0     168909.9    422.3
Process Creation                                126.0       2796.7    222.0
Shell Scripts (1 concurrent)                     42.4       3201.4    755.0
Shell Scripts (8 concurrent)                      6.0        431.3    718.9
System Call Overhead                          15000.0    1233097.5    822.1
                                                                   ========
System Benchmarks Index Score                                         445.0


[-- Attachment #8: unixbench-without-cfs-bandwidth-v6 --]
[-- Type: text/plain, Size: 5637 bytes --]

   BYTE UNIX Benchmarks (Version 5.1.3)

   System: KERNEL-128: GNU/Linux
   OS: GNU/Linux -- 2.6.39-rc3ht-cpu-bandwidth-test-without-patch+ -- #2 SMP PREEMPT Fri May 27 16:00:22 CST 2011
   Machine: x86_64 (x86_64)
   Language: en_US.utf8 (charmap="UTF-8", collate="UTF-8")
   CPU 0: Intel(R) Xeon(R) CPU X3430 @ 2.40GHz (4788.0 bogomips)
          Hyper-Threading, x86-64, MMX, Physical Address Ext, SYSENTER/SYSEXIT, SYSCALL/SYSRET, Intel virtualization
   CPU 1: Intel(R) Xeon(R) CPU X3430 @ 2.40GHz (4786.6 bogomips)
          Hyper-Threading, x86-64, MMX, Physical Address Ext, SYSENTER/SYSEXIT, SYSCALL/SYSRET, Intel virtualization
   CPU 2: Intel(R) Xeon(R) CPU X3430 @ 2.40GHz (4786.6 bogomips)
          Hyper-Threading, x86-64, MMX, Physical Address Ext, SYSENTER/SYSEXIT, SYSCALL/SYSRET, Intel virtualization
   CPU 3: Intel(R) Xeon(R) CPU X3430 @ 2.40GHz (4786.6 bogomips)
          Hyper-Threading, x86-64, MMX, Physical Address Ext, SYSENTER/SYSEXIT, SYSCALL/SYSRET, Intel virtualization
   15:34:21 up 1 min,  1 user,  load average: 0.90, 0.33, 0.12; runlevel 3

------------------------------------------------------------------------
Benchmark Run: Mon May 30 2011 15:34:21 - 16:02:43
4 CPUs in system; running 1 parallel copy of tests

Dhrystone 2 using register variables       23570449.8 lps   (10.0 s, 7 samples)
Double-Precision Whetstone                     2856.0 MWIPS (10.0 s, 7 samples)
Execl Throughput                                245.3 lps   (29.9 s, 2 samples)
File Copy 1024 bufsize 2000 maxblocks         32605.9 KBps  (30.0 s, 2 samples)
File Copy 256 bufsize 500 maxblocks            8211.5 KBps  (30.0 s, 2 samples)
File Copy 4096 bufsize 8000 maxblocks        126138.2 KBps  (30.0 s, 2 samples)
Pipe Throughput                              231883.3 lps   (10.0 s, 7 samples)
Pipe-based Context Switching                  22245.2 lps   (10.0 s, 7 samples)
Process Creation                                421.0 lps   (30.0 s, 2 samples)
Shell Scripts (1 concurrent)                    714.7 lpm   (60.1 s, 2 samples)
Shell Scripts (8 concurrent)                    355.1 lpm   (60.1 s, 2 samples)
System Call Overhead                         316964.5 lps   (10.0 s, 7 samples)

System Benchmarks Index Values               BASELINE       RESULT    INDEX
Dhrystone 2 using register variables         116700.0   23570449.8   2019.7
Double-Precision Whetstone                       55.0       2856.0    519.3
Execl Throughput                                 43.0        245.3     57.0
File Copy 1024 bufsize 2000 maxblocks          3960.0      32605.9     82.3
File Copy 256 bufsize 500 maxblocks            1655.0       8211.5     49.6
File Copy 4096 bufsize 8000 maxblocks          5800.0     126138.2    217.5
Pipe Throughput                               12440.0     231883.3    186.4
Pipe-based Context Switching                   4000.0      22245.2     55.6
Process Creation                                126.0        421.0     33.4
Shell Scripts (1 concurrent)                     42.4        714.7    168.6
Shell Scripts (8 concurrent)                      6.0        355.1    591.9
System Call Overhead                          15000.0     316964.5    211.3
                                                                   ========
System Benchmarks Index Score                                         164.3

------------------------------------------------------------------------
Benchmark Run: Mon May 30 2011 16:02:43 - 16:31:14
4 CPUs in system; running 4 parallel copies of tests

Dhrystone 2 using register variables       94372189.3 lps   (10.0 s, 7 samples)
Double-Precision Whetstone                    11430.4 MWIPS (10.0 s, 7 samples)
Execl Throughput                               1875.1 lps   (29.9 s, 2 samples)
File Copy 1024 bufsize 2000 maxblocks         22718.2 KBps  (30.0 s, 2 samples)
File Copy 256 bufsize 500 maxblocks            6067.2 KBps  (30.0 s, 2 samples)
File Copy 4096 bufsize 8000 maxblocks         62203.8 KBps  (30.0 s, 2 samples)
Pipe Throughput                              884763.1 lps   (10.0 s, 7 samples)
Pipe-based Context Switching                 172161.4 lps   (10.0 s, 7 samples)
Process Creation                               2920.9 lps   (30.0 s, 2 samples)
Shell Scripts (1 concurrent)                   3230.3 lpm   (60.0 s, 2 samples)
Shell Scripts (8 concurrent)                    430.6 lpm   (60.3 s, 2 samples)
System Call Overhead                        1199897.3 lps   (10.0 s, 7 samples)

System Benchmarks Index Values               BASELINE       RESULT    INDEX
Dhrystone 2 using register variables         116700.0   94372189.3   8086.7
Double-Precision Whetstone                       55.0      11430.4   2078.3
Execl Throughput                                 43.0       1875.1    436.1
File Copy 1024 bufsize 2000 maxblocks          3960.0      22718.2     57.4
File Copy 256 bufsize 500 maxblocks            1655.0       6067.2     36.7
File Copy 4096 bufsize 8000 maxblocks          5800.0      62203.8    107.2
Pipe Throughput                               12440.0     884763.1    711.2
Pipe-based Context Switching                   4000.0     172161.4    430.4
Process Creation                                126.0       2920.9    231.8
Shell Scripts (1 concurrent)                     42.4       3230.3    761.9
Shell Scripts (8 concurrent)                      6.0        430.6    717.6
System Call Overhead                          15000.0    1199897.3    799.9
                                                                   ========
System Benchmarks Index Score                                         439.0


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [patch 00/15] CFS Bandwidth Control V6
  2011-06-14  6:58 ` [patch 00/15] CFS Bandwidth Control V6 Hu Tao
@ 2011-06-14  7:29   ` Hidetoshi Seto
  2011-06-14  7:44     ` Hu Tao
  2011-06-15  8:37     ` Hu Tao
  0 siblings, 2 replies; 129+ messages in thread
From: Hidetoshi Seto @ 2011-06-14  7:29 UTC (permalink / raw)
  To: Hu Tao
  Cc: Paul Turner, linux-kernel, Peter Zijlstra, Bharata B Rao,
	Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan,
	Srivatsa Vaddagiri

(2011/06/14 15:58), Hu Tao wrote:
> Hi,
> 
> I've run several tests including hackbench, unixbench, massive-intr
> and kernel building. CPU is Intel(R) Xeon(R) CPU X3430  @ 2.40GHz,
> 4 cores, and 4G memory.
> 
> Most of the time the results differ few, but there are problems:
> 
> 1. unixbench: execl throughout has about 5% drop.
> 2. unixbench: process creation has about 5% drop.
> 3. massive-intr: when running 200 processes for 5mins, the number
>    of loops each process runs differ more than before cfs-bandwidth-v6.
> 
> The results are attached.

I know the score of unixbench is not so stable that the problem might
be noises ... but the result of massive-intr is interesting.
Could you give a try to find which piece (xx/15) in the series cause
the problems?

Thanks,
H.Seto


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [patch 00/15] CFS Bandwidth Control V6
  2011-06-14  7:29   ` Hidetoshi Seto
@ 2011-06-14  7:44     ` Hu Tao
  2011-06-15  8:37     ` Hu Tao
  1 sibling, 0 replies; 129+ messages in thread
From: Hu Tao @ 2011-06-14  7:44 UTC (permalink / raw)
  To: Hidetoshi Seto
  Cc: Paul Turner, linux-kernel, Peter Zijlstra, Bharata B Rao,
	Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan,
	Srivatsa Vaddagiri

On Tue, Jun 14, 2011 at 04:29:49PM +0900, Hidetoshi Seto wrote:
> (2011/06/14 15:58), Hu Tao wrote:
> > Hi,
> > 
> > I've run several tests including hackbench, unixbench, massive-intr
> > and kernel building. CPU is Intel(R) Xeon(R) CPU X3430  @ 2.40GHz,
> > 4 cores, and 4G memory.
> > 
> > Most of the time the results differ few, but there are problems:
> > 
> > 1. unixbench: execl throughout has about 5% drop.
> > 2. unixbench: process creation has about 5% drop.
> > 3. massive-intr: when running 200 processes for 5mins, the number
> >    of loops each process runs differ more than before cfs-bandwidth-v6.
> > 
> > The results are attached.
> 
> I know the score of unixbench is not so stable that the problem might
> be noises ... but the result of massive-intr is interesting.
> Could you give a try to find which piece (xx/15) in the series cause
> the problems?

OK. I'll do it.

> 
> Thanks,
> H.Seto

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: CFS Bandwidth Control - Test results of cgroups tasks pinned vs unpinned
  2011-06-07 15:45 ` CFS Bandwidth Control - Test results of cgroups tasks pinned vs unpinned Kamalesh Babulal
  2011-06-08  3:09   ` Paul Turner
  2011-06-08 10:46   ` Vladimir Davydov
@ 2011-06-14 10:16   ` Hidetoshi Seto
  2 siblings, 0 replies; 129+ messages in thread
From: Hidetoshi Seto @ 2011-06-14 10:16 UTC (permalink / raw)
  To: Kamalesh Babulal
  Cc: Paul Turner, linux-kernel, Peter Zijlstra, Bharata B Rao,
	Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan,
	Srivatsa Vaddagiri, Ingo Molnar, Pavel Emelyanov

(2011/06/08 0:45), Kamalesh Babulal wrote:
> Hi All,
> 
>     In our test environment, while testing the CFS Bandwidth V6 patch set 
> on top of 55922c9d1b84. We observed that the CPU's idle time is seen
> between 30% to 40% while running CPU bound test, with the cgroups tasks 
> not pinned to the CPU's. Whereas in the inverse case, where the cgroups 
> tasks are pinned to the CPU's, the idle time seen is nearly zero.

I've some test with your test script but I'm not sure whether it is really
a considerable problem. Am I missing the point?

I add -c option to your script to toggle pinning (1:pinned, 0:not pinned).
In short the results in my environment (16 cpu, 4 quad core) are:

				# group's usage
 -b 0 -p 0 -c 0 : Idle = 0%	(12,12,25,25,25)
 -b 0 -p 0 -c 1 : Idle = 0%	(6,6,12,25,50)
 -b 0 -p 1 -c * : Idle = 0%	(6,6,12,25,50)
 -b 1 -p 0 -c 0 : Idle = ~25%	(6,6,12,25,25)
 -b 1 -p 0 -c 1 : Idle = 0%	(6,6,12,25,50)
 -b 1 -p 1 -c * : Idle = 0%	(6,6,12,25,50)	

In my understanding is correct, when -p0, there are 5 groups (with share=1024)
and each group has 2,2,4,8,16 subgroups, so a subgroup in /1 is weighted 8 times
higher than one in /5.  And when -p1, share of 5 parent groups are promoted and
all subgroups are evenly weighted.
With -p0 the cpu usage of 5 groups is going to be 20,20,20,20,20 but group /1
and /2 have only 2 subgroups for each, so even if /1 and /2 fully use 2 cpus
for each the usage will be 12,12,25,25,25.

OTOH the bandwidth of a subgroup is 250000/500000 (=0.5 cpu), so in case of
Idle=0% the cpu usage of groups are likely be 6,6,12,25,50%. 

The question is what happen if both are mixed.

For example in case of your unpinned Idle=34.8%: 

> Average CPU Idle percentage 34.8% (as explained above in the Idle time measured)
> Bandwidth shared with remaining non-Idle 65.2%

> Bandwidth of Group 1 = 9.2500 i.e = 6.0300% of non-Idle CPU time 65.2%
> Bandwidth of Group 2 = 9.0400 i.e = 5.8900% of non-Idle CPU time 65.2%
> Bandwidth of Group 3 = 16.9300 i.e = 11.0300% of non-Idle CPU time 65.2%
> Bandwidth of Group 4 = 27.9300 i.e = 18.2100% of non-Idle CPU time 65.2%
> Bandwidth of Group 5 = 36.8300 i.e = 24.0100% of non-Idle CPU time 65.2%

The usage is 6,6,11,18,24.
It looks like that group /1 to /3 are limited by bandwidth, while group /5 is
limited by share. (I have no idea about the noise on /4 here)

BTW since pinning in your script always pin a couple of subgroup in a same
group to a cpu, subgroups are weighted evenly everywhere so as the result
share doesn't work for these cases.


Thanks,
H.Seto


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: CFS Bandwidth Control - Test results of cgroups tasks pinned vs unpinned
  2011-06-14  0:00           ` Paul Turner
@ 2011-06-15  5:37             ` Kamalesh Babulal
  2011-06-21 19:48               ` Paul Turner
  0 siblings, 1 reply; 129+ messages in thread
From: Kamalesh Babulal @ 2011-06-15  5:37 UTC (permalink / raw)
  To: Paul Turner
  Cc: Vladimir Davydov, linux-kernel, Peter Zijlstra, Bharata B Rao,
	Dhaval Giani, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Ingo Molnar, Pavel Emelianov

* Paul Turner <pjt@google.com> [2011-06-13 17:00:08]:

> Hi Kamalesh.
> 
> I tried on both friday and again today to reproduce your results
> without success.  Results are attached below.  The margin of error is
> the same as the previous (2-level deep case), ~4%.  One minor nit, in
> your script's input parsing you're calling shift; you don't need to do
> this with getopts and it will actually lead to arguments being
> dropped.
> 
> Are you testing on top of a clean -tip?  Do you have any custom
> load-balancer or scheduler settings?
> 
> Thanks,
> 
> - Paul
> 
> 
> Hyper-threaded topology:
> unpinned:
> Average CPU Idle percentage 38.6333%
> Bandwidth shared with remaining non-Idle 61.3667%
> 
> pinned:
> Average CPU Idle percentage 35.2766%
> Bandwidth shared with remaining non-Idle 64.7234%
> (The mask in the "unpinned" case is 0-3,6-9,12-15,18-21 which should
> mirror your 2 socket 8x2 configuration.)
> 
> 4-way NUMA topology:
> unpinned:
> Average CPU Idle percentage 5.26667%
> Bandwidth shared with remaining non-Idle 94.73333%
> 
> pinned:
> Average CPU Idle percentage 0.242424%
> Bandwidth shared with remaining non-Idle 99.757576%
> 
Hi Paul,

I tried tip 919c9baa9 + V6 patchset on 2 socket,quadcore with HT and
the Idle time seen is ~22% to ~23%. Kernel is not tuned to any custom
load-balancer/scheduler settings.

unpinned:
Average CPU Idle percentage 23.5333%
Bandwidth shared with remaining non-Idle 76.4667%

pinned:
Average CPU Idle percentage 0%
Bandwidth shared with remaining non-Idle 100%

Thanks,

 Kamalesh
> 
> 
> 
> On Fri, Jun 10, 2011 at 11:17 AM, Kamalesh Babulal
> <kamalesh@linux.vnet.ibm.com> wrote:
> > * Paul Turner <pjt@google.com> [2011-06-08 20:25:00]:
> >
> >> Hi Kamalesh,
> >>
> >> I'm unable to reproduce the results you describe.  One possibility is
> >> load-balancer interaction -- can you describe the topology of the
> >> platform you are running this on?
> >>
> >> On both a straight NUMA topology and a hyper-threaded platform I
> >> observe a ~4% delta between the pinned and un-pinned cases.
> >>
> >> Thanks -- results below,
> >>
> >> - Paul
> >>
> >>
(snip)

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [patch 00/15] CFS Bandwidth Control V6
  2011-06-14  7:29   ` Hidetoshi Seto
  2011-06-14  7:44     ` Hu Tao
@ 2011-06-15  8:37     ` Hu Tao
  2011-06-16  0:57       ` Hidetoshi Seto
  1 sibling, 1 reply; 129+ messages in thread
From: Hu Tao @ 2011-06-15  8:37 UTC (permalink / raw)
  To: Hidetoshi Seto
  Cc: Paul Turner, linux-kernel, Peter Zijlstra, Bharata B Rao,
	Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan,
	Srivatsa Vaddagiri

[-- Attachment #1: Type: text/plain, Size: 1092 bytes --]

On Tue, Jun 14, 2011 at 04:29:49PM +0900, Hidetoshi Seto wrote:
> (2011/06/14 15:58), Hu Tao wrote:
> > Hi,
> > 
> > I've run several tests including hackbench, unixbench, massive-intr
> > and kernel building. CPU is Intel(R) Xeon(R) CPU X3430  @ 2.40GHz,
> > 4 cores, and 4G memory.
> > 
> > Most of the time the results differ few, but there are problems:
> > 
> > 1. unixbench: execl throughout has about 5% drop.
> > 2. unixbench: process creation has about 5% drop.
> > 3. massive-intr: when running 200 processes for 5mins, the number
> >    of loops each process runs differ more than before cfs-bandwidth-v6.
> > 
> > The results are attached.
> 
> I know the score of unixbench is not so stable that the problem might
> be noises ... but the result of massive-intr is interesting.
> Could you give a try to find which piece (xx/15) in the series cause
> the problems?

After more tests, I found massive-intr data is not stable, too. Results
are attached. The third number in file name means which patchs are
applied, 0 means no patch applied. plot.sh is easy to generate png
files.


[-- Attachment #2: massive-intr-200-300-0-1.txt --]
[-- Type: text/plain, Size: 3200 bytes --]

004516	00000782
004522	00000778
004400	00000689
004420	00000699
004442	00000781
004459	00000729
004539	00000734
004413	00000689
004489	00000700
004499	00000699
004519	00000781
004543	00000734
004389	00000689
004561	00000737
004473	00000731
004457	00000736
004467	00000725
004557	00000794
004566	00000797
004440	00000778
004415	00000696
004531	00000794
004401	00000693
004552	00000743
004416	00000694
004422	00000695
004550	00000734
004497	00000701
004485	00000792
004451	00000789
004502	00000698
004507	00000780
004517	00000777
004536	00000792
004430	00000781
004505	00000780
004529	00000800
004534	00000789
004408	00000683
004456	00000734
004488	00000685
004527	00000803
004544	00000735
004546	00000737
004474	00000734
004564	00000789
004551	00000734
004392	00000793
004581	00000747
004445	00000785
004511	00000777
004395	00000691
004411	00000690
004576	00000694
004496	00000695
004409	00000691
004470	00000735
004426	00000780
004393	00000781
004460	00000737
004390	00000731
004483	00000796
004458	00000741
004465	00000735
004478	00000800
004433	00000778
004503	00000694
004514	00000784
004436	00000780
004435	00000783
004520	00000777
004386	00000783
004513	00000777
004521	00000782
004508	00000780
004427	00000776
004569	00000792
004573	00000794
004405	00000691
004476	00000789
004481	00000784
004548	00000731
004438	00000779
004472	00000731
004487	00000694
004549	00000727
004583	00000732
004575	00000693
004579	00000731
004397	00000784
004495	00000694
004542	00000738
004524	00000785
004580	00000741
004492	00000688
004463	00000739
004434	00000774
004449	00000797
004424	00000776
004504	00000784
004399	00000689
004437	00000784
004572	00000794
004452	00000790
004453	00000794
004563	00000796
004559	00000728
004446	00000794
004535	00000795
004444	00000779
004454	00000794
004560	00000734
004541	00000728
004494	00000695
004554	00000735
004419	00000690
004469	00000736
004447	00000796
004570	00000696
004471	00000733
004565	00000796
004403	00000688
004558	00000739
004532	00000797
004429	00000786
004475	00000793
004498	00000694
004417	00000698
004562	00000737
004506	00000781
004491	00000699
004448	00000795
004428	00000782
004404	00000692
004512	00000780
004509	00000781
004486	00000698
004479	00000802
004406	00000695
004398	00000775
004441	00000782
004423	00000696
004464	00000736
004510	00000782
004477	00000791
004462	00000796
004493	00000697
004410	00000702
004555	00000738
004384	00000696
004518	00000779
004425	00000742
004394	00000696
004443	00000780
004414	00000697
004388	00000690
004455	00000738
004482	00000791
004432	00000777
004582	00000734
004577	00000693
004439	00000779
004533	00000791
004578	00000692
004466	00000739
004418	00000690
004402	00000697
004391	00000798
004545	00000737
004500	00000696
004526	00000779
004568	00000799
004567	00000792
004450	00000795
004528	00000796
004480	00000794
004530	00000803
004387	00000739
004540	00000738
004538	00000793
004556	00000733
004490	00000693
004525	00000780
004547	00000743
004431	00000779
004484	00000794
004421	00000693
004412	00000699
004407	00000691
004385	00000800
004501	00000695
004537	00000796
004468	00000732
004515	00000736
004396	00000796
004571	00000799
004574	00000693
004461	00000744
004523	00000779
004553	00000746

[-- Attachment #3: massive-intr-200-300-10.txt --]
[-- Type: text/plain, Size: 2848 bytes --]

004687	00000706
004613	00000709
004591	00000702
004579	00000709
004685	00000709
004588	00000709
004669	00000811
004598	00000814
004699	00000758
004763	00000753
004750	00000735
004684	00000709
004756	00000753
004573	00000709
004577	00000754
004609	00000706
004657	00000820
004666	00000815
004633	00000753
004697	00000753
004608	00000703
004590	00000707
004681	00000706
004568	00000758
004736	00000819
004643	00000734
004566	00000813
004704	00000753
004595	00000813
004759	00000753
004709	00000753
004606	00000708
004661	00000814
004622	00000739
004675	00000816
004725	00000735
004663	00000818
004731	00000815
004596	00000818
004753	00000750
004713	00000750
004655	00000710
004627	00000750
004594	00000813
004667	00000816
004716	00000739
004722	00000740
004715	00000739
004700	00000750
004735	00000814
004674	00000815
004728	00000741
004762	00000751
004740	00000818
004576	00000713
004578	00000738
004723	00000740
004653	00000743
004647	00000737
004572	00000709
004584	00000706
004620	00000709
004619	00000710
004592	00000702
004597	00000739
004648	00000737
004733	00000814
004758	00000754
004659	00000820
004664	00000818
004747	00000820
004717	00000739
004701	00000752
004696	00000747
004760	00000740
004710	00000755
004712	00000752
004695	00000755
004623	00000734
004683	00000707
004587	00000711
004618	00000712
004605	00000710
004631	00000755
004603	00000710
004586	00000711
004706	00000734
004702	00000755
004644	00000739
004634	00000752
004635	00000752
004617	00000713
004738	00000815
004610	00000707
004732	00000818
004641	00000740
004691	00000734
004746	00000813
004601	00000822
004670	00000815
004628	00000752
004615	00000714
004703	00000762
004612	00000708
004698	00000752
004636	00000752
004632	00000752
004682	00000709
004629	00000752
004734	00000818
004714	00000750
004742	00000815
004708	00000754
004585	00000711
004743	00000814
004751	00000749
004574	00000711
004599	00000818
004639	00000756
004737	00000814
004651	00000739
004672	00000812
004671	00000815
004680	00000711
004668	00000817
004720	00000735
004761	00000754
004752	00000818
004678	00000714
004565	00000711
004638	00000757
004569	00000709
004665	00000814
004583	00000738
004688	00000739
004727	00000812
004575	00000712
004582	00000823
004581	00000740
004744	00000817
004614	00000709
004660	00000812
004580	00000735
004624	00000754
004642	00000733
004571	00000710
004705	00000758
004686	00000710
004741	00000820
004721	00000736
004593	00000817
004616	00000709
004677	00000713
004693	00000733
004650	00000736
004640	00000752
004719	00000739
004730	00000731
004745	00000736
004621	00000736
004645	00000736
004656	00000739
004689	00000741
004646	00000736
004748	00000817
004739	00000819
004676	00000708
004652	00000736
004694	00000739
004654	00000736
004649	00000739
004749	00000733
004726	00000731
004729	00000739
004724	00000737
004692	00000736
004718	00000737
004626	00000751

[-- Attachment #4: massive-intr-200-300-11.txt --]
[-- Type: text/plain, Size: 3072 bytes --]

004680	00000765
004812	00000759
004681	00000765
004705	00000762
004807	00000762
004786	00000775
004721	00000778
004783	00000762
004805	00000764
004648	00000775
004710	00000775
004809	00000762
004733	00000767
004724	00000778
004794	00000778
004796	00000779
004791	00000780
004718	00000777
004795	00000780
004687	00000708
004651	00000708
004633	00000708
004694	00000708
004701	00000708
004654	00000708
004673	00000767
004693	00000705
004620	00000700
004637	00000705
004702	00000708
004765	00000702
004621	00000708
004697	00000708
004649	00000778
004709	00000778
004641	00000708
004643	00000708
004614	00000774
004797	00000773
004698	00000705
004696	00000705
004692	00000705
004624	00000705
004776	00000705
004704	00000705
004808	00000767
004623	00000702
004639	00000705
004685	00000764
004663	00000750
004640	00000705
004766	00000702
004619	00000702
004629	00000702
004714	00000775
004772	00000750
004788	00000778
004792	00000780
004803	00000777
004646	00000775
004810	00000767
004745	00000750
004734	00000764
004798	00000767
004744	00000747
004631	00000751
004778	00000750
004769	00000750
004662	00000750
004764	00000747
004747	00000747
004755	00000744
004666	00000750
004804	00000767
004730	00000767
004753	00000750
004652	00000780
004667	00000750
004669	00000747
004773	00000747
004758	00000750
004751	00000750
004780	00000750
004754	00000747
004726	00000780
004689	00000767
004802	00000781
004739	00000761
004655	00000744
004671	00000747
004672	00000767
004779	00000741
004743	00000767
004708	00000767
004731	00000767
004749	00000747
004738	00000767
004771	00000744
004616	00000766
004750	00000741
004679	00000761
004789	00000772
004729	00000767
004727	00000780
004613	00000705
004728	00000777
004622	00000710
004799	00000777
004699	00000710
004759	00000744
004675	00000764
004691	00000707
004741	00000764
004627	00000710
004638	00000710
004642	00000710
004767	00000707
004723	00000777
004686	00000764
004768	00000707
004636	00000707
004740	00000764
004801	00000780
004625	00000707
004735	00000764
004793	00000777
004618	00000704
004716	00000772
004777	00000707
004690	00000707
004703	00000707
004628	00000707
004700	00000710
004678	00000758
004760	00000704
004761	00000707
004632	00000704
004683	00000769
004634	00000707
004806	00000704
004645	00000778
004684	00000761
004644	00000704
004676	00000766
004630	00000779
004647	00000777
004664	00000749
004617	00000702
004695	00000707
004650	00000780
004635	00000704
004785	00000782
004712	00000782
004717	00000779
004790	00000775
004811	00000766
004660	00000749
004763	00000749
004674	00000763
004656	00000749
004774	00000752
004787	00000779
004615	00000754
004770	00000749
004706	00000758
004707	00000766
004742	00000766
004713	00000774
004719	00000777
004715	00000776
004653	00000780
004746	00000752
004665	00000743
004658	00000743
004775	00000746
004782	00000746
004659	00000749
004722	00000776
004668	00000740
004736	00000760
004732	00000766
004677	00000763
004800	00000766
004737	00000763
004756	00000752
004657	00000752
004688	00000766
004626	00000770
004670	00000746
004781	00000749
004748	00000746
004752	00000746

[-- Attachment #5: massive-intr-200-300-12.txt --]
[-- Type: text/plain, Size: 3200 bytes --]

004457	00000758
004552	00000749
004581	00000776
004417	00000775
004556	00000749
004548	00000752
004555	00000749
004470	00000749
004465	00000761
004481	00000749
004418	00000749
004538	00000749
004549	00000749
004540	00000749
004546	00000749
004482	00000746
004592	00000749
004541	00000749
004409	00000749
004562	00000761
004584	00000779
004564	00000755
004438	00000776
004475	00000746
004578	00000776
004536	00000766
004408	00000761
004462	00000766
004463	00000763
004533	00000763
004489	00000707
004530	00000763
004566	00000760
004531	00000766
004529	00000710
004431	00000710
004430	00000710
004488	00000710
004429	00000710
004469	00000751
004413	00000710
004427	00000707
004412	00000707
004582	00000781
004447	00000707
004575	00000784
004423	00000707
004477	00000707
004449	00000763
004558	00000704
004450	00000763
004434	00000707
004411	00000767
004494	00000707
004495	00000707
004464	00000763
004497	00000707
004597	00000746
004452	00000766
004500	00000766
004573	00000763
004587	00000781
004419	00000707
004407	00000777
004441	00000784
004506	00000781
004560	00000763
004580	00000778
004588	00000781
004516	00000784
004598	00000754
004589	00000784
004520	00000781
004513	00000781
004583	00000781
004515	00000781
004433	00000704
004501	00000781
004504	00000781
004432	00000704
004539	00000751
004577	00000781
004428	00000704
004486	00000707
004547	00000751
004485	00000704
004551	00000754
004490	00000707
004557	00000751
004424	00000707
004491	00000707
004576	00000778
004511	00000781
004594	00000778
004445	00000778
004483	00000704
004436	00000701
004595	00000751
004554	00000751
004479	00000751
004519	00000778
004543	00000751
004572	00000763
004446	00000755
004518	00000763
004602	00000748
004454	00000760
004480	00000704
004459	00000763
004460	00000763
004535	00000775
004448	00000704
004605	00000745
004599	00000751
004600	00000751
004550	00000751
004528	00000760
004451	00000760
004456	00000763
004522	00000763
004596	00000748
004527	00000763
004473	00000748
004478	00000748
004505	00000775
004524	00000763
004601	00000751
004476	00000748
004537	00000757
004544	00000748
004545	00000751
004461	00000765
004571	00000757
004458	00000757
004455	00000762
004534	00000765
004507	00000783
004567	00000762
004503	00000783
004561	00000765
004406	00000707
004579	00000783
004569	00000763
004439	00000780
004532	00000765
004421	00000709
004512	00000783
004523	00000762
004499	00000709
004440	00000783
004474	00000753
004443	00000780
004508	00000783
004410	00000709
004563	00000762
004467	00000750
004509	00000783
004415	00000706
004468	00000753
004510	00000783
004604	00000753
004585	00000780
004487	00000706
004471	00000753
004492	00000706
004472	00000750
004444	00000777
004442	00000777
004425	00000706
004603	00000753
004586	00000783
004574	00000753
004542	00000750
004498	00000709
004414	00000703
004422	00000709
004565	00000757
004570	00000706
004502	00000780
004426	00000706
004591	00000780
004514	00000780
004517	00000780
004435	00000706
004420	00000703
004593	00000706
004437	00000706
004553	00000706
004493	00000706
004484	00000703
004525	00000762
004568	00000762
004521	00000765
004526	00000765
004416	00000706
004496	00000709
004590	00000780
004466	00000753
004453	00000759
004559	00000709

[-- Attachment #6: massive-intr-200-300-13.txt --]
[-- Type: text/plain, Size: 2624 bytes --]

004525	00000797
004491	00000733
004477	00000733
004527	00000799
004446	00000706
004469	00000765
004556	00000730
004595	00000802
004529	00000731
004512	00000717
004423	00000713
004475	00000771
004438	00000710
004501	00000714
004460	00000739
004435	00000713
004603	00000794
004465	00000736
004463	00000721
004588	00000797
004519	00000801
004597	00000797
004480	00000738
004476	00000770
004542	00000765
004503	00000768
004471	00000764
004443	00000715
004472	00000762
004434	00000737
004581	00000767
004496	00000767
004602	00000799
004579	00000765
004577	00000765
004508	00000736
004447	00000714
004584	00000791
004592	00000799
004553	00000801
004545	00000765
004598	00000792
004599	00000799
004445	00000713
004544	00000763
004451	00000711
004573	00000767
004607	00000738
004502	00000762
004567	00000738
004616	00000735
004490	00000730
004574	00000767
004429	00000769
004547	00000767
004449	00000714
004483	00000738
004522	00000797
004531	00000797
004593	00000796
004478	00000738
004561	00000738
004507	00000716
004419	00000799
004557	00000738
004485	00000738
004587	00000799
004497	00000767
004570	00000770
004576	00000770
004612	00000735
004601	00000735
004563	00000738
004430	00000736
004509	00000738
004608	00000738
004457	00000802
004426	00000716
004433	00000712
004461	00000709
004511	00000720
004528	00000794
004613	00000738
004546	00000767
004466	00000712
004617	00000737
004494	00000735
004489	00000738
004459	00000714
004464	00000713
004458	00000788
004454	00000805
004440	00000711
004427	00000723
004585	00000795
004565	00000732
004524	00000800
004572	00000762
004474	00000769
004594	00000798
004452	00000794
004504	00000709
004590	00000790
004568	00000770
004515	00000799
004455	00000804
004540	00000763
004425	00000766
004582	00000764
004575	00000764
004539	00000761
004523	00000785
004420	00000761
004418	00000735
004530	00000799
004456	00000801
004487	00000708
004462	00000717
004578	00000769
004583	00000715
004536	00000761
004609	00000729
004500	00000767
004473	00000764
004450	00000716
004467	00000769
004437	00000711
004498	00000769
004424	00000732
004482	00000740
004589	00000797
004564	00000769
004606	00000734
004432	00000708
004521	00000792
004615	00000740
004481	00000732
004562	00000737
004484	00000737
004566	00000737
004520	00000796
004486	00000737
004442	00000721
004510	00000717
004533	00000712
004534	00000714
004554	00000737
004468	00000769
004513	00000739
004560	00000731
004591	00000801
004552	00000737
004499	00000766
004586	00000800
004448	00000735
004444	00000713
004505	00000734
004436	00000716
004431	00000736
004551	00000737
004558	00000734
004495	00000769
004488	00000734
004605	00000796

[-- Attachment #7: massive-intr-200-300-14.txt --]
[-- Type: text/plain, Size: 3008 bytes --]

004446	00000734
004525	00000769
004604	00000749
004583	00000750
004578	00000753
004597	00000769
004534	00000753
004418	00000734
004444	00000734
004519	00000766
004507	00000731
004599	00000769
004547	00000749
004573	00000750
004522	00000766
004503	00000736
004496	00000732
004595	00000769
004448	00000729
004556	00000751
004491	00000754
004520	00000769
004479	00000751
004453	00000766
004559	00000751
004456	00000766
004437	00000734
004575	00000745
004579	00000750
004574	00000752
004473	00000753
004590	00000766
004424	00000753
004452	00000769
004602	00000766
004587	00000766
004471	00000750
004526	00000763
004543	00000750
004476	00000750
004572	00000750
004598	00000766
004580	00000755
004549	00000750
004472	00000751
004445	00000732
004529	00000766
004501	00000734
004550	00000754
004487	00000748
004486	00000750
004436	00000734
004617	00000753
004527	00000766
004511	00000729
004603	00000766
004565	00000756
004454	00000766
004523	00000763
004447	00000734
004541	00000755
004475	00000746
004569	00000751
004570	00000733
004451	00000769
004439	00000737
004428	00000733
004478	00000754
004542	00000750
004555	00000750
004477	00000756
004443	00000730
004566	00000751
004510	00000736
004465	00000747
004489	00000750
004508	00000734
004455	00000763
004467	00000753
004468	00000750
004546	00000753
004584	00000753
004535	00000751
004586	00000751
004540	00000750
004433	00000734
004464	00000747
004427	00000734
004560	00000751
004493	00000748
004571	00000737
004558	00000755
004552	00000754
004607	00000734
004608	00000751
004481	00000753
004585	00000752
004462	00000752
004528	00000768
004576	00000755
004517	00000771
004512	00000768
004614	00000752
004591	00000771
004563	00000753
004606	00000768
004431	00000757
004488	00000748
004485	00000756
004494	00000751
004440	00000733
004531	00000771
004589	00000768
004551	00000753
004423	00000733
004601	00000768
004509	00000736
004594	00000765
004422	00000735
004505	00000734
004600	00000768
004470	00000750
004548	00000755
004513	00000768
004421	00000757
004514	00000768
004480	00000755
004438	00000734
004611	00000756
004460	00000746
004524	00000768
004616	00000752
004483	00000749
004495	00000736
004554	00000750
004515	00000765
004530	00000747
004504	00000733
004545	00000752
004482	00000753
004457	00000753
004588	00000768
004613	00000750
004492	00000730
004434	00000735
004420	00000755
004426	00000733
004474	00000752
004516	00000765
004553	00000753
004593	00000768
004466	00000754
004568	00000765
004499	00000736
004449	00000768
004461	00000752
004562	00000755
004429	00000767
004425	00000743
004533	00000749
004612	00000754
004419	00000761
004502	00000733
004544	00000756
004500	00000735
004537	00000749
004430	00000733
004582	00000733
004596	00000765
004610	00000750
004458	00000737
004490	00000752
004484	00000755
004450	00000738
004521	00000770
004577	00000751
004615	00000757
004469	00000753
004592	00000765
004432	00000735
004459	00000732
004567	00000754
004463	00000751
004497	00000743
004506	00000732
004561	00000734
004605	00000768
004536	00000752

[-- Attachment #8: massive-intr-200-300-15-1.txt --]
[-- Type: text/plain, Size: 3088 bytes --]

004437	00000779
004583	00000782
004543	00000766
004568	00000753
004503	00000780
004526	00000742
004461	00000720
004455	00000718
004421	00000722
004563	00000744
004525	00000726
004591	00000781
004600	00000768
004428	00000720
004499	00000727
004490	00000745
004596	00000768
004593	00000777
004539	00000767
004482	00000748
004471	00000768
004555	00000762
004553	00000769
004491	00000751
004458	00000750
004431	00000720
004457	00000752
004497	00000725
004411	00000720
004575	00000781
004588	00000779
004452	00000720
004465	00000766
004467	00000768
004567	00000746
004562	00000744
004536	00000754
004435	00000720
004566	00000739
004470	00000767
004580	00000782
004560	00000742
004542	00000765
004406	00000783
004595	00000775
004576	00000776
004540	00000768
004493	00000723
004450	00000722
004514	00000784
004422	00000724
004448	00000724
004418	00000774
004544	00000771
004415	00000744
004534	00000742
004535	00000750
004507	00000778
004584	00000779
004498	00000717
004509	00000782
004459	00000751
004587	00000781
004532	00000746
004594	00000769
004541	00000776
004433	00000723
004599	00000768
004598	00000781
004554	00000765
004501	00000779
004424	00000725
004603	00000771
004440	00000782
004413	00000719
004572	00000746
004537	00000765
004559	00000744
004502	00000785
004570	00000752
004552	00000745
004495	00000722
004538	00000771
004474	00000765
004558	00000746
004472	00000747
004550	00000770
004478	00000770
004590	00000763
004530	00000742
004429	00000722
004442	00000785
004447	00000722
004488	00000755
004423	00000725
004466	00000769
004408	00000771
004592	00000783
004469	00000771
004517	00000778
004604	00000768
004585	00000781
004524	00000719
004577	00000780
004416	00000781
004516	00000781
004546	00000765
004519	00000781
004522	00000720
004480	00000765
004579	00000781
004405	00000722
004486	00000741
004523	00000724
004569	00000740
004589	00000781
004492	00000750
004515	00000781
004571	00000753
004586	00000780
004438	00000777
004414	00000722
004481	00000769
004487	00000743
004521	00000719
004496	00000725
004462	00000740
004531	00000746
004443	00000781
004561	00000747
004527	00000723
004439	00000783
004449	00000725
004483	00000744
004419	00000724
004601	00000777
004410	00000721
004425	00000722
004445	00000730
004453	00000722
004451	00000721
004551	00000764
004565	00000747
004417	00000748
004548	00000768
004484	00000744
004549	00000770
004434	00000719
004456	00000721
004473	00000768
004412	00000741
004510	00000778
004489	00000751
004533	00000747
004547	00000768
004581	00000785
004504	00000780
004520	00000727
004441	00000716
004573	00000742
004494	00000724
004528	00000747
004564	00000747
004582	00000780
004477	00000767
004500	00000778
004427	00000726
004430	00000721
004468	00000767
004506	00000783
004545	00000770
004556	00000769
004407	00000750
004505	00000782
004432	00000721
004426	00000723
004511	00000780
004460	00000743
004508	00000782
004574	00000777
004597	00000772
004476	00000748
004578	00000778
004512	00000775
004529	00000750
004513	00000781
004463	00000747
004454	00000725
004602	00000770
004475	00000771
004446	00000724
004485	00000749
004464	00000765

[-- Attachment #9: massive-intr-200-300-15-2.txt --]
[-- Type: text/plain, Size: 3184 bytes --]

004922	00000731
004819	00000751
004979	00000735
004931	00000766
004929	00000763
004810	00000746
004901	00000748
004797	00000768
004849	00000748
004832	00000736
004914	00000733
004806	00000729
004883	00000733
004866	00000762
004955	00000748
004816	00000748
004801	00000745
004812	00000777
004800	00000745
004820	00000748
004939	00000759
004993	00000764
004813	00000748
004956	00000743
004885	00000752
004918	00000736
004947	00000734
004920	00000732
004864	00000762
004974	00000754
004973	00000751
004966	00000748
004890	00000752
004894	00000751
004928	00000759
004836	00000734
004843	00000764
004902	00000751
004927	00000737
004821	00000752
004907	00000764
004975	00000750
004824	00000745
004829	00000752
004909	00000744
004908	00000750
004940	00000764
004906	00000750
004865	00000750
004912	00000749
004896	00000749
004987	00000775
004938	00000772
004848	00000759
004880	00000732
004949	00000735
004921	00000736
004911	00000752
004842	00000734
004982	00000748
004835	00000729
004830	00000752
004839	00000736
004867	00000764
004841	00000749
004854	00000752
004913	00000750
004874	00000735
004934	00000751
004983	00000762
004870	00000767
004991	00000763
004893	00000747
004850	00000747
004808	00000739
004834	00000745
004915	00000734
004855	00000767
004980	00000756
004936	00000764
004838	00000733
004961	00000728
004952	00000734
004951	00000731
004953	00000731
004989	00000761
004990	00000767
004860	00000764
004853	00000767
004978	00000767
004887	00000745
004817	00000750
004873	00000737
004844	00000767
004840	00000726
004958	00000744
004852	00000735
004857	00000764
004837	00000734
004957	00000764
004969	00000765
004803	00000744
004884	00000734
004892	00000754
004924	00000747
004971	00000751
004847	00000747
004891	00000751
004827	00000744
004882	00000731
004815	00000747
004859	00000746
004798	00000747
004954	00000750
004962	00000756
004845	00000764
004799	00000750
004846	00000760
004862	00000764
004986	00000764
004900	00000747
004946	00000734
004868	00000770
004903	00000752
004933	00000764
004981	00000761
004875	00000731
004881	00000731
004858	00000761
004795	00000753
004937	00000747
004988	00000728
004916	00000750
004897	00000747
004917	00000771
004863	00000763
004888	00000752
004926	00000731
004968	00000752
004895	00000752
004919	00000752
004930	00000767
004960	00000735
004941	00000764
004910	00000752
004963	00000748
004886	00000736
004945	00000747
004898	00000750
004967	00000753
004923	00000751
004856	00000768
004904	00000753
004818	00000751
004878	00000764
004802	00000750
004805	00000747
004876	00000746
004899	00000748
004977	00000744
004972	00000749
004889	00000749
004833	00000736
004944	00000758
004905	00000752
004985	00000764
004942	00000766
004823	00000752
004869	00000769
004822	00000752
004811	00000737
004807	00000735
004948	00000737
004796	00000738
004932	00000738
004877	00000736
004950	00000733
004970	00000749
004851	00000746
004825	00000764
004976	00000753
004828	00000762
004935	00000753
004804	00000752
004814	00000747
004964	00000746
004861	00000764
004809	00000748
004826	00000746
004984	00000755
004879	00000749
004794	00000750
004831	00000747
004872	00000765
004959	00000747
004992	00000772
004965	00000745
004925	00000741
004943	00000773

[-- Attachment #10: massive-intr-200-300-15.txt --]
[-- Type: text/plain, Size: 3056 bytes --]

004580	00000811
004556	00000696
004478	00000698
004638	00000773
004614	00000739
004509	00000815
004464	00000813
004521	00000764
004604	00000763
004596	00000770
004623	00000738
004595	00000770
004603	00000769
004574	00000813
004597	00000761
004620	00000738
004618	00000742
004547	00000733
004488	00000698
004568	00000696
004587	00000774
004454	00000813
004460	00000695
004499	00000815
004626	00000728
004465	00000808
004523	00000769
004554	00000694
004560	00000695
004551	00000736
004590	00000767
004516	00000766
004469	00000813
004619	00000739
004463	00000698
004562	00000692
004518	00000774
004458	00000700
004592	00000738
004624	00000739
004607	00000735
004533	00000733
004487	00000690
004448	00000695
004484	00000696
004514	00000766
004559	00000691
004450	00000767
004591	00000762
004635	00000768
004584	00000812
004525	00000765
004513	00000767
004629	00000730
004540	00000733
004459	00000695
004446	00000697
004549	00000808
004538	00000736
004519	00000770
004589	00000767
004625	00000693
004517	00000760
004616	00000736
004475	00000736
004531	00000813
004565	00000697
004534	00000736
004526	00000813
004471	00000738
004493	00000695
004467	00000812
004482	00000695
004451	00000694
004442	00000692
004447	00000697
004599	00000767
004532	00000736
004520	00000768
004577	00000810
004503	00000812
004606	00000812
004506	00000805
004561	00000696
004457	00000699
004641	00000769
004485	00000700
004572	00000815
004640	00000770
004570	00000813
004445	00000766
004468	00000810
004639	00000763
004530	00000814
004528	00000762
004588	00000761
004566	00000696
004529	00000768
004491	00000694
004449	00000743
004452	00000697
004573	00000811
004500	00000816
004598	00000768
004473	00000738
004495	00000696
004504	00000814
004502	00000810
004552	00000813
004455	00000694
004576	00000815
004555	00000696
004542	00000735
004522	00000767
004581	00000812
004615	00000737
004632	00000737
004550	00000736
004541	00000733
004497	00000700
004630	00000733
004558	00000699
004636	00000769
004476	00000732
004477	00000737
004461	00000697
004613	00000739
004609	00000737
004536	00000731
004479	00000693
004601	00000769
004444	00000811
004453	00000768
004443	00000738
004575	00000810
004611	00000737
004593	00000765
004571	00000809
004610	00000734
004557	00000694
004474	00000695
004545	00000740
004466	00000813
004633	00000777
004535	00000737
004456	00000695
004578	00000811
004510	00000767
004481	00000697
004627	00000811
004586	00000771
004617	00000740
004631	00000735
004524	00000768
004553	00000699
004486	00000694
004608	00000738
004505	00000810
004498	00000812
004483	00000692
004544	00000738
004621	00000735
004512	00000771
004634	00000762
004612	00000745
004472	00000738
004515	00000772
004602	00000767
004511	00000765
004582	00000813
004470	00000738
004507	00000815
004605	00000815
004579	00000812
004564	00000698
004489	00000695
004569	00000695
004628	00000695
004492	00000698
004480	00000696
004637	00000765
004527	00000766
004490	00000698
004583	00000815
004600	00000815
004594	00000766
004496	00000694
004548	00000817
004585	00000765
004539	00000735
004622	00000737

[-- Attachment #11: massive-intr-200-300-16-1.txt --]
[-- Type: text/plain, Size: 2624 bytes --]

004446	00000771
004390	00000725
004414	00000770
004456	00000724
004415	00000760
004465	00000727
004523	00000728
004514	00000725
004500	00000757
004434	00000773
004427	00000757
004393	00000769
004498	00000757
004401	00000770
004551	00000776
004391	00000728
004458	00000725
004515	00000722
004376	00000769
004473	00000727
004386	00000728
004528	00000727
004472	00000722
004410	00000767
004505	00000767
004365	00000776
004546	00000753
004421	00000754
004384	00000775
004371	00000755
004467	00000727
004471	00000724
004444	00000773
004538	00000757
004403	00000765
004448	00000771
004557	00000758
004422	00000758
004394	00000771
004517	00000722
004383	00000725
004539	00000775
004396	00000768
004389	00000725
004381	00000771
004400	00000766
004395	00000771
004367	00000770
004397	00000774
004439	00000771
004453	00000771
004423	00000759
004480	00000767
004413	00000774
004470	00000727
004468	00000727
004531	00000727
004463	00000727
004506	00000775
004489	00000727
004435	00000775
004385	00000728
004521	00000724
004486	00000762
004543	00000775
004481	00000772
004534	00000774
004424	00000759
004532	00000724
004508	00000775
004369	00000772
004379	00000768
004442	00000771
004450	00000773
004368	00000771
004554	00000756
004511	00000769
004503	00000772
004436	00000775
004510	00000775
004408	00000772
004420	00000756
004540	00000775
004443	00000768
004441	00000768
004457	00000727
004502	00000759
004513	00000727
004547	00000756
004485	00000756
004529	00000719
004553	00000753
004372	00000768
004419	00000759
004440	00000772
004559	00000753
004454	00000768
004494	00000759
004373	00000771
004488	00000758
004461	00000724
004544	00000775
004431	00000755
004399	00000771
004405	00000772
004452	00000774
004462	00000724
004438	00000772
004507	00000772
004482	00000772
004366	00000773
004407	00000775
004455	00000765
004550	00000756
004451	00000767
004428	00000758
004542	00000774
004492	00000756
004504	00000769
004404	00000770
004370	00000774
004361	00000776
004362	00000759
004363	00000778
004499	00000777
004520	00000726
004425	00000758
004409	00000774
004449	00000768
004530	00000726
004548	00000753
004479	00000774
004522	00000726
004495	00000758
004509	00000726
004484	00000759
004378	00000770
004527	00000720
004360	00000767
004411	00000769
004460	00000770
004545	00000761
004518	00000726
004387	00000726
004516	00000726
004459	00000721
004476	00000723
004549	00000762
004558	00000757
004426	00000760
004417	00000758
004437	00000777
004491	00000758
004364	00000771
004519	00000726
004555	00000757
004525	00000728
004466	00000726
004552	00000758
004497	00000761
004490	00000756
004535	00000772
004478	00000773
004533	00000729

[-- Attachment #12: massive-intr-200-300-1.txt --]
[-- Type: text/plain, Size: 3152 bytes --]

004404	00000754
004390	00000750
004453	00000740
004449	00000728
004526	00000742
004528	00000735
004508	00000780
004512	00000774
004507	00000777
004430	00000749
004393	00000750
004425	00000730
004552	00000777
004400	00000745
004579	00000774
004502	00000771
004569	00000734
004500	00000734
004450	00000733
004456	00000730
004548	00000774
004534	00000728
004398	00000749
004497	00000737
004490	00000777
004511	00000774
004551	00000777
004576	00000778
004553	00000774
004448	00000755
004542	00000777
004407	00000751
004544	00000774
004410	00000742
004514	00000776
004495	00000737
004506	00000779
004493	00000735
004538	00000777
004573	00000731
004478	00000731
004580	00000779
004519	00000768
004578	00000774
004395	00000734
004457	00000733
004532	00000731
004520	00000774
004541	00000772
004385	00000747
004444	00000749
004546	00000776
004521	00000771
004394	00000738
004411	00000747
004487	00000733
004555	00000778
004563	00000733
004535	00000730
004530	00000733
004559	00000736
004423	00000728
004489	00000734
004441	00000750
004414	00000750
004409	00000751
004464	00000739
004451	00000735
004505	00000776
004474	00000733
004518	00000773
004583	00000733
004477	00000737
004517	00000776
004504	00000773
004408	00000747
004527	00000736
004575	00000776
004445	00000747
004476	00000731
004499	00000735
004427	00000747
004424	00000733
004554	00000776
004564	00000734
004510	00000774
004460	00000735
004513	00000773
004388	00000750
004468	00000727
004389	00000747
004452	00000732
004557	00000736
004570	00000735
004523	00000733
004522	00000730
004492	00000733
004403	00000744
004515	00000776
004397	00000744
004584	00000732
004574	00000776
004571	00000776
004547	00000776
004387	00000737
004545	00000775
004484	00000735
004402	00000760
004565	00000725
004488	00000733
004558	00000733
004434	00000744
004412	00000747
004431	00000750
004562	00000730
004396	00000735
004419	00000749
004439	00000747
004429	00000733
004433	00000761
004422	00000731
004496	00000733
004466	00000737
004567	00000733
004440	00000755
004482	00000733
004524	00000731
004432	00000744
004549	00000770
004391	00000746
004543	00000776
004525	00000730
004509	00000778
004531	00000729
004418	00000749
004406	00000749
004480	00000776
004516	00000773
004529	00000730
004417	00000753
004413	00000748
004399	00000749
004392	00000749
004435	00000753
004415	00000756
004469	00000735
004503	00000776
004416	00000749
004485	00000733
004462	00000737
004566	00000735
004438	00000748
004481	00000778
004533	00000776
004442	00000746
004447	00000749
004436	00000752
004556	00000733
004428	00000736
004560	00000730
004401	00000749
004483	00000734
004581	00000735
004550	00000773
004467	00000730
004446	00000756
004437	00000746
004577	00000774
004443	00000748
004501	00000723
004473	00000734
004540	00000735
004461	00000735
004421	00000737
004582	00000735
004471	00000737
004498	00000730
004458	00000735
004465	00000732
004568	00000732
004539	00000738
004386	00000735
004459	00000732
004472	00000732
004470	00000732
004491	00000735
004420	00000732
004536	00000734
004454	00000732
004463	00000732
004475	00000734
004426	00000727
004455	00000732
004572	00000737
004537	00000734
004479	00000731
004486	00000729

[-- Attachment #13: massive-intr-200-300-2.txt --]
[-- Type: text/plain, Size: 3120 bytes --]

004409	00000752
004398	00000752
004379	00000752
004445	00000709
004531	00000796
004405	00000752
004433	00000755
004401	00000752
004538	00000791
004418	00000706
004491	00000711
004546	00000756
004383	00000752
004392	00000799
004389	00000752
004478	00000706
004386	00000754
004520	00000709
004547	00000747
004391	00000799
004535	00000802
004427	00000755
004364	00000752
004439	00000755
004441	00000708
004482	00000708
004534	00000796
004542	00000796
004499	00000755
004508	00000755
004545	00000751
004507	00000751
004488	00000749
004484	00000749
004371	00000757
004359	00000758
004440	00000756
004450	00000708
004407	00000757
004457	00000788
004395	00000799
004449	00000705
004469	00000711
004501	00000757
004458	00000796
004500	00000754
004510	00000708
004480	00000710
004415	00000711
004539	00000801
004360	00000749
004528	00000801
004453	00000801
004496	00000752
004426	00000751
004487	00000757
004373	00000751
004404	00000754
004494	00000745
004413	00000754
004410	00000757
004384	00000751
004411	00000754
004525	00000799
004382	00000754
004443	00000710
004431	00000756
004504	00000751
004489	00000705
004522	00000754
004476	00000709
004554	00000751
004365	00000754
004357	00000754
004378	00000754
004524	00000801
004385	00000754
004375	00000754
004461	00000796
004490	00000705
004477	00000714
004533	00000801
004493	00000711
004455	00000798
004470	00000801
004399	00000801
004544	00000801
004481	00000711
004475	00000708
004416	00000711
004376	00000754
004459	00000710
004549	00000752
004422	00000756
004479	00000710
004466	00000707
004370	00000751
004381	00000751
004483	00000751
004463	00000798
004434	00000751
004397	00000801
004541	00000801
004448	00000708
004437	00000757
004425	00000750
004514	00000708
004442	00000708
004551	00000753
004394	00000801
004367	00000799
004492	00000714
004390	00000798
004540	00000798
004511	00000757
004527	00000798
004548	00000753
004402	00000756
004498	00000750
004430	00000749
004428	00000748
004361	00000756
004497	00000757
004502	00000754
004368	00000755
004436	00000754
004523	00000798
004471	00000798
004366	00000752
004555	00000753
004447	00000710
004505	00000754
004516	00000710
004460	00000798
004495	00000758
004518	00000714
004521	00000711
004424	00000751
004519	00000711
004464	00000800
004435	00000757
004515	00000712
004474	00000709
004512	00000708
004396	00000795
004420	00000712
004526	00000798
004408	00000751
004553	00000754
004372	00000756
004454	00000705
004552	00000750
004456	00000803
004423	00000713
004509	00000756
004406	00000756
004486	00000756
004465	00000795
004421	00000710
004403	00000756
004388	00000756
004532	00000705
004374	00000756
004432	00000758
004446	00000712
004363	00000756
004550	00000748
004414	00000712
004362	00000742
004452	00000709
004537	00000800
004380	00000756
004530	00000800
004468	00000800
004462	00000803
004485	00000756
004393	00000797
004543	00000800
004444	00000709
004472	00000713
004451	00000710
004513	00000713
004417	00000715
004419	00000712
004369	00000750
004377	00000753
004412	00000753
004387	00000756
004503	00000755
004556	00000750
004506	00000758
004467	00000795
004438	00000758
004429	00000758
004536	00000790

[-- Attachment #14: massive-intr-200-300-3.txt --]
[-- Type: text/plain, Size: 3056 bytes --]

004463	00000781
004475	00000779
004435	00000765
004398	00000775
004399	00000778
004409	00000746
004462	00000777
004438	00000783
004401	00000781
004377	00000785
004383	00000725
004443	00000723
004372	00000723
004415	00000734
004375	00000716
004449	00000722
004419	00000736
004513	00000723
004515	00000736
004494	00000765
004431	00000768
004426	00000772
004367	00000783
004488	00000769
004429	00000767
004483	00000769
004425	00000741
004514	00000734
004434	00000762
004456	00000719
004481	00000765
004499	00000736
004470	00000785
004412	00000733
004466	00000778
004369	00000740
004402	00000780
004444	00000723
004384	00000718
004397	00000723
004416	00000737
004489	00000761
004491	00000770
004508	00000732
004468	00000775
004531	00000740
004410	00000739
004505	00000733
004496	00000738
004427	00000768
004479	00000775
004436	00000766
004480	00000763
004501	00000742
004528	00000733
004403	00000781
004534	00000737
004564	00000766
004440	00000762
004467	00000777
004535	00000783
004476	00000792
004464	00000780
004391	00000721
004517	00000724
004374	00000720
004472	00000784
004417	00000730
004421	00000739
004465	00000785
004452	00000725
004541	00000777
004389	00000722
004446	00000721
004448	00000721
004458	00000769
004486	00000762
004540	00000783
004411	00000739
004551	00000766
004445	00000723
004428	00000765
004390	00000724
004441	00000767
004457	00000721
004559	00000764
004553	00000780
004371	00000723
004537	00000778
004368	00000764
004512	00000718
004536	00000780
004455	00000718
004450	00000721
004473	00000768
004561	00000767
004527	00000733
004538	00000788
004385	00000722
004554	00000764
004442	00000720
004510	00000738
004482	00000769
004504	00000739
004523	00000738
004524	00000734
004407	00000719
004422	00000738
004437	00000764
004387	00000722
004474	00000783
004413	00000734
004492	00000761
004516	00000736
004497	00000738
004484	00000764
004509	00000737
004471	00000783
004529	00000720
004558	00000771
004533	00000745
004423	00000736
004552	00000780
004370	00000723
004485	00000768
004563	00000773
004525	00000738
004521	00000740
004556	00000767
004518	00000721
004439	00000766
004522	00000733
004487	00000770
004424	00000735
004543	00000777
004430	00000767
004503	00000736
004547	00000779
004530	00000735
004405	00000779
004565	00000764
004542	00000784
004549	00000786
004557	00000767
004550	00000788
004490	00000770
004539	00000785
004502	00000743
004469	00000785
004376	00000740
004459	00000767
004546	00000780
004414	00000745
004555	00000768
004477	00000785
004396	00000726
004451	00000726
004506	00000738
004460	00000773
004507	00000741
004366	00000727
004461	00000782
004447	00000724
004378	00000769
004432	00000764
004408	00000726
004454	00000722
004520	00000726
004386	00000720
004392	00000722
004393	00000720
004380	00000727
004519	00000727
004406	00000720
004394	00000721
004532	00000743
004373	00000726
004545	00000782
004500	00000736
004544	00000786
004495	00000765
004379	00000722
004433	00000761
004382	00000722
004511	00000738
004420	00000741
004418	00000736
004478	00000783
004388	00000720
004453	00000723
004562	00000766

[-- Attachment #15: massive-intr-200-300-4.txt --]
[-- Type: text/plain, Size: 3120 bytes --]

004478	00000736
004572	00000807
004414	00000713
004577	00000721
004513	00000736
004419	00000721
004411	00000718
004509	00000718
004505	00000721
004420	00000718
004413	00000721
004468	00000735
004449	00000733
004412	00000718
004410	00000721
004415	00000721
004488	00000807
004494	00000808
004499	00000721
004554	00000730
004593	00000734
004463	00000735
004460	00000732
004464	00000727
004404	00000735
004524	00000733
004395	00000806
004454	00000735
004584	00000735
004534	00000735
004537	00000737
004512	00000718
004440	00000718
004565	00000809
004500	00000718
004515	00000733
004447	00000730
004465	00000735
004455	00000738
004406	00000737
004564	00000812
004425	00000812
004558	00000807
004498	00000804
004470	00000735
004492	00000812
004518	00000735
004467	00000735
004433	00000721
004547	00000730
004439	00000717
004423	00000807
004568	00000812
004544	00000726
004549	00000735
004421	00000723
004591	00000732
004487	00000812
004490	00000807
004562	00000806
004427	00000812
004567	00000809
004497	00000809
004561	00000811
004485	00000807
004483	00000809
004581	00000729
004451	00000738
004553	00000727
004585	00000739
004424	00000813
004502	00000809
004442	00000722
004532	00000737
004403	00000720
004530	00000737
004570	00000809
004541	00000735
004566	00000811
004481	00000720
004430	00000717
004443	00000717
004405	00000814
004429	00000806
004579	00000735
004416	00000717
004473	00000719
004438	00000720
004432	00000714
004573	00000809
004408	00000720
004575	00000732
004511	00000720
004444	00000738
004450	00000735
004472	00000738
004475	00000738
004409	00000720
004521	00000738
004555	00000732
004545	00000735
004550	00000735
004535	00000734
004520	00000735
004417	00000720
004556	00000735
004539	00000729
004428	00000806
004587	00000734
004476	00000735
004495	00000809
004542	00000737
004480	00000720
004525	00000734
004469	00000737
004527	00000734
004533	00000737
004552	00000732
004400	00000720
004578	00000738
004538	00000737
004441	00000720
004486	00000810
004548	00000734
004504	00000720
004514	00000732
004590	00000732
004446	00000732
004569	00000812
004557	00000809
004397	00000737
004531	00000736
004436	00000723
004457	00000737
004491	00000809
004435	00000713
004522	00000732
004479	00000732
004546	00000732
004508	00000720
004422	00000720
004459	00000734
004418	00000723
004462	00000734
004543	00000737
004496	00000808
004437	00000720
004471	00000735
004507	00000717
004458	00000734
004431	00000720
004588	00000734
004401	00000714
004506	00000716
004407	00000719
004563	00000814
004461	00000737
004540	00000734
004399	00000722
004582	00000737
004592	00000734
004576	00000809
004560	00000814
004559	00000811
004426	00000811
004489	00000809
004571	00000814
004493	00000811
004477	00000737
004452	00000734
004448	00000737
004394	00000718
004474	00000734
004396	00000737
004482	00000720
004398	00000720
004456	00000734
004503	00000720
004586	00000734
004526	00000734
004529	00000734
004536	00000731
004434	00000738
004528	00000736
004445	00000734
004517	00000736
004510	00000720
004551	00000732
004523	00000734
004589	00000731
004580	00000736
004466	00000736
004453	00000737
004516	00000737
004501	00000806

[-- Attachment #16: massive-intr-200-300-5.txt --]
[-- Type: text/plain, Size: 2768 bytes --]

004452	00000689
004559	00000766
004437	00000769
004488	00000776
004466	00000766
004478	00000766
004399	00000763
004394	00000694
004511	00000763
004391	00000694
004440	00000689
004421	00000780
004374	00000689
004392	00000694
004505	00000766
004456	00000694
004566	00000763
004459	00000691
004509	00000766
004409	00000776
004411	00000776
004530	00000779
004471	00000776
004410	00000779
004550	00000766
004562	00000769
004544	00000768
004564	00000766
004474	00000766
004489	00000776
004425	00000776
004524	00000778
004460	00000694
004462	00000763
004567	00000760
004386	00000694
004565	00000766
004522	00000776
004516	00000766
004415	00000778
004507	00000763
004463	00000766
004454	00000694
004393	00000689
004563	00000692
004424	00000788
004404	00000766
004508	00000765
004396	00000691
004554	00000765
004499	00000775
004406	00000691
004480	00000776
004523	00000778
004432	00000760
004558	00000768
004502	00000765
004401	00000763
004503	00000763
004557	00000762
004395	00000691
004520	00000781
004539	00000766
004416	00000766
004389	00000691
004379	00000694
004419	00000784
004388	00000688
004427	00000765
004383	00000690
004513	00000694
004479	00000760
004420	00000784
004412	00000779
004418	00000783
004376	00000781
004533	00000763
004540	00000757
004413	00000778
004458	00000693
004492	00000778
004497	00000779
004447	00000691
004521	00000781
004435	00000768
004481	00000777
004375	00000694
004448	00000691
004519	00000691
004414	00000781
004553	00000777
004560	00000691
004387	00000694
004486	00000781
004453	00000691
004556	00000768
004484	00000777
004525	00000775
004494	00000781
004381	00000772
004493	00000782
004491	00000775
004426	00000765
004483	00000778
004547	00000770
004517	00000764
004498	00000766
004433	00000765
004443	00000765
004532	00000768
004510	00000765
004527	00000778
004441	00000765
004555	00000765
004434	00000767
004428	00000768
004506	00000770
004529	00000783
004417	00000690
004385	00000693
004473	00000765
004543	00000772
004370	00000787
004423	00000782
004495	00000783
004430	00000767
004528	00000780
004445	00000696
004465	00000768
004496	00000782
004398	00000768
004551	00000780
004515	00000762
004470	00000765
004490	00000778
004541	00000765
004487	00000777
004504	00000770
004482	00000783
004475	00000765
004561	00000762
004526	00000784
004476	00000765
004422	00000780
004469	00000765
004438	00000770
004477	00000768
004436	00000764
004439	00000770
004545	00000773
004397	00000693
004372	00000692
004431	00000767
004371	00000765
004500	00000764
004442	00000767
004407	00000690
004400	00000769
004512	00000770
004461	00000765
004467	00000769
004429	00000759
004451	00000689
004368	00000693
004536	00000762
004405	00000690
004444	00000693
004472	00000767
004468	00000762
004537	00000762
004518	00000693
004403	00000767
004552	00000765

[-- Attachment #17: massive-intr-200-300-6.txt --]
[-- Type: text/plain, Size: 3072 bytes --]

004513	00000777
004429	00000690
004447	00000781
004462	00000776
004568	00000776
004541	00000748
004428	00000691
004496	00000693
004510	00000779
004566	00000774
004602	00000752
004432	00000690
004555	00000750
004523	00000780
004570	00000783
004427	00000693
004459	00000693
004583	00000778
004585	00000777
004433	00000690
004599	00000747
004449	00000777
004503	00000773
004508	00000780
004442	00000779
004580	00000777
004469	00000747
004595	00000781
004527	00000695
004489	00000783
004484	00000749
004479	00000753
004551	00000749
004605	00000747
004515	00000777
004482	00000744
004436	00000696
004512	00000776
004458	00000692
004483	00000747
004540	00000751
004478	00000746
004560	00000785
004535	00000781
004464	00000783
004559	00000779
004514	00000776
004603	00000754
004423	00000692
004410	00000783
004529	00000785
004434	00000693
004556	00000743
004472	00000750
004604	00000754
004465	00000780
004412	00000696
004524	00000693
004439	00000693
004413	00000693
004505	00000779
004416	00000692
004533	00000783
004509	00000776
004491	00000783
004558	00000747
004486	00000782
004488	00000784
004457	00000693
004537	00000781
004606	00000744
004588	00000776
004577	00000783
004480	00000755
004563	00000781
004461	00000781
004437	00000693
004463	00000778
004452	00000693
004545	00000741
004485	00000779
004573	00000782
004530	00000775
004471	00000748
004554	00000778
004466	00000783
004593	00000779
004504	00000780
004532	00000781
004506	00000773
004444	00000771
004574	00000778
004431	00000690
004494	00000782
004450	00000692
004499	00000690
004448	00000785
004578	00000780
004481	00000778
004544	00000752
004487	00000778
004557	00000753
004534	00000784
004507	00000779
004582	00000779
004567	00000775
004538	00000746
004548	00000749
004543	00000746
004587	00000778
004549	00000749
004408	00000692
004575	00000749
004598	00000749
004474	00000749
004495	00000692
004542	00000744
004445	00000772
004525	00000696
004607	00000751
004454	00000692
004498	00000695
004497	00000700
004440	00000694
004470	00000749
004561	00000782
004572	00000787
004477	00000750
004443	00000776
004467	00000782
004553	00000747
004547	00000752
004565	00000783
004594	00000775
004571	00000783
004591	00000776
004420	00000752
004579	00000779
004511	00000780
004539	00000746
004417	00000784
004446	00000779
004522	00000772
004419	00000782
004584	00000778
004473	00000750
004550	00000695
004418	00000694
004519	00000695
004441	00000694
004460	00000692
004414	00000779
004492	00000783
004409	00000779
004528	00000783
004501	00000695
004518	00000766
004426	00000695
004562	00000783
004502	00000694
004493	00000783
004476	00000751
004590	00000784
004536	00000783
004592	00000778
004475	00000746
004552	00000747
004597	00000777
004600	00000750
004546	00000750
004581	00000773
004589	00000778
004435	00000699
004424	00000695
004421	00000695
004425	00000689
004521	00000694
004430	00000695
004415	00000694
004526	00000692
004531	00000781
004422	00000694
004456	00000692
004455	00000692
004500	00000692
004438	00000689
004451	00000692
004520	00000777
004564	00000777
004516	00000778
004576	00000751
004490	00000694

[-- Attachment #18: massive-intr-200-300-7.txt --]
[-- Type: text/plain, Size: 2992 bytes --]

004676	00000739
004762	00000727
004830	00000726
004845	00000721
004856	00000747
004771	00000736
004815	00000757
004736	00000755
004756	00000730
004732	00000754
004777	00000738
004789	00000795
004808	00000753
004722	00000797
004781	00000737
004718	00000797
004854	00000751
004818	00000751
004724	00000800
004705	00000739
004802	00000760
004727	00000754
004778	00000741
004734	00000750
004779	00000740
004677	00000740
004796	00000797
004693	00000740
004784	00000800
004659	00000752
004831	00000727
004813	00000756
004742	00000797
004787	00000800
004673	00000737
004788	00000799
004768	00000740
004672	00000737
004749	00000724
004829	00000724
004716	00000794
004753	00000724
004851	00000753
004760	00000724
004723	00000796
004805	00000724
004700	00000737
004770	00000737
004832	00000721
004729	00000756
004811	00000753
004816	00000753
004697	00000737
004685	00000726
004750	00000726
004689	00000737
004688	00000801
004767	00000737
004794	00000790
004800	00000753
004719	00000798
004844	00000796
004737	00000753
004745	00000796
004694	00000742
004776	00000734
004759	00000738
004793	00000799
004783	00000794
004810	00000750
004761	00000728
004782	00000734
004797	00000801
004667	00000737
004774	00000731
004825	00000723
004715	00000796
004804	00000751
004812	00000750
004780	00000741
004703	00000739
004772	00000739
004675	00000739
004839	00000801
004690	00000796
004795	00000796
004850	00000757
004728	00000759
004747	00000799
004738	00000758
004757	00000726
004848	00000722
004684	00000721
004769	00000734
004801	00000752
004708	00000736
004819	00000755
004680	00000720
004704	00000733
004661	00000740
004711	00000739
004683	00000724
004726	00000797
004786	00000797
004775	00000736
004754	00000729
004674	00000736
004799	00000757
004809	00000751
004664	00000754
004666	00000760
004702	00000739
004752	00000723
004806	00000752
004765	00000725
004682	00000726
004828	00000728
004741	00000757
004740	00000755
004834	00000723
004746	00000801
004678	00000722
004733	00000747
004791	00000799
004748	00000725
004692	00000790
004730	00000757
004713	00000794
004763	00000728
004696	00000736
004840	00000725
004836	00000723
004824	00000719
004841	00000723
004855	00000752
004766	00000725
004701	00000740
004706	00000739
004739	00000749
004833	00000729
004709	00000741
004847	00000720
004744	00000752
004662	00000739
004707	00000740
004731	00000752
004820	00000798
004758	00000727
004686	00000802
004792	00000799
004823	00000723
004773	00000734
004814	00000748
004735	00000752
004846	00000726
004695	00000733
004835	00000724
004665	00000743
004712	00000800
004679	00000726
004837	00000725
004687	00000796
004699	00000743
004691	00000752
004842	00000796
004785	00000796
004743	00000753
004658	00000795
004663	00000744
004670	00000729
004822	00000726
004838	00000717
004698	00000736
004720	00000720
004671	00000733
004660	00000728
004798	00000793
004826	00000728
004717	00000799
004849	00000755
004803	00000752
004710	00000739
004807	00000752
004817	00000757
004790	00000796
004668	00000800
004853	00000751

[-- Attachment #19: massive-intr-200-300-8.txt --]
[-- Type: text/plain, Size: 3200 bytes --]

004546	00000739
004492	00000802
004514	00000732
004585	00000738
004588	00000732
004583	00000735
004454	00000735
004506	00000732
004518	00000735
004547	00000734
004428	00000797
004413	00000730
004536	00000732
004393	00000738
004405	00000732
004575	00000796
004471	00000735
004532	00000737
004455	00000737
004549	00000732
004578	00000733
004414	00000734
004488	00000796
004556	00000791
004589	00000737
004425	00000791
004561	00000796
004449	00000734
004502	00000795
004458	00000735
004444	00000734
004542	00000737
004571	00000793
004460	00000734
004463	00000737
004450	00000737
004436	00000736
004581	00000737
004550	00000737
004493	00000735
004394	00000746
004410	00000736
004531	00000734
004537	00000734
004431	00000735
004451	00000734
004519	00000737
004487	00000737
004501	00000796
004426	00000793
004478	00000732
004418	00000729
004480	00000732
004565	00000794
004424	00000796
004503	00000797
004403	00000740
004586	00000736
004500	00000790
004534	00000736
004411	00000735
004445	00000734
004416	00000734
004407	00000735
004484	00000735
004457	00000731
004448	00000745
004453	00000737
004553	00000734
004476	00000732
004486	00000735
004479	00000736
004419	00000735
004572	00000746
004562	00000793
004580	00000731
004548	00000734
004466	00000737
004545	00000737
004392	00000792
004590	00000735
004446	00000737
004438	00000737
004544	00000734
004432	00000734
004512	00000734
004475	00000732
004441	00000739
004423	00000796
004398	00000732
004409	00000732
004540	00000734
004552	00000734
004481	00000732
004396	00000729
004470	00000732
004467	00000734
004443	00000734
004461	00000748
004417	00000745
004559	00000795
004490	00000794
004577	00000795
004564	00000798
004427	00000794
004497	00000794
004496	00000793
004400	00000734
004538	00000737
004402	00000740
004472	00000734
004399	00000798
004491	00000796
004498	00000789
004558	00000792
004391	00000737
004429	00000794
004468	00000729
004555	00000730
004494	00000798
004440	00000732
004404	00000731
004524	00000735
004520	00000738
004554	00000747
004485	00000732
004421	00000729
004563	00000795
004530	00000736
004430	00000801
004401	00000732
004513	00000731
004515	00000733
004525	00000736
004504	00000796
004516	00000738
004489	00000790
004406	00000737
004509	00000736
004573	00000795
004442	00000736
004582	00000739
004543	00000736
004522	00000738
004462	00000734
004517	00000728
004527	00000736
004464	00000736
004483	00000737
004434	00000739
004422	00000737
004539	00000737
004447	00000733
004412	00000734
004551	00000735
004510	00000741
004474	00000731
004505	00000739
004535	00000733
004533	00000734
004569	00000792
004507	00000735
004568	00000795
004459	00000738
004541	00000748
004465	00000733
004521	00000736
004576	00000792
004557	00000792
004584	00000737
004511	00000736
004435	00000740
004567	00000793
004587	00000736
004574	00000795
004526	00000736
004452	00000736
004529	00000733
004523	00000747
004477	00000735
004395	00000733
004408	00000734
004469	00000734
004560	00000797
004437	00000736
004508	00000732
004482	00000734
004415	00000730
004420	00000734
004473	00000734
004397	00000734
004499	00000792
004433	00000736
004439	00000733
004528	00000741
004456	00000736
004579	00000733
004570	00000799
004495	00000795
004566	00000789

[-- Attachment #20: massive-intr-200-300-9.txt --]
[-- Type: text/plain, Size: 3184 bytes --]

004528	00000799
004424	00000769
004561	00000769
004484	00000726
004377	00000728
004413	00000721
004430	00000769
004461	00000799
004531	00000799
004546	00000791
004456	00000799
004374	00000798
004523	00000723
004396	00000796
004537	00000799
004412	00000724
004420	00000771
004463	00000796
004395	00000798
004466	00000796
004477	00000721
004500	00000768
004368	00000726
004470	00000724
004522	00000724
004369	00000724
004390	00000724
004418	00000726
004479	00000724
004547	00000726
004468	00000796
004471	00000796
004551	00000725
004489	00000723
004378	00000721
004542	00000796
004474	00000718
004446	00000727
004365	00000795
004545	00000724
004441	00000726
004415	00000725
004447	00000712
004502	00000766
004520	00000720
004458	00000801
004431	00000801
004367	00000770
004556	00000768
004550	00000766
004411	00000721
004491	00000726
004549	00000770
004403	00000721
004534	00000796
004407	00000725
004405	00000724
004451	00000724
004517	00000728
004472	00000801
004383	00000727
004487	00000720
004394	00000801
004526	00000726
004437	00000765
004521	00000728
004501	00000768
004455	00000803
004497	00000775
004385	00000718
004438	00000768
004507	00000770
004428	00000773
004465	00000796
004423	00000728
004496	00000768
004543	00000801
004527	00000795
004533	00000798
004419	00000727
004436	00000772
004493	00000725
004372	00000773
004555	00000767
004544	00000796
004467	00000801
004434	00000772
004444	00000720
004529	00000801
004399	00000796
004460	00000801
004504	00000771
004464	00000798
004554	00000767
004559	00000722
004515	00000725
004435	00000769
004499	00000770
004558	00000761
004505	00000767
004429	00000770
004370	00000721
004492	00000725
004454	00000718
004421	00000722
004400	00000798
004425	00000773
004532	00000798
004459	00000798
004518	00000724
004427	00000773
004535	00000801
004553	00000768
004414	00000720
004563	00000770
004416	00000722
004541	00000798
004516	00000725
004519	00000725
004552	00000770
004457	00000795
004482	00000729
004375	00000725
004562	00000770
004490	00000725
004530	00000798
004539	00000800
004439	00000767
004371	00000725
004373	00000726
004443	00000725
004486	00000728
004557	00000771
004432	00000769
004388	00000726
004524	00000725
004409	00000723
004560	00000770
004380	00000726
004393	00000723
004408	00000723
004406	00000723
004475	00000723
004445	00000723
004364	00000723
004476	00000723
004410	00000725
004389	00000726
004478	00000723
004397	00000723
004450	00000720
004387	00000720
004483	00000723
004398	00000802
004509	00000776
004495	00000770
004485	00000721
004453	00000723
004506	00000771
004366	00000726
004473	00000803
004422	00000726
004462	00000800
004536	00000797
004449	00000726
004503	00000770
004391	00000723
004426	00000774
004386	00000721
004525	00000729
004480	00000723
004510	00000767
004494	00000771
004382	00000720
004452	00000724
004417	00000727
004448	00000722
004488	00000725
004376	00000725
004381	00000725
004404	00000721
004469	00000795
004548	00000728
004513	00000724
004511	00000727
004514	00000727
004498	00000771
004508	00000773
004540	00000803
004440	00000722
004433	00000772
004384	00000723
004481	00000720
004512	00000775
004442	00000727
004392	00000722
004402	00000720
004538	00000800
004401	00000797

[-- Attachment #21: massive-intr-200-300-without-patch.txt --]
[-- Type: text/plain, Size: 2784 bytes --]

004726	00000761
004723	00000763
004793	00000763
004776	00000736
004746	00000735
004731	00000754
004685	00000735
004835	00000754
004782	00000751
004747	00000736
004766	00000754
004663	00000735
004696	00000752
004737	00000760
004679	00000735
004727	00000751
004840	00000754
004720	00000767
004718	00000764
004788	00000761
004716	00000770
004791	00000758
004655	00000755
004838	00000757
004811	00000753
004659	00000768
004686	00000735
004740	00000759
004676	00000739
004849	00000748
004825	00000763
004808	00000748
004844	00000747
004702	00000755
004828	00000758
004829	00000758
004822	00000750
004820	00000753
004805	00000751
004764	00000748
004717	00000765
004794	00000761
004701	00000750
004792	00000766
004818	00000753
004842	00000752
004837	00000751
004697	00000750
004654	00000739
004763	00000754
004851	00000761
004671	00000738
004807	00000753
004734	00000760
004661	00000740
004743	00000737
004664	00000740
004682	00000737
004741	00000750
004817	00000750
004694	00000754
004779	00000753
004833	00000754
004758	00000757
004809	00000756
004815	00000752
004666	00000758
004770	00000750
004704	00000737
004709	00000753
004841	00000754
004732	00000753
004706	00000753
004675	00000739
004745	00000737
004719	00000765
004691	00000764
004777	00000756
004778	00000750
004780	00000759
004754	00000737
004799	00000755
004848	00000755
004752	00000737
004742	00000734
004773	00000752
004774	00000747
004673	00000736
004787	00000763
004781	00000756
004693	00000753
004692	00000751
004769	00000750
004728	00000763
004756	00000758
004749	00000737
004762	00000753
004687	00000739
004827	00000766
004683	00000734
004761	00000757
004678	00000739
004830	00000763
004803	00000763
004798	00000765
004850	00000760
004771	00000749
004674	00000737
004832	00000753
004821	00000757
004753	00000734
004843	00000752
004724	00000763
004759	00000752
004800	00000753
004700	00000753
004824	00000763
004767	00000755
004823	00000751
004789	00000768
004757	00000755
004852	00000765
004836	00000756
004839	00000757
004760	00000748
004834	00000758
004739	00000759
004786	00000768
004846	00000754
004711	00000761
004826	00000765
004695	00000755
004710	00000758
004783	00000761
004765	00000755
004684	00000731
004698	00000752
004785	00000768
004755	00000736
004813	00000754
004775	00000753
004795	00000765
004712	00000755
004768	00000755
004713	00000767
004816	00000752
004790	00000765
004744	00000731
004736	00000756
004672	00000741
004715	00000766
004667	00000754
004705	00000755
004810	00000755
004708	00000755
004707	00000752
004750	00000736
004688	00000736
004772	00000741
004703	00000736
004681	00000736
004748	00000737
004668	00000736
004690	00000739
004669	00000739
004733	00000743
004656	00000767
004812	00000749
004714	00000771
004677	00000741
004806	00000755
004665	00000736
004680	00000739
004670	00000739

[-- Attachment #22: massive-intr-200-300-with-patch.txt --]
[-- Type: text/plain, Size: 3200 bytes --]

004663	00000754
004634	00000694
004723	00000800
004746	00000751
004734	00000768
004633	00000689
004755	00000754
004722	00000797
004626	00000797
004689	00000765
004767	00000695
004813	00000765
004724	00000800
004621	00000769
004725	00000796
004714	00000799
004789	00000793
004631	00000758
004712	00000796
004744	00000748
004655	00000796
004783	00000751
004785	00000800
004790	00000796
004758	00000748
004816	00000772
004683	00000765
004636	00000694
004771	00000691
004619	00000695
004669	00000753
004623	00000696
004775	00000753
004752	00000748
004778	00000754
004784	00000751
004739	00000767
004807	00000762
004693	00000765
004691	00000770
004736	00000763
004709	00000768
004720	00000796
004628	00000695
004772	00000695
004696	00000695
004682	00000692
004675	00000748
004643	00000689
004637	00000695
004715	00000793
004787	00000796
004792	00000793
004797	00000796
004708	00000768
004651	00000796
004806	00000766
004679	00000766
004811	00000763
004699	00000695
004624	00000769
004638	00000695
004645	00000695
004635	00000692
004704	00000692
004742	00000764
004680	00000761
004800	00000796
004796	00000801
004802	00000798
004731	00000793
004677	00000770
004640	00000692
004657	00000692
004656	00000793
004730	00000790
004786	00000795
004817	00000766
004627	00000694
004727	00000793
004814	00000773
004658	00000798
004695	00000689
004791	00000792
004653	00000795
004798	00000792
004673	00000745
004666	00000753
004753	00000751
004664	00000753
004788	00000798
004801	00000753
004685	00000766
004810	00000770
004750	00000753
004754	00000755
004652	00000795
004668	00000753
004654	00000795
004648	00000695
004777	00000747
004765	00000694
004672	00000753
004665	00000750
004737	00000770
004757	00000747
004620	00000796
004780	00000750
004717	00000792
004773	00000751
004756	00000767
004760	00000746
004808	00000770
004776	00000753
004662	00000756
004670	00000750
004625	00000694
004647	00000694
004794	00000795
004738	00000767
004641	00000698
004735	00000767
004759	00000694
004799	00000790
004762	00000697
004629	00000694
004769	00000694
004705	00000694
004743	00000767
004781	00000750
004701	00000697
004661	00000749
004702	00000694
004710	00000770
004681	00000767
004700	00000691
004686	00000767
004642	00000694
004747	00000753
004644	00000694
004812	00000767
004748	00000750
004733	00000764
004721	00000797
004687	00000771
004690	00000771
004751	00000749
004632	00000694
004732	00000764
004728	00000798
004766	00000694
004706	00000764
004630	00000694
004688	00000764
004711	00000694
004622	00000753
004795	00000798
004815	00000770
004729	00000791
004763	00000747
004818	00000766
004674	00000749
004761	00000694
004749	00000752
004770	00000692
004718	00000795
004694	00000694
004782	00000755
004809	00000766
004740	00000770
004671	00000752
004716	00000762
004707	00000766
004692	00000801
004719	00000795
004713	00000800
004659	00000797
004764	00000749
004774	00000747
004698	00000688
004649	00000696
004779	00000752
004768	00000694
004676	00000752
004646	00000693
004805	00000755
004697	00000691
004703	00000692
004639	00000694
004804	00000693
004803	00000754
004678	00000769
004741	00000768
004684	00000761
004660	00000693
004793	00000797
004667	00000753
004726	00000795
004745	00000755
004650	00000691

[-- Attachment #23: plot.sh --]
[-- Type: application/x-sh, Size: 1242 bytes --]

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [patch 00/15] CFS Bandwidth Control V6
  2011-06-15  8:37     ` Hu Tao
@ 2011-06-16  0:57       ` Hidetoshi Seto
  2011-06-16  9:45         ` Hu Tao
  0 siblings, 1 reply; 129+ messages in thread
From: Hidetoshi Seto @ 2011-06-16  0:57 UTC (permalink / raw)
  To: Hu Tao
  Cc: Paul Turner, linux-kernel, Peter Zijlstra, Bharata B Rao,
	Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan,
	Srivatsa Vaddagiri

(2011/06/15 17:37), Hu Tao wrote:
> On Tue, Jun 14, 2011 at 04:29:49PM +0900, Hidetoshi Seto wrote:
>> (2011/06/14 15:58), Hu Tao wrote:
>>> Hi,
>>>
>>> I've run several tests including hackbench, unixbench, massive-intr
>>> and kernel building. CPU is Intel(R) Xeon(R) CPU X3430  @ 2.40GHz,
>>> 4 cores, and 4G memory.
>>>
>>> Most of the time the results differ few, but there are problems:
>>>
>>> 1. unixbench: execl throughout has about 5% drop.
>>> 2. unixbench: process creation has about 5% drop.
>>> 3. massive-intr: when running 200 processes for 5mins, the number
>>>    of loops each process runs differ more than before cfs-bandwidth-v6.
>>>
>>> The results are attached.
>>
>> I know the score of unixbench is not so stable that the problem might
>> be noises ... but the result of massive-intr is interesting.
>> Could you give a try to find which piece (xx/15) in the series cause
>> the problems?
> 
> After more tests, I found massive-intr data is not stable, too. Results
> are attached. The third number in file name means which patchs are
> applied, 0 means no patch applied. plot.sh is easy to generate png
> files.

(Though I don't know what the 16th patch of this series is, anyway)
I see that the results of 15, 15-1 and 15-2 are very different and that
15-2 is similar to without-patch.

One concern is whether this unstable of data is really caused by the
nature of your test (hardware, massive-intr itself and something running
in background etc.) or by a hidden piece in the bandwidth patch set.
Did you see "not stable" data when none of patches is applied?
If not, which patch makes it unstable?


Thanks,
H.Seto


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [patch 00/15] CFS Bandwidth Control V6
  2011-06-16  0:57       ` Hidetoshi Seto
@ 2011-06-16  9:45         ` Hu Tao
  2011-06-17  1:22           ` Hidetoshi Seto
  0 siblings, 1 reply; 129+ messages in thread
From: Hu Tao @ 2011-06-16  9:45 UTC (permalink / raw)
  To: Hidetoshi Seto
  Cc: Paul Turner, linux-kernel, Peter Zijlstra, Bharata B Rao,
	Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan,
	Srivatsa Vaddagiri

[-- Attachment #1: Type: text/plain, Size: 1937 bytes --]

On Thu, Jun 16, 2011 at 09:57:09AM +0900, Hidetoshi Seto wrote:
> (2011/06/15 17:37), Hu Tao wrote:
> > On Tue, Jun 14, 2011 at 04:29:49PM +0900, Hidetoshi Seto wrote:
> >> (2011/06/14 15:58), Hu Tao wrote:
> >>> Hi,
> >>>
> >>> I've run several tests including hackbench, unixbench, massive-intr
> >>> and kernel building. CPU is Intel(R) Xeon(R) CPU X3430  @ 2.40GHz,
> >>> 4 cores, and 4G memory.
> >>>
> >>> Most of the time the results differ few, but there are problems:
> >>>
> >>> 1. unixbench: execl throughout has about 5% drop.
> >>> 2. unixbench: process creation has about 5% drop.
> >>> 3. massive-intr: when running 200 processes for 5mins, the number
> >>>    of loops each process runs differ more than before cfs-bandwidth-v6.
> >>>
> >>> The results are attached.
> >>
> >> I know the score of unixbench is not so stable that the problem might
> >> be noises ... but the result of massive-intr is interesting.
> >> Could you give a try to find which piece (xx/15) in the series cause
> >> the problems?
> > 
> > After more tests, I found massive-intr data is not stable, too. Results
> > are attached. The third number in file name means which patchs are
> > applied, 0 means no patch applied. plot.sh is easy to generate png
> > files.
> 
> (Though I don't know what the 16th patch of this series is, anyway)

the 16th patch is this: https://lkml.org/lkml/2011/5/23/503

> I see that the results of 15, 15-1 and 15-2 are very different and that
> 15-2 is similar to without-patch.
> 
> One concern is whether this unstable of data is really caused by the
> nature of your test (hardware, massive-intr itself and something running
> in background etc.) or by a hidden piece in the bandwidth patch set.
> Did you see "not stable" data when none of patches is applied?

Yes. 

But for a five-runs the result seems 'stable'(before patches and after
patches). I've also run the tests in single mode. results are attached.

[-- Attachment #2: massive-intr-200-300-0-1.txt --]
[-- Type: text/plain, Size: 3200 bytes --]

005989	00000751
006025	00000750
006060	00000750
006044	00000742
006009	00000750
005893	00000744
006035	00000746
005936	00000754
005929	00000750
006040	00000743
006042	00000749
006050	00000750
005997	00000749
005966	00000752
005982	00000739
006034	00000751
006067	00000748
005939	00000751
005950	00000754
005986	00000747
006058	00000755
005963	00000750
005920	00000748
005925	00000754
006001	00000751
005980	00000745
006011	00000754
005993	00000747
005992	00000748
005994	00000745
006008	00000748
005984	00000745
006054	00000741
005912	00000749
005965	00000756
005918	00000752
006078	00000755
006074	00000763
006023	00000751
006038	00000745
005969	00000744
005897	00000749
005938	00000749
006069	00000748
005959	00000750
005945	00000759
005998	00000747
006016	00000752
006022	00000746
005887	00000753
005898	00000754
005949	00000748
006049	00000743
006020	00000754
006046	00000744
006018	00000748
005922	00000746
005942	00000752
005944	00000748
006026	00000752
005931	00000750
005928	00000752
006047	00000743
006029	00000744
005977	00000744
006024	00000754
005985	00000743
005915	00000747
005905	00000751
005902	00000750
005894	00000753
006005	00000750
005971	00000745
006007	00000749
005884	00000752
005991	00000744
006056	00000749
006052	00000742
005919	00000748
006015	00000753
006076	00000755
006077	00000750
006045	00000745
005955	00000752
006075	00000759
006036	00000741
005933	00000748
005907	00000749
005935	00000749
006064	00000752
005957	00000746
005990	00000746
006033	00000753
006006	00000752
006066	00000750
005910	00000753
006057	00000753
005909	00000749
005903	00000753
005927	00000746
006061	00000746
006028	00000751
006013	00000750
005988	00000745
006012	00000754
005899	00000749
005981	00000749
006065	00000753
005924	00000750
006004	00000757
005953	00000749
005934	00000749
005926	00000749
005932	00000753
006037	00000748
005975	00000745
006000	00000742
005914	00000753
005947	00000752
005906	00000746
006079	00000757
006030	00000742
006073	00000752
006068	00000754
005892	00000745
006072	00000747
005970	00000746
005908	00000749
005946	00000755
006048	00000742
006021	00000746
006017	00000752
006043	00000745
005886	00000749
005923	00000748
005890	00000753
006019	00000752
006059	00000749
006003	00000748
005983	00000745
005956	00000746
005999	00000755
006039	00000749
006032	00000754
006055	00000749
005940	00000755
005962	00000749
005901	00000747
005976	00000747
005895	00000751
005937	00000749
005972	00000743
005921	00000751
005995	00000743
005941	00000757
005960	00000757
005987	00000751
005952	00000746
006063	00000750
005904	00000755
006071	00000754
005979	00000745
006002	00000751
005964	00000749
006051	00000746
006031	00000743
005978	00000745
005951	00000752
006027	00000751
005954	00000749
005917	00000748
005891	00000752
005889	00000751
006062	00000751
005948	00000751
005896	00000748
006010	00000753
005943	00000753
006053	00000742
005958	00000752
005961	00000751
005888	00000749
005968	00000751
005883	00000749
005881	00000751
005913	00000752
005967	00000746
005900	00000749
005911	00000748
005916	00000751
005930	00000749
005880	00000750
005973	00000752
006041	00000743
005996	00000747
006014	00000753
005974	00000737
005885	00000751
006070	00000757
005882	00000756

[-- Attachment #3: massive-intr-200-300-0-2.txt --]
[-- Type: text/plain, Size: 3008 bytes --]

006143	00000704
006259	00000750
006275	00000708
006203	00000704
006204	00000704
006138	00000708
006267	00000749
006276	00000704
006205	00000704
006139	00000751
006272	00000698
006142	00000705
006211	00000701
006168	00000793
006140	00000791
006223	00000751
006303	00000796
006190	00000753
006155	00000709
006310	00000788
006148	00000710
006307	00000794
006187	00000751
006314	00000790
006302	00000792
006240	00000793
006304	00000793
006141	00000753
006293	00000753
006270	00000751
006208	00000704
006199	00000741
006238	00000787
006196	00000703
006219	00000748
006251	00000751
006173	00000751
006184	00000751
006201	00000748
006135	00000706
006189	00000751
006130	00000794
006181	00000751
006278	00000703
006172	00000787
006289	00000751
006264	00000751
006200	00000748
006300	00000795
006287	00000751
006323	00000749
006318	00000748
006186	00000751
006160	00000748
006247	00000748
006222	00000751
006210	00000708
006217	00000753
006241	00000796
006224	00000748
006209	00000706
006144	00000706
006321	00000751
006319	00000751
006180	00000748
006249	00000748
006171	00000793
006198	00000749
006266	00000748
006295	00000745
006188	00000748
006291	00000748
006202	00000709
006297	00000792
006134	00000705
006316	00000752
006167	00000793
006218	00000746
006157	00000707
006299	00000790
006129	00000709
006283	00000745
006290	00000750
006268	00000706
006131	00000750
006147	00000703
006277	00000708
006263	00000752
006177	00000753
006252	00000748
006213	00000708
006166	00000792
006151	00000705
006235	00000788
006296	00000751
006298	00000789
006175	00000794
006260	00000751
006237	00000792
006322	00000750
006301	00000795
006179	00000750
006236	00000792
006317	00000750
006245	00000790
006282	00000752
006169	00000795
006214	00000705
006228	00000790
006162	00000750
006239	00000791
006232	00000792
006248	00000751
006279	00000705
006227	00000790
006281	00000707
006244	00000792
006269	00000742
006243	00000795
006250	00000753
006132	00000752
006225	00000750
006154	00000706
006193	00000752
006215	00000705
006182	00000750
006191	00000749
006185	00000750
006292	00000750
006150	00000705
006158	00000708
006161	00000753
006195	00000749
006246	00000753
006163	00000750
006261	00000750
006262	00000750
006285	00000750
006311	00000750
006288	00000750
006306	00000795
006256	00000751
006156	00000705
006231	00000796
006284	00000750
006221	00000749
006165	00000795
006265	00000750
006324	00000750
006230	00000795
006212	00000711
006229	00000796
006226	00000750
006164	00000744
006149	00000707
006159	00000750
006136	00000708
006192	00000747
006280	00000705
006176	00000752
006145	00000708
006178	00000750
006271	00000751
006320	00000749
006206	00000705
006328	00000746
006233	00000790
006255	00000751
006273	00000701
006325	00000750
006137	00000705
006294	00000754
006153	00000708
006258	00000748
006254	00000753
006174	00000795
006234	00000792
006315	00000791
006313	00000795
006309	00000789
006286	00000745
006170	00000794
006194	00000751
006220	00000753
006312	00000740
006253	00000745
006305	00000794
006326	00000705

[-- Attachment #4: massive-intr-200-300-0-3.txt --]
[-- Type: text/plain, Size: 3056 bytes --]

006450	00000740
006481	00000812
006490	00000807
006557	00000809
006392	00000723
006485	00000810
006442	00000734
006497	00000720
006537	00000735
006503	00000740
006391	00000719
006533	00000732
006426	00000717
006462	00000717
006407	00000717
006454	00000734
006394	00000724
006522	00000737
006536	00000737
006554	00000818
006577	00000739
006515	00000731
006435	00000730
006397	00000733
006584	00000736
006546	00000734
006457	00000737
006478	00000807
006575	00000737
006390	00000717
006402	00000720
006464	00000717
006580	00000742
006491	00000810
006502	00000735
006403	00000718
006439	00000735
006433	00000735
006507	00000741
006562	00000812
006540	00000735
006551	00000810
006559	00000814
006567	00000817
006447	00000734
006550	00000731
006535	00000737
006419	00000810
006429	00000737
006393	00000817
006440	00000735
006430	00000735
006477	00000815
006532	00000735
006547	00000737
006460	00000734
006387	00000738
006505	00000741
006572	00000737
006552	00000807
006563	00000812
006415	00000806
006512	00000734
006421	00000812
006431	00000728
006410	00000721
006467	00000716
006495	00000815
006422	00000809
006494	00000812
006560	00000812
006486	00000812
006506	00000739
006404	00000722
006474	00000724
006465	00000722
006466	00000719
006488	00000812
006508	00000737
006568	00000812
006555	00000811
006458	00000723
006443	00000741
006418	00000812
006428	00000737
006526	00000729
006498	00000719
006475	00000719
006564	00000803
006493	00000722
006420	00000722
006399	00000722
006416	00000815
006473	00000719
006479	00000818
006409	00000719
006500	00000737
006401	00000719
006438	00000737
006579	00000739
006523	00000739
006484	00000809
006414	00000719
006518	00000739
006545	00000728
006514	00000739
006456	00000740
006553	00000810
006400	00000719
006501	00000734
006492	00000722
006470	00000719
006558	00000811
006385	00000717
006469	00000722
006517	00000736
006427	00000719
006424	00000719
006483	00000814
006412	00000723
006556	00000811
006411	00000719
006437	00000737
006398	00000719
006388	00000737
006408	00000719
006417	00000816
006472	00000719
006423	00000719
006499	00000722
006453	00000735
006489	00000812
006524	00000734
006459	00000733
006487	00000817
006449	00000736
006455	00000736
006529	00000737
006425	00000724
006519	00000734
006576	00000736
006413	00000721
006510	00000722
006511	00000731
006565	00000814
006468	00000722
006541	00000742
006406	00000720
006538	00000739
006405	00000724
006531	00000737
006436	00000734
006471	00000718
006583	00000733
006396	00000718
006395	00000736
006452	00000738
006581	00000733
006578	00000736
006441	00000731
006569	00000815
006480	00000811
006386	00000813
006520	00000734
006570	00000735
006451	00000738
006543	00000736
006574	00000733
006509	00000738
006539	00000736
006530	00000734
006448	00000738
006463	00000724
006389	00000724
006571	00000733
006461	00000724
006504	00000738
006432	00000733
006534	00000739
006525	00000733
006513	00000736
006549	00000742
006434	00000736
006444	00000733
006446	00000738
006561	00000740
006496	00000811
006544	00000736
006521	00000733
006527	00000734
006516	00000733

[-- Attachment #5: massive-intr-200-300-0-4.txt --]
[-- Type: text/plain, Size: 3168 bytes --]

006839	00000762
006730	00000738
006799	00000748
006812	00000747
006870	00000733
006797	00000733
006903	00000745
006911	00000733
006880	00000750
006879	00000747
006766	00000762
006743	00000745
006759	00000759
006782	00000745
006859	00000736
006857	00000736
006908	00000733
006793	00000733
006830	00000759
006784	00000736
006794	00000733
006774	00000742
006750	00000748
006719	00000747
006721	00000738
006728	00000735
006787	00000730
006878	00000744
006831	00000759
006862	00000733
006738	00000756
006866	00000730
006828	00000748
006845	00000747
006856	00000733
006758	00000742
006795	00000730
006824	00000752
006736	00000756
006726	00000756
006744	00000751
006821	00000749
006729	00000749
006805	00000753
006901	00000752
006808	00000753
006846	00000746
006756	00000753
006896	00000753
006900	00000753
006884	00000750
006775	00000753
006883	00000753
006757	00000750
006804	00000750
006722	00000758
006780	00000750
006840	00000761
006898	00000750
006798	00000735
006740	00000759
006731	00000758
006848	00000752
006770	00000764
006843	00000761
006822	00000752
006841	00000761
006837	00000764
006889	00000750
006792	00000738
006915	00000738
006763	00000761
006790	00000738
006734	00000761
006783	00000735
006727	00000764
006833	00000761
006760	00000758
006868	00000738
006767	00000761
006852	00000748
006842	00000756
006850	00000752
006733	00000764
006720	00000746
006718	00000759
006748	00000749
006772	00000761
006855	00000749
006761	00000761
006809	00000750
006864	00000738
006789	00000735
006814	00000752
006865	00000735
006827	00000749
006869	00000746
006853	00000749
006771	00000758
006905	00000735
006820	00000749
006762	00000761
006815	00000749
006745	00000756
006777	00000747
006811	00000749
006810	00000747
006765	00000761
006807	00000747
006818	00000746
006891	00000750
006902	00000753
006800	00000735
006897	00000750
006904	00000750
006788	00000729
006803	00000747
006725	00000751
006802	00000732
006823	00000749
006894	00000747
006737	00000761
006890	00000747
006739	00000761
006895	00000744
006838	00000761
006886	00000747
006909	00000735
006834	00000761
006888	00000747
006791	00000735
006785	00000735
006916	00000735
006912	00000735
006796	00000735
006768	00000761
006723	00000758
006732	00000761
006844	00000753
006801	00000729
006906	00000732
006860	00000732
006914	00000732
006836	00000761
006832	00000761
006826	00000758
006849	00000748
006874	00000751
006769	00000761
006875	00000746
006835	00000755
006899	00000743
006917	00000731
006786	00000737
006742	00000748
006872	00000734
006881	00000748
006861	00000737
006825	00000751
006907	00000734
006819	00000751
006764	00000760
006892	00000752
006913	00000737
006741	00000763
006778	00000752
006873	00000729
006749	00000740
006751	00000749
006863	00000737
006773	00000761
006851	00000748
006781	00000752
006858	00000731
006854	00000748
006776	00000749
006724	00000763
006747	00000751
006885	00000752
006887	00000749
006847	00000740
006779	00000749
006754	00000752
006806	00000749
006893	00000749
006882	00000749
006813	00000744
006829	00000751
006867	00000734
006871	00000737
006816	00000752
006746	00000748
006817	00000748
006755	00000752
006877	00000751
006876	00000750
006910	00000749
006752	00000749

[-- Attachment #6: massive-intr-200-300-0-5.txt --]
[-- Type: text/plain, Size: 2848 bytes --]

007029	00000814
007028	00000708
007017	00000814
007143	00000752
007118	00000739
007060	00000755
006982	00000815
007055	00000734
007010	00000703
007110	00000739
007080	00000751
007099	00000708
007138	00000814
007116	00000736
007021	00000711
007034	00000814
007001	00000731
006998	00000707
006978	00000755
007106	00000710
007097	00000708
007172	00000750
006996	00000705
006993	00000708
006983	00000710
007119	00000741
007149	00000747
006985	00000713
006981	00000705
007153	00000750
007030	00000814
007134	00000816
007088	00000701
007127	00000814
007160	00000751
007013	00000821
007108	00000736
007105	00000708
007038	00000811
007166	00000709
007144	00000755
007014	00000811
007114	00000739
007022	00000696
006975	00000769
007039	00000811
007090	00000733
007025	00000709
007076	00000752
007173	00000751
007047	00000736
007072	00000752
006995	00000705
007007	00000739
007168	00000736
007044	00000820
007077	00000752
007065	00000755
006997	00000709
007102	00000711
007016	00000814
007032	00000811
007063	00000755
007082	00000749
006994	00000710
006990	00000712
007067	00000751
007120	00000742
007056	00000739
006984	00000707
007045	00000816
007093	00000712
007130	00000810
007048	00000736
007058	00000736
006987	00000712
007152	00000744
007112	00000733
007064	00000749
007086	00000754
007094	00000710
007154	00000755
006989	00000714
007174	00000748
007027	00000710
007089	00000733
007037	00000813
007142	00000819
006977	00000736
007000	00000745
007113	00000736
007003	00000735
006976	00000815
007040	00000738
007041	00000816
007167	00000733
007035	00000816
006979	00000707
007046	00000816
007020	00000707
006988	00000710
007006	00000738
006986	00000713
007075	00000735
006999	00000732
007155	00000749
007070	00000752
007117	00000730
007100	00000704
007124	00000739
007066	00000755
007078	00000752
007019	00000710
007074	00000752
007111	00000738
007151	00000758
007146	00000758
007015	00000816
006980	00000739
007126	00000733
007018	00000703
007071	00000756
007115	00000733
007095	00000704
007051	00000733
007147	00000748
007085	00000753
007079	00000746
007135	00000819
007024	00000709
007156	00000754
007098	00000706
007062	00000750
007125	00000743
007083	00000754
007092	00000704
007026	00000709
007091	00000712
007073	00000756
007031	00000711
007033	00000820
007104	00000712
007009	00000818
007011	00000813
007122	00000742
007164	00000706
007121	00000740
007023	00000713
007109	00000735
007171	00000752
007103	00000712
007169	00000735
007081	00000748
007136	00000813
007096	00000709
007061	00000754
007069	00000754
007140	00000815
007132	00000815
007043	00000818
007101	00000712
007005	00000735
007054	00000735
007057	00000735
007002	00000737
007148	00000754
007131	00000815
007087	00000754
007150	00000748
007159	00000751
007137	00000810
007128	00000812
007158	00000751
007050	00000735
007141	00000815
007129	00000819
007161	00000810
007036	00000816

[-- Attachment #7: massive-intr-200-300-16-1.txt --]
[-- Type: text/plain, Size: 3024 bytes --]

004379	00000798
004329	00000708
004434	00000753
004338	00000751
004421	00000770
004420	00000749
004354	00000754
004286	00000723
004454	00000798
004464	00000794
004341	00000753
004468	00000754
004357	00000755
004411	00000754
004459	00000797
004316	00000804
004386	00000797
004346	00000758
004452	00000794
004406	00000752
004335	00000752
004407	00000754
004426	00000770
004443	00000753
004437	00000752
004334	00000752
004424	00000752
004369	00000713
004282	00000711
004449	00000793
004430	00000753
004427	00000767
004377	00000795
004340	00000749
004344	00000754
004300	00000704
004293	00000709
004326	00000707
004450	00000796
004289	00000753
004399	00000753
004442	00000749
004383	00000796
004458	00000794
004455	00000798
004376	00000796
004279	00000797
004387	00000797
004422	00000763
004429	00000748
004396	00000707
004290	00000755
004299	00000709
004472	00000755
004327	00000702
004397	00000706
004315	00000793
004298	00000705
004444	00000754
004287	00000767
004339	00000751
004414	00000751
004456	00000797
004301	00000709
004417	00000753
004332	00000744
004419	00000751
004475	00000757
004410	00000753
004362	00000752
004336	00000755
004368	00000713
004391	00000793
004477	00000746
004381	00000796
004404	00000747
004378	00000793
004317	00000795
004370	00000718
004382	00000800
004313	00000710
004296	00000707
004433	00000751
004403	00000751
004408	00000751
004392	00000711
004371	00000708
004375	00000797
004325	00000715
004363	00000755
004305	00000707
004401	00000751
004440	00000707
004359	00000756
004409	00000751
004355	00000749
004312	00000794
004365	00000710
004413	00000759
004295	00000704
004337	00000749
004415	00000755
004453	00000796
004342	00000756
004441	00000758
004402	00000758
004310	00000795
004445	00000751
004314	00000704
004311	00000794
004431	00000750
004457	00000797
004331	00000754
004374	00000800
004330	00000756
004350	00000756
004473	00000753
004319	00000711
004466	00000752
004373	00000796
004285	00000715
004398	00000751
004320	00000712
004447	00000799
004476	00000751
004356	00000750
004423	00000757
004470	00000756
004451	00000795
004345	00000753
004460	00000795
004418	00000774
004307	00000711
004343	00000758
004303	00000713
004435	00000751
004352	00000748
004360	00000751
004353	00000754
004448	00000795
004280	00000757
004388	00000798
004281	00000760
004412	00000754
004302	00000710
004308	00000795
004361	00000709
004394	00000727
004385	00000797
004284	00000764
004367	00000706
004384	00000794
004291	00000713
004436	00000753
004372	00000709
004297	00000716
004333	00000750
004446	00000795
004439	00000757
004416	00000749
004463	00000753
004425	00000764
004471	00000756
004432	00000753
004461	00000795
004351	00000758
004465	00000799
004309	00000796
004324	00000710
004288	00000799
004322	00000710
004366	00000707
004306	00000710
004400	00000753
004283	00000714
004405	00000755
004364	00000752
004348	00000748
004278	00000706
004328	00000709
004318	00000709
004292	00000709
004323	00000709
004389	00000703
004321	00000709
004438	00000753
004469	00000754
004390	00000795
004358	00000751

[-- Attachment #8: massive-intr-200-300-16-2.txt --]
[-- Type: text/plain, Size: 3184 bytes --]

004756	00000741
004639	00000765
004710	00000762
004611	00000729
004678	00000732
004619	00000732
004612	00000729
004624	00000754
004708	00000765
004786	00000755
004692	00000742
004590	00000732
004739	00000767
004755	00000745
004763	00000767
004665	00000759
004610	00000734
004717	00000765
004629	00000750
004625	00000759
004637	00000762
004709	00000759
004603	00000729
004750	00000729
004615	00000729
004788	00000749
004702	00000762
004698	00000752
004744	00000754
004690	00000730
004627	00000756
004634	00000761
004764	00000761
004767	00000761
004604	00000736
004598	00000734
004594	00000736
004666	00000750
004601	00000751
004630	00000747
004772	00000762
004773	00000754
004753	00000750
004685	00000737
004683	00000737
004607	00000736
004614	00000735
004672	00000751
004771	00000766
004769	00000762
004770	00000767
004784	00000754
004759	00000748
004667	00000748
004617	00000731
004748	00000754
004689	00000734
004596	00000737
004677	00000734
004648	00000751
004694	00000737
004663	00000753
004715	00000761
004732	00000754
004602	00000755
004649	00000751
004727	00000751
004591	00000756
004654	00000754
004628	00000751
004735	00000754
004712	00000762
004691	00000734
004600	00000756
004593	00000751
004782	00000751
004731	00000751
004653	00000748
004778	00000751
004655	00000751
004675	00000757
004724	00000751
004696	00000756
004658	00000762
004674	00000751
004644	00000767
004729	00000748
004754	00000748
004734	00000753
004751	00000741
004668	00000751
004679	00000734
004609	00000730
004747	00000767
004618	00000734
004620	00000734
004657	00000748
004726	00000751
004722	00000751
004787	00000757
004718	00000758
004660	00000754
004697	00000754
004659	00000752
004743	00000752
004669	00000755
004643	00000759
004613	00000735
004703	00000767
004776	00000761
004682	00000734
004775	00000764
004662	00000744
004626	00000751
004642	00000764
004687	00000731
004746	00000734
004765	00000763
004707	00000761
004661	00000764
004664	00000764
004705	00000761
004592	00000767
004645	00000748
004621	00000736
004768	00000755
004681	00000734
004632	00000757
004741	00000749
004631	00000753
004758	00000750
004684	00000735
004595	00000735
004706	00000761
004789	00000750
004766	00000768
004740	00000756
004736	00000755
004597	00000769
004606	00000736
004761	00000756
004774	00000763
004701	00000753
004599	00000736
004737	00000753
004749	00000751
004623	00000738
004738	00000761
004760	00000753
004762	00000725
004633	00000767
004695	00000750
004616	00000736
004635	00000766
004783	00000750
004622	00000736
004714	00000768
004781	00000750
004686	00000737
004670	00000750
004733	00000750
004728	00000753
004608	00000733
004671	00000755
004716	00000760
004757	00000753
004713	00000766
004700	00000747
004742	00000756
004641	00000761
004636	00000765
004673	00000738
004640	00000760
004711	00000760
004785	00000751
004721	00000752
004720	00000754
004730	00000747
004725	00000750
004777	00000750
004719	00000748
004745	00000752
004651	00000753
004656	00000753
004650	00000753
004779	00000750
004647	00000750
004723	00000750
004780	00000750
004652	00000747
004646	00000750
004752	00000753
004688	00000737
004676	00000733
004699	00000747
004638	00000766
004693	00000732
004704	00000766
004680	00000736

[-- Attachment #9: massive-intr-200-300-16-3.txt --]
[-- Type: text/plain, Size: 2992 bytes --]

005000	00000758
004992	00000756
005004	00000762
005068	00000758
005045	00000757
004916	00000754
005035	00000752
004905	00000754
005026	00000754
004919	00000755
005028	00000745
004914	00000755
005062	00000759
005002	00000751
004877	00000755
004903	00000754
005034	00000756
004979	00000754
004909	00000755
005027	00000754
004891	00000755
004971	00000755
005060	00000754
005005	00000751
005013	00000756
005032	00000751
004973	00000755
004890	00000749
004946	00000755
004958	00000755
004933	00000761
004935	00000753
004953	00000755
004975	00000754
005020	00000748
004929	00000755
004901	00000755
004915	00000760
004981	00000749
004889	00000755
004949	00000758
004911	00000752
005017	00000753
004888	00000752
004881	00000757
005067	00000754
004908	00000752
004939	00000756
004904	00000749
004924	00000757
004964	00000756
004947	00000768
004986	00000757
005037	00000761
004988	00000753
004934	00000756
004932	00000753
004931	00000751
004876	00000759
004882	00000756
004956	00000763
005070	00000754
004999	00000754
005031	00000758
005051	00000748
004897	00000760
004900	00000749
005014	00000753
004945	00000754
004874	00000755
004893	00000746
005001	00000754
005029	00000757
005046	00000756
005054	00000753
005018	00000748
005059	00000756
004997	00000750
005052	00000753
004926	00000756
005003	00000754
005022	00000760
005024	00000748
005019	00000756
004948	00000758
004966	00000756
004991	00000749
005007	00000751
005015	00000750
004972	00000750
004907	00000760
005009	00000763
004880	00000754
004892	00000757
005047	00000754
004990	00000755
005033	00000753
004959	00000757
004963	00000746
005041	00000758
005023	00000753
005049	00000756
005006	00000754
004998	00000755
004980	00000759
005016	00000749
004894	00000757
004917	00000751
004965	00000750
004912	00000754
005053	00000754
004957	00000754
004921	00000756
004884	00000754
004936	00000757
004950	00000757
005066	00000755
004920	00000759
004902	00000753
004879	00000767
005021	00000750
004910	00000754
004922	00000761
005040	00000767
005042	00000755
005043	00000759
005058	00000755
004887	00000759
004873	00000755
004895	00000754
004906	00000754
004967	00000755
004878	00000751
004977	00000752
004974	00000748
004896	00000757
005008	00000755
004983	00000756
004962	00000758
004955	00000748
005064	00000752
004885	00000755
004995	00000758
004927	00000753
004913	00000757
004954	00000754
004952	00000758
004875	00000753
004925	00000758
004978	00000756
005069	00000756
005030	00000752
004984	00000752
005038	00000756
004996	00000749
004883	00000758
004942	00000752
004923	00000758
004993	00000760
004938	00000761
005025	00000750
004930	00000752
004944	00000757
005011	00000750
004994	00000755
004989	00000752
004968	00000752
004886	00000754
004937	00000755
005012	00000751
004928	00000750
005065	00000752
005048	00000756
004940	00000755
004970	00000750
004976	00000752
004969	00000755
004960	00000756
004951	00000756
005039	00000760
005010	00000755
004872	00000756
005036	00000753
005063	00000752
004987	00000760
004941	00000749
004898	00000761

[-- Attachment #10: massive-intr-200-300-16-4.txt --]
[-- Type: text/plain, Size: 3024 bytes --]

005154	00000794
005194	00000755
005166	00000800
005270	00000760
005125	00000798
005156	00000797
005283	00000800
005158	00000800
005221	00000802
005184	00000755
005281	00000800
005132	00000675
005287	00000800
005278	00000800
005144	00000673
005267	00000752
005135	00000672
005120	00000667
005137	00000671
005251	00000673
005274	00000667
005209	00000673
005118	00000756
005262	00000755
005186	00000751
005249	00000667
005208	00000671
005196	00000752
005276	00000810
005254	00000755
005164	00000805
005195	00000757
005177	00000673
005139	00000672
005189	00000751
005233	00000799
005121	00000670
005204	00000670
005150	00000670
005201	00000670
005222	00000799
005250	00000673
005225	00000797
005178	00000670
005155	00000800
005286	00000796
005246	00000799
005242	00000802
005141	00000666
005292	00000799
005200	00000673
005129	00000676
005294	00000799
005240	00000799
005231	00000802
005172	00000799
005167	00000804
005310	00000754
005252	00000752
005187	00000754
005145	00000675
005289	00000804
005295	00000799
005235	00000799
005191	00000754
005169	00000793
005298	00000802
005170	00000799
005304	00000798
005264	00000757
005152	00000801
005216	00000797
005232	00000793
005243	00000798
005237	00000802
005279	00000802
005117	00000798
005159	00000796
005290	00000803
005285	00000801
005226	00000802
005313	00000754
005280	00000804
005192	00000757
005284	00000796
005149	00000670
005253	00000758
005185	00000753
005130	00000669
005272	00000672
005199	00000759
005151	00000805
005179	00000675
005140	00000669
005198	00000756
005180	00000671
005220	00000802
005269	00000749
005239	00000675
005307	00000754
005183	00000754
005259	00000757
005247	00000805
005275	00000757
005126	00000806
005215	00000802
005122	00000672
005248	00000669
005148	00000675
005277	00000802
005165	00000806
005223	00000799
005124	00000762
005217	00000801
005311	00000754
005142	00000669
005265	00000754
005263	00000750
005160	00000799
005300	00000801
005147	00000672
005175	00000801
005210	00000671
005181	00000672
005266	00000676
005271	00000669
005188	00000751
005203	00000671
005256	00000751
005206	00000672
005197	00000754
005127	00000764
005205	00000675
005161	00000803
005134	00000675
005182	00000754
005245	00000796
005190	00000756
005173	00000789
005282	00000798
005207	00000669
005255	00000745
005301	00000798
005309	00000756
005123	00000758
005143	00000674
005163	00000798
005306	00000752
005229	00000801
005273	00000672
005116	00000805
005261	00000751
005168	00000801
005303	00000796
005162	00000802
005115	00000675
005224	00000804
005308	00000750
005238	00000790
005302	00000798
005213	00000801
005171	00000801
005234	00000798
005257	00000754
005244	00000794
005297	00000798
005314	00000758
005202	00000669
005153	00000801
005291	00000804
005312	00000674
005219	00000799
005299	00000753
005212	00000794
005136	00000669
005214	00000799
005293	00000798
005176	00000796
005236	00000801
005288	00000801
005228	00000806
005174	00000798
005131	00000674
005211	00000801
005157	00000804
005218	00000798
005119	00000674
005138	00000675
005146	00000672

[-- Attachment #11: massive-intr-200-300-16-5.txt --]
[-- Type: text/plain, Size: 3200 bytes --]

005431	00000751
005528	00000717
005461	00000751
005400	00000723
005408	00000751
005527	00000720
005496	00000751
005474	00000723
005546	00000751
005526	00000720
005370	00000752
005536	00000717
005545	00000751
005535	00000720
005453	00000751
005478	00000720
005498	00000751
005468	00000720
005467	00000720
005397	00000720
005413	00000752
005501	00000717
005421	00000749
005497	00000751
005484	00000749
005520	00000720
005499	00000751
005521	00000720
005430	00000749
005477	00000720
005388	00000749
005500	00000717
005389	00000743
005471	00000720
005479	00000746
005490	00000749
005393	00000749
005373	00000749
005542	00000751
005409	00000748
005424	00000746
005403	00000751
005493	00000749
005455	00000748
005494	00000746
005405	00000748
005375	00000749
005417	00000749
005487	00000746
005433	00000748
005418	00000743
005390	00000752
005541	00000745
005416	00000746
005404	00000748
005426	00000752
005551	00000745
005407	00000748
005544	00000748
005491	00000749
005427	00000746
005382	00000746
005486	00000746
005518	00000777
005406	00000751
005387	00000751
005415	00000751
005516	00000780
005419	00000748
005552	00000750
005384	00000751
005549	00000753
005495	00000751
005464	00000750
005383	00000751
005386	00000748
005391	00000751
005379	00000754
005429	00000751
005449	00000753
005456	00000750
005414	00000748
005448	00000780
005457	00000750
005537	00000747
005432	00000750
005476	00000719
005376	00000781
005556	00000725
005462	00000750
005369	00000726
005559	00000777
005434	00000753
005460	00000750
005502	00000719
005561	00000777
005538	00000750
005514	00000777
005452	00000780
005463	00000753
005505	00000777
005392	00000748
005488	00000748
005492	00000751
005368	00000748
005374	00000751
005420	00000751
005543	00000753
005425	00000748
005435	00000722
005458	00000755
005402	00000722
005540	00000752
005399	00000725
005398	00000722
005454	00000750
005459	00000750
005548	00000750
005428	00000748
005534	00000722
005436	00000780
005395	00000722
005506	00000780
005439	00000780
005554	00000747
005446	00000783
005412	00000744
005553	00000722
005539	00000749
005547	00000755
005475	00000722
005550	00000755
005411	00000755
005445	00000782
005444	00000722
005532	00000725
005558	00000783
005372	00000748
005563	00000780
005485	00000751
005385	00000748
005381	00000742
005489	00000748
005423	00000745
005422	00000745
005410	00000755
005530	00000724
005380	00000773
005470	00000721
005465	00000755
005481	00000721
005562	00000782
005524	00000724
005529	00000721
005480	00000724
005557	00000776
005483	00000724
005564	00000779
005394	00000724
005482	00000724
005437	00000779
005466	00000721
005451	00000776
005531	00000724
005438	00000773
005401	00000721
005473	00000724
005469	00000721
005522	00000783
005523	00000782
005443	00000782
005533	00000721
005560	00000779
005513	00000779
005555	00000721
005519	00000782
005511	00000779
005510	00000779
005525	00000721
005517	00000782
005504	00000782
005515	00000779
005441	00000776
005509	00000776
005450	00000782
005567	00000781
005371	00000780
005508	00000782
005377	00000781
005566	00000779
005440	00000779
005507	00000779
005447	00000779
005512	00000776
005503	00000721
005472	00000721
005396	00000721
005378	00000719
005565	00000779
005442	00000782

[-- Attachment #12: massive-intr-200-300-single-0-1.txt --]
[-- Type: text/plain, Size: 3104 bytes --]

003648	00000745
003807	00000743
003802	00000751
003718	00000751
003738	00000747
003790	00000758
003737	00000747
003725	00000749
003653	00000749
003774	00000749
003704	00000747
003674	00000742
003655	00000745
003680	00000751
003741	00000745
003688	00000751
003818	00000751
003679	00000742
003684	00000752
003633	00000750
003787	00000753
003736	00000750
003755	00000747
003687	00000739
003804	00000741
003715	00000757
003627	00000748
003682	00000745
003637	00000763
003641	00000745
003733	00000753
003647	00000744
003623	00000745
003720	00000749
003659	00000745
003667	00000751
003771	00000752
003759	00000748
003817	00000745
003628	00000750
003677	00000742
003803	00000752
003678	00000741
003791	00000749
003779	00000746
003689	00000754
003761	00000750
003739	00000748
003768	00000745
003640	00000759
003652	00000748
003726	00000748
003728	00000754
003806	00000748
003793	00000748
003716	00000745
003696	00000743
003729	00000751
003675	00000750
003756	00000745
003809	00000748
003631	00000749
003723	00000761
003658	00000751
003767	00000745
003781	00000748
003766	00000749
003815	00000749
003676	00000754
003664	00000745
003673	00000744
003681	00000752
003748	00000740
003722	00000753
003644	00000752
003763	00000748
003669	00000750
003789	00000759
003777	00000756
003712	00000749
003649	00000751
003643	00000750
003724	00000753
003780	00000745
003660	00000745
003821	00000746
003770	00000742
003626	00000746
003693	00000748
003782	00000749
003776	00000751
003735	00000748
003812	00000748
003775	00000748
003686	00000750
003683	00000750
003703	00000748
003749	00000749
003638	00000747
003745	00000745
003711	00000741
003706	00000750
003629	00000749
003753	00000746
003765	00000742
003710	00000745
003813	00000749
003799	00000751
003820	00000754
003708	00000745
003690	00000747
003800	00000748
003634	00000749
003646	00000747
003747	00000747
003672	00000747
003639	00000741
003707	00000753
003642	00000746
003656	00000744
003814	00000742
003702	00000759
003746	00000748
003685	00000755
003760	00000749
003751	00000756
003666	00000756
003645	00000750
003750	00000749
003719	00000751
003783	00000747
003792	00000753
003727	00000743
003731	00000747
003754	00000756
003808	00000759
003805	00000750
003671	00000754
003786	00000743
003757	00000747
003811	00000749
003625	00000745
003795	00000757
003661	00000754
003801	00000747
003709	00000752
003798	00000759
003650	00000750
003740	00000752
003794	00000747
003670	00000747
003694	00000752
003714	00000744
003721	00000747
003668	00000743
003784	00000748
003822	00000750
003698	00000747
003636	00000745
003691	00000750
003624	00000746
003772	00000750
003654	00000747
003743	00000749
003788	00000744
003732	00000747
003713	00000747
003717	00000749
003810	00000754
003744	00000747
003730	00000747
003758	00000748
003797	00000758
003632	00000753
003700	00000744
003816	00000750
003764	00000752
003778	00000747
003742	00000748
003695	00000747
003762	00000744
003769	00000748
003630	00000749
003662	00000747
003705	00000750
003697	00000747
003773	00000747
003665	00000745
003752	00000747
003635	00000748
003734	00000747
003663	00000747
003785	00000752
003701	00000750

[-- Attachment #13: massive-intr-200-300-single-0-2.txt --]
[-- Type: text/plain, Size: 3184 bytes --]

003856	00000762
004019	00000759
003943	00000764
003833	00000765
003860	00000763
003941	00000761
003987	00000755
004003	00000761
004011	00000760
003848	00000763
003927	00000752
004016	00000762
004004	00000762
003890	00000764
003959	00000762
003958	00000762
003896	00000759
003824	00000760
003925	00000760
003838	00000763
003924	00000760
003832	00000760
003930	00000760
003994	00000761
003981	00000761
003919	00000760
003909	00000763
003827	00000763
003949	00000759
003857	00000759
004002	00000759
003898	00000759
003918	00000758
003954	00000762
004021	00000759
004008	00000761
004009	00000759
003861	00000759
003946	00000758
003892	00000762
003906	00000759
003953	00000762
003923	00000761
004012	00000759
003895	00000759
003836	00000760
004005	00000758
004018	00000761
003929	00000760
003831	00000760
003865	00000756
003843	00000760
003837	00000760
003899	00000761
003905	00000758
003876	00000759
003957	00000759
003967	00000756
004022	00000761
003887	00000759
003951	00000750
003884	00000759
003932	00000758
003863	00000756
003931	00000760
003992	00000760
003877	00000761
003976	00000765
003886	00000759
003983	00000761
003882	00000756
003835	00000768
003948	00000758
003942	00000760
003917	00000759
003850	00000757
003864	00000761
004006	00000753
003933	00000758
003997	00000758
003989	00000760
003995	00000758
004010	00000758
003950	00000764
003883	00000750
003980	00000755
003937	00000754
003955	00000761
003936	00000755
003866	00000763
003859	00000761
003826	00000766
003970	00000761
003849	00000761
003922	00000757
003879	00000753
003889	00000761
003984	00000755
003870	00000757
003973	00000764
003851	00000761
003858	00000765
004000	00000762
003862	00000764
003974	00000765
003972	00000758
003962	00000762
003999	00000761
003964	00000753
003966	00000764
004007	00000765
003963	00000761
003916	00000762
003934	00000757
004001	00000760
003910	00000762
003915	00000759
003947	00000761
003894	00000761
003971	00000758
003891	00000761
003829	00000761
003846	00000763
003908	00000758
003834	00000765
003828	00000759
003998	00000763
003847	00000762
003912	00000762
004014	00000766
003977	00000757
003901	00000762
003839	00000758
003975	00000760
003872	00000761
003871	00000760
003996	00000755
003986	00000754
003979	00000759
003928	00000760
003920	00000762
003993	00000760
003926	00000762
003852	00000759
003854	00000755
004017	00000763
003902	00000758
003842	00000761
003888	00000760
003991	00000759
003978	00000763
003893	00000761
003990	00000757
003944	00000760
003880	00000760
003969	00000760
003874	00000758
003869	00000757
004013	00000758
003878	00000758
003830	00000760
003961	00000754
004020	00000763
003881	00000764
003825	00000760
003900	00000760
003885	00000759
003903	00000759
004023	00000762
003907	00000756
003867	00000755
003938	00000758
003853	00000756
003940	00000757
003873	00000752
003913	00000758
003855	00000759
003845	00000759
003921	00000758
003982	00000757
003868	00000758
003875	00000760
003935	00000760
003945	00000767
003914	00000759
003841	00000759
003911	00000764
003960	00000763
003952	00000760
003985	00000756
003968	00000763
003988	00000762
003939	00000757
003956	00000763
004015	00000758
003897	00000758
003840	00000756
003904	00000759
003844	00000761

[-- Attachment #14: massive-intr-200-300-single-0-3.txt --]
[-- Type: text/plain, Size: 2928 bytes --]

004151	00000773
004064	00000747
004113	00000770
004125	00000744
004120	00000744
004202	00000744
004204	00000744
004169	00000767
004130	00000744
004167	00000762
004220	00000762
004122	00000744
004217	00000760
004110	00000757
004075	00000763
004079	00000760
004175	00000770
004211	00000744
004198	00000736
004040	00000763
004026	00000745
004029	00000760
004194	00000744
004105	00000763
004207	00000744
004158	00000762
004031	00000763
004189	00000744
004098	00000762
004088	00000762
004170	00000762
004085	00000762
004036	00000762
004191	00000744
004121	00000744
004176	00000767
004197	00000748
004068	00000760
004063	00000760
004101	00000760
004135	00000740
004083	00000759
004223	00000759
004190	00000749
004089	00000761
004043	00000760
004082	00000759
004112	00000760
004070	00000760
004081	00000760
004116	00000775
004033	00000760
004032	00000760
004117	00000775
004078	00000760
004106	00000759
004038	00000762
004077	00000757
004187	00000775
004212	00000759
004146	00000775
004114	00000775
004172	00000775
004086	00000763
004216	00000756
004076	00000757
004180	00000775
004100	00000757
004177	00000775
004094	00000756
004108	00000757
004134	00000738
004118	00000749
004124	00000743
004042	00000762
004200	00000749
004062	00000751
004168	00000770
004148	00000774
004143	00000772
004179	00000775
004053	00000774
004054	00000774
004025	00000757
004129	00000775
004140	00000775
004139	00000772
004066	00000754
004145	00000772
004192	00000743
004133	00000746
004141	00000769
004174	00000772
004149	00000772
004132	00000749
004060	00000747
004071	00000762
004195	00000746
004080	00000754
004206	00000746
004059	00000746
004196	00000746
004142	00000769
004057	00000746
004065	00000746
004193	00000746
004127	00000746
004096	00000764
004205	00000746
004035	00000747
004052	00000777
004182	00000772
004051	00000777
004049	00000777
004157	00000753
004115	00000772
004048	00000771
004107	00000762
004061	00000746
004137	00000776
004153	00000763
004161	00000764
004119	00000746
004090	00000756
004046	00000762
004201	00000743
004163	00000764
004044	00000762
004213	00000761
004047	00000762
004099	00000761
004109	00000762
004178	00000769
004181	00000777
004039	00000756
004041	00000761
004030	00000762
004209	00000743
004072	00000762
004111	00000751
004104	00000763
004152	00000758
004186	00000761
004092	00000758
004102	00000756
004128	00000747
004155	00000763
004073	00000755
004159	00000752
004103	00000759
004027	00000767
004160	00000764
004069	00000762
004097	00000756
004074	00000756
004171	00000761
004067	00000774
004203	00000748
004136	00000748
004199	00000749
004164	00000759
004131	00000748
004222	00000758
004221	00000762
004056	00000771
004224	00000760
004208	00000745
004188	00000771
004156	00000762
004154	00000758
004185	00000774
004058	00000745
004138	00000774
004184	00000774
004162	00000755
004123	00000744
004037	00000756
004214	00000761
004050	00000771
004126	00000748
004215	00000759
004147	00000774
004183	00000772

[-- Attachment #15: massive-intr-200-300-single-0-4.txt --]
[-- Type: text/plain, Size: 3184 bytes --]

004318	00000754
004230	00000754
004306	00000780
004390	00000782
004329	00000754
004342	00000731
004315	00000751
004255	00000754
004229	00000751
004309	00000751
004247	00000751
004250	00000751
004235	00000748
004232	00000751
004314	00000751
004327	00000748
004319	00000751
004242	00000748
004316	00000751
004322	00000748
004331	00000748
004243	00000748
004251	00000751
004234	00000748
004302	00000782
004325	00000745
004324	00000753
004281	00000779
004298	00000779
004391	00000779
004340	00000782
004394	00000779
004337	00000782
004383	00000782
004274	00000776
004305	00000779
004303	00000776
004273	00000776
004332	00000782
004419	00000737
004381	00000779
004411	00000737
004352	00000740
004263	00000740
004382	00000779
004343	00000740
004252	00000742
004276	00000784
004344	00000740
004351	00000740
004339	00000779
004393	00000779
004311	00000750
004313	00000753
004246	00000753
004326	00000753
004249	00000753
004335	00000779
004256	00000750
004240	00000750
004328	00000750
004254	00000750
004375	00000734
004330	00000753
004345	00000737
004286	00000740
004282	00000781
004237	00000739
004409	00000737
004403	00000731
004421	00000734
004377	00000737
004376	00000737
004346	00000734
004257	00000737
004262	00000734
004410	00000737
004404	00000737
004401	00000734
004308	00000734
004283	00000734
004350	00000737
004307	00000734
004386	00000783
004424	00000740
004236	00000739
004361	00000737
004360	00000740
004388	00000778
004413	00000737
004293	00000734
004416	00000737
004422	00000737
004300	00000734
004301	00000734
004294	00000734
004299	00000734
004414	00000737
004296	00000734
004423	00000734
004364	00000734
004291	00000731
004363	00000734
004295	00000737
004341	00000737
004399	00000731
004267	00000728
004259	00000734
004420	00000740
004359	00000737
004356	00000737
004287	00000778
004285	00000737
004397	00000784
004317	00000750
004275	00000781
004244	00000753
004277	00000781
004226	00000750
004270	00000781
004323	00000750
004389	00000781
004253	00000746
004312	00000750
004320	00000750
004333	00000781
004380	00000778
004238	00000750
004387	00000784
004310	00000750
004378	00000784
004241	00000747
004239	00000750
004248	00000747
004396	00000775
004336	00000778
004395	00000781
004279	00000775
004321	00000755
004392	00000781
004228	00000739
004338	00000781
004358	00000739
004245	00000752
004271	00000783
004297	00000739
004231	00000755
004304	00000752
004264	00000736
004384	00000778
004353	00000739
004412	00000739
004385	00000778
004261	00000736
004292	00000739
004407	00000739
004369	00000739
004357	00000736
004367	00000739
004379	00000739
004289	00000736
004425	00000735
004373	00000736
004374	00000739
004355	00000736
004362	00000733
004284	00000739
004227	00000741
004405	00000733
004268	00000739
004258	00000736
004280	00000777
004266	00000736
004272	00000783
004334	00000778
004278	00000783
004368	00000739
004269	00000777
004400	00000739
004372	00000736
004354	00000736
004290	00000733
004288	00000736
004406	00000736
004408	00000736
004348	00000736
004349	00000736
004402	00000739
004347	00000739
004260	00000736
004265	00000733
004370	00000739
004371	00000739
004415	00000739
004365	00000739
004418	00000736
004233	00000741
004417	00000738
004398	00000739

[-- Attachment #16: massive-intr-200-300-single-0-5.txt --]
[-- Type: text/plain, Size: 3152 bytes --]

004532	00000752
004557	00000747
004592	00000731
004545	00000734
004604	00000748
004575	00000768
004540	00000752
004499	00000765
004451	00000749
004556	00000745
004444	00000752
004457	00000749
004493	00000748
004531	00000749
004490	00000751
004538	00000749
004491	00000751
004446	00000749
004577	00000768
004485	00000751
004460	00000752
004626	00000764
004615	00000751
004458	00000749
004561	00000751
004455	00000749
004562	00000748
004442	00000749
004553	00000748
004486	00000751
004605	00000751
004488	00000748
004464	00000749
004606	00000748
004529	00000749
004463	00000749
004550	00000751
004459	00000749
004609	00000748
004536	00000749
004492	00000745
004456	00000746
004618	00000765
004612	00000748
004487	00000745
004469	00000731
004427	00000746
004614	00000748
004452	00000746
004489	00000748
004520	00000749
004500	00000762
004434	00000743
004439	00000743
004449	00000749
004533	00000746
004528	00000743
004576	00000765
004454	00000746
004616	00000762
004438	00000763
004513	00000731
004580	00000765
004595	00000734
004515	00000734
004525	00000734
004586	00000731
004624	00000765
004587	00000731
004509	00000765
004506	00000731
004497	00000765
004521	00000731
004623	00000765
004522	00000731
004569	00000765
004571	00000765
004582	00000765
004503	00000765
004597	00000731
004526	00000733
004620	00000762
004570	00000765
004568	00000765
004473	00000731
004574	00000762
004428	00000766
004566	00000765
004504	00000765
004502	00000762
004567	00000762
004470	00000731
004518	00000734
004590	00000734
004517	00000734
004477	00000734
004583	00000765
004437	00000766
004578	00000762
004496	00000762
004494	00000765
004523	00000728
004598	00000731
004519	00000731
004622	00000762
004471	00000736
004621	00000762
004572	00000762
004498	00000762
004514	00000733
004596	00000725
004565	00000762
004534	00000751
004453	00000748
004481	00000753
004465	00000751
004613	00000750
004527	00000751
004603	00000753
004430	00000751
004482	00000750
004440	00000751
004552	00000753
004448	00000751
004429	00000751
004467	00000748
004484	00000750
004431	00000751
004512	00000750
004539	00000751
004511	00000767
004524	00000733
004450	00000748
004558	00000750
004584	00000753
004555	00000750
004472	00000733
004610	00000750
004611	00000750
004559	00000750
004433	00000748
004495	00000767
004436	00000763
004619	00000767
004560	00000750
004432	00000751
004625	00000764
004443	00000745
004483	00000750
004551	00000747
004549	00000744
004535	00000736
004480	00000747
004601	00000733
004510	00000764
004508	00000764
004594	00000724
004579	00000764
004501	00000766
004507	00000764
004505	00000730
004461	00000748
004530	00000751
004589	00000730
004462	00000748
004542	00000730
004537	00000748
004468	00000748
004573	00000764
004554	00000750
004516	00000736
004474	00000733
004479	00000733
004564	00000750
004548	00000747
004607	00000744
004581	00000733
004476	00000730
004478	00000733
004543	00000733
004602	00000733
004600	00000730
004546	00000736
004593	00000730
004466	00000745
004563	00000747
004608	00000747
004435	00000751
004585	00000733
004541	00000733
004475	00000736
004599	00000733
004547	00000733
004544	00000748
004591	00000733
004617	00000764
004588	00000730

[-- Attachment #17: massive-intr-200-300-single-16-1.txt --]
[-- Type: text/plain, Size: 3088 bytes --]

003655	00000743
003642	00000746
003740	00000743
003734	00000743
003635	00000746
003760	00000743
003772	00000746
003659	00000743
003761	00000743
003670	00000742
003671	00000747
003713	00000745
003708	00000747
003594	00000745
003747	00000742
003751	00000748
003581	00000742
003691	00000745
003583	00000748
003632	00000748
003626	00000742
003623	00000748
003631	00000748
003728	00000751
003620	00000745
003692	00000745
003615	00000742
003698	00000748
003622	00000748
003628	00000747
003683	00000748
003775	00000742
003627	00000745
003582	00000745
003595	00000745
003598	00000745
003593	00000745
003614	00000742
003589	00000745
003630	00000742
003619	00000742
003675	00000751
003610	00000745
003748	00000751
003755	00000748
003710	00000751
003693	00000746
003608	00000751
003705	00000747
003668	00000748
003678	00000748
003765	00000748
003607	00000748
003681	00000748
003649	00000751
003707	00000748
003767	00000748
003639	00000751
003664	00000745
003648	00000748
003717	00000748
003732	00000745
003742	00000745
003611	00000742
003718	00000748
003599	00000751
003720	00000748
003745	00000748
003735	00000748
003744	00000745
003646	00000751
003666	00000748
003730	00000751
003729	00000748
003677	00000742
003731	00000745
003665	00000748
003716	00000742
003584	00000749
003758	00000748
003762	00000748
003684	00000748
003577	00000746
003644	00000745
003737	00000748
003725	00000745
003601	00000748
003637	00000748
003752	00000742
003602	00000748
003634	00000745
003722	00000748
003689	00000747
003600	00000748
003690	00000750
003768	00000748
003597	00000747
003660	00000745
003652	00000745
003596	00000747
003654	00000745
003687	00000747
003771	00000745
003591	00000747
003661	00000742
003629	00000747
003769	00000750
003688	00000750
003694	00000750
003700	00000747
003585	00000749
003727	00000750
003695	00000744
003612	00000747
003699	00000744
003590	00000747
003697	00000747
003576	00000747
003617	00000747
003773	00000750
003616	00000747
003580	00000744
003592	00000747
003696	00000747
003618	00000744
003667	00000747
003621	00000745
003625	00000746
003603	00000750
003754	00000750
003757	00000750
003741	00000747
003636	00000747
003759	00000750
003679	00000742
003749	00000750
003674	00000744
003609	00000750
003633	00000750
003624	00000747
003686	00000749
003604	00000747
003706	00000750
003743	00000750
003714	00000750
003641	00000750
003709	00000749
003766	00000747
003613	00000747
003672	00000747
003680	00000744
003676	00000744
003719	00000747
003669	00000747
003721	00000744
003763	00000747
003588	00000748
003770	00000744
003579	00000751
003738	00000747
003704	00000752
003712	00000747
003653	00000747
003651	00000747
003662	00000747
003774	00000747
003723	00000750
003657	00000747
003587	00000749
003656	00000750
003733	00000750
003650	00000750
003658	00000747
003578	00000750
003753	00000747
003711	00000750
003673	00000750
003682	00000750
003647	00000748
003638	00000750
003643	00000750
003640	00000750
003586	00000751
003746	00000747
003750	00000747
003702	00000750
003703	00000747
003739	00000747
003736	00000750
003724	00000750
003645	00000747
003701	00000750
003605	00000742

[-- Attachment #18: massive-intr-200-300-single-16-2.txt --]
[-- Type: text/plain, Size: 3072 bytes --]

003938	00000750
003899	00000748
003894	00000748
003960	00000748
003949	00000751
003788	00000768
003942	00000745
003895	00000752
003803	00000750
003848	00000747
003874	00000750
003837	00000748
003838	00000752
003906	00000753
003780	00000752
003777	00000750
003903	00000752
003856	00000753
003854	00000756
003846	00000751
003925	00000758
003897	00000752
003941	00000763
003873	00000752
003945	00000753
003883	00000749
003816	00000750
003811	00000779
003863	00000748
003905	00000751
003809	00000752
003794	00000754
003952	00000761
003957	00000755
003861	00000744
003958	00000754
003931	00000763
003881	00000749
003829	00000758
003847	00000749
003922	00000748
003807	00000749
003896	00000742
003973	00000751
003872	00000751
003909	00000751
003951	00000758
003783	00000762
003831	00000758
003857	00000763
003976	00000760
003806	00000762
003923	00000751
003959	00000750
003950	00000749
003804	00000743
003865	00000751
003928	00000757
003904	00000748
003933	00000748
003915	00000753
003805	00000754
003802	00000750
003790	00000749
003962	00000754
003849	00000756
003936	00000756
003778	00000751
003878	00000743
003880	00000752
003830	00000751
003963	00000753
003886	00000740
003851	00000761
003835	00000745
003912	00000751
003914	00000745
003785	00000749
003934	00000750
003796	00000751
003824	00000773
003926	00000776
003901	00000750
003789	00000755
003953	00000751
003882	00000758
003916	00000744
003907	00000752
003970	00000750
003891	00000750
003795	00000754
003823	00000751
003917	00000753
003797	00000754
003969	00000751
003910	00000757
003862	00000762
003964	00000751
003875	00000754
003813	00000768
003828	00000751
003841	00000760
003834	00000752
003843	00000752
003844	00000762
003937	00000750
003911	00000751
003839	00000759
003975	00000748
003940	00000754
003859	00000746
003946	00000755
003850	00000750
003887	00000765
003827	00000753
003939	00000745
003929	00000769
003822	00000747
003799	00000747
003852	00000755
003871	00000751
003893	00000756
003888	00000755
003877	00000752
003930	00000762
003892	00000750
003927	00000760
003932	00000753
003812	00000756
003974	00000750
003858	00000750
003924	00000762
003868	00000750
003918	00000763
003870	00000753
003972	00000761
003819	00000753
003944	00000770
003820	00000754
003792	00000752
003921	00000763
003801	00000754
003853	00000754
003866	00000744
003947	00000751
003955	00000748
003876	00000751
003967	00000760
003814	00000753
003965	00000745
003860	00000753
003908	00000763
003889	00000753
003825	00000767
003913	00000763
003968	00000750
003956	00000753
003793	00000749
003840	00000747
003845	00000752
003919	00000749
003898	00000750
003855	00000743
003885	00000756
003782	00000746
003821	00000751
003784	00000752
003786	00000750
003818	00000757
003900	00000756
003884	00000748
003842	00000766
003833	00000753
003798	00000759
003961	00000747
003948	00000749
003902	00000747
003815	00000751
003966	00000755
003920	00000766
003879	00000752
003943	00000751
003808	00000764
003787	00000769
003954	00000750
003781	00000749
003890	00000765
003810	00000749
003869	00000752
003817	00000747
003800	00000748
003832	00000760

[-- Attachment #19: massive-intr-200-300-single-16-3.txt --]
[-- Type: text/plain, Size: 3200 bytes --]

004003	00000756
003992	00000756
004041	00000753
004070	00000756
004079	00000774
004140	00000774
004062	00000756
003993	00000756
004045	00000745
004016	00000769
003978	00000753
004101	00000771
004026	00000769
004015	00000760
004085	00000752
004046	00000745
004012	00000760
004124	00000771
004074	00000771
004102	00000768
004072	00000761
004021	00000769
004004	00000761
004025	00000769
004151	00000754
004130	00000771
004023	00000774
004137	00000771
004076	00000771
004013	00000760
004156	00000757
004036	00000761
004068	00000758
004134	00000770
004090	00000751
004105	00000768
004014	00000760
004161	00000757
004125	00000768
004153	00000760
004127	00000742
003980	00000746
004115	00000745
004107	00000768
004030	00000757
004031	00000760
003986	00000745
004044	00000745
004118	00000745
004171	00000745
004162	00000742
004096	00000757
004122	00000742
004020	00000774
004114	00000745
004129	00000768
004083	00000760
004136	00000768
004027	00000754
004018	00000766
004069	00000758
003981	00000758
004029	00000754
004042	00000745
004058	00000742
004111	00000745
004177	00000744
004106	00000768
004168	00000757
004024	00000774
003994	00000758
004098	00000768
004048	00000742
004109	00000742
004172	00000742
004011	00000760
004160	00000757
004147	00000760
004164	00000757
003979	00000762
004053	00000742
004110	00000742
004176	00000742
004032	00000755
004055	00000742
003991	00000758
004173	00000742
004049	00000739
004169	00000757
004082	00000757
004040	00000758
004175	00000739
004159	00000757
004035	00000758
004089	00000754
004052	00000750
004097	00000757
003989	00000743
004145	00000739
004067	00000758
004064	00000758
004028	00000776
004001	00000758
004084	00000754
003998	00000755
004009	00000759
004077	00000773
004019	00000771
003982	00000760
004132	00000773
003996	00000755
004131	00000773
003997	00000755
004141	00000770
004022	00000771
004144	00000770
004139	00000773
004108	00000773
004143	00000773
004081	00000773
003995	00000761
004133	00000770
004135	00000767
004104	00000770
004075	00000770
004128	00000770
004080	00000770
004138	00000770
004142	00000770
004099	00000759
004017	00000776
004150	00000747
004103	00000766
004157	00000756
004112	00000747
004120	00000741
004117	00000747
004121	00000741
004073	00000767
004093	00000759
004047	00000741
004061	00000760
004119	00000744
004078	00000775
004149	00000744
004091	00000759
004059	00000760
004113	00000744
004116	00000744
004152	00000759
004043	00000741
004155	00000762
003984	00000748
004154	00000762
004051	00000744
004087	00000759
004054	00000744
004100	00000759
004163	00000759
004166	00000756
004006	00000757
004170	00000741
004146	00000759
004005	00000757
004056	00000744
004008	00000759
004039	00000760
004086	00000759
004065	00000757
004095	00000756
004148	00000741
004038	00000760
004057	00000744
004050	00000741
004126	00000741
003987	00000761
004007	00000754
004174	00000741
004165	00000761
004034	00000757
004092	00000759
004010	00000757
004158	00000759
004088	00000759
003990	00000760
004060	00000754
004000	00000760
003985	00000760
004066	00000754
004071	00000757
004063	00000760
003999	00000757
004033	00000760
004094	00000756
004002	00000757
003988	00000757
004167	00000738
004037	00000757
004123	00000744
003983	00000757

[-- Attachment #20: massive-intr-200-300-single-16-4.txt --]
[-- Type: text/plain, Size: 3104 bytes --]

004354	00000738
004280	00000766
004332	00000752
004269	00000756
004361	00000752
004339	00000765
004211	00000749
004357	00000758
004304	00000754
004296	00000768
004348	00000765
004205	00000756
004229	00000766
004373	00000739
004228	00000771
004208	00000752
004300	00000753
004193	00000757
004214	00000759
004377	00000739
004337	00000770
004376	00000742
004267	00000752
004197	00000755
004284	00000768
004305	00000753
004333	00000765
004237	00000733
004203	00000757
004320	00000742
004224	00000739
004180	00000759
004290	00000764
004331	00000750
004306	00000736
004349	00000768
004326	00000754
004329	00000756
004352	00000755
004371	00000739
004286	00000767
004242	00000739
004252	00000739
004324	00000739
004251	00000741
004272	00000755
004261	00000757
004262	00000752
004369	00000740
004274	00000751
004281	00000754
004276	00000757
004227	00000766
004181	00000739
004360	00000758
004338	00000771
004260	00000754
004212	00000758
004253	00000754
004240	00000763
004303	00000748
004219	00000756
004256	00000750
004359	00000754
004218	00000755
004273	00000757
004192	00000753
004299	00000754
004275	00000751
004279	00000752
004264	00000755
004344	00000769
004340	00000764
004232	00000776
004257	00000754
004367	00000739
004358	00000754
004287	00000770
004322	00000754
004246	00000748
004307	00000745
004255	00000753
004185	00000752
004345	00000764
004202	00000749
004291	00000767
004301	00000751
004351	00000769
004278	00000752
004179	00000758
004325	00000760
004282	00000755
004199	00000752
004263	00000756
004198	00000755
004334	00000769
004363	00000751
004190	00000744
004236	00000766
004213	00000754
004207	00000749
004362	00000759
004230	00000768
004294	00000774
004321	00000741
004239	00000770
004346	00000764
004335	00000764
004217	00000748
004250	00000741
004308	00000735
004378	00000740
004342	00000766
004238	00000774
004313	00000743
004298	00000755
004366	00000741
004341	00000766
004312	00000738
004245	00000743
004249	00000735
004210	00000757
004318	00000740
004323	00000739
004310	00000741
004288	00000767
004368	00000741
004221	00000741
004268	00000751
004225	00000771
004235	00000765
004293	00000766
004372	00000741
004231	00000762
004319	00000741
004233	00000765
004347	00000768
004374	00000738
004295	00000768
004223	00000743
004311	00000734
004222	00000738
004182	00000753
004302	00000753
004343	00000770
004196	00000754
004215	00000756
004350	00000763
004364	00000735
004201	00000754
004365	00000755
004314	00000738
004266	00000757
004189	00000760
004183	00000754
004206	00000754
004336	00000771
004270	00000751
004327	00000753
004285	00000765
004277	00000754
004226	00000767
004243	00000737
004194	00000756
004265	00000754
004316	00000742
004241	00000740
004234	00000768
004258	00000754
004244	00000738
004309	00000738
004220	00000759
004328	00000763
004283	00000766
004187	00000754
004195	00000753
004209	00000754
004204	00000751
004191	00000757
004356	00000762
004188	00000758
004216	00000756
004259	00000752
004184	00000751
004370	00000737
004271	00000754
004292	00000772
004317	00000735
004186	00000754
004289	00000765
004248	00000735
004353	00000753
004297	00000756
004355	00000753

[-- Attachment #21: massive-intr-200-300-single-16-5.txt --]
[-- Type: text/plain, Size: 3120 bytes --]

004424	00000770
004423	00000775
004473	00000764
004462	00000713
004562	00000774
004485	00000713
004553	00000777
004542	00000711
004386	00000757
004463	00000764
004500	00000774
004561	00000781
004407	00000757
004474	00000757
004541	00000709
004435	00000774
004563	00000769
004430	00000775
004413	00000713
004552	00000772
004457	00000708
004390	00000775
004494	00000779
004556	00000781
004447	00000713
004525	00000719
004441	00000787
004395	00000756
004434	00000776
004502	00000774
004467	00000758
004404	00000754
004514	00000788
004478	00000758
004489	00000774
004540	00000717
004381	00000781
004570	00000785
004567	00000786
004577	00000786
004574	00000793
004531	00000714
004548	00000722
004400	00000754
004469	00000754
004505	00000770
004516	00000785
004565	00000785
004488	00000713
004438	00000786
004436	00000770
004448	00000796
004568	00000790
004439	00000787
004517	00000794
004575	00000789
004385	00000755
004415	00000711
004547	00000707
004483	00000750
004389	00000755
004458	00000756
004398	00000756
004456	00000711
004408	00000758
004406	00000761
004551	00000769
004452	00000788
004549	00000776
004498	00000772
004445	00000793
004519	00000787
004506	00000787
004495	00000769
004440	00000793
004428	00000772
004437	00000792
004464	00000748
004518	00000786
004477	00000755
004431	00000769
004421	00000773
004571	00000786
004544	00000702
004412	00000718
004545	00000710
004444	00000787
004401	00000756
004454	00000710
004529	00000715
004486	00000715
004501	00000774
004383	00000756
004394	00000756
004453	00000717
004476	00000760
004387	00000763
004546	00000708
004554	00000767
004528	00000711
004512	00000795
004403	00000760
004481	00000761
004564	00000770
004550	00000773
004558	00000773
004559	00000772
004579	00000792
004515	00000791
004482	00000758
004555	00000774
004402	00000755
004530	00000722
004578	00000784
004572	00000788
004397	00000758
004510	00000785
004418	00000710
004484	00000756
004455	00000716
004470	00000755
004429	00000768
004391	00000787
004533	00000718
004543	00000717
004432	00000775
004446	00000793
004576	00000790
004384	00000756
004521	00000792
004479	00000753
004425	00000772
004427	00000780
004522	00000794
004566	00000786
004393	00000755
004449	00000791
004409	00000762
004450	00000787
004539	00000713
004491	00000774
004443	00000797
004557	00000713
004499	00000776
004382	00000801
004388	00000762
004380	00000756
004560	00000771
004535	00000709
004392	00000753
004573	00000785
004414	00000710
004475	00000755
004422	00000798
004426	00000772
004468	00000760
004493	00000775
004471	00000766
004520	00000797
004480	00000758
004410	00000713
004399	00000755
004507	00000793
004465	00000757
004433	00000774
004569	00000796
004396	00000752
004508	00000789
004497	00000766
004523	00000774
004504	00000771
004451	00000710
004492	00000771
004534	00000710
004526	00000710
004509	00000787
004513	00000789
004460	00000712
004405	00000756
004487	00000752
004442	00000789
004503	00000773
004537	00000712
004511	00000709
004420	00000713
004416	00000709
004417	00000710
004411	00000712
004538	00000723
004536	00000713
004532	00000709
004472	00000761
004466	00000769
004461	00000715
004527	00000712

[-- Attachment #22: 0-1.png --]
[-- Type: image/png, Size: 12388 bytes --]

[-- Attachment #23: 16-1.png --]
[-- Type: image/png, Size: 12546 bytes --]

[-- Attachment #24: single-0-1.png --]
[-- Type: image/png, Size: 10008 bytes --]

[-- Attachment #25: single-16-1.png --]
[-- Type: image/png, Size: 11114 bytes --]

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [patch 00/15] CFS Bandwidth Control V6
  2011-06-16  9:45         ` Hu Tao
@ 2011-06-17  1:22           ` Hidetoshi Seto
  2011-06-17  6:05             ` Hu Tao
  2011-06-17  6:25             ` Paul Turner
  0 siblings, 2 replies; 129+ messages in thread
From: Hidetoshi Seto @ 2011-06-17  1:22 UTC (permalink / raw)
  To: Hu Tao
  Cc: Paul Turner, linux-kernel, Peter Zijlstra, Bharata B Rao,
	Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan,
	Srivatsa Vaddagiri

(2011/06/16 18:45), Hu Tao wrote:
> On Thu, Jun 16, 2011 at 09:57:09AM +0900, Hidetoshi Seto wrote:
>> (2011/06/15 17:37), Hu Tao wrote:
>>> On Tue, Jun 14, 2011 at 04:29:49PM +0900, Hidetoshi Seto wrote:
>>>> (2011/06/14 15:58), Hu Tao wrote:
>>>>> Hi,
>>>>>
>>>>> I've run several tests including hackbench, unixbench, massive-intr
>>>>> and kernel building. CPU is Intel(R) Xeon(R) CPU X3430  @ 2.40GHz,
>>>>> 4 cores, and 4G memory.
>>>>>
>>>>> Most of the time the results differ few, but there are problems:
>>>>>
>>>>> 1. unixbench: execl throughout has about 5% drop.
>>>>> 2. unixbench: process creation has about 5% drop.
>>>>> 3. massive-intr: when running 200 processes for 5mins, the number
>>>>>    of loops each process runs differ more than before cfs-bandwidth-v6.
>>>>>
>>>>> The results are attached.
>>>>
>>>> I know the score of unixbench is not so stable that the problem might
>>>> be noises ... but the result of massive-intr is interesting.
>>>> Could you give a try to find which piece (xx/15) in the series cause
>>>> the problems?
>>>
>>> After more tests, I found massive-intr data is not stable, too. Results
>>> are attached. The third number in file name means which patchs are
>>> applied, 0 means no patch applied. plot.sh is easy to generate png
>>> files.
>>
>> (Though I don't know what the 16th patch of this series is, anyway)

I see.  It will be replaced by Paul's update.

> the 16th patch is this: https://lkml.org/lkml/2011/5/23/503
> 
>> I see that the results of 15, 15-1 and 15-2 are very different and that
>> 15-2 is similar to without-patch.
>>
>> One concern is whether this unstable of data is really caused by the
>> nature of your test (hardware, massive-intr itself and something running
>> in background etc.) or by a hidden piece in the bandwidth patch set.
>> Did you see "not stable" data when none of patches is applied?
> 
> Yes. 
> 
> But for a five-runs the result seems 'stable'(before patches and after
> patches). I've also run the tests in single mode. results are attached.

(It will be appreciated greatly if you could provide not only raw results
but also your current observation/speculation.)

Well, (to wrap it up,) do you still see the following problem?

>>>>> 3. massive-intr: when running 200 processes for 5mins, the number
>>>>>    of loops each process runs differ more than before cfs-bandwidth-v6.

I think that 5 samples are not enough to draw a conclusion, and that at the
moment it is inconsiderable.  How do you think?

Even though pointed problems are gone, I have to say thank you for taking
your time to test this CFS bandwidth patch set.
I'd appreciate it if you could continue your test, possibly against V7.
(I'm waiting, Paul?)


Thanks,
H.Seto


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [patch 00/15] CFS Bandwidth Control V6
  2011-06-17  1:22           ` Hidetoshi Seto
@ 2011-06-17  6:05             ` Hu Tao
  2011-06-17  6:25             ` Paul Turner
  1 sibling, 0 replies; 129+ messages in thread
From: Hu Tao @ 2011-06-17  6:05 UTC (permalink / raw)
  To: Hidetoshi Seto
  Cc: Paul Turner, linux-kernel, Peter Zijlstra, Bharata B Rao,
	Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan,
	Srivatsa Vaddagiri

On Fri, Jun 17, 2011 at 10:22:51AM +0900, Hidetoshi Seto wrote:
> (2011/06/16 18:45), Hu Tao wrote:
> > On Thu, Jun 16, 2011 at 09:57:09AM +0900, Hidetoshi Seto wrote:
> >> (2011/06/15 17:37), Hu Tao wrote:
> >>> On Tue, Jun 14, 2011 at 04:29:49PM +0900, Hidetoshi Seto wrote:
> >>>> (2011/06/14 15:58), Hu Tao wrote:
> >>>>> Hi,
> >>>>>
> >>>>> I've run several tests including hackbench, unixbench, massive-intr
> >>>>> and kernel building. CPU is Intel(R) Xeon(R) CPU X3430  @ 2.40GHz,
> >>>>> 4 cores, and 4G memory.
> >>>>>
> >>>>> Most of the time the results differ few, but there are problems:
> >>>>>
> >>>>> 1. unixbench: execl throughout has about 5% drop.
> >>>>> 2. unixbench: process creation has about 5% drop.
> >>>>> 3. massive-intr: when running 200 processes for 5mins, the number
> >>>>>    of loops each process runs differ more than before cfs-bandwidth-v6.
> >>>>>
> >>>>> The results are attached.
> >>>>
> >>>> I know the score of unixbench is not so stable that the problem might
> >>>> be noises ... but the result of massive-intr is interesting.
> >>>> Could you give a try to find which piece (xx/15) in the series cause
> >>>> the problems?
> >>>
> >>> After more tests, I found massive-intr data is not stable, too. Results
> >>> are attached. The third number in file name means which patchs are
> >>> applied, 0 means no patch applied. plot.sh is easy to generate png
> >>> files.
> >>
> >> (Though I don't know what the 16th patch of this series is, anyway)
> 
> I see.  It will be replaced by Paul's update.
> 
> > the 16th patch is this: https://lkml.org/lkml/2011/5/23/503
> > 
> >> I see that the results of 15, 15-1 and 15-2 are very different and that
> >> 15-2 is similar to without-patch.
> >>
> >> One concern is whether this unstable of data is really caused by the
> >> nature of your test (hardware, massive-intr itself and something running
> >> in background etc.) or by a hidden piece in the bandwidth patch set.
> >> Did you see "not stable" data when none of patches is applied?
> > 
> > Yes. 
> > 
> > But for a five-runs the result seems 'stable'(before patches and after
> > patches). I've also run the tests in single mode. results are attached.
> 
> (It will be appreciated greatly if you could provide not only raw results
> but also your current observation/speculation.)

Sorry I didn't make me clear.

> 
> Well, (to wrap it up,) do you still see the following problem?
> 
> >>>>> 3. massive-intr: when running 200 processes for 5mins, the number
> >>>>>    of loops each process runs differ more than before cfs-bandwidth-v6.

Even when before applying the patches, the numbers differ much between
several runs of massive_intr, this is the reason I say the data is not
stable. But treating the results of five runs as a whole, it shows some
stability. The results after the patches are similar, and the average
loops differ little comparing to the results before the patches(compare
0-1.png and 16-1.png in my last mail). so I would say the patches don't
bring too much impact on interactive processes.

> 
> I think that 5 samples are not enough to draw a conclusion, and that at the
> moment it is inconsiderable.  How do you think?

At least 5 samples reveal something, but if you'd like I can take more
samples.

> 
> Even though pointed problems are gone, I have to say thank you for taking
> your time to test this CFS bandwidth patch set.
> I'd appreciate it if you could continue your test, possibly against V7.
> (I'm waiting, Paul?)
> 
> 
> Thanks,
> H.Seto

Thanks,
-- 
Hu Tao

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [patch 00/15] CFS Bandwidth Control V6
  2011-06-17  1:22           ` Hidetoshi Seto
  2011-06-17  6:05             ` Hu Tao
@ 2011-06-17  6:25             ` Paul Turner
  2011-06-17  9:13               ` Hidetoshi Seto
  1 sibling, 1 reply; 129+ messages in thread
From: Paul Turner @ 2011-06-17  6:25 UTC (permalink / raw)
  To: Hidetoshi Seto
  Cc: Hu Tao, linux-kernel, Peter Zijlstra, Bharata B Rao,
	Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan,
	Srivatsa Vaddagiri

On Thu, Jun 16, 2011 at 6:22 PM, Hidetoshi Seto
<seto.hidetoshi@jp.fujitsu.com> wrote:
> (2011/06/16 18:45), Hu Tao wrote:
>> On Thu, Jun 16, 2011 at 09:57:09AM +0900, Hidetoshi Seto wrote:
>>> (2011/06/15 17:37), Hu Tao wrote:
>>>> On Tue, Jun 14, 2011 at 04:29:49PM +0900, Hidetoshi Seto wrote:
>>>>> (2011/06/14 15:58), Hu Tao wrote:
>>>>>> Hi,
>>>>>>
>>>>>> I've run several tests including hackbench, unixbench, massive-intr
>>>>>> and kernel building. CPU is Intel(R) Xeon(R) CPU X3430  @ 2.40GHz,
>>>>>> 4 cores, and 4G memory.
>>>>>>
>>>>>> Most of the time the results differ few, but there are problems:
>>>>>>
>>>>>> 1. unixbench: execl throughout has about 5% drop.
>>>>>> 2. unixbench: process creation has about 5% drop.
>>>>>> 3. massive-intr: when running 200 processes for 5mins, the number
>>>>>>    of loops each process runs differ more than before cfs-bandwidth-v6.
>>>>>>
>>>>>> The results are attached.
>>>>>
>>>>> I know the score of unixbench is not so stable that the problem might
>>>>> be noises ... but the result of massive-intr is interesting.
>>>>> Could you give a try to find which piece (xx/15) in the series cause
>>>>> the problems?
>>>>
>>>> After more tests, I found massive-intr data is not stable, too. Results
>>>> are attached. The third number in file name means which patchs are
>>>> applied, 0 means no patch applied. plot.sh is easy to generate png
>>>> files.
>>>
>>> (Though I don't know what the 16th patch of this series is, anyway)
>
> I see.  It will be replaced by Paul's update.
>
>> the 16th patch is this: https://lkml.org/lkml/2011/5/23/503
>>
>>> I see that the results of 15, 15-1 and 15-2 are very different and that
>>> 15-2 is similar to without-patch.
>>>
>>> One concern is whether this unstable of data is really caused by the
>>> nature of your test (hardware, massive-intr itself and something running
>>> in background etc.) or by a hidden piece in the bandwidth patch set.
>>> Did you see "not stable" data when none of patches is applied?
>>
>> Yes.
>>
>> But for a five-runs the result seems 'stable'(before patches and after
>> patches). I've also run the tests in single mode. results are attached.
>
> (It will be appreciated greatly if you could provide not only raw results
> but also your current observation/speculation.)
>
> Well, (to wrap it up,) do you still see the following problem?
>
>>>>>> 3. massive-intr: when running 200 processes for 5mins, the number
>>>>>>    of loops each process runs differ more than before cfs-bandwidth-v6.
>
> I think that 5 samples are not enough to draw a conclusion, and that at the
> moment it is inconsiderable.  How do you think?
>
> Even though pointed problems are gone, I have to say thank you for taking
> your time to test this CFS bandwidth patch set.
> I'd appreciate it if you could continue your test, possibly against V7.
> (I'm waiting, Paul?)

It should be out in a few hours, as I was preparing everything today I
realized an latent error existed in the quota expiration path;
specifically that on a wake-up from a sufficiently long sleep we will
see expired quota and have to wait for the timer to recharge bandwidth
before we're actually allowed to run.  Currently munging the results
of fixing that and making sure everything else is correct in the wake
of those changes.

>
>
> Thanks,
> H.Seto
>
>

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [patch 00/15] CFS Bandwidth Control V6
  2011-06-17  6:25             ` Paul Turner
@ 2011-06-17  9:13               ` Hidetoshi Seto
  2011-06-18  0:28                 ` Paul Turner
  0 siblings, 1 reply; 129+ messages in thread
From: Hidetoshi Seto @ 2011-06-17  9:13 UTC (permalink / raw)
  To: Paul Turner
  Cc: Hu Tao, linux-kernel, Peter Zijlstra, Bharata B Rao,
	Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan,
	Srivatsa Vaddagiri

(2011/06/17 15:25), Paul Turner wrote:
> It should be out in a few hours, as I was preparing everything today I
> realized an latent error existed in the quota expiration path;
> specifically that on a wake-up from a sufficiently long sleep we will
> see expired quota and have to wait for the timer to recharge bandwidth
> before we're actually allowed to run.  Currently munging the results
> of fixing that and making sure everything else is correct in the wake
> of those changes.

Thanks!
I'll check it some time early next week.


Thanks,
H.Seto


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [patch 00/15] CFS Bandwidth Control V6
  2011-06-17  9:13               ` Hidetoshi Seto
@ 2011-06-18  0:28                 ` Paul Turner
  0 siblings, 0 replies; 129+ messages in thread
From: Paul Turner @ 2011-06-18  0:28 UTC (permalink / raw)
  To: Hidetoshi Seto
  Cc: Hu Tao, linux-kernel, Peter Zijlstra, Bharata B Rao,
	Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan,
	Srivatsa Vaddagiri

On Fri, Jun 17, 2011 at 2:13 AM, Hidetoshi Seto
<seto.hidetoshi@jp.fujitsu.com> wrote:
> (2011/06/17 15:25), Paul Turner wrote:
>> It should be out in a few hours, as I was preparing everything today I
>> realized an latent error existed in the quota expiration path;
>> specifically that on a wake-up from a sufficiently long sleep we will
>> see expired quota and have to wait for the timer to recharge bandwidth
>> before we're actually allowed to run.  Currently munging the results
>> of fixing that and making sure everything else is correct in the wake
>> of those changes.
>
> Thanks!
> I'll check it some time early next week.

So it's been a long session of hunting races and implementing the
cleanups above.

Unfortunately as my finger hovered over the send button I realized one
hurdle remains  -- there's a narrow race in the period timer shutdown
path:

- Our period timer can decide that we're going idle as a result of no activity
- Right after it makes this decision a task sneaks in and runs on
another cpu.  We can see the timer has chosen to go idle (it's
possible to  synchronize on that state around the bandwidth lock) but
there's no good way to kick the period timer into an about-face since
it's already active.
- The timing is sufficiently rare and short that we could do something
awful like spin until the timer is complete, but I think it's probably
better to put a kick in one of our already existing re-occuring paths
such as update_shares.

I'll fix this after some sleep, I'm out of steam for now.


>
>
> Thanks,
> H.Seto
>
>

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: CFS Bandwidth Control - Test results of cgroups tasks pinned vs unpinned
  2011-06-15  5:37             ` Kamalesh Babulal
@ 2011-06-21 19:48               ` Paul Turner
  2011-06-24 15:05                 ` Kamalesh Babulal
                                   ` (3 more replies)
  0 siblings, 4 replies; 129+ messages in thread
From: Paul Turner @ 2011-06-21 19:48 UTC (permalink / raw)
  To: Kamalesh Babulal
  Cc: Vladimir Davydov, linux-kernel, Peter Zijlstra, Bharata B Rao,
	Dhaval Giani, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Ingo Molnar, Pavel Emelianov

Hi Kamalesh,

Can you see what things look like under v7?

There's been a few improvements to quota re-distribution that should
hopefully help your test case.

The remaining idle% I see on my machines appear to be a product of
load-balancer inefficiency.

Thanks!

- Paul

On Tue, Jun 14, 2011 at 10:37 PM, Kamalesh Babulal
<kamalesh@linux.vnet.ibm.com> wrote:
> * Paul Turner <pjt@google.com> [2011-06-13 17:00:08]:
>
>> Hi Kamalesh.
>>
>> I tried on both friday and again today to reproduce your results
>> without success.  Results are attached below.  The margin of error is
>> the same as the previous (2-level deep case), ~4%.  One minor nit, in
>> your script's input parsing you're calling shift; you don't need to do
>> this with getopts and it will actually lead to arguments being
>> dropped.
>>
>> Are you testing on top of a clean -tip?  Do you have any custom
>> load-balancer or scheduler settings?
>>
>> Thanks,
>>
>> - Paul
>>
>>
>> Hyper-threaded topology:
>> unpinned:
>> Average CPU Idle percentage 38.6333%
>> Bandwidth shared with remaining non-Idle 61.3667%
>>
>> pinned:
>> Average CPU Idle percentage 35.2766%
>> Bandwidth shared with remaining non-Idle 64.7234%
>> (The mask in the "unpinned" case is 0-3,6-9,12-15,18-21 which should
>> mirror your 2 socket 8x2 configuration.)
>>
>> 4-way NUMA topology:
>> unpinned:
>> Average CPU Idle percentage 5.26667%
>> Bandwidth shared with remaining non-Idle 94.73333%
>>
>> pinned:
>> Average CPU Idle percentage 0.242424%
>> Bandwidth shared with remaining non-Idle 99.757576%
>>
> Hi Paul,
>
> I tried tip 919c9baa9 + V6 patchset on 2 socket,quadcore with HT and
> the Idle time seen is ~22% to ~23%. Kernel is not tuned to any custom
> load-balancer/scheduler settings.
>
> unpinned:
> Average CPU Idle percentage 23.5333%
> Bandwidth shared with remaining non-Idle 76.4667%
>
> pinned:
> Average CPU Idle percentage 0%
> Bandwidth shared with remaining non-Idle 100%
>
> Thanks,
>
>  Kamalesh
>>
>>
>>
>> On Fri, Jun 10, 2011 at 11:17 AM, Kamalesh Babulal
>> <kamalesh@linux.vnet.ibm.com> wrote:
>> > * Paul Turner <pjt@google.com> [2011-06-08 20:25:00]:
>> >
>> >> Hi Kamalesh,
>> >>
>> >> I'm unable to reproduce the results you describe.  One possibility is
>> >> load-balancer interaction -- can you describe the topology of the
>> >> platform you are running this on?
>> >>
>> >> On both a straight NUMA topology and a hyper-threaded platform I
>> >> observe a ~4% delta between the pinned and un-pinned cases.
>> >>
>> >> Thanks -- results below,
>> >>
>> >> - Paul
>> >>
>> >>
> (snip)
>

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: CFS Bandwidth Control - Test results of cgroups tasks pinned vs unpinned
  2011-06-21 19:48               ` Paul Turner
@ 2011-06-24 15:05                 ` Kamalesh Babulal
  2011-09-07 11:00                 ` Srivatsa Vaddagiri
                                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 129+ messages in thread
From: Kamalesh Babulal @ 2011-06-24 15:05 UTC (permalink / raw)
  To: Paul Turner
  Cc: Vladimir Davydov, linux-kernel, Peter Zijlstra, Bharata B Rao,
	Dhaval Giani, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Ingo Molnar, Pavel Emelianov

* Paul Turner <pjt@google.com> [2011-06-21 12:48:17]:

> Hi Kamalesh,
> 
> Can you see what things look like under v7?
> 
> There's been a few improvements to quota re-distribution that should
> hopefully help your test case.
> 
> The remaining idle% I see on my machines appear to be a product of
> load-balancer inefficiency.
> 
> Thanks!
> 
> - Paul
(snip)

Hi Paul,

Sorry for the delay in the response. I tried the V7 patchset on 
top of tip. Patchset passed different combinations build and boot 
tests. 

I have re-run the tests with couple of combinations on the same 
2 socket,4 core, HT box. The test data was collected for 60 seconds
run

un-pinned and cpu shares of 1024
-------------------------------------------------
Top five cgroups and its sub-cgroups were assigned default
cpu shares of 1024.

Average CPU Idle percentage 21.8333%
Bandwidth shared with remaining non-Idle 78.1667%


un-pinned and cpu shares are proportional 
--------------------------------------------------
Top five cgroups were assigned cpu shares proportional to 
no of sub-cgroups it has under its hierarchy. 
For example cgroup1's share is (1024*2) = 2048 and each sub-cgroups 
has shares of 1024.

Average CPU Idle percentage 14.2%
Bandwidth shared with remaining non-Idle 85.8%


pinned and cpu shares of 1024
--------------------------------------------------
Average CPU Idle percentage 0.0666667%
Bandwidth shared with remaining non-Idle 99.9333333%


pinned and cpu shares are proportional
--------------------------------------------------
Average CPU Idle percentage 0%
Bandwidth shared with remaining non-Idle 100%


I have captured the perf sched stats for every run. Let me
know if that will help. I can mail them to you privately.

Thanks,
Kamalesh.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: CFS Bandwidth Control - Test results of cgroups tasks pinned vs unpinned
  2011-06-21 19:48               ` Paul Turner
  2011-06-24 15:05                 ` Kamalesh Babulal
@ 2011-09-07 11:00                 ` Srivatsa Vaddagiri
  2011-09-07 14:54                 ` Srivatsa Vaddagiri
  2011-09-07 15:20                 ` CFS Bandwidth Control - Test results of cgroups tasks pinned vs unpinnede Srivatsa Vaddagiri
  3 siblings, 0 replies; 129+ messages in thread
From: Srivatsa Vaddagiri @ 2011-09-07 11:00 UTC (permalink / raw)
  To: Paul Turner
  Cc: Kamalesh Babulal, Vladimir Davydov, linux-kernel, Peter Zijlstra,
	Bharata B Rao, Dhaval Giani, Vaidyanathan Srinivasan,
	Ingo Molnar, Pavel Emelianov

On Tue, Jun 21, 2011 at 12:48:17PM -0700, Paul Turner wrote:
> Hi Kamalesh,
> 
> Can you see what things look like under v7?
> 
> There's been a few improvements to quota re-distribution that should
> hopefully help your test case.
> 
> The remaining idle% I see on my machines appear to be a product of
> load-balancer inefficiency.

which is quite a complex problem to solve! I am still surprised that
we can't handle 32 cpuhogs on a 16-cpu system very easily. The tasks seem to
hop around madly rather than settle down as 2 tasks/cpu. Kamalesh, can you post
the exact count of migrations we saw on latest tip over a 20-sec window?

Anyway, here's a "hack" to minimize the idle time induced due to load-balance 
issues. It brings down idle time from 7+% to ~0% ..I am not too happy about
this, but I don't see any other simpler solutions to solve the idle time issue
completely (other than making load-balancer completely fair!).

--

Fix excessive idle time reported when cgroups are capped.  The patch
introduces the notion of "steal" (or "grace") time which is the surplus
time/bandwidth each cgroup is allowed to consume, subject to a maximum
steal time (sched_cfs_max_steal_time_us). Cgroups are allowed this "steal"
or "grace" time when the lone task running on a cpu is about to be throttled.

Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>

Index: linux-3.1-rc4/include/linux/sched.h
===================================================================
--- linux-3.1-rc4.orig/include/linux/sched.h	2011-09-07 14:57:49.529602231 +0800
+++ linux-3.1-rc4/include/linux/sched.h	2011-09-07 14:58:49.952418107 +0800
@@ -2042,6 +2042,7 @@
 
 #ifdef CONFIG_CFS_BANDWIDTH
 extern unsigned int sysctl_sched_cfs_bandwidth_slice;
+extern unsigned int sysctl_sched_cfs_max_steal_time;
 #endif
 
 #ifdef CONFIG_RT_MUTEXES
Index: linux-3.1-rc4/kernel/sched.c
===================================================================
--- linux-3.1-rc4.orig/kernel/sched.c	2011-09-07 14:57:49.532854588 +0800
+++ linux-3.1-rc4/kernel/sched.c	2011-09-07 14:58:49.955453578 +0800
@@ -254,7 +254,7 @@
 #ifdef CONFIG_CFS_BANDWIDTH
 	raw_spinlock_t lock;
 	ktime_t period;
-	u64 quota, runtime;
+	u64 quota, runtime, steal_time;
 	s64 hierarchal_quota;
 	u64 runtime_expires;
 
Index: linux-3.1-rc4/kernel/sched_fair.c
===================================================================
--- linux-3.1-rc4.orig/kernel/sched_fair.c	2011-09-07 14:57:49.533644483 +0800
+++ linux-3.1-rc4/kernel/sched_fair.c	2011-09-07 15:16:09.338824132 +0800
@@ -101,6 +101,18 @@
  * default: 5 msec, units: microseconds
   */
 unsigned int sysctl_sched_cfs_bandwidth_slice = 5000UL;
+
+/*
+ * "Surplus" quota given to a cgroup to prevent a CPU from becoming idle.
+ *
+ * This would have been unnecessary had the load-balancer been "ideal" in
+ * loading tasks uniformly across all CPUs, which would have allowed
+ * all cgroups to claim their "quota" completely. In the absence of an
+ * "ideal" load-balancer, cgroups are unable to utilize their quota, leading
+ * to unexpected idle time. This knob allows a CPU to keep running a
+ * task beyond its throttled point before becoming idle.
+ */
+unsigned int sysctl_sched_cfs_max_steal_time = 100000UL;
 #endif
 
 static const struct sched_class fair_sched_class;
@@ -1288,6 +1300,11 @@
 	return (u64)sysctl_sched_cfs_bandwidth_slice * NSEC_PER_USEC;
 }
 
+static inline u64 sched_cfs_max_steal_time(void)
+{
+	return (u64)sysctl_sched_cfs_max_steal_time * NSEC_PER_USEC;
+}
+
 /*
  * Replenish runtime according to assigned quota and update expiration time.
  * We use sched_clock_cpu directly instead of rq->clock to avoid adding
@@ -1303,6 +1320,7 @@
 		return;
 
 	now = sched_clock_cpu(smp_processor_id());
+	cfs_b->steal_time = 0;
 	cfs_b->runtime = cfs_b->quota;
 	cfs_b->runtime_expires = now + ktime_to_ns(cfs_b->period);
 }
@@ -1337,6 +1355,12 @@
 			cfs_b->runtime -= amount;
 			cfs_b->idle = 0;
 		}
+
+		if (!amount && rq_of(cfs_rq)->nr_running == 1 &&
+				cfs_b->steal_time < sched_cfs_max_steal_time()) {
+			amount = min_amount;
+			cfs_b->steal_time += amount;
+		}
 	}
 	expires = cfs_b->runtime_expires;
 	raw_spin_unlock(&cfs_b->lock);
@@ -1378,7 +1402,8 @@
 	 * whether the global deadline has advanced.
 	 */
 
-	if ((s64)(cfs_rq->runtime_expires - cfs_b->runtime_expires) >= 0) {
+	if ((s64)(cfs_rq->runtime_expires - cfs_b->runtime_expires) >= 0 ||
+		(rq_of(cfs_rq)->nr_running == 1 && cfs_b->steal_time < sched_cfs_max_steal_time())) {
 		/* extend local deadline, drift is bounded above by 2 ticks */
 		cfs_rq->runtime_expires += TICK_NSEC;
 	} else {
Index: linux-3.1-rc4/kernel/sysctl.c
===================================================================
--- linux-3.1-rc4.orig/kernel/sysctl.c	2011-09-07 14:57:49.534454409 +0800
+++ linux-3.1-rc4/kernel/sysctl.c	2011-09-07 14:58:49.958452846 +0800
@@ -388,6 +388,14 @@
 		.proc_handler	= proc_dointvec_minmax,
 		.extra1		= &one,
 	},
+	{
+		.procname	= "sched_cfs_max_steal_time_us",
+		.data		= &sysctl_sched_cfs_max_steal_time,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &one,
+	},
 #endif
 #ifdef CONFIG_PROVE_LOCKING
 	{


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: CFS Bandwidth Control - Test results of cgroups tasks pinned vs unpinned
  2011-06-21 19:48               ` Paul Turner
  2011-06-24 15:05                 ` Kamalesh Babulal
  2011-09-07 11:00                 ` Srivatsa Vaddagiri
@ 2011-09-07 14:54                 ` Srivatsa Vaddagiri
  2011-09-07 15:20                 ` CFS Bandwidth Control - Test results of cgroups tasks pinned vs unpinnede Srivatsa Vaddagiri
  3 siblings, 0 replies; 129+ messages in thread
From: Srivatsa Vaddagiri @ 2011-09-07 14:54 UTC (permalink / raw)
  To: Paul Turner
  Cc: Kamalesh Babulal, Vladimir Davydov, linux-kernel, Peter Zijlstra,
	Bharata B Rao, Dhaval Giani, Vaidyanathan Srinivasan,
	Ingo Molnar, Pavel Emelianov

* Paul Turner <pjt@google.com> [2011-06-21 12:48:17]:

> Hi Kamalesh,
> 
> Can you see what things look like under v7?
> 
> There's been a few improvements to quota re-distribution that should
> hopefully help your test case.
> 
> The remaining idle% I see on my machines appear to be a product of
> load-balancer inefficiency.

which is quite a complex problem to solve! I am still surprised that
we can't handle 32 cpuhogs on a 16-cpu system very easily. The tasks seem to
hop around madly rather than settle down as 2 tasks/cpu. Kamalesh, can you post
the exact count of migrations we saw on latest tip over a 20-sec window?

Anyway, here's a "hack" to minimize the idle time induced due to load-balance
issues. It brings down idle time from 7+% to ~0% ..I am not too happy about
this, but I don't see any other simpler solutions to solve the idle time issue
completely (other than making load-balancer completely fair!).

--

Fix excessive idle time reported when cgroups are capped.  The patch
introduces the notion of "steal" (or "grace") time which is the surplus
time/bandwidth each cgroup is allowed to consume, subject to a maximum
steal time (sched_cfs_max_steal_time_us). Cgroups are allowed this "steal"
or "grace" time when the lone task running on a cpu is about to be throttled.

Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>

Index: linux-3.1-rc4/include/linux/sched.h
===================================================================
--- linux-3.1-rc4.orig/include/linux/sched.h	2011-09-07 14:57:49.529602231 +0800
+++ linux-3.1-rc4/include/linux/sched.h	2011-09-07 14:58:49.952418107 +0800
@@ -2042,6 +2042,7 @@ static inline void sched_autogroup_exit(
 
 #ifdef CONFIG_CFS_BANDWIDTH
 extern unsigned int sysctl_sched_cfs_bandwidth_slice;
+extern unsigned int sysctl_sched_cfs_max_steal_time;
 #endif
 
 #ifdef CONFIG_RT_MUTEXES
Index: linux-3.1-rc4/kernel/sched.c
===================================================================
--- linux-3.1-rc4.orig/kernel/sched.c	2011-09-07 14:57:49.532854588 +0800
+++ linux-3.1-rc4/kernel/sched.c	2011-09-07 14:58:49.955453578 +0800
@@ -254,7 +254,7 @@ struct cfs_bandwidth {
 #ifdef CONFIG_CFS_BANDWIDTH
 	raw_spinlock_t lock;
 	ktime_t period;
-	u64 quota, runtime;
+	u64 quota, runtime, steal_time;
 	s64 hierarchal_quota;
 	u64 runtime_expires;
 
Index: linux-3.1-rc4/kernel/sched_fair.c
===================================================================
--- linux-3.1-rc4.orig/kernel/sched_fair.c	2011-09-07 14:57:49.533644483 +0800
+++ linux-3.1-rc4/kernel/sched_fair.c	2011-09-07 15:16:09.338824132 +0800
@@ -101,6 +101,18 @@ unsigned int __read_mostly sysctl_sched_
  * default: 5 msec, units: microseconds
   */
 unsigned int sysctl_sched_cfs_bandwidth_slice = 5000UL;
+
+/*
+ * "Surplus" quota given to a cgroup to prevent a CPU from becoming idle.
+ *
+ * This would have been unnecessary had the load-balancer been "ideal" in
+ * loading tasks uniformly across all CPUs, which would have allowed
+ * all cgroups to claim their "quota" completely. In the absence of an
+ * "ideal" load-balancer, cgroups are unable to utilize their quota, leading
+ * to unexpected idle time. This knob allows a CPU to keep running a
+ * task beyond its throttled point before becoming idle.
+ */
+unsigned int sysctl_sched_cfs_max_steal_time = 100000UL;
 #endif
 
 static const struct sched_class fair_sched_class;
@@ -1288,6 +1300,11 @@ static inline u64 sched_cfs_bandwidth_sl
 	return (u64)sysctl_sched_cfs_bandwidth_slice * NSEC_PER_USEC;
 }
 
+static inline u64 sched_cfs_max_steal_time(void)
+{
+	return (u64)sysctl_sched_cfs_max_steal_time * NSEC_PER_USEC;
+}
+
 /*
  * Replenish runtime according to assigned quota and update expiration time.
  * We use sched_clock_cpu directly instead of rq->clock to avoid adding
@@ -1303,6 +1320,7 @@ static void __refill_cfs_bandwidth_runti
 		return;
 
 	now = sched_clock_cpu(smp_processor_id());
+	cfs_b->steal_time = 0;
 	cfs_b->runtime = cfs_b->quota;
 	cfs_b->runtime_expires = now + ktime_to_ns(cfs_b->period);
 }
@@ -1337,6 +1355,12 @@ static int assign_cfs_rq_runtime(struct 
 			cfs_b->runtime -= amount;
 			cfs_b->idle = 0;
 		}
+
+		if (!amount && rq_of(cfs_rq)->nr_running == 1 &&
+				cfs_b->steal_time < sched_cfs_max_steal_time()) {
+			amount = min_amount;
+			cfs_b->steal_time += amount;
+		}
 	}
 	expires = cfs_b->runtime_expires;
 	raw_spin_unlock(&cfs_b->lock);
@@ -1378,7 +1402,8 @@ static void expire_cfs_rq_runtime(struct
 	 * whether the global deadline has advanced.
 	 */
 
-	if ((s64)(cfs_rq->runtime_expires - cfs_b->runtime_expires) >= 0) {
+	if ((s64)(cfs_rq->runtime_expires - cfs_b->runtime_expires) >= 0 ||
+		(rq_of(cfs_rq)->nr_running == 1 && cfs_b->steal_time < sched_cfs_max_steal_time())) {
 		/* extend local deadline, drift is bounded above by 2 ticks */
 		cfs_rq->runtime_expires += TICK_NSEC;
 	} else {
Index: linux-3.1-rc4/kernel/sysctl.c
===================================================================
--- linux-3.1-rc4.orig/kernel/sysctl.c	2011-09-07 14:57:49.534454409 +0800
+++ linux-3.1-rc4/kernel/sysctl.c	2011-09-07 14:58:49.958452846 +0800
@@ -388,6 +388,14 @@ static struct ctl_table kern_table[] = {
 		.proc_handler	= proc_dointvec_minmax,
 		.extra1		= &one,
 	},
+	{
+		.procname	= "sched_cfs_max_steal_time_us",
+		.data		= &sysctl_sched_cfs_max_steal_time,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &one,
+	},
 #endif
 #ifdef CONFIG_PROVE_LOCKING
 	{

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: CFS Bandwidth Control - Test results of cgroups tasks pinned vs unpinnede
  2011-06-21 19:48               ` Paul Turner
                                   ` (2 preceding siblings ...)
  2011-09-07 14:54                 ` Srivatsa Vaddagiri
@ 2011-09-07 15:20                 ` Srivatsa Vaddagiri
  2011-09-07 19:22                   ` Peter Zijlstra
  2011-09-16  8:22                   ` Paul Turner
  3 siblings, 2 replies; 129+ messages in thread
From: Srivatsa Vaddagiri @ 2011-09-07 15:20 UTC (permalink / raw)
  To: Paul Turner
  Cc: Kamalesh Babulal, Vladimir Davydov, linux-kernel, Peter Zijlstra,
	Bharata B Rao, Dhaval Giani, Vaidyanathan Srinivasan,
	Ingo Molnar, Pavel Emelianov

[Apologies if you get this email multiple times - there is some email
client config issue that I am fixing up]

* Paul Turner <pjt@google.com> [2011-06-21 12:48:17]:

> Hi Kamalesh,
> 
> Can you see what things look like under v7?
> 
> There's been a few improvements to quota re-distribution that should
> hopefully help your test case.
> 
> The remaining idle% I see on my machines appear to be a product of
> load-balancer inefficiency.

which is quite a complex problem to solve! I am still surprised that
we can't handle 32 cpuhogs on a 16-cpu system very easily. The tasks seem to
hop around madly rather than settle down as 2 tasks/cpu. Kamalesh, can you post
the exact count of migrations we saw on latest tip over a 20-sec window?

Anyway, here's a "hack" to minimize the idle time induced due to load-balance
issues. It brings down idle time from 7+% to ~0% ..I am not too happy about
this, but I don't see any other simpler solutions to solve the idle time issue
completely (other than making load-balancer completely fair!).

--

Fix excessive idle time reported when cgroups are capped.  The patch
introduces the notion of "steal" (or "grace") time which is the surplus
time/bandwidth each cgroup is allowed to consume, subject to a maximum
steal time (sched_cfs_max_steal_time_us). Cgroups are allowed this "steal"
or "grace" time when the lone task running on a cpu is about to be throttled.

Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>

Index: linux-3.1-rc4/include/linux/sched.h
===================================================================
--- linux-3.1-rc4.orig/include/linux/sched.h	2011-09-07 14:57:49.529602231 +0800
+++ linux-3.1-rc4/include/linux/sched.h	2011-09-07 14:58:49.952418107 +0800
@@ -2042,6 +2042,7 @@ static inline void sched_autogroup_exit(
 
 #ifdef CONFIG_CFS_BANDWIDTH
 extern unsigned int sysctl_sched_cfs_bandwidth_slice;
+extern unsigned int sysctl_sched_cfs_max_steal_time;
 #endif
 
 #ifdef CONFIG_RT_MUTEXES
Index: linux-3.1-rc4/kernel/sched.c
===================================================================
--- linux-3.1-rc4.orig/kernel/sched.c	2011-09-07 14:57:49.532854588 +0800
+++ linux-3.1-rc4/kernel/sched.c	2011-09-07 14:58:49.955453578 +0800
@@ -254,7 +254,7 @@ struct cfs_bandwidth {
 #ifdef CONFIG_CFS_BANDWIDTH
 	raw_spinlock_t lock;
 	ktime_t period;
-	u64 quota, runtime;
+	u64 quota, runtime, steal_time;
 	s64 hierarchal_quota;
 	u64 runtime_expires;
 
Index: linux-3.1-rc4/kernel/sched_fair.c
===================================================================
--- linux-3.1-rc4.orig/kernel/sched_fair.c	2011-09-07 14:57:49.533644483 +0800
+++ linux-3.1-rc4/kernel/sched_fair.c	2011-09-07 15:16:09.338824132 +0800
@@ -101,6 +101,18 @@ unsigned int __read_mostly sysctl_sched_
  * default: 5 msec, units: microseconds
   */
 unsigned int sysctl_sched_cfs_bandwidth_slice = 5000UL;
+
+/*
+ * "Surplus" quota given to a cgroup to prevent a CPU from becoming idle.
+ *
+ * This would have been unnecessary had the load-balancer been "ideal" in
+ * loading tasks uniformly across all CPUs, which would have allowed
+ * all cgroups to claim their "quota" completely. In the absence of an
+ * "ideal" load-balancer, cgroups are unable to utilize their quota, leading
+ * to unexpected idle time. This knob allows a CPU to keep running a
+ * task beyond its throttled point before becoming idle.
+ */
+unsigned int sysctl_sched_cfs_max_steal_time = 100000UL;
 #endif
 
 static const struct sched_class fair_sched_class;
@@ -1288,6 +1300,11 @@ static inline u64 sched_cfs_bandwidth_sl
 	return (u64)sysctl_sched_cfs_bandwidth_slice * NSEC_PER_USEC;
 }
 
+static inline u64 sched_cfs_max_steal_time(void)
+{
+	return (u64)sysctl_sched_cfs_max_steal_time * NSEC_PER_USEC;
+}
+
 /*
  * Replenish runtime according to assigned quota and update expiration time.
  * We use sched_clock_cpu directly instead of rq->clock to avoid adding
@@ -1303,6 +1320,7 @@ static void __refill_cfs_bandwidth_runti
 		return;
 
 	now = sched_clock_cpu(smp_processor_id());
+	cfs_b->steal_time = 0;
 	cfs_b->runtime = cfs_b->quota;
 	cfs_b->runtime_expires = now + ktime_to_ns(cfs_b->period);
 }
@@ -1337,6 +1355,12 @@ static int assign_cfs_rq_runtime(struct 
 			cfs_b->runtime -= amount;
 			cfs_b->idle = 0;
 		}
+
+		if (!amount && rq_of(cfs_rq)->nr_running == 1 &&
+				cfs_b->steal_time < sched_cfs_max_steal_time()) {
+			amount = min_amount;
+			cfs_b->steal_time += amount;
+		}
 	}
 	expires = cfs_b->runtime_expires;
 	raw_spin_unlock(&cfs_b->lock);
@@ -1378,7 +1402,8 @@ static void expire_cfs_rq_runtime(struct
 	 * whether the global deadline has advanced.
 	 */
 
-	if ((s64)(cfs_rq->runtime_expires - cfs_b->runtime_expires) >= 0) {
+	if ((s64)(cfs_rq->runtime_expires - cfs_b->runtime_expires) >= 0 ||
+		(rq_of(cfs_rq)->nr_running == 1 && cfs_b->steal_time < sched_cfs_max_steal_time())) {
 		/* extend local deadline, drift is bounded above by 2 ticks */
 		cfs_rq->runtime_expires += TICK_NSEC;
 	} else {
Index: linux-3.1-rc4/kernel/sysctl.c
===================================================================
--- linux-3.1-rc4.orig/kernel/sysctl.c	2011-09-07 14:57:49.534454409 +0800
+++ linux-3.1-rc4/kernel/sysctl.c	2011-09-07 14:58:49.958452846 +0800
@@ -388,6 +388,14 @@ static struct ctl_table kern_table[] = {
 		.proc_handler	= proc_dointvec_minmax,
 		.extra1		= &one,
 	},
+	{
+		.procname	= "sched_cfs_max_steal_time_us",
+		.data		= &sysctl_sched_cfs_max_steal_time,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &one,
+	},
 #endif
 #ifdef CONFIG_PROVE_LOCKING
 	{

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: CFS Bandwidth Control - Test results of cgroups tasks pinned vs unpinnede
  2011-09-07 15:20                 ` CFS Bandwidth Control - Test results of cgroups tasks pinned vs unpinnede Srivatsa Vaddagiri
@ 2011-09-07 19:22                   ` Peter Zijlstra
  2011-09-08 15:15                     ` Srivatsa Vaddagiri
  2011-09-16  8:22                   ` Paul Turner
  1 sibling, 1 reply; 129+ messages in thread
From: Peter Zijlstra @ 2011-09-07 19:22 UTC (permalink / raw)
  To: Srivatsa Vaddagiri
  Cc: Paul Turner, Kamalesh Babulal, Vladimir Davydov, linux-kernel,
	Bharata B Rao, Dhaval Giani, Vaidyanathan Srinivasan,
	Ingo Molnar, Pavel Emelianov

On Wed, 2011-09-07 at 20:50 +0530, Srivatsa Vaddagiri wrote:
> 
> Fix excessive idle time reported when cgroups are capped. 

Where from? The whole idea of bandwidth caps is to introduce idle time,
so what's excessive and where does it come from?

>  The patch introduces the notion of "steal" 

The virt folks already claimed steal-time and have it mean something
entirely different. You get to pick a new name.

> (or "grace") time which is the surplus
> time/bandwidth each cgroup is allowed to consume, subject to a maximum
> steal time (sched_cfs_max_steal_time_us). Cgroups are allowed this "steal"
> or "grace" time when the lone task running on a cpu is about to be throttled.

Ok, so this is a solution to an unstated problem. Why is it a good
solution?

Also, another tunable, yay!

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: CFS Bandwidth Control - Test results of cgroups tasks pinned vs unpinnede
  2011-09-07 19:22                   ` Peter Zijlstra
@ 2011-09-08 15:15                     ` Srivatsa Vaddagiri
  2011-09-09 12:31                       ` Peter Zijlstra
  0 siblings, 1 reply; 129+ messages in thread
From: Srivatsa Vaddagiri @ 2011-09-08 15:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul Turner, Kamalesh Babulal, Vladimir Davydov, linux-kernel,
	Bharata B Rao, Dhaval Giani, Vaidyanathan Srinivasan,
	Ingo Molnar, Pavel Emelianov

* Peter Zijlstra <a.p.zijlstra@chello.nl> [2011-09-07 21:22:22]:

> On Wed, 2011-09-07 at 20:50 +0530, Srivatsa Vaddagiri wrote:
> > 
> > Fix excessive idle time reported when cgroups are capped. 
> 
> Where from? The whole idea of bandwidth caps is to introduce idle time,
> so what's excessive and where does it come from?

We have setup cgroups and their hard limits so that in theory they should
consume the entire capacity available on machine, leading to 0% idle time.
That's not what we see. A more detailed description of the setup and the problem
is here:

https://lkml.org/lkml/2011/6/7/352

but to quickly summarize it, the machine and the test-case is as below:

Machine : 16-cpus (2 Quad-core w/ HT enabled)
Cgroups : 5 in number (C1-C5), each having {2, 2, 4, 8, 16} tasks respectively.
	  Further, each task is placed in its own (sub-)cgroup with 
	  a capped usage of 50% CPU.

	/C1/C1_1/Task1	-> capped at 50% cpu usage
	/C1/C1_2/Task2	-> capped at 50% cpu usage
	/C2/C2_1/Task3	-> capped at 50% cpu usage
	/C2/C2_2/Task3	-> capped at 50% cpu usage
	/C3/C3_1/Task4	-> capped at 50% cpu usage
	/C3/C3_2/Task4	-> capped at 50% cpu usage
	/C3/C3_3/Task4	-> capped at 50% cpu usage
	/C3/C3_4/Task4	-> capped at 50% cpu usage
	...
	/C5/C5_16/Task32 -> capped at 50% cpu usage

So we have 32 tasks, each capped at 50% CPU usage, run on a 16-CPU
system. One can expect 0% idle time in this scenario, which was found
not to be the case. With early versions of cfs hardlimits, upto ~20%
idle time was seen, though with the current version in tip, we see upto
~10% idle time (when cfs.period = 100ms) which goes down to ~5% when
cfs.period is set to 500ms.

>From what I could find out, the "excess" idle time crops up because
load-balancer is not perfect. For example, there are instances when a
CPU has just 1 task on its runqueue (rather then the ideal number of 2
tasks/cpu). When that lone task exceeds its 50% limit, cpu is forced to
become idle.

> >  The patch introduces the notion of "steal" 
> 
> The virt folks already claimed steal-time and have it mean something
> entirely different. You get to pick a new name.

grace time?

> > (or "grace") time which is the surplus
> > time/bandwidth each cgroup is allowed to consume, subject to a maximum
> > steal time (sched_cfs_max_steal_time_us). Cgroups are allowed this "steal"
> > or "grace" time when the lone task running on a cpu is about to be throttled.
> 
> Ok, so this is a solution to an unstated problem. Why is it a good
> solution?

I am not sure if there are any "good" solutions to this problem! One
possibility is to make the idle load balancer become aggressive in
pulling tasks across sched-domain boundaries i.e when a CPU becomes idle
(after a task got throttled) and invokes the idle load balancer, it
should try "harder" at pulling a task from far-off cpus (across
package/node boundaries)?

> Also, another tunable, yay!

- vatsa

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: CFS Bandwidth Control - Test results of cgroups tasks pinned vs unpinnede
  2011-09-08 15:15                     ` Srivatsa Vaddagiri
@ 2011-09-09 12:31                       ` Peter Zijlstra
  2011-09-09 13:26                         ` Srivatsa Vaddagiri
  2011-09-12 10:17                         ` Srivatsa Vaddagiri
  0 siblings, 2 replies; 129+ messages in thread
From: Peter Zijlstra @ 2011-09-09 12:31 UTC (permalink / raw)
  To: Srivatsa Vaddagiri
  Cc: Paul Turner, Kamalesh Babulal, Vladimir Davydov, linux-kernel,
	Bharata B Rao, Dhaval Giani, Vaidyanathan Srinivasan,
	Ingo Molnar, Pavel Emelianov

On Thu, 2011-09-08 at 20:45 +0530, Srivatsa Vaddagiri wrote:
> * Peter Zijlstra <a.p.zijlstra@chello.nl> [2011-09-07 21:22:22]:
> 
> > On Wed, 2011-09-07 at 20:50 +0530, Srivatsa Vaddagiri wrote:
> > > 
> > > Fix excessive idle time reported when cgroups are capped. 
> > 
> > Where from? The whole idea of bandwidth caps is to introduce idle time,
> > so what's excessive and where does it come from?
> 
> We have setup cgroups and their hard limits so that in theory they should
> consume the entire capacity available on machine, leading to 0% idle time.
> That's not what we see. A more detailed description of the setup and the problem
> is here:
> 
> https://lkml.org/lkml/2011/6/7/352

That's frigging irrelevant isn't it? A patch should contain its own
justification.

> Machine : 16-cpus (2 Quad-core w/ HT enabled)
> Cgroups : 5 in number (C1-C5), each having {2, 2, 4, 8, 16} tasks respectively.
>           Further, each task is placed in its own (sub-)cgroup with 
>           a capped usage of 50% CPU.

So that's loads: {512,512}, {512,512}, {256,256,256,256}, {128,..} and {64,..}

And you expect that to be balanced perfectly when a bandwidth cap is
introduced, I think you need some expectation adjustments.


> From what I could find out, the "excess" idle time crops up because
> load-balancer is not perfect. For example, there are instances when a
> CPU has just 1 task on its runqueue (rather then the ideal number of 2
> tasks/cpu). When that lone task exceeds its 50% limit, cpu is forced to
> become idle.

So try and cure that instead of frobbing crap like this.

> > >  The patch introduces the notion of "steal" 
> > 
> > The virt folks already claimed steal-time and have it mean something
> > entirely different. You get to pick a new name.
> 
> grace time?

Well, ideally this frobbing of symptoms instead of fixing of causes
isn't going to happen at all, its just retarded. And it most certainly
shouldn't be the first approach to any problem. 


> > Ok, so this is a solution to an unstated problem. Why is it a good
> > solution?
> 
> I am not sure if there are any "good" solutions to this problem! 

Good, so then we're not going to do it, full stop.

> One
> possibility is to make the idle load balancer become aggressive in
> pulling tasks across sched-domain boundaries i.e when a CPU becomes idle
> (after a task got throttled) and invokes the idle load balancer, it
> should try "harder" at pulling a task from far-off cpus (across
> package/node boundaries)?

How about we just live with it? You set up a nearly impossible
(non-scalable) problem and then complain we don't do well. Tough fscking
luck, don't do that.

I mean, I'm all for improving things, but your frobbing here is just not
going to happen, most certainly not without very _very_ good
justification, and your patch frankly didn't have any.

Furthermore your patch frobs the bandwidth accounting but doesn't spend
a single word explaining how, if at all, it keeps the accounting a 0-sum
game.

Seriously, you suck, you patch sucks and your method sucks. Go away.


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: CFS Bandwidth Control - Test results of cgroups tasks pinned vs unpinnede
  2011-09-09 12:31                       ` Peter Zijlstra
@ 2011-09-09 13:26                         ` Srivatsa Vaddagiri
  2011-09-12 10:17                         ` Srivatsa Vaddagiri
  1 sibling, 0 replies; 129+ messages in thread
From: Srivatsa Vaddagiri @ 2011-09-09 13:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul Turner, Kamalesh Babulal, Vladimir Davydov, linux-kernel,
	Bharata B Rao, Dhaval Giani, Vaidyanathan Srinivasan,
	Ingo Molnar, Pavel Emelianov

* Peter Zijlstra <a.p.zijlstra@chello.nl> [2011-09-09 14:31:02]:

> > We have setup cgroups and their hard limits so that in theory they should
> > consume the entire capacity available on machine, leading to 0% idle time.
> > That's not what we see. A more detailed description of the setup and the problem
> > is here:
> > 
> > https://lkml.org/lkml/2011/6/7/352
> 
> That's frigging irrelevant isn't it? A patch should contain its own
> justification.

Agreed my bad. I was (wrongly) setting the problem context by posting
this in response to Paul's email where the problem was discussed.

> > One
> > possibility is to make the idle load balancer become aggressive in
> > pulling tasks across sched-domain boundaries i.e when a CPU becomes idle
> > (after a task got throttled) and invokes the idle load balancer, it
> > should try "harder" at pulling a task from far-off cpus (across
> > package/node boundaries)?
> 
> How about we just live with it?

I think we will, unless the load balancer can be improved (which seems unlikely 
to me :-()

- vatsa


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: CFS Bandwidth Control - Test results of cgroups tasks pinned vs unpinnede
  2011-09-09 12:31                       ` Peter Zijlstra
  2011-09-09 13:26                         ` Srivatsa Vaddagiri
@ 2011-09-12 10:17                         ` Srivatsa Vaddagiri
  2011-09-12 12:35                           ` Peter Zijlstra
  1 sibling, 1 reply; 129+ messages in thread
From: Srivatsa Vaddagiri @ 2011-09-12 10:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul Turner, Kamalesh Babulal, Vladimir Davydov, linux-kernel,
	Bharata B Rao, Dhaval Giani, Vaidyanathan Srinivasan,
	Ingo Molnar, Pavel Emelianov

* Peter Zijlstra <a.p.zijlstra@chello.nl> [2011-09-09 14:31:02]:

> > Machine : 16-cpus (2 Quad-core w/ HT enabled)
> > Cgroups : 5 in number (C1-C5), each having {2, 2, 4, 8, 16} tasks respectively.
> >           Further, each task is placed in its own (sub-)cgroup with 
> >           a capped usage of 50% CPU.
> 
> So that's loads: {512,512}, {512,512}, {256,256,256,256}, {128,..} and {64,..}

Yes, with the default shares of 1024 for each cgroup.

FWIW we did also try setting shares for each cgroup proportional to number of 
tasks it has. For ex: C1's shares = 1024 * 2 = 2048, C2 = 1024 * 2 = 2048, 
C3 = 4 * 1024 = 4096 etc. while /C1/C1_1, /C1/C1_2, .../C5/C5_16/ shares were 
left at default of 1024 (as those sub-cgroups contain only one task). 
 
That does help reduce idle time by almost 50% (from 15-20% -> 6-9%)

- vatsa


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: CFS Bandwidth Control - Test results of cgroups tasks pinned vs unpinnede
  2011-09-12 10:17                         ` Srivatsa Vaddagiri
@ 2011-09-12 12:35                           ` Peter Zijlstra
  2011-09-13  4:15                             ` Srivatsa Vaddagiri
  0 siblings, 1 reply; 129+ messages in thread
From: Peter Zijlstra @ 2011-09-12 12:35 UTC (permalink / raw)
  To: Srivatsa Vaddagiri
  Cc: Paul Turner, Kamalesh Babulal, Vladimir Davydov, linux-kernel,
	Bharata B Rao, Dhaval Giani, Vaidyanathan Srinivasan,
	Ingo Molnar, Pavel Emelianov

On Mon, 2011-09-12 at 15:47 +0530, Srivatsa Vaddagiri wrote:
> * Peter Zijlstra <a.p.zijlstra@chello.nl> [2011-09-09 14:31:02]:
> 
> > > Machine : 16-cpus (2 Quad-core w/ HT enabled)
> > > Cgroups : 5 in number (C1-C5), each having {2, 2, 4, 8, 16} tasks respectively.
> > >           Further, each task is placed in its own (sub-)cgroup with 
> > >           a capped usage of 50% CPU.
> > 
> > So that's loads: {512,512}, {512,512}, {256,256,256,256}, {128,..} and {64,..}
> 
> Yes, with the default shares of 1024 for each cgroup.
> 
> FWIW we did also try setting shares for each cgroup proportional to number of 
> tasks it has. For ex: C1's shares = 1024 * 2 = 2048, C2 = 1024 * 2 = 2048, 
> C3 = 4 * 1024 = 4096 etc. while /C1/C1_1, /C1/C1_2, .../C5/C5_16/ shares were 
> left at default of 1024 (as those sub-cgroups contain only one task). 
>  
> That does help reduce idle time by almost 50% (from 15-20% -> 6-9%)

Of course it does.. and I bet you can improve that slightly if you
manage to fix some of the numerical nightmares that live in the cgroup
load-balancer (Paul, care to share your WIP?)

But the initial scenario is a complete and utter fail, its impossible to
schedule that sanely. Its an infeasible weight scenario with more tasks
than cpus, and the added bandwidth constraints just keep changing the
set requiring endless migrations to try and keep utilization from
tanking.

Really, classic fail.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: CFS Bandwidth Control - Test results of cgroups tasks pinned vs unpinnede
  2011-09-12 12:35                           ` Peter Zijlstra
@ 2011-09-13  4:15                             ` Srivatsa Vaddagiri
  2011-09-13  5:03                               ` Srivatsa Vaddagiri
  2011-09-13 14:19                               ` Peter Zijlstra
  0 siblings, 2 replies; 129+ messages in thread
From: Srivatsa Vaddagiri @ 2011-09-13  4:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul Turner, Kamalesh Babulal, Vladimir Davydov, linux-kernel,
	Bharata B Rao, Dhaval Giani, Vaidyanathan Srinivasan,
	Ingo Molnar, Pavel Emelianov

* Peter Zijlstra <a.p.zijlstra@chello.nl> [2011-09-12 14:35:43]:

> Of course it does.. and I bet you can improve that slightly if you
> manage to fix some of the numerical nightmares that live in the cgroup
> load-balancer (Paul, care to share your WIP?)

Booting with "nohz=off" also helps significantly.

With nohz=on, average idle time (over 1 min) is 10.3%
With nohz=off, average idle time (over 1 min) is 3.9%

- vatsa

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: CFS Bandwidth Control - Test results of cgroups tasks pinned vs unpinnede
  2011-09-13  4:15                             ` Srivatsa Vaddagiri
@ 2011-09-13  5:03                               ` Srivatsa Vaddagiri
  2011-09-13  5:05                                 ` Srivatsa Vaddagiri
  2011-09-13  9:39                                 ` Peter Zijlstra
  2011-09-13 14:19                               ` Peter Zijlstra
  1 sibling, 2 replies; 129+ messages in thread
From: Srivatsa Vaddagiri @ 2011-09-13  5:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul Turner, Kamalesh Babulal, Vladimir Davydov, linux-kernel,
	Bharata B Rao, Dhaval Giani, Vaidyanathan Srinivasan,
	Ingo Molnar, Pavel Emelianov

* Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> [2011-09-13 09:45:45]:

> * Peter Zijlstra <a.p.zijlstra@chello.nl> [2011-09-12 14:35:43]:
> 
> > Of course it does.. and I bet you can improve that slightly if you
> > manage to fix some of the numerical nightmares that live in the cgroup
> > load-balancer (Paul, care to share your WIP?)
> 
> Booting with "nohz=off" also helps significantly.
> 
> With nohz=on, average idle time (over 1 min) is 10.3%
> With nohz=off, average idle time (over 1 min) is 3.9%

Tuning min_interval and max_interval of various sched_domains to 1 [a]
and also setting sched_cfs_bandwidth_slice_us to 500 does cut down idle
time further to 2.7% ..

This is perhaps not optimal (as it may lead to more lock contentions), but 
something to note for those who care for both capping and utilization in
equal measure!

- vatsa


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: CFS Bandwidth Control - Test results of cgroups tasks pinned vs unpinnede
  2011-09-13  5:03                               ` Srivatsa Vaddagiri
@ 2011-09-13  5:05                                 ` Srivatsa Vaddagiri
  2011-09-13  9:39                                 ` Peter Zijlstra
  1 sibling, 0 replies; 129+ messages in thread
From: Srivatsa Vaddagiri @ 2011-09-13  5:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul Turner, Kamalesh Babulal, Vladimir Davydov, linux-kernel,
	Bharata B Rao, Dhaval Giani, Vaidyanathan Srinivasan,
	Ingo Molnar, Pavel Emelianov

* Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> [2011-09-13 10:33:06]:

> * Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> [2011-09-13 09:45:45]:
> 
> > * Peter Zijlstra <a.p.zijlstra@chello.nl> [2011-09-12 14:35:43]:
> > 
> > > Of course it does.. and I bet you can improve that slightly if you
> > > manage to fix some of the numerical nightmares that live in the cgroup
> > > load-balancer (Paul, care to share your WIP?)
> > 
> > Booting with "nohz=off" also helps significantly.
> > 
> > With nohz=on, average idle time (over 1 min) is 10.3%
> > With nohz=off, average idle time (over 1 min) is 3.9%
> 
> Tuning min_interval and max_interval of various sched_domains to 1 [a]

Forgot to add footnote (a) earlier. min and max_interval tuned as
below:

	# cd /proc/sys/kernel/sched_domain
	# for i in `find . -name min_interval`; do echo 1 > $i; done
	# for i in `find . -name max_interval`; do echo 1 > $i; done

- vatsa

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: CFS Bandwidth Control - Test results of cgroups tasks pinned vs unpinnede
  2011-09-13  5:03                               ` Srivatsa Vaddagiri
  2011-09-13  5:05                                 ` Srivatsa Vaddagiri
@ 2011-09-13  9:39                                 ` Peter Zijlstra
  2011-09-13 11:28                                   ` Srivatsa Vaddagiri
  1 sibling, 1 reply; 129+ messages in thread
From: Peter Zijlstra @ 2011-09-13  9:39 UTC (permalink / raw)
  To: Srivatsa Vaddagiri
  Cc: Paul Turner, Kamalesh Babulal, Vladimir Davydov, linux-kernel,
	Bharata B Rao, Dhaval Giani, Vaidyanathan Srinivasan,
	Ingo Molnar, Pavel Emelianov

On Tue, 2011-09-13 at 10:33 +0530, Srivatsa Vaddagiri wrote:
> 
> This is perhaps not optimal (as it may lead to more lock contentions), but 
> something to note for those who care for both capping and utilization in
> equal measure!

You meant lock inversion, which leads to more idle time :-)

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: CFS Bandwidth Control - Test results of cgroups tasks pinned vs unpinnede
  2011-09-13  9:39                                 ` Peter Zijlstra
@ 2011-09-13 11:28                                   ` Srivatsa Vaddagiri
  2011-09-13 14:07                                     ` Peter Zijlstra
  0 siblings, 1 reply; 129+ messages in thread
From: Srivatsa Vaddagiri @ 2011-09-13 11:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul Turner, Kamalesh Babulal, Vladimir Davydov, linux-kernel,
	Bharata B Rao, Dhaval Giani, Vaidyanathan Srinivasan,
	Ingo Molnar, Pavel Emelianov

* Peter Zijlstra <a.p.zijlstra@chello.nl> [2011-09-13 11:39:48]:

> On Tue, 2011-09-13 at 10:33 +0530, Srivatsa Vaddagiri wrote:
> > 
> > This is perhaps not optimal (as it may lead to more lock contentions), but 
> > something to note for those who care for both capping and utilization in
> > equal measure!
> 
> You meant lock inversion, which leads to more idle time :-)

I think 'cfs_b->lock' contention would go up significantly when reducing
sysctl_sched_cfs_bandwidth_slice, while for something like 'balancing' lock 
(taken with SD_SERIALIZE set and more frequently when tuning down
max_interval?), yes it may increase idle time! Did you have any other
lock in mind when speaking of inversion?

- vatsa

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: CFS Bandwidth Control - Test results of cgroups tasks pinned vs unpinnede
  2011-09-13 11:28                                   ` Srivatsa Vaddagiri
@ 2011-09-13 14:07                                     ` Peter Zijlstra
  2011-09-13 16:21                                       ` Srivatsa Vaddagiri
  0 siblings, 1 reply; 129+ messages in thread
From: Peter Zijlstra @ 2011-09-13 14:07 UTC (permalink / raw)
  To: Srivatsa Vaddagiri
  Cc: Paul Turner, Kamalesh Babulal, Vladimir Davydov, linux-kernel,
	Bharata B Rao, Dhaval Giani, Vaidyanathan Srinivasan,
	Ingo Molnar, Pavel Emelianov

On Tue, 2011-09-13 at 16:58 +0530, Srivatsa Vaddagiri wrote:
> * Peter Zijlstra <a.p.zijlstra@chello.nl> [2011-09-13 11:39:48]:
> 
> > On Tue, 2011-09-13 at 10:33 +0530, Srivatsa Vaddagiri wrote:
> > > 
> > > This is perhaps not optimal (as it may lead to more lock contentions), but 
> > > something to note for those who care for both capping and utilization in
> > > equal measure!
> > 
> > You meant lock inversion, which leads to more idle time :-)
> 
> I think 'cfs_b->lock' contention would go up significantly when reducing
> sysctl_sched_cfs_bandwidth_slice, while for something like 'balancing' lock 
> (taken with SD_SERIALIZE set and more frequently when tuning down
> max_interval?), yes it may increase idle time! Did you have any other
> lock in mind when speaking of inversion?

I can't read it seems.. I thought you were talking about increasing the
period, which increases the time you force a task to sleep that's
holding locks etc..

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: CFS Bandwidth Control - Test results of cgroups tasks pinned vs unpinnede
  2011-09-13  4:15                             ` Srivatsa Vaddagiri
  2011-09-13  5:03                               ` Srivatsa Vaddagiri
@ 2011-09-13 14:19                               ` Peter Zijlstra
  2011-09-13 18:01                                 ` Srivatsa Vaddagiri
  1 sibling, 1 reply; 129+ messages in thread
From: Peter Zijlstra @ 2011-09-13 14:19 UTC (permalink / raw)
  To: Srivatsa Vaddagiri
  Cc: Paul Turner, Kamalesh Babulal, Vladimir Davydov, linux-kernel,
	Bharata B Rao, Dhaval Giani, Vaidyanathan Srinivasan,
	Ingo Molnar, Pavel Emelianov, Thomas Gleixner

On Tue, 2011-09-13 at 09:45 +0530, Srivatsa Vaddagiri wrote:
> * Peter Zijlstra <a.p.zijlstra@chello.nl> [2011-09-12 14:35:43]:
> 
> > Of course it does.. and I bet you can improve that slightly if you
> > manage to fix some of the numerical nightmares that live in the cgroup
> > load-balancer (Paul, care to share your WIP?)
> 
> Booting with "nohz=off" also helps significantly.
> 
> With nohz=on, average idle time (over 1 min) is 10.3%
> With nohz=off, average idle time (over 1 min) is 3.9%

So we should put the cpufreq/idle governor into the nohz/idle path, it
already tries to predict the idle duration in order to pick a C state,
that same prediction should be used to determine if stopping the tick is
worth it.

This has come up previously, but I can't quite recollect in what
context.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: CFS Bandwidth Control - Test results of cgroups tasks pinned vs unpinnede
  2011-09-13 14:07                                     ` Peter Zijlstra
@ 2011-09-13 16:21                                       ` Srivatsa Vaddagiri
  2011-09-13 16:33                                         ` Peter Zijlstra
  2011-09-13 16:36                                         ` Peter Zijlstra
  0 siblings, 2 replies; 129+ messages in thread
From: Srivatsa Vaddagiri @ 2011-09-13 16:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul Turner, Kamalesh Babulal, Vladimir Davydov, linux-kernel,
	Bharata B Rao, Dhaval Giani, Vaidyanathan Srinivasan,
	Ingo Molnar, Pavel Emelianov

* Peter Zijlstra <a.p.zijlstra@chello.nl> [2011-09-13 16:07:28]:

> > > > This is perhaps not optimal (as it may lead to more lock contentions), but 
> > > > something to note for those who care for both capping and utilization in
> > > > equal measure!
> > > 
> > > You meant lock inversion, which leads to more idle time :-)
> > 
> > I think 'cfs_b->lock' contention would go up significantly when reducing
> > sysctl_sched_cfs_bandwidth_slice, while for something like 'balancing' lock 
> > (taken with SD_SERIALIZE set and more frequently when tuning down
> > max_interval?), yes it may increase idle time! Did you have any other
> > lock in mind when speaking of inversion?
> 
> I can't read it seems.. I thought you were talking about increasing the
> period,

Mm ..I brought up the increased lock contention with reference to this
experimental result that I posted earlier:

  > Tuning min_interval and max_interval of various sched_domains to 1
  > and also setting sched_cfs_bandwidth_slice_us to 500 does cut down idle
  > time further to 2.7%

Value of sched_cfs_bandwidth_slice_us was reduced from default of 5000us
to 500us, which (along with reduction of min/max interval) helped cut down
idle time further (3.9% -> 2.7%). I was commenting that this may not necessarily
be optimal (as for example low 'sched_cfs_bandwidth_slice_us' could result
in all cpus contending for cfs_b->lock very frequently).

> which increases the time you force a task to sleep that's holding locks etc..

Ideally all tasks should get capped at the same time, given that there is
a global pool from which everyone pulls bandwidth? So while one vcpu/task
(holding a lock) gets capped, other vcpus/tasks (that may want the same lock)
should ideally not be running for long after that, avoiding lock inversion
related problems you point out.

I guess that we may still run into that with current implementation ..
Basically global pool may have zero runtime left for current period,
forcing a vcpu/task to be throttled, while there is surplus runtime in
per-cpu pools, allowing some sibling vcpus/tasks to run for wee bit
more, leading to lock-inversion related problems (more idling). That
makes me think we can improve directed yield->capping interaction.
Essentially when the target task of directed yield is capped, can the
"yielding" task donate some of its bandwidth?

- vatsa

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: CFS Bandwidth Control - Test results of cgroups tasks pinned vs unpinnede
  2011-09-13 16:21                                       ` Srivatsa Vaddagiri
@ 2011-09-13 16:33                                         ` Peter Zijlstra
  2011-09-13 17:41                                           ` Srivatsa Vaddagiri
  2011-09-13 16:36                                         ` Peter Zijlstra
  1 sibling, 1 reply; 129+ messages in thread
From: Peter Zijlstra @ 2011-09-13 16:33 UTC (permalink / raw)
  To: Srivatsa Vaddagiri
  Cc: Paul Turner, Kamalesh Babulal, Vladimir Davydov, linux-kernel,
	Bharata B Rao, Dhaval Giani, Vaidyanathan Srinivasan,
	Ingo Molnar, Pavel Emelianov

On Tue, 2011-09-13 at 21:51 +0530, Srivatsa Vaddagiri wrote:
> > which increases the time you force a task to sleep that's holding locks etc..
> 
> Ideally all tasks should get capped at the same time, given that there is
> a global pool from which everyone pulls bandwidth? So while one vcpu/task
> (holding a lock) gets capped, other vcpus/tasks (that may want the same lock)
> should ideally not be running for long after that, avoiding lock inversion
> related problems you point out.

No this simply cannot be true.. You force groups to sleep so that other
groups can run, right? Therefore shared kernel locks will cause
inversion.

You cannot put both groups to sleep and still expect a utilization of
100%.

Simple example, some task in group A owns the i_mutex of a file, group A
runs out of time and gets dequeued. Some other task in group B needs
that same i_mutex.

> I guess that we may still run into that with current implementation ..
> Basically global pool may have zero runtime left for current period,
> forcing a vcpu/task to be throttled, while there is surplus runtime in
> per-cpu pools, allowing some sibling vcpus/tasks to run for wee bit
> more, leading to lock-inversion related problems (more idling). That
> makes me think we can improve directed yield->capping interaction.
> Essentially when the target task of directed yield is capped, can the
> "yielding" task donate some of its bandwidth? 

What moron ever calls yield anyway? If you use yield you're doing it
wrong!

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: CFS Bandwidth Control - Test results of cgroups tasks pinned vs unpinnede
  2011-09-13 16:21                                       ` Srivatsa Vaddagiri
  2011-09-13 16:33                                         ` Peter Zijlstra
@ 2011-09-13 16:36                                         ` Peter Zijlstra
  2011-09-13 17:54                                           ` Srivatsa Vaddagiri
  1 sibling, 1 reply; 129+ messages in thread
From: Peter Zijlstra @ 2011-09-13 16:36 UTC (permalink / raw)
  To: Srivatsa Vaddagiri
  Cc: Paul Turner, Kamalesh Babulal, Vladimir Davydov, linux-kernel,
	Bharata B Rao, Dhaval Giani, Vaidyanathan Srinivasan,
	Ingo Molnar, Pavel Emelianov

On Tue, 2011-09-13 at 21:51 +0530, Srivatsa Vaddagiri wrote:
> > I can't read it seems.. I thought you were talking about increasing the
> > period,
> 
> Mm ..I brought up the increased lock contention with reference to this
> experimental result that I posted earlier:
> 
>   > Tuning min_interval and max_interval of various sched_domains to 1
>   > and also setting sched_cfs_bandwidth_slice_us to 500 does cut down idle
>   > time further to 2.7%

Yeah, that's the not being able to read part..

> Value of sched_cfs_bandwidth_slice_us was reduced from default of 5000us
> to 500us, which (along with reduction of min/max interval) helped cut down
> idle time further (3.9% -> 2.7%). I was commenting that this may not necessarily
> be optimal (as for example low 'sched_cfs_bandwidth_slice_us' could result
> in all cpus contending for cfs_b->lock very frequently). 

Right.. so this seems to suggest you're migrating a lot.

Also what workload are we talking about? the insane one with 5 groups of
weight 1024?

Ramping up the frequency of the load-balancer and giving out smaller
slices is really anti-scalability.. I bet a lot of that 'reclaimed' idle
time is spend in system time. 

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: CFS Bandwidth Control - Test results of cgroups tasks pinned vs unpinnede
  2011-09-13 16:33                                         ` Peter Zijlstra
@ 2011-09-13 17:41                                           ` Srivatsa Vaddagiri
  0 siblings, 0 replies; 129+ messages in thread
From: Srivatsa Vaddagiri @ 2011-09-13 17:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul Turner, Kamalesh Babulal, Vladimir Davydov, linux-kernel,
	Bharata B Rao, Dhaval Giani, Vaidyanathan Srinivasan,
	Ingo Molnar, Pavel Emelianov

* Peter Zijlstra <a.p.zijlstra@chello.nl> [2011-09-13 18:33:09]:

> On Tue, 2011-09-13 at 21:51 +0530, Srivatsa Vaddagiri wrote:
> > > which increases the time you force a task to sleep that's holding locks etc..
> > 
> > Ideally all tasks should get capped at the same time, given that there is
> > a global pool from which everyone pulls bandwidth? So while one vcpu/task
> > (holding a lock) gets capped, other vcpus/tasks (that may want the same lock)
> > should ideally not be running for long after that, avoiding lock inversion
> > related problems you point out.
> 
> No this simply cannot be true.. You force groups to sleep so that other
> groups can run, right? Therefore shared kernel locks will cause
> inversion.

Ah ..shared locks of "host" kernel ..true ..that can still cause
lock-inversion yes.

I had in mind user-space (or "guest" kernel) locks - which can't get inverted 
that easily (one of cgroup's tasks wanting a "userspace" lock which is held by 
another "throttled" task of same cgroup - causing a inversion problem of sorts).
My point was that once a task gets throttled, other sibling tasks should get 
throttled almost immediately after that (given that bandwidth for a cgroup is 
maintained in a global pool from which everyone draws in "small" increments) - 
so a task that gets capped while holding a user-space lock should not
result in other sibling tasks going too much hungry on held locks within the
same period?

> You cannot put both groups to sleep and still expect a utilization of
> 100%.
> 
> Simple example, some task in group A owns the i_mutex of a file, group A
> runs out of time and gets dequeued. Some other task in group B needs
> that same i_mutex.
> 
> > I guess that we may still run into that with current implementation ..
> > Basically global pool may have zero runtime left for current period,
> > forcing a vcpu/task to be throttled, while there is surplus runtime in
> > per-cpu pools, allowing some sibling vcpus/tasks to run for wee bit
> > more, leading to lock-inversion related problems (more idling). That
> > makes me think we can improve directed yield->capping interaction.
> > Essentially when the target task of directed yield is capped, can the
> > "yielding" task donate some of its bandwidth? 
> 
> What moron ever calls yield anyway?

I meant directed yield (yield_to) ..which is used by KVM when it detects 
pause-loops. Essentially, a vcpu spinning in guest-kernel context for too long 
leading to PLE (Pasue-Loop-Exit), which leads to KVM driver doing a directed 
yield to another sibling vcpu ..so the target of directed yield may be a
capped vcpu task, in which case was wondering if directed yield can donate
bit of bandwidth to the throttled task. Again going by what I said earlier about
tasks getting capped more or less at same time, this should occur very 
infrequently ...something for me to test and find out nevertheless!

> If you use yield you're doing it wrong!

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: CFS Bandwidth Control - Test results of cgroups tasks pinned vs unpinnede
  2011-09-13 16:36                                         ` Peter Zijlstra
@ 2011-09-13 17:54                                           ` Srivatsa Vaddagiri
  2011-09-13 18:03                                             ` Peter Zijlstra
                                                               ` (2 more replies)
  0 siblings, 3 replies; 129+ messages in thread
From: Srivatsa Vaddagiri @ 2011-09-13 17:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul Turner, Kamalesh Babulal, Vladimir Davydov, linux-kernel,
	Bharata B Rao, Dhaval Giani, Vaidyanathan Srinivasan,
	Ingo Molnar, Pavel Emelianov

[-- Attachment #1: Type: text/plain, Size: 2037 bytes --]

* Peter Zijlstra <a.p.zijlstra@chello.nl> [2011-09-13 18:36:15]:
> > Value of sched_cfs_bandwidth_slice_us was reduced from default of 5000us
> > to 500us, which (along with reduction of min/max interval) helped cut down
> > idle time further (3.9% -> 2.7%). I was commenting that this may not necessarily
> > be optimal (as for example low 'sched_cfs_bandwidth_slice_us' could result
> > in all cpus contending for cfs_b->lock very frequently). 
> 
> Right.. so this seems to suggest you're migrating a lot.

We did do some experiments (outside of capping) to see how badly tasks
migrate on latest tip (compared to previous kernels). The test was to
spawn 32 cpuhogs on a 16-cpu system (place them in default cgroup -
without any capping in place) and measure how much they bounce around.
System had little load besides these cpu hogs.

We saw considerably high migration count on latest tip compared to
previous kernels. Kamalesh, can you please post the migration count
data?

> Also what workload are we talking about? the insane one with 5 groups of
> weight 1024?

We never were running the "insane" one ..we are always with proportional
shares, the "sane" one! I missed to mention that bit in my first email
(about the shares setup). I am attaching the test script we are using
for your reference. Fyi, we have added additional levels to cgroup setup
(/Level1/Level2/C1/C1_1 etc) to mimic cgroup hierarchy for VMS as
created by libvirt.

> Ramping up the frequency of the load-balancer and giving out smaller
> slices is really anti-scalability.. I bet a lot of that 'reclaimed' idle
> time is spend in system time. 

System time (in top and vmstat) does remain unchanged at 0% when
cranking up load-balance frequency and slicing down
sched_cfs_bandwidth_slice_us ..I guess the additional "system" time
can't be accounted for easily by the tick-based accounting system we
have. I agree there could be other un-observed side-effects of increased
load-balance frequency (like workload performance) that I haven't noticed. 

- vatsa

[-- Attachment #2: hard_limit_test.sh --]
[-- Type: application/x-sh, Size: 7035 bytes --]

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: CFS Bandwidth Control - Test results of cgroups tasks pinned vs unpinnede
  2011-09-13 14:19                               ` Peter Zijlstra
@ 2011-09-13 18:01                                 ` Srivatsa Vaddagiri
  2011-09-13 18:23                                   ` Peter Zijlstra
  0 siblings, 1 reply; 129+ messages in thread
From: Srivatsa Vaddagiri @ 2011-09-13 18:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul Turner, Kamalesh Babulal, Vladimir Davydov, linux-kernel,
	Bharata B Rao, Dhaval Giani, Vaidyanathan Srinivasan,
	Ingo Molnar, Pavel Emelianov, Thomas Gleixner

* Peter Zijlstra <a.p.zijlstra@chello.nl> [2011-09-13 16:19:39]:

> > Booting with "nohz=off" also helps significantly.
> > 
> > With nohz=on, average idle time (over 1 min) is 10.3%
> > With nohz=off, average idle time (over 1 min) is 3.9%
>
> So we should put the cpufreq/idle governor into the nohz/idle path, it
> already tries to predict the idle duration in order to pick a C state,
> that same prediction should be used to determine if stopping the tick is
> worth it.

Hmm ..I tried performance governor and found that it slightly increases
idle time.

  With nohz=off && ondemand governor, idle time = 4%
  With nohz=off && performance governor on all cpus, idle time = 6%

I can't see obvious reasons for that ..afaict bandwidth capping should
be independent of frequency (i.e task gets capped by "used" time,
irrespective of frequency at which it was "using" the cpu)?

- vatsa

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: CFS Bandwidth Control - Test results of cgroups tasks pinned vs unpinnede
  2011-09-13 17:54                                           ` Srivatsa Vaddagiri
@ 2011-09-13 18:03                                             ` Peter Zijlstra
  2011-09-13 18:12                                               ` Srivatsa Vaddagiri
  2011-09-13 18:07                                             ` Peter Zijlstra
  2011-09-13 18:19                                             ` Peter Zijlstra
  2 siblings, 1 reply; 129+ messages in thread
From: Peter Zijlstra @ 2011-09-13 18:03 UTC (permalink / raw)
  To: Srivatsa Vaddagiri
  Cc: Paul Turner, Kamalesh Babulal, Vladimir Davydov, linux-kernel,
	Bharata B Rao, Dhaval Giani, Vaidyanathan Srinivasan,
	Ingo Molnar, Pavel Emelianov

On Tue, 2011-09-13 at 23:24 +0530, Srivatsa Vaddagiri wrote:
> Fyi, we have added additional levels to cgroup setup
> (/Level1/Level2/C1/C1_1 etc) to mimic cgroup hierarchy for VMS as
> created by libvirt. 

The deeper you nest the bigger the numerical problems get.. 

Also, can you please stop using virt crap and focus on useful
things? :-) Start with simple cases of single depth groups.



^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: CFS Bandwidth Control - Test results of cgroups tasks pinned vs unpinnede
  2011-09-13 17:54                                           ` Srivatsa Vaddagiri
  2011-09-13 18:03                                             ` Peter Zijlstra
@ 2011-09-13 18:07                                             ` Peter Zijlstra
  2011-09-13 18:19                                             ` Peter Zijlstra
  2 siblings, 0 replies; 129+ messages in thread
From: Peter Zijlstra @ 2011-09-13 18:07 UTC (permalink / raw)
  To: Srivatsa Vaddagiri
  Cc: Paul Turner, Kamalesh Babulal, Vladimir Davydov, linux-kernel,
	Bharata B Rao, Dhaval Giani, Vaidyanathan Srinivasan,
	Ingo Molnar, Pavel Emelianov

On Tue, 2011-09-13 at 23:24 +0530, Srivatsa Vaddagiri wrote:
> I guess the additional "system" time
> can't be accounted for easily by the tick-based accounting system we
> have. I agree there could be other un-observed side-effects of increased
> load-balance frequency (like workload performance) that I haven't noticed. 

Yeah, very hard, its the tick that starts the balancer, so it would have
to last longer than a tick to be noticed, very unlikely.

We should implement full blown CONFIG_VIRT_CPU_ACCOUNTING,.. except I
bet that once we do that people will want it enabled and I'm pretty sure
people also don't want to pay the price for it... :-)

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: CFS Bandwidth Control - Test results of cgroups tasks pinned vs unpinnede
  2011-09-13 18:03                                             ` Peter Zijlstra
@ 2011-09-13 18:12                                               ` Srivatsa Vaddagiri
  0 siblings, 0 replies; 129+ messages in thread
From: Srivatsa Vaddagiri @ 2011-09-13 18:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul Turner, Kamalesh Babulal, Vladimir Davydov, linux-kernel,
	Bharata B Rao, Dhaval Giani, Vaidyanathan Srinivasan,
	Ingo Molnar, Pavel Emelianov

* Peter Zijlstra <a.p.zijlstra@chello.nl> [2011-09-13 20:03:04]:

> On Tue, 2011-09-13 at 23:24 +0530, Srivatsa Vaddagiri wrote:
> > Fyi, we have added additional levels to cgroup setup
> > (/Level1/Level2/C1/C1_1 etc) to mimic cgroup hierarchy for VMS as
> > created by libvirt. 
> 
> The deeper you nest the bigger the numerical problems get.. 
>
> Also, can you please stop using virt crap and focus on useful
> things? :-) 

That unfortunately is the target environment where we want this working
(want to cap VMs under KVM) :-) For simplicity, we have been playing
with non-VM based testcase ..

> Start with simple cases of single depth groups.

We did try with single level and "extra" large proportional
shares (10k * NR_TASKS if I recall)..I don't think they made any 
difference ..will re-check though.

- vatsa


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: CFS Bandwidth Control - Test results of cgroups tasks pinned vs unpinnede
  2011-09-13 17:54                                           ` Srivatsa Vaddagiri
  2011-09-13 18:03                                             ` Peter Zijlstra
  2011-09-13 18:07                                             ` Peter Zijlstra
@ 2011-09-13 18:19                                             ` Peter Zijlstra
  2011-09-13 18:28                                               ` Srivatsa Vaddagiri
  2 siblings, 1 reply; 129+ messages in thread
From: Peter Zijlstra @ 2011-09-13 18:19 UTC (permalink / raw)
  To: Srivatsa Vaddagiri
  Cc: Paul Turner, Kamalesh Babulal, Vladimir Davydov, linux-kernel,
	Bharata B Rao, Dhaval Giani, Vaidyanathan Srinivasan,
	Ingo Molnar, Pavel Emelianov

On Tue, 2011-09-13 at 23:24 +0530, Srivatsa Vaddagiri wrote:
> We saw considerably high migration count on latest tip compared to
> previous kernels. Kamalesh, can you please post the migration count
> data? 

Hrmm, yes this looks horrid.. even without cgroup crap, something's
funny.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: CFS Bandwidth Control - Test results of cgroups tasks pinned vs unpinnede
  2011-09-13 18:01                                 ` Srivatsa Vaddagiri
@ 2011-09-13 18:23                                   ` Peter Zijlstra
  2011-09-16  8:14                                     ` Paul Turner
  0 siblings, 1 reply; 129+ messages in thread
From: Peter Zijlstra @ 2011-09-13 18:23 UTC (permalink / raw)
  To: Srivatsa Vaddagiri
  Cc: Paul Turner, Kamalesh Babulal, Vladimir Davydov, linux-kernel,
	Bharata B Rao, Dhaval Giani, Vaidyanathan Srinivasan,
	Ingo Molnar, Pavel Emelianov, Thomas Gleixner

On Tue, 2011-09-13 at 23:31 +0530, Srivatsa Vaddagiri wrote:
> * Peter Zijlstra <a.p.zijlstra@chello.nl> [2011-09-13 16:19:39]:
> 
> > > Booting with "nohz=off" also helps significantly.
> > > 
> > > With nohz=on, average idle time (over 1 min) is 10.3%
> > > With nohz=off, average idle time (over 1 min) is 3.9%
> >
> > So we should put the cpufreq/idle governor into the nohz/idle path, it
> > already tries to predict the idle duration in order to pick a C state,
> > that same prediction should be used to determine if stopping the tick is
> > worth it.
> 
> Hmm ..I tried performance governor and found that it slightly increases
> idle time.
> 
>   With nohz=off && ondemand governor, idle time = 4%
>   With nohz=off && performance governor on all cpus, idle time = 6%
> 
> I can't see obvious reasons for that ..afaict bandwidth capping should
> be independent of frequency (i.e task gets capped by "used" time,
> irrespective of frequency at which it was "using" the cpu)?

That's not what I said.. what I said is that the nohz code should also
use the idle time prognosis.. disabling the tick is a costly operation,
doing it only to have to undo it costs time, and will be accounted to
idle time, hence your improvement with nohz=off.



^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: CFS Bandwidth Control - Test results of cgroups tasks pinned vs unpinnede
  2011-09-13 18:19                                             ` Peter Zijlstra
@ 2011-09-13 18:28                                               ` Srivatsa Vaddagiri
  2011-09-13 18:30                                                 ` Peter Zijlstra
  0 siblings, 1 reply; 129+ messages in thread
From: Srivatsa Vaddagiri @ 2011-09-13 18:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul Turner, Kamalesh Babulal, Vladimir Davydov, linux-kernel,
	Bharata B Rao, Dhaval Giani, Vaidyanathan Srinivasan,
	Ingo Molnar, Pavel Emelianov

* Peter Zijlstra <a.p.zijlstra@chello.nl> [2011-09-13 20:19:55]:

> On Tue, 2011-09-13 at 23:24 +0530, Srivatsa Vaddagiri wrote:
> > We saw considerably high migration count on latest tip compared to
> > previous kernels. Kamalesh, can you please post the migration count
> > data? 
> 
> Hrmm, yes this looks horrid.. even without cgroup crap, something's funny.

Yes ..we could visualize that very much in top o/p .. A task's cpu would keep 
changing *every* screen refresh (refreshed every 0.5 sec that too!).

We didn't see that with older kernels ..Kamalesh is planning to do a
git bisect and see which commit lead to this "mad" hopping ..

- vatsa

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: CFS Bandwidth Control - Test results of cgroups tasks pinned vs unpinnede
  2011-09-13 18:28                                               ` Srivatsa Vaddagiri
@ 2011-09-13 18:30                                                 ` Peter Zijlstra
  2011-09-13 18:35                                                   ` Srivatsa Vaddagiri
  0 siblings, 1 reply; 129+ messages in thread
From: Peter Zijlstra @ 2011-09-13 18:30 UTC (permalink / raw)
  To: Srivatsa Vaddagiri
  Cc: Paul Turner, Kamalesh Babulal, Vladimir Davydov, linux-kernel,
	Bharata B Rao, Dhaval Giani, Vaidyanathan Srinivasan,
	Ingo Molnar, Pavel Emelianov

On Tue, 2011-09-13 at 23:58 +0530, Srivatsa Vaddagiri wrote:
> * Peter Zijlstra <a.p.zijlstra@chello.nl> [2011-09-13 20:19:55]:
> 
> > On Tue, 2011-09-13 at 23:24 +0530, Srivatsa Vaddagiri wrote:
> > > We saw considerably high migration count on latest tip compared to
> > > previous kernels. Kamalesh, can you please post the migration count
> > > data? 
> > 
> > Hrmm, yes this looks horrid.. even without cgroup crap, something's funny.
> 
> Yes ..we could visualize that very much in top o/p .. A task's cpu would keep 
> changing *every* screen refresh (refreshed every 0.5 sec that too!).
> 
> We didn't see that with older kernels ..Kamalesh is planning to do a
> git bisect and see which commit lead to this "mad" hopping ..

Awesome, thanks! Btw, what is 'older'? 3.0?

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: CFS Bandwidth Control - Test results of cgroups tasks pinned vs unpinnede
  2011-09-13 18:30                                                 ` Peter Zijlstra
@ 2011-09-13 18:35                                                   ` Srivatsa Vaddagiri
  2011-09-15 17:55                                                     ` Kamalesh Babulal
  0 siblings, 1 reply; 129+ messages in thread
From: Srivatsa Vaddagiri @ 2011-09-13 18:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul Turner, Kamalesh Babulal, Vladimir Davydov, linux-kernel,
	Bharata B Rao, Dhaval Giani, Vaidyanathan Srinivasan,
	Ingo Molnar, Pavel Emelianov

* Peter Zijlstra <a.p.zijlstra@chello.nl> [2011-09-13 20:30:46]:

> On Tue, 2011-09-13 at 23:58 +0530, Srivatsa Vaddagiri wrote:
> > * Peter Zijlstra <a.p.zijlstra@chello.nl> [2011-09-13 20:19:55]:
> > 
> > > On Tue, 2011-09-13 at 23:24 +0530, Srivatsa Vaddagiri wrote:
> > > > We saw considerably high migration count on latest tip compared to
> > > > previous kernels. Kamalesh, can you please post the migration count
> > > > data? 
> > > 
> > > Hrmm, yes this looks horrid.. even without cgroup crap, something's funny.
> > 
> > Yes ..we could visualize that very much in top o/p .. A task's cpu would keep 
> > changing *every* screen refresh (refreshed every 0.5 sec that too!).
> > 
> > We didn't see that with older kernels ..Kamalesh is planning to do a
> > git bisect and see which commit lead to this "mad" hopping ..
> 
> Awesome, thanks! Btw, what is 'older'? 3.0?

We went back all the way upto 2.6.32! I think 2.6.38 and 2.6.39 were
pretty stable ..I don't have the migration count data with me readily. I
will let Kamalesh post that info soon.

- vatsa


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: CFS Bandwidth Control - Test results of cgroups tasks pinned vs unpinnede
  2011-09-13 18:35                                                   ` Srivatsa Vaddagiri
@ 2011-09-15 17:55                                                     ` Kamalesh Babulal
  2011-09-15 21:48                                                       ` Peter Zijlstra
  2011-09-20 12:55                                                       ` Peter Zijlstra
  0 siblings, 2 replies; 129+ messages in thread
From: Kamalesh Babulal @ 2011-09-15 17:55 UTC (permalink / raw)
  To: Srivatsa Vaddagiri
  Cc: Peter Zijlstra, Paul Turner, Vladimir Davydov, linux-kernel,
	Bharata B Rao, Dhaval Giani, Vaidyanathan Srinivasan,
	Ingo Molnar, Pavel Emelianov

* Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> [2011-09-14 00:05:02]:

> * Peter Zijlstra <a.p.zijlstra@chello.nl> [2011-09-13 20:30:46]:
> 
> > On Tue, 2011-09-13 at 23:58 +0530, Srivatsa Vaddagiri wrote:
> > > * Peter Zijlstra <a.p.zijlstra@chello.nl> [2011-09-13 20:19:55]:
> > > 
> > > > On Tue, 2011-09-13 at 23:24 +0530, Srivatsa Vaddagiri wrote:
> > > > > We saw considerably high migration count on latest tip compared to
> > > > > previous kernels. Kamalesh, can you please post the migration count
> > > > > data? 
> > > > 
> > > > Hrmm, yes this looks horrid.. even without cgroup crap, something's funny.
> > > 
> > > Yes ..we could visualize that very much in top o/p .. A task's cpu would keep 
> > > changing *every* screen refresh (refreshed every 0.5 sec that too!).
> > > 
> > > We didn't see that with older kernels ..Kamalesh is planning to do a
> > > git bisect and see which commit lead to this "mad" hopping ..
> > 
> > Awesome, thanks! Btw, what is 'older'? 3.0?
> 
> We went back all the way upto 2.6.32! I think 2.6.38 and 2.6.39 were
> pretty stable ..I don't have the migration count data with me readily. I
> will let Kamalesh post that info soon.

Test Setup :
-----------
Machine is 2 socket Quad Core Intel (x5570) box. The lb.sh
script was run in a loop to execute 5 times after the box
was bought up with the kernel.  

lb.sh script spawns 2x number of CPU hogs, where x is
number of CPUs on the system. The script collects the
se.nr_migration before/after 60 seconds sleep and subtracts
the after_se.nr_migration - before_se.nr_migrations for 
all the spawned hogs.
----------------+-------+-------+-------+-------+-------+
Kernel		| Run 1	| Run 2	| Run 3	| Run 4	| Run 5	|
----------------+-------+-------+-------+-------+-------+
2.6.33		| 9604	| 101	| 66	| 2543	| 3488	|
----------------+-------+-------+-------+-------+-------+
2.6.34		| 28469	| 1514	| 1602	| 185	| 139	|	
----------------+-------+-------+-------+-------+-------+
2.6.35		| 1052	| 12	| 4	| 11	| 6	|
----------------+-------+-------+-------+-------+-------+
2.6.36		| 1253	| 53	| 78	| 76	| 50	|
----------------+-------+-------+-------+-------+-------+
2.6.37		| 262	| 36	| 48	| 61	| 43	|
----------------+-------+-------+-------+-------+-------+
2.6.38		| 1551	| 48	| 62	| 47	| 50	|
----------------+-------+-------+-------+-------+-------+
2.6.39		| 3784	| 457	| 722	| 3209	| 1037	|
----------------+-------+-------+-------+-------+-------+
3.0		| 933	| 608	| 658	| 1424	| 1415	|
----------------+-------+-------+-------+-------+-------+
3.1.0-rc4-tip	|	|	|	|	|	|
(e467f18f945)	| 1672	| 1643	| 1316	| 1577	| 61	|
----------------+-------+-------+-------+-------+-------+

lb.sh
------
#!/bin/bash

rm -rf test*
rm -rf t*

ITERATIONS=60			# No of Iterations to capture the details
NUM_CPUS=$(cat /proc/cpuinfo |grep -i proces|wc -l)

NUM_HOGS=$((NUM_CPUS * 2))	# No of hogs threads to invoke

echo "System has $NUM_CPUS cpus..... Spawing $NUM_HOGS cpu hogs ... for $ITERATIONS seconds.."
if [ ! -e while1.c ]
then
	cat >> while1.c << EOF
	int
	main (int argc, char **argv)
	{
		while(1);
		return (0);
	}
EOF
fi

for i in $(seq 1 $NUM_HOGS)
do
	gcc -o while$i while1.c
	if [ $? -ne 0 ]
	then
		echo "Looks like gcc is not present ... aborting"
		exit
	fi
done

for i in $(seq 1 $NUM_HOGS)
do
	./while$i &
	pids[$i]=$!
	pids_old[$i]=`cat /proc/$!/sched |grep -i nr_migr|grep -iv cold|cut -d ":" -f2|sed 's/  //g'`
done

sleep $ITERATIONS

j=1
old_nr_migrations=0
new_nr_migrations=0
echo  -e  "       \t New \t Old"
for i in $(seq 1 $NUM_HOGS)
do
	a=`echo ${pids[i]}`	
	new=`cat /proc/$a/sched |grep -i nr_migr|grep -iv cold|cut -d ":" -f2|sed 's/  //g'`
	old=`echo ${pids_old[i]}`
	old_nr_migrations=$((old_nr_migrations + old))
	c=$(($new - $old))
	new_nr_migrations=$((new_nr_migrations + c))
	echo -e "while$i\t[$new]\t[$old]\t"
done
echo "*******************************************"
echo -e "      $new_nr_migrations\t$old_nr_migrations"
echo "*******************************************"
pkill -9 while

exit


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: CFS Bandwidth Control - Test results of cgroups tasks pinned vs unpinnede
  2011-09-15 17:55                                                     ` Kamalesh Babulal
@ 2011-09-15 21:48                                                       ` Peter Zijlstra
  2011-09-19 17:51                                                         ` Kamalesh Babulal
  2011-09-20 12:55                                                       ` Peter Zijlstra
  1 sibling, 1 reply; 129+ messages in thread
From: Peter Zijlstra @ 2011-09-15 21:48 UTC (permalink / raw)
  To: Kamalesh Babulal
  Cc: Srivatsa Vaddagiri, Paul Turner, Vladimir Davydov, linux-kernel,
	Bharata B Rao, Dhaval Giani, Vaidyanathan Srinivasan,
	Ingo Molnar, Pavel Emelianov

On Thu, 2011-09-15 at 23:25 +0530, Kamalesh Babulal wrote:
> 2.6.38          | 1551  | 48    | 62    | 47    | 50    |
> ----------------+-------+-------+-------+-------+-------+
> 2.6.39          | 3784  | 457   | 722   | 3209  | 1037  | 

I'd say we wrecked it going from .38 to .39 and only made it worse after
that.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: CFS Bandwidth Control - Test results of cgroups tasks pinned vs unpinnede
  2011-09-13 18:23                                   ` Peter Zijlstra
@ 2011-09-16  8:14                                     ` Paul Turner
  2011-09-16  8:28                                       ` Peter Zijlstra
  0 siblings, 1 reply; 129+ messages in thread
From: Paul Turner @ 2011-09-16  8:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Srivatsa Vaddagiri, Kamalesh Babulal, Vladimir Davydov,
	linux-kernel, Bharata B Rao, Dhaval Giani,
	Vaidyanathan Srinivasan, Ingo Molnar, Pavel Emelianov,
	Thomas Gleixner

On 09/13/11 11:23, Peter Zijlstra wrote:
> On Tue, 2011-09-13 at 23:31 +0530, Srivatsa Vaddagiri wrote:
>> * Peter Zijlstra<a.p.zijlstra@chello.nl>  [2011-09-13 16:19:39]:
>>
>>>> Booting with "nohz=off" also helps significantly.
>>>>
>>>> With nohz=on, average idle time (over 1 min) is 10.3%
>>>> With nohz=off, average idle time (over 1 min) is 3.9%

I think more compelling here is that it looks like nohz load-balance 
needs more love.

>>>
>>> So we should put the cpufreq/idle governor into the nohz/idle path, it
>>> already tries to predict the idle duration in order to pick a C state,
>>> that same prediction should be used to determine if stopping the tick is
>>> worth it.
>>
>> Hmm ..I tried performance governor and found that it slightly increases
>> idle time.
>>
>>    With nohz=off&&  ondemand governor, idle time = 4%
>>    With nohz=off&&  performance governor on all cpus, idle time = 6%
>>
>> I can't see obvious reasons for that ..afaict bandwidth capping should
>> be independent of frequency (i.e task gets capped by "used" time,
>> irrespective of frequency at which it was "using" the cpu)?
>
> That's not what I said.. what I said is that the nohz code should also
> use the idle time prognosis.. disabling the tick is a costly operation,
> doing it only to have to undo it costs time, and will be accounted to
> idle time, hence your improvement with nohz=off.
>

Enabling Venki's CONFIG_IRQ_TIME_ACCOUNTING=y would discount to provide 
a definitive answer here yes?

- Paul


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: CFS Bandwidth Control - Test results of cgroups tasks pinned vs unpinnede
  2011-09-07 15:20                 ` CFS Bandwidth Control - Test results of cgroups tasks pinned vs unpinnede Srivatsa Vaddagiri
  2011-09-07 19:22                   ` Peter Zijlstra
@ 2011-09-16  8:22                   ` Paul Turner
  1 sibling, 0 replies; 129+ messages in thread
From: Paul Turner @ 2011-09-16  8:22 UTC (permalink / raw)
  To: Srivatsa Vaddagiri
  Cc: Kamalesh Babulal, Vladimir Davydov, linux-kernel, Peter Zijlstra,
	Bharata B Rao, Dhaval Giani, Vaidyanathan Srinivasan,
	Ingo Molnar, Pavel Emelianov

On 09/07/11 08:20, Srivatsa Vaddagiri wrote:
> [Apologies if you get this email multiple times - there is some email
> client config issue that I am fixing up]
>
> * Paul Turner<pjt@google.com>  [2011-06-21 12:48:17]:
>
>> Hi Kamalesh,
>>
>> Can you see what things look like under v7?
>>
>> There's been a few improvements to quota re-distribution that should
>> hopefully help your test case.
>>
>> The remaining idle% I see on my machines appear to be a product of
>> load-balancer inefficiency.
>

Hey Srivatsa,

Thanks for taking another look at this -- sorry for the delayed reply!

> which is quite a complex problem to solve! I am still surprised that
> we can't handle 32 cpuhogs on a 16-cpu system very easily. The tasks seem to
> hop around madly rather than settle down as 2 tasks/cpu. Kamalesh, can you post
> the exact count of migrations we saw on latest tip over a 20-sec window?
>
> Anyway, here's a "hack" to minimize the idle time induced due to load-balance
> issues. It brings down idle time from 7+% to ~0% ..I am not too happy about
> this, but I don't see any other simpler solutions to solve the idle time issue
> completely (other than making load-balancer completely fair!).

Hum,

So BWC returns bandwidth on voluntary sleep to the parent, so the most 
we can really lose is NR_CPUS * 1ms (how much a cpu keeps in case the 
entity re-wakes up quickly).  Technically we could lose another few ms 
if there's not enough BW left to bother distributing and we're near the 
end of the period; but I think that works out to another 6ms or so at 
worst.

As discussed in the long thread dangling off this; it's load-balance 
that's at fault -- allowing steal time is just hiding this by instead 
letting cpus run over quota within a period.

If you for example set-up a deadline oriented test that tried to 
accomplish the same amount of work (without bandwidth limits) and threw 
away the rest of the work when it reached period expiration (a benchmark 
I've been meaning to write and publish as a more general load-balance 
test actually); then I suspect we'd see similar problems; and sadly, 
this case is both more representative of real-world performance and not 
fixable by something like steal-time.

So... we're probably better off trying to improve LB; I raised it in 
another reply on the chain but the NOHZ vs ticks ilb numbers look pretty 
compelling as an area for improvement in this regard.

Thanks!

- Paul
>
> --
>
> Fix excessive idle time reported when cgroups are capped.  The patch
> introduces the notion of "steal" (or "grace") time which is the surplus
> time/bandwidth each cgroup is allowed to consume, subject to a maximum
> steal time (sched_cfs_max_steal_time_us). Cgroups are allowed this "steal"
> or "grace" time when the lone task running on a cpu is about to be throttled.


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: CFS Bandwidth Control - Test results of cgroups tasks pinned vs unpinnede
  2011-09-16  8:14                                     ` Paul Turner
@ 2011-09-16  8:28                                       ` Peter Zijlstra
  2011-09-19 16:35                                         ` Srivatsa Vaddagiri
  0 siblings, 1 reply; 129+ messages in thread
From: Peter Zijlstra @ 2011-09-16  8:28 UTC (permalink / raw)
  To: Paul Turner
  Cc: Srivatsa Vaddagiri, Kamalesh Babulal, Vladimir Davydov,
	linux-kernel, Bharata B Rao, Dhaval Giani,
	Vaidyanathan Srinivasan, Ingo Molnar, Pavel Emelianov,
	Thomas Gleixner

On Fri, 2011-09-16 at 01:14 -0700, Paul Turner wrote:
> On 09/13/11 11:23, Peter Zijlstra wrote:
> > On Tue, 2011-09-13 at 23:31 +0530, Srivatsa Vaddagiri wrote:
> >> * Peter Zijlstra<a.p.zijlstra@chello.nl>  [2011-09-13 16:19:39]:
> >>
> >>>> Booting with "nohz=off" also helps significantly.
> >>>>
> >>>> With nohz=on, average idle time (over 1 min) is 10.3%
> >>>> With nohz=off, average idle time (over 1 min) is 3.9%
> 
> I think more compelling here is that it looks like nohz load-balance 
> needs more love.

Quite probable, although I do know we tend to go overboard in going into
nohz state too.

> > That's not what I said.. what I said is that the nohz code should also
> > use the idle time prognosis.. disabling the tick is a costly operation,
> > doing it only to have to undo it costs time, and will be accounted to
> > idle time, hence your improvement with nohz=off.
> >
> 
> Enabling Venki's CONFIG_IRQ_TIME_ACCOUNTING=y would discount to provide 
> a definitive answer here yes?

Ah, yes, its all (soft)irq context anyway, no need to also account
systemcalls.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: CFS Bandwidth Control - Test results of cgroups tasks pinned vs unpinnede
  2011-09-16  8:28                                       ` Peter Zijlstra
@ 2011-09-19 16:35                                         ` Srivatsa Vaddagiri
  0 siblings, 0 replies; 129+ messages in thread
From: Srivatsa Vaddagiri @ 2011-09-19 16:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul Turner, Kamalesh Babulal, Vladimir Davydov, linux-kernel,
	Bharata B Rao, Dhaval Giani, Vaidyanathan Srinivasan,
	Ingo Molnar, Pavel Emelianov, Thomas Gleixner

* Peter Zijlstra <a.p.zijlstra@chello.nl> [2011-09-16 10:28:40]:

> > I think more compelling here is that it looks like nohz load-balance 
> > needs more love.
> 
> Quite probable,

Staring at nohz load-balancer for sometime, I see a potential issue:

'first_pick_cpu' and 'second_pick_cpu' can be idle without stopping
ticks for quite a while. When that happens, they stop bothering to
kick ilb cpu because of this snippet in nohz_kick_needed():

   static inline int nohz_kick_needed(struct rq *rq, int cpu)
   {

	..

        if (rq->idle_at_tick)
                return 0;

	..

   }


?

- vatsa

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: CFS Bandwidth Control - Test results of cgroups tasks pinned vs unpinnede
  2011-09-15 21:48                                                       ` Peter Zijlstra
@ 2011-09-19 17:51                                                         ` Kamalesh Babulal
  2011-09-20  0:38                                                           ` Venki Pallipadi
                                                                             ` (2 more replies)
  0 siblings, 3 replies; 129+ messages in thread
From: Kamalesh Babulal @ 2011-09-19 17:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Srivatsa Vaddagiri, Paul Turner, Vladimir Davydov, linux-kernel,
	Bharata B Rao, Dhaval Giani, Vaidyanathan Srinivasan,
	Ingo Molnar, Pavel Emelianov

* Peter Zijlstra <a.p.zijlstra@chello.nl> [2011-09-15 23:48:43]:

> On Thu, 2011-09-15 at 23:25 +0530, Kamalesh Babulal wrote:
> > 2.6.38          | 1551  | 48    | 62    | 47    | 50    |
> > ----------------+-------+-------+-------+-------+-------+
> > 2.6.39          | 3784  | 457   | 722   | 3209  | 1037  | 
> 
> I'd say we wrecked it going from .38 to .39 and only made it worse after
> that.

after reverting the commit 866ab43efd325fae8889ea, of the patches 
went between .38 and .39 reduces the ping pong of the tasks.

------------------------+-------+-------+-------+-------+-------+
Kernel			| Run 1	| Run 2	| Run 3	| Run 4	| Run 5	|
------------------------+-------+-------+-------+-------+-------+
2.6.39	        	| 1542  | 2172  | 2727  | 120   | 3681  |
------------------------+-------+-------+-------+-------+-------+
2.6.39 (with    	|       |       |       |       |       |
866ab43efd reverted)	| 65	| 78	| 58	| 99 	| 62	|
------------------------+-------+-------+-------+-------+-------+
3.1-rc4+tip		|	|	|	|	|	|
(e467f18f945c)		| 1219	| 2037	| 1943	| 772	| 1701	|
------------------------+-------+-------+-------+-------+-------+
3.1-rc4+tip (e467f18f9)	|	|	|	|	|	|
(866ab43efd reverted)	| 64	| 45	| 59	| 59	| 69	|
------------------------+-------+-------+-------+-------+-------+

Thanks,
Kamalesh.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: CFS Bandwidth Control - Test results of cgroups tasks pinned vs unpinnede
  2011-09-19 17:51                                                         ` Kamalesh Babulal
@ 2011-09-20  0:38                                                           ` Venki Pallipadi
  2011-09-20 11:09                                                             ` Kamalesh Babulal
  2011-09-20 13:56                                                           ` Peter Zijlstra
  2011-09-20 14:04                                                           ` Peter Zijlstra
  2 siblings, 1 reply; 129+ messages in thread
From: Venki Pallipadi @ 2011-09-20  0:38 UTC (permalink / raw)
  To: Kamalesh Babulal
  Cc: Peter Zijlstra, Srivatsa Vaddagiri, Paul Turner,
	Vladimir Davydov, linux-kernel, Bharata B Rao, Dhaval Giani,
	Vaidyanathan Srinivasan, Ingo Molnar, Pavel Emelianov, Ken Chen

On Mon, Sep 19, 2011 at 10:51 AM, Kamalesh Babulal
<kamalesh@linux.vnet.ibm.com> wrote:
> * Peter Zijlstra <a.p.zijlstra@chello.nl> [2011-09-15 23:48:43]:
>
>> On Thu, 2011-09-15 at 23:25 +0530, Kamalesh Babulal wrote:
>> > 2.6.38          | 1551  | 48    | 62    | 47    | 50    |
>> > ----------------+-------+-------+-------+-------+-------+
>> > 2.6.39          | 3784  | 457   | 722   | 3209  | 1037  |
>>
>> I'd say we wrecked it going from .38 to .39 and only made it worse after
>> that.
>
> after reverting the commit 866ab43efd325fae8889ea, of the patches
> went between .38 and .39 reduces the ping pong of the tasks.

There was a side-effect from 866ab43efd325fae8889ea that Ken
identified and fixed later in commit
b0432d8f162c7d5d9537b4cb749d44076b76a783. I guess you are seeing the
same problem...

Thanks,
Venki
>
> ------------------------+-------+-------+-------+-------+-------+
> Kernel                  | Run 1 | Run 2 | Run 3 | Run 4 | Run 5 |
> ------------------------+-------+-------+-------+-------+-------+
> 2.6.39                  | 1542  | 2172  | 2727  | 120   | 3681  |
> ------------------------+-------+-------+-------+-------+-------+
> 2.6.39 (with            |       |       |       |       |       |
> 866ab43efd reverted)    | 65    | 78    | 58    | 99    | 62    |
> ------------------------+-------+-------+-------+-------+-------+
> 3.1-rc4+tip             |       |       |       |       |       |
> (e467f18f945c)          | 1219  | 2037  | 1943  | 772   | 1701  |
> ------------------------+-------+-------+-------+-------+-------+
> 3.1-rc4+tip (e467f18f9) |       |       |       |       |       |
> (866ab43efd reverted)   | 64    | 45    | 59    | 59    | 69    |
> ------------------------+-------+-------+-------+-------+-------+
>
> Thanks,
> Kamalesh.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: CFS Bandwidth Control - Test results of cgroups tasks pinned vs unpinnede
  2011-09-20  0:38                                                           ` Venki Pallipadi
@ 2011-09-20 11:09                                                             ` Kamalesh Babulal
  0 siblings, 0 replies; 129+ messages in thread
From: Kamalesh Babulal @ 2011-09-20 11:09 UTC (permalink / raw)
  To: Venki Pallipadi
  Cc: Peter Zijlstra, Srivatsa Vaddagiri, Paul Turner,
	Vladimir Davydov, linux-kernel, Bharata B Rao, Dhaval Giani,
	Vaidyanathan Srinivasan, Ingo Molnar, Pavel Emelianov, Ken Chen

* Venki Pallipadi <venki@google.com> [2011-09-19 17:38:26]:

> On Mon, Sep 19, 2011 at 10:51 AM, Kamalesh Babulal
> <kamalesh@linux.vnet.ibm.com> wrote:
> > * Peter Zijlstra <a.p.zijlstra@chello.nl> [2011-09-15 23:48:43]:
> >
> >> On Thu, 2011-09-15 at 23:25 +0530, Kamalesh Babulal wrote:
> >> > 2.6.38          | 1551  | 48    | 62    | 47    | 50    |
> >> > ----------------+-------+-------+-------+-------+-------+
> >> > 2.6.39          | 3784  | 457   | 722   | 3209  | 1037  |
> >>
> >> I'd say we wrecked it going from .38 to .39 and only made it worse after
> >> that.
> >
> > after reverting the commit 866ab43efd325fae8889ea, of the patches
> > went between .38 and .39 reduces the ping pong of the tasks.
> 
> There was a side-effect from 866ab43efd325fae8889ea that Ken
> identified and fixed later in commit
> b0432d8f162c7d5d9537b4cb749d44076b76a783. I guess you are seeing the
> same problem...
(snip)

3.1-rc4+tip includes the commit b0432d8f162c7d5d. The number of task
ping pongs has reduced with 3.1-rc4+tip, in comparison to 2.6.39 as 
seen in the below table. Reverting the commit 866ab43efd325fa on the 
3.1-rc4+tip reduces the task bouncing to a good extend.

> > ------------------------+-------+-------+-------+-------+-------+
> > Kernel                  | Run 1 | Run 2 | Run 3 | Run 4 | Run 5 |
> > ------------------------+-------+-------+-------+-------+-------+
> > 2.6.39                  | 1542  | 2172  | 2727  | 120   | 3681  |
> > ------------------------+-------+-------+-------+-------+-------+
> > 2.6.39 (with            |       |       |       |       |       |
> > 866ab43efd reverted)    | 65    | 78    | 58    | 99    | 62    |
> > ------------------------+-------+-------+-------+-------+-------+
> > 3.1-rc4+tip             |       |       |       |       |       |
> > (e467f18f945c)          | 1219  | 2037  | 1943  | 772   | 1701  |
> > ------------------------+-------+-------+-------+-------+-------+
> > 3.1-rc4+tip (e467f18f9) |       |       |       |       |       |
> > (866ab43efd reverted)   | 64    | 45    | 59    | 59    | 69    |
> > ------------------------+-------+-------+-------+-------+-------+
> >


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: CFS Bandwidth Control - Test results of cgroups tasks pinned vs unpinnede
  2011-09-15 17:55                                                     ` Kamalesh Babulal
  2011-09-15 21:48                                                       ` Peter Zijlstra
@ 2011-09-20 12:55                                                       ` Peter Zijlstra
  2011-09-21 17:34                                                         ` Kamalesh Babulal
  1 sibling, 1 reply; 129+ messages in thread
From: Peter Zijlstra @ 2011-09-20 12:55 UTC (permalink / raw)
  To: Kamalesh Babulal
  Cc: Srivatsa Vaddagiri, Paul Turner, Vladimir Davydov, linux-kernel,
	Bharata B Rao, Dhaval Giani, Vaidyanathan Srinivasan,
	Ingo Molnar, Pavel Emelianov

On Thu, 2011-09-15 at 23:25 +0530, Kamalesh Babulal wrote:

> lb.sh
> ------
> #!/bin/bash
> 
> rm -rf test*
> rm -rf t*

You're insane, right?

> ITERATIONS=60			# No of Iterations to capture the details
> NUM_CPUS=$(cat /proc/cpuinfo |grep -i proces|wc -l)
> 
> NUM_HOGS=$((NUM_CPUS * 2))	# No of hogs threads to invoke
> 
> echo "System has $NUM_CPUS cpus..... Spawing $NUM_HOGS cpu hogs ... for $ITERATIONS seconds.."
> if [ ! -e while1.c ]
> then
> 	cat >> while1.c << EOF
> 	int
> 	main (int argc, char **argv)
> 	{
> 		while(1);
> 		return (0);
> 	}
> EOF
> fi
> 
> for i in $(seq 1 $NUM_HOGS)
> do
> 	gcc -o while$i while1.c
> 	if [ $? -ne 0 ]
> 	then
> 		echo "Looks like gcc is not present ... aborting"
> 		exit
> 	fi
> done
> 
> for i in $(seq 1 $NUM_HOGS)
> do
> 	./while$i &

You can kill the above two blocks by doing:

	while :; do :; done &

> 	pids[$i]=$!
> 	pids_old[$i]=`cat /proc/$!/sched |grep -i nr_migr|grep -iv cold|cut -d ":" -f2|sed 's/  //g'`
> done


and a fixup of the pkill muck.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: CFS Bandwidth Control - Test results of cgroups tasks pinned vs unpinnede
  2011-09-19 17:51                                                         ` Kamalesh Babulal
  2011-09-20  0:38                                                           ` Venki Pallipadi
@ 2011-09-20 13:56                                                           ` Peter Zijlstra
  2011-09-20 14:04                                                           ` Peter Zijlstra
  2 siblings, 0 replies; 129+ messages in thread
From: Peter Zijlstra @ 2011-09-20 13:56 UTC (permalink / raw)
  To: Kamalesh Babulal
  Cc: Srivatsa Vaddagiri, Paul Turner, Vladimir Davydov, linux-kernel,
	Bharata B Rao, Dhaval Giani, Vaidyanathan Srinivasan,
	Ingo Molnar, Pavel Emelianov

On Mon, 2011-09-19 at 23:21 +0530, Kamalesh Babulal wrote:
> * Peter Zijlstra <a.p.zijlstra@chello.nl> [2011-09-15 23:48:43]:
> 
> > On Thu, 2011-09-15 at 23:25 +0530, Kamalesh Babulal wrote:
> > > 2.6.38          | 1551  | 48    | 62    | 47    | 50    |
> > > ----------------+-------+-------+-------+-------+-------+
> > > 2.6.39          | 3784  | 457   | 722   | 3209  | 1037  | 
> > 
> > I'd say we wrecked it going from .38 to .39 and only made it worse after
> > that.
> 
> after reverting the commit 866ab43efd325fae8889ea, of the patches 
> went between .38 and .39 reduces the ping pong of the tasks.
> 
> ------------------------+-------+-------+-------+-------+-------+
> Kernel			| Run 1	| Run 2	| Run 3	| Run 4	| Run 5	|
> ------------------------+-------+-------+-------+-------+-------+
> 2.6.39	        	| 1542  | 2172  | 2727  | 120   | 3681  |
> ------------------------+-------+-------+-------+-------+-------+
> 2.6.39 (with    	|       |       |       |       |       |
> 866ab43efd reverted)	| 65	| 78	| 58	| 99 	| 62	|
> ------------------------+-------+-------+-------+-------+-------+
> 3.1-rc4+tip		|	|	|	|	|	|
> (e467f18f945c)		| 1219	| 2037	| 1943	| 772	| 1701	|
> ------------------------+-------+-------+-------+-------+-------+
> 3.1-rc4+tip (e467f18f9)	|	|	|	|	|	|
> (866ab43efd reverted)	| 64	| 45	| 59	| 59	| 69	|
> ------------------------+-------+-------+-------+-------+-------+

Right, so reverting that breaks the cpuset/cpuaffinity thing again :-(

Now I'm not quite sure why group_imb gets toggled in this use-case at
all, having put a trace_printk() in, we get:

           <...>-1894  [006]   704.056250: find_busiest_group: max: 2048, min: 0, avg: 1024, nr: 2
     kworker/1:1-101   [001]   706.305523: find_busiest_group: max: 3072, min: 0, avg: 1024, nr: 3

Which is of course a bad state to be in, but we also get:

    migration/17-73    [017]   706.284191: find_busiest_group: max: 1024, min: 0, avg: 512, nr: 2
          <idle>-0     [003]   706.325435: find_busiest_group: max: 1250, min: 440, avg: 1024, nr: 2

on a CGROUP=n kernel.. which I think we can attribute to races.

When I enable tracing I also get some good runs, so it smells like the
lb does one bad thing and instead of correcting it it makes it worse.

It looks like its set-off by a mass-wakeup of random crap that really
shouldn't be waking at all, I mean who needs automount to wakeup, or
whatever the fuck rtkit-daemon is. I'm pretty sure my bash loops don't
do anything remotely related to those.

Anyway, once enough random crap wakes up, the load-balancer goes shift
stuff around, once we hit the group_imb conditions we seem to get stuck
in a bad state instead of getting out of it.

Bah!







^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: CFS Bandwidth Control - Test results of cgroups tasks pinned vs unpinnede
  2011-09-19 17:51                                                         ` Kamalesh Babulal
  2011-09-20  0:38                                                           ` Venki Pallipadi
  2011-09-20 13:56                                                           ` Peter Zijlstra
@ 2011-09-20 14:04                                                           ` Peter Zijlstra
  2 siblings, 0 replies; 129+ messages in thread
From: Peter Zijlstra @ 2011-09-20 14:04 UTC (permalink / raw)
  To: Kamalesh Babulal
  Cc: Srivatsa Vaddagiri, Paul Turner, Vladimir Davydov, linux-kernel,
	Bharata B Rao, Dhaval Giani, Vaidyanathan Srinivasan,
	Ingo Molnar, Pavel Emelianov

On Tue, 2011-09-20 at 15:56 +0200, Peter Zijlstra wrote:
> 
> Anyway, once enough random crap wakes up, the load-balancer goes shift
> stuff around, once we hit the group_imb conditions we seem to get stuck
> in a bad state instead of getting out of it. 

I bet all that crap wakes on the same tick that sets of the
load-balancer, because none of those things runs long enough to register
otherwise.

Looks like we need proper time weighted load averages for the regular lb
too.. pjt mentioned doing something like that as well, if only to reduce
the number of different load calculations we have.



^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: CFS Bandwidth Control - Test results of cgroups tasks pinned vs unpinnede
  2011-09-20 12:55                                                       ` Peter Zijlstra
@ 2011-09-21 17:34                                                         ` Kamalesh Babulal
  0 siblings, 0 replies; 129+ messages in thread
From: Kamalesh Babulal @ 2011-09-21 17:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Srivatsa Vaddagiri, Paul Turner, Vladimir Davydov, linux-kernel,
	Bharata B Rao, Dhaval Giani, Vaidyanathan Srinivasan,
	Ingo Molnar, Pavel Emelianov

* Peter Zijlstra <a.p.zijlstra@chello.nl> [2011-09-20 14:55:20]:

> On Thu, 2011-09-15 at 23:25 +0530, Kamalesh Babulal wrote:
> 
(snip)
> > rm -rf test*
> > rm -rf t*
> 
> You're insane, right?

Ofcourse not :-). It's a typo. it should have been 
rm -rf r* to delete the temporary files created by 
the original script (Only the part which does the
se.nr_migrations calculation was posted). 

> > ITERATIONS=60			# No of Iterations to capture the details
> > NUM_CPUS=$(cat /proc/cpuinfo |grep -i proces|wc -l)
> > 
> > NUM_HOGS=$((NUM_CPUS * 2))	# No of hogs threads to invoke
> > 
(snip)
> > for i in $(seq 1 $NUM_HOGS)
> > do
> > 	./while$i &
> 
> You can kill the above two blocks by doing:
> 
> 	while :; do :; done &

Thanks. Got to knew this from your commit 866ab43efd325fae88 previously.

Thanks,
Kamalesh.

^ permalink raw reply	[flat|nested] 129+ messages in thread

end of thread, other threads:[~2011-09-21 17:34 UTC | newest]

Thread overview: 129+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-05-03  9:28 [patch 00/15] CFS Bandwidth Control V6 Paul Turner
2011-05-03  9:28 ` [patch 01/15] sched: (fixlet) dont update shares twice on on_rq parent Paul Turner
2011-05-10  7:14   ` Hidetoshi Seto
2011-05-10  8:32     ` Mike Galbraith
2011-05-11  7:55       ` Hidetoshi Seto
2011-05-11  8:13         ` Paul Turner
2011-05-11  8:45           ` Mike Galbraith
2011-05-11  8:59             ` Hidetoshi Seto
2011-05-03  9:28 ` [patch 02/15] sched: hierarchical task accounting for SCHED_OTHER Paul Turner
2011-05-10  7:17   ` Hidetoshi Seto
2011-05-03  9:28 ` [patch 03/15] sched: introduce primitives to account for CFS bandwidth tracking Paul Turner
2011-05-10  7:18   ` Hidetoshi Seto
2011-05-03  9:28 ` [patch 04/15] sched: validate CFS quota hierarchies Paul Turner
2011-05-10  7:20   ` Hidetoshi Seto
2011-05-11  9:37     ` Paul Turner
2011-05-16  9:30   ` Peter Zijlstra
2011-05-16  9:43   ` Peter Zijlstra
2011-05-16 12:32     ` Paul Turner
2011-05-17 15:26       ` Peter Zijlstra
2011-05-18  7:16         ` Paul Turner
2011-05-18 11:57           ` Peter Zijlstra
2011-05-03  9:28 ` [patch 05/15] sched: add a timer to handle CFS bandwidth refresh Paul Turner
2011-05-10  7:21   ` Hidetoshi Seto
2011-05-11  9:27     ` Paul Turner
2011-05-16 10:18   ` Peter Zijlstra
2011-05-16 12:56     ` Paul Turner
2011-05-03  9:28 ` [patch 06/15] sched: accumulate per-cfs_rq cpu usage and charge against bandwidth Paul Turner
2011-05-10  7:22   ` Hidetoshi Seto
2011-05-11  9:25     ` Paul Turner
2011-05-16 10:27   ` Peter Zijlstra
2011-05-16 12:59     ` Paul Turner
2011-05-17 15:28       ` Peter Zijlstra
2011-05-18  7:02         ` Paul Turner
2011-05-16 10:32   ` Peter Zijlstra
2011-05-03  9:28 ` [patch 07/15] sched: expire invalid runtime Paul Turner
2011-05-10  7:22   ` Hidetoshi Seto
2011-05-16 11:05   ` Peter Zijlstra
2011-05-16 11:07   ` Peter Zijlstra
2011-05-03  9:28 ` [patch 08/15] sched: throttle cfs_rq entities which exceed their local runtime Paul Turner
2011-05-10  7:23   ` Hidetoshi Seto
2011-05-16 15:58   ` Peter Zijlstra
2011-05-16 16:05   ` Peter Zijlstra
2011-05-03  9:28 ` [patch 09/15] sched: unthrottle cfs_rq(s) who ran out of quota at period refresh Paul Turner
2011-05-10  7:24   ` Hidetoshi Seto
2011-05-11  9:24     ` Paul Turner
2011-05-03  9:28 ` [patch 10/15] sched: allow for positional tg_tree walks Paul Turner
2011-05-10  7:24   ` Hidetoshi Seto
2011-05-17 13:31   ` Peter Zijlstra
2011-05-18  7:18     ` Paul Turner
2011-05-03  9:28 ` [patch 11/15] sched: prevent interactions between throttled entities and load-balance Paul Turner
2011-05-10  7:26   ` Hidetoshi Seto
2011-05-11  9:11     ` Paul Turner
2011-05-03  9:28 ` [patch 12/15] sched: migrate throttled tasks on HOTPLUG Paul Turner
2011-05-10  7:27   ` Hidetoshi Seto
2011-05-11  9:10     ` Paul Turner
2011-05-03  9:28 ` [patch 13/15] sched: add exports tracking cfs bandwidth control statistics Paul Turner
2011-05-10  7:27   ` Hidetoshi Seto
2011-05-11  7:56   ` Hidetoshi Seto
2011-05-11  9:09     ` Paul Turner
2011-05-03  9:29 ` [patch 14/15] sched: return unused runtime on voluntary sleep Paul Turner
2011-05-10  7:28   ` Hidetoshi Seto
2011-05-03  9:29 ` [patch 15/15] sched: add documentation for bandwidth control Paul Turner
2011-05-10  7:29   ` Hidetoshi Seto
2011-05-11  9:09     ` Paul Turner
2011-06-07 15:45 ` CFS Bandwidth Control - Test results of cgroups tasks pinned vs unpinned Kamalesh Babulal
2011-06-08  3:09   ` Paul Turner
2011-06-08 10:46   ` Vladimir Davydov
2011-06-08 16:32     ` Kamalesh Babulal
2011-06-09  3:25       ` Paul Turner
2011-06-10 18:17         ` Kamalesh Babulal
2011-06-14  0:00           ` Paul Turner
2011-06-15  5:37             ` Kamalesh Babulal
2011-06-21 19:48               ` Paul Turner
2011-06-24 15:05                 ` Kamalesh Babulal
2011-09-07 11:00                 ` Srivatsa Vaddagiri
2011-09-07 14:54                 ` Srivatsa Vaddagiri
2011-09-07 15:20                 ` CFS Bandwidth Control - Test results of cgroups tasks pinned vs unpinnede Srivatsa Vaddagiri
2011-09-07 19:22                   ` Peter Zijlstra
2011-09-08 15:15                     ` Srivatsa Vaddagiri
2011-09-09 12:31                       ` Peter Zijlstra
2011-09-09 13:26                         ` Srivatsa Vaddagiri
2011-09-12 10:17                         ` Srivatsa Vaddagiri
2011-09-12 12:35                           ` Peter Zijlstra
2011-09-13  4:15                             ` Srivatsa Vaddagiri
2011-09-13  5:03                               ` Srivatsa Vaddagiri
2011-09-13  5:05                                 ` Srivatsa Vaddagiri
2011-09-13  9:39                                 ` Peter Zijlstra
2011-09-13 11:28                                   ` Srivatsa Vaddagiri
2011-09-13 14:07                                     ` Peter Zijlstra
2011-09-13 16:21                                       ` Srivatsa Vaddagiri
2011-09-13 16:33                                         ` Peter Zijlstra
2011-09-13 17:41                                           ` Srivatsa Vaddagiri
2011-09-13 16:36                                         ` Peter Zijlstra
2011-09-13 17:54                                           ` Srivatsa Vaddagiri
2011-09-13 18:03                                             ` Peter Zijlstra
2011-09-13 18:12                                               ` Srivatsa Vaddagiri
2011-09-13 18:07                                             ` Peter Zijlstra
2011-09-13 18:19                                             ` Peter Zijlstra
2011-09-13 18:28                                               ` Srivatsa Vaddagiri
2011-09-13 18:30                                                 ` Peter Zijlstra
2011-09-13 18:35                                                   ` Srivatsa Vaddagiri
2011-09-15 17:55                                                     ` Kamalesh Babulal
2011-09-15 21:48                                                       ` Peter Zijlstra
2011-09-19 17:51                                                         ` Kamalesh Babulal
2011-09-20  0:38                                                           ` Venki Pallipadi
2011-09-20 11:09                                                             ` Kamalesh Babulal
2011-09-20 13:56                                                           ` Peter Zijlstra
2011-09-20 14:04                                                           ` Peter Zijlstra
2011-09-20 12:55                                                       ` Peter Zijlstra
2011-09-21 17:34                                                         ` Kamalesh Babulal
2011-09-13 14:19                               ` Peter Zijlstra
2011-09-13 18:01                                 ` Srivatsa Vaddagiri
2011-09-13 18:23                                   ` Peter Zijlstra
2011-09-16  8:14                                     ` Paul Turner
2011-09-16  8:28                                       ` Peter Zijlstra
2011-09-19 16:35                                         ` Srivatsa Vaddagiri
2011-09-16  8:22                   ` Paul Turner
2011-06-14 10:16   ` CFS Bandwidth Control - Test results of cgroups tasks pinned vs unpinned Hidetoshi Seto
2011-06-14  6:58 ` [patch 00/15] CFS Bandwidth Control V6 Hu Tao
2011-06-14  7:29   ` Hidetoshi Seto
2011-06-14  7:44     ` Hu Tao
2011-06-15  8:37     ` Hu Tao
2011-06-16  0:57       ` Hidetoshi Seto
2011-06-16  9:45         ` Hu Tao
2011-06-17  1:22           ` Hidetoshi Seto
2011-06-17  6:05             ` Hu Tao
2011-06-17  6:25             ` Paul Turner
2011-06-17  9:13               ` Hidetoshi Seto
2011-06-18  0:28                 ` Paul Turner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.