All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch 00/16] CFS Bandwidth Control v7
@ 2011-06-21  7:16 Paul Turner
  2011-06-21  7:16 ` [patch 01/16] sched: (fixlet) dont update shares twice on on_rq parent Paul Turner
                   ` (16 more replies)
  0 siblings, 17 replies; 59+ messages in thread
From: Paul Turner @ 2011-06-21  7:16 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov

Hi all,

Please find attached the latest iteration of bandwidth control (v7).

This release continues the scouring started in v5 and v6, this time with 
attention paid to timers, quota expiration, and the reclaim path.

Thanks to Hidetoshi Seto for taking the time to review the previous series.

v7
------------
optimizations/tweaks:
- no need to reschedule on an enqueue_throttle
- bandwidth is reclaimed at time of dequeue rather than put_prev_entity, this
prevents us losing small slices of bandwidth to load-balance movement.

quota/period handling:
- runtime expiration now handles sched_clock wrap
- bandwidth now reclaimed at time of dequeue rather than put_prev_entity, this
  was resulting in load-balance stranding small amounts of bandwidht
  perviously.
- logic for handling the bandwidth timer is now better unified with idle state 
  accounting, races with period expiration during hrtimer tear-down resolved
- fixed wake-up into a new quota period waiting for timer to replenish
  bandwidth.

misc:
- fixed stats not being accumulated for unthrottled periods [thanks H. Sato]
- fixed nr_running corruption in enqueue/dequeue_task fair  [thanks H. Sato]
- consistent specification changed to max(child bandwidth) <= parent
  bandwidth, sysctl controlling this behavior was nuked
- throttling not enabled until both throttle and unthrottle mechanisms are in
  place.
- bunch of minor cleanups per list discussion

Hideotoshi, the following patches changed enough, or are new, and should be
looked over again before I can re-add your Reviewed-by.

[patch 04/16] sched: validate CFS quota hierarchies
[patch 06/16] sched: add a timer to handle CFS bandwidth refresh
[patch 07/16] sched: expire invalid runtime
[patch 10/16] sched: throttle entities exceeding their allowed bandwidth
[patch 15/16] sched: return unused runtime on voluntary sleep

Previous postings:
-----------------
v6: http://lkml.org/lkml/2011/5/7/37
v5: http://lkml.org/lkml/2011/3 /22/477
v4: http://lkml.org/lkml/2011/2/23/44
v3: http://lkml.org/lkml/2010/10/12/44:
v2: http://lkml.org/lkml/2010/4/28/88
Original posting: http://lkml.org/lkml/2010/2/12/393
Prior approaches: http://lkml.org/lkml/2010/1/5/44 ["CFS Hard limits v5"]

Let me know if anything's busted :)

- Paul




^ permalink raw reply	[flat|nested] 59+ messages in thread

* [patch 01/16] sched: (fixlet) dont update shares twice on on_rq parent
  2011-06-21  7:16 [patch 00/16] CFS Bandwidth Control v7 Paul Turner
@ 2011-06-21  7:16 ` Paul Turner
  2011-06-21  7:16 ` [patch 02/16] sched: hierarchical task accounting for SCHED_OTHER Paul Turner
                   ` (15 subsequent siblings)
  16 siblings, 0 replies; 59+ messages in thread
From: Paul Turner @ 2011-06-21  7:16 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov

[-- Attachment #1: sched-bwc-fix_dequeue_task_buglet.patch --]
[-- Type: text/plain, Size: 898 bytes --]

In dequeue_task_fair() we bail on dequeue when we encounter a parenting entity
with additional weight.  However, we perform a double shares update on this
entity as we continue the shares update traversal from this point, despite
dequeue_entity() having already updated its queuing cfs_rq.
Avoid this by starting from the parent when we resume.

Signed-off-by: Paul Turner <pjt@google.com>
---
 kernel/sched_fair.c |    3 +++
 1 file changed, 3 insertions(+)

Index: tip/kernel/sched_fair.c
===================================================================
--- tip.orig/kernel/sched_fair.c
+++ tip/kernel/sched_fair.c
@@ -1370,6 +1370,9 @@ static void dequeue_task_fair(struct rq 
 			 */
 			if (task_sleep && parent_entity(se))
 				set_next_buddy(parent_entity(se));
+
+			/* avoid re-evaluating load for this entity */
+			se = parent_entity(se);
 			break;
 		}
 		flags |= DEQUEUE_SLEEP;



^ permalink raw reply	[flat|nested] 59+ messages in thread

* [patch 02/16] sched: hierarchical task accounting for SCHED_OTHER
  2011-06-21  7:16 [patch 00/16] CFS Bandwidth Control v7 Paul Turner
  2011-06-21  7:16 ` [patch 01/16] sched: (fixlet) dont update shares twice on on_rq parent Paul Turner
@ 2011-06-21  7:16 ` Paul Turner
  2011-06-21  7:16 ` [patch 03/16] sched: introduce primitives to account for CFS bandwidth tracking Paul Turner
                   ` (14 subsequent siblings)
  16 siblings, 0 replies; 59+ messages in thread
From: Paul Turner @ 2011-06-21  7:16 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov

[-- Attachment #1: sched-bwc-account_nr_running.patch --]
[-- Type: text/plain, Size: 4530 bytes --]

Introduce hierarchical task accounting for the group scheduling case in CFS, as
well as promoting the responsibility for maintaining rq->nr_running to the
scheduling classes.

The primary motivation for this is that with scheduling classes supporting
bandwidth throttling it is possible for entities participating in throttled
sub-trees to not have root visible changes in rq->nr_running across activate
and de-activate operations.  This in turn leads to incorrect idle and 
weight-per-task load balance decisions.

This also allows us to make a small fixlet to the fastpath in pick_next_task()
under group scheduling.

Note: this issue also exists with the existing sched_rt throttling mechanism.
This patch does not address that.

Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>

---
 kernel/sched.c          |    6 ++----
 kernel/sched_fair.c     |   10 ++++++++--
 kernel/sched_rt.c       |    5 ++++-
 kernel/sched_stoptask.c |    2 ++
 4 files changed, 16 insertions(+), 7 deletions(-)

Index: tip/kernel/sched.c
===================================================================
--- tip.orig/kernel/sched.c
+++ tip/kernel/sched.c
@@ -308,7 +308,7 @@ struct task_group root_task_group;
 /* CFS-related fields in a runqueue */
 struct cfs_rq {
 	struct load_weight load;
-	unsigned long nr_running;
+	unsigned long nr_running, h_nr_running;
 
 	u64 exec_clock;
 	u64 min_vruntime;
@@ -1830,7 +1830,6 @@ static void activate_task(struct rq *rq,
 		rq->nr_uninterruptible--;
 
 	enqueue_task(rq, p, flags);
-	inc_nr_running(rq);
 }
 
 /*
@@ -1842,7 +1841,6 @@ static void deactivate_task(struct rq *r
 		rq->nr_uninterruptible++;
 
 	dequeue_task(rq, p, flags);
-	dec_nr_running(rq);
 }
 
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
@@ -4194,7 +4192,7 @@ pick_next_task(struct rq *rq)
 	 * Optimization: we know that if all tasks are in
 	 * the fair class we can call that function directly:
 	 */
-	if (likely(rq->nr_running == rq->cfs.nr_running)) {
+	if (likely(rq->nr_running == rq->cfs.h_nr_running)) {
 		p = fair_sched_class.pick_next_task(rq);
 		if (likely(p))
 			return p;
Index: tip/kernel/sched_fair.c
===================================================================
--- tip.orig/kernel/sched_fair.c
+++ tip/kernel/sched_fair.c
@@ -1332,16 +1332,19 @@ enqueue_task_fair(struct rq *rq, struct 
 			break;
 		cfs_rq = cfs_rq_of(se);
 		enqueue_entity(cfs_rq, se, flags);
+		cfs_rq->h_nr_running++;
 		flags = ENQUEUE_WAKEUP;
 	}
 
 	for_each_sched_entity(se) {
-		struct cfs_rq *cfs_rq = cfs_rq_of(se);
+		cfs_rq = cfs_rq_of(se);
+		cfs_rq->h_nr_running++;
 
 		update_cfs_load(cfs_rq, 0);
 		update_cfs_shares(cfs_rq);
 	}
 
+	inc_nr_running(rq);
 	hrtick_update(rq);
 }
 
@@ -1361,6 +1364,7 @@ static void dequeue_task_fair(struct rq 
 	for_each_sched_entity(se) {
 		cfs_rq = cfs_rq_of(se);
 		dequeue_entity(cfs_rq, se, flags);
+		cfs_rq->h_nr_running--;
 
 		/* Don't dequeue parent if it has other entities besides us */
 		if (cfs_rq->load.weight) {
@@ -1379,12 +1383,14 @@ static void dequeue_task_fair(struct rq 
 	}
 
 	for_each_sched_entity(se) {
-		struct cfs_rq *cfs_rq = cfs_rq_of(se);
+		cfs_rq = cfs_rq_of(se);
+		cfs_rq->h_nr_running--;
 
 		update_cfs_load(cfs_rq, 0);
 		update_cfs_shares(cfs_rq);
 	}
 
+	dec_nr_running(rq);
 	hrtick_update(rq);
 }
 
Index: tip/kernel/sched_rt.c
===================================================================
--- tip.orig/kernel/sched_rt.c
+++ tip/kernel/sched_rt.c
@@ -949,6 +949,8 @@ enqueue_task_rt(struct rq *rq, struct ta
 
 	if (!task_current(rq, p) && p->rt.nr_cpus_allowed > 1)
 		enqueue_pushable_task(rq, p);
+
+	inc_nr_running(rq);
 }
 
 static void dequeue_task_rt(struct rq *rq, struct task_struct *p, int flags)
@@ -959,6 +961,8 @@ static void dequeue_task_rt(struct rq *r
 	dequeue_rt_entity(rt_se);
 
 	dequeue_pushable_task(rq, p);
+
+	dec_nr_running(rq);
 }
 
 /*
@@ -1851,4 +1855,3 @@ static void print_rt_stats(struct seq_fi
 	rcu_read_unlock();
 }
 #endif /* CONFIG_SCHED_DEBUG */
-
Index: tip/kernel/sched_stoptask.c
===================================================================
--- tip.orig/kernel/sched_stoptask.c
+++ tip/kernel/sched_stoptask.c
@@ -34,11 +34,13 @@ static struct task_struct *pick_next_tas
 static void
 enqueue_task_stop(struct rq *rq, struct task_struct *p, int flags)
 {
+	inc_nr_running(rq);
 }
 
 static void
 dequeue_task_stop(struct rq *rq, struct task_struct *p, int flags)
 {
+	dec_nr_running(rq);
 }
 
 static void yield_task_stop(struct rq *rq)



^ permalink raw reply	[flat|nested] 59+ messages in thread

* [patch 03/16] sched: introduce primitives to account for CFS bandwidth tracking
  2011-06-21  7:16 [patch 00/16] CFS Bandwidth Control v7 Paul Turner
  2011-06-21  7:16 ` [patch 01/16] sched: (fixlet) dont update shares twice on on_rq parent Paul Turner
  2011-06-21  7:16 ` [patch 02/16] sched: hierarchical task accounting for SCHED_OTHER Paul Turner
@ 2011-06-21  7:16 ` Paul Turner
  2011-06-22 10:52   ` Peter Zijlstra
  2011-06-21  7:16 ` [patch 04/16] sched: validate CFS quota hierarchies Paul Turner
                   ` (13 subsequent siblings)
  16 siblings, 1 reply; 59+ messages in thread
From: Paul Turner @ 2011-06-21  7:16 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov, Nikhil Rao

[-- Attachment #1: sched-bwc-add_cfs_tg_bandwidth.patch --]
[-- Type: text/plain, Size: 9996 bytes --]

In this patch we introduce the notion of CFS bandwidth, partitioned into 
globally unassigned bandwidth, and locally claimed bandwidth.

- The global bandwidth is per task_group, it represents a pool of unclaimed
  bandwidth that cfs_rqs can allocate from.  
- The local bandwidth is tracked per-cfs_rq, this represents allotments from
  the global pool bandwidth assigned to a specific cpu.

Bandwidth is managed via cgroupfs, adding two new interfaces to the cpu subsystem:
- cpu.cfs_period_us : the bandwidth period in usecs
- cpu.cfs_quota_us : the cpu bandwidth (in usecs) that this tg will be allowed
  to consume over period above.

Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Nikhil Rao <ncrao@google.com>
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>

---
 init/Kconfig        |   13 +++
 kernel/sched.c      |  194 ++++++++++++++++++++++++++++++++++++++++++++++++++--
 kernel/sched_fair.c |   16 ++++
 3 files changed, 219 insertions(+), 4 deletions(-)

Index: tip/init/Kconfig
===================================================================
--- tip.orig/init/Kconfig
+++ tip/init/Kconfig
@@ -715,6 +715,19 @@ config FAIR_GROUP_SCHED
 	depends on CGROUP_SCHED
 	default CGROUP_SCHED
 
+config CFS_BANDWIDTH
+	bool "CPU bandwidth provisioning for FAIR_GROUP_SCHED"
+	depends on EXPERIMENTAL
+	depends on FAIR_GROUP_SCHED
+	depends on SMP
+	default n
+	help
+	  This option allows users to define CPU bandwidth rates (limits) for
+	  tasks running within the fair group scheduler.  Groups with no limit
+	  set are considered to be unconstrained and will run with no
+	  restriction.
+	  See tip/Documentation/scheduler/sched-bwc.txt for more information/
+
 config RT_GROUP_SCHED
 	bool "Group scheduling for SCHED_RR/FIFO"
 	depends on EXPERIMENTAL
Index: tip/kernel/sched.c
===================================================================
--- tip.orig/kernel/sched.c
+++ tip/kernel/sched.c
@@ -244,6 +244,14 @@ struct cfs_rq;
 
 static LIST_HEAD(task_groups);
 
+struct cfs_bandwidth {
+#ifdef CONFIG_CFS_BANDWIDTH
+	raw_spinlock_t lock;
+	ktime_t period;
+	u64 quota;
+#endif
+};
+
 /* task group related information */
 struct task_group {
 	struct cgroup_subsys_state css;
@@ -275,6 +283,8 @@ struct task_group {
 #ifdef CONFIG_SCHED_AUTOGROUP
 	struct autogroup *autogroup;
 #endif
+
+	struct cfs_bandwidth cfs_bandwidth;
 };
 
 /* task_group_lock serializes the addition/removal of task groups */
@@ -374,9 +384,46 @@ struct cfs_rq {
 
 	unsigned long load_contribution;
 #endif
+#ifdef CONFIG_CFS_BANDWIDTH
+	int runtime_enabled;
+	s64 runtime_remaining;
+#endif
 #endif
 };
 
+#ifdef CONFIG_CFS_BANDWIDTH
+static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
+{
+	return &tg->cfs_bandwidth;
+}
+
+static inline u64 default_cfs_period(void);
+
+static void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
+{
+	raw_spin_lock_init(&cfs_b->lock);
+	cfs_b->quota = RUNTIME_INF;
+	cfs_b->period = ns_to_ktime(default_cfs_period());
+}
+
+static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)
+{
+	cfs_rq->runtime_enabled = 0;
+}
+
+static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
+{}
+#else
+static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
+static void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b) {}
+static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b) {}
+
+static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
+{
+	return NULL;
+}
+#endif /* CONFIG_CFS_BANDWIDTH */
+
 /* Real-Time classes' related field in a runqueue: */
 struct rt_rq {
 	struct rt_prio_array active;
@@ -7802,6 +7849,7 @@ static void init_tg_cfs_entry(struct tas
 	tg->cfs_rq[cpu] = cfs_rq;
 	init_cfs_rq(cfs_rq, rq);
 	cfs_rq->tg = tg;
+	init_cfs_rq_runtime(cfs_rq);
 
 	tg->se[cpu] = se;
 	/* se could be NULL for root_task_group */
@@ -7937,6 +7985,7 @@ void __init sched_init(void)
 		 * We achieve this by letting root_task_group's tasks sit
 		 * directly in rq->cfs (i.e root_task_group->se[] = NULL).
 		 */
+		init_cfs_bandwidth(&root_task_group.cfs_bandwidth);
 		init_tg_cfs_entry(&root_task_group, &rq->cfs, NULL, i, NULL);
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
@@ -8180,6 +8229,8 @@ static void free_fair_sched_group(struct
 {
 	int i;
 
+	destroy_cfs_bandwidth(tg_cfs_bandwidth(tg));
+
 	for_each_possible_cpu(i) {
 		if (tg->cfs_rq)
 			kfree(tg->cfs_rq[i]);
@@ -8207,6 +8258,8 @@ int alloc_fair_sched_group(struct task_g
 
 	tg->shares = NICE_0_LOAD;
 
+	init_cfs_bandwidth(tg_cfs_bandwidth(tg));
+
 	for_each_possible_cpu(i) {
 		cfs_rq = kzalloc_node(sizeof(struct cfs_rq),
 				      GFP_KERNEL, cpu_to_node(i));
@@ -8581,7 +8634,7 @@ static int __rt_schedulable(struct task_
 	return walk_tg_tree(tg_schedulable, tg_nop, &data);
 }
 
-static int tg_set_bandwidth(struct task_group *tg,
+static int tg_set_rt_bandwidth(struct task_group *tg,
 		u64 rt_period, u64 rt_runtime)
 {
 	int i, err = 0;
@@ -8620,7 +8673,7 @@ int sched_group_set_rt_runtime(struct ta
 	if (rt_runtime_us < 0)
 		rt_runtime = RUNTIME_INF;
 
-	return tg_set_bandwidth(tg, rt_period, rt_runtime);
+	return tg_set_rt_bandwidth(tg, rt_period, rt_runtime);
 }
 
 long sched_group_rt_runtime(struct task_group *tg)
@@ -8645,7 +8698,7 @@ int sched_group_set_rt_period(struct tas
 	if (rt_period == 0)
 		return -EINVAL;
 
-	return tg_set_bandwidth(tg, rt_period, rt_runtime);
+	return tg_set_rt_bandwidth(tg, rt_period, rt_runtime);
 }
 
 long sched_group_rt_period(struct task_group *tg)
@@ -8835,6 +8888,128 @@ static u64 cpu_shares_read_u64(struct cg
 
 	return (u64) scale_load_down(tg->shares);
 }
+
+#ifdef CONFIG_CFS_BANDWIDTH
+const u64 max_cfs_quota_period = 1 * NSEC_PER_SEC; /* 1s */
+const u64 min_cfs_quota_period = 1 * NSEC_PER_MSEC; /* 1ms */
+
+static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
+{
+	int i;
+	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
+	static DEFINE_MUTEX(mutex);
+
+	if (tg == &root_task_group)
+		return -EINVAL;
+
+	/*
+	 * Ensure we have at some amount of bandwidth every period.  This is
+	 * to prevent reaching a state of large arrears when throttled via
+	 * entity_tick() resulting in prolonged exit starvation.
+	 */
+	if (quota < min_cfs_quota_period || period < min_cfs_quota_period)
+		return -EINVAL;
+
+	/*
+	 * Likewise, bound things on the otherside by preventing insane quota
+	 * periods.  This also allows us to normalize in computing quota
+	 * feasibility.
+	 */
+	if (period > max_cfs_quota_period)
+		return -EINVAL;
+
+	mutex_lock(&mutex);
+	raw_spin_lock_irq(&cfs_b->lock);
+	cfs_b->period = ns_to_ktime(period);
+	cfs_b->quota = quota;
+	raw_spin_unlock_irq(&cfs_b->lock);
+
+	for_each_possible_cpu(i) {
+		struct cfs_rq *cfs_rq = tg->cfs_rq[i];
+		struct rq *rq = rq_of(cfs_rq);
+
+		raw_spin_lock_irq(&rq->lock);
+		cfs_rq->runtime_enabled = quota != RUNTIME_INF;
+		cfs_rq->runtime_remaining = 0;
+		raw_spin_unlock_irq(&rq->lock);
+	}
+	mutex_unlock(&mutex);
+
+	return 0;
+}
+
+int tg_set_cfs_quota(struct task_group *tg, long cfs_quota_us)
+{
+	u64 quota, period;
+
+	period = ktime_to_ns(tg_cfs_bandwidth(tg)->period);
+	if (cfs_quota_us < 0)
+		quota = RUNTIME_INF;
+	else
+		quota = (u64)cfs_quota_us * NSEC_PER_USEC;
+
+	return tg_set_cfs_bandwidth(tg, period, quota);
+}
+
+long tg_get_cfs_quota(struct task_group *tg)
+{
+	u64 quota_us;
+
+	if (tg_cfs_bandwidth(tg)->quota == RUNTIME_INF)
+		return -1;
+
+	quota_us = tg_cfs_bandwidth(tg)->quota;
+	do_div(quota_us, NSEC_PER_USEC);
+
+	return quota_us;
+}
+
+int tg_set_cfs_period(struct task_group *tg, long cfs_period_us)
+{
+	u64 quota, period;
+
+	period = (u64)cfs_period_us * NSEC_PER_USEC;
+	quota = tg_cfs_bandwidth(tg)->quota;
+
+	if (period <= 0)
+		return -EINVAL;
+
+	return tg_set_cfs_bandwidth(tg, period, quota);
+}
+
+long tg_get_cfs_period(struct task_group *tg)
+{
+	u64 cfs_period_us;
+
+	cfs_period_us = ktime_to_ns(tg_cfs_bandwidth(tg)->period);
+	do_div(cfs_period_us, NSEC_PER_USEC);
+
+	return cfs_period_us;
+}
+
+static s64 cpu_cfs_quota_read_s64(struct cgroup *cgrp, struct cftype *cft)
+{
+	return tg_get_cfs_quota(cgroup_tg(cgrp));
+}
+
+static int cpu_cfs_quota_write_s64(struct cgroup *cgrp, struct cftype *cftype,
+				s64 cfs_quota_us)
+{
+	return tg_set_cfs_quota(cgroup_tg(cgrp), cfs_quota_us);
+}
+
+static u64 cpu_cfs_period_read_u64(struct cgroup *cgrp, struct cftype *cft)
+{
+	return tg_get_cfs_period(cgroup_tg(cgrp));
+}
+
+static int cpu_cfs_period_write_u64(struct cgroup *cgrp, struct cftype *cftype,
+				u64 cfs_period_us)
+{
+	return tg_set_cfs_period(cgroup_tg(cgrp), cfs_period_us);
+}
+
+#endif /* CONFIG_CFS_BANDWIDTH */
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
 #ifdef CONFIG_RT_GROUP_SCHED
@@ -8869,6 +9044,18 @@ static struct cftype cpu_files[] = {
 		.write_u64 = cpu_shares_write_u64,
 	},
 #endif
+#ifdef CONFIG_CFS_BANDWIDTH
+	{
+		.name = "cfs_quota_us",
+		.read_s64 = cpu_cfs_quota_read_s64,
+		.write_s64 = cpu_cfs_quota_write_s64,
+	},
+	{
+		.name = "cfs_period_us",
+		.read_u64 = cpu_cfs_period_read_u64,
+		.write_u64 = cpu_cfs_period_write_u64,
+	},
+#endif
 #ifdef CONFIG_RT_GROUP_SCHED
 	{
 		.name = "rt_runtime_us",
@@ -9178,4 +9365,3 @@ struct cgroup_subsys cpuacct_subsys = {
 	.subsys_id = cpuacct_subsys_id,
 };
 #endif	/* CONFIG_CGROUP_CPUACCT */
-
Index: tip/kernel/sched_fair.c
===================================================================
--- tip.orig/kernel/sched_fair.c
+++ tip/kernel/sched_fair.c
@@ -1256,6 +1256,22 @@ entity_tick(struct cfs_rq *cfs_rq, struc
 		check_preempt_tick(cfs_rq, curr);
 }
 
+
+/**************************************************
+ * CFS bandwidth control machinery
+ */
+
+#ifdef CONFIG_CFS_BANDWIDTH
+/*
+ * default period for cfs group bandwidth.
+ * default: 0.1s, units: nanoseconds
+ */
+static inline u64 default_cfs_period(void)
+{
+	return 100000000ULL;
+}
+#endif
+
 /**************************************************
  * CFS operations on tasks:
  */



^ permalink raw reply	[flat|nested] 59+ messages in thread

* [patch 04/16] sched: validate CFS quota hierarchies
  2011-06-21  7:16 [patch 00/16] CFS Bandwidth Control v7 Paul Turner
                   ` (2 preceding siblings ...)
  2011-06-21  7:16 ` [patch 03/16] sched: introduce primitives to account for CFS bandwidth tracking Paul Turner
@ 2011-06-21  7:16 ` Paul Turner
  2011-06-22  5:43   ` Bharata B Rao
  2011-06-22  9:38   ` Hidetoshi Seto
  2011-06-21  7:16 ` [patch 05/16] sched: accumulate per-cfs_rq cpu usage and charge against bandwidth Paul Turner
                   ` (12 subsequent siblings)
  16 siblings, 2 replies; 59+ messages in thread
From: Paul Turner @ 2011-06-21  7:16 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov

[-- Attachment #1: sched-bwc-consistent_quota.patch --]
[-- Type: text/plain, Size: 5310 bytes --]

Add constraints validation for CFS bandwidth hierarchies.

Validate that:
   max(child bandwidth) <= parent_bandwidth

In a quota limited hierarchy, an unconstrained entity
(e.g. bandwidth==RUNTIME_INF) inherits the bandwidth of its parent.

This constraint is chosen over sum(child_bandwidth) as notion of over-commit is
valuable within SCHED_OTHER.  Some basic code from the RT case is re-factored
for reuse.

Signed-off-by: Paul Turner <pjt@google.com>

---
 kernel/sched.c |  109 ++++++++++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 96 insertions(+), 13 deletions(-)

Index: tip/kernel/sched.c
===================================================================
--- tip.orig/kernel/sched.c
+++ tip/kernel/sched.c
@@ -249,6 +249,7 @@ struct cfs_bandwidth {
 	raw_spinlock_t lock;
 	ktime_t period;
 	u64 quota;
+	s64 hierarchal_quota;
 #endif
 };
 
@@ -8534,12 +8535,7 @@ unsigned long sched_group_shares(struct 
 }
 #endif
 
-#ifdef CONFIG_RT_GROUP_SCHED
-/*
- * Ensure that the real time constraints are schedulable.
- */
-static DEFINE_MUTEX(rt_constraints_mutex);
-
+#if defined(CONFIG_RT_GROUP_SCHED) || defined(CONFIG_CFS_BANDWIDTH)
 static unsigned long to_ratio(u64 period, u64 runtime)
 {
 	if (runtime == RUNTIME_INF)
@@ -8547,6 +8543,13 @@ static unsigned long to_ratio(u64 period
 
 	return div64_u64(runtime << 20, period);
 }
+#endif
+
+#ifdef CONFIG_RT_GROUP_SCHED
+/*
+ * Ensure that the real time constraints are schedulable.
+ */
+static DEFINE_MUTEX(rt_constraints_mutex);
 
 /* Must be called with tasklist_lock held */
 static inline int tg_has_rt_tasks(struct task_group *tg)
@@ -8567,7 +8570,7 @@ struct rt_schedulable_data {
 	u64 rt_runtime;
 };
 
-static int tg_schedulable(struct task_group *tg, void *data)
+static int tg_rt_schedulable(struct task_group *tg, void *data)
 {
 	struct rt_schedulable_data *d = data;
 	struct task_group *child;
@@ -8631,7 +8634,7 @@ static int __rt_schedulable(struct task_
 		.rt_runtime = runtime,
 	};
 
-	return walk_tg_tree(tg_schedulable, tg_nop, &data);
+	return walk_tg_tree(tg_rt_schedulable, tg_nop, &data);
 }
 
 static int tg_set_rt_bandwidth(struct task_group *tg,
@@ -8890,14 +8893,17 @@ static u64 cpu_shares_read_u64(struct cg
 }
 
 #ifdef CONFIG_CFS_BANDWIDTH
+static DEFINE_MUTEX(cfs_constraints_mutex);
+
 const u64 max_cfs_quota_period = 1 * NSEC_PER_SEC; /* 1s */
 const u64 min_cfs_quota_period = 1 * NSEC_PER_MSEC; /* 1ms */
 
+static int __cfs_schedulable(struct task_group *tg, u64 period, u64 runtime);
+
 static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
 {
-	int i;
+	int i, ret = 0;
 	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
-	static DEFINE_MUTEX(mutex);
 
 	if (tg == &root_task_group)
 		return -EINVAL;
@@ -8918,7 +8924,11 @@ static int tg_set_cfs_bandwidth(struct t
 	if (period > max_cfs_quota_period)
 		return -EINVAL;
 
-	mutex_lock(&mutex);
+	mutex_lock(&cfs_constraints_mutex);
+	ret = __cfs_schedulable(tg, period, quota);
+	if (ret)
+		goto out_unlock;
+
 	raw_spin_lock_irq(&cfs_b->lock);
 	cfs_b->period = ns_to_ktime(period);
 	cfs_b->quota = quota;
@@ -8933,9 +8943,10 @@ static int tg_set_cfs_bandwidth(struct t
 		cfs_rq->runtime_remaining = 0;
 		raw_spin_unlock_irq(&rq->lock);
 	}
-	mutex_unlock(&mutex);
+out_unlock:
+	mutex_unlock(&cfs_constraints_mutex);
 
-	return 0;
+	return ret;
 }
 
 int tg_set_cfs_quota(struct task_group *tg, long cfs_quota_us)
@@ -9009,6 +9020,78 @@ static int cpu_cfs_period_write_u64(stru
 	return tg_set_cfs_period(cgroup_tg(cgrp), cfs_period_us);
 }
 
+struct cfs_schedulable_data {
+	struct task_group *tg;
+	u64 period, quota;
+};
+
+/*
+ * normalize group quota/period to be quota/max_period
+ * note: units are usecs
+ */
+static u64 normalize_cfs_quota(struct task_group *tg,
+			       struct cfs_schedulable_data *d)
+{
+	u64 quota, period;
+
+	if (tg == d->tg) {
+		period = d->period;
+		quota = d->quota;
+	} else {
+		period = tg_get_cfs_period(tg);
+		quota = tg_get_cfs_quota(tg);
+	}
+
+	/* note: these should typically be equivalent */
+	if (quota == RUNTIME_INF || quota == -1)
+		return RUNTIME_INF;
+
+	return to_ratio(period, quota);
+}
+
+static int tg_cfs_schedulable_down(struct task_group *tg, void *data)
+{
+	struct cfs_schedulable_data *d = data;
+	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
+	s64 quota = 0, parent_quota = -1;
+
+	if (!tg->parent) {
+		quota = RUNTIME_INF;
+	} else {
+		struct cfs_bandwidth *parent_b = tg_cfs_bandwidth(tg->parent);
+
+		quota = normalize_cfs_quota(tg, d);
+		parent_quota = parent_b->hierarchal_quota;
+
+		/*
+		 * ensure max(child_quota) <= parent_quota, inherit when no
+		 * limit is set
+		 */
+		if (quota == RUNTIME_INF)
+			quota = parent_quota;
+		else if (parent_quota != RUNTIME_INF && quota > parent_quota)
+			return -EINVAL;
+	}
+	cfs_b->hierarchal_quota = quota;
+
+	return 0;
+}
+
+static int __cfs_schedulable(struct task_group *tg, u64 period, u64 quota)
+{
+	struct cfs_schedulable_data data = {
+		.tg = tg,
+		.period = period,
+		.quota = quota,
+	};
+
+	if (quota != RUNTIME_INF) {
+		do_div(data.period, NSEC_PER_USEC);
+		do_div(data.quota, NSEC_PER_USEC);
+	}
+
+	return walk_tg_tree(tg_cfs_schedulable_down, tg_nop, &data);
+}
 #endif /* CONFIG_CFS_BANDWIDTH */
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 



^ permalink raw reply	[flat|nested] 59+ messages in thread

* [patch 05/16] sched: accumulate per-cfs_rq cpu usage and charge against bandwidth
  2011-06-21  7:16 [patch 00/16] CFS Bandwidth Control v7 Paul Turner
                   ` (3 preceding siblings ...)
  2011-06-21  7:16 ` [patch 04/16] sched: validate CFS quota hierarchies Paul Turner
@ 2011-06-21  7:16 ` Paul Turner
  2011-06-21  7:16 ` [patch 06/16] sched: add a timer to handle CFS bandwidth refresh Paul Turner
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 59+ messages in thread
From: Paul Turner @ 2011-06-21  7:16 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov, Nikhil Rao

[-- Attachment #1: sched-bwc-account_cfs_rq_runtime.patch --]
[-- Type: text/plain, Size: 5990 bytes --]

Account bandwidth usage on the cfs_rq level versus the task_groups to which
they belong.  Whether we are tracking bandwidth on a given cfs_rq is maintained
under cfs_rq->runtime_enabled.

cfs_rq's which belong to a bandwidth constrained task_group have their runtime
accounted via the update_curr() path, which withdraws bandwidth from the global
pool as desired.  Updates involving the global pool are currently protected
under cfs_bandwidth->lock, local runtime is protected by rq->lock.

This patch only assigns and tracks quota, no action is taken in the case that
cfs_rq->runtime_used exceeds cfs_rq->runtime_assigned.

Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Nikhil Rao <ncrao@google.com>
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>


---
 include/linux/sched.h |    4 ++
 kernel/sched.c        |    3 +-
 kernel/sched_fair.c   |   72 ++++++++++++++++++++++++++++++++++++++++++++++++--
 kernel/sysctl.c       |   10 ++++++
 4 files changed, 86 insertions(+), 3 deletions(-)

Index: tip/kernel/sched_fair.c
===================================================================
--- tip.orig/kernel/sched_fair.c
+++ tip/kernel/sched_fair.c
@@ -89,6 +89,20 @@ const_debug unsigned int sysctl_sched_mi
  */
 unsigned int __read_mostly sysctl_sched_shares_window = 10000000UL;
 
+#ifdef CONFIG_CFS_BANDWIDTH
+/*
+ * Amount of runtime to allocate from global (tg) to local (per-cfs_rq) pool
+ * each time a cfs_rq requests quota.
+ *
+ * Note: in the case that the slice exceeds the runtime remaining (either due
+ * to consumption or the quota being specified to be smaller than the slice)
+ * we will always only issue the remaining available time.
+ *
+ * default: 5 msec, units: microseconds
+  */
+unsigned int sysctl_sched_cfs_bandwidth_slice = 5000UL;
+#endif
+
 static const struct sched_class fair_sched_class;
 
 /**************************************************************
@@ -305,6 +319,8 @@ find_matching_se(struct sched_entity **s
 
 #endif	/* CONFIG_FAIR_GROUP_SCHED */
 
+static void account_cfs_rq_runtime(struct cfs_rq *cfs_rq,
+				   unsigned long delta_exec);
 
 /**************************************************************
  * Scheduling class tree data structure manipulation methods:
@@ -602,6 +618,8 @@ static void update_curr(struct cfs_rq *c
 		cpuacct_charge(curtask, delta_exec);
 		account_group_exec_runtime(curtask, delta_exec);
 	}
+
+	account_cfs_rq_runtime(cfs_rq, delta_exec);
 }
 
 static inline void
@@ -1270,6 +1288,48 @@ static inline u64 default_cfs_period(voi
 {
 	return 100000000ULL;
 }
+
+static inline u64 sched_cfs_bandwidth_slice(void)
+{
+	return (u64)sysctl_sched_cfs_bandwidth_slice * NSEC_PER_USEC;
+}
+
+static void assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
+{
+	struct task_group *tg = cfs_rq->tg;
+	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
+	u64 amount = 0, min_amount;
+
+	/* note: this is a positive sum, runtime_remaining <= 0 */
+	min_amount = sched_cfs_bandwidth_slice() - cfs_rq->runtime_remaining;
+
+	raw_spin_lock(&cfs_b->lock);
+	if (cfs_b->quota == RUNTIME_INF)
+		amount = min_amount;
+	else if (cfs_b->runtime > 0) {
+		amount = min(cfs_b->runtime, min_amount);
+		cfs_b->runtime -= amount;
+	}
+	raw_spin_unlock(&cfs_b->lock);
+
+	cfs_rq->runtime_remaining += amount;
+}
+
+static void account_cfs_rq_runtime(struct cfs_rq *cfs_rq,
+		unsigned long delta_exec)
+{
+	if (!cfs_rq->runtime_enabled)
+		return;
+
+	cfs_rq->runtime_remaining -= delta_exec;
+	if (cfs_rq->runtime_remaining > 0)
+		return;
+
+	assign_cfs_rq_runtime(cfs_rq);
+}
+#else
+static void account_cfs_rq_runtime(struct cfs_rq *cfs_rq,
+		unsigned long delta_exec) {}
 #endif
 
 /**************************************************
@@ -4264,8 +4324,16 @@ static void set_curr_task_fair(struct rq
 {
 	struct sched_entity *se = &rq->curr->se;
 
-	for_each_sched_entity(se)
-		set_next_entity(cfs_rq_of(se), se);
+	for_each_sched_entity(se) {
+		struct cfs_rq *cfs_rq = cfs_rq_of(se);
+
+		set_next_entity(cfs_rq, se);
+		/*
+		 * if bandwidth is enabled, make sure it is up-to-date or
+		 * reschedule for the case of a move into a throttled cpu.
+		 */
+		account_cfs_rq_runtime(cfs_rq, 0);
+	}
 }
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
Index: tip/kernel/sysctl.c
===================================================================
--- tip.orig/kernel/sysctl.c
+++ tip/kernel/sysctl.c
@@ -379,6 +379,16 @@ static struct ctl_table kern_table[] = {
 		.extra2		= &one,
 	},
 #endif
+#ifdef CONFIG_CFS_BANDWIDTH
+	{
+		.procname	= "sched_cfs_bandwidth_slice_us",
+		.data		= &sysctl_sched_cfs_bandwidth_slice,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &one,
+	},
+#endif
 #ifdef CONFIG_PROVE_LOCKING
 	{
 		.procname	= "prove_locking",
Index: tip/kernel/sched.c
===================================================================
--- tip.orig/kernel/sched.c
+++ tip/kernel/sched.c
@@ -248,7 +248,7 @@ struct cfs_bandwidth {
 #ifdef CONFIG_CFS_BANDWIDTH
 	raw_spinlock_t lock;
 	ktime_t period;
-	u64 quota;
+	u64 quota, runtime;
 	s64 hierarchal_quota;
 #endif
 };
@@ -403,6 +403,7 @@ static inline u64 default_cfs_period(voi
 static void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
 {
 	raw_spin_lock_init(&cfs_b->lock);
+	cfs_b->runtime = 0;
 	cfs_b->quota = RUNTIME_INF;
 	cfs_b->period = ns_to_ktime(default_cfs_period());
 }
Index: tip/include/linux/sched.h
===================================================================
--- tip.orig/include/linux/sched.h
+++ tip/include/linux/sched.h
@@ -2012,6 +2012,10 @@ static inline void sched_autogroup_fork(
 static inline void sched_autogroup_exit(struct signal_struct *sig) { }
 #endif
 
+#ifdef CONFIG_CFS_BANDWIDTH
+extern unsigned int sysctl_sched_cfs_bandwidth_slice;
+#endif
+
 #ifdef CONFIG_RT_MUTEXES
 extern int rt_mutex_getprio(struct task_struct *p);
 extern void rt_mutex_setprio(struct task_struct *p, int prio);



^ permalink raw reply	[flat|nested] 59+ messages in thread

* [patch 06/16] sched: add a timer to handle CFS bandwidth refresh
  2011-06-21  7:16 [patch 00/16] CFS Bandwidth Control v7 Paul Turner
                   ` (4 preceding siblings ...)
  2011-06-21  7:16 ` [patch 05/16] sched: accumulate per-cfs_rq cpu usage and charge against bandwidth Paul Turner
@ 2011-06-21  7:16 ` Paul Turner
  2011-06-22  9:38   ` Hidetoshi Seto
  2011-06-21  7:16 ` [patch 07/16] sched: expire invalid runtime Paul Turner
                   ` (10 subsequent siblings)
  16 siblings, 1 reply; 59+ messages in thread
From: Paul Turner @ 2011-06-21  7:16 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov

[-- Attachment #1: sched-bwc-bandwidth_timers.patch --]
[-- Type: text/plain, Size: 5755 bytes --]

This patch adds a per-task_group timer which handles the refresh of the global
CFS bandwidth pool.

Since the RT pool is using a similar timer there's some small refactoring to
share this support.

Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>

---
 kernel/sched.c      |   87 ++++++++++++++++++++++++++++++++++++++++------------
 kernel/sched_fair.c |   36 +++++++++++++++++++--
 2 files changed, 101 insertions(+), 22 deletions(-)

Index: tip/kernel/sched.c
===================================================================
--- tip.orig/kernel/sched.c
+++ tip/kernel/sched.c
@@ -193,10 +193,28 @@ static inline int rt_bandwidth_enabled(v
 	return sysctl_sched_rt_runtime >= 0;
 }
 
-static void start_rt_bandwidth(struct rt_bandwidth *rt_b)
+static void start_bandwidth_timer(struct hrtimer *period_timer, ktime_t period)
 {
-	ktime_t now;
+	unsigned long delta;
+	ktime_t soft, hard, now;
+
+	for (;;) {
+		if (hrtimer_active(period_timer))
+			break;
+
+		now = hrtimer_cb_get_time(period_timer);
+		hrtimer_forward(period_timer, now, period);
+
+		soft = hrtimer_get_softexpires(period_timer);
+		hard = hrtimer_get_expires(period_timer);
+		delta = ktime_to_ns(ktime_sub(hard, soft));
+		__hrtimer_start_range_ns(period_timer, soft, delta,
+					 HRTIMER_MODE_ABS_PINNED, 0);
+	}
+}
 
+static void start_rt_bandwidth(struct rt_bandwidth *rt_b)
+{
 	if (!rt_bandwidth_enabled() || rt_b->rt_runtime == RUNTIME_INF)
 		return;
 
@@ -204,22 +222,7 @@ static void start_rt_bandwidth(struct rt
 		return;
 
 	raw_spin_lock(&rt_b->rt_runtime_lock);
-	for (;;) {
-		unsigned long delta;
-		ktime_t soft, hard;
-
-		if (hrtimer_active(&rt_b->rt_period_timer))
-			break;
-
-		now = hrtimer_cb_get_time(&rt_b->rt_period_timer);
-		hrtimer_forward(&rt_b->rt_period_timer, now, rt_b->rt_period);
-
-		soft = hrtimer_get_softexpires(&rt_b->rt_period_timer);
-		hard = hrtimer_get_expires(&rt_b->rt_period_timer);
-		delta = ktime_to_ns(ktime_sub(hard, soft));
-		__hrtimer_start_range_ns(&rt_b->rt_period_timer, soft, delta,
-				HRTIMER_MODE_ABS_PINNED, 0);
-	}
+	start_bandwidth_timer(&rt_b->rt_period_timer, rt_b->rt_period);
 	raw_spin_unlock(&rt_b->rt_runtime_lock);
 }
 
@@ -250,6 +253,9 @@ struct cfs_bandwidth {
 	ktime_t period;
 	u64 quota, runtime;
 	s64 hierarchal_quota;
+
+	int idle, timer_active;
+	struct hrtimer period_timer;
 #endif
 };
 
@@ -399,6 +405,28 @@ static inline struct cfs_bandwidth *tg_c
 }
 
 static inline u64 default_cfs_period(void);
+static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun);
+
+static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
+{
+	struct cfs_bandwidth *cfs_b =
+		container_of(timer, struct cfs_bandwidth, period_timer);
+	ktime_t now;
+	int overrun;
+	int idle = 0;
+
+	for (;;) {
+		now = hrtimer_cb_get_time(timer);
+		overrun = hrtimer_forward(timer, now, cfs_b->period);
+
+		if (!overrun)
+			break;
+
+		idle = do_sched_cfs_period_timer(cfs_b, overrun);
+	}
+
+	return idle ? HRTIMER_NORESTART : HRTIMER_RESTART;
+}
 
 static void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
 {
@@ -406,6 +434,9 @@ static void init_cfs_bandwidth(struct cf
 	cfs_b->runtime = 0;
 	cfs_b->quota = RUNTIME_INF;
 	cfs_b->period = ns_to_ktime(default_cfs_period());
+
+	hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	cfs_b->period_timer.function = sched_cfs_period_timer;
 }
 
 static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)
@@ -413,8 +444,26 @@ static void init_cfs_rq_runtime(struct c
 	cfs_rq->runtime_enabled = 0;
 }
 
+/* requires cfs_b->lock */
+static void __start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
+{
+	/*
+	 * Handle the extremely unlikely case of trying to start the period
+	 * timer, while that timer is in the tear-down path from having
+	 * decided to no longer run.  In this case we must wait for the
+	 * (tail of the) callback to terminate so that we can re-enqueue it.
+	 */
+	if (unlikely(hrtimer_active(&cfs_b->period_timer)))
+		hrtimer_cancel(&cfs_b->period_timer);
+
+	cfs_b->timer_active = 1;
+	start_bandwidth_timer(&cfs_b->period_timer, cfs_b->period);
+}
+
 static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
-{}
+{
+	hrtimer_cancel(&cfs_b->period_timer);
+}
 #else
 static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
 static void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b) {}
Index: tip/kernel/sched_fair.c
===================================================================
--- tip.orig/kernel/sched_fair.c
+++ tip/kernel/sched_fair.c
@@ -1306,9 +1306,16 @@ static void assign_cfs_rq_runtime(struct
 	raw_spin_lock(&cfs_b->lock);
 	if (cfs_b->quota == RUNTIME_INF)
 		amount = min_amount;
-	else if (cfs_b->runtime > 0) {
-		amount = min(cfs_b->runtime, min_amount);
-		cfs_b->runtime -= amount;
+	else {
+		/* ensure bandwidth timer remains active under consumption */
+		if (!cfs_b->timer_active)
+			__start_cfs_bandwidth(cfs_b);
+
+		if (cfs_b->runtime > 0) {
+			amount = min(cfs_b->runtime, min_amount);
+			cfs_b->runtime -= amount;
+			cfs_b->idle = 0;
+		}
 	}
 	raw_spin_unlock(&cfs_b->lock);
 
@@ -1327,6 +1334,29 @@ static void account_cfs_rq_runtime(struc
 
 	assign_cfs_rq_runtime(cfs_rq);
 }
+
+static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
+{
+	u64 quota, runtime = 0;
+	int idle = 1;
+
+	raw_spin_lock(&cfs_b->lock);
+	quota = cfs_b->quota;
+
+	if (quota != RUNTIME_INF) {
+		runtime = quota;
+		cfs_b->runtime = runtime;
+
+		idle = cfs_b->idle;
+		cfs_b->idle = 1;
+	}
+
+	if (idle)
+		cfs_b->timer_active = 0;
+	raw_spin_unlock(&cfs_b->lock);
+
+	return idle;
+}
 #else
 static void account_cfs_rq_runtime(struct cfs_rq *cfs_rq,
 		unsigned long delta_exec) {}



^ permalink raw reply	[flat|nested] 59+ messages in thread

* [patch 07/16] sched: expire invalid runtime
  2011-06-21  7:16 [patch 00/16] CFS Bandwidth Control v7 Paul Turner
                   ` (5 preceding siblings ...)
  2011-06-21  7:16 ` [patch 06/16] sched: add a timer to handle CFS bandwidth refresh Paul Turner
@ 2011-06-21  7:16 ` Paul Turner
  2011-06-22  9:38   ` Hidetoshi Seto
  2011-06-22 15:47   ` Peter Zijlstra
  2011-06-21  7:16 ` [patch 08/16] sched: throttle cfs_rq entities which exceed their local runtime Paul Turner
                   ` (9 subsequent siblings)
  16 siblings, 2 replies; 59+ messages in thread
From: Paul Turner @ 2011-06-21  7:16 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov

[-- Attachment #1: sched-bwc-expire_cfs_rq_runtime.patch --]
[-- Type: text/plain, Size: 6072 bytes --]

Since quota is managed using a global state but consumed on a per-cpu basis
we need to ensure that our per-cpu state is appropriately synchronized.  
Most importantly, runtime that is state (from a previous period) should not be
locally consumable.

We take advantage of existing sched_clock synchronization about the jiffy to
efficiently detect whether we have (globally) crossed a quota boundary above.

One catch is that the direction of spread on sched_clock is undefined, 
specifically, we don't know whether our local clock is behind or ahead
of the one responsible for the current expiration time.

Fortunately we can differentiate these by considering whether the
global deadline has advanced.  If it has not, then we assume our clock to be 
"fast" and advance our local expiration; otherwise, we know the deadline has
truly passed and we expire our local runtime.

Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>

---
 kernel/sched.c      |    6 ++-
 kernel/sched_fair.c |   87 ++++++++++++++++++++++++++++++++++++++++++++++------
 2 files changed, 82 insertions(+), 11 deletions(-)

Index: tip/kernel/sched_fair.c
===================================================================
--- tip.orig/kernel/sched_fair.c
+++ tip/kernel/sched_fair.c
@@ -1294,11 +1294,28 @@ static inline u64 sched_cfs_bandwidth_sl
 	return (u64)sysctl_sched_cfs_bandwidth_slice * NSEC_PER_USEC;
 }
 
+/*
+ * replenish runtime according to assigned quota and update expiration time
+ *
+ * requires cfs_b->lock
+ */
+static void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)
+{
+	u64 now;
+
+	if (cfs_b->quota == RUNTIME_INF)
+		return;
+
+	cfs_b->runtime = cfs_b->quota;
+	now = sched_clock_cpu(smp_processor_id());
+	cfs_b->runtime_expires = now + ktime_to_ns(cfs_b->period);
+}
+
 static void assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 {
 	struct task_group *tg = cfs_rq->tg;
 	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
-	u64 amount = 0, min_amount;
+	u64 amount = 0, min_amount, expires;
 
 	/* note: this is a positive sum, runtime_remaining <= 0 */
 	min_amount = sched_cfs_bandwidth_slice() - cfs_rq->runtime_remaining;
@@ -1307,9 +1324,16 @@ static void assign_cfs_rq_runtime(struct
 	if (cfs_b->quota == RUNTIME_INF)
 		amount = min_amount;
 	else {
-		/* ensure bandwidth timer remains active under consumption */
-		if (!cfs_b->timer_active)
+		/*
+		 * If the bandwidth pool has become inactive, then at least one
+		 * period must have elapsed since the last consumption.
+		 * Refresh the global state and ensure bandwidth timer becomes
+		 * active.
+		 */
+		if (!cfs_b->timer_active) {
+			__refill_cfs_bandwidth_runtime(cfs_b);
 			__start_cfs_bandwidth(cfs_b);
+		}
 
 		if (cfs_b->runtime > 0) {
 			amount = min(cfs_b->runtime, min_amount);
@@ -1317,9 +1341,47 @@ static void assign_cfs_rq_runtime(struct
 			cfs_b->idle = 0;
 		}
 	}
+	expires = cfs_b->runtime_expires;
 	raw_spin_unlock(&cfs_b->lock);
 
 	cfs_rq->runtime_remaining += amount;
+	/*
+	 * we may have advanced our local expiration to account for allowed
+	 * spread between our sched_clock and the one on which runtime was
+	 * issued.
+	 */
+	if ((s64)(expires - cfs_rq->runtime_expires) > 0)
+		cfs_rq->runtime_expires = expires;
+}
+
+static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq)
+{
+	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
+	struct rq *rq = rq_of(cfs_rq);
+
+	if (cfs_rq->runtime_remaining < 0)
+		return;
+
+	/* if the deadline is ahead of our clock, nothing to do */
+	if ((s64)(rq->clock - cfs_rq->runtime_expires) < 0)
+		return;
+
+	/*
+	 * If the local deadline has passed we have to consider the
+	 * possibility that our sched_clock is 'fast' and the global deadline
+	 * has not truly expired.
+	 *
+	 * Fortunately we can check determine whether this the case by checking
+	 * whether the global deadline has advanced.
+	 */
+
+	if ((s64)(cfs_rq->runtime_expires - cfs_b->runtime_expires) >= 0) {
+		/* extend local deadline, drift is bounded above by 2 ticks */
+		cfs_rq->runtime_expires += TICK_NSEC;
+	} else {
+		/* global deadline is ahead, expiration has passed */
+		cfs_rq->runtime_remaining = 0;
+	}
 }
 
 static void account_cfs_rq_runtime(struct cfs_rq *cfs_rq,
@@ -1328,7 +1390,10 @@ static void account_cfs_rq_runtime(struc
 	if (!cfs_rq->runtime_enabled)
 		return;
 
+	/* dock delta_exec before expiring quota (as it could span periods) */
 	cfs_rq->runtime_remaining -= delta_exec;
+	expire_cfs_rq_runtime(cfs_rq);
+
 	if (cfs_rq->runtime_remaining > 0)
 		return;
 
@@ -1337,17 +1402,19 @@ static void account_cfs_rq_runtime(struc
 
 static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
 {
-	u64 quota, runtime = 0;
 	int idle = 1;
 
 	raw_spin_lock(&cfs_b->lock);
-	quota = cfs_b->quota;
-
-	if (quota != RUNTIME_INF) {
-		runtime = quota;
-		cfs_b->runtime = runtime;
-
+	if (cfs_b->quota != RUNTIME_INF) {
 		idle = cfs_b->idle;
+		/* If we're going idle then defer handle the refill */
+		if (!idle)
+			__refill_cfs_bandwidth_runtime(cfs_b);
+
+		/*
+		 * mark this bandwidth pool as idle so that we may deactivate
+		 * the timer at the next expiration if there is no usage.
+		 */
 		cfs_b->idle = 1;
 	}
 
Index: tip/kernel/sched.c
===================================================================
--- tip.orig/kernel/sched.c
+++ tip/kernel/sched.c
@@ -253,6 +253,7 @@ struct cfs_bandwidth {
 	ktime_t period;
 	u64 quota, runtime;
 	s64 hierarchal_quota;
+	u64 runtime_expires;
 
 	int idle, timer_active;
 	struct hrtimer period_timer;
@@ -393,6 +394,7 @@ struct cfs_rq {
 #endif
 #ifdef CONFIG_CFS_BANDWIDTH
 	int runtime_enabled;
+	u64 runtime_expires;
 	s64 runtime_remaining;
 #endif
 #endif
@@ -8981,7 +8983,9 @@ static int tg_set_cfs_bandwidth(struct t
 
 	raw_spin_lock_irq(&cfs_b->lock);
 	cfs_b->period = ns_to_ktime(period);
-	cfs_b->quota = quota;
+	cfs_b->quota = cfs_b->runtime = quota;
+
+	__refill_cfs_bandwidth_runtime(cfs_b);
 	raw_spin_unlock_irq(&cfs_b->lock);
 
 	for_each_possible_cpu(i) {



^ permalink raw reply	[flat|nested] 59+ messages in thread

* [patch 08/16] sched: throttle cfs_rq entities which exceed their local runtime
  2011-06-21  7:16 [patch 00/16] CFS Bandwidth Control v7 Paul Turner
                   ` (6 preceding siblings ...)
  2011-06-21  7:16 ` [patch 07/16] sched: expire invalid runtime Paul Turner
@ 2011-06-21  7:16 ` Paul Turner
  2011-06-22  7:11   ` Bharata B Rao
  2011-06-22 16:07   ` Peter Zijlstra
  2011-06-21  7:16 ` [patch 09/16] sched: unthrottle cfs_rq(s) who ran out of quota at period refresh Paul Turner
                   ` (8 subsequent siblings)
  16 siblings, 2 replies; 59+ messages in thread
From: Paul Turner @ 2011-06-21  7:16 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov, Nikhil Rao

[-- Attachment #1: sched-bwc-throttle_entities.patch --]
[-- Type: text/plain, Size: 6199 bytes --]

In account_cfs_rq_runtime() (via update_curr()) we track consumption versus a
cfs_rqs locally assigned runtime and whether there is global runtime available 
to provide a refill when it runs out.

In the case that there is no runtime remaining it's necessary to throttle so
that execution ceases until the susbequent period.  While it is at this
boundary that we detect (and signal for, via reshed_task) that a throttle is
required, the actual operation is deferred until put_prev_entity().

At this point the cfs_rq is marked as throttled and not re-enqueued, this
avoids potential interactions with throttled runqueues in the event that we
are not immediately able to evict the running task.

Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Nikhil Rao <ncrao@google.com>
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>

---
 kernel/sched.c      |    7 ++++
 kernel/sched_fair.c |   89 +++++++++++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 92 insertions(+), 4 deletions(-)

Index: tip/kernel/sched_fair.c
===================================================================
--- tip.orig/kernel/sched_fair.c
+++ tip/kernel/sched_fair.c
@@ -1311,7 +1311,8 @@ static void __refill_cfs_bandwidth_runti
 	cfs_b->runtime_expires = now + ktime_to_ns(cfs_b->period);
 }
 
-static void assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
+/* returns 0 on failure to allocate runtime */
+static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 {
 	struct task_group *tg = cfs_rq->tg;
 	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
@@ -1352,6 +1353,8 @@ static void assign_cfs_rq_runtime(struct
 	 */
 	if ((s64)(expires - cfs_rq->runtime_expires) > 0)
 		cfs_rq->runtime_expires = expires;
+
+	return cfs_rq->runtime_remaining > 0;
 }
 
 static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq)
@@ -1397,7 +1400,53 @@ static void account_cfs_rq_runtime(struc
 	if (cfs_rq->runtime_remaining > 0)
 		return;
 
-	assign_cfs_rq_runtime(cfs_rq);
+	/*
+	 * if we're unable to extend our runtime we resched so that the active
+	 * hierarchy can be throttled
+	 */
+	if (!assign_cfs_rq_runtime(cfs_rq))
+		resched_task(rq_of(cfs_rq)->curr);
+}
+
+static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
+{
+	return cfs_rq->throttled;
+}
+
+static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
+{
+	struct rq *rq = rq_of(cfs_rq);
+	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
+	struct sched_entity *se;
+	long task_delta, dequeue = 1;
+
+	se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
+
+	/* account load preceding throttle */
+	update_cfs_load(cfs_rq, 0);
+
+	task_delta = cfs_rq->h_nr_running;
+	for_each_sched_entity(se) {
+		struct cfs_rq *qcfs_rq = cfs_rq_of(se);
+		/* throttled entity or throttle-on-deactivate */
+		if (!se->on_rq)
+			break;
+
+		if (dequeue)
+			dequeue_entity(qcfs_rq, se, DEQUEUE_SLEEP);
+		qcfs_rq->h_nr_running -= task_delta;
+
+		if (qcfs_rq->load.weight)
+			dequeue = 0;
+	}
+
+	if (!se)
+		rq->nr_running -= task_delta;
+
+	cfs_rq->throttled = 1;
+	raw_spin_lock(&cfs_b->lock);
+	list_add_tail_rcu(&cfs_rq->throttled_list, &cfs_b->throttled_cfs_rq);
+	raw_spin_unlock(&cfs_b->lock);
 }
 
 static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
@@ -1427,6 +1476,11 @@ static int do_sched_cfs_period_timer(str
 #else
 static void account_cfs_rq_runtime(struct cfs_rq *cfs_rq,
 		unsigned long delta_exec) {}
+
+static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
+{
+	return 0;
+}
 #endif
 
 /**************************************************
@@ -1505,7 +1559,17 @@ enqueue_task_fair(struct rq *rq, struct 
 			break;
 		cfs_rq = cfs_rq_of(se);
 		enqueue_entity(cfs_rq, se, flags);
+
+		/*
+		 * end evaluation on encountering a throttled cfs_rq
+		 *
+		 * note: in the case of encountering a throttled cfs_rq we will
+		 * post the final h_nr_running decrement below.
+		*/
+		if (cfs_rq_throttled(cfs_rq))
+			break;
 		cfs_rq->h_nr_running++;
+
 		flags = ENQUEUE_WAKEUP;
 	}
 
@@ -1513,11 +1577,15 @@ enqueue_task_fair(struct rq *rq, struct 
 		cfs_rq = cfs_rq_of(se);
 		cfs_rq->h_nr_running++;
 
+		if (cfs_rq_throttled(cfs_rq))
+			break;
+
 		update_cfs_load(cfs_rq, 0);
 		update_cfs_shares(cfs_rq);
 	}
 
-	inc_nr_running(rq);
+	if (!se)
+		inc_nr_running(rq);
 	hrtick_update(rq);
 }
 
@@ -1537,6 +1605,15 @@ static void dequeue_task_fair(struct rq 
 	for_each_sched_entity(se) {
 		cfs_rq = cfs_rq_of(se);
 		dequeue_entity(cfs_rq, se, flags);
+
+		/*
+		 * end evaluation on encountering a throttled cfs_rq
+		 *
+		 * note: in the case of encountering a throttled cfs_rq we will
+		 * post the final h_nr_running decrement below.
+		*/
+		if (cfs_rq_throttled(cfs_rq))
+			break;
 		cfs_rq->h_nr_running--;
 
 		/* Don't dequeue parent if it has other entities besides us */
@@ -1559,11 +1636,15 @@ static void dequeue_task_fair(struct rq 
 		cfs_rq = cfs_rq_of(se);
 		cfs_rq->h_nr_running--;
 
+		if (cfs_rq_throttled(cfs_rq))
+			break;
+
 		update_cfs_load(cfs_rq, 0);
 		update_cfs_shares(cfs_rq);
 	}
 
-	dec_nr_running(rq);
+	if (!se)
+		dec_nr_running(rq);
 	hrtick_update(rq);
 }
 
Index: tip/kernel/sched.c
===================================================================
--- tip.orig/kernel/sched.c
+++ tip/kernel/sched.c
@@ -257,6 +257,8 @@ struct cfs_bandwidth {
 
 	int idle, timer_active;
 	struct hrtimer period_timer;
+	struct list_head throttled_cfs_rq;
+
 #endif
 };
 
@@ -396,6 +398,9 @@ struct cfs_rq {
 	int runtime_enabled;
 	u64 runtime_expires;
 	s64 runtime_remaining;
+
+	int throttled;
+	struct list_head throttled_list;
 #endif
 #endif
 };
@@ -437,6 +442,7 @@ static void init_cfs_bandwidth(struct cf
 	cfs_b->quota = RUNTIME_INF;
 	cfs_b->period = ns_to_ktime(default_cfs_period());
 
+	INIT_LIST_HEAD(&cfs_b->throttled_cfs_rq);
 	hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
 	cfs_b->period_timer.function = sched_cfs_period_timer;
 }
@@ -444,6 +450,7 @@ static void init_cfs_bandwidth(struct cf
 static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 {
 	cfs_rq->runtime_enabled = 0;
+	INIT_LIST_HEAD(&cfs_rq->throttled_list);
 }
 
 /* requires cfs_b->lock */



^ permalink raw reply	[flat|nested] 59+ messages in thread

* [patch 09/16] sched: unthrottle cfs_rq(s) who ran out of quota at period refresh
  2011-06-21  7:16 [patch 00/16] CFS Bandwidth Control v7 Paul Turner
                   ` (7 preceding siblings ...)
  2011-06-21  7:16 ` [patch 08/16] sched: throttle cfs_rq entities which exceed their local runtime Paul Turner
@ 2011-06-21  7:16 ` Paul Turner
  2011-06-22 17:29   ` Peter Zijlstra
  2011-06-21  7:16 ` [patch 10/16] sched: throttle entities exceeding their allowed bandwidth Paul Turner
                   ` (7 subsequent siblings)
  16 siblings, 1 reply; 59+ messages in thread
From: Paul Turner @ 2011-06-21  7:16 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov, Nikhil Rao

[-- Attachment #1: sched-bwc-unthrottle_entities.patch --]
[-- Type: text/plain, Size: 4868 bytes --]

At the start of a new period we must refresh the global bandwidth pool as well
as unthrottle any cfs_rq entities who previously ran out of bandwidth (as quota
permits).

Unthrottled entities have the cfs_rq->throttled flag cleared and are re-enqueued
into the entity hierarchy.

Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Nikhil Rao <ncrao@google.com>
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>

---
 kernel/sched.c      |    3 +
 kernel/sched_fair.c |  125 +++++++++++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 121 insertions(+), 7 deletions(-)

Index: tip/kernel/sched.c
===================================================================
--- tip.orig/kernel/sched.c
+++ tip/kernel/sched.c
@@ -9002,6 +9002,9 @@ static int tg_set_cfs_bandwidth(struct t
 		raw_spin_lock_irq(&rq->lock);
 		cfs_rq->runtime_enabled = quota != RUNTIME_INF;
 		cfs_rq->runtime_remaining = 0;
+
+		if (cfs_rq_throttled(cfs_rq))
+			unthrottle_cfs_rq(cfs_rq);
 		raw_spin_unlock_irq(&rq->lock);
 	}
 out_unlock:
Index: tip/kernel/sched_fair.c
===================================================================
--- tip.orig/kernel/sched_fair.c
+++ tip/kernel/sched_fair.c
@@ -1448,26 +1448,137 @@ static void throttle_cfs_rq(struct cfs_r
 	raw_spin_unlock(&cfs_b->lock);
 }
 
+static void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
+{
+	struct rq *rq = rq_of(cfs_rq);
+	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
+	struct sched_entity *se;
+	int enqueue = 1;
+	long task_delta;
+
+	se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
+
+	cfs_rq->throttled = 0;
+	raw_spin_lock(&cfs_b->lock);
+	list_del_rcu(&cfs_rq->throttled_list);
+	raw_spin_unlock(&cfs_b->lock);
+
+	if (!cfs_rq->load.weight)
+		return;
+
+	task_delta = cfs_rq->h_nr_running;
+	for_each_sched_entity(se) {
+		if (se->on_rq)
+			enqueue = 0;
+
+		cfs_rq = cfs_rq_of(se);
+		if (enqueue)
+			enqueue_entity(cfs_rq, se, ENQUEUE_WAKEUP);
+		cfs_rq->h_nr_running += task_delta;
+
+		if (cfs_rq_throttled(cfs_rq))
+			break;
+	}
+
+	if (!se)
+		rq->nr_running += task_delta;
+
+	/* determine whether we need to wake up potentially idle cpu */
+	if (rq->curr == rq->idle && rq->cfs.nr_running)
+		resched_task(rq->curr);
+}
+
+static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
+		u64 remaining, u64 expires)
+{
+	struct cfs_rq *cfs_rq;
+	u64 runtime = remaining;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(cfs_rq, &cfs_b->throttled_cfs_rq,
+				throttled_list) {
+		struct rq *rq = rq_of(cfs_rq);
+
+		raw_spin_lock(&rq->lock);
+		if (!cfs_rq_throttled(cfs_rq))
+			goto next;
+
+		runtime = -cfs_rq->runtime_remaining + 1;
+		if (runtime > remaining)
+			runtime = remaining;
+		remaining -= runtime;
+
+		cfs_rq->runtime_remaining += runtime;
+		cfs_rq->runtime_expires = expires;
+
+		/* we check whether we're throttled above */
+		if (cfs_rq->runtime_remaining > 0)
+			unthrottle_cfs_rq(cfs_rq);
+
+next:
+		raw_spin_unlock(&rq->lock);
+
+		if (!remaining)
+			break;
+	}
+	rcu_read_unlock();
+
+	return remaining;
+}
+
 static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
 {
-	int idle = 1;
+	int idle = 1, throttled = 0;
+	u64 runtime, runtime_expires;
+
 
 	raw_spin_lock(&cfs_b->lock);
 	if (cfs_b->quota != RUNTIME_INF) {
-		idle = cfs_b->idle;
-		/* If we're going idle then defer handle the refill */
+		/* idle depends on !throttled in the case of a large deficit */
+		throttled = !list_empty(&cfs_b->throttled_cfs_rq);
+		idle = cfs_b->idle && !throttled;
+
+		/* If we're going idle then defer the refill */
 		if (!idle)
 			__refill_cfs_bandwidth_runtime(cfs_b);
+		if (throttled) {
+			runtime = cfs_b->runtime;
+			runtime_expires = cfs_b->runtime_expires;
+
+			/* we must first distribute to throttled entities */
+			cfs_b->runtime = 0;
+		}
 
 		/*
-		 * mark this bandwidth pool as idle so that we may deactivate
-		 * the timer at the next expiration if there is no usage.
+		 * conditionally mark this bandwidth pool as idle so that we may
+		 * deactivate the timer at the next expiration if there is no
+		 * usage.
 		 */
-		cfs_b->idle = 1;
+		cfs_b->idle = !throttled;
 	}
 
-	if (idle)
+	if (idle) {
 		cfs_b->timer_active = 0;
+		goto out_unlock;
+	}
+	raw_spin_unlock(&cfs_b->lock);
+
+retry:
+	runtime = distribute_cfs_runtime(cfs_b, runtime, runtime_expires);
+
+	raw_spin_lock(&cfs_b->lock);
+	/* new bandwidth specification may exist */
+	if (unlikely(runtime_expires != cfs_b->runtime_expires))
+		goto out_unlock;
+	/* ensure no-one was throttled while we unthrottling */
+	if (unlikely(!list_empty(&cfs_b->throttled_cfs_rq)) && runtime > 0) {
+		raw_spin_unlock(&cfs_b->lock);
+		goto retry;
+	}
+
+	/* return remaining runtime */
+	cfs_b->runtime = runtime;
+out_unlock:
 	raw_spin_unlock(&cfs_b->lock);
 
 	return idle;



^ permalink raw reply	[flat|nested] 59+ messages in thread

* [patch 10/16] sched: throttle entities exceeding their allowed bandwidth
  2011-06-21  7:16 [patch 00/16] CFS Bandwidth Control v7 Paul Turner
                   ` (8 preceding siblings ...)
  2011-06-21  7:16 ` [patch 09/16] sched: unthrottle cfs_rq(s) who ran out of quota at period refresh Paul Turner
@ 2011-06-21  7:16 ` Paul Turner
  2011-06-22  9:39   ` Hidetoshi Seto
  2011-06-21  7:17 ` [patch 11/16] sched: allow for positional tg_tree walks Paul Turner
                   ` (6 subsequent siblings)
  16 siblings, 1 reply; 59+ messages in thread
From: Paul Turner @ 2011-06-21  7:16 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov

[-- Attachment #1: sched-bwc-enable-throttling.patch --]
[-- Type: text/plain, Size: 3670 bytes --]

Add conditional checks time of put_prev_entity() and enqueue_entity() to detect
when an active entity has exceeded its allowed bandwidth and requires
throttling.

Signed-off-by: Paul Turner <pjt@google.com>

---
 kernel/sched_fair.c |   55 ++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 53 insertions(+), 2 deletions(-)

Index: tip/kernel/sched_fair.c
===================================================================
--- tip.orig/kernel/sched_fair.c
+++ tip/kernel/sched_fair.c
@@ -987,6 +987,8 @@ place_entity(struct cfs_rq *cfs_rq, stru
 	se->vruntime = vruntime;
 }
 
+static void check_enqueue_throttle(struct cfs_rq *cfs_rq);
+
 static void
 enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 {
@@ -1016,8 +1018,10 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
 		__enqueue_entity(cfs_rq, se);
 	se->on_rq = 1;
 
-	if (cfs_rq->nr_running == 1)
+	if (cfs_rq->nr_running == 1) {
 		list_add_leaf_cfs_rq(cfs_rq);
+		check_enqueue_throttle(cfs_rq);
+	}
 }
 
 static void __clear_buddies_last(struct sched_entity *se)
@@ -1222,6 +1226,8 @@ static struct sched_entity *pick_next_en
 	return se;
 }
 
+static void check_cfs_rq_runtime(struct cfs_rq *cfs_rq);
+
 static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
 {
 	/*
@@ -1231,6 +1237,9 @@ static void put_prev_entity(struct cfs_r
 	if (prev->on_rq)
 		update_curr(cfs_rq);
 
+	/* throttle cfs_rqs exceeding runtime */
+	check_cfs_rq_runtime(cfs_rq);
+
 	check_spread(cfs_rq, prev);
 	if (prev->on_rq) {
 		update_stats_wait_start(cfs_rq, prev);
@@ -1403,7 +1412,7 @@ static void account_cfs_rq_runtime(struc
 	 * if we're unable to extend our runtime we resched so that the active
 	 * hierarchy can be throttled
 	 */
-	if (!assign_cfs_rq_runtime(cfs_rq))
+	if (!assign_cfs_rq_runtime(cfs_rq) && likely(cfs_rq->curr))
 		resched_task(rq_of(cfs_rq)->curr);
 }
 
@@ -1448,6 +1457,46 @@ static void throttle_cfs_rq(struct cfs_r
 	raw_spin_unlock(&cfs_b->lock);
 }
 
+/*
+ * When a group wakes up we want to make sure that its quota is not already
+ * expired/exceeded, otherwise it may be allowed to steal additional ticks of
+ * runtime as update_curr() throttling can not not trigger until it's on-rq.
+ */
+static void check_enqueue_throttle(struct cfs_rq *cfs_rq)
+{
+	/* an active group must be handled by the update_curr()->put() path */
+	if (!cfs_rq->runtime_enabled || cfs_rq->curr)
+		return;
+
+	/* ensure the group is not already throttled */
+	if (cfs_rq_throttled(cfs_rq))
+		return;
+
+	/* update runtime allocation */
+	account_cfs_rq_runtime(cfs_rq, 0);
+	if (cfs_rq->runtime_remaining <= 0)
+		throttle_cfs_rq(cfs_rq);
+}
+
+/* conditionally throttle active cfs_rq's from put_prev_entity() */
+static void check_cfs_rq_runtime(struct cfs_rq *cfs_rq)
+{
+	if (!cfs_rq->runtime_enabled || cfs_rq->runtime_remaining > 0)
+		return;
+
+	/*
+	 * as the alignment of the last vestiges of per-cpu quota is not
+	 * controllable it's possible that active load-balance will force a
+	 * thread belonging to an unthrottled cfs_rq on cpu A to into a running
+	 * state on a throttled cfs_rq on cpu B.  In this case we're already
+	 * throttled.
+	 */
+	if (cfs_rq_throttled(cfs_rq))
+		return;
+
+	throttle_cfs_rq(cfs_rq);
+}
+
 static void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 {
 	struct rq *rq = rq_of(cfs_rq);
@@ -1586,6 +1635,8 @@ out_unlock:
 #else
 static void account_cfs_rq_runtime(struct cfs_rq *cfs_rq,
 		unsigned long delta_exec) {}
+static void check_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
+static void check_enqueue_throttle(struct cfs_rq *cfs_rq) {}
 
 static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
 {



^ permalink raw reply	[flat|nested] 59+ messages in thread

* [patch 11/16] sched: allow for positional tg_tree walks
  2011-06-21  7:16 [patch 00/16] CFS Bandwidth Control v7 Paul Turner
                   ` (9 preceding siblings ...)
  2011-06-21  7:16 ` [patch 10/16] sched: throttle entities exceeding their allowed bandwidth Paul Turner
@ 2011-06-21  7:17 ` Paul Turner
  2011-06-21  7:17 ` [patch 12/16] sched: prevent interactions with throttled entities Paul Turner
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 59+ messages in thread
From: Paul Turner @ 2011-06-21  7:17 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov

[-- Attachment #1: sched-bwc-refactor-walk_tg_tree.patch --]
[-- Type: text/plain, Size: 3532 bytes --]

Extend walk_tg_tree to accept a positional argument

static int walk_tg_tree_from(struct task_group *from,
			     tg_visitor down, tg_visitor up, void *data)

Existing semantics are preserved, caller must hold rcu_lock() or sufficient
analogue.

Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>

---
 kernel/sched.c |   52 +++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 39 insertions(+), 13 deletions(-)

Index: tip/kernel/sched.c
===================================================================
--- tip.orig/kernel/sched.c
+++ tip/kernel/sched.c
@@ -1574,20 +1574,23 @@ static inline void dec_cpu_load(struct r
 typedef int (*tg_visitor)(struct task_group *, void *);
 
 /*
- * Iterate the full tree, calling @down when first entering a node and @up when
- * leaving it for the final time.
+ * Iterate task_group tree rooted at *from, calling @down when first entering a
+ * node and @up when leaving it for the final time.
+ *
+ * Caller must hold rcu_lock or sufficient equivalent.
  */
-static int walk_tg_tree(tg_visitor down, tg_visitor up, void *data)
+static int walk_tg_tree_from(struct task_group *from,
+			     tg_visitor down, tg_visitor up, void *data)
 {
 	struct task_group *parent, *child;
 	int ret;
 
-	rcu_read_lock();
-	parent = &root_task_group;
+	parent = from;
+
 down:
 	ret = (*down)(parent, data);
 	if (ret)
-		goto out_unlock;
+		goto out;
 	list_for_each_entry_rcu(child, &parent->children, siblings) {
 		parent = child;
 		goto down;
@@ -1596,19 +1599,29 @@ up:
 		continue;
 	}
 	ret = (*up)(parent, data);
-	if (ret)
-		goto out_unlock;
+	if (ret || parent == from)
+		goto out;
 
 	child = parent;
 	parent = parent->parent;
 	if (parent)
 		goto up;
-out_unlock:
-	rcu_read_unlock();
-
+out:
 	return ret;
 }
 
+/*
+ * Iterate the full tree, calling @down when first entering a node and @up when
+ * leaving it for the final time.
+ *
+ * Caller must hold rcu_lock or sufficient equivalent.
+ */
+
+static inline int walk_tg_tree(tg_visitor down, tg_visitor up, void *data)
+{
+	return walk_tg_tree_from(&root_task_group, down, up, data);
+}
+
 static int tg_nop(struct task_group *tg, void *data)
 {
 	return 0;
@@ -1702,7 +1715,9 @@ static int tg_load_down(struct task_grou
 
 static void update_h_load(long cpu)
 {
+	rcu_read_lock();
 	walk_tg_tree(tg_load_down, tg_nop, (void *)cpu);
+	rcu_read_unlock();
 }
 
 #endif
@@ -8687,13 +8702,19 @@ static int tg_rt_schedulable(struct task
 
 static int __rt_schedulable(struct task_group *tg, u64 period, u64 runtime)
 {
+	int ret;
+
 	struct rt_schedulable_data data = {
 		.tg = tg,
 		.rt_period = period,
 		.rt_runtime = runtime,
 	};
 
-	return walk_tg_tree(tg_rt_schedulable, tg_nop, &data);
+	rcu_read_lock();
+	ret = walk_tg_tree(tg_rt_schedulable, tg_nop, &data);
+	rcu_read_unlock();
+
+	return ret;
 }
 
 static int tg_set_rt_bandwidth(struct task_group *tg,
@@ -9143,6 +9164,7 @@ static int tg_cfs_schedulable_down(struc
 
 static int __cfs_schedulable(struct task_group *tg, u64 period, u64 quota)
 {
+	int ret;
 	struct cfs_schedulable_data data = {
 		.tg = tg,
 		.period = period,
@@ -9154,7 +9176,11 @@ static int __cfs_schedulable(struct task
 		do_div(data.quota, NSEC_PER_USEC);
 	}
 
-	return walk_tg_tree(tg_cfs_schedulable_down, tg_nop, &data);
+	rcu_read_lock();
+	ret = walk_tg_tree(tg_cfs_schedulable_down, tg_nop, &data);
+	rcu_read_unlock();
+
+	return ret;
 }
 #endif /* CONFIG_CFS_BANDWIDTH */
 #endif /* CONFIG_FAIR_GROUP_SCHED */



^ permalink raw reply	[flat|nested] 59+ messages in thread

* [patch 12/16] sched: prevent interactions with throttled entities
  2011-06-21  7:16 [patch 00/16] CFS Bandwidth Control v7 Paul Turner
                   ` (10 preceding siblings ...)
  2011-06-21  7:17 ` [patch 11/16] sched: allow for positional tg_tree walks Paul Turner
@ 2011-06-21  7:17 ` Paul Turner
  2011-06-22 21:34   ` Peter Zijlstra
  2011-06-23 11:49   ` Peter Zijlstra
  2011-06-21  7:17 ` [patch 13/16] sched: migrate throttled tasks on HOTPLUG Paul Turner
                   ` (4 subsequent siblings)
  16 siblings, 2 replies; 59+ messages in thread
From: Paul Turner @ 2011-06-21  7:17 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov

[-- Attachment #1: sched-bwc-throttled_shares.patch --]
[-- Type: text/plain, Size: 6239 bytes --]

>From the perspective of load-balance and shares distribution, throttled
entities should be invisible.

However, both of these operations work on 'active' lists and are not
inherently aware of what group hierarchies may be present.  In some cases this
may be side-stepped (e.g. we could sideload via tg_load_down in load balance) 
while in others (e.g. update_shares()) it is more difficult to compute without
incurring some O(n^2) costs.

Instead, track hierarchicaal throttled state at time of transition.  This
allows us to easily identify whether an entity belongs to a throttled hierarchy
and avoid incorrect interactions with it.

Also, when an entity leaves a throttled hierarchy we need to advance its
time averaging for shares averaging so that the elapsed throttled time is not
considered as part of the cfs_rq's operation.

We also use this information to prevent buddy interactions in the wakeup and
yield_to() paths.

Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>

---
 kernel/sched.c      |    2 -
 kernel/sched_fair.c |   87 +++++++++++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 81 insertions(+), 8 deletions(-)

Index: tip/kernel/sched_fair.c
===================================================================
--- tip.orig/kernel/sched_fair.c
+++ tip/kernel/sched_fair.c
@@ -741,13 +741,15 @@ static void update_cfs_rq_load_contribut
 	}
 }
 
+static inline int throttled_hierarchy(struct cfs_rq *cfs_rq);
+
 static void update_cfs_load(struct cfs_rq *cfs_rq, int global_update)
 {
 	u64 period = sysctl_sched_shares_window;
 	u64 now, delta;
 	unsigned long load = cfs_rq->load.weight;
 
-	if (cfs_rq->tg == &root_task_group)
+	if (cfs_rq->tg == &root_task_group || throttled_hierarchy(cfs_rq))
 		return;
 
 	now = rq_of(cfs_rq)->clock_task;
@@ -1421,6 +1423,46 @@ static inline int cfs_rq_throttled(struc
 	return cfs_rq->throttled;
 }
 
+static inline int throttled_hierarchy(struct cfs_rq *cfs_rq)
+{
+	return cfs_rq->throttle_count;
+}
+
+struct tg_unthrottle_down_data {
+	int cpu;
+	u64 now;
+};
+
+static int tg_unthrottle_down(struct task_group *tg, void *data)
+{
+	struct tg_unthrottle_down_data *udd = data;
+	struct cfs_rq *cfs_rq = tg->cfs_rq[udd->cpu];
+	u64 delta;
+
+	cfs_rq->throttle_count--;
+	if (!cfs_rq->throttle_count) {
+		/* leaving throttled state, move up windows */
+		delta = udd->now - cfs_rq->load_stamp;
+		cfs_rq->load_stamp += delta;
+		cfs_rq->load_last += delta;
+	}
+
+	return 0;
+}
+
+static int tg_throttle_down(struct task_group *tg, void *data)
+{
+	long cpu = (long)data;
+	struct cfs_rq *cfs_rq = tg->cfs_rq[cpu];
+
+	/* group is entering throttled state, record last load */
+	if (!cfs_rq->throttle_count)
+		update_cfs_load(cfs_rq, 0);
+	cfs_rq->throttle_count++;
+
+	return 0;
+}
+
 static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
 {
 	struct rq *rq = rq_of(cfs_rq);
@@ -1431,7 +1473,10 @@ static void throttle_cfs_rq(struct cfs_r
 	se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
 
 	/* account load preceding throttle */
-	update_cfs_load(cfs_rq, 0);
+	rcu_read_lock();
+	walk_tg_tree_from(cfs_rq->tg, tg_throttle_down, tg_nop,
+			  (void *)(long)rq_of(cfs_rq)->cpu);
+	rcu_read_unlock();
 
 	task_delta = cfs_rq->h_nr_running;
 	for_each_sched_entity(se) {
@@ -1504,6 +1549,7 @@ static void unthrottle_cfs_rq(struct cfs
 	struct sched_entity *se;
 	int enqueue = 1;
 	long task_delta;
+	struct tg_unthrottle_down_data udd;
 
 	se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
 
@@ -1512,6 +1558,13 @@ static void unthrottle_cfs_rq(struct cfs
 	list_del_rcu(&cfs_rq->throttled_list);
 	raw_spin_unlock(&cfs_b->lock);
 
+	update_rq_clock(rq);
+	/* don't include throttled window for load statistics */
+	udd.cpu = rq->cpu;
+	udd.now = rq->clock_task;
+	walk_tg_tree_from(cfs_rq->tg, tg_unthrottle_down, tg_nop,
+			  (void *)&udd);
+
 	if (!cfs_rq->load.weight)
 		return;
 
@@ -1642,6 +1695,11 @@ static inline int cfs_rq_throttled(struc
 {
 	return 0;
 }
+
+static inline int throttled_hierarchy(struct cfs_rq *cfs_rq)
+{
+	return 0;
+}
 #endif
 
 /**************************************************
@@ -2317,6 +2375,14 @@ static void check_preempt_wakeup(struct 
 	if (unlikely(se == pse))
 		return;
 
+	/*
+	 * this is possible from callers such as pull_task(), where we
+	 * unconditionally check_prempt_curr() after an enqueue (which may have
+	 * lead to a throttle)
+	 */
+	if (unlikely(throttled_hierarchy(cfs_rq_of(pse))))
+		return;
+
 	if (sched_feat(NEXT_BUDDY) && scale && !(wake_flags & WF_FORK)) {
 		set_next_buddy(pse);
 		next_buddy_marked = 1;
@@ -2447,7 +2513,7 @@ static bool yield_to_task_fair(struct rq
 {
 	struct sched_entity *se = &p->se;
 
-	if (!se->on_rq)
+	if (!se->on_rq || throttled_hierarchy(cfs_rq_of(se)))
 		return false;
 
 	/* Tell the scheduler that we'd really like pse to run next. */
@@ -2543,6 +2609,9 @@ move_one_task(struct rq *this_rq, int th
 	int pinned = 0;
 
 	for_each_leaf_cfs_rq(busiest, cfs_rq) {
+		if (throttled_hierarchy(cfs_rq))
+			continue;
+
 		list_for_each_entry_safe(p, n, &cfs_rq->tasks, se.group_node) {
 
 			if (!can_migrate_task(p, busiest, this_cpu,
@@ -2635,8 +2704,10 @@ static int update_shares_cpu(struct task
 
 	raw_spin_lock_irqsave(&rq->lock, flags);
 
-	update_rq_clock(rq);
-	update_cfs_load(cfs_rq, 1);
+	if (!throttled_hierarchy(cfs_rq)) {
+		update_rq_clock(rq);
+		update_cfs_load(cfs_rq, 1);
+	}
 
 	/*
 	 * We need to update shares after updating tg->load_weight in
@@ -2680,9 +2751,11 @@ load_balance_fair(struct rq *this_rq, in
 		u64 rem_load, moved_load;
 
 		/*
-		 * empty group
+		 * empty group or part of a throttled hierarchy
 		 */
-		if (!busiest_cfs_rq->task_weight)
+		if (!busiest_cfs_rq->task_weight ||
+		    throttled_hierarchy(busiest_cfs_rq) ||
+		    throttled_hierarchy(tg->cfs_rq[this_cpu]))
 			continue;
 
 		rem_load = (u64)rem_load_move * busiest_weight;
Index: tip/kernel/sched.c
===================================================================
--- tip.orig/kernel/sched.c
+++ tip/kernel/sched.c
@@ -399,7 +399,7 @@ struct cfs_rq {
 	u64 runtime_expires;
 	s64 runtime_remaining;
 
-	int throttled;
+	int throttled, throttle_count;
 	struct list_head throttled_list;
 #endif
 #endif



^ permalink raw reply	[flat|nested] 59+ messages in thread

* [patch 13/16] sched: migrate throttled tasks on HOTPLUG
  2011-06-21  7:16 [patch 00/16] CFS Bandwidth Control v7 Paul Turner
                   ` (11 preceding siblings ...)
  2011-06-21  7:17 ` [patch 12/16] sched: prevent interactions with throttled entities Paul Turner
@ 2011-06-21  7:17 ` Paul Turner
  2011-06-21  7:17 ` [patch 14/16] sched: add exports tracking cfs bandwidth control statistics Paul Turner
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 59+ messages in thread
From: Paul Turner @ 2011-06-21  7:17 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov

[-- Attachment #1: sched-bwc-migrate_dead.patch --]
[-- Type: text/plain, Size: 1792 bytes --]

Throttled tasks are invisisble to cpu-offline since they are not eligible for
selection by pick_next_task().  The regular 'escape' path for a thread that is
blocked at offline is via ttwu->select_task_rq, however this will not handle a
throttled group since there are no individual thread wakeups on an unthrottle.

Resolve this by unthrottling offline cpus so that threads can be migrated.

Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>

---
 kernel/sched.c |   27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

Index: tip/kernel/sched.c
===================================================================
--- tip.orig/kernel/sched.c
+++ tip/kernel/sched.c
@@ -6269,6 +6269,30 @@ static void calc_global_load_remove(stru
 	rq->calc_load_active = 0;
 }
 
+#ifdef CONFIG_CFS_BANDWIDTH
+static void unthrottle_offline_cfs_rqs(struct rq *rq)
+{
+	struct cfs_rq *cfs_rq;
+
+	for_each_leaf_cfs_rq(rq, cfs_rq) {
+		struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
+
+		if (!cfs_rq->runtime_enabled)
+			continue;
+
+		/*
+		 * clock_task is not advancing so we just need to make sure
+		 * there's some valid quota amount
+		 */
+		cfs_rq->runtime_remaining = cfs_b->quota;
+		if (cfs_rq_throttled(cfs_rq))
+			unthrottle_cfs_rq(cfs_rq);
+	}
+}
+#else
+static void unthrottle_offline_cfs_rqs(struct rq *rq) {}
+#endif
+
 /*
  * Migrate all tasks from the rq, sleeping tasks will be migrated by
  * try_to_wake_up()->select_task_rq().
@@ -6294,6 +6318,9 @@ static void migrate_tasks(unsigned int d
 	 */
 	rq->stop = NULL;
 
+	/* Ensure any throttled groups are reachable by pick_next_task */
+	unthrottle_offline_cfs_rqs(rq);
+
 	for ( ; ; ) {
 		/*
 		 * There's this thread running, bail when that's the only



^ permalink raw reply	[flat|nested] 59+ messages in thread

* [patch 14/16] sched: add exports tracking cfs bandwidth control statistics
  2011-06-21  7:16 [patch 00/16] CFS Bandwidth Control v7 Paul Turner
                   ` (12 preceding siblings ...)
  2011-06-21  7:17 ` [patch 13/16] sched: migrate throttled tasks on HOTPLUG Paul Turner
@ 2011-06-21  7:17 ` Paul Turner
  2011-06-21  7:17 ` [patch 15/16] sched: return unused runtime on voluntary sleep Paul Turner
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 59+ messages in thread
From: Paul Turner @ 2011-06-21  7:17 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov, Nikhil Rao

[-- Attachment #1: sched-bwc-throttle_stats.patch --]
[-- Type: text/plain, Size: 3230 bytes --]

From: Nikhil Rao <ncrao@google.com>

This change introduces statistics exports for the cpu sub-system, these are
added through the use of a stat file similar to that exported by other
subsystems.

The following exports are included:

nr_periods:	number of periods in which execution occurred
nr_throttled:	the number of periods above in which execution was throttle
throttled_time:	cumulative wall-time that any cpus have been throttled for
this group

Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Nikhil Rao <ncrao@google.com>
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>

---
 kernel/sched.c      |   21 +++++++++++++++++++++
 kernel/sched_fair.c |    8 ++++++++
 2 files changed, 29 insertions(+)

Index: tip/kernel/sched.c
===================================================================
--- tip.orig/kernel/sched.c
+++ tip/kernel/sched.c
@@ -259,6 +259,9 @@ struct cfs_bandwidth {
 	struct hrtimer period_timer;
 	struct list_head throttled_cfs_rq;
 
+	/* statistics */
+	int nr_periods, nr_throttled;
+	u64 throttled_time;
 #endif
 };
 
@@ -399,6 +402,7 @@ struct cfs_rq {
 	u64 runtime_expires;
 	s64 runtime_remaining;
 
+	u64 throttled_timestamp;
 	int throttled, throttle_count;
 	struct list_head throttled_list;
 #endif
@@ -9209,6 +9213,19 @@ static int __cfs_schedulable(struct task
 
 	return ret;
 }
+
+static int cpu_stats_show(struct cgroup *cgrp, struct cftype *cft,
+		struct cgroup_map_cb *cb)
+{
+	struct task_group *tg = cgroup_tg(cgrp);
+	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
+
+	cb->fill(cb, "nr_periods", cfs_b->nr_periods);
+	cb->fill(cb, "nr_throttled", cfs_b->nr_throttled);
+	cb->fill(cb, "throttled_time", cfs_b->throttled_time);
+
+	return 0;
+}
 #endif /* CONFIG_CFS_BANDWIDTH */
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
@@ -9255,6 +9272,10 @@ static struct cftype cpu_files[] = {
 		.read_u64 = cpu_cfs_period_read_u64,
 		.write_u64 = cpu_cfs_period_write_u64,
 	},
+	{
+		.name = "stat",
+		.read_map = cpu_stats_show,
+	},
 #endif
 #ifdef CONFIG_RT_GROUP_SCHED
 	{
Index: tip/kernel/sched_fair.c
===================================================================
--- tip.orig/kernel/sched_fair.c
+++ tip/kernel/sched_fair.c
@@ -1497,6 +1497,7 @@ static void throttle_cfs_rq(struct cfs_r
 		rq->nr_running -= task_delta;
 
 	cfs_rq->throttled = 1;
+	cfs_rq->throttled_timestamp = rq->clock;
 	raw_spin_lock(&cfs_b->lock);
 	list_add_tail_rcu(&cfs_rq->throttled_list, &cfs_b->throttled_cfs_rq);
 	raw_spin_unlock(&cfs_b->lock);
@@ -1555,8 +1556,10 @@ static void unthrottle_cfs_rq(struct cfs
 
 	cfs_rq->throttled = 0;
 	raw_spin_lock(&cfs_b->lock);
+	cfs_b->throttled_time += rq->clock - cfs_rq->throttled_timestamp;
 	list_del_rcu(&cfs_rq->throttled_list);
 	raw_spin_unlock(&cfs_b->lock);
+	cfs_rq->throttled_timestamp = 0;
 
 	update_rq_clock(rq);
 	/* don't include throttled window for load statistics */
@@ -1659,6 +1662,11 @@ static int do_sched_cfs_period_timer(str
 		cfs_b->idle = !throttled;
 	}
 
+	/* update throttled stats */
+	cfs_b->nr_periods += overrun;
+	if (throttled)
+		cfs_b->nr_throttled += overrun;
+
 	if (idle) {
 		cfs_b->timer_active = 0;
 		goto out_unlock;



^ permalink raw reply	[flat|nested] 59+ messages in thread

* [patch 15/16] sched: return unused runtime on voluntary sleep
  2011-06-21  7:16 [patch 00/16] CFS Bandwidth Control v7 Paul Turner
                   ` (13 preceding siblings ...)
  2011-06-21  7:17 ` [patch 14/16] sched: add exports tracking cfs bandwidth control statistics Paul Turner
@ 2011-06-21  7:17 ` Paul Turner
  2011-06-21  7:33   ` Paul Turner
                     ` (2 more replies)
  2011-06-21  7:17 ` [patch 16/16] sched: add documentation for bandwidth control Paul Turner
  2011-06-22 10:05 ` [patch 00/16] CFS Bandwidth Control v7 Hidetoshi Seto
  16 siblings, 3 replies; 59+ messages in thread
From: Paul Turner @ 2011-06-21  7:17 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov

[-- Attachment #1: sched-bwc-simple_return_quota.patch --]
[-- Type: text/plain, Size: 7588 bytes --]

When a local cfs_rq blocks we return the majority of its remaining quota to the
global bandwidth pool for use by other runqueues.

We do this only when the quota is current and there is more than 
min_cfs_rq_quota [1ms by default] of runtime remaining on the rq.

In the case where there are throttled runqueues and we have sufficient
bandwidth to meter out a slice, a second timer is kicked off to handle this
delivery, unthrottling where appropriate.

Using a 'worst case' antagonist which executes on each cpu
for 1ms before moving onto the next on a fairly large machine:

no quota generations:
 197.47 ms       /cgroup/a/cpuacct.usage
 199.46 ms       /cgroup/a/cpuacct.usage
 205.46 ms       /cgroup/a/cpuacct.usage
 198.46 ms       /cgroup/a/cpuacct.usage
 208.39 ms       /cgroup/a/cpuacct.usage
Since we are allowed to use "stale" quota our usage is effectively bounded by
the rate of input into the global pool and performance is relatively stable.

with quota generations [1s increments]:
 119.58 ms       /cgroup/a/cpuacct.usage
 119.65 ms       /cgroup/a/cpuacct.usage
 119.64 ms       /cgroup/a/cpuacct.usage
 119.63 ms       /cgroup/a/cpuacct.usage
 119.60 ms       /cgroup/a/cpuacct.usage
The large deficit here is due to quota generations (/intentionally/) preventing
us from now using previously stranded slack quota.  The cost is that this quota
becomes unavailable.

with quota generations and quota return:
 200.09 ms       /cgroup/a/cpuacct.usage
 200.09 ms       /cgroup/a/cpuacct.usage
 198.09 ms       /cgroup/a/cpuacct.usage
 200.09 ms       /cgroup/a/cpuacct.usage
 200.06 ms       /cgroup/a/cpuacct.usage
By returning unused quota we're able to both stably consume our desired quota
and prevent unintentional overages due to the abuse of slack quota from 
previous quota periods (especially on a large machine).

Signed-off-by: Paul Turner <pjt@google.com>

---
 kernel/sched.c      |   15 +++++++
 kernel/sched_fair.c |   99 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 113 insertions(+), 1 deletion(-)

Index: tip/kernel/sched.c
===================================================================
--- tip.orig/kernel/sched.c
+++ tip/kernel/sched.c
@@ -256,7 +256,7 @@ struct cfs_bandwidth {
 	u64 runtime_expires;
 
 	int idle, timer_active;
-	struct hrtimer period_timer;
+	struct hrtimer period_timer, slack_timer;
 	struct list_head throttled_cfs_rq;
 
 	/* statistics */
@@ -417,6 +417,16 @@ static inline struct cfs_bandwidth *tg_c
 
 static inline u64 default_cfs_period(void);
 static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun);
+static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b);
+
+static enum hrtimer_restart sched_cfs_slack_timer(struct hrtimer *timer)
+{
+	struct cfs_bandwidth *cfs_b =
+		container_of(timer, struct cfs_bandwidth, slack_timer);
+	do_sched_cfs_slack_timer(cfs_b);
+
+	return HRTIMER_NORESTART;
+}
 
 static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
 {
@@ -449,6 +459,8 @@ static void init_cfs_bandwidth(struct cf
 	INIT_LIST_HEAD(&cfs_b->throttled_cfs_rq);
 	hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
 	cfs_b->period_timer.function = sched_cfs_period_timer;
+	hrtimer_init(&cfs_b->slack_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	cfs_b->slack_timer.function = sched_cfs_slack_timer;
 }
 
 static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)
@@ -476,6 +488,7 @@ static void __start_cfs_bandwidth(struct
 static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
 {
 	hrtimer_cancel(&cfs_b->period_timer);
+	hrtimer_cancel(&cfs_b->slack_timer);
 }
 #else
 static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
Index: tip/kernel/sched_fair.c
===================================================================
--- tip.orig/kernel/sched_fair.c
+++ tip/kernel/sched_fair.c
@@ -1071,6 +1071,8 @@ static void clear_buddies(struct cfs_rq 
 		__clear_buddies_skip(se);
 }
 
+static void return_cfs_rq_runtime(struct cfs_rq *cfs_rq);
+
 static void
 dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 {
@@ -1109,6 +1111,10 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
 	if (!(flags & DEQUEUE_SLEEP))
 		se->vruntime -= cfs_rq->min_vruntime;
 
+	/* return excess runtime on last deuque */
+	if (!cfs_rq->nr_running)
+		return_cfs_rq_runtime(cfs_rq);
+
 	update_min_vruntime(cfs_rq);
 	update_cfs_shares(cfs_rq);
 }
@@ -1694,11 +1700,104 @@ out_unlock:
 
 	return idle;
 }
+
+/* a cfs_rq won't donate quota below this amount */
+static const u64 min_cfs_rq_runtime = 1 * NSEC_PER_MSEC;
+/* minimum remaining period time to redistribute slack quota */
+static const u64 min_bandwidth_expiration = 2 * NSEC_PER_MSEC;
+/* how long we wait to gather additional slack before distributing */
+static const u64 cfs_bandwidth_slack_period = 5 * NSEC_PER_MSEC;
+
+/* are we near the end of the current quota period? */
+static int runtime_refresh_within(struct cfs_bandwidth *cfs_b, u64 min_expire)
+{
+	struct hrtimer *refresh_timer = &cfs_b->period_timer;
+	u64 remaining;
+
+	/* if the call-back is running a quota refresh is already occurring */
+	if (hrtimer_callback_running(refresh_timer))
+		return 1;
+
+	/* is a quota refresh about to occur? */
+	remaining = ktime_to_ns(hrtimer_expires_remaining(refresh_timer));
+	if (remaining < min_expire)
+		return 1;
+
+	return 0;
+}
+
+static void start_cfs_slack_bandwidth(struct cfs_bandwidth *cfs_b)
+{
+	u64 min_left = cfs_bandwidth_slack_period + min_bandwidth_expiration;
+
+	/* if there's a quota refresh soon don't bother with slack */
+	if (runtime_refresh_within(cfs_b, min_left))
+		return;
+
+	start_bandwidth_timer(&cfs_b->slack_timer,
+				ns_to_ktime(cfs_bandwidth_slack_period));
+}
+
+/* we know any runtime found here is valid as update_curr() precedes return */
+static void return_cfs_rq_runtime(struct cfs_rq *cfs_rq)
+{
+	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
+	s64 slack_runtime = cfs_rq->runtime_remaining - min_cfs_rq_runtime;
+
+	if (!cfs_rq->runtime_enabled)
+		return;
+
+	if (slack_runtime <= 0)
+		return;
+
+	raw_spin_lock(&cfs_b->lock);
+	if (cfs_b->quota != RUNTIME_INF &&
+	    (s64)(cfs_rq->runtime_expires - cfs_b->runtime_expires) >= 0) {
+		cfs_b->runtime += slack_runtime;
+
+		if (cfs_b->runtime > sched_cfs_bandwidth_slice() &&
+		    !list_empty(&cfs_b->throttled_cfs_rq))
+			start_cfs_slack_bandwidth(cfs_b);
+	}
+	raw_spin_unlock(&cfs_b->lock);
+
+	cfs_rq->runtime_remaining -= slack_runtime;
+}
+
+static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
+{
+	u64 runtime = 0, slice = sched_cfs_bandwidth_slice();
+	u64 expires;
+
+	/* confirm we're still not at a refresh boundary */
+	if (runtime_refresh_within(cfs_b, min_bandwidth_expiration))
+		return;
+
+	raw_spin_lock(&cfs_b->lock);
+	if (cfs_b->quota != RUNTIME_INF && cfs_b->runtime > slice) {
+		runtime = cfs_b->runtime;
+		cfs_b->runtime = 0;
+	}
+	expires = cfs_b->runtime_expires;
+	raw_spin_unlock(&cfs_b->lock);
+
+	if (!runtime)
+		return;
+
+	runtime = distribute_cfs_runtime(cfs_b, runtime, expires);
+
+	raw_spin_lock(&cfs_b->lock);
+	if (expires == cfs_b->runtime_expires)
+		cfs_b->runtime = runtime;
+	raw_spin_unlock(&cfs_b->lock);
+}
+
 #else
 static void account_cfs_rq_runtime(struct cfs_rq *cfs_rq,
 		unsigned long delta_exec) {}
 static void check_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
 static void check_enqueue_throttle(struct cfs_rq *cfs_rq) {}
+static void return_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
 
 static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
 {



^ permalink raw reply	[flat|nested] 59+ messages in thread

* [patch 16/16] sched: add documentation for bandwidth control
  2011-06-21  7:16 [patch 00/16] CFS Bandwidth Control v7 Paul Turner
                   ` (14 preceding siblings ...)
  2011-06-21  7:17 ` [patch 15/16] sched: return unused runtime on voluntary sleep Paul Turner
@ 2011-06-21  7:17 ` Paul Turner
  2011-06-21 10:30   ` Hidetoshi Seto
  2011-06-22 10:05 ` [patch 00/16] CFS Bandwidth Control v7 Hidetoshi Seto
  16 siblings, 1 reply; 59+ messages in thread
From: Paul Turner @ 2011-06-21  7:17 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov

[-- Attachment #1: sched-bwc-documentation.patch --]
[-- Type: text/plain, Size: 4820 bytes --]

From: Bharata B Rao <bharata@linux.vnet.ibm.com>

Basic description of usage and effect for CFS Bandwidth Control.

Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
Signed-off-by: Paul Turner <pjt@google.com>
---
 Documentation/scheduler/sched-bwc.txt |   98
 ++++++++++++++++++++++++++++++++++
 Documentation/scheduler/sched-bwc.txt |  110 ++++++++++++++++++++++++++++++++++
 1 file changed, 110 insertions(+)

Index: tip/Documentation/scheduler/sched-bwc.txt
===================================================================
--- /dev/null
+++ tip/Documentation/scheduler/sched-bwc.txt
@@ -0,0 +1,110 @@
+CFS Bandwidth Control
+=====================
+
+[ This document talks about CPU bandwidth control for CFS groups only.
+  Bandwidth control for RT groups covered in:
+  Documentation/scheduler/sched-rt-group.txt ]
+
+CFS bandwidth control is a group scheduler extension that can be used to
+control the maximum CPU bandwidth obtained by a CPU cgroup.
+
+Bandwidth allowed for a group is specified using quota and period. Within
+a given "period" (microseconds), a group is allowed to consume up to "quota"
+microseconds of CPU time, which is the upper limit or the hard limit. When the
+CPU bandwidth consumption of a group exceeds the hard limit, the tasks in the
+group are throttled and are not allowed to run until the end of the period at
+which time the group's quota is replenished.
+
+Runtime available to the group is tracked globally. At the beginning of
+each period, the group's global runtime pool is replenished with "quota"
+microseconds worth of runtime.  This bandwidth is then transferred to cpu local
+"accounts" on a demand basis.  Thie size of this transfer is described as a
+"slice".
+
+Interface
+---------
+Quota and period can be set via cgroup files.
+
+cpu.cfs_quota_us: the enforcement interval (microseconds)
+cpu.cfs_period_us: the maximum allowed bandwidth (microseconds)
+
+Within a period of cpu.cfs_period_us, the group as a whole will not be allowed
+to consume more than cpu_cfs_quota_us worth of runtime.
+
+The default value of cpu.cfs_period_us is 100ms and the default value
+for cpu.cfs_quota_us is -1.
+
+A group with cpu.cfs_quota_us as -1 indicates that the group has infinite
+bandwidth, which means that it is not bandwidth controlled.
+
+Writing any negative value to cpu.cfs_quota_us will turn the group into
+an infinite bandwidth group. Reading cpu.cfs_quota_us for an unconstrained
+bandwidth group will always return -1.
+
+System wide settings
+--------------------
+The amount of runtime obtained from global pool every time a CPU wants the
+group quota locally is controlled by a sysctl parameter
+sched_cfs_bandwidth_slice_us. The current default is 5ms. This can be changed
+by writing to /proc/sys/kernel/sched_cfs_bandwidth_slice_us.
+
+Statistics
+----------
+cpu.stat file lists three different stats related to bandwidth control's
+activity.
+
+- nr_periods: Number of enforcement intervals that have elapsed.
+- nr_throttled: Number of times the group has been throttled/limited.
+- throttled_time: The total time duration (in nanoseconds) for which entities
+  of the group have been throttled.
+
+These files are read-only.
+
+Hierarchy considerations
+------------------------
+The interface enforces that an individual entity's bandwidth is always
+attainable, that is: max(c_i) <= C. However, over-subscription in the
+aggregate case is explicitly allowed:
+  e.g. \Sum (c_i) may exceed C
+[ Where C is the parent's bandwidth, and c_i its children ]
+
+There are two ways in which a group may become throttled:
+
+a. it fully consumes its own quota within a period
+b. a parent's quota is fully consumed within its period
+
+In case b above, even though the child may have runtime remaining it will not
+be allowed to un until the parent's runtime is refreshed.
+
+Examples
+--------
+1. Limit a group to 1 CPU worth of runtime.
+
+	If period is 250ms and quota is also 250ms, the group will get
+	1 CPU worth of runtime every 250ms.
+
+	# echo 500000 > cpu.cfs_quota_us /* quota = 250ms */
+	# echo 250000 > cpu.cfs_period_us /* period = 250ms */
+
+2. Limit a group to 2 CPUs worth of runtime on a multi-CPU machine.
+
+	With 500ms period and 1000ms quota, the group can get 2 CPUs worth of
+	runtime every 500ms.
+
+	# echo 1000000 > cpu.cfs_quota_us /* quota = 1000ms */
+	# echo 500000 > cpu.cfs_period_us /* period = 500ms */
+
+	The larger period here allows for increased burst capacity.
+
+3. Limit a group to 20% of 1 CPU.
+
+	With 50ms period, 10ms quota will be equivalent to 20% of 1 CPU.
+
+	# echo 10000 > cpu.cfs_quota_us /* quota = 10ms */
+	# echo 50000 > cpu.cfs_period_us /* period = 50ms */
+
+	By using a small period her we are ensuring a consistent latency
+	response at the expense of burst capacity.
+
+
+



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 15/16] sched: return unused runtime on voluntary sleep
  2011-06-21  7:17 ` [patch 15/16] sched: return unused runtime on voluntary sleep Paul Turner
@ 2011-06-21  7:33   ` Paul Turner
  2011-06-22  9:39   ` Hidetoshi Seto
  2011-06-23 15:26   ` Peter Zijlstra
  2 siblings, 0 replies; 59+ messages in thread
From: Paul Turner @ 2011-06-21  7:33 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov

I just realized the title of this patch is stale, as mentioned in the
changelog, we return on all dequeue to avoid stranding bandwidth.

On Tue, Jun 21, 2011 at 12:17 AM, Paul Turner <pjt@google.com> wrote:
> When a local cfs_rq blocks we return the majority of its remaining quota to the
> global bandwidth pool for use by other runqueues.
>
> We do this only when the quota is current and there is more than
> min_cfs_rq_quota [1ms by default] of runtime remaining on the rq.
>
> In the case where there are throttled runqueues and we have sufficient
> bandwidth to meter out a slice, a second timer is kicked off to handle this
> delivery, unthrottling where appropriate.
>
> Using a 'worst case' antagonist which executes on each cpu
> for 1ms before moving onto the next on a fairly large machine:
>
> no quota generations:
>  197.47 ms       /cgroup/a/cpuacct.usage
>  199.46 ms       /cgroup/a/cpuacct.usage
>  205.46 ms       /cgroup/a/cpuacct.usage
>  198.46 ms       /cgroup/a/cpuacct.usage
>  208.39 ms       /cgroup/a/cpuacct.usage
> Since we are allowed to use "stale" quota our usage is effectively bounded by
> the rate of input into the global pool and performance is relatively stable.
>
> with quota generations [1s increments]:
>  119.58 ms       /cgroup/a/cpuacct.usage
>  119.65 ms       /cgroup/a/cpuacct.usage
>  119.64 ms       /cgroup/a/cpuacct.usage
>  119.63 ms       /cgroup/a/cpuacct.usage
>  119.60 ms       /cgroup/a/cpuacct.usage
> The large deficit here is due to quota generations (/intentionally/) preventing
> us from now using previously stranded slack quota.  The cost is that this quota
> becomes unavailable.
>
> with quota generations and quota return:
>  200.09 ms       /cgroup/a/cpuacct.usage
>  200.09 ms       /cgroup/a/cpuacct.usage
>  198.09 ms       /cgroup/a/cpuacct.usage
>  200.09 ms       /cgroup/a/cpuacct.usage
>  200.06 ms       /cgroup/a/cpuacct.usage
> By returning unused quota we're able to both stably consume our desired quota
> and prevent unintentional overages due to the abuse of slack quota from
> previous quota periods (especially on a large machine).
>
> Signed-off-by: Paul Turner <pjt@google.com>
>
> ---
>  kernel/sched.c      |   15 +++++++
>  kernel/sched_fair.c |   99 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 113 insertions(+), 1 deletion(-)
>
> Index: tip/kernel/sched.c
> ===================================================================
> --- tip.orig/kernel/sched.c
> +++ tip/kernel/sched.c
> @@ -256,7 +256,7 @@ struct cfs_bandwidth {
>        u64 runtime_expires;
>
>        int idle, timer_active;
> -       struct hrtimer period_timer;
> +       struct hrtimer period_timer, slack_timer;
>        struct list_head throttled_cfs_rq;
>
>        /* statistics */
> @@ -417,6 +417,16 @@ static inline struct cfs_bandwidth *tg_c
>
>  static inline u64 default_cfs_period(void);
>  static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun);
> +static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b);
> +
> +static enum hrtimer_restart sched_cfs_slack_timer(struct hrtimer *timer)
> +{
> +       struct cfs_bandwidth *cfs_b =
> +               container_of(timer, struct cfs_bandwidth, slack_timer);
> +       do_sched_cfs_slack_timer(cfs_b);
> +
> +       return HRTIMER_NORESTART;
> +}
>
>  static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
>  {
> @@ -449,6 +459,8 @@ static void init_cfs_bandwidth(struct cf
>        INIT_LIST_HEAD(&cfs_b->throttled_cfs_rq);
>        hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
>        cfs_b->period_timer.function = sched_cfs_period_timer;
> +       hrtimer_init(&cfs_b->slack_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
> +       cfs_b->slack_timer.function = sched_cfs_slack_timer;
>  }
>
>  static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)
> @@ -476,6 +488,7 @@ static void __start_cfs_bandwidth(struct
>  static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
>  {
>        hrtimer_cancel(&cfs_b->period_timer);
> +       hrtimer_cancel(&cfs_b->slack_timer);
>  }
>  #else
>  static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
> Index: tip/kernel/sched_fair.c
> ===================================================================
> --- tip.orig/kernel/sched_fair.c
> +++ tip/kernel/sched_fair.c
> @@ -1071,6 +1071,8 @@ static void clear_buddies(struct cfs_rq
>                __clear_buddies_skip(se);
>  }
>
> +static void return_cfs_rq_runtime(struct cfs_rq *cfs_rq);
> +
>  static void
>  dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>  {
> @@ -1109,6 +1111,10 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
>        if (!(flags & DEQUEUE_SLEEP))
>                se->vruntime -= cfs_rq->min_vruntime;
>
> +       /* return excess runtime on last deuque */

typo here also fixed

> +       if (!cfs_rq->nr_running)
> +               return_cfs_rq_runtime(cfs_rq);
> +
>        update_min_vruntime(cfs_rq);
>        update_cfs_shares(cfs_rq);
>  }
> @@ -1694,11 +1700,104 @@ out_unlock:
>
>        return idle;
>  }
> +
> +/* a cfs_rq won't donate quota below this amount */
> +static const u64 min_cfs_rq_runtime = 1 * NSEC_PER_MSEC;
> +/* minimum remaining period time to redistribute slack quota */
> +static const u64 min_bandwidth_expiration = 2 * NSEC_PER_MSEC;
> +/* how long we wait to gather additional slack before distributing */
> +static const u64 cfs_bandwidth_slack_period = 5 * NSEC_PER_MSEC;
> +
> +/* are we near the end of the current quota period? */
> +static int runtime_refresh_within(struct cfs_bandwidth *cfs_b, u64 min_expire)
> +{
> +       struct hrtimer *refresh_timer = &cfs_b->period_timer;
> +       u64 remaining;
> +
> +       /* if the call-back is running a quota refresh is already occurring */
> +       if (hrtimer_callback_running(refresh_timer))
> +               return 1;
> +
> +       /* is a quota refresh about to occur? */
> +       remaining = ktime_to_ns(hrtimer_expires_remaining(refresh_timer));
> +       if (remaining < min_expire)
> +               return 1;
> +
> +       return 0;
> +}
> +
> +static void start_cfs_slack_bandwidth(struct cfs_bandwidth *cfs_b)
> +{
> +       u64 min_left = cfs_bandwidth_slack_period + min_bandwidth_expiration;
> +
> +       /* if there's a quota refresh soon don't bother with slack */
> +       if (runtime_refresh_within(cfs_b, min_left))
> +               return;
> +
> +       start_bandwidth_timer(&cfs_b->slack_timer,
> +                               ns_to_ktime(cfs_bandwidth_slack_period));
> +}
> +
> +/* we know any runtime found here is valid as update_curr() precedes return */
> +static void return_cfs_rq_runtime(struct cfs_rq *cfs_rq)
> +{
> +       struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
> +       s64 slack_runtime = cfs_rq->runtime_remaining - min_cfs_rq_runtime;
> +
> +       if (!cfs_rq->runtime_enabled)
> +               return;
> +
> +       if (slack_runtime <= 0)
> +               return;
> +
> +       raw_spin_lock(&cfs_b->lock);
> +       if (cfs_b->quota != RUNTIME_INF &&
> +           (s64)(cfs_rq->runtime_expires - cfs_b->runtime_expires) >= 0) {
> +               cfs_b->runtime += slack_runtime;
> +
> +               if (cfs_b->runtime > sched_cfs_bandwidth_slice() &&
> +                   !list_empty(&cfs_b->throttled_cfs_rq))
> +                       start_cfs_slack_bandwidth(cfs_b);
> +       }
> +       raw_spin_unlock(&cfs_b->lock);
> +
> +       cfs_rq->runtime_remaining -= slack_runtime;
> +}
> +
> +static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
> +{
> +       u64 runtime = 0, slice = sched_cfs_bandwidth_slice();
> +       u64 expires;
> +
> +       /* confirm we're still not at a refresh boundary */
> +       if (runtime_refresh_within(cfs_b, min_bandwidth_expiration))
> +               return;
> +
> +       raw_spin_lock(&cfs_b->lock);
> +       if (cfs_b->quota != RUNTIME_INF && cfs_b->runtime > slice) {
> +               runtime = cfs_b->runtime;
> +               cfs_b->runtime = 0;
> +       }
> +       expires = cfs_b->runtime_expires;
> +       raw_spin_unlock(&cfs_b->lock);
> +
> +       if (!runtime)
> +               return;
> +
> +       runtime = distribute_cfs_runtime(cfs_b, runtime, expires);
> +
> +       raw_spin_lock(&cfs_b->lock);
> +       if (expires == cfs_b->runtime_expires)
> +               cfs_b->runtime = runtime;
> +       raw_spin_unlock(&cfs_b->lock);
> +}
> +
>  #else
>  static void account_cfs_rq_runtime(struct cfs_rq *cfs_rq,
>                unsigned long delta_exec) {}
>  static void check_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
>  static void check_enqueue_throttle(struct cfs_rq *cfs_rq) {}
> +static void return_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
>
>  static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
>  {
>
>
>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 16/16] sched: add documentation for bandwidth control
  2011-06-21  7:17 ` [patch 16/16] sched: add documentation for bandwidth control Paul Turner
@ 2011-06-21 10:30   ` Hidetoshi Seto
  2011-06-21 19:46     ` Paul Turner
  0 siblings, 1 reply; 59+ messages in thread
From: Hidetoshi Seto @ 2011-06-21 10:30 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Peter Zijlstra, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Pavel Emelyanov

Minor typos/nitpicks:

(2011/06/21 16:17), Paul Turner wrote:
> From: Bharata B Rao <bharata@linux.vnet.ibm.com>
> 
> Basic description of usage and effect for CFS Bandwidth Control.
> 
> Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
> Signed-off-by: Paul Turner <pjt@google.com>
> ---
>  Documentation/scheduler/sched-bwc.txt |   98
>  ++++++++++++++++++++++++++++++++++
>  Documentation/scheduler/sched-bwc.txt |  110 ++++++++++++++++++++++++++++++++++
>  1 file changed, 110 insertions(+)
> 
> Index: tip/Documentation/scheduler/sched-bwc.txt
> ===================================================================
> --- /dev/null
> +++ tip/Documentation/scheduler/sched-bwc.txt
> @@ -0,0 +1,110 @@
> +CFS Bandwidth Control
> +=====================
> +
> +[ This document talks about CPU bandwidth control for CFS groups only.
> +  Bandwidth control for RT groups covered in:
> +  Documentation/scheduler/sched-rt-group.txt ]
> +
> +CFS bandwidth control is a group scheduler extension that can be used to
> +control the maximum CPU bandwidth obtained by a CPU cgroup.
> +
> +Bandwidth allowed for a group is specified using quota and period. Within
> +a given "period" (microseconds), a group is allowed to consume up to "quota"
> +microseconds of CPU time, which is the upper limit or the hard limit. When the
> +CPU bandwidth consumption of a group exceeds the hard limit, the tasks in the
> +group are throttled and are not allowed to run until the end of the period at
> +which time the group's quota is replenished.
> +
> +Runtime available to the group is tracked globally. At the beginning of
> +each period, the group's global runtime pool is replenished with "quota"
> +microseconds worth of runtime.  This bandwidth is then transferred to cpu local
> +"accounts" on a demand basis.  Thie size of this transfer is described as a

                                  The ?

> +"slice".
> +
> +Interface
> +---------
> +Quota and period can be set via cgroup files.
> +
> +cpu.cfs_quota_us: the enforcement interval (microseconds)
> +cpu.cfs_period_us: the maximum allowed bandwidth (microseconds)
> +
> +Within a period of cpu.cfs_period_us, the group as a whole will not be allowed
> +to consume more than cpu_cfs_quota_us worth of runtime.
> +
> +The default value of cpu.cfs_period_us is 100ms and the default value
> +for cpu.cfs_quota_us is -1.
> +
> +A group with cpu.cfs_quota_us as -1 indicates that the group has infinite
> +bandwidth, which means that it is not bandwidth controlled.

(I think it's better to use "unconstrained (bandwidth) group" as the
 standardized expression instead of "infinite bandwidth group", so ...)

                                               ... controlled. Such group is
described as an unconstrained bandwidth group.

> +
> +Writing any negative value to cpu.cfs_quota_us will turn the group into
> +an infinite bandwidth group. Reading cpu.cfs_quota_us for an unconstrained
      ^^^^^^^^
      unconstrained

> +bandwidth group will always return -1.
> +
> +System wide settings
> +--------------------
> +The amount of runtime obtained from global pool every time a CPU wants the
> +group quota locally is controlled by a sysctl parameter
> +sched_cfs_bandwidth_slice_us. The current default is 5ms. This can be changed
> +by writing to /proc/sys/kernel/sched_cfs_bandwidth_slice_us.
> +
> +Statistics
> +----------
> +cpu.stat file lists three different stats related to bandwidth control's
> +activity.
> +
> +- nr_periods: Number of enforcement intervals that have elapsed.
> +- nr_throttled: Number of times the group has been throttled/limited.
> +- throttled_time: The total time duration (in nanoseconds) for which entities
> +  of the group have been throttled.
> +
> +These files are read-only.
> +
> +Hierarchy considerations
> +------------------------
> +The interface enforces that an individual entity's bandwidth is always
> +attainable, that is: max(c_i) <= C. However, over-subscription in the
> +aggregate case is explicitly allowed:
> +  e.g. \Sum (c_i) may exceed C
> +[ Where C is the parent's bandwidth, and c_i its children ]
> +
> +There are two ways in which a group may become throttled:
> +
> +a. it fully consumes its own quota within a period
> +b. a parent's quota is fully consumed within its period
> +
> +In case b above, even though the child may have runtime remaining it will not
> +be allowed to un until the parent's runtime is refreshed.

                 run ?

> +
> +Examples
> +--------
> +1. Limit a group to 1 CPU worth of runtime.
> +
> +	If period is 250ms and quota is also 250ms, the group will get
> +	1 CPU worth of runtime every 250ms.
> +
> +	# echo 500000 > cpu.cfs_quota_us /* quota = 250ms */
               ~~~~~~
               250000 ?

> +	# echo 250000 > cpu.cfs_period_us /* period = 250ms */
> +
> +2. Limit a group to 2 CPUs worth of runtime on a multi-CPU machine.
> +
> +	With 500ms period and 1000ms quota, the group can get 2 CPUs worth of
> +	runtime every 500ms.
> +
> +	# echo 1000000 > cpu.cfs_quota_us /* quota = 1000ms */
> +	# echo 500000 > cpu.cfs_period_us /* period = 500ms */
> +
> +	The larger period here allows for increased burst capacity.
> +
> +3. Limit a group to 20% of 1 CPU.
> +
> +	With 50ms period, 10ms quota will be equivalent to 20% of 1 CPU.
> +
> +	# echo 10000 > cpu.cfs_quota_us /* quota = 10ms */
> +	# echo 50000 > cpu.cfs_period_us /* period = 50ms */
> +
> +	By using a small period her we are ensuring a consistent latency

                                here ?

> +	response at the expense of burst capacity.
> +
> +
> +

Blank lines at EOF ?


Thank you for improving the document!  Especially I think it is pretty
good that now it provide examples of how "period" is used.

For the rests in the V7 set (overall it looks very good), give me some
time to review & tests. ;-)


Thanks,
H.Seto


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 16/16] sched: add documentation for bandwidth control
  2011-06-21 10:30   ` Hidetoshi Seto
@ 2011-06-21 19:46     ` Paul Turner
  0 siblings, 0 replies; 59+ messages in thread
From: Paul Turner @ 2011-06-21 19:46 UTC (permalink / raw)
  To: Hidetoshi Seto
  Cc: linux-kernel, Peter Zijlstra, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Pavel Emelyanov

On Tue, Jun 21, 2011 at 3:30 AM, Hidetoshi Seto
<seto.hidetoshi@jp.fujitsu.com> wrote:
> Minor typos/nitpicks:
>
> (2011/06/21 16:17), Paul Turner wrote:
>> From: Bharata B Rao <bharata@linux.vnet.ibm.com>
>>
>> Basic description of usage and effect for CFS Bandwidth Control.
>>
>> Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
>> Signed-off-by: Paul Turner <pjt@google.com>
>> ---
>>  Documentation/scheduler/sched-bwc.txt |   98
>>  ++++++++++++++++++++++++++++++++++
>>  Documentation/scheduler/sched-bwc.txt |  110 ++++++++++++++++++++++++++++++++++
>>  1 file changed, 110 insertions(+)
>>
>> Index: tip/Documentation/scheduler/sched-bwc.txt
>> ===================================================================
>> --- /dev/null
>> +++ tip/Documentation/scheduler/sched-bwc.txt
>> @@ -0,0 +1,110 @@
>> +CFS Bandwidth Control
>> +=====================
>> +
>> +[ This document talks about CPU bandwidth control for CFS groups only.
>> +  Bandwidth control for RT groups covered in:
>> +  Documentation/scheduler/sched-rt-group.txt ]
>> +
>> +CFS bandwidth control is a group scheduler extension that can be used to
>> +control the maximum CPU bandwidth obtained by a CPU cgroup.
>> +
>> +Bandwidth allowed for a group is specified using quota and period. Within
>> +a given "period" (microseconds), a group is allowed to consume up to "quota"
>> +microseconds of CPU time, which is the upper limit or the hard limit. When the
>> +CPU bandwidth consumption of a group exceeds the hard limit, the tasks in the
>> +group are throttled and are not allowed to run until the end of the period at
>> +which time the group's quota is replenished.
>> +
>> +Runtime available to the group is tracked globally. At the beginning of
>> +each period, the group's global runtime pool is replenished with "quota"
>> +microseconds worth of runtime.  This bandwidth is then transferred to cpu local
>> +"accounts" on a demand basis.  Thie size of this transfer is described as a
>
>                                  The ?
>
>> +"slice".
>> +
>> +Interface
>> +---------
>> +Quota and period can be set via cgroup files.
>> +
>> +cpu.cfs_quota_us: the enforcement interval (microseconds)
>> +cpu.cfs_period_us: the maximum allowed bandwidth (microseconds)
>> +
>> +Within a period of cpu.cfs_period_us, the group as a whole will not be allowed
>> +to consume more than cpu_cfs_quota_us worth of runtime.
>> +
>> +The default value of cpu.cfs_period_us is 100ms and the default value
>> +for cpu.cfs_quota_us is -1.
>> +
>> +A group with cpu.cfs_quota_us as -1 indicates that the group has infinite
>> +bandwidth, which means that it is not bandwidth controlled.
>
> (I think it's better to use "unconstrained (bandwidth) group" as the
>  standardized expression instead of "infinite bandwidth group", so ...)
>
>                                               ... controlled. Such group is
> described as an unconstrained bandwidth group.
>
>> +
>> +Writing any negative value to cpu.cfs_quota_us will turn the group into
>> +an infinite bandwidth group. Reading cpu.cfs_quota_us for an unconstrained
>      ^^^^^^^^
>      unconstrained
>
>> +bandwidth group will always return -1.
>> +
>> +System wide settings
>> +--------------------
>> +The amount of runtime obtained from global pool every time a CPU wants the
>> +group quota locally is controlled by a sysctl parameter
>> +sched_cfs_bandwidth_slice_us. The current default is 5ms. This can be changed
>> +by writing to /proc/sys/kernel/sched_cfs_bandwidth_slice_us.
>> +
>> +Statistics
>> +----------
>> +cpu.stat file lists three different stats related to bandwidth control's
>> +activity.
>> +
>> +- nr_periods: Number of enforcement intervals that have elapsed.
>> +- nr_throttled: Number of times the group has been throttled/limited.
>> +- throttled_time: The total time duration (in nanoseconds) for which entities
>> +  of the group have been throttled.
>> +
>> +These files are read-only.
>> +
>> +Hierarchy considerations
>> +------------------------
>> +The interface enforces that an individual entity's bandwidth is always
>> +attainable, that is: max(c_i) <= C. However, over-subscription in the
>> +aggregate case is explicitly allowed:
>> +  e.g. \Sum (c_i) may exceed C
>> +[ Where C is the parent's bandwidth, and c_i its children ]
>> +
>> +There are two ways in which a group may become throttled:
>> +
>> +a. it fully consumes its own quota within a period
>> +b. a parent's quota is fully consumed within its period
>> +
>> +In case b above, even though the child may have runtime remaining it will not
>> +be allowed to un until the parent's runtime is refreshed.
>
>                 run ?
>
>> +
>> +Examples
>> +--------
>> +1. Limit a group to 1 CPU worth of runtime.
>> +
>> +     If period is 250ms and quota is also 250ms, the group will get
>> +     1 CPU worth of runtime every 250ms.
>> +
>> +     # echo 500000 > cpu.cfs_quota_us /* quota = 250ms */
>               ~~~~~~
>               250000 ?
>
>> +     # echo 250000 > cpu.cfs_period_us /* period = 250ms */
>> +
>> +2. Limit a group to 2 CPUs worth of runtime on a multi-CPU machine.
>> +
>> +     With 500ms period and 1000ms quota, the group can get 2 CPUs worth of
>> +     runtime every 500ms.
>> +
>> +     # echo 1000000 > cpu.cfs_quota_us /* quota = 1000ms */
>> +     # echo 500000 > cpu.cfs_period_us /* period = 500ms */
>> +
>> +     The larger period here allows for increased burst capacity.
>> +
>> +3. Limit a group to 20% of 1 CPU.
>> +
>> +     With 50ms period, 10ms quota will be equivalent to 20% of 1 CPU.
>> +
>> +     # echo 10000 > cpu.cfs_quota_us /* quota = 10ms */
>> +     # echo 50000 > cpu.cfs_period_us /* period = 50ms */
>> +
>> +     By using a small period her we are ensuring a consistent latency
>
>                                here ?
>
>> +     response at the expense of burst capacity.
>> +
>> +
>> +
>
> Blank lines at EOF ?
>
>
> Thank you for improving the document!  Especially I think it is pretty
> good that now it provide examples of how "period" is used.

Yeah I gave Bharata's documentation a once-over, the errors above are
a mix of original and introduced.  :(

I'll clean up the nits above as well as doing a proper general editing
pass for language and flow.

>
> For the rests in the V7 set (overall it looks very good), give me some
> time to review & tests. ;-)
>

Sounds good!  Thanks!

>
> Thanks,
> H.Seto
>
>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 04/16] sched: validate CFS quota hierarchies
  2011-06-21  7:16 ` [patch 04/16] sched: validate CFS quota hierarchies Paul Turner
@ 2011-06-22  5:43   ` Bharata B Rao
  2011-06-22  6:57     ` Paul Turner
  2011-06-22  9:38   ` Hidetoshi Seto
  1 sibling, 1 reply; 59+ messages in thread
From: Bharata B Rao @ 2011-06-22  5:43 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Peter Zijlstra, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov

On Tue, Jun 21, 2011 at 12:16:53AM -0700, Paul Turner wrote:
> Add constraints validation for CFS bandwidth hierarchies.
> 
> Validate that:
>    max(child bandwidth) <= parent_bandwidth
> 
> In a quota limited hierarchy, an unconstrained entity
> (e.g. bandwidth==RUNTIME_INF) inherits the bandwidth of its parent.
> 
> This constraint is chosen over sum(child_bandwidth) as notion of over-commit is
> valuable within SCHED_OTHER.  Some basic code from the RT case is re-factored
> for reuse.
> 
> Signed-off-by: Paul Turner <pjt@google.com>
> 
> ---
>  kernel/sched.c |  109 ++++++++++++++++++++++++++++++++++++++++++++++++++-------
>  1 file changed, 96 insertions(+), 13 deletions(-)
> 
> Index: tip/kernel/sched.c
> ===================================================================
> --- tip.orig/kernel/sched.c
> +++ tip/kernel/sched.c
> @@ -249,6 +249,7 @@ struct cfs_bandwidth {
>  	raw_spinlock_t lock;
>  	ktime_t period;
>  	u64 quota;
> +	s64 hierarchal_quota;

You mean hierarchical I suppose.

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 04/16] sched: validate CFS quota hierarchies
  2011-06-22  5:43   ` Bharata B Rao
@ 2011-06-22  6:57     ` Paul Turner
  0 siblings, 0 replies; 59+ messages in thread
From: Paul Turner @ 2011-06-22  6:57 UTC (permalink / raw)
  To: bharata
  Cc: linux-kernel, Peter Zijlstra, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov

On Tue, Jun 21, 2011 at 10:43 PM, Bharata B Rao
<bharata@linux.vnet.ibm.com> wrote:
> On Tue, Jun 21, 2011 at 12:16:53AM -0700, Paul Turner wrote:
>> Add constraints validation for CFS bandwidth hierarchies.
>>
>> Validate that:
>>    max(child bandwidth) <= parent_bandwidth
>>
>> In a quota limited hierarchy, an unconstrained entity
>> (e.g. bandwidth==RUNTIME_INF) inherits the bandwidth of its parent.
>>
>> This constraint is chosen over sum(child_bandwidth) as notion of over-commit is
>> valuable within SCHED_OTHER.  Some basic code from the RT case is re-factored
>> for reuse.
>>
>> Signed-off-by: Paul Turner <pjt@google.com>
>>
>> ---
>>  kernel/sched.c |  109 ++++++++++++++++++++++++++++++++++++++++++++++++++-------
>>  1 file changed, 96 insertions(+), 13 deletions(-)
>>
>> Index: tip/kernel/sched.c
>> ===================================================================
>> --- tip.orig/kernel/sched.c
>> +++ tip/kernel/sched.c
>> @@ -249,6 +249,7 @@ struct cfs_bandwidth {
>>       raw_spinlock_t lock;
>>       ktime_t period;
>>       u64 quota;
>> +     s64 hierarchal_quota;
>
> You mean hierarchical I suppose.
>

Yup!  Thanks

> Regards,
> Bharata.
>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 08/16] sched: throttle cfs_rq entities which exceed their local runtime
  2011-06-21  7:16 ` [patch 08/16] sched: throttle cfs_rq entities which exceed their local runtime Paul Turner
@ 2011-06-22  7:11   ` Bharata B Rao
  2011-06-22 16:07   ` Peter Zijlstra
  1 sibling, 0 replies; 59+ messages in thread
From: Bharata B Rao @ 2011-06-22  7:11 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Peter Zijlstra, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov, Nikhil Rao

On Tue, Jun 21, 2011 at 12:16:57AM -0700, Paul Turner wrote:
> @@ -1505,7 +1559,17 @@ enqueue_task_fair(struct rq *rq, struct 
>  			break;
>  		cfs_rq = cfs_rq_of(se);
>  		enqueue_entity(cfs_rq, se, flags);
> +
> +		/*
> +		 * end evaluation on encountering a throttled cfs_rq
> +		 *
> +		 * note: in the case of encountering a throttled cfs_rq we will
> +		 * post the final h_nr_running decrement below.

You mean 'final h_nr_running increment' I suppose ?

> +		*/
> +		if (cfs_rq_throttled(cfs_rq))
> +			break;
>  		cfs_rq->h_nr_running++;
> +
>  		flags = ENQUEUE_WAKEUP;
>  	}
> 
> @@ -1513,11 +1577,15 @@ enqueue_task_fair(struct rq *rq, struct 
>  		cfs_rq = cfs_rq_of(se);
>  		cfs_rq->h_nr_running++;

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 04/16] sched: validate CFS quota hierarchies
  2011-06-21  7:16 ` [patch 04/16] sched: validate CFS quota hierarchies Paul Turner
  2011-06-22  5:43   ` Bharata B Rao
@ 2011-06-22  9:38   ` Hidetoshi Seto
  1 sibling, 0 replies; 59+ messages in thread
From: Hidetoshi Seto @ 2011-06-22  9:38 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Peter Zijlstra, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Pavel Emelyanov

(2011/06/21 16:16), Paul Turner wrote:
> Add constraints validation for CFS bandwidth hierarchies.
> 
> Validate that:
>    max(child bandwidth) <= parent_bandwidth
> 
> In a quota limited hierarchy, an unconstrained entity
> (e.g. bandwidth==RUNTIME_INF) inherits the bandwidth of its parent.
> 
> This constraint is chosen over sum(child_bandwidth) as notion of over-commit is
> valuable within SCHED_OTHER.  Some basic code from the RT case is re-factored
> for reuse.
> 
> Signed-off-by: Paul Turner <pjt@google.com>
> 
> ---

The sysctl_sched_cfs_bandwidth_consistent is gone for good.
Now it is no longer complicated.

(Though there is a trivial typo pointed by Bharata, soon it will be fixed.)
Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>


Thanks,
H.Seto


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 06/16] sched: add a timer to handle CFS bandwidth refresh
  2011-06-21  7:16 ` [patch 06/16] sched: add a timer to handle CFS bandwidth refresh Paul Turner
@ 2011-06-22  9:38   ` Hidetoshi Seto
  0 siblings, 0 replies; 59+ messages in thread
From: Hidetoshi Seto @ 2011-06-22  9:38 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Peter Zijlstra, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Pavel Emelyanov

(2011/06/21 16:16), Paul Turner wrote:
> This patch adds a per-task_group timer which handles the refresh of the global
> CFS bandwidth pool.
> 
> Since the RT pool is using a similar timer there's some small refactoring to
> share this support.
> 
> Signed-off-by: Paul Turner <pjt@google.com>
> Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
> 
> ---
(snip)
> @@ -413,8 +444,26 @@ static void init_cfs_rq_runtime(struct c
>  	cfs_rq->runtime_enabled = 0;
>  }
>  
> +/* requires cfs_b->lock */
> +static void __start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
> +{
> +	/*
> +	 * Handle the extremely unlikely case of trying to start the period
> +	 * timer, while that timer is in the tear-down path from having
> +	 * decided to no longer run.  In this case we must wait for the
> +	 * (tail of the) callback to terminate so that we can re-enqueue it.
> +	 */
> +	if (unlikely(hrtimer_active(&cfs_b->period_timer)))
> +		hrtimer_cancel(&cfs_b->period_timer);
> +
> +	cfs_b->timer_active = 1;
> +	start_bandwidth_timer(&cfs_b->period_timer, cfs_b->period);
> +}
> +

Nice trick :-)

(Again,)
Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>


Thanks,
H.Seto


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 07/16] sched: expire invalid runtime
  2011-06-21  7:16 ` [patch 07/16] sched: expire invalid runtime Paul Turner
@ 2011-06-22  9:38   ` Hidetoshi Seto
  2011-06-22 15:47   ` Peter Zijlstra
  1 sibling, 0 replies; 59+ messages in thread
From: Hidetoshi Seto @ 2011-06-22  9:38 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Peter Zijlstra, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Pavel Emelyanov

(2011/06/21 16:16), Paul Turner wrote:
> Since quota is managed using a global state but consumed on a per-cpu basis
> we need to ensure that our per-cpu state is appropriately synchronized.  
> Most importantly, runtime that is state (from a previous period) should not be
> locally consumable.
> 
> We take advantage of existing sched_clock synchronization about the jiffy to
> efficiently detect whether we have (globally) crossed a quota boundary above.
> 
> One catch is that the direction of spread on sched_clock is undefined, 
> specifically, we don't know whether our local clock is behind or ahead
> of the one responsible for the current expiration time.
> 
> Fortunately we can differentiate these by considering whether the
> global deadline has advanced.  If it has not, then we assume our clock to be 
> "fast" and advance our local expiration; otherwise, we know the deadline has
> truly passed and we expire our local runtime.
> 
> Signed-off-by: Paul Turner <pjt@google.com>
> Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
> 
> ---

Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>


Thanks,
H.Seto


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 10/16] sched: throttle entities exceeding their allowed bandwidth
  2011-06-21  7:16 ` [patch 10/16] sched: throttle entities exceeding their allowed bandwidth Paul Turner
@ 2011-06-22  9:39   ` Hidetoshi Seto
  0 siblings, 0 replies; 59+ messages in thread
From: Hidetoshi Seto @ 2011-06-22  9:39 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Peter Zijlstra, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Pavel Emelyanov

(2011/06/21 16:16), Paul Turner wrote:
> Add conditional checks time of put_prev_entity() and enqueue_entity() to detect
> when an active entity has exceeded its allowed bandwidth and requires
> throttling.
> 
> Signed-off-by: Paul Turner <pjt@google.com>
> 
> ---
(snip)
> @@ -1403,7 +1412,7 @@ static void account_cfs_rq_runtime(struc
>  	 * if we're unable to extend our runtime we resched so that the active
>  	 * hierarchy can be throttled
>  	 */
> -	if (!assign_cfs_rq_runtime(cfs_rq))
> +	if (!assign_cfs_rq_runtime(cfs_rq) && likely(cfs_rq->curr))
>  		resched_task(rq_of(cfs_rq)->curr);
>  }
>  

Nit: I think this hunk could be included in patch 08/16.

The rest is good.

Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>


Thanks,
H.Seto


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 15/16] sched: return unused runtime on voluntary sleep
  2011-06-21  7:17 ` [patch 15/16] sched: return unused runtime on voluntary sleep Paul Turner
  2011-06-21  7:33   ` Paul Turner
@ 2011-06-22  9:39   ` Hidetoshi Seto
  2011-06-23 15:26   ` Peter Zijlstra
  2 siblings, 0 replies; 59+ messages in thread
From: Hidetoshi Seto @ 2011-06-22  9:39 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Peter Zijlstra, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Pavel Emelyanov

(2011/06/21 16:17), Paul Turner wrote:
> When a local cfs_rq blocks we return the majority of its remaining quota to the
> global bandwidth pool for use by other runqueues.
> 
> We do this only when the quota is current and there is more than 
> min_cfs_rq_quota [1ms by default] of runtime remaining on the rq.
> 
> In the case where there are throttled runqueues and we have sufficient
> bandwidth to meter out a slice, a second timer is kicked off to handle this
> delivery, unthrottling where appropriate.
> 
> Using a 'worst case' antagonist which executes on each cpu
> for 1ms before moving onto the next on a fairly large machine:
> 
> no quota generations:
>  197.47 ms       /cgroup/a/cpuacct.usage
>  199.46 ms       /cgroup/a/cpuacct.usage
>  205.46 ms       /cgroup/a/cpuacct.usage
>  198.46 ms       /cgroup/a/cpuacct.usage
>  208.39 ms       /cgroup/a/cpuacct.usage
> Since we are allowed to use "stale" quota our usage is effectively bounded by
> the rate of input into the global pool and performance is relatively stable.
> 
> with quota generations [1s increments]:
>  119.58 ms       /cgroup/a/cpuacct.usage
>  119.65 ms       /cgroup/a/cpuacct.usage
>  119.64 ms       /cgroup/a/cpuacct.usage
>  119.63 ms       /cgroup/a/cpuacct.usage
>  119.60 ms       /cgroup/a/cpuacct.usage
> The large deficit here is due to quota generations (/intentionally/) preventing
> us from now using previously stranded slack quota.  The cost is that this quota
> becomes unavailable.
> 
> with quota generations and quota return:
>  200.09 ms       /cgroup/a/cpuacct.usage
>  200.09 ms       /cgroup/a/cpuacct.usage
>  198.09 ms       /cgroup/a/cpuacct.usage
>  200.09 ms       /cgroup/a/cpuacct.usage
>  200.06 ms       /cgroup/a/cpuacct.usage
> By returning unused quota we're able to both stably consume our desired quota
> and prevent unintentional overages due to the abuse of slack quota from 
> previous quota periods (especially on a large machine).
> 
> Signed-off-by: Paul Turner <pjt@google.com>
> 
> ---

(For all but the patch title:)
Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>


Thanks,
H.Seto


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 00/16] CFS Bandwidth Control v7
  2011-06-21  7:16 [patch 00/16] CFS Bandwidth Control v7 Paul Turner
                   ` (15 preceding siblings ...)
  2011-06-21  7:17 ` [patch 16/16] sched: add documentation for bandwidth control Paul Turner
@ 2011-06-22 10:05 ` Hidetoshi Seto
  2011-06-23 12:06   ` Peter Zijlstra
  16 siblings, 1 reply; 59+ messages in thread
From: Hidetoshi Seto @ 2011-06-22 10:05 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Peter Zijlstra, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Pavel Emelyanov

(2011/06/21 16:16), Paul Turner wrote:
> Hideotoshi, the following patches changed enough, or are new, and should be
> looked over again before I can re-add your Reviewed-by.
> 
> [patch 04/16] sched: validate CFS quota hierarchies
> [patch 06/16] sched: add a timer to handle CFS bandwidth refresh
> [patch 07/16] sched: expire invalid runtime
> [patch 10/16] sched: throttle entities exceeding their allowed bandwidth
> [patch 15/16] sched: return unused runtime on voluntary sleep

Done.

Thank you very much again for your great work! 

I'll continue my test/benchmark on this v7 for a while.
Though I believe no more bug is there, I'll let you know if there is something.

I think it's about time to have this set in upstream branch (at first, tip:sched/???).
(How about having an update with correct words (say, v7.1?) within a few days?)


Thanks,
H.Seto


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 03/16] sched: introduce primitives to account for CFS bandwidth tracking
  2011-06-21  7:16 ` [patch 03/16] sched: introduce primitives to account for CFS bandwidth tracking Paul Turner
@ 2011-06-22 10:52   ` Peter Zijlstra
  2011-07-06 21:38     ` Paul Turner
  0 siblings, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2011-06-22 10:52 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov, Nikhil Rao

On Tue, 2011-06-21 at 00:16 -0700, Paul Turner wrote:
> +#ifdef CONFIG_CFS_BANDWIDTH
> +       {
> +               .name = "cfs_quota_us",
> +               .read_s64 = cpu_cfs_quota_read_s64,
> +               .write_s64 = cpu_cfs_quota_write_s64,
> +       },
> +       {
> +               .name = "cfs_period_us",
> +               .read_u64 = cpu_cfs_period_read_u64,
> +               .write_u64 = cpu_cfs_period_write_u64,
> +       },
> +#endif 

Did I miss a reply to:
lkml.kernel.org/r/1305538202.2466.4047.camel@twins ? why does it make
sense to have different periods per cgroup? what does it mean?

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 07/16] sched: expire invalid runtime
  2011-06-21  7:16 ` [patch 07/16] sched: expire invalid runtime Paul Turner
  2011-06-22  9:38   ` Hidetoshi Seto
@ 2011-06-22 15:47   ` Peter Zijlstra
  2011-06-28  4:42     ` Paul Turner
  1 sibling, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2011-06-22 15:47 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov

On Tue, 2011-06-21 at 00:16 -0700, Paul Turner wrote:

> +	now = sched_clock_cpu(smp_processor_id());
> +	cfs_b->runtime_expires = now + ktime_to_ns(cfs_b->period);

> +	if ((s64)(rq->clock - cfs_rq->runtime_expires) < 0)

Is there a good reason to mix these two (related) time sources?

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 08/16] sched: throttle cfs_rq entities which exceed their local runtime
  2011-06-21  7:16 ` [patch 08/16] sched: throttle cfs_rq entities which exceed their local runtime Paul Turner
  2011-06-22  7:11   ` Bharata B Rao
@ 2011-06-22 16:07   ` Peter Zijlstra
  2011-06-22 16:54     ` Paul Turner
  1 sibling, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2011-06-22 16:07 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov, Nikhil Rao

On Tue, 2011-06-21 at 00:16 -0700, Paul Turner wrote:
> +static void throttle_cfs_rq(struct cfs_rq *cfs_rq)

And yet this function isn't actually used, so the $subject is at best
somewhat misleading.

/me continues reading with the next patch :-)

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 08/16] sched: throttle cfs_rq entities which exceed their local runtime
  2011-06-22 16:07   ` Peter Zijlstra
@ 2011-06-22 16:54     ` Paul Turner
  0 siblings, 0 replies; 59+ messages in thread
From: Paul Turner @ 2011-06-22 16:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov, Nikhil Rao

On Wed, Jun 22, 2011 at 9:07 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Tue, 2011-06-21 at 00:16 -0700, Paul Turner wrote:
>> +static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
>
> And yet this function isn't actually used, so the $subject is at best
> somewhat misleading.
>
> /me continues reading with the next patch :-)
>

Sorry, I should fix up the title.  The throttling is now turned on
after unthrottle support is also added so that bisections through this
series are sane.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 09/16] sched: unthrottle cfs_rq(s) who ran out of quota at period refresh
  2011-06-21  7:16 ` [patch 09/16] sched: unthrottle cfs_rq(s) who ran out of quota at period refresh Paul Turner
@ 2011-06-22 17:29   ` Peter Zijlstra
  2011-06-28  4:40     ` Paul Turner
  0 siblings, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2011-06-22 17:29 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov, Nikhil Rao

On Tue, 2011-06-21 at 00:16 -0700, Paul Turner wrote:
>  static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
>  {
> -       int idle = 1;
> +       int idle = 1, throttled = 0;
> +       u64 runtime, runtime_expires;
> +
>  
>         raw_spin_lock(&cfs_b->lock);
>         if (cfs_b->quota != RUNTIME_INF) {
> -               idle = cfs_b->idle;
> -               /* If we're going idle then defer handle the refill */
> +               /* idle depends on !throttled in the case of a large deficit */
> +               throttled = !list_empty(&cfs_b->throttled_cfs_rq);
> +               idle = cfs_b->idle && !throttled;
> +
> +               /* If we're going idle then defer the refill */
>                 if (!idle)
>                         __refill_cfs_bandwidth_runtime(cfs_b);
> +               if (throttled) {
> +                       runtime = cfs_b->runtime;
> +                       runtime_expires = cfs_b->runtime_expires;
> +
> +                       /* we must first distribute to throttled entities */
> +                       cfs_b->runtime = 0;
> +               }

Why, whats so bad about letting someone take some concurrently and not
getting throttled meanwhile? Starvation considerations? If so, that
wants mentioning.

>  
>                 /*
> -                * mark this bandwidth pool as idle so that we may deactivate
> -                * the timer at the next expiration if there is no usage.
> +                * conditionally mark this bandwidth pool as idle so that we may
> +                * deactivate the timer at the next expiration if there is no
> +                * usage.
>                  */
> -               cfs_b->idle = 1;
> +               cfs_b->idle = !throttled;
>         }
>  
> -       if (idle)
> +       if (idle) {
>                 cfs_b->timer_active = 0;
> +               goto out_unlock;
> +       }
> +       raw_spin_unlock(&cfs_b->lock);
> +
> +retry:
> +       runtime = distribute_cfs_runtime(cfs_b, runtime, runtime_expires);
> +
> +       raw_spin_lock(&cfs_b->lock);
> +       /* new bandwidth specification may exist */
> +       if (unlikely(runtime_expires != cfs_b->runtime_expires))
> +               goto out_unlock;

it might help to explain how, runtime_expires is taken from cfs_b after
calling __refill_cfs_bandwidth_runtime, and we're in the replenishment
timer, so nobody is going to be adding new runtime.

> +       /* ensure no-one was throttled while we unthrottling */
> +       if (unlikely(!list_empty(&cfs_b->throttled_cfs_rq)) && runtime > 0) {
> +               raw_spin_unlock(&cfs_b->lock);
> +               goto retry;
> +       }

OK, I can see that.

> +
> +       /* return remaining runtime */
> +       cfs_b->runtime = runtime;
> +out_unlock:
>         raw_spin_unlock(&cfs_b->lock);
>  
>         return idle; 

This function hurts my brain, code flow is horrid.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 12/16] sched: prevent interactions with throttled entities
  2011-06-21  7:17 ` [patch 12/16] sched: prevent interactions with throttled entities Paul Turner
@ 2011-06-22 21:34   ` Peter Zijlstra
  2011-06-28  4:43     ` Paul Turner
  2011-06-23 11:49   ` Peter Zijlstra
  1 sibling, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2011-06-22 21:34 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov

On Tue, 2011-06-21 at 00:17 -0700, Paul Turner wrote:
> +       udd.cpu = rq->cpu;
> +       udd.now = rq->clock_task;
> +       walk_tg_tree_from(cfs_rq->tg, tg_unthrottle_down, tg_nop,
> +                         (void *)&udd); 

How about passing rq along and not using udd? :-)

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 12/16] sched: prevent interactions with throttled entities
  2011-06-21  7:17 ` [patch 12/16] sched: prevent interactions with throttled entities Paul Turner
  2011-06-22 21:34   ` Peter Zijlstra
@ 2011-06-23 11:49   ` Peter Zijlstra
  2011-06-28  4:38     ` Paul Turner
  1 sibling, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2011-06-23 11:49 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov

On Tue, 2011-06-21 at 00:17 -0700, Paul Turner wrote:
> @@ -2635,8 +2704,10 @@ static int update_shares_cpu(struct task
>  
>         raw_spin_lock_irqsave(&rq->lock, flags);
>  
> -       update_rq_clock(rq);
> -       update_cfs_load(cfs_rq, 1);
> +       if (!throttled_hierarchy(cfs_rq)) {
> +               update_rq_clock(rq);
> +               update_cfs_load(cfs_rq, 1);
> +       }
>  
>         /* 

OK, so we can't contribute to load since we're throttled, but
tg->load_weight might have changed meanwhile?

Also, update_cfs_shares()->reweight_entity() can dequeue/enqueue the
entity, doesn't that require an up-to-date rq->clock?



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 00/16] CFS Bandwidth Control v7
  2011-06-22 10:05 ` [patch 00/16] CFS Bandwidth Control v7 Hidetoshi Seto
@ 2011-06-23 12:06   ` Peter Zijlstra
  2011-06-23 12:43     ` Ingo Molnar
  0 siblings, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2011-06-23 12:06 UTC (permalink / raw)
  To: Hidetoshi Seto
  Cc: Paul Turner, linux-kernel, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Pavel Emelyanov

On Wed, 2011-06-22 at 19:05 +0900, Hidetoshi Seto wrote:
> I'll continue my test/benchmark on this v7 for a while.
> Though I believe no more bug is there, I'll let you know if there is
> something.

Would that testing include performance of a kernel without these patches
vs one with these patches in a configuration where the new feature is
compiled in but not used?

It does add a number of if (!cfs_rq->runtime_enabled) return branches
all over the place, some possibly inside a function call (depending on
what the auto-inliner does). So while the impact should be minimal, it
would be very good to test it is indeed so.



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 00/16] CFS Bandwidth Control v7
  2011-06-23 12:06   ` Peter Zijlstra
@ 2011-06-23 12:43     ` Ingo Molnar
  2011-06-24  5:11       ` Hidetoshi Seto
  0 siblings, 1 reply; 59+ messages in thread
From: Ingo Molnar @ 2011-06-23 12:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Hidetoshi Seto, Paul Turner, linux-kernel, Bharata B Rao,
	Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan,
	Srivatsa Vaddagiri, Kamalesh Babulal, Pavel Emelyanov


* Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> On Wed, 2011-06-22 at 19:05 +0900, Hidetoshi Seto wrote:
>
> > I'll continue my test/benchmark on this v7 for a while. Though I 
> > believe no more bug is there, I'll let you know if there is 
> > something.
> 
> Would that testing include performance of a kernel without these 
> patches vs one with these patches in a configuration where the new 
> feature is compiled in but not used?
> 
> It does add a number of if (!cfs_rq->runtime_enabled) return 
> branches all over the place, some possibly inside a function call 
> (depending on what the auto-inliner does). So while the impact 
> should be minimal, it would be very good to test it is indeed so.

Yeah, doing such performance tests is absolutely required. Branches 
and instructions impact should be measured as well, beyond the cycles 
impact.

The changelog of this recent commit:

  c8b281161dfa: sched: Increase SCHED_LOAD_SCALE resolution

gives an example of how to do such measurements.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 15/16] sched: return unused runtime on voluntary sleep
  2011-06-21  7:17 ` [patch 15/16] sched: return unused runtime on voluntary sleep Paul Turner
  2011-06-21  7:33   ` Paul Turner
  2011-06-22  9:39   ` Hidetoshi Seto
@ 2011-06-23 15:26   ` Peter Zijlstra
  2011-06-28  1:42     ` Paul Turner
  2 siblings, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2011-06-23 15:26 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov

On Tue, 2011-06-21 at 00:17 -0700, Paul Turner wrote:
> plain text document attachment (sched-bwc-simple_return_quota.patch)
> When a local cfs_rq blocks we return the majority of its remaining quota to the
> global bandwidth pool for use by other runqueues.

OK, I saw return_cfs_rq_runtime() do that.

> We do this only when the quota is current and there is more than 
> min_cfs_rq_quota [1ms by default] of runtime remaining on the rq.

sure..

> In the case where there are throttled runqueues and we have sufficient
> bandwidth to meter out a slice, a second timer is kicked off to handle this
> delivery, unthrottling where appropriate.

I'm having trouble there, what's the purpose of the timer, you could
redistribute immediately. None of this is well explained.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 00/16] CFS Bandwidth Control v7
  2011-06-23 12:43     ` Ingo Molnar
@ 2011-06-24  5:11       ` Hidetoshi Seto
  2011-06-26 10:35         ` Ingo Molnar
  0 siblings, 1 reply; 59+ messages in thread
From: Hidetoshi Seto @ 2011-06-24  5:11 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Paul Turner, linux-kernel, Bharata B Rao,
	Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan,
	Srivatsa Vaddagiri, Kamalesh Babulal, Pavel Emelyanov

[-- Attachment #1: Type: text/plain, Size: 7253 bytes --]

(2011/06/23 21:43), Ingo Molnar wrote:
> 
> * Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> 
>> On Wed, 2011-06-22 at 19:05 +0900, Hidetoshi Seto wrote:
>>
>>> I'll continue my test/benchmark on this v7 for a while. Though I 
>>> believe no more bug is there, I'll let you know if there is 
>>> something.
>>
>> Would that testing include performance of a kernel without these 
>> patches vs one with these patches in a configuration where the new 
>> feature is compiled in but not used?
>>
>> It does add a number of if (!cfs_rq->runtime_enabled) return 
>> branches all over the place, some possibly inside a function call 
>> (depending on what the auto-inliner does). So while the impact 
>> should be minimal, it would be very good to test it is indeed so.
> 
> Yeah, doing such performance tests is absolutely required. Branches 
> and instructions impact should be measured as well, beyond the cycles 
> impact.
> 
> The changelog of this recent commit:
> 
>   c8b281161dfa: sched: Increase SCHED_LOAD_SCALE resolution
> 
> gives an example of how to do such measurements.

Thank you for useful guidance!

I've run pipe-test-100k on both of a kernel without patches (3.0-rc4)
and one with patches (3.0-rc4+), in similar way as that described in
the change log you pointed (but I add "-d" for more details).

I sampled 4 results for each: repeat 10 times * 3 + repeat 200 times * 1.
Cgroups are not used in both, therefore of course CFS bandwidth control
is not used in one that have patched.  Results are archived and attached.

Here is a comparison in diff style:

=====
--- /home/seto/bwc-pipe-test/bwc-rc4-orig.txt   2011-06-24 11:52:16.000000000 +0900
+++ /home/seto/bwc-pipe-test/bwc-rc4-patched.txt        2011-06-24 12:08:32.000000000 +0900
 [seto@SIRIUS-F14 perf]$ taskset 1 ./perf stat -d -d -d --repeat 200 ../../../pipe-test-100k

  Performance counter stats for '../../../pipe-test-100k' (200 runs):

-        865.139070 task-clock                #    0.468 CPUs utilized            ( +-  0.22% )
-           200,167 context-switches          #    0.231 M/sec                    ( +-  0.00% )
-                 0 CPU-migrations            #    0.000 M/sec                    ( +- 49.62% )
-               142 page-faults               #    0.000 M/sec                    ( +-  0.07% )
-     1,671,107,623 cycles                    #    1.932 GHz                      ( +-  0.16% ) [28.23%]
-       838,554,329 stalled-cycles-frontend   #   50.18% frontend cycles idle     ( +-  0.27% ) [28.21%]
-       453,526,560 stalled-cycles-backend    #   27.14% backend  cycles idle     ( +-  0.43% ) [28.33%]
-     1,434,140,915 instructions              #    0.86  insns per cycle
-                                             #    0.58  stalled cycles per insn  ( +-  0.06% ) [34.01%]
-       279,485,621 branches                  #  323.053 M/sec                    ( +-  0.06% ) [33.98%]
-         6,653,998 branch-misses             #    2.38% of all branches          ( +-  0.16% ) [33.93%]
-       495,463,378 L1-dcache-loads           #  572.698 M/sec                    ( +-  0.05% ) [28.12%]
-        27,903,270 L1-dcache-load-misses     #    5.63% of all L1-dcache hits    ( +-  0.28% ) [27.84%]
-           885,210 LLC-loads                 #    1.023 M/sec                    ( +-  3.21% ) [21.80%]
-             9,479 LLC-load-misses           #    1.07% of all LL-cache hits     ( +-  0.63% ) [ 5.61%]
-       830,096,007 L1-icache-loads           #  959.494 M/sec                    ( +-  0.08% ) [11.18%]
-       123,728,370 L1-icache-load-misses     #   14.91% of all L1-icache hits    ( +-  0.06% ) [16.78%]
-       504,932,490 dTLB-loads                #  583.643 M/sec                    ( +-  0.06% ) [22.30%]
-         2,056,069 dTLB-load-misses          #    0.41% of all dTLB cache hits   ( +-  2.23% ) [22.20%]
-     1,579,410,083 iTLB-loads                # 1825.614 M/sec                    ( +-  0.06% ) [22.30%]
-           394,739 iTLB-load-misses          #    0.02% of all iTLB cache hits   ( +-  0.03% ) [22.27%]
-         2,286,363 L1-dcache-prefetches      #    2.643 M/sec                    ( +-  0.72% ) [22.40%]
-           776,096 L1-dcache-prefetch-misses #    0.897 M/sec                    ( +-  1.45% ) [22.54%]
+        859.259725 task-clock                #    0.472 CPUs utilized            ( +-  0.24% )
+           200,165 context-switches          #    0.233 M/sec                    ( +-  0.00% )
+                 0 CPU-migrations            #    0.000 M/sec                    ( +-100.00% )
+               142 page-faults               #    0.000 M/sec                    ( +-  0.06% )
+     1,659,371,974 cycles                    #    1.931 GHz                      ( +-  0.18% ) [28.23%]
+       829,806,955 stalled-cycles-frontend   #   50.01% frontend cycles idle     ( +-  0.32% ) [28.32%]
+       490,316,435 stalled-cycles-backend    #   29.55% backend  cycles idle     ( +-  0.46% ) [28.34%]
+     1,445,166,061 instructions              #    0.87  insns per cycle
+                                             #    0.57  stalled cycles per insn  ( +-  0.06% ) [34.01%]
+       282,370,988 branches                  #  328.621 M/sec                    ( +-  0.06% ) [33.93%]
+         5,056,568 branch-misses             #    1.79% of all branches          ( +-  0.19% ) [33.94%]
+       500,660,789 L1-dcache-loads           #  582.665 M/sec                    ( +-  0.06% ) [28.05%]
+        26,802,313 L1-dcache-load-misses     #    5.35% of all L1-dcache hits    ( +-  0.26% ) [27.83%]
+           872,571 LLC-loads                 #    1.015 M/sec                    ( +-  3.73% ) [21.82%]
+             9,050 LLC-load-misses           #    1.04% of all LL-cache hits     ( +-  0.55% ) [ 5.70%]
+       794,396,111 L1-icache-loads           #  924.512 M/sec                    ( +-  0.06% ) [11.30%]
+       130,179,414 L1-icache-load-misses     #   16.39% of all L1-icache hits    ( +-  0.09% ) [16.85%]
+       511,119,889 dTLB-loads                #  594.837 M/sec                    ( +-  0.06% ) [22.37%]
+         2,452,378 dTLB-load-misses          #    0.48% of all dTLB cache hits   ( +-  2.31% ) [22.14%]
+     1,597,897,243 iTLB-loads                # 1859.621 M/sec                    ( +-  0.06% ) [22.17%]
+           394,366 iTLB-load-misses          #    0.02% of all iTLB cache hits   ( +-  0.03% ) [22.24%]
+         1,897,401 L1-dcache-prefetches      #    2.208 M/sec                    ( +-  0.64% ) [22.38%]
+           879,391 L1-dcache-prefetch-misses #    1.023 M/sec                    ( +-  0.90% ) [22.54%]

-       1.847093132 seconds time elapsed                                          ( +-  0.19% )
+       1.822131534 seconds time elapsed                                          ( +-  0.21% )
=====

As Peter have expected, the number of branches is slightly increased.

-       279,485,621 branches                  #  323.053 M/sec                    ( +-  0.06% ) [33.98%]
+       282,370,988 branches                  #  328.621 M/sec                    ( +-  0.06% ) [33.93%]

However, looking overall, I think there is no significant problem on
the score with this patch set.  I'd love to hear from maintainers.


Thanks,
H.Seto

[-- Attachment #2: bwc-pipe-test.tar.bz2 --]
[-- Type: application/octet-stream, Size: 5124 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 00/16] CFS Bandwidth Control v7
  2011-06-24  5:11       ` Hidetoshi Seto
@ 2011-06-26 10:35         ` Ingo Molnar
  2011-06-29  4:05           ` Hu Tao
  0 siblings, 1 reply; 59+ messages in thread
From: Ingo Molnar @ 2011-06-26 10:35 UTC (permalink / raw)
  To: Hidetoshi Seto
  Cc: Peter Zijlstra, Paul Turner, linux-kernel, Bharata B Rao,
	Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan,
	Srivatsa Vaddagiri, Kamalesh Babulal, Pavel Emelyanov


* Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> wrote:

> -        865.139070 task-clock                #    0.468 CPUs utilized            ( +-  0.22% )
> -           200,167 context-switches          #    0.231 M/sec                    ( +-  0.00% )
> -                 0 CPU-migrations            #    0.000 M/sec                    ( +- 49.62% )
> -               142 page-faults               #    0.000 M/sec                    ( +-  0.07% )
> -     1,671,107,623 cycles                    #    1.932 GHz                      ( +-  0.16% ) [28.23%]
> -       838,554,329 stalled-cycles-frontend   #   50.18% frontend cycles idle     ( +-  0.27% ) [28.21%]
> -       453,526,560 stalled-cycles-backend    #   27.14% backend  cycles idle     ( +-  0.43% ) [28.33%]
> -     1,434,140,915 instructions              #    0.86  insns per cycle
> -                                             #    0.58  stalled cycles per insn  ( +-  0.06% ) [34.01%]
> -       279,485,621 branches                  #  323.053 M/sec                    ( +-  0.06% ) [33.98%]
> -         6,653,998 branch-misses             #    2.38% of all branches          ( +-  0.16% ) [33.93%]
> -       495,463,378 L1-dcache-loads           #  572.698 M/sec                    ( +-  0.05% ) [28.12%]
> -        27,903,270 L1-dcache-load-misses     #    5.63% of all L1-dcache hits    ( +-  0.28% ) [27.84%]
> -           885,210 LLC-loads                 #    1.023 M/sec                    ( +-  3.21% ) [21.80%]
> -             9,479 LLC-load-misses           #    1.07% of all LL-cache hits     ( +-  0.63% ) [ 5.61%]
> -       830,096,007 L1-icache-loads           #  959.494 M/sec                    ( +-  0.08% ) [11.18%]
> -       123,728,370 L1-icache-load-misses     #   14.91% of all L1-icache hits    ( +-  0.06% ) [16.78%]
> -       504,932,490 dTLB-loads                #  583.643 M/sec                    ( +-  0.06% ) [22.30%]
> -         2,056,069 dTLB-load-misses          #    0.41% of all dTLB cache hits   ( +-  2.23% ) [22.20%]
> -     1,579,410,083 iTLB-loads                # 1825.614 M/sec                    ( +-  0.06% ) [22.30%]
> -           394,739 iTLB-load-misses          #    0.02% of all iTLB cache hits   ( +-  0.03% ) [22.27%]
> -         2,286,363 L1-dcache-prefetches      #    2.643 M/sec                    ( +-  0.72% ) [22.40%]
> -           776,096 L1-dcache-prefetch-misses #    0.897 M/sec                    ( +-  1.45% ) [22.54%]
> +        859.259725 task-clock                #    0.472 CPUs utilized            ( +-  0.24% )
> +           200,165 context-switches          #    0.233 M/sec                    ( +-  0.00% )
> +                 0 CPU-migrations            #    0.000 M/sec                    ( +-100.00% )
> +               142 page-faults               #    0.000 M/sec                    ( +-  0.06% )
> +     1,659,371,974 cycles                    #    1.931 GHz                      ( +-  0.18% ) [28.23%]
> +       829,806,955 stalled-cycles-frontend   #   50.01% frontend cycles idle     ( +-  0.32% ) [28.32%]
> +       490,316,435 stalled-cycles-backend    #   29.55% backend  cycles idle     ( +-  0.46% ) [28.34%]
> +     1,445,166,061 instructions              #    0.87  insns per cycle
> +                                             #    0.57  stalled cycles per insn  ( +-  0.06% ) [34.01%]
> +       282,370,988 branches                  #  328.621 M/sec                    ( +-  0.06% ) [33.93%]
> +         5,056,568 branch-misses             #    1.79% of all branches          ( +-  0.19% ) [33.94%]
> +       500,660,789 L1-dcache-loads           #  582.665 M/sec                    ( +-  0.06% ) [28.05%]
> +        26,802,313 L1-dcache-load-misses     #    5.35% of all L1-dcache hits    ( +-  0.26% ) [27.83%]
> +           872,571 LLC-loads                 #    1.015 M/sec                    ( +-  3.73% ) [21.82%]
> +             9,050 LLC-load-misses           #    1.04% of all LL-cache hits     ( +-  0.55% ) [ 5.70%]
> +       794,396,111 L1-icache-loads           #  924.512 M/sec                    ( +-  0.06% ) [11.30%]
> +       130,179,414 L1-icache-load-misses     #   16.39% of all L1-icache hits    ( +-  0.09% ) [16.85%]
> +       511,119,889 dTLB-loads                #  594.837 M/sec                    ( +-  0.06% ) [22.37%]
> +         2,452,378 dTLB-load-misses          #    0.48% of all dTLB cache hits   ( +-  2.31% ) [22.14%]
> +     1,597,897,243 iTLB-loads                # 1859.621 M/sec                    ( +-  0.06% ) [22.17%]
> +           394,366 iTLB-load-misses          #    0.02% of all iTLB cache hits   ( +-  0.03% ) [22.24%]
> +         1,897,401 L1-dcache-prefetches      #    2.208 M/sec                    ( +-  0.64% ) [22.38%]
> +           879,391 L1-dcache-prefetch-misses #    1.023 M/sec                    ( +-  0.90% ) [22.54%]
> 
> -       1.847093132 seconds time elapsed                                          ( +-  0.19% )
> +       1.822131534 seconds time elapsed                                          ( +-  0.21% )
> =====
> 
> As Peter have expected, the number of branches is slightly increased.
> 
> -       279,485,621 branches                  #  323.053 M/sec                    ( +-  0.06% ) [33.98%]
> +       282,370,988 branches                  #  328.621 M/sec                    ( +-  0.06% ) [33.93%]
> 
> However, looking overall, I think there is no significant problem on
> the score with this patch set.  I'd love to hear from maintainers.

Yeah, these numbers look pretty good. Note that the percentages in 
the third column (the amount of time that particular event was 
measured) is pretty low, and it would be nice to eliminate it: i.e. 
now that we know the ballpark figures do very precise measurements 
that do not over-commit the PMU.

One such measurement would be:

	-e cycles -e instructions -e branches

This should also bring the stddev percentages down i think, to below 
0.1%.

Another measurement would be to test not just the feature-enabled but 
also the feature-disabled cost - so that we document the rough 
overhead that users of this new scheduler feature should expect.

Organizing it into neat before/after numbers and percentages, 
comparing it with noise (stddev) [i.e. determining that the effect we 
measure is above noise] and putting it all into the changelog would 
be the other goal of these measurements.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 15/16] sched: return unused runtime on voluntary sleep
  2011-06-23 15:26   ` Peter Zijlstra
@ 2011-06-28  1:42     ` Paul Turner
  2011-06-28 10:01       ` Peter Zijlstra
  0 siblings, 1 reply; 59+ messages in thread
From: Paul Turner @ 2011-06-28  1:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov

On Thu, Jun 23, 2011 at 8:26 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Tue, 2011-06-21 at 00:17 -0700, Paul Turner wrote:
>> plain text document attachment (sched-bwc-simple_return_quota.patch)
>> When a local cfs_rq blocks we return the majority of its remaining quota to the
>> global bandwidth pool for use by other runqueues.
>
> OK, I saw return_cfs_rq_runtime() do that.
>
>> We do this only when the quota is current and there is more than
>> min_cfs_rq_quota [1ms by default] of runtime remaining on the rq.
>
> sure..
>
>> In the case where there are throttled runqueues and we have sufficient
>> bandwidth to meter out a slice, a second timer is kicked off to handle this
>> delivery, unthrottling where appropriate.
>
> I'm having trouble there, what's the purpose of the timer, you could
> redistribute immediately. None of this is well explained.
>

Current reasons:
- There was concern regarding thrashing the unthrottle path on a task
that is rapidly oscillating between runnable states, using a timer
this operation is inherently limited both in frequency and to a single
cpu.  I think the move to using a throttled list (as opposed to having
to poll all cpus) as well as the fact that we only return quota in
excess of min_cfs_rq_quota probably mitigates this to the point where
we could just do away with this and do it directly in the put path.

- The aesthetics of releasing rq->lock in the put path.  Quick
inspection suggests it should actually be safe to do at that point,
and we do similar for idle_balance().

Given consideration the above two factors are not requirements, this
could be moved out of a timer and into the put_path directly (with the
fact that we drop rq->lock strongly commented).  I have no strong
preference between either choice.

Uninteresting additional historical reason:
The /original/ requirement for a timer here is that previous versions
placed some of the bandwidth distribution under cfs_b->lock.  This
meant that we couldn't take rq->lock under cfs_b->lock (as the nesting
is the other way around).  This is no longer a requirement
(advancement of expiration now provides what cfs_b->lock used to
here).




A timer is used so that we don't have to release rq->lock within the put path

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 12/16] sched: prevent interactions with throttled entities
  2011-06-23 11:49   ` Peter Zijlstra
@ 2011-06-28  4:38     ` Paul Turner
  0 siblings, 0 replies; 59+ messages in thread
From: Paul Turner @ 2011-06-28  4:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov

On Thu, Jun 23, 2011 at 4:49 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Tue, 2011-06-21 at 00:17 -0700, Paul Turner wrote:
>> @@ -2635,8 +2704,10 @@ static int update_shares_cpu(struct task
>>
>>         raw_spin_lock_irqsave(&rq->lock, flags);
>>
>> -       update_rq_clock(rq);
>> -       update_cfs_load(cfs_rq, 1);
>> +       if (!throttled_hierarchy(cfs_rq)) {
>> +               update_rq_clock(rq);
>> +               update_cfs_load(cfs_rq, 1);
>> +       }
>>
>>         /*
>
> OK, so we can't contribute to load since we're throttled, but
> tg->load_weight might have changed meanwhile?
>

What's why we continue to update their shares (also at
enqueue/dequeue) but not their load, so that the weight will be
correct when unthrottling*

> Also, update_cfs_shares()->reweight_entity() can dequeue/enqueue the
> entity, doesn't that require an up-to-date rq->clock?
>

It shouldn't, we're only doing an account_enqueue/dequeue, and there
shouldn't be a cfs_rq->curr to lead to updates (on a throttled
entity).  I suppose we might do something interesting in the case of a
race with alb forcing something throttled to run intersecting with
update_shares()**.

The original concern here was [*] above, keeping shares current for
the unthrottle.  *However*, with the hierarchal throttle accounting in
the current version, I think this can be improved.

Instead, we should skip update_shares/update_cfs_shares for all
throttled entities and simply do a final update shares when
throttle_count goes to 0 in tg_throttle_down (which also avoids **).
I thought of doing this at the end of preparing the last patchset but
by that time it was tested and I didn't want to change things around
here at the last minute.

Will fix for this week.


>
>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 09/16] sched: unthrottle cfs_rq(s) who ran out of quota at period refresh
  2011-06-22 17:29   ` Peter Zijlstra
@ 2011-06-28  4:40     ` Paul Turner
  2011-06-28  9:11       ` Peter Zijlstra
  0 siblings, 1 reply; 59+ messages in thread
From: Paul Turner @ 2011-06-28  4:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov, Nikhil Rao

On Wed, Jun 22, 2011 at 10:29 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Tue, 2011-06-21 at 00:16 -0700, Paul Turner wrote:
>>  static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
>>  {
>> -       int idle = 1;
>> +       int idle = 1, throttled = 0;
>> +       u64 runtime, runtime_expires;
>> +
>>
>>         raw_spin_lock(&cfs_b->lock);
>>         if (cfs_b->quota != RUNTIME_INF) {
>> -               idle = cfs_b->idle;
>> -               /* If we're going idle then defer handle the refill */
>> +               /* idle depends on !throttled in the case of a large deficit */
>> +               throttled = !list_empty(&cfs_b->throttled_cfs_rq);
>> +               idle = cfs_b->idle && !throttled;
>> +
>> +               /* If we're going idle then defer the refill */
>>                 if (!idle)
>>                         __refill_cfs_bandwidth_runtime(cfs_b);
>> +               if (throttled) {
>> +                       runtime = cfs_b->runtime;
>> +                       runtime_expires = cfs_b->runtime_expires;
>> +
>> +                       /* we must first distribute to throttled entities */
>> +                       cfs_b->runtime = 0;
>> +               }
>
> Why, whats so bad about letting someone take some concurrently and not
> getting throttled meanwhile? Starvation considerations? If so, that
> wants mentioning.

Yes -- we also particularly want to pay down all deficits first in
case someone has accumulated a *large* arrears (e.g. !CONFIG_PREEMPT).

Will expand the comment here.

>
>>
>>                 /*
>> -                * mark this bandwidth pool as idle so that we may deactivate
>> -                * the timer at the next expiration if there is no usage.
>> +                * conditionally mark this bandwidth pool as idle so that we may
>> +                * deactivate the timer at the next expiration if there is no
>> +                * usage.
>>                  */
>> -               cfs_b->idle = 1;
>> +               cfs_b->idle = !throttled;
>>         }
>>
>> -       if (idle)
>> +       if (idle) {
>>                 cfs_b->timer_active = 0;
>> +               goto out_unlock;
>> +       }
>> +       raw_spin_unlock(&cfs_b->lock);
>> +
>> +retry:
>> +       runtime = distribute_cfs_runtime(cfs_b, runtime, runtime_expires);
>> +
>> +       raw_spin_lock(&cfs_b->lock);
>> +       /* new bandwidth specification may exist */
>> +       if (unlikely(runtime_expires != cfs_b->runtime_expires))
>> +               goto out_unlock;
>
> it might help to explain how, runtime_expires is taken from cfs_b after
> calling __refill_cfs_bandwidth_runtime, and we're in the replenishment
> timer, so nobody is going to be adding new runtime.
>

Good idea -- thanks

>> +       /* ensure no-one was throttled while we unthrottling */
>> +       if (unlikely(!list_empty(&cfs_b->throttled_cfs_rq)) && runtime > 0) {
>> +               raw_spin_unlock(&cfs_b->lock);
>> +               goto retry;
>> +       }
>
> OK, I can see that.
>
>> +
>> +       /* return remaining runtime */
>> +       cfs_b->runtime = runtime;
>> +out_unlock:
>>         raw_spin_unlock(&cfs_b->lock);
>>
>>         return idle;
>
> This function hurts my brain, code flow is horrid.

Yeah.. I don't know why I didn't just make it a while loop, will fix.

>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 07/16] sched: expire invalid runtime
  2011-06-22 15:47   ` Peter Zijlstra
@ 2011-06-28  4:42     ` Paul Turner
  2011-06-29  2:29       ` Paul Turner
  0 siblings, 1 reply; 59+ messages in thread
From: Paul Turner @ 2011-06-28  4:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov

On Wed, Jun 22, 2011 at 8:47 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Tue, 2011-06-21 at 00:16 -0700, Paul Turner wrote:
>
>> +     now = sched_clock_cpu(smp_processor_id());
>> +     cfs_b->runtime_expires = now + ktime_to_ns(cfs_b->period);
>
>> +     if ((s64)(rq->clock - cfs_rq->runtime_expires) < 0)
>
> Is there a good reason to mix these two (related) time sources?
>

It does make sense to remove the (current) aliasing dependency, will
use rq->clock for setting expiration.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 12/16] sched: prevent interactions with throttled entities
  2011-06-22 21:34   ` Peter Zijlstra
@ 2011-06-28  4:43     ` Paul Turner
  0 siblings, 0 replies; 59+ messages in thread
From: Paul Turner @ 2011-06-28  4:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov

On Wed, Jun 22, 2011 at 2:34 PM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Tue, 2011-06-21 at 00:17 -0700, Paul Turner wrote:
>> +       udd.cpu = rq->cpu;
>> +       udd.now = rq->clock_task;
>> +       walk_tg_tree_from(cfs_rq->tg, tg_unthrottle_down, tg_nop,
>> +                         (void *)&udd);
>
> How about passing rq along and not using udd? :-)
>

egads.. yes!

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 09/16] sched: unthrottle cfs_rq(s) who ran out of quota at period refresh
  2011-06-28  4:40     ` Paul Turner
@ 2011-06-28  9:11       ` Peter Zijlstra
  2011-06-29  3:37         ` Paul Turner
  0 siblings, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2011-06-28  9:11 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov, Nikhil Rao

On Mon, 2011-06-27 at 21:40 -0700, Paul Turner wrote:
> >> +       if (unlikely(runtime_expires != cfs_b->runtime_expires))
> >> +               goto out_unlock;
> >
> > it might help to explain how, runtime_expires is taken from cfs_b after
> > calling __refill_cfs_bandwidth_runtime, and we're in the replenishment
> > timer, so nobody is going to be adding new runtime.
> >
> 
> Good idea -- thanks 

Aside from being a good idea, I'm genuinely puzzled by that part and
would love having it explained :-)

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 15/16] sched: return unused runtime on voluntary sleep
  2011-06-28  1:42     ` Paul Turner
@ 2011-06-28 10:01       ` Peter Zijlstra
  2011-06-28 18:45         ` Paul Turner
  0 siblings, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2011-06-28 10:01 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov

On Mon, 2011-06-27 at 18:42 -0700, Paul Turner wrote:

> - The aesthetics of releasing rq->lock in the put path.  Quick
> inspection suggests it should actually be safe to do at that point,
> and we do similar for idle_balance().
> 
> Given consideration the above two factors are not requirements, this
> could be moved out of a timer and into the put_path directly (with the
> fact that we drop rq->lock strongly commented).  I have no strong
> preference between either choice.

Argh, ok I see, distribute_cfs_runtime() wants that. Dropping rq->lock
is very fragile esp from the put path, you can only do that _before_ the
put path updates rq->curr etc.. So I'd rather you didn't, just keep the
timer crap and add some comments there.

And we need that distribute_cfs_runtime() muck because that's what
unthrottles rqs when more runtime is available.. bah.



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 15/16] sched: return unused runtime on voluntary sleep
  2011-06-28 10:01       ` Peter Zijlstra
@ 2011-06-28 18:45         ` Paul Turner
  0 siblings, 0 replies; 59+ messages in thread
From: Paul Turner @ 2011-06-28 18:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov

On Tue, Jun 28, 2011 at 3:01 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Mon, 2011-06-27 at 18:42 -0700, Paul Turner wrote:
>
>> - The aesthetics of releasing rq->lock in the put path.  Quick
>> inspection suggests it should actually be safe to do at that point,
>> and we do similar for idle_balance().
>>
>> Given consideration the above two factors are not requirements, this
>> could be moved out of a timer and into the put_path directly (with the
>> fact that we drop rq->lock strongly commented).  I have no strong
>> preference between either choice.
>
> Argh, ok I see, distribute_cfs_runtime() wants that. Dropping rq->lock
> is very fragile esp from the put path, you can only do that _before_ the
> put path updates rq->curr etc.. So I'd rather you didn't, just keep the
> timer crap and add some comments there.
>

Done.

An alternative that does come to mind is something like:

- cfs_b->lock may be sufficient synchronization to twiddle
cfs_rq->runtime_assigned (once it has been throttled, modulo alb)
- we could use this to synchronize a per-rq "to-be-unthrottled" list
which could be checked under something like task_tick_fair

We'd have to be careful about making sure we wake-up a sleeping cpu
but this hides the rq->lock requirements and would let us kill the
timer.

Perhaps this could be a future incremental improvement.

> And we need that distribute_cfs_runtime() muck because that's what
> unthrottles rqs when more runtime is available.. bah.
>
>
>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 07/16] sched: expire invalid runtime
  2011-06-28  4:42     ` Paul Turner
@ 2011-06-29  2:29       ` Paul Turner
  0 siblings, 0 replies; 59+ messages in thread
From: Paul Turner @ 2011-06-29  2:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov

On Mon, Jun 27, 2011 at 9:42 PM, Paul Turner <pjt@google.com> wrote:
> On Wed, Jun 22, 2011 at 8:47 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
>> On Tue, 2011-06-21 at 00:16 -0700, Paul Turner wrote:
>>
>>> +     now = sched_clock_cpu(smp_processor_id());
>>> +     cfs_b->runtime_expires = now + ktime_to_ns(cfs_b->period);
>>
>>> +     if ((s64)(rq->clock - cfs_rq->runtime_expires) < 0)
>>
>> Is there a good reason to mix these two (related) time sources?
>>
>
> It does make sense to remove the (current) aliasing dependency, will
> use rq->clock for setting expiration.
>

So looking more closely at this I think i prefer the "mix" after all.

Using rq->clock within __refill_cfs_bandwidth_runtime adds the requirement
of taking rq->lock on the current cpu within the period timer so that
we can update rq->clock (which then just gets set to sched_clock
anyway).

Expiration logic is already dependent on the fact that rq->clock
snapshots sched_clock (the 2ms bound on clock-to-clock drift).  Given
that this is an infrequent (once/period) operation I think it's better
to leave it as an explicit sched_clock_cpu call, with an explanatory
comment.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 09/16] sched: unthrottle cfs_rq(s) who ran out of quota at period refresh
  2011-06-28  9:11       ` Peter Zijlstra
@ 2011-06-29  3:37         ` Paul Turner
  0 siblings, 0 replies; 59+ messages in thread
From: Paul Turner @ 2011-06-29  3:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov, Nikhil Rao

On Tue, Jun 28, 2011 at 2:11 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Mon, 2011-06-27 at 21:40 -0700, Paul Turner wrote:
>> >> +       if (unlikely(runtime_expires != cfs_b->runtime_expires))
>> >> +               goto out_unlock;
>> >
>> > it might help to explain how, runtime_expires is taken from cfs_b after
>> > calling __refill_cfs_bandwidth_runtime, and we're in the replenishment
>> > timer, so nobody is going to be adding new runtime.
>> >
>>
>> Good idea -- thanks
>
> Aside from being a good idea, I'm genuinely puzzled by that part and
> would love having it explained :-)
>

It's all the users fault!

While we were busy doing this they might have set some new bandwidth
limit (since we drop cfs_b->lock to distribute) via cgroupfs, in which
case we need to make sure we:

a) stop since setting the new runtime will have unthrottled everyone anyway
b) don't put our runtime in cfs_b->runtime as at this point we'd be
over-writing the new-runtime just set by tg_set_cfs_bandwidth

We can catch this happening however, since tg_set_cfs_bandwidth sets a
new expiration' which is what the check above enforces.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 00/16] CFS Bandwidth Control v7
  2011-06-26 10:35         ` Ingo Molnar
@ 2011-06-29  4:05           ` Hu Tao
  2011-07-01 12:28             ` Ingo Molnar
  0 siblings, 1 reply; 59+ messages in thread
From: Hu Tao @ 2011-06-29  4:05 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Hidetoshi Seto, Peter Zijlstra, Paul Turner, linux-kernel,
	Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Pavel Emelyanov

On Sun, Jun 26, 2011 at 12:35:26PM +0200, Ingo Molnar wrote:
> 
> * Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> wrote:
> 
> > -        865.139070 task-clock                #    0.468 CPUs utilized            ( +-  0.22% )
> > -           200,167 context-switches          #    0.231 M/sec                    ( +-  0.00% )
> > -                 0 CPU-migrations            #    0.000 M/sec                    ( +- 49.62% )
> > -               142 page-faults               #    0.000 M/sec                    ( +-  0.07% )
> > -     1,671,107,623 cycles                    #    1.932 GHz                      ( +-  0.16% ) [28.23%]
> > -       838,554,329 stalled-cycles-frontend   #   50.18% frontend cycles idle     ( +-  0.27% ) [28.21%]
> > -       453,526,560 stalled-cycles-backend    #   27.14% backend  cycles idle     ( +-  0.43% ) [28.33%]
> > -     1,434,140,915 instructions              #    0.86  insns per cycle
> > -                                             #    0.58  stalled cycles per insn  ( +-  0.06% ) [34.01%]
> > -       279,485,621 branches                  #  323.053 M/sec                    ( +-  0.06% ) [33.98%]
> > -         6,653,998 branch-misses             #    2.38% of all branches          ( +-  0.16% ) [33.93%]
> > -       495,463,378 L1-dcache-loads           #  572.698 M/sec                    ( +-  0.05% ) [28.12%]
> > -        27,903,270 L1-dcache-load-misses     #    5.63% of all L1-dcache hits    ( +-  0.28% ) [27.84%]
> > -           885,210 LLC-loads                 #    1.023 M/sec                    ( +-  3.21% ) [21.80%]
> > -             9,479 LLC-load-misses           #    1.07% of all LL-cache hits     ( +-  0.63% ) [ 5.61%]
> > -       830,096,007 L1-icache-loads           #  959.494 M/sec                    ( +-  0.08% ) [11.18%]
> > -       123,728,370 L1-icache-load-misses     #   14.91% of all L1-icache hits    ( +-  0.06% ) [16.78%]
> > -       504,932,490 dTLB-loads                #  583.643 M/sec                    ( +-  0.06% ) [22.30%]
> > -         2,056,069 dTLB-load-misses          #    0.41% of all dTLB cache hits   ( +-  2.23% ) [22.20%]
> > -     1,579,410,083 iTLB-loads                # 1825.614 M/sec                    ( +-  0.06% ) [22.30%]
> > -           394,739 iTLB-load-misses          #    0.02% of all iTLB cache hits   ( +-  0.03% ) [22.27%]
> > -         2,286,363 L1-dcache-prefetches      #    2.643 M/sec                    ( +-  0.72% ) [22.40%]
> > -           776,096 L1-dcache-prefetch-misses #    0.897 M/sec                    ( +-  1.45% ) [22.54%]
> > +        859.259725 task-clock                #    0.472 CPUs utilized            ( +-  0.24% )
> > +           200,165 context-switches          #    0.233 M/sec                    ( +-  0.00% )
> > +                 0 CPU-migrations            #    0.000 M/sec                    ( +-100.00% )
> > +               142 page-faults               #    0.000 M/sec                    ( +-  0.06% )
> > +     1,659,371,974 cycles                    #    1.931 GHz                      ( +-  0.18% ) [28.23%]
> > +       829,806,955 stalled-cycles-frontend   #   50.01% frontend cycles idle     ( +-  0.32% ) [28.32%]
> > +       490,316,435 stalled-cycles-backend    #   29.55% backend  cycles idle     ( +-  0.46% ) [28.34%]
> > +     1,445,166,061 instructions              #    0.87  insns per cycle
> > +                                             #    0.57  stalled cycles per insn  ( +-  0.06% ) [34.01%]
> > +       282,370,988 branches                  #  328.621 M/sec                    ( +-  0.06% ) [33.93%]
> > +         5,056,568 branch-misses             #    1.79% of all branches          ( +-  0.19% ) [33.94%]
> > +       500,660,789 L1-dcache-loads           #  582.665 M/sec                    ( +-  0.06% ) [28.05%]
> > +        26,802,313 L1-dcache-load-misses     #    5.35% of all L1-dcache hits    ( +-  0.26% ) [27.83%]
> > +           872,571 LLC-loads                 #    1.015 M/sec                    ( +-  3.73% ) [21.82%]
> > +             9,050 LLC-load-misses           #    1.04% of all LL-cache hits     ( +-  0.55% ) [ 5.70%]
> > +       794,396,111 L1-icache-loads           #  924.512 M/sec                    ( +-  0.06% ) [11.30%]
> > +       130,179,414 L1-icache-load-misses     #   16.39% of all L1-icache hits    ( +-  0.09% ) [16.85%]
> > +       511,119,889 dTLB-loads                #  594.837 M/sec                    ( +-  0.06% ) [22.37%]
> > +         2,452,378 dTLB-load-misses          #    0.48% of all dTLB cache hits   ( +-  2.31% ) [22.14%]
> > +     1,597,897,243 iTLB-loads                # 1859.621 M/sec                    ( +-  0.06% ) [22.17%]
> > +           394,366 iTLB-load-misses          #    0.02% of all iTLB cache hits   ( +-  0.03% ) [22.24%]
> > +         1,897,401 L1-dcache-prefetches      #    2.208 M/sec                    ( +-  0.64% ) [22.38%]
> > +           879,391 L1-dcache-prefetch-misses #    1.023 M/sec                    ( +-  0.90% ) [22.54%]
> > 
> > -       1.847093132 seconds time elapsed                                          ( +-  0.19% )
> > +       1.822131534 seconds time elapsed                                          ( +-  0.21% )
> > =====
> > 
> > As Peter have expected, the number of branches is slightly increased.
> > 
> > -       279,485,621 branches                  #  323.053 M/sec                    ( +-  0.06% ) [33.98%]
> > +       282,370,988 branches                  #  328.621 M/sec                    ( +-  0.06% ) [33.93%]
> > 
> > However, looking overall, I think there is no significant problem on
> > the score with this patch set.  I'd love to hear from maintainers.
> 
> Yeah, these numbers look pretty good. Note that the percentages in 
> the third column (the amount of time that particular event was 
> measured) is pretty low, and it would be nice to eliminate it: i.e. 
> now that we know the ballpark figures do very precise measurements 
> that do not over-commit the PMU.
> 
> One such measurement would be:
> 
> 	-e cycles -e instructions -e branches
> 
> This should also bring the stddev percentages down i think, to below 
> 0.1%.
> 
> Another measurement would be to test not just the feature-enabled but 
> also the feature-disabled cost - so that we document the rough 
> overhead that users of this new scheduler feature should expect.
> 
> Organizing it into neat before/after numbers and percentages, 
> comparing it with noise (stddev) [i.e. determining that the effect we 
> measure is above noise] and putting it all into the changelog would 
> be the other goal of these measurements.

Hi Ingo,

I've tested pipe-test-100k in the following cases: base(no patch), with
patch but feature-disabled, with patch and several periods(quota set to
be a large value to avoid processes throttled), the result is:


                                            cycles                   instructions            branches
-------------------------------------------------------------------------------------------------------------------
base                                        7,526,317,497           8,666,579,347            1,771,078,445
+patch, cgroup not enabled                  7,610,354,447 (1.12%)   8,569,448,982 (-1.12%)   1,751,675,193 (-0.11%)
+patch, 10000000000/1000(quota/period)      7,856,873,327 (4.39%)   8,822,227,540 (1.80%)    1,801,766,182 (1.73%)
+patch, 10000000000/10000(quota/period)     7,797,711,600 (3.61%)   8,754,747,746 (1.02%)    1,788,316,969 (0.97%)
+patch, 10000000000/100000(quota/period)    7,777,784,384 (3.34%)   8,744,979,688 (0.90%)    1,786,319,566 (0.86%)
+patch, 10000000000/1000000(quota/period)   7,802,382,802 (3.67%)   8,755,638,235 (1.03%)    1,788,601,070 (0.99%)
-------------------------------------------------------------------------------------------------------------------



These are the original outputs from perf.

base
--------------
 Performance counter stats for './pipe-test-100k' (50 runs):

       3834.623919 task-clock                #    0.576 CPUs utilized            ( +-  0.04% )
           200,009 context-switches          #    0.052 M/sec                    ( +-  0.00% )
                 0 CPU-migrations            #    0.000 M/sec                    ( +- 48.45% )
               135 page-faults               #    0.000 M/sec                    ( +-  0.12% )
     7,526,317,497 cycles                    #    1.963 GHz                      ( +-  0.07% )
     2,672,526,467 stalled-cycles-frontend   #   35.51% frontend cycles idle     ( +-  0.14% )
     1,157,897,108 stalled-cycles-backend    #   15.38% backend  cycles idle     ( +-  0.29% )
     8,666,579,347 instructions              #    1.15  insns per cycle        
                                             #    0.31  stalled cycles per insn  ( +-  0.04% )
     1,771,078,445 branches                  #  461.865 M/sec                    ( +-  0.04% )
        35,159,140 branch-misses             #    1.99% of all branches          ( +-  0.11% )

       6.654770337 seconds time elapsed                                          ( +-  0.02% )



+patch, cpu cgroup not enabled
------------------------------
 Performance counter stats for './pipe-test-100k' (50 runs):

       3872.071268 task-clock                #    0.577 CPUs utilized            ( +-  0.10% )
           200,009 context-switches          #    0.052 M/sec                    ( +-  0.00% )
                 0 CPU-migrations            #    0.000 M/sec                    ( +- 69.99% )
               135 page-faults               #    0.000 M/sec                    ( +-  0.17% )
     7,610,354,447 cycles                    #    1.965 GHz                      ( +-  0.11% )
     2,792,310,881 stalled-cycles-frontend   #   36.69% frontend cycles idle     ( +-  0.17% )
     1,268,428,999 stalled-cycles-backend    #   16.67% backend  cycles idle     ( +-  0.33% )
     8,569,448,982 instructions              #    1.13  insns per cycle        
                                             #    0.33  stalled cycles per insn  ( +-  0.10% )
     1,751,675,193 branches                  #  452.387 M/sec                    ( +-  0.09% )
        36,605,163 branch-misses             #    2.09% of all branches          ( +-  0.12% )

       6.707220617 seconds time elapsed                                          ( +-  0.05% )



+patch, 10000000000/1000(quota/period)
--------------------------------------
 Performance counter stats for './pipe-test-100k' (50 runs):

       3973.982673 task-clock                #    0.583 CPUs utilized            ( +-  0.09% )
           200,010 context-switches          #    0.050 M/sec                    ( +-  0.00% )
                 0 CPU-migrations            #    0.000 M/sec                    ( +-100.00% )
               135 page-faults               #    0.000 M/sec                    ( +-  0.14% )
     7,856,873,327 cycles                    #    1.977 GHz                      ( +-  0.10% )
     2,903,700,355 stalled-cycles-frontend   #   36.96% frontend cycles idle     ( +-  0.14% )
     1,310,151,837 stalled-cycles-backend    #   16.68% backend  cycles idle     ( +-  0.33% )
     8,822,227,540 instructions              #    1.12  insns per cycle        
                                             #    0.33  stalled cycles per insn  ( +-  0.08% )
     1,801,766,182 branches                  #  453.391 M/sec                    ( +-  0.08% )
        37,784,995 branch-misses             #    2.10% of all branches          ( +-  0.14% )

       6.821678535 seconds time elapsed                                          ( +-  0.05% )



+patch, 10000000000/10000(quota/period)
---------------------------------------
 Performance counter stats for './pipe-test-100k' (50 runs):

       3948.074074 task-clock                #    0.581 CPUs utilized            ( +-  0.11% )
           200,009 context-switches          #    0.051 M/sec                    ( +-  0.00% )
                 0 CPU-migrations            #    0.000 M/sec                    ( +- 69.99% )
               135 page-faults               #    0.000 M/sec                    ( +-  0.20% )
     7,797,711,600 cycles                    #    1.975 GHz                      ( +-  0.12% )
     2,881,224,123 stalled-cycles-frontend   #   36.95% frontend cycles idle     ( +-  0.18% )
     1,294,534,443 stalled-cycles-backend    #   16.60% backend  cycles idle     ( +-  0.40% )
     8,754,747,746 instructions              #    1.12  insns per cycle        
                                             #    0.33  stalled cycles per insn  ( +-  0.10% )
     1,788,316,969 branches                  #  452.959 M/sec                    ( +-  0.09% )
        37,619,798 branch-misses             #    2.10% of all branches          ( +-  0.17% )

       6.792410565 seconds time elapsed                                          ( +-  0.05% )



+patch, 10000000000/100000(quota/period)
----------------------------------------
 Performance counter stats for './pipe-test-100k' (50 runs):

       3943.323261 task-clock                #    0.581 CPUs utilized            ( +-  0.10% )
           200,009 context-switches          #    0.051 M/sec                    ( +-  0.00% )
                 0 CPU-migrations            #    0.000 M/sec                    ( +- 56.54% )
               135 page-faults               #    0.000 M/sec                    ( +-  0.24% )
     7,777,784,384 cycles                    #    1.972 GHz                      ( +-  0.12% )
     2,869,653,004 stalled-cycles-frontend   #   36.90% frontend cycles idle     ( +-  0.19% )
     1,278,100,561 stalled-cycles-backend    #   16.43% backend  cycles idle     ( +-  0.37% )
     8,744,979,688 instructions              #    1.12  insns per cycle        
                                             #    0.33  stalled cycles per insn  ( +-  0.10% )
     1,786,319,566 branches                  #  452.999 M/sec                    ( +-  0.09% )
        37,514,727 branch-misses             #    2.10% of all branches          ( +-  0.14% )

       6.790280499 seconds time elapsed                                          ( +-  0.06% )



+patch, 10000000000/1000000(quota/period)
----------------------------------------
 Performance counter stats for './pipe-test-100k' (50 runs):

       3951.215042 task-clock                #    0.582 CPUs utilized            ( +-  0.09% )
           200,009 context-switches          #    0.051 M/sec                    ( +-  0.00% )
                 0 CPU-migrations            #    0.000 M/sec                    ( +-  0.00% )
               135 page-faults               #    0.000 M/sec                    ( +-  0.20% )
     7,802,382,802 cycles                    #    1.975 GHz                      ( +-  0.12% )
     2,884,487,463 stalled-cycles-frontend   #   36.97% frontend cycles idle     ( +-  0.17% )
     1,297,073,308 stalled-cycles-backend    #   16.62% backend  cycles idle     ( +-  0.35% )
     8,755,638,235 instructions              #    1.12  insns per cycle        
                                             #    0.33  stalled cycles per insn  ( +-  0.11% )
     1,788,601,070 branches                  #  452.671 M/sec                    ( +-  0.11% )
        37,649,606 branch-misses             #    2.10% of all branches          ( +-  0.15% )

       6.794033052 seconds time elapsed                                          ( +-  0.06% )


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 00/16] CFS Bandwidth Control v7
  2011-06-29  4:05           ` Hu Tao
@ 2011-07-01 12:28             ` Ingo Molnar
  2011-07-05  3:58               ` Hu Tao
  0 siblings, 1 reply; 59+ messages in thread
From: Ingo Molnar @ 2011-07-01 12:28 UTC (permalink / raw)
  To: Hu Tao
  Cc: Hidetoshi Seto, Peter Zijlstra, Paul Turner, linux-kernel,
	Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Pavel Emelyanov


* Hu Tao <hutao@cn.fujitsu.com> wrote:

> > Yeah, these numbers look pretty good. Note that the percentages 
> > in the third column (the amount of time that particular event was 
> > measured) is pretty low, and it would be nice to eliminate it: 
> > i.e. now that we know the ballpark figures do very precise 
> > measurements that do not over-commit the PMU.
> > 
> > One such measurement would be:
> > 
> > 	-e cycles -e instructions -e branches
> > 
> > This should also bring the stddev percentages down i think, to 
> > below 0.1%.
> > 
> > Another measurement would be to test not just the feature-enabled 
> > but also the feature-disabled cost - so that we document the 
> > rough overhead that users of this new scheduler feature should 
> > expect.
> > 
> > Organizing it into neat before/after numbers and percentages, 
> > comparing it with noise (stddev) [i.e. determining that the 
> > effect we measure is above noise] and putting it all into the 
> > changelog would be the other goal of these measurements.
> 
> Hi Ingo,
> 
> I've tested pipe-test-100k in the following cases: base(no patch), 
> with patch but feature-disabled, with patch and several 
> periods(quota set to be a large value to avoid processes 
> throttled), the result is:
> 
> 
>                                             cycles                   instructions            branches
> -------------------------------------------------------------------------------------------------------------------
> base                                        7,526,317,497           8,666,579,347            1,771,078,445
> +patch, cgroup not enabled                  7,610,354,447 (1.12%)   8,569,448,982 (-1.12%)   1,751,675,193 (-0.11%)
> +patch, 10000000000/1000(quota/period)      7,856,873,327 (4.39%)   8,822,227,540 (1.80%)    1,801,766,182 (1.73%)
> +patch, 10000000000/10000(quota/period)     7,797,711,600 (3.61%)   8,754,747,746 (1.02%)    1,788,316,969 (0.97%)
> +patch, 10000000000/100000(quota/period)    7,777,784,384 (3.34%)   8,744,979,688 (0.90%)    1,786,319,566 (0.86%)
> +patch, 10000000000/1000000(quota/period)   7,802,382,802 (3.67%)   8,755,638,235 (1.03%)    1,788,601,070 (0.99%)
> -------------------------------------------------------------------------------------------------------------------

ok, i had a quick look at the stddev numbers as well and most seem 
below the 0.1 range, well below the effects you managed to measure. 
So i think this table is pretty accurate and we can rely on it for 
analysis.

So we've got a +1.1% incrase in overhead with cgroups disabled, while 
the instruction count went down by 1.1%. Is this expected? If you 
profile stalled cycles and use perf diff between base and patched 
kernels, does it show you some new hotspot that causes the overhead?

To better understand the reasons behind that result, could you try to 
see whether the cycles count is stable across reboots as well, or 
does it vary beyond the ~1% value that you measure?

One thing that can help validating the measurements is to do:

  echo 1 > /proc/sys/vm/drop_caches

Before testing. This helps re-establish the whole pagecache layout 
(which gives a lot of the across-boot variability of such 
measurements).

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 00/16] CFS Bandwidth Control v7
  2011-07-01 12:28             ` Ingo Molnar
@ 2011-07-05  3:58               ` Hu Tao
  2011-07-05  8:50                 ` Ingo Molnar
  0 siblings, 1 reply; 59+ messages in thread
From: Hu Tao @ 2011-07-05  3:58 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Hidetoshi Seto, Peter Zijlstra, Paul Turner, linux-kernel,
	Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Pavel Emelyanov

On Fri, Jul 01, 2011 at 02:28:24PM +0200, Ingo Molnar wrote:
> 
> * Hu Tao <hutao@cn.fujitsu.com> wrote:
> 
> > > Yeah, these numbers look pretty good. Note that the percentages 
> > > in the third column (the amount of time that particular event was 
> > > measured) is pretty low, and it would be nice to eliminate it: 
> > > i.e. now that we know the ballpark figures do very precise 
> > > measurements that do not over-commit the PMU.
> > > 
> > > One such measurement would be:
> > > 
> > > 	-e cycles -e instructions -e branches
> > > 
> > > This should also bring the stddev percentages down i think, to 
> > > below 0.1%.
> > > 
> > > Another measurement would be to test not just the feature-enabled 
> > > but also the feature-disabled cost - so that we document the 
> > > rough overhead that users of this new scheduler feature should 
> > > expect.
> > > 
> > > Organizing it into neat before/after numbers and percentages, 
> > > comparing it with noise (stddev) [i.e. determining that the 
> > > effect we measure is above noise] and putting it all into the 
> > > changelog would be the other goal of these measurements.
> > 
> > Hi Ingo,
> > 
> > I've tested pipe-test-100k in the following cases: base(no patch), 
> > with patch but feature-disabled, with patch and several 
> > periods(quota set to be a large value to avoid processes 
> > throttled), the result is:
> > 
> > 
> >                                             cycles                   instructions            branches
> > -------------------------------------------------------------------------------------------------------------------
> > base                                        7,526,317,497           8,666,579,347            1,771,078,445
> > +patch, cgroup not enabled                  7,610,354,447 (1.12%)   8,569,448,982 (-1.12%)   1,751,675,193 (-0.11%)
> > +patch, 10000000000/1000(quota/period)      7,856,873,327 (4.39%)   8,822,227,540 (1.80%)    1,801,766,182 (1.73%)
> > +patch, 10000000000/10000(quota/period)     7,797,711,600 (3.61%)   8,754,747,746 (1.02%)    1,788,316,969 (0.97%)
> > +patch, 10000000000/100000(quota/period)    7,777,784,384 (3.34%)   8,744,979,688 (0.90%)    1,786,319,566 (0.86%)
> > +patch, 10000000000/1000000(quota/period)   7,802,382,802 (3.67%)   8,755,638,235 (1.03%)    1,788,601,070 (0.99%)
> > -------------------------------------------------------------------------------------------------------------------
> 
> ok, i had a quick look at the stddev numbers as well and most seem 
> below the 0.1 range, well below the effects you managed to measure. 
> So i think this table is pretty accurate and we can rely on it for 
> analysis.
> 
> So we've got a +1.1% incrase in overhead with cgroups disabled, while 
> the instruction count went down by 1.1%. Is this expected? If you 
> profile stalled cycles and use perf diff between base and patched 
> kernels, does it show you some new hotspot that causes the overhead?

perf diff shows 0.43% increase in sched_clock, and 0.98% decrease in
pipe_unlock. the complete output is attached.

> 
> To better understand the reasons behind that result, could you try to 
> see whether the cycles count is stable across reboots as well, or 
> does it vary beyond the ~1% value that you measure?
> 
> One thing that can help validating the measurements is to do:
> 
>   echo 1 > /proc/sys/vm/drop_caches
> 
> Before testing. This helps re-establish the whole pagecache layout 
> (which gives a lot of the across-boot variability of such 
> measurements).

I have tested three times for base and +patch,cgroup not enabled
repectively (each time reboot, drop_caches then perf). the data
seems stable comparing to those in the table above, see below:


                    cycles                   instructions
------------------------------------------------------------------
base                7,526,317,497            8,666,579,347
base, drop_caches   7,518,958,711 (-0.10%)   8,634,136,901(-0.37%)
base, drop_caches   7,526,419,287 (+0.00%)   8,641,162,766(-0.29%)
base, drop_caches   7,491,864,402 (-0.46%)   8,624,760,925(-0.48%)


                                       cycles                   instructions
--------------------------------------------------------------------------------------
+patch, cgroup disabled                7,610,354,447            8,569,448,982
+patch, cgroup disabled, drop_caches   7,574,623,093 (-0.47%)   8,572,061,001 (+0.03%)
+patch, cgroup disabled, drop_caches   7,594,083,776 (-0.21%)   8,574,447,382 (+0.06%)
+patch, cgroup disabled, drop_caches   7,584,913,316 (-0.33%)   8,574,734,269 (+0.06%)






perf diff output:

# Baseline  Delta          Shared Object                       Symbol
# ........ ..........  .................  ...........................
#
     0.00%    +10.07%  [kernel.kallsyms]  [k] __lock_acquire
     0.00%     +5.90%  [kernel.kallsyms]  [k] lock_release
     0.00%     +4.86%  [kernel.kallsyms]  [k] trace_hardirqs_off_caller
     0.00%     +4.06%  [kernel.kallsyms]  [k] debug_smp_processor_id
     0.00%     +4.00%  [kernel.kallsyms]  [k] lock_acquire
     0.00%     +3.81%  [kernel.kallsyms]  [k] lock_acquired
     0.00%     +3.71%  [kernel.kallsyms]  [k] lock_is_held
     0.00%     +3.04%  [kernel.kallsyms]  [k] validate_chain
     0.00%     +2.68%  [kernel.kallsyms]  [k] check_chain_key
     0.00%     +2.41%  [kernel.kallsyms]  [k] trace_hardirqs_off
     0.00%     +2.01%  [kernel.kallsyms]  [k] trace_hardirqs_on_caller
     2.04%     -0.09%  pipe-test-100k     [.] main
     0.00%     +1.79%  [kernel.kallsyms]  [k] add_preempt_count
     0.00%     +1.67%  [kernel.kallsyms]  [k] lock_release_holdtime
     0.00%     +1.67%  [kernel.kallsyms]  [k] mutex_lock_nested
     0.00%     +1.61%  [kernel.kallsyms]  [k] pipe_read
     0.00%     +1.58%  [kernel.kallsyms]  [k] local_clock
     1.13%     +0.43%  [kernel.kallsyms]  [k] sched_clock
     0.00%     +1.52%  [kernel.kallsyms]  [k] sub_preempt_count
     0.00%     +1.39%  [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
     1.14%     +0.15%  libc-2.12.so       [.] __GI___libc_read
     0.00%     +1.21%  [kernel.kallsyms]  [k] mark_lock
     0.00%     +1.06%  [kernel.kallsyms]  [k] __mutex_unlock_slowpath
     0.00%     +1.03%  [kernel.kallsyms]  [k] match_held_lock
     0.00%     +0.96%  [kernel.kallsyms]  [k] copy_user_generic_string
     0.00%     +0.93%  [kernel.kallsyms]  [k] schedule
     0.00%     +0.76%  [kernel.kallsyms]  [k] __list_del_entry
     0.00%     +0.73%  [kernel.kallsyms]  [k] enqueue_entity
     0.00%     +0.68%  [kernel.kallsyms]  [k] cpuacct_charge
     0.00%     +0.62%  [kernel.kallsyms]  [k] trace_preempt_off
     0.00%     +0.59%  [kernel.kallsyms]  [k] vfs_write
     0.00%     +0.56%  [kernel.kallsyms]  [k] trace_preempt_on
     0.00%     +0.56%  [kernel.kallsyms]  [k] system_call
     0.00%     +0.55%  [kernel.kallsyms]  [k] sys_read
     0.00%     +0.54%  [kernel.kallsyms]  [k] pipe_write
     0.00%     +0.53%  [kernel.kallsyms]  [k] get_parent_ip
     0.00%     +0.53%  [kernel.kallsyms]  [k] vfs_read
     0.00%     +0.53%  [kernel.kallsyms]  [k] put_lock_stats
     0.56%     -0.03%  [kernel.kallsyms]  [k] intel_pmu_enable_all
     0.00%     +0.51%  [kernel.kallsyms]  [k] fsnotify
     0.72%     -0.23%  libc-2.12.so       [.] __GI___libc_write
     0.00%     +0.49%  [kernel.kallsyms]  [k] do_sync_write
     0.00%     +0.48%  [kernel.kallsyms]  [k] trace_hardirqs_on
     0.00%     +0.48%  [kernel.kallsyms]  [k] do_sync_read
     0.00%     +0.45%  [kernel.kallsyms]  [k] dequeue_entity
     0.00%     +0.44%  [kernel.kallsyms]  [k] select_task_rq_fair
     0.00%     +0.44%  [kernel.kallsyms]  [k] update_curr
     0.00%     +0.43%  [kernel.kallsyms]  [k] fget_light
     0.00%     +0.42%  [kernel.kallsyms]  [k] do_raw_spin_trylock
     0.00%     +0.42%  [kernel.kallsyms]  [k] in_lock_functions
     0.00%     +0.40%  [kernel.kallsyms]  [k] find_next_bit
     0.50%     -0.11%  [kernel.kallsyms]  [k] intel_pmu_disable_all
     0.00%     +0.39%  [kernel.kallsyms]  [k] __list_add
     0.00%     +0.38%  [kernel.kallsyms]  [k] enqueue_task
     0.00%     +0.38%  [kernel.kallsyms]  [k] __might_sleep
     0.00%     +0.38%  [kernel.kallsyms]  [k] kill_fasync
     0.00%     +0.36%  [kernel.kallsyms]  [k] check_flags
     0.00%     +0.36%  [kernel.kallsyms]  [k] _raw_spin_unlock
     0.00%     +0.34%  [kernel.kallsyms]  [k] pipe_iov_copy_from_user
     0.00%     +0.33%  [kernel.kallsyms]  [k] check_preempt_curr
     0.00%     +0.32%  [kernel.kallsyms]  [k] system_call_after_swapgs
     0.00%     +0.32%  [kernel.kallsyms]  [k] mark_held_locks
     0.00%     +0.31%  [kernel.kallsyms]  [k] touch_atime
     0.00%     +0.30%  [kernel.kallsyms]  [k] account_entity_enqueue
     0.00%     +0.30%  [kernel.kallsyms]  [k] set_next_entity
     0.00%     +0.30%  [kernel.kallsyms]  [k] place_entity
     0.00%     +0.29%  [kernel.kallsyms]  [k] try_to_wake_up
     0.00%     +0.29%  [kernel.kallsyms]  [k] check_preempt_wakeup
     0.00%     +0.28%  [kernel.kallsyms]  [k] debug_lockdep_rcu_enabled
     0.00%     +0.28%  [kernel.kallsyms]  [k] cpumask_next_and
     0.00%     +0.28%  [kernel.kallsyms]  [k] __wake_up_common
     0.00%     +0.27%  [kernel.kallsyms]  [k] rb_erase
     0.00%     +0.26%  [kernel.kallsyms]  [k] ttwu_stat
     0.00%     +0.25%  [kernel.kallsyms]  [k] _raw_spin_unlock_irq
     0.00%     +0.25%  [kernel.kallsyms]  [k] pick_next_task_fair
     0.00%     +0.25%  [kernel.kallsyms]  [k] update_cfs_shares
     0.00%     +0.25%  [kernel.kallsyms]  [k] sysret_check
     0.00%     +0.25%  [kernel.kallsyms]  [k] lockdep_sys_exit_thunk
     0.00%     +0.25%  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
     0.00%     +0.24%  [kernel.kallsyms]  [k] get_lock_stats
     0.00%     +0.24%  [kernel.kallsyms]  [k] put_prev_task_fair
     0.00%     +0.24%  [kernel.kallsyms]  [k] trace_hardirqs_on_thunk
     0.00%     +0.24%  [kernel.kallsyms]  [k] __perf_event_task_sched_out
     0.00%     +0.24%  [kernel.kallsyms]  [k] ret_from_sys_call
     0.00%     +0.23%  [kernel.kallsyms]  [k] rcu_note_context_switch
     0.00%     +0.23%  [kernel.kallsyms]  [k] update_stats_wait_end
     0.00%     +0.23%  [kernel.kallsyms]  [k] file_update_time
     0.35%     -0.12%  libc-2.12.so       [.] __write_nocancel
     0.00%     +0.22%  [kernel.kallsyms]  [k] rw_verify_area
     0.00%     +0.21%  [kernel.kallsyms]  [k] mutex_unlock
     0.00%     +0.20%  [kernel.kallsyms]  [k] system_call_fastpath
     0.00%     +0.20%  [kernel.kallsyms]  [k] sys_write
     0.09%     +0.11%  [kernel.kallsyms]  [k] update_cfs_load
     0.00%     +0.20%  [kernel.kallsyms]  [k] time_hardirqs_off
     0.10%     +0.10%  [kernel.kallsyms]  [k] x86_pmu_disable
     0.00%     +0.19%  [kernel.kallsyms]  [k] clear_buddies
     0.00%     +0.19%  [kernel.kallsyms]  [k] activate_task
     0.00%     +0.18%  [kernel.kallsyms]  [k] enqueue_task_fair
     0.00%     +0.18%  [kernel.kallsyms]  [k] _raw_spin_lock
     0.00%     +0.18%  [kernel.kallsyms]  [k] ttwu_do_wakeup
     0.00%     +0.17%  [kernel.kallsyms]  [k] __srcu_read_lock
     0.00%     +0.17%  [kernel.kallsyms]  [k] prepare_to_wait
     0.00%     +0.16%  [kernel.kallsyms]  [k] debug_mutex_lock_common
     0.00%     +0.16%  [kernel.kallsyms]  [k] ttwu_activate
     0.00%     +0.16%  [kernel.kallsyms]  [k] time_hardirqs_on
     0.00%     +0.16%  [kernel.kallsyms]  [k] pipe_wait
     0.00%     +0.16%  [kernel.kallsyms]  [k] preempt_schedule
     0.00%     +0.16%  [kernel.kallsyms]  [k] debug_mutex_free_waiter
     0.00%     +0.15%  [kernel.kallsyms]  [k] __rcu_read_unlock
     0.00%     +0.14%  [kernel.kallsyms]  [k] account_cfs_rq_runtime
     0.00%     +0.14%  [kernel.kallsyms]  [k] perf_pmu_rotate_start
     0.00%     +0.14%  [kernel.kallsyms]  [k] pipe_lock
     0.00%     +0.14%  [kernel.kallsyms]  [k] __perf_event_task_sched_in
     0.00%     +0.14%  [kernel.kallsyms]  [k] __srcu_read_unlock
     0.00%     +0.13%  [kernel.kallsyms]  [k] perf_ctx_unlock
     0.00%     +0.13%  [kernel.kallsyms]  [k] __rcu_read_lock
     0.00%     +0.13%  [kernel.kallsyms]  [k] account_entity_dequeue
     0.00%     +0.12%  [kernel.kallsyms]  [k] __fsnotify_parent
     0.00%     +0.12%  [kernel.kallsyms]  [k] sched_clock_cpu
     0.00%     +0.12%  [kernel.kallsyms]  [k] current_fs_time
     0.00%     +0.11%  [kernel.kallsyms]  [k] _raw_spin_lock_irq
     0.00%     +0.11%  [kernel.kallsyms]  [k] mutex_remove_waiter
     0.00%     +0.11%  [kernel.kallsyms]  [k] autoremove_wake_function
     0.00%     +0.10%  [kernel.kallsyms]  [k] hrtick_start_fair
     0.08%     +0.03%  pipe-test-100k     [.] read@plt
     0.00%     +0.10%  [kernel.kallsyms]  [k] __bfs
     0.00%     +0.10%  [kernel.kallsyms]  [k] mnt_want_write
     0.00%     +0.09%  [kernel.kallsyms]  [k] __dequeue_entity
     0.00%     +0.09%  [kernel.kallsyms]  [k] do_raw_spin_unlock
     0.00%     +0.08%  [kernel.kallsyms]  [k] lockdep_sys_exit
     0.00%     +0.08%  [kernel.kallsyms]  [k] rb_next
     0.00%     +0.08%  [kernel.kallsyms]  [k] debug_mutex_unlock
     0.00%     +0.08%  [kernel.kallsyms]  [k] rb_insert_color
     0.00%     +0.08%  [kernel.kallsyms]  [k] update_rq_clock
     0.00%     +0.08%  [kernel.kallsyms]  [k] dequeue_task_fair
     0.00%     +0.07%  [kernel.kallsyms]  [k] finish_wait
     0.00%     +0.07%  [kernel.kallsyms]  [k] wakeup_preempt_entity
     0.00%     +0.07%  [kernel.kallsyms]  [k] debug_mutex_add_waiter
     0.00%     +0.07%  [kernel.kallsyms]  [k] ttwu_do_activate.clone.3
     0.00%     +0.07%  [kernel.kallsyms]  [k] generic_pipe_buf_map
     0.00%     +0.06%  [kernel.kallsyms]  [k] __wake_up_sync_key
     0.00%     +0.06%  [kernel.kallsyms]  [k] __mark_inode_dirty
     0.04%     +0.02%  [kernel.kallsyms]  [k] intel_pmu_nhm_enable_all
     0.00%     +0.05%  [kernel.kallsyms]  [k] timespec_trunc
     0.00%     +0.05%  [kernel.kallsyms]  [k] dequeue_task
     0.00%     +0.05%  [kernel.kallsyms]  [k] perf_pmu_disable
     0.00%     +0.05%  [kernel.kallsyms]  [k] apic_timer_interrupt
     0.00%     +0.05%  [kernel.kallsyms]  [k] current_kernel_time
     0.05%             pipe-test-100k     [.] write@plt
     0.00%     +0.05%  [kernel.kallsyms]  [k] generic_pipe_buf_confirm
     0.00%     +0.04%  [kernel.kallsyms]  [k] __rcu_pending
     0.00%     +0.04%  [kernel.kallsyms]  [k] generic_pipe_buf_unmap
     0.00%     +0.04%  [kernel.kallsyms]  [k] anon_pipe_buf_release
     0.00%     +0.04%  [kernel.kallsyms]  [k] finish_task_switch
     0.00%     +0.04%  [kernel.kallsyms]  [k] perf_event_context_sched_in
     0.00%     +0.04%  [kernel.kallsyms]  [k] update_process_times
     0.00%     +0.04%  [kernel.kallsyms]  [k] do_timer
     0.00%     +0.04%  [kernel.kallsyms]  [k] trace_hardirqs_off_thunk
     0.00%     +0.03%  [kernel.kallsyms]  [k] run_timer_softirq
     0.00%     +0.02%  [kernel.kallsyms]  [k] default_wake_function
     0.00%     +0.02%  [kernel.kallsyms]  [k] hrtimer_interrupt
     0.00%     +0.02%  [kernel.kallsyms]  [k] timerqueue_add
     0.00%     +0.02%  [kernel.kallsyms]  [k] __do_softirq
     0.00%     +0.02%  [kernel.kallsyms]  [k] set_next_buddy
     0.00%     +0.02%  [kernel.kallsyms]  [k] resched_task
     0.00%     +0.02%  [kernel.kallsyms]  [k] task_tick_fair
     0.00%     +0.02%  [kernel.kallsyms]  [k] restore
     0.00%     +0.02%  [kernel.kallsyms]  [k] irq_exit
     0.00%     +0.02%  [e1000e]           [k] e1000_watchdog
     0.00%     +0.01%  [kernel.kallsyms]  [k] account_process_tick
     0.00%     +0.01%  [kernel.kallsyms]  [k] update_vsyscall
     0.00%     +0.01%  [kernel.kallsyms]  [k] rcu_enter_nohz
     0.00%     +0.01%  [kernel.kallsyms]  [k] hrtimer_run_pending
     0.00%     +0.01%  [kernel.kallsyms]  [k] calc_global_load
     0.00%     +0.01%  [kernel.kallsyms]  [k] account_system_time
     0.00%     +0.01%  [kernel.kallsyms]  [k] __run_hrtimer
     0.99%     -0.98%  [kernel.kallsyms]  [k] pipe_unlock
     0.00%     +0.01%  [kernel.kallsyms]  [k] irq_enter
     0.00%     +0.01%  [kernel.kallsyms]  [k] scheduler_tick
     0.00%     +0.01%  [kernel.kallsyms]  [k] mnt_want_write_file
     0.00%     +0.01%  [kernel.kallsyms]  [k] hrtimer_run_queues
     0.01%             [kernel.kallsyms]  [k] sched_avg_update
     0.00%             [kernel.kallsyms]  [k] rcu_check_callbacks
     0.00%             [kernel.kallsyms]  [k] task_waking_fair
     0.00%             [kernel.kallsyms]  [k] trace_softirqs_off
     0.00%             [kernel.kallsyms]  [k] call_softirq
     0.00%             [kernel.kallsyms]  [k] find_busiest_group
     0.00%             [kernel.kallsyms]  [k] exit_idle
     0.00%             [kernel.kallsyms]  [k] enqueue_hrtimer
     0.00%             [kernel.kallsyms]  [k] hrtimer_forward
     0.02%     -0.02%  [kernel.kallsyms]  [k] x86_pmu_enable
     0.01%             [kernel.kallsyms]  [k] do_softirq
     0.00%             [kernel.kallsyms]  [k] calc_delta_mine
     0.00%             [kernel.kallsyms]  [k] sched_slice
     0.00%             [kernel.kallsyms]  [k] tick_sched_timer
     0.00%             [kernel.kallsyms]  [k] irq_work_run
     0.00%             [kernel.kallsyms]  [k] ktime_get
     0.00%             [kernel.kallsyms]  [k] update_cpu_load
     0.00%             [kernel.kallsyms]  [k] __remove_hrtimer
     0.00%             [kernel.kallsyms]  [k] rcu_exit_nohz
     0.00%             [kernel.kallsyms]  [k] clockevents_program_event





and perf stat outputs:



base, drop_caches:

 Performance counter stats for './pipe-test-100k' (50 runs):

       3841.033842 task-clock                #    0.576 CPUs utilized            ( +-  0.06% )
           200,008 context-switches          #    0.052 M/sec                    ( +-  0.00% )
                 0 CPU-migrations            #    0.000 M/sec                    ( +- 56.54% )
               135 page-faults               #    0.000 M/sec                    ( +-  0.16% )
     7,518,958,711 cycles                    #    1.958 GHz                      ( +-  0.09% )
     2,676,161,995 stalled-cycles-frontend   #   35.59% frontend cycles idle     ( +-  0.17% )
     1,152,912,513 stalled-cycles-backend    #   15.33% backend  cycles idle     ( +-  0.31% )
     8,634,136,901 instructions              #    1.15  insns per cycle        
                                             #    0.31  stalled cycles per insn  ( +-  0.08% )
     1,764,912,243 branches                  #  459.489 M/sec                    ( +-  0.08% )
        35,531,303 branch-misses             #    2.01% of all branches          ( +-  0.12% )

       6.669821483 seconds time elapsed                                          ( +-  0.03% )



base, drop_caches:

 Performance counter stats for './pipe-test-100k' (50 runs):

       3840.203514 task-clock                #    0.576 CPUs utilized            ( +-  0.06% )
           200,009 context-switches          #    0.052 M/sec                    ( +-  0.00% )
                 0 CPU-migrations            #    0.000 M/sec                    ( +- 60.19% )
               135 page-faults               #    0.000 M/sec                    ( +-  0.18% )
     7,526,419,287 cycles                    #    1.960 GHz                      ( +-  0.08% )
     2,681,342,567 stalled-cycles-frontend   #   35.63% frontend cycles idle     ( +-  0.15% )
     1,159,603,323 stalled-cycles-backend    #   15.41% backend  cycles idle     ( +-  0.36% )
     8,641,162,766 instructions              #    1.15  insns per cycle        
                                             #    0.31  stalled cycles per insn  ( +-  0.07% )
     1,766,192,649 branches                  #  459.922 M/sec                    ( +-  0.07% )
        35,520,560 branch-misses             #    2.01% of all branches          ( +-  0.11% )

       6.667852851 seconds time elapsed                                          ( +-  0.03% )



base, drop_caches:

 Performance counter stats for './pipe-test-100k' (50 runs):

       3827.952520 task-clock                #    0.575 CPUs utilized            ( +-  0.06% )
           200,009 context-switches          #    0.052 M/sec                    ( +-  0.00% )
                 0 CPU-migrations            #    0.000 M/sec                    ( +- 56.54% )
               135 page-faults               #    0.000 M/sec                    ( +-  0.17% )
     7,491,864,402 cycles                    #    1.957 GHz                      ( +-  0.08% )
     2,664,949,808 stalled-cycles-frontend   #   35.57% frontend cycles idle     ( +-  0.16% )
     1,140,326,742 stalled-cycles-backend    #   15.22% backend  cycles idle     ( +-  0.31% )
     8,624,760,925 instructions              #    1.15  insns per cycle        
                                             #    0.31  stalled cycles per insn  ( +-  0.07% )
     1,761,666,011 branches                  #  460.211 M/sec                    ( +-  0.07% )
        34,655,390 branch-misses             #    1.97% of all branches          ( +-  0.12% )

       6.657224884 seconds time elapsed                                          ( +-  0.03% )




+patch, cgroup disabled, drop_caches:

 Performance counter stats for './pipe-test-100k' (50 runs):

       3857.191852 task-clock                #    0.576 CPUs utilized            ( +-  0.09% )
           200,008 context-switches          #    0.052 M/sec                    ( +-  0.00% )
                 0 CPU-migrations            #    0.000 M/sec                    ( +- 42.86% )
               135 page-faults               #    0.000 M/sec                    ( +-  0.19% )
     7,574,623,093 cycles                    #    1.964 GHz                      ( +-  0.10% )
     2,758,696,094 stalled-cycles-frontend   #   36.42% frontend cycles idle     ( +-  0.15% )
     1,239,909,382 stalled-cycles-backend    #   16.37% backend  cycles idle     ( +-  0.38% )
     8,572,061,001 instructions              #    1.13  insns per cycle        
                                             #    0.32  stalled cycles per insn  ( +-  0.08% )
     1,750,572,714 branches                  #  453.846 M/sec                    ( +-  0.08% )
        36,051,335 branch-misses             #    2.06% of all branches          ( +-  0.13% )

       6.691634724 seconds time elapsed                                          ( +-  0.04% )



+patch, cgroup disabled, drop_caches:

 Performance counter stats for './pipe-test-100k' (50 runs):

       3867.143019 task-clock                #    0.577 CPUs utilized            ( +-  0.10% )
           200,008 context-switches          #    0.052 M/sec                    ( +-  0.00% )
                 0 CPU-migrations            #    0.000 M/sec                    ( +- 56.54% )
               135 page-faults               #    0.000 M/sec                    ( +-  0.17% )
     7,594,083,776 cycles                    #    1.964 GHz                      ( +-  0.12% )
     2,775,221,867 stalled-cycles-frontend   #   36.54% frontend cycles idle     ( +-  0.19% )
     1,251,931,725 stalled-cycles-backend    #   16.49% backend  cycles idle     ( +-  0.36% )
     8,574,447,382 instructions              #    1.13  insns per cycle        
                                             #    0.32  stalled cycles per insn  ( +-  0.09% )
     1,751,600,855 branches                  #  452.944 M/sec                    ( +-  0.09% )
        36,098,438 branch-misses             #    2.06% of all branches          ( +-  0.16% )

       6.698065282 seconds time elapsed                                          ( +-  0.05% )



+patch, cgroup disabled, drop_caches:

 Performance counter stats for './pipe-test-100k' (50 runs):

       3857.654582 task-clock                #    0.577 CPUs utilized            ( +-  0.10% )
           200,009 context-switches          #    0.052 M/sec                    ( +-  0.00% )
                 0 CPU-migrations            #    0.000 M/sec                    ( +- 78.57% )
               135 page-faults               #    0.000 M/sec                    ( +-  0.23% )
     7,584,913,316 cycles                    #    1.966 GHz                      ( +-  0.11% )
     2,771,130,327 stalled-cycles-frontend   #   36.53% frontend cycles idle     ( +-  0.17% )
     1,263,203,011 stalled-cycles-backend    #   16.65% backend  cycles idle     ( +-  0.40% )
     8,574,734,269 instructions              #    1.13  insns per cycle        
                                             #    0.32  stalled cycles per insn  ( +-  0.09% )
     1,751,597,037 branches                  #  454.058 M/sec                    ( +-  0.09% )
        36,113,467 branch-misses             #    2.06% of all branches          ( +-  0.14% )

       6.688379749 seconds time elapsed                                          ( +-  0.04% )


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 00/16] CFS Bandwidth Control v7
  2011-07-05  3:58               ` Hu Tao
@ 2011-07-05  8:50                 ` Ingo Molnar
  2011-07-05  8:52                   ` Ingo Molnar
  0 siblings, 1 reply; 59+ messages in thread
From: Ingo Molnar @ 2011-07-05  8:50 UTC (permalink / raw)
  To: Hu Tao
  Cc: Hidetoshi Seto, Peter Zijlstra, Paul Turner, linux-kernel,
	Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Pavel Emelyanov


* Hu Tao <hutao@cn.fujitsu.com> wrote:

> perf diff output:
> 
> # Baseline  Delta          Shared Object                       Symbol
> # ........ ..........  .................  ...........................
> #
>      0.00%    +10.07%  [kernel.kallsyms]  [k] __lock_acquire
>      0.00%     +5.90%  [kernel.kallsyms]  [k] lock_release
>      0.00%     +4.86%  [kernel.kallsyms]  [k] trace_hardirqs_off_caller
>      0.00%     +4.06%  [kernel.kallsyms]  [k] debug_smp_processor_id
>      0.00%     +4.00%  [kernel.kallsyms]  [k] lock_acquire
>      0.00%     +3.81%  [kernel.kallsyms]  [k] lock_acquired
>      0.00%     +3.71%  [kernel.kallsyms]  [k] lock_is_held
>      0.00%     +3.04%  [kernel.kallsyms]  [k] validate_chain
>      0.00%     +2.68%  [kernel.kallsyms]  [k] check_chain_key
>      0.00%     +2.41%  [kernel.kallsyms]  [k] trace_hardirqs_off
>      0.00%     +2.01%  [kernel.kallsyms]  [k] trace_hardirqs_on_caller

Oh, please measure with lockdep (CONFIG_PROVE_LOCKING) turned off. No 
production kernel has it enabled and it has quite some overhead (as 
visible in the profile), skewing results.

>      2.04%     -0.09%  pipe-test-100k     [.] main
>      0.00%     +1.79%  [kernel.kallsyms]  [k] add_preempt_count

I'd also suggest to turn off CONFIG_PREEMPT_DEBUG.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 00/16] CFS Bandwidth Control v7
  2011-07-05  8:50                 ` Ingo Molnar
@ 2011-07-05  8:52                   ` Ingo Molnar
  2011-07-07  3:53                     ` Hu Tao
  0 siblings, 1 reply; 59+ messages in thread
From: Ingo Molnar @ 2011-07-05  8:52 UTC (permalink / raw)
  To: Hu Tao
  Cc: Hidetoshi Seto, Peter Zijlstra, Paul Turner, linux-kernel,
	Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Pavel Emelyanov


* Ingo Molnar <mingo@elte.hu> wrote:

> 
> * Hu Tao <hutao@cn.fujitsu.com> wrote:
> 
> > perf diff output:
> > 
> > # Baseline  Delta          Shared Object                       Symbol
> > # ........ ..........  .................  ...........................
> > #
> >      0.00%    +10.07%  [kernel.kallsyms]  [k] __lock_acquire
> >      0.00%     +5.90%  [kernel.kallsyms]  [k] lock_release
> >      0.00%     +4.86%  [kernel.kallsyms]  [k] trace_hardirqs_off_caller
> >      0.00%     +4.06%  [kernel.kallsyms]  [k] debug_smp_processor_id
> >      0.00%     +4.00%  [kernel.kallsyms]  [k] lock_acquire
> >      0.00%     +3.81%  [kernel.kallsyms]  [k] lock_acquired
> >      0.00%     +3.71%  [kernel.kallsyms]  [k] lock_is_held
> >      0.00%     +3.04%  [kernel.kallsyms]  [k] validate_chain
> >      0.00%     +2.68%  [kernel.kallsyms]  [k] check_chain_key
> >      0.00%     +2.41%  [kernel.kallsyms]  [k] trace_hardirqs_off
> >      0.00%     +2.01%  [kernel.kallsyms]  [k] trace_hardirqs_on_caller
> 
> Oh, please measure with lockdep (CONFIG_PROVE_LOCKING) turned off. No 
> production kernel has it enabled and it has quite some overhead (as 
> visible in the profile), skewing results.
> 
> >      2.04%     -0.09%  pipe-test-100k     [.] main
> >      0.00%     +1.79%  [kernel.kallsyms]  [k] add_preempt_count
> 
> I'd also suggest to turn off CONFIG_PREEMPT_DEBUG.

The best way to get a good 'reference config' to measure scheduler 
overhead on do something like:

	make defconfig
	make localyesconfig

The first step will configure a sane default kernel, the second one 
will enable all drivers that are needed on that box. You should be 
able to boot the resulting bzImage and all drivers should be built-in 
and are easily profilable.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 03/16] sched: introduce primitives to account for CFS bandwidth tracking
  2011-06-22 10:52   ` Peter Zijlstra
@ 2011-07-06 21:38     ` Paul Turner
  2011-07-07 11:32       ` Peter Zijlstra
  0 siblings, 1 reply; 59+ messages in thread
From: Paul Turner @ 2011-07-06 21:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov, Nikhil Rao

On Wed, Jun 22, 2011 at 3:52 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Tue, 2011-06-21 at 00:16 -0700, Paul Turner wrote:
>> +#ifdef CONFIG_CFS_BANDWIDTH
>> +       {
>> +               .name = "cfs_quota_us",
>> +               .read_s64 = cpu_cfs_quota_read_s64,
>> +               .write_s64 = cpu_cfs_quota_write_s64,
>> +       },
>> +       {
>> +               .name = "cfs_period_us",
>> +               .read_u64 = cpu_cfs_period_read_u64,
>> +               .write_u64 = cpu_cfs_period_write_u64,
>> +       },
>> +#endif
>
> Did I miss a reply to:
> lkml.kernel.org/r/1305538202.2466.4047.camel@twins ? why does it make
> sense to have different periods per cgroup? what does it mean?
>

Sorry for the delayed reply -- I never hit send on this one.

The reason asymmetric periods are beneficial is a trade-off exists
between latency and throughput.  The 3 major "classes" I see are:

Latency sensitive applications with a very continuous work
distribution of work may look to use a very tight bandwidth period
(e.g. 10ms).  This provides very consistent/predictable/repeatable
performance as well as limiting their bandwidth imposed tail
latencies.

Latency sensitive applications who experience "bursty", or
inconsistent work distributions.  In this case expanding the period
slightly to improve burst capacity yields a large performance benefit;
while protecting the rest of the system's applications should they
burst beyond their provisioning.

Latency insensitive applications in which we care only about
throughput.  For this type of application we care only about limiting
their usage over a prolonged period of time, with tail latency
concern.  For applications in this class we can use large periods to
minimize overheads / maximize throughput.

These classes are somewhat orthogonal and as such they pack fairly
well on machines together; but support for this requires period
granularity to be at the hierarchy -- and not machine -- level.

(This is also briefly covered in the updated documentation.)

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 00/16] CFS Bandwidth Control v7
  2011-07-05  8:52                   ` Ingo Molnar
@ 2011-07-07  3:53                     ` Hu Tao
  0 siblings, 0 replies; 59+ messages in thread
From: Hu Tao @ 2011-07-07  3:53 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Hidetoshi Seto, Peter Zijlstra, Paul Turner, linux-kernel,
	Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Pavel Emelyanov

> > 
> > Oh, please measure with lockdep (CONFIG_PROVE_LOCKING) turned off. No 
> > production kernel has it enabled and it has quite some overhead (as 
> > visible in the profile), skewing results.
> > 
> > >      2.04%     -0.09%  pipe-test-100k     [.] main
> > >      0.00%     +1.79%  [kernel.kallsyms]  [k] add_preempt_count
> > 
> > I'd also suggest to turn off CONFIG_PREEMPT_DEBUG.
> 
> The best way to get a good 'reference config' to measure scheduler 
> overhead on do something like:
> 
> 	make defconfig
> 	make localyesconfig
> 
> The first step will configure a sane default kernel, the second one 
> will enable all drivers that are needed on that box. You should be 
> able to boot the resulting bzImage and all drivers should be built-in 
> and are easily profilable.

Thanks for the information. I've re-tested the patches using the config
got by the way you gave, these are the results:

table 1. shows the differences of cycles, instructions and branches
         between drop caches and no drop caches. each drop caches case
         is run as reboot, drop caches, then perf. The patch cases are
         run with cpu cgroup disabled.

                          cycles                   instructions             branches
-----------------------------------------------------------------------------------------------
base                      1,146,384,132            1,151,216,688            212,431,532
base, drop caches         1,150,931,998 ( 0.39%)   1,150,099,127 (-0.10%)   212,216,507 (-0.10%)
base, drop caches         1,144,685,532 (-0.15%)   1,151,115,796 (-0.01%)   212,412,336 (-0.01%)
base, drop caches         1,148,922,524 ( 0.22%)   1,150,636,042 (-0.05%)   212,322,280 (-0.05%)
-----------------------------------------------------------------------------------------------
patch                     1,163,717,547            1,165,238,015            215,092,327
patch, drop caches        1,161,301,415 (-0.21%)   1,165,905,415 (0.06%)    215,220,114 (0.06%)
patch, drop caches        1,161,388,127 (-0.20%)   1,166,315,396 (0.09%)    215,300,854 (0.10%)
patch, drop caches        1,167,839,222 ( 0.35%)   1,166,287,755 (0.09%)    215,294,118 (0.09%)
-----------------------------------------------------------------------------------------------


table 2. shows the differences between patch and no-patch. quota is set
         to a large value to avoid processes being throttled.

        quota/period          cycles                   instructions             branches
--------------------------------------------------------------------------------------------------
base                          1,146,384,132           1,151,216,688            212,431,532
patch   cgroup disabled       1,163,717,547 (1.51%)   1,165,238,015 ( 1.22%)   215,092,327 ( 1.25%)
patch   10000000000/1000      1,244,889,136 (8.59%)   1,299,128,502 (12.85%)   243,162,542 (14.47%)
patch   10000000000/10000     1,253,305,706 (9.33%)   1,299,167,897 (12.85%)   243,175,027 (14.47%)
patch   10000000000/100000    1,252,374,134 (9.25%)   1,299,314,357 (12.86%)   243,203,923 (14.49%)
patch   10000000000/1000000   1,254,165,824 (9.40%)   1,299,751,347 (12.90%)   243,288,600 (14.53%)
--------------------------------------------------------------------------------------------------


(any questions please let me know.)




outputs from perf:



base
--------------
 Performance counter stats for './pipe-test-100k' (500 runs):

        741.615458 task-clock                #    0.432 CPUs utilized            ( +-  0.05% )
           200,001 context-switches          #    0.270 M/sec                    ( +-  0.00% )
                 0 CPU-migrations            #    0.000 M/sec                    ( +- 57.62% )
               135 page-faults               #    0.000 M/sec                    ( +-  0.06% )
     1,146,384,132 cycles                    #    1.546 GHz                      ( +-  0.06% )
       528,191,000 stalled-cycles-frontend   #   46.07% frontend cycles idle     ( +-  0.11% )
       245,053,477 stalled-cycles-backend    #   21.38% backend  cycles idle     ( +-  0.14% )
     1,151,216,688 instructions              #    1.00  insns per cycle        
                                             #    0.46  stalled cycles per insn  ( +-  0.04% )
       212,431,532 branches                  #  286.444 M/sec                    ( +-  0.04% )
         3,192,969 branch-misses             #    1.50% of all branches          ( +-  0.26% )

       1.717638863 seconds time elapsed                                          ( +-  0.02% )



base, drop caches
------------------
 Performance counter stats for './pipe-test-100k' (500 runs):

        743.991156 task-clock                #    0.432 CPUs utilized            ( +-  0.05% )
           200,001 context-switches          #    0.269 M/sec                    ( +-  0.00% )
                 0 CPU-migrations            #    0.000 M/sec                    ( +- 57.62% )
               135 page-faults               #    0.000 M/sec                    ( +-  0.06% )
     1,150,931,998 cycles                    #    1.547 GHz                      ( +-  0.06% )
       532,150,859 stalled-cycles-frontend   #   46.24% frontend cycles idle     ( +-  0.11% )
       248,132,791 stalled-cycles-backend    #   21.56% backend  cycles idle     ( +-  0.14% )
     1,150,099,127 instructions              #    1.00  insns per cycle        
                                             #    0.46  stalled cycles per insn  ( +-  0.04% )
       212,216,507 branches                  #  285.241 M/sec                    ( +-  0.05% )
         3,234,741 branch-misses             #    1.52% of all branches          ( +-  0.24% )

       1.720283100 seconds time elapsed                                          ( +-  0.02% )



base, drop caches
------------------
 Performance counter stats for './pipe-test-100k' (500 runs):

        741.228159 task-clock                #    0.432 CPUs utilized            ( +-  0.05% )
           200,001 context-switches          #    0.270 M/sec                    ( +-  0.00% )
                 0 CPU-migrations            #    0.000 M/sec                    ( +- 49.85% )
               135 page-faults               #    0.000 M/sec                    ( +-  0.06% )
     1,144,685,532 cycles                    #    1.544 GHz                      ( +-  0.06% )
       528,095,499 stalled-cycles-frontend   #   46.13% frontend cycles idle     ( +-  0.10% )
       245,336,551 stalled-cycles-backend    #   21.43% backend  cycles idle     ( +-  0.14% )
     1,151,115,796 instructions              #    1.01  insns per cycle        
                                             #    0.46  stalled cycles per insn  ( +-  0.04% )
       212,412,336 branches                  #  286.568 M/sec                    ( +-  0.04% )
         3,128,390 branch-misses             #    1.47% of all branches          ( +-  0.25% )

       1.717165952 seconds time elapsed                                          ( +-  0.02% )



base, drop caches
------------------
 Performance counter stats for './pipe-test-100k' (500 runs):

        743.564054 task-clock                #    0.433 CPUs utilized            ( +-  0.04% )
           200,001 context-switches          #    0.269 M/sec                    ( +-  0.00% )
                 0 CPU-migrations            #    0.000 M/sec                    ( +- 74.48% )
               135 page-faults               #    0.000 M/sec                    ( +-  0.06% )
     1,148,922,524 cycles                    #    1.545 GHz                      ( +-  0.07% )
       532,489,993 stalled-cycles-frontend   #   46.35% frontend cycles idle     ( +-  0.11% )
       248,064,979 stalled-cycles-backend    #   21.59% backend  cycles idle     ( +-  0.15% )
     1,150,636,042 instructions              #    1.00  insns per cycle        
                                             #    0.46  stalled cycles per insn  ( +-  0.04% )
       212,322,280 branches                  #  285.547 M/sec                    ( +-  0.04% )
         3,123,001 branch-misses             #    1.47% of all branches          ( +-  0.25% )

       1.718876342 seconds time elapsed                                          ( +-  0.02% )








patch, cgroup disabled
-----------------------
 Performance counter stats for './pipe-test-100k' (500 runs):

        739.608960 task-clock                #    0.426 CPUs utilized            ( +-  0.04% )
           200,001 context-switches          #    0.270 M/sec                    ( +-  0.00% )
                 0 CPU-migrations            #    0.000 M/sec                    ( +-100.00% )
               135 page-faults               #    0.000 M/sec                    ( +-  0.06% )
     1,163,717,547 cycles                    #    1.573 GHz                      ( +-  0.06% )
       541,274,832 stalled-cycles-frontend   #   46.51% frontend cycles idle     ( +-  0.11% )
       248,207,739 stalled-cycles-backend    #   21.33% backend  cycles idle     ( +-  0.14% )
     1,165,238,015 instructions              #    1.00  insns per cycle        
                                             #    0.46  stalled cycles per insn  ( +-  0.04% )
       215,092,327 branches                  #  290.819 M/sec                    ( +-  0.04% )
         3,355,695 branch-misses             #    1.56% of all branches          ( +-  0.15% )

       1.734269082 seconds time elapsed                                          ( +-  0.02% )



patch, cgroup disabled, drop caches
------------------------------------
 Performance counter stats for './pipe-test-100k' (500 runs):

        737.995897 task-clock                #    0.426 CPUs utilized            ( +-  0.04% )
           200,001 context-switches          #    0.271 M/sec                    ( +-  0.00% )
                 0 CPU-migrations            #    0.000 M/sec                    ( +- 57.62% )
               135 page-faults               #    0.000 M/sec                    ( +-  0.06% )
     1,161,301,415 cycles                    #    1.574 GHz                      ( +-  0.06% )
       538,706,207 stalled-cycles-frontend   #   46.39% frontend cycles idle     ( +-  0.10% )
       247,842,667 stalled-cycles-backend    #   21.34% backend  cycles idle     ( +-  0.15% )
     1,165,905,415 instructions              #    1.00  insns per cycle        
                                             #    0.46  stalled cycles per insn  ( +-  0.04% )
       215,220,114 branches                  #  291.628 M/sec                    ( +-  0.04% )
         3,344,324 branch-misses             #    1.55% of all branches          ( +-  0.15% )

       1.731173126 seconds time elapsed                                          ( +-  0.02% )



patch, cgroup disabled, drop caches
------------------------------------
 Performance counter stats for './pipe-test-100k' (500 runs):

        737.789383 task-clock                #    0.427 CPUs utilized            ( +-  0.04% )
           200,001 context-switches          #    0.271 M/sec                    ( +-  0.00% )
                 0 CPU-migrations            #    0.000 M/sec                    ( +- 70.64% )
               135 page-faults               #    0.000 M/sec                    ( +-  0.05% )
     1,161,388,127 cycles                    #    1.574 GHz                      ( +-  0.06% )
       538,324,103 stalled-cycles-frontend   #   46.35% frontend cycles idle     ( +-  0.10% )
       248,382,647 stalled-cycles-backend    #   21.39% backend  cycles idle     ( +-  0.14% )
     1,166,315,396 instructions              #    1.00  insns per cycle        
                                             #    0.46  stalled cycles per insn  ( +-  0.03% )
       215,300,854 branches                  #  291.819 M/sec                    ( +-  0.04% )
         3,337,456 branch-misses             #    1.55% of all branches          ( +-  0.15% )

       1.729696593 seconds time elapsed                                          ( +-  0.02% )



patch, cgroup disabled, drop caches
------------------------------------
 Performance counter stats for './pipe-test-100k' (500 runs):

        740.796454 task-clock                #    0.427 CPUs utilized            ( +-  0.04% )
           200,001 context-switches          #    0.270 M/sec                    ( +-  0.00% )
                 0 CPU-migrations            #    0.000 M/sec                    ( +- 52.78% )
               135 page-faults               #    0.000 M/sec                    ( +-  0.05% )
     1,167,839,222 cycles                    #    1.576 GHz                      ( +-  0.06% )
       543,240,067 stalled-cycles-frontend   #   46.52% frontend cycles idle     ( +-  0.10% )
       250,219,423 stalled-cycles-backend    #   21.43% backend  cycles idle     ( +-  0.15% )
     1,166,287,755 instructions              #    1.00  insns per cycle        
                                             #    0.47  stalled cycles per insn  ( +-  0.03% )
       215,294,118 branches                  #  290.625 M/sec                    ( +-  0.03% )
         3,435,316 branch-misses             #    1.60% of all branches          ( +-  0.15% )

       1.735473959 seconds time elapsed                                          ( +-  0.02% )








patch, period/quota 1000/10000000000
------------------------------------
 Performance counter stats for './pipe-test-100k' (500 runs):
        773.180003 task-clock                #    0.437 CPUs utilized            ( +-  0.04% )
           200,001 context-switches          #    0.259 M/sec                    ( +-  0.00% )
                 0 CPU-migrations            #    0.000 M/sec                    ( +- 57.62% )
               135 page-faults               #    0.000 M/sec                    ( +-  0.06% )
     1,244,889,136 cycles                    #    1.610 GHz                      ( +-  0.06% )
       557,331,396 stalled-cycles-frontend   #   44.77% frontend cycles idle     ( +-  0.10% )
       244,081,415 stalled-cycles-backend    #   19.61% backend  cycles idle     ( +-  0.14% )
     1,299,128,502 instructions              #    1.04  insns per cycle        
                                             #    0.43  stalled cycles per insn  ( +-  0.04% )
       243,162,542 branches                  #  314.497 M/sec                    ( +-  0.04% )
         3,630,994 branch-misses             #    1.49% of all branches          ( +-  0.16% )

       1.769489922 seconds time elapsed                                          ( +-  0.02% )



patch, period/quota 10000/10000000000
------------------------------------
 Performance counter stats for './pipe-test-100k' (500 runs):
        776.884689 task-clock                #    0.438 CPUs utilized            ( +-  0.04% )
           200,001 context-switches          #    0.257 M/sec                    ( +-  0.00% )
                 0 CPU-migrations            #    0.000 M/sec                    ( +- 57.62% )
               135 page-faults               #    0.000 M/sec                    ( +-  0.06% )
     1,253,305,706 cycles                    #    1.613 GHz                      ( +-  0.06% )
       566,262,435 stalled-cycles-frontend   #   45.18% frontend cycles idle     ( +-  0.10% )
       249,193,264 stalled-cycles-backend    #   19.88% backend  cycles idle     ( +-  0.13% )
     1,299,167,897 instructions              #    1.04  insns per cycle        
                                             #    0.44  stalled cycles per insn  ( +-  0.04% )
       243,175,027 branches                  #  313.013 M/sec                    ( +-  0.04% )
         3,774,613 branch-misses             #    1.55% of all branches          ( +-  0.13% )

       1.773111308 seconds time elapsed                                          ( +-  0.02% )



patch, period/quota 100000/10000000000
------------------------------------
 Performance counter stats for './pipe-test-100k' (500 runs):
        776.756709 task-clock                #    0.439 CPUs utilized            ( +-  0.04% )
           200,001 context-switches          #    0.257 M/sec                    ( +-  0.00% )
                 0 CPU-migrations            #    0.000 M/sec                    ( +- 52.78% )
               135 page-faults               #    0.000 M/sec                    ( +-  0.06% )
     1,252,374,134 cycles                    #    1.612 GHz                      ( +-  0.05% )
       565,520,222 stalled-cycles-frontend   #   45.16% frontend cycles idle     ( +-  0.09% )
       249,412,383 stalled-cycles-backend    #   19.92% backend  cycles idle     ( +-  0.12% )
     1,299,314,357 instructions              #    1.04  insns per cycle        
                                             #    0.44  stalled cycles per insn  ( +-  0.04% )
       243,203,923 branches                  #  313.102 M/sec                    ( +-  0.04% )
         3,793,064 branch-misses             #    1.56% of all branches          ( +-  0.13% )

       1.771283272 seconds time elapsed                                          ( +-  0.01% )



patch, period/quota 1000000/10000000000
------------------------------------
 Performance counter stats for './pipe-test-100k' (500 runs):
        778.091675 task-clock                #    0.439 CPUs utilized            ( +-  0.04% )
           200,001 context-switches          #    0.257 M/sec                    ( +-  0.00% )
                 0 CPU-migrations            #    0.000 M/sec                    ( +- 61.13% )
               135 page-faults               #    0.000 M/sec                    ( +-  0.06% )
     1,254,165,824 cycles                    #    1.612 GHz                      ( +-  0.05% )
       567,280,955 stalled-cycles-frontend   #   45.23% frontend cycles idle     ( +-  0.09% )
       249,428,011 stalled-cycles-backend    #   19.89% backend  cycles idle     ( +-  0.12% )
     1,299,751,347 instructions              #    1.04  insns per cycle        
                                             #    0.44  stalled cycles per insn  ( +-  0.04% )
       243,288,600 branches                  #  312.673 M/sec                    ( +-  0.04% )
         3,811,879 branch-misses             #    1.57% of all branches          ( +-  0.13% )

       1.773436668 seconds time elapsed                                          ( +-  0.02% )

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 03/16] sched: introduce primitives to account for CFS bandwidth tracking
  2011-07-06 21:38     ` Paul Turner
@ 2011-07-07 11:32       ` Peter Zijlstra
  0 siblings, 0 replies; 59+ messages in thread
From: Peter Zijlstra @ 2011-07-07 11:32 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov, Nikhil Rao

On Wed, 2011-07-06 at 14:38 -0700, Paul Turner wrote:
> On Wed, Jun 22, 2011 at 3:52 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> > On Tue, 2011-06-21 at 00:16 -0700, Paul Turner wrote:
> >> +#ifdef CONFIG_CFS_BANDWIDTH
> >> +       {
> >> +               .name = "cfs_quota_us",
> >> +               .read_s64 = cpu_cfs_quota_read_s64,
> >> +               .write_s64 = cpu_cfs_quota_write_s64,
> >> +       },
> >> +       {
> >> +               .name = "cfs_period_us",
> >> +               .read_u64 = cpu_cfs_period_read_u64,
> >> +               .write_u64 = cpu_cfs_period_write_u64,
> >> +       },
> >> +#endif
> >
> > Did I miss a reply to:
> > lkml.kernel.org/r/1305538202.2466.4047.camel@twins ? why does it make
> > sense to have different periods per cgroup? what does it mean?
> >
> 
> Sorry for the delayed reply -- I never hit send on this one.
> 
> The reason asymmetric periods are beneficial is a trade-off exists
> between latency and throughput.  The 3 major "classes" I see are:
> 
> Latency sensitive applications with a very continuous work
> distribution of work may look to use a very tight bandwidth period
> (e.g. 10ms).  This provides very consistent/predictable/repeatable
> performance as well as limiting their bandwidth imposed tail
> latencies.
> 
> Latency sensitive applications who experience "bursty", or
> inconsistent work distributions.  In this case expanding the period
> slightly to improve burst capacity yields a large performance benefit;
> while protecting the rest of the system's applications should they
> burst beyond their provisioning.
> 
> Latency insensitive applications in which we care only about
> throughput.  For this type of application we care only about limiting
> their usage over a prolonged period of time, with tail latency
> concern.  For applications in this class we can use large periods to
> minimize overheads / maximize throughput.
> 
> These classes are somewhat orthogonal and as such they pack fairly
> well on machines together; but support for this requires period
> granularity to be at the hierarchy -- and not machine -- level.
> 
> (This is also briefly covered in the updated documentation.)


Right, but what I'm getting at is the hierarchical nature of things.
Having a parent/child both both with bandwidth constraints but different
periods simply doesn't work well. The child will always be subjected to
the parent's sleep time and thus affected by its choice of period.

One way out of there is not allowing child cgroups to set their period
when they have a parent that is also bandwidth constrained, and have the
period update of a cgroup without constrained parent update its entire
sub-tree.

That way you can still have siblings with different periods as children
of an unconstrained parent (say root), but avoid all the weirdness of
parent/child with different constraint periods.

^ permalink raw reply	[flat|nested] 59+ messages in thread

end of thread, other threads:[~2011-07-07 11:33 UTC | newest]

Thread overview: 59+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-06-21  7:16 [patch 00/16] CFS Bandwidth Control v7 Paul Turner
2011-06-21  7:16 ` [patch 01/16] sched: (fixlet) dont update shares twice on on_rq parent Paul Turner
2011-06-21  7:16 ` [patch 02/16] sched: hierarchical task accounting for SCHED_OTHER Paul Turner
2011-06-21  7:16 ` [patch 03/16] sched: introduce primitives to account for CFS bandwidth tracking Paul Turner
2011-06-22 10:52   ` Peter Zijlstra
2011-07-06 21:38     ` Paul Turner
2011-07-07 11:32       ` Peter Zijlstra
2011-06-21  7:16 ` [patch 04/16] sched: validate CFS quota hierarchies Paul Turner
2011-06-22  5:43   ` Bharata B Rao
2011-06-22  6:57     ` Paul Turner
2011-06-22  9:38   ` Hidetoshi Seto
2011-06-21  7:16 ` [patch 05/16] sched: accumulate per-cfs_rq cpu usage and charge against bandwidth Paul Turner
2011-06-21  7:16 ` [patch 06/16] sched: add a timer to handle CFS bandwidth refresh Paul Turner
2011-06-22  9:38   ` Hidetoshi Seto
2011-06-21  7:16 ` [patch 07/16] sched: expire invalid runtime Paul Turner
2011-06-22  9:38   ` Hidetoshi Seto
2011-06-22 15:47   ` Peter Zijlstra
2011-06-28  4:42     ` Paul Turner
2011-06-29  2:29       ` Paul Turner
2011-06-21  7:16 ` [patch 08/16] sched: throttle cfs_rq entities which exceed their local runtime Paul Turner
2011-06-22  7:11   ` Bharata B Rao
2011-06-22 16:07   ` Peter Zijlstra
2011-06-22 16:54     ` Paul Turner
2011-06-21  7:16 ` [patch 09/16] sched: unthrottle cfs_rq(s) who ran out of quota at period refresh Paul Turner
2011-06-22 17:29   ` Peter Zijlstra
2011-06-28  4:40     ` Paul Turner
2011-06-28  9:11       ` Peter Zijlstra
2011-06-29  3:37         ` Paul Turner
2011-06-21  7:16 ` [patch 10/16] sched: throttle entities exceeding their allowed bandwidth Paul Turner
2011-06-22  9:39   ` Hidetoshi Seto
2011-06-21  7:17 ` [patch 11/16] sched: allow for positional tg_tree walks Paul Turner
2011-06-21  7:17 ` [patch 12/16] sched: prevent interactions with throttled entities Paul Turner
2011-06-22 21:34   ` Peter Zijlstra
2011-06-28  4:43     ` Paul Turner
2011-06-23 11:49   ` Peter Zijlstra
2011-06-28  4:38     ` Paul Turner
2011-06-21  7:17 ` [patch 13/16] sched: migrate throttled tasks on HOTPLUG Paul Turner
2011-06-21  7:17 ` [patch 14/16] sched: add exports tracking cfs bandwidth control statistics Paul Turner
2011-06-21  7:17 ` [patch 15/16] sched: return unused runtime on voluntary sleep Paul Turner
2011-06-21  7:33   ` Paul Turner
2011-06-22  9:39   ` Hidetoshi Seto
2011-06-23 15:26   ` Peter Zijlstra
2011-06-28  1:42     ` Paul Turner
2011-06-28 10:01       ` Peter Zijlstra
2011-06-28 18:45         ` Paul Turner
2011-06-21  7:17 ` [patch 16/16] sched: add documentation for bandwidth control Paul Turner
2011-06-21 10:30   ` Hidetoshi Seto
2011-06-21 19:46     ` Paul Turner
2011-06-22 10:05 ` [patch 00/16] CFS Bandwidth Control v7 Hidetoshi Seto
2011-06-23 12:06   ` Peter Zijlstra
2011-06-23 12:43     ` Ingo Molnar
2011-06-24  5:11       ` Hidetoshi Seto
2011-06-26 10:35         ` Ingo Molnar
2011-06-29  4:05           ` Hu Tao
2011-07-01 12:28             ` Ingo Molnar
2011-07-05  3:58               ` Hu Tao
2011-07-05  8:50                 ` Ingo Molnar
2011-07-05  8:52                   ` Ingo Molnar
2011-07-07  3:53                     ` Hu Tao

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.