linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch 00/17] CFS Bandwidth Control v7.1
@ 2011-07-07  5:30 Paul Turner
  2011-07-07  5:30 ` [patch 01/17] sched: (fixlet) dont update shares twice on on_rq parent Paul Turner
                   ` (20 more replies)
  0 siblings, 21 replies; 47+ messages in thread
From: Paul Turner @ 2011-07-07  5:30 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov, Hu Tao

Hi all,

Please find attached an incremental revision on v7 of bandwidth control.

The only real functional change is an improvement to update shares only as we
leave a throttled state.  The remainder is largely refactoring, expansion of
comments, and code clean-up.

Hidetoshi Seto and Hu Tao have been kind enough to run performance benchmarks
against v7 measuring the scheduling path overheads versus pipe-test-100k.
Results can be found at:

https://lkml.org/lkml/2011/6/24/10
https://lkml.org/lkml/2011/7/4/347

The summary results (from Hu Tao's most recent run) are:
                                            cycles                   instructions            branches
-------------------------------------------------------------------------------------------------------------------
base                                        7,526,317,497           8,666,579,347            1,771,078,445
+patch, cgroup not enabled                  7,610,354,447 (1.12%)   8,569,448,982 (-1.12%)   1,751,675,193 (-0.11%)
+patch, 10000000000/1000(quota/period)      7,856,873,327 (4.39%)   8,822,227,540 (1.80%)    1,801,766,182 (1.73%)
+patch, 10000000000/10000(quota/period)     7,797,711,600 (3.61%)   8,754,747,746 (1.02%)    1,788,316,969 (0.97%)
+patch, 10000000000/100000(quota/period)    7,777,784,384 (3.34%)   8,744,979,688 (0.90%)    1,786,319,566 (0.86%)
+patch, 10000000000/1000000(quota/period)   7,802,382,802 (3.67%)   8,755,638,235 (1.03%)    1,788,601,070 (0.99%)
------------------------------------------------------------------------------------------------------------------

Thanks again for running these benchmarks!

Changes:

v7.1
-----------
- We now only explicitly update entity shares as their hierarchy leaves a 
  throttled state.  This simplifies shares interactions as all tg->shares
  logic can now be omitted within a throttled hierarchy.  This should also
  improve the quality of balance observed within Kamalesh's nested cgroup test
  as we are able to do a bottom-up shares update on unthrottle.
- do_sched_cfs_period_timer() refactored to be linear/readable.
- We now force a period timer restart in tg_set_cfs_bandwidth (before a
  dramatic decrease in the period length could have induced a period of
  starvation while we waited for the previous timer to expire).  Also avoid a
  spurious start/restart that existed in the quota == RUNTIME_INF case.
- The above removes the case of bandwidth changing within the period timer
  which helps with the do_sched_cfs_period_timer() clean-up.  
- Fixed potential cfs_b lock nesting on __start_cfs_bandwidth() (In the case
  that we are racing with call-back startup and not tear-down).
- The load-balancer checks to ensure that we are not moving tasks between 
  throttled hierarchies have been refactored and now check both the src and
  dest cfs_rqs.
- Buddy isolation cleaned up and moved to its own patch.
- Enabling of throttling deferred later within the series so that
  load-balancer and buddy protections exist (for stability in bisection).
- Documentation given a once-over for clarity and content.
- General code cleanup & comments improved

Hidetoshi, the following patchsets have changed enough to necessitate tweaking
of your Reviewed-by:
[patch 09/18] sched: add support for unthrottling group entities (extensive)
[patch 11/18] sched: prevent interactions with throttled entities (update_cfs_shares)
[patch 12/18] sched: prevent buddy interactions with throttled entities (new)

Previous postings:
-----------------
v7: http://lkml.org/lkml/2011/6/21/43
v6: http://lkml.org/lkml/2011/5/7/37
v5: http://lkml.org/lkml/2011/3 /22/477
v4: http://lkml.org/lkml/2011/2/23/44
v3: http://lkml.org/lkml/2010/10/12/44
v2: http://lkml.org/lkml/2010/4/28/88
Original posting: http://lkml.org/lkml/2010/2/12/393
Prior approaches: http://lkml.org/lkml/2010/1/5/44 ["CFS Hard limits v5"]


Thanks,

- Paul


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [patch 01/17] sched: (fixlet) dont update shares twice on on_rq parent
  2011-07-07  5:30 [patch 00/17] CFS Bandwidth Control v7.1 Paul Turner
@ 2011-07-07  5:30 ` Paul Turner
  2011-07-21 18:28   ` [tip:sched/core] sched: Don't " tip-bot for Paul Turner
  2011-07-07  5:30 ` [patch 02/17] sched: hierarchical task accounting for SCHED_OTHER Paul Turner
                   ` (19 subsequent siblings)
  20 siblings, 1 reply; 47+ messages in thread
From: Paul Turner @ 2011-07-07  5:30 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov, Hu Tao

[-- Attachment #1: sched-bwc-fix_dequeue_task_buglet.patch --]
[-- Type: text/plain, Size: 898 bytes --]

In dequeue_task_fair() we bail on dequeue when we encounter a parenting entity
with additional weight.  However, we perform a double shares update on this
entity as we continue the shares update traversal from this point, despite
dequeue_entity() having already updated its queuing cfs_rq.
Avoid this by starting from the parent when we resume.

Signed-off-by: Paul Turner <pjt@google.com>
---
 kernel/sched_fair.c |    3 +++
 1 file changed, 3 insertions(+)

Index: tip/kernel/sched_fair.c
===================================================================
--- tip.orig/kernel/sched_fair.c
+++ tip/kernel/sched_fair.c
@@ -1370,6 +1370,9 @@ static void dequeue_task_fair(struct rq 
 			 */
 			if (task_sleep && parent_entity(se))
 				set_next_buddy(parent_entity(se));
+
+			/* avoid re-evaluating load for this entity */
+			se = parent_entity(se);
 			break;
 		}
 		flags |= DEQUEUE_SLEEP;



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [patch 02/17] sched: hierarchical task accounting for SCHED_OTHER
  2011-07-07  5:30 [patch 00/17] CFS Bandwidth Control v7.1 Paul Turner
  2011-07-07  5:30 ` [patch 01/17] sched: (fixlet) dont update shares twice on on_rq parent Paul Turner
@ 2011-07-07  5:30 ` Paul Turner
  2011-07-07  5:30 ` [patch 03/17] sched: introduce primitives to account for CFS bandwidth tracking Paul Turner
                   ` (18 subsequent siblings)
  20 siblings, 0 replies; 47+ messages in thread
From: Paul Turner @ 2011-07-07  5:30 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov, Hu Tao

[-- Attachment #1: sched-bwc-account_nr_running.patch --]
[-- Type: text/plain, Size: 4530 bytes --]

Introduce hierarchical task accounting for the group scheduling case in CFS, as
well as promoting the responsibility for maintaining rq->nr_running to the
scheduling classes.

The primary motivation for this is that with scheduling classes supporting
bandwidth throttling it is possible for entities participating in throttled
sub-trees to not have root visible changes in rq->nr_running across activate
and de-activate operations.  This in turn leads to incorrect idle and 
weight-per-task load balance decisions.

This also allows us to make a small fixlet to the fastpath in pick_next_task()
under group scheduling.

Note: this issue also exists with the existing sched_rt throttling mechanism.
This patch does not address that.

Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>

---
 kernel/sched.c          |    6 ++----
 kernel/sched_fair.c     |   10 ++++++++--
 kernel/sched_rt.c       |    5 ++++-
 kernel/sched_stoptask.c |    2 ++
 4 files changed, 16 insertions(+), 7 deletions(-)

Index: tip/kernel/sched.c
===================================================================
--- tip.orig/kernel/sched.c
+++ tip/kernel/sched.c
@@ -308,7 +308,7 @@ struct task_group root_task_group;
 /* CFS-related fields in a runqueue */
 struct cfs_rq {
 	struct load_weight load;
-	unsigned long nr_running;
+	unsigned long nr_running, h_nr_running;
 
 	u64 exec_clock;
 	u64 min_vruntime;
@@ -1830,7 +1830,6 @@ static void activate_task(struct rq *rq,
 		rq->nr_uninterruptible--;
 
 	enqueue_task(rq, p, flags);
-	inc_nr_running(rq);
 }
 
 /*
@@ -1842,7 +1841,6 @@ static void deactivate_task(struct rq *r
 		rq->nr_uninterruptible++;
 
 	dequeue_task(rq, p, flags);
-	dec_nr_running(rq);
 }
 
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
@@ -4194,7 +4192,7 @@ pick_next_task(struct rq *rq)
 	 * Optimization: we know that if all tasks are in
 	 * the fair class we can call that function directly:
 	 */
-	if (likely(rq->nr_running == rq->cfs.nr_running)) {
+	if (likely(rq->nr_running == rq->cfs.h_nr_running)) {
 		p = fair_sched_class.pick_next_task(rq);
 		if (likely(p))
 			return p;
Index: tip/kernel/sched_fair.c
===================================================================
--- tip.orig/kernel/sched_fair.c
+++ tip/kernel/sched_fair.c
@@ -1332,16 +1332,19 @@ enqueue_task_fair(struct rq *rq, struct 
 			break;
 		cfs_rq = cfs_rq_of(se);
 		enqueue_entity(cfs_rq, se, flags);
+		cfs_rq->h_nr_running++;
 		flags = ENQUEUE_WAKEUP;
 	}
 
 	for_each_sched_entity(se) {
-		struct cfs_rq *cfs_rq = cfs_rq_of(se);
+		cfs_rq = cfs_rq_of(se);
+		cfs_rq->h_nr_running++;
 
 		update_cfs_load(cfs_rq, 0);
 		update_cfs_shares(cfs_rq);
 	}
 
+	inc_nr_running(rq);
 	hrtick_update(rq);
 }
 
@@ -1361,6 +1364,7 @@ static void dequeue_task_fair(struct rq 
 	for_each_sched_entity(se) {
 		cfs_rq = cfs_rq_of(se);
 		dequeue_entity(cfs_rq, se, flags);
+		cfs_rq->h_nr_running--;
 
 		/* Don't dequeue parent if it has other entities besides us */
 		if (cfs_rq->load.weight) {
@@ -1379,12 +1383,14 @@ static void dequeue_task_fair(struct rq 
 	}
 
 	for_each_sched_entity(se) {
-		struct cfs_rq *cfs_rq = cfs_rq_of(se);
+		cfs_rq = cfs_rq_of(se);
+		cfs_rq->h_nr_running--;
 
 		update_cfs_load(cfs_rq, 0);
 		update_cfs_shares(cfs_rq);
 	}
 
+	dec_nr_running(rq);
 	hrtick_update(rq);
 }
 
Index: tip/kernel/sched_rt.c
===================================================================
--- tip.orig/kernel/sched_rt.c
+++ tip/kernel/sched_rt.c
@@ -961,6 +961,8 @@ enqueue_task_rt(struct rq *rq, struct ta
 
 	if (!task_current(rq, p) && p->rt.nr_cpus_allowed > 1)
 		enqueue_pushable_task(rq, p);
+
+	inc_nr_running(rq);
 }
 
 static void dequeue_task_rt(struct rq *rq, struct task_struct *p, int flags)
@@ -971,6 +973,8 @@ static void dequeue_task_rt(struct rq *r
 	dequeue_rt_entity(rt_se);
 
 	dequeue_pushable_task(rq, p);
+
+	dec_nr_running(rq);
 }
 
 /*
@@ -1863,4 +1867,3 @@ static void print_rt_stats(struct seq_fi
 	rcu_read_unlock();
 }
 #endif /* CONFIG_SCHED_DEBUG */
-
Index: tip/kernel/sched_stoptask.c
===================================================================
--- tip.orig/kernel/sched_stoptask.c
+++ tip/kernel/sched_stoptask.c
@@ -34,11 +34,13 @@ static struct task_struct *pick_next_tas
 static void
 enqueue_task_stop(struct rq *rq, struct task_struct *p, int flags)
 {
+	inc_nr_running(rq);
 }
 
 static void
 dequeue_task_stop(struct rq *rq, struct task_struct *p, int flags)
 {
+	dec_nr_running(rq);
 }
 
 static void yield_task_stop(struct rq *rq)



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [patch 03/17] sched: introduce primitives to account for CFS bandwidth tracking
  2011-07-07  5:30 [patch 00/17] CFS Bandwidth Control v7.1 Paul Turner
  2011-07-07  5:30 ` [patch 01/17] sched: (fixlet) dont update shares twice on on_rq parent Paul Turner
  2011-07-07  5:30 ` [patch 02/17] sched: hierarchical task accounting for SCHED_OTHER Paul Turner
@ 2011-07-07  5:30 ` Paul Turner
  2011-07-07 13:48   ` Peter Zijlstra
  2011-07-07  5:30 ` [patch 04/17] sched: validate CFS quota hierarchies Paul Turner
                   ` (17 subsequent siblings)
  20 siblings, 1 reply; 47+ messages in thread
From: Paul Turner @ 2011-07-07  5:30 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov, Hu Tao, Nikhil Rao

[-- Attachment #1: sched-bwc-add_cfs_tg_bandwidth.patch --]
[-- Type: text/plain, Size: 9996 bytes --]

In this patch we introduce the notion of CFS bandwidth, partitioned into 
globally unassigned bandwidth, and locally claimed bandwidth.

- The global bandwidth is per task_group, it represents a pool of unclaimed
  bandwidth that cfs_rqs can allocate from.  
- The local bandwidth is tracked per-cfs_rq, this represents allotments from
  the global pool bandwidth assigned to a specific cpu.

Bandwidth is managed via cgroupfs, adding two new interfaces to the cpu subsystem:
- cpu.cfs_period_us : the bandwidth period in usecs
- cpu.cfs_quota_us : the cpu bandwidth (in usecs) that this tg will be allowed
  to consume over period above.

Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Nikhil Rao <ncrao@google.com>
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>

---
 init/Kconfig        |   13 +++
 kernel/sched.c      |  194 ++++++++++++++++++++++++++++++++++++++++++++++++++--
 kernel/sched_fair.c |   16 ++++
 3 files changed, 219 insertions(+), 4 deletions(-)

Index: tip/init/Kconfig
===================================================================
--- tip.orig/init/Kconfig
+++ tip/init/Kconfig
@@ -715,6 +715,19 @@ config FAIR_GROUP_SCHED
 	depends on CGROUP_SCHED
 	default CGROUP_SCHED
 
+config CFS_BANDWIDTH
+	bool "CPU bandwidth provisioning for FAIR_GROUP_SCHED"
+	depends on EXPERIMENTAL
+	depends on FAIR_GROUP_SCHED
+	depends on SMP
+	default n
+	help
+	  This option allows users to define CPU bandwidth rates (limits) for
+	  tasks running within the fair group scheduler.  Groups with no limit
+	  set are considered to be unconstrained and will run with no
+	  restriction.
+	  See tip/Documentation/scheduler/sched-bwc.txt for more information.
+
 config RT_GROUP_SCHED
 	bool "Group scheduling for SCHED_RR/FIFO"
 	depends on EXPERIMENTAL
Index: tip/kernel/sched.c
===================================================================
--- tip.orig/kernel/sched.c
+++ tip/kernel/sched.c
@@ -244,6 +244,14 @@ struct cfs_rq;
 
 static LIST_HEAD(task_groups);
 
+struct cfs_bandwidth {
+#ifdef CONFIG_CFS_BANDWIDTH
+	raw_spinlock_t lock;
+	ktime_t period;
+	u64 quota;
+#endif
+};
+
 /* task group related information */
 struct task_group {
 	struct cgroup_subsys_state css;
@@ -275,6 +283,8 @@ struct task_group {
 #ifdef CONFIG_SCHED_AUTOGROUP
 	struct autogroup *autogroup;
 #endif
+
+	struct cfs_bandwidth cfs_bandwidth;
 };
 
 /* task_group_lock serializes the addition/removal of task groups */
@@ -374,9 +384,46 @@ struct cfs_rq {
 
 	unsigned long load_contribution;
 #endif
+#ifdef CONFIG_CFS_BANDWIDTH
+	int runtime_enabled;
+	s64 runtime_remaining;
+#endif
 #endif
 };
 
+#ifdef CONFIG_CFS_BANDWIDTH
+static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
+{
+	return &tg->cfs_bandwidth;
+}
+
+static inline u64 default_cfs_period(void);
+
+static void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
+{
+	raw_spin_lock_init(&cfs_b->lock);
+	cfs_b->quota = RUNTIME_INF;
+	cfs_b->period = ns_to_ktime(default_cfs_period());
+}
+
+static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)
+{
+	cfs_rq->runtime_enabled = 0;
+}
+
+static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
+{}
+#else
+static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
+static void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b) {}
+static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b) {}
+
+static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
+{
+	return NULL;
+}
+#endif /* CONFIG_CFS_BANDWIDTH */
+
 /* Real-Time classes' related field in a runqueue: */
 struct rt_rq {
 	struct rt_prio_array active;
@@ -7795,6 +7842,7 @@ static void init_tg_cfs_entry(struct tas
 	tg->cfs_rq[cpu] = cfs_rq;
 	init_cfs_rq(cfs_rq, rq);
 	cfs_rq->tg = tg;
+	init_cfs_rq_runtime(cfs_rq);
 
 	tg->se[cpu] = se;
 	/* se could be NULL for root_task_group */
@@ -7930,6 +7978,7 @@ void __init sched_init(void)
 		 * We achieve this by letting root_task_group's tasks sit
 		 * directly in rq->cfs (i.e root_task_group->se[] = NULL).
 		 */
+		init_cfs_bandwidth(&root_task_group.cfs_bandwidth);
 		init_tg_cfs_entry(&root_task_group, &rq->cfs, NULL, i, NULL);
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
@@ -8171,6 +8220,8 @@ static void free_fair_sched_group(struct
 {
 	int i;
 
+	destroy_cfs_bandwidth(tg_cfs_bandwidth(tg));
+
 	for_each_possible_cpu(i) {
 		if (tg->cfs_rq)
 			kfree(tg->cfs_rq[i]);
@@ -8198,6 +8249,8 @@ int alloc_fair_sched_group(struct task_g
 
 	tg->shares = NICE_0_LOAD;
 
+	init_cfs_bandwidth(tg_cfs_bandwidth(tg));
+
 	for_each_possible_cpu(i) {
 		cfs_rq = kzalloc_node(sizeof(struct cfs_rq),
 				      GFP_KERNEL, cpu_to_node(i));
@@ -8569,7 +8622,7 @@ static int __rt_schedulable(struct task_
 	return walk_tg_tree(tg_schedulable, tg_nop, &data);
 }
 
-static int tg_set_bandwidth(struct task_group *tg,
+static int tg_set_rt_bandwidth(struct task_group *tg,
 		u64 rt_period, u64 rt_runtime)
 {
 	int i, err = 0;
@@ -8608,7 +8661,7 @@ int sched_group_set_rt_runtime(struct ta
 	if (rt_runtime_us < 0)
 		rt_runtime = RUNTIME_INF;
 
-	return tg_set_bandwidth(tg, rt_period, rt_runtime);
+	return tg_set_rt_bandwidth(tg, rt_period, rt_runtime);
 }
 
 long sched_group_rt_runtime(struct task_group *tg)
@@ -8633,7 +8686,7 @@ int sched_group_set_rt_period(struct tas
 	if (rt_period == 0)
 		return -EINVAL;
 
-	return tg_set_bandwidth(tg, rt_period, rt_runtime);
+	return tg_set_rt_bandwidth(tg, rt_period, rt_runtime);
 }
 
 long sched_group_rt_period(struct task_group *tg)
@@ -8823,6 +8876,128 @@ static u64 cpu_shares_read_u64(struct cg
 
 	return (u64) scale_load_down(tg->shares);
 }
+
+#ifdef CONFIG_CFS_BANDWIDTH
+const u64 max_cfs_quota_period = 1 * NSEC_PER_SEC; /* 1s */
+const u64 min_cfs_quota_period = 1 * NSEC_PER_MSEC; /* 1ms */
+
+static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
+{
+	int i;
+	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
+	static DEFINE_MUTEX(mutex);
+
+	if (tg == &root_task_group)
+		return -EINVAL;
+
+	/*
+	 * Ensure we have at some amount of bandwidth every period.  This is
+	 * to prevent reaching a state of large arrears when throttled via
+	 * entity_tick() resulting in prolonged exit starvation.
+	 */
+	if (quota < min_cfs_quota_period || period < min_cfs_quota_period)
+		return -EINVAL;
+
+	/*
+	 * Likewise, bound things on the otherside by preventing insane quota
+	 * periods.  This also allows us to normalize in computing quota
+	 * feasibility.
+	 */
+	if (period > max_cfs_quota_period)
+		return -EINVAL;
+
+	mutex_lock(&mutex);
+	raw_spin_lock_irq(&cfs_b->lock);
+	cfs_b->period = ns_to_ktime(period);
+	cfs_b->quota = quota;
+	raw_spin_unlock_irq(&cfs_b->lock);
+
+	for_each_possible_cpu(i) {
+		struct cfs_rq *cfs_rq = tg->cfs_rq[i];
+		struct rq *rq = rq_of(cfs_rq);
+
+		raw_spin_lock_irq(&rq->lock);
+		cfs_rq->runtime_enabled = quota != RUNTIME_INF;
+		cfs_rq->runtime_remaining = 0;
+		raw_spin_unlock_irq(&rq->lock);
+	}
+	mutex_unlock(&mutex);
+
+	return 0;
+}
+
+int tg_set_cfs_quota(struct task_group *tg, long cfs_quota_us)
+{
+	u64 quota, period;
+
+	period = ktime_to_ns(tg_cfs_bandwidth(tg)->period);
+	if (cfs_quota_us < 0)
+		quota = RUNTIME_INF;
+	else
+		quota = (u64)cfs_quota_us * NSEC_PER_USEC;
+
+	return tg_set_cfs_bandwidth(tg, period, quota);
+}
+
+long tg_get_cfs_quota(struct task_group *tg)
+{
+	u64 quota_us;
+
+	if (tg_cfs_bandwidth(tg)->quota == RUNTIME_INF)
+		return -1;
+
+	quota_us = tg_cfs_bandwidth(tg)->quota;
+	do_div(quota_us, NSEC_PER_USEC);
+
+	return quota_us;
+}
+
+int tg_set_cfs_period(struct task_group *tg, long cfs_period_us)
+{
+	u64 quota, period;
+
+	period = (u64)cfs_period_us * NSEC_PER_USEC;
+	quota = tg_cfs_bandwidth(tg)->quota;
+
+	if (period <= 0)
+		return -EINVAL;
+
+	return tg_set_cfs_bandwidth(tg, period, quota);
+}
+
+long tg_get_cfs_period(struct task_group *tg)
+{
+	u64 cfs_period_us;
+
+	cfs_period_us = ktime_to_ns(tg_cfs_bandwidth(tg)->period);
+	do_div(cfs_period_us, NSEC_PER_USEC);
+
+	return cfs_period_us;
+}
+
+static s64 cpu_cfs_quota_read_s64(struct cgroup *cgrp, struct cftype *cft)
+{
+	return tg_get_cfs_quota(cgroup_tg(cgrp));
+}
+
+static int cpu_cfs_quota_write_s64(struct cgroup *cgrp, struct cftype *cftype,
+				s64 cfs_quota_us)
+{
+	return tg_set_cfs_quota(cgroup_tg(cgrp), cfs_quota_us);
+}
+
+static u64 cpu_cfs_period_read_u64(struct cgroup *cgrp, struct cftype *cft)
+{
+	return tg_get_cfs_period(cgroup_tg(cgrp));
+}
+
+static int cpu_cfs_period_write_u64(struct cgroup *cgrp, struct cftype *cftype,
+				u64 cfs_period_us)
+{
+	return tg_set_cfs_period(cgroup_tg(cgrp), cfs_period_us);
+}
+
+#endif /* CONFIG_CFS_BANDWIDTH */
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
 #ifdef CONFIG_RT_GROUP_SCHED
@@ -8857,6 +9032,18 @@ static struct cftype cpu_files[] = {
 		.write_u64 = cpu_shares_write_u64,
 	},
 #endif
+#ifdef CONFIG_CFS_BANDWIDTH
+	{
+		.name = "cfs_quota_us",
+		.read_s64 = cpu_cfs_quota_read_s64,
+		.write_s64 = cpu_cfs_quota_write_s64,
+	},
+	{
+		.name = "cfs_period_us",
+		.read_u64 = cpu_cfs_period_read_u64,
+		.write_u64 = cpu_cfs_period_write_u64,
+	},
+#endif
 #ifdef CONFIG_RT_GROUP_SCHED
 	{
 		.name = "rt_runtime_us",
@@ -9166,4 +9353,3 @@ struct cgroup_subsys cpuacct_subsys = {
 	.subsys_id = cpuacct_subsys_id,
 };
 #endif	/* CONFIG_CGROUP_CPUACCT */
-
Index: tip/kernel/sched_fair.c
===================================================================
--- tip.orig/kernel/sched_fair.c
+++ tip/kernel/sched_fair.c
@@ -1256,6 +1256,22 @@ entity_tick(struct cfs_rq *cfs_rq, struc
 		check_preempt_tick(cfs_rq, curr);
 }
 
+
+/**************************************************
+ * CFS bandwidth control machinery
+ */
+
+#ifdef CONFIG_CFS_BANDWIDTH
+/*
+ * default period for cfs group bandwidth.
+ * default: 0.1s, units: nanoseconds
+ */
+static inline u64 default_cfs_period(void)
+{
+	return 100000000ULL;
+}
+#endif
+
 /**************************************************
  * CFS operations on tasks:
  */



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [patch 04/17] sched: validate CFS quota hierarchies
  2011-07-07  5:30 [patch 00/17] CFS Bandwidth Control v7.1 Paul Turner
                   ` (2 preceding siblings ...)
  2011-07-07  5:30 ` [patch 03/17] sched: introduce primitives to account for CFS bandwidth tracking Paul Turner
@ 2011-07-07  5:30 ` Paul Turner
  2011-07-07  5:30 ` [patch 05/17] sched: accumulate per-cfs_rq cpu usage and charge against bandwidth Paul Turner
                   ` (16 subsequent siblings)
  20 siblings, 0 replies; 47+ messages in thread
From: Paul Turner @ 2011-07-07  5:30 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov, Hu Tao

[-- Attachment #1: sched-bwc-consistent_quota.patch --]
[-- Type: text/plain, Size: 5310 bytes --]

Add constraints validation for CFS bandwidth hierarchies.

Validate that:
   max(child bandwidth) <= parent_bandwidth

In a quota limited hierarchy, an unconstrained entity
(e.g. bandwidth==RUNTIME_INF) inherits the bandwidth of its parent.

This constraint is chosen over sum(child_bandwidth) as notion of over-commit is
valuable within SCHED_OTHER.  Some basic code from the RT case is re-factored
for reuse.

Signed-off-by: Paul Turner <pjt@google.com>

---
 kernel/sched.c |  109 ++++++++++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 96 insertions(+), 13 deletions(-)

Index: tip/kernel/sched.c
===================================================================
--- tip.orig/kernel/sched.c
+++ tip/kernel/sched.c
@@ -249,6 +249,7 @@ struct cfs_bandwidth {
 	raw_spinlock_t lock;
 	ktime_t period;
 	u64 quota;
+	s64 hierarchal_quota;
 #endif
 };
 
@@ -8522,12 +8523,7 @@ unsigned long sched_group_shares(struct 
 }
 #endif
 
-#ifdef CONFIG_RT_GROUP_SCHED
-/*
- * Ensure that the real time constraints are schedulable.
- */
-static DEFINE_MUTEX(rt_constraints_mutex);
-
+#if defined(CONFIG_RT_GROUP_SCHED) || defined(CONFIG_CFS_BANDWIDTH)
 static unsigned long to_ratio(u64 period, u64 runtime)
 {
 	if (runtime == RUNTIME_INF)
@@ -8535,6 +8531,13 @@ static unsigned long to_ratio(u64 period
 
 	return div64_u64(runtime << 20, period);
 }
+#endif
+
+#ifdef CONFIG_RT_GROUP_SCHED
+/*
+ * Ensure that the real time constraints are schedulable.
+ */
+static DEFINE_MUTEX(rt_constraints_mutex);
 
 /* Must be called with tasklist_lock held */
 static inline int tg_has_rt_tasks(struct task_group *tg)
@@ -8555,7 +8558,7 @@ struct rt_schedulable_data {
 	u64 rt_runtime;
 };
 
-static int tg_schedulable(struct task_group *tg, void *data)
+static int tg_rt_schedulable(struct task_group *tg, void *data)
 {
 	struct rt_schedulable_data *d = data;
 	struct task_group *child;
@@ -8619,7 +8622,7 @@ static int __rt_schedulable(struct task_
 		.rt_runtime = runtime,
 	};
 
-	return walk_tg_tree(tg_schedulable, tg_nop, &data);
+	return walk_tg_tree(tg_rt_schedulable, tg_nop, &data);
 }
 
 static int tg_set_rt_bandwidth(struct task_group *tg,
@@ -8878,14 +8881,17 @@ static u64 cpu_shares_read_u64(struct cg
 }
 
 #ifdef CONFIG_CFS_BANDWIDTH
+static DEFINE_MUTEX(cfs_constraints_mutex);
+
 const u64 max_cfs_quota_period = 1 * NSEC_PER_SEC; /* 1s */
 const u64 min_cfs_quota_period = 1 * NSEC_PER_MSEC; /* 1ms */
 
+static int __cfs_schedulable(struct task_group *tg, u64 period, u64 runtime);
+
 static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
 {
-	int i;
+	int i, ret = 0;
 	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
-	static DEFINE_MUTEX(mutex);
 
 	if (tg == &root_task_group)
 		return -EINVAL;
@@ -8906,7 +8912,11 @@ static int tg_set_cfs_bandwidth(struct t
 	if (period > max_cfs_quota_period)
 		return -EINVAL;
 
-	mutex_lock(&mutex);
+	mutex_lock(&cfs_constraints_mutex);
+	ret = __cfs_schedulable(tg, period, quota);
+	if (ret)
+		goto out_unlock;
+
 	raw_spin_lock_irq(&cfs_b->lock);
 	cfs_b->period = ns_to_ktime(period);
 	cfs_b->quota = quota;
@@ -8921,9 +8931,10 @@ static int tg_set_cfs_bandwidth(struct t
 		cfs_rq->runtime_remaining = 0;
 		raw_spin_unlock_irq(&rq->lock);
 	}
-	mutex_unlock(&mutex);
+out_unlock:
+	mutex_unlock(&cfs_constraints_mutex);
 
-	return 0;
+	return ret;
 }
 
 int tg_set_cfs_quota(struct task_group *tg, long cfs_quota_us)
@@ -8997,6 +9008,78 @@ static int cpu_cfs_period_write_u64(stru
 	return tg_set_cfs_period(cgroup_tg(cgrp), cfs_period_us);
 }
 
+struct cfs_schedulable_data {
+	struct task_group *tg;
+	u64 period, quota;
+};
+
+/*
+ * normalize group quota/period to be quota/max_period
+ * note: units are usecs
+ */
+static u64 normalize_cfs_quota(struct task_group *tg,
+			       struct cfs_schedulable_data *d)
+{
+	u64 quota, period;
+
+	if (tg == d->tg) {
+		period = d->period;
+		quota = d->quota;
+	} else {
+		period = tg_get_cfs_period(tg);
+		quota = tg_get_cfs_quota(tg);
+	}
+
+	/* note: these should typically be equivalent */
+	if (quota == RUNTIME_INF || quota == -1)
+		return RUNTIME_INF;
+
+	return to_ratio(period, quota);
+}
+
+static int tg_cfs_schedulable_down(struct task_group *tg, void *data)
+{
+	struct cfs_schedulable_data *d = data;
+	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
+	s64 quota = 0, parent_quota = -1;
+
+	if (!tg->parent) {
+		quota = RUNTIME_INF;
+	} else {
+		struct cfs_bandwidth *parent_b = tg_cfs_bandwidth(tg->parent);
+
+		quota = normalize_cfs_quota(tg, d);
+		parent_quota = parent_b->hierarchal_quota;
+
+		/*
+		 * ensure max(child_quota) <= parent_quota, inherit when no
+		 * limit is set
+		 */
+		if (quota == RUNTIME_INF)
+			quota = parent_quota;
+		else if (parent_quota != RUNTIME_INF && quota > parent_quota)
+			return -EINVAL;
+	}
+	cfs_b->hierarchal_quota = quota;
+
+	return 0;
+}
+
+static int __cfs_schedulable(struct task_group *tg, u64 period, u64 quota)
+{
+	struct cfs_schedulable_data data = {
+		.tg = tg,
+		.period = period,
+		.quota = quota,
+	};
+
+	if (quota != RUNTIME_INF) {
+		do_div(data.period, NSEC_PER_USEC);
+		do_div(data.quota, NSEC_PER_USEC);
+	}
+
+	return walk_tg_tree(tg_cfs_schedulable_down, tg_nop, &data);
+}
 #endif /* CONFIG_CFS_BANDWIDTH */
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [patch 05/17] sched: accumulate per-cfs_rq cpu usage and charge against bandwidth
  2011-07-07  5:30 [patch 00/17] CFS Bandwidth Control v7.1 Paul Turner
                   ` (3 preceding siblings ...)
  2011-07-07  5:30 ` [patch 04/17] sched: validate CFS quota hierarchies Paul Turner
@ 2011-07-07  5:30 ` Paul Turner
  2011-07-07  5:30 ` [patch 06/17] sched: add a timer to handle CFS bandwidth refresh Paul Turner
                   ` (15 subsequent siblings)
  20 siblings, 0 replies; 47+ messages in thread
From: Paul Turner @ 2011-07-07  5:30 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov, Hu Tao, Nikhil Rao

[-- Attachment #1: sched-bwc-account_cfs_rq_runtime.patch --]
[-- Type: text/plain, Size: 6175 bytes --]

Account bandwidth usage on the cfs_rq level versus the task_groups to which
they belong.  Whether we are tracking bandwidth on a given cfs_rq is maintained
under cfs_rq->runtime_enabled.

cfs_rq's which belong to a bandwidth constrained task_group have their runtime
accounted via the update_curr() path, which withdraws bandwidth from the global
pool as desired.  Updates involving the global pool are currently protected
under cfs_bandwidth->lock, local runtime is protected by rq->lock.

This patch only assigns and tracks quota, no action is taken in the case that
cfs_rq->runtime_used exceeds cfs_rq->runtime_assigned.

Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Nikhil Rao <ncrao@google.com>
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>


---
 include/linux/sched.h |    4 ++
 kernel/sched.c        |    4 ++
 kernel/sched_fair.c   |   69 ++++++++++++++++++++++++++++++++++++++++++++++++--
 kernel/sysctl.c       |   10 +++++++
 4 files changed, 84 insertions(+), 3 deletions(-)

Index: tip/kernel/sched_fair.c
===================================================================
--- tip.orig/kernel/sched_fair.c
+++ tip/kernel/sched_fair.c
@@ -89,6 +89,20 @@ const_debug unsigned int sysctl_sched_mi
  */
 unsigned int __read_mostly sysctl_sched_shares_window = 10000000UL;
 
+#ifdef CONFIG_CFS_BANDWIDTH
+/*
+ * Amount of runtime to allocate from global (tg) to local (per-cfs_rq) pool
+ * each time a cfs_rq requests quota.
+ *
+ * Note: in the case that the slice exceeds the runtime remaining (either due
+ * to consumption or the quota being specified to be smaller than the slice)
+ * we will always only issue the remaining available time.
+ *
+ * default: 5 msec, units: microseconds
+  */
+unsigned int sysctl_sched_cfs_bandwidth_slice = 5000UL;
+#endif
+
 static const struct sched_class fair_sched_class;
 
 /**************************************************************
@@ -305,6 +319,8 @@ find_matching_se(struct sched_entity **s
 
 #endif	/* CONFIG_FAIR_GROUP_SCHED */
 
+static void account_cfs_rq_runtime(struct cfs_rq *cfs_rq,
+				   unsigned long delta_exec);
 
 /**************************************************************
  * Scheduling class tree data structure manipulation methods:
@@ -602,6 +618,8 @@ static void update_curr(struct cfs_rq *c
 		cpuacct_charge(curtask, delta_exec);
 		account_group_exec_runtime(curtask, delta_exec);
 	}
+
+	account_cfs_rq_runtime(cfs_rq, delta_exec);
 }
 
 static inline void
@@ -1270,6 +1288,48 @@ static inline u64 default_cfs_period(voi
 {
 	return 100000000ULL;
 }
+
+static inline u64 sched_cfs_bandwidth_slice(void)
+{
+	return (u64)sysctl_sched_cfs_bandwidth_slice * NSEC_PER_USEC;
+}
+
+static void assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
+{
+	struct task_group *tg = cfs_rq->tg;
+	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
+	u64 amount = 0, min_amount;
+
+	/* note: this is a positive sum as runtime_remaining <= 0 */
+	min_amount = sched_cfs_bandwidth_slice() - cfs_rq->runtime_remaining;
+
+	raw_spin_lock(&cfs_b->lock);
+	if (cfs_b->quota == RUNTIME_INF)
+		amount = min_amount;
+	else if (cfs_b->runtime > 0) {
+		amount = min(cfs_b->runtime, min_amount);
+		cfs_b->runtime -= amount;
+	}
+	raw_spin_unlock(&cfs_b->lock);
+
+	cfs_rq->runtime_remaining += amount;
+}
+
+static void account_cfs_rq_runtime(struct cfs_rq *cfs_rq,
+		unsigned long delta_exec)
+{
+	if (!cfs_rq->runtime_enabled)
+		return;
+
+	cfs_rq->runtime_remaining -= delta_exec;
+	if (cfs_rq->runtime_remaining > 0)
+		return;
+
+	assign_cfs_rq_runtime(cfs_rq);
+}
+#else
+static void account_cfs_rq_runtime(struct cfs_rq *cfs_rq,
+		unsigned long delta_exec) {}
 #endif
 
 /**************************************************
@@ -4262,8 +4322,13 @@ static void set_curr_task_fair(struct rq
 {
 	struct sched_entity *se = &rq->curr->se;
 
-	for_each_sched_entity(se)
-		set_next_entity(cfs_rq_of(se), se);
+	for_each_sched_entity(se) {
+		struct cfs_rq *cfs_rq = cfs_rq_of(se);
+
+		set_next_entity(cfs_rq, se);
+		/* ensure bandwidth has been allocated on our new cfs_rq */
+		account_cfs_rq_runtime(cfs_rq, 0);
+	}
 }
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
Index: tip/kernel/sysctl.c
===================================================================
--- tip.orig/kernel/sysctl.c
+++ tip/kernel/sysctl.c
@@ -379,6 +379,16 @@ static struct ctl_table kern_table[] = {
 		.extra2		= &one,
 	},
 #endif
+#ifdef CONFIG_CFS_BANDWIDTH
+	{
+		.procname	= "sched_cfs_bandwidth_slice_us",
+		.data		= &sysctl_sched_cfs_bandwidth_slice,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &one,
+	},
+#endif
 #ifdef CONFIG_PROVE_LOCKING
 	{
 		.procname	= "prove_locking",
Index: tip/kernel/sched.c
===================================================================
--- tip.orig/kernel/sched.c
+++ tip/kernel/sched.c
@@ -248,7 +248,7 @@ struct cfs_bandwidth {
 #ifdef CONFIG_CFS_BANDWIDTH
 	raw_spinlock_t lock;
 	ktime_t period;
-	u64 quota;
+	u64 quota, runtime;
 	s64 hierarchal_quota;
 #endif
 };
@@ -403,6 +403,7 @@ static inline u64 default_cfs_period(voi
 static void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
 {
 	raw_spin_lock_init(&cfs_b->lock);
+	cfs_b->runtime = 0;
 	cfs_b->quota = RUNTIME_INF;
 	cfs_b->period = ns_to_ktime(default_cfs_period());
 }
@@ -8920,6 +8921,7 @@ static int tg_set_cfs_bandwidth(struct t
 	raw_spin_lock_irq(&cfs_b->lock);
 	cfs_b->period = ns_to_ktime(period);
 	cfs_b->quota = quota;
+	cfs_b->runtime = quota;
 	raw_spin_unlock_irq(&cfs_b->lock);
 
 	for_each_possible_cpu(i) {
Index: tip/include/linux/sched.h
===================================================================
--- tip.orig/include/linux/sched.h
+++ tip/include/linux/sched.h
@@ -2012,6 +2012,10 @@ static inline void sched_autogroup_fork(
 static inline void sched_autogroup_exit(struct signal_struct *sig) { }
 #endif
 
+#ifdef CONFIG_CFS_BANDWIDTH
+extern unsigned int sysctl_sched_cfs_bandwidth_slice;
+#endif
+
 #ifdef CONFIG_RT_MUTEXES
 extern int rt_mutex_getprio(struct task_struct *p);
 extern void rt_mutex_setprio(struct task_struct *p, int prio);



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [patch 06/17] sched: add a timer to handle CFS bandwidth refresh
  2011-07-07  5:30 [patch 00/17] CFS Bandwidth Control v7.1 Paul Turner
                   ` (4 preceding siblings ...)
  2011-07-07  5:30 ` [patch 05/17] sched: accumulate per-cfs_rq cpu usage and charge against bandwidth Paul Turner
@ 2011-07-07  5:30 ` Paul Turner
  2011-07-07  5:30 ` [patch 07/17] sched: expire invalid runtime Paul Turner
                   ` (14 subsequent siblings)
  20 siblings, 0 replies; 47+ messages in thread
From: Paul Turner @ 2011-07-07  5:30 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov, Hu Tao

[-- Attachment #1: sched-bwc-bandwidth_timers.patch --]
[-- Type: text/plain, Size: 7502 bytes --]

This patch adds a per-task_group timer which handles the refresh of the global
CFS bandwidth pool.

Since the RT pool is using a similar timer there's some small refactoring to
share this support.

Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>

---
 kernel/sched.c      |  107 +++++++++++++++++++++++++++++++++++++++++-----------
 kernel/sched_fair.c |   41 ++++++++++++++++++-
 2 files changed, 124 insertions(+), 24 deletions(-)

Index: tip/kernel/sched.c
===================================================================
--- tip.orig/kernel/sched.c
+++ tip/kernel/sched.c
@@ -193,10 +193,28 @@ static inline int rt_bandwidth_enabled(v
 	return sysctl_sched_rt_runtime >= 0;
 }
 
-static void start_rt_bandwidth(struct rt_bandwidth *rt_b)
+static void start_bandwidth_timer(struct hrtimer *period_timer, ktime_t period)
 {
-	ktime_t now;
+	unsigned long delta;
+	ktime_t soft, hard, now;
+
+	for (;;) {
+		if (hrtimer_active(period_timer))
+			break;
+
+		now = hrtimer_cb_get_time(period_timer);
+		hrtimer_forward(period_timer, now, period);
 
+		soft = hrtimer_get_softexpires(period_timer);
+		hard = hrtimer_get_expires(period_timer);
+		delta = ktime_to_ns(ktime_sub(hard, soft));
+		__hrtimer_start_range_ns(period_timer, soft, delta,
+					 HRTIMER_MODE_ABS_PINNED, 0);
+	}
+}
+
+static void start_rt_bandwidth(struct rt_bandwidth *rt_b)
+{
 	if (!rt_bandwidth_enabled() || rt_b->rt_runtime == RUNTIME_INF)
 		return;
 
@@ -204,22 +222,7 @@ static void start_rt_bandwidth(struct rt
 		return;
 
 	raw_spin_lock(&rt_b->rt_runtime_lock);
-	for (;;) {
-		unsigned long delta;
-		ktime_t soft, hard;
-
-		if (hrtimer_active(&rt_b->rt_period_timer))
-			break;
-
-		now = hrtimer_cb_get_time(&rt_b->rt_period_timer);
-		hrtimer_forward(&rt_b->rt_period_timer, now, rt_b->rt_period);
-
-		soft = hrtimer_get_softexpires(&rt_b->rt_period_timer);
-		hard = hrtimer_get_expires(&rt_b->rt_period_timer);
-		delta = ktime_to_ns(ktime_sub(hard, soft));
-		__hrtimer_start_range_ns(&rt_b->rt_period_timer, soft, delta,
-				HRTIMER_MODE_ABS_PINNED, 0);
-	}
+	start_bandwidth_timer(&rt_b->rt_period_timer, rt_b->rt_period);
 	raw_spin_unlock(&rt_b->rt_runtime_lock);
 }
 
@@ -250,6 +253,9 @@ struct cfs_bandwidth {
 	ktime_t period;
 	u64 quota, runtime;
 	s64 hierarchal_quota;
+
+	int idle, timer_active;
+	struct hrtimer period_timer;
 #endif
 };
 
@@ -399,6 +405,28 @@ static inline struct cfs_bandwidth *tg_c
 }
 
 static inline u64 default_cfs_period(void);
+static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun);
+
+static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
+{
+	struct cfs_bandwidth *cfs_b =
+		container_of(timer, struct cfs_bandwidth, period_timer);
+	ktime_t now;
+	int overrun;
+	int idle = 0;
+
+	for (;;) {
+		now = hrtimer_cb_get_time(timer);
+		overrun = hrtimer_forward(timer, now, cfs_b->period);
+
+		if (!overrun)
+			break;
+
+		idle = do_sched_cfs_period_timer(cfs_b, overrun);
+	}
+
+	return idle ? HRTIMER_NORESTART : HRTIMER_RESTART;
+}
 
 static void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
 {
@@ -406,6 +434,9 @@ static void init_cfs_bandwidth(struct cf
 	cfs_b->runtime = 0;
 	cfs_b->quota = RUNTIME_INF;
 	cfs_b->period = ns_to_ktime(default_cfs_period());
+
+	hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	cfs_b->period_timer.function = sched_cfs_period_timer;
 }
 
 static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)
@@ -413,8 +444,34 @@ static void init_cfs_rq_runtime(struct c
 	cfs_rq->runtime_enabled = 0;
 }
 
+/* requires cfs_b->lock, may release to reprogram timer */
+static void __start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
+{
+	/*
+	 * The timer may be active because we're trying to set a new bandwidth
+	 * period or because we're racing with the tear-down path
+	 * (timer_active==0 becomes visible before the hrtimer call-back
+	 * terminates).  In either case we ensure that it's re-programmed
+	 */
+	while (unlikely(hrtimer_active(&cfs_b->period_timer))) {
+		raw_spin_unlock(&cfs_b->lock);
+		/* ensure cfs_b->lock is available while we wait */
+		hrtimer_cancel(&cfs_b->period_timer);
+
+		raw_spin_lock(&cfs_b->lock);
+		/* if someone else restarted the timer then we're done */
+		if (cfs_b->timer_active)
+			return;
+	}
+
+	cfs_b->timer_active = 1;
+	start_bandwidth_timer(&cfs_b->period_timer, cfs_b->period);
+}
+
 static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
-{}
+{
+	hrtimer_cancel(&cfs_b->period_timer);
+}
 #else
 static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
 static void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b) {}
@@ -8891,7 +8948,7 @@ static int __cfs_schedulable(struct task
 
 static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
 {
-	int i, ret = 0;
+	int i, ret = 0, runtime_enabled;
 	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
 
 	if (tg == &root_task_group)
@@ -8918,10 +8975,18 @@ static int tg_set_cfs_bandwidth(struct t
 	if (ret)
 		goto out_unlock;
 
+	runtime_enabled = quota != RUNTIME_INF;
 	raw_spin_lock_irq(&cfs_b->lock);
 	cfs_b->period = ns_to_ktime(period);
 	cfs_b->quota = quota;
 	cfs_b->runtime = quota;
+
+	/* restart the period timer (if active) to handle new period expiry */
+	if (runtime_enabled && cfs_b->timer_active) {
+		/* force a reprogram */
+		cfs_b->timer_active = 0;
+		__start_cfs_bandwidth(cfs_b);
+	}
 	raw_spin_unlock_irq(&cfs_b->lock);
 
 	for_each_possible_cpu(i) {
@@ -8929,7 +8994,7 @@ static int tg_set_cfs_bandwidth(struct t
 		struct rq *rq = rq_of(cfs_rq);
 
 		raw_spin_lock_irq(&rq->lock);
-		cfs_rq->runtime_enabled = quota != RUNTIME_INF;
+		cfs_rq->runtime_enabled = runtime_enabled;
 		cfs_rq->runtime_remaining = 0;
 		raw_spin_unlock_irq(&rq->lock);
 	}
Index: tip/kernel/sched_fair.c
===================================================================
--- tip.orig/kernel/sched_fair.c
+++ tip/kernel/sched_fair.c
@@ -1306,9 +1306,16 @@ static void assign_cfs_rq_runtime(struct
 	raw_spin_lock(&cfs_b->lock);
 	if (cfs_b->quota == RUNTIME_INF)
 		amount = min_amount;
-	else if (cfs_b->runtime > 0) {
-		amount = min(cfs_b->runtime, min_amount);
-		cfs_b->runtime -= amount;
+	else {
+		/* ensure bandwidth timer remains active under consumption */
+		if (!cfs_b->timer_active)
+			__start_cfs_bandwidth(cfs_b);
+
+		if (cfs_b->runtime > 0) {
+			amount = min(cfs_b->runtime, min_amount);
+			cfs_b->runtime -= amount;
+			cfs_b->idle = 0;
+		}
 	}
 	raw_spin_unlock(&cfs_b->lock);
 
@@ -1327,6 +1334,34 @@ static void account_cfs_rq_runtime(struc
 
 	assign_cfs_rq_runtime(cfs_rq);
 }
+
+/*
+ * Responsible for refilling a task_group's bandwidth and unthrottling its
+ * cfs_rqs as appropriate. If there has been no activity within the last
+ * period the timer is deactivated until scheduling resumes; cfs_b->idle is
+ * used to track this state.
+ */
+static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
+{
+	int idle = 1;
+
+	raw_spin_lock(&cfs_b->lock);
+	/* no need to continue the timer with no bandwidth constraint */
+	if (cfs_b->quota == RUNTIME_INF)
+		goto out_unlock;
+
+	idle = cfs_b->idle;
+	cfs_b->runtime = cfs_b->quota;
+
+	/* mark as potentially idle for the upcoming period */
+	cfs_b->idle = 1;
+out_unlock:
+	if (idle)
+		cfs_b->timer_active = 0;
+	raw_spin_unlock(&cfs_b->lock);
+
+	return idle;
+}
 #else
 static void account_cfs_rq_runtime(struct cfs_rq *cfs_rq,
 		unsigned long delta_exec) {}



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [patch 07/17] sched: expire invalid runtime
  2011-07-07  5:30 [patch 00/17] CFS Bandwidth Control v7.1 Paul Turner
                   ` (5 preceding siblings ...)
  2011-07-07  5:30 ` [patch 06/17] sched: add a timer to handle CFS bandwidth refresh Paul Turner
@ 2011-07-07  5:30 ` Paul Turner
  2011-07-07  5:30 ` [patch 08/17] sched: add support for throttling group entities Paul Turner
                   ` (13 subsequent siblings)
  20 siblings, 0 replies; 47+ messages in thread
From: Paul Turner @ 2011-07-07  5:30 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov, Hu Tao

[-- Attachment #1: sched-bwc-expire_cfs_rq_runtime.patch --]
[-- Type: text/plain, Size: 6079 bytes --]

Since quota is managed using a global state but consumed on a per-cpu basis
we need to ensure that our per-cpu state is appropriately synchronized.  
Most importantly, runtime that is state (from a previous period) should not be
locally consumable.

We take advantage of existing sched_clock synchronization about the jiffy to
efficiently detect whether we have (globally) crossed a quota boundary above.

One catch is that the direction of spread on sched_clock is undefined, 
specifically, we don't know whether our local clock is behind or ahead
of the one responsible for the current expiration time.

Fortunately we can differentiate these by considering whether the
global deadline has advanced.  If it has not, then we assume our clock to be 
"fast" and advance our local expiration; otherwise, we know the deadline has
truly passed and we expire our local runtime.

Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>

---
 kernel/sched.c      |    4 +-
 kernel/sched_fair.c |   83 +++++++++++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 82 insertions(+), 5 deletions(-)

Index: tip/kernel/sched_fair.c
===================================================================
--- tip.orig/kernel/sched_fair.c
+++ tip/kernel/sched_fair.c
@@ -1294,11 +1294,30 @@ static inline u64 sched_cfs_bandwidth_sl
 	return (u64)sysctl_sched_cfs_bandwidth_slice * NSEC_PER_USEC;
 }
 
+/*
+ * Replenish runtime according to assigned quota and update expiration time.
+ * We use sched_clock_cpu directly instead of rq->clock to avoid adding
+ * additional synchronization around rq->lock.
+ *
+ * requires cfs_b->lock
+ */
+static void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)
+{
+	u64 now;
+
+	if (cfs_b->quota == RUNTIME_INF)
+		return;
+
+	now = sched_clock_cpu(smp_processor_id());
+	cfs_b->runtime = cfs_b->quota;
+	cfs_b->runtime_expires = now + ktime_to_ns(cfs_b->period);
+}
+
 static void assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 {
 	struct task_group *tg = cfs_rq->tg;
 	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
-	u64 amount = 0, min_amount;
+	u64 amount = 0, min_amount, expires;
 
 	/* note: this is a positive sum as runtime_remaining <= 0 */
 	min_amount = sched_cfs_bandwidth_slice() - cfs_rq->runtime_remaining;
@@ -1307,9 +1326,16 @@ static void assign_cfs_rq_runtime(struct
 	if (cfs_b->quota == RUNTIME_INF)
 		amount = min_amount;
 	else {
-		/* ensure bandwidth timer remains active under consumption */
-		if (!cfs_b->timer_active)
+		/*
+		 * If the bandwidth pool has become inactive, then at least one
+		 * period must have elapsed since the last consumption.
+		 * Refresh the global state and ensure bandwidth timer becomes
+		 * active.
+		 */
+		if (!cfs_b->timer_active) {
+			__refill_cfs_bandwidth_runtime(cfs_b);
 			__start_cfs_bandwidth(cfs_b);
+		}
 
 		if (cfs_b->runtime > 0) {
 			amount = min(cfs_b->runtime, min_amount);
@@ -1317,9 +1343,51 @@ static void assign_cfs_rq_runtime(struct
 			cfs_b->idle = 0;
 		}
 	}
+	expires = cfs_b->runtime_expires;
 	raw_spin_unlock(&cfs_b->lock);
 
 	cfs_rq->runtime_remaining += amount;
+	/*
+	 * we may have advanced our local expiration to account for allowed
+	 * spread between our sched_clock and the one on which runtime was
+	 * issued.
+	 */
+	if ((s64)(expires - cfs_rq->runtime_expires) > 0)
+		cfs_rq->runtime_expires = expires;
+}
+
+/*
+ * Note: This depends on the synchronization provided by sched_clock and the
+ * fact that rq->clock snapshots this value.
+ */
+static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq)
+{
+	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
+	struct rq *rq = rq_of(cfs_rq);
+
+	if (cfs_rq->runtime_remaining < 0)
+		return;
+
+	/* if the deadline is ahead of our clock, nothing to do */
+	if ((s64)(rq->clock - cfs_rq->runtime_expires) < 0)
+		return;
+
+	/*
+	 * If the local deadline has passed we have to consider the
+	 * possibility that our sched_clock is 'fast' and the global deadline
+	 * has not truly expired.
+	 *
+	 * Fortunately we can check determine whether this the case by checking
+	 * whether the global deadline has advanced.
+	 */
+
+	if ((s64)(cfs_rq->runtime_expires - cfs_b->runtime_expires) >= 0) {
+		/* extend local deadline, drift is bounded above by 2 ticks */
+		cfs_rq->runtime_expires += TICK_NSEC;
+	} else {
+		/* global deadline is ahead, expiration has passed */
+		cfs_rq->runtime_remaining = 0;
+	}
 }
 
 static void account_cfs_rq_runtime(struct cfs_rq *cfs_rq,
@@ -1328,7 +1396,10 @@ static void account_cfs_rq_runtime(struc
 	if (!cfs_rq->runtime_enabled)
 		return;
 
+	/* dock delta_exec before expiring quota (as it could span periods) */
 	cfs_rq->runtime_remaining -= delta_exec;
+	expire_cfs_rq_runtime(cfs_rq);
+
 	if (cfs_rq->runtime_remaining > 0)
 		return;
 
@@ -1351,7 +1422,11 @@ static int do_sched_cfs_period_timer(str
 		goto out_unlock;
 
 	idle = cfs_b->idle;
-	cfs_b->runtime = cfs_b->quota;
+	/* if we're going inactive then everything else can be deferred */
+	if (idle)
+		goto out_unlock;
+
+	__refill_cfs_bandwidth_runtime(cfs_b);
 
 	/* mark as potentially idle for the upcoming period */
 	cfs_b->idle = 1;
Index: tip/kernel/sched.c
===================================================================
--- tip.orig/kernel/sched.c
+++ tip/kernel/sched.c
@@ -253,6 +253,7 @@ struct cfs_bandwidth {
 	ktime_t period;
 	u64 quota, runtime;
 	s64 hierarchal_quota;
+	u64 runtime_expires;
 
 	int idle, timer_active;
 	struct hrtimer period_timer;
@@ -393,6 +394,7 @@ struct cfs_rq {
 #endif
 #ifdef CONFIG_CFS_BANDWIDTH
 	int runtime_enabled;
+	u64 runtime_expires;
 	s64 runtime_remaining;
 #endif
 #endif
@@ -8979,8 +8981,8 @@ static int tg_set_cfs_bandwidth(struct t
 	raw_spin_lock_irq(&cfs_b->lock);
 	cfs_b->period = ns_to_ktime(period);
 	cfs_b->quota = quota;
-	cfs_b->runtime = quota;
 
+	__refill_cfs_bandwidth_runtime(cfs_b);
 	/* restart the period timer (if active) to handle new period expiry */
 	if (runtime_enabled && cfs_b->timer_active) {
 		/* force a reprogram */



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [patch 08/17] sched: add support for throttling group entities
  2011-07-07  5:30 [patch 00/17] CFS Bandwidth Control v7.1 Paul Turner
                   ` (6 preceding siblings ...)
  2011-07-07  5:30 ` [patch 07/17] sched: expire invalid runtime Paul Turner
@ 2011-07-07  5:30 ` Paul Turner
  2011-07-07  5:30 ` [patch 09/17] sched: add support for unthrottling " Paul Turner
                   ` (12 subsequent siblings)
  20 siblings, 0 replies; 47+ messages in thread
From: Paul Turner @ 2011-07-07  5:30 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov, Hu Tao

[-- Attachment #1: sched-bwc-throttle_entities.patch --]
[-- Type: text/plain, Size: 5863 bytes --]

Now that consumption is tracked (via update_curr()) we add support to throttle
group entities (and their corresponding cfs_rqs) in the case where this is no
run-time remaining.

Throttled entities are dequeued to prevent scheduling, additionally we mark
them as throttled (using cfs_rq->throttled) to prevent them from becoming
re-enqueued until they are unthrottled.  A list of a task_group's throttled
entities are maintained on the cfs_bandwidth structure.

Note: While the machinery for throttling is added in this patch the act of
throttling an entity exceeding its bandwidth is deferred until later within
the series.

Signed-off-by: Paul Turner <pjt@google.com>

---
 kernel/sched.c      |    7 ++++
 kernel/sched_fair.c |   89 +++++++++++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 92 insertions(+), 4 deletions(-)

Index: tip/kernel/sched_fair.c
===================================================================
--- tip.orig/kernel/sched_fair.c
+++ tip/kernel/sched_fair.c
@@ -1313,7 +1313,8 @@ static void __refill_cfs_bandwidth_runti
 	cfs_b->runtime_expires = now + ktime_to_ns(cfs_b->period);
 }
 
-static void assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
+/* returns 0 on failure to allocate runtime */
+static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 {
 	struct task_group *tg = cfs_rq->tg;
 	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
@@ -1354,6 +1355,8 @@ static void assign_cfs_rq_runtime(struct
 	 */
 	if ((s64)(expires - cfs_rq->runtime_expires) > 0)
 		cfs_rq->runtime_expires = expires;
+
+	return cfs_rq->runtime_remaining > 0;
 }
 
 /*
@@ -1403,7 +1406,53 @@ static void account_cfs_rq_runtime(struc
 	if (cfs_rq->runtime_remaining > 0)
 		return;
 
-	assign_cfs_rq_runtime(cfs_rq);
+	/*
+	 * if we're unable to extend our runtime we resched so that the active
+	 * hierarchy can be throttled
+	 */
+	if (!assign_cfs_rq_runtime(cfs_rq) && likely(cfs_rq->curr))
+		resched_task(rq_of(cfs_rq)->curr);
+}
+
+static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
+{
+	return cfs_rq->throttled;
+}
+
+static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
+{
+	struct rq *rq = rq_of(cfs_rq);
+	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
+	struct sched_entity *se;
+	long task_delta, dequeue = 1;
+
+	se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
+
+	/* account load preceding throttle */
+	update_cfs_load(cfs_rq, 0);
+
+	task_delta = cfs_rq->h_nr_running;
+	for_each_sched_entity(se) {
+		struct cfs_rq *qcfs_rq = cfs_rq_of(se);
+		/* throttled entity or throttle-on-deactivate */
+		if (!se->on_rq)
+			break;
+
+		if (dequeue)
+			dequeue_entity(qcfs_rq, se, DEQUEUE_SLEEP);
+		qcfs_rq->h_nr_running -= task_delta;
+
+		if (qcfs_rq->load.weight)
+			dequeue = 0;
+	}
+
+	if (!se)
+		rq->nr_running -= task_delta;
+
+	cfs_rq->throttled = 1;
+	raw_spin_lock(&cfs_b->lock);
+	list_add_tail_rcu(&cfs_rq->throttled_list, &cfs_b->throttled_cfs_rq);
+	raw_spin_unlock(&cfs_b->lock);
 }
 
 /*
@@ -1440,6 +1489,11 @@ out_unlock:
 #else
 static void account_cfs_rq_runtime(struct cfs_rq *cfs_rq,
 		unsigned long delta_exec) {}
+
+static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
+{
+	return 0;
+}
 #endif
 
 /**************************************************
@@ -1518,7 +1572,17 @@ enqueue_task_fair(struct rq *rq, struct 
 			break;
 		cfs_rq = cfs_rq_of(se);
 		enqueue_entity(cfs_rq, se, flags);
+
+		/*
+		 * end evaluation on encountering a throttled cfs_rq
+		 *
+		 * note: in the case of encountering a throttled cfs_rq we will
+		 * post the final h_nr_running increment below.
+		*/
+		if (cfs_rq_throttled(cfs_rq))
+			break;
 		cfs_rq->h_nr_running++;
+
 		flags = ENQUEUE_WAKEUP;
 	}
 
@@ -1526,11 +1590,15 @@ enqueue_task_fair(struct rq *rq, struct 
 		cfs_rq = cfs_rq_of(se);
 		cfs_rq->h_nr_running++;
 
+		if (cfs_rq_throttled(cfs_rq))
+			break;
+
 		update_cfs_load(cfs_rq, 0);
 		update_cfs_shares(cfs_rq);
 	}
 
-	inc_nr_running(rq);
+	if (!se)
+		inc_nr_running(rq);
 	hrtick_update(rq);
 }
 
@@ -1550,6 +1618,15 @@ static void dequeue_task_fair(struct rq 
 	for_each_sched_entity(se) {
 		cfs_rq = cfs_rq_of(se);
 		dequeue_entity(cfs_rq, se, flags);
+
+		/*
+		 * end evaluation on encountering a throttled cfs_rq
+		 *
+		 * note: in the case of encountering a throttled cfs_rq we will
+		 * post the final h_nr_running decrement below.
+		*/
+		if (cfs_rq_throttled(cfs_rq))
+			break;
 		cfs_rq->h_nr_running--;
 
 		/* Don't dequeue parent if it has other entities besides us */
@@ -1572,11 +1649,15 @@ static void dequeue_task_fair(struct rq 
 		cfs_rq = cfs_rq_of(se);
 		cfs_rq->h_nr_running--;
 
+		if (cfs_rq_throttled(cfs_rq))
+			break;
+
 		update_cfs_load(cfs_rq, 0);
 		update_cfs_shares(cfs_rq);
 	}
 
-	dec_nr_running(rq);
+	if (!se)
+		dec_nr_running(rq);
 	hrtick_update(rq);
 }
 
Index: tip/kernel/sched.c
===================================================================
--- tip.orig/kernel/sched.c
+++ tip/kernel/sched.c
@@ -257,6 +257,8 @@ struct cfs_bandwidth {
 
 	int idle, timer_active;
 	struct hrtimer period_timer;
+	struct list_head throttled_cfs_rq;
+
 #endif
 };
 
@@ -396,6 +398,9 @@ struct cfs_rq {
 	int runtime_enabled;
 	u64 runtime_expires;
 	s64 runtime_remaining;
+
+	int throttled;
+	struct list_head throttled_list;
 #endif
 #endif
 };
@@ -437,6 +442,7 @@ static void init_cfs_bandwidth(struct cf
 	cfs_b->quota = RUNTIME_INF;
 	cfs_b->period = ns_to_ktime(default_cfs_period());
 
+	INIT_LIST_HEAD(&cfs_b->throttled_cfs_rq);
 	hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
 	cfs_b->period_timer.function = sched_cfs_period_timer;
 }
@@ -444,6 +450,7 @@ static void init_cfs_bandwidth(struct cf
 static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 {
 	cfs_rq->runtime_enabled = 0;
+	INIT_LIST_HEAD(&cfs_rq->throttled_list);
 }
 
 /* requires cfs_b->lock, may release to reprogram timer */



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [patch 09/17] sched: add support for unthrottling group entities
  2011-07-07  5:30 [patch 00/17] CFS Bandwidth Control v7.1 Paul Turner
                   ` (7 preceding siblings ...)
  2011-07-07  5:30 ` [patch 08/17] sched: add support for throttling group entities Paul Turner
@ 2011-07-07  5:30 ` Paul Turner
  2011-07-07  5:30 ` [patch 10/17] sched: allow for positional tg_tree walks Paul Turner
                   ` (11 subsequent siblings)
  20 siblings, 0 replies; 47+ messages in thread
From: Paul Turner @ 2011-07-07  5:30 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov, Hu Tao

[-- Attachment #1: sched-bwc-unthrottle_entities.patch --]
[-- Type: text/plain, Size: 5369 bytes --]

At the start of each period we refresh the global bandwidth pool.  At this time
we must also unthrottle any cfs_rq entities who are now within bandwidth once 
more (as quota permits).

Unthrottled entities have their corresponding cfs_rq->throttled flag cleared
and their entities re-enqueued.

Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>

---
 kernel/sched.c      |    3 +
 kernel/sched_fair.c |  128 ++++++++++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 127 insertions(+), 4 deletions(-)

Index: tip/kernel/sched.c
===================================================================
--- tip.orig/kernel/sched.c
+++ tip/kernel/sched.c
@@ -9005,6 +9005,9 @@ static int tg_set_cfs_bandwidth(struct t
 		raw_spin_lock_irq(&rq->lock);
 		cfs_rq->runtime_enabled = runtime_enabled;
 		cfs_rq->runtime_remaining = 0;
+
+		if (cfs_rq_throttled(cfs_rq))
+			unthrottle_cfs_rq(cfs_rq);
 		raw_spin_unlock_irq(&rq->lock);
 	}
 out_unlock:
Index: tip/kernel/sched_fair.c
===================================================================
--- tip.orig/kernel/sched_fair.c
+++ tip/kernel/sched_fair.c
@@ -1455,6 +1455,84 @@ static void throttle_cfs_rq(struct cfs_r
 	raw_spin_unlock(&cfs_b->lock);
 }
 
+static void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
+{
+	struct rq *rq = rq_of(cfs_rq);
+	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
+	struct sched_entity *se;
+	int enqueue = 1;
+	long task_delta;
+
+	se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
+
+	cfs_rq->throttled = 0;
+	raw_spin_lock(&cfs_b->lock);
+	list_del_rcu(&cfs_rq->throttled_list);
+	raw_spin_unlock(&cfs_b->lock);
+
+	if (!cfs_rq->load.weight)
+		return;
+
+	task_delta = cfs_rq->h_nr_running;
+	for_each_sched_entity(se) {
+		if (se->on_rq)
+			enqueue = 0;
+
+		cfs_rq = cfs_rq_of(se);
+		if (enqueue)
+			enqueue_entity(cfs_rq, se, ENQUEUE_WAKEUP);
+		cfs_rq->h_nr_running += task_delta;
+
+		if (cfs_rq_throttled(cfs_rq))
+			break;
+	}
+
+	if (!se)
+		rq->nr_running += task_delta;
+
+	/* determine whether we need to wake up potentially idle cpu */
+	if (rq->curr == rq->idle && rq->cfs.nr_running)
+		resched_task(rq->curr);
+}
+
+static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
+		u64 remaining, u64 expires)
+{
+	struct cfs_rq *cfs_rq;
+	u64 runtime = remaining;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(cfs_rq, &cfs_b->throttled_cfs_rq,
+				throttled_list) {
+		struct rq *rq = rq_of(cfs_rq);
+
+		raw_spin_lock(&rq->lock);
+		if (!cfs_rq_throttled(cfs_rq))
+			goto next;
+
+		runtime = -cfs_rq->runtime_remaining + 1;
+		if (runtime > remaining)
+			runtime = remaining;
+		remaining -= runtime;
+
+		cfs_rq->runtime_remaining += runtime;
+		cfs_rq->runtime_expires = expires;
+
+		/* we check whether we're throttled above */
+		if (cfs_rq->runtime_remaining > 0)
+			unthrottle_cfs_rq(cfs_rq);
+
+next:
+		raw_spin_unlock(&rq->lock);
+
+		if (!remaining)
+			break;
+	}
+	rcu_read_unlock();
+
+	return remaining;
+}
+
 /*
  * Responsible for refilling a task_group's bandwidth and unthrottling its
  * cfs_rqs as appropriate. If there has been no activity within the last
@@ -1463,22 +1541,64 @@ static void throttle_cfs_rq(struct cfs_r
  */
 static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
 {
-	int idle = 1;
+	u64 runtime, runtime_expires;
+	int idle = 1, throttled;
 
 	raw_spin_lock(&cfs_b->lock);
 	/* no need to continue the timer with no bandwidth constraint */
 	if (cfs_b->quota == RUNTIME_INF)
 		goto out_unlock;
 
-	idle = cfs_b->idle;
+	throttled = !list_empty(&cfs_b->throttled_cfs_rq);
+	/* idle depends on !throttled (for the case of a large deficit) */
+	idle = cfs_b->idle && !throttled;
+
 	/* if we're going inactive then everything else can be deferred */
 	if (idle)
 		goto out_unlock;
 
 	__refill_cfs_bandwidth_runtime(cfs_b);
 
-	/* mark as potentially idle for the upcoming period */
-	cfs_b->idle = 1;
+	if (!throttled) {
+		/* mark as potentially idle for the upcoming period */
+		cfs_b->idle = 1;
+		goto out_unlock;
+	}
+
+	/*
+	 * There are throttled entities so we must first use the new bandwidth
+	 * to unthrottle them before making it generally available.  This
+	 * ensures that all existing debts will be paid before a new cfs_rq is
+	 * allowed to run.
+	 */
+	runtime = cfs_b->runtime;
+	runtime_expires = cfs_b->runtime_expires;
+	cfs_b->runtime = 0;
+
+	/*
+	 * This check is repeated as we are holding onto the new bandwidth
+	 * while we unthrottle.  This can potentially race with an unthrottled
+	 * group trying to acquire new bandwidth from the global pool.
+	 */
+	while (throttled && runtime > 0) {
+		raw_spin_unlock(&cfs_b->lock);
+		/* we can't nest cfs_b->lock while distributing bandwidth */
+		runtime = distribute_cfs_runtime(cfs_b, runtime,
+						 runtime_expires);
+		raw_spin_lock(&cfs_b->lock);
+
+		throttled = !list_empty(&cfs_b->throttled_cfs_rq);
+	}
+
+	/* return (any) remaining runtime */
+	cfs_b->runtime = runtime;
+	/*
+	 * While we are ensured activity in the period following an
+	 * unthrottle, this also covers the case in which the new bandwidth is
+	 * insufficient to cover the existing bandwidth deficit.  (Forcing the
+	 * timer to remain active while there are any throttled entities.)
+	 */
+	cfs_b->idle = 0;
 out_unlock:
 	if (idle)
 		cfs_b->timer_active = 0;



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [patch 10/17] sched: allow for positional tg_tree walks
  2011-07-07  5:30 [patch 00/17] CFS Bandwidth Control v7.1 Paul Turner
                   ` (8 preceding siblings ...)
  2011-07-07  5:30 ` [patch 09/17] sched: add support for unthrottling " Paul Turner
@ 2011-07-07  5:30 ` Paul Turner
  2011-07-07  5:30 ` [patch 11/17] sched: prevent interactions with throttled entities Paul Turner
                   ` (10 subsequent siblings)
  20 siblings, 0 replies; 47+ messages in thread
From: Paul Turner @ 2011-07-07  5:30 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov, Hu Tao

[-- Attachment #1: sched-bwc-refactor-walk_tg_tree.patch --]
[-- Type: text/plain, Size: 3532 bytes --]

Extend walk_tg_tree to accept a positional argument

static int walk_tg_tree_from(struct task_group *from,
			     tg_visitor down, tg_visitor up, void *data)

Existing semantics are preserved, caller must hold rcu_lock() or sufficient
analogue.

Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>

---
 kernel/sched.c |   52 +++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 39 insertions(+), 13 deletions(-)

Index: tip/kernel/sched.c
===================================================================
--- tip.orig/kernel/sched.c
+++ tip/kernel/sched.c
@@ -1582,20 +1582,23 @@ static inline void dec_cpu_load(struct r
 typedef int (*tg_visitor)(struct task_group *, void *);
 
 /*
- * Iterate the full tree, calling @down when first entering a node and @up when
- * leaving it for the final time.
+ * Iterate task_group tree rooted at *from, calling @down when first entering a
+ * node and @up when leaving it for the final time.
+ *
+ * Caller must hold rcu_lock or sufficient equivalent.
  */
-static int walk_tg_tree(tg_visitor down, tg_visitor up, void *data)
+static int walk_tg_tree_from(struct task_group *from,
+			     tg_visitor down, tg_visitor up, void *data)
 {
 	struct task_group *parent, *child;
 	int ret;
 
-	rcu_read_lock();
-	parent = &root_task_group;
+	parent = from;
+
 down:
 	ret = (*down)(parent, data);
 	if (ret)
-		goto out_unlock;
+		goto out;
 	list_for_each_entry_rcu(child, &parent->children, siblings) {
 		parent = child;
 		goto down;
@@ -1604,19 +1607,29 @@ up:
 		continue;
 	}
 	ret = (*up)(parent, data);
-	if (ret)
-		goto out_unlock;
+	if (ret || parent == from)
+		goto out;
 
 	child = parent;
 	parent = parent->parent;
 	if (parent)
 		goto up;
-out_unlock:
-	rcu_read_unlock();
-
+out:
 	return ret;
 }
 
+/*
+ * Iterate the full tree, calling @down when first entering a node and @up when
+ * leaving it for the final time.
+ *
+ * Caller must hold rcu_lock or sufficient equivalent.
+ */
+
+static inline int walk_tg_tree(tg_visitor down, tg_visitor up, void *data)
+{
+	return walk_tg_tree_from(&root_task_group, down, up, data);
+}
+
 static int tg_nop(struct task_group *tg, void *data)
 {
 	return 0;
@@ -1710,7 +1723,9 @@ static int tg_load_down(struct task_grou
 
 static void update_h_load(long cpu)
 {
+	rcu_read_lock();
 	walk_tg_tree(tg_load_down, tg_nop, (void *)cpu);
+	rcu_read_unlock();
 }
 
 #endif
@@ -8683,13 +8698,19 @@ static int tg_rt_schedulable(struct task
 
 static int __rt_schedulable(struct task_group *tg, u64 period, u64 runtime)
 {
+	int ret;
+
 	struct rt_schedulable_data data = {
 		.tg = tg,
 		.rt_period = period,
 		.rt_runtime = runtime,
 	};
 
-	return walk_tg_tree(tg_rt_schedulable, tg_nop, &data);
+	rcu_read_lock();
+	ret = walk_tg_tree(tg_rt_schedulable, tg_nop, &data);
+	rcu_read_unlock();
+
+	return ret;
 }
 
 static int tg_set_rt_bandwidth(struct task_group *tg,
@@ -9146,6 +9167,7 @@ static int tg_cfs_schedulable_down(struc
 
 static int __cfs_schedulable(struct task_group *tg, u64 period, u64 quota)
 {
+	int ret;
 	struct cfs_schedulable_data data = {
 		.tg = tg,
 		.period = period,
@@ -9157,7 +9179,11 @@ static int __cfs_schedulable(struct task
 		do_div(data.quota, NSEC_PER_USEC);
 	}
 
-	return walk_tg_tree(tg_cfs_schedulable_down, tg_nop, &data);
+	rcu_read_lock();
+	ret = walk_tg_tree(tg_cfs_schedulable_down, tg_nop, &data);
+	rcu_read_unlock();
+
+	return ret;
 }
 #endif /* CONFIG_CFS_BANDWIDTH */
 #endif /* CONFIG_FAIR_GROUP_SCHED */



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [patch 11/17] sched: prevent interactions with throttled entities
  2011-07-07  5:30 [patch 00/17] CFS Bandwidth Control v7.1 Paul Turner
                   ` (9 preceding siblings ...)
  2011-07-07  5:30 ` [patch 10/17] sched: allow for positional tg_tree walks Paul Turner
@ 2011-07-07  5:30 ` Paul Turner
  2011-07-07  5:30 ` [patch 12/17] sched: prevent buddy " Paul Turner
                   ` (9 subsequent siblings)
  20 siblings, 0 replies; 47+ messages in thread
From: Paul Turner @ 2011-07-07  5:30 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov, Hu Tao

[-- Attachment #1: sched-bwc-throttled_shares.patch --]
[-- Type: text/plain, Size: 6254 bytes --]

>From the perspective of load-balance and shares distribution, throttled
entities should be invisible.

However, both of these operations work on 'active' lists and are not
inherently aware of what group hierarchies may be present.  In some cases this
may be side-stepped (e.g. we could sideload via tg_load_down in load balance) 
while in others (e.g. update_shares()) it is more difficult to compute without
incurring some O(n^2) costs.

Instead, track hierarchicaal throttled state at time of transition.  This
allows us to easily identify whether an entity belongs to a throttled hierarchy
and avoid incorrect interactions with it.

Also, when an entity leaves a throttled hierarchy we need to advance its
time averaging for shares averaging so that the elapsed throttled time is not
considered as part of the cfs_rq's operation.

We also use this information to prevent buddy interactions in the wakeup and
yield_to() paths.

Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>

---
 kernel/sched.c      |    2 -
 kernel/sched_fair.c |   99 ++++++++++++++++++++++++++++++++++++++++++++++++----
 2 files changed, 94 insertions(+), 7 deletions(-)

Index: tip/kernel/sched_fair.c
===================================================================
--- tip.orig/kernel/sched_fair.c
+++ tip/kernel/sched_fair.c
@@ -741,13 +741,15 @@ static void update_cfs_rq_load_contribut
 	}
 }
 
+static inline int throttled_hierarchy(struct cfs_rq *cfs_rq);
+
 static void update_cfs_load(struct cfs_rq *cfs_rq, int global_update)
 {
 	u64 period = sysctl_sched_shares_window;
 	u64 now, delta;
 	unsigned long load = cfs_rq->load.weight;
 
-	if (cfs_rq->tg == &root_task_group)
+	if (cfs_rq->tg == &root_task_group || throttled_hierarchy(cfs_rq))
 		return;
 
 	now = rq_of(cfs_rq)->clock_task;
@@ -856,7 +858,7 @@ static void update_cfs_shares(struct cfs
 
 	tg = cfs_rq->tg;
 	se = tg->se[cpu_of(rq_of(cfs_rq))];
-	if (!se)
+	if (!se || throttled_hierarchy(cfs_rq))
 		return;
 #ifndef CONFIG_SMP
 	if (likely(se->load.weight == tg->shares))
@@ -1419,6 +1421,64 @@ static inline int cfs_rq_throttled(struc
 	return cfs_rq->throttled;
 }
 
+/* check whether cfs_rq, or any parent, is throttled */
+static inline int throttled_hierarchy(struct cfs_rq *cfs_rq)
+{
+	return cfs_rq->throttle_count;
+}
+
+/*
+ * Ensure that neither of the group entities corresponding to src_cpu or
+ * dest_cpu are members of a throttled hierarchy when performing group
+ * load-balance operations.
+ */
+static inline int throttled_lb_pair(struct task_group *tg,
+				    int src_cpu, int dest_cpu)
+{
+	struct cfs_rq *src_cfs_rq, *dest_cfs_rq;
+
+	src_cfs_rq = tg->cfs_rq[src_cpu];
+	dest_cfs_rq = tg->cfs_rq[dest_cpu];
+
+	return throttled_hierarchy(src_cfs_rq) ||
+	       throttled_hierarchy(dest_cfs_rq);
+}
+
+/* updated child weight may affect parent so we have to do this bottom up */
+static int tg_unthrottle_up(struct task_group *tg, void *data)
+{
+	struct rq *rq = data;
+	struct cfs_rq *cfs_rq = tg->cfs_rq[rq->cpu];
+	u64 delta;
+
+	cfs_rq->throttle_count--;
+	if (!cfs_rq->throttle_count) {
+		/* leaving throttled state, advance shares averaging windows */
+		delta = rq->clock_task - cfs_rq->load_stamp;
+
+		cfs_rq->load_stamp += delta;
+		cfs_rq->load_last += delta;
+
+		/* update entity weight now that we are on_rq again */
+		update_cfs_shares(cfs_rq);
+	}
+
+	return 0;
+}
+
+static int tg_throttle_down(struct task_group *tg, void *data)
+{
+	long cpu = (long)data;
+	struct cfs_rq *cfs_rq = tg->cfs_rq[cpu];
+
+	/* group is entering throttled state, record last load */
+	if (!cfs_rq->throttle_count)
+		update_cfs_load(cfs_rq, 0);
+	cfs_rq->throttle_count++;
+
+	return 0;
+}
+
 static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
 {
 	struct rq *rq = rq_of(cfs_rq);
@@ -1429,7 +1489,10 @@ static void throttle_cfs_rq(struct cfs_r
 	se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
 
 	/* account load preceding throttle */
-	update_cfs_load(cfs_rq, 0);
+	rcu_read_lock();
+	walk_tg_tree_from(cfs_rq->tg, tg_throttle_down, tg_nop,
+			  (void *)(long)rq_of(cfs_rq)->cpu);
+	rcu_read_unlock();
 
 	task_delta = cfs_rq->h_nr_running;
 	for_each_sched_entity(se) {
@@ -1470,6 +1533,10 @@ static void unthrottle_cfs_rq(struct cfs
 	list_del_rcu(&cfs_rq->throttled_list);
 	raw_spin_unlock(&cfs_b->lock);
 
+	update_rq_clock(rq);
+	/* update hierarchical throttle state */
+	walk_tg_tree_from(cfs_rq->tg, tg_nop, tg_unthrottle_up, (void *)rq);
+
 	if (!cfs_rq->load.weight)
 		return;
 
@@ -1614,6 +1681,17 @@ static inline int cfs_rq_throttled(struc
 {
 	return 0;
 }
+
+static inline int throttled_hierarchy(struct cfs_rq *cfs_rq)
+{
+	return 0;
+}
+
+static inline int throttled_lb_pair(struct task_group *tg,
+				    int src_cpu, int dest_cpu)
+{
+	return 0;
+}
 #endif
 
 /**************************************************
@@ -2513,6 +2591,9 @@ move_one_task(struct rq *this_rq, int th
 	int pinned = 0;
 
 	for_each_leaf_cfs_rq(busiest, cfs_rq) {
+		if (throttled_lb_pair(cfs_rq->tg, busiest->cpu, this_cpu))
+			continue;
+
 		list_for_each_entry_safe(p, n, &cfs_rq->tasks, se.group_node) {
 
 			if (!can_migrate_task(p, busiest, this_cpu,
@@ -2625,8 +2706,13 @@ static void update_shares(int cpu)
 	struct rq *rq = cpu_rq(cpu);
 
 	rcu_read_lock();
-	for_each_leaf_cfs_rq(rq, cfs_rq)
+	for_each_leaf_cfs_rq(rq, cfs_rq) {
+		/* throttled entities do not contribute to load */
+		if (throttled_hierarchy(cfs_rq))
+			continue;
+
 		update_shares_cpu(cfs_rq->tg, cpu);
+	}
 	rcu_read_unlock();
 }
 
@@ -2650,9 +2736,10 @@ load_balance_fair(struct rq *this_rq, in
 		u64 rem_load, moved_load;
 
 		/*
-		 * empty group
+		 * empty group or part of a throttled hierarchy
 		 */
-		if (!busiest_cfs_rq->task_weight)
+		if (!busiest_cfs_rq->task_weight ||
+		    throttled_lb_pair(tg, busiest_cpu, this_cpu))
 			continue;
 
 		rem_load = (u64)rem_load_move * busiest_weight;
Index: tip/kernel/sched.c
===================================================================
--- tip.orig/kernel/sched.c
+++ tip/kernel/sched.c
@@ -399,7 +399,7 @@ struct cfs_rq {
 	u64 runtime_expires;
 	s64 runtime_remaining;
 
-	int throttled;
+	int throttled, throttle_count;
 	struct list_head throttled_list;
 #endif
 #endif



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [patch 12/17] sched: prevent buddy interactions with throttled entities
  2011-07-07  5:30 [patch 00/17] CFS Bandwidth Control v7.1 Paul Turner
                   ` (10 preceding siblings ...)
  2011-07-07  5:30 ` [patch 11/17] sched: prevent interactions with throttled entities Paul Turner
@ 2011-07-07  5:30 ` Paul Turner
  2011-07-07  5:30 ` [patch 13/17] sched: migrate throttled tasks on HOTPLUG Paul Turner
                   ` (8 subsequent siblings)
  20 siblings, 0 replies; 47+ messages in thread
From: Paul Turner @ 2011-07-07  5:30 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov, Hu Tao

[-- Attachment #1: sched-bwc-throttled_buddies.patch --]
[-- Type: text/plain, Size: 2005 bytes --]

Buddies allow us to select "on-rq" entities without actually selecting them
from a cfs_rq's rb_tree.  As a result we must ensure that throttled entities
are not falsely nominated as buddies.  The fact that entities are dequeued
within throttle_entity is not sufficient for clearing buddy status as the
nomination may occur after throttling.

Signed-off-by: Paul Turner <pjt@google.com>

---
 kernel/sched_fair.c |   18 +++++++++++++++++-
 1 file changed, 17 insertions(+), 1 deletion(-)

Index: tip/kernel/sched_fair.c
===================================================================
--- tip.orig/kernel/sched_fair.c
+++ tip/kernel/sched_fair.c
@@ -2367,6 +2367,15 @@ static void check_preempt_wakeup(struct 
 	if (unlikely(se == pse))
 		return;
 
+	/*
+	 * This is possible from callers such as pull_task(), in which we
+	 * unconditionally check_prempt_curr() after an enqueue (which may have
+	 * lead to a throttle).  This both saves work and prevents false
+	 * next-buddy nomination below.
+	 */
+	if (unlikely(throttled_hierarchy(cfs_rq_of(pse))))
+		return;
+
 	if (sched_feat(NEXT_BUDDY) && scale && !(wake_flags & WF_FORK)) {
 		set_next_buddy(pse);
 		next_buddy_marked = 1;
@@ -2375,6 +2384,12 @@ static void check_preempt_wakeup(struct 
 	/*
 	 * We can come here with TIF_NEED_RESCHED already set from new task
 	 * wake up path.
+	 *
+	 * Note: this also catches the edge-case of curr being in a throttled
+	 * group (e.g. via set_curr_task), since update_curr() (in the
+	 * enqueue of curr) will have resulted in resched being set.  This
+	 * prevents us from potentially nominating it as a false LAST_BUDDY
+	 * below.
 	 */
 	if (test_tsk_need_resched(curr))
 		return;
@@ -2497,7 +2512,8 @@ static bool yield_to_task_fair(struct rq
 {
 	struct sched_entity *se = &p->se;
 
-	if (!se->on_rq)
+	/* throttled hierarchies are not runnable */
+	if (!se->on_rq || throttled_hierarchy(cfs_rq_of(se)))
 		return false;
 
 	/* Tell the scheduler that we'd really like pse to run next. */



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [patch 13/17] sched: migrate throttled tasks on HOTPLUG
  2011-07-07  5:30 [patch 00/17] CFS Bandwidth Control v7.1 Paul Turner
                   ` (11 preceding siblings ...)
  2011-07-07  5:30 ` [patch 12/17] sched: prevent buddy " Paul Turner
@ 2011-07-07  5:30 ` Paul Turner
  2011-07-07  5:30 ` [patch 14/17] sched: throttle entities exceeding their allowed bandwidth Paul Turner
                   ` (7 subsequent siblings)
  20 siblings, 0 replies; 47+ messages in thread
From: Paul Turner @ 2011-07-07  5:30 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov, Hu Tao

[-- Attachment #1: sched-bwc-migrate_dead.patch --]
[-- Type: text/plain, Size: 1792 bytes --]

Throttled tasks are invisisble to cpu-offline since they are not eligible for
selection by pick_next_task().  The regular 'escape' path for a thread that is
blocked at offline is via ttwu->select_task_rq, however this will not handle a
throttled group since there are no individual thread wakeups on an unthrottle.

Resolve this by unthrottling offline cpus so that threads can be migrated.

Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>

---
 kernel/sched.c |   27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

Index: tip/kernel/sched.c
===================================================================
--- tip.orig/kernel/sched.c
+++ tip/kernel/sched.c
@@ -6269,6 +6269,30 @@ static void calc_global_load_remove(stru
 	rq->calc_load_active = 0;
 }
 
+#ifdef CONFIG_CFS_BANDWIDTH
+static void unthrottle_offline_cfs_rqs(struct rq *rq)
+{
+	struct cfs_rq *cfs_rq;
+
+	for_each_leaf_cfs_rq(rq, cfs_rq) {
+		struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
+
+		if (!cfs_rq->runtime_enabled)
+			continue;
+
+		/*
+		 * clock_task is not advancing so we just need to make sure
+		 * there's some valid quota amount
+		 */
+		cfs_rq->runtime_remaining = cfs_b->quota;
+		if (cfs_rq_throttled(cfs_rq))
+			unthrottle_cfs_rq(cfs_rq);
+	}
+}
+#else
+static void unthrottle_offline_cfs_rqs(struct rq *rq) {}
+#endif
+
 /*
  * Migrate all tasks from the rq, sleeping tasks will be migrated by
  * try_to_wake_up()->select_task_rq().
@@ -6294,6 +6318,9 @@ static void migrate_tasks(unsigned int d
 	 */
 	rq->stop = NULL;
 
+	/* Ensure any throttled groups are reachable by pick_next_task */
+	unthrottle_offline_cfs_rqs(rq);
+
 	for ( ; ; ) {
 		/*
 		 * There's this thread running, bail when that's the only



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [patch 14/17] sched: throttle entities exceeding their allowed bandwidth
  2011-07-07  5:30 [patch 00/17] CFS Bandwidth Control v7.1 Paul Turner
                   ` (12 preceding siblings ...)
  2011-07-07  5:30 ` [patch 13/17] sched: migrate throttled tasks on HOTPLUG Paul Turner
@ 2011-07-07  5:30 ` Paul Turner
  2011-07-07  5:30 ` [patch 15/17] sched: add exports tracking cfs bandwidth control statistics Paul Turner
                   ` (6 subsequent siblings)
  20 siblings, 0 replies; 47+ messages in thread
From: Paul Turner @ 2011-07-07  5:30 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov, Hu Tao

[-- Attachment #1: sched-bwc-enable-throttling.patch --]
[-- Type: text/plain, Size: 3749 bytes --]

With the machinery in place to throttle and unthrottle entities, as well as
handle their participation (or lack there of) we can now enable throttling.

There are 2 points that we must check whether it's time to set throttled state:
 put_prev_entity() and enqueue_entity().

- put_prev_entity() is the typical throttle path, we reach it by exceeding our
  allocated run-time within update_curr()->account_cfs_rq_runtime() and going
  through a reschedule.

- enqueue_entity() covers the case of a wake-up into an already throttled
  group.  In this case we know the group cannot be on_rq and can throttle
  immediately.  Checks are added at time of put_prev_entity() and
  enqueue_entity()

Signed-off-by: Paul Turner <pjt@google.com>

---
 kernel/sched_fair.c |   50 +++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 49 insertions(+), 1 deletion(-)

Index: tip/kernel/sched_fair.c
===================================================================
--- tip.orig/kernel/sched_fair.c
+++ tip/kernel/sched_fair.c
@@ -989,6 +989,8 @@ place_entity(struct cfs_rq *cfs_rq, stru
 	se->vruntime = vruntime;
 }
 
+static void check_enqueue_throttle(struct cfs_rq *cfs_rq);
+
 static void
 enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 {
@@ -1018,8 +1020,10 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
 		__enqueue_entity(cfs_rq, se);
 	se->on_rq = 1;
 
-	if (cfs_rq->nr_running == 1)
+	if (cfs_rq->nr_running == 1) {
 		list_add_leaf_cfs_rq(cfs_rq);
+		check_enqueue_throttle(cfs_rq);
+	}
 }
 
 static void __clear_buddies_last(struct sched_entity *se)
@@ -1224,6 +1228,8 @@ static struct sched_entity *pick_next_en
 	return se;
 }
 
+static void check_cfs_rq_runtime(struct cfs_rq *cfs_rq);
+
 static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
 {
 	/*
@@ -1233,6 +1239,9 @@ static void put_prev_entity(struct cfs_r
 	if (prev->on_rq)
 		update_curr(cfs_rq);
 
+	/* throttle cfs_rqs exceeding runtime */
+	check_cfs_rq_runtime(cfs_rq);
+
 	check_spread(cfs_rq, prev);
 	if (prev->on_rq) {
 		update_stats_wait_start(cfs_rq, prev);
@@ -1518,6 +1527,43 @@ static void throttle_cfs_rq(struct cfs_r
 	raw_spin_unlock(&cfs_b->lock);
 }
 
+/*
+ * When a group wakes up we want to make sure that its quota is not already
+ * expired/exceeded, otherwise it may be allowed to steal additional ticks of
+ * runtime as update_curr() throttling can not not trigger until it's on-rq.
+ */
+static void check_enqueue_throttle(struct cfs_rq *cfs_rq)
+{
+	/* an active group must be handled by the update_curr()->put() path */
+	if (!cfs_rq->runtime_enabled || cfs_rq->curr)
+		return;
+
+	/* ensure the group is not already throttled */
+	if (cfs_rq_throttled(cfs_rq))
+		return;
+
+	/* update runtime allocation */
+	account_cfs_rq_runtime(cfs_rq, 0);
+	if (cfs_rq->runtime_remaining <= 0)
+		throttle_cfs_rq(cfs_rq);
+}
+
+/* conditionally throttle active cfs_rq's from put_prev_entity() */
+static void check_cfs_rq_runtime(struct cfs_rq *cfs_rq)
+{
+	if (likely(!cfs_rq->runtime_enabled || cfs_rq->runtime_remaining > 0))
+		return;
+
+	/*
+	 * it's possible for a throttled entity to be forced into a running
+	 * state (e.g. set_curr_task), in this case we're finished.
+	 */
+	if (cfs_rq_throttled(cfs_rq))
+		return;
+
+	throttle_cfs_rq(cfs_rq);
+}
+
 static void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 {
 	struct rq *rq = rq_of(cfs_rq);
@@ -1656,6 +1702,8 @@ static int do_sched_cfs_period_timer(str
 #else
 static void account_cfs_rq_runtime(struct cfs_rq *cfs_rq,
 		unsigned long delta_exec) {}
+static void check_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
+static void check_enqueue_throttle(struct cfs_rq *cfs_rq) {}
 
 static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
 {



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [patch 15/17] sched: add exports tracking cfs bandwidth control statistics
  2011-07-07  5:30 [patch 00/17] CFS Bandwidth Control v7.1 Paul Turner
                   ` (13 preceding siblings ...)
  2011-07-07  5:30 ` [patch 14/17] sched: throttle entities exceeding their allowed bandwidth Paul Turner
@ 2011-07-07  5:30 ` Paul Turner
  2011-07-07  5:30 ` [patch 16/17] sched: return unused runtime on group dequeue Paul Turner
                   ` (5 subsequent siblings)
  20 siblings, 0 replies; 47+ messages in thread
From: Paul Turner @ 2011-07-07  5:30 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov, Hu Tao, Nikhil Rao

[-- Attachment #1: sched-bwc-throttle_stats.patch --]
[-- Type: text/plain, Size: 3605 bytes --]

From: Nikhil Rao <ncrao@google.com>

This change introduces statistics exports for the cpu sub-system, these are
added through the use of a stat file similar to that exported by other
subsystems.

The following exports are included:

nr_periods:	number of periods in which execution occurred
nr_throttled:	the number of periods above in which execution was throttle
throttled_time:	cumulative wall-time that any cpus have been throttled for
this group

Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Nikhil Rao <ncrao@google.com>
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>

---
 kernel/sched.c      |   21 +++++++++++++++++++++
 kernel/sched_fair.c |    7 +++++++
 2 files changed, 28 insertions(+)

Index: tip/kernel/sched.c
===================================================================
--- tip.orig/kernel/sched.c
+++ tip/kernel/sched.c
@@ -259,6 +259,9 @@ struct cfs_bandwidth {
 	struct hrtimer period_timer;
 	struct list_head throttled_cfs_rq;
 
+	/* statistics */
+	int nr_periods, nr_throttled;
+	u64 throttled_time;
 #endif
 };
 
@@ -399,6 +402,7 @@ struct cfs_rq {
 	u64 runtime_expires;
 	s64 runtime_remaining;
 
+	u64 throttled_timestamp;
 	int throttled, throttle_count;
 	struct list_head throttled_list;
 #endif
@@ -9212,6 +9216,19 @@ static int __cfs_schedulable(struct task
 
 	return ret;
 }
+
+static int cpu_stats_show(struct cgroup *cgrp, struct cftype *cft,
+		struct cgroup_map_cb *cb)
+{
+	struct task_group *tg = cgroup_tg(cgrp);
+	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
+
+	cb->fill(cb, "nr_periods", cfs_b->nr_periods);
+	cb->fill(cb, "nr_throttled", cfs_b->nr_throttled);
+	cb->fill(cb, "throttled_time", cfs_b->throttled_time);
+
+	return 0;
+}
 #endif /* CONFIG_CFS_BANDWIDTH */
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
@@ -9258,6 +9275,10 @@ static struct cftype cpu_files[] = {
 		.read_u64 = cpu_cfs_period_read_u64,
 		.write_u64 = cpu_cfs_period_write_u64,
 	},
+	{
+		.name = "stat",
+		.read_map = cpu_stats_show,
+	},
 #endif
 #ifdef CONFIG_RT_GROUP_SCHED
 	{
Index: tip/kernel/sched_fair.c
===================================================================
--- tip.orig/kernel/sched_fair.c
+++ tip/kernel/sched_fair.c
@@ -1522,6 +1522,7 @@ static void throttle_cfs_rq(struct cfs_r
 		rq->nr_running -= task_delta;
 
 	cfs_rq->throttled = 1;
+	cfs_rq->throttled_timestamp = rq->clock;
 	raw_spin_lock(&cfs_b->lock);
 	list_add_tail_rcu(&cfs_rq->throttled_list, &cfs_b->throttled_cfs_rq);
 	raw_spin_unlock(&cfs_b->lock);
@@ -1576,8 +1577,10 @@ static void unthrottle_cfs_rq(struct cfs
 
 	cfs_rq->throttled = 0;
 	raw_spin_lock(&cfs_b->lock);
+	cfs_b->throttled_time += rq->clock - cfs_rq->throttled_timestamp;
 	list_del_rcu(&cfs_rq->throttled_list);
 	raw_spin_unlock(&cfs_b->lock);
+	cfs_rq->throttled_timestamp = 0;
 
 	update_rq_clock(rq);
 	/* update hierarchical throttle state */
@@ -1665,6 +1668,7 @@ static int do_sched_cfs_period_timer(str
 	throttled = !list_empty(&cfs_b->throttled_cfs_rq);
 	/* idle depends on !throttled (for the case of a large deficit) */
 	idle = cfs_b->idle && !throttled;
+	cfs_b->nr_periods += overrun;
 
 	/* if we're going inactive then everything else can be deferred */
 	if (idle)
@@ -1678,6 +1682,9 @@ static int do_sched_cfs_period_timer(str
 		goto out_unlock;
 	}
 
+	/* account preceding periods in which throttling occurred */
+	cfs_b->nr_throttled += overrun;
+
 	/*
 	 * There are throttled entities so we must first use the new bandwidth
 	 * to unthrottle them before making it generally available.  This



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [patch 16/17] sched: return unused runtime on group dequeue
  2011-07-07  5:30 [patch 00/17] CFS Bandwidth Control v7.1 Paul Turner
                   ` (14 preceding siblings ...)
  2011-07-07  5:30 ` [patch 15/17] sched: add exports tracking cfs bandwidth control statistics Paul Turner
@ 2011-07-07  5:30 ` Paul Turner
  2011-07-07  5:30 ` [patch 17/17] sched: add documentation for bandwidth control Paul Turner
                   ` (4 subsequent siblings)
  20 siblings, 0 replies; 47+ messages in thread
From: Paul Turner @ 2011-07-07  5:30 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov, Hu Tao

[-- Attachment #1: sched-bwc-simple_return_quota.patch --]
[-- Type: text/plain, Size: 7878 bytes --]

When a local cfs_rq blocks we return the majority of its remaining quota to the
global bandwidth pool for use by other runqueues.

We do this only when the quota is current and there is more than 
min_cfs_rq_quota [1ms by default] of runtime remaining on the rq.

In the case where there are throttled runqueues and we have sufficient
bandwidth to meter out a slice, a second timer is kicked off to handle this
delivery, unthrottling where appropriate.

Using a 'worst case' antagonist which executes on each cpu
for 1ms before moving onto the next on a fairly large machine:

no quota generations:
 197.47 ms       /cgroup/a/cpuacct.usage
 199.46 ms       /cgroup/a/cpuacct.usage
 205.46 ms       /cgroup/a/cpuacct.usage
 198.46 ms       /cgroup/a/cpuacct.usage
 208.39 ms       /cgroup/a/cpuacct.usage
Since we are allowed to use "stale" quota our usage is effectively bounded by
the rate of input into the global pool and performance is relatively stable.

with quota generations [1s increments]:
 119.58 ms       /cgroup/a/cpuacct.usage
 119.65 ms       /cgroup/a/cpuacct.usage
 119.64 ms       /cgroup/a/cpuacct.usage
 119.63 ms       /cgroup/a/cpuacct.usage
 119.60 ms       /cgroup/a/cpuacct.usage
The large deficit here is due to quota generations (/intentionally/) preventing
us from now using previously stranded slack quota.  The cost is that this quota
becomes unavailable.

with quota generations and quota return:
 200.09 ms       /cgroup/a/cpuacct.usage
 200.09 ms       /cgroup/a/cpuacct.usage
 198.09 ms       /cgroup/a/cpuacct.usage
 200.09 ms       /cgroup/a/cpuacct.usage
 200.06 ms       /cgroup/a/cpuacct.usage
By returning unused quota we're able to both stably consume our desired quota
and prevent unintentional overages due to the abuse of slack quota from 
previous quota periods (especially on a large machine).

Signed-off-by: Paul Turner <pjt@google.com>

---
 kernel/sched.c      |   15 ++++++-
 kernel/sched_fair.c |  105 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 119 insertions(+), 1 deletion(-)

Index: tip/kernel/sched.c
===================================================================
--- tip.orig/kernel/sched.c
+++ tip/kernel/sched.c
@@ -256,7 +256,7 @@ struct cfs_bandwidth {
 	u64 runtime_expires;
 
 	int idle, timer_active;
-	struct hrtimer period_timer;
+	struct hrtimer period_timer, slack_timer;
 	struct list_head throttled_cfs_rq;
 
 	/* statistics */
@@ -417,6 +417,16 @@ static inline struct cfs_bandwidth *tg_c
 
 static inline u64 default_cfs_period(void);
 static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun);
+static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b);
+
+static enum hrtimer_restart sched_cfs_slack_timer(struct hrtimer *timer)
+{
+	struct cfs_bandwidth *cfs_b =
+		container_of(timer, struct cfs_bandwidth, slack_timer);
+	do_sched_cfs_slack_timer(cfs_b);
+
+	return HRTIMER_NORESTART;
+}
 
 static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
 {
@@ -449,6 +459,8 @@ static void init_cfs_bandwidth(struct cf
 	INIT_LIST_HEAD(&cfs_b->throttled_cfs_rq);
 	hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
 	cfs_b->period_timer.function = sched_cfs_period_timer;
+	hrtimer_init(&cfs_b->slack_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	cfs_b->slack_timer.function = sched_cfs_slack_timer;
 }
 
 static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)
@@ -484,6 +496,7 @@ static void __start_cfs_bandwidth(struct
 static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
 {
 	hrtimer_cancel(&cfs_b->period_timer);
+	hrtimer_cancel(&cfs_b->slack_timer);
 }
 #else
 static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
Index: tip/kernel/sched_fair.c
===================================================================
--- tip.orig/kernel/sched_fair.c
+++ tip/kernel/sched_fair.c
@@ -1071,6 +1071,8 @@ static void clear_buddies(struct cfs_rq 
 		__clear_buddies_skip(se);
 }
 
+static void return_cfs_rq_runtime(struct cfs_rq *cfs_rq);
+
 static void
 dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 {
@@ -1109,6 +1111,10 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
 	if (!(flags & DEQUEUE_SLEEP))
 		se->vruntime -= cfs_rq->min_vruntime;
 
+	/* return excess runtime on last dequeue */
+	if (!cfs_rq->nr_running)
+		return_cfs_rq_runtime(cfs_rq);
+
 	update_min_vruntime(cfs_rq);
 	update_cfs_shares(cfs_rq);
 }
@@ -1724,11 +1730,110 @@ out_unlock:
 
 	return idle;
 }
+
+/* a cfs_rq won't donate quota below this amount */
+static const u64 min_cfs_rq_runtime = 1 * NSEC_PER_MSEC;
+/* minimum remaining period time to redistribute slack quota */
+static const u64 min_bandwidth_expiration = 2 * NSEC_PER_MSEC;
+/* how long we wait to gather additional slack before distributing */
+static const u64 cfs_bandwidth_slack_period = 5 * NSEC_PER_MSEC;
+
+/* are we near the end of the current quota period? */
+static int runtime_refresh_within(struct cfs_bandwidth *cfs_b, u64 min_expire)
+{
+	struct hrtimer *refresh_timer = &cfs_b->period_timer;
+	u64 remaining;
+
+	/* if the call-back is running a quota refresh is already occurring */
+	if (hrtimer_callback_running(refresh_timer))
+		return 1;
+
+	/* is a quota refresh about to occur? */
+	remaining = ktime_to_ns(hrtimer_expires_remaining(refresh_timer));
+	if (remaining < min_expire)
+		return 1;
+
+	return 0;
+}
+
+static void start_cfs_slack_bandwidth(struct cfs_bandwidth *cfs_b)
+{
+	u64 min_left = cfs_bandwidth_slack_period + min_bandwidth_expiration;
+
+	/* if there's a quota refresh soon don't bother with slack */
+	if (runtime_refresh_within(cfs_b, min_left))
+		return;
+
+	start_bandwidth_timer(&cfs_b->slack_timer,
+				ns_to_ktime(cfs_bandwidth_slack_period));
+}
+
+/* we know any runtime found here is valid as update_curr() precedes return */
+static void return_cfs_rq_runtime(struct cfs_rq *cfs_rq)
+{
+	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
+	s64 slack_runtime = cfs_rq->runtime_remaining - min_cfs_rq_runtime;
+
+	if (!cfs_rq->runtime_enabled)
+		return;
+
+	if (slack_runtime <= 0)
+		return;
+
+	raw_spin_lock(&cfs_b->lock);
+	if (cfs_b->quota != RUNTIME_INF &&
+	    cfs_rq->runtime_expires == cfs_b->runtime_expires) {
+		cfs_b->runtime += slack_runtime;
+
+		/* we are under rq->lock, defer unthrottling using a timer */
+		if (cfs_b->runtime > sched_cfs_bandwidth_slice() &&
+		    !list_empty(&cfs_b->throttled_cfs_rq))
+			start_cfs_slack_bandwidth(cfs_b);
+	}
+	raw_spin_unlock(&cfs_b->lock);
+
+	/* even if it's not valid for return we don't want to try again */
+	cfs_rq->runtime_remaining -= slack_runtime;
+}
+
+/*
+ * This is done with a timer (instead of inline with bandwidth return) since
+ * it's necessary to juggle rq->locks to unthrottle their respective cfs_rqs.
+ */
+static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
+{
+	u64 runtime = 0, slice = sched_cfs_bandwidth_slice();
+	u64 expires;
+
+	/* confirm we're still not at a refresh boundary */
+	if (runtime_refresh_within(cfs_b, min_bandwidth_expiration))
+		return;
+
+	raw_spin_lock(&cfs_b->lock);
+	if (cfs_b->quota != RUNTIME_INF && cfs_b->runtime > slice) {
+		runtime = cfs_b->runtime;
+		cfs_b->runtime = 0;
+	}
+	expires = cfs_b->runtime_expires;
+	raw_spin_unlock(&cfs_b->lock);
+
+	if (!runtime)
+		return;
+
+	runtime = distribute_cfs_runtime(cfs_b, runtime, expires);
+
+	raw_spin_lock(&cfs_b->lock);
+	if (expires == cfs_b->runtime_expires)
+		cfs_b->runtime = runtime;
+	raw_spin_unlock(&cfs_b->lock);
+}
+
 #else
 static void account_cfs_rq_runtime(struct cfs_rq *cfs_rq,
 		unsigned long delta_exec) {}
 static void check_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
 static void check_enqueue_throttle(struct cfs_rq *cfs_rq) {}
+static void return_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
 
 static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
 {



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [patch 17/17] sched: add documentation for bandwidth control
  2011-07-07  5:30 [patch 00/17] CFS Bandwidth Control v7.1 Paul Turner
                   ` (15 preceding siblings ...)
  2011-07-07  5:30 ` [patch 16/17] sched: return unused runtime on group dequeue Paul Turner
@ 2011-07-07  5:30 ` Paul Turner
  2011-07-07 11:13 ` [patch 00/17] CFS Bandwidth Control v7.1 Peter Zijlstra
                   ` (3 subsequent siblings)
  20 siblings, 0 replies; 47+ messages in thread
From: Paul Turner @ 2011-07-07  5:30 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov, Hu Tao

[-- Attachment #1: sched-bwc-documentation.patch --]
[-- Type: text/plain, Size: 5489 bytes --]

From: Bharata B Rao <bharata@linux.vnet.ibm.com>

Basic description of usage and effect for CFS Bandwidth Control.

Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
Signed-off-by: Paul Turner <pjt@google.com>
---
 Documentation/scheduler/sched-bwc.txt |   98
 ++++++++++++++++++++++++++++++++++
 Documentation/scheduler/sched-bwc.txt |  122 ++++++++++++++++++++++++++++++++++
 1 file changed, 122 insertions(+)

Index: tip/Documentation/scheduler/sched-bwc.txt
===================================================================
--- /dev/null
+++ tip/Documentation/scheduler/sched-bwc.txt
@@ -0,0 +1,122 @@
+CFS Bandwidth Control
+=====================
+
+[ This document only discusses CPU bandwidth control for SCHED_NORMAL.
+  The SCHED_RT case is covered in Documentation/scheduler/sched-rt-group.txt ]
+
+CFS bandwidth control is a CONFIG_FAIR_GROUP_SCHED extension which allows the
+specification of the maximum CPU bandwidth available to a group or hierarchy.
+
+The bandwidth allowed for a group is specified using a quota and period. Within
+each given "period" (microseconds), a group is allowed to consume only up to
+"quota" microseconds of CPU time.  When the CPU bandwidth consumption of a
+group exceeds this limit (for that period), the tasks belonging to its
+hierarchy will be throttled and are not allowed to run again until the next
+period.
+
+A group's unused runtime is globally tracked, being refreshed with quota units
+above at each period boundary.  As threads consume this bandwidth it is
+transferred to cpu-local "silos" on a demand basis.  The amount transferred
+within each of these updates is tunable and described as the "slice".
+
+Management
+----------
+Quota and period are managed within the cpu subsystem via cgroupfs.
+
+cpu.cfs_quota_us: the total available run-time within a period (in microseconds)
+cpu.cfs_period_us: the length of a period (in microseconds)
+cpu.stat: exports throttling statistics [explained further below]
+
+The default values are:
+	cpu.cfs_period_us=100ms
+	cpu.cfs_quota=-1
+
+A value of -1 for cpu.cfs_quota_us indicates that the group does not have any
+bandwidth restriction in place, such a group is described as an unconstrained
+bandwidth group.  This represents the traditional work-conserving behavior for
+CFS.
+
+Writing any (valid) positive value(s) will enact the specified bandwidth limit.
+The minimum quota allowed for the quota or period is 1ms.  There is also an
+upper bound on the period length of 1s.  Additional restrictions exist when
+bandwidth limits are used in a hierarchical fashion, these are explained in
+more detail below.
+
+Writing any negative value to cpu.cfs_quota_us will remove the bandwidth limit
+and return the group to an unconstrained state once more.
+
+Any updates to a group's bandwidth specification will result in it becoming
+unthrottled if it is in a constrained state.
+
+System wide settings
+--------------------
+For efficiency run-time is transferred between the global pool and CPU local
+"silos" in a batch fashion.  This greatly reduces global accounting pressure
+on large systems.  The amount transferred each time such an update is required
+is described as the "slice".
+
+This is tunable via procfs:
+	/proc/sys/kernel/sched_cfs_bandwidth_slice_us (default=5ms)
+
+Larger slice values will reduce transfer overheads, while smaller values allow
+for more fine-grained consumption.
+
+Statistics
+----------
+A group's bandwidth statistics are exported via 3 fields in cpu.stat.
+
+cpu.stat:
+- nr_periods: Number of enforcement intervals that have elapsed.
+- nr_throttled: Number of times the group has been throttled/limited.
+- throttled_time: The total time duration (in nanoseconds) for which entities
+  of the group have been throttled.
+
+This interface is read-only.
+
+Hierarchical considerations
+---------------------------
+The interface enforces that an individual entity's bandwidth is always
+attainable, that is: max(c_i) <= C. However, over-subscription in the
+aggregate case is explicitly allowed to enable work-conserving semantics
+within a hierarchy.
+  e.g. \Sum (c_i) may exceed C
+[ Where C is the parent's bandwidth, and c_i its children ]
+
+
+There are two ways in which a group may become throttled:
+	a. it fully consumes its own quota within a period
+	b. a parent's quota is fully consumed within its period
+
+In case b) above, even though the child may have runtime remaining it will not
+be allowed to until the parent's runtime is refreshed.
+
+Examples
+--------
+1. Limit a group to 1 CPU worth of runtime.
+
+	If period is 250ms and quota is also 250ms, the group will get
+	1 CPU worth of runtime every 250ms.
+
+	# echo 250000 > cpu.cfs_quota_us /* quota = 250ms */
+	# echo 250000 > cpu.cfs_period_us /* period = 250ms */
+
+2. Limit a group to 2 CPUs worth of runtime on a multi-CPU machine.
+
+	With 500ms period and 1000ms quota, the group can get 2 CPUs worth of
+	runtime every 500ms.
+
+	# echo 1000000 > cpu.cfs_quota_us /* quota = 1000ms */
+	# echo 500000 > cpu.cfs_period_us /* period = 500ms */
+
+	The larger period here allows for increased burst capacity.
+
+3. Limit a group to 20% of 1 CPU.
+
+	With 50ms period, 10ms quota will be equivalent to 20% of 1 CPU.
+
+	# echo 10000 > cpu.cfs_quota_us /* quota = 10ms */
+	# echo 50000 > cpu.cfs_period_us /* period = 50ms */
+
+	By using a small period here we are ensuring a consistent latency
+	response at the expense of burst capacity.
+



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch 00/17] CFS Bandwidth Control v7.1
  2011-07-07  5:30 [patch 00/17] CFS Bandwidth Control v7.1 Paul Turner
                   ` (16 preceding siblings ...)
  2011-07-07  5:30 ` [patch 17/17] sched: add documentation for bandwidth control Paul Turner
@ 2011-07-07 11:13 ` Peter Zijlstra
  2011-07-11  1:22   ` Hu Tao
  2011-07-07 11:23 ` Ingo Molnar
                   ` (2 subsequent siblings)
  20 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2011-07-07 11:13 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov, Hu Tao

On Wed, 2011-07-06 at 22:30 -0700, Paul Turner wrote:
> base                                        7,526,317,497           8,666,579,347            1,771,078,445
> +patch, cgroup not enabled                  7,610,354,447 (1.12%)   8,569,448,982 (-1.12%)   1,751,675,193 (-0.11%) 

"cgroup not enabled" means compiled in but not used at runtime, right?

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch 00/17] CFS Bandwidth Control v7.1
  2011-07-07  5:30 [patch 00/17] CFS Bandwidth Control v7.1 Paul Turner
                   ` (17 preceding siblings ...)
  2011-07-07 11:13 ` [patch 00/17] CFS Bandwidth Control v7.1 Peter Zijlstra
@ 2011-07-07 11:23 ` Ingo Molnar
  2011-07-07 11:28   ` Peter Zijlstra
  2011-07-07 14:38   ` Peter Zijlstra
  2011-07-07 14:06 ` Peter Zijlstra
  2011-07-11  1:22 ` Hu Tao
  20 siblings, 2 replies; 47+ messages in thread
From: Ingo Molnar @ 2011-07-07 11:23 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Peter Zijlstra, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Hidetoshi Seto, Pavel Emelyanov, Hu Tao


* Paul Turner <pjt@google.com> wrote:

> The summary results (from Hu Tao's most recent run) are:
>                                             cycles                   instructions            branches
> -------------------------------------------------------------------------------------------------------------------
> base                                        7,526,317,497           8,666,579,347            1,771,078,445
> +patch, cgroup not enabled                  7,610,354,447 (1.12%)   8,569,448,982 (-1.12%)   1,751,675,193 (-0.11%)
> +patch, 10000000000/1000(quota/period)      7,856,873,327 (4.39%)   8,822,227,540 (1.80%)    1,801,766,182 (1.73%)
> +patch, 10000000000/10000(quota/period)     7,797,711,600 (3.61%)   8,754,747,746 (1.02%)    1,788,316,969 (0.97%)
> +patch, 10000000000/100000(quota/period)    7,777,784,384 (3.34%)   8,744,979,688 (0.90%)    1,786,319,566 (0.86%)
> +patch, 10000000000/1000000(quota/period)   7,802,382,802 (3.67%)   8,755,638,235 (1.03%)    1,788,601,070 (0.99%)
> ------------------------------------------------------------------------------------------------------------------

Well, the most recent run Hu Tao sent (with lockdep disabled) are 
different:

 table 2. shows the differences between patch and no-patch. quota is set
          to a large value to avoid processes being throttled.

        quota/period          cycles                   instructions             branches
--------------------------------------------------------------------------------------------------
base                          1,146,384,132           1,151,216,688            212,431,532
patch   cgroup disabled       1,163,717,547 (1.51%)   1,165,238,015 ( 1.22%)   215,092,327 ( 1.25%)
patch   10000000000/1000      1,244,889,136 (8.59%)   1,299,128,502 (12.85%)   243,162,542 (14.47%)
patch   10000000000/10000     1,253,305,706 (9.33%)   1,299,167,897 (12.85%)   243,175,027 (14.47%)
patch   10000000000/100000    1,252,374,134 (9.25%)   1,299,314,357 (12.86%)   243,203,923 (14.49%)
patch   10000000000/1000000   1,254,165,824 (9.40%)   1,299,751,347 (12.90%)   243,288,600 (14.53%)
--------------------------------------------------------------------------------------------------


The +1.5% increase in vanilla kernel context switching performance is 
unfortunate - where does that overhead come from?

The +9% increase in cgroups context-switching overhead looks rather 
brutal.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch 00/17] CFS Bandwidth Control v7.1
  2011-07-07 11:23 ` Ingo Molnar
@ 2011-07-07 11:28   ` Peter Zijlstra
  2011-07-07 14:38   ` Peter Zijlstra
  1 sibling, 0 replies; 47+ messages in thread
From: Peter Zijlstra @ 2011-07-07 11:28 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Paul Turner, linux-kernel, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Hidetoshi Seto, Pavel Emelyanov, Hu Tao

On Thu, 2011-07-07 at 13:23 +0200, Ingo Molnar wrote:

> Well, the most recent run Hu Tao sent (with lockdep disabled) are 
> different:
> 
>  table 2. shows the differences between patch and no-patch. quota is set
>           to a large value to avoid processes being throttled.
> 
>         quota/period          cycles                   instructions             branches
> --------------------------------------------------------------------------------------------------
> base                          1,146,384,132           1,151,216,688            212,431,532
> patch   cgroup disabled       1,163,717,547 (1.51%)   1,165,238,015 ( 1.22%)   215,092,327 ( 1.25%)
> patch   10000000000/1000      1,244,889,136 (8.59%)   1,299,128,502 (12.85%)   243,162,542 (14.47%)
> patch   10000000000/10000     1,253,305,706 (9.33%)   1,299,167,897 (12.85%)   243,175,027 (14.47%)
> patch   10000000000/100000    1,252,374,134 (9.25%)   1,299,314,357 (12.86%)   243,203,923 (14.49%)
> patch   10000000000/1000000   1,254,165,824 (9.40%)   1,299,751,347 (12.90%)   243,288,600 (14.53%)
> --------------------------------------------------------------------------------------------------
> 
> 
> The +1.5% increase in vanilla kernel context switching performance is 
> unfortunate - where does that overhead come from?
> 
> The +9% increase in cgroups context-switching overhead looks rather 
> brutal.

As to those, do they run pipe-test in a cgroup or are you always using
the root cgroup?

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch 03/17] sched: introduce primitives to account for CFS bandwidth tracking
  2011-07-07  5:30 ` [patch 03/17] sched: introduce primitives to account for CFS bandwidth tracking Paul Turner
@ 2011-07-07 13:48   ` Peter Zijlstra
  2011-07-07 21:30     ` Paul Turner
  0 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2011-07-07 13:48 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov, Hu Tao, Nikhil Rao

On Wed, 2011-07-06 at 22:30 -0700, Paul Turner wrote:
> +config CFS_BANDWIDTH
> +       bool "CPU bandwidth provisioning for FAIR_GROUP_SCHED"
> +       depends on EXPERIMENTAL
> +       depends on FAIR_GROUP_SCHED
> +       depends on SMP 

I know that UP is quickly becoming extinct, but isn't that a tad
radical? :-)

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch 00/17] CFS Bandwidth Control v7.1
  2011-07-07  5:30 [patch 00/17] CFS Bandwidth Control v7.1 Paul Turner
                   ` (18 preceding siblings ...)
  2011-07-07 11:23 ` Ingo Molnar
@ 2011-07-07 14:06 ` Peter Zijlstra
  2011-07-08  7:35   ` Paul Turner
  2011-07-11  1:22 ` Hu Tao
  20 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2011-07-07 14:06 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov, Hu Tao


The very first .config I build gets me:

  CC      kernel/sched.o
/usr/src/linux-2.6/kernel/sched.c:503: warning: ‘struct cfs_bandwidth’ declared inside parameter list
/usr/src/linux-2.6/kernel/sched.c:503: warning: its scope is only this definition or declaration, which is probably not what you want
/usr/src/linux-2.6/kernel/sched.c:504: warning: ‘struct cfs_bandwidth’ declared inside parameter list
In file included from /usr/src/linux-2.6/kernel/sched.c:2175:
/usr/src/linux-2.6/kernel/sched_fair.c: In function ‘move_one_task’:
/usr/src/linux-2.6/kernel/sched_fair.c:2770: error: ‘struct cfs_rq’ has no member named ‘tg’




^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch 00/17] CFS Bandwidth Control v7.1
  2011-07-07 11:23 ` Ingo Molnar
  2011-07-07 11:28   ` Peter Zijlstra
@ 2011-07-07 14:38   ` Peter Zijlstra
  2011-07-07 14:51     ` Ingo Molnar
                       ` (3 more replies)
  1 sibling, 4 replies; 47+ messages in thread
From: Peter Zijlstra @ 2011-07-07 14:38 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Paul Turner, linux-kernel, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Hidetoshi Seto, Pavel Emelyanov, Hu Tao

On Thu, 2011-07-07 at 13:23 +0200, Ingo Molnar wrote:
> 
> The +1.5% increase in vanilla kernel context switching performance is 
> unfortunate - where does that overhead come from?

Looking at the asm output, I think its partly because things like:

@@ -602,6 +618,8 @@ static void update_curr(struct cfs_rq *c
                cpuacct_charge(curtask, delta_exec);
                account_group_exec_runtime(curtask, delta_exec);
        }
+
+       account_cfs_rq_runtime(cfs_rq, delta_exec);
 }


+static void account_cfs_rq_runtime(struct cfs_rq *cfs_rq,
+               unsigned long delta_exec)
+{
+       if (!cfs_rq->runtime_enabled)
+               return;
+
+       cfs_rq->runtime_remaining -= delta_exec;
+       if (cfs_rq->runtime_remaining > 0)
+               return;
+
+       assign_cfs_rq_runtime(cfs_rq);
+}

generate a call, only to then take the first branch out, marking that
function __always_inline would cure the call problem. Going beyond that
would be using static_branch() to track if there is any bandwidth
tracking required at all.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch 00/17] CFS Bandwidth Control v7.1
  2011-07-07 14:38   ` Peter Zijlstra
@ 2011-07-07 14:51     ` Ingo Molnar
  2011-07-07 14:54       ` Peter Zijlstra
  2011-07-07 16:52     ` [patch 00/17] CFS Bandwidth Control v7.1 Andi Kleen
                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 47+ messages in thread
From: Ingo Molnar @ 2011-07-07 14:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul Turner, linux-kernel, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Hidetoshi Seto, Pavel Emelyanov, Hu Tao


* Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> On Thu, 2011-07-07 at 13:23 +0200, Ingo Molnar wrote:
> > 
> > The +1.5% increase in vanilla kernel context switching performance is 
> > unfortunate - where does that overhead come from?
> 
> Looking at the asm output, I think its partly because things like:
> 
> @@ -602,6 +618,8 @@ static void update_curr(struct cfs_rq *c
>                 cpuacct_charge(curtask, delta_exec);
>                 account_group_exec_runtime(curtask, delta_exec);
>         }
> +
> +       account_cfs_rq_runtime(cfs_rq, delta_exec);
>  }
> 
> 
> +static void account_cfs_rq_runtime(struct cfs_rq *cfs_rq,
> +               unsigned long delta_exec)
> +{
> +       if (!cfs_rq->runtime_enabled)
> +               return;
> +
> +       cfs_rq->runtime_remaining -= delta_exec;
> +       if (cfs_rq->runtime_remaining > 0)
> +               return;
> +
> +       assign_cfs_rq_runtime(cfs_rq);
> +}
> 
> generate a call, only to then take the first branch out, marking that
> function __always_inline would cure the call problem. Going beyond that
> would be using static_branch() to track if there is any bandwidth
> tracking required at all.

Could jump labels be utilized perhaps?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch 00/17] CFS Bandwidth Control v7.1
  2011-07-07 14:51     ` Ingo Molnar
@ 2011-07-07 14:54       ` Peter Zijlstra
  2011-07-07 14:56         ` Ingo Molnar
  0 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2011-07-07 14:54 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Paul Turner, linux-kernel, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Hidetoshi Seto, Pavel Emelyanov, Hu Tao

On Thu, 2011-07-07 at 16:51 +0200, Ingo Molnar wrote:
> > Going beyond that
> > would be using static_branch() to track if there is any bandwidth
> > tracking required at all.
> 
> Could jump labels be utilized perhaps? 

# git grep static_branch include/linux/
include/linux/jump_label.h:static __always_inline bool static_branch(struct jump_label_key *key)

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch 00/17] CFS Bandwidth Control v7.1
  2011-07-07 14:54       ` Peter Zijlstra
@ 2011-07-07 14:56         ` Ingo Molnar
  2011-07-07 16:23           ` Jason Baron
  0 siblings, 1 reply; 47+ messages in thread
From: Ingo Molnar @ 2011-07-07 14:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul Turner, linux-kernel, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Hidetoshi Seto, Pavel Emelyanov, Hu Tao


* Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> On Thu, 2011-07-07 at 16:51 +0200, Ingo Molnar wrote:
> > > Going beyond that
> > > would be using static_branch() to track if there is any bandwidth
> > > tracking required at all.
> > 
> > Could jump labels be utilized perhaps? 
> 
> # git grep static_branch include/linux/
> include/linux/jump_label.h:static __always_inline bool static_branch(struct jump_label_key *key)

Right - i only read up to your first, __always_inline suggestion :)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch 00/17] CFS Bandwidth Control v7.1
  2011-07-07 14:56         ` Ingo Molnar
@ 2011-07-07 16:23           ` Jason Baron
  2011-07-07 17:20             ` Peter Zijlstra
  0 siblings, 1 reply; 47+ messages in thread
From: Jason Baron @ 2011-07-07 16:23 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Paul Turner, linux-kernel, Bharata B Rao,
	Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan,
	Srivatsa Vaddagiri, Kamalesh Babulal, Hidetoshi Seto,
	Pavel Emelyanov, Hu Tao

On Thu, Jul 07, 2011 at 04:56:20PM +0200, Ingo Molnar wrote:
> * Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> 
> > On Thu, 2011-07-07 at 16:51 +0200, Ingo Molnar wrote:
> > > > Going beyond that
> > > > would be using static_branch() to track if there is any bandwidth
> > > > tracking required at all.
> > > 
> > > Could jump labels be utilized perhaps? 
> > 
> > # git grep static_branch include/linux/
> > include/linux/jump_label.h:static __always_inline bool static_branch(struct jump_label_key *key)
> 
> Right - i only read up to your first, __always_inline suggestion :)
> 
> Thanks,
> 
> 	Ingo

yes, this is the type of case jump labels is designed for. Let me know
if there are any usage questions/concerns.

thanks,

-Jason

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch 00/17] CFS Bandwidth Control v7.1
  2011-07-07 14:38   ` Peter Zijlstra
  2011-07-07 14:51     ` Ingo Molnar
@ 2011-07-07 16:52     ` Andi Kleen
  2011-07-07 17:08       ` Peter Zijlstra
  2011-07-07 17:59     ` Peter Zijlstra
  2011-07-08  7:39     ` Paul Turner
  3 siblings, 1 reply; 47+ messages in thread
From: Andi Kleen @ 2011-07-07 16:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Paul Turner, linux-kernel, Bharata B Rao,
	Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan,
	Srivatsa Vaddagiri, Kamalesh Babulal, Hidetoshi Seto,
	Pavel Emelyanov, Hu Tao

Peter Zijlstra <a.p.zijlstra@chello.nl> writes:
>
> +static void account_cfs_rq_runtime(struct cfs_rq *cfs_rq,
> +               unsigned long delta_exec)
> +{
> +       if (!cfs_rq->runtime_enabled)
> +               return;
> +
> +       cfs_rq->runtime_remaining -= delta_exec;
> +       if (cfs_rq->runtime_remaining > 0)
> +               return;
> +
> +       assign_cfs_rq_runtime(cfs_rq);
> +}
>
> generate a call, only to then take the first branch out, marking that

You would need a *LOT* of calls to make up for 9%.

Maybe it's something else? Some profiling first before optimization
is probably a good idea.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch 00/17] CFS Bandwidth Control v7.1
  2011-07-07 16:52     ` [patch 00/17] CFS Bandwidth Control v7.1 Andi Kleen
@ 2011-07-07 17:08       ` Peter Zijlstra
  0 siblings, 0 replies; 47+ messages in thread
From: Peter Zijlstra @ 2011-07-07 17:08 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Ingo Molnar, Paul Turner, linux-kernel, Bharata B Rao,
	Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan,
	Srivatsa Vaddagiri, Kamalesh Babulal, Hidetoshi Seto,
	Pavel Emelyanov, Hu Tao

On Thu, 2011-07-07 at 09:52 -0700, Andi Kleen wrote:
> Peter Zijlstra <a.p.zijlstra@chello.nl> writes:
> >
> > +static void account_cfs_rq_runtime(struct cfs_rq *cfs_rq,
> > +               unsigned long delta_exec)
> > +{
> > +       if (!cfs_rq->runtime_enabled)
> > +               return;
> > +
> > +       cfs_rq->runtime_remaining -= delta_exec;
> > +       if (cfs_rq->runtime_remaining > 0)
> > +               return;
> > +
> > +       assign_cfs_rq_runtime(cfs_rq);
> > +}
> >
> > generate a call, only to then take the first branch out, marking that
> 
> You would need a *LOT* of calls to make up for 9%.
> 
> Maybe it's something else? Some profiling first before optimization
> is probably a good idea.

This is the 1.5% case where the feature is compiled in but not used.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch 00/17] CFS Bandwidth Control v7.1
  2011-07-07 16:23           ` Jason Baron
@ 2011-07-07 17:20             ` Peter Zijlstra
  2011-07-07 18:15               ` Jason Baron
  0 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2011-07-07 17:20 UTC (permalink / raw)
  To: Jason Baron
  Cc: Ingo Molnar, Paul Turner, linux-kernel, Bharata B Rao,
	Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan,
	Srivatsa Vaddagiri, Kamalesh Babulal, Hidetoshi Seto,
	Pavel Emelyanov, Hu Tao

On Thu, 2011-07-07 at 12:23 -0400, Jason Baron wrote:

> yes, this is the type of case jump labels is designed for. Let me know
> if there are any usage questions/concerns.

Unrelated to this change, I was looking at a case where it would be nice
to have a way to specify the initial state of jump labels.

That is, currently:

jump_label_key foo;

  if (static_branch(&foo))

will default to false, until jump_label_inc(&foo) etc..

I was looking to add something like:

jump_label_key foo = JUMP_LABEL_TRUE;

Which would initialize the thing to be true, the problem with that is
that it takes until the loop in jump_label_init() before that takes
effect, which was too late in my particular case.

Anyway, not the current problem and probably not your problem anyway,
but I wanted to throw the issue out there.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch 00/17] CFS Bandwidth Control v7.1
  2011-07-07 14:38   ` Peter Zijlstra
  2011-07-07 14:51     ` Ingo Molnar
  2011-07-07 16:52     ` [patch 00/17] CFS Bandwidth Control v7.1 Andi Kleen
@ 2011-07-07 17:59     ` Peter Zijlstra
  2011-07-07 19:36       ` Jason Baron
  2011-07-08  7:45       ` Paul Turner
  2011-07-08  7:39     ` Paul Turner
  3 siblings, 2 replies; 47+ messages in thread
From: Peter Zijlstra @ 2011-07-07 17:59 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Paul Turner, linux-kernel, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Hidetoshi Seto, Pavel Emelyanov, Hu Tao

On Thu, 2011-07-07 at 16:38 +0200, Peter Zijlstra wrote:
> On Thu, 2011-07-07 at 13:23 +0200, Ingo Molnar wrote:
> > 
> > The +1.5% increase in vanilla kernel context switching performance is 
> > unfortunate - where does that overhead come from?
> 
> Looking at the asm output, I think its partly because things like:
> 
> @@ -602,6 +618,8 @@ static void update_curr(struct cfs_rq *c
>                 cpuacct_charge(curtask, delta_exec);
>                 account_group_exec_runtime(curtask, delta_exec);
>         }
> +
> +       account_cfs_rq_runtime(cfs_rq, delta_exec);
>  }
> 
> 
> +static void account_cfs_rq_runtime(struct cfs_rq *cfs_rq,
> +               unsigned long delta_exec)
> +{
> +       if (!cfs_rq->runtime_enabled)
> +               return;
> +
> +       cfs_rq->runtime_remaining -= delta_exec;
> +       if (cfs_rq->runtime_remaining > 0)
> +               return;
> +
> +       assign_cfs_rq_runtime(cfs_rq);
> +}
> 
> generate a call, only to then take the first branch out, marking that
> function __always_inline would cure the call problem. Going beyond that
> would be using static_branch() to track if there is any bandwidth
> tracking required at all.

Right, so that cfs_rq->runtime_enabled is almost a guaranteed cacheline
miss as well, its at the tail end of cfs_rq, then again, the smp-load
update will want to touch that same cacheline so its not a complete
waste of time.

The other big addition to all the fast paths are the various throttled
checks, those do miss a complete new cacheline.. adding a
static_branch() to that might make sense.

compile tested only..

---
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -71,6 +71,7 @@
 #include <linux/ctype.h>
 #include <linux/ftrace.h>
 #include <linux/slab.h>
+#include <linux/jump_label.h>
 
 #include <asm/tlb.h>
 #include <asm/irq_regs.h>
@@ -297,6 +298,7 @@ struct task_group {
 	struct autogroup *autogroup;
 #endif
 
+	int runtime_enabled;
 	struct cfs_bandwidth cfs_bandwidth;
 };
 
@@ -410,6 +412,8 @@ struct cfs_rq {
 };
 
 #ifdef CONFIG_CFS_BANDWIDTH
+static struct jump_label_key cfs_bandwidth_enabled;
+
 static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
 {
 	return &tg->cfs_bandwidth;
@@ -9075,6 +9079,15 @@ static int tg_set_cfs_bandwidth(struct t
 			unthrottle_cfs_rq(cfs_rq);
 		raw_spin_unlock_irq(&rq->lock);
 	}
+
+	if (runtime_enabled && !tg->runtime_enabled)
+		jump_label_inc(&cfs_bandwidth_enabled);
+
+	if (!runtime_enabled && tg->runtime_enabled)
+		jump_label_dec(&cfs_bandwidth_enabled);
+
+	tg->runtime_enabled = runtime_enabled;
+
 out_unlock:
 	mutex_unlock(&cfs_constraints_mutex);
 
Index: linux-2.6/kernel/sched_fair.c
===================================================================
--- linux-2.6.orig/kernel/sched_fair.c
+++ linux-2.6/kernel/sched_fair.c
@@ -1410,10 +1410,10 @@ static void expire_cfs_rq_runtime(struct
 	}
 }
 
-static void account_cfs_rq_runtime(struct cfs_rq *cfs_rq,
+static __always_inline void account_cfs_rq_runtime(struct cfs_rq *cfs_rq,
 		unsigned long delta_exec)
 {
-	if (!cfs_rq->runtime_enabled)
+	if (!static_branch(&cfs_bandwidth_enabled) || !cfs_rq->runtime_enabled)
 		return;
 
 	/* dock delta_exec before expiring quota (as it could span periods) */
@@ -1433,13 +1433,13 @@ static void account_cfs_rq_runtime(struc
 
 static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
 {
-	return cfs_rq->throttled;
+	return static_branch(&cfs_bandwidth_enabled) && cfs_rq->throttled;
 }
 
 /* check whether cfs_rq, or any parent, is throttled */
 static inline int throttled_hierarchy(struct cfs_rq *cfs_rq)
 {
-	return cfs_rq->throttle_count;
+	return static_branch(&cfs_bandwidth_enabled) && cfs_rq->throttle_count;
 }
 
 /*


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch 00/17] CFS Bandwidth Control v7.1
  2011-07-07 17:20             ` Peter Zijlstra
@ 2011-07-07 18:15               ` Jason Baron
  2011-07-07 20:36                 ` jump_label defaults (was Re: [patch 00/17] CFS Bandwidth Control v7.1) Peter Zijlstra
  0 siblings, 1 reply; 47+ messages in thread
From: Jason Baron @ 2011-07-07 18:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Paul Turner, linux-kernel, Bharata B Rao,
	Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan,
	Srivatsa Vaddagiri, Kamalesh Babulal, Hidetoshi Seto,
	Pavel Emelyanov, Hu Tao

On Thu, Jul 07, 2011 at 07:20:20PM +0200, Peter Zijlstra wrote:
> On Thu, 2011-07-07 at 12:23 -0400, Jason Baron wrote:
> 
> > yes, this is the type of case jump labels is designed for. Let me know
> > if there are any usage questions/concerns.
> 
> Unrelated to this change, I was looking at a case where it would be nice
> to have a way to specify the initial state of jump labels.
> 
> That is, currently:
> 
> jump_label_key foo;
> 
>   if (static_branch(&foo))
> 
> will default to false, until jump_label_inc(&foo) etc..
> 
> I was looking to add something like:
> 
> jump_label_key foo = JUMP_LABEL_TRUE;
> 
> Which would initialize the thing to be true, the problem with that is
> that it takes until the loop in jump_label_init() before that takes
> effect, which was too late in my particular case.
> 

We don't have to wait until jump_label_init() to make it take effect.

We could introduce something like: static_branch_default_false(&foo),
and static_branch_default_true(&foo), which are set at compile time. I
was waiting for a real world example before introducing it, but if this
would solve your issue,  we can look at it.

-Jason


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch 00/17] CFS Bandwidth Control v7.1
  2011-07-07 17:59     ` Peter Zijlstra
@ 2011-07-07 19:36       ` Jason Baron
  2011-07-08  7:45       ` Paul Turner
  1 sibling, 0 replies; 47+ messages in thread
From: Jason Baron @ 2011-07-07 19:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Paul Turner, linux-kernel, Bharata B Rao,
	Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan,
	Srivatsa Vaddagiri, Kamalesh Babulal, Hidetoshi Seto,
	Pavel Emelyanov, Hu Tao

On Thu, Jul 07, 2011 at 07:59:48PM +0200, Peter Zijlstra wrote:
> On Thu, 2011-07-07 at 16:38 +0200, Peter Zijlstra wrote:
> > On Thu, 2011-07-07 at 13:23 +0200, Ingo Molnar wrote:
> > > 
> > > The +1.5% increase in vanilla kernel context switching performance is 
> > > unfortunate - where does that overhead come from?
> > 
> > Looking at the asm output, I think its partly because things like:
> > 
> > @@ -602,6 +618,8 @@ static void update_curr(struct cfs_rq *c
> >                 cpuacct_charge(curtask, delta_exec);
> >                 account_group_exec_runtime(curtask, delta_exec);
> >         }
> > +
> > +       account_cfs_rq_runtime(cfs_rq, delta_exec);
> >  }
> > 
> > 
> > +static void account_cfs_rq_runtime(struct cfs_rq *cfs_rq,
> > +               unsigned long delta_exec)
> > +{
> > +       if (!cfs_rq->runtime_enabled)
> > +               return;
> > +
> > +       cfs_rq->runtime_remaining -= delta_exec;
> > +       if (cfs_rq->runtime_remaining > 0)
> > +               return;
> > +
> > +       assign_cfs_rq_runtime(cfs_rq);
> > +}
> > 
> > generate a call, only to then take the first branch out, marking that
> > function __always_inline would cure the call problem. Going beyond that
> > would be using static_branch() to track if there is any bandwidth
> > tracking required at all.
> 
> Right, so that cfs_rq->runtime_enabled is almost a guaranteed cacheline
> miss as well, its at the tail end of cfs_rq, then again, the smp-load
> update will want to touch that same cacheline so its not a complete
> waste of time.
> 
> The other big addition to all the fast paths are the various throttled
> checks, those do miss a complete new cacheline.. adding a
> static_branch() to that might make sense.
> 
> compile tested only..
> 

I'm curious to see how the asm look like here for the static
branches in this case, can you post it?

thanks,

-Jason


> ---
> Index: linux-2.6/kernel/sched.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched.c
> +++ linux-2.6/kernel/sched.c
> @@ -71,6 +71,7 @@
>  #include <linux/ctype.h>
>  #include <linux/ftrace.h>
>  #include <linux/slab.h>
> +#include <linux/jump_label.h>
>  
>  #include <asm/tlb.h>
>  #include <asm/irq_regs.h>
> @@ -297,6 +298,7 @@ struct task_group {
>  	struct autogroup *autogroup;
>  #endif
>  
> +	int runtime_enabled;
>  	struct cfs_bandwidth cfs_bandwidth;
>  };
>  
> @@ -410,6 +412,8 @@ struct cfs_rq {
>  };
>  
>  #ifdef CONFIG_CFS_BANDWIDTH
> +static struct jump_label_key cfs_bandwidth_enabled;
> +
>  static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
>  {
>  	return &tg->cfs_bandwidth;
> @@ -9075,6 +9079,15 @@ static int tg_set_cfs_bandwidth(struct t
>  			unthrottle_cfs_rq(cfs_rq);
>  		raw_spin_unlock_irq(&rq->lock);
>  	}
> +
> +	if (runtime_enabled && !tg->runtime_enabled)
> +		jump_label_inc(&cfs_bandwidth_enabled);
> +
> +	if (!runtime_enabled && tg->runtime_enabled)
> +		jump_label_dec(&cfs_bandwidth_enabled);
> +
> +	tg->runtime_enabled = runtime_enabled;
> +
>  out_unlock:
>  	mutex_unlock(&cfs_constraints_mutex);
>  
> Index: linux-2.6/kernel/sched_fair.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched_fair.c
> +++ linux-2.6/kernel/sched_fair.c
> @@ -1410,10 +1410,10 @@ static void expire_cfs_rq_runtime(struct
>  	}
>  }
>  
> -static void account_cfs_rq_runtime(struct cfs_rq *cfs_rq,
> +static __always_inline void account_cfs_rq_runtime(struct cfs_rq *cfs_rq,
>  		unsigned long delta_exec)
>  {
> -	if (!cfs_rq->runtime_enabled)
> +	if (!static_branch(&cfs_bandwidth_enabled) || !cfs_rq->runtime_enabled)
>  		return;
>  
>  	/* dock delta_exec before expiring quota (as it could span periods) */
> @@ -1433,13 +1433,13 @@ static void account_cfs_rq_runtime(struc
>  
>  static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
>  {
> -	return cfs_rq->throttled;
> +	return static_branch(&cfs_bandwidth_enabled) && cfs_rq->throttled;
>  }
>  
>  /* check whether cfs_rq, or any parent, is throttled */
>  static inline int throttled_hierarchy(struct cfs_rq *cfs_rq)
>  {
> -	return cfs_rq->throttle_count;
> +	return static_branch(&cfs_bandwidth_enabled) && cfs_rq->throttle_count;
>  }
>  
>  /*
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 47+ messages in thread

* jump_label defaults (was Re: [patch 00/17] CFS Bandwidth Control v7.1)
  2011-07-07 18:15               ` Jason Baron
@ 2011-07-07 20:36                 ` Peter Zijlstra
  2011-07-08  9:20                   ` Peter Zijlstra
  0 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2011-07-07 20:36 UTC (permalink / raw)
  To: Jason Baron
  Cc: Ingo Molnar, Paul Turner, linux-kernel, Bharata B Rao,
	Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan,
	Srivatsa Vaddagiri, Kamalesh Babulal, Hidetoshi Seto,
	Pavel Emelyanov, Hu Tao, Mike Galbraith

[resend because I somehow managed to wreck the lkml address]

On Thu, 2011-07-07 at 14:15 -0400, Jason Baron wrote:
> We don't have to wait until jump_label_init() to make it take effect.
> 
> We could introduce something like: static_branch_default_false(&foo),
> and static_branch_default_true(&foo), which are set at compile time. I
> was waiting for a real world example before introducing it, but if this
> would solve your issue,  we can look at it. 

Hrm,. I can't seem to make that work, damn CPP for not being recursive.

The thing in question is the below patch, I'd need something like:

sed -ie 's/1)/true)' -e 's/0)/false)/' kernel/sched_features.h

#define SCHED_FEAT(name, enabled)	\
	#define static_branch_##name static_branch_default_##enabled
#include "sched_features.h"
#undef SCHED_FEAT

so that I can then do:

#define sched_feat(x)	\
	static_branch_##x(&sched_feat_keys[__SCHED_FEAT_##x]))

Otherwise there's no way to get the default thing related to x.

Also, it still needs an initializer for jump_label_key to get in the
correct state.

---
Subject: sched: Use jump_labels for sched_feat
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Wed Jul 06 14:20:14 CEST 2011

static_branch() is disabled by default, but more sched_feat are enabled
by default, so invert the logic. Fixup the few stragglers on
late_initcall().

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-10afjk8n3eu30jytrhdpaluc@git.kernel.org
---
 kernel/sched.c |   33 ++++++++++++++++++++++++++-------
 1 file changed, 26 insertions(+), 7 deletions(-)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -71,6 +71,7 @@
 #include <linux/ctype.h>
 #include <linux/ftrace.h>
 #include <linux/slab.h>
+#include <linux/jump_label.h>
 
 #include <asm/tlb.h>
 #include <asm/irq_regs.h>
@@ -691,6 +692,7 @@ int runqueue_is_locked(int cpu)
 
 enum {
 #include "sched_features.h"
+	__SCHED_FEAT_NR
 };
 
 #undef SCHED_FEAT
@@ -710,16 +712,17 @@ const_debug unsigned int sysctl_sched_fe
 
 static __read_mostly char *sched_feat_names[] = {
 #include "sched_features.h"
-	NULL
 };
 
 #undef SCHED_FEAT
 
+static struct jump_label_key sched_feat_keys[__SCHED_FEAT_NR];
+
 static int sched_feat_show(struct seq_file *m, void *v)
 {
 	int i;
 
-	for (i = 0; sched_feat_names[i]; i++) {
+	for (i = 0; i < __SCHED_FEAT_NR; i++) {
 		if (!(sysctl_sched_features & (1UL << i)))
 			seq_puts(m, "NO_");
 		seq_printf(m, "%s ", sched_feat_names[i]);
@@ -752,17 +755,22 @@ sched_feat_write(struct file *filp, cons
 		cmp += 3;
 	}
 
-	for (i = 0; sched_feat_names[i]; i++) {
+	for (i = 0; i < __SCHED_FEAT_NR; i++) {
 		if (strcmp(cmp, sched_feat_names[i]) == 0) {
-			if (neg)
+			if (neg) {
 				sysctl_sched_features &= ~(1UL << i);
-			else
+				if (!jump_label_enabled(&sched_feat_keys[i]))
+					jump_label_inc(&sched_feat_keys[i]);
+			} else {
 				sysctl_sched_features |= (1UL << i);
+				if (jump_label_enabled(&sched_feat_keys[i]))
+					jump_label_dec(&sched_feat_keys[i]);
+			}
 			break;
 		}
 	}
 
-	if (!sched_feat_names[i])
+	if (i == __SCHED_FEAT_NR)
 		return -EINVAL;
 
 	*ppos += cnt;
@@ -785,6 +793,13 @@ static const struct file_operations sche
 
 static __init int sched_init_debug(void)
 {
+	int i;
+
+	for (i = 0; i < __SCHED_FEAT_NR; i++) {
+		if (!(sysctl_sched_features & (1UL << i)))
+			jump_label_inc(&sched_feat_keys[i]);
+	}
+
 	debugfs_create_file("sched_features", 0644, NULL, NULL,
 			&sched_feat_fops);
 
@@ -792,10 +807,14 @@ static __init int sched_init_debug(void)
 }
 late_initcall(sched_init_debug);
 
-#endif
+#define sched_feat(x) (!static_branch(&sched_feat_keys[__SCHED_FEAT_##x]))
+
+#else /* CONFIG_SCHED_DEBUG */
 
 #define sched_feat(x) (sysctl_sched_features & (1UL << __SCHED_FEAT_##x))
 
+#endif /* CONFIG_SCHED_DEBUG */
+
 /*
  * Number of tasks to iterate in a single balance run.
  * Limited because this is done with IRQs disabled.



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch 03/17] sched: introduce primitives to account for CFS bandwidth tracking
  2011-07-07 13:48   ` Peter Zijlstra
@ 2011-07-07 21:30     ` Paul Turner
  0 siblings, 0 replies; 47+ messages in thread
From: Paul Turner @ 2011-07-07 21:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov, Hu Tao, Nikhil Rao

On Thu, Jul 7, 2011 at 6:48 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Wed, 2011-07-06 at 22:30 -0700, Paul Turner wrote:
>> +config CFS_BANDWIDTH
>> +       bool "CPU bandwidth provisioning for FAIR_GROUP_SCHED"
>> +       depends on EXPERIMENTAL
>> +       depends on FAIR_GROUP_SCHED
>> +       depends on SMP
>
> I know that UP is quickly becoming extinct, but isn't that a tad
> radical? :-)
>

Right, this was because walk_tg_tree (and some friends) were not
available in the FAIR_GROUP && !SMP case.  The wreckage used to be a
little more extensive but I just tried it again the last rounds of
clean-up seem to have fixed most of it, the only other problem was
updating the SMP group load stuff in the unthrottle path.

I've rolled the required #ifdef-erry to fix the above up and it's not
horrifying so... dependency removed!

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch 00/17] CFS Bandwidth Control v7.1
  2011-07-07 14:06 ` Peter Zijlstra
@ 2011-07-08  7:35   ` Paul Turner
  0 siblings, 0 replies; 47+ messages in thread
From: Paul Turner @ 2011-07-08  7:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov, Hu Tao

On Thu, Jul 7, 2011 at 7:06 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
>
> The very first .config I build gets me:
>
>  CC      kernel/sched.o
> /usr/src/linux-2.6/kernel/sched.c:503: warning: ‘struct cfs_bandwidth’ declared inside parameter list
> /usr/src/linux-2.6/kernel/sched.c:503: warning: its scope is only this definition or declaration, which is probably not what you want
> /usr/src/linux-2.6/kernel/sched.c:504: warning: ‘struct cfs_bandwidth’ declared inside parameter list
> In file included from /usr/src/linux-2.6/kernel/sched.c:2175:
> /usr/src/linux-2.6/kernel/sched_fair.c: In function ‘move_one_task’:
> /usr/src/linux-2.6/kernel/sched_fair.c:2770: error: ‘struct cfs_rq’ has no member named ‘tg’
>

Fixed -- I missed !CGROUP in my testing.

I'll roll these nits with the !SMP support and some performance
shavings from this thread.
>
>
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch 00/17] CFS Bandwidth Control v7.1
  2011-07-07 14:38   ` Peter Zijlstra
                       ` (2 preceding siblings ...)
  2011-07-07 17:59     ` Peter Zijlstra
@ 2011-07-08  7:39     ` Paul Turner
  2011-07-08 10:32       ` Peter Zijlstra
  3 siblings, 1 reply; 47+ messages in thread
From: Paul Turner @ 2011-07-08  7:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Hidetoshi Seto, Pavel Emelyanov, Hu Tao

On Thu, Jul 7, 2011 at 7:38 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Thu, 2011-07-07 at 13:23 +0200, Ingo Molnar wrote:
>>
>> The +1.5% increase in vanilla kernel context switching performance is
>> unfortunate - where does that overhead come from?
>
> Looking at the asm output, I think its partly because things like:
>
> @@ -602,6 +618,8 @@ static void update_curr(struct cfs_rq *c
>                cpuacct_charge(curtask, delta_exec);
>                account_group_exec_runtime(curtask, delta_exec);
>        }
> +
> +       account_cfs_rq_runtime(cfs_rq, delta_exec);
>  }
>
>
> +static void account_cfs_rq_runtime(struct cfs_rq *cfs_rq,
> +               unsigned long delta_exec)
> +{
> +       if (!cfs_rq->runtime_enabled)
> +               return;
> +
> +       cfs_rq->runtime_remaining -= delta_exec;
> +       if (cfs_rq->runtime_remaining > 0)
> +               return;
> +
> +       assign_cfs_rq_runtime(cfs_rq);
> +}
>
> generate a call, only to then take the first branch out, marking that
> function __always_inline would cure the call problem.

Indeed!  I looked at this today, fixing this inlining recovers ~50% of
the cost; however, my numbers are not directly comparable to Hu's (~2%
originally, improving to ~1%).

>  Going beyond that
> would be using static_branch() to track if there is any bandwidth
> tracking required at all.
>

I spent some time examining this option as well.  Our toolchain
apparently is stuck on gcc-4.4 which left me scratching my head at the
supposed jump label assembly being omitted until I realized
CC_HAS_ASM_GOTO was missing.  I will roll this up also and benchmark
tomorrow.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch 00/17] CFS Bandwidth Control v7.1
  2011-07-07 17:59     ` Peter Zijlstra
  2011-07-07 19:36       ` Jason Baron
@ 2011-07-08  7:45       ` Paul Turner
  1 sibling, 0 replies; 47+ messages in thread
From: Paul Turner @ 2011-07-08  7:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Hidetoshi Seto, Pavel Emelyanov, Hu Tao

On Thu, Jul 7, 2011 at 10:59 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, 2011-07-07 at 16:38 +0200, Peter Zijlstra wrote:
>> On Thu, 2011-07-07 at 13:23 +0200, Ingo Molnar wrote:
>> >
>> > The +1.5% increase in vanilla kernel context switching performance is
>> > unfortunate - where does that overhead come from?
>>
>> Looking at the asm output, I think its partly because things like:
>>
>> @@ -602,6 +618,8 @@ static void update_curr(struct cfs_rq *c
>>                 cpuacct_charge(curtask, delta_exec);
>>                 account_group_exec_runtime(curtask, delta_exec);
>>         }
>> +
>> +       account_cfs_rq_runtime(cfs_rq, delta_exec);
>>  }
>>
>>
>> +static void account_cfs_rq_runtime(struct cfs_rq *cfs_rq,
>> +               unsigned long delta_exec)
>> +{
>> +       if (!cfs_rq->runtime_enabled)
>> +               return;
>> +
>> +       cfs_rq->runtime_remaining -= delta_exec;
>> +       if (cfs_rq->runtime_remaining > 0)
>> +               return;
>> +
>> +       assign_cfs_rq_runtime(cfs_rq);
>> +}
>>
>> generate a call, only to then take the first branch out, marking that
>> function __always_inline would cure the call problem. Going beyond that
>> would be using static_branch() to track if there is any bandwidth
>> tracking required at all.
>
> Right, so that cfs_rq->runtime_enabled is almost a guaranteed cacheline
> miss as well, its at the tail end of cfs_rq, then again, the smp-load
> update will want to touch that same cacheline so its not a complete
> waste of time.

Right -- this is already being pulled in by the touch to
load_unacc_exec_time in __update_curr().

I looked at moving to share a line with exec_clock/vruntime, however
this was not beneficial in the supported but not enabled case due to
the above.  Moreover, it was detrimental to the enabled performance as
the dependent runtime state was no longer shared on the line when
cfs_rq->runtime_enabled == 1.

>
> The other big addition to all the fast paths are the various throttled
> checks, those do miss a complete new cacheline.. adding a
> static_branch() to that might make sense.
>
> compile tested only..
>
> ---
> Index: linux-2.6/kernel/sched.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched.c
> +++ linux-2.6/kernel/sched.c
> @@ -71,6 +71,7 @@
>  #include <linux/ctype.h>
>  #include <linux/ftrace.h>
>  #include <linux/slab.h>
> +#include <linux/jump_label.h>
>
>  #include <asm/tlb.h>
>  #include <asm/irq_regs.h>
> @@ -297,6 +298,7 @@ struct task_group {
>        struct autogroup *autogroup;
>  #endif
>
> +       int runtime_enabled;
>        struct cfs_bandwidth cfs_bandwidth;
>  };
>
> @@ -410,6 +412,8 @@ struct cfs_rq {
>  };
>
>  #ifdef CONFIG_CFS_BANDWIDTH
> +static struct jump_label_key cfs_bandwidth_enabled;
> +
>  static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
>  {
>        return &tg->cfs_bandwidth;
> @@ -9075,6 +9079,15 @@ static int tg_set_cfs_bandwidth(struct t
>                        unthrottle_cfs_rq(cfs_rq);
>                raw_spin_unlock_irq(&rq->lock);
>        }
> +
> +       if (runtime_enabled && !tg->runtime_enabled)
> +               jump_label_inc(&cfs_bandwidth_enabled);
> +
> +       if (!runtime_enabled && tg->runtime_enabled)
> +               jump_label_dec(&cfs_bandwidth_enabled);
> +
> +       tg->runtime_enabled = runtime_enabled;
> +
>  out_unlock:
>        mutex_unlock(&cfs_constraints_mutex);
>
> Index: linux-2.6/kernel/sched_fair.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched_fair.c
> +++ linux-2.6/kernel/sched_fair.c
> @@ -1410,10 +1410,10 @@ static void expire_cfs_rq_runtime(struct
>        }
>  }
>
> -static void account_cfs_rq_runtime(struct cfs_rq *cfs_rq,
> +static __always_inline void account_cfs_rq_runtime(struct cfs_rq *cfs_rq,
>                unsigned long delta_exec)
>  {
> -       if (!cfs_rq->runtime_enabled)
> +       if (!static_branch(&cfs_bandwidth_enabled) || !cfs_rq->runtime_enabled)
>                return;
>
>        /* dock delta_exec before expiring quota (as it could span periods) */
> @@ -1433,13 +1433,13 @@ static void account_cfs_rq_runtime(struc
>
>  static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
>  {
> -       return cfs_rq->throttled;
> +       return static_branch(&cfs_bandwidth_enabled) && cfs_rq->throttled;
>  }
>
>  /* check whether cfs_rq, or any parent, is throttled */
>  static inline int throttled_hierarchy(struct cfs_rq *cfs_rq)
>  {
> -       return cfs_rq->throttle_count;
> +       return static_branch(&cfs_bandwidth_enabled) && cfs_rq->throttle_count;
>  }
>
>  /*
>
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: jump_label defaults (was Re: [patch 00/17] CFS Bandwidth Control v7.1)
  2011-07-07 20:36                 ` jump_label defaults (was Re: [patch 00/17] CFS Bandwidth Control v7.1) Peter Zijlstra
@ 2011-07-08  9:20                   ` Peter Zijlstra
  2011-07-08 15:47                     ` Jason Baron
  0 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2011-07-08  9:20 UTC (permalink / raw)
  To: Jason Baron
  Cc: Ingo Molnar, Paul Turner, linux-kernel, Bharata B Rao,
	Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan,
	Srivatsa Vaddagiri, Kamalesh Babulal, Hidetoshi Seto,
	Pavel Emelyanov, Hu Tao, Mike Galbraith

On Thu, 2011-07-07 at 22:36 +0200, Peter Zijlstra wrote:
> Hrm,. I can't seem to make that work, damn CPP for not being
> recursive.

Ha! the wonders of sleep, ok I can make this part work.

So how do we write this static_branch_true() thing?

But then I realized that if we do something like:

static __always_inline bool arch_static_branch(struct jump_label_key *key)
{
        asm goto("1:"
                JUMP_LABEL_INITIAL_NOP
                ".pushsection __jump_table,  \"aw\" \n\t"
                _ASM_ALIGN "\n\t"
                _ASM_PTR "1b, %l[l_yes], %c0 \n\t"
                ".popsection \n\t"
                : :  "i" (key) : : l_yes);
        return false;
l_yes:
        return true;
}

Simply flipping the true and false in there isn't going to work, because
then its similar to !static_branch() and jump_label_inc() is going to
disable it.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch 00/17] CFS Bandwidth Control v7.1
  2011-07-08  7:39     ` Paul Turner
@ 2011-07-08 10:32       ` Peter Zijlstra
  2011-07-09  7:34         ` Paul Turner
  0 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2011-07-08 10:32 UTC (permalink / raw)
  To: Paul Turner
  Cc: Ingo Molnar, linux-kernel, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Hidetoshi Seto, Pavel Emelyanov, Hu Tao

On Fri, 2011-07-08 at 00:39 -0700, Paul Turner wrote:
> 
> >  Going beyond that
> > would be using static_branch() to track if there is any bandwidth
> > tracking required at all.
> >
> 
> I spent some time examining this option as well.  Our toolchain
> apparently is stuck on gcc-4.4 which left me scratching my head at the
> supposed jump label assembly being omitted until I realized
> CC_HAS_ASM_GOTO was missing.  I will roll this up also and benchmark
> tomorrow. 

Ah, does it actually make things worse if it uses the static_branch
fallbacks? If so we should probably use some HAVE_JUMP_LABEL foo.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: jump_label defaults (was Re: [patch 00/17] CFS Bandwidth Control v7.1)
  2011-07-08  9:20                   ` Peter Zijlstra
@ 2011-07-08 15:47                     ` Jason Baron
  0 siblings, 0 replies; 47+ messages in thread
From: Jason Baron @ 2011-07-08 15:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Paul Turner, linux-kernel, Bharata B Rao,
	Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan,
	Srivatsa Vaddagiri, Kamalesh Babulal, Hidetoshi Seto,
	Pavel Emelyanov, Hu Tao, Mike Galbraith

On Fri, Jul 08, 2011 at 11:20:32AM +0200, Peter Zijlstra wrote:
> On Thu, 2011-07-07 at 22:36 +0200, Peter Zijlstra wrote:
> > Hrm,. I can't seem to make that work, damn CPP for not being
> > recursive.
> 
> Ha! the wonders of sleep, ok I can make this part work.
> 
> So how do we write this static_branch_true() thing?
> 
> But then I realized that if we do something like:
> 
> static __always_inline bool arch_static_branch(struct jump_label_key *key)
> {
>         asm goto("1:"
>                 JUMP_LABEL_INITIAL_NOP
>                 ".pushsection __jump_table,  \"aw\" \n\t"
>                 _ASM_ALIGN "\n\t"
>                 _ASM_PTR "1b, %l[l_yes], %c0 \n\t"
>                 ".popsection \n\t"
>                 : :  "i" (key) : : l_yes);
>         return false;
> l_yes:
>         return true;
> }
> 
> Simply flipping the true and false in there isn't going to work, because
> then its similar to !static_branch() and jump_label_inc() is going to
> disable it.

right, you'd also have to take into account the original state of branch
in deciding whether or not to call jump_label_inc()/dec()


thanks,

-Jason

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch 00/17] CFS Bandwidth Control v7.1
  2011-07-08 10:32       ` Peter Zijlstra
@ 2011-07-09  7:34         ` Paul Turner
  2011-07-10 18:12           ` Ingo Molnar
  0 siblings, 1 reply; 47+ messages in thread
From: Paul Turner @ 2011-07-09  7:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Hidetoshi Seto, Pavel Emelyanov, Hu Tao

On Fri, Jul 8, 2011 at 3:32 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Fri, 2011-07-08 at 00:39 -0700, Paul Turner wrote:
>>
>> >  Going beyond that
>> > would be using static_branch() to track if there is any bandwidth
>> > tracking required at all.
>> >
>>
>> I spent some time examining this option as well.  Our toolchain
>> apparently is stuck on gcc-4.4 which left me scratching my head at the
>> supposed jump label assembly being omitted until I realized
>> CC_HAS_ASM_GOTO was missing.  I will roll this up also and benchmark
>> tomorrow.
>
> Ah, does it actually make things worse if it uses the static_branch
> fallbacks? If so we should probably use some HAVE_JUMP_LABEL foo.
>

I started whittling at this today, the numbers so far on my hardware (i7
12-thread) are as follows.

Base performance with !CONFIG_CFS_BW:

Performance counter stats for './pipe-test-100k' (50 runs):

       893,486,206 instructions             #      1.063 IPC     ( +-   0.296% )
       840,904,951 cycles                     ( +-   0.359% )
       160,076,980 branches                   ( +-   0.305% )

        0.735022174  seconds time elapsed   ( +-   0.143% )



Original performance (v7.2):
                            cycles                  instructions
     branches
----------------------------------------------------------------------------------------------------
base                	    893,486,206 	    840,904,951 	    160,076,980
+unconstrained      	    929,244,021 (+4.00)	    883,923,194 (+5.12)	
  167,131,228 (+4.41)
+10000000000/1000:  	    934,424,430 (+4.58)	    875,605,677 (+4.13)	
  168,466,469 (+5.24)
+10000000000/10000: 	    940,048,385 (+5.21)	    883,922,489 (+5.12)	
  169,512,329 (+5.89)
+10000000000/100000:	    934,351,875 (+4.57)	    888,878,742 (+5.71)	
  168,457,809 (+5.24)
+10000000000/1000000:	    931,127,353 (+4.21)	    874,830,745 (+4.03)	
   167,861,492 (+4.86)

The first step was fixing the missing inlining on update_curr().  This was a
major improvement.

Fix inlining on update_curr:
                            cycles                  instructions
     branches
----------------------------------------------------------------------------------------------------
base                	    893,486,206 	    840,904,951 	    160,076,980
+unconstrained      	    909,771,488 (+1.82)	    850,091,039 (+1.09)	
  164,385,813 (+2.69)
+10000000000/1000:  	    915,384,142 (+2.45)	    859,591,791 (+2.22)	
  165,616,386 (+3.46)
+10000000000/10000: 	    922,657,403 (+3.26)	    865,701,436 (+2.95)	
  166,996,717 (+4.32)
+10000000000/100000:	    928,636,540 (+3.93)	    866,234,685 (+3.01)	
  168,111,517 (+5.02)
+10000000000/1000000:	    922,311,143 (+3.23)	    859,445,796 (+2.20)	
   166,922,517 (+4.28)

I also realized on the dequeue path we can shave a branch by reversing the
order of some of the conditionals.

In particular reordering (!runnable || !enabled) ---> (!enabled || !runnable).
The latter choice saves us a branch in the !enabled case when !runnable, and
has the same cost in the enabled case.

Speed up return_cfs_rq_runtime:
                            cycles                  instructions
     branches
----------------------------------------------------------------------------------------------------
base                	    893,486,206 	    840,904,951 	    160,076,980
+unconstrained      	    906,151,427 (+1.42)	    877,497,749 (+4.35)	
  163,738,499 (+2.29)
+10000000000/1000:  	    910,284,839 (+1.88)	    885,136,630 (+5.26)	
  164,804,085 (+2.95)
+10000000000/10000: 	    911,860,656 (+2.06)	    891,433,792 (+6.01)	
  165,098,115 (+3.14)
+10000000000/100000:	    913,062,037 (+2.19)	    890,918,139 (+5.95)	
  165,327,113 (+3.28)
+10000000000/1000000:	    920,966,554 (+3.08)	    899,250,040 (+6.94)	
   166,813,750 (+4.21)

Finally introducing jump labels when there are no constrained groups claws back
a good portion of the remaining time.

Add jump labels:
                            cycles                  instructions
     branches
----------------------------------------------------------------------------------------------------
base                	    893,486,206 	    840,904,951 	    160,076,980
+unconstrained      	    900,477,543 (+0.78)	    890,310,950 (+5.88)	
  161,037,844 (+0.60)
+10000000000/1000:  	    921,436,697 (+3.13)	    919,362,792 (+9.33)	
  168,491,279 (+5.26)
+10000000000/10000: 	    907,214,638 (+1.54)	    894,406,875 (+6.36)	
  165,743,207 (+3.54)
+10000000000/100000:	    918,094,542 (+2.75)	    910,211,234 (+8.24)	
  167,841,828 (+4.85)
+10000000000/1000000:	    910,698,725 (+1.93)	    885,385,460 (+5.29)	
   166,406,742 (+3.95)

There's some permutations on where we use jump labels that I have to finish
evaluating (including whether we want to skip the jump labels in the
!CC_HAS_ASM_GOTO case), as well as one or two other shavings that I am
looking at.  Will post v7.2 incorporating these speed ups as well as some build
fixes for the !CONFIG_CGROUP case on Monday/Tuesday.

Thanks,

- Paul

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch 00/17] CFS Bandwidth Control v7.1
  2011-07-09  7:34         ` Paul Turner
@ 2011-07-10 18:12           ` Ingo Molnar
  0 siblings, 0 replies; 47+ messages in thread
From: Ingo Molnar @ 2011-07-10 18:12 UTC (permalink / raw)
  To: Paul Turner
  Cc: Peter Zijlstra, linux-kernel, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Hidetoshi Seto, Pavel Emelyanov, Hu Tao


* Paul Turner <pjt@google.com> wrote:

> Finally introducing jump labels when there are no constrained 
> groups claws back a good portion of the remaining time.
> 
> Add jump labels:
>                             cycles                  instructions
>      branches
> ----------------------------------------------------------------------------------------------------
> base                	    893,486,206 	    840,904,951 	    160,076,980
> +unconstrained      	    900,477,543 (+0.78)	    890,310,950 (+5.88)	
>   161,037,844 (+0.60)
> +10000000000/1000:  	    921,436,697 (+3.13)	    919,362,792 (+9.33)	
>   168,491,279 (+5.26)
> +10000000000/10000: 	    907,214,638 (+1.54)	    894,406,875 (+6.36)	
>   165,743,207 (+3.54)
> +10000000000/100000:	    918,094,542 (+2.75)	    910,211,234 (+8.24)	
>   167,841,828 (+4.85)
> +10000000000/1000000:	    910,698,725 (+1.93)	    885,385,460 (+5.29)	
>    166,406,742 (+3.95)

That looks pretty promising!

The +5% instruction count still looks a tad high to me: if there are 
about 1000 instructions in this particular contex-switch critical 
path then 5% means +50 instructions - a 'disabled' feature sure 
should not use that many instructions, right?

Also, i have a testing suggestion, i'd suggest to run:

	taskset 1 perf stat ...

to only measure while pinned on a single CPU. This will remove a lot 
of cross-CPU noise from the context switching overhead.

This is a valid way to progress because we are interested in the 
typical context-switch overhead on a single CPU - we know that 
there's no SMP cost when constraining is disabled.

Doing that should bring your measurement noise below the 0.1% range i 
suspect. As you are shaving off cycle after cycle i think you'll need 
that kind of measurement precision ...

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch 00/17] CFS Bandwidth Control v7.1
  2011-07-07  5:30 [patch 00/17] CFS Bandwidth Control v7.1 Paul Turner
                   ` (19 preceding siblings ...)
  2011-07-07 14:06 ` Peter Zijlstra
@ 2011-07-11  1:22 ` Hu Tao
  20 siblings, 0 replies; 47+ messages in thread
From: Hu Tao @ 2011-07-11  1:22 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Peter Zijlstra, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov

On Wed, Jul 06, 2011 at 10:30:36PM -0700, Paul Turner wrote:
> Hi all,
> 
> Please find attached an incremental revision on v7 of bandwidth control.
> 
> The only real functional change is an improvement to update shares only as we
> leave a throttled state.  The remainder is largely refactoring, expansion of
> comments, and code clean-up.
> 
> Hidetoshi Seto and Hu Tao have been kind enough to run performance benchmarks
> against v7 measuring the scheduling path overheads versus pipe-test-100k.
> Results can be found at:
> 
> https://lkml.org/lkml/2011/6/24/10
> https://lkml.org/lkml/2011/7/4/347
> 
> The summary results (from Hu Tao's most recent run) are:
>                                             cycles                   instructions            branches
> -------------------------------------------------------------------------------------------------------------------
> base                                        7,526,317,497           8,666,579,347            1,771,078,445
> +patch, cgroup not enabled                  7,610,354,447 (1.12%)   8,569,448,982 (-1.12%)   1,751,675,193 (-0.11%)
> +patch, 10000000000/1000(quota/period)      7,856,873,327 (4.39%)   8,822,227,540 (1.80%)    1,801,766,182 (1.73%)
> +patch, 10000000000/10000(quota/period)     7,797,711,600 (3.61%)   8,754,747,746 (1.02%)    1,788,316,969 (0.97%)
> +patch, 10000000000/100000(quota/period)    7,777,784,384 (3.34%)   8,744,979,688 (0.90%)    1,786,319,566 (0.86%)
> +patch, 10000000000/1000000(quota/period)   7,802,382,802 (3.67%)   8,755,638,235 (1.03%)    1,788,601,070 (0.99%)
> ------------------------------------------------------------------------------------------------------------------

Hi Paul,

I'm sorry these data are got by a config with some debug options on.
I've re-tested with a fine config, see

https://lkml.org/lkml/2011/7/6/516

-- 
Thanks,
Hu Tao

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch 00/17] CFS Bandwidth Control v7.1
  2011-07-07 11:13 ` [patch 00/17] CFS Bandwidth Control v7.1 Peter Zijlstra
@ 2011-07-11  1:22   ` Hu Tao
  0 siblings, 0 replies; 47+ messages in thread
From: Hu Tao @ 2011-07-11  1:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul Turner, linux-kernel, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov

On Thu, Jul 07, 2011 at 01:13:45PM +0200, Peter Zijlstra wrote:
> On Wed, 2011-07-06 at 22:30 -0700, Paul Turner wrote:
> > base                                        7,526,317,497           8,666,579,347            1,771,078,445
> > +patch, cgroup not enabled                  7,610,354,447 (1.12%)   8,569,448,982 (-1.12%)   1,751,675,193 (-0.11%) 
> 
> "cgroup not enabled" means compiled in but not used at runtime, right?

Yes.

-- 
Thanks,
Hu Tao

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [tip:sched/core] sched: Don't update shares twice on on_rq parent
  2011-07-07  5:30 ` [patch 01/17] sched: (fixlet) dont update shares twice on on_rq parent Paul Turner
@ 2011-07-21 18:28   ` tip-bot for Paul Turner
  0 siblings, 0 replies; 47+ messages in thread
From: tip-bot for Paul Turner @ 2011-07-21 18:28 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, a.p.zijlstra, pjt, tglx, mingo

Commit-ID:  9598c82dcacadc3b9daa8170613fd054c6124d30
Gitweb:     http://git.kernel.org/tip/9598c82dcacadc3b9daa8170613fd054c6124d30
Author:     Paul Turner <pjt@google.com>
AuthorDate: Wed, 6 Jul 2011 22:30:37 -0700
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Thu, 21 Jul 2011 18:01:44 +0200

sched: Don't update shares twice on on_rq parent

In dequeue_task_fair() we bail on dequeue when we encounter a parenting entity
with additional weight.  However, we perform a double shares update on this
entity as we continue the shares update traversal from this point, despite
dequeue_entity() having already updated its queuing cfs_rq.
Avoid this by starting from the parent when we resume.

Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110707053059.797714697@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/sched_fair.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index f88720b..6cdff84 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1370,6 +1370,9 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 			 */
 			if (task_sleep && parent_entity(se))
 				set_next_buddy(parent_entity(se));
+
+			/* avoid re-evaluating load for this entity */
+			se = parent_entity(se);
 			break;
 		}
 		flags |= DEQUEUE_SLEEP;

^ permalink raw reply related	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2011-07-21 18:28 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-07-07  5:30 [patch 00/17] CFS Bandwidth Control v7.1 Paul Turner
2011-07-07  5:30 ` [patch 01/17] sched: (fixlet) dont update shares twice on on_rq parent Paul Turner
2011-07-21 18:28   ` [tip:sched/core] sched: Don't " tip-bot for Paul Turner
2011-07-07  5:30 ` [patch 02/17] sched: hierarchical task accounting for SCHED_OTHER Paul Turner
2011-07-07  5:30 ` [patch 03/17] sched: introduce primitives to account for CFS bandwidth tracking Paul Turner
2011-07-07 13:48   ` Peter Zijlstra
2011-07-07 21:30     ` Paul Turner
2011-07-07  5:30 ` [patch 04/17] sched: validate CFS quota hierarchies Paul Turner
2011-07-07  5:30 ` [patch 05/17] sched: accumulate per-cfs_rq cpu usage and charge against bandwidth Paul Turner
2011-07-07  5:30 ` [patch 06/17] sched: add a timer to handle CFS bandwidth refresh Paul Turner
2011-07-07  5:30 ` [patch 07/17] sched: expire invalid runtime Paul Turner
2011-07-07  5:30 ` [patch 08/17] sched: add support for throttling group entities Paul Turner
2011-07-07  5:30 ` [patch 09/17] sched: add support for unthrottling " Paul Turner
2011-07-07  5:30 ` [patch 10/17] sched: allow for positional tg_tree walks Paul Turner
2011-07-07  5:30 ` [patch 11/17] sched: prevent interactions with throttled entities Paul Turner
2011-07-07  5:30 ` [patch 12/17] sched: prevent buddy " Paul Turner
2011-07-07  5:30 ` [patch 13/17] sched: migrate throttled tasks on HOTPLUG Paul Turner
2011-07-07  5:30 ` [patch 14/17] sched: throttle entities exceeding their allowed bandwidth Paul Turner
2011-07-07  5:30 ` [patch 15/17] sched: add exports tracking cfs bandwidth control statistics Paul Turner
2011-07-07  5:30 ` [patch 16/17] sched: return unused runtime on group dequeue Paul Turner
2011-07-07  5:30 ` [patch 17/17] sched: add documentation for bandwidth control Paul Turner
2011-07-07 11:13 ` [patch 00/17] CFS Bandwidth Control v7.1 Peter Zijlstra
2011-07-11  1:22   ` Hu Tao
2011-07-07 11:23 ` Ingo Molnar
2011-07-07 11:28   ` Peter Zijlstra
2011-07-07 14:38   ` Peter Zijlstra
2011-07-07 14:51     ` Ingo Molnar
2011-07-07 14:54       ` Peter Zijlstra
2011-07-07 14:56         ` Ingo Molnar
2011-07-07 16:23           ` Jason Baron
2011-07-07 17:20             ` Peter Zijlstra
2011-07-07 18:15               ` Jason Baron
2011-07-07 20:36                 ` jump_label defaults (was Re: [patch 00/17] CFS Bandwidth Control v7.1) Peter Zijlstra
2011-07-08  9:20                   ` Peter Zijlstra
2011-07-08 15:47                     ` Jason Baron
2011-07-07 16:52     ` [patch 00/17] CFS Bandwidth Control v7.1 Andi Kleen
2011-07-07 17:08       ` Peter Zijlstra
2011-07-07 17:59     ` Peter Zijlstra
2011-07-07 19:36       ` Jason Baron
2011-07-08  7:45       ` Paul Turner
2011-07-08  7:39     ` Paul Turner
2011-07-08 10:32       ` Peter Zijlstra
2011-07-09  7:34         ` Paul Turner
2011-07-10 18:12           ` Ingo Molnar
2011-07-07 14:06 ` Peter Zijlstra
2011-07-08  7:35   ` Paul Turner
2011-07-11  1:22 ` Hu Tao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).