All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch 00/18] CFS Bandwidth Control v7.2
@ 2011-07-21 16:43 Paul Turner
  2011-07-21 16:43 ` [patch 01/18] sched: (fixlet) dont update shares twice on on_rq parent Paul Turner
                   ` (20 more replies)
  0 siblings, 21 replies; 60+ messages in thread
From: Paul Turner @ 2011-07-21 16:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov, Jason Baron

Hi all,

Please find attached the incremental v7.2 for bandwidth control.

This release follows a fairly intensive period of scraping cycles across
various configurations.  Unfortunately we seem to be currently taking an IPC
hit for jump_labels (despite a savings in branches/instr. ret) which despite
fairly extensive digging I don't have a good explanation for.  The emitted
assembly /looks/ ok, but cycles/wall time is consistently higher across several
platforms.

As such I've demoted the jumppatch to [RFT] while these details are worked
out.  But there's no point in holding up the rest of the series any more.

[ Please find the specific discussion related to the above attached to patch 
17/18. ]

So -- without jump labels -- the current performance looks like:

                            instructions            cycles                  branches         
---------------------------------------------------------------------------------------------
clovertown [!BWC]           843695716               965744453               151224759        
+unconstrained              845934117 (+0.27)       974222228 (+0.88)       152715407 (+0.99)
+10000000000/1000:          855102086 (+1.35)       978728348 (+1.34)       154495984 (+2.16)
+10000000000/1000000:       853981660 (+1.22)       976344561 (+1.10)       154287243 (+2.03)

barcelona [!BWC]            810514902               761071312               145351489        
+unconstrained              820573353 (+1.24)       748178486 (-1.69)       148161233 (+1.93)
+10000000000/1000:          827963132 (+2.15)       757829815 (-0.43)       149611950 (+2.93)
+10000000000/1000000:       827701516 (+2.12)       753575001 (-0.98)       149568284 (+2.90)

westmere [!BWC]             792513879               702882443               143267136        
+unconstrained              802533191 (+1.26)       694415157 (-1.20)       146071233 (+1.96)
+10000000000/1000:          809861594 (+2.19)       701781996 (-0.16)       147520953 (+2.97)
+10000000000/1000000:       809752541 (+2.18)       705278419 (+0.34)       147502154 (+2.96)

Under the workload:
  mkdir -p /cgroup/cpu/test
  echo $$ > /dev/cgroup/cpu/test (only cpu,cpuacct mounted)
  (W1) taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "for ((i=0;i<5;i++)); do $(dirname $0)/pipe-test 20000; done"

This may seem a strange work-load but it works around some bizarro overheads
currently introduced by perf.  Comparing for example with::w
  (W2)taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "$(dirname $0)/pipe-test 100000;true"
  (W3)taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "$(dirname $0)/pipe-test 100000;"


We see: 
 (W1)  westmere [!BWC]             792513879               702882443               143267136             0.197246943  
 (W2)  westmere [!BWC]             912241728               772576786               165734252             0.214923134  
 (W3)  westmere [!BWC]             904349725               882084726               162577399             0.748506065  

vs an 'ideal' total exec time of (approximately):
$ time taskset -c 0 ./pipe-test 100000
 real    0m0.198 user    0m0.007s ys     0m0.095s

The overhead in W2 is explained by that invoking pipe-test directly, one of
the siblings is becoming the perf_ctx parent, invoking lots of pain every time
we switch.  I do not have a reasonable explantion as to why (W1) is so much
cheaper than (W2), I stumbled across it by accident when I was trying some
combinations to reduce the <perf stat>-to-<perf stat> variance.

v7.2
-----------
- Build errors in !CGROUP_SCHED case fixed
- !CONFIG_SMP now 'supported' (#ifdef munging)
- gcc was failing to inline account_cfs_rq_runtime, affecting performance
- checks in expire_cfs_rq_runtime() and check_enqueue_throttle() re-organized
  to save branches.
- jump labels introduced in the case BWC is not being used system-wide to
  reduce inert overhead.
- branch saved in expiring runtime (reorganize conditonals)

Hidetoshi, the following patchsets have changed enough to necessitate tweaking
of your Reviewed-by:
[patch 09/18] sched: add support for unthrottling group entities (extensive)
[patch 11/18] sched: prevent interactions with throttled entities (update_cfs_shares)
[patch 12/18] sched: prevent buddy interactions with throttled entities (new)


Previous postings:
-----------------
v7.1: https://lkml.org/lkml/2011/7/7/24
v7: http://lkml.org/lkml/2011/6/21/43
v6: http://lkml.org/lkml/2011/5/7/37
v5: http://lkml.org/lkml/2011/3 /22/477
v4: http://lkml.org/lkml/2011/2/23/44
v3: http://lkml.org/lkml/2010/10/12/44
v2: http://lkml.org/lkml/2010/4/28/88
Original posting: http://lkml.org/lkml/2010/2/12/393

Prior approaches: http://lkml.org/lkml/2010/1/5/44 ["CFS Hard limits v5"]

Thanks,

- Paul


^ permalink raw reply	[flat|nested] 60+ messages in thread

* [patch 01/18] sched: (fixlet) dont update shares twice on on_rq parent
  2011-07-21 16:43 [patch 00/18] CFS Bandwidth Control v7.2 Paul Turner
@ 2011-07-21 16:43 ` Paul Turner
  2011-07-22 11:06   ` Kamalesh Babulal
  2011-07-21 16:43 ` [patch 02/18] sched: hierarchical task accounting for SCHED_OTHER Paul Turner
                   ` (19 subsequent siblings)
  20 siblings, 1 reply; 60+ messages in thread
From: Paul Turner @ 2011-07-21 16:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov, Jason Baron

[-- Attachment #1: sched-bwc-fix_dequeue_task_buglet.patch --]
[-- Type: text/plain, Size: 898 bytes --]

In dequeue_task_fair() we bail on dequeue when we encounter a parenting entity
with additional weight.  However, we perform a double shares update on this
entity as we continue the shares update traversal from this point, despite
dequeue_entity() having already updated its queuing cfs_rq.
Avoid this by starting from the parent when we resume.

Signed-off-by: Paul Turner <pjt@google.com>
---
 kernel/sched_fair.c |    3 +++
 1 file changed, 3 insertions(+)

Index: tip/kernel/sched_fair.c
===================================================================
--- tip.orig/kernel/sched_fair.c
+++ tip/kernel/sched_fair.c
@@ -1370,6 +1370,9 @@ static void dequeue_task_fair(struct rq 
 			 */
 			if (task_sleep && parent_entity(se))
 				set_next_buddy(parent_entity(se));
+
+			/* avoid re-evaluating load for this entity */
+			se = parent_entity(se);
 			break;
 		}
 		flags |= DEQUEUE_SLEEP;



^ permalink raw reply	[flat|nested] 60+ messages in thread

* [patch 02/18] sched: hierarchical task accounting for SCHED_OTHER
  2011-07-21 16:43 [patch 00/18] CFS Bandwidth Control v7.2 Paul Turner
  2011-07-21 16:43 ` [patch 01/18] sched: (fixlet) dont update shares twice on on_rq parent Paul Turner
@ 2011-07-21 16:43 ` Paul Turner
  2011-08-14 16:15   ` [tip:sched/core] sched: Implement " tip-bot for Paul Turner
  2011-07-21 16:43 ` [patch 03/18] sched: introduce primitives to account for CFS bandwidth tracking Paul Turner
                   ` (18 subsequent siblings)
  20 siblings, 1 reply; 60+ messages in thread
From: Paul Turner @ 2011-07-21 16:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov, Jason Baron

[-- Attachment #1: sched-bwc-account_nr_running.patch --]
[-- Type: text/plain, Size: 4530 bytes --]

Introduce hierarchical task accounting for the group scheduling case in CFS, as
well as promoting the responsibility for maintaining rq->nr_running to the
scheduling classes.

The primary motivation for this is that with scheduling classes supporting
bandwidth throttling it is possible for entities participating in throttled
sub-trees to not have root visible changes in rq->nr_running across activate
and de-activate operations.  This in turn leads to incorrect idle and 
weight-per-task load balance decisions.

This also allows us to make a small fixlet to the fastpath in pick_next_task()
under group scheduling.

Note: this issue also exists with the existing sched_rt throttling mechanism.
This patch does not address that.

Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>

---
 kernel/sched.c          |    6 ++----
 kernel/sched_fair.c     |   10 ++++++++--
 kernel/sched_rt.c       |    5 ++++-
 kernel/sched_stoptask.c |    2 ++
 4 files changed, 16 insertions(+), 7 deletions(-)

Index: tip/kernel/sched.c
===================================================================
--- tip.orig/kernel/sched.c
+++ tip/kernel/sched.c
@@ -308,7 +308,7 @@ struct task_group root_task_group;
 /* CFS-related fields in a runqueue */
 struct cfs_rq {
 	struct load_weight load;
-	unsigned long nr_running;
+	unsigned long nr_running, h_nr_running;
 
 	u64 exec_clock;
 	u64 min_vruntime;
@@ -1830,7 +1830,6 @@ static void activate_task(struct rq *rq,
 		rq->nr_uninterruptible--;
 
 	enqueue_task(rq, p, flags);
-	inc_nr_running(rq);
 }
 
 /*
@@ -1842,7 +1841,6 @@ static void deactivate_task(struct rq *r
 		rq->nr_uninterruptible++;
 
 	dequeue_task(rq, p, flags);
-	dec_nr_running(rq);
 }
 
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
@@ -4194,7 +4192,7 @@ pick_next_task(struct rq *rq)
 	 * Optimization: we know that if all tasks are in
 	 * the fair class we can call that function directly:
 	 */
-	if (likely(rq->nr_running == rq->cfs.nr_running)) {
+	if (likely(rq->nr_running == rq->cfs.h_nr_running)) {
 		p = fair_sched_class.pick_next_task(rq);
 		if (likely(p))
 			return p;
Index: tip/kernel/sched_fair.c
===================================================================
--- tip.orig/kernel/sched_fair.c
+++ tip/kernel/sched_fair.c
@@ -1332,16 +1332,19 @@ enqueue_task_fair(struct rq *rq, struct 
 			break;
 		cfs_rq = cfs_rq_of(se);
 		enqueue_entity(cfs_rq, se, flags);
+		cfs_rq->h_nr_running++;
 		flags = ENQUEUE_WAKEUP;
 	}
 
 	for_each_sched_entity(se) {
-		struct cfs_rq *cfs_rq = cfs_rq_of(se);
+		cfs_rq = cfs_rq_of(se);
+		cfs_rq->h_nr_running++;
 
 		update_cfs_load(cfs_rq, 0);
 		update_cfs_shares(cfs_rq);
 	}
 
+	inc_nr_running(rq);
 	hrtick_update(rq);
 }
 
@@ -1361,6 +1364,7 @@ static void dequeue_task_fair(struct rq 
 	for_each_sched_entity(se) {
 		cfs_rq = cfs_rq_of(se);
 		dequeue_entity(cfs_rq, se, flags);
+		cfs_rq->h_nr_running--;
 
 		/* Don't dequeue parent if it has other entities besides us */
 		if (cfs_rq->load.weight) {
@@ -1379,12 +1383,14 @@ static void dequeue_task_fair(struct rq 
 	}
 
 	for_each_sched_entity(se) {
-		struct cfs_rq *cfs_rq = cfs_rq_of(se);
+		cfs_rq = cfs_rq_of(se);
+		cfs_rq->h_nr_running--;
 
 		update_cfs_load(cfs_rq, 0);
 		update_cfs_shares(cfs_rq);
 	}
 
+	dec_nr_running(rq);
 	hrtick_update(rq);
 }
 
Index: tip/kernel/sched_rt.c
===================================================================
--- tip.orig/kernel/sched_rt.c
+++ tip/kernel/sched_rt.c
@@ -961,6 +961,8 @@ enqueue_task_rt(struct rq *rq, struct ta
 
 	if (!task_current(rq, p) && p->rt.nr_cpus_allowed > 1)
 		enqueue_pushable_task(rq, p);
+
+	inc_nr_running(rq);
 }
 
 static void dequeue_task_rt(struct rq *rq, struct task_struct *p, int flags)
@@ -971,6 +973,8 @@ static void dequeue_task_rt(struct rq *r
 	dequeue_rt_entity(rt_se);
 
 	dequeue_pushable_task(rq, p);
+
+	dec_nr_running(rq);
 }
 
 /*
@@ -1863,4 +1867,3 @@ static void print_rt_stats(struct seq_fi
 	rcu_read_unlock();
 }
 #endif /* CONFIG_SCHED_DEBUG */
-
Index: tip/kernel/sched_stoptask.c
===================================================================
--- tip.orig/kernel/sched_stoptask.c
+++ tip/kernel/sched_stoptask.c
@@ -34,11 +34,13 @@ static struct task_struct *pick_next_tas
 static void
 enqueue_task_stop(struct rq *rq, struct task_struct *p, int flags)
 {
+	inc_nr_running(rq);
 }
 
 static void
 dequeue_task_stop(struct rq *rq, struct task_struct *p, int flags)
 {
+	dec_nr_running(rq);
 }
 
 static void yield_task_stop(struct rq *rq)



^ permalink raw reply	[flat|nested] 60+ messages in thread

* [patch 03/18] sched: introduce primitives to account for CFS bandwidth tracking
  2011-07-21 16:43 [patch 00/18] CFS Bandwidth Control v7.2 Paul Turner
  2011-07-21 16:43 ` [patch 01/18] sched: (fixlet) dont update shares twice on on_rq parent Paul Turner
  2011-07-21 16:43 ` [patch 02/18] sched: hierarchical task accounting for SCHED_OTHER Paul Turner
@ 2011-07-21 16:43 ` Paul Turner
  2011-07-22 11:14   ` Kamalesh Babulal
  2011-08-14 16:17   ` [tip:sched/core] sched: Introduce " tip-bot for Paul Turner
  2011-07-21 16:43 ` [patch 04/18] sched: validate CFS quota hierarchies Paul Turner
                   ` (17 subsequent siblings)
  20 siblings, 2 replies; 60+ messages in thread
From: Paul Turner @ 2011-07-21 16:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov, Jason Baron,
	Nikhil Rao

[-- Attachment #1: sched-bwc-add_cfs_tg_bandwidth.patch --]
[-- Type: text/plain, Size: 10048 bytes --]

In this patch we introduce the notion of CFS bandwidth, partitioned into
globally unassigned bandwidth, and locally claimed bandwidth.

- The global bandwidth is per task_group, it represents a pool of unclaimed
  bandwidth that cfs_rqs can allocate from.  
- The local bandwidth is tracked per-cfs_rq, this represents allotments from
  the global pool bandwidth assigned to a specific cpu.

Bandwidth is managed via cgroupfs, adding two new interfaces to the cpu subsystem:
- cpu.cfs_period_us : the bandwidth period in usecs
- cpu.cfs_quota_us : the cpu bandwidth (in usecs) that this tg will be allowed
  to consume over period above.

Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Nikhil Rao <ncrao@google.com>
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>

---
 init/Kconfig        |   12 +++
 kernel/sched.c      |  196 ++++++++++++++++++++++++++++++++++++++++++++++++++--
 kernel/sched_fair.c |   16 ++++
 3 files changed, 220 insertions(+), 4 deletions(-)

Index: tip/init/Kconfig
===================================================================
--- tip.orig/init/Kconfig
+++ tip/init/Kconfig
@@ -715,6 +715,18 @@ config FAIR_GROUP_SCHED
 	depends on CGROUP_SCHED
 	default CGROUP_SCHED
 
+config CFS_BANDWIDTH
+	bool "CPU bandwidth provisioning for FAIR_GROUP_SCHED"
+	depends on EXPERIMENTAL
+	depends on FAIR_GROUP_SCHED
+	default n
+	help
+	  This option allows users to define CPU bandwidth rates (limits) for
+	  tasks running within the fair group scheduler.  Groups with no limit
+	  set are considered to be unconstrained and will run with no
+	  restriction.
+	  See tip/Documentation/scheduler/sched-bwc.txt for more information.
+
 config RT_GROUP_SCHED
 	bool "Group scheduling for SCHED_RR/FIFO"
 	depends on EXPERIMENTAL
Index: tip/kernel/sched.c
===================================================================
--- tip.orig/kernel/sched.c
+++ tip/kernel/sched.c
@@ -244,6 +244,14 @@ struct cfs_rq;
 
 static LIST_HEAD(task_groups);
 
+struct cfs_bandwidth {
+#ifdef CONFIG_CFS_BANDWIDTH
+	raw_spinlock_t lock;
+	ktime_t period;
+	u64 quota;
+#endif
+};
+
 /* task group related information */
 struct task_group {
 	struct cgroup_subsys_state css;
@@ -275,6 +283,8 @@ struct task_group {
 #ifdef CONFIG_SCHED_AUTOGROUP
 	struct autogroup *autogroup;
 #endif
+
+	struct cfs_bandwidth cfs_bandwidth;
 };
 
 /* task_group_lock serializes the addition/removal of task groups */
@@ -374,9 +384,48 @@ struct cfs_rq {
 
 	unsigned long load_contribution;
 #endif
+#ifdef CONFIG_CFS_BANDWIDTH
+	int runtime_enabled;
+	s64 runtime_remaining;
+#endif
 #endif
 };
 
+#ifdef CONFIG_FAIR_GROUP_SCHED
+#ifdef CONFIG_CFS_BANDWIDTH
+static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
+{
+	return &tg->cfs_bandwidth;
+}
+
+static inline u64 default_cfs_period(void);
+
+static void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
+{
+	raw_spin_lock_init(&cfs_b->lock);
+	cfs_b->quota = RUNTIME_INF;
+	cfs_b->period = ns_to_ktime(default_cfs_period());
+}
+
+static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)
+{
+	cfs_rq->runtime_enabled = 0;
+}
+
+static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
+{}
+#else
+static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
+static void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b) {}
+static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b) {}
+
+static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
+{
+	return NULL;
+}
+#endif /* CONFIG_CFS_BANDWIDTH */
+#endif /* CONFIG_FAIR_GROUP_SCHED */
+
 /* Real-Time classes' related field in a runqueue: */
 struct rt_rq {
 	struct rt_prio_array active;
@@ -7795,6 +7844,7 @@ static void init_tg_cfs_entry(struct tas
 	tg->cfs_rq[cpu] = cfs_rq;
 	init_cfs_rq(cfs_rq, rq);
 	cfs_rq->tg = tg;
+	init_cfs_rq_runtime(cfs_rq);
 
 	tg->se[cpu] = se;
 	/* se could be NULL for root_task_group */
@@ -7930,6 +7980,7 @@ void __init sched_init(void)
 		 * We achieve this by letting root_task_group's tasks sit
 		 * directly in rq->cfs (i.e root_task_group->se[] = NULL).
 		 */
+		init_cfs_bandwidth(&root_task_group.cfs_bandwidth);
 		init_tg_cfs_entry(&root_task_group, &rq->cfs, NULL, i, NULL);
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
@@ -8171,6 +8222,8 @@ static void free_fair_sched_group(struct
 {
 	int i;
 
+	destroy_cfs_bandwidth(tg_cfs_bandwidth(tg));
+
 	for_each_possible_cpu(i) {
 		if (tg->cfs_rq)
 			kfree(tg->cfs_rq[i]);
@@ -8198,6 +8251,8 @@ int alloc_fair_sched_group(struct task_g
 
 	tg->shares = NICE_0_LOAD;
 
+	init_cfs_bandwidth(tg_cfs_bandwidth(tg));
+
 	for_each_possible_cpu(i) {
 		cfs_rq = kzalloc_node(sizeof(struct cfs_rq),
 				      GFP_KERNEL, cpu_to_node(i));
@@ -8569,7 +8624,7 @@ static int __rt_schedulable(struct task_
 	return walk_tg_tree(tg_schedulable, tg_nop, &data);
 }
 
-static int tg_set_bandwidth(struct task_group *tg,
+static int tg_set_rt_bandwidth(struct task_group *tg,
 		u64 rt_period, u64 rt_runtime)
 {
 	int i, err = 0;
@@ -8608,7 +8663,7 @@ int sched_group_set_rt_runtime(struct ta
 	if (rt_runtime_us < 0)
 		rt_runtime = RUNTIME_INF;
 
-	return tg_set_bandwidth(tg, rt_period, rt_runtime);
+	return tg_set_rt_bandwidth(tg, rt_period, rt_runtime);
 }
 
 long sched_group_rt_runtime(struct task_group *tg)
@@ -8633,7 +8688,7 @@ int sched_group_set_rt_period(struct tas
 	if (rt_period == 0)
 		return -EINVAL;
 
-	return tg_set_bandwidth(tg, rt_period, rt_runtime);
+	return tg_set_rt_bandwidth(tg, rt_period, rt_runtime);
 }
 
 long sched_group_rt_period(struct task_group *tg)
@@ -8823,6 +8878,128 @@ static u64 cpu_shares_read_u64(struct cg
 
 	return (u64) scale_load_down(tg->shares);
 }
+
+#ifdef CONFIG_CFS_BANDWIDTH
+const u64 max_cfs_quota_period = 1 * NSEC_PER_SEC; /* 1s */
+const u64 min_cfs_quota_period = 1 * NSEC_PER_MSEC; /* 1ms */
+
+static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
+{
+	int i;
+	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
+	static DEFINE_MUTEX(mutex);
+
+	if (tg == &root_task_group)
+		return -EINVAL;
+
+	/*
+	 * Ensure we have at some amount of bandwidth every period.  This is
+	 * to prevent reaching a state of large arrears when throttled via
+	 * entity_tick() resulting in prolonged exit starvation.
+	 */
+	if (quota < min_cfs_quota_period || period < min_cfs_quota_period)
+		return -EINVAL;
+
+	/*
+	 * Likewise, bound things on the otherside by preventing insane quota
+	 * periods.  This also allows us to normalize in computing quota
+	 * feasibility.
+	 */
+	if (period > max_cfs_quota_period)
+		return -EINVAL;
+
+	mutex_lock(&mutex);
+	raw_spin_lock_irq(&cfs_b->lock);
+	cfs_b->period = ns_to_ktime(period);
+	cfs_b->quota = quota;
+	raw_spin_unlock_irq(&cfs_b->lock);
+
+	for_each_possible_cpu(i) {
+		struct cfs_rq *cfs_rq = tg->cfs_rq[i];
+		struct rq *rq = rq_of(cfs_rq);
+
+		raw_spin_lock_irq(&rq->lock);
+		cfs_rq->runtime_enabled = quota != RUNTIME_INF;
+		cfs_rq->runtime_remaining = 0;
+		raw_spin_unlock_irq(&rq->lock);
+	}
+	mutex_unlock(&mutex);
+
+	return 0;
+}
+
+int tg_set_cfs_quota(struct task_group *tg, long cfs_quota_us)
+{
+	u64 quota, period;
+
+	period = ktime_to_ns(tg_cfs_bandwidth(tg)->period);
+	if (cfs_quota_us < 0)
+		quota = RUNTIME_INF;
+	else
+		quota = (u64)cfs_quota_us * NSEC_PER_USEC;
+
+	return tg_set_cfs_bandwidth(tg, period, quota);
+}
+
+long tg_get_cfs_quota(struct task_group *tg)
+{
+	u64 quota_us;
+
+	if (tg_cfs_bandwidth(tg)->quota == RUNTIME_INF)
+		return -1;
+
+	quota_us = tg_cfs_bandwidth(tg)->quota;
+	do_div(quota_us, NSEC_PER_USEC);
+
+	return quota_us;
+}
+
+int tg_set_cfs_period(struct task_group *tg, long cfs_period_us)
+{
+	u64 quota, period;
+
+	period = (u64)cfs_period_us * NSEC_PER_USEC;
+	quota = tg_cfs_bandwidth(tg)->quota;
+
+	if (period <= 0)
+		return -EINVAL;
+
+	return tg_set_cfs_bandwidth(tg, period, quota);
+}
+
+long tg_get_cfs_period(struct task_group *tg)
+{
+	u64 cfs_period_us;
+
+	cfs_period_us = ktime_to_ns(tg_cfs_bandwidth(tg)->period);
+	do_div(cfs_period_us, NSEC_PER_USEC);
+
+	return cfs_period_us;
+}
+
+static s64 cpu_cfs_quota_read_s64(struct cgroup *cgrp, struct cftype *cft)
+{
+	return tg_get_cfs_quota(cgroup_tg(cgrp));
+}
+
+static int cpu_cfs_quota_write_s64(struct cgroup *cgrp, struct cftype *cftype,
+				s64 cfs_quota_us)
+{
+	return tg_set_cfs_quota(cgroup_tg(cgrp), cfs_quota_us);
+}
+
+static u64 cpu_cfs_period_read_u64(struct cgroup *cgrp, struct cftype *cft)
+{
+	return tg_get_cfs_period(cgroup_tg(cgrp));
+}
+
+static int cpu_cfs_period_write_u64(struct cgroup *cgrp, struct cftype *cftype,
+				u64 cfs_period_us)
+{
+	return tg_set_cfs_period(cgroup_tg(cgrp), cfs_period_us);
+}
+
+#endif /* CONFIG_CFS_BANDWIDTH */
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
 #ifdef CONFIG_RT_GROUP_SCHED
@@ -8857,6 +9034,18 @@ static struct cftype cpu_files[] = {
 		.write_u64 = cpu_shares_write_u64,
 	},
 #endif
+#ifdef CONFIG_CFS_BANDWIDTH
+	{
+		.name = "cfs_quota_us",
+		.read_s64 = cpu_cfs_quota_read_s64,
+		.write_s64 = cpu_cfs_quota_write_s64,
+	},
+	{
+		.name = "cfs_period_us",
+		.read_u64 = cpu_cfs_period_read_u64,
+		.write_u64 = cpu_cfs_period_write_u64,
+	},
+#endif
 #ifdef CONFIG_RT_GROUP_SCHED
 	{
 		.name = "rt_runtime_us",
@@ -9166,4 +9355,3 @@ struct cgroup_subsys cpuacct_subsys = {
 	.subsys_id = cpuacct_subsys_id,
 };
 #endif	/* CONFIG_CGROUP_CPUACCT */
-
Index: tip/kernel/sched_fair.c
===================================================================
--- tip.orig/kernel/sched_fair.c
+++ tip/kernel/sched_fair.c
@@ -1256,6 +1256,22 @@ entity_tick(struct cfs_rq *cfs_rq, struc
 		check_preempt_tick(cfs_rq, curr);
 }
 
+
+/**************************************************
+ * CFS bandwidth control machinery
+ */
+
+#ifdef CONFIG_CFS_BANDWIDTH
+/*
+ * default period for cfs group bandwidth.
+ * default: 0.1s, units: nanoseconds
+ */
+static inline u64 default_cfs_period(void)
+{
+	return 100000000ULL;
+}
+#endif
+
 /**************************************************
  * CFS operations on tasks:
  */



^ permalink raw reply	[flat|nested] 60+ messages in thread

* [patch 04/18] sched: validate CFS quota hierarchies
  2011-07-21 16:43 [patch 00/18] CFS Bandwidth Control v7.2 Paul Turner
                   ` (2 preceding siblings ...)
  2011-07-21 16:43 ` [patch 03/18] sched: introduce primitives to account for CFS bandwidth tracking Paul Turner
@ 2011-07-21 16:43 ` Paul Turner
  2011-08-14 16:19   ` [tip:sched/core] sched: Validate " tip-bot for Paul Turner
  2011-07-21 16:43 ` [patch 05/18] sched: accumulate per-cfs_rq cpu usage and charge against bandwidth Paul Turner
                   ` (16 subsequent siblings)
  20 siblings, 1 reply; 60+ messages in thread
From: Paul Turner @ 2011-07-21 16:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov, Jason Baron

[-- Attachment #1: sched-bwc-consistent_quota.patch --]
[-- Type: text/plain, Size: 5712 bytes --]

Add constraints validation for CFS bandwidth hierarchies.

Validate that:
   max(child bandwidth) <= parent_bandwidth

In a quota limited hierarchy, an unconstrained entity
(e.g. bandwidth==RUNTIME_INF) inherits the bandwidth of its parent.

This constraint is chosen over sum(child_bandwidth) as notion of over-commit is
valuable within SCHED_OTHER.  Some basic code from the RT case is re-factored
for reuse.

Signed-off-by: Paul Turner <pjt@google.com>

---
 kernel/sched.c |  112 +++++++++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 98 insertions(+), 14 deletions(-)

Index: tip/kernel/sched.c
===================================================================
--- tip.orig/kernel/sched.c
+++ tip/kernel/sched.c
@@ -249,6 +249,7 @@ struct cfs_bandwidth {
 	raw_spinlock_t lock;
 	ktime_t period;
 	u64 quota;
+	s64 hierarchal_quota;
 #endif
 };
 
@@ -1510,7 +1511,8 @@ static inline void dec_cpu_load(struct r
 	update_load_sub(&rq->load, load);
 }
 
-#if (defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)) || defined(CONFIG_RT_GROUP_SCHED)
+#if defined(CONFIG_RT_GROUP_SCHED) || (defined(CONFIG_FAIR_GROUP_SCHED) && \
+			(defined(CONFIG_SMP) || defined(CONFIG_CFS_BANDWIDTH)))
 typedef int (*tg_visitor)(struct task_group *, void *);
 
 /*
@@ -8522,12 +8524,7 @@ unsigned long sched_group_shares(struct 
 }
 #endif
 
-#ifdef CONFIG_RT_GROUP_SCHED
-/*
- * Ensure that the real time constraints are schedulable.
- */
-static DEFINE_MUTEX(rt_constraints_mutex);
-
+#if defined(CONFIG_RT_GROUP_SCHED) || defined(CONFIG_CFS_BANDWIDTH)
 static unsigned long to_ratio(u64 period, u64 runtime)
 {
 	if (runtime == RUNTIME_INF)
@@ -8535,6 +8532,13 @@ static unsigned long to_ratio(u64 period
 
 	return div64_u64(runtime << 20, period);
 }
+#endif
+
+#ifdef CONFIG_RT_GROUP_SCHED
+/*
+ * Ensure that the real time constraints are schedulable.
+ */
+static DEFINE_MUTEX(rt_constraints_mutex);
 
 /* Must be called with tasklist_lock held */
 static inline int tg_has_rt_tasks(struct task_group *tg)
@@ -8555,7 +8559,7 @@ struct rt_schedulable_data {
 	u64 rt_runtime;
 };
 
-static int tg_schedulable(struct task_group *tg, void *data)
+static int tg_rt_schedulable(struct task_group *tg, void *data)
 {
 	struct rt_schedulable_data *d = data;
 	struct task_group *child;
@@ -8619,7 +8623,7 @@ static int __rt_schedulable(struct task_
 		.rt_runtime = runtime,
 	};
 
-	return walk_tg_tree(tg_schedulable, tg_nop, &data);
+	return walk_tg_tree(tg_rt_schedulable, tg_nop, &data);
 }
 
 static int tg_set_rt_bandwidth(struct task_group *tg,
@@ -8878,14 +8882,17 @@ static u64 cpu_shares_read_u64(struct cg
 }
 
 #ifdef CONFIG_CFS_BANDWIDTH
+static DEFINE_MUTEX(cfs_constraints_mutex);
+
 const u64 max_cfs_quota_period = 1 * NSEC_PER_SEC; /* 1s */
 const u64 min_cfs_quota_period = 1 * NSEC_PER_MSEC; /* 1ms */
 
+static int __cfs_schedulable(struct task_group *tg, u64 period, u64 runtime);
+
 static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
 {
-	int i;
+	int i, ret = 0;
 	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
-	static DEFINE_MUTEX(mutex);
 
 	if (tg == &root_task_group)
 		return -EINVAL;
@@ -8906,7 +8913,11 @@ static int tg_set_cfs_bandwidth(struct t
 	if (period > max_cfs_quota_period)
 		return -EINVAL;
 
-	mutex_lock(&mutex);
+	mutex_lock(&cfs_constraints_mutex);
+	ret = __cfs_schedulable(tg, period, quota);
+	if (ret)
+		goto out_unlock;
+
 	raw_spin_lock_irq(&cfs_b->lock);
 	cfs_b->period = ns_to_ktime(period);
 	cfs_b->quota = quota;
@@ -8921,9 +8932,10 @@ static int tg_set_cfs_bandwidth(struct t
 		cfs_rq->runtime_remaining = 0;
 		raw_spin_unlock_irq(&rq->lock);
 	}
-	mutex_unlock(&mutex);
+out_unlock:
+	mutex_unlock(&cfs_constraints_mutex);
 
-	return 0;
+	return ret;
 }
 
 int tg_set_cfs_quota(struct task_group *tg, long cfs_quota_us)
@@ -8997,6 +9009,78 @@ static int cpu_cfs_period_write_u64(stru
 	return tg_set_cfs_period(cgroup_tg(cgrp), cfs_period_us);
 }
 
+struct cfs_schedulable_data {
+	struct task_group *tg;
+	u64 period, quota;
+};
+
+/*
+ * normalize group quota/period to be quota/max_period
+ * note: units are usecs
+ */
+static u64 normalize_cfs_quota(struct task_group *tg,
+			       struct cfs_schedulable_data *d)
+{
+	u64 quota, period;
+
+	if (tg == d->tg) {
+		period = d->period;
+		quota = d->quota;
+	} else {
+		period = tg_get_cfs_period(tg);
+		quota = tg_get_cfs_quota(tg);
+	}
+
+	/* note: these should typically be equivalent */
+	if (quota == RUNTIME_INF || quota == -1)
+		return RUNTIME_INF;
+
+	return to_ratio(period, quota);
+}
+
+static int tg_cfs_schedulable_down(struct task_group *tg, void *data)
+{
+	struct cfs_schedulable_data *d = data;
+	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
+	s64 quota = 0, parent_quota = -1;
+
+	if (!tg->parent) {
+		quota = RUNTIME_INF;
+	} else {
+		struct cfs_bandwidth *parent_b = tg_cfs_bandwidth(tg->parent);
+
+		quota = normalize_cfs_quota(tg, d);
+		parent_quota = parent_b->hierarchal_quota;
+
+		/*
+		 * ensure max(child_quota) <= parent_quota, inherit when no
+		 * limit is set
+		 */
+		if (quota == RUNTIME_INF)
+			quota = parent_quota;
+		else if (parent_quota != RUNTIME_INF && quota > parent_quota)
+			return -EINVAL;
+	}
+	cfs_b->hierarchal_quota = quota;
+
+	return 0;
+}
+
+static int __cfs_schedulable(struct task_group *tg, u64 period, u64 quota)
+{
+	struct cfs_schedulable_data data = {
+		.tg = tg,
+		.period = period,
+		.quota = quota,
+	};
+
+	if (quota != RUNTIME_INF) {
+		do_div(data.period, NSEC_PER_USEC);
+		do_div(data.quota, NSEC_PER_USEC);
+	}
+
+	return walk_tg_tree(tg_cfs_schedulable_down, tg_nop, &data);
+}
 #endif /* CONFIG_CFS_BANDWIDTH */
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 



^ permalink raw reply	[flat|nested] 60+ messages in thread

* [patch 05/18] sched: accumulate per-cfs_rq cpu usage and charge against bandwidth
  2011-07-21 16:43 [patch 00/18] CFS Bandwidth Control v7.2 Paul Turner
                   ` (3 preceding siblings ...)
  2011-07-21 16:43 ` [patch 04/18] sched: validate CFS quota hierarchies Paul Turner
@ 2011-07-21 16:43 ` Paul Turner
  2011-08-14 16:21   ` [tip:sched/core] sched: Accumulate " tip-bot for Paul Turner
  2011-07-21 16:43 ` [patch 06/18] sched: add a timer to handle CFS bandwidth refresh Paul Turner
                   ` (15 subsequent siblings)
  20 siblings, 1 reply; 60+ messages in thread
From: Paul Turner @ 2011-07-21 16:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov, Jason Baron,
	Nikhil Rao

[-- Attachment #1: sched-bwc-account_cfs_rq_runtime.patch --]
[-- Type: text/plain, Size: 6404 bytes --]

Account bandwidth usage on the cfs_rq level versus the task_groups to which
they belong.  Whether we are tracking bandwidth on a given cfs_rq is maintained
under cfs_rq->runtime_enabled.

cfs_rq's which belong to a bandwidth constrained task_group have their runtime
accounted via the update_curr() path, which withdraws bandwidth from the global
pool as desired.  Updates involving the global pool are currently protected
under cfs_bandwidth->lock, local runtime is protected by rq->lock.

This patch only assigns and tracks quota, no action is taken in the case that
cfs_rq->runtime_used exceeds cfs_rq->runtime_assigned.

Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Nikhil Rao <ncrao@google.com>
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>


---
 include/linux/sched.h |    4 ++
 kernel/sched.c        |    4 +-
 kernel/sched_fair.c   |   79 ++++++++++++++++++++++++++++++++++++++++++++++++--
 kernel/sysctl.c       |   10 ++++++
 4 files changed, 94 insertions(+), 3 deletions(-)

Index: tip/kernel/sched_fair.c
===================================================================
--- tip.orig/kernel/sched_fair.c
+++ tip/kernel/sched_fair.c
@@ -89,6 +89,20 @@ const_debug unsigned int sysctl_sched_mi
  */
 unsigned int __read_mostly sysctl_sched_shares_window = 10000000UL;
 
+#ifdef CONFIG_CFS_BANDWIDTH
+/*
+ * Amount of runtime to allocate from global (tg) to local (per-cfs_rq) pool
+ * each time a cfs_rq requests quota.
+ *
+ * Note: in the case that the slice exceeds the runtime remaining (either due
+ * to consumption or the quota being specified to be smaller than the slice)
+ * we will always only issue the remaining available time.
+ *
+ * default: 5 msec, units: microseconds
+  */
+unsigned int sysctl_sched_cfs_bandwidth_slice = 5000UL;
+#endif
+
 static const struct sched_class fair_sched_class;
 
 /**************************************************************
@@ -305,6 +319,8 @@ find_matching_se(struct sched_entity **s
 
 #endif	/* CONFIG_FAIR_GROUP_SCHED */
 
+static void account_cfs_rq_runtime(struct cfs_rq *cfs_rq,
+				   unsigned long delta_exec);
 
 /**************************************************************
  * Scheduling class tree data structure manipulation methods:
@@ -602,6 +618,8 @@ static void update_curr(struct cfs_rq *c
 		cpuacct_charge(curtask, delta_exec);
 		account_group_exec_runtime(curtask, delta_exec);
 	}
+
+	account_cfs_rq_runtime(cfs_rq, delta_exec);
 }
 
 static inline void
@@ -1270,6 +1288,58 @@ static inline u64 default_cfs_period(voi
 {
 	return 100000000ULL;
 }
+
+static inline u64 sched_cfs_bandwidth_slice(void)
+{
+	return (u64)sysctl_sched_cfs_bandwidth_slice * NSEC_PER_USEC;
+}
+
+static void assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
+{
+	struct task_group *tg = cfs_rq->tg;
+	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
+	u64 amount = 0, min_amount;
+
+	/* note: this is a positive sum as runtime_remaining <= 0 */
+	min_amount = sched_cfs_bandwidth_slice() - cfs_rq->runtime_remaining;
+
+	raw_spin_lock(&cfs_b->lock);
+	if (cfs_b->quota == RUNTIME_INF)
+		amount = min_amount;
+	else if (cfs_b->runtime > 0) {
+		amount = min(cfs_b->runtime, min_amount);
+		cfs_b->runtime -= amount;
+	}
+	raw_spin_unlock(&cfs_b->lock);
+
+	cfs_rq->runtime_remaining += amount;
+}
+
+static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq,
+				     unsigned long delta_exec)
+{
+	if (!cfs_rq->runtime_enabled)
+		return;
+
+	cfs_rq->runtime_remaining -= delta_exec;
+	if (cfs_rq->runtime_remaining > 0)
+		return;
+
+	assign_cfs_rq_runtime(cfs_rq);
+}
+
+static __always_inline void account_cfs_rq_runtime(struct cfs_rq *cfs_rq,
+						   unsigned long delta_exec)
+{
+	if (!cfs_rq->runtime_enabled)
+		return;
+
+	__account_cfs_rq_runtime(cfs_rq, delta_exec);
+}
+
+#else
+static void account_cfs_rq_runtime(struct cfs_rq *cfs_rq,
+				     unsigned long delta_exec) {}
 #endif
 
 /**************************************************
@@ -4262,8 +4332,13 @@ static void set_curr_task_fair(struct rq
 {
 	struct sched_entity *se = &rq->curr->se;
 
-	for_each_sched_entity(se)
-		set_next_entity(cfs_rq_of(se), se);
+	for_each_sched_entity(se) {
+		struct cfs_rq *cfs_rq = cfs_rq_of(se);
+
+		set_next_entity(cfs_rq, se);
+		/* ensure bandwidth has been allocated on our new cfs_rq */
+		account_cfs_rq_runtime(cfs_rq, 0);
+	}
 }
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
Index: tip/kernel/sysctl.c
===================================================================
--- tip.orig/kernel/sysctl.c
+++ tip/kernel/sysctl.c
@@ -379,6 +379,16 @@ static struct ctl_table kern_table[] = {
 		.extra2		= &one,
 	},
 #endif
+#ifdef CONFIG_CFS_BANDWIDTH
+	{
+		.procname	= "sched_cfs_bandwidth_slice_us",
+		.data		= &sysctl_sched_cfs_bandwidth_slice,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &one,
+	},
+#endif
 #ifdef CONFIG_PROVE_LOCKING
 	{
 		.procname	= "prove_locking",
Index: tip/kernel/sched.c
===================================================================
--- tip.orig/kernel/sched.c
+++ tip/kernel/sched.c
@@ -248,7 +248,7 @@ struct cfs_bandwidth {
 #ifdef CONFIG_CFS_BANDWIDTH
 	raw_spinlock_t lock;
 	ktime_t period;
-	u64 quota;
+	u64 quota, runtime;
 	s64 hierarchal_quota;
 #endif
 };
@@ -403,6 +403,7 @@ static inline u64 default_cfs_period(voi
 static void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
 {
 	raw_spin_lock_init(&cfs_b->lock);
+	cfs_b->runtime = 0;
 	cfs_b->quota = RUNTIME_INF;
 	cfs_b->period = ns_to_ktime(default_cfs_period());
 }
@@ -8921,6 +8922,7 @@ static int tg_set_cfs_bandwidth(struct t
 	raw_spin_lock_irq(&cfs_b->lock);
 	cfs_b->period = ns_to_ktime(period);
 	cfs_b->quota = quota;
+	cfs_b->runtime = quota;
 	raw_spin_unlock_irq(&cfs_b->lock);
 
 	for_each_possible_cpu(i) {
Index: tip/include/linux/sched.h
===================================================================
--- tip.orig/include/linux/sched.h
+++ tip/include/linux/sched.h
@@ -2012,6 +2012,10 @@ static inline void sched_autogroup_fork(
 static inline void sched_autogroup_exit(struct signal_struct *sig) { }
 #endif
 
+#ifdef CONFIG_CFS_BANDWIDTH
+extern unsigned int sysctl_sched_cfs_bandwidth_slice;
+#endif
+
 #ifdef CONFIG_RT_MUTEXES
 extern int rt_mutex_getprio(struct task_struct *p);
 extern void rt_mutex_setprio(struct task_struct *p, int prio);



^ permalink raw reply	[flat|nested] 60+ messages in thread

* [patch 06/18] sched: add a timer to handle CFS bandwidth refresh
  2011-07-21 16:43 [patch 00/18] CFS Bandwidth Control v7.2 Paul Turner
                   ` (4 preceding siblings ...)
  2011-07-21 16:43 ` [patch 05/18] sched: accumulate per-cfs_rq cpu usage and charge against bandwidth Paul Turner
@ 2011-07-21 16:43 ` Paul Turner
  2011-08-14 16:23   ` [tip:sched/core] sched: Add " tip-bot for Paul Turner
  2011-07-21 16:43 ` [patch 07/18] sched: expire invalid runtime Paul Turner
                   ` (14 subsequent siblings)
  20 siblings, 1 reply; 60+ messages in thread
From: Paul Turner @ 2011-07-21 16:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov, Jason Baron

[-- Attachment #1: sched-bwc-bandwidth_timers.patch --]
[-- Type: text/plain, Size: 7522 bytes --]

This patch adds a per-task_group timer which handles the refresh of the global
CFS bandwidth pool.

Since the RT pool is using a similar timer there's some small refactoring to
share this support.

Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>

---
 kernel/sched.c      |  107 +++++++++++++++++++++++++++++++++++++++++-----------
 kernel/sched_fair.c |   40 +++++++++++++++++--
 2 files changed, 123 insertions(+), 24 deletions(-)

Index: tip/kernel/sched.c
===================================================================
--- tip.orig/kernel/sched.c
+++ tip/kernel/sched.c
@@ -193,10 +193,28 @@ static inline int rt_bandwidth_enabled(v
 	return sysctl_sched_rt_runtime >= 0;
 }
 
-static void start_rt_bandwidth(struct rt_bandwidth *rt_b)
+static void start_bandwidth_timer(struct hrtimer *period_timer, ktime_t period)
 {
-	ktime_t now;
+	unsigned long delta;
+	ktime_t soft, hard, now;
+
+	for (;;) {
+		if (hrtimer_active(period_timer))
+			break;
+
+		now = hrtimer_cb_get_time(period_timer);
+		hrtimer_forward(period_timer, now, period);
 
+		soft = hrtimer_get_softexpires(period_timer);
+		hard = hrtimer_get_expires(period_timer);
+		delta = ktime_to_ns(ktime_sub(hard, soft));
+		__hrtimer_start_range_ns(period_timer, soft, delta,
+					 HRTIMER_MODE_ABS_PINNED, 0);
+	}
+}
+
+static void start_rt_bandwidth(struct rt_bandwidth *rt_b)
+{
 	if (!rt_bandwidth_enabled() || rt_b->rt_runtime == RUNTIME_INF)
 		return;
 
@@ -204,22 +222,7 @@ static void start_rt_bandwidth(struct rt
 		return;
 
 	raw_spin_lock(&rt_b->rt_runtime_lock);
-	for (;;) {
-		unsigned long delta;
-		ktime_t soft, hard;
-
-		if (hrtimer_active(&rt_b->rt_period_timer))
-			break;
-
-		now = hrtimer_cb_get_time(&rt_b->rt_period_timer);
-		hrtimer_forward(&rt_b->rt_period_timer, now, rt_b->rt_period);
-
-		soft = hrtimer_get_softexpires(&rt_b->rt_period_timer);
-		hard = hrtimer_get_expires(&rt_b->rt_period_timer);
-		delta = ktime_to_ns(ktime_sub(hard, soft));
-		__hrtimer_start_range_ns(&rt_b->rt_period_timer, soft, delta,
-				HRTIMER_MODE_ABS_PINNED, 0);
-	}
+	start_bandwidth_timer(&rt_b->rt_period_timer, rt_b->rt_period);
 	raw_spin_unlock(&rt_b->rt_runtime_lock);
 }
 
@@ -250,6 +253,9 @@ struct cfs_bandwidth {
 	ktime_t period;
 	u64 quota, runtime;
 	s64 hierarchal_quota;
+
+	int idle, timer_active;
+	struct hrtimer period_timer;
 #endif
 };
 
@@ -399,6 +405,28 @@ static inline struct cfs_bandwidth *tg_c
 }
 
 static inline u64 default_cfs_period(void);
+static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun);
+
+static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
+{
+	struct cfs_bandwidth *cfs_b =
+		container_of(timer, struct cfs_bandwidth, period_timer);
+	ktime_t now;
+	int overrun;
+	int idle = 0;
+
+	for (;;) {
+		now = hrtimer_cb_get_time(timer);
+		overrun = hrtimer_forward(timer, now, cfs_b->period);
+
+		if (!overrun)
+			break;
+
+		idle = do_sched_cfs_period_timer(cfs_b, overrun);
+	}
+
+	return idle ? HRTIMER_NORESTART : HRTIMER_RESTART;
+}
 
 static void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
 {
@@ -406,6 +434,9 @@ static void init_cfs_bandwidth(struct cf
 	cfs_b->runtime = 0;
 	cfs_b->quota = RUNTIME_INF;
 	cfs_b->period = ns_to_ktime(default_cfs_period());
+
+	hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	cfs_b->period_timer.function = sched_cfs_period_timer;
 }
 
 static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)
@@ -413,8 +444,34 @@ static void init_cfs_rq_runtime(struct c
 	cfs_rq->runtime_enabled = 0;
 }
 
+/* requires cfs_b->lock, may release to reprogram timer */
+static void __start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
+{
+	/*
+	 * The timer may be active because we're trying to set a new bandwidth
+	 * period or because we're racing with the tear-down path
+	 * (timer_active==0 becomes visible before the hrtimer call-back
+	 * terminates).  In either case we ensure that it's re-programmed
+	 */
+	while (unlikely(hrtimer_active(&cfs_b->period_timer))) {
+		raw_spin_unlock(&cfs_b->lock);
+		/* ensure cfs_b->lock is available while we wait */
+		hrtimer_cancel(&cfs_b->period_timer);
+
+		raw_spin_lock(&cfs_b->lock);
+		/* if someone else restarted the timer then we're done */
+		if (cfs_b->timer_active)
+			return;
+	}
+
+	cfs_b->timer_active = 1;
+	start_bandwidth_timer(&cfs_b->period_timer, cfs_b->period);
+}
+
 static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
-{}
+{
+	hrtimer_cancel(&cfs_b->period_timer);
+}
 #else
 static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
 static void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b) {}
@@ -8892,7 +8949,7 @@ static int __cfs_schedulable(struct task
 
 static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
 {
-	int i, ret = 0;
+	int i, ret = 0, runtime_enabled;
 	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
 
 	if (tg == &root_task_group)
@@ -8919,10 +8976,18 @@ static int tg_set_cfs_bandwidth(struct t
 	if (ret)
 		goto out_unlock;
 
+	runtime_enabled = quota != RUNTIME_INF;
 	raw_spin_lock_irq(&cfs_b->lock);
 	cfs_b->period = ns_to_ktime(period);
 	cfs_b->quota = quota;
 	cfs_b->runtime = quota;
+
+	/* restart the period timer (if active) to handle new period expiry */
+	if (runtime_enabled && cfs_b->timer_active) {
+		/* force a reprogram */
+		cfs_b->timer_active = 0;
+		__start_cfs_bandwidth(cfs_b);
+	}
 	raw_spin_unlock_irq(&cfs_b->lock);
 
 	for_each_possible_cpu(i) {
@@ -8930,7 +8995,7 @@ static int tg_set_cfs_bandwidth(struct t
 		struct rq *rq = rq_of(cfs_rq);
 
 		raw_spin_lock_irq(&rq->lock);
-		cfs_rq->runtime_enabled = quota != RUNTIME_INF;
+		cfs_rq->runtime_enabled = runtime_enabled;
 		cfs_rq->runtime_remaining = 0;
 		raw_spin_unlock_irq(&rq->lock);
 	}
Index: tip/kernel/sched_fair.c
===================================================================
--- tip.orig/kernel/sched_fair.c
+++ tip/kernel/sched_fair.c
@@ -1306,9 +1306,16 @@ static void assign_cfs_rq_runtime(struct
 	raw_spin_lock(&cfs_b->lock);
 	if (cfs_b->quota == RUNTIME_INF)
 		amount = min_amount;
-	else if (cfs_b->runtime > 0) {
-		amount = min(cfs_b->runtime, min_amount);
-		cfs_b->runtime -= amount;
+	else {
+		/* ensure bandwidth timer remains active under consumption */
+		if (!cfs_b->timer_active)
+			__start_cfs_bandwidth(cfs_b);
+
+		if (cfs_b->runtime > 0) {
+			amount = min(cfs_b->runtime, min_amount);
+			cfs_b->runtime -= amount;
+			cfs_b->idle = 0;
+		}
 	}
 	raw_spin_unlock(&cfs_b->lock);
 
@@ -1337,6 +1344,33 @@ static __always_inline void account_cfs_
 	__account_cfs_rq_runtime(cfs_rq, delta_exec);
 }
 
+/*
+ * Responsible for refilling a task_group's bandwidth and unthrottling its
+ * cfs_rqs as appropriate. If there has been no activity within the last
+ * period the timer is deactivated until scheduling resumes; cfs_b->idle is
+ * used to track this state.
+ */
+static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
+{
+	int idle = 1;
+
+	raw_spin_lock(&cfs_b->lock);
+	/* no need to continue the timer with no bandwidth constraint */
+	if (cfs_b->quota == RUNTIME_INF)
+		goto out_unlock;
+
+	idle = cfs_b->idle;
+	cfs_b->runtime = cfs_b->quota;
+
+	/* mark as potentially idle for the upcoming period */
+	cfs_b->idle = 1;
+out_unlock:
+	if (idle)
+		cfs_b->timer_active = 0;
+	raw_spin_unlock(&cfs_b->lock);
+
+	return idle;
+}
 #else
 static void account_cfs_rq_runtime(struct cfs_rq *cfs_rq,
 				     unsigned long delta_exec) {}



^ permalink raw reply	[flat|nested] 60+ messages in thread

* [patch 07/18] sched: expire invalid runtime
  2011-07-21 16:43 [patch 00/18] CFS Bandwidth Control v7.2 Paul Turner
                   ` (5 preceding siblings ...)
  2011-07-21 16:43 ` [patch 06/18] sched: add a timer to handle CFS bandwidth refresh Paul Turner
@ 2011-07-21 16:43 ` Paul Turner
  2011-08-14 16:24   ` [tip:sched/core] sched: Expire " tip-bot for Paul Turner
  2011-07-21 16:43 ` [patch 08/18] sched: add support for throttling group entities Paul Turner
                   ` (13 subsequent siblings)
  20 siblings, 1 reply; 60+ messages in thread
From: Paul Turner @ 2011-07-21 16:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov, Jason Baron

[-- Attachment #1: sched-bwc-expire_cfs_rq_runtime.patch --]
[-- Type: text/plain, Size: 6398 bytes --]

Since quota is managed using a global state but consumed on a per-cpu basis
we need to ensure that our per-cpu state is appropriately synchronized.  
Most importantly, runtime that is state (from a previous period) should not be
locally consumable.

We take advantage of existing sched_clock synchronization about the jiffy to
efficiently detect whether we have (globally) crossed a quota boundary above.

One catch is that the direction of spread on sched_clock is undefined, 
specifically, we don't know whether our local clock is behind or ahead
of the one responsible for the current expiration time.

Fortunately we can differentiate these by considering whether the
global deadline has advanced.  If it has not, then we assume our clock to be 
"fast" and advance our local expiration; otherwise, we know the deadline has
truly passed and we expire our local runtime.

Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>

---
 kernel/sched.c      |    4 +-
 kernel/sched_fair.c |   90 ++++++++++++++++++++++++++++++++++++++++++++++------
 2 files changed, 84 insertions(+), 10 deletions(-)

Index: tip/kernel/sched_fair.c
===================================================================
--- tip.orig/kernel/sched_fair.c
+++ tip/kernel/sched_fair.c
@@ -1294,11 +1294,30 @@ static inline u64 sched_cfs_bandwidth_sl
 	return (u64)sysctl_sched_cfs_bandwidth_slice * NSEC_PER_USEC;
 }
 
+/*
+ * Replenish runtime according to assigned quota and update expiration time.
+ * We use sched_clock_cpu directly instead of rq->clock to avoid adding
+ * additional synchronization around rq->lock.
+ *
+ * requires cfs_b->lock
+ */
+static void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)
+{
+	u64 now;
+
+	if (cfs_b->quota == RUNTIME_INF)
+		return;
+
+	now = sched_clock_cpu(smp_processor_id());
+	cfs_b->runtime = cfs_b->quota;
+	cfs_b->runtime_expires = now + ktime_to_ns(cfs_b->period);
+}
+
 static void assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 {
 	struct task_group *tg = cfs_rq->tg;
 	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
-	u64 amount = 0, min_amount;
+	u64 amount = 0, min_amount, expires;
 
 	/* note: this is a positive sum as runtime_remaining <= 0 */
 	min_amount = sched_cfs_bandwidth_slice() - cfs_rq->runtime_remaining;
@@ -1307,9 +1326,16 @@ static void assign_cfs_rq_runtime(struct
 	if (cfs_b->quota == RUNTIME_INF)
 		amount = min_amount;
 	else {
-		/* ensure bandwidth timer remains active under consumption */
-		if (!cfs_b->timer_active)
+		/*
+		 * If the bandwidth pool has become inactive, then at least one
+		 * period must have elapsed since the last consumption.
+		 * Refresh the global state and ensure bandwidth timer becomes
+		 * active.
+		 */
+		if (!cfs_b->timer_active) {
+			__refill_cfs_bandwidth_runtime(cfs_b);
 			__start_cfs_bandwidth(cfs_b);
+		}
 
 		if (cfs_b->runtime > 0) {
 			amount = min(cfs_b->runtime, min_amount);
@@ -1317,19 +1343,61 @@ static void assign_cfs_rq_runtime(struct
 			cfs_b->idle = 0;
 		}
 	}
+	expires = cfs_b->runtime_expires;
 	raw_spin_unlock(&cfs_b->lock);
 
 	cfs_rq->runtime_remaining += amount;
+	/*
+	 * we may have advanced our local expiration to account for allowed
+	 * spread between our sched_clock and the one on which runtime was
+	 * issued.
+	 */
+	if ((s64)(expires - cfs_rq->runtime_expires) > 0)
+		cfs_rq->runtime_expires = expires;
 }
 
-static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq,
-				     unsigned long delta_exec)
+/*
+ * Note: This depends on the synchronization provided by sched_clock and the
+ * fact that rq->clock snapshots this value.
+ */
+static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 {
-	if (!cfs_rq->runtime_enabled)
+	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
+	struct rq *rq = rq_of(cfs_rq);
+
+	/* if the deadline is ahead of our clock, nothing to do */
+	if (likely((s64)(rq->clock - cfs_rq->runtime_expires) < 0))
+		return;
+
+	if (cfs_rq->runtime_remaining < 0)
 		return;
 
+	/*
+	 * If the local deadline has passed we have to consider the
+	 * possibility that our sched_clock is 'fast' and the global deadline
+	 * has not truly expired.
+	 *
+	 * Fortunately we can check determine whether this the case by checking
+	 * whether the global deadline has advanced.
+	 */
+
+	if ((s64)(cfs_rq->runtime_expires - cfs_b->runtime_expires) >= 0) {
+		/* extend local deadline, drift is bounded above by 2 ticks */
+		cfs_rq->runtime_expires += TICK_NSEC;
+	} else {
+		/* global deadline is ahead, expiration has passed */
+		cfs_rq->runtime_remaining = 0;
+	}
+}
+
+static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq,
+				     unsigned long delta_exec)
+{
+	/* dock delta_exec before expiring quota (as it could span periods) */
 	cfs_rq->runtime_remaining -= delta_exec;
-	if (cfs_rq->runtime_remaining > 0)
+	expire_cfs_rq_runtime(cfs_rq);
+
+	if (likely(cfs_rq->runtime_remaining > 0))
 		return;
 
 	assign_cfs_rq_runtime(cfs_rq);
@@ -1360,7 +1428,12 @@ static int do_sched_cfs_period_timer(str
 		goto out_unlock;
 
 	idle = cfs_b->idle;
-	cfs_b->runtime = cfs_b->quota;
+	/* if we're going inactive then everything else can be deferred */
+	if (idle)
+		goto out_unlock;
+
+	__refill_cfs_bandwidth_runtime(cfs_b);
+
 
 	/* mark as potentially idle for the upcoming period */
 	cfs_b->idle = 1;
@@ -1579,7 +1652,6 @@ static long effective_load(struct task_g
 
 	return wl;
 }
-
 #else
 
 static inline unsigned long effective_load(struct task_group *tg, int cpu,
Index: tip/kernel/sched.c
===================================================================
--- tip.orig/kernel/sched.c
+++ tip/kernel/sched.c
@@ -253,6 +253,7 @@ struct cfs_bandwidth {
 	ktime_t period;
 	u64 quota, runtime;
 	s64 hierarchal_quota;
+	u64 runtime_expires;
 
 	int idle, timer_active;
 	struct hrtimer period_timer;
@@ -393,6 +394,7 @@ struct cfs_rq {
 #endif
 #ifdef CONFIG_CFS_BANDWIDTH
 	int runtime_enabled;
+	u64 runtime_expires;
 	s64 runtime_remaining;
 #endif
 #endif
@@ -8980,8 +8982,8 @@ static int tg_set_cfs_bandwidth(struct t
 	raw_spin_lock_irq(&cfs_b->lock);
 	cfs_b->period = ns_to_ktime(period);
 	cfs_b->quota = quota;
-	cfs_b->runtime = quota;
 
+	__refill_cfs_bandwidth_runtime(cfs_b);
 	/* restart the period timer (if active) to handle new period expiry */
 	if (runtime_enabled && cfs_b->timer_active) {
 		/* force a reprogram */



^ permalink raw reply	[flat|nested] 60+ messages in thread

* [patch 08/18] sched: add support for throttling group entities
  2011-07-21 16:43 [patch 00/18] CFS Bandwidth Control v7.2 Paul Turner
                   ` (6 preceding siblings ...)
  2011-07-21 16:43 ` [patch 07/18] sched: expire invalid runtime Paul Turner
@ 2011-07-21 16:43 ` Paul Turner
  2011-08-08 15:46   ` Lin Ming
  2011-08-14 16:26   ` [tip:sched/core] sched: Add " tip-bot for Paul Turner
  2011-07-21 16:43 ` [patch 09/18] sched: add support for unthrottling " Paul Turner
                   ` (12 subsequent siblings)
  20 siblings, 2 replies; 60+ messages in thread
From: Paul Turner @ 2011-07-21 16:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov, Jason Baron

[-- Attachment #1: sched-bwc-throttle_entities.patch --]
[-- Type: text/plain, Size: 6227 bytes --]

Now that consumption is tracked (via update_curr()) we add support to throttle
group entities (and their corresponding cfs_rqs) in the case where this is no
run-time remaining.

Throttled entities are dequeued to prevent scheduling, additionally we mark
them as throttled (using cfs_rq->throttled) to prevent them from becoming
re-enqueued until they are unthrottled.  A list of a task_group's throttled
entities are maintained on the cfs_bandwidth structure.

Note: While the machinery for throttling is added in this patch the act of
throttling an entity exceeding its bandwidth is deferred until later within
the series.

Signed-off-by: Paul Turner <pjt@google.com>

---
 kernel/sched.c      |    7 ++++
 kernel/sched_fair.c |   89 +++++++++++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 92 insertions(+), 4 deletions(-)

Index: tip/kernel/sched_fair.c
===================================================================
--- tip.orig/kernel/sched_fair.c
+++ tip/kernel/sched_fair.c
@@ -1313,7 +1313,8 @@ static void __refill_cfs_bandwidth_runti
 	cfs_b->runtime_expires = now + ktime_to_ns(cfs_b->period);
 }
 
-static void assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
+/* returns 0 on failure to allocate runtime */
+static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 {
 	struct task_group *tg = cfs_rq->tg;
 	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
@@ -1354,6 +1355,8 @@ static void assign_cfs_rq_runtime(struct
 	 */
 	if ((s64)(expires - cfs_rq->runtime_expires) > 0)
 		cfs_rq->runtime_expires = expires;
+
+	return cfs_rq->runtime_remaining > 0;
 }
 
 /*
@@ -1400,7 +1403,12 @@ static void __account_cfs_rq_runtime(str
 	if (likely(cfs_rq->runtime_remaining > 0))
 		return;
 
-	assign_cfs_rq_runtime(cfs_rq);
+	/*
+	 * if we're unable to extend our runtime we resched so that the active
+	 * hierarchy can be throttled
+	 */
+	if (!assign_cfs_rq_runtime(cfs_rq) && likely(cfs_rq->curr))
+		resched_task(rq_of(cfs_rq)->curr);
 }
 
 static __always_inline void account_cfs_rq_runtime(struct cfs_rq *cfs_rq,
@@ -1412,6 +1420,47 @@ static __always_inline void account_cfs_
 	__account_cfs_rq_runtime(cfs_rq, delta_exec);
 }
 
+static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
+{
+	return cfs_rq->throttled;
+}
+
+static __used void throttle_cfs_rq(struct cfs_rq *cfs_rq)
+{
+	struct rq *rq = rq_of(cfs_rq);
+	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
+	struct sched_entity *se;
+	long task_delta, dequeue = 1;
+
+	se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
+
+	/* account load preceding throttle */
+	update_cfs_load(cfs_rq, 0);
+
+	task_delta = cfs_rq->h_nr_running;
+	for_each_sched_entity(se) {
+		struct cfs_rq *qcfs_rq = cfs_rq_of(se);
+		/* throttled entity or throttle-on-deactivate */
+		if (!se->on_rq)
+			break;
+
+		if (dequeue)
+			dequeue_entity(qcfs_rq, se, DEQUEUE_SLEEP);
+		qcfs_rq->h_nr_running -= task_delta;
+
+		if (qcfs_rq->load.weight)
+			dequeue = 0;
+	}
+
+	if (!se)
+		rq->nr_running -= task_delta;
+
+	cfs_rq->throttled = 1;
+	raw_spin_lock(&cfs_b->lock);
+	list_add_tail_rcu(&cfs_rq->throttled_list, &cfs_b->throttled_cfs_rq);
+	raw_spin_unlock(&cfs_b->lock);
+}
+
 /*
  * Responsible for refilling a task_group's bandwidth and unthrottling its
  * cfs_rqs as appropriate. If there has been no activity within the last
@@ -1447,6 +1496,11 @@ out_unlock:
 #else
 static void account_cfs_rq_runtime(struct cfs_rq *cfs_rq,
 				     unsigned long delta_exec) {}
+
+static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
+{
+	return 0;
+}
 #endif
 
 /**************************************************
@@ -1525,7 +1579,17 @@ enqueue_task_fair(struct rq *rq, struct 
 			break;
 		cfs_rq = cfs_rq_of(se);
 		enqueue_entity(cfs_rq, se, flags);
+
+		/*
+		 * end evaluation on encountering a throttled cfs_rq
+		 *
+		 * note: in the case of encountering a throttled cfs_rq we will
+		 * post the final h_nr_running increment below.
+		*/
+		if (cfs_rq_throttled(cfs_rq))
+			break;
 		cfs_rq->h_nr_running++;
+
 		flags = ENQUEUE_WAKEUP;
 	}
 
@@ -1533,11 +1597,15 @@ enqueue_task_fair(struct rq *rq, struct 
 		cfs_rq = cfs_rq_of(se);
 		cfs_rq->h_nr_running++;
 
+		if (cfs_rq_throttled(cfs_rq))
+			break;
+
 		update_cfs_load(cfs_rq, 0);
 		update_cfs_shares(cfs_rq);
 	}
 
-	inc_nr_running(rq);
+	if (!se)
+		inc_nr_running(rq);
 	hrtick_update(rq);
 }
 
@@ -1557,6 +1625,15 @@ static void dequeue_task_fair(struct rq 
 	for_each_sched_entity(se) {
 		cfs_rq = cfs_rq_of(se);
 		dequeue_entity(cfs_rq, se, flags);
+
+		/*
+		 * end evaluation on encountering a throttled cfs_rq
+		 *
+		 * note: in the case of encountering a throttled cfs_rq we will
+		 * post the final h_nr_running decrement below.
+		*/
+		if (cfs_rq_throttled(cfs_rq))
+			break;
 		cfs_rq->h_nr_running--;
 
 		/* Don't dequeue parent if it has other entities besides us */
@@ -1579,11 +1656,15 @@ static void dequeue_task_fair(struct rq 
 		cfs_rq = cfs_rq_of(se);
 		cfs_rq->h_nr_running--;
 
+		if (cfs_rq_throttled(cfs_rq))
+			break;
+
 		update_cfs_load(cfs_rq, 0);
 		update_cfs_shares(cfs_rq);
 	}
 
-	dec_nr_running(rq);
+	if (!se)
+		dec_nr_running(rq);
 	hrtick_update(rq);
 }
 
Index: tip/kernel/sched.c
===================================================================
--- tip.orig/kernel/sched.c
+++ tip/kernel/sched.c
@@ -257,6 +257,8 @@ struct cfs_bandwidth {
 
 	int idle, timer_active;
 	struct hrtimer period_timer;
+	struct list_head throttled_cfs_rq;
+
 #endif
 };
 
@@ -396,6 +398,9 @@ struct cfs_rq {
 	int runtime_enabled;
 	u64 runtime_expires;
 	s64 runtime_remaining;
+
+	int throttled;
+	struct list_head throttled_list;
 #endif
 #endif
 };
@@ -438,6 +443,7 @@ static void init_cfs_bandwidth(struct cf
 	cfs_b->quota = RUNTIME_INF;
 	cfs_b->period = ns_to_ktime(default_cfs_period());
 
+	INIT_LIST_HEAD(&cfs_b->throttled_cfs_rq);
 	hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
 	cfs_b->period_timer.function = sched_cfs_period_timer;
 }
@@ -445,6 +451,7 @@ static void init_cfs_bandwidth(struct cf
 static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 {
 	cfs_rq->runtime_enabled = 0;
+	INIT_LIST_HEAD(&cfs_rq->throttled_list);
 }
 
 /* requires cfs_b->lock, may release to reprogram timer */



^ permalink raw reply	[flat|nested] 60+ messages in thread

* [patch 09/18] sched: add support for unthrottling group entities
  2011-07-21 16:43 [patch 00/18] CFS Bandwidth Control v7.2 Paul Turner
                   ` (7 preceding siblings ...)
  2011-07-21 16:43 ` [patch 08/18] sched: add support for throttling group entities Paul Turner
@ 2011-07-21 16:43 ` Paul Turner
  2011-08-14 16:27   ` [tip:sched/core] sched: Add " tip-bot for Paul Turner
  2011-07-21 16:43 ` [patch 10/18] sched: allow for positional tg_tree walks Paul Turner
                   ` (11 subsequent siblings)
  20 siblings, 1 reply; 60+ messages in thread
From: Paul Turner @ 2011-07-21 16:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov, Jason Baron

[-- Attachment #1: sched-bwc-unthrottle_entities.patch --]
[-- Type: text/plain, Size: 5369 bytes --]

At the start of each period we refresh the global bandwidth pool.  At this time
we must also unthrottle any cfs_rq entities who are now within bandwidth once 
more (as quota permits).

Unthrottled entities have their corresponding cfs_rq->throttled flag cleared
and their entities re-enqueued.

Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>

---
 kernel/sched.c      |    3 +
 kernel/sched_fair.c |  127 ++++++++++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 126 insertions(+), 4 deletions(-)

Index: tip/kernel/sched.c
===================================================================
--- tip.orig/kernel/sched.c
+++ tip/kernel/sched.c
@@ -9008,6 +9008,9 @@ static int tg_set_cfs_bandwidth(struct t
 		raw_spin_lock_irq(&rq->lock);
 		cfs_rq->runtime_enabled = runtime_enabled;
 		cfs_rq->runtime_remaining = 0;
+
+		if (cfs_rq_throttled(cfs_rq))
+			unthrottle_cfs_rq(cfs_rq);
 		raw_spin_unlock_irq(&rq->lock);
 	}
 out_unlock:
Index: tip/kernel/sched_fair.c
===================================================================
--- tip.orig/kernel/sched_fair.c
+++ tip/kernel/sched_fair.c
@@ -1461,6 +1461,84 @@ static __used void throttle_cfs_rq(struc
 	raw_spin_unlock(&cfs_b->lock);
 }
 
+static void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
+{
+	struct rq *rq = rq_of(cfs_rq);
+	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
+	struct sched_entity *se;
+	int enqueue = 1;
+	long task_delta;
+
+	se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
+
+	cfs_rq->throttled = 0;
+	raw_spin_lock(&cfs_b->lock);
+	list_del_rcu(&cfs_rq->throttled_list);
+	raw_spin_unlock(&cfs_b->lock);
+
+	if (!cfs_rq->load.weight)
+		return;
+
+	task_delta = cfs_rq->h_nr_running;
+	for_each_sched_entity(se) {
+		if (se->on_rq)
+			enqueue = 0;
+
+		cfs_rq = cfs_rq_of(se);
+		if (enqueue)
+			enqueue_entity(cfs_rq, se, ENQUEUE_WAKEUP);
+		cfs_rq->h_nr_running += task_delta;
+
+		if (cfs_rq_throttled(cfs_rq))
+			break;
+	}
+
+	if (!se)
+		rq->nr_running += task_delta;
+
+	/* determine whether we need to wake up potentially idle cpu */
+	if (rq->curr == rq->idle && rq->cfs.nr_running)
+		resched_task(rq->curr);
+}
+
+static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
+		u64 remaining, u64 expires)
+{
+	struct cfs_rq *cfs_rq;
+	u64 runtime = remaining;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(cfs_rq, &cfs_b->throttled_cfs_rq,
+				throttled_list) {
+		struct rq *rq = rq_of(cfs_rq);
+
+		raw_spin_lock(&rq->lock);
+		if (!cfs_rq_throttled(cfs_rq))
+			goto next;
+
+		runtime = -cfs_rq->runtime_remaining + 1;
+		if (runtime > remaining)
+			runtime = remaining;
+		remaining -= runtime;
+
+		cfs_rq->runtime_remaining += runtime;
+		cfs_rq->runtime_expires = expires;
+
+		/* we check whether we're throttled above */
+		if (cfs_rq->runtime_remaining > 0)
+			unthrottle_cfs_rq(cfs_rq);
+
+next:
+		raw_spin_unlock(&rq->lock);
+
+		if (!remaining)
+			break;
+	}
+	rcu_read_unlock();
+
+	return remaining;
+}
+
 /*
  * Responsible for refilling a task_group's bandwidth and unthrottling its
  * cfs_rqs as appropriate. If there has been no activity within the last
@@ -1469,23 +1547,64 @@ static __used void throttle_cfs_rq(struc
  */
 static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
 {
-	int idle = 1;
+	u64 runtime, runtime_expires;
+	int idle = 1, throttled;
 
 	raw_spin_lock(&cfs_b->lock);
 	/* no need to continue the timer with no bandwidth constraint */
 	if (cfs_b->quota == RUNTIME_INF)
 		goto out_unlock;
 
-	idle = cfs_b->idle;
+	throttled = !list_empty(&cfs_b->throttled_cfs_rq);
+	/* idle depends on !throttled (for the case of a large deficit) */
+	idle = cfs_b->idle && !throttled;
+
 	/* if we're going inactive then everything else can be deferred */
 	if (idle)
 		goto out_unlock;
 
 	__refill_cfs_bandwidth_runtime(cfs_b);
 
+	if (!throttled) {
+		/* mark as potentially idle for the upcoming period */
+		cfs_b->idle = 1;
+		goto out_unlock;
+	}
+
+	/*
+	 * There are throttled entities so we must first use the new bandwidth
+	 * to unthrottle them before making it generally available.  This
+	 * ensures that all existing debts will be paid before a new cfs_rq is
+	 * allowed to run.
+	 */
+	runtime = cfs_b->runtime;
+	runtime_expires = cfs_b->runtime_expires;
+	cfs_b->runtime = 0;
+
+	/*
+	 * This check is repeated as we are holding onto the new bandwidth
+	 * while we unthrottle.  This can potentially race with an unthrottled
+	 * group trying to acquire new bandwidth from the global pool.
+	 */
+	while (throttled && runtime > 0) {
+		raw_spin_unlock(&cfs_b->lock);
+		/* we can't nest cfs_b->lock while distributing bandwidth */
+		runtime = distribute_cfs_runtime(cfs_b, runtime,
+						 runtime_expires);
+		raw_spin_lock(&cfs_b->lock);
+
+		throttled = !list_empty(&cfs_b->throttled_cfs_rq);
+	}
 
-	/* mark as potentially idle for the upcoming period */
-	cfs_b->idle = 1;
+	/* return (any) remaining runtime */
+	cfs_b->runtime = runtime;
+	/*
+	 * While we are ensured activity in the period following an
+	 * unthrottle, this also covers the case in which the new bandwidth is
+	 * insufficient to cover the existing bandwidth deficit.  (Forcing the
+	 * timer to remain active while there are any throttled entities.)
+	 */
+	cfs_b->idle = 0;
 out_unlock:
 	if (idle)
 		cfs_b->timer_active = 0;



^ permalink raw reply	[flat|nested] 60+ messages in thread

* [patch 10/18] sched: allow for positional tg_tree walks
  2011-07-21 16:43 [patch 00/18] CFS Bandwidth Control v7.2 Paul Turner
                   ` (8 preceding siblings ...)
  2011-07-21 16:43 ` [patch 09/18] sched: add support for unthrottling " Paul Turner
@ 2011-07-21 16:43 ` Paul Turner
  2011-08-14 16:29   ` [tip:sched/core] sched: Allow " tip-bot for Paul Turner
  2011-07-21 16:43 ` [patch 11/18] sched: prevent interactions with throttled entities Paul Turner
                   ` (10 subsequent siblings)
  20 siblings, 1 reply; 60+ messages in thread
From: Paul Turner @ 2011-07-21 16:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov, Jason Baron

[-- Attachment #1: sched-bwc-refactor-walk_tg_tree.patch --]
[-- Type: text/plain, Size: 3532 bytes --]

Extend walk_tg_tree to accept a positional argument

static int walk_tg_tree_from(struct task_group *from,
			     tg_visitor down, tg_visitor up, void *data)

Existing semantics are preserved, caller must hold rcu_lock() or sufficient
analogue.

Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>

---
 kernel/sched.c |   52 +++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 39 insertions(+), 13 deletions(-)

Index: tip/kernel/sched.c
===================================================================
--- tip.orig/kernel/sched.c
+++ tip/kernel/sched.c
@@ -1583,20 +1583,23 @@ static inline void dec_cpu_load(struct r
 typedef int (*tg_visitor)(struct task_group *, void *);
 
 /*
- * Iterate the full tree, calling @down when first entering a node and @up when
- * leaving it for the final time.
+ * Iterate task_group tree rooted at *from, calling @down when first entering a
+ * node and @up when leaving it for the final time.
+ *
+ * Caller must hold rcu_lock or sufficient equivalent.
  */
-static int walk_tg_tree(tg_visitor down, tg_visitor up, void *data)
+static int walk_tg_tree_from(struct task_group *from,
+			     tg_visitor down, tg_visitor up, void *data)
 {
 	struct task_group *parent, *child;
 	int ret;
 
-	rcu_read_lock();
-	parent = &root_task_group;
+	parent = from;
+
 down:
 	ret = (*down)(parent, data);
 	if (ret)
-		goto out_unlock;
+		goto out;
 	list_for_each_entry_rcu(child, &parent->children, siblings) {
 		parent = child;
 		goto down;
@@ -1605,19 +1608,29 @@ up:
 		continue;
 	}
 	ret = (*up)(parent, data);
-	if (ret)
-		goto out_unlock;
+	if (ret || parent == from)
+		goto out;
 
 	child = parent;
 	parent = parent->parent;
 	if (parent)
 		goto up;
-out_unlock:
-	rcu_read_unlock();
-
+out:
 	return ret;
 }
 
+/*
+ * Iterate the full tree, calling @down when first entering a node and @up when
+ * leaving it for the final time.
+ *
+ * Caller must hold rcu_lock or sufficient equivalent.
+ */
+
+static inline int walk_tg_tree(tg_visitor down, tg_visitor up, void *data)
+{
+	return walk_tg_tree_from(&root_task_group, down, up, data);
+}
+
 static int tg_nop(struct task_group *tg, void *data)
 {
 	return 0;
@@ -1711,7 +1724,9 @@ static int tg_load_down(struct task_grou
 
 static void update_h_load(long cpu)
 {
+	rcu_read_lock();
 	walk_tg_tree(tg_load_down, tg_nop, (void *)cpu);
+	rcu_read_unlock();
 }
 
 #endif
@@ -8684,13 +8699,19 @@ static int tg_rt_schedulable(struct task
 
 static int __rt_schedulable(struct task_group *tg, u64 period, u64 runtime)
 {
+	int ret;
+
 	struct rt_schedulable_data data = {
 		.tg = tg,
 		.rt_period = period,
 		.rt_runtime = runtime,
 	};
 
-	return walk_tg_tree(tg_rt_schedulable, tg_nop, &data);
+	rcu_read_lock();
+	ret = walk_tg_tree(tg_rt_schedulable, tg_nop, &data);
+	rcu_read_unlock();
+
+	return ret;
 }
 
 static int tg_set_rt_bandwidth(struct task_group *tg,
@@ -9147,6 +9168,7 @@ static int tg_cfs_schedulable_down(struc
 
 static int __cfs_schedulable(struct task_group *tg, u64 period, u64 quota)
 {
+	int ret;
 	struct cfs_schedulable_data data = {
 		.tg = tg,
 		.period = period,
@@ -9158,7 +9180,11 @@ static int __cfs_schedulable(struct task
 		do_div(data.quota, NSEC_PER_USEC);
 	}
 
-	return walk_tg_tree(tg_cfs_schedulable_down, tg_nop, &data);
+	rcu_read_lock();
+	ret = walk_tg_tree(tg_cfs_schedulable_down, tg_nop, &data);
+	rcu_read_unlock();
+
+	return ret;
 }
 #endif /* CONFIG_CFS_BANDWIDTH */
 #endif /* CONFIG_FAIR_GROUP_SCHED */



^ permalink raw reply	[flat|nested] 60+ messages in thread

* [patch 11/18] sched: prevent interactions with throttled entities
  2011-07-21 16:43 [patch 00/18] CFS Bandwidth Control v7.2 Paul Turner
                   ` (9 preceding siblings ...)
  2011-07-21 16:43 ` [patch 10/18] sched: allow for positional tg_tree walks Paul Turner
@ 2011-07-21 16:43 ` Paul Turner
  2011-07-22 11:26   ` Kamalesh Babulal
                     ` (2 more replies)
  2011-07-21 16:43 ` [patch 12/18] sched: prevent buddy " Paul Turner
                   ` (9 subsequent siblings)
  20 siblings, 3 replies; 60+ messages in thread
From: Paul Turner @ 2011-07-21 16:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov, Jason Baron

[-- Attachment #1: sched-bwc-throttled_shares.patch --]
[-- Type: text/plain, Size: 6437 bytes --]

>From the perspective of load-balance and shares distribution, throttled
entities should be invisible.

However, both of these operations work on 'active' lists and are not
inherently aware of what group hierarchies may be present.  In some cases this
may be side-stepped (e.g. we could sideload via tg_load_down in load balance) 
while in others (e.g. update_shares()) it is more difficult to compute without
incurring some O(n^2) costs.

Instead, track hierarchicaal throttled state at time of transition.  This
allows us to easily identify whether an entity belongs to a throttled hierarchy
and avoid incorrect interactions with it.

Also, when an entity leaves a throttled hierarchy we need to advance its
time averaging for shares averaging so that the elapsed throttled time is not
considered as part of the cfs_rq's operation.

We also use this information to prevent buddy interactions in the wakeup and
yield_to() paths.

Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>

---
 kernel/sched.c      |    2 -
 kernel/sched_fair.c |   99 ++++++++++++++++++++++++++++++++++++++++++++++++----
 2 files changed, 94 insertions(+), 7 deletions(-)

Index: tip/kernel/sched_fair.c
===================================================================
--- tip.orig/kernel/sched_fair.c
+++ tip/kernel/sched_fair.c
@@ -725,6 +725,8 @@ account_entity_dequeue(struct cfs_rq *cf
 }
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
+/* we need this in update_cfs_load and load-balance functions below */
+static inline int throttled_hierarchy(struct cfs_rq *cfs_rq);
 # ifdef CONFIG_SMP
 static void update_cfs_rq_load_contribution(struct cfs_rq *cfs_rq,
 					    int global_update)
@@ -747,7 +749,7 @@ static void update_cfs_load(struct cfs_r
 	u64 now, delta;
 	unsigned long load = cfs_rq->load.weight;
 
-	if (cfs_rq->tg == &root_task_group)
+	if (cfs_rq->tg == &root_task_group || throttled_hierarchy(cfs_rq))
 		return;
 
 	now = rq_of(cfs_rq)->clock_task;
@@ -856,7 +858,7 @@ static void update_cfs_shares(struct cfs
 
 	tg = cfs_rq->tg;
 	se = tg->se[cpu_of(rq_of(cfs_rq))];
-	if (!se)
+	if (!se || throttled_hierarchy(cfs_rq))
 		return;
 #ifndef CONFIG_SMP
 	if (likely(se->load.weight == tg->shares))
@@ -1425,6 +1427,65 @@ static inline int cfs_rq_throttled(struc
 	return cfs_rq->throttled;
 }
 
+/* check whether cfs_rq, or any parent, is throttled */
+static inline int throttled_hierarchy(struct cfs_rq *cfs_rq)
+{
+	return cfs_rq->throttle_count;
+}
+
+/*
+ * Ensure that neither of the group entities corresponding to src_cpu or
+ * dest_cpu are members of a throttled hierarchy when performing group
+ * load-balance operations.
+ */
+static inline int throttled_lb_pair(struct task_group *tg,
+				    int src_cpu, int dest_cpu)
+{
+	struct cfs_rq *src_cfs_rq, *dest_cfs_rq;
+
+	src_cfs_rq = tg->cfs_rq[src_cpu];
+	dest_cfs_rq = tg->cfs_rq[dest_cpu];
+
+	return throttled_hierarchy(src_cfs_rq) ||
+	       throttled_hierarchy(dest_cfs_rq);
+}
+
+/* updated child weight may affect parent so we have to do this bottom up */
+static int tg_unthrottle_up(struct task_group *tg, void *data)
+{
+	struct rq *rq = data;
+	struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
+
+	cfs_rq->throttle_count--;
+#ifdef CONFIG_SMP
+	if (!cfs_rq->throttle_count) {
+		u64 delta = rq->clock_task - cfs_rq->load_stamp;
+
+		/* leaving throttled state, advance shares averaging windows */
+		cfs_rq->load_stamp += delta;
+		cfs_rq->load_last += delta;
+
+		/* update entity weight now that we are on_rq again */
+		update_cfs_shares(cfs_rq);
+	}
+#endif
+
+	return 0;
+}
+
+static int tg_throttle_down(struct task_group *tg, void *data)
+{
+	struct rq *rq = data;
+	struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
+
+	/* group is entering throttled state, record last load */
+	if (!cfs_rq->throttle_count)
+		update_cfs_load(cfs_rq, 0);
+	cfs_rq->throttle_count++;
+
+	return 0;
+}
+
 static __used void throttle_cfs_rq(struct cfs_rq *cfs_rq)
 {
 	struct rq *rq = rq_of(cfs_rq);
@@ -1435,7 +1496,9 @@ static __used void throttle_cfs_rq(struc
 	se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
 
 	/* account load preceding throttle */
-	update_cfs_load(cfs_rq, 0);
+	rcu_read_lock();
+	walk_tg_tree_from(cfs_rq->tg, tg_throttle_down, tg_nop, (void *)rq);
+	rcu_read_unlock();
 
 	task_delta = cfs_rq->h_nr_running;
 	for_each_sched_entity(se) {
@@ -1476,6 +1539,10 @@ static void unthrottle_cfs_rq(struct cfs
 	list_del_rcu(&cfs_rq->throttled_list);
 	raw_spin_unlock(&cfs_b->lock);
 
+	update_rq_clock(rq);
+	/* update hierarchical throttle state */
+	walk_tg_tree_from(cfs_rq->tg, tg_nop, tg_unthrottle_up, (void *)rq);
+
 	if (!cfs_rq->load.weight)
 		return;
 
@@ -1620,6 +1687,17 @@ static inline int cfs_rq_throttled(struc
 {
 	return 0;
 }
+
+static inline int throttled_hierarchy(struct cfs_rq *cfs_rq)
+{
+	return 0;
+}
+
+static inline int throttled_lb_pair(struct task_group *tg,
+				    int src_cpu, int dest_cpu)
+{
+	return 0;
+}
 #endif
 
 /**************************************************
@@ -2519,6 +2597,9 @@ move_one_task(struct rq *this_rq, int th
 
 	for_each_leaf_cfs_rq(busiest, cfs_rq) {
 		list_for_each_entry_safe(p, n, &cfs_rq->tasks, se.group_node) {
+			if (throttled_lb_pair(task_group(p),
+					      busiest->cpu, this_cpu))
+				break;
 
 			if (!can_migrate_task(p, busiest, this_cpu,
 						sd, idle, &pinned))
@@ -2630,8 +2711,13 @@ static void update_shares(int cpu)
 	struct rq *rq = cpu_rq(cpu);
 
 	rcu_read_lock();
-	for_each_leaf_cfs_rq(rq, cfs_rq)
+	for_each_leaf_cfs_rq(rq, cfs_rq) {
+		/* throttled entities do not contribute to load */
+		if (throttled_hierarchy(cfs_rq))
+			continue;
+
 		update_shares_cpu(cfs_rq->tg, cpu);
+	}
 	rcu_read_unlock();
 }
 
@@ -2655,9 +2741,10 @@ load_balance_fair(struct rq *this_rq, in
 		u64 rem_load, moved_load;
 
 		/*
-		 * empty group
+		 * empty group or part of a throttled hierarchy
 		 */
-		if (!busiest_cfs_rq->task_weight)
+		if (!busiest_cfs_rq->task_weight ||
+		    throttled_lb_pair(tg, busiest_cpu, this_cpu))
 			continue;
 
 		rem_load = (u64)rem_load_move * busiest_weight;
Index: tip/kernel/sched.c
===================================================================
--- tip.orig/kernel/sched.c
+++ tip/kernel/sched.c
@@ -399,7 +399,7 @@ struct cfs_rq {
 	u64 runtime_expires;
 	s64 runtime_remaining;
 
-	int throttled;
+	int throttled, throttle_count;
 	struct list_head throttled_list;
 #endif
 #endif



^ permalink raw reply	[flat|nested] 60+ messages in thread

* [patch 12/18] sched: prevent buddy interactions with throttled entities
  2011-07-21 16:43 [patch 00/18] CFS Bandwidth Control v7.2 Paul Turner
                   ` (10 preceding siblings ...)
  2011-07-21 16:43 ` [patch 11/18] sched: prevent interactions with throttled entities Paul Turner
@ 2011-07-21 16:43 ` Paul Turner
  2011-08-14 16:32   ` [tip:sched/core] sched: Prevent " tip-bot for Paul Turner
  2011-07-21 16:43 ` [patch 13/18] sched: migrate throttled tasks on HOTPLUG Paul Turner
                   ` (8 subsequent siblings)
  20 siblings, 1 reply; 60+ messages in thread
From: Paul Turner @ 2011-07-21 16:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov, Jason Baron

[-- Attachment #1: sched-bwc-throttled_buddies.patch --]
[-- Type: text/plain, Size: 2005 bytes --]

Buddies allow us to select "on-rq" entities without actually selecting them
from a cfs_rq's rb_tree.  As a result we must ensure that throttled entities
are not falsely nominated as buddies.  The fact that entities are dequeued
within throttle_entity is not sufficient for clearing buddy status as the
nomination may occur after throttling.

Signed-off-by: Paul Turner <pjt@google.com>

---
 kernel/sched_fair.c |   18 +++++++++++++++++-
 1 file changed, 17 insertions(+), 1 deletion(-)

Index: tip/kernel/sched_fair.c
===================================================================
--- tip.orig/kernel/sched_fair.c
+++ tip/kernel/sched_fair.c
@@ -2370,6 +2370,15 @@ static void check_preempt_wakeup(struct 
 	if (unlikely(se == pse))
 		return;
 
+	/*
+	 * This is possible from callers such as pull_task(), in which we
+	 * unconditionally check_prempt_curr() after an enqueue (which may have
+	 * lead to a throttle).  This both saves work and prevents false
+	 * next-buddy nomination below.
+	 */
+	if (unlikely(throttled_hierarchy(cfs_rq_of(pse))))
+		return;
+
 	if (sched_feat(NEXT_BUDDY) && scale && !(wake_flags & WF_FORK)) {
 		set_next_buddy(pse);
 		next_buddy_marked = 1;
@@ -2378,6 +2387,12 @@ static void check_preempt_wakeup(struct 
 	/*
 	 * We can come here with TIF_NEED_RESCHED already set from new task
 	 * wake up path.
+	 *
+	 * Note: this also catches the edge-case of curr being in a throttled
+	 * group (e.g. via set_curr_task), since update_curr() (in the
+	 * enqueue of curr) will have resulted in resched being set.  This
+	 * prevents us from potentially nominating it as a false LAST_BUDDY
+	 * below.
 	 */
 	if (test_tsk_need_resched(curr))
 		return;
@@ -2500,7 +2515,8 @@ static bool yield_to_task_fair(struct rq
 {
 	struct sched_entity *se = &p->se;
 
-	if (!se->on_rq)
+	/* throttled hierarchies are not runnable */
+	if (!se->on_rq || throttled_hierarchy(cfs_rq_of(se)))
 		return false;
 
 	/* Tell the scheduler that we'd really like pse to run next. */



^ permalink raw reply	[flat|nested] 60+ messages in thread

* [patch 13/18] sched: migrate throttled tasks on HOTPLUG
  2011-07-21 16:43 [patch 00/18] CFS Bandwidth Control v7.2 Paul Turner
                   ` (11 preceding siblings ...)
  2011-07-21 16:43 ` [patch 12/18] sched: prevent buddy " Paul Turner
@ 2011-07-21 16:43 ` Paul Turner
  2011-08-14 16:34   ` [tip:sched/core] sched: Migrate " tip-bot for Paul Turner
  2011-07-21 16:43 ` [patch 14/18] sched: throttle entities exceeding their allowed bandwidth Paul Turner
                   ` (7 subsequent siblings)
  20 siblings, 1 reply; 60+ messages in thread
From: Paul Turner @ 2011-07-21 16:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov, Jason Baron

[-- Attachment #1: sched-bwc-migrate_dead.patch --]
[-- Type: text/plain, Size: 1792 bytes --]

Throttled tasks are invisisble to cpu-offline since they are not eligible for
selection by pick_next_task().  The regular 'escape' path for a thread that is
blocked at offline is via ttwu->select_task_rq, however this will not handle a
throttled group since there are no individual thread wakeups on an unthrottle.

Resolve this by unthrottling offline cpus so that threads can be migrated.

Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>

---
 kernel/sched.c |   27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

Index: tip/kernel/sched.c
===================================================================
--- tip.orig/kernel/sched.c
+++ tip/kernel/sched.c
@@ -6271,6 +6271,30 @@ static void calc_global_load_remove(stru
 	rq->calc_load_active = 0;
 }
 
+#ifdef CONFIG_CFS_BANDWIDTH
+static void unthrottle_offline_cfs_rqs(struct rq *rq)
+{
+	struct cfs_rq *cfs_rq;
+
+	for_each_leaf_cfs_rq(rq, cfs_rq) {
+		struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
+
+		if (!cfs_rq->runtime_enabled)
+			continue;
+
+		/*
+		 * clock_task is not advancing so we just need to make sure
+		 * there's some valid quota amount
+		 */
+		cfs_rq->runtime_remaining = cfs_b->quota;
+		if (cfs_rq_throttled(cfs_rq))
+			unthrottle_cfs_rq(cfs_rq);
+	}
+}
+#else
+static void unthrottle_offline_cfs_rqs(struct rq *rq) {}
+#endif
+
 /*
  * Migrate all tasks from the rq, sleeping tasks will be migrated by
  * try_to_wake_up()->select_task_rq().
@@ -6296,6 +6320,9 @@ static void migrate_tasks(unsigned int d
 	 */
 	rq->stop = NULL;
 
+	/* Ensure any throttled groups are reachable by pick_next_task */
+	unthrottle_offline_cfs_rqs(rq);
+
 	for ( ; ; ) {
 		/*
 		 * There's this thread running, bail when that's the only



^ permalink raw reply	[flat|nested] 60+ messages in thread

* [patch 14/18] sched: throttle entities exceeding their allowed bandwidth
  2011-07-21 16:43 [patch 00/18] CFS Bandwidth Control v7.2 Paul Turner
                   ` (12 preceding siblings ...)
  2011-07-21 16:43 ` [patch 13/18] sched: migrate throttled tasks on HOTPLUG Paul Turner
@ 2011-07-21 16:43 ` Paul Turner
  2011-08-14 16:35   ` [tip:sched/core] sched: Throttle " tip-bot for Paul Turner
  2011-07-21 16:43 ` [patch 15/18] sched: add exports tracking cfs bandwidth control statistics Paul Turner
                   ` (6 subsequent siblings)
  20 siblings, 1 reply; 60+ messages in thread
From: Paul Turner @ 2011-07-21 16:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov, Jason Baron

[-- Attachment #1: sched-bwc-enable-throttling.patch --]
[-- Type: text/plain, Size: 3848 bytes --]

With the machinery in place to throttle and unthrottle entities, as well as
handle their participation (or lack there of) we can now enable throttling.

There are 2 points that we must check whether it's time to set throttled state:
 put_prev_entity() and enqueue_entity().

- put_prev_entity() is the typical throttle path, we reach it by exceeding our
  allocated run-time within update_curr()->account_cfs_rq_runtime() and going
  through a reschedule.

- enqueue_entity() covers the case of a wake-up into an already throttled
  group.  In this case we know the group cannot be on_rq and can throttle
  immediately.  Checks are added at time of put_prev_entity() and
  enqueue_entity()

Signed-off-by: Paul Turner <pjt@google.com>

---
 kernel/sched_fair.c |   52 ++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 50 insertions(+), 2 deletions(-)

Index: tip/kernel/sched_fair.c
===================================================================
--- tip.orig/kernel/sched_fair.c
+++ tip/kernel/sched_fair.c
@@ -989,6 +989,8 @@ place_entity(struct cfs_rq *cfs_rq, stru
 	se->vruntime = vruntime;
 }
 
+static void check_enqueue_throttle(struct cfs_rq *cfs_rq);
+
 static void
 enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 {
@@ -1018,8 +1020,10 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
 		__enqueue_entity(cfs_rq, se);
 	se->on_rq = 1;
 
-	if (cfs_rq->nr_running == 1)
+	if (cfs_rq->nr_running == 1) {
 		list_add_leaf_cfs_rq(cfs_rq);
+		check_enqueue_throttle(cfs_rq);
+	}
 }
 
 static void __clear_buddies_last(struct sched_entity *se)
@@ -1224,6 +1228,8 @@ static struct sched_entity *pick_next_en
 	return se;
 }
 
+static void check_cfs_rq_runtime(struct cfs_rq *cfs_rq);
+
 static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
 {
 	/*
@@ -1233,6 +1239,9 @@ static void put_prev_entity(struct cfs_r
 	if (prev->on_rq)
 		update_curr(cfs_rq);
 
+	/* throttle cfs_rqs exceeding runtime */
+	check_cfs_rq_runtime(cfs_rq);
+
 	check_spread(cfs_rq, prev);
 	if (prev->on_rq) {
 		update_stats_wait_start(cfs_rq, prev);
@@ -1486,7 +1495,7 @@ static int tg_throttle_down(struct task_
 	return 0;
 }
 
-static __used void throttle_cfs_rq(struct cfs_rq *cfs_rq)
+static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
 {
 	struct rq *rq = rq_of(cfs_rq);
 	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
@@ -1679,9 +1688,48 @@ out_unlock:
 
 	return idle;
 }
+
+/*
+ * When a group wakes up we want to make sure that its quota is not already
+ * expired/exceeded, otherwise it may be allowed to steal additional ticks of
+ * runtime as update_curr() throttling can not not trigger until it's on-rq.
+ */
+static void check_enqueue_throttle(struct cfs_rq *cfs_rq)
+{
+	/* an active group must be handled by the update_curr()->put() path */
+	if (!cfs_rq->runtime_enabled || cfs_rq->curr)
+		return;
+
+	/* ensure the group is not already throttled */
+	if (cfs_rq_throttled(cfs_rq))
+		return;
+
+	/* update runtime allocation */
+	account_cfs_rq_runtime(cfs_rq, 0);
+	if (cfs_rq->runtime_remaining <= 0)
+		throttle_cfs_rq(cfs_rq);
+}
+
+/* conditionally throttle active cfs_rq's from put_prev_entity() */
+static void check_cfs_rq_runtime(struct cfs_rq *cfs_rq)
+{
+	if (likely(!cfs_rq->runtime_enabled || cfs_rq->runtime_remaining > 0))
+		return;
+
+	/*
+	 * it's possible for a throttled entity to be forced into a running
+	 * state (e.g. set_curr_task), in this case we're finished.
+	 */
+	if (cfs_rq_throttled(cfs_rq))
+		return;
+
+	throttle_cfs_rq(cfs_rq);
+}
 #else
 static void account_cfs_rq_runtime(struct cfs_rq *cfs_rq,
 				     unsigned long delta_exec) {}
+static void check_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
+static void check_enqueue_throttle(struct cfs_rq *cfs_rq) {}
 
 static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
 {



^ permalink raw reply	[flat|nested] 60+ messages in thread

* [patch 15/18] sched: add exports tracking cfs bandwidth control statistics
  2011-07-21 16:43 [patch 00/18] CFS Bandwidth Control v7.2 Paul Turner
                   ` (13 preceding siblings ...)
  2011-07-21 16:43 ` [patch 14/18] sched: throttle entities exceeding their allowed bandwidth Paul Turner
@ 2011-07-21 16:43 ` Paul Turner
  2011-08-14 16:37   ` [tip:sched/core] sched: Add " tip-bot for Nikhil Rao
  2011-07-21 16:43 ` [patch 16/18] sched: return unused runtime on group dequeue Paul Turner
                   ` (5 subsequent siblings)
  20 siblings, 1 reply; 60+ messages in thread
From: Paul Turner @ 2011-07-21 16:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov, Jason Baron,
	Nikhil Rao

[-- Attachment #1: sched-bwc-throttle_stats.patch --]
[-- Type: text/plain, Size: 3605 bytes --]

From: Nikhil Rao <ncrao@google.com>

This change introduces statistics exports for the cpu sub-system, these are
added through the use of a stat file similar to that exported by other
subsystems.

The following exports are included:

nr_periods:	number of periods in which execution occurred
nr_throttled:	the number of periods above in which execution was throttle
throttled_time:	cumulative wall-time that any cpus have been throttled for
this group

Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Nikhil Rao <ncrao@google.com>
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>

---
 kernel/sched.c      |   21 +++++++++++++++++++++
 kernel/sched_fair.c |    7 +++++++
 2 files changed, 28 insertions(+)

Index: tip/kernel/sched.c
===================================================================
--- tip.orig/kernel/sched.c
+++ tip/kernel/sched.c
@@ -259,6 +259,9 @@ struct cfs_bandwidth {
 	struct hrtimer period_timer;
 	struct list_head throttled_cfs_rq;
 
+	/* statistics */
+	int nr_periods, nr_throttled;
+	u64 throttled_time;
 #endif
 };
 
@@ -399,6 +402,7 @@ struct cfs_rq {
 	u64 runtime_expires;
 	s64 runtime_remaining;
 
+	u64 throttled_timestamp;
 	int throttled, throttle_count;
 	struct list_head throttled_list;
 #endif
@@ -9213,6 +9217,19 @@ static int __cfs_schedulable(struct task
 
 	return ret;
 }
+
+static int cpu_stats_show(struct cgroup *cgrp, struct cftype *cft,
+		struct cgroup_map_cb *cb)
+{
+	struct task_group *tg = cgroup_tg(cgrp);
+	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
+
+	cb->fill(cb, "nr_periods", cfs_b->nr_periods);
+	cb->fill(cb, "nr_throttled", cfs_b->nr_throttled);
+	cb->fill(cb, "throttled_time", cfs_b->throttled_time);
+
+	return 0;
+}
 #endif /* CONFIG_CFS_BANDWIDTH */
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
@@ -9259,6 +9276,10 @@ static struct cftype cpu_files[] = {
 		.read_u64 = cpu_cfs_period_read_u64,
 		.write_u64 = cpu_cfs_period_write_u64,
 	},
+	{
+		.name = "stat",
+		.read_map = cpu_stats_show,
+	},
 #endif
 #ifdef CONFIG_RT_GROUP_SCHED
 	{
Index: tip/kernel/sched_fair.c
===================================================================
--- tip.orig/kernel/sched_fair.c
+++ tip/kernel/sched_fair.c
@@ -1528,6 +1528,7 @@ static void throttle_cfs_rq(struct cfs_r
 		rq->nr_running -= task_delta;
 
 	cfs_rq->throttled = 1;
+	cfs_rq->throttled_timestamp = rq->clock;
 	raw_spin_lock(&cfs_b->lock);
 	list_add_tail_rcu(&cfs_rq->throttled_list, &cfs_b->throttled_cfs_rq);
 	raw_spin_unlock(&cfs_b->lock);
@@ -1545,8 +1546,10 @@ static void unthrottle_cfs_rq(struct cfs
 
 	cfs_rq->throttled = 0;
 	raw_spin_lock(&cfs_b->lock);
+	cfs_b->throttled_time += rq->clock - cfs_rq->throttled_timestamp;
 	list_del_rcu(&cfs_rq->throttled_list);
 	raw_spin_unlock(&cfs_b->lock);
+	cfs_rq->throttled_timestamp = 0;
 
 	update_rq_clock(rq);
 	/* update hierarchical throttle state */
@@ -1634,6 +1637,7 @@ static int do_sched_cfs_period_timer(str
 	throttled = !list_empty(&cfs_b->throttled_cfs_rq);
 	/* idle depends on !throttled (for the case of a large deficit) */
 	idle = cfs_b->idle && !throttled;
+	cfs_b->nr_periods += overrun;
 
 	/* if we're going inactive then everything else can be deferred */
 	if (idle)
@@ -1647,6 +1651,9 @@ static int do_sched_cfs_period_timer(str
 		goto out_unlock;
 	}
 
+	/* account preceding periods in which throttling occurred */
+	cfs_b->nr_throttled += overrun;
+
 	/*
 	 * There are throttled entities so we must first use the new bandwidth
 	 * to unthrottle them before making it generally available.  This



^ permalink raw reply	[flat|nested] 60+ messages in thread

* [patch 16/18] sched: return unused runtime on group dequeue
  2011-07-21 16:43 [patch 00/18] CFS Bandwidth Control v7.2 Paul Turner
                   ` (14 preceding siblings ...)
  2011-07-21 16:43 ` [patch 15/18] sched: add exports tracking cfs bandwidth control statistics Paul Turner
@ 2011-07-21 16:43 ` Paul Turner
  2011-08-14 16:39   ` [tip:sched/core] sched: Return " tip-bot for Paul Turner
  2011-07-21 16:43 ` [RFT][patch 17/18] sched: use jump labels to reduce overhead when bandwidth control is inactive Paul Turner
                   ` (4 subsequent siblings)
  20 siblings, 1 reply; 60+ messages in thread
From: Paul Turner @ 2011-07-21 16:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov, Jason Baron

[-- Attachment #1: sched-bwc-simple_return_quota.patch --]
[-- Type: text/plain, Size: 8152 bytes --]

When a local cfs_rq blocks we return the majority of its remaining quota to the
global bandwidth pool for use by other runqueues.

We do this only when the quota is current and there is more than 
min_cfs_rq_quota [1ms by default] of runtime remaining on the rq.

In the case where there are throttled runqueues and we have sufficient
bandwidth to meter out a slice, a second timer is kicked off to handle this
delivery, unthrottling where appropriate.

Using a 'worst case' antagonist which executes on each cpu
for 1ms before moving onto the next on a fairly large machine:

no quota generations:
 197.47 ms       /cgroup/a/cpuacct.usage
 199.46 ms       /cgroup/a/cpuacct.usage
 205.46 ms       /cgroup/a/cpuacct.usage
 198.46 ms       /cgroup/a/cpuacct.usage
 208.39 ms       /cgroup/a/cpuacct.usage
Since we are allowed to use "stale" quota our usage is effectively bounded by
the rate of input into the global pool and performance is relatively stable.

with quota generations [1s increments]:
 119.58 ms       /cgroup/a/cpuacct.usage
 119.65 ms       /cgroup/a/cpuacct.usage
 119.64 ms       /cgroup/a/cpuacct.usage
 119.63 ms       /cgroup/a/cpuacct.usage
 119.60 ms       /cgroup/a/cpuacct.usage
The large deficit here is due to quota generations (/intentionally/) preventing
us from now using previously stranded slack quota.  The cost is that this quota
becomes unavailable.

with quota generations and quota return:
 200.09 ms       /cgroup/a/cpuacct.usage
 200.09 ms       /cgroup/a/cpuacct.usage
 198.09 ms       /cgroup/a/cpuacct.usage
 200.09 ms       /cgroup/a/cpuacct.usage
 200.06 ms       /cgroup/a/cpuacct.usage
By returning unused quota we're able to both stably consume our desired quota
and prevent unintentional overages due to the abuse of slack quota from 
previous quota periods (especially on a large machine).

Signed-off-by: Paul Turner <pjt@google.com>

---
 kernel/sched.c      |   15 ++++++-
 kernel/sched_fair.c |  108 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 122 insertions(+), 1 deletion(-)

Index: tip/kernel/sched.c
===================================================================
--- tip.orig/kernel/sched.c
+++ tip/kernel/sched.c
@@ -256,7 +256,7 @@ struct cfs_bandwidth {
 	u64 runtime_expires;
 
 	int idle, timer_active;
-	struct hrtimer period_timer;
+	struct hrtimer period_timer, slack_timer;
 	struct list_head throttled_cfs_rq;
 
 	/* statistics */
@@ -417,6 +417,16 @@ static inline struct cfs_bandwidth *tg_c
 
 static inline u64 default_cfs_period(void);
 static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun);
+static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b);
+
+static enum hrtimer_restart sched_cfs_slack_timer(struct hrtimer *timer)
+{
+	struct cfs_bandwidth *cfs_b =
+		container_of(timer, struct cfs_bandwidth, slack_timer);
+	do_sched_cfs_slack_timer(cfs_b);
+
+	return HRTIMER_NORESTART;
+}
 
 static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
 {
@@ -449,6 +459,8 @@ static void init_cfs_bandwidth(struct cf
 	INIT_LIST_HEAD(&cfs_b->throttled_cfs_rq);
 	hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
 	cfs_b->period_timer.function = sched_cfs_period_timer;
+	hrtimer_init(&cfs_b->slack_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	cfs_b->slack_timer.function = sched_cfs_slack_timer;
 }
 
 static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)
@@ -484,6 +496,7 @@ static void __start_cfs_bandwidth(struct
 static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
 {
 	hrtimer_cancel(&cfs_b->period_timer);
+	hrtimer_cancel(&cfs_b->slack_timer);
 }
 #else
 static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
Index: tip/kernel/sched_fair.c
===================================================================
--- tip.orig/kernel/sched_fair.c
+++ tip/kernel/sched_fair.c
@@ -1071,6 +1071,8 @@ static void clear_buddies(struct cfs_rq 
 		__clear_buddies_skip(se);
 }
 
+static void return_cfs_rq_runtime(struct cfs_rq *cfs_rq);
+
 static void
 dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 {
@@ -1109,6 +1111,9 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
 	if (!(flags & DEQUEUE_SLEEP))
 		se->vruntime -= cfs_rq->min_vruntime;
 
+	/* return excess runtime on last dequeue */
+	return_cfs_rq_runtime(cfs_rq);
+
 	update_min_vruntime(cfs_rq);
 	update_cfs_shares(cfs_rq);
 }
@@ -1696,6 +1701,108 @@ out_unlock:
 	return idle;
 }
 
+/* a cfs_rq won't donate quota below this amount */
+static const u64 min_cfs_rq_runtime = 1 * NSEC_PER_MSEC;
+/* minimum remaining period time to redistribute slack quota */
+static const u64 min_bandwidth_expiration = 2 * NSEC_PER_MSEC;
+/* how long we wait to gather additional slack before distributing */
+static const u64 cfs_bandwidth_slack_period = 5 * NSEC_PER_MSEC;
+
+/* are we near the end of the current quota period? */
+static int runtime_refresh_within(struct cfs_bandwidth *cfs_b, u64 min_expire)
+{
+	struct hrtimer *refresh_timer = &cfs_b->period_timer;
+	u64 remaining;
+
+	/* if the call-back is running a quota refresh is already occurring */
+	if (hrtimer_callback_running(refresh_timer))
+		return 1;
+
+	/* is a quota refresh about to occur? */
+	remaining = ktime_to_ns(hrtimer_expires_remaining(refresh_timer));
+	if (remaining < min_expire)
+		return 1;
+
+	return 0;
+}
+
+static void start_cfs_slack_bandwidth(struct cfs_bandwidth *cfs_b)
+{
+	u64 min_left = cfs_bandwidth_slack_period + min_bandwidth_expiration;
+
+	/* if there's a quota refresh soon don't bother with slack */
+	if (runtime_refresh_within(cfs_b, min_left))
+		return;
+
+	start_bandwidth_timer(&cfs_b->slack_timer,
+				ns_to_ktime(cfs_bandwidth_slack_period));
+}
+
+/* we know any runtime found here is valid as update_curr() precedes return */
+static void __return_cfs_rq_runtime(struct cfs_rq *cfs_rq)
+{
+	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
+	s64 slack_runtime = cfs_rq->runtime_remaining - min_cfs_rq_runtime;
+
+	if (slack_runtime <= 0)
+		return;
+
+	raw_spin_lock(&cfs_b->lock);
+	if (cfs_b->quota != RUNTIME_INF &&
+	    cfs_rq->runtime_expires == cfs_b->runtime_expires) {
+		cfs_b->runtime += slack_runtime;
+
+		/* we are under rq->lock, defer unthrottling using a timer */
+		if (cfs_b->runtime > sched_cfs_bandwidth_slice() &&
+		    !list_empty(&cfs_b->throttled_cfs_rq))
+			start_cfs_slack_bandwidth(cfs_b);
+	}
+	raw_spin_unlock(&cfs_b->lock);
+
+	/* even if it's not valid for return we don't want to try again */
+	cfs_rq->runtime_remaining -= slack_runtime;
+}
+
+static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq)
+{
+	if (!cfs_rq->runtime_enabled || !cfs_rq->nr_running)
+		return;
+
+	__return_cfs_rq_runtime(cfs_rq);
+}
+
+/*
+ * This is done with a timer (instead of inline with bandwidth return) since
+ * it's necessary to juggle rq->locks to unthrottle their respective cfs_rqs.
+ */
+static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
+{
+	u64 runtime = 0, slice = sched_cfs_bandwidth_slice();
+	u64 expires;
+
+	/* confirm we're still not at a refresh boundary */
+	if (runtime_refresh_within(cfs_b, min_bandwidth_expiration))
+		return;
+
+	raw_spin_lock(&cfs_b->lock);
+	if (cfs_b->quota != RUNTIME_INF && cfs_b->runtime > slice) {
+		runtime = cfs_b->runtime;
+		cfs_b->runtime = 0;
+	}
+	expires = cfs_b->runtime_expires;
+	raw_spin_unlock(&cfs_b->lock);
+
+	if (!runtime)
+		return;
+
+	runtime = distribute_cfs_runtime(cfs_b, runtime, expires);
+
+	raw_spin_lock(&cfs_b->lock);
+	if (expires == cfs_b->runtime_expires)
+		cfs_b->runtime = runtime;
+	raw_spin_unlock(&cfs_b->lock);
+}
+
 /*
  * When a group wakes up we want to make sure that its quota is not already
  * expired/exceeded, otherwise it may be allowed to steal additional ticks of
@@ -1737,6 +1844,7 @@ static void account_cfs_rq_runtime(struc
 				     unsigned long delta_exec) {}
 static void check_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
 static void check_enqueue_throttle(struct cfs_rq *cfs_rq) {}
+static void return_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
 
 static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
 {



^ permalink raw reply	[flat|nested] 60+ messages in thread

* [RFT][patch 17/18] sched: use jump labels to reduce overhead when bandwidth control is inactive
  2011-07-21 16:43 [patch 00/18] CFS Bandwidth Control v7.2 Paul Turner
                   ` (15 preceding siblings ...)
  2011-07-21 16:43 ` [patch 16/18] sched: return unused runtime on group dequeue Paul Turner
@ 2011-07-21 16:43 ` Paul Turner
  2011-07-21 16:43 ` [patch 18/18] sched: add documentation for bandwidth control Paul Turner
                   ` (3 subsequent siblings)
  20 siblings, 0 replies; 60+ messages in thread
From: Paul Turner @ 2011-07-21 16:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov, Jason Baron

[-- Attachment #1: sched-bwc-add_jump_labels.patch --]
[-- Type: text/plain, Size: 16125 bytes --]

So I'm seeing some strange costs associated with jump_labels; while on paper
the branches and instructions retired improves (as expected) we're taking an
unexpected hit in IPC.

[From the initial mail we have workloads:
  mkdir -p /cgroup/cpu/test
  echo $$ > /dev/cgroup/cpu/test (only cpu,cpuacct mounted)
  (W1) taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "for ((i=0;i<5;i++)); do $(dirname $0)/pipe-test 20000; done"
  (W2)taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "$(dirname $0)/pipe-test 100000;true"
  (W3)taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "$(dirname $0)/pipe-test 100000;"
]

To make some of the figures more clear:

Legend:
!BWC = tip + bwc, BWC compiled out
BWC = tip + bwc
BWC_JL = tip + bwc + jump label (this patch)


Now, comparing under W1 we see:
W1: BWC vs BWC_JL
                            instructions            cycles                  branches              elapsed                
---------------------------------------------------------------------------------------------------------------------
clovertown [BWC]            845934117               974222228               152715407             0.419014188 [baseline]
+unconstrained              857963815 (+1.42)      1007152750 (+3.38)       153140328 (+0.28)     0.433186926 (+3.38)  [rel]
+10000000000/1000:          876937753 (+2.55)      1033978705 (+5.65)       160038434 (+3.59)     0.443638365 (+5.66)  [rel]
+10000000000/1000000:       880276838 (+3.08)      1036176245 (+6.13)       160683878 (+4.15)     0.444577244 (+6.14)  [rel]

barcelona [BWC]             820573353               748178486               148161233             0.342122850 [baseline] 
+unconstrained              817011602 (-0.43)       759838181 (+1.56)       145951513 (-1.49)     0.347462571 (+1.56)  [rel]
+10000000000/1000:          830109086 (+0.26)       770451537 (+1.67)       151228902 (+1.08)     0.350824677 (+1.65)  [rel]
+10000000000/1000000:       830196206 (+0.30)       770704213 (+2.27)       151250413 (+1.12)     0.350962182 (+2.28)  [rel]

westmere [BWC]              802533191               694415157               146071233             0.194428018 [baseline]
+unconstrained              799057936 (-0.43)       751384496 (+8.20)       143875513 (-1.50)     0.211182620 (+8.62)  [rel]
+10000000000/1000:          812033785 (+0.27)       761469084 (+8.51)       149134146 (+1.09)     0.212149229 (+8.28)  [rel]
+10000000000/1000000:       811912834 (+0.27)       757842988 (+7.45)       149113291 (+1.09)     0.211364804 (+7.30)  [rel]
e.g. Barcelona issues ~0.43% less instructions, for a total of 817011602, in
the unconstrained case with BWC.


Where "unconstrained, 10000000000/1000, 10000000000/10000" are the on
measurements for BWC_JL, with (%d) being the relative difference to their
BWC counterparts.

W1: BWC vs BWC_JL is very similar.
	BWC vs BWC_JL
clovertown [BWC]            985732031              1283113452               175621212             1.375905653  
+unconstrained              979242938 (-0.66)      1288971141 (+0.46)       172122546 (-1.99)     1.389795165 (+1.01)  [rel]
+10000000000/1000:          999886468 (+0.33)      1296597143 (+1.13)       180554004 (+1.62)     1.392576770 (+1.18)  [rel]
+10000000000/1000000:       999034223 (+0.11)      1293925500 (+0.57)       180413829 (+1.39)     1.391041338 (+0.94)  [rel]

barcelona [BWC]             982139920              1078757792               175417574             1.069537049  
+unconstrained              965443672 (-1.70)      1075377223 (-0.31)       170215844 (-2.97)     1.045595065 (-2.24)  [rel]
+10000000000/1000:          989104943 (+0.05)      1100836668 (+0.52)       178837754 (+1.22)     1.058730316 (-1.77)  [rel]
+10000000000/1000000:       987627489 (-0.32)      1095843758 (-0.17)       178567411 (+0.84)     1.056100899 (-2.28)  [rel]

westmere [BWC]              918633403               896047900               166496917             0.754629182  
+unconstrained              914740541 (-0.42)       903906801 (+0.88)       163652848 (-1.71)     0.758050332 (+0.45)  [rel]
+10000000000/1000:          927517377 (-0.41)       952579771 (+5.67)       170173060 (+0.75)     0.771193786 (+2.43)  [rel]
+10000000000/1000000:       914676985 (-0.89)       936106277 (+3.81)       167683288 (+0.22)     0.764973632 (+1.38)  [rel]

Now this is rather odd, almost across the board we're seeing the expected
drops in instructions and branches, yet we appear to be paying a heavy IPC
price.  The fact that wall-time has scaled equivalently with cycles roughly
rules out the cycles counter being off.

We are seeing the expected behavior in the bandwidth enabled case;
specifically the <jl=jmp><ret><cond><ret> blocks are taking an extra branch
and instruction which shows up on all the numbers above.

With respect to compiler mangling the text is essentially unchanged in size.
One lurking suspicion is whether the inserted nops have perturbed some of the
jmp/branch alignments?

    text    data     bss     dec     hex filename
 7277206 2827256 2125824 12230286         ba9e8e vmlinux.jump_label
 7276886 2826744 2125824 12229454         ba9b4e vmlinux.no_jump_label
 
 I have checked to make sure that the right instructions are being patched in
 at run-time.  I've also pulled a fully patched jump_label out of the kernel
 into a userspace test (and benchmarked it directly under perf).  The results
 here are also exactly as expected.

e.g.
 Performance counter stats for './jump_test':
     1,500,839,002 instructions, 300,147,081 branches 702,468,404 cycles
Performance counter stats for './jump_test 1':
     2,001,014,609 instructions, 400,177,192 branches 901,758,219 cycles

Overall if we can fix the IPC the benefit in the globally unconstrained case
looks really good.

Any thoughts Jason?

-----
Some more raw data:

perf-stat_to_perf-stat variance in performance for W1:

	BWC_JL vs BWC_JL (sample run-to-run variance on JL measurements)
                            instructions            cycles                  branches              elapsed                
---------------------------------------------------------------------------------------------------------------------
clovertown [BWC_JL]         857963815              1007152750               153140328             0.433186926  
+unconstrained              856457537 (-0.18)       986820040 (-2.02)       152871983 (-0.18)     0.424187340 (-2.08)  [rel]
+10000000000/1000:          880281114 (+0.38)      1009349419 (-2.38)       160668480 (+0.39)     0.433031825 (-2.39)  [rel]
+10000000000/1000000:       881001883 (+0.08)      1008445782 (-2.68)       160811824 (+0.08)     0.432629132 (-2.69)  [rel]

barcelona [BWC_JL]          817011602               759838181               145951513             0.347462571  
+unconstrained              817076246 (+0.01)       758404044 (-0.19)       145958670 (+0.00)     0.346313238 (-0.33)  [rel]
+10000000000/1000:          830087089 (-0.00)       773100724 (+0.34)       151218674 (-0.01)     0.352047450 (+0.35)  [rel]
+10000000000/1000000:       830002149 (-0.02)       773209942 (+0.33)       151208657 (-0.03)     0.352090862 (+0.32)  [rel]

westmere [BWC_JL]           799057936               751384496               143875513             0.211182620  
+unconstrained              799067664 (+0.00)       751165910 (-0.03)       143877385 (+0.00)     0.210928554 (-0.12)  [rel]
+10000000000/1000:          812040497 (+0.00)       748711039 (-1.68)       149135568 (+0.00)     0.208868390 (-1.55)  [rel]
+10000000000/1000000:       811911208 (-0.00)       746860347 (-1.45)       149113194 (-0.00)     0.208663627 (-1.28)  [rel]

	BWC vs BWC (sample run-to-run variance on BWC measurements)

ilium [BWC]                845934117               974222228               152715407             0.419014188  
+unconstrained              849061624 (+0.37)       965568244 (-0.89)       153288606 (+0.38)     0.415287406 (-0.89)  [rel]
+10000000000/1000:          861138018 (+0.71)       975979688 (-0.28)       155594606 (+0.71)     0.418710227 (-0.28)  [rel]
+10000000000/1000000:       858768659 (+0.56)       972288157 (-0.42)       155163198 (+0.57)     0.417130144 (-0.42)  [rel]

barcelona [BWC]                820573353               748178486               148161233             0.342122850  
+unconstrained              820494225 (-0.01)       748302946 (+0.02)       148147559 (-0.01)     0.341349438 (-0.23)  [rel]
+10000000000/1000:          827929735 (-0.00)       756163375 (-0.22)       149609111 (-0.00)     0.344356113 (-0.22)  [rel]
+10000000000/1000000:       827682550 (-0.00)       759867539 (+0.84)       149565408 (-0.00)     0.346039855 (+0.84)  [rel]

westmere [BWC]                802533191               694415157               146071233             0.194428018  
+unconstrained              802648805 (+0.01)       698052899 (+0.52)       146099982 (+0.02)     0.195632318 (+0.62)  [rel]
+10000000000/1000:          809855427 (-0.00)       703633926 (+0.26)       147519800 (-0.00)     0.196545542 (+0.32)  [rel]
+10000000000/1000000:       809646717 (-0.01)       704895639 (-0.05)       147476169 (-0.02)     0.197022787 (+0.01)  [rel]

Raw Westmere measurements:

BWC:
Case: Unconstrained -1

 Performance counter stats for 'bash -c for ((i=0;i<5;i++)); do ./pipe-test 20000; done' (50 runs):

         802533191 instructions             #      1.156 IPC     ( +-   0.004% )
         694415157 cycles                     ( +-   0.165% )
         146071233 branches                   ( +-   0.003% )

        0.194428018  seconds time elapsed   ( +-   0.437% )

Case: 10000000000/1000:

 Performance counter stats for 'bash -c for ((i=0;i<5;i++)); do ./pipe-test 20000; done' (50 runs):

         809861594 instructions             #      1.154 IPC     ( +-   0.016% )
         701781996 cycles                     ( +-   0.184% )
         147520953 branches                   ( +-   0.022% )

        0.195928354  seconds time elapsed   ( +-   0.262% )


Case: 10000000000/1000000:

 Performance counter stats for 'bash -c for ((i=0;i<5;i++)); do ./pipe-test 20000; done' (50 runs):

         809752541 instructions             #      1.148 IPC     ( +-   0.016% )
         705278419 cycles                     ( +-   0.593% )
         147502154 branches                   ( +-   0.022% )

        0.196993502  seconds time elapsed   ( +-   0.698% )

BWC_JL
Case: Unconstrained -1

 Performance counter stats for 'bash -c for ((i=0;i<5;i++)); do ./pipe-test 20000; done' (50 runs):

         799057936 instructions             #      1.063 IPC     ( +-   0.001% )
         751384496 cycles                     ( +-   0.584% )
         143875513 branches                   ( +-   0.001% )

        0.211182620  seconds time elapsed   ( +-   0.771% )

Case: 10000000000/1000:

 Performance counter stats for 'bash -c for ((i=0;i<5;i++)); do ./pipe-test 20000; done' (50 runs):

         812033785 instructions             #      1.066 IPC     ( +-   0.017% )
         761469084 cycles                     ( +-   0.125% )
         149134146 branches                   ( +-   0.022% )

        0.212149229  seconds time elapsed   ( +-   0.171% )


Case: 10000000000/1000000:

 Performance counter stats for 'bash -c for ((i=0;i<5;i++)); do ./pipe-test 20000; done' (50 runs):

         811912834 instructions             #      1.071 IPC     ( +-   0.017% )
         757842988 cycles                     ( +-   0.158% )
         149113291 branches                   ( +-   0.022% )

        0.211364804  seconds time elapsed   ( +-   0.225% )


Let me know if there's any particular raw data you want, westmere seems the
most interesting because it's taking the biggest hit.

-------


From: Paul Turner <pjt@google.com>
When no groups within the system are constrained we can use jump labels to
reduce overheads -- skipping the per-cfs_rq runtime enabled checks.

Signed-off-by: Paul Turner <pjt@google.com>
---
 kernel/sched.c      |   33 +++++++++++++++++++++++++++++++--
 kernel/sched_fair.c |   15 ++++++++++++---
 2 files changed, 43 insertions(+), 5 deletions(-)

Index: tip/kernel/sched.c
===================================================================
--- tip.orig/kernel/sched.c
+++ tip/kernel/sched.c
@@ -71,6 +71,7 @@
 #include <linux/ctype.h>
 #include <linux/ftrace.h>
 #include <linux/slab.h>
+#include <linux/jump_label.h>
 
 #include <asm/tlb.h>
 #include <asm/irq_regs.h>
@@ -499,7 +500,32 @@ static void destroy_cfs_bandwidth(struct
 	hrtimer_cancel(&cfs_b->period_timer);
 	hrtimer_cancel(&cfs_b->slack_timer);
 }
-#else
+
+#ifdef HAVE_JUMP_LABEL
+static struct jump_label_key __cfs_bandwidth_enabled;
+
+static inline bool cfs_bandwidth_enabled(void)
+{
+	return static_branch(&__cfs_bandwidth_enabled);
+}
+
+static void account_cfs_bandwidth_enabled(int enabled, int was_enabled)
+{
+	/* only need to count groups transitioning between enabled/!enabled */
+	if (enabled && !was_enabled)
+		jump_label_inc(&__cfs_bandwidth_enabled);
+	else if (!enabled && was_enabled)
+		jump_label_dec(&__cfs_bandwidth_enabled);
+}
+#else /* !HAVE_JUMP_LABEL */
+/* static_branch doesn't help unless supported */
+static int cfs_bandwidth_enabled(void)
+{
+	return 1;
+}
+static void account_cfs_bandwidth_enabled(int enabled, int was_enabled) {}
+#endif /* HAVE_JUMP_LABEL */
+#else /* !CONFIG_CFS_BANDWIDTH */
 static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
 static void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b) {}
 static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b) {}
@@ -9025,7 +9051,7 @@ static int __cfs_schedulable(struct task
 
 static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
 {
-	int i, ret = 0, runtime_enabled;
+	int i, ret = 0, runtime_enabled, runtime_was_enabled;
 	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
 
 	if (tg == &root_task_group)
@@ -9053,6 +9079,9 @@ static int tg_set_cfs_bandwidth(struct t
 		goto out_unlock;
 
 	runtime_enabled = quota != RUNTIME_INF;
+	runtime_was_enabled = cfs_b->quota != RUNTIME_INF;
+	account_cfs_bandwidth_enabled(runtime_enabled, runtime_was_enabled);
+
 	raw_spin_lock_irq(&cfs_b->lock);
 	cfs_b->period = ns_to_ktime(period);
 	cfs_b->quota = quota;
Index: tip/kernel/sched_fair.c
===================================================================
--- tip.orig/kernel/sched_fair.c
+++ tip/kernel/sched_fair.c
@@ -1430,7 +1430,7 @@ static void __account_cfs_rq_runtime(str
 static __always_inline void account_cfs_rq_runtime(struct cfs_rq *cfs_rq,
 						   unsigned long delta_exec)
 {
-	if (!cfs_rq->runtime_enabled)
+	if (!cfs_bandwidth_enabled() || !cfs_rq->runtime_enabled)
 		return;
 
 	__account_cfs_rq_runtime(cfs_rq, delta_exec);
@@ -1438,13 +1438,13 @@ static __always_inline void account_cfs_
 
 static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
 {
-	return cfs_rq->throttled;
+	return cfs_bandwidth_enabled() && cfs_rq->throttled;
 }
 
 /* check whether cfs_rq, or any parent, is throttled */
 static inline int throttled_hierarchy(struct cfs_rq *cfs_rq)
 {
-	return cfs_rq->throttle_count;
+	return cfs_bandwidth_enabled() && cfs_rq->throttle_count;
 }
 
 /*
@@ -1765,6 +1765,9 @@ static void __return_cfs_rq_runtime(stru
 
 static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 {
+	if (!cfs_bandwidth_enabled())
+		return;
+
 	if (!cfs_rq->runtime_enabled || !cfs_rq->nr_running)
 		return;
 
@@ -1810,6 +1813,9 @@ static void do_sched_cfs_slack_timer(str
  */
 static void check_enqueue_throttle(struct cfs_rq *cfs_rq)
 {
+	if (!cfs_bandwidth_enabled())
+		return;
+
 	/* an active group must be handled by the update_curr()->put() path */
 	if (!cfs_rq->runtime_enabled || cfs_rq->curr)
 		return;
@@ -1827,6 +1833,9 @@ static void check_enqueue_throttle(struc
 /* conditionally throttle active cfs_rq's from put_prev_entity() */
 static void check_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 {
+	if (!cfs_bandwidth_enabled())
+		return;
+
 	if (likely(!cfs_rq->runtime_enabled || cfs_rq->runtime_remaining > 0))
 		return;
 



^ permalink raw reply	[flat|nested] 60+ messages in thread

* [patch 18/18] sched: add documentation for bandwidth control
  2011-07-21 16:43 [patch 00/18] CFS Bandwidth Control v7.2 Paul Turner
                   ` (16 preceding siblings ...)
  2011-07-21 16:43 ` [RFT][patch 17/18] sched: use jump labels to reduce overhead when bandwidth control is inactive Paul Turner
@ 2011-07-21 16:43 ` Paul Turner
  2011-08-14 16:41   ` [tip:sched/core] sched: Add " tip-bot for Bharata B Rao
  2011-07-21 23:01 ` [patch 00/18] CFS Bandwidth Control v7.2 Paul Turner
                   ` (2 subsequent siblings)
  20 siblings, 1 reply; 60+ messages in thread
From: Paul Turner @ 2011-07-21 16:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov, Jason Baron

[-- Attachment #1: sched-bwc-documentation.patch --]
[-- Type: text/plain, Size: 5488 bytes --]

From: Bharata B Rao <bharata@linux.vnet.ibm.com>

Basic description of usage and effect for CFS Bandwidth Control.

Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
Signed-off-by: Paul Turner <pjt@google.com>
---
 Documentation/scheduler/sched-bwc.txt |   98
 ++++++++++++++++++++++++++++++++++
 Documentation/scheduler/sched-bwc.txt |  122 ++++++++++++++++++++++++++++++++++
 1 file changed, 122 insertions(+)

Index: tip/Documentation/scheduler/sched-bwc.txt
===================================================================
--- /dev/null
+++ tip/Documentation/scheduler/sched-bwc.txt
@@ -0,0 +1,122 @@
+CFS Bandwidth Control
+=====================
+
+[ This document only discusses CPU bandwidth control for SCHED_NORMAL.
+  The SCHED_RT case is covered in Documentation/scheduler/sched-rt-group.txt ]
+
+CFS bandwidth control is a CONFIG_FAIR_GROUP_SCHED extension which allows the
+specification of the maximum CPU bandwidth available to a group or hierarchy.
+
+The bandwidth allowed for a group is specified using a quota and period. Within
+each given "period" (microseconds), a group is allowed to consume only up to
+"quota" microseconds of CPU time.  When the CPU bandwidth consumption of a
+group exceeds this limit (for that period), the tasks belonging to its
+hierarchy will be throttled and are not allowed to run again until the next
+period.
+
+A group's unused runtime is globally tracked, being refreshed with quota units
+above at each period boundary.  As threads consume this bandwidth it is
+transferred to cpu-local "silos" on a demand basis.  The amount transferred
+within each of these updates is tunable and described as the "slice".
+
+Management
+----------
+Quota and period are managed within the cpu subsystem via cgroupfs.
+
+cpu.cfs_quota_us: the total available run-time within a period (in microseconds)
+cpu.cfs_period_us: the length of a period (in microseconds)
+cpu.stat: exports throttling statistics [explained further below]
+
+The default values are:
+	cpu.cfs_period_us=100ms
+	cpu.cfs_quota=-1
+
+A value of -1 for cpu.cfs_quota_us indicates that the group does not have any
+bandwidth restriction in place, such a group is described as an unconstrained
+bandwidth group.  This represents the traditional work-conserving behavior for
+CFS.
+
+Writing any (valid) positive value(s) will enact the specified bandwidth limit.
+The minimum quota allowed for the quota or period is 1ms.  There is also an
+upper bound on the period length of 1s.  Additional restrictions exist when
+bandwidth limits are used in a hierarchical fashion, these are explained in
+more detail below.
+
+Writing any negative value to cpu.cfs_quota_us will remove the bandwidth limit
+and return the group to an unconstrained state once more.
+
+Any updates to a group's bandwidth specification will result in it becoming
+unthrottled if it is in a constrained state.
+
+System wide settings
+--------------------
+For efficiency run-time is transferred between the global pool and CPU local
+"silos" in a batch fashion.  This greatly reduces global accounting pressure
+on large systems.  The amount transferred each time such an update is required
+is described as the "slice".
+
+This is tunable via procfs:
+	/proc/sys/kernel/sched_cfs_bandwidth_slice_us (default=5ms)
+
+Larger slice values will reduce transfer overheads, while smaller values allow
+for more fine-grained consumption.
+
+Statistics
+----------
+A group's bandwidth statistics are exported via 3 fields in cpu.stat.
+
+cpu.stat:
+- nr_periods: Number of enforcement intervals that have elapsed.
+- nr_throttled: Number of times the group has been throttled/limited.
+- throttled_time: The total time duration (in nanoseconds) for which entities
+  of the group have been throttled.
+
+This interface is read-only.
+
+Hierarchical considerations
+---------------------------
+The interface enforces that an individual entity's bandwidth is always
+attainable, that is: max(c_i) <= C. However, over-subscription in the
+aggregate case is explicitly allowed to enable work-conserving semantics
+within a hierarchy.
+  e.g. \Sum (c_i) may exceed C
+[ Where C is the parent's bandwidth, and c_i its children ]
+
+
+There are two ways in which a group may become throttled:
+	a. it fully consumes its own quota within a period
+	b. a parent's quota is fully consumed within its period
+
+In case b) above, even though the child may have runtime remaining it will not
+be allowed to until the parent's runtime is refreshed.
+
+Examples
+--------
+1. Limit a group to 1 CPU worth of runtime.
+
+	If period is 250ms and quota is also 250ms, the group will get
+	1 CPU worth of runtime every 250ms.
+
+	# echo 250000 > cpu.cfs_quota_us /* quota = 250ms */
+	# echo 250000 > cpu.cfs_period_us /* period = 250ms */
+
+2. Limit a group to 2 CPUs worth of runtime on a multi-CPU machine.
+
+	With 500ms period and 1000ms quota, the group can get 2 CPUs worth of
+	runtime every 500ms.
+
+	# echo 1000000 > cpu.cfs_quota_us /* quota = 1000ms */
+	# echo 500000 > cpu.cfs_period_us /* period = 500ms */
+
+	The larger period here allows for increased burst capacity.
+
+3. Limit a group to 20% of 1 CPU.
+
+	With 50ms period, 10ms quota will be equivalent to 20% of 1 CPU.
+
+	# echo 10000 > cpu.cfs_quota_us /* quota = 10ms */
+	# echo 50000 > cpu.cfs_period_us /* period = 50ms */
+
+	By using a small period here we are ensuring a consistent latency
+	response at the expense of burst capacity.
+


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [patch 00/18] CFS Bandwidth Control v7.2
  2011-07-21 16:43 [patch 00/18] CFS Bandwidth Control v7.2 Paul Turner
                   ` (17 preceding siblings ...)
  2011-07-21 16:43 ` [patch 18/18] sched: add documentation for bandwidth control Paul Turner
@ 2011-07-21 23:01 ` Paul Turner
  2011-07-25 14:58 ` Peter Zijlstra
  2011-09-13 12:10 ` Vladimir Davydov
  20 siblings, 0 replies; 60+ messages in thread
From: Paul Turner @ 2011-07-21 23:01 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov, Jason Baron

On Thu, Jul 21, 2011 at 9:43 AM, Paul Turner <pjt@google.com> wrote:
> Hi all,
>
> Please find attached the incremental v7.2 for bandwidth control.
>
> This release follows a fairly intensive period of scraping cycles across
> various configurations.  Unfortunately we seem to be currently taking an IPC
> hit for jump_labels (despite a savings in branches/instr. ret) which despite
> fairly extensive digging I don't have a good explanation for.  The emitted
> assembly /looks/ ok, but cycles/wall time is consistently higher across several
> platforms.
>
> As such I've demoted the jumppatch to [RFT] while these details are worked
> out.  But there's no point in holding up the rest of the series any more.
>
> [ Please find the specific discussion related to the above attached to patch
> 17/18. ]
>
> So -- without jump labels -- the current performance looks like:
>
>                            instructions            cycles                  branches
> ---------------------------------------------------------------------------------------------
> clovertown [!BWC]           843695716               965744453               151224759
> +unconstrained              845934117 (+0.27)       974222228 (+0.88)       152715407 (+0.99)
> +10000000000/1000:          855102086 (+1.35)       978728348 (+1.34)       154495984 (+2.16)
> +10000000000/1000000:       853981660 (+1.22)       976344561 (+1.10)       154287243 (+2.03)
>
> barcelona [!BWC]            810514902               761071312               145351489
> +unconstrained              820573353 (+1.24)       748178486 (-1.69)       148161233 (+1.93)
> +10000000000/1000:          827963132 (+2.15)       757829815 (-0.43)       149611950 (+2.93)
> +10000000000/1000000:       827701516 (+2.12)       753575001 (-0.98)       149568284 (+2.90)
>
> westmere [!BWC]             792513879               702882443               143267136
> +unconstrained              802533191 (+1.26)       694415157 (-1.20)       146071233 (+1.96)
> +10000000000/1000:          809861594 (+2.19)       701781996 (-0.16)       147520953 (+2.97)
> +10000000000/1000000:       809752541 (+2.18)       705278419 (+0.34)       147502154 (+2.96)
>
> Under the workload:
>  mkdir -p /cgroup/cpu/test
>  echo $$ > /dev/cgroup/cpu/test (only cpu,cpuacct mounted)
>  (W1) taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "for ((i=0;i<5;i++)); do $(dirname $0)/pipe-test 20000; done"
>
> This may seem a strange work-load but it works around some bizarro overheads
> currently introduced by perf.  Comparing for example with::w
>  (W2)taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "$(dirname $0)/pipe-test 100000;true"
>  (W3)taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "$(dirname $0)/pipe-test 100000;"
>
>
> We see:

(Sorry this is missing an "instructions,cycles,branches,elapsed time" header.)

>  (W1)  westmere [!BWC]             792513879               702882443               143267136             0.197246943
>  (W2)  westmere [!BWC]             912241728               772576786               165734252             0.214923134
>  (W3)  westmere [!BWC]             904349725               882084726               162577399             0.748506065
>
> vs an 'ideal' total exec time of (approximately):
> $ time taskset -c 0 ./pipe-test 100000
>  real    0m0.198 user    0m0.007s ys     0m0.095s
>
> The overhead in W2 is explained by that invoking pipe-test directly, one of
> the siblings is becoming the perf_ctx parent, invoking lots of pain every time
> we switch.  I do not have a reasonable explantion as to why (W1) is so much
> cheaper than (W2), I stumbled across it by accident when I was trying some
> combinations to reduce the <perf stat>-to-<perf stat> variance.
>
> v7.2
> -----------
> - Build errors in !CGROUP_SCHED case fixed
> - !CONFIG_SMP now 'supported' (#ifdef munging)
> - gcc was failing to inline account_cfs_rq_runtime, affecting performance
> - checks in expire_cfs_rq_runtime() and check_enqueue_throttle() re-organized
>  to save branches.
> - jump labels introduced in the case BWC is not being used system-wide to
>  reduce inert overhead.
> - branch saved in expiring runtime (reorganize conditonals)
>
> Hidetoshi, the following patchsets have changed enough to necessitate tweaking
> of your Reviewed-by:
> [patch 09/18] sched: add support for unthrottling group entities (extensive)
> [patch 11/18] sched: prevent interactions with throttled entities (update_cfs_shares)
> [patch 12/18] sched: prevent buddy interactions with throttled entities (new)
>
>
> Previous postings:
> -----------------
> v7.1: https://lkml.org/lkml/2011/7/7/24
> v7: http://lkml.org/lkml/2011/6/21/43
> v6: http://lkml.org/lkml/2011/5/7/37
> v5: http://lkml.org/lkml/2011/3 /22/477
> v4: http://lkml.org/lkml/2011/2/23/44
> v3: http://lkml.org/lkml/2010/10/12/44
> v2: http://lkml.org/lkml/2010/4/28/88
> Original posting: http://lkml.org/lkml/2010/2/12/393
>
> Prior approaches: http://lkml.org/lkml/2010/1/5/44 ["CFS Hard limits v5"]
>
> Thanks,
>
> - Paul
>
>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [patch 01/18] sched: (fixlet) dont update shares twice on on_rq parent
  2011-07-21 16:43 ` [patch 01/18] sched: (fixlet) dont update shares twice on on_rq parent Paul Turner
@ 2011-07-22 11:06   ` Kamalesh Babulal
  0 siblings, 0 replies; 60+ messages in thread
From: Kamalesh Babulal @ 2011-07-22 11:06 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Peter Zijlstra, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov, Jason Baron

* Paul Turner <pjt@google.com> [2011-07-21 09:43:26]:

> In dequeue_task_fair() we bail on dequeue when we encounter a parenting entity
> with additional weight.  However, we perform a double shares update on this
> entity as we continue the shares update traversal from this point, despite
> dequeue_entity() having already updated its queuing cfs_rq.
> Avoid this by starting from the parent when we resume.
> 
> Signed-off-by: Paul Turner <pjt@google.com>
> ---
>  kernel/sched_fair.c |    3 +++
>  1 file changed, 3 insertions(+)
> 
> Index: tip/kernel/sched_fair.c
> ===================================================================
> --- tip.orig/kernel/sched_fair.c
> +++ tip/kernel/sched_fair.c
> @@ -1370,6 +1370,9 @@ static void dequeue_task_fair(struct rq 
>  			 */
>  			if (task_sleep && parent_entity(se))
>  				set_next_buddy(parent_entity(se));
> +
> +			/* avoid re-evaluating load for this entity */
> +			se = parent_entity(se);
>  			break;
>  		}
>  		flags |= DEQUEUE_SLEEP;
> 
this patch has been merged into tip

commit 9598c82dcacadc3b9daa8170613fd054c6124d30
Author: Paul Turner <pjt@google.com>
Date:   Wed Jul 6 22:30:37 2011 -0700

    sched: Don't update shares twice on on_rq parent


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [patch 03/18] sched: introduce primitives to account for CFS bandwidth tracking
  2011-07-21 16:43 ` [patch 03/18] sched: introduce primitives to account for CFS bandwidth tracking Paul Turner
@ 2011-07-22 11:14   ` Kamalesh Babulal
  2011-08-14 16:17   ` [tip:sched/core] sched: Introduce " tip-bot for Paul Turner
  1 sibling, 0 replies; 60+ messages in thread
From: Kamalesh Babulal @ 2011-07-22 11:14 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Peter Zijlstra, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov, Jason Baron,
	Nikhil Rao

* Paul Turner <pjt@google.com> [2011-07-21 09:43:28]:

> In this patch we introduce the notion of CFS bandwidth, partitioned into
> globally unassigned bandwidth, and locally claimed bandwidth.
> 
> - The global bandwidth is per task_group, it represents a pool of unclaimed
>   bandwidth that cfs_rqs can allocate from.  
> - The local bandwidth is tracked per-cfs_rq, this represents allotments from
>   the global pool bandwidth assigned to a specific cpu.
> 
> Bandwidth is managed via cgroupfs, adding two new interfaces to the cpu subsystem:
> - cpu.cfs_period_us : the bandwidth period in usecs
> - cpu.cfs_quota_us : the cpu bandwidth (in usecs) that this tg will be allowed
>   to consume over period above.
> 
> Signed-off-by: Paul Turner <pjt@google.com>
> Signed-off-by: Nikhil Rao <ncrao@google.com>
> Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
> Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
> 
> ---
>  init/Kconfig        |   12 +++
>  kernel/sched.c      |  196 ++++++++++++++++++++++++++++++++++++++++++++++++++--
>  kernel/sched_fair.c |   16 ++++
>  3 files changed, 220 insertions(+), 4 deletions(-)
> 
> Index: tip/init/Kconfig
> ===================================================================
> --- tip.orig/init/Kconfig
> +++ tip/init/Kconfig
> @@ -715,6 +715,18 @@ config FAIR_GROUP_SCHED
>  	depends on CGROUP_SCHED
>  	default CGROUP_SCHED
> 
> +config CFS_BANDWIDTH
> +	bool "CPU bandwidth provisioning for FAIR_GROUP_SCHED"
> +	depends on EXPERIMENTAL
> +	depends on FAIR_GROUP_SCHED
> +	default n
> +	help
> +	  This option allows users to define CPU bandwidth rates (limits) for
> +	  tasks running within the fair group scheduler.  Groups with no limit
> +	  set are considered to be unconstrained and will run with no
> +	  restriction.
> +	  See tip/Documentation/scheduler/sched-bwc.txt for more information.
> +
>  config RT_GROUP_SCHED
>  	bool "Group scheduling for SCHED_RR/FIFO"
>  	depends on EXPERIMENTAL
> Index: tip/kernel/sched.c
> ===================================================================
> --- tip.orig/kernel/sched.c
> +++ tip/kernel/sched.c
> @@ -244,6 +244,14 @@ struct cfs_rq;
> 
>  static LIST_HEAD(task_groups);
> 
> +struct cfs_bandwidth {
> +#ifdef CONFIG_CFS_BANDWIDTH
> +	raw_spinlock_t lock;
> +	ktime_t period;
> +	u64 quota;
> +#endif
> +};
> +
>  /* task group related information */
>  struct task_group {
>  	struct cgroup_subsys_state css;
> @@ -275,6 +283,8 @@ struct task_group {
>  #ifdef CONFIG_SCHED_AUTOGROUP
>  	struct autogroup *autogroup;
>  #endif
> +
> +	struct cfs_bandwidth cfs_bandwidth;
>  };
> 
>  /* task_group_lock serializes the addition/removal of task groups */
> @@ -374,9 +384,48 @@ struct cfs_rq {
> 
>  	unsigned long load_contribution;
>  #endif
> +#ifdef CONFIG_CFS_BANDWIDTH
> +	int runtime_enabled;
> +	s64 runtime_remaining;
> +#endif
>  #endif
>  };
> 
> +#ifdef CONFIG_FAIR_GROUP_SCHED
> +#ifdef CONFIG_CFS_BANDWIDTH
> +static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
> +{
> +	return &tg->cfs_bandwidth;
> +}
> +
> +static inline u64 default_cfs_period(void);
> +
> +static void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
> +{
> +	raw_spin_lock_init(&cfs_b->lock);
> +	cfs_b->quota = RUNTIME_INF;
> +	cfs_b->period = ns_to_ktime(default_cfs_period());
> +}
> +
> +static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)
> +{
> +	cfs_rq->runtime_enabled = 0;
> +}
> +
> +static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
> +{}
> +#else
> +static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
> +static void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b) {}
> +static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b) {}
> +
> +static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
> +{
> +	return NULL;
> +}
> +#endif /* CONFIG_CFS_BANDWIDTH */
> +#endif /* CONFIG_FAIR_GROUP_SCHED */
> +
>  /* Real-Time classes' related field in a runqueue: */
>  struct rt_rq {
>  	struct rt_prio_array active;
> @@ -7795,6 +7844,7 @@ static void init_tg_cfs_entry(struct tas
>  	tg->cfs_rq[cpu] = cfs_rq;
>  	init_cfs_rq(cfs_rq, rq);
>  	cfs_rq->tg = tg;
> +	init_cfs_rq_runtime(cfs_rq);

this hunk fails to apply, due to the changes introduced by
acb5a9ba3bd7 in the tip tree.
> 
>  	tg->se[cpu] = se;
>  	/* se could be NULL for root_task_group */
> @@ -7930,6 +7980,7 @@ void __init sched_init(void)
>  		 * We achieve this by letting root_task_group's tasks sit
>  		 * directly in rq->cfs (i.e root_task_group->se[] = NULL).
>  		 */
> +		init_cfs_bandwidth(&root_task_group.cfs_bandwidth);
>  		init_tg_cfs_entry(&root_task_group, &rq->cfs, NULL, i, NULL);
>  #endif /* CONFIG_FAIR_GROUP_SCHED */
> 
> @@ -8171,6 +8222,8 @@ static void free_fair_sched_group(struct
>  {
>  	int i;
> 
> +	destroy_cfs_bandwidth(tg_cfs_bandwidth(tg));
> +
>  	for_each_possible_cpu(i) {
>  		if (tg->cfs_rq)
>  			kfree(tg->cfs_rq[i]);
> @@ -8198,6 +8251,8 @@ int alloc_fair_sched_group(struct task_g
> 
>  	tg->shares = NICE_0_LOAD;
> 
> +	init_cfs_bandwidth(tg_cfs_bandwidth(tg));
> +
>  	for_each_possible_cpu(i) {
>  		cfs_rq = kzalloc_node(sizeof(struct cfs_rq),
>  				      GFP_KERNEL, cpu_to_node(i));
> @@ -8569,7 +8624,7 @@ static int __rt_schedulable(struct task_
>  	return walk_tg_tree(tg_schedulable, tg_nop, &data);
>  }
> 
> -static int tg_set_bandwidth(struct task_group *tg,
> +static int tg_set_rt_bandwidth(struct task_group *tg,
>  		u64 rt_period, u64 rt_runtime)
>  {
>  	int i, err = 0;
> @@ -8608,7 +8663,7 @@ int sched_group_set_rt_runtime(struct ta
>  	if (rt_runtime_us < 0)
>  		rt_runtime = RUNTIME_INF;
> 
> -	return tg_set_bandwidth(tg, rt_period, rt_runtime);
> +	return tg_set_rt_bandwidth(tg, rt_period, rt_runtime);
>  }
> 
>  long sched_group_rt_runtime(struct task_group *tg)
> @@ -8633,7 +8688,7 @@ int sched_group_set_rt_period(struct tas
>  	if (rt_period == 0)
>  		return -EINVAL;
> 
> -	return tg_set_bandwidth(tg, rt_period, rt_runtime);
> +	return tg_set_rt_bandwidth(tg, rt_period, rt_runtime);
>  }
> 
>  long sched_group_rt_period(struct task_group *tg)
> @@ -8823,6 +8878,128 @@ static u64 cpu_shares_read_u64(struct cg
> 
>  	return (u64) scale_load_down(tg->shares);
>  }
> +
> +#ifdef CONFIG_CFS_BANDWIDTH
> +const u64 max_cfs_quota_period = 1 * NSEC_PER_SEC; /* 1s */
> +const u64 min_cfs_quota_period = 1 * NSEC_PER_MSEC; /* 1ms */
> +
> +static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
> +{
> +	int i;
> +	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
> +	static DEFINE_MUTEX(mutex);
> +
> +	if (tg == &root_task_group)
> +		return -EINVAL;
> +
> +	/*
> +	 * Ensure we have at some amount of bandwidth every period.  This is
> +	 * to prevent reaching a state of large arrears when throttled via
> +	 * entity_tick() resulting in prolonged exit starvation.
> +	 */
> +	if (quota < min_cfs_quota_period || period < min_cfs_quota_period)
> +		return -EINVAL;
> +
> +	/*
> +	 * Likewise, bound things on the otherside by preventing insane quota
> +	 * periods.  This also allows us to normalize in computing quota
> +	 * feasibility.
> +	 */
> +	if (period > max_cfs_quota_period)
> +		return -EINVAL;
> +
> +	mutex_lock(&mutex);
> +	raw_spin_lock_irq(&cfs_b->lock);
> +	cfs_b->period = ns_to_ktime(period);
> +	cfs_b->quota = quota;
> +	raw_spin_unlock_irq(&cfs_b->lock);
> +
> +	for_each_possible_cpu(i) {
> +		struct cfs_rq *cfs_rq = tg->cfs_rq[i];
> +		struct rq *rq = rq_of(cfs_rq);
> +
> +		raw_spin_lock_irq(&rq->lock);
> +		cfs_rq->runtime_enabled = quota != RUNTIME_INF;
> +		cfs_rq->runtime_remaining = 0;
> +		raw_spin_unlock_irq(&rq->lock);
> +	}
> +	mutex_unlock(&mutex);
> +
> +	return 0;
> +}
> +
> +int tg_set_cfs_quota(struct task_group *tg, long cfs_quota_us)
> +{
> +	u64 quota, period;
> +
> +	period = ktime_to_ns(tg_cfs_bandwidth(tg)->period);
> +	if (cfs_quota_us < 0)
> +		quota = RUNTIME_INF;
> +	else
> +		quota = (u64)cfs_quota_us * NSEC_PER_USEC;
> +
> +	return tg_set_cfs_bandwidth(tg, period, quota);
> +}
> +
> +long tg_get_cfs_quota(struct task_group *tg)
> +{
> +	u64 quota_us;
> +
> +	if (tg_cfs_bandwidth(tg)->quota == RUNTIME_INF)
> +		return -1;
> +
> +	quota_us = tg_cfs_bandwidth(tg)->quota;
> +	do_div(quota_us, NSEC_PER_USEC);
> +
> +	return quota_us;
> +}
> +
> +int tg_set_cfs_period(struct task_group *tg, long cfs_period_us)
> +{
> +	u64 quota, period;
> +
> +	period = (u64)cfs_period_us * NSEC_PER_USEC;
> +	quota = tg_cfs_bandwidth(tg)->quota;
> +
> +	if (period <= 0)
> +		return -EINVAL;
> +
> +	return tg_set_cfs_bandwidth(tg, period, quota);
> +}
> +
> +long tg_get_cfs_period(struct task_group *tg)
> +{
> +	u64 cfs_period_us;
> +
> +	cfs_period_us = ktime_to_ns(tg_cfs_bandwidth(tg)->period);
> +	do_div(cfs_period_us, NSEC_PER_USEC);
> +
> +	return cfs_period_us;
> +}
> +
> +static s64 cpu_cfs_quota_read_s64(struct cgroup *cgrp, struct cftype *cft)
> +{
> +	return tg_get_cfs_quota(cgroup_tg(cgrp));
> +}
> +
> +static int cpu_cfs_quota_write_s64(struct cgroup *cgrp, struct cftype *cftype,
> +				s64 cfs_quota_us)
> +{
> +	return tg_set_cfs_quota(cgroup_tg(cgrp), cfs_quota_us);
> +}
> +
> +static u64 cpu_cfs_period_read_u64(struct cgroup *cgrp, struct cftype *cft)
> +{
> +	return tg_get_cfs_period(cgroup_tg(cgrp));
> +}
> +
> +static int cpu_cfs_period_write_u64(struct cgroup *cgrp, struct cftype *cftype,
> +				u64 cfs_period_us)
> +{
> +	return tg_set_cfs_period(cgroup_tg(cgrp), cfs_period_us);
> +}
> +
> +#endif /* CONFIG_CFS_BANDWIDTH */
>  #endif /* CONFIG_FAIR_GROUP_SCHED */
> 
>  #ifdef CONFIG_RT_GROUP_SCHED
> @@ -8857,6 +9034,18 @@ static struct cftype cpu_files[] = {
>  		.write_u64 = cpu_shares_write_u64,
>  	},
>  #endif
> +#ifdef CONFIG_CFS_BANDWIDTH
> +	{
> +		.name = "cfs_quota_us",
> +		.read_s64 = cpu_cfs_quota_read_s64,
> +		.write_s64 = cpu_cfs_quota_write_s64,
> +	},
> +	{
> +		.name = "cfs_period_us",
> +		.read_u64 = cpu_cfs_period_read_u64,
> +		.write_u64 = cpu_cfs_period_write_u64,
> +	},
> +#endif
>  #ifdef CONFIG_RT_GROUP_SCHED
>  	{
>  		.name = "rt_runtime_us",
> @@ -9166,4 +9355,3 @@ struct cgroup_subsys cpuacct_subsys = {
>  	.subsys_id = cpuacct_subsys_id,
>  };
>  #endif	/* CONFIG_CGROUP_CPUACCT */
> -
> Index: tip/kernel/sched_fair.c
> ===================================================================
> --- tip.orig/kernel/sched_fair.c
> +++ tip/kernel/sched_fair.c
> @@ -1256,6 +1256,22 @@ entity_tick(struct cfs_rq *cfs_rq, struc
>  		check_preempt_tick(cfs_rq, curr);
>  }
> 
> +
> +/**************************************************
> + * CFS bandwidth control machinery
> + */
> +
> +#ifdef CONFIG_CFS_BANDWIDTH
> +/*
> + * default period for cfs group bandwidth.
> + * default: 0.1s, units: nanoseconds
> + */
> +static inline u64 default_cfs_period(void)
> +{
> +	return 100000000ULL;
> +}
> +#endif
> +
>  /**************************************************
>   * CFS operations on tasks:
>   */
> 
> 

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [patch 11/18] sched: prevent interactions with throttled entities
  2011-07-21 16:43 ` [patch 11/18] sched: prevent interactions with throttled entities Paul Turner
@ 2011-07-22 11:26   ` Kamalesh Babulal
  2011-07-22 11:37     ` Peter Zijlstra
  2011-07-22 11:41   ` Kamalesh Babulal
  2011-08-14 16:30   ` [tip:sched/core] sched: Prevent " tip-bot for Paul Turner
  2 siblings, 1 reply; 60+ messages in thread
From: Kamalesh Babulal @ 2011-07-22 11:26 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Peter Zijlstra, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov, Jason Baron

* Paul Turner <pjt@google.com> [2011-07-21 09:43:36]:

> >From the perspective of load-balance and shares distribution, throttled
> entities should be invisible.
> 
> However, both of these operations work on 'active' lists and are not
> inherently aware of what group hierarchies may be present.  In some cases this
> may be side-stepped (e.g. we could sideload via tg_load_down in load balance) 
> while in others (e.g. update_shares()) it is more difficult to compute without
> incurring some O(n^2) costs.
> 
> Instead, track hierarchicaal throttled state at time of transition.  This
> allows us to easily identify whether an entity belongs to a throttled hierarchy
> and avoid incorrect interactions with it.
> 
> Also, when an entity leaves a throttled hierarchy we need to advance its
> time averaging for shares averaging so that the elapsed throttled time is not
> considered as part of the cfs_rq's operation.
> 
> We also use this information to prevent buddy interactions in the wakeup and
> yield_to() paths.
> 
> Signed-off-by: Paul Turner <pjt@google.com>
> Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
> 
> ---
>  kernel/sched.c      |    2 -
>  kernel/sched_fair.c |   99 ++++++++++++++++++++++++++++++++++++++++++++++++----
>  2 files changed, 94 insertions(+), 7 deletions(-)
> 
> Index: tip/kernel/sched_fair.c
> ===================================================================
> --- tip.orig/kernel/sched_fair.c
> +++ tip/kernel/sched_fair.c
> @@ -725,6 +725,8 @@ account_entity_dequeue(struct cfs_rq *cf
>  }
> 
>  #ifdef CONFIG_FAIR_GROUP_SCHED
> +/* we need this in update_cfs_load and load-balance functions below */
> +static inline int throttled_hierarchy(struct cfs_rq *cfs_rq);
>  # ifdef CONFIG_SMP
>  static void update_cfs_rq_load_contribution(struct cfs_rq *cfs_rq,
>  					    int global_update)
> @@ -747,7 +749,7 @@ static void update_cfs_load(struct cfs_r
>  	u64 now, delta;
>  	unsigned long load = cfs_rq->load.weight;
> 
> -	if (cfs_rq->tg == &root_task_group)
> +	if (cfs_rq->tg == &root_task_group || throttled_hierarchy(cfs_rq))
>  		return;
> 
>  	now = rq_of(cfs_rq)->clock_task;
> @@ -856,7 +858,7 @@ static void update_cfs_shares(struct cfs
> 
>  	tg = cfs_rq->tg;
>  	se = tg->se[cpu_of(rq_of(cfs_rq))];
> -	if (!se)
> +	if (!se || throttled_hierarchy(cfs_rq))
>  		return;
>  #ifndef CONFIG_SMP
>  	if (likely(se->load.weight == tg->shares))
> @@ -1425,6 +1427,65 @@ static inline int cfs_rq_throttled(struc
>  	return cfs_rq->throttled;
>  }
> 
> +/* check whether cfs_rq, or any parent, is throttled */
> +static inline int throttled_hierarchy(struct cfs_rq *cfs_rq)
> +{
> +	return cfs_rq->throttle_count;
> +}
> +
> +/*
> + * Ensure that neither of the group entities corresponding to src_cpu or
> + * dest_cpu are members of a throttled hierarchy when performing group
> + * load-balance operations.
> + */
> +static inline int throttled_lb_pair(struct task_group *tg,
> +				    int src_cpu, int dest_cpu)
> +{
> +	struct cfs_rq *src_cfs_rq, *dest_cfs_rq;
> +
> +	src_cfs_rq = tg->cfs_rq[src_cpu];
> +	dest_cfs_rq = tg->cfs_rq[dest_cpu];
> +
> +	return throttled_hierarchy(src_cfs_rq) ||
> +	       throttled_hierarchy(dest_cfs_rq);
> +}
> +
> +/* updated child weight may affect parent so we have to do this bottom up */
> +static int tg_unthrottle_up(struct task_group *tg, void *data)
> +{
> +	struct rq *rq = data;
> +	struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
> +
> +	cfs_rq->throttle_count--;
> +#ifdef CONFIG_SMP
> +	if (!cfs_rq->throttle_count) {
> +		u64 delta = rq->clock_task - cfs_rq->load_stamp;
> +
> +		/* leaving throttled state, advance shares averaging windows */
> +		cfs_rq->load_stamp += delta;
> +		cfs_rq->load_last += delta;
> +
> +		/* update entity weight now that we are on_rq again */
> +		update_cfs_shares(cfs_rq);
> +	}
> +#endif
> +
> +	return 0;
> +}
> +
> +static int tg_throttle_down(struct task_group *tg, void *data)
> +{
> +	struct rq *rq = data;
> +	struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
> +
> +	/* group is entering throttled state, record last load */
> +	if (!cfs_rq->throttle_count)
> +		update_cfs_load(cfs_rq, 0);
> +	cfs_rq->throttle_count++;
> +
> +	return 0;
> +}
> +
>  static __used void throttle_cfs_rq(struct cfs_rq *cfs_rq)
>  {
>  	struct rq *rq = rq_of(cfs_rq);
> @@ -1435,7 +1496,9 @@ static __used void throttle_cfs_rq(struc
>  	se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
> 
>  	/* account load preceding throttle */
> -	update_cfs_load(cfs_rq, 0);
> +	rcu_read_lock();
> +	walk_tg_tree_from(cfs_rq->tg, tg_throttle_down, tg_nop, (void *)rq);
> +	rcu_read_unlock();
> 
>  	task_delta = cfs_rq->h_nr_running;
>  	for_each_sched_entity(se) {
> @@ -1476,6 +1539,10 @@ static void unthrottle_cfs_rq(struct cfs
>  	list_del_rcu(&cfs_rq->throttled_list);
>  	raw_spin_unlock(&cfs_b->lock);
> 
> +	update_rq_clock(rq);
> +	/* update hierarchical throttle state */
> +	walk_tg_tree_from(cfs_rq->tg, tg_nop, tg_unthrottle_up, (void *)rq);
> +
>  	if (!cfs_rq->load.weight)
>  		return;
> 
> @@ -1620,6 +1687,17 @@ static inline int cfs_rq_throttled(struc
>  {
>  	return 0;
>  }
> +
> +static inline int throttled_hierarchy(struct cfs_rq *cfs_rq)
> +{
> +	return 0;
> +}
> +
> +static inline int throttled_lb_pair(struct task_group *tg,
> +				    int src_cpu, int dest_cpu)
> +{
> +	return 0;
> +}
>  #endif
> 
>  /**************************************************
> @@ -2519,6 +2597,9 @@ move_one_task(struct rq *this_rq, int th
> 
>  	for_each_leaf_cfs_rq(busiest, cfs_rq) {
>  		list_for_each_entry_safe(p, n, &cfs_rq->tasks, se.group_node) {
> +			if (throttled_lb_pair(task_group(p),
> +					      busiest->cpu, this_cpu))
> +				break;
> 
>  			if (!can_migrate_task(p, busiest, this_cpu,
>  						sd, idle, &pinned))
> @@ -2630,8 +2711,13 @@ static void update_shares(int cpu)
>  	struct rq *rq = cpu_rq(cpu);
> 
>  	rcu_read_lock();
> -	for_each_leaf_cfs_rq(rq, cfs_rq)
> +	for_each_leaf_cfs_rq(rq, cfs_rq) {

this hunk fails to apply over the latest tip, due to the comments
introduced by 9763b67fb9f3050.

> +		/* throttled entities do not contribute to load */
> +		if (throttled_hierarchy(cfs_rq))
> +			continue;
> +
>  		update_shares_cpu(cfs_rq->tg, cpu);
> +	}
>  	rcu_read_unlock();
>  }
> 
> @@ -2655,9 +2741,10 @@ load_balance_fair(struct rq *this_rq, in
>  		u64 rem_load, moved_load;
> 
>  		/*
> -		 * empty group
> +		 * empty group or part of a throttled hierarchy
>  		 */
> -		if (!busiest_cfs_rq->task_weight)
> +		if (!busiest_cfs_rq->task_weight ||
> +		    throttled_lb_pair(tg, busiest_cpu, this_cpu))
>  			continue;
> 
>  		rem_load = (u64)rem_load_move * busiest_weight;
> Index: tip/kernel/sched.c
> ===================================================================
> --- tip.orig/kernel/sched.c
> +++ tip/kernel/sched.c
> @@ -399,7 +399,7 @@ struct cfs_rq {
>  	u64 runtime_expires;
>  	s64 runtime_remaining;
> 
> -	int throttled;
> +	int throttled, throttle_count;
>  	struct list_head throttled_list;
>  #endif
>  #endif
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [patch 11/18] sched: prevent interactions with throttled entities
  2011-07-22 11:26   ` Kamalesh Babulal
@ 2011-07-22 11:37     ` Peter Zijlstra
  0 siblings, 0 replies; 60+ messages in thread
From: Peter Zijlstra @ 2011-07-22 11:37 UTC (permalink / raw)
  To: Kamalesh Babulal
  Cc: Paul Turner, linux-kernel, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov, Jason Baron

On Fri, 2011-07-22 at 16:56 +0530, Kamalesh Babulal wrote:
> > @@ -2630,8 +2711,13 @@ static void update_shares(int cpu)
> >       struct rq *rq = cpu_rq(cpu);
> > 
> >       rcu_read_lock();
> > -     for_each_leaf_cfs_rq(rq, cfs_rq)
> > +     for_each_leaf_cfs_rq(rq, cfs_rq) {
> 
> this hunk fails to apply over the latest tip, due to the comments
> introduced by 9763b67fb9f3050.

please trim your replies.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [patch 11/18] sched: prevent interactions with throttled entities
  2011-07-21 16:43 ` [patch 11/18] sched: prevent interactions with throttled entities Paul Turner
  2011-07-22 11:26   ` Kamalesh Babulal
@ 2011-07-22 11:41   ` Kamalesh Babulal
  2011-07-22 11:43     ` Peter Zijlstra
  2011-08-14 16:30   ` [tip:sched/core] sched: Prevent " tip-bot for Paul Turner
  2 siblings, 1 reply; 60+ messages in thread
From: Kamalesh Babulal @ 2011-07-22 11:41 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Peter Zijlstra, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov, Jason Baron

* Paul Turner <pjt@google.com> [2011-07-21 09:43:36]:

> >From the perspective of load-balance and shares distribution, throttled
> entities should be invisible.
> 
> However, both of these operations work on 'active' lists and are not
> inherently aware of what group hierarchies may be present.  In some cases this
> may be side-stepped (e.g. we could sideload via tg_load_down in load balance) 
> while in others (e.g. update_shares()) it is more difficult to compute without
> incurring some O(n^2) costs.
> 
> Instead, track hierarchicaal throttled state at time of transition.  This
> allows us to easily identify whether an entity belongs to a throttled hierarchy
> and avoid incorrect interactions with it.
> 
> Also, when an entity leaves a throttled hierarchy we need to advance its
> time averaging for shares averaging so that the elapsed throttled time is not
> considered as part of the cfs_rq's operation.
> 
> We also use this information to prevent buddy interactions in the wakeup and
> yield_to() paths.
> 
> Signed-off-by: Paul Turner <pjt@google.com>
> Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
> 
> ---
>  kernel/sched.c      |    2 -
>  kernel/sched_fair.c |   99 ++++++++++++++++++++++++++++++++++++++++++++++++----
>  2 files changed, 94 insertions(+), 7 deletions(-)
> 
> Index: tip/kernel/sched_fair.c
> ===================================================================
> --- tip.orig/kernel/sched_fair.c
> +++ tip/kernel/sched_fair.c
> @@ -725,6 +725,8 @@ account_entity_dequeue(struct cfs_rq *cf
>  }
> 
>  #ifdef CONFIG_FAIR_GROUP_SCHED
> +/* we need this in update_cfs_load and load-balance functions below */
> +static inline int throttled_hierarchy(struct cfs_rq *cfs_rq);
>  # ifdef CONFIG_SMP
>  static void update_cfs_rq_load_contribution(struct cfs_rq *cfs_rq,
>  					    int global_update)
> @@ -747,7 +749,7 @@ static void update_cfs_load(struct cfs_r
>  	u64 now, delta;
>  	unsigned long load = cfs_rq->load.weight;
> 
> -	if (cfs_rq->tg == &root_task_group)
> +	if (cfs_rq->tg == &root_task_group || throttled_hierarchy(cfs_rq))
>  		return;
> 
>  	now = rq_of(cfs_rq)->clock_task;
> @@ -856,7 +858,7 @@ static void update_cfs_shares(struct cfs
> 
>  	tg = cfs_rq->tg;
>  	se = tg->se[cpu_of(rq_of(cfs_rq))];
> -	if (!se)
> +	if (!se || throttled_hierarchy(cfs_rq))
>  		return;
>  #ifndef CONFIG_SMP
>  	if (likely(se->load.weight == tg->shares))
> @@ -1425,6 +1427,65 @@ static inline int cfs_rq_throttled(struc
>  	return cfs_rq->throttled;
>  }
> 
> +/* check whether cfs_rq, or any parent, is throttled */
> +static inline int throttled_hierarchy(struct cfs_rq *cfs_rq)
> +{
> +	return cfs_rq->throttle_count;
> +}
> +
> +/*
> + * Ensure that neither of the group entities corresponding to src_cpu or
> + * dest_cpu are members of a throttled hierarchy when performing group
> + * load-balance operations.
> + */
> +static inline int throttled_lb_pair(struct task_group *tg,
> +				    int src_cpu, int dest_cpu)
> +{
> +	struct cfs_rq *src_cfs_rq, *dest_cfs_rq;
> +
> +	src_cfs_rq = tg->cfs_rq[src_cpu];
> +	dest_cfs_rq = tg->cfs_rq[dest_cpu];
> +
> +	return throttled_hierarchy(src_cfs_rq) ||
> +	       throttled_hierarchy(dest_cfs_rq);
> +}
> +
> +/* updated child weight may affect parent so we have to do this bottom up */
> +static int tg_unthrottle_up(struct task_group *tg, void *data)
> +{
> +	struct rq *rq = data;
> +	struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
> +
> +	cfs_rq->throttle_count--;
> +#ifdef CONFIG_SMP
> +	if (!cfs_rq->throttle_count) {
> +		u64 delta = rq->clock_task - cfs_rq->load_stamp;
> +
> +		/* leaving throttled state, advance shares averaging windows */
> +		cfs_rq->load_stamp += delta;
> +		cfs_rq->load_last += delta;
> +
> +		/* update entity weight now that we are on_rq again */
> +		update_cfs_shares(cfs_rq);
> +	}
> +#endif
> +
> +	return 0;
> +}
> +
> +static int tg_throttle_down(struct task_group *tg, void *data)
> +{
> +	struct rq *rq = data;
> +	struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
> +
> +	/* group is entering throttled state, record last load */
> +	if (!cfs_rq->throttle_count)
> +		update_cfs_load(cfs_rq, 0);
> +	cfs_rq->throttle_count++;
> +
> +	return 0;
> +}
> +
>  static __used void throttle_cfs_rq(struct cfs_rq *cfs_rq)
>  {
>  	struct rq *rq = rq_of(cfs_rq);
> @@ -1435,7 +1496,9 @@ static __used void throttle_cfs_rq(struc
>  	se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
> 
>  	/* account load preceding throttle */
> -	update_cfs_load(cfs_rq, 0);
> +	rcu_read_lock();
> +	walk_tg_tree_from(cfs_rq->tg, tg_throttle_down, tg_nop, (void *)rq);
> +	rcu_read_unlock();
> 
>  	task_delta = cfs_rq->h_nr_running;
>  	for_each_sched_entity(se) {
> @@ -1476,6 +1539,10 @@ static void unthrottle_cfs_rq(struct cfs
>  	list_del_rcu(&cfs_rq->throttled_list);
>  	raw_spin_unlock(&cfs_b->lock);
> 
> +	update_rq_clock(rq);
> +	/* update hierarchical throttle state */
> +	walk_tg_tree_from(cfs_rq->tg, tg_nop, tg_unthrottle_up, (void *)rq);
> +
>  	if (!cfs_rq->load.weight)
>  		return;
> 
> @@ -1620,6 +1687,17 @@ static inline int cfs_rq_throttled(struc
>  {
>  	return 0;
>  }
> +
> +static inline int throttled_hierarchy(struct cfs_rq *cfs_rq)
> +{
> +	return 0;
> +}
> +
> +static inline int throttled_lb_pair(struct task_group *tg,
> +				    int src_cpu, int dest_cpu)
> +{
> +	return 0;
> +}
>  #endif
> 
>  /**************************************************
> @@ -2519,6 +2597,9 @@ move_one_task(struct rq *this_rq, int th
> 
>  	for_each_leaf_cfs_rq(busiest, cfs_rq) {
>  		list_for_each_entry_safe(p, n, &cfs_rq->tasks, se.group_node) {
> +			if (throttled_lb_pair(task_group(p),
> +					      busiest->cpu, this_cpu))
> +				break;
> 
>  			if (!can_migrate_task(p, busiest, this_cpu,
>  						sd, idle, &pinned))
> @@ -2630,8 +2711,13 @@ static void update_shares(int cpu)
>  	struct rq *rq = cpu_rq(cpu);
> 
>  	rcu_read_lock();
> -	for_each_leaf_cfs_rq(rq, cfs_rq)
> +	for_each_leaf_cfs_rq(rq, cfs_rq) {
> +		/* throttled entities do not contribute to load */
> +		if (throttled_hierarchy(cfs_rq))
> +			continue;
> +
>  		update_shares_cpu(cfs_rq->tg, cpu);
> +	}
>  	rcu_read_unlock();
>  }
> 
> @@ -2655,9 +2741,10 @@ load_balance_fair(struct rq *this_rq, in
>  		u64 rem_load, moved_load;
> 
>  		/*
> -		 * empty group
> +		 * empty group or part of a throttled hierarchy
>  		 */
> -		if (!busiest_cfs_rq->task_weight)
> +		if (!busiest_cfs_rq->task_weight ||
> +		    throttled_lb_pair(tg, busiest_cpu, this_cpu))

tip commit 9763b67fb9f30 removes both tg and busiest_cpu from
load_balance_fair.

>  			continue;
> 
>  		rem_load = (u64)rem_load_move * busiest_weight;
> Index: tip/kernel/sched.c
> ===================================================================
> --- tip.orig/kernel/sched.c
> +++ tip/kernel/sched.c
> @@ -399,7 +399,7 @@ struct cfs_rq {
>  	u64 runtime_expires;
>  	s64 runtime_remaining;
> 
> -	int throttled;
> +	int throttled, throttle_count;
>  	struct list_head throttled_list;
>  #endif
>  #endif
> 
> 

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [patch 11/18] sched: prevent interactions with throttled entities
  2011-07-22 11:41   ` Kamalesh Babulal
@ 2011-07-22 11:43     ` Peter Zijlstra
  2011-07-22 18:16       ` Kamalesh Babulal
  0 siblings, 1 reply; 60+ messages in thread
From: Peter Zijlstra @ 2011-07-22 11:43 UTC (permalink / raw)
  To: Kamalesh Babulal
  Cc: Paul Turner, linux-kernel, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov, Jason Baron

On Fri, 2011-07-22 at 17:11 +0530, Kamalesh Babulal wrote:
> >                */
> > -             if (!busiest_cfs_rq->task_weight)
> > +             if (!busiest_cfs_rq->task_weight ||
> > +                 throttled_lb_pair(tg, busiest_cpu, this_cpu))
> 
> tip commit 9763b67fb9f30 removes both tg and busiest_cpu from
> load_balance_fair. 

Grrr, one more untrimmed email and you're in the /dev/null filter.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [patch 11/18] sched: prevent interactions with throttled entities
  2011-07-22 11:43     ` Peter Zijlstra
@ 2011-07-22 18:16       ` Kamalesh Babulal
  0 siblings, 0 replies; 60+ messages in thread
From: Kamalesh Babulal @ 2011-07-22 18:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul Turner, linux-kernel, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov, Jason Baron

* Peter Zijlstra <a.p.zijlstra@chello.nl> [2011-07-22 13:43:39]:

> On Fri, 2011-07-22 at 17:11 +0530, Kamalesh Babulal wrote:
> > >                */
> > > -             if (!busiest_cfs_rq->task_weight)
> > > +             if (!busiest_cfs_rq->task_weight ||
> > > +                 throttled_lb_pair(tg, busiest_cpu, this_cpu))
> > 
> > tip commit 9763b67fb9f30 removes both tg and busiest_cpu from
> > load_balance_fair. 
> 
> Grrr, one more untrimmed email and you're in the /dev/null filter.

Sorry, will take care of it next time.


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [patch 00/18] CFS Bandwidth Control v7.2
  2011-07-21 16:43 [patch 00/18] CFS Bandwidth Control v7.2 Paul Turner
                   ` (18 preceding siblings ...)
  2011-07-21 23:01 ` [patch 00/18] CFS Bandwidth Control v7.2 Paul Turner
@ 2011-07-25 14:58 ` Peter Zijlstra
  2011-07-25 15:00   ` Peter Zijlstra
  2011-09-13 12:10 ` Vladimir Davydov
  20 siblings, 1 reply; 60+ messages in thread
From: Peter Zijlstra @ 2011-07-25 14:58 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov, Jason Baron

How about something like the below on top?


---
 kernel/sched.c |   33 ++++++++++++++++++++++++++++++++-
 1 file changed, 32 insertions(+), 1 deletion(-)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -9228,9 +9228,19 @@ long tg_get_cfs_quota(struct task_group 
 	return quota_us;
 }
 
+int tg_set_cfs_period_down(struct task_group *tg, void *data)
+{
+	u64 period = *(u64 *)data;
+	if (ktime_to_ns(tg->cfs_bandwidth.period) == period)
+		return 0;
+
+	return tg_set_cfs_bandwidth(tg, period, tg->cfs_bandwidth.quota);
+}
+
 int tg_set_cfs_period(struct task_group *tg, long cfs_period_us)
 {
 	u64 quota, period;
+	int ret;
 
 	period = (u64)cfs_period_us * NSEC_PER_USEC;
 	quota = tg_cfs_bandwidth(tg)->quota;
@@ -9238,7 +9248,28 @@ int tg_set_cfs_period(struct task_group 
 	if (period <= 0)
 		return -EINVAL;
 
-	return tg_set_cfs_bandwidth(tg, period, quota);
+	/*
+	 * If the parent is bandwidth constrained all its children will
+	 * have to have the same period.
+	 */
+	if (tg->parent && tg->parent->cfs_bandwidth.quota != RUNTIME_INF)
+		return -EINVAL;
+
+	ret = tg_set_cfs_bandwidth(tg, period, quota);
+	if (!ret) {
+		rcu_read_lock();
+		ret = walk_tg_tree_from(tg, tg_set_cfs_period_down, NULL, &period);
+		rcu_read_unlock();
+
+		/*
+		 * If we could change the period on the parent we should be
+		 * able to change the period on all its children since their
+		 * quote is constrained to be equal or less than our.
+		 */
+		WARN_ON_ONCE(ret);
+	}
+
+	return ret;
 }
 
 long tg_get_cfs_period(struct task_group *tg)


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [patch 00/18] CFS Bandwidth Control v7.2
  2011-07-25 14:58 ` Peter Zijlstra
@ 2011-07-25 15:00   ` Peter Zijlstra
  2011-07-25 16:21     ` Paul E. McKenney
  2011-07-28  2:59     ` Paul Turner
  0 siblings, 2 replies; 60+ messages in thread
From: Peter Zijlstra @ 2011-07-25 15:00 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov, Jason Baron

On Mon, 2011-07-25 at 16:58 +0200, Peter Zijlstra wrote:
> +               rcu_read_lock();
> +               ret = walk_tg_tree_from(tg, tg_set_cfs_period_down, NULL, &period);
> +               rcu_read_unlock(); 

rcu over a mutex doesn't really work in mainline, bah.. 

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [patch 00/18] CFS Bandwidth Control v7.2
  2011-07-25 15:00   ` Peter Zijlstra
@ 2011-07-25 16:21     ` Paul E. McKenney
  2011-07-25 16:28       ` Peter Zijlstra
  2011-07-28  2:59     ` Paul Turner
  1 sibling, 1 reply; 60+ messages in thread
From: Paul E. McKenney @ 2011-07-25 16:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul Turner, linux-kernel, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov,
	Jason Baron

On Mon, Jul 25, 2011 at 05:00:41PM +0200, Peter Zijlstra wrote:
> On Mon, 2011-07-25 at 16:58 +0200, Peter Zijlstra wrote:
> > +               rcu_read_lock();
> > +               ret = walk_tg_tree_from(tg, tg_set_cfs_period_down, NULL, &period);
> > +               rcu_read_unlock(); 
> 
> rcu over a mutex doesn't really work in mainline, bah.. 

SRCU can handle that situation, FWIW.  But yes, blocking in an RCU
read-side critical section is a no-no.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [patch 00/18] CFS Bandwidth Control v7.2
  2011-07-25 16:21     ` Paul E. McKenney
@ 2011-07-25 16:28       ` Peter Zijlstra
  2011-07-25 16:46         ` Paul E. McKenney
  0 siblings, 1 reply; 60+ messages in thread
From: Peter Zijlstra @ 2011-07-25 16:28 UTC (permalink / raw)
  To: paulmck
  Cc: Paul Turner, linux-kernel, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov,
	Jason Baron

On Mon, 2011-07-25 at 09:21 -0700, Paul E. McKenney wrote:
> On Mon, Jul 25, 2011 at 05:00:41PM +0200, Peter Zijlstra wrote:
> > On Mon, 2011-07-25 at 16:58 +0200, Peter Zijlstra wrote:
> > > +               rcu_read_lock();
> > > +               ret = walk_tg_tree_from(tg, tg_set_cfs_period_down, NULL, &period);
> > > +               rcu_read_unlock(); 
> > 
> > rcu over a mutex doesn't really work in mainline, bah.. 
> 
> SRCU can handle that situation, FWIW.  But yes, blocking in an RCU
> read-side critical section is a no-no.

Yeah, I know, but didn't notice until after I sent.. SRCU isn't useful
though, way too slow due to lacking srcu_call().

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [patch 00/18] CFS Bandwidth Control v7.2
  2011-07-25 16:28       ` Peter Zijlstra
@ 2011-07-25 16:46         ` Paul E. McKenney
  2011-07-25 17:08           ` Peter Zijlstra
  0 siblings, 1 reply; 60+ messages in thread
From: Paul E. McKenney @ 2011-07-25 16:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul Turner, linux-kernel, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov,
	Jason Baron

On Mon, Jul 25, 2011 at 06:28:31PM +0200, Peter Zijlstra wrote:
> On Mon, 2011-07-25 at 09:21 -0700, Paul E. McKenney wrote:
> > On Mon, Jul 25, 2011 at 05:00:41PM +0200, Peter Zijlstra wrote:
> > > On Mon, 2011-07-25 at 16:58 +0200, Peter Zijlstra wrote:
> > > > +               rcu_read_lock();
> > > > +               ret = walk_tg_tree_from(tg, tg_set_cfs_period_down, NULL, &period);
> > > > +               rcu_read_unlock(); 
> > > 
> > > rcu over a mutex doesn't really work in mainline, bah.. 
> > 
> > SRCU can handle that situation, FWIW.  But yes, blocking in an RCU
> > read-side critical section is a no-no.
> 
> Yeah, I know, but didn't notice until after I sent.. SRCU isn't useful
> though, way too slow due to lacking srcu_call().

Good point.  How frequently would a call_srcu() be invoked?

In other words, would a really crude hack involving a globally locked
per-srcu_struct callback list and a per-srcu_struct kernel thread be
helpful, or would a slightly less-crude hack involving a per-CPU callback
list be required?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [patch 00/18] CFS Bandwidth Control v7.2
  2011-07-25 16:46         ` Paul E. McKenney
@ 2011-07-25 17:08           ` Peter Zijlstra
  2011-07-25 17:11             ` Dhaval Giani
  0 siblings, 1 reply; 60+ messages in thread
From: Peter Zijlstra @ 2011-07-25 17:08 UTC (permalink / raw)
  To: paulmck
  Cc: Paul Turner, linux-kernel, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov,
	Jason Baron

On Mon, 2011-07-25 at 09:46 -0700, Paul E. McKenney wrote:
> On Mon, Jul 25, 2011 at 06:28:31PM +0200, Peter Zijlstra wrote:
> > On Mon, 2011-07-25 at 09:21 -0700, Paul E. McKenney wrote:
> > > On Mon, Jul 25, 2011 at 05:00:41PM +0200, Peter Zijlstra wrote:
> > > > On Mon, 2011-07-25 at 16:58 +0200, Peter Zijlstra wrote:
> > > > > +               rcu_read_lock();
> > > > > +               ret = walk_tg_tree_from(tg, tg_set_cfs_period_down, NULL, &period);
> > > > > +               rcu_read_unlock(); 
> > > > 
> > > > rcu over a mutex doesn't really work in mainline, bah.. 
> > > 
> > > SRCU can handle that situation, FWIW.  But yes, blocking in an RCU
> > > read-side critical section is a no-no.
> > 
> > Yeah, I know, but didn't notice until after I sent.. SRCU isn't useful
> > though, way too slow due to lacking srcu_call().
> 
> Good point.  How frequently would a call_srcu() be invoked?
> 
> In other words, would a really crude hack involving a globally locked
> per-srcu_struct callback list and a per-srcu_struct kernel thread be
> helpful, or would a slightly less-crude hack involving a per-CPU callback
> list be required?

it would be invoked every time someone kills a cgroup, which I would
consider a slow path, but some folks out there seem to think otherwise
and create/destoy these things like they're free (there was a discussion
about this some time ago about optimizing the cgroup destroy path
etc..).

Anyway, I think I can sort this particular problem by simply wrapping
the whole crap in cgroup_lock(),cgroup_unlock(). If we want to go this
way anyway.

I consider setting the cgroup paramaters an utter slow path, and if
people complain I'll simply tell them to sod off ;-)

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [patch 00/18] CFS Bandwidth Control v7.2
  2011-07-25 17:08           ` Peter Zijlstra
@ 2011-07-25 17:11             ` Dhaval Giani
  2011-07-25 17:35               ` Peter Zijlstra
  0 siblings, 1 reply; 60+ messages in thread
From: Dhaval Giani @ 2011-07-25 17:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: paulmck, Paul Turner, linux-kernel, Bharata B Rao, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov, Jason Baron

>
> I consider setting the cgroup paramaters an utter slow path, and if
> people complain I'll simply tell them to sod off ;-)
>

cgroups parameters? I can think of some funky uses if setting cgroup
parameters was not in the slow path. I think you meant create/destroy
cgroups here :-).

Dhaval

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [patch 00/18] CFS Bandwidth Control v7.2
  2011-07-25 17:11             ` Dhaval Giani
@ 2011-07-25 17:35               ` Peter Zijlstra
  0 siblings, 0 replies; 60+ messages in thread
From: Peter Zijlstra @ 2011-07-25 17:35 UTC (permalink / raw)
  To: Dhaval Giani
  Cc: paulmck, Paul Turner, linux-kernel, Bharata B Rao, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov, Jason Baron

Dhaval Giani <dhaval.giani@gmail.com> wrote:

>>
>> I consider setting the cgroup paramaters an utter slow path, and if
>> people complain I'll simply tell them to sod off ;-)
>>
>
>cgroups parameters? I can think of some funky uses if setting cgroup
>parameters was not in the slow path. I think you meant create/destroy
>cgroups here :-).
>
>Dhaval

No, i meant writing into cgroup files, like the cfs bandwidth period thing.
-- 
Sent from my Android phone with K-9 Mail. Please excuse my brevity.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [patch 00/18] CFS Bandwidth Control v7.2
  2011-07-25 15:00   ` Peter Zijlstra
  2011-07-25 16:21     ` Paul E. McKenney
@ 2011-07-28  2:59     ` Paul Turner
  1 sibling, 0 replies; 60+ messages in thread
From: Paul Turner @ 2011-07-28  2:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov, Jason Baron

On Mon, Jul 25, 2011 at 8:00 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Mon, 2011-07-25 at 16:58 +0200, Peter Zijlstra wrote:
>> +               rcu_read_lock();
>> +               ret = walk_tg_tree_from(tg, tg_set_cfs_period_down, NULL, &period);
>> +               rcu_read_unlock();
>
> rcu over a mutex doesn't really work in mainline, bah..
>

Isn't this the other way around though?  We already hold the mutex so
we shouldn't be blocking within the RCU section.

rcu_lock here is only to stop nodes in the tree from disappearing
under us on the walk.

(FWIW this is the same as the rt_schedulable case)

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [patch 08/18] sched: add support for throttling group entities
  2011-07-21 16:43 ` [patch 08/18] sched: add support for throttling group entities Paul Turner
@ 2011-08-08 15:46   ` Lin Ming
  2011-08-08 16:00     ` Peter Zijlstra
  2011-08-14 16:26   ` [tip:sched/core] sched: Add " tip-bot for Paul Turner
  1 sibling, 1 reply; 60+ messages in thread
From: Lin Ming @ 2011-08-08 15:46 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Peter Zijlstra, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov,
	Jason Baron

On Fri, Jul 22, 2011 at 12:43 AM, Paul Turner <pjt@google.com> wrote:

> +static __used void throttle_cfs_rq(struct cfs_rq *cfs_rq)
> +{
> +       struct rq *rq = rq_of(cfs_rq);
> +       struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
> +       struct sched_entity *se;
> +       long task_delta, dequeue = 1;
> +
> +       se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
> +
> +       /* account load preceding throttle */
> +       update_cfs_load(cfs_rq, 0);
> +
> +       task_delta = cfs_rq->h_nr_running;
> +       for_each_sched_entity(se) {
> +               struct cfs_rq *qcfs_rq = cfs_rq_of(se);
> +               /* throttled entity or throttle-on-deactivate */
> +               if (!se->on_rq)
> +                       break;

Does it mean it's possible that child se is unthrottled but parent se
is throttled?

I thought if parent group was throttled then its children should be
throttled too.
I may misunderstood the code, please correct me then.

Thanks,
Lin Ming

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [patch 08/18] sched: add support for throttling group entities
  2011-08-08 15:46   ` Lin Ming
@ 2011-08-08 16:00     ` Peter Zijlstra
  2011-08-08 16:16       ` Paul Turner
  0 siblings, 1 reply; 60+ messages in thread
From: Peter Zijlstra @ 2011-08-08 16:00 UTC (permalink / raw)
  To: Lin Ming
  Cc: Paul Turner, linux-kernel, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov,
	Jason Baron

On Mon, 2011-08-08 at 23:46 +0800, Lin Ming wrote:
> On Fri, Jul 22, 2011 at 12:43 AM, Paul Turner <pjt@google.com> wrote:
> 
> > +static __used void throttle_cfs_rq(struct cfs_rq *cfs_rq)
> > +{
> > +       struct rq *rq = rq_of(cfs_rq);
> > +       struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
> > +       struct sched_entity *se;
> > +       long task_delta, dequeue = 1;
> > +
> > +       se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
> > +
> > +       /* account load preceding throttle */
> > +       update_cfs_load(cfs_rq, 0);
> > +
> > +       task_delta = cfs_rq->h_nr_running;
> > +       for_each_sched_entity(se) {
> > +               struct cfs_rq *qcfs_rq = cfs_rq_of(se);
> > +               /* throttled entity or throttle-on-deactivate */
> > +               if (!se->on_rq)
> > +                       break;
> 
> Does it mean it's possible that child se is unthrottled but parent se
> is throttled?

Yep..

> I thought if parent group was throttled then its children should be
> throttled too.
> I may misunderstood the code, please correct me then.

That would be costly, as throttling a parent would require throttling
all its children (of which there can be arbitrary many).


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [patch 08/18] sched: add support for throttling group entities
  2011-08-08 16:00     ` Peter Zijlstra
@ 2011-08-08 16:16       ` Paul Turner
  0 siblings, 0 replies; 60+ messages in thread
From: Paul Turner @ 2011-08-08 16:16 UTC (permalink / raw)
  To: Lin Ming
  Cc: Peter Zijlstra, linux-kernel, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov,
	Jason Baron

On Mon, Aug 8, 2011 at 9:00 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Mon, 2011-08-08 at 23:46 +0800, Lin Ming wrote:
>> On Fri, Jul 22, 2011 at 12:43 AM, Paul Turner <pjt@google.com> wrote:
>>
>> > +static __used void throttle_cfs_rq(struct cfs_rq *cfs_rq)
>> > +{
>> > +       struct rq *rq = rq_of(cfs_rq);
>> > +       struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
>> > +       struct sched_entity *se;
>> > +       long task_delta, dequeue = 1;
>> > +
>> > +       se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
>> > +
>> > +       /* account load preceding throttle */
>> > +       update_cfs_load(cfs_rq, 0);
>> > +
>> > +       task_delta = cfs_rq->h_nr_running;
>> > +       for_each_sched_entity(se) {
>> > +               struct cfs_rq *qcfs_rq = cfs_rq_of(se);
>> > +               /* throttled entity or throttle-on-deactivate */
>> > +               if (!se->on_rq)
>> > +                       break;
>>
>> Does it mean it's possible that child se is unthrottled but parent se
>> is throttled?
>
> Yep..
>
>> I thought if parent group was throttled then its children should be
>> throttled too.
>> I may misunderstood the code, please correct me then.
>
> That would be costly, as throttling a parent would require throttling
> all its children (of which there can be arbitrary many).
>

In case it is not clear, the children of a throttled entity can not be
scheduled, they are implicitly throttled by virtue of their
parent/ancestor having reached its bandwidth limit (and being
throttled).

Consider the hierarchy below:
  A
 /  \
D  B
      \
       C

If A and B both have bandwidth limits then B being on_rq depends on:

1. A being within its bandwidth limit, otherwise the entire hierarchy
would be dequeued
2. B being within its bandwidth limit, otherwise the hierarchy B-C
would be dequeued

B's throttle state is independent of whether A has reached its limit;
however it will not be runnable while A is throttled.

The per-cfs_rq throttle_count may be more directly in-line with what
your interpretation of "throttled"; it maintains explicit tracking of
whether or not an entity is throttled (including via its parent).

- Paul

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [tip:sched/core] sched: Implement hierarchical task accounting for SCHED_OTHER
  2011-07-21 16:43 ` [patch 02/18] sched: hierarchical task accounting for SCHED_OTHER Paul Turner
@ 2011-08-14 16:15   ` tip-bot for Paul Turner
  0 siblings, 0 replies; 60+ messages in thread
From: tip-bot for Paul Turner @ 2011-08-14 16:15 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, a.p.zijlstra, seto.hidetoshi, pjt, tglx, mingo

Commit-ID:  953bfcd10e6f3697233e8e5128c611d275da39c1
Gitweb:     http://git.kernel.org/tip/953bfcd10e6f3697233e8e5128c611d275da39c1
Author:     Paul Turner <pjt@google.com>
AuthorDate: Thu, 21 Jul 2011 09:43:27 -0700
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Sun, 14 Aug 2011 12:01:13 +0200

sched: Implement hierarchical task accounting for SCHED_OTHER

Introduce hierarchical task accounting for the group scheduling case in CFS, as
well as promoting the responsibility for maintaining rq->nr_running to the
scheduling classes.

The primary motivation for this is that with scheduling classes supporting
bandwidth throttling it is possible for entities participating in throttled
sub-trees to not have root visible changes in rq->nr_running across activate
and de-activate operations.  This in turn leads to incorrect idle and
weight-per-task load balance decisions.

This also allows us to make a small fixlet to the fastpath in pick_next_task()
under group scheduling.

Note: this issue also exists with the existing sched_rt throttling mechanism.
This patch does not address that.

Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110721184756.878333391@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/sched.c          |    6 ++----
 kernel/sched_fair.c     |    6 ++++++
 kernel/sched_rt.c       |    5 ++++-
 kernel/sched_stoptask.c |    2 ++
 4 files changed, 14 insertions(+), 5 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index cf427bb..cd1a531 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -311,7 +311,7 @@ struct task_group root_task_group;
 /* CFS-related fields in a runqueue */
 struct cfs_rq {
 	struct load_weight load;
-	unsigned long nr_running;
+	unsigned long nr_running, h_nr_running;
 
 	u64 exec_clock;
 	u64 min_vruntime;
@@ -1802,7 +1802,6 @@ static void activate_task(struct rq *rq, struct task_struct *p, int flags)
 		rq->nr_uninterruptible--;
 
 	enqueue_task(rq, p, flags);
-	inc_nr_running(rq);
 }
 
 /*
@@ -1814,7 +1813,6 @@ static void deactivate_task(struct rq *rq, struct task_struct *p, int flags)
 		rq->nr_uninterruptible++;
 
 	dequeue_task(rq, p, flags);
-	dec_nr_running(rq);
 }
 
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
@@ -4258,7 +4256,7 @@ pick_next_task(struct rq *rq)
 	 * Optimization: we know that if all tasks are in
 	 * the fair class we can call that function directly:
 	 */
-	if (likely(rq->nr_running == rq->cfs.nr_running)) {
+	if (likely(rq->nr_running == rq->cfs.h_nr_running)) {
 		p = fair_sched_class.pick_next_task(rq);
 		if (likely(p))
 			return p;
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index f4b732a..f86b0cb 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1310,16 +1310,19 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 			break;
 		cfs_rq = cfs_rq_of(se);
 		enqueue_entity(cfs_rq, se, flags);
+		cfs_rq->h_nr_running++;
 		flags = ENQUEUE_WAKEUP;
 	}
 
 	for_each_sched_entity(se) {
 		cfs_rq = cfs_rq_of(se);
+		cfs_rq->h_nr_running++;
 
 		update_cfs_load(cfs_rq, 0);
 		update_cfs_shares(cfs_rq);
 	}
 
+	inc_nr_running(rq);
 	hrtick_update(rq);
 }
 
@@ -1339,6 +1342,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	for_each_sched_entity(se) {
 		cfs_rq = cfs_rq_of(se);
 		dequeue_entity(cfs_rq, se, flags);
+		cfs_rq->h_nr_running--;
 
 		/* Don't dequeue parent if it has other entities besides us */
 		if (cfs_rq->load.weight) {
@@ -1358,11 +1362,13 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 
 	for_each_sched_entity(se) {
 		cfs_rq = cfs_rq_of(se);
+		cfs_rq->h_nr_running--;
 
 		update_cfs_load(cfs_rq, 0);
 		update_cfs_shares(cfs_rq);
 	}
 
+	dec_nr_running(rq);
 	hrtick_update(rq);
 }
 
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index a8c207f..a9d3c6b 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -936,6 +936,8 @@ enqueue_task_rt(struct rq *rq, struct task_struct *p, int flags)
 
 	if (!task_current(rq, p) && p->rt.nr_cpus_allowed > 1)
 		enqueue_pushable_task(rq, p);
+
+	inc_nr_running(rq);
 }
 
 static void dequeue_task_rt(struct rq *rq, struct task_struct *p, int flags)
@@ -946,6 +948,8 @@ static void dequeue_task_rt(struct rq *rq, struct task_struct *p, int flags)
 	dequeue_rt_entity(rt_se);
 
 	dequeue_pushable_task(rq, p);
+
+	dec_nr_running(rq);
 }
 
 /*
@@ -1841,4 +1845,3 @@ static void print_rt_stats(struct seq_file *m, int cpu)
 	rcu_read_unlock();
 }
 #endif /* CONFIG_SCHED_DEBUG */
-
diff --git a/kernel/sched_stoptask.c b/kernel/sched_stoptask.c
index 6f43763..8b44e7f 100644
--- a/kernel/sched_stoptask.c
+++ b/kernel/sched_stoptask.c
@@ -34,11 +34,13 @@ static struct task_struct *pick_next_task_stop(struct rq *rq)
 static void
 enqueue_task_stop(struct rq *rq, struct task_struct *p, int flags)
 {
+	inc_nr_running(rq);
 }
 
 static void
 dequeue_task_stop(struct rq *rq, struct task_struct *p, int flags)
 {
+	dec_nr_running(rq);
 }
 
 static void yield_task_stop(struct rq *rq)

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [tip:sched/core] sched: Introduce primitives to account for CFS bandwidth tracking
  2011-07-21 16:43 ` [patch 03/18] sched: introduce primitives to account for CFS bandwidth tracking Paul Turner
  2011-07-22 11:14   ` Kamalesh Babulal
@ 2011-08-14 16:17   ` tip-bot for Paul Turner
  1 sibling, 0 replies; 60+ messages in thread
From: tip-bot for Paul Turner @ 2011-08-14 16:17 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, a.p.zijlstra, seto.hidetoshi, ncrao,
	pjt, bharata, tglx, mingo

Commit-ID:  ab84d31e15502fb626169ba2663381e34bf965b2
Gitweb:     http://git.kernel.org/tip/ab84d31e15502fb626169ba2663381e34bf965b2
Author:     Paul Turner <pjt@google.com>
AuthorDate: Thu, 21 Jul 2011 09:43:28 -0700
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Sun, 14 Aug 2011 12:03:20 +0200

sched: Introduce primitives to account for CFS bandwidth tracking

In this patch we introduce the notion of CFS bandwidth, partitioned into
globally unassigned bandwidth, and locally claimed bandwidth.

 - The global bandwidth is per task_group, it represents a pool of unclaimed
   bandwidth that cfs_rqs can allocate from.
 - The local bandwidth is tracked per-cfs_rq, this represents allotments from
   the global pool bandwidth assigned to a specific cpu.

Bandwidth is managed via cgroupfs, adding two new interfaces to the cpu subsystem:
 - cpu.cfs_period_us : the bandwidth period in usecs
 - cpu.cfs_quota_us : the cpu bandwidth (in usecs) that this tg will be allowed
   to consume over period above.

Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Nikhil Rao <ncrao@google.com>
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110721184756.972636699@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 init/Kconfig        |   12 +++
 kernel/sched.c      |  196 +++++++++++++++++++++++++++++++++++++++++++++++++-
 kernel/sched_fair.c |   16 ++++
 3 files changed, 220 insertions(+), 4 deletions(-)

diff --git a/init/Kconfig b/init/Kconfig
index d627783..d19b3a7 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -715,6 +715,18 @@ config FAIR_GROUP_SCHED
 	depends on CGROUP_SCHED
 	default CGROUP_SCHED
 
+config CFS_BANDWIDTH
+	bool "CPU bandwidth provisioning for FAIR_GROUP_SCHED"
+	depends on EXPERIMENTAL
+	depends on FAIR_GROUP_SCHED
+	default n
+	help
+	  This option allows users to define CPU bandwidth rates (limits) for
+	  tasks running within the fair group scheduler.  Groups with no limit
+	  set are considered to be unconstrained and will run with no
+	  restriction.
+	  See tip/Documentation/scheduler/sched-bwc.txt for more information.
+
 config RT_GROUP_SCHED
 	bool "Group scheduling for SCHED_RR/FIFO"
 	depends on EXPERIMENTAL
diff --git a/kernel/sched.c b/kernel/sched.c
index cd1a531..f08cb23 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -247,6 +247,14 @@ struct cfs_rq;
 
 static LIST_HEAD(task_groups);
 
+struct cfs_bandwidth {
+#ifdef CONFIG_CFS_BANDWIDTH
+	raw_spinlock_t lock;
+	ktime_t period;
+	u64 quota;
+#endif
+};
+
 /* task group related information */
 struct task_group {
 	struct cgroup_subsys_state css;
@@ -278,6 +286,8 @@ struct task_group {
 #ifdef CONFIG_SCHED_AUTOGROUP
 	struct autogroup *autogroup;
 #endif
+
+	struct cfs_bandwidth cfs_bandwidth;
 };
 
 /* task_group_lock serializes the addition/removal of task groups */
@@ -377,9 +387,48 @@ struct cfs_rq {
 
 	unsigned long load_contribution;
 #endif
+#ifdef CONFIG_CFS_BANDWIDTH
+	int runtime_enabled;
+	s64 runtime_remaining;
+#endif
 #endif
 };
 
+#ifdef CONFIG_FAIR_GROUP_SCHED
+#ifdef CONFIG_CFS_BANDWIDTH
+static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
+{
+	return &tg->cfs_bandwidth;
+}
+
+static inline u64 default_cfs_period(void);
+
+static void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
+{
+	raw_spin_lock_init(&cfs_b->lock);
+	cfs_b->quota = RUNTIME_INF;
+	cfs_b->period = ns_to_ktime(default_cfs_period());
+}
+
+static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)
+{
+	cfs_rq->runtime_enabled = 0;
+}
+
+static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
+{}
+#else
+static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
+static void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b) {}
+static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b) {}
+
+static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
+{
+	return NULL;
+}
+#endif /* CONFIG_CFS_BANDWIDTH */
+#endif /* CONFIG_FAIR_GROUP_SCHED */
+
 /* Real-Time classes' related field in a runqueue: */
 struct rt_rq {
 	struct rt_prio_array active;
@@ -7971,6 +8020,7 @@ static void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
 	/* allow initial update_cfs_load() to truncate */
 	cfs_rq->load_stamp = 1;
 #endif
+	init_cfs_rq_runtime(cfs_rq);
 
 	tg->cfs_rq[cpu] = cfs_rq;
 	tg->se[cpu] = se;
@@ -8110,6 +8160,7 @@ void __init sched_init(void)
 		 * We achieve this by letting root_task_group's tasks sit
 		 * directly in rq->cfs (i.e root_task_group->se[] = NULL).
 		 */
+		init_cfs_bandwidth(&root_task_group.cfs_bandwidth);
 		init_tg_cfs_entry(&root_task_group, &rq->cfs, NULL, i, NULL);
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
@@ -8351,6 +8402,8 @@ static void free_fair_sched_group(struct task_group *tg)
 {
 	int i;
 
+	destroy_cfs_bandwidth(tg_cfs_bandwidth(tg));
+
 	for_each_possible_cpu(i) {
 		if (tg->cfs_rq)
 			kfree(tg->cfs_rq[i]);
@@ -8378,6 +8431,8 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
 
 	tg->shares = NICE_0_LOAD;
 
+	init_cfs_bandwidth(tg_cfs_bandwidth(tg));
+
 	for_each_possible_cpu(i) {
 		cfs_rq = kzalloc_node(sizeof(struct cfs_rq),
 				      GFP_KERNEL, cpu_to_node(i));
@@ -8753,7 +8808,7 @@ static int __rt_schedulable(struct task_group *tg, u64 period, u64 runtime)
 	return walk_tg_tree(tg_schedulable, tg_nop, &data);
 }
 
-static int tg_set_bandwidth(struct task_group *tg,
+static int tg_set_rt_bandwidth(struct task_group *tg,
 		u64 rt_period, u64 rt_runtime)
 {
 	int i, err = 0;
@@ -8792,7 +8847,7 @@ int sched_group_set_rt_runtime(struct task_group *tg, long rt_runtime_us)
 	if (rt_runtime_us < 0)
 		rt_runtime = RUNTIME_INF;
 
-	return tg_set_bandwidth(tg, rt_period, rt_runtime);
+	return tg_set_rt_bandwidth(tg, rt_period, rt_runtime);
 }
 
 long sched_group_rt_runtime(struct task_group *tg)
@@ -8817,7 +8872,7 @@ int sched_group_set_rt_period(struct task_group *tg, long rt_period_us)
 	if (rt_period == 0)
 		return -EINVAL;
 
-	return tg_set_bandwidth(tg, rt_period, rt_runtime);
+	return tg_set_rt_bandwidth(tg, rt_period, rt_runtime);
 }
 
 long sched_group_rt_period(struct task_group *tg)
@@ -9007,6 +9062,128 @@ static u64 cpu_shares_read_u64(struct cgroup *cgrp, struct cftype *cft)
 
 	return (u64) scale_load_down(tg->shares);
 }
+
+#ifdef CONFIG_CFS_BANDWIDTH
+const u64 max_cfs_quota_period = 1 * NSEC_PER_SEC; /* 1s */
+const u64 min_cfs_quota_period = 1 * NSEC_PER_MSEC; /* 1ms */
+
+static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
+{
+	int i;
+	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
+	static DEFINE_MUTEX(mutex);
+
+	if (tg == &root_task_group)
+		return -EINVAL;
+
+	/*
+	 * Ensure we have at some amount of bandwidth every period.  This is
+	 * to prevent reaching a state of large arrears when throttled via
+	 * entity_tick() resulting in prolonged exit starvation.
+	 */
+	if (quota < min_cfs_quota_period || period < min_cfs_quota_period)
+		return -EINVAL;
+
+	/*
+	 * Likewise, bound things on the otherside by preventing insane quota
+	 * periods.  This also allows us to normalize in computing quota
+	 * feasibility.
+	 */
+	if (period > max_cfs_quota_period)
+		return -EINVAL;
+
+	mutex_lock(&mutex);
+	raw_spin_lock_irq(&cfs_b->lock);
+	cfs_b->period = ns_to_ktime(period);
+	cfs_b->quota = quota;
+	raw_spin_unlock_irq(&cfs_b->lock);
+
+	for_each_possible_cpu(i) {
+		struct cfs_rq *cfs_rq = tg->cfs_rq[i];
+		struct rq *rq = rq_of(cfs_rq);
+
+		raw_spin_lock_irq(&rq->lock);
+		cfs_rq->runtime_enabled = quota != RUNTIME_INF;
+		cfs_rq->runtime_remaining = 0;
+		raw_spin_unlock_irq(&rq->lock);
+	}
+	mutex_unlock(&mutex);
+
+	return 0;
+}
+
+int tg_set_cfs_quota(struct task_group *tg, long cfs_quota_us)
+{
+	u64 quota, period;
+
+	period = ktime_to_ns(tg_cfs_bandwidth(tg)->period);
+	if (cfs_quota_us < 0)
+		quota = RUNTIME_INF;
+	else
+		quota = (u64)cfs_quota_us * NSEC_PER_USEC;
+
+	return tg_set_cfs_bandwidth(tg, period, quota);
+}
+
+long tg_get_cfs_quota(struct task_group *tg)
+{
+	u64 quota_us;
+
+	if (tg_cfs_bandwidth(tg)->quota == RUNTIME_INF)
+		return -1;
+
+	quota_us = tg_cfs_bandwidth(tg)->quota;
+	do_div(quota_us, NSEC_PER_USEC);
+
+	return quota_us;
+}
+
+int tg_set_cfs_period(struct task_group *tg, long cfs_period_us)
+{
+	u64 quota, period;
+
+	period = (u64)cfs_period_us * NSEC_PER_USEC;
+	quota = tg_cfs_bandwidth(tg)->quota;
+
+	if (period <= 0)
+		return -EINVAL;
+
+	return tg_set_cfs_bandwidth(tg, period, quota);
+}
+
+long tg_get_cfs_period(struct task_group *tg)
+{
+	u64 cfs_period_us;
+
+	cfs_period_us = ktime_to_ns(tg_cfs_bandwidth(tg)->period);
+	do_div(cfs_period_us, NSEC_PER_USEC);
+
+	return cfs_period_us;
+}
+
+static s64 cpu_cfs_quota_read_s64(struct cgroup *cgrp, struct cftype *cft)
+{
+	return tg_get_cfs_quota(cgroup_tg(cgrp));
+}
+
+static int cpu_cfs_quota_write_s64(struct cgroup *cgrp, struct cftype *cftype,
+				s64 cfs_quota_us)
+{
+	return tg_set_cfs_quota(cgroup_tg(cgrp), cfs_quota_us);
+}
+
+static u64 cpu_cfs_period_read_u64(struct cgroup *cgrp, struct cftype *cft)
+{
+	return tg_get_cfs_period(cgroup_tg(cgrp));
+}
+
+static int cpu_cfs_period_write_u64(struct cgroup *cgrp, struct cftype *cftype,
+				u64 cfs_period_us)
+{
+	return tg_set_cfs_period(cgroup_tg(cgrp), cfs_period_us);
+}
+
+#endif /* CONFIG_CFS_BANDWIDTH */
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
 #ifdef CONFIG_RT_GROUP_SCHED
@@ -9041,6 +9218,18 @@ static struct cftype cpu_files[] = {
 		.write_u64 = cpu_shares_write_u64,
 	},
 #endif
+#ifdef CONFIG_CFS_BANDWIDTH
+	{
+		.name = "cfs_quota_us",
+		.read_s64 = cpu_cfs_quota_read_s64,
+		.write_s64 = cpu_cfs_quota_write_s64,
+	},
+	{
+		.name = "cfs_period_us",
+		.read_u64 = cpu_cfs_period_read_u64,
+		.write_u64 = cpu_cfs_period_write_u64,
+	},
+#endif
 #ifdef CONFIG_RT_GROUP_SCHED
 	{
 		.name = "rt_runtime_us",
@@ -9350,4 +9539,3 @@ struct cgroup_subsys cpuacct_subsys = {
 	.subsys_id = cpuacct_subsys_id,
 };
 #endif	/* CONFIG_CGROUP_CPUACCT */
-
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index f86b0cb..f24f417 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1234,6 +1234,22 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
 		check_preempt_tick(cfs_rq, curr);
 }
 
+
+/**************************************************
+ * CFS bandwidth control machinery
+ */
+
+#ifdef CONFIG_CFS_BANDWIDTH
+/*
+ * default period for cfs group bandwidth.
+ * default: 0.1s, units: nanoseconds
+ */
+static inline u64 default_cfs_period(void)
+{
+	return 100000000ULL;
+}
+#endif
+
 /**************************************************
  * CFS operations on tasks:
  */

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [tip:sched/core] sched: Validate CFS quota hierarchies
  2011-07-21 16:43 ` [patch 04/18] sched: validate CFS quota hierarchies Paul Turner
@ 2011-08-14 16:19   ` tip-bot for Paul Turner
  0 siblings, 0 replies; 60+ messages in thread
From: tip-bot for Paul Turner @ 2011-08-14 16:19 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, a.p.zijlstra, pjt, tglx, mingo

Commit-ID:  a790de99599a29ad3f18667530cf4b9f4b7e3234
Gitweb:     http://git.kernel.org/tip/a790de99599a29ad3f18667530cf4b9f4b7e3234
Author:     Paul Turner <pjt@google.com>
AuthorDate: Thu, 21 Jul 2011 09:43:29 -0700
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Sun, 14 Aug 2011 12:03:22 +0200

sched: Validate CFS quota hierarchies

Add constraints validation for CFS bandwidth hierarchies.

Validate that:
   max(child bandwidth) <= parent_bandwidth

In a quota limited hierarchy, an unconstrained entity
(e.g. bandwidth==RUNTIME_INF) inherits the bandwidth of its parent.

This constraint is chosen over sum(child_bandwidth) as notion of over-commit is
valuable within SCHED_OTHER.  Some basic code from the RT case is re-factored
for reuse.

Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110721184757.083774572@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/sched.c |  112 +++++++++++++++++++++++++++++++++++++++++++++++++-------
 1 files changed, 98 insertions(+), 14 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index f08cb23..ea6850d 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -252,6 +252,7 @@ struct cfs_bandwidth {
 	raw_spinlock_t lock;
 	ktime_t period;
 	u64 quota;
+	s64 hierarchal_quota;
 #endif
 };
 
@@ -1518,7 +1519,8 @@ static inline void dec_cpu_load(struct rq *rq, unsigned long load)
 	update_load_sub(&rq->load, load);
 }
 
-#if (defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)) || defined(CONFIG_RT_GROUP_SCHED)
+#if defined(CONFIG_RT_GROUP_SCHED) || (defined(CONFIG_FAIR_GROUP_SCHED) && \
+			(defined(CONFIG_SMP) || defined(CONFIG_CFS_BANDWIDTH)))
 typedef int (*tg_visitor)(struct task_group *, void *);
 
 /*
@@ -8708,12 +8710,7 @@ unsigned long sched_group_shares(struct task_group *tg)
 }
 #endif
 
-#ifdef CONFIG_RT_GROUP_SCHED
-/*
- * Ensure that the real time constraints are schedulable.
- */
-static DEFINE_MUTEX(rt_constraints_mutex);
-
+#if defined(CONFIG_RT_GROUP_SCHED) || defined(CONFIG_CFS_BANDWIDTH)
 static unsigned long to_ratio(u64 period, u64 runtime)
 {
 	if (runtime == RUNTIME_INF)
@@ -8721,6 +8718,13 @@ static unsigned long to_ratio(u64 period, u64 runtime)
 
 	return div64_u64(runtime << 20, period);
 }
+#endif
+
+#ifdef CONFIG_RT_GROUP_SCHED
+/*
+ * Ensure that the real time constraints are schedulable.
+ */
+static DEFINE_MUTEX(rt_constraints_mutex);
 
 /* Must be called with tasklist_lock held */
 static inline int tg_has_rt_tasks(struct task_group *tg)
@@ -8741,7 +8745,7 @@ struct rt_schedulable_data {
 	u64 rt_runtime;
 };
 
-static int tg_schedulable(struct task_group *tg, void *data)
+static int tg_rt_schedulable(struct task_group *tg, void *data)
 {
 	struct rt_schedulable_data *d = data;
 	struct task_group *child;
@@ -8805,7 +8809,7 @@ static int __rt_schedulable(struct task_group *tg, u64 period, u64 runtime)
 		.rt_runtime = runtime,
 	};
 
-	return walk_tg_tree(tg_schedulable, tg_nop, &data);
+	return walk_tg_tree(tg_rt_schedulable, tg_nop, &data);
 }
 
 static int tg_set_rt_bandwidth(struct task_group *tg,
@@ -9064,14 +9068,17 @@ static u64 cpu_shares_read_u64(struct cgroup *cgrp, struct cftype *cft)
 }
 
 #ifdef CONFIG_CFS_BANDWIDTH
+static DEFINE_MUTEX(cfs_constraints_mutex);
+
 const u64 max_cfs_quota_period = 1 * NSEC_PER_SEC; /* 1s */
 const u64 min_cfs_quota_period = 1 * NSEC_PER_MSEC; /* 1ms */
 
+static int __cfs_schedulable(struct task_group *tg, u64 period, u64 runtime);
+
 static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
 {
-	int i;
+	int i, ret = 0;
 	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
-	static DEFINE_MUTEX(mutex);
 
 	if (tg == &root_task_group)
 		return -EINVAL;
@@ -9092,7 +9099,11 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
 	if (period > max_cfs_quota_period)
 		return -EINVAL;
 
-	mutex_lock(&mutex);
+	mutex_lock(&cfs_constraints_mutex);
+	ret = __cfs_schedulable(tg, period, quota);
+	if (ret)
+		goto out_unlock;
+
 	raw_spin_lock_irq(&cfs_b->lock);
 	cfs_b->period = ns_to_ktime(period);
 	cfs_b->quota = quota;
@@ -9107,9 +9118,10 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
 		cfs_rq->runtime_remaining = 0;
 		raw_spin_unlock_irq(&rq->lock);
 	}
-	mutex_unlock(&mutex);
+out_unlock:
+	mutex_unlock(&cfs_constraints_mutex);
 
-	return 0;
+	return ret;
 }
 
 int tg_set_cfs_quota(struct task_group *tg, long cfs_quota_us)
@@ -9183,6 +9195,78 @@ static int cpu_cfs_period_write_u64(struct cgroup *cgrp, struct cftype *cftype,
 	return tg_set_cfs_period(cgroup_tg(cgrp), cfs_period_us);
 }
 
+struct cfs_schedulable_data {
+	struct task_group *tg;
+	u64 period, quota;
+};
+
+/*
+ * normalize group quota/period to be quota/max_period
+ * note: units are usecs
+ */
+static u64 normalize_cfs_quota(struct task_group *tg,
+			       struct cfs_schedulable_data *d)
+{
+	u64 quota, period;
+
+	if (tg == d->tg) {
+		period = d->period;
+		quota = d->quota;
+	} else {
+		period = tg_get_cfs_period(tg);
+		quota = tg_get_cfs_quota(tg);
+	}
+
+	/* note: these should typically be equivalent */
+	if (quota == RUNTIME_INF || quota == -1)
+		return RUNTIME_INF;
+
+	return to_ratio(period, quota);
+}
+
+static int tg_cfs_schedulable_down(struct task_group *tg, void *data)
+{
+	struct cfs_schedulable_data *d = data;
+	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
+	s64 quota = 0, parent_quota = -1;
+
+	if (!tg->parent) {
+		quota = RUNTIME_INF;
+	} else {
+		struct cfs_bandwidth *parent_b = tg_cfs_bandwidth(tg->parent);
+
+		quota = normalize_cfs_quota(tg, d);
+		parent_quota = parent_b->hierarchal_quota;
+
+		/*
+		 * ensure max(child_quota) <= parent_quota, inherit when no
+		 * limit is set
+		 */
+		if (quota == RUNTIME_INF)
+			quota = parent_quota;
+		else if (parent_quota != RUNTIME_INF && quota > parent_quota)
+			return -EINVAL;
+	}
+	cfs_b->hierarchal_quota = quota;
+
+	return 0;
+}
+
+static int __cfs_schedulable(struct task_group *tg, u64 period, u64 quota)
+{
+	struct cfs_schedulable_data data = {
+		.tg = tg,
+		.period = period,
+		.quota = quota,
+	};
+
+	if (quota != RUNTIME_INF) {
+		do_div(data.period, NSEC_PER_USEC);
+		do_div(data.quota, NSEC_PER_USEC);
+	}
+
+	return walk_tg_tree(tg_cfs_schedulable_down, tg_nop, &data);
+}
 #endif /* CONFIG_CFS_BANDWIDTH */
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [tip:sched/core] sched: Accumulate per-cfs_rq cpu usage and charge against bandwidth
  2011-07-21 16:43 ` [patch 05/18] sched: accumulate per-cfs_rq cpu usage and charge against bandwidth Paul Turner
@ 2011-08-14 16:21   ` tip-bot for Paul Turner
  0 siblings, 0 replies; 60+ messages in thread
From: tip-bot for Paul Turner @ 2011-08-14 16:21 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, a.p.zijlstra, seto.hidetoshi, ncrao,
	pjt, bharata, tglx, mingo

Commit-ID:  ec12cb7f31e28854efae7dd6f9544e0a66379040
Gitweb:     http://git.kernel.org/tip/ec12cb7f31e28854efae7dd6f9544e0a66379040
Author:     Paul Turner <pjt@google.com>
AuthorDate: Thu, 21 Jul 2011 09:43:30 -0700
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Sun, 14 Aug 2011 12:03:26 +0200

sched: Accumulate per-cfs_rq cpu usage and charge against bandwidth

Account bandwidth usage on the cfs_rq level versus the task_groups to which
they belong.  Whether we are tracking bandwidth on a given cfs_rq is maintained
under cfs_rq->runtime_enabled.

cfs_rq's which belong to a bandwidth constrained task_group have their runtime
accounted via the update_curr() path, which withdraws bandwidth from the global
pool as desired.  Updates involving the global pool are currently protected
under cfs_bandwidth->lock, local runtime is protected by rq->lock.

This patch only assigns and tracks quota, no action is taken in the case that
cfs_rq->runtime_used exceeds cfs_rq->runtime_assigned.

Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Nikhil Rao <ncrao@google.com>
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110721184757.179386821@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 include/linux/sched.h |    4 ++
 kernel/sched.c        |    4 ++-
 kernel/sched_fair.c   |   79 +++++++++++++++++++++++++++++++++++++++++++++++-
 kernel/sysctl.c       |   10 ++++++
 4 files changed, 94 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4ac2c05..bc6f5f2 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2040,6 +2040,10 @@ static inline void sched_autogroup_fork(struct signal_struct *sig) { }
 static inline void sched_autogroup_exit(struct signal_struct *sig) { }
 #endif
 
+#ifdef CONFIG_CFS_BANDWIDTH
+extern unsigned int sysctl_sched_cfs_bandwidth_slice;
+#endif
+
 #ifdef CONFIG_RT_MUTEXES
 extern int rt_mutex_getprio(struct task_struct *p);
 extern void rt_mutex_setprio(struct task_struct *p, int prio);
diff --git a/kernel/sched.c b/kernel/sched.c
index ea6850d..35561c6 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -251,7 +251,7 @@ struct cfs_bandwidth {
 #ifdef CONFIG_CFS_BANDWIDTH
 	raw_spinlock_t lock;
 	ktime_t period;
-	u64 quota;
+	u64 quota, runtime;
 	s64 hierarchal_quota;
 #endif
 };
@@ -407,6 +407,7 @@ static inline u64 default_cfs_period(void);
 static void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
 {
 	raw_spin_lock_init(&cfs_b->lock);
+	cfs_b->runtime = 0;
 	cfs_b->quota = RUNTIME_INF;
 	cfs_b->period = ns_to_ktime(default_cfs_period());
 }
@@ -9107,6 +9108,7 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
 	raw_spin_lock_irq(&cfs_b->lock);
 	cfs_b->period = ns_to_ktime(period);
 	cfs_b->quota = quota;
+	cfs_b->runtime = quota;
 	raw_spin_unlock_irq(&cfs_b->lock);
 
 	for_each_possible_cpu(i) {
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index f24f417..9502aa8 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -89,6 +89,20 @@ const_debug unsigned int sysctl_sched_migration_cost = 500000UL;
  */
 unsigned int __read_mostly sysctl_sched_shares_window = 10000000UL;
 
+#ifdef CONFIG_CFS_BANDWIDTH
+/*
+ * Amount of runtime to allocate from global (tg) to local (per-cfs_rq) pool
+ * each time a cfs_rq requests quota.
+ *
+ * Note: in the case that the slice exceeds the runtime remaining (either due
+ * to consumption or the quota being specified to be smaller than the slice)
+ * we will always only issue the remaining available time.
+ *
+ * default: 5 msec, units: microseconds
+  */
+unsigned int sysctl_sched_cfs_bandwidth_slice = 5000UL;
+#endif
+
 static const struct sched_class fair_sched_class;
 
 /**************************************************************
@@ -292,6 +306,8 @@ find_matching_se(struct sched_entity **se, struct sched_entity **pse)
 
 #endif	/* CONFIG_FAIR_GROUP_SCHED */
 
+static void account_cfs_rq_runtime(struct cfs_rq *cfs_rq,
+				   unsigned long delta_exec);
 
 /**************************************************************
  * Scheduling class tree data structure manipulation methods:
@@ -583,6 +599,8 @@ static void update_curr(struct cfs_rq *cfs_rq)
 		cpuacct_charge(curtask, delta_exec);
 		account_group_exec_runtime(curtask, delta_exec);
 	}
+
+	account_cfs_rq_runtime(cfs_rq, delta_exec);
 }
 
 static inline void
@@ -1248,6 +1266,58 @@ static inline u64 default_cfs_period(void)
 {
 	return 100000000ULL;
 }
+
+static inline u64 sched_cfs_bandwidth_slice(void)
+{
+	return (u64)sysctl_sched_cfs_bandwidth_slice * NSEC_PER_USEC;
+}
+
+static void assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
+{
+	struct task_group *tg = cfs_rq->tg;
+	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
+	u64 amount = 0, min_amount;
+
+	/* note: this is a positive sum as runtime_remaining <= 0 */
+	min_amount = sched_cfs_bandwidth_slice() - cfs_rq->runtime_remaining;
+
+	raw_spin_lock(&cfs_b->lock);
+	if (cfs_b->quota == RUNTIME_INF)
+		amount = min_amount;
+	else if (cfs_b->runtime > 0) {
+		amount = min(cfs_b->runtime, min_amount);
+		cfs_b->runtime -= amount;
+	}
+	raw_spin_unlock(&cfs_b->lock);
+
+	cfs_rq->runtime_remaining += amount;
+}
+
+static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq,
+				     unsigned long delta_exec)
+{
+	if (!cfs_rq->runtime_enabled)
+		return;
+
+	cfs_rq->runtime_remaining -= delta_exec;
+	if (cfs_rq->runtime_remaining > 0)
+		return;
+
+	assign_cfs_rq_runtime(cfs_rq);
+}
+
+static __always_inline void account_cfs_rq_runtime(struct cfs_rq *cfs_rq,
+						   unsigned long delta_exec)
+{
+	if (!cfs_rq->runtime_enabled)
+		return;
+
+	__account_cfs_rq_runtime(cfs_rq, delta_exec);
+}
+
+#else
+static void account_cfs_rq_runtime(struct cfs_rq *cfs_rq,
+				     unsigned long delta_exec) {}
 #endif
 
 /**************************************************
@@ -4266,8 +4336,13 @@ static void set_curr_task_fair(struct rq *rq)
 {
 	struct sched_entity *se = &rq->curr->se;
 
-	for_each_sched_entity(se)
-		set_next_entity(cfs_rq_of(se), se);
+	for_each_sched_entity(se) {
+		struct cfs_rq *cfs_rq = cfs_rq_of(se);
+
+		set_next_entity(cfs_rq, se);
+		/* ensure bandwidth has been allocated on our new cfs_rq */
+		account_cfs_rq_runtime(cfs_rq, 0);
+	}
 }
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 11d65b5..2d2ecdc 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -379,6 +379,16 @@ static struct ctl_table kern_table[] = {
 		.extra2		= &one,
 	},
 #endif
+#ifdef CONFIG_CFS_BANDWIDTH
+	{
+		.procname	= "sched_cfs_bandwidth_slice_us",
+		.data		= &sysctl_sched_cfs_bandwidth_slice,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &one,
+	},
+#endif
 #ifdef CONFIG_PROVE_LOCKING
 	{
 		.procname	= "prove_locking",

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [tip:sched/core] sched: Add a timer to handle CFS bandwidth refresh
  2011-07-21 16:43 ` [patch 06/18] sched: add a timer to handle CFS bandwidth refresh Paul Turner
@ 2011-08-14 16:23   ` tip-bot for Paul Turner
  0 siblings, 0 replies; 60+ messages in thread
From: tip-bot for Paul Turner @ 2011-08-14 16:23 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, a.p.zijlstra, seto.hidetoshi, pjt, tglx, mingo

Commit-ID:  58088ad0152ba4b7997388c93d0ca208ec1ece75
Gitweb:     http://git.kernel.org/tip/58088ad0152ba4b7997388c93d0ca208ec1ece75
Author:     Paul Turner <pjt@google.com>
AuthorDate: Thu, 21 Jul 2011 09:43:31 -0700
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Sun, 14 Aug 2011 12:03:28 +0200

sched: Add a timer to handle CFS bandwidth refresh

This patch adds a per-task_group timer which handles the refresh of the global
CFS bandwidth pool.

Since the RT pool is using a similar timer there's some small refactoring to
share this support.

Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110721184757.277271273@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/sched.c      |  107 +++++++++++++++++++++++++++++++++++++++++----------
 kernel/sched_fair.c |   40 +++++++++++++++++-
 2 files changed, 123 insertions(+), 24 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 35561c6..34bf8e6 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -196,10 +196,28 @@ static inline int rt_bandwidth_enabled(void)
 	return sysctl_sched_rt_runtime >= 0;
 }
 
-static void start_rt_bandwidth(struct rt_bandwidth *rt_b)
+static void start_bandwidth_timer(struct hrtimer *period_timer, ktime_t period)
 {
-	ktime_t now;
+	unsigned long delta;
+	ktime_t soft, hard, now;
 
+	for (;;) {
+		if (hrtimer_active(period_timer))
+			break;
+
+		now = hrtimer_cb_get_time(period_timer);
+		hrtimer_forward(period_timer, now, period);
+
+		soft = hrtimer_get_softexpires(period_timer);
+		hard = hrtimer_get_expires(period_timer);
+		delta = ktime_to_ns(ktime_sub(hard, soft));
+		__hrtimer_start_range_ns(period_timer, soft, delta,
+					 HRTIMER_MODE_ABS_PINNED, 0);
+	}
+}
+
+static void start_rt_bandwidth(struct rt_bandwidth *rt_b)
+{
 	if (!rt_bandwidth_enabled() || rt_b->rt_runtime == RUNTIME_INF)
 		return;
 
@@ -207,22 +225,7 @@ static void start_rt_bandwidth(struct rt_bandwidth *rt_b)
 		return;
 
 	raw_spin_lock(&rt_b->rt_runtime_lock);
-	for (;;) {
-		unsigned long delta;
-		ktime_t soft, hard;
-
-		if (hrtimer_active(&rt_b->rt_period_timer))
-			break;
-
-		now = hrtimer_cb_get_time(&rt_b->rt_period_timer);
-		hrtimer_forward(&rt_b->rt_period_timer, now, rt_b->rt_period);
-
-		soft = hrtimer_get_softexpires(&rt_b->rt_period_timer);
-		hard = hrtimer_get_expires(&rt_b->rt_period_timer);
-		delta = ktime_to_ns(ktime_sub(hard, soft));
-		__hrtimer_start_range_ns(&rt_b->rt_period_timer, soft, delta,
-				HRTIMER_MODE_ABS_PINNED, 0);
-	}
+	start_bandwidth_timer(&rt_b->rt_period_timer, rt_b->rt_period);
 	raw_spin_unlock(&rt_b->rt_runtime_lock);
 }
 
@@ -253,6 +256,9 @@ struct cfs_bandwidth {
 	ktime_t period;
 	u64 quota, runtime;
 	s64 hierarchal_quota;
+
+	int idle, timer_active;
+	struct hrtimer period_timer;
 #endif
 };
 
@@ -403,6 +409,28 @@ static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
 }
 
 static inline u64 default_cfs_period(void);
+static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun);
+
+static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
+{
+	struct cfs_bandwidth *cfs_b =
+		container_of(timer, struct cfs_bandwidth, period_timer);
+	ktime_t now;
+	int overrun;
+	int idle = 0;
+
+	for (;;) {
+		now = hrtimer_cb_get_time(timer);
+		overrun = hrtimer_forward(timer, now, cfs_b->period);
+
+		if (!overrun)
+			break;
+
+		idle = do_sched_cfs_period_timer(cfs_b, overrun);
+	}
+
+	return idle ? HRTIMER_NORESTART : HRTIMER_RESTART;
+}
 
 static void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
 {
@@ -410,6 +438,9 @@ static void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
 	cfs_b->runtime = 0;
 	cfs_b->quota = RUNTIME_INF;
 	cfs_b->period = ns_to_ktime(default_cfs_period());
+
+	hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	cfs_b->period_timer.function = sched_cfs_period_timer;
 }
 
 static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)
@@ -417,8 +448,34 @@ static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 	cfs_rq->runtime_enabled = 0;
 }
 
+/* requires cfs_b->lock, may release to reprogram timer */
+static void __start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
+{
+	/*
+	 * The timer may be active because we're trying to set a new bandwidth
+	 * period or because we're racing with the tear-down path
+	 * (timer_active==0 becomes visible before the hrtimer call-back
+	 * terminates).  In either case we ensure that it's re-programmed
+	 */
+	while (unlikely(hrtimer_active(&cfs_b->period_timer))) {
+		raw_spin_unlock(&cfs_b->lock);
+		/* ensure cfs_b->lock is available while we wait */
+		hrtimer_cancel(&cfs_b->period_timer);
+
+		raw_spin_lock(&cfs_b->lock);
+		/* if someone else restarted the timer then we're done */
+		if (cfs_b->timer_active)
+			return;
+	}
+
+	cfs_b->timer_active = 1;
+	start_bandwidth_timer(&cfs_b->period_timer, cfs_b->period);
+}
+
 static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
-{}
+{
+	hrtimer_cancel(&cfs_b->period_timer);
+}
 #else
 static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
 static void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b) {}
@@ -9078,7 +9135,7 @@ static int __cfs_schedulable(struct task_group *tg, u64 period, u64 runtime);
 
 static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
 {
-	int i, ret = 0;
+	int i, ret = 0, runtime_enabled;
 	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
 
 	if (tg == &root_task_group)
@@ -9105,10 +9162,18 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
 	if (ret)
 		goto out_unlock;
 
+	runtime_enabled = quota != RUNTIME_INF;
 	raw_spin_lock_irq(&cfs_b->lock);
 	cfs_b->period = ns_to_ktime(period);
 	cfs_b->quota = quota;
 	cfs_b->runtime = quota;
+
+	/* restart the period timer (if active) to handle new period expiry */
+	if (runtime_enabled && cfs_b->timer_active) {
+		/* force a reprogram */
+		cfs_b->timer_active = 0;
+		__start_cfs_bandwidth(cfs_b);
+	}
 	raw_spin_unlock_irq(&cfs_b->lock);
 
 	for_each_possible_cpu(i) {
@@ -9116,7 +9181,7 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
 		struct rq *rq = rq_of(cfs_rq);
 
 		raw_spin_lock_irq(&rq->lock);
-		cfs_rq->runtime_enabled = quota != RUNTIME_INF;
+		cfs_rq->runtime_enabled = runtime_enabled;
 		cfs_rq->runtime_remaining = 0;
 		raw_spin_unlock_irq(&rq->lock);
 	}
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 9502aa8..af73a8a 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1284,9 +1284,16 @@ static void assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 	raw_spin_lock(&cfs_b->lock);
 	if (cfs_b->quota == RUNTIME_INF)
 		amount = min_amount;
-	else if (cfs_b->runtime > 0) {
-		amount = min(cfs_b->runtime, min_amount);
-		cfs_b->runtime -= amount;
+	else {
+		/* ensure bandwidth timer remains active under consumption */
+		if (!cfs_b->timer_active)
+			__start_cfs_bandwidth(cfs_b);
+
+		if (cfs_b->runtime > 0) {
+			amount = min(cfs_b->runtime, min_amount);
+			cfs_b->runtime -= amount;
+			cfs_b->idle = 0;
+		}
 	}
 	raw_spin_unlock(&cfs_b->lock);
 
@@ -1315,6 +1322,33 @@ static __always_inline void account_cfs_rq_runtime(struct cfs_rq *cfs_rq,
 	__account_cfs_rq_runtime(cfs_rq, delta_exec);
 }
 
+/*
+ * Responsible for refilling a task_group's bandwidth and unthrottling its
+ * cfs_rqs as appropriate. If there has been no activity within the last
+ * period the timer is deactivated until scheduling resumes; cfs_b->idle is
+ * used to track this state.
+ */
+static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
+{
+	int idle = 1;
+
+	raw_spin_lock(&cfs_b->lock);
+	/* no need to continue the timer with no bandwidth constraint */
+	if (cfs_b->quota == RUNTIME_INF)
+		goto out_unlock;
+
+	idle = cfs_b->idle;
+	cfs_b->runtime = cfs_b->quota;
+
+	/* mark as potentially idle for the upcoming period */
+	cfs_b->idle = 1;
+out_unlock:
+	if (idle)
+		cfs_b->timer_active = 0;
+	raw_spin_unlock(&cfs_b->lock);
+
+	return idle;
+}
 #else
 static void account_cfs_rq_runtime(struct cfs_rq *cfs_rq,
 				     unsigned long delta_exec) {}

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [tip:sched/core] sched: Expire invalid runtime
  2011-07-21 16:43 ` [patch 07/18] sched: expire invalid runtime Paul Turner
@ 2011-08-14 16:24   ` tip-bot for Paul Turner
  0 siblings, 0 replies; 60+ messages in thread
From: tip-bot for Paul Turner @ 2011-08-14 16:24 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, a.p.zijlstra, seto.hidetoshi, pjt, tglx, mingo

Commit-ID:  a9cf55b2861057a213e610da2fec52125439a11d
Gitweb:     http://git.kernel.org/tip/a9cf55b2861057a213e610da2fec52125439a11d
Author:     Paul Turner <pjt@google.com>
AuthorDate: Thu, 21 Jul 2011 09:43:32 -0700
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Sun, 14 Aug 2011 12:03:31 +0200

sched: Expire invalid runtime

Since quota is managed using a global state but consumed on a per-cpu basis
we need to ensure that our per-cpu state is appropriately synchronized.
Most importantly, runtime that is state (from a previous period) should not be
locally consumable.

We take advantage of existing sched_clock synchronization about the jiffy to
efficiently detect whether we have (globally) crossed a quota boundary above.

One catch is that the direction of spread on sched_clock is undefined,
specifically, we don't know whether our local clock is behind or ahead
of the one responsible for the current expiration time.

Fortunately we can differentiate these by considering whether the
global deadline has advanced.  If it has not, then we assume our clock to be
"fast" and advance our local expiration; otherwise, we know the deadline has
truly passed and we expire our local runtime.

Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110721184757.379275352@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/sched.c      |    4 ++-
 kernel/sched_fair.c |   90 +++++++++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 84 insertions(+), 10 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 34bf8e6..a2d5514 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -256,6 +256,7 @@ struct cfs_bandwidth {
 	ktime_t period;
 	u64 quota, runtime;
 	s64 hierarchal_quota;
+	u64 runtime_expires;
 
 	int idle, timer_active;
 	struct hrtimer period_timer;
@@ -396,6 +397,7 @@ struct cfs_rq {
 #endif
 #ifdef CONFIG_CFS_BANDWIDTH
 	int runtime_enabled;
+	u64 runtime_expires;
 	s64 runtime_remaining;
 #endif
 #endif
@@ -9166,8 +9168,8 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
 	raw_spin_lock_irq(&cfs_b->lock);
 	cfs_b->period = ns_to_ktime(period);
 	cfs_b->quota = quota;
-	cfs_b->runtime = quota;
 
+	__refill_cfs_bandwidth_runtime(cfs_b);
 	/* restart the period timer (if active) to handle new period expiry */
 	if (runtime_enabled && cfs_b->timer_active) {
 		/* force a reprogram */
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index af73a8a..9d1adbd 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1272,11 +1272,30 @@ static inline u64 sched_cfs_bandwidth_slice(void)
 	return (u64)sysctl_sched_cfs_bandwidth_slice * NSEC_PER_USEC;
 }
 
+/*
+ * Replenish runtime according to assigned quota and update expiration time.
+ * We use sched_clock_cpu directly instead of rq->clock to avoid adding
+ * additional synchronization around rq->lock.
+ *
+ * requires cfs_b->lock
+ */
+static void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)
+{
+	u64 now;
+
+	if (cfs_b->quota == RUNTIME_INF)
+		return;
+
+	now = sched_clock_cpu(smp_processor_id());
+	cfs_b->runtime = cfs_b->quota;
+	cfs_b->runtime_expires = now + ktime_to_ns(cfs_b->period);
+}
+
 static void assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 {
 	struct task_group *tg = cfs_rq->tg;
 	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
-	u64 amount = 0, min_amount;
+	u64 amount = 0, min_amount, expires;
 
 	/* note: this is a positive sum as runtime_remaining <= 0 */
 	min_amount = sched_cfs_bandwidth_slice() - cfs_rq->runtime_remaining;
@@ -1285,9 +1304,16 @@ static void assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 	if (cfs_b->quota == RUNTIME_INF)
 		amount = min_amount;
 	else {
-		/* ensure bandwidth timer remains active under consumption */
-		if (!cfs_b->timer_active)
+		/*
+		 * If the bandwidth pool has become inactive, then at least one
+		 * period must have elapsed since the last consumption.
+		 * Refresh the global state and ensure bandwidth timer becomes
+		 * active.
+		 */
+		if (!cfs_b->timer_active) {
+			__refill_cfs_bandwidth_runtime(cfs_b);
 			__start_cfs_bandwidth(cfs_b);
+		}
 
 		if (cfs_b->runtime > 0) {
 			amount = min(cfs_b->runtime, min_amount);
@@ -1295,19 +1321,61 @@ static void assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 			cfs_b->idle = 0;
 		}
 	}
+	expires = cfs_b->runtime_expires;
 	raw_spin_unlock(&cfs_b->lock);
 
 	cfs_rq->runtime_remaining += amount;
+	/*
+	 * we may have advanced our local expiration to account for allowed
+	 * spread between our sched_clock and the one on which runtime was
+	 * issued.
+	 */
+	if ((s64)(expires - cfs_rq->runtime_expires) > 0)
+		cfs_rq->runtime_expires = expires;
 }
 
-static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq,
-				     unsigned long delta_exec)
+/*
+ * Note: This depends on the synchronization provided by sched_clock and the
+ * fact that rq->clock snapshots this value.
+ */
+static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 {
-	if (!cfs_rq->runtime_enabled)
+	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
+	struct rq *rq = rq_of(cfs_rq);
+
+	/* if the deadline is ahead of our clock, nothing to do */
+	if (likely((s64)(rq->clock - cfs_rq->runtime_expires) < 0))
+		return;
+
+	if (cfs_rq->runtime_remaining < 0)
 		return;
 
+	/*
+	 * If the local deadline has passed we have to consider the
+	 * possibility that our sched_clock is 'fast' and the global deadline
+	 * has not truly expired.
+	 *
+	 * Fortunately we can check determine whether this the case by checking
+	 * whether the global deadline has advanced.
+	 */
+
+	if ((s64)(cfs_rq->runtime_expires - cfs_b->runtime_expires) >= 0) {
+		/* extend local deadline, drift is bounded above by 2 ticks */
+		cfs_rq->runtime_expires += TICK_NSEC;
+	} else {
+		/* global deadline is ahead, expiration has passed */
+		cfs_rq->runtime_remaining = 0;
+	}
+}
+
+static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq,
+				     unsigned long delta_exec)
+{
+	/* dock delta_exec before expiring quota (as it could span periods) */
 	cfs_rq->runtime_remaining -= delta_exec;
-	if (cfs_rq->runtime_remaining > 0)
+	expire_cfs_rq_runtime(cfs_rq);
+
+	if (likely(cfs_rq->runtime_remaining > 0))
 		return;
 
 	assign_cfs_rq_runtime(cfs_rq);
@@ -1338,7 +1406,12 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
 		goto out_unlock;
 
 	idle = cfs_b->idle;
-	cfs_b->runtime = cfs_b->quota;
+	/* if we're going inactive then everything else can be deferred */
+	if (idle)
+		goto out_unlock;
+
+	__refill_cfs_bandwidth_runtime(cfs_b);
+
 
 	/* mark as potentially idle for the upcoming period */
 	cfs_b->idle = 1;
@@ -1557,7 +1630,6 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
 
 	return wl;
 }
-
 #else
 
 static inline unsigned long effective_load(struct task_group *tg, int cpu,

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [tip:sched/core] sched: Add support for throttling group entities
  2011-07-21 16:43 ` [patch 08/18] sched: add support for throttling group entities Paul Turner
  2011-08-08 15:46   ` Lin Ming
@ 2011-08-14 16:26   ` tip-bot for Paul Turner
  1 sibling, 0 replies; 60+ messages in thread
From: tip-bot for Paul Turner @ 2011-08-14 16:26 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, a.p.zijlstra, pjt, tglx, mingo

Commit-ID:  85dac906bec3bb41bfaa7ccaa65c4706de5cfdf8
Gitweb:     http://git.kernel.org/tip/85dac906bec3bb41bfaa7ccaa65c4706de5cfdf8
Author:     Paul Turner <pjt@google.com>
AuthorDate: Thu, 21 Jul 2011 09:43:33 -0700
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Sun, 14 Aug 2011 12:03:34 +0200

sched: Add support for throttling group entities

Now that consumption is tracked (via update_curr()) we add support to throttle
group entities (and their corresponding cfs_rqs) in the case where this is no
run-time remaining.

Throttled entities are dequeued to prevent scheduling, additionally we mark
them as throttled (using cfs_rq->throttled) to prevent them from becoming
re-enqueued until they are unthrottled.  A list of a task_group's throttled
entities are maintained on the cfs_bandwidth structure.

Note: While the machinery for throttling is added in this patch the act of
throttling an entity exceeding its bandwidth is deferred until later within
the series.

Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110721184757.480608533@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/sched.c      |    7 ++++
 kernel/sched_fair.c |   89 ++++++++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 92 insertions(+), 4 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index a2d5514..044260a 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -260,6 +260,8 @@ struct cfs_bandwidth {
 
 	int idle, timer_active;
 	struct hrtimer period_timer;
+	struct list_head throttled_cfs_rq;
+
 #endif
 };
 
@@ -399,6 +401,9 @@ struct cfs_rq {
 	int runtime_enabled;
 	u64 runtime_expires;
 	s64 runtime_remaining;
+
+	int throttled;
+	struct list_head throttled_list;
 #endif
 #endif
 };
@@ -441,6 +446,7 @@ static void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
 	cfs_b->quota = RUNTIME_INF;
 	cfs_b->period = ns_to_ktime(default_cfs_period());
 
+	INIT_LIST_HEAD(&cfs_b->throttled_cfs_rq);
 	hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
 	cfs_b->period_timer.function = sched_cfs_period_timer;
 }
@@ -448,6 +454,7 @@ static void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
 static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 {
 	cfs_rq->runtime_enabled = 0;
+	INIT_LIST_HEAD(&cfs_rq->throttled_list);
 }
 
 /* requires cfs_b->lock, may release to reprogram timer */
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 9d1adbd..72c9d4e 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1291,7 +1291,8 @@ static void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)
 	cfs_b->runtime_expires = now + ktime_to_ns(cfs_b->period);
 }
 
-static void assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
+/* returns 0 on failure to allocate runtime */
+static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 {
 	struct task_group *tg = cfs_rq->tg;
 	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
@@ -1332,6 +1333,8 @@ static void assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 	 */
 	if ((s64)(expires - cfs_rq->runtime_expires) > 0)
 		cfs_rq->runtime_expires = expires;
+
+	return cfs_rq->runtime_remaining > 0;
 }
 
 /*
@@ -1378,7 +1381,12 @@ static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq,
 	if (likely(cfs_rq->runtime_remaining > 0))
 		return;
 
-	assign_cfs_rq_runtime(cfs_rq);
+	/*
+	 * if we're unable to extend our runtime we resched so that the active
+	 * hierarchy can be throttled
+	 */
+	if (!assign_cfs_rq_runtime(cfs_rq) && likely(cfs_rq->curr))
+		resched_task(rq_of(cfs_rq)->curr);
 }
 
 static __always_inline void account_cfs_rq_runtime(struct cfs_rq *cfs_rq,
@@ -1390,6 +1398,47 @@ static __always_inline void account_cfs_rq_runtime(struct cfs_rq *cfs_rq,
 	__account_cfs_rq_runtime(cfs_rq, delta_exec);
 }
 
+static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
+{
+	return cfs_rq->throttled;
+}
+
+static __used void throttle_cfs_rq(struct cfs_rq *cfs_rq)
+{
+	struct rq *rq = rq_of(cfs_rq);
+	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
+	struct sched_entity *se;
+	long task_delta, dequeue = 1;
+
+	se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
+
+	/* account load preceding throttle */
+	update_cfs_load(cfs_rq, 0);
+
+	task_delta = cfs_rq->h_nr_running;
+	for_each_sched_entity(se) {
+		struct cfs_rq *qcfs_rq = cfs_rq_of(se);
+		/* throttled entity or throttle-on-deactivate */
+		if (!se->on_rq)
+			break;
+
+		if (dequeue)
+			dequeue_entity(qcfs_rq, se, DEQUEUE_SLEEP);
+		qcfs_rq->h_nr_running -= task_delta;
+
+		if (qcfs_rq->load.weight)
+			dequeue = 0;
+	}
+
+	if (!se)
+		rq->nr_running -= task_delta;
+
+	cfs_rq->throttled = 1;
+	raw_spin_lock(&cfs_b->lock);
+	list_add_tail_rcu(&cfs_rq->throttled_list, &cfs_b->throttled_cfs_rq);
+	raw_spin_unlock(&cfs_b->lock);
+}
+
 /*
  * Responsible for refilling a task_group's bandwidth and unthrottling its
  * cfs_rqs as appropriate. If there has been no activity within the last
@@ -1425,6 +1474,11 @@ out_unlock:
 #else
 static void account_cfs_rq_runtime(struct cfs_rq *cfs_rq,
 				     unsigned long delta_exec) {}
+
+static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
+{
+	return 0;
+}
 #endif
 
 /**************************************************
@@ -1503,7 +1557,17 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 			break;
 		cfs_rq = cfs_rq_of(se);
 		enqueue_entity(cfs_rq, se, flags);
+
+		/*
+		 * end evaluation on encountering a throttled cfs_rq
+		 *
+		 * note: in the case of encountering a throttled cfs_rq we will
+		 * post the final h_nr_running increment below.
+		*/
+		if (cfs_rq_throttled(cfs_rq))
+			break;
 		cfs_rq->h_nr_running++;
+
 		flags = ENQUEUE_WAKEUP;
 	}
 
@@ -1511,11 +1575,15 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		cfs_rq = cfs_rq_of(se);
 		cfs_rq->h_nr_running++;
 
+		if (cfs_rq_throttled(cfs_rq))
+			break;
+
 		update_cfs_load(cfs_rq, 0);
 		update_cfs_shares(cfs_rq);
 	}
 
-	inc_nr_running(rq);
+	if (!se)
+		inc_nr_running(rq);
 	hrtick_update(rq);
 }
 
@@ -1535,6 +1603,15 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	for_each_sched_entity(se) {
 		cfs_rq = cfs_rq_of(se);
 		dequeue_entity(cfs_rq, se, flags);
+
+		/*
+		 * end evaluation on encountering a throttled cfs_rq
+		 *
+		 * note: in the case of encountering a throttled cfs_rq we will
+		 * post the final h_nr_running decrement below.
+		*/
+		if (cfs_rq_throttled(cfs_rq))
+			break;
 		cfs_rq->h_nr_running--;
 
 		/* Don't dequeue parent if it has other entities besides us */
@@ -1557,11 +1634,15 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		cfs_rq = cfs_rq_of(se);
 		cfs_rq->h_nr_running--;
 
+		if (cfs_rq_throttled(cfs_rq))
+			break;
+
 		update_cfs_load(cfs_rq, 0);
 		update_cfs_shares(cfs_rq);
 	}
 
-	dec_nr_running(rq);
+	if (!se)
+		dec_nr_running(rq);
 	hrtick_update(rq);
 }
 

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [tip:sched/core] sched: Add support for unthrottling group entities
  2011-07-21 16:43 ` [patch 09/18] sched: add support for unthrottling " Paul Turner
@ 2011-08-14 16:27   ` tip-bot for Paul Turner
  0 siblings, 0 replies; 60+ messages in thread
From: tip-bot for Paul Turner @ 2011-08-14 16:27 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, a.p.zijlstra, seto.hidetoshi, pjt, tglx, mingo

Commit-ID:  671fd9dabe5239ad218c7eb48b2b9edee50250e6
Gitweb:     http://git.kernel.org/tip/671fd9dabe5239ad218c7eb48b2b9edee50250e6
Author:     Paul Turner <pjt@google.com>
AuthorDate: Thu, 21 Jul 2011 09:43:34 -0700
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Sun, 14 Aug 2011 12:03:36 +0200

sched: Add support for unthrottling group entities

At the start of each period we refresh the global bandwidth pool.  At this time
we must also unthrottle any cfs_rq entities who are now within bandwidth once
more (as quota permits).

Unthrottled entities have their corresponding cfs_rq->throttled flag cleared
and their entities re-enqueued.

Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110721184757.574628950@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/sched.c      |    3 +
 kernel/sched_fair.c |  127 +++++++++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 126 insertions(+), 4 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 044260a..4bbabc2 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -9192,6 +9192,9 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
 		raw_spin_lock_irq(&rq->lock);
 		cfs_rq->runtime_enabled = runtime_enabled;
 		cfs_rq->runtime_remaining = 0;
+
+		if (cfs_rq_throttled(cfs_rq))
+			unthrottle_cfs_rq(cfs_rq);
 		raw_spin_unlock_irq(&rq->lock);
 	}
 out_unlock:
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 72c9d4e..7641195 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1439,6 +1439,84 @@ static __used void throttle_cfs_rq(struct cfs_rq *cfs_rq)
 	raw_spin_unlock(&cfs_b->lock);
 }
 
+static void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
+{
+	struct rq *rq = rq_of(cfs_rq);
+	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
+	struct sched_entity *se;
+	int enqueue = 1;
+	long task_delta;
+
+	se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
+
+	cfs_rq->throttled = 0;
+	raw_spin_lock(&cfs_b->lock);
+	list_del_rcu(&cfs_rq->throttled_list);
+	raw_spin_unlock(&cfs_b->lock);
+
+	if (!cfs_rq->load.weight)
+		return;
+
+	task_delta = cfs_rq->h_nr_running;
+	for_each_sched_entity(se) {
+		if (se->on_rq)
+			enqueue = 0;
+
+		cfs_rq = cfs_rq_of(se);
+		if (enqueue)
+			enqueue_entity(cfs_rq, se, ENQUEUE_WAKEUP);
+		cfs_rq->h_nr_running += task_delta;
+
+		if (cfs_rq_throttled(cfs_rq))
+			break;
+	}
+
+	if (!se)
+		rq->nr_running += task_delta;
+
+	/* determine whether we need to wake up potentially idle cpu */
+	if (rq->curr == rq->idle && rq->cfs.nr_running)
+		resched_task(rq->curr);
+}
+
+static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
+		u64 remaining, u64 expires)
+{
+	struct cfs_rq *cfs_rq;
+	u64 runtime = remaining;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(cfs_rq, &cfs_b->throttled_cfs_rq,
+				throttled_list) {
+		struct rq *rq = rq_of(cfs_rq);
+
+		raw_spin_lock(&rq->lock);
+		if (!cfs_rq_throttled(cfs_rq))
+			goto next;
+
+		runtime = -cfs_rq->runtime_remaining + 1;
+		if (runtime > remaining)
+			runtime = remaining;
+		remaining -= runtime;
+
+		cfs_rq->runtime_remaining += runtime;
+		cfs_rq->runtime_expires = expires;
+
+		/* we check whether we're throttled above */
+		if (cfs_rq->runtime_remaining > 0)
+			unthrottle_cfs_rq(cfs_rq);
+
+next:
+		raw_spin_unlock(&rq->lock);
+
+		if (!remaining)
+			break;
+	}
+	rcu_read_unlock();
+
+	return remaining;
+}
+
 /*
  * Responsible for refilling a task_group's bandwidth and unthrottling its
  * cfs_rqs as appropriate. If there has been no activity within the last
@@ -1447,23 +1525,64 @@ static __used void throttle_cfs_rq(struct cfs_rq *cfs_rq)
  */
 static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
 {
-	int idle = 1;
+	u64 runtime, runtime_expires;
+	int idle = 1, throttled;
 
 	raw_spin_lock(&cfs_b->lock);
 	/* no need to continue the timer with no bandwidth constraint */
 	if (cfs_b->quota == RUNTIME_INF)
 		goto out_unlock;
 
-	idle = cfs_b->idle;
+	throttled = !list_empty(&cfs_b->throttled_cfs_rq);
+	/* idle depends on !throttled (for the case of a large deficit) */
+	idle = cfs_b->idle && !throttled;
+
 	/* if we're going inactive then everything else can be deferred */
 	if (idle)
 		goto out_unlock;
 
 	__refill_cfs_bandwidth_runtime(cfs_b);
 
+	if (!throttled) {
+		/* mark as potentially idle for the upcoming period */
+		cfs_b->idle = 1;
+		goto out_unlock;
+	}
+
+	/*
+	 * There are throttled entities so we must first use the new bandwidth
+	 * to unthrottle them before making it generally available.  This
+	 * ensures that all existing debts will be paid before a new cfs_rq is
+	 * allowed to run.
+	 */
+	runtime = cfs_b->runtime;
+	runtime_expires = cfs_b->runtime_expires;
+	cfs_b->runtime = 0;
+
+	/*
+	 * This check is repeated as we are holding onto the new bandwidth
+	 * while we unthrottle.  This can potentially race with an unthrottled
+	 * group trying to acquire new bandwidth from the global pool.
+	 */
+	while (throttled && runtime > 0) {
+		raw_spin_unlock(&cfs_b->lock);
+		/* we can't nest cfs_b->lock while distributing bandwidth */
+		runtime = distribute_cfs_runtime(cfs_b, runtime,
+						 runtime_expires);
+		raw_spin_lock(&cfs_b->lock);
+
+		throttled = !list_empty(&cfs_b->throttled_cfs_rq);
+	}
 
-	/* mark as potentially idle for the upcoming period */
-	cfs_b->idle = 1;
+	/* return (any) remaining runtime */
+	cfs_b->runtime = runtime;
+	/*
+	 * While we are ensured activity in the period following an
+	 * unthrottle, this also covers the case in which the new bandwidth is
+	 * insufficient to cover the existing bandwidth deficit.  (Forcing the
+	 * timer to remain active while there are any throttled entities.)
+	 */
+	cfs_b->idle = 0;
 out_unlock:
 	if (idle)
 		cfs_b->timer_active = 0;

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [tip:sched/core] sched: Allow for positional tg_tree walks
  2011-07-21 16:43 ` [patch 10/18] sched: allow for positional tg_tree walks Paul Turner
@ 2011-08-14 16:29   ` tip-bot for Paul Turner
  0 siblings, 0 replies; 60+ messages in thread
From: tip-bot for Paul Turner @ 2011-08-14 16:29 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, a.p.zijlstra, seto.hidetoshi, pjt, tglx, mingo

Commit-ID:  8277434ef1202ce30315f8edb3fc760aa6e74493
Gitweb:     http://git.kernel.org/tip/8277434ef1202ce30315f8edb3fc760aa6e74493
Author:     Paul Turner <pjt@google.com>
AuthorDate: Thu, 21 Jul 2011 09:43:35 -0700
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Sun, 14 Aug 2011 12:03:38 +0200

sched: Allow for positional tg_tree walks

Extend walk_tg_tree to accept a positional argument

static int walk_tg_tree_from(struct task_group *from,
			     tg_visitor down, tg_visitor up, void *data)

Existing semantics are preserved, caller must hold rcu_lock() or sufficient
analogue.

Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110721184757.677889157@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/sched.c |   50 +++++++++++++++++++++++++++++++++++++-------------
 1 files changed, 37 insertions(+), 13 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 4bbabc2..8ec1e7a 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1591,20 +1591,23 @@ static inline void dec_cpu_load(struct rq *rq, unsigned long load)
 typedef int (*tg_visitor)(struct task_group *, void *);
 
 /*
- * Iterate the full tree, calling @down when first entering a node and @up when
- * leaving it for the final time.
+ * Iterate task_group tree rooted at *from, calling @down when first entering a
+ * node and @up when leaving it for the final time.
+ *
+ * Caller must hold rcu_lock or sufficient equivalent.
  */
-static int walk_tg_tree(tg_visitor down, tg_visitor up, void *data)
+static int walk_tg_tree_from(struct task_group *from,
+			     tg_visitor down, tg_visitor up, void *data)
 {
 	struct task_group *parent, *child;
 	int ret;
 
-	rcu_read_lock();
-	parent = &root_task_group;
+	parent = from;
+
 down:
 	ret = (*down)(parent, data);
 	if (ret)
-		goto out_unlock;
+		goto out;
 	list_for_each_entry_rcu(child, &parent->children, siblings) {
 		parent = child;
 		goto down;
@@ -1613,19 +1616,29 @@ up:
 		continue;
 	}
 	ret = (*up)(parent, data);
-	if (ret)
-		goto out_unlock;
+	if (ret || parent == from)
+		goto out;
 
 	child = parent;
 	parent = parent->parent;
 	if (parent)
 		goto up;
-out_unlock:
-	rcu_read_unlock();
-
+out:
 	return ret;
 }
 
+/*
+ * Iterate the full tree, calling @down when first entering a node and @up when
+ * leaving it for the final time.
+ *
+ * Caller must hold rcu_lock or sufficient equivalent.
+ */
+
+static inline int walk_tg_tree(tg_visitor down, tg_visitor up, void *data)
+{
+	return walk_tg_tree_from(&root_task_group, down, up, data);
+}
+
 static int tg_nop(struct task_group *tg, void *data)
 {
 	return 0;
@@ -8870,13 +8883,19 @@ static int tg_rt_schedulable(struct task_group *tg, void *data)
 
 static int __rt_schedulable(struct task_group *tg, u64 period, u64 runtime)
 {
+	int ret;
+
 	struct rt_schedulable_data data = {
 		.tg = tg,
 		.rt_period = period,
 		.rt_runtime = runtime,
 	};
 
-	return walk_tg_tree(tg_rt_schedulable, tg_nop, &data);
+	rcu_read_lock();
+	ret = walk_tg_tree(tg_rt_schedulable, tg_nop, &data);
+	rcu_read_unlock();
+
+	return ret;
 }
 
 static int tg_set_rt_bandwidth(struct task_group *tg,
@@ -9333,6 +9352,7 @@ static int tg_cfs_schedulable_down(struct task_group *tg, void *data)
 
 static int __cfs_schedulable(struct task_group *tg, u64 period, u64 quota)
 {
+	int ret;
 	struct cfs_schedulable_data data = {
 		.tg = tg,
 		.period = period,
@@ -9344,7 +9364,11 @@ static int __cfs_schedulable(struct task_group *tg, u64 period, u64 quota)
 		do_div(data.quota, NSEC_PER_USEC);
 	}
 
-	return walk_tg_tree(tg_cfs_schedulable_down, tg_nop, &data);
+	rcu_read_lock();
+	ret = walk_tg_tree(tg_cfs_schedulable_down, tg_nop, &data);
+	rcu_read_unlock();
+
+	return ret;
 }
 #endif /* CONFIG_CFS_BANDWIDTH */
 #endif /* CONFIG_FAIR_GROUP_SCHED */

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [tip:sched/core] sched: Prevent interactions with throttled entities
  2011-07-21 16:43 ` [patch 11/18] sched: prevent interactions with throttled entities Paul Turner
  2011-07-22 11:26   ` Kamalesh Babulal
  2011-07-22 11:41   ` Kamalesh Babulal
@ 2011-08-14 16:30   ` tip-bot for Paul Turner
  2 siblings, 0 replies; 60+ messages in thread
From: tip-bot for Paul Turner @ 2011-08-14 16:30 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, a.p.zijlstra, seto.hidetoshi, pjt, tglx, mingo

Commit-ID:  64660c864f46202b932b911a69deb09805bdbaf8
Gitweb:     http://git.kernel.org/tip/64660c864f46202b932b911a69deb09805bdbaf8
Author:     Paul Turner <pjt@google.com>
AuthorDate: Thu, 21 Jul 2011 09:43:36 -0700
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Sun, 14 Aug 2011 12:03:40 +0200

sched: Prevent interactions with throttled entities

>From the perspective of load-balance and shares distribution, throttled
entities should be invisible.

However, both of these operations work on 'active' lists and are not
inherently aware of what group hierarchies may be present.  In some cases this
may be side-stepped (e.g. we could sideload via tg_load_down in load balance)
while in others (e.g. update_shares()) it is more difficult to compute without
incurring some O(n^2) costs.

Instead, track hierarchicaal throttled state at time of transition.  This
allows us to easily identify whether an entity belongs to a throttled hierarchy
and avoid incorrect interactions with it.

Also, when an entity leaves a throttled hierarchy we need to advance its
time averaging for shares averaging so that the elapsed throttled time is not
considered as part of the cfs_rq's operation.

We also use this information to prevent buddy interactions in the wakeup and
yield_to() paths.

Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110721184757.777916795@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/sched.c      |    2 +-
 kernel/sched_fair.c |   99 +++++++++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 94 insertions(+), 7 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 8ec1e7a..5db05f6 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -402,7 +402,7 @@ struct cfs_rq {
 	u64 runtime_expires;
 	s64 runtime_remaining;
 
-	int throttled;
+	int throttled, throttle_count;
 	struct list_head throttled_list;
 #endif
 #endif
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 7641195..5a20894 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -706,6 +706,8 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
 }
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
+/* we need this in update_cfs_load and load-balance functions below */
+static inline int throttled_hierarchy(struct cfs_rq *cfs_rq);
 # ifdef CONFIG_SMP
 static void update_cfs_rq_load_contribution(struct cfs_rq *cfs_rq,
 					    int global_update)
@@ -728,7 +730,7 @@ static void update_cfs_load(struct cfs_rq *cfs_rq, int global_update)
 	u64 now, delta;
 	unsigned long load = cfs_rq->load.weight;
 
-	if (cfs_rq->tg == &root_task_group)
+	if (cfs_rq->tg == &root_task_group || throttled_hierarchy(cfs_rq))
 		return;
 
 	now = rq_of(cfs_rq)->clock_task;
@@ -837,7 +839,7 @@ static void update_cfs_shares(struct cfs_rq *cfs_rq)
 
 	tg = cfs_rq->tg;
 	se = tg->se[cpu_of(rq_of(cfs_rq))];
-	if (!se)
+	if (!se || throttled_hierarchy(cfs_rq))
 		return;
 #ifndef CONFIG_SMP
 	if (likely(se->load.weight == tg->shares))
@@ -1403,6 +1405,65 @@ static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
 	return cfs_rq->throttled;
 }
 
+/* check whether cfs_rq, or any parent, is throttled */
+static inline int throttled_hierarchy(struct cfs_rq *cfs_rq)
+{
+	return cfs_rq->throttle_count;
+}
+
+/*
+ * Ensure that neither of the group entities corresponding to src_cpu or
+ * dest_cpu are members of a throttled hierarchy when performing group
+ * load-balance operations.
+ */
+static inline int throttled_lb_pair(struct task_group *tg,
+				    int src_cpu, int dest_cpu)
+{
+	struct cfs_rq *src_cfs_rq, *dest_cfs_rq;
+
+	src_cfs_rq = tg->cfs_rq[src_cpu];
+	dest_cfs_rq = tg->cfs_rq[dest_cpu];
+
+	return throttled_hierarchy(src_cfs_rq) ||
+	       throttled_hierarchy(dest_cfs_rq);
+}
+
+/* updated child weight may affect parent so we have to do this bottom up */
+static int tg_unthrottle_up(struct task_group *tg, void *data)
+{
+	struct rq *rq = data;
+	struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
+
+	cfs_rq->throttle_count--;
+#ifdef CONFIG_SMP
+	if (!cfs_rq->throttle_count) {
+		u64 delta = rq->clock_task - cfs_rq->load_stamp;
+
+		/* leaving throttled state, advance shares averaging windows */
+		cfs_rq->load_stamp += delta;
+		cfs_rq->load_last += delta;
+
+		/* update entity weight now that we are on_rq again */
+		update_cfs_shares(cfs_rq);
+	}
+#endif
+
+	return 0;
+}
+
+static int tg_throttle_down(struct task_group *tg, void *data)
+{
+	struct rq *rq = data;
+	struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
+
+	/* group is entering throttled state, record last load */
+	if (!cfs_rq->throttle_count)
+		update_cfs_load(cfs_rq, 0);
+	cfs_rq->throttle_count++;
+
+	return 0;
+}
+
 static __used void throttle_cfs_rq(struct cfs_rq *cfs_rq)
 {
 	struct rq *rq = rq_of(cfs_rq);
@@ -1413,7 +1474,9 @@ static __used void throttle_cfs_rq(struct cfs_rq *cfs_rq)
 	se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
 
 	/* account load preceding throttle */
-	update_cfs_load(cfs_rq, 0);
+	rcu_read_lock();
+	walk_tg_tree_from(cfs_rq->tg, tg_throttle_down, tg_nop, (void *)rq);
+	rcu_read_unlock();
 
 	task_delta = cfs_rq->h_nr_running;
 	for_each_sched_entity(se) {
@@ -1454,6 +1517,10 @@ static void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 	list_del_rcu(&cfs_rq->throttled_list);
 	raw_spin_unlock(&cfs_b->lock);
 
+	update_rq_clock(rq);
+	/* update hierarchical throttle state */
+	walk_tg_tree_from(cfs_rq->tg, tg_nop, tg_unthrottle_up, (void *)rq);
+
 	if (!cfs_rq->load.weight)
 		return;
 
@@ -1598,6 +1665,17 @@ static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
 {
 	return 0;
 }
+
+static inline int throttled_hierarchy(struct cfs_rq *cfs_rq)
+{
+	return 0;
+}
+
+static inline int throttled_lb_pair(struct task_group *tg,
+				    int src_cpu, int dest_cpu)
+{
+	return 0;
+}
 #endif
 
 /**************************************************
@@ -2493,6 +2571,9 @@ move_one_task(struct rq *this_rq, int this_cpu, struct rq *busiest,
 
 	for_each_leaf_cfs_rq(busiest, cfs_rq) {
 		list_for_each_entry_safe(p, n, &cfs_rq->tasks, se.group_node) {
+			if (throttled_lb_pair(task_group(p),
+					      busiest->cpu, this_cpu))
+				break;
 
 			if (!can_migrate_task(p, busiest, this_cpu,
 						sd, idle, &pinned))
@@ -2608,8 +2689,13 @@ static void update_shares(int cpu)
 	 * Iterates the task_group tree in a bottom up fashion, see
 	 * list_add_leaf_cfs_rq() for details.
 	 */
-	for_each_leaf_cfs_rq(rq, cfs_rq)
+	for_each_leaf_cfs_rq(rq, cfs_rq) {
+		/* throttled entities do not contribute to load */
+		if (throttled_hierarchy(cfs_rq))
+			continue;
+
 		update_shares_cpu(cfs_rq->tg, cpu);
+	}
 	rcu_read_unlock();
 }
 
@@ -2659,9 +2745,10 @@ load_balance_fair(struct rq *this_rq, int this_cpu, struct rq *busiest,
 		u64 rem_load, moved_load;
 
 		/*
-		 * empty group
+		 * empty group or part of a throttled hierarchy
 		 */
-		if (!busiest_cfs_rq->task_weight)
+		if (!busiest_cfs_rq->task_weight ||
+		    throttled_lb_pair(busiest_cfs_rq->tg, cpu_of(busiest), this_cpu))
 			continue;
 
 		rem_load = (u64)rem_load_move * busiest_weight;

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [tip:sched/core] sched: Prevent buddy interactions with throttled entities
  2011-07-21 16:43 ` [patch 12/18] sched: prevent buddy " Paul Turner
@ 2011-08-14 16:32   ` tip-bot for Paul Turner
  0 siblings, 0 replies; 60+ messages in thread
From: tip-bot for Paul Turner @ 2011-08-14 16:32 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, a.p.zijlstra, pjt, tglx, mingo

Commit-ID:  5238cdd3873e67a98b28c1161d65d2a615c320a3
Gitweb:     http://git.kernel.org/tip/5238cdd3873e67a98b28c1161d65d2a615c320a3
Author:     Paul Turner <pjt@google.com>
AuthorDate: Thu, 21 Jul 2011 09:43:37 -0700
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Sun, 14 Aug 2011 12:03:42 +0200

sched: Prevent buddy interactions with throttled entities

Buddies allow us to select "on-rq" entities without actually selecting them
from a cfs_rq's rb_tree.  As a result we must ensure that throttled entities
are not falsely nominated as buddies.  The fact that entities are dequeued
within throttle_entity is not sufficient for clearing buddy status as the
nomination may occur after throttling.

Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110721184757.886850167@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/sched_fair.c |   18 +++++++++++++++++-
 1 files changed, 17 insertions(+), 1 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 5a20894..1d4acbe 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -2348,6 +2348,15 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
 	if (unlikely(se == pse))
 		return;
 
+	/*
+	 * This is possible from callers such as pull_task(), in which we
+	 * unconditionally check_prempt_curr() after an enqueue (which may have
+	 * lead to a throttle).  This both saves work and prevents false
+	 * next-buddy nomination below.
+	 */
+	if (unlikely(throttled_hierarchy(cfs_rq_of(pse))))
+		return;
+
 	if (sched_feat(NEXT_BUDDY) && scale && !(wake_flags & WF_FORK)) {
 		set_next_buddy(pse);
 		next_buddy_marked = 1;
@@ -2356,6 +2365,12 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
 	/*
 	 * We can come here with TIF_NEED_RESCHED already set from new task
 	 * wake up path.
+	 *
+	 * Note: this also catches the edge-case of curr being in a throttled
+	 * group (e.g. via set_curr_task), since update_curr() (in the
+	 * enqueue of curr) will have resulted in resched being set.  This
+	 * prevents us from potentially nominating it as a false LAST_BUDDY
+	 * below.
 	 */
 	if (test_tsk_need_resched(curr))
 		return;
@@ -2474,7 +2489,8 @@ static bool yield_to_task_fair(struct rq *rq, struct task_struct *p, bool preemp
 {
 	struct sched_entity *se = &p->se;
 
-	if (!se->on_rq)
+	/* throttled hierarchies are not runnable */
+	if (!se->on_rq || throttled_hierarchy(cfs_rq_of(se)))
 		return false;
 
 	/* Tell the scheduler that we'd really like pse to run next. */

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [tip:sched/core] sched: Migrate throttled tasks on HOTPLUG
  2011-07-21 16:43 ` [patch 13/18] sched: migrate throttled tasks on HOTPLUG Paul Turner
@ 2011-08-14 16:34   ` tip-bot for Paul Turner
  0 siblings, 0 replies; 60+ messages in thread
From: tip-bot for Paul Turner @ 2011-08-14 16:34 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, a.p.zijlstra, seto.hidetoshi, pjt, tglx, mingo

Commit-ID:  8cb120d3e41a0464a559d639d519cef563717a4e
Gitweb:     http://git.kernel.org/tip/8cb120d3e41a0464a559d639d519cef563717a4e
Author:     Paul Turner <pjt@google.com>
AuthorDate: Thu, 21 Jul 2011 09:43:38 -0700
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Sun, 14 Aug 2011 12:03:44 +0200

sched: Migrate throttled tasks on HOTPLUG

Throttled tasks are invisisble to cpu-offline since they are not eligible for
selection by pick_next_task().  The regular 'escape' path for a thread that is
blocked at offline is via ttwu->select_task_rq, however this will not handle a
throttled group since there are no individual thread wakeups on an unthrottle.

Resolve this by unthrottling offline cpus so that threads can be migrated.

Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110721184757.989000590@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/sched.c |   27 +++++++++++++++++++++++++++
 1 files changed, 27 insertions(+), 0 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 5db05f6..3973172 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -6335,6 +6335,30 @@ static void calc_global_load_remove(struct rq *rq)
 	rq->calc_load_active = 0;
 }
 
+#ifdef CONFIG_CFS_BANDWIDTH
+static void unthrottle_offline_cfs_rqs(struct rq *rq)
+{
+	struct cfs_rq *cfs_rq;
+
+	for_each_leaf_cfs_rq(rq, cfs_rq) {
+		struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
+
+		if (!cfs_rq->runtime_enabled)
+			continue;
+
+		/*
+		 * clock_task is not advancing so we just need to make sure
+		 * there's some valid quota amount
+		 */
+		cfs_rq->runtime_remaining = cfs_b->quota;
+		if (cfs_rq_throttled(cfs_rq))
+			unthrottle_cfs_rq(cfs_rq);
+	}
+}
+#else
+static void unthrottle_offline_cfs_rqs(struct rq *rq) {}
+#endif
+
 /*
  * Migrate all tasks from the rq, sleeping tasks will be migrated by
  * try_to_wake_up()->select_task_rq().
@@ -6360,6 +6384,9 @@ static void migrate_tasks(unsigned int dead_cpu)
 	 */
 	rq->stop = NULL;
 
+	/* Ensure any throttled groups are reachable by pick_next_task */
+	unthrottle_offline_cfs_rqs(rq);
+
 	for ( ; ; ) {
 		/*
 		 * There's this thread running, bail when that's the only

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [tip:sched/core] sched: Throttle entities exceeding their allowed bandwidth
  2011-07-21 16:43 ` [patch 14/18] sched: throttle entities exceeding their allowed bandwidth Paul Turner
@ 2011-08-14 16:35   ` tip-bot for Paul Turner
  0 siblings, 0 replies; 60+ messages in thread
From: tip-bot for Paul Turner @ 2011-08-14 16:35 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, a.p.zijlstra, pjt, tglx, mingo

Commit-ID:  d3d9dc3302368269acf94b7381663b93000fe2fe
Gitweb:     http://git.kernel.org/tip/d3d9dc3302368269acf94b7381663b93000fe2fe
Author:     Paul Turner <pjt@google.com>
AuthorDate: Thu, 21 Jul 2011 09:43:39 -0700
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Sun, 14 Aug 2011 12:03:47 +0200

sched: Throttle entities exceeding their allowed bandwidth

With the machinery in place to throttle and unthrottle entities, as well as
handle their participation (or lack there of) we can now enable throttling.

There are 2 points that we must check whether it's time to set throttled state:
 put_prev_entity() and enqueue_entity().

- put_prev_entity() is the typical throttle path, we reach it by exceeding our
  allocated run-time within update_curr()->account_cfs_rq_runtime() and going
  through a reschedule.

- enqueue_entity() covers the case of a wake-up into an already throttled
  group.  In this case we know the group cannot be on_rq and can throttle
  immediately.  Checks are added at time of put_prev_entity() and
  enqueue_entity()

Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110721184758.091415417@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/sched_fair.c |   52 +++++++++++++++++++++++++++++++++++++++++++++++++-
 1 files changed, 50 insertions(+), 2 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 1d4acbe..f9f671a 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -970,6 +970,8 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
 	se->vruntime = vruntime;
 }
 
+static void check_enqueue_throttle(struct cfs_rq *cfs_rq);
+
 static void
 enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 {
@@ -999,8 +1001,10 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 		__enqueue_entity(cfs_rq, se);
 	se->on_rq = 1;
 
-	if (cfs_rq->nr_running == 1)
+	if (cfs_rq->nr_running == 1) {
 		list_add_leaf_cfs_rq(cfs_rq);
+		check_enqueue_throttle(cfs_rq);
+	}
 }
 
 static void __clear_buddies_last(struct sched_entity *se)
@@ -1202,6 +1206,8 @@ static struct sched_entity *pick_next_entity(struct cfs_rq *cfs_rq)
 	return se;
 }
 
+static void check_cfs_rq_runtime(struct cfs_rq *cfs_rq);
+
 static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
 {
 	/*
@@ -1211,6 +1217,9 @@ static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
 	if (prev->on_rq)
 		update_curr(cfs_rq);
 
+	/* throttle cfs_rqs exceeding runtime */
+	check_cfs_rq_runtime(cfs_rq);
+
 	check_spread(cfs_rq, prev);
 	if (prev->on_rq) {
 		update_stats_wait_start(cfs_rq, prev);
@@ -1464,7 +1473,7 @@ static int tg_throttle_down(struct task_group *tg, void *data)
 	return 0;
 }
 
-static __used void throttle_cfs_rq(struct cfs_rq *cfs_rq)
+static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
 {
 	struct rq *rq = rq_of(cfs_rq);
 	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
@@ -1657,9 +1666,48 @@ out_unlock:
 
 	return idle;
 }
+
+/*
+ * When a group wakes up we want to make sure that its quota is not already
+ * expired/exceeded, otherwise it may be allowed to steal additional ticks of
+ * runtime as update_curr() throttling can not not trigger until it's on-rq.
+ */
+static void check_enqueue_throttle(struct cfs_rq *cfs_rq)
+{
+	/* an active group must be handled by the update_curr()->put() path */
+	if (!cfs_rq->runtime_enabled || cfs_rq->curr)
+		return;
+
+	/* ensure the group is not already throttled */
+	if (cfs_rq_throttled(cfs_rq))
+		return;
+
+	/* update runtime allocation */
+	account_cfs_rq_runtime(cfs_rq, 0);
+	if (cfs_rq->runtime_remaining <= 0)
+		throttle_cfs_rq(cfs_rq);
+}
+
+/* conditionally throttle active cfs_rq's from put_prev_entity() */
+static void check_cfs_rq_runtime(struct cfs_rq *cfs_rq)
+{
+	if (likely(!cfs_rq->runtime_enabled || cfs_rq->runtime_remaining > 0))
+		return;
+
+	/*
+	 * it's possible for a throttled entity to be forced into a running
+	 * state (e.g. set_curr_task), in this case we're finished.
+	 */
+	if (cfs_rq_throttled(cfs_rq))
+		return;
+
+	throttle_cfs_rq(cfs_rq);
+}
 #else
 static void account_cfs_rq_runtime(struct cfs_rq *cfs_rq,
 				     unsigned long delta_exec) {}
+static void check_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
+static void check_enqueue_throttle(struct cfs_rq *cfs_rq) {}
 
 static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
 {

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [tip:sched/core] sched: Add exports tracking cfs bandwidth control statistics
  2011-07-21 16:43 ` [patch 15/18] sched: add exports tracking cfs bandwidth control statistics Paul Turner
@ 2011-08-14 16:37   ` tip-bot for Nikhil Rao
  0 siblings, 0 replies; 60+ messages in thread
From: tip-bot for Nikhil Rao @ 2011-08-14 16:37 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, a.p.zijlstra, seto.hidetoshi, ncrao,
	pjt, bharata, tglx, mingo

Commit-ID:  e8da1b18b32064c43881bceef0f051c2110c9ab9
Gitweb:     http://git.kernel.org/tip/e8da1b18b32064c43881bceef0f051c2110c9ab9
Author:     Nikhil Rao <ncrao@google.com>
AuthorDate: Thu, 21 Jul 2011 09:43:40 -0700
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Sun, 14 Aug 2011 12:03:49 +0200

sched: Add exports tracking cfs bandwidth control statistics

This change introduces statistics exports for the cpu sub-system, these are
added through the use of a stat file similar to that exported by other
subsystems.

The following exports are included:

nr_periods:	number of periods in which execution occurred
nr_throttled:	the number of periods above in which execution was throttle
throttled_time:	cumulative wall-time that any cpus have been throttled for
this group

Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Nikhil Rao <ncrao@google.com>
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110721184758.198901931@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/sched.c      |   21 +++++++++++++++++++++
 kernel/sched_fair.c |    7 +++++++
 2 files changed, 28 insertions(+), 0 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 3973172..35c9185 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -262,6 +262,9 @@ struct cfs_bandwidth {
 	struct hrtimer period_timer;
 	struct list_head throttled_cfs_rq;
 
+	/* statistics */
+	int nr_periods, nr_throttled;
+	u64 throttled_time;
 #endif
 };
 
@@ -402,6 +405,7 @@ struct cfs_rq {
 	u64 runtime_expires;
 	s64 runtime_remaining;
 
+	u64 throttled_timestamp;
 	int throttled, throttle_count;
 	struct list_head throttled_list;
 #endif
@@ -9397,6 +9401,19 @@ static int __cfs_schedulable(struct task_group *tg, u64 period, u64 quota)
 
 	return ret;
 }
+
+static int cpu_stats_show(struct cgroup *cgrp, struct cftype *cft,
+		struct cgroup_map_cb *cb)
+{
+	struct task_group *tg = cgroup_tg(cgrp);
+	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
+
+	cb->fill(cb, "nr_periods", cfs_b->nr_periods);
+	cb->fill(cb, "nr_throttled", cfs_b->nr_throttled);
+	cb->fill(cb, "throttled_time", cfs_b->throttled_time);
+
+	return 0;
+}
 #endif /* CONFIG_CFS_BANDWIDTH */
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
@@ -9443,6 +9460,10 @@ static struct cftype cpu_files[] = {
 		.read_u64 = cpu_cfs_period_read_u64,
 		.write_u64 = cpu_cfs_period_write_u64,
 	},
+	{
+		.name = "stat",
+		.read_map = cpu_stats_show,
+	},
 #endif
 #ifdef CONFIG_RT_GROUP_SCHED
 	{
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index f9f671a..d201f28 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1506,6 +1506,7 @@ static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
 		rq->nr_running -= task_delta;
 
 	cfs_rq->throttled = 1;
+	cfs_rq->throttled_timestamp = rq->clock;
 	raw_spin_lock(&cfs_b->lock);
 	list_add_tail_rcu(&cfs_rq->throttled_list, &cfs_b->throttled_cfs_rq);
 	raw_spin_unlock(&cfs_b->lock);
@@ -1523,8 +1524,10 @@ static void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 
 	cfs_rq->throttled = 0;
 	raw_spin_lock(&cfs_b->lock);
+	cfs_b->throttled_time += rq->clock - cfs_rq->throttled_timestamp;
 	list_del_rcu(&cfs_rq->throttled_list);
 	raw_spin_unlock(&cfs_b->lock);
+	cfs_rq->throttled_timestamp = 0;
 
 	update_rq_clock(rq);
 	/* update hierarchical throttle state */
@@ -1612,6 +1615,7 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
 	throttled = !list_empty(&cfs_b->throttled_cfs_rq);
 	/* idle depends on !throttled (for the case of a large deficit) */
 	idle = cfs_b->idle && !throttled;
+	cfs_b->nr_periods += overrun;
 
 	/* if we're going inactive then everything else can be deferred */
 	if (idle)
@@ -1625,6 +1629,9 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
 		goto out_unlock;
 	}
 
+	/* account preceding periods in which throttling occurred */
+	cfs_b->nr_throttled += overrun;
+
 	/*
 	 * There are throttled entities so we must first use the new bandwidth
 	 * to unthrottle them before making it generally available.  This

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [tip:sched/core] sched: Return unused runtime on group dequeue
  2011-07-21 16:43 ` [patch 16/18] sched: return unused runtime on group dequeue Paul Turner
@ 2011-08-14 16:39   ` tip-bot for Paul Turner
  0 siblings, 0 replies; 60+ messages in thread
From: tip-bot for Paul Turner @ 2011-08-14 16:39 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, a.p.zijlstra, pjt, tglx, mingo

Commit-ID:  d8b4986d3dbc4fabc2054d63f1d31d6ed2fb1ca8
Gitweb:     http://git.kernel.org/tip/d8b4986d3dbc4fabc2054d63f1d31d6ed2fb1ca8
Author:     Paul Turner <pjt@google.com>
AuthorDate: Thu, 21 Jul 2011 09:43:41 -0700
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Sun, 14 Aug 2011 12:03:54 +0200

sched: Return unused runtime on group dequeue

When a local cfs_rq blocks we return the majority of its remaining quota to the
global bandwidth pool for use by other runqueues.

We do this only when the quota is current and there is more than
min_cfs_rq_quota [1ms by default] of runtime remaining on the rq.

In the case where there are throttled runqueues and we have sufficient
bandwidth to meter out a slice, a second timer is kicked off to handle this
delivery, unthrottling where appropriate.

Using a 'worst case' antagonist which executes on each cpu
for 1ms before moving onto the next on a fairly large machine:

no quota generations:

 197.47 ms       /cgroup/a/cpuacct.usage
 199.46 ms       /cgroup/a/cpuacct.usage
 205.46 ms       /cgroup/a/cpuacct.usage
 198.46 ms       /cgroup/a/cpuacct.usage
 208.39 ms       /cgroup/a/cpuacct.usage

Since we are allowed to use "stale" quota our usage is effectively bounded by
the rate of input into the global pool and performance is relatively stable.

with quota generations [1s increments]:

 119.58 ms       /cgroup/a/cpuacct.usage
 119.65 ms       /cgroup/a/cpuacct.usage
 119.64 ms       /cgroup/a/cpuacct.usage
 119.63 ms       /cgroup/a/cpuacct.usage
 119.60 ms       /cgroup/a/cpuacct.usage

The large deficit here is due to quota generations (/intentionally/) preventing
us from now using previously stranded slack quota.  The cost is that this quota
becomes unavailable.

with quota generations and quota return:

 200.09 ms       /cgroup/a/cpuacct.usage
 200.09 ms       /cgroup/a/cpuacct.usage
 198.09 ms       /cgroup/a/cpuacct.usage
 200.09 ms       /cgroup/a/cpuacct.usage
 200.06 ms       /cgroup/a/cpuacct.usage

By returning unused quota we're able to both stably consume our desired quota
and prevent unintentional overages due to the abuse of slack quota from
previous quota periods (especially on a large machine).

Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110721184758.306848658@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/sched.c      |   15 +++++++-
 kernel/sched_fair.c |  108 +++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 122 insertions(+), 1 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 35c9185..6baade0 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -259,7 +259,7 @@ struct cfs_bandwidth {
 	u64 runtime_expires;
 
 	int idle, timer_active;
-	struct hrtimer period_timer;
+	struct hrtimer period_timer, slack_timer;
 	struct list_head throttled_cfs_rq;
 
 	/* statistics */
@@ -421,6 +421,16 @@ static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
 
 static inline u64 default_cfs_period(void);
 static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun);
+static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b);
+
+static enum hrtimer_restart sched_cfs_slack_timer(struct hrtimer *timer)
+{
+	struct cfs_bandwidth *cfs_b =
+		container_of(timer, struct cfs_bandwidth, slack_timer);
+	do_sched_cfs_slack_timer(cfs_b);
+
+	return HRTIMER_NORESTART;
+}
 
 static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
 {
@@ -453,6 +463,8 @@ static void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
 	INIT_LIST_HEAD(&cfs_b->throttled_cfs_rq);
 	hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
 	cfs_b->period_timer.function = sched_cfs_period_timer;
+	hrtimer_init(&cfs_b->slack_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	cfs_b->slack_timer.function = sched_cfs_slack_timer;
 }
 
 static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)
@@ -488,6 +500,7 @@ static void __start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
 static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
 {
 	hrtimer_cancel(&cfs_b->period_timer);
+	hrtimer_cancel(&cfs_b->slack_timer);
 }
 #else
 static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index d201f28..1ca2cd4 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1052,6 +1052,8 @@ static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se)
 		__clear_buddies_skip(se);
 }
 
+static void return_cfs_rq_runtime(struct cfs_rq *cfs_rq);
+
 static void
 dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 {
@@ -1090,6 +1092,9 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	if (!(flags & DEQUEUE_SLEEP))
 		se->vruntime -= cfs_rq->min_vruntime;
 
+	/* return excess runtime on last dequeue */
+	return_cfs_rq_runtime(cfs_rq);
+
 	update_min_vruntime(cfs_rq);
 	update_cfs_shares(cfs_rq);
 }
@@ -1674,6 +1679,108 @@ out_unlock:
 	return idle;
 }
 
+/* a cfs_rq won't donate quota below this amount */
+static const u64 min_cfs_rq_runtime = 1 * NSEC_PER_MSEC;
+/* minimum remaining period time to redistribute slack quota */
+static const u64 min_bandwidth_expiration = 2 * NSEC_PER_MSEC;
+/* how long we wait to gather additional slack before distributing */
+static const u64 cfs_bandwidth_slack_period = 5 * NSEC_PER_MSEC;
+
+/* are we near the end of the current quota period? */
+static int runtime_refresh_within(struct cfs_bandwidth *cfs_b, u64 min_expire)
+{
+	struct hrtimer *refresh_timer = &cfs_b->period_timer;
+	u64 remaining;
+
+	/* if the call-back is running a quota refresh is already occurring */
+	if (hrtimer_callback_running(refresh_timer))
+		return 1;
+
+	/* is a quota refresh about to occur? */
+	remaining = ktime_to_ns(hrtimer_expires_remaining(refresh_timer));
+	if (remaining < min_expire)
+		return 1;
+
+	return 0;
+}
+
+static void start_cfs_slack_bandwidth(struct cfs_bandwidth *cfs_b)
+{
+	u64 min_left = cfs_bandwidth_slack_period + min_bandwidth_expiration;
+
+	/* if there's a quota refresh soon don't bother with slack */
+	if (runtime_refresh_within(cfs_b, min_left))
+		return;
+
+	start_bandwidth_timer(&cfs_b->slack_timer,
+				ns_to_ktime(cfs_bandwidth_slack_period));
+}
+
+/* we know any runtime found here is valid as update_curr() precedes return */
+static void __return_cfs_rq_runtime(struct cfs_rq *cfs_rq)
+{
+	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
+	s64 slack_runtime = cfs_rq->runtime_remaining - min_cfs_rq_runtime;
+
+	if (slack_runtime <= 0)
+		return;
+
+	raw_spin_lock(&cfs_b->lock);
+	if (cfs_b->quota != RUNTIME_INF &&
+	    cfs_rq->runtime_expires == cfs_b->runtime_expires) {
+		cfs_b->runtime += slack_runtime;
+
+		/* we are under rq->lock, defer unthrottling using a timer */
+		if (cfs_b->runtime > sched_cfs_bandwidth_slice() &&
+		    !list_empty(&cfs_b->throttled_cfs_rq))
+			start_cfs_slack_bandwidth(cfs_b);
+	}
+	raw_spin_unlock(&cfs_b->lock);
+
+	/* even if it's not valid for return we don't want to try again */
+	cfs_rq->runtime_remaining -= slack_runtime;
+}
+
+static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq)
+{
+	if (!cfs_rq->runtime_enabled || !cfs_rq->nr_running)
+		return;
+
+	__return_cfs_rq_runtime(cfs_rq);
+}
+
+/*
+ * This is done with a timer (instead of inline with bandwidth return) since
+ * it's necessary to juggle rq->locks to unthrottle their respective cfs_rqs.
+ */
+static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
+{
+	u64 runtime = 0, slice = sched_cfs_bandwidth_slice();
+	u64 expires;
+
+	/* confirm we're still not at a refresh boundary */
+	if (runtime_refresh_within(cfs_b, min_bandwidth_expiration))
+		return;
+
+	raw_spin_lock(&cfs_b->lock);
+	if (cfs_b->quota != RUNTIME_INF && cfs_b->runtime > slice) {
+		runtime = cfs_b->runtime;
+		cfs_b->runtime = 0;
+	}
+	expires = cfs_b->runtime_expires;
+	raw_spin_unlock(&cfs_b->lock);
+
+	if (!runtime)
+		return;
+
+	runtime = distribute_cfs_runtime(cfs_b, runtime, expires);
+
+	raw_spin_lock(&cfs_b->lock);
+	if (expires == cfs_b->runtime_expires)
+		cfs_b->runtime = runtime;
+	raw_spin_unlock(&cfs_b->lock);
+}
+
 /*
  * When a group wakes up we want to make sure that its quota is not already
  * expired/exceeded, otherwise it may be allowed to steal additional ticks of
@@ -1715,6 +1822,7 @@ static void account_cfs_rq_runtime(struct cfs_rq *cfs_rq,
 				     unsigned long delta_exec) {}
 static void check_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
 static void check_enqueue_throttle(struct cfs_rq *cfs_rq) {}
+static void return_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
 
 static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
 {

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [tip:sched/core] sched: Add documentation for bandwidth control
  2011-07-21 16:43 ` [patch 18/18] sched: add documentation for bandwidth control Paul Turner
@ 2011-08-14 16:41   ` tip-bot for Bharata B Rao
  0 siblings, 0 replies; 60+ messages in thread
From: tip-bot for Bharata B Rao @ 2011-08-14 16:41 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, bharata, a.p.zijlstra, pjt, tglx, mingo

Commit-ID:  88ebc08ea9f721d1345d5414288a308ea42ac458
Gitweb:     http://git.kernel.org/tip/88ebc08ea9f721d1345d5414288a308ea42ac458
Author:     Bharata B Rao <bharata@linux.vnet.ibm.com>
AuthorDate: Thu, 21 Jul 2011 09:43:43 -0700
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Sun, 14 Aug 2011 12:03:58 +0200

sched: Add documentation for bandwidth control

Basic description of usage and effect for CFS Bandwidth Control.

Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110721184758.498036116@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 Documentation/scheduler/sched-bwc.txt |  122 +++++++++++++++++++++++++++++++++
 1 files changed, 122 insertions(+), 0 deletions(-)

diff --git a/Documentation/scheduler/sched-bwc.txt b/Documentation/scheduler/sched-bwc.txt
new file mode 100644
index 0000000..f6b1873
--- /dev/null
+++ b/Documentation/scheduler/sched-bwc.txt
@@ -0,0 +1,122 @@
+CFS Bandwidth Control
+=====================
+
+[ This document only discusses CPU bandwidth control for SCHED_NORMAL.
+  The SCHED_RT case is covered in Documentation/scheduler/sched-rt-group.txt ]
+
+CFS bandwidth control is a CONFIG_FAIR_GROUP_SCHED extension which allows the
+specification of the maximum CPU bandwidth available to a group or hierarchy.
+
+The bandwidth allowed for a group is specified using a quota and period. Within
+each given "period" (microseconds), a group is allowed to consume only up to
+"quota" microseconds of CPU time.  When the CPU bandwidth consumption of a
+group exceeds this limit (for that period), the tasks belonging to its
+hierarchy will be throttled and are not allowed to run again until the next
+period.
+
+A group's unused runtime is globally tracked, being refreshed with quota units
+above at each period boundary.  As threads consume this bandwidth it is
+transferred to cpu-local "silos" on a demand basis.  The amount transferred
+within each of these updates is tunable and described as the "slice".
+
+Management
+----------
+Quota and period are managed within the cpu subsystem via cgroupfs.
+
+cpu.cfs_quota_us: the total available run-time within a period (in microseconds)
+cpu.cfs_period_us: the length of a period (in microseconds)
+cpu.stat: exports throttling statistics [explained further below]
+
+The default values are:
+	cpu.cfs_period_us=100ms
+	cpu.cfs_quota=-1
+
+A value of -1 for cpu.cfs_quota_us indicates that the group does not have any
+bandwidth restriction in place, such a group is described as an unconstrained
+bandwidth group.  This represents the traditional work-conserving behavior for
+CFS.
+
+Writing any (valid) positive value(s) will enact the specified bandwidth limit.
+The minimum quota allowed for the quota or period is 1ms.  There is also an
+upper bound on the period length of 1s.  Additional restrictions exist when
+bandwidth limits are used in a hierarchical fashion, these are explained in
+more detail below.
+
+Writing any negative value to cpu.cfs_quota_us will remove the bandwidth limit
+and return the group to an unconstrained state once more.
+
+Any updates to a group's bandwidth specification will result in it becoming
+unthrottled if it is in a constrained state.
+
+System wide settings
+--------------------
+For efficiency run-time is transferred between the global pool and CPU local
+"silos" in a batch fashion.  This greatly reduces global accounting pressure
+on large systems.  The amount transferred each time such an update is required
+is described as the "slice".
+
+This is tunable via procfs:
+	/proc/sys/kernel/sched_cfs_bandwidth_slice_us (default=5ms)
+
+Larger slice values will reduce transfer overheads, while smaller values allow
+for more fine-grained consumption.
+
+Statistics
+----------
+A group's bandwidth statistics are exported via 3 fields in cpu.stat.
+
+cpu.stat:
+- nr_periods: Number of enforcement intervals that have elapsed.
+- nr_throttled: Number of times the group has been throttled/limited.
+- throttled_time: The total time duration (in nanoseconds) for which entities
+  of the group have been throttled.
+
+This interface is read-only.
+
+Hierarchical considerations
+---------------------------
+The interface enforces that an individual entity's bandwidth is always
+attainable, that is: max(c_i) <= C. However, over-subscription in the
+aggregate case is explicitly allowed to enable work-conserving semantics
+within a hierarchy.
+  e.g. \Sum (c_i) may exceed C
+[ Where C is the parent's bandwidth, and c_i its children ]
+
+
+There are two ways in which a group may become throttled:
+	a. it fully consumes its own quota within a period
+	b. a parent's quota is fully consumed within its period
+
+In case b) above, even though the child may have runtime remaining it will not
+be allowed to until the parent's runtime is refreshed.
+
+Examples
+--------
+1. Limit a group to 1 CPU worth of runtime.
+
+	If period is 250ms and quota is also 250ms, the group will get
+	1 CPU worth of runtime every 250ms.
+
+	# echo 250000 > cpu.cfs_quota_us /* quota = 250ms */
+	# echo 250000 > cpu.cfs_period_us /* period = 250ms */
+
+2. Limit a group to 2 CPUs worth of runtime on a multi-CPU machine.
+
+	With 500ms period and 1000ms quota, the group can get 2 CPUs worth of
+	runtime every 500ms.
+
+	# echo 1000000 > cpu.cfs_quota_us /* quota = 1000ms */
+	# echo 500000 > cpu.cfs_period_us /* period = 500ms */
+
+	The larger period here allows for increased burst capacity.
+
+3. Limit a group to 20% of 1 CPU.
+
+	With 50ms period, 10ms quota will be equivalent to 20% of 1 CPU.
+
+	# echo 10000 > cpu.cfs_quota_us /* quota = 10ms */
+	# echo 50000 > cpu.cfs_period_us /* period = 50ms */
+
+	By using a small period here we are ensuring a consistent latency
+	response at the expense of burst capacity.
+

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [patch 00/18] CFS Bandwidth Control v7.2
  2011-07-21 16:43 [patch 00/18] CFS Bandwidth Control v7.2 Paul Turner
                   ` (19 preceding siblings ...)
  2011-07-25 14:58 ` Peter Zijlstra
@ 2011-09-13 12:10 ` Vladimir Davydov
  2011-09-13 14:00   ` Peter Zijlstra
  2011-09-16  8:06   ` Paul Turner
  20 siblings, 2 replies; 60+ messages in thread
From: Vladimir Davydov @ 2011-09-13 12:10 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Peter Zijlstra, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Hidetoshi Seto, Ingo Molnar, Pavel Emelianov,
	Jason Baron

Hello, Paul

I have a question about CFS bandwidth control.

Let's consider a cgroup with several (>1) tasks running on a two CPU
host. Let the limit of the cgroup be 50% (e.g. period=1s, quota=0.5s).
How will tasks of the cgroup be distributed between the two CPUs? Will
they all run on one of the CPUs, or will one half of them run on one CPU
and others run on the other?

Although in both cases the tasks will consume not more than one half of
overall CPU time, the first case (all tasks of the cgroup run on the
same CPU) is obviously better if the tasks are likely to communicate
with each other (e.g. through pipe) which is often the case when cgroups
are used for container virtualization.

In other words, I'd like to know if your code (or the scheduler code)
tries to gather all tasks of the same cgroup on such a subset of all
CPUs so that the tasks can't execute less CPUs without losing quota
during each period. And if not, are you going to address the issue?

On Thu, 2011-07-21 at 20:43 +0400, Paul Turner wrote:
> Hi all,
> 
> Please find attached the incremental v7.2 for bandwidth control.
> 
> This release follows a fairly intensive period of scraping cycles across
> various configurations.  Unfortunately we seem to be currently taking an IPC
> hit for jump_labels (despite a savings in branches/instr. ret) which despite
> fairly extensive digging I don't have a good explanation for.  The emitted
> assembly /looks/ ok, but cycles/wall time is consistently higher across several
> platforms.
> 
> As such I've demoted the jumppatch to [RFT] while these details are worked
> out.  But there's no point in holding up the rest of the series any more.
> 
> [ Please find the specific discussion related to the above attached to patch 
> 17/18. ]
> 
> So -- without jump labels -- the current performance looks like:
> 
>                             instructions            cycles                  branches         
> ---------------------------------------------------------------------------------------------
> clovertown [!BWC]           843695716               965744453               151224759        
> +unconstrained              845934117 (+0.27)       974222228 (+0.88)       152715407 (+0.99)
> +10000000000/1000:          855102086 (+1.35)       978728348 (+1.34)       154495984 (+2.16)
> +10000000000/1000000:       853981660 (+1.22)       976344561 (+1.10)       154287243 (+2.03)
> 
> barcelona [!BWC]            810514902               761071312               145351489        
> +unconstrained              820573353 (+1.24)       748178486 (-1.69)       148161233 (+1.93)
> +10000000000/1000:          827963132 (+2.15)       757829815 (-0.43)       149611950 (+2.93)
> +10000000000/1000000:       827701516 (+2.12)       753575001 (-0.98)       149568284 (+2.90)
> 
> westmere [!BWC]             792513879               702882443               143267136        
> +unconstrained              802533191 (+1.26)       694415157 (-1.20)       146071233 (+1.96)
> +10000000000/1000:          809861594 (+2.19)       701781996 (-0.16)       147520953 (+2.97)
> +10000000000/1000000:       809752541 (+2.18)       705278419 (+0.34)       147502154 (+2.96)
> 
> Under the workload:
>   mkdir -p /cgroup/cpu/test
>   echo $$ > /dev/cgroup/cpu/test (only cpu,cpuacct mounted)
>   (W1) taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "for ((i=0;i<5;i++)); do $(dirname $0)/pipe-test 20000; done"
> 
> This may seem a strange work-load but it works around some bizarro overheads
> currently introduced by perf.  Comparing for example with::w
>   (W2)taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "$(dirname $0)/pipe-test 100000;true"
>   (W3)taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "$(dirname $0)/pipe-test 100000;"
> 
> 
> We see: 
>  (W1)  westmere [!BWC]             792513879               702882443               143267136             0.197246943  
>  (W2)  westmere [!BWC]             912241728               772576786               165734252             0.214923134  
>  (W3)  westmere [!BWC]             904349725               882084726               162577399             0.748506065  
> 
> vs an 'ideal' total exec time of (approximately):
> $ time taskset -c 0 ./pipe-test 100000
>  real    0m0.198 user    0m0.007s ys     0m0.095s
> 
> The overhead in W2 is explained by that invoking pipe-test directly, one of
> the siblings is becoming the perf_ctx parent, invoking lots of pain every time
> we switch.  I do not have a reasonable explantion as to why (W1) is so much
> cheaper than (W2), I stumbled across it by accident when I was trying some
> combinations to reduce the <perf stat>-to-<perf stat> variance.
> 
> v7.2
> -----------
> - Build errors in !CGROUP_SCHED case fixed
> - !CONFIG_SMP now 'supported' (#ifdef munging)
> - gcc was failing to inline account_cfs_rq_runtime, affecting performance
> - checks in expire_cfs_rq_runtime() and check_enqueue_throttle() re-organized
>   to save branches.
> - jump labels introduced in the case BWC is not being used system-wide to
>   reduce inert overhead.
> - branch saved in expiring runtime (reorganize conditonals)
> 
> Hidetoshi, the following patchsets have changed enough to necessitate tweaking
> of your Reviewed-by:
> [patch 09/18] sched: add support for unthrottling group entities (extensive)
> [patch 11/18] sched: prevent interactions with throttled entities (update_cfs_shares)
> [patch 12/18] sched: prevent buddy interactions with throttled entities (new)
> 
> 
> Previous postings:
> -----------------
> v7.1: https://lkml.org/lkml/2011/7/7/24
> v7: http://lkml.org/lkml/2011/6/21/43
> v6: http://lkml.org/lkml/2011/5/7/37
> v5: http://lkml.org/lkml/2011/3 /22/477
> v4: http://lkml.org/lkml/2011/2/23/44
> v3: http://lkml.org/lkml/2010/10/12/44
> v2: http://lkml.org/lkml/2010/4/28/88
> Original posting: http://lkml.org/lkml/2010/2/12/393
> 
> Prior approaches: http://lkml.org/lkml/2010/1/5/44 ["CFS Hard limits v5"]
> 
> Thanks,
> 
> - Paul
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [patch 00/18] CFS Bandwidth Control v7.2
  2011-09-13 12:10 ` Vladimir Davydov
@ 2011-09-13 14:00   ` Peter Zijlstra
  2011-09-16  8:06   ` Paul Turner
  1 sibling, 0 replies; 60+ messages in thread
From: Peter Zijlstra @ 2011-09-13 14:00 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Paul Turner, linux-kernel, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Hidetoshi Seto, Ingo Molnar, Pavel Emelianov,
	Jason Baron

On Tue, 2011-09-13 at 16:10 +0400, Vladimir Davydov wrote:

> In other words, I'd like to know if your code (or the scheduler code)
> tries to gather all tasks of the same cgroup on such a subset of all
> CPUs 

No

> so that the tasks can't execute less CPUs without losing quota
> during each period. 

what?!

> And if not, are you going to address the issue? 

and no.

There is nothing special about being part of a cgroup that warrants
that, as for pipes the scheduler already tries to pull the waking task
to the cpu of the waker if possible (or an idle cache sibling).



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [patch 00/18] CFS Bandwidth Control v7.2
  2011-09-13 12:10 ` Vladimir Davydov
  2011-09-13 14:00   ` Peter Zijlstra
@ 2011-09-16  8:06   ` Paul Turner
  2011-09-19  8:22     ` Vladimir Davydov
  1 sibling, 1 reply; 60+ messages in thread
From: Paul Turner @ 2011-09-16  8:06 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: linux-kernel, Peter Zijlstra, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Hidetoshi Seto, Ingo Molnar, Pavel Emelianov,
	Jason Baron

Hi Vladimir,

I had a fairly good coversation with Pavel at LPC regarding these 
questions, it's probably worth syncing up with him and then following up 
if you still have questions.


On 09/13/11 05:10, Vladimir Davydov wrote:
> Hello, Paul
>
> I have a question about CFS bandwidth control.
>
> Let's consider a cgroup with several (>1) tasks running on a two CPU
> host. Let the limit of the cgroup be 50% (e.g. period=1s, quota=0.5s).
> How will tasks of the cgroup be distributed between the two CPUs? Will
> they all run on one of the CPUs, or will one half of them run on one CPU
> and others run on the other?
>

Parallelism is unconstrained until the bandwidth limit is reached, at 
which point we CONFIG_NR_CPUS=0

> Although in both cases the tasks will consume not more than one half of
> overall CPU time, the first case (all tasks of the cgroup run on the
> same CPU) is obviously better if the tasks are likely to communicate
> with each other (e.g. through pipe) which is often the case when cgroups
> are used for container virtualization.
>

This case is handled already by the affine wake-up path.

> In other words, I'd like to know if your code (or the scheduler code)
> tries to gather all tasks of the same cgroup on such a subset of all
> CPUs so that the tasks can't execute less CPUs without losing quota
> during each period. And if not, are you going to address the issue?
>

Parallelism != Bandwidth; no plans at this time.

Thanks!

- Paul

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [patch 00/18] CFS Bandwidth Control v7.2
  2011-09-16  8:06   ` Paul Turner
@ 2011-09-19  8:22     ` Vladimir Davydov
  2011-09-19  8:33       ` Peter Zijlstra
  0 siblings, 1 reply; 60+ messages in thread
From: Vladimir Davydov @ 2011-09-19  8:22 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Peter Zijlstra, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Hidetoshi Seto, Ingo Molnar, Pavel Emelianov,
	Jason Baron

On Sep 16, 2011, at 12:06 PM, Paul Turner wrote:

>> Although in both cases the tasks will consume not more than one half of
>> overall CPU time, the first case (all tasks of the cgroup run on the
>> same CPU) is obviously better if the tasks are likely to communicate
>> with each other (e.g. through pipe) which is often the case when cgroups
>> are used for container virtualization.
>> 
> 
> This case is handled already by the affine wake-up path.

But communicating tasks do not necessarily wake each other even if they exchange data through the pipe. And of course, if they use shared memory (e.g. threads), it is not obligatory at all. Also, the wake-affine path is cpu-load aware, i.e. it tries not to overload a cpu it is going to wake a task on. For instance, if we run a context switch test on an idle host, the two tasks will be executing on different cpus although it is better to execute them together on the same cpu.

What I want to say is that sometimes it can be beneficial to constrain parallelism of a container. And if we are going to limit a container's cpu usage to, for example, one cpu, I guess it is better to make its tasks run on the same cpu instead of spreading them across the system because:

1) it can improve cpu caches utilization.

2) It can reduce the overhead of the CFS bandwidth control: the contention on the cgroup's quota pool obviously will be less and the number of throttlings/unthrottlings will be diminished (in the example above with the cpulimit = one cpu, we can forget about it at all).

4) It can improve the latency of a cgroup the cpu usage of which is limited. Consider a cgroup with one interactive and several cpu-bound tasks. Let the limit of the cgroup be 1 cpu (the cgroup should not consume more cpu power than it would if it ran on a UP host). If the tasks run on all the cpus of the host (provided the host is SMP), the cpu-hogs will consume all the quota soon, and the cgroup will be throttled till the end of the period when the quota is recharged. The interactive task will be throttled too so the cgroup's latency falls dramatically. However, if all the tasks run on the same cpu, the cgroup is never throttled, and the interactive task easily preempts cpu-hogs whenever it wants.

> 
>> In other words, I'd like to know if your code (or the scheduler code)
>> tries to gather all tasks of the same cgroup on such a subset of all
>> CPUs so that the tasks can't execute less CPUs without losing quota
>> during each period. And if not, are you going to address the issue?
>> 
> 
> Parallelism != Bandwidth
> 


I agree.

Nevertheless, theoretically the former can be implemented on top of the latter. This is exactly what we've done in the latest OpenVZ kernel where limiting the number of cpus a container can run on to N is equivalent to setting its limit to N*max-per-cpu-limit.

> no plans at this time.


It's a pity :(


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [patch 00/18] CFS Bandwidth Control v7.2
  2011-09-19  8:22     ` Vladimir Davydov
@ 2011-09-19  8:33       ` Peter Zijlstra
  0 siblings, 0 replies; 60+ messages in thread
From: Peter Zijlstra @ 2011-09-19  8:33 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Paul Turner, linux-kernel, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Hidetoshi Seto, Ingo Molnar, Pavel Emelianov,
	Jason Baron

On Mon, 2011-09-19 at 12:22 +0400, Vladimir Davydov wrote:
> But communicating tasks do not necessarily wake each other even if
> they exchange data through the pipe. And of course, if they use shared
> memory (e.g. threads), it is not obligatory at all. Also, the
> wake-affine path is cpu-load aware, i.e. it tries not to overload a
> cpu it is going to wake a task on. For instance, if we run a context
> switch test on an idle host, the two tasks will be executing on
> different cpus although it is better to execute them together on the
> same cpu.

This is not a problem specific to cgroups, and thus the solution
shouldn't ever live as something related to cgroups.



^ permalink raw reply	[flat|nested] 60+ messages in thread

end of thread, other threads:[~2011-09-19  8:33 UTC | newest]

Thread overview: 60+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-07-21 16:43 [patch 00/18] CFS Bandwidth Control v7.2 Paul Turner
2011-07-21 16:43 ` [patch 01/18] sched: (fixlet) dont update shares twice on on_rq parent Paul Turner
2011-07-22 11:06   ` Kamalesh Babulal
2011-07-21 16:43 ` [patch 02/18] sched: hierarchical task accounting for SCHED_OTHER Paul Turner
2011-08-14 16:15   ` [tip:sched/core] sched: Implement " tip-bot for Paul Turner
2011-07-21 16:43 ` [patch 03/18] sched: introduce primitives to account for CFS bandwidth tracking Paul Turner
2011-07-22 11:14   ` Kamalesh Babulal
2011-08-14 16:17   ` [tip:sched/core] sched: Introduce " tip-bot for Paul Turner
2011-07-21 16:43 ` [patch 04/18] sched: validate CFS quota hierarchies Paul Turner
2011-08-14 16:19   ` [tip:sched/core] sched: Validate " tip-bot for Paul Turner
2011-07-21 16:43 ` [patch 05/18] sched: accumulate per-cfs_rq cpu usage and charge against bandwidth Paul Turner
2011-08-14 16:21   ` [tip:sched/core] sched: Accumulate " tip-bot for Paul Turner
2011-07-21 16:43 ` [patch 06/18] sched: add a timer to handle CFS bandwidth refresh Paul Turner
2011-08-14 16:23   ` [tip:sched/core] sched: Add " tip-bot for Paul Turner
2011-07-21 16:43 ` [patch 07/18] sched: expire invalid runtime Paul Turner
2011-08-14 16:24   ` [tip:sched/core] sched: Expire " tip-bot for Paul Turner
2011-07-21 16:43 ` [patch 08/18] sched: add support for throttling group entities Paul Turner
2011-08-08 15:46   ` Lin Ming
2011-08-08 16:00     ` Peter Zijlstra
2011-08-08 16:16       ` Paul Turner
2011-08-14 16:26   ` [tip:sched/core] sched: Add " tip-bot for Paul Turner
2011-07-21 16:43 ` [patch 09/18] sched: add support for unthrottling " Paul Turner
2011-08-14 16:27   ` [tip:sched/core] sched: Add " tip-bot for Paul Turner
2011-07-21 16:43 ` [patch 10/18] sched: allow for positional tg_tree walks Paul Turner
2011-08-14 16:29   ` [tip:sched/core] sched: Allow " tip-bot for Paul Turner
2011-07-21 16:43 ` [patch 11/18] sched: prevent interactions with throttled entities Paul Turner
2011-07-22 11:26   ` Kamalesh Babulal
2011-07-22 11:37     ` Peter Zijlstra
2011-07-22 11:41   ` Kamalesh Babulal
2011-07-22 11:43     ` Peter Zijlstra
2011-07-22 18:16       ` Kamalesh Babulal
2011-08-14 16:30   ` [tip:sched/core] sched: Prevent " tip-bot for Paul Turner
2011-07-21 16:43 ` [patch 12/18] sched: prevent buddy " Paul Turner
2011-08-14 16:32   ` [tip:sched/core] sched: Prevent " tip-bot for Paul Turner
2011-07-21 16:43 ` [patch 13/18] sched: migrate throttled tasks on HOTPLUG Paul Turner
2011-08-14 16:34   ` [tip:sched/core] sched: Migrate " tip-bot for Paul Turner
2011-07-21 16:43 ` [patch 14/18] sched: throttle entities exceeding their allowed bandwidth Paul Turner
2011-08-14 16:35   ` [tip:sched/core] sched: Throttle " tip-bot for Paul Turner
2011-07-21 16:43 ` [patch 15/18] sched: add exports tracking cfs bandwidth control statistics Paul Turner
2011-08-14 16:37   ` [tip:sched/core] sched: Add " tip-bot for Nikhil Rao
2011-07-21 16:43 ` [patch 16/18] sched: return unused runtime on group dequeue Paul Turner
2011-08-14 16:39   ` [tip:sched/core] sched: Return " tip-bot for Paul Turner
2011-07-21 16:43 ` [RFT][patch 17/18] sched: use jump labels to reduce overhead when bandwidth control is inactive Paul Turner
2011-07-21 16:43 ` [patch 18/18] sched: add documentation for bandwidth control Paul Turner
2011-08-14 16:41   ` [tip:sched/core] sched: Add " tip-bot for Bharata B Rao
2011-07-21 23:01 ` [patch 00/18] CFS Bandwidth Control v7.2 Paul Turner
2011-07-25 14:58 ` Peter Zijlstra
2011-07-25 15:00   ` Peter Zijlstra
2011-07-25 16:21     ` Paul E. McKenney
2011-07-25 16:28       ` Peter Zijlstra
2011-07-25 16:46         ` Paul E. McKenney
2011-07-25 17:08           ` Peter Zijlstra
2011-07-25 17:11             ` Dhaval Giani
2011-07-25 17:35               ` Peter Zijlstra
2011-07-28  2:59     ` Paul Turner
2011-09-13 12:10 ` Vladimir Davydov
2011-09-13 14:00   ` Peter Zijlstra
2011-09-16  8:06   ` Paul Turner
2011-09-19  8:22     ` Vladimir Davydov
2011-09-19  8:33       ` Peter Zijlstra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.