linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch 00/15] CFS Bandwidth Control V5
@ 2011-03-23  3:03 Paul Turner
  2011-03-23  3:03 ` [patch 01/15] sched: introduce primitives to account for CFS bandwidth tracking Paul Turner
                   ` (18 more replies)
  0 siblings, 19 replies; 63+ messages in thread
From: Paul Turner @ 2011-03-23  3:03 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov

Hi all,

Please find attached the latest version of bandwidth control for the normal
scheduling class.  This revision has undergone fairly extensive changes since
the previous version based largely on the observation that many of the edge
conditions requiring special casing around update_curr() were a result of
introducing side-effects into that operation.  By introducing an interstitial
state, where we recognize that the runqueue is over bandwidth, but not marking
it throttled until we can actually remove it from the CPU we avoid the
previous possible interactions with throttled entities which eliminates some
head-scratching corner cases.

In particular I'd like to thank Peter Zijlstra who provided extensive comments 
and review for the last series.

Changes since v4:

New features:
- Bandwidth control now properly works with hotplug, throttled tasks are
  returned to rq on cpu-offline so that they can be migrated.
- It is now validated that hierarchies are consistent with their resource
  reservations.  That is, the sum of a sub-hierarchy's bandwidth requirements
  will not exceed the bandwidth provisioned to the parent.  (This enforcement
  is optional and controlled by a sysctl.)
- It is now tracked whether quota is 'current' or not, this allows for the
  expiration of slack quota from prioir scheduling periors as well as the return
  of quota by idling cpus.

Major:
- The atomicity of update_curr() is restored, it will now only perform the
  accounting required for bandwidth control.  The act of checking whether
  quota has been exceeded is made explicit.  This avoids the previous corner
  cases required in enqueue/dequeue-entity.
- The act of throttling is now deferred until we reach put_task().  This means
  that the transition to throttled is atomic and the special case interactions
  with a running-but-throttled-entity (in the case where we couldn't previously 
  immediately handle a resched) are no longer needed.
- The correction for shares accounting during a throttled period has been
  extended to work for the children of a throttled run-queue.
- Throttled cfs_rqs are now explicitly tracked using a list, this avoids the
  need to revisit every cfs_rq on period expiration on large systems.


Minor:
- Hierarchal task accounting is no longer a separate hierachy evaluation.
- (Buglet) nr_running accounting added to sched::stoptask
- (Buglet) Will no longer load balance the child hierarchies of a throttled
  entity.
- (Fixlet) don't process dequeued entities twice in dequeue_task_fair()
- walk_tg_tree refactored to allow for partial sub-tree evaluations.
- Dropped some #ifdefs
- Fixed some compile warnings with various CONFIG permutations
- Local bandwidth is now consumed "negatively"
- Quota slices now 5ms

Probably some others that I missed, there was a lot of refactoring and cleanup.

Interface:
----------
Three new cgroupfs files are exported by the cpu subsystem:
  cpu.cfs_period_us : period over which bandwidth is to be regulated
  cpu.cfs_quota_us  : bandwidth available for consumption per period
  cpu.stat          : statistics (such as number of throttled periods and
                      total throttled time)
One important interface change that this introduces (versus the rate limits
proposal) is that the defined bandwidth becomes an absolute quantifier.

Previous postings:
-----------------
v4:
https://lkml.org/lkml/2011/2/23/44
v3:
https://lkml.org/lkml/2010/10/12/44
v2:
http://lkml.org/lkml/2010/4/28/88
Original posting:
http://lkml.org/lkml/2010/2/12/393

Prior approaches:
http://lkml.org/lkml/2010/1/5/44 ["CFS Hard limits v5"]

Thanks,

- Paul




^ permalink raw reply	[flat|nested] 63+ messages in thread

* [patch 01/15] sched: introduce primitives to account for CFS bandwidth tracking
  2011-03-23  3:03 [patch 00/15] CFS Bandwidth Control V5 Paul Turner
@ 2011-03-23  3:03 ` Paul Turner
  2011-03-24 12:38   ` Kamalesh Babulal
  2011-04-05 13:28   ` Peter Zijlstra
  2011-03-23  3:03 ` [patch 02/15] sched: validate CFS quota hierarchies Paul Turner
                   ` (17 subsequent siblings)
  18 siblings, 2 replies; 63+ messages in thread
From: Paul Turner @ 2011-03-23  3:03 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov, Nikhil Rao

[-- Attachment #1: sched-bwc-add_cfs_tg_bandwidth.patch --]
[-- Type: text/plain, Size: 13342 bytes --]

In this patch we introduce the notion of CFS bandwidth, partitioned into 
globally unassigned bandwidth, and locally claimed bandwidth.

- The global bandwidth is per task_group, it represents a pool of unclaimed
  bandwidth that cfs_rqs can allocate from.  
- The local bandwidth is tracked per-cfs_rq, this represents allotments from
  the global pool bandwidth assigned to a specific cpu.

Bandwidth is managed via cgroupfs, adding two new interfaces to the cpu subsystem:
- cpu.cfs_period_us : the bandwidth period in usecs
- cpu.cfs_quota_us : the cpu bandwidth (in usecs) that this tg will be allowed
  to consume over period above.

A per-cfs_bandwidth timer is also introduced to handle future refresh at
period expiration.  There's some minor refactoring here so that
start_bandwidth_timer() functionality can be shared

Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Nikhil Rao <ncrao@google.com>
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
---
 init/Kconfig        |    9 +
 kernel/sched.c      |  277 +++++++++++++++++++++++++++++++++++++++++++++++-----
 kernel/sched_fair.c |   30 +++++
 3 files changed, 294 insertions(+), 22 deletions(-)

Index: tip/init/Kconfig
===================================================================
--- tip.orig/init/Kconfig
+++ tip/init/Kconfig
@@ -720,6 +720,15 @@ config FAIR_GROUP_SCHED
 	depends on CGROUP_SCHED
 	default CGROUP_SCHED
 
+config CFS_BANDWIDTH
+	bool "CPU bandwidth provisioning for FAIR_GROUP_SCHED"
+	depends on EXPERIMENTAL
+	depends on FAIR_GROUP_SCHED
+	default n
+	help
+	  This option allows users to define quota and period for cpu
+	  bandwidth provisioning on a per-cgroup basis.
+
 config RT_GROUP_SCHED
 	bool "Group scheduling for SCHED_RR/FIFO"
 	depends on EXPERIMENTAL
Index: tip/kernel/sched.c
===================================================================
--- tip.orig/kernel/sched.c
+++ tip/kernel/sched.c
@@ -194,10 +194,28 @@ static inline int rt_bandwidth_enabled(v
 	return sysctl_sched_rt_runtime >= 0;
 }
 
-static void start_rt_bandwidth(struct rt_bandwidth *rt_b)
+static void start_bandwidth_timer(struct hrtimer *period_timer, ktime_t period)
 {
-	ktime_t now;
+	unsigned long delta;
+	ktime_t soft, hard, now;
+
+	for (;;) {
+		if (hrtimer_active(period_timer))
+			break;
 
+		now = hrtimer_cb_get_time(period_timer);
+		hrtimer_forward(period_timer, now, period);
+
+		soft = hrtimer_get_softexpires(period_timer);
+		hard = hrtimer_get_expires(period_timer);
+		delta = ktime_to_ns(ktime_sub(hard, soft));
+		__hrtimer_start_range_ns(period_timer, soft, delta,
+					 HRTIMER_MODE_ABS_PINNED, 0);
+	}
+}
+
+static void start_rt_bandwidth(struct rt_bandwidth *rt_b)
+{
 	if (!rt_bandwidth_enabled() || rt_b->rt_runtime == RUNTIME_INF)
 		return;
 
@@ -205,22 +223,7 @@ static void start_rt_bandwidth(struct rt
 		return;
 
 	raw_spin_lock(&rt_b->rt_runtime_lock);
-	for (;;) {
-		unsigned long delta;
-		ktime_t soft, hard;
-
-		if (hrtimer_active(&rt_b->rt_period_timer))
-			break;
-
-		now = hrtimer_cb_get_time(&rt_b->rt_period_timer);
-		hrtimer_forward(&rt_b->rt_period_timer, now, rt_b->rt_period);
-
-		soft = hrtimer_get_softexpires(&rt_b->rt_period_timer);
-		hard = hrtimer_get_expires(&rt_b->rt_period_timer);
-		delta = ktime_to_ns(ktime_sub(hard, soft));
-		__hrtimer_start_range_ns(&rt_b->rt_period_timer, soft, delta,
-				HRTIMER_MODE_ABS_PINNED, 0);
-	}
+	start_bandwidth_timer(&rt_b->rt_period_timer, rt_b->rt_period);
 	raw_spin_unlock(&rt_b->rt_runtime_lock);
 }
 
@@ -245,6 +248,15 @@ struct cfs_rq;
 
 static LIST_HEAD(task_groups);
 
+struct cfs_bandwidth {
+#ifdef CONFIG_CFS_BANDWIDTH
+	raw_spinlock_t lock;
+	ktime_t period;
+	u64 runtime, quota;
+	struct hrtimer period_timer;
+#endif
+};
+
 /* task group related information */
 struct task_group {
 	struct cgroup_subsys_state css;
@@ -276,6 +288,8 @@ struct task_group {
 #ifdef CONFIG_SCHED_AUTOGROUP
 	struct autogroup *autogroup;
 #endif
+
+	struct cfs_bandwidth cfs_bandwidth;
 };
 
 /* task_group_lock serializes the addition/removal of task groups */
@@ -370,9 +384,90 @@ struct cfs_rq {
 
 	unsigned long load_contribution;
 #endif
+#ifdef CONFIG_CFS_BANDWIDTH
+	int quota_enabled;
+	s64 quota_remaining;
+#endif
 #endif
 };
 
+static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
+{
+	return &tg->cfs_bandwidth;
+}
+
+#ifdef CONFIG_CFS_BANDWIDTH
+static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun);
+
+static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
+{
+	struct cfs_bandwidth *cfs_b =
+		container_of(timer, struct cfs_bandwidth, period_timer);
+	ktime_t now;
+	int overrun;
+	int idle = 0;
+
+	for (;;) {
+		now = hrtimer_cb_get_time(timer);
+		overrun = hrtimer_forward(timer, now, cfs_b->period);
+
+		if (!overrun)
+			break;
+
+		idle = do_sched_cfs_period_timer(cfs_b, overrun);
+	}
+
+	return idle ? HRTIMER_NORESTART : HRTIMER_RESTART;
+}
+
+static
+void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b, u64 quota, u64 period)
+{
+	raw_spin_lock_init(&cfs_b->lock);
+	cfs_b->quota = cfs_b->runtime = quota;
+	cfs_b->period = ns_to_ktime(period);
+
+	hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	cfs_b->period_timer.function = sched_cfs_period_timer;
+}
+
+static void init_cfs_rq_quota(struct cfs_rq *cfs_rq)
+{
+	cfs_rq->quota_remaining = 0;
+	if (tg_cfs_bandwidth(cfs_rq->tg)->quota == RUNTIME_INF)
+		cfs_rq->quota_enabled = 0;
+	else
+		cfs_rq->quota_enabled = 1;
+}
+
+static void start_cfs_bandwidth(struct cfs_rq *cfs_rq)
+{
+	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
+
+	if (cfs_b->quota == RUNTIME_INF)
+		return;
+
+	if (hrtimer_active(&cfs_b->period_timer))
+		return;
+
+	raw_spin_lock(&cfs_b->lock);
+	start_bandwidth_timer(&cfs_b->period_timer, cfs_b->period);
+	raw_spin_unlock(&cfs_b->lock);
+}
+
+static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
+{
+	hrtimer_cancel(&cfs_b->period_timer);
+}
+#else
+#ifdef CONFIG_FAIR_GROUP_SCHED
+static void init_cfs_rq_quota(struct cfs_rq *cfs_rq) {}
+void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b, u64 quota, u64 period) {}
+static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b) {}
+#endif /* CONFIG_FAIR_GROUP_SCHED */
+static void start_cfs_bandwidth(struct cfs_rq *cfs_rq) {}
+#endif /* CONFIG_CFS_BANDWIDTH */
+
 /* Real-Time classes' related field in a runqueue: */
 struct rt_rq {
 	struct rt_prio_array active;
@@ -8048,6 +8143,7 @@ static void init_tg_cfs_entry(struct tas
 	tg->cfs_rq[cpu] = cfs_rq;
 	init_cfs_rq(cfs_rq, rq);
 	cfs_rq->tg = tg;
+	init_cfs_rq_quota(cfs_rq);
 
 	tg->se[cpu] = se;
 	/* se could be NULL for root_task_group */
@@ -8183,6 +8279,8 @@ void __init sched_init(void)
 		 * We achieve this by letting root_task_group's tasks sit
 		 * directly in rq->cfs (i.e root_task_group->se[] = NULL).
 		 */
+		init_cfs_bandwidth(&root_task_group.cfs_bandwidth,
+				RUNTIME_INF, default_cfs_period());
 		init_tg_cfs_entry(&root_task_group, &rq->cfs, NULL, i, NULL);
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
@@ -8425,6 +8523,8 @@ static void free_fair_sched_group(struct
 {
 	int i;
 
+	destroy_cfs_bandwidth(tg_cfs_bandwidth(tg));
+
 	for_each_possible_cpu(i) {
 		if (tg->cfs_rq)
 			kfree(tg->cfs_rq[i]);
@@ -8453,6 +8553,9 @@ int alloc_fair_sched_group(struct task_g
 
 	tg->shares = NICE_0_LOAD;
 
+	init_cfs_bandwidth(tg_cfs_bandwidth(tg), RUNTIME_INF,
+			   default_cfs_period());
+
 	for_each_possible_cpu(i) {
 		rq = cpu_rq(i);
 
@@ -8832,7 +8935,7 @@ static int __rt_schedulable(struct task_
 	return walk_tg_tree(tg_schedulable, tg_nop, &data);
 }
 
-static int tg_set_bandwidth(struct task_group *tg,
+static int tg_set_rt_bandwidth(struct task_group *tg,
 		u64 rt_period, u64 rt_runtime)
 {
 	int i, err = 0;
@@ -8871,7 +8974,7 @@ int sched_group_set_rt_runtime(struct ta
 	if (rt_runtime_us < 0)
 		rt_runtime = RUNTIME_INF;
 
-	return tg_set_bandwidth(tg, rt_period, rt_runtime);
+	return tg_set_rt_bandwidth(tg, rt_period, rt_runtime);
 }
 
 long sched_group_rt_runtime(struct task_group *tg)
@@ -8896,7 +8999,7 @@ int sched_group_set_rt_period(struct tas
 	if (rt_period == 0)
 		return -EINVAL;
 
-	return tg_set_bandwidth(tg, rt_period, rt_runtime);
+	return tg_set_rt_bandwidth(tg, rt_period, rt_runtime);
 }
 
 long sched_group_rt_period(struct task_group *tg)
@@ -9118,6 +9221,125 @@ static u64 cpu_shares_read_u64(struct cg
 
 	return (u64) tg->shares;
 }
+
+#ifdef CONFIG_CFS_BANDWIDTH
+const u64 max_cfs_quota_period = 5 * NSEC_PER_SEC; /* 5s */
+const u64 min_cfs_quota_period = 1 * NSEC_PER_MSEC; /* 1ms */
+
+static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
+{
+	int i;
+	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
+	static DEFINE_MUTEX(mutex);
+
+	if (tg == &root_task_group)
+		return -EINVAL;
+
+	/*
+	 * Ensure we have at some amount of bandwidth every period.  This is
+	 * to prevent reaching a state of large arrears when throttled via
+	 * entity_tick() resulting in prolonged exit starvation.
+	 */
+	if (quota < min_cfs_quota_period || period < min_cfs_quota_period)
+		return -EINVAL;
+
+	/*
+	 * Likewise, bound things on the otherside by preventing insane quota
+	 * periods.  This also allows us to normalize in computing quota
+	 * feasibility.
+	 */
+	if (period > max_cfs_quota_period)
+		return -EINVAL;
+
+	mutex_lock(&mutex);
+	raw_spin_lock_irq(&cfs_b->lock);
+	cfs_b->period = ns_to_ktime(period);
+	cfs_b->runtime = cfs_b->quota = quota;
+	raw_spin_unlock_irq(&cfs_b->lock);
+
+	for_each_possible_cpu(i) {
+		struct cfs_rq *cfs_rq = tg->cfs_rq[i];
+		struct rq *rq = rq_of(cfs_rq);
+
+		raw_spin_lock_irq(&rq->lock);
+		init_cfs_rq_quota(cfs_rq);
+		raw_spin_unlock_irq(&rq->lock);
+	}
+	mutex_unlock(&mutex);
+
+	return 0;
+}
+
+int tg_set_cfs_quota(struct task_group *tg, long cfs_quota_us)
+{
+	u64 quota, period;
+
+	period = ktime_to_ns(tg_cfs_bandwidth(tg)->period);
+	if (cfs_quota_us < 0)
+		quota = RUNTIME_INF;
+	else
+		quota = (u64)cfs_quota_us * NSEC_PER_USEC;
+
+	return tg_set_cfs_bandwidth(tg, period, quota);
+}
+
+long tg_get_cfs_quota(struct task_group *tg)
+{
+	u64 quota_us;
+
+	if (tg_cfs_bandwidth(tg)->quota == RUNTIME_INF)
+		return -1;
+
+	quota_us = tg_cfs_bandwidth(tg)->quota;
+	do_div(quota_us, NSEC_PER_USEC);
+	return quota_us;
+}
+
+int tg_set_cfs_period(struct task_group *tg, long cfs_period_us)
+{
+	u64 quota, period;
+
+	period = (u64)cfs_period_us * NSEC_PER_USEC;
+	quota = tg_cfs_bandwidth(tg)->quota;
+
+	if (period <= 0)
+		return -EINVAL;
+
+	return tg_set_cfs_bandwidth(tg, period, quota);
+}
+
+long tg_get_cfs_period(struct task_group *tg)
+{
+	u64 cfs_period_us;
+
+	cfs_period_us = ktime_to_ns(tg_cfs_bandwidth(tg)->period);
+	do_div(cfs_period_us, NSEC_PER_USEC);
+	return cfs_period_us;
+}
+
+static s64 cpu_cfs_quota_read_s64(struct cgroup *cgrp, struct cftype *cft)
+{
+	return tg_get_cfs_quota(cgroup_tg(cgrp));
+}
+
+static int cpu_cfs_quota_write_s64(struct cgroup *cgrp, struct cftype *cftype,
+				s64 cfs_quota_us)
+{
+	return tg_set_cfs_quota(cgroup_tg(cgrp), cfs_quota_us);
+}
+
+static u64 cpu_cfs_period_read_u64(struct cgroup *cgrp, struct cftype *cft)
+{
+	return tg_get_cfs_period(cgroup_tg(cgrp));
+}
+
+static int cpu_cfs_period_write_u64(struct cgroup *cgrp, struct cftype *cftype,
+				u64 cfs_period_us)
+{
+	return tg_set_cfs_period(cgroup_tg(cgrp), cfs_period_us);
+}
+
+#endif /* CONFIG_CFS_BANDWIDTH */
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
 #ifdef CONFIG_RT_GROUP_SCHED
@@ -9152,6 +9374,18 @@ static struct cftype cpu_files[] = {
 		.write_u64 = cpu_shares_write_u64,
 	},
 #endif
+#ifdef CONFIG_CFS_BANDWIDTH
+	{
+		.name = "cfs_quota_us",
+		.read_s64 = cpu_cfs_quota_read_s64,
+		.write_s64 = cpu_cfs_quota_write_s64,
+	},
+	{
+		.name = "cfs_period_us",
+		.read_u64 = cpu_cfs_period_read_u64,
+		.write_u64 = cpu_cfs_period_write_u64,
+	},
+#endif
 #ifdef CONFIG_RT_GROUP_SCHED
 	{
 		.name = "rt_runtime_us",
@@ -9461,4 +9695,3 @@ struct cgroup_subsys cpuacct_subsys = {
 	.subsys_id = cpuacct_subsys_id,
 };
 #endif	/* CONFIG_CGROUP_CPUACCT */
-
Index: tip/kernel/sched_fair.c
===================================================================
--- tip.orig/kernel/sched_fair.c
+++ tip/kernel/sched_fair.c
@@ -995,6 +995,8 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
 
 	if (cfs_rq->nr_running == 1)
 		list_add_leaf_cfs_rq(cfs_rq);
+
+	start_cfs_bandwidth(cfs_rq);
 }
 
 static void __clear_buddies_last(struct sched_entity *se)
@@ -1212,6 +1214,8 @@ static void put_prev_entity(struct cfs_r
 		update_stats_wait_start(cfs_rq, prev);
 		/* Put 'current' back into the tree. */
 		__enqueue_entity(cfs_rq, prev);
+
+		start_cfs_bandwidth(cfs_rq);
 	}
 	cfs_rq->curr = NULL;
 }
@@ -1251,6 +1255,32 @@ entity_tick(struct cfs_rq *cfs_rq, struc
 }
 
 /**************************************************
+ * CFS bandwidth control machinery
+ */
+
+#ifdef CONFIG_CFS_BANDWIDTH
+/*
+ * default period for cfs group bandwidth.
+ * default: 0.5s, units: nanoseconds
+ */
+static inline u64 default_cfs_period(void)
+{
+	return 500000000ULL;
+}
+
+static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
+{
+	return 1;
+}
+#else
+static inline u64 default_cfs_period(void)
+{
+	return 0;
+}
+#endif
+
+
+/**************************************************
  * CFS operations on tasks:
  */
 



^ permalink raw reply	[flat|nested] 63+ messages in thread

* [patch 02/15] sched: validate CFS quota hierarchies
  2011-03-23  3:03 [patch 00/15] CFS Bandwidth Control V5 Paul Turner
  2011-03-23  3:03 ` [patch 01/15] sched: introduce primitives to account for CFS bandwidth tracking Paul Turner
@ 2011-03-23  3:03 ` Paul Turner
  2011-03-23 10:39   ` torbenh
                     ` (3 more replies)
  2011-03-23  3:03 ` [patch 03/15] sched: accumulate per-cfs_rq cpu usage Paul Turner
                   ` (16 subsequent siblings)
  18 siblings, 4 replies; 63+ messages in thread
From: Paul Turner @ 2011-03-23  3:03 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov

[-- Attachment #1: sched-bwc-consistent_quota.patch --]
[-- Type: text/plain, Size: 7450 bytes --]

Add constraints validation for CFS bandwidth hierachies.

It is checked that:
   sum(child bandwidth) <= parent_bandwidth

In a quota limited hierarchy, an unconstrainted entity
(e.g. bandwidth==RUNTIME_INF) inherits the bandwidth of its parent.

Since bandwidth periods may be non-uniform we normalize to the maximum allowed
period, 5 seconds.

This behavior may be disabled (allowing child bandwidth to exceed parent) via
kernel.sched_cfs_bandwidth_consistent=0

Signed-off-by: Paul Turner <pjt@google.com>

---
 include/linux/sched.h |    8 +++
 kernel/sched.c        |  127 +++++++++++++++++++++++++++++++++++++++++++++++---
 kernel/sched_fair.c   |    8 +++
 kernel/sysctl.c       |   11 ++++
 4 files changed, 147 insertions(+), 7 deletions(-)

Index: tip/kernel/sched.c
===================================================================
--- tip.orig/kernel/sched.c
+++ tip/kernel/sched.c
@@ -253,6 +253,7 @@ struct cfs_bandwidth {
 	raw_spinlock_t lock;
 	ktime_t period;
 	u64 runtime, quota;
+	s64 hierarchal_quota; /* used for validating consistency */
 	struct hrtimer period_timer;
 #endif
 };
@@ -8868,7 +8869,7 @@ struct rt_schedulable_data {
 	u64 rt_runtime;
 };
 
-static int tg_schedulable(struct task_group *tg, void *data)
+static int tg_rt_schedulable(struct task_group *tg, void *data)
 {
 	struct rt_schedulable_data *d = data;
 	struct task_group *child;
@@ -8932,7 +8933,7 @@ static int __rt_schedulable(struct task_
 		.rt_runtime = runtime,
 	};
 
-	return walk_tg_tree(tg_schedulable, tg_nop, &data);
+	return walk_tg_tree(tg_rt_schedulable, tg_nop, &data);
 }
 
 static int tg_set_rt_bandwidth(struct task_group *tg,
@@ -9223,14 +9224,17 @@ static u64 cpu_shares_read_u64(struct cg
 }
 
 #ifdef CONFIG_CFS_BANDWIDTH
+static DEFINE_MUTEX(cfs_constraints_mutex);
+
 const u64 max_cfs_quota_period = 5 * NSEC_PER_SEC; /* 5s */
 const u64 min_cfs_quota_period = 1 * NSEC_PER_MSEC; /* 1ms */
 
+static int __cfs_schedulable(struct task_group *tg, u64 period, u64 runtime);
+
 static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
 {
-	int i;
+	int i, ret = 0;
 	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
-	static DEFINE_MUTEX(mutex);
 
 	if (tg == &root_task_group)
 		return -EINVAL;
@@ -9251,7 +9255,13 @@ static int tg_set_cfs_bandwidth(struct t
 	if (period > max_cfs_quota_period)
 		return -EINVAL;
 
-	mutex_lock(&mutex);
+	mutex_lock(&cfs_constraints_mutex);
+	if (sysctl_sched_cfs_bandwidth_consistent) {
+		ret = __cfs_schedulable(tg, period, quota);
+		if (ret)
+			goto out_unlock;
+	}
+
 	raw_spin_lock_irq(&cfs_b->lock);
 	cfs_b->period = ns_to_ktime(period);
 	cfs_b->runtime = cfs_b->quota = quota;
@@ -9265,9 +9275,10 @@ static int tg_set_cfs_bandwidth(struct t
 		init_cfs_rq_quota(cfs_rq);
 		raw_spin_unlock_irq(&rq->lock);
 	}
-	mutex_unlock(&mutex);
+out_unlock:
+	mutex_unlock(&cfs_constraints_mutex);
 
-	return 0;
+	return ret;
 }
 
 int tg_set_cfs_quota(struct task_group *tg, long cfs_quota_us)
@@ -9339,6 +9350,108 @@ static int cpu_cfs_period_write_u64(stru
 	return tg_set_cfs_period(cgroup_tg(cgrp), cfs_period_us);
 }
 
+
+struct cfs_schedulable_data {
+	struct task_group *tg;
+	u64 period, quota;
+};
+
+/*
+ * normalize group quota/period to be quota/max_period
+ * note: units are usecs
+ */
+static u64 normalize_cfs_quota(struct task_group *tg,
+		               struct cfs_schedulable_data *d)
+{
+	u64 quota, period;
+	struct load_weight lw;
+
+	if (tg == d->tg) {
+		period = d->period;
+		quota = d->quota;
+	} else {
+		period = tg_get_cfs_period(tg);
+		quota = tg_get_cfs_quota(tg);
+	}
+
+	if (quota == RUNTIME_INF)
+		return RUNTIME_INF;
+
+	lw.weight = period;
+	lw.inv_weight = 0;
+
+	return calc_delta_mine(quota, max_cfs_quota_period, &lw) - 1;
+}
+
+static int tg_cfs_schedulable_down(struct task_group *tg, void *data)
+{
+	struct cfs_schedulable_data *d = data;
+	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
+	s64 quota = 0, parent_quota = -1;
+
+	quota = normalize_cfs_quota(tg, d);
+	if (!tg->parent) {
+		quota = RUNTIME_INF;
+	} else {
+		struct cfs_bandwidth *parent_b = tg_cfs_bandwidth(tg->parent);
+
+		parent_quota = parent_b->hierarchal_quota;
+		if (parent_quota != RUNTIME_INF) {
+			parent_quota -= quota;
+			/* invalid hierarchy, child bandwidth exceeds parent */
+			if (parent_quota < 0)
+				return -EINVAL;
+		}
+
+		/* if no inherent limit then inherit parent quota */
+		if (quota == RUNTIME_INF)
+			quota = parent_quota;
+		parent_b->hierarchal_quota = parent_quota;
+	}
+	cfs_b->hierarchal_quota = quota;
+
+	return 0;
+}
+
+static int __cfs_schedulable(struct task_group *tg, u64 period, u64 quota)
+{
+	int ret;
+	struct cfs_schedulable_data data = {
+		.tg = tg,
+		.period = period / NSEC_PER_USEC,
+		.quota = quota / NSEC_PER_USEC,
+	};
+
+	if (!sysctl_sched_cfs_bandwidth_consistent)
+		return 0;
+
+	rcu_read_lock();
+	ret = walk_tg_tree(tg_cfs_schedulable_down, tg_nop,
+			    &data);
+	rcu_read_unlock();
+
+	return ret;
+}
+
+int sched_cfs_consistent_handler(struct ctl_table *table, int write,
+		void __user *buffer, size_t *lenp,
+		loff_t *ppos)
+{
+	int ret;
+
+	mutex_lock(&cfs_constraints_mutex);
+	ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
+
+	if (!ret && write && sysctl_sched_cfs_bandwidth_consistent) {
+		ret = __cfs_schedulable(NULL, 0, 0);
+
+		/* must be consistent to enable */
+		if (ret)
+			sysctl_sched_cfs_bandwidth_consistent = 0;
+	}
+	mutex_unlock(&cfs_constraints_mutex);
+	return ret;
+}
 #endif /* CONFIG_CFS_BANDWIDTH */
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
Index: tip/kernel/sysctl.c
===================================================================
--- tip.orig/kernel/sysctl.c
+++ tip/kernel/sysctl.c
@@ -361,6 +361,17 @@ static struct ctl_table kern_table[] = {
 		.mode		= 0644,
 		.proc_handler	= sched_rt_handler,
 	},
+#ifdef CONFIG_CFS_BANDWIDTH
+	{
+		.procname	= "sched_cfs_bandwidth_consistent",
+		.data		= &sysctl_sched_cfs_bandwidth_consistent,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= sched_cfs_consistent_handler,
+		.extra1		= &zero,
+		.extra2		= &one,
+	},
+#endif
 #ifdef CONFIG_SCHED_AUTOGROUP
 	{
 		.procname	= "sched_autogroup_enabled",
Index: tip/include/linux/sched.h
===================================================================
--- tip.orig/include/linux/sched.h
+++ tip/include/linux/sched.h
@@ -1943,6 +1943,14 @@ int sched_rt_handler(struct ctl_table *t
 		void __user *buffer, size_t *lenp,
 		loff_t *ppos);
 
+#ifdef CONFIG_CFS_BANDWIDTH
+extern unsigned int sysctl_sched_cfs_bandwidth_consistent;
+
+int sched_cfs_consistent_handler(struct ctl_table *table, int write,
+		void __user *buffer, size_t *lenp,
+		loff_t *ppos);
+#endif
+
 #ifdef CONFIG_SCHED_AUTOGROUP
 extern unsigned int sysctl_sched_autogroup_enabled;
 
Index: tip/kernel/sched_fair.c
===================================================================
--- tip.orig/kernel/sched_fair.c
+++ tip/kernel/sched_fair.c
@@ -88,6 +88,14 @@ const_debug unsigned int sysctl_sched_mi
  */
 unsigned int __read_mostly sysctl_sched_shares_window = 10000000UL;
 
+#ifdef CONFIG_CFS_BANDWIDTH
+/*
+ * Whether a CFS bandwidth hierarchy is required to be consistent, that is:
+ *   sum(child_bandwidth) <= parent_bandwidth
+ */
+unsigned int sysctl_sched_cfs_bandwidth_consistent = 1;
+#endif
+
 static const struct sched_class fair_sched_class;
 
 /**************************************************************



^ permalink raw reply	[flat|nested] 63+ messages in thread

* [patch 03/15] sched: accumulate per-cfs_rq cpu usage
  2011-03-23  3:03 [patch 00/15] CFS Bandwidth Control V5 Paul Turner
  2011-03-23  3:03 ` [patch 01/15] sched: introduce primitives to account for CFS bandwidth tracking Paul Turner
  2011-03-23  3:03 ` [patch 02/15] sched: validate CFS quota hierarchies Paul Turner
@ 2011-03-23  3:03 ` Paul Turner
  2011-04-05 13:28   ` Peter Zijlstra
  2011-04-05 13:28   ` Peter Zijlstra
  2011-03-23  3:03 ` [patch 04/15] sched: throttle cfs_rq entities which exceed their local quota Paul Turner
                   ` (15 subsequent siblings)
  18 siblings, 2 replies; 63+ messages in thread
From: Paul Turner @ 2011-03-23  3:03 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov, Nikhil Rao

[-- Attachment #1: sched-bwc-accumulate_cfs_rq_usage.patch --]
[-- Type: text/plain, Size: 4520 bytes --]

Introduce account_cfs_rq_quota() to account bandwidth usage on the cfs_rq
level versus task_groups for which bandwidth has been assigned.  This is
tracked by whether the local cfs_rq->quota_assigned is finite or infinite
(RUNTIME_INF).

For cfs_rq's that belong to a bandwidth constrained task_group we introduce
tg_request_cfs_quota() which attempts to allocate quota from the global pool
for use locally.  Updates involving the global pool are currently protected
under cfs_bandwidth->lock, local pools are protected by rq->lock.

This patch only attempts to assign and track quota, no action is taken in the
case that cfs_rq->quota_used exceeds cfs_rq->quota_assigned.

Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Nikhil Rao <ncrao@google.com>
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
---
 include/linux/sched.h |    4 +++
 kernel/sched_fair.c   |   57 ++++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sysctl.c       |    8 +++++++
 3 files changed, 69 insertions(+)

Index: tip/kernel/sched_fair.c
===================================================================
--- tip.orig/kernel/sched_fair.c
+++ tip/kernel/sched_fair.c
@@ -96,6 +96,15 @@ unsigned int __read_mostly sysctl_sched_
 unsigned int sysctl_sched_cfs_bandwidth_consistent = 1;
 #endif
 
+#ifdef CONFIG_CFS_BANDWIDTH
+/*
+ * amount of quota to allocate from global tg to local cfs_rq pool on each
+ * refresh
+ * default: 5ms, units: microseconds
+  */
+unsigned int sysctl_sched_cfs_bandwidth_slice = 5000UL;
+#endif
+
 static const struct sched_class fair_sched_class;
 
 /**************************************************************
@@ -312,6 +321,8 @@ find_matching_se(struct sched_entity **s
 
 #endif	/* CONFIG_FAIR_GROUP_SCHED */
 
+static void account_cfs_rq_quota(struct cfs_rq *cfs_rq,
+		                 unsigned long delta_exec);
 
 /**************************************************************
  * Scheduling class tree data structure manipulation methods:
@@ -605,6 +616,8 @@ static void update_curr(struct cfs_rq *c
 		cpuacct_charge(curtask, delta_exec);
 		account_group_exec_runtime(curtask, delta_exec);
 	}
+
+	account_cfs_rq_quota(cfs_rq, delta_exec);
 }
 
 static inline void
@@ -1276,6 +1289,47 @@ static inline u64 default_cfs_period(voi
 	return 500000000ULL;
 }
 
+static inline u64 sched_cfs_bandwidth_slice(void)
+{
+       return (u64)sysctl_sched_cfs_bandwidth_slice * NSEC_PER_USEC;
+}
+
+static void request_cfs_rq_quota(struct cfs_rq *cfs_rq)
+{
+	struct task_group *tg = cfs_rq->tg;
+	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
+	u64 amount = 0, min_amount;
+
+	min_amount = sched_cfs_bandwidth_slice() + (-cfs_rq->quota_remaining);
+
+	if (cfs_b->runtime > 0 || cfs_b->quota == RUNTIME_INF) {
+		raw_spin_lock(&cfs_b->lock);
+		if (cfs_b->quota != RUNTIME_INF) {
+			amount = min(cfs_b->runtime, min_amount);
+			cfs_b->runtime -= amount;
+		} else {
+			amount = min_amount;
+		}
+		raw_spin_unlock(&cfs_b->lock);
+	}
+
+	cfs_rq->quota_remaining += amount;
+}
+
+static void account_cfs_rq_quota(struct cfs_rq *cfs_rq,
+		unsigned long delta_exec)
+{
+	if (!cfs_rq->quota_enabled)
+		return;
+
+	cfs_rq->quota_remaining -= delta_exec;
+
+	if (cfs_rq->quota_remaining > 0)
+		return;
+
+	request_cfs_rq_quota(cfs_rq);
+}
+
 static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
 {
 	return 1;
@@ -1285,6 +1339,9 @@ static inline u64 default_cfs_period(voi
 {
 	return 0;
 }
+
+static void account_cfs_rq_quota(struct cfs_rq *cfs_rq,
+		unsigned long delta_exec) {}
 #endif
 
 
Index: tip/kernel/sysctl.c
===================================================================
--- tip.orig/kernel/sysctl.c
+++ tip/kernel/sysctl.c
@@ -371,6 +371,14 @@ static struct ctl_table kern_table[] = {
 		.extra1		= &zero,
 		.extra2		= &one,
 	},
+	{
+		.procname	= "sched_cfs_bandwidth_slice_us",
+		.data		= &sysctl_sched_cfs_bandwidth_slice,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &one,
+	},
 #endif
 #ifdef CONFIG_SCHED_AUTOGROUP
 	{
Index: tip/include/linux/sched.h
===================================================================
--- tip.orig/include/linux/sched.h
+++ tip/include/linux/sched.h
@@ -1951,6 +1951,10 @@ int sched_cfs_consistent_handler(struct 
 		loff_t *ppos);
 #endif
 
+#ifdef CONFIG_CFS_BANDWIDTH
+extern unsigned int sysctl_sched_cfs_bandwidth_slice;
+#endif
+
 #ifdef CONFIG_SCHED_AUTOGROUP
 extern unsigned int sysctl_sched_autogroup_enabled;
 



^ permalink raw reply	[flat|nested] 63+ messages in thread

* [patch 04/15] sched: throttle cfs_rq entities which exceed their local quota
  2011-03-23  3:03 [patch 00/15] CFS Bandwidth Control V5 Paul Turner
                   ` (2 preceding siblings ...)
  2011-03-23  3:03 ` [patch 03/15] sched: accumulate per-cfs_rq cpu usage Paul Turner
@ 2011-03-23  3:03 ` Paul Turner
  2011-03-23  5:09   ` Mike Galbraith
                     ` (2 more replies)
  2011-03-23  3:03 ` [patch 05/15] sched: unthrottle cfs_rq(s) who ran out of quota at period refresh Paul Turner
                   ` (14 subsequent siblings)
  18 siblings, 3 replies; 63+ messages in thread
From: Paul Turner @ 2011-03-23  3:03 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov, Nikhil Rao

[-- Attachment #1: sched-bwc-throttle_entities.patch --]
[-- Type: text/plain, Size: 6744 bytes --]

In account_cfs_rq_quota() (via update_curr()) we track consumption versus a
cfs_rqs locally assigned quota and whether there is global quota available 
to provide a refill when it runs out.

In the case that there is no quota remaining it's necessary to throttle so
that execution ceases until the susbequent period.  While it is at this
boundary that we detect (and signal for, via reshed_task) that a throttle is
required, the actual operation is deferred until put_prev_entity().

At this point the cfs_rq is marked as throttled and not re-enqueued, this
avoids potential interactions with throttled runqueues in the event that we
are not immediately able to evict the running task.

Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Nikhil Rao <ncrao@google.com>
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
---
 kernel/sched.c      |    2 
 kernel/sched_fair.c |  117 +++++++++++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 113 insertions(+), 6 deletions(-)

Index: tip/kernel/sched.c
===================================================================
--- tip.orig/kernel/sched.c
+++ tip/kernel/sched.c
@@ -386,7 +386,7 @@ struct cfs_rq {
 	unsigned long load_contribution;
 #endif
 #ifdef CONFIG_CFS_BANDWIDTH
-	int quota_enabled;
+	int quota_enabled, throttled;
 	s64 quota_remaining;
 #endif
 #endif
Index: tip/kernel/sched_fair.c
===================================================================
--- tip.orig/kernel/sched_fair.c
+++ tip/kernel/sched_fair.c
@@ -321,9 +321,6 @@ find_matching_se(struct sched_entity **s
 
 #endif	/* CONFIG_FAIR_GROUP_SCHED */
 
-static void account_cfs_rq_quota(struct cfs_rq *cfs_rq,
-		                 unsigned long delta_exec);
-
 /**************************************************************
  * Scheduling class tree data structure manipulation methods:
  */
@@ -588,6 +585,9 @@ __update_curr(struct cfs_rq *cfs_rq, str
 #endif
 }
 
+static void account_cfs_rq_quota(struct cfs_rq *cfs_rq,
+		unsigned long delta_exec);
+
 static void update_curr(struct cfs_rq *cfs_rq)
 {
 	struct sched_entity *curr = cfs_rq->curr;
@@ -1221,6 +1221,9 @@ static struct sched_entity *pick_next_en
 	return se;
 }
 
+static void throttle_cfs_rq(struct cfs_rq *cfs_rq);
+static inline int within_bandwidth(struct cfs_rq *cfs_rq);
+
 static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
 {
 	/*
@@ -1230,6 +1233,9 @@ static void put_prev_entity(struct cfs_r
 	if (prev->on_rq)
 		update_curr(cfs_rq);
 
+	if (!within_bandwidth(cfs_rq))
+		throttle_cfs_rq(cfs_rq);
+
 	check_spread(cfs_rq, prev);
 	if (prev->on_rq) {
 		update_stats_wait_start(cfs_rq, prev);
@@ -1241,6 +1247,8 @@ static void put_prev_entity(struct cfs_r
 	cfs_rq->curr = NULL;
 }
 
+static void check_cfs_rq_quota(struct cfs_rq *cfs_rq);
+
 static void
 entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
 {
@@ -1249,6 +1257,9 @@ entity_tick(struct cfs_rq *cfs_rq, struc
 	 */
 	update_curr(cfs_rq);
 
+	/* check that entity's usage is still within quota (if enabled) */
+	check_cfs_rq_quota(cfs_rq);
+
 	/*
 	 * Update share accounting for long-running entities.
 	 */
@@ -1294,6 +1305,46 @@ static inline u64 sched_cfs_bandwidth_sl
        return (u64)sysctl_sched_cfs_bandwidth_slice * NSEC_PER_USEC;
 }
 
+static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
+{
+	return cfs_rq->throttled;
+}
+
+static inline int throttled_hierarchy(struct cfs_rq *cfs_rq)
+{
+	struct task_group *tg;
+	struct sched_entity *se;
+
+	if (cfs_rq_throttled(cfs_rq))
+		return 1;
+
+	tg = cfs_rq->tg;
+	se = tg->se[cpu_of(rq_of(cfs_rq))];
+	if (!se)
+		return 0;
+
+	for_each_sched_entity(se) {
+		if (cfs_rq_throttled(cfs_rq_of(se)))
+			return 1;
+	}
+
+	return 0;
+}
+
+static inline int within_bandwidth(struct cfs_rq *cfs_rq)
+{
+	return !cfs_rq->quota_enabled || cfs_rq->quota_remaining > 0;
+}
+
+static void check_cfs_rq_quota(struct cfs_rq *cfs_rq)
+{
+	if (within_bandwidth(cfs_rq))
+		return;
+
+
+	resched_task(rq_of(cfs_rq)->curr);
+}
+
 static void request_cfs_rq_quota(struct cfs_rq *cfs_rq)
 {
 	struct task_group *tg = cfs_rq->tg;
@@ -1330,6 +1381,29 @@ static void account_cfs_rq_quota(struct 
 	request_cfs_rq_quota(cfs_rq);
 }
 
+static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
+{
+	struct sched_entity *se;
+
+	se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
+
+	/* account load preceding throttle */
+	update_cfs_load(cfs_rq, 0);
+
+	for_each_sched_entity(se) {
+		struct cfs_rq *qcfs_rq = cfs_rq_of(se);
+		/* throttled entity or throttle-on-deactivate */
+		if (!se->on_rq)
+			break;
+
+		dequeue_entity(qcfs_rq, se, DEQUEUE_SLEEP);
+		if (qcfs_rq->load.weight)
+			break;
+	}
+
+	cfs_rq->throttled = 1;
+}
+
 static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
 {
 	return 1;
@@ -1340,6 +1414,23 @@ static inline u64 default_cfs_period(voi
 	return 0;
 }
 
+static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
+{
+	return 0;
+}
+
+static inline int within_bandwidth(struct cfs_rq *cfs_rq)
+{
+	return 1;
+}
+
+static inline int throttled_hierarchy(struct cfs_rq *cfs_rq)
+{
+	return 0;
+}
+
+static void check_cfs_rq_quota(struct cfs_rq *cfs_rq) {}
+static void throttle_cfs_rq(struct cfs_rq *cfs_rq) {}
 static void account_cfs_rq_quota(struct cfs_rq *cfs_rq,
 		unsigned long delta_exec) {}
 #endif
@@ -1421,6 +1512,12 @@ enqueue_task_fair(struct rq *rq, struct 
 			break;
 		cfs_rq = cfs_rq_of(se);
 		enqueue_entity(cfs_rq, se, flags);
+		/* end evaluation on throttled cfs_rq */
+		if (cfs_rq_throttled(cfs_rq)) {
+			se = NULL;
+			break;
+		}
+		check_cfs_rq_quota(cfs_rq);
 		flags = ENQUEUE_WAKEUP;
 	}
 
@@ -1447,10 +1544,15 @@ static void dequeue_task_fair(struct rq 
 	for_each_sched_entity(se) {
 		cfs_rq = cfs_rq_of(se);
 		dequeue_entity(cfs_rq, se, flags);
-
+		/* end evaluation on throttled cfs_rq */
+		if (cfs_rq_throttled(cfs_rq)) {
+			se = NULL;
+			break;
+		}
 		/* Don't dequeue parent if it has other entities besides us */
 		if (cfs_rq->load.weight)
 			break;
+		check_cfs_rq_quota(cfs_rq);
 		flags |= DEQUEUE_SLEEP;
 	}
 
@@ -1955,6 +2057,10 @@ static void check_preempt_wakeup(struct 
 	if (unlikely(se == pse))
 		return;
 
+	/* avoid preemption check/buddy nomination for throttled tasks */
+	if (throttled_hierarchy(cfs_rq_of(pse)))
+		return;
+
 	if (sched_feat(NEXT_BUDDY) && scale && !(wake_flags & WF_FORK))
 		set_next_buddy(pse);
 
@@ -2076,7 +2182,8 @@ static bool yield_to_task_fair(struct rq
 {
 	struct sched_entity *se = &p->se;
 
-	if (!se->on_rq)
+	/* ensure entire hierarchy is on rq (e.g. running & not throttled) */
+	if (!se->on_rq || throttled_hierarchy(cfs_rq_of(se)))
 		return false;
 
 	/* Tell the scheduler that we'd really like pse to run next. */



^ permalink raw reply	[flat|nested] 63+ messages in thread

* [patch 05/15] sched: unthrottle cfs_rq(s) who ran out of quota at period refresh
  2011-03-23  3:03 [patch 00/15] CFS Bandwidth Control V5 Paul Turner
                   ` (3 preceding siblings ...)
  2011-03-23  3:03 ` [patch 04/15] sched: throttle cfs_rq entities which exceed their local quota Paul Turner
@ 2011-03-23  3:03 ` Paul Turner
  2011-04-05 13:28   ` Peter Zijlstra
                     ` (2 more replies)
  2011-03-23  3:03 ` [patch 06/15] sched: allow for positional tg_tree walks Paul Turner
                   ` (13 subsequent siblings)
  18 siblings, 3 replies; 63+ messages in thread
From: Paul Turner @ 2011-03-23  3:03 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov, Nikhil Rao

[-- Attachment #1: sched-bwc-unthrottle_entities.patch --]
[-- Type: text/plain, Size: 6051 bytes --]

At the start of a new period there are several actions we must refresh the
global bandwidth pool as well as unthrottle any cfs_rq entities who previously
ran out of bandwidth (as quota permits).

Unthrottled entities have the cfs_rq->throttled flag cleared and are re-enqueued
into the cfs entity hierarchy.

sched_rt_period_mask() is refactored slightly into sched_bw_period_mask()
since it is now shared by both cfs and rt bandwidth period timers.

The !CONFIG_RT_GROUP_SCHED && CONFIG_SMP case has been collapsed to use
rd->span instead of cpu_online_mask since I think that was incorrect before
[don't actually want to hit cpu's outside of your root_domain for RT bandwidth.]

Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Nikhil Rao <ncrao@google.com>
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
---
 kernel/sched.c      |   18 +++++++++-
 kernel/sched_fair.c |   92 +++++++++++++++++++++++++++++++++++++++++++++++++++-
 kernel/sched_rt.c   |   19 ----------
 3 files changed, 109 insertions(+), 20 deletions(-)

Index: tip/kernel/sched.c
===================================================================
--- tip.orig/kernel/sched.c
+++ tip/kernel/sched.c
@@ -252,7 +252,7 @@ struct cfs_bandwidth {
 #ifdef CONFIG_CFS_BANDWIDTH
 	raw_spinlock_t lock;
 	ktime_t period;
-	u64 runtime, quota;
+	u64 runtime, runtime_assigned, quota;
 	s64 hierarchal_quota; /* used for validating consistency */
 	struct hrtimer period_timer;
 #endif
@@ -1564,6 +1564,8 @@ static int tg_nop(struct task_group *tg,
 }
 #endif
 
+static inline const struct cpumask *sched_bw_period_mask(void);
+
 #ifdef CONFIG_SMP
 /* Used instead of source_load when we know the type == 0 */
 static unsigned long weighted_cpuload(const int cpu)
@@ -8514,6 +8516,18 @@ void set_curr_task(int cpu, struct task_
 
 #endif
 
+#ifdef CONFIG_SMP
+static inline const struct cpumask *sched_bw_period_mask(void)
+{
+	return cpu_rq(smp_processor_id())->rd->span;
+}
+#else
+static inline const struct cpumask *sched_bw_period_mask(void)
+{
+	return cpu_online_mask;
+}
+#endif
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 static void free_fair_sched_group(struct task_group *tg)
 {
@@ -9268,6 +9282,8 @@ static int tg_set_cfs_bandwidth(struct t
 
 		raw_spin_lock_irq(&rq->lock);
 		init_cfs_rq_quota(cfs_rq);
+		if (cfs_rq_throttled(cfs_rq))
+			unthrottle_cfs_rq(cfs_rq);
 		raw_spin_unlock_irq(&rq->lock);
 	}
 out_unlock:
Index: tip/kernel/sched_fair.c
===================================================================
--- tip.orig/kernel/sched_fair.c
+++ tip/kernel/sched_fair.c
@@ -1394,9 +1394,99 @@ static void throttle_cfs_rq(struct cfs_r
 	cfs_rq->throttled = 1;
 }
 
+static void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
+{
+	struct rq *rq = rq_of(cfs_rq);
+	struct sched_entity *se;
+
+	se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
+
+	update_rq_clock(rq);
+
+	cfs_rq->throttled = 0;
+	if (!cfs_rq->load.weight)
+		return;
+
+	for_each_sched_entity(se) {
+		if (se->on_rq)
+			break;
+
+		cfs_rq = cfs_rq_of(se);
+		enqueue_entity(cfs_rq, se, ENQUEUE_WAKEUP);
+		if (cfs_rq_throttled(cfs_rq))
+			break;
+	}
+
+	/* determine whether we need to wake up potentially idle cpu */
+	if (rq->curr == rq->idle && rq->cfs.nr_running)
+		resched_task(rq->curr);
+}
+
+static inline struct task_group *cfs_bandwidth_tg(struct cfs_bandwidth *cfs_b)
+{
+	return container_of(cfs_b, struct task_group, cfs_bandwidth);
+}
+
+static u64 distribute_cfs_bandwidth(struct cfs_bandwidth *cfs_b, u64 runtime)
+{
+	int i;
+	u64 quota, remaining = runtime;
+	const struct cpumask *span;
+
+	rcu_read_lock();
+	span = sched_bw_period_mask();
+	for_each_cpu(i, span) {
+		struct rq *rq = cpu_rq(i);
+		struct cfs_rq *cfs_rq = cfs_bandwidth_tg(cfs_b)->cfs_rq[i];
+
+		raw_spin_lock(&rq->lock);
+		if (within_bandwidth(cfs_rq))
+			goto next;
+
+		quota = -cfs_rq->quota_remaining;
+		quota += sched_cfs_bandwidth_slice();
+		quota = min(quota, remaining);
+		remaining -= quota;
+
+		cfs_rq->quota_remaining += quota;
+		if (cfs_rq_throttled(cfs_rq) && cfs_rq->quota_remaining > 0)
+			unthrottle_cfs_rq(cfs_rq);
+
+next:
+		raw_spin_unlock(&rq->lock);
+
+		if (!remaining)
+			break;
+	}
+	rcu_read_unlock();
+
+	return remaining;
+}
+
 static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
 {
-	return 1;
+	u64 runtime, runtime_assigned;
+	int idle;
+
+	raw_spin_lock(&cfs_b->lock);
+	runtime = cfs_b->quota;
+	idle = cfs_b->runtime == cfs_b->runtime_assigned;
+	raw_spin_unlock(&cfs_b->lock);
+
+	if (runtime == RUNTIME_INF)
+		return 1;
+
+	runtime *= overrun;
+	runtime_assigned = runtime;
+
+	runtime = distribute_cfs_bandwidth(cfs_b, runtime);
+
+	raw_spin_lock(&cfs_b->lock);
+	cfs_b->runtime = runtime;
+	cfs_b->runtime_assigned = runtime_assigned;
+	raw_spin_unlock(&cfs_b->lock);
+
+	return idle;
 }
 #else
 static inline u64 default_cfs_period(void)
Index: tip/kernel/sched_rt.c
===================================================================
--- tip.orig/kernel/sched_rt.c
+++ tip/kernel/sched_rt.c
@@ -253,18 +253,6 @@ static int rt_se_boosted(struct sched_rt
 	return p->prio != p->normal_prio;
 }
 
-#ifdef CONFIG_SMP
-static inline const struct cpumask *sched_rt_period_mask(void)
-{
-	return cpu_rq(smp_processor_id())->rd->span;
-}
-#else
-static inline const struct cpumask *sched_rt_period_mask(void)
-{
-	return cpu_online_mask;
-}
-#endif
-
 static inline
 struct rt_rq *sched_rt_period_rt_rq(struct rt_bandwidth *rt_b, int cpu)
 {
@@ -322,11 +310,6 @@ static inline int rt_rq_throttled(struct
 	return rt_rq->rt_throttled;
 }
 
-static inline const struct cpumask *sched_rt_period_mask(void)
-{
-	return cpu_online_mask;
-}
-
 static inline
 struct rt_rq *sched_rt_period_rt_rq(struct rt_bandwidth *rt_b, int cpu)
 {
@@ -544,7 +527,7 @@ static int do_sched_rt_period_timer(stru
 	if (!rt_bandwidth_enabled() || rt_b->rt_runtime == RUNTIME_INF)
 		return 1;
 
-	span = sched_rt_period_mask();
+	span = sched_bw_period_mask();
 	for_each_cpu(i, span) {
 		int enqueue = 0;
 		struct rt_rq *rt_rq = sched_rt_period_rt_rq(rt_b, i);



^ permalink raw reply	[flat|nested] 63+ messages in thread

* [patch 06/15] sched: allow for positional tg_tree walks
  2011-03-23  3:03 [patch 00/15] CFS Bandwidth Control V5 Paul Turner
                   ` (4 preceding siblings ...)
  2011-03-23  3:03 ` [patch 05/15] sched: unthrottle cfs_rq(s) who ran out of quota at period refresh Paul Turner
@ 2011-03-23  3:03 ` Paul Turner
  2011-03-23  3:03 ` [patch 07/15] sched: prevent interactions between throttled entities and load-balance Paul Turner
                   ` (12 subsequent siblings)
  18 siblings, 0 replies; 63+ messages in thread
From: Paul Turner @ 2011-03-23  3:03 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov

[-- Attachment #1: sched-bwc-refactor-walk_tg_tree.patch --]
[-- Type: text/plain, Size: 2015 bytes --]

Extend walk_tg_tree to accept a positional argument

static int walk_tg_tree_from(struct task_group *from,
			     tg_visitor down, tg_visitor up, void *data)

Existing semantics are preserved, caller must hold rcu_lock() or sufficient
analogue.

Signed-off-by: Paul Turner <pjt@google.com>
---
 kernel/sched.c |   34 +++++++++++++++++++++++-----------
 1 file changed, 23 insertions(+), 11 deletions(-)

Index: tip/kernel/sched.c
===================================================================
--- tip.orig/kernel/sched.c
+++ tip/kernel/sched.c
@@ -1518,21 +1518,19 @@ static inline void dec_cpu_load(struct r
 #if (defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)) || defined(CONFIG_RT_GROUP_SCHED)
 typedef int (*tg_visitor)(struct task_group *, void *);
 
-/*
- * Iterate the full tree, calling @down when first entering a node and @up when
- * leaving it for the final time.
- */
-static int walk_tg_tree(tg_visitor down, tg_visitor up, void *data)
+/* Iterate task_group tree rooted at *from */
+static int walk_tg_tree_from(struct task_group *from,
+			     tg_visitor down, tg_visitor up, void *data)
 {
 	struct task_group *parent, *child;
 	int ret;
 
-	rcu_read_lock();
-	parent = &root_task_group;
+	parent = from;
+
 down:
 	ret = (*down)(parent, data);
 	if (ret)
-		goto out_unlock;
+		goto out;
 	list_for_each_entry_rcu(child, &parent->children, siblings) {
 		parent = child;
 		goto down;
@@ -1541,14 +1539,28 @@ up:
 		continue;
 	}
 	ret = (*up)(parent, data);
-	if (ret)
-		goto out_unlock;
+	if (ret || parent == from)
+		goto out;
 
 	child = parent;
 	parent = parent->parent;
 	if (parent)
 		goto up;
-out_unlock:
+out:
+	return ret;
+}
+
+/*
+ * Iterate the full tree, calling @down when first entering a node and @up when
+ * leaving it for the final time.
+ */
+
+static inline int walk_tg_tree(tg_visitor down, tg_visitor up, void *data)
+{
+	int ret;
+
+	rcu_read_lock();
+	ret = walk_tg_tree_from(&root_task_group, down, up, data);
 	rcu_read_unlock();
 
 	return ret;



^ permalink raw reply	[flat|nested] 63+ messages in thread

* [patch 07/15] sched: prevent interactions between throttled entities and load-balance
  2011-03-23  3:03 [patch 00/15] CFS Bandwidth Control V5 Paul Turner
                   ` (5 preceding siblings ...)
  2011-03-23  3:03 ` [patch 06/15] sched: allow for positional tg_tree walks Paul Turner
@ 2011-03-23  3:03 ` Paul Turner
  2011-04-05 13:28   ` Peter Zijlstra
  2011-03-23  3:03 ` [patch 08/15] sched: migrate throttled tasks on HOTPLUG Paul Turner
                   ` (11 subsequent siblings)
  18 siblings, 1 reply; 63+ messages in thread
From: Paul Turner @ 2011-03-23  3:03 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov

[-- Attachment #1: sched-bwc-throttled_shares.patch --]
[-- Type: text/plain, Size: 5032 bytes --]

>From the perspective of load-balance and shares distribution, throttled
entities should be invisible.

However, both of these operations work on 'active' lists and are not
inherently aware of what group hierarchies may be present.  In some cases this
may be side-stepped (e.g. we could sideload via tg_load_down in load balance) 
while in others (e.g. update_shares()) it is more difficult to compute without
incurring some O(n**2) costs.

Instead, track hierarchal throttled state at time of transition.  This allows
us to easily identify whether an entity belongs to a throttled hierarchy and
avoid incorrect interactions with it.

Also, when an entity leaves a throttled hierarchy we need to advance its
time averaging for shares averaging so that the elapsed throttled time is not
considered as part of the cfs_rq's operation.

Signed-off-by: Paul Turner <pjt@google.com>
---
 kernel/sched.c      |    2 -
 kernel/sched_fair.c |   78 +++++++++++++++++++++++++++++++++++++---------------
 2 files changed, 57 insertions(+), 23 deletions(-)

Index: tip/kernel/sched_fair.c
===================================================================
--- tip.orig/kernel/sched_fair.c
+++ tip/kernel/sched_fair.c
@@ -739,13 +739,15 @@ static void update_cfs_rq_load_contribut
 	}
 }
 
+static inline int throttled_hierarchy(struct cfs_rq *cfs_rq);
+
 static void update_cfs_load(struct cfs_rq *cfs_rq, int global_update)
 {
 	u64 period = sysctl_sched_shares_window;
 	u64 now, delta;
 	unsigned long load = cfs_rq->load.weight;
 
-	if (cfs_rq->tg == &root_task_group)
+	if (cfs_rq->tg == &root_task_group || throttled_hierarchy(cfs_rq))
 		return;
 
 	now = rq_of(cfs_rq)->clock_task;
@@ -1312,23 +1314,7 @@ static inline int cfs_rq_throttled(struc
 
 static inline int throttled_hierarchy(struct cfs_rq *cfs_rq)
 {
-	struct task_group *tg;
-	struct sched_entity *se;
-
-	if (cfs_rq_throttled(cfs_rq))
-		return 1;
-
-	tg = cfs_rq->tg;
-	se = tg->se[cpu_of(rq_of(cfs_rq))];
-	if (!se)
-		return 0;
-
-	for_each_sched_entity(se) {
-		if (cfs_rq_throttled(cfs_rq_of(se)))
-			return 1;
-	}
-
-	return 0;
+	return cfs_rq->throttle_count > 0;
 }
 
 static inline int within_bandwidth(struct cfs_rq *cfs_rq)
@@ -1381,6 +1367,41 @@ static void account_cfs_rq_quota(struct 
 	request_cfs_rq_quota(cfs_rq);
 }
 
+struct tg_unthrottle_down_data {
+	int cpu;
+	u64 now;
+};
+
+static int tg_unthrottle_down(struct task_group *tg, void *data)
+{
+	struct tg_unthrottle_down_data *udd = data;
+	struct cfs_rq *cfs_rq = tg->cfs_rq[udd->cpu];
+	u64 delta;
+
+	cfs_rq->throttle_count--;
+	if (!cfs_rq->throttle_count) {
+		/* leaving throttled state, move up windows */
+		delta = udd->now - cfs_rq->load_stamp;
+		cfs_rq->load_stamp += delta;
+		cfs_rq->load_last += delta;
+	}
+
+	return 0;
+}
+
+static int tg_throttle_down(struct task_group *tg, void *data)
+{
+	long cpu = (long)data;
+	struct cfs_rq *cfs_rq = tg->cfs_rq[cpu];
+
+	/* group is entering throttled state, record last load */
+	if (!cfs_rq->throttle_count)
+		update_cfs_load(cfs_rq, 0);
+	cfs_rq->throttle_count++;
+
+	return 0;
+}
+
 static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
 {
 	struct sched_entity *se;
@@ -1388,7 +1409,10 @@ static void throttle_cfs_rq(struct cfs_r
 	se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
 
 	/* account load preceding throttle */
-	update_cfs_load(cfs_rq, 0);
+	rcu_read_lock();
+	walk_tg_tree_from(cfs_rq->tg, tg_throttle_down, tg_nop,
+			(void*)(long)rq_of(cfs_rq)->cpu);
+	rcu_read_unlock();
 
 	for_each_sched_entity(se) {
 		struct cfs_rq *qcfs_rq = cfs_rq_of(se);
@@ -1408,11 +1432,18 @@ static void unthrottle_cfs_rq(struct cfs
 {
 	struct rq *rq = rq_of(cfs_rq);
 	struct sched_entity *se;
+	struct tg_unthrottle_down_data udd;
 
 	se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
 
 	update_rq_clock(rq);
 
+	/* don't include throttled window for load statistics */
+	udd.cpu = rq->cpu;
+	udd.now = rq->clock_task;
+	walk_tg_tree_from(cfs_rq->tg, tg_unthrottle_down, tg_nop,
+			(void*)&udd);
+
 	cfs_rq->throttled = 0;
 	if (!cfs_rq->load.weight)
 		return;
@@ -2488,8 +2519,10 @@ static void update_shares(int cpu)
 	struct rq *rq = cpu_rq(cpu);
 
 	rcu_read_lock();
-	for_each_leaf_cfs_rq(rq, cfs_rq)
-		update_shares_cpu(cfs_rq->tg, cpu);
+	for_each_leaf_cfs_rq(rq, cfs_rq) {
+		if (!throttled_hierarchy(cfs_rq))
+			update_shares_cpu(cfs_rq->tg, cpu);
+	}
 	rcu_read_unlock();
 }
 
@@ -2515,7 +2548,8 @@ load_balance_fair(struct rq *this_rq, in
 		/*
 		 * empty group
 		 */
-		if (!busiest_cfs_rq->task_weight)
+		if (!busiest_cfs_rq->task_weight ||
+				throttled_hierarchy(busiest_cfs_rq));
 			continue;
 
 		rem_load = (u64)rem_load_move * busiest_weight;
Index: tip/kernel/sched.c
===================================================================
--- tip.orig/kernel/sched.c
+++ tip/kernel/sched.c
@@ -386,7 +386,7 @@ struct cfs_rq {
 	unsigned long load_contribution;
 #endif
 #ifdef CONFIG_CFS_BANDWIDTH
-	int quota_enabled, throttled;
+	int quota_enabled, throttled, throttle_count;
 	s64 quota_remaining;
 #endif
 #endif



^ permalink raw reply	[flat|nested] 63+ messages in thread

* [patch 08/15] sched: migrate throttled tasks on HOTPLUG
  2011-03-23  3:03 [patch 00/15] CFS Bandwidth Control V5 Paul Turner
                   ` (6 preceding siblings ...)
  2011-03-23  3:03 ` [patch 07/15] sched: prevent interactions between throttled entities and load-balance Paul Turner
@ 2011-03-23  3:03 ` Paul Turner
  2011-04-05 13:28   ` Peter Zijlstra
  2011-03-23  3:03 ` [patch 09/15] sched: add exports tracking cfs bandwidth control statistics Paul Turner
                   ` (10 subsequent siblings)
  18 siblings, 1 reply; 63+ messages in thread
From: Paul Turner @ 2011-03-23  3:03 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov

[-- Attachment #1: sched-bwc-migrate_dead.patch --]
[-- Type: text/plain, Size: 1688 bytes --]

Throttled tasks are invisisble to cpu-offline since they are not eligible for
selection by pick_next_task().  The regular 'escape' path for a thread that is
blocked at offline is via ttwu->select_task_rq, however this will not handle a
throttled group since there are no individual thread wakeups on an unthrottle.

Resolve this by unthrottling offline cpus so that threads can be migrated.

Signed-off-by: Paul Turner <pjt@google.com>
---
 kernel/sched.c |   27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

Index: tip/kernel/sched.c
===================================================================
--- tip.orig/kernel/sched.c
+++ tip/kernel/sched.c
@@ -6119,6 +6119,30 @@ static void calc_global_load_remove(stru
 	rq->calc_load_active = 0;
 }
 
+#ifdef CONFIG_CFS_BANDWIDTH
+static void unthrottle_offline_cfs_rqs(struct rq *rq)
+{
+	struct cfs_rq *cfs_rq;
+
+	for_each_leaf_cfs_rq(rq, cfs_rq) {
+		if (!cfs_rq->quota_enabled)
+			continue;
+
+		/*
+		 * clock_task is not advancing so we just need to make sure
+		 * there's some valid quota amount
+		 */
+		cfs_rq->quota_remaining = tg_cfs_bandwidth(cfs_rq->tg)->quota;
+		if (cfs_rq_throttled(cfs_rq))
+			unthrottle_cfs_rq(cfs_rq);
+	}
+}
+#else
+static void unthrottle_offline_cfs_rqs(struct rq *rq)
+{
+}
+#endif
+
 /*
  * Migrate all tasks from the rq, sleeping tasks will be migrated by
  * try_to_wake_up()->select_task_rq().
@@ -6144,6 +6168,9 @@ static void migrate_tasks(unsigned int d
 	 */
 	rq->stop = NULL;
 
+	/* Ensure any throttled groups are reachable by pick_next_task */
+	unthrottle_offline_cfs_rqs(rq);
+
 	for ( ; ; ) {
 		/*
 		 * There's this thread running, bail when that's the only



^ permalink raw reply	[flat|nested] 63+ messages in thread

* [patch 09/15] sched: add exports tracking cfs bandwidth control statistics
  2011-03-23  3:03 [patch 00/15] CFS Bandwidth Control V5 Paul Turner
                   ` (7 preceding siblings ...)
  2011-03-23  3:03 ` [patch 08/15] sched: migrate throttled tasks on HOTPLUG Paul Turner
@ 2011-03-23  3:03 ` Paul Turner
  2011-04-05 13:28   ` Peter Zijlstra
  2011-03-23  3:03 ` [patch 10/15] sched: (fixlet) dont update shares twice on on_rq parent Paul Turner
                   ` (9 subsequent siblings)
  18 siblings, 1 reply; 63+ messages in thread
From: Paul Turner @ 2011-03-23  3:03 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov, Nikhil Rao

[-- Attachment #1: sched-bwc-throttle_stats.patch --]
[-- Type: text/plain, Size: 4154 bytes --]

From: Nikhil Rao <ncrao@google.com>

This change introduces statistics exports for the cpu sub-system, these are
added through the use of a stat file similar to that exported by other
subsystems.

The following exports are included:

nr_periods:	number of periods in which execution occurred
nr_throttled:	the number of periods above in which execution was throttle
throttled_time:	cumulative wall-time that any cpus have been throttled for
this group

Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Nikhil Rao <ncrao@google.com>
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
---
 kernel/sched.c      |   27 +++++++++++++++++++++++++++
 kernel/sched_fair.c |   14 +++++++++++++-
 2 files changed, 40 insertions(+), 1 deletion(-)

Index: tip/kernel/sched.c
===================================================================
--- tip.orig/kernel/sched.c
+++ tip/kernel/sched.c
@@ -255,6 +255,11 @@ struct cfs_bandwidth {
 	u64 runtime, runtime_assigned, quota;
 	s64 hierarchal_quota; /* used for validating consistency */
 	struct hrtimer period_timer;
+
+	/* throttle statistics */
+	u64			nr_periods;
+	u64			nr_throttled;
+	u64			throttled_time;
 #endif
 };
 
@@ -388,6 +393,7 @@ struct cfs_rq {
 #ifdef CONFIG_CFS_BANDWIDTH
 	int quota_enabled, throttled, throttle_count;
 	s64 quota_remaining;
+	u64 throttled_timestamp;
 #endif
 #endif
 };
@@ -429,6 +435,10 @@ void init_cfs_bandwidth(struct cfs_bandw
 
 	hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
 	cfs_b->period_timer.function = sched_cfs_period_timer;
+
+	cfs_b->nr_periods = 0;
+	cfs_b->nr_throttled = 0;
+	cfs_b->throttled_time = 0;
 }
 
 static void init_cfs_rq_quota(struct cfs_rq *cfs_rq)
@@ -9502,6 +9512,19 @@ int sched_cfs_consistent_handler(struct 
 	mutex_unlock(&cfs_constraints_mutex);
 	return ret;
 }
+
+static int cpu_stats_show(struct cgroup *cgrp, struct cftype *cft,
+		struct cgroup_map_cb *cb)
+{
+	struct task_group *tg = cgroup_tg(cgrp);
+	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
+
+	cb->fill(cb, "nr_periods", cfs_b->nr_periods);
+	cb->fill(cb, "nr_throttled", cfs_b->nr_throttled);
+	cb->fill(cb, "throttled_time", cfs_b->throttled_time);
+
+	return 0;
+}
 #endif /* CONFIG_CFS_BANDWIDTH */
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
@@ -9548,6 +9571,10 @@ static struct cftype cpu_files[] = {
 		.read_u64 = cpu_cfs_period_read_u64,
 		.write_u64 = cpu_cfs_period_write_u64,
 	},
+	{
+		.name = "stat",
+		.read_map = cpu_stats_show,
+	},
 #endif
 #ifdef CONFIG_RT_GROUP_SCHED
 	{
Index: tip/kernel/sched_fair.c
===================================================================
--- tip.orig/kernel/sched_fair.c
+++ tip/kernel/sched_fair.c
@@ -1424,6 +1424,7 @@ static void throttle_cfs_rq(struct cfs_r
 	}
 
 	cfs_rq->throttled = 1;
+	cfs_rq->throttled_timestamp = rq_of(cfs_rq)->clock;
 }
 
 static void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
@@ -1431,10 +1432,15 @@ static void unthrottle_cfs_rq(struct cfs
 	struct rq *rq = rq_of(cfs_rq);
 	struct sched_entity *se;
 	struct tg_unthrottle_down_data udd;
+	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
 
 	se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
 
 	update_rq_clock(rq);
+	raw_spin_lock(&cfs_b->lock);
+	cfs_b->throttled_time += (rq->clock - cfs_rq->throttled_timestamp);
+	raw_spin_unlock(&cfs_b->lock);
+	cfs_rq->throttled_timestamp = 0;
 
 	/* don't include throttled window for load statistics */
 	udd.cpu = rq->cpu;
@@ -1505,11 +1511,12 @@ next:
 static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
 {
 	u64 runtime, runtime_assigned;
-	int idle;
+	int idle, throttled;
 
 	raw_spin_lock(&cfs_b->lock);
 	runtime = cfs_b->quota;
 	idle = cfs_b->runtime == cfs_b->runtime_assigned;
+	throttled = cfs_b->runtime == 0;
 	raw_spin_unlock(&cfs_b->lock);
 
 	if (runtime == RUNTIME_INF)
@@ -1523,6 +1530,11 @@ static int do_sched_cfs_period_timer(str
 	raw_spin_lock(&cfs_b->lock);
 	cfs_b->runtime = runtime;
 	cfs_b->runtime_assigned = runtime_assigned;
+
+	/* update throttled stats */
+	cfs_b->nr_periods++;
+	if (throttled)
+		cfs_b->nr_throttled++;
 	raw_spin_unlock(&cfs_b->lock);
 
 	return idle;



^ permalink raw reply	[flat|nested] 63+ messages in thread

* [patch 10/15] sched: (fixlet) dont update shares twice on on_rq parent
  2011-03-23  3:03 [patch 00/15] CFS Bandwidth Control V5 Paul Turner
                   ` (8 preceding siblings ...)
  2011-03-23  3:03 ` [patch 09/15] sched: add exports tracking cfs bandwidth control statistics Paul Turner
@ 2011-03-23  3:03 ` Paul Turner
  2011-04-05 13:28   ` Peter Zijlstra
  2011-03-23  3:03 ` [patch 11/15] sched: hierarchical task accounting for SCHED_OTHER Paul Turner
                   ` (8 subsequent siblings)
  18 siblings, 1 reply; 63+ messages in thread
From: Paul Turner @ 2011-03-23  3:03 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov

[-- Attachment #1: sched-bwc-fix_dequeue_task_buglet.patch --]
[-- Type: text/plain, Size: 952 bytes --]

In dequeue_task_fair() we bail on dequeue when we encounter a parenting entity
with additional weight.  However, we perform a double shares update on this
entity since we continue the shares update traversal from that point, despite
dequeue_entity() having already updated its queuing cfs_rq.

Avoid this by starting from the parent when we resume.

Signed-off-by: Paul Turner <pjt@google.com>
---
 kernel/sched_fair.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Index: tip/kernel/sched_fair.c
===================================================================
--- tip.orig/kernel/sched_fair.c
+++ tip/kernel/sched_fair.c
@@ -1683,8 +1683,10 @@ static void dequeue_task_fair(struct rq 
 			break;
 		}
 		/* Don't dequeue parent if it has other entities besides us */
-		if (cfs_rq->load.weight)
+		if (cfs_rq->load.weight) {
+			se = parent_entity(se);
 			break;
+		}
 		check_cfs_rq_quota(cfs_rq);
 		flags |= DEQUEUE_SLEEP;
 	}



^ permalink raw reply	[flat|nested] 63+ messages in thread

* [patch 11/15] sched: hierarchical task accounting for SCHED_OTHER
  2011-03-23  3:03 [patch 00/15] CFS Bandwidth Control V5 Paul Turner
                   ` (9 preceding siblings ...)
  2011-03-23  3:03 ` [patch 10/15] sched: (fixlet) dont update shares twice on on_rq parent Paul Turner
@ 2011-03-23  3:03 ` Paul Turner
  2011-04-05 13:28   ` Peter Zijlstra
  2011-03-23  3:03 ` [patch 12/15] sched: maintain throttled rqs as a list Paul Turner
                   ` (7 subsequent siblings)
  18 siblings, 1 reply; 63+ messages in thread
From: Paul Turner @ 2011-03-23  3:03 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov

[-- Attachment #1: sched-bwc-account_nr_running.patch --]
[-- Type: text/plain, Size: 7249 bytes --]

With task entities participating in throttled sub-trees it is possible for
task activation/de-activation to not lead to root visible changes to
rq->nr_running.  This in turn leads to incorrect idle and weight-per-task load
balance decisions.

To allow correct accounting we move responsibility for updating rq->nr_running
to the respective sched::classes.  In the fair-group case this update is
hierarchical, tracking the number of active tasks rooted at each group entity.

This also allows us to fix a small buglet in pick_next_task() when group
scheduling is enabled.

Note: technically this issue also exists with the existing sched_rt
throttling; however due to the nearly complete provisioning of system
resources for rt scheduling this is much less common by default.

Signed-off-by: Paul Turner <pjt@google.com>

---
 kernel/sched.c          |    6 +----
 kernel/sched_fair.c     |   51 ++++++++++++++++++++++++++++++++++++++----------
 kernel/sched_rt.c       |    5 +++-
 kernel/sched_stoptask.c |    2 +
 4 files changed, 49 insertions(+), 15 deletions(-)

Index: tip/kernel/sched.c
===================================================================
--- tip.orig/kernel/sched.c
+++ tip/kernel/sched.c
@@ -329,7 +329,7 @@ struct task_group root_task_group;
 /* CFS-related fields in a runqueue */
 struct cfs_rq {
 	struct load_weight load;
-	unsigned long nr_running;
+	unsigned long nr_running, h_nr_running;
 
 	u64 exec_clock;
 	u64 min_vruntime;
@@ -1914,7 +1914,6 @@ static void activate_task(struct rq *rq,
 		rq->nr_uninterruptible--;
 
 	enqueue_task(rq, p, flags);
-	inc_nr_running(rq);
 }
 
 /*
@@ -1926,7 +1925,6 @@ static void deactivate_task(struct rq *r
 		rq->nr_uninterruptible++;
 
 	dequeue_task(rq, p, flags);
-	dec_nr_running(rq);
 }
 
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
@@ -4174,7 +4172,7 @@ pick_next_task(struct rq *rq)
 	 * Optimization: we know that if all tasks are in
 	 * the fair class we can call that function directly:
 	 */
-	if (likely(rq->nr_running == rq->cfs.nr_running)) {
+	if (likely(rq->nr_running == rq->cfs.h_nr_running)) {
 		p = fair_sched_class.pick_next_task(rq);
 		if (likely(p))
 			return p;
Index: tip/kernel/sched_fair.c
===================================================================
--- tip.orig/kernel/sched_fair.c
+++ tip/kernel/sched_fair.c
@@ -1404,9 +1404,11 @@ static int tg_throttle_down(struct task_
 
 static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
 {
+	struct rq *rq = rq_of(cfs_rq);
 	struct sched_entity *se;
+	long task_delta, dequeue = 1;
 
-	se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
+	se = cfs_rq->tg->se[cpu_of(rq)];
 
 	/* account load preceding throttle */
 	rcu_read_lock();
@@ -1414,17 +1416,24 @@ static void throttle_cfs_rq(struct cfs_r
 			(void*)(long)rq_of(cfs_rq)->cpu);
 	rcu_read_unlock();
 
+	task_delta = -cfs_rq->h_nr_running;
 	for_each_sched_entity(se) {
 		struct cfs_rq *qcfs_rq = cfs_rq_of(se);
 		/* throttled entity or throttle-on-deactivate */
 		if (!se->on_rq)
 			break;
 
-		dequeue_entity(qcfs_rq, se, DEQUEUE_SLEEP);
+		if (dequeue)
+			dequeue_entity(qcfs_rq, se, DEQUEUE_SLEEP);
+		qcfs_rq->h_nr_running += task_delta;
+
 		if (qcfs_rq->load.weight)
-			break;
+			dequeue = 0;
 	}
 
+	if (!se)
+		rq->nr_running += task_delta;
+
 	cfs_rq->throttled = 1;
 	cfs_rq->throttled_timestamp = rq_of(cfs_rq)->clock;
 }
@@ -1435,7 +1444,7 @@ static void unthrottle_cfs_rq(struct cfs
 	struct sched_entity *se;
 	struct tg_unthrottle_down_data udd;
 	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
-
+	int task_delta, enqueue = 1;
 	se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
 
 	update_rq_clock(rq);
@@ -1454,16 +1463,22 @@ static void unthrottle_cfs_rq(struct cfs
 	if (!cfs_rq->load.weight)
 		return;
 
+	task_delta = cfs_rq->h_nr_running;
 	for_each_sched_entity(se) {
 		if (se->on_rq)
-			break;
+			enqueue = 0;
 
 		cfs_rq = cfs_rq_of(se);
-		enqueue_entity(cfs_rq, se, ENQUEUE_WAKEUP);
+		if (enqueue)
+			enqueue_entity(cfs_rq, se, ENQUEUE_WAKEUP);
+		cfs_rq->h_nr_running += task_delta;
 		if (cfs_rq_throttled(cfs_rq))
 			break;
 	}
 
+	if (!se)
+		rq->nr_running += task_delta;
+
 	/* determine whether we need to wake up potentially idle cpu */
 	if (rq->curr == rq->idle && rq->cfs.nr_running)
 		resched_task(rq->curr);
@@ -1637,7 +1652,7 @@ static inline void hrtick_update(struct 
 static void
 enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 {
-	struct cfs_rq *cfs_rq;
+	struct cfs_rq *cfs_rq = NULL;
 	struct sched_entity *se = &p->se;
 
 	for_each_sched_entity(se) {
@@ -1645,6 +1660,7 @@ enqueue_task_fair(struct rq *rq, struct 
 			break;
 		cfs_rq = cfs_rq_of(se);
 		enqueue_entity(cfs_rq, se, flags);
+		cfs_rq->h_nr_running++;
 		/* end evaluation on throttled cfs_rq */
 		if (cfs_rq_throttled(cfs_rq)) {
 			se = NULL;
@@ -1655,12 +1671,19 @@ enqueue_task_fair(struct rq *rq, struct 
 	}
 
 	for_each_sched_entity(se) {
-		struct cfs_rq *cfs_rq = cfs_rq_of(se);
+		cfs_rq = cfs_rq_of(se);
+		cfs_rq->h_nr_running++;
+
+		if (cfs_rq_throttled(cfs_rq))
+			break;
 
 		update_cfs_load(cfs_rq, 0);
 		update_cfs_shares(cfs_rq);
 	}
 
+	if (!cfs_rq_throttled(cfs_rq))
+		inc_nr_running(rq);
+
 	hrtick_update(rq);
 }
 
@@ -1671,12 +1694,13 @@ enqueue_task_fair(struct rq *rq, struct 
  */
 static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 {
-	struct cfs_rq *cfs_rq;
+	struct cfs_rq *cfs_rq = NULL;
 	struct sched_entity *se = &p->se;
 
 	for_each_sched_entity(se) {
 		cfs_rq = cfs_rq_of(se);
 		dequeue_entity(cfs_rq, se, flags);
+		cfs_rq->h_nr_running--;
 		/* end evaluation on throttled cfs_rq */
 		if (cfs_rq_throttled(cfs_rq)) {
 			se = NULL;
@@ -1692,12 +1716,19 @@ static void dequeue_task_fair(struct rq 
 	}
 
 	for_each_sched_entity(se) {
-		struct cfs_rq *cfs_rq = cfs_rq_of(se);
+		cfs_rq = cfs_rq_of(se);
+		cfs_rq->h_nr_running--;
+
+		if (cfs_rq_throttled(cfs_rq))
+			break;
 
 		update_cfs_load(cfs_rq, 0);
 		update_cfs_shares(cfs_rq);
 	}
 
+	if (!cfs_rq_throttled(cfs_rq))
+		dec_nr_running(rq);
+
 	hrtick_update(rq);
 }
 
Index: tip/kernel/sched_rt.c
===================================================================
--- tip.orig/kernel/sched_rt.c
+++ tip/kernel/sched_rt.c
@@ -910,6 +910,8 @@ enqueue_task_rt(struct rq *rq, struct ta
 
 	if (!task_current(rq, p) && p->rt.nr_cpus_allowed > 1)
 		enqueue_pushable_task(rq, p);
+
+	inc_nr_running(rq);
 }
 
 static void dequeue_task_rt(struct rq *rq, struct task_struct *p, int flags)
@@ -920,6 +922,8 @@ static void dequeue_task_rt(struct rq *r
 	dequeue_rt_entity(rt_se);
 
 	dequeue_pushable_task(rq, p);
+
+	dec_nr_running(rq);
 }
 
 /*
@@ -1787,4 +1791,3 @@ static void print_rt_stats(struct seq_fi
 	rcu_read_unlock();
 }
 #endif /* CONFIG_SCHED_DEBUG */
-
Index: tip/kernel/sched_stoptask.c
===================================================================
--- tip.orig/kernel/sched_stoptask.c
+++ tip/kernel/sched_stoptask.c
@@ -35,11 +35,13 @@ static struct task_struct *pick_next_tas
 static void
 enqueue_task_stop(struct rq *rq, struct task_struct *p, int flags)
 {
+	inc_nr_running(rq);
 }
 
 static void
 dequeue_task_stop(struct rq *rq, struct task_struct *p, int flags)
 {
+	dec_nr_running(rq);
 }
 
 static void yield_task_stop(struct rq *rq)



^ permalink raw reply	[flat|nested] 63+ messages in thread

* [patch 12/15] sched: maintain throttled rqs as a list
  2011-03-23  3:03 [patch 00/15] CFS Bandwidth Control V5 Paul Turner
                   ` (10 preceding siblings ...)
  2011-03-23  3:03 ` [patch 11/15] sched: hierarchical task accounting for SCHED_OTHER Paul Turner
@ 2011-03-23  3:03 ` Paul Turner
  2011-04-22  2:50   ` Hidetoshi Seto
  2011-03-23  3:03 ` [patch 13/15] sched: expire slack quota using generation counters Paul Turner
                   ` (6 subsequent siblings)
  18 siblings, 1 reply; 63+ messages in thread
From: Paul Turner @ 2011-03-23  3:03 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov

[-- Attachment #1: sched-bwc-throttle-list.patch --]
[-- Type: text/plain, Size: 3637 bytes --]

Maintain a list of throttle runqueues instead of having to test every runqueue
to determine whether we need to unthrottle.

Signed-off-by: Paul Turner <pjt@google.com>

---
 kernel/sched.c      |    7 +++++++
 kernel/sched_fair.c |   25 +++++++++++++------------
 2 files changed, 20 insertions(+), 12 deletions(-)

Index: tip/kernel/sched.c
===================================================================
--- tip.orig/kernel/sched.c
+++ tip/kernel/sched.c
@@ -256,6 +256,7 @@ struct cfs_bandwidth {
 	s64 hierarchal_quota; /* used for validating consistency */
 	struct hrtimer period_timer;
 
+	struct list_head throttled_cfs_rq;
 	/* throttle statistics */
 	u64			nr_periods;
 	u64			nr_throttled;
@@ -394,6 +395,8 @@ struct cfs_rq {
 	int quota_enabled, throttled, throttle_count;
 	s64 quota_remaining;
 	u64 throttled_timestamp;
+
+	struct list_head throttled_list;
 #endif
 #endif
 };
@@ -433,6 +436,7 @@ void init_cfs_bandwidth(struct cfs_bandw
 	raw_spin_lock_init(&cfs_b->lock);
 	cfs_b->quota = cfs_b->runtime = quota;
 	cfs_b->period = ns_to_ktime(period);
+	INIT_LIST_HEAD(&cfs_b->throttled_cfs_rq);
 
 	hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
 	cfs_b->period_timer.function = sched_cfs_period_timer;
@@ -8141,6 +8145,9 @@ static void init_cfs_rq(struct cfs_rq *c
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	cfs_rq->rq = rq;
 	/* allow initial update_cfs_load() to truncate */
+#ifdef CONFIG_CFS_BANDWIDTH
+	INIT_LIST_HEAD(&cfs_rq->throttled_list);
+#endif
 #ifdef CONFIG_SMP
 	cfs_rq->load_stamp = 1;
 #endif
Index: tip/kernel/sched_fair.c
===================================================================
--- tip.orig/kernel/sched_fair.c
+++ tip/kernel/sched_fair.c
@@ -1405,6 +1405,7 @@ static int tg_throttle_down(struct task_
 static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
 {
 	struct rq *rq = rq_of(cfs_rq);
+	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
 	struct sched_entity *se;
 	long task_delta, dequeue = 1;
 
@@ -1434,7 +1435,11 @@ static void throttle_cfs_rq(struct cfs_r
 	if (!se)
 		rq->nr_running += task_delta;
 
+	raw_spin_lock(&cfs_b->lock);
 	cfs_rq->throttled = 1;
+	list_add_tail_rcu(&cfs_rq->throttled_list, &cfs_b->throttled_cfs_rq);
+	raw_spin_unlock(&cfs_b->lock);
+
 	cfs_rq->throttled_timestamp = rq_of(cfs_rq)->clock;
 }
 
@@ -1450,6 +1455,10 @@ static void unthrottle_cfs_rq(struct cfs
 	update_rq_clock(rq);
 	raw_spin_lock(&cfs_b->lock);
 	cfs_b->throttled_time += (rq->clock - cfs_rq->throttled_timestamp);
+
+	cfs_rq->throttled = 0;
+	list_del_rcu(&cfs_rq->throttled_list);
+
 	raw_spin_unlock(&cfs_b->lock);
 	cfs_rq->throttled_timestamp = 0;
 
@@ -1459,7 +1468,6 @@ static void unthrottle_cfs_rq(struct cfs
 	walk_tg_tree_from(cfs_rq->tg, tg_unthrottle_down, tg_nop,
 			(void*)&udd);
 
-	cfs_rq->throttled = 0;
 	if (!cfs_rq->load.weight)
 		return;
 
@@ -1484,22 +1492,15 @@ static void unthrottle_cfs_rq(struct cfs
 		resched_task(rq->curr);
 }
 
-static inline struct task_group *cfs_bandwidth_tg(struct cfs_bandwidth *cfs_b)
-{
-	return container_of(cfs_b, struct task_group, cfs_bandwidth);
-}
-
 static u64 distribute_cfs_bandwidth(struct cfs_bandwidth *cfs_b, u64 runtime)
 {
-	int i;
+	struct cfs_rq *cfs_rq;
 	u64 quota, remaining = runtime;
-	const struct cpumask *span;
 
 	rcu_read_lock();
-	span = sched_bw_period_mask();
-	for_each_cpu(i, span) {
-		struct rq *rq = cpu_rq(i);
-		struct cfs_rq *cfs_rq = cfs_bandwidth_tg(cfs_b)->cfs_rq[i];
+	list_for_each_entry_rcu(cfs_rq, &cfs_b->throttled_cfs_rq,
+				throttled_list) {
+		struct rq *rq = rq_of(cfs_rq);
 
 		raw_spin_lock(&rq->lock);
 		if (within_bandwidth(cfs_rq))



^ permalink raw reply	[flat|nested] 63+ messages in thread

* [patch 13/15] sched: expire slack quota using generation counters
  2011-03-23  3:03 [patch 00/15] CFS Bandwidth Control V5 Paul Turner
                   ` (11 preceding siblings ...)
  2011-03-23  3:03 ` [patch 12/15] sched: maintain throttled rqs as a list Paul Turner
@ 2011-03-23  3:03 ` Paul Turner
  2011-04-05 13:28   ` Peter Zijlstra
  2011-03-23  3:03 ` [patch 14/15] sched: return unused quota on voluntary sleep Paul Turner
                   ` (5 subsequent siblings)
  18 siblings, 1 reply; 63+ messages in thread
From: Paul Turner @ 2011-03-23  3:03 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov

[-- Attachment #1: sched-bwc-add_quota_generations.patch --]
[-- Type: text/plain, Size: 4941 bytes --]

Introduce a generational counter which is incremented each quota period.  This
allows us to determine on a per-cpu basis whether the currently operating quota
is "current" or not, without requiring us to visit every cpu and explicitly
expire quota on every new period.

Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>

---
 kernel/sched.c      |    6 ++++++
 kernel/sched_fair.c |   42 +++++++++++++++++++++++++++++++++++++-----
 2 files changed, 43 insertions(+), 5 deletions(-)

Index: tip/kernel/sched.c
===================================================================
--- tip.orig/kernel/sched.c
+++ tip/kernel/sched.c
@@ -256,6 +256,7 @@ struct cfs_bandwidth {
 	s64 hierarchal_quota; /* used for validating consistency */
 	struct hrtimer period_timer;
 
+	int quota_generation;
 	struct list_head throttled_cfs_rq;
 	/* throttle statistics */
 	u64			nr_periods;
@@ -396,6 +397,7 @@ struct cfs_rq {
 	s64 quota_remaining;
 	u64 throttled_timestamp;
 
+	int quota_generation;
 	struct list_head throttled_list;
 #endif
 #endif
@@ -436,8 +438,10 @@ void init_cfs_bandwidth(struct cfs_bandw
 	raw_spin_lock_init(&cfs_b->lock);
 	cfs_b->quota = cfs_b->runtime = quota;
 	cfs_b->period = ns_to_ktime(period);
+	cfs_b->quota_generation = 0;
 	INIT_LIST_HEAD(&cfs_b->throttled_cfs_rq);
 
+
 	hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
 	cfs_b->period_timer.function = sched_cfs_period_timer;
 
@@ -9333,6 +9337,8 @@ static int tg_set_cfs_bandwidth(struct t
 	raw_spin_lock_irq(&cfs_b->lock);
 	cfs_b->period = ns_to_ktime(period);
 	cfs_b->runtime = cfs_b->quota = quota;
+
+	cfs_bump_quota_generation(cfs_b);
 	raw_spin_unlock_irq(&cfs_b->lock);
 
 	for_each_possible_cpu(i) {
Index: tip/kernel/sched_fair.c
===================================================================
--- tip.orig/kernel/sched_fair.c
+++ tip/kernel/sched_fair.c
@@ -1331,11 +1331,25 @@ static void check_cfs_rq_quota(struct cf
 	resched_task(rq_of(cfs_rq)->curr);
 }
 
+static void cfs_bump_quota_generation(struct cfs_bandwidth *cfs_b)
+{
+	cfs_b->quota_generation++;
+	smp_mb();
+}
+
+static inline int cfs_rq_quota_current(struct cfs_rq *cfs_rq)
+{
+	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
+
+	return cfs_rq->quota_generation == cfs_b->quota_generation;
+}
+
 static void request_cfs_rq_quota(struct cfs_rq *cfs_rq)
 {
 	struct task_group *tg = cfs_rq->tg;
 	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
 	u64 amount = 0, min_amount;
+	int generation;
 
 	min_amount = sched_cfs_bandwidth_slice() + (-cfs_rq->quota_remaining);
 
@@ -1347,10 +1361,18 @@ static void request_cfs_rq_quota(struct 
 		} else {
 			amount = min_amount;
 		}
+		generation = cfs_b->quota_generation;
 		raw_spin_unlock(&cfs_b->lock);
 	}
 
+	/* a deficit should be carried forwards, surplus should be dropped */
+
+	if (generation != cfs_rq->quota_generation &&
+	    cfs_rq->quota_remaining > 0)
+		cfs_rq->quota_remaining = 0;
+
 	cfs_rq->quota_remaining += amount;
+	cfs_rq->quota_generation = generation;
 }
 
 static void account_cfs_rq_quota(struct cfs_rq *cfs_rq,
@@ -1361,8 +1383,13 @@ static void account_cfs_rq_quota(struct 
 
 	cfs_rq->quota_remaining -= delta_exec;
 
-	if (cfs_rq->quota_remaining > 0)
-		return;
+	/* we only want to charge deficits against the next generation */
+	if (likely(cfs_rq->quota_remaining > 0)) {
+		if (unlikely(!cfs_rq_quota_current(cfs_rq)))
+			cfs_rq->quota_remaining = 0;
+		else
+			return;
+	}
 
 	request_cfs_rq_quota(cfs_rq);
 }
@@ -1492,7 +1519,8 @@ static void unthrottle_cfs_rq(struct cfs
 		resched_task(rq->curr);
 }
 
-static u64 distribute_cfs_bandwidth(struct cfs_bandwidth *cfs_b, u64 runtime)
+static u64 distribute_cfs_bandwidth(struct cfs_bandwidth *cfs_b, u64 runtime,
+		int generation)
 {
 	struct cfs_rq *cfs_rq;
 	u64 quota, remaining = runtime;
@@ -1512,6 +1540,7 @@ static u64 distribute_cfs_bandwidth(stru
 		remaining -= quota;
 
 		cfs_rq->quota_remaining += quota;
+		cfs_rq->quota_generation = generation;
 		if (cfs_rq_throttled(cfs_rq) && cfs_rq->quota_remaining > 0)
 			unthrottle_cfs_rq(cfs_rq);
 
@@ -1529,12 +1558,15 @@ next:
 static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
 {
 	u64 runtime, runtime_assigned;
-	int idle, throttled;
+	int idle, throttled, generation;
 
 	raw_spin_lock(&cfs_b->lock);
 	runtime = cfs_b->quota;
 	idle = cfs_b->runtime == cfs_b->runtime_assigned;
 	throttled = cfs_b->runtime == 0;
+
+	cfs_bump_quota_generation(cfs_b);
+	generation = cfs_b->quota_generation;
 	raw_spin_unlock(&cfs_b->lock);
 
 	if (runtime == RUNTIME_INF)
@@ -1543,7 +1575,7 @@ static int do_sched_cfs_period_timer(str
 	runtime *= overrun;
 	runtime_assigned = runtime;
 
-	runtime = distribute_cfs_bandwidth(cfs_b, runtime);
+	runtime = distribute_cfs_bandwidth(cfs_b, runtime, generation);
 
 	raw_spin_lock(&cfs_b->lock);
 	cfs_b->runtime = runtime;



^ permalink raw reply	[flat|nested] 63+ messages in thread

* [patch 14/15] sched: return unused quota on voluntary sleep
  2011-03-23  3:03 [patch 00/15] CFS Bandwidth Control V5 Paul Turner
                   ` (12 preceding siblings ...)
  2011-03-23  3:03 ` [patch 13/15] sched: expire slack quota using generation counters Paul Turner
@ 2011-03-23  3:03 ` Paul Turner
  2011-04-05 13:28   ` Peter Zijlstra
  2011-03-23  3:03 ` [patch 15/15] sched: add documentation for bandwidth control Paul Turner
                   ` (4 subsequent siblings)
  18 siblings, 1 reply; 63+ messages in thread
From: Paul Turner @ 2011-03-23  3:03 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov

[-- Attachment #1: sched-bwc-simple_return_quota.patch --]
[-- Type: text/plain, Size: 7540 bytes --]

When a local cfs_rq blocks we return the majority of its remaining quota to the
global bandwidth pool for use by other runqueues.

We do this only when the quota is current and there is more than 
min_cfs_rq_quota [1ms by default] of runtime remaining on the rq.

In the case where there are throttled runqueues and we have sufficient
bandwidth to meter out a slice, a second timer is kicked off to handle this
delivery, unthrottling where appropriate.

Using a 'worst case' antagonist which executes on each cpu
for 1ms before moving onto the next on a fairly large machine:

no quota generations:
 197.47 ms       /cgroup/a/cpuacct.usage
 199.46 ms       /cgroup/a/cpuacct.usage
 205.46 ms       /cgroup/a/cpuacct.usage
 198.46 ms       /cgroup/a/cpuacct.usage
 208.39 ms       /cgroup/a/cpuacct.usage
Since we are allowed to use "stale" quota our usage is effectively bounded by
the rate of input into the global pool and performance is relatively stable.

with quota generations [1s increments]:
 119.58 ms       /cgroup/a/cpuacct.usage
 119.65 ms       /cgroup/a/cpuacct.usage
 119.64 ms       /cgroup/a/cpuacct.usage
 119.63 ms       /cgroup/a/cpuacct.usage
 119.60 ms       /cgroup/a/cpuacct.usage
The large deficit here is due to quota generations (/intentionally/) preventing
us from now using previously stranded slack quota.  The cost is that this quota
becomes unavailable.

with quota generations and quota return:
 200.09 ms       /cgroup/a/cpuacct.usage
 200.09 ms       /cgroup/a/cpuacct.usage
 198.09 ms       /cgroup/a/cpuacct.usage
 200.09 ms       /cgroup/a/cpuacct.usage
 200.06 ms       /cgroup/a/cpuacct.usage
By returning unused quota we're able to both stably consume our desired quota
and prevent unintentional overages due to the abuse of slack quota from 
previous quota periods (especially on a large machine).

Bharata's idea to use a slack timer to handle the return helped make this patch
cleaner.

Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>

---
 kernel/sched.c      |   16 ++++++++-
 kernel/sched_fair.c |   92 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 107 insertions(+), 1 deletion(-)

Index: tip/kernel/sched.c
===================================================================
--- tip.orig/kernel/sched.c
+++ tip/kernel/sched.c
@@ -254,7 +254,7 @@ struct cfs_bandwidth {
 	ktime_t period;
 	u64 runtime, runtime_assigned, quota;
 	s64 hierarchal_quota; /* used for validating consistency */
-	struct hrtimer period_timer;
+	struct hrtimer period_timer, slack_timer;
 
 	int quota_generation;
 	struct list_head throttled_cfs_rq;
@@ -432,6 +432,17 @@ static enum hrtimer_restart sched_cfs_pe
 	return idle ? HRTIMER_NORESTART : HRTIMER_RESTART;
 }
 
+static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b);
+
+static enum hrtimer_restart sched_cfs_slack_timer(struct hrtimer *timer)
+{
+	struct cfs_bandwidth *cfs_b =
+		container_of(timer, struct cfs_bandwidth, slack_timer);
+	do_sched_cfs_slack_timer(cfs_b);
+
+	return HRTIMER_NORESTART;
+}
+
 static
 void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b, u64 quota, u64 period)
 {
@@ -444,6 +455,8 @@ void init_cfs_bandwidth(struct cfs_bandw
 
 	hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
 	cfs_b->period_timer.function = sched_cfs_period_timer;
+	hrtimer_init(&cfs_b->slack_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	cfs_b->slack_timer.function = sched_cfs_slack_timer;
 
 	cfs_b->nr_periods = 0;
 	cfs_b->nr_throttled = 0;
@@ -477,6 +490,7 @@ static void start_cfs_bandwidth(struct c
 static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
 {
 	hrtimer_cancel(&cfs_b->period_timer);
+	hrtimer_cancel(&cfs_b->slack_timer);
 }
 #else
 #ifdef CONFIG_FAIR_GROUP_SCHED
Index: tip/kernel/sched_fair.c
===================================================================
--- tip.orig/kernel/sched_fair.c
+++ tip/kernel/sched_fair.c
@@ -1225,6 +1225,7 @@ static struct sched_entity *pick_next_en
 
 static void throttle_cfs_rq(struct cfs_rq *cfs_rq);
 static inline int within_bandwidth(struct cfs_rq *cfs_rq);
+static void return_cfs_rq_quota(struct cfs_rq *cfs_rq);
 
 static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
 {
@@ -1237,6 +1238,8 @@ static void put_prev_entity(struct cfs_r
 
 	if (!within_bandwidth(cfs_rq))
 		throttle_cfs_rq(cfs_rq);
+	else
+		return_cfs_rq_quota(cfs_rq);
 
 	check_spread(cfs_rq, prev);
 	if (prev->on_rq) {
@@ -1589,6 +1592,94 @@ static int do_sched_cfs_period_timer(str
 
 	return idle;
 }
+
+/* a cfs_rq won't donate quota below this amount */
+static const u64 min_cfs_rq_quota = 1 * NSEC_PER_MSEC;
+/* minimum remaining period time to redistribute slack quota */
+static const u64 min_bandwidth_expiration = 2 * NSEC_PER_MSEC;
+/* how long we wait to gather additional slack before distributing */
+static const u64 cfs_bandwidth_slack_period = 5 * NSEC_PER_MSEC;
+
+/* are we near the end of the current quota period? */
+static int quota_refresh_within(struct cfs_bandwidth *cfs_b, u64 min_expire)
+{
+	struct hrtimer *refresh_timer = &cfs_b->period_timer;
+	u64 remaining;
+
+	/* if the call back is running a quota refresh is occurring */
+	if (hrtimer_callback_running(refresh_timer))
+		return 1;
+
+	/* is a quota refresh about to occur? */
+	remaining = ktime_to_ns(hrtimer_expires_remaining(refresh_timer));
+	if (remaining < min_expire)
+		return 1;
+
+	return 0;
+}
+
+static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
+{
+	u64 runtime = 0, slice = sched_cfs_bandwidth_slice();
+	int generation;
+
+	/* confirm we're still not at a refresh boundary */
+	if (quota_refresh_within(cfs_b, min_bandwidth_expiration))
+		return;
+
+	raw_spin_lock(&cfs_b->lock);
+	if (cfs_b->quota != RUNTIME_INF && cfs_b->runtime > slice) {
+		runtime = cfs_b->runtime;
+		cfs_b->runtime = 0;
+	}
+	generation = cfs_b->quota_generation;
+	raw_spin_unlock(&cfs_b->lock);
+
+	if (!runtime)
+		return;
+
+	runtime = distribute_cfs_bandwidth(cfs_b, runtime, generation);
+
+	raw_spin_lock(&cfs_b->lock);
+	cfs_b->runtime = runtime;
+	raw_spin_unlock(&cfs_b->lock);
+}
+
+static void start_cfs_slack_bandwidth(struct cfs_bandwidth *cfs_b)
+{
+	u64 min_left = cfs_bandwidth_slack_period + min_bandwidth_expiration;
+
+	/* if there's a quota refresh soon don't bother with slack */
+	if (quota_refresh_within(cfs_b, min_left))
+		return;
+
+	start_bandwidth_timer(&cfs_b->slack_timer,
+				ns_to_ktime(cfs_bandwidth_slack_period));
+}
+
+static void return_cfs_rq_quota(struct cfs_rq *cfs_rq)
+{
+	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
+	s64 slack_quota = cfs_rq->quota_remaining - min_cfs_rq_quota;
+
+	if (!cfs_rq->quota_enabled || cfs_rq->load.weight)
+		return;
+
+	if (slack_quota <= 0 || !cfs_rq_quota_current(cfs_rq))
+		return;
+
+	raw_spin_lock(&cfs_b->lock);
+	if (cfs_b->quota != RUNTIME_INF && cfs_rq_quota_current(cfs_rq)) {
+		cfs_b->runtime += slack_quota;
+
+		if (cfs_b->runtime > sched_cfs_bandwidth_slice() &&
+		    !list_empty(&cfs_b->throttled_cfs_rq))
+			start_cfs_slack_bandwidth(cfs_b);
+	}
+	raw_spin_unlock(&cfs_b->lock);
+	cfs_rq->quota_remaining -= slack_quota;
+}
+
 #else
 static inline u64 default_cfs_period(void)
 {
@@ -1614,6 +1705,7 @@ static void check_cfs_rq_quota(struct cf
 static void throttle_cfs_rq(struct cfs_rq *cfs_rq) {}
 static void account_cfs_rq_quota(struct cfs_rq *cfs_rq,
 		unsigned long delta_exec) {}
+static void return_cfs_rq_quota(struct cfs_rq *cfs_rq) {}
 #endif
 
 



^ permalink raw reply	[flat|nested] 63+ messages in thread

* [patch 15/15] sched: add documentation for bandwidth control
  2011-03-23  3:03 [patch 00/15] CFS Bandwidth Control V5 Paul Turner
                   ` (13 preceding siblings ...)
  2011-03-23  3:03 ` [patch 14/15] sched: return unused quota on voluntary sleep Paul Turner
@ 2011-03-23  3:03 ` Paul Turner
  2011-03-24  6:38   ` Bharata B Rao
  2011-03-24 16:12 ` [patch 00/15] CFS Bandwidth Control V5 Bharata B Rao
                   ` (3 subsequent siblings)
  18 siblings, 1 reply; 63+ messages in thread
From: Paul Turner @ 2011-03-23  3:03 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov

[-- Attachment #1: sched-bwc-documentation.patch --]
[-- Type: text/plain, Size: 4889 bytes --]

From: Bharata B Rao <bharata@linux.vnet.ibm.com>

Basic description for usage and effect of CFS Bandwidth Control.

Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
Signed-off-by: Paul Turner <pjt@google.com>
---
 Documentation/scheduler/sched-bwc.txt |   98
 ++++++++++++++++++++++++++++++++++
 Documentation/scheduler/sched-bwc.txt |  104 ++++++++++++++++++++++++++++++++++
 1 file changed, 104 insertions(+)

Index: tip/Documentation/scheduler/sched-bwc.txt
===================================================================
--- /dev/null
+++ tip/Documentation/scheduler/sched-bwc.txt
@@ -0,0 +1,104 @@
+CFS Bandwidth Control (aka CPU hard limits)
+===========================================
+
+[ This document talks about CPU bandwidth control of CFS groups only.
+  The bandwidth control of RT groups is explained in
+  Documentation/scheduler/sched-rt-group.txt ]
+
+CFS bandwidth control is a group scheduler extension that can be used to
+control the maximum CPU bandwidth obtained by a CPU cgroup.
+
+Bandwidth allowed for a group is specified using quota and period. Within
+a given "period" (microseconds), a group is allowed to consume up to "quota"
+microseconds of CPU time, which is the upper limit or the hard limit. When the
+CPU bandwidth consumption of a group exceeds the hard limit, the tasks in the
+group are throttled and are not allowed to run until the end of the period at
+which time the group's quota is replenished.
+
+Runtime available to the group is tracked globally. At the beginning of
+every period, group's global runtime pool is replenished with "quota"
+microseconds worth of runtime. The runtime consumption happens locally at each
+CPU by fetching runtimes in "slices" from the global pool.
+
+Interface
+---------
+Quota and period can be set via cgroup files.
+
+cpu.cfs_quota_us: the enforcement interval (microseconds)
+cpu.cfs_period_us: the maximum allowed bandwidth (microseconds)
+
+Within a period of cpu.cfs_period_us, the group as a whole will not be allowed
+to consume more than cpu_cfs_quota_us worth of runtime.
+
+The default value of cpu.cfs_period_us is 500ms and the default value
+for cpu.cfs_quota_us is -1.
+
+A group with cpu.cfs_quota_us as -1 indicates that the group has infinite
+bandwidth, which means that it is not bandwidth controlled.
+
+Writing any negative value to cpu.cfs_quota_us will turn the group into
+an infinite bandwidth group. Reading cpu.cfs_quota_us for an infinite
+bandwidth group will always return -1.
+
+System wide settings
+--------------------
+The amount of runtime obtained from global pool every time a CPU wants the
+group quota locally is controlled by a sysctl parameter
+sched_cfs_bandwidth_slice_us. The current default is 10ms. This can be changed
+by writing to /proc/sys/kernel/sched_cfs_bandwidth_slice_us.
+
+A quota hierarchy is defined to be consistent if the sum of child reservations
+does not exceed the bandwidth allocated to its parent.  An entity with no
+explicit bandwidth reservation (e.g. no limit) is considered to inherit its
+parent's limits.  This behavior may be managed using
+/proc/sys/kernel/sched_cfs_bandwidth_consistent
+
+Statistics
+----------
+cpu.stat file lists three different stats related to CPU bandwidth control.
+
+nr_periods: Number of enforcement intervals that have elapsed.
+nr_throttled: Number of times the group has been throttled/limited.
+throttled_time: The total time duration (in nanoseconds) for which the group
+remained throttled.
+
+These files are read-only.
+
+Hierarchy considerations
+------------------------
+Each group's bandwidth (quota and period) can be set independent of its
+parent or child groups. There are two ways in which a group can get
+throttled:
+
+- it consumed its quota within the period
+- it has quota left but the parent's quota is exhausted.
+
+In the 2nd case, even though the child has quota left, it will not be
+able to run since the parent itself is throttled. Similarly groups that are
+not bandwidth constrained might end up being throttled if any parent
+in their hierarchy is throttled.
+
+Examples
+--------
+1. Limit a group to 1 CPU worth of runtime.
+
+	If period is 500ms and quota is also 500ms, the group will get
+	1 CPU worth of runtime every 500ms.
+
+	# echo 500000 > cpu.cfs_quota_us /* quota = 500ms */
+	# echo 500000 > cpu.cfs_period_us /* period = 500ms */
+
+2. Limit a group to 2 CPUs worth of runtime on a multi-CPU machine.
+
+	With 500ms period and 1000ms quota, the group can get 2 CPUs worth of
+	runtime every 500ms.
+
+	# echo 1000000 > cpu.cfs_quota_us /* quota = 1000ms */
+	# echo 500000 > cpu.cfs_period_us /* period = 500ms */
+
+3. Limit a group to 20% of 1 CPU.
+
+	With 500ms period, 100ms quota will be equivalent to 20% of 1 CPU.
+
+	# echo 100000 > cpu.cfs_quota_us /* quota = 100ms */
+	# echo 500000 > cpu.cfs_period_us /* period = 500ms */



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [patch 04/15] sched: throttle cfs_rq entities which exceed their local quota
  2011-03-23  3:03 ` [patch 04/15] sched: throttle cfs_rq entities which exceed their local quota Paul Turner
@ 2011-03-23  5:09   ` Mike Galbraith
  2011-03-23 20:53     ` Paul Turner
  2011-03-24  6:36   ` Bharata B Rao
  2011-04-05 13:28   ` Peter Zijlstra
  2 siblings, 1 reply; 63+ messages in thread
From: Mike Galbraith @ 2011-03-23  5:09 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Peter Zijlstra, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Pavel Emelyanov, Nikhil Rao

On Tue, 2011-03-22 at 20:03 -0700, Paul Turner wrote:

> +static void check_cfs_rq_quota(struct cfs_rq *cfs_rq)
> +{
> +	if (within_bandwidth(cfs_rq))
> +		return;
> +
> +

Nit:  It'd be nice if classes agreed on naming convention to ease
rummaging.  In rt, it's bandwidth for bean counting parameters, but the
beans are runtime.  within_bandwidth() vs sched_rt_runtime_exceeded()
kinda pokes me in the eye when I look at the total.  Seems to me it
should be uniformly either quota or bandwidth, and uniformly runtime.

	-Mike



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [patch 02/15] sched: validate CFS quota hierarchies
  2011-03-23  3:03 ` [patch 02/15] sched: validate CFS quota hierarchies Paul Turner
@ 2011-03-23 10:39   ` torbenh
  2011-03-23 20:49     ` Paul Turner
  2011-03-24  6:31   ` Bharata B Rao
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 63+ messages in thread
From: torbenh @ 2011-03-23 10:39 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Peter Zijlstra, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Pavel Emelyanov

On Tue, Mar 22, 2011 at 08:03:28PM -0700, Paul Turner wrote:
> Add constraints validation for CFS bandwidth hierachies.
> 
> It is checked that:
>    sum(child bandwidth) <= parent_bandwidth
> 
> In a quota limited hierarchy, an unconstrainted entity
> (e.g. bandwidth==RUNTIME_INF) inherits the bandwidth of its parent.
> 
> Since bandwidth periods may be non-uniform we normalize to the maximum allowed
> period, 5 seconds.
> 
> This behavior may be disabled (allowing child bandwidth to exceed parent) via
> kernel.sched_cfs_bandwidth_consistent=0
> 
> Signed-off-by: Paul Turner <pjt@google.com>
> 
> ---
>  include/linux/sched.h |    8 +++
>  kernel/sched.c        |  127 +++++++++++++++++++++++++++++++++++++++++++++++---
>  kernel/sched_fair.c   |    8 +++
>  kernel/sysctl.c       |   11 ++++
>  4 files changed, 147 insertions(+), 7 deletions(-)
> 
> Index: tip/kernel/sched.c
> ===================================================================
> --- tip.orig/kernel/sched.c
> +++ tip/kernel/sched.c
> +static int tg_cfs_schedulable_down(struct task_group *tg, void *data)
> +{
> +	struct cfs_schedulable_data *d = data;
> +	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
> +	s64 quota = 0, parent_quota = -1;
> +
> +	quota = normalize_cfs_quota(tg, d);
> +	if (!tg->parent) {
> +		quota = RUNTIME_INF;
> +	} else {
> +		struct cfs_bandwidth *parent_b = tg_cfs_bandwidth(tg->parent);
> +
> +		parent_quota = parent_b->hierarchal_quota;
> +		if (parent_quota != RUNTIME_INF) {
> +			parent_quota -= quota;
> +			/* invalid hierarchy, child bandwidth exceeds parent */
> +			if (parent_quota < 0)
> +				return -EINVAL;
> +		}
> +
> +		/* if no inherent limit then inherit parent quota */
> +		if (quota == RUNTIME_INF)
> +			quota = parent_quota;
> +		parent_b->hierarchal_quota = parent_quota;
> +	}
> +	cfs_b->hierarchal_quota = quota;
> +
> +	return 0;
> +}

I find this logic pretty weird.
As long as quota == INF i can overcommit, but as soon as there is some
quota, i can not ?

Its clear, that one needs to be able to overcommit runtime, or the
default runtime for a new cgroup would need to be 0.
The root problem imo is that runtime and shares should not be in the
same cgroup subsystem. The semantics are too different.


> +
> +static int __cfs_schedulable(struct task_group *tg, u64 period, u64 quota)
> +{
> +	int ret;
> +	struct cfs_schedulable_data data = {
> +		.tg = tg,
> +		.period = period / NSEC_PER_USEC,
> +		.quota = quota / NSEC_PER_USEC,
> +	};
> +
> +	if (!sysctl_sched_cfs_bandwidth_consistent)
> +		return 0;
> +
> +	rcu_read_lock();
> +	ret = walk_tg_tree(tg_cfs_schedulable_down, tg_nop,
> +			    &data);
> +	rcu_read_unlock();

walk_tg_tree does the rcu_read_lock itself.
not necessary to do that here.
look at __rt_schedulable there is no rcu.


> +
> +	return ret;
> +}

-- 
torben Hohn

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [patch 02/15] sched: validate CFS quota hierarchies
  2011-03-23 10:39   ` torbenh
@ 2011-03-23 20:49     ` Paul Turner
  0 siblings, 0 replies; 63+ messages in thread
From: Paul Turner @ 2011-03-23 20:49 UTC (permalink / raw)
  To: Paul Turner, linux-kernel, Peter Zijlstra, Bharata B Rao,
	Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan,
	Srivatsa Vaddagiri, Kamalesh Babulal, Ingo Molnar,
	Pavel Emelyanov
  Cc: torbenh

On Wed, Mar 23, 2011 at 3:39 AM, torbenh <torbenh@gmx.de> wrote:
> On Tue, Mar 22, 2011 at 08:03:28PM -0700, Paul Turner wrote:
>> Add constraints validation for CFS bandwidth hierachies.
>>
>> It is checked that:
>>    sum(child bandwidth) <= parent_bandwidth
>>
>> In a quota limited hierarchy, an unconstrainted entity
>> (e.g. bandwidth==RUNTIME_INF) inherits the bandwidth of its parent.
>>
>> Since bandwidth periods may be non-uniform we normalize to the maximum allowed
>> period, 5 seconds.
>>
>> This behavior may be disabled (allowing child bandwidth to exceed parent) via
>> kernel.sched_cfs_bandwidth_consistent=0
>>
>> Signed-off-by: Paul Turner <pjt@google.com>
>>
>> ---
>>  include/linux/sched.h |    8 +++
>>  kernel/sched.c        |  127 +++++++++++++++++++++++++++++++++++++++++++++++---
>>  kernel/sched_fair.c   |    8 +++
>>  kernel/sysctl.c       |   11 ++++
>>  4 files changed, 147 insertions(+), 7 deletions(-)
>>
>> Index: tip/kernel/sched.c
>> ===================================================================
>> --- tip.orig/kernel/sched.c
>> +++ tip/kernel/sched.c
>> +static int tg_cfs_schedulable_down(struct task_group *tg, void *data)
>> +{
>> +     struct cfs_schedulable_data *d = data;
>> +     struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
>> +     s64 quota = 0, parent_quota = -1;
>> +
>> +     quota = normalize_cfs_quota(tg, d);
>> +     if (!tg->parent) {
>> +             quota = RUNTIME_INF;
>> +     } else {
>> +             struct cfs_bandwidth *parent_b = tg_cfs_bandwidth(tg->parent);
>> +
>> +             parent_quota = parent_b->hierarchal_quota;
>> +             if (parent_quota != RUNTIME_INF) {
>> +                     parent_quota -= quota;
>> +                     /* invalid hierarchy, child bandwidth exceeds parent */
>> +                     if (parent_quota < 0)
>> +                             return -EINVAL;
>> +             }
>> +
>> +             /* if no inherent limit then inherit parent quota */
>> +             if (quota == RUNTIME_INF)
>> +                     quota = parent_quota;
>> +             parent_b->hierarchal_quota = parent_quota;
>> +     }
>> +     cfs_b->hierarchal_quota = quota;
>> +
>> +     return 0;
>> +}
>
> I find this logic pretty weird.
> As long as quota == INF i can overcommit, but as soon as there is some
> quota, i can not ?
>

I don't think I understand what you mean by being able to overcommit
when quota ==INF.  For a state of overcommit to exist an upper limit
must exist to exceed.

> Its clear, that one needs to be able to overcommit runtime, or the
> default runtime for a new cgroup would need to be 0.

Actually this is one of the primary reasons that the semantic of
inheritance was chosen.  By inheriting our parent's quota all existing
hiearchies remain valid.

There are also many cases which are not expressible without such a semantic:

Consider,

     X
  /  |   \
A  B  C

Suppose we have the following constraints:
X - is the top level application group, we wish to provision X with 4 cpus
C - is a threadpool performing some background work, we wish it to
consume no more than 2 cpus
A/B - split the remaining time available to the hierarchy

If we require absolute limits on A/B there is no way to allow this
usage, we must establish a priori hard usage ratios; yet, if a usage
ratio is desired using cpu.shares to specify this is a much better
solution as gives you a soft-ratio while remaining work-conserving
with respect to X's limit).


> The root problem imo is that runtime and shares should not be in the
> same cgroup subsystem. The semantics are too different.
>

Shares are a relative and represent a lower limit for cgroup bandwidth
Bandwidth is absolute and represents an upper limit for cgroup bandwidth

Given that one is absolute and the other relative, the semantics will
be necessarily different.  It does not work to extend bandwidth from
the definition of shares since there are many use-cases which require
their specification to be independent.

They are however intrinsically related since they effectively form
upper/lower bounds on the same parameter and should be controlled from
the same group.

>> +
>> +static int __cfs_schedulable(struct task_group *tg, u64 period, u64 quota)
>> +{
>> +     int ret;
>> +     struct cfs_schedulable_data data = {
>> +             .tg = tg,
>> +             .period = period / NSEC_PER_USEC,
>> +             .quota = quota / NSEC_PER_USEC,
>> +     };
>> +
>> +     if (!sysctl_sched_cfs_bandwidth_consistent)
>> +             return 0;
>> +
>> +     rcu_read_lock();
>> +     ret = walk_tg_tree(tg_cfs_schedulable_down, tg_nop,
>> +                         &data);
>> +     rcu_read_unlock();
>
> walk_tg_tree does the rcu_read_lock itself.
> not necessary to do that here.
> look at __rt_schedulable there is no rcu.
>

Oops yeah -- I wrote this one after fiddling with the new _from
variant (which does require rcu_lock to be external); you're right,
it's not needed here.  Thanks!

>
>> +
>> +     return ret;
>> +}
>
> --
> torben Hohn
>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [patch 04/15] sched: throttle cfs_rq entities which exceed their local quota
  2011-03-23  5:09   ` Mike Galbraith
@ 2011-03-23 20:53     ` Paul Turner
  0 siblings, 0 replies; 63+ messages in thread
From: Paul Turner @ 2011-03-23 20:53 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: linux-kernel, Peter Zijlstra, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Pavel Emelyanov, Nikhil Rao

On Tue, Mar 22, 2011 at 10:09 PM, Mike Galbraith <efault@gmx.de> wrote:
> On Tue, 2011-03-22 at 20:03 -0700, Paul Turner wrote:
>
>> +static void check_cfs_rq_quota(struct cfs_rq *cfs_rq)
>> +{
>> +     if (within_bandwidth(cfs_rq))
>> +             return;
>> +
>> +
>
> Nit:  It'd be nice if classes agreed on naming convention to ease
> rummaging.  In rt, it's bandwidth for bean counting parameters, but the
> beans are runtime.  within_bandwidth() vs sched_rt_runtime_exceeded()
> kinda pokes me in the eye when I look at the total.  Seems to me it
> should be uniformly either quota or bandwidth, and uniformly runtime.
>

True enough, I'll rename to bring these more in-line with their RT equivalents

>        -Mike
>
>
>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [patch 02/15] sched: validate CFS quota hierarchies
  2011-03-23  3:03 ` [patch 02/15] sched: validate CFS quota hierarchies Paul Turner
  2011-03-23 10:39   ` torbenh
@ 2011-03-24  6:31   ` Bharata B Rao
  2011-04-08 17:01     ` Peter Zijlstra
  2011-03-29  6:57   ` Hidetoshi Seto
  2011-04-05 13:28   ` Peter Zijlstra
  3 siblings, 1 reply; 63+ messages in thread
From: Bharata B Rao @ 2011-03-24  6:31 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Peter Zijlstra, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov

On Tue, Mar 22, 2011 at 08:03:28PM -0700, Paul Turner wrote:
> Add constraints validation for CFS bandwidth hierachies.
> 
> +static u64 normalize_cfs_quota(struct task_group *tg,
> +		               struct cfs_schedulable_data *d)
> +{
> +	u64 quota, period;
> +	struct load_weight lw;
> +
> +	if (tg == d->tg) {
> +		period = d->period;
> +		quota = d->quota;
> +	} else {
> +		period = tg_get_cfs_period(tg);
> +		quota = tg_get_cfs_quota(tg);
> +	}
> +
> +	if (quota == RUNTIME_INF)
> +		return RUNTIME_INF;
> +
> +	lw.weight = period;
> +	lw.inv_weight = 0;
> +
> +	return calc_delta_mine(quota, max_cfs_quota_period, &lw) - 1;

Time to rename calc_delta_mine to something more meaningful ?

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [patch 04/15] sched: throttle cfs_rq entities which exceed their local quota
  2011-03-23  3:03 ` [patch 04/15] sched: throttle cfs_rq entities which exceed their local quota Paul Turner
  2011-03-23  5:09   ` Mike Galbraith
@ 2011-03-24  6:36   ` Bharata B Rao
  2011-03-24  7:40     ` Paul Turner
  2011-04-05 13:28   ` Peter Zijlstra
  2 siblings, 1 reply; 63+ messages in thread
From: Bharata B Rao @ 2011-03-24  6:36 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Peter Zijlstra, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov, Nikhil Rao

On Tue, Mar 22, 2011 at 08:03:30PM -0700, Paul Turner wrote:
> In account_cfs_rq_quota() (via update_curr()) we track consumption versus a
> cfs_rqs locally assigned quota and whether there is global quota available 
> to provide a refill when it runs out.
> 
> In the case that there is no quota remaining it's necessary to throttle so
> that execution ceases until the susbequent period.  While it is at this
> boundary that we detect (and signal for, via reshed_task) that a throttle is
> required, the actual operation is deferred until put_prev_entity().
> 
> At this point the cfs_rq is marked as throttled and not re-enqueued, this
> avoids potential interactions with throttled runqueues in the event that we
> are not immediately able to evict the running task.
> 
> Signed-off-by: Paul Turner <pjt@google.com>
> Signed-off-by: Nikhil Rao <ncrao@google.com>
> Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
> ---
>  kernel/sched.c      |    2 
>  kernel/sched_fair.c |  117 +++++++++++++++++++++++++++++++++++++++++++++++++---
>  2 files changed, 113 insertions(+), 6 deletions(-)
> 
> Index: tip/kernel/sched.c
> ===================================================================
> --- tip.orig/kernel/sched.c
> +++ tip/kernel/sched.c
> @@ -386,7 +386,7 @@ struct cfs_rq {
>  	unsigned long load_contribution;
>  #endif
>  #ifdef CONFIG_CFS_BANDWIDTH
> -	int quota_enabled;
> +	int quota_enabled, throttled;
>  	s64 quota_remaining;
>  #endif
>  #endif
> Index: tip/kernel/sched_fair.c
> ===================================================================
> --- tip.orig/kernel/sched_fair.c
> +++ tip/kernel/sched_fair.c
> @@ -321,9 +321,6 @@ find_matching_se(struct sched_entity **s
> 
>  #endif	/* CONFIG_FAIR_GROUP_SCHED */
> 
> -static void account_cfs_rq_quota(struct cfs_rq *cfs_rq,
> -		                 unsigned long delta_exec);
> -
>  /**************************************************************
>   * Scheduling class tree data structure manipulation methods:
>   */
> @@ -588,6 +585,9 @@ __update_curr(struct cfs_rq *cfs_rq, str
>  #endif
>  }
> 
> +static void account_cfs_rq_quota(struct cfs_rq *cfs_rq,
> +		unsigned long delta_exec);
> +
>  static void update_curr(struct cfs_rq *cfs_rq)
>  {
>  	struct sched_entity *curr = cfs_rq->curr;
> @@ -1221,6 +1221,9 @@ static struct sched_entity *pick_next_en
>  	return se;
>  }
> 
> +static void throttle_cfs_rq(struct cfs_rq *cfs_rq);
> +static inline int within_bandwidth(struct cfs_rq *cfs_rq);
> +
>  static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
>  {
>  	/*
> @@ -1230,6 +1233,9 @@ static void put_prev_entity(struct cfs_r
>  	if (prev->on_rq)
>  		update_curr(cfs_rq);
> 
> +	if (!within_bandwidth(cfs_rq))
> +		throttle_cfs_rq(cfs_rq);
> +
>  	check_spread(cfs_rq, prev);
>  	if (prev->on_rq) {
>  		update_stats_wait_start(cfs_rq, prev);
> @@ -1241,6 +1247,8 @@ static void put_prev_entity(struct cfs_r
>  	cfs_rq->curr = NULL;
>  }
> 
> +static void check_cfs_rq_quota(struct cfs_rq *cfs_rq);
> +
>  static void
>  entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
>  {
> @@ -1249,6 +1257,9 @@ entity_tick(struct cfs_rq *cfs_rq, struc
>  	 */
>  	update_curr(cfs_rq);
> 
> +	/* check that entity's usage is still within quota (if enabled) */
> +	check_cfs_rq_quota(cfs_rq);
> +
>  	/*
>  	 * Update share accounting for long-running entities.
>  	 */
> @@ -1294,6 +1305,46 @@ static inline u64 sched_cfs_bandwidth_sl
>         return (u64)sysctl_sched_cfs_bandwidth_slice * NSEC_PER_USEC;
>  }
> 
> +static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
> +{
> +	return cfs_rq->throttled;
> +}
> +
> +static inline int throttled_hierarchy(struct cfs_rq *cfs_rq)
> +{
> +	struct task_group *tg;
> +	struct sched_entity *se;
> +
> +	if (cfs_rq_throttled(cfs_rq))
> +		return 1;
> +
> +	tg = cfs_rq->tg;
> +	se = tg->se[cpu_of(rq_of(cfs_rq))];
> +	if (!se)
> +		return 0;
> +
> +	for_each_sched_entity(se) {
> +		if (cfs_rq_throttled(cfs_rq_of(se)))
> +			return 1;
> +	}
> +
> +	return 0;
> +}
> +
> +static inline int within_bandwidth(struct cfs_rq *cfs_rq)
> +{
> +	return !cfs_rq->quota_enabled || cfs_rq->quota_remaining > 0;
> +}
> +
> +static void check_cfs_rq_quota(struct cfs_rq *cfs_rq)
> +{
> +	if (within_bandwidth(cfs_rq))
> +		return;
> +
> +
> +	resched_task(rq_of(cfs_rq)->curr);
> +}
> +
>  static void request_cfs_rq_quota(struct cfs_rq *cfs_rq)
>  {
>  	struct task_group *tg = cfs_rq->tg;
> @@ -1330,6 +1381,29 @@ static void account_cfs_rq_quota(struct 
>  	request_cfs_rq_quota(cfs_rq);
>  }
> 
> +static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
> +{
> +	struct sched_entity *se;
> +
> +	se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
> +
> +	/* account load preceding throttle */
> +	update_cfs_load(cfs_rq, 0);
> +
> +	for_each_sched_entity(se) {
> +		struct cfs_rq *qcfs_rq = cfs_rq_of(se);
> +		/* throttled entity or throttle-on-deactivate */
> +		if (!se->on_rq)
> +			break;
> +
> +		dequeue_entity(qcfs_rq, se, DEQUEUE_SLEEP);
> +		if (qcfs_rq->load.weight)
> +			break;
> +	}
> +
> +	cfs_rq->throttled = 1;
> +}

Since throttling is done from put_prev_entity(), iiuc, you will be
doing 'put' for current entities which are not on the tree. Can you
avoid the dequeue_entity() call here which I think will anyway bail out
from actual dequeueing (se != cfs_rq->curr check in dequeue_entity).

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [patch 15/15] sched: add documentation for bandwidth control
  2011-03-23  3:03 ` [patch 15/15] sched: add documentation for bandwidth control Paul Turner
@ 2011-03-24  6:38   ` Bharata B Rao
  0 siblings, 0 replies; 63+ messages in thread
From: Bharata B Rao @ 2011-03-24  6:38 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Peter Zijlstra, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov

On Tue, Mar 22, 2011 at 08:03:41PM -0700, Paul Turner wrote:
> +
> +System wide settings
> +--------------------
> +The amount of runtime obtained from global pool every time a CPU wants the
> +group quota locally is controlled by a sysctl parameter
> +sched_cfs_bandwidth_slice_us. The current default is 10ms. This can be changed

Needs to change the default value to 5ms now.

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [patch 04/15] sched: throttle cfs_rq entities which exceed their local quota
  2011-03-24  6:36   ` Bharata B Rao
@ 2011-03-24  7:40     ` Paul Turner
  0 siblings, 0 replies; 63+ messages in thread
From: Paul Turner @ 2011-03-24  7:40 UTC (permalink / raw)
  To: bharata
  Cc: linux-kernel, Peter Zijlstra, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov, Nikhil Rao

On Wed, Mar 23, 2011 at 11:36 PM, Bharata B Rao
<bharata@linux.vnet.ibm.com> wrote:
> On Tue, Mar 22, 2011 at 08:03:30PM -0700, Paul Turner wrote:
>> In account_cfs_rq_quota() (via update_curr()) we track consumption versus a
>> cfs_rqs locally assigned quota and whether there is global quota available
>> to provide a refill when it runs out.
>>
>> In the case that there is no quota remaining it's necessary to throttle so
>> that execution ceases until the susbequent period.  While it is at this
>> boundary that we detect (and signal for, via reshed_task) that a throttle is
>> required, the actual operation is deferred until put_prev_entity().
>>
>> At this point the cfs_rq is marked as throttled and not re-enqueued, this
>> avoids potential interactions with throttled runqueues in the event that we
>> are not immediately able to evict the running task.
>>
>> Signed-off-by: Paul Turner <pjt@google.com>
>> Signed-off-by: Nikhil Rao <ncrao@google.com>
>> Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
>> ---
>>  kernel/sched.c      |    2
>>  kernel/sched_fair.c |  117 +++++++++++++++++++++++++++++++++++++++++++++++++---
>>  2 files changed, 113 insertions(+), 6 deletions(-)
>>
>> Index: tip/kernel/sched.c
>> ===================================================================
>> --- tip.orig/kernel/sched.c
>> +++ tip/kernel/sched.c
>> @@ -386,7 +386,7 @@ struct cfs_rq {
>>       unsigned long load_contribution;
>>  #endif
>>  #ifdef CONFIG_CFS_BANDWIDTH
>> -     int quota_enabled;
>> +     int quota_enabled, throttled;
>>       s64 quota_remaining;
>>  #endif
>>  #endif
>> Index: tip/kernel/sched_fair.c
>> ===================================================================
>> --- tip.orig/kernel/sched_fair.c
>> +++ tip/kernel/sched_fair.c
>> @@ -321,9 +321,6 @@ find_matching_se(struct sched_entity **s
>>
>>  #endif       /* CONFIG_FAIR_GROUP_SCHED */
>>
>> -static void account_cfs_rq_quota(struct cfs_rq *cfs_rq,
>> -                              unsigned long delta_exec);
>> -
>>  /**************************************************************
>>   * Scheduling class tree data structure manipulation methods:
>>   */
>> @@ -588,6 +585,9 @@ __update_curr(struct cfs_rq *cfs_rq, str
>>  #endif
>>  }
>>
>> +static void account_cfs_rq_quota(struct cfs_rq *cfs_rq,
>> +             unsigned long delta_exec);
>> +
>>  static void update_curr(struct cfs_rq *cfs_rq)
>>  {
>>       struct sched_entity *curr = cfs_rq->curr;
>> @@ -1221,6 +1221,9 @@ static struct sched_entity *pick_next_en
>>       return se;
>>  }
>>
>> +static void throttle_cfs_rq(struct cfs_rq *cfs_rq);
>> +static inline int within_bandwidth(struct cfs_rq *cfs_rq);
>> +
>>  static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
>>  {
>>       /*
>> @@ -1230,6 +1233,9 @@ static void put_prev_entity(struct cfs_r
>>       if (prev->on_rq)
>>               update_curr(cfs_rq);
>>
>> +     if (!within_bandwidth(cfs_rq))
>> +             throttle_cfs_rq(cfs_rq);
>> +
>>       check_spread(cfs_rq, prev);
>>       if (prev->on_rq) {
>>               update_stats_wait_start(cfs_rq, prev);
>> @@ -1241,6 +1247,8 @@ static void put_prev_entity(struct cfs_r
>>       cfs_rq->curr = NULL;
>>  }
>>
>> +static void check_cfs_rq_quota(struct cfs_rq *cfs_rq);
>> +
>>  static void
>>  entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
>>  {
>> @@ -1249,6 +1257,9 @@ entity_tick(struct cfs_rq *cfs_rq, struc
>>        */
>>       update_curr(cfs_rq);
>>
>> +     /* check that entity's usage is still within quota (if enabled) */
>> +     check_cfs_rq_quota(cfs_rq);
>> +
>>       /*
>>        * Update share accounting for long-running entities.
>>        */
>> @@ -1294,6 +1305,46 @@ static inline u64 sched_cfs_bandwidth_sl
>>         return (u64)sysctl_sched_cfs_bandwidth_slice * NSEC_PER_USEC;
>>  }
>>
>> +static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
>> +{
>> +     return cfs_rq->throttled;
>> +}
>> +
>> +static inline int throttled_hierarchy(struct cfs_rq *cfs_rq)
>> +{
>> +     struct task_group *tg;
>> +     struct sched_entity *se;
>> +
>> +     if (cfs_rq_throttled(cfs_rq))
>> +             return 1;
>> +
>> +     tg = cfs_rq->tg;
>> +     se = tg->se[cpu_of(rq_of(cfs_rq))];
>> +     if (!se)
>> +             return 0;
>> +
>> +     for_each_sched_entity(se) {
>> +             if (cfs_rq_throttled(cfs_rq_of(se)))
>> +                     return 1;
>> +     }
>> +
>> +     return 0;
>> +}
>> +
>> +static inline int within_bandwidth(struct cfs_rq *cfs_rq)
>> +{
>> +     return !cfs_rq->quota_enabled || cfs_rq->quota_remaining > 0;
>> +}
>> +
>> +static void check_cfs_rq_quota(struct cfs_rq *cfs_rq)
>> +{
>> +     if (within_bandwidth(cfs_rq))
>> +             return;
>> +
>> +
>> +     resched_task(rq_of(cfs_rq)->curr);
>> +}
>> +
>>  static void request_cfs_rq_quota(struct cfs_rq *cfs_rq)
>>  {
>>       struct task_group *tg = cfs_rq->tg;
>> @@ -1330,6 +1381,29 @@ static void account_cfs_rq_quota(struct
>>       request_cfs_rq_quota(cfs_rq);
>>  }
>>
>> +static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
>> +{
>> +     struct sched_entity *se;
>> +
>> +     se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
>> +
>> +     /* account load preceding throttle */
>> +     update_cfs_load(cfs_rq, 0);
>> +
>> +     for_each_sched_entity(se) {
>> +             struct cfs_rq *qcfs_rq = cfs_rq_of(se);
>> +             /* throttled entity or throttle-on-deactivate */
>> +             if (!se->on_rq)
>> +                     break;
>> +
>> +             dequeue_entity(qcfs_rq, se, DEQUEUE_SLEEP);
>> +             if (qcfs_rq->load.weight)
>> +                     break;
>> +     }
>> +
>> +     cfs_rq->throttled = 1;
>> +}
>
> Since throttling is done from put_prev_entity(), iiuc, you will be
> doing 'put' for current entities which are not on the tree. Can you
> avoid the dequeue_entity() call here which I think will anyway bail out
> from actual dequeueing (se != cfs_rq->curr check in dequeue_entity).
>

No -- cfs_rq->curr is still wholly enqueued less residency in the
rb-tree; this includes factors such as the number of runnable entities
and contribution to load.  The dequeue is necessary; a throttle is
analogous to the current task blocking, only on a group entity level.

> Regards,
> Bharata.
>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [patch 01/15] sched: introduce primitives to account for CFS bandwidth tracking
  2011-03-23  3:03 ` [patch 01/15] sched: introduce primitives to account for CFS bandwidth tracking Paul Turner
@ 2011-03-24 12:38   ` Kamalesh Babulal
  2011-04-05 13:28   ` Peter Zijlstra
  1 sibling, 0 replies; 63+ messages in thread
From: Kamalesh Babulal @ 2011-03-24 12:38 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Peter Zijlstra, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Ingo Molnar, Pavel Emelyanov, Nikhil Rao

* Paul Turner <pjt@google.com> [2011-03-22 20:03:27]:

> In this patch we introduce the notion of CFS bandwidth, partitioned into 
> globally unassigned bandwidth, and locally claimed bandwidth.
> 
> - The global bandwidth is per task_group, it represents a pool of unclaimed
>   bandwidth that cfs_rqs can allocate from.  
> - The local bandwidth is tracked per-cfs_rq, this represents allotments from
>   the global pool bandwidth assigned to a specific cpu.
> 
> Bandwidth is managed via cgroupfs, adding two new interfaces to the cpu subsystem:
> - cpu.cfs_period_us : the bandwidth period in usecs
> - cpu.cfs_quota_us : the cpu bandwidth (in usecs) that this tg will be allowed
>   to consume over period above.
> 
> A per-cfs_bandwidth timer is also introduced to handle future refresh at
> period expiration.  There's some minor refactoring here so that
> start_bandwidth_timer() functionality can be shared
> 
> Signed-off-by: Paul Turner <pjt@google.com>
> Signed-off-by: Nikhil Rao <ncrao@google.com>
> Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
> ---
<snip>

defconfig build on powerpc, with patchset applied on 4bbba111d94 fails

kernel/sched.c: In function 'tg_cfs_bandwidth':
kernel/sched.c:408: error: dereferencing pointer to incomplete type
make[1]: *** [kernel/sched.o] Error 1

CONFIG_CGROUPS=y
CONFIG_CGROUP_SCHED=n
CONFIG_EXPERIMENTAL=y
CONFIG_FAIR_GROUP_SCHED=n
CONFIG_CFS_BANDWIDTH=n

 thanks,
 Kamalesh

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [patch 00/15] CFS Bandwidth Control V5
  2011-03-23  3:03 [patch 00/15] CFS Bandwidth Control V5 Paul Turner
                   ` (14 preceding siblings ...)
  2011-03-23  3:03 ` [patch 15/15] sched: add documentation for bandwidth control Paul Turner
@ 2011-03-24 16:12 ` Bharata B Rao
  2011-03-31  7:57 ` Xiao Guangrong
                   ` (2 subsequent siblings)
  18 siblings, 0 replies; 63+ messages in thread
From: Bharata B Rao @ 2011-03-24 16:12 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Peter Zijlstra, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov

On Tue, Mar 22, 2011 at 08:03:26PM -0700, Paul Turner wrote:
> Hi all,
> 
> Please find attached the latest version of bandwidth control for the normal
> scheduling class.  This revision has undergone fairly extensive changes since
> the previous version based largely on the observation that many of the edge
> conditions requiring special casing around update_curr() were a result of
> introducing side-effects into that operation.  By introducing an interstitial
> state, where we recognize that the runqueue is over bandwidth, but not marking
> it throttled until we can actually remove it from the CPU we avoid the
> previous possible interactions with throttled entities which eliminates some
> head-scratching corner cases.

I am seeing hard lockups occasionally, not always reproducible. This particular
one occured when I had 1 task in a bandwidth constrained parent group and 10
tasks in its child group which has infinite bandwidth on a 16 CPU system.

Here is the log...

WARNING: at kernel/watchdog.c:226 watchdog_overflow_callback+0x98/0xc0()
Hardware name: System x3650 M2 -[794796Q]-
Watchdog detected hard LOCKUP on cpu 0
Modules linked in: autofs4 sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf xt_physdev ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables ipv6 ext4 jbd2 dm_mirror dm_region_hash dm_log dm_mod kvm_intel kvm uinput matroxfb_base matroxfb_DAC1064 matroxfb_accel matroxfb_Ti3026 matroxfb_g450 g450_pll matroxfb_misc cdc_ether usbnet mii ses enclosure sg serio_raw pcspkr i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support shpchp ioatdma dca i7core_edac edac_core bnx2 ext3 jbd mbcache sr_mod cdrom sd_mod crc_t10dif usb_storage pata_acpi ata_generic ata_piix megaraid_sas qla2xxx scsi_transport_fc scsi_tgt [last unloaded: microcode]
Pid: 0, comm: swapper Not tainted 2.6.38-tip #6
Call Trace:
 <NMI>  [<ffffffff8106558f>] warn_slowpath_common+0x7f/0xc0
 [<ffffffff81065686>] warn_slowpath_fmt+0x46/0x50
 [<ffffffff810d8158>] watchdog_overflow_callback+0x98/0xc0
 [<ffffffff8110fb39>] __perf_event_overflow+0x99/0x250
 [<ffffffff8110d2dd>] ? perf_event_update_userpage+0xbd/0x140
 [<ffffffff8110d220>] ? perf_event_update_userpage+0x0/0x140
 [<ffffffff81110234>] perf_event_overflow+0x14/0x20
 [<ffffffff8101eb66>] intel_pmu_handle_irq+0x306/0x560
 [<ffffffff8150e4c1>] ? hw_breakpoint_exceptions_notify+0x21/0x200
 [<ffffffff8150faf6>] ? kprobe_exceptions_notify+0x16/0x450
 [<ffffffff8150e6f0>] perf_event_nmi_handler+0x50/0xc0
 [<ffffffff81510aa4>] notifier_call_chain+0x94/0xd0
 [<ffffffff81510b4c>] __atomic_notifier_call_chain+0x6c/0xa0
 [<ffffffff81510ae0>] ? __atomic_notifier_call_chain+0x0/0xa0
 [<ffffffff81510b96>] atomic_notifier_call_chain+0x16/0x20
 [<ffffffff81510bce>] notify_die+0x2e/0x30
 [<ffffffff8150d89a>] do_nmi+0xda/0x2a0
 [<ffffffff8150d4e0>] nmi+0x20/0x39
 [<ffffffff8109f4a3>] ? register_lock_class+0xb3/0x550
 <<EOE>>  <IRQ>  [<ffffffff81013e73>] ? native_sched_clock+0x13/0x60
 [<ffffffff810131e9>] ? sched_clock+0x9/0x10
 [<ffffffff81090e0d>] ? sched_clock_cpu+0xcd/0x110
 [<ffffffff810a2348>] __lock_acquire+0x98/0x15c0
 [<ffffffff810a2628>] ? __lock_acquire+0x378/0x15c0
 [<ffffffff81013e73>] ? native_sched_clock+0x13/0x60
 [<ffffffff810131e9>] ? sched_clock+0x9/0x10
 [<ffffffff81049880>] ? tg_unthrottle_down+0x0/0x50
 [<ffffffff810a3928>] lock_acquire+0xb8/0x150
 [<ffffffff81059e9c>] ? distribute_cfs_bandwidth+0xfc/0x1d0
 [<ffffffff8150c146>] _raw_spin_lock+0x36/0x70
 [<ffffffff81059e9c>] ? distribute_cfs_bandwidth+0xfc/0x1d0
 [<ffffffff81059e9c>] distribute_cfs_bandwidth+0xfc/0x1d0
 [<ffffffff81059da0>] ? distribute_cfs_bandwidth+0x0/0x1d0
 [<ffffffff8105a0eb>] sched_cfs_period_timer+0x9b/0x100
 [<ffffffff8105a050>] ? sched_cfs_period_timer+0x0/0x100
 [<ffffffff8108e631>] __run_hrtimer+0x91/0x1f0
 [<ffffffff8108e9fa>] hrtimer_interrupt+0xda/0x250
 [<ffffffff8109a5d9>] tick_do_broadcast+0x49/0x90
 [<ffffffff8109a71c>] tick_handle_oneshot_broadcast+0xfc/0x140
 [<ffffffff8100ecae>] timer_interrupt+0x1e/0x30
 [<ffffffff810d8bcd>] handle_irq_event_percpu+0x5d/0x230
 [<ffffffff810d8e28>] handle_irq_event+0x58/0x80
 [<ffffffff810dbaae>] ? handle_edge_irq+0x1e/0xe0
 [<ffffffff810dbaff>] handle_edge_irq+0x6f/0xe0
 [<ffffffff8100e449>] handle_irq+0x49/0xa0
 [<ffffffff81516bed>] do_IRQ+0x5d/0xe0
 [<ffffffff8150ce53>] ret_from_intr+0x0/0x1a
 <EOI>  [<ffffffff8109dbbd>] ? trace_hardirqs_off+0xd/0x10
 [<ffffffff812dd074>] ? acpi_idle_enter_bm+0x242/0x27a
 [<ffffffff812dd06d>] ? acpi_idle_enter_bm+0x23b/0x27a
 [<ffffffff813ee532>] cpuidle_idle_call+0xc2/0x260
 [<ffffffff8100c07c>] cpu_idle+0xbc/0x110
 [<ffffffff814f0937>] rest_init+0xb7/0xc0
 [<ffffffff814f0880>] ? rest_init+0x0/0xc0
 [<ffffffff81dfffa2>] start_kernel+0x41c/0x427
 [<ffffffff81dff346>] x86_64_start_reservations+0x131/0x135
 [<ffffffff81dff44d>] x86_64_start_kernel+0x103/0x112

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [patch 02/15] sched: validate CFS quota hierarchies
  2011-03-23  3:03 ` [patch 02/15] sched: validate CFS quota hierarchies Paul Turner
  2011-03-23 10:39   ` torbenh
  2011-03-24  6:31   ` Bharata B Rao
@ 2011-03-29  6:57   ` Hidetoshi Seto
  2011-04-04 23:10     ` Paul Turner
  2011-04-05 13:28   ` Peter Zijlstra
  3 siblings, 1 reply; 63+ messages in thread
From: Hidetoshi Seto @ 2011-03-29  6:57 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Peter Zijlstra, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Pavel Emelyanov

(2011/03/23 12:03), Paul Turner wrote:
> @@ -9251,7 +9255,13 @@ static int tg_set_cfs_bandwidth(struct t
>  	if (period > max_cfs_quota_period)
>  		return -EINVAL;
>  
> -	mutex_lock(&mutex);
> +	mutex_lock(&cfs_constraints_mutex);
> +	if (sysctl_sched_cfs_bandwidth_consistent) {
> +		ret = __cfs_schedulable(tg, period, quota);

At this point:
 period => scale in ns unit
 quota  => scale in ns unit, or RUNTIME_INF

And both are unsigned. But...

> @@ -9339,6 +9350,108 @@ static int cpu_cfs_period_write_u64(stru
>  	return tg_set_cfs_period(cgroup_tg(cgrp), cfs_period_us);
>  }
>  
> +
> +struct cfs_schedulable_data {
> +	struct task_group *tg;
> +	u64 period, quota;
> +};
> +
> +/*
> + * normalize group quota/period to be quota/max_period
> + * note: units are usecs
> + */
> +static u64 normalize_cfs_quota(struct task_group *tg,
> +		               struct cfs_schedulable_data *d)
> +{
> +	u64 quota, period;
> +	struct load_weight lw;
> +
> +	if (tg == d->tg) {
> +		period = d->period;
> +		quota = d->quota;
> +	} else {
> +		period = tg_get_cfs_period(tg);
> +		quota = tg_get_cfs_quota(tg);
> +	}

... at this point:
 period => scale in us unit
 quota  => scale in us unit, or -1
Moreover:
 d->period => (scale in ns unit) / NSEC_PER_USEC
 d->quota  => (scale in ns unit, or RUNTIME_INF) / NSEC_PER_USEC

Therefore, ...

> +
> +	if (quota == RUNTIME_INF)
> +		return RUNTIME_INF;

This check doesn't work properly.

I found this problem because I could not get child group back to be
unconstrained:

[root@localhost group0]# cat cpu.cfs_*
500000
500000
[root@localhost group0]# cat sub0/cpu.cfs_*
500000
100000
[root@localhost group0]# cat sub1/cpu.cfs_*
500000
100000
[root@localhost group0]# echo -1 > sub1/cpu.cfs_quota_us
bash: echo: write error: Invalid argument

I confirmed that this write error is removed by the following
change.  I'm looking forward to seeing your V6 soon.

Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>


Thanks,
H.Seto

---
 kernel/sched.c |    8 ++++----
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 6d764b5..c8f9820 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -9467,8 +9467,8 @@ static u64 normalize_cfs_quota(struct task_group *tg,
 		period = d->period;
 		quota = d->quota;
 	} else {
-		period = tg_get_cfs_period(tg);
-		quota = tg_get_cfs_quota(tg);
+		period = ktime_to_ns(tg_cfs_bandwidth(tg)->period);
+		quota = tg_cfs_bandwidth(tg)->quota;
 	}
 
 	if (quota == RUNTIME_INF)
@@ -9515,8 +9515,8 @@ static int __cfs_schedulable(struct task_group *tg, u64 period, u64 quota)
 	int ret;
 	struct cfs_schedulable_data data = {
 		.tg = tg,
-		.period = period / NSEC_PER_USEC,
-		.quota = quota / NSEC_PER_USEC,
+		.period = period,
+		.quota = quota,
 	};
 
 	if (!sysctl_sched_cfs_bandwidth_consistent)
-- 
1.7.4


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [patch 00/15] CFS Bandwidth Control V5
  2011-03-23  3:03 [patch 00/15] CFS Bandwidth Control V5 Paul Turner
                   ` (15 preceding siblings ...)
  2011-03-24 16:12 ` [patch 00/15] CFS Bandwidth Control V5 Bharata B Rao
@ 2011-03-31  7:57 ` Xiao Guangrong
  2011-04-04 23:10   ` Paul Turner
  2011-04-05 13:28 ` Peter Zijlstra
  2011-05-20  2:12 ` Test for CFS Bandwidth Control V6 Xiao Guangrong
  18 siblings, 1 reply; 63+ messages in thread
From: Xiao Guangrong @ 2011-03-31  7:57 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Peter Zijlstra, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Pavel Emelyanov

[-- Attachment #1: Type: text/plain, Size: 954 bytes --]

On 03/23/2011 11:03 AM, Paul Turner wrote:
> Hi all,
> 
> Please find attached the latest version of bandwidth control for the normal
> scheduling class.  This revision has undergone fairly extensive changes since
> the previous version based largely on the observation that many of the edge
> conditions requiring special casing around update_curr() were a result of
> introducing side-effects into that operation.  By introducing an interstitial
> state, where we recognize that the runqueue is over bandwidth, but not marking
> it throttled until we can actually remove it from the CPU we avoid the
> previous possible interactions with throttled entities which eliminates some
> head-scratching corner cases.
> 

Hi Paul,

I have wrote some codes to test this patchset. While run attached test case,
the test program is blocked, it seams the children tasks can not be killed,
after one or two hours, the watchdog is expired and triggers box crashed.

[-- Attachment #2: bwc.tar.bz2 --]
[-- Type: application/x-bzip, Size: 2987 bytes --]

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [patch 00/15] CFS Bandwidth Control V5
  2011-03-31  7:57 ` Xiao Guangrong
@ 2011-04-04 23:10   ` Paul Turner
  0 siblings, 0 replies; 63+ messages in thread
From: Paul Turner @ 2011-04-04 23:10 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: linux-kernel, Peter Zijlstra, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Pavel Emelyanov

On Thu, Mar 31, 2011 at 12:57 AM, Xiao Guangrong
<xiaoguangrong@cn.fujitsu.com> wrote:
> On 03/23/2011 11:03 AM, Paul Turner wrote:
>> Hi all,
>>
>> Please find attached the latest version of bandwidth control for the normal
>> scheduling class.  This revision has undergone fairly extensive changes since
>> the previous version based largely on the observation that many of the edge
>> conditions requiring special casing around update_curr() were a result of
>> introducing side-effects into that operation.  By introducing an interstitial
>> state, where we recognize that the runqueue is over bandwidth, but not marking
>> it throttled until we can actually remove it from the CPU we avoid the
>> previous possible interactions with throttled entities which eliminates some
>> head-scratching corner cases.
>>
>
> Hi Paul,
>
> I have wrote some codes to test this patchset. While run attached test case,
> the test program is blocked, it seams the children tasks can not be killed,
> after one or two hours, the watchdog is expired and triggers box crashed.
>

Sorry, I was out last week.

I'm taking a look at this now, I believe the interaction lies
somewhere within the quota return paths.

Thanks for taking the time to test!

- Paul

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [patch 02/15] sched: validate CFS quota hierarchies
  2011-03-29  6:57   ` Hidetoshi Seto
@ 2011-04-04 23:10     ` Paul Turner
  0 siblings, 0 replies; 63+ messages in thread
From: Paul Turner @ 2011-04-04 23:10 UTC (permalink / raw)
  To: Hidetoshi Seto
  Cc: linux-kernel, Peter Zijlstra, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Pavel Emelyanov

On Mon, Mar 28, 2011 at 11:57 PM, Hidetoshi Seto
<seto.hidetoshi@jp.fujitsu.com> wrote:
> (2011/03/23 12:03), Paul Turner wrote:
>> @@ -9251,7 +9255,13 @@ static int tg_set_cfs_bandwidth(struct t
>>       if (period > max_cfs_quota_period)
>>               return -EINVAL;
>>
>> -     mutex_lock(&mutex);
>> +     mutex_lock(&cfs_constraints_mutex);
>> +     if (sysctl_sched_cfs_bandwidth_consistent) {
>> +             ret = __cfs_schedulable(tg, period, quota);
>
> At this point:
>  period => scale in ns unit
>  quota  => scale in ns unit, or RUNTIME_INF
>
> And both are unsigned. But...

Ack.. I had accounted for this at one point but I obviously churned it out.

Good catch, thanks!


>
>> @@ -9339,6 +9350,108 @@ static int cpu_cfs_period_write_u64(stru
>>       return tg_set_cfs_period(cgroup_tg(cgrp), cfs_period_us);
>>  }
>>
>> +
>> +struct cfs_schedulable_data {
>> +     struct task_group *tg;
>> +     u64 period, quota;
>> +};
>> +
>> +/*
>> + * normalize group quota/period to be quota/max_period
>> + * note: units are usecs
>> + */
>> +static u64 normalize_cfs_quota(struct task_group *tg,
>> +                            struct cfs_schedulable_data *d)
>> +{
>> +     u64 quota, period;
>> +     struct load_weight lw;
>> +
>> +     if (tg == d->tg) {
>> +             period = d->period;
>> +             quota = d->quota;
>> +     } else {
>> +             period = tg_get_cfs_period(tg);
>> +             quota = tg_get_cfs_quota(tg);
>> +     }
>
> ... at this point:
>  period => scale in us unit
>  quota  => scale in us unit, or -1
> Moreover:
>  d->period => (scale in ns unit) / NSEC_PER_USEC
>  d->quota  => (scale in ns unit, or RUNTIME_INF) / NSEC_PER_USEC
>
> Therefore, ...
>
>> +
>> +     if (quota == RUNTIME_INF)
>> +             return RUNTIME_INF;
>
> This check doesn't work properly.

Right.  Fixed, sorry for the delayed response -- was out last week.

>
> I found this problem because I could not get child group back to be
> unconstrained:
>
> [root@localhost group0]# cat cpu.cfs_*
> 500000
> 500000
> [root@localhost group0]# cat sub0/cpu.cfs_*
> 500000
> 100000
> [root@localhost group0]# cat sub1/cpu.cfs_*
> 500000
> 100000
> [root@localhost group0]# echo -1 > sub1/cpu.cfs_quota_us
> bash: echo: write error: Invalid argument
>
> I confirmed that this write error is removed by the following
> change.  I'm looking forward to seeing your V6 soon.
>
> Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
>
>
> Thanks,
> H.Seto
>
> ---
>  kernel/sched.c |    8 ++++----
>  1 files changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 6d764b5..c8f9820 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -9467,8 +9467,8 @@ static u64 normalize_cfs_quota(struct task_group *tg,
>                period = d->period;
>                quota = d->quota;
>        } else {
> -               period = tg_get_cfs_period(tg);
> -               quota = tg_get_cfs_quota(tg);
> +               period = ktime_to_ns(tg_cfs_bandwidth(tg)->period);
> +               quota = tg_cfs_bandwidth(tg)->quota;
>        }
>
>        if (quota == RUNTIME_INF)
> @@ -9515,8 +9515,8 @@ static int __cfs_schedulable(struct task_group *tg, u64 period, u64 quota)
>        int ret;
>        struct cfs_schedulable_data data = {
>                .tg = tg,
> -               .period = period / NSEC_PER_USEC,
> -               .quota = quota / NSEC_PER_USEC,
> +               .period = period,
> +               .quota = quota,
>        };
>
>        if (!sysctl_sched_cfs_bandwidth_consistent)
> --
> 1.7.4
>
>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [patch 00/15] CFS Bandwidth Control V5
  2011-03-23  3:03 [patch 00/15] CFS Bandwidth Control V5 Paul Turner
                   ` (16 preceding siblings ...)
  2011-03-31  7:57 ` Xiao Guangrong
@ 2011-04-05 13:28 ` Peter Zijlstra
  2011-05-20  2:12 ` Test for CFS Bandwidth Control V6 Xiao Guangrong
  18 siblings, 0 replies; 63+ messages in thread
From: Peter Zijlstra @ 2011-04-05 13:28 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov

On Tue, 2011-03-22 at 20:03 -0700, Paul Turner wrote:
> 
> In particular I'd like to thank Peter Zijlstra who provided extensive comments 
> and review for the last series. 

While I still loathe the feature the patches look much saner than last
time.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [patch 14/15] sched: return unused quota on voluntary sleep
  2011-03-23  3:03 ` [patch 14/15] sched: return unused quota on voluntary sleep Paul Turner
@ 2011-04-05 13:28   ` Peter Zijlstra
  2011-04-06  2:25     ` Paul Turner
  0 siblings, 1 reply; 63+ messages in thread
From: Peter Zijlstra @ 2011-04-05 13:28 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov

On Tue, 2011-03-22 at 20:03 -0700, Paul Turner wrote:
> 
> Signed-off-by: Paul Turner <pjt@google.com>
> Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com> 

Shouldn't that be Suggested-by or somesuch? You wrote the patch, you
send the patch, afaict it never passed through Bharata so where does
that SOB come from?

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [patch 13/15] sched: expire slack quota using generation counters
  2011-03-23  3:03 ` [patch 13/15] sched: expire slack quota using generation counters Paul Turner
@ 2011-04-05 13:28   ` Peter Zijlstra
  2011-04-06  7:22     ` Paul Turner
  0 siblings, 1 reply; 63+ messages in thread
From: Peter Zijlstra @ 2011-04-05 13:28 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov

On Tue, 2011-03-22 at 20:03 -0700, Paul Turner wrote:

Argh, this patch is terrible for the reason that it changes the whole
accounting just introduced and me having to re-open all the previous
patches to look up hth stuff worked before.

> @@ -436,8 +438,10 @@ void init_cfs_bandwidth(struct cfs_bandw
>  	raw_spin_lock_init(&cfs_b->lock);
>  	cfs_b->quota = cfs_b->runtime = quota;
>  	cfs_b->period = ns_to_ktime(period);
> +	cfs_b->quota_generation = 0;
>  	INIT_LIST_HEAD(&cfs_b->throttled_cfs_rq);
>  
> +
>  	hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
>  	cfs_b->period_timer.function = sched_cfs_period_timer;

We're in desperate need of more whitespace there? :-)

> @@ -9333,6 +9337,8 @@ static int tg_set_cfs_bandwidth(struct t
>  	raw_spin_lock_irq(&cfs_b->lock);
>  	cfs_b->period = ns_to_ktime(period);
>  	cfs_b->runtime = cfs_b->quota = quota;
> +
> +	cfs_bump_quota_generation(cfs_b);
>  	raw_spin_unlock_irq(&cfs_b->lock);
>  
>  	for_each_possible_cpu(i) {
> Index: tip/kernel/sched_fair.c
> ===================================================================
> --- tip.orig/kernel/sched_fair.c
> +++ tip/kernel/sched_fair.c
> @@ -1331,11 +1331,25 @@ static void check_cfs_rq_quota(struct cf
>  	resched_task(rq_of(cfs_rq)->curr);
>  }
>  
> +static void cfs_bump_quota_generation(struct cfs_bandwidth *cfs_b)
> +{
> +	cfs_b->quota_generation++;
> +	smp_mb();
> +}

Memory barriers come in pairs and with a comment, you fail on both
counts.

> +
> +static inline int cfs_rq_quota_current(struct cfs_rq *cfs_rq)
> +{
> +	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
> +
> +	return cfs_rq->quota_generation == cfs_b->quota_generation;
> +}
> +
>  static void request_cfs_rq_quota(struct cfs_rq *cfs_rq)
>  {
>  	struct task_group *tg = cfs_rq->tg;
>  	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
>  	u64 amount = 0, min_amount;
> +	int generation;

Not initialized,

>  	min_amount = sched_cfs_bandwidth_slice() + (-cfs_rq->quota_remaining);
>  
> @@ -1347,10 +1361,18 @@ static void request_cfs_rq_quota(struct 
>  		} else {
>  			amount = min_amount;
>  		}
> +		generation = cfs_b->quota_generation;
>  		raw_spin_unlock(&cfs_b->lock);
>  	}

and since there's an if there, one can fail it, leaving generation
uninitialized,

>  
> +	/* a deficit should be carried forwards, surplus should be dropped */
> +
> +	if (generation != cfs_rq->quota_generation &&
> +	    cfs_rq->quota_remaining > 0)
> +		cfs_rq->quota_remaining = 0;
> +
>  	cfs_rq->quota_remaining += amount;
> +	cfs_rq->quota_generation = generation;
>  }

Resulting in uninitialized usage right there.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [patch 11/15] sched: hierarchical task accounting for SCHED_OTHER
  2011-03-23  3:03 ` [patch 11/15] sched: hierarchical task accounting for SCHED_OTHER Paul Turner
@ 2011-04-05 13:28   ` Peter Zijlstra
  0 siblings, 0 replies; 63+ messages in thread
From: Peter Zijlstra @ 2011-04-05 13:28 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov

On Tue, 2011-03-22 at 20:03 -0700, Paul Turner wrote:
> With task entities participating in throttled sub-trees it is possible for
> task activation/de-activation to not lead to root visible changes to
> rq->nr_running.  This in turn leads to incorrect idle and weight-per-task load
> balance decisions.
> 
> To allow correct accounting we move responsibility for updating rq->nr_running
> to the respective sched::classes.  In the fair-group case this update is
> hierarchical, tracking the number of active tasks rooted at each group entity.
> 
> This also allows us to fix a small buglet in pick_next_task() when group
> scheduling is enabled.
> 
> Note: technically this issue also exists with the existing sched_rt
> throttling; however due to the nearly complete provisioning of system
> resources for rt scheduling this is much less common by default.

Shouldn't this patch live at the start of the series?



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [patch 09/15] sched: add exports tracking cfs bandwidth control statistics
  2011-03-23  3:03 ` [patch 09/15] sched: add exports tracking cfs bandwidth control statistics Paul Turner
@ 2011-04-05 13:28   ` Peter Zijlstra
  0 siblings, 0 replies; 63+ messages in thread
From: Peter Zijlstra @ 2011-04-05 13:28 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov, Nikhil Rao

On Tue, 2011-03-22 at 20:03 -0700, Paul Turner wrote:
> @@ -1431,10 +1432,15 @@ static void unthrottle_cfs_rq(struct cfs
>         struct rq *rq = rq_of(cfs_rq);
>         struct sched_entity *se;
>         struct tg_unthrottle_down_data udd;
> +       struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
>  
>         se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
>  
>         update_rq_clock(rq);
> +       raw_spin_lock(&cfs_b->lock);
> +       cfs_b->throttled_time += (rq->clock - cfs_rq->throttled_timestamp);
> +       raw_spin_unlock(&cfs_b->lock);
> +       cfs_rq->throttled_timestamp = 0;
>  
>         /* don't include throttled window for load statistics */
>         udd.cpu = rq->cpu;

> @@ -1523,6 +1530,11 @@ static int do_sched_cfs_period_timer(str
>         raw_spin_lock(&cfs_b->lock);
>         cfs_b->runtime = runtime;
>         cfs_b->runtime_assigned = runtime_assigned;
> +
> +       /* update throttled stats */
> +       cfs_b->nr_periods++;

Aren't you under-counting the periods, shouldn't that be += overrun?

> +       if (throttled)
> +               cfs_b->nr_throttled++;

idem?

>         raw_spin_unlock(&cfs_b->lock);
>  
>         return idle;


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [patch 10/15] sched: (fixlet) dont update shares twice on on_rq parent
  2011-03-23  3:03 ` [patch 10/15] sched: (fixlet) dont update shares twice on on_rq parent Paul Turner
@ 2011-04-05 13:28   ` Peter Zijlstra
  0 siblings, 0 replies; 63+ messages in thread
From: Peter Zijlstra @ 2011-04-05 13:28 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov

On Tue, 2011-03-22 at 20:03 -0700, Paul Turner wrote:
> In dequeue_task_fair() we bail on dequeue when we encounter a parenting entity
> with additional weight.  However, we perform a double shares update on this
> entity since we continue the shares update traversal from that point, despite
> dequeue_entity() having already updated its queuing cfs_rq.
> 
> Avoid this by starting from the parent when we resume. 

I'm thinking this fixlet is orthogonal to the rest of this patch-set and
should go in now? (Usually such things live at the head of a series)

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [patch 08/15] sched: migrate throttled tasks on HOTPLUG
  2011-03-23  3:03 ` [patch 08/15] sched: migrate throttled tasks on HOTPLUG Paul Turner
@ 2011-04-05 13:28   ` Peter Zijlstra
  2011-04-06  2:31     ` Paul Turner
  0 siblings, 1 reply; 63+ messages in thread
From: Peter Zijlstra @ 2011-04-05 13:28 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov

On Tue, 2011-03-22 at 20:03 -0700, Paul Turner wrote:
> Throttled tasks are invisisble to cpu-offline since they are not eligible for
> selection by pick_next_task().  The regular 'escape' path for a thread that is
> blocked at offline is via ttwu->select_task_rq, however this will not handle a
> throttled group since there are no individual thread wakeups on an unthrottle.
> 
> Resolve this by unthrottling offline cpus so that threads can be migrated.

Fair enough, the flip side is that they all can again increase their
debt by a whole tick, right?

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [patch 05/15] sched: unthrottle cfs_rq(s) who ran out of quota at period refresh
  2011-03-23  3:03 ` [patch 05/15] sched: unthrottle cfs_rq(s) who ran out of quota at period refresh Paul Turner
@ 2011-04-05 13:28   ` Peter Zijlstra
  2011-04-05 13:33     ` Peter Zijlstra
  2011-04-05 13:28   ` Peter Zijlstra
  2011-04-05 13:28   ` Peter Zijlstra
  2 siblings, 1 reply; 63+ messages in thread
From: Peter Zijlstra @ 2011-04-05 13:28 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov, Nikhil Rao

On Tue, 2011-03-22 at 20:03 -0700, Paul Turner wrote:
> +static u64 distribute_cfs_bandwidth(struct cfs_bandwidth *cfs_b, u64 runtime)
> +{
> +       int i;
> +       u64 quota, remaining = runtime;
> +       const struct cpumask *span;
> +
> +       rcu_read_lock();
> +       span = sched_bw_period_mask();
> +       for_each_cpu(i, span) {
> +               struct rq *rq = cpu_rq(i);
> +               struct cfs_rq *cfs_rq = cfs_bandwidth_tg(cfs_b)->cfs_rq[i];
> +
> +               raw_spin_lock(&rq->lock);
> +               if (within_bandwidth(cfs_rq))
> +                       goto next;
> +
> +               quota = -cfs_rq->quota_remaining;
> +               quota += sched_cfs_bandwidth_slice();
> +               quota = min(quota, remaining);
> +               remaining -= quota;
> +
> +               cfs_rq->quota_remaining += quota;
> +               if (cfs_rq_throttled(cfs_rq) && cfs_rq->quota_remaining > 0)
> +                       unthrottle_cfs_rq(cfs_rq);
> +
> +next:
> +               raw_spin_unlock(&rq->lock);
> +
> +               if (!remaining)
> +                       break;
> +       }
> +       rcu_read_unlock();
> +
> +       return remaining;
> +}
> +
>  static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
>  {
> +       u64 runtime, runtime_assigned;
> +       int idle;
> +
> +       raw_spin_lock(&cfs_b->lock);
> +       runtime = cfs_b->quota;
> +       idle = cfs_b->runtime == cfs_b->runtime_assigned;
> +       raw_spin_unlock(&cfs_b->lock);
> +
> +       if (runtime == RUNTIME_INF)
> +               return 1;
> +
> +       runtime *= overrun;
> +       runtime_assigned = runtime;
> +
> +       runtime = distribute_cfs_bandwidth(cfs_b, runtime);
> +
> +       raw_spin_lock(&cfs_b->lock);
> +       cfs_b->runtime = runtime;
> +       cfs_b->runtime_assigned = runtime_assigned;
> +       raw_spin_unlock(&cfs_b->lock);
> +
> +       return idle;
>  } 

There's something fishy there, it looks like ->runtime can end up being
> ->quota in case of overrun > 1, that shouldn't be possible, the
refresh timer should never over-fill the bucket.

The whole ->runtime_assigned stuff had me confused for a while, but I
guess its the easiest way to determine if we indeed had runtime
consumption.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [patch 07/15] sched: prevent interactions between throttled entities and load-balance
  2011-03-23  3:03 ` [patch 07/15] sched: prevent interactions between throttled entities and load-balance Paul Turner
@ 2011-04-05 13:28   ` Peter Zijlstra
  0 siblings, 0 replies; 63+ messages in thread
From: Peter Zijlstra @ 2011-04-05 13:28 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov

On Tue, 2011-03-22 at 20:03 -0700, Paul Turner wrote:
> O(n**2)

I'm thinking normal people write that as O(n^2) ?

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [patch 03/15] sched: accumulate per-cfs_rq cpu usage
  2011-03-23  3:03 ` [patch 03/15] sched: accumulate per-cfs_rq cpu usage Paul Turner
@ 2011-04-05 13:28   ` Peter Zijlstra
  2011-04-06 20:44     ` Paul Turner
  2011-04-05 13:28   ` Peter Zijlstra
  1 sibling, 1 reply; 63+ messages in thread
From: Peter Zijlstra @ 2011-04-05 13:28 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov, Nikhil Rao

On Tue, 2011-03-22 at 20:03 -0700, Paul Turner wrote:
> +static void request_cfs_rq_quota(struct cfs_rq *cfs_rq)
> +{
> +       struct task_group *tg = cfs_rq->tg;
> +       struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
> +       u64 amount = 0, min_amount;
> +
> +       min_amount = sched_cfs_bandwidth_slice() + (-cfs_rq->quota_remaining);
> +
> +       if (cfs_b->runtime > 0 || cfs_b->quota == RUNTIME_INF) {
> +               raw_spin_lock(&cfs_b->lock);
> +               if (cfs_b->quota != RUNTIME_INF) {
> +                       amount = min(cfs_b->runtime, min_amount);
> +                       cfs_b->runtime -= amount;
> +               } else {
> +                       amount = min_amount;
> +               }

So why would quota be RUNTIME_INF and quota_enabled be true? If its due
to a race when fiddling with the cgroup filesystem setting things up
that else branch wants a comment, if its for another reason all together
there's also a comment missing somewhere ;-)

> +               raw_spin_unlock(&cfs_b->lock);
> +       }
> +
> +       cfs_rq->quota_remaining += amount;
> +}
> +
> +static void account_cfs_rq_quota(struct cfs_rq *cfs_rq,
> +               unsigned long delta_exec)
> +{
> +       if (!cfs_rq->quota_enabled)
> +               return;
> +
> +       cfs_rq->quota_remaining -= delta_exec;
> +
> +       if (cfs_rq->quota_remaining > 0)
> +               return;
> +
> +       request_cfs_rq_quota(cfs_rq);
> +} 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [patch 05/15] sched: unthrottle cfs_rq(s) who ran out of quota at period refresh
  2011-03-23  3:03 ` [patch 05/15] sched: unthrottle cfs_rq(s) who ran out of quota at period refresh Paul Turner
  2011-04-05 13:28   ` Peter Zijlstra
@ 2011-04-05 13:28   ` Peter Zijlstra
  2011-04-05 13:28   ` Peter Zijlstra
  2 siblings, 0 replies; 63+ messages in thread
From: Peter Zijlstra @ 2011-04-05 13:28 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov, Nikhil Rao

On Tue, 2011-03-22 at 20:03 -0700, Paul Turner wrote:
> +#ifdef CONFIG_SMP
> +static inline const struct cpumask *sched_bw_period_mask(void)
> +{
> +       return cpu_rq(smp_processor_id())->rd->span;
> +}
> +#else
> +static inline const struct cpumask *sched_bw_period_mask(void)
> +{
> +       return cpu_online_mask;
> +}
> +#endif 

insert rant about merging cpusets into the sched and memcg
controllers :-)

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [patch 04/15] sched: throttle cfs_rq entities which exceed their local quota
  2011-03-23  3:03 ` [patch 04/15] sched: throttle cfs_rq entities which exceed their local quota Paul Turner
  2011-03-23  5:09   ` Mike Galbraith
  2011-03-24  6:36   ` Bharata B Rao
@ 2011-04-05 13:28   ` Peter Zijlstra
  2011-04-05 23:15     ` Paul Turner
  2 siblings, 1 reply; 63+ messages in thread
From: Peter Zijlstra @ 2011-04-05 13:28 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov, Nikhil Rao

On Tue, 2011-03-22 at 20:03 -0700, Paul Turner wrote:

> @@ -1249,6 +1257,9 @@ entity_tick(struct cfs_rq *cfs_rq, struc
>  	 */
>  	update_curr(cfs_rq);
>  
> +	/* check that entity's usage is still within quota (if enabled) */
> +	check_cfs_rq_quota(cfs_rq);
> +
>  	/*
>  	 * Update share accounting for long-running entities.
>  	 */

You already have a hook in update_curr() to account quota, why not also
use that to trigger the reschedule? request_cfs_rq_quota() already has
the information we failed to replenish the local quota.

Then when you've gotten rid of check_cfs_rq_quota() there isn't a second
user of within_bandwidth() and you can fold:

> @@ -1230,6 +1233,9 @@ static void put_prev_entity(struct cfs_r
>  	if (prev->on_rq)
>  		update_curr(cfs_rq);
>  
> +	if (!within_bandwidth(cfs_rq))
> +		throttle_cfs_rq(cfs_rq);
> +
>  	check_spread(cfs_rq, prev);
>  	if (prev->on_rq) {
>  		update_stats_wait_start(cfs_rq, prev);

Into a single hook. 

> @@ -1447,10 +1544,15 @@ static void dequeue_task_fair(struct rq 
>         for_each_sched_entity(se) {
>                 cfs_rq = cfs_rq_of(se);
>                 dequeue_entity(cfs_rq, se, flags);
> -
> +               /* end evaluation on throttled cfs_rq */
> +               if (cfs_rq_throttled(cfs_rq)) {
> +                       se = NULL;
> +                       break;
> +               }
>                 /* Don't dequeue parent if it has other entities besides us */
>                 if (cfs_rq->load.weight)
>                         break;
> +               check_cfs_rq_quota(cfs_rq);
>                 flags |= DEQUEUE_SLEEP;
>         }

dequeue_entity() calls update_curr(), so again, by folding
check_cfs_rq_quota() into your update_curr() hook this becomes simpler.

> +static inline int throttled_hierarchy(struct cfs_rq *cfs_rq)
> +{
> +       struct task_group *tg;
> +       struct sched_entity *se;
> +
> +       if (cfs_rq_throttled(cfs_rq))
> +               return 1;
> +
> +       tg = cfs_rq->tg;
> +       se = tg->se[cpu_of(rq_of(cfs_rq))];
> +       if (!se)
> +               return 0;
> +
> +       for_each_sched_entity(se) {
> +               if (cfs_rq_throttled(cfs_rq_of(se)))
> +                       return 1;
> +       }
> +
> +       return 0;
> +}

You can actually call for_each_sched_entity() with se==NULL, saves a few lines.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [patch 05/15] sched: unthrottle cfs_rq(s) who ran out of quota at period refresh
  2011-03-23  3:03 ` [patch 05/15] sched: unthrottle cfs_rq(s) who ran out of quota at period refresh Paul Turner
  2011-04-05 13:28   ` Peter Zijlstra
  2011-04-05 13:28   ` Peter Zijlstra
@ 2011-04-05 13:28   ` Peter Zijlstra
  2 siblings, 0 replies; 63+ messages in thread
From: Peter Zijlstra @ 2011-04-05 13:28 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov, Nikhil Rao

On Tue, 2011-03-22 at 20:03 -0700, Paul Turner wrote:
> The !CONFIG_RT_GROUP_SCHED && CONFIG_SMP case has been collapsed to use
> rd->span instead of cpu_online_mask since I think that was incorrect before
> [don't actually want to hit cpu's outside of your root_domain for RT bandwidth.]

That sounds about right, but definitely wants to be a separate patch.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [patch 03/15] sched: accumulate per-cfs_rq cpu usage
  2011-03-23  3:03 ` [patch 03/15] sched: accumulate per-cfs_rq cpu usage Paul Turner
  2011-04-05 13:28   ` Peter Zijlstra
@ 2011-04-05 13:28   ` Peter Zijlstra
  2011-04-06 20:47     ` Paul Turner
  1 sibling, 1 reply; 63+ messages in thread
From: Peter Zijlstra @ 2011-04-05 13:28 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov, Nikhil Rao

On Tue, 2011-03-22 at 20:03 -0700, Paul Turner wrote:
> +       min_amount = sched_cfs_bandwidth_slice() + (-cfs_rq->quota_remaining);

funny way of writing that ;-)

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [patch 02/15] sched: validate CFS quota hierarchies
  2011-03-23  3:03 ` [patch 02/15] sched: validate CFS quota hierarchies Paul Turner
                     ` (2 preceding siblings ...)
  2011-03-29  6:57   ` Hidetoshi Seto
@ 2011-04-05 13:28   ` Peter Zijlstra
  3 siblings, 0 replies; 63+ messages in thread
From: Peter Zijlstra @ 2011-04-05 13:28 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov

On Tue, 2011-03-22 at 20:03 -0700, Paul Turner wrote:

> +/*
> + * Whether a CFS bandwidth hierarchy is required to be consistent, that is:
> + *   sum(child_bandwidth) <= parent_bandwidth
> + */
> +unsigned int sysctl_sched_cfs_bandwidth_consistent = 1;

Why have this configurable?


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [patch 01/15] sched: introduce primitives to account for CFS bandwidth tracking
  2011-03-23  3:03 ` [patch 01/15] sched: introduce primitives to account for CFS bandwidth tracking Paul Turner
  2011-03-24 12:38   ` Kamalesh Babulal
@ 2011-04-05 13:28   ` Peter Zijlstra
  1 sibling, 0 replies; 63+ messages in thread
From: Peter Zijlstra @ 2011-04-05 13:28 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov, Nikhil Rao

On Tue, 2011-03-22 at 20:03 -0700, Paul Turner wrote:
> +const u64 max_cfs_quota_period = 5 * NSEC_PER_SEC; /* 5s */

I would prefer that max to be _much_ lower than that, 5 seconds is a
long time to be throttled while holding a lock the rest of the system
wants.



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [patch 05/15] sched: unthrottle cfs_rq(s) who ran out of quota at period refresh
  2011-04-05 13:28   ` Peter Zijlstra
@ 2011-04-05 13:33     ` Peter Zijlstra
  0 siblings, 0 replies; 63+ messages in thread
From: Peter Zijlstra @ 2011-04-05 13:33 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov, Nikhil Rao

On Tue, 2011-04-05 at 15:28 +0200, Peter Zijlstra wrote:
> There's something fishy there, it looks like ->runtime can end up being
> ->quota in case of overrun > 1, that shouldn't be possible, the
> refresh timer should never over-fill the bucket.

General confusion reigns after reading all those patches, but I think I
might have been wrong there.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [patch 04/15] sched: throttle cfs_rq entities which exceed their local quota
  2011-04-05 13:28   ` Peter Zijlstra
@ 2011-04-05 23:15     ` Paul Turner
  0 siblings, 0 replies; 63+ messages in thread
From: Paul Turner @ 2011-04-05 23:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov, Nikhil Rao

On Tue, Apr 5, 2011 at 6:28 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
>
> On Tue, 2011-03-22 at 20:03 -0700, Paul Turner wrote:
>
> > @@ -1249,6 +1257,9 @@ entity_tick(struct cfs_rq *cfs_rq, struc
> >        */
> >       update_curr(cfs_rq);
> >
> > +     /* check that entity's usage is still within quota (if enabled) */
> > +     check_cfs_rq_quota(cfs_rq);
> > +
> >       /*
> >        * Update share accounting for long-running entities.
> >        */
>
> You already have a hook in update_curr() to account quota, why not also
> use that to trigger the reschedule? request_cfs_rq_quota() already has
> the information we failed to replenish the local quota.
>
> Then when you've gotten rid of check_cfs_rq_quota() there isn't a second
> user of within_bandwidth() and you can fold:
>

This actually what it looked like originally, but I broke it apart to
avoid a spurious need_resched coming from the put_path.

Looking again I realize we actually arbitrarily
clear_tsk_need_resched() on prev out of schedule() now so this isn't a
concern.

>
> > @@ -1230,6 +1233,9 @@ static void put_prev_entity(struct cfs_r
> >       if (prev->on_rq)
> >               update_curr(cfs_rq);
> >
> > +     if (!within_bandwidth(cfs_rq))
> > +             throttle_cfs_rq(cfs_rq);
> > +
> >       check_spread(cfs_rq, prev);
> >       if (prev->on_rq) {
> >               update_stats_wait_start(cfs_rq, prev);
>
> Into a single hook.
>
> > @@ -1447,10 +1544,15 @@ static void dequeue_task_fair(struct rq
> >         for_each_sched_entity(se) {
> >                 cfs_rq = cfs_rq_of(se);
> >                 dequeue_entity(cfs_rq, se, flags);
> > -
> > +               /* end evaluation on throttled cfs_rq */
> > +               if (cfs_rq_throttled(cfs_rq)) {
> > +                       se = NULL;
> > +                       break;
> > +               }
> >                 /* Don't dequeue parent if it has other entities besides us */
> >                 if (cfs_rq->load.weight)
> >                         break;
> > +               check_cfs_rq_quota(cfs_rq);
> >                 flags |= DEQUEUE_SLEEP;
> >         }
>
> dequeue_entity() calls update_curr(), so again, by folding
> check_cfs_rq_quota() into your update_curr() hook this becomes simpler.
>

Yes I preferred it as a single hook out of update_curr, will put it
back that way :)

> > +static inline int throttled_hierarchy(struct cfs_rq *cfs_rq)
> > +{
> > +       struct task_group *tg;
> > +       struct sched_entity *se;
> > +
> > +       if (cfs_rq_throttled(cfs_rq))
> > +               return 1;
> > +
> > +       tg = cfs_rq->tg;
> > +       se = tg->se[cpu_of(rq_of(cfs_rq))];
> > +       if (!se)
> > +               return 0;
> > +
> > +       for_each_sched_entity(se) {
> > +               if (cfs_rq_throttled(cfs_rq_of(se)))
> > +                       return 1;
> > +       }
> > +
> > +       return 0;
> > +}
>
> You can actually call for_each_sched_entity() with se==NULL, saves a few lines.

True enough, although this is subsequently subverted by throttle_count

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [patch 14/15] sched: return unused quota on voluntary sleep
  2011-04-05 13:28   ` Peter Zijlstra
@ 2011-04-06  2:25     ` Paul Turner
  0 siblings, 0 replies; 63+ messages in thread
From: Paul Turner @ 2011-04-06  2:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov

On Tue, Apr 5, 2011 at 6:28 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Tue, 2011-03-22 at 20:03 -0700, Paul Turner wrote:
>>
>> Signed-off-by: Paul Turner <pjt@google.com>
>> Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
>
> Shouldn't that be Suggested-by or somesuch? You wrote the patch, you
> send the patch, afaict it never passed through Bharata so where does
> that SOB come from?
>

This was previously iterated on off-list as the intent was always to
fix slack quota beyond the initial merge/review.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [patch 08/15] sched: migrate throttled tasks on HOTPLUG
  2011-04-05 13:28   ` Peter Zijlstra
@ 2011-04-06  2:31     ` Paul Turner
  0 siblings, 0 replies; 63+ messages in thread
From: Paul Turner @ 2011-04-06  2:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov

On Tue, Apr 5, 2011 at 6:28 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Tue, 2011-03-22 at 20:03 -0700, Paul Turner wrote:
>> Throttled tasks are invisisble to cpu-offline since they are not eligible for
>> selection by pick_next_task().  The regular 'escape' path for a thread that is
>> blocked at offline is via ttwu->select_task_rq, however this will not handle a
>> throttled group since there are no individual thread wakeups on an unthrottle.
>>
>> Resolve this by unthrottling offline cpus so that threads can be migrated.
>
> Fair enough, the flip side is that they all can again increase their
> debt by a whole tick, right?
>

In the case where they enqueue on to a new rq and we can't catch the
bw condition until entity_tick I suppose it's possible to grab up to
an extra tick.

This is actually analogous to a task waking up when other cpus have
drained quota -- this case can be properly handled by an explicit
check/throttle in the enqueue_task_fair() path in the non-on-cpu case.
 Will add.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [patch 13/15] sched: expire slack quota using generation counters
  2011-04-05 13:28   ` Peter Zijlstra
@ 2011-04-06  7:22     ` Paul Turner
  2011-04-06  8:15       ` Peter Zijlstra
  2011-04-06 11:26       ` Peter Zijlstra
  0 siblings, 2 replies; 63+ messages in thread
From: Paul Turner @ 2011-04-06  7:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov

On Tue, Apr 5, 2011 at 6:28 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Tue, 2011-03-22 at 20:03 -0700, Paul Turner wrote:
>
> Argh, this patch is terrible for the reason that it changes the whole
> accounting just introduced and me having to re-open all the previous
> patches to look up hth stuff worked before.

I didn't think it was too drastic -- the introduction of the
generation is more incremental.  However I agree it does cause
unnecessary churn within the accounting functions across the series.
Given that expiring quota is a requirement, this can be streamlined by
introducing some of these notions earlier in the series as opposed to
bootstrapping them at the end here -- will clean it up.

>
>> @@ -436,8 +438,10 @@ void init_cfs_bandwidth(struct cfs_bandw
>>       raw_spin_lock_init(&cfs_b->lock);
>>       cfs_b->quota = cfs_b->runtime = quota;
>>       cfs_b->period = ns_to_ktime(period);
>> +     cfs_b->quota_generation = 0;
>>       INIT_LIST_HEAD(&cfs_b->throttled_cfs_rq);
>>
>> +
>>       hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
>>       cfs_b->period_timer.function = sched_cfs_period_timer;
>
> We're in desperate need of more whitespace there? :-)

8< 8<

>
>> @@ -9333,6 +9337,8 @@ static int tg_set_cfs_bandwidth(struct t
>>       raw_spin_lock_irq(&cfs_b->lock);
>>       cfs_b->period = ns_to_ktime(period);
>>       cfs_b->runtime = cfs_b->quota = quota;
>> +
>> +     cfs_bump_quota_generation(cfs_b);
>>       raw_spin_unlock_irq(&cfs_b->lock);
>>
>>       for_each_possible_cpu(i) {
>> Index: tip/kernel/sched_fair.c
>> ===================================================================
>> --- tip.orig/kernel/sched_fair.c
>> +++ tip/kernel/sched_fair.c
>> @@ -1331,11 +1331,25 @@ static void check_cfs_rq_quota(struct cf
>>       resched_task(rq_of(cfs_rq)->curr);
>>  }
>>
>> +static void cfs_bump_quota_generation(struct cfs_bandwidth *cfs_b)
>> +{
>> +     cfs_b->quota_generation++;
>> +     smp_mb();
>> +}
>
> Memory barriers come in pairs and with a comment, you fail on both
> counts.
>

scratches head... erm right, I meant to pair this with a read barrier
on querying the generation but balked when I realized that still
yields an lfence on x86 since I didn't want to introduce that to the
update_curr() path.

While we can probably do away with the barrier completely (it's not
critical that we line them up perfectly with the new generation), I've
been thinking about this one and I think I have something a little
nicer that also reduces the shared cache hits.

We can take advantage of the fact that sched_clocks are already
synchronized within 2 jiffies and store the quota's expiration,
instead of a generation, when we refresh.

This effectively yields a fairly simple control flow (we can use
rq->clock since we're always paired with update_rq_clock operations):
a) our rq->clock < expiration always implies quota is valid

Obviously if our cpu clock is ahead of the one that issued the quota,
our quota is still valid since the real deadline is even further
behind
Even if our cpu's clock is behind the max 1.99 jiffies the amount of
time that the stale quota can remain valid is basically already within
our potential margin of error since for a long running process we
check on each tick edge anyway.

 b) our rq->clock > expiration

Again there's two cases, if our cpu clock is behind (or equal) then
the deadline has indeed passed and the quota is expired.  This can be
confirmed by comparing the global deadline with our local one (the
global expiration will have advanced with quota refresh for this to be
true).

We can also catch that our cpu is potentially ahead -- by the fact
that our rq->clock > expiration but that the global expiration has not
yet advanced.  In this case we recognize that our quota is still valid
and extend our local expiration time by either the maximum margin of
error or some fraction there of (say 1 jiffy) which is guaranteed to
push us back in case a) above.  Again this is within our existing
margin of error due to entity_tick() alignment.

This ends up looking a lot simpler and avoids much of the pressure on
the global variable since we need to compare against it in the case
where our clock passes expiration, once a quota period (as the
extension will put us in case a where we know we don't need to
consider it).

This ends up simpler than the generation muck and can be introduced
cleanly earlier in the series, avoiding the churn mentioned above.

Make sense?

>> +
>> +static inline int cfs_rq_quota_current(struct cfs_rq *cfs_rq)
>> +{
>> +     struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
>> +
>> +     return cfs_rq->quota_generation == cfs_b->quota_generation;
>> +}
>> +
>>  static void request_cfs_rq_quota(struct cfs_rq *cfs_rq)
>>  {
>>       struct task_group *tg = cfs_rq->tg;
>>       struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
>>       u64 amount = 0, min_amount;
>> +     int generation;
>
> Not initialized,
>
>>       min_amount = sched_cfs_bandwidth_slice() + (-cfs_rq->quota_remaining);
>>
>> @@ -1347,10 +1361,18 @@ static void request_cfs_rq_quota(struct
>>               } else {
>>                       amount = min_amount;
>>               }
>> +             generation = cfs_b->quota_generation;
>>               raw_spin_unlock(&cfs_b->lock);
>>       }
>
> and since there's an if there, one can fail it, leaving generation
> uninitialized,
>
>>
>> +     /* a deficit should be carried forwards, surplus should be dropped */
>> +
>> +     if (generation != cfs_rq->quota_generation &&
>> +         cfs_rq->quota_remaining > 0)
>> +             cfs_rq->quota_remaining = 0;
>> +
>>       cfs_rq->quota_remaining += amount;
>> +     cfs_rq->quota_generation = generation;
>>  }
>
> Resulting in uninitialized usage right there.
>

Truth

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [patch 13/15] sched: expire slack quota using generation counters
  2011-04-06  7:22     ` Paul Turner
@ 2011-04-06  8:15       ` Peter Zijlstra
  2011-04-06 11:26       ` Peter Zijlstra
  1 sibling, 0 replies; 63+ messages in thread
From: Peter Zijlstra @ 2011-04-06  8:15 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov

On Wed, 2011-04-06 at 00:22 -0700, Paul Turner wrote:
> > Argh, this patch is terrible for the reason that it changes the whole
> > accounting just introduced and me having to re-open all the previous
> > patches to look up hth stuff worked before.
> 
> I didn't think it was too drastic -- the introduction of the
> generation is more incremental.  However I agree it does cause
> unnecessary churn within the accounting functions across the series.
> Given that expiring quota is a requirement, this can be streamlined by
> introducing some of these notions earlier in the series as opposed to
> bootstrapping them at the end here -- will clean it up. 

Thanks, and yes, I imagine that when working with the patches for a few
days at end its all not really much of a problem, but for me its the
first look in many weeks and this is the 13th patch in and my brain is
about to turn to mush ;-)

I'll try and read the massive comment on the memory barrier bits once
I've managed to properly wake up..

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [patch 13/15] sched: expire slack quota using generation counters
  2011-04-06  7:22     ` Paul Turner
  2011-04-06  8:15       ` Peter Zijlstra
@ 2011-04-06 11:26       ` Peter Zijlstra
  1 sibling, 0 replies; 63+ messages in thread
From: Peter Zijlstra @ 2011-04-06 11:26 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov

On Wed, 2011-04-06 at 00:22 -0700, Paul Turner wrote:
> 
> scratches head... erm right, I meant to pair this with a read barrier
> on querying the generation but balked when I realized that still
> yields an lfence on x86 since I didn't want to introduce that to the
> update_curr() path.
> 
> While we can probably do away with the barrier completely (it's not
> critical that we line them up perfectly with the new generation), I've
> been thinking about this one and I think I have something a little
> nicer that also reduces the shared cache hits.
> 
> We can take advantage of the fact that sched_clocks are already
> synchronized within 2 jiffies and store the quota's expiration,
> instead of a generation, when we refresh.
> 
> This effectively yields a fairly simple control flow (we can use
> rq->clock since we're always paired with update_rq_clock operations):
> a) our rq->clock < expiration always implies quota is valid
> 
> Obviously if our cpu clock is ahead of the one that issued the quota,
> our quota is still valid since the real deadline is even further
> behind
> Even if our cpu's clock is behind the max 1.99 jiffies the amount of
> time that the stale quota can remain valid is basically already within
> our potential margin of error since for a long running process we
> check on each tick edge anyway.
> 
>  b) our rq->clock > expiration
> 
> Again there's two cases, if our cpu clock is behind (or equal) then
> the deadline has indeed passed and the quota is expired.  This can be
> confirmed by comparing the global deadline with our local one (the
> global expiration will have advanced with quota refresh for this to be
> true).
> 
> We can also catch that our cpu is potentially ahead -- by the fact
> that our rq->clock > expiration but that the global expiration has not
> yet advanced.  In this case we recognize that our quota is still valid
> and extend our local expiration time by either the maximum margin of
> error or some fraction there of (say 1 jiffy) which is guaranteed to
> push us back in case a) above.  Again this is within our existing
> margin of error due to entity_tick() alignment.
> 
> This ends up looking a lot simpler and avoids much of the pressure on
> the global variable since we need to compare against it in the case
> where our clock passes expiration, once a quota period (as the
> extension will put us in case a where we know we don't need to
> consider it).
> 
> This ends up simpler than the generation muck and can be introduced
> cleanly earlier in the series, avoiding the churn mentioned above.
> 
> Make sense? 

Yes, that ought to work.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [patch 03/15] sched: accumulate per-cfs_rq cpu usage
  2011-04-05 13:28   ` Peter Zijlstra
@ 2011-04-06 20:44     ` Paul Turner
  0 siblings, 0 replies; 63+ messages in thread
From: Paul Turner @ 2011-04-06 20:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov, Nikhil Rao

On Tue, Apr 5, 2011 at 6:28 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Tue, 2011-03-22 at 20:03 -0700, Paul Turner wrote:
>> +static void request_cfs_rq_quota(struct cfs_rq *cfs_rq)
>> +{
>> +       struct task_group *tg = cfs_rq->tg;
>> +       struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
>> +       u64 amount = 0, min_amount;
>> +
>> +       min_amount = sched_cfs_bandwidth_slice() + (-cfs_rq->quota_remaining);
>> +
>> +       if (cfs_b->runtime > 0 || cfs_b->quota == RUNTIME_INF) {
>> +               raw_spin_lock(&cfs_b->lock);
>> +               if (cfs_b->quota != RUNTIME_INF) {
>> +                       amount = min(cfs_b->runtime, min_amount);
>> +                       cfs_b->runtime -= amount;
>> +               } else {
>> +                       amount = min_amount;
>> +               }
>
> So why would quota be RUNTIME_INF and quota_enabled be true? If its due
> to a race when fiddling with the cgroup filesystem setting things up
> that else branch wants a comment, if its for another reason all together
> there's also a comment missing somewhere ;-)
>

It's the first -- it's possible for global quota to transition between
infinite/finite states, but since we own the lock we may not yet have
seen our cfs_rq->quota_enabled update yet.  In this case we don't want
to perturb the global state /out/ of RUNTIME_INF (by subtracting from
it).

While making this more readable the speculative check can also be
dropped since that only ever benefits us a single time when we're out
of quota (as the next operation in that case is to throttle).

Cleaning..


>> +               raw_spin_unlock(&cfs_b->lock);
>> +       }
>> +
>> +       cfs_rq->quota_remaining += amount;
>> +}
>> +
>> +static void account_cfs_rq_quota(struct cfs_rq *cfs_rq,
>> +               unsigned long delta_exec)
>> +{
>> +       if (!cfs_rq->quota_enabled)
>> +               return;
>> +
>> +       cfs_rq->quota_remaining -= delta_exec;
>> +
>> +       if (cfs_rq->quota_remaining > 0)
>> +               return;
>> +
>> +       request_cfs_rq_quota(cfs_rq);
>> +}
>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [patch 03/15] sched: accumulate per-cfs_rq cpu usage
  2011-04-05 13:28   ` Peter Zijlstra
@ 2011-04-06 20:47     ` Paul Turner
  0 siblings, 0 replies; 63+ messages in thread
From: Paul Turner @ 2011-04-06 20:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov, Nikhil Rao

On Tue, Apr 5, 2011 at 6:28 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Tue, 2011-03-22 at 20:03 -0700, Paul Turner wrote:
>> +       min_amount = sched_cfs_bandwidth_slice() + (-cfs_rq->quota_remaining);
>
> funny way of writing that ;-)
>

This was intended to call out that quota_remaining is <= 0 at this
point -- making it a positive sum.   A comment would do this just as
well though.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [patch 02/15] sched: validate CFS quota hierarchies
  2011-03-24  6:31   ` Bharata B Rao
@ 2011-04-08 17:01     ` Peter Zijlstra
  0 siblings, 0 replies; 63+ messages in thread
From: Peter Zijlstra @ 2011-04-08 17:01 UTC (permalink / raw)
  To: bharata
  Cc: Paul Turner, linux-kernel, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov

On Thu, 2011-03-24 at 12:01 +0530, Bharata B Rao wrote:
> On Tue, Mar 22, 2011 at 08:03:28PM -0700, Paul Turner wrote:
> > Add constraints validation for CFS bandwidth hierachies.
> > 
> > +static u64 normalize_cfs_quota(struct task_group *tg,
> > +		               struct cfs_schedulable_data *d)
> > +{
> > +	u64 quota, period;
> > +	struct load_weight lw;
> > +
> > +	if (tg == d->tg) {
> > +		period = d->period;
> > +		quota = d->quota;
> > +	} else {
> > +		period = tg_get_cfs_period(tg);
> > +		quota = tg_get_cfs_quota(tg);
> > +	}
> > +
> > +	if (quota == RUNTIME_INF)
> > +		return RUNTIME_INF;
> > +
> > +	lw.weight = period;
> > +	lw.inv_weight = 0;
> > +
> > +	return calc_delta_mine(quota, max_cfs_quota_period, &lw) - 1;
> 
> Time to rename calc_delta_mine to something more meaningful ?

Or not use it there at all:

 - I'm not sure why we have different periods per cgroup, given that we
don't have EDF like scheduling and there's a very limited set of useful
periods. Too small and overhead increases like mad, too large and we get
lots of priority inversion crap.

 - Its not a fast-path by any means, so a straight fwd division wouldn't
hurt anybody.



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [patch 12/15] sched: maintain throttled rqs as a list
  2011-03-23  3:03 ` [patch 12/15] sched: maintain throttled rqs as a list Paul Turner
@ 2011-04-22  2:50   ` Hidetoshi Seto
  2011-04-24 21:23     ` Paul Turner
  0 siblings, 1 reply; 63+ messages in thread
From: Hidetoshi Seto @ 2011-04-22  2:50 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Peter Zijlstra, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Pavel Emelyanov

Hi Paul,

(2011/03/23 12:03), Paul Turner wrote:
> @@ -1434,7 +1435,11 @@ static void throttle_cfs_rq(struct cfs_r
>  	if (!se)
>  		rq->nr_running += task_delta;
>  
> +	raw_spin_lock(&cfs_b->lock);
>  	cfs_rq->throttled = 1;
> +	list_add_tail_rcu(&cfs_rq->throttled_list, &cfs_b->throttled_cfs_rq);
> +	raw_spin_unlock(&cfs_b->lock);
> +
>  	cfs_rq->throttled_timestamp = rq_of(cfs_rq)->clock;
>  }
>  

Though I'm looking forward to see your updates soon, just FYI,
I found that somehow throttle_cfs_rq() was called with already
throttled cfs_rq.

So it breaks list throttled_cfs_rq by repeating list_add_tail_rcu()
and results in lockups like that Bharata and Xiao already reported.

There should be better fix, but following first aid to put_prev_entity()
worked for me:

-       if (!within_bandwidth(cfs_rq))
+       if (!within_bandwidth(cfs_rq) && !cfs_rq->throttled)
                throttle_cfs_rq(cfs_rq);
        else
                return_cfs_rq_quota(cfs_rq);

I believe you can do better next time.


Thanks,
H.Seto


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [patch 12/15] sched: maintain throttled rqs as a list
  2011-04-22  2:50   ` Hidetoshi Seto
@ 2011-04-24 21:23     ` Paul Turner
  0 siblings, 0 replies; 63+ messages in thread
From: Paul Turner @ 2011-04-24 21:23 UTC (permalink / raw)
  To: Hidetoshi Seto
  Cc: linux-kernel, Peter Zijlstra, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Pavel Emelyanov

On Thu, Apr 21, 2011 at 7:50 PM, Hidetoshi Seto
<seto.hidetoshi@jp.fujitsu.com> wrote:
> Hi Paul,
>
> (2011/03/23 12:03), Paul Turner wrote:
>> @@ -1434,7 +1435,11 @@ static void throttle_cfs_rq(struct cfs_r
>>       if (!se)
>>               rq->nr_running += task_delta;
>>
>> +     raw_spin_lock(&cfs_b->lock);
>>       cfs_rq->throttled = 1;
>> +     list_add_tail_rcu(&cfs_rq->throttled_list, &cfs_b->throttled_cfs_rq);
>> +     raw_spin_unlock(&cfs_b->lock);
>> +
>>       cfs_rq->throttled_timestamp = rq_of(cfs_rq)->clock;
>>  }
>>
>
> Though I'm looking forward to see your updates soon, just FYI,
> I found that somehow throttle_cfs_rq() was called with already
> throttled cfs_rq.
>
> So it breaks list throttled_cfs_rq by repeating list_add_tail_rcu()
> and results in lockups like that Bharata and Xiao already reported.
>
> There should be better fix, but following first aid to put_prev_entity()
> worked for me:
>
> -       if (!within_bandwidth(cfs_rq))
> +       if (!within_bandwidth(cfs_rq) && !cfs_rq->throttled)
>                throttle_cfs_rq(cfs_rq);
>        else
>                return_cfs_rq_quota(cfs_rq);
>
> I believe you can do better next time.
>

Aha!

Good catch -- Thanks Hidetoshi!  I've been looking for that one.

I'm figuring out exactly how this came about and I will repost.

Thanks!

>
> Thanks,
> H.Seto
>
>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Test for CFS Bandwidth Control V6
  2011-03-23  3:03 [patch 00/15] CFS Bandwidth Control V5 Paul Turner
                   ` (17 preceding siblings ...)
  2011-04-05 13:28 ` Peter Zijlstra
@ 2011-05-20  2:12 ` Xiao Guangrong
  2011-05-24  0:53   ` Hidetoshi Seto
  18 siblings, 1 reply; 63+ messages in thread
From: Xiao Guangrong @ 2011-05-20  2:12 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Peter Zijlstra, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Pavel Emelyanov

Hi Paul,

I'm so sorry for sending this mail in the new thread, since i didn't
receive your V6 patchset from LKML.

It seams the patchset can not be applied, since it's conflict between
patch 3 and patch 5:

========Quote========
[patch 03/15] sched: introduce primitives to account for CFS bandwidth tracking
 
+#ifdef CONFIG_CFS_BANDWIDTH
+       int runtime_enabled;
+       s64 runtime_remaining;
+#endif
 #endif
 };

+#ifdef CONFIG_CFS_BANDWIDTH
+static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
+{
+       return &tg->cfs_bandwidth;
+}
+
+static inline u64 default_cfs_period(void);
+

[patch 05/15] sched: add a timer to handle CFS bandwidth refresh

@@ -394,12 +400,38 @@ static inline struct cfs_bandwidth *tg_c

 #ifdef CONFIG_CFS_BANDWIDTH
 static inline u64 default_cfs_period(void);
+static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun);
+
+static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
+{
+       struct cfs_bandwidth *cfs_b =
+               container_of(timer, struct cfs_bandwidth, period_timer);
+       ktime_t now;
+       int overrun;

========End quote========

I downloaded the patchset from Internet, i missed the newer version?

I have done some test after fixed the conflict by handle, below test can cause
box crash:

========Quote cpu_hotlpug.sh ========
#!/bin/sh

ROOT_PATH="/mnt"

def_quota=30000
def_period=100000

pid=
creat_process()
{

	nice -n $1 cat /dev/zero > /dev/null &
	pid=$!

	if [ $2 -ne -1 ]; then
		taskset -pc $2 $pid &> /dev/null
	fi
}

HOTPLUG_PATH=$ROOT_PATH/cpu-hotplug

mount -t cgroup -o cpu none $ROOT_PATH

mkdir $HOTPLUG_PATH

echo $def_quota > $HOTPLUG_PATH/cpu.cfs_quota_us
echo $def_period > $HOTPLUG_PATH/cpu.cfs_period_us

# create 3 tasks for every cpu

for((i=0;i<3;i++))
{
	creat_process -6 1
	echo $pid > $HOTPLUG_PATH/tasks
}

for((i=0;i<3;i++))
{
	creat_process -6 2
	echo $pid > $HOTPLUG_PATH/tasks
}

for((i=0;i<3;i++))
{
	creat_process -6 3
	echo $pid > $HOTPLUG_PATH/tasks
}

echo 0 > /sys/devices/system/cpu/cpu1/online
echo 1 > /sys/devices/system/cpu/cpu1/online

echo 0 > /sys/devices/system/cpu/cpu2/online
echo 1 > /sys/devices/system/cpu/cpu2/online

echo 0 > /sys/devices/system/cpu/cpu3/online
echo 1 > /sys/devices/system/cpu/cpu3/online

killall -9 cat

rmdir $HOTPLUG_PATH
umount $ROOT_PATH
======== End quote cpu_hotlpug.sh ========

Sorry to disturb you if the bug is know.

Thanks!




^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Test for CFS Bandwidth Control V6
  2011-05-20  2:12 ` Test for CFS Bandwidth Control V6 Xiao Guangrong
@ 2011-05-24  0:53   ` Hidetoshi Seto
  2011-05-24  7:56     ` Xiao Guangrong
  2011-06-08  2:54     ` Paul Turner
  0 siblings, 2 replies; 63+ messages in thread
From: Hidetoshi Seto @ 2011-05-24  0:53 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: Paul Turner, linux-kernel, Peter Zijlstra, Bharata B Rao,
	Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan,
	Srivatsa Vaddagiri, Kamalesh Babulal, Ingo Molnar,
	Pavel Emelyanov

Hi Paul and Xiao,

Please check/test a fix at the foot of this mail.

(2011/05/20 11:12), Xiao Guangrong wrote:
> Hi Paul,
> 
> I'm so sorry for sending this mail in the new thread, since i didn't
> receive your V6 patchset from LKML.
> 
> It seams the patchset can not be applied, since it's conflict between
> patch 3 and patch 5:
> 
> ========Quote========
(snip)
> ========End quote========

Maybe I've fixed it by hand, or git-am is so wonderful.

I believe Paul will do it right for next time.

> 
> I downloaded the patchset from Internet, i missed the newer version?
> 
> I have done some test after fixed the conflict by handle, below test can cause
> box crash:
> 
> ========Quote cpu_hotlpug.sh ========
(snip)
> ======== End quote cpu_hotlpug.sh ========
> 
> Sorry to disturb you if the bug is know.
> 
> Thanks!

Thank you for reporting it, Xiao!

I confirmed that running your test cause hung-up on my box.
And after some investigation, I found that this is an infinite loop
in migrate_task() due to miscalculation of rq->nr_running; when a
task is queued to throttled entity the nr_running is incremented at
the queuing and also the unthrottling.

I made a fix for this bug and it seems works well for me.
Could you try this patch and give us your feedback, Xiao?

Thanks,
H.Seto

---
 kernel/sched_fair.c |   28 +++++++++++++---------------
 1 files changed, 13 insertions(+), 15 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 3936393..544072f 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1537,7 +1537,7 @@ static void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 	walk_tg_tree_from(cfs_rq->tg, tg_unthrottle_down, tg_nop,
 			  (void *)&udd);
 
-	if (!cfs_rq->load.weight)
+	if (!cfs_rq->h_nr_running)
 		return;
 
 	task_delta = cfs_rq->h_nr_running;
@@ -1843,10 +1843,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		cfs_rq->h_nr_running++;
 
 		/* end evaluation on throttled cfs_rq */
-		if (cfs_rq_throttled(cfs_rq)) {
-			se = NULL;
-			break;
-		}
+		if (cfs_rq_throttled(cfs_rq))
+			goto done;
+
 		flags = ENQUEUE_WAKEUP;
 	}
 
@@ -1855,14 +1854,14 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		cfs_rq->h_nr_running++;
 
 		if (cfs_rq_throttled(cfs_rq))
-			break;
+			goto done;
 
 		update_cfs_load(cfs_rq, 0);
 		update_cfs_shares(cfs_rq);
 	}
 
-	if (!se)
-		inc_nr_running(rq);
+	inc_nr_running(rq);
+done:
 	hrtick_update(rq);
 }
 
@@ -1885,10 +1884,9 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		cfs_rq->h_nr_running--;
 
 		/* end evaluation on throttled cfs_rq */
-		if (cfs_rq_throttled(cfs_rq)) {
-			se = NULL;
-			break;
-		}
+		if (cfs_rq_throttled(cfs_rq))
+			goto done;
+
 		/* Don't dequeue parent if it has other entities besides us */
 		if (cfs_rq->load.weight) {
 			 /* Avoid double update below. */
@@ -1910,14 +1908,14 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		cfs_rq->h_nr_running--;
 
 		if (cfs_rq_throttled(cfs_rq))
-			break;
+			goto done;
 
 		update_cfs_load(cfs_rq, 0);
 		update_cfs_shares(cfs_rq);
 	}
 
-	if (!se)
-		dec_nr_running(rq);
+	dec_nr_running(rq);
+done:
 	hrtick_update(rq);
 }
 
-- 
1.7.4.4



^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: Test for CFS Bandwidth Control V6
  2011-05-24  0:53   ` Hidetoshi Seto
@ 2011-05-24  7:56     ` Xiao Guangrong
  2011-06-08  2:54     ` Paul Turner
  1 sibling, 0 replies; 63+ messages in thread
From: Xiao Guangrong @ 2011-05-24  7:56 UTC (permalink / raw)
  To: Hidetoshi Seto
  Cc: Paul Turner, linux-kernel, Peter Zijlstra, Bharata B Rao,
	Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan,
	Srivatsa Vaddagiri, Kamalesh Babulal, Ingo Molnar,
	Pavel Emelyanov, Hu Tao

On 05/24/2011 08:53 AM, Hidetoshi Seto wrote:

> I confirmed that running your test cause hung-up on my box.
> And after some investigation, I found that this is an infinite loop
> in migrate_task() due to miscalculation of rq->nr_running; when a
> task is queued to throttled entity the nr_running is incremented at
> the queuing and also the unthrottling.
> 
> I made a fix for this bug and it seems works well for me.
> Could you try this patch and give us your feedback, Xiao?
> 

Thanks for your nice fix, Seto! Everything goes well now!

Tested-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Test for CFS Bandwidth Control V6
  2011-05-24  0:53   ` Hidetoshi Seto
  2011-05-24  7:56     ` Xiao Guangrong
@ 2011-06-08  2:54     ` Paul Turner
  2011-06-08  5:55       ` Hidetoshi Seto
  1 sibling, 1 reply; 63+ messages in thread
From: Paul Turner @ 2011-06-08  2:54 UTC (permalink / raw)
  To: Hidetoshi Seto
  Cc: Xiao Guangrong, LKML, Peter Zijlstra, Bharata B Rao,
	Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan,
	Srivatsa Vaddagiri, Kamalesh Babulal, Ingo Molnar,
	Pavel Emelyanov

[ Sorry for the delayed response, I was out on vacation for the second
half of May until last week -- I've now caught up on email and am
preparing the next posting ]

On Mon, May 23, 2011 at 5:53 PM, Hidetoshi Seto
<seto.hidetoshi@jp.fujitsu.com> wrote:
> Hi Paul and Xiao,
>
> Please check/test a fix at the foot of this mail.
>
> (2011/05/20 11:12), Xiao Guangrong wrote:
>> Hi Paul,
>>
>> I'm so sorry for sending this mail in the new thread, since i didn't
>> receive your V6 patchset from LKML.
>>
>> It seams the patchset can not be applied, since it's conflict between
>> patch 3 and patch 5:
>>
>> ========Quote========
> (snip)
>> ========End quote========
>
> Maybe I've fixed it by hand, or git-am is so wonderful.
>
> I believe Paul will do it right for next time.
>
>>
>> I downloaded the patchset from Internet, i missed the newer version?
>>
>> I have done some test after fixed the conflict by handle, below test can cause
>> box crash:
>>
>> ========Quote cpu_hotlpug.sh ========
> (snip)
>> ======== End quote cpu_hotlpug.sh ========
>>
>> Sorry to disturb you if the bug is know.
>>
>> Thanks!
>
> Thank you for reporting it, Xiao!
>
> I confirmed that running your test cause hung-up on my box.
> And after some investigation, I found that this is an infinite loop
> in migrate_task() due to miscalculation of rq->nr_running; when a
> task is queued to throttled entity the nr_running is incremented at
> the queuing and also the unthrottling.
>
> I made a fix for this bug and it seems works well for me.
> Could you try this patch and give us your feedback, Xiao?
>
> Thanks,
> H.Seto
>
> ---
>  kernel/sched_fair.c |   28 +++++++++++++---------------
>  1 files changed, 13 insertions(+), 15 deletions(-)
>
> diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
> index 3936393..544072f 100644
> --- a/kernel/sched_fair.c
> +++ b/kernel/sched_fair.c
> @@ -1537,7 +1537,7 @@ static void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
>        walk_tg_tree_from(cfs_rq->tg, tg_unthrottle_down, tg_nop,
>                          (void *)&udd);
>
> -       if (!cfs_rq->load.weight)
> +       if (!cfs_rq->h_nr_running)
>                return;
>

Why change here?

>        task_delta = cfs_rq->h_nr_running;
> @@ -1843,10 +1843,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>                cfs_rq->h_nr_running++;
>
>                /* end evaluation on throttled cfs_rq */
> -               if (cfs_rq_throttled(cfs_rq)) {
> -                       se = NULL;

Hmm..  yeah this is a casualty of moving the h_nr_running computations
in-line as a part of the previous refactoring within the last
releases.  This optimization (setting se = NULL to skip the second
half) obviously won't work properly with detecting whether we made it
to the end of the tree.

> -                       break;
> -               }
> +               if (cfs_rq_throttled(cfs_rq))
> +                       goto done;
> +
>                flags = ENQUEUE_WAKEUP;
>        }
>
> @@ -1855,14 +1854,14 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>                cfs_rq->h_nr_running++;
>
>                if (cfs_rq_throttled(cfs_rq))
> -                       break;
> +                       goto done;
>
>                update_cfs_load(cfs_rq, 0);
>                update_cfs_shares(cfs_rq);
>        }
>
> -       if (!se)
> -               inc_nr_running(rq);
> +       inc_nr_running(rq);
> +done:
>        hrtick_update(rq);
>  }
>
> @@ -1885,10 +1884,9 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>                cfs_rq->h_nr_running--;
>
>                /* end evaluation on throttled cfs_rq */
> -               if (cfs_rq_throttled(cfs_rq)) {
> -                       se = NULL;
> -                       break;
> -               }
> +               if (cfs_rq_throttled(cfs_rq))
> +                       goto done;
> +
>                /* Don't dequeue parent if it has other entities besides us */
>                if (cfs_rq->load.weight) {
>                         /* Avoid double update below. */
> @@ -1910,14 +1908,14 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>                cfs_rq->h_nr_running--;
>
>                if (cfs_rq_throttled(cfs_rq))
> -                       break;
> +                       goto done;
>
>                update_cfs_load(cfs_rq, 0);
>                update_cfs_shares(cfs_rq);
>        }
>
> -       if (!se)
> -               dec_nr_running(rq);
> +       dec_nr_running(rq);
> +done:
>        hrtick_update(rq);
>  }
>
> --
> 1.7.4.4
>
>
>

How about instead something like the following.  We can actually take
advantage of the second loop always executing by deferring the
accounting update on a throttle entity.  This keeps the control flow
within dequeue_task_fair linear.

What do you think of (untested):

--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1744,13 +1744,12 @@ enqueue_task_fair(struct rq *rq, struct
task_struct *p, int flags)
                        break;
                cfs_rq = cfs_rq_of(se);
                enqueue_entity(cfs_rq, se, flags);
-               cfs_rq->h_nr_running++;

-               /* end evaluation on throttled cfs_rq */
-               if (cfs_rq_throttled(cfs_rq)) {
-                       se = NULL;
+               /* note: ordering with throttle check to perform
h_nr_running accounting on throttled entity below */
+               if (cfs_rq_throttled(cfs_rq))
                        break;
-               }
+
+               cfs_rq->h_nr_running++;
                flags = ENQUEUE_WAKEUP;
        }

@@ -1786,13 +1785,12 @@ static void dequeue_task_fair(struct rq *rq,
struct task_struct *p, int flags)
        for_each_sched_entity(se) {
                cfs_rq = cfs_rq_of(se);
                dequeue_entity(cfs_rq, se, flags);
-               cfs_rq->h_nr_running--;

-               /* end evaluation on throttled cfs_rq */
-               if (cfs_rq_throttled(cfs_rq)) {
-                       se = NULL;
+               /* note: ordering with throttle check to perform
h_nr_running accounting on throttled entity below */
+               if (cfs_rq_throttled(cfs_rq))
                        break;
-               }
+
+               cfs_rq->h_nr_running--;
                /* Don't dequeue parent if it has other entities besides

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Test for CFS Bandwidth Control V6
  2011-06-08  2:54     ` Paul Turner
@ 2011-06-08  5:55       ` Hidetoshi Seto
  0 siblings, 0 replies; 63+ messages in thread
From: Hidetoshi Seto @ 2011-06-08  5:55 UTC (permalink / raw)
  To: Paul Turner
  Cc: Xiao Guangrong, LKML, Peter Zijlstra, Bharata B Rao,
	Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan,
	Srivatsa Vaddagiri, Kamalesh Babulal, Ingo Molnar,
	Pavel Emelyanov

(2011/06/08 11:54), Paul Turner wrote:
> On Mon, May 23, 2011 at 5:53 PM, Hidetoshi Seto
> <seto.hidetoshi@jp.fujitsu.com> wrote:
>
>> diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
>> index 3936393..544072f 100644
>> --- a/kernel/sched_fair.c
>> +++ b/kernel/sched_fair.c
>> @@ -1537,7 +1537,7 @@ static void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
>>        walk_tg_tree_from(cfs_rq->tg, tg_unthrottle_down, tg_nop,
>>                          (void *)&udd);
>>
>> -       if (!cfs_rq->load.weight)
>> +       if (!cfs_rq->h_nr_running)
>>                return;
>>
> 
> Why change here?

I've confused a bit - just curious if by any chance there is throttled
cfs_rq that have (load.weight, h_nr_running) = (0, >0).


>>        task_delta = cfs_rq->h_nr_running;
>> @@ -1843,10 +1843,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>>                cfs_rq->h_nr_running++;
>>
>>                /* end evaluation on throttled cfs_rq */
>> -               if (cfs_rq_throttled(cfs_rq)) {
>> -                       se = NULL;
> 
> Hmm..  yeah this is a casualty of moving the h_nr_running computations
> in-line as a part of the previous refactoring within the last
> releases.  This optimization (setting se = NULL to skip the second
> half) obviously won't work properly with detecting whether we made it
> to the end of the tree.
> 
(snip)
> 
> How about instead something like the following.  We can actually take
> advantage of the second loop always executing by deferring the
> accounting update on a throttle entity.  This keeps the control flow
> within dequeue_task_fair linear.
> 
> What do you think of (untested):
> 
> --- a/kernel/sched_fair.c
> +++ b/kernel/sched_fair.c
> @@ -1744,13 +1744,12 @@ enqueue_task_fair(struct rq *rq, struct
> task_struct *p, int flags)
>                         break;
>                 cfs_rq = cfs_rq_of(se);
>                 enqueue_entity(cfs_rq, se, flags);
> -               cfs_rq->h_nr_running++;
> 
> -               /* end evaluation on throttled cfs_rq */
> -               if (cfs_rq_throttled(cfs_rq)) {
> -                       se = NULL;
> +               /* note: ordering with throttle check to perform
> h_nr_running accounting on throttled entity below */
> +               if (cfs_rq_throttled(cfs_rq))
>                         break;
> -               }
> +
> +               cfs_rq->h_nr_running++;
>                 flags = ENQUEUE_WAKEUP;
>         }
> 
> @@ -1786,13 +1785,12 @@ static void dequeue_task_fair(struct rq *rq,
> struct task_struct *p, int flags)
>         for_each_sched_entity(se) {
>                 cfs_rq = cfs_rq_of(se);
>                 dequeue_entity(cfs_rq, se, flags);
> -               cfs_rq->h_nr_running--;
> 
> -               /* end evaluation on throttled cfs_rq */
> -               if (cfs_rq_throttled(cfs_rq)) {
> -                       se = NULL;
> +               /* note: ordering with throttle check to perform
> h_nr_running accounting on throttled entity below */
> +               if (cfs_rq_throttled(cfs_rq))
>                         break;
> -               }
> +
> +               cfs_rq->h_nr_running--;
>                 /* Don't dequeue parent if it has other entities besides

Looks good if it abides by the nature of scheduler codes ;-)


Thanks,
H.Seto


^ permalink raw reply	[flat|nested] 63+ messages in thread

end of thread, other threads:[~2011-06-08  5:55 UTC | newest]

Thread overview: 63+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-03-23  3:03 [patch 00/15] CFS Bandwidth Control V5 Paul Turner
2011-03-23  3:03 ` [patch 01/15] sched: introduce primitives to account for CFS bandwidth tracking Paul Turner
2011-03-24 12:38   ` Kamalesh Babulal
2011-04-05 13:28   ` Peter Zijlstra
2011-03-23  3:03 ` [patch 02/15] sched: validate CFS quota hierarchies Paul Turner
2011-03-23 10:39   ` torbenh
2011-03-23 20:49     ` Paul Turner
2011-03-24  6:31   ` Bharata B Rao
2011-04-08 17:01     ` Peter Zijlstra
2011-03-29  6:57   ` Hidetoshi Seto
2011-04-04 23:10     ` Paul Turner
2011-04-05 13:28   ` Peter Zijlstra
2011-03-23  3:03 ` [patch 03/15] sched: accumulate per-cfs_rq cpu usage Paul Turner
2011-04-05 13:28   ` Peter Zijlstra
2011-04-06 20:44     ` Paul Turner
2011-04-05 13:28   ` Peter Zijlstra
2011-04-06 20:47     ` Paul Turner
2011-03-23  3:03 ` [patch 04/15] sched: throttle cfs_rq entities which exceed their local quota Paul Turner
2011-03-23  5:09   ` Mike Galbraith
2011-03-23 20:53     ` Paul Turner
2011-03-24  6:36   ` Bharata B Rao
2011-03-24  7:40     ` Paul Turner
2011-04-05 13:28   ` Peter Zijlstra
2011-04-05 23:15     ` Paul Turner
2011-03-23  3:03 ` [patch 05/15] sched: unthrottle cfs_rq(s) who ran out of quota at period refresh Paul Turner
2011-04-05 13:28   ` Peter Zijlstra
2011-04-05 13:33     ` Peter Zijlstra
2011-04-05 13:28   ` Peter Zijlstra
2011-04-05 13:28   ` Peter Zijlstra
2011-03-23  3:03 ` [patch 06/15] sched: allow for positional tg_tree walks Paul Turner
2011-03-23  3:03 ` [patch 07/15] sched: prevent interactions between throttled entities and load-balance Paul Turner
2011-04-05 13:28   ` Peter Zijlstra
2011-03-23  3:03 ` [patch 08/15] sched: migrate throttled tasks on HOTPLUG Paul Turner
2011-04-05 13:28   ` Peter Zijlstra
2011-04-06  2:31     ` Paul Turner
2011-03-23  3:03 ` [patch 09/15] sched: add exports tracking cfs bandwidth control statistics Paul Turner
2011-04-05 13:28   ` Peter Zijlstra
2011-03-23  3:03 ` [patch 10/15] sched: (fixlet) dont update shares twice on on_rq parent Paul Turner
2011-04-05 13:28   ` Peter Zijlstra
2011-03-23  3:03 ` [patch 11/15] sched: hierarchical task accounting for SCHED_OTHER Paul Turner
2011-04-05 13:28   ` Peter Zijlstra
2011-03-23  3:03 ` [patch 12/15] sched: maintain throttled rqs as a list Paul Turner
2011-04-22  2:50   ` Hidetoshi Seto
2011-04-24 21:23     ` Paul Turner
2011-03-23  3:03 ` [patch 13/15] sched: expire slack quota using generation counters Paul Turner
2011-04-05 13:28   ` Peter Zijlstra
2011-04-06  7:22     ` Paul Turner
2011-04-06  8:15       ` Peter Zijlstra
2011-04-06 11:26       ` Peter Zijlstra
2011-03-23  3:03 ` [patch 14/15] sched: return unused quota on voluntary sleep Paul Turner
2011-04-05 13:28   ` Peter Zijlstra
2011-04-06  2:25     ` Paul Turner
2011-03-23  3:03 ` [patch 15/15] sched: add documentation for bandwidth control Paul Turner
2011-03-24  6:38   ` Bharata B Rao
2011-03-24 16:12 ` [patch 00/15] CFS Bandwidth Control V5 Bharata B Rao
2011-03-31  7:57 ` Xiao Guangrong
2011-04-04 23:10   ` Paul Turner
2011-04-05 13:28 ` Peter Zijlstra
2011-05-20  2:12 ` Test for CFS Bandwidth Control V6 Xiao Guangrong
2011-05-24  0:53   ` Hidetoshi Seto
2011-05-24  7:56     ` Xiao Guangrong
2011-06-08  2:54     ` Paul Turner
2011-06-08  5:55       ` Hidetoshi Seto

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).