All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC v2 PATCH 0/8] CFS Hard limits - v2
@ 2009-09-30 12:49 Bharata B Rao
  2009-09-30 12:50 ` [RFC v2 PATCH 1/8] sched: Rename sched_rt_period_mask() and use it in CFS also Bharata B Rao
                   ` (8 more replies)
  0 siblings, 9 replies; 39+ messages in thread
From: Bharata B Rao @ 2009-09-30 12:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan,
	Gautham R Shenoy, Srivatsa Vaddagiri, Ingo Molnar,
	Peter Zijlstra, Pavel Emelyanov, Herbert Poetzl, Avi Kivity,
	Chris Friesen, Paul Menage, Mike Waychison

Hi,

Here is the v2 post of hard limits feature for CFS group scheduler. This
RFC post mainly adds runtime borrowing feature and has a new locking scheme
to protect CFS runtime related fields.

It would be nice to have some comments on this set!

Changes
-------

RFC v2:
- Upgraded to 2.6.31.
- Added CFS runtime borrowing.
- New locking scheme
    The hard limit specific fields of cfs_rq (cfs_runtime, cfs_time and
    cfs_throttled) were being protected by rq->lock. This simple scheme will
    not work when runtime rebalancing is introduced where it will be required
    to look at these fields on other CPU's which requires us to acquire
    rq->lock of other CPUs. This will not be feasible from update_curr().
    Hence introduce a separate lock (rq->runtime_lock) to protect these
    fields of all cfs_rq under it.
- Handle the task wakeup in a throttled group correctly.
- Make CFS_HARD_LIMITS dependent on CGROUP_SCHED (Thanks to Andrea Righi)

RFC v1:
- First version of the patches with minimal features was posted at
  http://lkml.org/lkml/2009/8/25/128

RFC v0:
- The CFS hard limits proposal was first posted at
  http://lkml.org/lkml/2009/6/4/24

Testing and Benchmark numbers
-----------------------------
- This patchset has seen very minimal testing on 24way machine and is expected
  to have bugs. I need to test this under more test scenarios.
- I have run a few common benchmarks to see if my patches introduce any visible
  overhead. I am aware that the number of runs or the combinations I have
  used may not be ideal, but the intention in this early stage is to catch any
  serious regressions that the patches would have introduced.
- I plan to get numbers from more benchmarks in future releases. Any inputs
  on specific benchmarks to try would be helpful.

- hackbench (hackbench -pipe N)
  (hackbench was run as part of a group under root group)
  -----------------------------------------------------------------------
				Time
	-----------------------------------------------------------------
  N	CFS_HARD_LIMTS=n	CFS_HARD_LIMTS=y	CFS_HARD_LIMITS=y
				(infinite runtime)	(BW=450000/500000)
  -----------------------------------------------------------------------
  10	0.475			0.384			0.253
  20	0.610			0.670			0.692
  50	1.250			1.201			1.295
  100	1.981			2.174			1.583
  -----------------------------------------------------------------------
  - BW = Bandwidth = runtime/period
  - Infinite runtime means no hard limiting

- lmbench (lat_ctx -N 5 -s <size_in_kb> N)

  (i) size_in_kb = 1024
  -----------------------------------------------------------------------
				Context switch time (us)
	-----------------------------------------------------------------
  N	CFS_HARD_LIMTS=n	CFS_HARD_LIMTS=y	CFS_HARD_LIMITS=y
				(infinite runtime)	(BW=450000/500000)
  -----------------------------------------------------------------------
  10	315.87			330.19			317.04
  100	675.52			699.90			698.50
  500	775.01			772.86			772.30
  -----------------------------------------------------------------------

  (ii) size_in_kb = 2048
  -----------------------------------------------------------------------
				Context switch time (us)
	-----------------------------------------------------------------
  N	CFS_HARD_LIMTS=n	CFS_HARD_LIMTS=y	CFS_HARD_LIMITS=y
				(infinite runtime)	(BW=450000/500000)
  -----------------------------------------------------------------------
  10	1319.01			1332.16			1328.09
  100	1400.77			1372.67			1382.27
  500	1479.40			1524.57			1615.84
  -----------------------------------------------------------------------

- kernbench

Average Half load -j 12 Run (std deviation):
------------------------------------------------------------------------------
	CFS_HARD_LIMTS=n	CFS_HARD_LIMTS=y	CFS_HARD_LIMITS=y
				(infinite runtime)	(BW=450000/500000)
------------------------------------------------------------------------------
Elapsd	5.716 (0.278711)	6.06 (0.479322)		5.41 (0.360694)
User	20.464 (2.22087)	22.978 (3.43738)	18.486 (2.60754)
System	14.82 (1.52086)		16.68 (2.3438)		13.514 (1.77074)
% CPU	615.2 (41.1667)		651.6 (43.397)		588.4 (42.0214)
CtxSwt	2727.8 (243.19)		3030.6 (425.338)	2536 (302.498)
Sleeps	4981.4 (442.337)	5532.2 (847.27)		4554.6 (510.532)
------------------------------------------------------------------------------

Average Optimal load -j 96 Run (std deviation):
------------------------------------------------------------------------------
	CFS_HARD_LIMTS=n	CFS_HARD_LIMTS=y	CFS_HARD_LIMITS=y
				(infinite runtime)	(BW=450000/500000)
------------------------------------------------------------------------------
Elapsd	4.826 (0.276641)	4.776 (0.291599)	5.13 (0.50448)
User	21.278 (2.67999)	22.138 (3.2045)		21.988 (5.63116)
System	19.213 (5.38314)	19.796 (4.32574)	20.407 (8.53682)
% CPU	778.3 (184.522)		786.1 (154.295)		803.1 (244.865)
CtxSwt	2906.5 (387.799)	3052.1 (397.15)		3030.6 (765.418)
Sleeps	4576.6 (565.383)	4796 (990.278)		4576.9 (625.933)
------------------------------------------------------------------------------

Average Maximal load -j Run (std deviation):
------------------------------------------------------------------------------
	CFS_HARD_LIMTS=n	CFS_HARD_LIMTS=y	CFS_HARD_LIMITS=y
				(infinite runtime)	(BW=450000/500000)
------------------------------------------------------------------------------
Elapsd	5.13 (0.530236)		5.062 (0.0408656)	4.94 (0.229891)
User	22.7293 (4.37921)	22.9973 (2.86311)	22.5507 (4.78016)
System	21.966 (6.81872)	21.9713 (4.72952)	22.0287 (7.39655)
% CPU	860 (202.295)		859.8 (164.415)		864.467 (218.721)
CtxSwt	3154.27 (659.933)	3172.93 (370.439)	3127.2 (657.224)
Sleeps	4602.6 (662.155)	4676.67 (813.274)	4489.2 (542.859)
------------------------------------------------------------------------------

Features TODO
-------------
- CFS runtime borrowing still needs some work, especially need to handle
  runtime redistribution when a CPU goes offline.
- Bandwidth inheritance support (long term, not under consideration currently)
- This implementation doesn't work for user group scheduler. Since user group
  scheduler will eventually go away, I don't plan to work on this.

Implementation TODO
-------------------
- It is possible to share some of the bandwidth handling code with RT, but
  the intention of this post is to show the changes associated with hard limits.
  Hence the sharing/cleanup will be done down the line when this patchset
  itself becomes more accepatable.
- When a dequeued entity is enqueued back, I don't change its vruntime. The
  entity might get undue advantage due to its old (lower) vruntime. Need to
  address this.

Patches description
-------------------
This post has the following patches:

1/8 sched: Rename sched_rt_period_mask() and use it in CFS also
2/8 sched: Maintain aggregated tasks count in cfs_rq at each hierarchy level
3/8 sched: Bandwidth initialization for fair task groups
4/8 sched: Enforce hard limits by throttling
5/8 sched: Unthrottle the throttled tasks
6/8 sched: Add throttle time statistics to /proc/sched_debug
7/8 sched: CFS runtime borrowing
8/8 sched: Hard limits documentation

 Documentation/scheduler/sched-cfs-hard-limits.txt |   52 ++
 include/linux/sched.h                               |    9 
 init/Kconfig                                        |   13 
 kernel/sched.c                                      |  427 +++++++++++++++++++
 kernel/sched_debug.c                                |   21 
 kernel/sched_fair.c                                 |  432 +++++++++++++++++++-
 kernel/sched_rt.c                                   |   22 -
 7 files changed, 932 insertions(+), 44 deletions(-)

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [RFC v2 PATCH 1/8] sched: Rename sched_rt_period_mask() and use it in CFS also
  2009-09-30 12:49 [RFC v2 PATCH 0/8] CFS Hard limits - v2 Bharata B Rao
@ 2009-09-30 12:50 ` Bharata B Rao
  2009-09-30 12:51 ` [RFC v2 PATCH 2/8] sched: Maintain aggregated tasks count in cfs_rq at each hierarchy level Bharata B Rao
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 39+ messages in thread
From: Bharata B Rao @ 2009-09-30 12:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan,
	Gautham R Shenoy, Srivatsa Vaddagiri, Ingo Molnar,
	Peter Zijlstra, Pavel Emelyanov, Herbert Poetzl, Avi Kivity,
	Chris Friesen, Paul Menage, Mike Waychison

sched: Rename sched_rt_period_mask() and use it in CFS also.

From: Bharata B Rao <bharata@linux.vnet.ibm.com>

sched_rt_period_mask() is needed in CFS also. Rename it to a generic name
and move it to kernel/sched.c. No functionality change in this patch.

Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
---
 kernel/sched.c    |   23 +++++++++++++++++++++++
 kernel/sched_rt.c |   19 +------------------
 2 files changed, 24 insertions(+), 18 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 1b59e26..c802dcb 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1732,6 +1732,29 @@ static void cfs_rq_set_shares(struct cfs_rq *cfs_rq, unsigned long shares)
 
 static void calc_load_account_active(struct rq *this_rq);
 
+
+#if defined(CONFIG_RT_GROUP_SCHED) || defined(CONFIG_FAIR_GROUP_SCHED)
+
+#ifdef CONFIG_SMP
+static inline const struct cpumask *sched_bw_period_mask(void)
+{
+	return cpu_rq(smp_processor_id())->rd->span;
+}
+#else /* !CONFIG_SMP */
+static inline const struct cpumask *sched_bw_period_mask(void)
+{
+	return cpu_online_mask;
+}
+#endif /* CONFIG_SMP */
+
+#else
+static inline const struct cpumask *sched_bw_period_mask(void)
+{
+	return cpu_online_mask;
+}
+
+#endif
+
 #include "sched_stats.h"
 #include "sched_idletask.c"
 #include "sched_fair.c"
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 3918e01..478fff9 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -222,18 +222,6 @@ static int rt_se_boosted(struct sched_rt_entity *rt_se)
 	return p->prio != p->normal_prio;
 }
 
-#ifdef CONFIG_SMP
-static inline const struct cpumask *sched_rt_period_mask(void)
-{
-	return cpu_rq(smp_processor_id())->rd->span;
-}
-#else
-static inline const struct cpumask *sched_rt_period_mask(void)
-{
-	return cpu_online_mask;
-}
-#endif
-
 static inline
 struct rt_rq *sched_rt_period_rt_rq(struct rt_bandwidth *rt_b, int cpu)
 {
@@ -283,11 +271,6 @@ static inline int rt_rq_throttled(struct rt_rq *rt_rq)
 	return rt_rq->rt_throttled;
 }
 
-static inline const struct cpumask *sched_rt_period_mask(void)
-{
-	return cpu_online_mask;
-}
-
 static inline
 struct rt_rq *sched_rt_period_rt_rq(struct rt_bandwidth *rt_b, int cpu)
 {
@@ -505,7 +488,7 @@ static int do_sched_rt_period_timer(struct rt_bandwidth *rt_b, int overrun)
 	if (!rt_bandwidth_enabled() || rt_b->rt_runtime == RUNTIME_INF)
 		return 1;
 
-	span = sched_rt_period_mask();
+	span = sched_bw_period_mask();
 	for_each_cpu(i, span) {
 		int enqueue = 0;
 		struct rt_rq *rt_rq = sched_rt_period_rt_rq(rt_b, i);

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [RFC v2 PATCH 2/8] sched: Maintain aggregated tasks count in cfs_rq at each hierarchy level
  2009-09-30 12:49 [RFC v2 PATCH 0/8] CFS Hard limits - v2 Bharata B Rao
  2009-09-30 12:50 ` [RFC v2 PATCH 1/8] sched: Rename sched_rt_period_mask() and use it in CFS also Bharata B Rao
@ 2009-09-30 12:51 ` Bharata B Rao
  2009-10-13 14:27   ` Peter Zijlstra
  2009-09-30 12:52 ` [RFC v2 PATCH 3/8] sched: Bandwidth initialization for fair task groups Bharata B Rao
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 39+ messages in thread
From: Bharata B Rao @ 2009-09-30 12:51 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan,
	Gautham R Shenoy, Srivatsa Vaddagiri, Ingo Molnar,
	Peter Zijlstra, Pavel Emelyanov, Herbert Poetzl, Avi Kivity,
	Chris Friesen, Paul Menage, Mike Waychison

sched: Maintain aggregated tasks count in cfs_rq at each hierarchy level

From: Bharata B Rao <bharata@linux.vnet.ibm.com>

This patch adds a counter to cfs_rq (->nr_tasks_running) to record the
aggregated tasks count at each level in the task group hierarchy.
This is needed by later hard limit patches where it is required to
know how many tasks go off the rq when a throttled group entity
is dequeued.

Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
---
 kernel/sched.c       |    4 ++++
 kernel/sched_debug.c |    2 ++
 kernel/sched_fair.c  |   23 +++++++++++++++++++++++
 3 files changed, 29 insertions(+), 0 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index c802dcb..c283d0f 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -477,6 +477,10 @@ struct cfs_rq {
 	unsigned long rq_weight;
 #endif
 #endif
+	/*
+	 * Number of tasks at this heirarchy.
+	 */
+	unsigned long nr_tasks_running;
 };
 
 /* Real-Time classes' related field in a runqueue: */
diff --git a/kernel/sched_debug.c b/kernel/sched_debug.c
index 70c7e0b..f4c30bc 100644
--- a/kernel/sched_debug.c
+++ b/kernel/sched_debug.c
@@ -214,6 +214,8 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 #ifdef CONFIG_SMP
 	SEQ_printf(m, "  .%-30s: %lu\n", "shares", cfs_rq->shares);
 #endif
+	SEQ_printf(m, "  .%-30s: %ld\n", "nr_tasks_running",
+			cfs_rq->nr_tasks_running);
 	print_cfs_group_stats(m, cpu, cfs_rq->tg);
 #endif
 }
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 652e8bd..eeeddb8 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -243,6 +243,27 @@ find_matching_se(struct sched_entity **se, struct sched_entity **pse)
 
 #endif	/* CONFIG_FAIR_GROUP_SCHED */
 
+static void add_cfs_rq_tasks_running(struct sched_entity *se,
+		unsigned long count)
+{
+	struct cfs_rq *cfs_rq;
+
+	for_each_sched_entity(se) {
+		cfs_rq = cfs_rq_of(se);
+		cfs_rq->nr_tasks_running += count;
+	}
+}
+
+static void sub_cfs_rq_tasks_running(struct sched_entity *se,
+		unsigned long count)
+{
+	struct cfs_rq *cfs_rq;
+
+	for_each_sched_entity(se) {
+		cfs_rq = cfs_rq_of(se);
+		cfs_rq->nr_tasks_running -= count;
+	}
+}
 
 /**************************************************************
  * Scheduling class tree data structure manipulation methods:
@@ -969,6 +990,7 @@ static void enqueue_task_fair(struct rq *rq, struct task_struct *p, int wakeup)
 		wakeup = 1;
 	}
 
+	add_cfs_rq_tasks_running(&p->se, 1);
 	hrtick_update(rq);
 }
 
@@ -991,6 +1013,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int sleep)
 		sleep = 1;
 	}
 
+	sub_cfs_rq_tasks_running(&p->se, 1);
 	hrtick_update(rq);
 }
 

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [RFC v2 PATCH 3/8] sched: Bandwidth initialization for fair task groups
  2009-09-30 12:49 [RFC v2 PATCH 0/8] CFS Hard limits - v2 Bharata B Rao
  2009-09-30 12:50 ` [RFC v2 PATCH 1/8] sched: Rename sched_rt_period_mask() and use it in CFS also Bharata B Rao
  2009-09-30 12:51 ` [RFC v2 PATCH 2/8] sched: Maintain aggregated tasks count in cfs_rq at each hierarchy level Bharata B Rao
@ 2009-09-30 12:52 ` Bharata B Rao
  2009-10-13 14:27   ` Peter Zijlstra
  2009-09-30 12:52 ` [RFC v2 PATCH 4/8] sched: Enforce hard limits by throttling Bharata B Rao
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 39+ messages in thread
From: Bharata B Rao @ 2009-09-30 12:52 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan,
	Gautham R Shenoy, Srivatsa Vaddagiri, Ingo Molnar,
	Peter Zijlstra, Pavel Emelyanov, Herbert Poetzl, Avi Kivity,
	Chris Friesen, Paul Menage, Mike Waychison

sched: Bandwidth initialization for fair task groups.

From: Bharata B Rao <bharata@linux.vnet.ibm.com>

Introduce the notion of hard limiting for CFS groups by bringing in
the concept of runtime and period for them. Add cgroup files to control
runtime and period.

Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
---
 init/Kconfig   |   13 ++
 kernel/sched.c |  317 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 330 insertions(+), 0 deletions(-)

diff --git a/init/Kconfig b/init/Kconfig
index 3f7e609..e93282f 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -492,6 +492,19 @@ config CGROUP_SCHED
 
 endchoice
 
+config CFS_HARD_LIMITS
+	bool "Hard Limits for CFS Group Scheduler"
+	depends on EXPERIMENTAL
+	depends on FAIR_GROUP_SCHED && CGROUP_SCHED
+	default n
+	help
+	  This option enables hard limiting of CPU time obtained by
+	  a fair task group. Use this if you want to throttle a group of tasks
+	  based on its CPU usage. For more details refer to
+	  Documentation/scheduler/sched-cfs-hard-limits.txt
+
+	  Say N if unsure.
+
 menuconfig CGROUPS
 	boolean "Control Group support"
 	help
diff --git a/kernel/sched.c b/kernel/sched.c
index c283d0f..0147f6f 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -262,6 +262,15 @@ static DEFINE_MUTEX(sched_domains_mutex);
 
 #include <linux/cgroup.h>
 
+#if defined(CONFIG_FAIR_GROUP_SCHED) && defined(CONFIG_CFS_HARD_LIMITS)
+struct cfs_bandwidth {
+	spinlock_t		cfs_runtime_lock;
+	ktime_t			cfs_period;
+	u64			cfs_runtime;
+	struct hrtimer		cfs_period_timer;
+};
+#endif
+
 struct cfs_rq;
 
 static LIST_HEAD(task_groups);
@@ -282,6 +291,11 @@ struct task_group {
 	/* runqueue "owned" by this group on each cpu */
 	struct cfs_rq **cfs_rq;
 	unsigned long shares;
+#ifdef CONFIG_CFS_HARD_LIMITS
+	struct cfs_bandwidth cfs_bandwidth;
+	/* If set, throttle when the group exceeds its bandwidth */
+	int hard_limit_enabled;
+#endif
 #endif
 
 #ifdef CONFIG_RT_GROUP_SCHED
@@ -477,6 +491,16 @@ struct cfs_rq {
 	unsigned long rq_weight;
 #endif
 #endif
+#ifdef CONFIG_CFS_HARD_LIMITS
+	/* set when the group is throttled  on this cpu */
+	int cfs_throttled;
+
+	/* runtime currently consumed by the group on this rq */
+	u64 cfs_time;
+
+	/* runtime available to the group on this rq */
+	u64 cfs_runtime;
+#endif
 	/*
 	 * Number of tasks at this heirarchy.
 	 */
@@ -665,6 +689,11 @@ struct rq {
 	/* BKL stats */
 	unsigned int bkl_count;
 #endif
+	/*
+	 * Protects the cfs runtime related fields of all cfs_rqs under
+	 * this rq
+	 */
+	spinlock_t runtime_lock;
 };
 
 static DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
@@ -1759,6 +1788,150 @@ static inline const struct cpumask *sched_bw_period_mask(void)
 
 #endif
 
+#ifdef CONFIG_FAIR_GROUP_SCHED
+#ifdef CONFIG_CFS_HARD_LIMITS
+
+/*
+ * Runtime allowed for a cfs group before it is hard limited.
+ * default: Infinite which means no hard limiting.
+ */
+u64 sched_cfs_runtime = RUNTIME_INF;
+
+/*
+ * period over which we hard limit the cfs group's bandwidth.
+ * default: 0.5s
+ */
+u64 sched_cfs_period = 500000;
+
+static inline u64 global_cfs_period(void)
+{
+	return sched_cfs_period * NSEC_PER_USEC;
+}
+
+static inline u64 global_cfs_runtime(void)
+{
+	return RUNTIME_INF;
+}
+
+static inline int cfs_bandwidth_enabled(struct task_group *tg)
+{
+	return tg->hard_limit_enabled;
+}
+
+static inline void rq_runtime_lock(struct rq *rq)
+{
+	spin_lock(&rq->runtime_lock);
+}
+
+static inline void rq_runtime_unlock(struct rq *rq)
+{
+	spin_unlock(&rq->runtime_lock);
+}
+
+/*
+ * Refresh the runtimes of the throttled groups.
+ * But nothing much to do now, will populate this in later patches.
+ */
+static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
+{
+	struct cfs_bandwidth *cfs_b =
+		container_of(timer, struct cfs_bandwidth, cfs_period_timer);
+
+	hrtimer_add_expires_ns(timer, ktime_to_ns(cfs_b->cfs_period));
+	return HRTIMER_RESTART;
+}
+
+/*
+ * TODO: Check if this kind of timer setup is sufficient for cfs or
+ * should we do what rt is doing.
+ */
+static void start_cfs_bandwidth(struct task_group *tg)
+{
+	struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
+
+	/*
+	 * Timer isn't setup for groups with infinite runtime or for groups
+	 * for which hard limiting isn't enabled.
+	 */
+	if (!cfs_bandwidth_enabled(tg) || (cfs_b->cfs_runtime == RUNTIME_INF))
+		return;
+
+	if (hrtimer_active(&cfs_b->cfs_period_timer))
+		return;
+
+	hrtimer_start_range_ns(&cfs_b->cfs_period_timer, cfs_b->cfs_period,
+			0, HRTIMER_MODE_REL);
+}
+
+static void init_cfs_bandwidth(struct task_group *tg)
+{
+	struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
+
+	cfs_b->cfs_period = ns_to_ktime(global_cfs_period());
+	cfs_b->cfs_runtime = global_cfs_runtime();
+
+	spin_lock_init(&cfs_b->cfs_runtime_lock);
+
+	hrtimer_init(&cfs_b->cfs_period_timer,
+			CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	cfs_b->cfs_period_timer.function = &sched_cfs_period_timer;
+}
+
+static inline void destroy_cfs_bandwidth(struct task_group *tg)
+{
+	hrtimer_cancel(&tg->cfs_bandwidth.cfs_period_timer);
+}
+
+static void init_cfs_hard_limits(struct cfs_rq *cfs_rq, struct task_group *tg)
+{
+	cfs_rq->cfs_time = 0;
+	cfs_rq->cfs_throttled = 0;
+	cfs_rq->cfs_runtime = tg->cfs_bandwidth.cfs_runtime;
+	tg->hard_limit_enabled = 0;
+}
+
+#else /* !CONFIG_CFS_HARD_LIMITS */
+
+static void init_cfs_bandwidth(struct task_group *tg)
+{
+	return;
+}
+
+static inline void destroy_cfs_bandwidth(struct task_group *tg)
+{
+	return;
+}
+
+static void init_cfs_hard_limits(struct cfs_rq *cfs_rq, struct task_group *tg)
+{
+	return;
+}
+
+static inline void rq_runtime_lock(struct rq *rq)
+{
+	return;
+}
+
+static inline void rq_runtime_unlock(struct rq *rq)
+{
+	return;
+}
+
+#endif /* CONFIG_CFS_HARD_LIMITS */
+#else /* !CONFIG_FAIR_GROUP_SCHED */
+
+static inline void rq_runtime_lock(struct rq *rq)
+{
+	return;
+}
+
+static inline void rq_runtime_unlock(struct rq *rq)
+{
+	return;
+}
+
+#endif /* CONFIG_FAIR_GROUP_SCHED */
+
 #include "sched_stats.h"
 #include "sched_idletask.c"
 #include "sched_fair.c"
@@ -9146,6 +9319,7 @@ static void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
 	struct rq *rq = cpu_rq(cpu);
 	tg->cfs_rq[cpu] = cfs_rq;
 	init_cfs_rq(cfs_rq, rq);
+	init_cfs_hard_limits(cfs_rq, tg);
 	cfs_rq->tg = tg;
 	if (add)
 		list_add(&cfs_rq->leaf_cfs_rq_list, &rq->leaf_cfs_rq_list);
@@ -9275,6 +9449,10 @@ void __init sched_init(void)
 #endif /* CONFIG_USER_SCHED */
 #endif /* CONFIG_RT_GROUP_SCHED */
 
+#ifdef CONFIG_FAIR_GROUP_SCHED
+	init_cfs_bandwidth(&init_task_group);
+#endif
+
 #ifdef CONFIG_GROUP_SCHED
 	list_add(&init_task_group.list, &task_groups);
 	INIT_LIST_HEAD(&init_task_group.children);
@@ -9291,6 +9469,7 @@ void __init sched_init(void)
 
 		rq = cpu_rq(i);
 		spin_lock_init(&rq->lock);
+		spin_lock_init(&rq->runtime_lock);
 		rq->nr_running = 0;
 		rq->calc_load_active = 0;
 		rq->calc_load_update = jiffies + LOAD_FREQ;
@@ -9564,6 +9743,7 @@ static void free_fair_sched_group(struct task_group *tg)
 {
 	int i;
 
+	destroy_cfs_bandwidth(tg);
 	for_each_possible_cpu(i) {
 		if (tg->cfs_rq)
 			kfree(tg->cfs_rq[i]);
@@ -9590,6 +9770,7 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
 	if (!tg->se)
 		goto err;
 
+	init_cfs_bandwidth(tg);
 	tg->shares = NICE_0_LOAD;
 
 	for_each_possible_cpu(i) {
@@ -10284,6 +10465,125 @@ static u64 cpu_shares_read_u64(struct cgroup *cgrp, struct cftype *cft)
 
 	return (u64) tg->shares;
 }
+
+#ifdef CONFIG_CFS_HARD_LIMITS
+
+static int tg_set_cfs_bandwidth(struct task_group *tg,
+		u64 cfs_period, u64 cfs_runtime)
+{
+	int i, err = 0;
+
+	spin_lock_irq(&tg->cfs_bandwidth.cfs_runtime_lock);
+	tg->cfs_bandwidth.cfs_period = ns_to_ktime(cfs_period);
+	tg->cfs_bandwidth.cfs_runtime = cfs_runtime;
+
+	for_each_possible_cpu(i) {
+		struct cfs_rq *cfs_rq = tg->cfs_rq[i];
+
+		rq_runtime_lock(rq_of(cfs_rq));
+		cfs_rq->cfs_runtime = cfs_runtime;
+		rq_runtime_unlock(rq_of(cfs_rq));
+	}
+
+	start_cfs_bandwidth(tg);
+	spin_unlock_irq(&tg->cfs_bandwidth.cfs_runtime_lock);
+	return err;
+}
+
+int tg_set_cfs_runtime(struct task_group *tg, long cfs_runtime_us)
+{
+	u64 cfs_runtime, cfs_period;
+
+	cfs_period = ktime_to_ns(tg->cfs_bandwidth.cfs_period);
+	cfs_runtime = (u64)cfs_runtime_us * NSEC_PER_USEC;
+	if (cfs_runtime_us < 0)
+		cfs_runtime = RUNTIME_INF;
+
+	return tg_set_cfs_bandwidth(tg, cfs_period, cfs_runtime);
+}
+
+long tg_get_cfs_runtime(struct task_group *tg)
+{
+	u64 cfs_runtime_us;
+
+	if (tg->cfs_bandwidth.cfs_runtime == RUNTIME_INF)
+		return -1;
+
+	cfs_runtime_us = tg->cfs_bandwidth.cfs_runtime;
+	do_div(cfs_runtime_us, NSEC_PER_USEC);
+	return cfs_runtime_us;
+}
+
+int tg_set_cfs_period(struct task_group *tg, long cfs_period_us)
+{
+	u64 cfs_runtime, cfs_period;
+
+	cfs_period = (u64)cfs_period_us * NSEC_PER_USEC;
+	cfs_runtime = tg->cfs_bandwidth.cfs_runtime;
+
+	if (cfs_period == 0)
+		return -EINVAL;
+
+	return tg_set_cfs_bandwidth(tg, cfs_period, cfs_runtime);
+}
+
+long tg_get_cfs_period(struct task_group *tg)
+{
+	u64 cfs_period_us;
+
+	cfs_period_us = ktime_to_ns(tg->cfs_bandwidth.cfs_period);
+	do_div(cfs_period_us, NSEC_PER_USEC);
+	return cfs_period_us;
+}
+
+int tg_set_hard_limit_enabled(struct task_group *tg, u64 val)
+{
+	spin_lock_irq(&tg->cfs_bandwidth.cfs_runtime_lock);
+	if (val > 0) {
+		tg->hard_limit_enabled = 1;
+		start_cfs_bandwidth(tg);
+	} else {
+		destroy_cfs_bandwidth(tg);
+		tg->hard_limit_enabled = 0;
+	}
+	spin_unlock_irq(&tg->cfs_bandwidth.cfs_runtime_lock);
+	return 0;
+}
+
+static s64 cpu_cfs_runtime_read_s64(struct cgroup *cgrp, struct cftype *cft)
+{
+	return tg_get_cfs_runtime(cgroup_tg(cgrp));
+}
+
+static int cpu_cfs_runtime_write_s64(struct cgroup *cgrp, struct cftype *cftype,
+				s64 cfs_runtime_us)
+{
+	return tg_set_cfs_runtime(cgroup_tg(cgrp), cfs_runtime_us);
+}
+
+static u64 cpu_cfs_period_read_u64(struct cgroup *cgrp, struct cftype *cft)
+{
+	return tg_get_cfs_period(cgroup_tg(cgrp));
+}
+
+static int cpu_cfs_period_write_u64(struct cgroup *cgrp, struct cftype *cftype,
+				u64 cfs_period_us)
+{
+	return tg_set_cfs_period(cgroup_tg(cgrp), cfs_period_us);
+}
+
+static u64 cpu_cfs_hard_limit_read_u64(struct cgroup *cgrp, struct cftype *cft)
+{
+	return cfs_bandwidth_enabled(cgroup_tg(cgrp));
+}
+
+static int cpu_cfs_hard_limit_write_u64(struct cgroup *cgrp,
+		struct cftype *cftype, u64 val)
+{
+	return tg_set_hard_limit_enabled(cgroup_tg(cgrp), val);
+}
+
+#endif /* CONFIG_CFS_HARD_LIMITS */
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
 #ifdef CONFIG_RT_GROUP_SCHED
@@ -10317,6 +10617,23 @@ static struct cftype cpu_files[] = {
 		.read_u64 = cpu_shares_read_u64,
 		.write_u64 = cpu_shares_write_u64,
 	},
+#ifdef CONFIG_CFS_HARD_LIMITS
+	{
+		.name = "cfs_runtime_us",
+		.read_s64 = cpu_cfs_runtime_read_s64,
+		.write_s64 = cpu_cfs_runtime_write_s64,
+	},
+	{
+		.name = "cfs_period_us",
+		.read_u64 = cpu_cfs_period_read_u64,
+		.write_u64 = cpu_cfs_period_write_u64,
+	},
+	{
+		.name = "cfs_hard_limit",
+		.read_u64 = cpu_cfs_hard_limit_read_u64,
+		.write_u64 = cpu_cfs_hard_limit_write_u64,
+	},
+#endif /* CONFIG_CFS_HARD_LIMITS */
 #endif
 #ifdef CONFIG_RT_GROUP_SCHED
 	{

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [RFC v2 PATCH 4/8] sched: Enforce hard limits by throttling
  2009-09-30 12:49 [RFC v2 PATCH 0/8] CFS Hard limits - v2 Bharata B Rao
                   ` (2 preceding siblings ...)
  2009-09-30 12:52 ` [RFC v2 PATCH 3/8] sched: Bandwidth initialization for fair task groups Bharata B Rao
@ 2009-09-30 12:52 ` Bharata B Rao
  2009-10-13 14:27   ` Peter Zijlstra
  2009-09-30 12:53 ` [RFC v2 PATCH 5/8] sched: Unthrottle the throttled tasks Bharata B Rao
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 39+ messages in thread
From: Bharata B Rao @ 2009-09-30 12:52 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan,
	Gautham R Shenoy, Srivatsa Vaddagiri, Ingo Molnar,
	Peter Zijlstra, Pavel Emelyanov, Herbert Poetzl, Avi Kivity,
	Chris Friesen, Paul Menage, Mike Waychison

sched: Enforce hard limits by throttling.

From: Bharata B Rao <bharata@linux.vnet.ibm.com>

Throttle the task-groups which exceed the runtime allocated to them.
Throttled group entities are removed from the run queue.

Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
---
 include/linux/sched.h |    3 -
 kernel/sched.c        |   72 ++++++++++++++---
 kernel/sched_debug.c  |    2 
 kernel/sched_fair.c   |  210 ++++++++++++++++++++++++++++++++++++++++++++++---
 kernel/sched_rt.c     |    3 -
 5 files changed, 265 insertions(+), 25 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 0f1ea4a..77ace43 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1024,7 +1024,7 @@ struct sched_domain;
 struct sched_class {
 	const struct sched_class *next;
 
-	void (*enqueue_task) (struct rq *rq, struct task_struct *p, int wakeup);
+	int (*enqueue_task) (struct rq *rq, struct task_struct *p, int wakeup);
 	void (*dequeue_task) (struct rq *rq, struct task_struct *p, int sleep);
 	void (*yield_task) (struct rq *rq);
 
@@ -1124,6 +1124,7 @@ struct sched_entity {
 	u64			nr_failed_migrations_affine;
 	u64			nr_failed_migrations_running;
 	u64			nr_failed_migrations_hot;
+	u64			nr_failed_migrations_throttled;
 	u64			nr_forced_migrations;
 	u64			nr_forced2_migrations;
 
diff --git a/kernel/sched.c b/kernel/sched.c
index 0147f6f..04c505f 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1585,6 +1585,7 @@ update_group_shares_cpu(struct task_group *tg, int cpu,
 	}
 }
 
+static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq);
 /*
  * Re-compute the task group their per cpu shares over the given domain.
  * This needs to be done in a bottom-up fashion because the rq weight of a
@@ -1602,9 +1603,11 @@ static int tg_shares_up(struct task_group *tg, void *data)
 		 * If there are currently no tasks on the cpu pretend there
 		 * is one of average load so that when a new task gets to
 		 * run here it will not get delayed by group starvation.
+		 * Also if the group is throttled on this cpu, pretend that
+		 * it has no tasks.
 		 */
 		weight = tg->cfs_rq[i]->load.weight;
-		if (!weight)
+		if (!weight || cfs_rq_throttled(tg->cfs_rq[i]))
 			weight = NICE_0_LOAD;
 
 		tg->cfs_rq[i]->rq_weight = weight;
@@ -1628,6 +1631,7 @@ static int tg_shares_up(struct task_group *tg, void *data)
  * Compute the cpu's hierarchical load factor for each task group.
  * This needs to be done in a top-down fashion because the load of a child
  * group is a fraction of its parents load.
+ * A throttled group's h_load is set to 0.
  */
 static int tg_load_down(struct task_group *tg, void *data)
 {
@@ -1636,6 +1640,8 @@ static int tg_load_down(struct task_group *tg, void *data)
 
 	if (!tg->parent) {
 		load = cpu_rq(cpu)->load.weight;
+	} else if (cfs_rq_throttled(tg->cfs_rq[cpu])) {
+		load = 0;
 	} else {
 		load = tg->parent->cfs_rq[cpu]->h_load;
 		load *= tg->cfs_rq[cpu]->shares;
@@ -1813,6 +1819,8 @@ static inline u64 global_cfs_runtime(void)
 	return RUNTIME_INF;
 }
 
+int task_group_throttled(struct task_group *tg, int cpu);
+
 static inline int cfs_bandwidth_enabled(struct task_group *tg)
 {
 	return tg->hard_limit_enabled;
@@ -1930,6 +1938,16 @@ static inline void rq_runtime_unlock(struct rq *rq)
 	return;
 }
 
+int task_group_throttled(struct task_group *tg, int cpu)
+{
+	return 0;
+}
+
+static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
+{
+	return 0;
+}
+
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
 #include "sched_stats.h"
@@ -1981,14 +1999,17 @@ static void update_avg(u64 *avg, u64 sample)
 	*avg += diff >> 3;
 }
 
-static void enqueue_task(struct rq *rq, struct task_struct *p, int wakeup)
+static int enqueue_task(struct rq *rq, struct task_struct *p, int wakeup)
 {
+	int ret;
+
 	if (wakeup)
 		p->se.start_runtime = p->se.sum_exec_runtime;
 
 	sched_info_queued(p);
-	p->sched_class->enqueue_task(rq, p, wakeup);
+	ret = p->sched_class->enqueue_task(rq, p, wakeup);
 	p->se.on_rq = 1;
+	return ret;
 }
 
 static void dequeue_task(struct rq *rq, struct task_struct *p, int sleep)
@@ -2063,8 +2084,15 @@ static void activate_task(struct rq *rq, struct task_struct *p, int wakeup)
 	if (task_contributes_to_load(p))
 		rq->nr_uninterruptible--;
 
-	enqueue_task(rq, p, wakeup);
-	inc_nr_running(rq);
+	/*
+	 * Increment rq->nr_running only if enqueue_task() succeeds.
+	 * enqueue_task() can fail when the task being activated belongs
+	 * to a throttled group. In this case, the task gets enqueued to
+	 * throttled group and the group will be enqueued later when it
+	 * gets unthrottled. rq->nr_running gets incremented at that time.
+	 */
+	if (!enqueue_task(rq, p, wakeup))
+		inc_nr_running(rq);
 }
 
 /*
@@ -3401,6 +3429,7 @@ int can_migrate_task(struct task_struct *p, struct rq *rq, int this_cpu,
 	 * 1) running (obviously), or
 	 * 2) cannot be migrated to this CPU due to cpus_allowed, or
 	 * 3) are cache-hot on their current CPU.
+	 * 4) end up in throttled task groups on this CPU.
 	 */
 	if (!cpumask_test_cpu(this_cpu, &p->cpus_allowed)) {
 		schedstat_inc(p, se.nr_failed_migrations_affine);
@@ -3414,6 +3443,18 @@ int can_migrate_task(struct task_struct *p, struct rq *rq, int this_cpu,
 	}
 
 	/*
+	 * Don't migrate the task if it belongs to a
+	 * - throttled group on its current cpu
+	 * - throttled group on this_cpu
+	 * - group whose hierarchy is throttled on this_cpu
+	 */
+	if (cfs_rq_throttled(cfs_rq_of(&p->se)) ||
+		task_group_throttled(task_group(p), this_cpu)) {
+		schedstat_inc(p, se.nr_failed_migrations_throttled);
+		return 0;
+	}
+
+	/*
 	 * Aggressive migration if:
 	 * 1) task is cache cold, or
 	 * 2) too many balance attempts have failed.
@@ -6096,8 +6137,10 @@ void rt_mutex_setprio(struct task_struct *p, int prio)
 	oldprio = p->prio;
 	on_rq = p->se.on_rq;
 	running = task_current(rq, p);
-	if (on_rq)
+	if (on_rq) {
 		dequeue_task(rq, p, 0);
+		dec_nr_running(rq);
+	}
 	if (running)
 		p->sched_class->put_prev_task(rq, p);
 
@@ -6111,7 +6154,8 @@ void rt_mutex_setprio(struct task_struct *p, int prio)
 	if (running)
 		p->sched_class->set_curr_task(rq);
 	if (on_rq) {
-		enqueue_task(rq, p, 0);
+		if (!enqueue_task(rq, p, 0))
+			inc_nr_running(rq);
 
 		check_class_changed(rq, p, prev_class, oldprio, running);
 	}
@@ -6145,8 +6189,10 @@ void set_user_nice(struct task_struct *p, long nice)
 		goto out_unlock;
 	}
 	on_rq = p->se.on_rq;
-	if (on_rq)
+	if (on_rq) {
 		dequeue_task(rq, p, 0);
+		dec_nr_running(rq);
+	}
 
 	p->static_prio = NICE_TO_PRIO(nice);
 	set_load_weight(p);
@@ -6155,7 +6201,8 @@ void set_user_nice(struct task_struct *p, long nice)
 	delta = p->prio - old_prio;
 
 	if (on_rq) {
-		enqueue_task(rq, p, 0);
+		if (!enqueue_task(rq, p, 0))
+			inc_nr_running(rq);
 		/*
 		 * If the task increased its priority or is running and
 		 * lowered its priority, then reschedule its CPU:
@@ -10003,8 +10050,10 @@ void sched_move_task(struct task_struct *tsk)
 	running = task_current(rq, tsk);
 	on_rq = tsk->se.on_rq;
 
-	if (on_rq)
+	if (on_rq) {
 		dequeue_task(rq, tsk, 0);
+		dec_nr_running(rq);
+	}
 	if (unlikely(running))
 		tsk->sched_class->put_prev_task(rq, tsk);
 
@@ -10018,7 +10067,8 @@ void sched_move_task(struct task_struct *tsk)
 	if (unlikely(running))
 		tsk->sched_class->set_curr_task(rq);
 	if (on_rq)
-		enqueue_task(rq, tsk, 0);
+		if (!enqueue_task(rq, tsk, 0))
+			inc_nr_running(rq);
 
 	task_rq_unlock(rq, &flags);
 }
diff --git a/kernel/sched_debug.c b/kernel/sched_debug.c
index f4c30bc..8ce525f 100644
--- a/kernel/sched_debug.c
+++ b/kernel/sched_debug.c
@@ -417,6 +417,7 @@ void proc_sched_show_task(struct task_struct *p, struct seq_file *m)
 	P(se.nr_failed_migrations_affine);
 	P(se.nr_failed_migrations_running);
 	P(se.nr_failed_migrations_hot);
+	P(se.nr_failed_migrations_throttled);
 	P(se.nr_forced_migrations);
 	P(se.nr_forced2_migrations);
 	P(se.nr_wakeups);
@@ -491,6 +492,7 @@ void proc_sched_set_task(struct task_struct *p)
 	p->se.nr_failed_migrations_affine	= 0;
 	p->se.nr_failed_migrations_running	= 0;
 	p->se.nr_failed_migrations_hot		= 0;
+	p->se.nr_failed_migrations_throttled	= 0;
 	p->se.nr_forced_migrations		= 0;
 	p->se.nr_forced2_migrations		= 0;
 	p->se.nr_wakeups			= 0;
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index eeeddb8..f98c1c8 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -186,6 +186,94 @@ find_matching_se(struct sched_entity **se, struct sched_entity **pse)
 	}
 }
 
+#ifdef CONFIG_CFS_HARD_LIMITS
+
+static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
+{
+	return cfs_rq->cfs_throttled;
+}
+
+/*
+ * Check if group entity exceeded its runtime. If so, mark the cfs_rq as
+ * throttled mark the current task for reschedling.
+ */
+static void sched_cfs_runtime_exceeded(struct sched_entity *se,
+	struct task_struct *tsk_curr, unsigned long delta_exec)
+{
+	struct cfs_rq *cfs_rq;
+
+	cfs_rq = group_cfs_rq(se);
+
+	if (!cfs_bandwidth_enabled(cfs_rq->tg))
+		return;
+
+	if (cfs_rq->cfs_runtime == RUNTIME_INF)
+		return;
+
+	cfs_rq->cfs_time += delta_exec;
+
+	if (cfs_rq_throttled(cfs_rq))
+		return;
+
+	if (cfs_rq->cfs_time > cfs_rq->cfs_runtime) {
+		cfs_rq->cfs_throttled = 1;
+		resched_task(tsk_curr);
+	}
+}
+
+/*
+ * Check if the entity is throttled.
+ */
+static int entity_throttled(struct sched_entity *se)
+{
+	struct cfs_rq *cfs_rq;
+
+	/* Only group entities can be throttled */
+	if (entity_is_task(se))
+		return 0;
+
+	cfs_rq = group_cfs_rq(se);
+	if (cfs_rq_throttled(cfs_rq))
+		return 1;
+	return 0;
+}
+
+int task_group_throttled(struct task_group *tg, int cpu)
+{
+	struct sched_entity *se = tg->se[cpu];
+
+	for_each_sched_entity(se) {
+		if (entity_throttled(se))
+			return 1;
+	}
+	return 0;
+}
+
+#else
+
+static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
+{
+	return 0;
+}
+
+int task_group_throttled(struct task_group *tg, int cpu)
+{
+	return 0;
+}
+
+static void sched_cfs_runtime_exceeded(struct sched_entity *se,
+	struct task_struct *tsk_curr, unsigned long delta_exec)
+{
+	return;
+}
+
+static int entity_throttled(struct sched_entity *se)
+{
+	return 0;
+}
+
+#endif /* CONFIG_CFS_HARD_LIMITS */
+
 #else	/* CONFIG_FAIR_GROUP_SCHED */
 
 static inline struct rq *rq_of(struct cfs_rq *cfs_rq)
@@ -241,6 +329,17 @@ find_matching_se(struct sched_entity **se, struct sched_entity **pse)
 {
 }
 
+static void sched_cfs_runtime_exceeded(struct sched_entity *se,
+	struct task_struct *tsk_curr, unsigned long delta_exec)
+{
+	return;
+}
+
+static int entity_throttled(struct sched_entity *se)
+{
+	return 0;
+}
+
 #endif	/* CONFIG_FAIR_GROUP_SCHED */
 
 static void add_cfs_rq_tasks_running(struct sched_entity *se,
@@ -502,10 +601,12 @@ __update_curr(struct cfs_rq *cfs_rq, struct sched_entity *curr,
 	update_min_vruntime(cfs_rq);
 }
 
-static void update_curr(struct cfs_rq *cfs_rq)
+static void update_curr_common(struct cfs_rq *cfs_rq)
 {
 	struct sched_entity *curr = cfs_rq->curr;
-	u64 now = rq_of(cfs_rq)->clock;
+	struct rq *rq = rq_of(cfs_rq);
+	struct task_struct *tsk_curr = rq->curr;
+	u64 now = rq->clock;
 	unsigned long delta_exec;
 
 	if (unlikely(!curr))
@@ -528,9 +629,23 @@ static void update_curr(struct cfs_rq *cfs_rq)
 
 		cpuacct_charge(curtask, delta_exec);
 		account_group_exec_runtime(curtask, delta_exec);
+	} else {
+		sched_cfs_runtime_exceeded(curr, tsk_curr, delta_exec);
 	}
 }
 
+static void update_curr(struct cfs_rq *cfs_rq)
+{
+	rq_runtime_lock(rq_of(cfs_rq));
+	update_curr_common(cfs_rq);
+	rq_runtime_unlock(rq_of(cfs_rq));
+}
+
+static inline void update_curr_locked(struct cfs_rq *cfs_rq)
+{
+	update_curr_common(cfs_rq);
+}
+
 static inline void
 update_stats_wait_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
@@ -734,13 +849,9 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
 	se->vruntime = vruntime;
 }
 
-static void
-enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int wakeup)
+static void enqueue_entity_common(struct cfs_rq *cfs_rq,
+		struct sched_entity *se, int wakeup)
 {
-	/*
-	 * Update run-time statistics of the 'current'.
-	 */
-	update_curr(cfs_rq);
 	account_entity_enqueue(cfs_rq, se);
 
 	if (wakeup) {
@@ -754,6 +865,26 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int wakeup)
 		__enqueue_entity(cfs_rq, se);
 }
 
+static void enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
+		int wakeup)
+{
+	/*
+	 * Update run-time statistics of the 'current'.
+	 */
+	update_curr(cfs_rq);
+	enqueue_entity_common(cfs_rq, se, wakeup);
+}
+
+static void enqueue_entity_locked(struct cfs_rq *cfs_rq,
+		struct sched_entity *se, int wakeup)
+{
+	/*
+	 * Update run-time statistics of the 'current'.
+	 */
+	update_curr_locked(cfs_rq);
+	enqueue_entity_common(cfs_rq, se, wakeup);
+}
+
 static void __clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	if (cfs_rq->last == se)
@@ -865,8 +996,40 @@ static struct sched_entity *pick_next_entity(struct cfs_rq *cfs_rq)
 	return se;
 }
 
+/*
+ * Called from put_prev_entity()
+ * If a group entity (@se) is found to be throttled, it will not be put back
+ * on @cfs_rq, which is equivalent to dequeing it.
+ */
+static void dequeue_throttled_entity(struct cfs_rq *cfs_rq,
+		struct sched_entity *se)
+{
+	unsigned long nr_tasks = group_cfs_rq(se)->nr_tasks_running;
+
+	__clear_buddies(cfs_rq, se);
+	account_entity_dequeue(cfs_rq, se);
+	cfs_rq->curr = NULL;
+
+	if (!nr_tasks)
+		return;
+
+	/*
+	 * Decrement the number of tasks this entity has from
+	 * all of its parent entities.
+	 */
+	sub_cfs_rq_tasks_running(se, nr_tasks);
+
+	/*
+	 * Decrement the number of tasks this entity has from
+	 * this cpu's rq.
+	 */
+	rq_of(cfs_rq)->nr_running -= nr_tasks;
+}
+
 static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
 {
+	struct cfs_rq *gcfs_rq = group_cfs_rq(prev);
+
 	/*
 	 * If still on the runqueue then deactivate_task()
 	 * was not called and update_curr() has to be done:
@@ -876,6 +1039,18 @@ static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
 
 	check_spread(cfs_rq, prev);
 	if (prev->on_rq) {
+		/*
+		 * If the group entity is throttled or if it has no
+		 * no child entities, then don't enqueue it back.
+		 */
+		rq_runtime_lock(rq_of(cfs_rq));
+		if (entity_throttled(prev) ||
+			(gcfs_rq && !gcfs_rq->nr_running)) {
+			dequeue_throttled_entity(cfs_rq, prev);
+			rq_runtime_unlock(rq_of(cfs_rq));
+			return;
+		}
+		rq_runtime_unlock(rq_of(cfs_rq));
 		update_stats_wait_start(cfs_rq, prev);
 		/* Put 'current' back into the tree. */
 		__enqueue_entity(cfs_rq, prev);
@@ -976,22 +1151,32 @@ static inline void hrtick_update(struct rq *rq)
  * The enqueue_task method is called before nr_running is
  * increased. Here we update the fair scheduling stats and
  * then put the task into the rbtree:
+ * Don't enqueue a throttled entity further into the hierarchy.
  */
-static void enqueue_task_fair(struct rq *rq, struct task_struct *p, int wakeup)
+static int enqueue_task_fair(struct rq *rq, struct task_struct *p, int wakeup)
 {
 	struct cfs_rq *cfs_rq;
 	struct sched_entity *se = &p->se;
+	int throttled = 0;
 
+	rq_runtime_lock(rq);
 	for_each_sched_entity(se) {
 		if (se->on_rq)
 			break;
+		if (entity_throttled(se)) {
+			throttled = 1;
+			break;
+		}
 		cfs_rq = cfs_rq_of(se);
-		enqueue_entity(cfs_rq, se, wakeup);
+		enqueue_entity_locked(cfs_rq, se, wakeup);
 		wakeup = 1;
 	}
 
 	add_cfs_rq_tasks_running(&p->se, 1);
+	rq_runtime_unlock(rq);
+
 	hrtick_update(rq);
+	return throttled;
 }
 
 /*
@@ -1541,6 +1726,7 @@ static struct task_struct *pick_next_task_fair(struct rq *rq)
 
 	do {
 		se = pick_next_entity(cfs_rq);
+
 		/*
 		 * If se was a buddy, clear it so that it will have to earn
 		 * the favour again.
@@ -1650,9 +1836,9 @@ load_balance_fair(struct rq *this_rq, int this_cpu, struct rq *busiest,
 		u64 rem_load, moved_load;
 
 		/*
-		 * empty group
+		 * empty group or a group with no h_load (throttled)
 		 */
-		if (!busiest_cfs_rq->task_weight)
+		if (!busiest_cfs_rq->task_weight || !busiest_h_load)
 			continue;
 
 		rem_load = (u64)rem_load_move * busiest_weight;
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 478fff9..477d3b7 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -846,7 +846,7 @@ static void dequeue_rt_entity(struct sched_rt_entity *rt_se)
 /*
  * Adding/removing a task to/from a priority array:
  */
-static void enqueue_task_rt(struct rq *rq, struct task_struct *p, int wakeup)
+static int enqueue_task_rt(struct rq *rq, struct task_struct *p, int wakeup)
 {
 	struct sched_rt_entity *rt_se = &p->rt;
 
@@ -859,6 +859,7 @@ static void enqueue_task_rt(struct rq *rq, struct task_struct *p, int wakeup)
 		enqueue_pushable_task(rq, p);
 
 	inc_cpu_load(rq, p->se.load.weight);
+	return 0;
 }
 
 static void dequeue_task_rt(struct rq *rq, struct task_struct *p, int sleep)

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [RFC v2 PATCH 5/8] sched: Unthrottle the throttled tasks
  2009-09-30 12:49 [RFC v2 PATCH 0/8] CFS Hard limits - v2 Bharata B Rao
                   ` (3 preceding siblings ...)
  2009-09-30 12:52 ` [RFC v2 PATCH 4/8] sched: Enforce hard limits by throttling Bharata B Rao
@ 2009-09-30 12:53 ` Bharata B Rao
  2009-09-30 12:54 ` [RFC v2 PATCH 6/8] sched: Add throttle time statistics to /proc/sched_debug Bharata B Rao
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 39+ messages in thread
From: Bharata B Rao @ 2009-09-30 12:53 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan,
	Gautham R Shenoy, Srivatsa Vaddagiri, Ingo Molnar,
	Peter Zijlstra, Pavel Emelyanov, Herbert Poetzl, Avi Kivity,
	Chris Friesen, Paul Menage, Mike Waychison

sched: Unthrottle the throttled tasks.

From: Bharata B Rao <bharata@linux.vnet.ibm.com>

Refresh runtimes when group's bandwidth period expires. Unthrottle any
throttled groups at that time. Refreshing runtimes is driven through
a periodic timer.

Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
---
 kernel/sched.c      |   15 ++++++++-
 kernel/sched_fair.c |   81 +++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 94 insertions(+), 2 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 04c505f..ec302ac 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1820,6 +1820,7 @@ static inline u64 global_cfs_runtime(void)
 }
 
 int task_group_throttled(struct task_group *tg, int cpu);
+void do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b);
 
 static inline int cfs_bandwidth_enabled(struct task_group *tg)
 {
@@ -1845,6 +1846,7 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
 	struct cfs_bandwidth *cfs_b =
 		container_of(timer, struct cfs_bandwidth, cfs_period_timer);
 
+	do_sched_cfs_period_timer(cfs_b);
 	hrtimer_add_expires_ns(timer, ktime_to_ns(cfs_b->cfs_period));
 	return HRTIMER_RESTART;
 }
@@ -10588,15 +10590,24 @@ long tg_get_cfs_period(struct task_group *tg)
 
 int tg_set_hard_limit_enabled(struct task_group *tg, u64 val)
 {
-	spin_lock_irq(&tg->cfs_bandwidth.cfs_runtime_lock);
+	local_irq_disable();
+	spin_lock(&tg->cfs_bandwidth.cfs_runtime_lock);
 	if (val > 0) {
 		tg->hard_limit_enabled = 1;
 		start_cfs_bandwidth(tg);
+		spin_unlock(&tg->cfs_bandwidth.cfs_runtime_lock);
 	} else {
 		destroy_cfs_bandwidth(tg);
 		tg->hard_limit_enabled = 0;
+		spin_unlock(&tg->cfs_bandwidth.cfs_runtime_lock);
+		/*
+		 * Hard limiting is being disabled for this group.
+		 * Refresh runtimes and put the throttled entities
+		 * of the group back onto runqueue.
+		 */
+		do_sched_cfs_period_timer(&tg->cfs_bandwidth);
 	}
-	spin_unlock_irq(&tg->cfs_bandwidth.cfs_runtime_lock);
+	local_irq_enable();
 	return 0;
 }
 
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index f98c1c8..8c8b602 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -249,6 +249,80 @@ int task_group_throttled(struct task_group *tg, int cpu)
 	return 0;
 }
 
+static void enqueue_entity_locked(struct cfs_rq *cfs_rq,
+		struct sched_entity *se, int wakeup);
+static void add_cfs_rq_tasks_running(struct sched_entity *se,
+		unsigned long count);
+static void sub_cfs_rq_tasks_running(struct sched_entity *se,
+		unsigned long count);
+
+static void enqueue_throttled_entity(struct rq *rq, struct sched_entity *se)
+{
+	unsigned long nr_tasks = 0;
+	struct sched_entity *se_tmp = se;
+	int throttled = 0;
+
+	for_each_sched_entity(se) {
+		if (se->on_rq)
+			break;
+
+		if (entity_throttled(se)) {
+			throttled = 1;
+			break;
+		}
+
+		enqueue_entity_locked(cfs_rq_of(se), se, 0);
+		nr_tasks += group_cfs_rq(se)->nr_tasks_running;
+	}
+
+	if (!nr_tasks)
+		return;
+
+	/*
+	 * Add the number of tasks this entity has to
+	 * all of its parent entities.
+	 */
+	add_cfs_rq_tasks_running(se_tmp, nr_tasks);
+
+	/*
+	 * Add the number of tasks this entity has to
+	 * this cpu's rq only if the entity got enqueued all the
+	 * way up without any throttled entity in the hierarchy.
+	 */
+	if (!throttled)
+		rq->nr_running += nr_tasks;
+}
+
+/*
+ * Refresh runtimes of all cfs_rqs in this group, i,e.,
+ * refresh runtimes of the representative cfs_rq of this
+ * tg on all cpus. Enqueue any throttled entity back.
+ */
+void do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b)
+{
+	int i;
+	const struct cpumask *span = sched_bw_period_mask();
+	struct task_group *tg = container_of(cfs_b, struct task_group,
+					cfs_bandwidth);
+	unsigned long flags;
+
+	for_each_cpu(i, span) {
+		struct rq *rq = cpu_rq(i);
+		struct cfs_rq *cfs_rq = tg->cfs_rq[i];
+		struct sched_entity *se = tg->se[i];
+
+		spin_lock_irqsave(&rq->lock, flags);
+		rq_runtime_lock(rq);
+		cfs_rq->cfs_time = 0;
+		if (cfs_rq_throttled(cfs_rq)) {
+			cfs_rq->cfs_throttled = 0;
+			enqueue_throttled_entity(rq, se);
+		}
+		rq_runtime_unlock(rq);
+		spin_unlock_irqrestore(&rq->lock, flags);
+	}
+}
+
 #else
 
 static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
@@ -348,6 +422,13 @@ static void add_cfs_rq_tasks_running(struct sched_entity *se,
 	struct cfs_rq *cfs_rq;
 
 	for_each_sched_entity(se) {
+		/*
+		 * If any entity in the hierarchy is throttled, don't
+		 * propogate the tasks count up since this entity isn't
+		 * on rq yet.
+		 */
+		if (entity_throttled(se))
+			break;
 		cfs_rq = cfs_rq_of(se);
 		cfs_rq->nr_tasks_running += count;
 	}

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [RFC v2 PATCH 6/8] sched: Add throttle time statistics to /proc/sched_debug
  2009-09-30 12:49 [RFC v2 PATCH 0/8] CFS Hard limits - v2 Bharata B Rao
                   ` (4 preceding siblings ...)
  2009-09-30 12:53 ` [RFC v2 PATCH 5/8] sched: Unthrottle the throttled tasks Bharata B Rao
@ 2009-09-30 12:54 ` Bharata B Rao
  2009-09-30 12:55 ` [RFC v2 PATCH 7/8] sched: Rebalance cfs runtimes Bharata B Rao
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 39+ messages in thread
From: Bharata B Rao @ 2009-09-30 12:54 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan,
	Gautham R Shenoy, Srivatsa Vaddagiri, Ingo Molnar,
	Peter Zijlstra, Pavel Emelyanov, Herbert Poetzl, Avi Kivity,
	Chris Friesen, Paul Menage, Mike Waychison

sched: Add throttle time statistics to /proc/sched_debug

From: Bharata B Rao <bharata@linux.vnet.ibm.com>

With hard limits, provide stats about throttle time, throttle count
and max throttle time for group sched entities in /proc/sched_debug
Throttle stats are collected only for group entities.

Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
---
 include/linux/sched.h |    6 ++++++
 kernel/sched_debug.c  |   17 ++++++++++++++++-
 kernel/sched_fair.c   |   20 ++++++++++++++++++++
 3 files changed, 42 insertions(+), 1 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 77ace43..8c83a39 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1137,6 +1137,12 @@ struct sched_entity {
 	u64			nr_wakeups_affine_attempts;
 	u64			nr_wakeups_passive;
 	u64			nr_wakeups_idle;
+#ifdef CONFIG_CFS_HARD_LIMITS
+	u64			throttle_start;
+	u64			throttle_max;
+	u64			throttle_count;
+	u64			throttle_sum;
+#endif
 #endif
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
diff --git a/kernel/sched_debug.c b/kernel/sched_debug.c
index 8ce525f..b15712d 100644
--- a/kernel/sched_debug.c
+++ b/kernel/sched_debug.c
@@ -80,6 +80,11 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu,
 	PN(se->wait_max);
 	PN(se->wait_sum);
 	P(se->wait_count);
+#ifdef CONFIG_CFS_HARD_LIMITS
+	PN(se->throttle_max);
+	PN(se->throttle_sum);
+	P(se->throttle_count);
+#endif
 #endif
 	P(se->load.weight);
 #undef PN
@@ -216,6 +221,16 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 #endif
 	SEQ_printf(m, "  .%-30s: %ld\n", "nr_tasks_running",
 			cfs_rq->nr_tasks_running);
+#ifdef CONFIG_CFS_HARD_LIMITS
+	spin_lock_irqsave(&rq->lock, flags);
+	SEQ_printf(m, "  .%-30s: %d\n", "cfs_throttled",
+			cfs_rq->cfs_throttled);
+	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "cfs_time",
+			SPLIT_NS(cfs_rq->cfs_time));
+	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "cfs_runtime",
+			SPLIT_NS(cfs_rq->cfs_runtime));
+	spin_unlock_irqrestore(&rq->lock, flags);
+#endif
 	print_cfs_group_stats(m, cpu, cfs_rq->tg);
 #endif
 }
@@ -312,7 +327,7 @@ static int sched_debug_show(struct seq_file *m, void *v)
 	u64 now = ktime_to_ns(ktime_get());
 	int cpu;
 
-	SEQ_printf(m, "Sched Debug Version: v0.09, %s %.*s\n",
+	SEQ_printf(m, "Sched Debug Version: v0.10, %s %.*s\n",
 		init_utsname()->release,
 		(int)strcspn(init_utsname()->version, " "),
 		init_utsname()->version);
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 8c8b602..f4dec63 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -188,6 +188,23 @@ find_matching_se(struct sched_entity **se, struct sched_entity **pse)
 
 #ifdef CONFIG_CFS_HARD_LIMITS
 
+static inline void update_stats_throttle_start(struct cfs_rq *cfs_rq,
+			struct sched_entity *se)
+{
+	schedstat_set(se->throttle_start, rq_of(cfs_rq)->clock);
+}
+
+static inline void update_stats_throttle_end(struct cfs_rq *cfs_rq,
+			struct sched_entity *se)
+{
+	schedstat_set(se->throttle_max, max(se->throttle_max,
+			rq_of(cfs_rq)->clock - se->throttle_start));
+	schedstat_set(se->throttle_count, se->throttle_count + 1);
+	schedstat_set(se->throttle_sum, se->throttle_sum +
+			rq_of(cfs_rq)->clock - se->throttle_start);
+	schedstat_set(se->throttle_start, 0);
+}
+
 static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
 {
 	return cfs_rq->cfs_throttled;
@@ -217,6 +234,7 @@ static void sched_cfs_runtime_exceeded(struct sched_entity *se,
 
 	if (cfs_rq->cfs_time > cfs_rq->cfs_runtime) {
 		cfs_rq->cfs_throttled = 1;
+		update_stats_throttle_start(cfs_rq, se);
 		resched_task(tsk_curr);
 	}
 }
@@ -315,6 +333,8 @@ void do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b)
 		rq_runtime_lock(rq);
 		cfs_rq->cfs_time = 0;
 		if (cfs_rq_throttled(cfs_rq)) {
+			update_rq_clock(rq);
+			update_stats_throttle_end(cfs_rq, se);
 			cfs_rq->cfs_throttled = 0;
 			enqueue_throttled_entity(rq, se);
 		}

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [RFC v2 PATCH 7/8] sched: Rebalance cfs runtimes
  2009-09-30 12:49 [RFC v2 PATCH 0/8] CFS Hard limits - v2 Bharata B Rao
                   ` (5 preceding siblings ...)
  2009-09-30 12:54 ` [RFC v2 PATCH 6/8] sched: Add throttle time statistics to /proc/sched_debug Bharata B Rao
@ 2009-09-30 12:55 ` Bharata B Rao
  2009-09-30 12:55 ` [RFC v2 PATCH 8/8] sched: Hard limits documentation Bharata B Rao
  2009-09-30 13:36 ` [RFC v2 PATCH 0/8] CFS Hard limits - v2 Pavel Emelyanov
  8 siblings, 0 replies; 39+ messages in thread
From: Bharata B Rao @ 2009-09-30 12:55 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan,
	Gautham R Shenoy, Srivatsa Vaddagiri, Ingo Molnar,
	Peter Zijlstra, Pavel Emelyanov, Herbert Poetzl, Avi Kivity,
	Chris Friesen, Paul Menage, Mike Waychison

sched: CFS runtime borrowing

From: Bharata B Rao <bharata@linux.vnet.ibm.com>

To start with, a group will get equal runtime on every cpu. If the group doesn't
have tasks on all cpus, it might get throttled on some cpus while it still has
runtime left on other cpus where it doesn't have any tasks to consume that
runtime. Hence there is a chance to borrow runtimes from such cpus/cfs_rqs to
cpus/cfs_rqs where it is required.
---
 kernel/sched_fair.c |   98 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 98 insertions(+), 0 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index f4dec63..8b43f4f 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -205,12 +205,107 @@ static inline void update_stats_throttle_end(struct cfs_rq *cfs_rq,
 	schedstat_set(se->throttle_start, 0);
 }
 
+static void double_rq_runtime_lock(struct rq *rq1, struct rq *rq2)
+	__acquires(rq1->runtime_lock)
+	__acquires(rq2->runtime_lock)
+{
+	BUG_ON(!irqs_disabled());
+	if (rq1 == rq2) {
+		spin_lock(&rq1->runtime_lock);
+		__acquire(rq2->runtime_lock);	/* Fake it out ;) */
+	} else {
+		if (rq1 < rq2) {
+			spin_lock(&rq1->runtime_lock);
+			spin_lock_nested(&rq2->runtime_lock,
+					SINGLE_DEPTH_NESTING);
+		} else {
+			spin_lock(&rq2->runtime_lock);
+			spin_lock_nested(&rq1->runtime_lock,
+					SINGLE_DEPTH_NESTING);
+		}
+	}
+	update_rq_clock(rq1);
+	update_rq_clock(rq2);
+}
+
+static void double_rq_runtime_unlock(struct rq *rq1, struct rq *rq2)
+	__releases(rq1->runtime_lock)
+	__releases(rq2->runtime_lock)
+{
+	spin_unlock(&rq1->runtime_lock);
+	if (rq1 != rq2)
+		spin_unlock(&rq2->runtime_lock);
+	else
+		__release(rq2->runtime_lock);
+}
+
 static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
 {
 	return cfs_rq->cfs_throttled;
 }
 
 /*
+ * Ran out of runtime, check if we can borrow some from others
+ * instead of getting throttled right away.
+ */
+static void do_cfs_balance_runtime(struct cfs_rq *cfs_rq)
+{
+	struct rq *rq = rq_of(cfs_rq);
+	struct cfs_bandwidth *cfs_b = &cfs_rq->tg->cfs_bandwidth;
+	const struct cpumask *span = sched_bw_period_mask();
+	int i, weight;
+	u64 cfs_period;
+	struct task_group *tg = container_of(cfs_b, struct task_group,
+				cfs_bandwidth);
+
+	weight = cpumask_weight(span);
+	spin_lock(&cfs_b->cfs_runtime_lock);
+	cfs_period = ktime_to_ns(cfs_b->cfs_period);
+
+	for_each_cpu(i, span) {
+		struct cfs_rq *borrow_cfs_rq = tg->cfs_rq[i];
+		struct rq *borrow_rq = rq_of(borrow_cfs_rq);
+		s64 diff;
+
+		if (borrow_cfs_rq == cfs_rq)
+			continue;
+
+		double_rq_runtime_lock(rq, borrow_rq);
+		if (borrow_cfs_rq->cfs_runtime == RUNTIME_INF) {
+			double_rq_runtime_unlock(rq, borrow_rq);
+			continue;
+		}
+
+		diff = borrow_cfs_rq->cfs_runtime - borrow_cfs_rq->cfs_time;
+		if (diff > 0) {
+			diff = div_u64((u64)diff, weight);
+			if (cfs_rq->cfs_runtime + diff > cfs_period)
+				diff = cfs_period - cfs_rq->cfs_runtime;
+			borrow_cfs_rq->cfs_runtime -= diff;
+			cfs_rq->cfs_runtime += diff;
+			if (cfs_rq->cfs_runtime == cfs_period) {
+				double_rq_runtime_unlock(rq, borrow_rq);
+				break;
+			}
+		}
+		double_rq_runtime_unlock(rq, borrow_rq);
+	}
+	spin_unlock(&cfs_b->cfs_runtime_lock);
+}
+
+/*
+ * Called with rq->runtime_lock held.
+ */
+static void cfs_balance_runtime(struct cfs_rq *cfs_rq)
+{
+	struct rq *rq = rq_of(cfs_rq);
+
+	rq_runtime_unlock(rq);
+	do_cfs_balance_runtime(cfs_rq);
+	rq_runtime_lock(rq);
+}
+
+/*
  * Check if group entity exceeded its runtime. If so, mark the cfs_rq as
  * throttled mark the current task for reschedling.
  */
@@ -232,6 +327,9 @@ static void sched_cfs_runtime_exceeded(struct sched_entity *se,
 	if (cfs_rq_throttled(cfs_rq))
 		return;
 
+	if (cfs_rq->cfs_time > cfs_rq->cfs_runtime)
+		cfs_balance_runtime(cfs_rq);
+
 	if (cfs_rq->cfs_time > cfs_rq->cfs_runtime) {
 		cfs_rq->cfs_throttled = 1;
 		update_stats_throttle_start(cfs_rq, se);

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [RFC v2 PATCH 8/8] sched: Hard limits documentation
  2009-09-30 12:49 [RFC v2 PATCH 0/8] CFS Hard limits - v2 Bharata B Rao
                   ` (6 preceding siblings ...)
  2009-09-30 12:55 ` [RFC v2 PATCH 7/8] sched: Rebalance cfs runtimes Bharata B Rao
@ 2009-09-30 12:55 ` Bharata B Rao
  2009-09-30 13:36 ` [RFC v2 PATCH 0/8] CFS Hard limits - v2 Pavel Emelyanov
  8 siblings, 0 replies; 39+ messages in thread
From: Bharata B Rao @ 2009-09-30 12:55 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan,
	Gautham R Shenoy, Srivatsa Vaddagiri, Ingo Molnar,
	Peter Zijlstra, Pavel Emelyanov, Herbert Poetzl, Avi Kivity,
	Chris Friesen, Paul Menage, Mike Waychison

sched: Hard limits documentation

From: Bharata B Rao <bharata@linux.vnet.ibm.com>

Documentation for hard limits feature.

Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
---
 Documentation/scheduler/sched-cfs-hard-limits.txt |   52 +++++++++++++++++++++
 1 files changed, 52 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/scheduler/sched-cfs-hard-limits.txt

diff --git a/Documentation/scheduler/sched-cfs-hard-limits.txt b/Documentation/scheduler/sched-cfs-hard-limits.txt
new file mode 100644
index 0000000..bf3859f
--- /dev/null
+++ b/Documentation/scheduler/sched-cfs-hard-limits.txt
@@ -0,0 +1,52 @@
+CPU HARD LIMITS FOR CFS GROUPS
+==============================
+
+1. Overview
+2. Interface
+3. Examples
+
+1. Overview
+-----------
+
+CFS is a proportional share scheduler which tries to divide the CPU time
+proportionately between tasks or groups of tasks (task group/cgroup) depending
+on the priority/weight of the task or shares assigned to groups of tasks.
+In CFS, a task/task group can get more than its share of CPU if there are
+enough idle CPU cycles available in the system, due to the work conserving
+nature of the scheduler. However in certain scenarios (like pay-per-use),
+it is desirable not to provide extra time to a group even in the presence
+of idle CPU cycles. This is where hard limiting can be of use.
+
+Hard limits for task groups can be set by specifying how much CPU runtime a
+group can consume within a given period. If the group consumes more CPU time
+than the runtime in a given period, it gets throttled. None of the tasks of
+the throttled group gets to run until the runtime of the group gets refreshed
+at the beginning of the next period.
+
+2. Interface
+------------
+
+Hard limit feature adds 3 cgroup files for CFS group scheduler:
+
+cfs_runtime_us: Hard limit for the group in microseconds.
+
+cfs_period_us: Time period in microseconds within which hard limits is
+enforced.
+
+cfs_hard_limit: The control file to enable or disable hard limiting for the
+group.
+
+A group gets created with default values for runtime and period and with
+hard limit disabled. Each group can set its own values for runtime and period
+independent of other groups in the system.
+
+3. Examples
+-----------
+
+# mount -t cgroup -ocpu none /cgroups/
+# cd /cgroups
+# mkdir 1
+# cd 1/
+# echo 250000 > cfs_runtime_us /* set a 250ms runtime or limit */
+# echo 500000 > cfs_period_us /* set a 500ms period */
+# echo 1 > cfs_hard_limit /* enable hard limiting for group 1/ */

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [RFC v2 PATCH 0/8] CFS Hard limits - v2
  2009-09-30 12:49 [RFC v2 PATCH 0/8] CFS Hard limits - v2 Bharata B Rao
                   ` (7 preceding siblings ...)
  2009-09-30 12:55 ` [RFC v2 PATCH 8/8] sched: Hard limits documentation Bharata B Rao
@ 2009-09-30 13:36 ` Pavel Emelyanov
  2009-09-30 14:25   ` Bharata B Rao
  2009-09-30 14:38   ` Balbir Singh
  8 siblings, 2 replies; 39+ messages in thread
From: Pavel Emelyanov @ 2009-09-30 13:36 UTC (permalink / raw)
  To: bharata
  Cc: linux-kernel, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Gautham R Shenoy, Srivatsa Vaddagiri,
	Ingo Molnar, Peter Zijlstra, Pavel Emelyanov, Herbert Poetzl,
	Avi Kivity, Chris Friesen, Paul Menage, Mike Waychison

Bharata B Rao wrote:
> Hi,
> 
> Here is the v2 post of hard limits feature for CFS group scheduler. This
> RFC post mainly adds runtime borrowing feature and has a new locking scheme
> to protect CFS runtime related fields.
> 
> It would be nice to have some comments on this set!

I have a question I'd like to ask before diving into the code.
Consider I'm a user, that has a 4CPUs box 2GHz each and I'd like
to create a container with 2CPUs 1GHz each. Can I achieve this
after your patches?

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC v2 PATCH 0/8] CFS Hard limits - v2
  2009-09-30 13:36 ` [RFC v2 PATCH 0/8] CFS Hard limits - v2 Pavel Emelyanov
@ 2009-09-30 14:25   ` Bharata B Rao
  2009-09-30 14:39     ` Srivatsa Vaddagiri
  2009-09-30 14:38   ` Balbir Singh
  1 sibling, 1 reply; 39+ messages in thread
From: Bharata B Rao @ 2009-09-30 14:25 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: linux-kernel, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Gautham R Shenoy, Srivatsa Vaddagiri,
	Ingo Molnar, Peter Zijlstra, Herbert Poetzl, Avi Kivity,
	Chris Friesen, Paul Menage, Mike Waychison

On Wed, Sep 30, 2009 at 05:36:29PM +0400, Pavel Emelyanov wrote:
> Bharata B Rao wrote:
> > Hi,
> > 
> > Here is the v2 post of hard limits feature for CFS group scheduler. This
> > RFC post mainly adds runtime borrowing feature and has a new locking scheme
> > to protect CFS runtime related fields.
> > 
> > It would be nice to have some comments on this set!
> 
> I have a question I'd like to ask before diving into the code.
> Consider I'm a user, that has a 4CPUs box 2GHz each and I'd like
> to create a container with 2CPUs 1GHz each. Can I achieve this
> after your patches?

I am not sure if I understand the GHz specification you mention here.
Are you saying that you want run a container with 2 CPUS with each of
them running at half their (frequency)capacity ?

This hard limits scheme is about time based rate limiting where you can
specify a runtime(=hard limit) and a period for the container and the
container will not be allowed to consume more than the specified CPU time
within a given period.

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC v2 PATCH 0/8] CFS Hard limits - v2
  2009-09-30 13:36 ` [RFC v2 PATCH 0/8] CFS Hard limits - v2 Pavel Emelyanov
  2009-09-30 14:25   ` Bharata B Rao
@ 2009-09-30 14:38   ` Balbir Singh
  2009-09-30 15:10     ` Pavel Emelyanov
  1 sibling, 1 reply; 39+ messages in thread
From: Balbir Singh @ 2009-09-30 14:38 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: bharata, linux-kernel, Dhaval Giani, Vaidyanathan Srinivasan,
	Gautham R Shenoy, Srivatsa Vaddagiri, Ingo Molnar,
	Peter Zijlstra, Herbert Poetzl, Avi Kivity, Chris Friesen,
	Paul Menage, Mike Waychison

* Pavel Emelyanov <xemul@openvz.org> [2009-09-30 17:36:29]:

> Bharata B Rao wrote:
> > Hi,
> > 
> > Here is the v2 post of hard limits feature for CFS group scheduler. This
> > RFC post mainly adds runtime borrowing feature and has a new locking scheme
> > to protect CFS runtime related fields.
> > 
> > It would be nice to have some comments on this set!
> 
> I have a question I'd like to ask before diving into the code.
> Consider I'm a user, that has a 4CPUs box 2GHz each and I'd like
> to create a container with 2CPUs 1GHz each. Can I achieve this
> after your patches?

I don't think the GHz makes any sense, consider CPUs with frequency
scaling. If I can scale from 1.6GHz to say 2.6GHz or 2GHz to 4GHz,
what does it mean for hard limit control? Hard limits define control
over existing bandwidth, anything else would be superficial and hard
hard to get right for both developers and users.

-- 
	Balbir

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC v2 PATCH 0/8] CFS Hard limits - v2
  2009-09-30 14:25   ` Bharata B Rao
@ 2009-09-30 14:39     ` Srivatsa Vaddagiri
  2009-09-30 15:09       ` Pavel Emelyanov
  2009-10-13 11:39       ` Pavel Emelyanov
  0 siblings, 2 replies; 39+ messages in thread
From: Srivatsa Vaddagiri @ 2009-09-30 14:39 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: Pavel Emelyanov, linux-kernel, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Gautham R Shenoy, Ingo Molnar,
	Peter Zijlstra, Herbert Poetzl, Avi Kivity, Chris Friesen,
	Paul Menage, Mike Waychison

On Wed, Sep 30, 2009 at 07:55:37PM +0530, Bharata B Rao wrote:
> On Wed, Sep 30, 2009 at 05:36:29PM +0400, Pavel Emelyanov wrote:
> > Bharata B Rao wrote:
> > > Hi,
> > > 
> > > Here is the v2 post of hard limits feature for CFS group scheduler. This
> > > RFC post mainly adds runtime borrowing feature and has a new locking scheme
> > > to protect CFS runtime related fields.
> > > 
> > > It would be nice to have some comments on this set!
> > 
> > I have a question I'd like to ask before diving into the code.
> > Consider I'm a user, that has a 4CPUs box 2GHz each and I'd like
> > to create a container with 2CPUs 1GHz each. Can I achieve this
> > after your patches?
> 
> I am not sure if I understand the GHz specification you mention here.
> Are you saying that you want run a container with 2 CPUS with each of
> them running at half their (frequency)capacity ?
> 
> This hard limits scheme is about time based rate limiting where you can
> specify a runtime(=hard limit) and a period for the container and the
> container will not be allowed to consume more than the specified CPU time
> within a given period.

IMO Pavel's requirement can be met with a hard limit of 25%

	2 CPU of 1GHz = (1/2 x 4) (1/2 x 2) GHz CPUs
		      = 1/4 x 4 2GHz CPU
		      = 25% of (4 2GHz CPU)

IOW by hard-limiting a container thread to run just 0.5sec every sec on a 2GHz 
cpu, it is effectively making progress at the rate of 1GHz?

- vatsa

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC v2 PATCH 0/8] CFS Hard limits - v2
  2009-09-30 14:39     ` Srivatsa Vaddagiri
@ 2009-09-30 15:09       ` Pavel Emelyanov
  2009-10-13 11:39       ` Pavel Emelyanov
  1 sibling, 0 replies; 39+ messages in thread
From: Pavel Emelyanov @ 2009-09-30 15:09 UTC (permalink / raw)
  To: vatsa
  Cc: Bharata B Rao, linux-kernel, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Gautham R Shenoy, Ingo Molnar,
	Peter Zijlstra, Herbert Poetzl, Avi Kivity, Chris Friesen,
	Paul Menage, Mike Waychison

Srivatsa Vaddagiri wrote:
> On Wed, Sep 30, 2009 at 07:55:37PM +0530, Bharata B Rao wrote:
>> On Wed, Sep 30, 2009 at 05:36:29PM +0400, Pavel Emelyanov wrote:
>>> Bharata B Rao wrote:
>>>> Hi,
>>>>
>>>> Here is the v2 post of hard limits feature for CFS group scheduler. This
>>>> RFC post mainly adds runtime borrowing feature and has a new locking scheme
>>>> to protect CFS runtime related fields.
>>>>
>>>> It would be nice to have some comments on this set!
>>> I have a question I'd like to ask before diving into the code.
>>> Consider I'm a user, that has a 4CPUs box 2GHz each and I'd like
>>> to create a container with 2CPUs 1GHz each. Can I achieve this
>>> after your patches?
>> I am not sure if I understand the GHz specification you mention here.
>> Are you saying that you want run a container with 2 CPUS with each of
>> them running at half their (frequency)capacity ?
>>
>> This hard limits scheme is about time based rate limiting where you can
>> specify a runtime(=hard limit) and a period for the container and the
>> container will not be allowed to consume more than the specified CPU time
>> within a given period.
> 
> IMO Pavel's requirement can be met with a hard limit of 25%
> 
> 	2 CPU of 1GHz = (1/2 x 4) (1/2 x 2) GHz CPUs
> 		      = 1/4 x 4 2GHz CPU
> 		      = 25% of (4 2GHz CPU)
> 
> IOW by hard-limiting a container thread to run just 0.5sec every sec on a 2GHz 
> cpu, it is effectively making progress at the rate of 1GHz?

4CPUS 25% each is not the same as 2CPUS 50% each.
OTOH making 2CPU container and setting 50% for both would work, but please
look at the problem from the end user point of view. He wants to set 50% of
the CPU power. Which setup is better 0.5/1, 1/2 or 0.25/0.5?

If you look at how tc works it proposes the user to select bandwidth in
human readable values like kbps or mbps.

> - vatsa
> 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC v2 PATCH 0/8] CFS Hard limits - v2
  2009-09-30 14:38   ` Balbir Singh
@ 2009-09-30 15:10     ` Pavel Emelyanov
  2009-09-30 15:30       ` Balbir Singh
  0 siblings, 1 reply; 39+ messages in thread
From: Pavel Emelyanov @ 2009-09-30 15:10 UTC (permalink / raw)
  To: balbir
  Cc: bharata, linux-kernel, Dhaval Giani, Vaidyanathan Srinivasan,
	Gautham R Shenoy, Srivatsa Vaddagiri, Ingo Molnar,
	Peter Zijlstra, Herbert Poetzl, Avi Kivity, Chris Friesen,
	Paul Menage, Mike Waychison

Balbir Singh wrote:
> * Pavel Emelyanov <xemul@openvz.org> [2009-09-30 17:36:29]:
> 
>> Bharata B Rao wrote:
>>> Hi,
>>>
>>> Here is the v2 post of hard limits feature for CFS group scheduler. This
>>> RFC post mainly adds runtime borrowing feature and has a new locking scheme
>>> to protect CFS runtime related fields.
>>>
>>> It would be nice to have some comments on this set!
>> I have a question I'd like to ask before diving into the code.
>> Consider I'm a user, that has a 4CPUs box 2GHz each and I'd like
>> to create a container with 2CPUs 1GHz each. Can I achieve this
>> after your patches?
> 
> I don't think the GHz makes any sense, consider CPUs with frequency
> scaling. If I can scale from 1.6GHz to say 2.6GHz or 2GHz to 4GHz,
> what does it mean for hard limit control? Hard limits define control
> over existing bandwidth, anything else would be superficial and hard
> hard to get right for both developers and users.

Two numbers for configuring limits make even less sense OTOH ;)
By assigning 2GHz for 4GHz CPU I obviously want half of its power ;)
Please, see my reply to vatsa@ in this thread.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC v2 PATCH 0/8] CFS Hard limits - v2
  2009-09-30 15:10     ` Pavel Emelyanov
@ 2009-09-30 15:30       ` Balbir Singh
  2009-09-30 22:30         ` Herbert Poetzl
  0 siblings, 1 reply; 39+ messages in thread
From: Balbir Singh @ 2009-09-30 15:30 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: bharata, linux-kernel, Dhaval Giani, Vaidyanathan Srinivasan,
	Gautham R Shenoy, Srivatsa Vaddagiri, Ingo Molnar,
	Peter Zijlstra, Herbert Poetzl, Avi Kivity, Chris Friesen,
	Paul Menage, Mike Waychison

* Pavel Emelyanov <xemul@openvz.org> [2009-09-30 19:10:27]:

> Balbir Singh wrote:
> > * Pavel Emelyanov <xemul@openvz.org> [2009-09-30 17:36:29]:
> > 
> >> Bharata B Rao wrote:
> >>> Hi,
> >>>
> >>> Here is the v2 post of hard limits feature for CFS group scheduler. This
> >>> RFC post mainly adds runtime borrowing feature and has a new locking scheme
> >>> to protect CFS runtime related fields.
> >>>
> >>> It would be nice to have some comments on this set!
> >> I have a question I'd like to ask before diving into the code.
> >> Consider I'm a user, that has a 4CPUs box 2GHz each and I'd like
> >> to create a container with 2CPUs 1GHz each. Can I achieve this
> >> after your patches?
> > 
> > I don't think the GHz makes any sense, consider CPUs with frequency
> > scaling. If I can scale from 1.6GHz to say 2.6GHz or 2GHz to 4GHz,
> > what does it mean for hard limit control? Hard limits define control
> > over existing bandwidth, anything else would be superficial and hard
> > hard to get right for both developers and users.
> 
> Two numbers for configuring limits make even less sense OTOH ;)
> By assigning 2GHz for 4GHz CPU I obviously want half of its power ;)
> Please, see my reply to vatsa@ in this thread.

But it makes life more difficult for the administrator to think in
terms of GHz -- no? Specifically with different heterogeneous systems.
I think it would be chaotic in a data center to configure GHz for
every partition. Not to say that it makes it even more confusing when
running on top of KVM. Lets say I create two vCPUs and I specifiy GHz
outside, do I expect to see it in /proc/cpuinfo?

I'd like to hear what others think about GHz as well.

-- 
	Balbir

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC v2 PATCH 0/8] CFS Hard limits - v2
  2009-09-30 15:30       ` Balbir Singh
@ 2009-09-30 22:30         ` Herbert Poetzl
  2009-10-01  5:12           ` Bharata B Rao
  0 siblings, 1 reply; 39+ messages in thread
From: Herbert Poetzl @ 2009-09-30 22:30 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Pavel Emelyanov, bharata, linux-kernel, Dhaval Giani,
	Vaidyanathan Srinivasan, Gautham R Shenoy, Srivatsa Vaddagiri,
	Ingo Molnar, Peter Zijlstra, Avi Kivity, Chris Friesen,
	Paul Menage, Mike Waychison

On Wed, Sep 30, 2009 at 09:00:53PM +0530, Balbir Singh wrote:
> * Pavel Emelyanov <xemul@openvz.org> [2009-09-30 19:10:27]:
> > Balbir Singh wrote:
> > > * Pavel Emelyanov <xemul@openvz.org> [2009-09-30 17:36:29]:
> > >> Bharata B Rao wrote:
> > >>> Hi,
> > >>>
> > >>> Here is the v2 post of hard limits feature for CFS group scheduler. This
> > >>> RFC post mainly adds runtime borrowing feature and has a new locking scheme
> > >>> to protect CFS runtime related fields.
> > >>>
> > >>> It would be nice to have some comments on this set!
> > >> I have a question I'd like to ask before diving into the code.
> > >> Consider I'm a user, that has a 4CPUs box 2GHz each and I'd like
> > >> to create a container with 2CPUs 1GHz each. Can I achieve this
> > >> after your patches?
> > > 
> > > I don't think the GHz makes any sense, consider CPUs with frequency
> > > scaling. If I can scale from 1.6GHz to say 2.6GHz or 2GHz to 4GHz,
> > > what does it mean for hard limit control? Hard limits define control
> > > over existing bandwidth, anything else would be superficial and hard
> > > hard to get right for both developers and users.
> > 
> > Two numbers for configuring limits make even less sense OTOH ;)
> > By assigning 2GHz for 4GHz CPU I obviously want half of its power ;)
> > Please, see my reply to vatsa@ in this thread.

> But it makes life more difficult for the administrator to think in
> terms of GHz -- no? Specifically with different heterogeneous systems.
> I think it would be chaotic in a data center to configure GHz for
> every partition. Not to say that it makes it even more confusing when
> running on top of KVM. Lets say I create two vCPUs and I specifiy GHz
> outside, do I expect to see it in /proc/cpuinfo?

> I'd like to hear what others think about GHz as well.

for me, using something like GHz or even BogoMips would
not make any sense whatsoever ... I would be able to
understand percentage, but this has some other implications
as it does not handle granularity ....

Linux-VServer uses interval and rate, which is very similar
to your setup, except that we use two pairs to achive
a differentiation for 'busy' and 'idle' cases, i.e. when
the cpu(s) would go idle, we switch from R1/I1 to R2/I2
for all guests, allowing to distribute the excess differently
than the 'active' part ... but only one set is needed for
hard limits ...

best,
Herbert

> -- 
> 	Balbir

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC v2 PATCH 0/8] CFS Hard limits - v2
  2009-09-30 22:30         ` Herbert Poetzl
@ 2009-10-01  5:12           ` Bharata B Rao
  0 siblings, 0 replies; 39+ messages in thread
From: Bharata B Rao @ 2009-10-01  5:12 UTC (permalink / raw)
  To: Balbir Singh, Pavel Emelyanov, linux-kernel, Dhaval Giani,
	Vaidyanathan Srinivasan, Gautham R Shenoy, Srivatsa Vaddagiri,
	Ingo Molnar, Peter Zijlstra, Avi Kivity, Chris Friesen,
	Paul Menage, Mike Waychison

On Thu, Oct 01, 2009 at 12:30:39AM +0200, Herbert Poetzl wrote:
> 
> Linux-VServer uses interval and rate, which is very similar
> to your setup, except that we use two pairs to achive
> a differentiation for 'busy' and 'idle' cases, i.e. when
> the cpu(s) would go idle, we switch from R1/I1 to R2/I2
> for all guests, allowing to distribute the excess differently
> than the 'active' part ... but only one set is needed for
> hard limits ...

Do you have examples of how you choose R1, I1 and R2, I2 for guests
running typical workloads ?

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC v2 PATCH 0/8] CFS Hard limits - v2
  2009-09-30 14:39     ` Srivatsa Vaddagiri
  2009-09-30 15:09       ` Pavel Emelyanov
@ 2009-10-13 11:39       ` Pavel Emelyanov
  2009-10-13 12:03         ` Herbert Poetzl
  2009-10-13 14:49         ` Valdis.Kletnieks
  1 sibling, 2 replies; 39+ messages in thread
From: Pavel Emelyanov @ 2009-10-13 11:39 UTC (permalink / raw)
  To: vatsa, Bharata B Rao, Balbir Singh
  Cc: linux-kernel, Dhaval Giani, Vaidyanathan Srinivasan,
	Gautham R Shenoy, Ingo Molnar, Peter Zijlstra, Herbert Poetzl,
	Avi Kivity, Chris Friesen, Paul Menage, Mike Waychison

> IMO Pavel's requirement can be met with a hard limit of 25%
> 
> 	2 CPU of 1GHz = (1/2 x 4) (1/2 x 2) GHz CPUs
> 		      = 1/4 x 4 2GHz CPU
> 		      = 25% of (4 2GHz CPU)
> 
> IOW by hard-limiting a container thread to run just 0.5sec every sec on a 2GHz 
> cpu, it is effectively making progress at the rate of 1GHz?

So, any suggestions on this? I'm not against the patches themselves.
I'm just trying to tell, that setting a cpulimit with 2 numbers is
not a good way to go (at least a clean explanation of how to calculate
them should go with the patches).

I propose to first collect what *can* be done. I see the following
possibilities:

1) two times (as it is now) - I believe this is inconvenient.
2) the amount in percents (like 50%) - this is how it works in
   OpenVZ and customers are quite happy with it. It's better than
   two numbers, since you need to specify only one clean number.
3) virtual cpu power in M/GHz - I don't agree with Balbir that
   this is difficult for administrator. This is better than two
   numbers and better that the percentage, since the amount of
   cpu time got by a container will not change after migrating to
   a more powerful CPU.

Thoughts?

> - vatsa
> 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC v2 PATCH 0/8] CFS Hard limits - v2
  2009-10-13 11:39       ` Pavel Emelyanov
@ 2009-10-13 12:03         ` Herbert Poetzl
  2009-10-13 12:19           ` Pavel Emelyanov
  2009-10-13 14:49         ` Valdis.Kletnieks
  1 sibling, 1 reply; 39+ messages in thread
From: Herbert Poetzl @ 2009-10-13 12:03 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: vatsa, Bharata B Rao, Balbir Singh, linux-kernel, Dhaval Giani,
	Vaidyanathan Srinivasan, Gautham R Shenoy, Ingo Molnar,
	Peter Zijlstra, Avi Kivity, Chris Friesen, Paul Menage,
	Mike Waychison

On Tue, Oct 13, 2009 at 03:39:17PM +0400, Pavel Emelyanov wrote:
> > IMO Pavel's requirement can be met with a hard limit of 25%
> > 
> > 	2 CPU of 1GHz = (1/2 x 4) (1/2 x 2) GHz CPUs
> > 		      = 1/4 x 4 2GHz CPU
> > 		      = 25% of (4 2GHz CPU)
> > 
> > IOW by hard-limiting a container thread to run just 0.5sec every
> > sec on a 2GHz cpu, it is effectively making progress at the rate of
> > 1GHz?

> So, any suggestions on this? I'm not against the patches themselves.
> I'm just trying to tell, that setting a cpulimit with 2 numbers is
> not a good way to go (at least a clean explanation of how to calculate
> them should go with the patches).

as I already stated, it seems perfectly fine for me (to have
two values, period and runtime), IMHO it is quite natural to 
understand and it allows (to some degree) to control the
scheduling behaviour by controlling the multiplicator (i.e.
use 1s/0.5s vs 100ms/50ms) ...

we already incorporated the patch (for testing) in our
current release, and it seems to work fine and do what we
need/want (see http://linux-vserver.org/util-vserver:Cgroups)

> I propose to first collect what *can* be done. I see the following
> possibilities:

> 1) two times (as it is now) - I believe this is inconvenient.
> 2) the amount in percents (like 50%) - this is how it works in
>    OpenVZ and customers are quite happy with it. It's better than
>    two numbers, since you need to specify only one clean number.

can be trivially mapped to the two values, by chosing a
fixed multiplicative base (let's say '1s' to simplify :) 

  with 50%, you get 1s/0.5s
  with 20%, you get 1s/0.2s
  with  5%, you get 1s/0.05s

well, you get the idea :)

> 3) virtual cpu power in M/GHz - I don't agree with Balbir that
>    this is difficult for administrator. This is better than two
>    numbers and better that the percentage, since the amount of
>    cpu time got by a container will not change after migrating to
>    a more powerful CPU.

I think this is completely artificial, and adds a lot
of silly cornercases, like e.g. cpu speed changes (think
SpeedStep and friends) and doesn't help the administration
in any way ... nevertheless, it should also be trivial to
map to the two values if you do the following:

  Host CPU = 4.5GHz
  Desired Guest CPU = 2.0Ghz

  2.0/4.5 = 0.44' -> 44% -> 1s/0.44s

> Thoughts?

so once again, 'we' (Linux-VServer) are perfectly happy
with the current status, the only difference to what we
used to have is that we calculated the period and runtime
in jiffies not micro seconds, and called them interval
and rate (which is as simple to map as the percentage
OpenVZ uses)

best,
Herbert

> > - vatsa
> > 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC v2 PATCH 0/8] CFS Hard limits - v2
  2009-10-13 12:03         ` Herbert Poetzl
@ 2009-10-13 12:19           ` Pavel Emelyanov
  2009-10-13 12:30             ` Dhaval Giani
                               ` (2 more replies)
  0 siblings, 3 replies; 39+ messages in thread
From: Pavel Emelyanov @ 2009-10-13 12:19 UTC (permalink / raw)
  To: Herbert Poetzl, vatsa, Bharata B Rao, Balbir Singh, linux-kernel,
	Dhaval Giani, Vaidyanathan Srinivasan, Gautham R Shenoy,
	Ingo Molnar, Peter Zijlstra, Avi Kivity, Chris Friesen,
	Paul Menage, Mike Waychison

> as I already stated, it seems perfectly fine for me 

You're not the only one interested in it, sorry. Besides, I
got your point in "I'm find with it". Now get mine which is
about "I am not".

> can be trivially mapped to the two values, by chosing a
> fixed multiplicative base (let's say '1s' to simplify :) 
> 
>   with 50%, you get 1s/0.5s
>   with 20%, you get 1s/0.2s
>   with  5%, you get 1s/0.05s
> 
> well, you get the idea :)

No I don't.
Is 1s/0.5s worse or better than 2s/1s?
How should I make a choice?

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC v2 PATCH 0/8] CFS Hard limits - v2
  2009-10-13 12:19           ` Pavel Emelyanov
@ 2009-10-13 12:30             ` Dhaval Giani
  2009-10-13 12:45               ` Pavel Emelyanov
  2009-10-13 14:56             ` Valdis.Kletnieks
  2009-10-13 22:02             ` Herbert Poetzl
  2 siblings, 1 reply; 39+ messages in thread
From: Dhaval Giani @ 2009-10-13 12:30 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Herbert Poetzl, vatsa, Bharata B Rao, Balbir Singh, linux-kernel,
	Vaidyanathan Srinivasan, Gautham R Shenoy, Ingo Molnar,
	Peter Zijlstra, Avi Kivity, Chris Friesen, Paul Menage,
	Mike Waychison

On Tue, Oct 13, 2009 at 04:19:41PM +0400, Pavel Emelyanov wrote:
> > as I already stated, it seems perfectly fine for me 
> 
> You're not the only one interested in it, sorry. Besides, I
> got your point in "I'm find with it". Now get mine which is
> about "I am not".
> 
> > can be trivially mapped to the two values, by chosing a
> > fixed multiplicative base (let's say '1s' to simplify :) 
> > 
> >   with 50%, you get 1s/0.5s
> >   with 20%, you get 1s/0.2s
> >   with  5%, you get 1s/0.05s
> > 
> > well, you get the idea :)
> 
> No I don't.
> Is 1s/0.5s worse or better than 2s/1s?
> How should I make a choice?

I would say it depends on your requirement. How fast do you want to
respond back to the user? Wiht lower bandwidth, you would want to have
shorter periods so that the user would not get the impression that he
has to "wait" to get CPU time. But having a very short period is not a
good thing, since there are other considerations (such as the overhead of
hard limits).

thanks,
-- 
regards,
Dhaval

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC v2 PATCH 0/8] CFS Hard limits - v2
  2009-10-13 12:30             ` Dhaval Giani
@ 2009-10-13 12:45               ` Pavel Emelyanov
  2009-10-13 12:56                 ` Dhaval Giani
  2009-10-13 12:57                 ` Bharata B Rao
  0 siblings, 2 replies; 39+ messages in thread
From: Pavel Emelyanov @ 2009-10-13 12:45 UTC (permalink / raw)
  To: Dhaval Giani
  Cc: Herbert Poetzl, vatsa, Bharata B Rao, Balbir Singh, linux-kernel,
	Vaidyanathan Srinivasan, Gautham R Shenoy, Ingo Molnar,
	Peter Zijlstra, Avi Kivity, Chris Friesen, Paul Menage,
	Mike Waychison

Dhaval Giani wrote:
> On Tue, Oct 13, 2009 at 04:19:41PM +0400, Pavel Emelyanov wrote:
>>> as I already stated, it seems perfectly fine for me 
>> You're not the only one interested in it, sorry. Besides, I
>> got your point in "I'm find with it". Now get mine which is
>> about "I am not".
>>
>>> can be trivially mapped to the two values, by chosing a
>>> fixed multiplicative base (let's say '1s' to simplify :) 
>>>
>>>   with 50%, you get 1s/0.5s
>>>   with 20%, you get 1s/0.2s
>>>   with  5%, you get 1s/0.05s
>>>
>>> well, you get the idea :)
>> No I don't.
>> Is 1s/0.5s worse or better than 2s/1s?
>> How should I make a choice?
> 
> I would say it depends on your requirement. How fast do you want to
> respond back to the user? Wiht lower bandwidth, you would want to have
> shorter periods so that the user would not get the impression that he
> has to "wait" to get CPU time. But having a very short period is not a
> good thing, since there are other considerations (such as the overhead of
> hard limits).

That's it - long period is bad for one reason, short period is bad for 
some other one and neither of them is clearly described unlike the 
limit itself.

In other words there are two numbers we're essentially playing with:
* the limit (int percents, Hz, whatever)
* and this abstract "badness"

Can't we give the user one of them for "must be configured" usage, put
the other one in some "good for most users" position and let the user
move it later on demand?

Yet again - weights in CFQ CPU-sched, ionoce in CFQ-iosched, bandwidth
in tc (traffic shaping), etc. are all clean for end-user. Plus there are
other fine tunes, that user should not configure by default, but which
change the default behavior. I propose to create simple and clean 
interface for limits as well. If you think that virtual cpu power is 
not good, ok. Let's ask user for a percentage and give him yet another
option to control this "badness" or "responsiveness".

> thanks,


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC v2 PATCH 0/8] CFS Hard limits - v2
  2009-10-13 12:45               ` Pavel Emelyanov
@ 2009-10-13 12:56                 ` Dhaval Giani
  2009-10-13 12:57                 ` Bharata B Rao
  1 sibling, 0 replies; 39+ messages in thread
From: Dhaval Giani @ 2009-10-13 12:56 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Herbert Poetzl, vatsa, Bharata B Rao, Balbir Singh, linux-kernel,
	Vaidyanathan Srinivasan, Gautham R Shenoy, Ingo Molnar,
	Peter Zijlstra, Avi Kivity, Chris Friesen, Paul Menage,
	Mike Waychison

On Tue, Oct 13, 2009 at 04:45:15PM +0400, Pavel Emelyanov wrote:
> Dhaval Giani wrote:
> > On Tue, Oct 13, 2009 at 04:19:41PM +0400, Pavel Emelyanov wrote:
> >>> as I already stated, it seems perfectly fine for me 
> >> You're not the only one interested in it, sorry. Besides, I
> >> got your point in "I'm find with it". Now get mine which is
> >> about "I am not".
> >>
> >>> can be trivially mapped to the two values, by chosing a
> >>> fixed multiplicative base (let's say '1s' to simplify :) 
> >>>
> >>>   with 50%, you get 1s/0.5s
> >>>   with 20%, you get 1s/0.2s
> >>>   with  5%, you get 1s/0.05s
> >>>
> >>> well, you get the idea :)
> >> No I don't.
> >> Is 1s/0.5s worse or better than 2s/1s?
> >> How should I make a choice?
> > 
> > I would say it depends on your requirement. How fast do you want to
> > respond back to the user? Wiht lower bandwidth, you would want to have
> > shorter periods so that the user would not get the impression that he
> > has to "wait" to get CPU time. But having a very short period is not a
> > good thing, since there are other considerations (such as the overhead of
> > hard limits).
> 
> That's it - long period is bad for one reason, short period is bad for 
> some other one and neither of them is clearly described unlike the 
> limit itself.
> 
> In other words there are two numbers we're essentially playing with:
> * the limit (int percents, Hz, whatever)
> * and this abstract "badness"
> 
> Can't we give the user one of them for "must be configured" usage, put
> the other one in some "good for most users" position and let the user
> move it later on demand?
> 

Right, but is this not more of a policy decision as opposed to an
infrastructure one and better handled in userspace?

> Yet again - weights in CFQ CPU-sched, ionoce in CFQ-iosched, bandwidth
> in tc (traffic shaping), etc. are all clean for end-user. Plus there are
> other fine tunes, that user should not configure by default, but which
> change the default behavior. I propose to create simple and clean 
> interface for limits as well. If you think that virtual cpu power is 
> not good, ok. Let's ask user for a percentage and give him yet another
> option to control this "badness" or "responsiveness".
> 

How is virtual cpu power defined? GHz is not a clear definition. It
means different things (in terms of performance) for different
processors. Do you want to define it as getting x cycles of a CPU in
y seconds for the cgroup, or do you want to define it as a certain number
of operations in a second. If it is the former, I think we can map it to
the current interface in userspace itself.  Why don't we define this
metric, and then look to solve the problem? 

thanks,
-- 
regards,
Dhaval

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC v2 PATCH 0/8] CFS Hard limits - v2
  2009-10-13 12:45               ` Pavel Emelyanov
  2009-10-13 12:56                 ` Dhaval Giani
@ 2009-10-13 12:57                 ` Bharata B Rao
  2009-10-13 13:01                   ` Pavel Emelyanov
  1 sibling, 1 reply; 39+ messages in thread
From: Bharata B Rao @ 2009-10-13 12:57 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Dhaval Giani, Herbert Poetzl, vatsa, Balbir Singh, linux-kernel,
	Vaidyanathan Srinivasan, Gautham R Shenoy, Ingo Molnar,
	Peter Zijlstra, Avi Kivity, Chris Friesen, Paul Menage,
	Mike Waychison

On Tue, Oct 13, 2009 at 04:45:15PM +0400, Pavel Emelyanov wrote:
> Dhaval Giani wrote:
> > On Tue, Oct 13, 2009 at 04:19:41PM +0400, Pavel Emelyanov wrote:
> >>> as I already stated, it seems perfectly fine for me 
> >> You're not the only one interested in it, sorry. Besides, I
> >> got your point in "I'm find with it". Now get mine which is
> >> about "I am not".
> >>
> >>> can be trivially mapped to the two values, by chosing a
> >>> fixed multiplicative base (let's say '1s' to simplify :) 
> >>>
> >>>   with 50%, you get 1s/0.5s
> >>>   with 20%, you get 1s/0.2s
> >>>   with  5%, you get 1s/0.05s
> >>>
> >>> well, you get the idea :)
> >> No I don't.
> >> Is 1s/0.5s worse or better than 2s/1s?
> >> How should I make a choice?
> > 
> > I would say it depends on your requirement. How fast do you want to
> > respond back to the user? Wiht lower bandwidth, you would want to have
> > shorter periods so that the user would not get the impression that he
> > has to "wait" to get CPU time. But having a very short period is not a
> > good thing, since there are other considerations (such as the overhead of
> > hard limits).
> 
> That's it - long period is bad for one reason, short period is bad for 
> some other one and neither of them is clearly described unlike the 
> limit itself.
> 
> In other words there are two numbers we're essentially playing with:
> * the limit (int percents, Hz, whatever)
> * and this abstract "badness"
> 
> Can't we give the user one of them for "must be configured" usage, put
> the other one in some "good for most users" position and let the user
> move it later on demand?

This is what we have right now. All the groups will start with a default
period (of .5s) and user has the option of changing it based on his
requirement. However we need to find out if the default period is really
"good for most users"

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC v2 PATCH 0/8] CFS Hard limits - v2
  2009-10-13 12:57                 ` Bharata B Rao
@ 2009-10-13 13:01                   ` Pavel Emelyanov
  0 siblings, 0 replies; 39+ messages in thread
From: Pavel Emelyanov @ 2009-10-13 13:01 UTC (permalink / raw)
  To: bharata
  Cc: Dhaval Giani, Herbert Poetzl, vatsa, Balbir Singh, linux-kernel,
	Vaidyanathan Srinivasan, Gautham R Shenoy, Ingo Molnar,
	Peter Zijlstra, Avi Kivity, Chris Friesen, Paul Menage,
	Mike Waychison

> This is what we have right now. All the groups will start with a default
> period (of .5s) and user has the option of changing it based on his
> requirement. However we need to find out if the default period is really
> "good for most users"

Is it? If so, then I'm fine with it. As I understood from the beginning
is that I'll _have_ to specify both numbers.

> Regards,
> Bharata.
> 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC v2 PATCH 4/8] sched: Enforce hard limits by throttling
  2009-09-30 12:52 ` [RFC v2 PATCH 4/8] sched: Enforce hard limits by throttling Bharata B Rao
@ 2009-10-13 14:27   ` Peter Zijlstra
  2009-10-14  3:41     ` Bharata B Rao
  0 siblings, 1 reply; 39+ messages in thread
From: Peter Zijlstra @ 2009-10-13 14:27 UTC (permalink / raw)
  To: bharata
  Cc: linux-kernel, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Gautham R Shenoy, Srivatsa Vaddagiri,
	Ingo Molnar, Pavel Emelyanov, Herbert Poetzl, Avi Kivity,
	Chris Friesen, Paul Menage, Mike Waychison

On Wed, 2009-09-30 at 18:22 +0530, Bharata B Rao wrote:

> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 0f1ea4a..77ace43 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1024,7 +1024,7 @@ struct sched_domain;
>  struct sched_class {
>  	const struct sched_class *next;
>  
> -	void (*enqueue_task) (struct rq *rq, struct task_struct *p, int wakeup);
> +	int (*enqueue_task) (struct rq *rq, struct task_struct *p, int wakeup);
>  	void (*dequeue_task) (struct rq *rq, struct task_struct *p, int sleep);
>  	void (*yield_task) (struct rq *rq);
>  

I really hate this, it uglfies all the enqueue code in a horrid way
(which is most of this patch).

Why can't we simply enqueue the task on a throttled group just like rt?

> @@ -3414,6 +3443,18 @@ int can_migrate_task(struct task_struct *p, struct rq *rq, int this_cpu,
>  	}
>  
>  	/*
> +	 * Don't migrate the task if it belongs to a
> +	 * - throttled group on its current cpu
> +	 * - throttled group on this_cpu
> +	 * - group whose hierarchy is throttled on this_cpu
> +	 */
> +	if (cfs_rq_throttled(cfs_rq_of(&p->se)) ||
> +		task_group_throttled(task_group(p), this_cpu)) {
> +		schedstat_inc(p, se.nr_failed_migrations_throttled);
> +		return 0;
> +	}
> +
> +	/*
>  	 * Aggressive migration if:
>  	 * 1) task is cache cold, or
>  	 * 2) too many balance attempts have failed.

Simply don't iterate throttled groups?



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC v2 PATCH 3/8] sched: Bandwidth initialization for fair task groups
  2009-09-30 12:52 ` [RFC v2 PATCH 3/8] sched: Bandwidth initialization for fair task groups Bharata B Rao
@ 2009-10-13 14:27   ` Peter Zijlstra
  2009-10-14  3:49     ` Bharata B Rao
  0 siblings, 1 reply; 39+ messages in thread
From: Peter Zijlstra @ 2009-10-13 14:27 UTC (permalink / raw)
  To: bharata
  Cc: linux-kernel, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Gautham R Shenoy, Srivatsa Vaddagiri,
	Ingo Molnar, Pavel Emelyanov, Herbert Poetzl, Avi Kivity,
	Chris Friesen, Paul Menage, Mike Waychison

On Wed, 2009-09-30 at 18:22 +0530, Bharata B Rao wrote:

> diff --git a/kernel/sched.c b/kernel/sched.c
> index c283d0f..0147f6f 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -262,6 +262,15 @@ static DEFINE_MUTEX(sched_domains_mutex);
>  
>  #include <linux/cgroup.h>
>  
> +#if defined(CONFIG_FAIR_GROUP_SCHED) && defined(CONFIG_CFS_HARD_LIMITS)
> +struct cfs_bandwidth {
> +	spinlock_t		cfs_runtime_lock;
> +	ktime_t			cfs_period;
> +	u64			cfs_runtime;
> +	struct hrtimer		cfs_period_timer;
> +};
> +#endif

too much cfs here..

>  struct cfs_rq;
>  
>  static LIST_HEAD(task_groups);
> @@ -282,6 +291,11 @@ struct task_group {
>  	/* runqueue "owned" by this group on each cpu */
>  	struct cfs_rq **cfs_rq;
>  	unsigned long shares;
> +#ifdef CONFIG_CFS_HARD_LIMITS
> +	struct cfs_bandwidth cfs_bandwidth;
> +	/* If set, throttle when the group exceeds its bandwidth */
> +	int hard_limit_enabled;
> +#endif

What's wrong with doing something like cfs_bandwidth.cfs_runtime ==
RUNTIME_INF ?

>  #endif
>  
>  #ifdef CONFIG_RT_GROUP_SCHED
> @@ -477,6 +491,16 @@ struct cfs_rq {
>  	unsigned long rq_weight;
>  #endif
>  #endif
> +#ifdef CONFIG_CFS_HARD_LIMITS
> +	/* set when the group is throttled  on this cpu */
> +	int cfs_throttled;
> +
> +	/* runtime currently consumed by the group on this rq */
> +	u64 cfs_time;
> +
> +	/* runtime available to the group on this rq */
> +	u64 cfs_runtime;
> +#endif

too much cfs_ again.

>  	/*
>  	 * Number of tasks at this heirarchy.
>  	 */
> @@ -665,6 +689,11 @@ struct rq {
>  	/* BKL stats */
>  	unsigned int bkl_count;
>  #endif
> +	/*
> +	 * Protects the cfs runtime related fields of all cfs_rqs under
> +	 * this rq
> +	 */
> +	spinlock_t runtime_lock;
>  };
>  
>  static DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);


> +static inline void rq_runtime_lock(struct rq *rq)
> +{
> +	spin_lock(&rq->runtime_lock);
> +}
> +
> +static inline void rq_runtime_unlock(struct rq *rq)
> +{
> +	spin_unlock(&rq->runtime_lock);
> +}

needless obfuscation.

> CONFIG_RT_GROUP_SCHED
> @@ -10317,6 +10617,23 @@ static struct cftype cpu_files[] = {
>  		.read_u64 = cpu_shares_read_u64,
>  		.write_u64 = cpu_shares_write_u64,
>  	},
> +#ifdef CONFIG_CFS_HARD_LIMITS
> +	{
> +		.name = "cfs_runtime_us",
> +		.read_s64 = cpu_cfs_runtime_read_s64,
> +		.write_s64 = cpu_cfs_runtime_write_s64,
> +	},
> +	{
> +		.name = "cfs_period_us",
> +		.read_u64 = cpu_cfs_period_read_u64,
> +		.write_u64 = cpu_cfs_period_write_u64,
> +	},
> +	{
> +		.name = "cfs_hard_limit",
> +		.read_u64 = cpu_cfs_hard_limit_read_u64,
> +		.write_u64 = cpu_cfs_hard_limit_write_u64,
> +	},
> +#endif /* CONFIG_CFS_HARD_LIMITS */
>  #endif
>  #ifdef CONFIG_RT_GROUP_SCHED
>  	{

I guess that cfs_hard_limit thing is superfluous as well.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC v2 PATCH 2/8] sched: Maintain aggregated tasks count in cfs_rq at each hierarchy level
  2009-09-30 12:51 ` [RFC v2 PATCH 2/8] sched: Maintain aggregated tasks count in cfs_rq at each hierarchy level Bharata B Rao
@ 2009-10-13 14:27   ` Peter Zijlstra
  2009-10-14  3:42     ` Bharata B Rao
  0 siblings, 1 reply; 39+ messages in thread
From: Peter Zijlstra @ 2009-10-13 14:27 UTC (permalink / raw)
  To: bharata
  Cc: linux-kernel, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Gautham R Shenoy, Srivatsa Vaddagiri,
	Ingo Molnar, Pavel Emelyanov, Herbert Poetzl, Avi Kivity,
	Chris Friesen, Paul Menage, Mike Waychison

On Wed, 2009-09-30 at 18:21 +0530, Bharata B Rao wrote:

> diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
> index 652e8bd..eeeddb8 100644
> --- a/kernel/sched_fair.c
> +++ b/kernel/sched_fair.c
> @@ -243,6 +243,27 @@ find_matching_se(struct sched_entity **se, struct sched_entity **pse)
>  
>  #endif	/* CONFIG_FAIR_GROUP_SCHED */
>  
> +static void add_cfs_rq_tasks_running(struct sched_entity *se,
> +		unsigned long count)
> +{
> +	struct cfs_rq *cfs_rq;
> +
> +	for_each_sched_entity(se) {
> +		cfs_rq = cfs_rq_of(se);
> +		cfs_rq->nr_tasks_running += count;
> +	}
> +}
> +
> +static void sub_cfs_rq_tasks_running(struct sched_entity *se,
> +		unsigned long count)
> +{
> +	struct cfs_rq *cfs_rq;
> +
> +	for_each_sched_entity(se) {
> +		cfs_rq = cfs_rq_of(se);
> +		cfs_rq->nr_tasks_running -= count;
> +	}
> +}
>  
>  /**************************************************************
>   * Scheduling class tree data structure manipulation methods:
> @@ -969,6 +990,7 @@ static void enqueue_task_fair(struct rq *rq, struct task_struct *p, int wakeup)
>  		wakeup = 1;
>  	}
>  
> +	add_cfs_rq_tasks_running(&p->se, 1);
>  	hrtick_update(rq);
>  }
>  
> @@ -991,6 +1013,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int sleep)
>  		sleep = 1;
>  	}
>  
> +	sub_cfs_rq_tasks_running(&p->se, 1);
>  	hrtick_update(rq);
>  }

This seems daft, why not add the incement to the for_each_sched_entity()
loop already present in both functions?




^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC v2 PATCH 0/8] CFS Hard limits - v2
  2009-10-13 11:39       ` Pavel Emelyanov
  2009-10-13 12:03         ` Herbert Poetzl
@ 2009-10-13 14:49         ` Valdis.Kletnieks
  1 sibling, 0 replies; 39+ messages in thread
From: Valdis.Kletnieks @ 2009-10-13 14:49 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: vatsa, Bharata B Rao, Balbir Singh, linux-kernel, Dhaval Giani,
	Vaidyanathan Srinivasan, Gautham R Shenoy, Ingo Molnar,
	Peter Zijlstra, Herbert Poetzl, Avi Kivity, Chris Friesen,
	Paul Menage, Mike Waychison

[-- Attachment #1: Type: text/plain, Size: 435 bytes --]

On Tue, 13 Oct 2009 15:39:17 +0400, Pavel Emelyanov said:

> 3) virtual cpu power in M/GHz - I don't agree with Balbir that
>    this is difficult for administrator. This is better than two
>    numbers and better that the percentage, since the amount of
>    cpu time got by a container will not change after migrating to
>    a more powerful CPU.

2Gz worth of throughput on an old Xeon is probably not the same as
half of a 4Gz i7.

[-- Attachment #2: Type: application/pgp-signature, Size: 227 bytes --]

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC v2 PATCH 0/8] CFS Hard limits - v2
  2009-10-13 12:19           ` Pavel Emelyanov
  2009-10-13 12:30             ` Dhaval Giani
@ 2009-10-13 14:56             ` Valdis.Kletnieks
  2009-10-13 22:02             ` Herbert Poetzl
  2 siblings, 0 replies; 39+ messages in thread
From: Valdis.Kletnieks @ 2009-10-13 14:56 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Herbert Poetzl, vatsa, Bharata B Rao, Balbir Singh, linux-kernel,
	Dhaval Giani, Vaidyanathan Srinivasan, Gautham R Shenoy,
	Ingo Molnar, Peter Zijlstra, Avi Kivity, Chris Friesen,
	Paul Menage, Mike Waychison

[-- Attachment #1: Type: text/plain, Size: 592 bytes --]

On Tue, 13 Oct 2009 16:19:41 +0400, Pavel Emelyanov said:

> >   with 50%, you get 1s/0.5s
> >   with 20%, you get 1s/0.2s
> >   with  5%, you get 1s/0.05s
> > 
> > well, you get the idea :)
> 
> No I don't.
> Is 1s/0.5s worse or better than 2s/1s?
> How should I make a choice?

It depends how sensitive you want to be to short bursts of CPU usage.

And of course, if we're going down that route, I'll point out that over in
the network world, 95% percentile billing is pretty popular.  How to do it
in this application, I have not a clue. ;)

http://en.wikipedia.org/wiki/Burstable_billing

[-- Attachment #2: Type: application/pgp-signature, Size: 227 bytes --]

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC v2 PATCH 0/8] CFS Hard limits - v2
  2009-10-13 12:19           ` Pavel Emelyanov
  2009-10-13 12:30             ` Dhaval Giani
  2009-10-13 14:56             ` Valdis.Kletnieks
@ 2009-10-13 22:02             ` Herbert Poetzl
  2 siblings, 0 replies; 39+ messages in thread
From: Herbert Poetzl @ 2009-10-13 22:02 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: vatsa, Bharata B Rao, Balbir Singh, linux-kernel, Dhaval Giani,
	Vaidyanathan Srinivasan, Gautham R Shenoy, Ingo Molnar,
	Peter Zijlstra, Avi Kivity, Chris Friesen, Paul Menage,
	Mike Waychison

On Tue, Oct 13, 2009 at 04:19:41PM +0400, Pavel Emelyanov wrote:
> > as I already stated, it seems perfectly fine for me 
> 
> You're not the only one interested in it, sorry. Besides, I
> got your point in "I'm find with it". Now get mine which is
> about "I am not".

> > can be trivially mapped to the two values, by chosing a
> > fixed multiplicative base (let's say '1s' to simplify :) 

> >   with 50%, you get 1s/0.5s
> >   with 20%, you get 1s/0.2s
> >   with  5%, you get 1s/0.05s

> > well, you get the idea :)

> No I don't.
> Is 1s/0.5s worse or better than 2s/1s?
> How should I make a choice?

and because _you_ are unable to make a choice, you
want that choice to be removed alltogether?

don't you see that your suggestions are less powerful
than the one provided by the patch, mainly because
they are only a subset (depending on your choice) 

besides that, it _is_ trivial to map the percentage,
but it isn't trivial to 'guess' what the user wants
(i.e. the amount of interactivity)

best,
Herbert


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC v2 PATCH 4/8] sched: Enforce hard limits by throttling
  2009-10-13 14:27   ` Peter Zijlstra
@ 2009-10-14  3:41     ` Bharata B Rao
  2009-10-14  9:17       ` Peter Zijlstra
  0 siblings, 1 reply; 39+ messages in thread
From: Bharata B Rao @ 2009-10-14  3:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Gautham R Shenoy, Srivatsa Vaddagiri,
	Ingo Molnar, Pavel Emelyanov, Herbert Poetzl, Avi Kivity,
	Chris Friesen, Paul Menage, Mike Waychison

On Tue, Oct 13, 2009 at 04:27:00PM +0200, Peter Zijlstra wrote:
> On Wed, 2009-09-30 at 18:22 +0530, Bharata B Rao wrote:
> 
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index 0f1ea4a..77ace43 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -1024,7 +1024,7 @@ struct sched_domain;
> >  struct sched_class {
> >  	const struct sched_class *next;
> >  
> > -	void (*enqueue_task) (struct rq *rq, struct task_struct *p, int wakeup);
> > +	int (*enqueue_task) (struct rq *rq, struct task_struct *p, int wakeup);
> >  	void (*dequeue_task) (struct rq *rq, struct task_struct *p, int sleep);
> >  	void (*yield_task) (struct rq *rq);
> >  
> 
> I really hate this, it uglfies all the enqueue code in a horrid way
> (which is most of this patch).
> 
> Why can't we simply enqueue the task on a throttled group just like rt?

We do enqueue a task to its group even if the group is throttled. However such
throttled groups are not enqueued further. In such scenarios, even though the
task enqueue to its parent group succeeded, it really didn't add any task to
the cpu runqueue (rq). So we need to identify this condition and don't
increment rq->running. That is why this return value is needed.

> 
> > @@ -3414,6 +3443,18 @@ int can_migrate_task(struct task_struct *p, struct rq *rq, int this_cpu,
> >  	}
> >  
> >  	/*
> > +	 * Don't migrate the task if it belongs to a
> > +	 * - throttled group on its current cpu
> > +	 * - throttled group on this_cpu
> > +	 * - group whose hierarchy is throttled on this_cpu
> > +	 */
> > +	if (cfs_rq_throttled(cfs_rq_of(&p->se)) ||
> > +		task_group_throttled(task_group(p), this_cpu)) {
> > +		schedstat_inc(p, se.nr_failed_migrations_throttled);
> > +		return 0;
> > +	}
> > +
> > +	/*
> >  	 * Aggressive migration if:
> >  	 * 1) task is cache cold, or
> >  	 * 2) too many balance attempts have failed.
> 
> Simply don't iterate throttled groups?

We already do that by setting the h_load of the throttled cfs_rq to 0 and
not considering such a cfs_rq for iteration in load_balance_fair(). So I guess
I can remove the first check in the above code (cfs_rq_throttled(cfs_rq_of(&p->se))).

However the other check is needed because, we don't want to pull a task
(whose group is not throttled) from busiest cpu to a target cpu where its
group or any group in its hierarchy is throttled. This is what the 2nd check
does (task_group_throttled(task_group(p), this_cpu))).

Thanks for looking at the patches!

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC v2 PATCH 2/8] sched: Maintain aggregated tasks count in cfs_rq at each hierarchy level
  2009-10-13 14:27   ` Peter Zijlstra
@ 2009-10-14  3:42     ` Bharata B Rao
  0 siblings, 0 replies; 39+ messages in thread
From: Bharata B Rao @ 2009-10-14  3:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Gautham R Shenoy, Srivatsa Vaddagiri,
	Ingo Molnar, Pavel Emelyanov, Herbert Poetzl, Avi Kivity,
	Chris Friesen, Paul Menage, Mike Waychison

On Tue, Oct 13, 2009 at 04:27:01PM +0200, Peter Zijlstra wrote:
> On Wed, 2009-09-30 at 18:21 +0530, Bharata B Rao wrote:
> 
> > diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
> > index 652e8bd..eeeddb8 100644
> > --- a/kernel/sched_fair.c
> > +++ b/kernel/sched_fair.c
> > @@ -243,6 +243,27 @@ find_matching_se(struct sched_entity **se, struct sched_entity **pse)
> >  
> >  #endif	/* CONFIG_FAIR_GROUP_SCHED */
> >  
> > +static void add_cfs_rq_tasks_running(struct sched_entity *se,
> > +		unsigned long count)
> > +{
> > +	struct cfs_rq *cfs_rq;
> > +
> > +	for_each_sched_entity(se) {
> > +		cfs_rq = cfs_rq_of(se);
> > +		cfs_rq->nr_tasks_running += count;
> > +	}
> > +}
> > +
> > +static void sub_cfs_rq_tasks_running(struct sched_entity *se,
> > +		unsigned long count)
> > +{
> > +	struct cfs_rq *cfs_rq;
> > +
> > +	for_each_sched_entity(se) {
> > +		cfs_rq = cfs_rq_of(se);
> > +		cfs_rq->nr_tasks_running -= count;
> > +	}
> > +}
> >  
> >  /**************************************************************
> >   * Scheduling class tree data structure manipulation methods:
> > @@ -969,6 +990,7 @@ static void enqueue_task_fair(struct rq *rq, struct task_struct *p, int wakeup)
> >  		wakeup = 1;
> >  	}
> >  
> > +	add_cfs_rq_tasks_running(&p->se, 1);
> >  	hrtick_update(rq);
> >  }
> >  
> > @@ -991,6 +1013,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int sleep)
> >  		sleep = 1;
> >  	}
> >  
> > +	sub_cfs_rq_tasks_running(&p->se, 1);
> >  	hrtick_update(rq);
> >  }
> 
> This seems daft, why not add the incement to the for_each_sched_entity()
> loop already present in both functions?

Right. There was a reason why it started out like this and I can't seem to
remember. Now looking at the current code, I don't see why I can't do what
you suggest. Will try this for the next post.

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC v2 PATCH 3/8] sched: Bandwidth initialization for fair task groups
  2009-10-13 14:27   ` Peter Zijlstra
@ 2009-10-14  3:49     ` Bharata B Rao
  0 siblings, 0 replies; 39+ messages in thread
From: Bharata B Rao @ 2009-10-14  3:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Gautham R Shenoy, Srivatsa Vaddagiri,
	Ingo Molnar, Pavel Emelyanov, Herbert Poetzl, Avi Kivity,
	Chris Friesen, Paul Menage, Mike Waychison

On Tue, Oct 13, 2009 at 04:27:01PM +0200, Peter Zijlstra wrote:
> On Wed, 2009-09-30 at 18:22 +0530, Bharata B Rao wrote:
> 
> > diff --git a/kernel/sched.c b/kernel/sched.c
> > index c283d0f..0147f6f 100644
> > --- a/kernel/sched.c
> > +++ b/kernel/sched.c
> > @@ -262,6 +262,15 @@ static DEFINE_MUTEX(sched_domains_mutex);
> >  
> >  #include <linux/cgroup.h>
> >  
> > +#if defined(CONFIG_FAIR_GROUP_SCHED) && defined(CONFIG_CFS_HARD_LIMITS)
> > +struct cfs_bandwidth {
> > +	spinlock_t		cfs_runtime_lock;
> > +	ktime_t			cfs_period;
> > +	u64			cfs_runtime;
> > +	struct hrtimer		cfs_period_timer;
> > +};
> > +#endif
> 
> too much cfs here..

Right, this will eventually be merged with rt_bandwidth. Dhaval already has
patches for bandwidth code unification b/n cfs and rt. As I said, the initial
focus is to show how the hard limit code looks like.

> 
> >  struct cfs_rq;
> >  
> >  static LIST_HEAD(task_groups);
> > @@ -282,6 +291,11 @@ struct task_group {
> >  	/* runqueue "owned" by this group on each cpu */
> >  	struct cfs_rq **cfs_rq;
> >  	unsigned long shares;
> > +#ifdef CONFIG_CFS_HARD_LIMITS
> > +	struct cfs_bandwidth cfs_bandwidth;
> > +	/* If set, throttle when the group exceeds its bandwidth */
> > +	int hard_limit_enabled;
> > +#endif
> 
> What's wrong with doing something like cfs_bandwidth.cfs_runtime ==
> RUNTIME_INF ?

Can be done.

> 
> >  #endif
> >  
> >  #ifdef CONFIG_RT_GROUP_SCHED
> > @@ -477,6 +491,16 @@ struct cfs_rq {
> >  	unsigned long rq_weight;
> >  #endif
> >  #endif
> > +#ifdef CONFIG_CFS_HARD_LIMITS
> > +	/* set when the group is throttled  on this cpu */
> > +	int cfs_throttled;
> > +
> > +	/* runtime currently consumed by the group on this rq */
> > +	u64 cfs_time;
> > +
> > +	/* runtime available to the group on this rq */
> > +	u64 cfs_runtime;
> > +#endif
> 
> too much cfs_ again.

But this is needed. It is present in rt also.

> 
> >  	/*
> >  	 * Number of tasks at this heirarchy.
> >  	 */
> > @@ -665,6 +689,11 @@ struct rq {
> >  	/* BKL stats */
> >  	unsigned int bkl_count;
> >  #endif
> > +	/*
> > +	 * Protects the cfs runtime related fields of all cfs_rqs under
> > +	 * this rq
> > +	 */
> > +	spinlock_t runtime_lock;
> >  };
> >  
> >  static DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
> 
> 
> > +static inline void rq_runtime_lock(struct rq *rq)
> > +{
> > +	spin_lock(&rq->runtime_lock);
> > +}
> > +
> > +static inline void rq_runtime_unlock(struct rq *rq)
> > +{
> > +	spin_unlock(&rq->runtime_lock);
> > +}
> 
> needless obfuscation.

This is needed to keep the code clean for !CFS_HARD_LIMITS case.

> 
> > CONFIG_RT_GROUP_SCHED
> > @@ -10317,6 +10617,23 @@ static struct cftype cpu_files[] = {
> >  		.read_u64 = cpu_shares_read_u64,
> >  		.write_u64 = cpu_shares_write_u64,
> >  	},
> > +#ifdef CONFIG_CFS_HARD_LIMITS
> > +	{
> > +		.name = "cfs_runtime_us",
> > +		.read_s64 = cpu_cfs_runtime_read_s64,
> > +		.write_s64 = cpu_cfs_runtime_write_s64,
> > +	},
> > +	{
> > +		.name = "cfs_period_us",
> > +		.read_u64 = cpu_cfs_period_read_u64,
> > +		.write_u64 = cpu_cfs_period_write_u64,
> > +	},
> > +	{
> > +		.name = "cfs_hard_limit",
> > +		.read_u64 = cpu_cfs_hard_limit_read_u64,
> > +		.write_u64 = cpu_cfs_hard_limit_write_u64,
> > +	},
> > +#endif /* CONFIG_CFS_HARD_LIMITS */
> >  #endif
> >  #ifdef CONFIG_RT_GROUP_SCHED
> >  	{
> 
> I guess that cfs_hard_limit thing is superfluous as well.

Ok, will try to remove this control and will treat the case when
runtime != RUNTIME_INF as hard limits enabled case.

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC v2 PATCH 4/8] sched: Enforce hard limits by throttling
  2009-10-14  3:41     ` Bharata B Rao
@ 2009-10-14  9:17       ` Peter Zijlstra
  2009-10-14 11:50         ` Bharata B Rao
  0 siblings, 1 reply; 39+ messages in thread
From: Peter Zijlstra @ 2009-10-14  9:17 UTC (permalink / raw)
  To: bharata
  Cc: linux-kernel, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Gautham R Shenoy, Srivatsa Vaddagiri,
	Ingo Molnar, Pavel Emelyanov, Herbert Poetzl, Avi Kivity,
	Chris Friesen, Paul Menage, Mike Waychison

On Wed, 2009-10-14 at 09:11 +0530, Bharata B Rao wrote:
> On Tue, Oct 13, 2009 at 04:27:00PM +0200, Peter Zijlstra wrote:
> > On Wed, 2009-09-30 at 18:22 +0530, Bharata B Rao wrote:
> > 
> > > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > > index 0f1ea4a..77ace43 100644
> > > --- a/include/linux/sched.h
> > > +++ b/include/linux/sched.h
> > > @@ -1024,7 +1024,7 @@ struct sched_domain;
> > >  struct sched_class {
> > >     const struct sched_class *next;
> > >  
> > > -   void (*enqueue_task) (struct rq *rq, struct task_struct *p, int wakeup);
> > > +   int (*enqueue_task) (struct rq *rq, struct task_struct *p, int wakeup);
> > >     void (*dequeue_task) (struct rq *rq, struct task_struct *p, int sleep);
> > >     void (*yield_task) (struct rq *rq);
> > >  
> > 
> > I really hate this, it uglfies all the enqueue code in a horrid way
> > (which is most of this patch).
> > 
> > Why can't we simply enqueue the task on a throttled group just like rt?
> 
> We do enqueue a task to its group even if the group is throttled. However such
> throttled groups are not enqueued further. In such scenarios, even though the
> task enqueue to its parent group succeeded, it really didn't add any task to
> the cpu runqueue (rq). So we need to identify this condition and don't
> increment rq->running. That is why this return value is needed.

I would still consider those tasks running, the fact that they don't get
to run is a different matter.

This added return value really utterly craps up the code and I'm not
going to take it.

What I'm not seeing is why all this code looks so very much different
from the rt bits.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC v2 PATCH 4/8] sched: Enforce hard limits by throttling
  2009-10-14  9:17       ` Peter Zijlstra
@ 2009-10-14 11:50         ` Bharata B Rao
  2009-10-14 13:18           ` Herbert Poetzl
  0 siblings, 1 reply; 39+ messages in thread
From: Bharata B Rao @ 2009-10-14 11:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Gautham R Shenoy, Srivatsa Vaddagiri,
	Ingo Molnar, Pavel Emelyanov, Herbert Poetzl, Avi Kivity,
	Chris Friesen, Paul Menage, Mike Waychison

On Wed, Oct 14, 2009 at 11:17:44AM +0200, Peter Zijlstra wrote:
> On Wed, 2009-10-14 at 09:11 +0530, Bharata B Rao wrote:
> > On Tue, Oct 13, 2009 at 04:27:00PM +0200, Peter Zijlstra wrote:
> > > On Wed, 2009-09-30 at 18:22 +0530, Bharata B Rao wrote:
> > > 
> > > > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > > > index 0f1ea4a..77ace43 100644
> > > > --- a/include/linux/sched.h
> > > > +++ b/include/linux/sched.h
> > > > @@ -1024,7 +1024,7 @@ struct sched_domain;
> > > >  struct sched_class {
> > > >     const struct sched_class *next;
> > > >  
> > > > -   void (*enqueue_task) (struct rq *rq, struct task_struct *p, int wakeup);
> > > > +   int (*enqueue_task) (struct rq *rq, struct task_struct *p, int wakeup);
> > > >     void (*dequeue_task) (struct rq *rq, struct task_struct *p, int sleep);
> > > >     void (*yield_task) (struct rq *rq);
> > > >  
> > > 
> > > I really hate this, it uglfies all the enqueue code in a horrid way
> > > (which is most of this patch).
> > > 
> > > Why can't we simply enqueue the task on a throttled group just like rt?
> > 
> > We do enqueue a task to its group even if the group is throttled. However such
> > throttled groups are not enqueued further. In such scenarios, even though the
> > task enqueue to its parent group succeeded, it really didn't add any task to
> > the cpu runqueue (rq). So we need to identify this condition and don't
> > increment rq->running. That is why this return value is needed.
> 
> I would still consider those tasks running, the fact that they don't get
> to run is a different matter.

Ok, that's how rt also considers them I realize. I thought that we should
update rq->running when tasks go off the runqueue due to throttling. When a
task is throttled, it is no doubt present on its group's cfs_rq, but it
doesn't contribute to the CPU load as the throttled group entity isn't there
on any cfs_rq. rq->running is used to obtain a few load balancing metrics and
they might go wrong if rq->running isn't uptodate.

Do you still think we shouldn't update rq->running ? If so, I can get rid
of this return value change.

> 
> This added return value really utterly craps up the code and I'm not
> going to take it.

OK :) I will work towards making them more acceptable in future iterations.

> 
> What I'm not seeing is why all this code looks so very much different
> from the rt bits.

Throttling code here looks different than rt for the following reasons:

- As I mentioned earlier, I update rq->running during throttling which
is not done in rt afaics.
- There are special conditions to prevent movement of tasks in and out
of the throttled groups during load balancing and migration.
- rt dequeues the throttled entity by walking the entity hierachy from
update_curr_rt(). But I found it difficult to do the same in cfs because
update_curr() is called from many different places and from places where
we are actually walking the entity hiearchy. A second walk (in update_curr)
of the hiearchy while we are in the middle of a hierarchy walk didn't look
all that good. So I resorted to just marking the entity as throttled in
update_curr() and later doing the dequeing from put_prev_entity() ?
Isn't this acceptable ?

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC v2 PATCH 4/8] sched: Enforce hard limits by throttling
  2009-10-14 11:50         ` Bharata B Rao
@ 2009-10-14 13:18           ` Herbert Poetzl
  2009-10-15  3:30             ` Bharata B Rao
  0 siblings, 1 reply; 39+ messages in thread
From: Herbert Poetzl @ 2009-10-14 13:18 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: Peter Zijlstra, linux-kernel, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Gautham R Shenoy, Srivatsa Vaddagiri,
	Ingo Molnar, Pavel Emelyanov, Avi Kivity, Chris Friesen,
	Paul Menage, Mike Waychison

On Wed, Oct 14, 2009 at 05:20:03PM +0530, Bharata B Rao wrote:
> On Wed, Oct 14, 2009 at 11:17:44AM +0200, Peter Zijlstra wrote:
> > On Wed, 2009-10-14 at 09:11 +0530, Bharata B Rao wrote:
> > > On Tue, Oct 13, 2009 at 04:27:00PM +0200, Peter Zijlstra wrote:
> > > > On Wed, 2009-09-30 at 18:22 +0530, Bharata B Rao wrote:
> > > > 
> > > > > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > > > > index 0f1ea4a..77ace43 100644
> > > > > --- a/include/linux/sched.h
> > > > > +++ b/include/linux/sched.h
> > > > > @@ -1024,7 +1024,7 @@ struct sched_domain;
> > > > >  struct sched_class {
> > > > >     const struct sched_class *next;
> > > > >  
> > > > > -   void (*enqueue_task) (struct rq *rq, struct task_struct *p, int wakeup);
> > > > > +   int (*enqueue_task) (struct rq *rq, struct task_struct *p, int wakeup);
> > > > >     void (*dequeue_task) (struct rq *rq, struct task_struct *p, int sleep);
> > > > >     void (*yield_task) (struct rq *rq);
> > > > >  
> > > > 
> > > > I really hate this, it uglfies all the enqueue code in a horrid way
> > > > (which is most of this patch).
> > > > 
> > > > Why can't we simply enqueue the task on a throttled group just like rt?
> > > 
> > > We do enqueue a task to its group even if the group is throttled. However such
> > > throttled groups are not enqueued further. In such scenarios, even though the
> > > task enqueue to its parent group succeeded, it really didn't add any task to
> > > the cpu runqueue (rq). So we need to identify this condition and don't
> > > increment rq->running. That is why this return value is needed.
> > 
> > I would still consider those tasks running, the fact that they don't get
> > to run is a different matter.

> Ok, that's how rt also considers them I realize. I thought that we
> should update rq->running when tasks go off the runqueue due to
> throttling. When a task is throttled, it is no doubt present on its
> group's cfs_rq, but it doesn't contribute to the CPU load as the
> throttled group entity isn't there on any cfs_rq. rq->running is used
> to obtain a few load balancing metrics and they might go wrong if
> rq->running isn't uptodate.

for all practical purposes throttled tasks _are_ running
(i.e. they would like to run, but the hardware/software
doesn't allow them to do more work) ...

> Do you still think we shouldn't update rq->running ? If so, I can get rid
> of this return value change.

Linux-VServer marked throttled tasks as 'H' (on hold)
but counted them as running, which seems to work fine
and reflect the expected behaviour ...

best,
Herbert

> > This added return value really utterly craps up the code and I'm not
> > going to take it.
> 
> OK :) I will work towards making them more acceptable in future iterations.
> 
> > 
> > What I'm not seeing is why all this code looks so very much different
> > from the rt bits.
> 
> Throttling code here looks different than rt for the following reasons:
> 
> - As I mentioned earlier, I update rq->running during throttling which
> is not done in rt afaics.
> - There are special conditions to prevent movement of tasks in and out
> of the throttled groups during load balancing and migration.
> - rt dequeues the throttled entity by walking the entity hierachy from
> update_curr_rt(). But I found it difficult to do the same in cfs because
> update_curr() is called from many different places and from places where
> we are actually walking the entity hiearchy. A second walk (in update_curr)
> of the hiearchy while we are in the middle of a hierarchy walk didn't look
> all that good. So I resorted to just marking the entity as throttled in
> update_curr() and later doing the dequeing from put_prev_entity() ?
> Isn't this acceptable ?
> 
> Regards,
> Bharata.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC v2 PATCH 4/8] sched: Enforce hard limits by throttling
  2009-10-14 13:18           ` Herbert Poetzl
@ 2009-10-15  3:30             ` Bharata B Rao
  0 siblings, 0 replies; 39+ messages in thread
From: Bharata B Rao @ 2009-10-15  3:30 UTC (permalink / raw)
  To: Peter Zijlstra, linux-kernel, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Gautham R Shenoy, Srivatsa Vaddagiri,
	Ingo Molnar, Pavel Emelyanov, Avi Kivity, Chris Friesen,
	Paul Menage, Mike Waychison

On Wed, Oct 14, 2009 at 03:18:54PM +0200, Herbert Poetzl wrote:
> On Wed, Oct 14, 2009 at 05:20:03PM +0530, Bharata B Rao wrote:
> > On Wed, Oct 14, 2009 at 11:17:44AM +0200, Peter Zijlstra wrote:
> > > On Wed, 2009-10-14 at 09:11 +0530, Bharata B Rao wrote:
> > > > On Tue, Oct 13, 2009 at 04:27:00PM +0200, Peter Zijlstra wrote:
> > > > > On Wed, 2009-09-30 at 18:22 +0530, Bharata B Rao wrote:
> > > > > 
> > > > > > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > > > > > index 0f1ea4a..77ace43 100644
> > > > > > --- a/include/linux/sched.h
> > > > > > +++ b/include/linux/sched.h
> > > > > > @@ -1024,7 +1024,7 @@ struct sched_domain;
> > > > > >  struct sched_class {
> > > > > >     const struct sched_class *next;
> > > > > >  
> > > > > > -   void (*enqueue_task) (struct rq *rq, struct task_struct *p, int wakeup);
> > > > > > +   int (*enqueue_task) (struct rq *rq, struct task_struct *p, int wakeup);
> > > > > >     void (*dequeue_task) (struct rq *rq, struct task_struct *p, int sleep);
> > > > > >     void (*yield_task) (struct rq *rq);
> > > > > >  
> > > > > 
> > > > > I really hate this, it uglfies all the enqueue code in a horrid way
> > > > > (which is most of this patch).
> > > > > 
> > > > > Why can't we simply enqueue the task on a throttled group just like rt?
> > > > 
> > > > We do enqueue a task to its group even if the group is throttled. However such
> > > > throttled groups are not enqueued further. In such scenarios, even though the
> > > > task enqueue to its parent group succeeded, it really didn't add any task to
> > > > the cpu runqueue (rq). So we need to identify this condition and don't
> > > > increment rq->running. That is why this return value is needed.
> > > 
> > > I would still consider those tasks running, the fact that they don't get
> > > to run is a different matter.
> 
> > Ok, that's how rt also considers them I realize. I thought that we
> > should update rq->running when tasks go off the runqueue due to
> > throttling. When a task is throttled, it is no doubt present on its
> > group's cfs_rq, but it doesn't contribute to the CPU load as the
> > throttled group entity isn't there on any cfs_rq. rq->running is used
> > to obtain a few load balancing metrics and they might go wrong if
> > rq->running isn't uptodate.
> 
> for all practical purposes throttled tasks _are_ running
> (i.e. they would like to run, but the hardware/software
> doesn't allow them to do more work) ...

Ok, I will take a re-look at this and see if I too can consider them
as running and don't touch rq->running which should simply some code.

> 
> > Do you still think we shouldn't update rq->running ? If so, I can get rid
> > of this return value change.
> 
> Linux-VServer marked throttled tasks as 'H' (on hold)
> but counted them as running, which seems to work fine
> and reflect the expected behaviour ...

So a new task state 'H' ?

Thanks for your comments.

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2009-10-15  3:31 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-09-30 12:49 [RFC v2 PATCH 0/8] CFS Hard limits - v2 Bharata B Rao
2009-09-30 12:50 ` [RFC v2 PATCH 1/8] sched: Rename sched_rt_period_mask() and use it in CFS also Bharata B Rao
2009-09-30 12:51 ` [RFC v2 PATCH 2/8] sched: Maintain aggregated tasks count in cfs_rq at each hierarchy level Bharata B Rao
2009-10-13 14:27   ` Peter Zijlstra
2009-10-14  3:42     ` Bharata B Rao
2009-09-30 12:52 ` [RFC v2 PATCH 3/8] sched: Bandwidth initialization for fair task groups Bharata B Rao
2009-10-13 14:27   ` Peter Zijlstra
2009-10-14  3:49     ` Bharata B Rao
2009-09-30 12:52 ` [RFC v2 PATCH 4/8] sched: Enforce hard limits by throttling Bharata B Rao
2009-10-13 14:27   ` Peter Zijlstra
2009-10-14  3:41     ` Bharata B Rao
2009-10-14  9:17       ` Peter Zijlstra
2009-10-14 11:50         ` Bharata B Rao
2009-10-14 13:18           ` Herbert Poetzl
2009-10-15  3:30             ` Bharata B Rao
2009-09-30 12:53 ` [RFC v2 PATCH 5/8] sched: Unthrottle the throttled tasks Bharata B Rao
2009-09-30 12:54 ` [RFC v2 PATCH 6/8] sched: Add throttle time statistics to /proc/sched_debug Bharata B Rao
2009-09-30 12:55 ` [RFC v2 PATCH 7/8] sched: Rebalance cfs runtimes Bharata B Rao
2009-09-30 12:55 ` [RFC v2 PATCH 8/8] sched: Hard limits documentation Bharata B Rao
2009-09-30 13:36 ` [RFC v2 PATCH 0/8] CFS Hard limits - v2 Pavel Emelyanov
2009-09-30 14:25   ` Bharata B Rao
2009-09-30 14:39     ` Srivatsa Vaddagiri
2009-09-30 15:09       ` Pavel Emelyanov
2009-10-13 11:39       ` Pavel Emelyanov
2009-10-13 12:03         ` Herbert Poetzl
2009-10-13 12:19           ` Pavel Emelyanov
2009-10-13 12:30             ` Dhaval Giani
2009-10-13 12:45               ` Pavel Emelyanov
2009-10-13 12:56                 ` Dhaval Giani
2009-10-13 12:57                 ` Bharata B Rao
2009-10-13 13:01                   ` Pavel Emelyanov
2009-10-13 14:56             ` Valdis.Kletnieks
2009-10-13 22:02             ` Herbert Poetzl
2009-10-13 14:49         ` Valdis.Kletnieks
2009-09-30 14:38   ` Balbir Singh
2009-09-30 15:10     ` Pavel Emelyanov
2009-09-30 15:30       ` Balbir Singh
2009-09-30 22:30         ` Herbert Poetzl
2009-10-01  5:12           ` Bharata B Rao

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.