linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC 0/5] sched: Add CPU rate caps
@ 2006-05-26  4:20 Peter Williams
  2006-05-26  4:20 ` [RFC 1/5] sched: Fix priority inheritence before CPU rate soft caps Peter Williams
                   ` (8 more replies)
  0 siblings, 9 replies; 95+ messages in thread
From: Peter Williams @ 2006-05-26  4:20 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Con Kolivas, Peter Williams, Linux Kernel, Kingsley Cheung,
	Ingo Molnar, Rene Herman

These patches implement CPU usage rate limits for tasks.

Although the rlimit mechanism already has a CPU usage limit (RLIMIT_CPU)
it is a total usage limit and therefore (to my mind) not very useful.
These patches provide an alternative whereby the (recent) average CPU
usage rate of a task can be limited to a (per task) specified proportion
of a single CPU's capacity.  The limits are specified in parts per
thousand and come in two varieties -- hard and soft.  The difference
between the two is that the system tries to enforce hard caps regardless
of the other demand for CPU resources but allows soft caps to be
exceeded if there are spare CPU resources available.  By default, tasks
will have both caps set to 1000 (i.e. no limit) but newly forked tasks
will inherit any caps that have been imposed on their parent from the
parent.  The mimimim soft cap allowed is 0 (which effectively puts the
task in the background) and the minimim hard cap allowed is 1.

Care has been taken to minimize the overhead inflicted on tasks that
have no caps and my tests using kernbench indicate that it is hidden in
the noise.

Note:

The first patch in this series fixes some problems with priority
inheritance that are present in 2.6.17-rc4-mm3 but will be fixed in
the next -mm kernel.

Signed-off-by: Peter Williams <pwil3058@bigpond.com.au>


-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
 -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 95+ messages in thread

* [RFC 1/5] sched: Fix priority inheritence before CPU rate soft caps
  2006-05-26  4:20 [RFC 0/5] sched: Add CPU rate caps Peter Williams
@ 2006-05-26  4:20 ` Peter Williams
  2006-05-26  4:20 ` [RFC 2/5] sched: Add " Peter Williams
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 95+ messages in thread
From: Peter Williams @ 2006-05-26  4:20 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Con Kolivas, Peter Williams, Linux Kernel, Kingsley Cheung,
	Ingo Molnar, Rene Herman

Problem:

The advent of priority inheritance (PI) (in -mm kernels) means that the
prio field for non real time tasks can no longer be guaranteed to be
greater than or equal to MAX_RT_PRIO.  This, in turn, means that the
rt_task() macro is no longer a reliable test for determining if the
scheduler policy of a task is one of the real time policies.

Redefining rt_task() is not a good solution as the majority places
where it is used within sched.c the current definition is what is
required. However, this is not the case in the functions
set_load_weight() and set_user_nice() (and perhaps elsewhere in the
kernel).

Solution:

Define a new macro, has_rt_policy(), that returns true if the task given
as an argument has a policy of SCHED_RR or SCHED_FIFO and use this
inside set_load_weight() and set_user_nice().  The definition is made in
sched.h so that it is generally available should it be needed elsewhere
in the kernel.

Signed-off-by: Peter Williams <pwil3058@bigpond.com.au>
 include/linux/sched.h |    2 ++
 kernel/sched.c        |   15 +++++++--------
 2 files changed, 9 insertions(+), 8 deletions(-)

Index: MM-2.6.17-rc4-mm3/include/linux/sched.h
===================================================================
--- MM-2.6.17-rc4-mm3.orig/include/linux/sched.h	2006-05-26 10:39:59.000000000 +1000
+++ MM-2.6.17-rc4-mm3/include/linux/sched.h	2006-05-26 10:43:21.000000000 +1000
@@ -491,6 +491,8 @@ struct signal_struct {
 #define rt_prio(prio)		unlikely((prio) < MAX_RT_PRIO)
 #define rt_task(p)		rt_prio((p)->prio)
 #define batch_task(p)		(unlikely((p)->policy == SCHED_BATCH))
+#define has_rt_policy(p) \
+	unlikely((p)->policy != SCHED_NORMAL && (p)->policy != SCHED_BATCH)
 
 /*
  * Some day this will be a full-fledged user tracking system..
Index: MM-2.6.17-rc4-mm3/kernel/sched.c
===================================================================
--- MM-2.6.17-rc4-mm3.orig/kernel/sched.c	2006-05-26 10:39:59.000000000 +1000
+++ MM-2.6.17-rc4-mm3/kernel/sched.c	2006-05-26 10:44:51.000000000 +1000
@@ -786,7 +786,7 @@ static inline int expired_starving(runqu
 
 static void set_load_weight(task_t *p)
 {
-	if (rt_task(p)) {
+	if (has_rt_policy(p)) {
 #ifdef CONFIG_SMP
 		if (p == task_rq(p)->migration_thread)
 			/*
@@ -835,7 +835,7 @@ static inline int normal_prio(task_t *p)
 {
 	int prio;
 
-	if (p->policy != SCHED_NORMAL && p->policy != SCHED_BATCH)
+	if (has_rt_policy(p))
 		prio = MAX_RT_PRIO-1 - p->rt_priority;
 	else
 		prio = __normal_prio(p);
@@ -3831,7 +3831,7 @@ void set_user_nice(task_t *p, long nice)
 	unsigned long flags;
 	prio_array_t *array;
 	runqueue_t *rq;
-	int old_prio, new_prio, delta;
+	int old_prio, delta;
 
 	if (TASK_NICE(p) == nice || nice < -20 || nice > 19)
 		return;
@@ -3846,7 +3846,7 @@ void set_user_nice(task_t *p, long nice)
 	 * it wont have any effect on scheduling until the task is
 	 * not SCHED_NORMAL/SCHED_BATCH:
 	 */
-	if (rt_task(p)) {
+	if (has_rt_policy(p)) {
 		p->static_prio = NICE_TO_PRIO(nice);
 		goto out_unlock;
 	}
@@ -3856,12 +3856,11 @@ void set_user_nice(task_t *p, long nice)
 		dec_raw_weighted_load(rq, p);
 	}
 
-	old_prio = p->prio;
-	new_prio = NICE_TO_PRIO(nice);
-	delta = new_prio - old_prio;
 	p->static_prio = NICE_TO_PRIO(nice);
 	set_load_weight(p);
-	p->prio += delta;
+	old_prio = p->prio;
+	p->prio = effective_prio(p);
+	delta = p->prio - old_prio;
 
 	if (array) {
 		enqueue_task(p, array);

-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
 -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 95+ messages in thread

* [RFC 2/5] sched: Add CPU rate soft caps
  2006-05-26  4:20 [RFC 0/5] sched: Add CPU rate caps Peter Williams
  2006-05-26  4:20 ` [RFC 1/5] sched: Fix priority inheritence before CPU rate soft caps Peter Williams
@ 2006-05-26  4:20 ` Peter Williams
  2006-05-26 10:48   ` Con Kolivas
  2006-05-27  6:31   ` Balbir Singh
  2006-05-26  4:20 ` [RFC 3/5] sched: Add CPU rate hard caps Peter Williams
                   ` (6 subsequent siblings)
  8 siblings, 2 replies; 95+ messages in thread
From: Peter Williams @ 2006-05-26  4:20 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Con Kolivas, Peter Williams, Linux Kernel, Kingsley Cheung,
	Ingo Molnar, Rene Herman

This patch implements (soft) CPU rate caps per task as a proportion of a
single CPU's capacity expressed in parts per thousand.  The CPU usage
of capped tasks is determined by using Kalman filters to calculate the
(recent) average lengths of the task's scheduling cycle and the time
spent on the CPU each cycle and taking the ratio of the latter to the
former.  To minimize overhead associated with uncapped tasks these
statistics are not kept for them.

Notes:

1. To minimize the overhead incurred when testing to skip caps processing for
uncapped tasks a new flag PF_HAS_CAP has been added to flags.

2. The implementation involves the addition of two priority slots to the
run queue priority arrays and this means that MAX_PRIO no longer
represents the scheduling priority of the idle process and can't be used to
test whether priority values are in the valid range.  To alleviate this
problem a new function sched_idle_prio() has been provided.

3. Enforcement of caps is not as strict as it could be in order to
reduce the possibility of a task being starved of CPU while holding
an important system resource with resultant overall performance
degradation.  In effect, all runnable capped tasks will get some amount
of CPU access every active/expired swap cycle.  This will be most
apparent for small or zero soft caps.

Signed-off-by: Peter Williams <pwil3058@bigpond.com.au>

 include/linux/sched.h  |   16 ++
 init/Kconfig           |    2 
 kernel/Kconfig.caps    |   13 +
 kernel/rtmutex-debug.c |    4 
 kernel/sched.c         |  362 ++++++++++++++++++++++++++++++++++++++++++++++---
 5 files changed, 375 insertions(+), 22 deletions(-)

Index: MM-2.6.17-rc4-mm3/include/linux/sched.h
===================================================================
--- MM-2.6.17-rc4-mm3.orig/include/linux/sched.h	2006-05-26 10:43:21.000000000 +1000
+++ MM-2.6.17-rc4-mm3/include/linux/sched.h	2006-05-26 10:46:35.000000000 +1000
@@ -494,6 +494,12 @@ struct signal_struct {
 #define has_rt_policy(p) \
 	unlikely((p)->policy != SCHED_NORMAL && (p)->policy != SCHED_BATCH)
 
+#ifdef CONFIG_CPU_RATE_CAPS
+int sched_idle_prio(void);
+#else
+#define sched_idle_prio()	MAX_PRIO
+#endif
+
 /*
  * Some day this will be a full-fledged user tracking system..
  */
@@ -787,6 +793,10 @@ struct task_struct {
 	unsigned long sleep_avg;
 	unsigned long long timestamp, last_ran;
 	unsigned long long sched_time; /* sched_clock time spent running */
+#ifdef CONFIG_CPU_RATE_CAPS
+	unsigned long long avg_cpu_per_cycle, avg_cycle_length;
+	unsigned int cpu_rate_cap;
+#endif
 	enum sleep_type sleep_type;
 
 	unsigned long policy;
@@ -981,6 +991,11 @@ struct task_struct {
 #endif
 };
 
+#ifdef CONFIG_CPU_RATE_CAPS
+unsigned int get_cpu_rate_cap(const struct task_struct *);
+int set_cpu_rate_cap(struct task_struct *, unsigned int);
+#endif
+
 static inline pid_t process_group(struct task_struct *tsk)
 {
 	return tsk->signal->pgrp;
@@ -1040,6 +1055,7 @@ static inline void put_task_struct(struc
 #define PF_SPREAD_SLAB	0x08000000	/* Spread some slab caches over cpuset */
 #define PF_MEMPOLICY	0x10000000	/* Non-default NUMA mempolicy */
 #define PF_MUTEX_TESTER	0x02000000	/* Thread belongs to the rt mutex tester */
+#define PF_HAS_CAP	0x20000000	/* Has a CPU rate cap */
 
 /*
  * Only the _current_ task can read/write to tsk->flags, but other
Index: MM-2.6.17-rc4-mm3/init/Kconfig
===================================================================
--- MM-2.6.17-rc4-mm3.orig/init/Kconfig	2006-05-26 10:39:59.000000000 +1000
+++ MM-2.6.17-rc4-mm3/init/Kconfig	2006-05-26 10:45:26.000000000 +1000
@@ -286,6 +286,8 @@ config RELAY
 
 	  If unsure, say N.
 
+source "kernel/Kconfig.caps"
+
 source "usr/Kconfig"
 
 config UID16
Index: MM-2.6.17-rc4-mm3/kernel/Kconfig.caps
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ MM-2.6.17-rc4-mm3/kernel/Kconfig.caps	2006-05-26 10:45:26.000000000 +1000
@@ -0,0 +1,13 @@
+#
+# CPU Rate Caps Configuration
+#
+
+config CPU_RATE_CAPS
+	bool "Support (soft) CPU rate caps"
+	default n
+	---help---
+	  Say y here if you wish to be able to put a (soft) upper limit on
+	  the rate of CPU usage by individual tasks.  A task which has been
+	  allocated a soft CPU rate cap will be limited to that rate of CPU
+	  usage unless there is spare CPU resources available after the needs
+	  of uncapped tasks are met.
Index: MM-2.6.17-rc4-mm3/kernel/sched.c
===================================================================
--- MM-2.6.17-rc4-mm3.orig/kernel/sched.c	2006-05-26 10:44:51.000000000 +1000
+++ MM-2.6.17-rc4-mm3/kernel/sched.c	2006-05-26 11:00:02.000000000 +1000
@@ -57,6 +57,19 @@
 
 #include <asm/unistd.h>
 
+#ifdef CONFIG_CPU_RATE_CAPS
+#define IDLE_PRIO	(MAX_PRIO + 2)
+#else
+#define IDLE_PRIO	MAX_PRIO
+#endif
+#define BGND_PRIO	(IDLE_PRIO - 1)
+#define CAPPED_PRIO	(IDLE_PRIO - 2)
+
+int sched_idle_prio(void)
+{
+	return IDLE_PRIO;
+}
+
 /*
  * Convert user-nice values [ -20 ... 0 ... 19 ]
  * to static priority [ MAX_RT_PRIO..MAX_PRIO-1 ],
@@ -186,6 +199,149 @@ static inline unsigned int task_timeslic
 	return static_prio_timeslice(p->static_prio);
 }
 
+#ifdef CONFIG_CPU_RATE_CAPS
+#define CAP_STATS_OFFSET 8
+#define task_has_cap(p) unlikely((p)->flags & PF_HAS_CAP)
+/* this assumes that p is not a real time task */
+#define task_is_background(p) unlikely((p)->cpu_rate_cap == 0)
+#define task_being_capped(p) unlikely((p)->prio >= CAPPED_PRIO)
+#define cap_load_weight(p) (((p)->cpu_rate_cap * SCHED_LOAD_SCALE) / 1000)
+
+static void init_cpu_rate_caps(task_t *p)
+{
+	p->cpu_rate_cap = 1000;
+	p->flags &= ~PF_HAS_CAP;
+}
+
+static inline void set_cap_flag(task_t *p)
+{
+	if (p->cpu_rate_cap < 1000 && !has_rt_policy(p))
+		p->flags |= PF_HAS_CAP;
+	else
+		p->flags &= ~PF_HAS_CAP;
+}
+
+static inline int task_exceeding_cap(const task_t *p)
+{
+	return (p->avg_cpu_per_cycle * 1000) > (p->avg_cycle_length * p->cpu_rate_cap);
+}
+
+#ifdef CONFIG_SCHED_SMT
+static unsigned int smt_timeslice(task_t *p)
+{
+	if (task_has_cap(p) && task_being_capped(p))
+		return 0;
+
+	return task_timeslice(p);
+}
+
+static int task_priority_gt(const task_t *thisp, const task_t *thatp)
+{
+	if (task_has_cap(thisp) && (task_being_capped(thisp)))
+	    return 0;
+
+	if (task_has_cap(thatp) && (task_being_capped(thatp)))
+	    return 1;
+
+	return thisp->static_prio < thatp->static_prio;
+}
+#endif
+
+/*
+ * Update usage stats to "now" before making comparison
+ * Assume: task is actually on a CPU
+ */
+static int task_exceeding_cap_now(const task_t *p, unsigned long long now)
+{
+	unsigned long long delta, lhs, rhs;
+
+	delta = (now > p->timestamp) ? (now - p->timestamp) : 0;
+	lhs = (p->avg_cpu_per_cycle + delta) * 1000;
+	rhs = (p->avg_cycle_length + delta) * p->cpu_rate_cap;
+
+	return lhs > rhs;
+}
+
+static inline void init_cap_stats(task_t *p)
+{
+	p->avg_cpu_per_cycle = 0;
+	p->avg_cycle_length = 0;
+}
+
+static inline void inc_cap_stats_cycle(task_t *p, unsigned long long now)
+{
+	unsigned long long delta;
+
+	delta = (now > p->timestamp) ? (now - p->timestamp) : 0;
+	p->avg_cycle_length += delta;
+}
+
+static inline void inc_cap_stats_both(task_t *p, unsigned long long now)
+{
+	unsigned long long delta;
+
+	delta = (now > p->timestamp) ? (now - p->timestamp) : 0;
+	p->avg_cycle_length += delta;
+	p->avg_cpu_per_cycle += delta;
+}
+
+static inline void decay_cap_stats(task_t *p)
+{
+	p->avg_cycle_length *= ((1 << CAP_STATS_OFFSET) - 1);
+	p->avg_cycle_length >>= CAP_STATS_OFFSET;
+	p->avg_cpu_per_cycle *= ((1 << CAP_STATS_OFFSET) - 1);
+	p->avg_cpu_per_cycle >>= CAP_STATS_OFFSET;
+}
+#else
+#define task_has_cap(p) 0
+#define task_is_background(p) 0
+#define task_being_capped(p) 0
+#define cap_load_weight(p) SCHED_LOAD_SCALE
+
+static inline void init_cpu_rate_caps(task_t *p)
+{
+}
+
+static inline void set_cap_flag(task_t *p)
+{
+}
+
+static inline int task_exceeding_cap(const task_t *p)
+{
+	return 0;
+}
+
+#ifdef CONFIG_SCHED_SMT
+#define smt_timeslice(p) task_timeslice(p)
+
+static inline int task_priority_gt(const task_t *thisp, const task_t *thatp)
+{
+	return thisp->static_prio < thatp->static_prio;
+}
+#endif
+
+static inline int task_exceeding_cap_now(const task_t *p, unsigned long long now)
+{
+	return 0;
+}
+
+static inline void init_cap_stats(task_t *p)
+{
+}
+
+static inline void inc_cap_stats_cycle(task_t *p, unsigned long long now)
+{
+}
+
+static inline void inc_cap_stats_both(task_t *p, unsigned long long now)
+{
+}
+
+static inline void decay_cap_stats(task_t *p)
+{
+}
+#endif
+
 #define task_hot(p, now, sd) ((long long) ((now) - (p)->last_ran)	\
 				< (long long) (sd)->cache_hot_time)
 
@@ -197,8 +353,8 @@ typedef struct runqueue runqueue_t;
 
 struct prio_array {
 	unsigned int nr_active;
-	DECLARE_BITMAP(bitmap, MAX_PRIO+1); /* include 1 bit for delimiter */
-	struct list_head queue[MAX_PRIO];
+	DECLARE_BITMAP(bitmap, IDLE_PRIO+1); /* include 1 bit for delimiter */
+	struct list_head queue[IDLE_PRIO];
 };
 
 /*
@@ -710,6 +866,10 @@ static inline int __normal_prio(task_t *
 {
 	int bonus, prio;
 
+	/* Ensure that background tasks stay at BGND_PRIO */
+	if (task_is_background(p))
+		return BGND_PRIO;
+
 	bonus = CURRENT_BONUS(p) - MAX_BONUS / 2;
 
 	prio = p->static_prio - bonus;
@@ -786,6 +946,8 @@ static inline int expired_starving(runqu
 
 static void set_load_weight(task_t *p)
 {
+	set_cap_flag(p);
+
 	if (has_rt_policy(p)) {
 #ifdef CONFIG_SMP
 		if (p == task_rq(p)->migration_thread)
@@ -798,8 +960,22 @@ static void set_load_weight(task_t *p)
 		else
 #endif
 			p->load_weight = RTPRIO_TO_LOAD_WEIGHT(p->rt_priority);
-	} else
+	} else {
 		p->load_weight = PRIO_TO_LOAD_WEIGHT(p->static_prio);
+
+		/*
+		 * Reduce the probability of a task escaping its CPU rate cap
+		 * due to load balancing leaving it on a lighly used CPU
+		 * This will be optimized away if rate caps aren't configured
+		 */
+		if (task_has_cap(p)) {
+			unsigned int clw; /* load weight based on cap */
+
+			clw = cap_load_weight(p);
+			if (clw < p->load_weight)
+				p->load_weight = clw;
+		}
+	}
 }
 
 static inline void inc_raw_weighted_load(runqueue_t *rq, const task_t *p)
@@ -869,7 +1045,8 @@ static void __activate_task(task_t *p, r
 {
 	prio_array_t *target = rq->active;
 
-	if (unlikely(batch_task(p) || (expired_starving(rq) && !rt_task(p))))
+	if (unlikely(batch_task(p) || (expired_starving(rq) && !rt_task(p)) ||
+			task_being_capped(p)))
 		target = rq->expired;
 	enqueue_task(p, target);
 	inc_nr_running(p, rq);
@@ -975,8 +1152,30 @@ static void activate_task(task_t *p, run
 #endif
 
 	if (!rt_task(p))
+		/*
+		 * We want to do the recalculation even if we're exceeding
+		 * a cap so that everything still works when we stop
+		 * exceeding our cap.
+		 */
 		p->prio = recalc_task_prio(p, now);
 
+	if (task_has_cap(p)) {
+		inc_cap_stats_cycle(p, now);
+		/* Background tasks are handled in effective_prio()
+		 * in order to ensure that they stay at BGND_PRIO
+		 * but we need to be careful that we don't override
+		 * it here
+		 */
+		if (task_exceeding_cap(p) && !task_is_background(p)) {
+			p->normal_prio = CAPPED_PRIO;
+			/*
+			 * Don't undo any priority ineheritance
+			 */
+			if (!rt_task(p))
+				p->prio = CAPPED_PRIO;
+		}
+	}
+
 	/*
 	 * This checks to make sure it's not an uninterruptible task
 	 * that is now waking up.
@@ -1566,6 +1765,7 @@ void fastcall sched_fork(task_t *p, int 
 #endif
 	set_task_cpu(p, cpu);
 
+	init_cap_stats(p);
 	/*
 	 * We mark the process as running here, but have not actually
 	 * inserted it onto the runqueue yet. This guarantees that
@@ -2040,7 +2240,7 @@ void pull_task(runqueue_t *src_rq, prio_
 	p->timestamp = (p->timestamp - src_rq->timestamp_last_tick)
 				+ this_rq->timestamp_last_tick;
 	/*
-	 * Note that idle threads have a prio of MAX_PRIO, for this test
+	 * Note that idle threads have a prio of IDLE_PRIO, for this test
 	 * to be always true for them.
 	 */
 	if (TASK_PREEMPTS_CURR(p, this_rq))
@@ -2140,8 +2340,8 @@ skip_bitmap:
 	if (!idx)
 		idx = sched_find_first_bit(array->bitmap);
 	else
-		idx = find_next_bit(array->bitmap, MAX_PRIO, idx);
-	if (idx >= MAX_PRIO) {
+		idx = find_next_bit(array->bitmap, IDLE_PRIO, idx);
+	if (idx >= IDLE_PRIO) {
 		if (array == busiest->expired && busiest->active->nr_active) {
 			array = busiest->active;
 			dst_array = this_rq->active;
@@ -2931,15 +3131,58 @@ void scheduler_tick(void)
 		}
 		goto out_unlock;
 	}
+	/* Only check for task exceeding cap if it's worthwhile */
+	if (task_has_cap(p)) {
+		/*
+		 * Do this even if there's only one task on the queue as
+		 * we want to set the priority low so that any waking tasks
+		 * can preempt.
+		 */
+		if (task_being_capped(p)) {
+			/*
+			 * Tasks whose cap is currently being enforced will be
+			 * at CAPPED_PRIO or BGND_PRIO priority and preemption
+			 * should be enough to keep them in check provided we
+			 * don't let them adversely effect tasks on the expired
+			 * array
+			 */
+			if (!task_is_background(p) && !task_exceeding_cap_now(p, now)) {
+				dequeue_task(p, rq->active);
+				p->prio = effective_prio(p);
+				enqueue_task(p, rq->active);
+			} else if (rq->expired->nr_active && rq->best_expired_prio < p->prio) {
+				dequeue_task(p, rq->active);
+				enqueue_task(p, rq->expired);
+				set_tsk_need_resched(p);
+				goto out_unlock;
+			}
+		} else if (task_exceeding_cap_now(p, now)) {
+			dequeue_task(p, rq->active);
+			p->prio = CAPPED_PRIO;
+			enqueue_task(p, rq->expired);
+			/*
+			 * think about making this conditional to reduce
+			 * context switch rate
+			 */
+			set_tsk_need_resched(p);
+			goto out_unlock;
+		}
+	}
 	if (!--p->time_slice) {
 		dequeue_task(p, rq->active);
 		set_tsk_need_resched(p);
-		p->prio = effective_prio(p);
+		if (!task_being_capped(p))
+			p->prio = effective_prio(p);
 		p->time_slice = task_timeslice(p);
 		p->first_time_slice = 0;
 
 		if (!rq->expired_timestamp)
 			rq->expired_timestamp = jiffies;
+		/*
+		 * No need to do anything special for capped tasks as here
+		 * TASK_INTERACTIVE() should fail when they're exceeding
+		 * their caps.
+		 */
 		if (!TASK_INTERACTIVE(p) || expired_starving(rq)) {
 			enqueue_task(p, rq->expired);
 			if (p->static_prio < rq->best_expired_prio)
@@ -3104,9 +3347,9 @@ static int dependent_sleeper(int this_cp
 				(sd->per_cpu_gain * DEF_TIMESLICE / 100))
 					ret = 1;
 		} else
-			if (smt_curr->static_prio < p->static_prio &&
+			if (task_priority_gt(smt_curr, p) &&
 				!TASK_PREEMPTS_CURR(p, smt_rq) &&
-				smt_slice(smt_curr, sd) > task_timeslice(p))
+				smt_slice(smt_curr, sd) > smt_timeslice(p))
 					ret = 1;
 
 check_smt_task:
@@ -3129,7 +3372,7 @@ check_smt_task:
 					resched_task(smt_curr);
 		} else {
 			if (TASK_PREEMPTS_CURR(p, smt_rq) &&
-				smt_slice(p, sd) > task_timeslice(smt_curr))
+				smt_slice(p, sd) > smt_timeslice(smt_curr))
 					resched_task(smt_curr);
 			else
 				wakeup_busy_runqueue(smt_rq);
@@ -3265,6 +3508,10 @@ need_resched_nonpreemptible:
 		}
 	}
 
+	/* do this now so that stats are correct for SMT code */
+	if (task_has_cap(prev))
+		inc_cap_stats_both(prev, now);
+
 	cpu = smp_processor_id();
 	if (unlikely(!rq->nr_running)) {
 go_idle:
@@ -3305,7 +3552,7 @@ go_idle:
 		rq->expired = array;
 		array = rq->active;
 		rq->expired_timestamp = 0;
-		rq->best_expired_prio = MAX_PRIO;
+		rq->best_expired_prio = IDLE_PRIO;
 	}
 
 	idx = sched_find_first_bit(array->bitmap);
@@ -3323,7 +3570,7 @@ go_idle:
 		array = next->array;
 		new_prio = recalc_task_prio(next, next->timestamp + delta);
 
-		if (unlikely(next->prio != new_prio)) {
+		if (unlikely(next->prio != new_prio && !task_being_capped(next))) {
 			dequeue_task(next, array);
 			next->prio = new_prio;
 			enqueue_task(next, array);
@@ -3347,6 +3594,10 @@ switch_tasks:
 
 	sched_info_switch(prev, next);
 	if (likely(prev != next)) {
+		if (task_has_cap(next)) {
+			decay_cap_stats(next);
+			inc_cap_stats_cycle(next, now);
+		}
 		next->timestamp = now;
 		rq->nr_switches++;
 		rq->curr = next;
@@ -3792,7 +4043,7 @@ void rt_mutex_setprio(task_t *p, int pri
 	runqueue_t *rq;
 	int oldprio;
 
-	BUG_ON(prio < 0 || prio > MAX_PRIO);
+	BUG_ON(prio < 0 || prio > IDLE_PRIO);
 
 	rq = task_rq_lock(p, &flags);
 
@@ -4220,6 +4471,76 @@ out_unlock:
 	return retval;
 }
 
+#ifdef CONFIG_CPU_RATE_CAPS
+unsigned int get_cpu_rate_cap(const struct task_struct *p)
+{
+	return p->cpu_rate_cap;
+}
+
+EXPORT_SYMBOL(get_cpu_rate_cap);
+
+/*
+ * Require: 0 <= new_cap <= 1000
+ */
+int set_cpu_rate_cap(struct task_struct *p, unsigned int new_cap)
+{
+	int is_allowed;
+	unsigned long flags;
+	struct runqueue *rq;
+	prio_array_t *array;
+	int delta;
+
+	if (new_cap > 1000)
+		return -EINVAL;
+	is_allowed = capable(CAP_SYS_NICE);
+	/*
+	 * We have to be careful, if called from /proc code,
+	 * the task might be in the middle of scheduling on another CPU.
+	 */
+	rq = task_rq_lock(p, &flags);
+	delta = new_cap - p->cpu_rate_cap;
+	if (!is_allowed) {
+		/*
+		 * Ordinary users can set/change caps on their own tasks
+		 * provided that the new setting is MORE constraining
+		 */
+		if (((current->euid != p->uid) && (current->uid != p->uid)) || (delta > 0)) {
+			task_rq_unlock(rq, &flags);
+			return -EPERM;
+		}
+	}
+	/*
+	 * The RT tasks don't have caps, but we still allow the caps to be
+	 * set - but as expected it wont have any effect on scheduling until
+	 * the task becomes SCHED_NORMAL/SCHED_BATCH:
+	 */
+	p->cpu_rate_cap = new_cap;
+
+	if (has_rt_policy(p))
+		goto out;
+
+	array = p->array;
+	if (array) {
+		dec_raw_weighted_load(rq, p);
+		dequeue_task(p, array);
+	}
+
+	set_load_weight(p);
+	p->prio = effective_prio(p);
+
+	if (array) {
+		enqueue_task(p, array);
+		inc_raw_weighted_load(rq, p);
+	}
+out:
+	task_rq_unlock(rq, &flags);
+
+	return 0;
+}
+
+EXPORT_SYMBOL(set_cpu_rate_cap);
+#endif
+
 long sched_setaffinity(pid_t pid, cpumask_t new_mask)
 {
 	task_t *p;
@@ -4733,7 +5054,7 @@ void __devinit init_idle(task_t *idle, i
 	idle->timestamp = sched_clock();
 	idle->sleep_avg = 0;
 	idle->array = NULL;
-	idle->prio = idle->normal_prio = MAX_PRIO;
+	idle->prio = idle->normal_prio = IDLE_PRIO;
 	idle->state = TASK_RUNNING;
 	idle->cpus_allowed = cpumask_of_cpu(cpu);
 	set_task_cpu(idle, cpu);
@@ -5074,7 +5395,7 @@ static void migrate_dead_tasks(unsigned 
 	struct runqueue *rq = cpu_rq(dead_cpu);
 
 	for (arr = 0; arr < 2; arr++) {
-		for (i = 0; i < MAX_PRIO; i++) {
+		for (i = 0; i < IDLE_PRIO; i++) {
 			struct list_head *list = &rq->arrays[arr].queue[i];
 			while (!list_empty(list))
 				migrate_dead(dead_cpu,
@@ -5244,7 +5565,7 @@ static int migration_call(struct notifie
 		/* Idle task back to normal (off runqueue, low prio) */
 		rq = task_rq_lock(rq->idle, &flags);
 		deactivate_task(rq->idle, rq);
-		rq->idle->static_prio = MAX_PRIO;
+		rq->idle->static_prio = IDLE_PRIO;
 		__setscheduler(rq->idle, SCHED_NORMAL, 0);
 		migrate_dead_tasks(cpu);
 		task_rq_unlock(rq, &flags);
@@ -6657,7 +6978,7 @@ void __init sched_init(void)
 		rq->nr_running = 0;
 		rq->active = rq->arrays;
 		rq->expired = rq->arrays + 1;
-		rq->best_expired_prio = MAX_PRIO;
+		rq->best_expired_prio = IDLE_PRIO;
 
 #ifdef CONFIG_SMP
 		rq->sd = NULL;
@@ -6673,15 +6994,16 @@ void __init sched_init(void)
 
 		for (j = 0; j < 2; j++) {
 			array = rq->arrays + j;
-			for (k = 0; k < MAX_PRIO; k++) {
+			for (k = 0; k < IDLE_PRIO; k++) {
 				INIT_LIST_HEAD(array->queue + k);
 				__clear_bit(k, array->bitmap);
 			}
 			// delimiter for bitsearch
-			__set_bit(MAX_PRIO, array->bitmap);
+			__set_bit(IDLE_PRIO, array->bitmap);
 		}
 	}
 
+	init_cpu_rate_caps(&init_task);
 	set_load_weight(&init_task);
 	/*
 	 * The boot idle thread does lazy MMU switching as well:
Index: MM-2.6.17-rc4-mm3/kernel/rtmutex-debug.c
===================================================================
--- MM-2.6.17-rc4-mm3.orig/kernel/rtmutex-debug.c	2006-05-26 10:39:59.000000000 +1000
+++ MM-2.6.17-rc4-mm3/kernel/rtmutex-debug.c	2006-05-26 10:45:26.000000000 +1000
@@ -479,8 +479,8 @@ void debug_rt_mutex_proxy_unlock(struct 
 void debug_rt_mutex_init_waiter(struct rt_mutex_waiter *waiter)
 {
 	memset(waiter, 0x11, sizeof(*waiter));
-	plist_node_init(&waiter->list_entry, MAX_PRIO);
-	plist_node_init(&waiter->pi_list_entry, MAX_PRIO);
+	plist_node_init(&waiter->list_entry, sched_idle_prio());
+	plist_node_init(&waiter->pi_list_entry, sched_idle_prio());
 }
 
 void debug_rt_mutex_free_waiter(struct rt_mutex_waiter *waiter)

-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
 -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 95+ messages in thread

* [RFC 3/5] sched: Add CPU rate hard caps
  2006-05-26  4:20 [RFC 0/5] sched: Add CPU rate caps Peter Williams
  2006-05-26  4:20 ` [RFC 1/5] sched: Fix priority inheritence before CPU rate soft caps Peter Williams
  2006-05-26  4:20 ` [RFC 2/5] sched: Add " Peter Williams
@ 2006-05-26  4:20 ` Peter Williams
  2006-05-26  6:58   ` Kari Hurtta
                     ` (2 more replies)
  2006-05-26  4:21 ` [RFC 4/5] sched: Add procfs interface for CPU rate soft caps Peter Williams
                   ` (5 subsequent siblings)
  8 siblings, 3 replies; 95+ messages in thread
From: Peter Williams @ 2006-05-26  4:20 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Con Kolivas, Peter Williams, Linux Kernel, Kingsley Cheung,
	Ingo Molnar, Rene Herman

This patch implements hard CPU rate caps per task as a proportion of a
single CPU's capacity expressed in parts per thousand.

Signed-off-by: Peter Williams <pwil3058@bigpond.com.au>
 include/linux/sched.h |    8 ++
 kernel/Kconfig.caps   |   14 +++-
 kernel/sched.c        |  154 ++++++++++++++++++++++++++++++++++++++++++++++++--
 3 files changed, 168 insertions(+), 8 deletions(-)

Index: MM-2.6.17-rc4-mm3/include/linux/sched.h
===================================================================
--- MM-2.6.17-rc4-mm3.orig/include/linux/sched.h	2006-05-26 10:46:35.000000000 +1000
+++ MM-2.6.17-rc4-mm3/include/linux/sched.h	2006-05-26 11:00:07.000000000 +1000
@@ -796,6 +796,10 @@ struct task_struct {
 #ifdef CONFIG_CPU_RATE_CAPS
 	unsigned long long avg_cpu_per_cycle, avg_cycle_length;
 	unsigned int cpu_rate_cap;
+#ifdef CONFIG_CPU_RATE_HARD_CAPS
+	unsigned int cpu_rate_hard_cap;
+	struct timer_list sinbin_timer;
+#endif
 #endif
 	enum sleep_type sleep_type;
 
@@ -994,6 +998,10 @@ struct task_struct {
 #ifdef CONFIG_CPU_RATE_CAPS
 unsigned int get_cpu_rate_cap(const struct task_struct *);
 int set_cpu_rate_cap(struct task_struct *, unsigned int);
+#ifdef CONFIG_CPU_RATE_HARD_CAPS
+unsigned int get_cpu_rate_hard_cap(const struct task_struct *);
+int set_cpu_rate_hard_cap(struct task_struct *, unsigned int);
+#endif
 #endif
 
 static inline pid_t process_group(struct task_struct *tsk)
Index: MM-2.6.17-rc4-mm3/kernel/Kconfig.caps
===================================================================
--- MM-2.6.17-rc4-mm3.orig/kernel/Kconfig.caps	2006-05-26 10:45:26.000000000 +1000
+++ MM-2.6.17-rc4-mm3/kernel/Kconfig.caps	2006-05-26 11:00:07.000000000 +1000
@@ -3,11 +3,21 @@
 #
 
 config CPU_RATE_CAPS
-	bool "Support (soft) CPU rate caps"
+	bool "Support CPU rate caps"
 	default n
 	---help---
-	  Say y here if you wish to be able to put a (soft) upper limit on
+	  Say y here if you wish to be able to put a soft upper limit on
 	  the rate of CPU usage by individual tasks.  A task which has been
 	  allocated a soft CPU rate cap will be limited to that rate of CPU
 	  usage unless there is spare CPU resources available after the needs
 	  of uncapped tasks are met.
+
+config CPU_RATE_HARD_CAPS
+	bool "Support CPU rate hard caps"
+	depends on CPU_RATE_CAPS
+	default n
+	---help---
+	  Say y here if you wish to be able to put a hard upper limit on
+	  the rate of CPU usage by individual tasks.  A task which has been
+	  allocated a hard CPU rate cap will be limited to that rate of CPU
+	  usage regardless of whether there is spare CPU resources available.
Index: MM-2.6.17-rc4-mm3/kernel/sched.c
===================================================================
--- MM-2.6.17-rc4-mm3.orig/kernel/sched.c	2006-05-26 11:00:02.000000000 +1000
+++ MM-2.6.17-rc4-mm3/kernel/sched.c	2006-05-26 13:50:11.000000000 +1000
@@ -201,21 +201,33 @@ static inline unsigned int task_timeslic
 
 #ifdef CONFIG_CPU_RATE_CAPS
 #define CAP_STATS_OFFSET 8
+#ifdef CONFIG_CPU_RATE_HARD_CAPS
+static void sinbin_release_fn(unsigned long arg);
+#define min_cpu_rate_cap(p) min((p)->cpu_rate_cap, (p)->cpu_rate_hard_cap)
+#else
+#define min_cpu_rate_cap(p) (p)->cpu_rate_cap
+#endif
 #define task_has_cap(p) unlikely((p)->flags & PF_HAS_CAP)
 /* this assumes that p is not a real time task */
 #define task_is_background(p) unlikely((p)->cpu_rate_cap == 0)
 #define task_being_capped(p) unlikely((p)->prio >= CAPPED_PRIO)
-#define cap_load_weight(p) (((p)->cpu_rate_cap * SCHED_LOAD_SCALE) / 1000)
+#define cap_load_weight(p) ((min_cpu_rate_cap(p) * SCHED_LOAD_SCALE) / 1000)
 
 static void init_cpu_rate_caps(task_t *p)
 {
 	p->cpu_rate_cap = 1000;
 	p->flags &= ~PF_HAS_CAP;
+#ifdef CONFIG_CPU_RATE_HARD_CAPS
+	p->cpu_rate_hard_cap = 1000;
+	init_timer(&p->sinbin_timer);
+	p->sinbin_timer.function = sinbin_release_fn;
+	p->sinbin_timer.data = (unsigned long) p;
+#endif
 }
 
 static inline void set_cap_flag(task_t *p)
 {
-	if (p->cpu_rate_cap < 1000 && !has_rt_policy(p))
+	if (min_cpu_rate_cap(p) < 1000 && !has_rt_policy(p))
 		p->flags |= PF_HAS_CAP;
 	else
 		p->flags &= ~PF_HAS_CAP;
@@ -223,7 +235,7 @@ static inline void set_cap_flag(task_t *
 
 static inline int task_exceeding_cap(const task_t *p)
 {
-	return (p->avg_cpu_per_cycle * 1000) > (p->avg_cycle_length * p->cpu_rate_cap);
+	return (p->avg_cpu_per_cycle * 1000) > (p->avg_cycle_length * min_cpu_rate_cap(p));
 }
 
 #ifdef CONFIG_SCHED_SMT
@@ -257,7 +269,7 @@ static int task_exceeding_cap_now(const 
 
 	delta = (now > p->timestamp) ? (now - p->timestamp) : 0;
 	lhs = (p->avg_cpu_per_cycle + delta) * 1000;
-	rhs = (p->avg_cycle_length + delta) * p->cpu_rate_cap;
+	rhs = (p->avg_cycle_length + delta) * min_cpu_rate_cap(p);
 
 	return lhs > rhs;
 }
@@ -266,6 +278,10 @@ static inline void init_cap_stats(task_t
 {
 	p->avg_cpu_per_cycle = 0;
 	p->avg_cycle_length = 0;
+#ifdef CONFIG_CPU_RATE_HARD_CAPS
+	init_timer(&p->sinbin_timer);
+	p->sinbin_timer.data = (unsigned long) p;
+#endif
 }
 
 static inline void inc_cap_stats_cycle(task_t *p, unsigned long long now)
@@ -1213,6 +1229,64 @@ static void deactivate_task(struct task_
 	p->array = NULL;
 }
 
+#ifdef CONFIG_CPU_RATE_HARD_CAPS
+#define task_has_hard_cap(p) unlikely((p)->cpu_rate_hard_cap < 1000)
+
+/*
+ * Release a task from the sinbin
+ */
+static void sinbin_release_fn(unsigned long arg)
+{
+	unsigned long flags;
+	struct task_struct *p = (struct task_struct*)arg;
+	struct runqueue *rq = task_rq_lock(p, &flags);
+
+	p->prio = effective_prio(p);
+
+	__activate_task(p, rq);
+
+	task_rq_unlock(rq, &flags);
+}
+
+static unsigned long reqd_sinbin_ticks(const task_t *p)
+{
+	unsigned long long res;
+
+	res = p->avg_cpu_per_cycle * 1000;
+
+	if (res > p->avg_cycle_length * p->cpu_rate_hard_cap) {
+		(void)do_div(res, p->cpu_rate_hard_cap);
+		res -= p->avg_cpu_per_cycle;
+		/*
+		 * IF it was available we'd also subtract
+		 * the average sleep per cycle here
+		 */
+		res >>= CAP_STATS_OFFSET;
+		(void)do_div(res, (1000000000 / HZ));
+
+		return res ? : 1;
+	}
+
+	return 0;
+}
+
+static void sinbin_task(task_t *p, unsigned long durn)
+{
+	if (durn == 0)
+		return;
+	deactivate_task(p, task_rq(p));
+	p->sinbin_timer.expires = jiffies + durn;
+	add_timer(&p->sinbin_timer);
+}
+#else
+#define task_has_hard_cap(p) 0
+#define reqd_sinbin_ticks(p) 0
+
+static inline void sinbin_task(task_t *p, unsigned long durn)
+{
+}
+#endif
+
 /*
  * resched_task - mark a task 'to be rescheduled now'.
  *
@@ -3508,9 +3582,16 @@ need_resched_nonpreemptible:
 		}
 	}
 
-	/* do this now so that stats are correct for SMT code */
-	if (task_has_cap(prev))
+	if (task_has_cap(prev)) {
 		inc_cap_stats_both(prev, now);
+		if (task_has_hard_cap(prev) && !prev->state &&
+		    !rt_task(prev) && !signal_pending(prev)) {
+			unsigned long sinbin_ticks = reqd_sinbin_ticks(prev);
+
+			if (sinbin_ticks)
+				sinbin_task(prev, sinbin_ticks);
+		}
+	}
 
 	cpu = smp_processor_id();
 	if (unlikely(!rq->nr_running)) {
@@ -4539,6 +4620,67 @@ out:
 }
 
 EXPORT_SYMBOL(set_cpu_rate_cap);
+
+#ifdef CONFIG_CPU_RATE_HARD_CAPS
+unsigned int get_cpu_rate_hard_cap(const struct task_struct *p)
+{
+	return p->cpu_rate_hard_cap;
+}
+
+EXPORT_SYMBOL(get_cpu_rate_hard_cap);
+
+/*
+ * Require: 1 <= new_cap <= 1000
+ */
+int set_cpu_rate_hard_cap(struct task_struct *p, unsigned int new_cap)
+{
+	int is_allowed;
+	unsigned long flags;
+	struct runqueue *rq;
+	int delta;
+
+	if (new_cap > 1000 && new_cap > 0)
+		return -EINVAL;
+	is_allowed = capable(CAP_SYS_NICE);
+	/*
+	 * We have to be careful, if called from /proc code,
+	 * the task might be in the middle of scheduling on another CPU.
+	 */
+	rq = task_rq_lock(p, &flags);
+	delta = new_cap - p->cpu_rate_hard_cap;
+	if (!is_allowed) {
+		/*
+		 * Ordinary users can set/change caps on their own tasks
+		 * provided that the new setting is MORE constraining
+		 */
+		if (((current->euid != p->uid) && (current->uid != p->uid)) || (delta > 0)) {
+			task_rq_unlock(rq, &flags);
+			return -EPERM;
+		}
+	}
+	/*
+	 * The RT tasks don't have caps, but we still allow the caps to be
+	 * set - but as expected it wont have any effect on scheduling until
+	 * the task becomes SCHED_NORMAL/SCHED_BATCH:
+	 */
+	p->cpu_rate_hard_cap = new_cap;
+
+	if (has_rt_policy(p))
+		goto out;
+
+	if (p->array)
+		dec_raw_weighted_load(rq, p);
+	set_load_weight(p);
+	if (p->array)
+		inc_raw_weighted_load(rq, p);
+out:
+	task_rq_unlock(rq, &flags);
+
+	return 0;
+}
+
+EXPORT_SYMBOL(set_cpu_rate_hard_cap);
+#endif
 #endif
 
 long sched_setaffinity(pid_t pid, cpumask_t new_mask)

-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
 -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 95+ messages in thread

* [RFC 4/5] sched: Add procfs interface for CPU rate soft caps
  2006-05-26  4:20 [RFC 0/5] sched: Add CPU rate caps Peter Williams
                   ` (2 preceding siblings ...)
  2006-05-26  4:20 ` [RFC 3/5] sched: Add CPU rate hard caps Peter Williams
@ 2006-05-26  4:21 ` Peter Williams
  2006-05-26  4:21 ` [RFC 5/5] sched: Add procfs interface for CPU rate hard caps Peter Williams
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 95+ messages in thread
From: Peter Williams @ 2006-05-26  4:21 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Con Kolivas, Peter Williams, Linux Kernel, Kingsley Cheung,
	Ingo Molnar, Rene Herman

This patch implements a procfs interface for soft CPU rate caps.

Signed-off-by: Peter Williams <pwil3058@bigpond.com.au>
 fs/proc/base.c |   59 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 59 insertions(+)

Index: MM-2.6.17-rc4-mm3/fs/proc/base.c
===================================================================
--- MM-2.6.17-rc4-mm3.orig/fs/proc/base.c	2006-05-26 13:46:40.000000000 +1000
+++ MM-2.6.17-rc4-mm3/fs/proc/base.c	2006-05-26 13:50:57.000000000 +1000
@@ -167,6 +167,9 @@ enum pid_directory_inos {
 #ifdef CONFIG_CPUSETS
 	PROC_TID_CPUSET,
 #endif
+#ifdef CONFIG_CPU_RATE_CAPS
+	PROC_TID_CPU_RATE_CAP,
+#endif
 #ifdef CONFIG_SECURITY
 	PROC_TID_ATTR,
 	PROC_TID_ATTR_CURRENT,
@@ -280,6 +283,9 @@ static struct pid_entry tid_base_stuff[]
 #ifdef CONFIG_AUDITSYSCALL
 	E(PROC_TID_LOGINUID, "loginuid", S_IFREG|S_IWUSR|S_IRUGO),
 #endif
+#ifdef CONFIG_CPU_RATE_CAPS
+	E(PROC_TID_CPU_RATE_CAP,  "cpu_rate_cap",   S_IFREG|S_IRUGO|S_IWUSR),
+#endif
 	{0,0,NULL,0}
 };
 
@@ -1036,6 +1042,54 @@ static struct file_operations proc_secco
 };
 #endif /* CONFIG_SECCOMP */
 
+#ifdef CONFIG_CPU_RATE_CAPS
+static ssize_t cpu_rate_cap_read(struct file * file, char * buf,
+			size_t count, loff_t *ppos)
+{
+	struct task_struct *task = get_proc_task(file->f_dentry->d_inode);
+	char buffer[64];
+	size_t len;
+	unsigned int cppt = get_cpu_rate_cap(task);
+
+	if (*ppos)
+		return 0;
+	*ppos = len = sprintf(buffer, "%u\n", cppt);
+	if (copy_to_user(buf, buffer, len))
+		return -EFAULT;
+
+	return len;
+}
+
+static ssize_t cpu_rate_cap_write(struct file * file, const char * buf,
+			 size_t count, loff_t *ppos)
+{
+	struct task_struct *task = get_proc_task(file->f_dentry->d_inode);
+	char buffer[128] = "";
+	char *endptr = NULL;
+	unsigned long hcppt;
+	int res;
+
+
+	if ((count > 63) || *ppos)
+		return -EFBIG;
+	if (copy_from_user(buffer, buf, count))
+		return -EFAULT;
+	hcppt = simple_strtoul(buffer, &endptr, 0);
+	if ((endptr == buffer) || (hcppt == ULONG_MAX))
+		return -EINVAL;
+
+	if ((res = set_cpu_rate_cap(task, hcppt)) != 0)
+		return res;
+
+	return count;
+}
+
+struct file_operations proc_cpu_rate_cap_operations = {
+	read:		cpu_rate_cap_read,
+	write:		cpu_rate_cap_write,
+};
+#endif
+
 static void *proc_pid_follow_link(struct dentry *dentry, struct nameidata *nd)
 {
 	struct inode *inode = dentry->d_inode;
@@ -1796,6 +1850,11 @@ static struct dentry *proc_pident_lookup
 			inode->i_fop = &proc_loginuid_operations;
 			break;
 #endif
+#ifdef CONFIG_CPU_RATE_CAPS
+		case PROC_TID_CPU_RATE_CAP:
+			inode->i_fop = &proc_cpu_rate_cap_operations;
+			break;
+#endif
 		default:
 			printk("procfs: impossible type (%d)",p->type);
 			iput(inode);

-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
 -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 95+ messages in thread

* [RFC 5/5] sched: Add procfs interface for CPU rate hard caps
  2006-05-26  4:20 [RFC 0/5] sched: Add CPU rate caps Peter Williams
                   ` (3 preceding siblings ...)
  2006-05-26  4:21 ` [RFC 4/5] sched: Add procfs interface for CPU rate soft caps Peter Williams
@ 2006-05-26  4:21 ` Peter Williams
  2006-05-26  8:04 ` [RFC 0/5] sched: Add CPU rate caps Mike Galbraith
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 95+ messages in thread
From: Peter Williams @ 2006-05-26  4:21 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Con Kolivas, Peter Williams, Linux Kernel, Kingsley Cheung,
	Ingo Molnar, Rene Herman

This patch implements a procfs interface for hard CPU rate caps.

Signed-off-by: Peter Williams <pwil3058@bigpond.com.au>
 fs/proc/base.c |   59 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 59 insertions(+)

Index: MM-2.6.17-rc4-mm3/fs/proc/base.c
===================================================================
--- MM-2.6.17-rc4-mm3.orig/fs/proc/base.c	2006-05-26 13:50:57.000000000 +1000
+++ MM-2.6.17-rc4-mm3/fs/proc/base.c	2006-05-26 13:51:01.000000000 +1000
@@ -170,6 +170,9 @@ enum pid_directory_inos {
 #ifdef CONFIG_CPU_RATE_CAPS
 	PROC_TID_CPU_RATE_CAP,
 #endif
+#ifdef CONFIG_CPU_RATE_HARD_CAPS
+	PROC_TID_CPU_RATE_HARD_CAP,
+#endif
 #ifdef CONFIG_SECURITY
 	PROC_TID_ATTR,
 	PROC_TID_ATTR_CURRENT,
@@ -286,6 +289,9 @@ static struct pid_entry tid_base_stuff[]
 #ifdef CONFIG_CPU_RATE_CAPS
 	E(PROC_TID_CPU_RATE_CAP,  "cpu_rate_cap",   S_IFREG|S_IRUGO|S_IWUSR),
 #endif
+#ifdef CONFIG_CPU_RATE_HARD_CAPS
+	E(PROC_TID_CPU_RATE_HARD_CAP,  "cpu_rate_hard_cap",   S_IFREG|S_IRUGO|S_IWUSR),
+#endif
 	{0,0,NULL,0}
 };
 
@@ -1090,6 +1096,54 @@ struct file_operations proc_cpu_rate_cap
 };
 #endif
 
+#ifdef CONFIG_CPU_RATE_HARD_CAPS
+static ssize_t cpu_rate_hard_cap_read(struct file * file, char * buf,
+			size_t count, loff_t *ppos)
+{
+	struct task_struct *task = get_proc_task(file->f_dentry->d_inode);
+	char buffer[64];
+	size_t len;
+	unsigned int cppt = get_cpu_rate_hard_cap(task);
+
+	if (*ppos)
+		return 0;
+	*ppos = len = sprintf(buffer, "%u\n", cppt);
+	if (copy_to_user(buf, buffer, len))
+		return -EFAULT;
+
+	return len;
+}
+
+static ssize_t cpu_rate_hard_cap_write(struct file * file, const char * buf,
+			 size_t count, loff_t *ppos)
+{
+	struct task_struct *task = get_proc_task(file->f_dentry->d_inode);
+	char buffer[128] = "";
+	char *endptr = NULL;
+	unsigned long hcppt;
+	int res;
+
+
+	if ((count > 63) || *ppos)
+		return -EFBIG;
+	if (copy_from_user(buffer, buf, count))
+		return -EFAULT;
+	hcppt = simple_strtoul(buffer, &endptr, 0);
+	if ((endptr == buffer) || (hcppt == ULONG_MAX))
+		return -EINVAL;
+
+	if ((res = set_cpu_rate_hard_cap(task, hcppt)) != 0)
+		return res;
+
+	return count;
+}
+
+struct file_operations proc_cpu_rate_hard_cap_operations = {
+	read:		cpu_rate_hard_cap_read,
+	write:		cpu_rate_hard_cap_write,
+};
+#endif
+
 static void *proc_pid_follow_link(struct dentry *dentry, struct nameidata *nd)
 {
 	struct inode *inode = dentry->d_inode;
@@ -1855,6 +1909,11 @@ static struct dentry *proc_pident_lookup
 			inode->i_fop = &proc_cpu_rate_cap_operations;
 			break;
 #endif
+#ifdef CONFIG_CPU_RATE_HARD_CAPS
+		case PROC_TID_CPU_RATE_HARD_CAP:
+			inode->i_fop = &proc_cpu_rate_hard_cap_operations;
+			break;
+#endif
 		default:
 			printk("procfs: impossible type (%d)",p->type);
 			iput(inode);

-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
 -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [RFC 3/5] sched: Add CPU rate hard caps
  2006-05-26  4:20 ` [RFC 3/5] sched: Add CPU rate hard caps Peter Williams
@ 2006-05-26  6:58   ` Kari Hurtta
  2006-05-27  1:00     ` Peter Williams
  2006-05-26 11:00   ` Con Kolivas
  2006-05-27  6:48   ` Balbir Singh
  2 siblings, 1 reply; 95+ messages in thread
From: Kari Hurtta @ 2006-05-26  6:58 UTC (permalink / raw)
  To: linux-kernel

Peter Williams <pwil3058@bigpond.net.au> writes in gmane.linux.kernel:

> This patch implements hard CPU rate caps per task as a proportion of a
> single CPU's capacity expressed in parts per thousand.

> + * Require: 1 <= new_cap <= 1000
> + */
> +int set_cpu_rate_hard_cap(struct task_struct *p, unsigned int new_cap)
> +{
> +	int is_allowed;
> +	unsigned long flags;
> +	struct runqueue *rq;
> +	int delta;
> +
> +	if (new_cap > 1000 && new_cap > 0)
> +		return -EINVAL;

That condition looks wrong.




^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [RFC 0/5] sched: Add CPU rate caps
  2006-05-26  4:20 [RFC 0/5] sched: Add CPU rate caps Peter Williams
                   ` (4 preceding siblings ...)
  2006-05-26  4:21 ` [RFC 5/5] sched: Add procfs interface for CPU rate hard caps Peter Williams
@ 2006-05-26  8:04 ` Mike Galbraith
  2006-05-26 16:11   ` Björn Steinbrink
  2006-05-27  0:16   ` Peter Williams
  2006-05-26 10:41 ` Con Kolivas
                   ` (2 subsequent siblings)
  8 siblings, 2 replies; 95+ messages in thread
From: Mike Galbraith @ 2006-05-26  8:04 UTC (permalink / raw)
  To: Peter Williams
  Cc: Con Kolivas, Linux Kernel, Kingsley Cheung, Ingo Molnar, Rene Herman

On Fri, 2006-05-26 at 14:20 +1000, Peter Williams wrote:
> These patches implement CPU usage rate limits for tasks.
> 
> Although the rlimit mechanism already has a CPU usage limit (RLIMIT_CPU)
> it is a total usage limit and therefore (to my mind) not very useful.
> These patches provide an alternative whereby the (recent) average CPU
> usage rate of a task can be limited to a (per task) specified proportion
> of a single CPU's capacity.

The killer problem I see with this approach is that it doesn't address
the divide and conquer problem.  If a task is capped, and forks off
workers, each worker inherits the total cap, effectively extending same.

IMHO, per task resource management is too severely limited in it's
usefulness, because jobs are what need managing, and they're seldom
single threaded.  In order to use per task limits to manage any given
job, you have to both know the number of components, and manually
distribute resources to each component of the job.  If a job has a
dynamic number of components, it becomes impossible to manage.

	-Mike


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [RFC 0/5] sched: Add CPU rate caps
  2006-05-26  4:20 [RFC 0/5] sched: Add CPU rate caps Peter Williams
                   ` (5 preceding siblings ...)
  2006-05-26  8:04 ` [RFC 0/5] sched: Add CPU rate caps Mike Galbraith
@ 2006-05-26 10:41 ` Con Kolivas
  2006-05-27  1:28   ` Peter Williams
  2006-05-26 11:09 ` Con Kolivas
  2006-05-26 11:29 ` Balbir Singh
  8 siblings, 1 reply; 95+ messages in thread
From: Con Kolivas @ 2006-05-26 10:41 UTC (permalink / raw)
  To: Peter Williams
  Cc: Mike Galbraith, Linux Kernel, Kingsley Cheung, Ingo Molnar, Rene Herman

On Friday 26 May 2006 14:20, Peter Williams wrote:
> These patches implement CPU usage rate limits for tasks.

Nice :)

> Although the rlimit mechanism already has a CPU usage limit (RLIMIT_CPU)
> it is a total usage limit and therefore (to my mind) not very useful.
> These patches provide an alternative whereby the (recent) average CPU
> usage rate of a task can be limited to a (per task) specified proportion
> of a single CPU's capacity.  The limits are specified in parts per
> thousand and come in two varieties -- hard and soft.

Why 1000? I doubt that degree of accuracy is possible in cpu accounting and 
accuracy or even required. To me it would seem to make more sense to just be 
a percentage.

-- 
-ck

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [RFC 2/5] sched: Add CPU rate soft caps
  2006-05-26  4:20 ` [RFC 2/5] sched: Add " Peter Williams
@ 2006-05-26 10:48   ` Con Kolivas
  2006-05-26 11:15     ` Mike Galbraith
  2006-05-26 13:55     ` Peter Williams
  2006-05-27  6:31   ` Balbir Singh
  1 sibling, 2 replies; 95+ messages in thread
From: Con Kolivas @ 2006-05-26 10:48 UTC (permalink / raw)
  To: Peter Williams
  Cc: Mike Galbraith, Linux Kernel, Kingsley Cheung, Ingo Molnar, Rene Herman

On Friday 26 May 2006 14:20, Peter Williams wrote:
> This patch implements (soft) CPU rate caps per task as a proportion of a
> single CPU's capacity expressed in parts per thousand.  The CPU usage
> of capped tasks is determined by using Kalman filters to calculate the
> (recent) average lengths of the task's scheduling cycle and the time
> spent on the CPU each cycle and taking the ratio of the latter to the
> former.  To minimize overhead associated with uncapped tasks these
> statistics are not kept for them.
>
> Notes:
>
> 1. To minimize the overhead incurred when testing to skip caps processing
> for uncapped tasks a new flag PF_HAS_CAP has been added to flags.

[ot]I'm sorry to see an Australian adopt American spelling [/ot]

> 3. Enforcement of caps is not as strict as it could be in order to
> reduce the possibility of a task being starved of CPU while holding
> an important system resource with resultant overall performance
> degradation.  In effect, all runnable capped tasks will get some amount
> of CPU access every active/expired swap cycle.  This will be most
> apparent for small or zero soft caps.

The array swap happens very frequently if there are nothing but heavily cpu 
bound tasks, which is not an infrequent workload. I doubt the zero caps are 
very effective in that environment.

-- 
-ck

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [RFC 3/5] sched: Add CPU rate hard caps
  2006-05-26  4:20 ` [RFC 3/5] sched: Add CPU rate hard caps Peter Williams
  2006-05-26  6:58   ` Kari Hurtta
@ 2006-05-26 11:00   ` Con Kolivas
  2006-05-26 13:59     ` Peter Williams
  2006-05-27  6:48   ` Balbir Singh
  2 siblings, 1 reply; 95+ messages in thread
From: Con Kolivas @ 2006-05-26 11:00 UTC (permalink / raw)
  To: Peter Williams
  Cc: Mike Galbraith, Linux Kernel, Kingsley Cheung, Ingo Molnar, Rene Herman

On Friday 26 May 2006 14:20, Peter Williams wrote:
> This patch implements hard CPU rate caps per task as a proportion of a
> single CPU's capacity expressed in parts per thousand.

A hard cap of 1/1000 could lead to interesting starvation scenarios where a 
mutex or semaphore was held by a task that hardly ever got cpu. Same goes to 
a lesser extent to a 0 soft cap. 

Here is how I handle idleprio tasks in current -ck:

http://ck.kolivas.org/patches/2.6/pre-releases/2.6.17-rc5/2.6.17-rc5-ck1/patches/track_mutexes-1.patch
tags tasks that are holding a mutex

http://ck.kolivas.org/patches/2.6/pre-releases/2.6.17-rc5/2.6.17-rc5-ck1/patches/sched-idleprio-1.7.patch
is the idleprio policy for staircase.

What it does is runs idleprio tasks as normal tasks when they hold a mutex or 
are waking up after calling down() (ie holding a semaphore). These two in 
combination have shown resistance to any priority inversion problems in 
widespread testing. An attempt was made to track semaphores held via a 
down_interruptible() but unfortunately the lack of strict rules about who 
could release the semaphore meant accounting was impossible of this scenario. 
In practice, though there were no test cases that showed it to be an issue, 
and the recent conversion en-masse of semaphores to mutexes in the kernel 
means it has pretty much covered most possibilities.

-- 
-ck

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [RFC 0/5] sched: Add CPU rate caps
  2006-05-26  4:20 [RFC 0/5] sched: Add CPU rate caps Peter Williams
                   ` (6 preceding siblings ...)
  2006-05-26 10:41 ` Con Kolivas
@ 2006-05-26 11:09 ` Con Kolivas
  2006-05-26 14:00   ` Peter Williams
  2006-05-26 11:29 ` Balbir Singh
  8 siblings, 1 reply; 95+ messages in thread
From: Con Kolivas @ 2006-05-26 11:09 UTC (permalink / raw)
  To: Peter Williams
  Cc: Mike Galbraith, Linux Kernel, Kingsley Cheung, Ingo Molnar, Rene Herman

On Friday 26 May 2006 14:20, Peter Williams wrote:
> These patches implement CPU usage rate limits for tasks.
>
> Although the rlimit mechanism already has a CPU usage limit (RLIMIT_CPU)
> it is a total usage limit and therefore (to my mind) not very useful.
> These patches provide an alternative whereby the (recent) average CPU
> usage rate of a task can be limited to a (per task) specified proportion
> of a single CPU's capacity.  The limits are specified in parts per
> thousand and come in two varieties -- hard and soft.  The difference
> between the two is that the system tries to enforce hard caps regardless
> of the other demand for CPU resources but allows soft caps to be
> exceeded if there are spare CPU resources available.  By default, tasks
> will have both caps set to 1000 (i.e. no limit) but newly forked tasks
> will inherit any caps that have been imposed on their parent from the
> parent.  The mimimim soft cap allowed is 0 (which effectively puts the
> task in the background) and the minimim hard cap allowed is 1.
>
> Care has been taken to minimize the overhead inflicted on tasks that
> have no caps and my tests using kernbench indicate that it is hidden in
> the noise.

The presence of tasks with caps will break smp balancing and smp nice. I 
suspect you could probably provide a reasonable workaround by altering their 
priority bias effect in the raw weighted load in smp nice for soft caps by 
the percentage cpu of the cap. Hard caps provide more "interesting" 
challenges though. I can't think of a valid biasing off hand for them, but at 
least initially using the same logic as soft caps should help.

-- 
-ck

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [RFC 2/5] sched: Add CPU rate soft caps
  2006-05-26 10:48   ` Con Kolivas
@ 2006-05-26 11:15     ` Mike Galbraith
  2006-05-26 11:17       ` Con Kolivas
  2006-05-26 13:55     ` Peter Williams
  1 sibling, 1 reply; 95+ messages in thread
From: Mike Galbraith @ 2006-05-26 11:15 UTC (permalink / raw)
  To: Con Kolivas
  Cc: Peter Williams, Linux Kernel, Kingsley Cheung, Ingo Molnar, Rene Herman

On Fri, 2006-05-26 at 20:48 +1000, Con Kolivas wrote:
> On Friday 26 May 2006 14:20, Peter Williams wrote:
> > 3. Enforcement of caps is not as strict as it could be in order to
> > reduce the possibility of a task being starved of CPU while holding
> > an important system resource with resultant overall performance
> > degradation.  In effect, all runnable capped tasks will get some amount
> > of CPU access every active/expired swap cycle.  This will be most
> > apparent for small or zero soft caps.
> 
> The array swap happens very frequently if there are nothing but heavily cpu 
> bound tasks, which is not an infrequent workload. I doubt the zero caps are 
> very effective in that environment.

Hmm.  I think that came out kinda back-assward.  You meant "the array
swap happens very frequently _unless_..."  No?

But anyway, I can't think of any reason to hold back an uncontested
resource.

	-Mike


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [RFC 2/5] sched: Add CPU rate soft caps
  2006-05-26 11:15     ` Mike Galbraith
@ 2006-05-26 11:17       ` Con Kolivas
  2006-05-26 11:30         ` Mike Galbraith
  0 siblings, 1 reply; 95+ messages in thread
From: Con Kolivas @ 2006-05-26 11:17 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Williams, Linux Kernel, Kingsley Cheung, Ingo Molnar, Rene Herman

On Friday 26 May 2006 21:15, Mike Galbraith wrote:
> On Fri, 2006-05-26 at 20:48 +1000, Con Kolivas wrote:
> > On Friday 26 May 2006 14:20, Peter Williams wrote:
> > > 3. Enforcement of caps is not as strict as it could be in order to
> > > reduce the possibility of a task being starved of CPU while holding
> > > an important system resource with resultant overall performance
> > > degradation.  In effect, all runnable capped tasks will get some amount
> > > of CPU access every active/expired swap cycle.  This will be most
> > > apparent for small or zero soft caps.
> >
> > The array swap happens very frequently if there are nothing but heavily
> > cpu bound tasks, which is not an infrequent workload. I doubt the zero
> > caps are very effective in that environment.
>
> Hmm.  I think that came out kinda back-assward.  You meant "the array
> swap happens very frequently _unless_..."  No?

No I didn't. If all you are doing is compiling code then the array swap will 
happen often as they will always use up their full timeslice and expire. 
Therefore an array swap will follow shortly afterwards.

> But anyway, I can't think of any reason to hold back an uncontested
> resource.

If you are compiling applications it's a contested resource.

-- 
-ck

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [RFC 0/5] sched: Add CPU rate caps
  2006-05-26  4:20 [RFC 0/5] sched: Add CPU rate caps Peter Williams
                   ` (7 preceding siblings ...)
  2006-05-26 11:09 ` Con Kolivas
@ 2006-05-26 11:29 ` Balbir Singh
  2006-05-27  1:40   ` Peter Williams
  8 siblings, 1 reply; 95+ messages in thread
From: Balbir Singh @ 2006-05-26 11:29 UTC (permalink / raw)
  To: Peter Williams
  Cc: Mike Galbraith, Con Kolivas, Linux Kernel, Kingsley Cheung,
	Ingo Molnar, Rene Herman

On Fri, May 26, 2006 at 02:20:21PM +1000, Peter Williams wrote:
> These patches implement CPU usage rate limits for tasks.
> 
> Although the rlimit mechanism already has a CPU usage limit (RLIMIT_CPU)
> it is a total usage limit and therefore (to my mind) not very useful.
> These patches provide an alternative whereby the (recent) average CPU
> usage rate of a task can be limited to a (per task) specified proportion
> of a single CPU's capacity.  The limits are specified in parts per
> thousand and come in two varieties -- hard and soft.  The difference
> between the two is that the system tries to enforce hard caps regardless
> of the other demand for CPU resources but allows soft caps to be
> exceeded if there are spare CPU resources available.  By default, tasks
> will have both caps set to 1000 (i.e. no limit) but newly forked tasks
> will inherit any caps that have been imposed on their parent from the
> parent.  The mimimim soft cap allowed is 0 (which effectively puts the
> task in the background) and the minimim hard cap allowed is 1.
> 
> Care has been taken to minimize the overhead inflicted on tasks that
> have no caps and my tests using kernbench indicate that it is hidden in
> the noise.
> 
> Note:
> 
> The first patch in this series fixes some problems with priority
> inheritance that are present in 2.6.17-rc4-mm3 but will be fixed in
> the next -mm kernel.
> 

1000 sounds like a course number. A good estimate for the user setting
these limits would be percentage or better yet let the user decide on the
parts. For example, the user could divide the available CPU's capacity
to 2000 parts and ask for 200 parts or divide into 100 parts and as for 10
parts. The default capacity can be 100 or 1000 parts. May be the part
setting could be a system tunable.

I would also prefer making the capacity defined as the a specified portion
of the capacity of all CPU's. This would make the behaviour more predictable.

Consider a task "T" which has 10 percent of a single CPU's capacity as hard
limit. If it migrated to another CPU, would the new CPU also make 10% of its
capacity available "T".

What is the interval over which the 10% is tracked? Does the task that crosses
its hard limit get killed? If not, When does a task which has exceeded its
hard-limit get a new lease of another 10% to use?

I guess I should move on to reading the code for this feature now :-)

	Balbir Singh,
	Linux Technology Center,
	IBM Software Labs

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [RFC 2/5] sched: Add CPU rate soft caps
  2006-05-26 11:17       ` Con Kolivas
@ 2006-05-26 11:30         ` Mike Galbraith
  0 siblings, 0 replies; 95+ messages in thread
From: Mike Galbraith @ 2006-05-26 11:30 UTC (permalink / raw)
  To: Con Kolivas
  Cc: Peter Williams, Linux Kernel, Kingsley Cheung, Ingo Molnar, Rene Herman

On Fri, 2006-05-26 at 21:17 +1000, Con Kolivas wrote:
> On Friday 26 May 2006 21:15, Mike Galbraith wrote:
> > On Fri, 2006-05-26 at 20:48 +1000, Con Kolivas wrote:
> > > On Friday 26 May 2006 14:20, Peter Williams wrote:
> > > > 3. Enforcement of caps is not as strict as it could be in order to
> > > > reduce the possibility of a task being starved of CPU while holding
> > > > an important system resource with resultant overall performance
> > > > degradation.  In effect, all runnable capped tasks will get some amount
> > > > of CPU access every active/expired swap cycle.  This will be most
> > > > apparent for small or zero soft caps.
> > >
> > > The array swap happens very frequently if there are nothing but heavily
> > > cpu bound tasks, which is not an infrequent workload. I doubt the zero
> > > caps are very effective in that environment.
> >
> > Hmm.  I think that came out kinda back-assward.  You meant "the array
> > swap happens very frequently _unless_..."  No?
> 
> No I didn't. If all you are doing is compiling code then the array swap will 
> happen often as they will always use up their full timeslice and expire. 
> Therefore an array swap will follow shortly afterwards.

Afterward being possibly ages.  Frequent array switch happens when you
have mostly sleepy processes, not cpu bound.  But whatever.

> > But anyway, I can't think of any reason to hold back an uncontested
> > resource.
> 
> If you are compiling applications it's a contested resource.

These zero capped tasks are at the bottom of the heap.  They won't be
selected if there's any other runnable task, so it's not contested.

	-Mike


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [RFC 2/5] sched: Add CPU rate soft caps
  2006-05-26 10:48   ` Con Kolivas
  2006-05-26 11:15     ` Mike Galbraith
@ 2006-05-26 13:55     ` Peter Williams
  1 sibling, 0 replies; 95+ messages in thread
From: Peter Williams @ 2006-05-26 13:55 UTC (permalink / raw)
  To: Con Kolivas
  Cc: Mike Galbraith, Linux Kernel, Kingsley Cheung, Ingo Molnar, Rene Herman

Con Kolivas wrote:
> On Friday 26 May 2006 14:20, Peter Williams wrote:
>> This patch implements (soft) CPU rate caps per task as a proportion of a
>> single CPU's capacity expressed in parts per thousand.  The CPU usage
>> of capped tasks is determined by using Kalman filters to calculate the
>> (recent) average lengths of the task's scheduling cycle and the time
>> spent on the CPU each cycle and taking the ratio of the latter to the
>> former.  To minimize overhead associated with uncapped tasks these
>> statistics are not kept for them.
>>
>> Notes:
>>
>> 1. To minimize the overhead incurred when testing to skip caps processing
>> for uncapped tasks a new flag PF_HAS_CAP has been added to flags.
> 
> [ot]I'm sorry to see an Australian adopt American spelling [/ot]

I think you'll find the Oxford English Dictionary (which was the 
reference when I went to school in the middle of last century) uses the 
z and offers the s version as an option.

> 
>> 3. Enforcement of caps is not as strict as it could be in order to
>> reduce the possibility of a task being starved of CPU while holding
>> an important system resource with resultant overall performance
>> degradation.  In effect, all runnable capped tasks will get some amount
>> of CPU access every active/expired swap cycle.  This will be most
>> apparent for small or zero soft caps.
> 
> The array swap happens very frequently if there are nothing but heavily cpu 
> bound tasks, which is not an infrequent workload. I doubt the zero caps are 
> very effective in that environment.

Yes and it depends on HZ as well (i.e. it works better when HZis zero). 
  With HZ=250 and a zero capped hard spinning task competing with 
another hard spinning task on a single CPU system it struggles to keep 
it below below 4%.  I've tested hard caps down to 0.5% in the same test 
and it copes.  So a long term solution such as something similar to the 
rt_mutex priority inheritance is needed so that stricter soft capping 
can be enforced.  I don't think that it would be hard to be more strict 
as it would just involve some checking when determining idx in schedule().

BTW in my SPA schedulers this can be controlled by varying the promotion 
rate.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [RFC 3/5] sched: Add CPU rate hard caps
  2006-05-26 11:00   ` Con Kolivas
@ 2006-05-26 13:59     ` Peter Williams
  2006-05-26 14:12       ` Con Kolivas
  2006-05-26 14:23       ` Mike Galbraith
  0 siblings, 2 replies; 95+ messages in thread
From: Peter Williams @ 2006-05-26 13:59 UTC (permalink / raw)
  To: Con Kolivas
  Cc: Mike Galbraith, Linux Kernel, Kingsley Cheung, Ingo Molnar, Rene Herman

Con Kolivas wrote:
> On Friday 26 May 2006 14:20, Peter Williams wrote:
>> This patch implements hard CPU rate caps per task as a proportion of a
>> single CPU's capacity expressed in parts per thousand.
> 
> A hard cap of 1/1000 could lead to interesting starvation scenarios where a 
> mutex or semaphore was held by a task that hardly ever got cpu. Same goes to 
> a lesser extent to a 0 soft cap. 
> 
> Here is how I handle idleprio tasks in current -ck:
> 
> http://ck.kolivas.org/patches/2.6/pre-releases/2.6.17-rc5/2.6.17-rc5-ck1/patches/track_mutexes-1.patch
> tags tasks that are holding a mutex
> 
> http://ck.kolivas.org/patches/2.6/pre-releases/2.6.17-rc5/2.6.17-rc5-ck1/patches/sched-idleprio-1.7.patch
> is the idleprio policy for staircase.
> 
> What it does is runs idleprio tasks as normal tasks when they hold a mutex or 
> are waking up after calling down() (ie holding a semaphore).

I wasn't aware that you could detect those conditions.  They could be 
very useful.

> These two in 
> combination have shown resistance to any priority inversion problems in 
> widespread testing. An attempt was made to track semaphores held via a 
> down_interruptible() but unfortunately the lack of strict rules about who 
> could release the semaphore meant accounting was impossible of this scenario. 
> In practice, though there were no test cases that showed it to be an issue, 
> and the recent conversion en-masse of semaphores to mutexes in the kernel 
> means it has pretty much covered most possibilities.
> 

Thanks,
Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [RFC 0/5] sched: Add CPU rate caps
  2006-05-26 11:09 ` Con Kolivas
@ 2006-05-26 14:00   ` Peter Williams
  0 siblings, 0 replies; 95+ messages in thread
From: Peter Williams @ 2006-05-26 14:00 UTC (permalink / raw)
  To: Con Kolivas
  Cc: Mike Galbraith, Linux Kernel, Kingsley Cheung, Ingo Molnar, Rene Herman

Con Kolivas wrote:
> On Friday 26 May 2006 14:20, Peter Williams wrote:
>> These patches implement CPU usage rate limits for tasks.
>>
>> Although the rlimit mechanism already has a CPU usage limit (RLIMIT_CPU)
>> it is a total usage limit and therefore (to my mind) not very useful.
>> These patches provide an alternative whereby the (recent) average CPU
>> usage rate of a task can be limited to a (per task) specified proportion
>> of a single CPU's capacity.  The limits are specified in parts per
>> thousand and come in two varieties -- hard and soft.  The difference
>> between the two is that the system tries to enforce hard caps regardless
>> of the other demand for CPU resources but allows soft caps to be
>> exceeded if there are spare CPU resources available.  By default, tasks
>> will have both caps set to 1000 (i.e. no limit) but newly forked tasks
>> will inherit any caps that have been imposed on their parent from the
>> parent.  The mimimim soft cap allowed is 0 (which effectively puts the
>> task in the background) and the minimim hard cap allowed is 1.
>>
>> Care has been taken to minimize the overhead inflicted on tasks that
>> have no caps and my tests using kernbench indicate that it is hidden in
>> the noise.
> 
> The presence of tasks with caps will break smp balancing and smp nice. I 
> suspect you could probably provide a reasonable workaround by altering their 
> priority bias effect in the raw weighted load in smp nice for soft caps by 
> the percentage cpu of the cap. Hard caps provide more "interesting" 
> challenges though. I can't think of a valid biasing off hand for them, but at 
> least initially using the same logic as soft caps should help.
> 

I thought that I already did that?  Check the changes to set_load_weight().

-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [RFC 3/5] sched: Add CPU rate hard caps
  2006-05-26 13:59     ` Peter Williams
@ 2006-05-26 14:12       ` Con Kolivas
  2006-05-26 14:23       ` Mike Galbraith
  1 sibling, 0 replies; 95+ messages in thread
From: Con Kolivas @ 2006-05-26 14:12 UTC (permalink / raw)
  To: Peter Williams
  Cc: Mike Galbraith, Linux Kernel, Kingsley Cheung, Ingo Molnar, Rene Herman

On Friday 26 May 2006 23:59, Peter Williams wrote:
> Con Kolivas wrote:
> > On Friday 26 May 2006 14:20, Peter Williams wrote:
> >> This patch implements hard CPU rate caps per task as a proportion of a
> >> single CPU's capacity expressed in parts per thousand.
> >
> > A hard cap of 1/1000 could lead to interesting starvation scenarios where
> > a mutex or semaphore was held by a task that hardly ever got cpu. Same
> > goes to a lesser extent to a 0 soft cap.
> >
> > Here is how I handle idleprio tasks in current -ck:
> >
> > http://ck.kolivas.org/patches/2.6/pre-releases/2.6.17-rc5/2.6.17-rc5-ck1/
> >patches/track_mutexes-1.patch tags tasks that are holding a mutex
> >
> > http://ck.kolivas.org/patches/2.6/pre-releases/2.6.17-rc5/2.6.17-rc5-ck1/
> >patches/sched-idleprio-1.7.patch is the idleprio policy for staircase.
> >
> > What it does is runs idleprio tasks as normal tasks when they hold a
> > mutex or are waking up after calling down() (ie holding a semaphore).
>
> I wasn't aware that you could detect those conditions.  They could be
> very useful.

Ingo's mutex infrastructure made it possible to accurately track mutexes held, 
and basically anything entering uninterruptible sleep has called down(). 
Mainline, as you know, already flags the latter for interactivity purposes.

-- 
-ck

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [RFC 3/5] sched: Add CPU rate hard caps
  2006-05-26 13:59     ` Peter Williams
  2006-05-26 14:12       ` Con Kolivas
@ 2006-05-26 14:23       ` Mike Galbraith
  2006-05-27  0:16         ` Peter Williams
  1 sibling, 1 reply; 95+ messages in thread
From: Mike Galbraith @ 2006-05-26 14:23 UTC (permalink / raw)
  To: Peter Williams
  Cc: Con Kolivas, Linux Kernel, Kingsley Cheung, Ingo Molnar, Rene Herman

On Fri, 2006-05-26 at 23:59 +1000, Peter Williams wrote:
> Con Kolivas wrote:
> > On Friday 26 May 2006 14:20, Peter Williams wrote:
> >> This patch implements hard CPU rate caps per task as a proportion of a
> >> single CPU's capacity expressed in parts per thousand.
> > 
> > A hard cap of 1/1000 could lead to interesting starvation scenarios where a 
> > mutex or semaphore was held by a task that hardly ever got cpu. Same goes to 
> > a lesser extent to a 0 soft cap. 
> > 
> > Here is how I handle idleprio tasks in current -ck:
> > 
> > http://ck.kolivas.org/patches/2.6/pre-releases/2.6.17-rc5/2.6.17-rc5-ck1/patches/track_mutexes-1.patch
> > tags tasks that are holding a mutex
> > 
> > http://ck.kolivas.org/patches/2.6/pre-releases/2.6.17-rc5/2.6.17-rc5-ck1/patches/sched-idleprio-1.7.patch
> > is the idleprio policy for staircase.
> > 
> > What it does is runs idleprio tasks as normal tasks when they hold a mutex or 
> > are waking up after calling down() (ie holding a semaphore).
> 
> I wasn't aware that you could detect those conditions.  They could be 
> very useful.

Isn't this exactly what the PI code is there to handle?  Is something
more than PI needed?

	-Mike


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [RFC 0/5] sched: Add CPU rate caps
  2006-05-26  8:04 ` [RFC 0/5] sched: Add CPU rate caps Mike Galbraith
@ 2006-05-26 16:11   ` Björn Steinbrink
  2006-05-28 22:46     ` Sam Vilain
  2006-05-27  0:16   ` Peter Williams
  1 sibling, 1 reply; 95+ messages in thread
From: Björn Steinbrink @ 2006-05-26 16:11 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Williams, Con Kolivas, Linux Kernel, Kingsley Cheung,
	Ingo Molnar, Rene Herman, Herbert Poetzl, Sam Vilain

On 2006.05.26 10:04:20 +0200, Mike Galbraith wrote:
> On Fri, 2006-05-26 at 14:20 +1000, Peter Williams wrote:
> > These patches implement CPU usage rate limits for tasks.
> > 
> > Although the rlimit mechanism already has a CPU usage limit (RLIMIT_CPU)
> > it is a total usage limit and therefore (to my mind) not very useful.
> > These patches provide an alternative whereby the (recent) average CPU
> > usage rate of a task can be limited to a (per task) specified proportion
> > of a single CPU's capacity.
> 
> The killer problem I see with this approach is that it doesn't address
> the divide and conquer problem.  If a task is capped, and forks off
> workers, each worker inherits the total cap, effectively extending same.
> 
> IMHO, per task resource management is too severely limited in it's
> usefulness, because jobs are what need managing, and they're seldom
> single threaded.  In order to use per task limits to manage any given
> job, you have to both know the number of components, and manually
> distribute resources to each component of the job.  If a job has a
> dynamic number of components, it becomes impossible to manage.

Linux-VServer uses a token bucket scheduler (TBS) to limit cpu ressources
for processes in a "context". All processes in a context share one token
bucket, which has a set of parameters to tune scheduling behaviour.
As the token bucket is shared by a group of processes, and inherited by
child processes/threads, management is quite easy. And the parameters
can be tuned to allow different scheduling behaviours, like allowing a
process group to burst, ie. use as much cpu time as is available, after
being idle for some time, but being limited to X % cpu time on average.

I'm CC'ing Herbert and Sam on this as they can explain the whole thing a
lot better and I'm not familiar with implementation details.

Regards
Björn

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [RFC 0/5] sched: Add CPU rate caps
  2006-05-26  8:04 ` [RFC 0/5] sched: Add CPU rate caps Mike Galbraith
  2006-05-26 16:11   ` Björn Steinbrink
@ 2006-05-27  0:16   ` Peter Williams
  1 sibling, 0 replies; 95+ messages in thread
From: Peter Williams @ 2006-05-27  0:16 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Con Kolivas, Linux Kernel, Kingsley Cheung, Ingo Molnar, Rene Herman

Mike Galbraith wrote:
> On Fri, 2006-05-26 at 14:20 +1000, Peter Williams wrote:
>> These patches implement CPU usage rate limits for tasks.
>>
>> Although the rlimit mechanism already has a CPU usage limit (RLIMIT_CPU)
>> it is a total usage limit and therefore (to my mind) not very useful.
>> These patches provide an alternative whereby the (recent) average CPU
>> usage rate of a task can be limited to a (per task) specified proportion
>> of a single CPU's capacity.
> 
> The killer problem I see with this approach is that it doesn't address
> the divide and conquer problem.  If a task is capped, and forks off
> workers, each worker inherits the total cap, effectively extending same.
> 
> IMHO, per task resource management is too severely limited in it's
> usefulness, because jobs are what need managing, and they're seldom
> single threaded.  In order to use per task limits to manage any given
> job, you have to both know the number of components, and manually
> distribute resources to each component of the job.  If a job has a
> dynamic number of components, it becomes impossible to manage.

Doing caps at a process level inside the scheduler is doable but would 
involve an extra level of complexity including locking at the process 
level to calculate process usage rates.  Also the calculation of usage 
rates would be more complex than just doing it for tasks and the fact 
that there are not separate structures for processes and threads also 
complicates the code compared to what is required otherwise (e.g. for 
Solaris).

I'm not sure that this extra complexity is warranted when it is possible 
to implement caps at the process level from outside the scheduler using 
the task level caps provided by this patch.  However, to allow the costs 
to be properly evaluated, I'll put some effort into a process level 
capping mechanism over the next few weeks.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [RFC 3/5] sched: Add CPU rate hard caps
  2006-05-26 14:23       ` Mike Galbraith
@ 2006-05-27  0:16         ` Peter Williams
  2006-05-27  9:28           ` Mike Galbraith
  0 siblings, 1 reply; 95+ messages in thread
From: Peter Williams @ 2006-05-27  0:16 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Con Kolivas, Linux Kernel, Kingsley Cheung, Ingo Molnar, Rene Herman

Mike Galbraith wrote:
> On Fri, 2006-05-26 at 23:59 +1000, Peter Williams wrote:
>> Con Kolivas wrote:
>>> On Friday 26 May 2006 14:20, Peter Williams wrote:
>>>> This patch implements hard CPU rate caps per task as a proportion of a
>>>> single CPU's capacity expressed in parts per thousand.
>>> A hard cap of 1/1000 could lead to interesting starvation scenarios where a 
>>> mutex or semaphore was held by a task that hardly ever got cpu. Same goes to 
>>> a lesser extent to a 0 soft cap. 
>>>
>>> Here is how I handle idleprio tasks in current -ck:
>>>
>>> http://ck.kolivas.org/patches/2.6/pre-releases/2.6.17-rc5/2.6.17-rc5-ck1/patches/track_mutexes-1.patch
>>> tags tasks that are holding a mutex
>>>
>>> http://ck.kolivas.org/patches/2.6/pre-releases/2.6.17-rc5/2.6.17-rc5-ck1/patches/sched-idleprio-1.7.patch
>>> is the idleprio policy for staircase.
>>>
>>> What it does is runs idleprio tasks as normal tasks when they hold a mutex or 
>>> are waking up after calling down() (ie holding a semaphore).
>> I wasn't aware that you could detect those conditions.  They could be 
>> very useful.
> 
> Isn't this exactly what the PI code is there to handle?  Is something
> more than PI needed?
> 

AFAIK (but I may be wrong) PI is only used by RT tasks and would need to 
be extended.  It could be argued that extending PI so that it can be 
used by non RT tasks is a worthwhile endeavour in its own right.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [RFC 3/5] sched: Add CPU rate hard caps
  2006-05-26  6:58   ` Kari Hurtta
@ 2006-05-27  1:00     ` Peter Williams
  0 siblings, 0 replies; 95+ messages in thread
From: Peter Williams @ 2006-05-27  1:00 UTC (permalink / raw)
  To: Kari Hurtta; +Cc: linux-kernel

Kari Hurtta wrote:
> Peter Williams <pwil3058@bigpond.net.au> writes in gmane.linux.kernel:
> 
>> This patch implements hard CPU rate caps per task as a proportion of a
>> single CPU's capacity expressed in parts per thousand.
> 
>> + * Require: 1 <= new_cap <= 1000
>> + */
>> +int set_cpu_rate_hard_cap(struct task_struct *p, unsigned int new_cap)
>> +{
>> +	int is_allowed;
>> +	unsigned long flags;
>> +	struct runqueue *rq;
>> +	int delta;
>> +
>> +	if (new_cap > 1000 && new_cap > 0)
>> +		return -EINVAL;
> 
> That condition looks wrong.

It certainly does.

Thanks
Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [RFC 0/5] sched: Add CPU rate caps
  2006-05-26 10:41 ` Con Kolivas
@ 2006-05-27  1:28   ` Peter Williams
  2006-05-27  1:42     ` Con Kolivas
  0 siblings, 1 reply; 95+ messages in thread
From: Peter Williams @ 2006-05-27  1:28 UTC (permalink / raw)
  To: Con Kolivas
  Cc: Mike Galbraith, Linux Kernel, Kingsley Cheung, Ingo Molnar, Rene Herman

Con Kolivas wrote:
> On Friday 26 May 2006 14:20, Peter Williams wrote:
>> These patches implement CPU usage rate limits for tasks.
> 
> Nice :)

Thanks.

> 
>> Although the rlimit mechanism already has a CPU usage limit (RLIMIT_CPU)
>> it is a total usage limit and therefore (to my mind) not very useful.
>> These patches provide an alternative whereby the (recent) average CPU
>> usage rate of a task can be limited to a (per task) specified proportion
>> of a single CPU's capacity.  The limits are specified in parts per
>> thousand and come in two varieties -- hard and soft.
> 
> Why 1000?

Probably a hang over from a version where the units were proportion of a 
whole machine.  Percentage doesn't work very well if there are more than 
1 CPU in that case (especially if there are more than 100 CPUs :-)). 
But it's also useful to have the extra range if your trying to cap 
processes (or users) from outside the scheduler using these primitives.

> I doubt that degree of accuracy is possible in cpu accounting and 
> accuracy or even required. To me it would seem to make more sense to just be 
> a percentage.
> 

It's not meant to imply accuracy :-).  The main issue is avoiding 
overflow when doing the multiplications during the comparisons.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [RFC 0/5] sched: Add CPU rate caps
  2006-05-26 11:29 ` Balbir Singh
@ 2006-05-27  1:40   ` Peter Williams
  0 siblings, 0 replies; 95+ messages in thread
From: Peter Williams @ 2006-05-27  1:40 UTC (permalink / raw)
  To: balbir
  Cc: Mike Galbraith, Con Kolivas, Linux Kernel, Kingsley Cheung,
	Ingo Molnar, Rene Herman

Balbir Singh wrote:
> On Fri, May 26, 2006 at 02:20:21PM +1000, Peter Williams wrote:
>> These patches implement CPU usage rate limits for tasks.
>>
>> Although the rlimit mechanism already has a CPU usage limit (RLIMIT_CPU)
>> it is a total usage limit and therefore (to my mind) not very useful.
>> These patches provide an alternative whereby the (recent) average CPU
>> usage rate of a task can be limited to a (per task) specified proportion
>> of a single CPU's capacity.  The limits are specified in parts per
>> thousand and come in two varieties -- hard and soft.  The difference
>> between the two is that the system tries to enforce hard caps regardless
>> of the other demand for CPU resources but allows soft caps to be
>> exceeded if there are spare CPU resources available.  By default, tasks
>> will have both caps set to 1000 (i.e. no limit) but newly forked tasks
>> will inherit any caps that have been imposed on their parent from the
>> parent.  The mimimim soft cap allowed is 0 (which effectively puts the
>> task in the background) and the minimim hard cap allowed is 1.
>>
>> Care has been taken to minimize the overhead inflicted on tasks that
>> have no caps and my tests using kernbench indicate that it is hidden in
>> the noise.
>>
>> Note:
>>
>> The first patch in this series fixes some problems with priority
>> inheritance that are present in 2.6.17-rc4-mm3 but will be fixed in
>> the next -mm kernel.
>>
> 
> 1000 sounds like a course number. A good estimate for the user setting
> these limits would be percentage or better yet let the user decide on the
> parts. For example, the user could divide the available CPU's capacity
> to 2000 parts and ask for 200 parts or divide into 100 parts and as for 10
> parts. The default capacity can be 100 or 1000 parts. May be the part
> setting could be a system tunable.
> 
> I would also prefer making the capacity defined as the a specified portion
> of the capacity of all CPU's. This would make the behaviour more predictable.

The meaning of a cap would change every time you took a CPU off/on line. 
  This makes the behaviour less predictable not more predictable (at 
least in my opinion).  You also have the possibility of a cap being 
larger than the capacity of a single CPU which doesn't make sense when 
capping at the task level.

However, if you still preferred that interface, it could be implemented 
as a wrapper around these functionalities out in user space or inside a 
resource management component.

> 
> Consider a task "T" which has 10 percent of a single CPU's capacity as hard
> limit. If it migrated to another CPU, would the new CPU also make 10% of its
> capacity available "T".
> 
> What is the interval over which the 10% is tracked? Does the task that crosses
> its hard limit get killed? If not, When does a task which has exceeded its
> hard-limit get a new lease of another 10% to use?
> 
> I guess I should move on to reading the code for this feature now :-)

I look forward to your comments.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [RFC 0/5] sched: Add CPU rate caps
  2006-05-27  1:28   ` Peter Williams
@ 2006-05-27  1:42     ` Con Kolivas
  0 siblings, 0 replies; 95+ messages in thread
From: Con Kolivas @ 2006-05-27  1:42 UTC (permalink / raw)
  To: Peter Williams
  Cc: Mike Galbraith, Linux Kernel, Kingsley Cheung, Ingo Molnar, Rene Herman

On Saturday 27 May 2006 11:28, Peter Williams wrote:
> Con Kolivas wrote:
> > On Friday 26 May 2006 14:20, Peter Williams wrote:
> >> Although the rlimit mechanism already has a CPU usage limit (RLIMIT_CPU)
> >> it is a total usage limit and therefore (to my mind) not very useful.
> >> These patches provide an alternative whereby the (recent) average CPU
> >> usage rate of a task can be limited to a (per task) specified proportion
> >> of a single CPU's capacity.  The limits are specified in parts per
> >> thousand and come in two varieties -- hard and soft.
> >
> > Why 1000?
>
> Probably a hang over from a version where the units were proportion of a
> whole machine.  Percentage doesn't work very well if there are more than
> 1 CPU in that case (especially if there are more than 100 CPUs :-)).
> But it's also useful to have the extra range if your trying to cap
> processes (or users) from outside the scheduler using these primitives.
>
> > I doubt that degree of accuracy is possible in cpu accounting and
> > accuracy or even required. To me it would seem to make more sense to just
> > be a percentage.
>
> It's not meant to imply accuracy :-).  The main issue is avoiding
> overflow when doing the multiplications during the comparisons.

Well you could always expose a smaller more meaningful value than what is 
stored internally. However you've already implied that there are requirements 
in userspace for more granularity in the proportioning than percentage can 
give.

-- 
-ck

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [RFC 2/5] sched: Add CPU rate soft caps
  2006-05-26  4:20 ` [RFC 2/5] sched: Add " Peter Williams
  2006-05-26 10:48   ` Con Kolivas
@ 2006-05-27  6:31   ` Balbir Singh
  2006-05-27  7:03     ` Peter Williams
  1 sibling, 1 reply; 95+ messages in thread
From: Balbir Singh @ 2006-05-27  6:31 UTC (permalink / raw)
  To: Peter Williams
  Cc: Mike Galbraith, Con Kolivas, Linux Kernel, Kingsley Cheung,
	Ingo Molnar, Rene Herman

On 5/26/06, Peter Williams <pwil3058@bigpond.net.au> wrote:
<snip>
>
> Notes:
>
> 1. To minimize the overhead incurred when testing to skip caps processing for
> uncapped tasks a new flag PF_HAS_CAP has been added to flags.
>
> 2. The implementation involves the addition of two priority slots to the
> run queue priority arrays and this means that MAX_PRIO no longer
> represents the scheduling priority of the idle process and can't be used to
> test whether priority values are in the valid range.  To alleviate this
> problem a new function sched_idle_prio() has been provided.

I am a little confused by this. Why link the bandwidth expired tasks a
cpu (its caps) to a priority slot? Is this a hack to conitnue using
the prio_array? why not move such tasks to the expired array?

<snip>
>  /*
>   * Some day this will be a full-fledged user tracking system..
>   */
> @@ -787,6 +793,10 @@ struct task_struct {
>         unsigned long sleep_avg;
>         unsigned long long timestamp, last_ran;
>         unsigned long long sched_time; /* sched_clock time spent running */
> +#ifdef CONFIG_CPU_RATE_CAPS
> +       unsigned long long avg_cpu_per_cycle, avg_cycle_length;
> +       unsigned int cpu_rate_cap;
> +#endif

How is a cycle defined? What are the units of a cycle? Could we please
document the units for the declarations above

>         enum sleep_type sleep_type;
>
>         unsigned long policy;
> @@ -981,6 +991,11 @@ struct task_struct {
>  #endif
>  };
>
> +#ifdef CONFIG_CPU_RATE_CAPS
> +unsigned int get_cpu_rate_cap(const struct task_struct *);
> +int set_cpu_rate_cap(struct task_struct *, unsigned int);
> +#endif
> +
>  static inline pid_t process_group(struct task_struct *tsk)
>  {
>         return tsk->signal->pgrp;
> @@ -1040,6 +1055,7 @@ static inline void put_task_struct(struc
>  #define PF_SPREAD_SLAB 0x08000000      /* Spread some slab caches over cpuset */
>  #define PF_MEMPOLICY   0x10000000      /* Non-default NUMA mempolicy */
>  #define PF_MUTEX_TESTER        0x02000000      /* Thread belongs to the rt mutex tester */
> +#define PF_HAS_CAP     0x20000000      /* Has a CPU rate cap */
>
>  /*
>   * Only the _current_ task can read/write to tsk->flags, but other
> Index: MM-2.6.17-rc4-mm3/init/Kconfig
> ===================================================================
> --- MM-2.6.17-rc4-mm3.orig/init/Kconfig 2006-05-26 10:39:59.000000000 +1000
> +++ MM-2.6.17-rc4-mm3/init/Kconfig      2006-05-26 10:45:26.000000000 +1000
> @@ -286,6 +286,8 @@ config RELAY
>
>           If unsure, say N.
>
> +source "kernel/Kconfig.caps"
> +
>  source "usr/Kconfig"
>
>  config UID16
> Index: MM-2.6.17-rc4-mm3/kernel/Kconfig.caps
> ===================================================================
> --- /dev/null   1970-01-01 00:00:00.000000000 +0000
> +++ MM-2.6.17-rc4-mm3/kernel/Kconfig.caps       2006-05-26 10:45:26.000000000 +1000
> @@ -0,0 +1,13 @@
> +#
> +# CPU Rate Caps Configuration
> +#
> +
> +config CPU_RATE_CAPS
> +       bool "Support (soft) CPU rate caps"
> +       default n
> +       ---help---
> +         Say y here if you wish to be able to put a (soft) upper limit on
> +         the rate of CPU usage by individual tasks.  A task which has been
> +         allocated a soft CPU rate cap will be limited to that rate of CPU
> +         usage unless there is spare CPU resources available after the needs
> +         of uncapped tasks are met.
> Index: MM-2.6.17-rc4-mm3/kernel/sched.c
> ===================================================================
> --- MM-2.6.17-rc4-mm3.orig/kernel/sched.c       2006-05-26 10:44:51.000000000 +1000
> +++ MM-2.6.17-rc4-mm3/kernel/sched.c    2006-05-26 11:00:02.000000000 +1000
> @@ -57,6 +57,19 @@
>
>  #include <asm/unistd.h>
>
> +#ifdef CONFIG_CPU_RATE_CAPS
> +#define IDLE_PRIO      (MAX_PRIO + 2)
> +#else
> +#define IDLE_PRIO      MAX_PRIO
> +#endif
> +#define BGND_PRIO      (IDLE_PRIO - 1)
> +#define CAPPED_PRIO    (IDLE_PRIO - 2)
> +
> +int sched_idle_prio(void)
> +{
> +       return IDLE_PRIO;
> +}
> +
>  /*
>   * Convert user-nice values [ -20 ... 0 ... 19 ]
>   * to static priority [ MAX_RT_PRIO..MAX_PRIO-1 ],
> @@ -186,6 +199,149 @@ static inline unsigned int task_timeslic
>         return static_prio_timeslice(p->static_prio);
>  }
>
> +#ifdef CONFIG_CPU_RATE_CAPS
> +#define CAP_STATS_OFFSET 8
> +#define task_has_cap(p) unlikely((p)->flags & PF_HAS_CAP)
> +/* this assumes that p is not a real time task */
> +#define task_is_background(p) unlikely((p)->cpu_rate_cap == 0)
> +#define task_being_capped(p) unlikely((p)->prio >= CAPPED_PRIO)
> +#define cap_load_weight(p) (((p)->cpu_rate_cap * SCHED_LOAD_SCALE) / 1000)

Could we please use a const or #define'd name instead of 1000. How
about TOTAL_CAP_IN_PARTS? It would make the code easier to read and
maintain.

> +
> +static void init_cpu_rate_caps(task_t *p)
> +{
> +       p->cpu_rate_cap = 1000;
> +       p->flags &= ~PF_HAS_CAP;
> +}
> +
> +static inline void set_cap_flag(task_t *p)
> +{
> +       if (p->cpu_rate_cap < 1000 && !has_rt_policy(p))
> +               p->flags |= PF_HAS_CAP;
> +       else
> +               p->flags &= ~PF_HAS_CAP;
> +}

Why don't you re-use RLIMIT_INFINITY?

> +
> +static inline int task_exceeding_cap(const task_t *p)
> +{
> +       return (p->avg_cpu_per_cycle * 1000) > (p->avg_cycle_length * p->cpu_rate_cap);
> +}
> +
> +#ifdef CONFIG_SCHED_SMT
> +static unsigned int smt_timeslice(task_t *p)
> +{
> +       if (task_has_cap(p) && task_being_capped(p))
> +               return 0;
> +
> +       return task_timeslice(p);
> +}
> +
> +static int task_priority_gt(const task_t *thisp, const task_t *thatp)
> +{
> +       if (task_has_cap(thisp) && (task_being_capped(thisp)))
> +           return 0;
> +
> +       if (task_has_cap(thatp) && (task_being_capped(thatp)))
> +           return 1;
> +
> +       return thisp->static_prio < thatp->static_prio;
> +}

This function needs some comments. At least with respect to what is
thisp and thatp

> +#endif
> +
> +/*
> + * Update usage stats to "now" before making comparison
> + * Assume: task is actually on a CPU
> + */
> +static int task_exceeding_cap_now(const task_t *p, unsigned long long now)
> +{
> +       unsigned long long delta, lhs, rhs;
> +
> +       delta = (now > p->timestamp) ? (now - p->timestamp) : 0;
> +       lhs = (p->avg_cpu_per_cycle + delta) * 1000;
> +       rhs = (p->avg_cycle_length + delta) * p->cpu_rate_cap;
> +
> +       return lhs > rhs;
> +}
> +
> +static inline void init_cap_stats(task_t *p)
> +{
> +       p->avg_cpu_per_cycle = 0;
> +       p->avg_cycle_length = 0;
> +}
> +
> +static inline void inc_cap_stats_cycle(task_t *p, unsigned long long now)
> +{
> +       unsigned long long delta;
> +
> +       delta = (now > p->timestamp) ? (now - p->timestamp) : 0;
> +       p->avg_cycle_length += delta;
> +}
> +
> +static inline void inc_cap_stats_both(task_t *p, unsigned long long now)
> +{
> +       unsigned long long delta;
> +
> +       delta = (now > p->timestamp) ? (now - p->timestamp) : 0;
> +       p->avg_cycle_length += delta;
> +       p->avg_cpu_per_cycle += delta;
> +}
> +
> +static inline void decay_cap_stats(task_t *p)
> +{
> +       p->avg_cycle_length *= ((1 << CAP_STATS_OFFSET) - 1);
> +       p->avg_cycle_length >>= CAP_STATS_OFFSET;
> +       p->avg_cpu_per_cycle *= ((1 << CAP_STATS_OFFSET) - 1);
> +       p->avg_cpu_per_cycle >>= CAP_STATS_OFFSET;
> +}
> +#else
> +#define task_has_cap(p) 0
> +#define task_is_background(p) 0
> +#define task_being_capped(p) 0
> +#define cap_load_weight(p) SCHED_LOAD_SCALE
> +
> +static inline void init_cpu_rate_caps(task_t *p)
> +{
> +}
> +
> +static inline void set_cap_flag(task_t *p)
> +{
> +}
> +
> +static inline int task_exceeding_cap(const task_t *p)
> +{
> +       return 0;
> +}
> +
> +#ifdef CONFIG_SCHED_SMT
> +#define smt_timeslice(p) task_timeslice(p)
> +
> +static inline int task_priority_gt(const task_t *thisp, const task_t *thatp)
> +{
> +       return thisp->static_prio < thatp->static_prio;
> +}
> +#endif
> +
> +static inline int task_exceeding_cap_now(const task_t *p, unsigned long long now)
> +{
> +       return 0;
> +}
> +
> +static inline void init_cap_stats(task_t *p)
> +{
> +}
> +
> +static inline void inc_cap_stats_cycle(task_t *p, unsigned long long now)
> +{
> +}
> +
> +static inline void inc_cap_stats_both(task_t *p, unsigned long long now)
> +{
> +}
> +
> +static inline void decay_cap_stats(task_t *p)
> +{
> +}
> +#endif
> +
>  #define task_hot(p, now, sd) ((long long) ((now) - (p)->last_ran)      \
>                                 < (long long) (sd)->cache_hot_time)
>
> @@ -197,8 +353,8 @@ typedef struct runqueue runqueue_t;
>
>  struct prio_array {
>         unsigned int nr_active;
> -       DECLARE_BITMAP(bitmap, MAX_PRIO+1); /* include 1 bit for delimiter */
> -       struct list_head queue[MAX_PRIO];
> +       DECLARE_BITMAP(bitmap, IDLE_PRIO+1); /* include 1 bit for delimiter */
> +       struct list_head queue[IDLE_PRIO];
>  };
>
>  /*
> @@ -710,6 +866,10 @@ static inline int __normal_prio(task_t *
>  {
>         int bonus, prio;
>
> +       /* Ensure that background tasks stay at BGND_PRIO */
> +       if (task_is_background(p))
> +               return BGND_PRIO;
> +
>         bonus = CURRENT_BONUS(p) - MAX_BONUS / 2;
>
>         prio = p->static_prio - bonus;
> @@ -786,6 +946,8 @@ static inline int expired_starving(runqu
>
>  static void set_load_weight(task_t *p)
>  {
> +       set_cap_flag(p);
> +
>         if (has_rt_policy(p)) {
>  #ifdef CONFIG_SMP
>                 if (p == task_rq(p)->migration_thread)
> @@ -798,8 +960,22 @@ static void set_load_weight(task_t *p)
>                 else
>  #endif
>                         p->load_weight = RTPRIO_TO_LOAD_WEIGHT(p->rt_priority);
> -       } else
> +       } else {
>                 p->load_weight = PRIO_TO_LOAD_WEIGHT(p->static_prio);
> +
> +               /*
> +                * Reduce the probability of a task escaping its CPU rate cap
> +                * due to load balancing leaving it on a lighly used CPU
> +                * This will be optimized away if rate caps aren't configured
> +                */
> +               if (task_has_cap(p)) {
> +                       unsigned int clw; /* load weight based on cap */
> +
> +                       clw = cap_load_weight(p);
> +                       if (clw < p->load_weight)
> +                               p->load_weight = clw;
> +               }

You could  use
p->load_weight = min(cap_load_weight(p), p->load_weight);


> +       }
>  }
>
>  static inline void inc_raw_weighted_load(runqueue_t *rq, const task_t *p)
> @@ -869,7 +1045,8 @@ static void __activate_task(task_t *p, r
>  {
>         prio_array_t *target = rq->active;
>
> -       if (unlikely(batch_task(p) || (expired_starving(rq) && !rt_task(p))))
> +       if (unlikely(batch_task(p) || (expired_starving(rq) && !rt_task(p)) ||
> +                       task_being_capped(p)))
>                 target = rq->expired;
>         enqueue_task(p, target);
>         inc_nr_running(p, rq);
> @@ -975,8 +1152,30 @@ static void activate_task(task_t *p, run
>  #endif
>
>         if (!rt_task(p))
> +               /*
> +                * We want to do the recalculation even if we're exceeding
> +                * a cap so that everything still works when we stop
> +                * exceeding our cap.
> +                */
>                 p->prio = recalc_task_prio(p, now);
>
> +       if (task_has_cap(p)) {
> +               inc_cap_stats_cycle(p, now);
> +               /* Background tasks are handled in effective_prio()
> +                * in order to ensure that they stay at BGND_PRIO
> +                * but we need to be careful that we don't override
> +                * it here
> +                */
> +               if (task_exceeding_cap(p) && !task_is_background(p)) {
> +                       p->normal_prio = CAPPED_PRIO;
> +                       /*
> +                        * Don't undo any priority ineheritance
> +                        */
> +                       if (!rt_task(p))
> +                               p->prio = CAPPED_PRIO;
> +               }
> +       }

Within all tasks at CAPPED_PRIO, is priority of the task used for scheduling?

<snip>

Cheers,
Balbir
Linux Technology Center
IBM ISoftware Labs

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [RFC 3/5] sched: Add CPU rate hard caps
  2006-05-26  4:20 ` [RFC 3/5] sched: Add CPU rate hard caps Peter Williams
  2006-05-26  6:58   ` Kari Hurtta
  2006-05-26 11:00   ` Con Kolivas
@ 2006-05-27  6:48   ` Balbir Singh
  2006-05-27  8:44     ` Peter Williams
  2 siblings, 1 reply; 95+ messages in thread
From: Balbir Singh @ 2006-05-27  6:48 UTC (permalink / raw)
  To: Peter Williams
  Cc: Mike Galbraith, Con Kolivas, Linux Kernel, Kingsley Cheung,
	Ingo Molnar, Rene Herman

On 5/26/06, Peter Williams <pwil3058@bigpond.net.au> wrote:
> This patch implements hard CPU rate caps per task as a proportion of a
> single CPU's capacity expressed in parts per thousand.
>
> Signed-off-by: Peter Williams <pwil3058@bigpond.com.au>
>  include/linux/sched.h |    8 ++
>  kernel/Kconfig.caps   |   14 +++-
>  kernel/sched.c        |  154 ++++++++++++++++++++++++++++++++++++++++++++++++--
>  3 files changed, 168 insertions(+), 8 deletions(-)
>
> Index: MM-2.6.17-rc4-mm3/include/linux/sched.h
> ===================================================================
> --- MM-2.6.17-rc4-mm3.orig/include/linux/sched.h        2006-05-26 10:46:35.000000000 +1000
> +++ MM-2.6.17-rc4-mm3/include/linux/sched.h     2006-05-26 11:00:07.000000000 +1000
> @@ -796,6 +796,10 @@ struct task_struct {
>  #ifdef CONFIG_CPU_RATE_CAPS
>         unsigned long long avg_cpu_per_cycle, avg_cycle_length;
>         unsigned int cpu_rate_cap;
> +#ifdef CONFIG_CPU_RATE_HARD_CAPS
> +       unsigned int cpu_rate_hard_cap;
> +       struct timer_list sinbin_timer;

Using a timer for releasing tasks from their sinbin sounds like a  bit
of an overhead. Given that there could be 10s of thousands of tasks.
Is it possible to use the scheduler_tick() function take a look at all
deactivated tasks (as efficiently as possible) and activate them when
its time to activate them or just fold the functionality by defining a
time quantum after which everyone is worken up. This time quantum
could be the same as the time over which limits are honoured.

> +#endif
>  #endif
>         enum sleep_type sleep_type;
>
> @@ -994,6 +998,10 @@ struct task_struct {
>  #ifdef CONFIG_CPU_RATE_CAPS
>  unsigned int get_cpu_rate_cap(const struct task_struct *);
>  int set_cpu_rate_cap(struct task_struct *, unsigned int);
> +#ifdef CONFIG_CPU_RATE_HARD_CAPS
> +unsigned int get_cpu_rate_hard_cap(const struct task_struct *);
> +int set_cpu_rate_hard_cap(struct task_struct *, unsigned int);
> +#endif
>  #endif
>
>  static inline pid_t process_group(struct task_struct *tsk)
> Index: MM-2.6.17-rc4-mm3/kernel/Kconfig.caps
> ===================================================================
> --- MM-2.6.17-rc4-mm3.orig/kernel/Kconfig.caps  2006-05-26 10:45:26.000000000 +1000
> +++ MM-2.6.17-rc4-mm3/kernel/Kconfig.caps       2006-05-26 11:00:07.000000000 +1000
> @@ -3,11 +3,21 @@
>  #
>
>  config CPU_RATE_CAPS
> -       bool "Support (soft) CPU rate caps"
> +       bool "Support CPU rate caps"
>         default n
>         ---help---
> -         Say y here if you wish to be able to put a (soft) upper limit on
> +         Say y here if you wish to be able to put a soft upper limit on
>           the rate of CPU usage by individual tasks.  A task which has been
>           allocated a soft CPU rate cap will be limited to that rate of CPU
>           usage unless there is spare CPU resources available after the needs
>           of uncapped tasks are met.
> +
> +config CPU_RATE_HARD_CAPS
> +       bool "Support CPU rate hard caps"
> +       depends on CPU_RATE_CAPS
> +       default n
> +       ---help---
> +         Say y here if you wish to be able to put a hard upper limit on
> +         the rate of CPU usage by individual tasks.  A task which has been
> +         allocated a hard CPU rate cap will be limited to that rate of CPU
> +         usage regardless of whether there is spare CPU resources available.
> Index: MM-2.6.17-rc4-mm3/kernel/sched.c
> ===================================================================
> --- MM-2.6.17-rc4-mm3.orig/kernel/sched.c       2006-05-26 11:00:02.000000000 +1000
> +++ MM-2.6.17-rc4-mm3/kernel/sched.c    2006-05-26 13:50:11.000000000 +1000
> @@ -201,21 +201,33 @@ static inline unsigned int task_timeslic
>
>  #ifdef CONFIG_CPU_RATE_CAPS
>  #define CAP_STATS_OFFSET 8
> +#ifdef CONFIG_CPU_RATE_HARD_CAPS
> +static void sinbin_release_fn(unsigned long arg);
> +#define min_cpu_rate_cap(p) min((p)->cpu_rate_cap, (p)->cpu_rate_hard_cap)
> +#else
> +#define min_cpu_rate_cap(p) (p)->cpu_rate_cap
> +#endif
>  #define task_has_cap(p) unlikely((p)->flags & PF_HAS_CAP)
>  /* this assumes that p is not a real time task */
>  #define task_is_background(p) unlikely((p)->cpu_rate_cap == 0)
>  #define task_being_capped(p) unlikely((p)->prio >= CAPPED_PRIO)
> -#define cap_load_weight(p) (((p)->cpu_rate_cap * SCHED_LOAD_SCALE) / 1000)
> +#define cap_load_weight(p) ((min_cpu_rate_cap(p) * SCHED_LOAD_SCALE) / 1000)
>
>  static void init_cpu_rate_caps(task_t *p)
>  {
>         p->cpu_rate_cap = 1000;
>         p->flags &= ~PF_HAS_CAP;
> +#ifdef CONFIG_CPU_RATE_HARD_CAPS
> +       p->cpu_rate_hard_cap = 1000;
> +       init_timer(&p->sinbin_timer);
> +       p->sinbin_timer.function = sinbin_release_fn;
> +       p->sinbin_timer.data = (unsigned long) p;
> +#endif
>  }
>
>  static inline void set_cap_flag(task_t *p)
>  {
> -       if (p->cpu_rate_cap < 1000 && !has_rt_policy(p))
> +       if (min_cpu_rate_cap(p) < 1000 && !has_rt_policy(p))
>                 p->flags |= PF_HAS_CAP;
>         else
>                 p->flags &= ~PF_HAS_CAP;
> @@ -223,7 +235,7 @@ static inline void set_cap_flag(task_t *
>
>  static inline int task_exceeding_cap(const task_t *p)
>  {
> -       return (p->avg_cpu_per_cycle * 1000) > (p->avg_cycle_length * p->cpu_rate_cap);
> +       return (p->avg_cpu_per_cycle * 1000) > (p->avg_cycle_length * min_cpu_rate_cap(p));
>  }
>
>  #ifdef CONFIG_SCHED_SMT
> @@ -257,7 +269,7 @@ static int task_exceeding_cap_now(const
>
>         delta = (now > p->timestamp) ? (now - p->timestamp) : 0;
>         lhs = (p->avg_cpu_per_cycle + delta) * 1000;
> -       rhs = (p->avg_cycle_length + delta) * p->cpu_rate_cap;
> +       rhs = (p->avg_cycle_length + delta) * min_cpu_rate_cap(p);
>
>         return lhs > rhs;
>  }
> @@ -266,6 +278,10 @@ static inline void init_cap_stats(task_t
>  {
>         p->avg_cpu_per_cycle = 0;
>         p->avg_cycle_length = 0;
> +#ifdef CONFIG_CPU_RATE_HARD_CAPS
> +       init_timer(&p->sinbin_timer);
> +       p->sinbin_timer.data = (unsigned long) p;
> +#endif
>  }
>
>  static inline void inc_cap_stats_cycle(task_t *p, unsigned long long now)
> @@ -1213,6 +1229,64 @@ static void deactivate_task(struct task_
>         p->array = NULL;
>  }
>
> +#ifdef CONFIG_CPU_RATE_HARD_CAPS
> +#define task_has_hard_cap(p) unlikely((p)->cpu_rate_hard_cap < 1000)
> +
> +/*
> + * Release a task from the sinbin
> + */
> +static void sinbin_release_fn(unsigned long arg)
> +{
> +       unsigned long flags;
> +       struct task_struct *p = (struct task_struct*)arg;
> +       struct runqueue *rq = task_rq_lock(p, &flags);
> +
> +       p->prio = effective_prio(p);
> +
> +       __activate_task(p, rq);
> +
> +       task_rq_unlock(rq, &flags);
> +}
> +
> +static unsigned long reqd_sinbin_ticks(const task_t *p)
> +{
> +       unsigned long long res;
> +
> +       res = p->avg_cpu_per_cycle * 1000;
> +
> +       if (res > p->avg_cycle_length * p->cpu_rate_hard_cap) {
> +               (void)do_div(res, p->cpu_rate_hard_cap);
> +               res -= p->avg_cpu_per_cycle;
> +               /*
> +                * IF it was available we'd also subtract
> +                * the average sleep per cycle here
> +                */
> +               res >>= CAP_STATS_OFFSET;
> +               (void)do_div(res, (1000000000 / HZ));

Please use NSEC_PER_SEC if that is what 10^9 stands for in the above
calculation.

> +
> +               return res ? : 1;
> +       }
> +
> +       return 0;
> +}
> +
> +static void sinbin_task(task_t *p, unsigned long durn)
> +{
> +       if (durn == 0)
> +               return;
> +       deactivate_task(p, task_rq(p));
> +       p->sinbin_timer.expires = jiffies + durn;
> +       add_timer(&p->sinbin_timer);
> +}
> +#else
> +#define task_has_hard_cap(p) 0
> +#define reqd_sinbin_ticks(p) 0
> +
> +static inline void sinbin_task(task_t *p, unsigned long durn)
> +{
> +}
> +#endif
> +
>  /*
>   * resched_task - mark a task 'to be rescheduled now'.
>   *
> @@ -3508,9 +3582,16 @@ need_resched_nonpreemptible:
>                 }
>         }
>
> -       /* do this now so that stats are correct for SMT code */
> -       if (task_has_cap(prev))
> +       if (task_has_cap(prev)) {
>                 inc_cap_stats_both(prev, now);
> +               if (task_has_hard_cap(prev) && !prev->state &&
> +                   !rt_task(prev) && !signal_pending(prev)) {
> +                       unsigned long sinbin_ticks = reqd_sinbin_ticks(prev);
> +
> +                       if (sinbin_ticks)
> +                               sinbin_task(prev, sinbin_ticks);
> +               }
> +       }
>
>         cpu = smp_processor_id();
>         if (unlikely(!rq->nr_running)) {
> @@ -4539,6 +4620,67 @@ out:
>  }
>
<snip>

Balbir
Linux Technology Center
IBM Software Labs

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [RFC 2/5] sched: Add CPU rate soft caps
  2006-05-27  6:31   ` Balbir Singh
@ 2006-05-27  7:03     ` Peter Williams
  2006-05-28  0:11       ` Peter Williams
  0 siblings, 1 reply; 95+ messages in thread
From: Peter Williams @ 2006-05-27  7:03 UTC (permalink / raw)
  To: balbir
  Cc: Mike Galbraith, Con Kolivas, Linux Kernel, Kingsley Cheung,
	Ingo Molnar, Rene Herman

Balbir Singh wrote:
> On 5/26/06, Peter Williams <pwil3058@bigpond.net.au> wrote:
> <snip>
>>
>> Notes:
>>
>> 1. To minimize the overhead incurred when testing to skip caps 
>> processing for
>> uncapped tasks a new flag PF_HAS_CAP has been added to flags.
>>
>> 2. The implementation involves the addition of two priority slots to the
>> run queue priority arrays and this means that MAX_PRIO no longer
>> represents the scheduling priority of the idle process and can't be 
>> used to
>> test whether priority values are in the valid range.  To alleviate this
>> problem a new function sched_idle_prio() has been provided.
> 
> I am a little confused by this. Why link the bandwidth expired tasks a
> cpu (its caps) to a priority slot? Is this a hack to conitnue using
> the prio_array? why not move such tasks to the expired array?

Because it won't work as after the array switch they may get to run 
before tasks who aren't exceeding their cap (or don't have a cap).

> 
> <snip>
>>  /*
>>   * Some day this will be a full-fledged user tracking system..
>>   */
>> @@ -787,6 +793,10 @@ struct task_struct {
>>         unsigned long sleep_avg;
>>         unsigned long long timestamp, last_ran;
>>         unsigned long long sched_time; /* sched_clock time spent 
>> running */
>> +#ifdef CONFIG_CPU_RATE_CAPS
>> +       unsigned long long avg_cpu_per_cycle, avg_cycle_length;
>> +       unsigned int cpu_rate_cap;
>> +#endif
> 
> How is a cycle defined?

 From one "on CPU" to the next.

> What are the units of a cycle?

Well since sched_clock() is used they are obviously nanoseconds 
multiplied by 2 to the power of CAP_STATS_OFFSET.  But it's fairly 
irrelevant as we're only interested in their ratio and that's dimensionless.

> Could we please
> document the units for the declarations above

No.

> 
>>         enum sleep_type sleep_type;
>>
>>         unsigned long policy;
>> @@ -981,6 +991,11 @@ struct task_struct {
>>  #endif
>>  };
>>
>> +#ifdef CONFIG_CPU_RATE_CAPS
>> +unsigned int get_cpu_rate_cap(const struct task_struct *);
>> +int set_cpu_rate_cap(struct task_struct *, unsigned int);
>> +#endif
>> +
>>  static inline pid_t process_group(struct task_struct *tsk)
>>  {
>>         return tsk->signal->pgrp;
>> @@ -1040,6 +1055,7 @@ static inline void put_task_struct(struc
>>  #define PF_SPREAD_SLAB 0x08000000      /* Spread some slab caches 
>> over cpuset */
>>  #define PF_MEMPOLICY   0x10000000      /* Non-default NUMA mempolicy */
>>  #define PF_MUTEX_TESTER        0x02000000      /* Thread belongs to 
>> the rt mutex tester */
>> +#define PF_HAS_CAP     0x20000000      /* Has a CPU rate cap */
>>
>>  /*
>>   * Only the _current_ task can read/write to tsk->flags, but other
>> Index: MM-2.6.17-rc4-mm3/init/Kconfig
>> ===================================================================
>> --- MM-2.6.17-rc4-mm3.orig/init/Kconfig 2006-05-26 10:39:59.000000000 
>> +1000
>> +++ MM-2.6.17-rc4-mm3/init/Kconfig      2006-05-26 10:45:26.000000000 
>> +1000
>> @@ -286,6 +286,8 @@ config RELAY
>>
>>           If unsure, say N.
>>
>> +source "kernel/Kconfig.caps"
>> +
>>  source "usr/Kconfig"
>>
>>  config UID16
>> Index: MM-2.6.17-rc4-mm3/kernel/Kconfig.caps
>> ===================================================================
>> --- /dev/null   1970-01-01 00:00:00.000000000 +0000
>> +++ MM-2.6.17-rc4-mm3/kernel/Kconfig.caps       2006-05-26 
>> 10:45:26.000000000 +1000
>> @@ -0,0 +1,13 @@
>> +#
>> +# CPU Rate Caps Configuration
>> +#
>> +
>> +config CPU_RATE_CAPS
>> +       bool "Support (soft) CPU rate caps"
>> +       default n
>> +       ---help---
>> +         Say y here if you wish to be able to put a (soft) upper 
>> limit on
>> +         the rate of CPU usage by individual tasks.  A task which has 
>> been
>> +         allocated a soft CPU rate cap will be limited to that rate 
>> of CPU
>> +         usage unless there is spare CPU resources available after 
>> the needs
>> +         of uncapped tasks are met.
>> Index: MM-2.6.17-rc4-mm3/kernel/sched.c
>> ===================================================================
>> --- MM-2.6.17-rc4-mm3.orig/kernel/sched.c       2006-05-26 
>> 10:44:51.000000000 +1000
>> +++ MM-2.6.17-rc4-mm3/kernel/sched.c    2006-05-26 11:00:02.000000000 
>> +1000
>> @@ -57,6 +57,19 @@
>>
>>  #include <asm/unistd.h>
>>
>> +#ifdef CONFIG_CPU_RATE_CAPS
>> +#define IDLE_PRIO      (MAX_PRIO + 2)
>> +#else
>> +#define IDLE_PRIO      MAX_PRIO
>> +#endif
>> +#define BGND_PRIO      (IDLE_PRIO - 1)
>> +#define CAPPED_PRIO    (IDLE_PRIO - 2)
>> +
>> +int sched_idle_prio(void)
>> +{
>> +       return IDLE_PRIO;
>> +}
>> +
>>  /*
>>   * Convert user-nice values [ -20 ... 0 ... 19 ]
>>   * to static priority [ MAX_RT_PRIO..MAX_PRIO-1 ],
>> @@ -186,6 +199,149 @@ static inline unsigned int task_timeslic
>>         return static_prio_timeslice(p->static_prio);
>>  }
>>
>> +#ifdef CONFIG_CPU_RATE_CAPS
>> +#define CAP_STATS_OFFSET 8
>> +#define task_has_cap(p) unlikely((p)->flags & PF_HAS_CAP)
>> +/* this assumes that p is not a real time task */
>> +#define task_is_background(p) unlikely((p)->cpu_rate_cap == 0)
>> +#define task_being_capped(p) unlikely((p)->prio >= CAPPED_PRIO)
>> +#define cap_load_weight(p) (((p)->cpu_rate_cap * SCHED_LOAD_SCALE) / 
>> 1000)
> 
> Could we please use a const or #define'd name instead of 1000. How
> about TOTAL_CAP_IN_PARTS? It would make the code easier to read and
> maintain.
> 
>> +
>> +static void init_cpu_rate_caps(task_t *p)
>> +{
>> +       p->cpu_rate_cap = 1000;
>> +       p->flags &= ~PF_HAS_CAP;
>> +}
>> +
>> +static inline void set_cap_flag(task_t *p)
>> +{
>> +       if (p->cpu_rate_cap < 1000 && !has_rt_policy(p))
>> +               p->flags |= PF_HAS_CAP;
>> +       else
>> +               p->flags &= ~PF_HAS_CAP;
>> +}
> 
> Why don't you re-use RLIMIT_INFINITY?

I presume that it means something else.

> 
>> +
>> +static inline int task_exceeding_cap(const task_t *p)
>> +{
>> +       return (p->avg_cpu_per_cycle * 1000) > (p->avg_cycle_length * 
>> p->cpu_rate_cap);
>> +}
>> +
>> +#ifdef CONFIG_SCHED_SMT
>> +static unsigned int smt_timeslice(task_t *p)
>> +{
>> +       if (task_has_cap(p) && task_being_capped(p))
>> +               return 0;
>> +
>> +       return task_timeslice(p);
>> +}
>> +
>> +static int task_priority_gt(const task_t *thisp, const task_t *thatp)
>> +{
>> +       if (task_has_cap(thisp) && (task_being_capped(thisp)))
>> +           return 0;
>> +
>> +       if (task_has_cap(thatp) && (task_being_capped(thatp)))
>> +           return 1;
>> +
>> +       return thisp->static_prio < thatp->static_prio;
>> +}
> 
> This function needs some comments. At least with respect to what is
> thisp and thatp

gt means "greater than" (as any Fortran programmer knows :-)) and is 
sufficient documentation.

> 
>> +#endif
>> +
>> +/*
>> + * Update usage stats to "now" before making comparison
>> + * Assume: task is actually on a CPU
>> + */
>> +static int task_exceeding_cap_now(const task_t *p, unsigned long long 
>> now)
>> +{
>> +       unsigned long long delta, lhs, rhs;
>> +
>> +       delta = (now > p->timestamp) ? (now - p->timestamp) : 0;
>> +       lhs = (p->avg_cpu_per_cycle + delta) * 1000;
>> +       rhs = (p->avg_cycle_length + delta) * p->cpu_rate_cap;
>> +
>> +       return lhs > rhs;
>> +}
>> +
>> +static inline void init_cap_stats(task_t *p)
>> +{
>> +       p->avg_cpu_per_cycle = 0;
>> +       p->avg_cycle_length = 0;
>> +}
>> +
>> +static inline void inc_cap_stats_cycle(task_t *p, unsigned long long 
>> now)
>> +{
>> +       unsigned long long delta;
>> +
>> +       delta = (now > p->timestamp) ? (now - p->timestamp) : 0;
>> +       p->avg_cycle_length += delta;
>> +}
>> +
>> +static inline void inc_cap_stats_both(task_t *p, unsigned long long now)
>> +{
>> +       unsigned long long delta;
>> +
>> +       delta = (now > p->timestamp) ? (now - p->timestamp) : 0;
>> +       p->avg_cycle_length += delta;
>> +       p->avg_cpu_per_cycle += delta;
>> +}
>> +
>> +static inline void decay_cap_stats(task_t *p)
>> +{
>> +       p->avg_cycle_length *= ((1 << CAP_STATS_OFFSET) - 1);
>> +       p->avg_cycle_length >>= CAP_STATS_OFFSET;
>> +       p->avg_cpu_per_cycle *= ((1 << CAP_STATS_OFFSET) - 1);
>> +       p->avg_cpu_per_cycle >>= CAP_STATS_OFFSET;
>> +}
>> +#else
>> +#define task_has_cap(p) 0
>> +#define task_is_background(p) 0
>> +#define task_being_capped(p) 0
>> +#define cap_load_weight(p) SCHED_LOAD_SCALE
>> +
>> +static inline void init_cpu_rate_caps(task_t *p)
>> +{
>> +}
>> +
>> +static inline void set_cap_flag(task_t *p)
>> +{
>> +}
>> +
>> +static inline int task_exceeding_cap(const task_t *p)
>> +{
>> +       return 0;
>> +}
>> +
>> +#ifdef CONFIG_SCHED_SMT
>> +#define smt_timeslice(p) task_timeslice(p)
>> +
>> +static inline int task_priority_gt(const task_t *thisp, const task_t 
>> *thatp)
>> +{
>> +       return thisp->static_prio < thatp->static_prio;
>> +}
>> +#endif
>> +
>> +static inline int task_exceeding_cap_now(const task_t *p, unsigned 
>> long long now)
>> +{
>> +       return 0;
>> +}
>> +
>> +static inline void init_cap_stats(task_t *p)
>> +{
>> +}
>> +
>> +static inline void inc_cap_stats_cycle(task_t *p, unsigned long long 
>> now)
>> +{
>> +}
>> +
>> +static inline void inc_cap_stats_both(task_t *p, unsigned long long now)
>> +{
>> +}
>> +
>> +static inline void decay_cap_stats(task_t *p)
>> +{
>> +}
>> +#endif
>> +
>>  #define task_hot(p, now, sd) ((long long) ((now) - (p)->last_ran)      \
>>                                 < (long long) (sd)->cache_hot_time)
>>
>> @@ -197,8 +353,8 @@ typedef struct runqueue runqueue_t;
>>
>>  struct prio_array {
>>         unsigned int nr_active;
>> -       DECLARE_BITMAP(bitmap, MAX_PRIO+1); /* include 1 bit for 
>> delimiter */
>> -       struct list_head queue[MAX_PRIO];
>> +       DECLARE_BITMAP(bitmap, IDLE_PRIO+1); /* include 1 bit for 
>> delimiter */
>> +       struct list_head queue[IDLE_PRIO];
>>  };
>>
>>  /*
>> @@ -710,6 +866,10 @@ static inline int __normal_prio(task_t *
>>  {
>>         int bonus, prio;
>>
>> +       /* Ensure that background tasks stay at BGND_PRIO */
>> +       if (task_is_background(p))
>> +               return BGND_PRIO;
>> +
>>         bonus = CURRENT_BONUS(p) - MAX_BONUS / 2;
>>
>>         prio = p->static_prio - bonus;
>> @@ -786,6 +946,8 @@ static inline int expired_starving(runqu
>>
>>  static void set_load_weight(task_t *p)
>>  {
>> +       set_cap_flag(p);
>> +
>>         if (has_rt_policy(p)) {
>>  #ifdef CONFIG_SMP
>>                 if (p == task_rq(p)->migration_thread)
>> @@ -798,8 +960,22 @@ static void set_load_weight(task_t *p)
>>                 else
>>  #endif
>>                         p->load_weight = 
>> RTPRIO_TO_LOAD_WEIGHT(p->rt_priority);
>> -       } else
>> +       } else {
>>                 p->load_weight = PRIO_TO_LOAD_WEIGHT(p->static_prio);
>> +
>> +               /*
>> +                * Reduce the probability of a task escaping its CPU 
>> rate cap
>> +                * due to load balancing leaving it on a lighly used CPU
>> +                * This will be optimized away if rate caps aren't 
>> configured
>> +                */
>> +               if (task_has_cap(p)) {
>> +                       unsigned int clw; /* load weight based on cap */
>> +
>> +                       clw = cap_load_weight(p);
>> +                       if (clw < p->load_weight)
>> +                               p->load_weight = clw;
>> +               }
> 
> You could  use
> p->load_weight = min(cap_load_weight(p), p->load_weight);

Yes.

> 
> 
>> +       }
>>  }
>>
>>  static inline void inc_raw_weighted_load(runqueue_t *rq, const task_t 
>> *p)
>> @@ -869,7 +1045,8 @@ static void __activate_task(task_t *p, r
>>  {
>>         prio_array_t *target = rq->active;
>>
>> -       if (unlikely(batch_task(p) || (expired_starving(rq) && 
>> !rt_task(p))))
>> +       if (unlikely(batch_task(p) || (expired_starving(rq) && 
>> !rt_task(p)) ||
>> +                       task_being_capped(p)))
>>                 target = rq->expired;
>>         enqueue_task(p, target);
>>         inc_nr_running(p, rq);
>> @@ -975,8 +1152,30 @@ static void activate_task(task_t *p, run
>>  #endif
>>
>>         if (!rt_task(p))
>> +               /*
>> +                * We want to do the recalculation even if we're 
>> exceeding
>> +                * a cap so that everything still works when we stop
>> +                * exceeding our cap.
>> +                */
>>                 p->prio = recalc_task_prio(p, now);
>>
>> +       if (task_has_cap(p)) {
>> +               inc_cap_stats_cycle(p, now);
>> +               /* Background tasks are handled in effective_prio()
>> +                * in order to ensure that they stay at BGND_PRIO
>> +                * but we need to be careful that we don't override
>> +                * it here
>> +                */
>> +               if (task_exceeding_cap(p) && !task_is_background(p)) {
>> +                       p->normal_prio = CAPPED_PRIO;
>> +                       /*
>> +                        * Don't undo any priority ineheritance
>> +                        */
>> +                       if (!rt_task(p))
>> +                               p->prio = CAPPED_PRIO;
>> +               }
>> +       }
> 
> Within all tasks at CAPPED_PRIO, is priority of the task used for 
> scheduling?

Unless a task is exceeding its cap it gets scheduled as per usual and 
even if it is exceeding its cap its time slice allocation is still done 
as per usual so "nice" and interactivity will still have some effect for 
competing "over cap" tasks.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [RFC 3/5] sched: Add CPU rate hard caps
  2006-05-27  6:48   ` Balbir Singh
@ 2006-05-27  8:44     ` Peter Williams
  2006-05-31 13:10       ` Kirill Korotaev
  0 siblings, 1 reply; 95+ messages in thread
From: Peter Williams @ 2006-05-27  8:44 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Mike Galbraith, Con Kolivas, Linux Kernel, Kingsley Cheung,
	Ingo Molnar, Rene Herman

Balbir Singh wrote:
> 
> Using a timer for releasing tasks from their sinbin sounds like a  bit
> of an overhead. Given that there could be 10s of thousands of tasks.

The more runnable tasks there are the less likely it is that any of them 
is exceeding its hard cap due to normal competition for the CPUs.  So I 
think that it's unlikely that there will ever be a very large number of 
tasks in the sinbin at the same time.

> Is it possible to use the scheduler_tick() function take a look at all
> deactivated tasks (as efficiently as possible) and activate them when
> its time to activate them or just fold the functionality by defining a
> time quantum after which everyone is worken up. This time quantum
> could be the same as the time over which limits are honoured.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [RFC 3/5] sched: Add CPU rate hard caps
  2006-05-27  0:16         ` Peter Williams
@ 2006-05-27  9:28           ` Mike Galbraith
  2006-05-28  2:09             ` Peter Williams
  0 siblings, 1 reply; 95+ messages in thread
From: Mike Galbraith @ 2006-05-27  9:28 UTC (permalink / raw)
  To: Peter Williams
  Cc: Con Kolivas, Linux Kernel, Kingsley Cheung, Ingo Molnar, Rene Herman

On Sat, 2006-05-27 at 10:16 +1000, Peter Williams wrote:
> Mike Galbraith wrote:
> > On Fri, 2006-05-26 at 23:59 +1000, Peter Williams wrote:
> >> Con Kolivas wrote:
> >>> On Friday 26 May 2006 14:20, Peter Williams wrote:
> >>>> This patch implements hard CPU rate caps per task as a proportion of a
> >>>> single CPU's capacity expressed in parts per thousand.
> >>> A hard cap of 1/1000 could lead to interesting starvation scenarios where a 
> >>> mutex or semaphore was held by a task that hardly ever got cpu. Same goes to 
> >>> a lesser extent to a 0 soft cap. 
> >>>
> >>> Here is how I handle idleprio tasks in current -ck:
> >>>
> >>> http://ck.kolivas.org/patches/2.6/pre-releases/2.6.17-rc5/2.6.17-rc5-ck1/patches/track_mutexes-1.patch
> >>> tags tasks that are holding a mutex
> >>>
> >>> http://ck.kolivas.org/patches/2.6/pre-releases/2.6.17-rc5/2.6.17-rc5-ck1/patches/sched-idleprio-1.7.patch
> >>> is the idleprio policy for staircase.
> >>>
> >>> What it does is runs idleprio tasks as normal tasks when they hold a mutex or 
> >>> are waking up after calling down() (ie holding a semaphore).
> >> I wasn't aware that you could detect those conditions.  They could be 
> >> very useful.
> > 
> > Isn't this exactly what the PI code is there to handle?  Is something
> > more than PI needed?
> > 
> 
> AFAIK (but I may be wrong) PI is only used by RT tasks and would need to 
> be extended.  It could be argued that extending PI so that it can be 
> used by non RT tasks is a worthwhile endeavour in its own right.

Hm.  Looking around a bit, it appears to me that we're one itty bitty
redefine away from PI being global.  No idea if/when that will happen
though.

	-Mike


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [RFC 2/5] sched: Add CPU rate soft caps
  2006-05-27  7:03     ` Peter Williams
@ 2006-05-28  0:11       ` Peter Williams
  2006-05-28  7:38         ` Balbir Singh
  0 siblings, 1 reply; 95+ messages in thread
From: Peter Williams @ 2006-05-28  0:11 UTC (permalink / raw)
  To: balbir
  Cc: Mike Galbraith, Con Kolivas, Linux Kernel, Kingsley Cheung,
	Ingo Molnar, Rene Herman

Peter Williams wrote:
> Balbir Singh wrote:
>> On 5/26/06, Peter Williams <pwil3058@bigpond.net.au> wrote:
>> <snip>
>>>
>>> Notes:
>>>
>>> 1. To minimize the overhead incurred when testing to skip caps 
>>> processing for
>>> uncapped tasks a new flag PF_HAS_CAP has been added to flags.
>>>
>>> 2. The implementation involves the addition of two priority slots to the
>>> run queue priority arrays and this means that MAX_PRIO no longer
>>> represents the scheduling priority of the idle process and can't be 
>>> used to
>>> test whether priority values are in the valid range.  To alleviate this
>>> problem a new function sched_idle_prio() has been provided.
>>
>> I am a little confused by this. Why link the bandwidth expired tasks a
>> cpu (its caps) to a priority slot? Is this a hack to conitnue using
>> the prio_array? why not move such tasks to the expired array?
> 
> Because it won't work as after the array switch they may get to run 
> before tasks who aren't exceeding their cap (or don't have a cap).

Another important reason for using these slots is that it allows waking 
tasks to preempt tasks that have exceeded their cap.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [RFC 3/5] sched: Add CPU rate hard caps
  2006-05-27  9:28           ` Mike Galbraith
@ 2006-05-28  2:09             ` Peter Williams
  0 siblings, 0 replies; 95+ messages in thread
From: Peter Williams @ 2006-05-28  2:09 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Con Kolivas, Linux Kernel, Kingsley Cheung, Ingo Molnar, Rene Herman

Mike Galbraith wrote:
> On Sat, 2006-05-27 at 10:16 +1000, Peter Williams wrote:
>> Mike Galbraith wrote:
>>> On Fri, 2006-05-26 at 23:59 +1000, Peter Williams wrote:
>>>> Con Kolivas wrote:
>>>>> On Friday 26 May 2006 14:20, Peter Williams wrote:
>>>>>> This patch implements hard CPU rate caps per task as a proportion of a
>>>>>> single CPU's capacity expressed in parts per thousand.
>>>>> A hard cap of 1/1000 could lead to interesting starvation scenarios where a 
>>>>> mutex or semaphore was held by a task that hardly ever got cpu. Same goes to 
>>>>> a lesser extent to a 0 soft cap. 
>>>>>
>>>>> Here is how I handle idleprio tasks in current -ck:
>>>>>
>>>>> http://ck.kolivas.org/patches/2.6/pre-releases/2.6.17-rc5/2.6.17-rc5-ck1/patches/track_mutexes-1.patch
>>>>> tags tasks that are holding a mutex
>>>>>
>>>>> http://ck.kolivas.org/patches/2.6/pre-releases/2.6.17-rc5/2.6.17-rc5-ck1/patches/sched-idleprio-1.7.patch
>>>>> is the idleprio policy for staircase.
>>>>>
>>>>> What it does is runs idleprio tasks as normal tasks when they hold a mutex or 
>>>>> are waking up after calling down() (ie holding a semaphore).
>>>> I wasn't aware that you could detect those conditions.  They could be 
>>>> very useful.
>>> Isn't this exactly what the PI code is there to handle?  Is something
>>> more than PI needed?
>>>
>> AFAIK (but I may be wrong) PI is only used by RT tasks and would need to 
>> be extended.  It could be argued that extending PI so that it can be 
>> used by non RT tasks is a worthwhile endeavour in its own right.
> 
> Hm.  Looking around a bit, it appears to me that we're one itty bitty
> redefine away from PI being global.  No idea if/when that will happen
> though.

It needs slightly more than that.  It's currently relying on the way 
tasks with prio less than MAX_RT_PRIO are treated to prevent the 
priority of tasks who are inheriting a priority from having that 
priority reset to their normal priority at various places in sched.c. 
So something would need to be done in that regard but it shouldn't be 
too difficult.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [RFC 2/5] sched: Add CPU rate soft caps
  2006-05-28  0:11       ` Peter Williams
@ 2006-05-28  7:38         ` Balbir Singh
  2006-05-28 13:35           ` Peter Williams
  0 siblings, 1 reply; 95+ messages in thread
From: Balbir Singh @ 2006-05-28  7:38 UTC (permalink / raw)
  To: Peter Williams
  Cc: Mike Galbraith, Con Kolivas, Linux Kernel, Kingsley Cheung,
	Ingo Molnar, Rene Herman

On 5/28/06, Peter Williams <pwil3058@bigpond.net.au> wrote:
> Peter Williams wrote:
> > Balbir Singh wrote:
> >> On 5/26/06, Peter Williams <pwil3058@bigpond.net.au> wrote:
> >> <snip>
> >>>
> >>> Notes:
> >>>
> >>> 1. To minimize the overhead incurred when testing to skip caps
> >>> processing for
> >>> uncapped tasks a new flag PF_HAS_CAP has been added to flags.
> >>>
> >>> 2. The implementation involves the addition of two priority slots to the
> >>> run queue priority arrays and this means that MAX_PRIO no longer
> >>> represents the scheduling priority of the idle process and can't be
> >>> used to
> >>> test whether priority values are in the valid range.  To alleviate this
> >>> problem a new function sched_idle_prio() has been provided.
> >>
> >> I am a little confused by this. Why link the bandwidth expired tasks a
> >> cpu (its caps) to a priority slot? Is this a hack to conitnue using
> >> the prio_array? why not move such tasks to the expired array?
> >
> > Because it won't work as after the array switch they may get to run
> > before tasks who aren't exceeding their cap (or don't have a cap).
>

That behaviour would be fair. Let the priority govern who gets to run
first (irrespective of their cap) and then use the cap to limit their
timeslice (execution time).

> Another important reason for using these slots is that it allows waking
> tasks to preempt tasks that have exceeded their cap.
>

But among all tasks that exceed their cap (there is no priority based
scheduling). As far as preemption is concerned, if they are moved to
the expired array, capped tasks will only run if an array switch takes
place. If it does, then they get their fare share of the cap again
until they exceed their cap.

Balbir

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [RFC 2/5] sched: Add CPU rate soft caps
  2006-05-28  7:38         ` Balbir Singh
@ 2006-05-28 13:35           ` Peter Williams
  2006-05-28 14:42             ` Balbir Singh
  0 siblings, 1 reply; 95+ messages in thread
From: Peter Williams @ 2006-05-28 13:35 UTC (permalink / raw)
  To: balbir
  Cc: Mike Galbraith, Con Kolivas, Linux Kernel, Kingsley Cheung,
	Ingo Molnar, Rene Herman

Balbir Singh wrote:
> On 5/28/06, Peter Williams <pwil3058@bigpond.net.au> wrote:
>> Peter Williams wrote:
>> > Balbir Singh wrote:
>> >> On 5/26/06, Peter Williams <pwil3058@bigpond.net.au> wrote:
>> >> <snip>
>> >>>
>> >>> Notes:
>> >>>
>> >>> 1. To minimize the overhead incurred when testing to skip caps
>> >>> processing for
>> >>> uncapped tasks a new flag PF_HAS_CAP has been added to flags.
>> >>>
>> >>> 2. The implementation involves the addition of two priority slots 
>> to the
>> >>> run queue priority arrays and this means that MAX_PRIO no longer
>> >>> represents the scheduling priority of the idle process and can't be
>> >>> used to
>> >>> test whether priority values are in the valid range.  To alleviate 
>> this
>> >>> problem a new function sched_idle_prio() has been provided.
>> >>
>> >> I am a little confused by this. Why link the bandwidth expired tasks a
>> >> cpu (its caps) to a priority slot? Is this a hack to conitnue using
>> >> the prio_array? why not move such tasks to the expired array?
>> >
>> > Because it won't work as after the array switch they may get to run
>> > before tasks who aren't exceeding their cap (or don't have a cap).
>>
> 
> That behaviour would be fair.

Caps aren't about being fair.  In fact, giving a task a cap is an 
explicit instruction to the scheduler that the task should be treated 
unfairly in some circumstances (namely when it's exceeding that cap).

Similarly, the interactive bonus mechanism is not about fairness either. 
  It's about giving tasks that are thought to be interactive an unfair 
advantage so that the user experiences good responsiveness.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [RFC 2/5] sched: Add CPU rate soft caps
  2006-05-28 13:35           ` Peter Williams
@ 2006-05-28 14:42             ` Balbir Singh
  2006-05-28 23:27               ` Peter Williams
  0 siblings, 1 reply; 95+ messages in thread
From: Balbir Singh @ 2006-05-28 14:42 UTC (permalink / raw)
  To: Peter Williams
  Cc: Mike Galbraith, Con Kolivas, Linux Kernel, Kingsley Cheung,
	Ingo Molnar, Rene Herman

On 5/28/06, Peter Williams <pwil3058@bigpond.net.au> wrote:
<snip>

> >
> > That behaviour would be fair.
>
> Caps aren't about being fair.  In fact, giving a task a cap is an
> explicit instruction to the scheduler that the task should be treated
> unfairly in some circumstances (namely when it's exceeding that cap).
>
> Similarly, the interactive bonus mechanism is not about fairness either.
>   It's about giving tasks that are thought to be interactive an unfair
> advantage so that the user experiences good responsiveness.
>

I understand that, I was talking about fairness between capped tasks
and what might be considered fair or intutive between capped tasks and
regular tasks. Of course, the last point is debatable ;)

Balbir

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [RFC 0/5] sched: Add CPU rate caps
  2006-05-26 16:11   ` Björn Steinbrink
@ 2006-05-28 22:46     ` Sam Vilain
  2006-05-28 23:30       ` Peter Williams
  0 siblings, 1 reply; 95+ messages in thread
From: Sam Vilain @ 2006-05-28 22:46 UTC (permalink / raw)
  To: Björn Steinbrink
  Cc: Mike Galbraith, Peter Williams, Con Kolivas, Linux Kernel,
	Kingsley Cheung, Ingo Molnar, Rene Herman, Herbert Poetzl,
	Kirill Korotaev

Björn Steinbrink wrote:

>>The killer problem I see with this approach is that it doesn't address
>>the divide and conquer problem.  If a task is capped, and forks off
>>workers, each worker inherits the total cap, effectively extending same.
>>    
>>

Yes, although the current thinking is that you need to set a special
clone() flag (which may be restricted via capabilities such as
CAP_SYS_RESOURCE) to set a new CPU scheduling namespace, so the workers
will inherit the same scheduling ns and therefore be accounted against
the one resource.

Sorry if I'm replying out of context, I'll catch up on this thread
shortly.  Thanks for including me.

>>IMHO, per task resource management is too severely limited in it's
>>usefulness, because jobs are what need managing, and they're seldom
>>single threaded.  In order to use per task limits to manage any given
>>job, you have to both know the number of components, and manually
>>distribute resources to each component of the job.  If a job has a
>>dynamic number of components, it becomes impossible to manage.
>>    
>>
>
>Linux-VServer uses a token bucket scheduler (TBS) to limit cpu ressources
>for processes in a "context". All processes in a context share one token
>bucket, which has a set of parameters to tune scheduling behaviour.
>As the token bucket is shared by a group of processes, and inherited by
>child processes/threads, management is quite easy. And the parameters
>can be tuned to allow different scheduling behaviours, like allowing a
>process group to burst, ie. use as much cpu time as is available, after
>being idle for some time, but being limited to X % cpu time on average.
>  
>

This is correct.  Basically I read the LARTC.org (which explains Linux
network schedulers etc) and the description of the Token Bucket
Scheduler inspired me to write the same thing for CPU resources.  It was
originally developed for the 2.4 Alan Cox series kernels.  The primary
design guarantee of the scheduler is a low total performance impact -
maximum CPU utilisation prioritisation and fairness a secondary
concern.  It was built with the idea that people wanting different sorts
of scheduling policies could at least get a set of userland controls to
implement their approach - to the limit of the effectiveness of task
priorities.

I most recently described this at http://lkml.org/lkml/2006/3/29/59, a
lot of that thread is probably worth catching up on.

It would be nice if we could somehow re-use the scheduling algorithms in
use in the network space here, if it doesn't impact on performance.

For instance, the "CBQ" network scheduler is the same approach as used
in OpenVZ's CPU scheduler, and the classful Token Bucket Filter is the
approach used in VServer.  The "Sched_prio" and "Sched_hard" distinction
in vserver could probably be compared to "Ingres Policing", where
available CPU that could run a process instead sits idle - similar to
the network world where data that has arrived is dropped to try to
convince the application to throttle its network activity.

As in the network space (http://lkml.org/lkml/2006/5/19/216) in this
space we have a continual scale of possible implementations, marked by a
highly efficient solution akin to "binding" at one end, and a
virtualisation at the other.  Each deliver guarantees most applicable to
certain users or workloads.

Sam.

>I'm CC'ing Herbert and Sam on this as they can explain the whole thing a
>lot better and I'm not familiar with implementation details.
>
>Regards
>Björn
>  
>


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [RFC 2/5] sched: Add CPU rate soft caps
  2006-05-28 14:42             ` Balbir Singh
@ 2006-05-28 23:27               ` Peter Williams
  2006-05-31 13:17                 ` Kirill Korotaev
  0 siblings, 1 reply; 95+ messages in thread
From: Peter Williams @ 2006-05-28 23:27 UTC (permalink / raw)
  To: balbir
  Cc: Mike Galbraith, Con Kolivas, Linux Kernel, Kingsley Cheung,
	Ingo Molnar, Rene Herman

Balbir Singh wrote:
> On 5/28/06, Peter Williams <pwil3058@bigpond.net.au> wrote:
> <snip>
> 
>> >
>> > That behaviour would be fair.
>>
>> Caps aren't about being fair.  In fact, giving a task a cap is an
>> explicit instruction to the scheduler that the task should be treated
>> unfairly in some circumstances (namely when it's exceeding that cap).
>>
>> Similarly, the interactive bonus mechanism is not about fairness either.
>>   It's about giving tasks that are thought to be interactive an unfair
>> advantage so that the user experiences good responsiveness.
>>
> 
> I understand that, I was talking about fairness between capped tasks
> and what might be considered fair or intutive between capped tasks and
> regular tasks. Of course, the last point is debatable ;)

Well, the primary fairness mechanism in the scheduler is the time slice 
allocation and the capping code doesn't fiddle with those so there 
should be a reasonable degree of fairness (taking into account "nice") 
between capped tasks.  To improve that would require allocating several 
new priority slots for use by tasks exceeding their caps and fiddling 
with those.  I don't think that it's worth the bother.

When capped tasks aren't exceeding their cap they are treated just like 
any other task and will get the same amount of fairness.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [RFC 0/5] sched: Add CPU rate caps
  2006-05-28 22:46     ` Sam Vilain
@ 2006-05-28 23:30       ` Peter Williams
  2006-05-29  3:09         ` Sam Vilain
  0 siblings, 1 reply; 95+ messages in thread
From: Peter Williams @ 2006-05-28 23:30 UTC (permalink / raw)
  To: Sam Vilain
  Cc: Björn Steinbrink, Mike Galbraith, Con Kolivas, Linux Kernel,
	Kingsley Cheung, Ingo Molnar, Rene Herman, Herbert Poetzl,
	Kirill Korotaev

Sam Vilain wrote:
> Björn Steinbrink wrote:
> 
>>> The killer problem I see with this approach is that it doesn't address
>>> the divide and conquer problem.  If a task is capped, and forks off
>>> workers, each worker inherits the total cap, effectively extending same.
>>>    
>>>
> 
> Yes, although the current thinking is that you need to set a special
> clone() flag (which may be restricted via capabilities such as
> CAP_SYS_RESOURCE) to set a new CPU scheduling namespace, so the workers
> will inherit the same scheduling ns and therefore be accounted against
> the one resource.
> 
> Sorry if I'm replying out of context, I'll catch up on this thread
> shortly.  Thanks for including me.
> 
>>> IMHO, per task resource management is too severely limited in it's
>>> usefulness, because jobs are what need managing, and they're seldom
>>> single threaded.  In order to use per task limits to manage any given
>>> job, you have to both know the number of components, and manually
>>> distribute resources to each component of the job.  If a job has a
>>> dynamic number of components, it becomes impossible to manage.
>>>    
>>>
>> Linux-VServer uses a token bucket scheduler (TBS) to limit cpu ressources
>> for processes in a "context". All processes in a context share one token
>> bucket, which has a set of parameters to tune scheduling behaviour.
>> As the token bucket is shared by a group of processes, and inherited by
>> child processes/threads, management is quite easy. And the parameters
>> can be tuned to allow different scheduling behaviours, like allowing a
>> process group to burst, ie. use as much cpu time as is available, after
>> being idle for some time, but being limited to X % cpu time on average.
>>  
>>
> 
> This is correct.  Basically I read the LARTC.org (which explains Linux
> network schedulers etc) and the description of the Token Bucket
> Scheduler inspired me to write the same thing for CPU resources.  It was
> originally developed for the 2.4 Alan Cox series kernels.  The primary
> design guarantee of the scheduler is a low total performance impact -
> maximum CPU utilisation prioritisation and fairness a secondary
> concern.  It was built with the idea that people wanting different sorts
> of scheduling policies could at least get a set of userland controls to
> implement their approach - to the limit of the effectiveness of task
> priorities.
> 
> I most recently described this at http://lkml.org/lkml/2006/3/29/59, a
> lot of that thread is probably worth catching up on.
> 
> It would be nice if we could somehow re-use the scheduling algorithms in
> use in the network space here, if it doesn't impact on performance.
> 
> For instance, the "CBQ" network scheduler is the same approach as used
> in OpenVZ's CPU scheduler, and the classful Token Bucket Filter is the
> approach used in VServer.  The "Sched_prio" and "Sched_hard" distinction
> in vserver could probably be compared to "Ingres Policing", where
> available CPU that could run a process instead sits idle - similar to
> the network world where data that has arrived is dropped to try to
> convince the application to throttle its network activity.
> 
> As in the network space (http://lkml.org/lkml/2006/5/19/216) in this
> space we have a continual scale of possible implementations, marked by a
> highly efficient solution akin to "binding" at one end, and a
> virtualisation at the other.  Each deliver guarantees most applicable to
> certain users or workloads.
> 
> Sam.
> 
>> I'm CC'ing Herbert and Sam on this as they can explain the whole thing a
>> lot better and I'm not familiar with implementation details.

Have you considered adding an implementation of these schedulers to 
PlugSched?

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [RFC 0/5] sched: Add CPU rate caps
  2006-05-28 23:30       ` Peter Williams
@ 2006-05-29  3:09         ` Sam Vilain
  2006-05-29  3:41           ` Peter Williams
  0 siblings, 1 reply; 95+ messages in thread
From: Sam Vilain @ 2006-05-29  3:09 UTC (permalink / raw)
  To: Peter Williams
  Cc: Björn Steinbrink, Mike Galbraith, Con Kolivas, Linux Kernel,
	Kingsley Cheung, Ingo Molnar, Rene Herman, Herbert Poetzl,
	Kirill Korotaev

Peter Williams wrote:

>>This is correct.  Basically I read the LARTC.org (which explains Linux
>>network schedulers etc) and the description of the Token Bucket
>>Scheduler inspired me to write the same thing for CPU resources.  It was
>>originally developed for the 2.4 Alan Cox series kernels.  The primary
>>[...]
>>I most recently described this at http://lkml.org/lkml/2006/3/29/59, a
>>lot of that thread is probably worth catching up on.
>>[...]
>>    
>>
>Have you considered adding an implementation of these schedulers to 
>PlugSched?
>  
>

No, I haven't; I'd be happy to do so, given appropriate pointers to a
codebase I can produce commits for.  Is there a public git tree for the
patches, or a series of split out patches?  I see only combined patches
on the SourceForge site.

Sam.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [RFC 0/5] sched: Add CPU rate caps
  2006-05-29  3:09         ` Sam Vilain
@ 2006-05-29  3:41           ` Peter Williams
  2006-05-29 21:16             ` Sam Vilain
  0 siblings, 1 reply; 95+ messages in thread
From: Peter Williams @ 2006-05-29  3:41 UTC (permalink / raw)
  To: Sam Vilain
  Cc: Björn Steinbrink, Mike Galbraith, Con Kolivas, Linux Kernel,
	Kingsley Cheung, Ingo Molnar, Rene Herman, Herbert Poetzl,
	Kirill Korotaev

Sam Vilain wrote:
> Peter Williams wrote:
> 
>>> This is correct.  Basically I read the LARTC.org (which explains Linux
>>> network schedulers etc) and the description of the Token Bucket
>>> Scheduler inspired me to write the same thing for CPU resources.  It was
>>> originally developed for the 2.4 Alan Cox series kernels.  The primary
>>> [...]
>>> I most recently described this at http://lkml.org/lkml/2006/3/29/59, a
>>> lot of that thread is probably worth catching up on.
>>> [...]
>>>    
>>>
>> Have you considered adding an implementation of these schedulers to 
>> PlugSched?
>>  
>>
> 
> No, I haven't; I'd be happy to do so, given appropriate pointers to a
> codebase I can produce commits for.  Is there a public git tree for the
> patches, or a series of split out patches?

Yes, but not yet publicly available.  I use quilt to keep the patch 
series up to date and do the change as a relatively large series (30 or 
so) to make it easier for me to cope with changes in the kernel.  When I 
do the next release I'll make a tar ball of the patch series available.

Of course, if your eager to start right away I could make the 
2.6.17-rc4-mm1 one available?

>  I see only combined patches
> on the SourceForge site.

Yes, I'm trying to be not too greedy in my disk space use :-)

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [RFC 0/5] sched: Add CPU rate caps
  2006-05-29  3:41           ` Peter Williams
@ 2006-05-29 21:16             ` Sam Vilain
  2006-05-29 23:12               ` Peter Williams
  0 siblings, 1 reply; 95+ messages in thread
From: Sam Vilain @ 2006-05-29 21:16 UTC (permalink / raw)
  To: Peter Williams
  Cc: Björn Steinbrink, Mike Galbraith, Con Kolivas, Linux Kernel,
	Kingsley Cheung, Ingo Molnar, Rene Herman, Herbert Poetzl,
	Kirill Korotaev, Eric W. Biederman

Peter Williams wrote:

>Yes, but not yet publicly available.  I use quilt to keep the patch 
>series up to date and do the change as a relatively large series (30 or 
>so) to make it easier for me to cope with changes in the kernel.  When I 
>do the next release I'll make a tar ball of the patch series available.
>
>Of course, if your eager to start right away I could make the 
>2.6.17-rc4-mm1 one available?
>  
>

Well a piecewise patchset does make it a lot easier to see what's going
on, especially if it's got descriptions of each patch along the way. 
I'd certainly be interested in having a look through the split out patch
to see how namespaces and this advanced scheduling system might
interoperate.

Sam.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [RFC 0/5] sched: Add CPU rate caps
  2006-05-29 21:16             ` Sam Vilain
@ 2006-05-29 23:12               ` Peter Williams
  2006-05-30  2:07                 ` Sam Vilain
  0 siblings, 1 reply; 95+ messages in thread
From: Peter Williams @ 2006-05-29 23:12 UTC (permalink / raw)
  To: Sam Vilain
  Cc: Björn Steinbrink, Mike Galbraith, Con Kolivas, Linux Kernel,
	Kingsley Cheung, Ingo Molnar, Rene Herman, Herbert Poetzl,
	Kirill Korotaev, Eric W. Biederman

Sam Vilain wrote:
> Peter Williams wrote:
> 
>> Yes, but not yet publicly available.  I use quilt to keep the patch 
>> series up to date and do the change as a relatively large series (30 or 
>> so) to make it easier for me to cope with changes in the kernel.  When I 
>> do the next release I'll make a tar ball of the patch series available.
>>
>> Of course, if your eager to start right away I could make the 
>> 2.6.17-rc4-mm1 one available?
>>  
>>
> 
> Well a piecewise patchset does make it a lot easier to see what's going
> on, especially if it's got descriptions of each patch along the way. 

It's a bit light on descriptions at the moment :-(  as I keep putting 
that in the "do later" bin.

> I'd certainly be interested in having a look through the split out patch
> to see how namespaces and this advanced scheduling system might
> interoperate.

OK.  I've tried very hard to make the scheduling code orthogonal to 
everything else and it essentially separates out the scheduling within a 
CPU from other issues e.g. load balancing.  This separation is 
sufficiently good for me to have merged PlugSched with an earlier 
version of CKRM's CPU management module in a way that made each of 
PlugSched's schedulers available within CKRM's infrastructure.  (CKRM 
have radically rewritten their CPU code since then and I haven't 
bothered to keep up.)

The key point that I'm trying to make is that I would expect PlugSched 
and namespaces to coexist without any problems.  How it integrates with 
the "advanced" scheduling system would depend on how that system alters 
things such as load balancing and/or whether it goes for scheduling 
outcomes at a higher level than the task.

I'm assuming that you're happy to wait for the next release?  That will 
improve the likelihood of descriptions in the patches :-).

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [RFC 0/5] sched: Add CPU rate caps
  2006-05-29 23:12               ` Peter Williams
@ 2006-05-30  2:07                 ` Sam Vilain
  2006-05-30  2:45                   ` Peter Williams
  0 siblings, 1 reply; 95+ messages in thread
From: Sam Vilain @ 2006-05-30  2:07 UTC (permalink / raw)
  To: Peter Williams
  Cc: Björn Steinbrink, Mike Galbraith, Con Kolivas, Linux Kernel,
	Kingsley Cheung, Ingo Molnar, Rene Herman, Herbert Poetzl,
	Kirill Korotaev, Eric W. Biederman

Peter Williams wrote:

>>I'd certainly be interested in having a look through the split out patch
>>to see how namespaces and this advanced scheduling system might
>>interoperate.
>>    
>>
>
>OK.  I've tried very hard to make the scheduling code orthogonal to 
>everything else and it essentially separates out the scheduling within a 
>CPU from other issues e.g. load balancing.  This separation is 
>sufficiently good for me to have merged PlugSched with an earlier 
>version of CKRM's CPU management module in a way that made each of 
>PlugSched's schedulers available within CKRM's infrastructure.  (CKRM 
>have radically rewritten their CPU code since then and I haven't 
>bothered to keep up.)
>
>The key point that I'm trying to make is that I would expect PlugSched 
>and namespaces to coexist without any problems.  How it integrates with 
>the "advanced" scheduling system would depend on how that system alters 
>things such as load balancing and/or whether it goes for scheduling 
>outcomes at a higher level than the task.
>  
>

Coexisting is the base line and I don't think they'll 'interfere' with
each other, per se, but we specifically want to make it easy for
userland to set up and configure scheduling policies to apply to groups
of processes.

For instance, the vserver scheduling extension I wrote changes
scheduling policy so that the interactivity bonus is reduced to -5 ..
-5, and -5 .. +15 is given as a bonus or penalty for an entire vserver
that is currently below or above its allocated CPU quotient.  In this
case the scheduling algorithm hasn't changed, just more feedback is
given into the effective priorities of the processes being scheduled. 
ie, there are now two "inputs" (task and vserver) to the existing scheduler.

I guess the big question is - is there a corresponding concept in
PlugSched?  for instance, is there a reference in the task_struct to the
current scheduling domain, or is it more CKRM-style with classification
modules?

If there is a reference in the task_struct to some set of scheduling
counters, maybe we could squint and say that looks like a namespace, and
throw it into the nsproxy.

>I'm assuming that you're happy to wait for the next release?  That will 
>improve the likelihood of descriptions in the patches :-).
>  
>

Let's keep it the technical dialogue going for now, then.

Now, forgive me if I'm preaching to the vicar here, but have you tried
using Stacked Git for the patch development?  I find the way that you
build patch descriptions as you go along, can easily wind the commit
stack to work on individual patches and import other people's work to be
very simple and powerful.

  http://www.procode.org/stgit/

I just mention this because you're not the first person I've talked to
using Quilt to express some difficulty in producing incremental patchset
snapshots.  Not having used Quilt myself I'm unsure whether this is a
deficiency or just "the way it is" once a patch set gets big.

Sam.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [RFC 0/5] sched: Add CPU rate caps
  2006-05-30  2:07                 ` Sam Vilain
@ 2006-05-30  2:45                   ` Peter Williams
  2006-05-30 22:05                     ` Sam Vilain
  0 siblings, 1 reply; 95+ messages in thread
From: Peter Williams @ 2006-05-30  2:45 UTC (permalink / raw)
  To: Sam Vilain
  Cc: Björn Steinbrink, Mike Galbraith, Con Kolivas, Linux Kernel,
	Kingsley Cheung, Ingo Molnar, Rene Herman, Herbert Poetzl,
	Kirill Korotaev, Eric W. Biederman

Sam Vilain wrote:
> Peter Williams wrote:
> 
>>> I'd certainly be interested in having a look through the split out patch
>>> to see how namespaces and this advanced scheduling system might
>>> interoperate.
>>>    
>>>
>> OK.  I've tried very hard to make the scheduling code orthogonal to 
>> everything else and it essentially separates out the scheduling within a 
>> CPU from other issues e.g. load balancing.  This separation is 
>> sufficiently good for me to have merged PlugSched with an earlier 
>> version of CKRM's CPU management module in a way that made each of 
>> PlugSched's schedulers available within CKRM's infrastructure.  (CKRM 
>> have radically rewritten their CPU code since then and I haven't 
>> bothered to keep up.)
>>
>> The key point that I'm trying to make is that I would expect PlugSched 
>> and namespaces to coexist without any problems.  How it integrates with 
>> the "advanced" scheduling system would depend on how that system alters 
>> things such as load balancing and/or whether it goes for scheduling 
>> outcomes at a higher level than the task.
>>  
>>
> 
> Coexisting is the base line and I don't think they'll 'interfere' with
> each other, per se, but we specifically want to make it easy for
> userland to set up and configure scheduling policies to apply to groups
> of processes.

They shouldn't interfere as which scheduler to use is a boot time 
selection and only one scheduler is in force.  It's mainly a coding 
matter and in particular whether the "scheduler driver" interface would 
need to be modified or whether your scheduler can be implemented using 
the current interface.

> 
> For instance, the vserver scheduling extension I wrote changes
> scheduling policy so that the interactivity bonus is reduced to -5 ..
> -5, and -5 .. +15 is given as a bonus or penalty for an entire vserver
> that is currently below or above its allocated CPU quotient.  In this
> case the scheduling algorithm hasn't changed, just more feedback is
> given into the effective priorities of the processes being scheduled. 
> ie, there are now two "inputs" (task and vserver) to the existing scheduler.
> 
> I guess the big question is - is there a corresponding concept in
> PlugSched?  for instance, is there a reference in the task_struct to the
> current scheduling domain, or is it more CKRM-style with classification
> modules?

It uses the standard run queue structure with per scheduler 
modifications (via a union) to handle the different ways that the 
schedulers manage priority arrays (so yes).  As I said it restricts 
itself to scheduling matters within each run queue and leaves the wider 
aspects to the normal code.

At first guess, it sounds like adding your scheduler could be as simple 
as taking a copy of ingosched.c (which is the implementation of the 
standard scheduler within PlugSched) and then making your modifications. 
  You could probably even share the same run queue components but 
there's nothing to stop you adding new ones.

Each scheduler can also have its own per task data via a union in the 
task struct.

> 
> If there is a reference in the task_struct to some set of scheduling
> counters, maybe we could squint and say that looks like a namespace, and
> throw it into the nsproxy.

Depends on the scheduler.

> 
>> I'm assuming that you're happy to wait for the next release?  That will 
>> improve the likelihood of descriptions in the patches :-).
>>  
>>
> 
> Let's keep it the technical dialogue going for now, then.

OK.  I'm waiting for the next -mm kernel before I make the next release.

> 
> Now, forgive me if I'm preaching to the vicar here, but have you tried
> using Stacked Git for the patch development?

No, I actually use the gquilt GUI wrapper for quilt 
<http://freshmeat.net/projects/gquilt/> and, although I've modified it 
to use a generic interface to the underlying patch management system 
(a.k.a. back end), I haven't yet modified it to use Stacked GIT as a 
back end.  I have thought about it and it was the primary motivation for 
adding the generic interface but I ran out of enthusiasm.

>  I find the way that you
> build patch descriptions as you go along, can easily wind the commit
> stack to work on individual patches and import other people's work to be
> very simple and powerful.
> 
>   http://www.procode.org/stgit/
> 
> I just mention this because you're not the first person I've talked to
> using Quilt to express some difficulty in producing incremental patchset
> snapshots.  Not having used Quilt myself I'm unsure whether this is a
> deficiency or just "the way it is" once a patch set gets big.

Making quilt easier to use is why I wrote gquilt :-)

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [RFC 0/5] sched: Add CPU rate caps
  2006-05-30  2:45                   ` Peter Williams
@ 2006-05-30 22:05                     ` Sam Vilain
  2006-05-30 23:22                       ` Peter Williams
                                         ` (2 more replies)
  0 siblings, 3 replies; 95+ messages in thread
From: Sam Vilain @ 2006-05-30 22:05 UTC (permalink / raw)
  To: Peter Williams
  Cc: Björn Steinbrink, Mike Galbraith, Con Kolivas, Linux Kernel,
	Kingsley Cheung, Ingo Molnar, Rene Herman, Herbert Poetzl,
	Kirill Korotaev, Eric W. Biederman

[-- Attachment #1: Type: text/plain, Size: 2358 bytes --]

Peter Williams wrote:

>They shouldn't interfere as which scheduler to use is a boot time 
>selection and only one scheduler is in force.  It's mainly a coding 
>matter and in particular whether the "scheduler driver" interface would 
>need to be modified or whether your scheduler can be implemented using 
>the current interface.
>  
>

Yes, that's the key issue I think - the interface now has more inputs.

>>I guess the big question is - is there a corresponding concept in
>>> PlugSched?  for instance, is there a reference in the task_struct to the
>>> current scheduling domain, or is it more CKRM-style with classification
>>> modules?
>>    
>>
> It uses the standard run queue structure with per scheduler
> modifications (via a union) to handle the different ways that the
> schedulers manage priority arrays (so yes). As I said it restricts
> itself to scheduling matters within each run queue and leaves the
> wider aspects to the normal code.


Ok, so there is no existing "classification" abstraction?  The
classification is tied to the scheduler implementation?

>At first guess, it sounds like adding your scheduler could be as simple 
>as taking a copy of ingosched.c (which is the implementation of the 
>standard scheduler within PlugSched) and then making your modifications. 
>  You could probably even share the same run queue components but 
>there's nothing to stop you adding new ones.
>
>Each scheduler can also have its own per task data via a union in the 
>task struct.
>  
>

Ok, sounds like that problem is solved - just the classification one
remaining.

>OK.  I'm waiting for the next -mm kernel before I make the next release.
>  
>

Looking forward to it.

>>Now, forgive me if I'm preaching to the vicar here, but have you tried
>>using Stacked Git for the patch development?
>>    
>>
>
>No, I actually use the gquilt GUI wrapper for quilt 
><http://freshmeat.net/projects/gquilt/> and, although I've modified it 
>to use a generic interface to the underlying patch management system 
>(a.k.a. back end), I haven't yet modified it to use Stacked GIT as a 
>back end.  I have thought about it and it was the primary motivation for 
>adding the generic interface but I ran out of enthusiasm.
>  
>

Hmm, guess the vicar disclaimer was a good one to make.

Well maybe you'll find the attached file motivating, then.

Sam.

[-- Attachment #2: gquilt_stgit.py --]
[-- Type: text/x-python, Size: 10671 bytes --]

# -*- python -*-

### Copyright (C) 2006 Sam Vilain <sam.vilain@catalyst.net.nz>

### based on gquilt_quilt.py, which is:
### Copyright (C) 2005 Peter Williams <pwil3058@bigpond.net.au>

### This program is free software; you can redistribute it and/or modify
### it under the terms of the GNU General Public License as published by
### the Free Software Foundation; version 2 of the License only.

### This program is distributed in the hope that it will be useful,
### but WITHOUT ANY WARRANTY; without even the implied warranty of
### MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
### GNU General Public License for more details.

### You should have received a copy of the GNU General Public License
### along with this program; if not, write to the Free Software
### Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA

# This file provides access to "stgit" functionality required by "gquilt"

import gquilt_utils, os.path, os, gquilt_tool, gquilt_pfuns, re

class stgit_commands(gquilt_utils.cmd_list):
    def __init__(self):
        gquilt_utils.cmd_list.__init__(self)
    def _get_commands(self):
        res, sop, eop = gquilt_utils.run_cmd("stg --help")
        if sop == "":
            return None
        lines = sop.splitlines()
        index = 0
        found = 0
        cmds = []
        while True:
            if found:
              lcmds = lines[index].split()
              cmds.push(lcmds[0])
            else:
              if lines[index].match("commands:"):
                found = 1
            index += 1
        
        cmds.sort()
        return cmds

# run a command and log the result to the provided console
def _exec_console_cmd(console, cmd, error=gquilt_tool.ERROR):
    res, so, se = gquilt_utils.run_cmd(cmd)
    if console is not None:
        console.log_cmd(cmd + "\n", so, se)
    if res != 0:
        return (error, so, se)
    else:
        return (gquilt_tool.OK, so, se)

def _patch_file_name(patch):
    # FIXME - cache this
    res, so, se = gquilt_utils.run_cmd("stg branch")
    branch = so.strip()
    return ".git/patches/" + branch + "/patches/" + patch + "/description"

# Now implement the tool interface for quilt
class interface(gquilt_tool.interface):
    def __init__(self):
        gquilt_tool.interface.__init__(self, "stgit")

    def is_patch_applied(self, patch):
        res, so, se = gquilt_utils.run_cmd("stg applied")
        lines = so.splitlines()
        index = 0
        while lines[index]:
            lines[index].strip()
            if lines[index] == patch:
                return 1
        return 0
    
    def top_patch(self):
        res, so, se = gquilt_utils.run_cmd("stg top")
        if res == 0 or (se.strip() == "" and so.strip() == ""):
            return so.strip()
        elif res == 512 and so.strip() == "":
            return ""
        else:
            raise "stgit_specific_error", se

    def last_patch_in_series(self):
        res, op, err = gquilt_utils.run_cmd("stg series -s")
        if res != 0:
            raise "stgit_specific_error", se
        lastline = op.splitlines()[-1]
        return lastline.match(". (.*)")[1]

    def display_file_diff_in_viewer(self, viewer, file, patch=None):
        # XXX - not a good enough abstraction for stg.
        return

    def get_patch_description(self, patch):
        pfn = _patch_file_name(patch)
        if os.path.exists(pfn):
            res, lines = gquilt_pfuns.get_patch_descr_lines(pfn)
            if res:
                return (gquilt_tool.OK, "\n".join(lines) + "\n", "")
            else:
                return (gquilt_tool.ERROR, "", "Error reading patch description\n")
        else:
            return (gquilt_tool.OK, "", "")

    def get_patch_status(self, patch):
        # XXX - not sure what the "needs refresh" stuff was about
        if self.is_patch_applied(patch):
            return (gquilt_tool.OK, gquilt_tool.APPLIED, "")
        else:
            return (gquilt_tool.OK, gquilt_tool.NOT_APPLIED, "")

    def get_series(self):
        res, op, err = gquilt_utils.run_cmd("stg branch")
        patch_dir = op.strip()
        res, op, err = gquilt_utils.run_cmd("stg series")
        if res != 0:
            return (gquilt_tool.ERROR, (None, None), err)
        series = []
        for line in op.splitlines():
            if line[0] == "-":
                series.append((line[2:], gquilt_tool.NOT_APPLIED))
            else:
                pn = line[2:]
                #res, status, err = self.get_patch_status(pn)
                #series.append((pn, status))
                series.append((pn, gquilt_tool.APPLIED))
        return (gquilt_tool.OK, (patch_dir, series), "")

    def get_diff(self, filelist=[], patch=None):
        if patch is None:
            cmd = "stg diff -r /bottom"
        else:
            cmd = "stg diff -r " + patch + "/bottom:" + patch
        res, so, se = gquilt_utils.run_cmd(" ".join([cmd, " ".join(filelist)]))
        if res != 0:
            return (gquilt_tool.ERROR, so, se)
        return (res, so, se)

    def get_combined_diff(self, start_patch=None, end_patch=None):
        cmd = "stg diff -r "
        if start_patch is None:
            cmd += "base"
        else:
            cmd += start_patch+"/bottom"
        if end_patch is not None:
            cmd += ":" + end_patch
        res, so, se = gquilt_utils.run_cmd(cmd)
        if res != 0:
            return (gquilt_tool.ERROR, so, se)
        return (res, so, se)

    def get_patch_files(self, patch=None, withstatus=True):
        if self.top_patch() == "":
            return (gquilt_tool.OK, "", "")
        cmd = "stg files"
        if not withstatus:
            cmd += " --bare"
        if patch is not None:
            cmd += " " + patch
        res, so, se = gquilt_utils.run_cmd(cmd)
        if res != 0:
            return (gquilt_tool.ERROR, so, se)
        if withstatus:
            filelist = []
            for line in so.splitlines():
                if line[0] == "A":
                    filelist.append((line[2:], gquilt_tool.ADDED))
                elif line[0] == "D":
                    filelist.append((line[2:], gquilt_tool.DELETED))
                else:
                    filelist.append((line[2:], gquilt_tool.EXTANT))
        else:
            filelist = so.splitlines()
        return (res, filelist, se)

    def do_set_patch_description(self, console, patch, description):

        if patch != self.top_patch:
            return ( gquilt_tool.ERROR, "", "Can only edit top patch description")

        # FIXME shellquote properly
        cmd = "stg refresh -m '" + description + "'"
        res, so, se = gquilt_utils.run_cmd(cmd)
        if res == 0:
            res = gquilt_tool.OK
            se = ""
        else:
            res = gquilt_tool.ERROR
            se = "Error reading patch description\n"

        if console is not None:
            console.log_cmd('set description for "' + patch + '"', "\n", se)

        return (res, "", se)

    def do_rename_patch(self, console, patch, newname):

        cmd = "stg rename " + patch + " " + newname
        res, so, se = gquilt_utils.run_cmd(cmd)

        if res != 0:
            return (gquilt_tool.ERROR, so, se)
        else:
            return (gquilt_tool.OK, so, se)

    def do_pop_patch(self, console, force=False):
        cmd = "stg pop"
        res, so, se = _exec_console_cmd(console, cmd)
        if res != 0:
            return (gquilt_tool.ERROR, so, se)
        else:
            return (gquilt_tool.OK, so, se)

    def do_push_patch(self, console, force=False):
        cmd = "stg push"
        res, so, se = _exec_console_cmd(console, cmd)
        # FIXME - stg push likes to run merge tools rather than
        # "requiring refresh" on a push that needs a merge.
        if res != 0:
            return (gquilt_tool.ERROR, so, se)
        else:
            return (gquilt_tool.OK, so, se)

    def do_pop_to_patch(self, console, patch=None):
        cmd = "stg pop "
        if patch is not None:
            if patch == "":
                cmd += "-a"
            else:
                cmd += "-t " + patch
        res, so, se = _exec_console_cmd(console, cmd)
        if res != 0:
            return (gquilt_tool.ERROR, so, se)
        else:
            return (gquilt_tool.OK, so, se)

    def do_push_to_patch(self, console, patch=None):
        cmd = "stg push "
        if patch is not None:
            if patch == "":
                cmd += "-a"
            else:
                cmd += "-t " + patch
        res, so, se = _exec_console_cmd(console, cmd)
        if res != 0:
            return (gquilt_tool.ERROR, so, se)
        else:
            return (gquilt_tool.OK, so, se)

    def do_refresh_patch(self, console, patch=None, force=False):
        if patch is not None:
            if self.top_patch() != patch:
                return( gquilt_tool.ERROR, "", "Only the top patch can be refreshed")

        cmd = "stg refresh"
        res, so, se = _exec_console_cmd(console, cmd)
        if res != 0:
            return (gquilt_tool.ERROR, so, se)
        else:
            return (gquilt_tool.OK, so, se)

    def do_create_new_patch(self, console, name):
        return _exec_console_cmd(console, "stg new -m 'no description' " + name)

    def do_import_patch(self, console, filename):
        return _exec_console_cmd(console, "stg import " + filename)

    # XXX - support this in the UI
    def do_import_email(self, console, filename):
        return _exec_console_cmd(console, "stg import -m " + filename)

    def do_merge_patch(self, console, filename):
        # XXX - see stg fold -t for three-way merge option
        return _exec_console_cmd(console, "stg fold " + filename)

    def do_delete_patch(self, console, patch):
        return _exec_console_cmd(console, "stg delete " + patch)

    def do_add_files_to_patch(self, console, filelist, patch=None):
        if patch is not None:
            if self.top_patch() != patch:
                return( gquilt_tool.ERROR, "", "Only the top patch can have files added to it")
        cmd = "stg add"

        return _exec_console_cmd(console, " ".join([cmd, " ".join(filelist)]))

    def do_remove_files_from_patch(self, console, filelist, patch=None):
        if patch is not None:
            if self.top_patch() != patch:
                return( gquilt_tool.ERROR, "", "Only the top patch can have files removed from it")

        cmd = "stg rm";
        return _exec_console_cmd(console, " ".join([cmd, " ".join(filelist)]))

    def do_exec_tool_cmd(self, console, cmd):
        return _exec_console_cmd(console, " ".join(["stg", cmd]))

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [RFC 0/5] sched: Add CPU rate caps
  2006-05-30 22:05                     ` Sam Vilain
@ 2006-05-30 23:22                       ` Peter Williams
  2006-05-30 23:25                       ` Peter Williams
  2006-06-05 23:56                       ` Peter Williams
  2 siblings, 0 replies; 95+ messages in thread
From: Peter Williams @ 2006-05-30 23:22 UTC (permalink / raw)
  To: Sam Vilain
  Cc: Björn Steinbrink, Mike Galbraith, Con Kolivas, Linux Kernel,
	Kingsley Cheung, Ingo Molnar, Rene Herman, Herbert Poetzl,
	Kirill Korotaev, Eric W. Biederman

Sam Vilain wrote:
> Peter Williams wrote:
> 
>> They shouldn't interfere as which scheduler to use is a boot time 
>> selection and only one scheduler is in force.  It's mainly a coding 
>> matter and in particular whether the "scheduler driver" interface would 
>> need to be modified or whether your scheduler can be implemented using 
>> the current interface.
>>  
>>
> 
> Yes, that's the key issue I think - the interface now has more inputs.
> 
>>> I guess the big question is - is there a corresponding concept in
>>>> PlugSched?  for instance, is there a reference in the task_struct to the
>>>> current scheduling domain, or is it more CKRM-style with classification
>>>> modules?
>>>    
>>>
>> It uses the standard run queue structure with per scheduler
>> modifications (via a union) to handle the different ways that the
>> schedulers manage priority arrays (so yes). As I said it restricts
>> itself to scheduling matters within each run queue and leaves the
>> wider aspects to the normal code.
> 
> 
> Ok, so there is no existing "classification" abstraction?  The
> classification is tied to the scheduler implementation?

Yes.

> 
>> At first guess, it sounds like adding your scheduler could be as simple 
>> as taking a copy of ingosched.c (which is the implementation of the 
>> standard scheduler within PlugSched) and then making your modifications. 
>>  You could probably even share the same run queue components but 
>> there's nothing to stop you adding new ones.
>>
>> Each scheduler can also have its own per task data via a union in the 
>> task struct.
>>  
>>
> 
> Ok, sounds like that problem is solved - just the classification one
> remaining.
> 
>> OK.  I'm waiting for the next -mm kernel before I make the next release.
>>  
>>
> 
> Looking forward to it.

Andrew released 2.6.17-rc5-mm1 yesterday so I should have a new version 
in a day or two.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [RFC 0/5] sched: Add CPU rate caps
  2006-05-30 22:05                     ` Sam Vilain
  2006-05-30 23:22                       ` Peter Williams
@ 2006-05-30 23:25                       ` Peter Williams
  2006-06-05 23:56                       ` Peter Williams
  2 siblings, 0 replies; 95+ messages in thread
From: Peter Williams @ 2006-05-30 23:25 UTC (permalink / raw)
  To: Sam Vilain
  Cc: Björn Steinbrink, Mike Galbraith, Con Kolivas, Linux Kernel,
	Kingsley Cheung, Ingo Molnar, Rene Herman, Herbert Poetzl,
	Kirill Korotaev, Eric W. Biederman

Sam Vilain wrote:
> Peter Williams wrote:
> 
>> No, I actually use the gquilt GUI wrapper for quilt 
>> <http://freshmeat.net/projects/gquilt/> and, although I've modified it 
>> to use a generic interface to the underlying patch management system 
>> (a.k.a. back end), I haven't yet modified it to use Stacked GIT as a 
>> back end.  I have thought about it and it was the primary motivation for 
>> adding the generic interface but I ran out of enthusiasm.
>>  
>>
> 
> Hmm, guess the vicar disclaimer was a good one to make.
> 
> Well maybe you'll find the attached file motivating, then.

Maybe.

Thanks
Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [RFC 3/5] sched: Add CPU rate hard caps
  2006-05-27  8:44     ` Peter Williams
@ 2006-05-31 13:10       ` Kirill Korotaev
  2006-05-31 15:59         ` Balbir Singh
  2006-05-31 23:28         ` Peter Williams
  0 siblings, 2 replies; 95+ messages in thread
From: Kirill Korotaev @ 2006-05-31 13:10 UTC (permalink / raw)
  To: Peter Williams
  Cc: Balbir Singh, Mike Galbraith, Con Kolivas, Linux Kernel,
	Kingsley Cheung, Ingo Molnar, Rene Herman

>> Using a timer for releasing tasks from their sinbin sounds like a  bit
>> of an overhead. Given that there could be 10s of thousands of tasks.
> 
> 
> The more runnable tasks there are the less likely it is that any of them 
> is exceeding its hard cap due to normal competition for the CPUs.  So I 
> think that it's unlikely that there will ever be a very large number of 
> tasks in the sinbin at the same time.
for containers this can be untrue... :( actually even for 1000 tasks (I 
suppose this is the maximum in your case) it can slowdown significantly 
as well.

>> Is it possible to use the scheduler_tick() function take a look at all
>> deactivated tasks (as efficiently as possible) and activate them when
>> its time to activate them or just fold the functionality by defining a
>> time quantum after which everyone is worken up. This time quantum
>> could be the same as the time over which limits are honoured.
agree with it.

Kirill


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [RFC 2/5] sched: Add CPU rate soft caps
  2006-05-28 23:27               ` Peter Williams
@ 2006-05-31 13:17                 ` Kirill Korotaev
  2006-05-31 23:39                   ` Peter Williams
  0 siblings, 1 reply; 95+ messages in thread
From: Kirill Korotaev @ 2006-05-31 13:17 UTC (permalink / raw)
  To: Peter Williams
  Cc: balbir, Mike Galbraith, Con Kolivas, Linux Kernel,
	Kingsley Cheung, Ingo Molnar, Rene Herman

>> I understand that, I was talking about fairness between capped tasks
>> and what might be considered fair or intutive between capped tasks and
>> regular tasks. Of course, the last point is debatable ;)
> 
> 
> Well, the primary fairness mechanism in the scheduler is the time slice 
> allocation and the capping code doesn't fiddle with those so there 
> should be a reasonable degree of fairness (taking into account "nice") 
> between capped tasks.  To improve that would require allocating several 
> new priority slots for use by tasks exceeding their caps and fiddling 
> with those.  I don't think that it's worth the bother.
I suppose it should be handled still. a subjective feeling :)

BTW, do you have any test results for your patch?
It would be interesting to see how precise these limitations are and 
whether or not we should bother for the above...

Kirill


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [RFC 3/5] sched: Add CPU rate hard caps
  2006-05-31 13:10       ` Kirill Korotaev
@ 2006-05-31 15:59         ` Balbir Singh
  2006-05-31 18:09           ` Mike Galbraith
                             ` (2 more replies)
  2006-05-31 23:28         ` Peter Williams
  1 sibling, 3 replies; 95+ messages in thread
From: Balbir Singh @ 2006-05-31 15:59 UTC (permalink / raw)
  To: Kirill Korotaev
  Cc: Peter Williams, Balbir Singh, Mike Galbraith, Con Kolivas,
	Linux Kernel, Kingsley Cheung, Ingo Molnar, Rene Herman

Kirill Korotaev wrote:
>>> Using a timer for releasing tasks from their sinbin sounds like a  bit
>>> of an overhead. Given that there could be 10s of thousands of tasks.
>>
>>
>>
>> The more runnable tasks there are the less likely it is that any of 
>> them is exceeding its hard cap due to normal competition for the 
>> CPUs.  So I think that it's unlikely that there will ever be a very 
>> large number of tasks in the sinbin at the same time.
> 
> for containers this can be untrue... :( actually even for 1000 tasks (I 
> suppose this is the maximum in your case) it can slowdown significantly 
> as well.

Do you have any documented requirements for container resource management?
Is there a minimum list of features and nice to have features for containers
as far as resource management is concerned?


> 
>>> Is it possible to use the scheduler_tick() function take a look at all
>>> deactivated tasks (as efficiently as possible) and activate them when
>>> its time to activate them or just fold the functionality by defining a
>>> time quantum after which everyone is worken up. This time quantum
>>> could be the same as the time over which limits are honoured.
> 
> agree with it.

Thinking a bit more along these lines, it would probably break O(1). But I guess a good
algorithm can amortize the cost.

> 
> Kirill
> 
-- 

	Balbir Singh,
	Linux Technology Center,
	IBM Software Labs

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [RFC 3/5] sched: Add CPU rate hard caps
  2006-05-31 15:59         ` Balbir Singh
@ 2006-05-31 18:09           ` Mike Galbraith
  2006-06-01  7:41           ` Kirill Korotaev
  2006-06-01 23:43           ` Peter Williams
  2 siblings, 0 replies; 95+ messages in thread
From: Mike Galbraith @ 2006-05-31 18:09 UTC (permalink / raw)
  To: balbir
  Cc: Kirill Korotaev, Peter Williams, Balbir Singh, Con Kolivas,
	Linux Kernel, Kingsley Cheung, Ingo Molnar, Rene Herman

On Wed, 2006-05-31 at 21:29 +0530, Balbir Singh wrote:

> Do you have any documented requirements for container resource management?

(??  where would that come from?)

Containers, I can imagine ~working (albeit I don't see why num_tasks
dilution problem shouldn't apply to num_containers... it's the same
thing, stale info)


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [RFC 3/5] sched: Add CPU rate hard caps
  2006-05-31 13:10       ` Kirill Korotaev
  2006-05-31 15:59         ` Balbir Singh
@ 2006-05-31 23:28         ` Peter Williams
  2006-06-01  7:44           ` Kirill Korotaev
  1 sibling, 1 reply; 95+ messages in thread
From: Peter Williams @ 2006-05-31 23:28 UTC (permalink / raw)
  To: Kirill Korotaev
  Cc: Balbir Singh, Mike Galbraith, Con Kolivas, Linux Kernel,
	Kingsley Cheung, Ingo Molnar, Rene Herman

Kirill Korotaev wrote:
>>> Using a timer for releasing tasks from their sinbin sounds like a  bit
>>> of an overhead. Given that there could be 10s of thousands of tasks.
>>
>>
>> The more runnable tasks there are the less likely it is that any of 
>> them is exceeding its hard cap due to normal competition for the 
>> CPUs.  So I think that it's unlikely that there will ever be a very 
>> large number of tasks in the sinbin at the same time.
> for containers this can be untrue...

Why will this be untrue for containers?

> :( actually even for 1000 tasks (I 
> suppose this is the maximum in your case) it can slowdown significantly 
> as well.
> 
>>> Is it possible to use the scheduler_tick() function take a look at all
>>> deactivated tasks (as efficiently as possible) and activate them when
>>> its time to activate them or just fold the functionality by defining a
>>> time quantum after which everyone is worken up. This time quantum
>>> could be the same as the time over which limits are honoured.
> agree with it.

If there are a lot of RUNNABLE (i.e. on a run queue) tasks then normal 
competition will mean that their CPU usage rates are small and therefore 
unlikely to be greater than their cap.  The sinbin is only used for 
tasks that are EXCEEDING their cap.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [RFC 2/5] sched: Add CPU rate soft caps
  2006-05-31 13:17                 ` Kirill Korotaev
@ 2006-05-31 23:39                   ` Peter Williams
  2006-06-01  8:09                     ` Kirill Korotaev
  0 siblings, 1 reply; 95+ messages in thread
From: Peter Williams @ 2006-05-31 23:39 UTC (permalink / raw)
  To: Kirill Korotaev
  Cc: balbir, Mike Galbraith, Con Kolivas, Linux Kernel,
	Kingsley Cheung, Ingo Molnar, Rene Herman

Kirill Korotaev wrote:
>>> I understand that, I was talking about fairness between capped tasks
>>> and what might be considered fair or intutive between capped tasks and
>>> regular tasks. Of course, the last point is debatable ;)
>>
>>
>> Well, the primary fairness mechanism in the scheduler is the time 
>> slice allocation and the capping code doesn't fiddle with those so 
>> there should be a reasonable degree of fairness (taking into account 
>> "nice") between capped tasks.  To improve that would require 
>> allocating several new priority slots for use by tasks exceeding their 
>> caps and fiddling with those.  I don't think that it's worth the bother.

I think more needs to be said about the fairness issue.

1. If a task is getting its cap or more then it's getting its fair share 
as specified by that cap.  Yes?

2. If a task is getting less CPU usage then its cap then it will be 
being scheduled just as if it had no cap and will be getting its fair 
share just as much as any task is.

So there is no fairness problem.

> I suppose it should be handled still. a subjective feeling :)
> 
> BTW, do you have any test results for your patch?
> It would be interesting to see how precise these limitations are and 
> whether or not we should bother for the above...

I tend to test by observing the results of setting caps on running tasks 
and this doesn't generate something that can be e-mailed.

Observations indicate that hard caps are enforced to less than 1% and 
ditto for soft caps except for small soft caps where the fact (stated in 
the patches) that enforcement is not fully strict in order to prevent 
priority inversion or starvation means that the cap is generally 
exceeded.  I'm currently making modifications (based on suggestions by 
Con Kolivas) that implement an alternative method for avoiding priority 
inversion and starvation and allow the enforcement to be more strict.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [RFC 3/5] sched: Add CPU rate hard caps
  2006-05-31 15:59         ` Balbir Singh
  2006-05-31 18:09           ` Mike Galbraith
@ 2006-06-01  7:41           ` Kirill Korotaev
  2006-06-01  8:34             ` Balbir Singh
  2006-06-01 23:43           ` Peter Williams
  2 siblings, 1 reply; 95+ messages in thread
From: Kirill Korotaev @ 2006-06-01  7:41 UTC (permalink / raw)
  To: balbir
  Cc: Peter Williams, Balbir Singh, Mike Galbraith, Con Kolivas,
	Linux Kernel, Kingsley Cheung, Ingo Molnar, Rene Herman,
	Sam Vilain, Andrew Morton, Eric W. Biederman

>>> The more runnable tasks there are the less likely it is that any of 
>>> them is exceeding its hard cap due to normal competition for the 
>>> CPUs.  So I think that it's unlikely that there will ever be a very 
>>> large number of tasks in the sinbin at the same time.
>>
>>
>> for containers this can be untrue... :( actually even for 1000 tasks 
>> (I suppose this is the maximum in your case) it can slowdown 
>> significantly as well.
> 
> 
> Do you have any documented requirements for container resource management?
> Is there a minimum list of features and nice to have features for 
> containers
> as far as resource management is concerned?
Sure! You can check OpenVZ project (http://openvz.org) for example of 
required resource management. BTW, I must agree with other people here 
who noticed that per-process resource management is really useless and 
hard to use :(

Briefly about required resource management:
1) CPU:
- fairness (i.e. prioritization of containers). For this we use SFQ like 
fair cpu scheduler with virtual cpus (runqueues). Linux-vserver uses 
tocken bucket algorithm. I can provide more details on this if you are 
interested.
- cpu limits (soft, hard). OpenVZ provides only hard cpu limits. For 
this we account the time in cycles. And after some credit is used do 
delay of container execution. We use cycles as our experiments show that 
statistical algorithms work poorly on some patterns :(
- cpu guarantees. I'm not sure any of solutions provide this yet.

2) disk:
- overall disk quota for container
- per-user/group quotas inside container

in OpenVZ we wrote a 2level disk quota which works on disk subtrees. 
vserver imho uses 1 partition per container approach.

- disk I/O bandwidth:
we started to use CFQv2, but it is quite poor in this regard. First, it 
doesn't prioritizes writes and async disk operations :( And even for 
sync reads we found some problems we work on now...

3) memory and other resources.
- memory
- files
- signals and so on and so on.
For example, in OpenVZ we have user resource beancounters (original 
author is Alan Cox), which account the following set of parameters:
kernel memory (vmas, page tables, different structures etc.), dcache 
pinned size, different user pages (locked, physical, private, shared), 
number of files, sockets, ptys, signals, network buffers, netfilter 
rules etc.

4. network bandwidth
traffic shaping is already ok here.

>>>> Is it possible to use the scheduler_tick() function take a look at all
>>>> deactivated tasks (as efficiently as possible) and activate them when
>>>> its time to activate them or just fold the functionality by defining a
>>>> time quantum after which everyone is worken up. This time quantum
>>>> could be the same as the time over which limits are honoured.
>>
>>
>> agree with it.
> 
> 
> Thinking a bit more along these lines, it would probably break O(1). But 
> I guess a good
> algorithm can amortize the cost.
this is the price to pay. but it happens quite rarelly as was noticed 
already...

Kirill


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [RFC 3/5] sched: Add CPU rate hard caps
  2006-05-31 23:28         ` Peter Williams
@ 2006-06-01  7:44           ` Kirill Korotaev
  2006-06-01 23:21             ` Peter Williams
  0 siblings, 1 reply; 95+ messages in thread
From: Kirill Korotaev @ 2006-06-01  7:44 UTC (permalink / raw)
  To: Peter Williams
  Cc: Balbir Singh, Mike Galbraith, Con Kolivas, Linux Kernel,
	Kingsley Cheung, Ingo Molnar, Rene Herman

>>>> Using a timer for releasing tasks from their sinbin sounds like a  bit
>>>> of an overhead. Given that there could be 10s of thousands of tasks.
>>>
>>>
>>>
>>> The more runnable tasks there are the less likely it is that any of 
>>> them is exceeding its hard cap due to normal competition for the 
>>> CPUs.  So I think that it's unlikely that there will ever be a very 
>>> large number of tasks in the sinbin at the same time.
>>
>> for containers this can be untrue...
> 
> 
> Why will this be untrue for containers?
if one container having 100 running tasks inside exceeded it's credit, 
it should be delayed. i.e. 100 tasks should be placed in sinbin if I 
understand your algo correctly. the second container having 100 tasks as 
well will do the same.

Kirill

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [RFC 2/5] sched: Add CPU rate soft caps
  2006-05-31 23:39                   ` Peter Williams
@ 2006-06-01  8:09                     ` Kirill Korotaev
  2006-06-01 23:38                       ` Peter Williams
  0 siblings, 1 reply; 95+ messages in thread
From: Kirill Korotaev @ 2006-06-01  8:09 UTC (permalink / raw)
  To: Peter Williams
  Cc: Kirill Korotaev, balbir, Mike Galbraith, Con Kolivas,
	Linux Kernel, Kingsley Cheung, Ingo Molnar, Rene Herman

> I think more needs to be said about the fairness issue.
> 
> 1. If a task is getting its cap or more then it's getting its fair share 
> as specified by that cap.  Yes?
> 
> 2. If a task is getting less CPU usage then its cap then it will be 
> being scheduled just as if it had no cap and will be getting its fair 
> share just as much as any task is.
> 
> So there is no fairness problem.
the problem is that O(1) cpu scheduler doesn't keep the history of 
execution and consumed time which is required for fairness. So I'm 
pretty sure, that fairness will decrease when one of the tasks is being 
capped/uncapped constanntly.

Can you check the behavior of 2 tasks, having different priorites with 
the range of possible cpu limits implied on one of them.

> I tend to test by observing the results of setting caps on running tasks 
> and this doesn't generate something that can be e-mailed.
plot?

> Observations indicate that hard caps are enforced to less than 1% and 
> ditto for soft caps except for small soft caps where the fact (stated in 
> the patches) that enforcement is not fully strict in order to prevent 
> priority inversion or starvation means that the cap is generally 
> exceeded.  I'm currently making modifications (based on suggestions by 
> Con Kolivas) that implement an alternative method for avoiding priority 
> inversion and starvation and allow the enforcement to be more strict.
running tasks are also not very good for such testing. it is too simple 
load. It would be nice if you could test the results with wide range of 
limits on Java Volano benchmark (loopback mode).

Thanks,
Kirill

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [RFC 3/5] sched: Add CPU rate hard caps
  2006-06-01  7:41           ` Kirill Korotaev
@ 2006-06-01  8:34             ` Balbir Singh
  2006-06-01 18:43               ` [ckrm-tech] " Chandra Seetharaman
  2006-06-01 23:47               ` Sam Vilain
  0 siblings, 2 replies; 95+ messages in thread
From: Balbir Singh @ 2006-06-01  8:34 UTC (permalink / raw)
  To: Kirill Korotaev
  Cc: Peter Williams, Balbir Singh, Mike Galbraith, Con Kolivas,
	Linux Kernel, Kingsley Cheung, Ingo Molnar, Rene Herman,
	Sam Vilain, Andrew Morton, Eric W. Biederman, Srivatsa,
	ckrm-tech

Hi, Kirill,

Kirill Korotaev wrote:
>> Do you have any documented requirements for container resource 
>> management?
>> Is there a minimum list of features and nice to have features for 
>> containers
>> as far as resource management is concerned?
> 
> Sure! You can check OpenVZ project (http://openvz.org) for example of 
> required resource management. BTW, I must agree with other people here 
> who noticed that per-process resource management is really useless and 
> hard to use :(

I'll take a look at the references. I agree with you that it will be useful
to have resource management for a group of tasks.

> 
> Briefly about required resource management:
> 1) CPU:
> - fairness (i.e. prioritization of containers). For this we use SFQ like 
> fair cpu scheduler with virtual cpus (runqueues). Linux-vserver uses 
> tocken bucket algorithm. I can provide more details on this if you are 
> interested.

Yes, any information or pointers to them will be very useful.

> - cpu limits (soft, hard). OpenVZ provides only hard cpu limits. For 
> this we account the time in cycles. And after some credit is used do 
> delay of container execution. We use cycles as our experiments show that 
> statistical algorithms work poorly on some patterns :(
> - cpu guarantees. I'm not sure any of solutions provide this yet.

ckrm has a solution to provide cpu guarantees. 

I think as far as CPU resource management is concerned (limits or guarantees),
there are common problems to be solved, for example

1. Tracking when a limit or a gaurantee is not met
2. Taking a decision to cap the group
3. Selecting the next task to execute (keeping O(1) in mind)

For the existing resource controller in OpenVZ I would be
interested in the information on the kinds of patterns it does not
perform well on and the patterns it performs well on.

> 
> 2) disk:
> - overall disk quota for container
> - per-user/group quotas inside container
> 
> in OpenVZ we wrote a 2level disk quota which works on disk subtrees. 
> vserver imho uses 1 partition per container approach.
> 
> - disk I/O bandwidth:
> we started to use CFQv2, but it is quite poor in this regard. First, it 
> doesn't prioritizes writes and async disk operations :( And even for 
> sync reads we found some problems we work on now...
> 
> 3) memory and other resources.
> - memory
> - files
> - signals and so on and so on.
> For example, in OpenVZ we have user resource beancounters (original 
> author is Alan Cox), which account the following set of parameters:
> kernel memory (vmas, page tables, different structures etc.), dcache 
> pinned size, different user pages (locked, physical, private, shared), 
> number of files, sockets, ptys, signals, network buffers, netfilter 
> rules etc.
> 
> 4. network bandwidth
> traffic shaping is already ok here.

Traffic shaping is just for outgoing traffic right? How about incoming
traffic (through the accept call)

> 

These are a great set of requirements. Thanks for putting them together.


>> Thinking a bit more along these lines, it would probably break O(1). 
>> But I guess a good
>> algorithm can amortize the cost.
> 
> this is the price to pay. but it happens quite rarelly as was noticed 
> already...
> 

Yes, agreed.

> Kirill
> 


-- 

	Balbir Singh,
	Linux Technology Center,
	IBM Software Labs

PS: I am also cc'ing ckrm-tech and srivatsa

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [ckrm-tech] [RFC 3/5] sched: Add CPU rate hard caps
  2006-06-01  8:34             ` Balbir Singh
@ 2006-06-01 18:43               ` Chandra Seetharaman
  2006-06-01 23:26                 ` Peter Williams
                                   ` (3 more replies)
  2006-06-01 23:47               ` Sam Vilain
  1 sibling, 4 replies; 95+ messages in thread
From: Chandra Seetharaman @ 2006-06-01 18:43 UTC (permalink / raw)
  To: balbir, dev
  Cc: Andrew Morton, Srivatsa, Sam Vilain, ckrm-tech, Balbir Singh,
	Mike Galbraith, Peter Williams, Con Kolivas, Linux Kernel,
	Kingsley Cheung, Eric W. Biederman, Ingo Molnar, Rene Herman

On Thu, 2006-06-01 at 14:04 +0530, Balbir Singh wrote:
> Hi, Kirill,
> 
> Kirill Korotaev wrote:
> >> Do you have any documented requirements for container resource 
> >> management?
> >> Is there a minimum list of features and nice to have features for 
> >> containers
> >> as far as resource management is concerned?
> > 
> > Sure! You can check OpenVZ project (http://openvz.org) for example of 
> > required resource management. BTW, I must agree with other people here 
> > who noticed that per-process resource management is really useless and 
> > hard to use :(
> 

I totally agree.
> I'll take a look at the references. I agree with you that it will be useful
> to have resource management for a group of tasks.
> 
> > 
> > Briefly about required resource management:
> > 1) CPU:
> > - fairness (i.e. prioritization of containers). For this we use SFQ like 
> > fair cpu scheduler with virtual cpus (runqueues). Linux-vserver uses 
> > tocken bucket algorithm. I can provide more details on this if you are 
> > interested.
> 
> Yes, any information or pointers to them will be very useful.
> 
> > - cpu limits (soft, hard). OpenVZ provides only hard cpu limits. For 
> > this we account the time in cycles. And after some credit is used do 
> > delay of container execution. We use cycles as our experiments show that 
> > statistical algorithms work poorly on some patterns :(
> > - cpu guarantees. I'm not sure any of solutions provide this yet.
> 
> ckrm has a solution to provide cpu guarantees. 
> 
> I think as far as CPU resource management is concerned (limits or guarantees),
> there are common problems to be solved, for example
> 
> 1. Tracking when a limit or a gaurantee is not met
> 2. Taking a decision to cap the group
> 3. Selecting the next task to execute (keeping O(1) in mind)
> 
> For the existing resource controller in OpenVZ I would be
> interested in the information on the kinds of patterns it does not
> perform well on and the patterns it performs well on.
> 
> > 
> > 2) disk:
> > - overall disk quota for container
> > - per-user/group quotas inside container
> > 
> > in OpenVZ we wrote a 2level disk quota which works on disk subtrees. 
> > vserver imho uses 1 partition per container approach.
> > 
> > - disk I/O bandwidth:
> > we started to use CFQv2, but it is quite poor in this regard. First, it 
> > doesn't prioritizes writes and async disk operations :( And even for 
> > sync reads we found some problems we work on now...

CKRM (on e-series) had an implementation based on a modified CFQ
scheduler. Shailabh is currently working on porting that controller to
f-series.

> > 
> > 3) memory and other resources.
> > - memory
> > - files
> > - signals and so on and so on.
> > For example, in OpenVZ we have user resource beancounters (original 
> > author is Alan Cox), which account the following set of parameters:
> > kernel memory (vmas, page tables, different structures etc.), dcache 

i started looking at UBC. They provide only max limits, not min
guarantees, right ?
 
> > pinned size, different user pages (locked, physical, private, shared), 
> > number of files, sockets, ptys, signals, network buffers, netfilter 
> > rules etc.
> > 
<snip>
> 
-- 

----------------------------------------------------------------------
    Chandra Seetharaman               | Be careful what you choose....
              - sekharan@us.ibm.com   |      .......you may get it.
----------------------------------------------------------------------



^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [RFC 3/5] sched: Add CPU rate hard caps
  2006-06-01  7:44           ` Kirill Korotaev
@ 2006-06-01 23:21             ` Peter Williams
  0 siblings, 0 replies; 95+ messages in thread
From: Peter Williams @ 2006-06-01 23:21 UTC (permalink / raw)
  To: Kirill Korotaev
  Cc: Balbir Singh, Mike Galbraith, Con Kolivas, Linux Kernel,
	Kingsley Cheung, Ingo Molnar, Rene Herman

Kirill Korotaev wrote:
>>>>> Using a timer for releasing tasks from their sinbin sounds like a  bit
>>>>> of an overhead. Given that there could be 10s of thousands of tasks.
>>>>
>>>>
>>>>
>>>> The more runnable tasks there are the less likely it is that any of 
>>>> them is exceeding its hard cap due to normal competition for the 
>>>> CPUs.  So I think that it's unlikely that there will ever be a very 
>>>> large number of tasks in the sinbin at the same time.
>>>
>>> for containers this can be untrue...
>>
>>
>> Why will this be untrue for containers?
> if one container having 100 running tasks inside exceeded it's credit, 
> it should be delayed. i.e. 100 tasks should be placed in sinbin if I 
> understand your algo correctly. the second container having 100 tasks as 
> well will do the same.

1. Caps are set on a per task basis not on a group basis.
2. Sinbinning is the last resort and only used for hard caps.  The soft 
capping mechanism is also applied to hard capped tasks and natural 
competition also tends to reduce usage rates.

In general, sinbinning will only kick in on lightly loaded systems where 
there is no competition for CPU resources.

Further, there is a natural ceiling of 999 per CPU on the number tasks 
that will ever be in the sinbin at the same time.  To achieve this 
maximum some very unusual circumstances have to prevail:

1. these 999 tasks must be the only runnable tasks on the system,
2. they all must have a cap of 1/1000, and
3. the distribution of CPU among them must be perfectly fair so that 
they all have the expected average usage rate of 1/999.

If you add one more task to this mix the average usage would be 1/1000 
and if they all had that none would be exceeding their cap and there 
would be no sinbinning at all.  Of course, in reality, half would be 
slightly above the average and half slightly below and about 500 would 
be sinbinned.  But this reality check also applies to the 999 and 
somewhat less than 999 would actually be sinbinned.

As the number of runnable tasks increases beyond 1000 then the number 
that have a usage rate greater than their cap will decrease and quickly 
reach zero.

So the conclusion is that the maximum number of sinbinned tasks per CPU 
is given by:

min(1000 / min_cpu_rate_cap - 1, nr_running)

As you can see, if a minimum cap cpu of 1 causes problems we can just 
increase that minimum.

And once again I ask what's so special about containers that changes this?

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [ckrm-tech] [RFC 3/5] sched: Add CPU rate hard caps
  2006-06-01 18:43               ` [ckrm-tech] " Chandra Seetharaman
@ 2006-06-01 23:26                 ` Peter Williams
  2006-06-02  2:02                   ` Chandra Seetharaman
  2006-06-02  0:36                 ` Con Kolivas
                                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 95+ messages in thread
From: Peter Williams @ 2006-06-01 23:26 UTC (permalink / raw)
  To: sekharan
  Cc: balbir, dev, Andrew Morton, Srivatsa, Sam Vilain, ckrm-tech,
	Balbir Singh, Mike Galbraith, Con Kolivas, Linux Kernel,
	Kingsley Cheung, Eric W. Biederman, Ingo Molnar, Rene Herman

Chandra Seetharaman wrote:
> On Thu, 2006-06-01 at 14:04 +0530, Balbir Singh wrote:
>> Hi, Kirill,
>>
>> Kirill Korotaev wrote:
>>>> Do you have any documented requirements for container resource 
>>>> management?
>>>> Is there a minimum list of features and nice to have features for 
>>>> containers
>>>> as far as resource management is concerned?
>>> Sure! You can check OpenVZ project (http://openvz.org) for example of 
>>> required resource management. BTW, I must agree with other people here 
>>> who noticed that per-process resource management is really useless and 
>>> hard to use :(
> 
> I totally agree.
>> I'll take a look at the references. I agree with you that it will be useful
>> to have resource management for a group of tasks.

But you don't need something as complex as CKRM either.  This capping 
functionality coupled with (the lamented) PAGG patches (should have been 
called TAGG for "task aggregation" instead of PAGG for "process 
aggregation") would allow you to implement a kernel module that could 
apply caps to arbitrary groups of tasks.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [RFC 2/5] sched: Add CPU rate soft caps
  2006-06-01  8:09                     ` Kirill Korotaev
@ 2006-06-01 23:38                       ` Peter Williams
  2006-06-02  1:35                         ` Peter Williams
  0 siblings, 1 reply; 95+ messages in thread
From: Peter Williams @ 2006-06-01 23:38 UTC (permalink / raw)
  To: Kirill Korotaev
  Cc: Kirill Korotaev, balbir, Mike Galbraith, Con Kolivas,
	Linux Kernel, Kingsley Cheung, Ingo Molnar, Rene Herman

Kirill Korotaev wrote:
>> I think more needs to be said about the fairness issue.
>>
>> 1. If a task is getting its cap or more then it's getting its fair 
>> share as specified by that cap.  Yes?
>>
>> 2. If a task is getting less CPU usage then its cap then it will be 
>> being scheduled just as if it had no cap and will be getting its fair 
>> share just as much as any task is.
>>
>> So there is no fairness problem.
> the problem is that O(1) cpu scheduler doesn't keep the history of 
> execution and consumed time which is required for fairness. So I'm 
> pretty sure, that fairness will decrease when one of the tasks is being 
> capped/uncapped constanntly.

Why would you want to keep capping and uncapping a task?

> 
> Can you check the behavior of 2 tasks, having different priorites with 
> the range of possible cpu limits implied on one of them.

It works OK.

> 
>> I tend to test by observing the results of setting caps on running 
>> tasks and this doesn't generate something that can be e-mailed.
> plot?

Plot what?  I'll see if I can come up with some tests that have 
plottable results.  Unless you already have some that I could use?

> 
>> Observations indicate that hard caps are enforced to less than 1% and 
>> ditto for soft caps except for small soft caps where the fact (stated 
>> in the patches) that enforcement is not fully strict in order to 
>> prevent priority inversion or starvation means that the cap is 
>> generally exceeded.  I'm currently making modifications (based on 
>> suggestions by Con Kolivas) that implement an alternative method for 
>> avoiding priority inversion and starvation and allow the enforcement 
>> to be more strict.
> running tasks are also not very good for such testing. it is too simple 
> load. It would be nice if you could test the results with wide range of 
> limits on Java Volano benchmark (loopback mode).

I'm interested in three things:

1. that the capping works pretty well,
2. that if the capping code is present in the kernel but no tasks are 
actually capped then the extra over head is minimal, and
3. that if capping is used then the overhead involved is minimal.

I do informal checks for 1), use kernbench to test 2) (know noticeable 
overhead has been observed) and haven't been able to think of a way to 
test 3) yet as applying caps small enough that they'd actually be 
enforced to something like kernbench would clearly cause it to take 
longer :-(.

Feel free to run any other tests that you think are necessary.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [RFC 3/5] sched: Add CPU rate hard caps
  2006-05-31 15:59         ` Balbir Singh
  2006-05-31 18:09           ` Mike Galbraith
  2006-06-01  7:41           ` Kirill Korotaev
@ 2006-06-01 23:43           ` Peter Williams
  2 siblings, 0 replies; 95+ messages in thread
From: Peter Williams @ 2006-06-01 23:43 UTC (permalink / raw)
  To: balbir
  Cc: Kirill Korotaev, Balbir Singh, Mike Galbraith, Con Kolivas,
	Linux Kernel, Kingsley Cheung, Ingo Molnar, Rene Herman

Balbir Singh wrote:
> Kirill Korotaev wrote:
>>>> Using a timer for releasing tasks from their sinbin sounds like a  bit
>>>> of an overhead. Given that there could be 10s of thousands of tasks.
>>>
>>>
>>>
>>> The more runnable tasks there are the less likely it is that any of 
>>> them is exceeding its hard cap due to normal competition for the 
>>> CPUs.  So I think that it's unlikely that there will ever be a very 
>>> large number of tasks in the sinbin at the same time.
>>
>> for containers this can be untrue... :( actually even for 1000 tasks 
>> (I suppose this is the maximum in your case) it can slowdown 
>> significantly as well.
> 
> Do you have any documented requirements for container resource management?
> Is there a minimum list of features and nice to have features for 
> containers
> as far as resource management is concerned?
> 
> 
>>
>>>> Is it possible to use the scheduler_tick() function take a look at all
>>>> deactivated tasks (as efficiently as possible) and activate them when
>>>> its time to activate them or just fold the functionality by defining a
>>>> time quantum after which everyone is worken up. This time quantum
>>>> could be the same as the time over which limits are honoured.
>>
>> agree with it.
> 
> Thinking a bit more along these lines, it would probably break O(1). But 
> I guess a good
> algorithm can amortize the cost.

It's also unlikely to be less overhead than using timers.  In fact, my 
gut feeling is that you'd actually be doing something very similar to 
how timers work only cruder.  I.e. reinventing the wheel.

-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [RFC 3/5] sched: Add CPU rate hard caps
  2006-06-01  8:34             ` Balbir Singh
  2006-06-01 18:43               ` [ckrm-tech] " Chandra Seetharaman
@ 2006-06-01 23:47               ` Sam Vilain
  1 sibling, 0 replies; 95+ messages in thread
From: Sam Vilain @ 2006-06-01 23:47 UTC (permalink / raw)
  To: balbir
  Cc: Kirill Korotaev, Peter Williams, Balbir Singh, Mike Galbraith,
	Con Kolivas, Linux Kernel, Kingsley Cheung, Ingo Molnar,
	Rene Herman, Andrew Morton, Eric W. Biederman, Srivatsa,
	ckrm-tech

Balbir Singh wrote:

>>1) CPU:
>>- fairness (i.e. prioritization of containers). For this we use SFQ like 
>>fair cpu scheduler with virtual cpus (runqueues). Linux-vserver uses 
>>tocken bucket algorithm. I can provide more details on this if you are 
>>interested.
>>    
>>
>Yes, any information or pointers to them will be very useful.
>  
>
A general description of the token bucket scheduler is on the Vserver
wiki at http://linux-vserver.org/Linux-VServer-Paper-06

I also just described it on a nearby thread -
http://lkml.org/lkml/2006/5/28/122

Sam.



^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: sched: Add CPU rate hard caps
  2006-06-01 18:43               ` [ckrm-tech] " Chandra Seetharaman
  2006-06-01 23:26                 ` Peter Williams
@ 2006-06-02  0:36                 ` Con Kolivas
  2006-06-02  2:03                   ` [ckrm-tech] " Chandra Seetharaman
  2006-06-02  5:55                 ` [ckrm-tech] [RFC 3/5] " Peter Williams
  2006-06-02  7:34                 ` Kirill Korotaev
  3 siblings, 1 reply; 95+ messages in thread
From: Con Kolivas @ 2006-06-02  0:36 UTC (permalink / raw)
  To: sekharan
  Cc: balbir, dev, Andrew Morton, Srivatsa, Sam Vilain, ckrm-tech,
	Balbir Singh, Mike Galbraith, Peter Williams, Linux Kernel,
	Kingsley Cheung, Eric W. Biederman, Ingo Molnar, Rene Herman,
	Jens Axboe

On Friday 02 June 2006 04:43, Chandra Seetharaman wrote:
> On Thu, 2006-06-01 at 14:04 +0530, Balbir Singh wrote:
> > > - disk I/O bandwidth:
> > > we started to use CFQv2, but it is quite poor in this regard. First, it
> > > doesn't prioritizes writes and async disk operations :( And even for
> > > sync reads we found some problems we work on now...
>
> CKRM (on e-series) had an implementation based on a modified CFQ
> scheduler. Shailabh is currently working on porting that controller to
> f-series.

I hope that the changes you have to improve CFQ were done in a way that is 
suitable for mainline and you're planning to try and merge them there.

-- 
-ck

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [RFC 2/5] sched: Add CPU rate soft caps
  2006-06-01 23:38                       ` Peter Williams
@ 2006-06-02  1:35                         ` Peter Williams
  0 siblings, 0 replies; 95+ messages in thread
From: Peter Williams @ 2006-06-02  1:35 UTC (permalink / raw)
  To: Linux Kernel
  Cc: Kirill Korotaev, Kirill Korotaev, balbir, Mike Galbraith,
	Con Kolivas, Kingsley Cheung, Ingo Molnar, Rene Herman

Peter Williams wrote:
> Kirill Korotaev wrote:
>>> I think more needs to be said about the fairness issue.
>>>
>>> 1. If a task is getting its cap or more then it's getting its fair 
>>> share as specified by that cap.  Yes?
>>>
>>> 2. If a task is getting less CPU usage then its cap then it will be 
>>> being scheduled just as if it had no cap and will be getting its fair 
>>> share just as much as any task is.
>>>
>>> So there is no fairness problem.
>> the problem is that O(1) cpu scheduler doesn't keep the history of 
>> execution and consumed time which is required for fairness. So I'm 
>> pretty sure, that fairness will decrease when one of the tasks is 
>> being capped/uncapped constanntly.
> 
> Why would you want to keep capping and uncapping a task?
> 
>>
>> Can you check the behavior of 2 tasks, having different priorites with 
>> the range of possible cpu limits implied on one of them.
> 
> It works OK.
> 
>>
>>> I tend to test by observing the results of setting caps on running 
>>> tasks and this doesn't generate something that can be e-mailed.
>> plot?
> 
> Plot what?  I'll see if I can come up with some tests that have 
> plottable results.  Unless you already have some that I could use?
> 
>>
>>> Observations indicate that hard caps are enforced to less than 1% and 
>>> ditto for soft caps except for small soft caps where the fact (stated 
>>> in the patches) that enforcement is not fully strict in order to 
>>> prevent priority inversion or starvation means that the cap is 
>>> generally exceeded.  I'm currently making modifications (based on 
>>> suggestions by Con Kolivas) that implement an alternative method for 
>>> avoiding priority inversion and starvation and allow the enforcement 
>>> to be more strict.
>> running tasks are also not very good for such testing. it is too 
>> simple load. It would be nice if you could test the results with wide 
>> range of limits on Java Volano benchmark (loopback mode).
> 
> I'm interested in three things:
> 
> 1. that the capping works pretty well,
> 2. that if the capping code is present in the kernel but no tasks are 
> actually capped then the extra over head is minimal, and
> 3. that if capping is used then the overhead involved is minimal.
> 
> I do informal checks for 1), use kernbench to test 2) (know noticeable 

I'm having a bad day word selection wise that "know" should be "no".

> overhead has been observed) and haven't been able to think of a way to 
> test 3) yet as applying caps small enough that they'd actually be 
> enforced to something like kernbench would clearly cause it to take 
> longer :-(.
> 
> Feel free to run any other tests that you think are necessary.
> 
> Peter


-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [ckrm-tech] [RFC 3/5] sched: Add CPU rate hard caps
  2006-06-01 23:26                 ` Peter Williams
@ 2006-06-02  2:02                   ` Chandra Seetharaman
  2006-06-02  3:21                     ` Peter Williams
  0 siblings, 1 reply; 95+ messages in thread
From: Chandra Seetharaman @ 2006-06-02  2:02 UTC (permalink / raw)
  To: Peter Williams
  Cc: balbir, dev, Andrew Morton, Srivatsa, Sam Vilain, ckrm-tech,
	Balbir Singh, Mike Galbraith, Con Kolivas, Linux Kernel,
	Kingsley Cheung, Eric W. Biederman, Ingo Molnar, Rene Herman

On Fri, 2006-06-02 at 09:26 +1000, Peter Williams wrote:
> Chandra Seetharaman wrote:
> > On Thu, 2006-06-01 at 14:04 +0530, Balbir Singh wrote:
> >> Hi, Kirill,
> >>
> >> Kirill Korotaev wrote:
> >>>> Do you have any documented requirements for container resource 
> >>>> management?
> >>>> Is there a minimum list of features and nice to have features for 
> >>>> containers
> >>>> as far as resource management is concerned?
> >>> Sure! You can check OpenVZ project (http://openvz.org) for example of 
> >>> required resource management. BTW, I must agree with other people here 
> >>> who noticed that per-process resource management is really useless and 
> >>> hard to use :(
> > 
> > I totally agree.
> >> I'll take a look at the references. I agree with you that it will be useful
> >> to have resource management for a group of tasks.
> 
> But you don't need something as complex as CKRM either.  This capping

All CKRM^W Resource Groups does is to group unrelated/related tasks to a
group and apply resource limits. 

>  
> functionality coupled with (the lamented) PAGG patches (should have been 
> called TAGG for "task aggregation" instead of PAGG for "process 
> aggregation") would allow you to implement a kernel module that could 
> apply caps to arbitrary groups of tasks.

I do not follow how PAGG + this cap feature can be used to put cap of
related/unrelated tasks. Can you provide little more explanation,
please.

Also, i do not think it can provide guarantees to that group of tasks.
can it ?

> 
> Peter
-- 

----------------------------------------------------------------------
    Chandra Seetharaman               | Be careful what you choose....
              - sekharan@us.ibm.com   |      .......you may get it.
----------------------------------------------------------------------



^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [ckrm-tech] sched: Add CPU rate hard caps
  2006-06-02  0:36                 ` Con Kolivas
@ 2006-06-02  2:03                   ` Chandra Seetharaman
  0 siblings, 0 replies; 95+ messages in thread
From: Chandra Seetharaman @ 2006-06-02  2:03 UTC (permalink / raw)
  To: Con Kolivas
  Cc: Andrew Morton, dev, Jens Axboe, Srivatsa, ckrm-tech, balbir,
	Balbir Singh, Mike Galbraith, Sam Vilain, Peter Williams,
	Kingsley Cheung, Eric W. Biederman, Ingo Molnar, Rene Herman,
	Linux Kernel

On Fri, 2006-06-02 at 10:36 +1000, Con Kolivas wrote:
> On Friday 02 June 2006 04:43, Chandra Seetharaman wrote:
> > On Thu, 2006-06-01 at 14:04 +0530, Balbir Singh wrote:
> > > > - disk I/O bandwidth:
> > > > we started to use CFQv2, but it is quite poor in this regard. First, it
> > > > doesn't prioritizes writes and async disk operations :( And even for
> > > > sync reads we found some problems we work on now...
> >
> > CKRM (on e-series) had an implementation based on a modified CFQ
> > scheduler. Shailabh is currently working on porting that controller to
> > f-series.
> 
> I hope that the changes you have to improve CFQ were done in a way that is 
> suitable for mainline and you're planning to try and merge them there.

That is our #1 object :)
> 
-- 

----------------------------------------------------------------------
    Chandra Seetharaman               | Be careful what you choose....
              - sekharan@us.ibm.com   |      .......you may get it.
----------------------------------------------------------------------



^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [ckrm-tech] [RFC 3/5] sched: Add CPU rate hard caps
  2006-06-02  2:02                   ` Chandra Seetharaman
@ 2006-06-02  3:21                     ` Peter Williams
  2006-06-02  8:32                       ` Balbir Singh
  2006-06-02 19:06                       ` Chandra Seetharaman
  0 siblings, 2 replies; 95+ messages in thread
From: Peter Williams @ 2006-06-02  3:21 UTC (permalink / raw)
  To: sekharan
  Cc: Peter Williams, Andrew Morton, dev, Srivatsa, ckrm-tech, balbir,
	Balbir Singh, Mike Galbraith, Sam Vilain, Con Kolivas,
	Linux Kernel, Kingsley Cheung, Eric W. Biederman, Ingo Molnar,
	Rene Herman

Chandra Seetharaman wrote:
> On Fri, 2006-06-02 at 09:26 +1000, Peter Williams wrote:
>> Chandra Seetharaman wrote:
>>> On Thu, 2006-06-01 at 14:04 +0530, Balbir Singh wrote:
>>>> Hi, Kirill,
>>>>
>>>> Kirill Korotaev wrote:
>>>>>> Do you have any documented requirements for container resource 
>>>>>> management?
>>>>>> Is there a minimum list of features and nice to have features for 
>>>>>> containers
>>>>>> as far as resource management is concerned?
>>>>> Sure! You can check OpenVZ project (http://openvz.org) for example of 
>>>>> required resource management. BTW, I must agree with other people here 
>>>>> who noticed that per-process resource management is really useless and 
>>>>> hard to use :(
>>> I totally agree.
>>>> I'll take a look at the references. I agree with you that it will be useful
>>>> to have resource management for a group of tasks.
>> But you don't need something as complex as CKRM either.  This capping
> 
> All CKRM^W Resource Groups does is to group unrelated/related tasks to a
> group and apply resource limits. 
> 
>>  
>> functionality coupled with (the lamented) PAGG patches (should have been 
>> called TAGG for "task aggregation" instead of PAGG for "process 
>> aggregation") would allow you to implement a kernel module that could 
>> apply caps to arbitrary groups of tasks.
> 
> I do not follow how PAGG + this cap feature can be used to put cap of
> related/unrelated tasks. Can you provide little more explanation,
> please.

I would have thought it was fairly obvious.  PAGG supplies the task 
aggregation mechanism, these patches provide per task caps and all 
that's needed is the code that marries the two.

> 
> Also, i do not think it can provide guarantees to that group of tasks.
> can it ?

It could do that by manipulating nice which is already available in the 
kernel.

I.e. these patches plus improved statistics (which are coming, I hope) 
together with the existing policy controls provide all that is necessary 
to do comprehensive CPU resource control.  If there is an efficient way 
to get the statistics out to user space (also coming, I hope) this 
control could be exercised from user space.

Peter
-- 
Dr Peter Williams, Chief Scientist         <peterw@aurema.com>
Aurema Pty Limited
Level 2, 130 Elizabeth St, Sydney, NSW 2000, Australia
Tel:+61 2 9698 2322  Fax:+61 2 9699 9174 http://www.aurema.com

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [ckrm-tech] [RFC 3/5] sched: Add CPU rate hard caps
  2006-06-01 18:43               ` [ckrm-tech] " Chandra Seetharaman
  2006-06-01 23:26                 ` Peter Williams
  2006-06-02  0:36                 ` Con Kolivas
@ 2006-06-02  5:55                 ` Peter Williams
  2006-06-02  7:47                   ` Kirill Korotaev
  2006-06-02  8:46                   ` Mike Galbraith
  2006-06-02  7:34                 ` Kirill Korotaev
  3 siblings, 2 replies; 95+ messages in thread
From: Peter Williams @ 2006-06-02  5:55 UTC (permalink / raw)
  To: sekharan
  Cc: balbir, dev, Andrew Morton, Srivatsa, Sam Vilain, ckrm-tech,
	Balbir Singh, Mike Galbraith, Con Kolivas, Linux Kernel,
	Kingsley Cheung, Eric W. Biederman, Ingo Molnar, Rene Herman

[-- Attachment #1: Type: text/plain, Size: 1553 bytes --]

Chandra Seetharaman wrote:
> On Thu, 2006-06-01 at 14:04 +0530, Balbir Singh wrote:
>> Hi, Kirill,
>>
>> Kirill Korotaev wrote:
>>>> Do you have any documented requirements for container resource 
>>>> management?
>>>> Is there a minimum list of features and nice to have features for 
>>>> containers
>>>> as far as resource management is concerned?
>>> Sure! You can check OpenVZ project (http://openvz.org) for example of 
>>> required resource management. BTW, I must agree with other people here 
>>> who noticed that per-process resource management is really useless and 
>>> hard to use :(
> 
> I totally agree.

"nice" seems to be doing quite nicely :-)

To me this capping functionality is a similar functionality to that 
provided by "nice" and all that's needed to make it useful is a command 
(similar to "nice") that runs tasks with caps applied.  To that end I've 
written a small script (attached) that does this.  As this is something 
that a user might like to combine with "nice" the command has an option 
for setting "nice" as well as caps.

Usage:
         withcap [options] command [arguments ...]
         withcap -h
Options:
         [-c <CPU rate soft cap>]
         [-C <CPU rate hard cap>]
         [-n <nice value>]

         -c Set CPU usage rate soft cap
         -C Set CPU usage rate hard cap
         -n Set nice value
         -h Display this help

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

[-- Attachment #2: withcap.sh --]
[-- Type: application/x-shellscript, Size: 2463 bytes --]

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [ckrm-tech] [RFC 3/5] sched: Add CPU rate hard caps
  2006-06-01 18:43               ` [ckrm-tech] " Chandra Seetharaman
                                   ` (2 preceding siblings ...)
  2006-06-02  5:55                 ` [ckrm-tech] [RFC 3/5] " Peter Williams
@ 2006-06-02  7:34                 ` Kirill Korotaev
  2006-06-02 21:23                   ` Shailabh Nagar
  3 siblings, 1 reply; 95+ messages in thread
From: Kirill Korotaev @ 2006-06-02  7:34 UTC (permalink / raw)
  To: sekharan
  Cc: balbir, dev, Andrew Morton, Srivatsa, ckrm-tech, Linux Kernel,
	Balbir Singh, Mike Galbraith, Sam Vilain, Con Kolivas,
	Peter Williams, Kingsley Cheung, Eric W. Biederman, Rene Herman

>>>- disk I/O bandwidth:
>>>we started to use CFQv2, but it is quite poor in this regard. First, it 
>>>doesn't prioritizes writes and async disk operations :( And even for 
>>>sync reads we found some problems we work on now...

> CKRM (on e-series) had an implementation based on a modified CFQ
> scheduler. Shailabh is currently working on porting that controller to
> f-series.
can you explain what was changed by CKRM there? Did you made it to 
control ASYNC read/writes? I don't think so...
Do you have any plots on what is concurrent bandwidth is depending on 
weights? Because, our measurements show that CFQ is not ideal and 
behaves poorly when prio 0,5,6,7 are used :/ Only 1,2,3,4 are really 
linear-scalable...

>>>3) memory and other resources.
>>>- memory
>>>- files
>>>- signals and so on and so on.
>>>For example, in OpenVZ we have user resource beancounters (original 
>>>author is Alan Cox), which account the following set of parameters:
>>>kernel memory (vmas, page tables, different structures etc.), dcache 
> i started looking at UBC. They provide only max limits, not min
> guarantees, right ?
they provide also vmpages guarantees and guarantees against OOM killer. 
(vmguarpages and oomguarpages) i.e. if container consumes less than X 
pages it won't be killed by OOM killer. Only if there no any other 
container to select. I.e. we have 2-level OOM.

Kirill

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [ckrm-tech] [RFC 3/5] sched: Add CPU rate hard caps
  2006-06-02  5:55                 ` [ckrm-tech] [RFC 3/5] " Peter Williams
@ 2006-06-02  7:47                   ` Kirill Korotaev
  2006-06-02 13:34                     ` Peter Williams
  2006-06-05 22:11                     ` Sam Vilain
  2006-06-02  8:46                   ` Mike Galbraith
  1 sibling, 2 replies; 95+ messages in thread
From: Kirill Korotaev @ 2006-06-02  7:47 UTC (permalink / raw)
  To: Peter Williams
  Cc: sekharan, Andrew Morton, Srivatsa, ckrm-tech, balbir,
	Balbir Singh, Mike Galbraith, Sam Vilain, Con Kolivas,
	Linux Kernel, Kingsley Cheung, Eric W. Biederman, Ingo Molnar,
	Rene Herman

>>>> Sure! You can check OpenVZ project (http://openvz.org) for example 
>>>> of required resource management. BTW, I must agree with other people 
>>>> here who noticed that per-process resource management is really 
>>>> useless and hard to use :(
>>
>>
>> I totally agree.
> 
> 
> "nice" seems to be doing quite nicely :-)
I'm sorry, but nice never looked "nice" to me.
Have you ever tried to "nice" apache server which spawns 500 
processes/threads on a loaded machine?
With nice you _can't_ impose limits or priority on the whole "apache".
The more apaches you have the more useless their priorites and nices are...

> To me this capping functionality is a similar functionality to that 
> provided by "nice" and all that's needed to make it useful is a command 
> (similar to "nice") that runs tasks with caps applied.  To that end I've 
> written a small script (attached) that does this.  As this is something 
> that a user might like to combine with "nice" the command has an option 
> for setting "nice" as well as caps.
> 
> Usage:
>         withcap [options] command [arguments ...]
>         withcap -h
> Options:
>         [-c <CPU rate soft cap>]
>         [-C <CPU rate hard cap>]
>         [-n <nice value>]
> 
>         -c Set CPU usage rate soft cap
>         -C Set CPU usage rate hard cap
>         -n Set nice value
>         -h Display this help

the same for this. you can't limit a _user_, only his processes.
Today I have 1 task and 20% limit is ok, tomorrow I have 10 tasks and 
this 20% limits changes nothing in the system.

Kirill

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [ckrm-tech] [RFC 3/5] sched: Add CPU rate hard caps
  2006-06-02  3:21                     ` Peter Williams
@ 2006-06-02  8:32                       ` Balbir Singh
  2006-06-02 13:30                         ` Peter Williams
  2006-06-02 19:06                       ` Chandra Seetharaman
  1 sibling, 1 reply; 95+ messages in thread
From: Balbir Singh @ 2006-06-02  8:32 UTC (permalink / raw)
  To: Peter Williams
  Cc: sekharan, Andrew Morton, dev, Srivatsa, ckrm-tech, Balbir Singh,
	Mike Galbraith, Peter Williams, Con Kolivas, Sam Vilain,
	Kingsley Cheung, Eric W. Biederman, Ingo Molnar, Rene Herman,
	Linux Kernel

Peter Williams wrote:
<snip>

>>>
>>>But you don't need something as complex as CKRM either.  This capping
>>
>>All CKRM^W Resource Groups does is to group unrelated/related tasks to a
>>group and apply resource limits. 
>>
>>
>>> 
>>>functionality coupled with (the lamented) PAGG patches (should have been 
>>>called TAGG for "task aggregation" instead of PAGG for "process 
>>>aggregation") would allow you to implement a kernel module that could 
>>>apply caps to arbitrary groups of tasks.
>>
>>I do not follow how PAGG + this cap feature can be used to put cap of
>>related/unrelated tasks. Can you provide little more explanation,
>>please.
> 
> 
> I would have thought it was fairly obvious.  PAGG supplies the task 
> aggregation mechanism, these patches provide per task caps and all 
> that's needed is the code that marries the two.
> 

The problem is that with per-task caps, if I have a resource group A
and I want to limit it to 10%, I need to limit each task in resource
group A to 10% (which makes resource groups not so useful). Is my
understanding correct? Is there a way to distribute the group limit
across tasks in the resource group?

> 
>>Also, i do not think it can provide guarantees to that group of tasks.
>>can it ?
> 
> 
> It could do that by manipulating nice which is already available in the 
> kernel.
> 
> I.e. these patches plus improved statistics (which are coming, I hope) 
> together with the existing policy controls provide all that is necessary 
> to do comprehensive CPU resource control.  If there is an efficient way 
> to get the statistics out to user space (also coming, I hope) this 
> control could be exercised from user space.

Could you please provide me with a link to the new improved statistics.
What do the new statistics contain - any heads up on them?

> 
> Peter


-- 

	Balbir Singh,
	Linux Technology Center,
	IBM Software Labs

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [ckrm-tech] [RFC 3/5] sched: Add CPU rate hard caps
  2006-06-02  5:55                 ` [ckrm-tech] [RFC 3/5] " Peter Williams
  2006-06-02  7:47                   ` Kirill Korotaev
@ 2006-06-02  8:46                   ` Mike Galbraith
  2006-06-02 13:18                     ` Peter Williams
  1 sibling, 1 reply; 95+ messages in thread
From: Mike Galbraith @ 2006-06-02  8:46 UTC (permalink / raw)
  To: Peter Williams
  Cc: sekharan, balbir, dev, Andrew Morton, Srivatsa, Sam Vilain,
	ckrm-tech, Balbir Singh, Con Kolivas, Linux Kernel,
	Kingsley Cheung, Eric W. Biederman, Ingo Molnar, Rene Herman

On Fri, 2006-06-02 at 15:55 +1000, Peter Williams wrote:
> Chandra Seetharaman wrote:
> > On Thu, 2006-06-01 at 14:04 +0530, Balbir Singh wrote:
> >> Hi, Kirill,
> >>
> >> Kirill Korotaev wrote:
> >>>> Do you have any documented requirements for container resource 
> >>>> management?
> >>>> Is there a minimum list of features and nice to have features for 
> >>>> containers
> >>>> as far as resource management is concerned?
> >>> Sure! You can check OpenVZ project (http://openvz.org) for example of 
> >>> required resource management. BTW, I must agree with other people here 
> >>> who noticed that per-process resource management is really useless and 
> >>> hard to use :(
> > 
> > I totally agree.
> 
> "nice" seems to be doing quite nicely :-)
> 
> To me this capping functionality is a similar functionality to that 
> provided by "nice" and all that's needed to make it useful is a command 
> (similar to "nice") that runs tasks with caps applied.

Similar in that they are both inherited.  Very dissimilar in that the
effect of nice is not altered by fork whereas the effect of a cap is.

Consider make.  A cap on make itself isn't meaningful, and _any_ per
task cap you put on it with the intent of managing the aggregate, is
defeated by the argument -j.  Per task caps require omniscience to be
effective in managing processes.  That's a pretty severe limitation.

	-Mike


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [ckrm-tech] [RFC 3/5] sched: Add CPU rate hard caps
  2006-06-02  8:46                   ` Mike Galbraith
@ 2006-06-02 13:18                     ` Peter Williams
  2006-06-02 14:47                       ` Mike Galbraith
  0 siblings, 1 reply; 95+ messages in thread
From: Peter Williams @ 2006-06-02 13:18 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: sekharan, balbir, dev, Andrew Morton, Srivatsa, Sam Vilain,
	ckrm-tech, Balbir Singh, Con Kolivas, Linux Kernel,
	Kingsley Cheung, Eric W. Biederman, Ingo Molnar, Rene Herman

Mike Galbraith wrote:
> On Fri, 2006-06-02 at 15:55 +1000, Peter Williams wrote:
>> Chandra Seetharaman wrote:
>>> On Thu, 2006-06-01 at 14:04 +0530, Balbir Singh wrote:
>>>> Hi, Kirill,
>>>>
>>>> Kirill Korotaev wrote:
>>>>>> Do you have any documented requirements for container resource 
>>>>>> management?
>>>>>> Is there a minimum list of features and nice to have features for 
>>>>>> containers
>>>>>> as far as resource management is concerned?
>>>>> Sure! You can check OpenVZ project (http://openvz.org) for example of 
>>>>> required resource management. BTW, I must agree with other people here 
>>>>> who noticed that per-process resource management is really useless and 
>>>>> hard to use :(
>>> I totally agree.
>> "nice" seems to be doing quite nicely :-)
>>
>> To me this capping functionality is a similar functionality to that 
>> provided by "nice" and all that's needed to make it useful is a command 
>> (similar to "nice") that runs tasks with caps applied.
> 
> Similar in that they are both inherited.  Very dissimilar in that the
> effect of nice is not altered by fork whereas the effect of a cap is.
> 
> Consider make.  A cap on make itself isn't meaningful, and _any_ per
> task cap you put on it with the intent of managing the aggregate, is
> defeated by the argument -j.  Per task caps require omniscience to be
> effective in managing processes.  That's a pretty severe limitation.

These caps aren't trying to control aggregates but with suitable 
software they can be used to control aggregates.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [ckrm-tech] [RFC 3/5] sched: Add CPU rate hard caps
  2006-06-02  8:32                       ` Balbir Singh
@ 2006-06-02 13:30                         ` Peter Williams
  2006-06-02 18:58                           ` Balbir Singh
  0 siblings, 1 reply; 95+ messages in thread
From: Peter Williams @ 2006-06-02 13:30 UTC (permalink / raw)
  To: balbir
  Cc: Peter Williams, sekharan, Andrew Morton, dev, Srivatsa,
	ckrm-tech, Balbir Singh, Mike Galbraith, Con Kolivas, Sam Vilain,
	Kingsley Cheung, Eric W. Biederman, Ingo Molnar, Rene Herman,
	Linux Kernel

Balbir Singh wrote:
> Peter Williams wrote:
> <snip>
> 
>>>>
>>>> But you don't need something as complex as CKRM either.  This capping
>>>
>>> All CKRM^W Resource Groups does is to group unrelated/related tasks to a
>>> group and apply resource limits.
>>>
>>>>
>>>> functionality coupled with (the lamented) PAGG patches (should have 
>>>> been called TAGG for "task aggregation" instead of PAGG for "process 
>>>> aggregation") would allow you to implement a kernel module that 
>>>> could apply caps to arbitrary groups of tasks.
>>>
>>> I do not follow how PAGG + this cap feature can be used to put cap of
>>> related/unrelated tasks. Can you provide little more explanation,
>>> please.
>>
>>
>> I would have thought it was fairly obvious.  PAGG supplies the task 
>> aggregation mechanism, these patches provide per task caps and all 
>> that's needed is the code that marries the two.
>>
> 
> The problem is that with per-task caps, if I have a resource group A
> and I want to limit it to 10%, I need to limit each task in resource
> group A to 10% (which makes resource groups not so useful). Is my
> understanding correct?

Well the general idea is correct but your maths is wrong.  You'd have to 
give each of them a cap somewhere between 10% and 10% divided by the 
number of tasks in group A.  Exactly where in that range would vary 
depending on the CPU demand of each task and would need to be adjusted 
dynamically (unless they were very boring tasks whose demands were 
constant over time).

> Is there a way to distribute the group limit
> across tasks in the resource group?

Not as part of this patch but it could be done from outside the 
scheduler either in the kernel or in user space.

> 
>>
>>> Also, i do not think it can provide guarantees to that group of tasks.
>>> can it ?
>>
>>
>> It could do that by manipulating nice which is already available in 
>> the kernel.
>>
>> I.e. these patches plus improved statistics (which are coming, I hope) 
>> together with the existing policy controls provide all that is 
>> necessary to do comprehensive CPU resource control.  If there is an 
>> efficient way to get the statistics out to user space (also coming, I 
>> hope) this control could be exercised from user space.
> 
> Could you please provide me with a link to the new improved statistics.

No.  Read LKML and you'll know as much as I do.

> What do the new statistics contain - any heads up on them?

There're several contenders (including some from IBM) that periodically 
post patches to LKML.  That's where I'm aware of them from.  As I say, 
I'm hoping that they get together and come up with something generally 
useful (as opposed to just meeting each contenders needs). I may be 
being overly optimistic but you never know.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [ckrm-tech] [RFC 3/5] sched: Add CPU rate hard caps
  2006-06-02  7:47                   ` Kirill Korotaev
@ 2006-06-02 13:34                     ` Peter Williams
  2006-06-05 22:11                     ` Sam Vilain
  1 sibling, 0 replies; 95+ messages in thread
From: Peter Williams @ 2006-06-02 13:34 UTC (permalink / raw)
  To: Kirill Korotaev
  Cc: sekharan, Andrew Morton, Srivatsa, ckrm-tech, balbir,
	Balbir Singh, Mike Galbraith, Sam Vilain, Con Kolivas,
	Linux Kernel, Kingsley Cheung, Eric W. Biederman, Ingo Molnar,
	Rene Herman

Kirill Korotaev wrote:
>>>>> Sure! You can check OpenVZ project (http://openvz.org) for example 
>>>>> of required resource management. BTW, I must agree with other 
>>>>> people here who noticed that per-process resource management is 
>>>>> really useless and hard to use :(
>>>
>>>
>>> I totally agree.
>>
>>
>> "nice" seems to be doing quite nicely :-)
> I'm sorry, but nice never looked "nice" to me.
> Have you ever tried to "nice" apache server which spawns 500 
> processes/threads on a loaded machine?
> With nice you _can't_ impose limits or priority on the whole "apache".
> The more apaches you have the more useless their priorites and nices are...

Nevertheless "nice" is still useful.  I'd bet that just about every 
Linux system has at least one task with non normal nice at any time.

I think that these caps can be similarly useful.

They can be also used as the basic mechanism to implement the kind of 
thing you want from OUTSIDE of the scheduler.

> 
>> To me this capping functionality is a similar functionality to that 
>> provided by "nice" and all that's needed to make it useful is a 
>> command (similar to "nice") that runs tasks with caps applied.  To 
>> that end I've written a small script (attached) that does this.  As 
>> this is something that a user might like to combine with "nice" the 
>> command has an option for setting "nice" as well as caps.
>>
>> Usage:
>>         withcap [options] command [arguments ...]
>>         withcap -h
>> Options:
>>         [-c <CPU rate soft cap>]
>>         [-C <CPU rate hard cap>]
>>         [-n <nice value>]
>>
>>         -c Set CPU usage rate soft cap
>>         -C Set CPU usage rate hard cap
>>         -n Set nice value
>>         -h Display this help
> 
> the same for this. you can't limit a _user_, only his processes.
> Today I have 1 task and 20% limit is ok, tomorrow I have 10 tasks and 
> this 20% limits changes nothing in the system.

This still doesn't make it useless.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [ckrm-tech] [RFC 3/5] sched: Add CPU rate hard caps
  2006-06-02 13:18                     ` Peter Williams
@ 2006-06-02 14:47                       ` Mike Galbraith
  2006-06-03  0:08                         ` Peter Williams
  2006-06-06 11:26                         ` Srivatsa Vaddagiri
  0 siblings, 2 replies; 95+ messages in thread
From: Mike Galbraith @ 2006-06-02 14:47 UTC (permalink / raw)
  To: Peter Williams
  Cc: sekharan, balbir, dev, Andrew Morton, Srivatsa, Sam Vilain,
	ckrm-tech, Balbir Singh, Con Kolivas, Linux Kernel,
	Kingsley Cheung, Eric W. Biederman, Ingo Molnar, Rene Herman

On Fri, 2006-06-02 at 23:18 +1000, Peter Williams wrote:
> Mike Galbraith wrote:
> > On Fri, 2006-06-02 at 15:55 +1000, Peter Williams wrote:
> >> Chandra Seetharaman wrote:
> >>> On Thu, 2006-06-01 at 14:04 +0530, Balbir Singh wrote:
> >>>> Hi, Kirill,
> >>>>
> >>>> Kirill Korotaev wrote:
> >>>>>> Do you have any documented requirements for container resource 
> >>>>>> management?
> >>>>>> Is there a minimum list of features and nice to have features for 
> >>>>>> containers
> >>>>>> as far as resource management is concerned?
> >>>>> Sure! You can check OpenVZ project (http://openvz.org) for example of 
> >>>>> required resource management. BTW, I must agree with other people here 
> >>>>> who noticed that per-process resource management is really useless and 
> >>>>> hard to use :(
> >>> I totally agree.
> >> "nice" seems to be doing quite nicely :-)
> >>
> >> To me this capping functionality is a similar functionality to that 
> >> provided by "nice" and all that's needed to make it useful is a command 
> >> (similar to "nice") that runs tasks with caps applied.
> > 
> > Similar in that they are both inherited.  Very dissimilar in that the
> > effect of nice is not altered by fork whereas the effect of a cap is.
> > 
> > Consider make.  A cap on make itself isn't meaningful, and _any_ per
> > task cap you put on it with the intent of managing the aggregate, is
> > defeated by the argument -j.  Per task caps require omniscience to be
> > effective in managing processes.  That's a pretty severe limitation.
> 
> These caps aren't trying to control aggregates but with suitable 
> software they can be used to control aggregates.

How?  How would you deal with the make example with per task caps.

	-Mike


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [ckrm-tech] [RFC 3/5] sched: Add CPU rate hard caps
  2006-06-02 13:30                         ` Peter Williams
@ 2006-06-02 18:58                           ` Balbir Singh
  2006-06-02 23:49                             ` Peter Williams
  0 siblings, 1 reply; 95+ messages in thread
From: Balbir Singh @ 2006-06-02 18:58 UTC (permalink / raw)
  To: Peter Williams
  Cc: Andrew Morton, dev, Srivatsa, sekharan, ckrm-tech, Balbir Singh,
	Mike Galbraith, Sam Vilain, Con Kolivas, Linux Kernel,
	Kingsley Cheung, Eric W. Biederman, Ingo Molnar, Peter Williams,
	Rene Herman

Peter Williams wrote:
> Balbir Singh wrote:
> 
>>Peter Williams wrote:
>><snip>
>>
>>>>>But you don't need something as complex as CKRM either.  This capping
>>>>
>>>>All CKRM^W Resource Groups does is to group unrelated/related tasks to a
>>>>group and apply resource limits.
>>>>
>>>>
>>>>>functionality coupled with (the lamented) PAGG patches (should have 
>>>>>been called TAGG for "task aggregation" instead of PAGG for "process 
>>>>>aggregation") would allow you to implement a kernel module that 
>>>>>could apply caps to arbitrary groups of tasks.
>>>>
>>>>I do not follow how PAGG + this cap feature can be used to put cap of
>>>>related/unrelated tasks. Can you provide little more explanation,
>>>>please.
>>>
>>>
>>>I would have thought it was fairly obvious.  PAGG supplies the task 
>>>aggregation mechanism, these patches provide per task caps and all 
>>>that's needed is the code that marries the two.
>>>
>>
>>The problem is that with per-task caps, if I have a resource group A
>>and I want to limit it to 10%, I need to limit each task in resource
>>group A to 10% (which makes resource groups not so useful). Is my
>>understanding correct?
> 
> 
> Well the general idea is correct but your maths is wrong.  You'd have to 
> give each of them a cap somewhere between 10% and 10% divided by the 
> number of tasks in group A.  Exactly where in that range would vary 
> depending on the CPU demand of each task and would need to be adjusted 
> dynamically (unless they were very boring tasks whose demands were 
> constant over time).
>


Hmm.. I thought my math was reasonable (but there is always so much to learn)
>From your formula, if I have 1 task in group A, I need to provide it with
a cap of b/w 10 to 11%. For two tasks, I need to give them b/w 10 to 10.5%.
If I have a hundred, it needs to be b/w 10% and 10.01%
 
> 
>>Is there a way to distribute the group limit
>>across tasks in the resource group?
> 
> 
> Not as part of this patch but it could be done from outside the 
> scheduler either in the kernel or in user space.
> 
> 
>>>>Also, i do not think it can provide guarantees to that group of tasks.
>>>>can it ?
>>>
>>>
>>>It could do that by manipulating nice which is already available in 
>>>the kernel.
>>>
>>>I.e. these patches plus improved statistics (which are coming, I hope) 
>>>together with the existing policy controls provide all that is 
>>>necessary to do comprehensive CPU resource control.  If there is an 
>>>efficient way to get the statistics out to user space (also coming, I 
>>>hope) this control could be exercised from user space.
>>
>>Could you please provide me with a link to the new improved statistics.
> 
> 
> No.  Read LKML and you'll know as much as I do.
> 
> 
>>What do the new statistics contain - any heads up on them?
> 
> 
> There're several contenders (including some from IBM) that periodically 
> post patches to LKML.  That's where I'm aware of them from.  As I say, 
> I'm hoping that they get together and come up with something generally 
> useful (as opposed to just meeting each contenders needs). I may be 
> being overly optimistic but you never know.

Yes, thats the whole point of the discussion and everybody is free to
participate.


> 
> Peter


-- 

	Balbir Singh,
	Linux Technology Center,
	IBM Software Labs

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [ckrm-tech] [RFC 3/5] sched: Add CPU rate hard caps
  2006-06-02  3:21                     ` Peter Williams
  2006-06-02  8:32                       ` Balbir Singh
@ 2006-06-02 19:06                       ` Chandra Seetharaman
  2006-06-03  0:04                         ` Peter Williams
  1 sibling, 1 reply; 95+ messages in thread
From: Chandra Seetharaman @ 2006-06-02 19:06 UTC (permalink / raw)
  To: Peter Williams
  Cc: Peter Williams, Andrew Morton, dev, Srivatsa, ckrm-tech, balbir,
	Balbir Singh, Mike Galbraith, Sam Vilain, Con Kolivas,
	Linux Kernel, Kingsley Cheung, Eric W. Biederman, Ingo Molnar,
	Rene Herman

On Fri, 2006-06-02 at 13:21 +1000, Peter Williams wrote:
> Chandra Seetharaman wrote:
> > On Fri, 2006-06-02 at 09:26 +1000, Peter Williams wrote:
> >> Chandra Seetharaman wrote:
> >>> On Thu, 2006-06-01 at 14:04 +0530, Balbir Singh wrote:
> >>>> Hi, Kirill,
> >>>>
> >>>> Kirill Korotaev wrote:
> >>>>>> Do you have any documented requirements for container resource 
> >>>>>> management?
> >>>>>> Is there a minimum list of features and nice to have features for 
> >>>>>> containers
> >>>>>> as far as resource management is concerned?
> >>>>> Sure! You can check OpenVZ project (http://openvz.org) for example of 
> >>>>> required resource management. BTW, I must agree with other people here 
> >>>>> who noticed that per-process resource management is really useless and 
> >>>>> hard to use :(
> >>> I totally agree.
> >>>> I'll take a look at the references. I agree with you that it will be useful
> >>>> to have resource management for a group of tasks.
> >> But you don't need something as complex as CKRM either.  This capping
> > 
> > All CKRM^W Resource Groups does is to group unrelated/related tasks to a
> > group and apply resource limits. 
> > 
> >>  
> >> functionality coupled with (the lamented) PAGG patches (should have been 
> >> called TAGG for "task aggregation" instead of PAGG for "process 
> >> aggregation") would allow you to implement a kernel module that could 
> >> apply caps to arbitrary groups of tasks.
> > 
> > I do not follow how PAGG + this cap feature can be used to put cap of
> > related/unrelated tasks. Can you provide little more explanation,
> > please.
> 
> I would have thought it was fairly obvious.  PAGG supplies the task 
> aggregation mechanism, these patches provide per task caps and all 
> that's needed is the code that marries the two.

May be obvious from your usage point of view. It wasn't for what i was
thinking as resource management.

I thought there is some way the user can associate some amount of
resources (limits and guarantees) to a PAGG group and move the
corresponding tasks to that PAGG and that is all needed from user
space. 

In other words i thought there is some clever way to manage resources at
the PAGG level (without needing to tinker with the per task caps), which
wasn't obvious for me, and it is now clear that is not the case, and one
still have to keep tweaking the "per task" caps to get the result they
want. 

>From your explanation, complex stuff need to happen in the user space to
manage resource for a group of tasks.

Knobs that are available to the user are
 - per task nice values
 - per task cap limits and
 - per task statistics, if and when they become available.

user level application has to constantly monitor the stats of _all_ the
tasks and constantly keep changing the knobs if they want to keep the
"group of tasks" within their guarantees and limits. As others pointed
already, this may still _not_ yield what one wants, if you have tasks
with disparate need for a resource.

I certainly do not see it as the result of a simple marriage between
PAGG and "per task caps".
> 
> > 
> > Also, i do not think it can provide guarantees to that group of tasks.
> > can it ?
> 
> It could do that by manipulating nice which is already available in the 
> kernel.
> 
> I.e. these patches plus improved statistics (which are coming, I hope) 
> together with the existing policy controls provide all that is necessary 
> to do comprehensive CPU resource control.  If there is an efficient way 
> to get the statistics out to user space (also coming, I hope) this 
> control could be exercised from user space.
> 
> Peter
-- 

----------------------------------------------------------------------
    Chandra Seetharaman               | Be careful what you choose....
              - sekharan@us.ibm.com   |      .......you may get it.
----------------------------------------------------------------------



^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [ckrm-tech] [RFC 3/5] sched: Add CPU rate hard caps
  2006-06-02  7:34                 ` Kirill Korotaev
@ 2006-06-02 21:23                   ` Shailabh Nagar
  0 siblings, 0 replies; 95+ messages in thread
From: Shailabh Nagar @ 2006-06-02 21:23 UTC (permalink / raw)
  To: Kirill Korotaev
  Cc: sekharan, Andrew Morton, dev, Srivatsa, ckrm-tech, balbir,
	Balbir Singh, Mike Galbraith, Linux Kernel, Con Kolivas,
	Sam Vilain, Kingsley Cheung, Eric W. Biederman, Rene Herman,
	Peter Williams

Kirill Korotaev wrote:

>>>>- disk I/O bandwidth:
>>>>we started to use CFQv2, but it is quite poor in this regard. First, it 
>>>>doesn't prioritizes writes and async disk operations :( And even for 
>>>>sync reads we found some problems we work on now...
>>>>        
>>>>
>
>  
>
>>CKRM (on e-series) had an implementation based on a modified CFQ
>>scheduler. Shailabh is currently working on porting that controller to
>>f-series.
>>    
>>
>can you explain what was changed by CKRM there? Did you made it to 
>control ASYNC read/writes? I don't think so...
>  
>
In e-series, CFQ was modified to
- maintain request queues per ckrm-class (now resource group) rather 
than per-tgid
- explicitly maintain I/O bandwidth of each request queue (in terms of 
I/O issued by the I/O scheduler)
- select the "next request queue to service" based on its I/O 
bandwidth...if a queue exceeds its allocation (as calculated
from the CKRM guarantee values), the queue gets skipped.

So this did not use the CFQ priority scheme as such and only implemented 
the "limit" part.

The current plan is to exploit the CFQ prio levels and rely on CFQ doing 
a good enough job in maintaining an adequate
bandwidth differential between those prio levels.
Again, each queue would maintain a count of its consumed bandwidth as 
well as target bandwidth. While picking the next request
from the queue, if its observed that the queue is above its "guarantee", 
its priority will get reduced (it'll still supply a request) while
a queue that is below its share will get bumped up....Control will be 
much more gradual but the basic idea is to leverage CFQ's priority
handling than supplant it (since we get anticipation in the form of 
time-slicing for free).

One concern is whether the time-slicing of CFQ plays well with queues 
that aren't organized by tgid...I'm still looking into that.

>Do you have any plots on what is concurrent bandwidth is depending on 
>weights? Because, our measurements show that CFQ is not ideal and 
>behaves poorly when prio 0,5,6,7 are used :/ Only 1,2,3,4 are really 
>linear-scalable...
>  
>
Interesting. Whats the time-scale over which you expect I/O bandwidth 
rates to get enforced ?

Perhaps the iosched discussion should use  a different thread....

--Shailabh



^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [ckrm-tech] [RFC 3/5] sched: Add CPU rate hard caps
  2006-06-02 18:58                           ` Balbir Singh
@ 2006-06-02 23:49                             ` Peter Williams
  2006-06-03  4:59                               ` Balbir Singh
  0 siblings, 1 reply; 95+ messages in thread
From: Peter Williams @ 2006-06-02 23:49 UTC (permalink / raw)
  To: balbir
  Cc: Peter Williams, Andrew Morton, dev, Srivatsa, sekharan,
	ckrm-tech, Balbir Singh, Mike Galbraith, Sam Vilain, Con Kolivas,
	Linux Kernel, Kingsley Cheung, Eric W. Biederman, Ingo Molnar,
	Rene Herman

Balbir Singh wrote:
> Peter Williams wrote:
>> Balbir Singh wrote:
>>
>>> Peter Williams wrote:
>>> <snip>
>>>
>>>>>> But you don't need something as complex as CKRM either.  This capping
>>>>>
>>>>> All CKRM^W Resource Groups does is to group unrelated/related tasks 
>>>>> to a
>>>>> group and apply resource limits.
>>>>>
>>>>>
>>>>>> functionality coupled with (the lamented) PAGG patches (should 
>>>>>> have been called TAGG for "task aggregation" instead of PAGG for 
>>>>>> "process aggregation") would allow you to implement a kernel 
>>>>>> module that could apply caps to arbitrary groups of tasks.
>>>>>
>>>>> I do not follow how PAGG + this cap feature can be used to put cap of
>>>>> related/unrelated tasks. Can you provide little more explanation,
>>>>> please.
>>>>
>>>>
>>>> I would have thought it was fairly obvious.  PAGG supplies the task 
>>>> aggregation mechanism, these patches provide per task caps and all 
>>>> that's needed is the code that marries the two.
>>>>
>>>
>>> The problem is that with per-task caps, if I have a resource group A
>>> and I want to limit it to 10%, I need to limit each task in resource
>>> group A to 10% (which makes resource groups not so useful). Is my
>>> understanding correct?
>>
>>
>> Well the general idea is correct but your maths is wrong.  You'd have 
>> to give each of them a cap somewhere between 10% and 10% divided by 
>> the number of tasks in group A.  Exactly where in that range would 
>> vary depending on the CPU demand of each task and would need to be 
>> adjusted dynamically (unless they were very boring tasks whose demands 
>> were constant over time).
>>
> 
> 
> Hmm.. I thought my math was reasonable (but there is always so much to 
> learn)
>  From your formula, if I have 1 task in group A, I need to provide it with
> a cap of b/w 10 to 11%. For two tasks, I need to give them b/w 10 to 10.5%.
> If I have a hundred, it needs to be b/w 10% and 10.01%

Now your arithmetic is failing you.  According to my formula:

1. With one task in group A you give it 10% which is what you get when 
you divide 10% by one.

2. With two tasks in group A you give them each somewhere between 5% 
(which is 10% divided by 2) and 10%.  If they are equally busy you give 
them each 5% and if they are not equally busy you give them you give 
them larger caps.

Another, probably a better but more expensive, formula is to divide the 
10% between them in proportion to their demand.  Being careful not to 
give any of them a zero cap, of course.  I.e. in the two task 10% case 
they each get 5% if they are equally busy but if one is twice as busy as 
the other it gets a 6.6% cap and the other gets 3.3% (approximately).

Peter
-- 
Dr Peter Williams, Chief Scientist         <peterw@aurema.com>
Aurema Pty Limited
Level 2, 130 Elizabeth St, Sydney, NSW 2000, Australia
Tel:+61 2 9698 2322  Fax:+61 2 9699 9174 http://www.aurema.com

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [ckrm-tech] [RFC 3/5] sched: Add CPU rate hard caps
  2006-06-02 19:06                       ` Chandra Seetharaman
@ 2006-06-03  0:04                         ` Peter Williams
  0 siblings, 0 replies; 95+ messages in thread
From: Peter Williams @ 2006-06-03  0:04 UTC (permalink / raw)
  To: sekharan
  Cc: Andrew Morton, dev, Srivatsa, ckrm-tech, balbir, Balbir Singh,
	Mike Galbraith, Peter Williams, Con Kolivas, Sam Vilain,
	Kingsley Cheung, Eric W. Biederman, Ingo Molnar, Rene Herman,
	Linux Kernel

Chandra Seetharaman wrote:
> On Fri, 2006-06-02 at 13:21 +1000, Peter Williams wrote:
>> Chandra Seetharaman wrote:
>>> On Fri, 2006-06-02 at 09:26 +1000, Peter Williams wrote:
>>>> Chandra Seetharaman wrote:
>>>>> On Thu, 2006-06-01 at 14:04 +0530, Balbir Singh wrote:
>>>>>> Hi, Kirill,
>>>>>>
>>>>>> Kirill Korotaev wrote:
>>>>>>>> Do you have any documented requirements for container resource 
>>>>>>>> management?
>>>>>>>> Is there a minimum list of features and nice to have features for 
>>>>>>>> containers
>>>>>>>> as far as resource management is concerned?
>>>>>>> Sure! You can check OpenVZ project (http://openvz.org) for example of 
>>>>>>> required resource management. BTW, I must agree with other people here 
>>>>>>> who noticed that per-process resource management is really useless and 
>>>>>>> hard to use :(
>>>>> I totally agree.
>>>>>> I'll take a look at the references. I agree with you that it will be useful
>>>>>> to have resource management for a group of tasks.
>>>> But you don't need something as complex as CKRM either.  This capping
>>> All CKRM^W Resource Groups does is to group unrelated/related tasks to a
>>> group and apply resource limits. 
>>>
>>>>  
>>>> functionality coupled with (the lamented) PAGG patches (should have been 
>>>> called TAGG for "task aggregation" instead of PAGG for "process 
>>>> aggregation") would allow you to implement a kernel module that could 
>>>> apply caps to arbitrary groups of tasks.
>>> I do not follow how PAGG + this cap feature can be used to put cap of
>>> related/unrelated tasks. Can you provide little more explanation,
>>> please.
>> I would have thought it was fairly obvious.  PAGG supplies the task 
>> aggregation mechanism, these patches provide per task caps and all 
>> that's needed is the code that marries the two.
> 
> May be obvious from your usage point of view. It wasn't for what i was
> thinking as resource management.

I was thinking of it from the resource management POV.

> 
> I thought there is some way the user can associate some amount of
> resources (limits and guarantees) to a PAGG group and move the
> corresponding tasks to that PAGG and that is all needed from user
> space. 

No.  PAGG just provides the infrastructure for grouping tasks together 
and adding any extra per task data that you may require.  Of course, you 
can provide your own per group data as well.  It's then up to the author 
of the module utilizing PAGG to implement whatever group specific 
functionality that they want.

> 
> In other words i thought there is some clever way to manage resources at
> the PAGG level (without needing to tinker with the per task caps), which
> wasn't obvious for me, and it is now clear that is not the case, and one
> still have to keep tweaking the "per task" caps to get the result they
> want. 
> 
>>From your explanation, complex stuff need to happen in the user space to
> manage resource for a group of tasks.
> 
> Knobs that are available to the user are
>  - per task nice values
>  - per task cap limits and
>  - per task statistics, if and when they become available.
> 
> user level application has to constantly monitor the stats of _all_ the
> tasks and constantly keep changing the knobs if they want to keep the
> "group of tasks" within their guarantees and limits. As others pointed
> already, this may still _not_ yield what one wants, if you have tasks
> with disparate need for a resource.

I think that it can especially now that load balancing takes "nice" into 
account.  Without the smpnice patches it would have been difficult due 
to the effects of "soft" affinity tending to undermine "nice"'s 
functionality.

> 
> I certainly do not see it as the result of a simple marriage between
> PAGG and "per task caps".

It's simple but there's a non trivial amount of work required.

I know from previous discussion with CKRM folks that you find the idea 
of providing a number of small independent capabilities that enable more 
complex capabilities to be built on top of them difficult to come to 
terms with.  But that doesn't mean it won't work.  It has the advantage 
of being less intrusive and causing less angst to those users who don't 
want sophisticated resource control.

Peter
-- 
Dr Peter Williams, Chief Scientist         <peterw@aurema.com>
Aurema Pty Limited
Level 2, 130 Elizabeth St, Sydney, NSW 2000, Australia
Tel:+61 2 9698 2322  Fax:+61 2 9699 9174 http://www.aurema.com

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [ckrm-tech] [RFC 3/5] sched: Add CPU rate hard caps
  2006-06-02 14:47                       ` Mike Galbraith
@ 2006-06-03  0:08                         ` Peter Williams
  2006-06-03  6:02                           ` Mike Galbraith
  2006-06-06 11:26                         ` Srivatsa Vaddagiri
  1 sibling, 1 reply; 95+ messages in thread
From: Peter Williams @ 2006-06-03  0:08 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: sekharan, balbir, dev, Andrew Morton, Srivatsa, Sam Vilain,
	ckrm-tech, Balbir Singh, Con Kolivas, Linux Kernel,
	Kingsley Cheung, Eric W. Biederman, Ingo Molnar, Rene Herman

Mike Galbraith wrote:
> On Fri, 2006-06-02 at 23:18 +1000, Peter Williams wrote:
>> Mike Galbraith wrote:
>>> On Fri, 2006-06-02 at 15:55 +1000, Peter Williams wrote:
>>>> Chandra Seetharaman wrote:
>>>>> On Thu, 2006-06-01 at 14:04 +0530, Balbir Singh wrote:
>>>>>> Hi, Kirill,
>>>>>>
>>>>>> Kirill Korotaev wrote:
>>>>>>>> Do you have any documented requirements for container resource 
>>>>>>>> management?
>>>>>>>> Is there a minimum list of features and nice to have features for 
>>>>>>>> containers
>>>>>>>> as far as resource management is concerned?
>>>>>>> Sure! You can check OpenVZ project (http://openvz.org) for example of 
>>>>>>> required resource management. BTW, I must agree with other people here 
>>>>>>> who noticed that per-process resource management is really useless and 
>>>>>>> hard to use :(
>>>>> I totally agree.
>>>> "nice" seems to be doing quite nicely :-)
>>>>
>>>> To me this capping functionality is a similar functionality to that 
>>>> provided by "nice" and all that's needed to make it useful is a command 
>>>> (similar to "nice") that runs tasks with caps applied.
>>> Similar in that they are both inherited.  Very dissimilar in that the
>>> effect of nice is not altered by fork whereas the effect of a cap is.
>>>
>>> Consider make.  A cap on make itself isn't meaningful, and _any_ per
>>> task cap you put on it with the intent of managing the aggregate, is
>>> defeated by the argument -j.  Per task caps require omniscience to be
>>> effective in managing processes.  That's a pretty severe limitation.
>> These caps aren't trying to control aggregates but with suitable 
>> software they can be used to control aggregates.
> 
> How?  How would you deal with the make example with per task caps.

I'd build a resource management tool that uses task statistics, nice and 
caps to manage CPU resource allocation.  This could be a plug in kernel 
module or a user space daemon.  It doesn't need to be in the scheduler.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [ckrm-tech] [RFC 3/5] sched: Add CPU rate hard caps
  2006-06-02 23:49                             ` Peter Williams
@ 2006-06-03  4:59                               ` Balbir Singh
  0 siblings, 0 replies; 95+ messages in thread
From: Balbir Singh @ 2006-06-03  4:59 UTC (permalink / raw)
  To: Peter Williams
  Cc: Andrew Morton, dev, Srivatsa, sekharan, ckrm-tech, Linux Kernel,
	Balbir Singh, Mike Galbraith, Peter Williams, Con Kolivas,
	Sam Vilain, Kingsley Cheung, Eric W. Biederman, Ingo Molnar,
	Rene Herman

Peter Williams wrote:
>>>>
>>>>The problem is that with per-task caps, if I have a resource group A
>>>>and I want to limit it to 10%, I need to limit each task in resource
>>>>group A to 10% (which makes resource groups not so useful). Is my
>>>>understanding correct?
>>>
>>>
>>>Well the general idea is correct but your maths is wrong.  You'd have 
>>>to give each of them a cap somewhere between 10% and 10% divided by 
>>>the number of tasks in group A.  Exactly where in that range would 
>>>vary depending on the CPU demand of each task and would need to be 
>>>adjusted dynamically (unless they were very boring tasks whose demands 
>>>were constant over time).
>>>
>>
>>
>>Hmm.. I thought my math was reasonable (but there is always so much to 
>>learn)
>> From your formula, if I have 1 task in group A, I need to provide it with
>>a cap of b/w 10 to 11%. For two tasks, I need to give them b/w 10 to 10.5%.
>>If I have a hundred, it needs to be b/w 10% and 10.01%
> 
> 
> Now your arithmetic is failing you.  According to my formula:
> 
> 1. With one task in group A you give it 10% which is what you get when 
> you divide 10% by one.
> 
> 2. With two tasks in group A you give them each somewhere between 5% 
> (which is 10% divided by 2) and 10%.  If they are equally busy you give 
> them each 5% and if they are not equally busy you give them you give 
> them larger caps.

Yes, I understand. I misinterpreted what you said earlier. I see you
clearly meant the range [cap_of_the_group/number_of_tasks, cap_of_the_group]

> 
> Another, probably a better but more expensive, formula is to divide the 
> 10% between them in proportion to their demand.  Being careful not to 
> give any of them a zero cap, of course.  I.e. in the two task 10% case 
> they each get 5% if they are equally busy but if one is twice as busy as 
> the other it gets a 6.6% cap and the other gets 3.3% (approximately).
> 

Yes, that makes a lot of sense

> Peter

Thanks for clarifying.

-- 

	Balbir Singh,
	Linux Technology Center,
	IBM Software Labs

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [ckrm-tech] [RFC 3/5] sched: Add CPU rate hard caps
  2006-06-03  0:08                         ` Peter Williams
@ 2006-06-03  6:02                           ` Mike Galbraith
  2006-06-03 11:03                             ` Peter Williams
  0 siblings, 1 reply; 95+ messages in thread
From: Mike Galbraith @ 2006-06-03  6:02 UTC (permalink / raw)
  To: Peter Williams
  Cc: sekharan, balbir, dev, Andrew Morton, Srivatsa, Sam Vilain,
	ckrm-tech, Balbir Singh, Con Kolivas, Linux Kernel,
	Kingsley Cheung, Eric W. Biederman, Ingo Molnar, Rene Herman

On Sat, 2006-06-03 at 10:08 +1000, Peter Williams wrote:
> Mike Galbraith wrote:
> > How?  How would you deal with the make example with per task caps.
> 
> I'd build a resource management tool that uses task statistics, nice and 
> caps to manage CPU resource allocation.  This could be a plug in kernel 
> module or a user space daemon.  It doesn't need to be in the scheduler.

Ok, you _can_ gather statistics, and modify caps/nice on the fly... for
long running tasks.  How long does a task have to exist before you have
statistics for it so you can manage it?

Also, if you're going to need a separate resource manager to allocate,
monitor and modify in realtime, why not go whole hog, and allocate and
monitor instances of uml.  It'd be a heck of a lot easier. 

	-Mike


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [ckrm-tech] [RFC 3/5] sched: Add CPU rate hard caps
  2006-06-03  6:02                           ` Mike Galbraith
@ 2006-06-03 11:03                             ` Peter Williams
  0 siblings, 0 replies; 95+ messages in thread
From: Peter Williams @ 2006-06-03 11:03 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: sekharan, balbir, dev, Andrew Morton, Srivatsa, Sam Vilain,
	ckrm-tech, Balbir Singh, Con Kolivas, Linux Kernel,
	Kingsley Cheung, Eric W. Biederman, Ingo Molnar, Rene Herman

Mike Galbraith wrote:
> On Sat, 2006-06-03 at 10:08 +1000, Peter Williams wrote:
>> Mike Galbraith wrote:
>>> How?  How would you deal with the make example with per task caps.
>> I'd build a resource management tool that uses task statistics, nice and 
>> caps to manage CPU resource allocation.  This could be a plug in kernel 
>> module or a user space daemon.  It doesn't need to be in the scheduler.
> 
> Ok, you _can_ gather statistics, and modify caps/nice on the fly... for
> long running tasks.  How long does a task have to exist before you have
> statistics for it so you can manage it?

If the stats package is up to scratch it will provide stats for tasks 
that have exited so you will be able to charge their resource usage to 
the higher level entity and still manage that entity's usage properly 
via its other tasks.

> 
> Also, if you're going to need a separate resource manager to allocate,
> monitor and modify in realtime, why not go whole hog, and allocate and
> monitor instances of uml.  It'd be a heck of a lot easier. 

Or Xen.  Or Vmware.  That would be one solution and brings other 
functionality that may be desirable.  Of course, you can also do 
resource control within those instances as well :-).

"There's more than one way to skin a cat" as the old saying goes.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [ckrm-tech] [RFC 3/5] sched: Add CPU rate hard caps
  2006-06-02  7:47                   ` Kirill Korotaev
  2006-06-02 13:34                     ` Peter Williams
@ 2006-06-05 22:11                     ` Sam Vilain
  2006-06-06  8:24                       ` Kirill Korotaev
  1 sibling, 1 reply; 95+ messages in thread
From: Sam Vilain @ 2006-06-05 22:11 UTC (permalink / raw)
  To: Kirill Korotaev
  Cc: Peter Williams, sekharan, Andrew Morton, Srivatsa, ckrm-tech,
	balbir, Balbir Singh, Mike Galbraith, Con Kolivas, Linux Kernel,
	Kingsley Cheung, Eric W. Biederman, Ingo Molnar, Rene Herman

Kirill Korotaev wrote:

>>"nice" seems to be doing quite nicely :-)
>>    
>>
>I'm sorry, but nice never looked "nice" to me.
>Have you ever tried to "nice" apache server which spawns 500 
>processes/threads on a loaded machine?
>With nice you _can't_ impose limits or priority on the whole "apache".
>The more apaches you have the more useless their priorites and nices are...
>  
>

Yes but interactive admin processes will still get a large bonus
relative to the apache processes so you can still log in and kill the
apache storm off even with very large loads.

Sam.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [RFC 0/5] sched: Add CPU rate caps
  2006-05-30 22:05                     ` Sam Vilain
  2006-05-30 23:22                       ` Peter Williams
  2006-05-30 23:25                       ` Peter Williams
@ 2006-06-05 23:56                       ` Peter Williams
  2 siblings, 0 replies; 95+ messages in thread
From: Peter Williams @ 2006-06-05 23:56 UTC (permalink / raw)
  To: Sam Vilain
  Cc: Björn Steinbrink, Mike Galbraith, Con Kolivas, Linux Kernel,
	Kingsley Cheung, Ingo Molnar, Rene Herman, Herbert Poetzl,
	Kirill Korotaev, Eric W. Biederman

Sam Vilain wrote:
> Peter Williams wrote:
> 
>> They shouldn't interfere as which scheduler to use is a boot time 
>> selection and only one scheduler is in force.  It's mainly a coding 
>> matter and in particular whether the "scheduler driver" interface would 
>> need to be modified or whether your scheduler can be implemented using 
>> the current interface.
>>  
>>
> 
> Yes, that's the key issue I think - the interface now has more inputs.
> 
>>> I guess the big question is - is there a corresponding concept in
>>>> PlugSched?  for instance, is there a reference in the task_struct to the
>>>> current scheduling domain, or is it more CKRM-style with classification
>>>> modules?
>>>    
>>>
>> It uses the standard run queue structure with per scheduler
>> modifications (via a union) to handle the different ways that the
>> schedulers manage priority arrays (so yes). As I said it restricts
>> itself to scheduling matters within each run queue and leaves the
>> wider aspects to the normal code.
> 
> 
> Ok, so there is no existing "classification" abstraction?  The
> classification is tied to the scheduler implementation?
> 
>> At first guess, it sounds like adding your scheduler could be as simple 
>> as taking a copy of ingosched.c (which is the implementation of the 
>> standard scheduler within PlugSched) and then making your modifications. 
>>  You could probably even share the same run queue components but 
>> there's nothing to stop you adding new ones.
>>
>> Each scheduler can also have its own per task data via a union in the 
>> task struct.
>>  
>>
> 
> Ok, sounds like that problem is solved - just the classification one
> remaining.
> 
>> OK.  I'm waiting for the next -mm kernel before I make the next release.
>>  
>>
> 
> Looking forward to it.

A gzipped tar file containing the patch series (against 2.6.17-rc5-mm3) 
in a form suitable for use with quilt is now available at:

<http://prdownloads.sourceforge.net/cpuse/plugsched-6.3.2-for-2.6.17-rc5-mm3.series.tar.gz?download>

It's still a bit light in the description area but I figured that it's 
better than nothing.  Hopefully, the patch names give some idea of their 
purpose.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [ckrm-tech] [RFC 3/5] sched: Add CPU rate hard caps
  2006-06-05 22:11                     ` Sam Vilain
@ 2006-06-06  8:24                       ` Kirill Korotaev
  2006-06-06  9:13                         ` Con Kolivas
  0 siblings, 1 reply; 95+ messages in thread
From: Kirill Korotaev @ 2006-06-06  8:24 UTC (permalink / raw)
  To: Sam Vilain
  Cc: Kirill Korotaev, Peter Williams, sekharan, Andrew Morton,
	Srivatsa, ckrm-tech, balbir, Balbir Singh, Mike Galbraith,
	Con Kolivas, Linux Kernel, Kingsley Cheung, Eric W. Biederman,
	Ingo Molnar, Rene Herman

>>I'm sorry, but nice never looked "nice" to me.
>>Have you ever tried to "nice" apache server which spawns 500 
>>processes/threads on a loaded machine?
>>With nice you _can't_ impose limits or priority on the whole "apache".
>>The more apaches you have the more useless their priorites and nices are...
>> 
>>
> 
> 
> Yes but interactive admin processes will still get a large bonus
> relative to the apache processes so you can still log in and kill the
> apache storm off even with very large loads.

And how do you plan to manage it: to log in every time when apache works 
too much and kill processes? The managabiliy of such solutions sucks..

Kirill


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [ckrm-tech] [RFC 3/5] sched: Add CPU rate hard caps
  2006-06-06  8:24                       ` Kirill Korotaev
@ 2006-06-06  9:13                         ` Con Kolivas
  2006-06-06  9:28                           ` Kirill Korotaev
  0 siblings, 1 reply; 95+ messages in thread
From: Con Kolivas @ 2006-06-06  9:13 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kirill Korotaev, Sam Vilain, Kirill Korotaev, Peter Williams,
	sekharan, Andrew Morton, Srivatsa, ckrm-tech, balbir,
	Balbir Singh, Mike Galbraith, Kingsley Cheung, Eric W. Biederman,
	Ingo Molnar, Rene Herman

On Tuesday 06 June 2006 18:24, Kirill Korotaev wrote:
> >>I'm sorry, but nice never looked "nice" to me.
> >>Have you ever tried to "nice" apache server which spawns 500
> >>processes/threads on a loaded machine?
> >>With nice you _can't_ impose limits or priority on the whole "apache".
> >>The more apaches you have the more useless their priorites and nices
> >> are...
> >
> > Yes but interactive admin processes will still get a large bonus
> > relative to the apache processes so you can still log in and kill the
> > apache storm off even with very large loads.
>
> And how do you plan to manage it: to log in every time when apache works
> too much and kill processes? The managabiliy of such solutions sucks..

What a strange discussion. I simply impose limits on processes and connections 
on my grossly underpowered server.

/me shrugs

-- 
-ck

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [ckrm-tech] [RFC 3/5] sched: Add CPU rate hard caps
  2006-06-06  9:13                         ` Con Kolivas
@ 2006-06-06  9:28                           ` Kirill Korotaev
  0 siblings, 0 replies; 95+ messages in thread
From: Kirill Korotaev @ 2006-06-06  9:28 UTC (permalink / raw)
  To: Con Kolivas
  Cc: linux-kernel, Sam Vilain, Kirill Korotaev, Peter Williams,
	sekharan, Andrew Morton, Srivatsa, ckrm-tech, balbir,
	Balbir Singh, Mike Galbraith, Kingsley Cheung, Eric W. Biederman,
	Ingo Molnar, Rene Herman

>>>Yes but interactive admin processes will still get a large bonus
>>>relative to the apache processes so you can still log in and kill the
>>>apache storm off even with very large loads.
>>
>>And how do you plan to manage it: to log in every time when apache works
>>too much and kill processes? The managabiliy of such solutions sucks..
> 
> What a strange discussion. I simply impose limits on processes and connections 
> on my grossly underpowered server.
this works when you are an administrator of a single linux machine. Now 
imagine, you can run Virtual Environments (VE) each with it's own root 
and users. You can't and don't want to control what and how people are 
running. Sure, you limit the number of processes, but usually this won't 
be less then 50-100 processes per VE, so a single VE can lead to 50 
tasks in a running state and the total number of tasks in the system can 
be as high as 10,000. People can run setiathome or any other sh$t which 
consumes CPU, but the result is always the same - huge amount of running 
tasks leads to overall slowdown. So this is the case when you want to 
limits _users_ or VE, not _single_ tasks. I don't think you will succeed 
in managing 10,000 tasks when 100 active users change the load on the 
day basis.

Hope, it become more clear.

Thanks,
Kirill

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [ckrm-tech] [RFC 3/5] sched: Add CPU rate hard caps
  2006-06-02 14:47                       ` Mike Galbraith
  2006-06-03  0:08                         ` Peter Williams
@ 2006-06-06 11:26                         ` Srivatsa Vaddagiri
  1 sibling, 0 replies; 95+ messages in thread
From: Srivatsa Vaddagiri @ 2006-06-06 11:26 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Williams, sekharan, balbir, dev, Andrew Morton, Sam Vilain,
	ckrm-tech, Balbir Singh, Con Kolivas, Linux Kernel,
	Kingsley Cheung, Eric W. Biederman, Ingo Molnar, Rene Herman

On Fri, Jun 02, 2006 at 04:47:19PM +0200, Mike Galbraith wrote:
> > > Consider make.  A cap on make itself isn't meaningful, and _any_ per
> > > task cap you put on it with the intent of managing the aggregate, is
> > > defeated by the argument -j.  Per task caps require omniscience to be
> > > effective in managing processes.  That's a pretty severe limitation.
> > 
> > These caps aren't trying to control aggregates but with suitable 
> > software they can be used to control aggregates.
> 
> How?  How would you deal with the make example with per task caps.

If we add some grouping mechanism for tasks (CKRM or PAGG), then this 
could be handled easily by adjusting the per-task limit based on the
number of tasks in the group? For example, when make is started, it
could be the only task in the group, with a per-task (& group_limit) of 50%. 
As it forks more tasks, the per-task limit of everyone in the group is 
adjusted (lazily perhaps at the next scheduler tick time) based on the 
group_limit/num_of_tasks_in_group. 

This would still require a resource control daemon to adjust the
per-task limit of tasks within a group (if some task in underutilizing
its bandwidth for example).


-- 
Regards,
vatsa

^ permalink raw reply	[flat|nested] 95+ messages in thread

end of thread, other threads:[~2006-06-06 11:27 UTC | newest]

Thread overview: 95+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-05-26  4:20 [RFC 0/5] sched: Add CPU rate caps Peter Williams
2006-05-26  4:20 ` [RFC 1/5] sched: Fix priority inheritence before CPU rate soft caps Peter Williams
2006-05-26  4:20 ` [RFC 2/5] sched: Add " Peter Williams
2006-05-26 10:48   ` Con Kolivas
2006-05-26 11:15     ` Mike Galbraith
2006-05-26 11:17       ` Con Kolivas
2006-05-26 11:30         ` Mike Galbraith
2006-05-26 13:55     ` Peter Williams
2006-05-27  6:31   ` Balbir Singh
2006-05-27  7:03     ` Peter Williams
2006-05-28  0:11       ` Peter Williams
2006-05-28  7:38         ` Balbir Singh
2006-05-28 13:35           ` Peter Williams
2006-05-28 14:42             ` Balbir Singh
2006-05-28 23:27               ` Peter Williams
2006-05-31 13:17                 ` Kirill Korotaev
2006-05-31 23:39                   ` Peter Williams
2006-06-01  8:09                     ` Kirill Korotaev
2006-06-01 23:38                       ` Peter Williams
2006-06-02  1:35                         ` Peter Williams
2006-05-26  4:20 ` [RFC 3/5] sched: Add CPU rate hard caps Peter Williams
2006-05-26  6:58   ` Kari Hurtta
2006-05-27  1:00     ` Peter Williams
2006-05-26 11:00   ` Con Kolivas
2006-05-26 13:59     ` Peter Williams
2006-05-26 14:12       ` Con Kolivas
2006-05-26 14:23       ` Mike Galbraith
2006-05-27  0:16         ` Peter Williams
2006-05-27  9:28           ` Mike Galbraith
2006-05-28  2:09             ` Peter Williams
2006-05-27  6:48   ` Balbir Singh
2006-05-27  8:44     ` Peter Williams
2006-05-31 13:10       ` Kirill Korotaev
2006-05-31 15:59         ` Balbir Singh
2006-05-31 18:09           ` Mike Galbraith
2006-06-01  7:41           ` Kirill Korotaev
2006-06-01  8:34             ` Balbir Singh
2006-06-01 18:43               ` [ckrm-tech] " Chandra Seetharaman
2006-06-01 23:26                 ` Peter Williams
2006-06-02  2:02                   ` Chandra Seetharaman
2006-06-02  3:21                     ` Peter Williams
2006-06-02  8:32                       ` Balbir Singh
2006-06-02 13:30                         ` Peter Williams
2006-06-02 18:58                           ` Balbir Singh
2006-06-02 23:49                             ` Peter Williams
2006-06-03  4:59                               ` Balbir Singh
2006-06-02 19:06                       ` Chandra Seetharaman
2006-06-03  0:04                         ` Peter Williams
2006-06-02  0:36                 ` Con Kolivas
2006-06-02  2:03                   ` [ckrm-tech] " Chandra Seetharaman
2006-06-02  5:55                 ` [ckrm-tech] [RFC 3/5] " Peter Williams
2006-06-02  7:47                   ` Kirill Korotaev
2006-06-02 13:34                     ` Peter Williams
2006-06-05 22:11                     ` Sam Vilain
2006-06-06  8:24                       ` Kirill Korotaev
2006-06-06  9:13                         ` Con Kolivas
2006-06-06  9:28                           ` Kirill Korotaev
2006-06-02  8:46                   ` Mike Galbraith
2006-06-02 13:18                     ` Peter Williams
2006-06-02 14:47                       ` Mike Galbraith
2006-06-03  0:08                         ` Peter Williams
2006-06-03  6:02                           ` Mike Galbraith
2006-06-03 11:03                             ` Peter Williams
2006-06-06 11:26                         ` Srivatsa Vaddagiri
2006-06-02  7:34                 ` Kirill Korotaev
2006-06-02 21:23                   ` Shailabh Nagar
2006-06-01 23:47               ` Sam Vilain
2006-06-01 23:43           ` Peter Williams
2006-05-31 23:28         ` Peter Williams
2006-06-01  7:44           ` Kirill Korotaev
2006-06-01 23:21             ` Peter Williams
2006-05-26  4:21 ` [RFC 4/5] sched: Add procfs interface for CPU rate soft caps Peter Williams
2006-05-26  4:21 ` [RFC 5/5] sched: Add procfs interface for CPU rate hard caps Peter Williams
2006-05-26  8:04 ` [RFC 0/5] sched: Add CPU rate caps Mike Galbraith
2006-05-26 16:11   ` Björn Steinbrink
2006-05-28 22:46     ` Sam Vilain
2006-05-28 23:30       ` Peter Williams
2006-05-29  3:09         ` Sam Vilain
2006-05-29  3:41           ` Peter Williams
2006-05-29 21:16             ` Sam Vilain
2006-05-29 23:12               ` Peter Williams
2006-05-30  2:07                 ` Sam Vilain
2006-05-30  2:45                   ` Peter Williams
2006-05-30 22:05                     ` Sam Vilain
2006-05-30 23:22                       ` Peter Williams
2006-05-30 23:25                       ` Peter Williams
2006-06-05 23:56                       ` Peter Williams
2006-05-27  0:16   ` Peter Williams
2006-05-26 10:41 ` Con Kolivas
2006-05-27  1:28   ` Peter Williams
2006-05-27  1:42     ` Con Kolivas
2006-05-26 11:09 ` Con Kolivas
2006-05-26 14:00   ` Peter Williams
2006-05-26 11:29 ` Balbir Singh
2006-05-27  1:40   ` Peter Williams

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).