linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] sched: Add CPU rate caps
@ 2006-06-22  1:52 Peter Williams
  2006-06-22  1:52 ` [PATCH 1/4] sched: Add CPU rate soft caps Peter Williams
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Peter Williams @ 2006-06-22  1:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Con Kolivas, Sam Vilain, Balbir Singh, Srivatsa, Peter Williams,
	Kirill Korotaev, Linux Kernel, Mike Galbraith, Kingsley Cheung,
	CKRM, Ingo Molnar, Rene Herman

These patches implement CPU usage rate limits for tasks.

Although the rlimit mechanism already has a CPU usage limit (RLIMIT_CPU)
it is a total usage limit and therefore (to my mind) not very useful.
These patches provide an alternative whereby the (recent) average CPU
usage rate of a task can be limited to a (per task) specified proportion
of a single CPU's capacity.  The limits are specified in parts per
thousand and come in two varieties -- hard and soft.  The difference
between the two is that the system tries to enforce hard caps regardless
of the other demand for CPU resources but allows soft caps to be
exceeded if there are spare CPU resources available.  By default, tasks
will have both caps set to 1000 (i.e. no limit) but newly forked tasks
will inherit any caps that have been imposed on their parent from the
parent.  The mimimim soft cap allowed is 0 (which effectively puts the
task in the background) and the minimim hard cap allowed is 1.

Care has been taken to minimize the overhead inflicted on tasks that
have no caps and my tests using kernbench indicate that it is hidden in
the noise.

Note:

1. Caps are not enforced for SCHED_FIFO and SCHED_RR tasks.

2. This versions incorporates improvements and bug fixes as a result of
feedback from an earlier post.  Special thanks to Con Kolivas whose
suggestions with respect to improved methods for avoiding starvation
and priority inversion have enabled cap enforcement to be stricter.

3. To reduce overhead for uncapped tasks, no CPU usage rate statistics
are kept for tasks that have a hard or soft cap less than 1000 parts
per thousand.  However, these statistics in conjunction with the
weighted load values for tasks and run queues can be used to determine
if a task is getting a "fair" share of CPU i.e. the ratio of the tasks
load to the run queues total load weight should be the same as the
ratio of its average on cpu time per cycle to the average length of
its cycle.  This could be of use to the normal scheduling code.

4. This patch is against 2.6.17-mm1.

5. Overhead Measurements.  To measure the implications for overhead
introduced by these patches kernbench was used on a dual 500Mhz
Centrino SMP system.  Runs were done for a kernel without these
patches applied, one with the patches applied but no caps being used
and one with the patches applied and running kernbench with a soft cap
of zero (which would be inherited by all its children).

Average Optimal -j 8 Load Run:

                  Vanilla          Patch Applied    Soft Cap 0%

Elapsed Time      1056.1   (1.92)  1048.2   (0.62)  1064.1   (1.59)
User Time         1908.1   (1.09)  1895.2   (1.30)  1926.6   (1.39)
System Time        181.7   (0.60)   177.5   (0.74)   173.8   (1.07)
   Total          2089.8           2072.7           2100.4
Percent CPU        197.6   (0.55)   197.0   (0)      197.0   (0)
Context Switches 49253.6 (136.31) 48881.4  (92.03) 92490.8 (163.71)
Sleeps           28038.8 (228.11) 28136.0 (250.65) 25769.4 (280.40)

As can be seen there is no significant overhead penalty for having
these patches applied and leaving them unused (in fact, based on
these numbers, there's actually a small improvement).  Similarly,
the overhead for running kernbench as a background (soft cap of
zero) job is not significant.  As expected, the context switches for
the background run are double (due to the fact that ANY OTHER task
running on the machine would be able to preempt the kernbench tasks)
but have not seriously effected the CPU usage statistics.  The
similarity of the "Percent CPU" numbers indicate that load balancing
hasn't been adversely effected.

6. Code size measurements:

Vanilla kernel:

   text    data     bss     dec     hex filename
  30305    4881     704   35890    8c32 kernel/sched.o
    907       0       0     907     38b kernel/mutex.o
  10183    3424       0   13607    3527 fs/proc/base.o

Patches applied:

   text    data     bss     dec     hex filename
  33027    4913     704   38644    96f4 kernel/sched.o
    979       0       0     979     3d3 kernel/mutex.o
  10962    3712       0   14674    3952 fs/proc/base.o

Indicating that the size cost of the patch proper is about
2.7 kilobytes and the procfs costs about another 0.8 kilobytes.
This is a reduction of 0.7 kilobytes on the previous version and
is due to a rationalization of the interface code.

7. A patch to use task watchers to implement per process (as opposed to
per task) caps is under development to demonstrate the suitability of
these mechanisms for implementing higher level capping from outside the
scheduling code.

Signed-off-by: Peter Williams <pwil3058@bigpond.com.au>


-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
 -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/4] sched: Add CPU rate soft caps
  2006-06-22  1:52 [PATCH 0/4] sched: Add CPU rate caps Peter Williams
@ 2006-06-22  1:52 ` Peter Williams
  2006-06-22  1:52 ` [PATCH 2/4] sched: Add CPU rate hard caps Peter Williams
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 13+ messages in thread
From: Peter Williams @ 2006-06-22  1:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Con Kolivas, Sam Vilain, Peter Williams, Srivatsa, Balbir Singh,
	Kirill Korotaev, Linux Kernel, Mike Galbraith, Kingsley Cheung,
	CKRM, Ingo Molnar, Rene Herman

This patch implements (soft) CPU rate caps per task as a proportion of a
single CPU's capacity expressed in parts per thousand.  The CPU usage
of capped tasks is determined by using Kalman filters to calculate the
(recent) average lengths of the task's scheduling cycle and the time
spent on the CPU each cycle and taking the ratio of the latter to the
former.  To minimize overhead associated with uncapped tasks these
statistics are not kept for them.

Notes:

1. To minimize the overhead incurred when testing to skip caps processing for
uncapped tasks a new flag PF_HAS_CAP has been added to flags.

2. The implementation involves the addition of two priority slots to the
run queue priority arrays and this means that MAX_PRIO no longer
represents the scheduling priority of the idle process and can't be used to
test whether priority values are in the valid range.  To alleviate this
problem a new function sched_idle_prio() has been provided.

3. Thanks to suggestions from Con Kolivas with respect to alternative
methods to reduce the possibility of a task being starved of CPU while
holding an important system resource, enforcement of caps is now
quite strict.  However, there will still be occasions where caps may be
exceeded due to this mechanism vetoing enforcement.

4. Modified the load weight for capped tasks to be a minimum of 1 so
that load balancing still works when only background tasks are running.

Signed-off-by: Peter Williams <pwil3058@bigpond.net.au>

 include/linux/sched.h  |   22 ++
 init/Kconfig           |    2 
 kernel/Kconfig.caps    |   13 +
 kernel/mutex.c         |   33 +++
 kernel/rtmutex-debug.c |    4 
 kernel/sched.c         |  415 +++++++++++++++++++++++++++++++++++++++++++++----
 6 files changed, 456 insertions(+), 33 deletions(-)

Index: MM-2.6.17-mm1/include/linux/sched.h
===================================================================
--- MM-2.6.17-mm1.orig/include/linux/sched.h	2006-06-22 09:42:05.000000000 +1000
+++ MM-2.6.17-mm1/include/linux/sched.h	2006-06-22 10:21:08.000000000 +1000
@@ -496,6 +496,12 @@ struct signal_struct {
 #define has_rt_policy(p) \
 	unlikely((p)->policy != SCHED_NORMAL && (p)->policy != SCHED_BATCH)
 
+#ifdef CONFIG_CPU_RATE_CAPS
+int sched_idle_prio(void);
+#else
+#define sched_idle_prio()	MAX_PRIO
+#endif
+
 /*
  * Some day this will be a full-fledged user tracking system..
  */
@@ -794,6 +800,11 @@ struct task_struct {
 	unsigned long sleep_avg;
 	unsigned long long timestamp, last_ran;
 	unsigned long long sched_time; /* sched_clock time spent running */
+#ifdef CONFIG_CPU_RATE_CAPS
+	unsigned long long avg_cpu_per_cycle, avg_cycle_length;
+	unsigned int cpu_rate_cap;
+	unsigned int mutexes_held;
+#endif
 	enum sleep_type sleep_type;
 
 	unsigned long policy;
@@ -1009,6 +1020,15 @@ struct task_struct {
 #endif
 };
 
+#ifdef CONFIG_CPU_RATE_CAPS
+static inline unsigned int get_cpu_rate_cap(const struct task_struct *p)
+{
+	return p->cpu_rate_cap;
+}
+
+int set_cpu_rate_cap(struct task_struct *, unsigned int);
+#endif
+
 static inline pid_t process_group(struct task_struct *tsk)
 {
 	return tsk->signal->pgrp;
@@ -1065,6 +1085,8 @@ static inline void put_task_struct(struc
 #define PF_SWAPWRITE	0x00800000	/* Allowed to write to swap */
 #define PF_SPREAD_PAGE	0x01000000	/* Spread page cache over cpuset */
 #define PF_SPREAD_SLAB	0x02000000	/* Spread some slab caches over cpuset */
+#define PF_HAS_CAP	0x04000000	/* Has a CPU rate cap */
+#define PF_UIWAKE	0x08000000	/* Just woken from uninterrptible sleep */
 #define PF_MEMPOLICY	0x10000000	/* Non-default NUMA mempolicy */
 #define PF_MUTEX_TESTER	0x20000000	/* Thread belongs to the rt mutex tester */
 
Index: MM-2.6.17-mm1/init/Kconfig
===================================================================
--- MM-2.6.17-mm1.orig/init/Kconfig	2006-06-22 09:27:21.000000000 +1000
+++ MM-2.6.17-mm1/init/Kconfig	2006-06-22 10:17:47.000000000 +1000
@@ -297,6 +297,8 @@ config RELAY
 
 	  If unsure, say N.
 
+source "kernel/Kconfig.caps"
+
 source "usr/Kconfig"
 
 config UID16
Index: MM-2.6.17-mm1/kernel/Kconfig.caps
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ MM-2.6.17-mm1/kernel/Kconfig.caps	2006-06-22 10:17:47.000000000 +1000
@@ -0,0 +1,13 @@
+#
+# CPU Rate Caps Configuration
+#
+
+config CPU_RATE_CAPS
+	bool "Support (soft) CPU rate caps"
+	default y
+	---help---
+	  Say y here if you wish to be able to put a (soft) upper limit on
+	  the rate of CPU usage by individual tasks.  A task which has been
+	  allocated a soft CPU rate cap will be limited to that rate of CPU
+	  usage unless there is spare CPU resources available after the needs
+	  of uncapped tasks are met.
Index: MM-2.6.17-mm1/kernel/sched.c
===================================================================
--- MM-2.6.17-mm1.orig/kernel/sched.c	2006-06-22 09:27:21.000000000 +1000
+++ MM-2.6.17-mm1/kernel/sched.c	2006-06-22 10:26:24.000000000 +1000
@@ -58,6 +58,19 @@
 
 #include <asm/unistd.h>
 
+#ifdef CONFIG_CPU_RATE_CAPS
+#define IDLE_PRIO	(MAX_PRIO + 2)
+
+int sched_idle_prio(void)
+{
+	return IDLE_PRIO;
+}
+#else
+#define IDLE_PRIO	MAX_PRIO
+#endif
+#define BGND_PRIO	(IDLE_PRIO - 1)
+#define CAPPED_PRIO	(IDLE_PRIO - 2)
+
 /*
  * Convert user-nice values [ -20 ... 0 ... 19 ]
  * to static priority [ MAX_RT_PRIO..MAX_PRIO-1 ],
@@ -187,6 +200,167 @@ static inline unsigned int task_timeslic
 	return static_prio_timeslice(p->static_prio);
 }
 
+#ifdef CONFIG_CPU_RATE_CAPS
+#define CPU_CAP_ONE 1000
+#define CAP_STATS_OFFSET 8
+#define task_has_cap(p) unlikely((p)->flags & PF_HAS_CAP)
+/* this assumes that p is not a real time task */
+#define task_is_background(p) unlikely((p)->cpu_rate_cap == 0)
+#define task_being_capped(p) unlikely((p)->prio >= CAPPED_PRIO)
+#define cap_load_weight(p) \
+	(max((int)(((p)->cpu_rate_cap * SCHED_LOAD_SCALE) / CPU_CAP_ONE), 1))
+#define safe_to_enforce_cap(p) \
+	(!((p)->mutexes_held || (p)->flags & (PF_FREEZE | PF_UIWAKE)))
+
+static void init_cpu_rate_caps(task_t *p)
+{
+	p->cpu_rate_cap = CPU_CAP_ONE;
+	p->flags &= ~PF_HAS_CAP;
+	p->mutexes_held = 0;
+}
+
+static inline void set_cap_flag(task_t *p)
+{
+	if (p->cpu_rate_cap < CPU_CAP_ONE && !has_rt_policy(p))
+		p->flags |= PF_HAS_CAP;
+	else
+		p->flags &= ~PF_HAS_CAP;
+}
+
+static inline int task_exceeding_cap(const task_t *p)
+{
+	return (p->avg_cpu_per_cycle * CPU_CAP_ONE) > (p->avg_cycle_length * p->cpu_rate_cap);
+}
+
+#ifdef CONFIG_SCHED_SMT
+/*
+ * task time slice for SMT dependent idle purposes
+ */
+static unsigned int smt_timeslice(task_t *p)
+{
+	if (task_being_capped(p))
+		return (p->cpu_rate_cap * DEF_TIMESLICE) / CPU_CAP_ONE;
+
+	return task_timeslice(p);
+}
+
+/*
+ * Is the thisp a higher priority task than thatp for SMT dependent idle
+ * purposes?
+ */
+static int task_priority_gt(const task_t *thisp, const task_t *thatp)
+{
+	if (task_being_capped(thisp))
+	    return thisp->prio < thatp->prio;
+
+	if (task_being_capped(thatp))
+	    return 1;
+
+	return thisp->static_prio < thatp->static_prio;
+}
+#endif
+
+/*
+ * Update usage stats to "now" before making comparison
+ * Assume: task is actually on a CPU
+ */
+static int task_exceeding_cap_now(const task_t *p, unsigned long long now,
+				  int oncpu)
+{
+	unsigned long long delta, rhs;
+	unsigned long long cpc = p->avg_cpu_per_cycle;
+
+	delta = (now > p->timestamp) ? (now - p->timestamp) : 0;
+	rhs = (p->avg_cycle_length + delta) * p->cpu_rate_cap;
+	if (oncpu)
+		cpc += delta;
+
+	return cpc * CPU_CAP_ONE > rhs;
+}
+
+static inline void init_cap_stats(task_t *p)
+{
+	p->avg_cpu_per_cycle = 0;
+	p->avg_cycle_length = 0;
+	p->mutexes_held = 0;
+}
+
+static inline void inc_cap_stats_cycle(task_t *p, unsigned long long now)
+{
+	unsigned long long delta;
+
+	delta = (now > p->timestamp) ? (now - p->timestamp) : 0;
+	p->avg_cycle_length += delta;
+}
+
+static inline void inc_cap_stats_both(task_t *p, unsigned long long now)
+{
+	unsigned long long delta;
+
+	delta = (now > p->timestamp) ? (now - p->timestamp) : 0;
+	p->avg_cycle_length += delta;
+	p->avg_cpu_per_cycle += delta;
+}
+
+static inline void decay_cap_stats(task_t *p)
+{
+	p->avg_cycle_length *= ((1 << CAP_STATS_OFFSET) - 1);
+	p->avg_cycle_length >>= CAP_STATS_OFFSET;
+	p->avg_cpu_per_cycle *= ((1 << CAP_STATS_OFFSET) - 1);
+	p->avg_cpu_per_cycle >>= CAP_STATS_OFFSET;
+}
+#else
+#define task_has_cap(p) 0
+#define task_is_background(p) 0
+#define task_being_capped(p) 0
+#define cap_load_weight(p) ((int)SCHED_LOAD_SCALE)
+#define safe_to_enforce_cap(p) 0
+
+static inline void init_cpu_rate_caps(task_t *p)
+{
+}
+
+static inline void set_cap_flag(task_t *p)
+{
+}
+
+static inline int task_exceeding_cap(const task_t *p)
+{
+	return 0;
+}
+
+#ifdef CONFIG_SCHED_SMT
+#define smt_timeslice(p) task_timeslice(p)
+
+static inline int task_priority_gt(const task_t *thisp, const task_t *thatp)
+{
+	return thisp->static_prio < thatp->static_prio;
+}
+#endif
+
+static inline int task_exceeding_cap_now(const task_t *p,
+					 unsigned long long now, int oncpu)
+{
+	return 0;
+}
+
+static inline void init_cap_stats(task_t *p)
+{
+}
+
+static inline void inc_cap_stats_cycle(task_t *p, unsigned long long now)
+{
+}
+
+static inline void inc_cap_stats_both(task_t *p, unsigned long long now)
+{
+}
+
+static inline void decay_cap_stats(task_t *p)
+{
+}
+#endif
+
 #define task_hot(p, now, sd) ((long long) ((now) - (p)->last_ran)	\
 				< (long long) (sd)->cache_hot_time)
 
@@ -198,8 +372,8 @@ typedef struct runqueue runqueue_t;
 
 struct prio_array {
 	unsigned int nr_active;
-	DECLARE_BITMAP(bitmap, MAX_PRIO+1); /* include 1 bit for delimiter */
-	struct list_head queue[MAX_PRIO];
+	DECLARE_BITMAP(bitmap, IDLE_PRIO+1); /* include 1 bit for delimiter */
+	struct list_head queue[IDLE_PRIO];
 };
 
 /*
@@ -717,6 +891,10 @@ static inline int __normal_prio(task_t *
 {
 	int bonus, prio;
 
+	/* Ensure that background tasks stay at BGND_PRIO */
+	if (task_is_background(p) && safe_to_enforce_cap(p))
+		return BGND_PRIO;
+
 	bonus = CURRENT_BONUS(p) - MAX_BONUS / 2;
 
 	prio = p->static_prio - bonus;
@@ -751,6 +929,8 @@ static inline int __normal_prio(task_t *
 
 static void set_load_weight(task_t *p)
 {
+	set_cap_flag(p);
+
 	if (has_rt_policy(p)) {
 #ifdef CONFIG_SMP
 		if (p == task_rq(p)->migration_thread)
@@ -763,8 +943,17 @@ static void set_load_weight(task_t *p)
 		else
 #endif
 			p->load_weight = RTPRIO_TO_LOAD_WEIGHT(p->rt_priority);
-	} else
+	} else {
 		p->load_weight = PRIO_TO_LOAD_WEIGHT(p->static_prio);
+
+		/*
+		 * Reduce the probability of a task escaping its CPU rate cap
+		 * due to load balancing leaving it on a lighly used CPU
+		 * This will be optimized away if rate caps aren't configured
+		 */
+		if (task_has_cap(p))
+			p->load_weight = min(cap_load_weight(p), p->load_weight);
+	}
 }
 
 static inline void inc_raw_weighted_load(runqueue_t *rq, const task_t *p)
@@ -834,7 +1023,7 @@ static void __activate_task(task_t *p, r
 {
 	prio_array_t *target = rq->active;
 
-	if (batch_task(p))
+	if (batch_task(p) || task_being_capped(p))
 		target = rq->expired;
 	enqueue_task(p, target);
 	inc_nr_running(p, rq);
@@ -940,8 +1129,32 @@ static void activate_task(task_t *p, run
 #endif
 
 	if (!rt_task(p))
+		/*
+		 * We want to do the recalculation even if we're exceeding
+		 * a cap so that everything still works when we stop
+		 * exceeding our cap.
+		 */
 		p->prio = recalc_task_prio(p, now);
 
+	if (task_has_cap(p)) {
+		inc_cap_stats_cycle(p, now);
+		/* Background tasks are handled in effective_prio()
+		 * in order to ensure that they stay at BGND_PRIO
+		 * but we need to be careful that we don't override
+		 * it here
+		 */
+		if (task_exceeding_cap(p) && !task_is_background(p) &&
+		    safe_to_enforce_cap(p)) {
+			p->normal_prio = CAPPED_PRIO;
+			/*
+			 * Don't undo any priority ineheritance
+			 */
+			if (!rt_task(p))
+				p->prio = p->normal_prio;
+		}
+	}
+	p->flags &= ~PF_UIWAKE;
+
 	/*
 	 * This checks to make sure it's not an uninterruptible task
 	 * that is now waking up.
@@ -1482,6 +1695,7 @@ out_activate:
 		 * sleep_avg beyond just interactive state.
 		 */
 		p->sleep_type = SLEEP_NONINTERACTIVE;
+		p->flags |= PF_UIWAKE;
 	} else
 
 	/*
@@ -1542,6 +1756,7 @@ void fastcall sched_fork(task_t *p, int 
 #endif
 	set_task_cpu(p, cpu);
 
+	init_cap_stats(p);
 	/*
 	 * We mark the process as running here, but have not actually
 	 * inserted it onto the runqueue yet. This guarantees that
@@ -2014,7 +2229,7 @@ void pull_task(runqueue_t *src_rq, prio_
 	p->timestamp = (p->timestamp - src_rq->timestamp_last_tick)
 				+ this_rq->timestamp_last_tick;
 	/*
-	 * Note that idle threads have a prio of MAX_PRIO, for this test
+	 * Note that idle threads have a prio of IDLE_PRIO, for this test
 	 * to be always true for them.
 	 */
 	if (TASK_PREEMPTS_CURR(p, this_rq))
@@ -2114,8 +2329,8 @@ skip_bitmap:
 	if (!idx)
 		idx = sched_find_first_bit(array->bitmap);
 	else
-		idx = find_next_bit(array->bitmap, MAX_PRIO, idx);
-	if (idx >= MAX_PRIO) {
+		idx = find_next_bit(array->bitmap, IDLE_PRIO, idx);
+	if (idx >= IDLE_PRIO) {
 		if (array == busiest->expired && busiest->active->nr_active) {
 			array = busiest->active;
 			dst_array = this_rq->active;
@@ -2997,15 +3212,67 @@ void scheduler_tick(void)
 		}
 		goto out_unlock;
 	}
+	/* Only check for task exceeding cap if it's worthwhile */
+	if (task_has_cap(p)) {
+		/*
+		 * Do this even if there's only one task on the queue as
+		 * we want to set the priority low so that any waking tasks
+		 * can preempt.
+		 */
+		if (task_being_capped(p)) {
+			/*
+			 * Tasks whose cap is currently being enforced will be
+			 * at CAPPED_PRIO or BGND_PRIO priority and preemption
+			 * should be enough to keep them in check provided we
+			 * don't let them adversely effect tasks on the expired
+			 * array.
+			 */
+			if (!safe_to_enforce_cap(p) ||
+			    (!task_exceeding_cap_now(p, now, 1) &&
+			     !task_is_background(p))) {
+				dequeue_task(p, rq->active);
+				p->prio = effective_prio(p);
+				enqueue_task(p, rq->active);
+			} else if (rq->expired->nr_active &&
+				   rq->best_expired_prio < p->prio) {
+				dequeue_task(p, rq->active);
+				enqueue_task(p, rq->expired);
+				set_tsk_need_resched(p);
+				goto out_unlock;
+			}
+		} else if (task_exceeding_cap_now(p, now, 1) &&
+			   safe_to_enforce_cap(p)) {
+			dequeue_task(p, rq->active);
+			if (task_is_background(p))
+				p->normal_prio = BGND_PRIO;
+			else
+				p->normal_prio = CAPPED_PRIO;
+			/* this should be safe for PI purposes */
+			p->prio = p->normal_prio;
+			enqueue_task(p, rq->expired);
+			/*
+			 * think about making this conditional to reduce
+			 * context switch rate
+			 */
+			set_tsk_need_resched(p);
+			goto out_unlock;
+		}
+	}
 	if (!--p->time_slice) {
 		dequeue_task(p, rq->active);
 		set_tsk_need_resched(p);
-		p->prio = effective_prio(p);
+		if (!task_being_capped(p))
+			p->prio = effective_prio(p);
 		p->time_slice = task_timeslice(p);
 		p->first_time_slice = 0;
 
 		if (!rq->expired_timestamp)
 			rq->expired_timestamp = jiffies;
+		/*
+		 * No need to do anything special for capped tasks as here
+		 * TASK_INTERACTIVE() should fail when they're exceeding
+		 * their caps.
+		 */
 		if (!TASK_INTERACTIVE(p) || EXPIRED_STARVING(rq)) {
 			enqueue_task(p, rq->expired);
 			if (p->static_prio < rq->best_expired_prio)
@@ -3151,9 +3418,9 @@ static int dependent_sleeper(int this_cp
 				(sd->per_cpu_gain * DEF_TIMESLICE / 100))
 					ret = 1;
 		} else {
-			if (smt_curr->static_prio < p->static_prio &&
+			if (task_priority_gt(smt_curr, p) &&
 				!TASK_PREEMPTS_CURR(p, smt_rq) &&
-				smt_slice(smt_curr, sd) > task_timeslice(p))
+				smt_slice(smt_curr, sd) > smt_timeslice(p))
 					ret = 1;
 		}
 unlock:
@@ -3217,6 +3484,22 @@ static inline int interactive_sleep(enum
 }
 
 /*
+ * Switch the active and expired arrays.
+ */
+static prio_array_t *switch_arrays(runqueue_t *rq, int best_active_prio)
+{
+	prio_array_t *array = rq->active;
+
+	schedstat_inc(rq, sched_switch);
+	rq->active = rq->expired;
+	rq->expired = array;
+	rq->expired_timestamp = 0;
+	rq->best_expired_prio = best_active_prio;
+
+	return rq->active;
+}
+
+/*
  * schedule() is the main scheduler function.
  */
 asmlinkage void __sched schedule(void)
@@ -3292,6 +3575,12 @@ need_resched_nonpreemptible:
 		}
 	}
 
+	/* do this now so that stats are correct for SMT code */
+	if (task_has_cap(prev)) {
+		inc_cap_stats_both(prev, now);
+		decay_cap_stats(prev);
+	}
+
 	cpu = smp_processor_id();
 	if (unlikely(!rq->nr_running)) {
 		idle_balance(cpu, rq);
@@ -3304,21 +3593,23 @@ need_resched_nonpreemptible:
 	}
 
 	array = rq->active;
-	if (unlikely(!array->nr_active)) {
-		/*
-		 * Switch the active and expired arrays.
-		 */
-		schedstat_inc(rq, sched_switch);
-		rq->active = rq->expired;
-		rq->expired = array;
-		array = rq->active;
-		rq->expired_timestamp = 0;
-		rq->best_expired_prio = MAX_PRIO;
-	}
+	if (unlikely(!array->nr_active))
+		array = switch_arrays(rq, IDLE_PRIO);
 
 	idx = sched_find_first_bit(array->bitmap);
+get_next:
 	queue = array->queue + idx;
 	next = list_entry(queue->next, task_t, run_list);
+	/* very strict soft capping */
+	if (unlikely(task_being_capped(next) && rq->expired->nr_active)) {
+		int tmp = sched_find_first_bit(rq->expired->bitmap);
+
+		if (likely(tmp < idx) && task_exceeding_cap_now(next, now, 0)) {
+			array = switch_arrays(rq, idx);
+			idx = tmp;
+			goto get_next;
+		}
+	}
 
 	if (!rt_task(next) && interactive_sleep(next->sleep_type)) {
 		unsigned long long delta = now - next->timestamp;
@@ -3331,7 +3622,7 @@ need_resched_nonpreemptible:
 		array = next->array;
 		new_prio = recalc_task_prio(next, next->timestamp + delta);
 
-		if (unlikely(next->prio != new_prio)) {
+		if (unlikely(next->prio != new_prio && !task_being_capped(next))) {
 			dequeue_task(next, array);
 			next->prio = new_prio;
 			enqueue_task(next, array);
@@ -3357,6 +3648,8 @@ switch_tasks:
 
 	sched_info_switch(prev, next);
 	if (likely(prev != next)) {
+		if (task_has_cap(next))
+			inc_cap_stats_cycle(next, now);
 		next->timestamp = now;
 		rq->nr_switches++;
 		rq->curr = next;
@@ -3810,7 +4103,7 @@ void rt_mutex_setprio(task_t *p, int pri
 	runqueue_t *rq;
 	int oldprio;
 
-	BUG_ON(prio < 0 || prio > MAX_PRIO);
+	BUG_ON(prio < 0 || prio > IDLE_PRIO);
 
 	rq = task_rq_lock(p, &flags);
 
@@ -4237,6 +4530,69 @@ out_unlock:
 	return retval;
 }
 
+#ifdef CONFIG_CPU_RATE_CAPS
+/*
+ * Require: 0 <= new_cap <= CPU_CAP_ONE
+ */
+int set_cpu_rate_cap(struct task_struct *p, unsigned int new_cap)
+{
+	int is_allowed;
+	unsigned long flags;
+	struct runqueue *rq;
+	prio_array_t *array;
+	int delta;
+
+	if (new_cap > CPU_CAP_ONE)
+		return -EINVAL;
+	is_allowed = capable(CAP_SYS_NICE);
+	/*
+	 * We have to be careful, if called from /proc code,
+	 * the task might be in the middle of scheduling on another CPU.
+	 */
+	rq = task_rq_lock(p, &flags);
+	delta = new_cap - p->cpu_rate_cap;
+	if (!is_allowed) {
+		/*
+		 * Ordinary users can set/change caps on their own tasks
+		 * provided that the new setting is MORE constraining
+		 */
+		if (((current->euid != p->uid) && (current->uid != p->uid)) || (delta > 0)) {
+			task_rq_unlock(rq, &flags);
+			return -EPERM;
+		}
+	}
+	/*
+	 * The RT tasks don't have caps, but we still allow the caps to be
+	 * set - but as expected it wont have any effect on scheduling until
+	 * the task becomes SCHED_NORMAL/SCHED_BATCH:
+	 */
+	p->cpu_rate_cap = new_cap;
+
+	if (has_rt_policy(p))
+		goto out;
+
+	array = p->array;
+	if (array) {
+		dec_raw_weighted_load(rq, p);
+		dequeue_task(p, array);
+	}
+
+	set_load_weight(p);
+	p->prio = effective_prio(p);
+
+	if (array) {
+		enqueue_task(p, array);
+		inc_raw_weighted_load(rq, p);
+	}
+out:
+	task_rq_unlock(rq, &flags);
+
+	return 0;
+}
+
+EXPORT_SYMBOL(set_cpu_rate_cap);
+#endif
+
 long sched_setaffinity(pid_t pid, cpumask_t new_mask)
 {
 	task_t *p;
@@ -4761,7 +5117,7 @@ void __devinit init_idle(task_t *idle, i
 	idle->timestamp = sched_clock();
 	idle->sleep_avg = 0;
 	idle->array = NULL;
-	idle->prio = idle->normal_prio = MAX_PRIO;
+	idle->prio = idle->normal_prio = IDLE_PRIO;
 	idle->state = TASK_RUNNING;
 	idle->cpus_allowed = cpumask_of_cpu(cpu);
 	set_task_cpu(idle, cpu);
@@ -5115,7 +5471,7 @@ static void migrate_dead_tasks(unsigned 
 	struct runqueue *rq = cpu_rq(dead_cpu);
 
 	for (arr = 0; arr < 2; arr++) {
-		for (i = 0; i < MAX_PRIO; i++) {
+		for (i = 0; i < IDLE_PRIO; i++) {
 			struct list_head *list = &rq->arrays[arr].queue[i];
 			while (!list_empty(list))
 				migrate_dead(dead_cpu,
@@ -5287,7 +5643,7 @@ static int migration_call(struct notifie
 		/* Idle task back to normal (off runqueue, low prio) */
 		rq = task_rq_lock(rq->idle, &flags);
 		deactivate_task(rq->idle, rq);
-		rq->idle->static_prio = MAX_PRIO;
+		rq->idle->static_prio = IDLE_PRIO;
 		__setscheduler(rq->idle, SCHED_NORMAL, 0);
 		migrate_dead_tasks(cpu);
 		task_rq_unlock(rq, &flags);
@@ -6810,7 +7166,7 @@ void __init sched_init(void)
 		rq->nr_running = 0;
 		rq->active = rq->arrays;
 		rq->expired = rq->arrays + 1;
-		rq->best_expired_prio = MAX_PRIO;
+		rq->best_expired_prio = IDLE_PRIO;
 
 #ifdef CONFIG_SMP
 		rq->sd = NULL;
@@ -6825,15 +7181,16 @@ void __init sched_init(void)
 
 		for (j = 0; j < 2; j++) {
 			array = rq->arrays + j;
-			for (k = 0; k < MAX_PRIO; k++) {
+			for (k = 0; k < IDLE_PRIO; k++) {
 				INIT_LIST_HEAD(array->queue + k);
 				__clear_bit(k, array->bitmap);
 			}
 			// delimiter for bitsearch
-			__set_bit(MAX_PRIO, array->bitmap);
+			__set_bit(IDLE_PRIO, array->bitmap);
 		}
 	}
 
+	init_cpu_rate_caps(&init_task);
 	set_load_weight(&init_task);
 	/*
 	 * The boot idle thread does lazy MMU switching as well:
Index: MM-2.6.17-mm1/kernel/rtmutex-debug.c
===================================================================
--- MM-2.6.17-mm1.orig/kernel/rtmutex-debug.c	2006-06-22 09:27:21.000000000 +1000
+++ MM-2.6.17-mm1/kernel/rtmutex-debug.c	2006-06-22 10:17:47.000000000 +1000
@@ -210,8 +210,8 @@ void debug_rt_mutex_proxy_unlock(struct 
 void debug_rt_mutex_init_waiter(struct rt_mutex_waiter *waiter)
 {
 	memset(waiter, 0x11, sizeof(*waiter));
-	plist_node_init(&waiter->list_entry, MAX_PRIO);
-	plist_node_init(&waiter->pi_list_entry, MAX_PRIO);
+	plist_node_init(&waiter->list_entry, sched_idle_prio());
+	plist_node_init(&waiter->pi_list_entry, sched_idle_prio());
 }
 
 void debug_rt_mutex_free_waiter(struct rt_mutex_waiter *waiter)
Index: MM-2.6.17-mm1/kernel/mutex.c
===================================================================
--- MM-2.6.17-mm1.orig/kernel/mutex.c	2006-06-22 09:27:21.000000000 +1000
+++ MM-2.6.17-mm1/kernel/mutex.c	2006-06-22 10:17:47.000000000 +1000
@@ -51,6 +51,20 @@ __mutex_init(struct mutex *lock, const c
 
 EXPORT_SYMBOL(__mutex_init);
 
+static inline void inc_mutex_count(void)
+{
+#ifdef CONFIG_CPU_RATE_CAPS
+	current->mutexes_held++;
+#endif
+}
+
+static inline void dec_mutex_count(void)
+{
+#ifdef CONFIG_CPU_RATE_CAPS
+	current->mutexes_held--;
+#endif
+}
+
 /*
  * We split the mutex lock/unlock logic into separate fastpath and
  * slowpath functions, to reduce the register pressure on the fastpath.
@@ -89,6 +103,7 @@ void inline fastcall __sched mutex_lock(
 	 * 'unlocked' into 'locked' state.
 	 */
 	__mutex_fastpath_lock(&lock->count, __mutex_lock_slowpath);
+	inc_mutex_count();
 }
 
 EXPORT_SYMBOL(mutex_lock);
@@ -114,6 +129,7 @@ void fastcall __sched mutex_unlock(struc
 	 * into 'unlocked' state:
 	 */
 	__mutex_fastpath_unlock(&lock->count, __mutex_unlock_slowpath);
+	dec_mutex_count();
 }
 
 EXPORT_SYMBOL(mutex_unlock);
@@ -128,6 +144,7 @@ void fastcall __sched mutex_unlock_non_n
 	 * into 'unlocked' state:
 	 */
 	__mutex_fastpath_unlock(&lock->count, __mutex_unlock_non_nested_slowpath);
+	dec_mutex_count();
 }
 
 EXPORT_SYMBOL(mutex_unlock_non_nested);
@@ -300,9 +317,16 @@ __mutex_lock_interruptible_slowpath(atom
  */
 int fastcall __sched mutex_lock_interruptible(struct mutex *lock)
 {
+	int ret;
+
 	might_sleep();
-	return __mutex_fastpath_lock_retval
+	ret = __mutex_fastpath_lock_retval
 			(&lock->count, __mutex_lock_interruptible_slowpath);
+
+	if (likely(!ret))
+		inc_mutex_count();
+
+	return ret;
 }
 
 EXPORT_SYMBOL(mutex_lock_interruptible);
@@ -357,8 +381,13 @@ static inline int __mutex_trylock_slowpa
  */
 int fastcall __sched mutex_trylock(struct mutex *lock)
 {
-	return __mutex_fastpath_trylock(&lock->count,
+	int ret = __mutex_fastpath_trylock(&lock->count,
 					__mutex_trylock_slowpath);
+
+	if (likely(ret))
+		inc_mutex_count();
+
+	return ret;
 }
 
 EXPORT_SYMBOL(mutex_trylock);

-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
 -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 2/4] sched: Add CPU rate hard caps
  2006-06-22  1:52 [PATCH 0/4] sched: Add CPU rate caps Peter Williams
  2006-06-22  1:52 ` [PATCH 1/4] sched: Add CPU rate soft caps Peter Williams
@ 2006-06-22  1:52 ` Peter Williams
  2006-06-22  1:53 ` [PATCH 3/4] sched: Add procfs interface for CPU rate soft caps Peter Williams
  2006-06-22  1:53 ` [PATCH 4/4] sched: Add procfs interface for CPU rate hard caps Peter Williams
  3 siblings, 0 replies; 13+ messages in thread
From: Peter Williams @ 2006-06-22  1:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Con Kolivas, Sam Vilain, Balbir Singh, Srivatsa, Peter Williams,
	Kirill Korotaev, Linux Kernel, Mike Galbraith, Kingsley Cheung,
	CKRM, Ingo Molnar, Rene Herman

This patch implements hard CPU rate caps per task as a proportion of a
single CPU's capacity expressed in parts per thousand.

Notes:

1. Simplified calculation of sinbin durations to eliminat need for 64 bit
divide.

Signed-off-by: Peter Williams <pwil3058@bigpond.net.au>
 include/linux/sched.h |   22 ++++++++-
 kernel/Kconfig.caps   |   14 +++++
 kernel/sched.c        |  117 +++++++++++++++++++++++++++++++++++++++++++++-----
 3 files changed, 140 insertions(+), 13 deletions(-)

Index: MM-2.6.17-mm1/include/linux/sched.h
===================================================================
--- MM-2.6.17-mm1.orig/include/linux/sched.h	2006-06-22 10:21:08.000000000 +1000
+++ MM-2.6.17-mm1/include/linux/sched.h	2006-06-22 10:35:51.000000000 +1000
@@ -804,6 +804,10 @@ struct task_struct {
 	unsigned long long avg_cpu_per_cycle, avg_cycle_length;
 	unsigned int cpu_rate_cap;
 	unsigned int mutexes_held;
+#ifdef CONFIG_CPU_RATE_HARD_CAPS
+	unsigned int cpu_rate_hard_cap;
+	struct timer_list sinbin_timer;
+#endif
 #endif
 	enum sleep_type sleep_type;
 
@@ -1021,12 +1025,28 @@ struct task_struct {
 };
 
 #ifdef CONFIG_CPU_RATE_CAPS
+int set_cpu_rate_cap_low(struct task_struct *, unsigned int, int);
+
 static inline unsigned int get_cpu_rate_cap(const struct task_struct *p)
 {
 	return p->cpu_rate_cap;
 }
 
-int set_cpu_rate_cap(struct task_struct *, unsigned int);
+static inline int set_cpu_rate_cap(struct task_struct *p, unsigned int newcap)
+{
+	return set_cpu_rate_cap_low(p, newcap, 0);
+}
+#ifdef CONFIG_CPU_RATE_HARD_CAPS
+static inline unsigned int get_cpu_rate_hard_cap(const struct task_struct *p)
+{
+	return p->cpu_rate_hard_cap;
+}
+
+static inline int set_cpu_rate_hard_cap(struct task_struct *p, unsigned int newcap)
+{
+	return set_cpu_rate_cap_low(p, newcap, 1);
+}
+#endif
 #endif
 
 static inline pid_t process_group(struct task_struct *tsk)
Index: MM-2.6.17-mm1/kernel/Kconfig.caps
===================================================================
--- MM-2.6.17-mm1.orig/kernel/Kconfig.caps	2006-06-22 10:17:47.000000000 +1000
+++ MM-2.6.17-mm1/kernel/Kconfig.caps	2006-06-22 10:29:46.000000000 +1000
@@ -3,11 +3,21 @@
 #
 
 config CPU_RATE_CAPS
-	bool "Support (soft) CPU rate caps"
+	bool "Support CPU rate caps"
 	default y
 	---help---
-	  Say y here if you wish to be able to put a (soft) upper limit on
+	  Say y here if you wish to be able to put a soft upper limit on
 	  the rate of CPU usage by individual tasks.  A task which has been
 	  allocated a soft CPU rate cap will be limited to that rate of CPU
 	  usage unless there is spare CPU resources available after the needs
 	  of uncapped tasks are met.
+
+config CPU_RATE_HARD_CAPS
+	bool "Support CPU rate hard caps"
+	depends on CPU_RATE_CAPS
+	default n
+	---help---
+	  Say y here if you wish to be able to put a hard upper limit on
+	  the rate of CPU usage by individual tasks.  A task which has been
+	  allocated a hard CPU rate cap will be limited to that rate of CPU
+	  usage regardless of whether there is spare CPU resources available.
Index: MM-2.6.17-mm1/kernel/sched.c
===================================================================
--- MM-2.6.17-mm1.orig/kernel/sched.c	2006-06-22 10:26:24.000000000 +1000
+++ MM-2.6.17-mm1/kernel/sched.c	2006-06-22 10:32:15.000000000 +1000
@@ -203,25 +203,39 @@ static inline unsigned int task_timeslic
 #ifdef CONFIG_CPU_RATE_CAPS
 #define CPU_CAP_ONE 1000
 #define CAP_STATS_OFFSET 8
+#ifdef CONFIG_CPU_RATE_HARD_CAPS
+static void sinbin_release_fn(unsigned long arg);
+#define min_cpu_rate_cap(p) min((p)->cpu_rate_cap, (p)->cpu_rate_hard_cap)
+#else
+#define min_cpu_rate_cap(p) (p)->cpu_rate_cap
+#endif
 #define task_has_cap(p) unlikely((p)->flags & PF_HAS_CAP)
 /* this assumes that p is not a real time task */
 #define task_is_background(p) unlikely((p)->cpu_rate_cap == 0)
 #define task_being_capped(p) unlikely((p)->prio >= CAPPED_PRIO)
 #define cap_load_weight(p) \
-	(max((int)(((p)->cpu_rate_cap * SCHED_LOAD_SCALE) / CPU_CAP_ONE), 1))
+	(max((int)((min_cpu_rate_cap(p) * SCHED_LOAD_SCALE) / CPU_CAP_ONE), 1))
 #define safe_to_enforce_cap(p) \
-	(!((p)->mutexes_held || (p)->flags & (PF_FREEZE | PF_UIWAKE)))
+	(!((p)->mutexes_held || \
+	   (p)->flags & (PF_FREEZE | PF_UIWAKE | PF_EXITING)))
+#define safe_to_sinbin(p) (safe_to_enforce_cap(p) && !signal_pending(p))
 
 static void init_cpu_rate_caps(task_t *p)
 {
 	p->cpu_rate_cap = CPU_CAP_ONE;
 	p->flags &= ~PF_HAS_CAP;
 	p->mutexes_held = 0;
+#ifdef CONFIG_CPU_RATE_HARD_CAPS
+	p->cpu_rate_hard_cap = CPU_CAP_ONE;
+	init_timer(&p->sinbin_timer);
+	p->sinbin_timer.function = sinbin_release_fn;
+	p->sinbin_timer.data = (unsigned long) p;
+#endif
 }
 
 static inline void set_cap_flag(task_t *p)
 {
-	if (p->cpu_rate_cap < CPU_CAP_ONE && !has_rt_policy(p))
+	if (min_cpu_rate_cap(p) < CPU_CAP_ONE && !has_rt_policy(p))
 		p->flags |= PF_HAS_CAP;
 	else
 		p->flags &= ~PF_HAS_CAP;
@@ -229,7 +243,7 @@ static inline void set_cap_flag(task_t *
 
 static inline int task_exceeding_cap(const task_t *p)
 {
-	return (p->avg_cpu_per_cycle * CPU_CAP_ONE) > (p->avg_cycle_length * p->cpu_rate_cap);
+	return (p->avg_cpu_per_cycle * CPU_CAP_ONE) > (p->avg_cycle_length * min_cpu_rate_cap(p));
 }
 
 #ifdef CONFIG_SCHED_SMT
@@ -239,7 +253,7 @@ static inline int task_exceeding_cap(con
 static unsigned int smt_timeslice(task_t *p)
 {
 	if (task_being_capped(p))
-		return (p->cpu_rate_cap * DEF_TIMESLICE) / CPU_CAP_ONE;
+		return (min_cpu_rate_cap(p) * DEF_TIMESLICE) / CPU_CAP_ONE;
 
 	return task_timeslice(p);
 }
@@ -271,7 +285,7 @@ static int task_exceeding_cap_now(const 
 	unsigned long long cpc = p->avg_cpu_per_cycle;
 
 	delta = (now > p->timestamp) ? (now - p->timestamp) : 0;
-	rhs = (p->avg_cycle_length + delta) * p->cpu_rate_cap;
+	rhs = (p->avg_cycle_length + delta) * min_cpu_rate_cap(p);
 	if (oncpu)
 		cpc += delta;
 
@@ -283,6 +297,10 @@ static inline void init_cap_stats(task_t
 	p->avg_cpu_per_cycle = 0;
 	p->avg_cycle_length = 0;
 	p->mutexes_held = 0;
+#ifdef CONFIG_CPU_RATE_HARD_CAPS
+	init_timer(&p->sinbin_timer);
+	p->sinbin_timer.data = (unsigned long) p;
+#endif
 }
 
 static inline void inc_cap_stats_cycle(task_t *p, unsigned long long now)
@@ -315,6 +333,7 @@ static inline void decay_cap_stats(task_
 #define task_being_capped(p) 0
 #define cap_load_weight(p) ((int)SCHED_LOAD_SCALE)
 #define safe_to_enforce_cap(p) 0
+#define safe_to_sinbin(p) 0
 
 static inline void init_cpu_rate_caps(task_t *p)
 {
@@ -1192,6 +1211,63 @@ static void deactivate_task(struct task_
 	p->array = NULL;
 }
 
+#ifdef CONFIG_CPU_RATE_HARD_CAPS
+#define task_has_hard_cap(p) unlikely((p)->cpu_rate_hard_cap < CPU_CAP_ONE)
+
+/*
+ * Release a task from the sinbin
+ */
+static void sinbin_release_fn(unsigned long arg)
+{
+	unsigned long flags;
+	struct task_struct *p = (struct task_struct*)arg;
+	struct runqueue *rq = task_rq_lock(p, &flags);
+
+	p->prio = effective_prio(p);
+
+	__activate_task(p, rq);
+
+	task_rq_unlock(rq, &flags);
+}
+
+static unsigned long reqd_sinbin_ticks(const task_t *p)
+{
+	unsigned long long lhs = p->avg_cpu_per_cycle * CPU_CAP_ONE;
+	unsigned long long rhs = p->avg_cycle_length * p->cpu_rate_hard_cap;
+
+	if (lhs > rhs) {
+		lhs -= p->avg_cpu_per_cycle;
+		lhs >>= CAP_STATS_OFFSET;
+		/* have to do two divisions because there's no gaurantee
+		 * that p->cpu_rate_hard_cap * (1000000000 / HZ) would
+		 * not overflow a 32 bit unsigned integer
+		 */
+		(void)do_div(lhs, p->cpu_rate_hard_cap);
+		(void)do_div(lhs, (1000000000 / HZ));
+
+		return lhs ? : 1;
+	}
+
+	return 0;
+}
+
+static void sinbin_task(task_t *p, unsigned long durn)
+{
+	if (durn == 0)
+		return;
+	deactivate_task(p, task_rq(p));
+	p->sinbin_timer.expires = jiffies + durn;
+	add_timer(&p->sinbin_timer);
+}
+#else
+#define task_has_hard_cap(p) 0
+#define reqd_sinbin_ticks(p) 0
+
+static inline void sinbin_task(task_t *p, unsigned long durn)
+{
+}
+#endif
+
 /*
  * resched_task - mark a task 'to be rescheduled now'.
  *
@@ -3579,6 +3655,13 @@ need_resched_nonpreemptible:
 	if (task_has_cap(prev)) {
 		inc_cap_stats_both(prev, now);
 		decay_cap_stats(prev);
+		if (task_has_hard_cap(prev) && !prev->state &&
+		    !rt_task(prev) && safe_to_sinbin(prev)) {
+			unsigned long sinbin_ticks = reqd_sinbin_ticks(prev);
+
+			if (sinbin_ticks)
+				sinbin_task(prev, sinbin_ticks);
+		}
 	}
 
 	cpu = smp_processor_id();
@@ -4532,9 +4615,10 @@ out_unlock:
 
 #ifdef CONFIG_CPU_RATE_CAPS
 /*
- * Require: 0 <= new_cap <= CPU_CAP_ONE
+ * Require: 0 <= new_cap <= CPU_CAP_ONE for hard == 0
+ *          1 <= new_cap <= CPU_CAP_ONE otherwise
  */
-int set_cpu_rate_cap(struct task_struct *p, unsigned int new_cap)
+int set_cpu_rate_cap_low(struct task_struct *p, unsigned int new_cap, int hard)
 {
 	int is_allowed;
 	unsigned long flags;
@@ -4544,13 +4628,21 @@ int set_cpu_rate_cap(struct task_struct 
 
 	if (new_cap > CPU_CAP_ONE)
 		return -EINVAL;
+#ifdef CONFIG_CPU_RATE_HARD_CAPS
+	if (hard && new_cap < 1)
+		return -EINVAL;
+#endif
 	is_allowed = capable(CAP_SYS_NICE);
 	/*
 	 * We have to be careful, if called from /proc code,
 	 * the task might be in the middle of scheduling on another CPU.
 	 */
 	rq = task_rq_lock(p, &flags);
+#ifdef CONFIG_CPU_RATE_HARD_CAPS
+	delta = new_cap - (hard ? p->cpu_rate_hard_cap : p->cpu_rate_cap);
+#else
 	delta = new_cap - p->cpu_rate_cap;
+#endif
 	if (!is_allowed) {
 		/*
 		 * Ordinary users can set/change caps on their own tasks
@@ -4566,7 +4658,12 @@ int set_cpu_rate_cap(struct task_struct 
 	 * set - but as expected it wont have any effect on scheduling until
 	 * the task becomes SCHED_NORMAL/SCHED_BATCH:
 	 */
-	p->cpu_rate_cap = new_cap;
+#ifdef CONFIG_CPU_RATE_HARD_CAPS
+	if (hard)
+		p->cpu_rate_hard_cap = new_cap;
+	else
+#endif
+		p->cpu_rate_cap = new_cap;
 
 	if (has_rt_policy(p))
 		goto out;
@@ -4590,7 +4687,7 @@ out:
 	return 0;
 }
 
-EXPORT_SYMBOL(set_cpu_rate_cap);
+EXPORT_SYMBOL(set_cpu_rate_cap_low);
 #endif
 
 long sched_setaffinity(pid_t pid, cpumask_t new_mask)

-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
 -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 3/4] sched: Add procfs interface for CPU rate soft caps
  2006-06-22  1:52 [PATCH 0/4] sched: Add CPU rate caps Peter Williams
  2006-06-22  1:52 ` [PATCH 1/4] sched: Add CPU rate soft caps Peter Williams
  2006-06-22  1:52 ` [PATCH 2/4] sched: Add CPU rate hard caps Peter Williams
@ 2006-06-22  1:53 ` Peter Williams
  2006-06-22  1:53 ` [PATCH 4/4] sched: Add procfs interface for CPU rate hard caps Peter Williams
  3 siblings, 0 replies; 13+ messages in thread
From: Peter Williams @ 2006-06-22  1:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Con Kolivas, Sam Vilain, Peter Williams, Srivatsa, Balbir Singh,
	Kirill Korotaev, Linux Kernel, Mike Galbraith, Kingsley Cheung,
	CKRM, Ingo Molnar, Rene Herman

This patch implements a procfs interface for soft CPU rate caps.
Via files of the form /proc/<tgid>/task/<pid>/cpu_rate_cap.

Signed-off-by: Peter Williams <pwil3058@bigpond.net.au>
 fs/proc/base.c      |   59 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/Kconfig.caps |    5 ++++
 2 files changed, 64 insertions(+)

Index: MM-2.6.17-mm1/fs/proc/base.c
===================================================================
--- MM-2.6.17-mm1.orig/fs/proc/base.c	2006-06-22 09:27:19.000000000 +1000
+++ MM-2.6.17-mm1/fs/proc/base.c	2006-06-22 10:40:18.000000000 +1000
@@ -168,6 +168,9 @@ enum pid_directory_inos {
 #ifdef CONFIG_CPUSETS
 	PROC_TID_CPUSET,
 #endif
+#ifdef CONFIG_CPU_RATE_CAPS
+	PROC_TID_CPU_RATE_CAP,
+#endif
 #ifdef CONFIG_SECURITY
 	PROC_TID_ATTR,
 	PROC_TID_ATTR_CURRENT,
@@ -282,6 +285,9 @@ static struct pid_entry tid_base_stuff[]
 #ifdef CONFIG_AUDITSYSCALL
 	E(PROC_TID_LOGINUID, "loginuid", S_IFREG|S_IWUSR|S_IRUGO),
 #endif
+#ifdef CONFIG_CPU_RATE_CAPS
+	E(PROC_TID_CPU_RATE_CAP,  "cpu_rate_cap",   S_IFREG|S_IRUGO|S_IWUSR),
+#endif
 	{0,0,NULL,0}
 };
 
@@ -1041,6 +1047,54 @@ static struct file_operations proc_secco
 };
 #endif /* CONFIG_SECCOMP */
 
+#ifdef CONFIG_CPU_RATE_CAPS
+static ssize_t cpu_rate_cap_read(struct file * file, char * buf,
+			size_t count, loff_t *ppos)
+{
+	struct task_struct *task = get_proc_task(file->f_dentry->d_inode);
+	char buffer[64];
+	size_t len;
+	unsigned int cppt = get_cpu_rate_cap(task);
+
+	if (*ppos)
+		return 0;
+	*ppos = len = sprintf(buffer, "%u\n", cppt);
+	if (copy_to_user(buf, buffer, len))
+		return -EFAULT;
+
+	return len;
+}
+
+static ssize_t cpu_rate_cap_write(struct file * file, const char * buf,
+			 size_t count, loff_t *ppos)
+{
+	struct task_struct *task = get_proc_task(file->f_dentry->d_inode);
+	char buffer[128] = "";
+	char *endptr = NULL;
+	unsigned long hcppt;
+	int res;
+
+
+	if ((count > 63) || *ppos)
+		return -EFBIG;
+	if (copy_from_user(buffer, buf, count))
+		return -EFAULT;
+	hcppt = simple_strtoul(buffer, &endptr, 0);
+	if ((endptr == buffer) || (hcppt == ULONG_MAX))
+		return -EINVAL;
+
+	if ((res = set_cpu_rate_cap(task, hcppt)) != 0)
+		return res;
+
+	return count;
+}
+
+struct file_operations proc_cpu_rate_cap_operations = {
+	read:		cpu_rate_cap_read,
+	write:		cpu_rate_cap_write,
+};
+#endif
+
 static void *proc_pid_follow_link(struct dentry *dentry, struct nameidata *nd)
 {
 	struct inode *inode = dentry->d_inode;
@@ -1803,6 +1857,11 @@ static struct dentry *proc_pident_lookup
 			inode->i_fop = &proc_loginuid_operations;
 			break;
 #endif
+#ifdef CONFIG_CPU_RATE_CAPS
+		case PROC_TID_CPU_RATE_CAP:
+			inode->i_fop = &proc_cpu_rate_cap_operations;
+			break;
+#endif
 		default:
 			printk("procfs: impossible type (%d)",p->type);
 			iput(inode);
Index: MM-2.6.17-mm1/kernel/Kconfig.caps
===================================================================
--- MM-2.6.17-mm1.orig/kernel/Kconfig.caps	2006-06-22 10:29:46.000000000 +1000
+++ MM-2.6.17-mm1/kernel/Kconfig.caps	2006-06-22 10:48:54.000000000 +1000
@@ -11,6 +11,11 @@ config CPU_RATE_CAPS
 	  allocated a soft CPU rate cap will be limited to that rate of CPU
 	  usage unless there is spare CPU resources available after the needs
 	  of uncapped tasks are met.
+	  Task soft caps can be set/got via the procfs file system using
+	  files of the form /proc/<tgid>/task/<pid>/cpu_rate_cap in parts
+	  per thousand.  Minimum soft cap is 0 and effectively places the
+	  task in the background.  Maximum soft cap is 1000 and means
+	  unlimited.
 
 config CPU_RATE_HARD_CAPS
 	bool "Support CPU rate hard caps"

-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
 -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 4/4] sched: Add procfs interface for CPU rate hard caps
  2006-06-22  1:52 [PATCH 0/4] sched: Add CPU rate caps Peter Williams
                   ` (2 preceding siblings ...)
  2006-06-22  1:53 ` [PATCH 3/4] sched: Add procfs interface for CPU rate soft caps Peter Williams
@ 2006-06-22  1:53 ` Peter Williams
  3 siblings, 0 replies; 13+ messages in thread
From: Peter Williams @ 2006-06-22  1:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Con Kolivas, Sam Vilain, Balbir Singh, Srivatsa, Peter Williams,
	Kirill Korotaev, Linux Kernel, Mike Galbraith, Kingsley Cheung,
	CKRM, Ingo Molnar, Rene Herman

This patch implements a procfs interface for hard CPU rate caps.
Via files of the form /proc/<tgid>/task/<pid>/cpu_rate_hard_cap.

Signed-off-by: Peter Williams <pwil3058@bigpond.net.au>
 fs/proc/base.c      |   59 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/Kconfig.caps |    4 +++
 2 files changed, 63 insertions(+)

Index: MM-2.6.17-mm1/fs/proc/base.c
===================================================================
--- MM-2.6.17-mm1.orig/fs/proc/base.c	2006-06-22 10:40:18.000000000 +1000
+++ MM-2.6.17-mm1/fs/proc/base.c	2006-06-22 10:49:55.000000000 +1000
@@ -171,6 +171,9 @@ enum pid_directory_inos {
 #ifdef CONFIG_CPU_RATE_CAPS
 	PROC_TID_CPU_RATE_CAP,
 #endif
+#ifdef CONFIG_CPU_RATE_HARD_CAPS
+	PROC_TID_CPU_RATE_HARD_CAP,
+#endif
 #ifdef CONFIG_SECURITY
 	PROC_TID_ATTR,
 	PROC_TID_ATTR_CURRENT,
@@ -288,6 +291,9 @@ static struct pid_entry tid_base_stuff[]
 #ifdef CONFIG_CPU_RATE_CAPS
 	E(PROC_TID_CPU_RATE_CAP,  "cpu_rate_cap",   S_IFREG|S_IRUGO|S_IWUSR),
 #endif
+#ifdef CONFIG_CPU_RATE_HARD_CAPS
+	E(PROC_TID_CPU_RATE_HARD_CAP,  "cpu_rate_hard_cap",   S_IFREG|S_IRUGO|S_IWUSR),
+#endif
 	{0,0,NULL,0}
 };
 
@@ -1095,6 +1101,54 @@ struct file_operations proc_cpu_rate_cap
 };
 #endif
 
+#ifdef CONFIG_CPU_RATE_HARD_CAPS
+static ssize_t cpu_rate_hard_cap_read(struct file * file, char * buf,
+			size_t count, loff_t *ppos)
+{
+	struct task_struct *task = get_proc_task(file->f_dentry->d_inode);
+	char buffer[64];
+	size_t len;
+	unsigned int cppt = get_cpu_rate_hard_cap(task);
+
+	if (*ppos)
+		return 0;
+	*ppos = len = sprintf(buffer, "%u\n", cppt);
+	if (copy_to_user(buf, buffer, len))
+		return -EFAULT;
+
+	return len;
+}
+
+static ssize_t cpu_rate_hard_cap_write(struct file * file, const char * buf,
+			 size_t count, loff_t *ppos)
+{
+	struct task_struct *task = get_proc_task(file->f_dentry->d_inode);
+	char buffer[128] = "";
+	char *endptr = NULL;
+	unsigned long hcppt;
+	int res;
+
+
+	if ((count > 63) || *ppos)
+		return -EFBIG;
+	if (copy_from_user(buffer, buf, count))
+		return -EFAULT;
+	hcppt = simple_strtoul(buffer, &endptr, 0);
+	if ((endptr == buffer) || (hcppt == ULONG_MAX))
+		return -EINVAL;
+
+	if ((res = set_cpu_rate_hard_cap(task, hcppt)) != 0)
+		return res;
+
+	return count;
+}
+
+struct file_operations proc_cpu_rate_hard_cap_operations = {
+	read:		cpu_rate_hard_cap_read,
+	write:		cpu_rate_hard_cap_write,
+};
+#endif
+
 static void *proc_pid_follow_link(struct dentry *dentry, struct nameidata *nd)
 {
 	struct inode *inode = dentry->d_inode;
@@ -1862,6 +1916,11 @@ static struct dentry *proc_pident_lookup
 			inode->i_fop = &proc_cpu_rate_cap_operations;
 			break;
 #endif
+#ifdef CONFIG_CPU_RATE_HARD_CAPS
+		case PROC_TID_CPU_RATE_HARD_CAP:
+			inode->i_fop = &proc_cpu_rate_hard_cap_operations;
+			break;
+#endif
 		default:
 			printk("procfs: impossible type (%d)",p->type);
 			iput(inode);
Index: MM-2.6.17-mm1/kernel/Kconfig.caps
===================================================================
--- MM-2.6.17-mm1.orig/kernel/Kconfig.caps	2006-06-22 10:48:54.000000000 +1000
+++ MM-2.6.17-mm1/kernel/Kconfig.caps	2006-06-22 10:52:21.000000000 +1000
@@ -26,3 +26,7 @@ config CPU_RATE_HARD_CAPS
 	  the rate of CPU usage by individual tasks.  A task which has been
 	  allocated a hard CPU rate cap will be limited to that rate of CPU
 	  usage regardless of whether there is spare CPU resources available.
+	  Task hard caps can be set/got via the procfs file system using
+	  files of the form /proc/<tgid>/task/<pid>/cpu_rate_hard_cap in parts
+	  per thousand.  Minimum hard cap is 1 and the maximum hard cap is
+	  1000 (which means unlimited).

-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
 -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/4] sched: Add CPU rate caps
  2006-06-18  9:50 ` Andrew Morton
  2006-06-18 10:25   ` Peter Williams
@ 2006-06-19 23:59   ` Peter Williams
  1 sibling, 0 replies; 13+ messages in thread
From: Peter Williams @ 2006-06-19 23:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: kernel, sam, bsingharora, vatsa, dev, linux-kernel, efault,
	kingsley, ckrm-tech, mingo, rene.herman

Andrew Morton wrote:
> 
> If the task can exceed its cap without impacting any other tasks (ie: there
> is spare idle capacity), what happens?  I trust that spare capacity gets
> used?

As I said in another reply, the answer to this is yes for soft caps and 
how good a job was demonstrated by the kernbench results that I 
included.  Repeated here:

Average Optimal -j 8 Load Run:

                   Vanilla          Patch Applied    Soft Cap 0%

Elapsed Time      1056.1   (1.92)  1048.2   (0.62)  1064.1   (1.59)
User Time         1908.1   (1.09)  1895.2   (1.30)  1926.6   (1.39)
System Time        181.7   (0.60)   177.5   (0.74)   173.8   (1.07)
    Total          2089.8           2072.7           2100.4
Percent CPU        197.6   (0.55)   197.0   (0)      197.0   (0)
Context Switches 49253.6 (136.31) 48881.4  (92.03) 92490.8 (163.71)
Sleeps           28038.8 (228.11) 28136.0 (250.65) 25769.4 (280.40)

Note that the (slight) increase in the elapsed time when using a soft 
cap of zero can be directly attributed to the increase in CPU usage due 
to the cap overhead (an approximate increase of 16 seconds for elapsed 
time with an approximate increase of 28 seconds (for two CPUs) in CPU 
time consumed when comparing the "patch applied" and "soft cap 0%" numbers).

I think this illustrates that (for soft caps) spare capacity is not wasted?

 >  (Is this termed "work conserving"?)

I don't know but it sounds apt.

Peter
PS For ordinary users, I think that the ability to run jobs in the 
background by using a soft cap of zero is the most useful thing that 
this patch provides.
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/4] sched: Add CPU rate caps
  2006-06-18 10:25   ` Peter Williams
  2006-06-18 11:42     ` Srivatsa Vaddagiri
  2006-06-19  0:13     ` Balbir Singh
@ 2006-06-19  5:04     ` Peter Williams
  2 siblings, 0 replies; 13+ messages in thread
From: Peter Williams @ 2006-06-19  5:04 UTC (permalink / raw)
  To: Andrew Morton
  Cc: kernel, sam, bsingharora, vatsa, dev, linux-kernel, efault,
	kingsley, ckrm-tech, mingo, rene.herman

[-- Attachment #1: Type: text/plain, Size: 2463 bytes --]

Peter Williams wrote:
> Andrew Morton wrote:
>> On Sun, 18 Jun 2006 18:26:38 +1000
>> Peter Williams <pwil3058@bigpond.net.au> wrote:
>>> 5. Code size measurements:
>>>
>>> Vanilla kernel:
>>>
>>>    text    data     bss     dec     hex filename
>>>   33800    4689     296   38785    9781 sched.o
>>>    2554      79       0    2633     a49 mutex.o
>>>   12076    2632       0   14708    3974 base.o
>>>
>>> Patches applied:
>>>
>>>    text    data     bss     dec     hex filename
>>>   36870    4721     296   41887    a39f sched.o
>>>    2630      79       0    2709     a95 mutex.o
>>>   13011    2920       0   15931    3e3b base.o
>>>
>>> Indicating that the size cost of the patch proper is about
>>> 3 kilobytes and the procfs costs about another 1.2 kilobytes.
>>>
>>
>> hm.  That seems rather a lot.  I guess it's not a simple thing to do.
> 
> I suspect that a large part of that is the functions that set the caps 
> (i.e. the equivalents of set_user_nice()) one for soft and one for hard 
> caps.  The actual capping mechanisms are quite simple.

This contribution may not be as large as I thought.  Realizing that the
function to set soft caps and the function to set hard caps duplicate a
lot of code I've modified them to share code rather than duplicate.
I've also made the functions for getting caps inline and they should
just be optimized away to a field reference.  But the reduction in code
size is only about 285 bytes.

Before:
    text    data     bss     dec     hex filename
   32820    4913     672   38405    9605 kernel/sched.o

After:
    text    data     bss     dec     hex filename
   32535    4913     672   38120    94e8 kernel/sched.o

Noting that these values were less than those in the original mail, I've
also checked the size for when the patches are excluded.

    text    data     bss     dec     hex filename
   30061    4881     672   35614    8b1e kernel/sched.o

The numbers in the e-mail were from a different computer with gcc-4.0.2
installed while these are generated with gcc-4.1.1 (also that build
would not have had CONFIG_SCHED_SMT or CONFIG_SCHED_MC set).  But the
end result is that the cost is still of the order of 2.5 kilobytes.

A patch to make these changes is attached.

Signed-off-by: Peter Williams <pwil3058@bigpond.net.au>

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce


[-- Attachment #2: simplify-get-set-caps-code --]
[-- Type: text/plain, Size: 4906 bytes --]

---
 include/linux/sched.h |   24 +++++++++++--
 kernel/sched.c        |   90 ++++++++++----------------------------------------
 2 files changed, 38 insertions(+), 76 deletions(-)

Index: MM-2.6.17-rc6-mm2/include/linux/sched.h
===================================================================
--- MM-2.6.17-rc6-mm2.orig/include/linux/sched.h	2006-06-19 13:57:18.000000000 +1000
+++ MM-2.6.17-rc6-mm2/include/linux/sched.h	2006-06-19 14:28:25.000000000 +1000
@@ -1022,11 +1022,27 @@ struct task_struct {
 };
 
 #ifdef CONFIG_CPU_RATE_CAPS
-unsigned int get_cpu_rate_cap(const struct task_struct *);
-int set_cpu_rate_cap(struct task_struct *, unsigned int);
+int set_cpu_rate_cap_low(struct task_struct *, unsigned int, int);
+
+static inline unsigned int get_cpu_rate_cap(const struct task_struct *p)
+{
+	return p->cpu_rate_cap;
+}
+
+static inline int set_cpu_rate_cap(struct task_struct *p, unsigned int newcap)
+{
+	return set_cpu_rate_cap_low(p, newcap, 0);
+}
 #ifdef CONFIG_CPU_RATE_HARD_CAPS
-unsigned int get_cpu_rate_hard_cap(const struct task_struct *);
-int set_cpu_rate_hard_cap(struct task_struct *, unsigned int);
+static inline unsigned int get_cpu_rate_hard_cap(const struct task_struct *p)
+{
+	return p->cpu_rate_hard_cap;
+}
+
+static inline int set_cpu_rate_hard_cap(struct task_struct *p, unsigned int newcap)
+{
+	return set_cpu_rate_cap_low(p, newcap, 1);
+}
 #endif
 #endif
 
Index: MM-2.6.17-rc6-mm2/kernel/sched.c
===================================================================
--- MM-2.6.17-rc6-mm2.orig/kernel/sched.c	2006-06-19 13:57:18.000000000 +1000
+++ MM-2.6.17-rc6-mm2/kernel/sched.c	2006-06-19 14:28:25.000000000 +1000
@@ -4614,17 +4614,11 @@ out_unlock:
 }
 
 #ifdef CONFIG_CPU_RATE_CAPS
-unsigned int get_cpu_rate_cap(const struct task_struct *p)
-{
-	return p->cpu_rate_cap;
-}
-
-EXPORT_SYMBOL(get_cpu_rate_cap);
-
 /*
- * Require: 0 <= new_cap <= CPU_CAP_ONE
+ * Require: 0 <= new_cap <= CPU_CAP_ONE for hard == 0
+ *          1 <= new_cap <= CPU_CAP_ONE otherwise
  */
-int set_cpu_rate_cap(struct task_struct *p, unsigned int new_cap)
+int set_cpu_rate_cap_low(struct task_struct *p, unsigned int new_cap, int hard)
 {
 	int is_allowed;
 	unsigned long flags;
@@ -4634,13 +4628,21 @@ int set_cpu_rate_cap(struct task_struct 
 
 	if (new_cap > CPU_CAP_ONE)
 		return -EINVAL;
+#ifdef CONFIG_CPU_RATE_HARD_CAPS
+	if (hard && new_cap < 1)
+		return -EINVAL;
+#endif
 	is_allowed = capable(CAP_SYS_NICE);
 	/*
 	 * We have to be careful, if called from /proc code,
 	 * the task might be in the middle of scheduling on another CPU.
 	 */
 	rq = task_rq_lock(p, &flags);
+#ifdef CONFIG_CPU_RATE_HARD_CAPS
+	delta = new_cap - (hard ? p->cpu_rate_hard_cap : p->cpu_rate_cap);
+#else
 	delta = new_cap - p->cpu_rate_cap;
+#endif
 	if (!is_allowed) {
 		/*
 		 * Ordinary users can set/change caps on their own tasks
@@ -4656,7 +4658,12 @@ int set_cpu_rate_cap(struct task_struct 
 	 * set - but as expected it wont have any effect on scheduling until
 	 * the task becomes SCHED_NORMAL/SCHED_BATCH:
 	 */
-	p->cpu_rate_cap = new_cap;
+#ifdef CONFIG_CPU_RATE_HARD_CAPS
+	if (hard)
+		p->cpu_rate_hard_cap = new_cap;
+	else
+#endif
+		p->cpu_rate_cap = new_cap;
 
 	if (has_rt_policy(p))
 		goto out;
@@ -4680,68 +4687,7 @@ out:
 	return 0;
 }
 
-EXPORT_SYMBOL(set_cpu_rate_cap);
-
-#ifdef CONFIG_CPU_RATE_HARD_CAPS
-unsigned int get_cpu_rate_hard_cap(const struct task_struct *p)
-{
-	return p->cpu_rate_hard_cap;
-}
-
-EXPORT_SYMBOL(get_cpu_rate_hard_cap);
-
-/*
- * Require: 1 <= new_cap <= CPU_CAP_ONE
- */
-int set_cpu_rate_hard_cap(struct task_struct *p, unsigned int new_cap)
-{
-	int is_allowed;
-	unsigned long flags;
-	struct runqueue *rq;
-	int delta;
-
-	if (new_cap > CPU_CAP_ONE || new_cap < 1)
-		return -EINVAL;
-	is_allowed = capable(CAP_SYS_NICE);
-	/*
-	 * We have to be careful, if called from /proc code,
-	 * the task might be in the middle of scheduling on another CPU.
-	 */
-	rq = task_rq_lock(p, &flags);
-	delta = new_cap - p->cpu_rate_hard_cap;
-	if (!is_allowed) {
-		/*
-		 * Ordinary users can set/change caps on their own tasks
-		 * provided that the new setting is MORE constraining
-		 */
-		if (((current->euid != p->uid) && (current->uid != p->uid)) || (delta > 0)) {
-			task_rq_unlock(rq, &flags);
-			return -EPERM;
-		}
-	}
-	/*
-	 * The RT tasks don't have caps, but we still allow the caps to be
-	 * set - but as expected it wont have any effect on scheduling until
-	 * the task becomes SCHED_NORMAL/SCHED_BATCH:
-	 */
-	p->cpu_rate_hard_cap = new_cap;
-
-	if (has_rt_policy(p))
-		goto out;
-
-	if (p->array)
-		dec_raw_weighted_load(rq, p);
-	set_load_weight(p);
-	if (p->array)
-		inc_raw_weighted_load(rq, p);
-out:
-	task_rq_unlock(rq, &flags);
-
-	return 0;
-}
-
-EXPORT_SYMBOL(set_cpu_rate_hard_cap);
-#endif
+EXPORT_SYMBOL(set_cpu_rate_cap_low);
 #endif
 
 long sched_setaffinity(pid_t pid, cpumask_t new_mask)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/4] sched: Add CPU rate caps
  2006-06-18 10:25   ` Peter Williams
  2006-06-18 11:42     ` Srivatsa Vaddagiri
@ 2006-06-19  0:13     ` Balbir Singh
  2006-06-19  5:04     ` Peter Williams
  2 siblings, 0 replies; 13+ messages in thread
From: Balbir Singh @ 2006-06-19  0:13 UTC (permalink / raw)
  To: Peter Williams
  Cc: Andrew Morton, kernel, sam, bsingharora, vatsa, dev,
	linux-kernel, efault, kingsley, ckrm-tech, mingo, rene.herman

Peter Williams wrote:
> Andrew Morton wrote:
> 
>> On Sun, 18 Jun 2006 18:26:38 +1000
>> Peter Williams <pwil3058@bigpond.net.au> wrote:
>>
>> People are going to want to extend this to capping a *group* of tasks, 
>> with
>> some yet-to-be-determined means of tying those tasks together.  How well
>> suited is this code to that extension?
> 
> 
> Quite good.  It can be used from outside the scheduler to impose caps on 
> arbitrary groups of tasks.  Were the PAGG interface available I could 
> knock up a module to demonstrate this.  When/if the "task watchers" 
> patch is included I will try and implement a higher level mechanism 
> using that.  The general technique is to get an estimate of the 
> "effective number" of tasks in the group (similar to load) and give each 
> task in the group a cap which is the group's cap divided by the 
> effective number of tasks (or the group cap whichever is smaller -- i.e. 
> the effective number of tasks could be less than one).
>)


There is one possible issue with this approach. Lets assume that we desire
a cap of 10 for a set of two tasks. As discussed earlier, each task
would get a limit of 5% if they are equally busy.

Lets call the group as G1 and the tasks as T1 and T2.

If we have another group called G2 with tasks T3, T4 and T5 and a soft
cap of 90. Then each of T3, T4 and T5 would get a soft cap of
30% (assuming that they are equally busy). Now if T5 stops using its limit
for a while let say its cpu utilization is 10% - how do we divide the saved
20% between T1, T2, T3 and T4.

In a group scenario, the balance 20% should be shared between T3 and T4.

Also mathematically

A group is a superset of task

It is hard to implement things for a task and make it work for groups,
but if we had something for groups, we could easily adapt it to tasks
by making each group equal to a task


 
> Doing it inside the scheduler is also doable but would have some locking 
> issues.  The run queue lock could no longer be used to protect the data 
> as there's no guarantee that all the tasks in the group are associated 
> with the same queue.
> 
>>
>> If the task can exceed its cap without impacting any other tasks (ie: 
>> there
>> is spare idle capacity), what happens?
> 
> 
> That's the difference between soft and hard caps.  If it's a soft cap 
> then the task is allowed to exceed it if there's spare capacity.  If 
> it's a hard cap it's not.
> 

By how much is the task allowed to exceed if there is spare capacity?
Will the spare capacity allocation require resetting of caps to implement
the new caps?

>>  I trust that spare capacity gets
>> used?  (Is this termed "work conserving"?)
> 
> 
> Soft caps, yes.  Hard caps, no.
> 

-- 
	Cheers,
	Balbir Singh,
	Linux Technology Center,
	IBM Software Labs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/4] sched: Add CPU rate caps
  2006-06-18 11:42     ` Srivatsa Vaddagiri
@ 2006-06-18 12:19       ` Peter Williams
  0 siblings, 0 replies; 13+ messages in thread
From: Peter Williams @ 2006-06-18 12:19 UTC (permalink / raw)
  To: vatsa
  Cc: Andrew Morton, kernel, sam, bsingharora, dev, linux-kernel,
	efault, kingsley, ckrm-tech, mingo, rene.herman

Srivatsa Vaddagiri wrote:
> On Sun, Jun 18, 2006 at 08:25:02PM +1000, Peter Williams wrote:
>>> People are going to want to extend this to capping a *group* of tasks, with
>>> some yet-to-be-determined means of tying those tasks together.  How well
>>> suited is this code to that extension?
>> Quite good.  It can be used from outside the scheduler to impose caps on 
>> arbitrary groups of tasks.  Were the PAGG interface available I could 
>> knock up a module to demonstrate this.  When/if the "task watchers" 
> 
> For demonstration purpose, maybe you could use tsk->uid to group tasks and 
> and see how capping at group level works out? Basically cap the per-user cpu
> usage.

I was thinking of doing the group of tasks that represent a process's 
threads as a first example as that will be more useful to ordinary users 
as well as demonstrating the technique.

> 
>> patch is included I will try and implement a higher level mechanism 
>> using that.  The general technique is to get an estimate of the 
>> "effective number" of tasks in the group (similar to load) and give each 
>> task in the group a cap which is the group's cap divided by the 
>> effective number of tasks (or the group cap whichever is smaller -- i.e. 
>> the effective number of tasks could be less than one).
> 
> For "effective number" do you include only runnable tasks?

It would be roughly equivalent to a smoothed smoothed estimate of the 
number of runnable tasks in the group.

> As and when
> this number changes, you would need to go and change the per-task cap of
> all tasks in the group. Any idea what the cost of that operation would
> be? Or do you intend to do it lazily to amortize some of that cost?

Yes.  It would be some form of periodical sampling mechanism.  It could 
be done from user space or from a module.

The costs can also be reduced by realizing that it is only necessary to 
do anything at all if the number of tasks in the group is greater than 
the group's cap.  I.e. if a group has a cap of 3000 but only two tasks 
there is no way the group can exceed its cap so that group can be ignored.

> 
>> Doing it inside the scheduler is also doable but would have some locking 
>> issues.  The run queue lock could no longer be used to protect the data 
>> as there's no guarantee that all the tasks in the group are associated 
>> with the same queue.
> 


-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/4] sched: Add CPU rate caps
  2006-06-18 10:25   ` Peter Williams
@ 2006-06-18 11:42     ` Srivatsa Vaddagiri
  2006-06-18 12:19       ` Peter Williams
  2006-06-19  0:13     ` Balbir Singh
  2006-06-19  5:04     ` Peter Williams
  2 siblings, 1 reply; 13+ messages in thread
From: Srivatsa Vaddagiri @ 2006-06-18 11:42 UTC (permalink / raw)
  To: Peter Williams
  Cc: Andrew Morton, kernel, sam, bsingharora, dev, linux-kernel,
	efault, kingsley, ckrm-tech, mingo, rene.herman

On Sun, Jun 18, 2006 at 08:25:02PM +1000, Peter Williams wrote:
> >People are going to want to extend this to capping a *group* of tasks, with
> >some yet-to-be-determined means of tying those tasks together.  How well
> >suited is this code to that extension?
> 
> Quite good.  It can be used from outside the scheduler to impose caps on 
> arbitrary groups of tasks.  Were the PAGG interface available I could 
> knock up a module to demonstrate this.  When/if the "task watchers" 

For demonstration purpose, maybe you could use tsk->uid to group tasks and 
and see how capping at group level works out? Basically cap the per-user cpu
usage.

> patch is included I will try and implement a higher level mechanism 
> using that.  The general technique is to get an estimate of the 
> "effective number" of tasks in the group (similar to load) and give each 
> task in the group a cap which is the group's cap divided by the 
> effective number of tasks (or the group cap whichever is smaller -- i.e. 
> the effective number of tasks could be less than one).

For "effective number" do you include only runnable tasks? As and when
this number changes, you would need to go and change the per-task cap of
all tasks in the group. Any idea what the cost of that operation would
be? Or do you intend to do it lazily to amortize some of that cost?

> Doing it inside the scheduler is also doable but would have some locking 
> issues.  The run queue lock could no longer be used to protect the data 
> as there's no guarantee that all the tasks in the group are associated 
> with the same queue.

-- 
Regards,
vatsa

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/4] sched: Add CPU rate caps
  2006-06-18  9:50 ` Andrew Morton
@ 2006-06-18 10:25   ` Peter Williams
  2006-06-18 11:42     ` Srivatsa Vaddagiri
                       ` (2 more replies)
  2006-06-19 23:59   ` Peter Williams
  1 sibling, 3 replies; 13+ messages in thread
From: Peter Williams @ 2006-06-18 10:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: kernel, sam, bsingharora, vatsa, dev, linux-kernel, efault,
	kingsley, ckrm-tech, mingo, rene.herman

Andrew Morton wrote:
> On Sun, 18 Jun 2006 18:26:38 +1000
> Peter Williams <pwil3058@bigpond.net.au> wrote:
> 
>> These patches implement CPU usage rate limits for tasks.
> 
> Via /proc/pid/cpu_rate_cap.  Important detail, that.

Via /proc/tgid/task/pid/cpu_rate_cap actually.  I.e. it's at the task 
level not the process level.

Also, the /proc interface is just one possible interface and the one I 
implemented because it was easy and useful for testing.  Alternate 
interfaces would be easy to provide (or add).

There are exported functions to set/get caps and all the necessary 
checking is done there.  The /proc stuff is a very thin wrapper around that.

Other options for interfaces would be RLIMIT and/or a syscall.

> 
> People are going to want to extend this to capping a *group* of tasks, with
> some yet-to-be-determined means of tying those tasks together.  How well
> suited is this code to that extension?

Quite good.  It can be used from outside the scheduler to impose caps on 
arbitrary groups of tasks.  Were the PAGG interface available I could 
knock up a module to demonstrate this.  When/if the "task watchers" 
patch is included I will try and implement a higher level mechanism 
using that.  The general technique is to get an estimate of the 
"effective number" of tasks in the group (similar to load) and give each 
task in the group a cap which is the group's cap divided by the 
effective number of tasks (or the group cap whichever is smaller -- i.e. 
the effective number of tasks could be less than one).

Doing it inside the scheduler is also doable but would have some locking 
issues.  The run queue lock could no longer be used to protect the data 
as there's no guarantee that all the tasks in the group are associated 
with the same queue.

> 
> If the task can exceed its cap without impacting any other tasks (ie: there
> is spare idle capacity), what happens?

That's the difference between soft and hard caps.  If it's a soft cap 
then the task is allowed to exceed it if there's spare capacity.  If 
it's a hard cap it's not.

>  I trust that spare capacity gets
> used?  (Is this termed "work conserving"?)

Soft caps, yes.  Hard caps, no.

> 
>> 5. Code size measurements:
>>
>> Vanilla kernel:
>>
>>    text    data     bss     dec     hex filename
>>   33800    4689     296   38785    9781 sched.o
>>    2554      79       0    2633     a49 mutex.o
>>   12076    2632       0   14708    3974 base.o
>>
>> Patches applied:
>>
>>    text    data     bss     dec     hex filename
>>   36870    4721     296   41887    a39f sched.o
>>    2630      79       0    2709     a95 mutex.o
>>   13011    2920       0   15931    3e3b base.o
>>
>> Indicating that the size cost of the patch proper is about
>> 3 kilobytes and the procfs costs about another 1.2 kilobytes.
>>
> 
> hm.  That seems rather a lot.  I guess it's not a simple thing to do.

I suspect that a large part of that is the functions that set the caps 
(i.e. the equivalents of set_user_nice()) one for soft and one for hard 
caps.  The actual capping mechanisms are quite simple.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/4] sched: Add CPU rate caps
  2006-06-18  8:26 [PATCH 0/4] sched: Add CPU rate caps Peter Williams
@ 2006-06-18  9:50 ` Andrew Morton
  2006-06-18 10:25   ` Peter Williams
  2006-06-19 23:59   ` Peter Williams
  0 siblings, 2 replies; 13+ messages in thread
From: Andrew Morton @ 2006-06-18  9:50 UTC (permalink / raw)
  To: Peter Williams
  Cc: kernel, sam, bsingharora, vatsa, pwil3058, dev, linux-kernel,
	efault, kingsley, ckrm-tech, mingo, rene.herman

On Sun, 18 Jun 2006 18:26:38 +1000
Peter Williams <pwil3058@bigpond.net.au> wrote:

> These patches implement CPU usage rate limits for tasks.

Via /proc/pid/cpu_rate_cap.  Important detail, that.

People are going to want to extend this to capping a *group* of tasks, with
some yet-to-be-determined means of tying those tasks together.  How well
suited is this code to that extension?

If the task can exceed its cap without impacting any other tasks (ie: there
is spare idle capacity), what happens?  I trust that spare capacity gets
used?  (Is this termed "work conserving"?)

> 5. Code size measurements:
> 
> Vanilla kernel:
> 
>    text    data     bss     dec     hex filename
>   33800    4689     296   38785    9781 sched.o
>    2554      79       0    2633     a49 mutex.o
>   12076    2632       0   14708    3974 base.o
> 
> Patches applied:
> 
>    text    data     bss     dec     hex filename
>   36870    4721     296   41887    a39f sched.o
>    2630      79       0    2709     a95 mutex.o
>   13011    2920       0   15931    3e3b base.o
> 
> Indicating that the size cost of the patch proper is about
> 3 kilobytes and the procfs costs about another 1.2 kilobytes.
> 

hm.  That seems rather a lot.  I guess it's not a simple thing to do.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 0/4] sched: Add CPU rate caps
@ 2006-06-18  8:26 Peter Williams
  2006-06-18  9:50 ` Andrew Morton
  0 siblings, 1 reply; 13+ messages in thread
From: Peter Williams @ 2006-06-18  8:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Con Kolivas, Sam Vilain, Balbir Singh, Srivatsa, Peter Williams,
	Kirill Korotaev, Linux Kernel, Mike Galbraith, Kingsley Cheung,
	CKRM, Ingo Molnar, Rene Herman

These patches implement CPU usage rate limits for tasks.

Although the rlimit mechanism already has a CPU usage limit (RLIMIT_CPU)
it is a total usage limit and therefore (to my mind) not very useful.
These patches provide an alternative whereby the (recent) average CPU
usage rate of a task can be limited to a (per task) specified proportion
of a single CPU's capacity.  The limits are specified in parts per
thousand and come in two varieties -- hard and soft.  The difference
between the two is that the system tries to enforce hard caps regardless
of the other demand for CPU resources but allows soft caps to be
exceeded if there are spare CPU resources available.  By default, tasks
will have both caps set to 1000 (i.e. no limit) but newly forked tasks
will inherit any caps that have been imposed on their parent from the
parent.  The mimimim soft cap allowed is 0 (which effectively puts the
task in the background) and the minimim hard cap allowed is 1.

Care has been taken to minimize the overhead inflicted on tasks that
have no caps and my tests using kernbench indicate that it is hidden in
the noise.

Note:

1. Caps are not enforced for SCHED_FIFO and SCHED_RR tasks.

2. This versions incorporates improvements and bug fixes as a result of
feedback from an earlier post.  Special thanks to Con Kolivas whose
suggestions with respect to improved methods for avoiding starvation
and priority inversion have enabled cap enforcement to be stricter.

3. This patch is against 2.6.17-rc6-mm2.

4. Overhead Measurements.  To measure the implications for overhead
introduced by these patches kernbench was used on a dual 500Mhz
Centrino SMP system.  Runs were done for a kernel without these
patches applied, one with the patches applied but no caps being used
and one with the patches applied and running kernbench with a soft cap
of zero (which would be inherited by all its children).

Average Optimal -j 8 Load Run:

                  Vanilla          Patch Applied    Soft Cap 0%

Elapsed Time      1056.1   (1.92)  1048.2   (0.62)  1064.1   (1.59)
User Time         1908.1   (1.09)  1895.2   (1.30)  1926.6   (1.39)
System Time        181.7   (0.60)   177.5   (0.74)   173.8   (1.07)
   Total          2089.8           2072.7           2100.4
Percent CPU        197.6   (0.55)   197.0   (0)      197.0   (0)
Context Switches 49253.6 (136.31) 48881.4  (92.03) 92490.8 (163.71)
Sleeps           28038.8 (228.11) 28136.0 (250.65) 25769.4 (280.40)

As can be seen there is no significant overhead penalty for having
these patches applied and leaving them unused (in fact, based on
these numbers, there's actually a small improvement).  Similarly,
the overhead for running kernbench as a background (soft cap of
zero) job is not significant.  As expected, the context switches for
the background run are double (due to the fact that ANY OTHER task
running on the machine would be able to preempt the kernbench tasks)
but have not seriously effected the CPU usage statistics.  The
similarity of the "Percent CPU" numbers indicate that load balancing
hasn't been adversely effected.

5. Code size measurements:

Vanilla kernel:

   text    data     bss     dec     hex filename
  33800    4689     296   38785    9781 sched.o
   2554      79       0    2633     a49 mutex.o
  12076    2632       0   14708    3974 base.o

Patches applied:

   text    data     bss     dec     hex filename
  36870    4721     296   41887    a39f sched.o
   2630      79       0    2709     a95 mutex.o
  13011    2920       0   15931    3e3b base.o

Indicating that the size cost of the patch proper is about
3 kilobytes and the procfs costs about another 1.2 kilobytes.

Signed-off-by: Peter Williams <pwil3058@bigpond.com.au>


-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
 -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2006-06-22  1:53 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-06-22  1:52 [PATCH 0/4] sched: Add CPU rate caps Peter Williams
2006-06-22  1:52 ` [PATCH 1/4] sched: Add CPU rate soft caps Peter Williams
2006-06-22  1:52 ` [PATCH 2/4] sched: Add CPU rate hard caps Peter Williams
2006-06-22  1:53 ` [PATCH 3/4] sched: Add procfs interface for CPU rate soft caps Peter Williams
2006-06-22  1:53 ` [PATCH 4/4] sched: Add procfs interface for CPU rate hard caps Peter Williams
  -- strict thread matches above, loose matches on Subject: below --
2006-06-18  8:26 [PATCH 0/4] sched: Add CPU rate caps Peter Williams
2006-06-18  9:50 ` Andrew Morton
2006-06-18 10:25   ` Peter Williams
2006-06-18 11:42     ` Srivatsa Vaddagiri
2006-06-18 12:19       ` Peter Williams
2006-06-19  0:13     ` Balbir Singh
2006-06-19  5:04     ` Peter Williams
2006-06-19 23:59   ` Peter Williams

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).