linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 00/17] New RT Task Balancing -v3
@ 2007-11-17  6:21 Steven Rostedt
  2007-11-17  6:21 ` [PATCH v3 01/17] Add rt_nr_running accounting Steven Rostedt
                   ` (17 more replies)
  0 siblings, 18 replies; 28+ messages in thread
From: Steven Rostedt @ 2007-11-17  6:21 UTC (permalink / raw)
  To: LKML; +Cc: Ingo Molnar, Gregory Haskins, Peter Zijlstra, Christoph Lameter


[
  Changes since V2:

    Updated to git tree 8c0863403f109a43d7000b4646da4818220d501f

    This version also contains patches from Gregory Haskins.

    Actually brought back the global RT overload bitmask.
    The reason is that it should be seldom written to. The RT
    overload bitmask is only set when more than one RT task is
    scheduled to run on the same runqueue. This event should not
    happen often.  The other methods of trying to get rid of
    the global cpumask had issues of its own.

    Also to avoid overloading the runqueue, a check is made
    before an RT task is actually scheduled to a runqueue to
    see if it would be better to place that woken RT task onto
    another runqueue if the current runqueue is currently
    running another RT task.  Even if the currently running
    RT task is of lower priority than the one waking up.
    We still try to move the woken RT task to another runqueue
    if possible.

    The reason is that RT tasks behave different than normal
    tasks. Preempting a normal task to run a RT task to keep
    its cache hot is fine, because the preempted non-RT task
    may wait on that same runqueue to run again unless the
    migration thread comes along and pulls it off.

    RT tasks behave differently. If one is preempted, it makes
    an active effort to continue to run. So by having a high
    priority task preempt a lower priority RT task, that lower
    RT task will then quickly try to run on another runqueue.
    This will cause that lower RT task to replace its nice
    hot cache (and TLB) with a completely cold one. This is
    for the hope that the new high priority RT task will keep
    its cache hot.

    Remeber that this high priority RT task was just woken up.
    So it may likely have been sleeping for several milliseconds,
    and will end up with a cold cache anyway. RT tasks run till
    they voluntarily stop, or are preempted by a higher priority
    task. This means that it is unlikely that the woken RT task
    will have a hot cache to wake up to. So pushing off a lower
    RT task is just killing its cache for no good reason.
]

Currently in mainline the balancing of multiple RT threads is quite broken.
That is to say that a high priority thread that is scheduled on a CPU
with a higher priority thread, may need to unnecessarily wait while it
can easily run on another CPU that's running a lower priority thread.

Balancing (or migrating) tasks in general is an art. Lots of considerations
must be taken into account. Cache lines, NUMA and more. This is true
with general processes which expect high through put and migration can
be done in batch.  But when it comes to RT tasks, we really need to
put them off to a CPU that they can run on as soon as possible. Even
if it means a bit of cache line flushing.

Right now an RT task can wait several milliseconds before it gets scheduled
to run. And perhaps even longer. The migration thread is not fast enough
to take care of RT tasks.

To demonstrate this, I wrote a simple test.
 
  http://rostedt.homelinux.com/rt/rt-migrate-test.c

  (gcc -o rt-migrate-test rt-migrate-test.c -lpthread)

This test expects a parameter to pass in the number of threads to create.
If you add the '-c' option (check) it will terminate if the test fails
one of the iterations. If you add this, pass in +1 threads.

For example, on a 4 way box, I used

  rt-migrate-test -c 5

What this test does is to create the number of threads specified (in this
case 5). Each thread is set as an RT FIFO task starting at a specified
prio (default 2) and each thread being one priority higher. So with this
example the 5 threads created are at priorities 2, 3, 4, 5, and 6.

The parent thread sets its priority to one higher than the highest of
the children (this example 7). It uses pthread_barrier_wait to synchronize
the threads.  Then it takes a time stamp and starts all the threads.
The threads when woken up take a time stamp and compares it to the parent
thread to see how long it took to be awoken. It then runs for an
interval (20ms default) in a busy loop. The busy loop ends when it reaches
the interval delta from the start time stamp. So if it is preempted, it
may not actually run for the full interval. This is expected behavior
of the test.

The numbers recorded are the delta from the thread's time stamp from the
parent time stamp. The number of iterations it ran the busy loop for, and
the delta from a thread time stamp taken at the end of the loop to the
parent time stamp.

Sometimes a lower priority task might wake up before a higher priority,
but this is OK, as long as the higher priority process gets the CPU when
it is awoken.

At the end of the test, the iteration data is printed to stdout. If a
higher priority task had to wait for a lower one to finish running, then
this is considered a failure. Here's an example of the output from
a run against git commit 4fa4d23fa20de67df919030c1216295664866ad7.

   1:       36      33   20041      39      33
 len:    20036   20033   40041   20039   20033
 loops: 167789  167693  227167  167829  167814

On iteration 1 (starts at 0) the third task started at 20ms after the parent
woke it up. We can see here that the first two tasks ran to completion
before the higher priority task was even able to start. That is a
20ms latency for the higher priority task!!!

So people who think that their audio would lose most latencies by upping 
the priority, may be in for a surprise. Since some kernel threads (like
the migration thread itself) may cause this latency.

To solve this issue, I've changed the RT task balancing from a passive
method (migration thread) to an active method.  This new method is
to actively push or pull RT tasks when they are woken up or scheduled.

On wake up of a task if it is an RT task, and there's already an RT task
of higher priority running on its runqueue, we initiate a push_rt_tasks
algorithm. This algorithm looks at the highest non-running RT task
and tries to find a CPU where it can run on. It only migrates the RT
task if it finds a CPU (of lowest priority) where the RT task
can run on and can preempt the currently running task on that CPU.
We continue pushing RT tasks until we can't push anymore.

If a RT task fails to be migrated we stop the pushing. This is possible
because we are always looking at the highest priority RT task on the
run queue. And if it can't migrate, then most likely the lower RT tasks
can not either.

There is one case that is not covered by this patch set. That is that
when the highest priority non running RT task has its CPU affinity
in such a way that it can not preempt any tasks on the CPUs running
on CPUs of its affinity. But a lower priority task has a larger affinity
to CPUs that it can run on. This is a case where the lower priority task
will not be migrated to those CPUS (although those CPUs may pull that task).
Currently this patch set ignores this scenario.

Another case where we push RT tasks is in the finish_task_switch. This is
done since the running RT task can not be migrated while it is running.
So if an RT task is preempted by a higher priority RT task, we can
migrate the RT task being preempted at that moment.

We also actively pull RT tasks. Whenever a runqueue is about to lower its
priority (schedule a lower priority task) a check is done to see if that
runqueue can pull RT tasks to it to run instead. A search is made of all
overloaded runqueues (runqueues with more than one RT task scheduled on it)
and checked to see if they have an RT task that can run on the runqueue
(affinity matches) and is of higher priority than the task the runqueue
is about to schedule. The pull algorithm pulls all RT tasks that match
this case.

With this patch set, I ran the rt-migrate-test over night in a while
loop with the -c option (which terminates upon failure) and it passed
over 6500 tests (each doing 50 wakeups each).

Here's an example of the output from the patched kernel. This is just to
explain it a bit more.

   1:    20060      61      55      56      61
 len:    40060   20061   20055   20056   20061
 loops: 227255  146050  168104  145965  168144

   2:       40      46      31      35      42
 len:    20040   20046   20031   20035   20042
 loops:     28  167769  167781  167668  167804

The first iteration (really 2cd, since we start at zero), is a typical run.
The lowest prio task didn't start executing until the other 4 tasks finished
and gave up the CPU.

The second iteration seems at first like a failure. But this is actually fine.
The lowest priority task just happen to schedule onto a CPU before the
higher priority tasks were woken up. But as you can see from this example,
the higher priority tasks still were able to get scheduled right away and
in doing so preempted the lower priority task. This can be seen by the
number of loops that the lower priority task was able to complete. Only 28.
This is because the busy loop terminates when the time stamp reaches the
time interval (20ms here) from the start time stamp. Since the lower priority
task was able to sneak in and start, it's time stamp was low. So after it
got preempted, and rescheduled, it was already past the run time interval
so it simply ended the loop.

Finally, the CFS RT balancing had to be removed in order for this to work.
Testing showed that the CFS RT balancing would actually pull RT tasks
from runqueues already assigned to the proper runqueues, and again cause
latencies.  With this new approach, the CFS RT balancing is not needed,
and I suggest that these patches replace the current CFS RT balancing.

Also, let me stress, that I made a great attempt to have this cause
as little overhead (practically none) to the non RT cases. Most of these
algorithms only take place when more than one RT task is scheduled on the
same runqueue.

Special thanks goes to Gregory Haskins for his advice and back and forth
on IRC with ideas. Although I didn't use his actual patches (his were
against -rt) it did help me clean up some of my code.  Also, thanks go to
Ingo Molnar himself for taking some ideas from his RT balance code in
the -rt patch.


Comments welcomed!

-- Steve


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v3 01/17] Add rt_nr_running accounting
  2007-11-17  6:21 [PATCH v3 00/17] New RT Task Balancing -v3 Steven Rostedt
@ 2007-11-17  6:21 ` Steven Rostedt
  2007-11-17  6:21 ` [PATCH v3 02/17] track highest prio queued on runqueue Steven Rostedt
                   ` (16 subsequent siblings)
  17 siblings, 0 replies; 28+ messages in thread
From: Steven Rostedt @ 2007-11-17  6:21 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Gregory Haskins, Peter Zijlstra, Christoph Lameter,
	Steven Rostedt

[-- Attachment #1: count-rt-queued.patch --]
[-- Type: text/plain, Size: 1887 bytes --]

This patch adds accounting to keep track of the number of RT tasks running
on a runqueue. This information will be used in later patches.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>

---
 kernel/sched.c    |    1 +
 kernel/sched_rt.c |   17 +++++++++++++++++
 2 files changed, 18 insertions(+)

Index: linux-compile.git/kernel/sched.c
===================================================================
--- linux-compile.git.orig/kernel/sched.c	2007-11-16 11:16:38.000000000 -0500
+++ linux-compile.git/kernel/sched.c	2007-11-16 21:48:48.000000000 -0500
@@ -266,6 +266,7 @@ struct rt_rq {
 	struct rt_prio_array active;
 	int rt_load_balance_idx;
 	struct list_head *rt_load_balance_head, *rt_load_balance_curr;
+	unsigned long rt_nr_running;
 };
 
 /*
Index: linux-compile.git/kernel/sched_rt.c
===================================================================
--- linux-compile.git.orig/kernel/sched_rt.c	2007-11-16 11:15:56.000000000 -0500
+++ linux-compile.git/kernel/sched_rt.c	2007-11-16 21:48:48.000000000 -0500
@@ -25,12 +25,27 @@ static void update_curr_rt(struct rq *rq
 	curr->se.exec_start = rq->clock;
 }
 
+static inline void inc_rt_tasks(struct task_struct *p, struct rq *rq)
+{
+	WARN_ON(!rt_task(p));
+	rq->rt.rt_nr_running++;
+}
+
+static inline void dec_rt_tasks(struct task_struct *p, struct rq *rq)
+{
+	WARN_ON(!rt_task(p));
+	WARN_ON(!rq->rt.rt_nr_running);
+	rq->rt.rt_nr_running--;
+}
+
 static void enqueue_task_rt(struct rq *rq, struct task_struct *p, int wakeup)
 {
 	struct rt_prio_array *array = &rq->rt.active;
 
 	list_add_tail(&p->run_list, array->queue + p->prio);
 	__set_bit(p->prio, array->bitmap);
+
+	inc_rt_tasks(p, rq);
 }
 
 /*
@@ -45,6 +60,8 @@ static void dequeue_task_rt(struct rq *r
 	list_del(&p->run_list);
 	if (list_empty(array->queue + p->prio))
 		__clear_bit(p->prio, array->bitmap);
+
+	dec_rt_tasks(p, rq);
 }
 
 /*

-- 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v3 02/17] track highest prio queued on runqueue
  2007-11-17  6:21 [PATCH v3 00/17] New RT Task Balancing -v3 Steven Rostedt
  2007-11-17  6:21 ` [PATCH v3 01/17] Add rt_nr_running accounting Steven Rostedt
@ 2007-11-17  6:21 ` Steven Rostedt
  2007-11-17  6:21 ` [PATCH v3 03/17] push RT tasks Steven Rostedt
                   ` (15 subsequent siblings)
  17 siblings, 0 replies; 28+ messages in thread
From: Steven Rostedt @ 2007-11-17  6:21 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Gregory Haskins, Peter Zijlstra, Christoph Lameter,
	Steven Rostedt

[-- Attachment #1: add-rq-highest-prio.patch --]
[-- Type: text/plain, Size: 2375 bytes --]

This patch adds accounting to each runqueue to keep track of the
highest prio task queued on the run queue. We only care about
RT tasks, so if the run queue does not contain any active RT tasks
its priority will be considered MAX_RT_PRIO.

This information will be used for later patches.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>

---
 kernel/sched.c    |    3 +++
 kernel/sched_rt.c |   18 ++++++++++++++++++
 2 files changed, 21 insertions(+)

Index: linux-compile.git/kernel/sched.c
===================================================================
--- linux-compile.git.orig/kernel/sched.c	2007-11-16 21:55:20.000000000 -0500
+++ linux-compile.git/kernel/sched.c	2007-11-16 21:55:34.000000000 -0500
@@ -267,6 +267,8 @@ struct rt_rq {
 	int rt_load_balance_idx;
 	struct list_head *rt_load_balance_head, *rt_load_balance_curr;
 	unsigned long rt_nr_running;
+	/* highest queued rt task prio */
+	int highest_prio;
 };
 
 /*
@@ -6776,6 +6778,7 @@ void __init sched_init(void)
 		rq->cpu = i;
 		rq->migration_thread = NULL;
 		INIT_LIST_HEAD(&rq->migration_queue);
+		rq->rt.highest_prio = MAX_RT_PRIO;
 #endif
 		atomic_set(&rq->nr_iowait, 0);
 
Index: linux-compile.git/kernel/sched_rt.c
===================================================================
--- linux-compile.git.orig/kernel/sched_rt.c	2007-11-16 21:55:20.000000000 -0500
+++ linux-compile.git/kernel/sched_rt.c	2007-11-16 21:55:34.000000000 -0500
@@ -29,6 +29,10 @@ static inline void inc_rt_tasks(struct t
 {
 	WARN_ON(!rt_task(p));
 	rq->rt.rt_nr_running++;
+#ifdef CONFIG_SMP
+	if (p->prio < rq->rt.highest_prio)
+		rq->rt.highest_prio = p->prio;
+#endif /* CONFIG_SMP */
 }
 
 static inline void dec_rt_tasks(struct task_struct *p, struct rq *rq)
@@ -36,6 +40,20 @@ static inline void dec_rt_tasks(struct t
 	WARN_ON(!rt_task(p));
 	WARN_ON(!rq->rt.rt_nr_running);
 	rq->rt.rt_nr_running--;
+#ifdef CONFIG_SMP
+	if (rq->rt.rt_nr_running) {
+		struct rt_prio_array *array;
+
+		WARN_ON(p->prio < rq->rt.highest_prio);
+		if (p->prio == rq->rt.highest_prio) {
+			/* recalculate */
+			array = &rq->rt.active;
+			rq->rt.highest_prio =
+				sched_find_first_bit(array->bitmap);
+		} /* otherwise leave rq->highest prio alone */
+	} else
+		rq->rt.highest_prio = MAX_RT_PRIO;
+#endif /* CONFIG_SMP */
 }
 
 static void enqueue_task_rt(struct rq *rq, struct task_struct *p, int wakeup)

-- 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v3 03/17] push RT tasks
  2007-11-17  6:21 [PATCH v3 00/17] New RT Task Balancing -v3 Steven Rostedt
  2007-11-17  6:21 ` [PATCH v3 01/17] Add rt_nr_running accounting Steven Rostedt
  2007-11-17  6:21 ` [PATCH v3 02/17] track highest prio queued on runqueue Steven Rostedt
@ 2007-11-17  6:21 ` Steven Rostedt
  2007-11-17  6:21 ` [PATCH v3 04/17] RT overloaded runqueues accounting Steven Rostedt
                   ` (14 subsequent siblings)
  17 siblings, 0 replies; 28+ messages in thread
From: Steven Rostedt @ 2007-11-17  6:21 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Gregory Haskins, Peter Zijlstra, Christoph Lameter,
	Steven Rostedt

[-- Attachment #1: rt-balance-push-tasks.patch --]
[-- Type: text/plain, Size: 9361 bytes --]

This patch adds an algorithm to push extra RT tasks off a run queue to
other CPU runqueues.

When more than one RT task is added to a run queue, this algorithm takes
an assertive approach to push the RT tasks that are not running onto other
run queues that have lower priority.  The way this works is that the highest
RT task that is not running is looked at and we examine the runqueues on
the CPUS for that tasks affinity mask. We find the runqueue with the lowest
prio in the CPU affinity of the picked task, and if it is lower in prio than
the picked task, we push the task onto that CPU runqueue.

We continue pushing RT tasks off the current runqueue until we don't push any
more.  The algorithm stops when the next highest RT task can't preempt any
other processes on other CPUS.

TODO: The algorithm may stop when there are still RT tasks that can be
 migrated. Specifically, if the highest non running RT task CPU affinity
 is restricted to CPUs that are running higher priority tasks, there may
 be a lower priority task queued that has an affinity with a CPU that is
 running a lower priority task that it could be migrated to.  This
 patch set does not address this issue.

Note: checkpatch reveals two over 80 character instances. I'm not sure
 that breaking them up will help visually, so I left them as is.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>

---
 kernel/sched.c    |    8 +
 kernel/sched_rt.c |  223 +++++++++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 229 insertions(+), 2 deletions(-)

Index: linux-compile.git/kernel/sched.c
===================================================================
--- linux-compile.git.orig/kernel/sched.c	2007-11-16 21:49:27.000000000 -0500
+++ linux-compile.git/kernel/sched.c	2007-11-16 21:49:59.000000000 -0500
@@ -1877,6 +1877,8 @@ static void finish_task_switch(struct rq
 	prev_state = prev->state;
 	finish_arch_switch(prev);
 	finish_lock_switch(rq, prev);
+	schedule_tail_balance_rt(rq);
+
 	fire_sched_in_preempt_notifiers(current);
 	if (mm)
 		mmdrop(mm);
@@ -2110,11 +2112,13 @@ static void double_rq_unlock(struct rq *
 /*
  * double_lock_balance - lock the busiest runqueue, this_rq is locked already.
  */
-static void double_lock_balance(struct rq *this_rq, struct rq *busiest)
+static int double_lock_balance(struct rq *this_rq, struct rq *busiest)
 	__releases(this_rq->lock)
 	__acquires(busiest->lock)
 	__acquires(this_rq->lock)
 {
+	int ret = 0;
+
 	if (unlikely(!irqs_disabled())) {
 		/* printk() doesn't work good under rq->lock */
 		spin_unlock(&this_rq->lock);
@@ -2125,9 +2129,11 @@ static void double_lock_balance(struct r
 			spin_unlock(&this_rq->lock);
 			spin_lock(&busiest->lock);
 			spin_lock(&this_rq->lock);
+			ret = 1;
 		} else
 			spin_lock(&busiest->lock);
 	}
+	return ret;
 }
 
 /*
Index: linux-compile.git/kernel/sched_rt.c
===================================================================
--- linux-compile.git.orig/kernel/sched_rt.c	2007-11-16 21:49:27.000000000 -0500
+++ linux-compile.git/kernel/sched_rt.c	2007-11-16 21:49:59.000000000 -0500
@@ -134,6 +134,225 @@ static void put_prev_task_rt(struct rq *
 }
 
 #ifdef CONFIG_SMP
+/* Only try algorithms three times */
+#define RT_MAX_TRIES 3
+
+static int double_lock_balance(struct rq *this_rq, struct rq *busiest);
+static void deactivate_task(struct rq *rq, struct task_struct *p, int sleep);
+
+/* Return the second highest RT task, NULL otherwise */
+static struct task_struct *pick_next_highest_task_rt(struct rq *rq)
+{
+	struct rt_prio_array *array = &rq->rt.active;
+	struct task_struct *next;
+	struct list_head *queue;
+	int idx;
+
+	assert_spin_locked(&rq->lock);
+
+	if (likely(rq->rt.rt_nr_running < 2))
+		return NULL;
+
+	idx = sched_find_first_bit(array->bitmap);
+	if (unlikely(idx >= MAX_RT_PRIO)) {
+		WARN_ON(1); /* rt_nr_running is bad */
+		return NULL;
+	}
+
+	queue = array->queue + idx;
+	next = list_entry(queue->next, struct task_struct, run_list);
+	if (unlikely(next != rq->curr))
+		return next;
+
+	if (queue->next->next != queue) {
+		/* same prio task */
+		next = list_entry(queue->next->next, struct task_struct, run_list);
+		return next;
+	}
+
+	/* slower, but more flexible */
+	idx = find_next_bit(array->bitmap, MAX_RT_PRIO, idx+1);
+	if (unlikely(idx >= MAX_RT_PRIO)) {
+		WARN_ON(1); /* rt_nr_running was 2 and above! */
+		return NULL;
+	}
+
+	queue = array->queue + idx;
+	next = list_entry(queue->next, struct task_struct, run_list);
+
+	return next;
+}
+
+/* Will lock the rq it finds */
+static struct rq *find_lock_lowest_rq(struct task_struct *task,
+				      struct rq *this_rq)
+{
+	struct rq *lowest_rq = NULL;
+	cpumask_t cpu_mask;
+	int cpu;
+	int tries;
+
+	cpus_and(cpu_mask, cpu_online_map, task->cpus_allowed);
+
+	for (tries = 0; tries < RT_MAX_TRIES; tries++) {
+		/*
+		 * Scan each rq for the lowest prio.
+		 */
+		for_each_cpu_mask(cpu, cpu_mask) {
+			struct rq *rq = &per_cpu(runqueues, cpu);
+
+			if (cpu == this_rq->cpu)
+				continue;
+
+			/* We look for lowest RT prio or non-rt CPU */
+			if (rq->rt.highest_prio >= MAX_RT_PRIO) {
+				lowest_rq = rq;
+				break;
+			}
+
+			/* no locking for now */
+			if (rq->rt.highest_prio > task->prio &&
+			    (!lowest_rq || rq->rt.highest_prio > lowest_rq->rt.highest_prio)) {
+				lowest_rq = rq;
+			}
+		}
+
+		if (!lowest_rq)
+			break;
+
+		/* if the prio of this runqueue changed, try again */
+		if (double_lock_balance(this_rq, lowest_rq)) {
+			/*
+			 * We had to unlock the run queue. In
+			 * the mean time, task could have
+			 * migrated already or had its affinity changed.
+			 * Also make sure that it wasn't scheduled on its rq.
+			 */
+			if (unlikely(task_rq(task) != this_rq ||
+				     !cpu_isset(lowest_rq->cpu, task->cpus_allowed) ||
+				     task_running(this_rq, task) ||
+				     !task->se.on_rq)) {
+				spin_unlock(&lowest_rq->lock);
+				lowest_rq = NULL;
+				break;
+			}
+		}
+
+		/* If this rq is still suitable use it. */
+		if (lowest_rq->rt.highest_prio > task->prio)
+			break;
+
+		/* try again */
+		spin_unlock(&lowest_rq->lock);
+		lowest_rq = NULL;
+	}
+
+	return lowest_rq;
+}
+
+/*
+ * If the current CPU has more than one RT task, see if the non
+ * running task can migrate over to a CPU that is running a task
+ * of lesser priority.
+ */
+static int push_rt_task(struct rq *this_rq)
+{
+	struct task_struct *next_task;
+	struct rq *lowest_rq;
+	int ret = 0;
+	int paranoid = RT_MAX_TRIES;
+
+	assert_spin_locked(&this_rq->lock);
+
+	next_task = pick_next_highest_task_rt(this_rq);
+	if (!next_task)
+		return 0;
+
+ retry:
+	if (unlikely(next_task == this_rq->curr))
+		return 0;
+
+	/*
+	 * It's possible that the next_task slipped in of
+	 * higher priority than current. If that's the case
+	 * just reschedule current.
+	 */
+	if (unlikely(next_task->prio < this_rq->curr->prio)) {
+		resched_task(this_rq->curr);
+		return 0;
+	}
+
+	/* We might release this_rq lock */
+	get_task_struct(next_task);
+
+	/* find_lock_lowest_rq locks the rq if found */
+	lowest_rq = find_lock_lowest_rq(next_task, this_rq);
+	if (!lowest_rq) {
+		struct task_struct *task;
+		/*
+		 * find lock_lowest_rq releases this_rq->lock
+		 * so it is possible that next_task has changed.
+		 * If it has, then try again.
+		 */
+		task = pick_next_highest_task_rt(this_rq);
+		if (unlikely(task != next_task) && task && paranoid--) {
+			put_task_struct(next_task);
+			next_task = task;
+			goto retry;
+		}
+		goto out;
+	}
+
+	assert_spin_locked(&lowest_rq->lock);
+
+	deactivate_task(this_rq, next_task, 0);
+	set_task_cpu(next_task, lowest_rq->cpu);
+	activate_task(lowest_rq, next_task, 0);
+
+	resched_task(lowest_rq->curr);
+
+	spin_unlock(&lowest_rq->lock);
+
+	ret = 1;
+out:
+	put_task_struct(next_task);
+
+	return ret;
+}
+
+/*
+ * TODO: Currently we just use the second highest prio task on
+ *       the queue, and stop when it can't migrate (or there's
+ *       no more RT tasks).  There may be a case where a lower
+ *       priority RT task has a different affinity than the
+ *       higher RT task. In this case the lower RT task could
+ *       possibly be able to migrate where as the higher priority
+ *       RT task could not.  We currently ignore this issue.
+ *       Enhancements are welcome!
+ */
+static void push_rt_tasks(struct rq *rq)
+{
+	/* push_rt_task will return true if it moved an RT */
+	while (push_rt_task(rq))
+		;
+}
+
+static void schedule_tail_balance_rt(struct rq *rq)
+{
+	/*
+	 * If we have more than one rt_task queued, then
+	 * see if we can push the other rt_tasks off to other CPUS.
+	 * Note we may release the rq lock, and since
+	 * the lock was owned by prev, we need to release it
+	 * first via finish_lock_switch and then reaquire it here.
+	 */
+	if (unlikely(rq->rt.rt_nr_running > 1)) {
+		spin_lock_irq(&rq->lock);
+		push_rt_tasks(rq);
+		spin_unlock_irq(&rq->lock);
+	}
+}
+
 /*
  * Load-balancing iterator. Note: while the runqueue stays locked
  * during the whole iteration, the current task might be
@@ -238,7 +457,9 @@ move_one_task_rt(struct rq *this_rq, int
 	return iter_move_one_task(this_rq, this_cpu, busiest, sd, idle,
 				  &rt_rq_iterator);
 }
-#endif
+#else /* CONFIG_SMP */
+# define schedule_tail_balance_rt(rq)	do { } while (0)
+#endif /* CONFIG_SMP */
 
 static void task_tick_rt(struct rq *rq, struct task_struct *p)
 {

-- 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v3 04/17] RT overloaded runqueues accounting
  2007-11-17  6:21 [PATCH v3 00/17] New RT Task Balancing -v3 Steven Rostedt
                   ` (2 preceding siblings ...)
  2007-11-17  6:21 ` [PATCH v3 03/17] push RT tasks Steven Rostedt
@ 2007-11-17  6:21 ` Steven Rostedt
  2007-11-17  6:21 ` [PATCH v3 05/17] pull RT tasks Steven Rostedt
                   ` (13 subsequent siblings)
  17 siblings, 0 replies; 28+ messages in thread
From: Steven Rostedt @ 2007-11-17  6:21 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Gregory Haskins, Peter Zijlstra, Christoph Lameter,
	Steven Rostedt

[-- Attachment #1: rt-overload.patch --]
[-- Type: text/plain, Size: 2059 bytes --]

This patch adds an RT overload accounting system. When a runqueue has
more than one RT task queued, it is marked as overloaded. That is that it
is a candidate to have RT tasks pulled from it.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---
 kernel/sched_rt.c |   36 ++++++++++++++++++++++++++++++++++++
 1 file changed, 36 insertions(+)

Index: linux-compile.git/kernel/sched_rt.c
===================================================================
--- linux-compile.git.orig/kernel/sched_rt.c	2007-11-16 21:56:29.000000000 -0500
+++ linux-compile.git/kernel/sched_rt.c	2007-11-16 22:12:07.000000000 -0500
@@ -3,6 +3,38 @@
  * policies)
  */
 
+#ifdef CONFIG_SMP
+static cpumask_t rt_overload_mask;
+static atomic_t rto_count;
+static inline int rt_overloaded(void)
+{
+	return atomic_read(&rto_count);
+}
+static inline cpumask_t *rt_overload(void)
+{
+	return &rt_overload_mask;
+}
+static inline void rt_set_overload(struct rq *rq)
+{
+	cpu_set(rq->cpu, rt_overload_mask);
+	/*
+	 * Make sure the mask is visible before we set
+	 * the overload count. That is checked to determine
+	 * if we should look at the mask. It would be a shame
+	 * if we looked at the mask, but the mask was not
+	 * updated yet.
+	 */
+	wmb();
+	atomic_inc(&rto_count);
+}
+static inline void rt_clear_overload(struct rq *rq)
+{
+	/* the order here really doesn't matter */
+	atomic_dec(&rto_count);
+	cpu_clear(rq->cpu, rt_overload_mask);
+}
+#endif /* CONFIG_SMP */
+
 /*
  * Update the current task's runtime statistics. Skip current tasks that
  * are not in our scheduling class.
@@ -32,6 +64,8 @@ static inline void inc_rt_tasks(struct t
 #ifdef CONFIG_SMP
 	if (p->prio < rq->rt.highest_prio)
 		rq->rt.highest_prio = p->prio;
+	if (rq->rt.rt_nr_running > 1)
+		rt_set_overload(rq);
 #endif /* CONFIG_SMP */
 }
 
@@ -53,6 +87,8 @@ static inline void dec_rt_tasks(struct t
 		} /* otherwise leave rq->highest prio alone */
 	} else
 		rq->rt.highest_prio = MAX_RT_PRIO;
+	if (rq->rt.rt_nr_running < 2)
+		rt_clear_overload(rq);
 #endif /* CONFIG_SMP */
 }
 

-- 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v3 05/17] pull RT tasks
  2007-11-17  6:21 [PATCH v3 00/17] New RT Task Balancing -v3 Steven Rostedt
                   ` (3 preceding siblings ...)
  2007-11-17  6:21 ` [PATCH v3 04/17] RT overloaded runqueues accounting Steven Rostedt
@ 2007-11-17  6:21 ` Steven Rostedt
  2007-11-17  6:21 ` [PATCH v3 06/17] wake up balance RT Steven Rostedt
                   ` (12 subsequent siblings)
  17 siblings, 0 replies; 28+ messages in thread
From: Steven Rostedt @ 2007-11-17  6:21 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Gregory Haskins, Peter Zijlstra, Christoph Lameter,
	Steven Rostedt

[-- Attachment #1: rt-balance-pull-tasks.patch --]
[-- Type: text/plain, Size: 8059 bytes --]

This patch adds the algorithm to pull tasks from RT overloaded runqueues.

When a pull RT is initiated, all overloaded runqueues are examined for
a RT task that is higher in prio than the highest prio task queued on the
target runqueue. If another runqueue holds a RT task that is of higher
prio than the highest prio task on the target runqueue is found it is pulled
to the target runqueue.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>

---
 kernel/sched.c    |    2 
 kernel/sched_rt.c |  196 ++++++++++++++++++++++++++++++++++++++++++++++++++----
 2 files changed, 187 insertions(+), 11 deletions(-)

Index: linux-compile.git/kernel/sched.c
===================================================================
--- linux-compile.git.orig/kernel/sched.c	2007-11-16 22:11:37.000000000 -0500
+++ linux-compile.git/kernel/sched.c	2007-11-16 22:12:13.000000000 -0500
@@ -3646,6 +3646,8 @@ need_resched_nonpreemptible:
 		switch_count = &prev->nvcsw;
 	}
 
+	schedule_balance_rt(rq, prev);
+
 	if (unlikely(!rq->nr_running))
 		idle_balance(cpu, rq);
 
Index: linux-compile.git/kernel/sched_rt.c
===================================================================
--- linux-compile.git.orig/kernel/sched_rt.c	2007-11-16 22:12:07.000000000 -0500
+++ linux-compile.git/kernel/sched_rt.c	2007-11-16 22:12:13.000000000 -0500
@@ -176,8 +176,17 @@ static void put_prev_task_rt(struct rq *
 static int double_lock_balance(struct rq *this_rq, struct rq *busiest);
 static void deactivate_task(struct rq *rq, struct task_struct *p, int sleep);
 
+static int pick_rt_task(struct rq *rq, struct task_struct *p, int cpu)
+{
+	if (!task_running(rq, p) &&
+	    (cpu < 0 || cpu_isset(cpu, p->cpus_allowed)))
+		return 1;
+	return 0;
+}
+
 /* Return the second highest RT task, NULL otherwise */
-static struct task_struct *pick_next_highest_task_rt(struct rq *rq)
+static struct task_struct *pick_next_highest_task_rt(struct rq *rq,
+						     int cpu)
 {
 	struct rt_prio_array *array = &rq->rt.active;
 	struct task_struct *next;
@@ -196,26 +205,36 @@ static struct task_struct *pick_next_hig
 	}
 
 	queue = array->queue + idx;
+	BUG_ON(list_empty(queue));
+
 	next = list_entry(queue->next, struct task_struct, run_list);
-	if (unlikely(next != rq->curr))
-		return next;
+	if (unlikely(pick_rt_task(rq, next, cpu)))
+		goto out;
 
 	if (queue->next->next != queue) {
 		/* same prio task */
 		next = list_entry(queue->next->next, struct task_struct, run_list);
-		return next;
+		if (pick_rt_task(rq, next, cpu))
+			goto out;
 	}
 
+ retry:
 	/* slower, but more flexible */
 	idx = find_next_bit(array->bitmap, MAX_RT_PRIO, idx+1);
-	if (unlikely(idx >= MAX_RT_PRIO)) {
-		WARN_ON(1); /* rt_nr_running was 2 and above! */
+	if (unlikely(idx >= MAX_RT_PRIO))
 		return NULL;
-	}
 
 	queue = array->queue + idx;
-	next = list_entry(queue->next, struct task_struct, run_list);
+	BUG_ON(list_empty(queue));
 
+	list_for_each_entry(next, queue, run_list) {
+		if (pick_rt_task(rq, next, cpu))
+			goto out;
+	}
+
+	goto retry;
+
+ out:
 	return next;
 }
 
@@ -300,13 +319,15 @@ static int push_rt_task(struct rq *this_
 
 	assert_spin_locked(&this_rq->lock);
 
-	next_task = pick_next_highest_task_rt(this_rq);
+	next_task = pick_next_highest_task_rt(this_rq, -1);
 	if (!next_task)
 		return 0;
 
  retry:
-	if (unlikely(next_task == this_rq->curr))
+	if (unlikely(next_task == this_rq->curr)) {
+		WARN_ON(1);
 		return 0;
+	}
 
 	/*
 	 * It's possible that the next_task slipped in of
@@ -330,7 +351,7 @@ static int push_rt_task(struct rq *this_
 		 * so it is possible that next_task has changed.
 		 * If it has, then try again.
 		 */
-		task = pick_next_highest_task_rt(this_rq);
+		task = pick_next_highest_task_rt(this_rq, -1);
 		if (unlikely(task != next_task) && task && paranoid--) {
 			put_task_struct(next_task);
 			next_task = task;
@@ -373,6 +394,158 @@ static void push_rt_tasks(struct rq *rq)
 		;
 }
 
+static int pull_rt_task(struct rq *this_rq)
+{
+	struct task_struct *next;
+	struct task_struct *p;
+	struct rq *src_rq;
+	cpumask_t *rto_cpumask;
+	int this_cpu = this_rq->cpu;
+	int cpu;
+	int ret = 0;
+
+	assert_spin_locked(&this_rq->lock);
+
+	/*
+	 * If cpusets are used, and we have overlapping
+	 * run queue cpusets, then this algorithm may not catch all.
+	 * This is just the price you pay on trying to keep
+	 * dirtying caches down on large SMP machines.
+	 */
+	if (likely(!rt_overloaded()))
+		return 0;
+
+	next = pick_next_task_rt(this_rq);
+
+	rto_cpumask = rt_overload();
+
+	for_each_cpu_mask(cpu, *rto_cpumask) {
+		if (this_cpu == cpu)
+			continue;
+
+		src_rq = cpu_rq(cpu);
+		if (unlikely(src_rq->rt.rt_nr_running <= 1)) {
+			/*
+			 * It is possible that overlapping cpusets
+			 * will miss clearing a non overloaded runqueue.
+			 * Clear it now.
+			 */
+			if (double_lock_balance(this_rq, src_rq)) {
+				/* unlocked our runqueue lock */
+				struct task_struct *old_next = next;
+				next = pick_next_task_rt(this_rq);
+				if (next != old_next)
+					ret = 1;
+			}
+			if (likely(src_rq->rt.rt_nr_running <= 1))
+				/*
+				 * Small chance that this_rq->curr changed
+				 * but it's really harmless here.
+				 */
+				rt_clear_overload(this_rq);
+			else
+				/*
+				 * Heh, the src_rq is now overloaded, since
+				 * we already have the src_rq lock, go straight
+				 * to pulling tasks from it.
+				 */
+				goto try_pulling;
+			spin_unlock(&src_rq->lock);
+			continue;
+		}
+
+		/*
+		 * We can potentially drop this_rq's lock in
+		 * double_lock_balance, and another CPU could
+		 * steal our next task - hence we must cause
+		 * the caller to recalculate the next task
+		 * in that case:
+		 */
+		if (double_lock_balance(this_rq, src_rq)) {
+			struct task_struct *old_next = next;
+			next = pick_next_task_rt(this_rq);
+			if (next != old_next)
+				ret = 1;
+		}
+
+		/*
+		 * Are there still pullable RT tasks?
+		 */
+		if (src_rq->rt.rt_nr_running <= 1) {
+			spin_unlock(&src_rq->lock);
+			continue;
+		}
+
+ try_pulling:
+		p = pick_next_highest_task_rt(src_rq, this_cpu);
+
+		/*
+		 * Do we have an RT task that preempts
+		 * the to-be-scheduled task?
+		 */
+		if (p && (!next || (p->prio < next->prio))) {
+			WARN_ON(p == src_rq->curr);
+			WARN_ON(!p->se.on_rq);
+
+			/*
+			 * There's a chance that p is higher in priority
+			 * than what's currently running on its cpu.
+			 * This is just that p is wakeing up and hasn't
+			 * had a chance to schedule. We only pull
+			 * p if it is lower in priority than the
+			 * current task on the run queue or
+			 * this_rq next task is lower in prio than
+			 * the current task on that rq.
+			 */
+			if (p->prio < src_rq->curr->prio ||
+			    (next && next->prio < src_rq->curr->prio))
+				goto bail;
+
+			ret = 1;
+
+			deactivate_task(src_rq, p, 0);
+			set_task_cpu(p, this_cpu);
+			activate_task(this_rq, p, 0);
+			/*
+			 * We continue with the search, just in
+			 * case there's an even higher prio task
+			 * in another runqueue. (low likelyhood
+			 * but possible)
+			 */
+
+			/*
+			 * Update next so that we won't pick a task
+			 * on another cpu with a priority lower (or equal)
+			 * than the one we just picked.
+			 */
+			next = p;
+
+		}
+ bail:
+		spin_unlock(&src_rq->lock);
+	}
+
+	return ret;
+}
+
+static void schedule_balance_rt(struct rq *rq,
+				struct task_struct *prev)
+{
+	struct rt_prio_array *array;
+	int next_prio;
+
+	/* Try to pull RT tasks here if we lower this rq's prio */
+	if (unlikely(rt_task(prev))) {
+		next_prio = MAX_RT_PRIO;
+		if (rq->rt.rt_nr_running) {
+			array = &rq->rt.active;
+			next_prio = sched_find_first_bit(array->bitmap);
+		}
+		if (next_prio > prev->prio)
+			pull_rt_task(rq);
+	}
+}
+
 static void schedule_tail_balance_rt(struct rq *rq)
 {
 	/*
@@ -495,6 +668,7 @@ move_one_task_rt(struct rq *this_rq, int
 }
 #else /* CONFIG_SMP */
 # define schedule_tail_balance_rt(rq)	do { } while (0)
+# define schedule_balance_rt(rq, prev)	do { } while (0)
 #endif /* CONFIG_SMP */
 
 static void task_tick_rt(struct rq *rq, struct task_struct *p)

-- 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v3 06/17] wake up balance RT
  2007-11-17  6:21 [PATCH v3 00/17] New RT Task Balancing -v3 Steven Rostedt
                   ` (4 preceding siblings ...)
  2007-11-17  6:21 ` [PATCH v3 05/17] pull RT tasks Steven Rostedt
@ 2007-11-17  6:21 ` Steven Rostedt
  2007-11-17  6:21 ` [PATCH v3 07/17] disable CFS RT load balancing Steven Rostedt
                   ` (11 subsequent siblings)
  17 siblings, 0 replies; 28+ messages in thread
From: Steven Rostedt @ 2007-11-17  6:21 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Gregory Haskins, Peter Zijlstra, Christoph Lameter,
	Steven Rostedt

[-- Attachment #1: rt-balance-wakeup.patch --]
[-- Type: text/plain, Size: 2100 bytes --]

This patch adds pushing of overloaded RT tasks from a runqueue that is
having tasks (most likely RT tasks) added to the run queue.

TODO: We don't cover the case of waking of new RT tasks (yet).

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---
 kernel/sched.c    |    3 +++
 kernel/sched_rt.c |   10 ++++++++++
 2 files changed, 13 insertions(+)

Index: linux-compile.git/kernel/sched_rt.c
===================================================================
--- linux-compile.git.orig/kernel/sched_rt.c	2007-11-16 22:12:13.000000000 -0500
+++ linux-compile.git/kernel/sched_rt.c	2007-11-16 22:12:17.000000000 -0500
@@ -562,6 +562,15 @@ static void schedule_tail_balance_rt(str
 	}
 }
 
+
+static void wakeup_balance_rt(struct rq *rq, struct task_struct *p)
+{
+	if (unlikely(rt_task(p)) &&
+	    !task_running(rq, p) &&
+	    (p->prio >= rq->curr->prio))
+		push_rt_tasks(rq);
+}
+
 /*
  * Load-balancing iterator. Note: while the runqueue stays locked
  * during the whole iteration, the current task might be
@@ -669,6 +678,7 @@ move_one_task_rt(struct rq *this_rq, int
 #else /* CONFIG_SMP */
 # define schedule_tail_balance_rt(rq)	do { } while (0)
 # define schedule_balance_rt(rq, prev)	do { } while (0)
+# define wakeup_balance_rt(rq, p)	do { } while (0)
 #endif /* CONFIG_SMP */
 
 static void task_tick_rt(struct rq *rq, struct task_struct *p)
Index: linux-compile.git/kernel/sched.c
===================================================================
--- linux-compile.git.orig/kernel/sched.c	2007-11-16 22:12:13.000000000 -0500
+++ linux-compile.git/kernel/sched.c	2007-11-16 22:12:17.000000000 -0500
@@ -22,6 +22,8 @@
  *              by Peter Williams
  *  2007-05-06  Interactivity improvements to CFS by Mike Galbraith
  *  2007-07-01  Group scheduling enhancements by Srivatsa Vaddagiri
+ *  2007-10-22  RT overload balancing by Steven Rostedt
+ *                 (with thanks to Gregory Haskins)
  */
 
 #include <linux/mm.h>
@@ -1635,6 +1637,7 @@ out_activate:
 
 out_running:
 	p->state = TASK_RUNNING;
+	wakeup_balance_rt(rq, p);
 out:
 	task_rq_unlock(rq, &flags);
 

-- 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v3 07/17] disable CFS RT load balancing.
  2007-11-17  6:21 [PATCH v3 00/17] New RT Task Balancing -v3 Steven Rostedt
                   ` (5 preceding siblings ...)
  2007-11-17  6:21 ` [PATCH v3 06/17] wake up balance RT Steven Rostedt
@ 2007-11-17  6:21 ` Steven Rostedt
  2007-11-17  6:21 ` [PATCH v3 08/17] Cache cpus_allowed weight for optimizing migration Steven Rostedt
                   ` (10 subsequent siblings)
  17 siblings, 0 replies; 28+ messages in thread
From: Steven Rostedt @ 2007-11-17  6:21 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Gregory Haskins, Peter Zijlstra, Christoph Lameter,
	Steven Rostedt

[-- Attachment #1: disable-CFS-rt-balance.patch --]
[-- Type: text/plain, Size: 3649 bytes --]

Since we now take an active approach to load balancing, we don't need to
balance RT tasks via CFS. In fact, this code was found to pull RT tasks
away from CPUS that the active movement performed, resulting in
large latencies.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---
 kernel/sched_rt.c |   95 ++----------------------------------------------------
 1 file changed, 4 insertions(+), 91 deletions(-)

Index: linux-compile.git/kernel/sched_rt.c
===================================================================
--- linux-compile.git.orig/kernel/sched_rt.c	2007-11-16 22:12:17.000000000 -0500
+++ linux-compile.git/kernel/sched_rt.c	2007-11-16 22:12:19.000000000 -0500
@@ -571,109 +571,22 @@ static void wakeup_balance_rt(struct rq 
 		push_rt_tasks(rq);
 }
 
-/*
- * Load-balancing iterator. Note: while the runqueue stays locked
- * during the whole iteration, the current task might be
- * dequeued so the iterator has to be dequeue-safe. Here we
- * achieve that by always pre-iterating before returning
- * the current task:
- */
-static struct task_struct *load_balance_start_rt(void *arg)
-{
-	struct rq *rq = arg;
-	struct rt_prio_array *array = &rq->rt.active;
-	struct list_head *head, *curr;
-	struct task_struct *p;
-	int idx;
-
-	idx = sched_find_first_bit(array->bitmap);
-	if (idx >= MAX_RT_PRIO)
-		return NULL;
-
-	head = array->queue + idx;
-	curr = head->prev;
-
-	p = list_entry(curr, struct task_struct, run_list);
-
-	curr = curr->prev;
-
-	rq->rt.rt_load_balance_idx = idx;
-	rq->rt.rt_load_balance_head = head;
-	rq->rt.rt_load_balance_curr = curr;
-
-	return p;
-}
-
-static struct task_struct *load_balance_next_rt(void *arg)
-{
-	struct rq *rq = arg;
-	struct rt_prio_array *array = &rq->rt.active;
-	struct list_head *head, *curr;
-	struct task_struct *p;
-	int idx;
-
-	idx = rq->rt.rt_load_balance_idx;
-	head = rq->rt.rt_load_balance_head;
-	curr = rq->rt.rt_load_balance_curr;
-
-	/*
-	 * If we arrived back to the head again then
-	 * iterate to the next queue (if any):
-	 */
-	if (unlikely(head == curr)) {
-		int next_idx = find_next_bit(array->bitmap, MAX_RT_PRIO, idx+1);
-
-		if (next_idx >= MAX_RT_PRIO)
-			return NULL;
-
-		idx = next_idx;
-		head = array->queue + idx;
-		curr = head->prev;
-
-		rq->rt.rt_load_balance_idx = idx;
-		rq->rt.rt_load_balance_head = head;
-	}
-
-	p = list_entry(curr, struct task_struct, run_list);
-
-	curr = curr->prev;
-
-	rq->rt.rt_load_balance_curr = curr;
-
-	return p;
-}
-
 static unsigned long
 load_balance_rt(struct rq *this_rq, int this_cpu, struct rq *busiest,
 		unsigned long max_load_move,
 		struct sched_domain *sd, enum cpu_idle_type idle,
 		int *all_pinned, int *this_best_prio)
 {
-	struct rq_iterator rt_rq_iterator;
-
-	rt_rq_iterator.start = load_balance_start_rt;
-	rt_rq_iterator.next = load_balance_next_rt;
-	/* pass 'busiest' rq argument into
-	 * load_balance_[start|next]_rt iterators
-	 */
-	rt_rq_iterator.arg = busiest;
-
-	return balance_tasks(this_rq, this_cpu, busiest, max_load_move, sd,
-			     idle, all_pinned, this_best_prio, &rt_rq_iterator);
+	/* don't touch RT tasks */
+	return 0;
 }
 
 static int
 move_one_task_rt(struct rq *this_rq, int this_cpu, struct rq *busiest,
 		 struct sched_domain *sd, enum cpu_idle_type idle)
 {
-	struct rq_iterator rt_rq_iterator;
-
-	rt_rq_iterator.start = load_balance_start_rt;
-	rt_rq_iterator.next = load_balance_next_rt;
-	rt_rq_iterator.arg = busiest;
-
-	return iter_move_one_task(this_rq, this_cpu, busiest, sd, idle,
-				  &rt_rq_iterator);
+	/* don't touch RT tasks */
+	return 0;
 }
 #else /* CONFIG_SMP */
 # define schedule_tail_balance_rt(rq)	do { } while (0)

-- 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v3 08/17] Cache cpus_allowed weight for optimizing migration
  2007-11-17  6:21 [PATCH v3 00/17] New RT Task Balancing -v3 Steven Rostedt
                   ` (6 preceding siblings ...)
  2007-11-17  6:21 ` [PATCH v3 07/17] disable CFS RT load balancing Steven Rostedt
@ 2007-11-17  6:21 ` Steven Rostedt
  2007-11-17  6:21 ` [PATCH v3 09/17] Consistency cleanup for this_rq usage Steven Rostedt
                   ` (9 subsequent siblings)
  17 siblings, 0 replies; 28+ messages in thread
From: Steven Rostedt @ 2007-11-17  6:21 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Gregory Haskins, Peter Zijlstra, Christoph Lameter,
	Steven Rostedt

[-- Attachment #1: rt-balance-cpu-weight.patch --]
[-- Type: text/plain, Size: 7960 bytes --]

From: Gregory Haskins <ghaskins@novell.com>

Some RT tasks (particularly kthreads) are bound to one specific CPU.
It is fairly common for two or more bound tasks to get queued up at the
same time.  Consider, for instance, softirq_timer and softirq_sched.  A
timer goes off in an ISR which schedules softirq_thread to run at RT50.
Then the timer handler determines that it's time to smp-rebalance the
system so it schedules softirq_sched to run.  So we are in a situation
where we have two RT50 tasks queued, and the system will go into
rt-overload condition to request other CPUs for help.

This causes two problems in the current code:

1) If a high-priority bound task and a low-priority unbounded task queue
   up behind the running task, we will fail to ever relocate the unbounded
   task because we terminate the search on the first unmovable task.

2) We spend precious futile cycles in the fast-path trying to pull
   overloaded tasks over.  It is therefore optimial to strive to avoid the
   overhead all together if we can cheaply detect the condition before
   overload even occurs.

This patch tries to achieve this optimization by utilizing the hamming
weight of the task->cpus_allowed mask.  A weight of 1 indicates that
the task cannot be migrated.  We will then utilize this information to
skip non-migratable tasks and to eliminate uncessary rebalance attempts.

We introduce a per-rq variable to count the number of migratable tasks
that are currently running.  We only go into overload if we have more
than one rt task, AND at least one of them is migratable.

In addition, we introduce a per-task variable to cache the cpus_allowed
weight, since the hamming calculation is probably relatively expensive.
We only update the cached value when the mask is updated which should be
relatively infrequent, especially compared to scheduling frequency
in the fast path.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---

 include/linux/init_task.h |    1 
 include/linux/sched.h     |    2 +
 kernel/fork.c             |    1 
 kernel/sched.c            |    9 +++++++-
 kernel/sched_rt.c         |   50 +++++++++++++++++++++++++++++++++++++++++-----
 5 files changed, 57 insertions(+), 6 deletions(-)

Index: linux-compile.git/include/linux/sched.h
===================================================================
--- linux-compile.git.orig/include/linux/sched.h	2007-11-16 22:11:22.000000000 -0500
+++ linux-compile.git/include/linux/sched.h	2007-11-17 00:15:42.000000000 -0500
@@ -843,6 +843,7 @@ struct sched_class {
 	void (*set_curr_task) (struct rq *rq);
 	void (*task_tick) (struct rq *rq, struct task_struct *p);
 	void (*task_new) (struct rq *rq, struct task_struct *p);
+	void (*set_cpus_allowed)(struct task_struct *p, cpumask_t newmask);
 };
 
 struct load_weight {
@@ -952,6 +953,7 @@ struct task_struct {
 
 	unsigned int policy;
 	cpumask_t cpus_allowed;
+	int nr_cpus_allowed;
 	unsigned int time_slice;
 
 #if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)
Index: linux-compile.git/kernel/fork.c
===================================================================
--- linux-compile.git.orig/kernel/fork.c	2007-11-16 22:11:22.000000000 -0500
+++ linux-compile.git/kernel/fork.c	2007-11-16 22:12:34.000000000 -0500
@@ -1237,6 +1237,7 @@ static struct task_struct *copy_process(
 	 * parent's CPU). This avoids alot of nasty races.
 	 */
 	p->cpus_allowed = current->cpus_allowed;
+	p->nr_cpus_allowed = current->nr_cpus_allowed;
 	if (unlikely(!cpu_isset(task_cpu(p), p->cpus_allowed) ||
 			!cpu_online(task_cpu(p))))
 		set_task_cpu(p, smp_processor_id());
Index: linux-compile.git/kernel/sched.c
===================================================================
--- linux-compile.git.orig/kernel/sched.c	2007-11-16 22:12:17.000000000 -0500
+++ linux-compile.git/kernel/sched.c	2007-11-17 00:15:42.000000000 -0500
@@ -269,6 +269,7 @@ struct rt_rq {
 	int rt_load_balance_idx;
 	struct list_head *rt_load_balance_head, *rt_load_balance_curr;
 	unsigned long rt_nr_running;
+	unsigned long rt_nr_migratory;
 	/* highest queued rt task prio */
 	int highest_prio;
 };
@@ -5070,7 +5071,13 @@ int set_cpus_allowed(struct task_struct 
 		goto out;
 	}
 
-	p->cpus_allowed = new_mask;
+	if (p->sched_class->set_cpus_allowed)
+		p->sched_class->set_cpus_allowed(p, new_mask);
+	else {
+		p->cpus_allowed    = new_mask;
+		p->nr_cpus_allowed = cpus_weight(new_mask);
+	}
+
 	/* Can the task run on the task's current CPU? If so, we're done */
 	if (cpu_isset(task_cpu(p), new_mask))
 		goto out;
Index: linux-compile.git/kernel/sched_rt.c
===================================================================
--- linux-compile.git.orig/kernel/sched_rt.c	2007-11-16 22:12:19.000000000 -0500
+++ linux-compile.git/kernel/sched_rt.c	2007-11-17 00:15:42.000000000 -0500
@@ -33,6 +33,14 @@ static inline void rt_clear_overload(str
 	atomic_dec(&rto_count);
 	cpu_clear(rq->cpu, rt_overload_mask);
 }
+
+static void update_rt_migration(struct rq *rq)
+{
+	if (rq->rt.rt_nr_migratory && (rq->rt.rt_nr_running > 1))
+		rt_set_overload(rq);
+	else
+		rt_clear_overload(rq);
+}
 #endif /* CONFIG_SMP */
 
 /*
@@ -64,8 +72,10 @@ static inline void inc_rt_tasks(struct t
 #ifdef CONFIG_SMP
 	if (p->prio < rq->rt.highest_prio)
 		rq->rt.highest_prio = p->prio;
-	if (rq->rt.rt_nr_running > 1)
-		rt_set_overload(rq);
+	if (p->nr_cpus_allowed > 1)
+		rq->rt.rt_nr_migratory++;
+
+	update_rt_migration(rq);
 #endif /* CONFIG_SMP */
 }
 
@@ -87,8 +97,10 @@ static inline void dec_rt_tasks(struct t
 		} /* otherwise leave rq->highest prio alone */
 	} else
 		rq->rt.highest_prio = MAX_RT_PRIO;
-	if (rq->rt.rt_nr_running < 2)
-		rt_clear_overload(rq);
+	if (p->nr_cpus_allowed > 1)
+		rq->rt.rt_nr_migratory--;
+
+	update_rt_migration(rq);
 #endif /* CONFIG_SMP */
 }
 
@@ -179,7 +191,8 @@ static void deactivate_task(struct rq *r
 static int pick_rt_task(struct rq *rq, struct task_struct *p, int cpu)
 {
 	if (!task_running(rq, p) &&
-	    (cpu < 0 || cpu_isset(cpu, p->cpus_allowed)))
+	    (cpu < 0 || cpu_isset(cpu, p->cpus_allowed)) &&
+	    (p->nr_cpus_allowed > 1))
 		return 1;
 	return 0;
 }
@@ -588,6 +601,32 @@ move_one_task_rt(struct rq *this_rq, int
 	/* don't touch RT tasks */
 	return 0;
 }
+static void set_cpus_allowed_rt(struct task_struct *p, cpumask_t new_mask)
+{
+	int weight = cpus_weight(new_mask);
+
+	BUG_ON(!rt_task(p));
+
+	/*
+	 * Update the migration status of the RQ if we have an RT task
+	 * which is running AND changing its weight value.
+	 */
+	if (p->se.on_rq && (weight != p->nr_cpus_allowed)) {
+		struct rq *rq = task_rq(p);
+
+		if ((p->nr_cpus_allowed <= 1) && (weight > 1))
+			rq->rt.rt_nr_migratory++;
+		else if((p->nr_cpus_allowed > 1) && (weight <= 1)) {
+			BUG_ON(!rq->rt.rt_nr_migratory);
+			rq->rt.rt_nr_migratory--;
+		}
+
+		update_rt_migration(rq);
+	}
+
+	p->cpus_allowed    = new_mask;
+	p->nr_cpus_allowed = weight;
+}
 #else /* CONFIG_SMP */
 # define schedule_tail_balance_rt(rq)	do { } while (0)
 # define schedule_balance_rt(rq, prev)	do { } while (0)
@@ -639,6 +678,7 @@ const struct sched_class rt_sched_class 
 #ifdef CONFIG_SMP
 	.load_balance		= load_balance_rt,
 	.move_one_task		= move_one_task_rt,
+	.set_cpus_allowed       = set_cpus_allowed_rt,
 #endif
 
 	.set_curr_task          = set_curr_task_rt,
Index: linux-compile.git/include/linux/init_task.h
===================================================================
--- linux-compile.git.orig/include/linux/init_task.h	2007-11-16 22:11:22.000000000 -0500
+++ linux-compile.git/include/linux/init_task.h	2007-11-16 22:12:34.000000000 -0500
@@ -130,6 +130,7 @@ extern struct group_info init_groups;
 	.normal_prio	= MAX_PRIO-20,					\
 	.policy		= SCHED_NORMAL,					\
 	.cpus_allowed	= CPU_MASK_ALL,					\
+	.nr_cpus_allowed = NR_CPUS,					\
 	.mm		= NULL,						\
 	.active_mm	= &init_mm,					\
 	.run_list	= LIST_HEAD_INIT(tsk.run_list),			\

-- 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v3 09/17] Consistency cleanup for this_rq usage
  2007-11-17  6:21 [PATCH v3 00/17] New RT Task Balancing -v3 Steven Rostedt
                   ` (7 preceding siblings ...)
  2007-11-17  6:21 ` [PATCH v3 08/17] Cache cpus_allowed weight for optimizing migration Steven Rostedt
@ 2007-11-17  6:21 ` Steven Rostedt
  2007-11-17  6:21 ` [PATCH v3 10/17] Remove some CFS specific code from the wakeup path of RT tasks Steven Rostedt
                   ` (8 subsequent siblings)
  17 siblings, 0 replies; 28+ messages in thread
From: Steven Rostedt @ 2007-11-17  6:21 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Gregory Haskins, Peter Zijlstra, Christoph Lameter,
	Steven Rostedt

[-- Attachment #1: 0001-this_rq-consistency-cleanup.patch --]
[-- Type: text/plain, Size: 4525 bytes --]

From: Gregory Haskins <ghaskins@novell.com>

"this_rq" is normally used to denote the RQ on the current cpu
(i.e. "cpu_rq(this_cpu)").  So clean up the usage of this_rq to be
more consistent with the rest of the code.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---

 kernel/sched_rt.c |   44 ++++++++++++++++++++++----------------------
 1 file changed, 22 insertions(+), 22 deletions(-)

Index: linus.git/kernel/sched_rt.c
===================================================================
--- linus.git.orig/kernel/sched_rt.c
+++ linus.git/kernel/sched_rt.c
@@ -253,7 +253,7 @@ static struct task_struct *pick_next_hig
 
 /* Will lock the rq it finds */
 static struct rq *find_lock_lowest_rq(struct task_struct *task,
-				      struct rq *this_rq)
+				      struct rq *rq)
 {
 	struct rq *lowest_rq = NULL;
 	cpumask_t cpu_mask;
@@ -267,21 +267,21 @@ static struct rq *find_lock_lowest_rq(st
 		 * Scan each rq for the lowest prio.
 		 */
 		for_each_cpu_mask(cpu, cpu_mask) {
-			struct rq *rq = &per_cpu(runqueues, cpu);
+			struct rq *curr_rq = &per_cpu(runqueues, cpu);
 
-			if (cpu == this_rq->cpu)
+			if (cpu == rq->cpu)
 				continue;
 
 			/* We look for lowest RT prio or non-rt CPU */
-			if (rq->rt.highest_prio >= MAX_RT_PRIO) {
-				lowest_rq = rq;
+			if (curr_rq->rt.highest_prio >= MAX_RT_PRIO) {
+				lowest_rq = curr_rq;
 				break;
 			}
 
 			/* no locking for now */
-			if (rq->rt.highest_prio > task->prio &&
-			    (!lowest_rq || rq->rt.highest_prio > lowest_rq->rt.highest_prio)) {
-				lowest_rq = rq;
+			if (curr_rq->rt.highest_prio > task->prio &&
+			    (!lowest_rq || curr_rq->rt.highest_prio > lowest_rq->rt.highest_prio)) {
+				lowest_rq = curr_rq;
 			}
 		}
 
@@ -289,16 +289,16 @@ static struct rq *find_lock_lowest_rq(st
 			break;
 
 		/* if the prio of this runqueue changed, try again */
-		if (double_lock_balance(this_rq, lowest_rq)) {
+		if (double_lock_balance(rq, lowest_rq)) {
 			/*
 			 * We had to unlock the run queue. In
 			 * the mean time, task could have
 			 * migrated already or had its affinity changed.
 			 * Also make sure that it wasn't scheduled on its rq.
 			 */
-			if (unlikely(task_rq(task) != this_rq ||
+			if (unlikely(task_rq(task) != rq ||
 				     !cpu_isset(lowest_rq->cpu, task->cpus_allowed) ||
-				     task_running(this_rq, task) ||
+				     task_running(rq, task) ||
 				     !task->se.on_rq)) {
 				spin_unlock(&lowest_rq->lock);
 				lowest_rq = NULL;
@@ -323,21 +323,21 @@ static struct rq *find_lock_lowest_rq(st
  * running task can migrate over to a CPU that is running a task
  * of lesser priority.
  */
-static int push_rt_task(struct rq *this_rq)
+static int push_rt_task(struct rq *rq)
 {
 	struct task_struct *next_task;
 	struct rq *lowest_rq;
 	int ret = 0;
 	int paranoid = RT_MAX_TRIES;
 
-	assert_spin_locked(&this_rq->lock);
+	assert_spin_locked(&rq->lock);
 
-	next_task = pick_next_highest_task_rt(this_rq, -1);
+	next_task = pick_next_highest_task_rt(rq, -1);
 	if (!next_task)
 		return 0;
 
  retry:
-	if (unlikely(next_task == this_rq->curr)) {
+	if (unlikely(next_task == rq->curr)) {
 		WARN_ON(1);
 		return 0;
 	}
@@ -347,24 +347,24 @@ static int push_rt_task(struct rq *this_
 	 * higher priority than current. If that's the case
 	 * just reschedule current.
 	 */
-	if (unlikely(next_task->prio < this_rq->curr->prio)) {
-		resched_task(this_rq->curr);
+	if (unlikely(next_task->prio < rq->curr->prio)) {
+		resched_task(rq->curr);
 		return 0;
 	}
 
-	/* We might release this_rq lock */
+	/* We might release rq lock */
 	get_task_struct(next_task);
 
 	/* find_lock_lowest_rq locks the rq if found */
-	lowest_rq = find_lock_lowest_rq(next_task, this_rq);
+	lowest_rq = find_lock_lowest_rq(next_task, rq);
 	if (!lowest_rq) {
 		struct task_struct *task;
 		/*
-		 * find lock_lowest_rq releases this_rq->lock
+		 * find lock_lowest_rq releases rq->lock
 		 * so it is possible that next_task has changed.
 		 * If it has, then try again.
 		 */
-		task = pick_next_highest_task_rt(this_rq, -1);
+		task = pick_next_highest_task_rt(rq, -1);
 		if (unlikely(task != next_task) && task && paranoid--) {
 			put_task_struct(next_task);
 			next_task = task;
@@ -375,7 +375,7 @@ static int push_rt_task(struct rq *this_
 
 	assert_spin_locked(&lowest_rq->lock);
 
-	deactivate_task(this_rq, next_task, 0);
+	deactivate_task(rq, next_task, 0);
 	set_task_cpu(next_task, lowest_rq->cpu);
 	activate_task(lowest_rq, next_task, 0);
 

-- 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v3 10/17] Remove some CFS specific code from the wakeup path of RT tasks
  2007-11-17  6:21 [PATCH v3 00/17] New RT Task Balancing -v3 Steven Rostedt
                   ` (8 preceding siblings ...)
  2007-11-17  6:21 ` [PATCH v3 09/17] Consistency cleanup for this_rq usage Steven Rostedt
@ 2007-11-17  6:21 ` Steven Rostedt
  2007-11-17 17:35   ` Gregory Haskins
  2007-11-17  6:21 ` [PATCH v3 11/17] RT: Break out the search function Steven Rostedt
                   ` (7 subsequent siblings)
  17 siblings, 1 reply; 28+ messages in thread
From: Steven Rostedt @ 2007-11-17  6:21 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Gregory Haskins, Peter Zijlstra, Christoph Lameter,
	Steven Rostedt

[-- Attachment #1: 0002-remove-CFS-from-wakeup.patch --]
[-- Type: text/plain, Size: 12293 bytes --]

From: Gregory Haskins <ghaskins@novell.com>

The current wake-up code path tries to determine if it can optimize the
wake-up to "this_cpu" by computing load calculations.  The problem is that
these calculations are only relevant to CFS tasks where load is king.  For RT
tasks, priority is king.  So the load calculation is completely wasted
bandwidth.

Therefore, we create a new sched_class interface to help with
pre-wakeup routing decisions and move the load calculation as a function
of CFS task's class.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---

 include/linux/sched.h   |    1 
 kernel/sched.c          |  144 +++---------------------------------------------
 kernel/sched_fair.c     |  132 ++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched_idletask.c |    9 +++
 kernel/sched_rt.c       |    8 ++
 5 files changed, 159 insertions(+), 135 deletions(-)

Index: linux-compile.git/include/linux/sched.h
===================================================================
--- linux-compile.git.orig/include/linux/sched.h	2007-11-16 22:12:34.000000000 -0500
+++ linux-compile.git/include/linux/sched.h	2007-11-16 22:23:39.000000000 -0500
@@ -823,6 +823,7 @@ struct sched_class {
 	void (*enqueue_task) (struct rq *rq, struct task_struct *p, int wakeup);
 	void (*dequeue_task) (struct rq *rq, struct task_struct *p, int sleep);
 	void (*yield_task) (struct rq *rq);
+	int  (*select_task_rq)(struct task_struct *p, int sync);
 
 	void (*check_preempt_curr) (struct rq *rq, struct task_struct *p);
 
Index: linux-compile.git/kernel/sched.c
===================================================================
--- linux-compile.git.orig/kernel/sched.c	2007-11-16 22:12:34.000000000 -0500
+++ linux-compile.git/kernel/sched.c	2007-11-16 22:23:39.000000000 -0500
@@ -860,6 +860,10 @@ iter_move_one_task(struct rq *this_rq, i
 		   struct rq_iterator *iterator);
 #endif
 
+static unsigned long source_load(int cpu, int type);
+static unsigned long target_load(int cpu, int type);
+static unsigned long cpu_avg_load_per_task(int cpu);
+
 #include "sched_stats.h"
 #include "sched_idletask.c"
 #include "sched_fair.c"
@@ -1270,7 +1274,7 @@ static unsigned long target_load(int cpu
 /*
  * Return the average load per task on the cpu's run queue
  */
-static inline unsigned long cpu_avg_load_per_task(int cpu)
+static unsigned long cpu_avg_load_per_task(int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
 	unsigned long total = weighted_cpuload(cpu);
@@ -1427,58 +1431,6 @@ static int sched_balance_self(int cpu, i
 
 #endif /* CONFIG_SMP */
 
-/*
- * wake_idle() will wake a task on an idle cpu if task->cpu is
- * not idle and an idle cpu is available.  The span of cpus to
- * search starts with cpus closest then further out as needed,
- * so we always favor a closer, idle cpu.
- *
- * Returns the CPU we should wake onto.
- */
-#if defined(ARCH_HAS_SCHED_WAKE_IDLE)
-static int wake_idle(int cpu, struct task_struct *p)
-{
-	cpumask_t tmp;
-	struct sched_domain *sd;
-	int i;
-
-	/*
-	 * If it is idle, then it is the best cpu to run this task.
-	 *
-	 * This cpu is also the best, if it has more than one task already.
-	 * Siblings must be also busy(in most cases) as they didn't already
-	 * pickup the extra load from this cpu and hence we need not check
-	 * sibling runqueue info. This will avoid the checks and cache miss
-	 * penalities associated with that.
-	 */
-	if (idle_cpu(cpu) || cpu_rq(cpu)->nr_running > 1)
-		return cpu;
-
-	for_each_domain(cpu, sd) {
-		if (sd->flags & SD_WAKE_IDLE) {
-			cpus_and(tmp, sd->span, p->cpus_allowed);
-			for_each_cpu_mask(i, tmp) {
-				if (idle_cpu(i)) {
-					if (i != task_cpu(p)) {
-						schedstat_inc(p,
-							se.nr_wakeups_idle);
-					}
-					return i;
-				}
-			}
-		} else {
-			break;
-		}
-	}
-	return cpu;
-}
-#else
-static inline int wake_idle(int cpu, struct task_struct *p)
-{
-	return cpu;
-}
-#endif
-
 /***
  * try_to_wake_up - wake up a thread
  * @p: the to-be-woken-up thread
@@ -1500,8 +1452,6 @@ static int try_to_wake_up(struct task_st
 	long old_state;
 	struct rq *rq;
 #ifdef CONFIG_SMP
-	struct sched_domain *sd, *this_sd = NULL;
-	unsigned long load, this_load;
 	int new_cpu;
 #endif
 
@@ -1521,90 +1471,14 @@ static int try_to_wake_up(struct task_st
 	if (unlikely(task_running(rq, p)))
 		goto out_activate;
 
-	new_cpu = cpu;
-
 	schedstat_inc(rq, ttwu_count);
-	if (cpu == this_cpu) {
+	if (cpu == this_cpu)
 		schedstat_inc(rq, ttwu_local);
-		goto out_set_cpu;
-	}
-
-	for_each_domain(this_cpu, sd) {
-		if (cpu_isset(cpu, sd->span)) {
-			schedstat_inc(sd, ttwu_wake_remote);
-			this_sd = sd;
-			break;
-		}
-	}
-
-	if (unlikely(!cpu_isset(this_cpu, p->cpus_allowed)))
-		goto out_set_cpu;
-
-	/*
-	 * Check for affine wakeup and passive balancing possibilities.
-	 */
-	if (this_sd) {
-		int idx = this_sd->wake_idx;
-		unsigned int imbalance;
-
-		imbalance = 100 + (this_sd->imbalance_pct - 100) / 2;
-
-		load = source_load(cpu, idx);
-		this_load = target_load(this_cpu, idx);
-
-		new_cpu = this_cpu; /* Wake to this CPU if we can */
-
-		if (this_sd->flags & SD_WAKE_AFFINE) {
-			unsigned long tl = this_load;
-			unsigned long tl_per_task;
-
-			/*
-			 * Attract cache-cold tasks on sync wakeups:
-			 */
-			if (sync && !task_hot(p, rq->clock, this_sd))
-				goto out_set_cpu;
-
-			schedstat_inc(p, se.nr_wakeups_affine_attempts);
-			tl_per_task = cpu_avg_load_per_task(this_cpu);
-
-			/*
-			 * If sync wakeup then subtract the (maximum possible)
-			 * effect of the currently running task from the load
-			 * of the current CPU:
-			 */
-			if (sync)
-				tl -= current->se.load.weight;
-
-			if ((tl <= load &&
-				tl + target_load(cpu, idx) <= tl_per_task) ||
-			       100*(tl + p->se.load.weight) <= imbalance*load) {
-				/*
-				 * This domain has SD_WAKE_AFFINE and
-				 * p is cache cold in this domain, and
-				 * there is no bad imbalance.
-				 */
-				schedstat_inc(this_sd, ttwu_move_affine);
-				schedstat_inc(p, se.nr_wakeups_affine);
-				goto out_set_cpu;
-			}
-		}
+	else
+		schedstat_inc(rq->sd, ttwu_wake_remote);
 
-		/*
-		 * Start passive balancing when half the imbalance_pct
-		 * limit is reached.
-		 */
-		if (this_sd->flags & SD_WAKE_BALANCE) {
-			if (imbalance*this_load <= 100*load) {
-				schedstat_inc(this_sd, ttwu_move_balance);
-				schedstat_inc(p, se.nr_wakeups_passive);
-				goto out_set_cpu;
-			}
-		}
-	}
+	new_cpu = p->sched_class->select_task_rq(p, sync);
 
-	new_cpu = cpu; /* Could not wake to this_cpu. Wake to cpu instead */
-out_set_cpu:
-	new_cpu = wake_idle(new_cpu, p);
 	if (new_cpu != cpu) {
 		set_task_cpu(p, new_cpu);
 		task_rq_unlock(rq, &flags);
Index: linux-compile.git/kernel/sched_fair.c
===================================================================
--- linux-compile.git.orig/kernel/sched_fair.c	2007-11-16 11:16:38.000000000 -0500
+++ linux-compile.git/kernel/sched_fair.c	2007-11-16 22:23:39.000000000 -0500
@@ -564,6 +564,137 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
 }
 
 /*
+ * wake_idle() will wake a task on an idle cpu if task->cpu is
+ * not idle and an idle cpu is available.  The span of cpus to
+ * search starts with cpus closest then further out as needed,
+ * so we always favor a closer, idle cpu.
+ *
+ * Returns the CPU we should wake onto.
+ */
+#if defined(ARCH_HAS_SCHED_WAKE_IDLE)
+static int wake_idle(int cpu, struct task_struct *p)
+{
+	cpumask_t tmp;
+	struct sched_domain *sd;
+	int i;
+
+	/*
+	 * If it is idle, then it is the best cpu to run this task.
+	 *
+	 * This cpu is also the best, if it has more than one task already.
+	 * Siblings must be also busy(in most cases) as they didn't already
+	 * pickup the extra load from this cpu and hence we need not check
+	 * sibling runqueue info. This will avoid the checks and cache miss
+	 * penalities associated with that.
+	 */
+	if (idle_cpu(cpu) || cpu_rq(cpu)->nr_running > 1)
+		return cpu;
+
+	for_each_domain(cpu, sd) {
+		if (sd->flags & SD_WAKE_IDLE) {
+			cpus_and(tmp, sd->span, p->cpus_allowed);
+			for_each_cpu_mask(i, tmp) {
+				if (idle_cpu(i))
+					return i;
+			}
+		} else {
+			break;
+		}
+	}
+	return cpu;
+}
+#else
+static inline int wake_idle(int cpu, struct task_struct *p)
+{
+	return cpu;
+}
+#endif
+
+#ifdef CONFIG_SMP
+static int select_task_rq_fair(struct task_struct *p, int sync)
+{
+	int cpu, this_cpu;
+	struct rq *rq;
+	struct sched_domain *sd, *this_sd = NULL;
+	int new_cpu;
+
+	cpu      = task_cpu(p);
+	rq       = task_rq(p);
+	this_cpu = smp_processor_id();
+	new_cpu  = cpu;
+
+	for_each_domain(this_cpu, sd) {
+		if (cpu_isset(cpu, sd->span)) {
+			this_sd = sd;
+			break;
+		}
+	}
+
+	if (unlikely(!cpu_isset(this_cpu, p->cpus_allowed)))
+		goto out_set_cpu;
+
+	/*
+	 * Check for affine wakeup and passive balancing possibilities.
+	 */
+	if (this_sd) {
+		int idx = this_sd->wake_idx;
+		unsigned int imbalance;
+		unsigned long load, this_load;
+
+		imbalance = 100 + (this_sd->imbalance_pct - 100) / 2;
+
+		load = source_load(cpu, idx);
+		this_load = target_load(this_cpu, idx);
+
+		new_cpu = this_cpu; /* Wake to this CPU if we can */
+
+		if (this_sd->flags & SD_WAKE_AFFINE) {
+			unsigned long tl = this_load;
+			unsigned long tl_per_task;
+
+			tl_per_task = cpu_avg_load_per_task(this_cpu);
+
+			/*
+			 * If sync wakeup then subtract the (maximum possible)
+			 * effect of the currently running task from the load
+			 * of the current CPU:
+			 */
+			if (sync)
+				tl -= current->se.load.weight;
+
+			if ((tl <= load &&
+				tl + target_load(cpu, idx) <= tl_per_task) ||
+			       100*(tl + p->se.load.weight) <= imbalance*load) {
+				/*
+				 * This domain has SD_WAKE_AFFINE and
+				 * p is cache cold in this domain, and
+				 * there is no bad imbalance.
+				 */
+				schedstat_inc(this_sd, ttwu_move_affine);
+				goto out_set_cpu;
+			}
+		}
+
+		/*
+		 * Start passive balancing when half the imbalance_pct
+		 * limit is reached.
+		 */
+		if (this_sd->flags & SD_WAKE_BALANCE) {
+			if (imbalance*this_load <= 100*load) {
+				schedstat_inc(this_sd, ttwu_move_balance);
+				goto out_set_cpu;
+			}
+		}
+	}
+
+	new_cpu = cpu; /* Could not wake to this_cpu. Wake to cpu instead */
+out_set_cpu:
+	return wake_idle(new_cpu, p);
+}
+#endif /* CONFIG_SMP */
+
+
+/*
  * Preempt the current task with a newly woken task if needed:
  */
 static void
@@ -1111,6 +1242,7 @@ static const struct sched_class fair_sch
 #ifdef CONFIG_SMP
 	.load_balance		= load_balance_fair,
 	.move_one_task		= move_one_task_fair,
+	.select_task_rq		= select_task_rq_fair,
 #endif
 
 	.set_curr_task          = set_curr_task_fair,
Index: linux-compile.git/kernel/sched_idletask.c
===================================================================
--- linux-compile.git.orig/kernel/sched_idletask.c	2007-11-16 11:15:54.000000000 -0500
+++ linux-compile.git/kernel/sched_idletask.c	2007-11-16 22:23:39.000000000 -0500
@@ -5,6 +5,12 @@
  *  handled in sched_fair.c)
  */
 
+#ifdef CONFIG_SMP
+static int select_task_rq_idle(struct task_struct *p, int sync)
+{
+	return task_cpu(p); /* IDLE tasks as never migrated */
+}
+#endif /* CONFIG_SMP */
 /*
  * Idle tasks are unconditionally rescheduled:
  */
@@ -72,6 +78,9 @@ const struct sched_class idle_sched_clas
 
 	/* dequeue is not valid, we print a debug message there: */
 	.dequeue_task		= dequeue_task_idle,
+#ifdef CONFIG_SMP
+	.select_task_rq		= select_task_rq_idle,
+#endif /* CONFIG_SMP */
 
 	.check_preempt_curr	= check_preempt_curr_idle,
 
Index: linux-compile.git/kernel/sched_rt.c
===================================================================
--- linux-compile.git.orig/kernel/sched_rt.c	2007-11-16 22:23:35.000000000 -0500
+++ linux-compile.git/kernel/sched_rt.c	2007-11-16 22:23:39.000000000 -0500
@@ -41,6 +41,11 @@ static void update_rt_migration(struct r
 	else
 		rt_clear_overload(rq);
 }
+
+static int select_task_rq_rt(struct task_struct *p, int sync)
+{
+	return task_cpu(p);
+}
 #endif /* CONFIG_SMP */
 
 /*
@@ -669,6 +674,9 @@ const struct sched_class rt_sched_class 
 	.enqueue_task		= enqueue_task_rt,
 	.dequeue_task		= dequeue_task_rt,
 	.yield_task		= yield_task_rt,
+#ifdef CONFIG_SMP
+	.select_task_rq		= select_task_rq_rt,
+#endif /* CONFIG_SMP */
 
 	.check_preempt_curr	= check_preempt_curr_rt,
 

-- 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v3 11/17] RT: Break out the search function
  2007-11-17  6:21 [PATCH v3 00/17] New RT Task Balancing -v3 Steven Rostedt
                   ` (9 preceding siblings ...)
  2007-11-17  6:21 ` [PATCH v3 10/17] Remove some CFS specific code from the wakeup path of RT tasks Steven Rostedt
@ 2007-11-17  6:21 ` Steven Rostedt
  2007-11-17  6:21 ` [PATCH v3 12/17] Allow current_cpu to be included in search Steven Rostedt
                   ` (6 subsequent siblings)
  17 siblings, 0 replies; 28+ messages in thread
From: Steven Rostedt @ 2007-11-17  6:21 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Gregory Haskins, Peter Zijlstra, Christoph Lameter,
	Steven Rostedt

[-- Attachment #1: 0003-rt-balance-break-out-search.patch --]
[-- Type: text/plain, Size: 2676 bytes --]

From: Gregory Haskins <ghaskins@novell.com>

Isolate the search logic into a function so that it can be used later
in places other than find_locked_lowest_rq().

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---

 kernel/sched_rt.c |   62 ++++++++++++++++++++++++++++++++----------------------
 1 file changed, 37 insertions(+), 25 deletions(-)

Index: linux-compile.git/kernel/sched_rt.c
===================================================================
--- linux-compile.git.orig/kernel/sched_rt.c	2007-11-16 22:23:39.000000000 -0500
+++ linux-compile.git/kernel/sched_rt.c	2007-11-16 22:23:43.000000000 -0500
@@ -256,43 +256,55 @@ static struct task_struct *pick_next_hig
 	return next;
 }
 
-/* Will lock the rq it finds */
-static struct rq *find_lock_lowest_rq(struct task_struct *task,
-				      struct rq *rq)
+static int find_lowest_rq(struct task_struct *task)
 {
-	struct rq *lowest_rq = NULL;
-	cpumask_t cpu_mask;
 	int cpu;
-	int tries;
+	cpumask_t cpu_mask;
+	struct rq *lowest_rq = NULL;
 
 	cpus_and(cpu_mask, cpu_online_map, task->cpus_allowed);
 
-	for (tries = 0; tries < RT_MAX_TRIES; tries++) {
-		/*
-		 * Scan each rq for the lowest prio.
-		 */
-		for_each_cpu_mask(cpu, cpu_mask) {
-			struct rq *curr_rq = &per_cpu(runqueues, cpu);
+	/*
+	 * Scan each rq for the lowest prio.
+	 */
+	for_each_cpu_mask(cpu, cpu_mask) {
+		struct rq *rq = cpu_rq(cpu);
 
-			if (cpu == rq->cpu)
-				continue;
+		if (cpu == rq->cpu)
+			continue;
 
-			/* We look for lowest RT prio or non-rt CPU */
-			if (curr_rq->rt.highest_prio >= MAX_RT_PRIO) {
-				lowest_rq = curr_rq;
-				break;
-			}
+		/* We look for lowest RT prio or non-rt CPU */
+		if (rq->rt.highest_prio >= MAX_RT_PRIO) {
+			lowest_rq = rq;
+			break;
+		}
 
-			/* no locking for now */
-			if (curr_rq->rt.highest_prio > task->prio &&
-			    (!lowest_rq || curr_rq->rt.highest_prio > lowest_rq->rt.highest_prio)) {
-				lowest_rq = curr_rq;
-			}
+		/* no locking for now */
+		if (rq->rt.highest_prio > task->prio &&
+		    (!lowest_rq || rq->rt.highest_prio > lowest_rq->rt.highest_prio)) {
+			lowest_rq = rq;
 		}
+	}
+
+	return lowest_rq ? lowest_rq->cpu : -1;
+}
+
+/* Will lock the rq it finds */
+static struct rq *find_lock_lowest_rq(struct task_struct *task,
+				      struct rq *rq)
+{
+	struct rq *lowest_rq = NULL;
+	int cpu;
+	int tries;
+
+	for (tries = 0; tries < RT_MAX_TRIES; tries++) {
+		cpu = find_lowest_rq(task);
 
-		if (!lowest_rq)
+		if (cpu == -1)
 			break;
 
+		lowest_rq = cpu_rq(cpu);
+
 		/* if the prio of this runqueue changed, try again */
 		if (double_lock_balance(rq, lowest_rq)) {
 			/*

-- 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v3 12/17] Allow current_cpu to be included in search
  2007-11-17  6:21 [PATCH v3 00/17] New RT Task Balancing -v3 Steven Rostedt
                   ` (10 preceding siblings ...)
  2007-11-17  6:21 ` [PATCH v3 11/17] RT: Break out the search function Steven Rostedt
@ 2007-11-17  6:21 ` Steven Rostedt
  2007-11-17  6:21 ` [PATCH v3 13/17] RT: Pre-route RT tasks on wakeup Steven Rostedt
                   ` (5 subsequent siblings)
  17 siblings, 0 replies; 28+ messages in thread
From: Steven Rostedt @ 2007-11-17  6:21 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Gregory Haskins, Peter Zijlstra, Christoph Lameter,
	Steven Rostedt

[-- Attachment #1: 0004-rt-balance-allow-current-cpu-in-search.patch --]
[-- Type: text/plain, Size: 1234 bytes --]

From: Gregory Haskins <ghaskins@novell.com>

It doesn't hurt if we allow the current CPU to be included in the
search.  We will just simply skip it later if the current CPU turns out
to be the lowest.

We will use this later in the series

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---

 kernel/sched_rt.c |    5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

Index: linux-compile.git/kernel/sched_rt.c
===================================================================
--- linux-compile.git.orig/kernel/sched_rt.c	2007-11-16 22:23:43.000000000 -0500
+++ linux-compile.git/kernel/sched_rt.c	2007-11-16 22:23:45.000000000 -0500
@@ -270,9 +270,6 @@ static int find_lowest_rq(struct task_st
 	for_each_cpu_mask(cpu, cpu_mask) {
 		struct rq *rq = cpu_rq(cpu);
 
-		if (cpu == rq->cpu)
-			continue;
-
 		/* We look for lowest RT prio or non-rt CPU */
 		if (rq->rt.highest_prio >= MAX_RT_PRIO) {
 			lowest_rq = rq;
@@ -300,7 +297,7 @@ static struct rq *find_lock_lowest_rq(st
 	for (tries = 0; tries < RT_MAX_TRIES; tries++) {
 		cpu = find_lowest_rq(task);
 
-		if (cpu == -1)
+		if ((cpu == -1) || (cpu == rq->cpu))
 			break;
 
 		lowest_rq = cpu_rq(cpu);

-- 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v3 13/17] RT: Pre-route RT tasks on wakeup
  2007-11-17  6:21 [PATCH v3 00/17] New RT Task Balancing -v3 Steven Rostedt
                   ` (11 preceding siblings ...)
  2007-11-17  6:21 ` [PATCH v3 12/17] Allow current_cpu to be included in search Steven Rostedt
@ 2007-11-17  6:21 ` Steven Rostedt
  2007-11-17  6:21 ` [PATCH v3 14/17] Optimize our cpu selection based on topology Steven Rostedt
                   ` (4 subsequent siblings)
  17 siblings, 0 replies; 28+ messages in thread
From: Steven Rostedt @ 2007-11-17  6:21 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Gregory Haskins, Peter Zijlstra, Christoph Lameter,
	Steven Rostedt

[-- Attachment #1: 0005-pre-route-rt-tasks-on-wakeup.patch --]
[-- Type: text/plain, Size: 2350 bytes --]

From: Gregory Haskins <ghaskins@novell.com>

In the original patch series that Steven Rostedt and I worked on together,
we both took different approaches to low-priority wakeup path.  I utilized
"pre-routing" (push the task away to a less important RQ before activating)
approach, while Steve utilized a "post-routing" approach.  The advantage of
my approach is that you avoid the overhead of a wasted activate/deactivate
cycle and peripherally related burdens.  The advantage of Steve's method is
that it neatly solves an issue preventing a "pull" optimization from being
deployed.

In the end, we ended up deploying Steve's idea.  But it later dawned on me
that we could get the best of both worlds by deploying both ideas together,
albeit slightly modified.

The idea is simple:  Use a "light-weight" lookup for pre-routing, since we
only need to approximate a good home for the task.  And we also retain the
post-routing push logic to clean up any inaccuracies caused by a condition
of "priority mistargeting" caused by the lightweight lookup.  Most of the
time, the pre-routing should work and yield lower overhead.  In the cases
where it doesnt, the post-router will bat cleanup.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---

 kernel/sched_rt.c |   19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

Index: linux-compile.git/kernel/sched_rt.c
===================================================================
--- linux-compile.git.orig/kernel/sched_rt.c	2007-11-16 22:23:45.000000000 -0500
+++ linux-compile.git/kernel/sched_rt.c	2007-11-16 22:23:48.000000000 -0500
@@ -42,8 +42,27 @@ static void update_rt_migration(struct r
 		rt_clear_overload(rq);
 }
 
+static int find_lowest_rq(struct task_struct *task);
+
 static int select_task_rq_rt(struct task_struct *p, int sync)
 {
+	struct rq *rq = task_rq(p);
+
+	/*
+	 * If the task will not preempt the RQ, try to find a better RQ
+	 * before we even activate the task
+	 */
+	if ((p->prio >= rq->rt.highest_prio)
+	    && (p->nr_cpus_allowed > 1)) {
+		int cpu = find_lowest_rq(p);
+
+		return (cpu == -1) ? task_cpu(p) : cpu;
+	}
+
+	/*
+	 * Otherwise, just let it ride on the affined RQ and the
+	 * post-schedule router will push the preempted task away
+	 */
 	return task_cpu(p);
 }
 #endif /* CONFIG_SMP */

-- 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v3 14/17] Optimize our cpu selection based on topology
  2007-11-17  6:21 [PATCH v3 00/17] New RT Task Balancing -v3 Steven Rostedt
                   ` (12 preceding siblings ...)
  2007-11-17  6:21 ` [PATCH v3 13/17] RT: Pre-route RT tasks on wakeup Steven Rostedt
@ 2007-11-17  6:21 ` Steven Rostedt
  2007-11-17  6:21 ` [PATCH v3 15/17] RT: Optimize rebalancing Steven Rostedt
                   ` (3 subsequent siblings)
  17 siblings, 0 replies; 28+ messages in thread
From: Steven Rostedt @ 2007-11-17  6:21 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Gregory Haskins, Peter Zijlstra, Christoph Lameter,
	Steven Rostedt

[-- Attachment #1: 0006-optimize-cpu-section-topology.patch --]
[-- Type: text/plain, Size: 4868 bytes --]

From: Gregory Haskins <ghaskins@novell.com>

The current code base assumes a relatively flat CPU/core topology and will
route RT tasks to any CPU fairly equally.  In the real world, there are
various toplogies and affinities that govern where a task is best suited to
run with the smallest amount of overhead.  NUMA and multi-core CPUs are
prime examples of topologies that can impact cache performance.

Fortunately, linux is already structured to represent these topologies via
the sched_domains interface.  So we change our RT router to consult a
combination of topology and affinity policy to best place tasks during
migration.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---

 kernel/sched.c    |    1 
 kernel/sched_rt.c |   99 +++++++++++++++++++++++++++++++++++++++++++++++-------
 2 files changed, 88 insertions(+), 12 deletions(-)

Index: linux-compile.git/kernel/sched.c
===================================================================
--- linux-compile.git.orig/kernel/sched.c	2007-11-16 22:23:39.000000000 -0500
+++ linux-compile.git/kernel/sched.c	2007-11-16 22:23:50.000000000 -0500
@@ -24,6 +24,7 @@
  *  2007-07-01  Group scheduling enhancements by Srivatsa Vaddagiri
  *  2007-10-22  RT overload balancing by Steven Rostedt
  *                 (with thanks to Gregory Haskins)
+ *  2007-11-05  RT migration/wakeup tuning by Gregory Haskins
  */
 
 #include <linux/mm.h>
Index: linux-compile.git/kernel/sched_rt.c
===================================================================
--- linux-compile.git.orig/kernel/sched_rt.c	2007-11-16 22:23:48.000000000 -0500
+++ linux-compile.git/kernel/sched_rt.c	2007-11-16 22:23:50.000000000 -0500
@@ -275,34 +275,109 @@ static struct task_struct *pick_next_hig
 	return next;
 }
 
-static int find_lowest_rq(struct task_struct *task)
+static int find_lowest_cpus(struct task_struct *task, cpumask_t *lowest_mask)
 {
-	int cpu;
-	cpumask_t cpu_mask;
-	struct rq *lowest_rq = NULL;
+	int       cpu;
+	cpumask_t valid_mask;
+	int       lowest_prio = -1;
+	int       ret         = 0;
 
-	cpus_and(cpu_mask, cpu_online_map, task->cpus_allowed);
+	cpus_clear(*lowest_mask);
+	cpus_and(valid_mask, cpu_online_map, task->cpus_allowed);
 
 	/*
 	 * Scan each rq for the lowest prio.
 	 */
-	for_each_cpu_mask(cpu, cpu_mask) {
+	for_each_cpu_mask(cpu, valid_mask) {
 		struct rq *rq = cpu_rq(cpu);
 
 		/* We look for lowest RT prio or non-rt CPU */
 		if (rq->rt.highest_prio >= MAX_RT_PRIO) {
-			lowest_rq = rq;
-			break;
+			if (ret)
+				cpus_clear(*lowest_mask);
+			cpu_set(rq->cpu, *lowest_mask);
+			return 1;
 		}
 
 		/* no locking for now */
-		if (rq->rt.highest_prio > task->prio &&
-		    (!lowest_rq || rq->rt.highest_prio > lowest_rq->rt.highest_prio)) {
-			lowest_rq = rq;
+		if ((rq->rt.highest_prio > task->prio)
+		    && (rq->rt.highest_prio >= lowest_prio)) {
+			if (rq->rt.highest_prio > lowest_prio) {
+				/* new low - clear old data */
+				lowest_prio = rq->rt.highest_prio;
+				cpus_clear(*lowest_mask);
+			}
+			cpu_set(rq->cpu, *lowest_mask);
+			ret = 1;
+		}
+	}
+
+	return ret;
+}
+
+static inline int pick_optimal_cpu(int this_cpu, cpumask_t *mask)
+{
+	int first;
+
+	/* "this_cpu" is cheaper to preempt than a remote processor */
+	if ((this_cpu != -1) && cpu_isset(this_cpu, *mask))
+		return this_cpu;
+
+	first = first_cpu(*mask);
+	if (first != NR_CPUS)
+		return first;
+
+	return -1;
+}
+
+static int find_lowest_rq(struct task_struct *task)
+{
+	struct sched_domain *sd;
+	cpumask_t lowest_mask;
+	int this_cpu = smp_processor_id();
+	int cpu      = task_cpu(task);
+
+	if (!find_lowest_cpus(task, &lowest_mask))
+		return -1;
+
+	/*
+	 * At this point we have built a mask of cpus representing the
+	 * lowest priority tasks in the system.  Now we want to elect
+	 * the best one based on our affinity and topology.
+	 *
+	 * We prioritize the last cpu that the task executed on since
+	 * it is most likely cache-hot in that location.
+	 */
+	if (cpu_isset(cpu, lowest_mask))
+		return cpu;
+
+	/*
+	 * Otherwise, we consult the sched_domains span maps to figure
+	 * out which cpu is logically closest to our hot cache data.
+	 */
+	if (this_cpu == cpu)
+		this_cpu = -1; /* Skip this_cpu opt if the same */
+
+	for_each_domain(cpu, sd) {
+		if (sd->flags & SD_WAKE_AFFINE) {
+			cpumask_t domain_mask;
+			int       best_cpu;
+
+			cpus_and(domain_mask, sd->span, lowest_mask);
+
+			best_cpu = pick_optimal_cpu(this_cpu,
+						    &domain_mask);
+			if (best_cpu != -1)
+				return best_cpu;
 		}
 	}
 
-	return lowest_rq ? lowest_rq->cpu : -1;
+	/*
+	 * And finally, if there were no matches within the domains
+	 * just give the caller *something* to work with from the compatible
+	 * locations.
+	 */
+	return pick_optimal_cpu(this_cpu, &lowest_mask);
 }
 
 /* Will lock the rq it finds */

-- 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v3 15/17] RT: Optimize rebalancing
  2007-11-17  6:21 [PATCH v3 00/17] New RT Task Balancing -v3 Steven Rostedt
                   ` (13 preceding siblings ...)
  2007-11-17  6:21 ` [PATCH v3 14/17] Optimize our cpu selection based on topology Steven Rostedt
@ 2007-11-17  6:21 ` Steven Rostedt
  2007-11-17  6:21 ` [PATCH v3 16/17] Fix schedstat handling Steven Rostedt
                   ` (2 subsequent siblings)
  17 siblings, 0 replies; 28+ messages in thread
From: Steven Rostedt @ 2007-11-17  6:21 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Gregory Haskins, Peter Zijlstra, Christoph Lameter,
	Steven Rostedt

[-- Attachment #1: 0007-optimize-rt-overload-balancing.patch --]
[-- Type: text/plain, Size: 2681 bytes --]

From: Gregory Haskins <ghaskins@novell.com>

We have logic to detect whether the system has migratable tasks, but we are
not using it when deciding whether to push tasks away.  So we add support
for considering this new information.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---

 kernel/sched.c    |    2 ++
 kernel/sched_rt.c |   10 ++++++++--
 2 files changed, 10 insertions(+), 2 deletions(-)

Index: linux-compile.git/kernel/sched.c
===================================================================
--- linux-compile.git.orig/kernel/sched.c	2007-11-16 22:23:50.000000000 -0500
+++ linux-compile.git/kernel/sched.c	2007-11-16 22:24:07.000000000 -0500
@@ -273,6 +273,7 @@ struct rt_rq {
 	unsigned long rt_nr_migratory;
 	/* highest queued rt task prio */
 	int highest_prio;
+	int overloaded;
 };
 
 /*
@@ -6672,6 +6673,7 @@ void __init sched_init(void)
 		rq->migration_thread = NULL;
 		INIT_LIST_HEAD(&rq->migration_queue);
 		rq->rt.highest_prio = MAX_RT_PRIO;
+		rq->rt.overloaded = 0;
 #endif
 		atomic_set(&rq->nr_iowait, 0);
 
Index: linux-compile.git/kernel/sched_rt.c
===================================================================
--- linux-compile.git.orig/kernel/sched_rt.c	2007-11-16 22:23:50.000000000 -0500
+++ linux-compile.git/kernel/sched_rt.c	2007-11-16 22:25:30.000000000 -0500
@@ -16,6 +16,7 @@ static inline cpumask_t *rt_overload(voi
 }
 static inline void rt_set_overload(struct rq *rq)
 {
+	rq->rt.overloaded = 1;
 	cpu_set(rq->cpu, rt_overload_mask);
 	/*
 	 * Make sure the mask is visible before we set
@@ -30,6 +31,7 @@ static inline void rt_set_overload(struc
 static inline void rt_clear_overload(struct rq *rq)
 {
 	/* the order here really doesn't matter */
+	rq->rt.overloaded = 0;
 	atomic_dec(&rto_count);
 	cpu_clear(rq->cpu, rt_overload_mask);
 }
@@ -440,6 +442,9 @@ static int push_rt_task(struct rq *rq)
 
 	assert_spin_locked(&rq->lock);
 
+	if (!rq->rt.overloaded)
+		return 0;
+
 	next_task = pick_next_highest_task_rt(rq, -1);
 	if (!next_task)
 		return 0;
@@ -676,7 +681,7 @@ static void schedule_tail_balance_rt(str
 	 * the lock was owned by prev, we need to release it
 	 * first via finish_lock_switch and then reaquire it here.
 	 */
-	if (unlikely(rq->rt.rt_nr_running > 1)) {
+	if (unlikely(rq->rt.overloaded)) {
 		spin_lock_irq(&rq->lock);
 		push_rt_tasks(rq);
 		spin_unlock_irq(&rq->lock);
@@ -688,7 +693,8 @@ static void wakeup_balance_rt(struct rq 
 {
 	if (unlikely(rt_task(p)) &&
 	    !task_running(rq, p) &&
-	    (p->prio >= rq->curr->prio))
+	    (p->prio >= rq->rt.highest_prio) &&
+	    rq->rt.overloaded)
 		push_rt_tasks(rq);
 }
 

-- 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v3 16/17] Fix schedstat handling
  2007-11-17  6:21 [PATCH v3 00/17] New RT Task Balancing -v3 Steven Rostedt
                   ` (14 preceding siblings ...)
  2007-11-17  6:21 ` [PATCH v3 15/17] RT: Optimize rebalancing Steven Rostedt
@ 2007-11-17  6:21 ` Steven Rostedt
  2007-11-17 17:40   ` Gregory Haskins
  2007-11-17  6:21 ` [PATCH v3 17/17] --- kernel/sched_rt.c | 20 ++++++++++++++++---- 1 file changed, 16 insertions(+), 4 deletions(-) Steven Rostedt
  2007-11-17  8:14 ` [PATCH v3 00/17] New RT Task Balancing -v3 Jon Masters
  17 siblings, 1 reply; 28+ messages in thread
From: Steven Rostedt @ 2007-11-17  6:21 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Gregory Haskins, Peter Zijlstra, Christoph Lameter,
	Steven Rostedt

[-- Attachment #1: rt-balance-wakup-fix-schedstat.patch --]
[-- Type: text/plain, Size: 4736 bytes --]

Gregory Haskins RT balancing broke sched domains.
This is a fix to allow it to still work.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>

---
 include/linux/sched.h   |    3 ++-
 kernel/sched.c          |   17 ++++++++++++++---
 kernel/sched_fair.c     |   19 ++++++++++++++-----
 kernel/sched_idletask.c |    3 ++-
 kernel/sched_rt.c       |    3 ++-
 5 files changed, 34 insertions(+), 11 deletions(-)

Index: linux-compile.git/kernel/sched.c
===================================================================
--- linux-compile.git.orig/kernel/sched.c	2007-11-17 00:15:57.000000000 -0500
+++ linux-compile.git/kernel/sched.c	2007-11-17 00:15:57.000000000 -0500
@@ -1453,6 +1453,7 @@ static int try_to_wake_up(struct task_st
 	unsigned long flags;
 	long old_state;
 	struct rq *rq;
+	struct sched_domain *this_sd = NULL;
 #ifdef CONFIG_SMP
 	int new_cpu;
 #endif
@@ -1476,10 +1477,20 @@ static int try_to_wake_up(struct task_st
 	schedstat_inc(rq, ttwu_count);
 	if (cpu == this_cpu)
 		schedstat_inc(rq, ttwu_local);
-	else
-		schedstat_inc(rq->sd, ttwu_wake_remote);
+	else {
+#ifdef CONFIG_SCHEDSTATS
+		struct sched_domain *sd;
+		for_each_domain(this_cpu, sd) {
+			if (cpu_isset(cpu, sd->span)) {
+				schedstat_inc(sd, ttwu_wake_remote);
+				this_sd = sd;
+				break;
+			}
+		}
+#endif /* CONFIG_SCHEDSTATES */
+	}
 
-	new_cpu = p->sched_class->select_task_rq(p, sync);
+	new_cpu = p->sched_class->select_task_rq(p, this_sd, sync);
 
 	if (new_cpu != cpu) {
 		set_task_cpu(p, new_cpu);
Index: linux-compile.git/include/linux/sched.h
===================================================================
--- linux-compile.git.orig/include/linux/sched.h	2007-11-17 00:15:57.000000000 -0500
+++ linux-compile.git/include/linux/sched.h	2007-11-17 00:15:57.000000000 -0500
@@ -823,7 +823,8 @@ struct sched_class {
 	void (*enqueue_task) (struct rq *rq, struct task_struct *p, int wakeup);
 	void (*dequeue_task) (struct rq *rq, struct task_struct *p, int sleep);
 	void (*yield_task) (struct rq *rq);
-	int  (*select_task_rq)(struct task_struct *p, int sync);
+	int  (*select_task_rq)(struct task_struct *p,
+			       struct sched_domain *sd, int sync);
 
 	void (*check_preempt_curr) (struct rq *rq, struct task_struct *p);
 
Index: linux-compile.git/kernel/sched_fair.c
===================================================================
--- linux-compile.git.orig/kernel/sched_fair.c	2007-11-17 00:15:57.000000000 -0500
+++ linux-compile.git/kernel/sched_fair.c	2007-11-17 00:43:44.000000000 -0500
@@ -611,11 +611,12 @@ static inline int wake_idle(int cpu, str
 #endif
 
 #ifdef CONFIG_SMP
-static int select_task_rq_fair(struct task_struct *p, int sync)
+static int select_task_rq_fair(struct task_struct *p,
+			       struct sched_domain *this_sd, int sync)
 {
 	int cpu, this_cpu;
 	struct rq *rq;
-	struct sched_domain *sd, *this_sd = NULL;
+	struct sched_domain *sd;
 	int new_cpu;
 
 	cpu      = task_cpu(p);
@@ -623,15 +624,23 @@ static int select_task_rq_fair(struct ta
 	this_cpu = smp_processor_id();
 	new_cpu  = cpu;
 
+	if (cpu == this_cpu || unlikely(!cpu_isset(this_cpu, p->cpus_allowed)))
+		goto out_set_cpu;
+
+#ifndef CONFIG_SCHEDSTATS
+	/*
+	 * If SCHEDSTATS is configured, then this_sd would
+	 * have already been determined.
+	 */
 	for_each_domain(this_cpu, sd) {
 		if (cpu_isset(cpu, sd->span)) {
 			this_sd = sd;
 			break;
 		}
 	}
-
-	if (unlikely(!cpu_isset(this_cpu, p->cpus_allowed)))
-		goto out_set_cpu;
+#else
+	(void)sd; /* unused */
+#endif /* CONFIG_SCHEDSTATS */
 
 	/*
 	 * Check for affine wakeup and passive balancing possibilities.
Index: linux-compile.git/kernel/sched_idletask.c
===================================================================
--- linux-compile.git.orig/kernel/sched_idletask.c	2007-11-17 00:15:57.000000000 -0500
+++ linux-compile.git/kernel/sched_idletask.c	2007-11-17 00:15:57.000000000 -0500
@@ -6,7 +6,8 @@
  */
 
 #ifdef CONFIG_SMP
-static int select_task_rq_idle(struct task_struct *p, int sync)
+static int select_task_rq_idle(struct task_struct *p,
+			       struct sched_domain *sd, int sync)
 {
 	return task_cpu(p); /* IDLE tasks as never migrated */
 }
Index: linux-compile.git/kernel/sched_rt.c
===================================================================
--- linux-compile.git.orig/kernel/sched_rt.c	2007-11-17 00:15:57.000000000 -0500
+++ linux-compile.git/kernel/sched_rt.c	2007-11-17 00:44:31.000000000 -0500
@@ -46,7 +46,8 @@ static void update_rt_migration(struct r
 
 static int find_lowest_rq(struct task_struct *task);
 
-static int select_task_rq_rt(struct task_struct *p, int sync)
+static int select_task_rq_rt(struct task_struct *p,
+			     struct sched_domain *sd, int sync)
 {
 	struct rq *rq = task_rq(p);
 

-- 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v3 17/17] --- kernel/sched_rt.c | 20 ++++++++++++++++---- 1 file changed, 16 insertions(+), 4 deletions(-)
  2007-11-17  6:21 [PATCH v3 00/17] New RT Task Balancing -v3 Steven Rostedt
                   ` (15 preceding siblings ...)
  2007-11-17  6:21 ` [PATCH v3 16/17] Fix schedstat handling Steven Rostedt
@ 2007-11-17  6:21 ` Steven Rostedt
  2007-11-17  6:33   ` [PATCH v3 17/17] (Avoid overload) Steven Rostedt
  2007-11-17  8:14 ` [PATCH v3 00/17] New RT Task Balancing -v3 Jon Masters
  17 siblings, 1 reply; 28+ messages in thread
From: Steven Rostedt @ 2007-11-17  6:21 UTC (permalink / raw)
  To: LKML; +Cc: Ingo Molnar, Gregory Haskins, Peter Zijlstra, Christoph Lameter

[-- Attachment #1: rt-balance-avoid-overloading.patch --]
[-- Type: text/plain, Size: 1390 bytes --]

Index: linux-compile.git/kernel/sched_rt.c
===================================================================
--- linux-compile.git.orig/kernel/sched_rt.c	2007-11-17 00:18:27.000000000 -0500
+++ linux-compile.git/kernel/sched_rt.c	2007-11-17 00:20:31.000000000 -0500
@@ -52,11 +52,23 @@ static int select_task_rq_rt(struct task
 	struct rq *rq = task_rq(p);
 
 	/*
-	 * If the task will not preempt the RQ, try to find a better RQ
-	 * before we even activate the task
+	 * If the current task is an RT task, then
+	 * try to see if we can wake this RT task up on another
+	 * runqueue. Otherwise simply start this RT task
+	 * on its current runqueue.
+	 *
+	 * We want to avoid overloading runqueues. Even if
+	 * the RT task is of higher priority than the current RT task.
+	 * RT tasks behave differently than other tasks. If
+	 * one gets preempted, we try to push it off to another queue.
+	 * So trying to keep a preempting RT task on the same
+	 * cache hot CPU will force the running RT task to
+	 * a cold CPU. So we waste all the cache for the lower
+	 * RT task in hopes of saving some of a RT task
+	 * that is just being woken and probably will have
+	 * cold cache anyway.
 	 */
-	if ((p->prio >= rq->rt.highest_prio)
-	    && (p->nr_cpus_allowed > 1)) {
+	if (unlikely(rt_task(rq->curr))) {
 		int cpu = find_lowest_rq(p);
 
 		return (cpu == -1) ? task_cpu(p) : cpu;

-- 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 17/17] (Avoid overload)
  2007-11-17  6:21 ` [PATCH v3 17/17] --- kernel/sched_rt.c | 20 ++++++++++++++++---- 1 file changed, 16 insertions(+), 4 deletions(-) Steven Rostedt
@ 2007-11-17  6:33   ` Steven Rostedt
  2007-11-17 17:42     ` Gregory Haskins
  2007-11-17 17:46     ` Gregory Haskins
  0 siblings, 2 replies; 28+ messages in thread
From: Steven Rostedt @ 2007-11-17  6:33 UTC (permalink / raw)
  To: LKML; +Cc: Ingo Molnar, Gregory Haskins, Peter Zijlstra, Christoph Lameter


Sorry!  I forgot to put in a prologue for this patch.

Here it is.

====

This patch changes the searching for a run queue by a waking RT task
to try to pick another runqueue if the currently running task
is an RT task.

The reason is that RT tasks behave different than normal
tasks. Preempting a normal task to run a RT task to keep
its cache hot is fine, because the preempted non-RT task
may wait on that same runqueue to run again unless the
migration thread comes along and pulls it off.

RT tasks behave differently. If one is preempted, it makes
an active effort to continue to run. So by having a high
priority task preempt a lower priority RT task, that lower
RT task will then quickly try to run on another runqueue.
This will cause that lower RT task to replace its nice
hot cache (and TLB) with a completely cold one. This is
for the hope that the new high priority RT task will keep
 its cache hot.

Remeber that this high priority RT task was just woken up.
So it may likely have been sleeping for several milliseconds,
and will end up with a cold cache anyway. RT tasks run till
they voluntarily stop, or are preempted by a higher priority
task. This means that it is unlikely that the woken RT task
will have a hot cache to wake up to. So pushing off a lower
RT task is just killing its cache for no good reason.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>

---
 kernel/sched_rt.c |   20 ++++++++++++++++----
 1 file changed, 16 insertions(+), 4 deletions(-)

Index: linux-compile.git/kernel/sched_rt.c
===================================================================
--- linux-compile.git.orig/kernel/sched_rt.c	2007-11-17 00:18:27.000000000 -0500
+++ linux-compile.git/kernel/sched_rt.c	2007-11-17 00:20:31.000000000 -0500
@@ -52,11 +52,23 @@ static int select_task_rq_rt(struct task
 	struct rq *rq = task_rq(p);
 
 	/*
-	 * If the task will not preempt the RQ, try to find a better RQ
-	 * before we even activate the task
+	 * If the current task is an RT task, then
+	 * try to see if we can wake this RT task up on another
+	 * runqueue. Otherwise simply start this RT task
+	 * on its current runqueue.
+	 *
+	 * We want to avoid overloading runqueues. Even if
+	 * the RT task is of higher priority than the current RT task.
+	 * RT tasks behave differently than other tasks. If
+	 * one gets preempted, we try to push it off to another queue.
+	 * So trying to keep a preempting RT task on the same
+	 * cache hot CPU will force the running RT task to
+	 * a cold CPU. So we waste all the cache for the lower
+	 * RT task in hopes of saving some of a RT task
+	 * that is just being woken and probably will have
+	 * cold cache anyway.
 	 */
-	if ((p->prio >= rq->rt.highest_prio)
-	    && (p->nr_cpus_allowed > 1)) {
+	if (unlikely(rt_task(rq->curr))) {
 		int cpu = find_lowest_rq(p);
 
 		return (cpu == -1) ? task_cpu(p) : cpu;

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 00/17] New RT Task Balancing -v3
  2007-11-17  6:21 [PATCH v3 00/17] New RT Task Balancing -v3 Steven Rostedt
                   ` (16 preceding siblings ...)
  2007-11-17  6:21 ` [PATCH v3 17/17] --- kernel/sched_rt.c | 20 ++++++++++++++++---- 1 file changed, 16 insertions(+), 4 deletions(-) Steven Rostedt
@ 2007-11-17  8:14 ` Jon Masters
  17 siblings, 0 replies; 28+ messages in thread
From: Jon Masters @ 2007-11-17  8:14 UTC (permalink / raw)
  To: linux-kernel

On Sat, 2007-11-17 at 01:21 -0500, Steven Rostedt wrote:

> Comments welcomed!

Looks very cool. I was a little apprehensive about the active
synchronous migration/moving runqueues but you seem to have more than
proved your point :-)

Jon.




^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 10/17] Remove some CFS specific code from the wakeup path of RT tasks
  2007-11-17  6:21 ` [PATCH v3 10/17] Remove some CFS specific code from the wakeup path of RT tasks Steven Rostedt
@ 2007-11-17 17:35   ` Gregory Haskins
  2007-11-17 18:51     ` Steven Rostedt
  0 siblings, 1 reply; 28+ messages in thread
From: Gregory Haskins @ 2007-11-17 17:35 UTC (permalink / raw)
  To: Steven Rostedt, LKML
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Christoph Lameter

[-- Attachment #1: Type: text/plain, Size: 3289 bytes --]

>>> On Sat, Nov 17, 2007 at  1:21 AM, in message
<20071117062404.672976284@goodmis.org>, Steven Rostedt <rostedt@goodmis.org>
wrote: 

> -/*
> - * wake_idle() will wake a task on an idle cpu if task->cpu is
> - * not idle and an idle cpu is available.  The span of cpus to
> - * search starts with cpus closest then further out as needed,
> - * so we always favor a closer, idle cpu.
> - *
> - * Returns the CPU we should wake onto.
> - */
> -#if defined(ARCH_HAS_SCHED_WAKE_IDLE)
> -static int wake_idle(int cpu, struct task_struct *p)
> -{
> -	cpumask_t tmp;
> -	struct sched_domain *sd;
> -	int i;
> -
> -	/*
> -	 * If it is idle, then it is the best cpu to run this task.
> -	 *
> -	 * This cpu is also the best, if it has more than one task already.
> -	 * Siblings must be also busy(in most cases) as they didn't already
> -	 * pickup the extra load from this cpu and hence we need not check
> -	 * sibling runqueue info. This will avoid the checks and cache miss
> -	 * penalities associated with that.
> -	 */
> -	if (idle_cpu(cpu) || cpu_rq(cpu)->nr_running > 1)
> -		return cpu;
> -
> -	for_each_domain(cpu, sd) {
> -		if (sd->flags & SD_WAKE_IDLE) {
> -			cpus_and(tmp, sd->span, p->cpus_allowed);
> -			for_each_cpu_mask(i, tmp) {
> -				if (idle_cpu(i)) {
> -					if (i != task_cpu(p)) {
> -						schedstat_inc(p,
> -							se.nr_wakeups_idle);
                                                                                          ^^^^^^^^^^^^^^^^^^^^^

[...]


> --- linux-compile.git.orig/kernel/sched_fair.c	2007-11-16 11:16:38.000000000 -0500
> +++ linux-compile.git/kernel/sched_fair.c	2007-11-16 22:23:39.000000000 -0500
> @@ -564,6 +564,137 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
>  }
>  
>  /*
> + * wake_idle() will wake a task on an idle cpu if task->cpu is
> + * not idle and an idle cpu is available.  The span of cpus to
> + * search starts with cpus closest then further out as needed,
> + * so we always favor a closer, idle cpu.
> + *
> + * Returns the CPU we should wake onto.
> + */
> +#if defined(ARCH_HAS_SCHED_WAKE_IDLE)
> +static int wake_idle(int cpu, struct task_struct *p)
> +{
> +	cpumask_t tmp;
> +	struct sched_domain *sd;
> +	int i;
> +
> +	/*
> +	 * If it is idle, then it is the best cpu to run this task.
> +	 *
> +	 * This cpu is also the best, if it has more than one task already.
> +	 * Siblings must be also busy(in most cases) as they didn't already
> +	 * pickup the extra load from this cpu and hence we need not check
> +	 * sibling runqueue info. This will avoid the checks and cache miss
> +	 * penalities associated with that.
> +	 */
> +	if (idle_cpu(cpu) || cpu_rq(cpu)->nr_running > 1)
> +		return cpu;
> +
> +	for_each_domain(cpu, sd) {
> +		if (sd->flags & SD_WAKE_IDLE) {
> +			cpus_and(tmp, sd->span, p->cpus_allowed);
> +			for_each_cpu_mask(i, tmp) {
> +				if (idle_cpu(i))
> +					return i;
                                                                        ^^^^^^^^^^^^^^^^

Looks like some stuff that was added in 24 was inadvertently lost in the move when you merged the patches up from 23.1-rt11.  The attached patch is updated to move the new logic as well.

Regards,
-Greg
                                    


[-- Attachment #2: sched-de-cfsize-rt-path.patch --]
[-- Type: text/plain, Size: 13566 bytes --]

RT: Remove some CFS specific code from the wakeup path of RT tasks

From: Gregory Haskins <ghaskins@novell.com>

The current wake-up code path tries to determine if it can optimize the
wake-up to "this_cpu" by computing load calculations.  The problem is that
these calculations are only relevant to CFS tasks where load is king.  For RT
tasks, priority is king.  So the load calculation is completely wasted
bandwidth.

Therefore, we create a new sched_class interface to help with
pre-wakeup routing decisions and move the load calculation as a function
of CFS task's class.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 include/linux/sched.h   |    1 
 kernel/sched.c          |  167 ++++++++---------------------------------------
 kernel/sched_fair.c     |  148 ++++++++++++++++++++++++++++++++++++++++++
 kernel/sched_idletask.c |    9 +++
 kernel/sched_rt.c       |   10 +++
 5 files changed, 195 insertions(+), 140 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index e9e74de..253517b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -823,6 +823,7 @@ struct sched_class {
 	void (*enqueue_task) (struct rq *rq, struct task_struct *p, int wakeup);
 	void (*dequeue_task) (struct rq *rq, struct task_struct *p, int sleep);
 	void (*yield_task) (struct rq *rq);
+	int  (*select_task_rq)(struct task_struct *p, int sync);
 
 	void (*check_preempt_curr) (struct rq *rq, struct task_struct *p);
 
diff --git a/kernel/sched.c b/kernel/sched.c
index f089f41..bb92ec4 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -861,6 +861,13 @@ iter_move_one_task(struct rq *this_rq, int this_cpu, struct rq *busiest,
 		   struct rq_iterator *iterator);
 #endif
 
+#ifdef CONFIG_SMP
+static unsigned long source_load(int cpu, int type);
+static unsigned long target_load(int cpu, int type);
+static unsigned long cpu_avg_load_per_task(int cpu);
+static int task_hot(struct task_struct *p, u64 now, struct sched_domain *sd);
+#endif /* CONFIG_SMP */
+
 #include "sched_stats.h"
 #include "sched_idletask.c"
 #include "sched_fair.c"
@@ -1040,7 +1047,7 @@ static inline void __set_task_cpu(struct task_struct *p, unsigned int cpu)
 /*
  * Is this task likely cache-hot:
  */
-static inline int
+static int
 task_hot(struct task_struct *p, u64 now, struct sched_domain *sd)
 {
 	s64 delta;
@@ -1265,7 +1272,7 @@ static unsigned long target_load(int cpu, int type)
 /*
  * Return the average load per task on the cpu's run queue
  */
-static inline unsigned long cpu_avg_load_per_task(int cpu)
+static unsigned long cpu_avg_load_per_task(int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
 	unsigned long total = weighted_cpuload(cpu);
@@ -1422,58 +1429,6 @@ static int sched_balance_self(int cpu, int flag)
 
 #endif /* CONFIG_SMP */
 
-/*
- * wake_idle() will wake a task on an idle cpu if task->cpu is
- * not idle and an idle cpu is available.  The span of cpus to
- * search starts with cpus closest then further out as needed,
- * so we always favor a closer, idle cpu.
- *
- * Returns the CPU we should wake onto.
- */
-#if defined(ARCH_HAS_SCHED_WAKE_IDLE)
-static int wake_idle(int cpu, struct task_struct *p)
-{
-	cpumask_t tmp;
-	struct sched_domain *sd;
-	int i;
-
-	/*
-	 * If it is idle, then it is the best cpu to run this task.
-	 *
-	 * This cpu is also the best, if it has more than one task already.
-	 * Siblings must be also busy(in most cases) as they didn't already
-	 * pickup the extra load from this cpu and hence we need not check
-	 * sibling runqueue info. This will avoid the checks and cache miss
-	 * penalities associated with that.
-	 */
-	if (idle_cpu(cpu) || cpu_rq(cpu)->nr_running > 1)
-		return cpu;
-
-	for_each_domain(cpu, sd) {
-		if (sd->flags & SD_WAKE_IDLE) {
-			cpus_and(tmp, sd->span, p->cpus_allowed);
-			for_each_cpu_mask(i, tmp) {
-				if (idle_cpu(i)) {
-					if (i != task_cpu(p)) {
-						schedstat_inc(p,
-							se.nr_wakeups_idle);
-					}
-					return i;
-				}
-			}
-		} else {
-			break;
-		}
-	}
-	return cpu;
-}
-#else
-static inline int wake_idle(int cpu, struct task_struct *p)
-{
-	return cpu;
-}
-#endif
-
 /***
  * try_to_wake_up - wake up a thread
  * @p: the to-be-woken-up thread
@@ -1495,8 +1450,6 @@ static int try_to_wake_up(struct task_struct *p, unsigned int state, int sync)
 	long old_state;
 	struct rq *rq;
 #ifdef CONFIG_SMP
-	struct sched_domain *sd, *this_sd = NULL;
-	unsigned long load, this_load;
 	int new_cpu;
 #endif
 
@@ -1516,90 +1469,7 @@ static int try_to_wake_up(struct task_struct *p, unsigned int state, int sync)
 	if (unlikely(task_running(rq, p)))
 		goto out_activate;
 
-	new_cpu = cpu;
-
-	schedstat_inc(rq, ttwu_count);
-	if (cpu == this_cpu) {
-		schedstat_inc(rq, ttwu_local);
-		goto out_set_cpu;
-	}
-
-	for_each_domain(this_cpu, sd) {
-		if (cpu_isset(cpu, sd->span)) {
-			schedstat_inc(sd, ttwu_wake_remote);
-			this_sd = sd;
-			break;
-		}
-	}
-
-	if (unlikely(!cpu_isset(this_cpu, p->cpus_allowed)))
-		goto out_set_cpu;
-
-	/*
-	 * Check for affine wakeup and passive balancing possibilities.
-	 */
-	if (this_sd) {
-		int idx = this_sd->wake_idx;
-		unsigned int imbalance;
-
-		imbalance = 100 + (this_sd->imbalance_pct - 100) / 2;
-
-		load = source_load(cpu, idx);
-		this_load = target_load(this_cpu, idx);
-
-		new_cpu = this_cpu; /* Wake to this CPU if we can */
-
-		if (this_sd->flags & SD_WAKE_AFFINE) {
-			unsigned long tl = this_load;
-			unsigned long tl_per_task;
-
-			/*
-			 * Attract cache-cold tasks on sync wakeups:
-			 */
-			if (sync && !task_hot(p, rq->clock, this_sd))
-				goto out_set_cpu;
-
-			schedstat_inc(p, se.nr_wakeups_affine_attempts);
-			tl_per_task = cpu_avg_load_per_task(this_cpu);
-
-			/*
-			 * If sync wakeup then subtract the (maximum possible)
-			 * effect of the currently running task from the load
-			 * of the current CPU:
-			 */
-			if (sync)
-				tl -= current->se.load.weight;
-
-			if ((tl <= load &&
-				tl + target_load(cpu, idx) <= tl_per_task) ||
-			       100*(tl + p->se.load.weight) <= imbalance*load) {
-				/*
-				 * This domain has SD_WAKE_AFFINE and
-				 * p is cache cold in this domain, and
-				 * there is no bad imbalance.
-				 */
-				schedstat_inc(this_sd, ttwu_move_affine);
-				schedstat_inc(p, se.nr_wakeups_affine);
-				goto out_set_cpu;
-			}
-		}
-
-		/*
-		 * Start passive balancing when half the imbalance_pct
-		 * limit is reached.
-		 */
-		if (this_sd->flags & SD_WAKE_BALANCE) {
-			if (imbalance*this_load <= 100*load) {
-				schedstat_inc(this_sd, ttwu_move_balance);
-				schedstat_inc(p, se.nr_wakeups_passive);
-				goto out_set_cpu;
-			}
-		}
-	}
-
-	new_cpu = cpu; /* Could not wake to this_cpu. Wake to cpu instead */
-out_set_cpu:
-	new_cpu = wake_idle(new_cpu, p);
+	new_cpu = p->sched_class->select_task_rq(p, sync);
 	if (new_cpu != cpu) {
 		set_task_cpu(p, new_cpu);
 		task_rq_unlock(rq, &flags);
@@ -1615,6 +1485,23 @@ out_set_cpu:
 		cpu = task_cpu(p);
 	}
 
+#ifdef CONFIG_SCHEDSTATS
+	schedstat_inc(rq, ttwu_count);
+	if (cpu == this_cpu)
+		schedstat_inc(rq, ttwu_local);
+	else {
+		struct sched_domain *sd;
+		for_each_domain(this_cpu, sd) {
+			if (cpu_isset(cpu, sd->span)) {
+				schedstat_inc(sd, ttwu_wake_remote);
+				break;
+			}
+		}
+	}
+
+#endif
+
+
 out_activate:
 #endif /* CONFIG_SMP */
 	schedstat_inc(p, se.nr_wakeups);
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index d3c0307..9e30c96 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -830,6 +830,151 @@ static void yield_task_fair(struct rq *rq)
 }
 
 /*
+ * wake_idle() will wake a task on an idle cpu if task->cpu is
+ * not idle and an idle cpu is available.  The span of cpus to
+ * search starts with cpus closest then further out as needed,
+ * so we always favor a closer, idle cpu.
+ *
+ * Returns the CPU we should wake onto.
+ */
+#if defined(ARCH_HAS_SCHED_WAKE_IDLE)
+static int wake_idle(int cpu, struct task_struct *p)
+{
+	cpumask_t tmp;
+	struct sched_domain *sd;
+	int i;
+
+	/*
+	 * If it is idle, then it is the best cpu to run this task.
+	 *
+	 * This cpu is also the best, if it has more than one task already.
+	 * Siblings must be also busy(in most cases) as they didn't already
+	 * pickup the extra load from this cpu and hence we need not check
+	 * sibling runqueue info. This will avoid the checks and cache miss
+	 * penalities associated with that.
+	 */
+	if (idle_cpu(cpu) || cpu_rq(cpu)->nr_running > 1)
+		return cpu;
+
+	for_each_domain(cpu, sd) {
+		if (sd->flags & SD_WAKE_IDLE) {
+			cpus_and(tmp, sd->span, p->cpus_allowed);
+			for_each_cpu_mask(i, tmp) {
+				if (idle_cpu(i)) {
+					if (i != task_cpu(p)) {
+						schedstat_inc(p,
+						       se.nr_wakeups_idle);
+					}
+					return i;
+				}
+			}
+		} else {
+			break;
+		}
+	}
+	return cpu;
+}
+#else
+static inline int wake_idle(int cpu, struct task_struct *p)
+{
+	return cpu;
+}
+#endif
+
+#ifdef CONFIG_SMP
+static int select_task_rq_fair(struct task_struct *p, int sync)
+{
+	int cpu, this_cpu;
+	struct rq *rq;
+	struct sched_domain *sd, *this_sd = NULL;
+	int new_cpu;
+
+	cpu      = task_cpu(p);
+	rq       = task_rq(p);
+	this_cpu = smp_processor_id();
+	new_cpu  = cpu;
+
+	for_each_domain(this_cpu, sd) {
+		if (cpu_isset(cpu, sd->span)) {
+			this_sd = sd;
+			break;
+		}
+	}
+
+	if (unlikely(!cpu_isset(this_cpu, p->cpus_allowed)))
+		goto out_set_cpu;
+
+	/*
+	 * Check for affine wakeup and passive balancing possibilities.
+	 */
+	if (this_sd) {
+		int idx = this_sd->wake_idx;
+		unsigned int imbalance;
+		unsigned long load, this_load;
+
+		imbalance = 100 + (this_sd->imbalance_pct - 100) / 2;
+
+		load = source_load(cpu, idx);
+		this_load = target_load(this_cpu, idx);
+
+		new_cpu = this_cpu; /* Wake to this CPU if we can */
+
+		if (this_sd->flags & SD_WAKE_AFFINE) {
+			unsigned long tl = this_load;
+			unsigned long tl_per_task;
+
+			/*
+			 * Attract cache-cold tasks on sync wakeups:
+			 */
+			if (sync && !task_hot(p, rq->clock, this_sd))
+				goto out_set_cpu;
+
+			schedstat_inc(p, se.nr_wakeups_affine_attempts);
+			tl_per_task = cpu_avg_load_per_task(this_cpu);
+
+			/*
+			 * If sync wakeup then subtract the (maximum possible)
+			 * effect of the currently running task from the load
+			 * of the current CPU:
+			 */
+			if (sync)
+				tl -= current->se.load.weight;
+
+			if ((tl <= load &&
+				tl + target_load(cpu, idx) <= tl_per_task) ||
+			       100*(tl + p->se.load.weight) <= imbalance*load) {
+				/*
+				 * This domain has SD_WAKE_AFFINE and
+				 * p is cache cold in this domain, and
+				 * there is no bad imbalance.
+				 */
+				schedstat_inc(this_sd, ttwu_move_affine);
+				schedstat_inc(p, se.nr_wakeups_affine);
+				goto out_set_cpu;
+			}
+		}
+
+		/*
+		 * Start passive balancing when half the imbalance_pct
+		 * limit is reached.
+		 */
+		if (this_sd->flags & SD_WAKE_BALANCE) {
+			if (imbalance*this_load <= 100*load) {
+				schedstat_inc(this_sd, ttwu_move_balance);
+				schedstat_inc(p, se.nr_wakeups_passive);
+				goto out_set_cpu;
+			}
+		}
+	}
+
+	new_cpu = cpu; /* Could not wake to this_cpu. Wake to cpu instead */
+out_set_cpu:
+	return wake_idle(new_cpu, p);
+}
+#endif /* CONFIG_SMP */
+
+
+/*
  * Preempt the current task with a newly woken task if needed:
  */
 static void check_preempt_wakeup(struct rq *rq, struct task_struct *p)
@@ -1102,6 +1247,9 @@ static const struct sched_class fair_sched_class = {
 	.enqueue_task		= enqueue_task_fair,
 	.dequeue_task		= dequeue_task_fair,
 	.yield_task		= yield_task_fair,
+#ifdef CONFIG_SMP
+	.select_task_rq		= select_task_rq_fair,
+#endif /* CONFIG_SMP */
 
 	.check_preempt_curr	= check_preempt_wakeup,
 
diff --git a/kernel/sched_idletask.c b/kernel/sched_idletask.c
index bf9c25c..ca53748 100644
--- a/kernel/sched_idletask.c
+++ b/kernel/sched_idletask.c
@@ -5,6 +5,12 @@
  *  handled in sched_fair.c)
  */
 
+#ifdef CONFIG_SMP
+static int select_task_rq_idle(struct task_struct *p, int sync)
+{
+	return task_cpu(p); /* IDLE tasks as never migrated */
+}
+#endif /* CONFIG_SMP */
 /*
  * Idle tasks are unconditionally rescheduled:
  */
@@ -72,6 +78,9 @@ const struct sched_class idle_sched_class = {
 
 	/* dequeue is not valid, we print a debug message there: */
 	.dequeue_task		= dequeue_task_idle,
+#ifdef CONFIG_SMP
+	.select_task_rq		= select_task_rq_idle,
+#endif /* CONFIG_SMP */
 
 	.check_preempt_curr	= check_preempt_curr_idle,
 
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 4a469c5..0e408dc 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -147,6 +147,13 @@ yield_task_rt(struct rq *rq)
 	requeue_task_rt(rq, rq->curr);
 }
 
+#ifdef CONFIG_SMP
+static int select_task_rq_rt(struct task_struct *p, int sync)
+{
+	return task_cpu(p);
+}
+#endif /* CONFIG_SMP */
+
 /*
  * Preempt the current task with a newly woken task if needed:
  */
@@ -669,6 +676,9 @@ const struct sched_class rt_sched_class = {
 	.enqueue_task		= enqueue_task_rt,
 	.dequeue_task		= dequeue_task_rt,
 	.yield_task		= yield_task_rt,
+#ifdef CONFIG_SMP
+	.select_task_rq		= select_task_rq_rt,
+#endif /* CONFIG_SMP */
 
 	.check_preempt_curr	= check_preempt_curr_rt,
 

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 16/17] Fix schedstat handling
  2007-11-17  6:21 ` [PATCH v3 16/17] Fix schedstat handling Steven Rostedt
@ 2007-11-17 17:40   ` Gregory Haskins
  2007-11-17 18:52     ` Steven Rostedt
  0 siblings, 1 reply; 28+ messages in thread
From: Gregory Haskins @ 2007-11-17 17:40 UTC (permalink / raw)
  To: Steven Rostedt, LKML
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Christoph Lameter

>>> On Sat, Nov 17, 2007 at  1:21 AM, in message
<20071117062405.634159013@goodmis.org>, Steven Rostedt <rostedt@goodmis.org>
wrote: 
> Gregory Haskins RT balancing broke sched domains.

Doh! (though you mean s/domains/stats ;)

> This is a fix to allow it to still work.
> 
> Signed-off-by: Steven Rostedt <srostedt@redhat.com>
> 
> ---
>  include/linux/sched.h   |    3 ++-
>  kernel/sched.c          |   17 ++++++++++++++---
>  kernel/sched_fair.c     |   19 ++++++++++++++-----
>  kernel/sched_idletask.c |    3 ++-
>  kernel/sched_rt.c       |    3 ++-
>  5 files changed, 34 insertions(+), 11 deletions(-)
> 
> Index: linux-compile.git/kernel/sched.c
> ===================================================================
> --- linux-compile.git.orig/kernel/sched.c	2007-11-17 00:15:57.000000000 -0500
> +++ linux-compile.git/kernel/sched.c	2007-11-17 00:15:57.000000000 -0500
> @@ -1453,6 +1453,7 @@ static int try_to_wake_up(struct task_st
>  	unsigned long flags;
>  	long old_state;
>  	struct rq *rq;
> +	struct sched_domain *this_sd = NULL;
>  #ifdef CONFIG_SMP
>  	int new_cpu;
>  #endif
> @@ -1476,10 +1477,20 @@ static int try_to_wake_up(struct task_st
>  	schedstat_inc(rq, ttwu_count);
>  	if (cpu == this_cpu)
>  		schedstat_inc(rq, ttwu_local);
> -	else
> -		schedstat_inc(rq->sd, ttwu_wake_remote);
> +	else {
> +#ifdef CONFIG_SCHEDSTATS
> +		struct sched_domain *sd;
> +		for_each_domain(this_cpu, sd) {
> +			if (cpu_isset(cpu, sd->span)) {
> +				schedstat_inc(sd, ttwu_wake_remote);
> +				this_sd = sd;
> +				break;
> +			}
> +		}
> +#endif /* CONFIG_SCHEDSTATES */
> +	}
>  
> -	new_cpu = p->sched_class->select_task_rq(p, sync);
> +	new_cpu = p->sched_class->select_task_rq(p, this_sd, sync);

I like this optimization, but I am thinking that the location of the stat update is now no longer relevant.  It should potentially go *after* the select_task_rq() so that we pick the sched_domain of the actual wake target, not the historical affinity.  If that is accurate, I'm sure you can finagle this optimization to work in that scenario too, but it will take a little re-work.


>  
>  	if (new_cpu != cpu) {
>  		set_task_cpu(p, new_cpu);
> Index: linux-compile.git/include/linux/sched.h
> ===================================================================
> --- linux-compile.git.orig/include/linux/sched.h	2007-11-17 00:15:57.000000000 -0500
> +++ linux-compile.git/include/linux/sched.h	2007-11-17 00:15:57.000000000 -0500
> @@ -823,7 +823,8 @@ struct sched_class {
>  	void (*enqueue_task) (struct rq *rq, struct task_struct *p, int wakeup);
>  	void (*dequeue_task) (struct rq *rq, struct task_struct *p, int sleep);
>  	void (*yield_task) (struct rq *rq);
> -	int  (*select_task_rq)(struct task_struct *p, int sync);
> +	int  (*select_task_rq)(struct task_struct *p,
> +			       struct sched_domain *sd, int sync);
>  
>  	void (*check_preempt_curr) (struct rq *rq, struct task_struct *p);
>  
> Index: linux-compile.git/kernel/sched_fair.c
> ===================================================================
> --- linux-compile.git.orig/kernel/sched_fair.c	2007-11-17 00:15:57.000000000 -0500
> +++ linux-compile.git/kernel/sched_fair.c	2007-11-17 00:43:44.000000000 -0500
> @@ -611,11 +611,12 @@ static inline int wake_idle(int cpu, str
>  #endif
>  
>  #ifdef CONFIG_SMP
> -static int select_task_rq_fair(struct task_struct *p, int sync)
> +static int select_task_rq_fair(struct task_struct *p,
> +			       struct sched_domain *this_sd, int sync)
>  {
>  	int cpu, this_cpu;
>  	struct rq *rq;
> -	struct sched_domain *sd, *this_sd = NULL;
> +	struct sched_domain *sd;
>  	int new_cpu;
>  
>  	cpu      = task_cpu(p);
> @@ -623,15 +624,23 @@ static int select_task_rq_fair(struct ta
>  	this_cpu = smp_processor_id();
>  	new_cpu  = cpu;
>  
> +	if (cpu == this_cpu || unlikely(!cpu_isset(this_cpu, p->cpus_allowed)))
> +		goto out_set_cpu;
> +
> +#ifndef CONFIG_SCHEDSTATS
> +	/*
> +	 * If SCHEDSTATS is configured, then this_sd would
> +	 * have already been determined.
> +	 */
>  	for_each_domain(this_cpu, sd) {
>  		if (cpu_isset(cpu, sd->span)) {
>  			this_sd = sd;
>  			break;
>  		}
>  	}
> -
> -	if (unlikely(!cpu_isset(this_cpu, p->cpus_allowed)))
> -		goto out_set_cpu;
> +#else
> +	(void)sd; /* unused */
> +#endif /* CONFIG_SCHEDSTATS */
>  
>  	/*
>  	 * Check for affine wakeup and passive balancing possibilities.
> Index: linux-compile.git/kernel/sched_idletask.c
> ===================================================================
> --- linux-compile.git.orig/kernel/sched_idletask.c	2007-11-17 00:15:57.000000000 
> -0500
> +++ linux-compile.git/kernel/sched_idletask.c	2007-11-17 00:15:57.000000000 -0500
> @@ -6,7 +6,8 @@
>   */
>  
>  #ifdef CONFIG_SMP
> -static int select_task_rq_idle(struct task_struct *p, int sync)
> +static int select_task_rq_idle(struct task_struct *p,
> +			       struct sched_domain *sd, int sync)
>  {
>  	return task_cpu(p); /* IDLE tasks as never migrated */
>  }
> Index: linux-compile.git/kernel/sched_rt.c
> ===================================================================
> --- linux-compile.git.orig/kernel/sched_rt.c	2007-11-17 00:15:57.000000000 -0500
> +++ linux-compile.git/kernel/sched_rt.c	2007-11-17 00:44:31.000000000 -0500
> @@ -46,7 +46,8 @@ static void update_rt_migration(struct r
>  
>  static int find_lowest_rq(struct task_struct *task);
>  
> -static int select_task_rq_rt(struct task_struct *p, int sync)
> +static int select_task_rq_rt(struct task_struct *p,
> +			     struct sched_domain *sd, int sync)
>  {
>  	struct rq *rq = task_rq(p);
>  
> 
> -- 



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 17/17] (Avoid overload)
  2007-11-17  6:33   ` [PATCH v3 17/17] (Avoid overload) Steven Rostedt
@ 2007-11-17 17:42     ` Gregory Haskins
  2007-11-17 18:55       ` Steven Rostedt
  2007-11-17 17:46     ` Gregory Haskins
  1 sibling, 1 reply; 28+ messages in thread
From: Gregory Haskins @ 2007-11-17 17:42 UTC (permalink / raw)
  To: Steven Rostedt, LKML; +Cc: Peter Zijlstra, Ingo Molnar, Christoph Lameter

>>> On Sat, Nov 17, 2007 at  1:33 AM, in message
<20071117063318.GA31442@goodmis.org>, Steven Rostedt <rostedt@goodmis.org>
wrote: 

> Sorry!  I forgot to put in a prologue for this patch.
> 
> Here it is.
> 
> ====
> 
> This patch changes the searching for a run queue by a waking RT task
> to try to pick another runqueue if the currently running task
> is an RT task.
> 
> The reason is that RT tasks behave different than normal
> tasks. Preempting a normal task to run a RT task to keep
> its cache hot is fine, because the preempted non-RT task
> may wait on that same runqueue to run again unless the
> migration thread comes along and pulls it off.
> 
> RT tasks behave differently. If one is preempted, it makes
> an active effort to continue to run. So by having a high
> priority task preempt a lower priority RT task, that lower
> RT task will then quickly try to run on another runqueue.
> This will cause that lower RT task to replace its nice
> hot cache (and TLB) with a completely cold one. This is
> for the hope that the new high priority RT task will keep
>  its cache hot.
> 
> Remeber that this high priority RT task was just woken up.
> So it may likely have been sleeping for several milliseconds,
> and will end up with a cold cache anyway. RT tasks run till
> they voluntarily stop, or are preempted by a higher priority
> task. This means that it is unlikely that the woken RT task
> will have a hot cache to wake up to. So pushing off a lower
> RT task is just killing its cache for no good reason.

You make some excellent points here.  Out of curiosity, have you tried a comparison to see if it helps?

-Greg



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 17/17] (Avoid overload)
  2007-11-17  6:33   ` [PATCH v3 17/17] (Avoid overload) Steven Rostedt
  2007-11-17 17:42     ` Gregory Haskins
@ 2007-11-17 17:46     ` Gregory Haskins
  2007-11-19 16:34       ` [PATCH] RT: restore the migratable conditional Gregory Haskins
  1 sibling, 1 reply; 28+ messages in thread
From: Gregory Haskins @ 2007-11-17 17:46 UTC (permalink / raw)
  To: Steven Rostedt, LKML; +Cc: Peter Zijlstra, Ingo Molnar, Christoph Lameter

>>> On Sat, Nov 17, 2007 at  1:33 AM, in message
<20071117063318.GA31442@goodmis.org>, Steven Rostedt <rostedt@goodmis.org>
wrote: 
 
> -	if ((p->prio >= rq->rt.highest_prio)
> -	    && (p->nr_cpus_allowed > 1)) {
> +	if (unlikely(rt_task(rq->curr))) {
>  		int cpu = find_lowest_rq(p);
>  
>  		return (cpu == -1) ? task_cpu(p) : cpu;

We probably should leave the "p->nr_cpus_allowed > 1" in the conditional since it doesn't make sense to look if the task can't go anywhere.

-Greg


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 10/17] Remove some CFS specific code from the wakeup path of RT tasks
  2007-11-17 17:35   ` Gregory Haskins
@ 2007-11-17 18:51     ` Steven Rostedt
  0 siblings, 0 replies; 28+ messages in thread
From: Steven Rostedt @ 2007-11-17 18:51 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: LKML, Peter Zijlstra, Ingo Molnar, Steven Rostedt, Christoph Lameter


On Sat, 17 Nov 2007, Gregory Haskins wrote:

> >>> On Sat, Nov 17, 2007 at  1:21 AM, in message
> <20071117062404.672976284@goodmis.org>, Steven Rostedt <rostedt@goodmis.org>
> wrote:
>
> > +	 */
> > +	if (idle_cpu(cpu) || cpu_rq(cpu)->nr_running > 1)
> > +		return cpu;
> > +
> > +	for_each_domain(cpu, sd) {
> > +		if (sd->flags & SD_WAKE_IDLE) {
> > +			cpus_and(tmp, sd->span, p->cpus_allowed);
> > +			for_each_cpu_mask(i, tmp) {
> > +				if (idle_cpu(i))
> > +					return i;
>                                                                         ^^^^^^^^^^^^^^^^
>
> Looks like some stuff that was added in 24 was inadvertently lost in the move when you merged the patches up from 23.1-rt11.  The attached patch is updated to move the new logic as well.
>

Doh!  Good catch. Will rework on Monday.

-- Steve


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 16/17] Fix schedstat handling
  2007-11-17 17:40   ` Gregory Haskins
@ 2007-11-17 18:52     ` Steven Rostedt
  0 siblings, 0 replies; 28+ messages in thread
From: Steven Rostedt @ 2007-11-17 18:52 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: LKML, Peter Zijlstra, Ingo Molnar, Steven Rostedt, Christoph Lameter


On Sat, 17 Nov 2007, Gregory Haskins wrote:

> >>> On Sat, Nov 17, 2007 at  1:21 AM, in message
> <20071117062405.634159013@goodmis.org>, Steven Rostedt <rostedt@goodmis.org>
> wrote:
> > Gregory Haskins RT balancing broke sched domains.
>
> Doh! (though you mean s/domains/stats ;)

Heh, indeed.

>
> > This is a fix to allow it to still work.
> >
> > Signed-off-by: Steven Rostedt <srostedt@redhat.com>
> >
> > ---
> >  include/linux/sched.h   |    3 ++-
> >  kernel/sched.c          |   17 ++++++++++++++---
> >  kernel/sched_fair.c     |   19 ++++++++++++++-----
> >  kernel/sched_idletask.c |    3 ++-
> >  kernel/sched_rt.c       |    3 ++-
> >  5 files changed, 34 insertions(+), 11 deletions(-)
> >
> > Index: linux-compile.git/kernel/sched.c
> > ===================================================================
> > --- linux-compile.git.orig/kernel/sched.c	2007-11-17 00:15:57.000000000 -0500
> > +++ linux-compile.git/kernel/sched.c	2007-11-17 00:15:57.000000000 -0500
> > @@ -1453,6 +1453,7 @@ static int try_to_wake_up(struct task_st
> >  	unsigned long flags;
> >  	long old_state;
> >  	struct rq *rq;
> > +	struct sched_domain *this_sd = NULL;
> >  #ifdef CONFIG_SMP
> >  	int new_cpu;
> >  #endif
> > @@ -1476,10 +1477,20 @@ static int try_to_wake_up(struct task_st
> >  	schedstat_inc(rq, ttwu_count);
> >  	if (cpu == this_cpu)
> >  		schedstat_inc(rq, ttwu_local);
> > -	else
> > -		schedstat_inc(rq->sd, ttwu_wake_remote);
> > +	else {
> > +#ifdef CONFIG_SCHEDSTATS
> > +		struct sched_domain *sd;
> > +		for_each_domain(this_cpu, sd) {
> > +			if (cpu_isset(cpu, sd->span)) {
> > +				schedstat_inc(sd, ttwu_wake_remote);
> > +				this_sd = sd;
> > +				break;
> > +			}
> > +		}
> > +#endif /* CONFIG_SCHEDSTATES */
> > +	}
> >
> > -	new_cpu = p->sched_class->select_task_rq(p, sync);
> > +	new_cpu = p->sched_class->select_task_rq(p, this_sd, sync);
>
> I like this optimization, but I am thinking that the location of the stat update is now no longer relevant.  It should potentially go *after* the select_task_rq() so that we pick the sched_domain of the actual wake target, not the historical affinity.  If that is accurate, I'm sure you can finagle this optimization to work in that scenario too, but it will take a little re-work.

Yeah, I can take a deep look. This was written late at night, so I need to
spend more "awake" hours on it.

-- Steve


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 17/17] (Avoid overload)
  2007-11-17 17:42     ` Gregory Haskins
@ 2007-11-17 18:55       ` Steven Rostedt
  0 siblings, 0 replies; 28+ messages in thread
From: Steven Rostedt @ 2007-11-17 18:55 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: LKML, Peter Zijlstra, Ingo Molnar, Christoph Lameter


On Sat, 17 Nov 2007, Gregory Haskins wrote:

> >>> On Sat, Nov 17, 2007 at  1:33 AM, in message
> > This patch changes the searching for a run queue by a waking RT task
> > to try to pick another runqueue if the currently running task
> > is an RT task.
> >
> > The reason is that RT tasks behave different than normal
> > tasks. Preempting a normal task to run a RT task to keep
> > its cache hot is fine, because the preempted non-RT task
> > may wait on that same runqueue to run again unless the
> > migration thread comes along and pulls it off.
> >
> > RT tasks behave differently. If one is preempted, it makes
> > an active effort to continue to run. So by having a high
> > priority task preempt a lower priority RT task, that lower
> > RT task will then quickly try to run on another runqueue.
> > This will cause that lower RT task to replace its nice
> > hot cache (and TLB) with a completely cold one. This is
> > for the hope that the new high priority RT task will keep
> >  its cache hot.
> >
> > Remeber that this high priority RT task was just woken up.
> > So it may likely have been sleeping for several milliseconds,
> > and will end up with a cold cache anyway. RT tasks run till
> > they voluntarily stop, or are preempted by a higher priority
> > task. This means that it is unlikely that the woken RT task
> > will have a hot cache to wake up to. So pushing off a lower
> > RT task is just killing its cache for no good reason.
>
> You make some excellent points here.  Out of curiosity, have you tried a comparison to see if it helps?

hehe, I was waiting for the "where's the numbers". Right now I don't have
them. Mainly because I don't have boxes with more that 4 CPUs on them. And
4 really doesn't make much of a difference.

If others out there would like to test, I'll write up a couple of versions
of this code and that way we can really get numbers for them.

Or perhaps someone would like to send me a 16way box. Although my wife
would kill me. ;-)

Hmm, I'm sure I can get to our lab and run some tests too.

-- Steve


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH] RT: restore the migratable conditional
  2007-11-17 17:46     ` Gregory Haskins
@ 2007-11-19 16:34       ` Gregory Haskins
  0 siblings, 0 replies; 28+ messages in thread
From: Gregory Haskins @ 2007-11-19 16:34 UTC (permalink / raw)
  To: Steven Rostedt, LKML <linux-kernel
  Cc: Peter Zijlstra, Christoph Lameter, Ingo Molnar, Gregory Haskins

(This applies to the end of Steven's "v3" series
http://lkml.org/lkml/2007/11/17/10)

----------

We don't need to bother searching if the task cannot be migrated

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 kernel/sched_rt.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 9c640c8..0026f11 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -173,7 +173,8 @@ static int select_task_rq_rt(struct task_struct *p, int sync)
 	 * that is just being woken and probably will have
 	 * cold cache anyway.
 	 */
-	if (unlikely(rt_task(rq->curr))) {
+	if (unlikely(rt_task(rq->curr)) &&
+	    (p->nr_cpus_allowed > 1)) {
 		int cpu = find_lowest_rq(p);
 
 		return (cpu == -1) ? task_cpu(p) : cpu;


^ permalink raw reply related	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2007-11-19 16:53 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-11-17  6:21 [PATCH v3 00/17] New RT Task Balancing -v3 Steven Rostedt
2007-11-17  6:21 ` [PATCH v3 01/17] Add rt_nr_running accounting Steven Rostedt
2007-11-17  6:21 ` [PATCH v3 02/17] track highest prio queued on runqueue Steven Rostedt
2007-11-17  6:21 ` [PATCH v3 03/17] push RT tasks Steven Rostedt
2007-11-17  6:21 ` [PATCH v3 04/17] RT overloaded runqueues accounting Steven Rostedt
2007-11-17  6:21 ` [PATCH v3 05/17] pull RT tasks Steven Rostedt
2007-11-17  6:21 ` [PATCH v3 06/17] wake up balance RT Steven Rostedt
2007-11-17  6:21 ` [PATCH v3 07/17] disable CFS RT load balancing Steven Rostedt
2007-11-17  6:21 ` [PATCH v3 08/17] Cache cpus_allowed weight for optimizing migration Steven Rostedt
2007-11-17  6:21 ` [PATCH v3 09/17] Consistency cleanup for this_rq usage Steven Rostedt
2007-11-17  6:21 ` [PATCH v3 10/17] Remove some CFS specific code from the wakeup path of RT tasks Steven Rostedt
2007-11-17 17:35   ` Gregory Haskins
2007-11-17 18:51     ` Steven Rostedt
2007-11-17  6:21 ` [PATCH v3 11/17] RT: Break out the search function Steven Rostedt
2007-11-17  6:21 ` [PATCH v3 12/17] Allow current_cpu to be included in search Steven Rostedt
2007-11-17  6:21 ` [PATCH v3 13/17] RT: Pre-route RT tasks on wakeup Steven Rostedt
2007-11-17  6:21 ` [PATCH v3 14/17] Optimize our cpu selection based on topology Steven Rostedt
2007-11-17  6:21 ` [PATCH v3 15/17] RT: Optimize rebalancing Steven Rostedt
2007-11-17  6:21 ` [PATCH v3 16/17] Fix schedstat handling Steven Rostedt
2007-11-17 17:40   ` Gregory Haskins
2007-11-17 18:52     ` Steven Rostedt
2007-11-17  6:21 ` [PATCH v3 17/17] --- kernel/sched_rt.c | 20 ++++++++++++++++---- 1 file changed, 16 insertions(+), 4 deletions(-) Steven Rostedt
2007-11-17  6:33   ` [PATCH v3 17/17] (Avoid overload) Steven Rostedt
2007-11-17 17:42     ` Gregory Haskins
2007-11-17 18:55       ` Steven Rostedt
2007-11-17 17:46     ` Gregory Haskins
2007-11-19 16:34       ` [PATCH] RT: restore the migratable conditional Gregory Haskins
2007-11-17  8:14 ` [PATCH v3 00/17] New RT Task Balancing -v3 Jon Masters

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).