[PATCH 0/4] [GIT PULL] sched: Migrate higher priority tasks less

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/4] [GIT PULL] sched: Migrate higher priority tasks less
@ 2010-09-21  2:40 Steven Rostedt
  2010-09-21  2:40 ` [PATCH 1/4] sched: Try not to migrate higher priority RT tasks Steven Rostedt
                   ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: Steven Rostedt @ 2010-09-21  2:40 UTC (permalink / raw)
  To: linux-kernel; +Cc: Ingo Molnar, Andrew Morton, Peter Zijlstra, Gregory Haskins


Ingo,

Please pull the latest tip/sched/rt tree, which can be found at:

  git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace.git
tip/sched/rt


Steven Rostedt (4):
      sched: Try not to migrate higher priority RT tasks
      sched: Give CPU bound RT tasks preference
      tracing/sched: Add sched_rt_push and sched_rt_pull tracepoints
      tracing/sched: Add sched_rt_setprio tracepoint

----
 include/trace/events/sched.h |   61 ++++++++++++++++++++++++++++++++++++++++++
 kernel/sched.c               |    1 +
 kernel/sched_rt.c            |   28 +++++++++++-------
 3 files changed, 79 insertions(+), 11 deletions(-)

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/4] sched: Try not to migrate higher priority RT tasks
  2010-09-21  2:40 [PATCH 0/4] [GIT PULL] sched: Migrate higher priority tasks less Steven Rostedt
@ 2010-09-21  2:40 ` Steven Rostedt
  2010-09-21 14:14   ` [tip:sched/core] " tip-bot for Steven Rostedt
  2010-09-21  2:40 ` [PATCH 2/4] sched: Give CPU bound RT tasks preference Steven Rostedt
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 14+ messages in thread
From: Steven Rostedt @ 2010-09-21  2:40 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Andrew Morton, Peter Zijlstra, Gregory Haskins,
	Peter Zijlstra

[-- Attachment #1: 0001-sched-Try-not-to-migrate-higher-priority-RT-tasks.patch --]
[-- Type: text/plain, Size: 9031 bytes --]

From: Steven Rostedt <srostedt@redhat.com>

When first working on the RT scheduler design, we concentrated on
keeping all CPUs running RT tasks instead of having multiple RT
tasks on a single CPU waiting for the migration thread to move them.
Instead we take a more proactive stance and push or pull RT tasks
from one CPU to another on wakeup or scheduling.

When an RT task wakes up on a CPU that is running another RT task,
instead of preempting it and killing the cache of the running RT task,
we look to see if we can migrate the RT task that is waking up,
even if the RT task waking up is of higher priority.

This may sound a bit odd, but RT tasks should be limited in migration
by the user anyway. But in practice, people do not do this, which
causes high prio RT tasks to bounce around the CPUs. This becomes
even worse when we have priority inheritance, because a high prio task
can block on a lower prio task and boost its priority. When
the lower prio task wakes up the high prio task, if it happens to
be on the same CPU it will migrate off of it.

But in reality, the above does not happen much either, because the
wake up of the lower prio task, which has already been boosted, if
it was on the same CPU as the higher prio task, it would then migrate
off of it. But anyway, we do not want to migrate them either.

To examine the scheduling, I created a test program and examined it
under kernelshark. The test program created CPU * 2 threads, where
each thread had a different priority. The program takes different options.
The options used in this change log was to have priority inheritance
mutexes or not.

All threads did the following loop:

static void grab_lock(long id, int iter, int l)
{
	ftrace_write("thread %ld iter %d, taking lock %d\n",
		     id, iter, l);
	pthread_mutex_lock(&locks[l]);
	ftrace_write("thread %ld iter %d, took lock %d\n",
		     id, iter, l);
	busy_loop(nr_tasks - id);
	ftrace_write("thread %ld iter %d, unlock lock %d\n",
		     id, iter, l);
	pthread_mutex_unlock(&locks[l]);
}

void *start_task(void *id)
{
	[...]
	while (!done) {
		for (l = 0; l < nr_locks; l++) {
			grab_lock(id, i, l);
			ftrace_write("thread %ld iter %d sleeping\n",
				     id, i);
			ms_sleep(id);
		}
		i++;
	}
	[...]
}

The busy_loop(ms) keeps the CPU spinning for ms milliseconds.
The ms_sleep(ms) sleeps for ms milliseconds.
The ftrace_write() writes to the ftrace buffer to help analyze via ftrace.

The higher the id, the higher the prio, the shorter it does the busy
loop, but the longer it spins. This is usually the case with RT tasks,
the lower priority tasks usually run longer than higher priority tasks.

At the end of the test, it records the number of loops each thread took,
as well as the number of voluntary preemptions, non-voluntary preemptions,
and number of migrations each thread took, taking the information from
/proc/$$/sched and /proc/$$/status.

Running this on a 4 CPU processor, the results without changes to the
kernel looked like this:

Task        vol    nonvol   migrated     iterations
----        ---    ------   --------     ----------
  0:         53      3220       1470             98
  1:        562       773        724             98
  2:        752       933       1375             98
  3:        749        39        697             98
  4:        758         5        515             98
  5:        764         2        679             99
  6:        761         2        535             99
  7:        757         3        346             99

total:     5156       4977      6341            787

Each thread regardless of priority migrated a few hundred times.
The higher priority tasks, were a little better but still took
quite an impact.

By letting higher priority tasks bump the lower prio task from the CPU,
things changed a bit:

Task        vol    nonvol   migrated     iterations
----        ---    ------   --------     ----------
  0:         37      2835       1937             98
  1:        666      1821       1865             98
  2:        654      1003       1385             98
  3:        664       635        973             99
  4:        698       197        352             99
  5:        703       101        159             99
  6:        708         1         75             99
  7:        713         1          2             99

total:     4843       6594      6748            789

The total # of migrations did not change (several runs showed the
difference all within the noise). But we now see a dramatic improvement
to the higher priority tasks. (kernelshark showed that the watchdog
timer bumped the highest priority task to give it the 2 count. This
was actually consistent with every run).

Notice that the # of iterations did not change either.

The above was with priority inheritance mutexes. That is, when the
higher prority task blocked on a lower priority task, the lower
priority task would inherit the higher priority task (which shows why
task 6 was bumped so many times). When not using priority inheritance
mutexes, the current kernel shows this:

Task        vol    nonvol   migrated     iterations
----        ---    ------   --------     ----------
  0:         56      3101       1892             95
  1:        594       713        937             95
  2:        625       188        618             95
  3:        628         4        491             96
  4:        640         7        468             96
  5:        631         2        501             96
  6:        641         1        466             96
  7:        643         2        497             96

total:     4458       4018      5870            765

Not much changed with or without priority inheritance mutexes. But
if we let the high priority task bump lower priority tasks on wakeup
we see:

Task        vol    nonvol   migrated     iterations
----        ---    ------   --------     ----------
  0:        115      3439       2782             98
  1:        633      1354       1583             99
  2:        652       919       1218             99
  3:        645       713        934             99
  4:        690         3          3             99
  5:        694         1          4             99
  6:        720         3          4             99
  7:        747         0          1            100

Which shows a even bigger change. The big difference between task 3 and
task 4 is because we have only 4 CPUs on the machine, causing the 4 highest
prio tasks to always have preference.

Although I did not measure cache misses, and I'm sure there would be little
to measure since the test was not data intensive, I could imagine
large improvements for higher priority tasks when dealing with lower
priority tasks. Thus, I'm satisfied with making the change and agreeing
with what Gregory Haskins argued a few years ago when we first had this
discussion.

One final note. All tasks in the above tests were RT tasks. Any RT
task will always preempt a non RT task that is running on the CPU the
RT task wants to run on.

Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 kernel/sched_rt.c |   22 ++++++++++++----------
 1 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index d10c80e..6a02b38 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -960,18 +960,18 @@ select_task_rq_rt(struct rq *rq, struct task_struct *p, int sd_flag, int flags)
 	 * runqueue. Otherwise simply start this RT task
 	 * on its current runqueue.
 	 *
-	 * We want to avoid overloading runqueues. Even if
-	 * the RT task is of higher priority than the current RT task.
-	 * RT tasks behave differently than other tasks. If
-	 * one gets preempted, we try to push it off to another queue.
-	 * So trying to keep a preempting RT task on the same
-	 * cache hot CPU will force the running RT task to
-	 * a cold CPU. So we waste all the cache for the lower
-	 * RT task in hopes of saving some of a RT task
-	 * that is just being woken and probably will have
-	 * cold cache anyway.
+	 * We want to avoid overloading runqueues. If the woken
+	 * task is a higher priority, then it will stay on this CPU
+	 * and the lower prio task should be moved to another CPU.
+	 * Even though this will probably make the lower prio task
+	 * lose its cache, we do not want to bounce a higher task
+	 * around just because it gave up its CPU, perhaps for a
+	 * lock?
+	 *
+	 * For equal prio tasks, we just let the scheduler sort it out.
 	 */
 	if (unlikely(rt_task(rq->curr)) &&
+	    rq->curr->prio < p->prio &&
 	    (p->rt.nr_cpus_allowed > 1)) {
 		int cpu = find_lowest_rq(p);

@@ -1491,6 +1491,8 @@ static void task_woken_rt(struct rq *rq, struct task_struct *p)
 	if (!task_running(rq, p) &&
 	    !test_tsk_need_resched(rq->curr) &&
 	    has_pushable_tasks(rq) &&
+	    rt_task(rq->curr) &&
+	    rq->curr->prio < p->prio &&
 	    p->rt.nr_cpus_allowed > 1)
 		push_rt_tasks(rq);
 }
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/4] sched: Give CPU bound RT tasks preference
  2010-09-21  2:40 [PATCH 0/4] [GIT PULL] sched: Migrate higher priority tasks less Steven Rostedt
  2010-09-21  2:40 ` [PATCH 1/4] sched: Try not to migrate higher priority RT tasks Steven Rostedt
@ 2010-09-21  2:40 ` Steven Rostedt
  2010-09-21 14:14   ` [tip:sched/core] " tip-bot for Steven Rostedt
  2010-09-21  2:40 ` [PATCH 3/4] tracing/sched: Add sched_rt_push and sched_rt_pull tracepoints Steven Rostedt
  2010-09-21  2:40 ` [PATCH 4/4] tracing/sched: Add sched_rt_setprio tracepoint Steven Rostedt
  3 siblings, 1 reply; 14+ messages in thread
From: Steven Rostedt @ 2010-09-21  2:40 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Andrew Morton, Peter Zijlstra, Gregory Haskins,
	Peter Zijlstra

[-- Attachment #1: 0002-sched-Give-CPU-bound-RT-tasks-preference.patch --]
[-- Type: text/plain, Size: 1528 bytes --]

From: Steven Rostedt <srostedt@redhat.com>

If a high priority task is waking up on a CPU that is running a lower
priority task that is bound to a CPU, see if we can move the high RT task to
another CPU first. Note, if all other CPUs are running higher priority
tasks than the CPU bounded current task, then it will be preempted
regardless.

Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 kernel/sched_rt.c |    8 +++++---
 1 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 6a02b38..baef30f 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -971,7 +971,8 @@ select_task_rq_rt(struct rq *rq, struct task_struct *p, int sd_flag, int flags)
 	 * For equal prio tasks, we just let the scheduler sort it out.
 	 */
 	if (unlikely(rt_task(rq->curr)) &&
-	    rq->curr->prio < p->prio &&
+	    (rq->curr->rt.nr_cpus_allowed < 2 ||
+	     rq->curr->prio < p->prio) &&
 	    (p->rt.nr_cpus_allowed > 1)) {
 		int cpu = find_lowest_rq(p);
 
@@ -1491,9 +1492,10 @@ static void task_woken_rt(struct rq *rq, struct task_struct *p)
 	if (!task_running(rq, p) &&
 	    !test_tsk_need_resched(rq->curr) &&
 	    has_pushable_tasks(rq) &&
+	    p->rt.nr_cpus_allowed > 1 &&
 	    rt_task(rq->curr) &&
-	    rq->curr->prio < p->prio &&
-	    p->rt.nr_cpus_allowed > 1)
+	    (rq->curr->rt.nr_cpus_allowed < 2 ||
+	     rq->curr->prio < p->prio))
 		push_rt_tasks(rq);
 }
 
-- 
1.7.1



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 3/4] tracing/sched: Add sched_rt_push and sched_rt_pull tracepoints
  2010-09-21  2:40 [PATCH 0/4] [GIT PULL] sched: Migrate higher priority tasks less Steven Rostedt
  2010-09-21  2:40 ` [PATCH 1/4] sched: Try not to migrate higher priority RT tasks Steven Rostedt
  2010-09-21  2:40 ` [PATCH 2/4] sched: Give CPU bound RT tasks preference Steven Rostedt
@ 2010-09-21  2:40 ` Steven Rostedt
  2010-09-21  8:46   ` Peter Zijlstra
  2010-09-21  2:40 ` [PATCH 4/4] tracing/sched: Add sched_rt_setprio tracepoint Steven Rostedt
  3 siblings, 1 reply; 14+ messages in thread
From: Steven Rostedt @ 2010-09-21  2:40 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Andrew Morton, Peter Zijlstra, Gregory Haskins,
	Peter Zijlstra

[-- Attachment #1: 0003-tracing-sched-Add-sched_rt_push-and-sched_rt_pull-tr.patch --]
[-- Type: text/plain, Size: 2494 bytes --]

From: Steven Rostedt <srostedt@redhat.com>

Seeing how tasks are affected by the push/pull algorithm of the RT
scheduler helps in understanding the migration of RT tasks.

Adding these two tracepoints to show when an RT task is pushed from
one CPU to another, or pulled helps with analyzing the way the
scheduler works.

Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 include/trace/events/sched.h |   36 ++++++++++++++++++++++++++++++++++++
 kernel/sched_rt.c            |    2 ++
 2 files changed, 38 insertions(+), 0 deletions(-)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 9208c92..0e0c108 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -362,6 +362,42 @@ TRACE_EVENT(sched_stat_runtime,
 			(unsigned long long)__entry->vruntime)
 );
 
+DECLARE_EVENT_CLASS(sched_rt_push_pull,
+
+	TP_PROTO(struct task_struct *tsk, int cpu),
+
+	TP_ARGS(tsk, cpu),
+
+	TP_STRUCT__entry(
+		__array( char,	comm,	TASK_COMM_LEN	)
+		__field( pid_t,	pid			)
+		__field( int,	prio			)
+		__field( int,   cpu			)
+	),
+
+	TP_fast_assign(
+		memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN);
+		__entry->pid		= tsk->pid;
+		__entry->prio		= tsk->prio;
+		__entry->cpu		= cpu;
+	),
+
+	TP_printk("comm=%s pid=%d prio=%d dest_cpu=%d",
+			__entry->comm, __entry->pid,
+			__entry->prio, __entry->cpu)
+);
+
+DEFINE_EVENT(sched_rt_push_pull, sched_rt_push,
+	TP_PROTO(struct task_struct *tsk, int cpu),
+	TP_ARGS(tsk, cpu));
+
+DEFINE_EVENT_PRINT(sched_rt_push_pull, sched_rt_pull,
+	TP_PROTO(struct task_struct *tsk, int cpu),
+	TP_ARGS(tsk, cpu),
+	TP_printk("comm=%s pid=%d prio=%d src_cpu=%d",
+			__entry->comm, __entry->pid,
+			__entry->prio, __entry->cpu));
+
 #endif /* _TRACE_SCHED_H */
 
 /* This part must be outside protection */
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index baef30f..3aed9b0 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -1372,6 +1372,7 @@ static int push_rt_task(struct rq *rq)
 	}
 
 	deactivate_task(rq, next_task, 0);
+	trace_sched_rt_push(next_task, lowest_rq->cpu);
 	set_task_cpu(next_task, lowest_rq->cpu);
 	activate_task(lowest_rq, next_task, 0);
 
@@ -1455,6 +1456,7 @@ static int pull_rt_task(struct rq *this_rq)
 			ret = 1;
 
 			deactivate_task(src_rq, p, 0);
+			trace_sched_rt_pull(p, src_rq->cpu);
 			set_task_cpu(p, this_cpu);
 			activate_task(this_rq, p, 0);
 			/*
-- 
1.7.1



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 4/4] tracing/sched: Add sched_rt_setprio tracepoint
  2010-09-21  2:40 [PATCH 0/4] [GIT PULL] sched: Migrate higher priority tasks less Steven Rostedt
                   ` (2 preceding siblings ...)
  2010-09-21  2:40 ` [PATCH 3/4] tracing/sched: Add sched_rt_push and sched_rt_pull tracepoints Steven Rostedt
@ 2010-09-21  2:40 ` Steven Rostedt
  2010-09-21 11:14   ` Peter Zijlstra
  3 siblings, 1 reply; 14+ messages in thread
From: Steven Rostedt @ 2010-09-21  2:40 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Andrew Morton, Peter Zijlstra, Gregory Haskins,
	Peter Zijlstra

[-- Attachment #1: 0004-tracing-sched-Add-sched_rt_setprio-tracepoint.patch --]
[-- Type: text/plain, Size: 1742 bytes --]

From: Steven Rostedt <srostedt@redhat.com>

Add a tracepoint that shows the priority of a task being boosted
via priority inheritance.

Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 include/trace/events/sched.h |   25 +++++++++++++++++++++++++
 kernel/sched.c               |    1 +
 2 files changed, 26 insertions(+), 0 deletions(-)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 0e0c108..8feb641 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -398,6 +398,31 @@ DEFINE_EVENT_PRINT(sched_rt_push_pull, sched_rt_pull,
 			__entry->comm, __entry->pid,
 			__entry->prio, __entry->cpu));
 
+TRACE_EVENT(sched_rt_setprio,
+
+	TP_PROTO(struct task_struct *tsk, int newprio),
+
+	TP_ARGS(tsk, newprio),
+
+	TP_STRUCT__entry(
+		__array( char,	comm,	TASK_COMM_LEN	)
+		__field( pid_t,	pid			)
+		__field( int,	oldprio			)
+		__field( int,	newprio			)
+	),
+
+	TP_fast_assign(
+		memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN);
+		__entry->pid		= tsk->pid;
+		__entry->oldprio	= tsk->prio;
+		__entry->newprio	= newprio;
+	),
+
+	TP_printk("comm=%s pid=%d oldprio=%d newprio=%d",
+			__entry->comm, __entry->pid,
+			__entry->oldprio, __entry->newprio)
+);
+
 #endif /* _TRACE_SCHED_H */
 
 /* This part must be outside protection */
diff --git a/kernel/sched.c b/kernel/sched.c
index ed09d4f..4d4f35e 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -4358,6 +4358,7 @@ void rt_mutex_setprio(struct task_struct *p, int prio)
 
 	rq = task_rq_lock(p, &flags);
 
+	trace_sched_rt_setprio(p, prio);
 	oldprio = p->prio;
 	prev_class = p->sched_class;
 	on_rq = p->se.on_rq;
-- 
1.7.1



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/4] tracing/sched: Add sched_rt_push and sched_rt_pull tracepoints
  2010-09-21  2:40 ` [PATCH 3/4] tracing/sched: Add sched_rt_push and sched_rt_pull tracepoints Steven Rostedt
@ 2010-09-21  8:46   ` Peter Zijlstra
  2010-09-21 12:33     ` Steven Rostedt
  0 siblings, 1 reply; 14+ messages in thread
From: Peter Zijlstra @ 2010-09-21  8:46 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: linux-kernel, Ingo Molnar, Andrew Morton, Gregory Haskins

On Mon, 2010-09-20 at 22:40 -0400, Steven Rostedt wrote:
> From: Steven Rostedt <srostedt@redhat.com>
> 
> Seeing how tasks are affected by the push/pull algorithm of the RT
> scheduler helps in understanding the migration of RT tasks.
> 
> Adding these two tracepoints to show when an RT task is pushed from
> one CPU to another, or pulled helps with analyzing the way the
> scheduler works. 

Why doesn't the migration tracepoint cover this? It shows you the task,
the prio, the old and new cpu. If the migration is logged from the old
cpu, its a push, if its logged from the new its a pull, no?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 4/4] tracing/sched: Add sched_rt_setprio tracepoint
  2010-09-21  2:40 ` [PATCH 4/4] tracing/sched: Add sched_rt_setprio tracepoint Steven Rostedt
@ 2010-09-21 11:14   ` Peter Zijlstra
  2010-09-21 12:35     ` Steven Rostedt
  0 siblings, 1 reply; 14+ messages in thread
From: Peter Zijlstra @ 2010-09-21 11:14 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: linux-kernel, Ingo Molnar, Andrew Morton, Gregory Haskins

On Mon, 2010-09-20 at 22:40 -0400, Steven Rostedt wrote:
> plain text document attachment
> (0004-tracing-sched-Add-sched_rt_setprio-tracepoint.patch)
> From: Steven Rostedt <srostedt@redhat.com>
> 
> Add a tracepoint that shows the priority of a task being boosted
> via priority inheritance.

I'm not too sure about having this tracepoint.. Ideally I'd like to
rewrite the whole PI crap to not actually need anything like this.

Also, sched_pi_setprio() would be a better name I think.



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/4] tracing/sched: Add sched_rt_push and sched_rt_pull tracepoints
  2010-09-21  8:46   ` Peter Zijlstra
@ 2010-09-21 12:33     ` Steven Rostedt
  2010-09-21 13:36       ` Steven Rostedt
  0 siblings, 1 reply; 14+ messages in thread
From: Steven Rostedt @ 2010-09-21 12:33 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, Ingo Molnar, Andrew Morton, Gregory Haskins

On Tue, 2010-09-21 at 10:46 +0200, Peter Zijlstra wrote:
> On Mon, 2010-09-20 at 22:40 -0400, Steven Rostedt wrote:
> > From: Steven Rostedt <srostedt@redhat.com>
> > 
> > Seeing how tasks are affected by the push/pull algorithm of the RT
> > scheduler helps in understanding the migration of RT tasks.
> > 
> > Adding these two tracepoints to show when an RT task is pushed from
> > one CPU to another, or pulled helps with analyzing the way the
> > scheduler works. 
> 
> Why doesn't the migration tracepoint cover this? It shows you the task,
> the prio, the old and new cpu. If the migration is logged from the old
> cpu, its a push, if its logged from the new its a pull, no?

For some reason it did not help me enough in my analysis. I'll go back
and see why.

I originally had these tracepoints at the front of the series, and
realized they may be a little controversial, thus I moved them to the
end.

-- Steve



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 4/4] tracing/sched: Add sched_rt_setprio tracepoint
  2010-09-21 11:14   ` Peter Zijlstra
@ 2010-09-21 12:35     ` Steven Rostedt
  0 siblings, 0 replies; 14+ messages in thread
From: Steven Rostedt @ 2010-09-21 12:35 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, Ingo Molnar, Andrew Morton, Gregory Haskins

On Tue, 2010-09-21 at 13:14 +0200, Peter Zijlstra wrote:
> On Mon, 2010-09-20 at 22:40 -0400, Steven Rostedt wrote:
> > plain text document attachment
> > (0004-tracing-sched-Add-sched_rt_setprio-tracepoint.patch)
> > From: Steven Rostedt <srostedt@redhat.com>
> > 
> > Add a tracepoint that shows the priority of a task being boosted
> > via priority inheritance.
> 
> I'm not too sure about having this tracepoint.. Ideally I'd like to
> rewrite the whole PI crap to not actually need anything like this.

Yeah, I know we want to change all that. But until we have new code, we
could have this tracepoint to even help with modifying it.

Tracepoints show internal functioning of the kernel and are always
subject to change. Until we have a TRACE_EVENT_ABI(), these can come and
go with each version. I think this tracepoint is valuable.

> 
> Also, sched_pi_setprio() would be a better name I think.
> 

I'll update it for the above, thanks!

-- Steve



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/4] tracing/sched: Add sched_rt_push and sched_rt_pull tracepoints
  2010-09-21 12:33     ` Steven Rostedt
@ 2010-09-21 13:36       ` Steven Rostedt
  2010-09-21 19:26         ` Frank Rowand
  0 siblings, 1 reply; 14+ messages in thread
From: Steven Rostedt @ 2010-09-21 13:36 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, Ingo Molnar, Andrew Morton, Gregory Haskins

On Tue, 2010-09-21 at 08:33 -0400, Steven Rostedt wrote:

> > Why doesn't the migration tracepoint cover this? It shows you the task,
> > the prio, the old and new cpu. If the migration is logged from the old
> > cpu, its a push, if its logged from the new its a pull, no?
> 
> For some reason it did not help me enough in my analysis. I'll go back
> and see why.
> 
> I originally had these tracepoints at the front of the series, and
> realized they may be a little controversial, thus I moved them to the
> end.

OK, I went back and looked, and I think you are right. I can determine
this from the migration tracepoint. I think I was just being lazy and
wanted to have the direction obvious to me :-)  I can easily do that by
adding a plugin.

Anyway, I'll remove this patch and rename the sched_rt_setprio
tracepoint to sched_pi_setprio. Would that be OK with you?

Thanks,

-- Steve



^ permalink raw reply	[flat|nested] 14+ messages in thread

* [tip:sched/core] sched: Try not to migrate higher priority RT tasks
  2010-09-21  2:40 ` [PATCH 1/4] sched: Try not to migrate higher priority RT tasks Steven Rostedt
@ 2010-09-21 14:14   ` tip-bot for Steven Rostedt
  0 siblings, 0 replies; 14+ messages in thread
From: tip-bot for Steven Rostedt @ 2010-09-21 14:14 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, a.p.zijlstra, rostedt, srostedt, tglx,
	ghaskins, mingo

Commit-ID:  43fa5460fe60dea5c610490a1d263415419c60f6
Gitweb:     http://git.kernel.org/tip/43fa5460fe60dea5c610490a1d263415419c60f6
Author:     Steven Rostedt <srostedt@redhat.com>
AuthorDate: Mon, 20 Sep 2010 22:40:03 -0400
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Tue, 21 Sep 2010 13:57:12 +0200

sched: Try not to migrate higher priority RT tasks

When first working on the RT scheduler design, we concentrated on
keeping all CPUs running RT tasks instead of having multiple RT
tasks on a single CPU waiting for the migration thread to move
them. Instead we take a more proactive stance and push or pull RT
tasks from one CPU to another on wakeup or scheduling.

When an RT task wakes up on a CPU that is running another RT task,
instead of preempting it and killing the cache of the running RT
task, we look to see if we can migrate the RT task that is waking
up, even if the RT task waking up is of higher priority.

This may sound a bit odd, but RT tasks should be limited in
migration by the user anyway. But in practice, people do not do
this, which causes high prio RT tasks to bounce around the CPUs.
This becomes even worse when we have priority inheritance, because
a high prio task can block on a lower prio task and boost its
priority. When the lower prio task wakes up the high prio task, if
it happens to be on the same CPU it will migrate off of it.

But in reality, the above does not happen much either, because the
wake up of the lower prio task, which has already been boosted, if
it was on the same CPU as the higher prio task, it would then
migrate off of it. But anyway, we do not want to migrate them
either.

To examine the scheduling, I created a test program and examined it
under kernelshark. The test program created CPU * 2 threads, where
each thread had a different priority. The program takes different
options. The options used in this change log was to have priority
inheritance mutexes or not.

All threads did the following loop:

static void grab_lock(long id, int iter, int l)
{
	ftrace_write("thread %ld iter %d, taking lock %d\n",
		     id, iter, l);
	pthread_mutex_lock(&locks[l]);
	ftrace_write("thread %ld iter %d, took lock %d\n",
		     id, iter, l);
	busy_loop(nr_tasks - id);
	ftrace_write("thread %ld iter %d, unlock lock %d\n",
		     id, iter, l);
	pthread_mutex_unlock(&locks[l]);
}

void *start_task(void *id)
{
	[...]
	while (!done) {
		for (l = 0; l < nr_locks; l++) {
			grab_lock(id, i, l);
			ftrace_write("thread %ld iter %d sleeping\n",
				     id, i);
			ms_sleep(id);
		}
		i++;
	}
	[...]
}

The busy_loop(ms) keeps the CPU spinning for ms milliseconds. The
ms_sleep(ms) sleeps for ms milliseconds. The ftrace_write() writes
to the ftrace buffer to help analyze via ftrace.

The higher the id, the higher the prio, the shorter it does the
busy loop, but the longer it spins. This is usually the case with
RT tasks, the lower priority tasks usually run longer than higher
priority tasks.

At the end of the test, it records the number of loops each thread
took, as well as the number of voluntary preemptions, non-voluntary
preemptions, and number of migrations each thread took, taking the
information from /proc/$$/sched and /proc/$$/status.

Running this on a 4 CPU processor, the results without changes to
the kernel looked like this:

Task        vol    nonvol   migrated     iterations
----        ---    ------   --------     ----------
  0:         53      3220       1470             98
  1:        562       773        724             98
  2:        752       933       1375             98
  3:        749        39        697             98
  4:        758         5        515             98
  5:        764         2        679             99
  6:        761         2        535             99
  7:        757         3        346             99

total:     5156       4977      6341            787

Each thread regardless of priority migrated a few hundred times.
The higher priority tasks, were a little better but still took
quite an impact.

By letting higher priority tasks bump the lower prio task from the
CPU, things changed a bit:

Task        vol    nonvol   migrated     iterations
----        ---    ------   --------     ----------
  0:         37      2835       1937             98
  1:        666      1821       1865             98
  2:        654      1003       1385             98
  3:        664       635        973             99
  4:        698       197        352             99
  5:        703       101        159             99
  6:        708         1         75             99
  7:        713         1          2             99

total:     4843       6594      6748            789

The total # of migrations did not change (several runs showed the
difference all within the noise). But we now see a dramatic
improvement to the higher priority tasks. (kernelshark showed that
the watchdog timer bumped the highest priority task to give it the
2 count. This was actually consistent with every run).

Notice that the # of iterations did not change either.

The above was with priority inheritance mutexes. That is, when the
higher prority task blocked on a lower priority task, the lower
priority task would inherit the higher priority task (which shows
why task 6 was bumped so many times). When not using priority
inheritance mutexes, the current kernel shows this:

Task        vol    nonvol   migrated     iterations
----        ---    ------   --------     ----------
  0:         56      3101       1892             95
  1:        594       713        937             95
  2:        625       188        618             95
  3:        628         4        491             96
  4:        640         7        468             96
  5:        631         2        501             96
  6:        641         1        466             96
  7:        643         2        497             96

total:     4458       4018      5870            765

Not much changed with or without priority inheritance mutexes. But
if we let the high priority task bump lower priority tasks on
wakeup we see:

Task        vol    nonvol   migrated     iterations
----        ---    ------   --------     ----------
  0:        115      3439       2782             98
  1:        633      1354       1583             99
  2:        652       919       1218             99
  3:        645       713        934             99
  4:        690         3          3             99
  5:        694         1          4             99
  6:        720         3          4             99
  7:        747         0          1            100

Which shows a even bigger change. The big difference between task 3
and task 4 is because we have only 4 CPUs on the machine, causing
the 4 highest prio tasks to always have preference.

Although I did not measure cache misses, and I'm sure there would
be little to measure since the test was not data intensive, I could
imagine large improvements for higher priority tasks when dealing
with lower priority tasks. Thus, I'm satisfied with making the
change and agreeing with what Gregory Haskins argued a few years
ago when we first had this discussion.

One final note. All tasks in the above tests were RT tasks. Any RT
task will always preempt a non RT task that is running on the CPU
the RT task wants to run on.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Gregory Haskins <ghaskins@novell.com>
LKML-Reference: <20100921024138.605460343@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/sched_rt.c |   22 ++++++++++++----------
 1 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index d10c80e..6a02b38 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -960,18 +960,18 @@ select_task_rq_rt(struct rq *rq, struct task_struct *p, int sd_flag, int flags)
 	 * runqueue. Otherwise simply start this RT task
 	 * on its current runqueue.
 	 *
-	 * We want to avoid overloading runqueues. Even if
-	 * the RT task is of higher priority than the current RT task.
-	 * RT tasks behave differently than other tasks. If
-	 * one gets preempted, we try to push it off to another queue.
-	 * So trying to keep a preempting RT task on the same
-	 * cache hot CPU will force the running RT task to
-	 * a cold CPU. So we waste all the cache for the lower
-	 * RT task in hopes of saving some of a RT task
-	 * that is just being woken and probably will have
-	 * cold cache anyway.
+	 * We want to avoid overloading runqueues. If the woken
+	 * task is a higher priority, then it will stay on this CPU
+	 * and the lower prio task should be moved to another CPU.
+	 * Even though this will probably make the lower prio task
+	 * lose its cache, we do not want to bounce a higher task
+	 * around just because it gave up its CPU, perhaps for a
+	 * lock?
+	 *
+	 * For equal prio tasks, we just let the scheduler sort it out.
 	 */
 	if (unlikely(rt_task(rq->curr)) &&
+	    rq->curr->prio < p->prio &&
 	    (p->rt.nr_cpus_allowed > 1)) {
 		int cpu = find_lowest_rq(p);

@@ -1491,6 +1491,8 @@ static void task_woken_rt(struct rq *rq, struct task_struct *p)
 	if (!task_running(rq, p) &&
 	    !test_tsk_need_resched(rq->curr) &&
 	    has_pushable_tasks(rq) &&
+	    rt_task(rq->curr) &&
+	    rq->curr->prio < p->prio &&
 	    p->rt.nr_cpus_allowed > 1)
 		push_rt_tasks(rq);
 }

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [tip:sched/core] sched: Give CPU bound RT tasks preference
  2010-09-21  2:40 ` [PATCH 2/4] sched: Give CPU bound RT tasks preference Steven Rostedt
@ 2010-09-21 14:14   ` tip-bot for Steven Rostedt
  0 siblings, 0 replies; 14+ messages in thread
From: tip-bot for Steven Rostedt @ 2010-09-21 14:14 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, a.p.zijlstra, rostedt, srostedt, tglx,
	ghaskins, mingo

Commit-ID:  b3bc211cfe7d5fe94b310480d78e00bea96fbf2a
Gitweb:     http://git.kernel.org/tip/b3bc211cfe7d5fe94b310480d78e00bea96fbf2a
Author:     Steven Rostedt <srostedt@redhat.com>
AuthorDate: Mon, 20 Sep 2010 22:40:04 -0400
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Tue, 21 Sep 2010 13:57:12 +0200

sched: Give CPU bound RT tasks preference

If a high priority task is waking up on a CPU that is running a
lower priority task that is bound to a CPU, see if we can move the
high RT task to another CPU first. Note, if all other CPUs are
running higher priority tasks than the CPU bounded current task,
then it will be preempted regardless.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Gregory Haskins <ghaskins@novell.com>
LKML-Reference: <20100921024138.888922071@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/sched_rt.c |    8 +++++---
 1 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 6a02b38..baef30f 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -971,7 +971,8 @@ select_task_rq_rt(struct rq *rq, struct task_struct *p, int sd_flag, int flags)
 	 * For equal prio tasks, we just let the scheduler sort it out.
 	 */
 	if (unlikely(rt_task(rq->curr)) &&
-	    rq->curr->prio < p->prio &&
+	    (rq->curr->rt.nr_cpus_allowed < 2 ||
+	     rq->curr->prio < p->prio) &&
 	    (p->rt.nr_cpus_allowed > 1)) {
 		int cpu = find_lowest_rq(p);
 
@@ -1491,9 +1492,10 @@ static void task_woken_rt(struct rq *rq, struct task_struct *p)
 	if (!task_running(rq, p) &&
 	    !test_tsk_need_resched(rq->curr) &&
 	    has_pushable_tasks(rq) &&
+	    p->rt.nr_cpus_allowed > 1 &&
 	    rt_task(rq->curr) &&
-	    rq->curr->prio < p->prio &&
-	    p->rt.nr_cpus_allowed > 1)
+	    (rq->curr->rt.nr_cpus_allowed < 2 ||
+	     rq->curr->prio < p->prio))
 		push_rt_tasks(rq);
 }
 

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/4] tracing/sched: Add sched_rt_push and sched_rt_pull tracepoints
  2010-09-21 13:36       ` Steven Rostedt
@ 2010-09-21 19:26         ` Frank Rowand
  2010-09-21 19:55           ` Frank Rowand
  0 siblings, 1 reply; 14+ messages in thread
From: Frank Rowand @ 2010-09-21 19:26 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, linux-kernel, Ingo Molnar, Andrew Morton,
	Gregory Haskins

On 09/21/10 06:36, Steven Rostedt wrote:
> On Tue, 2010-09-21 at 08:33 -0400, Steven Rostedt wrote:
> 
>>> Why doesn't the migration tracepoint cover this? It shows you the task,
>>> the prio, the old and new cpu. If the migration is logged from the old
>>> cpu, its a push, if its logged from the new its a pull, no?

Slight nuance...

If the migration is logged from a cpu other than the new cpu, it is a push.


>>
>> For some reason it did not help me enough in my analysis. I'll go back
>> and see why.
>>
>> I originally had these tracepoints at the front of the series, and
>> realized they may be a little controversial, thus I moved them to the
>> end.
> 
> OK, I went back and looked, and I think you are right. I can determine
> this from the migration tracepoint. I think I was just being lazy and
> wanted to have the direction obvious to me :-)  I can easily do that by
> adding a plugin.
> 
> Anyway, I'll remove this patch and rename the sched_rt_setprio
> tracepoint to sched_pi_setprio. Would that be OK with you?
> 
> Thanks,
> 
> -- Steve

Regards,

Frank



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/4] tracing/sched: Add sched_rt_push and sched_rt_pull tracepoints
  2010-09-21 19:26         ` Frank Rowand
@ 2010-09-21 19:55           ` Frank Rowand
  0 siblings, 0 replies; 14+ messages in thread
From: Frank Rowand @ 2010-09-21 19:55 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Rowand, Frank, Peter Zijlstra, linux-kernel, Ingo Molnar,
	Andrew Morton, Gregory Haskins

On 09/21/10 12:26, Frank Rowand wrote:
> On 09/21/10 06:36, Steven Rostedt wrote:
>> On Tue, 2010-09-21 at 08:33 -0400, Steven Rostedt wrote:
>>
>>>> Why doesn't the migration tracepoint cover this? It shows you the task,
>>>> the prio, the old and new cpu. If the migration is logged from the old
>>>> cpu, its a push, if its logged from the new its a pull, no?
> 
> Slight nuance...
> 
> If the migration is logged from a cpu other than the new cpu, it is a push.

And since the logged from cpu can be different than the old cpu, the logged
from cpu can be additional useful debugging or analysis information.

> 
>>>
>>> For some reason it did not help me enough in my analysis. I'll go back
>>> and see why.
>>>
>>> I originally had these tracepoints at the front of the series, and
>>> realized they may be a little controversial, thus I moved them to the
>>> end.
>>
>> OK, I went back and looked, and I think you are right. I can determine
>> this from the migration tracepoint. I think I was just being lazy and
>> wanted to have the direction obvious to me :-)  I can easily do that by
>> adding a plugin.
>>
>> Anyway, I'll remove this patch and rename the sched_rt_setprio
>> tracepoint to sched_pi_setprio. Would that be OK with you?
>>
>> Thanks,
>>
>> -- Steve

Regards,

Frank


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2010-09-21 19:55 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-09-21  2:40 [PATCH 0/4] [GIT PULL] sched: Migrate higher priority tasks less Steven Rostedt
2010-09-21  2:40 ` [PATCH 1/4] sched: Try not to migrate higher priority RT tasks Steven Rostedt
2010-09-21 14:14   ` [tip:sched/core] " tip-bot for Steven Rostedt
2010-09-21  2:40 ` [PATCH 2/4] sched: Give CPU bound RT tasks preference Steven Rostedt
2010-09-21 14:14   ` [tip:sched/core] " tip-bot for Steven Rostedt
2010-09-21  2:40 ` [PATCH 3/4] tracing/sched: Add sched_rt_push and sched_rt_pull tracepoints Steven Rostedt
2010-09-21  8:46   ` Peter Zijlstra
2010-09-21 12:33     ` Steven Rostedt
2010-09-21 13:36       ` Steven Rostedt
2010-09-21 19:26         ` Frank Rowand
2010-09-21 19:55           ` Frank Rowand
2010-09-21  2:40 ` [PATCH 4/4] tracing/sched: Add sched_rt_setprio tracepoint Steven Rostedt
2010-09-21 11:14   ` Peter Zijlstra
2010-09-21 12:35     ` Steven Rostedt

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.