linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/5] RT: scheduler fixes and rt_overload enhancements
@ 2007-10-09 14:25 Gregory Haskins
  2007-10-09 14:25 ` [PATCH 1/5] RT - fix for scheduling issue Gregory Haskins
                   ` (6 more replies)
  0 siblings, 7 replies; 18+ messages in thread
From: Gregory Haskins @ 2007-10-09 14:25 UTC (permalink / raw)
  To: mingo, linux-rt-users, rostedt
  Cc: kravetz, linux-kernel, ghaskins, pmorreale, sdietrich

Hi All,

The first two patches are from Mike and Steven on LKML, which the rest of my
series is dependent on.  Patch #4 is a resend from earlier.

Series Summary:

1) Send IPI on overload regardless of whether prev is an RT task
2) Set the NEEDS_RESCHED flag on reception of RESCHED_IPI
3) Fix a mistargeted IPI on overload
4) Track which CPUS are in overload for efficiency
5) Track which CPUs are eligible for rebalancing for efficiency

These have been built and boot-tested on a 4-core Intel system.

Regards,
-Greg

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 1/5] RT - fix for scheduling issue
  2007-10-09 14:25 [PATCH 0/5] RT: scheduler fixes and rt_overload enhancements Gregory Haskins
@ 2007-10-09 14:25 ` Gregory Haskins
  2007-10-09 14:25 ` [PATCH 2/5] RT - fix reschedule IPI Gregory Haskins
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 18+ messages in thread
From: Gregory Haskins @ 2007-10-09 14:25 UTC (permalink / raw)
  To: mingo, linux-rt-users, rostedt
  Cc: kravetz, linux-kernel, ghaskins, pmorreale, sdietrich

From: Mike Kravetz <kravetz@us.ibm.com>

RESCHED_IPIs can be missed if more than one RT task is awoken simultaneously

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 kernel/sched.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 93fd6de..3e75c62 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2207,7 +2207,7 @@ static inline void finish_task_switch(struct rq *rq, struct task_struct *prev)
 	 * If we pushed an RT task off the runqueue,
 	 * then kick other CPUs, they might run it:
 	 */
-	if (unlikely(rt_task(current) && prev->se.on_rq && rt_task(prev))) {
+	if (unlikely(rt_task(current) && rq->rt_nr_running > 1)) {
 		schedstat_inc(rq, rto_schedule);
 		smp_send_reschedule_allbutself_cpumask(current->cpus_allowed);
 	}


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 2/5] RT - fix reschedule IPI
  2007-10-09 14:25 [PATCH 0/5] RT: scheduler fixes and rt_overload enhancements Gregory Haskins
  2007-10-09 14:25 ` [PATCH 1/5] RT - fix for scheduling issue Gregory Haskins
@ 2007-10-09 14:25 ` Gregory Haskins
  2007-10-09 14:25 ` [PATCH 3/5] RT - fix mistargeted RESCHED_IPI Gregory Haskins
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 18+ messages in thread
From: Gregory Haskins @ 2007-10-09 14:25 UTC (permalink / raw)
  To: mingo, linux-rt-users, rostedt
  Cc: kravetz, linux-kernel, ghaskins, pmorreale, sdietrich

From: Mike Kravetz <kravetz@us.ibm.com>

x86_64 based RESCHED_IPIs fail to set the reschedule flag

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 arch/x86_64/kernel/smp.c |    6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86_64/kernel/smp.c b/arch/x86_64/kernel/smp.c
index a5bf746..3ce6cad 100644
--- a/arch/x86_64/kernel/smp.c
+++ b/arch/x86_64/kernel/smp.c
@@ -505,13 +505,13 @@ void smp_send_stop(void)
 }
 
 /*
- * Reschedule call back. Nothing to do,
- * all the work is done automatically when
- * we return from the interrupt.
+ * Reschedule call back. Trigger a reschedule pass so that
+ * RT-overload balancing can pass tasks around.
  */
 asmlinkage void smp_reschedule_interrupt(void)
 {
 	ack_APIC_irq();
+	set_tsk_need_resched(current);
 }
 
 asmlinkage void smp_call_function_interrupt(void)


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 3/5] RT - fix mistargeted RESCHED_IPI
  2007-10-09 14:25 [PATCH 0/5] RT: scheduler fixes and rt_overload enhancements Gregory Haskins
  2007-10-09 14:25 ` [PATCH 1/5] RT - fix for scheduling issue Gregory Haskins
  2007-10-09 14:25 ` [PATCH 2/5] RT - fix reschedule IPI Gregory Haskins
@ 2007-10-09 14:25 ` Gregory Haskins
  2007-10-09 14:26 ` [PATCH 4/5] RT: Add a per-cpu rt_overload indication Gregory Haskins
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 18+ messages in thread
From: Gregory Haskins @ 2007-10-09 14:25 UTC (permalink / raw)
  To: mingo, linux-rt-users, rostedt
  Cc: kravetz, linux-kernel, ghaskins, pmorreale, sdietrich

Any number of tasks could be queued behind the current task, so direct the
balance IPI at all CPUs (other than current)

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Mike Kravetz <kravetz@us.ibm.com>
CC: Peter W. Morreale <pmorreale@novell.com>
---

 kernel/sched.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 3e75c62..551629b 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2209,7 +2209,7 @@ static inline void finish_task_switch(struct rq *rq, struct task_struct *prev)
 	 */
 	if (unlikely(rt_task(current) && rq->rt_nr_running > 1)) {
 		schedstat_inc(rq, rto_schedule);
-		smp_send_reschedule_allbutself_cpumask(current->cpus_allowed);
+		smp_send_reschedule_allbutself();
 	}
 #endif
 	prev_state = prev->state;


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 4/5] RT: Add a per-cpu rt_overload indication
  2007-10-09 14:25 [PATCH 0/5] RT: scheduler fixes and rt_overload enhancements Gregory Haskins
                   ` (2 preceding siblings ...)
  2007-10-09 14:25 ` [PATCH 3/5] RT - fix mistargeted RESCHED_IPI Gregory Haskins
@ 2007-10-09 14:26 ` Gregory Haskins
  2007-10-09 14:26 ` [PATCH 5/5] RT - Track which CPUs should get IPI'd on rt-overload Gregory Haskins
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 18+ messages in thread
From: Gregory Haskins @ 2007-10-09 14:26 UTC (permalink / raw)
  To: mingo, linux-rt-users, rostedt
  Cc: kravetz, linux-kernel, ghaskins, pmorreale, sdietrich

The system currently evaluates all online CPUs whenever one or more enters
an rt_overload condition.  This suffers from scalability limitations as
the # of online CPUs increases.  So we introduce a cpumask to track
exactly which CPUs need RT balancing.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
CC: Peter W. Morreale <pmorreale@novell.com>
---

 kernel/sched.c |   12 +++++++++---
 1 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 551629b..a28ca9d 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -631,6 +631,7 @@ static inline struct rq *this_rq_lock(void)
 
 #if defined(CONFIG_PREEMPT_RT) && defined(CONFIG_SMP)
 static __cacheline_aligned_in_smp atomic_t rt_overload;
+static cpumask_t rto_cpus;
 #endif
 
 static inline void inc_rt_tasks(struct task_struct *p, struct rq *rq)
@@ -639,8 +640,11 @@ static inline void inc_rt_tasks(struct task_struct *p, struct rq *rq)
 	if (rt_task(p)) {
 		rq->rt_nr_running++;
 # ifdef CONFIG_SMP
-		if (rq->rt_nr_running == 2)
+		if (rq->rt_nr_running == 2) {
+			cpu_set(rq->cpu, rto_cpus);
+			smp_wmb();
 			atomic_inc(&rt_overload);
+		}
 # endif
 	}
 #endif
@@ -653,8 +657,10 @@ static inline void dec_rt_tasks(struct task_struct *p, struct rq *rq)
 		WARN_ON(!rq->rt_nr_running);
 		rq->rt_nr_running--;
 # ifdef CONFIG_SMP
-		if (rq->rt_nr_running == 1)
+		if (rq->rt_nr_running == 1) {
 			atomic_dec(&rt_overload);
+			cpu_clear(rq->cpu, rto_cpus);
+		}
 # endif
 	}
 #endif
@@ -1503,7 +1509,7 @@ static void balance_rt_tasks(struct rq *this_rq, int this_cpu)
 	 */
 	next = pick_next_task(this_rq, this_rq->curr);
 
-	for_each_online_cpu(cpu) {
+	for_each_cpu_mask(cpu, rto_cpus) {
 		if (cpu == this_cpu)
 			continue;
 		src_rq = cpu_rq(cpu);


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 5/5] RT - Track which CPUs should get IPI'd on rt-overload
  2007-10-09 14:25 [PATCH 0/5] RT: scheduler fixes and rt_overload enhancements Gregory Haskins
                   ` (3 preceding siblings ...)
  2007-10-09 14:26 ` [PATCH 4/5] RT: Add a per-cpu rt_overload indication Gregory Haskins
@ 2007-10-09 14:26 ` Gregory Haskins
  2007-10-09 15:00 ` [PATCH 0/5] RT: scheduler fixes and rt_overload enhancements Peter Zijlstra
  2007-10-09 15:00 ` Steven Rostedt
  6 siblings, 0 replies; 18+ messages in thread
From: Gregory Haskins @ 2007-10-09 14:26 UTC (permalink / raw)
  To: mingo, linux-rt-users, rostedt
  Cc: kravetz, linux-kernel, ghaskins, pmorreale, sdietrich

The code currently blindly fires IPIs out whenever an overload occurs.
However, there are strict events that govern when a rt-overload exists
(e.g. RT task added to a RQ, or an RT task preempted).  Therefore, we
attempt to efficiently track which CPUs are eligible for rebalancing, and we
only IPI those affected units.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
CC: Peter W. Morreale <pmorreale@novell.com>
---

 kernel/sched.c |   15 +++++++++++++--
 1 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index a28ca9d..6ca5f4f 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -359,6 +359,8 @@ struct rq {
 	unsigned long rto_pulled;
 #endif
 	struct lock_class_key rq_lock_key;
+
+	cpumask_t rto_resched; /* Which of our peers needs rescheduling */
 };
 
 static DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
@@ -645,6 +647,9 @@ static inline void inc_rt_tasks(struct task_struct *p, struct rq *rq)
 			smp_wmb();
 			atomic_inc(&rt_overload);
 		}
+
+		cpus_or(rq->rto_resched, rq->rto_resched, p->cpus_allowed);
+		cpu_clear(rq->cpu, rq->rto_resched);
 # endif
 	}
 #endif
@@ -2213,9 +2218,15 @@ static inline void finish_task_switch(struct rq *rq, struct task_struct *prev)
 	 * If we pushed an RT task off the runqueue,
 	 * then kick other CPUs, they might run it:
 	 */
-	if (unlikely(rt_task(current) && rq->rt_nr_running > 1)) {
+	if (unlikely(rt_task(current) && prev->se.on_rq && rt_task(prev))) {
+		cpus_or(rq->rto_resched, rq->rto_resched, prev->cpus_allowed);
+		cpu_clear(rq->cpu, rq->rto_resched);
+	}
+
+	if (unlikely(rq->rt_nr_running > 1 && !cpus_empty(rq->rto_resched))) {
 		schedstat_inc(rq, rto_schedule);
-		smp_send_reschedule_allbutself();
+		smp_send_reschedule_allbutself_cpumask(rq->rto_resched);
+		cpus_clear(rq->rto_resched);
 	}
 #endif
 	prev_state = prev->state;


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/5] RT: scheduler fixes and rt_overload enhancements
  2007-10-09 14:25 [PATCH 0/5] RT: scheduler fixes and rt_overload enhancements Gregory Haskins
                   ` (4 preceding siblings ...)
  2007-10-09 14:26 ` [PATCH 5/5] RT - Track which CPUs should get IPI'd on rt-overload Gregory Haskins
@ 2007-10-09 15:00 ` Peter Zijlstra
  2007-10-09 15:00 ` Steven Rostedt
  6 siblings, 0 replies; 18+ messages in thread
From: Peter Zijlstra @ 2007-10-09 15:00 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: mingo, linux-rt-users, rostedt, kravetz, linux-kernel, pmorreale,
	sdietrich

On Tue, 2007-10-09 at 10:25 -0400, Gregory Haskins wrote:
> Hi All,
> 
> The first two patches are from Mike and Steven on LKML, which the rest of my
> series is dependent on.  Patch #4 is a resend from earlier.
> 
> Series Summary:
> 
> 1) Send IPI on overload regardless of whether prev is an RT task
> 2) Set the NEEDS_RESCHED flag on reception of RESCHED_IPI
> 3) Fix a mistargeted IPI on overload
> 4) Track which CPUS are in overload for efficiency
> 5) Track which CPUs are eligible for rebalancing for efficiency
> 
> These have been built and boot-tested on a 4-core Intel system.

Ok, I'm not liking these.

I really hate setting TIF_NEED_RESCHED from the IPI handler. Also, I
don't see how doing a resched pulls tasks to begin with.

How about keeping a per rq variable that indicates the highest priority
of runnable tasks. And on forced preemption look for a target rq to send
your last highest task to.

There is no need to broadcast rebalance, that will only serialise on the
local rq lock again. So pick a target rq, and stick with that.

Also, I think you meant to use cpus_and() with the rto and allowed
masks.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/5] RT: scheduler fixes and rt_overload enhancements
  2007-10-09 14:25 [PATCH 0/5] RT: scheduler fixes and rt_overload enhancements Gregory Haskins
                   ` (5 preceding siblings ...)
  2007-10-09 15:00 ` [PATCH 0/5] RT: scheduler fixes and rt_overload enhancements Peter Zijlstra
@ 2007-10-09 15:00 ` Steven Rostedt
  2007-10-09 15:33   ` Gregory Haskins
  6 siblings, 1 reply; 18+ messages in thread
From: Steven Rostedt @ 2007-10-09 15:00 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Ingo Molnar, linux-rt-users, kravetz, LKML, pmorreale, sdietrich,
	Peter Zijlstra



--
On Tue, 9 Oct 2007, Gregory Haskins wrote:
> Hi All,

Hi Gregory,

>
> The first two patches are from Mike and Steven on LKML, which the rest of my
> series is dependent on.  Patch #4 is a resend from earlier.
>
> Series Summary:
>
> 1) Send IPI on overload regardless of whether prev is an RT task

OK.

> 2) Set the NEEDS_RESCHED flag on reception of RESCHED_IPI

Peter Zijlstra and I have been discussing this IPI Resched change a bit.
It seems that it is too much overkill for what is needed. That is, the
send_reschedule is used elsewhere where we do not want to actually do a
schedule.

I'm thinking about trying out a method that each rq has the priority of
the current task that is running. On case where we get an rt overload
(like in the finish_task_switch) we do a scan of all CPUS (not taking any
locks) and find the CPU which the lowest priority. If that CPU has a lower
prioirty than a waiting task to run on the current CPU then we grab the
lock for that rq, check to see if the priority is still lower, and then
push the rt task over to that CPU.

If after taking the rq lock a schedule had taken place and a higher RT
task is running, then we would try again, two more times. If this
phenomenon happens two more times, we punt and wouldn't do anything else
(paranoid attempt to fall into trying over and over on a high context
switch RT system).


> 3) Fix a mistargeted IPI on overload
> 4) Track which CPUS are in overload for efficiency
> 5) Track which CPUs are eligible for rebalancing for efficiency

The above three may be obsoleted by this new algorithm.

-- Steve


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/5] RT: scheduler fixes and rt_overload enhancements
  2007-10-09 15:00 ` Steven Rostedt
@ 2007-10-09 15:33   ` Gregory Haskins
  2007-10-09 15:39     ` Peter Zijlstra
  2007-10-09 17:59     ` [RFC PATCH RT] push waiting rt tasks to cpus with lower prios Steven Rostedt
  0 siblings, 2 replies; 18+ messages in thread
From: Gregory Haskins @ 2007-10-09 15:33 UTC (permalink / raw)
  To: Steven Rostedt, Peter Zijlstra
  Cc: Ingo Molnar, linux-rt-users, kravetz, LKML, pmorreale, sdietrich,
	Peter Zijlstra

[-- Attachment #1: Type: text/plain, Size: 2721 bytes --]

On Tue, 2007-10-09 at 11:00 -0400, Steven Rostedt wrote:
> 

Hi Steve, Peter,


> --
> On Tue, 9 Oct 2007, Gregory Haskins wrote:
> > Hi All,
> 
> Hi Gregory,
> 
> >
> > The first two patches are from Mike and Steven on LKML, which the rest of my
> > series is dependent on.  Patch #4 is a resend from earlier.
> >
> > Series Summary:
> >
> > 1) Send IPI on overload regardless of whether prev is an RT task
> 
> OK.
> 
> > 2) Set the NEEDS_RESCHED flag on reception of RESCHED_IPI
> 
> Peter Zijlstra and I have been discussing this IPI Resched change a bit.
> It seems that it is too much overkill for what is needed. That is, the
> send_reschedule is used elsewhere where we do not want to actually do a
> schedule.

That is a good point.  We definitely need a good "kick+resched" kind of
mechanism here, but perhaps it should be RTO specific instead of in the
primary data path.  I guess a rq-lock + set(NEEDS_RESCHED) + IPI works
too.

On the flip side:  Perhaps sending a reschedule-ipi that doesn't
reschedule is simply misused, and the misuse should be cleaned up
instead? 

> 
> I'm thinking about trying out a method that each rq has the priority of
> the current task that is running. On case where we get an rt overload
> (like in the finish_task_switch) we do a scan of all CPUS (not taking any
> locks) and find the CPU which the lowest priority. If that CPU has a lower
> prioirty than a waiting task to run on the current CPU then we grab the
> lock for that rq, check to see if the priority is still lower, and then
> push the rt task over to that CPU.

Great minds think alike ;)  See attached for a patch I have been working
on in this area.  It currently address the "wake_up" path.  It would
also need to address the "preempted" path if we were to eliminate RTO
outright.

I wasn't going to share it quite yet, since its still a work in
progress.  But the timing seems right now, given the discussion.

> 
> If after taking the rq lock a schedule had taken place and a higher RT
> task is running, then we would try again, two more times. If this
> phenomenon happens two more times, we punt and wouldn't do anything else
> (paranoid attempt to fall into trying over and over on a high context
> switch RT system).

My patch currently doesn't address this yet, but I have been thinking
about it for the last day or so.  I was wondering if perhaps an RCU
would be appropriate instead of the rwlock like I am using.

> 
> 
> > 3) Fix a mistargeted IPI on overload
> > 4) Track which CPUS are in overload for efficiency
> > 5) Track which CPUs are eligible for rebalancing for efficiency
> 
> The above three may be obsoleted by this new algorithm.

On the same page with you, here.

Regards,
-Greg 




[-- Attachment #2: cpu_priority.patch --]
[-- Type: text/x-patch, Size: 11740 bytes --]

SCHED: CPU priority management

From: Gregory Haskins <ghaskins@novell.com>

This code tracks the priority of each CPU so that global migration
  decisions are easy to calculate.  Each CPU can be in a state as follows:

                 (INVALID), IDLE, NORMAL, RT1, ... RT99

  going from the lowest priority to the highest.  CPUs in the INVALID state
  are not eligible for routing.  The system maintains this state with
  a 2 dimensional bitmap (the first for priority class, the second for cpus
  in that class).  Therefore a typical application without affinity
  restrictions can find a suitable CPU with O(1) complexity (e.g. two bit
  searches).  For tasks with affinity restrictions, the algorithm has a
  worst case complexity of O(min(102, NR_CPUS)), though the scenario that
  yields the worst case search is fairly contrived.

  Because this type of data structure is going to be cache/lock hot,
  certain design considerations were made to mitigate this overhead, such
  as:  rwlocks, per_cpu data to avoid cacheline contention, avoiding locks
  in the update code when possible, etc.

  This logic can really be seen as a superset of the wake_idle()
  functionality (in fact, it replaces wake_idle() when enabled).  The
  original logic performed a similar function, but was limited to only two
  priority classifications: IDLE, and !IDLE.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 include/linux/cpupri.h |   26 ++++++
 kernel/Kconfig.preempt |   11 +++
 kernel/Makefile        |    1 
 kernel/cpupri.c        |  198 ++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched.c         |   39 +++++++++
 5 files changed, 274 insertions(+), 1 deletions(-)

diff --git a/include/linux/cpupri.h b/include/linux/cpupri.h
new file mode 100644
index 0000000..c5749db
--- /dev/null
+++ b/include/linux/cpupri.h
@@ -0,0 +1,26 @@
+#ifndef _LINUX_CPUPRI_H
+#define _LINUX_CPUPRI_H
+
+#include <linux/sched.h>
+
+#define CPUPRI_NR_PRIORITIES 2+MAX_RT_PRIO
+
+#define CPUPRI_INVALID -2
+#define CPUPRI_IDLE    -1
+#define CPUPRI_NORMAL   0
+/* values 1-99 are RT priorities */
+
+#ifdef CONFIG_CPU_PRIORITIES
+int cpupri_find_best(int cpu, int pri, struct task_struct *p);
+void cpupri_set(int pri);
+void cpupri_init(void);
+#else
+inline int cpupri_find_best(int cpu, struct task_struct *p)
+{
+	return cpu;
+}
+#define cpupri_set(pri) do { } while(0)
+#define cpupri_init() do { } while(0)
+#endif /* CONFIG_CPU_PRIORITIES */
+
+#endif /* _LINUX_CPUPRI_H */
diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index 2316f28..5397e59 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -197,3 +197,14 @@ config RCU_TRACE
 	  Say Y here if you want to enable RCU tracing
 	  Say N if you are unsure.
 
+config CPU_PRIORITIES
+       bool "Enable CPU priority management"
+       default n
+       help
+         This option allows the scheduler to efficiently track the absolute
+	 priority of the current task on each CPU.  This helps it to make
+	 global decisions for real-time tasks before a overload conflict
+	 actually occurs.
+
+	 Say Y here if you want to enable priority management
+	 Say N if you are unsure.
\ No newline at end of file
diff --git a/kernel/Makefile b/kernel/Makefile
index e4e2acf..63aaaf5 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -66,6 +66,7 @@ obj-$(CONFIG_RELAY) += relay.o
 obj-$(CONFIG_SYSCTL) += utsname_sysctl.o
 obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
 obj-$(CONFIG_TASKSTATS) += taskstats.o tsacct.o
+obj-$(CONFIG_CPU_PRIORITIES) += cpupri.o
 
 ifneq ($(CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER),y)
 # According to Alan Modra <alan@linuxcare.com.au>, the -fno-omit-frame-pointer is
diff --git a/kernel/cpupri.c b/kernel/cpupri.c
new file mode 100644
index 0000000..c6a2e3e
--- /dev/null
+++ b/kernel/cpupri.c
@@ -0,0 +1,198 @@
+/*
+ *  kernel/cpupri.c
+ *
+ *  CPU priority management
+ *
+ *  Copyright (C) 2007 Novell
+ *
+ *  Author: Gregory Haskins <ghaskins@novell.com>
+ *
+ *  This code tracks the priority of each CPU so that global migration
+ *  decisions are easy to calculate.  Each CPU can be in a state as follows:
+ *
+ *                 (INVALID), IDLE, NORMAL, RT1, ... RT99
+ *
+ *  going from the lowest priority to the highest.  CPUs in the INVALID state
+ *  are not eligible for routing.  The system maintains this state with
+ *  a 2 dimensional bitmap (the first for priority class, the second for cpus
+ *  in that class).  Therefore a typical application without affinity
+ *  restrictions can find a suitable CPU with O(1) complexity (e.g. two bit
+ *  searches).  For tasks with affinity restrictions, the algorithm has a
+ *  worst case complexity of O(min(102, NR_CPUS)), though the scenario that
+ *  yields the worst case search is fairly contrived.
+ *
+ *  Because this type of data structure is going to be cache/lock hot,
+ *  certain design considerations were made to mitigate this overhead, such
+ *  as:  rwlocks, per_cpu data to avoid cacheline contention, avoiding locks
+ *  in the update code when possible, etc.
+ *
+ *  This logic can really be seen as a superset of the wake_idle()
+ *  functionality (in fact, it replaces wake_idle() when enabled).  The
+ *  original logic performed a similar function, but was limited to only two
+ *  priority classifications: IDLE, and !IDLE.
+ *
+ *  This program is free software; you can redistribute it and/or
+ *  modify it under the terms of the GNU General Public License
+ *  as published by the Free Software Foundation; version 2
+ *  of the License.
+ */
+
+#include <linux/cpupri.h>
+#include <asm/idle.h>
+
+struct cpu_priority {
+	raw_rwlock_t lock;
+	cpumask_t    pri_to_cpu[CPUPRI_NR_PRIORITIES];
+	long         pri_active[CPUPRI_NR_PRIORITIES/BITS_PER_LONG];
+};
+
+static DEFINE_PER_CPU(int, cpu_to_pri);
+
+static __cacheline_aligned_in_smp struct cpu_priority cpu_priority;
+
+#define for_each_cpupri_active(array, idx)                   \
+  for( idx = find_first_bit(array, CPUPRI_NR_PRIORITIES);    \
+       idx < CPUPRI_NR_PRIORITIES;                           \
+       idx = find_next_bit(array, CPUPRI_NR_PRIORITIES, idx+1))
+
+/**
+ * cpupri_find_best - find the best (lowest-pri) CPU in the system
+ * @cpu: The recommended/default CPU
+ * @task_pri: The priority of the task being scheduled (IDLE-RT99)
+ * @p: The task being scheduled
+ *
+ * Note: This function returns the recommeded CPU as calculated during the
+ * current invokation.  By the time the call returns, the CPUs may have in
+ * fact changed priorities any number of times.  While not ideal, it is not
+ * an issue of correctness since the normal rebalancer logic will correct
+ * any discrepancies created by racing against the uncertainty of the current
+ * priority configuration.
+ *
+ * Returns: (int)cpu - The recommended cpu to accept the task
+ */
+int cpupri_find_best(int cpu, int task_pri, struct task_struct *p)
+{
+	int                  idx      = 0;
+	struct cpu_priority *cp       = &cpu_priority;
+	unsigned long        flags;
+
+	read_lock_irqsave(&cp->lock, flags);
+
+	for_each_cpupri_active(cp->pri_active, idx) {
+		cpumask_t mask;
+		int       lowest_pri = idx-1;
+
+		if (lowest_pri > task_pri)
+			break;
+
+		cpus_and(mask, p->cpus_allowed, cp->pri_to_cpu[idx]);
+
+		/*
+		 * If the default cpu is available for this task to run on,
+		 * it wins automatically
+		 */
+		if (cpu_isset(cpu, mask))
+			break;
+
+		if (!cpus_empty(mask)) {
+			/*
+			 * Else we should pick one of the remaining elements
+			 */
+			cpu = first_cpu(mask);
+			break;
+		}
+	}
+
+	read_unlock_irqrestore(&cp->lock, flags);
+
+	return cpu;
+}
+
+/**
+ * cpupri_set - update the cpu priority setting
+ * @pri: The priority (INVALID-RT99) to assign to this CPU
+ *
+ * Returns: (void)
+ */
+void cpupri_set(int pri)
+{
+	struct cpu_priority *cp   = &cpu_priority;
+	int                  cpu  = raw_smp_processor_id();
+	int                 *cpri = &per_cpu(cpu_to_pri, cpu);
+
+	/*
+	 * Its safe to check the CPU priority outside the lock because 
+	 * it can only be modified from the processor in question
+	 */
+	if (*cpri != pri) {
+		int           oldpri = *cpri;
+		unsigned long flags;
+		
+		write_lock_irqsave(&cp->lock, flags);
+
+		/*
+		 * If the cpu was currently mapped to a different value, we
+		 * first need to unmap the old value
+		 */
+		if (likely(oldpri != CPUPRI_INVALID)) {
+			int        idx  = oldpri+1;
+			cpumask_t *mask = &cp->pri_to_cpu[idx];
+
+			cpu_clear(cpu, *mask);
+			if (cpus_empty(*mask))
+				__clear_bit(idx, cp->pri_active);
+		}
+
+		if (likely(pri != CPUPRI_INVALID)) {
+			int        idx  = pri+1;
+			cpumask_t *mask = &cp->pri_to_cpu[idx];
+
+			cpu_set(cpu, *mask);
+			__set_bit(idx, cp->pri_active);
+		}
+
+		write_unlock_irqrestore(&cp->lock, flags);
+
+		*cpri = pri;
+	}
+}
+
+static int cpupri_idle(struct notifier_block *b, unsigned long event, void *v)
+{
+	if (event == IDLE_START)
+		cpupri_set(CPUPRI_IDLE);
+
+	return 0;
+}
+
+static struct notifier_block cpupri_idle_notifier = {
+	.notifier_call = cpupri_idle
+};
+
+/**
+ * cpupri_init - initialize the cpupri subsystem
+ *
+ * This must be called during the scheduler initialization before the 
+ * other methods may be used.
+ *
+ * Returns: (void)
+ */
+void cpupri_init(void)
+{
+	struct cpu_priority *cp = &cpu_priority;
+	int i;
+
+	printk("CPU Priority Management, Copyright(c) 2007, Novell\n");
+
+	memset(cp, 0, sizeof(*cp));
+
+	rwlock_init(&cp->lock);
+
+	for_each_possible_cpu(i) {
+		per_cpu(cpu_to_pri, i) = CPUPRI_INVALID;
+	}
+
+	idle_notifier_register(&cpupri_idle_notifier);
+}
+
+
diff --git a/kernel/sched.c b/kernel/sched.c
index 6ca5f4f..0f815ad 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -24,6 +24,7 @@
  *              by Peter Williams
  *  2007-05-06  Interactivity improvements to CFS by Mike Galbraith
  *  2007-07-01  Group scheduling enhancements by Srivatsa Vaddagiri
+ *  2007-10-07  Cpu priorities by Greg Haskins
  */
 
 #include <linux/mm.h>
@@ -64,6 +65,7 @@
 #include <linux/delayacct.h>
 #include <linux/reciprocal_div.h>
 #include <linux/unistd.h>
+#include <linux/cpupri.h>
 
 #include <asm/tlb.h>
 
@@ -1716,6 +1718,38 @@ static inline int wake_idle(int cpu, struct task_struct *p)
 }
 #endif
 
+#ifdef CONFIG_CPU_PRIORITIES
+static int cpupri_task_priority(struct rq *rq, struct task_struct *p)
+{
+	int pri;
+
+	if (rt_task(p))
+		pri = p->rt_priority;
+	else if (p == rq->idle)
+		pri = CPUPRI_IDLE;
+	else
+		pri = CPUPRI_NORMAL;
+
+	return pri;
+}
+
+static void cpupri_set_task(struct rq *rq, struct task_struct *p)
+{
+	int pri = cpupri_task_priority(rq, p);
+	cpupri_set(pri);
+}
+
+static int wake_lowest(int cpu, struct task_struct *p)
+{
+	int pri = cpupri_task_priority(cpu_rq(cpu), p);
+
+	return cpupri_find_best(cpu, pri, p);
+}
+#else
+#define cpupri_set_task(rq, task) do { } while (0)
+#define wake_lowest(cpu, task)  wake_idle(cpu, task)
+#endif
+
 /***
  * try_to_wake_up - wake up a thread
  * @p: the to-be-woken-up thread
@@ -1840,7 +1874,7 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int sync, int mutex)
 
 	new_cpu = cpu; /* Could not wake to this_cpu. Wake to cpu instead */
 out_set_cpu:
-	new_cpu = wake_idle(new_cpu, p);
+	new_cpu = wake_lowest(new_cpu, p);
 	if (new_cpu != cpu) {
 		set_task_cpu(p, new_cpu);
 		task_rq_unlock(rq, &flags);
@@ -2177,6 +2211,7 @@ prepare_task_switch(struct rq *rq, struct task_struct *prev,
 	fire_sched_out_preempt_notifiers(prev, next);
 	prepare_lock_switch(rq, next);
 	prepare_arch_switch(next);
+	cpupri_set_task(rq, next);
 }
 
 /**
@@ -7214,6 +7249,8 @@ void __init sched_init(void)
 	int highest_cpu = 0;
 	int i, j;
 
+	cpupri_init();
+
 	/*
 	 * Link up the scheduling class hierarchy:
 	 */

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/5] RT: scheduler fixes and rt_overload enhancements
  2007-10-09 15:33   ` Gregory Haskins
@ 2007-10-09 15:39     ` Peter Zijlstra
  2007-10-09 17:59     ` [RFC PATCH RT] push waiting rt tasks to cpus with lower prios Steven Rostedt
  1 sibling, 0 replies; 18+ messages in thread
From: Peter Zijlstra @ 2007-10-09 15:39 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Steven Rostedt, Ingo Molnar, linux-rt-users, kravetz, LKML,
	pmorreale, sdietrich

On Tue, 2007-10-09 at 11:33 -0400, Gregory Haskins wrote:

> On the flip side:  Perhaps sending a reschedule-ipi that doesn't
> reschedule is simply misused, and the misuse should be cleaned up
> instead? 

It basically does TIF_WORK_MASK and TIF_NEED_RESCHED is one of the most
frequently used of those. Using it for any other bit in that mask is
IMHO not abuse.




^ permalink raw reply	[flat|nested] 18+ messages in thread

* [RFC PATCH RT] push waiting rt tasks to cpus with lower prios.
  2007-10-09 15:33   ` Gregory Haskins
  2007-10-09 15:39     ` Peter Zijlstra
@ 2007-10-09 17:59     ` Steven Rostedt
  2007-10-09 18:14       ` Steven Rostedt
                         ` (3 more replies)
  1 sibling, 4 replies; 18+ messages in thread
From: Steven Rostedt @ 2007-10-09 17:59 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Peter Zijlstra, Ingo Molnar, linux-rt-users, kravetz, LKML,
	pmorreale, sdietrich

This has been complied tested (and no more ;-)


The idea here is when we find a situation that we just scheduled in an
RT task and we either pushed a lesser RT task away or more than one RT
task was scheduled on this CPU before scheduling occurred.

The answer that this patch does is to do a O(n) search of CPUs for the
CPU with the lowest prio task running. When that CPU is found the next
highest RT task is pushed to that CPU.

Some notes:

1) no lock is taken while looking for the lowest priority CPU. When one
is found, only that CPU's lock is taken and after that a check is made
to see if it is still a candidate to push the RT task over. If not, we
try the search again, for a max of 3 tries.

2) I only do this for the second highest RT task on the CPU queue. This
can be easily changed to do it for all RT tasks until no more can be
pushed off to other CPUs.

This is a simple approach right now, and is only being posted for
comments.  I'm sure more can be done to make this more efficient or just
simply better.

-- Steve

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>

Index: linux-2.6.23-rc9-rt2/kernel/sched.c
===================================================================
--- linux-2.6.23-rc9-rt2.orig/kernel/sched.c
+++ linux-2.6.23-rc9-rt2/kernel/sched.c
@@ -304,6 +304,7 @@ struct rq {
 #ifdef CONFIG_PREEMPT_RT
 	unsigned long rt_nr_running;
 	unsigned long rt_nr_uninterruptible;
+	int curr_prio;
 #endif
 
 	unsigned long switch_timestamp;
@@ -1485,6 +1486,87 @@ next_in_queue:
 static int double_lock_balance(struct rq *this_rq, struct rq *busiest);
 
 /*
+ * If the current CPU has more than one RT task, see if the non
+ * running task can migrate over to a CPU that is running a task
+ * of lesser priority.
+ */
+static int push_rt_task(struct rq *this_rq)
+{
+	struct task_struct *next_task;
+	struct rq *lowest_rq = NULL;
+	int tries;
+	int cpu;
+	int dst_cpu = -1;
+	int ret = 0;
+
+	BUG_ON(!spin_is_locked(&this_rq->lock));
+
+	next_task = rt_next_highest_task(this_rq);
+	if (!next_task)
+		return 0;
+
+	/* We might release this_rq lock */
+	get_task_struct(next_task);
+
+	/* Only try this algorithm three times */
+	for (tries = 0; tries < 3; tries++) {
+		/*
+		 * Scan each rq for the lowest prio.
+		 */
+		for_each_cpu_mask(cpu, next_task->cpus_allowed) {
+			struct rq *rq = &per_cpu(runqueues, cpu);
+
+			if (cpu == smp_processor_id())
+				continue;
+
+			/* no locking for now */
+			if (rq->curr_prio > next_task->prio &&
+			    (!lowest_rq || rq->curr_prio < lowest_rq->curr_prio)) {
+				dst_cpu = cpu;
+				lowest_rq = rq;
+			}
+		}
+
+		if (!lowest_rq)
+			break;
+
+		if (double_lock_balance(this_rq, lowest_rq)) {
+			/*
+			 * We had to unlock the run queue. In
+			 * the mean time, next_task could have
+			 * migrated already or had its affinity changed.
+			 */
+			if (unlikely(task_rq(next_task) != this_rq ||
+				     !cpu_isset(dst_cpu, next_task->cpus_allowed))) {
+				spin_unlock(&lowest_rq->lock);
+				break;
+			}
+		}
+
+		/* if the prio of this runqueue changed, try again */
+		if (lowest_rq->curr_prio <= next_task->prio) {
+			spin_unlock(&lowest_rq->lock);
+			continue;
+		}
+
+		deactivate_task(this_rq, next_task, 0);
+		set_task_cpu(next_task, dst_cpu);
+		activate_task(lowest_rq, next_task, 0);
+
+		set_tsk_need_resched(lowest_rq->curr);
+
+		spin_unlock(&lowest_rq->lock);
+		ret = 1;
+
+		break;
+	}
+
+	put_task_struct(next_task);
+
+	return ret;
+}
+
+/*
  * Pull RT tasks from other CPUs in the RT-overload
  * case. Interrupts are disabled, local rq is locked.
  */
@@ -2207,7 +2289,8 @@ static inline void finish_task_switch(st
 	 * If we pushed an RT task off the runqueue,
 	 * then kick other CPUs, they might run it:
 	 */
-	if (unlikely(rt_task(current) && rq->rt_nr_running > 1)) {
+	rq->curr_prio = current->prio;
+	if (unlikely(rt_task(current) && push_rt_task(rq))) {
 		schedstat_inc(rq, rto_schedule);
 		smp_send_reschedule_allbutself_cpumask(current->cpus_allowed);
 	}
Index: linux-2.6.23-rc9-rt2/kernel/sched_rt.c
===================================================================
--- linux-2.6.23-rc9-rt2.orig/kernel/sched_rt.c
+++ linux-2.6.23-rc9-rt2/kernel/sched_rt.c
@@ -96,6 +96,48 @@ static struct task_struct *pick_next_tas
 	return next;
 }
 
+#ifdef CONFIG_PREEMPT_RT
+static struct task_struct *rt_next_highest_task(struct rq *rq)
+{
+	struct rt_prio_array *array = &rq->rt.active;
+	struct task_struct *next;
+	struct list_head *queue;
+	int idx;
+
+	if (likely (rq->rt_nr_running < 2))
+		return NULL;
+
+	idx = sched_find_first_bit(array->bitmap);
+	if (idx >= MAX_RT_PRIO) {
+		WARN_ON(1); /* rt_nr__running is bad */
+		return NULL;
+	}
+
+	queue = array->queue + idx;
+	if (queue->next->next != queue) {
+		/* same prio task */
+		next = list_entry(queue->next->next, struct task_struct, run_list);
+		goto out;
+	}
+
+	/* slower, but more flexible */
+	idx = find_next_bit(array->bitmap, MAX_RT_PRIO, idx+1);
+	if (idx >= MAX_RT_PRIO) {
+		WARN_ON(1); /* rt_nr_running was 2 and above! */
+		return NULL;
+	}
+
+	queue = array->queue + idx;
+	next = list_entry(queue->next, struct task_struct, run_list);
+
+ out:
+	return next;
+	
+}
+#else  /* CONFIG_PREEMPT_RT */
+
+#endif /* CONFIG_PREEMPT_RT */
+
 static void put_prev_task_rt(struct rq *rq, struct task_struct *p)
 {
 	update_curr_rt(rq);



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH RT] push waiting rt tasks to cpus with lower prios.
  2007-10-09 17:59     ` [RFC PATCH RT] push waiting rt tasks to cpus with lower prios Steven Rostedt
@ 2007-10-09 18:14       ` Steven Rostedt
  2007-10-09 18:16       ` Peter Zijlstra
                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 18+ messages in thread
From: Steven Rostedt @ 2007-10-09 18:14 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Peter Zijlstra, Ingo Molnar, linux-rt-users, kravetz, LKML,
	pmorreale, sdietrich


--
On Tue, 9 Oct 2007, Steven Rostedt wrote:

> This has been complied tested (and no more ;-)
>
>
> The idea here is when we find a situation that we just scheduled in an
> RT task and we either pushed a lesser RT task away or more than one RT
> task was scheduled on this CPU before scheduling occurred.
>
> The answer that this patch does is to do a O(n) search of CPUs for the
> CPU with the lowest prio task running. When that CPU is found the next
> highest RT task is pushed to that CPU.

I don't want that O(n) to scare anyone. It really is a O(1) but with a
K = NR_CPUS. I was saying if you grow the NR_CPUS the search grows too.

-- Steve


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH RT] push waiting rt tasks to cpus with lower prios.
  2007-10-09 17:59     ` [RFC PATCH RT] push waiting rt tasks to cpus with lower prios Steven Rostedt
  2007-10-09 18:14       ` Steven Rostedt
@ 2007-10-09 18:16       ` Peter Zijlstra
  2007-10-09 18:45         ` Steven Rostedt
  2007-10-09 20:39       ` mike kravetz
  2007-10-10  2:12       ` Girish kathalagiri
  3 siblings, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2007-10-09 18:16 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Gregory Haskins, Ingo Molnar, linux-rt-users, kravetz, LKML,
	pmorreale, sdietrich


On Tue, 2007-10-09 at 13:59 -0400, Steven Rostedt wrote:
> This has been complied tested (and no more ;-)
> 
> 
> The idea here is when we find a situation that we just scheduled in an
> RT task and we either pushed a lesser RT task away or more than one RT
> task was scheduled on this CPU before scheduling occurred.
> 
> The answer that this patch does is to do a O(n) search of CPUs for the
> CPU with the lowest prio task running. When that CPU is found the next
> highest RT task is pushed to that CPU.
> 
> Some notes:
> 
> 1) no lock is taken while looking for the lowest priority CPU. When one
> is found, only that CPU's lock is taken and after that a check is made
> to see if it is still a candidate to push the RT task over. If not, we
> try the search again, for a max of 3 tries.
> 
> 2) I only do this for the second highest RT task on the CPU queue. This
> can be easily changed to do it for all RT tasks until no more can be
> pushed off to other CPUs.
> 
> This is a simple approach right now, and is only being posted for
> comments.  I'm sure more can be done to make this more efficient or just
> simply better.
> 
> -- Steve

Do we really want this PREEMPT_RT only?

> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
> 
> Index: linux-2.6.23-rc9-rt2/kernel/sched.c
> ===================================================================
> --- linux-2.6.23-rc9-rt2.orig/kernel/sched.c
> +++ linux-2.6.23-rc9-rt2/kernel/sched.c
> @@ -304,6 +304,7 @@ struct rq {
>  #ifdef CONFIG_PREEMPT_RT
>  	unsigned long rt_nr_running;
>  	unsigned long rt_nr_uninterruptible;
> +	int curr_prio;
>  #endif
>  
>  	unsigned long switch_timestamp;
> @@ -1485,6 +1486,87 @@ next_in_queue:
>  static int double_lock_balance(struct rq *this_rq, struct rq *busiest);
>  
>  /*
> + * If the current CPU has more than one RT task, see if the non
> + * running task can migrate over to a CPU that is running a task
> + * of lesser priority.
> + */
> +static int push_rt_task(struct rq *this_rq)
> +{
> +	struct task_struct *next_task;
> +	struct rq *lowest_rq = NULL;
> +	int tries;
> +	int cpu;
> +	int dst_cpu = -1;
> +	int ret = 0;
> +
> +	BUG_ON(!spin_is_locked(&this_rq->lock));

	assert_spin_locked(&this_rq->lock);

> +
> +	next_task = rt_next_highest_task(this_rq);
> +	if (!next_task)
> +		return 0;
> +
> +	/* We might release this_rq lock */
> +	get_task_struct(next_task);

Can the rest of the code suffer this? (the caller that is)

> +	/* Only try this algorithm three times */
> +	for (tries = 0; tries < 3; tries++) {

magic numbers.. maybe a magic #define with a descriptive name?

> +		/*
> +		 * Scan each rq for the lowest prio.
> +		 */
> +		for_each_cpu_mask(cpu, next_task->cpus_allowed) {
> +			struct rq *rq = &per_cpu(runqueues, cpu);
> +
> +			if (cpu == smp_processor_id())
> +				continue;
> +
> +			/* no locking for now */
> +			if (rq->curr_prio > next_task->prio &&
> +			    (!lowest_rq || rq->curr_prio < lowest_rq->curr_prio)) {
> +				dst_cpu = cpu;
> +				lowest_rq = rq;
> +			}
> +		}
> +
> +		if (!lowest_rq)
> +			break;
> +
> +		if (double_lock_balance(this_rq, lowest_rq)) {
> +			/*
> +			 * We had to unlock the run queue. In
> +			 * the mean time, next_task could have
> +			 * migrated already or had its affinity changed.
> +			 */
> +			if (unlikely(task_rq(next_task) != this_rq ||
> +				     !cpu_isset(dst_cpu, next_task->cpus_allowed))) {
> +				spin_unlock(&lowest_rq->lock);
> +				break;
> +			}
> +		}
> +
> +		/* if the prio of this runqueue changed, try again */
> +		if (lowest_rq->curr_prio <= next_task->prio) {
> +			spin_unlock(&lowest_rq->lock);
> +			continue;
> +		}
> +
> +		deactivate_task(this_rq, next_task, 0);
> +		set_task_cpu(next_task, dst_cpu);
> +		activate_task(lowest_rq, next_task, 0);
> +
> +		set_tsk_need_resched(lowest_rq->curr);

Use resched_task(), that will notify the remote cpu too.

> +
> +		spin_unlock(&lowest_rq->lock);
> +		ret = 1;
> +
> +		break;
> +	}
> +
> +	put_task_struct(next_task);
> +
> +	return ret;
> +}
> +
> +/*
>   * Pull RT tasks from other CPUs in the RT-overload
>   * case. Interrupts are disabled, local rq is locked.
>   */
> @@ -2207,7 +2289,8 @@ static inline void finish_task_switch(st
>  	 * If we pushed an RT task off the runqueue,
>  	 * then kick other CPUs, they might run it:
>  	 */
> -	if (unlikely(rt_task(current) && rq->rt_nr_running > 1)) {
> +	rq->curr_prio = current->prio;
> +	if (unlikely(rt_task(current) && push_rt_task(rq))) {
>  		schedstat_inc(rq, rto_schedule);
>  		smp_send_reschedule_allbutself_cpumask(current->cpus_allowed);

Which will allow you to remove this thing.

>  	}
> Index: linux-2.6.23-rc9-rt2/kernel/sched_rt.c
> ===================================================================
> --- linux-2.6.23-rc9-rt2.orig/kernel/sched_rt.c
> +++ linux-2.6.23-rc9-rt2/kernel/sched_rt.c
> @@ -96,6 +96,48 @@ static struct task_struct *pick_next_tas
>  	return next;
>  }
>  
> +#ifdef CONFIG_PREEMPT_RT
> +static struct task_struct *rt_next_highest_task(struct rq *rq)
> +{
> +	struct rt_prio_array *array = &rq->rt.active;
> +	struct task_struct *next;
> +	struct list_head *queue;
> +	int idx;
> +
> +	if (likely (rq->rt_nr_running < 2))
> +		return NULL;
> +
> +	idx = sched_find_first_bit(array->bitmap);
> +	if (idx >= MAX_RT_PRIO) {
> +		WARN_ON(1); /* rt_nr__running is bad */
> +		return NULL;
> +	}
> +
> +	queue = array->queue + idx;
> +	if (queue->next->next != queue) {
> +		/* same prio task */
> +		next = list_entry(queue->next->next, struct task_struct, run_list);
> +		goto out;
> +	}
> +
> +	/* slower, but more flexible */
> +	idx = find_next_bit(array->bitmap, MAX_RT_PRIO, idx+1);
> +	if (idx >= MAX_RT_PRIO) {
> +		WARN_ON(1); /* rt_nr_running was 2 and above! */
> +		return NULL;
> +	}
> +
> +	queue = array->queue + idx;
> +	next = list_entry(queue->next, struct task_struct, run_list);
> +
> + out:
> +	return next;
> +	
> +}
> +#else  /* CONFIG_PREEMPT_RT */
> +
> +#endif /* CONFIG_PREEMPT_RT */
> +
>  static void put_prev_task_rt(struct rq *rq, struct task_struct *p)
>  {
>  	update_curr_rt(rq);
> 
> 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH RT] push waiting rt tasks to cpus with lower prios.
  2007-10-09 18:16       ` Peter Zijlstra
@ 2007-10-09 18:45         ` Steven Rostedt
  0 siblings, 0 replies; 18+ messages in thread
From: Steven Rostedt @ 2007-10-09 18:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Gregory Haskins, Ingo Molnar, linux-rt-users, kravetz, LKML,
	pmorreale, sdietrich


--

On Tue, 9 Oct 2007, Peter Zijlstra wrote:

>
> Do we really want this PREEMPT_RT only?

Yes, it will give us better benchmarks ;-)

>
> > Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
> >
> > Index: linux-2.6.23-rc9-rt2/kernel/sched.c
> > ===================================================================
> > --- linux-2.6.23-rc9-rt2.orig/kernel/sched.c
> > +++ linux-2.6.23-rc9-rt2/kernel/sched.c
> > @@ -304,6 +304,7 @@ struct rq {
> >  #ifdef CONFIG_PREEMPT_RT
> >  	unsigned long rt_nr_running;
> >  	unsigned long rt_nr_uninterruptible;
> > +	int curr_prio;
> >  #endif
> >
> >  	unsigned long switch_timestamp;
> > @@ -1485,6 +1486,87 @@ next_in_queue:
> >  static int double_lock_balance(struct rq *this_rq, struct rq *busiest);
> >
> >  /*
> > + * If the current CPU has more than one RT task, see if the non
> > + * running task can migrate over to a CPU that is running a task
> > + * of lesser priority.
> > + */
> > +static int push_rt_task(struct rq *this_rq)
> > +{
> > +	struct task_struct *next_task;
> > +	struct rq *lowest_rq = NULL;
> > +	int tries;
> > +	int cpu;
> > +	int dst_cpu = -1;
> > +	int ret = 0;
> > +
> > +	BUG_ON(!spin_is_locked(&this_rq->lock));
>
> 	assert_spin_locked(&this_rq->lock);

Damn! I know that. Thanks, will fix.

>
> > +
> > +	next_task = rt_next_highest_task(this_rq);
> > +	if (!next_task)
> > +		return 0;
> > +
> > +	/* We might release this_rq lock */
> > +	get_task_struct(next_task);
>
> Can the rest of the code suffer this? (the caller that is)

I need to add a comment at the top to state that this function can
do this.  Now is it OK with the current caller? I need to look more
closely. I might need to change where this is actually called.

As stated, this hasn't been tested. But you are right, this needs to be
looked closely at.

>
> > +	/* Only try this algorithm three times */
> > +	for (tries = 0; tries < 3; tries++) {
>
> magic numbers.. maybe a magic #define with a descriptive name?

Hehe, that's one of the clean ups that need to be done ;-)

>
> > +		/*
> > +		 * Scan each rq for the lowest prio.
> > +		 */
> > +		for_each_cpu_mask(cpu, next_task->cpus_allowed) {
> > +			struct rq *rq = &per_cpu(runqueues, cpu);
> > +
> > +			if (cpu == smp_processor_id())
> > +				continue;
> > +
> > +			/* no locking for now */
> > +			if (rq->curr_prio > next_task->prio &&
> > +			    (!lowest_rq || rq->curr_prio < lowest_rq->curr_prio)) {
> > +				dst_cpu = cpu;
> > +				lowest_rq = rq;
> > +			}
> > +		}
> > +
> > +		if (!lowest_rq)
> > +			break;
> > +
> > +		if (double_lock_balance(this_rq, lowest_rq)) {
> > +			/*
> > +			 * We had to unlock the run queue. In
> > +			 * the mean time, next_task could have
> > +			 * migrated already or had its affinity changed.
> > +			 */
> > +			if (unlikely(task_rq(next_task) != this_rq ||
> > +				     !cpu_isset(dst_cpu, next_task->cpus_allowed))) {
> > +				spin_unlock(&lowest_rq->lock);
> > +				break;
> > +			}
> > +		}
> > +
> > +		/* if the prio of this runqueue changed, try again */
> > +		if (lowest_rq->curr_prio <= next_task->prio) {
> > +			spin_unlock(&lowest_rq->lock);
> > +			continue;
> > +		}
> > +
> > +		deactivate_task(this_rq, next_task, 0);
> > +		set_task_cpu(next_task, dst_cpu);
> > +		activate_task(lowest_rq, next_task, 0);
> > +
> > +		set_tsk_need_resched(lowest_rq->curr);
>
> Use resched_task(), that will notify the remote cpu too.

OK, will do.

>
> > +
> > +		spin_unlock(&lowest_rq->lock);
> > +		ret = 1;
> > +
> > +		break;
> > +	}
> > +
> > +	put_task_struct(next_task);
> > +
> > +	return ret;
> > +}
> > +
> > +/*
> >   * Pull RT tasks from other CPUs in the RT-overload
> >   * case. Interrupts are disabled, local rq is locked.
> >   */
> > @@ -2207,7 +2289,8 @@ static inline void finish_task_switch(st
> >  	 * If we pushed an RT task off the runqueue,
> >  	 * then kick other CPUs, they might run it:
> >  	 */
> > -	if (unlikely(rt_task(current) && rq->rt_nr_running > 1)) {
> > +	rq->curr_prio = current->prio;
> > +	if (unlikely(rt_task(current) && push_rt_task(rq))) {
> >  		schedstat_inc(rq, rto_schedule);
> >  		smp_send_reschedule_allbutself_cpumask(current->cpus_allowed);
>
> Which will allow you to remove this thing.

OK, will do.  Note, that this is where we need to see if it is ok to
release the runqueue lock.


>
> >  	}
> > Index: linux-2.6.23-rc9-rt2/kernel/sched_rt.c
> > ===================================================================
> > --- linux-2.6.23-rc9-rt2.orig/kernel/sched_rt.c
> > +++ linux-2.6.23-rc9-rt2/kernel/sched_rt.c
> > @@ -96,6 +96,48 @@ static struct task_struct *pick_next_tas
> >  	return next;
> >  }
> >
> > +#ifdef CONFIG_PREEMPT_RT
> > +static struct task_struct *rt_next_highest_task(struct rq *rq)
> > +{
> > +	struct rt_prio_array *array = &rq->rt.active;
> > +	struct task_struct *next;
> > +	struct list_head *queue;
> > +	int idx;
> > +
> > +	if (likely (rq->rt_nr_running < 2))
> > +		return NULL;
> > +
> > +	idx = sched_find_first_bit(array->bitmap);
> > +	if (idx >= MAX_RT_PRIO) {
> > +		WARN_ON(1); /* rt_nr__running is bad */
> > +		return NULL;
> > +	}
> > +
> > +	queue = array->queue + idx;
> > +	if (queue->next->next != queue) {
> > +		/* same prio task */
> > +		next = list_entry(queue->next->next, struct task_struct, run_list);
> > +		goto out;
> > +	}
> > +
> > +	/* slower, but more flexible */
> > +	idx = find_next_bit(array->bitmap, MAX_RT_PRIO, idx+1);
> > +	if (idx >= MAX_RT_PRIO) {
> > +		WARN_ON(1); /* rt_nr_running was 2 and above! */
> > +		return NULL;
> > +	}
> > +
> > +	queue = array->queue + idx;
> > +	next = list_entry(queue->next, struct task_struct, run_list);
> > +
> > + out:
> > +	return next;
> > +
> > +}
> > +#else  /* CONFIG_PREEMPT_RT */
> > +
> > +#endif /* CONFIG_PREEMPT_RT */
> > +
> >  static void put_prev_task_rt(struct rq *rq, struct task_struct *p)
> >  {
> >  	update_curr_rt(rq);
> >

Thanks for taking the time to look it over.

-- Steve


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH RT] push waiting rt tasks to cpus with lower prios.
  2007-10-09 17:59     ` [RFC PATCH RT] push waiting rt tasks to cpus with lower prios Steven Rostedt
  2007-10-09 18:14       ` Steven Rostedt
  2007-10-09 18:16       ` Peter Zijlstra
@ 2007-10-09 20:39       ` mike kravetz
  2007-10-09 20:50         ` Steven Rostedt
  2007-10-10  2:12       ` Girish kathalagiri
  3 siblings, 1 reply; 18+ messages in thread
From: mike kravetz @ 2007-10-09 20:39 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Gregory Haskins, Peter Zijlstra, Ingo Molnar, linux-rt-users,
	LKML, pmorreale, sdietrich

On Tue, Oct 09, 2007 at 01:59:37PM -0400, Steven Rostedt wrote:
> This has been complied tested (and no more ;-)
> 
> The idea here is when we find a situation that we just scheduled in an
> RT task and we either pushed a lesser RT task away or more than one RT
> task was scheduled on this CPU before scheduling occurred.
> 
> The answer that this patch does is to do a O(n) search of CPUs for the
> CPU with the lowest prio task running. When that CPU is found the next
> highest RT task is pushed to that CPU.
> 
> Some notes:
> 
> 1) no lock is taken while looking for the lowest priority CPU. When one
> is found, only that CPU's lock is taken and after that a check is made
> to see if it is still a candidate to push the RT task over. If not, we
> try the search again, for a max of 3 tries.

I did something like this a while ago for another scheduling project.
A couple 'possible' optimizations to think about are:
1) Only scan the remote runqueues once and keep a local copy of the
   remote priorities for subsequent 'scans'.  Accessing the remote
   runqueus (CPU specific cache lines) can be expensive.
2) When verifying priorities, just perform spin_trylock() on the remote
   runqueue.  If you can immediately get it great.  If not, it implies
   someone else is messing with the runqueue and there is a good chance
   the data you pre-fetched (curr->Priority) is invalid.  In this case
   it might be faster to just 'move on' to the next candidate runqueue/CPU.
   i.e. The next highest priority that the new task can preempt.

Of course, these 'optimizations' would change the algorithm.  Trying to
make any decision based on data that is changing is always a crap shoot. :)
-- 
Mike

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH RT] push waiting rt tasks to cpus with lower prios.
  2007-10-09 20:39       ` mike kravetz
@ 2007-10-09 20:50         ` Steven Rostedt
  2007-10-09 21:17           ` mike kravetz
  0 siblings, 1 reply; 18+ messages in thread
From: Steven Rostedt @ 2007-10-09 20:50 UTC (permalink / raw)
  To: mike kravetz
  Cc: Gregory Haskins, Peter Zijlstra, Ingo Molnar, linux-rt-users,
	LKML, pmorreale, sdietrich


--
On Tue, 9 Oct 2007, mike kravetz wrote:

>
> I did something like this a while ago for another scheduling project.
> A couple 'possible' optimizations to think about are:
> 1) Only scan the remote runqueues once and keep a local copy of the
>    remote priorities for subsequent 'scans'.  Accessing the remote
>    runqueus (CPU specific cache lines) can be expensive.

You mean to keep the copy for the next two tries?

> 2) When verifying priorities, just perform spin_trylock() on the remote
>    runqueue.  If you can immediately get it great.  If not, it implies
>    someone else is messing with the runqueue and there is a good chance
>    the data you pre-fetched (curr->Priority) is invalid.  In this case
>    it might be faster to just 'move on' to the next candidate runqueue/CPU.
>    i.e. The next highest priority that the new task can preempt.

I was a bit scared of grabing the lock anyway, because that's another
cache hit (write side). So only grabbing the lock when needed would save
us from dirting the runqueue lock for each CPU.

>
> Of course, these 'optimizations' would change the algorithm.  Trying to
> make any decision based on data that is changing is always a crap shoot. :)

Yes indeed. The aim for now is to solve the latencies that you've been
seeing. But really, there is still holes (small ones) that can cause a
latency if a schedule happened "just right". Hopefully the final result of
this work will close them too.

-- Steve


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH RT] push waiting rt tasks to cpus with lower prios.
  2007-10-09 20:50         ` Steven Rostedt
@ 2007-10-09 21:17           ` mike kravetz
  0 siblings, 0 replies; 18+ messages in thread
From: mike kravetz @ 2007-10-09 21:17 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Gregory Haskins, Peter Zijlstra, Ingo Molnar, linux-rt-users,
	LKML, pmorreale, sdietrich

On Tue, Oct 09, 2007 at 04:50:47PM -0400, Steven Rostedt wrote:
> > I did something like this a while ago for another scheduling project.
> > A couple 'possible' optimizations to think about are:
> > 1) Only scan the remote runqueues once and keep a local copy of the
> >    remote priorities for subsequent 'scans'.  Accessing the remote
> >    runqueus (CPU specific cache lines) can be expensive.
> 
> You mean to keep the copy for the next two tries?

Yes.  But with #2 below, your next try is the runqueue/CPU that is the
next best candidate (after the trylock fails).  The 'hope' is that there
is more than one candidate CPU to push the task to.  Of course, you
always want to try and find the 'best' candidate.  My thoughts were that
if you could find ANY cpu to take the task that would be better than
sending the IPI everywhere.  With multiple runqueues/locks there is no
way you can be guaranteed of making the 'best' placement.  So, a good
placement may be enough.

> > 2) When verifying priorities, just perform spin_trylock() on the remote
> >    runqueue.  If you can immediately get it great.  If not, it implies
> >    someone else is messing with the runqueue and there is a good chance
> >    the data you pre-fetched (curr->Priority) is invalid.  In this case
> >    it might be faster to just 'move on' to the next candidate runqueue/CPU.
> >    i.e. The next highest priority that the new task can preempt.

-- 
Mike

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH RT] push waiting rt tasks to cpus with lower prios.
  2007-10-09 17:59     ` [RFC PATCH RT] push waiting rt tasks to cpus with lower prios Steven Rostedt
                         ` (2 preceding siblings ...)
  2007-10-09 20:39       ` mike kravetz
@ 2007-10-10  2:12       ` Girish kathalagiri
  3 siblings, 0 replies; 18+ messages in thread
From: Girish kathalagiri @ 2007-10-10  2:12 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Gregory Haskins, Peter Zijlstra, Ingo Molnar, linux-rt-users,
	kravetz, LKML, pmorreale, sdietrich

On 10/9/07, Steven Rostedt <rostedt@goodmis.org> wrote:
> This has been complied tested (and no more ;-)
>
>
> The idea here is when we find a situation that we just scheduled in an
> RT task and we either pushed a lesser RT task away or more than one RT
> task was scheduled on this CPU before scheduling occurred.
>
> The answer that this patch does is to do a O(n) search of CPUs for the
> CPU with the lowest prio task running. When that CPU is found the next
> highest RT task is pushed to that CPU.

It can be extended: Search CPU that is running the lowest priority or
the same priority as the highest RT task (which is tried to be
pushed).
 If any CPU is found to be running lower priority task (lowest among
the CPU) as above push the task to the CPU.

Else if  no CPU was found with lower priority, find CPU that runs a
task of same priority  .... In this there are two cases

case 1. if the currently running task on this CPU is higher priority
than the task running (ie active task priority) , then RT task can be
pushed to the CPU (where it competes with the similar priority task in
round robin fashion ).

case 2: if the  priority of the task that is running and the task that
is trying to be pushed are same (from the same queue ..
queue->next->next....) then the balancing has to be done on the number
of task that are running on these CPUs. making them run equal (or
almost , considering the ping-pong effect ) number of tasks.

>
> Some notes:
>
> 1) no lock is taken while looking for the lowest priority CPU. When one
> is found, only that CPU's lock is taken and after that a check is made
> to see if it is still a candidate to push the RT task over. If not, we
> try the search again, for a max of 3 tries.
>
> 2) I only do this for the second highest RT task on the CPU queue. This
> can be easily changed to do it for all RT tasks until no more can be
> pushed off to other CPUs.
>
> This is a simple approach right now, and is only being posted for
> comments.  I'm sure more can be done to make this more efficient or just
> simply better.
>
> -- Steve
>
> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
>
> Index: linux-2.6.23-rc9-rt2/kernel/sched.c
> ===================================================================
> --- linux-2.6.23-rc9-rt2.orig/kernel/sched.c
> +++ linux-2.6.23-rc9-rt2/kernel/sched.c
> @@ -304,6 +304,7 @@ struct rq {
>  #ifdef CONFIG_PREEMPT_RT
>         unsigned long rt_nr_running;
>         unsigned long rt_nr_uninterruptible;
> +       int curr_prio;
>  #endif
>
>         unsigned long switch_timestamp;
> @@ -1485,6 +1486,87 @@ next_in_queue:
>  static int double_lock_balance(struct rq *this_rq, struct rq *busiest);
>
>  /*
> + * If the current CPU has more than one RT task, see if the non
> + * running task can migrate over to a CPU that is running a task
> + * of lesser priority.
> + */
> +static int push_rt_task(struct rq *this_rq)
> +{
> +       struct task_struct *next_task;
> +       struct rq *lowest_rq = NULL;
> +       int tries;
> +       int cpu;
> +       int dst_cpu = -1;
> +       int ret = 0;
> +
> +       BUG_ON(!spin_is_locked(&this_rq->lock));
> +
> +       next_task = rt_next_highest_task(this_rq);
> +       if (!next_task)
> +               return 0;
> +
> +       /* We might release this_rq lock */
> +       get_task_struct(next_task);
> +
> +       /* Only try this algorithm three times */
> +       for (tries = 0; tries < 3; tries++) {
> +               /*
> +                * Scan each rq for the lowest prio.
> +                */
> +               for_each_cpu_mask(cpu, next_task->cpus_allowed) {
> +                       struct rq *rq = &per_cpu(runqueues, cpu);
> +
> +                       if (cpu == smp_processor_id())
> +                               continue;
> +
> +                       /* no locking for now */
> +                       if (rq->curr_prio > next_task->prio &&
> +                           (!lowest_rq || rq->curr_prio < lowest_rq->curr_prio)) {
> +                               dst_cpu = cpu;
> +                               lowest_rq = rq;
> +                       }
> +               }
> +
> +               if (!lowest_rq)
> +                       break;
> +
> +               if (double_lock_balance(this_rq, lowest_rq)) {
> +                       /*
> +                        * We had to unlock the run queue. In
> +                        * the mean time, next_task could have
> +                        * migrated already or had its affinity changed.
> +                        */
> +                       if (unlikely(task_rq(next_task) != this_rq ||
> +                                    !cpu_isset(dst_cpu, next_task->cpus_allowed))) {
> +                               spin_unlock(&lowest_rq->lock);
> +                               break;
> +                       }
> +               }
> +
> +               /* if the prio of this runqueue changed, try again */
> +               if (lowest_rq->curr_prio <= next_task->prio) {
> +                       spin_unlock(&lowest_rq->lock);
> +                       continue;
> +               }
> +
> +               deactivate_task(this_rq, next_task, 0);
> +               set_task_cpu(next_task, dst_cpu);
> +               activate_task(lowest_rq, next_task, 0);
> +
> +               set_tsk_need_resched(lowest_rq->curr);
> +
> +               spin_unlock(&lowest_rq->lock);
> +               ret = 1;
> +
> +               break;
> +       }
> +
> +       put_task_struct(next_task);
> +
> +       return ret;
> +}
> +
> +/*
>   * Pull RT tasks from other CPUs in the RT-overload
>   * case. Interrupts are disabled, local rq is locked.
>   */
> @@ -2207,7 +2289,8 @@ static inline void finish_task_switch(st
>          * If we pushed an RT task off the runqueue,
>          * then kick other CPUs, they might run it:
>          */
> -       if (unlikely(rt_task(current) && rq->rt_nr_running > 1)) {
> +       rq->curr_prio = current->prio;
> +       if (unlikely(rt_task(current) && push_rt_task(rq))) {
>                 schedstat_inc(rq, rto_schedule);
>                 smp_send_reschedule_allbutself_cpumask(current->cpus_allowed);
>         }
> Index: linux-2.6.23-rc9-rt2/kernel/sched_rt.c
> ===================================================================
> --- linux-2.6.23-rc9-rt2.orig/kernel/sched_rt.c
> +++ linux-2.6.23-rc9-rt2/kernel/sched_rt.c
> @@ -96,6 +96,48 @@ static struct task_struct *pick_next_tas
>         return next;
>  }
>
> +#ifdef CONFIG_PREEMPT_RT
> +static struct task_struct *rt_next_highest_task(struct rq *rq)
> +{
> +       struct rt_prio_array *array = &rq->rt.active;
> +       struct task_struct *next;
> +       struct list_head *queue;
> +       int idx;
> +
> +       if (likely (rq->rt_nr_running < 2))
> +               return NULL;
> +
> +       idx = sched_find_first_bit(array->bitmap);
> +       if (idx >= MAX_RT_PRIO) {
> +               WARN_ON(1); /* rt_nr__running is bad */
> +               return NULL;
> +       }
> +
> +       queue = array->queue + idx;
> +       if (queue->next->next != queue) {
> +               /* same prio task */
> +               next = list_entry(queue->next->next, struct task_struct, run_list);
> +               goto out;
> +       }
> +
> +       /* slower, but more flexible */
> +       idx = find_next_bit(array->bitmap, MAX_RT_PRIO, idx+1);
> +       if (idx >= MAX_RT_PRIO) {
> +               WARN_ON(1); /* rt_nr_running was 2 and above! */
> +               return NULL;
> +       }
> +
> +       queue = array->queue + idx;
> +       next = list_entry(queue->next, struct task_struct, run_list);
> +
> + out:
> +       return next;
> +
> +}
> +#else  /* CONFIG_PREEMPT_RT */
> +
> +#endif /* CONFIG_PREEMPT_RT */
> +
>  static void put_prev_task_rt(struct rq *rq, struct task_struct *p)
>  {
>         update_curr_rt(rq);
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


-- 
Thanks
   Giri

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2007-10-10  2:12 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-10-09 14:25 [PATCH 0/5] RT: scheduler fixes and rt_overload enhancements Gregory Haskins
2007-10-09 14:25 ` [PATCH 1/5] RT - fix for scheduling issue Gregory Haskins
2007-10-09 14:25 ` [PATCH 2/5] RT - fix reschedule IPI Gregory Haskins
2007-10-09 14:25 ` [PATCH 3/5] RT - fix mistargeted RESCHED_IPI Gregory Haskins
2007-10-09 14:26 ` [PATCH 4/5] RT: Add a per-cpu rt_overload indication Gregory Haskins
2007-10-09 14:26 ` [PATCH 5/5] RT - Track which CPUs should get IPI'd on rt-overload Gregory Haskins
2007-10-09 15:00 ` [PATCH 0/5] RT: scheduler fixes and rt_overload enhancements Peter Zijlstra
2007-10-09 15:00 ` Steven Rostedt
2007-10-09 15:33   ` Gregory Haskins
2007-10-09 15:39     ` Peter Zijlstra
2007-10-09 17:59     ` [RFC PATCH RT] push waiting rt tasks to cpus with lower prios Steven Rostedt
2007-10-09 18:14       ` Steven Rostedt
2007-10-09 18:16       ` Peter Zijlstra
2007-10-09 18:45         ` Steven Rostedt
2007-10-09 20:39       ` mike kravetz
2007-10-09 20:50         ` Steven Rostedt
2007-10-09 21:17           ` mike kravetz
2007-10-10  2:12       ` Girish kathalagiri

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).