All of lore.kernel.org
 help / color / mirror / Atom feed
* Bug in scheduler when using rt_mutex
@ 2011-01-17 14:42 Onkalo Samu
  2011-01-17 15:00 ` Peter Zijlstra
  2011-01-17 16:00 ` Peter Zijlstra
  0 siblings, 2 replies; 42+ messages in thread
From: Onkalo Samu @ 2011-01-17 14:42 UTC (permalink / raw)
  To: mingo, peterz; +Cc: Onkalo Samu.P, linux-kernel


Hi

I believe that there are some problems in the scheduling when
the following happens:
- Normal priority process locks rt_mutex and sleeps while keeping it
locked.
- RT priority process blocks on the rt_mutex while normal priority
process is sleeping

This sequence can occur with I2C access when both normal priority
thread and irq-thread access the same I2C bus. I2C core
contains rt_mutex and I2C drivers can sleep with wait_for_completion.


I have seen following failure to happen (also with 2.6.37):

User process access some device handle or sysfs entry which finally
makes an I2C access. I2C core contains rt_mutex protection against
parallel access. Sometimes when the rt_mutex is unlocked, user process
is not running for a long time (several minutes). This can occur when
there are only small number of user processes running. In my test cases
there was only cat /dev/zero > /dev/null running at the background and
other process was accessing sysfs entry.

Example:

cat /dev/zero > /dev/null &
while [ 1 ] ; do
cat /sys/devices/platform/lis3lv02d/selftest
done

Selftest causes I2C accesses from both user process and irq-thread.

Based on my debugging following sequence occurs (single CPU
system):

1) There is some user process running at the background (like
cat /dev/zero..)
2) User process reads sysfs entry which causes I2C acccess
3) User process locks rt_mutex in the I2C-core
4) User process sleeps while it keeps rt_mutex locked
(wait_for_completion in I2C transfer function)
5) irq-thread is kicked to run
6) irq-thread tries to take rt_mutex which is allready locked by user
process
7) sleeping user process is promoted to irq-thread priority (RT class)
8) user process is woken up by completion mechanism and it finishes its
job
9) user process unlocks rt_mutex and is changed back to old priority and
scheduling class
10) irq-thread continues as expected

User process is stucked to at phase 9. Scheduler may skip that process
for a long time.

Based on my analysis vruntime calculations fails for the user process.
At phase 9, vruntime for that sched_entity is much bigger compared other
processes which leads to situation that it is not scheduled for a long
time.

Problem is that at phase 7) user process is sleeping and the rt_mutex
priority change control is done for the sleeping task. se.vruntime is
not modified and when the user process continues running se.vruntime
contains about twice the cfs_rq.min_runtime value.

Success case:
- user process locks rt_mutex
- irq-thread causes user process to be promoted to RT level while the
user process is in the running and "on_rq == 1" state
-> dequeue_task is called which modifies se.vruntime
dequeue_entity function:

	if (!(flags & DEQUEUE_SLEEP))
		se->vruntime -= cfs_rq->min_vruntime;

When the process is moved back from rt to normal priority enqueue_task
updates vruntime again to correct value:
enqueue_entity:
	if (!(flags & ENQUEUE_WAKEUP) || (flags & ENQUEUE_WAKING))
		se->vruntime += cfs_rq->min_vruntime;


Failure case:
- user process locks rt_mutex
- and goes to sleep (wait_for_completion etc.)
- user process is dequeued to sleep state
-> vruntime is not updated in dequeue_entity

- irq-thread blocks to rt_mutex and user process is promoted to RT
priory
- User process wakes up and continues until it releases rt_mutex
-> User process is moved from rt-queue to cfs queue. WAKEUP / WAKING
flags are not set so vruntime is updated to incorrect value.

I have a simple dummy-driver which demonstrates the case. It is tested
with single CPU embedded system on 2.6.37.
I also have correction proposal, but it is quite possible that there is
better way to do this and it may be that I miss some case totally.
Scheduler is quite complex thing. I'll send patches for the test case
and for the proposal.

Br, Samu Onkalo








^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Bug in scheduler when using rt_mutex
  2011-01-17 14:42 Bug in scheduler when using rt_mutex Onkalo Samu
@ 2011-01-17 15:00 ` Peter Zijlstra
  2011-01-17 15:15   ` samu.p.onkalo
  2011-01-17 16:00 ` Peter Zijlstra
  1 sibling, 1 reply; 42+ messages in thread
From: Peter Zijlstra @ 2011-01-17 15:00 UTC (permalink / raw)
  To: samu.p.onkalo; +Cc: mingo, linux-kernel, tglx

On Mon, 2011-01-17 at 16:42 +0200, Onkalo Samu wrote:
> Hi
> 
> I believe that there are some problems in the scheduling when
> the following happens:
> - Normal priority process locks rt_mutex and sleeps while keeping it
> locked.

There's your fail, don't do that!

> - RT priority process blocks on the rt_mutex while normal priority
> process is sleeping
> 
> This sequence can occur with I2C access when both normal priority
> thread and irq-thread access the same I2C bus. I2C core
> contains rt_mutex and I2C drivers can sleep with wait_for_completion.

Why does I2C core use rt_mutex, that's utterly broken.

> Based on my debugging following sequence occurs (single CPU
> system):
> 
> 1) There is some user process running at the background (like
> cat /dev/zero..)
> 2) User process reads sysfs entry which causes I2C acccess
> 3) User process locks rt_mutex in the I2C-core
> 4) User process sleeps while it keeps rt_mutex locked
> (wait_for_completion in I2C transfer function)

That's where things go wrong, there's absolutely nothing you can do to
fix the system once you block while holding a mutex.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* RE: Bug in scheduler when using rt_mutex
  2011-01-17 15:00 ` Peter Zijlstra
@ 2011-01-17 15:15   ` samu.p.onkalo
  2011-01-17 15:28     ` Peter Zijlstra
  0 siblings, 1 reply; 42+ messages in thread
From: samu.p.onkalo @ 2011-01-17 15:15 UTC (permalink / raw)
  To: peterz; +Cc: mingo, linux-kernel, tglx

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 1996 bytes --]



>-----Original Message-----
>From: ext Peter Zijlstra [mailto:peterz@infradead.org]
>Sent: 17 January, 2011 17:00
>To: Onkalo Samu.P (Nokia-MS/Tampere)
>Cc: mingo@elte.hu; linux-kernel@vger.kernel.org; tglx
>Subject: Re: Bug in scheduler when using rt_mutex
>
>On Mon, 2011-01-17 at 16:42 +0200, Onkalo Samu wrote:
>> Hi
>>
>> I believe that there are some problems in the scheduling when
>> the following happens:
>> - Normal priority process locks rt_mutex and sleeps while keeping it
>> locked.
>
>There's your fail, don't do that!

So that is forbidden:

rt_mutex_lock();
wait_for_completion(); <--- shared HW finishes its job
rt_mutex_unlock();



>
>> - RT priority process blocks on the rt_mutex while normal priority
>> process is sleeping
>>
>> This sequence can occur with I2C access when both normal priority
>> thread and irq-thread access the same I2C bus. I2C core
>> contains rt_mutex and I2C drivers can sleep with wait_for_completion.
>
>Why does I2C core use rt_mutex, that's utterly broken.

To get low priority task finish ongoing I2C access in time under
heavy load cases I think.

>
>> Based on my debugging following sequence occurs (single CPU
>> system):
>>
>> 1) There is some user process running at the background (like
>> cat /dev/zero..)
>> 2) User process reads sysfs entry which causes I2C acccess
>> 3) User process locks rt_mutex in the I2C-core
>> 4) User process sleeps while it keeps rt_mutex locked
>> (wait_for_completion in I2C transfer function)
>
>That's where things go wrong, there's absolutely nothing you can do to
>fix the system once you block while holding a mutex.

Of course other processes are waiting until the (rt_)mutex is unlocked.
Problem is that after the rt_mutex_unlock is done, the task which  just released
the lock, may be in some non-running state for minutes.

 -Samu

ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 42+ messages in thread

* RE: Bug in scheduler when using rt_mutex
  2011-01-17 15:15   ` samu.p.onkalo
@ 2011-01-17 15:28     ` Peter Zijlstra
  0 siblings, 0 replies; 42+ messages in thread
From: Peter Zijlstra @ 2011-01-17 15:28 UTC (permalink / raw)
  To: samu.p.onkalo; +Cc: mingo, linux-kernel, tglx

On Mon, 2011-01-17 at 15:15 +0000, samu.p.onkalo@nokia.com wrote:
> 
> >-----Original Message-----
> >From: ext Peter Zijlstra [mailto:peterz@infradead.org]
> >Sent: 17 January, 2011 17:00
> >To: Onkalo Samu.P (Nokia-MS/Tampere)
> >Cc: mingo@elte.hu; linux-kernel@vger.kernel.org; tglx
> >Subject: Re: Bug in scheduler when using rt_mutex
> >
> >On Mon, 2011-01-17 at 16:42 +0200, Onkalo Samu wrote:
> >> Hi
> >>
> >> I believe that there are some problems in the scheduling when
> >> the following happens:
> >> - Normal priority process locks rt_mutex and sleeps while keeping it
> >> locked.
> >
> >There's your fail, don't do that!
> 
> So that is forbidden:
> 
> rt_mutex_lock();
> wait_for_completion(); <--- shared HW finishes its job
> rt_mutex_unlock();

Well, its pointless, its non-deterministic, so you totally void the
usage of rt_mutex. 

> >Why does I2C core use rt_mutex, that's utterly broken.
> 
> To get low priority task finish ongoing I2C access in time under
> heavy load cases I think.

FYI, I'm queueing a revert for that patch. Random driver junk should not
_ever_ use that.

> >> Based on my debugging following sequence occurs (single CPU
> >> system):
> >>
> >> 1) There is some user process running at the background (like
> >> cat /dev/zero..)
> >> 2) User process reads sysfs entry which causes I2C acccess
> >> 3) User process locks rt_mutex in the I2C-core
> >> 4) User process sleeps while it keeps rt_mutex locked
> >> (wait_for_completion in I2C transfer function)
> >
> >That's where things go wrong, there's absolutely nothing you can do to
> >fix the system once you block while holding a mutex.
> 
> Of course other processes are waiting until the (rt_)mutex is unlocked.
> Problem is that after the rt_mutex_unlock is done, the task which  just released
> the lock, may be in some non-running state for minutes.

Yeah, saw that, writing a patch for that, there's more than one problem
there.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Bug in scheduler when using rt_mutex
  2011-01-17 14:42 Bug in scheduler when using rt_mutex Onkalo Samu
  2011-01-17 15:00 ` Peter Zijlstra
@ 2011-01-17 16:00 ` Peter Zijlstra
  2011-01-18  8:23   ` Onkalo Samu
  1 sibling, 1 reply; 42+ messages in thread
From: Peter Zijlstra @ 2011-01-17 16:00 UTC (permalink / raw)
  To: samu.p.onkalo; +Cc: mingo, linux-kernel, tglx

On Mon, 2011-01-17 at 16:42 +0200, Onkalo Samu wrote:
> 
> Failure case:
> - user process locks rt_mutex
> - and goes to sleep (wait_for_completion etc.)
> - user process is dequeued to sleep state
> -> vruntime is not updated in dequeue_entity
> 

Does the below (completely untested) patch help?

---
 kernel/sched.c      |    6 +++++-
 kernel/sched_fair.c |   11 +++++++++++
 2 files changed, 16 insertions(+), 1 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index a0eb094..be09581 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -8108,8 +8108,10 @@ EXPORT_SYMBOL(__might_sleep);
 #ifdef CONFIG_MAGIC_SYSRQ
 static void normalize_task(struct rq *rq, struct task_struct *p)
 {
+	struct sched_class *prev_class = p->sched_class;
+	int old_prio = p->prio;
 	int on_rq;
-
+       
 	on_rq = p->se.on_rq;
 	if (on_rq)
 		deactivate_task(rq, p, 0);
@@ -8118,6 +8120,8 @@ static void normalize_task(struct rq *rq, struct task_struct *p)
 		activate_task(rq, p, 0);
 		resched_task(rq->curr);
 	}
+
+	check_class_changed(rq, p, prev_class, old_prio, task_current(rq, p));
 }
 
 void normalize_rt_tasks(void)
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index c62ebae..0a27b00 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -4072,6 +4072,17 @@ static void prio_changed_fair(struct rq *rq, struct task_struct *p,
 static void switched_to_fair(struct rq *rq, struct task_struct *p,
 			     int running)
 {
+	struct sched_entity *se = &p->se;
+	struct cfs_rq *cfs_rq = cfs_rq_of(se);
+
+	if (se->on_rq && cfs_rq->curr != se)
+		__dequeue_entity(cfs_rq, se);
+
+	place_entity(cfs_rq, se, 0);
+
+	if (se->on_rq && cfs_rq->curr != se)
+		__enqueue_entity(cfs_rq, se);
+
 	/*
 	 * We were most likely switched from sched_rt, so
 	 * kick off the schedule if running, otherwise just see


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: Bug in scheduler when using rt_mutex
  2011-01-17 16:00 ` Peter Zijlstra
@ 2011-01-18  8:23   ` Onkalo Samu
  2011-01-18  8:59     ` Yong Zhang
  0 siblings, 1 reply; 42+ messages in thread
From: Onkalo Samu @ 2011-01-18  8:23 UTC (permalink / raw)
  To: ext Peter Zijlstra; +Cc: mingo, linux-kernel, tglx, Onkalo Samu.P

On Mon, 2011-01-17 at 17:00 +0100, ext Peter Zijlstra wrote:
> On Mon, 2011-01-17 at 16:42 +0200, Onkalo Samu wrote:
> > 
> > Failure case:
> > - user process locks rt_mutex
> > - and goes to sleep (wait_for_completion etc.)
> > - user process is dequeued to sleep state
> > -> vruntime is not updated in dequeue_entity
> > 
> 
> Does the below (completely untested) patch help?

Unfortunately no. User process is still stucked.
It is stucked for about the time equal to min_vruntime.

Background of the rt_mutex in i2c-core:

http://www.mail-archive.com/linux-i2c@vger.kernel.org/msg01631.html

In phones the touch screen controller can be connected to I2C which makes it very timing sensitive.

The patch below shows what goes wrong and at least it corrects the case 
what I can see.

Idea is to reset vruntime to min_runtime when the task goes back from rt to cfs queue.

-Samu

---
 include/linux/sched.h |    1 +
 kernel/sched.c        |   12 +++++++++---
 kernel/sched_fair.c   |    8 ++++++++
 3 files changed, 18 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index d747f94..1c7f873 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1049,6 +1049,7 @@ struct sched_domain;
 #define ENQUEUE_WAKEUP		1
 #define ENQUEUE_WAKING		2
 #define ENQUEUE_HEAD		4
+#define ENQUEUE_FROM_RTMUTEX    8
 
 #define DEQUEUE_SLEEP		1
 
diff --git a/kernel/sched.c b/kernel/sched.c
index ea3e5ef..4724e25 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -4544,6 +4544,7 @@ void rt_mutex_setprio(struct task_struct *p, int prio)
 {
 	unsigned long flags;
 	int oldprio, on_rq, running;
+	int enqueue_flags = 0;
 	struct rq *rq;
 	const struct sched_class *prev_class;
 
@@ -4561,17 +4562,22 @@ void rt_mutex_setprio(struct task_struct *p, int prio)
 	if (running)
 		p->sched_class->put_prev_task(rq, p);
 
-	if (rt_prio(prio))
+	if (rt_prio(prio)) {
 		p->sched_class = &rt_sched_class;
-	else
+	} else {
 		p->sched_class = &fair_sched_class;
+		if (p->sched_class != prev_class)
+			enqueue_flags = ENQUEUE_FROM_RTMUTEX;
+	}
 
 	p->prio = prio;
 
 	if (running)
 		p->sched_class->set_curr_task(rq);
 	if (on_rq) {
-		enqueue_task(rq, p, oldprio < prio ? ENQUEUE_HEAD : 0);
+
+		enqueue_task(rq, p, (oldprio < prio ? ENQUEUE_HEAD : 0) |
+			enqueue_flags);
 
 		check_class_changed(rq, p, prev_class, oldprio, running);
 	}
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index c62ebae..ff670b4 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -941,6 +941,14 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
 static void
 enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 {
+
+	/*
+	 * vruntime may be incorrect if coming from rt to non rt priority.
+	 * Reset vruntime so that it has some valid value. WAKE* flags are not
+	 * set in this case.
+	 */
+	if (unlikely(flags & ENQUEUE_FROM_RTMUTEX))
+		se->vruntime = 0;
 	/*
 	 * Update the normalized vruntime before updating min_vruntime
 	 * through callig update_curr().
-- 


> 
> ---
>  kernel/sched.c      |    6 +++++-
>  kernel/sched_fair.c |   11 +++++++++++
>  2 files changed, 16 insertions(+), 1 deletions(-)
> 
> diff --git a/kernel/sched.c b/kernel/sched.c
> index a0eb094..be09581 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -8108,8 +8108,10 @@ EXPORT_SYMBOL(__might_sleep);
>  #ifdef CONFIG_MAGIC_SYSRQ
>  static void normalize_task(struct rq *rq, struct task_struct *p)
>  {
> +	struct sched_class *prev_class = p->sched_class;
> +	int old_prio = p->prio;
>  	int on_rq;
> -
> +       
>  	on_rq = p->se.on_rq;
>  	if (on_rq)
>  		deactivate_task(rq, p, 0);
> @@ -8118,6 +8120,8 @@ static void normalize_task(struct rq *rq, struct task_struct *p)
>  		activate_task(rq, p, 0);
>  		resched_task(rq->curr);
>  	}
> +
> +	check_class_changed(rq, p, prev_class, old_prio, task_current(rq, p));
>  }
>  
>  void normalize_rt_tasks(void)
> diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
> index c62ebae..0a27b00 100644
> --- a/kernel/sched_fair.c
> +++ b/kernel/sched_fair.c
> @@ -4072,6 +4072,17 @@ static void prio_changed_fair(struct rq *rq, struct task_struct *p,
>  static void switched_to_fair(struct rq *rq, struct task_struct *p,
>  			     int running)
>  {
> +	struct sched_entity *se = &p->se;
> +	struct cfs_rq *cfs_rq = cfs_rq_of(se);
> +
> +	if (se->on_rq && cfs_rq->curr != se)
> +		__dequeue_entity(cfs_rq, se);
> +
> +	place_entity(cfs_rq, se, 0);
> +
> +	if (se->on_rq && cfs_rq->curr != se)
> +		__enqueue_entity(cfs_rq, se);
> +
>  	/*
>  	 * We were most likely switched from sched_rt, so
>  	 * kick off the schedule if running, otherwise just see
> 



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: Bug in scheduler when using rt_mutex
  2011-01-18  8:23   ` Onkalo Samu
@ 2011-01-18  8:59     ` Yong Zhang
  2011-01-18 13:35       ` Peter Zijlstra
  0 siblings, 1 reply; 42+ messages in thread
From: Yong Zhang @ 2011-01-18  8:59 UTC (permalink / raw)
  To: samu.p.onkalo; +Cc: ext Peter Zijlstra, mingo, linux-kernel, tglx

On Tue, Jan 18, 2011 at 4:23 PM, Onkalo Samu <samu.p.onkalo@nokia.com> wrote:
> On Mon, 2011-01-17 at 17:00 +0100, ext Peter Zijlstra wrote:
>> On Mon, 2011-01-17 at 16:42 +0200, Onkalo Samu wrote:
>> >
>> > Failure case:
>> > - user process locks rt_mutex
>> > - and goes to sleep (wait_for_completion etc.)
>> > - user process is dequeued to sleep state
>> > -> vruntime is not updated in dequeue_entity
>> >
>>
>> Does the below (completely untested) patch help?
>
> Unfortunately no.

It couldn't work because place_entity() will not place it
backwards.


-- 
Only stand for myself

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Bug in scheduler when using rt_mutex
  2011-01-18  8:59     ` Yong Zhang
@ 2011-01-18 13:35       ` Peter Zijlstra
  2011-01-18 14:25         ` Onkalo Samu
  2011-01-19  2:38         ` Yong Zhang
  0 siblings, 2 replies; 42+ messages in thread
From: Peter Zijlstra @ 2011-01-18 13:35 UTC (permalink / raw)
  To: Yong Zhang; +Cc: samu.p.onkalo, mingo, linux-kernel, tglx

On Tue, 2011-01-18 at 16:59 +0800, Yong Zhang wrote:
> On Tue, Jan 18, 2011 at 4:23 PM, Onkalo Samu <samu.p.onkalo@nokia.com> wrote:
> > On Mon, 2011-01-17 at 17:00 +0100, ext Peter Zijlstra wrote:
> >> On Mon, 2011-01-17 at 16:42 +0200, Onkalo Samu wrote:
> >> >
> >> > Failure case:
> >> > - user process locks rt_mutex
> >> > - and goes to sleep (wait_for_completion etc.)
> >> > - user process is dequeued to sleep state
> >> > -> vruntime is not updated in dequeue_entity
> >> >
> >>
> >> Does the below (completely untested) patch help?
> >
> > Unfortunately no.
> 
> It couldn't work because place_entity() will not place it
> backwards.

Ah indeed, I was somehow assuming it was way left, but that is not at
all true, something like the below then..

---
Subject: sched: Fix switch_to_fair()
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Mon Jan 17 17:03:27 CET 2011

When a task is placed back into fair_sched_class, we must update its
placement, since we don't know how long its been gone, hence its
vruntime is stale and cannot be trusted.

There is also a case where it was moved from fair_sched_class when it
was in a blocked state and moved back while it is running, this causes
an imbalance between DEQUEUE_SLEEP/DEQUEUE_WAKEUP for the fair class
and leaves vruntime way out there (due to the min_vruntime
adjustment).

Also update sysrq-n to call the ->switch_{to,from} methods.

Reported-by: Onkalo Samu <samu.p.onkalo@nokia.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched.c      |    4 ++++
 kernel/sched_fair.c |   16 ++++++++++++++++
 2 files changed, 20 insertions(+)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -8106,6 +8106,8 @@ EXPORT_SYMBOL(__might_sleep);
 #ifdef CONFIG_MAGIC_SYSRQ
 static void normalize_task(struct rq *rq, struct task_struct *p)
 {
+	struct sched_class *prev_class = p->sched_class;
+	int old_prio = p->prio;
 	int on_rq;
 
 	on_rq = p->se.on_rq;
@@ -8116,6 +8118,8 @@ static void normalize_task(struct rq *rq
 		activate_task(rq, p, 0);
 		resched_task(rq->curr);
 	}
+
+	check_class_changed(rq, p, prev_class, old_prio, task_current(rq, p));
 }
 
 void normalize_rt_tasks(void)
Index: linux-2.6/kernel/sched_fair.c
===================================================================
--- linux-2.6.orig/kernel/sched_fair.c
+++ linux-2.6/kernel/sched_fair.c
@@ -4075,6 +4075,22 @@ static void prio_changed_fair(struct rq
 static void switched_to_fair(struct rq *rq, struct task_struct *p,
 			     int running)
 {
+	struct sched_entity *se = &p->se;
+	struct cfs_rq *cfs_rq = cfs_rq_of(se);
+
+	if (se->on_rq && cfs_rq->curr != se)
+		__dequeue_entity(cfs_rq, se);
+
+	/*
+	 * se->vruntime can be completely out there, there is no telling
+	 * how long this task was !fair and on what CPU if any it became
+	 * !fair. Therefore, reset it to a known, reasonable value.
+	 */
+	se->vruntime = cfs_rq->min_vruntime;
+
+	if (se->on_rq && cfs_rq->curr != se)
+		__enqueue_entity(cfs_rq, se);
+
 	/*
 	 * We were most likely switched from sched_rt, so
 	 * kick off the schedule if running, otherwise just see


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Bug in scheduler when using rt_mutex
  2011-01-18 13:35       ` Peter Zijlstra
@ 2011-01-18 14:25         ` Onkalo Samu
  2011-01-19  2:38         ` Yong Zhang
  1 sibling, 0 replies; 42+ messages in thread
From: Onkalo Samu @ 2011-01-18 14:25 UTC (permalink / raw)
  To: ext Peter Zijlstra; +Cc: Yong Zhang, mingo, linux-kernel, tglx

On Tue, 2011-01-18 at 14:35 +0100, ext Peter Zijlstra wrote:
> On Tue, 2011-01-18 at 16:59 +0800, Yong Zhang wrote:
> > On Tue, Jan 18, 2011 at 4:23 PM, Onkalo Samu <samu.p.onkalo@nokia.com> wrote:
> > > On Mon, 2011-01-17 at 17:00 +0100, ext Peter Zijlstra wrote:
> > >> On Mon, 2011-01-17 at 16:42 +0200, Onkalo Samu wrote:
> > >> >
> > >> > Failure case:
> > >> > - user process locks rt_mutex
> > >> > - and goes to sleep (wait_for_completion etc.)
> > >> > - user process is dequeued to sleep state
> > >> > -> vruntime is not updated in dequeue_entity
> > >> >
> > >>
> > >> Does the below (completely untested) patch help?
> > >
> > > Unfortunately no.
> > 
> > It couldn't work because place_entity() will not place it
> > backwards.
> 
> Ah indeed, I was somehow assuming it was way left, but that is not at
> all true, something like the below then..
> 

First trials show that my test case is not stucked. I'll continue
testing. Thanks.

Howabout this i2c-core case. Do you see that rt_mutex must be taken
away? It reduces latencies from the I2C accesses in the irq-threads.

-Samu

> ---
> Subject: sched: Fix switch_to_fair()
> From: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Date: Mon Jan 17 17:03:27 CET 2011
> 
> When a task is placed back into fair_sched_class, we must update its
> placement, since we don't know how long its been gone, hence its
> vruntime is stale and cannot be trusted.
> 
> There is also a case where it was moved from fair_sched_class when it
> was in a blocked state and moved back while it is running, this causes
> an imbalance between DEQUEUE_SLEEP/DEQUEUE_WAKEUP for the fair class
> and leaves vruntime way out there (due to the min_vruntime
> adjustment).
> 
> Also update sysrq-n to call the ->switch_{to,from} methods.
> 
> Reported-by: Onkalo Samu <samu.p.onkalo@nokia.com>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
>  kernel/sched.c      |    4 ++++
>  kernel/sched_fair.c |   16 ++++++++++++++++
>  2 files changed, 20 insertions(+)
> 
> Index: linux-2.6/kernel/sched.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched.c
> +++ linux-2.6/kernel/sched.c
> @@ -8106,6 +8106,8 @@ EXPORT_SYMBOL(__might_sleep);
>  #ifdef CONFIG_MAGIC_SYSRQ
>  static void normalize_task(struct rq *rq, struct task_struct *p)
>  {
> +	struct sched_class *prev_class = p->sched_class;
> +	int old_prio = p->prio;
>  	int on_rq;
>  
>  	on_rq = p->se.on_rq;
> @@ -8116,6 +8118,8 @@ static void normalize_task(struct rq *rq
>  		activate_task(rq, p, 0);
>  		resched_task(rq->curr);
>  	}
> +
> +	check_class_changed(rq, p, prev_class, old_prio, task_current(rq, p));
>  }
>  
>  void normalize_rt_tasks(void)
> Index: linux-2.6/kernel/sched_fair.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched_fair.c
> +++ linux-2.6/kernel/sched_fair.c
> @@ -4075,6 +4075,22 @@ static void prio_changed_fair(struct rq
>  static void switched_to_fair(struct rq *rq, struct task_struct *p,
>  			     int running)
>  {
> +	struct sched_entity *se = &p->se;
> +	struct cfs_rq *cfs_rq = cfs_rq_of(se);
> +
> +	if (se->on_rq && cfs_rq->curr != se)
> +		__dequeue_entity(cfs_rq, se);
> +
> +	/*
> +	 * se->vruntime can be completely out there, there is no telling
> +	 * how long this task was !fair and on what CPU if any it became
> +	 * !fair. Therefore, reset it to a known, reasonable value.
> +	 */
> +	se->vruntime = cfs_rq->min_vruntime;
> +
> +	if (se->on_rq && cfs_rq->curr != se)
> +		__enqueue_entity(cfs_rq, se);
> +
>  	/*
>  	 * We were most likely switched from sched_rt, so
>  	 * kick off the schedule if running, otherwise just see
> 



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Bug in scheduler when using rt_mutex
  2011-01-18 13:35       ` Peter Zijlstra
  2011-01-18 14:25         ` Onkalo Samu
@ 2011-01-19  2:38         ` Yong Zhang
  2011-01-19  3:43           ` Mike Galbraith
  2011-01-19  9:44           ` Peter Zijlstra
  1 sibling, 2 replies; 42+ messages in thread
From: Yong Zhang @ 2011-01-19  2:38 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: samu.p.onkalo, mingo, linux-kernel, tglx

On Tue, Jan 18, 2011 at 9:35 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> Subject: sched: Fix switch_to_fair()
> From: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Date: Mon Jan 17 17:03:27 CET 2011
>
> When a task is placed back into fair_sched_class, we must update its
> placement, since we don't know how long its been gone, hence its
> vruntime is stale and cannot be trusted.
>
> There is also a case where it was moved from fair_sched_class when it
> was in a blocked state and moved back while it is running, this causes
> an imbalance between DEQUEUE_SLEEP/DEQUEUE_WAKEUP for the fair class
> and leaves vruntime way out there (due to the min_vruntime
> adjustment).
>
> Also update sysrq-n to call the ->switch_{to,from} methods.
>
> Reported-by: Onkalo Samu <samu.p.onkalo@nokia.com>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
>  kernel/sched.c      |    4 ++++
>  kernel/sched_fair.c |   16 ++++++++++++++++
>  2 files changed, 20 insertions(+)
>
> Index: linux-2.6/kernel/sched_fair.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched_fair.c
> +++ linux-2.6/kernel/sched_fair.c
> @@ -4075,6 +4075,22 @@ static void prio_changed_fair(struct rq
>  static void switched_to_fair(struct rq *rq, struct task_struct *p,
>                             int running)
>  {
> +       struct sched_entity *se = &p->se;
> +       struct cfs_rq *cfs_rq = cfs_rq_of(se);
> +
> +       if (se->on_rq && cfs_rq->curr != se)

(cfs_rq->curr != se) equals to (!running), no?

> +               __dequeue_entity(cfs_rq, se);
> +
> +       /*
> +        * se->vruntime can be completely out there, there is no telling
> +        * how long this task was !fair and on what CPU if any it became
> +        * !fair. Therefore, reset it to a known, reasonable value.
> +        */
> +       se->vruntime = cfs_rq->min_vruntime;

But this is not fair for !SLEEP task.
You know se->vruntime -= cfs_rq->min_vruntime for !SLEEP task,
then after it go through sched_fair-->sched_rt-->sched_fair by some
means, current cfs_rq->min_vruntime is added back.

But here se is putted before where it should be. Is this what we want?

Thanks,
Yong

-- 
Only stand for myself

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Bug in scheduler when using rt_mutex
  2011-01-19  2:38         ` Yong Zhang
@ 2011-01-19  3:43           ` Mike Galbraith
  2011-01-19  4:35             ` Yong Zhang
  2011-01-19  9:44           ` Peter Zijlstra
  1 sibling, 1 reply; 42+ messages in thread
From: Mike Galbraith @ 2011-01-19  3:43 UTC (permalink / raw)
  To: Yong Zhang; +Cc: Peter Zijlstra, samu.p.onkalo, mingo, linux-kernel, tglx

On Wed, 2011-01-19 at 10:38 +0800, Yong Zhang wrote:

> > Index: linux-2.6/kernel/sched_fair.c
> > ===================================================================
> > --- linux-2.6.orig/kernel/sched_fair.c
> > +++ linux-2.6/kernel/sched_fair.c
> > @@ -4075,6 +4075,22 @@ static void prio_changed_fair(struct rq
> >  static void switched_to_fair(struct rq *rq, struct task_struct *p,
> >                             int running)
> >  {
> > +       struct sched_entity *se = &p->se;
> > +       struct cfs_rq *cfs_rq = cfs_rq_of(se);
> > +
> > +       if (se->on_rq && cfs_rq->curr != se)
> 
> (cfs_rq->curr != se) equals to (!running), no?

No, running is task_of(se) == rq->curr.  Another class or fair group
task may be rq_of(cfs_rq)->curr

> > +               __dequeue_entity(cfs_rq, se);
> > +
> > +       /*
> > +        * se->vruntime can be completely out there, there is no telling
> > +        * how long this task was !fair and on what CPU if any it became
> > +        * !fair. Therefore, reset it to a known, reasonable value.
> > +        */
> > +       se->vruntime = cfs_rq->min_vruntime;
> 
> But this is not fair for !SLEEP task.
> You know se->vruntime -= cfs_rq->min_vruntime for !SLEEP task,
> then after it go through sched_fair-->sched_rt-->sched_fair by some
> means, current cfs_rq->min_vruntime is added back.

It drops lag for all, positive or negative.

> But here se is putted before where it should be. Is this what we want?

It may move forward or backward.  If transitions can happen at high
frequency it could be a problem, otherwise, it's a cornercase blip.

An alternative is to leave lag alone. and normalize sleepers, but that's
(did that) considerably more intrusive.

	-Mike


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Bug in scheduler when using rt_mutex
  2011-01-19  3:43           ` Mike Galbraith
@ 2011-01-19  4:35             ` Yong Zhang
  2011-01-19  5:40               ` Mike Galbraith
  0 siblings, 1 reply; 42+ messages in thread
From: Yong Zhang @ 2011-01-19  4:35 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Peter Zijlstra, samu.p.onkalo, mingo, linux-kernel, tglx

On Wed, Jan 19, 2011 at 11:43 AM, Mike Galbraith <efault@gmx.de> wrote:
>> (cfs_rq->curr != se) equals to (!running), no?
>
> No, running is task_of(se) == rq->curr.  Another class or fair group
> task may be rq_of(cfs_rq)->curr

If task_of(se) != rq->curr, as you said it maybe on another class or
fair group; then for its fair group, cfs_rq->curr == NULL.
cfs_rq->curr != se is always true.

Thanks,
Yong

-- 
Only stand for myself

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Bug in scheduler when using rt_mutex
  2011-01-19  4:35             ` Yong Zhang
@ 2011-01-19  5:40               ` Mike Galbraith
  2011-01-19  6:09                 ` Yong Zhang
  0 siblings, 1 reply; 42+ messages in thread
From: Mike Galbraith @ 2011-01-19  5:40 UTC (permalink / raw)
  To: Yong Zhang; +Cc: Peter Zijlstra, samu.p.onkalo, mingo, linux-kernel, tglx

On Wed, 2011-01-19 at 12:35 +0800, Yong Zhang wrote:

> cfs_rq->curr != se is always true.

If that were always true, we'd illegally enqueue a running task.

        if (running)
                p->sched_class->set_curr_task(rq);
                ==> set_curr_task_fair(rq);
                ==>      struct sched_entity *se = &rq->curr->se;
                ==>      set_next_entity(cfs_rq_of(se), se);
                ==>          cfs_rq->curr = se;
                (prevents running entity from being enqueued below)
        if (on_rq) {
                enqueue_task(rq, p, oldprio < prio ? ENQUEUE_HEAD : 0);

                check_class_changed(rq, p, prev_class, oldprio, running);
        }

	-Mike


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Bug in scheduler when using rt_mutex
  2011-01-19  5:40               ` Mike Galbraith
@ 2011-01-19  6:09                 ` Yong Zhang
  2011-01-19  6:37                   ` Mike Galbraith
  0 siblings, 1 reply; 42+ messages in thread
From: Yong Zhang @ 2011-01-19  6:09 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Peter Zijlstra, samu.p.onkalo, mingo, linux-kernel, tglx

On Wed, Jan 19, 2011 at 1:40 PM, Mike Galbraith <efault@gmx.de> wrote:
> On Wed, 2011-01-19 at 12:35 +0800, Yong Zhang wrote:
>
>> cfs_rq->curr != se is always true.
>
> If that were always true, we'd illegally enqueue a running task.

I'm sorry that I'm not express myself correctly.

The conclusion of (cfs_rq->curr != se is always true) is not
self-contained. IOW, it's based on one condition which is
(task_of(se) != rq->curr). So what I want to say is:
task_of(se) != rq->curr    ==>   cfs_rq_of(se)->curr != se
So,
         !running                   ==>   cfs_rq_of(se)->curr != se

Is this more clear?

Thanks,
Yong

-- 
Only stand for myself

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Bug in scheduler when using rt_mutex
  2011-01-19  6:09                 ` Yong Zhang
@ 2011-01-19  6:37                   ` Mike Galbraith
  2011-01-19  7:19                     ` Ingo Molnar
  0 siblings, 1 reply; 42+ messages in thread
From: Mike Galbraith @ 2011-01-19  6:37 UTC (permalink / raw)
  To: Yong Zhang; +Cc: Peter Zijlstra, samu.p.onkalo, mingo, linux-kernel, tglx

On Wed, 2011-01-19 at 14:09 +0800, Yong Zhang wrote:
> On Wed, Jan 19, 2011 at 1:40 PM, Mike Galbraith <efault@gmx.de> wrote:
> > On Wed, 2011-01-19 at 12:35 +0800, Yong Zhang wrote:
> >
> >> cfs_rq->curr != se is always true.
> >
> > If that were always true, we'd illegally enqueue a running task.
> 
> I'm sorry that I'm not express myself correctly.

Human communication methods are all buggy as hell :)

> The conclusion of (cfs_rq->curr != se is always true) is not
> self-contained. IOW, it's based on one condition which is
> (task_of(se) != rq->curr). So what I want to say is:
> task_of(se) != rq->curr    ==>   cfs_rq_of(se)->curr != se
> So,
>          !running                   ==>   cfs_rq_of(se)->curr != se
> 
> Is this more clear?

Yeah.

	-Mike


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Bug in scheduler when using rt_mutex
  2011-01-19  6:37                   ` Mike Galbraith
@ 2011-01-19  7:19                     ` Ingo Molnar
  2011-01-19  7:41                       ` Mike Galbraith
  0 siblings, 1 reply; 42+ messages in thread
From: Ingo Molnar @ 2011-01-19  7:19 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Yong Zhang, Peter Zijlstra, samu.p.onkalo, linux-kernel, tglx


* Mike Galbraith <efault@gmx.de> wrote:

> On Wed, 2011-01-19 at 14:09 +0800, Yong Zhang wrote:
> > On Wed, Jan 19, 2011 at 1:40 PM, Mike Galbraith <efault@gmx.de> wrote:
> > > On Wed, 2011-01-19 at 12:35 +0800, Yong Zhang wrote:
> > >
> > >> cfs_rq->curr != se is always true.
> > >
> > > If that were always true, we'd illegally enqueue a running task.
> > 
> > I'm sorry that I'm not express myself correctly.
> 
> Human communication methods are all buggy as hell :)

Not to mention that they are slow, inefficient and ambiguous.

But wht did you expect? The original authors of the code are long gone and 
maintenance is done by newcomers who are patching the code bit by bit. What
you get from such a development model is pretty predictable: ~1 billion years
old spaghetti DNA that no-one truly understands.

	Ingo

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Bug in scheduler when using rt_mutex
  2011-01-19  7:19                     ` Ingo Molnar
@ 2011-01-19  7:41                       ` Mike Galbraith
  0 siblings, 0 replies; 42+ messages in thread
From: Mike Galbraith @ 2011-01-19  7:41 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Yong Zhang, Peter Zijlstra, samu.p.onkalo, linux-kernel, tglx

On Wed, 2011-01-19 at 08:19 +0100, Ingo Molnar wrote:
> * Mike Galbraith <efault@gmx.de> wrote:
> 
> > On Wed, 2011-01-19 at 14:09 +0800, Yong Zhang wrote:
> > > On Wed, Jan 19, 2011 at 1:40 PM, Mike Galbraith <efault@gmx.de> wrote:
> > > > On Wed, 2011-01-19 at 12:35 +0800, Yong Zhang wrote:
> > > >
> > > >> cfs_rq->curr != se is always true.
> > > >
> > > > If that were always true, we'd illegally enqueue a running task.
> > > 
> > > I'm sorry that I'm not express myself correctly.
> > 
> > Human communication methods are all buggy as hell :)
> 
> Not to mention that they are slow, inefficient and ambiguous.
> 
> But wht did you expect? The original authors of the code are long gone and 
> maintenance is done by newcomers who are patching the code bit by bit. What
> you get from such a development model is pretty predictable: ~1 billion years
> old spaghetti DNA that no-one truly understands.

Gotta give the original authors credit though, their self modifying code
comes equipped with a fully automated hardware break-point debugger.

	-Mike


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Bug in scheduler when using rt_mutex
  2011-01-19  2:38         ` Yong Zhang
  2011-01-19  3:43           ` Mike Galbraith
@ 2011-01-19  9:44           ` Peter Zijlstra
  2011-01-19 10:38             ` Peter Zijlstra
  2011-01-20  3:10             ` Yong Zhang
  1 sibling, 2 replies; 42+ messages in thread
From: Peter Zijlstra @ 2011-01-19  9:44 UTC (permalink / raw)
  To: Yong Zhang; +Cc: samu.p.onkalo, mingo, linux-kernel, tglx

On Wed, 2011-01-19 at 10:38 +0800, Yong Zhang wrote:
> > Index: linux-2.6/kernel/sched_fair.c
> > ===================================================================
> > --- linux-2.6.orig/kernel/sched_fair.c
> > +++ linux-2.6/kernel/sched_fair.c
> > @@ -4075,6 +4075,22 @@ static void prio_changed_fair(struct rq
> >  static void switched_to_fair(struct rq *rq, struct task_struct *p,
> >                             int running)
> >  {
> > +       struct sched_entity *se = &p->se;
> > +       struct cfs_rq *cfs_rq = cfs_rq_of(se);
> > +
> > +       if (se->on_rq && cfs_rq->curr != se)
> 
> (cfs_rq->curr != se) equals to (!running), no?

more or less, the idea is that we only call __{dequeue,enqueue}_entity()
when the task is actually in the tree and current is not.

> > +               __dequeue_entity(cfs_rq, se);
> > +
> > +       /*
> > +        * se->vruntime can be completely out there, there is no telling
> > +        * how long this task was !fair and on what CPU if any it became
> > +        * !fair. Therefore, reset it to a known, reasonable value.
> > +        */
> > +       se->vruntime = cfs_rq->min_vruntime;
> 
> But this is not fair for !SLEEP task.
> You know se->vruntime -= cfs_rq->min_vruntime for !SLEEP task,
> then after it go through sched_fair-->sched_rt-->sched_fair by some
> means, current cfs_rq->min_vruntime is added back.
> 
> But here se is putted before where it should be. Is this what we want?

well, its more or less screwy anyway, since we don't know for how long
the task was !fair and what cpu it came from etc..

But I guess you're right, we should at least pretend the whole
min_vruntime thing is the 0-lag point (its not) and preserve 'lag' like
we do for migrations... Something like the below.. except I've got a
massive head ache and I'm not at all sure I got the switched_from_fair()
bit right.

---
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -8108,6 +8108,8 @@ EXPORT_SYMBOL(__might_sleep);
 #ifdef CONFIG_MAGIC_SYSRQ
 static void normalize_task(struct rq *rq, struct task_struct *p)
 {
+	struct sched_class *prev_class = p->sched_class;
+	int old_prio = p->prio;
 	int on_rq;
 
 	on_rq = p->se.on_rq;
@@ -8118,6 +8120,8 @@ static void normalize_task(struct rq *rq
 		activate_task(rq, p, 0);
 		resched_task(rq->curr);
 	}
+
+	check_class_changed(rq, p, prev_class, old_prio, task_current(rq, p));
 }
 
 void normalize_rt_tasks(void)
Index: linux-2.6/kernel/sched_fair.c
===================================================================
--- linux-2.6.orig/kernel/sched_fair.c
+++ linux-2.6/kernel/sched_fair.c
@@ -4066,12 +4066,33 @@ static void prio_changed_fair(struct rq
 		check_preempt_curr(rq, p, 0);
 }
 
+static void
+switched_from_fair(struct rq *rq, struct task_struct *p, int running)
+{
+	struct sched_entity *se = &p->se;
+	struct cfs_rq *cfs_rq = cfs_rq_of(se);
+
+	if (!se->on_rq && p->state != TASK_RUNNING)
+		se->vruntime -= cfs_rq->min_vruntime;
+}
+
 /*
  * We switched to the sched_fair class.
  */
-static void switched_to_fair(struct rq *rq, struct task_struct *p,
-			     int running)
+static void
+switched_to_fair(struct rq *rq, struct task_struct *p, int running)
 {
+	struct sched_entity *se = &p->se;
+	struct cfs_rq *cfs_rq = cfs_rq_of(se);
+
+	if (se->on_rq && cfs_rq->curr != se)
+		__dequeue_entity(cfs_rq, se);
+
+	se->vruntimea += cfs_rq->min_vruntime;
+
+	if (se->on_rq && cfs_rq->curr != se)
+		__enqueue_entity(cfs_rq, se);
+
 	/*
 	 * We were most likely switched from sched_rt, so
 	 * kick off the schedule if running, otherwise just see
@@ -4163,6 +4184,7 @@ static const struct sched_class fair_sch
 	.task_fork		= task_fork_fair,
 
 	.prio_changed		= prio_changed_fair,
+	.switched_from		= switched_from_fair,
 	.switched_to		= switched_to_fair,
 
 	.get_rr_interval	= get_rr_interval_fair,


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Bug in scheduler when using rt_mutex
  2011-01-19  9:44           ` Peter Zijlstra
@ 2011-01-19 10:38             ` Peter Zijlstra
  2011-01-19 11:30               ` Peter Zijlstra
  2011-01-20  3:10             ` Yong Zhang
  1 sibling, 1 reply; 42+ messages in thread
From: Peter Zijlstra @ 2011-01-19 10:38 UTC (permalink / raw)
  To: Yong Zhang; +Cc: samu.p.onkalo, mingo, linux-kernel, tglx

On Wed, 2011-01-19 at 10:44 +0100, Peter Zijlstra wrote:
> +static void
> +switched_from_fair(struct rq *rq, struct task_struct *p, int running)
> +{
> +       struct sched_entity *se = &p->se;
> +       struct cfs_rq *cfs_rq = cfs_rq_of(se);
> +
> +       if (!se->on_rq && p->state != TASK_RUNNING)
> +               se->vruntime -= cfs_rq->min_vruntime;
> +}
> +
>  /*
>   * We switched to the sched_fair class.
>   */
> -static void switched_to_fair(struct rq *rq, struct task_struct *p,
> -                            int running)
> +static void
> +switched_to_fair(struct rq *rq, struct task_struct *p, int running)
>  {
> +       struct sched_entity *se = &p->se;
> +       struct cfs_rq *cfs_rq = cfs_rq_of(se);
> +
> +       if (se->on_rq && cfs_rq->curr != se)
> +               __dequeue_entity(cfs_rq, se);
> +
> +       se->vruntimea += cfs_rq->min_vruntime;
> +
> +       if (se->on_rq && cfs_rq->curr != se)
> +               __enqueue_entity(cfs_rq, se);
> +
>         /*
>          * We were most likely switched from sched_rt, so
>          * kick off the schedule if running, otherwise just see 

Hrm, I think the to bit is not needed with the from thing in place, the
enqueue from _setprio will already have added min_vruntime

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Bug in scheduler when using rt_mutex
  2011-01-19 10:38             ` Peter Zijlstra
@ 2011-01-19 11:30               ` Peter Zijlstra
  2011-01-19 12:58                 ` Onkalo Samu
  0 siblings, 1 reply; 42+ messages in thread
From: Peter Zijlstra @ 2011-01-19 11:30 UTC (permalink / raw)
  To: Yong Zhang; +Cc: samu.p.onkalo, mingo, linux-kernel, tglx

On Wed, 2011-01-19 at 11:38 +0100, Peter Zijlstra wrote:

> Hrm, I think the to bit is not needed with the from thing in place, the
> enqueue from _setprio will already have added min_vruntime


Would lead to something like this:

---
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -8108,6 +8108,8 @@ EXPORT_SYMBOL(__might_sleep);
 #ifdef CONFIG_MAGIC_SYSRQ
 static void normalize_task(struct rq *rq, struct task_struct *p)
 {
+	struct sched_class *prev_class = p->sched_class;
+	int old_prio = p->prio;
 	int on_rq;
 
 	on_rq = p->se.on_rq;
@@ -8118,6 +8120,8 @@ static void normalize_task(struct rq *rq
 		activate_task(rq, p, 0);
 		resched_task(rq->curr);
 	}
+
+	check_class_changed(rq, p, prev_class, old_prio, task_current(rq, p));
 }
 
 void normalize_rt_tasks(void)
Index: linux-2.6/kernel/sched_fair.c
===================================================================
--- linux-2.6.orig/kernel/sched_fair.c
+++ linux-2.6/kernel/sched_fair.c
@@ -4066,11 +4066,30 @@ static void prio_changed_fair(struct rq
 		check_preempt_curr(rq, p, 0);
 }
 
+static void
+switched_from_fair(struct rq *rq, struct task_struct *p, int running)
+{
+	struct sched_entity *se = &p->se;
+	struct cfs_rq *cfs_rq = cfs_rq_of(se);
+
+	/*
+	 * Ensure the task's vruntime is normalized, so that when its
+	 * switched back to the fair class the enqueue_entity(.flags=0) will
+	 * do the right thing.
+	 *
+	 * If it was on_rq, then the dequeue_entity(.flags=0) will already
+	 * have normalized the vruntime, if it was !on_rq, then only when
+	 * the task is sleeping will it still have non-normalized vruntime.
+	 */
+	if (!se->on_rq && p->state != TASK_RUNNING)
+		se->vruntime -= cfs_rq->min_vruntime;
+}
+
 /*
  * We switched to the sched_fair class.
  */
-static void switched_to_fair(struct rq *rq, struct task_struct *p,
-			     int running)
+static void
+switched_to_fair(struct rq *rq, struct task_struct *p, int running)
 {
 	/*
 	 * We were most likely switched from sched_rt, so
@@ -4163,6 +4182,7 @@ static const struct sched_class fair_sch
 	.task_fork		= task_fork_fair,
 
 	.prio_changed		= prio_changed_fair,
+	.switched_from		= switched_from_fair,
 	.switched_to		= switched_to_fair,
 
 	.get_rr_interval	= get_rr_interval_fair,


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Bug in scheduler when using rt_mutex
  2011-01-19 11:30               ` Peter Zijlstra
@ 2011-01-19 12:58                 ` Onkalo Samu
  2011-01-19 13:13                   ` Onkalo Samu
  0 siblings, 1 reply; 42+ messages in thread
From: Onkalo Samu @ 2011-01-19 12:58 UTC (permalink / raw)
  To: ext Peter Zijlstra; +Cc: Yong Zhang, mingo, linux-kernel, tglx, Onkalo Samu.P

On Wed, 2011-01-19 at 12:30 +0100, ext Peter Zijlstra wrote:
> On Wed, 2011-01-19 at 11:38 +0100, Peter Zijlstra wrote:
> 
> > Hrm, I think the to bit is not needed with the from thing in place, the
> > enqueue from _setprio will already have added min_vruntime
> 
> 
> Would lead to something like this:
> 

Doesn't work in my case.

> ---
> Index: linux-2.6/kernel/sched.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched.c
> +++ linux-2.6/kernel/sched.c
> @@ -8108,6 +8108,8 @@ EXPORT_SYMBOL(__might_sleep);
>  #ifdef CONFIG_MAGIC_SYSRQ
>  static void normalize_task(struct rq *rq, struct task_struct *p)
>  {
> +	struct sched_class *prev_class = p->sched_class;
> +	int old_prio = p->prio;
>  	int on_rq;
>  
>  	on_rq = p->se.on_rq;
> @@ -8118,6 +8120,8 @@ static void normalize_task(struct rq *rq
>  		activate_task(rq, p, 0);
>  		resched_task(rq->curr);
>  	}
> +
> +	check_class_changed(rq, p, prev_class, old_prio, task_current(rq, p));
>  }
>  
>  void normalize_rt_tasks(void)
> Index: linux-2.6/kernel/sched_fair.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched_fair.c
> +++ linux-2.6/kernel/sched_fair.c
> @@ -4066,11 +4066,30 @@ static void prio_changed_fair(struct rq
>  		check_preempt_curr(rq, p, 0);
>  }
>  
> +static void
> +switched_from_fair(struct rq *rq, struct task_struct *p, int running)
> +{
> +	struct sched_entity *se = &p->se;
> +	struct cfs_rq *cfs_rq = cfs_rq_of(se);
> +
> +	/*
> +	 * Ensure the task's vruntime is normalized, so that when its
> +	 * switched back to the fair class the enqueue_entity(.flags=0) will
> +	 * do the right thing.
> +	 *
> +	 * If it was on_rq, then the dequeue_entity(.flags=0) will already
> +	 * have normalized the vruntime, if it was !on_rq, then only when
> +	 * the task is sleeping will it still have non-normalized vruntime.
> +	 */
> +	if (!se->on_rq && p->state != TASK_RUNNING)
> +		se->vruntime -= cfs_rq->min_vruntime;
> +}
> +
>  /*
>   * We switched to the sched_fair class.
>   */
> -static void switched_to_fair(struct rq *rq, struct task_struct *p,
> -			     int running)
> +static void
> +switched_to_fair(struct rq *rq, struct task_struct *p, int running)
>  {
>  	/*
>  	 * We were most likely switched from sched_rt, so
> @@ -4163,6 +4182,7 @@ static const struct sched_class fair_sch
>  	.task_fork		= task_fork_fair,
>  
>  	.prio_changed		= prio_changed_fair,
> +	.switched_from		= switched_from_fair,
>  	.switched_to		= switched_to_fair,
>  
>  	.get_rr_interval	= get_rr_interval_fair,
> 



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Bug in scheduler when using rt_mutex
  2011-01-19 12:58                 ` Onkalo Samu
@ 2011-01-19 13:13                   ` Onkalo Samu
  2011-01-19 13:30                     ` Peter Zijlstra
  0 siblings, 1 reply; 42+ messages in thread
From: Onkalo Samu @ 2011-01-19 13:13 UTC (permalink / raw)
  To: ext Peter Zijlstra; +Cc: Yong Zhang, mingo, linux-kernel, tglx

On Wed, 2011-01-19 at 14:58 +0200, Onkalo Samu wrote:
> On Wed, 2011-01-19 at 12:30 +0100, ext Peter Zijlstra wrote:
> > On Wed, 2011-01-19 at 11:38 +0100, Peter Zijlstra wrote:
> > 
> > > Hrm, I think the to bit is not needed with the from thing in place, the
> > > enqueue from _setprio will already have added min_vruntime
> > 
> > 
> > Would lead to something like this:
> > 
> 
> Doesn't work in my case.

When the task is sleeping rt_mutex_setprio
doesn't call check_class_changed since the task is not 
in queue at that moment I think.

> 
> > ---
> > Index: linux-2.6/kernel/sched.c
> > ===================================================================
> > --- linux-2.6.orig/kernel/sched.c
> > +++ linux-2.6/kernel/sched.c
> > @@ -8108,6 +8108,8 @@ EXPORT_SYMBOL(__might_sleep);
> >  #ifdef CONFIG_MAGIC_SYSRQ
> >  static void normalize_task(struct rq *rq, struct task_struct *p)
> >  {
> > +	struct sched_class *prev_class = p->sched_class;
> > +	int old_prio = p->prio;
> >  	int on_rq;
> >  
> >  	on_rq = p->se.on_rq;
> > @@ -8118,6 +8120,8 @@ static void normalize_task(struct rq *rq
> >  		activate_task(rq, p, 0);
> >  		resched_task(rq->curr);
> >  	}
> > +
> > +	check_class_changed(rq, p, prev_class, old_prio, task_current(rq, p));
> >  }
> >  
> >  void normalize_rt_tasks(void)
> > Index: linux-2.6/kernel/sched_fair.c
> > ===================================================================
> > --- linux-2.6.orig/kernel/sched_fair.c
> > +++ linux-2.6/kernel/sched_fair.c
> > @@ -4066,11 +4066,30 @@ static void prio_changed_fair(struct rq
> >  		check_preempt_curr(rq, p, 0);
> >  }
> >  
> > +static void
> > +switched_from_fair(struct rq *rq, struct task_struct *p, int running)
> > +{
> > +	struct sched_entity *se = &p->se;
> > +	struct cfs_rq *cfs_rq = cfs_rq_of(se);
> > +
> > +	/*
> > +	 * Ensure the task's vruntime is normalized, so that when its
> > +	 * switched back to the fair class the enqueue_entity(.flags=0) will
> > +	 * do the right thing.
> > +	 *
> > +	 * If it was on_rq, then the dequeue_entity(.flags=0) will already
> > +	 * have normalized the vruntime, if it was !on_rq, then only when
> > +	 * the task is sleeping will it still have non-normalized vruntime.
> > +	 */
> > +	if (!se->on_rq && p->state != TASK_RUNNING)
> > +		se->vruntime -= cfs_rq->min_vruntime;
> > +}
> > +
> >  /*
> >   * We switched to the sched_fair class.
> >   */
> > -static void switched_to_fair(struct rq *rq, struct task_struct *p,
> > -			     int running)
> > +static void
> > +switched_to_fair(struct rq *rq, struct task_struct *p, int running)
> >  {
> >  	/*
> >  	 * We were most likely switched from sched_rt, so
> > @@ -4163,6 +4182,7 @@ static const struct sched_class fair_sch
> >  	.task_fork		= task_fork_fair,
> >  
> >  	.prio_changed		= prio_changed_fair,
> > +	.switched_from		= switched_from_fair,
> >  	.switched_to		= switched_to_fair,
> >  
> >  	.get_rr_interval	= get_rr_interval_fair,
> > 
> 
> 



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Bug in scheduler when using rt_mutex
  2011-01-19 13:13                   ` Onkalo Samu
@ 2011-01-19 13:30                     ` Peter Zijlstra
  2011-01-20  4:18                       ` Yong Zhang
  2011-01-20  7:07                       ` Onkalo Samu
  0 siblings, 2 replies; 42+ messages in thread
From: Peter Zijlstra @ 2011-01-19 13:30 UTC (permalink / raw)
  To: samu.p.onkalo; +Cc: Yong Zhang, mingo, linux-kernel, tglx, Steven Rostedt

On Wed, 2011-01-19 at 15:13 +0200, Onkalo Samu wrote:
> 
> When the task is sleeping rt_mutex_setprio
> doesn't call check_class_changed since the task is not 
> in queue at that moment I think. 

Ah, indeed, more changes then..

(I think we can also remove the prio_changed_idle thing, since I think
we killed that hotplug usage)

---
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -2025,14 +2025,14 @@ inline int task_curr(const struct task_s
 
 static inline void check_class_changed(struct rq *rq, struct task_struct *p,
 				       const struct sched_class *prev_class,
-				       int oldprio, int running)
+				       int oldprio)
 {
 	if (prev_class != p->sched_class) {
 		if (prev_class->switched_from)
-			prev_class->switched_from(rq, p, running);
-		p->sched_class->switched_to(rq, p, running);
-	} else
-		p->sched_class->prio_changed(rq, p, oldprio, running);
+			prev_class->switched_from(rq, p);
+		p->sched_class->switched_to(rq, p);
+	} else if (oldprio != p->prio)
+		p->sched_class->prio_changed(rq, p, oldprio);
 }
 
 static void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags)
@@ -4570,11 +4570,10 @@ void rt_mutex_setprio(struct task_struct
 
 	if (running)
 		p->sched_class->set_curr_task(rq);
-	if (on_rq) {
+	if (on_rq)
 		enqueue_task(rq, p, oldprio < prio ? ENQUEUE_HEAD : 0);
 
-		check_class_changed(rq, p, prev_class, oldprio, running);
-	}
+	check_class_changed(rq, p, prev_class, oldprio);
 	task_rq_unlock(rq, &flags);
 }
 
@@ -4902,11 +4901,10 @@ static int __sched_setscheduler(struct t
 
 	if (running)
 		p->sched_class->set_curr_task(rq);
-	if (on_rq) {
+	if (on_rq)
 		activate_task(rq, p, 0);
 
-		check_class_changed(rq, p, prev_class, oldprio, running);
-	}
+	check_class_changed(rq, p, prev_class, oldprio);
 	__task_rq_unlock(rq);
 	raw_spin_unlock_irqrestore(&p->pi_lock, flags);
 
@@ -8109,6 +8107,8 @@ EXPORT_SYMBOL(__might_sleep);
 #ifdef CONFIG_MAGIC_SYSRQ
 static void normalize_task(struct rq *rq, struct task_struct *p)
 {
+	struct sched_class *prev_class = p->sched_class;
+	int old_prio = p->prio;
 	int on_rq;
 
 	on_rq = p->se.on_rq;
@@ -8119,6 +8119,8 @@ static void normalize_task(struct rq *rq
 		activate_task(rq, p, 0);
 		resched_task(rq->curr);
 	}
+
+	check_class_changed(rq, p, prev_class, old_prio);
 }
 
 void normalize_rt_tasks(void)
Index: linux-2.6/kernel/sched_fair.c
===================================================================
--- linux-2.6.orig/kernel/sched_fair.c
+++ linux-2.6/kernel/sched_fair.c
@@ -4054,33 +4054,56 @@ static void task_fork_fair(struct task_s
  * Priority of the task has changed. Check to see if we preempt
  * the current task.
  */
-static void prio_changed_fair(struct rq *rq, struct task_struct *p,
-			      int oldprio, int running)
+static void
+prio_changed_fair(struct rq *rq, struct task_struct *p, int oldprio)
 {
+	if (!p->se.on_rq)
+		return;
+
 	/*
 	 * Reschedule if we are currently running on this runqueue and
 	 * our priority decreased, or if we are not currently running on
 	 * this runqueue and our priority is higher than the current's
 	 */
-	if (running) {
+	if (rq->curr == p) {
 		if (p->prio > oldprio)
 			resched_task(rq->curr);
 	} else
 		check_preempt_curr(rq, p, 0);
 }
 
+static void switched_from_fair(struct rq *rq, struct task_struct *p)
+{
+	struct sched_entity *se = &p->se;
+	struct cfs_rq *cfs_rq = cfs_rq_of(se);
+
+	/*
+	 * Ensure the task's vruntime is normalized, so that when its
+	 * switched back to the fair class the enqueue_entity(.flags=0) will
+	 * do the right thing.
+	 *
+	 * If it was on_rq, then the dequeue_entity(.flags=0) will already
+	 * have normalized the vruntime, if it was !on_rq, then only when
+	 * the task is sleeping will it still have non-normalized vruntime.
+	 */
+	if (!se->on_rq && p->state != TASK_RUNNING)
+		se->vruntime -= cfs_rq->min_vruntime;
+}
+
 /*
  * We switched to the sched_fair class.
  */
-static void switched_to_fair(struct rq *rq, struct task_struct *p,
-			     int running)
+static void switched_to_fair(struct rq *rq, struct task_struct *p)
 {
+	if (!p->se.on_rq)
+		return;
+
 	/*
 	 * We were most likely switched from sched_rt, so
 	 * kick off the schedule if running, otherwise just see
 	 * if we can still preempt the current task.
 	 */
-	if (running)
+	if (rq->curr == p)
 		resched_task(rq->curr);
 	else
 		check_preempt_curr(rq, p, 0);
@@ -4166,6 +4189,7 @@ static const struct sched_class fair_sch
 	.task_fork		= task_fork_fair,
 
 	.prio_changed		= prio_changed_fair,
+	.switched_from		= switched_from_fair,
 	.switched_to		= switched_to_fair,
 
 	.get_rr_interval	= get_rr_interval_fair,
Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -1084,12 +1084,10 @@ struct sched_class {
 	void (*task_tick) (struct rq *rq, struct task_struct *p, int queued);
 	void (*task_fork) (struct task_struct *p);
 
-	void (*switched_from) (struct rq *this_rq, struct task_struct *task,
-			       int running);
-	void (*switched_to) (struct rq *this_rq, struct task_struct *task,
-			     int running);
+	void (*switched_from) (struct rq *this_rq, struct task_struct *task);
+	void (*switched_to) (struct rq *this_rq, struct task_struct *task);
 	void (*prio_changed) (struct rq *this_rq, struct task_struct *task,
-			     int oldprio, int running);
+			     int oldprio);
 
 	unsigned int (*get_rr_interval) (struct rq *rq,
 					 struct task_struct *task);
Index: linux-2.6/kernel/sched_idletask.c
===================================================================
--- linux-2.6.orig/kernel/sched_idletask.c
+++ linux-2.6/kernel/sched_idletask.c
@@ -52,8 +52,7 @@ static void set_curr_task_idle(struct rq
 {
 }
 
-static void switched_to_idle(struct rq *rq, struct task_struct *p,
-			     int running)
+static void switched_to_idle(struct rq *rq, struct task_struct *p)
 {
 	/* Can this actually happen?? */
 	if (running)
@@ -62,8 +61,8 @@ static void switched_to_idle(struct rq *
 		check_preempt_curr(rq, p, 0);
 }
 
-static void prio_changed_idle(struct rq *rq, struct task_struct *p,
-			      int oldprio, int running)
+static void
+prio_changed_idle(struct rq *rq, struct task_struct *p, int oldprio)
 {
 	/* This can happen for hot plug CPUS */
 
Index: linux-2.6/kernel/sched_rt.c
===================================================================
--- linux-2.6.orig/kernel/sched_rt.c
+++ linux-2.6/kernel/sched_rt.c
@@ -1595,8 +1595,7 @@ static void rq_offline_rt(struct rq *rq)
  * When switch from the rt queue, we bring ourselves to a position
  * that we might want to pull RT tasks from other runqueues.
  */
-static void switched_from_rt(struct rq *rq, struct task_struct *p,
-			   int running)
+static void switched_from_rt(struct rq *rq, struct task_struct *p)
 {
 	/*
 	 * If there are other RT tasks then we will reschedule
@@ -1605,7 +1604,7 @@ static void switched_from_rt(struct rq *
 	 * we may need to handle the pulling of RT tasks
 	 * now.
 	 */
-	if (!rq->rt.rt_nr_running)
+	if (p->se.on_rq && !rq->rt.rt_nr_running)
 		pull_rt_task(rq);
 }
 
@@ -1624,8 +1623,7 @@ static inline void init_sched_rt_class(v
  * with RT tasks. In this case we try to push them off to
  * other runqueues.
  */
-static void switched_to_rt(struct rq *rq, struct task_struct *p,
-			   int running)
+static void switched_to_rt(struct rq *rq, struct task_struct *p)
 {
 	int check_resched = 1;
 
@@ -1636,7 +1634,7 @@ static void switched_to_rt(struct rq *rq
 	 * If that current running task is also an RT task
 	 * then see if we can move to another run queue.
 	 */
-	if (!running) {
+	if (p->se.on_rq && rq->curr != p) {
 #ifdef CONFIG_SMP
 		if (rq->rt.overloaded && push_rt_task(rq) &&
 		    /* Don't resched if we changed runqueues */
@@ -1652,10 +1650,13 @@ static void switched_to_rt(struct rq *rq
  * Priority of the task has changed. This may cause
  * us to initiate a push or pull.
  */
-static void prio_changed_rt(struct rq *rq, struct task_struct *p,
-			    int oldprio, int running)
+static void
+prio_changed_rt(struct rq *rq, struct task_struct *p, int oldprio)
 {
-	if (running) {
+	if (!p->se.on_rq)
+		return;
+
+	if (rq->curr == p) {
 #ifdef CONFIG_SMP
 		/*
 		 * If our priority decreases while running, we
Index: linux-2.6/kernel/sched_stoptask.c
===================================================================
--- linux-2.6.orig/kernel/sched_stoptask.c
+++ linux-2.6/kernel/sched_stoptask.c
@@ -59,14 +59,13 @@ static void set_curr_task_stop(struct rq
 {
 }
 
-static void switched_to_stop(struct rq *rq, struct task_struct *p,
-			     int running)
+static void switched_to_stop(struct rq *rq, struct task_struct *p)
 {
 	BUG(); /* its impossible to change to this class */
 }
 
-static void prio_changed_stop(struct rq *rq, struct task_struct *p,
-			      int oldprio, int running)
+static void
+prio_changed_stop(struct rq *rq, struct task_struct *p, int oldprio)
 {
 	BUG(); /* how!?, what priority? */
 }


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Bug in scheduler when using rt_mutex
  2011-01-19  9:44           ` Peter Zijlstra
  2011-01-19 10:38             ` Peter Zijlstra
@ 2011-01-20  3:10             ` Yong Zhang
  1 sibling, 0 replies; 42+ messages in thread
From: Yong Zhang @ 2011-01-20  3:10 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: samu.p.onkalo, mingo, linux-kernel, tglx

On Wed, Jan 19, 2011 at 5:44 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> more or less, the idea is that we only call __{dequeue,enqueue}_entity()
> when the task is actually in the tree and current is not.

Sure ;)

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Bug in scheduler when using rt_mutex
  2011-01-19 13:30                     ` Peter Zijlstra
@ 2011-01-20  4:18                       ` Yong Zhang
  2011-01-20  4:27                         ` Yong Zhang
  2011-01-20  4:59                         ` Mike Galbraith
  2011-01-20  7:07                       ` Onkalo Samu
  1 sibling, 2 replies; 42+ messages in thread
From: Yong Zhang @ 2011-01-20  4:18 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: samu.p.onkalo, mingo, linux-kernel, tglx, Steven Rostedt

On Wed, Jan 19, 2011 at 9:30 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> +static void switched_from_fair(struct rq *rq, struct task_struct *p)
> +{
> +       struct sched_entity *se = &p->se;
> +       struct cfs_rq *cfs_rq = cfs_rq_of(se);
> +
> +       /*
> +        * Ensure the task's vruntime is normalized, so that when its
> +        * switched back to the fair class the enqueue_entity(.flags=0) will
> +        * do the right thing.
> +        *
> +        * If it was on_rq, then the dequeue_entity(.flags=0) will already
> +        * have normalized the vruntime, if it was !on_rq, then only when
> +        * the task is sleeping will it still have non-normalized vruntime.
> +        */
> +       if (!se->on_rq && p->state != TASK_RUNNING)
> +               se->vruntime -= cfs_rq->min_vruntime;

Here it's possible that se->vruntime is little than cfs_rq->min_vruntime.

Thanks,
Yong

-- 
Only stand for myself

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Bug in scheduler when using rt_mutex
  2011-01-20  4:18                       ` Yong Zhang
@ 2011-01-20  4:27                         ` Yong Zhang
  2011-01-20  5:32                           ` Yong Zhang
  2011-01-20  4:59                         ` Mike Galbraith
  1 sibling, 1 reply; 42+ messages in thread
From: Yong Zhang @ 2011-01-20  4:27 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: samu.p.onkalo, mingo, linux-kernel, tglx, Steven Rostedt

on Thu, Jan 20, 2011 at 12:18 PM, Yong Zhang <yong.zhang0@gmail.com> wrote:
> On Wed, Jan 19, 2011 at 9:30 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>> +static void switched_from_fair(struct rq *rq, struct task_struct *p)
>> +{
>> +       struct sched_entity *se = &p->se;
>> +       struct cfs_rq *cfs_rq = cfs_rq_of(se);
>> +
>> +       /*
>> +        * Ensure the task's vruntime is normalized, so that when its
>> +        * switched back to the fair class the enqueue_entity(.flags=0) will
>> +        * do the right thing.
>> +        *
>> +        * If it was on_rq, then the dequeue_entity(.flags=0) will already
>> +        * have normalized the vruntime, if it was !on_rq, then only when
>> +        * the task is sleeping will it still have non-normalized vruntime.
>> +        */
>> +       if (!se->on_rq && p->state != TASK_RUNNING)
>> +               se->vruntime -= cfs_rq->min_vruntime;
>
> Here it's possible that se->vruntime is little than cfs_rq->min_vruntime.

Maybe we can:
                       place_entity(cfs_rq, se, 0);
                       se->vruntime -= cfs_rq->min_vruntime;

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Bug in scheduler when using rt_mutex
  2011-01-20  4:18                       ` Yong Zhang
  2011-01-20  4:27                         ` Yong Zhang
@ 2011-01-20  4:59                         ` Mike Galbraith
  2011-01-20  5:30                           ` Yong Zhang
  1 sibling, 1 reply; 42+ messages in thread
From: Mike Galbraith @ 2011-01-20  4:59 UTC (permalink / raw)
  To: Yong Zhang
  Cc: Peter Zijlstra, samu.p.onkalo, mingo, linux-kernel, tglx, Steven Rostedt

On Thu, 2011-01-20 at 12:18 +0800, Yong Zhang wrote:
> On Wed, Jan 19, 2011 at 9:30 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> > +static void switched_from_fair(struct rq *rq, struct task_struct *p)
> > +{
> > +       struct sched_entity *se = &p->se;
> > +       struct cfs_rq *cfs_rq = cfs_rq_of(se);
> > +
> > +       /*
> > +        * Ensure the task's vruntime is normalized, so that when its
> > +        * switched back to the fair class the enqueue_entity(.flags=0) will
> > +        * do the right thing.
> > +        *
> > +        * If it was on_rq, then the dequeue_entity(.flags=0) will already
> > +        * have normalized the vruntime, if it was !on_rq, then only when
> > +        * the task is sleeping will it still have non-normalized vruntime.
> > +        */
> > +       if (!se->on_rq && p->state != TASK_RUNNING)
> > +               se->vruntime -= cfs_rq->min_vruntime;
> 
> Here it's possible that se->vruntime is little than cfs_rq->min_vruntime.

The idea of normalizing is to preserve the offset, so the task can be
enqueued at this same offset, regardless of movement.

For sleepers, we can't usually do this until enqueue time, lest their
sleep time be unaccountable.  Here, we have to normalize, so the task's
sleep stops from the vantage point of fair_class upon class exit, and
either resumes where it left off upon re-entry, or the task is enqueued
at it's last offset as usual for runnable tasks.

	-Mike


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Bug in scheduler when using rt_mutex
  2011-01-20  4:59                         ` Mike Galbraith
@ 2011-01-20  5:30                           ` Yong Zhang
  2011-01-20  6:12                             ` Mike Galbraith
  0 siblings, 1 reply; 42+ messages in thread
From: Yong Zhang @ 2011-01-20  5:30 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, samu.p.onkalo, mingo, linux-kernel, tglx, Steven Rostedt

On Thu, Jan 20, 2011 at 12:59 PM, Mike Galbraith <efault@gmx.de> wrote:
> or the task is enqueued
> at it's last offset as usual for runnable tasks.

But shouldn't we task the tasks as WAKEUP, through the task has been
waked on other sched_class?
IOW, I wonder if we should play with place_entity() at some point.

Thanks,
Yong

-- 
Only stand for myself

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Bug in scheduler when using rt_mutex
  2011-01-20  4:27                         ` Yong Zhang
@ 2011-01-20  5:32                           ` Yong Zhang
  0 siblings, 0 replies; 42+ messages in thread
From: Yong Zhang @ 2011-01-20  5:32 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: samu.p.onkalo, mingo, linux-kernel, tglx, Steven Rostedt

On Thu, Jan 20, 2011 at 12:27 PM, Yong Zhang <yong.zhang0@gmail.com> wrote:
> on Thu, Jan 20, 2011 at 12:18 PM, Yong Zhang <yong.zhang0@gmail.com> wrote:
>> On Wed, Jan 19, 2011 at 9:30 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>>> +static void switched_from_fair(struct rq *rq, struct task_struct *p)
>>> +{
>>> +       struct sched_entity *se = &p->se;
>>> +       struct cfs_rq *cfs_rq = cfs_rq_of(se);
>>> +
>>> +       /*
>>> +        * Ensure the task's vruntime is normalized, so that when its
>>> +        * switched back to the fair class the enqueue_entity(.flags=0) will
>>> +        * do the right thing.
>>> +        *
>>> +        * If it was on_rq, then the dequeue_entity(.flags=0) will already
>>> +        * have normalized the vruntime, if it was !on_rq, then only when
>>> +        * the task is sleeping will it still have non-normalized vruntime.
>>> +        */
>>> +       if (!se->on_rq && p->state != TASK_RUNNING)
>>> +               se->vruntime -= cfs_rq->min_vruntime;
>>
>> Here it's possible that se->vruntime is little than cfs_rq->min_vruntime.
>
> Maybe we can:
>                       place_entity(cfs_rq, se, 0);
>                       se->vruntime -= cfs_rq->min_vruntime;

Hmmm, this doesn't make sense since we don't know how long the task
will sleep after we do this.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Bug in scheduler when using rt_mutex
  2011-01-20  5:30                           ` Yong Zhang
@ 2011-01-20  6:12                             ` Mike Galbraith
  2011-01-20  7:06                               ` Yong Zhang
  0 siblings, 1 reply; 42+ messages in thread
From: Mike Galbraith @ 2011-01-20  6:12 UTC (permalink / raw)
  To: Yong Zhang
  Cc: Peter Zijlstra, samu.p.onkalo, mingo, linux-kernel, tglx, Steven Rostedt

On Thu, 2011-01-20 at 13:30 +0800, Yong Zhang wrote:
> On Thu, Jan 20, 2011 at 12:59 PM, Mike Galbraith <efault@gmx.de> wrote:
> > or the task is enqueued
> > at it's last offset as usual for runnable tasks.
> 
> But shouldn't we task the tasks as WAKEUP, through the task has been
> waked on other sched_class?

We don't need to play any games with it afaiks, just normalize it, and
the rest is taken care of automatically.
 
> IOW, I wonder if we should play with place_entity() at some point.

If the task returns as a sleeper, place entity() will be called when it
is awakened, so it's sleep credit will be clipped as usual.  So vruntime
can be much less than min_vruntime at class exit time, and it doesn't
matter, clipping on wakeup after re-entry takes care of it.. if that's
what you were thinking about.

	-Mike


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Bug in scheduler when using rt_mutex
  2011-01-20  6:12                             ` Mike Galbraith
@ 2011-01-20  7:06                               ` Yong Zhang
  2011-01-20  8:37                                 ` Mike Galbraith
  0 siblings, 1 reply; 42+ messages in thread
From: Yong Zhang @ 2011-01-20  7:06 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, samu.p.onkalo, mingo, linux-kernel, tglx, Steven Rostedt

On Thu, Jan 20, 2011 at 2:12 PM, Mike Galbraith <efault@gmx.de> wrote:
> If the task returns as a sleeper, place entity() will be called when it
> is awakened, so it's sleep credit will be clipped as usual.  So vruntime
> can be much less than min_vruntime at class exit time, and it doesn't
> matter, clipping on wakeup after re-entry takes care of it.. if that's
> what you were thinking about.

For a sleep task which stay in sched_fair before it's waked:
try_to_wake_up()
  ttwu_activate()
    activate_task()
      enqueue_task_fair()
        enqueue_entity()
          place_entity()        <== clip vruntime

For a sleep task which promote to sched_rt when it's sleep:
rt_mutex_setprio()
  check_class_changed()
    switch_from_fair()       <== vruntime -= min_vruntime
      try_to_wake_up()
        ...run then stay on rq
        rt_mutex_setprio()
          enqueue_task_fair()     <==vruntime += min_vruntime

The difference is that in the second case, place_entity() is not
called, but wrt sched_fair, the task is a WAKEUP task.
Then we place this task in sched_fair before where it should be.

Thanks,
Yong

-- 
Only stand for myself

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Bug in scheduler when using rt_mutex
  2011-01-19 13:30                     ` Peter Zijlstra
  2011-01-20  4:18                       ` Yong Zhang
@ 2011-01-20  7:07                       ` Onkalo Samu
  2011-01-21  6:25                         ` Onkalo Samu
  1 sibling, 1 reply; 42+ messages in thread
From: Onkalo Samu @ 2011-01-20  7:07 UTC (permalink / raw)
  To: ext Peter Zijlstra; +Cc: Yong Zhang, mingo, linux-kernel, tglx, Steven Rostedt

On Wed, 2011-01-19 at 14:30 +0100, ext Peter Zijlstra wrote:
> On Wed, 2011-01-19 at 15:13 +0200, Onkalo Samu wrote:
> > 
> > When the task is sleeping rt_mutex_setprio
> > doesn't call check_class_changed since the task is not 
> > in queue at that moment I think. 
> 
> Ah, indeed, more changes then..
> 
> (I think we can also remove the prio_changed_idle thing, since I think
> we killed that hotplug usage)

Doesn't compile because kernel/sched_idletask.c
functions still use "running" parameter which was removed.

> 

> 
> ---
> Index: linux-2.6/kernel/sched.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched.c
> +++ linux-2.6/kernel/sched.c
> @@ -2025,14 +2025,14 @@ inline int task_curr(const struct task_s
>  
>  static inline void check_class_changed(struct rq *rq, struct task_struct *p,
>  				       const struct sched_class *prev_class,
> -				       int oldprio, int running)
> +				       int oldprio)
>  {
>  	if (prev_class != p->sched_class) {
>  		if (prev_class->switched_from)
> -			prev_class->switched_from(rq, p, running);
> -		p->sched_class->switched_to(rq, p, running);
> -	} else
> -		p->sched_class->prio_changed(rq, p, oldprio, running);
> +			prev_class->switched_from(rq, p);
> +		p->sched_class->switched_to(rq, p);
> +	} else if (oldprio != p->prio)
> +		p->sched_class->prio_changed(rq, p, oldprio);
>  }
>  
>  static void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags)
> @@ -4570,11 +4570,10 @@ void rt_mutex_setprio(struct task_struct
>  
>  	if (running)
>  		p->sched_class->set_curr_task(rq);
> -	if (on_rq) {
> +	if (on_rq)
>  		enqueue_task(rq, p, oldprio < prio ? ENQUEUE_HEAD : 0);
>  
> -		check_class_changed(rq, p, prev_class, oldprio, running);
> -	}
> +	check_class_changed(rq, p, prev_class, oldprio);
>  	task_rq_unlock(rq, &flags);
>  }
>  
> @@ -4902,11 +4901,10 @@ static int __sched_setscheduler(struct t
>  
>  	if (running)
>  		p->sched_class->set_curr_task(rq);
> -	if (on_rq) {
> +	if (on_rq)
>  		activate_task(rq, p, 0);
>  
> -		check_class_changed(rq, p, prev_class, oldprio, running);
> -	}
> +	check_class_changed(rq, p, prev_class, oldprio);
>  	__task_rq_unlock(rq);
>  	raw_spin_unlock_irqrestore(&p->pi_lock, flags);
>  
> @@ -8109,6 +8107,8 @@ EXPORT_SYMBOL(__might_sleep);
>  #ifdef CONFIG_MAGIC_SYSRQ
>  static void normalize_task(struct rq *rq, struct task_struct *p)
>  {
> +	struct sched_class *prev_class = p->sched_class;
> +	int old_prio = p->prio;
>  	int on_rq;
>  
>  	on_rq = p->se.on_rq;
> @@ -8119,6 +8119,8 @@ static void normalize_task(struct rq *rq
>  		activate_task(rq, p, 0);
>  		resched_task(rq->curr);
>  	}
> +
> +	check_class_changed(rq, p, prev_class, old_prio);
>  }
>  
>  void normalize_rt_tasks(void)
> Index: linux-2.6/kernel/sched_fair.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched_fair.c
> +++ linux-2.6/kernel/sched_fair.c
> @@ -4054,33 +4054,56 @@ static void task_fork_fair(struct task_s
>   * Priority of the task has changed. Check to see if we preempt
>   * the current task.
>   */
> -static void prio_changed_fair(struct rq *rq, struct task_struct *p,
> -			      int oldprio, int running)
> +static void
> +prio_changed_fair(struct rq *rq, struct task_struct *p, int oldprio)
>  {
> +	if (!p->se.on_rq)
> +		return;
> +
>  	/*
>  	 * Reschedule if we are currently running on this runqueue and
>  	 * our priority decreased, or if we are not currently running on
>  	 * this runqueue and our priority is higher than the current's
>  	 */
> -	if (running) {
> +	if (rq->curr == p) {
>  		if (p->prio > oldprio)
>  			resched_task(rq->curr);
>  	} else
>  		check_preempt_curr(rq, p, 0);
>  }
>  
> +static void switched_from_fair(struct rq *rq, struct task_struct *p)
> +{
> +	struct sched_entity *se = &p->se;
> +	struct cfs_rq *cfs_rq = cfs_rq_of(se);
> +
> +	/*
> +	 * Ensure the task's vruntime is normalized, so that when its
> +	 * switched back to the fair class the enqueue_entity(.flags=0) will
> +	 * do the right thing.
> +	 *
> +	 * If it was on_rq, then the dequeue_entity(.flags=0) will already
> +	 * have normalized the vruntime, if it was !on_rq, then only when
> +	 * the task is sleeping will it still have non-normalized vruntime.
> +	 */
> +	if (!se->on_rq && p->state != TASK_RUNNING)
> +		se->vruntime -= cfs_rq->min_vruntime;
> +}
> +
>  /*
>   * We switched to the sched_fair class.
>   */
> -static void switched_to_fair(struct rq *rq, struct task_struct *p,
> -			     int running)
> +static void switched_to_fair(struct rq *rq, struct task_struct *p)
>  {
> +	if (!p->se.on_rq)
> +		return;
> +
>  	/*
>  	 * We were most likely switched from sched_rt, so
>  	 * kick off the schedule if running, otherwise just see
>  	 * if we can still preempt the current task.
>  	 */
> -	if (running)
> +	if (rq->curr == p)
>  		resched_task(rq->curr);
>  	else
>  		check_preempt_curr(rq, p, 0);
> @@ -4166,6 +4189,7 @@ static const struct sched_class fair_sch
>  	.task_fork		= task_fork_fair,
>  
>  	.prio_changed		= prio_changed_fair,
> +	.switched_from		= switched_from_fair,
>  	.switched_to		= switched_to_fair,
>  
>  	.get_rr_interval	= get_rr_interval_fair,
> Index: linux-2.6/include/linux/sched.h
> ===================================================================
> --- linux-2.6.orig/include/linux/sched.h
> +++ linux-2.6/include/linux/sched.h
> @@ -1084,12 +1084,10 @@ struct sched_class {
>  	void (*task_tick) (struct rq *rq, struct task_struct *p, int queued);
>  	void (*task_fork) (struct task_struct *p);
>  
> -	void (*switched_from) (struct rq *this_rq, struct task_struct *task,
> -			       int running);
> -	void (*switched_to) (struct rq *this_rq, struct task_struct *task,
> -			     int running);
> +	void (*switched_from) (struct rq *this_rq, struct task_struct *task);
> +	void (*switched_to) (struct rq *this_rq, struct task_struct *task);
>  	void (*prio_changed) (struct rq *this_rq, struct task_struct *task,
> -			     int oldprio, int running);
> +			     int oldprio);
>  
>  	unsigned int (*get_rr_interval) (struct rq *rq,
>  					 struct task_struct *task);
> Index: linux-2.6/kernel/sched_idletask.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched_idletask.c
> +++ linux-2.6/kernel/sched_idletask.c
> @@ -52,8 +52,7 @@ static void set_curr_task_idle(struct rq
>  {
>  }
>  
> -static void switched_to_idle(struct rq *rq, struct task_struct *p,
> -			     int running)
> +static void switched_to_idle(struct rq *rq, struct task_struct *p)
>  {
>  	/* Can this actually happen?? */
>  	if (running)
> @@ -62,8 +61,8 @@ static void switched_to_idle(struct rq *
>  		check_preempt_curr(rq, p, 0);
>  }
>  
> -static void prio_changed_idle(struct rq *rq, struct task_struct *p,
> -			      int oldprio, int running)
> +static void
> +prio_changed_idle(struct rq *rq, struct task_struct *p, int oldprio)
>  {
>  	/* This can happen for hot plug CPUS */
>  
> Index: linux-2.6/kernel/sched_rt.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched_rt.c
> +++ linux-2.6/kernel/sched_rt.c
> @@ -1595,8 +1595,7 @@ static void rq_offline_rt(struct rq *rq)
>   * When switch from the rt queue, we bring ourselves to a position
>   * that we might want to pull RT tasks from other runqueues.
>   */
> -static void switched_from_rt(struct rq *rq, struct task_struct *p,
> -			   int running)
> +static void switched_from_rt(struct rq *rq, struct task_struct *p)
>  {
>  	/*
>  	 * If there are other RT tasks then we will reschedule
> @@ -1605,7 +1604,7 @@ static void switched_from_rt(struct rq *
>  	 * we may need to handle the pulling of RT tasks
>  	 * now.
>  	 */
> -	if (!rq->rt.rt_nr_running)
> +	if (p->se.on_rq && !rq->rt.rt_nr_running)
>  		pull_rt_task(rq);
>  }
>  
> @@ -1624,8 +1623,7 @@ static inline void init_sched_rt_class(v
>   * with RT tasks. In this case we try to push them off to
>   * other runqueues.
>   */
> -static void switched_to_rt(struct rq *rq, struct task_struct *p,
> -			   int running)
> +static void switched_to_rt(struct rq *rq, struct task_struct *p)
>  {
>  	int check_resched = 1;
>  
> @@ -1636,7 +1634,7 @@ static void switched_to_rt(struct rq *rq
>  	 * If that current running task is also an RT task
>  	 * then see if we can move to another run queue.
>  	 */
> -	if (!running) {
> +	if (p->se.on_rq && rq->curr != p) {
>  #ifdef CONFIG_SMP
>  		if (rq->rt.overloaded && push_rt_task(rq) &&
>  		    /* Don't resched if we changed runqueues */
> @@ -1652,10 +1650,13 @@ static void switched_to_rt(struct rq *rq
>   * Priority of the task has changed. This may cause
>   * us to initiate a push or pull.
>   */
> -static void prio_changed_rt(struct rq *rq, struct task_struct *p,
> -			    int oldprio, int running)
> +static void
> +prio_changed_rt(struct rq *rq, struct task_struct *p, int oldprio)
>  {
> -	if (running) {
> +	if (!p->se.on_rq)
> +		return;
> +
> +	if (rq->curr == p) {
>  #ifdef CONFIG_SMP
>  		/*
>  		 * If our priority decreases while running, we
> Index: linux-2.6/kernel/sched_stoptask.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched_stoptask.c
> +++ linux-2.6/kernel/sched_stoptask.c
> @@ -59,14 +59,13 @@ static void set_curr_task_stop(struct rq
>  {
>  }
>  
> -static void switched_to_stop(struct rq *rq, struct task_struct *p,
> -			     int running)
> +static void switched_to_stop(struct rq *rq, struct task_struct *p)
>  {
>  	BUG(); /* its impossible to change to this class */
>  }
>  
> -static void prio_changed_stop(struct rq *rq, struct task_struct *p,
> -			      int oldprio, int running)
> +static void
> +prio_changed_stop(struct rq *rq, struct task_struct *p, int oldprio)
>  {
>  	BUG(); /* how!?, what priority? */
>  }
> 



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Bug in scheduler when using rt_mutex
  2011-01-20  7:06                               ` Yong Zhang
@ 2011-01-20  8:37                                 ` Mike Galbraith
  2011-01-20  9:07                                   ` Yong Zhang
  0 siblings, 1 reply; 42+ messages in thread
From: Mike Galbraith @ 2011-01-20  8:37 UTC (permalink / raw)
  To: Yong Zhang
  Cc: Peter Zijlstra, samu.p.onkalo, mingo, linux-kernel, tglx, Steven Rostedt

On Thu, 2011-01-20 at 15:06 +0800, Yong Zhang wrote:
> On Thu, Jan 20, 2011 at 2:12 PM, Mike Galbraith <efault@gmx.de> wrote:
> > If the task returns as a sleeper, place entity() will be called when it
> > is awakened, so it's sleep credit will be clipped as usual.  So vruntime
> > can be much less than min_vruntime at class exit time, and it doesn't
> > matter, clipping on wakeup after re-entry takes care of it.. if that's
> > what you were thinking about.
> 
> For a sleep task which stay in sched_fair before it's waked:
> try_to_wake_up()
>   ttwu_activate()
>     activate_task()
>       enqueue_task_fair()
>         enqueue_entity()
>           place_entity()        <== clip vruntime
> 
> For a sleep task which promote to sched_rt when it's sleep:
> rt_mutex_setprio()
>   check_class_changed()
>     switch_from_fair()       <== vruntime -= min_vruntime
>       try_to_wake_up()
>         ...run then stay on rq
>         rt_mutex_setprio()
>           enqueue_task_fair()     <==vruntime += min_vruntime
> 
> The difference is that in the second case, place_entity() is not
> called, but wrt sched_fair, the task is a WAKEUP task.
> Then we place this task in sched_fair before where it should be.

D'oh.  You're right, he needs to be clipped before he leaves.

	-Mike


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Bug in scheduler when using rt_mutex
  2011-01-20  8:37                                 ` Mike Galbraith
@ 2011-01-20  9:07                                   ` Yong Zhang
  2011-01-20 10:07                                     ` Mike Galbraith
  0 siblings, 1 reply; 42+ messages in thread
From: Yong Zhang @ 2011-01-20  9:07 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, samu.p.onkalo, mingo, linux-kernel, tglx, Steven Rostedt

On Thu, Jan 20, 2011 at 4:37 PM, Mike Galbraith <efault@gmx.de> wrote:
> On Thu, 2011-01-20 at 15:06 +0800, Yong Zhang wrote:
>> On Thu, Jan 20, 2011 at 2:12 PM, Mike Galbraith <efault@gmx.de> wrote:
>> > If the task returns as a sleeper, place entity() will be called when it
>> > is awakened, so it's sleep credit will be clipped as usual.  So vruntime
>> > can be much less than min_vruntime at class exit time, and it doesn't
>> > matter, clipping on wakeup after re-entry takes care of it.. if that's
>> > what you were thinking about.
>>
>> For a sleep task which stay in sched_fair before it's waked:
>> try_to_wake_up()
>>   ttwu_activate()
>>     activate_task()
>>       enqueue_task_fair()
>>         enqueue_entity()
>>           place_entity()        <== clip vruntime
>>
>> For a sleep task which promote to sched_rt when it's sleep:
>> rt_mutex_setprio()
>>   check_class_changed()
>>     switch_from_fair()       <== vruntime -= min_vruntime
>>       try_to_wake_up()
>>         ...run then stay on rq
>>         rt_mutex_setprio()
>>           enqueue_task_fair()     <==vruntime += min_vruntime
>>
>> The difference is that in the second case, place_entity() is not
>> called, but wrt sched_fair, the task is a WAKEUP task.
>> Then we place this task in sched_fair before where it should be.
>
> D'oh.  You're right, he needs to be clipped before he leaves.

Exactly we should clip it when it comes back, because it still could
sleep for some time after it leaves ;)

Thanks,
Yong

-- 
Only stand for myself

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Bug in scheduler when using rt_mutex
  2011-01-20  9:07                                   ` Yong Zhang
@ 2011-01-20 10:07                                     ` Mike Galbraith
  2011-01-21 11:08                                       ` Peter Zijlstra
  2011-01-21 13:15                                       ` Yong Zhang
  0 siblings, 2 replies; 42+ messages in thread
From: Mike Galbraith @ 2011-01-20 10:07 UTC (permalink / raw)
  To: Yong Zhang
  Cc: Peter Zijlstra, samu.p.onkalo, mingo, linux-kernel, tglx, Steven Rostedt

On Thu, 2011-01-20 at 17:07 +0800, Yong Zhang wrote:
> On Thu, Jan 20, 2011 at 4:37 PM, Mike Galbraith <efault@gmx.de> wrote:
> > On Thu, 2011-01-20 at 15:06 +0800, Yong Zhang wrote:
> >> On Thu, Jan 20, 2011 at 2:12 PM, Mike Galbraith <efault@gmx.de> wrote:
> >> > If the task returns as a sleeper, place entity() will be called when it
> >> > is awakened, so it's sleep credit will be clipped as usual.  So vruntime
> >> > can be much less than min_vruntime at class exit time, and it doesn't
> >> > matter, clipping on wakeup after re-entry takes care of it.. if that's
> >> > what you were thinking about.
> >>
> >> For a sleep task which stay in sched_fair before it's waked:
> >> try_to_wake_up()
> >>   ttwu_activate()
> >>     activate_task()
> >>       enqueue_task_fair()
> >>         enqueue_entity()
> >>           place_entity()        <== clip vruntime
> >>
> >> For a sleep task which promote to sched_rt when it's sleep:
> >> rt_mutex_setprio()
> >>   check_class_changed()
> >>     switch_from_fair()       <== vruntime -= min_vruntime
> >>       try_to_wake_up()
> >>         ...run then stay on rq
> >>         rt_mutex_setprio()
> >>           enqueue_task_fair()     <==vruntime += min_vruntime
> >>
> >> The difference is that in the second case, place_entity() is not
> >> called, but wrt sched_fair, the task is a WAKEUP task.
> >> Then we place this task in sched_fair before where it should be.
> >
> > D'oh.  You're right, he needs to be clipped before he leaves.
> 
> Exactly we should clip it when it comes back, because it still could
> sleep for some time after it leaves ;)

That's ok, we don't and aren't supposed to care what happens while he's
gone.  But we do have to make sure that vruntime is sane either when he
leaves, or when he comes back.  Seems to me the easiest is clip when he
leaves to cover him having slept a long time before leaving, then coming
back on us as a runner.  If he comes back as a sleeper, he'll be clipped
again anyway, so all is well.

sched_fork() should probably zero child's vruntime too, so non-fair
children can't enter fair_class with some bogus lag they never had.

	-Mike


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Bug in scheduler when using rt_mutex
  2011-01-20  7:07                       ` Onkalo Samu
@ 2011-01-21  6:25                         ` Onkalo Samu
  0 siblings, 0 replies; 42+ messages in thread
From: Onkalo Samu @ 2011-01-21  6:25 UTC (permalink / raw)
  To: ext Peter Zijlstra; +Cc: Yong Zhang, mingo, linux-kernel, tglx, Steven Rostedt

On Thu, 2011-01-20 at 09:07 +0200, Onkalo Samu wrote:
> On Wed, 2011-01-19 at 14:30 +0100, ext Peter Zijlstra wrote:
> > On Wed, 2011-01-19 at 15:13 +0200, Onkalo Samu wrote:
> > > 
> > > When the task is sleeping rt_mutex_setprio
> > > doesn't call check_class_changed since the task is not 
> > > in queue at that moment I think. 
> > 
> > Ah, indeed, more changes then..
> > 
> > (I think we can also remove the prio_changed_idle thing, since I think
> > we killed that hotplug usage)
> 
> Doesn't compile because kernel/sched_idletask.c
> functions still use "running" parameter which was removed.
> 

I made similar corrections to sched_idletask.c as you did for other
functions. Seems to work for us.

-Samu

> > 
> 
> > 
> > ---
> > Index: linux-2.6/kernel/sched.c
> > ===================================================================
> > --- linux-2.6.orig/kernel/sched.c
> > +++ linux-2.6/kernel/sched.c
> > @@ -2025,14 +2025,14 @@ inline int task_curr(const struct task_s
> >  
> >  static inline void check_class_changed(struct rq *rq, struct task_struct *p,
> >  				       const struct sched_class *prev_class,
> > -				       int oldprio, int running)
> > +				       int oldprio)
> >  {
> >  	if (prev_class != p->sched_class) {
> >  		if (prev_class->switched_from)
> > -			prev_class->switched_from(rq, p, running);
> > -		p->sched_class->switched_to(rq, p, running);
> > -	} else
> > -		p->sched_class->prio_changed(rq, p, oldprio, running);
> > +			prev_class->switched_from(rq, p);
> > +		p->sched_class->switched_to(rq, p);
> > +	} else if (oldprio != p->prio)
> > +		p->sched_class->prio_changed(rq, p, oldprio);
> >  }
> >  
> >  static void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags)
> > @@ -4570,11 +4570,10 @@ void rt_mutex_setprio(struct task_struct
> >  
> >  	if (running)
> >  		p->sched_class->set_curr_task(rq);
> > -	if (on_rq) {
> > +	if (on_rq)
> >  		enqueue_task(rq, p, oldprio < prio ? ENQUEUE_HEAD : 0);
> >  
> > -		check_class_changed(rq, p, prev_class, oldprio, running);
> > -	}
> > +	check_class_changed(rq, p, prev_class, oldprio);
> >  	task_rq_unlock(rq, &flags);
> >  }
> >  
> > @@ -4902,11 +4901,10 @@ static int __sched_setscheduler(struct t
> >  
> >  	if (running)
> >  		p->sched_class->set_curr_task(rq);
> > -	if (on_rq) {
> > +	if (on_rq)
> >  		activate_task(rq, p, 0);
> >  
> > -		check_class_changed(rq, p, prev_class, oldprio, running);
> > -	}
> > +	check_class_changed(rq, p, prev_class, oldprio);
> >  	__task_rq_unlock(rq);
> >  	raw_spin_unlock_irqrestore(&p->pi_lock, flags);
> >  
> > @@ -8109,6 +8107,8 @@ EXPORT_SYMBOL(__might_sleep);
> >  #ifdef CONFIG_MAGIC_SYSRQ
> >  static void normalize_task(struct rq *rq, struct task_struct *p)
> >  {
> > +	struct sched_class *prev_class = p->sched_class;
> > +	int old_prio = p->prio;
> >  	int on_rq;
> >  
> >  	on_rq = p->se.on_rq;
> > @@ -8119,6 +8119,8 @@ static void normalize_task(struct rq *rq
> >  		activate_task(rq, p, 0);
> >  		resched_task(rq->curr);
> >  	}
> > +
> > +	check_class_changed(rq, p, prev_class, old_prio);
> >  }
> >  
> >  void normalize_rt_tasks(void)
> > Index: linux-2.6/kernel/sched_fair.c
> > ===================================================================
> > --- linux-2.6.orig/kernel/sched_fair.c
> > +++ linux-2.6/kernel/sched_fair.c
> > @@ -4054,33 +4054,56 @@ static void task_fork_fair(struct task_s
> >   * Priority of the task has changed. Check to see if we preempt
> >   * the current task.
> >   */
> > -static void prio_changed_fair(struct rq *rq, struct task_struct *p,
> > -			      int oldprio, int running)
> > +static void
> > +prio_changed_fair(struct rq *rq, struct task_struct *p, int oldprio)
> >  {
> > +	if (!p->se.on_rq)
> > +		return;
> > +
> >  	/*
> >  	 * Reschedule if we are currently running on this runqueue and
> >  	 * our priority decreased, or if we are not currently running on
> >  	 * this runqueue and our priority is higher than the current's
> >  	 */
> > -	if (running) {
> > +	if (rq->curr == p) {
> >  		if (p->prio > oldprio)
> >  			resched_task(rq->curr);
> >  	} else
> >  		check_preempt_curr(rq, p, 0);
> >  }
> >  
> > +static void switched_from_fair(struct rq *rq, struct task_struct *p)
> > +{
> > +	struct sched_entity *se = &p->se;
> > +	struct cfs_rq *cfs_rq = cfs_rq_of(se);
> > +
> > +	/*
> > +	 * Ensure the task's vruntime is normalized, so that when its
> > +	 * switched back to the fair class the enqueue_entity(.flags=0) will
> > +	 * do the right thing.
> > +	 *
> > +	 * If it was on_rq, then the dequeue_entity(.flags=0) will already
> > +	 * have normalized the vruntime, if it was !on_rq, then only when
> > +	 * the task is sleeping will it still have non-normalized vruntime.
> > +	 */
> > +	if (!se->on_rq && p->state != TASK_RUNNING)
> > +		se->vruntime -= cfs_rq->min_vruntime;
> > +}
> > +
> >  /*
> >   * We switched to the sched_fair class.
> >   */
> > -static void switched_to_fair(struct rq *rq, struct task_struct *p,
> > -			     int running)
> > +static void switched_to_fair(struct rq *rq, struct task_struct *p)
> >  {
> > +	if (!p->se.on_rq)
> > +		return;
> > +
> >  	/*
> >  	 * We were most likely switched from sched_rt, so
> >  	 * kick off the schedule if running, otherwise just see
> >  	 * if we can still preempt the current task.
> >  	 */
> > -	if (running)
> > +	if (rq->curr == p)
> >  		resched_task(rq->curr);
> >  	else
> >  		check_preempt_curr(rq, p, 0);
> > @@ -4166,6 +4189,7 @@ static const struct sched_class fair_sch
> >  	.task_fork		= task_fork_fair,
> >  
> >  	.prio_changed		= prio_changed_fair,
> > +	.switched_from		= switched_from_fair,
> >  	.switched_to		= switched_to_fair,
> >  
> >  	.get_rr_interval	= get_rr_interval_fair,
> > Index: linux-2.6/include/linux/sched.h
> > ===================================================================
> > --- linux-2.6.orig/include/linux/sched.h
> > +++ linux-2.6/include/linux/sched.h
> > @@ -1084,12 +1084,10 @@ struct sched_class {
> >  	void (*task_tick) (struct rq *rq, struct task_struct *p, int queued);
> >  	void (*task_fork) (struct task_struct *p);
> >  
> > -	void (*switched_from) (struct rq *this_rq, struct task_struct *task,
> > -			       int running);
> > -	void (*switched_to) (struct rq *this_rq, struct task_struct *task,
> > -			     int running);
> > +	void (*switched_from) (struct rq *this_rq, struct task_struct *task);
> > +	void (*switched_to) (struct rq *this_rq, struct task_struct *task);
> >  	void (*prio_changed) (struct rq *this_rq, struct task_struct *task,
> > -			     int oldprio, int running);
> > +			     int oldprio);
> >  
> >  	unsigned int (*get_rr_interval) (struct rq *rq,
> >  					 struct task_struct *task);
> > Index: linux-2.6/kernel/sched_idletask.c
> > ===================================================================
> > --- linux-2.6.orig/kernel/sched_idletask.c
> > +++ linux-2.6/kernel/sched_idletask.c
> > @@ -52,8 +52,7 @@ static void set_curr_task_idle(struct rq
> >  {
> >  }
> >  
> > -static void switched_to_idle(struct rq *rq, struct task_struct *p,
> > -			     int running)
> > +static void switched_to_idle(struct rq *rq, struct task_struct *p)
> >  {
> >  	/* Can this actually happen?? */
> >  	if (running)
> > @@ -62,8 +61,8 @@ static void switched_to_idle(struct rq *
> >  		check_preempt_curr(rq, p, 0);
> >  }
> >  
> > -static void prio_changed_idle(struct rq *rq, struct task_struct *p,
> > -			      int oldprio, int running)
> > +static void
> > +prio_changed_idle(struct rq *rq, struct task_struct *p, int oldprio)
> >  {
> >  	/* This can happen for hot plug CPUS */
> >  
> > Index: linux-2.6/kernel/sched_rt.c
> > ===================================================================
> > --- linux-2.6.orig/kernel/sched_rt.c
> > +++ linux-2.6/kernel/sched_rt.c
> > @@ -1595,8 +1595,7 @@ static void rq_offline_rt(struct rq *rq)
> >   * When switch from the rt queue, we bring ourselves to a position
> >   * that we might want to pull RT tasks from other runqueues.
> >   */
> > -static void switched_from_rt(struct rq *rq, struct task_struct *p,
> > -			   int running)
> > +static void switched_from_rt(struct rq *rq, struct task_struct *p)
> >  {
> >  	/*
> >  	 * If there are other RT tasks then we will reschedule
> > @@ -1605,7 +1604,7 @@ static void switched_from_rt(struct rq *
> >  	 * we may need to handle the pulling of RT tasks
> >  	 * now.
> >  	 */
> > -	if (!rq->rt.rt_nr_running)
> > +	if (p->se.on_rq && !rq->rt.rt_nr_running)
> >  		pull_rt_task(rq);
> >  }
> >  
> > @@ -1624,8 +1623,7 @@ static inline void init_sched_rt_class(v
> >   * with RT tasks. In this case we try to push them off to
> >   * other runqueues.
> >   */
> > -static void switched_to_rt(struct rq *rq, struct task_struct *p,
> > -			   int running)
> > +static void switched_to_rt(struct rq *rq, struct task_struct *p)
> >  {
> >  	int check_resched = 1;
> >  
> > @@ -1636,7 +1634,7 @@ static void switched_to_rt(struct rq *rq
> >  	 * If that current running task is also an RT task
> >  	 * then see if we can move to another run queue.
> >  	 */
> > -	if (!running) {
> > +	if (p->se.on_rq && rq->curr != p) {
> >  #ifdef CONFIG_SMP
> >  		if (rq->rt.overloaded && push_rt_task(rq) &&
> >  		    /* Don't resched if we changed runqueues */
> > @@ -1652,10 +1650,13 @@ static void switched_to_rt(struct rq *rq
> >   * Priority of the task has changed. This may cause
> >   * us to initiate a push or pull.
> >   */
> > -static void prio_changed_rt(struct rq *rq, struct task_struct *p,
> > -			    int oldprio, int running)
> > +static void
> > +prio_changed_rt(struct rq *rq, struct task_struct *p, int oldprio)
> >  {
> > -	if (running) {
> > +	if (!p->se.on_rq)
> > +		return;
> > +
> > +	if (rq->curr == p) {
> >  #ifdef CONFIG_SMP
> >  		/*
> >  		 * If our priority decreases while running, we
> > Index: linux-2.6/kernel/sched_stoptask.c
> > ===================================================================
> > --- linux-2.6.orig/kernel/sched_stoptask.c
> > +++ linux-2.6/kernel/sched_stoptask.c
> > @@ -59,14 +59,13 @@ static void set_curr_task_stop(struct rq
> >  {
> >  }
> >  
> > -static void switched_to_stop(struct rq *rq, struct task_struct *p,
> > -			     int running)
> > +static void switched_to_stop(struct rq *rq, struct task_struct *p)
> >  {
> >  	BUG(); /* its impossible to change to this class */
> >  }
> >  
> > -static void prio_changed_stop(struct rq *rq, struct task_struct *p,
> > -			      int oldprio, int running)
> > +static void
> > +prio_changed_stop(struct rq *rq, struct task_struct *p, int oldprio)
> >  {
> >  	BUG(); /* how!?, what priority? */
> >  }
> > 
> 



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Bug in scheduler when using rt_mutex
  2011-01-20 10:07                                     ` Mike Galbraith
@ 2011-01-21 11:08                                       ` Peter Zijlstra
  2011-01-21 12:24                                         ` Yong Zhang
  2011-01-21 13:15                                       ` Yong Zhang
  1 sibling, 1 reply; 42+ messages in thread
From: Peter Zijlstra @ 2011-01-21 11:08 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Yong Zhang, samu.p.onkalo, mingo, linux-kernel, tglx, Steven Rostedt

On Thu, 2011-01-20 at 11:07 +0100, Mike Galbraith wrote:
> On Thu, 2011-01-20 at 17:07 +0800, Yong Zhang wrote:
> > On Thu, Jan 20, 2011 at 4:37 PM, Mike Galbraith <efault@gmx.de> wrote:
> > > On Thu, 2011-01-20 at 15:06 +0800, Yong Zhang wrote:
> > >> On Thu, Jan 20, 2011 at 2:12 PM, Mike Galbraith <efault@gmx.de> wrote:
> > >> > If the task returns as a sleeper, place entity() will be called when it
> > >> > is awakened, so it's sleep credit will be clipped as usual.  So vruntime
> > >> > can be much less than min_vruntime at class exit time, and it doesn't
> > >> > matter, clipping on wakeup after re-entry takes care of it.. if that's
> > >> > what you were thinking about.
> > >>
> > >> For a sleep task which stay in sched_fair before it's waked:
> > >> try_to_wake_up()
> > >>   ttwu_activate()
> > >>     activate_task()
> > >>       enqueue_task_fair()
> > >>         enqueue_entity()
> > >>           place_entity()        <== clip vruntime
> > >>
> > >> For a sleep task which promote to sched_rt when it's sleep:
> > >> rt_mutex_setprio()
> > >>   check_class_changed()
> > >>     switch_from_fair()       <== vruntime -= min_vruntime
> > >>       try_to_wake_up()
> > >>         ...run then stay on rq
> > >>         rt_mutex_setprio()
> > >>           enqueue_task_fair()     <==vruntime += min_vruntime
> > >>
> > >> The difference is that in the second case, place_entity() is not
> > >> called, but wrt sched_fair, the task is a WAKEUP task.
> > >> Then we place this task in sched_fair before where it should be.
> > >
> > > D'oh.  You're right, he needs to be clipped before he leaves.
> > 
> > Exactly we should clip it when it comes back, because it still could
> > sleep for some time after it leaves ;)
> 
> That's ok, we don't and aren't supposed to care what happens while he's
> gone.  But we do have to make sure that vruntime is sane either when he
> leaves, or when he comes back.  Seems to me the easiest is clip when he
> leaves to cover him having slept a long time before leaving, then coming
> back on us as a runner.  If he comes back as a sleeper, he'll be clipped
> again anyway, so all is well.
>
> sched_fork() should probably zero child's vruntime too, so non-fair
> children can't enter fair_class with some bogus lag they never had.

Something like so?

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -2624,6 +2624,8 @@ void sched_fork(struct task_struct *p, i
 
 	if (!rt_prio(p->prio))
 		p->sched_class = &fair_sched_class;
+	else
+		p->se.vruntime = 0;
 
 	if (p->sched_class->task_fork)
 		p->sched_class->task_fork(p);
Index: linux-2.6/kernel/sched_fair.c
===================================================================
--- linux-2.6.orig/kernel/sched_fair.c
+++ linux-2.6/kernel/sched_fair.c
@@ -4086,8 +4086,14 @@ static void switched_from_fair(struct rq
 	 * have normalized the vruntime, if it was !on_rq, then only when
 	 * the task is sleeping will it still have non-normalized vruntime.
 	 */
-	if (!se->on_rq && p->state != TASK_RUNNING)
+	if (!se->on_rq && p->state != TASK_RUNNING) {
+		/*
+		 * Fix up our vruntime so that the current sleep doesn't
+		 * cause 'unlimited' sleep bonus.
+		 */
+		place_entity(cfs_rq, se, 0);
 		se->vruntime -= cfs_rq->min_vruntime;
+	}
 }
 
 /*


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Bug in scheduler when using rt_mutex
  2011-01-21 11:08                                       ` Peter Zijlstra
@ 2011-01-21 12:24                                         ` Yong Zhang
  2011-01-21 13:40                                           ` Peter Zijlstra
  0 siblings, 1 reply; 42+ messages in thread
From: Yong Zhang @ 2011-01-21 12:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mike Galbraith, samu.p.onkalo, mingo, linux-kernel, tglx, Steven Rostedt

On Fri, Jan 21, 2011 at 12:08:56PM +0100, Peter Zijlstra wrote:
> > That's ok, we don't and aren't supposed to care what happens while he's
> > gone.  But we do have to make sure that vruntime is sane either when he
> > leaves, or when he comes back.  Seems to me the easiest is clip when he
> > leaves to cover him having slept a long time before leaving, then coming
> > back on us as a runner.  If he comes back as a sleeper, he'll be clipped
> > again anyway, so all is well.
> >
> > sched_fork() should probably zero child's vruntime too, so non-fair
> > children can't enter fair_class with some bogus lag they never had.
> 
> Something like so?
> 
> Index: linux-2.6/kernel/sched.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched.c
> +++ linux-2.6/kernel/sched.c
> @@ -2624,6 +2624,8 @@ void sched_fork(struct task_struct *p, i
>  
>  	if (!rt_prio(p->prio))
>  		p->sched_class = &fair_sched_class;
> +	else
> +		p->se.vruntime = 0;

This can be moved to __sched_fork()

>  
>  	if (p->sched_class->task_fork)
>  		p->sched_class->task_fork(p);
> Index: linux-2.6/kernel/sched_fair.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched_fair.c
> +++ linux-2.6/kernel/sched_fair.c
> @@ -4086,8 +4086,14 @@ static void switched_from_fair(struct rq
>  	 * have normalized the vruntime, if it was !on_rq, then only when
>  	 * the task is sleeping will it still have non-normalized vruntime.
>  	 */
> -	if (!se->on_rq && p->state != TASK_RUNNING)
> +	if (!se->on_rq && p->state != TASK_RUNNING) {
> +		/*
> +		 * Fix up our vruntime so that the current sleep doesn't
> +		 * cause 'unlimited' sleep bonus.
> +		 */
> +		place_entity(cfs_rq, se, 0);
>  		se->vruntime -= cfs_rq->min_vruntime;

Now I will say yes.
Though it's same to my suggestion which was rejected by myself :)

Thanks,
Yong

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Bug in scheduler when using rt_mutex
  2011-01-20 10:07                                     ` Mike Galbraith
  2011-01-21 11:08                                       ` Peter Zijlstra
@ 2011-01-21 13:15                                       ` Yong Zhang
  1 sibling, 0 replies; 42+ messages in thread
From: Yong Zhang @ 2011-01-21 13:15 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, samu.p.onkalo, mingo, linux-kernel, tglx, Steven Rostedt

On Thu, Jan 20, 2011 at 11:07:27AM +0100, Mike Galbraith wrote:
> On Thu, 2011-01-20 at 17:07 +0800, Yong Zhang wrote:
> > On Thu, Jan 20, 2011 at 4:37 PM, Mike Galbraith <efault@gmx.de> wrote:
> > > On Thu, 2011-01-20 at 15:06 +0800, Yong Zhang wrote:
> > >> On Thu, Jan 20, 2011 at 2:12 PM, Mike Galbraith <efault@gmx.de> wrote:
> > >> > If the task returns as a sleeper, place entity() will be called when it
> > >> > is awakened, so it's sleep credit will be clipped as usual.  So vruntime
> > >> > can be much less than min_vruntime at class exit time, and it doesn't
> > >> > matter, clipping on wakeup after re-entry takes care of it.. if that's
> > >> > what you were thinking about.
> > >>
> > >> For a sleep task which stay in sched_fair before it's waked:
> > >> try_to_wake_up()
> > >>   ttwu_activate()
> > >>     activate_task()
> > >>       enqueue_task_fair()
> > >>         enqueue_entity()
> > >>           place_entity()        <== clip vruntime
> > >>
> > >> For a sleep task which promote to sched_rt when it's sleep:
> > >> rt_mutex_setprio()
> > >>   check_class_changed()
> > >>     switch_from_fair()       <== vruntime -= min_vruntime
> > >>       try_to_wake_up()
> > >>         ...run then stay on rq
> > >>         rt_mutex_setprio()
> > >>           enqueue_task_fair()     <==vruntime += min_vruntime
> > >>
> > >> The difference is that in the second case, place_entity() is not
> > >> called, but wrt sched_fair, the task is a WAKEUP task.
> > >> Then we place this task in sched_fair before where it should be.
> > >
> > > D'oh.  You're right, he needs to be clipped before he leaves.
> > 
> > Exactly we should clip it when it comes back, because it still could
> > sleep for some time after it leaves ;)
> 
> That's ok, we don't and aren't supposed to care what happens while he's
> gone.  But we do have to make sure that vruntime is sane either when he
> leaves, or when he comes back.  Seems to me the easiest is clip when he
> leaves to cover him having slept a long time before leaving, then coming
> back on us as a runner.  If he comes back as a sleeper, he'll be clipped
> again anyway, so all is well.

OK, fair enough.

> 
> sched_fork() should probably zero child's vruntime too, so non-fair
> children can't enter fair_class with some bogus lag they never had.

Yeah,
So I think Peter's new patch is taking care of all the issue now.

Thanks,
Yong

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Bug in scheduler when using rt_mutex
  2011-01-21 12:24                                         ` Yong Zhang
@ 2011-01-21 13:40                                           ` Peter Zijlstra
  2011-01-21 15:03                                             ` Yong Zhang
  0 siblings, 1 reply; 42+ messages in thread
From: Peter Zijlstra @ 2011-01-21 13:40 UTC (permalink / raw)
  To: Yong Zhang
  Cc: Mike Galbraith, samu.p.onkalo, mingo, linux-kernel, tglx, Steven Rostedt

On Fri, 2011-01-21 at 20:24 +0800, Yong Zhang wrote:
> > Index: linux-2.6/kernel/sched.c
> > ===================================================================
> > --- linux-2.6.orig/kernel/sched.c
> > +++ linux-2.6/kernel/sched.c
> > @@ -2624,6 +2624,8 @@ void sched_fork(struct task_struct *p, i
> >  
> >       if (!rt_prio(p->prio))
> >               p->sched_class = &fair_sched_class;
> > +     else
> > +             p->se.vruntime = 0;
> 
> This can be moved to __sched_fork() 

But we cannot do it unconditionally, we want to inherit the parent's
->vruntime when we're a fair task clone.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Bug in scheduler when using rt_mutex
  2011-01-21 13:40                                           ` Peter Zijlstra
@ 2011-01-21 15:03                                             ` Yong Zhang
  2011-01-21 15:10                                               ` Peter Zijlstra
  0 siblings, 1 reply; 42+ messages in thread
From: Yong Zhang @ 2011-01-21 15:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mike Galbraith, samu.p.onkalo, mingo, linux-kernel, tglx, Steven Rostedt

On Fri, Jan 21, 2011 at 02:40:58PM +0100, Peter Zijlstra wrote:
> On Fri, 2011-01-21 at 20:24 +0800, Yong Zhang wrote:
> > > Index: linux-2.6/kernel/sched.c
> > > ===================================================================
> > > --- linux-2.6.orig/kernel/sched.c
> > > +++ linux-2.6/kernel/sched.c
> > > @@ -2624,6 +2624,8 @@ void sched_fork(struct task_struct *p, i
> > >  
> > >       if (!rt_prio(p->prio))
> > >               p->sched_class = &fair_sched_class;
> > > +     else
> > > +             p->se.vruntime = 0;
> > 
> > This can be moved to __sched_fork() 
> 
> But we cannot do it unconditionally, we want to inherit the parent's
> ->vruntime when we're a fair task clone.

Doesn't task_fork_fair()(which is called after __sched_fork())
take the inherit job?

Am I missing something?

Thanks,
Yong

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Bug in scheduler when using rt_mutex
  2011-01-21 15:03                                             ` Yong Zhang
@ 2011-01-21 15:10                                               ` Peter Zijlstra
  0 siblings, 0 replies; 42+ messages in thread
From: Peter Zijlstra @ 2011-01-21 15:10 UTC (permalink / raw)
  To: Yong Zhang
  Cc: Mike Galbraith, samu.p.onkalo, mingo, linux-kernel, tglx, Steven Rostedt

On Fri, 2011-01-21 at 23:03 +0800, Yong Zhang wrote:
> On Fri, Jan 21, 2011 at 02:40:58PM +0100, Peter Zijlstra wrote:
> > On Fri, 2011-01-21 at 20:24 +0800, Yong Zhang wrote:
> > > > Index: linux-2.6/kernel/sched.c
> > > > ===================================================================
> > > > --- linux-2.6.orig/kernel/sched.c
> > > > +++ linux-2.6/kernel/sched.c
> > > > @@ -2624,6 +2624,8 @@ void sched_fork(struct task_struct *p, i
> > > >  
> > > >       if (!rt_prio(p->prio))
> > > >               p->sched_class = &fair_sched_class;
> > > > +     else
> > > > +             p->se.vruntime = 0;
> > > 
> > > This can be moved to __sched_fork() 
> > 
> > But we cannot do it unconditionally, we want to inherit the parent's
> > ->vruntime when we're a fair task clone.
> 
> Doesn't task_fork_fair()(which is called after __sched_fork())
> take the inherit job?
> 
> Am I missing something?

Ah, indeed, we do the place_entity(.initial=1) thing there
unconditionally. For some reason I thought we relied on the inherited
vruntime.

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2011-01-21 15:10 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-01-17 14:42 Bug in scheduler when using rt_mutex Onkalo Samu
2011-01-17 15:00 ` Peter Zijlstra
2011-01-17 15:15   ` samu.p.onkalo
2011-01-17 15:28     ` Peter Zijlstra
2011-01-17 16:00 ` Peter Zijlstra
2011-01-18  8:23   ` Onkalo Samu
2011-01-18  8:59     ` Yong Zhang
2011-01-18 13:35       ` Peter Zijlstra
2011-01-18 14:25         ` Onkalo Samu
2011-01-19  2:38         ` Yong Zhang
2011-01-19  3:43           ` Mike Galbraith
2011-01-19  4:35             ` Yong Zhang
2011-01-19  5:40               ` Mike Galbraith
2011-01-19  6:09                 ` Yong Zhang
2011-01-19  6:37                   ` Mike Galbraith
2011-01-19  7:19                     ` Ingo Molnar
2011-01-19  7:41                       ` Mike Galbraith
2011-01-19  9:44           ` Peter Zijlstra
2011-01-19 10:38             ` Peter Zijlstra
2011-01-19 11:30               ` Peter Zijlstra
2011-01-19 12:58                 ` Onkalo Samu
2011-01-19 13:13                   ` Onkalo Samu
2011-01-19 13:30                     ` Peter Zijlstra
2011-01-20  4:18                       ` Yong Zhang
2011-01-20  4:27                         ` Yong Zhang
2011-01-20  5:32                           ` Yong Zhang
2011-01-20  4:59                         ` Mike Galbraith
2011-01-20  5:30                           ` Yong Zhang
2011-01-20  6:12                             ` Mike Galbraith
2011-01-20  7:06                               ` Yong Zhang
2011-01-20  8:37                                 ` Mike Galbraith
2011-01-20  9:07                                   ` Yong Zhang
2011-01-20 10:07                                     ` Mike Galbraith
2011-01-21 11:08                                       ` Peter Zijlstra
2011-01-21 12:24                                         ` Yong Zhang
2011-01-21 13:40                                           ` Peter Zijlstra
2011-01-21 15:03                                             ` Yong Zhang
2011-01-21 15:10                                               ` Peter Zijlstra
2011-01-21 13:15                                       ` Yong Zhang
2011-01-20  7:07                       ` Onkalo Samu
2011-01-21  6:25                         ` Onkalo Samu
2011-01-20  3:10             ` Yong Zhang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.