linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: RT and Cascade interrupts
@ 2005-05-27 16:47 Oleg Nesterov
  2005-05-27 23:37 ` john cooper
  0 siblings, 1 reply; 34+ messages in thread
From: Oleg Nesterov @ 2005-05-27 16:47 UTC (permalink / raw)
  To: john cooper; +Cc: linux-kernel, Ingo Molnar

john cooper wrote:
>
>  rpc_delete_timer(struct rpc_task *task)
>  {
> -	if (test_and_clear_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate)) {
> +	if (task->tk_timer.base) {
>  		del_singleshot_timer_sync(&task->tk_timer);
>  		dprintk("RPC: %4d deleting timer\n", task->tk_pid);
>  	}

I know nothing about rpc, but this looks completely wrong to me.

Please don't use the ->base directly, use timer_pending(). This
field is never NULL in -mm tree.

Next, timer_pending() == 0 does not mean that del_timer_sync() is
not needed, it may be running on the other CPU.

Looking at the code for the first time, I can guess that RPC_TASK_HAS_TIMER
was invented to indicate when it is safe not to wait for timer
completition. This bit is cleared after ->tk_timeout_fn call in
the timer's handler. After that rpc_run_timer will not touch *task.

Oleg.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RT and Cascade interrupts
  2005-05-27 16:47 RT and Cascade interrupts Oleg Nesterov
@ 2005-05-27 23:37 ` john cooper
  2005-05-28  8:52   ` Oleg Nesterov
  0 siblings, 1 reply; 34+ messages in thread
From: john cooper @ 2005-05-27 23:37 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: linux-kernel, Ingo Molnar, john cooper

[-- Attachment #1: Type: text/plain, Size: 3984 bytes --]

Oleg Nesterov wrote:
> john cooper wrote:
> 
>> rpc_delete_timer(struct rpc_task *task)
>> {
>>-	if (test_and_clear_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate)) {
>>+	if (task->tk_timer.base) {
>> 		del_singleshot_timer_sync(&task->tk_timer);
>> 		dprintk("RPC: %4d deleting timer\n", task->tk_pid);
>> 	}
> 
> 
> I know nothing about rpc, but this looks completely wrong to me.
> 
> Please don't use the ->base directly, use timer_pending().

A nit here but fair enough.

> This field is never NULL in -mm tree.

Which is ok if timer_pending() still provides the needed
information.  Otherwise another means will be needed to
determine whether the struct timer is logically "off" the
base to which it was previous queued.  I added such state
information to the timer structure and associated code in
the process of hunting this problem, but removed it in the
patch as for the tree to which this applies the state of
base already provided the information.

> Next, timer_pending() == 0 does not mean that del_timer_sync() is
> not needed, it may be running on the other CPU.

Whether timer_pending() is SMP safe depends on the caller's
context relative to the overall usage of the timer structure.

For example we aren't holding base->lock in rpc_delete_timer()
as would normally be expected.  The reason this is safe is
the transition to "timer off base" and "timer.base <- NULL"
follows this sequence in __run_timers().  So the worst we
can do is to be racing with another cpu trying to expire this
timer and we will delete an already inactive timer which is
innocuous here.

The state transition of "timer on base" and
"timer.base <- base" can only happen after we free
the struct rpc_task in which the struct timer is embedded.
We haven't done so yet so no race here exists.

> Looking at the code for the first time, I can guess that RPC_TASK_HAS_TIMER
> was invented to indicate when it is safe not to wait for timer
> completition. This bit is cleared after ->tk_timeout_fn call in
> the timer's handler. After that rpc_run_timer will not touch *task.

The problem being it is possible in the now preemptible
context in which rpc_run_timer() executes for the call
of callback(task) to be preempted allowing another task
restart the timer (via a call to __rpc_add_timer()).
So the previous implied assumption in a nonpreemptive
rpc_run_timer() of the timer known to be inactive (as
that's how we arrived here) is no longer reliable.

The failure is for the timer to have been requeued during
the preemption but RPC_TASK_HAS_TIMER to afterwards
be cleared in rpc_run_timer().  This results in the call to
rpc_delete_timer() from rpc_release_task() never occurring
and the rpc_task struct then released for reuse with the
embedded timer struct still linked in the timer cascade list.
The next reuse of this rpc_task struct requeues the
embedded timer struct again in the cascade list causing
the corruption.

The fix removes the attempt to replicate timer queue
status from the RPC code by removing the usage of the
pre-calculated RPC_TASK_HAS_TIMER.  This information is
only for the benefit of rpc_delete_timer() which is
called in the case of rpc_task tear down.  So the test
is deferred until just before freeing the rpc_task for
reuse.

The potential race within rpc_release_task() of the timer
being requeued between the call to rpc_delete_timer()
and the call to rpc_free() is closed by setting
tk_timeout = 0 in the rpc_task before the rpc_delete_timer()
call.  It may be unnecessary but I couldn't conclude a
race did not exist so it errs on the side of caution.
Updated patch attached relative to 40-04.

This is a rather complex failure scenario which unwinds
over time and requires considerable instrumentation to
isolate.  I think the addition of configurable debug
state information to the timer structure along with
assertion checking in the primitives which manipulate
them would be quite useful for catching similar problems.

-john


-- 
john.cooper@timesys.com

[-- Attachment #2: RPC.patch2 --]
[-- Type: text/plain, Size: 1734 bytes --]

--- ./include/linux/sunrpc/sched.h.ORG	2005-05-24 10:29:24.000000000 -0400
+++ ./include/linux/sunrpc/sched.h	2005-05-24 10:47:56.000000000 -0400
@@ -142,7 +142,6 @@ typedef void			(*rpc_action)(struct rpc_
 #define RPC_TASK_RUNNING	0
 #define RPC_TASK_QUEUED		1
 #define RPC_TASK_WAKEUP		2
-#define RPC_TASK_HAS_TIMER	3
 
 #define RPC_IS_RUNNING(t)	(test_bit(RPC_TASK_RUNNING, &(t)->tk_runstate))
 #define rpc_set_running(t)	(set_bit(RPC_TASK_RUNNING, &(t)->tk_runstate))
=================================================================
--- ./net/sunrpc/sched.c.ORG	2005-05-24 10:29:52.000000000 -0400
+++ ./net/sunrpc/sched.c	2005-05-27 18:27:41.000000000 -0400
@@ -103,9 +103,6 @@ static void rpc_run_timer(struct rpc_tas
 		dprintk("RPC: %4d running timer\n", task->tk_pid);
 		callback(task);
 	}
-	smp_mb__before_clear_bit();
-	clear_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate);
-	smp_mb__after_clear_bit();
 }
 
 /*
@@ -124,7 +121,6 @@ __rpc_add_timer(struct rpc_task *task, r
 		task->tk_timeout_fn = timer;
 	else
 		task->tk_timeout_fn = __rpc_default_timer;
-	set_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate);
 	mod_timer(&task->tk_timer, jiffies + task->tk_timeout);
 }
 
@@ -135,7 +131,7 @@ __rpc_add_timer(struct rpc_task *task, r
 static inline void
 rpc_delete_timer(struct rpc_task *task)
 {
-	if (test_and_clear_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate)) {
+	if (timer_pending(&task->tk_timer)) {
 		del_singleshot_timer_sync(&task->tk_timer);
 		dprintk("RPC: %4d deleting timer\n", task->tk_pid);
 	}
@@ -849,6 +845,7 @@ void rpc_release_task(struct rpc_task *t
 	task->tk_active = 0;
 
 	/* Synchronously delete any running timer */
+	task->tk_timeout = 0;
 	rpc_delete_timer(task);
 
 	/* Release resources */

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RT and Cascade interrupts
  2005-05-27 23:37 ` john cooper
@ 2005-05-28  8:52   ` Oleg Nesterov
  2005-05-28 14:02     ` john cooper
  2005-05-28 17:48     ` john cooper
  0 siblings, 2 replies; 34+ messages in thread
From: Oleg Nesterov @ 2005-05-28  8:52 UTC (permalink / raw)
  To: john cooper; +Cc: linux-kernel, Ingo Molnar, Olaf Kirch

john cooper wrote:
>
> Oleg Nesterov wrote:
> > john cooper wrote:
> >
> >> rpc_delete_timer(struct rpc_task *task)
> >> {
> >>-	if (test_and_clear_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate)) {
> >>+	if (task->tk_timer.base) {
> >> 		del_singleshot_timer_sync(&task->tk_timer);
> >> 		dprintk("RPC: %4d deleting timer\n", task->tk_pid);
> >> 	}
> >
> >
> > I know nothing about rpc, but this looks completely wrong to me.
>
> > Next, timer_pending() == 0 does not mean that del_timer_sync() is
> > not needed, it may be running on the other CPU.
>
> Whether timer_pending() is SMP safe depends on the caller's
> context relative to the overall usage of the timer structure.
>
> For example we aren't holding base->lock in rpc_delete_timer()
> as would normally be expected.  The reason this is safe is
> the transition to "timer off base" and "timer.base <- NULL"
> follows this sequence in __run_timers().  So the worst we
> can do is to be racing with another cpu trying to expire this
> timer and we will delete an already inactive timer which is
> innocuous here.

CPU_0						CPU_1

rpc_release_task(task)
						__run_timers:
							timer->base = NULL;
	rpc_delete_timer()
		if (timer->base)
			// not taken
			del_timer_sync();

	mempool_free(task);
							timer->function(task):
								// task already freed/reused
								__rpc_default_timer(task);


This is totally pointless to do:

	if (timer_pending())
		del_singleshot_timer_sync();

If you need to ensure that timer's handler is not running on any
cpu then timer_pending() can't help. If you don't need this, you
should use plain del_timer().

> > Looking at the code for the first time, I can guess that RPC_TASK_HAS_TIMER
> > was invented to indicate when it is safe not to wait for timer
> > completition. This bit is cleared after ->tk_timeout_fn call in
> > the timer's handler. After that rpc_run_timer will not touch *task.
>
> The problem being it is possible in the now preemptible
> context in which rpc_run_timer() executes for the call
> of callback(task) to be preempted allowing another task
> restart the timer (via a call to __rpc_add_timer()).
> So the previous implied assumption in a nonpreemptive
> rpc_run_timer() of the timer known to be inactive (as
> that's how we arrived here) is no longer reliable.

I don't understand why this race is rt specific. I think
that PREEMPT_SOFTIRQS just enlarges the window. I believe
the right fix is just to call del_singleshot_timer_sync()
unconditionally and kill RPC_TASK_HAS_TIMER bit.

I have added Olaf Kirch to the CC: list.

> The fix removes the attempt to replicate timer queue
> status from the RPC code by removing the usage of the
> pre-calculated RPC_TASK_HAS_TIMER.

No, RPC_TASK_HAS_TIMER doesn't replicate timer queue status.
I believe, it was intended as optimization, to avoid costly
del_timer_sync() call.

Oleg.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RT and Cascade interrupts
  2005-05-28  8:52   ` Oleg Nesterov
@ 2005-05-28 14:02     ` john cooper
  2005-05-28 16:34       ` Oleg Nesterov
  2005-05-28 17:48     ` john cooper
  1 sibling, 1 reply; 34+ messages in thread
From: john cooper @ 2005-05-28 14:02 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: linux-kernel, Ingo Molnar, Olaf Kirch, john cooper

Oleg Nesterov wrote:

[race scenario deleted]

This was not the race scenario of concern.

> This is totally pointless to do:
> 
> 	if (timer_pending())
> 		del_singleshot_timer_sync();

However this is the action the RPC code was attempting
to take originally based upon stale/incorrect data.
Again the failure mode was of the RPC_TASK_HAS_TIMER
bit not being set when the embedded timer actually was
still present in the timer cascade structure (eg:
timer->base != 0 or timer_pending()).  This caused the
rpc_task to be recycled with the embedded timer struct
still linked in the timer cascade.

> If you need to ensure that timer's handler is not running on any
> cpu then timer_pending() can't help. If you don't need this, you
> should use plain del_timer().

That's not the goal of the timer_pending() usage here.
Rather we're at a point in rpc_release_task where we
want to tear down an rpc_task.  The call to timer_pending()
determines if the embedded timer is still linked in the
timer cascade structure.

The embedded timer may be running on another CPU.  We
may even check timer->base during this race when it is
non-NULL and wind up deleting a timer which the path
running on another CPU has already deleted.  That is
allowable.  We simply want to idle the timer.

> I don't understand why this race is rt specific. I think
> that PREEMPT_SOFTIRQS just enlarges the window.

That is certainly possible.  However this code has been
around for some time and this problem apparently hasn't
otherwise surfaced.  I suspect it depends on the
combination of timer cascade structures existing
per-CPU and the execution context of callback(task)
invoked in rpc_run_timer() now being preemptive (ie:
preemption is required on the same CPU for this
problem to surface).

> I believe
> the right fix is just to call del_singleshot_timer_sync()
> unconditionally and kill RPC_TASK_HAS_TIMER bit.

Ignoring RPC_TASK_HAS_TIMER and always calling
del_singleshot_timer_sync() will work but seems
excessive.

> I have added Olaf Kirch to the CC: list.

Welcome.

>>The fix removes the attempt to replicate timer queue
>>status from the RPC code by removing the usage of the
>>pre-calculated RPC_TASK_HAS_TIMER.
> 
> 
> No, RPC_TASK_HAS_TIMER doesn't replicate timer queue status.
> I believe, it was intended as optimization, to avoid costly
> del_timer_sync() call.

If true its chosen name seems a misnomer.  In any case
the embedded timer in the rpc_task structure must be
quiesced before the rpc_task is made available for
reuse.  The fix under discussion here does so.  Note the
goal was not to rearchitecture the RPC code but rather to
address this very specific bug which surfaced during
stress testing.

If anyone with more ownership of the RPC code would like
to comment, any insight would be most welcome.

-john


-- 
john.cooper@timesys.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RT and Cascade interrupts
  2005-05-28 14:02     ` john cooper
@ 2005-05-28 16:34       ` Oleg Nesterov
  0 siblings, 0 replies; 34+ messages in thread
From: Oleg Nesterov @ 2005-05-28 16:34 UTC (permalink / raw)
  To: john cooper; +Cc: linux-kernel, Ingo Molnar, Olaf Kirch, Trond Myklebust

john cooper wrote:
>
> Oleg Nesterov wrote:
>
> > If you need to ensure that timer's handler is not running on any
> > cpu then timer_pending() can't help. If you don't need this, you
> > should use plain del_timer().
>
> That's not the goal of the timer_pending() usage here.
> Rather we're at a point in rpc_release_task where we
> want to tear down an rpc_task.  The call to timer_pending()
> determines if the embedded timer is still linked in the
> timer cascade structure.

Yes, I see what you are trying to fix. However, your fix
opens even worse bug.

> If anyone with more ownership of the RPC code would like
> to comment, any insight would be most welcome.

Yes. Trond, could you please look at this thread:
http://marc.theaimsgroup.com/?t=111590936700001 and put an
end to our discussion?

Oleg.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RT and Cascade interrupts
  2005-05-28  8:52   ` Oleg Nesterov
  2005-05-28 14:02     ` john cooper
@ 2005-05-28 17:48     ` john cooper
  2005-05-28 20:35       ` Trond Myklebust
  2005-05-28 22:17       ` Trond Myklebust
  1 sibling, 2 replies; 34+ messages in thread
From: john cooper @ 2005-05-28 17:48 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-kernel, Ingo Molnar, Olaf Kirch, trond.myklebust, john cooper

Oleg Nesterov wrote:
> 
> CPU_0						CPU_1
> 
> rpc_release_task(task)
> 						__run_timers:
> 							timer->base = NULL;
> 	rpc_delete_timer()
> 		if (timer->base)
> 			// not taken
> 			del_timer_sync();
> 
> 	mempool_free(task);
> 							timer->function(task):
> 								// task already freed/reused
> 								__rpc_default_timer(task);

Ah ok, I was misreading the above.  Now I see your point.
The race if hit loses the synchronization provided by
del_timer_sync() in my case.  Thanks for being persistent.

The other possibility to fix the original problem within
the scope of the RPC code was to replace the bit state of
RPC_TASK_HAS_TIMER with a count so we don't obscure the
fact the timer was requeued during the preemption window.
This happens as rpc_run_timer() does an unconditional
clear_bit(RPC_TASK_HAS_TIMER,..) before returning.

Another would be to check timer->base/timer_pending()
in rpc_run_timer() and to omit doing a clear_bit()
if the timer is found to have been requeued.

The former has simpler locking requirements but requires
an rpc_task data structure modification.  The latter does
not and is a bit more intuitive but needs additional
locking.  Both appear to provide a solution though I
haven't yet attempted either.

Anyone with more ownership of this code care to comment
on a preferred fix?

-john


-- 
john.cooper@timesys.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RT and Cascade interrupts
  2005-05-28 17:48     ` john cooper
@ 2005-05-28 20:35       ` Trond Myklebust
  2005-05-29  3:12         ` john cooper
  2005-05-29 11:31         ` Oleg Nesterov
  2005-05-28 22:17       ` Trond Myklebust
  1 sibling, 2 replies; 34+ messages in thread
From: Trond Myklebust @ 2005-05-28 20:35 UTC (permalink / raw)
  To: john cooper; +Cc: Oleg Nesterov, linux-kernel, Ingo Molnar, Olaf Kirch

lau den 28.05.2005 Klokka 13:48 (-0400) skreiv john cooper:

> The other possibility to fix the original problem within
> the scope of the RPC code was to replace the bit state of
> RPC_TASK_HAS_TIMER with a count so we don't obscure the
> fact the timer was requeued during the preemption window.
> This happens as rpc_run_timer() does an unconditional
> clear_bit(RPC_TASK_HAS_TIMER,..) before returning.
> 
> Another would be to check timer->base/timer_pending()
> in rpc_run_timer() and to omit doing a clear_bit()
> if the timer is found to have been requeued.

Could you please explain why you think such a scenario is possible? The
timer functions themselves should never be causing a re-queue, and every
iteration through the loop in __rpc_execute() should cause any pending
timer to be killed, as should rpc_release_task().

That's why we can use del_singleshot_timer_sync() in the first place:
because there is no recursion, and no re-queueing that can cause races.
I don't see how either preemption or RT will change that (and if they
do, then _that_ is the real bug that needs fixing).

Cheers,
  Trond


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RT and Cascade interrupts
  2005-05-28 17:48     ` john cooper
  2005-05-28 20:35       ` Trond Myklebust
@ 2005-05-28 22:17       ` Trond Myklebust
  1 sibling, 0 replies; 34+ messages in thread
From: Trond Myklebust @ 2005-05-28 22:17 UTC (permalink / raw)
  To: john cooper; +Cc: Oleg Nesterov, linux-kernel, Ingo Molnar, Olaf Kirch

lau den 28.05.2005 Klokka 13:48 (-0400) skreiv john cooper:
> The other possibility to fix the original problem within
> the scope of the RPC code was to replace the bit state of
> RPC_TASK_HAS_TIMER with a count so we don't obscure the
> fact the timer was requeued during the preemption window.
> This happens as rpc_run_timer() does an unconditional
> clear_bit(RPC_TASK_HAS_TIMER,..) before returning.

Just to clarify a bit more here. RPC_TASK_HAS_TIMER is definitely _not_
redundantly replicating timer state, as I saw you asserting previously.

There are two distinct cases where timer->base may be NULL. In the first
case, the timer is inactive and was never queued so it is entirely
pointless to be calling del_timer() & friends. In the second case, the
timer was active, but has expired and was picked up by __run_timers().
del_singleshot_timer_sync() fails to distinguish these two cases, and
will therefore end up redundantly calling del_timer_sync() every time a
task loops through __rpc_execute() without setting a timer. Your patch
to remove RPC_TASK_HAS_TIMER will re-introduce that redundant behaviour
but will re-introduce precisely the race that del_timer_sync() is
designed to fix.

So, I repeat: I want to see a case description that clearly demonstrates
a race scenario in the existing code before I'm willing to consider any
patches at all.

Cheers,
  Trond


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RT and Cascade interrupts
  2005-05-28 20:35       ` Trond Myklebust
@ 2005-05-29  3:12         ` john cooper
  2005-05-29  7:40           ` Trond Myklebust
  2005-05-29 11:31         ` Oleg Nesterov
  1 sibling, 1 reply; 34+ messages in thread
From: john cooper @ 2005-05-29  3:12 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Oleg Nesterov, linux-kernel, Ingo Molnar, Olaf Kirch, john cooper

Trond Myklebust wrote:
> lau den 28.05.2005 Klokka 13:48 (-0400) skreiv john cooper:
> Could you please explain why you think such a scenario is possible? The
> timer functions themselves should never be causing a re-queue, and every
> iteration through the loop in __rpc_execute() should cause any pending
> timer to be killed, as should rpc_release_task().

I'm trying to pinpoint the cause of the RPC timer cascade
structure corruption.  Here is the data I've ascertained thus far:

1.  the failure only has been seen for the timer struct embedded
     in an rpc_task.  The callback function is always rpc_run_timer().

2.  the failure mode is an rpc_task's embedded timer struct being
     queued a second time in the timer cascade without having
     been dequeued from an existing cascade vector.  The failure
     is not immediate but causes a corrupt entry which is later
     encountered in timer cascade code.

3.  the problem occurs if rpc_run_timer() executes in preemptable
     context.  The problem was not encountered when preemption was
     explicitly disabled within rpc_run_timer().

4.  the problem appears related to RPC_TASK_HAS_TIMER not being set
     in rpc_release_task().  Specifically the problem arises when
     !RPC_TASK_HAS_TIMER but timer->base is non-zero.  Modifying
     rpc_delete_timer() to effect the del_singleshot_timer_sync()
     call when (RPC_TASK_HAS_TIMER || timer->base) prevents the
     failure.

5.  (a detail of #2 above) Instrumenting the number of logical
     add/mod/del operations on the timer struct the corruption
     occurs at a point when the tally of the operations:
     (mod == add == del + 1) ie: the timer is has been add/modified
     several times but one count greater than the number of delete
     operations (active in cascade).  The next operation on this
     timer struct is an add/mod operation with the state of the
     timer having been reinitialized in rpc_init_task():init_timer()
     with no intervening delete having been performed.

If it wasn't clear in my prior mail, please disregard the earlier
claim of RPC_TASK_HAS_TIMER replicating the state of timer->base.
 From Oleg's earlier mail I see that isn't the case as additional
RPC-specific state is attached to this flag.  The patch as well
should be disregarded.

> That's why we can use del_singleshot_timer_sync() in the first place:
> because there is no recursion, and no re-queueing that can cause races.
> I don't see how either preemption or RT will change that (and if they
> do, then _that_ is the real bug that needs fixing).

During the time I have been hunting this bug I've lost
count of the number of times I've alternatively suspected
either the kernel timer code or net/sunrpc/sched.c usage
of the same.  I still feel it is somehow related to the RPC
code but will need to refine the instrumentation to extract
further information from the failure scenario.

A few questions I'd like to pose:

1.  Can you correlate anything of rpc_run_timer() running in
     preemptive context which could explain the above behavior?

2.  Do you agree that (!RPC_TASK_HAS_TIMER && timer->base)
     is an inconsistent state at the time of
     rpc_release_task():rpc_delete_timer() ?


-john


-- 
john.cooper@timesys.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RT and Cascade interrupts
  2005-05-29  3:12         ` john cooper
@ 2005-05-29  7:40           ` Trond Myklebust
  2005-05-30 21:32             ` john cooper
  0 siblings, 1 reply; 34+ messages in thread
From: Trond Myklebust @ 2005-05-29  7:40 UTC (permalink / raw)
  To: john cooper; +Cc: Oleg Nesterov, linux-kernel, Ingo Molnar, Olaf Kirch

[-- Attachment #1: Type: text/plain, Size: 1465 bytes --]

lau den 28.05.2005 Klokka 23:12 (-0400) skreiv john cooper:

> 1.  Can you correlate anything of rpc_run_timer() running in
>      preemptive context which could explain the above behavior?
>
> 2.  Do you agree that (!RPC_TASK_HAS_TIMER && timer->base)
>      is an inconsistent state at the time of
>      rpc_release_task():rpc_delete_timer() ?

I have yet to see an example of that happening.

The current code should be working according to the following rules:

        timers are only set if the rpc_task is being put on a wait queue
        and tk_timeout is non-zero.

        RPC_TASK_HAS_TIMER is always set immediately before the timer is
        activated.

        It is cleared either by the handler _after_ it has run (and
        after timer->base has been cleared by __rpc_run_timer()), or by
        __rpc_execute()/rpc_release_task() immediately prior to the call
        to del_singleshot_timer_sync().

        task->tk_callback should never be putting a task to sleep, so
        there is no timer being set after the call to rpc_delete_timer()
        and prior to the call to task->tk_action.
        
The above rules should suffice to guarantee strict ordering w.r.t.
starting a timer, clearing the same timer, and restarting a new timer,
so the code should not need to be reentrant. I've appended a patch that
should check for strict compliance of the above rules. Could you try it
out and see if it triggers any Oopses?

Cheers,
  Trond


[-- Attachment #2: linux-2.6.12-test_timer.dif --]
[-- Type: text/plain, Size: 1557 bytes --]

 sched.c |    8 ++++++++
 1 files changed, 8 insertions(+)

Index: linux-2.6.12-rc4/net/sunrpc/sched.c
===================================================================
--- linux-2.6.12-rc4.orig/net/sunrpc/sched.c
+++ linux-2.6.12-rc4/net/sunrpc/sched.c
@@ -135,6 +135,8 @@ __rpc_add_timer(struct rpc_task *task, r
 static void
 rpc_delete_timer(struct rpc_task *task)
 {
+	BUG_ON(test_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate) == 0 &&
+			timer_pending(&task->tk_timer));
 	if (RPC_IS_QUEUED(task))
 		return;
 	if (test_and_clear_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate)) {
@@ -337,6 +339,8 @@ static void __rpc_sleep_on(struct rpc_wa
 void rpc_sleep_on(struct rpc_wait_queue *q, struct rpc_task *task,
 				rpc_action action, rpc_action timer)
 {
+	BUG_ON(test_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate) != 0 ||
+			timer_pending(&task->tk_timer));
 	/*
 	 * Protect the queue operations.
 	 */
@@ -594,6 +598,8 @@ static int __rpc_execute(struct rpc_task
 			unlock_kernel();
 		}
 
+		BUG_ON(test_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate) != 0 ||
+			timer_pending(&task->tk_timer));
 		/*
 		 * Perform the next FSM step.
 		 * tk_action may be NULL when the task has been killed
@@ -925,6 +931,8 @@ fail:
 
 void rpc_run_child(struct rpc_task *task, struct rpc_task *child, rpc_action func)
 {
+	BUG_ON(test_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate) != 0 ||
+			timer_pending(&task->tk_timer));
 	spin_lock_bh(&childq.lock);
 	/* N.B. Is it possible for the child to have already finished? */
 	__rpc_sleep_on(&childq, task, func, NULL);

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RT and Cascade interrupts
  2005-05-28 20:35       ` Trond Myklebust
  2005-05-29  3:12         ` john cooper
@ 2005-05-29 11:31         ` Oleg Nesterov
  2005-05-29 13:58           ` Trond Myklebust
  1 sibling, 1 reply; 34+ messages in thread
From: Oleg Nesterov @ 2005-05-29 11:31 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: john cooper, linux-kernel, Ingo Molnar, Olaf Kirch

__rpc_execute() calls ->tk_exit and goes to 'restarted' label.
What if ->tk_exit calls rpc_sleep_on() ? If it is possible, we
have a race.

CPU_0 (main loop in rpc_execute restarted)		CPU_1

rpc_delete_timer:
	if (RPC_IS_QUEUED())	// YES
		return;
							rpc_run_timer:
								rpc_wake_up_task:
									clear_bit(RPC_TASK_QUEUED)
								preemption, or long interrupt

if (!RPC_IS_QUEUED())		// YES
	task->tk_action()
		__rpc_add_timer:
			set_bit(RPC_TASK_HAS_TIMER)
			mod_timer();
								clear_bit(RPC_TASK_HAS_TIMER)

Now we have pending timer without RPC_TASK_HAS_TIMER bit set.

Is this patch makes any sense?

Oleg.

--- 2.6.12-rc5/net/sunrpc/sched.c~	Sun May 29 15:09:25 2005
+++ 2.6.12-rc5/net/sunrpc/sched.c	Sun May 29 15:11:24 2005
@@ -135,8 +135,6 @@ __rpc_add_timer(struct rpc_task *task, r
 static void
 rpc_delete_timer(struct rpc_task *task)
 {
-	if (RPC_IS_QUEUED(task))
-		return;
 	if (test_and_clear_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate)) {
 		del_singleshot_timer_sync(&task->tk_timer);
 		dprintk("RPC: %4d deleting timer\n", task->tk_pid);
@@ -566,7 +564,6 @@ static int __rpc_execute(struct rpc_task
 
 	BUG_ON(RPC_IS_QUEUED(task));
 
- restarted:
 	while (1) {
 		/*
 		 * Garbage collection of pending timers...
@@ -607,6 +604,7 @@ static int __rpc_execute(struct rpc_task
 			unlock_kernel();
 		}
 
+ restarted:
 		/*
 		 * Lockless check for whether task is sleeping or not.
 		 */

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RT and Cascade interrupts
  2005-05-29 11:31         ` Oleg Nesterov
@ 2005-05-29 13:58           ` Trond Myklebust
  2005-05-30 14:50             ` Ingo Molnar
  0 siblings, 1 reply; 34+ messages in thread
From: Trond Myklebust @ 2005-05-29 13:58 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: john cooper, linux-kernel, Ingo Molnar, Olaf Kirch

su den 29.05.2005 Klokka 15:31 (+0400) skreiv Oleg Nesterov:
> __rpc_execute() calls ->tk_exit and goes to 'restarted' label.
> What if ->tk_exit calls rpc_sleep_on() ? If it is possible, we
> have a race.
> 
> CPU_0 (main loop in rpc_execute restarted)		CPU_1
> 
> rpc_delete_timer:
> 	if (RPC_IS_QUEUED())	// YES
> 		return;
> 							rpc_run_timer:
> 								rpc_wake_up_task:
> 									clear_bit(RPC_TASK_QUEUED)
> 								preemption, or long interrupt
> 
> if (!RPC_IS_QUEUED())		// YES
> 	task->tk_action()
> 		__rpc_add_timer:
> 			set_bit(RPC_TASK_HAS_TIMER)
> 			mod_timer();
> 								clear_bit(RPC_TASK_HAS_TIMER)
> 
> Now we have pending timer without RPC_TASK_HAS_TIMER bit set.

It is possible, but it should be _extremely_ rare. The only cases where
tk_exit() sets a timer are the cases where we call rpc_delay(). In the
existing cases in the kernel, that would be setting timers of 5 seconds
(for the case of JUKEBOX errors) or longer.

These tx_exit()+restart events are very rare in themselves (JUKEBOX
errors only occur if the server needs to do something with a long
latency - such as retrieving a file from an HSM unit), but for that to
coincide with a preemption or interrupt of > 5 seconds inside
__rpc_execute... I'd hate to see those RPC performance figures.

> Is this patch makes any sense?

Yes. I agree the scenario is theoretically possible (so I can queue that
patch up for you). I am not convinced it is a plausible explanation for
what John claims to be seeing, though.

Cheers,
  Trond


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RT and Cascade interrupts
  2005-05-29 13:58           ` Trond Myklebust
@ 2005-05-30 14:50             ` Ingo Molnar
  0 siblings, 0 replies; 34+ messages in thread
From: Ingo Molnar @ 2005-05-30 14:50 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Oleg Nesterov, john cooper, linux-kernel, Olaf Kirch


* Trond Myklebust <trond.myklebust@fys.uio.no> wrote:

> > Is this patch makes any sense?
> 
> Yes. I agree the scenario is theoretically possible (so I can queue 
> that patch up for you). I am not convinced it is a plausible 
> explanation for what John claims to be seeing, though.

i've added this patch (and your debug asserts, except for the 
rpc_delete_timer() one) to the -RT tree and i've removed the earlier 
hack - perhaps John can re-run the test and see whether it still occurs 
under -rc5-RT-V0.7.47-13 or later?

most races are much more likely to occur under PREEMPT_RT than under 
other preemption models, but maybe there's something else going on as 
well.

	Ingo

--- linux/net/sunrpc/sched.c.orig
+++ linux/net/sunrpc/sched.c
@@ -135,8 +135,6 @@ __rpc_add_timer(struct rpc_task *task, r
 static void
 rpc_delete_timer(struct rpc_task *task)
 {
-	if (RPC_IS_QUEUED(task))
-		return;
 	if (test_and_clear_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate)) {
 		del_singleshot_timer_sync(&task->tk_timer);
 		dprintk("RPC: %4d deleting timer\n", task->tk_pid);
@@ -337,6 +335,8 @@ static void __rpc_sleep_on(struct rpc_wa
 void rpc_sleep_on(struct rpc_wait_queue *q, struct rpc_task *task,
 				rpc_action action, rpc_action timer)
 {
+	BUG_ON(test_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate) != 0 ||
+			timer_pending(&task->tk_timer));
 	/*
 	 * Protect the queue operations.
 	 */
@@ -566,7 +566,6 @@ static int __rpc_execute(struct rpc_task
 
 	BUG_ON(RPC_IS_QUEUED(task));
 
- restarted:
 	while (1) {
 		/*
 		 * Garbage collection of pending timers...
@@ -594,6 +593,8 @@ static int __rpc_execute(struct rpc_task
 			unlock_kernel();
 		}
 
+		BUG_ON(test_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate) != 0 ||
+			timer_pending(&task->tk_timer));
 		/*
 		 * Perform the next FSM step.
 		 * tk_action may be NULL when the task has been killed
@@ -607,6 +608,7 @@ static int __rpc_execute(struct rpc_task
 			unlock_kernel();
 		}
 
+ restarted:
 		/*
 		 * Lockless check for whether task is sleeping or not.
 		 */
@@ -925,6 +927,8 @@ fail:
 
 void rpc_run_child(struct rpc_task *task, struct rpc_task *child, rpc_action func)
 {
+	BUG_ON(test_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate) != 0 ||
+			timer_pending(&task->tk_timer));
 	spin_lock_bh(&childq.lock);
 	/* N.B. Is it possible for the child to have already finished? */
 	__rpc_sleep_on(&childq, task, func, NULL);

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RT and Cascade interrupts
  2005-05-29  7:40           ` Trond Myklebust
@ 2005-05-30 21:32             ` john cooper
  2005-05-31 23:09               ` john cooper
  2005-06-01 14:22               ` Oleg Nesterov
  0 siblings, 2 replies; 34+ messages in thread
From: john cooper @ 2005-05-30 21:32 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Oleg Nesterov, linux-kernel, Ingo Molnar, Olaf Kirch, john cooper

Trond Myklebust wrote:

> I've appended a patch that
> should check for strict compliance of the above rules. Could you try it
> out and see if it triggers any Oopses?

Yes, the assert in rpc_delete_timer() occurs just before
the cascade list corruption.  This is consistent with
what I have seen.  ie: the timer in a released rpc_task
is still active.

BTW, the patch from Oleg is relative to 2.6.12 and didn't
look to apply to the 2.6.11-derived base with which I'm
working (the RPC_IS_QUEUED() test at the head of rpc_delete_timer()
does not exist).  In any case the relocation of restarted: in
__rpc_execute() did not influence the failure.  I'd like to
move to a 2.6.12-based -RT patch however I'm dealing with
"code in the pipe" and unfortunately don't have that option.

Sorry I'm just responding new.  We're in the middle of a
long holiday weekend.  I will have more time come tomorrow
to analyze this further.

-john


> ------------------------------------------------------------------------
> 
>  sched.c |    8 ++++++++
>  1 files changed, 8 insertions(+)
> 
> Index: linux-2.6.12-rc4/net/sunrpc/sched.c
> ===================================================================
> --- linux-2.6.12-rc4.orig/net/sunrpc/sched.c
> +++ linux-2.6.12-rc4/net/sunrpc/sched.c
> @@ -135,6 +135,8 @@ __rpc_add_timer(struct rpc_task *task, r
>  static void
>  rpc_delete_timer(struct rpc_task *task)
>  {
> +	BUG_ON(test_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate) == 0 &&
> +			timer_pending(&task->tk_timer));
>  	if (RPC_IS_QUEUED(task))
>  		return;
>  	if (test_and_clear_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate)) {
> @@ -337,6 +339,8 @@ static void __rpc_sleep_on(struct rpc_wa
>  void rpc_sleep_on(struct rpc_wait_queue *q, struct rpc_task *task,
>  				rpc_action action, rpc_action timer)
>  {
> +	BUG_ON(test_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate) != 0 ||
> +			timer_pending(&task->tk_timer));
>  	/*
>  	 * Protect the queue operations.
>  	 */
> @@ -594,6 +598,8 @@ static int __rpc_execute(struct rpc_task
>  			unlock_kernel();
>  		}
>  
> +		BUG_ON(test_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate) != 0 ||
> +			timer_pending(&task->tk_timer));
>  		/*
>  		 * Perform the next FSM step.
>  		 * tk_action may be NULL when the task has been killed
> @@ -925,6 +931,8 @@ fail:
>  
>  void rpc_run_child(struct rpc_task *task, struct rpc_task *child, rpc_action func)
>  {
> +	BUG_ON(test_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate) != 0 ||
> +			timer_pending(&task->tk_timer));
>  	spin_lock_bh(&childq.lock);
>  	/* N.B. Is it possible for the child to have already finished? */
>  	__rpc_sleep_on(&childq, task, func, NULL);


-- 
john.cooper@timesys.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RT and Cascade interrupts
  2005-05-30 21:32             ` john cooper
@ 2005-05-31 23:09               ` john cooper
  2005-06-01 14:22               ` Oleg Nesterov
  1 sibling, 0 replies; 34+ messages in thread
From: john cooper @ 2005-05-31 23:09 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: john cooper, Oleg Nesterov, linux-kernel, Ingo Molnar, Olaf Kirch

john cooper wrote:
> Trond Myklebust wrote:
> 
>> I've appended a patch that
>> should check for strict compliance of the above rules. Could you try it
>> out and see if it triggers any Oopses?
> 
> 
> Yes, the assert in rpc_delete_timer() occurs just before
> the cascade list corruption.  This is consistent with
> what I have seen.  ie: the timer in a released rpc_task
> is still active.

I've captured more data in the instrumentation and found
the rpc_task's timer is being requeued by an application
task which is preempting ksoftirqd when it wakes up in
xprt_transmit().  This is what I had originally suspected
but likely didn't communicate it effectively.

The scenario unfolds as:

                                          [high priority app task]
                                              :
                                          call_transmit()
                                             xprt_transmit()
                                             /* blocks in xprt_transmit() */
ksoftirqd()
     __run_timers()
         list_del("rpc_task_X.timer") /* logically off cascade */
         rpc_run_timer(data)
             task->tk_timeout_fn(task)

             /* ksoftirqd preempted */

                                              :
                                              ---------------------------------------------------------
                                              /* Don't race with disconnect */
                                              if (!xprt_connected(xprt))
                                                  task->tk_status = -ENOTCONN;
                                              else if (!req->rq_received)
                                                  rpc_sleep_on(&xprt->pending, task, NULL, xprt_timer);
                                              ---------------------------------------------------------
                                                      __rpc_sleep_on()
                                                          __mod_timer("rpc_task_X.timer")  /* requeued in cascade */
                                              /* blocks */

         /* rpc_run_timer resumes from preempt */
         clear_bit(RPC_TASK_HAS_TIMER, "rpc_task_X.tk_runstate");

         /* rpc_task_X.timer is now enqueued in cascade without
            RPC_TASK_HAS_TIMER set and will not be dequeued
            in rpc_release_task()/rpc_delete_timer() */


The usage of "rpc_task_X.timer" indicates the same KVA
observed for the timer struct at the associated points
in the instrumented code.

The above was gathered by logging usage of the
kernel/timer.c primitives.  Thus I don't have more
detailed state of the rpc_task in RPC context.
However I did verify which of the three calls to
rpc_sleep_on() in xprt_transmit() was being invoked
(as above).

So the root cause appears to be the rpc_task's timer
being requeued in xprt_transmit() when rpc_run_timer
is preempted.  From looking at the code I'm unsure
if modifying xprt_transmit()/out_receive is appropriate
to synchronize with rpc_release_task().  It seems
allowing rpc_sleep_on() to occur is more natural and
for rpc_release_task() to detect the pending timer and
remove it before proceeding.  I'm still in the process
of trying to digest the logic here but I thought there
was enough information here to be of use.  Suggestions,
warnings welcome.

-john


-- 
john.cooper@timesys.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RT and Cascade interrupts
  2005-05-30 21:32             ` john cooper
  2005-05-31 23:09               ` john cooper
@ 2005-06-01 14:22               ` Oleg Nesterov
  2005-06-01 18:05                 ` john cooper
  1 sibling, 1 reply; 34+ messages in thread
From: Oleg Nesterov @ 2005-06-01 14:22 UTC (permalink / raw)
  To: john cooper; +Cc: Trond Myklebust, linux-kernel, Ingo Molnar, Olaf Kirch

del_timer_sync() is known to be racy. Now I think it has
(theoretically) problems with singleshot timers too.

Suppose whe have rpc timer which is running on CPU_0, it is
preempted *after* it has cleared RPC_TASK_HAS_TIMER bit.

Then __rpc_execute's main loop on CPU_1 calls rpc_delete_timer(),
it returns without doing del_timer_sync().

Now it calls ->tk_action()->rpc_sleep_on()->__rpc_add_timer(),
timer pending on CPU_1.

preemption comes, reschedule at another (not CPU_1) processor,
(this step is not strictly necessary).

Next loop iteration, __rpc_execute calls rpc_delete_timer() again,
now it calls del_timer_sync().

del_timer_sync:

	// __run_timers starts on CPU_1, picks rpc timer, sets
	// timer->base = 0

	ret += del_timer(timer); // return 0, timer is not pending.

	for_each_online_cpu() {
		if (timer running on that cpu) {
			// finds the timer still running on CPU_0,
			// waits for __run_timers on CPU_0 change
			//	->running_timer,
			// the timer on CPU_0 completes.
			break;
		}
	}

	if (timer_pending())	// NO
		goto del_again;

	// The timer still running on CPU_1
	return;

This all is very unlikely of course, but it would be nice to verify
that kernel/timer.c is not the source of the problem.

John, if it is easy to reproduce the problem, could you please retest
with this patch?

Oleg.

--- 2.6.12-rc5/net/sunrpc/sched.c~	Wed Jun  1 17:49:57 2005
+++ 2.6.12-rc5/net/sunrpc/sched.c	Wed Jun  1 18:00:31 2005
@@ -137,8 +137,12 @@ rpc_delete_timer(struct rpc_task *task)
 {
 	if (RPC_IS_QUEUED(task))
 		return;
-	if (test_and_clear_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate)) {
-		del_singleshot_timer_sync(&task->tk_timer);
+	if (test_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate)) {
+		if (del_singleshot_timer_sync(&task->tk_timer)) {
+			BUG_ON(!test_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate));
+			clear_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate);
+		} else
+			BUG_ON(test_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate));
 		dprintk("RPC: %4d deleting timer\n", task->tk_pid);
 	}
 }

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RT and Cascade interrupts
  2005-06-01 14:22               ` Oleg Nesterov
@ 2005-06-01 18:05                 ` john cooper
  2005-06-01 18:31                   ` Trond Myklebust
  0 siblings, 1 reply; 34+ messages in thread
From: john cooper @ 2005-06-01 18:05 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Trond Myklebust, linux-kernel, Ingo Molnar, Olaf Kirch, john cooper

Oleg Nesterov wrote:
> This all is very unlikely of course, but it would be nice to verify
> that kernel/timer.c is not the source of the problem.

The problem I am seeing is on a single CPU system.

> John, if it is easy to reproduce the problem, could you please retest
> with this patch?

I've done so and the second assert is generated
when running the test.  So here we have a case
of RPC_TASK_HAS_TIMER set but the associated
timer->base == NULL.  It would seem this could
easily be the scenario of executing in:

__run_timers()
     timer->base = NULL;
         rpc_run_timer()
             task->tk_timeout_fn(task)
             /* ksoftirqd preempted */
                                              :
                                          /* RPC client */
                                          rpc_execute()
                                             rpc_delete_timer()
                                                 del_timer() returns 0
                                                 BUG_ON(test_bit(RPC_TASK_HAS_TIMER,
                                                   &task->tk_runstate));
                                              :
          /* rpc_run_timer() resumes */
          clear_bit(RPC_TASK_HAS_TIMER,
            &task->tk_runstate);

I don't see how this would imply a kernel/timer.c
problem.  It also appears this wouldn't be causing
the timer cascade corruption I've seen as the
end result is deleting an already dequeued timer
which is safe here.

BTW, if preemption is explicitly disabled in rpc_run_timer()
the BUG_ON assert is not generated nor (as reported
earlier) does the cascade corruption occur.  I'm still
investigating.

-john



> --- 2.6.12-rc5/net/sunrpc/sched.c~	Wed Jun  1 17:49:57 2005
> +++ 2.6.12-rc5/net/sunrpc/sched.c	Wed Jun  1 18:00:31 2005
> @@ -137,8 +137,12 @@ rpc_delete_timer(struct rpc_task *task)
>  {
>  	if (RPC_IS_QUEUED(task))
>  		return;
> -	if (test_and_clear_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate)) {
> -		del_singleshot_timer_sync(&task->tk_timer);
> +	if (test_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate)) {
> +		if (del_singleshot_timer_sync(&task->tk_timer)) {
> +			BUG_ON(!test_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate));
> +			clear_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate);
> +		} else
> +			BUG_ON(test_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate));
>  		dprintk("RPC: %4d deleting timer\n", task->tk_pid);
>  	}
>  }
> 


-- 
john.cooper@timesys.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RT and Cascade interrupts
  2005-06-01 18:05                 ` john cooper
@ 2005-06-01 18:31                   ` Trond Myklebust
  2005-06-01 19:20                     ` john cooper
  0 siblings, 1 reply; 34+ messages in thread
From: Trond Myklebust @ 2005-06-01 18:31 UTC (permalink / raw)
  To: john cooper; +Cc: Oleg Nesterov, linux-kernel, Ingo Molnar, Olaf Kirch

on den 01.06.2005 Klokka 14:05 (-0400) skreiv john cooper:
> I've done so and the second assert is generated
> when running the test.  So here we have a case
> of RPC_TASK_HAS_TIMER set but the associated
> timer->base == NULL.  It would seem this could
> easily be the scenario of executing in:
> 
> __run_timers()
>      timer->base = NULL;
>          rpc_run_timer()
>              task->tk_timeout_fn(task)
>              /* ksoftirqd preempted */
>                                               :
>                                           /* RPC client */
>                                           rpc_execute()
>                                              rpc_delete_timer()
>                                                  del_timer() returns 0
>                                                  BUG_ON(test_bit(RPC_TASK_HAS_TIMER,
>                                                    &task->tk_runstate));
>                                               :
>           /* rpc_run_timer() resumes */
>           clear_bit(RPC_TASK_HAS_TIMER,
>             &task->tk_runstate);
> 
> I don't see how this would imply a kernel/timer.c
> problem.  It also appears this wouldn't be causing
> the timer cascade corruption I've seen as the
> end result is deleting an already dequeued timer
> which is safe here.

If timer->base==NULL, then del_timer() returns 0 (as you correctly state
above). What you appear to be missing is that this will trigger a call
to del_timer_sync() inside del_singleshot_timer_sync().

In other words, your scenario above should, if the timer code is working
according to spec instead lead to

__run_timers()
     timer->base = NULL;
         rpc_run_timer()
             task->tk_timeout_fn(task)
             /* ksoftirqd preempted */
                                              :
                                       /* RPC client */
                                       rpc_execute()
                                          rpc_delete_timer()
					     del_singleshot_timer_sync()
                                                 del_timer() returns 0
						 del_timer_sync()
						/* wait until timer has
						   completed */
          /*rpc_run_timer() resumes */
          clear_bit(RPC_TASK_HAS_TIMER,
            &task->tk_runstate);
         ....
     set_running_timer(some_other_value);

				    BUG_ON(test_bit(RPC_TASK_HAS_TIMER,
                                                   &task->tk_runstate));
                                              :

Cheers,
  Trond


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RT and Cascade interrupts
  2005-06-01 18:31                   ` Trond Myklebust
@ 2005-06-01 19:20                     ` john cooper
  2005-06-01 19:46                       ` Trond Myklebust
  2005-06-01 20:21                       ` Trond Myklebust
  0 siblings, 2 replies; 34+ messages in thread
From: john cooper @ 2005-06-01 19:20 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Oleg Nesterov, linux-kernel, Ingo Molnar, Olaf Kirch, john cooper

Trond Myklebust wrote:
> on den 01.06.2005 Klokka 14:05 (-0400) skreiv john cooper:
>>I don't see how this would imply a kernel/timer.c
>>problem.  It also appears this wouldn't be causing
>>the timer cascade corruption I've seen as the
>>end result is deleting an already dequeued timer
>>which is safe here.
> 
> 
> If timer->base==NULL, then del_timer() returns 0 (as you correctly state
> above). What you appear to be missing is that this will trigger a call
> to del_timer_sync() inside del_singleshot_timer_sync().

You might have missed in my earlier mail as
this is a not an MP kernel ie: !CONFIG_SMP
The synchronous timer delete primitives don't
exist in this configuration:

include/linux/timer.h:

#ifdef CONFIG_SMP
   extern int del_timer_sync(struct timer_list *timer);
   extern int del_singleshot_timer_sync(struct timer_list *timer);
#else
# define del_timer_sync(t) del_timer(t)
# define del_singleshot_timer_sync(t) del_timer(t)
#endif

BTW, I don't know if you happened on the mail I sent
yesterday.  It details rpc_run_timer() waking up an
application task blocked in call_transmit().  The
app task preempts ksoftirqd and eventually does a
__rpc_sleep_on()/__mod_timer() which requeues the
timer in the cascade.  When ksoftirqd/rpc_run_timer()
resumes RPC_TASK_HAS_TIMER is unconditionally cleared
however the timer is queued in the cascade.  This
appears to be at least one cause of the timer cascade
corruption I've seen.

-john


-- 
john.cooper@timesys.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RT and Cascade interrupts
  2005-06-01 19:20                     ` john cooper
@ 2005-06-01 19:46                       ` Trond Myklebust
  2005-06-01 20:21                       ` Trond Myklebust
  1 sibling, 0 replies; 34+ messages in thread
From: Trond Myklebust @ 2005-06-01 19:46 UTC (permalink / raw)
  To: john cooper; +Cc: Oleg Nesterov, linux-kernel, Ingo Molnar, Olaf Kirch

on den 01.06.2005 Klokka 15:20 (-0400) skreiv john cooper:

> You might have missed in my earlier mail as
> this is a not an MP kernel ie: !CONFIG_SMP
> The synchronous timer delete primitives don't
> exist in this configuration:
> 
> include/linux/timer.h:
> 
> #ifdef CONFIG_SMP
>    extern int del_timer_sync(struct timer_list *timer);
>    extern int del_singleshot_timer_sync(struct timer_list *timer);
> #else
> # define del_timer_sync(t) del_timer(t)
> # define del_singleshot_timer_sync(t) del_timer(t)
> #endif


For the RT patched stuff that should read

#if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT_SOFTIRQS)
  extern int del_timer_sync(struct timer_list *timer);
  extern int del_singleshot_timer_sync(struct timer_list *timer);
#else
# define del_timer_sync(t) del_timer(t)
# define del_singleshot_timer_sync(t) del_timer(t)
#endif


> BTW, I don't know if you happened on the mail I sent
> yesterday.  It details rpc_run_timer() waking up an
> application task blocked in call_transmit().  The
> app task preempts ksoftirqd and eventually does a
> __rpc_sleep_on()/__mod_timer() which requeues the
> timer in the cascade.  When ksoftirqd/rpc_run_timer()
> resumes RPC_TASK_HAS_TIMER is unconditionally cleared
> however the timer is queued in the cascade.  This
> appears to be at least one cause of the timer cascade
> corruption I've seen.

I saw it. Once again, I don't see how that can happen. __rpc_execute()
should be calling rpc_delete_timer() before it calls task->tk_action.

There should be no instances of RPC entering call_transmit() or any
other tk_action callback with a pending timer.

Cheers,
  Trond


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RT and Cascade interrupts
  2005-06-01 19:20                     ` john cooper
  2005-06-01 19:46                       ` Trond Myklebust
@ 2005-06-01 20:21                       ` Trond Myklebust
  2005-06-01 20:59                         ` john cooper
  1 sibling, 1 reply; 34+ messages in thread
From: Trond Myklebust @ 2005-06-01 20:21 UTC (permalink / raw)
  To: john cooper; +Cc: Oleg Nesterov, linux-kernel, Ingo Molnar, Olaf Kirch

on den 01.06.2005 Klokka 15:20 (-0400) skreiv john cooper:
> You might have missed in my earlier mail as
> this is a not an MP kernel ie: !CONFIG_SMP
> The synchronous timer delete primitives don't
> exist in this configuration:

This probably explains your trouble. It makes no sense to allow
__run_timer to be preemptible without having the synchronous timer
delete primitives. Synchronisation is impossible without them.

Which version of the RT patches are you using? The one I'm looking at
(2.6.12-rc5-rt-V0.7.47-15) certainly defines both del_timer_sync() and
del_singleshot_timer_sync() to be the same as the SMP versions if you
are running an RT kernel with preemptible softirqs.

Cheers,
  Trond


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RT and Cascade interrupts
  2005-06-01 20:21                       ` Trond Myklebust
@ 2005-06-01 20:59                         ` john cooper
  2005-06-01 22:51                           ` Trond Myklebust
  0 siblings, 1 reply; 34+ messages in thread
From: john cooper @ 2005-06-01 20:59 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Oleg Nesterov, linux-kernel, Ingo Molnar, Olaf Kirch, john cooper

Trond Myklebust wrote:
> on den 01.06.2005 Klokka 15:20 (-0400) skreiv john cooper:
> 
>>You might have missed in my earlier mail as
>>this is a not an MP kernel ie: !CONFIG_SMP
>>The synchronous timer delete primitives don't
>>exist in this configuration:
> 
> 
> This probably explains your trouble. It makes no sense to allow
> __run_timer to be preemptible without having the synchronous timer
> delete primitives. Synchronisation is impossible without them.

The addition of CONFIG_PREEMPT_SOFTIRQS in this context
came into place in more recent RT patch versions than
with what I'm dealing.  I've just pulled it in but this
doesn't appear to alter the nature of the failure.
I'm still catching an inconsistency at the very head of
rpc_delete_timer() in the case of:

     BUG_ON(!test_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate) &&
         timer_pending(&task->tk_timer));

> Which version of the RT patches are you using? The one I'm looking at
> (2.6.12-rc5-rt-V0.7.47-15) certainly defines both del_timer_sync() and
> del_singleshot_timer_sync() to be the same as the SMP versions if you
> are running an RT kernel with preemptible softirqs.

Yes later versions of the patch do.  The version at hand
40-04 is based on 2.6.11.  We intend to sync-up with a
more recent version of the RT patch pending resolution
of this issue.

I have two potential work-arounds I'm trying to validate.
Though I have a bit more tree-shaking to do before I've
completed this exercise.

-john


-- 
john.cooper@timesys.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RT and Cascade interrupts
  2005-06-01 20:59                         ` john cooper
@ 2005-06-01 22:51                           ` Trond Myklebust
  2005-06-01 23:09                             ` Trond Myklebust
  2005-06-02  3:31                             ` john cooper
  0 siblings, 2 replies; 34+ messages in thread
From: Trond Myklebust @ 2005-06-01 22:51 UTC (permalink / raw)
  To: john cooper; +Cc: Oleg Nesterov, linux-kernel, Ingo Molnar, Olaf Kirch

[-- Attachment #1: Type: text/plain, Size: 701 bytes --]

on den 01.06.2005 Klokka 16:59 (-0400) skreiv john cooper:
> Yes later versions of the patch do.  The version at hand
> 40-04 is based on 2.6.11.  We intend to sync-up with a
> more recent version of the RT patch pending resolution
> of this issue.

Well it is pointless to concentrate on an obviously buggy patch. Could
you please sync up to rc5-rt-V0.7.47-15 at least: that looks like it
might be working (or at least be close to working).

Could you then apply the following debugging patch? It should warn you
in case something happens to corrupt base->running_timer (something
which would screw up del_timer_sync()). I'm not sure that can happen,
but it might be worth checking.

Cheers,
  Trond

[-- Attachment #2: linux-2.6.12-rt-warn.dif --]
[-- Type: text/plain, Size: 1617 bytes --]

 timer.c |   14 +++++++++++++-
 1 files changed, 13 insertions(+), 1 deletion(-)

Index: linux-2.6.12-rc5-rt-V0.7.47-15/kernel/timer.c
===================================================================
--- linux-2.6.12-rc5-rt-V0.7.47-15.orig/kernel/timer.c
+++ linux-2.6.12-rc5-rt-V0.7.47-15/kernel/timer.c
@@ -436,7 +436,7 @@ static int cascade(tvec_base_t *base, tv
 
 static inline void __run_timers(tvec_base_t *base)
 {
-	struct timer_list *timer;
+	struct timer_list *timer = NULL;
 
 	spin_lock_irq(&base->lock);
 	while (time_after_eq(jiffies, base->timer_jiffies)) {
@@ -444,6 +444,10 @@ static inline void __run_timers(tvec_bas
 		struct list_head *head = &work_list;
  		int index = base->timer_jiffies & TVR_MASK;
 
+		if (unlikely(base->running_timer != timer)) {
+			printk("BUG in line %d: __run_timers() is non-reentrant!\n", __LINE__);
+			return;
+		}
 		if (softirq_need_resched()) {
 			/* running_timer might be stale: */
 			set_running_timer(base, NULL);
@@ -459,6 +463,10 @@ static inline void __run_timers(tvec_bas
 			 * valid. Any new jiffy will be taken care of in
 			 * subsequent loops:
 			 */
+			if (unlikely(base->running_timer != timer)) {
+				printk("BUG in line %d: __run_timers() is non-reentrant!\n", __LINE__);
+				return;
+			}
 		}
 		/*
 		 * Cascade timers:
@@ -498,6 +506,10 @@ repeat:
 			goto repeat;
 		}
 	}
+	if (unlikely(base->running_timer != timer)) {
+		printk("BUG in line %d: __run_timers() is non-reentrant!\n", __LINE__);
+		return;
+	}
 	set_running_timer(base, NULL);
 	spin_unlock_irq(&base->lock);
 //	if (waitqueue_active(&base->wait_running_timer))

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RT and Cascade interrupts
  2005-06-01 22:51                           ` Trond Myklebust
@ 2005-06-01 23:09                             ` Trond Myklebust
  2005-06-02  3:31                             ` john cooper
  1 sibling, 0 replies; 34+ messages in thread
From: Trond Myklebust @ 2005-06-01 23:09 UTC (permalink / raw)
  To: john cooper; +Cc: Oleg Nesterov, linux-kernel, Ingo Molnar, Olaf Kirch

[-- Attachment #1: Type: text/plain, Size: 424 bytes --]

on den 01.06.2005 Klokka 18:51 (-0400) skreiv Trond Myklebust:
> Could you then apply the following debugging patch? It should warn you
> in case something happens to corrupt base->running_timer (something
> which would screw up del_timer_sync()). I'm not sure that can happen,
> but it might be worth checking.

Oops. You don't want to return without releasing the spinlock. This
patch should be better...

Cheers,
  Trond

[-- Attachment #2: linux-2.6.12-rt-warn.dif --]
[-- Type: text/plain, Size: 1673 bytes --]

 timer.c |   15 ++++++++++++++-
 1 files changed, 14 insertions(+), 1 deletion(-)

Index: linux-2.6.12-rc5-rt-V0.7.47-15/kernel/timer.c
===================================================================
--- linux-2.6.12-rc5-rt-V0.7.47-15.orig/kernel/timer.c
+++ linux-2.6.12-rc5-rt-V0.7.47-15/kernel/timer.c
@@ -436,7 +436,7 @@ static int cascade(tvec_base_t *base, tv
 
 static inline void __run_timers(tvec_base_t *base)
 {
-	struct timer_list *timer;
+	struct timer_list *timer = NULL;
 
 	spin_lock_irq(&base->lock);
 	while (time_after_eq(jiffies, base->timer_jiffies)) {
@@ -444,6 +444,10 @@ static inline void __run_timers(tvec_bas
 		struct list_head *head = &work_list;
  		int index = base->timer_jiffies & TVR_MASK;
 
+		if (unlikely(base->running_timer != timer)) {
+			printk("BUG in line %d: __run_timers() is non-reentrant!\n", __LINE__);
+			goto out;
+		}
 		if (softirq_need_resched()) {
 			/* running_timer might be stale: */
 			set_running_timer(base, NULL);
@@ -459,6 +463,10 @@ static inline void __run_timers(tvec_bas
 			 * valid. Any new jiffy will be taken care of in
 			 * subsequent loops:
 			 */
+			if (unlikely(base->running_timer != timer)) {
+				printk("BUG in line %d: __run_timers() is non-reentrant!\n", __LINE__);
+				goto out;
+			}
 		}
 		/*
 		 * Cascade timers:
@@ -498,7 +506,12 @@ repeat:
 			goto repeat;
 		}
 	}
+	if (unlikely(base->running_timer != timer)) {
+		printk("BUG in line %d: __run_timers() is non-reentrant!\n", __LINE__);
+		goto out;
+	}
 	set_running_timer(base, NULL);
+out:
 	spin_unlock_irq(&base->lock);
 //	if (waitqueue_active(&base->wait_running_timer))
 		wake_up(&base->wait_for_running_timer);

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RT and Cascade interrupts
  2005-06-01 22:51                           ` Trond Myklebust
  2005-06-01 23:09                             ` Trond Myklebust
@ 2005-06-02  3:31                             ` john cooper
  2005-06-02  4:26                               ` Trond Myklebust
  1 sibling, 1 reply; 34+ messages in thread
From: john cooper @ 2005-06-02  3:31 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Oleg Nesterov, linux-kernel, Ingo Molnar, Olaf Kirch, john cooper

Trond Myklebust wrote:
> on den 01.06.2005 Klokka 16:59 (-0400) skreiv john cooper:
> 
>>Yes later versions of the patch do.  The version at hand
>>40-04 is based on 2.6.11.  We intend to sync-up with a
>>more recent version of the RT patch pending resolution
>>of this issue.
> 
> 
> Well it is pointless to concentrate on an obviously buggy patch. Could
> you please sync up to rc5-rt-V0.7.47-15 at least: that looks like it
> might be working (or at least be close to working).

I fully share your frustration of wanting to "use the
latest patch -- dammit".  However there are other practical
constraints coming into play.  This tree has accumulated a
substantial amount of fixes for scheduler violation assertions
along with associated testing and has faired well thus far.
The bug under discussion here is the last major operational
problem found in the associated testing process.  Arriving
at this point also required development of target specific
driver/board code so a resync to a later version is not a
trivial operation.  However it would be justifiable in the
case of encountering at an impasse with the current tree.

> Could you then apply the following debugging patch? It should warn you
> in case something happens to corrupt base->running_timer (something
> which would screw up del_timer_sync()). I'm not sure that can happen,
> but it might be worth checking.

Yes, thanks.  Though the event trace does not suggest a
reentrance in __run_timer() but rather a preemption of it
during the call to rpc_run_timer() by a high priority
application task in the midst of an RPC.  The preempting
task requeues the timer in the cascade at the tail of
xprt_transmit().  rpc_run_timer() upon resuming execution
unconditionally clears the RPC_TASK_HAS_TIMER flag.  This
creates the inconsistent state.

No explicit deletion attempt of the timer (synchronous or
otherwise) is coming into play in the failure scenario as
witnessed by the event trace.  Rather it is the implicit
dequeue of the timer from the cascade in __run_timer() and
attempt to track ownership of it in rpc_run_timer() via
RPC_TASK_HAS_TIMER which is undermined in the case of
preemption.

 From earlier mail:

 > There should be no instances of RPC entering call_transmit() or any
 > other tk_action callback with a pending timer.

My description wasn't clear.  The timeout isn't pending
before call_transmit().  Rather the RPC appears to be
blocked elsewhere and upon wakeup via __run_timer()/xprt_timer()
preempts ksoftirqd and does the __rpc_sleep_on()/__mod_timer()
at the very tail of xprt_transmit().

I have work-arounds which detect the preemption of
rpc_run_timer() and correct the timer state.  But I
don't suggest they are general solutions or a
permanent fix.

In any case I will ascertain whether or not the problem here
still exists as soon as possible upon resync with a current
tree.

-john


-- 
john.cooper@timesys.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RT and Cascade interrupts
  2005-06-02  3:31                             ` john cooper
@ 2005-06-02  4:26                               ` Trond Myklebust
  2005-06-09 23:17                                 ` George Anzinger
  0 siblings, 1 reply; 34+ messages in thread
From: Trond Myklebust @ 2005-06-02  4:26 UTC (permalink / raw)
  To: john cooper; +Cc: Oleg Nesterov, linux-kernel, Ingo Molnar, Olaf Kirch

on den 01.06.2005 Klokka 23:31 (-0400) skreiv john cooper:
> I fully share your frustration of wanting to "use the
> latest patch -- dammit".  However there are other practical
> constraints coming into play.  This tree has accumulated a
> substantial amount of fixes for scheduler violation assertions
> along with associated testing and has faired well thus far.
> The bug under discussion here is the last major operational
> problem found in the associated testing process.  Arriving
> at this point also required development of target specific
> driver/board code so a resync to a later version is not a
> trivial operation.  However it would be justifiable in the
> case of encountering at an impasse with the current tree.

My point is that you are considering timer bugs due to synchronization
problems in code which is obviously not designed to accommodate
synchronization. Once that fact is established, one moves on and
considers the code which does support synchronization.

> > Could you then apply the following debugging patch? It should warn you
> > in case something happens to corrupt base->running_timer (something
> > which would screw up del_timer_sync()). I'm not sure that can happen,
> > but it might be worth checking.
> 
> Yes, thanks.  Though the event trace does not suggest a
> reentrance in __run_timer() but rather a preemption of it
> during the call to rpc_run_timer() by a high priority
> application task in the midst of an RPC.  The preempting
> task requeues the timer in the cascade at the tail of
> xprt_transmit().  rpc_run_timer() upon resuming execution
> unconditionally clears the RPC_TASK_HAS_TIMER flag.  This
> creates the inconsistent state.

There are NO cases where that is supposed to be allowed to occur. This
case is precisely what del_timer_sync() is supposed to treat.

> No explicit deletion attempt of the timer (synchronous or
> otherwise) is coming into play in the failure scenario as
> witnessed by the event trace.  Rather it is the implicit
> dequeue of the timer from the cascade in __run_timer() and
> attempt to track ownership of it in rpc_run_timer() via
> RPC_TASK_HAS_TIMER which is undermined in the case of
> preemption.

No!!! The responsibility for tracking timers that have been dequeued and
that are currently running inside __run_timer() lies fairly and squarely
with del_timer_sync().
There is NOTHING within the RT patches that implies that the existing
callers of del_timer_sync() should be burdened with having to do
additional tracking of pending timers. To do so would be a major change
of the existing API, and would require a lot of justification.

IOW: nobody but you is claiming that the RPC code is trying to deal with
this case by tracking RPC_TASK_HAS_TIMER. That is not its purpose, nor
should it be in the RT case.

>  From earlier mail:
> 
>  > There should be no instances of RPC entering call_transmit() or any
>  > other tk_action callback with a pending timer.
> 
> My description wasn't clear.  The timeout isn't pending
> before call_transmit().  Rather the RPC appears to be
> blocked elsewhere and upon wakeup via __run_timer()/xprt_timer()
> preempts ksoftirqd and does the __rpc_sleep_on()/__mod_timer()
> at the very tail of xprt_transmit().

No!!! How is this supposed to happen? There is only one thread that is
allowed to call rpc_sleep_on(), and that is the exact same thread that
is calling __rpc_execute(). It may call rpc_sleep_on() only from inside
a task->tk_action() call, and therefore only _after_ it has called
rpc_delete_timer(). There is supposed to be strict ordering here!

Trond


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RT and Cascade interrupts
  2005-06-02  4:26                               ` Trond Myklebust
@ 2005-06-09 23:17                                 ` George Anzinger
  2005-06-09 23:52                                   ` john cooper
  0 siblings, 1 reply; 34+ messages in thread
From: George Anzinger @ 2005-06-09 23:17 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: john cooper, Oleg Nesterov, linux-kernel, Ingo Molnar, Olaf Kirch

Excuse me for interrupting this thread, but have you seen:

http://marc.theaimsgroup.com/?l=linux-kernel&m=111717961227508&w=2

I think this will fix your problem.

George
-- 

Trond Myklebust wrote:
> on den 01.06.2005 Klokka 23:31 (-0400) skreiv john cooper:
> 
>>I fully share your frustration of wanting to "use the
>>latest patch -- dammit".  However there are other practical
>>constraints coming into play.  This tree has accumulated a
>>substantial amount of fixes for scheduler violation assertions
>>along with associated testing and has faired well thus far.
>>The bug under discussion here is the last major operational
>>problem found in the associated testing process.  Arriving
>>at this point also required development of target specific
>>driver/board code so a resync to a later version is not a
>>trivial operation.  However it would be justifiable in the
>>case of encountering at an impasse with the current tree.
> 
> 
> My point is that you are considering timer bugs due to synchronization
> problems in code which is obviously not designed to accommodate
> synchronization. Once that fact is established, one moves on and
> considers the code which does support synchronization.
> 
> 
>>>Could you then apply the following debugging patch? It should warn you
>>>in case something happens to corrupt base->running_timer (something
>>>which would screw up del_timer_sync()). I'm not sure that can happen,
>>>but it might be worth checking.
>>
>>Yes, thanks.  Though the event trace does not suggest a
>>reentrance in __run_timer() but rather a preemption of it
>>during the call to rpc_run_timer() by a high priority
>>application task in the midst of an RPC.  The preempting
>>task requeues the timer in the cascade at the tail of
>>xprt_transmit().  rpc_run_timer() upon resuming execution
>>unconditionally clears the RPC_TASK_HAS_TIMER flag.  This
>>creates the inconsistent state.
> 
> 
> There are NO cases where that is supposed to be allowed to occur. This
> case is precisely what del_timer_sync() is supposed to treat.
> 
> 
>>No explicit deletion attempt of the timer (synchronous or
>>otherwise) is coming into play in the failure scenario as
>>witnessed by the event trace.  Rather it is the implicit
>>dequeue of the timer from the cascade in __run_timer() and
>>attempt to track ownership of it in rpc_run_timer() via
>>RPC_TASK_HAS_TIMER which is undermined in the case of
>>preemption.
> 
> 
> No!!! The responsibility for tracking timers that have been dequeued and
> that are currently running inside __run_timer() lies fairly and squarely
> with del_timer_sync().
> There is NOTHING within the RT patches that implies that the existing
> callers of del_timer_sync() should be burdened with having to do
> additional tracking of pending timers. To do so would be a major change
> of the existing API, and would require a lot of justification.
> 
> IOW: nobody but you is claiming that the RPC code is trying to deal with
> this case by tracking RPC_TASK_HAS_TIMER. That is not its purpose, nor
> should it be in the RT case.
> 
> 
>> From earlier mail:
>>
>> > There should be no instances of RPC entering call_transmit() or any
>> > other tk_action callback with a pending timer.
>>
>>My description wasn't clear.  The timeout isn't pending
>>before call_transmit().  Rather the RPC appears to be
>>blocked elsewhere and upon wakeup via __run_timer()/xprt_timer()
>>preempts ksoftirqd and does the __rpc_sleep_on()/__mod_timer()
>>at the very tail of xprt_transmit().
> 
> 
> No!!! How is this supposed to happen? There is only one thread that is
> allowed to call rpc_sleep_on(), and that is the exact same thread that
> is calling __rpc_execute(). It may call rpc_sleep_on() only from inside
> a task->tk_action() call, and therefore only _after_ it has called
> rpc_delete_timer(). There is supposed to be strict ordering here!
> 
> Trond
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

-- 
George Anzinger   george@mvista.com
HRT (High-res-timers):  http://sourceforge.net/projects/high-res-timers/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RT and Cascade interrupts
  2005-06-09 23:17                                 ` George Anzinger
@ 2005-06-09 23:52                                   ` john cooper
  0 siblings, 0 replies; 34+ messages in thread
From: john cooper @ 2005-06-09 23:52 UTC (permalink / raw)
  To: george
  Cc: Trond Myklebust, Oleg Nesterov, linux-kernel, Ingo Molnar,
	Olaf Kirch, john cooper

George Anzinger wrote:
> Excuse me for interrupting this thread, but have you seen:
> 
> http://marc.theaimsgroup.com/?l=linux-kernel&m=111717961227508&w=2
> 
> I think this will fix your problem.

That was touched on earlier in the thread.  It did not
fix the problem I was chasing in 40-04.

I'll revisit this issue once I've moved to a more current
version of the patch should it still exist.

-john


-- 
john.cooper@timesys.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RT and Cascade interrupts
  2005-05-27  7:25       ` Ingo Molnar
@ 2005-05-27 13:53         ` john cooper
  0 siblings, 0 replies; 34+ messages in thread
From: john cooper @ 2005-05-27 13:53 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Daniel Walker, linux-kernel, john cooper

Ingo Molnar wrote:
> * john cooper <john.cooper@timesys.com> wrote:
>>The RPC code is attempting to replicate state of
>>timer ownership for a given rpc_task via RPC_TASK_HAS_TIMER
>>in rpc_task.tk_runstate.  Besides not working
>>correctly in the case of preemptable context it is
>>a replication of state of a timer pending in the
>>cascade structure (ie: timer->base).  The fix
>>changes the RPC code to use timer->base when
>>deciding whether an outstanding timer registration
>>exists during rpc_task tear down.
>>
>>Note: this failure occurred in the 40-04 version of
>>the patch though it applies to more current versions.
>>It was seen when executing stress tests on a number
>>of PPC targets running on an NFS mounted root though
>>was not observed on a x86 target under similar
>>conditions.
> 
> 
> should this fix go upstream too?

Yes.  The RPC code is attempting to replicate existing
and easily accessible state information in a timer
structure.  The simplistic means by which it does so
fails if ksoftirqd/rpc_run_timer() runs in preemptive
context.

-john


-- 
john.cooper@timesys.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RT and Cascade interrupts
  2005-05-24 16:32     ` john cooper
@ 2005-05-27  7:25       ` Ingo Molnar
  2005-05-27 13:53         ` john cooper
  0 siblings, 1 reply; 34+ messages in thread
From: Ingo Molnar @ 2005-05-27  7:25 UTC (permalink / raw)
  To: john cooper; +Cc: Daniel Walker, linux-kernel


* john cooper <john.cooper@timesys.com> wrote:

> john cooper wrote:
> >I'm seeing the BUG assert in kernel/timers.c:cascade()
> >kick in (tmp->base is somehow 0) during a test which
> >creates a few tasks of priority higher than ksoftirqd.
> >This race doesn't happen if ksoftirqd's priority is
> >elevated (eg: chrt -f -p 75 2) so the -RT patch might
> >be opening up a window here.
> 
> There is a window in rpc_run_timer() which allows
> it to lose track of timer ownership when ksoftirqd
> (and thus itself) are preempted.  This doesn't
> immediately cause a problem but does corrupt
> the timer cascade list when the timer struct is
> recycled/requeued.  This shows up some time later
> as the list is processed.  The failure mode is cascade()
> attempting to percolate a timer with poisoned
> next/prev *s and a NULL base causing the assertion
> BUG(tmp->base != base) to kick in.
> 
> The RPC code is attempting to replicate state of
> timer ownership for a given rpc_task via RPC_TASK_HAS_TIMER
> in rpc_task.tk_runstate.  Besides not working
> correctly in the case of preemptable context it is
> a replication of state of a timer pending in the
> cascade structure (ie: timer->base).  The fix
> changes the RPC code to use timer->base when
> deciding whether an outstanding timer registration
> exists during rpc_task tear down.
> 
> Note: this failure occurred in the 40-04 version of
> the patch though it applies to more current versions.
> It was seen when executing stress tests on a number
> of PPC targets running on an NFS mounted root though
> was not observed on a x86 target under similar
> conditions.

should this fix go upstream too?

	Ingo

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RT and Cascade interrupts
  2005-05-13 13:12   ` john cooper
@ 2005-05-24 16:32     ` john cooper
  2005-05-27  7:25       ` Ingo Molnar
  0 siblings, 1 reply; 34+ messages in thread
From: john cooper @ 2005-05-24 16:32 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: john cooper, Daniel Walker, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1550 bytes --]

john cooper wrote:
> I'm seeing the BUG assert in kernel/timers.c:cascade()
> kick in (tmp->base is somehow 0) during a test which
> creates a few tasks of priority higher than ksoftirqd.
> This race doesn't happen if ksoftirqd's priority is
> elevated (eg: chrt -f -p 75 2) so the -RT patch might
> be opening up a window here.

There is a window in rpc_run_timer() which allows
it to lose track of timer ownership when ksoftirqd
(and thus itself) are preempted.  This doesn't
immediately cause a problem but does corrupt
the timer cascade list when the timer struct is
recycled/requeued.  This shows up some time later
as the list is processed.  The failure mode is cascade()
attempting to percolate a timer with poisoned
next/prev *s and a NULL base causing the assertion
BUG(tmp->base != base) to kick in.

The RPC code is attempting to replicate state of
timer ownership for a given rpc_task via RPC_TASK_HAS_TIMER
in rpc_task.tk_runstate.  Besides not working
correctly in the case of preemptable context it is
a replication of state of a timer pending in the
cascade structure (ie: timer->base).  The fix
changes the RPC code to use timer->base when
deciding whether an outstanding timer registration
exists during rpc_task tear down.

Note: this failure occurred in the 40-04 version of
the patch though it applies to more current versions.
It was seen when executing stress tests on a number
of PPC targets running on an NFS mounted root though
was not observed on a x86 target under similar
conditions.

-john


-- 
john.cooper@timesys.com

[-- Attachment #2: RPC.patch --]
[-- Type: text/plain, Size: 1630 bytes --]

./include/linux/sunrpc/sched.h
./net/sunrpc/sched.c
=================================================================
--- ./include/linux/sunrpc/sched.h.ORG	2005-05-24 10:29:24.000000000 -0400
+++ ./include/linux/sunrpc/sched.h	2005-05-24 10:47:56.000000000 -0400
@@ -142,7 +142,6 @@ typedef void			(*rpc_action)(struct rpc_
 #define RPC_TASK_RUNNING	0
 #define RPC_TASK_QUEUED		1
 #define RPC_TASK_WAKEUP		2
-#define RPC_TASK_HAS_TIMER	3
 
 #define RPC_IS_RUNNING(t)	(test_bit(RPC_TASK_RUNNING, &(t)->tk_runstate))
 #define rpc_set_running(t)	(set_bit(RPC_TASK_RUNNING, &(t)->tk_runstate))
=================================================================
--- ./net/sunrpc/sched.c.ORG	2005-05-24 10:29:52.000000000 -0400
+++ ./net/sunrpc/sched.c	2005-05-24 11:02:44.000000000 -0400
@@ -103,9 +103,6 @@ static void rpc_run_timer(struct rpc_tas
 		dprintk("RPC: %4d running timer\n", task->tk_pid);
 		callback(task);
 	}
-	smp_mb__before_clear_bit();
-	clear_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate);
-	smp_mb__after_clear_bit();
 }
 
 /*
@@ -124,7 +121,6 @@ __rpc_add_timer(struct rpc_task *task, r
 		task->tk_timeout_fn = timer;
 	else
 		task->tk_timeout_fn = __rpc_default_timer;
-	set_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate);
 	mod_timer(&task->tk_timer, jiffies + task->tk_timeout);
 }
 
@@ -135,7 +131,7 @@ __rpc_add_timer(struct rpc_task *task, r
 static inline void
 rpc_delete_timer(struct rpc_task *task)
 {
-	if (test_and_clear_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate)) {
+	if (task->tk_timer.base) {
 		del_singleshot_timer_sync(&task->tk_timer);
 		dprintk("RPC: %4d deleting timer\n", task->tk_pid);
 	}

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RT and Cascade interrupts
  2005-05-13  7:44 ` Ingo Molnar
@ 2005-05-13 13:12   ` john cooper
  2005-05-24 16:32     ` john cooper
  0 siblings, 1 reply; 34+ messages in thread
From: john cooper @ 2005-05-13 13:12 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Daniel Walker, linux-kernel, john cooper

Ingo Molnar wrote:
> * Daniel Walker <dwalker@mvista.com> wrote:
> 
> 
>>	It seems like the cascade interrupts could run in threads, but 
>>i386 doesn't, and I know ARM crashed with cascades in threads. You may 
>>have a bit of a slow down, but it seems possible. Does anyone have 
>>some reasoning for why we aren't running the cascades in threads?
> 
> 
> are the x86 cascade interrupts real ones in fact? Normally they should 
> never trigger directly. (except on ARM which has a completely different 
> notion of cascade irq)

I just caught this thread in the corner of my eye.

I'm seeing the BUG assert in kernel/timers.c:cascade()
kick in (tmp->base is somehow 0) during a test which
creates a few tasks of priority higher than ksoftirqd.
This race doesn't happen if ksoftirqd's priority is
elevated (eg: chrt -f -p 75 2) so the -RT patch might
be opening up a window here.

I'm in the process of investigating this and see a few
potential suspects but was wondering if anyone else has
seen this behavior?

-john

-- 
john.cooper@timesys.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RT and Cascade interrupts
  2005-05-12 14:43 Daniel Walker
@ 2005-05-13  7:44 ` Ingo Molnar
  2005-05-13 13:12   ` john cooper
  0 siblings, 1 reply; 34+ messages in thread
From: Ingo Molnar @ 2005-05-13  7:44 UTC (permalink / raw)
  To: Daniel Walker; +Cc: linux-kernel


* Daniel Walker <dwalker@mvista.com> wrote:

> 	It seems like the cascade interrupts could run in threads, but 
> i386 doesn't, and I know ARM crashed with cascades in threads. You may 
> have a bit of a slow down, but it seems possible. Does anyone have 
> some reasoning for why we aren't running the cascades in threads?

are the x86 cascade interrupts real ones in fact? Normally they should 
never trigger directly. (except on ARM which has a completely different 
notion of cascade irq)

	Ingo

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RT and Cascade interrupts
@ 2005-05-12 14:43 Daniel Walker
  2005-05-13  7:44 ` Ingo Molnar
  0 siblings, 1 reply; 34+ messages in thread
From: Daniel Walker @ 2005-05-12 14:43 UTC (permalink / raw)
  To: mingo; +Cc: linux-kernel


	It seems like the cascade interrupts could run in threads, but 
i386 doesn't, and I know ARM crashed with cascades in threads. You may have a 
bit of a slow down, but it seems possible. Does anyone have some reasoning 
for why we aren't running the cascades in threads?

Daniel



^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2005-06-09 23:55 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-05-27 16:47 RT and Cascade interrupts Oleg Nesterov
2005-05-27 23:37 ` john cooper
2005-05-28  8:52   ` Oleg Nesterov
2005-05-28 14:02     ` john cooper
2005-05-28 16:34       ` Oleg Nesterov
2005-05-28 17:48     ` john cooper
2005-05-28 20:35       ` Trond Myklebust
2005-05-29  3:12         ` john cooper
2005-05-29  7:40           ` Trond Myklebust
2005-05-30 21:32             ` john cooper
2005-05-31 23:09               ` john cooper
2005-06-01 14:22               ` Oleg Nesterov
2005-06-01 18:05                 ` john cooper
2005-06-01 18:31                   ` Trond Myklebust
2005-06-01 19:20                     ` john cooper
2005-06-01 19:46                       ` Trond Myklebust
2005-06-01 20:21                       ` Trond Myklebust
2005-06-01 20:59                         ` john cooper
2005-06-01 22:51                           ` Trond Myklebust
2005-06-01 23:09                             ` Trond Myklebust
2005-06-02  3:31                             ` john cooper
2005-06-02  4:26                               ` Trond Myklebust
2005-06-09 23:17                                 ` George Anzinger
2005-06-09 23:52                                   ` john cooper
2005-05-29 11:31         ` Oleg Nesterov
2005-05-29 13:58           ` Trond Myklebust
2005-05-30 14:50             ` Ingo Molnar
2005-05-28 22:17       ` Trond Myklebust
  -- strict thread matches above, loose matches on Subject: below --
2005-05-12 14:43 Daniel Walker
2005-05-13  7:44 ` Ingo Molnar
2005-05-13 13:12   ` john cooper
2005-05-24 16:32     ` john cooper
2005-05-27  7:25       ` Ingo Molnar
2005-05-27 13:53         ` john cooper

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).