linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* In many cases softlockup can not be reported after disabling IRQ for long time
@ 2012-01-31  7:28 TAO HU
  2012-01-31 15:47 ` Don Zickus
  0 siblings, 1 reply; 10+ messages in thread
From: TAO HU @ 2012-01-31  7:28 UTC (permalink / raw)
  To: linux-kernel; +Cc: Ingo Molnar

Resend with a new subject

On Wed, Jan 25, 2012 at 4:24 PM, TAO HU <tghk48@motorola.com> wrote:
> Hi, All
>
> While playing kernel 3.0.8 with below test code, it does NOT report
> any softlockup with 60%~70% chances.
> NOTE: the softlockup timeout is set to 10 seconds (i.e.
> watchdog_thresh=5) in my test.
> ... ...
> preempt_disable();
> local_irq_disable();
> for (i = 0; i < 20; i++)
>       mdelay(1000);
> local_irq_enable();
> preempt_enable();
> ... ...
>
> However, if I remove local_irq_disable()/local_irq_enable() it will
> report softlockup with no problem.
> I believe it is due to that after local_irq_enable()
> touch_softlockup_watchdog() is called prior softlockup timer.
>
> touch_softlockup_watchdog() basically resets the lockup detection
> process which implies that the 20-second lockup will be ignored.
> I noticed that touch_softlockup_watchdog() is called in dozens of
> places in kernel.
>
> Is that a design limitation or a bug? Any way to improve the situation?
>
> kernel/debug/debug_core.c:453:  touch_softlockup_watchdog_sync();
> kernel/power/hibernate.c:443:   touch_softlockup_watchdog();
> kernel/panic.c:153:             touch_softlockup_watchdog();
> kernel/time/timekeeping.c:684:  touch_softlockup_watchdog();
> kernel/time/tick-sched.c:149:   touch_softlockup_watchdog();
> kernel/time/tick-sched.c:543:   touch_softlockup_watchdog();
> kernel/time/tick-sched.c:596:           touch_softlockup_watchdog();
> kernel/time/tick-sched.c:756:                   touch_softlockup_watchdog();
> kernel/sched_clock.c:277:       touch_softlockup_watchdog();
>
>
>
> --
> Best Regards
> Hu Tao



-- 
Best Regards
Hu Tao

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: In many cases softlockup can not be reported after disabling IRQ for long time
  2012-01-31  7:28 In many cases softlockup can not be reported after disabling IRQ for long time TAO HU
@ 2012-01-31 15:47 ` Don Zickus
  2012-02-01  2:18   ` TAO HU
  0 siblings, 1 reply; 10+ messages in thread
From: Don Zickus @ 2012-01-31 15:47 UTC (permalink / raw)
  To: TAO HU; +Cc: linux-kernel, Ingo Molnar

On Tue, Jan 31, 2012 at 03:28:09PM +0800, TAO HU wrote:
> Resend with a new subject
> 
> On Wed, Jan 25, 2012 at 4:24 PM, TAO HU <tghk48@motorola.com> wrote:
> > Hi, All
> >
> > While playing kernel 3.0.8 with below test code, it does NOT report
> > any softlockup with 60%~70% chances.
> > NOTE: the softlockup timeout is set to 10 seconds (i.e.
> > watchdog_thresh=5) in my test.
> > ... ...
> > preempt_disable();
> > local_irq_disable();
> > for (i = 0; i < 20; i++)
> >       mdelay(1000);
> > local_irq_enable();
> > preempt_enable();
> > ... ...
> >
> > However, if I remove local_irq_disable()/local_irq_enable() it will
> > report softlockup with no problem.
> > I believe it is due to that after local_irq_enable()
> > touch_softlockup_watchdog() is called prior softlockup timer.

Hi Hu,

Honestly, you should be getting hardlockup warnings if you are disabling
interrupts.  Do you see anything in the console output?

Cheers,
Don

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: In many cases softlockup can not be reported after disabling IRQ for long time
  2012-01-31 15:47 ` Don Zickus
@ 2012-02-01  2:18   ` TAO HU
  2012-02-01 10:51     ` Cong Wang
  2012-02-01 14:58     ` Don Zickus
  0 siblings, 2 replies; 10+ messages in thread
From: TAO HU @ 2012-02-01  2:18 UTC (permalink / raw)
  To: Don Zickus; +Cc: linux-kernel, Ingo Molnar, linux-arm-kernel, linux-omap

Hi, Don

Thanks for your feedback!

Unfortunately, the hardlockup depends on NMI which is not available on
ARM (Cortex-A9) per my understanding.
Our system uses OMAP4430. Any more suggestions?

On Tue, Jan 31, 2012 at 11:47 PM, Don Zickus <dzickus@redhat.com> wrote:
> On Tue, Jan 31, 2012 at 03:28:09PM +0800, TAO HU wrote:
>> Resend with a new subject
>>
>> On Wed, Jan 25, 2012 at 4:24 PM, TAO HU <tghk48@motorola.com> wrote:
>> > Hi, All
>> >
>> > While playing kernel 3.0.8 with below test code, it does NOT report
>> > any softlockup with 60%~70% chances.
>> > NOTE: the softlockup timeout is set to 10 seconds (i.e.
>> > watchdog_thresh=5) in my test.
>> > ... ...
>> > preempt_disable();
>> > local_irq_disable();
>> > for (i = 0; i < 20; i++)
>> >       mdelay(1000);
>> > local_irq_enable();
>> > preempt_enable();
>> > ... ...
>> >
>> > However, if I remove local_irq_disable()/local_irq_enable() it will
>> > report softlockup with no problem.
>> > I believe it is due to that after local_irq_enable()
>> > touch_softlockup_watchdog() is called prior softlockup timer.
>
> Hi Hu,
>
> Honestly, you should be getting hardlockup warnings if you are disabling
> interrupts.  Do you see anything in the console output?
>
> Cheers,
> Don



-- 
Best Regards
Hu Tao

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: In many cases softlockup can not be reported after disabling IRQ for long time
  2012-02-01  2:18   ` TAO HU
@ 2012-02-01 10:51     ` Cong Wang
  2012-02-01 14:58     ` Don Zickus
  1 sibling, 0 replies; 10+ messages in thread
From: Cong Wang @ 2012-02-01 10:51 UTC (permalink / raw)
  To: TAO HU
  Cc: Don Zickus, linux-kernel, Ingo Molnar, linux-arm-kernel, linux-omap

(Please don't top-reply.)

On 02/01/2012 10:18 AM, TAO HU wrote:
> Hi, Don
>
> Thanks for your feedback!
>
> Unfortunately, the hardlockup depends on NMI which is not available on
> ARM (Cortex-A9) per my understanding.
> Our system uses OMAP4430. Any more suggestions?

When there is no NMI, touch_nmi_watchdog() actually touches softlockup 
watchdog:

#if defined(ARCH_HAS_NMI_WATCHDOG) || defined(CONFIG_HARDLOCKUP_DETECTOR)
#include <asm/nmi.h>
extern void touch_nmi_watchdog(void);
#else
static inline void touch_nmi_watchdog(void)
{
         touch_softlockup_watchdog();
}
#endif

so you need to check if other places calling touch_nmi_watchdog() 
especially on ARM.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: In many cases softlockup can not be reported after disabling IRQ for long time
  2012-02-01  2:18   ` TAO HU
  2012-02-01 10:51     ` Cong Wang
@ 2012-02-01 14:58     ` Don Zickus
  2012-02-02  8:17       ` TAO HU
  1 sibling, 1 reply; 10+ messages in thread
From: Don Zickus @ 2012-02-01 14:58 UTC (permalink / raw)
  To: TAO HU; +Cc: linux-kernel, Ingo Molnar, linux-arm-kernel, linux-omap

On Wed, Feb 01, 2012 at 10:18:09AM +0800, TAO HU wrote:
> Hi, Don
> 
> Thanks for your feedback!
> 
> Unfortunately, the hardlockup depends on NMI which is not available on
> ARM (Cortex-A9) per my understanding.
> Our system uses OMAP4430. Any more suggestions?

Ah.  I wrongly assumed this is x86. Sorry about that.

Ok, so this is what is going on.  The softlockup check is just a high
priority thread that periodically runs.  If preemption is disabled that
thread can't run (or any threads for that matter) and a softlockup
condition will exist.  However, in order to determine that, a periodic
hrtimer has to come along and do the actual check.

If that check fails, then the warning is printed out.  However that
accuracy is based on the resolution of that hrtimer which I set to about
1/5 the watchdog threshold or 1 second in this case.

Unfortunately, if you disable the irqs, then that timer can't fire and now
we don't have a way to trigger the softlockup check until interrupts are
re-enabled.

On x86, we have a backup plan for disabled interrupts and that is the
hardlockup check which rely on NMIs (something that still fires even when
interrupts are disabled).

If on ARM you don't have NMIs, then it will be difficult to check for
softlockups when interrupts are disabled.  Though I do recall sparc doing
something clever like using IRQ0 as a special purpose IRQ to emulate an
NMI (IOW, software purposely avoided masking IRQ0).  So when an interrupt
came in on that irq, it was never blocked and always ran based on the irq
nesting rules.

I don't know ARM well enough to give any solution for your problem, but my
reason above is why it isn't working the way you intended.

Cheers,
Don

> 
> On Tue, Jan 31, 2012 at 11:47 PM, Don Zickus <dzickus@redhat.com> wrote:
> > On Tue, Jan 31, 2012 at 03:28:09PM +0800, TAO HU wrote:
> >> Resend with a new subject
> >>
> >> On Wed, Jan 25, 2012 at 4:24 PM, TAO HU <tghk48@motorola.com> wrote:
> >> > Hi, All
> >> >
> >> > While playing kernel 3.0.8 with below test code, it does NOT report
> >> > any softlockup with 60%~70% chances.
> >> > NOTE: the softlockup timeout is set to 10 seconds (i.e.
> >> > watchdog_thresh=5) in my test.
> >> > ... ...
> >> > preempt_disable();
> >> > local_irq_disable();
> >> > for (i = 0; i < 20; i++)
> >> >       mdelay(1000);
> >> > local_irq_enable();
> >> > preempt_enable();
> >> > ... ...
> >> >
> >> > However, if I remove local_irq_disable()/local_irq_enable() it will
> >> > report softlockup with no problem.
> >> > I believe it is due to that after local_irq_enable()
> >> > touch_softlockup_watchdog() is called prior softlockup timer.
> >
> > Hi Hu,
> >
> > Honestly, you should be getting hardlockup warnings if you are disabling
> > interrupts.  Do you see anything in the console output?
> >
> > Cheers,
> > Don
> 
> 
> 
> -- 
> Best Regards
> Hu Tao

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: In many cases softlockup can not be reported after disabling IRQ for long time
  2012-02-01 14:58     ` Don Zickus
@ 2012-02-02  8:17       ` TAO HU
  2012-02-02  8:43         ` Russell King - ARM Linux
  2012-02-02 15:58         ` Don Zickus
  0 siblings, 2 replies; 10+ messages in thread
From: TAO HU @ 2012-02-02  8:17 UTC (permalink / raw)
  To: Don Zickus; +Cc: linux-kernel, Ingo Molnar, linux-arm-kernel, linux-omap

Hi, Don

My concern is not actually that the softlockup could not be reported
while the IRQ is disabled.
What bothering me is that even AFTER re-enable the IRQ, it will not
give warning in many cases.

In theory, disabling IRQ for long time (10s in my case) also implies
the high priority thread (watchdog) is blocked
as well.
So the ideal case is that softlockup driver could give warning right
after the IRQ is re-enabled.
It does so occasionally but fails to be consistent.


On Wed, Feb 1, 2012 at 10:58 PM, Don Zickus <dzickus@redhat.com> wrote:
> On Wed, Feb 01, 2012 at 10:18:09AM +0800, TAO HU wrote:
>> Hi, Don
>>
>> Thanks for your feedback!
>>
>> Unfortunately, the hardlockup depends on NMI which is not available on
>> ARM (Cortex-A9) per my understanding.
>> Our system uses OMAP4430. Any more suggestions?
>
> Ah.  I wrongly assumed this is x86. Sorry about that.
>
> Ok, so this is what is going on.  The softlockup check is just a high
> priority thread that periodically runs.  If preemption is disabled that
> thread can't run (or any threads for that matter) and a softlockup
> condition will exist.  However, in order to determine that, a periodic
> hrtimer has to come along and do the actual check.
>
> If that check fails, then the warning is printed out.  However that
> accuracy is based on the resolution of that hrtimer which I set to about
> 1/5 the watchdog threshold or 1 second in this case.
>
> Unfortunately, if you disable the irqs, then that timer can't fire and now
> we don't have a way to trigger the softlockup check until interrupts are
> re-enabled.
>
> On x86, we have a backup plan for disabled interrupts and that is the
> hardlockup check which rely on NMIs (something that still fires even when
> interrupts are disabled).
>
> If on ARM you don't have NMIs, then it will be difficult to check for
> softlockups when interrupts are disabled.  Though I do recall sparc doing
> something clever like using IRQ0 as a special purpose IRQ to emulate an
> NMI (IOW, software purposely avoided masking IRQ0).  So when an interrupt
> came in on that irq, it was never blocked and always ran based on the irq
> nesting rules.
>
> I don't know ARM well enough to give any solution for your problem, but my
> reason above is why it isn't working the way you intended.
>
> Cheers,
> Don
>
>>
>> On Tue, Jan 31, 2012 at 11:47 PM, Don Zickus <dzickus@redhat.com> wrote:
>> > On Tue, Jan 31, 2012 at 03:28:09PM +0800, TAO HU wrote:
>> >> Resend with a new subject
>> >>
>> >> On Wed, Jan 25, 2012 at 4:24 PM, TAO HU <tghk48@motorola.com> wrote:
>> >> > Hi, All
>> >> >
>> >> > While playing kernel 3.0.8 with below test code, it does NOT report
>> >> > any softlockup with 60%~70% chances.
>> >> > NOTE: the softlockup timeout is set to 10 seconds (i.e.
>> >> > watchdog_thresh=5) in my test.
>> >> > ... ...
>> >> > preempt_disable();
>> >> > local_irq_disable();
>> >> > for (i = 0; i < 20; i++)
>> >> >       mdelay(1000);
>> >> > local_irq_enable();
>> >> > preempt_enable();
>> >> > ... ...
>> >> >
>> >> > However, if I remove local_irq_disable()/local_irq_enable() it will
>> >> > report softlockup with no problem.
>> >> > I believe it is due to that after local_irq_enable()
>> >> > touch_softlockup_watchdog() is called prior softlockup timer.
>> >
>> > Hi Hu,
>> >
>> > Honestly, you should be getting hardlockup warnings if you are disabling
>> > interrupts.  Do you see anything in the console output?
>> >
>> > Cheers,
>> > Don
>>
>>
>>
>> --
>> Best Regards
>> Hu Tao



-- 
Best Regards
Hu Tao

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: In many cases softlockup can not be reported after disabling IRQ for long time
  2012-02-02  8:17       ` TAO HU
@ 2012-02-02  8:43         ` Russell King - ARM Linux
       [not found]           ` <CAOwKts--CDpmiMunfYKrYsnWovmQhAC7Vp0P-9MeNVy6vx-Wvw@mail.gmail.com>
  2012-02-02 15:58         ` Don Zickus
  1 sibling, 1 reply; 10+ messages in thread
From: Russell King - ARM Linux @ 2012-02-02  8:43 UTC (permalink / raw)
  To: TAO HU
  Cc: Don Zickus, Ingo Molnar, linux-omap, linux-kernel, linux-arm-kernel

.sdrawkcab esra s'ti ,tsop pot t'noD

(Don't top post, it's arse backwards.)

On Thu, Feb 02, 2012 at 04:17:02PM +0800, TAO HU wrote:
> My concern is not actually that the softlockup could not be reported
> while the IRQ is disabled.
> What bothering me is that even AFTER re-enable the IRQ, it will not
> give warning in many cases.

That's already been explained.

softlockups are detected by time passing.  Time can't properly advance
with interrupts disabled, as the backing counter (assuming you're using
the clocksource and clockevent stuff) could wrap.  If it wraps, the
systems idea of time which has passed will be incorrect.

So, if interrupts are disabled for a long period, the system loses track
of time, and therefore can't know how long the system has been blocked for.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: In many cases softlockup can not be reported after disabling IRQ for long time
  2012-02-02  8:17       ` TAO HU
  2012-02-02  8:43         ` Russell King - ARM Linux
@ 2012-02-02 15:58         ` Don Zickus
  2012-02-02 16:22           ` Russell King - ARM Linux
  1 sibling, 1 reply; 10+ messages in thread
From: Don Zickus @ 2012-02-02 15:58 UTC (permalink / raw)
  To: TAO HU; +Cc: linux-kernel, Ingo Molnar, linux-arm-kernel, linux-omap

On Thu, Feb 02, 2012 at 04:17:02PM +0800, TAO HU wrote:
> Hi, Don
> 
> My concern is not actually that the softlockup could not be reported
> while the IRQ is disabled.
> What bothering me is that even AFTER re-enable the IRQ, it will not
> give warning in many cases.
> 
> In theory, disabling IRQ for long time (10s in my case) also implies
> the high priority thread (watchdog) is blocked
> as well.
> So the ideal case is that softlockup driver could give warning right
> after the IRQ is re-enabled.
> It does so occasionally but fails to be consistent.

The only thing I can think of is that the clock/jiffies isn't updated
until after the hrtimer is run.  I'm not sure if there is any guarantee
for ordering once interrupts are enabled.

But that is just a guess.

I guess in theory, I would expect that when interrupts are enabled, the
system would immediately jump into an IRQ context, update the
clock/jiffies, then run all the other irq handlers like hrtimers, which
would see the new time and do the right thing.  After everything is done,
the system would return to your test code and re-enable preemption
allowing the softlockup thread to run again.

I could be very wrong though. :-)

Cheers,
Don

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: In many cases softlockup can not be reported after disabling IRQ for long time
  2012-02-02 15:58         ` Don Zickus
@ 2012-02-02 16:22           ` Russell King - ARM Linux
  0 siblings, 0 replies; 10+ messages in thread
From: Russell King - ARM Linux @ 2012-02-02 16:22 UTC (permalink / raw)
  To: Don Zickus
  Cc: TAO HU, Ingo Molnar, linux-omap, linux-kernel, linux-arm-kernel

On Thu, Feb 02, 2012 at 10:58:41AM -0500, Don Zickus wrote:
> The only thing I can think of is that the clock/jiffies isn't updated
> until after the hrtimer is run.  I'm not sure if there is any guarantee
> for ordering once interrupts are enabled.
> 
> But that is just a guess.
> 
> I guess in theory, I would expect that when interrupts are enabled, the
> system would immediately jump into an IRQ context, update the
> clock/jiffies, then run all the other irq handlers like hrtimers, which
> would see the new time and do the right thing.  After everything is done,
> the system would return to your test code and re-enable preemption
> allowing the softlockup thread to run again.
> 
> I could be very wrong though. :-)

The first thing to confirm is whether disabling interrupts for 10s
results in the system losing proper track of time.  If it does, then
you've immediately found the problem.

So, what you need to do us to use /usr/bin/time to execute a userspace
command which causes your thread to simulate a soft-lockup.  If you
arrange for your soft-lockup to last for (eg) exactly 10 seconds, and
/usr/bin/time reports less than 10 seconds have passed, you've found
why the system can't report it.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: In many cases softlockup can not be reported after disabling IRQ for long time
       [not found]           ` <CAOwKts--CDpmiMunfYKrYsnWovmQhAC7Vp0P-9MeNVy6vx-Wvw@mail.gmail.com>
@ 2012-02-04 12:22             ` Russell King - ARM Linux
  0 siblings, 0 replies; 10+ messages in thread
From: Russell King - ARM Linux @ 2012-02-04 12:22 UTC (permalink / raw)
  To: TAO HU
  Cc: Don Zickus, Ingo Molnar, linux-omap, linux-kernel, linux-arm-kernel

On Thu, Feb 02, 2012 at 10:05:22PM +0800, TAO HU wrote:
> I don't know it's already been discussed.
> Appreciate if you could point out existing discussion thread.
> 
> I agree it is impossible to detect "timeout" when using jiffies which
> relies on timer.
> 
> For timestamp, softlockup (watchdog) use cpu_clock() whcih eventually calls
> sched_clock().
> And sched_clock() is implemented to read out the value of a 32K
> timer/counter on OMAP4430.
> That means the timestamp will be still updated while the IRQ is disabled.

Yes, and it'll take 131072 seconds to wrap.

> So when IRQ is re-enabled, softlockup code will be able to read a "fresh"
> timestamp which can be used to
> detect the timeout.
> 
> 
> static unsigned long get_timestamp(int this_cpu)
> {
> return cpu_clock(this_cpu) >> 30LL; /* 2^30 ~= 10^9 */
> }
> 
> unsigned long long __attribute__((weak)) sched_clock(void)
> {
> return (unsigned long long)(jiffies - INITIAL_JIFFIES)
> * (NSEC_PER_SEC / HZ);
> }
> 
> #ifndef CONFIG_OMAP_MPU_TIMER
> unsigned long long notrace sched_clock(void)
> {
> return _omap_32k_sched_clock();
> }
> #else
> unsigned long long notrace omap_32k_sched_clock(void)
> {
> return _omap_32k_sched_clock();
> }
> #endif

I guess someone needs to do some tracing to see what's going on, and
get a feel for the order in which things happen.  (Or add some printks.)

Is there a ready-prepared bit of code I can try?

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2012-02-04 12:23 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-01-31  7:28 In many cases softlockup can not be reported after disabling IRQ for long time TAO HU
2012-01-31 15:47 ` Don Zickus
2012-02-01  2:18   ` TAO HU
2012-02-01 10:51     ` Cong Wang
2012-02-01 14:58     ` Don Zickus
2012-02-02  8:17       ` TAO HU
2012-02-02  8:43         ` Russell King - ARM Linux
     [not found]           ` <CAOwKts--CDpmiMunfYKrYsnWovmQhAC7Vp0P-9MeNVy6vx-Wvw@mail.gmail.com>
2012-02-04 12:22             ` Russell King - ARM Linux
2012-02-02 15:58         ` Don Zickus
2012-02-02 16:22           ` Russell King - ARM Linux

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).