All of lore.kernel.org
 help / color / mirror / Atom feed
* [Xenomai-help] Tests with 2.5rc1
@ 2009-04-20 18:05 Martin Shepherd
  2009-04-20 18:28 ` Gilles Chanteperdrix
  2009-04-20 18:34 ` [Xenomai-help] Tests with 2.5rc1 Gilles Chanteperdrix
  0 siblings, 2 replies; 8+ messages in thread
From: Martin Shepherd @ 2009-04-20 18:05 UTC (permalink / raw)
  To: xenomai

Over the weekend I experimented with Xenomai 2.5-rc1. Unfortunately
the freeze problems that I have been experiencing continued with this
update. I have tried a lot of things to home in on the problem.

1. First I tried swapping memory sticks again, but the symtoms
    didn't change.

2. Then I tried enabling the local-APIC and NMI watchdog at compile
    time with a threshold of 200us, and set lapic=1 and nmi_watchdog=2
    on the kernel invokation line in grub. The following messages from
    dmesg showed that this was being picked up:

    [    0.000000] Local APIC disabled by BIOS -- reenabling.
    [    0.000000] Found and enabled local APIC!
    [    0.000000] mapped APIC to ffffb000 (fee00000)
    ...
    [    0.029054] Using local APIC timer interrupts.
    [    0.029057] calibrating APIC timer ...
    [    0.032000] ... lapic delta = 1250114
    [    0.032000] ... PM timer delta = 357974
    [    0.032000] ... PM timer result ok
    [    0.032000] ..... delta 1250114
    [    0.032000] ..... mult: 53688631
    [    0.032000] ..... calibration result: 800072
    [    0.032000] ..... CPU clock speed is 850.0310 MHz.
    [    0.032000] ..... host bus clock speed is 200.0072 MHz.
    ...
    [    0.140956] Xenomai: NMI watchdog started (threshold=200 us).

    However this didn't seem to do anything whenever the system hung,
    and I later unfroze it by moving the mouse, even though the system
    clock lost seconds of time.  It froze many times during the
    switchbench part of xeno-test, without any complaints from Xenomai
    or the kernel. I couldn't get it to freeze this time during the
    latency tests.

    Is there anything else that I need to do to get the watchdog to
    do something?

3. Next I tried enabling the IO APIC in the kernel configuration, and
    including nmi_watchdog=1, to use the IO APIC timer for the NMI
    watchdog, but dmesg included the following error:

     [    0.708944] IO APIC resources could be not be allocated.

    and when I tried eliciting hangs, again nothing happened.

4. I enabled SMI workarounds, thinking that perhaps hammering the
    system with dd might be causing overheating, and corresponding
    SMI interrupts. However the hangs continued.

5. Finally I tried disabling everything that didn't look important in
    the kernel configuration (and reduced the compilation time from 4 hours
    to 2:20 in the process :-). However after rebooting into the slimmed
    down kernel, the hangs when running "dd if=/dev/zero of=/dev/null"
    continued unchanged.

I believe that today I will finally receive the new computers that I
ordered. So hopefully these problems won't turn up again on them.
Regardless, I would have liked to have figured out what the problem
was, just in case it is something serious that just happens less often
on newer computers.

Martin


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Xenomai-help] Tests with 2.5rc1
  2009-04-20 18:05 [Xenomai-help] Tests with 2.5rc1 Martin Shepherd
@ 2009-04-20 18:28 ` Gilles Chanteperdrix
  2009-04-20 19:16   ` Martin Shepherd
  2009-04-20 22:13   ` [Xenomai-help] Success! (was Re: Tests with 2.5rc1) Martin Shepherd
  2009-04-20 18:34 ` [Xenomai-help] Tests with 2.5rc1 Gilles Chanteperdrix
  1 sibling, 2 replies; 8+ messages in thread
From: Gilles Chanteperdrix @ 2009-04-20 18:28 UTC (permalink / raw)
  To: Martin Shepherd; +Cc: xenomai

Martin Shepherd wrote:
> Over the weekend I experimented with Xenomai 2.5-rc1. Unfortunately
> the freeze problems that I have been experiencing continued with this
> update. I have tried a lot of things to home in on the problem.
> 
> 1. First I tried swapping memory sticks again, but the symtoms
>     didn't change.
> 
> 2. Then I tried enabling the local-APIC and NMI watchdog at compile
>     time with a threshold of 200us, and set lapic=1 and nmi_watchdog=2
>     on the kernel invokation line in grub. The following messages from
>     dmesg showed that this was being picked up:
> 
>     [    0.000000] Local APIC disabled by BIOS -- reenabling.
>     [    0.000000] Found and enabled local APIC!
>     [    0.000000] mapped APIC to ffffb000 (fee00000)
>     ...
>     [    0.029054] Using local APIC timer interrupts.
>     [    0.029057] calibrating APIC timer ...
>     [    0.032000] ... lapic delta = 1250114
>     [    0.032000] ... PM timer delta = 357974
>     [    0.032000] ... PM timer result ok
>     [    0.032000] ..... delta 1250114
>     [    0.032000] ..... mult: 53688631
>     [    0.032000] ..... calibration result: 800072
>     [    0.032000] ..... CPU clock speed is 850.0310 MHz.
>     [    0.032000] ..... host bus clock speed is 200.0072 MHz.
>     ...
>     [    0.140956] Xenomai: NMI watchdog started (threshold=200 us).
> 
>     However this didn't seem to do anything whenever the system hung,
>     and I later unfroze it by moving the mouse, even though the system
>     clock lost seconds of time.  It froze many times during the
>     switchbench part of xeno-test, without any complaints from Xenomai
>     or the kernel. I couldn't get it to freeze this time during the
>     latency tests.
> 
>     Is there anything else that I need to do to get the watchdog to
>     do something?

The watchdog is probably working. The fact that you get the problem with
the interrupt pipeline enabled but not Xenomai proves that the culprit
is not Xenomai's timer interception. So, what probably happens is that
only Linux' timer is not working when you experience "hang ups". You can
check that by checking the irqs count in /proc/xenomai/irq when latency
is running.

In any case, you should run without NO_HZ and should try to disable high
res timers to see if it helps further.

> I believe that today I will finally receive the new computers that I
> ordered. So hopefully these problems won't turn up again on them.
> Regardless, I would have liked to have figured out what the problem
> was, just in case it is something serious that just happens less often
> on newer computers.

Well, trying to solve an issue which happens on a machine with faulty
RAM is not really interesting. So, unless you are able to test the same
machine with working RAM, I would suggest to stop spending time on this
issue.

-- 
					    Gilles.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Xenomai-help] Tests with 2.5rc1
  2009-04-20 18:05 [Xenomai-help] Tests with 2.5rc1 Martin Shepherd
  2009-04-20 18:28 ` Gilles Chanteperdrix
@ 2009-04-20 18:34 ` Gilles Chanteperdrix
  2009-04-20 18:46   ` Martin Shepherd
  1 sibling, 1 reply; 8+ messages in thread
From: Gilles Chanteperdrix @ 2009-04-20 18:34 UTC (permalink / raw)
  To: Martin Shepherd; +Cc: xenomai

Martin Shepherd wrote:
> Over the weekend I experimented with Xenomai 2.5-rc1. Unfortunately
> the freeze problems that I have been experiencing continued with this
> update. I have tried a lot of things to home in on the problem.
> 
> 1. First I tried swapping memory sticks again, but the symtoms
>     didn't change.
> 
> 2. Then I tried enabling the local-APIC and NMI watchdog at compile
>     time with a threshold of 200us, and set lapic=1 and nmi_watchdog=2
>     on the kernel invokation line in grub. The following messages from
>     dmesg showed that this was being picked up:
> 
>     [    0.000000] Local APIC disabled by BIOS -- reenabling.
>     [    0.000000] Found and enabled local APIC!
>     [    0.000000] mapped APIC to ffffb000 (fee00000)
>     ...
>     [    0.029054] Using local APIC timer interrupts.
>     [    0.029057] calibrating APIC timer ...
>     [    0.032000] ... lapic delta = 1250114
>     [    0.032000] ... PM timer delta = 357974
>     [    0.032000] ... PM timer result ok
>     [    0.032000] ..... delta 1250114
>     [    0.032000] ..... mult: 53688631
>     [    0.032000] ..... calibration result: 800072
>     [    0.032000] ..... CPU clock speed is 850.0310 MHz.
>     [    0.032000] ..... host bus clock speed is 200.0072 MHz.
>     ...
>     [    0.140956] Xenomai: NMI watchdog started (threshold=200 us).
> 
>     However this didn't seem to do anything whenever the system hung,
>     and I later unfroze it by moving the mouse, even though the system
>     clock lost seconds of time.  It froze many times during the
>     switchbench part of xeno-test, without any complaints from Xenomai

The freeze during switchbench is normal. There should be a message
telling you so and telling you not to try and interrupt switchbench.

-- 
					    Gilles.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Xenomai-help] Tests with 2.5rc1
  2009-04-20 18:34 ` [Xenomai-help] Tests with 2.5rc1 Gilles Chanteperdrix
@ 2009-04-20 18:46   ` Martin Shepherd
  0 siblings, 0 replies; 8+ messages in thread
From: Martin Shepherd @ 2009-04-20 18:46 UTC (permalink / raw)
  To: Gilles Chanteperdrix; +Cc: xenomai


On Mon, 20 Apr 2009, Gilles Chanteperdrix wrote:
>>     clock lost seconds of time.  It froze many times during the
>>     switchbench part of xeno-test, without any complaints from Xenomai
>
> The freeze during switchbench is normal. There should be a message
> telling you so and telling you not to try and interrupt switchbench.

These weren't the normal switchbench freezes, but rather indefinite
freezes that stopped the clock, made pings from other machines fail,
and that never ended until I went to the offending computer and either
moved its mouse, hit any keyboard key, or pressed the power-button
momentarily. I sometimes left it a few hours in that frozen state
while I worked on other things on another computer.

Martin


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Xenomai-help] Tests with 2.5rc1
  2009-04-20 18:28 ` Gilles Chanteperdrix
@ 2009-04-20 19:16   ` Martin Shepherd
  2009-04-20 22:13   ` [Xenomai-help] Success! (was Re: Tests with 2.5rc1) Martin Shepherd
  1 sibling, 0 replies; 8+ messages in thread
From: Martin Shepherd @ 2009-04-20 19:16 UTC (permalink / raw)
  To: Gilles Chanteperdrix; +Cc: xenomai

Hi Gilles,

Thanks for the reply.

On Mon, 20 Apr 2009, Gilles Chanteperdrix wrote:
>>     Is there anything else that I need to do to get the watchdog to
>>     do something?
>
> The watchdog is probably working. The fact that you get the problem with
> the interrupt pipeline enabled but not Xenomai proves that the culprit
> is not Xenomai's timer interception. So, what probably happens is that
> only Linux' timer is not working when you experience "hang ups". You can
> check that by checking the irqs count in /proc/xenomai/irq when latency
> is running.

The irqs count in /proc/xenomai/irq also freezes, and since Linux is
also using the local-APIC timer interrupts for its clock, both the
number shown in /proc/xenomai/irq and the number shown in
/proc/interrupts for LOC, behave the same.

> In any case, you should run without NO_HZ and should try to disable high
> res timers to see if it helps further.

I've been running without NO_HZ, but I haven't tried with the high-res
timers disabled. If the new machines don't get here today, I'll try
turning off hi-res timers. This may seem like a waste of time, but my
boss's opinion was that even if I didn't resolve the issue, trying to
track it down would be good experience.

> Well, trying to solve an issue which happens on a machine with faulty
> RAM is not really interesting. So, unless you are able to test the same
> machine with working RAM, I would suggest to stop spending time on this
> issue.

Understood. I actually tried this with 9 different sticks of RAM,
which were all that I could get hold of. Unfortunately all of them
showed one or two errors when tested for a few hours with memtest86,
even when tested again on 3 different computers. Frustrating!

Thanks for the help,

Martin


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Xenomai-help] Success! (was Re: Tests with 2.5rc1)
  2009-04-20 18:28 ` Gilles Chanteperdrix
  2009-04-20 19:16   ` Martin Shepherd
@ 2009-04-20 22:13   ` Martin Shepherd
  2009-04-22 17:18     ` Gilles Chanteperdrix
  1 sibling, 1 reply; 8+ messages in thread
From: Martin Shepherd @ 2009-04-20 22:13 UTC (permalink / raw)
  To: Gilles Chanteperdrix; +Cc: xenomai


On Mon, 20 Apr 2009, Gilles Chanteperdrix wrote:
>...and should try to disable high
> res timers to see if it helps further.

Excellent suggestion. This has completely fixed the problem. To be
precise, starting with the config of the last problematic kernel, I
toggled off the high-res timer option in menuconfig, recompiled the
kernel, and booted into the new kernel.  I have now booted twice into
the resulting kernel, and have been unable to provoke a single hang in
either case, whereas before this change, I could reliably always
provoke a hang within a few seconds of running "dd if=/dev/zero
of=/dev/null". I have also now run xeno-test twice, all the way
through, for the first time.

Do you have any idea why having the high-resolution timers option
enabled was causing the system to hang?

Thankyou for the help,

Martin


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Xenomai-help] Success! (was Re: Tests with 2.5rc1)
  2009-04-20 22:13   ` [Xenomai-help] Success! (was Re: Tests with 2.5rc1) Martin Shepherd
@ 2009-04-22 17:18     ` Gilles Chanteperdrix
  2009-04-23  1:43       ` Martin Shepherd
  0 siblings, 1 reply; 8+ messages in thread
From: Gilles Chanteperdrix @ 2009-04-22 17:18 UTC (permalink / raw)
  To: Martin Shepherd; +Cc: xenomai

Martin Shepherd wrote:
> On Mon, 20 Apr 2009, Gilles Chanteperdrix wrote:
>> ...and should try to disable high
>> res timers to see if it helps further.
> 
> Excellent suggestion. This has completely fixed the problem. To be
> precise, starting with the config of the last problematic kernel, I
> toggled off the high-res timer option in menuconfig, recompiled the
> kernel, and booted into the new kernel.  I have now booted twice into
> the resulting kernel, and have been unable to provoke a single hang in
> either case, whereas before this change, I could reliably always
> provoke a hang within a few seconds of running "dd if=/dev/zero
> of=/dev/null". I have also now run xeno-test twice, all the way
> through, for the first time.
> 
> Do you have any idea why having the high-resolution timers option
> enabled was causing the system to hang?

Not at all. The issue you have seems to be related with the timer
sub-system, the timer not being reprogrammed or something like that.
We must tell everyone on this list that high-res timers work for the
current hardware we run our tests on, so that people do not conclude
that hi-res timers do not work with Xenomai.

So, I see from previous posts that you have PM-timers enabled, could you
try to re-enable high-res timers, and disable PM-timers ?

-- 
                                                 Gilles.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Xenomai-help] Success! (was Re: Tests with 2.5rc1)
  2009-04-22 17:18     ` Gilles Chanteperdrix
@ 2009-04-23  1:43       ` Martin Shepherd
  0 siblings, 0 replies; 8+ messages in thread
From: Martin Shepherd @ 2009-04-23  1:43 UTC (permalink / raw)
  To: Gilles Chanteperdrix; +Cc: xenomai


On Wed, 22 Apr 2009, Gilles Chanteperdrix wrote:
>...
> We must tell everyone on this list that high-res timers work for the
> current hardware we run our tests on, so that people do not conclude
> that hi-res timers do not work with Xenomai.
>
> So, I see from previous posts that you have PM-timers enabled, could you
> try to re-enable high_res_timers, and disable PM-timers ?

Yes, enabling HIGH_RES_TIMERS and disabling X86_PM_TIMER also fixes
the problem. In detail, after I re-enabled HIGH_RES_TIMERS, and
enabled EMBEDDED so that I could change the X86_PM_TIMER option, I
recompiled the kernel 4 times, toggling just the state of the
X86_PM_TIMER option. The results were as follows:

In the two cases where X86_PM_TIMER was enabled, the normal dd test
reliably hung the computer after about half a minute, leaving the
computer completely unresponsive to the keyboard and to pings from
other machines. Leading up to these hangs, top reported that dd was
getting very little CPU time, even though nothing else was running. I
limited dd to copying 5.1GB at a time, by setting its count argument
to 10000000, and was thus able to verify that dd eventually completed,
and that at that point, the system returned to normal, except that the
clock resumed from where it had been when the hang started, and was
thus left running slow. In other words, dd ran normally, but the clock
that would have otherwise allowed the scheduler to switch to other
tasks, stalled until dd finished.

In the two cases where X86_PM_TIMER was disabled, the system worked
fine. The dd process reliably got at least 95% of the CPU time when
nothing else was running, and never hung the computer.

Martin


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2009-04-23  1:43 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-04-20 18:05 [Xenomai-help] Tests with 2.5rc1 Martin Shepherd
2009-04-20 18:28 ` Gilles Chanteperdrix
2009-04-20 19:16   ` Martin Shepherd
2009-04-20 22:13   ` [Xenomai-help] Success! (was Re: Tests with 2.5rc1) Martin Shepherd
2009-04-22 17:18     ` Gilles Chanteperdrix
2009-04-23  1:43       ` Martin Shepherd
2009-04-20 18:34 ` [Xenomai-help] Tests with 2.5rc1 Gilles Chanteperdrix
2009-04-20 18:46   ` Martin Shepherd

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.