All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [Xenomai] [Emc-developers] "new RTOS" status: Scheduler (?) lockup on ARM
       [not found]   ` <F1ACE4FC-3E71-4498-B683-81F5C40CB6E3@mah.priv.at>
@ 2013-01-17  7:59     ` Bas Laarhoven
  2013-01-17  8:53       ` Gilles Chanteperdrix
  0 siblings, 1 reply; 17+ messages in thread
From: Bas Laarhoven @ 2013-01-17  7:59 UTC (permalink / raw)
  To: EMC developers; +Cc: xenomai

On 16-1-2013 20:36, Michael Haberler wrote:
> Am 16.01.2013 um 17:45 schrieb Bas Laarhoven:
>
>> On 16-1-2013 15:15, Michael Haberler wrote:
>>> ARM work:
>>>
>>> Several people have been able to get the Beaglebone ubuntu/xenomai setup working as outlined here: http://wiki.linuxcnc.org/cgi-bin/wiki.pl?BeagleboneDevsetup
>>> I have updated the kernel and rootfs image a few days ago so the kernel includes ext2/3/4 support compiled in, which should take care of two failure reports I got.
>>>
>>> Again that xenomai kernel is based on 3.2.21; it works very stable for me but there have been several reports of 'sudden stops'. The BB is a bit sensitive to power fluctuations but it might be more than that. As for that kernel, it works, but it is based on a branch which will see no further development. It supports most of the stuff needed to development; there might be some patches coming from more active BB users than me.
>> Hi Michael,
>>
>> Are you saying you don't have seen these 'sudden stops' yourself?
> No, never, after swapping to stronger power supplies; I have two of these boards running over NFS all the time. I dont have Linuxcnc running on them though, I'll do that and see if that changes the picture. Maybe keeping the torture test running helps trigger it.

Beginners error! :-P The power supply is indeed critical, but the 
stepdown converter on my BeBoPr is dimensioned for at least 2A and 
hasn't failed me yet.

I think that running linuxcnc is mandatory for the lockup. After a dozen 
runs, it looks like I can reproduce the lockup with 100% certainty 
within one hour.
Using the JTAG interface to attach a debugger to the Bone, I've found 
that once stalled the kernel is still running. It looks like it won't 
schedule properly and almost all time is spent in the cpu_idle thread.

The kernel with extra diagnostics produces these messages:

[ 3480.386342] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[ 3480.395913] INFO: task axis:799 blocked for more than 120 seconds.
[ 3480.406643] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[ 3600.408670] INFO: task hal_manualtoolc:788 blocked for more than 120 
seconds.

On one run I was able to re-issue a command from the command history 
before that console froze too.
Since the x86 version seems to be having none of these problems, it 
might be ARM specific.
Any suggestions on how to proceed? Are other people working on the ARM 
version?

I'm also sending this message to the xenomai mailing list as that might 
be a better place to resume this thread.

-- Bas

>
> NB there is an ipipe trace option, but that doesnt help if you cant talk to the damn thing.
>   
>> My system has frozen within one hour every time.
>> I'm aware of the power supply issues, but my configuration has _never_ experienced this problem over at least half a year of (heavy) use.
> just to clarifiy: you get the lockups only with the Xenomai kernel, I assume ? your other option is some Angström kernel or what exactly (isn't the list of options bewildering ;-?)
>
>> So I dare say that isn't the problem, at least not with my lock-ups I'm seeing.
>>
>> Currently I'm debugging the kernel to see what's going on. It looks like the kernel is idling, but the system is completely frozen (blocked, not scheduling?).
>> I've built a kernel with symbols a lot of extra debug options and am waiting for it to stop again right now. It's been running axis with the demo for almost an hour, the best result up to now...
>>
>> Do you have an opinion on what would be the best kernel version for (future) development? Is Xenomai up with the current kernels? Are the DT kernels usable on the bone or do we have to wait another couple of months for that?
> again it's a question of matching a Xenomai patch version with a stable base version, and have the itimer support in it - that's what reduces the range of options
>
> there are several base versions one could try; the integration towards mainline is now targeted at 3.8 and it seems the stock kernel has much of what is needed including PRUSS. It's also possible that the current Xenomai work for a 3.5.x base results in a match, I need to look into it. I was suggested to 'forward port the ipipe patch myself' but I chickened out on that one.
>
> summary: I'm pretty sure there is; I am not aware of tangible results.
>
> I will push the two patches I got from Stephan Kappertz and Sheng Chao Wong, I dont think they are online.
>
> - Michael
>
>
>> -- Bas
>>
>> Yes! Frozen Bone after 56 minutes uptime : ) Time to start debugging again!
>>
>>> Charles has done some great work for a high-speed stepgen on the Beaglebone, and a few folks have reproduced that, but I leave the fanfare to Charles here;)
>>>
>>> I have done no further work on the Raspberry, I do not consider that platform particularly useful to base work on.
>>>
>>> RTAI note:
>>>
>>> I was pointed to this thread recently, which is interesting to read for several reasons:
>>> https://mail.rtai.org/pipermail/rtai/2012-December/thread.html  "Git repository for RTAI"
>>>
>>> It does mention a Ubuntu 12.04 RTAI kernel (Shahbaz Youssefi shabbyx at gmail.com Tue Dec 18 11:09:41 CET 2012) - it might be worth following that up, maybe this is an option to get the current builds out of the 10.04 end-of-support-life situation. I would appreciate if somebody more RTAI-aware than me would pick that up.
>>>
>>> It also touches on the issue how the source repository and collaboration model touches upon a project's success, and that's an interesting read. It looks like the nature of open source communities changes due to for instance the github model, making it easier for the casual contributor, which is a sore spot with the linuxcnc proejct. Something to think about.
>>>
>>> - Michael
>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Master Java SE, Java EE, Eclipse, Spring, Hibernate, JavaScript, jQuery
>>> and much more. Keep your Java skills current with LearnJavaNow -
>>> 200+ hours of step-by-step video tutorials by Java experts.
>>> SALE $49.99 this month only -- learn more at:
>>> http://p.sf.net/sfu/learnmore_122612
>>> _______________________________________________
>>> Emc-developers mailing list
>>> Emc-developers@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/emc-developers
>
> ------------------------------------------------------------------------------
> Master Java SE, Java EE, Eclipse, Spring, Hibernate, JavaScript, jQuery
> and much more. Keep your Java skills current with LearnJavaNow -
> 200+ hours of step-by-step video tutorials by Java experts.
> SALE $49.99 this month only -- learn more at:
> http://p.sf.net/sfu/learnmore_122612
> _______________________________________________
> Emc-developers mailing list
> Emc-developers@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/emc-developers



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Xenomai] [Emc-developers] "new RTOS" status: Scheduler (?) lockup on ARM
  2013-01-17  7:59     ` [Xenomai] [Emc-developers] "new RTOS" status: Scheduler (?) lockup on ARM Bas Laarhoven
@ 2013-01-17  8:53       ` Gilles Chanteperdrix
  2013-01-17 11:34         ` Michael Haberler
  2013-01-17 13:30         ` Bas Laarhoven
  0 siblings, 2 replies; 17+ messages in thread
From: Gilles Chanteperdrix @ 2013-01-17  8:53 UTC (permalink / raw)
  To: Bas Laarhoven; +Cc: EMC developers, xenomai

On 01/17/2013 08:59 AM, Bas Laarhoven wrote:

> On 16-1-2013 20:36, Michael Haberler wrote:
>> Am 16.01.2013 um 17:45 schrieb Bas Laarhoven:
>>
>>> On 16-1-2013 15:15, Michael Haberler wrote:
>>>> ARM work:
>>>>
>>>> Several people have been able to get the Beaglebone ubuntu/xenomai setup working as outlined here: http://wiki.linuxcnc.org/cgi-bin/wiki.pl?BeagleboneDevsetup
>>>> I have updated the kernel and rootfs image a few days ago so the kernel includes ext2/3/4 support compiled in, which should take care of two failure reports I got.
>>>>
>>>> Again that xenomai kernel is based on 3.2.21; it works very stable for me but there have been several reports of 'sudden stops'. The BB is a bit sensitive to power fluctuations but it might be more than that. As for that kernel, it works, but it is based on a branch which will see no further development. It supports most of the stuff needed to development; there might be some patches coming from more active BB users than me.
>>> Hi Michael,
>>>
>>> Are you saying you don't have seen these 'sudden stops' yourself?
>> No, never, after swapping to stronger power supplies; I have two of these boards running over NFS all the time. I dont have Linuxcnc running on them though, I'll do that and see if that changes the picture. Maybe keeping the torture test running helps trigger it.
> 
> Beginners error! :-P The power supply is indeed critical, but the 
> stepdown converter on my BeBoPr is dimensioned for at least 2A and 
> hasn't failed me yet.
> 
> I think that running linuxcnc is mandatory for the lockup. After a dozen 
> runs, it looks like I can reproduce the lockup with 100% certainty 
> within one hour.
> Using the JTAG interface to attach a debugger to the Bone, I've found 
> that once stalled the kernel is still running. It looks like it won't 
> schedule properly and almost all time is spent in the cpu_idle thread.


This is typical of a tsc emulation or timer issue. On a system without
anything running, please let the "tsc -w" command run. It will take some
time to run (the wrap time of the hardware timer used for tsc
emulation), if it runs correctly, then you need to check whether the
timer is still running when the bug happens (cat /proc/xenomai/irq
should continue increasing when for instance the latency test is
running). If the timer is stopped, it may have been programmed for a too
short delay, to avoid that, you can try:
- increasing the ipipe_timer min_delay_ticks member (by default, it uses
a value corresponding to the min_delta_ns member in the clockevent
structure);
- checking after programming the timer (in the set_next_event method) if
the timer counter is already 0, in which case you can return a negative
value, usually -ETIME.


-- 
                                                                Gilles.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Xenomai] [Emc-developers] "new RTOS" status: Scheduler (?) lockup on ARM
  2013-01-17  8:53       ` Gilles Chanteperdrix
@ 2013-01-17 11:34         ` Michael Haberler
  2013-01-17 12:07           ` Gilles Chanteperdrix
  2013-01-17 13:30         ` Bas Laarhoven
  1 sibling, 1 reply; 17+ messages in thread
From: Michael Haberler @ 2013-01-17 11:34 UTC (permalink / raw)
  To: xenomai

Gilles,

Am 17.01.2013 um 09:53 schrieb Gilles Chanteperdrix:

> On 01/17/2013 08:59 AM, Bas Laarhoven wrote:
> 
>> On 16-1-2013 20:36, Michael Haberler wrote:
>>>> 
>>>> 
>>>> Are you saying you don't have seen these 'sudden stops' yourself?

...
> This is typical of a tsc emulation or timer issue. On a system without
> anything running, please let the "tsc -w" command run. It will take some
> time to run (the wrap time of the hardware timer used for tsc
> emulation), if it runs correctly, then you need to check whether the
> timer is still running when the bug happens (cat /proc/xenomai/irq
> should continue increasing when for instance the latency test is
> running). If the timer is stopped, it may have been programmed for a too
> short delay, to avoid that, you can try:
> - increasing the ipipe_timer min_delay_ticks member (by default, it uses
> a value corresponding to the min_delta_ns member in the clockevent
> structure);
> - checking after programming the timer (in the set_next_event method) if
> the timer counter is already 0, in which case you can return a negative
> value, usually -ETIME.

thanks for that most valuable hint. 

The bughunt safari is on, debuggers loaded and JTAG's armed ;)

- Michael

> 
> 
> -- 
>                                                               Gilles.
> 
> _______________________________________________
> Xenomai mailing list
> Xenomai@xenomai.org
> http://www.xenomai.org/mailman/listinfo/xenomai



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Xenomai] [Emc-developers] "new RTOS" status: Scheduler (?) lockup on ARM
  2013-01-17 11:34         ` Michael Haberler
@ 2013-01-17 12:07           ` Gilles Chanteperdrix
  0 siblings, 0 replies; 17+ messages in thread
From: Gilles Chanteperdrix @ 2013-01-17 12:07 UTC (permalink / raw)
  To: Michael Haberler; +Cc: xenomai

On 01/17/2013 12:34 PM, Michael Haberler wrote:

> Gilles,
> 
> Am 17.01.2013 um 09:53 schrieb Gilles Chanteperdrix:
> 
>> On 01/17/2013 08:59 AM, Bas Laarhoven wrote:
>>
>>> On 16-1-2013 20:36, Michael Haberler wrote:
>>>>>
>>>>>
>>>>> Are you saying you don't have seen these 'sudden stops' yourself?
> 
> ...
>> This is typical of a tsc emulation or timer issue. On a system without
>> anything running, please let the "tsc -w" command run. It will take some
>> time to run (the wrap time of the hardware timer used for tsc
>> emulation), if it runs correctly, then you need to check whether the
>> timer is still running when the bug happens (cat /proc/xenomai/irq
>> should continue increasing when for instance the latency test is
>> running). If the timer is stopped, it may have been programmed for a too
>> short delay, to avoid that, you can try:
>> - increasing the ipipe_timer min_delay_ticks member (by default, it uses
>> a value corresponding to the min_delta_ns member in the clockevent
>> structure);
>> - checking after programming the timer (in the set_next_event method) if
>> the timer counter is already 0, in which case you can return a negative
>> value, usually -ETIME.


Actually, the hardware counter will be 0xffffffff when the timer has
reached delay.

-- 
                                                                Gilles.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Xenomai] [Emc-developers] "new RTOS" status: Scheduler (?) lockup on ARM
  2013-01-17  8:53       ` Gilles Chanteperdrix
  2013-01-17 11:34         ` Michael Haberler
@ 2013-01-17 13:30         ` Bas Laarhoven
  2013-01-19 13:29           ` Gilles Chanteperdrix
  1 sibling, 1 reply; 17+ messages in thread
From: Bas Laarhoven @ 2013-01-17 13:30 UTC (permalink / raw)
  To: Gilles Chanteperdrix; +Cc: xenomai

On 17-1-2013 9:53, Gilles Chanteperdrix wrote:
> On 01/17/2013 08:59 AM, Bas Laarhoven wrote:
>
>> On 16-1-2013 20:36, Michael Haberler wrote:
>>> Am 16.01.2013 um 17:45 schrieb Bas Laarhoven:
>>>
>>>> On 16-1-2013 15:15, Michael Haberler wrote:
>>>>> ARM work:
>>>>>
>>>>> Several people have been able to get the Beaglebone ubuntu/xenomai setup working as outlined here: http://wiki.linuxcnc.org/cgi-bin/wiki.pl?BeagleboneDevsetup
>>>>> I have updated the kernel and rootfs image a few days ago so the kernel includes ext2/3/4 support compiled in, which should take care of two failure reports I got.
>>>>>
>>>>> Again that xenomai kernel is based on 3.2.21; it works very stable for me but there have been several reports of 'sudden stops'. The BB is a bit sensitive to power fluctuations but it might be more than that. As for that kernel, it works, but it is based on a branch which will see no further development. It supports most of the stuff needed to development; there might be some patches coming from more active BB users than me.
>>>> Hi Michael,
>>>>
>>>> Are you saying you don't have seen these 'sudden stops' yourself?
>>> No, never, after swapping to stronger power supplies; I have two of these boards running over NFS all the time. I dont have Linuxcnc running on them though, I'll do that and see if that changes the picture. Maybe keeping the torture test running helps trigger it.
>> Beginners error! :-P The power supply is indeed critical, but the
>> stepdown converter on my BeBoPr is dimensioned for at least 2A and
>> hasn't failed me yet.
>>
>> I think that running linuxcnc is mandatory for the lockup. After a dozen
>> runs, it looks like I can reproduce the lockup with 100% certainty
>> within one hour.
>> Using the JTAG interface to attach a debugger to the Bone, I've found
>> that once stalled the kernel is still running. It looks like it won't
>> schedule properly and almost all time is spent in the cpu_idle thread.
>
> This is typical of a tsc emulation or timer issue. On a system without
> anything running, please let the "tsc -w" command run. It will take some
> time to run (the wrap time of the hardware timer used for tsc
> emulation), if it runs correctly, then you need to check whether the
> timer is still running when the bug happens (cat /proc/xenomai/irq
> should continue increasing when for instance the latency test is
> running). If the timer is stopped, it may have been programmed for a too
> short delay, to avoid that, you can try:
> - increasing the ipipe_timer min_delay_ticks member (by default, it uses
> a value corresponding to the min_delta_ns member in the clockevent
> structure);
> - checking after programming the timer (in the set_next_event method) if
> the timer counter is already 0, in which case you can return a negative
> value, usually -ETIME.
>

Hi Gilles,

Thanks for the swift reply.

As far as I can see, tsc -w runs without an error:

ARM: counter wrap time: 179 seconds
Checking tsc for 6 minute(s)
min: 5, max: 12, avg: 5.04168
...
min: 5, max: 6, avg: 5.03771
min: 5, max: 28, avg: 5.03989 -> 0.209995 us

real    6m0.284s

I've also done the other regression tests and all were successful.

Problem is that once the bug happens I won't be able to issue the cat 
command.
I've fixed my debug setup so I don't have to use the System.map to 
manually translate the debugger addresses : /
Now I'm waiting for another lockup to see what's happening.

-- Bas




^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Xenomai] [Emc-developers] "new RTOS" status: Scheduler (?) lockup on ARM
  2013-01-17 13:30         ` Bas Laarhoven
@ 2013-01-19 13:29           ` Gilles Chanteperdrix
  2013-01-19 14:09             ` Michael Haberler
  0 siblings, 1 reply; 17+ messages in thread
From: Gilles Chanteperdrix @ 2013-01-19 13:29 UTC (permalink / raw)
  To: Bas Laarhoven; +Cc: xenomai

On 01/17/2013 02:30 PM, Bas Laarhoven wrote:

> On 17-1-2013 9:53, Gilles Chanteperdrix wrote:
>> On 01/17/2013 08:59 AM, Bas Laarhoven wrote:
>>
>>> On 16-1-2013 20:36, Michael Haberler wrote:
>>>> Am 16.01.2013 um 17:45 schrieb Bas Laarhoven:
>>>>
>>>>> On 16-1-2013 15:15, Michael Haberler wrote:
>>>>>> ARM work:
>>>>>>
>>>>>> Several people have been able to get the Beaglebone ubuntu/xenomai setup working as outlined here: http://wiki.linuxcnc.org/cgi-bin/wiki.pl?BeagleboneDevsetup
>>>>>> I have updated the kernel and rootfs image a few days ago so the kernel includes ext2/3/4 support compiled in, which should take care of two failure reports I got.
>>>>>>
>>>>>> Again that xenomai kernel is based on 3.2.21; it works very stable for me but there have been several reports of 'sudden stops'. The BB is a bit sensitive to power fluctuations but it might be more than that. As for that kernel, it works, but it is based on a branch which will see no further development. It supports most of the stuff needed to development; there might be some patches coming from more active BB users than me.
>>>>> Hi Michael,
>>>>>
>>>>> Are you saying you don't have seen these 'sudden stops' yourself?
>>>> No, never, after swapping to stronger power supplies; I have two of these boards running over NFS all the time. I dont have Linuxcnc running on them though, I'll do that and see if that changes the picture. Maybe keeping the torture test running helps trigger it.
>>> Beginners error! :-P The power supply is indeed critical, but the
>>> stepdown converter on my BeBoPr is dimensioned for at least 2A and
>>> hasn't failed me yet.
>>>
>>> I think that running linuxcnc is mandatory for the lockup. After a dozen
>>> runs, it looks like I can reproduce the lockup with 100% certainty
>>> within one hour.
>>> Using the JTAG interface to attach a debugger to the Bone, I've found
>>> that once stalled the kernel is still running. It looks like it won't
>>> schedule properly and almost all time is spent in the cpu_idle thread.
>>
>> This is typical of a tsc emulation or timer issue. On a system without
>> anything running, please let the "tsc -w" command run. It will take some
>> time to run (the wrap time of the hardware timer used for tsc
>> emulation), if it runs correctly, then you need to check whether the
>> timer is still running when the bug happens (cat /proc/xenomai/irq
>> should continue increasing when for instance the latency test is
>> running). If the timer is stopped, it may have been programmed for a too
>> short delay, to avoid that, you can try:
>> - increasing the ipipe_timer min_delay_ticks member (by default, it uses
>> a value corresponding to the min_delta_ns member in the clockevent
>> structure);
>> - checking after programming the timer (in the set_next_event method) if
>> the timer counter is already 0, in which case you can return a negative
>> value, usually -ETIME.
>>
> 
> Hi Gilles,
> 
> Thanks for the swift reply.
> 
> As far as I can see, tsc -w runs without an error:
> 
> ARM: counter wrap time: 179 seconds
> Checking tsc for 6 minute(s)
> min: 5, max: 12, avg: 5.04168
> ...
> min: 5, max: 6, avg: 5.03771
> min: 5, max: 28, avg: 5.03989 -> 0.209995 us
> 
> real    6m0.284s
> 
> I've also done the other regression tests and all were successful.
> 
> Problem is that once the bug happens I won't be able to issue the cat 
> command.
> I've fixed my debug setup so I don't have to use the System.map to 
> manually translate the debugger addresses : /
> Now I'm waiting for another lockup to see what's happening.


You may want to have a look at the xeno-regression-test script to put
your system under pressure (and likely generate the lockup faster).

-- 
                                                                Gilles.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Xenomai] [Emc-developers] "new RTOS" status: Scheduler (?) lockup on ARM
  2013-01-19 13:29           ` Gilles Chanteperdrix
@ 2013-01-19 14:09             ` Michael Haberler
  2013-01-19 14:10               ` Gilles Chanteperdrix
  2013-01-19 14:32               ` Gilles Chanteperdrix
  0 siblings, 2 replies; 17+ messages in thread
From: Michael Haberler @ 2013-01-19 14:09 UTC (permalink / raw)
  To: xenomai


Am 19.01.2013 um 14:29 schrieb Gilles Chanteperdrix:

> On 01/17/2013 02:30 PM, Bas Laarhoven wrote:
> 
>> On 17-1-2013 9:53, Gilles Chanteperdrix wrote:
>>> On 01/17/2013 08:59 AM, Bas Laarhoven wrote:
>>> 
>>>> On 16-1-2013 20:36, Michael Haberler wrote:
>>>>> Am 16.01.2013 um 17:45 schrieb Bas Laarhoven:
>>>>> 
>>>>>> On 16-1-2013 15:15, Michael Haberler wrote:
>>>>>>> ARM work:
>>>>>>> 
>>>>>>> Several people have been able to get the Beaglebone ubuntu/xenomai setup working as outlined here: http://wiki.linuxcnc.org/cgi-bin/wiki.pl?BeagleboneDevsetup
>>>>>>> I have updated the kernel and rootfs image a few days ago so the kernel includes ext2/3/4 support compiled in, which should take care of two failure reports I got.
>>>>>>> 
>>>>>>> Again that xenomai kernel is based on 3.2.21; it works very stable for me but there have been several reports of 'sudden stops'. The BB is a bit sensitive to power fluctuations but it might be more than that. As for that kernel, it works, but it is based on a branch which will see no further development. It supports most of the stuff needed to development; there might be some patches coming from more active BB users than me.
>>>>>> Hi Michael,
>>>>>> 
>>>>>> Are you saying you don't have seen these 'sudden stops' yourself?
>>>>> No, never, after swapping to stronger power supplies; I have two of these boards running over NFS all the time. I dont have Linuxcnc running on them though, I'll do that and see if that changes the picture. Maybe keeping the torture test running helps trigger it.
>>>> Beginners error! :-P The power supply is indeed critical, but the
>>>> stepdown converter on my BeBoPr is dimensioned for at least 2A and
>>>> hasn't failed me yet.
>>>> 
>>>> I think that running linuxcnc is mandatory for the lockup. After a dozen
>>>> runs, it looks like I can reproduce the lockup with 100% certainty
>>>> within one hour.
>>>> Using the JTAG interface to attach a debugger to the Bone, I've found
>>>> that once stalled the kernel is still running. It looks like it won't
>>>> schedule properly and almost all time is spent in the cpu_idle thread.
>>> 
>>> This is typical of a tsc emulation or timer issue. On a system without
>>> anything running, please let the "tsc -w" command run. It will take some
>>> time to run (the wrap time of the hardware timer used for tsc
>>> emulation), if it runs correctly, then you need to check whether the
>>> timer is still running when the bug happens (cat /proc/xenomai/irq
>>> should continue increasing when for instance the latency test is
>>> running). If the timer is stopped, it may have been programmed for a too
>>> short delay, to avoid that, you can try:
>>> - increasing the ipipe_timer min_delay_ticks member (by default, it uses
>>> a value corresponding to the min_delta_ns member in the clockevent
>>> structure);
>>> - checking after programming the timer (in the set_next_event method) if
>>> the timer counter is already 0, in which case you can return a negative
>>> value, usually -ETIME.
>>> 
>> 
>> Hi Gilles,
>> 
>> Thanks for the swift reply.
>> 
>> As far as I can see, tsc -w runs without an error:
>> 
>> ARM: counter wrap time: 179 seconds
>> Checking tsc for 6 minute(s)
>> min: 5, max: 12, avg: 5.04168
>> ...
>> min: 5, max: 6, avg: 5.03771
>> min: 5, max: 28, avg: 5.03989 -> 0.209995 us
>> 
>> real    6m0.284s
>> 
>> I've also done the other regression tests and all were successful.
>> 
>> Problem is that once the bug happens I won't be able to issue the cat 
>> command.
>> I've fixed my debug setup so I don't have to use the System.map to 
>> manually translate the debugger addresses : /
>> Now I'm waiting for another lockup to see what's happening.
> 
> 
> You may want to have a look at the xeno-regression-test script to put
> your system under pressure (and likely generate the lockup faster).

running tsc -w and xeno-regression-test in parallel I get errors like so (not on every run; no lockup so far):

++ /usr/xenomai/bin/mutex-torture-native
simple_wait
recursive_wait
timed_mutex
mode_switch
pi_wait
lock_stealing
NOTE: lock_stealing mutex_trylock: not supported
deny_stealing
simple_condwait
recursive_condwait
auto_switchback
FAILURE: current prio (0) != expected prio (2)

dmesg 
[501963.390598] Xenomai: native: cleaning up mutex "" (ret=0).
[502170.164984] usb 1-1: reset high-speed USB device number 2 using musb-hdrc

on another run, I got a segfault while running sigdebug:
++ /usr/xenomai/bin/regression/native/sigdebug
mayday page starting at 0x400eb000 [/dev/rtheap]
mayday code: 0c 00 9f e5 0c 70 9f e5 00 00 00 ef 00 00 a0 e3 00 00 80 e5 2b 02 00 0a 42 00 0f 00 db d7 ee b8
mlockall
syscall
signal
relaxed mutex owner
page fault
watchdog
./xeno-regression-test: line 53:  4210 Segmentation fault      /usr/xenomai/bin/regression/native/sigdebug

root@bb1:/usr/xenomai/bin# dmesg 
[502442.312996] Xenomai: watchdog triggered -- signaling runaway thread 'rt_task'
[502443.054186] Xenomai: native: cleaning up mutex "prio_invert" (ret=0).
[502443.055730] Xenomai: native: cleaning up sem "send_signal" (ret=0).
[502518.134977] usb 1-1: reset high-speed USB device number 2 using musb-hdrc


unsure what to make of it - any suggestions? the usb reset looks suspicious

- Michael



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Xenomai] [Emc-developers] "new RTOS" status: Scheduler (?) lockup on ARM
  2013-01-19 14:09             ` Michael Haberler
@ 2013-01-19 14:10               ` Gilles Chanteperdrix
  2013-01-19 14:14                 ` Michael Haberler
  2013-01-19 14:32               ` Gilles Chanteperdrix
  1 sibling, 1 reply; 17+ messages in thread
From: Gilles Chanteperdrix @ 2013-01-19 14:10 UTC (permalink / raw)
  To: Michael Haberler; +Cc: xenomai

On 01/19/2013 03:09 PM, Michael Haberler wrote:

> 
> Am 19.01.2013 um 14:29 schrieb Gilles Chanteperdrix:
> 
>> On 01/17/2013 02:30 PM, Bas Laarhoven wrote:
>>
>>> On 17-1-2013 9:53, Gilles Chanteperdrix wrote:
>>>> On 01/17/2013 08:59 AM, Bas Laarhoven wrote:
>>>>
>>>>> On 16-1-2013 20:36, Michael Haberler wrote:
>>>>>> Am 16.01.2013 um 17:45 schrieb Bas Laarhoven:
>>>>>>
>>>>>>> On 16-1-2013 15:15, Michael Haberler wrote:
>>>>>>>> ARM work:
>>>>>>>>
>>>>>>>> Several people have been able to get the Beaglebone ubuntu/xenomai setup working as outlined here: http://wiki.linuxcnc.org/cgi-bin/wiki.pl?BeagleboneDevsetup
>>>>>>>> I have updated the kernel and rootfs image a few days ago so the kernel includes ext2/3/4 support compiled in, which should take care of two failure reports I got.
>>>>>>>>
>>>>>>>> Again that xenomai kernel is based on 3.2.21; it works very stable for me but there have been several reports of 'sudden stops'. The BB is a bit sensitive to power fluctuations but it might be more than that. As for that kernel, it works, but it is based on a branch which will see no further development. It supports most of the stuff needed to development; there might be some patches coming from more active BB users than me.
>>>>>>> Hi Michael,
>>>>>>>
>>>>>>> Are you saying you don't have seen these 'sudden stops' yourself?
>>>>>> No, never, after swapping to stronger power supplies; I have two of these boards running over NFS all the time. I dont have Linuxcnc running on them though, I'll do that and see if that changes the picture. Maybe keeping the torture test running helps trigger it.
>>>>> Beginners error! :-P The power supply is indeed critical, but the
>>>>> stepdown converter on my BeBoPr is dimensioned for at least 2A and
>>>>> hasn't failed me yet.
>>>>>
>>>>> I think that running linuxcnc is mandatory for the lockup. After a dozen
>>>>> runs, it looks like I can reproduce the lockup with 100% certainty
>>>>> within one hour.
>>>>> Using the JTAG interface to attach a debugger to the Bone, I've found
>>>>> that once stalled the kernel is still running. It looks like it won't
>>>>> schedule properly and almost all time is spent in the cpu_idle thread.
>>>>
>>>> This is typical of a tsc emulation or timer issue. On a system without
>>>> anything running, please let the "tsc -w" command run. It will take some
>>>> time to run (the wrap time of the hardware timer used for tsc
>>>> emulation), if it runs correctly, then you need to check whether the
>>>> timer is still running when the bug happens (cat /proc/xenomai/irq
>>>> should continue increasing when for instance the latency test is
>>>> running). If the timer is stopped, it may have been programmed for a too
>>>> short delay, to avoid that, you can try:
>>>> - increasing the ipipe_timer min_delay_ticks member (by default, it uses
>>>> a value corresponding to the min_delta_ns member in the clockevent
>>>> structure);
>>>> - checking after programming the timer (in the set_next_event method) if
>>>> the timer counter is already 0, in which case you can return a negative
>>>> value, usually -ETIME.
>>>>
>>>
>>> Hi Gilles,
>>>
>>> Thanks for the swift reply.
>>>
>>> As far as I can see, tsc -w runs without an error:
>>>
>>> ARM: counter wrap time: 179 seconds
>>> Checking tsc for 6 minute(s)
>>> min: 5, max: 12, avg: 5.04168
>>> ...
>>> min: 5, max: 6, avg: 5.03771
>>> min: 5, max: 28, avg: 5.03989 -> 0.209995 us
>>>
>>> real    6m0.284s
>>>
>>> I've also done the other regression tests and all were successful.
>>>
>>> Problem is that once the bug happens I won't be able to issue the cat 
>>> command.
>>> I've fixed my debug setup so I don't have to use the System.map to 
>>> manually translate the debugger addresses : /
>>> Now I'm waiting for another lockup to see what's happening.
>>
>>
>> You may want to have a look at the xeno-regression-test script to put
>> your system under pressure (and likely generate the lockup faster).
> 
> running tsc -w and xeno-regression-test in parallel I get errors like so (not on every run; no lockup so far):
> 
> ++ /usr/xenomai/bin/mutex-torture-native
> simple_wait
> recursive_wait
> timed_mutex
> mode_switch
> pi_wait
> lock_stealing
> NOTE: lock_stealing mutex_trylock: not supported
> deny_stealing
> simple_condwait
> recursive_condwait
> auto_switchback
> FAILURE: current prio (0) != expected prio (2)
> 
> dmesg 
> [501963.390598] Xenomai: native: cleaning up mutex "" (ret=0).
> [502170.164984] usb 1-1: reset high-speed USB device number 2 using musb-hdrc
> 
> on another run, I got a segfault while running sigdebug:
> ++ /usr/xenomai/bin/regression/native/sigdebug
> mayday page starting at 0x400eb000 [/dev/rtheap]
> mayday code: 0c 00 9f e5 0c 70 9f e5 00 00 00 ef 00 00 a0 e3 00 00 80 e5 2b 02 00 0a 42 00 0f 00 db d7 ee b8
> mlockall
> syscall
> signal
> relaxed mutex owner
> page fault
> watchdog
> ./xeno-regression-test: line 53:  4210 Segmentation fault      /usr/xenomai/bin/regression/native/sigdebug
> 
> root@bb1:/usr/xenomai/bin# dmesg 
> [502442.312996] Xenomai: watchdog triggered -- signaling runaway thread 'rt_task'
> [502443.054186] Xenomai: native: cleaning up mutex "prio_invert" (ret=0).
> [502443.055730] Xenomai: native: cleaning up sem "send_signal" (ret=0).
> [502518.134977] usb 1-1: reset high-speed USB device number 2 using musb-hdrc
> 
> 
> unsure what to make of it - any suggestions? the usb reset looks suspicious


What version of xenomai are you using? These look like old issues?

-- 
                                                                Gilles.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Xenomai] [Emc-developers] "new RTOS" status: Scheduler (?) lockup on ARM
  2013-01-19 14:10               ` Gilles Chanteperdrix
@ 2013-01-19 14:14                 ` Michael Haberler
  2013-01-19 14:19                   ` Gilles Chanteperdrix
  0 siblings, 1 reply; 17+ messages in thread
From: Michael Haberler @ 2013-01-19 14:14 UTC (permalink / raw)
  To: xenomai


Am 19.01.2013 um 15:10 schrieb Gilles Chanteperdrix:

> On 01/19/2013 03:09 PM, Michael Haberler wrote:
> 
>> 
>> Am 19.01.2013 um 14:29 schrieb Gilles Chanteperdrix:
>> 
>>> On 01/17/2013 02:30 PM, Bas Laarhoven wrote:
>>> 
>>>> On 17-1-2013 9:53, Gilles Chanteperdrix wrote:
>>>>> On 01/17/2013 08:59 AM, Bas Laarhoven wrote:
>>>>> 
>>>>>> On 16-1-2013 20:36, Michael Haberler wrote:
>>>>>>> Am 16.01.2013 um 17:45 schrieb Bas Laarhoven:
>>>>>>> 
>>>>>>>> On 16-1-2013 15:15, Michael Haberler wrote:
>>>>>>>>> ARM work:
>>>>>>>>> 
>>>>>>>>> Several people have been able to get the Beaglebone ubuntu/xenomai setup working as outlined here: http://wiki.linuxcnc.org/cgi-bin/wiki.pl?BeagleboneDevsetup
>>>>>>>>> I have updated the kernel and rootfs image a few days ago so the kernel includes ext2/3/4 support compiled in, which should take care of two failure reports I got.
>>>>>>>>> 
>>>>>>>>> Again that xenomai kernel is based on 3.2.21; it works very stable for me but there have been several reports of 'sudden stops'. The BB is a bit sensitive to power fluctuations but it might be more than that. As for that kernel, it works, but it is based on a branch which will see no further development. It supports most of the stuff needed to development; there might be some patches coming from more active BB users than me.
>>>>>>>> Hi Michael,
>>>>>>>> 
>>>>>>>> Are you saying you don't have seen these 'sudden stops' yourself?
>>>>>>> No, never, after swapping to stronger power supplies; I have two of these boards running over NFS all the time. I dont have Linuxcnc running on them though, I'll do that and see if that changes the picture. Maybe keeping the torture test running helps trigger it.
>>>>>> Beginners error! :-P The power supply is indeed critical, but the
>>>>>> stepdown converter on my BeBoPr is dimensioned for at least 2A and
>>>>>> hasn't failed me yet.
>>>>>> 
>>>>>> I think that running linuxcnc is mandatory for the lockup. After a dozen
>>>>>> runs, it looks like I can reproduce the lockup with 100% certainty
>>>>>> within one hour.
>>>>>> Using the JTAG interface to attach a debugger to the Bone, I've found
>>>>>> that once stalled the kernel is still running. It looks like it won't
>>>>>> schedule properly and almost all time is spent in the cpu_idle thread.
>>>>> 
>>>>> This is typical of a tsc emulation or timer issue. On a system without
>>>>> anything running, please let the "tsc -w" command run. It will take some
>>>>> time to run (the wrap time of the hardware timer used for tsc
>>>>> emulation), if it runs correctly, then you need to check whether the
>>>>> timer is still running when the bug happens (cat /proc/xenomai/irq
>>>>> should continue increasing when for instance the latency test is
>>>>> running). If the timer is stopped, it may have been programmed for a too
>>>>> short delay, to avoid that, you can try:
>>>>> - increasing the ipipe_timer min_delay_ticks member (by default, it uses
>>>>> a value corresponding to the min_delta_ns member in the clockevent
>>>>> structure);
>>>>> - checking after programming the timer (in the set_next_event method) if
>>>>> the timer counter is already 0, in which case you can return a negative
>>>>> value, usually -ETIME.
>>>>> 
>>>> 
>>>> Hi Gilles,
>>>> 
>>>> Thanks for the swift reply.
>>>> 
>>>> As far as I can see, tsc -w runs without an error:
>>>> 
>>>> ARM: counter wrap time: 179 seconds
>>>> Checking tsc for 6 minute(s)
>>>> min: 5, max: 12, avg: 5.04168
>>>> ...
>>>> min: 5, max: 6, avg: 5.03771
>>>> min: 5, max: 28, avg: 5.03989 -> 0.209995 us
>>>> 
>>>> real    6m0.284s
>>>> 
>>>> I've also done the other regression tests and all were successful.
>>>> 
>>>> Problem is that once the bug happens I won't be able to issue the cat 
>>>> command.
>>>> I've fixed my debug setup so I don't have to use the System.map to 
>>>> manually translate the debugger addresses : /
>>>> Now I'm waiting for another lockup to see what's happening.
>>> 
>>> 
>>> You may want to have a look at the xeno-regression-test script to put
>>> your system under pressure (and likely generate the lockup faster).
>> 
>> running tsc -w and xeno-regression-test in parallel I get errors like so (not on every run; no lockup so far):
>> 
>> ++ /usr/xenomai/bin/mutex-torture-native
>> simple_wait
>> recursive_wait
>> timed_mutex
>> mode_switch
>> pi_wait
>> lock_stealing
>> NOTE: lock_stealing mutex_trylock: not supported
>> deny_stealing
>> simple_condwait
>> recursive_condwait
>> auto_switchback
>> FAILURE: current prio (0) != expected prio (2)
>> 
>> dmesg 
>> [501963.390598] Xenomai: native: cleaning up mutex "" (ret=0).
>> [502170.164984] usb 1-1: reset high-speed USB device number 2 using musb-hdrc
>> 
>> on another run, I got a segfault while running sigdebug:
>> ++ /usr/xenomai/bin/regression/native/sigdebug
>> mayday page starting at 0x400eb000 [/dev/rtheap]
>> mayday code: 0c 00 9f e5 0c 70 9f e5 00 00 00 ef 00 00 a0 e3 00 00 80 e5 2b 02 00 0a 42 00 0f 00 db d7 ee b8
>> mlockall
>> syscall
>> signal
>> relaxed mutex owner
>> page fault
>> watchdog
>> ./xeno-regression-test: line 53:  4210 Segmentation fault      /usr/xenomai/bin/regression/native/sigdebug
>> 
>> root@bb1:/usr/xenomai/bin# dmesg 
>> [502442.312996] Xenomai: watchdog triggered -- signaling runaway thread 'rt_task'
>> [502443.054186] Xenomai: native: cleaning up mutex "prio_invert" (ret=0).
>> [502443.055730] Xenomai: native: cleaning up sem "send_signal" (ret=0).
>> [502518.134977] usb 1-1: reset high-speed USB device number 2 using musb-hdrc
>> 
>> 
>> unsure what to make of it - any suggestions? the usb reset looks suspicious
> 
> 
> What version of xenomai are you using? These look like old issues?


that was xenomai 2.6.1 as per release tag in the git repo; the rest as outlined here: http://www.xenomai.org/pipermail/xenomai/2013-January/027164.html

just caught one more error:

-m

== Testing FPU check routines...
d0: 1 != 2
d1: 1 != 2
d2: 1 != 2
d3: 1 != 2
d4: 1 != 2
d5: 1 != 2
d6: 1 != 2
d7: 1 != 2
d8: 1 != 2
d9: 1 != 2
d10: 1 != 2
d11: 1 != 2
d12: 1 != 2
d13: 1 != 2
d14: 1 != 2
d15: 1 != 2
== FPU check routines: OK.
switchtest: Unable to open switchtest device.
(modprobe xeno_switchtest ?)
switchtest: Unable to open switchtest device.
(modprobe xeno_switchtest ?)
== Threads:== Threads:./xeno-regression-test failed: child 11343 exited with status 1
root@bb1:/usr/xenomai/bin# dmesg 
[502442.312996] Xenomai: watchdog triggered -- signaling runaway thread 'rt_task'
[502443.054186] Xenomai: native: cleaning up mutex "prio_invert" (ret=0).
[502443.055730] Xenomai: native: cleaning up sem "send_signal" (ret=0).
[502518.134977] usb 1-1: reset high-speed USB device number 2 using musb-hdrc
[502561.165050] usb 1-1: reset high-speed USB device number 2 using musb-hdrc
[502658.312984] Xenomai: watchdog triggered -- signaling runaway thread 'rt_task'
[502721.135190] usb 1-1: reset high-speed USB device number 2 using musb-hdrc
[502737.613198] Xenomai: Posix: closing message queue descriptor 3.
[502738.607343] switchtest: page allocation failure: order:4, mode:0xd0
[502738.607369] Backtrace: 
[502738.607436] [<c0010ea0>] (dump_backtrace+0x0/0x110) from [<c041e9a0>] (dump_stack+0x18/0x1c)
[502738.607453]  r6:00000000 r5:000000d0 r4:00000001 r3:00000000
[502738.607507] [<c041e988>] (dump_stack+0x0/0x1c) from [<c00cd668>] (warn_alloc_failed+0xf4/0x114)
[502738.607536] [<c00cd574>] (warn_alloc_failed+0x0/0x114) from [<c00cfbe0>] (__alloc_pages_nodemask+0x65c/0x6dc)
[502738.607554]  r3:00000006 r2:00000000
[502738.607570]  r7:00000004 r6:00000001 r5:cc14a000 r4:000000d0
[502738.607610] [<c00cf584>] (__alloc_pages_nodemask+0x0/0x6dc) from [<c041ffb0>] (cache_alloc_refill+0x2d0/0x5cc)
[502738.607645] [<c041fce0>] (cache_alloc_refill+0x0/0x5cc) from [<c00f7574>] (__kmalloc+0xb4/0x114)
[502738.607683] [<c00f74c0>] (__kmalloc+0x0/0x114) from [<c033a53c>] (rtswitch_ioctl_nrt+0xc0/0x3f4)
[502738.607756]  r7:c16c9800 r6:cf587ca0 r5:00000017 r4:cf587c80
[502738.607800] [<c033a47c>] (rtswitch_ioctl_nrt+0x0/0x3f4) from [<c00c4ea0>] (__rt_dev_ioctl+0x6c/0x1c4)
[502738.607817]  r7:c16c9800 r6:40040630 r5:00000017 r4:cf587c80
[502738.607856] [<c00c4e34>] (__rt_dev_ioctl+0x0/0x1c4) from [<c00c73fc>] (sys_rtdm_ioctl+0x28/0x2c)
[502738.607872]  r3:00000017 r2:40040630
[502738.607889]  r7:00000020 r6:cc14bfb0 r5:d088aa08 r4:00000050
[502738.607938] [<c00c73d4>] (sys_rtdm_ioctl+0x0/0x2c) from [<c009ffa8>] (losyscall_event+0xc0/0x224)
[502738.607973] [<c009fee8>] (losyscall_event+0x0/0x224) from [<c0083bb0>] (ipipe_syscall_hook+0x34/0x3c)
[502738.607998] [<c0083b7c>] (ipipe_syscall_hook+0x0/0x3c) from [<c0082738>] (__ipipe_notify_syscall+0x74/0xec)
[502738.608028] [<c00826c4>] (__ipipe_notify_syscall+0x0/0xec) from [<c001385c>] (__ipipe_syscall_root+0x7c/0x104)
[502738.608055] [<c00137e0>] (__ipipe_syscall_root+0x0/0x104) from [<c000da04>] (vector_swi+0x44/0x90)
[502738.608071]  r7:000f0042 r6:40114000 r5:00000380 r4:00030000
[502738.608099] Mem-info:
[502738.608110] Normal per-cpu:
[502738.608122] CPU    0: hi:   90, btch:  15 usd:   0
[502738.608149] active_anon:398 inactive_anon:473 isolated_anon:0
[502738.608157]  active_file:9642 inactive_file:14449 isolated_file:0
[502738.608164]  unevictable:4882 dirty:2 writeback:63 unstable:0
[502738.608171]  free:2028 slab_reclaimable:25399 slab_unreclaimable:4175
[502738.608179]  mapped:1370 shmem:4 pagetables:244 bounce:0
[502738.608222] Normal free:8112kB min:2036kB low:2544kB high:3052kB active_anon:1592kB inactive_anon:1892kB active_file:38568kB inactive_file:57796kB unevictable:19528kB isolated(anon):0kB isolated(file):0kB present:260096kB mlocked:19528kB dirty:8kB writeback:252kB mapped:5480kB shmem:16kB slab_reclaimable:101596kB slab_unreclaimable:16700kB kernel_stack:840kB pagetables:976kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[502738.608267] lowmem_reserve[]: 0 0
[502738.608286] Normal: 1502*4kB 195*8kB 34*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB 0*8192kB = 8112kB
[502738.608354] 24648 total pagecache pages
[502738.608366] 162 pages in swap cache
[502738.608378] Swap cache stats: add 3925, delete 3763, find 35641/35775
[502738.608391] Free swap  = 3898276kB
[502738.608401] Total swap = 3909628kB
[502738.710325] switchtest: page allocation failure: order:4, mode:0xd0
[502738.710350] Backtrace: 
[502738.710416] [<c0010ea0>] (dump_backtrace+0x0/0x110) from [<c041e9a0>] (dump_stack+0x18/0x1c)
[502738.710434]  r6:00000000 r5:000000d0 r4:00000001 r3:00000000
[502738.710487] [<c041e988>] (dump_stack+0x0/0x1c) from [<c00cd668>] (warn_alloc_failed+0xf4/0x114)
[502738.710516] [<c00cd574>] (warn_alloc_failed+0x0/0x114) from [<c00cfbe0>] (__alloc_pages_nodemask+0x65c/0x6dc)
[502738.710533]  r3:00000006 r2:00000000
[502738.710550]  r7:00000004 r6:00000001 r5:cc1fe000 r4:000000d0
[502738.710589] [<c00cf584>] (__alloc_pages_nodemask+0x0/0x6dc) from [<c041ffb0>] (cache_alloc_refill+0x2d0/0x5cc)
[502738.710624] [<c041fce0>] (cache_alloc_refill+0x0/0x5cc) from [<c00f7574>] (__kmalloc+0xb4/0x114)
[502738.710662] [<c00f74c0>] (__kmalloc+0x0/0x114) from [<c033a53c>] (rtswitch_ioctl_nrt+0xc0/0x3f4)
[502738.710678]  r7:cf35d800 r6:cf2ff4e0 r5:00000017 r4:cf2ff4c0
[502738.710777] [<c033a47c>] (rtswitch_ioctl_nrt+0x0/0x3f4) from [<c00c4ea0>] (__rt_dev_ioctl+0x6c/0x1c4)
[502738.710795]  r7:cf35d800 r6:40040630 r5:00000017 r4:cf2ff4c0
[502738.710834] [<c00c4e34>] (__rt_dev_ioctl+0x0/0x1c4) from [<c00c73fc>] (sys_rtdm_ioctl+0x28/0x2c)
[502738.710850]  r3:00000017 r2:40040630
[502738.710867]  r7:00000020 r6:cc1fffb0 r5:d0889c08 r4:00000050
[502738.710915] [<c00c73d4>] (sys_rtdm_ioctl+0x0/0x2c) from [<c009ffa8>] (losyscall_event+0xc0/0x224)
[502738.710949] [<c009fee8>] (losyscall_event+0x0/0x224) from [<c0083bb0>] (ipipe_syscall_hook+0x34/0x3c)
[502738.710975] [<c0083b7c>] (ipipe_syscall_hook+0x0/0x3c) from [<c0082738>] (__ipipe_notify_syscall+0x74/0xec)
[502738.711004] [<c00826c4>] (__ipipe_notify_syscall+0x0/0xec) from [<c001385c>] (__ipipe_syscall_root+0x7c/0x104)
[502738.711031] [<c00137e0>] (__ipipe_syscall_root+0x0/0x104) from [<c000da04>] (vector_swi+0x44/0x90)
[502738.711047]  r7:000f0042 r6:4008e000 r5:00000381 r4:00030000
[502738.711075] Mem-info:
[502738.711085] Normal per-cpu:
[502738.711097] CPU    0: hi:   90, btch:  15 usd:   0
[502738.711124] active_anon:408 inactive_anon:477 isolated_anon:0
[502738.711132]  active_file:9541 inactive_file:14408 isolated_file:0
[502738.711139]  unevictable:4883 dirty:3 writeback:63 unstable:0
[502738.711147]  free:2321 slab_reclaimable:25246 slab_unreclaimable:4168
[502738.711155]  mapped:1383 shmem:4 pagetables:243 bounce:0
[502738.711197] Normal free:9284kB min:2036kB low:2544kB high:3052kB active_anon:1632kB inactive_anon:1908kB active_file:38164kB inactive_file:57632kB unevictable:19532kB isolated(anon):0kB isolated(file):0kB present:260096kB mlocked:19532kB dirty:12kB writeback:252kB mapped:5532kB shmem:16kB slab_reclaimable:100984kB slab_unreclaimable:16672kB kernel_stack:832kB pagetables:972kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[502738.711243] lowmem_reserve[]: 0 0
[502738.711261] Normal: 1725*4kB 214*8kB 40*16kB 1*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB 0*8192kB = 9284kB
[502738.711328] 24511 total pagecache pages
[502738.711339] 167 pages in swap cache
[502738.711351] Swap cache stats: add 3933, delete 3766, find 35641/35776
[502738.711364] Free swap  = 3898276kB
[502738.711374] Total swap = 3909628kB
[502738.738776] 65536 pages of RAM
[502738.738802] 2390 free pages
[502738.738812] 2463 reserved pages
[502738.738822] 29511 slab pages
[502738.738831] 9664 pages shared
[502738.738840] 159 pages swap cached
[502738.846993] 65536 pages of RAM
[502738.847018] 2605 free pages
[502738.847028] 2463 reserved pages
[502738.847038] 29379 slab pages
[502738.847047] 9815 pages shared
[502738.847056] 170 pages swap cached
[502739.570312] Xenomai: native: cleaning up sem "dispsem-11349" (ret=0).
[502752.205048] usb 1-1: reset high-speed USB device number 2 using musb-hdrc



> 
> -- 
>                                                                Gilles.



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Xenomai] [Emc-developers] "new RTOS" status: Scheduler (?) lockup on ARM
  2013-01-19 14:14                 ` Michael Haberler
@ 2013-01-19 14:19                   ` Gilles Chanteperdrix
  0 siblings, 0 replies; 17+ messages in thread
From: Gilles Chanteperdrix @ 2013-01-19 14:19 UTC (permalink / raw)
  To: Michael Haberler; +Cc: xenomai

On 01/19/2013 03:14 PM, Michael Haberler wrote:

> that was xenomai 2.6.1 as per release tag in the git repo; the rest as outlined here: http://www.xenomai.org/pipermail/xenomai/2013-January/027164.html


Please upgrade to xenomai master. You are having bug which have already
been fixed since 2.6.1.

> [502738.607343] switchtest: page allocation failure: order:4, mode:0xd0


That is an allocation failure. I am afraid you can run
xeno-regression-test only once after the system boot (it is supposed to
run for several hours anyway).


-- 
                                                                Gilles.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Xenomai] [Emc-developers] "new RTOS" status: Scheduler (?) lockup on ARM
  2013-01-19 14:09             ` Michael Haberler
  2013-01-19 14:10               ` Gilles Chanteperdrix
@ 2013-01-19 14:32               ` Gilles Chanteperdrix
  2013-01-21 11:43                 ` Michael Haberler
  1 sibling, 1 reply; 17+ messages in thread
From: Gilles Chanteperdrix @ 2013-01-19 14:32 UTC (permalink / raw)
  To: Michael Haberler; +Cc: xenomai

On 01/19/2013 03:09 PM, Michael Haberler wrote:

> 
> Am 19.01.2013 um 14:29 schrieb Gilles Chanteperdrix:
> 
>> On 01/17/2013 02:30 PM, Bas Laarhoven wrote:
>>
>>> On 17-1-2013 9:53, Gilles Chanteperdrix wrote:
>>>> On 01/17/2013 08:59 AM, Bas Laarhoven wrote:
>>>>
>>>>> On 16-1-2013 20:36, Michael Haberler wrote:
>>>>>> Am 16.01.2013 um 17:45 schrieb Bas Laarhoven:
>>>>>>
>>>>>>> On 16-1-2013 15:15, Michael Haberler wrote:
>>>>>>>> ARM work:
>>>>>>>>
>>>>>>>> Several people have been able to get the Beaglebone ubuntu/xenomai setup working as outlined here: http://wiki.linuxcnc.org/cgi-bin/wiki.pl?BeagleboneDevsetup
>>>>>>>> I have updated the kernel and rootfs image a few days ago so the kernel includes ext2/3/4 support compiled in, which should take care of two failure reports I got.
>>>>>>>>
>>>>>>>> Again that xenomai kernel is based on 3.2.21; it works very stable for me but there have been several reports of 'sudden stops'. The BB is a bit sensitive to power fluctuations but it might be more than that. As for that kernel, it works, but it is based on a branch which will see no further development. It supports most of the stuff needed to development; there might be some patches coming from more active BB users than me.
>>>>>>> Hi Michael,
>>>>>>>
>>>>>>> Are you saying you don't have seen these 'sudden stops' yourself?
>>>>>> No, never, after swapping to stronger power supplies; I have two of these boards running over NFS all the time. I dont have Linuxcnc running on them though, I'll do that and see if that changes the picture. Maybe keeping the torture test running helps trigger it.
>>>>> Beginners error! :-P The power supply is indeed critical, but the
>>>>> stepdown converter on my BeBoPr is dimensioned for at least 2A and
>>>>> hasn't failed me yet.
>>>>>
>>>>> I think that running linuxcnc is mandatory for the lockup. After a dozen
>>>>> runs, it looks like I can reproduce the lockup with 100% certainty
>>>>> within one hour.
>>>>> Using the JTAG interface to attach a debugger to the Bone, I've found
>>>>> that once stalled the kernel is still running. It looks like it won't
>>>>> schedule properly and almost all time is spent in the cpu_idle thread.
>>>>
>>>> This is typical of a tsc emulation or timer issue. On a system without
>>>> anything running, please let the "tsc -w" command run. It will take some
>>>> time to run (the wrap time of the hardware timer used for tsc
>>>> emulation), if it runs correctly, then you need to check whether the
>>>> timer is still running when the bug happens (cat /proc/xenomai/irq
>>>> should continue increasing when for instance the latency test is
>>>> running). If the timer is stopped, it may have been programmed for a too
>>>> short delay, to avoid that, you can try:
>>>> - increasing the ipipe_timer min_delay_ticks member (by default, it uses
>>>> a value corresponding to the min_delta_ns member in the clockevent
>>>> structure);
>>>> - checking after programming the timer (in the set_next_event method) if
>>>> the timer counter is already 0, in which case you can return a negative
>>>> value, usually -ETIME.
>>>>
>>>
>>> Hi Gilles,
>>>
>>> Thanks for the swift reply.
>>>
>>> As far as I can see, tsc -w runs without an error:
>>>
>>> ARM: counter wrap time: 179 seconds
>>> Checking tsc for 6 minute(s)
>>> min: 5, max: 12, avg: 5.04168
>>> ...
>>> min: 5, max: 6, avg: 5.03771
>>> min: 5, max: 28, avg: 5.03989 -> 0.209995 us
>>>
>>> real    6m0.284s
>>>
>>> I've also done the other regression tests and all were successful.
>>>
>>> Problem is that once the bug happens I won't be able to issue the cat 
>>> command.
>>> I've fixed my debug setup so I don't have to use the System.map to 
>>> manually translate the debugger addresses : /
>>> Now I'm waiting for another lockup to see what's happening.
>>
>>
>> You may want to have a look at the xeno-regression-test script to put
>> your system under pressure (and likely generate the lockup faster).
> 
> running tsc -w and xeno-regression-test in parallel I get errors like so (not on every run; no lockup so far):


At this point we know that you do not have any issue with tsc emulation,
so running tsc -w in parallel is useless. The point of running
xeno-regression-test is to reach the "switchtest + switchtest -s +
latency + ltp" point, where the system will be put under stress, and
will be more likely to trigger a timer issue if there is one. So, if the
tests before that do not pass, simply comment them in
xeno-regression-test (xeno-regression-test is a shell script).

Also note that if you are running a thumb2 user-space or running the
kernel with CONFIG_THUMB2_KERNEL (which the segfault in sigdebug
suggests), on a processor with a cortex a8 core, you need to enable
CONFIG_ERRATA_430973, otherwise you will get random faults due to the
correspondig processor erratum.

-- 
                                                                Gilles.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Xenomai] [Emc-developers] "new RTOS" status: Scheduler (?) lockup on ARM
  2013-01-19 14:32               ` Gilles Chanteperdrix
@ 2013-01-21 11:43                 ` Michael Haberler
  2013-01-21 11:56                   ` Gilles Chanteperdrix
  0 siblings, 1 reply; 17+ messages in thread
From: Michael Haberler @ 2013-01-21 11:43 UTC (permalink / raw)
  To: xenomai

the suspicion now turned to the DHCP lease setting and RTC time warp issues - the Beaglebone doesnt have an RTC so it starts up at 1-1-1970

the first DHCP lease still has 1970 timestamps, but eventually the RTC is set with ntpdate and it could be this causes confusion

the thing which is hard to believe for me: loss of IP connectivity - conceivable; kernel hang - why?

question: does a RTC time warp have any possible bearing on Xenomai operations?

- Michael

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Xenomai] [Emc-developers] "new RTOS" status: Scheduler (?) lockup on ARM
  2013-01-21 11:43                 ` Michael Haberler
@ 2013-01-21 11:56                   ` Gilles Chanteperdrix
  2013-01-21 13:32                     ` Michael Haberler
  0 siblings, 1 reply; 17+ messages in thread
From: Gilles Chanteperdrix @ 2013-01-21 11:56 UTC (permalink / raw)
  To: Michael Haberler; +Cc: xenomai

On 01/21/2013 12:43 PM, Michael Haberler wrote:

> the suspicion now turned to the DHCP lease setting and RTC time warp
> issues - the Beaglebone doesnt have an RTC so it starts up at
> 1-1-1970
> 
> the first DHCP lease still has 1970 timestamps, but eventually the
> RTC is set with ntpdate and it could be this causes confusion
> 
> the thing which is hard to believe for me: loss of IP connectivity -
> conceivable; kernel hang - why?
> 
> question: does a RTC time warp have any possible bearing on Xenomai
> operations?


No, it should not, Xenomai uses its own clock, which is set only once
upon boot, so, is unaffected by Linux wallclock time changes... or
should be.

-- 
                                                                Gilles.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Xenomai] [Emc-developers] "new RTOS" status: Scheduler (?) lockup on ARM
  2013-01-21 11:56                   ` Gilles Chanteperdrix
@ 2013-01-21 13:32                     ` Michael Haberler
  2013-01-21 19:10                       ` Gilles Chanteperdrix
  0 siblings, 1 reply; 17+ messages in thread
From: Michael Haberler @ 2013-01-21 13:32 UTC (permalink / raw)
  To: xenomai


Am 21.01.2013 um 12:56 schrieb Gilles Chanteperdrix:

> On 01/21/2013 12:43 PM, Michael Haberler wrote:
> 
>> the suspicion now turned to the DHCP lease setting and RTC time warp
>> issues - the Beaglebone doesnt have an RTC so it starts up at
>> 1-1-1970
>> 
>> the first DHCP lease still has 1970 timestamps, but eventually the
>> RTC is set with ntpdate and it could be this causes confusion
>> 
>> the thing which is hard to believe for me: loss of IP connectivity -
>> conceivable; kernel hang - why?
>> 
>> question: does a RTC time warp have any possible bearing on Xenomai
>> operations?
> 
> 
> No, it should not, Xenomai uses its own clock, which is set only once
> upon boot, so, is unaffected by Linux wallclock time changes... or
> should be.


it might not be Xenomai after all. Uhum.

the bughunt safari tribe has decided to focus on class 'duh' problems and resolves to shut up until red hands are spotted.

--

btw the upgrade to the ipipe patch in master made all xeno-regression-test problems go away - thanks!

-Michael




^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Xenomai] [Emc-developers] "new RTOS" status: Scheduler (?) lockup on ARM
  2013-01-21 13:32                     ` Michael Haberler
@ 2013-01-21 19:10                       ` Gilles Chanteperdrix
  2013-01-21 21:20                         ` Michael Haberler
  0 siblings, 1 reply; 17+ messages in thread
From: Gilles Chanteperdrix @ 2013-01-21 19:10 UTC (permalink / raw)
  To: Michael Haberler; +Cc: xenomai

On 01/21/2013 02:32 PM, Michael Haberler wrote:

> 
> Am 21.01.2013 um 12:56 schrieb Gilles Chanteperdrix:
> 
>> On 01/21/2013 12:43 PM, Michael Haberler wrote:
>> 
>>> the suspicion now turned to the DHCP lease setting and RTC time
>>> warp issues - the Beaglebone doesnt have an RTC so it starts up
>>> at 1-1-1970
>>> 
>>> the first DHCP lease still has 1970 timestamps, but eventually
>>> the RTC is set with ntpdate and it could be this causes
>>> confusion
>>> 
>>> the thing which is hard to believe for me: loss of IP
>>> connectivity - conceivable; kernel hang - why?
>>> 
>>> question: does a RTC time warp have any possible bearing on
>>> Xenomai operations?
>> 
>> 
>> No, it should not, Xenomai uses its own clock, which is set only
>> once upon boot, so, is unaffected by Linux wallclock time
>> changes... or should be.
> 
> 
> it might not be Xenomai after all. Uhum.
> 
> the bughunt safari tribe has decided to focus on class 'duh' problems
> and resolves to shut up until red hands are spotted.


I would still put the check in the timer "set_next_event" callback, just
in case...


-- 
                                                                Gilles.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Xenomai] [Emc-developers] "new RTOS" status: Scheduler (?) lockup on ARM
  2013-01-21 19:10                       ` Gilles Chanteperdrix
@ 2013-01-21 21:20                         ` Michael Haberler
  2013-01-22 12:06                           ` [Xenomai] [Emc-developers] "new RTOS" status: Scheduler (?) lockup on ARM - SUMMARY Bas Laarhoven
  0 siblings, 1 reply; 17+ messages in thread
From: Michael Haberler @ 2013-01-21 21:20 UTC (permalink / raw)
  To: xenomai


Am 21.01.2013 um 20:10 schrieb Gilles Chanteperdrix:

> On 01/21/2013 02:32 PM, Michael Haberler wrote:
> 
>> 
>> Am 21.01.2013 um 12:56 schrieb Gilles Chanteperdrix:


>>>> question: does a RTC time warp have any possible bearing on
>>>> Xenomai operations?
>>> 
>>> 
>>> No, it should not, Xenomai uses its own clock, which is set only
>>> once upon boot, so, is unaffected by Linux wallclock time
>>> changes... or should be.
>> 
>> 
>> it might not be Xenomai after all. Uhum.
>> 
>> the bughunt safari tribe has decided to focus on class 'duh' problems
>> and resolves to shut up until red hands are spotted.
> 
> 
> I would still put the check in the timer "set_next_event" callback, just
> in case...

I assume Bas will give the postmortem shortly - he nailed the issue; the RTC boot timewarp makes for a lost DHCP lease midflight and NFS freezing, making it look like a kernel hang.

relieved,

- Michael







^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Xenomai] [Emc-developers] "new RTOS" status: Scheduler (?) lockup on ARM - SUMMARY
  2013-01-21 21:20                         ` Michael Haberler
@ 2013-01-22 12:06                           ` Bas Laarhoven
  0 siblings, 0 replies; 17+ messages in thread
From: Bas Laarhoven @ 2013-01-22 12:06 UTC (permalink / raw)
  To: Michael Haberler; +Cc: EMC developers, xenomai

On 21-1-2013 22:20, Michael Haberler wrote:
> Am 21.01.2013 um 20:10 schrieb Gilles Chanteperdrix:
>
>> On 01/21/2013 02:32 PM, Michael Haberler wrote:
>>
>>> Am 21.01.2013 um 12:56 schrieb Gilles Chanteperdrix:
>
>>>>> question: does a RTC time warp have any possible bearing on
>>>>> Xenomai operations?
>>>>
>>>> No, it should not, Xenomai uses its own clock, which is set only
>>>> once upon boot, so, is unaffected by Linux wallclock time
>>>> changes... or should be.
>>>
>>> it might not be Xenomai after all. Uhum.
>>>
>>> the bughunt safari tribe has decided to focus on class 'duh' problems
>>> and resolves to shut up until red hands are spotted.
>>
>> I would still put the check in the timer "set_next_event" callback, just
>> in case...
> I assume Bas will give the postmortem shortly - he nailed the issue; the RTC boot timewarp makes for a lost DHCP lease midflight and NFS freezing, making it look like a kernel hang.
>
> relieved,
>
> - Michael

Michael said it all, there's not much for me to add. I'll summarize the 
case for the records ; )

Lesson learned: Change only one variable at a time and don't assume 
anything!

I had been using a NFS mounted filesystem with the Beaglebone for over a 
year now without problems and got used to it's reliability (as I was 
used to in a corporate environment in the past).
Because the Xenomai software was built with libraries (eabihf) not 
compatible with my (eabi) system I switched to the Ubuntu image Michael 
built, and everything seemed to work fine. Except that the (xenomai) 
kernel froze out after around 50-60 minutes of uptime. With the JTAG 
debugger I could see the kernel still running, but all applications 
(both text and X via SSH, and console via serial/USB connection) seemed 
frozen and there was no output indicating what was going on. Of course 
the xenomai kernel was the first suspect. But that proved to be a 
mistake. With hindsight, knowing the cause of the freeze now, I wonder 
why I haven't gotten the NFS connection time-out message on the console, 
but for some reason or another that isn't generated in this case.

The underlying problem is that the Beaglebone has no battery backed 
real-time clock. This gives (only) a serious problem (freeze) with (1) a 
network mounted NFS root filesystem and (2) an initial kernel time lying 
in the past and (3) a DHCP lease time shorter than some multiple (in 
this case 2x) of the required system uptime.

Ubuntu (and maybe Debian too) systems are obviously not designed to 
start with a completely wrong real-time clock value. And the dhclient 
(as many other programs) is not designed to handle the large time step 
that's generated once the clock is set properly sometime during the boot 
process.
Note that if the filesystem is on local storage (e.g. FLASH or 
harddisk), there will only be a short disruption of the network 
connection and it's likely that the problem won't be noticed at all.

A final solution hasn't been found yet: I prefer a workaround without 
changing the dhclient or some other standard program. I think it would 
suffice to acquire a new lease right after the time-step has been made. 
This has to be done without giving up the previous lease (that has 
expired because of the time-step), because that would cause the system 
to freeze again. Suggestions on how to do this are welcome. I can't 
spend much more time on this issue this week.

-- Bas

>
>
>
>
>
>
> _______________________________________________
> Xenomai mailing list
> Xenomai@xenomai.org
> http://www.xenomai.org/mailman/listinfo/xenomai



^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2013-01-22 12:06 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <4D4F8D1B-022E-47F8-A579-EBF2A3427C5D@mah.priv.at>
     [not found] ` <50F6D940.3040406@xs4all.nl>
     [not found]   ` <F1ACE4FC-3E71-4498-B683-81F5C40CB6E3@mah.priv.at>
2013-01-17  7:59     ` [Xenomai] [Emc-developers] "new RTOS" status: Scheduler (?) lockup on ARM Bas Laarhoven
2013-01-17  8:53       ` Gilles Chanteperdrix
2013-01-17 11:34         ` Michael Haberler
2013-01-17 12:07           ` Gilles Chanteperdrix
2013-01-17 13:30         ` Bas Laarhoven
2013-01-19 13:29           ` Gilles Chanteperdrix
2013-01-19 14:09             ` Michael Haberler
2013-01-19 14:10               ` Gilles Chanteperdrix
2013-01-19 14:14                 ` Michael Haberler
2013-01-19 14:19                   ` Gilles Chanteperdrix
2013-01-19 14:32               ` Gilles Chanteperdrix
2013-01-21 11:43                 ` Michael Haberler
2013-01-21 11:56                   ` Gilles Chanteperdrix
2013-01-21 13:32                     ` Michael Haberler
2013-01-21 19:10                       ` Gilles Chanteperdrix
2013-01-21 21:20                         ` Michael Haberler
2013-01-22 12:06                           ` [Xenomai] [Emc-developers] "new RTOS" status: Scheduler (?) lockup on ARM - SUMMARY Bas Laarhoven

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.