All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [Xenomai] [xenomai-forge] psos: crash while stressing event timers
@ 2013-06-04 20:26 Tom Philips
  2013-06-05  8:02 ` Philippe Gerum
  0 siblings, 1 reply; 17+ messages in thread
From: Tom Philips @ 2013-06-04 20:26 UTC (permalink / raw)
  To: xenomai

We have made a core dump and the crash occurs in pvfree().
See call sequence explanation below.

The actual problem however, seems to be in the libc library.

I'll elaborate.
The crash occurs when using the timer function tm_evafter()

This is the code call sequence of tm_evafter(), pseudo code, annotated a
bit:

tm_evafter()              called from our app
  start_evtimer()
    timerobj_init()       calls timer_create() POSIX function
    timerobj_start()      calls timer_settime() POSIX function
    if (error)            we get an error from timerobj_start()
      timerobj_destroy()  destroy's POSIX timer
      pvlist_remove()
      pvfree()            ==> crashes (but not always)

So in our tests, timer_setime() sometimes returns an error code,
while the timer does seem to be started.
I.e. we get a negative return code from timer_settime(),
errno is set to 22 (EINVAL), but the timer is started anyhow.
All of this was checked in the debugger.

This does not seem correct behaviour of the timer_settime() system call.

Obviously, you will run into problems eventually.
I.e. the above code will clean up the the timer and pv objects and
the code that is called at timer elapse does the same thing.
So you will get a double free of the pv structures.

This only occurs under heavy load.
Not all false timer_settime() errors result in a crash.
I guess it all depends in which order the destruction takes place.

If I ignore the error returned by timerobj_start(), all works fine.

I've tried this on different architectures (ARM, MIPS, x86),
different LIBC versions (2.10, 2.11.3, 2.12, 2.15),
different kernel versions (2.6.29, 2.6.32, 3.2, 3.4.24)
They all exibit the same problem.

We might need to contact the LIBC guys...

--
Tom

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Xenomai] [xenomai-forge] psos: crash while stressing event timers
  2013-06-04 20:26 [Xenomai] [xenomai-forge] psos: crash while stressing event timers Tom Philips
@ 2013-06-05  8:02 ` Philippe Gerum
  2013-06-05  8:28   ` Ronny Meeus
  0 siblings, 1 reply; 17+ messages in thread
From: Philippe Gerum @ 2013-06-05  8:02 UTC (permalink / raw)
  To: Tom Philips; +Cc: xenomai

On 06/04/2013 10:26 PM, Tom Philips wrote:
> We have made a core dump and the crash occurs in pvfree().
> See call sequence explanation below.
> 
> The actual problem however, seems to be in the libc library.
> 
> I'll elaborate.
> The crash occurs when using the timer function tm_evafter()
> 
> This is the code call sequence of tm_evafter(), pseudo code, annotated a
> bit:
> 
> tm_evafter()              called from our app
>    start_evtimer()
>      timerobj_init()       calls timer_create() POSIX function
>      timerobj_start()      calls timer_settime() POSIX function
>      if (error)            we get an error from timerobj_start()
>        timerobj_destroy()  destroy's POSIX timer
>        pvlist_remove()
>        pvfree()            ==> crashes (but not always)
> 
> So in our tests, timer_setime() sometimes returns an error code,
> while the timer does seem to be started.
> I.e. we get a negative return code from timer_settime(),
> errno is set to 22 (EINVAL), but the timer is started anyhow.
> All of this was checked in the debugger.
> 
> This does not seem correct behaviour of the timer_settime() system call.
> 
> Obviously, you will run into problems eventually.
> I.e. the above code will clean up the the timer and pv objects and
> the code that is called at timer elapse does the same thing.
> So you will get a double free of the pv structures.
> 
> This only occurs under heavy load.
> Not all false timer_settime() errors result in a crash.
> I guess it all depends in which order the destruction takes place.
> 
> If I ignore the error returned by timerobj_start(), all works fine.
> 
> I've tried this on different architectures (ARM, MIPS, x86),
> different LIBC versions (2.10, 2.11.3, 2.12, 2.15),
> different kernel versions (2.6.29, 2.6.32, 3.2, 3.4.24)
> They all exibit the same problem.
> 
> We might need to contact the LIBC guys...
> 

Thanks for the detailed analysis, it makes sense for sure. However,
since timer_settime() should resolve as a plain kernel syscall with
these glibc/kernel combos, I would first suspect Xenomai rather
than the regular kernel. A few questions more:

- does the patch below cause assertions to be raised when the bug happens?
(you will need to mention --enable-assert or --enable-debug when configuring).

diff --git a/lib/copperplate/timerobj.c b/lib/copperplate/timerobj.c
index a367cb1..b917a47 100644
--- a/lib/copperplate/timerobj.c
+++ b/lib/copperplate/timerobj.c
@@ -306,8 +306,12 @@ int timerobj_start(struct timerobj *tmobj,
 	write_unlock(&svlock);
 	timerobj_unlock(tmobj);
 
-	if (__RT(timer_settime(tmobj->timer, TIMER_ABSTIME, it, NULL)))
+	if (__RT(timer_settime(tmobj->timer, TIMER_ABSTIME, it, NULL))) {
+		assert(timer_getoverrun(tmobj->timer) >= 0);
+		assert(it->it_value.tv_sec >= 0 && it->it_value.tv_nsec < 1000000000);
+		assert(it->it_interval.tv_sec >= 0 && it->it_interval.tv_nsec < 1000000000);
 		return __bt(-errno);
+	}
 
 	return 0;
 }

- does valgrind detect anything bad when running your test case (e.g. over x86)?

- do you have a reasonably simple test case illustrating the bug,
you could send me?

TIA,

-- 
Philippe.


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [Xenomai] [xenomai-forge] psos: crash while stressing event timers
  2013-06-05  8:02 ` Philippe Gerum
@ 2013-06-05  8:28   ` Ronny Meeus
  2013-06-05  8:36     ` Philippe Gerum
  0 siblings, 1 reply; 17+ messages in thread
From: Ronny Meeus @ 2013-06-05  8:28 UTC (permalink / raw)
  To: Philippe Gerum; +Cc: xenomai

On Wed, Jun 5, 2013 at 10:02 AM, Philippe Gerum <rpm@xenomai.org> wrote:

> On 06/04/2013 10:26 PM, Tom Philips wrote:
> > We have made a core dump and the crash occurs in pvfree().
> > See call sequence explanation below.
> >
> > The actual problem however, seems to be in the libc library.
> >
> > I'll elaborate.
> > The crash occurs when using the timer function tm_evafter()
> >
> > This is the code call sequence of tm_evafter(), pseudo code, annotated a
> > bit:
> >
> > tm_evafter()              called from our app
> >    start_evtimer()
> >      timerobj_init()       calls timer_create() POSIX function
> >      timerobj_start()      calls timer_settime() POSIX function
> >      if (error)            we get an error from timerobj_start()
> >        timerobj_destroy()  destroy's POSIX timer
> >        pvlist_remove()
> >        pvfree()            ==> crashes (but not always)
> >
> > So in our tests, timer_setime() sometimes returns an error code,
> > while the timer does seem to be started.
> > I.e. we get a negative return code from timer_settime(),
> > errno is set to 22 (EINVAL), but the timer is started anyhow.
> > All of this was checked in the debugger.
> >
> > This does not seem correct behaviour of the timer_settime() system call.
> >
> > Obviously, you will run into problems eventually.
> > I.e. the above code will clean up the the timer and pv objects and
> > the code that is called at timer elapse does the same thing.
> > So you will get a double free of the pv structures.
> >
> > This only occurs under heavy load.
> > Not all false timer_settime() errors result in a crash.
> > I guess it all depends in which order the destruction takes place.
> >
> > If I ignore the error returned by timerobj_start(), all works fine.
> >
> > I've tried this on different architectures (ARM, MIPS, x86),
> > different LIBC versions (2.10, 2.11.3, 2.12, 2.15),
> > different kernel versions (2.6.29, 2.6.32, 3.2, 3.4.24)
> > They all exibit the same problem.
> >
> > We might need to contact the LIBC guys...
> >
>
> Thanks for the detailed analysis, it makes sense for sure. However,
> since timer_settime() should resolve as a plain kernel syscall with
> these glibc/kernel combos, I would first suspect Xenomai rather
> than the regular kernel. A few questions more:
>
> - does the patch below cause assertions to be raised when the bug happens?
> (you will need to mention --enable-assert or --enable-debug when
> configuring).
>
> diff --git a/lib/copperplate/timerobj.c b/lib/copperplate/timerobj.c
> index a367cb1..b917a47 100644
> --- a/lib/copperplate/timerobj.c
> +++ b/lib/copperplate/timerobj.c
> @@ -306,8 +306,12 @@ int timerobj_start(struct timerobj *tmobj,
>         write_unlock(&svlock);
>         timerobj_unlock(tmobj);
>
> -       if (__RT(timer_settime(tmobj->timer, TIMER_ABSTIME, it, NULL)))
> +       if (__RT(timer_settime(tmobj->timer, TIMER_ABSTIME, it, NULL))) {
> +               assert(timer_getoverrun(tmobj->timer) >= 0);
> +               assert(it->it_value.tv_sec >= 0 && it->it_value.tv_nsec <
> 1000000000);
> +               assert(it->it_interval.tv_sec >= 0 &&
> it->it_interval.tv_nsec < 1000000000);
>                 return __bt(-errno);
> +       }
>
>         return 0;
>  }
>
> - does valgrind detect anything bad when running your test case (e.g. over
> x86)?
>
> - do you have a reasonably simple test case illustrating the bug,
> you could send me?
>

Philippe,

the testcode is attached to the initial mail.
The issue can be reproduced in all kinds of environments.

Ronny


>
> TIA,
>
> --
> Philippe.
>
> _______________________________________________
> Xenomai mailing list
> Xenomai@xenomai.org
> http://www.xenomai.org/mailman/listinfo/xenomai
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Xenomai] [xenomai-forge] psos: crash while stressing event timers
  2013-06-05  8:28   ` Ronny Meeus
@ 2013-06-05  8:36     ` Philippe Gerum
  2013-06-05  9:38       ` Philippe Gerum
  2013-06-05  9:51       ` Tom Philips
  0 siblings, 2 replies; 17+ messages in thread
From: Philippe Gerum @ 2013-06-05  8:36 UTC (permalink / raw)
  To: Ronny Meeus; +Cc: xenomai

On 06/05/2013 10:28 AM, Ronny Meeus wrote:

> the testcode is attached to the initial mail.
> The issue can be reproduced in all kinds of environments.
>

Ok, I overlooked the link. I'll have a look and let you know, thanks.

-- 
Philippe.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Xenomai] [xenomai-forge] psos: crash while stressing event timers
  2013-06-05  8:36     ` Philippe Gerum
@ 2013-06-05  9:38       ` Philippe Gerum
  2013-06-05  9:51       ` Tom Philips
  1 sibling, 0 replies; 17+ messages in thread
From: Philippe Gerum @ 2013-06-05  9:38 UTC (permalink / raw)
  To: Ronny Meeus; +Cc: xenomai

On 06/05/2013 10:36 AM, Philippe Gerum wrote:
> On 06/05/2013 10:28 AM, Ronny Meeus wrote:
>
>> the testcode is attached to the initial mail.
>> The issue can be reproduced in all kinds of environments.
>>
>
> Ok, I overlooked the link. I'll have a look and let you know, thanks.
>

I can reproduce this bug as well, at least a bug with the same symptoms.

Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `./settime --cpu-affinity=0,1,2 --silent -s -t 1 
-b 30 -o 1'.
Program terminated with signal 11, Segmentation fault.
#0  malloc_consolidate (av=av@entry=0x7f6f7c000020) at malloc.c:4143
4143		    unlink(av, p, bck, fwd);
(gdb) bt
#0  malloc_consolidate (av=av@entry=0x7f6f7c000020) at malloc.c:4143
#1  0x0000003d7147c988 in _int_free (av=0x7f6f7c000020, p=0x7f6f7c000990,
     have_lock=0) at malloc.c:4043
#2  0x00007f6fe6d14f3a in pvfree (ptr=0x7f6f7c0009a0)
     at /home/rpm/git/xenomai-forge/include/copperplate/heapobj.h:156
#3  0x00007f6fe6d15062 in delete_timer (tm=0x7f6f7c0009a0)
     at /home/rpm/git/xenomai-forge/lib/psos/tm.c:67
#4  0x00007f6fe6d150d5 in post_event_once (tmobj=0x7f6f7c0009b8)
     at /home/rpm/git/xenomai-forge/lib/psos/tm.c:80
#5  0x00007f6fe6adbd99 in timerobj_server (arg=0x0)
     at /home/rpm/git/xenomai-forge/lib/copperplate/timerobj.c:205
#6  0x0000003d71c07d15 in start_thread (arg=0x7f6fe67a3700)
     at pthread_create.c:308
#7  0x0000003d714f248d in clone ()
     at ../sysdeps/unix/sysv/linux/x86_64/clone.S:114
(gdb) q

However, I have to allow multiple CPUs for running the test, pinning it 
with --cpu-affinity on a single processor papers over the issue it 
seems. Ok, digging it.

-- 
Philippe.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Xenomai] [xenomai-forge] psos: crash while stressing event timers
  2013-06-05  8:36     ` Philippe Gerum
  2013-06-05  9:38       ` Philippe Gerum
@ 2013-06-05  9:51       ` Tom Philips
  2013-06-05  9:59         ` Philippe Gerum
  1 sibling, 1 reply; 17+ messages in thread
From: Tom Philips @ 2013-06-05  9:51 UTC (permalink / raw)
  To: xenomai

Philippe,

I tried the asserts you suggested and here's the output:

+++
xenomai: timerobj.c:308: timerobj_start: Assertion
`timer_getoverrun(tmobj->timer) >= 0' failed.
Aborted (core dumped)
+++


So this means the call to timer_getoverrun() fails.
Then I did some further investigation.
I changed the code in timerobj_start() to this:


+++
  if (__RT(timer_settime(tmobj->timer, TIMER_ABSTIME, it, NULL))) {
    int error = errno;
    errno = 0;
    if (timer_getoverrun(tmobj->timer) < 0) {
      __bt(-errno);           => @ line 311
    }
/*  assert(timer_getoverrun(tmobj->timer) >= 0); */
    assert(it->it_value.tv_sec >= 0 && it->it_value.tv_nsec < 1000000000);
    assert(it->it_interval.tv_sec >= 0 && it->it_interval.tv_nsec <
1000000000);
    return __bt(-error);      => @ line 316
  }
+++


This gives following (strange) result:
(The 'tm_evafter(x,y) returned' traces are from my test app)


+++
------------------------------------------------------------------------------
[ ERROR BACKTRACE: thread 0.0 ]

   #0  EINVAL in timerobj_start(), timerobj.c:316
=> #1  EINVAL in timerobj_start(), timerobj.c:311
------------------------------------------------------------------------------
tm_evafter(0,0) returned 75 (errno 22): tmid=3066037456
------------------------------------------------------------------------------
[ ERROR BACKTRACE: thread 4.0 ]

=> #0  EINVAL in timerobj_start(), timerobj.c:316
------------------------------------------------------------------------------
tm_evafter(4,0) returned 75 (errno 0): tmid=67192088
------------------------------------------------------------------------------
[ ERROR BACKTRACE: thread 4.0 ]

   #0  EINVAL in timerobj_start(), timerobj.c:316
=> #1  EINVAL in timerobj_start(), timerobj.c:311
------------------------------------------------------------------------------
tm_evafter(4,0) returned 75 (errno 22): tmid=67191880
+++


So timer_getoverrun() fails and sets errno to EINVAL (22)
This can only mean that the given timerid is (or has become) invalid.

But timer_getoverrun() does not always fail when timer_settime() fails.
My rough estimate is that 1 out of 20 times timer_getoverrun()
does not fail when timer_settime() fails.

Might it be that there is a race condition between the above code
and the code excuted at timer elapse (i.e. were the timer is destroyed)?

---
Tom

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Xenomai] [xenomai-forge] psos: crash while stressing event timers
  2013-06-05  9:51       ` Tom Philips
@ 2013-06-05  9:59         ` Philippe Gerum
  2013-06-05 10:11           ` Ronny Meeus
  0 siblings, 1 reply; 17+ messages in thread
From: Philippe Gerum @ 2013-06-05  9:59 UTC (permalink / raw)
  To: Tom Philips; +Cc: xenomai

On 06/05/2013 11:51 AM, Tom Philips wrote:
> Philippe,
>
> I tried the asserts you suggested and here's the output:
>
> +++
> xenomai: timerobj.c:308: timerobj_start: Assertion
> `timer_getoverrun(tmobj->timer) >= 0' failed.
> Aborted (core dumped)
> +++
>
>
> So this means the call to timer_getoverrun() fails.
> Then I did some further investigation.
> I changed the code in timerobj_start() to this:
>
>
> +++
>    if (__RT(timer_settime(tmobj->timer, TIMER_ABSTIME, it, NULL))) {
>      int error = errno;
>      errno = 0;
>      if (timer_getoverrun(tmobj->timer) < 0) {
>        __bt(-errno);           => @ line 311
>      }
> /*  assert(timer_getoverrun(tmobj->timer) >= 0); */
>      assert(it->it_value.tv_sec >= 0 && it->it_value.tv_nsec < 1000000000);
>      assert(it->it_interval.tv_sec >= 0 && it->it_interval.tv_nsec <
> 1000000000);
>      return __bt(-error);      => @ line 316
>    }
> +++
>
>
> This gives following (strange) result:
> (The 'tm_evafter(x,y) returned' traces are from my test app)
>
>
> +++
> ------------------------------------------------------------------------------
> [ ERROR BACKTRACE: thread 0.0 ]
>
>     #0  EINVAL in timerobj_start(), timerobj.c:316
> => #1  EINVAL in timerobj_start(), timerobj.c:311
> ------------------------------------------------------------------------------
> tm_evafter(0,0) returned 75 (errno 22): tmid=3066037456
> ------------------------------------------------------------------------------
> [ ERROR BACKTRACE: thread 4.0 ]
>
> => #0  EINVAL in timerobj_start(), timerobj.c:316
> ------------------------------------------------------------------------------
> tm_evafter(4,0) returned 75 (errno 0): tmid=67192088
> ------------------------------------------------------------------------------
> [ ERROR BACKTRACE: thread 4.0 ]
>
>     #0  EINVAL in timerobj_start(), timerobj.c:316
> => #1  EINVAL in timerobj_start(), timerobj.c:311
> ------------------------------------------------------------------------------
> tm_evafter(4,0) returned 75 (errno 22): tmid=67191880
> +++
>
>
> So timer_getoverrun() fails and sets errno to EINVAL (22)
> This can only mean that the given timerid is (or has become) invalid.
>
> But timer_getoverrun() does not always fail when timer_settime() fails.
> My rough estimate is that 1 out of 20 times timer_getoverrun()
> does not fail when timer_settime() fails.
>
> Might it be that there is a race condition between the above code
> and the code excuted at timer elapse (i.e. were the timer is destroyed)?


Yes, I think there is some race in the Xenomai code, and running SMP 
makes it more likely. I just got the backtrace below:

#0  __timer_settime_new (timerid=0x0, flags=1, value=0x7f78b2102d60,
     ovalue=0x0) at ../nptl/sysdeps/unix/sysv/linux/timer_settime.c:58
#1  0x00007f78b229c2e0 in timerobj_start (tmobj=0x7f78440008f8,
     handler=0x7f78b24ca9c1 <post_event_once>, it=0x7f78b2102d60)
     at /home/rpm/git/xenomai-forge/lib/copperplate/timerobj.c:309
#2  0x00007f78b24cabf6 in start_evtimer (events=2, it=0x7f78b2102d60,
     tmid_r=0x7f78b2102db0) at /home/rpm/git/xenomai-forge/lib/psos/tm.c:133
#3  0x00007f78b24cac86 in tm_evafter (ticks=1, events=2,
     tmid_r=0x7f78b2102db0) at /home/rpm/git/xenomai-forge/lib/psos/tm.c:153
#4  0x0000000000400c64 in thread_timer (bidx=24, tidx=0) at settime.c:73
#5  0x000000000040101c in thread_receive (bidx=24, tidx=0) at settime.c:117
#6  0x00000000004013a6 in thread_entry (a1=24, a2=0, a3=0, a4=0)
     at settime.c:149
#7  0x00007f78b24c9835 in task_trampoline (arg=0x67efc80)
     at /home/rpm/git/xenomai-forge/lib/psos/task.c:200
#8  0x0000003d71c07d15 in start_thread (arg=0x7f78b2103700)
     at pthread_create.c:308
#9  0x0000003d714f248d in clone ()
     at ../sysdeps/unix/sysv/linux/x86_64/clone.S:114

The timerid is definitely broken in some cases.

-- 
Philippe.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Xenomai] [xenomai-forge] psos: crash while stressing event timers
  2013-06-05  9:59         ` Philippe Gerum
@ 2013-06-05 10:11           ` Ronny Meeus
  2013-06-05 10:25             ` Philippe Gerum
  0 siblings, 1 reply; 17+ messages in thread
From: Ronny Meeus @ 2013-06-05 10:11 UTC (permalink / raw)
  To: Philippe Gerum; +Cc: xenomai

On Wed, Jun 5, 2013 at 11:59 AM, Philippe Gerum <rpm@xenomai.org> wrote:

> On 06/05/2013 11:51 AM, Tom Philips wrote:
>
>> Philippe,
>>
>> I tried the asserts you suggested and here's the output:
>>
>> +++
>> xenomai: timerobj.c:308: timerobj_start: Assertion
>> `timer_getoverrun(tmobj->**timer) >= 0' failed.
>> Aborted (core dumped)
>> +++
>>
>>
>> So this means the call to timer_getoverrun() fails.
>> Then I did some further investigation.
>> I changed the code in timerobj_start() to this:
>>
>>
>> +++
>>    if (__RT(timer_settime(tmobj->**timer, TIMER_ABSTIME, it, NULL))) {
>>      int error = errno;
>>      errno = 0;
>>      if (timer_getoverrun(tmobj->**timer) < 0) {
>>        __bt(-errno);           => @ line 311
>>      }
>> /*  assert(timer_getoverrun(tmobj-**>timer) >= 0); */
>>      assert(it->it_value.tv_sec >= 0 && it->it_value.tv_nsec <
>> 1000000000);
>>      assert(it->it_interval.tv_sec >= 0 && it->it_interval.tv_nsec <
>> 1000000000);
>>      return __bt(-error);      => @ line 316
>>    }
>> +++
>>
>>
>> This gives following (strange) result:
>> (The 'tm_evafter(x,y) returned' traces are from my test app)
>>
>>
>> +++
>> ------------------------------**------------------------------**
>> ------------------
>> [ ERROR BACKTRACE: thread 0.0 ]
>>
>>     #0  EINVAL in timerobj_start(), timerobj.c:316
>> => #1  EINVAL in timerobj_start(), timerobj.c:311
>> ------------------------------**------------------------------**
>> ------------------
>> tm_evafter(0,0) returned 75 (errno 22): tmid=3066037456
>> ------------------------------**------------------------------**
>> ------------------
>> [ ERROR BACKTRACE: thread 4.0 ]
>>
>> => #0  EINVAL in timerobj_start(), timerobj.c:316
>> ------------------------------**------------------------------**
>> ------------------
>> tm_evafter(4,0) returned 75 (errno 0): tmid=67192088
>> ------------------------------**------------------------------**
>> ------------------
>> [ ERROR BACKTRACE: thread 4.0 ]
>>
>>     #0  EINVAL in timerobj_start(), timerobj.c:316
>> => #1  EINVAL in timerobj_start(), timerobj.c:311
>> ------------------------------**------------------------------**
>> ------------------
>> tm_evafter(4,0) returned 75 (errno 22): tmid=67191880
>> +++
>>
>>
>> So timer_getoverrun() fails and sets errno to EINVAL (22)
>> This can only mean that the given timerid is (or has become) invalid.
>>
>> But timer_getoverrun() does not always fail when timer_settime() fails.
>> My rough estimate is that 1 out of 20 times timer_getoverrun()
>> does not fail when timer_settime() fails.
>>
>> Might it be that there is a race condition between the above code
>> and the code excuted at timer elapse (i.e. were the timer is destroyed)?
>>
>
>
> Yes, I think there is some race in the Xenomai code, and running SMP makes
> it more likely. I just got the backtrace below:
>
> #0  __timer_settime_new (timerid=0x0, flags=1, value=0x7f78b2102d60,
>     ovalue=0x0) at ../nptl/sysdeps/unix/sysv/**linux/timer_settime.c:58
> #1  0x00007f78b229c2e0 in timerobj_start (tmobj=0x7f78440008f8,
>     handler=0x7f78b24ca9c1 <post_event_once>, it=0x7f78b2102d60)
>     at /home/rpm/git/xenomai-forge/**lib/copperplate/timerobj.c:309
> #2  0x00007f78b24cabf6 in start_evtimer (events=2, it=0x7f78b2102d60,
>     tmid_r=0x7f78b2102db0) at /home/rpm/git/xenomai-forge/**
> lib/psos/tm.c:133
> #3  0x00007f78b24cac86 in tm_evafter (ticks=1, events=2,
>     tmid_r=0x7f78b2102db0) at /home/rpm/git/xenomai-forge/**
> lib/psos/tm.c:153
> #4  0x0000000000400c64 in thread_timer (bidx=24, tidx=0) at settime.c:73
> #5  0x000000000040101c in thread_receive (bidx=24, tidx=0) at settime.c:117
> #6  0x00000000004013a6 in thread_entry (a1=24, a2=0, a3=0, a4=0)
>     at settime.c:149
> #7  0x00007f78b24c9835 in task_trampoline (arg=0x67efc80)
>     at /home/rpm/git/xenomai-forge/**lib/psos/task.c:200
> #8  0x0000003d71c07d15 in start_thread (arg=0x7f78b2103700)
>     at pthread_create.c:308
> #9  0x0000003d714f248d in clone ()
>     at ../sysdeps/unix/sysv/linux/**x86_64/clone.S:114
>
> The timerid is definitely broken in some cases.
>
> --
> Philippe.
>
>
> ______________________________**_________________
> Xenomai mailing list
> Xenomai@xenomai.org
> http://www.xenomai.org/**mailman/listinfo/xenomai<http://www.xenomai.org/mailman/listinfo/xenomai>
>

Hello

if I understand the code of the timerserver correctly, it processes all
timers that have an expiry in the past.
So the linux timer will wake-up the timeserver thread and it simply starts
to process all timers for which the timeout is in the past.

Suppose now that a timer is enqueued and right after that the task that
enqueues the timer is scheduled out and does only return after the timer is
expired. If there is another timer expired in the meantime, also the timer
of the process scheduled out is already processed and cleaned-up.

In this way you have a race condition that will typically be seen under
heavy load and with short timeouts.

Ronny

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Xenomai] [xenomai-forge] psos: crash while stressing event timers
  2013-06-05 10:11           ` Ronny Meeus
@ 2013-06-05 10:25             ` Philippe Gerum
  2013-06-05 20:50               ` Philippe Gerum
  0 siblings, 1 reply; 17+ messages in thread
From: Philippe Gerum @ 2013-06-05 10:25 UTC (permalink / raw)
  To: Ronny Meeus; +Cc: xenomai

On 06/05/2013 12:11 PM, Ronny Meeus wrote:
>
>
> On Wed, Jun 5, 2013 at 11:59 AM, Philippe Gerum <rpm@xenomai.org
> <mailto:rpm@xenomai.org>> wrote:
>
>     On 06/05/2013 11:51 AM, Tom Philips wrote:
>
>         Philippe,
>
>         I tried the asserts you suggested and here's the output:
>
>         +++
>         xenomai: timerobj.c:308: timerobj_start: Assertion
>         `timer_getoverrun(tmobj->__timer) >= 0' failed.
>         Aborted (core dumped)
>         +++
>
>
>         So this means the call to timer_getoverrun() fails.
>         Then I did some further investigation.
>         I changed the code in timerobj_start() to this:
>
>
>         +++
>             if (__RT(timer_settime(tmobj->__timer, TIMER_ABSTIME, it,
>         NULL))) {
>               int error = errno;
>               errno = 0;
>               if (timer_getoverrun(tmobj->__timer) < 0) {
>                 __bt(-errno);           => @ line 311
>               }
>         /*  assert(timer_getoverrun(tmobj-__>timer) >= 0); */
>               assert(it->it_value.tv_sec >= 0 && it->it_value.tv_nsec <
>         1000000000);
>               assert(it->it_interval.tv_sec >= 0 &&
>         it->it_interval.tv_nsec <
>         1000000000);
>               return __bt(-error);      => @ line 316
>             }
>         +++
>
>
>         This gives following (strange) result:
>         (The 'tm_evafter(x,y) returned' traces are from my test app)
>
>
>         +++
>         ------------------------------__------------------------------__------------------
>         [ ERROR BACKTRACE: thread 0.0 ]
>
>              #0  EINVAL in timerobj_start(), timerobj.c:316
>         => #1  EINVAL in timerobj_start(), timerobj.c:311
>         ------------------------------__------------------------------__------------------
>         tm_evafter(0,0) returned 75 (errno 22): tmid=3066037456
>         ------------------------------__------------------------------__------------------
>         [ ERROR BACKTRACE: thread 4.0 ]
>
>         => #0  EINVAL in timerobj_start(), timerobj.c:316
>         ------------------------------__------------------------------__------------------
>         tm_evafter(4,0) returned 75 (errno 0): tmid=67192088
>         ------------------------------__------------------------------__------------------
>         [ ERROR BACKTRACE: thread 4.0 ]
>
>              #0  EINVAL in timerobj_start(), timerobj.c:316
>         => #1  EINVAL in timerobj_start(), timerobj.c:311
>         ------------------------------__------------------------------__------------------
>         tm_evafter(4,0) returned 75 (errno 22): tmid=67191880
>         +++
>
>
>         So timer_getoverrun() fails and sets errno to EINVAL (22)
>         This can only mean that the given timerid is (or has become)
>         invalid.
>
>         But timer_getoverrun() does not always fail when timer_settime()
>         fails.
>         My rough estimate is that 1 out of 20 times timer_getoverrun()
>         does not fail when timer_settime() fails.
>
>         Might it be that there is a race condition between the above code
>         and the code excuted at timer elapse (i.e. were the timer is
>         destroyed)?
>
>
>
>     Yes, I think there is some race in the Xenomai code, and running SMP
>     makes it more likely. I just got the backtrace below:
>
>     #0  __timer_settime_new (timerid=0x0, flags=1, value=0x7f78b2102d60,
>          ovalue=0x0) at ../nptl/sysdeps/unix/sysv/__linux/timer_settime.c:58
>     #1  0x00007f78b229c2e0 in timerobj_start (tmobj=0x7f78440008f8,
>          handler=0x7f78b24ca9c1 <post_event_once>, it=0x7f78b2102d60)
>          at /home/rpm/git/xenomai-forge/__lib/copperplate/timerobj.c:309
>     #2  0x00007f78b24cabf6 in start_evtimer (events=2, it=0x7f78b2102d60,
>          tmid_r=0x7f78b2102db0) at
>     /home/rpm/git/xenomai-forge/__lib/psos/tm.c:133
>     #3  0x00007f78b24cac86 in tm_evafter (ticks=1, events=2,
>          tmid_r=0x7f78b2102db0) at
>     /home/rpm/git/xenomai-forge/__lib/psos/tm.c:153
>     #4  0x0000000000400c64 in thread_timer (bidx=24, tidx=0) at settime.c:73
>     #5  0x000000000040101c in thread_receive (bidx=24, tidx=0) at
>     settime.c:117
>     #6  0x00000000004013a6 in thread_entry (a1=24, a2=0, a3=0, a4=0)
>          at settime.c:149
>     #7  0x00007f78b24c9835 in task_trampoline (arg=0x67efc80)
>          at /home/rpm/git/xenomai-forge/__lib/psos/task.c:200
>     #8  0x0000003d71c07d15 in start_thread (arg=0x7f78b2103700)
>          at pthread_create.c:308
>     #9  0x0000003d714f248d in clone ()
>          at ../sysdeps/unix/sysv/linux/__x86_64/clone.S:114
>
>     The timerid is definitely broken in some cases.
>
>     --
>     Philippe.
>
>
>     _________________________________________________
>     Xenomai mailing list
>     Xenomai@xenomai.org <mailto:Xenomai@xenomai.org>
>     http://www.xenomai.org/__mailman/listinfo/xenomai
>     <http://www.xenomai.org/mailman/listinfo/xenomai>
>
>
> Hello
>
> if I understand the code of the timerserver correctly, it processes all
> timers that have an expiry in the past.
> So the linux timer will wake-up the timeserver thread and it simply
> starts to process all timers for which the timeout is in the past.
>
> Suppose now that a timer is enqueued and right after that the task that
> enqueues the timer is scheduled out and does only return after the timer
> is expired. If there is another timer expired in the meantime, also the
> timer of the process scheduled out is already processed and cleaned-up.
>
> In this way you have a race condition that will typically be seen under
> heavy load and with short timeouts.

I'm not sure to understand the scenario you describe. Low-resolution 
timers in userland are all processed over a single carrier thread, 
independently from the actual timer owner. When a timer elapses, it gets 
removed out of the outstanding queue on behalf of the carrier thread 
with no risk of duplicate handling and/or cleanup.

However, I suspect a race between that carrier thread and the thread 
re-arming a just elapsed timer, causing the former to cleanup this timer 
unexpectedly. The code supposed to prevent looks fragile. But maybe this 
is the same scenario you just described, using another wording.

-- 
Philippe.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Xenomai] [xenomai-forge] psos: crash while stressing event timers
  2013-06-05 10:25             ` Philippe Gerum
@ 2013-06-05 20:50               ` Philippe Gerum
  2013-06-07 10:39                 ` Ronny Meeus
  0 siblings, 1 reply; 17+ messages in thread
From: Philippe Gerum @ 2013-06-05 20:50 UTC (permalink / raw)
  To: Ronny Meeus; +Cc: xenomai

On 06/05/2013 12:25 PM, Philippe Gerum wrote:
> On 06/05/2013 12:11 PM, Ronny Meeus wrote:
>>
>>
>> On Wed, Jun 5, 2013 at 11:59 AM, Philippe Gerum <rpm@xenomai.org
>> <mailto:rpm@xenomai.org>> wrote:
>>
>>     On 06/05/2013 11:51 AM, Tom Philips wrote:
>>
>>         Philippe,
>>
>>         I tried the asserts you suggested and here's the output:
>>
>>         +++
>>         xenomai: timerobj.c:308: timerobj_start: Assertion
>>         `timer_getoverrun(tmobj->__timer) >= 0' failed.
>>         Aborted (core dumped)
>>         +++
>>
>>
>>         So this means the call to timer_getoverrun() fails.
>>         Then I did some further investigation.
>>         I changed the code in timerobj_start() to this:
>>
>>
>>         +++
>>             if (__RT(timer_settime(tmobj->__timer, TIMER_ABSTIME, it,
>>         NULL))) {
>>               int error = errno;
>>               errno = 0;
>>               if (timer_getoverrun(tmobj->__timer) < 0) {
>>                 __bt(-errno);           => @ line 311
>>               }
>>         /*  assert(timer_getoverrun(tmobj-__>timer) >= 0); */
>>               assert(it->it_value.tv_sec >= 0 && it->it_value.tv_nsec <
>>         1000000000);
>>               assert(it->it_interval.tv_sec >= 0 &&
>>         it->it_interval.tv_nsec <
>>         1000000000);
>>               return __bt(-error);      => @ line 316
>>             }
>>         +++
>>
>>
>>         This gives following (strange) result:
>>         (The 'tm_evafter(x,y) returned' traces are from my test app)
>>
>>
>>         +++
>>
>> ------------------------------__------------------------------__------------------
>>
>>         [ ERROR BACKTRACE: thread 0.0 ]
>>
>>              #0  EINVAL in timerobj_start(), timerobj.c:316
>>         => #1  EINVAL in timerobj_start(), timerobj.c:311
>>
>> ------------------------------__------------------------------__------------------
>>
>>         tm_evafter(0,0) returned 75 (errno 22): tmid=3066037456
>>
>> ------------------------------__------------------------------__------------------
>>
>>         [ ERROR BACKTRACE: thread 4.0 ]
>>
>>         => #0  EINVAL in timerobj_start(), timerobj.c:316
>>
>> ------------------------------__------------------------------__------------------
>>
>>         tm_evafter(4,0) returned 75 (errno 0): tmid=67192088
>>
>> ------------------------------__------------------------------__------------------
>>
>>         [ ERROR BACKTRACE: thread 4.0 ]
>>
>>              #0  EINVAL in timerobj_start(), timerobj.c:316
>>         => #1  EINVAL in timerobj_start(), timerobj.c:311
>>
>> ------------------------------__------------------------------__------------------
>>
>>         tm_evafter(4,0) returned 75 (errno 22): tmid=67191880
>>         +++
>>
>>
>>         So timer_getoverrun() fails and sets errno to EINVAL (22)
>>         This can only mean that the given timerid is (or has become)
>>         invalid.
>>
>>         But timer_getoverrun() does not always fail when timer_settime()
>>         fails.
>>         My rough estimate is that 1 out of 20 times timer_getoverrun()
>>         does not fail when timer_settime() fails.
>>
>>         Might it be that there is a race condition between the above code
>>         and the code excuted at timer elapse (i.e. were the timer is
>>         destroyed)?
>>
>>
>>
>>     Yes, I think there is some race in the Xenomai code, and running SMP
>>     makes it more likely. I just got the backtrace below:
>>
>>     #0  __timer_settime_new (timerid=0x0, flags=1, value=0x7f78b2102d60,
>>          ovalue=0x0) at
>> ../nptl/sysdeps/unix/sysv/__linux/timer_settime.c:58
>>     #1  0x00007f78b229c2e0 in timerobj_start (tmobj=0x7f78440008f8,
>>          handler=0x7f78b24ca9c1 <post_event_once>, it=0x7f78b2102d60)
>>          at /home/rpm/git/xenomai-forge/__lib/copperplate/timerobj.c:309
>>     #2  0x00007f78b24cabf6 in start_evtimer (events=2, it=0x7f78b2102d60,
>>          tmid_r=0x7f78b2102db0) at
>>     /home/rpm/git/xenomai-forge/__lib/psos/tm.c:133
>>     #3  0x00007f78b24cac86 in tm_evafter (ticks=1, events=2,
>>          tmid_r=0x7f78b2102db0) at
>>     /home/rpm/git/xenomai-forge/__lib/psos/tm.c:153
>>     #4  0x0000000000400c64 in thread_timer (bidx=24, tidx=0) at
>> settime.c:73
>>     #5  0x000000000040101c in thread_receive (bidx=24, tidx=0) at
>>     settime.c:117
>>     #6  0x00000000004013a6 in thread_entry (a1=24, a2=0, a3=0, a4=0)
>>          at settime.c:149
>>     #7  0x00007f78b24c9835 in task_trampoline (arg=0x67efc80)
>>          at /home/rpm/git/xenomai-forge/__lib/psos/task.c:200
>>     #8  0x0000003d71c07d15 in start_thread (arg=0x7f78b2103700)
>>          at pthread_create.c:308
>>     #9  0x0000003d714f248d in clone ()
>>          at ../sysdeps/unix/sysv/linux/__x86_64/clone.S:114
>>
>>     The timerid is definitely broken in some cases.
>>
>>     --
>>     Philippe.
>>
>>
>>     _________________________________________________
>>     Xenomai mailing list
>>     Xenomai@xenomai.org <mailto:Xenomai@xenomai.org>
>>     http://www.xenomai.org/__mailman/listinfo/xenomai
>>     <http://www.xenomai.org/mailman/listinfo/xenomai>
>>
>>
>> Hello
>>
>> if I understand the code of the timerserver correctly, it processes all
>> timers that have an expiry in the past.
>> So the linux timer will wake-up the timeserver thread and it simply
>> starts to process all timers for which the timeout is in the past.
>>
>> Suppose now that a timer is enqueued and right after that the task that
>> enqueues the timer is scheduled out and does only return after the timer
>> is expired. If there is another timer expired in the meantime, also the
>> timer of the process scheduled out is already processed and cleaned-up.
>>
>> In this way you have a race condition that will typically be seen under
>> heavy load and with short timeouts.
>
> I'm not sure to understand the scenario you describe. Low-resolution
> timers in userland are all processed over a single carrier thread,
> independently from the actual timer owner. When a timer elapses, it gets
> removed out of the outstanding queue on behalf of the carrier thread
> with no risk of duplicate handling and/or cleanup.
>
> However, I suspect a race between that carrier thread and the thread
> re-arming a just elapsed timer, causing the former to cleanup this timer
> unexpectedly. The code supposed to prevent looks fragile. But maybe this
> is the same scenario you just described, using another wording.
>

Please pull the last 6 commits from the "next" branch, referring to the 
pSOS emulator and copperplate libraries. These fixes make the test case 
you sent me stable, including on a 8-way machine running 50 batches. 
Feedback welcome.

Thanks for the heads up on this issue, and for the test case as well.

-- 
Philippe.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Xenomai] [xenomai-forge] psos: crash while stressing event timers
  2013-06-05 20:50               ` Philippe Gerum
@ 2013-06-07 10:39                 ` Ronny Meeus
  2013-06-11 10:10                   ` Ronny Meeus
  0 siblings, 1 reply; 17+ messages in thread
From: Ronny Meeus @ 2013-06-07 10:39 UTC (permalink / raw)
  To: Philippe Gerum; +Cc: xenomai

>
>
>>
> Please pull the last 6 commits from the "next" branch, referring to the
> pSOS emulator and copperplate libraries. These fixes make the test case you
> sent me stable, including on a 8-way machine running 50 batches. Feedback
> welcome.
>
> Thanks for the heads up on this issue, and for the test case as well.
>
>
>
Philippe,

I did some testing with your patches applied. The system looks very stable.
Thanks for solving the issue.

I used following test:
ulimit -s 128 ; taskset 2 ./event_stress  -t 3 -b 80 -o 1

It has run now for more than 1 hour without issues.
I will also start a test during the weekend.

Best regards,
Ronny

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Xenomai] [xenomai-forge] psos: crash while stressing event timers
  2013-06-07 10:39                 ` Ronny Meeus
@ 2013-06-11 10:10                   ` Ronny Meeus
  2013-06-11 10:21                     ` Philippe Gerum
  0 siblings, 1 reply; 17+ messages in thread
From: Ronny Meeus @ 2013-06-11 10:10 UTC (permalink / raw)
  To: Philippe Gerum; +Cc: xenomai

>
>
> I used following test:
> ulimit -s 128 ; taskset 2 ./event_stress  -t 3 -b 80 -o 1
>
> It has run now for more than 1 hour without issues.
> I will also start a test during the weekend.
>

During the weekend I executed following test on 7 cores:
ulimit -s 128 ; taskset 0xfe ./event_stress  -t 3 -b 80 -o 1 &
(this command was executed 7 times).

No issues were observed so I think the code changes are OK.

---
Ronny

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Xenomai] [xenomai-forge] psos: crash while stressing event timers
  2013-06-11 10:10                   ` Ronny Meeus
@ 2013-06-11 10:21                     ` Philippe Gerum
  0 siblings, 0 replies; 17+ messages in thread
From: Philippe Gerum @ 2013-06-11 10:21 UTC (permalink / raw)
  To: Ronny Meeus; +Cc: xenomai

On 06/11/2013 12:10 PM, Ronny Meeus wrote:
>
>     I used following test:
>     ulimit -s 128 ; taskset 2 ./event_stress  -t 3 -b 80 -o 1
>
>     It has run now for more than 1 hour without issues.
>     I will also start a test during the weekend.
>
>
> During the weekend I executed following test on 7 cores:
> ulimit -s 128 ; taskset 0xfe ./event_stress  -t 3 -b 80 -o 1 &
> (this command was executed 7 times).
>
> No issues were observed so I think the code changes are OK.
>

Ok, thanks. On its way to master then.


-- 
Philippe.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Xenomai] [xenomai-forge] psos: crash while stressing event timers
  2013-06-04 12:57   ` Ronny Meeus
@ 2013-06-04 14:04     ` Philippe Gerum
  0 siblings, 0 replies; 17+ messages in thread
From: Philippe Gerum @ 2013-06-04 14:04 UTC (permalink / raw)
  To: Ronny Meeus; +Cc: xenomai

On 06/04/2013 02:57 PM, Ronny Meeus wrote:
> On Tue, Jun 4, 2013 at 2:41 PM, Philippe Gerum <rpm@xenomai.org
> <mailto:rpm@xenomai.org>> wrote:
>
>     On 06/04/2013 12:56 PM, Ronny Meeus wrote:
>
>         Hello
>
>         we are currently running with recent version of xenomai-forge.
>         The issue we see is a crash while running the attached
>         application code
>         (pSOS interface).
>
>
>     How recent? What is your current head commit?
>
>
> 0fb3e28aa5efca4d9a9930db8f437f61eafdf9bc
>
>
>
>         Other useful information is that we typically see the issue when
>         we start
>         to reach a high cpuload.
>         On boards with a stronger processor, the number of batches can
>         be much
>         higher compared to boards with a low end processor.
>
>         The problem is observed in various processor environments
>         (mips/arm/ppc)
>         and with different versions of the C library.
>
>
>     Do you have these patches in?
>
>     http://git.xenomai.org/?p=__xenomai-forge.git;a=commit;h=__ba4fbc35cb6cc0a49aa062e1df1f8c__24e09533e4
>     <http://git.xenomai.org/?p=xenomai-forge.git;a=commit;h=ba4fbc35cb6cc0a49aa062e1df1f8c24e09533e4>
>
>     http://git.xenomai.org/?p=__xenomai-forge.git;a=commit;h=__a397d4f5f87f5824730776f4cb36e0__8d3efa2191
>     <http://git.xenomai.org/?p=xenomai-forge.git;a=commit;h=a397d4f5f87f5824730776f4cb36e08d3efa2191>
>
>
> Both changesets are included.
>

You may want to get a post-mortem backtrace of the application fault, 
reading the core dump image against your executable with gdb.

-- 
Philippe.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Xenomai] [xenomai-forge] psos: crash while stressing event timers
  2013-06-04 12:41 ` Philippe Gerum
@ 2013-06-04 12:57   ` Ronny Meeus
  2013-06-04 14:04     ` Philippe Gerum
  0 siblings, 1 reply; 17+ messages in thread
From: Ronny Meeus @ 2013-06-04 12:57 UTC (permalink / raw)
  To: Philippe Gerum; +Cc: xenomai

On Tue, Jun 4, 2013 at 2:41 PM, Philippe Gerum <rpm@xenomai.org> wrote:

> On 06/04/2013 12:56 PM, Ronny Meeus wrote:
>
>> Hello
>>
>> we are currently running with recent version of xenomai-forge.
>> The issue we see is a crash while running the attached application code
>> (pSOS interface).
>>
>>
> How recent? What is your current head commit?


0fb3e28aa5efca4d9a9930db8f437f61eafdf9bc


>
>
>  Other useful information is that we typically see the issue when we start
>> to reach a high cpuload.
>> On boards with a stronger processor, the number of batches can be much
>> higher compared to boards with a low end processor.
>>
>> The problem is observed in various processor environments (mips/arm/ppc)
>> and with different versions of the C library.
>>
>>
> Do you have these patches in?
>
> http://git.xenomai.org/?p=**xenomai-forge.git;a=commit;h=**
> ba4fbc35cb6cc0a49aa062e1df1f8c**24e09533e4<http://git.xenomai.org/?p=xenomai-forge.git;a=commit;h=ba4fbc35cb6cc0a49aa062e1df1f8c24e09533e4>
>
> http://git.xenomai.org/?p=**xenomai-forge.git;a=commit;h=**
> a397d4f5f87f5824730776f4cb36e0**8d3efa2191<http://git.xenomai.org/?p=xenomai-forge.git;a=commit;h=a397d4f5f87f5824730776f4cb36e08d3efa2191>


Both changesets are included.


>
>
>  Best regards,
>> Ronny
>> -------------- next part --------------
>> A non-text attachment was scrubbed...
>> Name: event_stress.c
>> Type: text/x-csrc
>> Size: 7481 bytes
>> Desc: not available
>> URL: <http://www.xenomai.org/**pipermail/xenomai/attachments/**
>> 20130604/1800cdf1/attachment.c<http://www.xenomai.org/pipermail/xenomai/attachments/20130604/1800cdf1/attachment.c>
>> **>
>> ______________________________**_________________
>> Xenomai mailing list
>> Xenomai@xenomai.org
>> http://www.xenomai.org/**mailman/listinfo/xenomai<http://www.xenomai.org/mailman/listinfo/xenomai>
>>
>>
>
> --
> Philippe.
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Xenomai] [xenomai-forge] psos: crash while stressing event timers
  2013-06-04 10:56 Ronny Meeus
@ 2013-06-04 12:41 ` Philippe Gerum
  2013-06-04 12:57   ` Ronny Meeus
  0 siblings, 1 reply; 17+ messages in thread
From: Philippe Gerum @ 2013-06-04 12:41 UTC (permalink / raw)
  To: Ronny Meeus; +Cc: xenomai

On 06/04/2013 12:56 PM, Ronny Meeus wrote:
> Hello
>
> we are currently running with recent version of xenomai-forge.
> The issue we see is a crash while running the attached application code
> (pSOS interface).
>

How recent? What is your current head commit?

> Other useful information is that we typically see the issue when we start
> to reach a high cpuload.
> On boards with a stronger processor, the number of batches can be much
> higher compared to boards with a low end processor.
>
> The problem is observed in various processor environments (mips/arm/ppc)
> and with different versions of the C library.
>

Do you have these patches in?

http://git.xenomai.org/?p=xenomai-forge.git;a=commit;h=ba4fbc35cb6cc0a49aa062e1df1f8c24e09533e4

http://git.xenomai.org/?p=xenomai-forge.git;a=commit;h=a397d4f5f87f5824730776f4cb36e08d3efa2191

> Best regards,
> Ronny
> -------------- next part --------------
> A non-text attachment was scrubbed...
> Name: event_stress.c
> Type: text/x-csrc
> Size: 7481 bytes
> Desc: not available
> URL: <http://www.xenomai.org/pipermail/xenomai/attachments/20130604/1800cdf1/attachment.c>
> _______________________________________________
> Xenomai mailing list
> Xenomai@xenomai.org
> http://www.xenomai.org/mailman/listinfo/xenomai
>


-- 
Philippe.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Xenomai] [xenomai-forge] psos: crash while stressing event timers
@ 2013-06-04 10:56 Ronny Meeus
  2013-06-04 12:41 ` Philippe Gerum
  0 siblings, 1 reply; 17+ messages in thread
From: Ronny Meeus @ 2013-06-04 10:56 UTC (permalink / raw)
  To: xenomai

Hello

we are currently running with recent version of xenomai-forge.
The issue we see is a crash while running the attached application code
(pSOS interface).

Basically the test creates a number of "chained" (called batches) tasks.
After setting up the batch, the first task starts a timer and when the
timer expires, an event is sent to the next task in the chain.
This process continues forever.

If only 1 chain is created we do not see issues.
The number of threads in the chain is less relevant since typically there
will be no big impact when the number of tasks increases.

When we start to increase the number of batches we start to see crashes.
For example in the test below we create 20 batches with 1 thread in each
batch.
The -o parameter specifies the timeout used by the timer of each task
before the control is given to the next task in the chain.

ulimit -s 128 ;taskset 2 ./tests -t 1 -b 20 -o 1
Xenomai test: threads 1, batches 20, timeout 1 ms, stats 0
thread_entry(0,0)
thread_entry(1,0)
thread_entry(2,0)
thread_entry(3,0)
thread_entry(4,0)
thread_entry(5,0)
thread_entry(6,0)
thread_entry(7,0)
thread_entry(8,0)
thread_entry(9,0)
thread_entry(10,0)
thread_entry(11,0)
thread_entry(12,0)
thread_entry(13,0)
thread_entry(14,0)
thread_entry(15,0)
thread_entry(16,0)
thread_entry(17,0)
thread_entry(18,0)
thread_entry(19,0)
tm_evafter(4,0) returned 75 (errno 22): tmid=4920880
tm_evafter(9,0) returned 75 (errno 22): tmid=4919912
Segmentation fault

After some investigation it looks like it has something to do with the
timerhandling in xenomai.
The tm_evafter error indicates that the creation of the underlying posix
timer has failed.
Short after this typically a segmentation fault is seen.

If we run the same test application with an old version of xenomai forge
(1year ago) the issue is not observed.

Other useful information is that we typically see the issue when we start
to reach a high cpuload.
On boards with a stronger processor, the number of batches can be much
higher compared to boards with a low end processor.

The problem is observed in various processor environments (mips/arm/ppc)
and with different versions of the C library.

Best regards,
Ronny
-------------- next part --------------
A non-text attachment was scrubbed...
Name: event_stress.c
Type: text/x-csrc
Size: 7481 bytes
Desc: not available
URL: <http://www.xenomai.org/pipermail/xenomai/attachments/20130604/1800cdf1/attachment.c>

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2013-06-11 10:21 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-06-04 20:26 [Xenomai] [xenomai-forge] psos: crash while stressing event timers Tom Philips
2013-06-05  8:02 ` Philippe Gerum
2013-06-05  8:28   ` Ronny Meeus
2013-06-05  8:36     ` Philippe Gerum
2013-06-05  9:38       ` Philippe Gerum
2013-06-05  9:51       ` Tom Philips
2013-06-05  9:59         ` Philippe Gerum
2013-06-05 10:11           ` Ronny Meeus
2013-06-05 10:25             ` Philippe Gerum
2013-06-05 20:50               ` Philippe Gerum
2013-06-07 10:39                 ` Ronny Meeus
2013-06-11 10:10                   ` Ronny Meeus
2013-06-11 10:21                     ` Philippe Gerum
  -- strict thread matches above, loose matches on Subject: below --
2013-06-04 10:56 Ronny Meeus
2013-06-04 12:41 ` Philippe Gerum
2013-06-04 12:57   ` Ronny Meeus
2013-06-04 14:04     ` Philippe Gerum

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.