All of lore.kernel.org
 help / color / mirror / Atom feed
* [Xenomai] Xenomai-forge: Deadlock in finalize_thread()
@ 2014-07-03  8:11 Kim De Mey
  2014-07-03 15:14 ` Philippe Gerum
  0 siblings, 1 reply; 25+ messages in thread
From: Kim De Mey @ 2014-07-03  8:11 UTC (permalink / raw)
  To: Xenomai@xenomai.org

Hello,

I've encountered a deadlock in the finalize_thread() call in threadobj.c

I can easy reproduce the problem with a simple test case where I have
a psos task which in a loop creates, starts and deletes another psos
task.
The created tasks have a priority lower or equal to the priority of
the task that creates it.

When running the test case, some of the tasks don't get deleted
properly (the majority does), they are still visible when doing "ps"
command.
When attaching gdb I notice that these tasks are stuck on
__pthread_mutex_lock() called from threadobj_lock() >
finalize_thread().

See gdb debug information below:
(gdb) info thread
  Id   Target Id         Frame
  9    Thread 18694      0x00edf280 in __pthread_mutex_lock
(mutex=<optimized out>) at pthread_mutex_lock.c:293
  8    Thread 18572      0x00edf280 in __pthread_mutex_lock
(mutex=<optimized out>) at pthread_mutex_lock.c:293
  7    Thread 18355      0x00edf280 in __pthread_mutex_lock
(mutex=<optimized out>) at pthread_mutex_lock.c:293
  6    Thread 18201      0x00edf280 in __pthread_mutex_lock
(mutex=<optimized out>) at pthread_mutex_lock.c:293
  5    Thread 18110      0x00edf280 in __pthread_mutex_lock
(mutex=<optimized out>) at pthread_mutex_lock.c:293
  4    Thread 18037      0x00edf280 in __pthread_mutex_lock
(mutex=<optimized out>) at pthread_mutex_lock.c:293
* 3    Thread 17943      0x00edf280 in __pthread_mutex_lock
(mutex=<optimized out>) at pthread_mutex_lock.c:293
  2    Thread 17734      clock_nanosleep (clock_id=<optimized out>,
flags=<optimized out>, req=<optimized out>, rem=<optimized out>)
    at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:51
  1    Thread 17733      clock_nanosleep (clock_id=<optimized out>,
flags=<optimized out>, req=<optimized out>, rem=<optimized out>)
    at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:51
(gdb) bt
#0  0x00edf280 in __pthread_mutex_lock (mutex=<optimized out>) at
pthread_mutex_lock.c:293
#1  0x00ec0304 in threadobj_lock () from
/repo/kdemey/buildroot-cgroups/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
#2  0x00ec0404 in finalize_thread () from
/repo/kdemey/buildroot-cgroups/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
#3  0x00edc10c in __nptl_deallocate_tsd () at pthread_create.c:154
#4  0x00edd838 in start_thread (arg=<optimized out>) at pthread_create.c:304
#5  0x00ff7f4c in __thread_start () from
output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libc.so.6
Backtrace stopped: frame did not save the PC

What I think that goes wrong is that the lock which is taken in
threadobj_notify_entry() is not released before threadobj_start()
continues at wait_on_barrier(thobj, __THREAD_S_ACTIVE). As there is a
t_delete() done right after t_start() returns in my test case, this
could mean that the thread gets in finalize_thread() after the
pthread_cancel() and blocks there on the threadobj_lock() as the
threadobj_unlock() from threadobj_notify_entry() was possibly not yet
called.
Does this scenario sound plausible?

As a quick test I removed the lock & unlock in the
threadobj_notify_entry() and the deadlock on __pthread_mutex_lock() no
longer occurs. So this could mean that it is indeed this lock causing
the deadlock.
However when I do this change another deadlock occurs. This time on
destroy_thread() > uninit_thread() > pthread_cond_destroy() >
__lll_lock_wait()
I think pthread_cond_destroy() blocks if there is still a thread
blocked on the condition variable. I am unsure about this though. And
also when looking I don't see what could still be blocking on it. So I
am a bit stuck here.

Here is some gdb debugging info in the case that I removed the lock in
threadobj_notify_entry():
(gdb) info thread
  Id   Target Id         Frame
  5    Thread 16596      __lll_lock_wait (futex=<optimized out>,
private=<optimized out>) at
../nptl/sysdeps/unix/sysv/linux/lowlevellock.c:46
  4    Thread 16494      __lll_lock_wait (futex=<optimized out>,
private=<optimized out>) at
../nptl/sysdeps/unix/sysv/linux/lowlevellock.c:46
* 3    Thread 15944      __lll_lock_wait (futex=<optimized out>,
private=<optimized out>) at
../nptl/sysdeps/unix/sysv/linux/lowlevellock.c:46
  2    Thread 15798      clock_nanosleep (clock_id=<optimized out>,
flags=<optimized out>, req=<optimized out>, rem=<optimized out>)
    at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:51
  1    Thread 15797      clock_nanosleep (clock_id=<optimized out>,
flags=<optimized out>, req=<optimized out>, rem=<optimized out>)
    at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:51
(gdb) bt
#0  __lll_lock_wait (futex=<optimized out>, private=<optimized out>)
at ../nptl/sysdeps/unix/sysv/linux/lowlevellock.c:46
#1  0x00d11298 in __pthread_cond_destroy (cond=0xcc21d0) at
pthread_cond_destroy.c:33
#2  0x00cef12c in uninit_thread () from
/repo/kdemey/buildroot-cgroups/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
#3  0x00cef494 in finalize_thread () from
/repo/kdemey/buildroot-cgroups/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
#4  0x00d0b10c in __nptl_deallocate_tsd () at pthread_create.c:154
#5  0x00d0c838 in start_thread (arg=<optimized out>) at pthread_create.c:304
#6  0x00e26f4c in __thread_start () from
output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libc.so.6
Backtrace stopped: frame did not save the PC

Any insight on these issues?


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Xenomai] Xenomai-forge: Deadlock in finalize_thread()
  2014-07-03  8:11 [Xenomai] Xenomai-forge: Deadlock in finalize_thread() Kim De Mey
@ 2014-07-03 15:14 ` Philippe Gerum
  2014-07-07  7:58   ` Kim De Mey
  0 siblings, 1 reply; 25+ messages in thread
From: Philippe Gerum @ 2014-07-03 15:14 UTC (permalink / raw)
  To: Kim De Mey, Xenomai@xenomai.org

On 07/03/2014 10:11 AM, Kim De Mey wrote:

> What I think that goes wrong is that the lock which is taken in
> threadobj_notify_entry() is not released before threadobj_start()
> continues at wait_on_barrier(thobj, __THREAD_S_ACTIVE).

Until threadobj_notify_entry() releases that lock, there is no way 
wait_on_barrier() can exit, since it competes for the same lock. 
pthread_cond_wait() will grab this lock before returning.

As there is a
> t_delete() done right after t_start() returns in my test case, this
> could mean that the thread gets in finalize_thread() after the
> pthread_cancel() and blocks there on the threadobj_lock() as the
> threadobj_unlock() from threadobj_notify_entry() was possibly not yet
> called.

called? do you mean exited instead?

Until wait_on_barrier() unwinds for the child task, theadobj_start() 
cannot complete for its parent, so the latter cannot delete the child it 
has just spawned, until the latter has dropped the lock from 
threadobj_notify_entry(). So I'm unsure the explanation stands - unless 
I missed your point entirely.

This said, there must be something fishy as the backtrace clearly shows 
a child thread hanging in the finalizer, waiting for access to its own tcb.

If you have a simple standalone test case illustrating this bug, please 
send it along, this would save me some precious time trying to reproduce 
the issue accurately. Otherwise I'll write one.

PS: mentioning the exact Xenomai 3 version you are running would always 
help. Passing --version on the command line of any application built 
over the copperplate-based APIs returns this information.

-- 
Philippe.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Xenomai] Xenomai-forge: Deadlock in finalize_thread()
  2014-07-03 15:14 ` Philippe Gerum
@ 2014-07-07  7:58   ` Kim De Mey
  2014-07-08 16:48     ` Philippe Gerum
  0 siblings, 1 reply; 25+ messages in thread
From: Kim De Mey @ 2014-07-07  7:58 UTC (permalink / raw)
  To: Philippe Gerum; +Cc: Xenomai@xenomai.org

2014-07-03 17:14 GMT+02:00 Philippe Gerum <rpm@xenomai.org>:
> On 07/03/2014 10:11 AM, Kim De Mey wrote:
>
>> What I think that goes wrong is that the lock which is taken in
>> threadobj_notify_entry() is not released before threadobj_start()
>> continues at wait_on_barrier(thobj, __THREAD_S_ACTIVE).
>
>
> Until threadobj_notify_entry() releases that lock, there is no way
> wait_on_barrier() can exit, since it competes for the same lock.
> pthread_cond_wait() will grab this lock before returning.
>

Yes, you are right, the wait_on_barrier() can not exit until the
unlock. I misunderstood this part.

>
> As there is a
>>
>> t_delete() done right after t_start() returns in my test case, this
>> could mean that the thread gets in finalize_thread() after the
>> pthread_cancel() and blocks there on the threadobj_lock() as the
>> threadobj_unlock() from threadobj_notify_entry() was possibly not yet
>> called.
>
>
> called? do you mean exited instead?
>
> Until wait_on_barrier() unwinds for the child task, theadobj_start() cannot
> complete for its parent, so the latter cannot delete the child it has just
> spawned, until the latter has dropped the lock from
> threadobj_notify_entry(). So I'm unsure the explanation stands - unless I
> missed your point entirely.

No, you got my point, the explanation does not stand indeed.

>
> This said, there must be something fishy as the backtrace clearly shows a
> child thread hanging in the finalizer, waiting for access to its own tcb.
>
> If you have a simple standalone test case illustrating this bug, please send
> it along, this would save me some precious time trying to reproduce the
> issue accurately. Otherwise I'll write one.

I have a very simple test case:

static void worker(u_long a,u_long b,u_long c,u_long d)
{
  while(1)
    tm_wkafter(100);
}

static void create_delete(u_long a,u_long b,u_long c,u_long d)
{
  u_long tid, args[4] = {0,0,0,0};
  int j;

  for(j =0; j < 1000; j++)
  {
    if(t_create("TEST",50,0,0,0,&tid))
      printf("t_create failed!\n");
    if(t_start(tid,0, worker, args))
      printf("t_start failed!\n");
    if(t_delete(tid))
      printf("t_delete failed!\n");
  }

  while (1) tm_wkafter(1000);
}

int main(int argc, char * const argv[])
{
  u_long tid,args[4] = {0,0,0,0};
  copperplate_init(&argc,&argv);

  t_create("CRDE",50,0,0,0,&tid);
  t_start(tid,0,create_delete, args);

  while (1) tm_wkafter(1000);
  return 0;
}

In my case, running 1000 loops has about 4 to 10 "TEST" threads that hang.

>
> PS: mentioning the exact Xenomai 3 version you are running would always
> help. Passing --version on the command line of any application built over
> the copperplate-based APIs returns this information.
>

I was testing with a pull of the repository of 2014-06-06 (rev.
9137bdfe0b0881d04750c8e92e065657e7a9538e) . I'll pull the latest
revision and see if the issue still occurs.

> --
> Philippe.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Xenomai] Xenomai-forge: Deadlock in finalize_thread()
  2014-07-07  7:58   ` Kim De Mey
@ 2014-07-08 16:48     ` Philippe Gerum
  2014-07-09  8:09       ` Kim De Mey
  0 siblings, 1 reply; 25+ messages in thread
From: Philippe Gerum @ 2014-07-08 16:48 UTC (permalink / raw)
  To: Kim De Mey; +Cc: Xenomai@xenomai.org

On 07/07/2014 09:58 AM, Kim De Mey wrote:
> 2014-07-03 17:14 GMT+02:00 Philippe Gerum <rpm@xenomai.org>:
>> On 07/03/2014 10:11 AM, Kim De Mey wrote:
>>
>>> What I think that goes wrong is that the lock which is taken in
>>> threadobj_notify_entry() is not released before threadobj_start()
>>> continues at wait_on_barrier(thobj, __THREAD_S_ACTIVE).
>>
>>
>> Until threadobj_notify_entry() releases that lock, there is no way
>> wait_on_barrier() can exit, since it competes for the same lock.
>> pthread_cond_wait() will grab this lock before returning.
>>
>
> Yes, you are right, the wait_on_barrier() can not exit until the
> unlock. I misunderstood this part.
>
>>
>> As there is a
>>>
>>> t_delete() done right after t_start() returns in my test case, this
>>> could mean that the thread gets in finalize_thread() after the
>>> pthread_cancel() and blocks there on the threadobj_lock() as the
>>> threadobj_unlock() from threadobj_notify_entry() was possibly not yet
>>> called.
>>
>>
>> called? do you mean exited instead?
>>
>> Until wait_on_barrier() unwinds for the child task, theadobj_start() cannot
>> complete for its parent, so the latter cannot delete the child it has just
>> spawned, until the latter has dropped the lock from
>> threadobj_notify_entry(). So I'm unsure the explanation stands - unless I
>> missed your point entirely.
>
> No, you got my point, the explanation does not stand indeed.
>
>>
>> This said, there must be something fishy as the backtrace clearly shows a
>> child thread hanging in the finalizer, waiting for access to its own tcb.
>>
>> If you have a simple standalone test case illustrating this bug, please send
>> it along, this would save me some precious time trying to reproduce the
>> issue accurately. Otherwise I'll write one.
>
> I have a very simple test case:
>
> static void worker(u_long a,u_long b,u_long c,u_long d)
> {
>    while(1)
>      tm_wkafter(100);
> }
>
> static void create_delete(u_long a,u_long b,u_long c,u_long d)
> {
>    u_long tid, args[4] = {0,0,0,0};
>    int j;
>
>    for(j =0; j < 1000; j++)
>    {
>      if(t_create("TEST",50,0,0,0,&tid))
>        printf("t_create failed!\n");
>      if(t_start(tid,0, worker, args))
>        printf("t_start failed!\n");
>      if(t_delete(tid))
>        printf("t_delete failed!\n");
>    }
>
>    while (1) tm_wkafter(1000);
> }
>
> int main(int argc, char * const argv[])
> {
>    u_long tid,args[4] = {0,0,0,0};
>    copperplate_init(&argc,&argv);
>
>    t_create("CRDE",50,0,0,0,&tid);
>    t_start(tid,0,create_delete, args);
>
>    while (1) tm_wkafter(1000);
>    return 0;
> }
>
> In my case, running 1000 loops has about 4 to 10 "TEST" threads that hang.
>

I can't reproduce the issue with this test case, it's likely 
timing-dependent anyway. Please post the output of <test-program> 
--dump-config, so that I can match your build options.

TIA,

-- 
Philippe.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Xenomai] Xenomai-forge: Deadlock in finalize_thread()
  2014-07-08 16:48     ` Philippe Gerum
@ 2014-07-09  8:09       ` Kim De Mey
  2014-07-09  9:25         ` Philippe Gerum
  0 siblings, 1 reply; 25+ messages in thread
From: Kim De Mey @ 2014-07-09  8:09 UTC (permalink / raw)
  To: Philippe Gerum; +Cc: Xenomai@xenomai.org

2014-07-08 18:48 GMT+02:00 Philippe Gerum <rpm@xenomai.org>:
> On 07/07/2014 09:58 AM, Kim De Mey wrote:
>>
>> 2014-07-03 17:14 GMT+02:00 Philippe Gerum <rpm@xenomai.org>:
>>>
>>> On 07/03/2014 10:11 AM, Kim De Mey wrote:
>>>
>>>> What I think that goes wrong is that the lock which is taken in
>>>> threadobj_notify_entry() is not released before threadobj_start()
>>>> continues at wait_on_barrier(thobj, __THREAD_S_ACTIVE).
>>>
>>>
>>>
>>> Until threadobj_notify_entry() releases that lock, there is no way
>>> wait_on_barrier() can exit, since it competes for the same lock.
>>> pthread_cond_wait() will grab this lock before returning.
>>>
>>
>> Yes, you are right, the wait_on_barrier() can not exit until the
>> unlock. I misunderstood this part.
>>
>>>
>>> As there is a
>>>>
>>>>
>>>> t_delete() done right after t_start() returns in my test case, this
>>>> could mean that the thread gets in finalize_thread() after the
>>>> pthread_cancel() and blocks there on the threadobj_lock() as the
>>>> threadobj_unlock() from threadobj_notify_entry() was possibly not yet
>>>> called.
>>>
>>>
>>>
>>> called? do you mean exited instead?
>>>
>>> Until wait_on_barrier() unwinds for the child task, theadobj_start()
>>> cannot
>>> complete for its parent, so the latter cannot delete the child it has
>>> just
>>> spawned, until the latter has dropped the lock from
>>> threadobj_notify_entry(). So I'm unsure the explanation stands - unless I
>>> missed your point entirely.
>>
>>
>> No, you got my point, the explanation does not stand indeed.
>>
>>>
>>> This said, there must be something fishy as the backtrace clearly shows a
>>> child thread hanging in the finalizer, waiting for access to its own tcb.
>>>
>>> If you have a simple standalone test case illustrating this bug, please
>>> send
>>> it along, this would save me some precious time trying to reproduce the
>>> issue accurately. Otherwise I'll write one.
>>
>>
>> I have a very simple test case:
>>
>> static void worker(u_long a,u_long b,u_long c,u_long d)
>> {
>>    while(1)
>>      tm_wkafter(100);
>> }
>>
>> static void create_delete(u_long a,u_long b,u_long c,u_long d)
>> {
>>    u_long tid, args[4] = {0,0,0,0};
>>    int j;
>>
>>    for(j =0; j < 1000; j++)
>>    {
>>      if(t_create("TEST",50,0,0,0,&tid))
>>        printf("t_create failed!\n");
>>      if(t_start(tid,0, worker, args))
>>        printf("t_start failed!\n");
>>      if(t_delete(tid))
>>        printf("t_delete failed!\n");
>>    }
>>
>>    while (1) tm_wkafter(1000);
>> }
>>
>> int main(int argc, char * const argv[])
>> {
>>    u_long tid,args[4] = {0,0,0,0};
>>    copperplate_init(&argc,&argv);
>>
>>    t_create("CRDE",50,0,0,0,&tid);
>>    t_start(tid,0,create_delete, args);
>>
>>    while (1) tm_wkafter(1000);
>>    return 0;
>> }
>>
>> In my case, running 1000 loops has about 4 to 10 "TEST" threads that hang.
>>
>
> I can't reproduce the issue with this test case, it's likely
> timing-dependent anyway. Please post the output of <test-program>
> --dump-config, so that I can match your build options.
>
> TIA,
>

--dump-config output:

CONFIG_MMU=1
CONFIG_XENO_ASYNC_CANCEL=1
CONFIG_XENO_BUILD_STRING="x86_64-unknown-linux-gnu"
CONFIG_XENO_COMPILER="gcc version 4.3.3 (Cavium Inc. Version: 2_3_0 build 116) "
CONFIG_XENO_DEFAULT_PERIOD=100000
CONFIG_XENO_FORTIFY=1
CONFIG_XENO_HOST_STRING="mips64-buildroot-linux-gnu"
CONFIG_XENO_MERCURY=1
CONFIG_XENO_PREFIX="/usr"
CONFIG_XENO_REVISION_LEVEL=5
CONFIG_XENO_TLSF=1
CONFIG_XENO_TLS_MODEL="initial-exec"
CONFIG_XENO_UAPI_LEVEL=7
CONFIG_XENO_VERSION_MAJOR=2
CONFIG_XENO_VERSION_MINOR=99
CONFIG_XENO_X86_VSYSCALL=1
---
CONFIG_SMP is OFF
CONFIG_XENO_ARM_SA1100 is OFF
CONFIG_XENO_ARM_TSC_TYPE is OFF
CONFIG_XENO_COBALT is OFF
CONFIG_XENO_CPU_XSC3 is OFF
CONFIG_XENO_DEBUG is OFF
CONFIG_XENO_DEBUG_FULL is OFF
CONFIG_XENO_LIBS_DLOPEN is OFF
CONFIG_XENO_LORES_CLOCK_DISABLED is OFF
CONFIG_XENO_PSHARED is OFF
CONFIG_XENO_RAW_CLOCK_ENABLED is OFF
CONFIG_XENO_REGISTRY is OFF
CONFIG_XENO_VALGRIND_API is OFF

I have updated the Xenomai version to the latest version (next branch)
and still have the deadlock.

I also did a test with a more recent toolkit (and linux kernel,
3.10.16) and the problem is still there however it doesn't occur at
the same rate. I only see it after 10000 to 100000 loops.

Newer version --dump-config output:
...
CONFIG_XENO_COMPILER="gcc version 4.7.0 (Cavium Inc. Version:
SDK_3_1_0 build 27) "
...


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Xenomai] Xenomai-forge: Deadlock in finalize_thread()
  2014-07-09  8:09       ` Kim De Mey
@ 2014-07-09  9:25         ` Philippe Gerum
  2014-07-09 10:23           ` Kim De Mey
  0 siblings, 1 reply; 25+ messages in thread
From: Philippe Gerum @ 2014-07-09  9:25 UTC (permalink / raw)
  To: Kim De Mey; +Cc: Xenomai@xenomai.org

On 07/09/2014 10:09 AM, Kim De Mey wrote:

> Newer version --dump-config output:
> ...
> CONFIG_XENO_COMPILER="gcc version 4.7.0 (Cavium Inc. Version:
> SDK_3_1_0 build 27) "
> ...
>

Thanks. I can't reproduce this issue yet, so posting the current gdb 
backtraces for all the threads still shown by "info threads" after a 
deadlock would help.

TIA,

-- 
Philippe.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Xenomai] Xenomai-forge: Deadlock in finalize_thread()
  2014-07-09  9:25         ` Philippe Gerum
@ 2014-07-09 10:23           ` Kim De Mey
  2014-07-09 14:16             ` Philippe Gerum
  0 siblings, 1 reply; 25+ messages in thread
From: Kim De Mey @ 2014-07-09 10:23 UTC (permalink / raw)
  To: Philippe Gerum; +Cc: Xenomai@xenomai.org

2014-07-09 11:25 GMT+02:00 Philippe Gerum <rpm@xenomai.org>:
> On 07/09/2014 10:09 AM, Kim De Mey wrote:
>
>> Newer version --dump-config output:
>> ...
>> CONFIG_XENO_COMPILER="gcc version 4.7.0 (Cavium Inc. Version:
>> SDK_3_1_0 build 27) "
>> ...
>>
>
> Thanks. I can't reproduce this issue yet, so posting the current gdb
> backtraces for all the threads still shown by "info threads" after a
> deadlock would help.
>
> TIA,
>

Below backtraces of all the threads. It is a case with two "worker"
tasks in deadlock. The "main" and "create_delete" tasks continued to
their "tm_wkafter" loop.


(gdb) thread a a bt

Thread 4 (Thread 9406):
#0  0x77cc2d94 in __pthread_mutex_lock_full (mutex=0x77b03300) at
pthread_mutex_lock.c:321
#1  0x77cc7e64 in __GI___pthread_mutex_lock (mutex=0x77b03300) at
pthread_mutex_lock.c:55
#2  0x77cf91e0 in threadobj_lock () from
/repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
#3  0x77cf92e4 in finalize_thread () from
/repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
#4  0x77cc5704 in __nptl_deallocate_tsd () at pthread_create.c:156
#5  0x77cc5940 in start_thread (arg=0x75dff490) at pthread_create.c:315
#6  0x77bf1bbc in __thread_start () from
output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libc.so.6
Backtrace stopped: frame did not save the PC

Thread 3 (Thread 31145):
#0  0x77cc2d94 in __pthread_mutex_lock_full (mutex=0x77b030b8) at
pthread_mutex_lock.c:321
#1  0x77cc7e64 in __GI___pthread_mutex_lock (mutex=0x77b030b8) at
pthread_mutex_lock.c:55
#2  0x77cf91e0 in threadobj_lock () from
/repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
#3  0x77cf92e4 in finalize_thread () from
/repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
#4  0x77cc5704 in __nptl_deallocate_tsd () at pthread_create.c:156
#5  0x77cc5940 in start_thread (arg=0x766ff490) at pthread_create.c:315
#6  0x77bf1bbc in __thread_start () from
output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libc.so.6
Backtrace stopped: frame did not save the PC

Thread 2 (Thread 6970):
#0  clock_nanosleep (clock_id=<optimized out>, flags=<optimized out>,
req=<optimized out>, rem=<optimized out>)
    at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:50
#1  0x77cfa0bc in threadobj_sleep () from
/repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
#2  0x77d2af08 in tm_wkafter () from
/repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
#3  0x10000b64 in create_delete (a=0, b=0, c=0, d=0) at delete_child_hangs.c:31
#4  0x77d296fc in task_trampoline () from
/repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
#5  0x77cf7d84 in thread_trampoline () from
/repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
#6  0x77cc592c in start_thread (arg=0x77a00490) at pthread_create.c:310
#7  0x77bf1bbc in __thread_start () from
output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libc.so.6
Backtrace stopped: frame did not save the PC

Thread 1 (Thread 6969):
#0  clock_nanosleep (clock_id=<optimized out>, flags=<optimized out>,
req=<optimized out>, rem=<optimized out>)
    at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:50
#1  0x77cfa0bc in threadobj_sleep () from
/repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
#2  0x77d2af08 in tm_wkafter () from
/repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
#3  0x10000c00 in main (argc=1, argv=0x10011030) at delete_child_hangs.c:43


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Xenomai] Xenomai-forge: Deadlock in finalize_thread()
  2014-07-09 10:23           ` Kim De Mey
@ 2014-07-09 14:16             ` Philippe Gerum
  2014-07-09 15:05               ` Kim De Mey
  0 siblings, 1 reply; 25+ messages in thread
From: Philippe Gerum @ 2014-07-09 14:16 UTC (permalink / raw)
  To: Kim De Mey; +Cc: Xenomai@xenomai.org

On 07/09/2014 12:23 PM, Kim De Mey wrote:
> 2014-07-09 11:25 GMT+02:00 Philippe Gerum <rpm@xenomai.org>:
>> On 07/09/2014 10:09 AM, Kim De Mey wrote:
>>
>>> Newer version --dump-config output:
>>> ...
>>> CONFIG_XENO_COMPILER="gcc version 4.7.0 (Cavium Inc. Version:
>>> SDK_3_1_0 build 27) "
>>> ...
>>>
>>
>> Thanks. I can't reproduce this issue yet, so posting the current gdb
>> backtraces for all the threads still shown by "info threads" after a
>> deadlock would help.
>>
>> TIA,
>>
>
> Below backtraces of all the threads. It is a case with two "worker"
> tasks in deadlock. The "main" and "create_delete" tasks continued to
> their "tm_wkafter" loop.
>
>
> (gdb) thread a a bt
>
> Thread 4 (Thread 9406):
> #0  0x77cc2d94 in __pthread_mutex_lock_full (mutex=0x77b03300) at
> pthread_mutex_lock.c:321
> #1  0x77cc7e64 in __GI___pthread_mutex_lock (mutex=0x77b03300) at
> pthread_mutex_lock.c:55
> #2  0x77cf91e0 in threadobj_lock () from
> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
> #3  0x77cf92e4 in finalize_thread () from
> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
> #4  0x77cc5704 in __nptl_deallocate_tsd () at pthread_create.c:156
> #5  0x77cc5940 in start_thread (arg=0x75dff490) at pthread_create.c:315
> #6  0x77bf1bbc in __thread_start () from
> output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libc.so.6
> Backtrace stopped: frame did not save the PC
>
> Thread 3 (Thread 31145):
> #0  0x77cc2d94 in __pthread_mutex_lock_full (mutex=0x77b030b8) at
> pthread_mutex_lock.c:321
> #1  0x77cc7e64 in __GI___pthread_mutex_lock (mutex=0x77b030b8) at
> pthread_mutex_lock.c:55
> #2  0x77cf91e0 in threadobj_lock () from
> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
> #3  0x77cf92e4 in finalize_thread () from
> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
> #4  0x77cc5704 in __nptl_deallocate_tsd () at pthread_create.c:156
> #5  0x77cc5940 in start_thread (arg=0x766ff490) at pthread_create.c:315
> #6  0x77bf1bbc in __thread_start () from
> output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libc.so.6
> Backtrace stopped: frame did not save the PC
>
> Thread 2 (Thread 6970):
> #0  clock_nanosleep (clock_id=<optimized out>, flags=<optimized out>,
> req=<optimized out>, rem=<optimized out>)
>      at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:50
> #1  0x77cfa0bc in threadobj_sleep () from
> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
> #2  0x77d2af08 in tm_wkafter () from
> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
> #3  0x10000b64 in create_delete (a=0, b=0, c=0, d=0) at delete_child_hangs.c:31
> #4  0x77d296fc in task_trampoline () from
> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
> #5  0x77cf7d84 in thread_trampoline () from
> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
> #6  0x77cc592c in start_thread (arg=0x77a00490) at pthread_create.c:310
> #7  0x77bf1bbc in __thread_start () from
> output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libc.so.6
> Backtrace stopped: frame did not save the PC
>
> Thread 1 (Thread 6969):
> #0  clock_nanosleep (clock_id=<optimized out>, flags=<optimized out>,
> req=<optimized out>, rem=<optimized out>)
>      at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:50
> #1  0x77cfa0bc in threadobj_sleep () from
> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
> #2  0x77d2af08 in tm_wkafter () from
> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
> #3  0x10000c00 in main (argc=1, argv=0x10011030) at delete_child_hangs.c:43
>

Ok, unless we've entered the twilight zone, the only reason I could see 
this happening would be that the thread prologue somehow gets hit by an 
async cancellation signal while holding its own lock. If so, then the 
patch below would cause an assertion to trigger in such circumstance. 
You will have to turn on --enable-assert in your configuration, leaving 
debugging entirely off not to significantly affect the current timings.

diff --git a/include/boilerplate/lock.h b/include/boilerplate/lock.h
index 6f0218c..b704987 100644
--- a/include/boilerplate/lock.h
+++ b/include/boilerplate/lock.h
@@ -206,7 +206,7 @@ int __check_cancel_type(const char *locktype);
  #define read_unlock_safe(__lock, __state)	\
  	__do_unlock_safe(__lock, __state)

-#ifdef CONFIG_XENO_DEBUG
+#ifndef NDEBUG
  #define mutex_type_attribute PTHREAD_MUTEX_ERRORCHECK
  #else
  #define mutex_type_attribute PTHREAD_MUTEX_NORMAL
diff --git a/lib/copperplate/threadobj.c b/lib/copperplate/threadobj.c
index 05bb6cb..6547460 100644
--- a/lib/copperplate/threadobj.c
+++ b/lib/copperplate/threadobj.c
@@ -1310,6 +1310,7 @@ int threadobj_cancel(struct threadobj *thobj)
  static void finalize_thread(void *p) /* thobj->lock free */
  {
  	struct threadobj *thobj = p;
+	int ret;

  	if (thobj == NULL || thobj == THREADOBJ_IRQCONTEXT)
  		return;
@@ -1343,7 +1344,9 @@ static void finalize_thread(void *p) /* 
thobj->lock free */
  	 * waiting for us to start, pending on
  	 * wait_on_barrier(). Instead, hand it over to this thread.
  	 */
-	threadobj_lock(thobj);
+	ret = threadobj_lock(thobj);
+	assert(ret == 0);
+	(void)ret;
  	if ((thobj->status & __THREAD_S_SAFE) == 0) {
  		threadobj_unlock(thobj);
  		destroy_thread(thobj);

-- 
Philippe.


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [Xenomai] Xenomai-forge: Deadlock in finalize_thread()
  2014-07-09 14:16             ` Philippe Gerum
@ 2014-07-09 15:05               ` Kim De Mey
  2014-07-23  8:31                 ` Kim De Mey
  0 siblings, 1 reply; 25+ messages in thread
From: Kim De Mey @ 2014-07-09 15:05 UTC (permalink / raw)
  To: Philippe Gerum; +Cc: Xenomai@xenomai.org

2014-07-09 16:16 GMT+02:00 Philippe Gerum <rpm@xenomai.org>:
> On 07/09/2014 12:23 PM, Kim De Mey wrote:
>>
>> 2014-07-09 11:25 GMT+02:00 Philippe Gerum <rpm@xenomai.org>:
>>>
>>> On 07/09/2014 10:09 AM, Kim De Mey wrote:
>>>
>>>> Newer version --dump-config output:
>>>> ...
>>>> CONFIG_XENO_COMPILER="gcc version 4.7.0 (Cavium Inc. Version:
>>>> SDK_3_1_0 build 27) "
>>>> ...
>>>>
>>>
>>> Thanks. I can't reproduce this issue yet, so posting the current gdb
>>> backtraces for all the threads still shown by "info threads" after a
>>> deadlock would help.
>>>
>>> TIA,
>>>
>>
>> Below backtraces of all the threads. It is a case with two "worker"
>> tasks in deadlock. The "main" and "create_delete" tasks continued to
>> their "tm_wkafter" loop.
>>
>>
>> (gdb) thread a a bt
>>
>> Thread 4 (Thread 9406):
>> #0  0x77cc2d94 in __pthread_mutex_lock_full (mutex=0x77b03300) at
>> pthread_mutex_lock.c:321
>> #1  0x77cc7e64 in __GI___pthread_mutex_lock (mutex=0x77b03300) at
>> pthread_mutex_lock.c:55
>> #2  0x77cf91e0 in threadobj_lock () from
>>
>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>> #3  0x77cf92e4 in finalize_thread () from
>>
>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>> #4  0x77cc5704 in __nptl_deallocate_tsd () at pthread_create.c:156
>> #5  0x77cc5940 in start_thread (arg=0x75dff490) at pthread_create.c:315
>> #6  0x77bf1bbc in __thread_start () from
>> output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libc.so.6
>> Backtrace stopped: frame did not save the PC
>>
>> Thread 3 (Thread 31145):
>> #0  0x77cc2d94 in __pthread_mutex_lock_full (mutex=0x77b030b8) at
>> pthread_mutex_lock.c:321
>> #1  0x77cc7e64 in __GI___pthread_mutex_lock (mutex=0x77b030b8) at
>> pthread_mutex_lock.c:55
>> #2  0x77cf91e0 in threadobj_lock () from
>>
>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>> #3  0x77cf92e4 in finalize_thread () from
>>
>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>> #4  0x77cc5704 in __nptl_deallocate_tsd () at pthread_create.c:156
>> #5  0x77cc5940 in start_thread (arg=0x766ff490) at pthread_create.c:315
>> #6  0x77bf1bbc in __thread_start () from
>> output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libc.so.6
>> Backtrace stopped: frame did not save the PC
>>
>> Thread 2 (Thread 6970):
>> #0  clock_nanosleep (clock_id=<optimized out>, flags=<optimized out>,
>> req=<optimized out>, rem=<optimized out>)
>>      at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:50
>> #1  0x77cfa0bc in threadobj_sleep () from
>>
>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>> #2  0x77d2af08 in tm_wkafter () from
>>
>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
>> #3  0x10000b64 in create_delete (a=0, b=0, c=0, d=0) at
>> delete_child_hangs.c:31
>> #4  0x77d296fc in task_trampoline () from
>>
>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
>> #5  0x77cf7d84 in thread_trampoline () from
>>
>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>> #6  0x77cc592c in start_thread (arg=0x77a00490) at pthread_create.c:310
>> #7  0x77bf1bbc in __thread_start () from
>> output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libc.so.6
>> Backtrace stopped: frame did not save the PC
>>
>> Thread 1 (Thread 6969):
>> #0  clock_nanosleep (clock_id=<optimized out>, flags=<optimized out>,
>> req=<optimized out>, rem=<optimized out>)
>>      at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:50
>> #1  0x77cfa0bc in threadobj_sleep () from
>>
>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>> #2  0x77d2af08 in tm_wkafter () from
>>
>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
>> #3  0x10000c00 in main (argc=1, argv=0x10011030) at
>> delete_child_hangs.c:43
>>
>
> Ok, unless we've entered the twilight zone, the only reason I could see this
> happening would be that the thread prologue somehow gets hit by an async
> cancellation signal while holding its own lock. If so, then the patch below
> would cause an assertion to trigger in such circumstance. You will have to
> turn on --enable-assert in your configuration, leaving debugging entirely
> off not to significantly affect the current timings.
>
> diff --git a/include/boilerplate/lock.h b/include/boilerplate/lock.h
> index 6f0218c..b704987 100644
> --- a/include/boilerplate/lock.h
> +++ b/include/boilerplate/lock.h
> @@ -206,7 +206,7 @@ int __check_cancel_type(const char *locktype);
>  #define read_unlock_safe(__lock, __state)      \
>         __do_unlock_safe(__lock, __state)
>
> -#ifdef CONFIG_XENO_DEBUG
> +#ifndef NDEBUG
>  #define mutex_type_attribute PTHREAD_MUTEX_ERRORCHECK
>  #else
>  #define mutex_type_attribute PTHREAD_MUTEX_NORMAL
> diff --git a/lib/copperplate/threadobj.c b/lib/copperplate/threadobj.c
> index 05bb6cb..6547460 100644
> --- a/lib/copperplate/threadobj.c
> +++ b/lib/copperplate/threadobj.c
> @@ -1310,6 +1310,7 @@ int threadobj_cancel(struct threadobj *thobj)
>  static void finalize_thread(void *p) /* thobj->lock free */
>  {
>         struct threadobj *thobj = p;
> +       int ret;
>
>         if (thobj == NULL || thobj == THREADOBJ_IRQCONTEXT)
>                 return;
> @@ -1343,7 +1344,9 @@ static void finalize_thread(void *p) /* thobj->lock
> free */
>          * waiting for us to start, pending on
>          * wait_on_barrier(). Instead, hand it over to this thread.
>          */
> -       threadobj_lock(thobj);
> +       ret = threadobj_lock(thobj);
> +       assert(ret == 0);
> +       (void)ret;
>         if ((thobj->status & __THREAD_S_SAFE) == 0) {
>                 threadobj_unlock(thobj);
>                 destroy_thread(thobj);
>

And...the assert kicks in!

threadobj.c:1348: finalize_thread: Assertion `ret == 0' failed.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Xenomai] Xenomai-forge: Deadlock in finalize_thread()
  2014-07-09 15:05               ` Kim De Mey
@ 2014-07-23  8:31                 ` Kim De Mey
  2014-07-23 20:45                   ` Philippe Gerum
  2014-07-28 11:09                   ` Philippe Gerum
  0 siblings, 2 replies; 25+ messages in thread
From: Kim De Mey @ 2014-07-23  8:31 UTC (permalink / raw)
  To: Philippe Gerum; +Cc: Xenomai@xenomai.org

2014-07-09 17:05 GMT+02:00 Kim De Mey <kim.demey@gmail.com>:
>
> 2014-07-09 16:16 GMT+02:00 Philippe Gerum <rpm@xenomai.org>:
> > On 07/09/2014 12:23 PM, Kim De Mey wrote:
> >>
> >> 2014-07-09 11:25 GMT+02:00 Philippe Gerum <rpm@xenomai.org>:
> >>>
> >>> On 07/09/2014 10:09 AM, Kim De Mey wrote:
> >>>
> >>>> Newer version --dump-config output:
> >>>> ...
> >>>> CONFIG_XENO_COMPILER="gcc version 4.7.0 (Cavium Inc. Version:
> >>>> SDK_3_1_0 build 27) "
> >>>> ...
> >>>>
> >>>
> >>> Thanks. I can't reproduce this issue yet, so posting the current gdb
> >>> backtraces for all the threads still shown by "info threads" after a
> >>> deadlock would help.
> >>>
> >>> TIA,
> >>>
> >>
> >> Below backtraces of all the threads. It is a case with two "worker"
> >> tasks in deadlock. The "main" and "create_delete" tasks continued to
> >> their "tm_wkafter" loop.
> >>
> >>
> >> (gdb) thread a a bt
> >>
> >> Thread 4 (Thread 9406):
> >> #0  0x77cc2d94 in __pthread_mutex_lock_full (mutex=0x77b03300) at
> >> pthread_mutex_lock.c:321
> >> #1  0x77cc7e64 in __GI___pthread_mutex_lock (mutex=0x77b03300) at
> >> pthread_mutex_lock.c:55
> >> #2  0x77cf91e0 in threadobj_lock () from
> >>
> >> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
> >> #3  0x77cf92e4 in finalize_thread () from
> >>
> >> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
> >> #4  0x77cc5704 in __nptl_deallocate_tsd () at pthread_create.c:156
> >> #5  0x77cc5940 in start_thread (arg=0x75dff490) at pthread_create.c:315
> >> #6  0x77bf1bbc in __thread_start () from
> >> output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libc.so.6
> >> Backtrace stopped: frame did not save the PC
> >>
> >> Thread 3 (Thread 31145):
> >> #0  0x77cc2d94 in __pthread_mutex_lock_full (mutex=0x77b030b8) at
> >> pthread_mutex_lock.c:321
> >> #1  0x77cc7e64 in __GI___pthread_mutex_lock (mutex=0x77b030b8) at
> >> pthread_mutex_lock.c:55
> >> #2  0x77cf91e0 in threadobj_lock () from
> >>
> >> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
> >> #3  0x77cf92e4 in finalize_thread () from
> >>
> >> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
> >> #4  0x77cc5704 in __nptl_deallocate_tsd () at pthread_create.c:156
> >> #5  0x77cc5940 in start_thread (arg=0x766ff490) at pthread_create.c:315
> >> #6  0x77bf1bbc in __thread_start () from
> >> output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libc.so.6
> >> Backtrace stopped: frame did not save the PC
> >>
> >> Thread 2 (Thread 6970):
> >> #0  clock_nanosleep (clock_id=<optimized out>, flags=<optimized out>,
> >> req=<optimized out>, rem=<optimized out>)
> >>      at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:50
> >> #1  0x77cfa0bc in threadobj_sleep () from
> >>
> >> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
> >> #2  0x77d2af08 in tm_wkafter () from
> >>
> >> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
> >> #3  0x10000b64 in create_delete (a=0, b=0, c=0, d=0) at
> >> delete_child_hangs.c:31
> >> #4  0x77d296fc in task_trampoline () from
> >>
> >> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
> >> #5  0x77cf7d84 in thread_trampoline () from
> >>
> >> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
> >> #6  0x77cc592c in start_thread (arg=0x77a00490) at pthread_create.c:310
> >> #7  0x77bf1bbc in __thread_start () from
> >> output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libc.so.6
> >> Backtrace stopped: frame did not save the PC
> >>
> >> Thread 1 (Thread 6969):
> >> #0  clock_nanosleep (clock_id=<optimized out>, flags=<optimized out>,
> >> req=<optimized out>, rem=<optimized out>)
> >>      at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:50
> >> #1  0x77cfa0bc in threadobj_sleep () from
> >>
> >> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
> >> #2  0x77d2af08 in tm_wkafter () from
> >>
> >> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
> >> #3  0x10000c00 in main (argc=1, argv=0x10011030) at
> >> delete_child_hangs.c:43
> >>
> >
> > Ok, unless we've entered the twilight zone, the only reason I could see this
> > happening would be that the thread prologue somehow gets hit by an async
> > cancellation signal while holding its own lock. If so, then the patch below
> > would cause an assertion to trigger in such circumstance. You will have to
> > turn on --enable-assert in your configuration, leaving debugging entirely
> > off not to significantly affect the current timings.
> >
> > diff --git a/include/boilerplate/lock.h b/include/boilerplate/lock.h
> > index 6f0218c..b704987 100644
> > --- a/include/boilerplate/lock.h
> > +++ b/include/boilerplate/lock.h
> > @@ -206,7 +206,7 @@ int __check_cancel_type(const char *locktype);
> >  #define read_unlock_safe(__lock, __state)      \
> >         __do_unlock_safe(__lock, __state)
> >
> > -#ifdef CONFIG_XENO_DEBUG
> > +#ifndef NDEBUG
> >  #define mutex_type_attribute PTHREAD_MUTEX_ERRORCHECK
> >  #else
> >  #define mutex_type_attribute PTHREAD_MUTEX_NORMAL
> > diff --git a/lib/copperplate/threadobj.c b/lib/copperplate/threadobj.c
> > index 05bb6cb..6547460 100644
> > --- a/lib/copperplate/threadobj.c
> > +++ b/lib/copperplate/threadobj.c
> > @@ -1310,6 +1310,7 @@ int threadobj_cancel(struct threadobj *thobj)
> >  static void finalize_thread(void *p) /* thobj->lock free */
> >  {
> >         struct threadobj *thobj = p;
> > +       int ret;
> >
> >         if (thobj == NULL || thobj == THREADOBJ_IRQCONTEXT)
> >                 return;
> > @@ -1343,7 +1344,9 @@ static void finalize_thread(void *p) /* thobj->lock
> > free */
> >          * waiting for us to start, pending on
> >          * wait_on_barrier(). Instead, hand it over to this thread.
> >          */
> > -       threadobj_lock(thobj);
> > +       ret = threadobj_lock(thobj);
> > +       assert(ret == 0);
> > +       (void)ret;
> >         if ((thobj->status & __THREAD_S_SAFE) == 0) {
> >                 threadobj_unlock(thobj);
> >                 destroy_thread(thobj);
> >
>
> And...the assert kicks in!
>
> threadobj.c:1348: finalize_thread: Assertion `ret == 0' failed.


Philippe,

Could you explain this case of the async cancellation signal? I don't
fully understand how this happens or why it causes the issue.

I also tested once with CONFIG_XENO_ASYNC_CANCEL OFF and then the
issue doesn't occur anymore.

Thanks!

Kim


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Xenomai] Xenomai-forge: Deadlock in finalize_thread()
  2014-07-23  8:31                 ` Kim De Mey
@ 2014-07-23 20:45                   ` Philippe Gerum
  2014-07-28 11:09                   ` Philippe Gerum
  1 sibling, 0 replies; 25+ messages in thread
From: Philippe Gerum @ 2014-07-23 20:45 UTC (permalink / raw)
  To: Kim De Mey; +Cc: Xenomai@xenomai.org

On 07/23/2014 10:31 AM, Kim De Mey wrote:
> 2014-07-09 17:05 GMT+02:00 Kim De Mey <kim.demey@gmail.com>:
> Philippe,
> 
> Could you explain this case of the async cancellation signal? I don't
> fully understand how this happens or why it causes the issue.
> 
> I also tested once with CONFIG_XENO_ASYNC_CANCEL OFF and then the
> issue doesn't occur anymore.
> 

There must be something fragile with respect to disabling the
cancellation signal when running the thread prologue, which causes the
finalizer to run while the thread holds its own lock, as it handles
SIGCANCEL from the wrong place. This requires an async notification to
take place, hence your observation after switching off ASYNC_CANCEL. I
did not track it down entirely yet.

I'm slightly buried alive in the preparation of the initial Xenomai 3
-rc release, and/but because of that, I will definitely address this
issue in the coming weeks, hopefully in two weeks from now.

> Thanks!
> 
> Kim
> 


-- 
Philippe.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Xenomai] Xenomai-forge: Deadlock in finalize_thread()
  2014-07-23  8:31                 ` Kim De Mey
  2014-07-23 20:45                   ` Philippe Gerum
@ 2014-07-28 11:09                   ` Philippe Gerum
  2014-07-28 15:18                     ` Kim De Mey
  1 sibling, 1 reply; 25+ messages in thread
From: Philippe Gerum @ 2014-07-28 11:09 UTC (permalink / raw)
  To: Kim De Mey; +Cc: Xenomai@xenomai.org

On 07/23/2014 10:31 AM, Kim De Mey wrote:
> 2014-07-09 17:05 GMT+02:00 Kim De Mey <kim.demey@gmail.com>:
>>
>> 2014-07-09 16:16 GMT+02:00 Philippe Gerum <rpm@xenomai.org>:
>>> On 07/09/2014 12:23 PM, Kim De Mey wrote:
>>>>
>>>> 2014-07-09 11:25 GMT+02:00 Philippe Gerum <rpm@xenomai.org>:
>>>>>
>>>>> On 07/09/2014 10:09 AM, Kim De Mey wrote:
>>>>>
>>>>>> Newer version --dump-config output:
>>>>>> ...
>>>>>> CONFIG_XENO_COMPILER="gcc version 4.7.0 (Cavium Inc. Version:
>>>>>> SDK_3_1_0 build 27) "
>>>>>> ...
>>>>>>
>>>>>
>>>>> Thanks. I can't reproduce this issue yet, so posting the current gdb
>>>>> backtraces for all the threads still shown by "info threads" after a
>>>>> deadlock would help.
>>>>>
>>>>> TIA,
>>>>>
>>>>
>>>> Below backtraces of all the threads. It is a case with two "worker"
>>>> tasks in deadlock. The "main" and "create_delete" tasks continued to
>>>> their "tm_wkafter" loop.
>>>>
>>>>
>>>> (gdb) thread a a bt
>>>>
>>>> Thread 4 (Thread 9406):
>>>> #0  0x77cc2d94 in __pthread_mutex_lock_full (mutex=0x77b03300) at
>>>> pthread_mutex_lock.c:321
>>>> #1  0x77cc7e64 in __GI___pthread_mutex_lock (mutex=0x77b03300) at
>>>> pthread_mutex_lock.c:55
>>>> #2  0x77cf91e0 in threadobj_lock () from
>>>>
>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>> #3  0x77cf92e4 in finalize_thread () from
>>>>
>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>> #4  0x77cc5704 in __nptl_deallocate_tsd () at pthread_create.c:156
>>>> #5  0x77cc5940 in start_thread (arg=0x75dff490) at pthread_create.c:315
>>>> #6  0x77bf1bbc in __thread_start () from
>>>> output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libc.so.6
>>>> Backtrace stopped: frame did not save the PC
>>>>
>>>> Thread 3 (Thread 31145):
>>>> #0  0x77cc2d94 in __pthread_mutex_lock_full (mutex=0x77b030b8) at
>>>> pthread_mutex_lock.c:321
>>>> #1  0x77cc7e64 in __GI___pthread_mutex_lock (mutex=0x77b030b8) at
>>>> pthread_mutex_lock.c:55
>>>> #2  0x77cf91e0 in threadobj_lock () from
>>>>
>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>> #3  0x77cf92e4 in finalize_thread () from
>>>>
>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>> #4  0x77cc5704 in __nptl_deallocate_tsd () at pthread_create.c:156
>>>> #5  0x77cc5940 in start_thread (arg=0x766ff490) at pthread_create.c:315
>>>> #6  0x77bf1bbc in __thread_start () from
>>>> output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libc.so.6
>>>> Backtrace stopped: frame did not save the PC
>>>>
>>>> Thread 2 (Thread 6970):
>>>> #0  clock_nanosleep (clock_id=<optimized out>, flags=<optimized out>,
>>>> req=<optimized out>, rem=<optimized out>)
>>>>      at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:50
>>>> #1  0x77cfa0bc in threadobj_sleep () from
>>>>
>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>> #2  0x77d2af08 in tm_wkafter () from
>>>>
>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
>>>> #3  0x10000b64 in create_delete (a=0, b=0, c=0, d=0) at
>>>> delete_child_hangs.c:31
>>>> #4  0x77d296fc in task_trampoline () from
>>>>
>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
>>>> #5  0x77cf7d84 in thread_trampoline () from
>>>>
>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>> #6  0x77cc592c in start_thread (arg=0x77a00490) at pthread_create.c:310
>>>> #7  0x77bf1bbc in __thread_start () from
>>>> output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libc.so.6
>>>> Backtrace stopped: frame did not save the PC
>>>>
>>>> Thread 1 (Thread 6969):
>>>> #0  clock_nanosleep (clock_id=<optimized out>, flags=<optimized out>,
>>>> req=<optimized out>, rem=<optimized out>)
>>>>      at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:50
>>>> #1  0x77cfa0bc in threadobj_sleep () from
>>>>
>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>> #2  0x77d2af08 in tm_wkafter () from
>>>>
>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
>>>> #3  0x10000c00 in main (argc=1, argv=0x10011030) at
>>>> delete_child_hangs.c:43
>>>>
>>>
>>> Ok, unless we've entered the twilight zone, the only reason I could see this
>>> happening would be that the thread prologue somehow gets hit by an async
>>> cancellation signal while holding its own lock. If so, then the patch below
>>> would cause an assertion to trigger in such circumstance. You will have to
>>> turn on --enable-assert in your configuration, leaving debugging entirely
>>> off not to significantly affect the current timings.
>>>
>>> diff --git a/include/boilerplate/lock.h b/include/boilerplate/lock.h
>>> index 6f0218c..b704987 100644
>>> --- a/include/boilerplate/lock.h
>>> +++ b/include/boilerplate/lock.h
>>> @@ -206,7 +206,7 @@ int __check_cancel_type(const char *locktype);
>>>  #define read_unlock_safe(__lock, __state)      \
>>>         __do_unlock_safe(__lock, __state)
>>>
>>> -#ifdef CONFIG_XENO_DEBUG
>>> +#ifndef NDEBUG
>>>  #define mutex_type_attribute PTHREAD_MUTEX_ERRORCHECK
>>>  #else
>>>  #define mutex_type_attribute PTHREAD_MUTEX_NORMAL
>>> diff --git a/lib/copperplate/threadobj.c b/lib/copperplate/threadobj.c
>>> index 05bb6cb..6547460 100644
>>> --- a/lib/copperplate/threadobj.c
>>> +++ b/lib/copperplate/threadobj.c
>>> @@ -1310,6 +1310,7 @@ int threadobj_cancel(struct threadobj *thobj)
>>>  static void finalize_thread(void *p) /* thobj->lock free */
>>>  {
>>>         struct threadobj *thobj = p;
>>> +       int ret;
>>>
>>>         if (thobj == NULL || thobj == THREADOBJ_IRQCONTEXT)
>>>                 return;
>>> @@ -1343,7 +1344,9 @@ static void finalize_thread(void *p) /* thobj->lock
>>> free */
>>>          * waiting for us to start, pending on
>>>          * wait_on_barrier(). Instead, hand it over to this thread.
>>>          */
>>> -       threadobj_lock(thobj);
>>> +       ret = threadobj_lock(thobj);
>>> +       assert(ret == 0);
>>> +       (void)ret;
>>>         if ((thobj->status & __THREAD_S_SAFE) == 0) {
>>>                 threadobj_unlock(thobj);
>>>                 destroy_thread(thobj);
>>>
>>
>> And...the assert kicks in!
>>
>> threadobj.c:1348: finalize_thread: Assertion `ret == 0' failed.
> 
> 
> Philippe,
> 
> Could you explain this case of the async cancellation signal? I don't
> fully understand how this happens or why it causes the issue.
> 
> I also tested once with CONFIG_XENO_ASYNC_CANCEL OFF and then the
> issue doesn't occur anymore.
> 
>

I have not been able to reproduce this race yet, but still the following patch would make sense:
 
diff --git a/lib/copperplate/threadobj.c b/lib/copperplate/threadobj.c
index 05bb6cb..48aa032 100644
--- a/lib/copperplate/threadobj.c
+++ b/lib/copperplate/threadobj.c
@@ -1045,9 +1045,11 @@ static int wait_on_barrier(struct threadobj *thobj, int mask)
 		if (status & mask)
 			break;
 		oldstate = thobj->cancel_state;
+		push_cleanup_lock(&thobj->lock);
 		__threadobj_tag_unlocked(thobj);
 		__RT(pthread_cond_wait(&thobj->barrier, &thobj->lock));
 		__threadobj_tag_locked(thobj);
+		pop_cleanup_lock(&thobj->lock);
 		thobj->cancel_state = oldstate;
 	}
 
@@ -1243,9 +1245,11 @@ static void cancel_sync(struct threadobj *thobj) /* thobj->lock held */
 	 */
 	while (thobj->status & __THREAD_S_WARMUP) {
 		oldstate = thobj->cancel_state;
+		push_cleanup_lock(&thobj->lock);
 		__threadobj_tag_unlocked(thobj);
 		__RT(pthread_cond_wait(&thobj->barrier, &thobj->lock));
 		__threadobj_tag_locked(thobj);
+		pop_cleanup_lock(&thobj->lock);
 		thobj->cancel_state = oldstate;
 	}
 
-- 
Philippe.


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [Xenomai] Xenomai-forge: Deadlock in finalize_thread()
  2014-07-28 11:09                   ` Philippe Gerum
@ 2014-07-28 15:18                     ` Kim De Mey
  2014-07-28 21:52                       ` Philippe Gerum
  0 siblings, 1 reply; 25+ messages in thread
From: Kim De Mey @ 2014-07-28 15:18 UTC (permalink / raw)
  To: Philippe Gerum; +Cc: Xenomai@xenomai.org

2014-07-28 13:09 GMT+02:00 Philippe Gerum <rpm@xenomai.org>:
> On 07/23/2014 10:31 AM, Kim De Mey wrote:
>> 2014-07-09 17:05 GMT+02:00 Kim De Mey <kim.demey@gmail.com>:
>>>
>>> 2014-07-09 16:16 GMT+02:00 Philippe Gerum <rpm@xenomai.org>:
>>>> On 07/09/2014 12:23 PM, Kim De Mey wrote:
>>>>>
>>>>> 2014-07-09 11:25 GMT+02:00 Philippe Gerum <rpm@xenomai.org>:
>>>>>>
>>>>>> On 07/09/2014 10:09 AM, Kim De Mey wrote:
>>>>>>
>>>>>>> Newer version --dump-config output:
>>>>>>> ...
>>>>>>> CONFIG_XENO_COMPILER="gcc version 4.7.0 (Cavium Inc. Version:
>>>>>>> SDK_3_1_0 build 27) "
>>>>>>> ...
>>>>>>>
>>>>>>
>>>>>> Thanks. I can't reproduce this issue yet, so posting the current gdb
>>>>>> backtraces for all the threads still shown by "info threads" after a
>>>>>> deadlock would help.
>>>>>>
>>>>>> TIA,
>>>>>>
>>>>>
>>>>> Below backtraces of all the threads. It is a case with two "worker"
>>>>> tasks in deadlock. The "main" and "create_delete" tasks continued to
>>>>> their "tm_wkafter" loop.
>>>>>
>>>>>
>>>>> (gdb) thread a a bt
>>>>>
>>>>> Thread 4 (Thread 9406):
>>>>> #0  0x77cc2d94 in __pthread_mutex_lock_full (mutex=0x77b03300) at
>>>>> pthread_mutex_lock.c:321
>>>>> #1  0x77cc7e64 in __GI___pthread_mutex_lock (mutex=0x77b03300) at
>>>>> pthread_mutex_lock.c:55
>>>>> #2  0x77cf91e0 in threadobj_lock () from
>>>>>
>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>> #3  0x77cf92e4 in finalize_thread () from
>>>>>
>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>> #4  0x77cc5704 in __nptl_deallocate_tsd () at pthread_create.c:156
>>>>> #5  0x77cc5940 in start_thread (arg=0x75dff490) at pthread_create.c:315
>>>>> #6  0x77bf1bbc in __thread_start () from
>>>>> output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libc.so.6
>>>>> Backtrace stopped: frame did not save the PC
>>>>>
>>>>> Thread 3 (Thread 31145):
>>>>> #0  0x77cc2d94 in __pthread_mutex_lock_full (mutex=0x77b030b8) at
>>>>> pthread_mutex_lock.c:321
>>>>> #1  0x77cc7e64 in __GI___pthread_mutex_lock (mutex=0x77b030b8) at
>>>>> pthread_mutex_lock.c:55
>>>>> #2  0x77cf91e0 in threadobj_lock () from
>>>>>
>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>> #3  0x77cf92e4 in finalize_thread () from
>>>>>
>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>> #4  0x77cc5704 in __nptl_deallocate_tsd () at pthread_create.c:156
>>>>> #5  0x77cc5940 in start_thread (arg=0x766ff490) at pthread_create.c:315
>>>>> #6  0x77bf1bbc in __thread_start () from
>>>>> output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libc.so.6
>>>>> Backtrace stopped: frame did not save the PC
>>>>>
>>>>> Thread 2 (Thread 6970):
>>>>> #0  clock_nanosleep (clock_id=<optimized out>, flags=<optimized out>,
>>>>> req=<optimized out>, rem=<optimized out>)
>>>>>      at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:50
>>>>> #1  0x77cfa0bc in threadobj_sleep () from
>>>>>
>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>> #2  0x77d2af08 in tm_wkafter () from
>>>>>
>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
>>>>> #3  0x10000b64 in create_delete (a=0, b=0, c=0, d=0) at
>>>>> delete_child_hangs.c:31
>>>>> #4  0x77d296fc in task_trampoline () from
>>>>>
>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
>>>>> #5  0x77cf7d84 in thread_trampoline () from
>>>>>
>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>> #6  0x77cc592c in start_thread (arg=0x77a00490) at pthread_create.c:310
>>>>> #7  0x77bf1bbc in __thread_start () from
>>>>> output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libc.so.6
>>>>> Backtrace stopped: frame did not save the PC
>>>>>
>>>>> Thread 1 (Thread 6969):
>>>>> #0  clock_nanosleep (clock_id=<optimized out>, flags=<optimized out>,
>>>>> req=<optimized out>, rem=<optimized out>)
>>>>>      at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:50
>>>>> #1  0x77cfa0bc in threadobj_sleep () from
>>>>>
>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>> #2  0x77d2af08 in tm_wkafter () from
>>>>>
>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
>>>>> #3  0x10000c00 in main (argc=1, argv=0x10011030) at
>>>>> delete_child_hangs.c:43
>>>>>
>>>>
>>>> Ok, unless we've entered the twilight zone, the only reason I could see this
>>>> happening would be that the thread prologue somehow gets hit by an async
>>>> cancellation signal while holding its own lock. If so, then the patch below
>>>> would cause an assertion to trigger in such circumstance. You will have to
>>>> turn on --enable-assert in your configuration, leaving debugging entirely
>>>> off not to significantly affect the current timings.
>>>>
>>>> diff --git a/include/boilerplate/lock.h b/include/boilerplate/lock.h
>>>> index 6f0218c..b704987 100644
>>>> --- a/include/boilerplate/lock.h
>>>> +++ b/include/boilerplate/lock.h
>>>> @@ -206,7 +206,7 @@ int __check_cancel_type(const char *locktype);
>>>>  #define read_unlock_safe(__lock, __state)      \
>>>>         __do_unlock_safe(__lock, __state)
>>>>
>>>> -#ifdef CONFIG_XENO_DEBUG
>>>> +#ifndef NDEBUG
>>>>  #define mutex_type_attribute PTHREAD_MUTEX_ERRORCHECK
>>>>  #else
>>>>  #define mutex_type_attribute PTHREAD_MUTEX_NORMAL
>>>> diff --git a/lib/copperplate/threadobj.c b/lib/copperplate/threadobj.c
>>>> index 05bb6cb..6547460 100644
>>>> --- a/lib/copperplate/threadobj.c
>>>> +++ b/lib/copperplate/threadobj.c
>>>> @@ -1310,6 +1310,7 @@ int threadobj_cancel(struct threadobj *thobj)
>>>>  static void finalize_thread(void *p) /* thobj->lock free */
>>>>  {
>>>>         struct threadobj *thobj = p;
>>>> +       int ret;
>>>>
>>>>         if (thobj == NULL || thobj == THREADOBJ_IRQCONTEXT)
>>>>                 return;
>>>> @@ -1343,7 +1344,9 @@ static void finalize_thread(void *p) /* thobj->lock
>>>> free */
>>>>          * waiting for us to start, pending on
>>>>          * wait_on_barrier(). Instead, hand it over to this thread.
>>>>          */
>>>> -       threadobj_lock(thobj);
>>>> +       ret = threadobj_lock(thobj);
>>>> +       assert(ret == 0);
>>>> +       (void)ret;
>>>>         if ((thobj->status & __THREAD_S_SAFE) == 0) {
>>>>                 threadobj_unlock(thobj);
>>>>                 destroy_thread(thobj);
>>>>
>>>
>>> And...the assert kicks in!
>>>
>>> threadobj.c:1348: finalize_thread: Assertion `ret == 0' failed.
>>
>>
>> Philippe,
>>
>> Could you explain this case of the async cancellation signal? I don't
>> fully understand how this happens or why it causes the issue.
>>
>> I also tested once with CONFIG_XENO_ASYNC_CANCEL OFF and then the
>> issue doesn't occur anymore.
>>
>>
>
> I have not been able to reproduce this race yet, but still the following patch would make sense:
>
> diff --git a/lib/copperplate/threadobj.c b/lib/copperplate/threadobj.c
> index 05bb6cb..48aa032 100644
> --- a/lib/copperplate/threadobj.c
> +++ b/lib/copperplate/threadobj.c
> @@ -1045,9 +1045,11 @@ static int wait_on_barrier(struct threadobj *thobj, int mask)
>                 if (status & mask)
>                         break;
>                 oldstate = thobj->cancel_state;
> +               push_cleanup_lock(&thobj->lock);
>                 __threadobj_tag_unlocked(thobj);
>                 __RT(pthread_cond_wait(&thobj->barrier, &thobj->lock));
>                 __threadobj_tag_locked(thobj);
> +               pop_cleanup_lock(&thobj->lock);
>                 thobj->cancel_state = oldstate;
>         }
>
> @@ -1243,9 +1245,11 @@ static void cancel_sync(struct threadobj *thobj) /* thobj->lock held */
>          */
>         while (thobj->status & __THREAD_S_WARMUP) {
>                 oldstate = thobj->cancel_state;
> +               push_cleanup_lock(&thobj->lock);
>                 __threadobj_tag_unlocked(thobj);
>                 __RT(pthread_cond_wait(&thobj->barrier, &thobj->lock));
>                 __threadobj_tag_locked(thobj);
> +               pop_cleanup_lock(&thobj->lock);
>                 thobj->cancel_state = oldstate;
>         }
>

I have tried this patch but it does not seem to make any difference.
The threads are still deadlocked at the same location.

I've only tested this on my setup with the old toolchain/kernel so
far. I'll check also with the newer version.


Regards,
Kim


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Xenomai] Xenomai-forge: Deadlock in finalize_thread()
  2014-07-28 15:18                     ` Kim De Mey
@ 2014-07-28 21:52                       ` Philippe Gerum
  2014-07-29  7:31                         ` Kim De Mey
  0 siblings, 1 reply; 25+ messages in thread
From: Philippe Gerum @ 2014-07-28 21:52 UTC (permalink / raw)
  To: Kim De Mey; +Cc: Xenomai@xenomai.org

On 07/28/2014 05:18 PM, Kim De Mey wrote:
> 2014-07-28 13:09 GMT+02:00 Philippe Gerum <rpm@xenomai.org>:
>> On 07/23/2014 10:31 AM, Kim De Mey wrote:
>>> 2014-07-09 17:05 GMT+02:00 Kim De Mey <kim.demey@gmail.com>:
>>>>
>>>> 2014-07-09 16:16 GMT+02:00 Philippe Gerum <rpm@xenomai.org>:
>>>>> On 07/09/2014 12:23 PM, Kim De Mey wrote:
>>>>>>
>>>>>> 2014-07-09 11:25 GMT+02:00 Philippe Gerum <rpm@xenomai.org>:
>>>>>>>
>>>>>>> On 07/09/2014 10:09 AM, Kim De Mey wrote:
>>>>>>>
>>>>>>>> Newer version --dump-config output:
>>>>>>>> ...
>>>>>>>> CONFIG_XENO_COMPILER="gcc version 4.7.0 (Cavium Inc. Version:
>>>>>>>> SDK_3_1_0 build 27) "
>>>>>>>> ...
>>>>>>>>
>>>>>>>
>>>>>>> Thanks. I can't reproduce this issue yet, so posting the current gdb
>>>>>>> backtraces for all the threads still shown by "info threads" after a
>>>>>>> deadlock would help.
>>>>>>>
>>>>>>> TIA,
>>>>>>>
>>>>>>
>>>>>> Below backtraces of all the threads. It is a case with two "worker"
>>>>>> tasks in deadlock. The "main" and "create_delete" tasks continued to
>>>>>> their "tm_wkafter" loop.
>>>>>>
>>>>>>
>>>>>> (gdb) thread a a bt
>>>>>>
>>>>>> Thread 4 (Thread 9406):
>>>>>> #0  0x77cc2d94 in __pthread_mutex_lock_full (mutex=0x77b03300) at
>>>>>> pthread_mutex_lock.c:321
>>>>>> #1  0x77cc7e64 in __GI___pthread_mutex_lock (mutex=0x77b03300) at
>>>>>> pthread_mutex_lock.c:55
>>>>>> #2  0x77cf91e0 in threadobj_lock () from
>>>>>>
>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>> #3  0x77cf92e4 in finalize_thread () from
>>>>>>
>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>> #4  0x77cc5704 in __nptl_deallocate_tsd () at pthread_create.c:156
>>>>>> #5  0x77cc5940 in start_thread (arg=0x75dff490) at pthread_create.c:315
>>>>>> #6  0x77bf1bbc in __thread_start () from
>>>>>> output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libc.so.6
>>>>>> Backtrace stopped: frame did not save the PC
>>>>>>
>>>>>> Thread 3 (Thread 31145):
>>>>>> #0  0x77cc2d94 in __pthread_mutex_lock_full (mutex=0x77b030b8) at
>>>>>> pthread_mutex_lock.c:321
>>>>>> #1  0x77cc7e64 in __GI___pthread_mutex_lock (mutex=0x77b030b8) at
>>>>>> pthread_mutex_lock.c:55
>>>>>> #2  0x77cf91e0 in threadobj_lock () from
>>>>>>
>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>> #3  0x77cf92e4 in finalize_thread () from
>>>>>>
>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>> #4  0x77cc5704 in __nptl_deallocate_tsd () at pthread_create.c:156
>>>>>> #5  0x77cc5940 in start_thread (arg=0x766ff490) at pthread_create.c:315
>>>>>> #6  0x77bf1bbc in __thread_start () from
>>>>>> output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libc.so.6
>>>>>> Backtrace stopped: frame did not save the PC
>>>>>>
>>>>>> Thread 2 (Thread 6970):
>>>>>> #0  clock_nanosleep (clock_id=<optimized out>, flags=<optimized out>,
>>>>>> req=<optimized out>, rem=<optimized out>)
>>>>>>      at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:50
>>>>>> #1  0x77cfa0bc in threadobj_sleep () from
>>>>>>
>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>> #2  0x77d2af08 in tm_wkafter () from
>>>>>>
>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
>>>>>> #3  0x10000b64 in create_delete (a=0, b=0, c=0, d=0) at
>>>>>> delete_child_hangs.c:31
>>>>>> #4  0x77d296fc in task_trampoline () from
>>>>>>
>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
>>>>>> #5  0x77cf7d84 in thread_trampoline () from
>>>>>>
>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>> #6  0x77cc592c in start_thread (arg=0x77a00490) at pthread_create.c:310
>>>>>> #7  0x77bf1bbc in __thread_start () from
>>>>>> output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libc.so.6
>>>>>> Backtrace stopped: frame did not save the PC
>>>>>>
>>>>>> Thread 1 (Thread 6969):
>>>>>> #0  clock_nanosleep (clock_id=<optimized out>, flags=<optimized out>,
>>>>>> req=<optimized out>, rem=<optimized out>)
>>>>>>      at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:50
>>>>>> #1  0x77cfa0bc in threadobj_sleep () from
>>>>>>
>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>> #2  0x77d2af08 in tm_wkafter () from
>>>>>>
>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
>>>>>> #3  0x10000c00 in main (argc=1, argv=0x10011030) at
>>>>>> delete_child_hangs.c:43
>>>>>>
>>>>>
>>>>> Ok, unless we've entered the twilight zone, the only reason I could see this
>>>>> happening would be that the thread prologue somehow gets hit by an async
>>>>> cancellation signal while holding its own lock. If so, then the patch below
>>>>> would cause an assertion to trigger in such circumstance. You will have to
>>>>> turn on --enable-assert in your configuration, leaving debugging entirely
>>>>> off not to significantly affect the current timings.
>>>>>
>>>>> diff --git a/include/boilerplate/lock.h b/include/boilerplate/lock.h
>>>>> index 6f0218c..b704987 100644
>>>>> --- a/include/boilerplate/lock.h
>>>>> +++ b/include/boilerplate/lock.h
>>>>> @@ -206,7 +206,7 @@ int __check_cancel_type(const char *locktype);
>>>>>  #define read_unlock_safe(__lock, __state)      \
>>>>>         __do_unlock_safe(__lock, __state)
>>>>>
>>>>> -#ifdef CONFIG_XENO_DEBUG
>>>>> +#ifndef NDEBUG
>>>>>  #define mutex_type_attribute PTHREAD_MUTEX_ERRORCHECK
>>>>>  #else
>>>>>  #define mutex_type_attribute PTHREAD_MUTEX_NORMAL
>>>>> diff --git a/lib/copperplate/threadobj.c b/lib/copperplate/threadobj.c
>>>>> index 05bb6cb..6547460 100644
>>>>> --- a/lib/copperplate/threadobj.c
>>>>> +++ b/lib/copperplate/threadobj.c
>>>>> @@ -1310,6 +1310,7 @@ int threadobj_cancel(struct threadobj *thobj)
>>>>>  static void finalize_thread(void *p) /* thobj->lock free */
>>>>>  {
>>>>>         struct threadobj *thobj = p;
>>>>> +       int ret;
>>>>>
>>>>>         if (thobj == NULL || thobj == THREADOBJ_IRQCONTEXT)
>>>>>                 return;
>>>>> @@ -1343,7 +1344,9 @@ static void finalize_thread(void *p) /* thobj->lock
>>>>> free */
>>>>>          * waiting for us to start, pending on
>>>>>          * wait_on_barrier(). Instead, hand it over to this thread.
>>>>>          */
>>>>> -       threadobj_lock(thobj);
>>>>> +       ret = threadobj_lock(thobj);
>>>>> +       assert(ret == 0);
>>>>> +       (void)ret;
>>>>>         if ((thobj->status & __THREAD_S_SAFE) == 0) {
>>>>>                 threadobj_unlock(thobj);
>>>>>                 destroy_thread(thobj);
>>>>>
>>>>
>>>> And...the assert kicks in!
>>>>
>>>> threadobj.c:1348: finalize_thread: Assertion `ret == 0' failed.
>>>
>>>
>>> Philippe,
>>>
>>> Could you explain this case of the async cancellation signal? I don't
>>> fully understand how this happens or why it causes the issue.
>>>
>>> I also tested once with CONFIG_XENO_ASYNC_CANCEL OFF and then the
>>> issue doesn't occur anymore.
>>>
>>>
>>
>> I have not been able to reproduce this race yet, but still the following patch would make sense:
>>
>> diff --git a/lib/copperplate/threadobj.c b/lib/copperplate/threadobj.c
>> index 05bb6cb..48aa032 100644
>> --- a/lib/copperplate/threadobj.c
>> +++ b/lib/copperplate/threadobj.c
>> @@ -1045,9 +1045,11 @@ static int wait_on_barrier(struct threadobj *thobj, int mask)
>>                 if (status & mask)
>>                         break;
>>                 oldstate = thobj->cancel_state;
>> +               push_cleanup_lock(&thobj->lock);
>>                 __threadobj_tag_unlocked(thobj);
>>                 __RT(pthread_cond_wait(&thobj->barrier, &thobj->lock));
>>                 __threadobj_tag_locked(thobj);
>> +               pop_cleanup_lock(&thobj->lock);
>>                 thobj->cancel_state = oldstate;
>>         }
>>
>> @@ -1243,9 +1245,11 @@ static void cancel_sync(struct threadobj *thobj) /* thobj->lock held */
>>          */
>>         while (thobj->status & __THREAD_S_WARMUP) {
>>                 oldstate = thobj->cancel_state;
>> +               push_cleanup_lock(&thobj->lock);
>>                 __threadobj_tag_unlocked(thobj);
>>                 __RT(pthread_cond_wait(&thobj->barrier, &thobj->lock));
>>                 __threadobj_tag_locked(thobj);
>> +               pop_cleanup_lock(&thobj->lock);
>>                 thobj->cancel_state = oldstate;
>>         }
>>
> 
> I have tried this patch but it does not seem to make any difference.
> The threads are still deadlocked at the same location.
> 
> I've only tested this on my setup with the old toolchain/kernel so
> far. I'll check also with the newer version.
> 

Ok, thanks for testing. I would not bother testing with another
toolchain, I don't buy the toolchain bug in this case. This looks like a
common race due to the async nature of the cancellation event which is
released while the caller holds its own lock without proper cleanup
actions. I'll dig deeper.

-- 
Philippe.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Xenomai] Xenomai-forge: Deadlock in finalize_thread()
  2014-07-28 21:52                       ` Philippe Gerum
@ 2014-07-29  7:31                         ` Kim De Mey
  2014-08-04 17:06                           ` Philippe Gerum
  0 siblings, 1 reply; 25+ messages in thread
From: Kim De Mey @ 2014-07-29  7:31 UTC (permalink / raw)
  To: Philippe Gerum; +Cc: Xenomai@xenomai.org

2014-07-28 23:52 GMT+02:00 Philippe Gerum <rpm@xenomai.org>:
> On 07/28/2014 05:18 PM, Kim De Mey wrote:
>> 2014-07-28 13:09 GMT+02:00 Philippe Gerum <rpm@xenomai.org>:
>>> On 07/23/2014 10:31 AM, Kim De Mey wrote:
>>>> 2014-07-09 17:05 GMT+02:00 Kim De Mey <kim.demey@gmail.com>:
>>>>>
>>>>> 2014-07-09 16:16 GMT+02:00 Philippe Gerum <rpm@xenomai.org>:
>>>>>> On 07/09/2014 12:23 PM, Kim De Mey wrote:
>>>>>>>
>>>>>>> 2014-07-09 11:25 GMT+02:00 Philippe Gerum <rpm@xenomai.org>:
>>>>>>>>
>>>>>>>> On 07/09/2014 10:09 AM, Kim De Mey wrote:
>>>>>>>>
>>>>>>>>> Newer version --dump-config output:
>>>>>>>>> ...
>>>>>>>>> CONFIG_XENO_COMPILER="gcc version 4.7.0 (Cavium Inc. Version:
>>>>>>>>> SDK_3_1_0 build 27) "
>>>>>>>>> ...
>>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks. I can't reproduce this issue yet, so posting the current gdb
>>>>>>>> backtraces for all the threads still shown by "info threads" after a
>>>>>>>> deadlock would help.
>>>>>>>>
>>>>>>>> TIA,
>>>>>>>>
>>>>>>>
>>>>>>> Below backtraces of all the threads. It is a case with two "worker"
>>>>>>> tasks in deadlock. The "main" and "create_delete" tasks continued to
>>>>>>> their "tm_wkafter" loop.
>>>>>>>
>>>>>>>
>>>>>>> (gdb) thread a a bt
>>>>>>>
>>>>>>> Thread 4 (Thread 9406):
>>>>>>> #0  0x77cc2d94 in __pthread_mutex_lock_full (mutex=0x77b03300) at
>>>>>>> pthread_mutex_lock.c:321
>>>>>>> #1  0x77cc7e64 in __GI___pthread_mutex_lock (mutex=0x77b03300) at
>>>>>>> pthread_mutex_lock.c:55
>>>>>>> #2  0x77cf91e0 in threadobj_lock () from
>>>>>>>
>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>> #3  0x77cf92e4 in finalize_thread () from
>>>>>>>
>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>> #4  0x77cc5704 in __nptl_deallocate_tsd () at pthread_create.c:156
>>>>>>> #5  0x77cc5940 in start_thread (arg=0x75dff490) at pthread_create.c:315
>>>>>>> #6  0x77bf1bbc in __thread_start () from
>>>>>>> output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libc.so.6
>>>>>>> Backtrace stopped: frame did not save the PC
>>>>>>>
>>>>>>> Thread 3 (Thread 31145):
>>>>>>> #0  0x77cc2d94 in __pthread_mutex_lock_full (mutex=0x77b030b8) at
>>>>>>> pthread_mutex_lock.c:321
>>>>>>> #1  0x77cc7e64 in __GI___pthread_mutex_lock (mutex=0x77b030b8) at
>>>>>>> pthread_mutex_lock.c:55
>>>>>>> #2  0x77cf91e0 in threadobj_lock () from
>>>>>>>
>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>> #3  0x77cf92e4 in finalize_thread () from
>>>>>>>
>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>> #4  0x77cc5704 in __nptl_deallocate_tsd () at pthread_create.c:156
>>>>>>> #5  0x77cc5940 in start_thread (arg=0x766ff490) at pthread_create.c:315
>>>>>>> #6  0x77bf1bbc in __thread_start () from
>>>>>>> output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libc.so.6
>>>>>>> Backtrace stopped: frame did not save the PC
>>>>>>>
>>>>>>> Thread 2 (Thread 6970):
>>>>>>> #0  clock_nanosleep (clock_id=<optimized out>, flags=<optimized out>,
>>>>>>> req=<optimized out>, rem=<optimized out>)
>>>>>>>      at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:50
>>>>>>> #1  0x77cfa0bc in threadobj_sleep () from
>>>>>>>
>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>> #2  0x77d2af08 in tm_wkafter () from
>>>>>>>
>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
>>>>>>> #3  0x10000b64 in create_delete (a=0, b=0, c=0, d=0) at
>>>>>>> delete_child_hangs.c:31
>>>>>>> #4  0x77d296fc in task_trampoline () from
>>>>>>>
>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
>>>>>>> #5  0x77cf7d84 in thread_trampoline () from
>>>>>>>
>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>> #6  0x77cc592c in start_thread (arg=0x77a00490) at pthread_create.c:310
>>>>>>> #7  0x77bf1bbc in __thread_start () from
>>>>>>> output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libc.so.6
>>>>>>> Backtrace stopped: frame did not save the PC
>>>>>>>
>>>>>>> Thread 1 (Thread 6969):
>>>>>>> #0  clock_nanosleep (clock_id=<optimized out>, flags=<optimized out>,
>>>>>>> req=<optimized out>, rem=<optimized out>)
>>>>>>>      at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:50
>>>>>>> #1  0x77cfa0bc in threadobj_sleep () from
>>>>>>>
>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>> #2  0x77d2af08 in tm_wkafter () from
>>>>>>>
>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
>>>>>>> #3  0x10000c00 in main (argc=1, argv=0x10011030) at
>>>>>>> delete_child_hangs.c:43
>>>>>>>
>>>>>>
>>>>>> Ok, unless we've entered the twilight zone, the only reason I could see this
>>>>>> happening would be that the thread prologue somehow gets hit by an async
>>>>>> cancellation signal while holding its own lock. If so, then the patch below
>>>>>> would cause an assertion to trigger in such circumstance. You will have to
>>>>>> turn on --enable-assert in your configuration, leaving debugging entirely
>>>>>> off not to significantly affect the current timings.
>>>>>>
>>>>>> diff --git a/include/boilerplate/lock.h b/include/boilerplate/lock.h
>>>>>> index 6f0218c..b704987 100644
>>>>>> --- a/include/boilerplate/lock.h
>>>>>> +++ b/include/boilerplate/lock.h
>>>>>> @@ -206,7 +206,7 @@ int __check_cancel_type(const char *locktype);
>>>>>>  #define read_unlock_safe(__lock, __state)      \
>>>>>>         __do_unlock_safe(__lock, __state)
>>>>>>
>>>>>> -#ifdef CONFIG_XENO_DEBUG
>>>>>> +#ifndef NDEBUG
>>>>>>  #define mutex_type_attribute PTHREAD_MUTEX_ERRORCHECK
>>>>>>  #else
>>>>>>  #define mutex_type_attribute PTHREAD_MUTEX_NORMAL
>>>>>> diff --git a/lib/copperplate/threadobj.c b/lib/copperplate/threadobj.c
>>>>>> index 05bb6cb..6547460 100644
>>>>>> --- a/lib/copperplate/threadobj.c
>>>>>> +++ b/lib/copperplate/threadobj.c
>>>>>> @@ -1310,6 +1310,7 @@ int threadobj_cancel(struct threadobj *thobj)
>>>>>>  static void finalize_thread(void *p) /* thobj->lock free */
>>>>>>  {
>>>>>>         struct threadobj *thobj = p;
>>>>>> +       int ret;
>>>>>>
>>>>>>         if (thobj == NULL || thobj == THREADOBJ_IRQCONTEXT)
>>>>>>                 return;
>>>>>> @@ -1343,7 +1344,9 @@ static void finalize_thread(void *p) /* thobj->lock
>>>>>> free */
>>>>>>          * waiting for us to start, pending on
>>>>>>          * wait_on_barrier(). Instead, hand it over to this thread.
>>>>>>          */
>>>>>> -       threadobj_lock(thobj);
>>>>>> +       ret = threadobj_lock(thobj);
>>>>>> +       assert(ret == 0);
>>>>>> +       (void)ret;
>>>>>>         if ((thobj->status & __THREAD_S_SAFE) == 0) {
>>>>>>                 threadobj_unlock(thobj);
>>>>>>                 destroy_thread(thobj);
>>>>>>
>>>>>
>>>>> And...the assert kicks in!
>>>>>
>>>>> threadobj.c:1348: finalize_thread: Assertion `ret == 0' failed.
>>>>
>>>>
>>>> Philippe,
>>>>
>>>> Could you explain this case of the async cancellation signal? I don't
>>>> fully understand how this happens or why it causes the issue.
>>>>
>>>> I also tested once with CONFIG_XENO_ASYNC_CANCEL OFF and then the
>>>> issue doesn't occur anymore.
>>>>
>>>>
>>>
>>> I have not been able to reproduce this race yet, but still the following patch would make sense:
>>>
>>> diff --git a/lib/copperplate/threadobj.c b/lib/copperplate/threadobj.c
>>> index 05bb6cb..48aa032 100644
>>> --- a/lib/copperplate/threadobj.c
>>> +++ b/lib/copperplate/threadobj.c
>>> @@ -1045,9 +1045,11 @@ static int wait_on_barrier(struct threadobj *thobj, int mask)
>>>                 if (status & mask)
>>>                         break;
>>>                 oldstate = thobj->cancel_state;
>>> +               push_cleanup_lock(&thobj->lock);
>>>                 __threadobj_tag_unlocked(thobj);
>>>                 __RT(pthread_cond_wait(&thobj->barrier, &thobj->lock));
>>>                 __threadobj_tag_locked(thobj);
>>> +               pop_cleanup_lock(&thobj->lock);
>>>                 thobj->cancel_state = oldstate;
>>>         }
>>>
>>> @@ -1243,9 +1245,11 @@ static void cancel_sync(struct threadobj *thobj) /* thobj->lock held */
>>>          */
>>>         while (thobj->status & __THREAD_S_WARMUP) {
>>>                 oldstate = thobj->cancel_state;
>>> +               push_cleanup_lock(&thobj->lock);
>>>                 __threadobj_tag_unlocked(thobj);
>>>                 __RT(pthread_cond_wait(&thobj->barrier, &thobj->lock));
>>>                 __threadobj_tag_locked(thobj);
>>> +               pop_cleanup_lock(&thobj->lock);
>>>                 thobj->cancel_state = oldstate;
>>>         }
>>>
>>
>> I have tried this patch but it does not seem to make any difference.
>> The threads are still deadlocked at the same location.
>>
>> I've only tested this on my setup with the old toolchain/kernel so
>> far. I'll check also with the newer version.
>>
>
> Ok, thanks for testing. I would not bother testing with another
> toolchain, I don't buy the toolchain bug in this case. This looks like a
> common race due to the async nature of the cancellation event which is
> released while the caller holds its own lock without proper cleanup
> actions. I'll dig deeper.
>

Okay, in case there is anything you want me to try out, don't hesitate
to tell me.

Thanks a lot already!


Kim


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Xenomai] Xenomai-forge: Deadlock in finalize_thread()
  2014-07-29  7:31                         ` Kim De Mey
@ 2014-08-04 17:06                           ` Philippe Gerum
  2014-08-05  7:37                             ` Kim De Mey
  0 siblings, 1 reply; 25+ messages in thread
From: Philippe Gerum @ 2014-08-04 17:06 UTC (permalink / raw)
  To: Kim De Mey; +Cc: Xenomai@xenomai.org

On 07/29/2014 09:31 AM, Kim De Mey wrote:
> 2014-07-28 23:52 GMT+02:00 Philippe Gerum <rpm@xenomai.org>:
>> On 07/28/2014 05:18 PM, Kim De Mey wrote:
>>> 2014-07-28 13:09 GMT+02:00 Philippe Gerum <rpm@xenomai.org>:
>>>> On 07/23/2014 10:31 AM, Kim De Mey wrote:
>>>>> 2014-07-09 17:05 GMT+02:00 Kim De Mey <kim.demey@gmail.com>:
>>>>>>
>>>>>> 2014-07-09 16:16 GMT+02:00 Philippe Gerum <rpm@xenomai.org>:
>>>>>>> On 07/09/2014 12:23 PM, Kim De Mey wrote:
>>>>>>>>
>>>>>>>> 2014-07-09 11:25 GMT+02:00 Philippe Gerum <rpm@xenomai.org>:
>>>>>>>>>
>>>>>>>>> On 07/09/2014 10:09 AM, Kim De Mey wrote:
>>>>>>>>>
>>>>>>>>>> Newer version --dump-config output:
>>>>>>>>>> ...
>>>>>>>>>> CONFIG_XENO_COMPILER="gcc version 4.7.0 (Cavium Inc. Version:
>>>>>>>>>> SDK_3_1_0 build 27) "
>>>>>>>>>> ...
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks. I can't reproduce this issue yet, so posting the current gdb
>>>>>>>>> backtraces for all the threads still shown by "info threads" after a
>>>>>>>>> deadlock would help.
>>>>>>>>>
>>>>>>>>> TIA,
>>>>>>>>>
>>>>>>>>
>>>>>>>> Below backtraces of all the threads. It is a case with two "worker"
>>>>>>>> tasks in deadlock. The "main" and "create_delete" tasks continued to
>>>>>>>> their "tm_wkafter" loop.
>>>>>>>>
>>>>>>>>
>>>>>>>> (gdb) thread a a bt
>>>>>>>>
>>>>>>>> Thread 4 (Thread 9406):
>>>>>>>> #0  0x77cc2d94 in __pthread_mutex_lock_full (mutex=0x77b03300) at
>>>>>>>> pthread_mutex_lock.c:321
>>>>>>>> #1  0x77cc7e64 in __GI___pthread_mutex_lock (mutex=0x77b03300) at
>>>>>>>> pthread_mutex_lock.c:55
>>>>>>>> #2  0x77cf91e0 in threadobj_lock () from
>>>>>>>>
>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>> #3  0x77cf92e4 in finalize_thread () from
>>>>>>>>
>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>> #4  0x77cc5704 in __nptl_deallocate_tsd () at pthread_create.c:156
>>>>>>>> #5  0x77cc5940 in start_thread (arg=0x75dff490) at pthread_create.c:315
>>>>>>>> #6  0x77bf1bbc in __thread_start () from
>>>>>>>> output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libc.so.6
>>>>>>>> Backtrace stopped: frame did not save the PC
>>>>>>>>
>>>>>>>> Thread 3 (Thread 31145):
>>>>>>>> #0  0x77cc2d94 in __pthread_mutex_lock_full (mutex=0x77b030b8) at
>>>>>>>> pthread_mutex_lock.c:321
>>>>>>>> #1  0x77cc7e64 in __GI___pthread_mutex_lock (mutex=0x77b030b8) at
>>>>>>>> pthread_mutex_lock.c:55
>>>>>>>> #2  0x77cf91e0 in threadobj_lock () from
>>>>>>>>
>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>> #3  0x77cf92e4 in finalize_thread () from
>>>>>>>>
>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>> #4  0x77cc5704 in __nptl_deallocate_tsd () at pthread_create.c:156
>>>>>>>> #5  0x77cc5940 in start_thread (arg=0x766ff490) at pthread_create.c:315
>>>>>>>> #6  0x77bf1bbc in __thread_start () from
>>>>>>>> output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libc.so.6
>>>>>>>> Backtrace stopped: frame did not save the PC
>>>>>>>>
>>>>>>>> Thread 2 (Thread 6970):
>>>>>>>> #0  clock_nanosleep (clock_id=<optimized out>, flags=<optimized out>,
>>>>>>>> req=<optimized out>, rem=<optimized out>)
>>>>>>>>      at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:50
>>>>>>>> #1  0x77cfa0bc in threadobj_sleep () from
>>>>>>>>
>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>> #2  0x77d2af08 in tm_wkafter () from
>>>>>>>>
>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
>>>>>>>> #3  0x10000b64 in create_delete (a=0, b=0, c=0, d=0) at
>>>>>>>> delete_child_hangs.c:31
>>>>>>>> #4  0x77d296fc in task_trampoline () from
>>>>>>>>
>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
>>>>>>>> #5  0x77cf7d84 in thread_trampoline () from
>>>>>>>>
>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>> #6  0x77cc592c in start_thread (arg=0x77a00490) at pthread_create.c:310
>>>>>>>> #7  0x77bf1bbc in __thread_start () from
>>>>>>>> output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libc.so.6
>>>>>>>> Backtrace stopped: frame did not save the PC
>>>>>>>>
>>>>>>>> Thread 1 (Thread 6969):
>>>>>>>> #0  clock_nanosleep (clock_id=<optimized out>, flags=<optimized out>,
>>>>>>>> req=<optimized out>, rem=<optimized out>)
>>>>>>>>      at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:50
>>>>>>>> #1  0x77cfa0bc in threadobj_sleep () from
>>>>>>>>
>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>> #2  0x77d2af08 in tm_wkafter () from
>>>>>>>>
>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
>>>>>>>> #3  0x10000c00 in main (argc=1, argv=0x10011030) at
>>>>>>>> delete_child_hangs.c:43
>>>>>>>>
>>>>>>>
>>>>>>> Ok, unless we've entered the twilight zone, the only reason I could see this
>>>>>>> happening would be that the thread prologue somehow gets hit by an async
>>>>>>> cancellation signal while holding its own lock. If so, then the patch below
>>>>>>> would cause an assertion to trigger in such circumstance. You will have to
>>>>>>> turn on --enable-assert in your configuration, leaving debugging entirely
>>>>>>> off not to significantly affect the current timings.
>>>>>>>
>>>>>>> diff --git a/include/boilerplate/lock.h b/include/boilerplate/lock.h
>>>>>>> index 6f0218c..b704987 100644
>>>>>>> --- a/include/boilerplate/lock.h
>>>>>>> +++ b/include/boilerplate/lock.h
>>>>>>> @@ -206,7 +206,7 @@ int __check_cancel_type(const char *locktype);
>>>>>>>  #define read_unlock_safe(__lock, __state)      \
>>>>>>>         __do_unlock_safe(__lock, __state)
>>>>>>>
>>>>>>> -#ifdef CONFIG_XENO_DEBUG
>>>>>>> +#ifndef NDEBUG
>>>>>>>  #define mutex_type_attribute PTHREAD_MUTEX_ERRORCHECK
>>>>>>>  #else
>>>>>>>  #define mutex_type_attribute PTHREAD_MUTEX_NORMAL
>>>>>>> diff --git a/lib/copperplate/threadobj.c b/lib/copperplate/threadobj.c
>>>>>>> index 05bb6cb..6547460 100644
>>>>>>> --- a/lib/copperplate/threadobj.c
>>>>>>> +++ b/lib/copperplate/threadobj.c
>>>>>>> @@ -1310,6 +1310,7 @@ int threadobj_cancel(struct threadobj *thobj)
>>>>>>>  static void finalize_thread(void *p) /* thobj->lock free */
>>>>>>>  {
>>>>>>>         struct threadobj *thobj = p;
>>>>>>> +       int ret;
>>>>>>>
>>>>>>>         if (thobj == NULL || thobj == THREADOBJ_IRQCONTEXT)
>>>>>>>                 return;
>>>>>>> @@ -1343,7 +1344,9 @@ static void finalize_thread(void *p) /* thobj->lock
>>>>>>> free */
>>>>>>>          * waiting for us to start, pending on
>>>>>>>          * wait_on_barrier(). Instead, hand it over to this thread.
>>>>>>>          */
>>>>>>> -       threadobj_lock(thobj);
>>>>>>> +       ret = threadobj_lock(thobj);
>>>>>>> +       assert(ret == 0);
>>>>>>> +       (void)ret;
>>>>>>>         if ((thobj->status & __THREAD_S_SAFE) == 0) {
>>>>>>>                 threadobj_unlock(thobj);
>>>>>>>                 destroy_thread(thobj);
>>>>>>>
>>>>>>
>>>>>> And...the assert kicks in!
>>>>>>
>>>>>> threadobj.c:1348: finalize_thread: Assertion `ret == 0' failed.
>>>>>
>>>>>
>>>>> Philippe,
>>>>>
>>>>> Could you explain this case of the async cancellation signal? I don't
>>>>> fully understand how this happens or why it causes the issue.
>>>>>
>>>>> I also tested once with CONFIG_XENO_ASYNC_CANCEL OFF and then the
>>>>> issue doesn't occur anymore.
>>>>>
>>>>>
>>>>
>>>> I have not been able to reproduce this race yet, but still the following patch would make sense:
>>>>
>>>> diff --git a/lib/copperplate/threadobj.c b/lib/copperplate/threadobj.c
>>>> index 05bb6cb..48aa032 100644
>>>> --- a/lib/copperplate/threadobj.c
>>>> +++ b/lib/copperplate/threadobj.c
>>>> @@ -1045,9 +1045,11 @@ static int wait_on_barrier(struct threadobj *thobj, int mask)
>>>>                 if (status & mask)
>>>>                         break;
>>>>                 oldstate = thobj->cancel_state;
>>>> +               push_cleanup_lock(&thobj->lock);
>>>>                 __threadobj_tag_unlocked(thobj);
>>>>                 __RT(pthread_cond_wait(&thobj->barrier, &thobj->lock));
>>>>                 __threadobj_tag_locked(thobj);
>>>> +               pop_cleanup_lock(&thobj->lock);
>>>>                 thobj->cancel_state = oldstate;
>>>>         }
>>>>
>>>> @@ -1243,9 +1245,11 @@ static void cancel_sync(struct threadobj *thobj) /* thobj->lock held */
>>>>          */
>>>>         while (thobj->status & __THREAD_S_WARMUP) {
>>>>                 oldstate = thobj->cancel_state;
>>>> +               push_cleanup_lock(&thobj->lock);
>>>>                 __threadobj_tag_unlocked(thobj);
>>>>                 __RT(pthread_cond_wait(&thobj->barrier, &thobj->lock));
>>>>                 __threadobj_tag_locked(thobj);
>>>> +               pop_cleanup_lock(&thobj->lock);
>>>>                 thobj->cancel_state = oldstate;
>>>>         }
>>>>
>>>
>>> I have tried this patch but it does not seem to make any difference.
>>> The threads are still deadlocked at the same location.
>>>
>>> I've only tested this on my setup with the old toolchain/kernel so
>>> far. I'll check also with the newer version.
>>>
>>
>> Ok, thanks for testing. I would not bother testing with another
>> toolchain, I don't buy the toolchain bug in this case. This looks like a
>> common race due to the async nature of the cancellation event which is
>> released while the caller holds its own lock without proper cleanup
>> actions. I'll dig deeper.
>>
> 
> Okay, in case there is anything you want me to try out, don't hesitate
> to tell me.
> 
> Thanks a lot already!
> 
> 

I could reproduce the issue eventually on your exact test code, by
dropping the cpu affinity from the test options (--cpu-affinity) on a
multi-core system. The patch below should paper over the current
manifestation of this bug, but the root issue which bites here lives way
below this API layer, and is clearly SMP-related. Are you running a
multi-core board?

diff --git a/lib/psos/task.c b/lib/psos/task.c
index 89ba917..1c37cc7 100644
--- a/lib/psos/task.c
+++ b/lib/psos/task.c
@@ -202,9 +202,10 @@ static void *task_trampoline(void *arg)

 	threadobj_unlock(&task->thobj);

+	threadobj_notify_entry();
+
 	CANCEL_RESTORE(svc);

-	threadobj_notify_entry();
 	args->entry(args->arg0, args->arg1, args->arg2, args->arg3);
 	threadobj_lock(&task->thobj);
 	threadobj_set_magic(&task->thobj, ~task_magic);

-- 
Philippe.


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [Xenomai] Xenomai-forge: Deadlock in finalize_thread()
  2014-08-04 17:06                           ` Philippe Gerum
@ 2014-08-05  7:37                             ` Kim De Mey
  2014-08-05 10:35                               ` Philippe Gerum
  2014-08-05 10:45                               ` Philippe Gerum
  0 siblings, 2 replies; 25+ messages in thread
From: Kim De Mey @ 2014-08-05  7:37 UTC (permalink / raw)
  To: Philippe Gerum; +Cc: Xenomai@xenomai.org

2014-08-04 19:06 GMT+02:00 Philippe Gerum <rpm@xenomai.org>:
> On 07/29/2014 09:31 AM, Kim De Mey wrote:
>> 2014-07-28 23:52 GMT+02:00 Philippe Gerum <rpm@xenomai.org>:
>>> On 07/28/2014 05:18 PM, Kim De Mey wrote:
>>>> 2014-07-28 13:09 GMT+02:00 Philippe Gerum <rpm@xenomai.org>:
>>>>> On 07/23/2014 10:31 AM, Kim De Mey wrote:
>>>>>> 2014-07-09 17:05 GMT+02:00 Kim De Mey <kim.demey@gmail.com>:
>>>>>>>
>>>>>>> 2014-07-09 16:16 GMT+02:00 Philippe Gerum <rpm@xenomai.org>:
>>>>>>>> On 07/09/2014 12:23 PM, Kim De Mey wrote:
>>>>>>>>>
>>>>>>>>> 2014-07-09 11:25 GMT+02:00 Philippe Gerum <rpm@xenomai.org>:
>>>>>>>>>>
>>>>>>>>>> On 07/09/2014 10:09 AM, Kim De Mey wrote:
>>>>>>>>>>
>>>>>>>>>>> Newer version --dump-config output:
>>>>>>>>>>> ...
>>>>>>>>>>> CONFIG_XENO_COMPILER="gcc version 4.7.0 (Cavium Inc. Version:
>>>>>>>>>>> SDK_3_1_0 build 27) "
>>>>>>>>>>> ...
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thanks. I can't reproduce this issue yet, so posting the current gdb
>>>>>>>>>> backtraces for all the threads still shown by "info threads" after a
>>>>>>>>>> deadlock would help.
>>>>>>>>>>
>>>>>>>>>> TIA,
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Below backtraces of all the threads. It is a case with two "worker"
>>>>>>>>> tasks in deadlock. The "main" and "create_delete" tasks continued to
>>>>>>>>> their "tm_wkafter" loop.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> (gdb) thread a a bt
>>>>>>>>>
>>>>>>>>> Thread 4 (Thread 9406):
>>>>>>>>> #0  0x77cc2d94 in __pthread_mutex_lock_full (mutex=0x77b03300) at
>>>>>>>>> pthread_mutex_lock.c:321
>>>>>>>>> #1  0x77cc7e64 in __GI___pthread_mutex_lock (mutex=0x77b03300) at
>>>>>>>>> pthread_mutex_lock.c:55
>>>>>>>>> #2  0x77cf91e0 in threadobj_lock () from
>>>>>>>>>
>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>>> #3  0x77cf92e4 in finalize_thread () from
>>>>>>>>>
>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>>> #4  0x77cc5704 in __nptl_deallocate_tsd () at pthread_create.c:156
>>>>>>>>> #5  0x77cc5940 in start_thread (arg=0x75dff490) at pthread_create.c:315
>>>>>>>>> #6  0x77bf1bbc in __thread_start () from
>>>>>>>>> output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libc.so.6
>>>>>>>>> Backtrace stopped: frame did not save the PC
>>>>>>>>>
>>>>>>>>> Thread 3 (Thread 31145):
>>>>>>>>> #0  0x77cc2d94 in __pthread_mutex_lock_full (mutex=0x77b030b8) at
>>>>>>>>> pthread_mutex_lock.c:321
>>>>>>>>> #1  0x77cc7e64 in __GI___pthread_mutex_lock (mutex=0x77b030b8) at
>>>>>>>>> pthread_mutex_lock.c:55
>>>>>>>>> #2  0x77cf91e0 in threadobj_lock () from
>>>>>>>>>
>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>>> #3  0x77cf92e4 in finalize_thread () from
>>>>>>>>>
>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>>> #4  0x77cc5704 in __nptl_deallocate_tsd () at pthread_create.c:156
>>>>>>>>> #5  0x77cc5940 in start_thread (arg=0x766ff490) at pthread_create.c:315
>>>>>>>>> #6  0x77bf1bbc in __thread_start () from
>>>>>>>>> output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libc.so.6
>>>>>>>>> Backtrace stopped: frame did not save the PC
>>>>>>>>>
>>>>>>>>> Thread 2 (Thread 6970):
>>>>>>>>> #0  clock_nanosleep (clock_id=<optimized out>, flags=<optimized out>,
>>>>>>>>> req=<optimized out>, rem=<optimized out>)
>>>>>>>>>      at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:50
>>>>>>>>> #1  0x77cfa0bc in threadobj_sleep () from
>>>>>>>>>
>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>>> #2  0x77d2af08 in tm_wkafter () from
>>>>>>>>>
>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
>>>>>>>>> #3  0x10000b64 in create_delete (a=0, b=0, c=0, d=0) at
>>>>>>>>> delete_child_hangs.c:31
>>>>>>>>> #4  0x77d296fc in task_trampoline () from
>>>>>>>>>
>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
>>>>>>>>> #5  0x77cf7d84 in thread_trampoline () from
>>>>>>>>>
>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>>> #6  0x77cc592c in start_thread (arg=0x77a00490) at pthread_create.c:310
>>>>>>>>> #7  0x77bf1bbc in __thread_start () from
>>>>>>>>> output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libc.so.6
>>>>>>>>> Backtrace stopped: frame did not save the PC
>>>>>>>>>
>>>>>>>>> Thread 1 (Thread 6969):
>>>>>>>>> #0  clock_nanosleep (clock_id=<optimized out>, flags=<optimized out>,
>>>>>>>>> req=<optimized out>, rem=<optimized out>)
>>>>>>>>>      at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:50
>>>>>>>>> #1  0x77cfa0bc in threadobj_sleep () from
>>>>>>>>>
>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>>> #2  0x77d2af08 in tm_wkafter () from
>>>>>>>>>
>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
>>>>>>>>> #3  0x10000c00 in main (argc=1, argv=0x10011030) at
>>>>>>>>> delete_child_hangs.c:43
>>>>>>>>>
>>>>>>>>
>>>>>>>> Ok, unless we've entered the twilight zone, the only reason I could see this
>>>>>>>> happening would be that the thread prologue somehow gets hit by an async
>>>>>>>> cancellation signal while holding its own lock. If so, then the patch below
>>>>>>>> would cause an assertion to trigger in such circumstance. You will have to
>>>>>>>> turn on --enable-assert in your configuration, leaving debugging entirely
>>>>>>>> off not to significantly affect the current timings.
>>>>>>>>
>>>>>>>> diff --git a/include/boilerplate/lock.h b/include/boilerplate/lock.h
>>>>>>>> index 6f0218c..b704987 100644
>>>>>>>> --- a/include/boilerplate/lock.h
>>>>>>>> +++ b/include/boilerplate/lock.h
>>>>>>>> @@ -206,7 +206,7 @@ int __check_cancel_type(const char *locktype);
>>>>>>>>  #define read_unlock_safe(__lock, __state)      \
>>>>>>>>         __do_unlock_safe(__lock, __state)
>>>>>>>>
>>>>>>>> -#ifdef CONFIG_XENO_DEBUG
>>>>>>>> +#ifndef NDEBUG
>>>>>>>>  #define mutex_type_attribute PTHREAD_MUTEX_ERRORCHECK
>>>>>>>>  #else
>>>>>>>>  #define mutex_type_attribute PTHREAD_MUTEX_NORMAL
>>>>>>>> diff --git a/lib/copperplate/threadobj.c b/lib/copperplate/threadobj.c
>>>>>>>> index 05bb6cb..6547460 100644
>>>>>>>> --- a/lib/copperplate/threadobj.c
>>>>>>>> +++ b/lib/copperplate/threadobj.c
>>>>>>>> @@ -1310,6 +1310,7 @@ int threadobj_cancel(struct threadobj *thobj)
>>>>>>>>  static void finalize_thread(void *p) /* thobj->lock free */
>>>>>>>>  {
>>>>>>>>         struct threadobj *thobj = p;
>>>>>>>> +       int ret;
>>>>>>>>
>>>>>>>>         if (thobj == NULL || thobj == THREADOBJ_IRQCONTEXT)
>>>>>>>>                 return;
>>>>>>>> @@ -1343,7 +1344,9 @@ static void finalize_thread(void *p) /* thobj->lock
>>>>>>>> free */
>>>>>>>>          * waiting for us to start, pending on
>>>>>>>>          * wait_on_barrier(). Instead, hand it over to this thread.
>>>>>>>>          */
>>>>>>>> -       threadobj_lock(thobj);
>>>>>>>> +       ret = threadobj_lock(thobj);
>>>>>>>> +       assert(ret == 0);
>>>>>>>> +       (void)ret;
>>>>>>>>         if ((thobj->status & __THREAD_S_SAFE) == 0) {
>>>>>>>>                 threadobj_unlock(thobj);
>>>>>>>>                 destroy_thread(thobj);
>>>>>>>>
>>>>>>>
>>>>>>> And...the assert kicks in!
>>>>>>>
>>>>>>> threadobj.c:1348: finalize_thread: Assertion `ret == 0' failed.
>>>>>>
>>>>>>
>>>>>> Philippe,
>>>>>>
>>>>>> Could you explain this case of the async cancellation signal? I don't
>>>>>> fully understand how this happens or why it causes the issue.
>>>>>>
>>>>>> I also tested once with CONFIG_XENO_ASYNC_CANCEL OFF and then the
>>>>>> issue doesn't occur anymore.
>>>>>>
>>>>>>
>>>>>
>>>>> I have not been able to reproduce this race yet, but still the following patch would make sense:
>>>>>
>>>>> diff --git a/lib/copperplate/threadobj.c b/lib/copperplate/threadobj.c
>>>>> index 05bb6cb..48aa032 100644
>>>>> --- a/lib/copperplate/threadobj.c
>>>>> +++ b/lib/copperplate/threadobj.c
>>>>> @@ -1045,9 +1045,11 @@ static int wait_on_barrier(struct threadobj *thobj, int mask)
>>>>>                 if (status & mask)
>>>>>                         break;
>>>>>                 oldstate = thobj->cancel_state;
>>>>> +               push_cleanup_lock(&thobj->lock);
>>>>>                 __threadobj_tag_unlocked(thobj);
>>>>>                 __RT(pthread_cond_wait(&thobj->barrier, &thobj->lock));
>>>>>                 __threadobj_tag_locked(thobj);
>>>>> +               pop_cleanup_lock(&thobj->lock);
>>>>>                 thobj->cancel_state = oldstate;
>>>>>         }
>>>>>
>>>>> @@ -1243,9 +1245,11 @@ static void cancel_sync(struct threadobj *thobj) /* thobj->lock held */
>>>>>          */
>>>>>         while (thobj->status & __THREAD_S_WARMUP) {
>>>>>                 oldstate = thobj->cancel_state;
>>>>> +               push_cleanup_lock(&thobj->lock);
>>>>>                 __threadobj_tag_unlocked(thobj);
>>>>>                 __RT(pthread_cond_wait(&thobj->barrier, &thobj->lock));
>>>>>                 __threadobj_tag_locked(thobj);
>>>>> +               pop_cleanup_lock(&thobj->lock);
>>>>>                 thobj->cancel_state = oldstate;
>>>>>         }
>>>>>
>>>>
>>>> I have tried this patch but it does not seem to make any difference.
>>>> The threads are still deadlocked at the same location.
>>>>
>>>> I've only tested this on my setup with the old toolchain/kernel so
>>>> far. I'll check also with the newer version.
>>>>
>>>
>>> Ok, thanks for testing. I would not bother testing with another
>>> toolchain, I don't buy the toolchain bug in this case. This looks like a
>>> common race due to the async nature of the cancellation event which is
>>> released while the caller holds its own lock without proper cleanup
>>> actions. I'll dig deeper.
>>>
>>
>> Okay, in case there is anything you want me to try out, don't hesitate
>> to tell me.
>>
>> Thanks a lot already!
>>
>>
>
> I could reproduce the issue eventually on your exact test code, by
> dropping the cpu affinity from the test options (--cpu-affinity) on a
> multi-core system. The patch below should paper over the current
> manifestation of this bug, but the root issue which bites here lives way
> below this API layer, and is clearly SMP-related. Are you running a
> multi-core board?

I have tried the test with your patch and the issue no longer occurs.
Yes, I am indeed running a multi-core board (cpu: Cavium Octeon II - 8
cores).
The issue does not occur if I run the board with only 1 core activated.

>
> diff --git a/lib/psos/task.c b/lib/psos/task.c
> index 89ba917..1c37cc7 100644
> --- a/lib/psos/task.c
> +++ b/lib/psos/task.c
> @@ -202,9 +202,10 @@ static void *task_trampoline(void *arg)
>
>         threadobj_unlock(&task->thobj);
>
> +       threadobj_notify_entry();
> +
>         CANCEL_RESTORE(svc);
>
> -       threadobj_notify_entry();
>         args->entry(args->arg0, args->arg1, args->arg2, args->arg3);
>         threadobj_lock(&task->thobj);
>         threadobj_set_magic(&task->thobj, ~task_magic);
>
> --
> Philippe.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Xenomai] Xenomai-forge: Deadlock in finalize_thread()
  2014-08-05  7:37                             ` Kim De Mey
@ 2014-08-05 10:35                               ` Philippe Gerum
  2014-08-05 10:38                                 ` Philippe Gerum
  2014-08-05 10:45                               ` Philippe Gerum
  1 sibling, 1 reply; 25+ messages in thread
From: Philippe Gerum @ 2014-08-05 10:35 UTC (permalink / raw)
  To: Kim De Mey; +Cc: Xenomai@xenomai.org

On 08/05/2014 09:37 AM, Kim De Mey wrote:
> 2014-08-04 19:06 GMT+02:00 Philippe Gerum <rpm@xenomai.org>:
>> On 07/29/2014 09:31 AM, Kim De Mey wrote:
>>> 2014-07-28 23:52 GMT+02:00 Philippe Gerum <rpm@xenomai.org>:
>>>> On 07/28/2014 05:18 PM, Kim De Mey wrote:
>>>>> 2014-07-28 13:09 GMT+02:00 Philippe Gerum <rpm@xenomai.org>:
>>>>>> On 07/23/2014 10:31 AM, Kim De Mey wrote:
>>>>>>> 2014-07-09 17:05 GMT+02:00 Kim De Mey <kim.demey@gmail.com>:
>>>>>>>>
>>>>>>>> 2014-07-09 16:16 GMT+02:00 Philippe Gerum <rpm@xenomai.org>:
>>>>>>>>> On 07/09/2014 12:23 PM, Kim De Mey wrote:
>>>>>>>>>>
>>>>>>>>>> 2014-07-09 11:25 GMT+02:00 Philippe Gerum <rpm@xenomai.org>:
>>>>>>>>>>>
>>>>>>>>>>> On 07/09/2014 10:09 AM, Kim De Mey wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Newer version --dump-config output:
>>>>>>>>>>>> ...
>>>>>>>>>>>> CONFIG_XENO_COMPILER="gcc version 4.7.0 (Cavium Inc. Version:
>>>>>>>>>>>> SDK_3_1_0 build 27) "
>>>>>>>>>>>> ...
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thanks. I can't reproduce this issue yet, so posting the current gdb
>>>>>>>>>>> backtraces for all the threads still shown by "info threads" after a
>>>>>>>>>>> deadlock would help.
>>>>>>>>>>>
>>>>>>>>>>> TIA,
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Below backtraces of all the threads. It is a case with two "worker"
>>>>>>>>>> tasks in deadlock. The "main" and "create_delete" tasks continued to
>>>>>>>>>> their "tm_wkafter" loop.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> (gdb) thread a a bt
>>>>>>>>>>
>>>>>>>>>> Thread 4 (Thread 9406):
>>>>>>>>>> #0  0x77cc2d94 in __pthread_mutex_lock_full (mutex=0x77b03300) at
>>>>>>>>>> pthread_mutex_lock.c:321
>>>>>>>>>> #1  0x77cc7e64 in __GI___pthread_mutex_lock (mutex=0x77b03300) at
>>>>>>>>>> pthread_mutex_lock.c:55
>>>>>>>>>> #2  0x77cf91e0 in threadobj_lock () from
>>>>>>>>>>
>>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>>>> #3  0x77cf92e4 in finalize_thread () from
>>>>>>>>>>
>>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>>>> #4  0x77cc5704 in __nptl_deallocate_tsd () at pthread_create.c:156
>>>>>>>>>> #5  0x77cc5940 in start_thread (arg=0x75dff490) at pthread_create.c:315
>>>>>>>>>> #6  0x77bf1bbc in __thread_start () from
>>>>>>>>>> output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libc.so.6
>>>>>>>>>> Backtrace stopped: frame did not save the PC
>>>>>>>>>>
>>>>>>>>>> Thread 3 (Thread 31145):
>>>>>>>>>> #0  0x77cc2d94 in __pthread_mutex_lock_full (mutex=0x77b030b8) at
>>>>>>>>>> pthread_mutex_lock.c:321
>>>>>>>>>> #1  0x77cc7e64 in __GI___pthread_mutex_lock (mutex=0x77b030b8) at
>>>>>>>>>> pthread_mutex_lock.c:55
>>>>>>>>>> #2  0x77cf91e0 in threadobj_lock () from
>>>>>>>>>>
>>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>>>> #3  0x77cf92e4 in finalize_thread () from
>>>>>>>>>>
>>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>>>> #4  0x77cc5704 in __nptl_deallocate_tsd () at pthread_create.c:156
>>>>>>>>>> #5  0x77cc5940 in start_thread (arg=0x766ff490) at pthread_create.c:315
>>>>>>>>>> #6  0x77bf1bbc in __thread_start () from
>>>>>>>>>> output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libc.so.6
>>>>>>>>>> Backtrace stopped: frame did not save the PC
>>>>>>>>>>
>>>>>>>>>> Thread 2 (Thread 6970):
>>>>>>>>>> #0  clock_nanosleep (clock_id=<optimized out>, flags=<optimized out>,
>>>>>>>>>> req=<optimized out>, rem=<optimized out>)
>>>>>>>>>>      at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:50
>>>>>>>>>> #1  0x77cfa0bc in threadobj_sleep () from
>>>>>>>>>>
>>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>>>> #2  0x77d2af08 in tm_wkafter () from
>>>>>>>>>>
>>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
>>>>>>>>>> #3  0x10000b64 in create_delete (a=0, b=0, c=0, d=0) at
>>>>>>>>>> delete_child_hangs.c:31
>>>>>>>>>> #4  0x77d296fc in task_trampoline () from
>>>>>>>>>>
>>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
>>>>>>>>>> #5  0x77cf7d84 in thread_trampoline () from
>>>>>>>>>>
>>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>>>> #6  0x77cc592c in start_thread (arg=0x77a00490) at pthread_create.c:310
>>>>>>>>>> #7  0x77bf1bbc in __thread_start () from
>>>>>>>>>> output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libc.so.6
>>>>>>>>>> Backtrace stopped: frame did not save the PC
>>>>>>>>>>
>>>>>>>>>> Thread 1 (Thread 6969):
>>>>>>>>>> #0  clock_nanosleep (clock_id=<optimized out>, flags=<optimized out>,
>>>>>>>>>> req=<optimized out>, rem=<optimized out>)
>>>>>>>>>>      at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:50
>>>>>>>>>> #1  0x77cfa0bc in threadobj_sleep () from
>>>>>>>>>>
>>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>>>> #2  0x77d2af08 in tm_wkafter () from
>>>>>>>>>>
>>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
>>>>>>>>>> #3  0x10000c00 in main (argc=1, argv=0x10011030) at
>>>>>>>>>> delete_child_hangs.c:43
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Ok, unless we've entered the twilight zone, the only reason I could see this
>>>>>>>>> happening would be that the thread prologue somehow gets hit by an async
>>>>>>>>> cancellation signal while holding its own lock. If so, then the patch below
>>>>>>>>> would cause an assertion to trigger in such circumstance. You will have to
>>>>>>>>> turn on --enable-assert in your configuration, leaving debugging entirely
>>>>>>>>> off not to significantly affect the current timings.
>>>>>>>>>
>>>>>>>>> diff --git a/include/boilerplate/lock.h b/include/boilerplate/lock.h
>>>>>>>>> index 6f0218c..b704987 100644
>>>>>>>>> --- a/include/boilerplate/lock.h
>>>>>>>>> +++ b/include/boilerplate/lock.h
>>>>>>>>> @@ -206,7 +206,7 @@ int __check_cancel_type(const char *locktype);
>>>>>>>>>  #define read_unlock_safe(__lock, __state)      \
>>>>>>>>>         __do_unlock_safe(__lock, __state)
>>>>>>>>>
>>>>>>>>> -#ifdef CONFIG_XENO_DEBUG
>>>>>>>>> +#ifndef NDEBUG
>>>>>>>>>  #define mutex_type_attribute PTHREAD_MUTEX_ERRORCHECK
>>>>>>>>>  #else
>>>>>>>>>  #define mutex_type_attribute PTHREAD_MUTEX_NORMAL
>>>>>>>>> diff --git a/lib/copperplate/threadobj.c b/lib/copperplate/threadobj.c
>>>>>>>>> index 05bb6cb..6547460 100644
>>>>>>>>> --- a/lib/copperplate/threadobj.c
>>>>>>>>> +++ b/lib/copperplate/threadobj.c
>>>>>>>>> @@ -1310,6 +1310,7 @@ int threadobj_cancel(struct threadobj *thobj)
>>>>>>>>>  static void finalize_thread(void *p) /* thobj->lock free */
>>>>>>>>>  {
>>>>>>>>>         struct threadobj *thobj = p;
>>>>>>>>> +       int ret;
>>>>>>>>>
>>>>>>>>>         if (thobj == NULL || thobj == THREADOBJ_IRQCONTEXT)
>>>>>>>>>                 return;
>>>>>>>>> @@ -1343,7 +1344,9 @@ static void finalize_thread(void *p) /* thobj->lock
>>>>>>>>> free */
>>>>>>>>>          * waiting for us to start, pending on
>>>>>>>>>          * wait_on_barrier(). Instead, hand it over to this thread.
>>>>>>>>>          */
>>>>>>>>> -       threadobj_lock(thobj);
>>>>>>>>> +       ret = threadobj_lock(thobj);
>>>>>>>>> +       assert(ret == 0);
>>>>>>>>> +       (void)ret;
>>>>>>>>>         if ((thobj->status & __THREAD_S_SAFE) == 0) {
>>>>>>>>>                 threadobj_unlock(thobj);
>>>>>>>>>                 destroy_thread(thobj);
>>>>>>>>>
>>>>>>>>
>>>>>>>> And...the assert kicks in!
>>>>>>>>
>>>>>>>> threadobj.c:1348: finalize_thread: Assertion `ret == 0' failed.
>>>>>>>
>>>>>>>
>>>>>>> Philippe,
>>>>>>>
>>>>>>> Could you explain this case of the async cancellation signal? I don't
>>>>>>> fully understand how this happens or why it causes the issue.
>>>>>>>
>>>>>>> I also tested once with CONFIG_XENO_ASYNC_CANCEL OFF and then the
>>>>>>> issue doesn't occur anymore.
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> I have not been able to reproduce this race yet, but still the following patch would make sense:
>>>>>>
>>>>>> diff --git a/lib/copperplate/threadobj.c b/lib/copperplate/threadobj.c
>>>>>> index 05bb6cb..48aa032 100644
>>>>>> --- a/lib/copperplate/threadobj.c
>>>>>> +++ b/lib/copperplate/threadobj.c
>>>>>> @@ -1045,9 +1045,11 @@ static int wait_on_barrier(struct threadobj *thobj, int mask)
>>>>>>                 if (status & mask)
>>>>>>                         break;
>>>>>>                 oldstate = thobj->cancel_state;
>>>>>> +               push_cleanup_lock(&thobj->lock);
>>>>>>                 __threadobj_tag_unlocked(thobj);
>>>>>>                 __RT(pthread_cond_wait(&thobj->barrier, &thobj->lock));
>>>>>>                 __threadobj_tag_locked(thobj);
>>>>>> +               pop_cleanup_lock(&thobj->lock);
>>>>>>                 thobj->cancel_state = oldstate;
>>>>>>         }
>>>>>>
>>>>>> @@ -1243,9 +1245,11 @@ static void cancel_sync(struct threadobj *thobj) /* thobj->lock held */
>>>>>>          */
>>>>>>         while (thobj->status & __THREAD_S_WARMUP) {
>>>>>>                 oldstate = thobj->cancel_state;
>>>>>> +               push_cleanup_lock(&thobj->lock);
>>>>>>                 __threadobj_tag_unlocked(thobj);
>>>>>>                 __RT(pthread_cond_wait(&thobj->barrier, &thobj->lock));
>>>>>>                 __threadobj_tag_locked(thobj);
>>>>>> +               pop_cleanup_lock(&thobj->lock);
>>>>>>                 thobj->cancel_state = oldstate;
>>>>>>         }
>>>>>>
>>>>>
>>>>> I have tried this patch but it does not seem to make any difference.
>>>>> The threads are still deadlocked at the same location.
>>>>>
>>>>> I've only tested this on my setup with the old toolchain/kernel so
>>>>> far. I'll check also with the newer version.
>>>>>
>>>>
>>>> Ok, thanks for testing. I would not bother testing with another
>>>> toolchain, I don't buy the toolchain bug in this case. This looks like a
>>>> common race due to the async nature of the cancellation event which is
>>>> released while the caller holds its own lock without proper cleanup
>>>> actions. I'll dig deeper.
>>>>
>>>
>>> Okay, in case there is anything you want me to try out, don't hesitate
>>> to tell me.
>>>
>>> Thanks a lot already!
>>>
>>>
>>
>> I could reproduce the issue eventually on your exact test code, by
>> dropping the cpu affinity from the test options (--cpu-affinity) on a
>> multi-core system. The patch below should paper over the current
>> manifestation of this bug, but the root issue which bites here lives way
>> below this API layer, and is clearly SMP-related. Are you running a
>> multi-core board?
> 
> I have tried the test with your patch and the issue no longer occurs.
> Yes, I am indeed running a multi-core board (cpu: Cavium Octeon II - 8
> cores).
> The issue does not occur if I run the board with only 1 core activated.
> 

It looks like something may be wrong in the way glibc handles the
cancellability state of threads in SMP mode. Unless I'm completely
off-base reading the glibc code, this issue opens a nasty race with code
running SMP with async cancellation enabled, solely relying on
pthread_setcancelstate() to turn it off temporarily. Granted, async
cancellation is inherently unsafe, but still, something looks wrong.

Anyway, please pull
http://git.xenomai.org/xenomai-forge.git/commit/?h=next&id=7c4162315b8b970f9b82d53ddec8c910e15ff81a,
and let's see how this goes.

Thanks,

-- 
Philippe.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Xenomai] Xenomai-forge: Deadlock in finalize_thread()
  2014-08-05 10:35                               ` Philippe Gerum
@ 2014-08-05 10:38                                 ` Philippe Gerum
  2014-08-05 14:16                                   ` Kim De Mey
  0 siblings, 1 reply; 25+ messages in thread
From: Philippe Gerum @ 2014-08-05 10:38 UTC (permalink / raw)
  To: Kim De Mey; +Cc: Xenomai@xenomai.org

On 08/05/2014 12:35 PM, Philippe Gerum wrote:
> On 08/05/2014 09:37 AM, Kim De Mey wrote:
>> 2014-08-04 19:06 GMT+02:00 Philippe Gerum <rpm@xenomai.org>:
>>> On 07/29/2014 09:31 AM, Kim De Mey wrote:
>>>> 2014-07-28 23:52 GMT+02:00 Philippe Gerum <rpm@xenomai.org>:
>>>>> On 07/28/2014 05:18 PM, Kim De Mey wrote:
>>>>>> 2014-07-28 13:09 GMT+02:00 Philippe Gerum <rpm@xenomai.org>:
>>>>>>> On 07/23/2014 10:31 AM, Kim De Mey wrote:
>>>>>>>> 2014-07-09 17:05 GMT+02:00 Kim De Mey <kim.demey@gmail.com>:
>>>>>>>>>
>>>>>>>>> 2014-07-09 16:16 GMT+02:00 Philippe Gerum <rpm@xenomai.org>:
>>>>>>>>>> On 07/09/2014 12:23 PM, Kim De Mey wrote:
>>>>>>>>>>>
>>>>>>>>>>> 2014-07-09 11:25 GMT+02:00 Philippe Gerum <rpm@xenomai.org>:
>>>>>>>>>>>>
>>>>>>>>>>>> On 07/09/2014 10:09 AM, Kim De Mey wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Newer version --dump-config output:
>>>>>>>>>>>>> ...
>>>>>>>>>>>>> CONFIG_XENO_COMPILER="gcc version 4.7.0 (Cavium Inc. Version:
>>>>>>>>>>>>> SDK_3_1_0 build 27) "
>>>>>>>>>>>>> ...
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks. I can't reproduce this issue yet, so posting the current gdb
>>>>>>>>>>>> backtraces for all the threads still shown by "info threads" after a
>>>>>>>>>>>> deadlock would help.
>>>>>>>>>>>>
>>>>>>>>>>>> TIA,
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Below backtraces of all the threads. It is a case with two "worker"
>>>>>>>>>>> tasks in deadlock. The "main" and "create_delete" tasks continued to
>>>>>>>>>>> their "tm_wkafter" loop.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> (gdb) thread a a bt
>>>>>>>>>>>
>>>>>>>>>>> Thread 4 (Thread 9406):
>>>>>>>>>>> #0  0x77cc2d94 in __pthread_mutex_lock_full (mutex=0x77b03300) at
>>>>>>>>>>> pthread_mutex_lock.c:321
>>>>>>>>>>> #1  0x77cc7e64 in __GI___pthread_mutex_lock (mutex=0x77b03300) at
>>>>>>>>>>> pthread_mutex_lock.c:55
>>>>>>>>>>> #2  0x77cf91e0 in threadobj_lock () from
>>>>>>>>>>>
>>>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>>>>> #3  0x77cf92e4 in finalize_thread () from
>>>>>>>>>>>
>>>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>>>>> #4  0x77cc5704 in __nptl_deallocate_tsd () at pthread_create.c:156
>>>>>>>>>>> #5  0x77cc5940 in start_thread (arg=0x75dff490) at pthread_create.c:315
>>>>>>>>>>> #6  0x77bf1bbc in __thread_start () from
>>>>>>>>>>> output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libc.so.6
>>>>>>>>>>> Backtrace stopped: frame did not save the PC
>>>>>>>>>>>
>>>>>>>>>>> Thread 3 (Thread 31145):
>>>>>>>>>>> #0  0x77cc2d94 in __pthread_mutex_lock_full (mutex=0x77b030b8) at
>>>>>>>>>>> pthread_mutex_lock.c:321
>>>>>>>>>>> #1  0x77cc7e64 in __GI___pthread_mutex_lock (mutex=0x77b030b8) at
>>>>>>>>>>> pthread_mutex_lock.c:55
>>>>>>>>>>> #2  0x77cf91e0 in threadobj_lock () from
>>>>>>>>>>>
>>>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>>>>> #3  0x77cf92e4 in finalize_thread () from
>>>>>>>>>>>
>>>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>>>>> #4  0x77cc5704 in __nptl_deallocate_tsd () at pthread_create.c:156
>>>>>>>>>>> #5  0x77cc5940 in start_thread (arg=0x766ff490) at pthread_create.c:315
>>>>>>>>>>> #6  0x77bf1bbc in __thread_start () from
>>>>>>>>>>> output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libc.so.6
>>>>>>>>>>> Backtrace stopped: frame did not save the PC
>>>>>>>>>>>
>>>>>>>>>>> Thread 2 (Thread 6970):
>>>>>>>>>>> #0  clock_nanosleep (clock_id=<optimized out>, flags=<optimized out>,
>>>>>>>>>>> req=<optimized out>, rem=<optimized out>)
>>>>>>>>>>>      at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:50
>>>>>>>>>>> #1  0x77cfa0bc in threadobj_sleep () from
>>>>>>>>>>>
>>>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>>>>> #2  0x77d2af08 in tm_wkafter () from
>>>>>>>>>>>
>>>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
>>>>>>>>>>> #3  0x10000b64 in create_delete (a=0, b=0, c=0, d=0) at
>>>>>>>>>>> delete_child_hangs.c:31
>>>>>>>>>>> #4  0x77d296fc in task_trampoline () from
>>>>>>>>>>>
>>>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
>>>>>>>>>>> #5  0x77cf7d84 in thread_trampoline () from
>>>>>>>>>>>
>>>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>>>>> #6  0x77cc592c in start_thread (arg=0x77a00490) at pthread_create.c:310
>>>>>>>>>>> #7  0x77bf1bbc in __thread_start () from
>>>>>>>>>>> output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libc.so.6
>>>>>>>>>>> Backtrace stopped: frame did not save the PC
>>>>>>>>>>>
>>>>>>>>>>> Thread 1 (Thread 6969):
>>>>>>>>>>> #0  clock_nanosleep (clock_id=<optimized out>, flags=<optimized out>,
>>>>>>>>>>> req=<optimized out>, rem=<optimized out>)
>>>>>>>>>>>      at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:50
>>>>>>>>>>> #1  0x77cfa0bc in threadobj_sleep () from
>>>>>>>>>>>
>>>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>>>>> #2  0x77d2af08 in tm_wkafter () from
>>>>>>>>>>>
>>>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
>>>>>>>>>>> #3  0x10000c00 in main (argc=1, argv=0x10011030) at
>>>>>>>>>>> delete_child_hangs.c:43
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Ok, unless we've entered the twilight zone, the only reason I could see this
>>>>>>>>>> happening would be that the thread prologue somehow gets hit by an async
>>>>>>>>>> cancellation signal while holding its own lock. If so, then the patch below
>>>>>>>>>> would cause an assertion to trigger in such circumstance. You will have to
>>>>>>>>>> turn on --enable-assert in your configuration, leaving debugging entirely
>>>>>>>>>> off not to significantly affect the current timings.
>>>>>>>>>>
>>>>>>>>>> diff --git a/include/boilerplate/lock.h b/include/boilerplate/lock.h
>>>>>>>>>> index 6f0218c..b704987 100644
>>>>>>>>>> --- a/include/boilerplate/lock.h
>>>>>>>>>> +++ b/include/boilerplate/lock.h
>>>>>>>>>> @@ -206,7 +206,7 @@ int __check_cancel_type(const char *locktype);
>>>>>>>>>>  #define read_unlock_safe(__lock, __state)      \
>>>>>>>>>>         __do_unlock_safe(__lock, __state)
>>>>>>>>>>
>>>>>>>>>> -#ifdef CONFIG_XENO_DEBUG
>>>>>>>>>> +#ifndef NDEBUG
>>>>>>>>>>  #define mutex_type_attribute PTHREAD_MUTEX_ERRORCHECK
>>>>>>>>>>  #else
>>>>>>>>>>  #define mutex_type_attribute PTHREAD_MUTEX_NORMAL
>>>>>>>>>> diff --git a/lib/copperplate/threadobj.c b/lib/copperplate/threadobj.c
>>>>>>>>>> index 05bb6cb..6547460 100644
>>>>>>>>>> --- a/lib/copperplate/threadobj.c
>>>>>>>>>> +++ b/lib/copperplate/threadobj.c
>>>>>>>>>> @@ -1310,6 +1310,7 @@ int threadobj_cancel(struct threadobj *thobj)
>>>>>>>>>>  static void finalize_thread(void *p) /* thobj->lock free */
>>>>>>>>>>  {
>>>>>>>>>>         struct threadobj *thobj = p;
>>>>>>>>>> +       int ret;
>>>>>>>>>>
>>>>>>>>>>         if (thobj == NULL || thobj == THREADOBJ_IRQCONTEXT)
>>>>>>>>>>                 return;
>>>>>>>>>> @@ -1343,7 +1344,9 @@ static void finalize_thread(void *p) /* thobj->lock
>>>>>>>>>> free */
>>>>>>>>>>          * waiting for us to start, pending on
>>>>>>>>>>          * wait_on_barrier(). Instead, hand it over to this thread.
>>>>>>>>>>          */
>>>>>>>>>> -       threadobj_lock(thobj);
>>>>>>>>>> +       ret = threadobj_lock(thobj);
>>>>>>>>>> +       assert(ret == 0);
>>>>>>>>>> +       (void)ret;
>>>>>>>>>>         if ((thobj->status & __THREAD_S_SAFE) == 0) {
>>>>>>>>>>                 threadobj_unlock(thobj);
>>>>>>>>>>                 destroy_thread(thobj);
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> And...the assert kicks in!
>>>>>>>>>
>>>>>>>>> threadobj.c:1348: finalize_thread: Assertion `ret == 0' failed.
>>>>>>>>
>>>>>>>>
>>>>>>>> Philippe,
>>>>>>>>
>>>>>>>> Could you explain this case of the async cancellation signal? I don't
>>>>>>>> fully understand how this happens or why it causes the issue.
>>>>>>>>
>>>>>>>> I also tested once with CONFIG_XENO_ASYNC_CANCEL OFF and then the
>>>>>>>> issue doesn't occur anymore.
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> I have not been able to reproduce this race yet, but still the following patch would make sense:
>>>>>>>
>>>>>>> diff --git a/lib/copperplate/threadobj.c b/lib/copperplate/threadobj.c
>>>>>>> index 05bb6cb..48aa032 100644
>>>>>>> --- a/lib/copperplate/threadobj.c
>>>>>>> +++ b/lib/copperplate/threadobj.c
>>>>>>> @@ -1045,9 +1045,11 @@ static int wait_on_barrier(struct threadobj *thobj, int mask)
>>>>>>>                 if (status & mask)
>>>>>>>                         break;
>>>>>>>                 oldstate = thobj->cancel_state;
>>>>>>> +               push_cleanup_lock(&thobj->lock);
>>>>>>>                 __threadobj_tag_unlocked(thobj);
>>>>>>>                 __RT(pthread_cond_wait(&thobj->barrier, &thobj->lock));
>>>>>>>                 __threadobj_tag_locked(thobj);
>>>>>>> +               pop_cleanup_lock(&thobj->lock);
>>>>>>>                 thobj->cancel_state = oldstate;
>>>>>>>         }
>>>>>>>
>>>>>>> @@ -1243,9 +1245,11 @@ static void cancel_sync(struct threadobj *thobj) /* thobj->lock held */
>>>>>>>          */
>>>>>>>         while (thobj->status & __THREAD_S_WARMUP) {
>>>>>>>                 oldstate = thobj->cancel_state;
>>>>>>> +               push_cleanup_lock(&thobj->lock);
>>>>>>>                 __threadobj_tag_unlocked(thobj);
>>>>>>>                 __RT(pthread_cond_wait(&thobj->barrier, &thobj->lock));
>>>>>>>                 __threadobj_tag_locked(thobj);
>>>>>>> +               pop_cleanup_lock(&thobj->lock);
>>>>>>>                 thobj->cancel_state = oldstate;
>>>>>>>         }
>>>>>>>
>>>>>>
>>>>>> I have tried this patch but it does not seem to make any difference.
>>>>>> The threads are still deadlocked at the same location.
>>>>>>
>>>>>> I've only tested this on my setup with the old toolchain/kernel so
>>>>>> far. I'll check also with the newer version.
>>>>>>
>>>>>
>>>>> Ok, thanks for testing. I would not bother testing with another
>>>>> toolchain, I don't buy the toolchain bug in this case. This looks like a
>>>>> common race due to the async nature of the cancellation event which is
>>>>> released while the caller holds its own lock without proper cleanup
>>>>> actions. I'll dig deeper.
>>>>>
>>>>
>>>> Okay, in case there is anything you want me to try out, don't hesitate
>>>> to tell me.
>>>>
>>>> Thanks a lot already!
>>>>
>>>>
>>>
>>> I could reproduce the issue eventually on your exact test code, by
>>> dropping the cpu affinity from the test options (--cpu-affinity) on a
>>> multi-core system. The patch below should paper over the current
>>> manifestation of this bug, but the root issue which bites here lives way
>>> below this API layer, and is clearly SMP-related. Are you running a
>>> multi-core board?
>>
>> I have tried the test with your patch and the issue no longer occurs.
>> Yes, I am indeed running a multi-core board (cpu: Cavium Octeon II - 8
>> cores).
>> The issue does not occur if I run the board with only 1 core activated.
>>
> 
> It looks like something may be wrong in the way glibc handles the
> cancellability state of threads in SMP mode. Unless I'm completely
> off-base reading the glibc code, this issue opens a nasty race with code
> running SMP with async cancellation enabled, solely relying on
> pthread_setcancelstate() to turn it off temporarily. Granted, async
> cancellation is inherently unsafe, but still, something looks wrong.
> 
> Anyway, please pull
> http://git.xenomai.org/xenomai-forge.git/commit/?h=next&id=7c4162315b8b970f9b82d53ddec8c910e15ff81a,
> and let's see how this goes.
> 
> Thanks,
> 

This one fixes the commit log,

http://git.xenomai.org/xenomai-forge.git/commit/?h=next&id=75cc5dac959c06bac3a5730ae279ae7a0a4ad271

-- 
Philippe.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Xenomai] Xenomai-forge: Deadlock in finalize_thread()
  2014-08-05  7:37                             ` Kim De Mey
  2014-08-05 10:35                               ` Philippe Gerum
@ 2014-08-05 10:45                               ` Philippe Gerum
  2014-08-05 12:39                                 ` Thomas De Schampheleire
  2014-08-05 14:18                                 ` Kim De Mey
  1 sibling, 2 replies; 25+ messages in thread
From: Philippe Gerum @ 2014-08-05 10:45 UTC (permalink / raw)
  To: Kim De Mey; +Cc: Xenomai@xenomai.org

On 08/05/2014 09:37 AM, Kim De Mey wrote:
> 2014-08-04 19:06 GMT+02:00 Philippe Gerum <rpm@xenomai.org>:
>> On 07/29/2014 09:31 AM, Kim De Mey wrote:
>>> 2014-07-28 23:52 GMT+02:00 Philippe Gerum <rpm@xenomai.org>:
>>>> On 07/28/2014 05:18 PM, Kim De Mey wrote:
>>>>> 2014-07-28 13:09 GMT+02:00 Philippe Gerum <rpm@xenomai.org>:
>>>>>> On 07/23/2014 10:31 AM, Kim De Mey wrote:
>>>>>>> 2014-07-09 17:05 GMT+02:00 Kim De Mey <kim.demey@gmail.com>:
>>>>>>>>
>>>>>>>> 2014-07-09 16:16 GMT+02:00 Philippe Gerum <rpm@xenomai.org>:
>>>>>>>>> On 07/09/2014 12:23 PM, Kim De Mey wrote:
>>>>>>>>>>
>>>>>>>>>> 2014-07-09 11:25 GMT+02:00 Philippe Gerum <rpm@xenomai.org>:
>>>>>>>>>>>
>>>>>>>>>>> On 07/09/2014 10:09 AM, Kim De Mey wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Newer version --dump-config output:
>>>>>>>>>>>> ...
>>>>>>>>>>>> CONFIG_XENO_COMPILER="gcc version 4.7.0 (Cavium Inc. Version:
>>>>>>>>>>>> SDK_3_1_0 build 27) "
>>>>>>>>>>>> ...
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thanks. I can't reproduce this issue yet, so posting the current gdb
>>>>>>>>>>> backtraces for all the threads still shown by "info threads" after a
>>>>>>>>>>> deadlock would help.
>>>>>>>>>>>
>>>>>>>>>>> TIA,
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Below backtraces of all the threads. It is a case with two "worker"
>>>>>>>>>> tasks in deadlock. The "main" and "create_delete" tasks continued to
>>>>>>>>>> their "tm_wkafter" loop.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> (gdb) thread a a bt
>>>>>>>>>>
>>>>>>>>>> Thread 4 (Thread 9406):
>>>>>>>>>> #0  0x77cc2d94 in __pthread_mutex_lock_full (mutex=0x77b03300) at
>>>>>>>>>> pthread_mutex_lock.c:321
>>>>>>>>>> #1  0x77cc7e64 in __GI___pthread_mutex_lock (mutex=0x77b03300) at
>>>>>>>>>> pthread_mutex_lock.c:55
>>>>>>>>>> #2  0x77cf91e0 in threadobj_lock () from
>>>>>>>>>>
>>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>>>> #3  0x77cf92e4 in finalize_thread () from
>>>>>>>>>>
>>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>>>> #4  0x77cc5704 in __nptl_deallocate_tsd () at pthread_create.c:156
>>>>>>>>>> #5  0x77cc5940 in start_thread (arg=0x75dff490) at pthread_create.c:315
>>>>>>>>>> #6  0x77bf1bbc in __thread_start () from
>>>>>>>>>> output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libc.so.6
>>>>>>>>>> Backtrace stopped: frame did not save the PC
>>>>>>>>>>
>>>>>>>>>> Thread 3 (Thread 31145):
>>>>>>>>>> #0  0x77cc2d94 in __pthread_mutex_lock_full (mutex=0x77b030b8) at
>>>>>>>>>> pthread_mutex_lock.c:321
>>>>>>>>>> #1  0x77cc7e64 in __GI___pthread_mutex_lock (mutex=0x77b030b8) at
>>>>>>>>>> pthread_mutex_lock.c:55
>>>>>>>>>> #2  0x77cf91e0 in threadobj_lock () from
>>>>>>>>>>
>>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>>>> #3  0x77cf92e4 in finalize_thread () from
>>>>>>>>>>
>>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>>>> #4  0x77cc5704 in __nptl_deallocate_tsd () at pthread_create.c:156
>>>>>>>>>> #5  0x77cc5940 in start_thread (arg=0x766ff490) at pthread_create.c:315
>>>>>>>>>> #6  0x77bf1bbc in __thread_start () from
>>>>>>>>>> output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libc.so.6
>>>>>>>>>> Backtrace stopped: frame did not save the PC
>>>>>>>>>>
>>>>>>>>>> Thread 2 (Thread 6970):
>>>>>>>>>> #0  clock_nanosleep (clock_id=<optimized out>, flags=<optimized out>,
>>>>>>>>>> req=<optimized out>, rem=<optimized out>)
>>>>>>>>>>      at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:50
>>>>>>>>>> #1  0x77cfa0bc in threadobj_sleep () from
>>>>>>>>>>
>>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>>>> #2  0x77d2af08 in tm_wkafter () from
>>>>>>>>>>
>>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
>>>>>>>>>> #3  0x10000b64 in create_delete (a=0, b=0, c=0, d=0) at
>>>>>>>>>> delete_child_hangs.c:31
>>>>>>>>>> #4  0x77d296fc in task_trampoline () from
>>>>>>>>>>
>>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
>>>>>>>>>> #5  0x77cf7d84 in thread_trampoline () from
>>>>>>>>>>
>>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>>>> #6  0x77cc592c in start_thread (arg=0x77a00490) at pthread_create.c:310
>>>>>>>>>> #7  0x77bf1bbc in __thread_start () from
>>>>>>>>>> output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libc.so.6
>>>>>>>>>> Backtrace stopped: frame did not save the PC
>>>>>>>>>>
>>>>>>>>>> Thread 1 (Thread 6969):
>>>>>>>>>> #0  clock_nanosleep (clock_id=<optimized out>, flags=<optimized out>,
>>>>>>>>>> req=<optimized out>, rem=<optimized out>)
>>>>>>>>>>      at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:50
>>>>>>>>>> #1  0x77cfa0bc in threadobj_sleep () from
>>>>>>>>>>
>>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>>>> #2  0x77d2af08 in tm_wkafter () from
>>>>>>>>>>
>>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
>>>>>>>>>> #3  0x10000c00 in main (argc=1, argv=0x10011030) at
>>>>>>>>>> delete_child_hangs.c:43
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Ok, unless we've entered the twilight zone, the only reason I could see this
>>>>>>>>> happening would be that the thread prologue somehow gets hit by an async
>>>>>>>>> cancellation signal while holding its own lock. If so, then the patch below
>>>>>>>>> would cause an assertion to trigger in such circumstance. You will have to
>>>>>>>>> turn on --enable-assert in your configuration, leaving debugging entirely
>>>>>>>>> off not to significantly affect the current timings.
>>>>>>>>>
>>>>>>>>> diff --git a/include/boilerplate/lock.h b/include/boilerplate/lock.h
>>>>>>>>> index 6f0218c..b704987 100644
>>>>>>>>> --- a/include/boilerplate/lock.h
>>>>>>>>> +++ b/include/boilerplate/lock.h
>>>>>>>>> @@ -206,7 +206,7 @@ int __check_cancel_type(const char *locktype);
>>>>>>>>>  #define read_unlock_safe(__lock, __state)      \
>>>>>>>>>         __do_unlock_safe(__lock, __state)
>>>>>>>>>
>>>>>>>>> -#ifdef CONFIG_XENO_DEBUG
>>>>>>>>> +#ifndef NDEBUG
>>>>>>>>>  #define mutex_type_attribute PTHREAD_MUTEX_ERRORCHECK
>>>>>>>>>  #else
>>>>>>>>>  #define mutex_type_attribute PTHREAD_MUTEX_NORMAL
>>>>>>>>> diff --git a/lib/copperplate/threadobj.c b/lib/copperplate/threadobj.c
>>>>>>>>> index 05bb6cb..6547460 100644
>>>>>>>>> --- a/lib/copperplate/threadobj.c
>>>>>>>>> +++ b/lib/copperplate/threadobj.c
>>>>>>>>> @@ -1310,6 +1310,7 @@ int threadobj_cancel(struct threadobj *thobj)
>>>>>>>>>  static void finalize_thread(void *p) /* thobj->lock free */
>>>>>>>>>  {
>>>>>>>>>         struct threadobj *thobj = p;
>>>>>>>>> +       int ret;
>>>>>>>>>
>>>>>>>>>         if (thobj == NULL || thobj == THREADOBJ_IRQCONTEXT)
>>>>>>>>>                 return;
>>>>>>>>> @@ -1343,7 +1344,9 @@ static void finalize_thread(void *p) /* thobj->lock
>>>>>>>>> free */
>>>>>>>>>          * waiting for us to start, pending on
>>>>>>>>>          * wait_on_barrier(). Instead, hand it over to this thread.
>>>>>>>>>          */
>>>>>>>>> -       threadobj_lock(thobj);
>>>>>>>>> +       ret = threadobj_lock(thobj);
>>>>>>>>> +       assert(ret == 0);
>>>>>>>>> +       (void)ret;
>>>>>>>>>         if ((thobj->status & __THREAD_S_SAFE) == 0) {
>>>>>>>>>                 threadobj_unlock(thobj);
>>>>>>>>>                 destroy_thread(thobj);
>>>>>>>>>
>>>>>>>>
>>>>>>>> And...the assert kicks in!
>>>>>>>>
>>>>>>>> threadobj.c:1348: finalize_thread: Assertion `ret == 0' failed.
>>>>>>>
>>>>>>>
>>>>>>> Philippe,
>>>>>>>
>>>>>>> Could you explain this case of the async cancellation signal? I don't
>>>>>>> fully understand how this happens or why it causes the issue.
>>>>>>>
>>>>>>> I also tested once with CONFIG_XENO_ASYNC_CANCEL OFF and then the
>>>>>>> issue doesn't occur anymore.
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> I have not been able to reproduce this race yet, but still the following patch would make sense:
>>>>>>
>>>>>> diff --git a/lib/copperplate/threadobj.c b/lib/copperplate/threadobj.c
>>>>>> index 05bb6cb..48aa032 100644
>>>>>> --- a/lib/copperplate/threadobj.c
>>>>>> +++ b/lib/copperplate/threadobj.c
>>>>>> @@ -1045,9 +1045,11 @@ static int wait_on_barrier(struct threadobj *thobj, int mask)
>>>>>>                 if (status & mask)
>>>>>>                         break;
>>>>>>                 oldstate = thobj->cancel_state;
>>>>>> +               push_cleanup_lock(&thobj->lock);
>>>>>>                 __threadobj_tag_unlocked(thobj);
>>>>>>                 __RT(pthread_cond_wait(&thobj->barrier, &thobj->lock));
>>>>>>                 __threadobj_tag_locked(thobj);
>>>>>> +               pop_cleanup_lock(&thobj->lock);
>>>>>>                 thobj->cancel_state = oldstate;
>>>>>>         }
>>>>>>
>>>>>> @@ -1243,9 +1245,11 @@ static void cancel_sync(struct threadobj *thobj) /* thobj->lock held */
>>>>>>          */
>>>>>>         while (thobj->status & __THREAD_S_WARMUP) {
>>>>>>                 oldstate = thobj->cancel_state;
>>>>>> +               push_cleanup_lock(&thobj->lock);
>>>>>>                 __threadobj_tag_unlocked(thobj);
>>>>>>                 __RT(pthread_cond_wait(&thobj->barrier, &thobj->lock));
>>>>>>                 __threadobj_tag_locked(thobj);
>>>>>> +               pop_cleanup_lock(&thobj->lock);
>>>>>>                 thobj->cancel_state = oldstate;
>>>>>>         }
>>>>>>
>>>>>
>>>>> I have tried this patch but it does not seem to make any difference.
>>>>> The threads are still deadlocked at the same location.
>>>>>
>>>>> I've only tested this on my setup with the old toolchain/kernel so
>>>>> far. I'll check also with the newer version.
>>>>>
>>>>
>>>> Ok, thanks for testing. I would not bother testing with another
>>>> toolchain, I don't buy the toolchain bug in this case. This looks like a
>>>> common race due to the async nature of the cancellation event which is
>>>> released while the caller holds its own lock without proper cleanup
>>>> actions. I'll dig deeper.
>>>>
>>>
>>> Okay, in case there is anything you want me to try out, don't hesitate
>>> to tell me.
>>>
>>> Thanks a lot already!
>>>
>>>
>>
>> I could reproduce the issue eventually on your exact test code, by
>> dropping the cpu affinity from the test options (--cpu-affinity) on a
>> multi-core system. The patch below should paper over the current
>> manifestation of this bug, but the root issue which bites here lives way
>> below this API layer, and is clearly SMP-related. Are you running a
>> multi-core board?
> 
> I have tried the test with your patch and the issue no longer occurs.
> Yes, I am indeed running a multi-core board (cpu: Cavium Octeon II - 8
> cores).

Then make sure to pass --enable-smp to the build configuration line. The
Xenomai 3 libraries issue memory barriers internally in some lockless
operations which are required for proper ordering, but only when this
switch was given.

-- 
Philippe.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Xenomai] Xenomai-forge: Deadlock in finalize_thread()
  2014-08-05 10:45                               ` Philippe Gerum
@ 2014-08-05 12:39                                 ` Thomas De Schampheleire
  2014-08-05 13:22                                   ` Philippe Gerum
  2014-08-05 14:18                                 ` Kim De Mey
  1 sibling, 1 reply; 25+ messages in thread
From: Thomas De Schampheleire @ 2014-08-05 12:39 UTC (permalink / raw)
  To: Philippe Gerum; +Cc: Xenomai@xenomai.org

Philippe,

On Tue, Aug 5, 2014 at 12:45 PM, Philippe Gerum <rpm@xenomai.org> wrote:
>>
>> I have tried the test with your patch and the issue no longer occurs.
>> Yes, I am indeed running a multi-core board (cpu: Cavium Octeon II - 8
>> cores).
>
> Then make sure to pass --enable-smp to the build configuration line. The
> Xenomai 3 libraries issue memory barriers internally in some lockless
> operations which are required for proper ordering, but only when this
> switch was given.
>


I assume that if you pass --enable-smp at the configure stage, but
then run the resulting library on a single-core (or under a 'taskset'
cpu affinity of a single core), things will still work fine, is that
correct?

Thanks,
Thomas


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Xenomai] Xenomai-forge: Deadlock in finalize_thread()
  2014-08-05 12:39                                 ` Thomas De Schampheleire
@ 2014-08-05 13:22                                   ` Philippe Gerum
  0 siblings, 0 replies; 25+ messages in thread
From: Philippe Gerum @ 2014-08-05 13:22 UTC (permalink / raw)
  To: Thomas De Schampheleire; +Cc: Xenomai@xenomai.org

On 08/05/2014 02:39 PM, Thomas De Schampheleire wrote:
> Philippe,
> 
> On Tue, Aug 5, 2014 at 12:45 PM, Philippe Gerum <rpm@xenomai.org> wrote:
>>>
>>> I have tried the test with your patch and the issue no longer occurs.
>>> Yes, I am indeed running a multi-core board (cpu: Cavium Octeon II - 8
>>> cores).
>>
>> Then make sure to pass --enable-smp to the build configuration line. The
>> Xenomai 3 libraries issue memory barriers internally in some lockless
>> operations which are required for proper ordering, but only when this
>> switch was given.
>>
> 
> 
> I assume that if you pass --enable-smp at the configure stage, but
> then run the resulting library on a single-core (or under a 'taskset'
> cpu affinity of a single core), things will still work fine, is that
> correct?
> 

Yes.

-- 
Philippe.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Xenomai] Xenomai-forge: Deadlock in finalize_thread()
  2014-08-05 10:38                                 ` Philippe Gerum
@ 2014-08-05 14:16                                   ` Kim De Mey
  2014-08-05 14:33                                     ` Philippe Gerum
  0 siblings, 1 reply; 25+ messages in thread
From: Kim De Mey @ 2014-08-05 14:16 UTC (permalink / raw)
  To: Philippe Gerum; +Cc: Xenomai@xenomai.org

2014-08-05 12:38 GMT+02:00 Philippe Gerum <rpm@xenomai.org>:
> On 08/05/2014 12:35 PM, Philippe Gerum wrote:
>> On 08/05/2014 09:37 AM, Kim De Mey wrote:
>>> 2014-08-04 19:06 GMT+02:00 Philippe Gerum <rpm@xenomai.org>:
>>>> On 07/29/2014 09:31 AM, Kim De Mey wrote:
>>>>> 2014-07-28 23:52 GMT+02:00 Philippe Gerum <rpm@xenomai.org>:
>>>>>> On 07/28/2014 05:18 PM, Kim De Mey wrote:
>>>>>>> 2014-07-28 13:09 GMT+02:00 Philippe Gerum <rpm@xenomai.org>:
>>>>>>>> On 07/23/2014 10:31 AM, Kim De Mey wrote:
>>>>>>>>> 2014-07-09 17:05 GMT+02:00 Kim De Mey <kim.demey@gmail.com>:
>>>>>>>>>>
>>>>>>>>>> 2014-07-09 16:16 GMT+02:00 Philippe Gerum <rpm@xenomai.org>:
>>>>>>>>>>> On 07/09/2014 12:23 PM, Kim De Mey wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> 2014-07-09 11:25 GMT+02:00 Philippe Gerum <rpm@xenomai.org>:
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 07/09/2014 10:09 AM, Kim De Mey wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Newer version --dump-config output:
>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>> CONFIG_XENO_COMPILER="gcc version 4.7.0 (Cavium Inc. Version:
>>>>>>>>>>>>>> SDK_3_1_0 build 27) "
>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks. I can't reproduce this issue yet, so posting the current gdb
>>>>>>>>>>>>> backtraces for all the threads still shown by "info threads" after a
>>>>>>>>>>>>> deadlock would help.
>>>>>>>>>>>>>
>>>>>>>>>>>>> TIA,
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Below backtraces of all the threads. It is a case with two "worker"
>>>>>>>>>>>> tasks in deadlock. The "main" and "create_delete" tasks continued to
>>>>>>>>>>>> their "tm_wkafter" loop.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> (gdb) thread a a bt
>>>>>>>>>>>>
>>>>>>>>>>>> Thread 4 (Thread 9406):
>>>>>>>>>>>> #0  0x77cc2d94 in __pthread_mutex_lock_full (mutex=0x77b03300) at
>>>>>>>>>>>> pthread_mutex_lock.c:321
>>>>>>>>>>>> #1  0x77cc7e64 in __GI___pthread_mutex_lock (mutex=0x77b03300) at
>>>>>>>>>>>> pthread_mutex_lock.c:55
>>>>>>>>>>>> #2  0x77cf91e0 in threadobj_lock () from
>>>>>>>>>>>>
>>>>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>>>>>> #3  0x77cf92e4 in finalize_thread () from
>>>>>>>>>>>>
>>>>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>>>>>> #4  0x77cc5704 in __nptl_deallocate_tsd () at pthread_create.c:156
>>>>>>>>>>>> #5  0x77cc5940 in start_thread (arg=0x75dff490) at pthread_create.c:315
>>>>>>>>>>>> #6  0x77bf1bbc in __thread_start () from
>>>>>>>>>>>> output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libc.so.6
>>>>>>>>>>>> Backtrace stopped: frame did not save the PC
>>>>>>>>>>>>
>>>>>>>>>>>> Thread 3 (Thread 31145):
>>>>>>>>>>>> #0  0x77cc2d94 in __pthread_mutex_lock_full (mutex=0x77b030b8) at
>>>>>>>>>>>> pthread_mutex_lock.c:321
>>>>>>>>>>>> #1  0x77cc7e64 in __GI___pthread_mutex_lock (mutex=0x77b030b8) at
>>>>>>>>>>>> pthread_mutex_lock.c:55
>>>>>>>>>>>> #2  0x77cf91e0 in threadobj_lock () from
>>>>>>>>>>>>
>>>>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>>>>>> #3  0x77cf92e4 in finalize_thread () from
>>>>>>>>>>>>
>>>>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>>>>>> #4  0x77cc5704 in __nptl_deallocate_tsd () at pthread_create.c:156
>>>>>>>>>>>> #5  0x77cc5940 in start_thread (arg=0x766ff490) at pthread_create.c:315
>>>>>>>>>>>> #6  0x77bf1bbc in __thread_start () from
>>>>>>>>>>>> output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libc.so.6
>>>>>>>>>>>> Backtrace stopped: frame did not save the PC
>>>>>>>>>>>>
>>>>>>>>>>>> Thread 2 (Thread 6970):
>>>>>>>>>>>> #0  clock_nanosleep (clock_id=<optimized out>, flags=<optimized out>,
>>>>>>>>>>>> req=<optimized out>, rem=<optimized out>)
>>>>>>>>>>>>      at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:50
>>>>>>>>>>>> #1  0x77cfa0bc in threadobj_sleep () from
>>>>>>>>>>>>
>>>>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>>>>>> #2  0x77d2af08 in tm_wkafter () from
>>>>>>>>>>>>
>>>>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
>>>>>>>>>>>> #3  0x10000b64 in create_delete (a=0, b=0, c=0, d=0) at
>>>>>>>>>>>> delete_child_hangs.c:31
>>>>>>>>>>>> #4  0x77d296fc in task_trampoline () from
>>>>>>>>>>>>
>>>>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
>>>>>>>>>>>> #5  0x77cf7d84 in thread_trampoline () from
>>>>>>>>>>>>
>>>>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>>>>>> #6  0x77cc592c in start_thread (arg=0x77a00490) at pthread_create.c:310
>>>>>>>>>>>> #7  0x77bf1bbc in __thread_start () from
>>>>>>>>>>>> output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libc.so.6
>>>>>>>>>>>> Backtrace stopped: frame did not save the PC
>>>>>>>>>>>>
>>>>>>>>>>>> Thread 1 (Thread 6969):
>>>>>>>>>>>> #0  clock_nanosleep (clock_id=<optimized out>, flags=<optimized out>,
>>>>>>>>>>>> req=<optimized out>, rem=<optimized out>)
>>>>>>>>>>>>      at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:50
>>>>>>>>>>>> #1  0x77cfa0bc in threadobj_sleep () from
>>>>>>>>>>>>
>>>>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>>>>>> #2  0x77d2af08 in tm_wkafter () from
>>>>>>>>>>>>
>>>>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
>>>>>>>>>>>> #3  0x10000c00 in main (argc=1, argv=0x10011030) at
>>>>>>>>>>>> delete_child_hangs.c:43
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Ok, unless we've entered the twilight zone, the only reason I could see this
>>>>>>>>>>> happening would be that the thread prologue somehow gets hit by an async
>>>>>>>>>>> cancellation signal while holding its own lock. If so, then the patch below
>>>>>>>>>>> would cause an assertion to trigger in such circumstance. You will have to
>>>>>>>>>>> turn on --enable-assert in your configuration, leaving debugging entirely
>>>>>>>>>>> off not to significantly affect the current timings.
>>>>>>>>>>>
>>>>>>>>>>> diff --git a/include/boilerplate/lock.h b/include/boilerplate/lock.h
>>>>>>>>>>> index 6f0218c..b704987 100644
>>>>>>>>>>> --- a/include/boilerplate/lock.h
>>>>>>>>>>> +++ b/include/boilerplate/lock.h
>>>>>>>>>>> @@ -206,7 +206,7 @@ int __check_cancel_type(const char *locktype);
>>>>>>>>>>>  #define read_unlock_safe(__lock, __state)      \
>>>>>>>>>>>         __do_unlock_safe(__lock, __state)
>>>>>>>>>>>
>>>>>>>>>>> -#ifdef CONFIG_XENO_DEBUG
>>>>>>>>>>> +#ifndef NDEBUG
>>>>>>>>>>>  #define mutex_type_attribute PTHREAD_MUTEX_ERRORCHECK
>>>>>>>>>>>  #else
>>>>>>>>>>>  #define mutex_type_attribute PTHREAD_MUTEX_NORMAL
>>>>>>>>>>> diff --git a/lib/copperplate/threadobj.c b/lib/copperplate/threadobj.c
>>>>>>>>>>> index 05bb6cb..6547460 100644
>>>>>>>>>>> --- a/lib/copperplate/threadobj.c
>>>>>>>>>>> +++ b/lib/copperplate/threadobj.c
>>>>>>>>>>> @@ -1310,6 +1310,7 @@ int threadobj_cancel(struct threadobj *thobj)
>>>>>>>>>>>  static void finalize_thread(void *p) /* thobj->lock free */
>>>>>>>>>>>  {
>>>>>>>>>>>         struct threadobj *thobj = p;
>>>>>>>>>>> +       int ret;
>>>>>>>>>>>
>>>>>>>>>>>         if (thobj == NULL || thobj == THREADOBJ_IRQCONTEXT)
>>>>>>>>>>>                 return;
>>>>>>>>>>> @@ -1343,7 +1344,9 @@ static void finalize_thread(void *p) /* thobj->lock
>>>>>>>>>>> free */
>>>>>>>>>>>          * waiting for us to start, pending on
>>>>>>>>>>>          * wait_on_barrier(). Instead, hand it over to this thread.
>>>>>>>>>>>          */
>>>>>>>>>>> -       threadobj_lock(thobj);
>>>>>>>>>>> +       ret = threadobj_lock(thobj);
>>>>>>>>>>> +       assert(ret == 0);
>>>>>>>>>>> +       (void)ret;
>>>>>>>>>>>         if ((thobj->status & __THREAD_S_SAFE) == 0) {
>>>>>>>>>>>                 threadobj_unlock(thobj);
>>>>>>>>>>>                 destroy_thread(thobj);
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> And...the assert kicks in!
>>>>>>>>>>
>>>>>>>>>> threadobj.c:1348: finalize_thread: Assertion `ret == 0' failed.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Philippe,
>>>>>>>>>
>>>>>>>>> Could you explain this case of the async cancellation signal? I don't
>>>>>>>>> fully understand how this happens or why it causes the issue.
>>>>>>>>>
>>>>>>>>> I also tested once with CONFIG_XENO_ASYNC_CANCEL OFF and then the
>>>>>>>>> issue doesn't occur anymore.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> I have not been able to reproduce this race yet, but still the following patch would make sense:
>>>>>>>>
>>>>>>>> diff --git a/lib/copperplate/threadobj.c b/lib/copperplate/threadobj.c
>>>>>>>> index 05bb6cb..48aa032 100644
>>>>>>>> --- a/lib/copperplate/threadobj.c
>>>>>>>> +++ b/lib/copperplate/threadobj.c
>>>>>>>> @@ -1045,9 +1045,11 @@ static int wait_on_barrier(struct threadobj *thobj, int mask)
>>>>>>>>                 if (status & mask)
>>>>>>>>                         break;
>>>>>>>>                 oldstate = thobj->cancel_state;
>>>>>>>> +               push_cleanup_lock(&thobj->lock);
>>>>>>>>                 __threadobj_tag_unlocked(thobj);
>>>>>>>>                 __RT(pthread_cond_wait(&thobj->barrier, &thobj->lock));
>>>>>>>>                 __threadobj_tag_locked(thobj);
>>>>>>>> +               pop_cleanup_lock(&thobj->lock);
>>>>>>>>                 thobj->cancel_state = oldstate;
>>>>>>>>         }
>>>>>>>>
>>>>>>>> @@ -1243,9 +1245,11 @@ static void cancel_sync(struct threadobj *thobj) /* thobj->lock held */
>>>>>>>>          */
>>>>>>>>         while (thobj->status & __THREAD_S_WARMUP) {
>>>>>>>>                 oldstate = thobj->cancel_state;
>>>>>>>> +               push_cleanup_lock(&thobj->lock);
>>>>>>>>                 __threadobj_tag_unlocked(thobj);
>>>>>>>>                 __RT(pthread_cond_wait(&thobj->barrier, &thobj->lock));
>>>>>>>>                 __threadobj_tag_locked(thobj);
>>>>>>>> +               pop_cleanup_lock(&thobj->lock);
>>>>>>>>                 thobj->cancel_state = oldstate;
>>>>>>>>         }
>>>>>>>>
>>>>>>>
>>>>>>> I have tried this patch but it does not seem to make any difference.
>>>>>>> The threads are still deadlocked at the same location.
>>>>>>>
>>>>>>> I've only tested this on my setup with the old toolchain/kernel so
>>>>>>> far. I'll check also with the newer version.
>>>>>>>
>>>>>>
>>>>>> Ok, thanks for testing. I would not bother testing with another
>>>>>> toolchain, I don't buy the toolchain bug in this case. This looks like a
>>>>>> common race due to the async nature of the cancellation event which is
>>>>>> released while the caller holds its own lock without proper cleanup
>>>>>> actions. I'll dig deeper.
>>>>>>
>>>>>
>>>>> Okay, in case there is anything you want me to try out, don't hesitate
>>>>> to tell me.
>>>>>
>>>>> Thanks a lot already!
>>>>>
>>>>>
>>>>
>>>> I could reproduce the issue eventually on your exact test code, by
>>>> dropping the cpu affinity from the test options (--cpu-affinity) on a
>>>> multi-core system. The patch below should paper over the current
>>>> manifestation of this bug, but the root issue which bites here lives way
>>>> below this API layer, and is clearly SMP-related. Are you running a
>>>> multi-core board?
>>>
>>> I have tried the test with your patch and the issue no longer occurs.
>>> Yes, I am indeed running a multi-core board (cpu: Cavium Octeon II - 8
>>> cores).
>>> The issue does not occur if I run the board with only 1 core activated.
>>>
>>
>> It looks like something may be wrong in the way glibc handles the
>> cancellability state of threads in SMP mode. Unless I'm completely
>> off-base reading the glibc code, this issue opens a nasty race with code
>> running SMP with async cancellation enabled, solely relying on
>> pthread_setcancelstate() to turn it off temporarily. Granted, async
>> cancellation is inherently unsafe, but still, something looks wrong.
>>
>> Anyway, please pull
>> http://git.xenomai.org/xenomai-forge.git/commit/?h=next&id=7c4162315b8b970f9b82d53ddec8c910e15ff81a,
>> and let's see how this goes.
>>
>> Thanks,
>>
>
> This one fixes the commit log,
>
> http://git.xenomai.org/xenomai-forge.git/commit/?h=next&id=75cc5dac959c06bac3a5730ae279ae7a0a4ad271
>

Pulled to this revision and the test with this revision is good. I
will continue to work with this revision now.

Thanks for the work-around. Is the bug known yet at glibc?


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Xenomai] Xenomai-forge: Deadlock in finalize_thread()
  2014-08-05 10:45                               ` Philippe Gerum
  2014-08-05 12:39                                 ` Thomas De Schampheleire
@ 2014-08-05 14:18                                 ` Kim De Mey
  1 sibling, 0 replies; 25+ messages in thread
From: Kim De Mey @ 2014-08-05 14:18 UTC (permalink / raw)
  To: Philippe Gerum; +Cc: Xenomai@xenomai.org

2014-08-05 12:45 GMT+02:00 Philippe Gerum <rpm@xenomai.org>:
> On 08/05/2014 09:37 AM, Kim De Mey wrote:
>> 2014-08-04 19:06 GMT+02:00 Philippe Gerum <rpm@xenomai.org>:
>>> On 07/29/2014 09:31 AM, Kim De Mey wrote:
>>>> 2014-07-28 23:52 GMT+02:00 Philippe Gerum <rpm@xenomai.org>:
>>>>> On 07/28/2014 05:18 PM, Kim De Mey wrote:
>>>>>> 2014-07-28 13:09 GMT+02:00 Philippe Gerum <rpm@xenomai.org>:
>>>>>>> On 07/23/2014 10:31 AM, Kim De Mey wrote:
>>>>>>>> 2014-07-09 17:05 GMT+02:00 Kim De Mey <kim.demey@gmail.com>:
>>>>>>>>>
>>>>>>>>> 2014-07-09 16:16 GMT+02:00 Philippe Gerum <rpm@xenomai.org>:
>>>>>>>>>> On 07/09/2014 12:23 PM, Kim De Mey wrote:
>>>>>>>>>>>
>>>>>>>>>>> 2014-07-09 11:25 GMT+02:00 Philippe Gerum <rpm@xenomai.org>:
>>>>>>>>>>>>
>>>>>>>>>>>> On 07/09/2014 10:09 AM, Kim De Mey wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Newer version --dump-config output:
>>>>>>>>>>>>> ...
>>>>>>>>>>>>> CONFIG_XENO_COMPILER="gcc version 4.7.0 (Cavium Inc. Version:
>>>>>>>>>>>>> SDK_3_1_0 build 27) "
>>>>>>>>>>>>> ...
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks. I can't reproduce this issue yet, so posting the current gdb
>>>>>>>>>>>> backtraces for all the threads still shown by "info threads" after a
>>>>>>>>>>>> deadlock would help.
>>>>>>>>>>>>
>>>>>>>>>>>> TIA,
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Below backtraces of all the threads. It is a case with two "worker"
>>>>>>>>>>> tasks in deadlock. The "main" and "create_delete" tasks continued to
>>>>>>>>>>> their "tm_wkafter" loop.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> (gdb) thread a a bt
>>>>>>>>>>>
>>>>>>>>>>> Thread 4 (Thread 9406):
>>>>>>>>>>> #0  0x77cc2d94 in __pthread_mutex_lock_full (mutex=0x77b03300) at
>>>>>>>>>>> pthread_mutex_lock.c:321
>>>>>>>>>>> #1  0x77cc7e64 in __GI___pthread_mutex_lock (mutex=0x77b03300) at
>>>>>>>>>>> pthread_mutex_lock.c:55
>>>>>>>>>>> #2  0x77cf91e0 in threadobj_lock () from
>>>>>>>>>>>
>>>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>>>>> #3  0x77cf92e4 in finalize_thread () from
>>>>>>>>>>>
>>>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>>>>> #4  0x77cc5704 in __nptl_deallocate_tsd () at pthread_create.c:156
>>>>>>>>>>> #5  0x77cc5940 in start_thread (arg=0x75dff490) at pthread_create.c:315
>>>>>>>>>>> #6  0x77bf1bbc in __thread_start () from
>>>>>>>>>>> output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libc.so.6
>>>>>>>>>>> Backtrace stopped: frame did not save the PC
>>>>>>>>>>>
>>>>>>>>>>> Thread 3 (Thread 31145):
>>>>>>>>>>> #0  0x77cc2d94 in __pthread_mutex_lock_full (mutex=0x77b030b8) at
>>>>>>>>>>> pthread_mutex_lock.c:321
>>>>>>>>>>> #1  0x77cc7e64 in __GI___pthread_mutex_lock (mutex=0x77b030b8) at
>>>>>>>>>>> pthread_mutex_lock.c:55
>>>>>>>>>>> #2  0x77cf91e0 in threadobj_lock () from
>>>>>>>>>>>
>>>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>>>>> #3  0x77cf92e4 in finalize_thread () from
>>>>>>>>>>>
>>>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>>>>> #4  0x77cc5704 in __nptl_deallocate_tsd () at pthread_create.c:156
>>>>>>>>>>> #5  0x77cc5940 in start_thread (arg=0x766ff490) at pthread_create.c:315
>>>>>>>>>>> #6  0x77bf1bbc in __thread_start () from
>>>>>>>>>>> output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libc.so.6
>>>>>>>>>>> Backtrace stopped: frame did not save the PC
>>>>>>>>>>>
>>>>>>>>>>> Thread 2 (Thread 6970):
>>>>>>>>>>> #0  clock_nanosleep (clock_id=<optimized out>, flags=<optimized out>,
>>>>>>>>>>> req=<optimized out>, rem=<optimized out>)
>>>>>>>>>>>      at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:50
>>>>>>>>>>> #1  0x77cfa0bc in threadobj_sleep () from
>>>>>>>>>>>
>>>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>>>>> #2  0x77d2af08 in tm_wkafter () from
>>>>>>>>>>>
>>>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
>>>>>>>>>>> #3  0x10000b64 in create_delete (a=0, b=0, c=0, d=0) at
>>>>>>>>>>> delete_child_hangs.c:31
>>>>>>>>>>> #4  0x77d296fc in task_trampoline () from
>>>>>>>>>>>
>>>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
>>>>>>>>>>> #5  0x77cf7d84 in thread_trampoline () from
>>>>>>>>>>>
>>>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>>>>> #6  0x77cc592c in start_thread (arg=0x77a00490) at pthread_create.c:310
>>>>>>>>>>> #7  0x77bf1bbc in __thread_start () from
>>>>>>>>>>> output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libc.so.6
>>>>>>>>>>> Backtrace stopped: frame did not save the PC
>>>>>>>>>>>
>>>>>>>>>>> Thread 1 (Thread 6969):
>>>>>>>>>>> #0  clock_nanosleep (clock_id=<optimized out>, flags=<optimized out>,
>>>>>>>>>>> req=<optimized out>, rem=<optimized out>)
>>>>>>>>>>>      at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:50
>>>>>>>>>>> #1  0x77cfa0bc in threadobj_sleep () from
>>>>>>>>>>>
>>>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>>>>> #2  0x77d2af08 in tm_wkafter () from
>>>>>>>>>>>
>>>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
>>>>>>>>>>> #3  0x10000c00 in main (argc=1, argv=0x10011030) at
>>>>>>>>>>> delete_child_hangs.c:43
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Ok, unless we've entered the twilight zone, the only reason I could see this
>>>>>>>>>> happening would be that the thread prologue somehow gets hit by an async
>>>>>>>>>> cancellation signal while holding its own lock. If so, then the patch below
>>>>>>>>>> would cause an assertion to trigger in such circumstance. You will have to
>>>>>>>>>> turn on --enable-assert in your configuration, leaving debugging entirely
>>>>>>>>>> off not to significantly affect the current timings.
>>>>>>>>>>
>>>>>>>>>> diff --git a/include/boilerplate/lock.h b/include/boilerplate/lock.h
>>>>>>>>>> index 6f0218c..b704987 100644
>>>>>>>>>> --- a/include/boilerplate/lock.h
>>>>>>>>>> +++ b/include/boilerplate/lock.h
>>>>>>>>>> @@ -206,7 +206,7 @@ int __check_cancel_type(const char *locktype);
>>>>>>>>>>  #define read_unlock_safe(__lock, __state)      \
>>>>>>>>>>         __do_unlock_safe(__lock, __state)
>>>>>>>>>>
>>>>>>>>>> -#ifdef CONFIG_XENO_DEBUG
>>>>>>>>>> +#ifndef NDEBUG
>>>>>>>>>>  #define mutex_type_attribute PTHREAD_MUTEX_ERRORCHECK
>>>>>>>>>>  #else
>>>>>>>>>>  #define mutex_type_attribute PTHREAD_MUTEX_NORMAL
>>>>>>>>>> diff --git a/lib/copperplate/threadobj.c b/lib/copperplate/threadobj.c
>>>>>>>>>> index 05bb6cb..6547460 100644
>>>>>>>>>> --- a/lib/copperplate/threadobj.c
>>>>>>>>>> +++ b/lib/copperplate/threadobj.c
>>>>>>>>>> @@ -1310,6 +1310,7 @@ int threadobj_cancel(struct threadobj *thobj)
>>>>>>>>>>  static void finalize_thread(void *p) /* thobj->lock free */
>>>>>>>>>>  {
>>>>>>>>>>         struct threadobj *thobj = p;
>>>>>>>>>> +       int ret;
>>>>>>>>>>
>>>>>>>>>>         if (thobj == NULL || thobj == THREADOBJ_IRQCONTEXT)
>>>>>>>>>>                 return;
>>>>>>>>>> @@ -1343,7 +1344,9 @@ static void finalize_thread(void *p) /* thobj->lock
>>>>>>>>>> free */
>>>>>>>>>>          * waiting for us to start, pending on
>>>>>>>>>>          * wait_on_barrier(). Instead, hand it over to this thread.
>>>>>>>>>>          */
>>>>>>>>>> -       threadobj_lock(thobj);
>>>>>>>>>> +       ret = threadobj_lock(thobj);
>>>>>>>>>> +       assert(ret == 0);
>>>>>>>>>> +       (void)ret;
>>>>>>>>>>         if ((thobj->status & __THREAD_S_SAFE) == 0) {
>>>>>>>>>>                 threadobj_unlock(thobj);
>>>>>>>>>>                 destroy_thread(thobj);
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> And...the assert kicks in!
>>>>>>>>>
>>>>>>>>> threadobj.c:1348: finalize_thread: Assertion `ret == 0' failed.
>>>>>>>>
>>>>>>>>
>>>>>>>> Philippe,
>>>>>>>>
>>>>>>>> Could you explain this case of the async cancellation signal? I don't
>>>>>>>> fully understand how this happens or why it causes the issue.
>>>>>>>>
>>>>>>>> I also tested once with CONFIG_XENO_ASYNC_CANCEL OFF and then the
>>>>>>>> issue doesn't occur anymore.
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> I have not been able to reproduce this race yet, but still the following patch would make sense:
>>>>>>>
>>>>>>> diff --git a/lib/copperplate/threadobj.c b/lib/copperplate/threadobj.c
>>>>>>> index 05bb6cb..48aa032 100644
>>>>>>> --- a/lib/copperplate/threadobj.c
>>>>>>> +++ b/lib/copperplate/threadobj.c
>>>>>>> @@ -1045,9 +1045,11 @@ static int wait_on_barrier(struct threadobj *thobj, int mask)
>>>>>>>                 if (status & mask)
>>>>>>>                         break;
>>>>>>>                 oldstate = thobj->cancel_state;
>>>>>>> +               push_cleanup_lock(&thobj->lock);
>>>>>>>                 __threadobj_tag_unlocked(thobj);
>>>>>>>                 __RT(pthread_cond_wait(&thobj->barrier, &thobj->lock));
>>>>>>>                 __threadobj_tag_locked(thobj);
>>>>>>> +               pop_cleanup_lock(&thobj->lock);
>>>>>>>                 thobj->cancel_state = oldstate;
>>>>>>>         }
>>>>>>>
>>>>>>> @@ -1243,9 +1245,11 @@ static void cancel_sync(struct threadobj *thobj) /* thobj->lock held */
>>>>>>>          */
>>>>>>>         while (thobj->status & __THREAD_S_WARMUP) {
>>>>>>>                 oldstate = thobj->cancel_state;
>>>>>>> +               push_cleanup_lock(&thobj->lock);
>>>>>>>                 __threadobj_tag_unlocked(thobj);
>>>>>>>                 __RT(pthread_cond_wait(&thobj->barrier, &thobj->lock));
>>>>>>>                 __threadobj_tag_locked(thobj);
>>>>>>> +               pop_cleanup_lock(&thobj->lock);
>>>>>>>                 thobj->cancel_state = oldstate;
>>>>>>>         }
>>>>>>>
>>>>>>
>>>>>> I have tried this patch but it does not seem to make any difference.
>>>>>> The threads are still deadlocked at the same location.
>>>>>>
>>>>>> I've only tested this on my setup with the old toolchain/kernel so
>>>>>> far. I'll check also with the newer version.
>>>>>>
>>>>>
>>>>> Ok, thanks for testing. I would not bother testing with another
>>>>> toolchain, I don't buy the toolchain bug in this case. This looks like a
>>>>> common race due to the async nature of the cancellation event which is
>>>>> released while the caller holds its own lock without proper cleanup
>>>>> actions. I'll dig deeper.
>>>>>
>>>>
>>>> Okay, in case there is anything you want me to try out, don't hesitate
>>>> to tell me.
>>>>
>>>> Thanks a lot already!
>>>>
>>>>
>>>
>>> I could reproduce the issue eventually on your exact test code, by
>>> dropping the cpu affinity from the test options (--cpu-affinity) on a
>>> multi-core system. The patch below should paper over the current
>>> manifestation of this bug, but the root issue which bites here lives way
>>> below this API layer, and is clearly SMP-related. Are you running a
>>> multi-core board?
>>
>> I have tried the test with your patch and the issue no longer occurs.
>> Yes, I am indeed running a multi-core board (cpu: Cavium Octeon II - 8
>> cores).
>
> Then make sure to pass --enable-smp to the build configuration line. The
> Xenomai 3 libraries issue memory barriers internally in some lockless
> operations which are required for proper ordering, but only when this
> switch was given.
>

Okay, I didn't know this. I will add this switch.

> --
> Philippe.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Xenomai] Xenomai-forge: Deadlock in finalize_thread()
  2014-08-05 14:16                                   ` Kim De Mey
@ 2014-08-05 14:33                                     ` Philippe Gerum
  0 siblings, 0 replies; 25+ messages in thread
From: Philippe Gerum @ 2014-08-05 14:33 UTC (permalink / raw)
  To: Kim De Mey; +Cc: Xenomai@xenomai.org

On 08/05/2014 04:16 PM, Kim De Mey wrote:
> 2014-08-05 12:38 GMT+02:00 Philippe Gerum <rpm@xenomai.org>:
>> On 08/05/2014 12:35 PM, Philippe Gerum wrote:
>>> On 08/05/2014 09:37 AM, Kim De Mey wrote:
>>>> 2014-08-04 19:06 GMT+02:00 Philippe Gerum <rpm@xenomai.org>:
>>>>> On 07/29/2014 09:31 AM, Kim De Mey wrote:
>>>>>> 2014-07-28 23:52 GMT+02:00 Philippe Gerum <rpm@xenomai.org>:
>>>>>>> On 07/28/2014 05:18 PM, Kim De Mey wrote:
>>>>>>>> 2014-07-28 13:09 GMT+02:00 Philippe Gerum <rpm@xenomai.org>:
>>>>>>>>> On 07/23/2014 10:31 AM, Kim De Mey wrote:
>>>>>>>>>> 2014-07-09 17:05 GMT+02:00 Kim De Mey <kim.demey@gmail.com>:
>>>>>>>>>>>
>>>>>>>>>>> 2014-07-09 16:16 GMT+02:00 Philippe Gerum <rpm@xenomai.org>:
>>>>>>>>>>>> On 07/09/2014 12:23 PM, Kim De Mey wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2014-07-09 11:25 GMT+02:00 Philippe Gerum <rpm@xenomai.org>:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 07/09/2014 10:09 AM, Kim De Mey wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Newer version --dump-config output:
>>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>> CONFIG_XENO_COMPILER="gcc version 4.7.0 (Cavium Inc. Version:
>>>>>>>>>>>>>>> SDK_3_1_0 build 27) "
>>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks. I can't reproduce this issue yet, so posting the current gdb
>>>>>>>>>>>>>> backtraces for all the threads still shown by "info threads" after a
>>>>>>>>>>>>>> deadlock would help.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> TIA,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Below backtraces of all the threads. It is a case with two "worker"
>>>>>>>>>>>>> tasks in deadlock. The "main" and "create_delete" tasks continued to
>>>>>>>>>>>>> their "tm_wkafter" loop.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> (gdb) thread a a bt
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thread 4 (Thread 9406):
>>>>>>>>>>>>> #0  0x77cc2d94 in __pthread_mutex_lock_full (mutex=0x77b03300) at
>>>>>>>>>>>>> pthread_mutex_lock.c:321
>>>>>>>>>>>>> #1  0x77cc7e64 in __GI___pthread_mutex_lock (mutex=0x77b03300) at
>>>>>>>>>>>>> pthread_mutex_lock.c:55
>>>>>>>>>>>>> #2  0x77cf91e0 in threadobj_lock () from
>>>>>>>>>>>>>
>>>>>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>>>>>>> #3  0x77cf92e4 in finalize_thread () from
>>>>>>>>>>>>>
>>>>>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>>>>>>> #4  0x77cc5704 in __nptl_deallocate_tsd () at pthread_create.c:156
>>>>>>>>>>>>> #5  0x77cc5940 in start_thread (arg=0x75dff490) at pthread_create.c:315
>>>>>>>>>>>>> #6  0x77bf1bbc in __thread_start () from
>>>>>>>>>>>>> output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libc.so.6
>>>>>>>>>>>>> Backtrace stopped: frame did not save the PC
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thread 3 (Thread 31145):
>>>>>>>>>>>>> #0  0x77cc2d94 in __pthread_mutex_lock_full (mutex=0x77b030b8) at
>>>>>>>>>>>>> pthread_mutex_lock.c:321
>>>>>>>>>>>>> #1  0x77cc7e64 in __GI___pthread_mutex_lock (mutex=0x77b030b8) at
>>>>>>>>>>>>> pthread_mutex_lock.c:55
>>>>>>>>>>>>> #2  0x77cf91e0 in threadobj_lock () from
>>>>>>>>>>>>>
>>>>>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>>>>>>> #3  0x77cf92e4 in finalize_thread () from
>>>>>>>>>>>>>
>>>>>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>>>>>>> #4  0x77cc5704 in __nptl_deallocate_tsd () at pthread_create.c:156
>>>>>>>>>>>>> #5  0x77cc5940 in start_thread (arg=0x766ff490) at pthread_create.c:315
>>>>>>>>>>>>> #6  0x77bf1bbc in __thread_start () from
>>>>>>>>>>>>> output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libc.so.6
>>>>>>>>>>>>> Backtrace stopped: frame did not save the PC
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thread 2 (Thread 6970):
>>>>>>>>>>>>> #0  clock_nanosleep (clock_id=<optimized out>, flags=<optimized out>,
>>>>>>>>>>>>> req=<optimized out>, rem=<optimized out>)
>>>>>>>>>>>>>      at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:50
>>>>>>>>>>>>> #1  0x77cfa0bc in threadobj_sleep () from
>>>>>>>>>>>>>
>>>>>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>>>>>>> #2  0x77d2af08 in tm_wkafter () from
>>>>>>>>>>>>>
>>>>>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
>>>>>>>>>>>>> #3  0x10000b64 in create_delete (a=0, b=0, c=0, d=0) at
>>>>>>>>>>>>> delete_child_hangs.c:31
>>>>>>>>>>>>> #4  0x77d296fc in task_trampoline () from
>>>>>>>>>>>>>
>>>>>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
>>>>>>>>>>>>> #5  0x77cf7d84 in thread_trampoline () from
>>>>>>>>>>>>>
>>>>>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>>>>>>> #6  0x77cc592c in start_thread (arg=0x77a00490) at pthread_create.c:310
>>>>>>>>>>>>> #7  0x77bf1bbc in __thread_start () from
>>>>>>>>>>>>> output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libc.so.6
>>>>>>>>>>>>> Backtrace stopped: frame did not save the PC
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thread 1 (Thread 6969):
>>>>>>>>>>>>> #0  clock_nanosleep (clock_id=<optimized out>, flags=<optimized out>,
>>>>>>>>>>>>> req=<optimized out>, rem=<optimized out>)
>>>>>>>>>>>>>      at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:50
>>>>>>>>>>>>> #1  0x77cfa0bc in threadobj_sleep () from
>>>>>>>>>>>>>
>>>>>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>>>>>>> #2  0x77d2af08 in tm_wkafter () from
>>>>>>>>>>>>>
>>>>>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
>>>>>>>>>>>>> #3  0x10000c00 in main (argc=1, argv=0x10011030) at
>>>>>>>>>>>>> delete_child_hangs.c:43
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Ok, unless we've entered the twilight zone, the only reason I could see this
>>>>>>>>>>>> happening would be that the thread prologue somehow gets hit by an async
>>>>>>>>>>>> cancellation signal while holding its own lock. If so, then the patch below
>>>>>>>>>>>> would cause an assertion to trigger in such circumstance. You will have to
>>>>>>>>>>>> turn on --enable-assert in your configuration, leaving debugging entirely
>>>>>>>>>>>> off not to significantly affect the current timings.
>>>>>>>>>>>>
>>>>>>>>>>>> diff --git a/include/boilerplate/lock.h b/include/boilerplate/lock.h
>>>>>>>>>>>> index 6f0218c..b704987 100644
>>>>>>>>>>>> --- a/include/boilerplate/lock.h
>>>>>>>>>>>> +++ b/include/boilerplate/lock.h
>>>>>>>>>>>> @@ -206,7 +206,7 @@ int __check_cancel_type(const char *locktype);
>>>>>>>>>>>>  #define read_unlock_safe(__lock, __state)      \
>>>>>>>>>>>>         __do_unlock_safe(__lock, __state)
>>>>>>>>>>>>
>>>>>>>>>>>> -#ifdef CONFIG_XENO_DEBUG
>>>>>>>>>>>> +#ifndef NDEBUG
>>>>>>>>>>>>  #define mutex_type_attribute PTHREAD_MUTEX_ERRORCHECK
>>>>>>>>>>>>  #else
>>>>>>>>>>>>  #define mutex_type_attribute PTHREAD_MUTEX_NORMAL
>>>>>>>>>>>> diff --git a/lib/copperplate/threadobj.c b/lib/copperplate/threadobj.c
>>>>>>>>>>>> index 05bb6cb..6547460 100644
>>>>>>>>>>>> --- a/lib/copperplate/threadobj.c
>>>>>>>>>>>> +++ b/lib/copperplate/threadobj.c
>>>>>>>>>>>> @@ -1310,6 +1310,7 @@ int threadobj_cancel(struct threadobj *thobj)
>>>>>>>>>>>>  static void finalize_thread(void *p) /* thobj->lock free */
>>>>>>>>>>>>  {
>>>>>>>>>>>>         struct threadobj *thobj = p;
>>>>>>>>>>>> +       int ret;
>>>>>>>>>>>>
>>>>>>>>>>>>         if (thobj == NULL || thobj == THREADOBJ_IRQCONTEXT)
>>>>>>>>>>>>                 return;
>>>>>>>>>>>> @@ -1343,7 +1344,9 @@ static void finalize_thread(void *p) /* thobj->lock
>>>>>>>>>>>> free */
>>>>>>>>>>>>          * waiting for us to start, pending on
>>>>>>>>>>>>          * wait_on_barrier(). Instead, hand it over to this thread.
>>>>>>>>>>>>          */
>>>>>>>>>>>> -       threadobj_lock(thobj);
>>>>>>>>>>>> +       ret = threadobj_lock(thobj);
>>>>>>>>>>>> +       assert(ret == 0);
>>>>>>>>>>>> +       (void)ret;
>>>>>>>>>>>>         if ((thobj->status & __THREAD_S_SAFE) == 0) {
>>>>>>>>>>>>                 threadobj_unlock(thobj);
>>>>>>>>>>>>                 destroy_thread(thobj);
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> And...the assert kicks in!
>>>>>>>>>>>
>>>>>>>>>>> threadobj.c:1348: finalize_thread: Assertion `ret == 0' failed.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Philippe,
>>>>>>>>>>
>>>>>>>>>> Could you explain this case of the async cancellation signal? I don't
>>>>>>>>>> fully understand how this happens or why it causes the issue.
>>>>>>>>>>
>>>>>>>>>> I also tested once with CONFIG_XENO_ASYNC_CANCEL OFF and then the
>>>>>>>>>> issue doesn't occur anymore.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I have not been able to reproduce this race yet, but still the following patch would make sense:
>>>>>>>>>
>>>>>>>>> diff --git a/lib/copperplate/threadobj.c b/lib/copperplate/threadobj.c
>>>>>>>>> index 05bb6cb..48aa032 100644
>>>>>>>>> --- a/lib/copperplate/threadobj.c
>>>>>>>>> +++ b/lib/copperplate/threadobj.c
>>>>>>>>> @@ -1045,9 +1045,11 @@ static int wait_on_barrier(struct threadobj *thobj, int mask)
>>>>>>>>>                 if (status & mask)
>>>>>>>>>                         break;
>>>>>>>>>                 oldstate = thobj->cancel_state;
>>>>>>>>> +               push_cleanup_lock(&thobj->lock);
>>>>>>>>>                 __threadobj_tag_unlocked(thobj);
>>>>>>>>>                 __RT(pthread_cond_wait(&thobj->barrier, &thobj->lock));
>>>>>>>>>                 __threadobj_tag_locked(thobj);
>>>>>>>>> +               pop_cleanup_lock(&thobj->lock);
>>>>>>>>>                 thobj->cancel_state = oldstate;
>>>>>>>>>         }
>>>>>>>>>
>>>>>>>>> @@ -1243,9 +1245,11 @@ static void cancel_sync(struct threadobj *thobj) /* thobj->lock held */
>>>>>>>>>          */
>>>>>>>>>         while (thobj->status & __THREAD_S_WARMUP) {
>>>>>>>>>                 oldstate = thobj->cancel_state;
>>>>>>>>> +               push_cleanup_lock(&thobj->lock);
>>>>>>>>>                 __threadobj_tag_unlocked(thobj);
>>>>>>>>>                 __RT(pthread_cond_wait(&thobj->barrier, &thobj->lock));
>>>>>>>>>                 __threadobj_tag_locked(thobj);
>>>>>>>>> +               pop_cleanup_lock(&thobj->lock);
>>>>>>>>>                 thobj->cancel_state = oldstate;
>>>>>>>>>         }
>>>>>>>>>
>>>>>>>>
>>>>>>>> I have tried this patch but it does not seem to make any difference.
>>>>>>>> The threads are still deadlocked at the same location.
>>>>>>>>
>>>>>>>> I've only tested this on my setup with the old toolchain/kernel so
>>>>>>>> far. I'll check also with the newer version.
>>>>>>>>
>>>>>>>
>>>>>>> Ok, thanks for testing. I would not bother testing with another
>>>>>>> toolchain, I don't buy the toolchain bug in this case. This looks like a
>>>>>>> common race due to the async nature of the cancellation event which is
>>>>>>> released while the caller holds its own lock without proper cleanup
>>>>>>> actions. I'll dig deeper.
>>>>>>>
>>>>>>
>>>>>> Okay, in case there is anything you want me to try out, don't hesitate
>>>>>> to tell me.
>>>>>>
>>>>>> Thanks a lot already!
>>>>>>
>>>>>>
>>>>>
>>>>> I could reproduce the issue eventually on your exact test code, by
>>>>> dropping the cpu affinity from the test options (--cpu-affinity) on a
>>>>> multi-core system. The patch below should paper over the current
>>>>> manifestation of this bug, but the root issue which bites here lives way
>>>>> below this API layer, and is clearly SMP-related. Are you running a
>>>>> multi-core board?
>>>>
>>>> I have tried the test with your patch and the issue no longer occurs.
>>>> Yes, I am indeed running a multi-core board (cpu: Cavium Octeon II - 8
>>>> cores).
>>>> The issue does not occur if I run the board with only 1 core activated.
>>>>
>>>
>>> It looks like something may be wrong in the way glibc handles the
>>> cancellability state of threads in SMP mode. Unless I'm completely
>>> off-base reading the glibc code, this issue opens a nasty race with code
>>> running SMP with async cancellation enabled, solely relying on
>>> pthread_setcancelstate() to turn it off temporarily. Granted, async
>>> cancellation is inherently unsafe, but still, something looks wrong.
>>>
>>> Anyway, please pull
>>> http://git.xenomai.org/xenomai-forge.git/commit/?h=next&id=7c4162315b8b970f9b82d53ddec8c910e15ff81a,
>>> and let's see how this goes.
>>>
>>> Thanks,
>>>
>>
>> This one fixes the commit log,
>>
>> http://git.xenomai.org/xenomai-forge.git/commit/?h=next&id=75cc5dac959c06bac3a5730ae279ae7a0a4ad271
>>
> 
> Pulled to this revision and the test with this revision is good. I
> will continue to work with this revision now.
> 
> Thanks for the work-around. Is the bug known yet at glibc?
> 

I did not find any mention of such issue, only an old nptl bug with
PTHREAD_CANCEL_DISABLE which surfaced in 2004, and obviously fixed many
moons ago. I have to create a self-contained and obvious enough test to
report it, which does not seem trivial.

-- 
Philippe.


^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2014-08-05 14:33 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-07-03  8:11 [Xenomai] Xenomai-forge: Deadlock in finalize_thread() Kim De Mey
2014-07-03 15:14 ` Philippe Gerum
2014-07-07  7:58   ` Kim De Mey
2014-07-08 16:48     ` Philippe Gerum
2014-07-09  8:09       ` Kim De Mey
2014-07-09  9:25         ` Philippe Gerum
2014-07-09 10:23           ` Kim De Mey
2014-07-09 14:16             ` Philippe Gerum
2014-07-09 15:05               ` Kim De Mey
2014-07-23  8:31                 ` Kim De Mey
2014-07-23 20:45                   ` Philippe Gerum
2014-07-28 11:09                   ` Philippe Gerum
2014-07-28 15:18                     ` Kim De Mey
2014-07-28 21:52                       ` Philippe Gerum
2014-07-29  7:31                         ` Kim De Mey
2014-08-04 17:06                           ` Philippe Gerum
2014-08-05  7:37                             ` Kim De Mey
2014-08-05 10:35                               ` Philippe Gerum
2014-08-05 10:38                                 ` Philippe Gerum
2014-08-05 14:16                                   ` Kim De Mey
2014-08-05 14:33                                     ` Philippe Gerum
2014-08-05 10:45                               ` Philippe Gerum
2014-08-05 12:39                                 ` Thomas De Schampheleire
2014-08-05 13:22                                   ` Philippe Gerum
2014-08-05 14:18                                 ` Kim De Mey

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.