All of lore.kernel.org
 help / color / mirror / Atom feed
* [Xenomai] segfault in printer_loop()
@ 2017-11-01  6:29 C Smith
  2017-11-01  7:35 ` Jan Kiszka
  0 siblings, 1 reply; 8+ messages in thread
From: C Smith @ 2017-11-01  6:29 UTC (permalink / raw)
  To: xenomai

I finally caught all the variables in a corefile in gdb:
(gdb) bt
#0  0xb76d70db in pthread_cond_wait@@GLIBC_2.3.2 () from
/lib/libpthread.so.0
#1  0xb76f97f4 in printer_loop (arg=0x0) at rt_print.c:685
#2  0xb76d3adf in start_thread () from /lib/libpthread.so.0
#3  0xb746444e in clone () from /lib/libc.so.6
(gdb) print printer_wakeup
$1 = {__data = {__lock = 0, __futex = 0, __total_seq = 0, __wakeup_seq = 0,
__woken_seq = 0, __mutex = 0x0, __nwaiters = 0, __broadcast_seq = 0},
__size = '\000' <repeats 47 times>, __align = 0}
(gdb) print buffer_lock
$2 = {__data = {__lock = 0, __count = 0, __owner = 0, __kind = 0, __nusers
= 0, {__spins = 0, __list = {__next = 0x0}}}, __size = '\000' <repeats 23
times>, __align = 0}
(gdb) print buffers
$3 = 4
(gdb) print arg
No symbol "arg" in current context.
(gdb) print mask
No symbol "mask" in current context.
(gdb) print unlock
$4 = {void (void *)} 0xb76f96f9 <unlock>
(gdb) info threads
  Id   Target Id         Frame
  2    Thread 0xb73686c0 (LWP 20462) 0xffffe424 in ?? ()
* 1    Thread 0xb76f5b40 (LWP 20464) 0xb76d70db in
pthread_cond_wait@@GLIBC_2.3.2
() from /lib/libpthread.so.0
(gdb) print &printer_wakeup
$5 = (pthread_cond_t *) 0xb76fda20
(gdb) print &buffer_lock
$6 = (pthread_mutex_t *) 0xb76fd9fc

I see now that all variables used in this function are static on the heap,
and thus they are not null pointers, so what could cause a segfault?

Thanks, -C Smith

^ permalink raw reply	[flat|nested] 8+ messages in thread
* [Xenomai] segfault in printer_loop()
@ 2017-11-13  6:39 C Smith
  2017-11-13  7:41 ` Jan Kiszka
  0 siblings, 1 reply; 8+ messages in thread
From: C Smith @ 2017-11-13  6:39 UTC (permalink / raw)
  To: xenomai

Hi Jan,

I have found a workaround for the problem. Instead of the startup segfault
happening 10% of the time, I have now started my RT app 90 times with a
single RT thread, and 80 times with its original three RT threads - with no
segfaults.

Per your question: I don't think the problem is that __rt_print_init() is
getting called twice. The normal order of execution is like this:

. printer_loop() gets called first when a xenomai RT app starts up

. pthread_mutex_lock() sets the buffer_lock struct so __lock and __owner
are nonzero:
(gdb) p buffer_lock
$4 = {__data = {__lock = 1, __count = 0, __owner = 18681, __kind = 0,
__nusers = 1, {__spins = 0, __list = {__next = 0x0}}}, __size =
"\001\000\000\000\000\000\000\000\371H\000\000\000\000\000\000\001\000\000\000\000\000\000",
__align = 1}

. then pthread_cond_wait() calls __rt_print_init()

. inside  __rt_print_init(), printer_wakeup has a valid __mutex:
(gdb) print printer_wakeup
$5 = {__data = {__lock = 0, __futex = 1, __total_seq = 1, __wakeup_seq = 0,
__woken_seq = 0, __mutex = 0xb7fd4a1c, __nwaiters = 2, __broadcast_seq =
0}, __size = "\000\000\000\000\001\000\000\000\001", '\000' <repeats 23
times>, "\034J\375\267\002\000\000\000\000\000\000\000\000\000\000",
__align = 4294967296}

. Then continuing, we get to first line of main() OK with no segfault.

You had advised to watch for corruption of the vars pthread_cond_wait()
uses.
In contrast to the above, when the segfault occurs, the vars buffer_lock
and printer_wakeup, which get passed into pthread_cond_wait(), contain all
zeros:

(gdb) print buffer_lock
$6 = {__data = {__lock = 0, __count = 0, __owner = 0, __kind = 0, __nusers
= 0, {__spins = 0, __list = {__next = 0x0}}}, __size = '\000' <repeats 23
times>, __align = 0}
(gdb) print printer_wakeup
$7 = {__data = {__lock = 0, __futex = 0, __total_seq = 0, __wakeup_seq = 0,
__woken_seq = 0, __mutex = 0x0, __nwaiters = 0, __broadcast_seq = 0},
__size = '\000' <repeats 47 times>, __align = 0}

There is one pointer in the pthread_cond_t structure:
printer_wakeup.__data.__mutex
So perhaps pthread_cond_wait() dereferences this null mutex pointer ? The
segfault always happens on access of address 0xC.

This segfault first appeared when I compiled my app for SMP, and it goes
away if I use kernel arg maxcpus=1. Perhaps some SMP race condition is
occasionally preventing the data structures (buffer_lock,printer_wakeup)
from being ready for pthread_cond_wait()?

As a protection against this I have patched the rt_print.c printer_loop()
code, skipping the call to pthread_cond_wait() if those two structures
(buffer_lock,printer_wakeup) are not ready. There is no reason to wait on a
thread which is not locked and where the mutex is nonexistent, right?

This is the patch:

--- rt_print_A.c    2014-09-24 13:57:49.000000000 -0700
+++ rt_print_B.c    2017-11-11 23:24:34.309832301 -0800
@@ -680,9 +680,10 @@
     while (1) {
         pthread_cleanup_push(unlock, &buffer_lock);
         pthread_mutex_lock(&buffer_lock);
-
-        while (buffers == 0)
-            pthread_cond_wait(&printer_wakeup, &buffer_lock);
+
+        if ((buffer_lock.__data.__lock != 0) &&
(printer_wakeup.__data.__mutex != 0))
+            while (buffers == 0)
+                pthread_cond_wait(&printer_wakeup, &buffer_lock);

         print_buffers();

Can you verify that this patch is safe?

thanks,
-C Smith

^ permalink raw reply	[flat|nested] 8+ messages in thread
* Re: [Xenomai] segfault in printer_loop()
@ 2017-11-01  0:12 C Smith
  0 siblings, 0 replies; 8+ messages in thread
From: C Smith @ 2017-11-01  0:12 UTC (permalink / raw)
  To: xenomai

One update for clarification:

The gcc command line is identical between the old xeno 2.6.2 machine
which produces a stable app,
and this new xeno 2.6.4 machine which produces segfaults in my app.
You'll notice some redundancy
in the gcc command line below, as I am using more than one skin and
calling xeno-config more than
once in the Makefile (but that has never been a problem in years of
using 2.6.2):

gcc -g3 -I/usr/xenomai/include -D_GNU_SOURCE -D_REENTRANT -D__XENO__
-I/usr/include/libxml2 -I/usr/local/rtnet/include -I"SOEM/" -I"SOEM/osal"
-I"SOEM/oshw/linux" -I"SOEM/soem"    -Xlinker -rpath -Xlinker /usr/xenomai/lib
app.c ../include/dia_dev_app.h ../include/crc_table.h ../include/dacdefs.h
../include/ov_version.h ../include/adcdefs.h ../include/app_version.h
../include/app.h ../include/canodefs.h ../include/preproc_app.h
../include/app_mem_manager_data.h ../include/comm_dta_app.h
../include/comproto.h
../modules/rtdinsync.h quad.o dac.o adc.o
SOEM/lib/linux/liboshw.a SOEM/lib/linux/libosal.a SOEM/lib/linux/libsoem.a
-L../lib -lapp -lnative -L/usr/xenomai/lib -lxenomai -lpthread -lrt -lrtdm
-L/usr/xenomai/lib -lxenomai -lpthread -lrt
-Wl,@/usr/xenomai/lib/posix.wrappers
-L/usr/xenomai/lib -lpthread_rt -lxenomai -lpthread -lrt  -lxml2 -lz -lm
-L"SOEM/lib/linux" -Wl,--start-group -loshw -losal -lsoem
-Wl,--end-group -lm -o app

Thanks, -C. Smith

Original post:

My xenomai application is segfaulting at startup, 1 in 10 times I run it.
When I catch it in a debugger or get a core file it says the segfault was
not in my code but in the xenomai sources:

rt_print.c line 685:
pthread_cond_wait(&printer_wakeup, &buffer_lock);

(gdb) bt
#O  Oxb77120db in pthread_cond_wait@@GLIBC_2.3.2 () from
/lib/libpthread.so.O
#1  Oxb77347de in printer_loop (arg=Ox0) at rt_print.c:685
#2  Oxb770eadf in start thread () from /lib/libpthread.so.O
#3  Oxb749f44e in clone () from /lib/libc.so.6
(gdb) info threads
 Id  Target Id         Frame
 2   Thread Oxb73a36cO (LWP 7235) Oxffffe424 in ?? ()
*1   Thread Oxb7730b40 (LWP 7238) Oxb77120db in pthread_cond_wait@
@GLZBC_2.3.2
() from /lib/libpthread.so.O

Note that there is no printing whatsover in my code. This is a mature
application which has been running sucessfully on xenomai 2.6.2 for a few
years - but now I am running it on xenomai 2.6.4 on kernel 3.14.17.
Another difference is that I am now using a faster motherboard. I have a
suspicion that there is a race condition which is causing uninitialized
thread variables. I believe this is during the creation of a thread where
xenomai prints the new thread info to stdout.

Could &printer_wakeup, &buffer_lock be invalid?
I was unable to evaluate them in the debugger, I think their values are
gone from the stack/heap by the time I get to them.

There are no differences in rt_print.c between xenomai 2.6.4 and 2.6.5.

Can you provide a way to modify the code of printer_loop() to detect and
work around the problem?

^ permalink raw reply	[flat|nested] 8+ messages in thread
* [Xenomai] segfault in printer_loop()
@ 2017-10-27  2:08 C Smith
  0 siblings, 0 replies; 8+ messages in thread
From: C Smith @ 2017-10-27  2:08 UTC (permalink / raw)
  To: xenomai

My xenomai application is segfaulting at startup, 1 in 10 times I run it.
When I catch it in a debugger or get a core file it says the segfault was
not in my code but in the xenomai sources:

rt_print.c line 685:
pthread_cond_wait(&printer_wakeup, &buffer_lock);

(gdb) bt
#O  Oxb77120db in pthread_cond_wait@@GLIBC_2.3.2 () from
/lib/libpthread.so.O
#1  Oxb77347de in printer_loop (arg=Ox0) at rt_print.c:685
#2  Oxb770eadf in start thread () from /lib/libpthread.so.O
#3  Oxb749f44e in clone () from /lib/libc.so.6
(gdb) info threads
 Id  Target Id         Frame
 2   Thread Oxb73a36cO (LWP 7235) Oxffffe424 in ?? ()
*1   Thread Oxb7730b40 (LWP 7238) Oxb77120db in pthread_cond_wait@
@GLZBC_2.3.2
() from /lib/libpthread.so.O

Note that there is no printing whatsover in my code. This is a mature
application which has been running sucessfully on xenomai 2.6.2 for a few
years - but now I am running it on xenomai 2.6.4 on kernel 3.14.17.
Another difference is that I am now using a faster motherboard. I have a
suspicion that there is a race condition which is causing uninitialized
thread variables. I believe this is during the creation of a thread where
xenomai prints the new thread info to stdout.

Could &printer_wakeup, &buffer_lock be invalid?
I was unable to evaluate them in the debugger, I think their values are
gone from the stack/heap by the time I get to them.

There are no differences in rt_print.c between xenomai 2.6.4 and 2.6.5.

Can you provide a way to modify the code of printer_loop() to detect and
work around the problem?

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2017-11-13  7:41 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-01  6:29 [Xenomai] segfault in printer_loop() C Smith
2017-11-01  7:35 ` Jan Kiszka
     [not found]   ` <CA+K1mPF+SOhOeVpYktjNCcD7u403CtUXkM1Hcz_SS-6wwG50xg@mail.gmail.com>
2017-11-10  7:02     ` Jan Kiszka
     [not found]       ` <CA+K1mPGSTNyk0JZPQs4sSyuu+Xbi=cChCwFU1uokM9gNAe6n2Q@mail.gmail.com>
2017-11-10 10:07         ` Jan Kiszka
  -- strict thread matches above, loose matches on Subject: below --
2017-11-13  6:39 C Smith
2017-11-13  7:41 ` Jan Kiszka
2017-11-01  0:12 C Smith
2017-10-27  2:08 C Smith

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.