From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Helge Deller" Subject: Re: futex wait failure Date: Mon, 04 Jan 2010 17:27:32 +0100 Message-ID: <20100104162732.10090@gmx.net> References: <20100101034858.80D264EA9@hiauly1.hia.nrc.ca> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Cc: linux-parisc@vger.kernel.org, dave.anglin@nrc-cnrc.gc.ca, carlos@systemhalted.org To: "John David Anglin" Return-path: In-Reply-To: <20100101034858.80D264EA9@hiauly1.hia.nrc.ca> List-ID: List-Id: linux-parisc.vger.kernel.org > > I tested the patch and the testcase in > > http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=3D561203 > > still segfaults. >=20 > I think the expect/tcl bug and the bug 561203 are related. Looking > at the minifail core dump, I see: >=20 > Core was generated by `./minifail'. > Program terminated with signal 11, Segmentation fault. > #0 0x00000000 in ?? () >=20 > So, how did we get to 0? $rp is 0, so we might have executed a > return to this location. $r31 conains 0x4157cc4f. >=20 > (gdb) disass 0x4157cc3c 0x4157cc5c > Dump of assembler code from 0x4157cc3c to 0x4157cc5c: > 0x4157cc3c <_IO_puts+332>: copy rp,r25 > 0x4157cc40 <_IO_puts+336>: copy r6,r24 > 0x4157cc44 <_IO_puts+340>: be,l b0(sr2,r0),sr0,r31 > 0x4157cc48 <_IO_puts+344>: ldi 0,r20 > 0x4157cc4c <_IO_puts+348>: ldi -b,r24 > 0x4157cc50 <_IO_puts+352>: cmpb,=3D,n r24,r21,0x4157cc38 <_IO_pu= ts+328> > 0x4157cc54 <_IO_puts+356>: nop > 0x4157cc58 <_IO_puts+360>: ldi -2d,r25 I think I have an idea what could have happened and why it most of the = times (but not always) crashes in the child process... In ports/sysdeps/unix/sysv/linux/hppa/bits/atomic.h we have: #define atomic_compare_and_exchange_val_acq(mem, newval, oldval) \ ({ = \ volatile int lws_errno; = \ volatile int lws_ret; = \ asm volatile( = \ =2E..some assembly... "stw %%r28, %0 \n\t" = \ "sub %%r0, %%r21, %%r21 \n\t" = \ "stw %%r21, %1 \n\t" = \ : "=3Dm" (lws_ret), "=3Dm" (lws_errno) = \ : "r" (mem), "r" (oldval), "r" (newval) = \ : _LWS_CLOBBER =20 this means, that lws_errno and lws_ret are located on the stack. With gdb I see this expanded to: 0x40705494 : stw ret0,-1b8(sp) 0x40705498 : sub r0,r21,r21 0x4070549c : stw r21,-1b4(sp) So, lws_ret/lws_errno are at -1b8/-1b4(sp). And this LWS code is called from=20 =2E./nptl/sysdeps/pthread/createthread.c: static int create_thread (struct pthread *pd, const struct pthread_attr= *attr, STACK_VARIABLES_PARMS) =2E.. int res =3D do_clone (pd, attr, clone_flags, start_thread, STACK_VARIABLES_ARGS, 1); if (res =3D=3D 0) { =2E..(line 216): /* Enqueue the descriptor. */ do pd->nextevent =3D __nptl_last_event; while (atomic_compare_and_exchange_bool_acq(&__nptl_last_= event, pd, pd->nextevent) !=3D 0); And here is what could have happened: a) do_clone() creates the child process. b) the child process gets a new stack c) the child calls atomic_compare_and_exchange_bool_acq() and thus the = LWS code above. d) the LWS code writes to the stack location at -1b8(sp), which is out = of bounds for the child process (the child stack got only ~ 0x40 bytes = initial room) e) Thus the child either crashes, overwrites memory of the parent or do= es other things wrong. Additionally: Due to the LWS assembly code and because we don't have many registers f= ree while using LWS, gcc used %rp as a temporary register which may hav= e fooled us in our thinking? 0x40705458 : ldi 0,rp 0x4070545c : ldi fb,r3 0x40705460 : ldw -70(sp),ret0 0x40705464 : ldw 214(ret0),ret1 0x40705468 : copy r5,r26 0x4070546c : copy ret1,r25 0x40705470 : copy rp,r24 0x40705474 : be,l b0(sr2,r0),sr0,r31 0x40705478 : ldi 0,r20 0x4070547c : ldi -b,r24 0x40705480 : cmpb,=3D,n r24,r21,0x40705468 0x40705484 : nop 0x40705488 : ldi -2d,r25 0x4070548c : cmpb,=3D,n r25,r21,0x40705468 0x40705490 : nop 0x40705494 : stw ret0,-1b8(sp) 0x40705498 : sub r0,r21,r21 0x4070549c : stw r21,-1b4(sp) 0x407054a0 : ldw -1b4(sp),ret0 If my assumptions are correct, then we either could a) use the gcc atomic builtins instead of own atomic code in libc6: E.g: add to ports/sysdeps/unix/sysv/linux/hppa/bits/atomic.h: =2E.. #if __GNUC_PREREQ (4, 1) # define atomic_compare_and_exchange_val_acq(mem, newval, oldval) \ __sync_val_compare_and_swap (mem, oldval, newval) # define atomic_compare_and_exchange_bool_acq(mem, newval, oldval) \ (! __sync_bool_compare_and_swap (mem, oldval, newval)) #elif __ASSUME_LWS_CAS =2E... b) change the assembly in=20 atomic_compare_and_exchange_val_acq() to not put it's local variables (lws_errno and lws_ret) on the stack. I'm currently testing option a). Helge (PS: I used a webmailer, so the indenting might be strange...) --=20 GRATIS f=FCr alle GMX-Mitglieder: Die maxdome Movie-FLAT! Jetzt freischalten unter http://portal.gmx.net/de/go/maxdome01 -- To unsubscribe from this list: send the line "unsubscribe linux-parisc"= in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html