debugging librbd async

* debugging librbd async
@ 2013-08-16  5:00 James Harper
  2013-08-16  5:08 ` Sage Weil
  0 siblings, 1 reply; 11+ messages in thread
From: James Harper @ 2013-08-16  5:00 UTC (permalink / raw)
  To: ceph-devel; +Cc: Sylvain Munaut (s.munaut@whatever-company.com)

I'm testing out the tapdisk rbd that Sylvain wrote under Xen, and have been having all sorts of problems as the tapdisk process is segfaulting. To make matters worse, any attempt to use gdb on the resulting core just tells me it can't find the threads ('generic error'). Google tells me that I can get around this error by linking the main exe (tapdisk) with libpthread, but that doesn't help.

With strategic printf's I have confirmed that in most cases the crash happens after a call to rbd_aio_read or rbd_aio_write and before the callback is called. Given the async nature of tapdisk it's impossible to be sure but I'm confident that the crash is not happening in any of the tapdisk code. It's possible that there is an off-by-one error in a buffer somewhere with the corruption showing up later but there really isn't a lot of code there and I've been over it very closely and it appears quite sound.

I have also tested for multiple complete's for the same request, and corrupt pointers being passed into the completion routine, and nothing shows up there either.

In most cases there is nothing pre-empting the crash, aside from a tendency to seemingly crash more often when the cluster is disturbed (eg a mon node is rebooted). I have one VM which will be unbootable for long periods of time with the crash happening during boot, typically when postgres starts. This can be reproduced for hours and is useful for debugging, but then suddenly the problem goes away spontaneously and I can no longer reproduce it even after hundreds of reboots.

I'm using Debian and the problem exists with both the latest cuttlefish and dumpling deb's.

So... does librbd have any internal self-checking options I can enable? If I'm going to start injecting printf's around the place, can anyone suggest what code paths are most likely to be causing the above?

Thanks

James

^ permalink raw reply	[flat|nested] 11+ messages in thread