From mboxrd@z Thu Jan 1 00:00:00 1970 From: Philipp Hahn Subject: Re: xenstored crashes with SIGSEGV Date: Fri, 12 Dec 2014 18:20:58 +0100 Message-ID: <548B23FA.6070108@univention.de> References: <546461A2.2070908@univention.de> <1415869951.31613.26.camel@citrix.com> <548B1472.5080302@univention.de> <1418401932.16425.34.camel@citrix.com> <548B1BA8.3090504@univention.de> <1418403387.16425.38.camel@citrix.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <1418403387.16425.38.camel@citrix.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Ian Campbell Cc: Xen-devel@lists.xen.org List-Id: xen-devel@lists.xenproject.org Hello Ian, On 12.12.2014 17:56, Ian Campbell wrote: > On Fri, 2014-12-12 at 17:45 +0100, Philipp Hahn wrote: >> On 12.12.2014 17:32, Ian Campbell wrote: >>> On Fri, 2014-12-12 at 17:14 +0100, Philipp Hahn wrote: >>>> We did enable tracing and now have the xenstored-trace.log of one crash: >>>> It contains 1.6 billion lines and is 83 GiB. >>>> It just shows xenstored to crash on TRANSACTION_START. >>>> >>>> Is there some tool to feed that trace back into a newly launched xenstored? >>> >>> Not that I know of I'm afraid. >> >> Okay, then I have to continue with my own tool. > > If you do end up developing a tool to replay a xenstore trace then I > think that'd be something great to have in tree! I just need to figure out how to talk to xenstored on the wire: for some strange reason xenstored is closing the connection to the UNIX socket on the first write inside a transaction. Or switch to /usr/share/pyshared/xen/xend/xenstore/xstransact.py... >>> Do you get a core dump when this happens? You might need to fiddle with >>> ulimits (some distros disable by default). IIRC there is also some /proc >>> nob which controls where core dumps go on the filesystem. >> >> Not for that specific trace: We first enabled generating core files, but >> only then discovered that this is not enough. > > How wasn't it enough? You mean you couldn't use gdb to extract a > backtrace from the core file? Or was something else wrong? The 1st and 2nd trace look like this: ptr in frame #2 looks very bogus. (gdb) bt full #0 talloc_chunk_from_ptr (ptr=0xff00000000) at talloc.c:116 tc = #1 0x0000000000407edf in talloc_free (ptr=0xff00000000) at talloc.c:551 tc = #2 0x000000000040a348 in tdb_open_ex (name=0x1941fb0 "/var/lib/xenstored/tdb.0x1935bb0", hash_size=, tdb_flags=0, open_flags=, mode=, log_fn=0x4093b0 , hash_fn=) at tdb.c:1958 tdb = 0x1921270 st = {st_dev = 17, st_ino = 816913342, st_nlink = 1, st_mode = 33184, st_uid = 0, st_gid = 0, __pad0 = 0, st_rdev = 0, st_size = 303104, st_blksize = 4096, st_blocks = 592, st_atim = {tv_sec = 1415748063, tv_nsec = 87562634}, st_mtim = {tv_sec = 1415748063, tv_nsec = 87562634}, st_ctim = { tv_sec = 1415748063, tv_nsec = 87562634}, __unused = {0, 0, 0}} rev = locked = 4232112 vp = #3 0x000000000040a684 in tdb_open (name=0xff00000000

, hash_size=0, tdb_flags=4254928, open_flags=-1, mode=3119127560) at tdb.c:1773 No locals. #4 0x000000000040a70b in tdb_copy (tdb=0x192e540, outfile=0x1941fb0 "/var/lib/xenstored/tdb.0x1935bb0") at tdb.c:2124 fd = saved_errno = copy = 0x0 #5 0x0000000000406c2d in do_transaction_start (conn=0x1939550, in=) at xenstored_transaction.c:164 trans = 0x1935bb0 exists = id_str = "\300L\222\001\000\000\000\000\330!@\000\000\000\000\000P\225\223\001" #6 0x00000000004045ca in process_message (conn=0x1939550) at xenstored_core.c:1214 trans = #7 consider_message (conn=0x1939550) at xenstored_core.c:1261 No locals. #8 handle_input (conn=0x1939550) at xenstored_core.c:1308 bytes = in = #9 0x0000000000405170 in main (argc=, argv=) at xenstored_core.c:1964 A 3rd trace is somewhere completely different: (gdb) bt #0 0x00007fcbf066088d in _IO_vfprintf_internal (s=0x7fff46ac3010, format=, ap=0x7fff46ac3170) at vfprintf.c:1617 #1 0x00007fcbf0682732 in _IO_vsnprintf (string=0x7fff46ac318f "", maxlen=, format=0x40d4a4 "%.*s", args=0x7fff46ac3170) at vsnprintf.c:120 #2 0x000000000040855b in talloc_vasprintf (t=0x17aaf20, fmt=0x40d4a4 "%.*s", ap=0x7fff46ac31d0) at talloc.c:1104 #3 0x0000000000408666 in talloc_asprintf (t=0x1f, fmt=0xffffe938

) at talloc.c:1129 #4 0x0000000000403a38 in ask_parents (conn=0x177a1f0, name=0x17aaf20 "/local/domain/0/backend/vif/1/0/accel", perm=XS_PERM_READ) at xenstored_core.c:492 #5 errno_from_parents (conn=0x177a1f0, name=0x17aaf20 "/local/domain/0/backend/vif/1/0/accel", perm=XS_PERM_READ) at xenstored_core.c:516 #6 get_node (conn=0x177a1f0, name=0x17aaf20 "/local/domain/0/backend/vif/1/0/accel", perm=XS_PERM_READ) at xenstored_core.c:543 #7 0x000000000040481d in do_read (conn=0x177a1f0) at xenstored_core.c:744 #8 process_message (conn=0x177a1f0) at xenstored_core.c:1178 #9 consider_message (conn=0x177a1f0) at xenstored_core.c:1261 #10 handle_input (conn=0x177a1f0) at xenstored_core.c:1308 #11 0x0000000000405170 in main (argc=, argv=) at xenstored_core.c:1964 >> It might be interesting to see what happens if you preserve the db and >> reboot arranging for the new xenstored to start with the old file. If >> the corruption is part of the file then maybe it can be induced to crash >> again more quickly. > > Thanks for the pointer, will try. Didn't crash immediately. Now running /usr/share/pyshared/xen/xend/xenstore/tests/stress_xs.py for the weekend. Thanks again. Philipp