All of lore.kernel.org
 help / color / mirror / Atom feed
* xenstored crashes with SIGSEGV
@ 2014-11-13  7:45 Philipp Hahn
  2014-11-13  9:12 ` Ian Campbell
  0 siblings, 1 reply; 36+ messages in thread
From: Philipp Hahn @ 2014-11-13  7:45 UTC (permalink / raw)
  To: Xen-devel

Hello,

for some time we observed several host where xenstored crashes. We
observed the following crash two times by now:

> #0  talloc_chunk_from_ptr (ptr=0xff0000000000) at talloc.c:116
> 116             if ((tc->flags & ~0xF) != TALLOC_MAGIC) { 
> warning: not using untrusted file
> "/root/xen-4.1-4.1.3/xen-4.1.3/tools/xenstore/.gdbinit"
> (gdb) bt
> #0  talloc_chunk_from_ptr (ptr=0xff0000000000) at talloc.c:116
> #1  0x0000000000407edf in talloc_free (ptr=0xff0000000000) at talloc.c:551
> #2  0x000000000040a348 in tdb_open_ex (name=0x167d620
> "/var/lib/xenstored/tdb.0x16a48b0", 
>     hash_size=<value optimized out>, tdb_flags=0, open_flags=<value optimized
> out>, mode=<value optimized out>, 
>     log_fn=0x4093b0 <null_log_fn>, hash_fn=<value optimized out>) at tdb.c:1958
> #3  0x000000000040a684 in tdb_open (name=0xff0000000000 <Address 0xff0000000000
> out of bounds>, hash_size=0, 
>     tdb_flags=4254928, open_flags=-1, mode=3974450184) at tdb.c:1773
> #4  0x000000000040a70b in tdb_copy (tdb=0x16c9040, outfile=0x167d620
> "/var/lib/xenstored/tdb.0x16a48b0")
>     at tdb.c:2124
> #5  0x0000000000406c2d in do_transaction_start (conn=0x167e310, in=<value
> optimized out>)
>     at xenstored_transaction.c:164
> #6  0x00000000004045ca in process_message (conn=0x167e310) at
> xenstored_core.c:1214
> #7  consider_message (conn=0x167e310) at xenstored_core.c:1261
> #8  handle_input (conn=0x167e310) at xenstored_core.c:1308
> #9  0x0000000000405170 in main (argc=<value optimized out>, argv=<value
> optimized out>) at xenstored_core.c:1964

> (gdb) frame 2
> #2  0x000000000040a348 in tdb_open_ex (name=0x167d620 "/var/lib/xenstored/tdb.0x16a48b0", 
>     hash_size=<value optimized out>, tdb_flags=0, open_flags=<value optimized out>, mode=<value optimized out>, 
>     log_fn=0x4093b0 <null_log_fn>, hash_fn=<value optimized out>) at tdb.c:1958
> 1958            SAFE_FREE(tdb->locked);
> (gdb) print tdb->locked
> $3 = (struct tdb_lock_type *) 0xff0000000000

Another one was in vsprintf() - see
<https://forge.univention.org/bugzilla/show_bug.cgi?id=35104#c3> for the
full back traces.

To me this looks like some memory corruption by some unknown code
writing into some random memory space, which happens to be the tdb here.

As far as I know xenstored can't be restarted as - for example - qemu-dm
and blktap2 processes have open file handles to the xenstored unix
socket for IPC, which would need re-opening. As such the host must be
rebooted to fix this situation, as the VMs can no longer be managed and
thus not migrated.

The host is still running xen-4.1.3 (I know that this is quiet old), but
I had a look at the changes between that version and master for
tools/xenstore/ myself and didn't see any obvious change which could fix
that.

1. Has someone observed a similar crash?

2. We've now also enabled "xenstored -T /log --verbose" to log the
messages in the hope to find the triggering transaction, but until then
is there something more we can do to track down the problem?

3. the crash happens rarely and the host run fine most of the time. The
crash mostly happens around midnight and seem to be guest-triggered, as
the logs on the host don't show any activity like starting new or
destroying running VMs. So far the problem only showed on host running
Linux VMs. Other host running Windows VMs so far never showed that crash.

Thank you for your support.

Philipp
-- 
Philipp Hahn
Open Source Software Engineer

Univention GmbH
be open.
Mary-Somerville-Str. 1
D-28359 Bremen
Tel.: +49 421 22232-0
Fax : +49 421 22232-99
hahn@univention.de

http://www.univention.de/
Geschäftsführer: Peter H. Ganten
HRB 20755 Amtsgericht Bremen
Steuer-Nr.: 71-597-02876

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: xenstored crashes with SIGSEGV
  2014-11-13  7:45 xenstored crashes with SIGSEGV Philipp Hahn
@ 2014-11-13  9:12 ` Ian Campbell
  2014-12-12 16:14   ` Philipp Hahn
  0 siblings, 1 reply; 36+ messages in thread
From: Ian Campbell @ 2014-11-13  9:12 UTC (permalink / raw)
  To: Philipp Hahn; +Cc: Xen-devel

On Thu, 2014-11-13 at 08:45 +0100, Philipp Hahn wrote:
> To me this looks like some memory corruption by some unknown code
> writing into some random memory space, which happens to be the tdb here.

I wonder if running xenstored under valgrind would be useful. I think
you'd want to stop xenstored from starting during normal boot and then
launch it with:
        valgrind /usr/local/sbin/xenstored -N
-N is to stay in the foreground, you might want to do this in a screen
session or something, alternatively you could investigate the --log-*
options in the valgrind manpage, together with the various
--trace-children* in order to follow the processes over its
daemonization.

I'm not sure what the impact on the system would be with this, but I
think it is probably ok unless you have massive xs load.

You'll need a version of valgrind with xen support in it, anything from
the last year or so should do I think.

Other than that we don't really have anyone who is an expert in that
aspect of the C xenstore/tdb who we can lean on for pointers (no pun
intended) etc, so in the absence of some sort of ability to trigger on
demand I'm not sure what else to suggest.

> 1. Has someone observed a similar crash?

I think you are the only one I've seen reporting this.

> 2. We've now also enabled "xenstored -T /log --verbose" to log the
> messages in the hope to find the triggering transaction, but until then
> is there something more we can do to track down the problem?
> 
> 3. the crash happens rarely and the host run fine most of the time. The
> crash mostly happens around midnight and seem to be guest-triggered, as
> the logs on the host don't show any activity like starting new or
> destroying running VMs. So far the problem only showed on host running
> Linux VMs. Other host running Windows VMs so far never showed that crash.

If it is really mostly happening around midnight then it might be worth
digging into the host and guest configs for cronjobs and the like, e.g.
log rotation stuff like that which might be tweaking things somehow.

Does this happen on multiple hosts, or just the one?

Do you rm the xenstore db on boot? It might have a persistent
corruption, aiui most folks using C xenstored are doing so or even
placing it on a tmpfs for performance reasons.

If you are running 4.1.x then I think oxenstored isn't an option, but it
might be something to consider when you upgrade.

Ian.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: xenstored crashes with SIGSEGV
  2014-11-13  9:12 ` Ian Campbell
@ 2014-12-12 16:14   ` Philipp Hahn
  2014-12-12 16:32     ` Ian Campbell
  0 siblings, 1 reply; 36+ messages in thread
From: Philipp Hahn @ 2014-12-12 16:14 UTC (permalink / raw)
  To: Ian Campbell; +Cc: Xen-devel

Hello,

On 13.11.2014 10:12, Ian Campbell wrote:
> On Thu, 2014-11-13 at 08:45 +0100, Philipp Hahn wrote:
>> To me this looks like some memory corruption by some unknown code
>> writing into some random memory space, which happens to be the tdb here.
> 
> I wonder if running xenstored under valgrind would be useful. I think
> you'd want to stop xenstored from starting during normal boot and then
> launch it with:
>         valgrind /usr/local/sbin/xenstored -N
> -N is to stay in the foreground, you might want to do this in a screen
> session or something, alternatively you could investigate the --log-*
> options in the valgrind manpage, together with the various
> --trace-children* in order to follow the processes over its
> daemonization.

We did enable tracing and now have the xenstored-trace.log of one crash:
It contains 1.6 billion lines and is 83 GiB.
It just shows xenstored to crash on TRANSACTION_START.

Is there some tool to feed that trace back into a newly launched xenstored?

My hope would be that xenstored crashes again, because then we could use
all those other tools like valgrind more easily.

>> 3. the crash happens rarely and the host run fine most of the time. The
>> crash mostly happens around midnight and seem to be guest-triggered, as
>> the logs on the host don't show any activity like starting new or
>> destroying running VMs. So far the problem only showed on host running
>> Linux VMs. Other host running Windows VMs so far never showed that crash.

Now we also observed a crash on a host running Windows VMs.

> If it is really mostly happening around midnight then it might be worth
> digging into the host and guest configs for cronjobs and the like, e.g.
> log rotation stuff like that which might be tweaking things somehow.
> 
> Does this happen on multiple hosts, or just the one?

Multiple host in two different data centers.

> Do you rm the xenstore db on boot? It might have a persistent
> corruption, aiui most folks using C xenstored are doing so or even
> placing it on a tmpfs for performance reasons.

We're using a tmpfs for /var/lib/xenstored/, as we had some sever
performance problem with something updating
/local/domain/0/backend/console/*/0/uuid too often, which put xenstored
in permanent D state.

> If you are running 4.1.x then I think oxenstored isn't an option, but it
> might be something to consider when you upgrade.

Thank you for the hint, I'll have another look at the Ocaml version.

Thank you again.
Philipp Hahn

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: xenstored crashes with SIGSEGV
  2014-12-12 16:14   ` Philipp Hahn
@ 2014-12-12 16:32     ` Ian Campbell
  2014-12-12 16:45       ` Philipp Hahn
  0 siblings, 1 reply; 36+ messages in thread
From: Ian Campbell @ 2014-12-12 16:32 UTC (permalink / raw)
  To: Philipp Hahn; +Cc: Xen-devel

On Fri, 2014-12-12 at 17:14 +0100, Philipp Hahn wrote:
> Hello,
> 
> On 13.11.2014 10:12, Ian Campbell wrote:
> > On Thu, 2014-11-13 at 08:45 +0100, Philipp Hahn wrote:
> >> To me this looks like some memory corruption by some unknown code
> >> writing into some random memory space, which happens to be the tdb here.
> > 
> > I wonder if running xenstored under valgrind would be useful. I think
> > you'd want to stop xenstored from starting during normal boot and then
> > launch it with:
> >         valgrind /usr/local/sbin/xenstored -N
> > -N is to stay in the foreground, you might want to do this in a screen
> > session or something, alternatively you could investigate the --log-*
> > options in the valgrind manpage, together with the various
> > --trace-children* in order to follow the processes over its
> > daemonization.
> 
> We did enable tracing and now have the xenstored-trace.log of one crash:
> It contains 1.6 billion lines and is 83 GiB.
> It just shows xenstored to crash on TRANSACTION_START.
> 
> Is there some tool to feed that trace back into a newly launched xenstored?

Not that I know of I'm afraid.

Do you get a core dump when this happens? You might need to fiddle with
ulimits (some distros disable by default). IIRC there is also some /proc
nob which controls where core dumps go on the filesystem.

> My hope would be that xenstored crashes again, because then we could use
> all those other tools like valgrind more easily.

That would be handy. My fear would be that this bug is likely to be a
race condition of some sort, and the granularity/accuracy of the
playback would possibly need to be quite high to trigger the issue.
 
> > Do you rm the xenstore db on boot? It might have a persistent
> > corruption, aiui most folks using C xenstored are doing so or even
> > placing it on a tmpfs for performance reasons.
> 
> We're using a tmpfs for /var/lib/xenstored/, as we had some sever
> performance problem with something updating
> /local/domain/0/backend/console/*/0/uuid too often, which put xenstored
> in permanent D state.

But this is just a process crashing and not the whole host so you still
have the db file at the point of the crash?

It might be interesting to see what happens if you preserve the db and
reboot arranging for the new xenstored to start with the old file. If
the corruption is part of the file then maybe it can be induced to crash
again more quickly.

Ian.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: xenstored crashes with SIGSEGV
  2014-12-12 16:32     ` Ian Campbell
@ 2014-12-12 16:45       ` Philipp Hahn
  2014-12-12 16:56         ` Ian Campbell
  0 siblings, 1 reply; 36+ messages in thread
From: Philipp Hahn @ 2014-12-12 16:45 UTC (permalink / raw)
  To: Ian Campbell; +Cc: Xen-devel

Hello Ian,

On 12.12.2014 17:32, Ian Campbell wrote:
> On Fri, 2014-12-12 at 17:14 +0100, Philipp Hahn wrote:
>> We did enable tracing and now have the xenstored-trace.log of one crash:
>> It contains 1.6 billion lines and is 83 GiB.
>> It just shows xenstored to crash on TRANSACTION_START.
>>
>> Is there some tool to feed that trace back into a newly launched xenstored?
> 
> Not that I know of I'm afraid.

Okay, then I have to continue with my own tool.

> Do you get a core dump when this happens? You might need to fiddle with
> ulimits (some distros disable by default). IIRC there is also some /proc
> nob which controls where core dumps go on the filesystem.

Not for that specific trace: We first enabled generating core files, but
only then discovered that this is not enough. Then we enabled
--trace-file, but on that host something reseted generating the core file.
We hopefully fixed all hosts so on the next crash we hopefully will get
both a core file and the trace.

>> My hope would be that xenstored crashes again, because then we could use
>> all those other tools like valgrind more easily.
> 
> That would be handy. My fear would be that this bug is likely to be a
> race condition of some sort, and the granularity/accuracy of the
> playback would possibly need to be quite high to trigger the issue.

cxenstored looks single threaded to me, or am I wrong?

>>> Do you rm the xenstore db on boot? It might have a persistent
>>> corruption, aiui most folks using C xenstored are doing so or even
>>> placing it on a tmpfs for performance reasons.
>>
>> We're using a tmpfs for /var/lib/xenstored/, as we had some sever
>> performance problem with something updating
>> /local/domain/0/backend/console/*/0/uuid too often, which put xenstored
>> in permanent D state.
> 
> But this is just a process crashing and not the whole host so you still
> have the db file at the point of the crash?

Yes: Running xs_tdb_dump or tdb_dump on it didn't show anything
obviously wrong.

> It might be interesting to see what happens if you preserve the db and
> reboot arranging for the new xenstored to start with the old file. If
> the corruption is part of the file then maybe it can be induced to crash
> again more quickly.

Thanks for the pointer, will try.

Thank you again for your fast reply.
Philipp Hahn

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: xenstored crashes with SIGSEGV
  2014-12-12 16:45       ` Philipp Hahn
@ 2014-12-12 16:56         ` Ian Campbell
  2014-12-12 17:20           ` Philipp Hahn
  0 siblings, 1 reply; 36+ messages in thread
From: Ian Campbell @ 2014-12-12 16:56 UTC (permalink / raw)
  To: Philipp Hahn; +Cc: Xen-devel

On Fri, 2014-12-12 at 17:45 +0100, Philipp Hahn wrote:
> Hello Ian,
> 
> On 12.12.2014 17:32, Ian Campbell wrote:
> > On Fri, 2014-12-12 at 17:14 +0100, Philipp Hahn wrote:
> >> We did enable tracing and now have the xenstored-trace.log of one crash:
> >> It contains 1.6 billion lines and is 83 GiB.
> >> It just shows xenstored to crash on TRANSACTION_START.
> >>
> >> Is there some tool to feed that trace back into a newly launched xenstored?
> > 
> > Not that I know of I'm afraid.
> 
> Okay, then I have to continue with my own tool.

If you do end up developing a tool to replay a xenstore trace then I
think that'd be something great to have in tree!

> > Do you get a core dump when this happens? You might need to fiddle with
> > ulimits (some distros disable by default). IIRC there is also some /proc
> > nob which controls where core dumps go on the filesystem.
> 
> Not for that specific trace: We first enabled generating core files, but
> only then discovered that this is not enough.

How wasn't it enough? You mean you couldn't use gdb to extract a
backtrace from the core file? Or was something else wrong?

>  Then we enabled
> --trace-file, but on that host something reseted generating the core file.
> We hopefully fixed all hosts so on the next crash we hopefully will get
> both a core file and the trace.

Great.

> >> My hope would be that xenstored crashes again, because then we could use
> >> all those other tools like valgrind more easily.
> > 
> > That would be handy. My fear would be that this bug is likely to be a
> > race condition of some sort, and the granularity/accuracy of the
> > playback would possibly need to be quite high to trigger the issue.
> 
> cxenstored looks single threaded to me, or am I wrong?

Nope, you are right, my mistake.

Ian.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: xenstored crashes with SIGSEGV
  2014-12-12 16:56         ` Ian Campbell
@ 2014-12-12 17:20           ` Philipp Hahn
  2014-12-12 17:58             ` Ian Campbell
  0 siblings, 1 reply; 36+ messages in thread
From: Philipp Hahn @ 2014-12-12 17:20 UTC (permalink / raw)
  To: Ian Campbell; +Cc: Xen-devel

Hello Ian,

On 12.12.2014 17:56, Ian Campbell wrote:
> On Fri, 2014-12-12 at 17:45 +0100, Philipp Hahn wrote:
>> On 12.12.2014 17:32, Ian Campbell wrote:
>>> On Fri, 2014-12-12 at 17:14 +0100, Philipp Hahn wrote:
>>>> We did enable tracing and now have the xenstored-trace.log of one crash:
>>>> It contains 1.6 billion lines and is 83 GiB.
>>>> It just shows xenstored to crash on TRANSACTION_START.
>>>>
>>>> Is there some tool to feed that trace back into a newly launched xenstored?
>>>
>>> Not that I know of I'm afraid.
>>
>> Okay, then I have to continue with my own tool.
> 
> If you do end up developing a tool to replay a xenstore trace then I
> think that'd be something great to have in tree!

I just need to figure out how to talk to xenstored on the wire: for some
strange reason xenstored is closing the connection to the UNIX socket on
the first write inside a transaction.
Or switch to /usr/share/pyshared/xen/xend/xenstore/xstransact.py...

>>> Do you get a core dump when this happens? You might need to fiddle with
>>> ulimits (some distros disable by default). IIRC there is also some /proc
>>> nob which controls where core dumps go on the filesystem.
>>
>> Not for that specific trace: We first enabled generating core files, but
>> only then discovered that this is not enough.
> 
> How wasn't it enough? You mean you couldn't use gdb to extract a
> backtrace from the core file? Or was something else wrong?

The 1st and 2nd trace look like this: ptr in frame #2 looks very bogus.

(gdb) bt full
#0  talloc_chunk_from_ptr (ptr=0xff00000000) at talloc.c:116
        tc = <value optimized out>
#1  0x0000000000407edf in talloc_free (ptr=0xff00000000) at talloc.c:551
        tc = <value optimized out>
#2  0x000000000040a348 in tdb_open_ex (name=0x1941fb0
"/var/lib/xenstored/tdb.0x1935bb0",
    hash_size=<value optimized out>, tdb_flags=0, open_flags=<value
optimized out>, mode=<value optimized out>,
    log_fn=0x4093b0 <null_log_fn>, hash_fn=<value optimized out>) at
tdb.c:1958
        tdb = 0x1921270
        st = {st_dev = 17, st_ino = 816913342, st_nlink = 1, st_mode =
33184, st_uid = 0, st_gid = 0, __pad0 = 0,
          st_rdev = 0, st_size = 303104, st_blksize = 4096, st_blocks =
592, st_atim = {tv_sec = 1415748063,
            tv_nsec = 87562634}, st_mtim = {tv_sec = 1415748063, tv_nsec
= 87562634}, st_ctim = {
            tv_sec = 1415748063, tv_nsec = 87562634}, __unused = {0, 0, 0}}
        rev = <value optimized out>
        locked = 4232112
        vp = <value optimized out>
#3  0x000000000040a684 in tdb_open (name=0xff00000000 <Address
0xff00000000 out of bounds>, hash_size=0,
    tdb_flags=4254928, open_flags=-1, mode=3119127560) at tdb.c:1773
No locals.
#4  0x000000000040a70b in tdb_copy (tdb=0x192e540, outfile=0x1941fb0
"/var/lib/xenstored/tdb.0x1935bb0")
    at tdb.c:2124
        fd = <value optimized out>
        saved_errno = <value optimized out>
        copy = 0x0
#5  0x0000000000406c2d in do_transaction_start (conn=0x1939550,
in=<value optimized out>)
    at xenstored_transaction.c:164
        trans = 0x1935bb0
        exists = <value optimized out>
        id_str =
"\300L\222\001\000\000\000\000\330!@\000\000\000\000\000P\225\223\001"
#6  0x00000000004045ca in process_message (conn=0x1939550) at
xenstored_core.c:1214
        trans = <value optimized out>
#7  consider_message (conn=0x1939550) at xenstored_core.c:1261
No locals.
#8  handle_input (conn=0x1939550) at xenstored_core.c:1308
        bytes = <value optimized out>
        in = <value optimized out>
#9  0x0000000000405170 in main (argc=<value optimized out>, argv=<value
optimized out>) at xenstored_core.c:1964

A 3rd trace is somewhere completely different:
(gdb) bt
#0  0x00007fcbf066088d in _IO_vfprintf_internal (s=0x7fff46ac3010,
format=<value optimized out>, ap=0x7fff46ac3170)
    at vfprintf.c:1617
#1  0x00007fcbf0682732 in _IO_vsnprintf (string=0x7fff46ac318f "",
maxlen=<value optimized out>,
    format=0x40d4a4 "%.*s", args=0x7fff46ac3170) at vsnprintf.c:120
#2  0x000000000040855b in talloc_vasprintf (t=0x17aaf20, fmt=0x40d4a4
"%.*s", ap=0x7fff46ac31d0) at talloc.c:1104
#3  0x0000000000408666 in talloc_asprintf (t=0x1f, fmt=0xffffe938
<Address 0xffffe938 out of bounds>)
    at talloc.c:1129
#4  0x0000000000403a38 in ask_parents (conn=0x177a1f0, name=0x17aaf20
"/local/domain/0/backend/vif/1/0/accel",
    perm=XS_PERM_READ) at xenstored_core.c:492
#5  errno_from_parents (conn=0x177a1f0, name=0x17aaf20
"/local/domain/0/backend/vif/1/0/accel", perm=XS_PERM_READ)
    at xenstored_core.c:516
#6  get_node (conn=0x177a1f0, name=0x17aaf20
"/local/domain/0/backend/vif/1/0/accel", perm=XS_PERM_READ)
    at xenstored_core.c:543
#7  0x000000000040481d in do_read (conn=0x177a1f0) at xenstored_core.c:744
#8  process_message (conn=0x177a1f0) at xenstored_core.c:1178
#9  consider_message (conn=0x177a1f0) at xenstored_core.c:1261
#10 handle_input (conn=0x177a1f0) at xenstored_core.c:1308
#11 0x0000000000405170 in main (argc=<value optimized out>, argv=<value
optimized out>) at xenstored_core.c:1964


>> It might be interesting to see what happens if you preserve the db and
>> reboot arranging for the new xenstored to start with the old file. If
>> the corruption is part of the file then maybe it can be induced to crash
>> again more quickly.
> 
> Thanks for the pointer, will try.

Didn't crash immediately.
Now running /usr/share/pyshared/xen/xend/xenstore/tests/stress_xs.py for
the weekend.

Thanks again.
Philipp

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: xenstored crashes with SIGSEGV
  2014-12-12 17:20           ` Philipp Hahn
@ 2014-12-12 17:58             ` Ian Campbell
  2014-12-15 13:17               ` Ian Campbell
  0 siblings, 1 reply; 36+ messages in thread
From: Ian Campbell @ 2014-12-12 17:58 UTC (permalink / raw)
  To: Philipp Hahn, Ian Jackson; +Cc: Xen-devel

(adding Ian J who knows a bit more about C xenstored than me...)

 On Fri, 2014-12-12 at 18:20 +0100, Philipp Hahn wrote:
> Hello Ian,
> 
> On 12.12.2014 17:56, Ian Campbell wrote:
> > On Fri, 2014-12-12 at 17:45 +0100, Philipp Hahn wrote:
> >> On 12.12.2014 17:32, Ian Campbell wrote:
> >>> On Fri, 2014-12-12 at 17:14 +0100, Philipp Hahn wrote:
> >>>> We did enable tracing and now have the xenstored-trace.log of one crash:
> >>>> It contains 1.6 billion lines and is 83 GiB.
> >>>> It just shows xenstored to crash on TRANSACTION_START.
> >>>>
> >>>> Is there some tool to feed that trace back into a newly launched xenstored?
> >>>
> >>> Not that I know of I'm afraid.
> >>
> >> Okay, then I have to continue with my own tool.
> > 
> > If you do end up developing a tool to replay a xenstore trace then I
> > think that'd be something great to have in tree!
> 
> I just need to figure out how to talk to xenstored on the wire: for some
> strange reason xenstored is closing the connection to the UNIX socket on
> the first write inside a transaction.
> Or switch to /usr/share/pyshared/xen/xend/xenstore/xstransact.py...
> 
> >>> Do you get a core dump when this happens? You might need to fiddle with
> >>> ulimits (some distros disable by default). IIRC there is also some /proc
> >>> nob which controls where core dumps go on the filesystem.
> >>
> >> Not for that specific trace: We first enabled generating core files, but
> >> only then discovered that this is not enough.
> > 
> > How wasn't it enough? You mean you couldn't use gdb to extract a
> > backtrace from the core file? Or was something else wrong?
> 
> The 1st and 2nd trace look like this: ptr in frame #2 looks very bogus.
> 
> (gdb) bt full
> #0  talloc_chunk_from_ptr (ptr=0xff00000000) at talloc.c:116
>         tc = <value optimized out>
> #1  0x0000000000407edf in talloc_free (ptr=0xff00000000) at talloc.c:551
>         tc = <value optimized out>
> #2  0x000000000040a348 in tdb_open_ex (name=0x1941fb0
> "/var/lib/xenstored/tdb.0x1935bb0",

This is interesting actually. There are only a small number of calls to
talloc_free in tdb_open_ex (all wrapped in "SAFE_FREE") and they are all
in after the fail: error exit path. So I think the reason the crash is
rare is that you have to hit some other failure first.

About half of the "goto fail" statements are preceded by a TDB_LOG
statement. But given the presence of logfn=<null_log_fn> in the trace
that doesn't seem likely to be helpful right now.

It might be worth splurging some debug of your own before each of those
failure points and/or wiring up the tdb log function to xenstores
logging.

The calls to SAFE_FREE are
        SAFE_FREE(tdb->map_ptr);
        SAFE_FREE(tdb->name);
        SAFE_FREE(tdb->locked);
        SAFE_FREE(tdb);

I think those should all have been allocated by the time we get to fail
though, so not sure where 0xff000000 in the trace comes from.

I've timed out for tonight will try and have another look next week.

Ian.

>     hash_size=<value optimized out>, tdb_flags=0, open_flags=<value
> optimized out>, mode=<value optimized out>,
>     log_fn=0x4093b0 <null_log_fn>, hash_fn=<value optimized out>) at
> tdb.c:1958
>         tdb = 0x1921270
>         st = {st_dev = 17, st_ino = 816913342, st_nlink = 1, st_mode =
> 33184, st_uid = 0, st_gid = 0, __pad0 = 0,
>           st_rdev = 0, st_size = 303104, st_blksize = 4096, st_blocks =
> 592, st_atim = {tv_sec = 1415748063,
>             tv_nsec = 87562634}, st_mtim = {tv_sec = 1415748063, tv_nsec
> = 87562634}, st_ctim = {
>             tv_sec = 1415748063, tv_nsec = 87562634}, __unused = {0, 0, 0}}
>         rev = <value optimized out>
>         locked = 4232112
>         vp = <value optimized out>
> #3  0x000000000040a684 in tdb_open (name=0xff00000000 <Address
> 0xff00000000 out of bounds>, hash_size=0,
>     tdb_flags=4254928, open_flags=-1, mode=3119127560) at tdb.c:1773
> No locals.
> #4  0x000000000040a70b in tdb_copy (tdb=0x192e540, outfile=0x1941fb0
> "/var/lib/xenstored/tdb.0x1935bb0")
>     at tdb.c:2124
>         fd = <value optimized out>
>         saved_errno = <value optimized out>
>         copy = 0x0
> #5  0x0000000000406c2d in do_transaction_start (conn=0x1939550,
> in=<value optimized out>)
>     at xenstored_transaction.c:164
>         trans = 0x1935bb0
>         exists = <value optimized out>
>         id_str =
> "\300L\222\001\000\000\000\000\330!@\000\000\000\000\000P\225\223\001"
> #6  0x00000000004045ca in process_message (conn=0x1939550) at
> xenstored_core.c:1214
>         trans = <value optimized out>
> #7  consider_message (conn=0x1939550) at xenstored_core.c:1261
> No locals.
> #8  handle_input (conn=0x1939550) at xenstored_core.c:1308
>         bytes = <value optimized out>
>         in = <value optimized out>
> #9  0x0000000000405170 in main (argc=<value optimized out>, argv=<value
> optimized out>) at xenstored_core.c:1964
> 
> A 3rd trace is somewhere completely different:
> (gdb) bt
> #0  0x00007fcbf066088d in _IO_vfprintf_internal (s=0x7fff46ac3010,
> format=<value optimized out>, ap=0x7fff46ac3170)
>     at vfprintf.c:1617
> #1  0x00007fcbf0682732 in _IO_vsnprintf (string=0x7fff46ac318f "",
> maxlen=<value optimized out>,
>     format=0x40d4a4 "%.*s", args=0x7fff46ac3170) at vsnprintf.c:120
> #2  0x000000000040855b in talloc_vasprintf (t=0x17aaf20, fmt=0x40d4a4
> "%.*s", ap=0x7fff46ac31d0) at talloc.c:1104
> #3  0x0000000000408666 in talloc_asprintf (t=0x1f, fmt=0xffffe938
> <Address 0xffffe938 out of bounds>)
>     at talloc.c:1129
> #4  0x0000000000403a38 in ask_parents (conn=0x177a1f0, name=0x17aaf20
> "/local/domain/0/backend/vif/1/0/accel",
>     perm=XS_PERM_READ) at xenstored_core.c:492
> #5  errno_from_parents (conn=0x177a1f0, name=0x17aaf20
> "/local/domain/0/backend/vif/1/0/accel", perm=XS_PERM_READ)
>     at xenstored_core.c:516
> #6  get_node (conn=0x177a1f0, name=0x17aaf20
> "/local/domain/0/backend/vif/1/0/accel", perm=XS_PERM_READ)
>     at xenstored_core.c:543
> #7  0x000000000040481d in do_read (conn=0x177a1f0) at xenstored_core.c:744
> #8  process_message (conn=0x177a1f0) at xenstored_core.c:1178
> #9  consider_message (conn=0x177a1f0) at xenstored_core.c:1261
> #10 handle_input (conn=0x177a1f0) at xenstored_core.c:1308
> #11 0x0000000000405170 in main (argc=<value optimized out>, argv=<value
> optimized out>) at xenstored_core.c:1964
> 
> 
> >> It might be interesting to see what happens if you preserve the db and
> >> reboot arranging for the new xenstored to start with the old file. If
> >> the corruption is part of the file then maybe it can be induced to crash
> >> again more quickly.
> > 
> > Thanks for the pointer, will try.
> 
> Didn't crash immediately.
> Now running /usr/share/pyshared/xen/xend/xenstore/tests/stress_xs.py for
> the weekend.
> 
> Thanks again.
> Philipp
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: xenstored crashes with SIGSEGV
  2014-12-12 17:58             ` Ian Campbell
@ 2014-12-15 13:17               ` Ian Campbell
  2014-12-15 14:19                 ` Philipp Hahn
  0 siblings, 1 reply; 36+ messages in thread
From: Ian Campbell @ 2014-12-15 13:17 UTC (permalink / raw)
  To: Philipp Hahn; +Cc: Ian Jackson, Xen-devel

On Fri, 2014-12-12 at 17:58 +0000, Ian Campbell wrote:
> (adding Ian J who knows a bit more about C xenstored than me...)
> 
>  On Fri, 2014-12-12 at 18:20 +0100, Philipp Hahn wrote:
> > Hello Ian,
> > 
> > On 12.12.2014 17:56, Ian Campbell wrote:
> > > On Fri, 2014-12-12 at 17:45 +0100, Philipp Hahn wrote:
> > >> On 12.12.2014 17:32, Ian Campbell wrote:
> > >>> On Fri, 2014-12-12 at 17:14 +0100, Philipp Hahn wrote:
> > >>>> We did enable tracing and now have the xenstored-trace.log of one crash:
> > >>>> It contains 1.6 billion lines and is 83 GiB.
> > >>>> It just shows xenstored to crash on TRANSACTION_START.
> > >>>>
> > >>>> Is there some tool to feed that trace back into a newly launched xenstored?
> > >>>
> > >>> Not that I know of I'm afraid.
> > >>
> > >> Okay, then I have to continue with my own tool.
> > > 
> > > If you do end up developing a tool to replay a xenstore trace then I
> > > think that'd be something great to have in tree!
> > 
> > I just need to figure out how to talk to xenstored on the wire: for some
> > strange reason xenstored is closing the connection to the UNIX socket on
> > the first write inside a transaction.
> > Or switch to /usr/share/pyshared/xen/xend/xenstore/xstransact.py...
> > 
> > >>> Do you get a core dump when this happens? You might need to fiddle with
> > >>> ulimits (some distros disable by default). IIRC there is also some /proc
> > >>> nob which controls where core dumps go on the filesystem.
> > >>
> > >> Not for that specific trace: We first enabled generating core files, but
> > >> only then discovered that this is not enough.
> > > 
> > > How wasn't it enough? You mean you couldn't use gdb to extract a
> > > backtrace from the core file? Or was something else wrong?
> > 
> > The 1st and 2nd trace look like this: ptr in frame #2 looks very bogus.
> > 
> > (gdb) bt full
> > #0  talloc_chunk_from_ptr (ptr=0xff00000000) at talloc.c:116
> >         tc = <value optimized out>
> > #1  0x0000000000407edf in talloc_free (ptr=0xff00000000) at talloc.c:551
> >         tc = <value optimized out>
> > #2  0x000000000040a348 in tdb_open_ex (name=0x1941fb0
> > "/var/lib/xenstored/tdb.0x1935bb0",
> 
> I've timed out for tonight will try and have another look next week.

I've had another dig, and have instrumented all of the error paths from
this function and I can't see any way for an invalid pointer to be
produced, let alone freed. I've been running under valgrind which should
have caught any uninitialised memory type errors.

> >     hash_size=<value optimized out>, tdb_flags=0, open_flags=<value
> > optimized out>, mode=<value optimized out>,
> >     log_fn=0x4093b0 <null_log_fn>, hash_fn=<value optimized out>) at
> > tdb.c:1958

Please can you confirm what is at line 1958 of your copy of tdb.c. I
think it will be tdb->locked, but I'd like to be sure.

You are running a 64-bit dom0, correct? I've only just noticed that
0xff00000000 is >32bits. My testing so far was 32-bit, I don't think it
should matter wrt use of uninitialised data etc.

I can't help feeling that 0xff00000000 must be some sort of magic
sentinel value to someone. I can't figure out what though.

Have you observed the xenstored processes growing especially large
before this happens? I'm wondering if there might be a leak somewhere
which after a time is resulting a 

I'm about to send out a patch which plumbs tdb's logging into
xenstored's logging, in the hopes that next time you see this it might
say something as it dies.

Ian.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: xenstored crashes with SIGSEGV
  2014-12-15 13:17               ` Ian Campbell
@ 2014-12-15 14:19                 ` Philipp Hahn
  2014-12-15 14:50                   ` Ian Campbell
  0 siblings, 1 reply; 36+ messages in thread
From: Philipp Hahn @ 2014-12-15 14:19 UTC (permalink / raw)
  To: Ian Campbell; +Cc: Ian Jackson, Xen-devel

Hello Ian,

On 15.12.2014 14:17, Ian Campbell wrote:
> On Fri, 2014-12-12 at 17:58 +0000, Ian Campbell wrote:
>>  On Fri, 2014-12-12 at 18:20 +0100, Philipp Hahn wrote:
>>> On 12.12.2014 17:56, Ian Campbell wrote:
>>>> On Fri, 2014-12-12 at 17:45 +0100, Philipp Hahn wrote:
>>>>> On 12.12.2014 17:32, Ian Campbell wrote:
>>>>>> On Fri, 2014-12-12 at 17:14 +0100, Philipp Hahn wrote:
...
>>> The 1st and 2nd trace look like this: ptr in frame #2 looks very bogus.
>>>
>>> (gdb) bt full
>>> #0  talloc_chunk_from_ptr (ptr=0xff00000000) at talloc.c:116
>>>         tc = <value optimized out>
>>> #1  0x0000000000407edf in talloc_free (ptr=0xff00000000) at talloc.c:551
>>>         tc = <value optimized out>
>>> #2  0x000000000040a348 in tdb_open_ex (name=0x1941fb0
>>> "/var/lib/xenstored/tdb.0x1935bb0",

I just noticed something strange:

> #3  0x000000000040a684 in tdb_open (name=0xff00000000 <Address
> 0xff00000000 out of bounds>, hash_size=0,
>     tdb_flags=4254928, open_flags=-1, mode=3119127560) at tdb.c:1773
> #4  0x000000000040a70b in tdb_copy (tdb=0x192e540, outfile=0x1941fb0
> "/var/lib/xenstored/tdb.0x1935bb0")

Why does gdb-7.0.1 print "name=0xff000000" here for frame 3, but for
frame 2 and 4 the pointers are correct again?
Verifying the values with an explicit "print" shows them as correct.

>> I've timed out for tonight will try and have another look next week.
> 
> I've had another dig, and have instrumented all of the error paths from
> this function and I can't see any way for an invalid pointer to be
> produced, let alone freed. I've been running under valgrind which should
> have caught any uninitialised memory type errors.

Thank you for testing that.

>>>     hash_size=<value optimized out>, tdb_flags=0, open_flags=<value
>>> optimized out>, mode=<value optimized out>,
>>>     log_fn=0x4093b0 <null_log_fn>, hash_fn=<value optimized out>) at
>>> tdb.c:1958
> 
> Please can you confirm what is at line 1958 of your copy of tdb.c. I
> think it will be tdb->locked, but I'd like to be sure.

Yes, that's the line:
# sed -ne 1958p tdb.c
        SAFE_FREE(tdb->locked);

> You are running a 64-bit dom0, correct?

yes: x86_64

> I've only just noticed that
> 0xff00000000 is >32bits. My testing so far was 32-bit, I don't think it
> should matter wrt use of uninitialised data etc.
> 
> I can't help feeling that 0xff00000000 must be some sort of magic
> sentinel value to someone. I can't figure out what though.

0xff is too much for bit flip errors. and also two crashes on different
machines in the same location very much rules out any HW error for me.

My 2nd idea was that someone decremented 0 one too many, but then that
would have to be an 8 bit value - reading the code I didn't see anything
like that.

> Have you observed the xenstored processes growing especially large
> before this happens? I'm wondering if there might be a leak somewhere
> which after a time is resulting a 

I have no monitoring of the memory usage for the crashed systems, but
the core files look reasonable sane.
Looking at the test-system running
/usr/share/pyshared/xen/xend/xenstore/tests/stress_xs.py the memory
usage stays constant since last Friday.

> I'm about to send out a patch which plumbs tdb's logging into
> xenstored's logging, in the hopes that next time you see this it might
> say something as it dies.

Thank you for the patch: I'll try to incorporate it and will continue
trying to reproduce the crash.


One more thing we noticed: /var/lib/xenstored/ contained the tdb file
and to bit-identical copies after the crash, so I would read that as two
transactions being in progress at the time of the crash. Might be that
this is important.
But /usr/share/pyshared/xen/xend/xenstore/tests/stress_xs.py seems to
create more transaction in parallel and my test system so far has
survived this since Friday.

Sincerely
Philipp Hahn

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: xenstored crashes with SIGSEGV
  2014-12-15 14:19                 ` Philipp Hahn
@ 2014-12-15 14:50                   ` Ian Campbell
  2014-12-15 17:45                     ` Ian Campbell
  0 siblings, 1 reply; 36+ messages in thread
From: Ian Campbell @ 2014-12-15 14:50 UTC (permalink / raw)
  To: Philipp Hahn; +Cc: Ian Jackson, Xen-devel

On Mon, 2014-12-15 at 15:19 +0100, Philipp Hahn wrote:
> Hello Ian,
> 
> On 15.12.2014 14:17, Ian Campbell wrote:
> > On Fri, 2014-12-12 at 17:58 +0000, Ian Campbell wrote:
> >>  On Fri, 2014-12-12 at 18:20 +0100, Philipp Hahn wrote:
> >>> On 12.12.2014 17:56, Ian Campbell wrote:
> >>>> On Fri, 2014-12-12 at 17:45 +0100, Philipp Hahn wrote:
> >>>>> On 12.12.2014 17:32, Ian Campbell wrote:
> >>>>>> On Fri, 2014-12-12 at 17:14 +0100, Philipp Hahn wrote:
> ...
> >>> The 1st and 2nd trace look like this: ptr in frame #2 looks very bogus.
> >>>
> >>> (gdb) bt full
> >>> #0  talloc_chunk_from_ptr (ptr=0xff00000000) at talloc.c:116
> >>>         tc = <value optimized out>
> >>> #1  0x0000000000407edf in talloc_free (ptr=0xff00000000) at talloc.c:551
> >>>         tc = <value optimized out>
> >>> #2  0x000000000040a348 in tdb_open_ex (name=0x1941fb0
> >>> "/var/lib/xenstored/tdb.0x1935bb0",
> 
> I just noticed something strange:
> 
> > #3  0x000000000040a684 in tdb_open (name=0xff00000000 <Address
> > 0xff00000000 out of bounds>, hash_size=0,
> >     tdb_flags=4254928, open_flags=-1, mode=3119127560) at tdb.c:1773
> > #4  0x000000000040a70b in tdb_copy (tdb=0x192e540, outfile=0x1941fb0
> > "/var/lib/xenstored/tdb.0x1935bb0")
> 
> Why does gdb-7.0.1 print "name=0xff000000" here for frame 3, but for
> frame 2 and 4 the pointers are correct again?
> Verifying the values with an explicit "print" shows them as correct.

I has just noticed that and was wondering about that same thing. I'm
starting to worry that 0xff00000000 might just be a gdb thing, similar
to <value optimized out>, but infinitely more misleading.

I've also noticed in
https://forge.univention.org/bugzilla/show_bug.cgi?id=35104 that the
constant can be either 0xff000000, 0xff00000000 or 0xff0000000000 (6, 8
or 10 zeroes).

> >>>     hash_size=<value optimized out>, tdb_flags=0, open_flags=<value
> >>> optimized out>, mode=<value optimized out>,
> >>>     log_fn=0x4093b0 <null_log_fn>, hash_fn=<value optimized out>) at
> >>> tdb.c:1958
> > 
> > Please can you confirm what is at line 1958 of your copy of tdb.c. I
> > think it will be tdb->locked, but I'd like to be sure.
> 
> Yes, that's the line:
> # sed -ne 1958p tdb.c
>         SAFE_FREE(tdb->locked);

Good, thanks.

> > You are running a 64-bit dom0, correct?
> 
> yes: x86_64

Thanks for confirming. I'm resurrecting the 64-bit root partition on my
test box (which it turns out was still Debian Squeeze!)

> 
> > I've only just noticed that
> > 0xff00000000 is >32bits. My testing so far was 32-bit, I don't think it
> > should matter wrt use of uninitialised data etc.
> > 
> > I can't help feeling that 0xff00000000 must be some sort of magic
> > sentinel value to someone. I can't figure out what though.
> 
> 0xff is too much for bit flip errors. and also two crashes on different
> machines in the same location very much rules out any HW error for me.
> 
> My 2nd idea was that someone decremented 0 one too many, but then that
> would have to be an 8 bit value - reading the code I didn't see anything
> like that.

I was wondering if it was an overflow or sign-extension thing, but it
doesn't seem likely, not enough high bits set for one thing.

> One more thing we noticed: /var/lib/xenstored/ contained the tdb file
> and to bit-identical copies after the crash, so I would read that as two
> transactions being in progress at the time of the crash. Might be that
> this is important.

It's certainly worth noting, thanks.

Ian.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: xenstored crashes with SIGSEGV
  2014-12-15 14:50                   ` Ian Campbell
@ 2014-12-15 17:45                     ` Ian Campbell
  2014-12-15 22:29                       ` Philipp Hahn
  0 siblings, 1 reply; 36+ messages in thread
From: Ian Campbell @ 2014-12-15 17:45 UTC (permalink / raw)
  To: Philipp Hahn; +Cc: Ian Jackson, Xen-devel

On Mon, 2014-12-15 at 14:50 +0000, Ian Campbell wrote:
> On Mon, 2014-12-15 at 15:19 +0100, Philipp Hahn wrote:
> > I just noticed something strange:
> > 
> > > #3  0x000000000040a684 in tdb_open (name=0xff00000000 <Address
> > > 0xff00000000 out of bounds>, hash_size=0,
> > >     tdb_flags=4254928, open_flags=-1, mode=3119127560) at tdb.c:1773
> > > #4  0x000000000040a70b in tdb_copy (tdb=0x192e540, outfile=0x1941fb0
> > > "/var/lib/xenstored/tdb.0x1935bb0")
> > 
> > Why does gdb-7.0.1 print "name=0xff000000" here for frame 3, but for
> > frame 2 and 4 the pointers are correct again?
> > Verifying the values with an explicit "print" shows them as correct.
> 
> I has just noticed that and was wondering about that same thing. I'm
> starting to worry that 0xff00000000 might just be a gdb thing, similar
> to <value optimized out>, but infinitely more misleading.

I'm reasonably convinced now that this is just a weird artefact of
running gdb on an optimised binary, probably a shortcoming in the debug
info leading to gdb getting confused.

Unfortunately this also calls into doubt the parameter to talloc_free,
perhaps in that context 0xff0000000 is a similar artefact.

Please can you print the entire contents of tdb in the second frame
("print *tdb" ought to do it). I'm curious whether it is all sane or
not.

Please can you also print "info regs" at the point of the segv (in frame
0) as well as "disas" at that point.

Can you also "p $_siginfo._sifields._sigfault.si_addr" (in frame 0).
This ought to be the actual faulting address, which ought to give a hint
on how much we can trust the parameters in the stack trace.

Since I'm asking for the world I may as well ask you to dump the raw
stack too "x/64x $sp" ought to be a good starting point.

I notice in your bugzilla (for a different occurrence, I think):
> [2090451.721705] univention-conf[2512]: segfault at ff00000000 ip 000000000045e238 sp 00007ffff68dfa30 error 6 in python2.6[400000+21e000]

Which appears to have faulted access 0xff000000000 too. It looks like
this process is a python thing, it's nothing to do with xenstored I
assume? It seems rather coincidental that it should be accessing the 
same sort of address and be faulting.

Ian.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: xenstored crashes with SIGSEGV
  2014-12-15 17:45                     ` Ian Campbell
@ 2014-12-15 22:29                       ` Philipp Hahn
  2014-12-16  9:51                         ` Ian Campbell
                                           ` (2 more replies)
  0 siblings, 3 replies; 36+ messages in thread
From: Philipp Hahn @ 2014-12-15 22:29 UTC (permalink / raw)
  To: Ian Campbell; +Cc: Ian Jackson, Xen-devel

Hello Ian,

On 15.12.2014 18:45, Ian Campbell wrote:
> On Mon, 2014-12-15 at 14:50 +0000, Ian Campbell wrote:
>> On Mon, 2014-12-15 at 15:19 +0100, Philipp Hahn wrote:
>>> I just noticed something strange:
>>>
>>>> #3  0x000000000040a684 in tdb_open (name=0xff00000000 <Address
>>>> 0xff00000000 out of bounds>, hash_size=0,
>>>>     tdb_flags=4254928, open_flags=-1, mode=3119127560) at tdb.c:1773
...
> I'm reasonably convinced now that this is just a weird artefact of
> running gdb on an optimised binary, probably a shortcoming in the debug
> info leading to gdb getting confused.
> 
> Unfortunately this also calls into doubt the parameter to talloc_free,
> perhaps in that context 0xff0000000 is a similar artefact.
> 
> Please can you print the entire contents of tdb in the second frame
> ("print *tdb" ought to do it). I'm curious whether it is all sane or
> not.

(gdb) print *tdb
$1 = {name = 0x0, map_ptr = 0x0, fd = 47, map_size = 65280, read_only =
16711680,
  locked = 0xff0000000000, ecode = 16711680, header = {
    magic_food =
"\000\000\000\000\000\000\000\000\000\377\000\000\000\000\377\000\000\000\000\000\000\000\000\000\000\377\000\000\000\000\377",
version = 0, hash_size = 0,
    rwlocks = 65280, reserved = {16711680, 0, 0, 65280, 16711680, 0, 0,
65280,
      16711680, 0, 0, 65280, 16711680, 0, 0, 65280, 16711680, 0, 0,
65280, 16711680,
      0, 0, 65280, 16711680, 0, 0, 65280, 16711680, 0, 0}}, flags = 0,
travlocks = {
    next = 0xff0000, off = 0, hash = 65280}, next = 0xff0000,
  device = 280375465082880, inode = 16711680, log_fn = 0x4093b0
<null_log_fn>,
  hash_fn = 0x4092f0 <default_tdb_hash>, open_flags = 2}

> Please can you also print "info regs" at the point of the segv (in frame
> 0) as well as "disas" at that point.

(gdb) info registers
rax            0x0      0
rbx            0x16bff70        23854960
rcx            0xffffffffffffffff       -1
rdx            0x40ecd0 4254928
rsi            0x0      0
rdi            0xff0000000000   280375465082880
rbp            0x7fcaed6c96a8   0x7fcaed6c96a8
rsp            0x7fff9dc86330   0x7fff9dc86330
r8             0x7fcaece54c08   140509534571528
r9             0xff00000000000000       -72057594037927936
r10            0x7fcaed08c14c   140509536895308
r11            0x246    582
r12            0xd      13
r13            0xff0000000000   280375465082880
r14            0x4093b0 4232112
r15            0x167d620        23582240
rip            0x4075c4 0x4075c4 <talloc_chunk_from_ptr+4>
eflags         0x10206  [ PF IF RF ]
cs             0x33     51
ss             0x2b     43
ds             0x0      0
es             0x0      0
fs             0x0      0
gs             0x0      0
fctrl          0x0      0
fstat          0x0      0
ftag           0x0      0
fiseg          0x0      0
fioff          0x0      0
foseg          0x0      0
fooff          0x0      0
fop            0x0      0
mxcsr          0x0      [ ]

(gdb) disassemble
Dump of assembler code for function talloc_chunk_from_ptr:
0x00000000004075c0 <talloc_chunk_from_ptr+0>:   sub    $0x8,%rsp
0x00000000004075c4 <talloc_chunk_from_ptr+4>:   mov    -0x8(%rdi),%edx
0x00000000004075c7 <talloc_chunk_from_ptr+7>:   lea    -0x50(%rdi),%rax
0x00000000004075cb <talloc_chunk_from_ptr+11>:  mov    %edx,%ecx
0x00000000004075cd <talloc_chunk_from_ptr+13>:  and
$0xfffffffffffffff0,%ecx
0x00000000004075d0 <talloc_chunk_from_ptr+16>:  cmp    $0xe814ec70,%ecx
0x00000000004075d6 <talloc_chunk_from_ptr+22>:  jne    0x4075e2
<talloc_chunk_from_ptr+34>
0x00000000004075d8 <talloc_chunk_from_ptr+24>:  and    $0x1,%edx
0x00000000004075db <talloc_chunk_from_ptr+27>:  jne    0x4075e2
<talloc_chunk_from_ptr+34>
0x00000000004075dd <talloc_chunk_from_ptr+29>:  add    $0x8,%rsp
0x00000000004075e1 <talloc_chunk_from_ptr+33>:  retq
0x00000000004075e2 <talloc_chunk_from_ptr+34>:  nopw   0x0(%rax,%rax,1)
0x00000000004075e8 <talloc_chunk_from_ptr+40>:  callq  0x401b98 <abort@plt>

> Can you also "p $_siginfo._sifields._sigfault.si_addr" (in frame 0).
> This ought to be the actual faulting address, which ought to give a hint
> on how much we can trust the parameters in the stack trace.

Hmm, my gdb refused to access $_siginfo:
(gdb) show convenience
$_siginfo = Unable to read siginfo

> Since I'm asking for the world I may as well ask you to dump the raw
> stack too "x/64x $sp" ought to be a good starting point.

(gdb) x/64x $sp
0x7fff9dc86330: 0xed6c96a8      0x00007fca      0x00407edf      0x00000000
0x7fff9dc86340: 0x00000000      0x00000000      0x016bff70      0x00000000
0x7fff9dc86350: 0xed6c96a8      0x00007fca      0x0000000d      0x00000000
0x7fff9dc86360: 0x00000000      0x00000000      0x004093b0      0x00000000
0x7fff9dc86370: 0x0167d620      0x00000000      0x0040a348      0x00000000
0x7fff9dc86380: 0x00000000      0x00000000      0x00000000      0x00000000
0x7fff9dc86390: 0x00000000      0x00000000      0x00000000      0x00000000
0x7fff9dc863a0: 0x00000011      0x00000000      0x411d4816      0x00000000
0x7fff9dc863b0: 0x00000001      0x00000000      0x000081a0      0x00000000
0x7fff9dc863c0: 0x00000000      0x00000000      0x00000000      0x00000000
0x7fff9dc863d0: 0x00096000      0x00000000      0x00001000      0x00000000
0x7fff9dc863e0: 0x000004b0      0x00000000      0x5438ba01      0x00000000
0x7fff9dc863f0: 0x07fd332e      0x00000000      0x5438ba01      0x00000000
0x7fff9dc86400: 0x07fd332e      0x00000000      0x5438ba01      0x00000000
0x7fff9dc86410: 0x07fd332e      0x00000000      0x00000000      0x00000000
0x7fff9dc86420: 0x00000000      0x00000000      0x00000000      0x00000000

> I notice in your bugzilla (for a different occurrence, I think):
>> [2090451.721705] univention-conf[2512]: segfault at ff00000000 ip 000000000045e238 sp 00007ffff68dfa30 error 6 in python2.6[400000+21e000]
> 
> Which appears to have faulted access 0xff000000000 too. It looks like
> this process is a python thing, it's nothing to do with xenstored I
> assume?

Yes, that's one univention-config, which is completely independent of
xen(stored).

> It seems rather coincidental that it should be accessing the 
> same sort of address and be faulting.

Yes, good catch. I'll have another look at those core dumps.

> Ian.

Thank you for your help.
Philipp Hahn

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: xenstored crashes with SIGSEGV
  2014-12-15 22:29                       ` Philipp Hahn
@ 2014-12-16  9:51                         ` Ian Campbell
  2014-12-16 10:25                         ` Ian Campbell
  2014-12-16 10:45                         ` Ian Campbell
  2 siblings, 0 replies; 36+ messages in thread
From: Ian Campbell @ 2014-12-16  9:51 UTC (permalink / raw)
  To: Philipp Hahn; +Cc: Ian Jackson, Xen-devel

On Mon, 2014-12-15 at 23:29 +0100, Philipp Hahn wrote:
> Hello Ian,
> 
> On 15.12.2014 18:45, Ian Campbell wrote:
> > On Mon, 2014-12-15 at 14:50 +0000, Ian Campbell wrote:
> >> On Mon, 2014-12-15 at 15:19 +0100, Philipp Hahn wrote:
> >>> I just noticed something strange:
> >>>
> >>>> #3  0x000000000040a684 in tdb_open (name=0xff00000000 <Address
> >>>> 0xff00000000 out of bounds>, hash_size=0,
> >>>>     tdb_flags=4254928, open_flags=-1, mode=3119127560) at tdb.c:1773
> ...
> > I'm reasonably convinced now that this is just a weird artefact of
> > running gdb on an optimised binary, probably a shortcoming in the debug
> > info leading to gdb getting confused.
> > 
> > Unfortunately this also calls into doubt the parameter to talloc_free,
> > perhaps in that context 0xff0000000 is a similar artefact.
> > 
> > Please can you print the entire contents of tdb in the second frame
> > ("print *tdb" ought to do it). I'm curious whether it is all sane or
> > not.
> 
> (gdb) print *tdb
> $1 = {name = 0x0, map_ptr = 0x0, fd = 47, map_size = 65280, read_only =
> 16711680,
>   locked = 0xff0000000000,

So it really does seem to be 0xff0000000000 in memory.

> flags = 0,
> travlocks = {
>     next = 0xff0000, off = 0, hash = 65280}, next = 0xff0000,
>   device = 280375465082880, inode = 16711680, log_fn = 0x4093b0
> <null_log_fn>,
>   hash_fn = 0x4092f0 <default_tdb_hash>, open_flags = 2}

And here we can see tdb->{flags,open_flags} == 0 and 2, contrary to what
the stack trace says we were called with, which was nonsense. Since 0
and 2 are sensible and correspond to what the caller passes I think the
stack trace is just confused.

> (gdb) info registers
> rax            0x0      0
> rbx            0x16bff70        23854960
> rcx            0xffffffffffffffff       -1
> rdx            0x40ecd0 4254928
> rsi            0x0      0
> rdi            0xff0000000000   280375465082880

And here it is in the registers.

> rbp            0x7fcaed6c96a8   0x7fcaed6c96a8
> rsp            0x7fff9dc86330   0x7fff9dc86330
> r8             0x7fcaece54c08   140509534571528
> r9             0xff00000000000000       -72057594037927936
> r10            0x7fcaed08c14c   140509536895308
> r11            0x246    582
> r12            0xd      13
> r13            0xff0000000000   280375465082880

And again.

> r14            0x4093b0 4232112
> r15            0x167d620        23582240
> rip            0x4075c4 0x4075c4 <talloc_chunk_from_ptr+4>

This must be the faulting address.

> eflags         0x10206  [ PF IF RF ]
> cs             0x33     51
> ss             0x2b     43
> ds             0x0      0
> es             0x0      0
> fs             0x0      0
> gs             0x0      0
> fctrl          0x0      0
> fstat          0x0      0
> ftag           0x0      0
> fiseg          0x0      0
> fioff          0x0      0
> foseg          0x0      0
> fooff          0x0      0
> fop            0x0      0
> mxcsr          0x0      [ ]
> 
> (gdb) disassemble
> Dump of assembler code for function talloc_chunk_from_ptr:
> 0x00000000004075c0 <talloc_chunk_from_ptr+0>:   sub    $0x8,%rsp
> 0x00000000004075c4 <talloc_chunk_from_ptr+4>:   mov    -0x8(%rdi),%edx

This is the line corresponding to %rip above which is doing a read via %
rdi, which is 0xff0000000000.

It's reading tc->flags. It's been optimised, tc = pp - SIZE, so it is
loading *(pp-SIZE+offsetof(flags)), which is pp-8 (flags is the last
field in the struct).

So rdi contains pp which == the ptr given as an argument to the
function, so ptr was bogus.

So it seems we really do have tdb->locked containing 0xff0000000000.

This is only allocated in one place which is:
	tdb->locked = talloc_zero_array(tdb, struct tdb_lock_type,
					tdb->header.hash_size+1);
midway through tdb_open_ex. It might be worth inserting a check+log for
this returning  0xff, 0xff00, 0xff0000 ... 0xff0000000000 etc.

> 0x00000000004075c7 <talloc_chunk_from_ptr+7>:   lea    -0x50(%rdi),%rax

This is actually calculating tc, ready for return upon success.

> 0x00000000004075cb <talloc_chunk_from_ptr+11>:  mov    %edx,%ecx
> 0x00000000004075cd <talloc_chunk_from_ptr+13>:  and    $0xfffffffffffffff0,%ecx
> 0x00000000004075d0 <talloc_chunk_from_ptr+16>:  cmp    $0xe814ec70,%ecx
> 0x00000000004075d6 <talloc_chunk_from_ptr+22>:  jne    0x4075e2 <talloc_chunk_from_ptr+34>

(tc->flags & ~0xF) != TALLOC_MAGIC

> 0x00000000004075d8 <talloc_chunk_from_ptr+24>:  and    $0x1,%edx
> 0x00000000004075db <talloc_chunk_from_ptr+27>:  jne    0x4075e2 <talloc_chunk_from_ptr+34>

tc->flags & TALLOC_FLAG_FREE

> 0x00000000004075dd <talloc_chunk_from_ptr+29>:  add    $0x8,%rsp
> 0x00000000004075e1 <talloc_chunk_from_ptr+33>:  retq

Success, return.

> 0x00000000004075e2 <talloc_chunk_from_ptr+34>:  nopw   0x0(%rax,%rax,1)
> 0x00000000004075e8 <talloc_chunk_from_ptr+40>:  callq  0x401b98 <abort@plt>

The two TALLOC_ABORTS both end up here if the checks above fail.

> > Can you also "p $_siginfo._sifields._sigfault.si_addr" (in frame 0).
> > This ought to be the actual faulting address, which ought to give a hint
> > on how much we can trust the parameters in the stack trace.
> 
> Hmm, my gdb refused to access $_siginfo:
> (gdb) show convenience
> $_siginfo = Unable to read siginfo

That's ok, I think I've convinced myself above what the crash is.

Ian.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: xenstored crashes with SIGSEGV
  2014-12-15 22:29                       ` Philipp Hahn
  2014-12-16  9:51                         ` Ian Campbell
@ 2014-12-16 10:25                         ` Ian Campbell
  2014-12-16 10:45                         ` Ian Campbell
  2 siblings, 0 replies; 36+ messages in thread
From: Ian Campbell @ 2014-12-16 10:25 UTC (permalink / raw)
  To: Philipp Hahn; +Cc: Ian Jackson, Xen-devel

On Mon, 2014-12-15 at 23:29 +0100, Philipp Hahn wrote:
> (gdb) print *tdb
> $1 = {name = 0x0, map_ptr = 0x0, fd = 47, map_size = 65280, read_only =
> 16711680,
>   locked = 0xff0000000000, ecode = 16711680, header = {
>     magic_food =
> "\000\000\000\000\000\000\000\000\000\377\000\000\000\000\377\000\000\000\000\000\000\000\000\000\000\377\000\000\000\000\377",
> version = 0, hash_size = 0,

tdb->fd has been initialised, but version and hash_size have not yet.
This means we must have failed somewhere between the open() and the call
to tdb_new_database() (the second one, the first one is only if
TDB_INTERNAL, which is not the case here).

There are three interesting actions in that space.

The first is tdb_brlock, which could have gone wrong.

The second is ftruncate(). This is not a candidate because tdb->flags
doesn't have TDB_CLEAR_IF_FIRST (the actual test is on tdb_flags, which
is changed by the time of the stack trace, but it is stored in
tdb->flags where we can see it. tdb_flags isn't changed before the
check, so baring compiler problems I think we can rule that out).

The third is the read of the header itself. The fact that
tdb->header.magic_food isn't either all zeroes or the requisite magic
string "TDB file\n" is suspicious. Instead it is mostly zero with the
off 0xff in it. An interesting pattern of 0xff..00..00, may be a
coincidence, or not.

Ian.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: xenstored crashes with SIGSEGV
  2014-12-15 22:29                       ` Philipp Hahn
  2014-12-16  9:51                         ` Ian Campbell
  2014-12-16 10:25                         ` Ian Campbell
@ 2014-12-16 10:45                         ` Ian Campbell
  2014-12-16 11:06                           ` Ian Campbell
  2014-12-16 12:04                           ` Philipp Hahn
  2 siblings, 2 replies; 36+ messages in thread
From: Ian Campbell @ 2014-12-16 10:45 UTC (permalink / raw)
  To: Philipp Hahn; +Cc: Ian Jackson, Xen-devel

On Mon, 2014-12-15 at 23:29 +0100, Philipp Hahn wrote:
> > I notice in your bugzilla (for a different occurrence, I think):
> >> [2090451.721705] univention-conf[2512]: segfault at ff00000000 ip 000000000045e238 sp 00007ffff68dfa30 error 6 in python2.6[400000+21e000]
> > 
> > Which appears to have faulted access 0xff000000000 too. It looks like
> > this process is a python thing, it's nothing to do with xenstored I
> > assume?
> 
> Yes, that's one univention-config, which is completely independent of
> xen(stored).
> 
> > It seems rather coincidental that it should be accessing the 
> > same sort of address and be faulting.
> 
> Yes, good catch. I'll have another look at those core dumps.

With this in mind, please can you confirm what model of machines you've
seen this on, and in particular whether they are all the same class of
machine or whether they are significantly different.

The reason being that randomly placed 0xff values in a field of 0x00
could possibly indicate hardware (e.g. a GPU) DMAing over the wrong
memory pages.

Ian.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: xenstored crashes with SIGSEGV
  2014-12-16 10:45                         ` Ian Campbell
@ 2014-12-16 11:06                           ` Ian Campbell
  2014-12-16 11:30                             ` Frediano Ziglio
  2014-12-16 12:04                           ` Philipp Hahn
  1 sibling, 1 reply; 36+ messages in thread
From: Ian Campbell @ 2014-12-16 11:06 UTC (permalink / raw)
  To: Philipp Hahn; +Cc: Ian Jackson, Xen-devel

On Tue, 2014-12-16 at 10:45 +0000, Ian Campbell wrote:
> On Mon, 2014-12-15 at 23:29 +0100, Philipp Hahn wrote:
> > > I notice in your bugzilla (for a different occurrence, I think):
> > >> [2090451.721705] univention-conf[2512]: segfault at ff00000000 ip 000000000045e238 sp 00007ffff68dfa30 error 6 in python2.6[400000+21e000]
> > > 
> > > Which appears to have faulted access 0xff000000000 too. It looks like
> > > this process is a python thing, it's nothing to do with xenstored I
> > > assume?
> > 
> > Yes, that's one univention-config, which is completely independent of
> > xen(stored).
> > 
> > > It seems rather coincidental that it should be accessing the 
> > > same sort of address and be faulting.
> > 
> > Yes, good catch. I'll have another look at those core dumps.
> 
> With this in mind, please can you confirm what model of machines you've
> seen this on, and in particular whether they are all the same class of
> machine or whether they are significantly different.
> 
> The reason being that randomly placed 0xff values in a field of 0x00
> could possibly indicate hardware (e.g. a GPU) DMAing over the wrong
> memory pages.

Thanks for giving me access to the core files. This is very suspicious:
(gdb) frame 2
#2  0x000000000040a348 in tdb_open_ex (name=0x1941fb0 "/var/lib/xenstored/tdb.0x1935bb0", hash_size=<value optimized out>, tdb_flags=0, open_flags=<value optimized out>, mode=<value optimized out>, 
    log_fn=0x4093b0 <null_log_fn>, hash_fn=<value optimized out>) at tdb.c:1958
1958		SAFE_FREE(tdb->locked);

(gdb) x/96x tdb
0x1921270:	0x00000000	0x00000000	0x00000000	0x00000000
0x1921280:	0x0000001f	0x000000ff	0x0000ff00	0x000000ff
0x1921290:	0x00000000	0x000000ff	0x0000ff00	0x000000ff
0x19212a0:	0x00000000	0x000000ff	0x0000ff00	0x000000ff
0x19212b0:	0x00000000	0x000000ff	0x0000ff00	0x000000ff
0x19212c0:	0x00000000	0x000000ff	0x0000ff00	0x000000ff
0x19212d0:	0x00000000	0x000000ff	0x0000ff00	0x000000ff
0x19212e0:	0x00000000	0x000000ff	0x0000ff00	0x000000ff
0x19212f0:	0x00000000	0x000000ff	0x0000ff00	0x000000ff
0x1921300:	0x00000000	0x000000ff	0x0000ff00	0x000000ff
0x1921310:	0x00000000	0x000000ff	0x0000ff00	0x000000ff
0x1921320:	0x00000000	0x000000ff	0x0000ff00	0x000000ff
0x1921330:	0x00000000	0x000000ff	0x0000ff00	0x000000ff
0x1921340:	0x00000000	0x00000000	0x0000ff00	0x000000ff
0x1921350:	0x00000000	0x000000ff	0x0000ff00	0x000000ff
0x1921360:	0x00000000	0x000000ff	0x0000ff00	0x000000ff
0x1921370:	0x004093b0	0x00000000	0x004092f0	0x00000000
0x1921380:	0x00000002	0x00000000	0x00000091	0x00000000
0x1921390:	0x0193de70	0x00000000	0x01963600	0x00000000
0x19213a0:	0x00000000	0x00000000	0x0193fbb0	0x00000000
0x19213b0:	0x00000000	0x00000000	0x00000000	0x00000000
0x19213c0:	0x00405870	0x00000000	0x0040e3e0	0x00000000
0x19213d0:	0x00000038	0x00000000	0xe814ec70	0x6f2f6567
0x19213e0:	0x01963650	0x00000000	0x0193dec0	0x00000000

Something has clearly done a number on the ram of this process.
0x1921270 through 0x192136f is 256 bytes...

Since it appears to be happening to other processes too I would hazard
that this is not a xenstored issue.

Ian.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: xenstored crashes with SIGSEGV
  2014-12-16 11:06                           ` Ian Campbell
@ 2014-12-16 11:30                             ` Frediano Ziglio
  2014-12-16 12:23                               ` Ian Campbell
  0 siblings, 1 reply; 36+ messages in thread
From: Frediano Ziglio @ 2014-12-16 11:30 UTC (permalink / raw)
  To: Ian Campbell; +Cc: Xen-devel, Ian Jackson, Philipp Hahn

2014-12-16 11:06 GMT+00:00 Ian Campbell <Ian.Campbell@citrix.com>:
> On Tue, 2014-12-16 at 10:45 +0000, Ian Campbell wrote:
>> On Mon, 2014-12-15 at 23:29 +0100, Philipp Hahn wrote:
>> > > I notice in your bugzilla (for a different occurrence, I think):
>> > >> [2090451.721705] univention-conf[2512]: segfault at ff00000000 ip 000000000045e238 sp 00007ffff68dfa30 error 6 in python2.6[400000+21e000]
>> > >
>> > > Which appears to have faulted access 0xff000000000 too. It looks like
>> > > this process is a python thing, it's nothing to do with xenstored I
>> > > assume?
>> >
>> > Yes, that's one univention-config, which is completely independent of
>> > xen(stored).
>> >
>> > > It seems rather coincidental that it should be accessing the
>> > > same sort of address and be faulting.
>> >
>> > Yes, good catch. I'll have another look at those core dumps.
>>
>> With this in mind, please can you confirm what model of machines you've
>> seen this on, and in particular whether they are all the same class of
>> machine or whether they are significantly different.
>>
>> The reason being that randomly placed 0xff values in a field of 0x00
>> could possibly indicate hardware (e.g. a GPU) DMAing over the wrong
>> memory pages.
>
> Thanks for giving me access to the core files. This is very suspicious:
> (gdb) frame 2
> #2  0x000000000040a348 in tdb_open_ex (name=0x1941fb0 "/var/lib/xenstored/tdb.0x1935bb0", hash_size=<value optimized out>, tdb_flags=0, open_flags=<value optimized out>, mode=<value optimized out>,
>     log_fn=0x4093b0 <null_log_fn>, hash_fn=<value optimized out>) at tdb.c:1958
> 1958            SAFE_FREE(tdb->locked);
>
> (gdb) x/96x tdb
> 0x1921270:      0x00000000      0x00000000      0x00000000      0x00000000
> 0x1921280:      0x0000001f      0x000000ff      0x0000ff00      0x000000ff
> 0x1921290:      0x00000000      0x000000ff      0x0000ff00      0x000000ff
> 0x19212a0:      0x00000000      0x000000ff      0x0000ff00      0x000000ff
> 0x19212b0:      0x00000000      0x000000ff      0x0000ff00      0x000000ff
> 0x19212c0:      0x00000000      0x000000ff      0x0000ff00      0x000000ff
> 0x19212d0:      0x00000000      0x000000ff      0x0000ff00      0x000000ff
> 0x19212e0:      0x00000000      0x000000ff      0x0000ff00      0x000000ff
> 0x19212f0:      0x00000000      0x000000ff      0x0000ff00      0x000000ff
> 0x1921300:      0x00000000      0x000000ff      0x0000ff00      0x000000ff
> 0x1921310:      0x00000000      0x000000ff      0x0000ff00      0x000000ff
> 0x1921320:      0x00000000      0x000000ff      0x0000ff00      0x000000ff
> 0x1921330:      0x00000000      0x000000ff      0x0000ff00      0x000000ff
> 0x1921340:      0x00000000      0x00000000      0x0000ff00      0x000000ff
> 0x1921350:      0x00000000      0x000000ff      0x0000ff00      0x000000ff
> 0x1921360:      0x00000000      0x000000ff      0x0000ff00      0x000000ff
> 0x1921370:      0x004093b0      0x00000000      0x004092f0      0x00000000
> 0x1921380:      0x00000002      0x00000000      0x00000091      0x00000000
> 0x1921390:      0x0193de70      0x00000000      0x01963600      0x00000000
> 0x19213a0:      0x00000000      0x00000000      0x0193fbb0      0x00000000
> 0x19213b0:      0x00000000      0x00000000      0x00000000      0x00000000
> 0x19213c0:      0x00405870      0x00000000      0x0040e3e0      0x00000000
> 0x19213d0:      0x00000038      0x00000000      0xe814ec70      0x6f2f6567
> 0x19213e0:      0x01963650      0x00000000      0x0193dec0      0x00000000
>
> Something has clearly done a number on the ram of this process.
> 0x1921270 through 0x192136f is 256 bytes...
>
> Since it appears to be happening to other processes too I would hazard
> that this is not a xenstored issue.
>
> Ian.
>

Good catch Ian!

Strange corruption. Probably not related to xenstored as you
suggested. I would be curious to see what's before the tdb pointer and
where does the corruption starts. I also don't understand where the
"fd = 47" came from a previous mail. 0x1f is 31, not 47 (which is
0x2f).

I would not be surprised about a strange bug in libc or the kernel.

Frediano

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: xenstored crashes with SIGSEGV
  2014-12-16 10:45                         ` Ian Campbell
  2014-12-16 11:06                           ` Ian Campbell
@ 2014-12-16 12:04                           ` Philipp Hahn
  1 sibling, 0 replies; 36+ messages in thread
From: Philipp Hahn @ 2014-12-16 12:04 UTC (permalink / raw)
  To: Ian Campbell; +Cc: Ian Jackson, Frediano Ziglio, Xen-devel

Hello,

On 16.12.2014 11:45, Ian Campbell wrote:
> On Mon, 2014-12-15 at 23:29 +0100, Philipp Hahn wrote:
>>> I notice in your bugzilla (for a different occurrence, I think):
>>>> [2090451.721705] univention-conf[2512]: segfault at ff00000000 ip 000000000045e238 sp 00007ffff68dfa30 error 6 in python2.6[400000+21e000]
>>>
>>> Which appears to have faulted access 0xff000000000 too. It looks like
>>> this process is a python thing, it's nothing to do with xenstored I
>>> assume?
>>
>> Yes, that's one univention-config, which is completely independent of
>> xen(stored).
>>
>>> It seems rather coincidental that it should be accessing the 
>>> same sort of address and be faulting.
>>
>> Yes, good catch. I'll have another look at those core dumps.
> 
> With this in mind, please can you confirm what model of machines you've
> seen this on, and in particular whether they are all the same class of
> machine or whether they are significantly different.

They are all from the same vendor, but I have to check the individual
models and firmware versions, which might take some time.

> The reason being that randomly placed 0xff values in a field of 0x00
> could possibly indicate hardware (e.g. a GPU) DMAing over the wrong
> memory pages.

Good catch: that would explain why it only happens for us and no one
other has seen that strange bug before.

Thanks you again.
Philipp Hahn

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: xenstored crashes with SIGSEGV
  2014-12-16 11:30                             ` Frediano Ziglio
@ 2014-12-16 12:23                               ` Ian Campbell
  2014-12-16 16:13                                 ` Frediano Ziglio
  0 siblings, 1 reply; 36+ messages in thread
From: Ian Campbell @ 2014-12-16 12:23 UTC (permalink / raw)
  To: Frediano Ziglio; +Cc: Xen-devel, Ian Jackson, Philipp Hahn

On Tue, 2014-12-16 at 11:30 +0000, Frediano Ziglio wrote:
> 2014-12-16 11:06 GMT+00:00 Ian Campbell <Ian.Campbell@citrix.com>:
> > On Tue, 2014-12-16 at 10:45 +0000, Ian Campbell wrote:
> >> On Mon, 2014-12-15 at 23:29 +0100, Philipp Hahn wrote:
> >> > > I notice in your bugzilla (for a different occurrence, I think):
> >> > >> [2090451.721705] univention-conf[2512]: segfault at ff00000000 ip 000000000045e238 sp 00007ffff68dfa30 error 6 in python2.6[400000+21e000]
> >> > >
> >> > > Which appears to have faulted access 0xff000000000 too. It looks like
> >> > > this process is a python thing, it's nothing to do with xenstored I
> >> > > assume?
> >> >
> >> > Yes, that's one univention-config, which is completely independent of
> >> > xen(stored).
> >> >
> >> > > It seems rather coincidental that it should be accessing the
> >> > > same sort of address and be faulting.
> >> >
> >> > Yes, good catch. I'll have another look at those core dumps.
> >>
> >> With this in mind, please can you confirm what model of machines you've
> >> seen this on, and in particular whether they are all the same class of
> >> machine or whether they are significantly different.
> >>
> >> The reason being that randomly placed 0xff values in a field of 0x00
> >> could possibly indicate hardware (e.g. a GPU) DMAing over the wrong
> >> memory pages.
> >
> > Thanks for giving me access to the core files. This is very suspicious:
> > (gdb) frame 2
> > #2  0x000000000040a348 in tdb_open_ex (name=0x1941fb0 "/var/lib/xenstored/tdb.0x1935bb0", hash_size=<value optimized out>, tdb_flags=0, open_flags=<value optimized out>, mode=<value optimized out>,
> >     log_fn=0x4093b0 <null_log_fn>, hash_fn=<value optimized out>) at tdb.c:1958
> > 1958            SAFE_FREE(tdb->locked);
> >
> > (gdb) x/96x tdb
> > 0x1921270:      0x00000000      0x00000000      0x00000000      0x00000000
> > 0x1921280:      0x0000001f      0x000000ff      0x0000ff00      0x000000ff
> > 0x1921290:      0x00000000      0x000000ff      0x0000ff00      0x000000ff
> > 0x19212a0:      0x00000000      0x000000ff      0x0000ff00      0x000000ff
> > 0x19212b0:      0x00000000      0x000000ff      0x0000ff00      0x000000ff
> > 0x19212c0:      0x00000000      0x000000ff      0x0000ff00      0x000000ff
> > 0x19212d0:      0x00000000      0x000000ff      0x0000ff00      0x000000ff
> > 0x19212e0:      0x00000000      0x000000ff      0x0000ff00      0x000000ff
> > 0x19212f0:      0x00000000      0x000000ff      0x0000ff00      0x000000ff
> > 0x1921300:      0x00000000      0x000000ff      0x0000ff00      0x000000ff
> > 0x1921310:      0x00000000      0x000000ff      0x0000ff00      0x000000ff
> > 0x1921320:      0x00000000      0x000000ff      0x0000ff00      0x000000ff
> > 0x1921330:      0x00000000      0x000000ff      0x0000ff00      0x000000ff
> > 0x1921340:      0x00000000      0x00000000      0x0000ff00      0x000000ff
> > 0x1921350:      0x00000000      0x000000ff      0x0000ff00      0x000000ff
> > 0x1921360:      0x00000000      0x000000ff      0x0000ff00      0x000000ff
> > 0x1921370:      0x004093b0      0x00000000      0x004092f0      0x00000000
> > 0x1921380:      0x00000002      0x00000000      0x00000091      0x00000000
> > 0x1921390:      0x0193de70      0x00000000      0x01963600      0x00000000
> > 0x19213a0:      0x00000000      0x00000000      0x0193fbb0      0x00000000
> > 0x19213b0:      0x00000000      0x00000000      0x00000000      0x00000000
> > 0x19213c0:      0x00405870      0x00000000      0x0040e3e0      0x00000000
> > 0x19213d0:      0x00000038      0x00000000      0xe814ec70      0x6f2f6567
> > 0x19213e0:      0x01963650      0x00000000      0x0193dec0      0x00000000
> >
> > Something has clearly done a number on the ram of this process.
> > 0x1921270 through 0x192136f is 256 bytes...
> >
> > Since it appears to be happening to other processes too I would hazard
> > that this is not a xenstored issue.
> >
> > Ian.
> >
> 
> Good catch Ian!
> 
> Strange corruption. Probably not related to xenstored as you
> suggested. I would be curious to see what's before the tdb pointer and
> where does the corruption starts.

(gdb) print tdb
$2 = (TDB_CONTEXT *) 0x1921270
(gdb) x/64x 0x1921200
0x1921200:	0x01921174	0x00000000	0x00000000	0x00000000
0x1921210:	0x01921174	0x00000000	0x00000171	0x00000000
0x1921220:	0x00000000	0x00000000	0x00000000	0x00000000
0x1921230:	0x01941f60	0x00000000	0x00000000	0x00000000
0x1921240:	0x00000000	0x00000000	0x00000000	0x6f630065
0x1921250:	0x00000000	0x00000000	0x0040e8a7	0x00000000
0x1921260:	0x00000118	0x00000000	0xe814ec70	0x00000000
0x1921270:	0x00000000	0x00000000	0x00000000	0x00000000
0x1921280:	0x0000001f	0x000000ff	0x0000ff00	0x000000ff
0x1921290:	0x00000000	0x000000ff	0x0000ff00	0x000000ff
0x19212a0:	0x00000000	0x000000ff	0x0000ff00	0x000000ff
0x19212b0:	0x00000000	0x000000ff	0x0000ff00	0x000000ff
0x19212c0:	0x00000000	0x000000ff	0x0000ff00	0x000000ff
0x19212d0:	0x00000000	0x000000ff	0x0000ff00	0x000000ff
0x19212e0:	0x00000000	0x000000ff	0x0000ff00	0x000000ff
0x19212f0:	0x00000000	0x000000ff	0x0000ff00	0x000000ff

So it appears to start at 0x1921270 or maybe ...6c.

>  I also don't understand where the
> "fd = 47" came from a previous mail. 0x1f is 31, not 47 (which is
> 0x2f).

I must have been using a different coredump to the origianl report
(there are several). 

In the one which corresponds to the above:

(gdb) print *tdb
$3 = {name = 0x0, map_ptr = 0x0, fd = 31, map_size = 255, 
  read_only = 65280, locked = 0xff00000000, ecode = 65280, header = {
    magic_food = "\377\000\000\000\000\000\000\000\377\000\000\000\000\377\000\000\377\000\000\000\000\000\000\000\377\000\000\000\000\377\000", version = 255, hash_size = 0, rwlocks = 255, reserved = {65280, 
      255, 0, 255, 65280, 255, 0, 255, 65280, 255, 0, 255, 65280, 
      255, 0, 255, 65280, 255, 0, 255, 65280, 255, 0, 255, 65280, 
      255, 0, 255, 65280, 255, 0}}, flags = 0, travlocks = {
    next = 0xff0000ff00, off = 0, hash = 255}, next = 0xff0000ff00, 
  device = 1095216660480, inode = 1095216725760, 
  log_fn = 0x4093b0 <null_log_fn>, 
  hash_fn = 0x4092f0 <default_tdb_hash>, open_flags = 2}
(gdb) print/x *tdb
$4 = {name = 0x0, map_ptr = 0x0, fd = 0x1f, map_size = 0xff, 
  read_only = 0xff00, locked = 0xff00000000, ecode = 0xff00, 
  header = {magic_food = {0xff, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 
      0xff, 0x0, 0x0, 0x0, 0x0, 0xff, 0x0, 0x0, 0xff, 0x0, 0x0, 0x0, 
      0x0, 0x0, 0x0, 0x0, 0xff, 0x0, 0x0, 0x0, 0x0, 0xff, 0x0, 0x0}, 
    version = 0xff, hash_size = 0x0, rwlocks = 0xff, reserved = {
      0xff00, 0xff, 0x0, 0xff, 0xff00, 0xff, 0x0, 0xff, 0xff00, 
      0xff, 0x0, 0xff, 0xff00, 0xff, 0x0, 0xff, 0xff00, 0xff, 0x0, 
      0xff, 0xff00, 0xff, 0x0, 0xff, 0xff00, 0xff, 0x0, 0xff, 
      0xff00, 0xff, 0x0}}, flags = 0x0, travlocks = {
    next = 0xff0000ff00, off = 0x0, hash = 0xff}, 
  next = 0xff0000ff00, device = 0xff00000000, inode = 0xff0000ff00, 
  log_fn = 0x4093b0, hash_fn = 0x4092f0, open_flags = 0x2}

which is consistent.

> I would not be surprised about a strange bug in libc or the kernel.

Or even Xen itself, or the h/w.

Ian,

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: xenstored crashes with SIGSEGV
  2014-12-16 12:23                               ` Ian Campbell
@ 2014-12-16 16:13                                 ` Frediano Ziglio
  2014-12-16 16:23                                   ` Ian Campbell
  2014-12-18 10:17                                   ` Ian Campbell
  0 siblings, 2 replies; 36+ messages in thread
From: Frediano Ziglio @ 2014-12-16 16:13 UTC (permalink / raw)
  To: Ian Campbell; +Cc: Xen-devel, Ian Jackson, Philipp Hahn

2014-12-16 12:23 GMT+00:00 Ian Campbell <Ian.Campbell@citrix.com>:
> On Tue, 2014-12-16 at 11:30 +0000, Frediano Ziglio wrote:
>> 2014-12-16 11:06 GMT+00:00 Ian Campbell <Ian.Campbell@citrix.com>:
>> > On Tue, 2014-12-16 at 10:45 +0000, Ian Campbell wrote:
>> >> On Mon, 2014-12-15 at 23:29 +0100, Philipp Hahn wrote:
>> >> > > I notice in your bugzilla (for a different occurrence, I think):
>> >> > >> [2090451.721705] univention-conf[2512]: segfault at ff00000000 ip 000000000045e238 sp 00007ffff68dfa30 error 6 in python2.6[400000+21e000]
>> >> > >
>> >> > > Which appears to have faulted access 0xff000000000 too. It looks like
>> >> > > this process is a python thing, it's nothing to do with xenstored I
>> >> > > assume?
>> >> >
>> >> > Yes, that's one univention-config, which is completely independent of
>> >> > xen(stored).
>> >> >
>> >> > > It seems rather coincidental that it should be accessing the
>> >> > > same sort of address and be faulting.
>> >> >
>> >> > Yes, good catch. I'll have another look at those core dumps.
>> >>
>> >> With this in mind, please can you confirm what model of machines you've
>> >> seen this on, and in particular whether they are all the same class of
>> >> machine or whether they are significantly different.
>> >>
>> >> The reason being that randomly placed 0xff values in a field of 0x00
>> >> could possibly indicate hardware (e.g. a GPU) DMAing over the wrong
>> >> memory pages.
>> >
>> > Thanks for giving me access to the core files. This is very suspicious:
>> > (gdb) frame 2
>> > #2  0x000000000040a348 in tdb_open_ex (name=0x1941fb0 "/var/lib/xenstored/tdb.0x1935bb0", hash_size=<value optimized out>, tdb_flags=0, open_flags=<value optimized out>, mode=<value optimized out>,
>> >     log_fn=0x4093b0 <null_log_fn>, hash_fn=<value optimized out>) at tdb.c:1958
>> > 1958            SAFE_FREE(tdb->locked);
>> >
>> > (gdb) x/96x tdb
>> > 0x1921270:      0x00000000      0x00000000      0x00000000      0x00000000
>> > 0x1921280:      0x0000001f      0x000000ff      0x0000ff00      0x000000ff
>> > 0x1921290:      0x00000000      0x000000ff      0x0000ff00      0x000000ff
>> > 0x19212a0:      0x00000000      0x000000ff      0x0000ff00      0x000000ff
>> > 0x19212b0:      0x00000000      0x000000ff      0x0000ff00      0x000000ff
>> > 0x19212c0:      0x00000000      0x000000ff      0x0000ff00      0x000000ff
>> > 0x19212d0:      0x00000000      0x000000ff      0x0000ff00      0x000000ff
>> > 0x19212e0:      0x00000000      0x000000ff      0x0000ff00      0x000000ff
>> > 0x19212f0:      0x00000000      0x000000ff      0x0000ff00      0x000000ff
>> > 0x1921300:      0x00000000      0x000000ff      0x0000ff00      0x000000ff
>> > 0x1921310:      0x00000000      0x000000ff      0x0000ff00      0x000000ff
>> > 0x1921320:      0x00000000      0x000000ff      0x0000ff00      0x000000ff
>> > 0x1921330:      0x00000000      0x000000ff      0x0000ff00      0x000000ff
>> > 0x1921340:      0x00000000      0x00000000      0x0000ff00      0x000000ff
>> > 0x1921350:      0x00000000      0x000000ff      0x0000ff00      0x000000ff
>> > 0x1921360:      0x00000000      0x000000ff      0x0000ff00      0x000000ff
>> > 0x1921370:      0x004093b0      0x00000000      0x004092f0      0x00000000
>> > 0x1921380:      0x00000002      0x00000000      0x00000091      0x00000000
>> > 0x1921390:      0x0193de70      0x00000000      0x01963600      0x00000000
>> > 0x19213a0:      0x00000000      0x00000000      0x0193fbb0      0x00000000
>> > 0x19213b0:      0x00000000      0x00000000      0x00000000      0x00000000
>> > 0x19213c0:      0x00405870      0x00000000      0x0040e3e0      0x00000000
>> > 0x19213d0:      0x00000038      0x00000000      0xe814ec70      0x6f2f6567
>> > 0x19213e0:      0x01963650      0x00000000      0x0193dec0      0x00000000
>> >
>> > Something has clearly done a number on the ram of this process.
>> > 0x1921270 through 0x192136f is 256 bytes...
>> >
>> > Since it appears to be happening to other processes too I would hazard
>> > that this is not a xenstored issue.
>> >
>> > Ian.
>> >
>>
>> Good catch Ian!
>>
>> Strange corruption. Probably not related to xenstored as you
>> suggested. I would be curious to see what's before the tdb pointer and
>> where does the corruption starts.
>
> (gdb) print tdb
> $2 = (TDB_CONTEXT *) 0x1921270
> (gdb) x/64x 0x1921200
> 0x1921200:      0x01921174      0x00000000      0x00000000      0x00000000
> 0x1921210:      0x01921174      0x00000000      0x00000171      0x00000000
> 0x1921220:      0x00000000      0x00000000      0x00000000      0x00000000

0x0 next (u64)
0x0 prev (u64)

> 0x1921230:      0x01941f60      0x00000000      0x00000000      0x00000000

0x01941f60 parent (u64), make sense is not NULL
0x0 child (u64)

> 0x1921240:      0x00000000      0x00000000      0x00000000      0x6f630065

0x0 refs (u64)
0x0 null_refs (u32)
0x6f630065 pad, garbage (u32)

> 0x1921250:      0x00000000      0x00000000      0x0040e8a7      0x00000000

0x0 destructor (u64)
0x0040e8a7 name (u64)

> 0x1921260:      0x00000118      0x00000000      0xe814ec70      0x00000000

0x118, size (u64)
0xe814ec70 magic (u32)
0x0 pad (u32)

Well... all the talloc header seems fine to me.


> 0x1921270:      0x00000000      0x00000000      0x00000000      0x00000000
> 0x1921280:      0x0000001f      0x000000ff      0x0000ff00      0x000000ff
> 0x1921290:      0x00000000      0x000000ff      0x0000ff00      0x000000ff
> 0x19212a0:      0x00000000      0x000000ff      0x0000ff00      0x000000ff
> 0x19212b0:      0x00000000      0x000000ff      0x0000ff00      0x000000ff
> 0x19212c0:      0x00000000      0x000000ff      0x0000ff00      0x000000ff
> 0x19212d0:      0x00000000      0x000000ff      0x0000ff00      0x000000ff
> 0x19212e0:      0x00000000      0x000000ff      0x0000ff00      0x000000ff
> 0x19212f0:      0x00000000      0x000000ff      0x0000ff00      0x000000ff
>
> So it appears to start at 0x1921270 or maybe ...6c.
>

It looks like that there is a pattern like

 0x00000000      0x000000ff      0x0000ff00      0x000000ff

only exceptions are when field is set after talloc_zero (fd, flags,
functions). Something like the memset inside the talloc_zero fill with
this pattern instead of zeroes. Note that a pattern of 16 bytes is
compatible with SSE instructions size. Some bug in the save/restore
for SSE registers? Some bug on SSE emulation?

What does "info all-registers" gdb command say about SSE registers?

Do we have a bug in Xen that affect SSE instructions (possibly already
fixed after Philipp version) ?

>>  I also don't understand where the
>> "fd = 47" came from a previous mail. 0x1f is 31, not 47 (which is
>> 0x2f).
>
> I must have been using a different coredump to the origianl report
> (there are several).
>
> In the one which corresponds to the above:
>
> (gdb) print *tdb
> $3 = {name = 0x0, map_ptr = 0x0, fd = 31, map_size = 255,
>   read_only = 65280, locked = 0xff00000000, ecode = 65280, header = {
>     magic_food = "\377\000\000\000\000\000\000\000\377\000\000\000\000\377\000\000\377\000\000\000\000\000\000\000\377\000\000\000\000\377\000", version = 255, hash_size = 0, rwlocks = 255, reserved = {65280,
>       255, 0, 255, 65280, 255, 0, 255, 65280, 255, 0, 255, 65280,
>       255, 0, 255, 65280, 255, 0, 255, 65280, 255, 0, 255, 65280,
>       255, 0, 255, 65280, 255, 0}}, flags = 0, travlocks = {
>     next = 0xff0000ff00, off = 0, hash = 255}, next = 0xff0000ff00,
>   device = 1095216660480, inode = 1095216725760,
>   log_fn = 0x4093b0 <null_log_fn>,
>   hash_fn = 0x4092f0 <default_tdb_hash>, open_flags = 2}
> (gdb) print/x *tdb
> $4 = {name = 0x0, map_ptr = 0x0, fd = 0x1f, map_size = 0xff,
>   read_only = 0xff00, locked = 0xff00000000, ecode = 0xff00,
>   header = {magic_food = {0xff, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
>       0xff, 0x0, 0x0, 0x0, 0x0, 0xff, 0x0, 0x0, 0xff, 0x0, 0x0, 0x0,
>       0x0, 0x0, 0x0, 0x0, 0xff, 0x0, 0x0, 0x0, 0x0, 0xff, 0x0, 0x0},
>     version = 0xff, hash_size = 0x0, rwlocks = 0xff, reserved = {
>       0xff00, 0xff, 0x0, 0xff, 0xff00, 0xff, 0x0, 0xff, 0xff00,
>       0xff, 0x0, 0xff, 0xff00, 0xff, 0x0, 0xff, 0xff00, 0xff, 0x0,
>       0xff, 0xff00, 0xff, 0x0, 0xff, 0xff00, 0xff, 0x0, 0xff,
>       0xff00, 0xff, 0x0}}, flags = 0x0, travlocks = {
>     next = 0xff0000ff00, off = 0x0, hash = 0xff},
>   next = 0xff0000ff00, device = 0xff00000000, inode = 0xff0000ff00,
>   log_fn = 0x4093b0, hash_fn = 0x4092f0, open_flags = 0x2}
>
> which is consistent.
>
>> I would not be surprised about a strange bug in libc or the kernel.
>
> Or even Xen itself, or the h/w.
>
> Ian,
>

Frediano

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: xenstored crashes with SIGSEGV
  2014-12-16 16:13                                 ` Frediano Ziglio
@ 2014-12-16 16:23                                   ` Ian Campbell
  2014-12-16 16:44                                     ` Frediano Ziglio
  2014-12-18 10:17                                   ` Ian Campbell
  1 sibling, 1 reply; 36+ messages in thread
From: Ian Campbell @ 2014-12-16 16:23 UTC (permalink / raw)
  To: Frediano Ziglio; +Cc: Philipp Hahn, Ian Jackson, Xen-devel

On Tue, 2014-12-16 at 16:13 +0000, Frediano Ziglio wrote:
> What does "info all-registers" gdb command say about SSE registers?

All zeroes. No ffs anywhere.

> Do we have a bug in Xen that affect SSE instructions (possibly already
> fixed after Philipp version) ?

Possibly. When this was thought to be xenstored (which doesn't change
all that much) debugging 4.1 seemed plausible, but since it could be
anywhere else I think we either need a plausible reproduction, or a
repro on a newer hypervisor (or possibly kernel) I'm afraid.

Ian.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: xenstored crashes with SIGSEGV
  2014-12-16 16:23                                   ` Ian Campbell
@ 2014-12-16 16:44                                     ` Frediano Ziglio
  2014-12-17  9:14                                       ` Frediano Ziglio
  0 siblings, 1 reply; 36+ messages in thread
From: Frediano Ziglio @ 2014-12-16 16:44 UTC (permalink / raw)
  To: Ian Campbell; +Cc: Philipp Hahn, Ian Jackson, Xen-devel

2014-12-16 16:23 GMT+00:00 Ian Campbell <Ian.Campbell@citrix.com>:
> On Tue, 2014-12-16 at 16:13 +0000, Frediano Ziglio wrote:
>> What does "info all-registers" gdb command say about SSE registers?
>
> All zeroes. No ffs anywhere.
>

Could be that core does not dump these registers for some reasons? On
my machine I got some FFs even just before the main is reached.

>> Do we have a bug in Xen that affect SSE instructions (possibly already
>> fixed after Philipp version) ?
>
> Possibly. When this was thought to be xenstored (which doesn't change
> all that much) debugging 4.1 seemed plausible, but since it could be
> anywhere else I think we either need a plausible reproduction, or a
> repro on a newer hypervisor (or possibly kernel) I'm afraid.
>
> Ian.
>

I found these

1) https://www.kernel.org/pub/linux/kernel/v3.0/ChangeLog-3.2.8
2) https://sourceware.org/bugzilla/show_bug.cgi?id=16064

1 seems to indicate a problem with kernel 3.2. Second with glibc 2.18.

First we (I'll try when I reach home) can check if memset in glibc (or
the version called from talloc_zero) can use SSE. A possible dmesg
output and /proc/cpuinfo content could help too but I think SSE are
now quite common.

For the reproduction could be that a program doing some memset(0)
continuously while another fill SSE register with garbage could
help... at least if they execute on the same CPU (so could be limiting
Xen to one CPU). Also doing some FPU operation which could lead to
exception could help too.

Frediano

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: xenstored crashes with SIGSEGV
  2014-12-16 16:44                                     ` Frediano Ziglio
@ 2014-12-17  9:14                                       ` Frediano Ziglio
  2014-12-17 12:43                                         ` core dump files do not include all CPU registers? Philipp Hahn
  2014-12-18 10:20                                         ` xenstored crashes with SIGSEGV Philipp Hahn
  0 siblings, 2 replies; 36+ messages in thread
From: Frediano Ziglio @ 2014-12-17  9:14 UTC (permalink / raw)
  To: Ian Campbell; +Cc: Philipp Hahn, Ian Jackson, Xen-devel

2014-12-16 16:44 GMT+00:00 Frediano Ziglio <freddy77@gmail.com>:
> 2014-12-16 16:23 GMT+00:00 Ian Campbell <Ian.Campbell@citrix.com>:
>> On Tue, 2014-12-16 at 16:13 +0000, Frediano Ziglio wrote:
>>> What does "info all-registers" gdb command say about SSE registers?
>>
>> All zeroes. No ffs anywhere.
>>
>
> Could be that core does not dump these registers for some reasons? On
> my machine I got some FFs even just before the main is reached.
>
>>> Do we have a bug in Xen that affect SSE instructions (possibly already
>>> fixed after Philipp version) ?
>>
>> Possibly. When this was thought to be xenstored (which doesn't change
>> all that much) debugging 4.1 seemed plausible, but since it could be
>> anywhere else I think we either need a plausible reproduction, or a
>> repro on a newer hypervisor (or possibly kernel) I'm afraid.
>>
>> Ian.
>>
>
> I found these
>
> 1) https://www.kernel.org/pub/linux/kernel/v3.0/ChangeLog-3.2.8
> 2) https://sourceware.org/bugzilla/show_bug.cgi?id=16064
>
> 1 seems to indicate a problem with kernel 3.2. Second with glibc 2.18.
>
> First we (I'll try when I reach home) can check if memset in glibc (or
> the version called from talloc_zero) can use SSE. A possible dmesg
> output and /proc/cpuinfo content could help too but I think SSE are
> now quite common.
>

I have access to some core dumps. glibc memset is using SSE,
specifically xmm0 register.

Unfortunately is seems that core dumps contains only standard
registers, so all register appears zeroed. If you try with a newer gdb
version is shows that registers are not available.

> For the reproduction could be that a program doing some memset(0)
> continuously while another fill SSE register with garbage could
> help... at least if they execute on the same CPU (so could be limiting
> Xen to one CPU). Also doing some FPU operation which could lead to
> exception could help too.
>

Frediano

^ permalink raw reply	[flat|nested] 36+ messages in thread

* core dump files do not include all CPU registers?
  2014-12-17  9:14                                       ` Frediano Ziglio
@ 2014-12-17 12:43                                         ` Philipp Hahn
  2014-12-18 10:20                                         ` xenstored crashes with SIGSEGV Philipp Hahn
  1 sibling, 0 replies; 36+ messages in thread
From: Philipp Hahn @ 2014-12-17 12:43 UTC (permalink / raw)
  To: linux-kernel, Al Viro
  Cc: Frediano Ziglio, Ian Campbell, Xen-devel, Ian Jackson

Hello Linux folk,

while investigating some strange process crashes (SIGSEGV) we noticed
that the core files generated by the Linux kernel (3.10-amd64) do not
include all CPU registers; here namely the SSE2 (FPREGSET?) related
registers:

# eu-readelf --notes core
...
  CORE                 336  PRSTATUS
    info.si_signo: 11, info.si_code: 0, info.si_errno: 0, cursig: 11
    sigpend: <>
    sighold: <>
    pid: 6918, ppid: 18764, pgrp: 6918, sid: 18764
    utime: 1.116000, stime: 0.004000, cutime: 0.000000, cstime: 0.000000
    orig_rax: -1, fpvalid: 0
    r15:                       0  r14:                       0
    r13:         140734211556208  r12:                 4195248
    rbp:      0x0000000000000000  rbx:                       0
    r11:         140504440818576  r10:                       0
    r9:          140504444297440  r8:          140504444220160
    rax:                   80000  rcx:                       0
    rdx:                  100000  rsi:         140734211556216
    rdi:                       1  rip:      0x00000000004004e7
    rflags:   0x0000000000010246  rsp:      0x00007fff3cb00298
    fs.base:   0x00007fc9bd9df700  gs.base:   0x0000000000000000
    cs: 0x0033  ss: 0x002b  ds: 0x0000  es: 0x0000  fs: 0x0000  gs: 0x0000
...

In contrast to that "gdb generate-core-file" contains an additional note:
# eu-readelf --notes core.7335
...
  CORE                 512  FPREGSET
    xmm0:  0x00000000000000000000000000000000
    xmm1:  0x2f2f2f2f2f2f2f2f2f2f2f2f2f2f2f2f
    xmm2:  0x00000000000000000000000000000000
    xmm3:  0x00000000000000000000ff0000000000
    xmm4:  0x00000000000000000000000000000000
    xmm5:  0x00000000000000000000000000000000
    xmm6:  0x00000000000000000000000000000000
    xmm7:  0x00000000000000000000000000000000
    xmm8:  0x00000000000000000000000000000000
    xmm9:  0x00000000000000000000000000000000
    xmm10: 0x00000000000000000000000000000000
    xmm11: 0x00000000000000000000000000000000
    xmm12: 0x00000000000000000000000000000000
    xmm13: 0x00000000000000000000000000000000
    xmm14: 0x00000000000000000000000000000000
    xmm15: 0x00000000000000000000000000000000
    st0: 0x00000000000000000000  st1: 0x00000000000000000000
    st2: 0x00000000000000000000  st3: 0x00000000000000000000
    st4: 0x00000000000000000000  st5: 0x00000000000000000000
    st6: 0x00000000000000000000  st7: 0x00000000000000000000
    mxcsr:   0x0000000000001f80
    fcw: 0x037f  fsw: 0x0000
...

Is there some way to include all CPU registers into the core dump, as
the current information is not enough to diagnose the cause of some
strange crashes, which might be related to the use of SSE2 with Xen.

Currently we're using this:
	kernel.core_pattern = /var/tmp/core/core-%e-%p-%t
	kernel.core_uses_pid = 1
	fs.suid_dumpable = 2

Would a "pipe" core-handler be able to access the still existing process
and create a more complete dump?


On 17.12.2014 10:14, Frediano Ziglio wrote:
> 2014-12-16 16:44 GMT+00:00 Frediano Ziglio <freddy77@gmail.com>:
>> 2014-12-16 16:23 GMT+00:00 Ian Campbell <Ian.Campbell@citrix.com>:
...
>> First we (I'll try when I reach home) can check if memset in glibc (or
>> the version called from talloc_zero) can use SSE. A possible dmesg
>> output and /proc/cpuinfo content could help too but I think SSE are
>> now quite common.
> 
> I have access to some core dumps. glibc memset is using SSE,
> specifically xmm0 register.
> 
> Unfortunately is seems that core dumps contains only standard
> registers, so all register appears zeroed. If you try with a newer gdb
> version is shows that registers are not available.

Thank you in advance.
Philipp

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: xenstored crashes with SIGSEGV
  2014-12-16 16:13                                 ` Frediano Ziglio
  2014-12-16 16:23                                   ` Ian Campbell
@ 2014-12-18 10:17                                   ` Ian Campbell
  2014-12-18 10:25                                     ` David Vrabel
                                                       ` (2 more replies)
  1 sibling, 3 replies; 36+ messages in thread
From: Ian Campbell @ 2014-12-18 10:17 UTC (permalink / raw)
  To: Frediano Ziglio
  Cc: George Dunlap, Philipp Hahn, Ian Jackson, Xen-devel,
	David Vrabel, Jan Beulich

On Tue, 2014-12-16 at 16:13 +0000, Frediano Ziglio wrote:
> Do we have a bug in Xen that affect SSE instructions (possibly already
> fixed after Philipp version) ?

I've had a niggling feeling of Deja Vu over this which I'd been putting
down to an old Xen on ARM bug in the area of FPU register switching.

But it seems at some point (possibly even still) there was a similar
issue with pvops kernels on x86, see:
        http://bugs.xenproject.org/xen/bug/40

Philipp, what kernel are you guys using?

CCing Jan and the x86 kernel guys (and George since he registered the
bug). I'm not seeing anything in the kernel logs which looks like a fix
(there's some PVH related cr0 frobbing, but I don't think that's it).

I also can't quite shake the feeling that there was another much older
issue relating to FPU context switch on x86, but I think that was truly
ancient history (2.6.18 era stuff)

Ian.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: xenstored crashes with SIGSEGV
  2014-12-17  9:14                                       ` Frediano Ziglio
  2014-12-17 12:43                                         ` core dump files do not include all CPU registers? Philipp Hahn
@ 2014-12-18 10:20                                         ` Philipp Hahn
  1 sibling, 0 replies; 36+ messages in thread
From: Philipp Hahn @ 2014-12-18 10:20 UTC (permalink / raw)
  To: Frediano Ziglio, Ian Campbell; +Cc: Ian Jackson, Xen-devel

Hello,

On 17.12.2014 10:14, Frediano Ziglio wrote:
> 2014-12-16 16:44 GMT+00:00 Frediano Ziglio <freddy77@gmail.com>:
>> 2014-12-16 16:23 GMT+00:00 Ian Campbell <Ian.Campbell@citrix.com>:
...
>> First we (I'll try when I reach home) can check if memset in glibc (or
>> the version called from talloc_zero) can use SSE. A possible dmesg
>> output and /proc/cpuinfo content could help too but I think SSE are
>> now quite common.
> 
> I have access to some core dumps. glibc memset is using SSE,
> specifically xmm0 register.
> 
> Unfortunately is seems that core dumps contains only standard
> registers, so all register appears zeroed. If you try with a newer gdb
> version is shows that registers are not available.

I had another look myself and I'm confused now:

Using "info float" or "info vector" with gdb-7.0.1 shows the FP and MMX
registers to be all zero.
A newer gdb-7.2 shows the registers as "unavailable".

"eu-readelf --notes core" doesn't show a NT_FPREGSET note, so to me it
looks like at least the FP-registers were not dumped.
But is that also used for the MMX registers? If my memory is right, the
FP and MMX registers are "shared" in the CPU, but that might be old
knowledge.

I wrote a small SSE using program, which dumps core. If I run that
locally and do a "readelf --notes core", I get:
  CORE          0x00000200      NT_FPREGSET (floating point registers)

If I do the same in dom0, I don't get that note and gdb doesn't show the
register content.
SSE seems to be available in the dom0, as the program would crash with
SIGILL otherwise:
# grep ^flags /proc/cpuinfo
flags           : fpu de tsc msr pae mce cx8 apic sep mca cmov pat
clflush acpi mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc rep_good
nopl nonstop_tsc pni est ssse3 cx16 sse4_1 sse4_2 popcnt hypervisor
lahf_lm ida dtherm

Look like that got fixed with a newer 3.10.61 kernel, so I'll urge our
admins to update to a later kernel (again), so we'll get more useful
core dumps for future crashes.

I'm still investigating the core files of the other programs, but it
takes some time. I don't know if I will be able to finish that in time,
as the Christmas holiday season starts tomorrow and I will be
unavailable for nearly two weeks,

So happy Christmas to everybody and thanks again for your help.

Philipp

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: xenstored crashes with SIGSEGV
  2014-12-18 10:17                                   ` Ian Campbell
@ 2014-12-18 10:25                                     ` David Vrabel
  2014-12-19 14:30                                       ` Konrad Rzeszutek Wilk
  2014-12-18 10:49                                     ` Jan Beulich
  2014-12-19 12:36                                     ` Philipp Hahn
  2 siblings, 1 reply; 36+ messages in thread
From: David Vrabel @ 2014-12-18 10:25 UTC (permalink / raw)
  To: Ian Campbell, Frediano Ziglio
  Cc: George Dunlap, Philipp Hahn, Ian Jackson, Xen-devel, Jan Beulich

On 18/12/14 10:17, Ian Campbell wrote:
> On Tue, 2014-12-16 at 16:13 +0000, Frediano Ziglio wrote:
>> Do we have a bug in Xen that affect SSE instructions (possibly already
>> fixed after Philipp version) ?
> 
> I've had a niggling feeling of Deja Vu over this which I'd been putting
> down to an old Xen on ARM bug in the area of FPU register switching.
> 
> But it seems at some point (possibly even still) there was a similar
> issue with pvops kernels on x86, see:
>         http://bugs.xenproject.org/xen/bug/40
> 
> Philipp, what kernel are you guys using?
> 
> CCing Jan and the x86 kernel guys (and George since he registered the
> bug). I'm not seeing anything in the kernel logs which looks like a fix
> (there's some PVH related cr0 frobbing, but I don't think that's it).
> 
> I also can't quite shake the feeling that there was another much older
> issue relating to FPU context switch on x86, but I think that was truly
> ancient history (2.6.18 era stuff)

http://marc.info/?l=linux-kernel&m=139132566024357

David

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: xenstored crashes with SIGSEGV
  2014-12-18 10:17                                   ` Ian Campbell
  2014-12-18 10:25                                     ` David Vrabel
@ 2014-12-18 10:49                                     ` Jan Beulich
  2014-12-18 10:51                                       ` Ian Campbell
  2014-12-19 12:36                                     ` Philipp Hahn
  2 siblings, 1 reply; 36+ messages in thread
From: Jan Beulich @ 2014-12-18 10:49 UTC (permalink / raw)
  To: Ian Campbell
  Cc: George Dunlap, Philipp Hahn, Ian Jackson, Xen-devel, DavidVrabel,
	Frediano Ziglio

>>> On 18.12.14 at 11:17, <Ian.Campbell@citrix.com> wrote:
> On Tue, 2014-12-16 at 16:13 +0000, Frediano Ziglio wrote:
>> Do we have a bug in Xen that affect SSE instructions (possibly already
>> fixed after Philipp version) ?
> 
> I've had a niggling feeling of Deja Vu over this which I'd been putting
> down to an old Xen on ARM bug in the area of FPU register switching.
> 
> But it seems at some point (possibly even still) there was a similar
> issue with pvops kernels on x86, see:
>         http://bugs.xenproject.org/xen/bug/40 
> 
> Philipp, what kernel are you guys using?
> 
> CCing Jan and the x86 kernel guys (and George since he registered the
> bug). I'm not seeing anything in the kernel logs which looks like a fix
> (there's some PVH related cr0 frobbing, but I don't think that's it).

I just went through the thread again and didn't find where kernel/
hypervisor logs were posted. You mentioning PVH made me want to
take a look - said FPU related bug would be exposed only by PV
kernels.

Jan

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: xenstored crashes with SIGSEGV
  2014-12-18 10:49                                     ` Jan Beulich
@ 2014-12-18 10:51                                       ` Ian Campbell
  0 siblings, 0 replies; 36+ messages in thread
From: Ian Campbell @ 2014-12-18 10:51 UTC (permalink / raw)
  To: Jan Beulich
  Cc: George Dunlap, Philipp Hahn, Ian Jackson, Xen-devel, DavidVrabel,
	Frediano Ziglio

On Thu, 2014-12-18 at 10:49 +0000, Jan Beulich wrote:
> >>> On 18.12.14 at 11:17, <Ian.Campbell@citrix.com> wrote:
> > On Tue, 2014-12-16 at 16:13 +0000, Frediano Ziglio wrote:
> >> Do we have a bug in Xen that affect SSE instructions (possibly already
> >> fixed after Philipp version) ?
> > 
> > I've had a niggling feeling of Deja Vu over this which I'd been putting
> > down to an old Xen on ARM bug in the area of FPU register switching.
> > 
> > But it seems at some point (possibly even still) there was a similar
> > issue with pvops kernels on x86, see:
> >         http://bugs.xenproject.org/xen/bug/40 
> > 
> > Philipp, what kernel are you guys using?
> > 
> > CCing Jan and the x86 kernel guys (and George since he registered the
> > bug). I'm not seeing anything in the kernel logs which looks like a fix
> > (there's some PVH related cr0 frobbing, but I don't think that's it).
> 
> I just went through the thread again and didn't find where kernel/
> hypervisor logs were posted.

I don't think they were yet -- until recently it seemed like a xenstored
bug. Philipp, can you post them now?

Also the patch linked to by David seems like a good thing to try if you
are indeed running a kernel which is susceptible to this issue.

> You mentioning PVH made me want to take a look - said FPU related bug
> would be exposed only by PV kernels.

It's (almost certainly) not a PVH issue, I was saying that the patch
which touched PVH probably isn't relevant to this particular issue.

Ian.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: xenstored crashes with SIGSEGV
  2014-12-18 10:17                                   ` Ian Campbell
  2014-12-18 10:25                                     ` David Vrabel
  2014-12-18 10:49                                     ` Jan Beulich
@ 2014-12-19 12:36                                     ` Philipp Hahn
  2015-01-06  7:19                                       ` Philipp Hahn
  2 siblings, 1 reply; 36+ messages in thread
From: Philipp Hahn @ 2014-12-19 12:36 UTC (permalink / raw)
  To: Ian Campbell, Frediano Ziglio
  Cc: George Dunlap, Ian Jackson, David Vrabel, Jan Beulich, Xen-devel

Hello Ian,

On 18.12.2014 11:17, Ian Campbell wrote:
> On Tue, 2014-12-16 at 16:13 +0000, Frediano Ziglio wrote:
>> Do we have a bug in Xen that affect SSE instructions (possibly already
>> fixed after Philipp version) ?
> 
> I've had a niggling feeling of Deja Vu over this which I'd been putting
> down to an old Xen on ARM bug in the area of FPU register switching.
> 
> But it seems at some point (possibly even still) there was a similar
> issue with pvops kernels on x86, see:
>         http://bugs.xenproject.org/xen/bug/40

That definitely looks interesting.

> Philipp, what kernel are you guys using?

The crash "2014-12-06 01:26:21 xenstored[4337]" happened on linux-3.10.46.

That kernel is missing v3.10.50-13-gd1cc001:
> commit d1cc001905146d58c17ac8452eb96f226767819d
> Author: Silesh C V <svellattu@mvista.com>
> Date:   Wed Jul 23 13:59:59 2014 -0700
>
>     coredump: fix the setting of PF_DUMPCORE
>     commit aed8adb7688d5744cb484226820163af31d2499a upstream.
which explains why the xmm* registers are not included in the core file.

> I also can't quite shake the feeling that there was another much older
> issue relating to FPU context switch on x86, but I think that was truly
> ancient history (2.6.18 era stuff)

Some of those host might still use 3.2, most use 3.10.x, but definitely
no 2.6 kernels.

Xen-Hypervisor is 4.1.3

If you need anything more, just ask. It might take me some time to
answer as I'm on vacation for the next 2 weeks.

Thanks again for your help.
Philipp

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: xenstored crashes with SIGSEGV
  2014-12-18 10:25                                     ` David Vrabel
@ 2014-12-19 14:30                                       ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 36+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-12-19 14:30 UTC (permalink / raw)
  To: David Vrabel
  Cc: Ian Campbell, Philipp Hahn, George Dunlap, Ian Jackson,
	Xen-devel, Frediano Ziglio, Jan Beulich

On Thu, Dec 18, 2014 at 10:25:15AM +0000, David Vrabel wrote:
> On 18/12/14 10:17, Ian Campbell wrote:
> > On Tue, 2014-12-16 at 16:13 +0000, Frediano Ziglio wrote:
> >> Do we have a bug in Xen that affect SSE instructions (possibly already
> >> fixed after Philipp version) ?
> > 
> > I've had a niggling feeling of Deja Vu over this which I'd been putting
> > down to an old Xen on ARM bug in the area of FPU register switching.
> > 
> > But it seems at some point (possibly even still) there was a similar
> > issue with pvops kernels on x86, see:
> >         http://bugs.xenproject.org/xen/bug/40
> > 
> > Philipp, what kernel are you guys using?
> > 
> > CCing Jan and the x86 kernel guys (and George since he registered the
> > bug). I'm not seeing anything in the kernel logs which looks like a fix
> > (there's some PVH related cr0 frobbing, but I don't think that's it).
> > 
> > I also can't quite shake the feeling that there was another much older
> > issue relating to FPU context switch on x86, but I think that was truly
> > ancient history (2.6.18 era stuff)
> 
> http://marc.info/?l=linux-kernel&m=139132566024357

More up-to-date: http://lkml.iu.edu/hypermail/linux/kernel/1409.0/01057.html

And boy did that mess up Oracle DB! There had been an P0 bug affecting
databases and this patch solved it.

> 
> David

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: xenstored crashes with SIGSEGV
  2014-12-19 12:36                                     ` Philipp Hahn
@ 2015-01-06  7:19                                       ` Philipp Hahn
  2015-03-12 12:08                                         ` Philipp Hahn
  0 siblings, 1 reply; 36+ messages in thread
From: Philipp Hahn @ 2015-01-06  7:19 UTC (permalink / raw)
  To: Ian Campbell, Frediano Ziglio
  Cc: George Dunlap, Ian Jackson, David Vrabel, Jan Beulich, Xen-devel

Hello,

happy new year to everyone.

On 19.12.2014 13:36, Philipp Hahn wrote:
> On 18.12.2014 11:17, Ian Campbell wrote:
>> On Tue, 2014-12-16 at 16:13 +0000, Frediano Ziglio wrote:
>>> Do we have a bug in Xen that affect SSE instructions (possibly already
>>> fixed after Philipp version) ?
>>
>> I've had a niggling feeling of Deja Vu over this which I'd been putting
>> down to an old Xen on ARM bug in the area of FPU register switching.
>>
>> But it seems at some point (possibly even still) there was a similar
>> issue with pvops kernels on x86, see:
>>         http://bugs.xenproject.org/xen/bug/40
> 
> That definitely looks interesting.
> 
>> Philipp, what kernel are you guys using?
> 
> The crash "2014-12-06 01:26:21 xenstored[4337]" happened on linux-3.10.46.

I looked through the changes of v3.10.46..v3.10.63 and found the
following patches:
| fb5b6e7 x86, fpu: shift drop_init_fpu() from save_xstate_sig() to
handle_signal()
| b888e3d x86, fpu: __restore_xstate_sig()->math_state_restore() needs
preempt_disable()

They look interesting enough to may have fixed the bug, which could
explain the strange bit pattern caused by not restoring the FPU state
correctly. Because of that and because of the missing

>> commit d1cc001905146d58c17ac8452eb96f226767819d
>> Author: Silesh C V <svellattu@mvista.com>
>> Date:   Wed Jul 23 13:59:59 2014 -0700
>>
>>     coredump: fix the setting of PF_DUMPCORE
>>     commit aed8adb7688d5744cb484226820163af31d2499a upstream.

we're now working on upgrading the dom0 kernel which should give use
usable core dumps again and may also fix the underlying problem. It that
bug ever happens again I'll keep you informed.

Thanks so far to everybody for the excellent support.

Sincerely
Philipp Hahn

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: xenstored crashes with SIGSEGV
  2015-01-06  7:19                                       ` Philipp Hahn
@ 2015-03-12 12:08                                         ` Philipp Hahn
  2015-03-12 18:17                                           ` Oleg Nesterov
  0 siblings, 1 reply; 36+ messages in thread
From: Philipp Hahn @ 2015-03-12 12:08 UTC (permalink / raw)
  To: Ian Campbell, Frediano Ziglio, Oleg Nesterov
  Cc: George Dunlap, Ian Jackson, David Vrabel, Jan Beulich, Xen-devel

Hello,

On 06.01.2015 08:19, Philipp Hahn wrote:
> On 19.12.2014 13:36, Philipp Hahn wrote:
>> On 18.12.2014 11:17, Ian Campbell wrote:
>>> On Tue, 2014-12-16 at 16:13 +0000, Frediano Ziglio wrote:
>>>> Do we have a bug in Xen that affect SSE instructions (possibly already
>>>> fixed after Philipp version) ?
>>>
>>> I've had a niggling feeling of Deja Vu over this which I'd been putting
>>> down to an old Xen on ARM bug in the area of FPU register switching.
>>>
>>> But it seems at some point (possibly even still) there was a similar
>>> issue with pvops kernels on x86, see:
>>>         http://bugs.xenproject.org/xen/bug/40
...
>>> Philipp, what kernel are you guys using?
>>
>> The crash "2014-12-06 01:26:21 xenstored[4337]" happened on linux-3.10.46.
> 
> I looked through the changes of v3.10.46..v3.10.63 and found the
> following patches:
> | fb5b6e7 x86, fpu: shift drop_init_fpu() from save_xstate_sig() to
> handle_signal()
> | b888e3d x86, fpu: __restore_xstate_sig()->math_state_restore() needs
> preempt_disable()
> 
> They look interesting enough to may have fixed the bug, which could
> explain the strange bit pattern caused by not restoring the FPU state
> correctly.
...
> we're now working on upgrading the dom0 kernel which should give use
> usable core dumps again and may also fix the underlying problem. It that
> bug ever happens again I'll keep you informed.

We're now running 3.10.62 and the situation seems to have improved, but
yesterday and today we got two crashes on different host - this time
both times again in vsnprintf():

> [304534.173707] xenstored[3731]: segfault at 2 ip 00007f6da00805ad sp 00007fff544a2b80 error 4 in libc-2.11.3.so[7f6da003b000+158000]

> (gdb) where
> #0  0x00007f6da00805ad in _IO_vfprintf_internal (s=0x7fff544a3230, format=<value optimized out>, ap=0x7fff544a3790) at vfprintf.c:1617
> #1  0x00007f6da00a2452 in _IO_vsnprintf (string=0x7fff544a3390 "%%p 4249828122762082015 03:11:04 9JT\377\177", maxlen=<value optimized out>, format=0x40da48 "%s %p %04d%02d%02d %02d:%02d:%02d %s (", args=0x7fff544a3790) at vsnprintf.c:120
> #2  0x00000000004029ad in trace (fmt=0x40da48 "%s %p %04d%02d%02d %02d:%02d:%02d %s (") at xenstored_core.c:140
> #3  0x0000000000402c67 in trace_io (conn=0xbb51f0, data=0xbf1fe0, out=0) at xenstored_core.c:174
> #4  0x00000000004041cd in handle_input (conn=0xbb51f0) at xenstored_core.c:1307
> #5  0x0000000000405170 in main (argc=<value optimized out>, argv=<value optimized out>) at xenstored_core.c:1964

The SSE register again contain the 00..ff.. pattern, but accessing
%es:(%rdi)=0x0:0x2 looks very broken.

> (gdb) info all-registers 
> rax            0x0      0
> rbx            0x40da48 4250184
> rcx            0xffffffffffffffff       -1
> rdx            0x7fff544a3890   140734607538320
> rsi            0x40da69 4250217
> rdi            0x2      2
> rbp            0x7fff544a3790   0x7fff544a3790
> rsp            0x7fff544a3390   0x7fff544a3390
> r8             0x1      1
> r9             0x2      2
> r10            0x2      2
> r11            0x10     16
> r12            0x0      0
> r13            0x7fff544a3950   140734607538512
> r14            0x7fff544a39d0   140734607538640
> r15            0xc      12
> rip            0x4029ad 0x4029ad <trace+221>
> eflags         0x10286  [ PF SF IF RF ]
> cs             0xe033   57395
> ss             0xe02b   57387
> ds             0x0      0
> es             0x0      0
> fs             0x0      0
> gs             0x0      0
> st0            0        (raw 0x00000000000000000000)
> st1            0        (raw 0x00000000000000000000)
> st2            0        (raw 0x00000000000000000000)
> st3            0        (raw 0x00000000000000000000)
> st4            0        (raw 0x00000000000000000000)
> st5            0        (raw 0x00000000000000000000)
> st6            0        (raw 0x00000000000000000000)
> st7            0        (raw 0x00000000000000000000)
> fctrl          0x37f    895
> fstat          0x0      0
> ftag           0xffff   65535
> fiseg          0x0      0
> fioff          0x0      0
> foseg          0x0      0
> fooff          0x0      0
> fop            0x0      0
> xmm0           {v4_float = {0x0, 0x0, 0x0, 0x0}, v2_double = {0x0, 0x0}, v16_int8 = {0xff, 0x0, 0x0, 0x0, 0x0, 0xff, 0x0, 0x0, 0x0, 0x0, 0xff, 0x0, 0x0, 0x0, 0x0, 0x0}, v8_int16 = {0xff, 0x0, 0xff00, 0x0, 0x0, 0xff, 0x0, 0x0}, v4_int32 = {0xff, 0xff00, 0xff0000, 0x0}, v2_int64 = {0xff00000000ff, 0xff0000}, uint128 = 0x0000000000ff00000000ff00000000ff}
> xmm1           {v4_float = {0x0, 0x0, 0x0, 0x0}, v2_double = {0x0, 0x0}, v16_int8 = {0x25 <repeats 16 times>}, v8_int16 = {0x2525, 0x2525, 0x2525, 0x2525, 0x2525, 0x2525, 0x2525, 0x2525}, v4_int32 = {0x25252525, 0x25252525, 0x25252525, 0x25252525}, v2_int64 = {0x2525252525252525, 0x2525252525252525}, uint128 = 0x25252525252525252525252525252525}
> xmm2           {v4_float = {0x0, 0x0, 0x0, 0x0}, v2_double = {0x0, 0x0}, v16_int8 = {0x0 <repeats 16 times>}, v8_int16 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v4_int32 = {0x0, 0x0, 0x0, 0x0}, v2_int64 = {0x0, 0x0}, uint128 = 0x00000000000000000000000000000000}
> xmm3           {v4_float = {0x0, 0x0, 0x0, 0x0}, v2_double = {0x0, 0x8000000000000000}, v16_int8 = {0x0 <repeats 14 times>, 0xff, 0xff}, v8_int16 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xffff}, v4_int32 = {0x0, 0x0, 0x0, 0xffff0000}, v2_int64 = {0x0, 0xffff000000000000}, uint128 = 0xffff0000000000000000000000000000}
> xmm4           {v4_float = {0xd34e4f00, 0x0, 0x0, 0x0}, v2_double = {0x0, 0x8000000000000000}, v16_int8 = {0x4f, 0x4e, 0x53, 0x4f, 0x4c, 0x45, 0x3d, 0x2f, 0x64, 0x65, 0x76, 0x2f, 0x63, 0x6f, 0x6e, 0x73}, v8_int16 = {0x4e4f, 0x4f53, 0x454c, 0x2f3d, 0x6564, 0x2f76, 0x6f63, 0x736e}, v4_int32 = {0x4f534e4f, 0x2f3d454c, 0x2f766564, 0x736e6f63}, v2_int64 = {0x2f3d454c4f534e4f, 0x736e6f632f766564}, uint128 = 0x736e6f632f7665642f3d454c4f534e4f}
> xmm5           {v4_float = {0x0, 0x0, 0x0, 0x0}, v2_double = {0x0, 0x0}, v16_int8 = {0x0 <repeats 16 times>}, v8_int16 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v4_int32 = {0x0, 0x0, 0x0, 0x0}, v2_int64 = {0x0, 0x0}, uint128 = 0x00000000000000000000000000000000}
> xmm6           {v4_float = {0x0, 0x0, 0x0, 0x0}, v2_double = {0x0, 0x0}, v16_int8 = {0x0 <repeats 16 times>}, v8_int16 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v4_int32 = {0x0, 0x0, 0x0, 0x0}, v2_int64 = {0x0, 0x0}, uint128 = 0x00000000000000000000000000000000}
> xmm7           {v4_float = {0x0, 0x0, 0x0, 0x0}, v2_double = {0x0, 0x0}, v16_int8 = {0x0 <repeats 16 times>}, v8_int16 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v4_int32 = {0x0, 0x0, 0x0, 0x0}, v2_int64 = {0x0, 0x0}, uint128 = 0x00000000000000000000000000000000}
> xmm8           {v4_float = {0x0, 0x0, 0x0, 0x0}, v2_double = {0x0, 0x0}, v16_int8 = {0x0 <repeats 16 times>}, v8_int16 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v4_int32 = {0x0, 0x0, 0x0, 0x0}, v2_int64 = {0x0, 0x0}, uint128 = 0x00000000000000000000000000000000}
> xmm9           {v4_float = {0x0, 0x0, 0x0, 0x0}, v2_double = {0x0, 0x0}, v16_int8 = {0x0 <repeats 16 times>}, v8_int16 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v4_int32 = {0x0, 0x0, 0x0, 0x0}, v2_int64 = {0x0, 0x0}, uint128 = 0x00000000000000000000000000000000}
> xmm10          {v4_float = {0x0, 0x0, 0x0, 0x0}, v2_double = {0x0, 0x0}, v16_int8 = {0x0 <repeats 16 times>}, v8_int16 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v4_int32 = {0x0, 0x0, 0x0, 0x0}, v2_int64 = {0x0, 0x0}, uint128 = 0x00000000000000000000000000000000}
> xmm11          {v4_float = {0x0, 0x0, 0x0, 0x0}, v2_double = {0x0, 0x0}, v16_int8 = {0x0 <repeats 16 times>}, v8_int16 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v4_int32 = {0x0, 0x0, 0x0, 0x0}, v2_int64 = {0x0, 0x0}, uint128 = 0x00000000000000000000000000000000}
> xmm12          {v4_float = {0x0, 0x0, 0x0, 0x0}, v2_double = {0x0, 0x0}, v16_int8 = {0x0 <repeats 16 times>}, v8_int16 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v4_int32 = {0x0, 0x0, 0x0, 0x0}, v2_int64 = {0x0, 0x0}, uint128 = 0x00000000000000000000000000000000}
> xmm13          {v4_float = {0x0, 0x0, 0x0, 0x0}, v2_double = {0x0, 0x0}, v16_int8 = {0x0 <repeats 16 times>}, v8_int16 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v4_int32 = {0x0, 0x0, 0x0, 0x0}, v2_int64 = {0x0, 0x0}, uint128 = 0x00000000000000000000000000000000}
> xmm14          {v4_float = {0x0, 0x0, 0x0, 0x0}, v2_double = {0x0, 0x0}, v16_int8 = {0x0 <repeats 16 times>}, v8_int16 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v4_int32 = {0x0, 0x0, 0x0, 0x0}, v2_int64 = {0x0, 0x0}, uint128 = 0x00000000000000000000000000000000}
> xmm15          {v4_float = {0x0, 0x0, 0x0, 0x0}, v2_double = {0x0, 0x0}, v16_int8 = {0x0 <repeats 16 times>}, v8_int16 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v4_int32 = {0x0, 0x0, 0x0, 0x0}, v2_int64 = {0x0, 0x0}, uint128 = 0x00000000000000000000000000000000}
> mxcsr          0x1f80   [ IM DM ZM OM UM PM ]

> (gdb) x/20i $pc
> 0x7f6da00805ad <_IO_vfprintf_internal+15357>:   repnz scas %es:(%rdi),%al
> 0x7f6da00805af <_IO_vfprintf_internal+15359>:   xor    %r10d,%r10d
> 0x7f6da00805b2 <_IO_vfprintf_internal+15362>:   not    %rcx
> 0x7f6da00805b5 <_IO_vfprintf_internal+15365>:   lea    -0x1(%rcx),%r8
> 0x7f6da00805b9 <_IO_vfprintf_internal+15369>:   mov    %r8d,%ecx
> 0x7f6da00805bc <_IO_vfprintf_internal+15372>:   jmpq   0x7f6da007e00c <_IO_vfprintf_internal+5724>
> 0x7f6da00805c1 <_IO_vfprintf_internal+15377>:   mov    $0x6,%ecx
> 0x7f6da00805c6 <_IO_vfprintf_internal+15382>:   xor    %r10d,%r10d
> 0x7f6da00805c9 <_IO_vfprintf_internal+15385>:   mov    $0x6,%r8d
> 0x7f6da00805cf <_IO_vfprintf_internal+15391>:   lea    0xdff57(%rip),%r9        # 0x7f6da016052d <null>
> 0x7f6da00805d6 <_IO_vfprintf_internal+15398>:   jmpq   0x7f6da007d546 <_IO_vfprintf_internal+2966>
> 0x7f6da00805db <_IO_vfprintf_internal+15403>:   mov    0x8(%r13),%rax
> 0x7f6da00805df <_IO_vfprintf_internal+15407>:   lea    0x8(%rax),%rdx
> 0x7f6da00805e3 <_IO_vfprintf_internal+15411>:   mov    %rdx,0x8(%r13)
> 0x7f6da00805e7 <_IO_vfprintf_internal+15415>:   jmpq   0x7f6da007eac2 <_IO_vfprintf_internal+8466>
> 0x7f6da00805ec <_IO_vfprintf_internal+15420>:   mov    0x8(%r13),%rax
> 0x7f6da00805f0 <_IO_vfprintf_internal+15424>:   lea    0x8(%rax),%rdx
> 0x7f6da00805f4 <_IO_vfprintf_internal+15428>:   mov    %rdx,0x8(%r13)
> 0x7f6da00805f8 <_IO_vfprintf_internal+15432>:   jmpq   0x7f6da007f91e <_IO_vfprintf_internal+12142>
> 0x7f6da00805fd <_IO_vfprintf_internal+15437>:   mov    0x8(%r13),%rax

> (gdb) x/64x $sp
> 0x7fff544a2b80: 0x544a3260      0x00007fff      0x00000001      0x00000000
> 0x7fff544a2b90: 0x0040da6a      0x00000000      0x0040da6a      0x00000000
> 0x7fff544a2ba0: 0x544a3260      0x00007fff      0xa007cb39      0x00007f6d
> 0x7fff544a2bb0: 0x00000025      0x00000000      0x00000000      0x00000000
> 0x7fff544a2bc0: 0x544a3110      0x00007fff      0x0040d500      0x00000000
> 0x7fff544a2bd0: 0x0040da48      0x00000000      0x00000000      0x00000000
> 0x7fff544a2be0: 0x00000027      0x00000000      0x544a317c      0x00007fff
> 0x7fff544a2bf0: 0x544a31b8      0x00007fff      0x544a3198      0x00007fff
> 0x7fff544a2c00: 0x00000000      0x00000000      0x00000000      0x00000000
> 0x7fff544a2c10: 0x544a2d00      0x00007fff      0x544a31ac      0x00007fff
> 0x7fff544a2c20: 0x544a31e8      0x00007fff      0x544a31c8      0x00000000
> 0x7fff544a2c30: 0x544a3170      0x00007fff      0xffffffff      0xffffffff
> 0x7fff544a2c40: 0x544a2d30      0x00007fff      0x544a30e8      0x00007fff
> 0x7fff544a2c50: 0x0040da70      0x00000000      0x00000000      0x00000000
> 0x7fff544a2c60: 0x00000000      0xffffe938      0xffffff20      0xffffffff
> 0x7fff544a2c70: 0x544a3238      0x00007fff      0x544a3118      0x00007fff

To me it looks like there is still some register/memory corruption
happening in the kernel or Xen hypervisor.

@Oleg:
Have you seen any other corruption or is one of your patches likely to
fix something like the issue mentioned above:
> $ git l1 --grep fpu v3.10.. -- arch/x86
> c7b228a Merge branch 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
> dc56c0f x86, fpu: Shift "fpu_counter = 0" from copy_thread() to arch_dup_task_struct()
> 5e23fee x86, fpu: copy_process: Sanitize fpu->last_cpu initialization
> f185350 x86, fpu: copy_process: Avoid fpu_alloc/copy if !used_math()
> 31d9633 x86, fpu: Change __thread_fpu_begin() to use use_eager_fpu()

Philipp

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: xenstored crashes with SIGSEGV
  2015-03-12 12:08                                         ` Philipp Hahn
@ 2015-03-12 18:17                                           ` Oleg Nesterov
  2015-03-12 21:57                                             ` Philipp Hahn
  0 siblings, 1 reply; 36+ messages in thread
From: Oleg Nesterov @ 2015-03-12 18:17 UTC (permalink / raw)
  To: Philipp Hahn
  Cc: Ian Campbell, George Dunlap, Ian Jackson, Xen-devel,
	David Vrabel, Jan Beulich, Frediano Ziglio

On 03/12, Philipp Hahn wrote:
>
> Have you seen any other corruption

No,

> or is one of your patches likely to
> fix something like the issue mentioned above:

I am not sure I even understand the problem above ;) I mean, after the quick
look I do not see how this connects to FPU. $rdi == 2 looks obviously wrong.

> > $ git l1 --grep fpu v3.10.. -- arch/x86
> > c7b228a Merge branch 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
> > dc56c0f x86, fpu: Shift "fpu_counter = 0" from copy_thread() to arch_dup_task_struct()
> > 5e23fee x86, fpu: copy_process: Sanitize fpu->last_cpu initialization
> > f185350 x86, fpu: copy_process: Avoid fpu_alloc/copy if !used_math()
> > 31d9633 x86, fpu: Change __thread_fpu_begin() to use use_eager_fpu()

This is only cleanups... I do not think this series can fix something.

Oleg.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: xenstored crashes with SIGSEGV
  2015-03-12 18:17                                           ` Oleg Nesterov
@ 2015-03-12 21:57                                             ` Philipp Hahn
  0 siblings, 0 replies; 36+ messages in thread
From: Philipp Hahn @ 2015-03-12 21:57 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Ian Campbell, George Dunlap, Ian Jackson, Xen-devel,
	David Vrabel, Jan Beulich, Frediano Ziglio

Hello,
On 12.03.2015 19:17, Oleg Nesterov wrote:
> On 03/12, Philipp Hahn wrote:
>>
>> Have you seen any other corruption
> 
> No,
> 
>> or is one of your patches likely to
>> fix something like the issue mentioned above:
> 
> I am not sure I even understand the problem above ;) I mean, after the quick
> look I do not see how this connects to FPU. $rdi == 2 looks obviously wrong.

In December we found some strange crashes of a Xen daemon, but other
processes crashed as well. One strange pattern Ian found was some
0x..00.ff pattern, which seems to have come from some SSE register
corruption.
That is why we upgrades to 3.10.62, which contains some fixes for saving
the FPU state. If my memory is correct the FPU registers share the space
with the MMU/SSE registers, so that seemed a good candidate.

You might want to take a look at
<http://lists.xenproject.org/archives/html/xen-devel/2014-12/msg01583.html>,
where you find the mail thread from December.

>>> $ git l1 --grep fpu v3.10.. -- arch/x86
>>> c7b228a Merge branch 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
>>> dc56c0f x86, fpu: Shift "fpu_counter = 0" from copy_thread() to arch_dup_task_struct()
>>> 5e23fee x86, fpu: copy_process: Sanitize fpu->last_cpu initialization
>>> f185350 x86, fpu: copy_process: Avoid fpu_alloc/copy if !used_math()
>>> 31d9633 x86, fpu: Change __thread_fpu_begin() to use use_eager_fpu()
> 
> This is only cleanups... I do not think this series can fix something.

My guess from reading your description, but still tanks for your help.

Philipp

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2015-03-12 21:57 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-11-13  7:45 xenstored crashes with SIGSEGV Philipp Hahn
2014-11-13  9:12 ` Ian Campbell
2014-12-12 16:14   ` Philipp Hahn
2014-12-12 16:32     ` Ian Campbell
2014-12-12 16:45       ` Philipp Hahn
2014-12-12 16:56         ` Ian Campbell
2014-12-12 17:20           ` Philipp Hahn
2014-12-12 17:58             ` Ian Campbell
2014-12-15 13:17               ` Ian Campbell
2014-12-15 14:19                 ` Philipp Hahn
2014-12-15 14:50                   ` Ian Campbell
2014-12-15 17:45                     ` Ian Campbell
2014-12-15 22:29                       ` Philipp Hahn
2014-12-16  9:51                         ` Ian Campbell
2014-12-16 10:25                         ` Ian Campbell
2014-12-16 10:45                         ` Ian Campbell
2014-12-16 11:06                           ` Ian Campbell
2014-12-16 11:30                             ` Frediano Ziglio
2014-12-16 12:23                               ` Ian Campbell
2014-12-16 16:13                                 ` Frediano Ziglio
2014-12-16 16:23                                   ` Ian Campbell
2014-12-16 16:44                                     ` Frediano Ziglio
2014-12-17  9:14                                       ` Frediano Ziglio
2014-12-17 12:43                                         ` core dump files do not include all CPU registers? Philipp Hahn
2014-12-18 10:20                                         ` xenstored crashes with SIGSEGV Philipp Hahn
2014-12-18 10:17                                   ` Ian Campbell
2014-12-18 10:25                                     ` David Vrabel
2014-12-19 14:30                                       ` Konrad Rzeszutek Wilk
2014-12-18 10:49                                     ` Jan Beulich
2014-12-18 10:51                                       ` Ian Campbell
2014-12-19 12:36                                     ` Philipp Hahn
2015-01-06  7:19                                       ` Philipp Hahn
2015-03-12 12:08                                         ` Philipp Hahn
2015-03-12 18:17                                           ` Oleg Nesterov
2015-03-12 21:57                                             ` Philipp Hahn
2014-12-16 12:04                           ` Philipp Hahn

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.