All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Durrant, Paul" <pdurrant@amazon.com>
To: Julien Grall <julien@xen.org>, Ian Jackson <ian.jackson@citrix.com>
Cc: "Jürgen Groß" <jgross@suse.com>,
	"xen-devel@lists.xenproject.org" <xen-devel@lists.xenproject.org>,
	"Stefano Stabellini" <sstabellini@kernel.org>,
	"osstest service owner" <osstest-admin@xenproject.org>,
	"Anthony Perard" <anthony.perard@citrix.com>
Subject: Re: [Xen-devel] [xen-4.13-testing test] 144736: regressions - FAIL
Date: Fri, 13 Dec 2019 15:55:04 +0000	[thread overview]
Message-ID: <a65ae7dca64f4f718f116b9174893730@EX13D32EUC003.ant.amazon.com> (raw)
In-Reply-To: <7a0ef296-eb50-fbda-63e2-8d890fad5111@xen.org>

> -----Original Message-----
> From: Xen-devel <xen-devel-bounces@lists.xenproject.org> On Behalf Of
> Julien Grall
> Sent: 13 December 2019 15:37
> To: Ian Jackson <ian.jackson@citrix.com>
> Cc: Jürgen Groß <jgross@suse.com>; xen-devel@lists.xenproject.org; Stefano
> Stabellini <sstabellini@kernel.org>; osstest service owner <osstest-
> admin@xenproject.org>; Anthony Perard <anthony.perard@citrix.com>
> Subject: Re: [Xen-devel] [xen-4.13-testing test] 144736: regressions -
> FAIL
> 
> +Anthony
> 
> On 13/12/2019 11:40, Ian Jackson wrote:
> > Julien Grall writes ("Re: [Xen-devel] [xen-4.13-testing test] 144736:
> regressions - FAIL"):
> >> AMD Seattle boards (laxton*) are known to fail booting time to time
> >> because of PCI training issue. We have workaround for it (involving
> >> longer power cycle) but this is not 100% reliable.
> >
> > This wasn't a power cycle.  It was a software-initiated reboot.  It
> > does appear to hang in the firmware somewhere.  Do we expect the pci
> > training issue to occur in this case ?
> 
> The PCI training happens at every reset (including software). So I may
> have confused the workaround for firmware corruption with the PCI
> training. We definitely have a workfround for the former.
> 
> For the latter, I can't remember if we did use a new firmware or just
> hope it does not happen often.
> 
> I think we had a thread on infra@ about the workaround some times last
> year. Sadly this was sent on my Arm e-mail address and I didn't archive
> it before leaving :(. Can you have a look if you can find the thread?
> 
> >
> >>>>    test-armhf-armhf-xl-vhd      18 leak-check/check         fail
> REGR.
> >>>> vs. 144673
> >>>
> >>> That one is strange. A qemu process seems to have have died producing
> >>> a core file, but I couldn't find any log containing any other
> indication
> >>> of a crashed program.
> >>
> >> I haven't found anything interesting in the log. @Ian could you set up
> >> a repro for this?
> >
> > There is some heisenbug where qemu crashes with very low probability.
> > (I forget whether only on arm or on x86 too).  This has been around
> > for a little while.  I doubt this particular failure will be
> > reproducible.
> 
> I can't remember such bug been reported on Arm before. Anyway, I managed
> to get the stack trace from gdb:
> 
> Core was generated by `/usr/local/lib/xen/bin/qemu-system-i386
> -xen-domid 1 -chardev socket,id=libxl-c'.
> Program terminated with signal SIGSEGV, Segmentation fault.
> #0  0x006342be in xen_block_handle_requests (dataplane=0x108e600) at
> /home/osstest/build.144736.build-armhf/xen/tools/qemu-xen-
> dir/hw/block/dataplane/xen-block.c:531
> 531
> /home/osstest/build.144736.build-armhf/xen/tools/qemu-xen-
> dir/hw/block/dataplane/xen-block.c:
> No such file or directory.
> [Current thread is 1 (LWP 1987)]
> (gdb) bt
> #0  0x006342be in xen_block_handle_requests (dataplane=0x108e600) at
> /home/osstest/build.144736.build-armhf/xen/tools/qemu-xen-
> dir/hw/block/dataplane/xen-block.c:531
> #1  0x0063447c in xen_block_dataplane_event (opaque=0x108e600) at
> /home/osstest/build.144736.build-armhf/xen/tools/qemu-xen-
> dir/hw/block/dataplane/xen-block.c:626
> #2  0x008d005c in xen_device_poll (opaque=0x107a3b0) at
> /home/osstest/build.144736.build-armhf/xen/tools/qemu-xen-dir/hw/xen/xen-
> bus.c:1077
> #3  0x00a4175c in run_poll_handlers_once (ctx=0x1079708,
> timeout=0xb1ba17f8) at
> /home/osstest/build.144736.build-armhf/xen/tools/qemu-xen-dir/util/aio-
> posix.c:520
> #4  0x00a41826 in run_poll_handlers (ctx=0x1079708, max_ns=8000,
> timeout=0xb1ba17f8) at
> /home/osstest/build.144736.build-armhf/xen/tools/qemu-xen-dir/util/aio-
> posix.c:562
> #5  0x00a41956 in try_poll_mode (ctx=0x1079708, timeout=0xb1ba17f8) at
> /home/osstest/build.144736.build-armhf/xen/tools/qemu-xen-dir/util/aio-
> posix.c:597
> #6  0x00a41a2c in aio_poll (ctx=0x1079708, blocking=true) at
> /home/osstest/build.144736.build-armhf/xen/tools/qemu-xen-dir/util/aio-
> posix.c:639
> #7  0x0071dc16 in iothread_run (opaque=0x107d328) at
> /home/osstest/build.144736.build-armhf/xen/tools/qemu-xen-
> dir/iothread.c:75
> #8  0x00a44c80 in qemu_thread_start (args=0x1079538) at
> /home/osstest/build.144736.build-armhf/xen/tools/qemu-xen-dir/util/qemu-
> thread-posix.c:502
> #9  0xb67ae5d8 in ?? ()
> Backtrace stopped: previous frame identical to this frame (corrupt stack?)
> 
> This feels like a race condition between the init/free code with
> handler. Anthony, does it ring any bell?
> 

From that stack bt it looks like an iothread managed to run after the sring was NULLed. This should not be able happen as the dataplane should have been moved back onto QEMU's main thread context before the ring is unmapped.

  Paul

> Cheers,
> 
> --
> Julien Grall
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xenproject.org
> https://lists.xenproject.org/mailman/listinfo/xen-devel
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

  reply	other threads:[~2019-12-13 21:01 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-12-12 22:35 [Xen-devel] [xen-4.13-testing test] 144736: regressions - FAIL osstest service owner
2019-12-13  8:31 ` Jürgen Groß
2019-12-13 11:14   ` Julien Grall
2019-12-13 11:24     ` Jürgen Groß
2019-12-13 11:28       ` Julien Grall
2019-12-13 11:40     ` Ian Jackson
2019-12-13 15:36       ` Julien Grall
2019-12-13 15:55         ` Durrant, Paul [this message]
2019-12-14  0:34           ` xen-block: race condition when stopping the device (WAS: Re: [Xen-devel] [xen-4.13-testing test] 144736: regressions - FAIL) Julien Grall
2019-12-14  0:34             ` [Xen-devel] xen-block: race condition when stopping the device (WAS: " Julien Grall
2019-12-16  9:34             ` xen-block: race condition when stopping the device (WAS: Re: [Xen-devel] " Durrant, Paul
2019-12-16  9:34               ` [Xen-devel] xen-block: race condition when stopping the device (WAS: " Durrant, Paul
2019-12-16  9:50               ` Durrant, Paul
2019-12-16  9:50                 ` Durrant, Paul
2019-12-16 10:24                 ` Durrant, Paul
2019-12-16 10:24                   ` Durrant, Paul

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=a65ae7dca64f4f718f116b9174893730@EX13D32EUC003.ant.amazon.com \
    --to=pdurrant@amazon.com \
    --cc=anthony.perard@citrix.com \
    --cc=ian.jackson@citrix.com \
    --cc=jgross@suse.com \
    --cc=julien@xen.org \
    --cc=osstest-admin@xenproject.org \
    --cc=sstabellini@kernel.org \
    --cc=xen-devel@lists.xenproject.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.