All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ladi Prosek <lprosek@redhat.com>
To: "Fernando Casas Schössow" <casasfernando@hotmail.com>
Cc: "qemu-devel@nongnu.org" <qemu-devel@nongnu.org>
Subject: Re: [Qemu-devel] Guest unresponsive after Virtqueue size exceeded error
Date: Tue, 20 Jun 2017 09:52:57 +0200	[thread overview]
Message-ID: <CABdb7372Pv3hR4S8C0TCV19u9XAKn=y0VmreGuZavhOjqRrypQ@mail.gmail.com> (raw)
In-Reply-To: <HE1PR1001MB1371F7874B487B0907C42A7EB7C50@HE1PR1001MB1371.EURPRD10.PROD.OUTLOOK.COM>

On Tue, Jun 20, 2017 at 8:30 AM, Fernando Casas Schössow
<casasfernando@hotmail.com> wrote:
> Hi Ladi,
>
> In this case both guests are CentOS 7.3 running the same kernel
> 3.10.0-514.21.1.
> Also the guest that fails most frequently is running Docker with 4 or 5
> containers.
>
> Another thing I would like to mention is that the host is running on
> Alpine's default grsec patched kernel. I have the option to install also a
> vanilla kernel. Would it make sense to switch to the vanilla kernel on the
> host and see if that helps?

The host kernel is less likely to be responsible for this, in my
opinion. I'd hold off on that for now.

> And last but not least KSM is enabled on the host. Should I disable it?

Could be worth the try.

> Following your advice I will run memtest on the host and report back. Just
> as a side comment, the host is running on ECC memory.

I see.

Would it be possible for you, once a guest is in the broken state, to
make it available for debugging? By attaching gdb to the QEMU process
for example and letting me poke around it remotely? Thanks!

> Thanks for all your help.
>
> Fer.
>
> On mar, jun 20, 2017 at 7:59 , Ladi Prosek <lprosek@redhat.com> wrote:
>
> Hi Fernando, On Tue, Jun 20, 2017 at 12:10 AM, Fernando Casas Schössow
> <casasfernando@hotmail.com> wrote:
>
> Hi Ladi, Today two guests failed again at different times of day. One of
> them was the one I switched from virtio_blk to virtio_scsi so this change
> didn't solved the problem. Now in this guest I also disabled virtio_balloon,
> continuing with the elimination process. Also this time I found a different
> error message in the guest console. In the guest already switched to
> virtio_scsi: virtio_scsi virtio2: request:id 44 is not a head! Followed by
> the usual "task blocked for more than 120 seconds." error. On the guest
> still running on virtio_blk the error was similar: virtio_blk virtio2:
> req.0:id 42 is not a head! blk_update_request: I/O error, dev vda, sector
> 645657736 Buffer I/O error on dev dm-1, logical block 7413821, lost async
> page write Followed by the usual "task blocked for more than 120 seconds."
> error.
>
> Honestly this is starting to look more and more like a memory corruption.
> Two different virtio devices and two different guest operating systems, that
> would have to be a bug in the common virtio code and we would have seen it
> somewhere else already. Would it be possible run a thorough memtest on the
> host just in case?
>
> Do you think that the blk_update_request and the buffer I/O error may be a
> consequence of the previous "is not a head!" error or should I be worried
> for a storage level issue here? Now I will wait to see if disabling
> virtio_balloon helps or not and report back. Thanks. Fer On vie, jun 16,
> 2017 at 12:25 , Ladi Prosek <lprosek@redhat.com> wrote: On Fri, Jun 16, 2017
> at 12:11 PM, Fernando Casas Schössow <casasfernando@hotmail.com> wrote: Hi
> Ladi, Thanks a lot for looking into this and replying. I will do my best to
> rebuild and deploy Alpine's qemu packages with this patch included but not
> sure its feasible yet. In any case, would it be possible to have this patch
> included in the next qemu release? Yes, I have already added this to my todo
> list. The current error message is helpful but knowing which device was
> involved will be much more helpful. Regarding the environment, I'm not doing
> migrations and only managed save is done in case the host needs to be
> rebooted or shutdown. The QEMU process is running the VM since the host is
> started and this failuire is ocurring randomly without any previous manage
> save done. As part of troubleshooting on one of the guests I switched from
> virtio_blk to virtio_scsi for the guest disks but I will need more time to
> see if that helped. If I have this problem again I will follow your advise
> and remove virtio_balloon. Thanks, please keep us posted. Another question:
> is there any way to monitor the virtqueue size either from the guest itself
> or from the host? Any file in sysfs or proc? This may help to understand in
> which conditions this is happening and to react faster to mitigate the
> problem. The problem is not in the virtqueue size but in one piece of
> internal state ("inuse") which is meant to track the number of buffers
> "checked out" by QEMU. It's being compared to virtqueue size merely as a
> sanity check. I'm afraid that there's no way to expose this variable without
> rebuilding QEMU. The best you could do is attach gdb to the QEMU process and
> use some clever data access breakpoints to catch suspicious writes to the
> variable. Although it's likely that it just creeps up slowly and you won't
> see anything interesting. It's probably beyond reasonable at this point
> anyway. I would continue with the elimination process (virtio_scsi instead
> of virtio_blk, no balloon, etc.) and then maybe once we know which device it
> is, we can add some instrumentation to the code. Thanks again for your help
> with this! Fer On vie, jun 16, 2017 at 8:58 , Ladi Prosek
> <lprosek@redhat.com> wrote: Hi, Would you be able to enhance the error
> message and rebuild QEMU? --- a/hw/virtio/virtio.c +++ b/hw/virtio/virtio.c
> @@ -856,7 +856,7 @@ void *virtqueue_pop(VirtQueue *vq, size_t sz) max =
> vq->vring.num; if (vq->inuse = vq->vring.num) { - virtio_error(vdev,
> "Virtqueue size exceeded"); + virtio_error(vdev, "Virtqueue %u device %s
> size exceeded", vq->queue_index, vdev->name); goto done; } This would at
> least confirm the theory that it's caused by virtio-blk-pci. If rebuilding
> is not feasible I would start by removing other virtio devices --
> particularly balloon which has had quite a few virtio related bugs fixed
> recently. Does your environment involve VM migrations or saving/resuming, or
> does the crashing QEMU process always run the VM from its boot? Thanks!
>
>
>

  reply	other threads:[~2017-06-20  7:53 UTC|newest]

Thread overview: 55+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-06-14 21:56 [Qemu-devel] Guest unresponsive after Virtqueue size exceeded error Fernando Casas Schössow
2017-06-16  6:58 ` Ladi Prosek
2017-06-16 10:11   ` Fernando Casas Schössow
2017-06-16 10:25     ` Ladi Prosek
2017-06-19 22:10       ` Fernando Casas Schössow
2017-06-20  5:59         ` Ladi Prosek
2017-06-20  6:30           ` Fernando Casas Schössow
2017-06-20  7:52             ` Ladi Prosek [this message]
2017-06-21 12:19               ` Fernando Casas Schössow
2017-06-22  7:43                 ` Ladi Prosek
2017-06-23  6:29                   ` Fernando Casas Schössow
     [not found]                   ` <1498199343.2815.0@smtp-mail.outlook.com>
2017-06-24  8:34                     ` Fernando Casas Schössow
2019-01-31 11:32                       ` Fernando Casas Schössow
2019-02-01  5:48                         ` [Qemu-devel] [Qemu-block] " Stefan Hajnoczi
2019-02-01  8:17                           ` Fernando Casas Schössow
2019-02-04  6:06                             ` Stefan Hajnoczi
2019-02-04  7:24                               ` Fernando Casas Schössow
     [not found]                                 ` <AM5PR0602MB32368CB5ADDEC05F42D8BC8FA46D0@AM5PR0602MB3236.eurprd06.prod.outlo ok.com>
2019-02-06  7:15                                   ` Fernando Casas Schössow
     [not found]                                 ` <AM5PR0602MB32368CB5ADDEC05F42D8BC8FA46D0@AM5PR0602MB3236.eurprd06.prod.outlo>
     [not found]                                   ` <VI1PR0602MB3245032D51A5DF45AF6E1952A46F0@VI1PR0602MB3245.eurprd06.prod.outlo ok.com>
2019-02-06 16:47                                     ` Fernando Casas Schössow
2019-02-11  3:17                                       ` Stefan Hajnoczi
2019-02-11  9:48                                         ` Fernando Casas Schössow
2019-02-18  7:21                                         ` Fernando Casas Schössow
     [not found]                                           ` <VI1PR0602MB3245424120D151F29884A7E2A4630@VI1PR0602MB3245.eurprd06.prod.outlo ok.com>
2019-02-19  7:26                                             ` Fernando Casas Schössow
2019-02-20 16:58                                           ` Stefan Hajnoczi
2019-02-20 17:53                                             ` Paolo Bonzini
2019-02-20 18:56                                               ` Fernando Casas Schössow
2019-02-21 11:11                                                 ` Stefan Hajnoczi
2019-02-21 11:33                                                   ` Fernando Casas Schössow
     [not found]                                                   ` <VI1PR0602MB3245593855B029B427ED544FA47E0@VI1PR0602MB3245.eurprd06.prod.outlook.com>
     [not found]                                                     ` <CAJSP0QUs9Yz2-k1KyVMwpgx6RwY9cK7qdQRCQ74xmgXJPJR-qw@mail.gmail.com>
     [not found]                                                       ` <VI1PR0602MB32453A8B5CBC0308C7D18F1DA47E0@VI1PR0602MB3245.eurprd06.prod.outlook.com>
     [not found]                                                         ` <CAJSP0QVxaW3tezjBN9owJHsxzE9h8_qcaeRr5zHHKxKJOeFnkQ@mail.gmail.com>
     [not found]                                                           ` <CAJSP0QVXoZJ9MJ0qp4RM_m2fGJ8iFSyJMAU_X7mdiQvpOK59KA@mail.gmail.com>
     [not found]                                                             ` <VI1PR0602MB324516419266A934FE7759C6A47E0@VI1PR0602MB3245.eurprd06.prod.outlook.com>
     [not found]                                                               ` <VI1PR0602MB324516419266A934FE7759C6A47E0@VI1PR0602MB3245.eurprd06.prod.outlo>
     [not found]                                                                 ` <VI1PR0602MB32454C17192EFA863E29CC49A47E0@VI1PR0602MB3245.eurprd06.prod.outlo>
     [not found]                                                                   ` <VI1PR0602MB324547F72DA9EDEB1613C888A47E0@VI1PR0602MB3245.eurprd06.prod.outlook.com>
     [not found]                                                                     ` <CAJSP0QUg=cq3tCSLidQ9BR2hxAo3K6gA6LKtpx5Rjb=_6XgJ6Q@mail.gmail.com>
     [not found]                                                                       ` <28e6b4ed-9afd-3a79-6267-86c7385c23ce@redhat.com>
     [not found]                                                                         ` <VI1PR0602MB324578F91F1AF9390D03022FA47F0@VI1PR0602MB3245.eurprd06.prod.outlook.com>
2019-02-22 14:04                                                                           ` Stefan Hajnoczi
2019-02-22 14:38                                                                             ` Paolo Bonzini
2019-02-22 14:43                                                                             ` Fernando Casas Schössow
2019-02-22 14:55                                                                               ` Paolo Bonzini
2019-02-22 15:48                                                                                 ` Fernando Casas Schössow
2019-02-22 16:37                                                                             ` Dr. David Alan Gilbert
2019-02-22 16:39                                                                               ` Paolo Bonzini
2019-02-22 16:47                                                                                 ` Dr. David Alan Gilbert
2019-02-23 11:49                                                                             ` Natanael Copa
2019-02-26 13:30                                                                               ` Paolo Bonzini
2019-02-28  7:35                                                                                 ` Fernando Casas Schössow
2019-02-23 15:55                                                                             ` Natanael Copa
2019-02-23 16:18                                                                               ` Peter Maydell
2019-02-25 10:24                                                                                 ` Natanael Copa
2019-02-25 10:34                                                                                   ` Peter Maydell
2019-02-25 12:15                                                                                     ` Fernando Casas Schössow
2019-02-25 12:21                                                                                     ` Natanael Copa
2019-02-25 13:06                                                                                       ` Peter Maydell
2019-02-25 13:25                                                                                         ` Natanael Copa
2019-02-25 13:32                                                                                           ` Fernando Casas Schössow
     [not found]                                                                                             ` <VI1PR0602MB3245A6B693B23DA2E0E8E500A47A0@VI1PR0602MB3245.eurprd06.prod.outlo ok.com>
2019-02-25 15:41                                                                                               ` Fernando Casas Schössow
2019-02-28  9:58                                                                                         ` Peter Maydell
2019-03-07  7:14                                                                                           ` Fernando Casas Schössow
2019-02-23 16:21                                                                               ` Fernando Casas Schössow
2019-02-25 10:30                                                                               ` Stefan Hajnoczi
2019-02-25 10:33                                                                                 ` Stefan Hajnoczi
2019-02-23 16:57                                                                             ` Peter Maydell

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CABdb7372Pv3hR4S8C0TCV19u9XAKn=y0VmreGuZavhOjqRrypQ@mail.gmail.com' \
    --to=lprosek@redhat.com \
    --cc=casasfernando@hotmail.com \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.