qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
To: Lukas Straub <lukasstraub2@web.de>, thuth@redhat.com
Cc: Kevin Wolf <kwolf@redhat.com>, Alberto Garcia <berto@igalia.com>,
	Jason Wang <jasowang@redhat.com>,
	qemu-devel <qemu-devel@nongnu.org>, Max Reitz <mreitz@redhat.com>,
	"Zhang, Chen" <chen.zhang@intel.com>
Subject: Re: [PATCH 0/4] colo: Introduce resource agent and high-level test
Date: Wed, 18 Dec 2019 19:46:25 +0000	[thread overview]
Message-ID: <20191218194625.GV3707@work-vm> (raw)
In-Reply-To: <20191218102711.19e321ac@luklap>

* Lukas Straub (lukasstraub2@web.de) wrote:
> On Wed, 27 Nov 2019 22:11:34 +0100
> Lukas Straub <lukasstraub2@web.de> wrote:
> 
> > On Fri, 22 Nov 2019 09:46:46 +0000
> > "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
> >
> > > * Lukas Straub (lukasstraub2@web.de) wrote:
> > > > Hello Everyone,
> > > > These patches introduce a resource agent for use with the Pacemaker CRM and a
> > > > high-level test utilizing it for testing qemu COLO.
> > > >
> > > > The resource agent manages qemu COLO including continuous replication.
> > > >
> > > > Currently the second test case (where the peer qemu is frozen) fails on primary
> > > > failover, because qemu hangs while removing the replication related block nodes.
> > > > Note that this also happens in real world test when cutting power to the peer
> > > > host, so this needs to be fixed.
> > >
> > > Do you understand why that happens? Is this it's trying to finish a
> > > read/write to the dead partner?
> > >
> > > Dave
> >
> > I haven't looked into it too closely yet, but it's often hanging in bdrv_flush()
> > while removing the replication blockdev and of course thats probably because the
> > nbd client waits for a reply. So I tried with the workaround below, which will
> > actively kill the TCP connection and with it the test passes, though I haven't
> > tested it in real world yet.
> >
> 
> In the real cluster, sometimes qemu even hangs while connecting to qmp (after remote
> poweroff). But I currently don't have the time to look into it.

That doesn't surprise me too much; QMP is mostly handled in the main
thread, as are a lot of other things; hanging in COLO has been my
assumption for a while because of that.  However, there's a way to fix
it.

A while ago, Peter Xu added a feature called 'out of band' to QMP; you
can open a QMP connection, set the OOB feature, and then commands that
are marked as OOB are executed off the main thread on  that connection.

At the moment we've just got the one real OOB command, 'migrate-recover'
which is used for recovering postcopy from a similar failure to the COLO
case.

To fix this you'd have to convert colo-lost-heartbeat to be an OOB
command; note it's not that trivial, because you have to make sure the
code that's run as part of the OOB command doesn't take any locks that
could block on something in the main thread; so it can set flags, start
new threads, perhaps call shutdown() on a socket; but it takes some
thinking about.


> Still a failing test is better than no test. Could we mark this test as known-bad and
> fix this issue later? How should I mark it as known-bad? By tag? Or warn in the log?

Not sure of that; cc'ing Maybe thuth knows?

Dave

> Regards,
> Lukas Straub
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



      reply	other threads:[~2019-12-18 19:47 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-11-21 17:49 [PATCH 0/4] colo: Introduce resource agent and high-level test Lukas Straub
2019-11-21 17:49 ` [PATCH 1/4] block/quorum.c: stable children names Lukas Straub
2019-11-21 18:04   ` Eric Blake
2019-11-21 18:34     ` Lukas Straub
2019-11-26 14:21       ` Alberto Garcia
2019-11-27 21:20         ` Lukas Straub
2019-11-21 17:49 ` [PATCH 2/4] colo: Introduce resource agent Lukas Straub
2019-11-21 17:49 ` [PATCH 3/4] colo: Introduce high-level test Lukas Straub
2019-11-21 17:49 ` [PATCH 4/4] MAINTAINERS: Add myself as maintainer for COLO resource agent Lukas Straub
2019-11-22  9:46 ` [PATCH 0/4] colo: Introduce resource agent and high-level test Dr. David Alan Gilbert
2019-11-27 21:11   ` Lukas Straub
2019-12-18  9:27     ` Lukas Straub
2019-12-18 19:46       ` Dr. David Alan Gilbert [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20191218194625.GV3707@work-vm \
    --to=dgilbert@redhat.com \
    --cc=berto@igalia.com \
    --cc=chen.zhang@intel.com \
    --cc=jasowang@redhat.com \
    --cc=kwolf@redhat.com \
    --cc=lukasstraub2@web.de \
    --cc=mreitz@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=thuth@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).