All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Michael S. Tsirkin" <mst@redhat.com>
To: Laine Stump <laine@redhat.com>
Cc: pkrempa@redhat.com, berrange@redhat.com, ehabkost@redhat.com,
	aadam@redhat.com, qemu-devel@nongnu.org,
	Jens Freimann <jfreimann@redhat.com>,
	ailan@redhat.com
Subject: Re: [Qemu-devel] [PATCH 0/4] add failover feature for assigned network devices
Date: Tue, 11 Jun 2019 11:51:27 -0400	[thread overview]
Message-ID: <20190611115009-mutt-send-email-mst@kernel.org> (raw)
In-Reply-To: <646d0bf1-2fbb-1adb-d5d3-3ef3944376b5@redhat.com>

On Tue, Jun 11, 2019 at 11:42:54AM -0400, Laine Stump wrote:
> On 5/17/19 8:58 AM, Jens Freimann wrote:
> > This is another attempt at implementing the host side of the
> > net_failover concept
> > (https://www.kernel.org/doc/html/latest/networking/net_failover.html)
> > 
> > Changes since last RFC:
> > - work around circular dependency of commandline options. Just add
> >    failover=on to the virtio-net standby options and reference it from
> >    primary (vfio-pci) device with standby=<id>
> > - add patch 3/4 to allow migration of vfio-pci device when it is part of a
> >    failover pair, still disallow for all other devices
> > - add patch 4/4 to allow unplug of device during migrationm, make an
> >    exception for failover primary devices. I'd like feedback on how to
> >    solve this more elegant. I added a boolean to DeviceState, have it
> >    default to false for all devices except for primary devices.
> > - not tested yet with surprise removal
> > - I don't expect this to go in as it is, still needs more testing but
> >    I'd like to get feedback on above mentioned changes.
> > 
> > The general idea is that we have a pair of devices, a vfio-pci and a
> > emulated device. Before migration the vfio device is unplugged and data
> > flows to the emulated device, on the target side another vfio-pci device
> > is plugged in to take over the data-path. In the guest the net_failover
> > module will pair net devices with the same MAC address.
> > 
> > * In the first patch the infrastructure for hiding the device is added
> >    for the qbus and qdev APIs.
> > 
> > * In the second patch the virtio-net uses the API to defer adding the vfio
> >    device until the VIRTIO_NET_F_STANDBY feature is acked.
> > 
> > Previous discussion:
> >    RFC v1 https://patchwork.ozlabs.org/cover/989098/
> >    RFC v2 https://www.mail-archive.com/qemu-devel@nongnu.org/msg606906.html
> > 
> > To summarize concerns/feedback from previous discussion:
> > 1.- guest OS can reject or worse _delay_ unplug by any amount of time.
> >    Migration might get stuck for unpredictable time with unclear reason.
> >    This approach combines two tricky things, hot/unplug and migration.
> >    -> We can surprise-remove the PCI device and in QEMU we can do all
> >       necessary rollbacks transparent to management software. Will it be
> >       easy, probably not.
> > 2. PCI devices are a precious ressource. The primary device should never
> >    be added to QEMU if it won't be used by guest instead of hiding it in
> >    QEMU.
> >    -> We only hotplug the device when the standby feature bit was
> >       negotiated. We save the device cmdline options until we need it for
> >       qdev_device_add()
> >       Hiding a device can be a useful concept to model. For example a
> >       pci device in a powered-off slot could be marked as hidden until the slot is
> >       powered on (mst).
> > 3. Management layer software should handle this. Open Stack already has
> >    components/code to handle unplug/replug VFIO devices and metadata to
> >    provide to the guest for detecting which devices should be paired.
> >    -> An approach that includes all software from firmware to
> >       higher-level management software wasn't tried in the last years. This is
> >       an attempt to keep it simple and contained in QEMU as much as possible.
> > 4. Hotplugging a device and then making it part of a failover setup is
> >     not possible
> >    -> addressed by extending qdev hotplug functions to check for hidden
> >       attribute, so e.g. device_add can be used to plug a device.
> > 
> > 
> > I have tested this with a mlx5 NIC and was able to migrate the VM with
> > above mentioned workarounds for open problems.
> > 
> > Command line example:
> > 
> > qemu-system-x86_64 -enable-kvm -m 3072 -smp 3 \
> >          -machine q35,kernel-irqchip=split -cpu host   \
> >          -k fr   \
> >          -serial stdio   \
> >          -net none \
> >          -qmp unix:/tmp/qmp.socket,server,nowait \
> >          -monitor telnet:127.0.0.1:5555,server,nowait \
> >          -device pcie-root-port,id=root0,multifunction=on,chassis=0,addr=0xa \
> >          -device pcie-root-port,id=root1,bus=pcie.0,chassis=1 \
> >          -device pcie-root-port,id=root2,bus=pcie.0,chassis=2 \
> >          -netdev tap,script=/root/bin/bridge.sh,downscript=no,id=hostnet1,vhost=on \
> >          -device virtio-net-pci,netdev=hostnet1,id=net1,mac=52:54:00:6f:55:cc,bus=root2,failover=on \
> >          /root/rhel-guest-image-8.0-1781.x86_64.qcow2
> > 
> > Then the primary device can be hotplugged via
> >   (qemu) device_add vfio-pci,host=5e:00.2,id=hostdev0,bus=root1,standby=net1
> 
> 
> I guess this is the commandline on the migration destination, and as far as
> I understand from this example, on the destination we (meaning libvirt or
> higher level management application) must *not* include the assigned device
> on the qemu commandline, but must instead hotplug the device later after the
> guest CPUs have been restarted on the destination.
> 
> So if I'm understanding correctly, the idea is that on the migration source,
> the device may have been hotplugged, or may have been included when qemu was
> originally started. Then qemu automatically handles the unplug of the device
> on the source, but it seems qemu does nothing on the destination, leaving
> that up to libvirt or a higher layer to implement.

Good point. I don't see why it would not work just as well
with device present straight away.

Did I miss something?

I think Jens was just testing local machine migration
and of course you can only assign a device to 1 VM at a time.

> Then in order for this to work, libvirt (or OpenStack or oVirt or whoever)
> needs to understand that the device in the libvirt config (it will still be
> in the libvirt config, since from libvirt's POV it hasn't been unplugged):
> 
> 1) shouldn't be included in the qemu commandline on the destination,
> 
> 2) will almost surely need to be replaced with a different device on the
> destination (since it's almost certain that the destination won't have an
> available device at the same PCI address)
> 
> 3) will probably need to be unbinded from the VF net driver (does this need
> to happen before migration is finished? If we want to lower the probability
> of a failure after we're already committed to the migration, then I think we
> must, but libvirt isn't set up for that in any way).
> 
> 4) will need to be hotplugged after the migration has finished *and* after
> the guest CPUs have been restarted on the destination.
> 
> 
> While it will be possible to assure that there is a destination device, and
> to replace the old device with new in the config (and maybe, either with
> some major reworking of device assignment code, or offloading the
> responsibility to the management application(s), possible to re-bind the
> device to the vfio-pci driver), prior to marking the migration as
> "successful" (thus committing to running it on the destination), we can't
> say as much for actually assigning the device. So if the assignment fails,
> then what happens?
> 
> 
> So a few issues I see that will need to be solved by [someone] (apparently
> either libvirt or management):
> 
> a) there isn't anything in libvirt's XML grammar that allows us to signify a
> device that is "present in the config but shouldn't be included in the
> commandline"
> 
> b) someone will need to replace the device from the source with an
> equivalent device on the destination in the libvirt XML. There are other
> cases of management modifying the XML during migration (I think), but this
> does point out that putting the "auto-unplug code into qemu isn't turning
> this into a trivial
> 
> c) there is nothing in libvirt's migration logic that can cause a device to
> be re-binded to vfio-pci prior to completion of a migration. Unless this is
> added to libvirt (or the re-bind operation is passed off to the management
> application), we will need to live with the possibility that hotplugging the
> device will fail due to failed re-bind *after* we've committed to the
> migration.
> 
> d) once the guest CPUs are restarted on the destination, [someone] (libvirt
> or management) needs to hotplug the new device on the destination. (I'm
> guessing that a hotplug can only be done while the guest CPUs are running;
> correct me if this is wrong!)
> 
> This sounds like a lot of complexity for something that was supposed to be
> handled completely/transparently by qemu :-P.


  reply	other threads:[~2019-06-11 16:23 UTC|newest]

Thread overview: 77+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-05-17 12:58 [Qemu-devel] [PATCH 0/4] add failover feature for assigned network devices Jens Freimann
2019-05-17 12:58 ` [Qemu-devel] [PATCH 1/4] migration: allow unplug during migration for failover devices Jens Freimann
2019-05-21  9:33   ` Dr. David Alan Gilbert
2019-05-21  9:47     ` Daniel P. Berrangé
2019-05-23  8:01     ` Jens Freimann
2019-05-23 15:37       ` Dr. David Alan Gilbert
2019-05-17 12:58 ` [Qemu-devel] [PATCH 2/4] qdev/qbus: Add hidden device support Jens Freimann
2019-05-21 11:33   ` Michael S. Tsirkin
2019-05-17 12:58 ` [Qemu-devel] [PATCH 3/4] net/virtio: add failover support Jens Freimann
2019-05-21  9:45   ` Dr. David Alan Gilbert
2019-05-30 14:56     ` Jens Freimann
2019-05-30 17:46       ` Michael S. Tsirkin
2019-05-30 18:00         ` Dr. David Alan Gilbert
2019-05-30 18:09           ` Michael S. Tsirkin
2019-05-30 18:22             ` Eduardo Habkost
2019-05-30 23:06               ` Michael S. Tsirkin
2019-05-31 17:01                 ` Eduardo Habkost
2019-05-31 18:04                   ` Michael S. Tsirkin
2019-05-31 18:42                     ` Eduardo Habkost
2019-05-31 18:45                     ` Dr. David Alan Gilbert
2019-05-31 20:29                       ` Alex Williamson
2019-05-31 21:05                         ` Michael S. Tsirkin
2019-05-31 21:59                           ` Eduardo Habkost
2019-06-03  8:59                         ` Dr. David Alan Gilbert
2019-05-31 20:43                       ` Michael S. Tsirkin
2019-05-31 21:03                         ` Eduardo Habkost
2019-06-03  8:06                         ` Dr. David Alan Gilbert
2019-05-30 19:08             ` Dr. David Alan Gilbert
2019-05-30 19:21               ` Michael S. Tsirkin
2019-05-31  8:23                 ` Dr. David Alan Gilbert
2019-06-05 15:23             ` Daniel P. Berrangé
2019-05-30 18:17           ` Eduardo Habkost
2019-05-30 19:09       ` Dr. David Alan Gilbert
2019-05-31 21:47       ` Eduardo Habkost
2019-06-03  8:24         ` Jens Freimann
2019-06-03  9:26           ` Jens Freimann
2019-06-03 18:10           ` Laine Stump
2019-06-03 18:46             ` Alex Williamson
2019-06-05 15:20               ` Daniel P. Berrangé
2019-06-06 15:00               ` Roman Kagan
2019-06-03 19:36           ` Eduardo Habkost
2019-06-04 13:43             ` Jens Freimann
2019-06-04 14:09               ` Eduardo Habkost
2019-06-04 17:06               ` Michael S. Tsirkin
2019-06-04 19:00                 ` Dr. David Alan Gilbert
2019-06-07 14:14                   ` Jens Freimann
2019-06-07 14:32                     ` Michael S. Tsirkin
2019-06-07 17:51                     ` Dr. David Alan Gilbert
2019-06-05 14:36               ` Daniel P. Berrangé
2019-06-05 16:04               ` Laine Stump
2019-06-05 16:19                 ` Daniel P. Berrangé
2019-05-17 12:58 ` [Qemu-devel] [PATCH 4/4] vfio/pci: unplug failover primary device before migration Jens Freimann
2019-05-20 22:56 ` [Qemu-devel] [PATCH 0/4] add failover feature for assigned network devices Alex Williamson
2019-05-21  7:21   ` Jens Freimann
2019-05-21 11:37     ` Michael S. Tsirkin
2019-05-21 18:49       ` Jens Freimann
2019-05-29  0:14         ` si-wei liu
2019-05-29  2:54           ` Michael S. Tsirkin
2019-06-03 18:06             ` Laine Stump
2019-06-03 18:12               ` Michael S. Tsirkin
2019-06-03 18:18                 ` Laine Stump
2019-06-06 21:49                   ` Michael S. Tsirkin
2019-05-29  2:40         ` Michael S. Tsirkin
2019-05-29  7:48           ` Jens Freimann
2019-05-30 18:12             ` Michael S. Tsirkin
2019-05-31 15:12               ` Jens Freimann
2019-05-21 14:18     ` Alex Williamson
2019-05-21  8:37 ` Daniel P. Berrangé
2019-05-21 10:10 ` Michael S. Tsirkin
2019-05-21 19:17   ` Jens Freimann
2019-05-21 21:43     ` Michael S. Tsirkin
2019-06-11 15:42 ` Laine Stump
2019-06-11 15:51   ` Michael S. Tsirkin [this message]
2019-06-11 16:12     ` Laine Stump
2019-06-12  9:11   ` Daniel P. Berrangé
2019-06-12 11:59     ` Jens Freimann
2019-06-12 15:54       ` Laine Stump

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190611115009-mutt-send-email-mst@kernel.org \
    --to=mst@redhat.com \
    --cc=aadam@redhat.com \
    --cc=ailan@redhat.com \
    --cc=berrange@redhat.com \
    --cc=ehabkost@redhat.com \
    --cc=jfreimann@redhat.com \
    --cc=laine@redhat.com \
    --cc=pkrempa@redhat.com \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.