About restoring the state in vhost-vdpa device

* About restoring the state in vhost-vdpa device
@ 2022-05-11 19:43 Eugenio Perez Martin
  2022-05-12  4:00   ` Jason Wang
  2022-05-13 15:08   ` Parav Pandit
  0 siblings, 2 replies; 20+ messages in thread
From: Eugenio Perez Martin @ 2022-05-11 19:43 UTC (permalink / raw)
  To: virtualization, qemu-level, Jason Wang, Cindy Lu, Parav Pandit,
	Gautam Dawar, virtio-networking, Eli Cohen, Laurent Vivier,
	Stefano Garzarella

This is a proposal to restore the state of the vhost-vdpa device at
the destination after a live migration. It uses as many available
features both from the device and from qemu as possible so we keep the
communication simple and speed up the merging process.

# Initializing a vhost-vdpa device.

Without the context of live migration, the steps to initialize the
device from vhost-vdpa at qemu starting are:
1) [vhost] Open the vdpa device, Using simply open()
2) [vhost+virtio] Get device features. These are expected not to
change in the device's lifetime, so we can save them. Qemu issues a
VHOST_GET_FEATURES ioctl and vdpa forwards to the backend driver using
get_device_features() callback.
3) [vhost+virtio] Get its max_queue_pairs if _F_MQ and _F_CTRL_VQ.
These are obtained using VHOST_VDPA_GET_CONFIG, and that request is
forwarded to the device using get_config. QEMU expects the device to
not change it in its lifetime.
4) [vhost] Vdpa set status (_S_ACKNOLEDGE, _S_DRIVER). Still no
FEATURES_OK or DRIVER_OK. The ioctl is VHOST_VDPA_SET_STATUS, and the
vdpa backend driver callback is set_status.

These are the steps used to initialize the device in qemu terminology,
taking away some redundancies to make it simpler.

Now the driver sends the FEATURES_OK and the DRIVER_OK, and qemu
detects it, so it *starts* the device.

# Starting a vhost-vdpa device

At virtio_net_vhost_status we have two important variables here:
int cvq = _F_CTRL_VQ ? 1 : 0;
int queue_pairs = _F_CTRL_VQ && _F_MQ ? (max_queue_pairs of step 3) : 0;

Now identification of the cvq index. Qemu *know* that the device will
expose it at the last queue (max_queue_pairs*2) if _F_MQ has been
acknowledged by the guest's driver or 2 if not. It cannot depend on
any data sent to the device via cvq, because we couldn't get its
command status on a change.

Now we start the vhost device. The workflow is currently:

5) [virtio+vhost] The first step is to send the acknowledgement of the
Virtio features and vhost/vdpa backend features to the device, so it
knows how to configure itself. This is done using the same calls as
step 4 with these feature bits added.
6) [virtio] Set the size, base, addr, kick and call fd for each queue
(SET_VRING_ADDR, SET_VRING_NUM, ...; and forwarded with
set_vq_address, set_vq_state, ...)
7) [vdpa] Send host notifiers and *send SET_VRING_ENABLE = 1* for each
queue. This is done using ioctl VHOST_VDPA_SET_VRING_ENABLE, and
forwarded to the vdpa backend using set_vq_ready callback.
8) [virtio + vdpa] Send memory translations & set DRIVER_OK.

If we follow the current workflow, the device is allowed now to start
receiving only on vq pair 0, since we've still not set the multi queue
pair. This could cause the guest to receive packets in unexpected
queues, breaking RSS.

# Proposal

Our proposal diverge in step 7: Instead of enabling *all* the
virtqueues, only enable the CVQ. After that, send the DRIVER_OK and
queue all the control commands to restore the device status (MQ, RSS,
...). Once all of them have been acknowledged ("device", or emulated
cvq in host vdpa backend driver, has used all cvq buffers, enable
(SET_VRING_ENABLE, set_vq_ready) all other queues.

Everything needed for this is already implemented in the kernel as far
as I see, there is only a small modification in qemu needed. Thus
achieving the restoring of the device state without creating
maintenance burden.

A lot of optimizations can be applied on top without the need to add
stuff to the migration protocol or vDPA uAPI, like the pre-warming of
the vdpa queues or adding more capabilities to the emulated CVQ.

Other optimizations like applying the state out of band can also be
added so they can run in parallel with the migration, but that
requires a bigger change in qemu migration protocol making us lose
focus on achieving at least the basic device migration in my opinion.

Thoughts?

^ permalink raw reply	[flat|nested] 20+ messages in thread