RE: [PATCH v10 vfio 4/7] vfio/pds: Add VFIO live migration support

From: "Tian, Kevin" <kevin.tian@intel.com>
To: Brett Creeley <brett.creeley@amd.com>,
	"kvm@vger.kernel.org" <kvm@vger.kernel.org>,
	"netdev@vger.kernel.org" <netdev@vger.kernel.org>,
	"alex.williamson@redhat.com" <alex.williamson@redhat.com>,
	"jgg@nvidia.com" <jgg@nvidia.com>,
	"yishaih@nvidia.com" <yishaih@nvidia.com>,
	"shameerali.kolothum.thodi@huawei.com"
	<shameerali.kolothum.thodi@huawei.com>
Cc: "shannon.nelson@amd.com" <shannon.nelson@amd.com>
Subject: RE: [PATCH v10 vfio 4/7] vfio/pds: Add VFIO live migration support
Date: Fri, 16 Jun 2023 08:06:21 +0000	[thread overview]
Message-ID: <BN9PR11MB5276511543775B852AD1C5A88C58A@BN9PR11MB5276.namprd11.prod.outlook.com> (raw)
In-Reply-To: <20230602220318.15323-5-brett.creeley@amd.com>

> From: Brett Creeley <brett.creeley@amd.com>
> Sent: Saturday, June 3, 2023 6:03 AM
> 
> Add live migration support via the VFIO subsystem. The migration
> implementation aligns with the definition from uapi/vfio.h and uses
> the pds_core PF's adminq for device configuration.
> 
> The ability to suspend, resume, and transfer VF device state data is
> included along with the required admin queue command structures and
> implementations.
> 
> PDS_LM_CMD_SUSPEND and PDS_LM_CMD_SUSPEND_STATUS are added to
> support
> the VF device suspend operation.
> 
> PDS_LM_CMD_RESUME is added to support the VF device resume operation.
> 
> PDS_LM_CMD_STATUS is added to determine the exact size of the VF
> device state data.
> 
> PDS_LM_CMD_SAVE is added to get the VF device state data.
> 
> PDS_LM_CMD_RESTORE is added to restore the VF device with the
> previously saved data from PDS_LM_CMD_SAVE.
> 
> PDS_LM_CMD_HOST_VF_STATUS is added to notify the device when
> a migration is in/not-in progress from the host's perspective.

Here is 'the device' referring to the PF or VF?

and how would the device use this information?

> +
> +static int pds_vfio_client_adminq_cmd(struct pds_vfio_pci_device *pds_vfio,
> +				      union pds_core_adminq_cmd *req,
> +				      size_t req_len,
> +				      union pds_core_adminq_comp *resp,
> +				      u64 flags)
> +{
> +	union pds_core_adminq_cmd cmd = {};
> +	size_t cp_len;
> +	int err;
> +
> +	/* Wrap the client request */
> +	cmd.client_request.opcode = PDS_AQ_CMD_CLIENT_CMD;
> +	cmd.client_request.client_id = cpu_to_le16(pds_vfio->client_id);
> +	cp_len = min_t(size_t, req_len,
> sizeof(cmd.client_request.client_cmd));

'req_len' is kind of redundant. Looks all the callers use sizeof(req).

> +static int
> +pds_vfio_suspend_wait_device_cmd(struct pds_vfio_pci_device *pds_vfio)
> +{
> +	union pds_core_adminq_cmd cmd = {
> +		.lm_suspend_status = {
> +			.opcode = PDS_LM_CMD_SUSPEND_STATUS,
> +			.vf_id = cpu_to_le16(pds_vfio->vf_id),
> +		},
> +	};
> +	struct device *dev = pds_vfio_to_dev(pds_vfio);
> +	union pds_core_adminq_comp comp = {};
> +	unsigned long time_limit;
> +	unsigned long time_start;
> +	unsigned long time_done;
> +	int err;
> +
> +	time_start = jiffies;
> +	time_limit = time_start + HZ * SUSPEND_TIMEOUT_S;
> +	do {
> +		err = pds_vfio_client_adminq_cmd(pds_vfio, &cmd,
> sizeof(cmd),
> +						 &comp,
> PDS_AQ_FLAG_FASTPOLL);
> +		if (err != -EAGAIN)
> +			break;
> +
> +		msleep(SUSPEND_CHECK_INTERVAL_MS);
> +	} while (time_before(jiffies, time_limit));

pds_vfio_client_adminq_cmd() has the exactly same mechanism
with 5s timeout and 1ms poll interval when FASTPOLL is set.

probably you can introduce another flag to indicate retry on
-EAGAIN and then handle it fully in pds_vfio_client_adminq_cmd()?

> +int pds_vfio_suspend_device_cmd(struct pds_vfio_pci_device *pds_vfio)
> +{
> +	union pds_core_adminq_cmd cmd = {
> +		.lm_suspend = {
> +			.opcode = PDS_LM_CMD_SUSPEND,
> +			.vf_id = cpu_to_le16(pds_vfio->vf_id),
> +		},
> +	};
> +	struct device *dev = pds_vfio_to_dev(pds_vfio);
> +	union pds_core_adminq_comp comp = {};
> +	int err;
> +
> +	dev_dbg(dev, "vf%u: Suspend device\n", pds_vfio->vf_id);
> +
> +	err = pds_vfio_client_adminq_cmd(pds_vfio, &cmd, sizeof(cmd),
> &comp,
> +					 PDS_AQ_FLAG_FASTPOLL);
> +	if (err) {
> +		dev_err(dev, "vf%u: Suspend failed: %pe\n", pds_vfio->vf_id,
> +			ERR_PTR(err));
> +		return err;
> +	}
> +
> +	return pds_vfio_suspend_wait_device_cmd(pds_vfio);
> +}

The logic in this function is very confusing.

PDS_LM_CMD_SUSPEND has a completion record:

+struct pds_lm_suspend_comp {
+	u8     status;
+	u8     rsvd;
+	__le16 comp_index;
+	union {
+		__le64 state_size;
+		u8     rsvd2[11];
+	} __packed;
+	u8     color;

Presumably this function can look at the completion record to know whether
the suspend request succeeds.

Why do you require another wait_device step to query the suspend status?

and I have another question. Is it correct to hard-code the 5s timeout in
the kernel w/o any input from the VMM? Note the guest has been stopped
at this point then very likely the 5s timeout will kill any reasonable SLA which
CSPs try to reach hard.

Ideally the VMM has an estimation how long a VM can be paused based on
SLA, to-be-migrated state size, available network bandwidth, etc. and that
hint should be passed to the kernel so any state transition which may violate
that expectation can fail quickly to break the migration process and put the
VM back to the running state.

Jason/Shameer, is there similar concern in mlx/hisilicon drivers? 

> +
> +int pds_vfio_resume_device_cmd(struct pds_vfio_pci_device *pds_vfio)
> +{
> +	union pds_core_adminq_cmd cmd = {
> +		.lm_resume = {
> +			.opcode = PDS_LM_CMD_RESUME,
> +			.vf_id = cpu_to_le16(pds_vfio->vf_id),
> +		},
> +	};
> +	struct device *dev = pds_vfio_to_dev(pds_vfio);
> +	union pds_core_adminq_comp comp = {};
> +
> +	dev_dbg(dev, "vf%u: Resume device\n", pds_vfio->vf_id);
> +
> +	return pds_vfio_client_adminq_cmd(pds_vfio, &cmd, sizeof(cmd),
> &comp,
> +					  0);

'resume' is also in the blackout phase when the guest is not running.

So presumably FAST_POLL should be set otherwise the max 256ms
poll interval (PDSC_ADMINQ_MAX_POLL_INTERVAL) is really inefficient.

> +
> +	if (cur == VFIO_DEVICE_STATE_RUNNING && next ==
> VFIO_DEVICE_STATE_RUNNING_P2P) {
> +		pds_vfio_send_host_vf_lm_status_cmd(pds_vfio,
> +
> PDS_LM_STA_IN_PROGRESS);
> +		err = pds_vfio_suspend_device_cmd(pds_vfio);
> +		if (err)
> +			return ERR_PTR(err);
> +
> +		return NULL;
> +	}
> +
> +	if (cur == VFIO_DEVICE_STATE_RUNNING_P2P && next ==
> VFIO_DEVICE_STATE_RUNNING) {
> +		err = pds_vfio_resume_device_cmd(pds_vfio);
> +		if (err)
> +			return ERR_PTR(err);
> +
> +		pds_vfio_send_host_vf_lm_status_cmd(pds_vfio,
> PDS_LM_STA_NONE);
> +		return NULL;
> +	}
> +
> +	if (cur == VFIO_DEVICE_STATE_STOP && next ==
> VFIO_DEVICE_STATE_RUNNING_P2P)
> +		return NULL;
> +
> +	if (cur == VFIO_DEVICE_STATE_RUNNING_P2P && next ==
> VFIO_DEVICE_STATE_STOP)
> +		return NULL;

I'm not sure whether P2P is actually supported here. By definition
P2P means the device is stopped but still responds to p2p request
from other devices. If you look at mlx example it uses different
cmds between RUNNING->RUNNING_P2P and RUNNING_P2P->STOP.

But in your case seems you simply move what is required in STOP
into P2P. Probably you can just remove the support of P2P like
hisilicon does.

> +
> +/**
> + * struct pds_lm_comp - generic command completion
> + * @status:	Status of the command (enum pds_core_status_code)
> + * @rsvd:	Structure padding to 16 Bytes
> + */
> +struct pds_lm_comp {
> +	u8 status;
> +	u8 rsvd[15];
> +};

not used. Looks most comp structures are defined w/o an user
except struct pds_lm_status_comp.