All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v10 vfio 0/7] pds_vfio driver
@ 2023-06-02 22:03 Brett Creeley
  2023-06-02 22:03 ` [PATCH v10 vfio 1/7] vfio: Commonize combine_ranges for use in other VFIO drivers Brett Creeley
                   ` (9 more replies)
  0 siblings, 10 replies; 40+ messages in thread
From: Brett Creeley @ 2023-06-02 22:03 UTC (permalink / raw)
  To: kvm, netdev, alex.williamson, jgg, yishaih,
	shameerali.kolothum.thodi, kevin.tian
  Cc: brett.creeley, shannon.nelson

This is a patchset for a new vendor specific VFIO driver
(pds_vfio) for use with the AMD/Pensando Distributed Services Card
(DSC). This driver makes use of the pds_core driver.

This driver will use the pds_core device's adminq as the VFIO
control path to the DSC. In order to make adminq calls, the VFIO
instance makes use of functions exported by the pds_core driver.

In order to receive events from pds_core, the pds_vfio driver
registers to a private notifier. This is needed for various events
that come from the device.

An ASCII diagram of a VFIO instance looks something like this and can
be used with the VFIO subsystem to provide the VF device VFIO and live
migration support.

                               .------.  .-----------------------.
                               | QEMU |--|  VM  .-------------.  |
                               '......'  |      |   Eth VF    |  |
                                  |      |      .-------------.  |
                                  |      |      |  SR-IOV VF  |  |
                                  |      |      '-------------'  |
                                  |      '------------||---------'
                               .--------------.       ||
                               |/dev/<vfio_fd>|       ||
                               '--------------'       ||
Host Userspace                         |              ||
===================================================   ||
Host Kernel                            |              ||
                                  .--------.          ||
                                  |vfio-pci|          ||
                                  '--------'          ||
       .------------------.           ||              ||
       |   | exported API |<----+     ||              ||
       |   '--------------|     |     ||              ||
       |                  |    .-------------.        ||
       |     pds_core     |--->|   pds_vfio  |        ||
       '------------------' |  '-------------'        ||
               ||           |         ||              ||
             09:00.0     notifier    09:00.1          ||
== PCI ===============================================||=====
               ||                     ||              ||
          .----------.          .----------.          ||
    ,-----|    PF    |----------|    VF    |-------------------,
    |     '----------'          '----------'  |       VF       |
    |                     DSC                 |  data/control  |
    |                                         |      path      |
    -----------------------------------------------------------


The pds_vfio driver is targeted to reside in drivers/vfio/pci/pds.
It makes use of and introduces new files in the common include/linux/pds
include directory.

Changes:

v10:
- Various fixes/suggestions by Jason Gunthorpe
	- Simplify pds_vfio_get_lm_file() based on fpga_mgr_buf_load()
	- Clean-ups/fixes based on clang-format
	- Remove any double goto labels
	- Name goto labels baesed on what needs to be cleaned/freed
	  instead of a "call from" scheme
	- Fix any goto unwind ordering issues
	- Make sure call dma_map_single() after data is written to
	  memory in pds_vfio_dma_map_lm_file()
	- Don't use bitmap_zalloc() for the dirty bitmaps
- Use vzalloc() for dirty bitmaps and refactor how the bitmaps are DMA'd
  to and from the device in pds_vfio_dirty_seq_ack()
- Remove unnecessary goto in pds_vfio_dirty_disable()

v9:
https://lore.kernel.org/netdev/20230422010642.60720-1-brett.creeley@amd.com/
- Various fixes/suggestions by Alex Williamson
	- Fix how ID is generated in client registration
	- Add helper functions to get the VF's struct device and struct
	  pci_dev pointers instead of caching the struct pci dev
	- Remove redundant pds_vfio_lm_state() function and remove any
	  places this was being called
	- Fix multi-line comments to follow standard convention
	- Remove confusing comments in
	  pds_vfio_step_device_state_locked() since the driver's
	  migration states align with the VFIO documentation
	- Validate pdsc returned from pdsc_get_pf_struct()
- Various fixes/suggestions by Jason Gunthorpe
	- Use struct pdsc instead of void *
	- Use {} instead of {0} for structure initialization
	- Use unions on the stack instead of casting to the union when
	  sending AQ commands, which required including pds_lm.h in
	  pds_adminq.h
	- Replace use of dma_alloc_coherent() when creating the sgl DMA
	  entries for the LM file
	- Remove cached struct device *coredev and instead use
	  pci_physfn() to get the pds_core's struct device pointer
	- Drop the recovery work item and call pds_vfio_recovery()
	  directly from the notifier callback
	- Remove unnecessary #define for "pds_vfio_lm" and just use the
	  string inline to the anon_inode_getfile() argument
- Fix LM file reference counting
- Move initialization of some struct members to when the struct is being
  initialized for AQ commands
- Make use of GFP_KERNEL_ACCOUNT where it makes sense
- Replace PDS_VFIO_DRV_NAME with KBUILD_MODNAME
- Update to latest pds_core exported functions
- Remove duplicated prototypes for
  pds_vfio_dma_logging_[start|stop|report] from lm.h
- Hold pds_vfio->state_mutex while starting, stopping, and reporting
  dirty page tracking in pds_vfio_dma_logging_[start|stop|report]
- Remove duplicate PDS_DEV_TYPE_LM_STR define from pds_lm.h that's
  already included in pds_common.h
- Replace use of dma_alloc_coherent() when creating the sgl DMA
  entries for the dirty bitmaps

v8:
https://lore.kernel.org/netdev/20230404190141.57762-1-brett.creeley@amd.com/
- provide default iommufd callbacks for bind_iommufd, unbind_iommufd, and
  attach_ioas for the VFIO device as suggested by Shameerali Kolothum
  Thodi

v7:
https://lore.kernel.org/netdev/20230331003612.17569-1-brett.creeley@amd.com/
- Disable and clean up dirty page tracking when the VFIO device is closed
- Various improvements suggested by Simon Horman:
	- Fix RCT in vfio_combine_iova_ranges()
	- Simplify function exit paths by removing unnecessary goto
	  labels
	- Cleanup pds_vifo_print_guest_region_info() by adding a goto
	  label for freeing memory, which allowed for reduced
	  indentation on a for loop
	- Where possible use C99 style for loops

v6:
https://lore.kernel.org/netdev/20230327200553.13951-1-brett.creeley@amd.com/
- As suggested by Alex Williamson, use pci_domain_nr() macro to make sure
  the pds_vfio client's devname is unique
- Remove unnecessary forward declaration and include
- Fix copyright comment to use correct company name
- Remove "." from struct documentation for consistency

v5:
https://lore.kernel.org/netdev/20230322203442.56169-1-brett.creeley@amd.com/
- Fix SPDX comments in .h files
- Remove adminqcq argument from pdsc_post_adminq() uses
- Unregister client on vfio_pci_core_register_device() failure
- Other minor checkpatch issues

v4:
https://lore.kernel.org/netdev/20230308052450.13421-1-brett.creeley@amd.com/
- Update cover letter ASCII diagram to reflect new driver architecture
- Remove auxiliary driver implementation
- Use pds_core's exported functions to communicate with the device
- Implement and register notifier for events from the device/pds_core
- Use module_pci_driver() macro since auxiliary driver configuration is
  no longer needed in __init/__exit

v3:
https://lore.kernel.org/netdev/20230219083908.40013-1-brett.creeley@amd.com/
- Update copyright year to 2023 and use "Advanced Micro Devices, Inc."
  for the company name
- Clarify the fact that AMD/Pensando's VFIO solution is device type
  agnostic, which aligns with other current VFIO solutions
- Add line in drivers/vfio/pci/Makefile to build pds_vfio
- Move documentation to amd sub-directory
- Remove some dead code due to the pds_core implementation of
  listening to BIND/UNBIND events
- Move a dev_dbg() to a previous patch in the series
- Add implementation for vfio_migration_ops.migration_get_data_size to
  return the maximum possible device state size

RFC to v2:
https://lore.kernel.org/all/20221214232136.64220-1-brett.creeley@amd.com/
- Implement state transitions for VFIO_MIGRATION_P2P flag
- Improve auxiliary driver probe by returning EPROBE_DEFER
  when the PCI driver is not set up correctly
- Add pointer to docs in
  Documentation/networking/device_drivers/ethernet/index.rst

RFC:
https://lore.kernel.org/all/20221207010705.35128-1-brett.creeley@amd.com/


Brett Creeley (7):
  vfio: Commonize combine_ranges for use in other VFIO drivers
  vfio/pds: Initial support for pds_vfio VFIO driver
  vfio/pds: register with the pds_core PF
  vfio/pds: Add VFIO live migration support
  vfio/pds: Add support for dirty page tracking
  vfio/pds: Add support for firmware recovery
  vfio/pds: Add Kconfig and documentation

 .../device_drivers/ethernet/amd/pds_vfio.rst  |  79 +++
 .../device_drivers/ethernet/index.rst         |   1 +
 MAINTAINERS                                   |   7 +
 drivers/vfio/pci/Kconfig                      |   2 +
 drivers/vfio/pci/Makefile                     |   2 +
 drivers/vfio/pci/mlx5/cmd.c                   |  48 +-
 drivers/vfio/pci/pds/Kconfig                  |  20 +
 drivers/vfio/pci/pds/Makefile                 |  11 +
 drivers/vfio/pci/pds/cmds.c                   | 487 +++++++++++++++
 drivers/vfio/pci/pds/cmds.h                   |  25 +
 drivers/vfio/pci/pds/dirty.c                  | 577 ++++++++++++++++++
 drivers/vfio/pci/pds/dirty.h                  |  38 ++
 drivers/vfio/pci/pds/lm.c                     | 421 +++++++++++++
 drivers/vfio/pci/pds/lm.h                     |  41 ++
 drivers/vfio/pci/pds/pci_drv.c                | 206 +++++++
 drivers/vfio/pci/pds/pci_drv.h                |   9 +
 drivers/vfio/pci/pds/vfio_dev.c               | 234 +++++++
 drivers/vfio/pci/pds/vfio_dev.h               |  45 ++
 drivers/vfio/vfio_main.c                      |  47 ++
 include/linux/pds/pds_adminq.h                | 395 ++++++++++++
 include/linux/pds/pds_common.h                |   2 +
 include/linux/vfio.h                          |   3 +
 22 files changed, 2653 insertions(+), 47 deletions(-)
 create mode 100644 Documentation/networking/device_drivers/ethernet/amd/pds_vfio.rst
 create mode 100644 drivers/vfio/pci/pds/Kconfig
 create mode 100644 drivers/vfio/pci/pds/Makefile
 create mode 100644 drivers/vfio/pci/pds/cmds.c
 create mode 100644 drivers/vfio/pci/pds/cmds.h
 create mode 100644 drivers/vfio/pci/pds/dirty.c
 create mode 100644 drivers/vfio/pci/pds/dirty.h
 create mode 100644 drivers/vfio/pci/pds/lm.c
 create mode 100644 drivers/vfio/pci/pds/lm.h
 create mode 100644 drivers/vfio/pci/pds/pci_drv.c
 create mode 100644 drivers/vfio/pci/pds/pci_drv.h
 create mode 100644 drivers/vfio/pci/pds/vfio_dev.c
 create mode 100644 drivers/vfio/pci/pds/vfio_dev.h

-- 
2.17.1


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v10 vfio 1/7] vfio: Commonize combine_ranges for use in other VFIO drivers
  2023-06-02 22:03 [PATCH v10 vfio 0/7] pds_vfio driver Brett Creeley
@ 2023-06-02 22:03 ` Brett Creeley
  2023-06-16  6:52   ` Tian, Kevin
  2023-06-02 22:03 ` [PATCH v10 vfio 2/7] vfio/pds: Initial support for pds_vfio VFIO driver Brett Creeley
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 40+ messages in thread
From: Brett Creeley @ 2023-06-02 22:03 UTC (permalink / raw)
  To: kvm, netdev, alex.williamson, jgg, yishaih,
	shameerali.kolothum.thodi, kevin.tian
  Cc: brett.creeley, shannon.nelson

Currently only Mellanox uses the combine_ranges function. The
new pds_vfio driver also needs this function. So, move it to
a common location for other vendor drivers to use.

Also, Simon Harmon noticed that RCT ordering was not followed
for vfio_combin_iova_ranges(), so fix that.

Cc: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Brett Creeley <brett.creeley@amd.com>
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
---
 drivers/vfio/pci/mlx5/cmd.c | 48 +------------------------------------
 drivers/vfio/vfio_main.c    | 47 ++++++++++++++++++++++++++++++++++++
 include/linux/vfio.h        |  3 +++
 3 files changed, 51 insertions(+), 47 deletions(-)

diff --git a/drivers/vfio/pci/mlx5/cmd.c b/drivers/vfio/pci/mlx5/cmd.c
index deed156e6165..7f6c51992a15 100644
--- a/drivers/vfio/pci/mlx5/cmd.c
+++ b/drivers/vfio/pci/mlx5/cmd.c
@@ -732,52 +732,6 @@ void mlx5fv_cmd_clean_migf_resources(struct mlx5_vf_migration_file *migf)
 	mlx5vf_cmd_dealloc_pd(migf);
 }
 
-static void combine_ranges(struct rb_root_cached *root, u32 cur_nodes,
-			   u32 req_nodes)
-{
-	struct interval_tree_node *prev, *curr, *comb_start, *comb_end;
-	unsigned long min_gap;
-	unsigned long curr_gap;
-
-	/* Special shortcut when a single range is required */
-	if (req_nodes == 1) {
-		unsigned long last;
-
-		curr = comb_start = interval_tree_iter_first(root, 0, ULONG_MAX);
-		while (curr) {
-			last = curr->last;
-			prev = curr;
-			curr = interval_tree_iter_next(curr, 0, ULONG_MAX);
-			if (prev != comb_start)
-				interval_tree_remove(prev, root);
-		}
-		comb_start->last = last;
-		return;
-	}
-
-	/* Combine ranges which have the smallest gap */
-	while (cur_nodes > req_nodes) {
-		prev = NULL;
-		min_gap = ULONG_MAX;
-		curr = interval_tree_iter_first(root, 0, ULONG_MAX);
-		while (curr) {
-			if (prev) {
-				curr_gap = curr->start - prev->last;
-				if (curr_gap < min_gap) {
-					min_gap = curr_gap;
-					comb_start = prev;
-					comb_end = curr;
-				}
-			}
-			prev = curr;
-			curr = interval_tree_iter_next(curr, 0, ULONG_MAX);
-		}
-		comb_start->last = comb_end->last;
-		interval_tree_remove(comb_end, root);
-		cur_nodes--;
-	}
-}
-
 static int mlx5vf_create_tracker(struct mlx5_core_dev *mdev,
 				 struct mlx5vf_pci_core_device *mvdev,
 				 struct rb_root_cached *ranges, u32 nnodes)
@@ -800,7 +754,7 @@ static int mlx5vf_create_tracker(struct mlx5_core_dev *mdev,
 	int i;
 
 	if (num_ranges > max_num_range) {
-		combine_ranges(ranges, nnodes, max_num_range);
+		vfio_combine_iova_ranges(ranges, nnodes, max_num_range);
 		num_ranges = max_num_range;
 	}
 
diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
index f0ca33b2e1df..3bde62f7e08b 100644
--- a/drivers/vfio/vfio_main.c
+++ b/drivers/vfio/vfio_main.c
@@ -865,6 +865,53 @@ static int vfio_ioctl_device_feature_migration(struct vfio_device *device,
 	return 0;
 }
 
+void vfio_combine_iova_ranges(struct rb_root_cached *root, u32 cur_nodes,
+			      u32 req_nodes)
+{
+	struct interval_tree_node *prev, *curr, *comb_start, *comb_end;
+	unsigned long min_gap, curr_gap;
+
+	/* Special shortcut when a single range is required */
+	if (req_nodes == 1) {
+		unsigned long last;
+
+		comb_start = interval_tree_iter_first(root, 0, ULONG_MAX);
+		curr = comb_start;
+		while (curr) {
+			last = curr->last;
+			prev = curr;
+			curr = interval_tree_iter_next(curr, 0, ULONG_MAX);
+			if (prev != comb_start)
+				interval_tree_remove(prev, root);
+		}
+		comb_start->last = last;
+		return;
+	}
+
+	/* Combine ranges which have the smallest gap */
+	while (cur_nodes > req_nodes) {
+		prev = NULL;
+		min_gap = ULONG_MAX;
+		curr = interval_tree_iter_first(root, 0, ULONG_MAX);
+		while (curr) {
+			if (prev) {
+				curr_gap = curr->start - prev->last;
+				if (curr_gap < min_gap) {
+					min_gap = curr_gap;
+					comb_start = prev;
+					comb_end = curr;
+				}
+			}
+			prev = curr;
+			curr = interval_tree_iter_next(curr, 0, ULONG_MAX);
+		}
+		comb_start->last = comb_end->last;
+		interval_tree_remove(comb_end, root);
+		cur_nodes--;
+	}
+}
+EXPORT_SYMBOL_GPL(vfio_combine_iova_ranges);
+
 /* Ranges should fit into a single kernel page */
 #define LOG_MAX_RANGES \
 	(PAGE_SIZE / sizeof(struct vfio_device_feature_dma_logging_range))
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 2c137ea94a3e..f49933b63ac3 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -245,6 +245,9 @@ int vfio_mig_get_next_state(struct vfio_device *device,
 			    enum vfio_device_mig_state new_fsm,
 			    enum vfio_device_mig_state *next_fsm);
 
+void vfio_combine_iova_ranges(struct rb_root_cached *root, u32 cur_nodes,
+			      u32 req_nodes);
+
 /*
  * External user API
  */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v10 vfio 2/7] vfio/pds: Initial support for pds_vfio VFIO driver
  2023-06-02 22:03 [PATCH v10 vfio 0/7] pds_vfio driver Brett Creeley
  2023-06-02 22:03 ` [PATCH v10 vfio 1/7] vfio: Commonize combine_ranges for use in other VFIO drivers Brett Creeley
@ 2023-06-02 22:03 ` Brett Creeley
  2023-06-14 21:31   ` Alex Williamson
  2023-06-16  6:56   ` Tian, Kevin
  2023-06-02 22:03 ` [PATCH v10 vfio 3/7] vfio/pds: register with the pds_core PF Brett Creeley
                   ` (7 subsequent siblings)
  9 siblings, 2 replies; 40+ messages in thread
From: Brett Creeley @ 2023-06-02 22:03 UTC (permalink / raw)
  To: kvm, netdev, alex.williamson, jgg, yishaih,
	shameerali.kolothum.thodi, kevin.tian
  Cc: brett.creeley, shannon.nelson

This is the initial framework for the new pds_vfio device driver. This
does the very basics of registering the PDS PCI device and configuring
it as a VFIO PCI device.

With this change, the VF device can be bound to the pds_vfio driver on
the host and presented to the VM as the VF's device type.

Signed-off-by: Brett Creeley <brett.creeley@amd.com>
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
---
 drivers/vfio/pci/Makefile       |  2 +
 drivers/vfio/pci/pds/Makefile   |  8 ++++
 drivers/vfio/pci/pds/pci_drv.c  | 69 +++++++++++++++++++++++++++++++
 drivers/vfio/pci/pds/vfio_dev.c | 72 +++++++++++++++++++++++++++++++++
 drivers/vfio/pci/pds/vfio_dev.h | 20 +++++++++
 5 files changed, 171 insertions(+)
 create mode 100644 drivers/vfio/pci/pds/Makefile
 create mode 100644 drivers/vfio/pci/pds/pci_drv.c
 create mode 100644 drivers/vfio/pci/pds/vfio_dev.c
 create mode 100644 drivers/vfio/pci/pds/vfio_dev.h

diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
index 24c524224da5..45167be462d8 100644
--- a/drivers/vfio/pci/Makefile
+++ b/drivers/vfio/pci/Makefile
@@ -11,3 +11,5 @@ obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
 obj-$(CONFIG_MLX5_VFIO_PCI)           += mlx5/
 
 obj-$(CONFIG_HISI_ACC_VFIO_PCI) += hisilicon/
+
+obj-$(CONFIG_PDS_VFIO_PCI) += pds/
diff --git a/drivers/vfio/pci/pds/Makefile b/drivers/vfio/pci/pds/Makefile
new file mode 100644
index 000000000000..e1a55ae0f079
--- /dev/null
+++ b/drivers/vfio/pci/pds/Makefile
@@ -0,0 +1,8 @@
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2023 Advanced Micro Devices, Inc.
+
+obj-$(CONFIG_PDS_VFIO_PCI) += pds_vfio.o
+
+pds_vfio-y := \
+	pci_drv.o	\
+	vfio_dev.o
diff --git a/drivers/vfio/pci/pds/pci_drv.c b/drivers/vfio/pci/pds/pci_drv.c
new file mode 100644
index 000000000000..0e84249069d4
--- /dev/null
+++ b/drivers/vfio/pci/pds/pci_drv.c
@@ -0,0 +1,69 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright(c) 2023 Advanced Micro Devices, Inc. */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/module.h>
+#include <linux/pci.h>
+#include <linux/types.h>
+#include <linux/vfio.h>
+
+#include <linux/pds/pds_core_if.h>
+
+#include "vfio_dev.h"
+
+#define PDS_VFIO_DRV_DESCRIPTION	"AMD/Pensando VFIO Device Driver"
+#define PCI_VENDOR_ID_PENSANDO		0x1dd8
+
+static int pds_vfio_pci_probe(struct pci_dev *pdev,
+			      const struct pci_device_id *id)
+{
+	struct pds_vfio_pci_device *pds_vfio;
+	int err;
+
+	pds_vfio = vfio_alloc_device(pds_vfio_pci_device, vfio_coredev.vdev,
+				     &pdev->dev, pds_vfio_ops_info());
+	if (IS_ERR(pds_vfio))
+		return PTR_ERR(pds_vfio);
+
+	dev_set_drvdata(&pdev->dev, &pds_vfio->vfio_coredev);
+
+	err = vfio_pci_core_register_device(&pds_vfio->vfio_coredev);
+	if (err)
+		goto out_put_vdev;
+
+	return 0;
+
+out_put_vdev:
+	vfio_put_device(&pds_vfio->vfio_coredev.vdev);
+	return err;
+}
+
+static void pds_vfio_pci_remove(struct pci_dev *pdev)
+{
+	struct pds_vfio_pci_device *pds_vfio = pds_vfio_pci_drvdata(pdev);
+
+	vfio_pci_core_unregister_device(&pds_vfio->vfio_coredev);
+	vfio_put_device(&pds_vfio->vfio_coredev.vdev);
+}
+
+static const struct pci_device_id
+pds_vfio_pci_table[] = {
+	{ PCI_DRIVER_OVERRIDE_DEVICE_VFIO(PCI_VENDOR_ID_PENSANDO, 0x1003) }, /* Ethernet VF */
+	{ 0, }
+};
+MODULE_DEVICE_TABLE(pci, pds_vfio_pci_table);
+
+static struct pci_driver pds_vfio_pci_driver = {
+	.name = KBUILD_MODNAME,
+	.id_table = pds_vfio_pci_table,
+	.probe = pds_vfio_pci_probe,
+	.remove = pds_vfio_pci_remove,
+	.driver_managed_dma = true,
+};
+
+module_pci_driver(pds_vfio_pci_driver);
+
+MODULE_DESCRIPTION(PDS_VFIO_DRV_DESCRIPTION);
+MODULE_AUTHOR("Advanced Micro Devices, Inc.");
+MODULE_LICENSE("GPL");
diff --git a/drivers/vfio/pci/pds/vfio_dev.c b/drivers/vfio/pci/pds/vfio_dev.c
new file mode 100644
index 000000000000..4038dac90a97
--- /dev/null
+++ b/drivers/vfio/pci/pds/vfio_dev.c
@@ -0,0 +1,72 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright(c) 2023 Advanced Micro Devices, Inc. */
+
+#include <linux/vfio.h>
+#include <linux/vfio_pci_core.h>
+
+#include "vfio_dev.h"
+
+struct pds_vfio_pci_device *pds_vfio_pci_drvdata(struct pci_dev *pdev)
+{
+	struct vfio_pci_core_device *core_device = dev_get_drvdata(&pdev->dev);
+
+	return container_of(core_device, struct pds_vfio_pci_device,
+			    vfio_coredev);
+}
+
+static int pds_vfio_init_device(struct vfio_device *vdev)
+{
+	struct pds_vfio_pci_device *pds_vfio =
+		container_of(vdev, struct pds_vfio_pci_device,
+			     vfio_coredev.vdev);
+	struct pci_dev *pdev = to_pci_dev(vdev->dev);
+	int err;
+
+	err = vfio_pci_core_init_dev(vdev);
+	if (err)
+		return err;
+
+	pds_vfio->vf_id = pci_iov_vf_id(pdev);
+	pds_vfio->pci_id = PCI_DEVID(pdev->bus->number, pdev->devfn);
+
+	return 0;
+}
+
+static int pds_vfio_open_device(struct vfio_device *vdev)
+{
+	struct pds_vfio_pci_device *pds_vfio =
+		container_of(vdev, struct pds_vfio_pci_device,
+			     vfio_coredev.vdev);
+	int err;
+
+	err = vfio_pci_core_enable(&pds_vfio->vfio_coredev);
+	if (err)
+		return err;
+
+	vfio_pci_core_finish_enable(&pds_vfio->vfio_coredev);
+
+	return 0;
+}
+
+static const struct vfio_device_ops pds_vfio_ops = {
+	.name = "pds-vfio",
+	.init = pds_vfio_init_device,
+	.release = vfio_pci_core_release_dev,
+	.open_device = pds_vfio_open_device,
+	.close_device = vfio_pci_core_close_device,
+	.ioctl = vfio_pci_core_ioctl,
+	.device_feature = vfio_pci_core_ioctl_feature,
+	.read = vfio_pci_core_read,
+	.write = vfio_pci_core_write,
+	.mmap = vfio_pci_core_mmap,
+	.request = vfio_pci_core_request,
+	.match = vfio_pci_core_match,
+	.bind_iommufd = vfio_iommufd_physical_bind,
+	.unbind_iommufd = vfio_iommufd_physical_unbind,
+	.attach_ioas = vfio_iommufd_physical_attach_ioas,
+};
+
+const struct vfio_device_ops *pds_vfio_ops_info(void)
+{
+	return &pds_vfio_ops;
+}
diff --git a/drivers/vfio/pci/pds/vfio_dev.h b/drivers/vfio/pci/pds/vfio_dev.h
new file mode 100644
index 000000000000..66cfcab5b5bf
--- /dev/null
+++ b/drivers/vfio/pci/pds/vfio_dev.h
@@ -0,0 +1,20 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright(c) 2023 Advanced Micro Devices, Inc. */
+
+#ifndef _VFIO_DEV_H_
+#define _VFIO_DEV_H_
+
+#include <linux/pci.h>
+#include <linux/vfio_pci_core.h>
+
+struct pds_vfio_pci_device {
+	struct vfio_pci_core_device vfio_coredev;
+
+	int vf_id;
+	int pci_id;
+};
+
+const struct vfio_device_ops *pds_vfio_ops_info(void);
+struct pds_vfio_pci_device *pds_vfio_pci_drvdata(struct pci_dev *pdev);
+
+#endif /* _VFIO_DEV_H_ */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v10 vfio 3/7] vfio/pds: register with the pds_core PF
  2023-06-02 22:03 [PATCH v10 vfio 0/7] pds_vfio driver Brett Creeley
  2023-06-02 22:03 ` [PATCH v10 vfio 1/7] vfio: Commonize combine_ranges for use in other VFIO drivers Brett Creeley
  2023-06-02 22:03 ` [PATCH v10 vfio 2/7] vfio/pds: Initial support for pds_vfio VFIO driver Brett Creeley
@ 2023-06-02 22:03 ` Brett Creeley
  2023-06-15 21:05   ` Shameerali Kolothum Thodi
  2023-06-16  7:04   ` Tian, Kevin
  2023-06-02 22:03 ` [PATCH v10 vfio 4/7] vfio/pds: Add VFIO live migration support Brett Creeley
                   ` (6 subsequent siblings)
  9 siblings, 2 replies; 40+ messages in thread
From: Brett Creeley @ 2023-06-02 22:03 UTC (permalink / raw)
  To: kvm, netdev, alex.williamson, jgg, yishaih,
	shameerali.kolothum.thodi, kevin.tian
  Cc: brett.creeley, shannon.nelson

The pds_core driver will supply adminq services, so find the PF
and register with the DSC services.

Use the following commands to enable a VF:
echo 1 > /sys/bus/pci/drivers/pds_core/$PF_BDF/sriov_numvfs

Signed-off-by: Brett Creeley <brett.creeley@amd.com>
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
---
 drivers/vfio/pci/pds/Makefile   |  1 +
 drivers/vfio/pci/pds/cmds.c     | 43 +++++++++++++++++++++++++++++++++
 drivers/vfio/pci/pds/cmds.h     | 10 ++++++++
 drivers/vfio/pci/pds/pci_drv.c  | 19 +++++++++++++++
 drivers/vfio/pci/pds/pci_drv.h  |  9 +++++++
 drivers/vfio/pci/pds/vfio_dev.c | 11 +++++++++
 drivers/vfio/pci/pds/vfio_dev.h |  6 +++++
 include/linux/pds/pds_common.h  |  2 ++
 8 files changed, 101 insertions(+)
 create mode 100644 drivers/vfio/pci/pds/cmds.c
 create mode 100644 drivers/vfio/pci/pds/cmds.h
 create mode 100644 drivers/vfio/pci/pds/pci_drv.h

diff --git a/drivers/vfio/pci/pds/Makefile b/drivers/vfio/pci/pds/Makefile
index e1a55ae0f079..87581111fa17 100644
--- a/drivers/vfio/pci/pds/Makefile
+++ b/drivers/vfio/pci/pds/Makefile
@@ -4,5 +4,6 @@
 obj-$(CONFIG_PDS_VFIO_PCI) += pds_vfio.o
 
 pds_vfio-y := \
+	cmds.o		\
 	pci_drv.o	\
 	vfio_dev.o
diff --git a/drivers/vfio/pci/pds/cmds.c b/drivers/vfio/pci/pds/cmds.c
new file mode 100644
index 000000000000..ae01f5df2f5c
--- /dev/null
+++ b/drivers/vfio/pci/pds/cmds.c
@@ -0,0 +1,43 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright(c) 2023 Advanced Micro Devices, Inc. */
+
+#include <linux/io.h>
+#include <linux/types.h>
+
+#include <linux/pds/pds_common.h>
+#include <linux/pds/pds_core_if.h>
+#include <linux/pds/pds_adminq.h>
+
+#include "vfio_dev.h"
+#include "cmds.h"
+
+int pds_vfio_register_client_cmd(struct pds_vfio_pci_device *pds_vfio)
+{
+	struct pci_dev *pdev = pds_vfio_to_pci_dev(pds_vfio);
+	char devname[PDS_DEVNAME_LEN];
+	int ci;
+
+	snprintf(devname, sizeof(devname), "%s.%d-%u", PDS_LM_DEV_NAME,
+		 pci_domain_nr(pdev->bus), pds_vfio->pci_id);
+
+	ci = pds_client_register(pci_physfn(pdev), devname);
+	if (ci <= 0)
+		return ci;
+
+	pds_vfio->client_id = ci;
+
+	return 0;
+}
+
+void pds_vfio_unregister_client_cmd(struct pds_vfio_pci_device *pds_vfio)
+{
+	struct pci_dev *pdev = pds_vfio_to_pci_dev(pds_vfio);
+	int err;
+
+	err = pds_client_unregister(pci_physfn(pdev), pds_vfio->client_id);
+	if (err)
+		dev_err(&pdev->dev, "unregister from DSC failed: %pe\n",
+			ERR_PTR(err));
+
+	pds_vfio->client_id = 0;
+}
diff --git a/drivers/vfio/pci/pds/cmds.h b/drivers/vfio/pci/pds/cmds.h
new file mode 100644
index 000000000000..4c592afccf89
--- /dev/null
+++ b/drivers/vfio/pci/pds/cmds.h
@@ -0,0 +1,10 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright(c) 2023 Advanced Micro Devices, Inc. */
+
+#ifndef _CMDS_H_
+#define _CMDS_H_
+
+int pds_vfio_register_client_cmd(struct pds_vfio_pci_device *pds_vfio);
+void pds_vfio_unregister_client_cmd(struct pds_vfio_pci_device *pds_vfio);
+
+#endif /* _CMDS_H_ */
diff --git a/drivers/vfio/pci/pds/pci_drv.c b/drivers/vfio/pci/pds/pci_drv.c
index 0e84249069d4..a49420aa9736 100644
--- a/drivers/vfio/pci/pds/pci_drv.c
+++ b/drivers/vfio/pci/pds/pci_drv.c
@@ -8,9 +8,13 @@
 #include <linux/types.h>
 #include <linux/vfio.h>
 
+#include <linux/pds/pds_common.h>
 #include <linux/pds/pds_core_if.h>
+#include <linux/pds/pds_adminq.h>
 
 #include "vfio_dev.h"
+#include "pci_drv.h"
+#include "cmds.h"
 
 #define PDS_VFIO_DRV_DESCRIPTION	"AMD/Pensando VFIO Device Driver"
 #define PCI_VENDOR_ID_PENSANDO		0x1dd8
@@ -27,13 +31,27 @@ static int pds_vfio_pci_probe(struct pci_dev *pdev,
 		return PTR_ERR(pds_vfio);
 
 	dev_set_drvdata(&pdev->dev, &pds_vfio->vfio_coredev);
+	pds_vfio->pdsc = pdsc_get_pf_struct(pdev);
+	if (IS_ERR_OR_NULL(pds_vfio->pdsc)) {
+		err = PTR_ERR(pds_vfio->pdsc) ?: -ENODEV;
+		goto out_put_vdev;
+	}
 
 	err = vfio_pci_core_register_device(&pds_vfio->vfio_coredev);
 	if (err)
 		goto out_put_vdev;
 
+	err = pds_vfio_register_client_cmd(pds_vfio);
+	if (err) {
+		dev_err(&pdev->dev, "failed to register as client: %pe\n",
+			ERR_PTR(err));
+		goto out_unregister_coredev;
+	}
+
 	return 0;
 
+out_unregister_coredev:
+	vfio_pci_core_unregister_device(&pds_vfio->vfio_coredev);
 out_put_vdev:
 	vfio_put_device(&pds_vfio->vfio_coredev.vdev);
 	return err;
@@ -43,6 +61,7 @@ static void pds_vfio_pci_remove(struct pci_dev *pdev)
 {
 	struct pds_vfio_pci_device *pds_vfio = pds_vfio_pci_drvdata(pdev);
 
+	pds_vfio_unregister_client_cmd(pds_vfio);
 	vfio_pci_core_unregister_device(&pds_vfio->vfio_coredev);
 	vfio_put_device(&pds_vfio->vfio_coredev.vdev);
 }
diff --git a/drivers/vfio/pci/pds/pci_drv.h b/drivers/vfio/pci/pds/pci_drv.h
new file mode 100644
index 000000000000..e79bed12ed14
--- /dev/null
+++ b/drivers/vfio/pci/pds/pci_drv.h
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright(c) 2023 Advanced Micro Devices, Inc. */
+
+#ifndef _PCI_DRV_H
+#define _PCI_DRV_H
+
+#include <linux/pci.h>
+
+#endif /* _PCI_DRV_H */
diff --git a/drivers/vfio/pci/pds/vfio_dev.c b/drivers/vfio/pci/pds/vfio_dev.c
index 4038dac90a97..39771265b78f 100644
--- a/drivers/vfio/pci/pds/vfio_dev.c
+++ b/drivers/vfio/pci/pds/vfio_dev.c
@@ -6,6 +6,11 @@
 
 #include "vfio_dev.h"
 
+struct pci_dev *pds_vfio_to_pci_dev(struct pds_vfio_pci_device *pds_vfio)
+{
+	return pds_vfio->vfio_coredev.pdev;
+}
+
 struct pds_vfio_pci_device *pds_vfio_pci_drvdata(struct pci_dev *pdev)
 {
 	struct vfio_pci_core_device *core_device = dev_get_drvdata(&pdev->dev);
@@ -29,6 +34,12 @@ static int pds_vfio_init_device(struct vfio_device *vdev)
 	pds_vfio->vf_id = pci_iov_vf_id(pdev);
 	pds_vfio->pci_id = PCI_DEVID(pdev->bus->number, pdev->devfn);
 
+	dev_dbg(&pdev->dev,
+		"%s: PF %#04x VF %#04x (%d) vf_id %d domain %d pds_vfio %p\n",
+		__func__, pci_dev_id(pdev->physfn), pds_vfio->pci_id,
+		pds_vfio->pci_id, pds_vfio->vf_id, pci_domain_nr(pdev->bus),
+		pds_vfio);
+
 	return 0;
 }
 
diff --git a/drivers/vfio/pci/pds/vfio_dev.h b/drivers/vfio/pci/pds/vfio_dev.h
index 66cfcab5b5bf..92e8ff241ca8 100644
--- a/drivers/vfio/pci/pds/vfio_dev.h
+++ b/drivers/vfio/pci/pds/vfio_dev.h
@@ -7,14 +7,20 @@
 #include <linux/pci.h>
 #include <linux/vfio_pci_core.h>
 
+struct pdsc;
+
 struct pds_vfio_pci_device {
 	struct vfio_pci_core_device vfio_coredev;
+	struct pdsc *pdsc;
 
 	int vf_id;
 	int pci_id;
+	u16 client_id;
 };
 
 const struct vfio_device_ops *pds_vfio_ops_info(void);
 struct pds_vfio_pci_device *pds_vfio_pci_drvdata(struct pci_dev *pdev);
 
+struct pci_dev *pds_vfio_to_pci_dev(struct pds_vfio_pci_device *pds_vfio);
+
 #endif /* _VFIO_DEV_H_ */
diff --git a/include/linux/pds/pds_common.h b/include/linux/pds/pds_common.h
index 060331486d50..721453bdf975 100644
--- a/include/linux/pds/pds_common.h
+++ b/include/linux/pds/pds_common.h
@@ -39,6 +39,8 @@ enum pds_core_vif_types {
 #define PDS_DEV_TYPE_RDMA_STR	"RDMA"
 #define PDS_DEV_TYPE_LM_STR	"LM"
 
+#define PDS_LM_DEV_NAME		PDS_CORE_DRV_NAME "." PDS_DEV_TYPE_LM_STR
+
 #define PDS_CORE_IFNAMSIZ		16
 
 /**
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v10 vfio 4/7] vfio/pds: Add VFIO live migration support
  2023-06-02 22:03 [PATCH v10 vfio 0/7] pds_vfio driver Brett Creeley
                   ` (2 preceding siblings ...)
  2023-06-02 22:03 ` [PATCH v10 vfio 3/7] vfio/pds: register with the pds_core PF Brett Creeley
@ 2023-06-02 22:03 ` Brett Creeley
  2023-06-15 21:07   ` Shameerali Kolothum Thodi
  2023-06-16  8:06   ` Tian, Kevin
  2023-06-02 22:03 ` [PATCH v10 vfio 5/7] vfio/pds: Add support for dirty page tracking Brett Creeley
                   ` (5 subsequent siblings)
  9 siblings, 2 replies; 40+ messages in thread
From: Brett Creeley @ 2023-06-02 22:03 UTC (permalink / raw)
  To: kvm, netdev, alex.williamson, jgg, yishaih,
	shameerali.kolothum.thodi, kevin.tian
  Cc: brett.creeley, shannon.nelson

Add live migration support via the VFIO subsystem. The migration
implementation aligns with the definition from uapi/vfio.h and uses
the pds_core PF's adminq for device configuration.

The ability to suspend, resume, and transfer VF device state data is
included along with the required admin queue command structures and
implementations.

PDS_LM_CMD_SUSPEND and PDS_LM_CMD_SUSPEND_STATUS are added to support
the VF device suspend operation.

PDS_LM_CMD_RESUME is added to support the VF device resume operation.

PDS_LM_CMD_STATUS is added to determine the exact size of the VF
device state data.

PDS_LM_CMD_SAVE is added to get the VF device state data.

PDS_LM_CMD_RESTORE is added to restore the VF device with the
previously saved data from PDS_LM_CMD_SAVE.

PDS_LM_CMD_HOST_VF_STATUS is added to notify the device when
a migration is in/not-in progress from the host's perspective.

Signed-off-by: Brett Creeley <brett.creeley@amd.com>
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
---
 drivers/vfio/pci/pds/Makefile   |   1 +
 drivers/vfio/pci/pds/cmds.c     | 319 ++++++++++++++++++++++++
 drivers/vfio/pci/pds/cmds.h     |   8 +-
 drivers/vfio/pci/pds/lm.c       | 421 ++++++++++++++++++++++++++++++++
 drivers/vfio/pci/pds/lm.h       |  41 ++++
 drivers/vfio/pci/pds/pci_drv.c  |  13 +
 drivers/vfio/pci/pds/vfio_dev.c | 120 ++++++++-
 drivers/vfio/pci/pds/vfio_dev.h |  11 +
 include/linux/pds/pds_adminq.h  | 217 ++++++++++++++++
 9 files changed, 1149 insertions(+), 2 deletions(-)
 create mode 100644 drivers/vfio/pci/pds/lm.c
 create mode 100644 drivers/vfio/pci/pds/lm.h

diff --git a/drivers/vfio/pci/pds/Makefile b/drivers/vfio/pci/pds/Makefile
index 87581111fa17..dbaf613d3794 100644
--- a/drivers/vfio/pci/pds/Makefile
+++ b/drivers/vfio/pci/pds/Makefile
@@ -5,5 +5,6 @@ obj-$(CONFIG_PDS_VFIO_PCI) += pds_vfio.o
 
 pds_vfio-y := \
 	cmds.o		\
+	lm.o		\
 	pci_drv.o	\
 	vfio_dev.o
diff --git a/drivers/vfio/pci/pds/cmds.c b/drivers/vfio/pci/pds/cmds.c
index ae01f5df2f5c..256f458feb58 100644
--- a/drivers/vfio/pci/pds/cmds.c
+++ b/drivers/vfio/pci/pds/cmds.c
@@ -3,6 +3,7 @@
 
 #include <linux/io.h>
 #include <linux/types.h>
+#include <linux/delay.h>
 
 #include <linux/pds/pds_common.h>
 #include <linux/pds/pds_core_if.h>
@@ -11,6 +12,34 @@
 #include "vfio_dev.h"
 #include "cmds.h"
 
+#define SUSPEND_TIMEOUT_S		5
+#define SUSPEND_CHECK_INTERVAL_MS	1
+
+static int pds_vfio_client_adminq_cmd(struct pds_vfio_pci_device *pds_vfio,
+				      union pds_core_adminq_cmd *req,
+				      size_t req_len,
+				      union pds_core_adminq_comp *resp,
+				      u64 flags)
+{
+	union pds_core_adminq_cmd cmd = {};
+	size_t cp_len;
+	int err;
+
+	/* Wrap the client request */
+	cmd.client_request.opcode = PDS_AQ_CMD_CLIENT_CMD;
+	cmd.client_request.client_id = cpu_to_le16(pds_vfio->client_id);
+	cp_len = min_t(size_t, req_len, sizeof(cmd.client_request.client_cmd));
+	memcpy(cmd.client_request.client_cmd, req, cp_len);
+
+	err = pdsc_adminq_post(pds_vfio->pdsc, &cmd, resp,
+			       !!(flags & PDS_AQ_FLAG_FASTPOLL));
+	if (err && err != -EAGAIN)
+		dev_info(pds_vfio_to_dev(pds_vfio),
+			 "client admin cmd failed: %pe\n", ERR_PTR(err));
+
+	return err;
+}
+
 int pds_vfio_register_client_cmd(struct pds_vfio_pci_device *pds_vfio)
 {
 	struct pci_dev *pdev = pds_vfio_to_pci_dev(pds_vfio);
@@ -41,3 +70,293 @@ void pds_vfio_unregister_client_cmd(struct pds_vfio_pci_device *pds_vfio)
 
 	pds_vfio->client_id = 0;
 }
+
+static int
+pds_vfio_suspend_wait_device_cmd(struct pds_vfio_pci_device *pds_vfio)
+{
+	union pds_core_adminq_cmd cmd = {
+		.lm_suspend_status = {
+			.opcode = PDS_LM_CMD_SUSPEND_STATUS,
+			.vf_id = cpu_to_le16(pds_vfio->vf_id),
+		},
+	};
+	struct device *dev = pds_vfio_to_dev(pds_vfio);
+	union pds_core_adminq_comp comp = {};
+	unsigned long time_limit;
+	unsigned long time_start;
+	unsigned long time_done;
+	int err;
+
+	time_start = jiffies;
+	time_limit = time_start + HZ * SUSPEND_TIMEOUT_S;
+	do {
+		err = pds_vfio_client_adminq_cmd(pds_vfio, &cmd, sizeof(cmd),
+						 &comp, PDS_AQ_FLAG_FASTPOLL);
+		if (err != -EAGAIN)
+			break;
+
+		msleep(SUSPEND_CHECK_INTERVAL_MS);
+	} while (time_before(jiffies, time_limit));
+
+	time_done = jiffies;
+	dev_dbg(dev, "%s: vf%u: Suspend comp received in %d msecs\n", __func__,
+		pds_vfio->vf_id, jiffies_to_msecs(time_done - time_start));
+
+	/* Check the results */
+	if (time_after_eq(time_done, time_limit)) {
+		dev_err(dev, "%s: vf%u: Suspend comp timeout\n", __func__,
+			pds_vfio->vf_id);
+		err = -ETIMEDOUT;
+	}
+
+	return err;
+}
+
+int pds_vfio_suspend_device_cmd(struct pds_vfio_pci_device *pds_vfio)
+{
+	union pds_core_adminq_cmd cmd = {
+		.lm_suspend = {
+			.opcode = PDS_LM_CMD_SUSPEND,
+			.vf_id = cpu_to_le16(pds_vfio->vf_id),
+		},
+	};
+	struct device *dev = pds_vfio_to_dev(pds_vfio);
+	union pds_core_adminq_comp comp = {};
+	int err;
+
+	dev_dbg(dev, "vf%u: Suspend device\n", pds_vfio->vf_id);
+
+	err = pds_vfio_client_adminq_cmd(pds_vfio, &cmd, sizeof(cmd), &comp,
+					 PDS_AQ_FLAG_FASTPOLL);
+	if (err) {
+		dev_err(dev, "vf%u: Suspend failed: %pe\n", pds_vfio->vf_id,
+			ERR_PTR(err));
+		return err;
+	}
+
+	return pds_vfio_suspend_wait_device_cmd(pds_vfio);
+}
+
+int pds_vfio_resume_device_cmd(struct pds_vfio_pci_device *pds_vfio)
+{
+	union pds_core_adminq_cmd cmd = {
+		.lm_resume = {
+			.opcode = PDS_LM_CMD_RESUME,
+			.vf_id = cpu_to_le16(pds_vfio->vf_id),
+		},
+	};
+	struct device *dev = pds_vfio_to_dev(pds_vfio);
+	union pds_core_adminq_comp comp = {};
+
+	dev_dbg(dev, "vf%u: Resume device\n", pds_vfio->vf_id);
+
+	return pds_vfio_client_adminq_cmd(pds_vfio, &cmd, sizeof(cmd), &comp,
+					  0);
+}
+
+int pds_vfio_get_lm_status_cmd(struct pds_vfio_pci_device *pds_vfio, u64 *size)
+{
+	union pds_core_adminq_cmd cmd = {
+		.lm_status = {
+			.opcode = PDS_LM_CMD_STATUS,
+			.vf_id = cpu_to_le16(pds_vfio->vf_id),
+		},
+	};
+	struct device *dev = pds_vfio_to_dev(pds_vfio);
+	union pds_core_adminq_comp comp = {};
+	int err;
+
+	dev_dbg(dev, "vf%u: Get migration status\n", pds_vfio->vf_id);
+
+	err = pds_vfio_client_adminq_cmd(pds_vfio, &cmd, sizeof(cmd), &comp, 0);
+	if (err)
+		return err;
+
+	*size = le64_to_cpu(comp.lm_status.size);
+	return 0;
+}
+
+static int pds_vfio_dma_map_lm_file(struct device *dev,
+				    enum dma_data_direction dir,
+				    struct pds_vfio_lm_file *lm_file)
+{
+	struct pds_lm_sg_elem *sgl, *sge;
+	struct scatterlist *sg;
+	dma_addr_t sgl_addr;
+	size_t sgl_size;
+	int err;
+	int i;
+
+	if (!lm_file)
+		return -EINVAL;
+
+	/* dma map file pages */
+	err = dma_map_sgtable(dev, &lm_file->sg_table, dir, 0);
+	if (err)
+		return err;
+
+	lm_file->num_sge = lm_file->sg_table.nents;
+
+	/* alloc sgl */
+	sgl_size = lm_file->num_sge * sizeof(struct pds_lm_sg_elem);
+	sgl = kzalloc(sgl_size, GFP_KERNEL);
+	if (!sgl) {
+		err = -ENOMEM;
+		goto out_unmap_sgtable;
+	}
+
+	/* fill sgl */
+	sge = sgl;
+	for_each_sgtable_dma_sg(&lm_file->sg_table, sg, i) {
+		sge->addr = cpu_to_le64(sg_dma_address(sg));
+		sge->len = cpu_to_le32(sg_dma_len(sg));
+		dev_dbg(dev, "addr = %llx, len = %u\n", sge->addr, sge->len);
+		sge++;
+	}
+
+	sgl_addr = dma_map_single(dev, sgl, sgl_size, DMA_TO_DEVICE);
+	if (dma_mapping_error(dev, sgl_addr)) {
+		err = -EIO;
+		goto out_free_sgl;
+	}
+
+	lm_file->sgl = sgl;
+	lm_file->sgl_addr = sgl_addr;
+
+	return 0;
+
+out_free_sgl:
+	kfree(sgl);
+out_unmap_sgtable:
+	lm_file->num_sge = 0;
+	dma_unmap_sgtable(dev, &lm_file->sg_table, dir, 0);
+	return err;
+}
+
+static void pds_vfio_dma_unmap_lm_file(struct device *dev,
+				       enum dma_data_direction dir,
+				       struct pds_vfio_lm_file *lm_file)
+{
+	if (!lm_file)
+		return;
+
+	/* free sgl */
+	if (lm_file->sgl) {
+		dma_unmap_single(dev, lm_file->sgl_addr,
+				 lm_file->num_sge * sizeof(*lm_file->sgl),
+				 DMA_TO_DEVICE);
+		kfree(lm_file->sgl);
+		lm_file->sgl = NULL;
+		lm_file->sgl_addr = DMA_MAPPING_ERROR;
+		lm_file->num_sge = 0;
+	}
+
+	/* dma unmap file pages */
+	dma_unmap_sgtable(dev, &lm_file->sg_table, dir, 0);
+}
+
+int pds_vfio_get_lm_state_cmd(struct pds_vfio_pci_device *pds_vfio)
+{
+	union pds_core_adminq_cmd cmd = {
+		.lm_save = {
+			.opcode = PDS_LM_CMD_SAVE,
+			.vf_id = cpu_to_le16(pds_vfio->vf_id),
+		},
+	};
+	struct pci_dev *pdev = pds_vfio_to_pci_dev(pds_vfio);
+	struct device *pdsc_dev = &pci_physfn(pdev)->dev;
+	union pds_core_adminq_comp comp = {};
+	struct pds_vfio_lm_file *lm_file;
+	int err;
+
+	dev_dbg(&pdev->dev, "vf%u: Get migration state\n", pds_vfio->vf_id);
+
+	lm_file = pds_vfio->save_file;
+
+	err = pds_vfio_dma_map_lm_file(pdsc_dev, DMA_FROM_DEVICE, lm_file);
+	if (err) {
+		dev_err(&pdev->dev, "failed to map save migration file: %pe\n",
+			ERR_PTR(err));
+		return err;
+	}
+
+	cmd.lm_save.sgl_addr = cpu_to_le64(lm_file->sgl_addr);
+	cmd.lm_save.num_sge = cpu_to_le32(lm_file->num_sge);
+
+	err = pds_vfio_client_adminq_cmd(pds_vfio, &cmd, sizeof(cmd), &comp, 0);
+	if (err)
+		dev_err(&pdev->dev, "failed to get migration state: %pe\n",
+			ERR_PTR(err));
+
+	pds_vfio_dma_unmap_lm_file(pdsc_dev, DMA_FROM_DEVICE, lm_file);
+
+	return err;
+}
+
+int pds_vfio_set_lm_state_cmd(struct pds_vfio_pci_device *pds_vfio)
+{
+	union pds_core_adminq_cmd cmd = {
+		.lm_restore = {
+			.opcode = PDS_LM_CMD_RESTORE,
+			.vf_id = cpu_to_le16(pds_vfio->vf_id),
+		},
+	};
+	struct pci_dev *pdev = pds_vfio_to_pci_dev(pds_vfio);
+	struct device *pdsc_dev = &pci_physfn(pdev)->dev;
+	union pds_core_adminq_comp comp = {};
+	struct pds_vfio_lm_file *lm_file;
+	int err;
+
+	dev_dbg(&pdev->dev, "vf%u: Set migration state\n", pds_vfio->vf_id);
+
+	lm_file = pds_vfio->restore_file;
+
+	err = pds_vfio_dma_map_lm_file(pdsc_dev, DMA_TO_DEVICE, lm_file);
+	if (err) {
+		dev_err(&pdev->dev,
+			"failed to map restore migration file: %pe\n",
+			ERR_PTR(err));
+		return err;
+	}
+
+	cmd.lm_restore.sgl_addr = cpu_to_le64(lm_file->sgl_addr);
+	cmd.lm_restore.num_sge = cpu_to_le32(lm_file->num_sge);
+
+	err = pds_vfio_client_adminq_cmd(pds_vfio, &cmd, sizeof(cmd), &comp, 0);
+	if (err)
+		dev_err(&pdev->dev, "failed to set migration state: %pe\n",
+			ERR_PTR(err));
+
+	pds_vfio_dma_unmap_lm_file(pdsc_dev, DMA_TO_DEVICE, lm_file);
+
+	return err;
+}
+
+void pds_vfio_send_host_vf_lm_status_cmd(struct pds_vfio_pci_device *pds_vfio,
+					 enum pds_lm_host_vf_status vf_status)
+{
+	union pds_core_adminq_cmd cmd = {
+		.lm_host_vf_status = {
+			.opcode = PDS_LM_CMD_HOST_VF_STATUS,
+			.vf_id = cpu_to_le16(pds_vfio->vf_id),
+			.status = vf_status,
+		},
+	};
+	struct device *dev = pds_vfio_to_dev(pds_vfio);
+	union pds_core_adminq_comp comp = {};
+	int err;
+
+	dev_dbg(dev, "vf%u: Set host VF LM status: %u", pds_vfio->vf_id,
+		vf_status);
+	if (vf_status != PDS_LM_STA_IN_PROGRESS &&
+	    vf_status != PDS_LM_STA_NONE) {
+		dev_warn(dev, "Invalid host VF migration status, %d\n",
+			 vf_status);
+		return;
+	}
+
+	err = pds_vfio_client_adminq_cmd(pds_vfio, &cmd, sizeof(cmd), &comp, 0);
+	if (err)
+		dev_warn(dev, "failed to send host VF migration status: %pe\n",
+			 ERR_PTR(err));
+}
diff --git a/drivers/vfio/pci/pds/cmds.h b/drivers/vfio/pci/pds/cmds.h
index 4c592afccf89..3d8a5508c733 100644
--- a/drivers/vfio/pci/pds/cmds.h
+++ b/drivers/vfio/pci/pds/cmds.h
@@ -6,5 +6,11 @@
 
 int pds_vfio_register_client_cmd(struct pds_vfio_pci_device *pds_vfio);
 void pds_vfio_unregister_client_cmd(struct pds_vfio_pci_device *pds_vfio);
-
+int pds_vfio_suspend_device_cmd(struct pds_vfio_pci_device *pds_vfio);
+int pds_vfio_resume_device_cmd(struct pds_vfio_pci_device *pds_vfio);
+int pds_vfio_get_lm_status_cmd(struct pds_vfio_pci_device *pds_vfio, u64 *size);
+int pds_vfio_get_lm_state_cmd(struct pds_vfio_pci_device *pds_vfio);
+int pds_vfio_set_lm_state_cmd(struct pds_vfio_pci_device *pds_vfio);
+void pds_vfio_send_host_vf_lm_status_cmd(struct pds_vfio_pci_device *pds_vfio,
+					 enum pds_lm_host_vf_status vf_status);
 #endif /* _CMDS_H_ */
diff --git a/drivers/vfio/pci/pds/lm.c b/drivers/vfio/pci/pds/lm.c
new file mode 100644
index 000000000000..c507f39a2339
--- /dev/null
+++ b/drivers/vfio/pci/pds/lm.c
@@ -0,0 +1,421 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright(c) 2023 Advanced Micro Devices, Inc. */
+
+#include <linux/anon_inodes.h>
+#include <linux/file.h>
+#include <linux/fs.h>
+#include <linux/highmem.h>
+#include <linux/vfio.h>
+#include <linux/vfio_pci_core.h>
+
+#include "vfio_dev.h"
+#include "cmds.h"
+
+static struct pds_vfio_lm_file *
+pds_vfio_get_lm_file(const struct file_operations *fops, int flags, u64 size)
+{
+	struct pds_vfio_lm_file *lm_file = NULL;
+	unsigned long long npages;
+	struct page **pages;
+	void *page_mem;
+	const void *p;
+
+	if (!size)
+		return NULL;
+
+	/* Alloc file structure */
+	lm_file = kzalloc(sizeof(*lm_file), GFP_KERNEL);
+	if (!lm_file)
+		return NULL;
+
+	/* Create file */
+	lm_file->filep =
+		anon_inode_getfile("pds_vfio_lm", fops, lm_file, flags);
+	if (!lm_file->filep)
+		goto out_free_file;
+
+	stream_open(lm_file->filep->f_inode, lm_file->filep);
+	mutex_init(&lm_file->lock);
+
+	/* prevent file from being released before we are done with it */
+	get_file(lm_file->filep);
+
+	/* Allocate memory for file pages */
+	npages = DIV_ROUND_UP_ULL(size, PAGE_SIZE);
+	pages = kmalloc_array(npages, sizeof(*pages), GFP_KERNEL);
+	if (!pages)
+		goto out_put_file;
+
+	page_mem = kvzalloc(ALIGN(size, PAGE_SIZE), GFP_KERNEL);
+	if (!page_mem)
+		goto out_free_pages_array;
+
+	p = page_mem - offset_in_page(page_mem);
+	for (unsigned long long i = 0; i < npages; i++) {
+		if (is_vmalloc_addr(p))
+			pages[i] = vmalloc_to_page(p);
+		else
+			pages[i] = kmap_to_page((void *)p);
+		if (!pages[i])
+			goto out_free_page_mem;
+
+		p += PAGE_SIZE;
+	}
+
+	/* Create scatterlist of file pages to use for DMA mapping later */
+	if (sg_alloc_table_from_pages(&lm_file->sg_table, pages, npages, 0,
+				      size, GFP_KERNEL))
+		goto out_free_page_mem;
+
+	lm_file->size = size;
+	lm_file->pages = pages;
+	lm_file->npages = npages;
+	lm_file->page_mem = page_mem;
+	lm_file->alloc_size = npages * PAGE_SIZE;
+
+	return lm_file;
+
+out_free_page_mem:
+	kvfree(page_mem);
+out_free_pages_array:
+	kfree(pages);
+out_put_file:
+	fput(lm_file->filep);
+	mutex_destroy(&lm_file->lock);
+out_free_file:
+	kfree(lm_file);
+
+	return NULL;
+}
+
+static void pds_vfio_put_lm_file(struct pds_vfio_lm_file *lm_file)
+{
+	mutex_lock(&lm_file->lock);
+
+	lm_file->size = 0;
+	lm_file->alloc_size = 0;
+
+	/* Free scatter list of file pages */
+	sg_free_table(&lm_file->sg_table);
+
+	kvfree(lm_file->page_mem);
+	lm_file->page_mem = NULL;
+	kfree(lm_file->pages);
+	lm_file->pages = NULL;
+
+	mutex_unlock(&lm_file->lock);
+
+	/* allow file to be released since we are done with it */
+	fput(lm_file->filep);
+}
+
+void pds_vfio_put_save_file(struct pds_vfio_pci_device *pds_vfio)
+{
+	if (!pds_vfio->save_file)
+		return;
+
+	pds_vfio_put_lm_file(pds_vfio->save_file);
+	pds_vfio->save_file = NULL;
+}
+
+void pds_vfio_put_restore_file(struct pds_vfio_pci_device *pds_vfio)
+{
+	if (!pds_vfio->restore_file)
+		return;
+
+	pds_vfio_put_lm_file(pds_vfio->restore_file);
+	pds_vfio->restore_file = NULL;
+}
+
+static struct page *pds_vfio_get_file_page(struct pds_vfio_lm_file *lm_file,
+					   unsigned long offset)
+{
+	unsigned long cur_offset = 0;
+	struct scatterlist *sg;
+	unsigned int i;
+
+	/* All accesses are sequential */
+	if (offset < lm_file->last_offset || !lm_file->last_offset_sg) {
+		lm_file->last_offset = 0;
+		lm_file->last_offset_sg = lm_file->sg_table.sgl;
+		lm_file->sg_last_entry = 0;
+	}
+
+	cur_offset = lm_file->last_offset;
+
+	for_each_sg(lm_file->last_offset_sg, sg,
+		    lm_file->sg_table.orig_nents - lm_file->sg_last_entry, i) {
+		if (offset < sg->length + cur_offset) {
+			lm_file->last_offset_sg = sg;
+			lm_file->sg_last_entry += i;
+			lm_file->last_offset = cur_offset;
+			return nth_page(sg_page(sg),
+					(offset - cur_offset) / PAGE_SIZE);
+		}
+		cur_offset += sg->length;
+	}
+
+	return NULL;
+}
+
+static int pds_vfio_release_file(struct inode *inode, struct file *filp)
+{
+	struct pds_vfio_lm_file *lm_file = filp->private_data;
+
+	mutex_lock(&lm_file->lock);
+	lm_file->filep->f_pos = 0;
+	lm_file->size = 0;
+	mutex_unlock(&lm_file->lock);
+	mutex_destroy(&lm_file->lock);
+	kfree(lm_file);
+
+	return 0;
+}
+
+static ssize_t pds_vfio_save_read(struct file *filp, char __user *buf,
+				  size_t len, loff_t *pos)
+{
+	struct pds_vfio_lm_file *lm_file = filp->private_data;
+	ssize_t done = 0;
+
+	if (pos)
+		return -ESPIPE;
+	pos = &filp->f_pos;
+
+	mutex_lock(&lm_file->lock);
+	if (*pos > lm_file->size) {
+		done = -EINVAL;
+		goto out_unlock;
+	}
+
+	len = min_t(size_t, lm_file->size - *pos, len);
+	while (len) {
+		size_t page_offset;
+		struct page *page;
+		size_t page_len;
+		u8 *from_buff;
+		int err;
+
+		page_offset = (*pos) % PAGE_SIZE;
+		page = pds_vfio_get_file_page(lm_file, *pos - page_offset);
+		if (!page) {
+			if (done == 0)
+				done = -EINVAL;
+			goto out_unlock;
+		}
+
+		page_len = min_t(size_t, len, PAGE_SIZE - page_offset);
+		from_buff = kmap_local_page(page);
+		err = copy_to_user(buf, from_buff + page_offset, page_len);
+		kunmap_local(from_buff);
+		if (err) {
+			done = -EFAULT;
+			goto out_unlock;
+		}
+		*pos += page_len;
+		len -= page_len;
+		done += page_len;
+		buf += page_len;
+	}
+
+out_unlock:
+	mutex_unlock(&lm_file->lock);
+	return done;
+}
+
+static const struct file_operations pds_vfio_save_fops = {
+	.owner = THIS_MODULE,
+	.read = pds_vfio_save_read,
+	.release = pds_vfio_release_file,
+	.llseek = no_llseek,
+};
+
+static int pds_vfio_get_save_file(struct pds_vfio_pci_device *pds_vfio)
+{
+	struct device *dev = &pds_vfio->vfio_coredev.pdev->dev;
+	struct pds_vfio_lm_file *lm_file;
+	int err;
+	u64 size;
+
+	/* Get live migration state size in this state */
+	err = pds_vfio_get_lm_status_cmd(pds_vfio, &size);
+	if (err) {
+		dev_err(dev, "failed to get save status: %pe\n", ERR_PTR(err));
+		return err;
+	}
+
+	dev_dbg(dev, "save status, size = %lld\n", size);
+
+	if (!size) {
+		dev_err(dev, "invalid state size\n");
+		return -EIO;
+	}
+
+	lm_file = pds_vfio_get_lm_file(&pds_vfio_save_fops, O_RDONLY, size);
+	if (!lm_file) {
+		dev_err(dev, "failed to create save file\n");
+		return -ENOENT;
+	}
+
+	dev_dbg(dev, "size = %lld, alloc_size = %lld, npages = %lld\n",
+		lm_file->size, lm_file->alloc_size, lm_file->npages);
+
+	pds_vfio->save_file = lm_file;
+
+	return 0;
+}
+
+static ssize_t pds_vfio_restore_write(struct file *filp, const char __user *buf,
+				      size_t len, loff_t *pos)
+{
+	struct pds_vfio_lm_file *lm_file = filp->private_data;
+	loff_t requested_length;
+	ssize_t done = 0;
+
+	if (pos)
+		return -ESPIPE;
+
+	pos = &filp->f_pos;
+
+	if (*pos < 0 ||
+	    check_add_overflow((loff_t)len, *pos, &requested_length))
+		return -EINVAL;
+
+	mutex_lock(&lm_file->lock);
+
+	while (len) {
+		size_t page_offset;
+		struct page *page;
+		size_t page_len;
+		u8 *to_buff;
+		int err;
+
+		page_offset = (*pos) % PAGE_SIZE;
+		page = pds_vfio_get_file_page(lm_file, *pos - page_offset);
+		if (!page) {
+			if (done == 0)
+				done = -EINVAL;
+			goto out_unlock;
+		}
+
+		page_len = min_t(size_t, len, PAGE_SIZE - page_offset);
+		to_buff = kmap_local_page(page);
+		err = copy_from_user(to_buff + page_offset, buf, page_len);
+		kunmap_local(to_buff);
+		if (err) {
+			done = -EFAULT;
+			goto out_unlock;
+		}
+		*pos += page_len;
+		len -= page_len;
+		done += page_len;
+		buf += page_len;
+		lm_file->size += page_len;
+	}
+out_unlock:
+	mutex_unlock(&lm_file->lock);
+	return done;
+}
+
+static const struct file_operations pds_vfio_restore_fops = {
+	.owner = THIS_MODULE,
+	.write = pds_vfio_restore_write,
+	.release = pds_vfio_release_file,
+	.llseek = no_llseek,
+};
+
+static int pds_vfio_get_restore_file(struct pds_vfio_pci_device *pds_vfio)
+{
+	struct device *dev = &pds_vfio->vfio_coredev.pdev->dev;
+	struct pds_vfio_lm_file *lm_file;
+	u64 size;
+
+	size = sizeof(union pds_lm_dev_state);
+	dev_dbg(dev, "restore status, size = %lld\n", size);
+
+	if (!size) {
+		dev_err(dev, "invalid state size");
+		return -EIO;
+	}
+
+	lm_file = pds_vfio_get_lm_file(&pds_vfio_restore_fops, O_WRONLY, size);
+	if (!lm_file) {
+		dev_err(dev, "failed to create restore file");
+		return -ENOENT;
+	}
+	pds_vfio->restore_file = lm_file;
+
+	return 0;
+}
+
+struct file *
+pds_vfio_step_device_state_locked(struct pds_vfio_pci_device *pds_vfio,
+				  enum vfio_device_mig_state next)
+{
+	enum vfio_device_mig_state cur = pds_vfio->state;
+	int err;
+
+	if (cur == VFIO_DEVICE_STATE_STOP && next == VFIO_DEVICE_STATE_STOP_COPY) {
+		err = pds_vfio_get_save_file(pds_vfio);
+		if (err)
+			return ERR_PTR(err);
+
+		err = pds_vfio_get_lm_state_cmd(pds_vfio);
+		if (err) {
+			pds_vfio_put_save_file(pds_vfio);
+			return ERR_PTR(err);
+		}
+
+		return pds_vfio->save_file->filep;
+	}
+
+	if (cur == VFIO_DEVICE_STATE_STOP_COPY && next == VFIO_DEVICE_STATE_STOP) {
+		pds_vfio_put_save_file(pds_vfio);
+		pds_vfio_send_host_vf_lm_status_cmd(pds_vfio, PDS_LM_STA_NONE);
+		return NULL;
+	}
+
+	if (cur == VFIO_DEVICE_STATE_STOP && next == VFIO_DEVICE_STATE_RESUMING) {
+		err = pds_vfio_get_restore_file(pds_vfio);
+		if (err)
+			return ERR_PTR(err);
+
+		return pds_vfio->restore_file->filep;
+	}
+
+	if (cur == VFIO_DEVICE_STATE_RESUMING && next == VFIO_DEVICE_STATE_STOP) {
+		err = pds_vfio_set_lm_state_cmd(pds_vfio);
+		if (err)
+			return ERR_PTR(err);
+
+		pds_vfio_put_restore_file(pds_vfio);
+		return NULL;
+	}
+
+	if (cur == VFIO_DEVICE_STATE_RUNNING && next == VFIO_DEVICE_STATE_RUNNING_P2P) {
+		pds_vfio_send_host_vf_lm_status_cmd(pds_vfio,
+						    PDS_LM_STA_IN_PROGRESS);
+		err = pds_vfio_suspend_device_cmd(pds_vfio);
+		if (err)
+			return ERR_PTR(err);
+
+		return NULL;
+	}
+
+	if (cur == VFIO_DEVICE_STATE_RUNNING_P2P && next == VFIO_DEVICE_STATE_RUNNING) {
+		err = pds_vfio_resume_device_cmd(pds_vfio);
+		if (err)
+			return ERR_PTR(err);
+
+		pds_vfio_send_host_vf_lm_status_cmd(pds_vfio, PDS_LM_STA_NONE);
+		return NULL;
+	}
+
+	if (cur == VFIO_DEVICE_STATE_STOP && next == VFIO_DEVICE_STATE_RUNNING_P2P)
+		return NULL;
+
+	if (cur == VFIO_DEVICE_STATE_RUNNING_P2P && next == VFIO_DEVICE_STATE_STOP)
+		return NULL;
+
+	return ERR_PTR(-EINVAL);
+}
diff --git a/drivers/vfio/pci/pds/lm.h b/drivers/vfio/pci/pds/lm.h
new file mode 100644
index 000000000000..13be893198b7
--- /dev/null
+++ b/drivers/vfio/pci/pds/lm.h
@@ -0,0 +1,41 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright(c) 2023 Advanced Micro Devices, Inc. */
+
+#ifndef _LM_H_
+#define _LM_H_
+
+#include <linux/fs.h>
+#include <linux/mutex.h>
+#include <linux/scatterlist.h>
+#include <linux/types.h>
+
+#include <linux/pds/pds_common.h>
+#include <linux/pds/pds_adminq.h>
+
+struct pds_vfio_lm_file {
+	struct file *filep;
+	struct mutex lock;	/* protect live migration data file */
+	u64 size;		/* Size with valid data */
+	u64 alloc_size;		/* Total allocated size. Always >= len */
+	void *page_mem;		/* memory allocated for pages */
+	struct page **pages;	/* Backing pages for file */
+	unsigned long long npages;
+	struct sg_table sg_table;	/* SG table for backing pages */
+	struct pds_lm_sg_elem *sgl;	/* DMA mapping */
+	dma_addr_t sgl_addr;
+	u16 num_sge;
+	struct scatterlist *last_offset_sg;	/* Iterator */
+	unsigned int sg_last_entry;
+	unsigned long last_offset;
+};
+
+struct pds_vfio_pci_device;
+
+struct file *
+pds_vfio_step_device_state_locked(struct pds_vfio_pci_device *pds_vfio,
+				  enum vfio_device_mig_state next);
+
+void pds_vfio_put_save_file(struct pds_vfio_pci_device *pds_vfio);
+void pds_vfio_put_restore_file(struct pds_vfio_pci_device *pds_vfio);
+
+#endif /* _LM_H_ */
diff --git a/drivers/vfio/pci/pds/pci_drv.c b/drivers/vfio/pci/pds/pci_drv.c
index a49420aa9736..ffd47fa8ede3 100644
--- a/drivers/vfio/pci/pds/pci_drv.c
+++ b/drivers/vfio/pci/pds/pci_drv.c
@@ -73,11 +73,24 @@ pds_vfio_pci_table[] = {
 };
 MODULE_DEVICE_TABLE(pci, pds_vfio_pci_table);
 
+static void pds_vfio_pci_aer_reset_done(struct pci_dev *pdev)
+{
+	struct pds_vfio_pci_device *pds_vfio = pds_vfio_pci_drvdata(pdev);
+
+	pds_vfio_reset(pds_vfio);
+}
+
+static const struct pci_error_handlers pds_vfio_pci_err_handlers = {
+	.reset_done = pds_vfio_pci_aer_reset_done,
+	.error_detected = vfio_pci_core_aer_err_detected,
+};
+
 static struct pci_driver pds_vfio_pci_driver = {
 	.name = KBUILD_MODNAME,
 	.id_table = pds_vfio_pci_table,
 	.probe = pds_vfio_pci_probe,
 	.remove = pds_vfio_pci_remove,
+	.err_handler = &pds_vfio_pci_err_handlers,
 	.driver_managed_dma = true,
 };
 
diff --git a/drivers/vfio/pci/pds/vfio_dev.c b/drivers/vfio/pci/pds/vfio_dev.c
index 39771265b78f..2435d8255366 100644
--- a/drivers/vfio/pci/pds/vfio_dev.c
+++ b/drivers/vfio/pci/pds/vfio_dev.c
@@ -4,6 +4,7 @@
 #include <linux/vfio.h>
 #include <linux/vfio_pci_core.h>
 
+#include "lm.h"
 #include "vfio_dev.h"
 
 struct pci_dev *pds_vfio_to_pci_dev(struct pds_vfio_pci_device *pds_vfio)
@@ -11,6 +12,11 @@ struct pci_dev *pds_vfio_to_pci_dev(struct pds_vfio_pci_device *pds_vfio)
 	return pds_vfio->vfio_coredev.pdev;
 }
 
+struct device *pds_vfio_to_dev(struct pds_vfio_pci_device *pds_vfio)
+{
+	return &pds_vfio_to_pci_dev(pds_vfio)->dev;
+}
+
 struct pds_vfio_pci_device *pds_vfio_pci_drvdata(struct pci_dev *pdev)
 {
 	struct vfio_pci_core_device *core_device = dev_get_drvdata(&pdev->dev);
@@ -19,6 +25,98 @@ struct pds_vfio_pci_device *pds_vfio_pci_drvdata(struct pci_dev *pdev)
 			    vfio_coredev);
 }
 
+static void pds_vfio_state_mutex_unlock(struct pds_vfio_pci_device *pds_vfio)
+{
+again:
+	spin_lock(&pds_vfio->reset_lock);
+	if (pds_vfio->deferred_reset) {
+		pds_vfio->deferred_reset = false;
+		if (pds_vfio->state == VFIO_DEVICE_STATE_ERROR) {
+			pds_vfio->state = VFIO_DEVICE_STATE_RUNNING;
+			pds_vfio_put_restore_file(pds_vfio);
+			pds_vfio_put_save_file(pds_vfio);
+		}
+		spin_unlock(&pds_vfio->reset_lock);
+		goto again;
+	}
+	mutex_unlock(&pds_vfio->state_mutex);
+	spin_unlock(&pds_vfio->reset_lock);
+}
+
+void pds_vfio_reset(struct pds_vfio_pci_device *pds_vfio)
+{
+	spin_lock(&pds_vfio->reset_lock);
+	pds_vfio->deferred_reset = true;
+	if (!mutex_trylock(&pds_vfio->state_mutex)) {
+		spin_unlock(&pds_vfio->reset_lock);
+		return;
+	}
+	spin_unlock(&pds_vfio->reset_lock);
+	pds_vfio_state_mutex_unlock(pds_vfio);
+}
+
+static struct file *
+pds_vfio_set_device_state(struct vfio_device *vdev,
+			  enum vfio_device_mig_state new_state)
+{
+	struct pds_vfio_pci_device *pds_vfio =
+		container_of(vdev, struct pds_vfio_pci_device,
+			     vfio_coredev.vdev);
+	struct file *res = NULL;
+
+	mutex_lock(&pds_vfio->state_mutex);
+	while (new_state != pds_vfio->state) {
+		enum vfio_device_mig_state next_state;
+
+		int err = vfio_mig_get_next_state(vdev, pds_vfio->state,
+						  new_state, &next_state);
+		if (err) {
+			res = ERR_PTR(err);
+			break;
+		}
+
+		res = pds_vfio_step_device_state_locked(pds_vfio, next_state);
+		if (IS_ERR(res))
+			break;
+
+		pds_vfio->state = next_state;
+
+		if (WARN_ON(res && new_state != pds_vfio->state)) {
+			res = ERR_PTR(-EINVAL);
+			break;
+		}
+	}
+	pds_vfio_state_mutex_unlock(pds_vfio);
+
+	return res;
+}
+
+static int pds_vfio_get_device_state(struct vfio_device *vdev,
+				     enum vfio_device_mig_state *current_state)
+{
+	struct pds_vfio_pci_device *pds_vfio =
+		container_of(vdev, struct pds_vfio_pci_device,
+			     vfio_coredev.vdev);
+
+	mutex_lock(&pds_vfio->state_mutex);
+	*current_state = pds_vfio->state;
+	pds_vfio_state_mutex_unlock(pds_vfio);
+	return 0;
+}
+
+static int pds_vfio_get_device_state_size(struct vfio_device *vdev,
+					  unsigned long *stop_copy_length)
+{
+	*stop_copy_length = PDS_LM_DEVICE_STATE_LENGTH;
+	return 0;
+}
+
+static const struct vfio_migration_ops pds_vfio_lm_ops = {
+	.migration_set_state = pds_vfio_set_device_state,
+	.migration_get_state = pds_vfio_get_device_state,
+	.migration_get_data_size = pds_vfio_get_device_state_size
+};
+
 static int pds_vfio_init_device(struct vfio_device *vdev)
 {
 	struct pds_vfio_pci_device *pds_vfio =
@@ -34,6 +132,9 @@ static int pds_vfio_init_device(struct vfio_device *vdev)
 	pds_vfio->vf_id = pci_iov_vf_id(pdev);
 	pds_vfio->pci_id = PCI_DEVID(pdev->bus->number, pdev->devfn);
 
+	vdev->migration_flags = VFIO_MIGRATION_STOP_COPY | VFIO_MIGRATION_P2P;
+	vdev->mig_ops = &pds_vfio_lm_ops;
+
 	dev_dbg(&pdev->dev,
 		"%s: PF %#04x VF %#04x (%d) vf_id %d domain %d pds_vfio %p\n",
 		__func__, pci_dev_id(pdev->physfn), pds_vfio->pci_id,
@@ -54,17 +155,34 @@ static int pds_vfio_open_device(struct vfio_device *vdev)
 	if (err)
 		return err;
 
+	mutex_init(&pds_vfio->state_mutex);
+	pds_vfio->state = VFIO_DEVICE_STATE_RUNNING;
+
 	vfio_pci_core_finish_enable(&pds_vfio->vfio_coredev);
 
 	return 0;
 }
 
+static void pds_vfio_close_device(struct vfio_device *vdev)
+{
+	struct pds_vfio_pci_device *pds_vfio =
+		container_of(vdev, struct pds_vfio_pci_device,
+			     vfio_coredev.vdev);
+
+	mutex_lock(&pds_vfio->state_mutex);
+	pds_vfio_put_restore_file(pds_vfio);
+	pds_vfio_put_save_file(pds_vfio);
+	mutex_unlock(&pds_vfio->state_mutex);
+	mutex_destroy(&pds_vfio->state_mutex);
+	vfio_pci_core_close_device(vdev);
+}
+
 static const struct vfio_device_ops pds_vfio_ops = {
 	.name = "pds-vfio",
 	.init = pds_vfio_init_device,
 	.release = vfio_pci_core_release_dev,
 	.open_device = pds_vfio_open_device,
-	.close_device = vfio_pci_core_close_device,
+	.close_device = pds_vfio_close_device,
 	.ioctl = vfio_pci_core_ioctl,
 	.device_feature = vfio_pci_core_ioctl_feature,
 	.read = vfio_pci_core_read,
diff --git a/drivers/vfio/pci/pds/vfio_dev.h b/drivers/vfio/pci/pds/vfio_dev.h
index 92e8ff241ca8..df6208a7140b 100644
--- a/drivers/vfio/pci/pds/vfio_dev.h
+++ b/drivers/vfio/pci/pds/vfio_dev.h
@@ -7,12 +7,21 @@
 #include <linux/pci.h>
 #include <linux/vfio_pci_core.h>
 
+#include "lm.h"
+
 struct pdsc;
 
 struct pds_vfio_pci_device {
 	struct vfio_pci_core_device vfio_coredev;
 	struct pdsc *pdsc;
 
+	struct pds_vfio_lm_file *save_file;
+	struct pds_vfio_lm_file *restore_file;
+	struct mutex state_mutex; /* protect migration state */
+	enum vfio_device_mig_state state;
+	spinlock_t reset_lock; /* protect reset_done flow */
+	u8 deferred_reset;
+
 	int vf_id;
 	int pci_id;
 	u16 client_id;
@@ -20,7 +29,9 @@ struct pds_vfio_pci_device {
 
 const struct vfio_device_ops *pds_vfio_ops_info(void);
 struct pds_vfio_pci_device *pds_vfio_pci_drvdata(struct pci_dev *pdev);
+void pds_vfio_reset(struct pds_vfio_pci_device *pds_vfio);
 
 struct pci_dev *pds_vfio_to_pci_dev(struct pds_vfio_pci_device *pds_vfio);
+struct device *pds_vfio_to_dev(struct pds_vfio_pci_device *pds_vfio);
 
 #endif /* _VFIO_DEV_H_ */
diff --git a/include/linux/pds/pds_adminq.h b/include/linux/pds/pds_adminq.h
index 98a60ce87b92..db6de081f15f 100644
--- a/include/linux/pds/pds_adminq.h
+++ b/include/linux/pds/pds_adminq.h
@@ -584,6 +584,213 @@ struct pds_core_q_init_comp {
 	u8     color;
 };
 
+#define PDS_LM_DEVICE_STATE_LENGTH		65536
+#define PDS_LM_CHECK_DEVICE_STATE_LENGTH(X) \
+			PDS_CORE_SIZE_CHECK(union, PDS_LM_DEVICE_STATE_LENGTH, X)
+
+/*
+ * enum pds_lm_cmd_opcode - Live Migration Device commands
+ */
+enum pds_lm_cmd_opcode {
+	PDS_LM_CMD_HOST_VF_STATUS  = 1,
+
+	/* Device state commands */
+	PDS_LM_CMD_STATUS          = 16,
+	PDS_LM_CMD_SUSPEND         = 18,
+	PDS_LM_CMD_SUSPEND_STATUS  = 19,
+	PDS_LM_CMD_RESUME          = 20,
+	PDS_LM_CMD_SAVE            = 21,
+	PDS_LM_CMD_RESTORE         = 22,
+};
+
+/**
+ * struct pds_lm_cmd - generic command
+ * @opcode:	Opcode
+ * @rsvd:	Word boundary padding
+ * @vf_id:	VF id
+ * @rsvd2:	Structure padding to 60 Bytes
+ */
+struct pds_lm_cmd {
+	u8     opcode;
+	u8     rsvd;
+	__le16 vf_id;
+	u8     rsvd2[56];
+};
+
+/**
+ * struct pds_lm_comp - generic command completion
+ * @status:	Status of the command (enum pds_core_status_code)
+ * @rsvd:	Structure padding to 16 Bytes
+ */
+struct pds_lm_comp {
+	u8 status;
+	u8 rsvd[15];
+};
+
+/**
+ * struct pds_lm_status_cmd - STATUS command
+ * @opcode:	Opcode
+ * @rsvd:	Word boundary padding
+ * @vf_id:	VF id
+ */
+struct pds_lm_status_cmd {
+	u8     opcode;
+	u8     rsvd;
+	__le16 vf_id;
+};
+
+/**
+ * struct pds_lm_status_comp - STATUS command completion
+ * @status:		Status of the command (enum pds_core_status_code)
+ * @rsvd:		Word boundary padding
+ * @comp_index:		Index in the desc ring for which this is the completion
+ * @size:		Size of the device state
+ * @rsvd2:		Word boundary padding
+ * @color:		Color bit
+ */
+struct pds_lm_status_comp {
+	u8     status;
+	u8     rsvd;
+	__le16 comp_index;
+	union {
+		__le64 size;
+		u8     rsvd2[11];
+	} __packed;
+	u8     color;
+};
+
+/**
+ * struct pds_lm_suspend_cmd - SUSPEND command
+ * @opcode:	Opcode PDS_LM_CMD_SUSPEND
+ * @rsvd:	Word boundary padding
+ * @vf_id:	VF id
+ */
+struct pds_lm_suspend_cmd {
+	u8     opcode;
+	u8     rsvd;
+	__le16 vf_id;
+};
+
+/**
+ * struct pds_lm_suspend_comp - SUSPEND command completion
+ * @status:		Status of the command (enum pds_core_status_code)
+ * @rsvd:		Word boundary padding
+ * @comp_index:		Index in the desc ring for which this is the completion
+ * @state_size:		Size of the device state computed post suspend
+ * @rsvd2:		Word boundary padding
+ * @color:		Color bit
+ */
+struct pds_lm_suspend_comp {
+	u8     status;
+	u8     rsvd;
+	__le16 comp_index;
+	union {
+		__le64 state_size;
+		u8     rsvd2[11];
+	} __packed;
+	u8     color;
+};
+
+/**
+ * struct pds_lm_suspend_status_cmd - SUSPEND status command
+ * @opcode:	Opcode PDS_AQ_CMD_LM_SUSPEND_STATUS
+ * @rsvd:	Word boundary padding
+ * @vf_id:	VF id
+ */
+struct pds_lm_suspend_status_cmd {
+	u8 opcode;
+	u8 rsvd;
+	__le16 vf_id;
+};
+
+/**
+ * struct pds_lm_resume_cmd - RESUME command
+ * @opcode:	Opcode PDS_LM_CMD_RESUME
+ * @rsvd:	Word boundary padding
+ * @vf_id:	VF id
+ */
+struct pds_lm_resume_cmd {
+	u8     opcode;
+	u8     rsvd;
+	__le16 vf_id;
+};
+
+/**
+ * struct pds_lm_sg_elem - Transmit scatter-gather (SG) descriptor element
+ * @addr:	DMA address of SG element data buffer
+ * @len:	Length of SG element data buffer, in bytes
+ * @rsvd:	Word boundary padding
+ */
+struct pds_lm_sg_elem {
+	__le64 addr;
+	__le32 len;
+	__le16 rsvd[2];
+};
+
+/**
+ * struct pds_lm_save_cmd - SAVE command
+ * @opcode:	Opcode PDS_LM_CMD_SAVE
+ * @rsvd:	Word boundary padding
+ * @vf_id:	VF id
+ * @rsvd2:	Word boundary padding
+ * @sgl_addr:	IOVA address of the SGL to dma the device state
+ * @num_sge:	Total number of SG elements
+ */
+struct pds_lm_save_cmd {
+	u8     opcode;
+	u8     rsvd;
+	__le16 vf_id;
+	u8     rsvd2[4];
+	__le64 sgl_addr;
+	__le32 num_sge;
+} __packed;
+
+/**
+ * struct pds_lm_restore_cmd - RESTORE command
+ * @opcode:	Opcode PDS_LM_CMD_RESTORE
+ * @rsvd:	Word boundary padding
+ * @vf_id:	VF id
+ * @rsvd2:	Word boundary padding
+ * @sgl_addr:	IOVA address of the SGL to dma the device state
+ * @num_sge:	Total number of SG elements
+ */
+struct pds_lm_restore_cmd {
+	u8     opcode;
+	u8     rsvd;
+	__le16 vf_id;
+	u8     rsvd2[4];
+	__le64 sgl_addr;
+	__le32 num_sge;
+} __packed;
+
+/**
+ * union pds_lm_dev_state - device state information
+ * @words:	Device state words
+ */
+union pds_lm_dev_state {
+	__le32 words[PDS_LM_DEVICE_STATE_LENGTH / sizeof(__le32)];
+};
+
+enum pds_lm_host_vf_status {
+	PDS_LM_STA_NONE = 0,
+	PDS_LM_STA_IN_PROGRESS,
+	PDS_LM_STA_MAX,
+};
+
+/**
+ * struct pds_lm_host_vf_status_cmd - HOST_VF_STATUS command
+ * @opcode:	Opcode PDS_LM_CMD_HOST_VF_STATUS
+ * @rsvd:	Word boundary padding
+ * @vf_id:	VF id
+ * @status:	Current LM status of host VF driver (enum pds_lm_host_status)
+ */
+struct pds_lm_host_vf_status_cmd {
+	u8     opcode;
+	u8     rsvd;
+	__le16 vf_id;
+	u8     status;
+};
+
 union pds_core_adminq_cmd {
 	u8     opcode;
 	u8     bytes[64];
@@ -600,6 +807,14 @@ union pds_core_adminq_cmd {
 
 	struct pds_core_q_identify_cmd    q_ident;
 	struct pds_core_q_init_cmd        q_init;
+
+	struct pds_lm_suspend_cmd		lm_suspend;
+	struct pds_lm_suspend_status_cmd	lm_suspend_status;
+	struct pds_lm_resume_cmd		lm_resume;
+	struct pds_lm_status_cmd		lm_status;
+	struct pds_lm_save_cmd			lm_save;
+	struct pds_lm_restore_cmd		lm_restore;
+	struct pds_lm_host_vf_status_cmd	lm_host_vf_status;
 };
 
 union pds_core_adminq_comp {
@@ -621,6 +836,8 @@ union pds_core_adminq_comp {
 
 	struct pds_core_q_identify_comp   q_ident;
 	struct pds_core_q_init_comp       q_init;
+
+	struct pds_lm_status_comp		lm_status;
 };
 
 #ifndef __CHECKER__
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v10 vfio 5/7] vfio/pds: Add support for dirty page tracking
  2023-06-02 22:03 [PATCH v10 vfio 0/7] pds_vfio driver Brett Creeley
                   ` (3 preceding siblings ...)
  2023-06-02 22:03 ` [PATCH v10 vfio 4/7] vfio/pds: Add VFIO live migration support Brett Creeley
@ 2023-06-02 22:03 ` Brett Creeley
  2023-06-02 22:03 ` [PATCH v10 vfio 6/7] vfio/pds: Add support for firmware recovery Brett Creeley
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 40+ messages in thread
From: Brett Creeley @ 2023-06-02 22:03 UTC (permalink / raw)
  To: kvm, netdev, alex.williamson, jgg, yishaih,
	shameerali.kolothum.thodi, kevin.tian
  Cc: brett.creeley, shannon.nelson

In order to support dirty page tracking, the driver has to implement
the VFIO subsystem's vfio_log_ops. This includes log_start, log_stop,
and log_read_and_clear.

All of the tracker resources are allocated and dirty tracking on the
device is started during log_start. The resources are cleaned up and
dirty tracking on the device is stopped during log_stop. The dirty
pages are determined and reported during log_read_and_clear.

In order to support these callbacks admin queue commands are used.
All of the adminq queue command structures and implementations
are included as part of this patch.

PDS_LM_CMD_DIRTY_STATUS is added to query the current status of
dirty tracking on the device. This includes if it's enabled (i.e.
number of regions being tracked from the device's perspective) and
the maximum number of regions supported from the device's perspective.

PDS_LM_CMD_DIRTY_ENABLE is added to enable dirty tracking on the
specified number of regions and their iova ranges.

PDS_LM_CMD_DIRTY_DISABLE is added to disable dirty tracking for all
regions on the device.

PDS_LM_CMD_READ_SEQ and PDS_LM_CMD_DIRTY_WRITE_ACK are added to
support reading and acknowledging the currently dirtied pages.

Signed-off-by: Brett Creeley <brett.creeley@amd.com>
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
---
 drivers/vfio/pci/pds/Makefile   |   1 +
 drivers/vfio/pci/pds/cmds.c     | 125 +++++++
 drivers/vfio/pci/pds/cmds.h     |   9 +
 drivers/vfio/pci/pds/dirty.c    | 577 ++++++++++++++++++++++++++++++++
 drivers/vfio/pci/pds/dirty.h    |  38 +++
 drivers/vfio/pci/pds/lm.c       |   2 +-
 drivers/vfio/pci/pds/vfio_dev.c |  11 +-
 drivers/vfio/pci/pds/vfio_dev.h |   4 +
 include/linux/pds/pds_adminq.h  | 178 ++++++++++
 9 files changed, 943 insertions(+), 2 deletions(-)
 create mode 100644 drivers/vfio/pci/pds/dirty.c
 create mode 100644 drivers/vfio/pci/pds/dirty.h

diff --git a/drivers/vfio/pci/pds/Makefile b/drivers/vfio/pci/pds/Makefile
index dbaf613d3794..805176f7be9f 100644
--- a/drivers/vfio/pci/pds/Makefile
+++ b/drivers/vfio/pci/pds/Makefile
@@ -5,6 +5,7 @@ obj-$(CONFIG_PDS_VFIO_PCI) += pds_vfio.o
 
 pds_vfio-y := \
 	cmds.o		\
+	dirty.o		\
 	lm.o		\
 	pci_drv.o	\
 	vfio_dev.o
diff --git a/drivers/vfio/pci/pds/cmds.c b/drivers/vfio/pci/pds/cmds.c
index 256f458feb58..a2cc6d5011f6 100644
--- a/drivers/vfio/pci/pds/cmds.c
+++ b/drivers/vfio/pci/pds/cmds.c
@@ -360,3 +360,128 @@ void pds_vfio_send_host_vf_lm_status_cmd(struct pds_vfio_pci_device *pds_vfio,
 		dev_warn(dev, "failed to send host VF migration status: %pe\n",
 			 ERR_PTR(err));
 }
+
+int pds_vfio_dirty_status_cmd(struct pds_vfio_pci_device *pds_vfio,
+			      u64 regions_dma, u8 *max_regions, u8 *num_regions)
+{
+	union pds_core_adminq_cmd cmd = {
+		.lm_dirty_status = {
+			.opcode = PDS_LM_CMD_DIRTY_STATUS,
+			.vf_id = cpu_to_le16(pds_vfio->vf_id),
+		},
+	};
+	struct device *dev = pds_vfio_to_dev(pds_vfio);
+	union pds_core_adminq_comp comp = {};
+	int err;
+
+	dev_dbg(dev, "vf%u: Dirty status\n", pds_vfio->vf_id);
+
+	cmd.lm_dirty_status.regions_dma = cpu_to_le64(regions_dma);
+	cmd.lm_dirty_status.max_regions = *max_regions;
+
+	err = pds_vfio_client_adminq_cmd(pds_vfio, &cmd, sizeof(cmd), &comp, 0);
+	if (err) {
+		dev_err(dev, "failed to get dirty status: %pe\n", ERR_PTR(err));
+		return err;
+	}
+
+	/* only support seq_ack approach for now */
+	if (!(le32_to_cpu(comp.lm_dirty_status.bmp_type_mask) &
+	      BIT(PDS_LM_DIRTY_BMP_TYPE_SEQ_ACK))) {
+		dev_err(dev, "Dirty bitmap tracking SEQ_ACK not supported\n");
+		return -EOPNOTSUPP;
+	}
+
+	*num_regions = comp.lm_dirty_status.num_regions;
+	*max_regions = comp.lm_dirty_status.max_regions;
+
+	dev_dbg(dev,
+		"Page Tracking Status command successful, max_regions: %d, num_regions: %d, bmp_type: %s\n",
+		*max_regions, *num_regions, "PDS_LM_DIRTY_BMP_TYPE_SEQ_ACK");
+
+	return 0;
+}
+
+int pds_vfio_dirty_enable_cmd(struct pds_vfio_pci_device *pds_vfio,
+			      u64 regions_dma, u8 num_regions)
+{
+	union pds_core_adminq_cmd cmd = {
+		.lm_dirty_enable = {
+			.opcode = PDS_LM_CMD_DIRTY_ENABLE,
+			.vf_id = cpu_to_le16(pds_vfio->vf_id),
+			.regions_dma = cpu_to_le64(regions_dma),
+			.bmp_type = PDS_LM_DIRTY_BMP_TYPE_SEQ_ACK,
+			.num_regions = num_regions,
+		},
+	};
+	struct device *dev = pds_vfio_to_dev(pds_vfio);
+	union pds_core_adminq_comp comp = {};
+	int err;
+
+	err = pds_vfio_client_adminq_cmd(pds_vfio, &cmd, sizeof(cmd), &comp, 0);
+	if (err) {
+		dev_err(dev, "failed dirty tracking enable: %pe\n",
+			ERR_PTR(err));
+		return err;
+	}
+
+	return 0;
+}
+
+int pds_vfio_dirty_disable_cmd(struct pds_vfio_pci_device *pds_vfio)
+{
+	union pds_core_adminq_cmd cmd = {
+		.lm_dirty_disable = {
+			.opcode = PDS_LM_CMD_DIRTY_DISABLE,
+			.vf_id = cpu_to_le16(pds_vfio->vf_id),
+		},
+	};
+	struct device *dev = pds_vfio_to_dev(pds_vfio);
+	union pds_core_adminq_comp comp = {};
+	int err;
+
+	err = pds_vfio_client_adminq_cmd(pds_vfio, &cmd, sizeof(cmd), &comp, 0);
+	if (err || comp.lm_dirty_status.num_regions != 0) {
+		/* in case num_regions is still non-zero after disable */
+		err = err ? err : -EIO;
+		dev_err(dev,
+			"failed dirty tracking disable: %pe, num_regions %d\n",
+			ERR_PTR(err), comp.lm_dirty_status.num_regions);
+		return err;
+	}
+
+	return 0;
+}
+
+int pds_vfio_dirty_seq_ack_cmd(struct pds_vfio_pci_device *pds_vfio,
+			       u64 sgl_dma, u16 num_sge, u32 offset,
+			       u32 total_len, bool read_seq)
+{
+	const char *cmd_type_str = read_seq ? "read_seq" : "write_ack";
+	union pds_core_adminq_cmd cmd = {
+		.lm_dirty_seq_ack = {
+			.vf_id = cpu_to_le16(pds_vfio->vf_id),
+			.len_bytes = cpu_to_le32(total_len),
+			.off_bytes = cpu_to_le32(offset),
+			.sgl_addr = cpu_to_le64(sgl_dma),
+			.num_sge = cpu_to_le16(num_sge),
+		},
+	};
+	struct device *dev = pds_vfio_to_dev(pds_vfio);
+	union pds_core_adminq_comp comp = {};
+	int err;
+
+	if (read_seq)
+		cmd.lm_dirty_seq_ack.opcode = PDS_LM_CMD_DIRTY_READ_SEQ;
+	else
+		cmd.lm_dirty_seq_ack.opcode = PDS_LM_CMD_DIRTY_WRITE_ACK;
+
+	err = pds_vfio_client_adminq_cmd(pds_vfio, &cmd, sizeof(cmd), &comp, 0);
+	if (err) {
+		dev_err(dev, "failed cmd Page Tracking %s: %pe\n", cmd_type_str,
+			ERR_PTR(err));
+		return err;
+	}
+
+	return 0;
+}
diff --git a/drivers/vfio/pci/pds/cmds.h b/drivers/vfio/pci/pds/cmds.h
index 3d8a5508c733..fc1f4ae611eb 100644
--- a/drivers/vfio/pci/pds/cmds.h
+++ b/drivers/vfio/pci/pds/cmds.h
@@ -13,4 +13,13 @@ int pds_vfio_get_lm_state_cmd(struct pds_vfio_pci_device *pds_vfio);
 int pds_vfio_set_lm_state_cmd(struct pds_vfio_pci_device *pds_vfio);
 void pds_vfio_send_host_vf_lm_status_cmd(struct pds_vfio_pci_device *pds_vfio,
 					 enum pds_lm_host_vf_status vf_status);
+int pds_vfio_dirty_status_cmd(struct pds_vfio_pci_device *pds_vfio,
+			      u64 regions_dma, u8 *max_regions,
+			      u8 *num_regions);
+int pds_vfio_dirty_enable_cmd(struct pds_vfio_pci_device *pds_vfio,
+			      u64 regions_dma, u8 num_regions);
+int pds_vfio_dirty_disable_cmd(struct pds_vfio_pci_device *pds_vfio);
+int pds_vfio_dirty_seq_ack_cmd(struct pds_vfio_pci_device *pds_vfio,
+			       u64 sgl_dma, u16 num_sge, u32 offset,
+			       u32 total_len, bool read_seq);
 #endif /* _CMDS_H_ */
diff --git a/drivers/vfio/pci/pds/dirty.c b/drivers/vfio/pci/pds/dirty.c
new file mode 100644
index 000000000000..321d06d378ca
--- /dev/null
+++ b/drivers/vfio/pci/pds/dirty.c
@@ -0,0 +1,577 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright(c) 2023 Advanced Micro Devices, Inc. */
+
+#include <linux/interval_tree.h>
+#include <linux/vfio.h>
+
+#include <linux/pds/pds_common.h>
+#include <linux/pds/pds_core_if.h>
+#include <linux/pds/pds_adminq.h>
+
+#include "vfio_dev.h"
+#include "cmds.h"
+#include "dirty.h"
+
+#define READ_SEQ true
+#define WRITE_ACK false
+
+bool pds_vfio_dirty_is_enabled(struct pds_vfio_pci_device *pds_vfio)
+{
+	return pds_vfio->dirty.is_enabled;
+}
+
+void pds_vfio_dirty_set_enabled(struct pds_vfio_pci_device *pds_vfio)
+{
+	pds_vfio->dirty.is_enabled = true;
+}
+
+void pds_vfio_dirty_set_disabled(struct pds_vfio_pci_device *pds_vfio)
+{
+	pds_vfio->dirty.is_enabled = false;
+}
+
+static void
+pds_vfio_print_guest_region_info(struct pds_vfio_pci_device *pds_vfio,
+				 u8 max_regions)
+{
+	int len = max_regions * sizeof(struct pds_lm_dirty_region_info);
+	struct pci_dev *pdev = pds_vfio->vfio_coredev.pdev;
+	struct device *pdsc_dev = &pci_physfn(pdev)->dev;
+	struct pds_lm_dirty_region_info *region_info;
+	dma_addr_t regions_dma;
+	u8 num_regions;
+	int err;
+
+	region_info = kcalloc(max_regions,
+			      sizeof(struct pds_lm_dirty_region_info),
+			      GFP_KERNEL);
+	if (!region_info)
+		return;
+
+	regions_dma =
+		dma_map_single(pdsc_dev, region_info, len, DMA_FROM_DEVICE);
+	if (dma_mapping_error(pdsc_dev, regions_dma))
+		goto out_free_region_info;
+
+	err = pds_vfio_dirty_status_cmd(pds_vfio, regions_dma, &max_regions,
+					&num_regions);
+	dma_unmap_single(pdsc_dev, regions_dma, len, DMA_FROM_DEVICE);
+	if (err)
+		goto out_free_region_info;
+
+	for (unsigned int i = 0; i < num_regions; i++)
+		dev_dbg(&pdev->dev,
+			"region_info[%d]: dma_base 0x%llx page_count %u page_size_log2 %u\n",
+			i, le64_to_cpu(region_info[i].dma_base),
+			le32_to_cpu(region_info[i].page_count),
+			region_info[i].page_size_log2);
+
+out_free_region_info:
+	kfree(region_info);
+}
+
+static int pds_vfio_dirty_alloc_bitmaps(struct pds_vfio_dirty *dirty,
+					unsigned long bytes)
+{
+	unsigned long *host_seq_bmp, *host_ack_bmp;
+
+	host_seq_bmp = vzalloc(bytes);
+	if (!host_seq_bmp)
+		return -ENOMEM;
+
+	host_ack_bmp = vzalloc(bytes);
+	if (!host_ack_bmp) {
+		bitmap_free(host_seq_bmp);
+		return -ENOMEM;
+	}
+
+	dirty->host_seq.bmp = host_seq_bmp;
+	dirty->host_ack.bmp = host_ack_bmp;
+
+	return 0;
+}
+
+static void pds_vfio_dirty_free_bitmaps(struct pds_vfio_dirty *dirty)
+{
+	if (dirty->host_seq.bmp)
+		vfree(dirty->host_seq.bmp);
+	if (dirty->host_ack.bmp)
+		vfree(dirty->host_ack.bmp);
+
+	dirty->host_seq.bmp = NULL;
+	dirty->host_ack.bmp = NULL;
+}
+
+static void __pds_vfio_dirty_free_sgl(struct pds_vfio_pci_device *pds_vfio,
+				      struct pds_vfio_bmp_info *bmp_info)
+{
+	struct pci_dev *pdev = pds_vfio->vfio_coredev.pdev;
+	struct device *pdsc_dev = &pci_physfn(pdev)->dev;
+
+	dma_unmap_single(pdsc_dev, bmp_info->sgl_addr,
+			 bmp_info->num_sge * sizeof(struct pds_lm_sg_elem),
+			 DMA_BIDIRECTIONAL);
+	kfree(bmp_info->sgl);
+
+	bmp_info->num_sge = 0;
+	bmp_info->sgl = NULL;
+	bmp_info->sgl_addr = 0;
+}
+
+static void pds_vfio_dirty_free_sgl(struct pds_vfio_pci_device *pds_vfio)
+{
+	if (pds_vfio->dirty.host_seq.sgl)
+		__pds_vfio_dirty_free_sgl(pds_vfio, &pds_vfio->dirty.host_seq);
+	if (pds_vfio->dirty.host_ack.sgl)
+		__pds_vfio_dirty_free_sgl(pds_vfio, &pds_vfio->dirty.host_ack);
+}
+
+static int __pds_vfio_dirty_alloc_sgl(struct pds_vfio_pci_device *pds_vfio,
+				      struct pds_vfio_bmp_info *bmp_info,
+				      u32 page_count)
+{
+	struct pci_dev *pdev = pds_vfio->vfio_coredev.pdev;
+	struct device *pdsc_dev = &pci_physfn(pdev)->dev;
+	struct pds_lm_sg_elem *sgl;
+	dma_addr_t sgl_addr;
+	size_t sgl_size;
+	u32 max_sge;
+
+	max_sge = DIV_ROUND_UP(page_count, PAGE_SIZE * 8);
+	sgl_size = max_sge * sizeof(struct pds_lm_sg_elem);
+
+	sgl = kzalloc(sgl_size, GFP_KERNEL);
+	if (!sgl)
+		return -ENOMEM;
+
+	sgl_addr = dma_map_single(pdsc_dev, sgl, sgl_size, DMA_BIDIRECTIONAL);
+	if (dma_mapping_error(pdsc_dev, sgl_addr)) {
+		kfree(sgl);
+		return -EIO;
+	}
+
+	bmp_info->sgl = sgl;
+	bmp_info->num_sge = max_sge;
+	bmp_info->sgl_addr = sgl_addr;
+
+	return 0;
+}
+
+static int pds_vfio_dirty_alloc_sgl(struct pds_vfio_pci_device *pds_vfio,
+				    u32 page_count)
+{
+	struct pds_vfio_dirty *dirty = &pds_vfio->dirty;
+	int err;
+
+	err = __pds_vfio_dirty_alloc_sgl(pds_vfio, &dirty->host_seq,
+					 page_count);
+	if (err)
+		return err;
+
+	err = __pds_vfio_dirty_alloc_sgl(pds_vfio, &dirty->host_ack,
+					 page_count);
+	if (err) {
+		__pds_vfio_dirty_free_sgl(pds_vfio, &dirty->host_seq);
+		return err;
+	}
+
+	return 0;
+}
+
+static int pds_vfio_dirty_enable(struct pds_vfio_pci_device *pds_vfio,
+				 struct rb_root_cached *ranges, u32 nnodes,
+				 u64 *page_size)
+{
+	struct pci_dev *pdev = pds_vfio->vfio_coredev.pdev;
+	struct device *pdsc_dev = &pci_physfn(pdev)->dev;
+	struct pds_vfio_dirty *dirty = &pds_vfio->dirty;
+	u64 region_start, region_size, region_page_size;
+	struct pds_lm_dirty_region_info *region_info;
+	struct interval_tree_node *node = NULL;
+	u8 max_regions = 0, num_regions;
+	dma_addr_t regions_dma = 0;
+	u32 num_ranges = nnodes;
+	u32 page_count;
+	u16 len;
+	int err;
+
+	dev_dbg(&pdev->dev, "vf%u: Start dirty page tracking\n",
+		pds_vfio->vf_id);
+
+	if (pds_vfio_dirty_is_enabled(pds_vfio))
+		return -EINVAL;
+
+	pds_vfio_dirty_set_enabled(pds_vfio);
+
+	/* find if dirty tracking is disabled, i.e. num_regions == 0 */
+	err = pds_vfio_dirty_status_cmd(pds_vfio, 0, &max_regions,
+					&num_regions);
+	if (err < 0) {
+		dev_err(&pdev->dev, "Failed to get dirty status, err %pe\n",
+			ERR_PTR(err));
+		goto out_set_disabled;
+	} else if (num_regions) {
+		dev_err(&pdev->dev,
+			"Dirty tracking already enabled for %d regions\n",
+			num_regions);
+		err = -EEXIST;
+		goto out_set_disabled;
+	} else if (!max_regions) {
+		dev_err(&pdev->dev,
+			"Device doesn't support dirty tracking, max_regions %d\n",
+			max_regions);
+		err = -EOPNOTSUPP;
+		goto out_set_disabled;
+	}
+
+	/*
+	 * Only support 1 region for now. If there are any large gaps in the
+	 * VM's address regions, then this would be a waste of memory as we are
+	 * generating 2 bitmaps (ack/seq) from the min address to the max
+	 * address of the VM's address regions. In the future, if we support
+	 * more than one region in the device/driver we can split the bitmaps
+	 * on the largest address region gaps. We can do this split up to the
+	 * max_regions times returned from the dirty_status command.
+	 */
+	max_regions = 1;
+	if (num_ranges > max_regions) {
+		vfio_combine_iova_ranges(ranges, nnodes, max_regions);
+		num_ranges = max_regions;
+	}
+
+	node = interval_tree_iter_first(ranges, 0, ULONG_MAX);
+	if (!node) {
+		err = -EINVAL;
+		goto out_set_disabled;
+	}
+
+	region_size = node->last - node->start + 1;
+	region_start = node->start;
+	region_page_size = *page_size;
+
+	len = sizeof(*region_info);
+	region_info = kzalloc(len, GFP_KERNEL);
+	if (!region_info) {
+		err = -ENOMEM;
+		goto out_set_disabled;
+	}
+
+	page_count = DIV_ROUND_UP(region_size, region_page_size);
+
+	region_info->dma_base = cpu_to_le64(region_start);
+	region_info->page_count = cpu_to_le32(page_count);
+	region_info->page_size_log2 = ilog2(region_page_size);
+
+	regions_dma = dma_map_single(pdsc_dev, (void *)region_info, len,
+				     DMA_BIDIRECTIONAL);
+	if (dma_mapping_error(pdsc_dev, regions_dma)) {
+		err = -ENOMEM;
+		goto out_free_region_info;
+	}
+
+	err = pds_vfio_dirty_enable_cmd(pds_vfio, regions_dma, max_regions);
+	dma_unmap_single(pdsc_dev, regions_dma, len, DMA_BIDIRECTIONAL);
+	if (err)
+		goto out_free_region_info;
+
+	/*
+	 * page_count might be adjusted by the device,
+	 * update it before freeing region_info DMA
+	 */
+	page_count = le32_to_cpu(region_info->page_count);
+
+	dev_dbg(&pdev->dev,
+		"region_info: regions_dma 0x%llx dma_base 0x%llx page_count %u page_size_log2 %u\n",
+		regions_dma, region_start, page_count,
+		(u8)ilog2(region_page_size));
+
+	err = pds_vfio_dirty_alloc_bitmaps(dirty, page_count / BITS_PER_BYTE);
+	if (err) {
+		dev_err(&pdev->dev, "Failed to alloc dirty bitmaps: %pe\n",
+			ERR_PTR(err));
+		goto out_free_region_info;
+	}
+
+	err = pds_vfio_dirty_alloc_sgl(pds_vfio, page_count);
+	if (err) {
+		dev_err(&pdev->dev, "Failed to alloc dirty sg lists: %pe\n",
+			ERR_PTR(err));
+		goto out_free_bitmaps;
+	}
+
+	dirty->region_start = region_start;
+	dirty->region_size = region_size;
+	dirty->region_page_size = region_page_size;
+
+	pds_vfio_print_guest_region_info(pds_vfio, max_regions);
+
+	kfree(region_info);
+
+	return 0;
+
+out_free_bitmaps:
+	pds_vfio_dirty_free_bitmaps(dirty);
+out_free_region_info:
+	kfree(region_info);
+out_set_disabled:
+	pds_vfio_dirty_set_disabled(pds_vfio);
+	return err;
+}
+
+int pds_vfio_dirty_disable(struct pds_vfio_pci_device *pds_vfio)
+{
+	int err = 0;
+
+	if (pds_vfio_dirty_is_enabled(pds_vfio)) {
+		pds_vfio_dirty_set_disabled(pds_vfio);
+		err = pds_vfio_dirty_disable_cmd(pds_vfio);
+		pds_vfio_dirty_free_sgl(pds_vfio);
+		pds_vfio_dirty_free_bitmaps(&pds_vfio->dirty);
+	}
+
+	pds_vfio_send_host_vf_lm_status_cmd(pds_vfio, PDS_LM_STA_NONE);
+	return err;
+}
+
+static int pds_vfio_dirty_seq_ack(struct pds_vfio_pci_device *pds_vfio,
+				  struct pds_vfio_bmp_info *bmp_info,
+				  u32 offset, u32 bmp_bytes, bool read_seq)
+{
+	const char *bmp_type_str = read_seq ? "read_seq" : "write_ack";
+	u8 dma_dir = read_seq ? DMA_FROM_DEVICE : DMA_TO_DEVICE;
+	struct pci_dev *pdev = pds_vfio->vfio_coredev.pdev;
+	struct device *pdsc_dev = &pci_physfn(pdev)->dev;
+	unsigned long long npages;
+	struct sg_table sg_table;
+	struct scatterlist *sg;
+	struct page **pages;
+	u32 page_offset;
+	const void *bmp;
+	size_t size;
+	u16 num_sge;
+	int err;
+	int i;
+
+	bmp = (void *)((u64)bmp_info->bmp + offset);
+	page_offset = offset_in_page(bmp);
+	bmp -= page_offset;
+
+	/*
+	 * Start and end of bitmap section to seq/ack might not be page
+	 * aligned, so use the page_offset to account for that so there
+	 * will be enough pages to represent the bmp_bytes
+	 */
+	npages = DIV_ROUND_UP_ULL(bmp_bytes + page_offset, PAGE_SIZE);
+	pages = kmalloc_array(npages, sizeof(*pages), GFP_KERNEL);
+	if (!pages)
+		return -ENOMEM;
+
+	for (unsigned long long i = 0; i < npages; i++) {
+		struct page *page = vmalloc_to_page(bmp);
+		if (!page) {
+			err = -EFAULT;
+			goto out_free_pages;
+		}
+
+		pages[i] = page;
+		bmp += PAGE_SIZE;
+	}
+
+	err = sg_alloc_table_from_pages(&sg_table, pages, npages, page_offset,
+					bmp_bytes, GFP_KERNEL);
+	if (err)
+		goto out_free_pages;
+
+	err = dma_map_sgtable(pdsc_dev, &sg_table, dma_dir, 0);
+	if (err)
+		goto out_free_sg_table;
+
+	for_each_sgtable_dma_sg(&sg_table, sg, i) {
+		struct pds_lm_sg_elem *sg_elem = &bmp_info->sgl[i];
+
+		sg_elem->addr = cpu_to_le64(sg_dma_address(sg));
+		sg_elem->len = cpu_to_le32(sg_dma_len(sg));
+	}
+
+	num_sge = sg_table.nents;
+	size = num_sge * sizeof(struct pds_lm_sg_elem);
+	dma_sync_single_for_device(pdsc_dev, bmp_info->sgl_addr, size, dma_dir);
+	err = pds_vfio_dirty_seq_ack_cmd(pds_vfio, bmp_info->sgl_addr, num_sge,
+					 offset, bmp_bytes, read_seq);
+	if (err)
+		dev_err(&pdev->dev,
+			"Dirty bitmap %s failed offset %u bmp_bytes %u num_sge %u DMA 0x%llx: %pe\n",
+			bmp_type_str, offset, bmp_bytes,
+			num_sge, bmp_info->sgl_addr, ERR_PTR(err));
+	dma_sync_single_for_cpu(pdsc_dev, bmp_info->sgl_addr, size, dma_dir);
+
+	dma_unmap_sgtable(pdsc_dev, &sg_table, dma_dir, 0);
+out_free_sg_table:
+	sg_free_table(&sg_table);
+out_free_pages:
+	kfree(pages);
+
+	return err;
+}
+
+static int pds_vfio_dirty_write_ack(struct pds_vfio_pci_device *pds_vfio,
+				    u32 offset, u32 len)
+{
+	return pds_vfio_dirty_seq_ack(pds_vfio, &pds_vfio->dirty.host_ack,
+				      offset, len, WRITE_ACK);
+}
+
+static int pds_vfio_dirty_read_seq(struct pds_vfio_pci_device *pds_vfio,
+				   u32 offset, u32 len)
+{
+	return pds_vfio_dirty_seq_ack(pds_vfio, &pds_vfio->dirty.host_seq,
+				      offset, len, READ_SEQ);
+}
+
+static int pds_vfio_dirty_process_bitmaps(struct pds_vfio_pci_device *pds_vfio,
+					  struct iova_bitmap *dirty_bitmap,
+					  u32 bmp_offset, u32 len_bytes)
+{
+	u64 page_size = pds_vfio->dirty.region_page_size;
+	u64 region_start = pds_vfio->dirty.region_start;
+	u32 bmp_offset_bit;
+	__le64 *seq, *ack;
+	int dword_count;
+
+	dword_count = len_bytes / sizeof(u64);
+	seq = (__le64 *)((u64)pds_vfio->dirty.host_seq.bmp + bmp_offset);
+	ack = (__le64 *)((u64)pds_vfio->dirty.host_ack.bmp + bmp_offset);
+	bmp_offset_bit = bmp_offset * 8;
+
+	for (int i = 0; i < dword_count; i++) {
+		u64 xor = le64_to_cpu(seq[i]) ^ le64_to_cpu(ack[i]);
+
+		/* prepare for next write_ack call */
+		ack[i] = seq[i];
+
+		for (u8 bit_i = 0; bit_i < BITS_PER_TYPE(u64); ++bit_i) {
+			if (xor & BIT(bit_i)) {
+				u64 abs_bit_i = bmp_offset_bit +
+						i * BITS_PER_TYPE(u64) + bit_i;
+				u64 addr = abs_bit_i * page_size + region_start;
+
+				iova_bitmap_set(dirty_bitmap, addr, page_size);
+			}
+		}
+	}
+
+	return 0;
+}
+
+static int pds_vfio_dirty_sync(struct pds_vfio_pci_device *pds_vfio,
+			       struct iova_bitmap *dirty_bitmap,
+			       unsigned long iova, unsigned long length)
+{
+	struct device *dev = &pds_vfio->vfio_coredev.pdev->dev;
+	struct pds_vfio_dirty *dirty = &pds_vfio->dirty;
+	u64 bmp_offset, bmp_bytes;
+	u64 bitmap_size, pages;
+	int err;
+
+	dev_dbg(dev, "vf%u: Get dirty page bitmap\n", pds_vfio->vf_id);
+
+	if (!pds_vfio_dirty_is_enabled(pds_vfio)) {
+		dev_err(dev, "vf%u: Sync failed, dirty tracking is disabled\n",
+			pds_vfio->vf_id);
+		return -EINVAL;
+	}
+
+	pages = DIV_ROUND_UP(length, pds_vfio->dirty.region_page_size);
+	bitmap_size =
+		round_up(pages, sizeof(u64) * BITS_PER_BYTE) / BITS_PER_BYTE;
+
+	dev_dbg(dev,
+		"vf%u: iova 0x%lx length %lu page_size %llu pages %llu bitmap_size %llu\n",
+		pds_vfio->vf_id, iova, length, pds_vfio->dirty.region_page_size,
+		pages, bitmap_size);
+
+	if (!length || ((dirty->region_start + iova + length) >
+			(dirty->region_start + dirty->region_size))) {
+		dev_err(dev, "Invalid iova 0x%lx and/or length 0x%lx to sync\n",
+			iova, length);
+		return -EINVAL;
+	}
+
+	/* bitmap is modified in 64 bit chunks */
+	bmp_bytes = ALIGN(DIV_ROUND_UP(length / dirty->region_page_size,
+				       sizeof(u64)),
+			  sizeof(u64));
+	if (bmp_bytes != bitmap_size) {
+		dev_err(dev,
+			"Calculated bitmap bytes %llu not equal to bitmap size %llu\n",
+			bmp_bytes, bitmap_size);
+		return -EINVAL;
+	}
+
+	bmp_offset = DIV_ROUND_UP(iova / dirty->region_page_size, sizeof(u64));
+
+	dev_dbg(dev,
+		"Syncing dirty bitmap, iova 0x%lx length 0x%lx, bmp_offset %llu bmp_bytes %llu\n",
+		iova, length, bmp_offset, bmp_bytes);
+
+	err = pds_vfio_dirty_read_seq(pds_vfio, bmp_offset, bmp_bytes);
+	if (err)
+		return err;
+
+	err = pds_vfio_dirty_process_bitmaps(pds_vfio, dirty_bitmap, bmp_offset,
+					     bmp_bytes);
+	if (err)
+		return err;
+
+	err = pds_vfio_dirty_write_ack(pds_vfio, bmp_offset, bmp_bytes);
+	if (err)
+		return err;
+
+	return 0;
+}
+
+int pds_vfio_dma_logging_report(struct vfio_device *vdev, unsigned long iova,
+				unsigned long length, struct iova_bitmap *dirty)
+{
+	struct pds_vfio_pci_device *pds_vfio =
+		container_of(vdev, struct pds_vfio_pci_device,
+			     vfio_coredev.vdev);
+	int err;
+
+	mutex_lock(&pds_vfio->state_mutex);
+	err = pds_vfio_dirty_sync(pds_vfio, dirty, iova, length);
+	pds_vfio_state_mutex_unlock(pds_vfio);
+
+	return err;
+}
+
+int pds_vfio_dma_logging_start(struct vfio_device *vdev,
+			       struct rb_root_cached *ranges, u32 nnodes,
+			       u64 *page_size)
+{
+	struct pds_vfio_pci_device *pds_vfio =
+		container_of(vdev, struct pds_vfio_pci_device,
+			     vfio_coredev.vdev);
+	int err;
+
+	mutex_lock(&pds_vfio->state_mutex);
+	pds_vfio_send_host_vf_lm_status_cmd(pds_vfio, PDS_LM_STA_IN_PROGRESS);
+	err = pds_vfio_dirty_enable(pds_vfio, ranges, nnodes, page_size);
+	pds_vfio_state_mutex_unlock(pds_vfio);
+
+	return err;
+}
+
+int pds_vfio_dma_logging_stop(struct vfio_device *vdev)
+{
+	struct pds_vfio_pci_device *pds_vfio =
+		container_of(vdev, struct pds_vfio_pci_device,
+			     vfio_coredev.vdev);
+	int err;
+
+	mutex_lock(&pds_vfio->state_mutex);
+	err = pds_vfio_dirty_disable(pds_vfio);
+	pds_vfio_state_mutex_unlock(pds_vfio);
+
+	return err;
+}
diff --git a/drivers/vfio/pci/pds/dirty.h b/drivers/vfio/pci/pds/dirty.h
new file mode 100644
index 000000000000..d9fdce08e113
--- /dev/null
+++ b/drivers/vfio/pci/pds/dirty.h
@@ -0,0 +1,38 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright(c) 2023 Advanced Micro Devices, Inc. */
+
+#ifndef _DIRTY_H_
+#define _DIRTY_H_
+
+struct pds_vfio_bmp_info {
+	unsigned long *bmp;
+	u32 bmp_bytes;
+	struct pds_lm_sg_elem *sgl;
+	dma_addr_t sgl_addr;
+	u16 num_sge;
+};
+
+struct pds_vfio_dirty {
+	struct pds_vfio_bmp_info host_seq;
+	struct pds_vfio_bmp_info host_ack;
+	u64 region_size;
+	u64 region_start;
+	u64 region_page_size;
+	bool is_enabled;
+};
+
+struct pds_vfio_pci_device;
+
+bool pds_vfio_dirty_is_enabled(struct pds_vfio_pci_device *pds_vfio);
+void pds_vfio_dirty_set_enabled(struct pds_vfio_pci_device *pds_vfio);
+void pds_vfio_dirty_set_disabled(struct pds_vfio_pci_device *pds_vfio);
+int pds_vfio_dirty_disable(struct pds_vfio_pci_device *pds_vfio);
+
+int pds_vfio_dma_logging_report(struct vfio_device *vdev, unsigned long iova,
+				unsigned long length,
+				struct iova_bitmap *dirty);
+int pds_vfio_dma_logging_start(struct vfio_device *vdev,
+			       struct rb_root_cached *ranges, u32 nnodes,
+			       u64 *page_size);
+int pds_vfio_dma_logging_stop(struct vfio_device *vdev);
+#endif /* _DIRTY_H_ */
diff --git a/drivers/vfio/pci/pds/lm.c b/drivers/vfio/pci/pds/lm.c
index c507f39a2339..9116527408da 100644
--- a/drivers/vfio/pci/pds/lm.c
+++ b/drivers/vfio/pci/pds/lm.c
@@ -371,7 +371,7 @@ pds_vfio_step_device_state_locked(struct pds_vfio_pci_device *pds_vfio,
 
 	if (cur == VFIO_DEVICE_STATE_STOP_COPY && next == VFIO_DEVICE_STATE_STOP) {
 		pds_vfio_put_save_file(pds_vfio);
-		pds_vfio_send_host_vf_lm_status_cmd(pds_vfio, PDS_LM_STA_NONE);
+		pds_vfio_dirty_disable(pds_vfio);
 		return NULL;
 	}
 
diff --git a/drivers/vfio/pci/pds/vfio_dev.c b/drivers/vfio/pci/pds/vfio_dev.c
index 2435d8255366..c58b0c1fc811 100644
--- a/drivers/vfio/pci/pds/vfio_dev.c
+++ b/drivers/vfio/pci/pds/vfio_dev.c
@@ -5,6 +5,7 @@
 #include <linux/vfio_pci_core.h>
 
 #include "lm.h"
+#include "dirty.h"
 #include "vfio_dev.h"
 
 struct pci_dev *pds_vfio_to_pci_dev(struct pds_vfio_pci_device *pds_vfio)
@@ -25,7 +26,7 @@ struct pds_vfio_pci_device *pds_vfio_pci_drvdata(struct pci_dev *pdev)
 			    vfio_coredev);
 }
 
-static void pds_vfio_state_mutex_unlock(struct pds_vfio_pci_device *pds_vfio)
+void pds_vfio_state_mutex_unlock(struct pds_vfio_pci_device *pds_vfio)
 {
 again:
 	spin_lock(&pds_vfio->reset_lock);
@@ -117,6 +118,12 @@ static const struct vfio_migration_ops pds_vfio_lm_ops = {
 	.migration_get_data_size = pds_vfio_get_device_state_size
 };
 
+static const struct vfio_log_ops pds_vfio_log_ops = {
+	.log_start = pds_vfio_dma_logging_start,
+	.log_stop = pds_vfio_dma_logging_stop,
+	.log_read_and_clear = pds_vfio_dma_logging_report,
+};
+
 static int pds_vfio_init_device(struct vfio_device *vdev)
 {
 	struct pds_vfio_pci_device *pds_vfio =
@@ -134,6 +141,7 @@ static int pds_vfio_init_device(struct vfio_device *vdev)
 
 	vdev->migration_flags = VFIO_MIGRATION_STOP_COPY | VFIO_MIGRATION_P2P;
 	vdev->mig_ops = &pds_vfio_lm_ops;
+	vdev->log_ops = &pds_vfio_log_ops;
 
 	dev_dbg(&pdev->dev,
 		"%s: PF %#04x VF %#04x (%d) vf_id %d domain %d pds_vfio %p\n",
@@ -172,6 +180,7 @@ static void pds_vfio_close_device(struct vfio_device *vdev)
 	mutex_lock(&pds_vfio->state_mutex);
 	pds_vfio_put_restore_file(pds_vfio);
 	pds_vfio_put_save_file(pds_vfio);
+	pds_vfio_dirty_disable(pds_vfio);
 	mutex_unlock(&pds_vfio->state_mutex);
 	mutex_destroy(&pds_vfio->state_mutex);
 	vfio_pci_core_close_device(vdev);
diff --git a/drivers/vfio/pci/pds/vfio_dev.h b/drivers/vfio/pci/pds/vfio_dev.h
index df6208a7140b..1e28c072ce08 100644
--- a/drivers/vfio/pci/pds/vfio_dev.h
+++ b/drivers/vfio/pci/pds/vfio_dev.h
@@ -7,6 +7,7 @@
 #include <linux/pci.h>
 #include <linux/vfio_pci_core.h>
 
+#include "dirty.h"
 #include "lm.h"
 
 struct pdsc;
@@ -17,6 +18,7 @@ struct pds_vfio_pci_device {
 
 	struct pds_vfio_lm_file *save_file;
 	struct pds_vfio_lm_file *restore_file;
+	struct pds_vfio_dirty dirty;
 	struct mutex state_mutex; /* protect migration state */
 	enum vfio_device_mig_state state;
 	spinlock_t reset_lock; /* protect reset_done flow */
@@ -27,6 +29,8 @@ struct pds_vfio_pci_device {
 	u16 client_id;
 };
 
+void pds_vfio_state_mutex_unlock(struct pds_vfio_pci_device *pds_vfio);
+
 const struct vfio_device_ops *pds_vfio_ops_info(void);
 struct pds_vfio_pci_device *pds_vfio_pci_drvdata(struct pci_dev *pdev);
 void pds_vfio_reset(struct pds_vfio_pci_device *pds_vfio);
diff --git a/include/linux/pds/pds_adminq.h b/include/linux/pds/pds_adminq.h
index db6de081f15f..2e51f66b85a0 100644
--- a/include/linux/pds/pds_adminq.h
+++ b/include/linux/pds/pds_adminq.h
@@ -601,6 +601,13 @@ enum pds_lm_cmd_opcode {
 	PDS_LM_CMD_RESUME          = 20,
 	PDS_LM_CMD_SAVE            = 21,
 	PDS_LM_CMD_RESTORE         = 22,
+
+	/* Dirty page tracking commands */
+	PDS_LM_CMD_DIRTY_STATUS    = 32,
+	PDS_LM_CMD_DIRTY_ENABLE    = 33,
+	PDS_LM_CMD_DIRTY_DISABLE   = 34,
+	PDS_LM_CMD_DIRTY_READ_SEQ  = 35,
+	PDS_LM_CMD_DIRTY_WRITE_ACK = 36,
 };
 
 /**
@@ -777,6 +784,172 @@ enum pds_lm_host_vf_status {
 	PDS_LM_STA_MAX,
 };
 
+/**
+ * struct pds_lm_dirty_region_info - Memory region info for STATUS and ENABLE
+ * @dma_base:		Base address of the DMA-contiguous memory region
+ * @page_count:		Number of pages in the memory region
+ * @page_size_log2:	Log2 page size in the memory region
+ * @rsvd:		Word boundary padding
+ */
+struct pds_lm_dirty_region_info {
+	__le64 dma_base;
+	__le32 page_count;
+	u8     page_size_log2;
+	u8     rsvd[3];
+};
+
+/**
+ * struct pds_lm_dirty_status_cmd - DIRTY_STATUS command
+ * @opcode:		Opcode PDS_LM_CMD_DIRTY_STATUS
+ * @rsvd:		Word boundary padding
+ * @vf_id:		VF id
+ * @max_regions:	Capacity of the region info buffer
+ * @rsvd2:		Word boundary padding
+ * @regions_dma:	DMA address of the region info buffer
+ *
+ * The minimum of max_regions (from the command) and num_regions (from the
+ * completion) of struct pds_lm_dirty_region_info will be written to
+ * regions_dma.
+ *
+ * The max_regions may be zero, in which case regions_dma is ignored.  In that
+ * case, the completion will only report the maximum number of regions
+ * supported by the device, and the number of regions currently enabled.
+ */
+struct pds_lm_dirty_status_cmd {
+	u8     opcode;
+	u8     rsvd;
+	__le16 vf_id;
+	u8     max_regions;
+	u8     rsvd2[3];
+	__le64 regions_dma;
+} __packed;
+
+/**
+ * enum pds_lm_dirty_bmp_type - Type of dirty page bitmap
+ * @PDS_LM_DIRTY_BMP_TYPE_NONE: No bitmap / disabled
+ * @PDS_LM_DIRTY_BMP_TYPE_SEQ_ACK: Seq/Ack bitmap representation
+ */
+enum pds_lm_dirty_bmp_type {
+	PDS_LM_DIRTY_BMP_TYPE_NONE     = 0,
+	PDS_LM_DIRTY_BMP_TYPE_SEQ_ACK  = 1,
+};
+
+/**
+ * struct pds_lm_dirty_status_comp - STATUS command completion
+ * @status:		Status of the command (enum pds_core_status_code)
+ * @rsvd:		Word boundary padding
+ * @comp_index:		Index in the desc ring for which this is the completion
+ * @max_regions:	Maximum number of regions supported by the device
+ * @num_regions:	Number of regions currently enabled
+ * @bmp_type:		Type of dirty bitmap representation
+ * @rsvd2:		Word boundary padding
+ * @bmp_type_mask:	Mask of supported bitmap types, bit index per type
+ * @rsvd3:		Word boundary padding
+ * @color:		Color bit
+ *
+ * This completion descriptor is used for STATUS, ENABLE, and DISABLE.
+ */
+struct pds_lm_dirty_status_comp {
+	u8     status;
+	u8     rsvd;
+	__le16 comp_index;
+	u8     max_regions;
+	u8     num_regions;
+	u8     bmp_type;
+	u8     rsvd2;
+	__le32 bmp_type_mask;
+	u8     rsvd3[3];
+	u8     color;
+};
+
+/**
+ * struct pds_lm_dirty_enable_cmd - DIRTY_ENABLE command
+ * @opcode:		Opcode PDS_LM_CMD_DIRTY_ENABLE
+ * @rsvd:		Word boundary padding
+ * @vf_id:		VF id
+ * @bmp_type:		Type of dirty bitmap representation
+ * @num_regions:	Number of entries in the region info buffer
+ * @rsvd2:		Word boundary padding
+ * @regions_dma:	DMA address of the region info buffer
+ *
+ * The num_regions must be nonzero, and less than or equal to the maximum
+ * number of regions supported by the device.
+ *
+ * The memory regions should not overlap.
+ *
+ * The information should be initialized by the driver.  The device may modify
+ * the information on successful completion, such as by size-aligning the
+ * number of pages in a region.
+ *
+ * The modified number of pages will be greater than or equal to the page count
+ * given in the enable command, and at least as coarsly aligned as the given
+ * value.  For example, the count might be aligned to a multiple of 64, but
+ * if the value is already a multiple of 128 or higher, it will not change.
+ * If the driver requires its own minimum alignment of the number of pages, the
+ * driver should account for that already in the region info of this command.
+ *
+ * This command uses struct pds_lm_dirty_status_comp for its completion.
+ */
+struct pds_lm_dirty_enable_cmd {
+	u8     opcode;
+	u8     rsvd;
+	__le16 vf_id;
+	u8     bmp_type;
+	u8     num_regions;
+	u8     rsvd2[2];
+	__le64 regions_dma;
+} __packed;
+
+/**
+ * struct pds_lm_dirty_disable_cmd - DIRTY_DISABLE command
+ * @opcode:	Opcode PDS_LM_CMD_DIRTY_DISABLE
+ * @rsvd:	Word boundary padding
+ * @vf_id:	VF id
+ *
+ * Dirty page tracking will be disabled.  This may be called in any state, as
+ * long as dirty page tracking is supported by the device, to ensure that dirty
+ * page tracking is disabled.
+ *
+ * This command uses struct pds_lm_dirty_status_comp for its completion.  On
+ * success, num_regions will be zero.
+ */
+struct pds_lm_dirty_disable_cmd {
+	u8     opcode;
+	u8     rsvd;
+	__le16 vf_id;
+};
+
+/**
+ * struct pds_lm_dirty_seq_ack_cmd - DIRTY_READ_SEQ or _WRITE_ACK command
+ * @opcode:	Opcode PDS_LM_CMD_DIRTY_[READ_SEQ|WRITE_ACK]
+ * @rsvd:	Word boundary padding
+ * @vf_id:	VF id
+ * @off_bytes:	Byte offset in the bitmap
+ * @len_bytes:	Number of bytes to transfer
+ * @num_sge:	Number of DMA scatter gather elements
+ * @rsvd2:	Word boundary padding
+ * @sgl_addr:	DMA address of scatter gather list
+ *
+ * Read bytes from the SEQ bitmap, or write bytes into the ACK bitmap.
+ *
+ * This command treats the entire bitmap as a byte buffer.  It does not
+ * distinguish between guest memory regions.  The driver should refer to the
+ * number of pages in each region, according to PDS_LM_CMD_DIRTY_STATUS, to
+ * determine the region boundaries in the bitmap.  Each region will be
+ * represented by exactly the number of bits as the page count for that region,
+ * immediately following the last bit of the previous region.
+ */
+struct pds_lm_dirty_seq_ack_cmd {
+	u8     opcode;
+	u8     rsvd;
+	__le16 vf_id;
+	__le32 off_bytes;
+	__le32 len_bytes;
+	__le16 num_sge;
+	u8     rsvd2[2];
+	__le64 sgl_addr;
+} __packed;
+
 /**
  * struct pds_lm_host_vf_status_cmd - HOST_VF_STATUS command
  * @opcode:	Opcode PDS_LM_CMD_HOST_VF_STATUS
@@ -815,6 +988,10 @@ union pds_core_adminq_cmd {
 	struct pds_lm_save_cmd			lm_save;
 	struct pds_lm_restore_cmd		lm_restore;
 	struct pds_lm_host_vf_status_cmd	lm_host_vf_status;
+	struct pds_lm_dirty_status_cmd		lm_dirty_status;
+	struct pds_lm_dirty_enable_cmd		lm_dirty_enable;
+	struct pds_lm_dirty_disable_cmd		lm_dirty_disable;
+	struct pds_lm_dirty_seq_ack_cmd		lm_dirty_seq_ack;
 };
 
 union pds_core_adminq_comp {
@@ -838,6 +1015,7 @@ union pds_core_adminq_comp {
 	struct pds_core_q_init_comp       q_init;
 
 	struct pds_lm_status_comp		lm_status;
+	struct pds_lm_dirty_status_comp		lm_dirty_status;
 };
 
 #ifndef __CHECKER__
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v10 vfio 6/7] vfio/pds: Add support for firmware recovery
  2023-06-02 22:03 [PATCH v10 vfio 0/7] pds_vfio driver Brett Creeley
                   ` (4 preceding siblings ...)
  2023-06-02 22:03 ` [PATCH v10 vfio 5/7] vfio/pds: Add support for dirty page tracking Brett Creeley
@ 2023-06-02 22:03 ` Brett Creeley
  2023-06-16  8:24   ` Tian, Kevin
  2023-06-02 22:03 ` [PATCH v10 vfio 7/7] vfio/pds: Add Kconfig and documentation Brett Creeley
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 40+ messages in thread
From: Brett Creeley @ 2023-06-02 22:03 UTC (permalink / raw)
  To: kvm, netdev, alex.williamson, jgg, yishaih,
	shameerali.kolothum.thodi, kevin.tian
  Cc: brett.creeley, shannon.nelson

It's possible that the device firmware crashes and is able to recover
due to some configuration and/or other issue. If a live migration
is in progress while the firmware crashes, the live migration will
fail. However, the VF PCI device should still be functional post
crash recovery and subsequent migrations should go through as
expected.

When the pds_core device notices that firmware crashes it sends an
event to all its client drivers. When the pds_vfio driver receives
this event while migration is in progress it will request a deferred
reset on the next migration state transition. This state transition
will report failure as well as any subsequent state transition
requests from the VMM/VFIO. Based on uapi/vfio.h the only way out of
VFIO_DEVICE_STATE_ERROR is by issuing VFIO_DEVICE_RESET. Once this
reset is done, the migration state will be reset to
VFIO_DEVICE_STATE_RUNNING and migration can be performed.

If the event is received while no migration is in progress (i.e.
the VM is in normal operating mode), then no actions are taken
and the migration state remains VFIO_DEVICE_STATE_RUNNING.

Signed-off-by: Brett Creeley <brett.creeley@amd.com>
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
---
 drivers/vfio/pci/pds/pci_drv.c  | 105 ++++++++++++++++++++++++++++++++
 drivers/vfio/pci/pds/vfio_dev.c |  28 ++++++++-
 drivers/vfio/pci/pds/vfio_dev.h |   4 ++
 3 files changed, 135 insertions(+), 2 deletions(-)

diff --git a/drivers/vfio/pci/pds/pci_drv.c b/drivers/vfio/pci/pds/pci_drv.c
index ffd47fa8ede3..ca79135c9af2 100644
--- a/drivers/vfio/pci/pds/pci_drv.c
+++ b/drivers/vfio/pci/pds/pci_drv.c
@@ -19,6 +19,104 @@
 #define PDS_VFIO_DRV_DESCRIPTION	"AMD/Pensando VFIO Device Driver"
 #define PCI_VENDOR_ID_PENSANDO		0x1dd8
 
+static void pds_vfio_recovery(struct pds_vfio_pci_device *pds_vfio)
+{
+	bool deferred_reset_needed = false;
+
+	/*
+	 * Documentation states that the kernel migration driver must not
+	 * generate asynchronous device state transitions outside of
+	 * manipulation by the user or the VFIO_DEVICE_RESET ioctl.
+	 *
+	 * Since recovery is an asynchronous event received from the device,
+	 * initiate a deferred reset. Only issue the deferred reset if a
+	 * migration is in progress, which will cause the next step of the
+	 * migration to fail. Also, if the device is in a state that will
+	 * be set to VFIO_DEVICE_STATE_RUNNING on the next action (i.e. VM is
+	 * shutdown and device is in VFIO_DEVICE_STATE_STOP) as that will clear
+	 * the VFIO_DEVICE_STATE_ERROR when the VM starts back up.
+	 */
+	mutex_lock(&pds_vfio->state_mutex);
+	if ((pds_vfio->state != VFIO_DEVICE_STATE_RUNNING &&
+	     pds_vfio->state != VFIO_DEVICE_STATE_ERROR) ||
+	    (pds_vfio->state == VFIO_DEVICE_STATE_RUNNING &&
+	     pds_vfio_dirty_is_enabled(pds_vfio)))
+		deferred_reset_needed = true;
+	mutex_unlock(&pds_vfio->state_mutex);
+
+	/*
+	 * On the next user initiated state transition, the device will
+	 * transition to the VFIO_DEVICE_STATE_ERROR. At this point it's the user's
+	 * responsibility to reset the device.
+	 *
+	 * If a VFIO_DEVICE_RESET is requested post recovery and before the next
+	 * state transition, then the deferred reset state will be set to
+	 * VFIO_DEVICE_STATE_RUNNING.
+	 */
+	if (deferred_reset_needed)
+		pds_vfio_deferred_reset(pds_vfio, VFIO_DEVICE_STATE_ERROR);
+}
+
+static int pds_vfio_pci_notify_handler(struct notifier_block *nb,
+				       unsigned long ecode, void *data)
+{
+	struct pds_vfio_pci_device *pds_vfio =
+		container_of(nb, struct pds_vfio_pci_device, nb);
+	struct device *dev = pds_vfio_to_dev(pds_vfio);
+	union pds_core_notifyq_comp *event = data;
+
+	dev_dbg(dev, "%s: event code %lu\n", __func__, ecode);
+
+	/*
+	 * We don't need to do anything for RESET state==0 as there is no notify
+	 * or feedback mechanism available, and it is possible that we won't
+	 * even see a state==0 event.
+	 *
+	 * Any requests from VFIO while state==0 will fail, which will return
+	 * error and may cause migration to fail.
+	 */
+	if (ecode == PDS_EVENT_RESET) {
+		dev_info(dev, "%s: PDS_EVENT_RESET event received, state==%d\n",
+			 __func__, event->reset.state);
+		if (event->reset.state == 1)
+			pds_vfio_recovery(pds_vfio);
+	}
+
+	return 0;
+}
+
+static int
+pds_vfio_pci_register_event_handler(struct pds_vfio_pci_device *pds_vfio)
+{
+	struct device *dev = pds_vfio_to_dev(pds_vfio);
+	struct notifier_block *nb = &pds_vfio->nb;
+	int err;
+
+	if (!nb->notifier_call) {
+		nb->notifier_call = pds_vfio_pci_notify_handler;
+		err = pdsc_register_notify(nb);
+		if (err) {
+			nb->notifier_call = NULL;
+			dev_err(dev,
+				"failed to register pds event handler: %pe\n",
+				ERR_PTR(err));
+			return -EINVAL;
+		}
+		dev_dbg(dev, "pds event handler registered\n");
+	}
+
+	return 0;
+}
+
+static void
+pds_vfio_pci_unregister_event_handler(struct pds_vfio_pci_device *pds_vfio)
+{
+	if (pds_vfio->nb.notifier_call) {
+		pdsc_unregister_notify(&pds_vfio->nb);
+		pds_vfio->nb.notifier_call = NULL;
+	}
+}
+
 static int pds_vfio_pci_probe(struct pci_dev *pdev,
 			      const struct pci_device_id *id)
 {
@@ -48,8 +146,14 @@ static int pds_vfio_pci_probe(struct pci_dev *pdev,
 		goto out_unregister_coredev;
 	}
 
+	err = pds_vfio_pci_register_event_handler(pds_vfio);
+	if (err)
+		goto out_unregister_client;
+
 	return 0;
 
+out_unregister_client:
+	pds_vfio_unregister_client_cmd(pds_vfio);
 out_unregister_coredev:
 	vfio_pci_core_unregister_device(&pds_vfio->vfio_coredev);
 out_put_vdev:
@@ -61,6 +165,7 @@ static void pds_vfio_pci_remove(struct pci_dev *pdev)
 {
 	struct pds_vfio_pci_device *pds_vfio = pds_vfio_pci_drvdata(pdev);
 
+	pds_vfio_pci_unregister_event_handler(pds_vfio);
 	pds_vfio_unregister_client_cmd(pds_vfio);
 	vfio_pci_core_unregister_device(&pds_vfio->vfio_coredev);
 	vfio_put_device(&pds_vfio->vfio_coredev.vdev);
diff --git a/drivers/vfio/pci/pds/vfio_dev.c b/drivers/vfio/pci/pds/vfio_dev.c
index c58b0c1fc811..b380571f2146 100644
--- a/drivers/vfio/pci/pds/vfio_dev.c
+++ b/drivers/vfio/pci/pds/vfio_dev.c
@@ -33,10 +33,13 @@ void pds_vfio_state_mutex_unlock(struct pds_vfio_pci_device *pds_vfio)
 	if (pds_vfio->deferred_reset) {
 		pds_vfio->deferred_reset = false;
 		if (pds_vfio->state == VFIO_DEVICE_STATE_ERROR) {
-			pds_vfio->state = VFIO_DEVICE_STATE_RUNNING;
+			pds_vfio->state = pds_vfio->deferred_reset_state;
 			pds_vfio_put_restore_file(pds_vfio);
 			pds_vfio_put_save_file(pds_vfio);
+		} else if (pds_vfio->deferred_reset_state == VFIO_DEVICE_STATE_ERROR) {
+			pds_vfio->state = VFIO_DEVICE_STATE_ERROR;
 		}
+		pds_vfio->deferred_reset_state = VFIO_DEVICE_STATE_RUNNING;
 		spin_unlock(&pds_vfio->reset_lock);
 		goto again;
 	}
@@ -48,6 +51,7 @@ void pds_vfio_reset(struct pds_vfio_pci_device *pds_vfio)
 {
 	spin_lock(&pds_vfio->reset_lock);
 	pds_vfio->deferred_reset = true;
+	pds_vfio->deferred_reset_state = VFIO_DEVICE_STATE_RUNNING;
 	if (!mutex_trylock(&pds_vfio->state_mutex)) {
 		spin_unlock(&pds_vfio->reset_lock);
 		return;
@@ -56,6 +60,15 @@ void pds_vfio_reset(struct pds_vfio_pci_device *pds_vfio)
 	pds_vfio_state_mutex_unlock(pds_vfio);
 }
 
+void pds_vfio_deferred_reset(struct pds_vfio_pci_device *pds_vfio,
+			     enum vfio_device_mig_state reset_state)
+{
+	spin_lock(&pds_vfio->reset_lock);
+	pds_vfio->deferred_reset = true;
+	pds_vfio->deferred_reset_state = reset_state;
+	spin_unlock(&pds_vfio->reset_lock);
+}
+
 static struct file *
 pds_vfio_set_device_state(struct vfio_device *vdev,
 			  enum vfio_device_mig_state new_state)
@@ -66,7 +79,14 @@ pds_vfio_set_device_state(struct vfio_device *vdev,
 	struct file *res = NULL;
 
 	mutex_lock(&pds_vfio->state_mutex);
-	while (new_state != pds_vfio->state) {
+	/*
+	 * only way to transition out of VFIO_DEVICE_STATE_ERROR is via
+	 * VFIO_DEVICE_RESET, so prevent the state machine from running since
+	 * vfio_mig_get_next_state() will throw a WARN_ON() when transitioning
+	 * from VFIO_DEVICE_STATE_ERROR to any other state
+	 */
+	while (pds_vfio->state != VFIO_DEVICE_STATE_ERROR &&
+	       new_state != pds_vfio->state) {
 		enum vfio_device_mig_state next_state;
 
 		int err = vfio_mig_get_next_state(vdev, pds_vfio->state,
@@ -88,6 +108,9 @@ pds_vfio_set_device_state(struct vfio_device *vdev,
 		}
 	}
 	pds_vfio_state_mutex_unlock(pds_vfio);
+	/* still waiting on a deferred_reset */
+	if (pds_vfio->state == VFIO_DEVICE_STATE_ERROR)
+		res = ERR_PTR(-EIO);
 
 	return res;
 }
@@ -165,6 +188,7 @@ static int pds_vfio_open_device(struct vfio_device *vdev)
 
 	mutex_init(&pds_vfio->state_mutex);
 	pds_vfio->state = VFIO_DEVICE_STATE_RUNNING;
+	pds_vfio->deferred_reset_state = VFIO_DEVICE_STATE_RUNNING;
 
 	vfio_pci_core_finish_enable(&pds_vfio->vfio_coredev);
 
diff --git a/drivers/vfio/pci/pds/vfio_dev.h b/drivers/vfio/pci/pds/vfio_dev.h
index 1e28c072ce08..197ad8e2c223 100644
--- a/drivers/vfio/pci/pds/vfio_dev.h
+++ b/drivers/vfio/pci/pds/vfio_dev.h
@@ -23,6 +23,8 @@ struct pds_vfio_pci_device {
 	enum vfio_device_mig_state state;
 	spinlock_t reset_lock; /* protect reset_done flow */
 	u8 deferred_reset;
+	enum vfio_device_mig_state deferred_reset_state;
+	struct notifier_block nb;
 
 	int vf_id;
 	int pci_id;
@@ -34,6 +36,8 @@ void pds_vfio_state_mutex_unlock(struct pds_vfio_pci_device *pds_vfio);
 const struct vfio_device_ops *pds_vfio_ops_info(void);
 struct pds_vfio_pci_device *pds_vfio_pci_drvdata(struct pci_dev *pdev);
 void pds_vfio_reset(struct pds_vfio_pci_device *pds_vfio);
+void pds_vfio_deferred_reset(struct pds_vfio_pci_device *pds_vfio,
+			     enum vfio_device_mig_state reset_state);
 
 struct pci_dev *pds_vfio_to_pci_dev(struct pds_vfio_pci_device *pds_vfio);
 struct device *pds_vfio_to_dev(struct pds_vfio_pci_device *pds_vfio);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v10 vfio 7/7] vfio/pds: Add Kconfig and documentation
  2023-06-02 22:03 [PATCH v10 vfio 0/7] pds_vfio driver Brett Creeley
                   ` (5 preceding siblings ...)
  2023-06-02 22:03 ` [PATCH v10 vfio 6/7] vfio/pds: Add support for firmware recovery Brett Creeley
@ 2023-06-02 22:03 ` Brett Creeley
  2023-06-16  8:25   ` Tian, Kevin
  2023-06-14 20:20 ` [PATCH v10 vfio 0/7] pds_vfio driver Alex Williamson
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 40+ messages in thread
From: Brett Creeley @ 2023-06-02 22:03 UTC (permalink / raw)
  To: kvm, netdev, alex.williamson, jgg, yishaih,
	shameerali.kolothum.thodi, kevin.tian
  Cc: brett.creeley, shannon.nelson

Add Kconfig entries and pds_vfio.rst. Also, add an entry in the
MAINTAINERS file for this new driver.

It's not clear where documentation for vendor specific VFIO
drivers should live, so just re-use the current amd
ethernet location.

Signed-off-by: Brett Creeley <brett.creeley@amd.com>
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
---
 .../device_drivers/ethernet/amd/pds_vfio.rst  | 79 +++++++++++++++++++
 .../device_drivers/ethernet/index.rst         |  1 +
 MAINTAINERS                                   |  7 ++
 drivers/vfio/pci/Kconfig                      |  2 +
 drivers/vfio/pci/pds/Kconfig                  | 20 +++++
 5 files changed, 109 insertions(+)
 create mode 100644 Documentation/networking/device_drivers/ethernet/amd/pds_vfio.rst
 create mode 100644 drivers/vfio/pci/pds/Kconfig

diff --git a/Documentation/networking/device_drivers/ethernet/amd/pds_vfio.rst b/Documentation/networking/device_drivers/ethernet/amd/pds_vfio.rst
new file mode 100644
index 000000000000..7bddde0c7c9d
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/amd/pds_vfio.rst
@@ -0,0 +1,79 @@
+.. SPDX-License-Identifier: GPL-2.0+
+.. note: can be edited and viewed with /usr/bin/formiko-vim
+
+==========================================================
+PCI VFIO driver for the AMD/Pensando(R) DSC adapter family
+==========================================================
+
+AMD/Pensando Linux VFIO PCI Device Driver
+Copyright(c) 2023 Advanced Micro Devices, Inc.
+
+Overview
+========
+
+The ``pds_vfio`` module is a PCI driver that supports Live Migration
+capable Virtual Function (VF) devices in the DSC hardware.
+
+Using the device
+================
+
+The pds_vfio device is enabled via multiple configuration steps and
+depends on the ``pds_core`` driver to create and enable SR-IOV Virtual
+Function devices.
+
+Shown below are the steps to bind the driver to a VF and also to the
+associated auxiliary device created by the ``pds_core`` driver. This
+example assumes the pds_core and pds_vfio modules are already
+loaded.
+
+.. code-block:: bash
+  :name: example-setup-script
+
+  #!/bin/bash
+
+  PF_BUS="0000:60"
+  PF_BDF="0000:60:00.0"
+  VF_BDF="0000:60:00.1"
+
+  # Prevent non-vfio VF driver from probing the VF device
+  echo 0 > /sys/class/pci_bus/$PF_BUS/device/$PF_BDF/sriov_drivers_autoprobe
+
+  # Create single VF for Live Migration via VFIO
+  echo 1 > /sys/bus/pci/drivers/pds_core/$PF_BDF/sriov_numvfs
+
+  # Allow the VF to be bound to the pds_vfio driver
+  echo "pds_vfio" > /sys/class/pci_bus/$PF_BUS/device/$VF_BDF/driver_override
+
+  # Bind the VF to the pds_vfio driver
+  echo "$VF_BDF" > /sys/bus/pci/drivers/pds_vfio/bind
+
+After performing the steps above, a file in /dev/vfio/<iommu_group>
+should have been created.
+
+
+Enabling the driver
+===================
+
+The driver is enabled via the standard kernel configuration system,
+using the make command::
+
+  make oldconfig/menuconfig/etc.
+
+The driver is located in the menu structure at:
+
+  -> Device Drivers
+    -> VFIO Non-Privileged userspace driver framework
+      -> VFIO support for PDS PCI devices
+
+Support
+=======
+
+For general Linux networking support, please use the netdev mailing
+list, which is monitored by Pensando personnel::
+
+  netdev@vger.kernel.org
+
+For more specific support needs, please use the Pensando driver support
+email::
+
+  drivers@pensando.io
diff --git a/Documentation/networking/device_drivers/ethernet/index.rst b/Documentation/networking/device_drivers/ethernet/index.rst
index 417ca514a4d0..0dd88e6f4e7c 100644
--- a/Documentation/networking/device_drivers/ethernet/index.rst
+++ b/Documentation/networking/device_drivers/ethernet/index.rst
@@ -15,6 +15,7 @@ Contents:
    amazon/ena
    altera/altera_tse
    amd/pds_core
+   amd/pds_vfio
    aquantia/atlantic
    chelsio/cxgb
    cirrus/cs89x0
diff --git a/MAINTAINERS b/MAINTAINERS
index c904dba1733b..cb3e4d40ca76 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -22140,6 +22140,13 @@ S:	Maintained
 P:	Documentation/driver-api/vfio-pci-device-specific-driver-acceptance.rst
 F:	drivers/vfio/pci/*/
 
+VFIO PDS PCI DRIVER
+M:	Brett Creeley <brett.creeley@amd.com>
+L:	kvm@vger.kernel.org
+S:	Maintained
+F:	Documentation/networking/device_drivers/ethernet/amd/pds_vfio.rst
+F:	drivers/vfio/pci/pds/
+
 VFIO PLATFORM DRIVER
 M:	Eric Auger <eric.auger@redhat.com>
 L:	kvm@vger.kernel.org
diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
index f9d0c908e738..2c3831dd60ef 100644
--- a/drivers/vfio/pci/Kconfig
+++ b/drivers/vfio/pci/Kconfig
@@ -59,4 +59,6 @@ source "drivers/vfio/pci/mlx5/Kconfig"
 
 source "drivers/vfio/pci/hisilicon/Kconfig"
 
+source "drivers/vfio/pci/pds/Kconfig"
+
 endif
diff --git a/drivers/vfio/pci/pds/Kconfig b/drivers/vfio/pci/pds/Kconfig
new file mode 100644
index 000000000000..149d4986bf43
--- /dev/null
+++ b/drivers/vfio/pci/pds/Kconfig
@@ -0,0 +1,20 @@
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2023 Advanced Micro Devices, Inc.
+
+config PDS_VFIO_PCI
+	tristate "VFIO support for PDS PCI devices"
+	depends on PDS_CORE
+	depends on VFIO_PCI_CORE
+	help
+	  This provides generic PCI support for PDS devices using the VFIO
+	  framework.
+
+	  More specific information on this driver can be
+	  found in
+	  <file:Documentation/networking/device_drivers/ethernet/amd/pds_vfio.rst>.
+
+	  To compile this driver as a module, choose M here. The module
+	  will be called pds_vfio.
+
+	  If you don't know what to do here, say N.
+
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH v10 vfio 0/7] pds_vfio driver
  2023-06-02 22:03 [PATCH v10 vfio 0/7] pds_vfio driver Brett Creeley
                   ` (6 preceding siblings ...)
  2023-06-02 22:03 ` [PATCH v10 vfio 7/7] vfio/pds: Add Kconfig and documentation Brett Creeley
@ 2023-06-14 20:20 ` Alex Williamson
  2023-06-16  6:47 ` Tian, Kevin
  2023-06-17  4:49 ` Brett Creeley
  9 siblings, 0 replies; 40+ messages in thread
From: Alex Williamson @ 2023-06-14 20:20 UTC (permalink / raw)
  To: jgg, yishaih, shameerali.kolothum.thodi, kevin.tian
  Cc: Brett Creeley, kvm, netdev, shannon.nelson

[sorry, not-sorry for the top post]

Thanks Jason and Shameer for prior review comments, I hope you both can
find time to check v10 as well.

Others that previously stepped up to be reviewers for new vfio-pci
variant drivers, please jump in.  Thanks,

Alex

On Fri, 2 Jun 2023 15:03:11 -0700
Brett Creeley <brett.creeley@amd.com> wrote:

> This is a patchset for a new vendor specific VFIO driver
> (pds_vfio) for use with the AMD/Pensando Distributed Services Card
> (DSC). This driver makes use of the pds_core driver.
> 
> This driver will use the pds_core device's adminq as the VFIO
> control path to the DSC. In order to make adminq calls, the VFIO
> instance makes use of functions exported by the pds_core driver.
> 
> In order to receive events from pds_core, the pds_vfio driver
> registers to a private notifier. This is needed for various events
> that come from the device.
> 
> An ASCII diagram of a VFIO instance looks something like this and can
> be used with the VFIO subsystem to provide the VF device VFIO and live
> migration support.
> 
>                                .------.  .-----------------------.
>                                | QEMU |--|  VM  .-------------.  |
>                                '......'  |      |   Eth VF    |  |
>                                   |      |      .-------------.  |
>                                   |      |      |  SR-IOV VF  |  |
>                                   |      |      '-------------'  |
>                                   |      '------------||---------'
>                                .--------------.       ||
>                                |/dev/<vfio_fd>|       ||
>                                '--------------'       ||
> Host Userspace                         |              ||
> ===================================================   ||
> Host Kernel                            |              ||
>                                   .--------.          ||
>                                   |vfio-pci|          ||
>                                   '--------'          ||
>        .------------------.           ||              ||
>        |   | exported API |<----+     ||              ||
>        |   '--------------|     |     ||              ||
>        |                  |    .-------------.        ||
>        |     pds_core     |--->|   pds_vfio  |        ||
>        '------------------' |  '-------------'        ||
>                ||           |         ||              ||
>              09:00.0     notifier    09:00.1          ||
> == PCI ===============================================||=====
>                ||                     ||              ||
>           .----------.          .----------.          ||
>     ,-----|    PF    |----------|    VF    |-------------------,
>     |     '----------'          '----------'  |       VF       |
>     |                     DSC                 |  data/control  |
>     |                                         |      path      |
>     -----------------------------------------------------------
> 
> 
> The pds_vfio driver is targeted to reside in drivers/vfio/pci/pds.
> It makes use of and introduces new files in the common include/linux/pds
> include directory.
> 
> Changes:
> 
> v10:
> - Various fixes/suggestions by Jason Gunthorpe
> 	- Simplify pds_vfio_get_lm_file() based on fpga_mgr_buf_load()
> 	- Clean-ups/fixes based on clang-format
> 	- Remove any double goto labels
> 	- Name goto labels baesed on what needs to be cleaned/freed
> 	  instead of a "call from" scheme
> 	- Fix any goto unwind ordering issues
> 	- Make sure call dma_map_single() after data is written to
> 	  memory in pds_vfio_dma_map_lm_file()
> 	- Don't use bitmap_zalloc() for the dirty bitmaps
> - Use vzalloc() for dirty bitmaps and refactor how the bitmaps are DMA'd
>   to and from the device in pds_vfio_dirty_seq_ack()
> - Remove unnecessary goto in pds_vfio_dirty_disable()
> 
> v9:
> https://lore.kernel.org/netdev/20230422010642.60720-1-brett.creeley@amd.com/
> - Various fixes/suggestions by Alex Williamson
> 	- Fix how ID is generated in client registration
> 	- Add helper functions to get the VF's struct device and struct
> 	  pci_dev pointers instead of caching the struct pci dev
> 	- Remove redundant pds_vfio_lm_state() function and remove any
> 	  places this was being called
> 	- Fix multi-line comments to follow standard convention
> 	- Remove confusing comments in
> 	  pds_vfio_step_device_state_locked() since the driver's
> 	  migration states align with the VFIO documentation
> 	- Validate pdsc returned from pdsc_get_pf_struct()
> - Various fixes/suggestions by Jason Gunthorpe
> 	- Use struct pdsc instead of void *
> 	- Use {} instead of {0} for structure initialization
> 	- Use unions on the stack instead of casting to the union when
> 	  sending AQ commands, which required including pds_lm.h in
> 	  pds_adminq.h
> 	- Replace use of dma_alloc_coherent() when creating the sgl DMA
> 	  entries for the LM file
> 	- Remove cached struct device *coredev and instead use
> 	  pci_physfn() to get the pds_core's struct device pointer
> 	- Drop the recovery work item and call pds_vfio_recovery()
> 	  directly from the notifier callback
> 	- Remove unnecessary #define for "pds_vfio_lm" and just use the
> 	  string inline to the anon_inode_getfile() argument
> - Fix LM file reference counting
> - Move initialization of some struct members to when the struct is being
>   initialized for AQ commands
> - Make use of GFP_KERNEL_ACCOUNT where it makes sense
> - Replace PDS_VFIO_DRV_NAME with KBUILD_MODNAME
> - Update to latest pds_core exported functions
> - Remove duplicated prototypes for
>   pds_vfio_dma_logging_[start|stop|report] from lm.h
> - Hold pds_vfio->state_mutex while starting, stopping, and reporting
>   dirty page tracking in pds_vfio_dma_logging_[start|stop|report]
> - Remove duplicate PDS_DEV_TYPE_LM_STR define from pds_lm.h that's
>   already included in pds_common.h
> - Replace use of dma_alloc_coherent() when creating the sgl DMA
>   entries for the dirty bitmaps
> 
> v8:
> https://lore.kernel.org/netdev/20230404190141.57762-1-brett.creeley@amd.com/
> - provide default iommufd callbacks for bind_iommufd, unbind_iommufd, and
>   attach_ioas for the VFIO device as suggested by Shameerali Kolothum
>   Thodi
> 
> v7:
> https://lore.kernel.org/netdev/20230331003612.17569-1-brett.creeley@amd.com/
> - Disable and clean up dirty page tracking when the VFIO device is closed
> - Various improvements suggested by Simon Horman:
> 	- Fix RCT in vfio_combine_iova_ranges()
> 	- Simplify function exit paths by removing unnecessary goto
> 	  labels
> 	- Cleanup pds_vifo_print_guest_region_info() by adding a goto
> 	  label for freeing memory, which allowed for reduced
> 	  indentation on a for loop
> 	- Where possible use C99 style for loops
> 
> v6:
> https://lore.kernel.org/netdev/20230327200553.13951-1-brett.creeley@amd.com/
> - As suggested by Alex Williamson, use pci_domain_nr() macro to make sure
>   the pds_vfio client's devname is unique
> - Remove unnecessary forward declaration and include
> - Fix copyright comment to use correct company name
> - Remove "." from struct documentation for consistency
> 
> v5:
> https://lore.kernel.org/netdev/20230322203442.56169-1-brett.creeley@amd.com/
> - Fix SPDX comments in .h files
> - Remove adminqcq argument from pdsc_post_adminq() uses
> - Unregister client on vfio_pci_core_register_device() failure
> - Other minor checkpatch issues
> 
> v4:
> https://lore.kernel.org/netdev/20230308052450.13421-1-brett.creeley@amd.com/
> - Update cover letter ASCII diagram to reflect new driver architecture
> - Remove auxiliary driver implementation
> - Use pds_core's exported functions to communicate with the device
> - Implement and register notifier for events from the device/pds_core
> - Use module_pci_driver() macro since auxiliary driver configuration is
>   no longer needed in __init/__exit
> 
> v3:
> https://lore.kernel.org/netdev/20230219083908.40013-1-brett.creeley@amd.com/
> - Update copyright year to 2023 and use "Advanced Micro Devices, Inc."
>   for the company name
> - Clarify the fact that AMD/Pensando's VFIO solution is device type
>   agnostic, which aligns with other current VFIO solutions
> - Add line in drivers/vfio/pci/Makefile to build pds_vfio
> - Move documentation to amd sub-directory
> - Remove some dead code due to the pds_core implementation of
>   listening to BIND/UNBIND events
> - Move a dev_dbg() to a previous patch in the series
> - Add implementation for vfio_migration_ops.migration_get_data_size to
>   return the maximum possible device state size
> 
> RFC to v2:
> https://lore.kernel.org/all/20221214232136.64220-1-brett.creeley@amd.com/
> - Implement state transitions for VFIO_MIGRATION_P2P flag
> - Improve auxiliary driver probe by returning EPROBE_DEFER
>   when the PCI driver is not set up correctly
> - Add pointer to docs in
>   Documentation/networking/device_drivers/ethernet/index.rst
> 
> RFC:
> https://lore.kernel.org/all/20221207010705.35128-1-brett.creeley@amd.com/
> 
> 
> Brett Creeley (7):
>   vfio: Commonize combine_ranges for use in other VFIO drivers
>   vfio/pds: Initial support for pds_vfio VFIO driver
>   vfio/pds: register with the pds_core PF
>   vfio/pds: Add VFIO live migration support
>   vfio/pds: Add support for dirty page tracking
>   vfio/pds: Add support for firmware recovery
>   vfio/pds: Add Kconfig and documentation
> 
>  .../device_drivers/ethernet/amd/pds_vfio.rst  |  79 +++
>  .../device_drivers/ethernet/index.rst         |   1 +
>  MAINTAINERS                                   |   7 +
>  drivers/vfio/pci/Kconfig                      |   2 +
>  drivers/vfio/pci/Makefile                     |   2 +
>  drivers/vfio/pci/mlx5/cmd.c                   |  48 +-
>  drivers/vfio/pci/pds/Kconfig                  |  20 +
>  drivers/vfio/pci/pds/Makefile                 |  11 +
>  drivers/vfio/pci/pds/cmds.c                   | 487 +++++++++++++++
>  drivers/vfio/pci/pds/cmds.h                   |  25 +
>  drivers/vfio/pci/pds/dirty.c                  | 577 ++++++++++++++++++
>  drivers/vfio/pci/pds/dirty.h                  |  38 ++
>  drivers/vfio/pci/pds/lm.c                     | 421 +++++++++++++
>  drivers/vfio/pci/pds/lm.h                     |  41 ++
>  drivers/vfio/pci/pds/pci_drv.c                | 206 +++++++
>  drivers/vfio/pci/pds/pci_drv.h                |   9 +
>  drivers/vfio/pci/pds/vfio_dev.c               | 234 +++++++
>  drivers/vfio/pci/pds/vfio_dev.h               |  45 ++
>  drivers/vfio/vfio_main.c                      |  47 ++
>  include/linux/pds/pds_adminq.h                | 395 ++++++++++++
>  include/linux/pds/pds_common.h                |   2 +
>  include/linux/vfio.h                          |   3 +
>  22 files changed, 2653 insertions(+), 47 deletions(-)
>  create mode 100644 Documentation/networking/device_drivers/ethernet/amd/pds_vfio.rst
>  create mode 100644 drivers/vfio/pci/pds/Kconfig
>  create mode 100644 drivers/vfio/pci/pds/Makefile
>  create mode 100644 drivers/vfio/pci/pds/cmds.c
>  create mode 100644 drivers/vfio/pci/pds/cmds.h
>  create mode 100644 drivers/vfio/pci/pds/dirty.c
>  create mode 100644 drivers/vfio/pci/pds/dirty.h
>  create mode 100644 drivers/vfio/pci/pds/lm.c
>  create mode 100644 drivers/vfio/pci/pds/lm.h
>  create mode 100644 drivers/vfio/pci/pds/pci_drv.c
>  create mode 100644 drivers/vfio/pci/pds/pci_drv.h
>  create mode 100644 drivers/vfio/pci/pds/vfio_dev.c
>  create mode 100644 drivers/vfio/pci/pds/vfio_dev.h
> 


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v10 vfio 2/7] vfio/pds: Initial support for pds_vfio VFIO driver
  2023-06-02 22:03 ` [PATCH v10 vfio 2/7] vfio/pds: Initial support for pds_vfio VFIO driver Brett Creeley
@ 2023-06-14 21:31   ` Alex Williamson
  2023-06-14 21:41     ` Brett Creeley
  2023-06-16  6:56   ` Tian, Kevin
  1 sibling, 1 reply; 40+ messages in thread
From: Alex Williamson @ 2023-06-14 21:31 UTC (permalink / raw)
  To: Brett Creeley
  Cc: kvm, netdev, jgg, yishaih, shameerali.kolothum.thodi, kevin.tian,
	shannon.nelson

On Fri, 2 Jun 2023 15:03:13 -0700
Brett Creeley <brett.creeley@amd.com> wrote:

> This is the initial framework for the new pds_vfio device driver. This
> does the very basics of registering the PDS PCI device and configuring
> it as a VFIO PCI device.
> 
> With this change, the VF device can be bound to the pds_vfio driver on
> the host and presented to the VM as the VF's device type.
> 
> Signed-off-by: Brett Creeley <brett.creeley@amd.com>
> Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
> ---
>  drivers/vfio/pci/Makefile       |  2 +
>  drivers/vfio/pci/pds/Makefile   |  8 ++++
>  drivers/vfio/pci/pds/pci_drv.c  | 69 +++++++++++++++++++++++++++++++
>  drivers/vfio/pci/pds/vfio_dev.c | 72 +++++++++++++++++++++++++++++++++
>  drivers/vfio/pci/pds/vfio_dev.h | 20 +++++++++
>  5 files changed, 171 insertions(+)
>  create mode 100644 drivers/vfio/pci/pds/Makefile
>  create mode 100644 drivers/vfio/pci/pds/pci_drv.c
>  create mode 100644 drivers/vfio/pci/pds/vfio_dev.c
>  create mode 100644 drivers/vfio/pci/pds/vfio_dev.h
> 
> diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
> index 24c524224da5..45167be462d8 100644
> --- a/drivers/vfio/pci/Makefile
> +++ b/drivers/vfio/pci/Makefile
> @@ -11,3 +11,5 @@ obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
>  obj-$(CONFIG_MLX5_VFIO_PCI)           += mlx5/
>  
>  obj-$(CONFIG_HISI_ACC_VFIO_PCI) += hisilicon/
> +
> +obj-$(CONFIG_PDS_VFIO_PCI) += pds/
> diff --git a/drivers/vfio/pci/pds/Makefile b/drivers/vfio/pci/pds/Makefile
> new file mode 100644
> index 000000000000..e1a55ae0f079
> --- /dev/null
> +++ b/drivers/vfio/pci/pds/Makefile
> @@ -0,0 +1,8 @@
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (c) 2023 Advanced Micro Devices, Inc.
> +
> +obj-$(CONFIG_PDS_VFIO_PCI) += pds_vfio.o

Given the existing drivers:

obj-$(CONFIG_MLX5_VFIO_PCI) += mlx5-vfio-pci.o
obj-$(CONFIG_HISI_ACC_VFIO_PCI) += hisi-acc-vfio-pci.o

Does it make sense to name this one pds-vfio-pci?

> +
> +pds_vfio-y := \
> +	pci_drv.o	\
> +	vfio_dev.o
> diff --git a/drivers/vfio/pci/pds/pci_drv.c b/drivers/vfio/pci/pds/pci_drv.c
> new file mode 100644
> index 000000000000..0e84249069d4
> --- /dev/null
> +++ b/drivers/vfio/pci/pds/pci_drv.c
> @@ -0,0 +1,69 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright(c) 2023 Advanced Micro Devices, Inc. */
> +
> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> +
> +#include <linux/module.h>
> +#include <linux/pci.h>
> +#include <linux/types.h>
> +#include <linux/vfio.h>
> +
> +#include <linux/pds/pds_core_if.h>
> +
> +#include "vfio_dev.h"
> +
> +#define PDS_VFIO_DRV_DESCRIPTION	"AMD/Pensando VFIO Device Driver"
> +#define PCI_VENDOR_ID_PENSANDO		0x1dd8

Isn't this a duplicate from the above include:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/pds/pds_core_if.h#n7

I also find it defined in ionic.h, which means that it now satisfies
pci_ids.h requirement that the identifier is shared between multiple
drivers.  A trivial follow-up after this series might combine them
there.

> +
> +static int pds_vfio_pci_probe(struct pci_dev *pdev,
> +			      const struct pci_device_id *id)
> +{
> +	struct pds_vfio_pci_device *pds_vfio;
> +	int err;
> +
> +	pds_vfio = vfio_alloc_device(pds_vfio_pci_device, vfio_coredev.vdev,
> +				     &pdev->dev, pds_vfio_ops_info());
> +	if (IS_ERR(pds_vfio))
> +		return PTR_ERR(pds_vfio);
> +
> +	dev_set_drvdata(&pdev->dev, &pds_vfio->vfio_coredev);
> +
> +	err = vfio_pci_core_register_device(&pds_vfio->vfio_coredev);
> +	if (err)
> +		goto out_put_vdev;
> +
> +	return 0;
> +
> +out_put_vdev:
> +	vfio_put_device(&pds_vfio->vfio_coredev.vdev);
> +	return err;
> +}
> +
> +static void pds_vfio_pci_remove(struct pci_dev *pdev)
> +{
> +	struct pds_vfio_pci_device *pds_vfio = pds_vfio_pci_drvdata(pdev);
> +
> +	vfio_pci_core_unregister_device(&pds_vfio->vfio_coredev);
> +	vfio_put_device(&pds_vfio->vfio_coredev.vdev);
> +}
> +
> +static const struct pci_device_id
> +pds_vfio_pci_table[] = {
> +	{ PCI_DRIVER_OVERRIDE_DEVICE_VFIO(PCI_VENDOR_ID_PENSANDO, 0x1003) }, /* Ethernet VF */
> +	{ 0, }
> +};
> +MODULE_DEVICE_TABLE(pci, pds_vfio_pci_table);
> +
> +static struct pci_driver pds_vfio_pci_driver = {
> +	.name = KBUILD_MODNAME,
> +	.id_table = pds_vfio_pci_table,
> +	.probe = pds_vfio_pci_probe,
> +	.remove = pds_vfio_pci_remove,
> +	.driver_managed_dma = true,
> +};
> +
> +module_pci_driver(pds_vfio_pci_driver);
> +
> +MODULE_DESCRIPTION(PDS_VFIO_DRV_DESCRIPTION);
> +MODULE_AUTHOR("Advanced Micro Devices, Inc.");
> +MODULE_LICENSE("GPL");
> diff --git a/drivers/vfio/pci/pds/vfio_dev.c b/drivers/vfio/pci/pds/vfio_dev.c
> new file mode 100644
> index 000000000000..4038dac90a97
> --- /dev/null
> +++ b/drivers/vfio/pci/pds/vfio_dev.c
> @@ -0,0 +1,72 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright(c) 2023 Advanced Micro Devices, Inc. */
> +
> +#include <linux/vfio.h>
> +#include <linux/vfio_pci_core.h>
> +
> +#include "vfio_dev.h"
> +
> +struct pds_vfio_pci_device *pds_vfio_pci_drvdata(struct pci_dev *pdev)
> +{
> +	struct vfio_pci_core_device *core_device = dev_get_drvdata(&pdev->dev);
> +
> +	return container_of(core_device, struct pds_vfio_pci_device,
> +			    vfio_coredev);
> +}
> +
> +static int pds_vfio_init_device(struct vfio_device *vdev)
> +{
> +	struct pds_vfio_pci_device *pds_vfio =
> +		container_of(vdev, struct pds_vfio_pci_device,
> +			     vfio_coredev.vdev);
> +	struct pci_dev *pdev = to_pci_dev(vdev->dev);
> +	int err;
> +
> +	err = vfio_pci_core_init_dev(vdev);
> +	if (err)
> +		return err;
> +
> +	pds_vfio->vf_id = pci_iov_vf_id(pdev);
> +	pds_vfio->pci_id = PCI_DEVID(pdev->bus->number, pdev->devfn);

We only ever end up using pci_id for a debug print here that could use
a local variable and a slow path client registration that has access to
pdev to do a lookup on demand.  Why do we bother caching it on the
pds_vfio_pci_device?  Thanks,

Alex

> +
> +	return 0;
> +}
> +
> +static int pds_vfio_open_device(struct vfio_device *vdev)
> +{
> +	struct pds_vfio_pci_device *pds_vfio =
> +		container_of(vdev, struct pds_vfio_pci_device,
> +			     vfio_coredev.vdev);
> +	int err;
> +
> +	err = vfio_pci_core_enable(&pds_vfio->vfio_coredev);
> +	if (err)
> +		return err;
> +
> +	vfio_pci_core_finish_enable(&pds_vfio->vfio_coredev);
> +
> +	return 0;
> +}
> +
> +static const struct vfio_device_ops pds_vfio_ops = {
> +	.name = "pds-vfio",
> +	.init = pds_vfio_init_device,
> +	.release = vfio_pci_core_release_dev,
> +	.open_device = pds_vfio_open_device,
> +	.close_device = vfio_pci_core_close_device,
> +	.ioctl = vfio_pci_core_ioctl,
> +	.device_feature = vfio_pci_core_ioctl_feature,
> +	.read = vfio_pci_core_read,
> +	.write = vfio_pci_core_write,
> +	.mmap = vfio_pci_core_mmap,
> +	.request = vfio_pci_core_request,
> +	.match = vfio_pci_core_match,
> +	.bind_iommufd = vfio_iommufd_physical_bind,
> +	.unbind_iommufd = vfio_iommufd_physical_unbind,
> +	.attach_ioas = vfio_iommufd_physical_attach_ioas,
> +};
> +
> +const struct vfio_device_ops *pds_vfio_ops_info(void)
> +{
> +	return &pds_vfio_ops;
> +}
> diff --git a/drivers/vfio/pci/pds/vfio_dev.h b/drivers/vfio/pci/pds/vfio_dev.h
> new file mode 100644
> index 000000000000..66cfcab5b5bf
> --- /dev/null
> +++ b/drivers/vfio/pci/pds/vfio_dev.h
> @@ -0,0 +1,20 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/* Copyright(c) 2023 Advanced Micro Devices, Inc. */
> +
> +#ifndef _VFIO_DEV_H_
> +#define _VFIO_DEV_H_
> +
> +#include <linux/pci.h>
> +#include <linux/vfio_pci_core.h>
> +
> +struct pds_vfio_pci_device {
> +	struct vfio_pci_core_device vfio_coredev;
> +
> +	int vf_id;
> +	int pci_id;
> +};
> +
> +const struct vfio_device_ops *pds_vfio_ops_info(void);
> +struct pds_vfio_pci_device *pds_vfio_pci_drvdata(struct pci_dev *pdev);
> +
> +#endif /* _VFIO_DEV_H_ */


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v10 vfio 2/7] vfio/pds: Initial support for pds_vfio VFIO driver
  2023-06-14 21:31   ` Alex Williamson
@ 2023-06-14 21:41     ` Brett Creeley
  0 siblings, 0 replies; 40+ messages in thread
From: Brett Creeley @ 2023-06-14 21:41 UTC (permalink / raw)
  To: Alex Williamson, Brett Creeley
  Cc: kvm, netdev, jgg, yishaih, shameerali.kolothum.thodi, kevin.tian,
	shannon.nelson

On 6/14/2023 2:31 PM, Alex Williamson wrote:
> Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.
> 
> 
> On Fri, 2 Jun 2023 15:03:13 -0700
> Brett Creeley <brett.creeley@amd.com> wrote:
> 
>> This is the initial framework for the new pds_vfio device driver. This
>> does the very basics of registering the PDS PCI device and configuring
>> it as a VFIO PCI device.
>>
>> With this change, the VF device can be bound to the pds_vfio driver on
>> the host and presented to the VM as the VF's device type.
>>
>> Signed-off-by: Brett Creeley <brett.creeley@amd.com>
>> Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
>> ---
>>   drivers/vfio/pci/Makefile       |  2 +
>>   drivers/vfio/pci/pds/Makefile   |  8 ++++
>>   drivers/vfio/pci/pds/pci_drv.c  | 69 +++++++++++++++++++++++++++++++
>>   drivers/vfio/pci/pds/vfio_dev.c | 72 +++++++++++++++++++++++++++++++++
>>   drivers/vfio/pci/pds/vfio_dev.h | 20 +++++++++
>>   5 files changed, 171 insertions(+)
>>   create mode 100644 drivers/vfio/pci/pds/Makefile
>>   create mode 100644 drivers/vfio/pci/pds/pci_drv.c
>>   create mode 100644 drivers/vfio/pci/pds/vfio_dev.c
>>   create mode 100644 drivers/vfio/pci/pds/vfio_dev.h
>>
>> diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
>> index 24c524224da5..45167be462d8 100644
>> --- a/drivers/vfio/pci/Makefile
>> +++ b/drivers/vfio/pci/Makefile
>> @@ -11,3 +11,5 @@ obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
>>   obj-$(CONFIG_MLX5_VFIO_PCI)           += mlx5/
>>
>>   obj-$(CONFIG_HISI_ACC_VFIO_PCI) += hisilicon/
>> +
>> +obj-$(CONFIG_PDS_VFIO_PCI) += pds/
>> diff --git a/drivers/vfio/pci/pds/Makefile b/drivers/vfio/pci/pds/Makefile
>> new file mode 100644
>> index 000000000000..e1a55ae0f079
>> --- /dev/null
>> +++ b/drivers/vfio/pci/pds/Makefile
>> @@ -0,0 +1,8 @@
>> +# SPDX-License-Identifier: GPL-2.0
>> +# Copyright (c) 2023 Advanced Micro Devices, Inc.
>> +
>> +obj-$(CONFIG_PDS_VFIO_PCI) += pds_vfio.o
> 
> Given the existing drivers:
> 
> obj-$(CONFIG_MLX5_VFIO_PCI) += mlx5-vfio-pci.o
> obj-$(CONFIG_HISI_ACC_VFIO_PCI) += hisi-acc-vfio-pci.o
> 
> Does it make sense to name this one pds-vfio-pci?

Yeah I think it does make more sense to align. Thanks.

> 
>> +
>> +pds_vfio-y := \
>> +     pci_drv.o       \
>> +     vfio_dev.o
>> diff --git a/drivers/vfio/pci/pds/pci_drv.c b/drivers/vfio/pci/pds/pci_drv.c
>> new file mode 100644
>> index 000000000000..0e84249069d4
>> --- /dev/null
>> +++ b/drivers/vfio/pci/pds/pci_drv.c
>> @@ -0,0 +1,69 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/* Copyright(c) 2023 Advanced Micro Devices, Inc. */
>> +
>> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
>> +
>> +#include <linux/module.h>
>> +#include <linux/pci.h>
>> +#include <linux/types.h>
>> +#include <linux/vfio.h>
>> +
>> +#include <linux/pds/pds_core_if.h>
>> +
>> +#include "vfio_dev.h"
>> +
>> +#define PDS_VFIO_DRV_DESCRIPTION     "AMD/Pensando VFIO Device Driver"
>> +#define PCI_VENDOR_ID_PENSANDO               0x1dd8
> 
> Isn't this a duplicate from the above include:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/pds/pds_core_if.h#n7
> 
> I also find it defined in ionic.h, which means that it now satisfies
> pci_ids.h requirement that the identifier is shared between multiple
> drivers.  A trivial follow-up after this series might combine them
> there.

Good suggestion. Once this series is merged we will submit the follow up 
patch. Thanks.

> 
>> +
>> +static int pds_vfio_pci_probe(struct pci_dev *pdev,
>> +                           const struct pci_device_id *id)
>> +{
>> +     struct pds_vfio_pci_device *pds_vfio;
>> +     int err;
>> +
>> +     pds_vfio = vfio_alloc_device(pds_vfio_pci_device, vfio_coredev.vdev,
>> +                                  &pdev->dev, pds_vfio_ops_info());
>> +     if (IS_ERR(pds_vfio))
>> +             return PTR_ERR(pds_vfio);
>> +
>> +     dev_set_drvdata(&pdev->dev, &pds_vfio->vfio_coredev);
>> +
>> +     err = vfio_pci_core_register_device(&pds_vfio->vfio_coredev);
>> +     if (err)
>> +             goto out_put_vdev;
>> +
>> +     return 0;
>> +
>> +out_put_vdev:
>> +     vfio_put_device(&pds_vfio->vfio_coredev.vdev);
>> +     return err;
>> +}
>> +
>> +static void pds_vfio_pci_remove(struct pci_dev *pdev)
>> +{
>> +     struct pds_vfio_pci_device *pds_vfio = pds_vfio_pci_drvdata(pdev);
>> +
>> +     vfio_pci_core_unregister_device(&pds_vfio->vfio_coredev);
>> +     vfio_put_device(&pds_vfio->vfio_coredev.vdev);
>> +}
>> +
>> +static const struct pci_device_id
>> +pds_vfio_pci_table[] = {
>> +     { PCI_DRIVER_OVERRIDE_DEVICE_VFIO(PCI_VENDOR_ID_PENSANDO, 0x1003) }, /* Ethernet VF */
>> +     { 0, }
>> +};
>> +MODULE_DEVICE_TABLE(pci, pds_vfio_pci_table);
>> +
>> +static struct pci_driver pds_vfio_pci_driver = {
>> +     .name = KBUILD_MODNAME,
>> +     .id_table = pds_vfio_pci_table,
>> +     .probe = pds_vfio_pci_probe,
>> +     .remove = pds_vfio_pci_remove,
>> +     .driver_managed_dma = true,
>> +};
>> +
>> +module_pci_driver(pds_vfio_pci_driver);
>> +
>> +MODULE_DESCRIPTION(PDS_VFIO_DRV_DESCRIPTION);
>> +MODULE_AUTHOR("Advanced Micro Devices, Inc.");
>> +MODULE_LICENSE("GPL");
>> diff --git a/drivers/vfio/pci/pds/vfio_dev.c b/drivers/vfio/pci/pds/vfio_dev.c
>> new file mode 100644
>> index 000000000000..4038dac90a97
>> --- /dev/null
>> +++ b/drivers/vfio/pci/pds/vfio_dev.c
>> @@ -0,0 +1,72 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/* Copyright(c) 2023 Advanced Micro Devices, Inc. */
>> +
>> +#include <linux/vfio.h>
>> +#include <linux/vfio_pci_core.h>
>> +
>> +#include "vfio_dev.h"
>> +
>> +struct pds_vfio_pci_device *pds_vfio_pci_drvdata(struct pci_dev *pdev)
>> +{
>> +     struct vfio_pci_core_device *core_device = dev_get_drvdata(&pdev->dev);
>> +
>> +     return container_of(core_device, struct pds_vfio_pci_device,
>> +                         vfio_coredev);
>> +}
>> +
>> +static int pds_vfio_init_device(struct vfio_device *vdev)
>> +{
>> +     struct pds_vfio_pci_device *pds_vfio =
>> +             container_of(vdev, struct pds_vfio_pci_device,
>> +                          vfio_coredev.vdev);
>> +     struct pci_dev *pdev = to_pci_dev(vdev->dev);
>> +     int err;
>> +
>> +     err = vfio_pci_core_init_dev(vdev);
>> +     if (err)
>> +             return err;
>> +
>> +     pds_vfio->vf_id = pci_iov_vf_id(pdev);
>> +     pds_vfio->pci_id = PCI_DEVID(pdev->bus->number, pdev->devfn);
> 
> We only ever end up using pci_id for a debug print here that could use
> a local variable and a slow path client registration that has access to
> pdev to do a lookup on demand.  Why do we bother caching it on the
> pds_vfio_pci_device?  Thanks,
> 
> Alex

No good reason, so another good suggestion. I will fix this as well. 
Thanks again for the feedback.

> 
>> +
>> +     return 0;
>> +}
>> +
>> +static int pds_vfio_open_device(struct vfio_device *vdev)
>> +{
>> +     struct pds_vfio_pci_device *pds_vfio =
>> +             container_of(vdev, struct pds_vfio_pci_device,
>> +                          vfio_coredev.vdev);
>> +     int err;
>> +
>> +     err = vfio_pci_core_enable(&pds_vfio->vfio_coredev);
>> +     if (err)
>> +             return err;
>> +
>> +     vfio_pci_core_finish_enable(&pds_vfio->vfio_coredev);
>> +
>> +     return 0;
>> +}
>> +
>> +static const struct vfio_device_ops pds_vfio_ops = {
>> +     .name = "pds-vfio",
>> +     .init = pds_vfio_init_device,
>> +     .release = vfio_pci_core_release_dev,
>> +     .open_device = pds_vfio_open_device,
>> +     .close_device = vfio_pci_core_close_device,
>> +     .ioctl = vfio_pci_core_ioctl,
>> +     .device_feature = vfio_pci_core_ioctl_feature,
>> +     .read = vfio_pci_core_read,
>> +     .write = vfio_pci_core_write,
>> +     .mmap = vfio_pci_core_mmap,
>> +     .request = vfio_pci_core_request,
>> +     .match = vfio_pci_core_match,
>> +     .bind_iommufd = vfio_iommufd_physical_bind,
>> +     .unbind_iommufd = vfio_iommufd_physical_unbind,
>> +     .attach_ioas = vfio_iommufd_physical_attach_ioas,
>> +};
>> +
>> +const struct vfio_device_ops *pds_vfio_ops_info(void)
>> +{
>> +     return &pds_vfio_ops;
>> +}
>> diff --git a/drivers/vfio/pci/pds/vfio_dev.h b/drivers/vfio/pci/pds/vfio_dev.h
>> new file mode 100644
>> index 000000000000..66cfcab5b5bf
>> --- /dev/null
>> +++ b/drivers/vfio/pci/pds/vfio_dev.h
>> @@ -0,0 +1,20 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +/* Copyright(c) 2023 Advanced Micro Devices, Inc. */
>> +
>> +#ifndef _VFIO_DEV_H_
>> +#define _VFIO_DEV_H_
>> +
>> +#include <linux/pci.h>
>> +#include <linux/vfio_pci_core.h>
>> +
>> +struct pds_vfio_pci_device {
>> +     struct vfio_pci_core_device vfio_coredev;
>> +
>> +     int vf_id;
>> +     int pci_id;
>> +};
>> +
>> +const struct vfio_device_ops *pds_vfio_ops_info(void);
>> +struct pds_vfio_pci_device *pds_vfio_pci_drvdata(struct pci_dev *pdev);
>> +
>> +#endif /* _VFIO_DEV_H_ */
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: [PATCH v10 vfio 3/7] vfio/pds: register with the pds_core PF
  2023-06-02 22:03 ` [PATCH v10 vfio 3/7] vfio/pds: register with the pds_core PF Brett Creeley
@ 2023-06-15 21:05   ` Shameerali Kolothum Thodi
  2023-06-15 21:30     ` Brett Creeley
  2023-06-16  7:04   ` Tian, Kevin
  1 sibling, 1 reply; 40+ messages in thread
From: Shameerali Kolothum Thodi @ 2023-06-15 21:05 UTC (permalink / raw)
  To: Brett Creeley, kvm, netdev, alex.williamson, jgg, yishaih, kevin.tian
  Cc: shannon.nelson



> -----Original Message-----
> From: Brett Creeley [mailto:brett.creeley@amd.com]
> Sent: 02 June 2023 23:03
> To: kvm@vger.kernel.org; netdev@vger.kernel.org;
> alex.williamson@redhat.com; jgg@nvidia.com; yishaih@nvidia.com;
> Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>;
> kevin.tian@intel.com
> Cc: brett.creeley@amd.com; shannon.nelson@amd.com
> Subject: [PATCH v10 vfio 3/7] vfio/pds: register with the pds_core PF
> 
> The pds_core driver will supply adminq services, so find the PF
> and register with the DSC services.
> 
> Use the following commands to enable a VF:
> echo 1 > /sys/bus/pci/drivers/pds_core/$PF_BDF/sriov_numvfs
> 
> Signed-off-by: Brett Creeley <brett.creeley@amd.com>
> Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
> ---
>  drivers/vfio/pci/pds/Makefile   |  1 +
>  drivers/vfio/pci/pds/cmds.c     | 43
> +++++++++++++++++++++++++++++++++
>  drivers/vfio/pci/pds/cmds.h     | 10 ++++++++
>  drivers/vfio/pci/pds/pci_drv.c  | 19 +++++++++++++++
>  drivers/vfio/pci/pds/pci_drv.h  |  9 +++++++
>  drivers/vfio/pci/pds/vfio_dev.c | 11 +++++++++
>  drivers/vfio/pci/pds/vfio_dev.h |  6 +++++
>  include/linux/pds/pds_common.h  |  2 ++
>  8 files changed, 101 insertions(+)
>  create mode 100644 drivers/vfio/pci/pds/cmds.c
>  create mode 100644 drivers/vfio/pci/pds/cmds.h
>  create mode 100644 drivers/vfio/pci/pds/pci_drv.h
> 
> diff --git a/drivers/vfio/pci/pds/Makefile b/drivers/vfio/pci/pds/Makefile
> index e1a55ae0f079..87581111fa17 100644
> --- a/drivers/vfio/pci/pds/Makefile
> +++ b/drivers/vfio/pci/pds/Makefile
> @@ -4,5 +4,6 @@
>  obj-$(CONFIG_PDS_VFIO_PCI) += pds_vfio.o
> 
>  pds_vfio-y := \
> +	cmds.o		\
>  	pci_drv.o	\
>  	vfio_dev.o
> diff --git a/drivers/vfio/pci/pds/cmds.c b/drivers/vfio/pci/pds/cmds.c
> new file mode 100644
> index 000000000000..ae01f5df2f5c
> --- /dev/null
> +++ b/drivers/vfio/pci/pds/cmds.c
> @@ -0,0 +1,43 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright(c) 2023 Advanced Micro Devices, Inc. */
> +
> +#include <linux/io.h>
> +#include <linux/types.h>
> +
> +#include <linux/pds/pds_common.h>
> +#include <linux/pds/pds_core_if.h>
> +#include <linux/pds/pds_adminq.h>
> +
> +#include "vfio_dev.h"
> +#include "cmds.h"
> +
> +int pds_vfio_register_client_cmd(struct pds_vfio_pci_device *pds_vfio)
> +{
> +	struct pci_dev *pdev = pds_vfio_to_pci_dev(pds_vfio);
> +	char devname[PDS_DEVNAME_LEN];
> +	int ci;
> +
> +	snprintf(devname, sizeof(devname), "%s.%d-%u", PDS_LM_DEV_NAME,
> +		 pci_domain_nr(pdev->bus), pds_vfio->pci_id);
> +
> +	ci = pds_client_register(pci_physfn(pdev), devname);
> +	if (ci <= 0)
> +		return ci;

So 0 is not a valid id I guess but we return 0 here. But below where
pds_vfio_register_client_cmd() is called, 0 return is treated as success.

Note: Also in drivers..../auxbus.c the comment says the function returns 0
on success!.

Please check.

Thanks,
Shameer
> +
> +	pds_vfio->client_id = ci;
> +
> +	return 0;
> +}
> +
> +void pds_vfio_unregister_client_cmd(struct pds_vfio_pci_device *pds_vfio)
> +{
> +	struct pci_dev *pdev = pds_vfio_to_pci_dev(pds_vfio);
> +	int err;
> +
> +	err = pds_client_unregister(pci_physfn(pdev), pds_vfio->client_id);
> +	if (err)
> +		dev_err(&pdev->dev, "unregister from DSC failed: %pe\n",
> +			ERR_PTR(err));
> +
> +	pds_vfio->client_id = 0;
> +}
> diff --git a/drivers/vfio/pci/pds/cmds.h b/drivers/vfio/pci/pds/cmds.h
> new file mode 100644
> index 000000000000..4c592afccf89
> --- /dev/null
> +++ b/drivers/vfio/pci/pds/cmds.h
> @@ -0,0 +1,10 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/* Copyright(c) 2023 Advanced Micro Devices, Inc. */
> +
> +#ifndef _CMDS_H_
> +#define _CMDS_H_
> +
> +int pds_vfio_register_client_cmd(struct pds_vfio_pci_device *pds_vfio);
> +void pds_vfio_unregister_client_cmd(struct pds_vfio_pci_device *pds_vfio);
> +
> +#endif /* _CMDS_H_ */
> diff --git a/drivers/vfio/pci/pds/pci_drv.c b/drivers/vfio/pci/pds/pci_drv.c
> index 0e84249069d4..a49420aa9736 100644
> --- a/drivers/vfio/pci/pds/pci_drv.c
> +++ b/drivers/vfio/pci/pds/pci_drv.c
> @@ -8,9 +8,13 @@
>  #include <linux/types.h>
>  #include <linux/vfio.h>
> 
> +#include <linux/pds/pds_common.h>
>  #include <linux/pds/pds_core_if.h>
> +#include <linux/pds/pds_adminq.h>
> 
>  #include "vfio_dev.h"
> +#include "pci_drv.h"
> +#include "cmds.h"
> 
>  #define PDS_VFIO_DRV_DESCRIPTION	"AMD/Pensando VFIO Device
> Driver"
>  #define PCI_VENDOR_ID_PENSANDO		0x1dd8
> @@ -27,13 +31,27 @@ static int pds_vfio_pci_probe(struct pci_dev *pdev,
>  		return PTR_ERR(pds_vfio);
> 
>  	dev_set_drvdata(&pdev->dev, &pds_vfio->vfio_coredev);
> +	pds_vfio->pdsc = pdsc_get_pf_struct(pdev);
> +	if (IS_ERR_OR_NULL(pds_vfio->pdsc)) {
> +		err = PTR_ERR(pds_vfio->pdsc) ?: -ENODEV;
> +		goto out_put_vdev;
> +	}
> 
>  	err = vfio_pci_core_register_device(&pds_vfio->vfio_coredev);
>  	if (err)
>  		goto out_put_vdev;
> 
> +	err = pds_vfio_register_client_cmd(pds_vfio);
> +	if (err) {
> +		dev_err(&pdev->dev, "failed to register as client: %pe\n",
> +			ERR_PTR(err));
> +		goto out_unregister_coredev;
> +	}
> +
>  	return 0;
> 
> +out_unregister_coredev:
> +	vfio_pci_core_unregister_device(&pds_vfio->vfio_coredev);
>  out_put_vdev:
>  	vfio_put_device(&pds_vfio->vfio_coredev.vdev);
>  	return err;
> @@ -43,6 +61,7 @@ static void pds_vfio_pci_remove(struct pci_dev *pdev)
>  {
>  	struct pds_vfio_pci_device *pds_vfio = pds_vfio_pci_drvdata(pdev);
> 
> +	pds_vfio_unregister_client_cmd(pds_vfio);
>  	vfio_pci_core_unregister_device(&pds_vfio->vfio_coredev);
>  	vfio_put_device(&pds_vfio->vfio_coredev.vdev);
>  }
> diff --git a/drivers/vfio/pci/pds/pci_drv.h b/drivers/vfio/pci/pds/pci_drv.h
> new file mode 100644
> index 000000000000..e79bed12ed14
> --- /dev/null
> +++ b/drivers/vfio/pci/pds/pci_drv.h
> @@ -0,0 +1,9 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/* Copyright(c) 2023 Advanced Micro Devices, Inc. */
> +
> +#ifndef _PCI_DRV_H
> +#define _PCI_DRV_H
> +
> +#include <linux/pci.h>
> +
> +#endif /* _PCI_DRV_H */
> diff --git a/drivers/vfio/pci/pds/vfio_dev.c b/drivers/vfio/pci/pds/vfio_dev.c
> index 4038dac90a97..39771265b78f 100644
> --- a/drivers/vfio/pci/pds/vfio_dev.c
> +++ b/drivers/vfio/pci/pds/vfio_dev.c
> @@ -6,6 +6,11 @@
> 
>  #include "vfio_dev.h"
> 
> +struct pci_dev *pds_vfio_to_pci_dev(struct pds_vfio_pci_device *pds_vfio)
> +{
> +	return pds_vfio->vfio_coredev.pdev;
> +}
> +
>  struct pds_vfio_pci_device *pds_vfio_pci_drvdata(struct pci_dev *pdev)
>  {
>  	struct vfio_pci_core_device *core_device =
> dev_get_drvdata(&pdev->dev);
> @@ -29,6 +34,12 @@ static int pds_vfio_init_device(struct vfio_device
> *vdev)
>  	pds_vfio->vf_id = pci_iov_vf_id(pdev);
>  	pds_vfio->pci_id = PCI_DEVID(pdev->bus->number, pdev->devfn);
> 
> +	dev_dbg(&pdev->dev,
> +		"%s: PF %#04x VF %#04x (%d) vf_id %d domain %d
> pds_vfio %p\n",
> +		__func__, pci_dev_id(pdev->physfn), pds_vfio->pci_id,
> +		pds_vfio->pci_id, pds_vfio->vf_id, pci_domain_nr(pdev->bus),
> +		pds_vfio);
> +
>  	return 0;
>  }
> 
> diff --git a/drivers/vfio/pci/pds/vfio_dev.h b/drivers/vfio/pci/pds/vfio_dev.h
> index 66cfcab5b5bf..92e8ff241ca8 100644
> --- a/drivers/vfio/pci/pds/vfio_dev.h
> +++ b/drivers/vfio/pci/pds/vfio_dev.h
> @@ -7,14 +7,20 @@
>  #include <linux/pci.h>
>  #include <linux/vfio_pci_core.h>
> 
> +struct pdsc;
> +
>  struct pds_vfio_pci_device {
>  	struct vfio_pci_core_device vfio_coredev;
> +	struct pdsc *pdsc;
> 
>  	int vf_id;
>  	int pci_id;
> +	u16 client_id;
>  };
> 
>  const struct vfio_device_ops *pds_vfio_ops_info(void);
>  struct pds_vfio_pci_device *pds_vfio_pci_drvdata(struct pci_dev *pdev);
> 
> +struct pci_dev *pds_vfio_to_pci_dev(struct pds_vfio_pci_device *pds_vfio);
> +
>  #endif /* _VFIO_DEV_H_ */
> diff --git a/include/linux/pds/pds_common.h
> b/include/linux/pds/pds_common.h
> index 060331486d50..721453bdf975 100644
> --- a/include/linux/pds/pds_common.h
> +++ b/include/linux/pds/pds_common.h
> @@ -39,6 +39,8 @@ enum pds_core_vif_types {
>  #define PDS_DEV_TYPE_RDMA_STR	"RDMA"
>  #define PDS_DEV_TYPE_LM_STR	"LM"
> 
> +#define PDS_LM_DEV_NAME		PDS_CORE_DRV_NAME "."
> PDS_DEV_TYPE_LM_STR
> +
>  #define PDS_CORE_IFNAMSIZ		16
> 
>  /**
> --
> 2.17.1


^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: [PATCH v10 vfio 4/7] vfio/pds: Add VFIO live migration support
  2023-06-02 22:03 ` [PATCH v10 vfio 4/7] vfio/pds: Add VFIO live migration support Brett Creeley
@ 2023-06-15 21:07   ` Shameerali Kolothum Thodi
  2023-06-15 21:36     ` Brett Creeley
  2023-06-16  8:06   ` Tian, Kevin
  1 sibling, 1 reply; 40+ messages in thread
From: Shameerali Kolothum Thodi @ 2023-06-15 21:07 UTC (permalink / raw)
  To: Brett Creeley, kvm, netdev, alex.williamson, jgg, yishaih, kevin.tian
  Cc: shannon.nelson



> -----Original Message-----
> From: Brett Creeley [mailto:brett.creeley@amd.com]
> Sent: 02 June 2023 23:03
> To: kvm@vger.kernel.org; netdev@vger.kernel.org;
> alex.williamson@redhat.com; jgg@nvidia.com; yishaih@nvidia.com;
> Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>;
> kevin.tian@intel.com
> Cc: brett.creeley@amd.com; shannon.nelson@amd.com
> Subject: [PATCH v10 vfio 4/7] vfio/pds: Add VFIO live migration support
> 
> Add live migration support via the VFIO subsystem. The migration
> implementation aligns with the definition from uapi/vfio.h and uses
> the pds_core PF's adminq for device configuration.
> 
> The ability to suspend, resume, and transfer VF device state data is
> included along with the required admin queue command structures and
> implementations.
> 
> PDS_LM_CMD_SUSPEND and PDS_LM_CMD_SUSPEND_STATUS are added to
> support
> the VF device suspend operation.
> 
> PDS_LM_CMD_RESUME is added to support the VF device resume operation.
> 
> PDS_LM_CMD_STATUS is added to determine the exact size of the VF
> device state data.
> 
> PDS_LM_CMD_SAVE is added to get the VF device state data.
> 
> PDS_LM_CMD_RESTORE is added to restore the VF device with the
> previously saved data from PDS_LM_CMD_SAVE.
> 
> PDS_LM_CMD_HOST_VF_STATUS is added to notify the device when
> a migration is in/not-in progress from the host's perspective.
> 
> Signed-off-by: Brett Creeley <brett.creeley@amd.com>
> Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
> ---
>  drivers/vfio/pci/pds/Makefile   |   1 +
>  drivers/vfio/pci/pds/cmds.c     | 319 ++++++++++++++++++++++++
>  drivers/vfio/pci/pds/cmds.h     |   8 +-
>  drivers/vfio/pci/pds/lm.c       | 421
> ++++++++++++++++++++++++++++++++
>  drivers/vfio/pci/pds/lm.h       |  41 ++++
>  drivers/vfio/pci/pds/pci_drv.c  |  13 +
>  drivers/vfio/pci/pds/vfio_dev.c | 120 ++++++++-
>  drivers/vfio/pci/pds/vfio_dev.h |  11 +
>  include/linux/pds/pds_adminq.h  | 217 ++++++++++++++++
>  9 files changed, 1149 insertions(+), 2 deletions(-)
>  create mode 100644 drivers/vfio/pci/pds/lm.c
>  create mode 100644 drivers/vfio/pci/pds/lm.h
> 
> diff --git a/drivers/vfio/pci/pds/Makefile b/drivers/vfio/pci/pds/Makefile
> index 87581111fa17..dbaf613d3794 100644
> --- a/drivers/vfio/pci/pds/Makefile
> +++ b/drivers/vfio/pci/pds/Makefile
> @@ -5,5 +5,6 @@ obj-$(CONFIG_PDS_VFIO_PCI) += pds_vfio.o
> 
>  pds_vfio-y := \
>  	cmds.o		\
> +	lm.o		\
>  	pci_drv.o	\
>  	vfio_dev.o
> diff --git a/drivers/vfio/pci/pds/cmds.c b/drivers/vfio/pci/pds/cmds.c
> index ae01f5df2f5c..256f458feb58 100644
> --- a/drivers/vfio/pci/pds/cmds.c
> +++ b/drivers/vfio/pci/pds/cmds.c
> @@ -3,6 +3,7 @@
> 
>  #include <linux/io.h>
>  #include <linux/types.h>
> +#include <linux/delay.h>
> 
>  #include <linux/pds/pds_common.h>
>  #include <linux/pds/pds_core_if.h>
> @@ -11,6 +12,34 @@
>  #include "vfio_dev.h"
>  #include "cmds.h"
> 
> +#define SUSPEND_TIMEOUT_S		5
> +#define SUSPEND_CHECK_INTERVAL_MS	1
> +
> +static int pds_vfio_client_adminq_cmd(struct pds_vfio_pci_device
> *pds_vfio,
> +				      union pds_core_adminq_cmd *req,
> +				      size_t req_len,
> +				      union pds_core_adminq_comp *resp,
> +				      u64 flags)

Why u64? Do we expect more flags to follow? The core interface below
only takes a bool(fast_poll) though.

Thanks,
Shameer

> +{
> +	union pds_core_adminq_cmd cmd = {};
> +	size_t cp_len;
> +	int err;
> +
> +	/* Wrap the client request */
> +	cmd.client_request.opcode = PDS_AQ_CMD_CLIENT_CMD;
> +	cmd.client_request.client_id = cpu_to_le16(pds_vfio->client_id);
> +	cp_len = min_t(size_t, req_len, sizeof(cmd.client_request.client_cmd));
> +	memcpy(cmd.client_request.client_cmd, req, cp_len);
> +
> +	err = pdsc_adminq_post(pds_vfio->pdsc, &cmd, resp,
> +			       !!(flags & PDS_AQ_FLAG_FASTPOLL));
> +	if (err && err != -EAGAIN)
> +		dev_info(pds_vfio_to_dev(pds_vfio),
> +			 "client admin cmd failed: %pe\n", ERR_PTR(err));
> +
> +	return err;
> +}
> +
>  int pds_vfio_register_client_cmd(struct pds_vfio_pci_device *pds_vfio)
>  {
>  	struct pci_dev *pdev = pds_vfio_to_pci_dev(pds_vfio);
> @@ -41,3 +70,293 @@ void pds_vfio_unregister_client_cmd(struct
> pds_vfio_pci_device *pds_vfio)
> 
>  	pds_vfio->client_id = 0;
>  }
> +
> +static int
> +pds_vfio_suspend_wait_device_cmd(struct pds_vfio_pci_device *pds_vfio)
> +{
> +	union pds_core_adminq_cmd cmd = {
> +		.lm_suspend_status = {
> +			.opcode = PDS_LM_CMD_SUSPEND_STATUS,
> +			.vf_id = cpu_to_le16(pds_vfio->vf_id),
> +		},
> +	};
> +	struct device *dev = pds_vfio_to_dev(pds_vfio);
> +	union pds_core_adminq_comp comp = {};
> +	unsigned long time_limit;
> +	unsigned long time_start;
> +	unsigned long time_done;
> +	int err;
> +
> +	time_start = jiffies;
> +	time_limit = time_start + HZ * SUSPEND_TIMEOUT_S;
> +	do {
> +		err = pds_vfio_client_adminq_cmd(pds_vfio, &cmd, sizeof(cmd),
> +						 &comp, PDS_AQ_FLAG_FASTPOLL);
> +		if (err != -EAGAIN)
> +			break;
> +
> +		msleep(SUSPEND_CHECK_INTERVAL_MS);
> +	} while (time_before(jiffies, time_limit));
> +
> +	time_done = jiffies;
> +	dev_dbg(dev, "%s: vf%u: Suspend comp received in %d msecs\n",
> __func__,
> +		pds_vfio->vf_id, jiffies_to_msecs(time_done - time_start));
> +
> +	/* Check the results */
> +	if (time_after_eq(time_done, time_limit)) {
> +		dev_err(dev, "%s: vf%u: Suspend comp timeout\n", __func__,
> +			pds_vfio->vf_id);
> +		err = -ETIMEDOUT;
> +	}
> +
> +	return err;
> +}
> +
> +int pds_vfio_suspend_device_cmd(struct pds_vfio_pci_device *pds_vfio)
> +{
> +	union pds_core_adminq_cmd cmd = {
> +		.lm_suspend = {
> +			.opcode = PDS_LM_CMD_SUSPEND,
> +			.vf_id = cpu_to_le16(pds_vfio->vf_id),
> +		},
> +	};
> +	struct device *dev = pds_vfio_to_dev(pds_vfio);
> +	union pds_core_adminq_comp comp = {};
> +	int err;
> +
> +	dev_dbg(dev, "vf%u: Suspend device\n", pds_vfio->vf_id);
> +
> +	err = pds_vfio_client_adminq_cmd(pds_vfio, &cmd, sizeof(cmd),
> &comp,
> +					 PDS_AQ_FLAG_FASTPOLL);
> +	if (err) {
> +		dev_err(dev, "vf%u: Suspend failed: %pe\n", pds_vfio->vf_id,
> +			ERR_PTR(err));
> +		return err;
> +	}
> +
> +	return pds_vfio_suspend_wait_device_cmd(pds_vfio);
> +}
> +
> +int pds_vfio_resume_device_cmd(struct pds_vfio_pci_device *pds_vfio)
> +{
> +	union pds_core_adminq_cmd cmd = {
> +		.lm_resume = {
> +			.opcode = PDS_LM_CMD_RESUME,
> +			.vf_id = cpu_to_le16(pds_vfio->vf_id),
> +		},
> +	};
> +	struct device *dev = pds_vfio_to_dev(pds_vfio);
> +	union pds_core_adminq_comp comp = {};
> +
> +	dev_dbg(dev, "vf%u: Resume device\n", pds_vfio->vf_id);
> +
> +	return pds_vfio_client_adminq_cmd(pds_vfio, &cmd, sizeof(cmd),
> &comp,
> +					  0);
> +}
> +
> +int pds_vfio_get_lm_status_cmd(struct pds_vfio_pci_device *pds_vfio, u64
> *size)
> +{
> +	union pds_core_adminq_cmd cmd = {
> +		.lm_status = {
> +			.opcode = PDS_LM_CMD_STATUS,
> +			.vf_id = cpu_to_le16(pds_vfio->vf_id),
> +		},
> +	};
> +	struct device *dev = pds_vfio_to_dev(pds_vfio);
> +	union pds_core_adminq_comp comp = {};
> +	int err;
> +
> +	dev_dbg(dev, "vf%u: Get migration status\n", pds_vfio->vf_id);
> +
> +	err = pds_vfio_client_adminq_cmd(pds_vfio, &cmd, sizeof(cmd), &comp,
> 0);
> +	if (err)
> +		return err;
> +
> +	*size = le64_to_cpu(comp.lm_status.size);
> +	return 0;
> +}
> +
> +static int pds_vfio_dma_map_lm_file(struct device *dev,
> +				    enum dma_data_direction dir,
> +				    struct pds_vfio_lm_file *lm_file)
> +{
> +	struct pds_lm_sg_elem *sgl, *sge;
> +	struct scatterlist *sg;
> +	dma_addr_t sgl_addr;
> +	size_t sgl_size;
> +	int err;
> +	int i;
> +
> +	if (!lm_file)
> +		return -EINVAL;
> +
> +	/* dma map file pages */
> +	err = dma_map_sgtable(dev, &lm_file->sg_table, dir, 0);
> +	if (err)
> +		return err;
> +
> +	lm_file->num_sge = lm_file->sg_table.nents;
> +
> +	/* alloc sgl */
> +	sgl_size = lm_file->num_sge * sizeof(struct pds_lm_sg_elem);
> +	sgl = kzalloc(sgl_size, GFP_KERNEL);
> +	if (!sgl) {
> +		err = -ENOMEM;
> +		goto out_unmap_sgtable;
> +	}
> +
> +	/* fill sgl */
> +	sge = sgl;
> +	for_each_sgtable_dma_sg(&lm_file->sg_table, sg, i) {
> +		sge->addr = cpu_to_le64(sg_dma_address(sg));
> +		sge->len = cpu_to_le32(sg_dma_len(sg));
> +		dev_dbg(dev, "addr = %llx, len = %u\n", sge->addr, sge->len);
> +		sge++;
> +	}
> +
> +	sgl_addr = dma_map_single(dev, sgl, sgl_size, DMA_TO_DEVICE);
> +	if (dma_mapping_error(dev, sgl_addr)) {
> +		err = -EIO;
> +		goto out_free_sgl;
> +	}
> +
> +	lm_file->sgl = sgl;
> +	lm_file->sgl_addr = sgl_addr;
> +
> +	return 0;
> +
> +out_free_sgl:
> +	kfree(sgl);
> +out_unmap_sgtable:
> +	lm_file->num_sge = 0;
> +	dma_unmap_sgtable(dev, &lm_file->sg_table, dir, 0);
> +	return err;
> +}
> +
> +static void pds_vfio_dma_unmap_lm_file(struct device *dev,
> +				       enum dma_data_direction dir,
> +				       struct pds_vfio_lm_file *lm_file)
> +{
> +	if (!lm_file)
> +		return;
> +
> +	/* free sgl */
> +	if (lm_file->sgl) {
> +		dma_unmap_single(dev, lm_file->sgl_addr,
> +				 lm_file->num_sge * sizeof(*lm_file->sgl),
> +				 DMA_TO_DEVICE);
> +		kfree(lm_file->sgl);
> +		lm_file->sgl = NULL;
> +		lm_file->sgl_addr = DMA_MAPPING_ERROR;
> +		lm_file->num_sge = 0;
> +	}
> +
> +	/* dma unmap file pages */
> +	dma_unmap_sgtable(dev, &lm_file->sg_table, dir, 0);
> +}
> +
> +int pds_vfio_get_lm_state_cmd(struct pds_vfio_pci_device *pds_vfio)
> +{
> +	union pds_core_adminq_cmd cmd = {
> +		.lm_save = {
> +			.opcode = PDS_LM_CMD_SAVE,
> +			.vf_id = cpu_to_le16(pds_vfio->vf_id),
> +		},
> +	};
> +	struct pci_dev *pdev = pds_vfio_to_pci_dev(pds_vfio);
> +	struct device *pdsc_dev = &pci_physfn(pdev)->dev;
> +	union pds_core_adminq_comp comp = {};
> +	struct pds_vfio_lm_file *lm_file;
> +	int err;
> +
> +	dev_dbg(&pdev->dev, "vf%u: Get migration state\n", pds_vfio->vf_id);
> +
> +	lm_file = pds_vfio->save_file;
> +
> +	err = pds_vfio_dma_map_lm_file(pdsc_dev, DMA_FROM_DEVICE,
> lm_file);
> +	if (err) {
> +		dev_err(&pdev->dev, "failed to map save migration file: %pe\n",
> +			ERR_PTR(err));
> +		return err;
> +	}
> +
> +	cmd.lm_save.sgl_addr = cpu_to_le64(lm_file->sgl_addr);
> +	cmd.lm_save.num_sge = cpu_to_le32(lm_file->num_sge);
> +
> +	err = pds_vfio_client_adminq_cmd(pds_vfio, &cmd, sizeof(cmd), &comp,
> 0);
> +	if (err)
> +		dev_err(&pdev->dev, "failed to get migration state: %pe\n",
> +			ERR_PTR(err));
> +
> +	pds_vfio_dma_unmap_lm_file(pdsc_dev, DMA_FROM_DEVICE, lm_file);
> +
> +	return err;
> +}
> +
> +int pds_vfio_set_lm_state_cmd(struct pds_vfio_pci_device *pds_vfio)
> +{
> +	union pds_core_adminq_cmd cmd = {
> +		.lm_restore = {
> +			.opcode = PDS_LM_CMD_RESTORE,
> +			.vf_id = cpu_to_le16(pds_vfio->vf_id),
> +		},
> +	};
> +	struct pci_dev *pdev = pds_vfio_to_pci_dev(pds_vfio);
> +	struct device *pdsc_dev = &pci_physfn(pdev)->dev;
> +	union pds_core_adminq_comp comp = {};
> +	struct pds_vfio_lm_file *lm_file;
> +	int err;
> +
> +	dev_dbg(&pdev->dev, "vf%u: Set migration state\n", pds_vfio->vf_id);
> +
> +	lm_file = pds_vfio->restore_file;
> +
> +	err = pds_vfio_dma_map_lm_file(pdsc_dev, DMA_TO_DEVICE, lm_file);
> +	if (err) {
> +		dev_err(&pdev->dev,
> +			"failed to map restore migration file: %pe\n",
> +			ERR_PTR(err));
> +		return err;
> +	}
> +
> +	cmd.lm_restore.sgl_addr = cpu_to_le64(lm_file->sgl_addr);
> +	cmd.lm_restore.num_sge = cpu_to_le32(lm_file->num_sge);
> +
> +	err = pds_vfio_client_adminq_cmd(pds_vfio, &cmd, sizeof(cmd), &comp,
> 0);
> +	if (err)
> +		dev_err(&pdev->dev, "failed to set migration state: %pe\n",
> +			ERR_PTR(err));
> +
> +	pds_vfio_dma_unmap_lm_file(pdsc_dev, DMA_TO_DEVICE, lm_file);
> +
> +	return err;
> +}
> +
> +void pds_vfio_send_host_vf_lm_status_cmd(struct pds_vfio_pci_device
> *pds_vfio,
> +					 enum pds_lm_host_vf_status vf_status)
> +{
> +	union pds_core_adminq_cmd cmd = {
> +		.lm_host_vf_status = {
> +			.opcode = PDS_LM_CMD_HOST_VF_STATUS,
> +			.vf_id = cpu_to_le16(pds_vfio->vf_id),
> +			.status = vf_status,
> +		},
> +	};
> +	struct device *dev = pds_vfio_to_dev(pds_vfio);
> +	union pds_core_adminq_comp comp = {};
> +	int err;
> +
> +	dev_dbg(dev, "vf%u: Set host VF LM status: %u", pds_vfio->vf_id,
> +		vf_status);
> +	if (vf_status != PDS_LM_STA_IN_PROGRESS &&
> +	    vf_status != PDS_LM_STA_NONE) {
> +		dev_warn(dev, "Invalid host VF migration status, %d\n",
> +			 vf_status);
> +		return;
> +	}
> +
> +	err = pds_vfio_client_adminq_cmd(pds_vfio, &cmd, sizeof(cmd), &comp,
> 0);
> +	if (err)
> +		dev_warn(dev, "failed to send host VF migration status: %pe\n",
> +			 ERR_PTR(err));
> +}
> diff --git a/drivers/vfio/pci/pds/cmds.h b/drivers/vfio/pci/pds/cmds.h
> index 4c592afccf89..3d8a5508c733 100644
> --- a/drivers/vfio/pci/pds/cmds.h
> +++ b/drivers/vfio/pci/pds/cmds.h
> @@ -6,5 +6,11 @@
> 
>  int pds_vfio_register_client_cmd(struct pds_vfio_pci_device *pds_vfio);
>  void pds_vfio_unregister_client_cmd(struct pds_vfio_pci_device *pds_vfio);
> -
> +int pds_vfio_suspend_device_cmd(struct pds_vfio_pci_device *pds_vfio);
> +int pds_vfio_resume_device_cmd(struct pds_vfio_pci_device *pds_vfio);
> +int pds_vfio_get_lm_status_cmd(struct pds_vfio_pci_device *pds_vfio, u64
> *size);
> +int pds_vfio_get_lm_state_cmd(struct pds_vfio_pci_device *pds_vfio);
> +int pds_vfio_set_lm_state_cmd(struct pds_vfio_pci_device *pds_vfio);
> +void pds_vfio_send_host_vf_lm_status_cmd(struct pds_vfio_pci_device
> *pds_vfio,
> +					 enum pds_lm_host_vf_status vf_status);
>  #endif /* _CMDS_H_ */
> diff --git a/drivers/vfio/pci/pds/lm.c b/drivers/vfio/pci/pds/lm.c
> new file mode 100644
> index 000000000000..c507f39a2339
> --- /dev/null
> +++ b/drivers/vfio/pci/pds/lm.c
> @@ -0,0 +1,421 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright(c) 2023 Advanced Micro Devices, Inc. */
> +
> +#include <linux/anon_inodes.h>
> +#include <linux/file.h>
> +#include <linux/fs.h>
> +#include <linux/highmem.h>
> +#include <linux/vfio.h>
> +#include <linux/vfio_pci_core.h>
> +
> +#include "vfio_dev.h"
> +#include "cmds.h"
> +
> +static struct pds_vfio_lm_file *
> +pds_vfio_get_lm_file(const struct file_operations *fops, int flags, u64 size)
> +{
> +	struct pds_vfio_lm_file *lm_file = NULL;
> +	unsigned long long npages;
> +	struct page **pages;
> +	void *page_mem;
> +	const void *p;
> +
> +	if (!size)
> +		return NULL;
> +
> +	/* Alloc file structure */
> +	lm_file = kzalloc(sizeof(*lm_file), GFP_KERNEL);
> +	if (!lm_file)
> +		return NULL;
> +
> +	/* Create file */
> +	lm_file->filep =
> +		anon_inode_getfile("pds_vfio_lm", fops, lm_file, flags);
> +	if (!lm_file->filep)
> +		goto out_free_file;
> +
> +	stream_open(lm_file->filep->f_inode, lm_file->filep);
> +	mutex_init(&lm_file->lock);
> +
> +	/* prevent file from being released before we are done with it */
> +	get_file(lm_file->filep);
> +
> +	/* Allocate memory for file pages */
> +	npages = DIV_ROUND_UP_ULL(size, PAGE_SIZE);
> +	pages = kmalloc_array(npages, sizeof(*pages), GFP_KERNEL);
> +	if (!pages)
> +		goto out_put_file;
> +
> +	page_mem = kvzalloc(ALIGN(size, PAGE_SIZE), GFP_KERNEL);
> +	if (!page_mem)
> +		goto out_free_pages_array;
> +
> +	p = page_mem - offset_in_page(page_mem);
> +	for (unsigned long long i = 0; i < npages; i++) {
> +		if (is_vmalloc_addr(p))
> +			pages[i] = vmalloc_to_page(p);
> +		else
> +			pages[i] = kmap_to_page((void *)p);
> +		if (!pages[i])
> +			goto out_free_page_mem;
> +
> +		p += PAGE_SIZE;
> +	}
> +
> +	/* Create scatterlist of file pages to use for DMA mapping later */
> +	if (sg_alloc_table_from_pages(&lm_file->sg_table, pages, npages, 0,
> +				      size, GFP_KERNEL))
> +		goto out_free_page_mem;
> +
> +	lm_file->size = size;
> +	lm_file->pages = pages;
> +	lm_file->npages = npages;
> +	lm_file->page_mem = page_mem;
> +	lm_file->alloc_size = npages * PAGE_SIZE;
> +
> +	return lm_file;
> +
> +out_free_page_mem:
> +	kvfree(page_mem);
> +out_free_pages_array:
> +	kfree(pages);
> +out_put_file:
> +	fput(lm_file->filep);
> +	mutex_destroy(&lm_file->lock);
> +out_free_file:
> +	kfree(lm_file);
> +
> +	return NULL;
> +}
> +
> +static void pds_vfio_put_lm_file(struct pds_vfio_lm_file *lm_file)
> +{
> +	mutex_lock(&lm_file->lock);
> +
> +	lm_file->size = 0;
> +	lm_file->alloc_size = 0;
> +
> +	/* Free scatter list of file pages */
> +	sg_free_table(&lm_file->sg_table);
> +
> +	kvfree(lm_file->page_mem);
> +	lm_file->page_mem = NULL;
> +	kfree(lm_file->pages);
> +	lm_file->pages = NULL;
> +
> +	mutex_unlock(&lm_file->lock);
> +
> +	/* allow file to be released since we are done with it */
> +	fput(lm_file->filep);
> +}
> +
> +void pds_vfio_put_save_file(struct pds_vfio_pci_device *pds_vfio)
> +{
> +	if (!pds_vfio->save_file)
> +		return;
> +
> +	pds_vfio_put_lm_file(pds_vfio->save_file);
> +	pds_vfio->save_file = NULL;
> +}
> +
> +void pds_vfio_put_restore_file(struct pds_vfio_pci_device *pds_vfio)
> +{
> +	if (!pds_vfio->restore_file)
> +		return;
> +
> +	pds_vfio_put_lm_file(pds_vfio->restore_file);
> +	pds_vfio->restore_file = NULL;
> +}
> +
> +static struct page *pds_vfio_get_file_page(struct pds_vfio_lm_file *lm_file,
> +					   unsigned long offset)
> +{
> +	unsigned long cur_offset = 0;
> +	struct scatterlist *sg;
> +	unsigned int i;
> +
> +	/* All accesses are sequential */
> +	if (offset < lm_file->last_offset || !lm_file->last_offset_sg) {
> +		lm_file->last_offset = 0;
> +		lm_file->last_offset_sg = lm_file->sg_table.sgl;
> +		lm_file->sg_last_entry = 0;
> +	}
> +
> +	cur_offset = lm_file->last_offset;
> +
> +	for_each_sg(lm_file->last_offset_sg, sg,
> +		    lm_file->sg_table.orig_nents - lm_file->sg_last_entry, i) {
> +		if (offset < sg->length + cur_offset) {
> +			lm_file->last_offset_sg = sg;
> +			lm_file->sg_last_entry += i;
> +			lm_file->last_offset = cur_offset;
> +			return nth_page(sg_page(sg),
> +					(offset - cur_offset) / PAGE_SIZE);
> +		}
> +		cur_offset += sg->length;
> +	}
> +
> +	return NULL;
> +}
> +
> +static int pds_vfio_release_file(struct inode *inode, struct file *filp)
> +{
> +	struct pds_vfio_lm_file *lm_file = filp->private_data;
> +
> +	mutex_lock(&lm_file->lock);
> +	lm_file->filep->f_pos = 0;
> +	lm_file->size = 0;
> +	mutex_unlock(&lm_file->lock);
> +	mutex_destroy(&lm_file->lock);
> +	kfree(lm_file);
> +
> +	return 0;
> +}
> +
> +static ssize_t pds_vfio_save_read(struct file *filp, char __user *buf,
> +				  size_t len, loff_t *pos)
> +{
> +	struct pds_vfio_lm_file *lm_file = filp->private_data;
> +	ssize_t done = 0;
> +
> +	if (pos)
> +		return -ESPIPE;
> +	pos = &filp->f_pos;
> +
> +	mutex_lock(&lm_file->lock);
> +	if (*pos > lm_file->size) {
> +		done = -EINVAL;
> +		goto out_unlock;
> +	}
> +
> +	len = min_t(size_t, lm_file->size - *pos, len);
> +	while (len) {
> +		size_t page_offset;
> +		struct page *page;
> +		size_t page_len;
> +		u8 *from_buff;
> +		int err;
> +
> +		page_offset = (*pos) % PAGE_SIZE;
> +		page = pds_vfio_get_file_page(lm_file, *pos - page_offset);
> +		if (!page) {
> +			if (done == 0)
> +				done = -EINVAL;
> +			goto out_unlock;
> +		}
> +
> +		page_len = min_t(size_t, len, PAGE_SIZE - page_offset);
> +		from_buff = kmap_local_page(page);
> +		err = copy_to_user(buf, from_buff + page_offset, page_len);
> +		kunmap_local(from_buff);
> +		if (err) {
> +			done = -EFAULT;
> +			goto out_unlock;
> +		}
> +		*pos += page_len;
> +		len -= page_len;
> +		done += page_len;
> +		buf += page_len;
> +	}
> +
> +out_unlock:
> +	mutex_unlock(&lm_file->lock);
> +	return done;
> +}
> +
> +static const struct file_operations pds_vfio_save_fops = {
> +	.owner = THIS_MODULE,
> +	.read = pds_vfio_save_read,
> +	.release = pds_vfio_release_file,
> +	.llseek = no_llseek,
> +};
> +
> +static int pds_vfio_get_save_file(struct pds_vfio_pci_device *pds_vfio)
> +{
> +	struct device *dev = &pds_vfio->vfio_coredev.pdev->dev;
> +	struct pds_vfio_lm_file *lm_file;
> +	int err;
> +	u64 size;
> +
> +	/* Get live migration state size in this state */
> +	err = pds_vfio_get_lm_status_cmd(pds_vfio, &size);
> +	if (err) {
> +		dev_err(dev, "failed to get save status: %pe\n", ERR_PTR(err));
> +		return err;
> +	}
> +
> +	dev_dbg(dev, "save status, size = %lld\n", size);
> +
> +	if (!size) {
> +		dev_err(dev, "invalid state size\n");
> +		return -EIO;
> +	}
> +
> +	lm_file = pds_vfio_get_lm_file(&pds_vfio_save_fops, O_RDONLY, size);
> +	if (!lm_file) {
> +		dev_err(dev, "failed to create save file\n");
> +		return -ENOENT;
> +	}
> +
> +	dev_dbg(dev, "size = %lld, alloc_size = %lld, npages = %lld\n",
> +		lm_file->size, lm_file->alloc_size, lm_file->npages);
> +
> +	pds_vfio->save_file = lm_file;
> +
> +	return 0;
> +}
> +
> +static ssize_t pds_vfio_restore_write(struct file *filp, const char __user
> *buf,
> +				      size_t len, loff_t *pos)
> +{
> +	struct pds_vfio_lm_file *lm_file = filp->private_data;
> +	loff_t requested_length;
> +	ssize_t done = 0;
> +
> +	if (pos)
> +		return -ESPIPE;
> +
> +	pos = &filp->f_pos;
> +
> +	if (*pos < 0 ||
> +	    check_add_overflow((loff_t)len, *pos, &requested_length))
> +		return -EINVAL;
> +
> +	mutex_lock(&lm_file->lock);
> +
> +	while (len) {
> +		size_t page_offset;
> +		struct page *page;
> +		size_t page_len;
> +		u8 *to_buff;
> +		int err;
> +
> +		page_offset = (*pos) % PAGE_SIZE;
> +		page = pds_vfio_get_file_page(lm_file, *pos - page_offset);
> +		if (!page) {
> +			if (done == 0)
> +				done = -EINVAL;
> +			goto out_unlock;
> +		}
> +
> +		page_len = min_t(size_t, len, PAGE_SIZE - page_offset);
> +		to_buff = kmap_local_page(page);
> +		err = copy_from_user(to_buff + page_offset, buf, page_len);
> +		kunmap_local(to_buff);
> +		if (err) {
> +			done = -EFAULT;
> +			goto out_unlock;
> +		}
> +		*pos += page_len;
> +		len -= page_len;
> +		done += page_len;
> +		buf += page_len;
> +		lm_file->size += page_len;
> +	}
> +out_unlock:
> +	mutex_unlock(&lm_file->lock);
> +	return done;
> +}
> +
> +static const struct file_operations pds_vfio_restore_fops = {
> +	.owner = THIS_MODULE,
> +	.write = pds_vfio_restore_write,
> +	.release = pds_vfio_release_file,
> +	.llseek = no_llseek,
> +};
> +
> +static int pds_vfio_get_restore_file(struct pds_vfio_pci_device *pds_vfio)
> +{
> +	struct device *dev = &pds_vfio->vfio_coredev.pdev->dev;
> +	struct pds_vfio_lm_file *lm_file;
> +	u64 size;
> +
> +	size = sizeof(union pds_lm_dev_state);
> +	dev_dbg(dev, "restore status, size = %lld\n", size);
> +
> +	if (!size) {
> +		dev_err(dev, "invalid state size");
> +		return -EIO;
> +	}
> +
> +	lm_file = pds_vfio_get_lm_file(&pds_vfio_restore_fops, O_WRONLY,
> size);
> +	if (!lm_file) {
> +		dev_err(dev, "failed to create restore file");
> +		return -ENOENT;
> +	}
> +	pds_vfio->restore_file = lm_file;
> +
> +	return 0;
> +}
> +
> +struct file *
> +pds_vfio_step_device_state_locked(struct pds_vfio_pci_device *pds_vfio,
> +				  enum vfio_device_mig_state next)
> +{
> +	enum vfio_device_mig_state cur = pds_vfio->state;
> +	int err;
> +
> +	if (cur == VFIO_DEVICE_STATE_STOP && next ==
> VFIO_DEVICE_STATE_STOP_COPY) {
> +		err = pds_vfio_get_save_file(pds_vfio);
> +		if (err)
> +			return ERR_PTR(err);
> +
> +		err = pds_vfio_get_lm_state_cmd(pds_vfio);
> +		if (err) {
> +			pds_vfio_put_save_file(pds_vfio);
> +			return ERR_PTR(err);
> +		}
> +
> +		return pds_vfio->save_file->filep;
> +	}
> +
> +	if (cur == VFIO_DEVICE_STATE_STOP_COPY && next ==
> VFIO_DEVICE_STATE_STOP) {
> +		pds_vfio_put_save_file(pds_vfio);
> +		pds_vfio_send_host_vf_lm_status_cmd(pds_vfio,
> PDS_LM_STA_NONE);
> +		return NULL;
> +	}
> +
> +	if (cur == VFIO_DEVICE_STATE_STOP && next ==
> VFIO_DEVICE_STATE_RESUMING) {
> +		err = pds_vfio_get_restore_file(pds_vfio);
> +		if (err)
> +			return ERR_PTR(err);
> +
> +		return pds_vfio->restore_file->filep;
> +	}
> +
> +	if (cur == VFIO_DEVICE_STATE_RESUMING && next ==
> VFIO_DEVICE_STATE_STOP) {
> +		err = pds_vfio_set_lm_state_cmd(pds_vfio);
> +		if (err)
> +			return ERR_PTR(err);
> +
> +		pds_vfio_put_restore_file(pds_vfio);
> +		return NULL;
> +	}
> +
> +	if (cur == VFIO_DEVICE_STATE_RUNNING && next ==
> VFIO_DEVICE_STATE_RUNNING_P2P) {
> +		pds_vfio_send_host_vf_lm_status_cmd(pds_vfio,
> +						    PDS_LM_STA_IN_PROGRESS);
> +		err = pds_vfio_suspend_device_cmd(pds_vfio);
> +		if (err)
> +			return ERR_PTR(err);
> +
> +		return NULL;
> +	}
> +
> +	if (cur == VFIO_DEVICE_STATE_RUNNING_P2P && next ==
> VFIO_DEVICE_STATE_RUNNING) {
> +		err = pds_vfio_resume_device_cmd(pds_vfio);
> +		if (err)
> +			return ERR_PTR(err);
> +
> +		pds_vfio_send_host_vf_lm_status_cmd(pds_vfio,
> PDS_LM_STA_NONE);
> +		return NULL;
> +	}
> +
> +	if (cur == VFIO_DEVICE_STATE_STOP && next ==
> VFIO_DEVICE_STATE_RUNNING_P2P)
> +		return NULL;
> +
> +	if (cur == VFIO_DEVICE_STATE_RUNNING_P2P && next ==
> VFIO_DEVICE_STATE_STOP)
> +		return NULL;
> +
> +	return ERR_PTR(-EINVAL);
> +}
> diff --git a/drivers/vfio/pci/pds/lm.h b/drivers/vfio/pci/pds/lm.h
> new file mode 100644
> index 000000000000..13be893198b7
> --- /dev/null
> +++ b/drivers/vfio/pci/pds/lm.h
> @@ -0,0 +1,41 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/* Copyright(c) 2023 Advanced Micro Devices, Inc. */
> +
> +#ifndef _LM_H_
> +#define _LM_H_
> +
> +#include <linux/fs.h>
> +#include <linux/mutex.h>
> +#include <linux/scatterlist.h>
> +#include <linux/types.h>
> +
> +#include <linux/pds/pds_common.h>
> +#include <linux/pds/pds_adminq.h>
> +
> +struct pds_vfio_lm_file {
> +	struct file *filep;
> +	struct mutex lock;	/* protect live migration data file */
> +	u64 size;		/* Size with valid data */
> +	u64 alloc_size;		/* Total allocated size. Always >= len */
> +	void *page_mem;		/* memory allocated for pages */
> +	struct page **pages;	/* Backing pages for file */
> +	unsigned long long npages;
> +	struct sg_table sg_table;	/* SG table for backing pages */
> +	struct pds_lm_sg_elem *sgl;	/* DMA mapping */
> +	dma_addr_t sgl_addr;
> +	u16 num_sge;
> +	struct scatterlist *last_offset_sg;	/* Iterator */
> +	unsigned int sg_last_entry;
> +	unsigned long last_offset;
> +};
> +
> +struct pds_vfio_pci_device;
> +
> +struct file *
> +pds_vfio_step_device_state_locked(struct pds_vfio_pci_device *pds_vfio,
> +				  enum vfio_device_mig_state next);
> +
> +void pds_vfio_put_save_file(struct pds_vfio_pci_device *pds_vfio);
> +void pds_vfio_put_restore_file(struct pds_vfio_pci_device *pds_vfio);
> +
> +#endif /* _LM_H_ */
> diff --git a/drivers/vfio/pci/pds/pci_drv.c b/drivers/vfio/pci/pds/pci_drv.c
> index a49420aa9736..ffd47fa8ede3 100644
> --- a/drivers/vfio/pci/pds/pci_drv.c
> +++ b/drivers/vfio/pci/pds/pci_drv.c
> @@ -73,11 +73,24 @@ pds_vfio_pci_table[] = {
>  };
>  MODULE_DEVICE_TABLE(pci, pds_vfio_pci_table);
> 
> +static void pds_vfio_pci_aer_reset_done(struct pci_dev *pdev)
> +{
> +	struct pds_vfio_pci_device *pds_vfio = pds_vfio_pci_drvdata(pdev);
> +
> +	pds_vfio_reset(pds_vfio);
> +}
> +
> +static const struct pci_error_handlers pds_vfio_pci_err_handlers = {
> +	.reset_done = pds_vfio_pci_aer_reset_done,
> +	.error_detected = vfio_pci_core_aer_err_detected,
> +};
> +
>  static struct pci_driver pds_vfio_pci_driver = {
>  	.name = KBUILD_MODNAME,
>  	.id_table = pds_vfio_pci_table,
>  	.probe = pds_vfio_pci_probe,
>  	.remove = pds_vfio_pci_remove,
> +	.err_handler = &pds_vfio_pci_err_handlers,
>  	.driver_managed_dma = true,
>  };
> 
> diff --git a/drivers/vfio/pci/pds/vfio_dev.c b/drivers/vfio/pci/pds/vfio_dev.c
> index 39771265b78f..2435d8255366 100644
> --- a/drivers/vfio/pci/pds/vfio_dev.c
> +++ b/drivers/vfio/pci/pds/vfio_dev.c
> @@ -4,6 +4,7 @@
>  #include <linux/vfio.h>
>  #include <linux/vfio_pci_core.h>
> 
> +#include "lm.h"
>  #include "vfio_dev.h"
> 
>  struct pci_dev *pds_vfio_to_pci_dev(struct pds_vfio_pci_device *pds_vfio)
> @@ -11,6 +12,11 @@ struct pci_dev *pds_vfio_to_pci_dev(struct
> pds_vfio_pci_device *pds_vfio)
>  	return pds_vfio->vfio_coredev.pdev;
>  }
> 
> +struct device *pds_vfio_to_dev(struct pds_vfio_pci_device *pds_vfio)
> +{
> +	return &pds_vfio_to_pci_dev(pds_vfio)->dev;
> +}
> +
>  struct pds_vfio_pci_device *pds_vfio_pci_drvdata(struct pci_dev *pdev)
>  {
>  	struct vfio_pci_core_device *core_device =
> dev_get_drvdata(&pdev->dev);
> @@ -19,6 +25,98 @@ struct pds_vfio_pci_device
> *pds_vfio_pci_drvdata(struct pci_dev *pdev)
>  			    vfio_coredev);
>  }
> 
> +static void pds_vfio_state_mutex_unlock(struct pds_vfio_pci_device
> *pds_vfio)
> +{
> +again:
> +	spin_lock(&pds_vfio->reset_lock);
> +	if (pds_vfio->deferred_reset) {
> +		pds_vfio->deferred_reset = false;
> +		if (pds_vfio->state == VFIO_DEVICE_STATE_ERROR) {
> +			pds_vfio->state = VFIO_DEVICE_STATE_RUNNING;
> +			pds_vfio_put_restore_file(pds_vfio);
> +			pds_vfio_put_save_file(pds_vfio);
> +		}
> +		spin_unlock(&pds_vfio->reset_lock);
> +		goto again;
> +	}
> +	mutex_unlock(&pds_vfio->state_mutex);
> +	spin_unlock(&pds_vfio->reset_lock);
> +}
> +
> +void pds_vfio_reset(struct pds_vfio_pci_device *pds_vfio)
> +{
> +	spin_lock(&pds_vfio->reset_lock);
> +	pds_vfio->deferred_reset = true;
> +	if (!mutex_trylock(&pds_vfio->state_mutex)) {
> +		spin_unlock(&pds_vfio->reset_lock);
> +		return;
> +	}
> +	spin_unlock(&pds_vfio->reset_lock);
> +	pds_vfio_state_mutex_unlock(pds_vfio);
> +}
> +
> +static struct file *
> +pds_vfio_set_device_state(struct vfio_device *vdev,
> +			  enum vfio_device_mig_state new_state)
> +{
> +	struct pds_vfio_pci_device *pds_vfio =
> +		container_of(vdev, struct pds_vfio_pci_device,
> +			     vfio_coredev.vdev);
> +	struct file *res = NULL;
> +
> +	mutex_lock(&pds_vfio->state_mutex);
> +	while (new_state != pds_vfio->state) {
> +		enum vfio_device_mig_state next_state;
> +
> +		int err = vfio_mig_get_next_state(vdev, pds_vfio->state,
> +						  new_state, &next_state);
> +		if (err) {
> +			res = ERR_PTR(err);
> +			break;
> +		}
> +
> +		res = pds_vfio_step_device_state_locked(pds_vfio, next_state);
> +		if (IS_ERR(res))
> +			break;
> +
> +		pds_vfio->state = next_state;
> +
> +		if (WARN_ON(res && new_state != pds_vfio->state)) {
> +			res = ERR_PTR(-EINVAL);
> +			break;
> +		}
> +	}
> +	pds_vfio_state_mutex_unlock(pds_vfio);
> +
> +	return res;
> +}
> +
> +static int pds_vfio_get_device_state(struct vfio_device *vdev,
> +				     enum vfio_device_mig_state *current_state)
> +{
> +	struct pds_vfio_pci_device *pds_vfio =
> +		container_of(vdev, struct pds_vfio_pci_device,
> +			     vfio_coredev.vdev);
> +
> +	mutex_lock(&pds_vfio->state_mutex);
> +	*current_state = pds_vfio->state;
> +	pds_vfio_state_mutex_unlock(pds_vfio);
> +	return 0;
> +}
> +
> +static int pds_vfio_get_device_state_size(struct vfio_device *vdev,
> +					  unsigned long *stop_copy_length)
> +{
> +	*stop_copy_length = PDS_LM_DEVICE_STATE_LENGTH;
> +	return 0;
> +}
> +
> +static const struct vfio_migration_ops pds_vfio_lm_ops = {
> +	.migration_set_state = pds_vfio_set_device_state,
> +	.migration_get_state = pds_vfio_get_device_state,
> +	.migration_get_data_size = pds_vfio_get_device_state_size
> +};
> +
>  static int pds_vfio_init_device(struct vfio_device *vdev)
>  {
>  	struct pds_vfio_pci_device *pds_vfio =
> @@ -34,6 +132,9 @@ static int pds_vfio_init_device(struct vfio_device
> *vdev)
>  	pds_vfio->vf_id = pci_iov_vf_id(pdev);
>  	pds_vfio->pci_id = PCI_DEVID(pdev->bus->number, pdev->devfn);
> 
> +	vdev->migration_flags = VFIO_MIGRATION_STOP_COPY |
> VFIO_MIGRATION_P2P;
> +	vdev->mig_ops = &pds_vfio_lm_ops;
> +
>  	dev_dbg(&pdev->dev,
>  		"%s: PF %#04x VF %#04x (%d) vf_id %d domain %d
> pds_vfio %p\n",
>  		__func__, pci_dev_id(pdev->physfn), pds_vfio->pci_id,
> @@ -54,17 +155,34 @@ static int pds_vfio_open_device(struct vfio_device
> *vdev)
>  	if (err)
>  		return err;
> 
> +	mutex_init(&pds_vfio->state_mutex);
> +	pds_vfio->state = VFIO_DEVICE_STATE_RUNNING;
> +
>  	vfio_pci_core_finish_enable(&pds_vfio->vfio_coredev);
> 
>  	return 0;
>  }
> 
> +static void pds_vfio_close_device(struct vfio_device *vdev)
> +{
> +	struct pds_vfio_pci_device *pds_vfio =
> +		container_of(vdev, struct pds_vfio_pci_device,
> +			     vfio_coredev.vdev);
> +
> +	mutex_lock(&pds_vfio->state_mutex);
> +	pds_vfio_put_restore_file(pds_vfio);
> +	pds_vfio_put_save_file(pds_vfio);
> +	mutex_unlock(&pds_vfio->state_mutex);
> +	mutex_destroy(&pds_vfio->state_mutex);
> +	vfio_pci_core_close_device(vdev);
> +}
> +
>  static const struct vfio_device_ops pds_vfio_ops = {
>  	.name = "pds-vfio",
>  	.init = pds_vfio_init_device,
>  	.release = vfio_pci_core_release_dev,
>  	.open_device = pds_vfio_open_device,
> -	.close_device = vfio_pci_core_close_device,
> +	.close_device = pds_vfio_close_device,
>  	.ioctl = vfio_pci_core_ioctl,
>  	.device_feature = vfio_pci_core_ioctl_feature,
>  	.read = vfio_pci_core_read,
> diff --git a/drivers/vfio/pci/pds/vfio_dev.h b/drivers/vfio/pci/pds/vfio_dev.h
> index 92e8ff241ca8..df6208a7140b 100644
> --- a/drivers/vfio/pci/pds/vfio_dev.h
> +++ b/drivers/vfio/pci/pds/vfio_dev.h
> @@ -7,12 +7,21 @@
>  #include <linux/pci.h>
>  #include <linux/vfio_pci_core.h>
> 
> +#include "lm.h"
> +
>  struct pdsc;
> 
>  struct pds_vfio_pci_device {
>  	struct vfio_pci_core_device vfio_coredev;
>  	struct pdsc *pdsc;
> 
> +	struct pds_vfio_lm_file *save_file;
> +	struct pds_vfio_lm_file *restore_file;
> +	struct mutex state_mutex; /* protect migration state */
> +	enum vfio_device_mig_state state;
> +	spinlock_t reset_lock; /* protect reset_done flow */
> +	u8 deferred_reset;
> +
>  	int vf_id;
>  	int pci_id;
>  	u16 client_id;
> @@ -20,7 +29,9 @@ struct pds_vfio_pci_device {
> 
>  const struct vfio_device_ops *pds_vfio_ops_info(void);
>  struct pds_vfio_pci_device *pds_vfio_pci_drvdata(struct pci_dev *pdev);
> +void pds_vfio_reset(struct pds_vfio_pci_device *pds_vfio);
> 
>  struct pci_dev *pds_vfio_to_pci_dev(struct pds_vfio_pci_device *pds_vfio);
> +struct device *pds_vfio_to_dev(struct pds_vfio_pci_device *pds_vfio);
> 
>  #endif /* _VFIO_DEV_H_ */
> diff --git a/include/linux/pds/pds_adminq.h
> b/include/linux/pds/pds_adminq.h
> index 98a60ce87b92..db6de081f15f 100644
> --- a/include/linux/pds/pds_adminq.h
> +++ b/include/linux/pds/pds_adminq.h
> @@ -584,6 +584,213 @@ struct pds_core_q_init_comp {
>  	u8     color;
>  };
> 
> +#define PDS_LM_DEVICE_STATE_LENGTH		65536
> +#define PDS_LM_CHECK_DEVICE_STATE_LENGTH(X) \
> +			PDS_CORE_SIZE_CHECK(union,
> PDS_LM_DEVICE_STATE_LENGTH, X)
> +
> +/*
> + * enum pds_lm_cmd_opcode - Live Migration Device commands
> + */
> +enum pds_lm_cmd_opcode {
> +	PDS_LM_CMD_HOST_VF_STATUS  = 1,
> +
> +	/* Device state commands */
> +	PDS_LM_CMD_STATUS          = 16,
> +	PDS_LM_CMD_SUSPEND         = 18,
> +	PDS_LM_CMD_SUSPEND_STATUS  = 19,
> +	PDS_LM_CMD_RESUME          = 20,
> +	PDS_LM_CMD_SAVE            = 21,
> +	PDS_LM_CMD_RESTORE         = 22,
> +};
> +
> +/**
> + * struct pds_lm_cmd - generic command
> + * @opcode:	Opcode
> + * @rsvd:	Word boundary padding
> + * @vf_id:	VF id
> + * @rsvd2:	Structure padding to 60 Bytes
> + */
> +struct pds_lm_cmd {
> +	u8     opcode;
> +	u8     rsvd;
> +	__le16 vf_id;
> +	u8     rsvd2[56];
> +};
> +
> +/**
> + * struct pds_lm_comp - generic command completion
> + * @status:	Status of the command (enum pds_core_status_code)
> + * @rsvd:	Structure padding to 16 Bytes
> + */
> +struct pds_lm_comp {
> +	u8 status;
> +	u8 rsvd[15];
> +};
> +
> +/**
> + * struct pds_lm_status_cmd - STATUS command
> + * @opcode:	Opcode
> + * @rsvd:	Word boundary padding
> + * @vf_id:	VF id
> + */
> +struct pds_lm_status_cmd {
> +	u8     opcode;
> +	u8     rsvd;
> +	__le16 vf_id;
> +};
> +
> +/**
> + * struct pds_lm_status_comp - STATUS command completion
> + * @status:		Status of the command (enum pds_core_status_code)
> + * @rsvd:		Word boundary padding
> + * @comp_index:		Index in the desc ring for which this is the
> completion
> + * @size:		Size of the device state
> + * @rsvd2:		Word boundary padding
> + * @color:		Color bit
> + */
> +struct pds_lm_status_comp {
> +	u8     status;
> +	u8     rsvd;
> +	__le16 comp_index;
> +	union {
> +		__le64 size;
> +		u8     rsvd2[11];
> +	} __packed;
> +	u8     color;
> +};
> +
> +/**
> + * struct pds_lm_suspend_cmd - SUSPEND command
> + * @opcode:	Opcode PDS_LM_CMD_SUSPEND
> + * @rsvd:	Word boundary padding
> + * @vf_id:	VF id
> + */
> +struct pds_lm_suspend_cmd {
> +	u8     opcode;
> +	u8     rsvd;
> +	__le16 vf_id;
> +};
> +
> +/**
> + * struct pds_lm_suspend_comp - SUSPEND command completion
> + * @status:		Status of the command (enum pds_core_status_code)
> + * @rsvd:		Word boundary padding
> + * @comp_index:		Index in the desc ring for which this is the
> completion
> + * @state_size:		Size of the device state computed post suspend
> + * @rsvd2:		Word boundary padding
> + * @color:		Color bit
> + */
> +struct pds_lm_suspend_comp {
> +	u8     status;
> +	u8     rsvd;
> +	__le16 comp_index;
> +	union {
> +		__le64 state_size;
> +		u8     rsvd2[11];
> +	} __packed;
> +	u8     color;
> +};
> +
> +/**
> + * struct pds_lm_suspend_status_cmd - SUSPEND status command
> + * @opcode:	Opcode PDS_AQ_CMD_LM_SUSPEND_STATUS
> + * @rsvd:	Word boundary padding
> + * @vf_id:	VF id
> + */
> +struct pds_lm_suspend_status_cmd {
> +	u8 opcode;
> +	u8 rsvd;
> +	__le16 vf_id;
> +};
> +
> +/**
> + * struct pds_lm_resume_cmd - RESUME command
> + * @opcode:	Opcode PDS_LM_CMD_RESUME
> + * @rsvd:	Word boundary padding
> + * @vf_id:	VF id
> + */
> +struct pds_lm_resume_cmd {
> +	u8     opcode;
> +	u8     rsvd;
> +	__le16 vf_id;
> +};
> +
> +/**
> + * struct pds_lm_sg_elem - Transmit scatter-gather (SG) descriptor element
> + * @addr:	DMA address of SG element data buffer
> + * @len:	Length of SG element data buffer, in bytes
> + * @rsvd:	Word boundary padding
> + */
> +struct pds_lm_sg_elem {
> +	__le64 addr;
> +	__le32 len;
> +	__le16 rsvd[2];
> +};
> +
> +/**
> + * struct pds_lm_save_cmd - SAVE command
> + * @opcode:	Opcode PDS_LM_CMD_SAVE
> + * @rsvd:	Word boundary padding
> + * @vf_id:	VF id
> + * @rsvd2:	Word boundary padding
> + * @sgl_addr:	IOVA address of the SGL to dma the device state
> + * @num_sge:	Total number of SG elements
> + */
> +struct pds_lm_save_cmd {
> +	u8     opcode;
> +	u8     rsvd;
> +	__le16 vf_id;
> +	u8     rsvd2[4];
> +	__le64 sgl_addr;
> +	__le32 num_sge;
> +} __packed;
> +
> +/**
> + * struct pds_lm_restore_cmd - RESTORE command
> + * @opcode:	Opcode PDS_LM_CMD_RESTORE
> + * @rsvd:	Word boundary padding
> + * @vf_id:	VF id
> + * @rsvd2:	Word boundary padding
> + * @sgl_addr:	IOVA address of the SGL to dma the device state
> + * @num_sge:	Total number of SG elements
> + */
> +struct pds_lm_restore_cmd {
> +	u8     opcode;
> +	u8     rsvd;
> +	__le16 vf_id;
> +	u8     rsvd2[4];
> +	__le64 sgl_addr;
> +	__le32 num_sge;
> +} __packed;
> +
> +/**
> + * union pds_lm_dev_state - device state information
> + * @words:	Device state words
> + */
> +union pds_lm_dev_state {
> +	__le32 words[PDS_LM_DEVICE_STATE_LENGTH / sizeof(__le32)];
> +};
> +
> +enum pds_lm_host_vf_status {
> +	PDS_LM_STA_NONE = 0,
> +	PDS_LM_STA_IN_PROGRESS,
> +	PDS_LM_STA_MAX,
> +};
> +
> +/**
> + * struct pds_lm_host_vf_status_cmd - HOST_VF_STATUS command
> + * @opcode:	Opcode PDS_LM_CMD_HOST_VF_STATUS
> + * @rsvd:	Word boundary padding
> + * @vf_id:	VF id
> + * @status:	Current LM status of host VF driver (enum
> pds_lm_host_status)
> + */
> +struct pds_lm_host_vf_status_cmd {
> +	u8     opcode;
> +	u8     rsvd;
> +	__le16 vf_id;
> +	u8     status;
> +};
> +
>  union pds_core_adminq_cmd {
>  	u8     opcode;
>  	u8     bytes[64];
> @@ -600,6 +807,14 @@ union pds_core_adminq_cmd {
> 
>  	struct pds_core_q_identify_cmd    q_ident;
>  	struct pds_core_q_init_cmd        q_init;
> +
> +	struct pds_lm_suspend_cmd		lm_suspend;
> +	struct pds_lm_suspend_status_cmd	lm_suspend_status;
> +	struct pds_lm_resume_cmd		lm_resume;
> +	struct pds_lm_status_cmd		lm_status;
> +	struct pds_lm_save_cmd			lm_save;
> +	struct pds_lm_restore_cmd		lm_restore;
> +	struct pds_lm_host_vf_status_cmd	lm_host_vf_status;
>  };
> 
>  union pds_core_adminq_comp {
> @@ -621,6 +836,8 @@ union pds_core_adminq_comp {
> 
>  	struct pds_core_q_identify_comp   q_ident;
>  	struct pds_core_q_init_comp       q_init;
> +
> +	struct pds_lm_status_comp		lm_status;
>  };
> 
>  #ifndef __CHECKER__
> --
> 2.17.1


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v10 vfio 3/7] vfio/pds: register with the pds_core PF
  2023-06-15 21:05   ` Shameerali Kolothum Thodi
@ 2023-06-15 21:30     ` Brett Creeley
  0 siblings, 0 replies; 40+ messages in thread
From: Brett Creeley @ 2023-06-15 21:30 UTC (permalink / raw)
  To: Shameerali Kolothum Thodi, Brett Creeley, kvm, netdev,
	alex.williamson, jgg, yishaih, kevin.tian
  Cc: shannon.nelson

On 6/15/2023 2:05 PM, Shameerali Kolothum Thodi wrote:
> Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.
> 
> 
>> -----Original Message-----
>> From: Brett Creeley [mailto:brett.creeley@amd.com]
>> Sent: 02 June 2023 23:03
>> To: kvm@vger.kernel.org; netdev@vger.kernel.org;
>> alex.williamson@redhat.com; jgg@nvidia.com; yishaih@nvidia.com;
>> Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>;
>> kevin.tian@intel.com
>> Cc: brett.creeley@amd.com; shannon.nelson@amd.com
>> Subject: [PATCH v10 vfio 3/7] vfio/pds: register with the pds_core PF
>>
>> The pds_core driver will supply adminq services, so find the PF
>> and register with the DSC services.
>>
>> Use the following commands to enable a VF:
>> echo 1 > /sys/bus/pci/drivers/pds_core/$PF_BDF/sriov_numvfs
>>
>> Signed-off-by: Brett Creeley <brett.creeley@amd.com>
>> Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
>> ---
>>   drivers/vfio/pci/pds/Makefile   |  1 +
>>   drivers/vfio/pci/pds/cmds.c     | 43
>> +++++++++++++++++++++++++++++++++
>>   drivers/vfio/pci/pds/cmds.h     | 10 ++++++++
>>   drivers/vfio/pci/pds/pci_drv.c  | 19 +++++++++++++++
>>   drivers/vfio/pci/pds/pci_drv.h  |  9 +++++++
>>   drivers/vfio/pci/pds/vfio_dev.c | 11 +++++++++
>>   drivers/vfio/pci/pds/vfio_dev.h |  6 +++++
>>   include/linux/pds/pds_common.h  |  2 ++
>>   8 files changed, 101 insertions(+)
>>   create mode 100644 drivers/vfio/pci/pds/cmds.c
>>   create mode 100644 drivers/vfio/pci/pds/cmds.h
>>   create mode 100644 drivers/vfio/pci/pds/pci_drv.h
>>
>> diff --git a/drivers/vfio/pci/pds/Makefile b/drivers/vfio/pci/pds/Makefile
>> index e1a55ae0f079..87581111fa17 100644
>> --- a/drivers/vfio/pci/pds/Makefile
>> +++ b/drivers/vfio/pci/pds/Makefile
>> @@ -4,5 +4,6 @@
>>   obj-$(CONFIG_PDS_VFIO_PCI) += pds_vfio.o
>>
>>   pds_vfio-y := \
>> +     cmds.o          \
>>        pci_drv.o       \
>>        vfio_dev.o
>> diff --git a/drivers/vfio/pci/pds/cmds.c b/drivers/vfio/pci/pds/cmds.c
>> new file mode 100644
>> index 000000000000..ae01f5df2f5c
>> --- /dev/null
>> +++ b/drivers/vfio/pci/pds/cmds.c
>> @@ -0,0 +1,43 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/* Copyright(c) 2023 Advanced Micro Devices, Inc. */
>> +
>> +#include <linux/io.h>
>> +#include <linux/types.h>
>> +
>> +#include <linux/pds/pds_common.h>
>> +#include <linux/pds/pds_core_if.h>
>> +#include <linux/pds/pds_adminq.h>
>> +
>> +#include "vfio_dev.h"
>> +#include "cmds.h"
>> +
>> +int pds_vfio_register_client_cmd(struct pds_vfio_pci_device *pds_vfio)
>> +{
>> +     struct pci_dev *pdev = pds_vfio_to_pci_dev(pds_vfio);
>> +     char devname[PDS_DEVNAME_LEN];
>> +     int ci;
>> +
>> +     snprintf(devname, sizeof(devname), "%s.%d-%u", PDS_LM_DEV_NAME,
>> +              pci_domain_nr(pdev->bus), pds_vfio->pci_id);
>> +
>> +     ci = pds_client_register(pci_physfn(pdev), devname);
>> +     if (ci <= 0)
>> +             return ci;
> 
> So 0 is not a valid id I guess but we return 0 here. But below where
> pds_vfio_register_client_cmd() is called, 0 return is treated as success.
> 
> Note: Also in drivers..../auxbus.c the comment says the function returns 0
> on success!.
> 
> Please check.
> 
> Thanks,
> Shameer

Hey Shameer,

Thanks for catching these issues. It looks like there are a couple 
things that need to be fixed.

[1] pds_vfio_register_client_cmd() needs to always return negative on 
error, which includes ci == 0. I don't think we would ever hit this case 
because drivers..../auxbus.c returns -EIO when ci == 0, but best to fix 
it in case that ever changes.

[2] Documentation for pds_client_register in drivers..../auxbus.c needs 
to be updated to say something like the following:

Return: Client ID on succes, or negative for error

I will fix [1] in the next rev of this series. For [2] we will submit a 
separate follow on patch to clean up the wording.

Thanks for the review,

Brett
>> +
>> +     pds_vfio->client_id = ci;
>> +
>> +     return 0;
>> +}
>> +
>> +void pds_vfio_unregister_client_cmd(struct pds_vfio_pci_device *pds_vfio)
>> +{
>> +     struct pci_dev *pdev = pds_vfio_to_pci_dev(pds_vfio);
>> +     int err;
>> +
>> +     err = pds_client_unregister(pci_physfn(pdev), pds_vfio->client_id);
>> +     if (err)
>> +             dev_err(&pdev->dev, "unregister from DSC failed: %pe\n",
>> +                     ERR_PTR(err));
>> +
>> +     pds_vfio->client_id = 0;
>> +}
>> diff --git a/drivers/vfio/pci/pds/cmds.h b/drivers/vfio/pci/pds/cmds.h
>> new file mode 100644
>> index 000000000000..4c592afccf89
>> --- /dev/null
>> +++ b/drivers/vfio/pci/pds/cmds.h
>> @@ -0,0 +1,10 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +/* Copyright(c) 2023 Advanced Micro Devices, Inc. */
>> +
>> +#ifndef _CMDS_H_
>> +#define _CMDS_H_
>> +
>> +int pds_vfio_register_client_cmd(struct pds_vfio_pci_device *pds_vfio);
>> +void pds_vfio_unregister_client_cmd(struct pds_vfio_pci_device *pds_vfio);
>> +
>> +#endif /* _CMDS_H_ */
>> diff --git a/drivers/vfio/pci/pds/pci_drv.c b/drivers/vfio/pci/pds/pci_drv.c
>> index 0e84249069d4..a49420aa9736 100644
>> --- a/drivers/vfio/pci/pds/pci_drv.c
>> +++ b/drivers/vfio/pci/pds/pci_drv.c
>> @@ -8,9 +8,13 @@
>>   #include <linux/types.h>
>>   #include <linux/vfio.h>
>>
>> +#include <linux/pds/pds_common.h>
>>   #include <linux/pds/pds_core_if.h>
>> +#include <linux/pds/pds_adminq.h>
>>
>>   #include "vfio_dev.h"
>> +#include "pci_drv.h"
>> +#include "cmds.h"
>>
>>   #define PDS_VFIO_DRV_DESCRIPTION     "AMD/Pensando VFIO Device
>> Driver"
>>   #define PCI_VENDOR_ID_PENSANDO               0x1dd8
>> @@ -27,13 +31,27 @@ static int pds_vfio_pci_probe(struct pci_dev *pdev,
>>                return PTR_ERR(pds_vfio);
>>
>>        dev_set_drvdata(&pdev->dev, &pds_vfio->vfio_coredev);
>> +     pds_vfio->pdsc = pdsc_get_pf_struct(pdev);
>> +     if (IS_ERR_OR_NULL(pds_vfio->pdsc)) {
>> +             err = PTR_ERR(pds_vfio->pdsc) ?: -ENODEV;
>> +             goto out_put_vdev;
>> +     }
>>
>>        err = vfio_pci_core_register_device(&pds_vfio->vfio_coredev);
>>        if (err)
>>                goto out_put_vdev;
>>
>> +     err = pds_vfio_register_client_cmd(pds_vfio);
>> +     if (err) {
>> +             dev_err(&pdev->dev, "failed to register as client: %pe\n",
>> +                     ERR_PTR(err));
>> +             goto out_unregister_coredev;
>> +     }
>> +
>>        return 0;
>>
>> +out_unregister_coredev:
>> +     vfio_pci_core_unregister_device(&pds_vfio->vfio_coredev);
>>   out_put_vdev:
>>        vfio_put_device(&pds_vfio->vfio_coredev.vdev);
>>        return err;
>> @@ -43,6 +61,7 @@ static void pds_vfio_pci_remove(struct pci_dev *pdev)
>>   {
>>        struct pds_vfio_pci_device *pds_vfio = pds_vfio_pci_drvdata(pdev);
>>
>> +     pds_vfio_unregister_client_cmd(pds_vfio);
>>        vfio_pci_core_unregister_device(&pds_vfio->vfio_coredev);
>>        vfio_put_device(&pds_vfio->vfio_coredev.vdev);
>>   }
>> diff --git a/drivers/vfio/pci/pds/pci_drv.h b/drivers/vfio/pci/pds/pci_drv.h
>> new file mode 100644
>> index 000000000000..e79bed12ed14
>> --- /dev/null
>> +++ b/drivers/vfio/pci/pds/pci_drv.h
>> @@ -0,0 +1,9 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +/* Copyright(c) 2023 Advanced Micro Devices, Inc. */
>> +
>> +#ifndef _PCI_DRV_H
>> +#define _PCI_DRV_H
>> +
>> +#include <linux/pci.h>
>> +
>> +#endif /* _PCI_DRV_H */
>> diff --git a/drivers/vfio/pci/pds/vfio_dev.c b/drivers/vfio/pci/pds/vfio_dev.c
>> index 4038dac90a97..39771265b78f 100644
>> --- a/drivers/vfio/pci/pds/vfio_dev.c
>> +++ b/drivers/vfio/pci/pds/vfio_dev.c
>> @@ -6,6 +6,11 @@
>>
>>   #include "vfio_dev.h"
>>
>> +struct pci_dev *pds_vfio_to_pci_dev(struct pds_vfio_pci_device *pds_vfio)
>> +{
>> +     return pds_vfio->vfio_coredev.pdev;
>> +}
>> +
>>   struct pds_vfio_pci_device *pds_vfio_pci_drvdata(struct pci_dev *pdev)
>>   {
>>        struct vfio_pci_core_device *core_device =
>> dev_get_drvdata(&pdev->dev);
>> @@ -29,6 +34,12 @@ static int pds_vfio_init_device(struct vfio_device
>> *vdev)
>>        pds_vfio->vf_id = pci_iov_vf_id(pdev);
>>        pds_vfio->pci_id = PCI_DEVID(pdev->bus->number, pdev->devfn);
>>
>> +     dev_dbg(&pdev->dev,
>> +             "%s: PF %#04x VF %#04x (%d) vf_id %d domain %d
>> pds_vfio %p\n",
>> +             __func__, pci_dev_id(pdev->physfn), pds_vfio->pci_id,
>> +             pds_vfio->pci_id, pds_vfio->vf_id, pci_domain_nr(pdev->bus),
>> +             pds_vfio);
>> +
>>        return 0;
>>   }
>>
>> diff --git a/drivers/vfio/pci/pds/vfio_dev.h b/drivers/vfio/pci/pds/vfio_dev.h
>> index 66cfcab5b5bf..92e8ff241ca8 100644
>> --- a/drivers/vfio/pci/pds/vfio_dev.h
>> +++ b/drivers/vfio/pci/pds/vfio_dev.h
>> @@ -7,14 +7,20 @@
>>   #include <linux/pci.h>
>>   #include <linux/vfio_pci_core.h>
>>
>> +struct pdsc;
>> +
>>   struct pds_vfio_pci_device {
>>        struct vfio_pci_core_device vfio_coredev;
>> +     struct pdsc *pdsc;
>>
>>        int vf_id;
>>        int pci_id;
>> +     u16 client_id;
>>   };
>>
>>   const struct vfio_device_ops *pds_vfio_ops_info(void);
>>   struct pds_vfio_pci_device *pds_vfio_pci_drvdata(struct pci_dev *pdev);
>>
>> +struct pci_dev *pds_vfio_to_pci_dev(struct pds_vfio_pci_device *pds_vfio);
>> +
>>   #endif /* _VFIO_DEV_H_ */
>> diff --git a/include/linux/pds/pds_common.h
>> b/include/linux/pds/pds_common.h
>> index 060331486d50..721453bdf975 100644
>> --- a/include/linux/pds/pds_common.h
>> +++ b/include/linux/pds/pds_common.h
>> @@ -39,6 +39,8 @@ enum pds_core_vif_types {
>>   #define PDS_DEV_TYPE_RDMA_STR        "RDMA"
>>   #define PDS_DEV_TYPE_LM_STR  "LM"
>>
>> +#define PDS_LM_DEV_NAME              PDS_CORE_DRV_NAME "."
>> PDS_DEV_TYPE_LM_STR
>> +
>>   #define PDS_CORE_IFNAMSIZ            16
>>
>>   /**
>> --
>> 2.17.1
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v10 vfio 4/7] vfio/pds: Add VFIO live migration support
  2023-06-15 21:07   ` Shameerali Kolothum Thodi
@ 2023-06-15 21:36     ` Brett Creeley
  0 siblings, 0 replies; 40+ messages in thread
From: Brett Creeley @ 2023-06-15 21:36 UTC (permalink / raw)
  To: Shameerali Kolothum Thodi, Brett Creeley, kvm, netdev,
	alex.williamson, jgg, yishaih, kevin.tian
  Cc: shannon.nelson

On 6/15/2023 2:07 PM, Shameerali Kolothum Thodi wrote:
> Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.
> 
> 
>> -----Original Message-----
>> From: Brett Creeley [mailto:brett.creeley@amd.com]
>> Sent: 02 June 2023 23:03
>> To: kvm@vger.kernel.org; netdev@vger.kernel.org;
>> alex.williamson@redhat.com; jgg@nvidia.com; yishaih@nvidia.com;
>> Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>;
>> kevin.tian@intel.com
>> Cc: brett.creeley@amd.com; shannon.nelson@amd.com
>> Subject: [PATCH v10 vfio 4/7] vfio/pds: Add VFIO live migration support
>>
>> Add live migration support via the VFIO subsystem. The migration
>> implementation aligns with the definition from uapi/vfio.h and uses
>> the pds_core PF's adminq for device configuration.
>>
>> The ability to suspend, resume, and transfer VF device state data is
>> included along with the required admin queue command structures and
>> implementations.
>>
>> PDS_LM_CMD_SUSPEND and PDS_LM_CMD_SUSPEND_STATUS are added to
>> support
>> the VF device suspend operation.
>>
>> PDS_LM_CMD_RESUME is added to support the VF device resume operation.
>>
>> PDS_LM_CMD_STATUS is added to determine the exact size of the VF
>> device state data.
>>
>> PDS_LM_CMD_SAVE is added to get the VF device state data.
>>
>> PDS_LM_CMD_RESTORE is added to restore the VF device with the
>> previously saved data from PDS_LM_CMD_SAVE.
>>
>> PDS_LM_CMD_HOST_VF_STATUS is added to notify the device when
>> a migration is in/not-in progress from the host's perspective.
>>
>> Signed-off-by: Brett Creeley <brett.creeley@amd.com>
>> Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
>> ---
>>   drivers/vfio/pci/pds/Makefile   |   1 +
>>   drivers/vfio/pci/pds/cmds.c     | 319 ++++++++++++++++++++++++
>>   drivers/vfio/pci/pds/cmds.h     |   8 +-
>>   drivers/vfio/pci/pds/lm.c       | 421
>> ++++++++++++++++++++++++++++++++
>>   drivers/vfio/pci/pds/lm.h       |  41 ++++
>>   drivers/vfio/pci/pds/pci_drv.c  |  13 +
>>   drivers/vfio/pci/pds/vfio_dev.c | 120 ++++++++-
>>   drivers/vfio/pci/pds/vfio_dev.h |  11 +
>>   include/linux/pds/pds_adminq.h  | 217 ++++++++++++++++
>>   9 files changed, 1149 insertions(+), 2 deletions(-)
>>   create mode 100644 drivers/vfio/pci/pds/lm.c
>>   create mode 100644 drivers/vfio/pci/pds/lm.h
>>
>> diff --git a/drivers/vfio/pci/pds/Makefile b/drivers/vfio/pci/pds/Makefile
>> index 87581111fa17..dbaf613d3794 100644
>> --- a/drivers/vfio/pci/pds/Makefile
>> +++ b/drivers/vfio/pci/pds/Makefile
>> @@ -5,5 +5,6 @@ obj-$(CONFIG_PDS_VFIO_PCI) += pds_vfio.o
>>
>>   pds_vfio-y := \
>>        cmds.o          \
>> +     lm.o            \
>>        pci_drv.o       \
>>        vfio_dev.o
>> diff --git a/drivers/vfio/pci/pds/cmds.c b/drivers/vfio/pci/pds/cmds.c
>> index ae01f5df2f5c..256f458feb58 100644
>> --- a/drivers/vfio/pci/pds/cmds.c
>> +++ b/drivers/vfio/pci/pds/cmds.c
>> @@ -3,6 +3,7 @@
>>
>>   #include <linux/io.h>
>>   #include <linux/types.h>
>> +#include <linux/delay.h>
>>
>>   #include <linux/pds/pds_common.h>
>>   #include <linux/pds/pds_core_if.h>
>> @@ -11,6 +12,34 @@
>>   #include "vfio_dev.h"
>>   #include "cmds.h"
>>
>> +#define SUSPEND_TIMEOUT_S            5
>> +#define SUSPEND_CHECK_INTERVAL_MS    1
>> +
>> +static int pds_vfio_client_adminq_cmd(struct pds_vfio_pci_device
>> *pds_vfio,
>> +                                   union pds_core_adminq_cmd *req,
>> +                                   size_t req_len,
>> +                                   union pds_core_adminq_comp *resp,
>> +                                   u64 flags)
> 
> Why u64? Do we expect more flags to follow? The core interface below
> only takes a bool(fast_poll) though.
> 
> Thanks,
> Shameer >

Shameer,

Another good catch. This was leftover from the original set of patches, 
but using flags is definitely unnecessary in 
pds_vfio_client_adminq_cmd(). If we ever need more flags I can update 
then. I will change this to a bool in the next revision.

Thanks for the review,

Brett

>> +{
>> +     union pds_core_adminq_cmd cmd = {};
>> +     size_t cp_len;
>> +     int err;
>> +
>> +     /* Wrap the client request */
>> +     cmd.client_request.opcode = PDS_AQ_CMD_CLIENT_CMD;
>> +     cmd.client_request.client_id = cpu_to_le16(pds_vfio->client_id);
>> +     cp_len = min_t(size_t, req_len, sizeof(cmd.client_request.client_cmd));
>> +     memcpy(cmd.client_request.client_cmd, req, cp_len);
>> +
>> +     err = pdsc_adminq_post(pds_vfio->pdsc, &cmd, resp,
>> +                            !!(flags & PDS_AQ_FLAG_FASTPOLL));
>> +     if (err && err != -EAGAIN)
>> +             dev_info(pds_vfio_to_dev(pds_vfio),
>> +                      "client admin cmd failed: %pe\n", ERR_PTR(err));
>> +
>> +     return err;
>> +}
>> +
>>   int pds_vfio_register_client_cmd(struct pds_vfio_pci_device *pds_vfio)
>>   {
>>        struct pci_dev *pdev = pds_vfio_to_pci_dev(pds_vfio);
>> @@ -41,3 +70,293 @@ void pds_vfio_unregister_client_cmd(struct
>> pds_vfio_pci_device *pds_vfio)
>>
>>        pds_vfio->client_id = 0;
>>   }
>> +
>> +static int
>> +pds_vfio_suspend_wait_device_cmd(struct pds_vfio_pci_device *pds_vfio)
>> +{
>> +     union pds_core_adminq_cmd cmd = {
>> +             .lm_suspend_status = {
>> +                     .opcode = PDS_LM_CMD_SUSPEND_STATUS,
>> +                     .vf_id = cpu_to_le16(pds_vfio->vf_id),
>> +             },
>> +     };
>> +     struct device *dev = pds_vfio_to_dev(pds_vfio);
>> +     union pds_core_adminq_comp comp = {};
>> +     unsigned long time_limit;
>> +     unsigned long time_start;
>> +     unsigned long time_done;
>> +     int err;
>> +
>> +     time_start = jiffies;
>> +     time_limit = time_start + HZ * SUSPEND_TIMEOUT_S;
>> +     do {
>> +             err = pds_vfio_client_adminq_cmd(pds_vfio, &cmd, sizeof(cmd),
>> +                                              &comp, PDS_AQ_FLAG_FASTPOLL);
>> +             if (err != -EAGAIN)
>> +                     break;
>> +
>> +             msleep(SUSPEND_CHECK_INTERVAL_MS);
>> +     } while (time_before(jiffies, time_limit));
>> +
>> +     time_done = jiffies;
>> +     dev_dbg(dev, "%s: vf%u: Suspend comp received in %d msecs\n",
>> __func__,
>> +             pds_vfio->vf_id, jiffies_to_msecs(time_done - time_start));
>> +
>> +     /* Check the results */
>> +     if (time_after_eq(time_done, time_limit)) {
>> +             dev_err(dev, "%s: vf%u: Suspend comp timeout\n", __func__,
>> +                     pds_vfio->vf_id);
>> +             err = -ETIMEDOUT;
>> +     }
>> +
>> +     return err;
>> +}
>> +
>> +int pds_vfio_suspend_device_cmd(struct pds_vfio_pci_device *pds_vfio)
>> +{
>> +     union pds_core_adminq_cmd cmd = {
>> +             .lm_suspend = {
>> +                     .opcode = PDS_LM_CMD_SUSPEND,
>> +                     .vf_id = cpu_to_le16(pds_vfio->vf_id),
>> +             },
>> +     };
>> +     struct device *dev = pds_vfio_to_dev(pds_vfio);
>> +     union pds_core_adminq_comp comp = {};
>> +     int err;
>> +
>> +     dev_dbg(dev, "vf%u: Suspend device\n", pds_vfio->vf_id);
>> +
>> +     err = pds_vfio_client_adminq_cmd(pds_vfio, &cmd, sizeof(cmd),
>> &comp,
>> +                                      PDS_AQ_FLAG_FASTPOLL);
>> +     if (err) {
>> +             dev_err(dev, "vf%u: Suspend failed: %pe\n", pds_vfio->vf_id,
>> +                     ERR_PTR(err));
>> +             return err;
>> +     }
>> +
>> +     return pds_vfio_suspend_wait_device_cmd(pds_vfio);
>> +}
>> +
>> +int pds_vfio_resume_device_cmd(struct pds_vfio_pci_device *pds_vfio)
>> +{
>> +     union pds_core_adminq_cmd cmd = {
>> +             .lm_resume = {
>> +                     .opcode = PDS_LM_CMD_RESUME,
>> +                     .vf_id = cpu_to_le16(pds_vfio->vf_id),
>> +             },
>> +     };
>> +     struct device *dev = pds_vfio_to_dev(pds_vfio);
>> +     union pds_core_adminq_comp comp = {};
>> +
>> +     dev_dbg(dev, "vf%u: Resume device\n", pds_vfio->vf_id);
>> +
>> +     return pds_vfio_client_adminq_cmd(pds_vfio, &cmd, sizeof(cmd),
>> &comp,
>> +                                       0);
>> +}
>> +
>> +int pds_vfio_get_lm_status_cmd(struct pds_vfio_pci_device *pds_vfio, u64
>> *size)
>> +{
>> +     union pds_core_adminq_cmd cmd = {
>> +             .lm_status = {
>> +                     .opcode = PDS_LM_CMD_STATUS,
>> +                     .vf_id = cpu_to_le16(pds_vfio->vf_id),
>> +             },
>> +     };
>> +     struct device *dev = pds_vfio_to_dev(pds_vfio);
>> +     union pds_core_adminq_comp comp = {};
>> +     int err;
>> +
>> +     dev_dbg(dev, "vf%u: Get migration status\n", pds_vfio->vf_id);
>> +
>> +     err = pds_vfio_client_adminq_cmd(pds_vfio, &cmd, sizeof(cmd), &comp,
>> 0);
>> +     if (err)
>> +             return err;
>> +
>> +     *size = le64_to_cpu(comp.lm_status.size);
>> +     return 0;
>> +}
>> +
>> +static int pds_vfio_dma_map_lm_file(struct device *dev,
>> +                                 enum dma_data_direction dir,
>> +                                 struct pds_vfio_lm_file *lm_file)
>> +{
>> +     struct pds_lm_sg_elem *sgl, *sge;
>> +     struct scatterlist *sg;
>> +     dma_addr_t sgl_addr;
>> +     size_t sgl_size;
>> +     int err;
>> +     int i;
>> +
>> +     if (!lm_file)
>> +             return -EINVAL;
>> +
>> +     /* dma map file pages */
>> +     err = dma_map_sgtable(dev, &lm_file->sg_table, dir, 0);
>> +     if (err)
>> +             return err;
>> +
>> +     lm_file->num_sge = lm_file->sg_table.nents;
>> +
>> +     /* alloc sgl */
>> +     sgl_size = lm_file->num_sge * sizeof(struct pds_lm_sg_elem);
>> +     sgl = kzalloc(sgl_size, GFP_KERNEL);
>> +     if (!sgl) {
>> +             err = -ENOMEM;
>> +             goto out_unmap_sgtable;
>> +     }
>> +
>> +     /* fill sgl */
>> +     sge = sgl;
>> +     for_each_sgtable_dma_sg(&lm_file->sg_table, sg, i) {
>> +             sge->addr = cpu_to_le64(sg_dma_address(sg));
>> +             sge->len = cpu_to_le32(sg_dma_len(sg));
>> +             dev_dbg(dev, "addr = %llx, len = %u\n", sge->addr, sge->len);
>> +             sge++;
>> +     }
>> +
>> +     sgl_addr = dma_map_single(dev, sgl, sgl_size, DMA_TO_DEVICE);
>> +     if (dma_mapping_error(dev, sgl_addr)) {
>> +             err = -EIO;
>> +             goto out_free_sgl;
>> +     }
>> +
>> +     lm_file->sgl = sgl;
>> +     lm_file->sgl_addr = sgl_addr;
>> +
>> +     return 0;
>> +
>> +out_free_sgl:
>> +     kfree(sgl);
>> +out_unmap_sgtable:
>> +     lm_file->num_sge = 0;
>> +     dma_unmap_sgtable(dev, &lm_file->sg_table, dir, 0);
>> +     return err;
>> +}
>> +
>> +static void pds_vfio_dma_unmap_lm_file(struct device *dev,
>> +                                    enum dma_data_direction dir,
>> +                                    struct pds_vfio_lm_file *lm_file)
>> +{
>> +     if (!lm_file)
>> +             return;
>> +
>> +     /* free sgl */
>> +     if (lm_file->sgl) {
>> +             dma_unmap_single(dev, lm_file->sgl_addr,
>> +                              lm_file->num_sge * sizeof(*lm_file->sgl),
>> +                              DMA_TO_DEVICE);
>> +             kfree(lm_file->sgl);
>> +             lm_file->sgl = NULL;
>> +             lm_file->sgl_addr = DMA_MAPPING_ERROR;
>> +             lm_file->num_sge = 0;
>> +     }
>> +
>> +     /* dma unmap file pages */
>> +     dma_unmap_sgtable(dev, &lm_file->sg_table, dir, 0);
>> +}
>> +
>> +int pds_vfio_get_lm_state_cmd(struct pds_vfio_pci_device *pds_vfio)
>> +{
>> +     union pds_core_adminq_cmd cmd = {
>> +             .lm_save = {
>> +                     .opcode = PDS_LM_CMD_SAVE,
>> +                     .vf_id = cpu_to_le16(pds_vfio->vf_id),
>> +             },
>> +     };
>> +     struct pci_dev *pdev = pds_vfio_to_pci_dev(pds_vfio);
>> +     struct device *pdsc_dev = &pci_physfn(pdev)->dev;
>> +     union pds_core_adminq_comp comp = {};
>> +     struct pds_vfio_lm_file *lm_file;
>> +     int err;
>> +
>> +     dev_dbg(&pdev->dev, "vf%u: Get migration state\n", pds_vfio->vf_id);
>> +
>> +     lm_file = pds_vfio->save_file;
>> +
>> +     err = pds_vfio_dma_map_lm_file(pdsc_dev, DMA_FROM_DEVICE,
>> lm_file);
>> +     if (err) {
>> +             dev_err(&pdev->dev, "failed to map save migration file: %pe\n",
>> +                     ERR_PTR(err));
>> +             return err;
>> +     }
>> +
>> +     cmd.lm_save.sgl_addr = cpu_to_le64(lm_file->sgl_addr);
>> +     cmd.lm_save.num_sge = cpu_to_le32(lm_file->num_sge);
>> +
>> +     err = pds_vfio_client_adminq_cmd(pds_vfio, &cmd, sizeof(cmd), &comp,
>> 0);
>> +     if (err)
>> +             dev_err(&pdev->dev, "failed to get migration state: %pe\n",
>> +                     ERR_PTR(err));
>> +
>> +     pds_vfio_dma_unmap_lm_file(pdsc_dev, DMA_FROM_DEVICE, lm_file);
>> +
>> +     return err;
>> +}
>> +
>> +int pds_vfio_set_lm_state_cmd(struct pds_vfio_pci_device *pds_vfio)
>> +{
>> +     union pds_core_adminq_cmd cmd = {
>> +             .lm_restore = {
>> +                     .opcode = PDS_LM_CMD_RESTORE,
>> +                     .vf_id = cpu_to_le16(pds_vfio->vf_id),
>> +             },
>> +     };
>> +     struct pci_dev *pdev = pds_vfio_to_pci_dev(pds_vfio);
>> +     struct device *pdsc_dev = &pci_physfn(pdev)->dev;
>> +     union pds_core_adminq_comp comp = {};
>> +     struct pds_vfio_lm_file *lm_file;
>> +     int err;
>> +
>> +     dev_dbg(&pdev->dev, "vf%u: Set migration state\n", pds_vfio->vf_id);
>> +
>> +     lm_file = pds_vfio->restore_file;
>> +
>> +     err = pds_vfio_dma_map_lm_file(pdsc_dev, DMA_TO_DEVICE, lm_file);
>> +     if (err) {
>> +             dev_err(&pdev->dev,
>> +                     "failed to map restore migration file: %pe\n",
>> +                     ERR_PTR(err));
>> +             return err;
>> +     }
>> +
>> +     cmd.lm_restore.sgl_addr = cpu_to_le64(lm_file->sgl_addr);
>> +     cmd.lm_restore.num_sge = cpu_to_le32(lm_file->num_sge);
>> +
>> +     err = pds_vfio_client_adminq_cmd(pds_vfio, &cmd, sizeof(cmd), &comp,
>> 0);
>> +     if (err)
>> +             dev_err(&pdev->dev, "failed to set migration state: %pe\n",
>> +                     ERR_PTR(err));
>> +
>> +     pds_vfio_dma_unmap_lm_file(pdsc_dev, DMA_TO_DEVICE, lm_file);
>> +
>> +     return err;
>> +}
>> +
>> +void pds_vfio_send_host_vf_lm_status_cmd(struct pds_vfio_pci_device
>> *pds_vfio,
>> +                                      enum pds_lm_host_vf_status vf_status)
>> +{
>> +     union pds_core_adminq_cmd cmd = {
>> +             .lm_host_vf_status = {
>> +                     .opcode = PDS_LM_CMD_HOST_VF_STATUS,
>> +                     .vf_id = cpu_to_le16(pds_vfio->vf_id),
>> +                     .status = vf_status,
>> +             },
>> +     };
>> +     struct device *dev = pds_vfio_to_dev(pds_vfio);
>> +     union pds_core_adminq_comp comp = {};
>> +     int err;
>> +
>> +     dev_dbg(dev, "vf%u: Set host VF LM status: %u", pds_vfio->vf_id,
>> +             vf_status);
>> +     if (vf_status != PDS_LM_STA_IN_PROGRESS &&
>> +         vf_status != PDS_LM_STA_NONE) {
>> +             dev_warn(dev, "Invalid host VF migration status, %d\n",
>> +                      vf_status);
>> +             return;
>> +     }
>> +
>> +     err = pds_vfio_client_adminq_cmd(pds_vfio, &cmd, sizeof(cmd), &comp,
>> 0);
>> +     if (err)
>> +             dev_warn(dev, "failed to send host VF migration status: %pe\n",
>> +                      ERR_PTR(err));
>> +}
>> diff --git a/drivers/vfio/pci/pds/cmds.h b/drivers/vfio/pci/pds/cmds.h
>> index 4c592afccf89..3d8a5508c733 100644
>> --- a/drivers/vfio/pci/pds/cmds.h
>> +++ b/drivers/vfio/pci/pds/cmds.h
>> @@ -6,5 +6,11 @@
>>
>>   int pds_vfio_register_client_cmd(struct pds_vfio_pci_device *pds_vfio);
>>   void pds_vfio_unregister_client_cmd(struct pds_vfio_pci_device *pds_vfio);
>> -
>> +int pds_vfio_suspend_device_cmd(struct pds_vfio_pci_device *pds_vfio);
>> +int pds_vfio_resume_device_cmd(struct pds_vfio_pci_device *pds_vfio);
>> +int pds_vfio_get_lm_status_cmd(struct pds_vfio_pci_device *pds_vfio, u64
>> *size);
>> +int pds_vfio_get_lm_state_cmd(struct pds_vfio_pci_device *pds_vfio);
>> +int pds_vfio_set_lm_state_cmd(struct pds_vfio_pci_device *pds_vfio);
>> +void pds_vfio_send_host_vf_lm_status_cmd(struct pds_vfio_pci_device
>> *pds_vfio,
>> +                                      enum pds_lm_host_vf_status vf_status);
>>   #endif /* _CMDS_H_ */
>> diff --git a/drivers/vfio/pci/pds/lm.c b/drivers/vfio/pci/pds/lm.c
>> new file mode 100644
>> index 000000000000..c507f39a2339
>> --- /dev/null
>> +++ b/drivers/vfio/pci/pds/lm.c
>> @@ -0,0 +1,421 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/* Copyright(c) 2023 Advanced Micro Devices, Inc. */
>> +
>> +#include <linux/anon_inodes.h>
>> +#include <linux/file.h>
>> +#include <linux/fs.h>
>> +#include <linux/highmem.h>
>> +#include <linux/vfio.h>
>> +#include <linux/vfio_pci_core.h>
>> +
>> +#include "vfio_dev.h"
>> +#include "cmds.h"
>> +
>> +static struct pds_vfio_lm_file *
>> +pds_vfio_get_lm_file(const struct file_operations *fops, int flags, u64 size)
>> +{
>> +     struct pds_vfio_lm_file *lm_file = NULL;
>> +     unsigned long long npages;
>> +     struct page **pages;
>> +     void *page_mem;
>> +     const void *p;
>> +
>> +     if (!size)
>> +             return NULL;
>> +
>> +     /* Alloc file structure */
>> +     lm_file = kzalloc(sizeof(*lm_file), GFP_KERNEL);
>> +     if (!lm_file)
>> +             return NULL;
>> +
>> +     /* Create file */
>> +     lm_file->filep =
>> +             anon_inode_getfile("pds_vfio_lm", fops, lm_file, flags);
>> +     if (!lm_file->filep)
>> +             goto out_free_file;
>> +
>> +     stream_open(lm_file->filep->f_inode, lm_file->filep);
>> +     mutex_init(&lm_file->lock);
>> +
>> +     /* prevent file from being released before we are done with it */
>> +     get_file(lm_file->filep);
>> +
>> +     /* Allocate memory for file pages */
>> +     npages = DIV_ROUND_UP_ULL(size, PAGE_SIZE);
>> +     pages = kmalloc_array(npages, sizeof(*pages), GFP_KERNEL);
>> +     if (!pages)
>> +             goto out_put_file;
>> +
>> +     page_mem = kvzalloc(ALIGN(size, PAGE_SIZE), GFP_KERNEL);
>> +     if (!page_mem)
>> +             goto out_free_pages_array;
>> +
>> +     p = page_mem - offset_in_page(page_mem);
>> +     for (unsigned long long i = 0; i < npages; i++) {
>> +             if (is_vmalloc_addr(p))
>> +                     pages[i] = vmalloc_to_page(p);
>> +             else
>> +                     pages[i] = kmap_to_page((void *)p);
>> +             if (!pages[i])
>> +                     goto out_free_page_mem;
>> +
>> +             p += PAGE_SIZE;
>> +     }
>> +
>> +     /* Create scatterlist of file pages to use for DMA mapping later */
>> +     if (sg_alloc_table_from_pages(&lm_file->sg_table, pages, npages, 0,
>> +                                   size, GFP_KERNEL))
>> +             goto out_free_page_mem;
>> +
>> +     lm_file->size = size;
>> +     lm_file->pages = pages;
>> +     lm_file->npages = npages;
>> +     lm_file->page_mem = page_mem;
>> +     lm_file->alloc_size = npages * PAGE_SIZE;
>> +
>> +     return lm_file;
>> +
>> +out_free_page_mem:
>> +     kvfree(page_mem);
>> +out_free_pages_array:
>> +     kfree(pages);
>> +out_put_file:
>> +     fput(lm_file->filep);
>> +     mutex_destroy(&lm_file->lock);
>> +out_free_file:
>> +     kfree(lm_file);
>> +
>> +     return NULL;
>> +}
>> +
>> +static void pds_vfio_put_lm_file(struct pds_vfio_lm_file *lm_file)
>> +{
>> +     mutex_lock(&lm_file->lock);
>> +
>> +     lm_file->size = 0;
>> +     lm_file->alloc_size = 0;
>> +
>> +     /* Free scatter list of file pages */
>> +     sg_free_table(&lm_file->sg_table);
>> +
>> +     kvfree(lm_file->page_mem);
>> +     lm_file->page_mem = NULL;
>> +     kfree(lm_file->pages);
>> +     lm_file->pages = NULL;
>> +
>> +     mutex_unlock(&lm_file->lock);
>> +
>> +     /* allow file to be released since we are done with it */
>> +     fput(lm_file->filep);
>> +}
>> +
>> +void pds_vfio_put_save_file(struct pds_vfio_pci_device *pds_vfio)
>> +{
>> +     if (!pds_vfio->save_file)
>> +             return;
>> +
>> +     pds_vfio_put_lm_file(pds_vfio->save_file);
>> +     pds_vfio->save_file = NULL;
>> +}
>> +
>> +void pds_vfio_put_restore_file(struct pds_vfio_pci_device *pds_vfio)
>> +{
>> +     if (!pds_vfio->restore_file)
>> +             return;
>> +
>> +     pds_vfio_put_lm_file(pds_vfio->restore_file);
>> +     pds_vfio->restore_file = NULL;
>> +}
>> +
>> +static struct page *pds_vfio_get_file_page(struct pds_vfio_lm_file *lm_file,
>> +                                        unsigned long offset)
>> +{
>> +     unsigned long cur_offset = 0;
>> +     struct scatterlist *sg;
>> +     unsigned int i;
>> +
>> +     /* All accesses are sequential */
>> +     if (offset < lm_file->last_offset || !lm_file->last_offset_sg) {
>> +             lm_file->last_offset = 0;
>> +             lm_file->last_offset_sg = lm_file->sg_table.sgl;
>> +             lm_file->sg_last_entry = 0;
>> +     }
>> +
>> +     cur_offset = lm_file->last_offset;
>> +
>> +     for_each_sg(lm_file->last_offset_sg, sg,
>> +                 lm_file->sg_table.orig_nents - lm_file->sg_last_entry, i) {
>> +             if (offset < sg->length + cur_offset) {
>> +                     lm_file->last_offset_sg = sg;
>> +                     lm_file->sg_last_entry += i;
>> +                     lm_file->last_offset = cur_offset;
>> +                     return nth_page(sg_page(sg),
>> +                                     (offset - cur_offset) / PAGE_SIZE);
>> +             }
>> +             cur_offset += sg->length;
>> +     }
>> +
>> +     return NULL;
>> +}
>> +
>> +static int pds_vfio_release_file(struct inode *inode, struct file *filp)
>> +{
>> +     struct pds_vfio_lm_file *lm_file = filp->private_data;
>> +
>> +     mutex_lock(&lm_file->lock);
>> +     lm_file->filep->f_pos = 0;
>> +     lm_file->size = 0;
>> +     mutex_unlock(&lm_file->lock);
>> +     mutex_destroy(&lm_file->lock);
>> +     kfree(lm_file);
>> +
>> +     return 0;
>> +}
>> +
>> +static ssize_t pds_vfio_save_read(struct file *filp, char __user *buf,
>> +                               size_t len, loff_t *pos)
>> +{
>> +     struct pds_vfio_lm_file *lm_file = filp->private_data;
>> +     ssize_t done = 0;
>> +
>> +     if (pos)
>> +             return -ESPIPE;
>> +     pos = &filp->f_pos;
>> +
>> +     mutex_lock(&lm_file->lock);
>> +     if (*pos > lm_file->size) {
>> +             done = -EINVAL;
>> +             goto out_unlock;
>> +     }
>> +
>> +     len = min_t(size_t, lm_file->size - *pos, len);
>> +     while (len) {
>> +             size_t page_offset;
>> +             struct page *page;
>> +             size_t page_len;
>> +             u8 *from_buff;
>> +             int err;
>> +
>> +             page_offset = (*pos) % PAGE_SIZE;
>> +             page = pds_vfio_get_file_page(lm_file, *pos - page_offset);
>> +             if (!page) {
>> +                     if (done == 0)
>> +                             done = -EINVAL;
>> +                     goto out_unlock;
>> +             }
>> +
>> +             page_len = min_t(size_t, len, PAGE_SIZE - page_offset);
>> +             from_buff = kmap_local_page(page);
>> +             err = copy_to_user(buf, from_buff + page_offset, page_len);
>> +             kunmap_local(from_buff);
>> +             if (err) {
>> +                     done = -EFAULT;
>> +                     goto out_unlock;
>> +             }
>> +             *pos += page_len;
>> +             len -= page_len;
>> +             done += page_len;
>> +             buf += page_len;
>> +     }
>> +
>> +out_unlock:
>> +     mutex_unlock(&lm_file->lock);
>> +     return done;
>> +}
>> +
>> +static const struct file_operations pds_vfio_save_fops = {
>> +     .owner = THIS_MODULE,
>> +     .read = pds_vfio_save_read,
>> +     .release = pds_vfio_release_file,
>> +     .llseek = no_llseek,
>> +};
>> +
>> +static int pds_vfio_get_save_file(struct pds_vfio_pci_device *pds_vfio)
>> +{
>> +     struct device *dev = &pds_vfio->vfio_coredev.pdev->dev;
>> +     struct pds_vfio_lm_file *lm_file;
>> +     int err;
>> +     u64 size;
>> +
>> +     /* Get live migration state size in this state */
>> +     err = pds_vfio_get_lm_status_cmd(pds_vfio, &size);
>> +     if (err) {
>> +             dev_err(dev, "failed to get save status: %pe\n", ERR_PTR(err));
>> +             return err;
>> +     }
>> +
>> +     dev_dbg(dev, "save status, size = %lld\n", size);
>> +
>> +     if (!size) {
>> +             dev_err(dev, "invalid state size\n");
>> +             return -EIO;
>> +     }
>> +
>> +     lm_file = pds_vfio_get_lm_file(&pds_vfio_save_fops, O_RDONLY, size);
>> +     if (!lm_file) {
>> +             dev_err(dev, "failed to create save file\n");
>> +             return -ENOENT;
>> +     }
>> +
>> +     dev_dbg(dev, "size = %lld, alloc_size = %lld, npages = %lld\n",
>> +             lm_file->size, lm_file->alloc_size, lm_file->npages);
>> +
>> +     pds_vfio->save_file = lm_file;
>> +
>> +     return 0;
>> +}
>> +
>> +static ssize_t pds_vfio_restore_write(struct file *filp, const char __user
>> *buf,
>> +                                   size_t len, loff_t *pos)
>> +{
>> +     struct pds_vfio_lm_file *lm_file = filp->private_data;
>> +     loff_t requested_length;
>> +     ssize_t done = 0;
>> +
>> +     if (pos)
>> +             return -ESPIPE;
>> +
>> +     pos = &filp->f_pos;
>> +
>> +     if (*pos < 0 ||
>> +         check_add_overflow((loff_t)len, *pos, &requested_length))
>> +             return -EINVAL;
>> +
>> +     mutex_lock(&lm_file->lock);
>> +
>> +     while (len) {
>> +             size_t page_offset;
>> +             struct page *page;
>> +             size_t page_len;
>> +             u8 *to_buff;
>> +             int err;
>> +
>> +             page_offset = (*pos) % PAGE_SIZE;
>> +             page = pds_vfio_get_file_page(lm_file, *pos - page_offset);
>> +             if (!page) {
>> +                     if (done == 0)
>> +                             done = -EINVAL;
>> +                     goto out_unlock;
>> +             }
>> +
>> +             page_len = min_t(size_t, len, PAGE_SIZE - page_offset);
>> +             to_buff = kmap_local_page(page);
>> +             err = copy_from_user(to_buff + page_offset, buf, page_len);
>> +             kunmap_local(to_buff);
>> +             if (err) {
>> +                     done = -EFAULT;
>> +                     goto out_unlock;
>> +             }
>> +             *pos += page_len;
>> +             len -= page_len;
>> +             done += page_len;
>> +             buf += page_len;
>> +             lm_file->size += page_len;
>> +     }
>> +out_unlock:
>> +     mutex_unlock(&lm_file->lock);
>> +     return done;
>> +}
>> +
>> +static const struct file_operations pds_vfio_restore_fops = {
>> +     .owner = THIS_MODULE,
>> +     .write = pds_vfio_restore_write,
>> +     .release = pds_vfio_release_file,
>> +     .llseek = no_llseek,
>> +};
>> +
>> +static int pds_vfio_get_restore_file(struct pds_vfio_pci_device *pds_vfio)
>> +{
>> +     struct device *dev = &pds_vfio->vfio_coredev.pdev->dev;
>> +     struct pds_vfio_lm_file *lm_file;
>> +     u64 size;
>> +
>> +     size = sizeof(union pds_lm_dev_state);
>> +     dev_dbg(dev, "restore status, size = %lld\n", size);
>> +
>> +     if (!size) {
>> +             dev_err(dev, "invalid state size");
>> +             return -EIO;
>> +     }
>> +
>> +     lm_file = pds_vfio_get_lm_file(&pds_vfio_restore_fops, O_WRONLY,
>> size);
>> +     if (!lm_file) {
>> +             dev_err(dev, "failed to create restore file");
>> +             return -ENOENT;
>> +     }
>> +     pds_vfio->restore_file = lm_file;
>> +
>> +     return 0;
>> +}
>> +
>> +struct file *
>> +pds_vfio_step_device_state_locked(struct pds_vfio_pci_device *pds_vfio,
>> +                               enum vfio_device_mig_state next)
>> +{
>> +     enum vfio_device_mig_state cur = pds_vfio->state;
>> +     int err;
>> +
>> +     if (cur == VFIO_DEVICE_STATE_STOP && next ==
>> VFIO_DEVICE_STATE_STOP_COPY) {
>> +             err = pds_vfio_get_save_file(pds_vfio);
>> +             if (err)
>> +                     return ERR_PTR(err);
>> +
>> +             err = pds_vfio_get_lm_state_cmd(pds_vfio);
>> +             if (err) {
>> +                     pds_vfio_put_save_file(pds_vfio);
>> +                     return ERR_PTR(err);
>> +             }
>> +
>> +             return pds_vfio->save_file->filep;
>> +     }
>> +
>> +     if (cur == VFIO_DEVICE_STATE_STOP_COPY && next ==
>> VFIO_DEVICE_STATE_STOP) {
>> +             pds_vfio_put_save_file(pds_vfio);
>> +             pds_vfio_send_host_vf_lm_status_cmd(pds_vfio,
>> PDS_LM_STA_NONE);
>> +             return NULL;
>> +     }
>> +
>> +     if (cur == VFIO_DEVICE_STATE_STOP && next ==
>> VFIO_DEVICE_STATE_RESUMING) {
>> +             err = pds_vfio_get_restore_file(pds_vfio);
>> +             if (err)
>> +                     return ERR_PTR(err);
>> +
>> +             return pds_vfio->restore_file->filep;
>> +     }
>> +
>> +     if (cur == VFIO_DEVICE_STATE_RESUMING && next ==
>> VFIO_DEVICE_STATE_STOP) {
>> +             err = pds_vfio_set_lm_state_cmd(pds_vfio);
>> +             if (err)
>> +                     return ERR_PTR(err);
>> +
>> +             pds_vfio_put_restore_file(pds_vfio);
>> +             return NULL;
>> +     }
>> +
>> +     if (cur == VFIO_DEVICE_STATE_RUNNING && next ==
>> VFIO_DEVICE_STATE_RUNNING_P2P) {
>> +             pds_vfio_send_host_vf_lm_status_cmd(pds_vfio,
>> +                                                 PDS_LM_STA_IN_PROGRESS);
>> +             err = pds_vfio_suspend_device_cmd(pds_vfio);
>> +             if (err)
>> +                     return ERR_PTR(err);
>> +
>> +             return NULL;
>> +     }
>> +
>> +     if (cur == VFIO_DEVICE_STATE_RUNNING_P2P && next ==
>> VFIO_DEVICE_STATE_RUNNING) {
>> +             err = pds_vfio_resume_device_cmd(pds_vfio);
>> +             if (err)
>> +                     return ERR_PTR(err);
>> +
>> +             pds_vfio_send_host_vf_lm_status_cmd(pds_vfio,
>> PDS_LM_STA_NONE);
>> +             return NULL;
>> +     }
>> +
>> +     if (cur == VFIO_DEVICE_STATE_STOP && next ==
>> VFIO_DEVICE_STATE_RUNNING_P2P)
>> +             return NULL;
>> +
>> +     if (cur == VFIO_DEVICE_STATE_RUNNING_P2P && next ==
>> VFIO_DEVICE_STATE_STOP)
>> +             return NULL;
>> +
>> +     return ERR_PTR(-EINVAL);
>> +}
>> diff --git a/drivers/vfio/pci/pds/lm.h b/drivers/vfio/pci/pds/lm.h
>> new file mode 100644
>> index 000000000000..13be893198b7
>> --- /dev/null
>> +++ b/drivers/vfio/pci/pds/lm.h
>> @@ -0,0 +1,41 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +/* Copyright(c) 2023 Advanced Micro Devices, Inc. */
>> +
>> +#ifndef _LM_H_
>> +#define _LM_H_
>> +
>> +#include <linux/fs.h>
>> +#include <linux/mutex.h>
>> +#include <linux/scatterlist.h>
>> +#include <linux/types.h>
>> +
>> +#include <linux/pds/pds_common.h>
>> +#include <linux/pds/pds_adminq.h>
>> +
>> +struct pds_vfio_lm_file {
>> +     struct file *filep;
>> +     struct mutex lock;      /* protect live migration data file */
>> +     u64 size;               /* Size with valid data */
>> +     u64 alloc_size;         /* Total allocated size. Always >= len */
>> +     void *page_mem;         /* memory allocated for pages */
>> +     struct page **pages;    /* Backing pages for file */
>> +     unsigned long long npages;
>> +     struct sg_table sg_table;       /* SG table for backing pages */
>> +     struct pds_lm_sg_elem *sgl;     /* DMA mapping */
>> +     dma_addr_t sgl_addr;
>> +     u16 num_sge;
>> +     struct scatterlist *last_offset_sg;     /* Iterator */
>> +     unsigned int sg_last_entry;
>> +     unsigned long last_offset;
>> +};
>> +
>> +struct pds_vfio_pci_device;
>> +
>> +struct file *
>> +pds_vfio_step_device_state_locked(struct pds_vfio_pci_device *pds_vfio,
>> +                               enum vfio_device_mig_state next);
>> +
>> +void pds_vfio_put_save_file(struct pds_vfio_pci_device *pds_vfio);
>> +void pds_vfio_put_restore_file(struct pds_vfio_pci_device *pds_vfio);
>> +
>> +#endif /* _LM_H_ */
>> diff --git a/drivers/vfio/pci/pds/pci_drv.c b/drivers/vfio/pci/pds/pci_drv.c
>> index a49420aa9736..ffd47fa8ede3 100644
>> --- a/drivers/vfio/pci/pds/pci_drv.c
>> +++ b/drivers/vfio/pci/pds/pci_drv.c
>> @@ -73,11 +73,24 @@ pds_vfio_pci_table[] = {
>>   };
>>   MODULE_DEVICE_TABLE(pci, pds_vfio_pci_table);
>>
>> +static void pds_vfio_pci_aer_reset_done(struct pci_dev *pdev)
>> +{
>> +     struct pds_vfio_pci_device *pds_vfio = pds_vfio_pci_drvdata(pdev);
>> +
>> +     pds_vfio_reset(pds_vfio);
>> +}
>> +
>> +static const struct pci_error_handlers pds_vfio_pci_err_handlers = {
>> +     .reset_done = pds_vfio_pci_aer_reset_done,
>> +     .error_detected = vfio_pci_core_aer_err_detected,
>> +};
>> +
>>   static struct pci_driver pds_vfio_pci_driver = {
>>        .name = KBUILD_MODNAME,
>>        .id_table = pds_vfio_pci_table,
>>        .probe = pds_vfio_pci_probe,
>>        .remove = pds_vfio_pci_remove,
>> +     .err_handler = &pds_vfio_pci_err_handlers,
>>        .driver_managed_dma = true,
>>   };
>>
>> diff --git a/drivers/vfio/pci/pds/vfio_dev.c b/drivers/vfio/pci/pds/vfio_dev.c
>> index 39771265b78f..2435d8255366 100644
>> --- a/drivers/vfio/pci/pds/vfio_dev.c
>> +++ b/drivers/vfio/pci/pds/vfio_dev.c
>> @@ -4,6 +4,7 @@
>>   #include <linux/vfio.h>
>>   #include <linux/vfio_pci_core.h>
>>
>> +#include "lm.h"
>>   #include "vfio_dev.h"
>>
>>   struct pci_dev *pds_vfio_to_pci_dev(struct pds_vfio_pci_device *pds_vfio)
>> @@ -11,6 +12,11 @@ struct pci_dev *pds_vfio_to_pci_dev(struct
>> pds_vfio_pci_device *pds_vfio)
>>        return pds_vfio->vfio_coredev.pdev;
>>   }
>>
>> +struct device *pds_vfio_to_dev(struct pds_vfio_pci_device *pds_vfio)
>> +{
>> +     return &pds_vfio_to_pci_dev(pds_vfio)->dev;
>> +}
>> +
>>   struct pds_vfio_pci_device *pds_vfio_pci_drvdata(struct pci_dev *pdev)
>>   {
>>        struct vfio_pci_core_device *core_device =
>> dev_get_drvdata(&pdev->dev);
>> @@ -19,6 +25,98 @@ struct pds_vfio_pci_device
>> *pds_vfio_pci_drvdata(struct pci_dev *pdev)
>>                            vfio_coredev);
>>   }
>>
>> +static void pds_vfio_state_mutex_unlock(struct pds_vfio_pci_device
>> *pds_vfio)
>> +{
>> +again:
>> +     spin_lock(&pds_vfio->reset_lock);
>> +     if (pds_vfio->deferred_reset) {
>> +             pds_vfio->deferred_reset = false;
>> +             if (pds_vfio->state == VFIO_DEVICE_STATE_ERROR) {
>> +                     pds_vfio->state = VFIO_DEVICE_STATE_RUNNING;
>> +                     pds_vfio_put_restore_file(pds_vfio);
>> +                     pds_vfio_put_save_file(pds_vfio);
>> +             }
>> +             spin_unlock(&pds_vfio->reset_lock);
>> +             goto again;
>> +     }
>> +     mutex_unlock(&pds_vfio->state_mutex);
>> +     spin_unlock(&pds_vfio->reset_lock);
>> +}
>> +
>> +void pds_vfio_reset(struct pds_vfio_pci_device *pds_vfio)
>> +{
>> +     spin_lock(&pds_vfio->reset_lock);
>> +     pds_vfio->deferred_reset = true;
>> +     if (!mutex_trylock(&pds_vfio->state_mutex)) {
>> +             spin_unlock(&pds_vfio->reset_lock);
>> +             return;
>> +     }
>> +     spin_unlock(&pds_vfio->reset_lock);
>> +     pds_vfio_state_mutex_unlock(pds_vfio);
>> +}
>> +
>> +static struct file *
>> +pds_vfio_set_device_state(struct vfio_device *vdev,
>> +                       enum vfio_device_mig_state new_state)
>> +{
>> +     struct pds_vfio_pci_device *pds_vfio =
>> +             container_of(vdev, struct pds_vfio_pci_device,
>> +                          vfio_coredev.vdev);
>> +     struct file *res = NULL;
>> +
>> +     mutex_lock(&pds_vfio->state_mutex);
>> +     while (new_state != pds_vfio->state) {
>> +             enum vfio_device_mig_state next_state;
>> +
>> +             int err = vfio_mig_get_next_state(vdev, pds_vfio->state,
>> +                                               new_state, &next_state);
>> +             if (err) {
>> +                     res = ERR_PTR(err);
>> +                     break;
>> +             }
>> +
>> +             res = pds_vfio_step_device_state_locked(pds_vfio, next_state);
>> +             if (IS_ERR(res))
>> +                     break;
>> +
>> +             pds_vfio->state = next_state;
>> +
>> +             if (WARN_ON(res && new_state != pds_vfio->state)) {
>> +                     res = ERR_PTR(-EINVAL);
>> +                     break;
>> +             }
>> +     }
>> +     pds_vfio_state_mutex_unlock(pds_vfio);
>> +
>> +     return res;
>> +}
>> +
>> +static int pds_vfio_get_device_state(struct vfio_device *vdev,
>> +                                  enum vfio_device_mig_state *current_state)
>> +{
>> +     struct pds_vfio_pci_device *pds_vfio =
>> +             container_of(vdev, struct pds_vfio_pci_device,
>> +                          vfio_coredev.vdev);
>> +
>> +     mutex_lock(&pds_vfio->state_mutex);
>> +     *current_state = pds_vfio->state;
>> +     pds_vfio_state_mutex_unlock(pds_vfio);
>> +     return 0;
>> +}
>> +
>> +static int pds_vfio_get_device_state_size(struct vfio_device *vdev,
>> +                                       unsigned long *stop_copy_length)
>> +{
>> +     *stop_copy_length = PDS_LM_DEVICE_STATE_LENGTH;
>> +     return 0;
>> +}
>> +
>> +static const struct vfio_migration_ops pds_vfio_lm_ops = {
>> +     .migration_set_state = pds_vfio_set_device_state,
>> +     .migration_get_state = pds_vfio_get_device_state,
>> +     .migration_get_data_size = pds_vfio_get_device_state_size
>> +};
>> +
>>   static int pds_vfio_init_device(struct vfio_device *vdev)
>>   {
>>        struct pds_vfio_pci_device *pds_vfio =
>> @@ -34,6 +132,9 @@ static int pds_vfio_init_device(struct vfio_device
>> *vdev)
>>        pds_vfio->vf_id = pci_iov_vf_id(pdev);
>>        pds_vfio->pci_id = PCI_DEVID(pdev->bus->number, pdev->devfn);
>>
>> +     vdev->migration_flags = VFIO_MIGRATION_STOP_COPY |
>> VFIO_MIGRATION_P2P;
>> +     vdev->mig_ops = &pds_vfio_lm_ops;
>> +
>>        dev_dbg(&pdev->dev,
>>                "%s: PF %#04x VF %#04x (%d) vf_id %d domain %d
>> pds_vfio %p\n",
>>                __func__, pci_dev_id(pdev->physfn), pds_vfio->pci_id,
>> @@ -54,17 +155,34 @@ static int pds_vfio_open_device(struct vfio_device
>> *vdev)
>>        if (err)
>>                return err;
>>
>> +     mutex_init(&pds_vfio->state_mutex);
>> +     pds_vfio->state = VFIO_DEVICE_STATE_RUNNING;
>> +
>>        vfio_pci_core_finish_enable(&pds_vfio->vfio_coredev);
>>
>>        return 0;
>>   }
>>
>> +static void pds_vfio_close_device(struct vfio_device *vdev)
>> +{
>> +     struct pds_vfio_pci_device *pds_vfio =
>> +             container_of(vdev, struct pds_vfio_pci_device,
>> +                          vfio_coredev.vdev);
>> +
>> +     mutex_lock(&pds_vfio->state_mutex);
>> +     pds_vfio_put_restore_file(pds_vfio);
>> +     pds_vfio_put_save_file(pds_vfio);
>> +     mutex_unlock(&pds_vfio->state_mutex);
>> +     mutex_destroy(&pds_vfio->state_mutex);
>> +     vfio_pci_core_close_device(vdev);
>> +}
>> +
>>   static const struct vfio_device_ops pds_vfio_ops = {
>>        .name = "pds-vfio",
>>        .init = pds_vfio_init_device,
>>        .release = vfio_pci_core_release_dev,
>>        .open_device = pds_vfio_open_device,
>> -     .close_device = vfio_pci_core_close_device,
>> +     .close_device = pds_vfio_close_device,
>>        .ioctl = vfio_pci_core_ioctl,
>>        .device_feature = vfio_pci_core_ioctl_feature,
>>        .read = vfio_pci_core_read,
>> diff --git a/drivers/vfio/pci/pds/vfio_dev.h b/drivers/vfio/pci/pds/vfio_dev.h
>> index 92e8ff241ca8..df6208a7140b 100644
>> --- a/drivers/vfio/pci/pds/vfio_dev.h
>> +++ b/drivers/vfio/pci/pds/vfio_dev.h
>> @@ -7,12 +7,21 @@
>>   #include <linux/pci.h>
>>   #include <linux/vfio_pci_core.h>
>>
>> +#include "lm.h"
>> +
>>   struct pdsc;
>>
>>   struct pds_vfio_pci_device {
>>        struct vfio_pci_core_device vfio_coredev;
>>        struct pdsc *pdsc;
>>
>> +     struct pds_vfio_lm_file *save_file;
>> +     struct pds_vfio_lm_file *restore_file;
>> +     struct mutex state_mutex; /* protect migration state */
>> +     enum vfio_device_mig_state state;
>> +     spinlock_t reset_lock; /* protect reset_done flow */
>> +     u8 deferred_reset;
>> +
>>        int vf_id;
>>        int pci_id;
>>        u16 client_id;
>> @@ -20,7 +29,9 @@ struct pds_vfio_pci_device {
>>
>>   const struct vfio_device_ops *pds_vfio_ops_info(void);
>>   struct pds_vfio_pci_device *pds_vfio_pci_drvdata(struct pci_dev *pdev);
>> +void pds_vfio_reset(struct pds_vfio_pci_device *pds_vfio);
>>
>>   struct pci_dev *pds_vfio_to_pci_dev(struct pds_vfio_pci_device *pds_vfio);
>> +struct device *pds_vfio_to_dev(struct pds_vfio_pci_device *pds_vfio);
>>
>>   #endif /* _VFIO_DEV_H_ */
>> diff --git a/include/linux/pds/pds_adminq.h
>> b/include/linux/pds/pds_adminq.h
>> index 98a60ce87b92..db6de081f15f 100644
>> --- a/include/linux/pds/pds_adminq.h
>> +++ b/include/linux/pds/pds_adminq.h
>> @@ -584,6 +584,213 @@ struct pds_core_q_init_comp {
>>        u8     color;
>>   };
>>
>> +#define PDS_LM_DEVICE_STATE_LENGTH           65536
>> +#define PDS_LM_CHECK_DEVICE_STATE_LENGTH(X) \
>> +                     PDS_CORE_SIZE_CHECK(union,
>> PDS_LM_DEVICE_STATE_LENGTH, X)
>> +
>> +/*
>> + * enum pds_lm_cmd_opcode - Live Migration Device commands
>> + */
>> +enum pds_lm_cmd_opcode {
>> +     PDS_LM_CMD_HOST_VF_STATUS  = 1,
>> +
>> +     /* Device state commands */
>> +     PDS_LM_CMD_STATUS          = 16,
>> +     PDS_LM_CMD_SUSPEND         = 18,
>> +     PDS_LM_CMD_SUSPEND_STATUS  = 19,
>> +     PDS_LM_CMD_RESUME          = 20,
>> +     PDS_LM_CMD_SAVE            = 21,
>> +     PDS_LM_CMD_RESTORE         = 22,
>> +};
>> +
>> +/**
>> + * struct pds_lm_cmd - generic command
>> + * @opcode:  Opcode
>> + * @rsvd:    Word boundary padding
>> + * @vf_id:   VF id
>> + * @rsvd2:   Structure padding to 60 Bytes
>> + */
>> +struct pds_lm_cmd {
>> +     u8     opcode;
>> +     u8     rsvd;
>> +     __le16 vf_id;
>> +     u8     rsvd2[56];
>> +};
>> +
>> +/**
>> + * struct pds_lm_comp - generic command completion
>> + * @status:  Status of the command (enum pds_core_status_code)
>> + * @rsvd:    Structure padding to 16 Bytes
>> + */
>> +struct pds_lm_comp {
>> +     u8 status;
>> +     u8 rsvd[15];
>> +};
>> +
>> +/**
>> + * struct pds_lm_status_cmd - STATUS command
>> + * @opcode:  Opcode
>> + * @rsvd:    Word boundary padding
>> + * @vf_id:   VF id
>> + */
>> +struct pds_lm_status_cmd {
>> +     u8     opcode;
>> +     u8     rsvd;
>> +     __le16 vf_id;
>> +};
>> +
>> +/**
>> + * struct pds_lm_status_comp - STATUS command completion
>> + * @status:          Status of the command (enum pds_core_status_code)
>> + * @rsvd:            Word boundary padding
>> + * @comp_index:              Index in the desc ring for which this is the
>> completion
>> + * @size:            Size of the device state
>> + * @rsvd2:           Word boundary padding
>> + * @color:           Color bit
>> + */
>> +struct pds_lm_status_comp {
>> +     u8     status;
>> +     u8     rsvd;
>> +     __le16 comp_index;
>> +     union {
>> +             __le64 size;
>> +             u8     rsvd2[11];
>> +     } __packed;
>> +     u8     color;
>> +};
>> +
>> +/**
>> + * struct pds_lm_suspend_cmd - SUSPEND command
>> + * @opcode:  Opcode PDS_LM_CMD_SUSPEND
>> + * @rsvd:    Word boundary padding
>> + * @vf_id:   VF id
>> + */
>> +struct pds_lm_suspend_cmd {
>> +     u8     opcode;
>> +     u8     rsvd;
>> +     __le16 vf_id;
>> +};
>> +
>> +/**
>> + * struct pds_lm_suspend_comp - SUSPEND command completion
>> + * @status:          Status of the command (enum pds_core_status_code)
>> + * @rsvd:            Word boundary padding
>> + * @comp_index:              Index in the desc ring for which this is the
>> completion
>> + * @state_size:              Size of the device state computed post suspend
>> + * @rsvd2:           Word boundary padding
>> + * @color:           Color bit
>> + */
>> +struct pds_lm_suspend_comp {
>> +     u8     status;
>> +     u8     rsvd;
>> +     __le16 comp_index;
>> +     union {
>> +             __le64 state_size;
>> +             u8     rsvd2[11];
>> +     } __packed;
>> +     u8     color;
>> +};
>> +
>> +/**
>> + * struct pds_lm_suspend_status_cmd - SUSPEND status command
>> + * @opcode:  Opcode PDS_AQ_CMD_LM_SUSPEND_STATUS
>> + * @rsvd:    Word boundary padding
>> + * @vf_id:   VF id
>> + */
>> +struct pds_lm_suspend_status_cmd {
>> +     u8 opcode;
>> +     u8 rsvd;
>> +     __le16 vf_id;
>> +};
>> +
>> +/**
>> + * struct pds_lm_resume_cmd - RESUME command
>> + * @opcode:  Opcode PDS_LM_CMD_RESUME
>> + * @rsvd:    Word boundary padding
>> + * @vf_id:   VF id
>> + */
>> +struct pds_lm_resume_cmd {
>> +     u8     opcode;
>> +     u8     rsvd;
>> +     __le16 vf_id;
>> +};
>> +
>> +/**
>> + * struct pds_lm_sg_elem - Transmit scatter-gather (SG) descriptor element
>> + * @addr:    DMA address of SG element data buffer
>> + * @len:     Length of SG element data buffer, in bytes
>> + * @rsvd:    Word boundary padding
>> + */
>> +struct pds_lm_sg_elem {
>> +     __le64 addr;
>> +     __le32 len;
>> +     __le16 rsvd[2];
>> +};
>> +
>> +/**
>> + * struct pds_lm_save_cmd - SAVE command
>> + * @opcode:  Opcode PDS_LM_CMD_SAVE
>> + * @rsvd:    Word boundary padding
>> + * @vf_id:   VF id
>> + * @rsvd2:   Word boundary padding
>> + * @sgl_addr:        IOVA address of the SGL to dma the device state
>> + * @num_sge: Total number of SG elements
>> + */
>> +struct pds_lm_save_cmd {
>> +     u8     opcode;
>> +     u8     rsvd;
>> +     __le16 vf_id;
>> +     u8     rsvd2[4];
>> +     __le64 sgl_addr;
>> +     __le32 num_sge;
>> +} __packed;
>> +
>> +/**
>> + * struct pds_lm_restore_cmd - RESTORE command
>> + * @opcode:  Opcode PDS_LM_CMD_RESTORE
>> + * @rsvd:    Word boundary padding
>> + * @vf_id:   VF id
>> + * @rsvd2:   Word boundary padding
>> + * @sgl_addr:        IOVA address of the SGL to dma the device state
>> + * @num_sge: Total number of SG elements
>> + */
>> +struct pds_lm_restore_cmd {
>> +     u8     opcode;
>> +     u8     rsvd;
>> +     __le16 vf_id;
>> +     u8     rsvd2[4];
>> +     __le64 sgl_addr;
>> +     __le32 num_sge;
>> +} __packed;
>> +
>> +/**
>> + * union pds_lm_dev_state - device state information
>> + * @words:   Device state words
>> + */
>> +union pds_lm_dev_state {
>> +     __le32 words[PDS_LM_DEVICE_STATE_LENGTH / sizeof(__le32)];
>> +};
>> +
>> +enum pds_lm_host_vf_status {
>> +     PDS_LM_STA_NONE = 0,
>> +     PDS_LM_STA_IN_PROGRESS,
>> +     PDS_LM_STA_MAX,
>> +};
>> +
>> +/**
>> + * struct pds_lm_host_vf_status_cmd - HOST_VF_STATUS command
>> + * @opcode:  Opcode PDS_LM_CMD_HOST_VF_STATUS
>> + * @rsvd:    Word boundary padding
>> + * @vf_id:   VF id
>> + * @status:  Current LM status of host VF driver (enum
>> pds_lm_host_status)
>> + */
>> +struct pds_lm_host_vf_status_cmd {
>> +     u8     opcode;
>> +     u8     rsvd;
>> +     __le16 vf_id;
>> +     u8     status;
>> +};
>> +
>>   union pds_core_adminq_cmd {
>>        u8     opcode;
>>        u8     bytes[64];
>> @@ -600,6 +807,14 @@ union pds_core_adminq_cmd {
>>
>>        struct pds_core_q_identify_cmd    q_ident;
>>        struct pds_core_q_init_cmd        q_init;
>> +
>> +     struct pds_lm_suspend_cmd               lm_suspend;
>> +     struct pds_lm_suspend_status_cmd        lm_suspend_status;
>> +     struct pds_lm_resume_cmd                lm_resume;
>> +     struct pds_lm_status_cmd                lm_status;
>> +     struct pds_lm_save_cmd                  lm_save;
>> +     struct pds_lm_restore_cmd               lm_restore;
>> +     struct pds_lm_host_vf_status_cmd        lm_host_vf_status;
>>   };
>>
>>   union pds_core_adminq_comp {
>> @@ -621,6 +836,8 @@ union pds_core_adminq_comp {
>>
>>        struct pds_core_q_identify_comp   q_ident;
>>        struct pds_core_q_init_comp       q_init;
>> +
>> +     struct pds_lm_status_comp               lm_status;
>>   };
>>
>>   #ifndef __CHECKER__
>> --
>> 2.17.1
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: [PATCH v10 vfio 0/7] pds_vfio driver
  2023-06-02 22:03 [PATCH v10 vfio 0/7] pds_vfio driver Brett Creeley
                   ` (7 preceding siblings ...)
  2023-06-14 20:20 ` [PATCH v10 vfio 0/7] pds_vfio driver Alex Williamson
@ 2023-06-16  6:47 ` Tian, Kevin
  2023-06-16 20:06   ` Brett Creeley
  2023-06-17  4:49 ` Brett Creeley
  9 siblings, 1 reply; 40+ messages in thread
From: Tian, Kevin @ 2023-06-16  6:47 UTC (permalink / raw)
  To: Brett Creeley, kvm, netdev, alex.williamson, jgg, yishaih,
	shameerali.kolothum.thodi
  Cc: shannon.nelson

> From: Brett Creeley <brett.creeley@amd.com>
> Sent: Saturday, June 3, 2023 6:03 AM
> 
> This is a patchset for a new vendor specific VFIO driver
> (pds_vfio) for use with the AMD/Pensando Distributed Services Card
> (DSC). This driver makes use of the pds_core driver.
> 
> This driver will use the pds_core device's adminq as the VFIO
> control path to the DSC. In order to make adminq calls, the VFIO
> instance makes use of functions exported by the pds_core driver.
> 
> In order to receive events from pds_core, the pds_vfio driver
> registers to a private notifier. This is needed for various events
> that come from the device.
> 
> An ASCII diagram of a VFIO instance looks something like this and can
> be used with the VFIO subsystem to provide the VF device VFIO and live
> migration support.
> 
>                                .------.  .-----------------------.
>                                | QEMU |--|  VM  .-------------.  |
>                                '......'  |      |   Eth VF    |  |
>                                   |      |      .-------------.  |
>                                   |      |      |  SR-IOV VF  |  |
>                                   |      |      '-------------'  |
>                                   |      '------------||---------'
>                                .--------------.       ||
>                                |/dev/<vfio_fd>|       ||
>                                '--------------'       ||
> Host Userspace                         |              ||
> ===================================================   ||
> Host Kernel                            |              ||
>                                   .--------.          ||
>                                   |vfio-pci|          ||
>                                   '--------'          ||
>        .------------------.           ||              ||
>        |   | exported API |<----+     ||              ||
>        |   '--------------|     |     ||              ||
>        |                  |    .-------------.        ||
>        |     pds_core     |--->|   pds_vfio  |        ||
>        '------------------' |  '-------------'        ||
>                ||           |         ||              ||
>              09:00.0     notifier    09:00.1          ||
> == PCI ===============================================||=====
>                ||                     ||              ||
>           .----------.          .----------.          ||
>     ,-----|    PF    |----------|    VF    |-------------------,
>     |     '----------'          '----------'  |       VF       |
>     |                     DSC                 |  data/control  |
>     |                                         |      path      |
>     -----------------------------------------------------------
> 

why is "VF data/control path" drawn out of the VF box?


^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: [PATCH v10 vfio 1/7] vfio: Commonize combine_ranges for use in other VFIO drivers
  2023-06-02 22:03 ` [PATCH v10 vfio 1/7] vfio: Commonize combine_ranges for use in other VFIO drivers Brett Creeley
@ 2023-06-16  6:52   ` Tian, Kevin
  2023-06-16 18:37     ` Brett Creeley
  0 siblings, 1 reply; 40+ messages in thread
From: Tian, Kevin @ 2023-06-16  6:52 UTC (permalink / raw)
  To: Brett Creeley, kvm, netdev, alex.williamson, jgg, yishaih,
	shameerali.kolothum.thodi
  Cc: shannon.nelson

> From: Brett Creeley <brett.creeley@amd.com>
> Sent: Saturday, June 3, 2023 6:03 AM
> 
> +void vfio_combine_iova_ranges(struct rb_root_cached *root, u32
> cur_nodes,
> +			      u32 req_nodes)
> +{
> +	struct interval_tree_node *prev, *curr, *comb_start, *comb_end;
> +	unsigned long min_gap, curr_gap;
> +
> +	/* Special shortcut when a single range is required */
> +	if (req_nodes == 1) {
> +		unsigned long last;
> +
> +		comb_start = interval_tree_iter_first(root, 0, ULONG_MAX);
> +		curr = comb_start;
> +		while (curr) {
> +			last = curr->last;
> +			prev = curr;
> +			curr = interval_tree_iter_next(curr, 0, ULONG_MAX);
> +			if (prev != comb_start)
> +				interval_tree_remove(prev, root);
> +		}
> +		comb_start->last = last;
> +		return;
> +	}
> +
> +	/* Combine ranges which have the smallest gap */
> +	while (cur_nodes > req_nodes) {
> +		prev = NULL;
> +		min_gap = ULONG_MAX;
> +		curr = interval_tree_iter_first(root, 0, ULONG_MAX);
> +		while (curr) {
> +			if (prev) {
> +				curr_gap = curr->start - prev->last;
> +				if (curr_gap < min_gap) {
> +					min_gap = curr_gap;
> +					comb_start = prev;
> +					comb_end = curr;
> +				}
> +			}
> +			prev = curr;
> +			curr = interval_tree_iter_next(curr, 0, ULONG_MAX);
> +		}
> +		comb_start->last = comb_end->last;
> +		interval_tree_remove(comb_end, root);
> +		cur_nodes--;
> +	}
> +}
> +EXPORT_SYMBOL_GPL(vfio_combine_iova_ranges);
> +

Being a public function please follow the kernel convention with comment
explaining what this function actually does.

btw while you rename it with 'vfio' and 'iova' keywords, the actual logic
has nothing to do with either of them. Does it make more sense to move it
to a more generic library?

^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: [PATCH v10 vfio 2/7] vfio/pds: Initial support for pds_vfio VFIO driver
  2023-06-02 22:03 ` [PATCH v10 vfio 2/7] vfio/pds: Initial support for pds_vfio VFIO driver Brett Creeley
  2023-06-14 21:31   ` Alex Williamson
@ 2023-06-16  6:56   ` Tian, Kevin
  2023-06-16 18:42     ` Brett Creeley
  1 sibling, 1 reply; 40+ messages in thread
From: Tian, Kevin @ 2023-06-16  6:56 UTC (permalink / raw)
  To: Brett Creeley, kvm, netdev, alex.williamson, jgg, yishaih,
	shameerali.kolothum.thodi
  Cc: shannon.nelson

> From: Brett Creeley <brett.creeley@amd.com>
> Sent: Saturday, June 3, 2023 6:03 AM
> 
> This is the initial framework for the new pds_vfio device driver. This
> does the very basics of registering the PDS PCI device and configuring
> it as a VFIO PCI device.
> 
> With this change, the VF device can be bound to the pds_vfio driver on
> the host and presented to the VM as the VF's device type.

while this should be generic to multiple PDS device types this patch only
supports the ethernet VF. worth a clarification here.

> +static const struct pci_device_id
> +pds_vfio_pci_table[] = {

no need to break line.

> +
> +MODULE_DESCRIPTION(PDS_VFIO_DRV_DESCRIPTION);
> +MODULE_AUTHOR("Advanced Micro Devices, Inc.");

author usually describes the personal name plus mail address.

> +
> +	err = vfio_pci_core_init_dev(vdev);
> +	if (err)
> +		return err;
> +
> +	pds_vfio->vf_id = pci_iov_vf_id(pdev);

pci_iov_vf_id() could fail.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: [PATCH v10 vfio 3/7] vfio/pds: register with the pds_core PF
  2023-06-02 22:03 ` [PATCH v10 vfio 3/7] vfio/pds: register with the pds_core PF Brett Creeley
  2023-06-15 21:05   ` Shameerali Kolothum Thodi
@ 2023-06-16  7:04   ` Tian, Kevin
  2023-06-16 19:01     ` Brett Creeley
  1 sibling, 1 reply; 40+ messages in thread
From: Tian, Kevin @ 2023-06-16  7:04 UTC (permalink / raw)
  To: Brett Creeley, kvm, netdev, alex.williamson, jgg, yishaih,
	shameerali.kolothum.thodi
  Cc: shannon.nelson

> From: Brett Creeley <brett.creeley@amd.com>
> Sent: Saturday, June 3, 2023 6:03 AM
> 
> +
> +int pds_vfio_register_client_cmd(struct pds_vfio_pci_device *pds_vfio)
> +{
> +	struct pci_dev *pdev = pds_vfio_to_pci_dev(pds_vfio);
> +	char devname[PDS_DEVNAME_LEN];
> +	int ci;
> +
> +	snprintf(devname, sizeof(devname), "%s.%d-%u",
> PDS_LM_DEV_NAME,
> +		 pci_domain_nr(pdev->bus), pds_vfio->pci_id);
> +
> +	ci = pds_client_register(pci_physfn(pdev), devname);
> +	if (ci <= 0)
> +		return ci;

'ci' cannot be 0 since pds_client_register() already converts 0 into
-EIO.

btw the description of pds_client_register() is wrong. It said return
0 on success. should be positive client_id on success.

> 
> +struct pci_dev *pds_vfio_to_pci_dev(struct pds_vfio_pci_device *pds_vfio)
> +{
> +	return pds_vfio->vfio_coredev.pdev;
> +}

Does this wrapper actually save the length?

> 
> +	dev_dbg(&pdev->dev,
> +		"%s: PF %#04x VF %#04x (%d) vf_id %d domain %d
> pds_vfio %p\n",
> +		__func__, pci_dev_id(pdev->physfn), pds_vfio->pci_id,
> +		pds_vfio->pci_id, pds_vfio->vf_id, pci_domain_nr(pdev->bus),
> +		pds_vfio);

why printing pds_vfio->pci_id twice?

> 
> +#define PDS_LM_DEV_NAME		PDS_CORE_DRV_NAME "."
> PDS_DEV_TYPE_LM_STR
> +

should this name include a 'vfio' string?

^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: [PATCH v10 vfio 4/7] vfio/pds: Add VFIO live migration support
  2023-06-02 22:03 ` [PATCH v10 vfio 4/7] vfio/pds: Add VFIO live migration support Brett Creeley
  2023-06-15 21:07   ` Shameerali Kolothum Thodi
@ 2023-06-16  8:06   ` Tian, Kevin
  2023-06-17  4:45     ` Brett Creeley
  2023-06-19 12:46     ` Jason Gunthorpe
  1 sibling, 2 replies; 40+ messages in thread
From: Tian, Kevin @ 2023-06-16  8:06 UTC (permalink / raw)
  To: Brett Creeley, kvm, netdev, alex.williamson, jgg, yishaih,
	shameerali.kolothum.thodi
  Cc: shannon.nelson

> From: Brett Creeley <brett.creeley@amd.com>
> Sent: Saturday, June 3, 2023 6:03 AM
> 
> Add live migration support via the VFIO subsystem. The migration
> implementation aligns with the definition from uapi/vfio.h and uses
> the pds_core PF's adminq for device configuration.
> 
> The ability to suspend, resume, and transfer VF device state data is
> included along with the required admin queue command structures and
> implementations.
> 
> PDS_LM_CMD_SUSPEND and PDS_LM_CMD_SUSPEND_STATUS are added to
> support
> the VF device suspend operation.
> 
> PDS_LM_CMD_RESUME is added to support the VF device resume operation.
> 
> PDS_LM_CMD_STATUS is added to determine the exact size of the VF
> device state data.
> 
> PDS_LM_CMD_SAVE is added to get the VF device state data.
> 
> PDS_LM_CMD_RESTORE is added to restore the VF device with the
> previously saved data from PDS_LM_CMD_SAVE.
> 
> PDS_LM_CMD_HOST_VF_STATUS is added to notify the device when
> a migration is in/not-in progress from the host's perspective.

Here is 'the device' referring to the PF or VF?

and how would the device use this information?

> +
> +static int pds_vfio_client_adminq_cmd(struct pds_vfio_pci_device *pds_vfio,
> +				      union pds_core_adminq_cmd *req,
> +				      size_t req_len,
> +				      union pds_core_adminq_comp *resp,
> +				      u64 flags)
> +{
> +	union pds_core_adminq_cmd cmd = {};
> +	size_t cp_len;
> +	int err;
> +
> +	/* Wrap the client request */
> +	cmd.client_request.opcode = PDS_AQ_CMD_CLIENT_CMD;
> +	cmd.client_request.client_id = cpu_to_le16(pds_vfio->client_id);
> +	cp_len = min_t(size_t, req_len,
> sizeof(cmd.client_request.client_cmd));

'req_len' is kind of redundant. Looks all the callers use sizeof(req).

> +static int
> +pds_vfio_suspend_wait_device_cmd(struct pds_vfio_pci_device *pds_vfio)
> +{
> +	union pds_core_adminq_cmd cmd = {
> +		.lm_suspend_status = {
> +			.opcode = PDS_LM_CMD_SUSPEND_STATUS,
> +			.vf_id = cpu_to_le16(pds_vfio->vf_id),
> +		},
> +	};
> +	struct device *dev = pds_vfio_to_dev(pds_vfio);
> +	union pds_core_adminq_comp comp = {};
> +	unsigned long time_limit;
> +	unsigned long time_start;
> +	unsigned long time_done;
> +	int err;
> +
> +	time_start = jiffies;
> +	time_limit = time_start + HZ * SUSPEND_TIMEOUT_S;
> +	do {
> +		err = pds_vfio_client_adminq_cmd(pds_vfio, &cmd,
> sizeof(cmd),
> +						 &comp,
> PDS_AQ_FLAG_FASTPOLL);
> +		if (err != -EAGAIN)
> +			break;
> +
> +		msleep(SUSPEND_CHECK_INTERVAL_MS);
> +	} while (time_before(jiffies, time_limit));

pds_vfio_client_adminq_cmd() has the exactly same mechanism
with 5s timeout and 1ms poll interval when FASTPOLL is set.

probably you can introduce another flag to indicate retry on
-EAGAIN and then handle it fully in pds_vfio_client_adminq_cmd()?

> +int pds_vfio_suspend_device_cmd(struct pds_vfio_pci_device *pds_vfio)
> +{
> +	union pds_core_adminq_cmd cmd = {
> +		.lm_suspend = {
> +			.opcode = PDS_LM_CMD_SUSPEND,
> +			.vf_id = cpu_to_le16(pds_vfio->vf_id),
> +		},
> +	};
> +	struct device *dev = pds_vfio_to_dev(pds_vfio);
> +	union pds_core_adminq_comp comp = {};
> +	int err;
> +
> +	dev_dbg(dev, "vf%u: Suspend device\n", pds_vfio->vf_id);
> +
> +	err = pds_vfio_client_adminq_cmd(pds_vfio, &cmd, sizeof(cmd),
> &comp,
> +					 PDS_AQ_FLAG_FASTPOLL);
> +	if (err) {
> +		dev_err(dev, "vf%u: Suspend failed: %pe\n", pds_vfio->vf_id,
> +			ERR_PTR(err));
> +		return err;
> +	}
> +
> +	return pds_vfio_suspend_wait_device_cmd(pds_vfio);
> +}

The logic in this function is very confusing.

PDS_LM_CMD_SUSPEND has a completion record:

+struct pds_lm_suspend_comp {
+	u8     status;
+	u8     rsvd;
+	__le16 comp_index;
+	union {
+		__le64 state_size;
+		u8     rsvd2[11];
+	} __packed;
+	u8     color;

Presumably this function can look at the completion record to know whether
the suspend request succeeds.

Why do you require another wait_device step to query the suspend status?

and I have another question. Is it correct to hard-code the 5s timeout in
the kernel w/o any input from the VMM? Note the guest has been stopped
at this point then very likely the 5s timeout will kill any reasonable SLA which
CSPs try to reach hard.

Ideally the VMM has an estimation how long a VM can be paused based on
SLA, to-be-migrated state size, available network bandwidth, etc. and that
hint should be passed to the kernel so any state transition which may violate
that expectation can fail quickly to break the migration process and put the
VM back to the running state.

Jason/Shameer, is there similar concern in mlx/hisilicon drivers? 

> +
> +int pds_vfio_resume_device_cmd(struct pds_vfio_pci_device *pds_vfio)
> +{
> +	union pds_core_adminq_cmd cmd = {
> +		.lm_resume = {
> +			.opcode = PDS_LM_CMD_RESUME,
> +			.vf_id = cpu_to_le16(pds_vfio->vf_id),
> +		},
> +	};
> +	struct device *dev = pds_vfio_to_dev(pds_vfio);
> +	union pds_core_adminq_comp comp = {};
> +
> +	dev_dbg(dev, "vf%u: Resume device\n", pds_vfio->vf_id);
> +
> +	return pds_vfio_client_adminq_cmd(pds_vfio, &cmd, sizeof(cmd),
> &comp,
> +					  0);

'resume' is also in the blackout phase when the guest is not running.

So presumably FAST_POLL should be set otherwise the max 256ms
poll interval (PDSC_ADMINQ_MAX_POLL_INTERVAL) is really inefficient.

> +
> +	if (cur == VFIO_DEVICE_STATE_RUNNING && next ==
> VFIO_DEVICE_STATE_RUNNING_P2P) {
> +		pds_vfio_send_host_vf_lm_status_cmd(pds_vfio,
> +
> PDS_LM_STA_IN_PROGRESS);
> +		err = pds_vfio_suspend_device_cmd(pds_vfio);
> +		if (err)
> +			return ERR_PTR(err);
> +
> +		return NULL;
> +	}
> +
> +	if (cur == VFIO_DEVICE_STATE_RUNNING_P2P && next ==
> VFIO_DEVICE_STATE_RUNNING) {
> +		err = pds_vfio_resume_device_cmd(pds_vfio);
> +		if (err)
> +			return ERR_PTR(err);
> +
> +		pds_vfio_send_host_vf_lm_status_cmd(pds_vfio,
> PDS_LM_STA_NONE);
> +		return NULL;
> +	}
> +
> +	if (cur == VFIO_DEVICE_STATE_STOP && next ==
> VFIO_DEVICE_STATE_RUNNING_P2P)
> +		return NULL;
> +
> +	if (cur == VFIO_DEVICE_STATE_RUNNING_P2P && next ==
> VFIO_DEVICE_STATE_STOP)
> +		return NULL;

I'm not sure whether P2P is actually supported here. By definition
P2P means the device is stopped but still responds to p2p request
from other devices. If you look at mlx example it uses different
cmds between RUNNING->RUNNING_P2P and RUNNING_P2P->STOP.

But in your case seems you simply move what is required in STOP
into P2P. Probably you can just remove the support of P2P like
hisilicon does.

> +
> +/**
> + * struct pds_lm_comp - generic command completion
> + * @status:	Status of the command (enum pds_core_status_code)
> + * @rsvd:	Structure padding to 16 Bytes
> + */
> +struct pds_lm_comp {
> +	u8 status;
> +	u8 rsvd[15];
> +};

not used. Looks most comp structures are defined w/o an user
except struct pds_lm_status_comp.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: [PATCH v10 vfio 6/7] vfio/pds: Add support for firmware recovery
  2023-06-02 22:03 ` [PATCH v10 vfio 6/7] vfio/pds: Add support for firmware recovery Brett Creeley
@ 2023-06-16  8:24   ` Tian, Kevin
  2023-06-17  0:47     ` Brett Creeley
  0 siblings, 1 reply; 40+ messages in thread
From: Tian, Kevin @ 2023-06-16  8:24 UTC (permalink / raw)
  To: Brett Creeley, kvm, netdev, alex.williamson, jgg, yishaih,
	shameerali.kolothum.thodi
  Cc: shannon.nelson

> From: Brett Creeley <brett.creeley@amd.com>
> Sent: Saturday, June 3, 2023 6:03 AM
> 
> +static void pds_vfio_recovery(struct pds_vfio_pci_device *pds_vfio)
> +{
> +	bool deferred_reset_needed = false;
> +
> +	/*
> +	 * Documentation states that the kernel migration driver must not
> +	 * generate asynchronous device state transitions outside of
> +	 * manipulation by the user or the VFIO_DEVICE_RESET ioctl.
> +	 *
> +	 * Since recovery is an asynchronous event received from the device,
> +	 * initiate a deferred reset. Only issue the deferred reset if a
> +	 * migration is in progress, which will cause the next step of the
> +	 * migration to fail. Also, if the device is in a state that will
> +	 * be set to VFIO_DEVICE_STATE_RUNNING on the next action (i.e.
> VM is
> +	 * shutdown and device is in VFIO_DEVICE_STATE_STOP) as that will
> clear
> +	 * the VFIO_DEVICE_STATE_ERROR when the VM starts back up.

the last sentence after "Also, ..." is incomplete?

> +	 */
> +	mutex_lock(&pds_vfio->state_mutex);
> +	if ((pds_vfio->state != VFIO_DEVICE_STATE_RUNNING &&
> +	     pds_vfio->state != VFIO_DEVICE_STATE_ERROR) ||
> +	    (pds_vfio->state == VFIO_DEVICE_STATE_RUNNING &&
> +	     pds_vfio_dirty_is_enabled(pds_vfio)))
> +		deferred_reset_needed = true;

any unwind to be done in the dirty tracking path? When firmware crashes
presumably the cmd to retrieve dirty pages is also blocked...

> +	mutex_unlock(&pds_vfio->state_mutex);
> +
> +	/*
> +	 * On the next user initiated state transition, the device will
> +	 * transition to the VFIO_DEVICE_STATE_ERROR. At this point it's the
> user's
> +	 * responsibility to reset the device.
> +	 *
> +	 * If a VFIO_DEVICE_RESET is requested post recovery and before the
> next
> +	 * state transition, then the deferred reset state will be set to
> +	 * VFIO_DEVICE_STATE_RUNNING.
> +	 */
> +	if (deferred_reset_needed)
> +		pds_vfio_deferred_reset(pds_vfio,
> VFIO_DEVICE_STATE_ERROR);

open-code as here is the only caller.

> +}
> +
> +static int pds_vfio_pci_notify_handler(struct notifier_block *nb,
> +				       unsigned long ecode, void *data)
> +{
> +	struct pds_vfio_pci_device *pds_vfio =
> +		container_of(nb, struct pds_vfio_pci_device, nb);
> +	struct device *dev = pds_vfio_to_dev(pds_vfio);
> +	union pds_core_notifyq_comp *event = data;
> +
> +	dev_dbg(dev, "%s: event code %lu\n", __func__, ecode);
> +
> +	/*
> +	 * We don't need to do anything for RESET state==0 as there is no
> notify
> +	 * or feedback mechanism available, and it is possible that we won't
> +	 * even see a state==0 event.
> +	 *
> +	 * Any requests from VFIO while state==0 will fail, which will return
> +	 * error and may cause migration to fail.
> +	 */
> +	if (ecode == PDS_EVENT_RESET) {
> +		dev_info(dev, "%s: PDS_EVENT_RESET event received,
> state==%d\n",
> +			 __func__, event->reset.state);
> +		if (event->reset.state == 1)
> +			pds_vfio_recovery(pds_vfio);
> +	}

Please explain what state==0 is, and why state==1 is handled while
state==2 is not.

> @@ -33,10 +33,13 @@ void pds_vfio_state_mutex_unlock(struct
> pds_vfio_pci_device *pds_vfio)
>  	if (pds_vfio->deferred_reset) {
>  		pds_vfio->deferred_reset = false;
>  		if (pds_vfio->state == VFIO_DEVICE_STATE_ERROR) {
> -			pds_vfio->state = VFIO_DEVICE_STATE_RUNNING;
> +			pds_vfio->state = pds_vfio->deferred_reset_state;
>  			pds_vfio_put_restore_file(pds_vfio);
>  			pds_vfio_put_save_file(pds_vfio);
> +		} else if (pds_vfio->deferred_reset_state ==
> VFIO_DEVICE_STATE_ERROR) {
> +			pds_vfio->state = VFIO_DEVICE_STATE_ERROR;
>  		}
> +		pds_vfio->deferred_reset_state =
> VFIO_DEVICE_STATE_RUNNING;

this is not required. 'deferred_reset_state' should be set only when
deferred_reset is true. Currently only in the notify path and reset path.

So the last assignment is pointless.

It's simpler to be:

	if (pds_vfio->deferred_reset) {
		pds_vfio->deferred_reset = false;
		if (pds_vfio->state == VFIO_DEVICE_STATE_ERROR) {
			pds_vfio_put_restore_file(pds_vfio);
  			pds_vfio_put_save_file(pds_vfio);
		}
		pds_vfio->state = pds_vfio->deferred_reset_state;
		...
	}


^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: [PATCH v10 vfio 7/7] vfio/pds: Add Kconfig and documentation
  2023-06-02 22:03 ` [PATCH v10 vfio 7/7] vfio/pds: Add Kconfig and documentation Brett Creeley
@ 2023-06-16  8:25   ` Tian, Kevin
  2023-06-16 20:05     ` Brett Creeley
  0 siblings, 1 reply; 40+ messages in thread
From: Tian, Kevin @ 2023-06-16  8:25 UTC (permalink / raw)
  To: Brett Creeley, kvm, netdev, alex.williamson, jgg, yishaih,
	shameerali.kolothum.thodi
  Cc: shannon.nelson

> From: Brett Creeley <brett.creeley@amd.com>
> Sent: Saturday, June 3, 2023 6:03 AM
> +
> +  # Prevent non-vfio VF driver from probing the VF device
> +  echo 0 >
> /sys/class/pci_bus/$PF_BUS/device/$PF_BDF/sriov_drivers_autoprobe
> +
> +  # Create single VF for Live Migration via VFIO
> +  echo 1 > /sys/bus/pci/drivers/pds_core/$PF_BDF/sriov_numvfs

s/via VFIO/via pds_core/

> +
> +config PDS_VFIO_PCI
> +	tristate "VFIO support for PDS PCI devices"
> +	depends on PDS_CORE
> +	depends on VFIO_PCI_CORE

this should be rebased on Alex's Kconfig change.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v10 vfio 1/7] vfio: Commonize combine_ranges for use in other VFIO drivers
  2023-06-16  6:52   ` Tian, Kevin
@ 2023-06-16 18:37     ` Brett Creeley
  0 siblings, 0 replies; 40+ messages in thread
From: Brett Creeley @ 2023-06-16 18:37 UTC (permalink / raw)
  To: Tian, Kevin, Brett Creeley, kvm, netdev, alex.williamson, jgg,
	yishaih, shameerali.kolothum.thodi
  Cc: shannon.nelson

On 6/15/2023 11:52 PM, Tian, Kevin wrote:
> Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.
> 
> 
>> From: Brett Creeley <brett.creeley@amd.com>
>> Sent: Saturday, June 3, 2023 6:03 AM
>>
>> +void vfio_combine_iova_ranges(struct rb_root_cached *root, u32
>> cur_nodes,
>> +                           u32 req_nodes)
>> +{
>> +     struct interval_tree_node *prev, *curr, *comb_start, *comb_end;
>> +     unsigned long min_gap, curr_gap;
>> +
>> +     /* Special shortcut when a single range is required */
>> +     if (req_nodes == 1) {
>> +             unsigned long last;
>> +
>> +             comb_start = interval_tree_iter_first(root, 0, ULONG_MAX);
>> +             curr = comb_start;
>> +             while (curr) {
>> +                     last = curr->last;
>> +                     prev = curr;
>> +                     curr = interval_tree_iter_next(curr, 0, ULONG_MAX);
>> +                     if (prev != comb_start)
>> +                             interval_tree_remove(prev, root);
>> +             }
>> +             comb_start->last = last;
>> +             return;
>> +     }
>> +
>> +     /* Combine ranges which have the smallest gap */
>> +     while (cur_nodes > req_nodes) {
>> +             prev = NULL;
>> +             min_gap = ULONG_MAX;
>> +             curr = interval_tree_iter_first(root, 0, ULONG_MAX);
>> +             while (curr) {
>> +                     if (prev) {
>> +                             curr_gap = curr->start - prev->last;
>> +                             if (curr_gap < min_gap) {
>> +                                     min_gap = curr_gap;
>> +                                     comb_start = prev;
>> +                                     comb_end = curr;
>> +                             }
>> +                     }
>> +                     prev = curr;
>> +                     curr = interval_tree_iter_next(curr, 0, ULONG_MAX);
>> +             }
>> +             comb_start->last = comb_end->last;
>> +             interval_tree_remove(comb_end, root);
>> +             cur_nodes--;
>> +     }
>> +}
>> +EXPORT_SYMBOL_GPL(vfio_combine_iova_ranges);
>> +
> 
> Being a public function please follow the kernel convention with comment
> explaining what this function actually does.

I've seen many cases that there's no documentation for public functions 
and I don't think any documentation is needed for this function as the 
name is self explanatory. VFIO drivers can use this to combine iova 
ranges, hence why I named it vfio_combine_iova_ranges().

> 
> btw while you rename it with 'vfio' and 'iova' keywords, the actual logic
> has nothing to do with either of them. Does it make more sense to move it
> to a more generic library?

I think it *could* go into a more generic library, but at this point in 
time I think it belongs here. As mentioned in the previous comment the 
function name describes its exact purpose. If/when it ever gets more 
users it can be moved and renamed.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v10 vfio 2/7] vfio/pds: Initial support for pds_vfio VFIO driver
  2023-06-16  6:56   ` Tian, Kevin
@ 2023-06-16 18:42     ` Brett Creeley
  0 siblings, 0 replies; 40+ messages in thread
From: Brett Creeley @ 2023-06-16 18:42 UTC (permalink / raw)
  To: Tian, Kevin, Brett Creeley, kvm, netdev, alex.williamson, jgg,
	yishaih, shameerali.kolothum.thodi
  Cc: shannon.nelson

On 6/15/2023 11:56 PM, Tian, Kevin wrote:
> Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.
> 
> 
>> From: Brett Creeley <brett.creeley@amd.com>
>> Sent: Saturday, June 3, 2023 6:03 AM
>>
>> This is the initial framework for the new pds_vfio device driver. This
>> does the very basics of registering the PDS PCI device and configuring
>> it as a VFIO PCI device.
>>
>> With this change, the VF device can be bound to the pds_vfio driver on
>> the host and presented to the VM as the VF's device type.
> 
> while this should be generic to multiple PDS device types this patch only
> supports the ethernet VF. worth a clarification here.
> 
>> +static const struct pci_device_id
>> +pds_vfio_pci_table[] = {
> 
> no need to break line.

Must have missed this one. Thanks.

> 
>> +
>> +MODULE_DESCRIPTION(PDS_VFIO_DRV_DESCRIPTION);
>> +MODULE_AUTHOR("Advanced Micro Devices, Inc.");
> 
> author usually describes the personal name plus mail address.

Will fix. Thanks.

> 
>> +
>> +     err = vfio_pci_core_init_dev(vdev);
>> +     if (err)
>> +             return err;
>> +
>> +     pds_vfio->vf_id = pci_iov_vf_id(pdev);
> 
> pci_iov_vf_id() could fail.

Good catch. I will for failure on the next revision. Thanks.

> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v10 vfio 3/7] vfio/pds: register with the pds_core PF
  2023-06-16  7:04   ` Tian, Kevin
@ 2023-06-16 19:01     ` Brett Creeley
  2023-06-20  2:11       ` Tian, Kevin
  0 siblings, 1 reply; 40+ messages in thread
From: Brett Creeley @ 2023-06-16 19:01 UTC (permalink / raw)
  To: Tian, Kevin, Brett Creeley, kvm, netdev, alex.williamson, jgg,
	yishaih, shameerali.kolothum.thodi
  Cc: shannon.nelson

On 6/16/2023 12:04 AM, Tian, Kevin wrote:
> Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.
> 
> 
>> From: Brett Creeley <brett.creeley@amd.com>
>> Sent: Saturday, June 3, 2023 6:03 AM
>>
>> +
>> +int pds_vfio_register_client_cmd(struct pds_vfio_pci_device *pds_vfio)
>> +{
>> +     struct pci_dev *pdev = pds_vfio_to_pci_dev(pds_vfio);
>> +     char devname[PDS_DEVNAME_LEN];
>> +     int ci;
>> +
>> +     snprintf(devname, sizeof(devname), "%s.%d-%u",
>> PDS_LM_DEV_NAME,
>> +              pci_domain_nr(pdev->bus), pds_vfio->pci_id);
>> +
>> +     ci = pds_client_register(pci_physfn(pdev), devname);
>> +     if (ci <= 0)
>> +             return ci;
> 
> 'ci' cannot be 0 since pds_client_register() already converts 0 into
> -EIO.

Yeah, Shameer already mentioned this and I have already fixed this issue 
for the next revision. Thanks.

> 
> btw the description of pds_client_register() is wrong. It said return
> 0 on success. should be positive client_id on success.

Yeah, this was also mentioned by Shameer. I will submit a follow on 
patch that updates the documentation in pds_client_register(). Thanks.

> 
>>
>> +struct pci_dev *pds_vfio_to_pci_dev(struct pds_vfio_pci_device *pds_vfio)
>> +{
>> +     return pds_vfio->vfio_coredev.pdev;
>> +}
> 
> Does this wrapper actually save the length?o

It wasn't so much about length but encapsulating the multiple 
de-references and multiple uses into a function call.

> 
>>
>> +     dev_dbg(&pdev->dev,
>> +             "%s: PF %#04x VF %#04x (%d) vf_id %d domain %d
>> pds_vfio %p\n",
>> +             __func__, pci_dev_id(pdev->physfn), pds_vfio->pci_id,
>> +             pds_vfio->pci_id, pds_vfio->vf_id, pci_domain_nr(pdev->bus),
>> +             pds_vfio);
> 
> why printing pds_vfio->pci_id twice?

Will fix. Thanks.

> 
>>
>> +#define PDS_LM_DEV_NAME              PDS_CORE_DRV_NAME "."
>> PDS_DEV_TYPE_LM_STR
>> +
> 
> should this name include a 'vfio' string?

This aligns with what our DSC/firmware expects, so no it's not needed.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v10 vfio 7/7] vfio/pds: Add Kconfig and documentation
  2023-06-16  8:25   ` Tian, Kevin
@ 2023-06-16 20:05     ` Brett Creeley
  0 siblings, 0 replies; 40+ messages in thread
From: Brett Creeley @ 2023-06-16 20:05 UTC (permalink / raw)
  To: Tian, Kevin, Brett Creeley, kvm, netdev, alex.williamson, jgg,
	yishaih, shameerali.kolothum.thodi
  Cc: shannon.nelson

On 6/16/2023 1:25 AM, Tian, Kevin wrote:
> Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.
> 
> 
>> From: Brett Creeley <brett.creeley@amd.com>
>> Sent: Saturday, June 3, 2023 6:03 AM
>> +
>> +  # Prevent non-vfio VF driver from probing the VF device
>> +  echo 0 >
>> /sys/class/pci_bus/$PF_BUS/device/$PF_BDF/sriov_drivers_autoprobe
>> +
>> +  # Create single VF for Live Migration via VFIO
>> +  echo 1 > /sys/bus/pci/drivers/pds_core/$PF_BDF/sriov_numvfs
> 
> s/via VFIO/via pds_core/
> 
>> +
>> +config PDS_VFIO_PCI
>> +     tristate "VFIO support for PDS PCI devices"
>> +     depends on PDS_CORE
>> +     depends on VFIO_PCI_CORE
> 
> this should be rebased on Alex's Kconfig change.

I will fix these issues, thanks for the review.

Brett

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v10 vfio 0/7] pds_vfio driver
  2023-06-16  6:47 ` Tian, Kevin
@ 2023-06-16 20:06   ` Brett Creeley
  0 siblings, 0 replies; 40+ messages in thread
From: Brett Creeley @ 2023-06-16 20:06 UTC (permalink / raw)
  To: Tian, Kevin, Brett Creeley, kvm, netdev, alex.williamson, jgg,
	yishaih, shameerali.kolothum.thodi
  Cc: shannon.nelson

On 6/15/2023 11:47 PM, Tian, Kevin wrote:
> Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.
> 
> 
>> From: Brett Creeley <brett.creeley@amd.com>
>> Sent: Saturday, June 3, 2023 6:03 AM
>>
>> This is a patchset for a new vendor specific VFIO driver
>> (pds_vfio) for use with the AMD/Pensando Distributed Services Card
>> (DSC). This driver makes use of the pds_core driver.
>>
>> This driver will use the pds_core device's adminq as the VFIO
>> control path to the DSC. In order to make adminq calls, the VFIO
>> instance makes use of functions exported by the pds_core driver.
>>
>> In order to receive events from pds_core, the pds_vfio driver
>> registers to a private notifier. This is needed for various events
>> that come from the device.
>>
>> An ASCII diagram of a VFIO instance looks something like this and can
>> be used with the VFIO subsystem to provide the VF device VFIO and live
>> migration support.
>>
>>                                 .------.  .-----------------------.
>>                                 | QEMU |--|  VM  .-------------.  |
>>                                 '......'  |      |   Eth VF    |  |
>>                                    |      |      .-------------.  |
>>                                    |      |      |  SR-IOV VF  |  |
>>                                    |      |      '-------------'  |
>>                                    |      '------------||---------'
>>                                 .--------------.       ||
>>                                 |/dev/<vfio_fd>|       ||
>>                                 '--------------'       ||
>> Host Userspace                         |              ||
>> ===================================================   ||
>> Host Kernel                            |              ||
>>                                    .--------.          ||
>>                                    |vfio-pci|          ||
>>                                    '--------'          ||
>>         .------------------.           ||              ||
>>         |   | exported API |<----+     ||              ||
>>         |   '--------------|     |     ||              ||
>>         |                  |    .-------------.        ||
>>         |     pds_core     |--->|   pds_vfio  |        ||
>>         '------------------' |  '-------------'        ||
>>                 ||           |         ||              ||
>>               09:00.0     notifier    09:00.1          ||
>> == PCI ===============================================||=====
>>                 ||                     ||              ||
>>            .----------.          .----------.          ||
>>      ,-----|    PF    |----------|    VF    |-------------------,
>>      |     '----------'          '----------'  |       VF       |
>>      |                     DSC                 |  data/control  |
>>      |                                         |      path      |
>>      -----------------------------------------------------------
>>
> 
> why is "VF data/control path" drawn out of the VF box?

Just a mistake in the drawing. I can fix it. Thanks.
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v10 vfio 6/7] vfio/pds: Add support for firmware recovery
  2023-06-16  8:24   ` Tian, Kevin
@ 2023-06-17  0:47     ` Brett Creeley
  0 siblings, 0 replies; 40+ messages in thread
From: Brett Creeley @ 2023-06-17  0:47 UTC (permalink / raw)
  To: Tian, Kevin, Brett Creeley, kvm, netdev, alex.williamson, jgg,
	yishaih, shameerali.kolothum.thodi
  Cc: shannon.nelson

On 6/16/2023 1:24 AM, Tian, Kevin wrote:
> Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.
> 
> 
>> From: Brett Creeley <brett.creeley@amd.com>
>> Sent: Saturday, June 3, 2023 6:03 AM
>>
>> +static void pds_vfio_recovery(struct pds_vfio_pci_device *pds_vfio)
>> +{
>> +     bool deferred_reset_needed = false;
>> +
>> +     /*
>> +      * Documentation states that the kernel migration driver must not
>> +      * generate asynchronous device state transitions outside of
>> +      * manipulation by the user or the VFIO_DEVICE_RESET ioctl.
>> +      *
>> +      * Since recovery is an asynchronous event received from the device,
>> +      * initiate a deferred reset. Only issue the deferred reset if a
>> +      * migration is in progress, which will cause the next step of the
>> +      * migration to fail. Also, if the device is in a state that will
>> +      * be set to VFIO_DEVICE_STATE_RUNNING on the next action (i.e.
>> VM is
>> +      * shutdown and device is in VFIO_DEVICE_STATE_STOP) as that will
>> clear
>> +      * the VFIO_DEVICE_STATE_ERROR when the VM starts back up.
> 
> the last sentence after "Also, ..." is incomplete?

Yeah, not sure what happened there. Will fix. Thanks.

> 
>> +      */
>> +     mutex_lock(&pds_vfio->state_mutex);
>> +     if ((pds_vfio->state != VFIO_DEVICE_STATE_RUNNING &&
>> +          pds_vfio->state != VFIO_DEVICE_STATE_ERROR) ||
>> +         (pds_vfio->state == VFIO_DEVICE_STATE_RUNNING &&
>> +          pds_vfio_dirty_is_enabled(pds_vfio)))
>> +             deferred_reset_needed = true;
> 
> any unwind to be done in the dirty tracking path? When firmware crashes
> presumably the cmd to retrieve dirty pages is also blocked...

Hmm. I'll double check this. Thanks.

> 
>> +     mutex_unlock(&pds_vfio->state_mutex);
>> +
>> +     /*
>> +      * On the next user initiated state transition, the device will
>> +      * transition to the VFIO_DEVICE_STATE_ERROR. At this point it's the
>> user's
>> +      * responsibility to reset the device.
>> +      *
>> +      * If a VFIO_DEVICE_RESET is requested post recovery and before the
>> next
>> +      * state transition, then the deferred reset state will be set to
>> +      * VFIO_DEVICE_STATE_RUNNING.
>> +      */
>> +     if (deferred_reset_needed)
>> +             pds_vfio_deferred_reset(pds_vfio,
>> VFIO_DEVICE_STATE_ERROR);
> 
> open-code as here is the only caller.
> 
>> +}
>> +
>> +static int pds_vfio_pci_notify_handler(struct notifier_block *nb,
>> +                                    unsigned long ecode, void *data)
>> +{
>> +     struct pds_vfio_pci_device *pds_vfio =
>> +             container_of(nb, struct pds_vfio_pci_device, nb);
>> +     struct device *dev = pds_vfio_to_dev(pds_vfio);
>> +     union pds_core_notifyq_comp *event = data;
>> +
>> +     dev_dbg(dev, "%s: event code %lu\n", __func__, ecode);
>> +
>> +     /*
>> +      * We don't need to do anything for RESET state==0 as there is no
>> notify
>> +      * or feedback mechanism available, and it is possible that we won't
>> +      * even see a state==0 event.
>> +      *
>> +      * Any requests from VFIO while state==0 will fail, which will return
>> +      * error and may cause migration to fail.
>> +      */
>> +     if (ecode == PDS_EVENT_RESET) {
>> +             dev_info(dev, "%s: PDS_EVENT_RESET event received,
>> state==%d\n",
>> +                      __func__, event->reset.state);
>> +             if (event->reset.state == 1)
>> +                     pds_vfio_recovery(pds_vfio);
>> +     }
> 
> Please explain what state==0 is, and why state==1 is handled while
> state==2 is not.

Sure, will clarify. Thanks.

> 
>> @@ -33,10 +33,13 @@ void pds_vfio_state_mutex_unlock(struct
>> pds_vfio_pci_device *pds_vfio)
>>        if (pds_vfio->deferred_reset) {
>>                pds_vfio->deferred_reset = false;
>>                if (pds_vfio->state == VFIO_DEVICE_STATE_ERROR) {
>> -                     pds_vfio->state = VFIO_DEVICE_STATE_RUNNING;
>> +                     pds_vfio->state = pds_vfio->deferred_reset_state;
>>                        pds_vfio_put_restore_file(pds_vfio);
>>                        pds_vfio_put_save_file(pds_vfio);
>> +             } else if (pds_vfio->deferred_reset_state ==
>> VFIO_DEVICE_STATE_ERROR) {
>> +                     pds_vfio->state = VFIO_DEVICE_STATE_ERROR;
>>                }
>> +             pds_vfio->deferred_reset_state =
>> VFIO_DEVICE_STATE_RUNNING;
> 
> this is not required. 'deferred_reset_state' should be set only when
> deferred_reset is true. Currently only in the notify path and reset path.
> 
> So the last assignment is pointless.
> 
> It's simpler to be:
> 
>          if (pds_vfio->deferred_reset) {
>                  pds_vfio->deferred_reset = false;
>                  if (pds_vfio->state == VFIO_DEVICE_STATE_ERROR) {
>                          pds_vfio_put_restore_file(pds_vfio);
>                          pds_vfio_put_save_file(pds_vfio);
>                  }
>                  pds_vfio->state = pds_vfio->deferred_reset_state;
>                  ...
>          }

I think that makes sense. I will take another look and fix/improve this 
on the next version.
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v10 vfio 4/7] vfio/pds: Add VFIO live migration support
  2023-06-16  8:06   ` Tian, Kevin
@ 2023-06-17  4:45     ` Brett Creeley
  2023-06-20  2:19       ` Tian, Kevin
  2023-06-19 12:46     ` Jason Gunthorpe
  1 sibling, 1 reply; 40+ messages in thread
From: Brett Creeley @ 2023-06-17  4:45 UTC (permalink / raw)
  To: Tian, Kevin, Brett Creeley, kvm, netdev, alex.williamson, jgg,
	yishaih, shameerali.kolothum.thodi
  Cc: shannon.nelson

On 6/16/2023 1:06 AM, Tian, Kevin wrote:
> Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.
> 
> 
>> From: Brett Creeley <brett.creeley@amd.com>
>> Sent: Saturday, June 3, 2023 6:03 AM
>>
>> Add live migration support via the VFIO subsystem. The migration
>> implementation aligns with the definition from uapi/vfio.h and uses
>> the pds_core PF's adminq for device configuration.
>>
>> The ability to suspend, resume, and transfer VF device state data is
>> included along with the required admin queue command structures and
>> implementations.
>>
>> PDS_LM_CMD_SUSPEND and PDS_LM_CMD_SUSPEND_STATUS are added to
>> support
>> the VF device suspend operation.
>>
>> PDS_LM_CMD_RESUME is added to support the VF device resume operation.
>>
>> PDS_LM_CMD_STATUS is added to determine the exact size of the VF
>> device state data.
>>
>> PDS_LM_CMD_SAVE is added to get the VF device state data.
>>
>> PDS_LM_CMD_RESTORE is added to restore the VF device with the
>> previously saved data from PDS_LM_CMD_SAVE.
>>
>> PDS_LM_CMD_HOST_VF_STATUS is added to notify the device when
>> a migration is in/not-in progress from the host's perspective.
> 
> Here is 'the device' referring to the PF or VF?

Device is referring to the DSC/firmware not the function. I will clarify 
the wording here. Thanks.

> 
> and how would the device use this information?
> 
>> +
>> +static int pds_vfio_client_adminq_cmd(struct pds_vfio_pci_device *pds_vfio,
>> +                                   union pds_core_adminq_cmd *req,
>> +                                   size_t req_len,
>> +                                   union pds_core_adminq_comp *resp,
>> +                                   u64 flags)
>> +{
>> +     union pds_core_adminq_cmd cmd = {};
>> +     size_t cp_len;
>> +     int err;
>> +
>> +     /* Wrap the client request */
>> +     cmd.client_request.opcode = PDS_AQ_CMD_CLIENT_CMD;
>> +     cmd.client_request.client_id = cpu_to_le16(pds_vfio->client_id);
>> +     cp_len = min_t(size_t, req_len,
>> sizeof(cmd.client_request.client_cmd));
> 
> 'req_len' is kind of redundant. Looks all the callers use sizeof(req).

It does a memcpy based on the min size between req_len and the size of 
the request.

> 
>> +static int
>> +pds_vfio_suspend_wait_device_cmd(struct pds_vfio_pci_device *pds_vfio)
>> +{
>> +     union pds_core_adminq_cmd cmd = {
>> +             .lm_suspend_status = {
>> +                     .opcode = PDS_LM_CMD_SUSPEND_STATUS,
>> +                     .vf_id = cpu_to_le16(pds_vfio->vf_id),
>> +             },
>> +     };
>> +     struct device *dev = pds_vfio_to_dev(pds_vfio);
>> +     union pds_core_adminq_comp comp = {};
>> +     unsigned long time_limit;
>> +     unsigned long time_start;
>> +     unsigned long time_done;
>> +     int err;
>> +
>> +     time_start = jiffies;
>> +     time_limit = time_start + HZ * SUSPEND_TIMEOUT_S;
>> +     do {
>> +             err = pds_vfio_client_adminq_cmd(pds_vfio, &cmd,
>> sizeof(cmd),
>> +                                              &comp,
>> PDS_AQ_FLAG_FASTPOLL);
>> +             if (err != -EAGAIN)
>> +                     break;
>> +
>> +             msleep(SUSPEND_CHECK_INTERVAL_MS);
>> +     } while (time_before(jiffies, time_limit));
> 
> pds_vfio_client_adminq_cmd() has the exactly same mechanism
> with 5s timeout and 1ms poll interval when FASTPOLL is set.
> 
> probably you can introduce another flag to indicate retry on
> -EAGAIN and then handle it fully in pds_vfio_client_adminq_cmd()?

That's the entire purpose of this command. It uses 
pds_vfio_client_adminq_cmd() to poll the SUSPEND_STATUS. IMHO adding the 
polling mechanism in pds_vfio_client_adminq_cmd() and using it depending 
on a flag is just adding to the complexity and not offering any benefit. 
I plan to keep this function as is to separate the functionality. Thanks.

> 
>> +int pds_vfio_suspend_device_cmd(struct pds_vfio_pci_device *pds_vfio)
>> +{
>> +     union pds_core_adminq_cmd cmd = {
>> +             .lm_suspend = {
>> +                     .opcode = PDS_LM_CMD_SUSPEND,
>> +                     .vf_id = cpu_to_le16(pds_vfio->vf_id),
>> +             },
>> +     };
>> +     struct device *dev = pds_vfio_to_dev(pds_vfio);
>> +     union pds_core_adminq_comp comp = {};
>> +     int err;
>> +
>> +     dev_dbg(dev, "vf%u: Suspend device\n", pds_vfio->vf_id);
>> +
>> +     err = pds_vfio_client_adminq_cmd(pds_vfio, &cmd, sizeof(cmd),
>> &comp,
>> +                                      PDS_AQ_FLAG_FASTPOLL);
>> +     if (err) {
>> +             dev_err(dev, "vf%u: Suspend failed: %pe\n", pds_vfio->vf_id,
>> +                     ERR_PTR(err));
>> +             return err;
>> +     }
>> +
>> +     return pds_vfio_suspend_wait_device_cmd(pds_vfio);
>> +}
> 
> The logic in this function is very confusing.
> 
> PDS_LM_CMD_SUSPEND has a completion record:
> 
> +struct pds_lm_suspend_comp {
> +       u8     status;
> +       u8     rsvd;
> +       __le16 comp_index;
> +       union {
> +               __le64 state_size;
> +               u8     rsvd2[11];
> +       } __packed;
> +       u8     color;
> 
> Presumably this function can look at the completion record to know whether
> the suspend request succeeds.
> 
> Why do you require another wait_device step to query the suspend status?

The driver sends the initial suspend request to tell the DSC/firmware to 
suspend the VF's data/control path. The DSC/firmware will ack/nack the 
suspend request in the completion.

Then the driver polls the DSC/firmware to find when the VF's 
data/control path has been fully suspended. When the DSC/firmware isn't 
done suspending yet it will return -EAGAIN. Otherwise it will return 
success/failure.

I will add some comments clarifying these details.

> 
> and I have another question. Is it correct to hard-code the 5s timeout in
> the kernel w/o any input from the VMM? Note the guest has been stopped
> at this point then very likely the 5s timeout will kill any reasonable SLA which
> CSPs try to reach hard.

This gives the device a max of 5 seconds to suspend the VF's 
data/control path.

> 
> Ideally the VMM has an estimation how long a VM can be paused based on
> SLA, to-be-migrated state size, available network bandwidth, etc. and that
> hint should be passed to the kernel so any state transition which may violate
> that expectation can fail quickly to break the migration process and put the
> VM back to the running state.

For QEMU there is a parameter that can specify the downtime-limit that's 
used as you mentioned, but this does not include how long it takes the 
device to STOP (i.e. suspend the data/control path).

> 
> Jason/Shameer, is there similar concern in mlx/hisilicon drivers?
> 
>> +
>> +int pds_vfio_resume_device_cmd(struct pds_vfio_pci_device *pds_vfio)
>> +{
>> +     union pds_core_adminq_cmd cmd = {
>> +             .lm_resume = {
>> +                     .opcode = PDS_LM_CMD_RESUME,
>> +                     .vf_id = cpu_to_le16(pds_vfio->vf_id),
>> +             },
>> +     };
>> +     struct device *dev = pds_vfio_to_dev(pds_vfio);
>> +     union pds_core_adminq_comp comp = {};
>> +
>> +     dev_dbg(dev, "vf%u: Resume device\n", pds_vfio->vf_id);
>> +
>> +     return pds_vfio_client_adminq_cmd(pds_vfio, &cmd, sizeof(cmd),
>> &comp,
>> +                                       0);
> 
> 'resume' is also in the blackout phase when the guest is not running.
> 
> So presumably FAST_POLL should be set otherwise the max 256ms
> poll interval (PDSC_ADMINQ_MAX_POLL_INTERVAL) is really inefficient.

Yeah this is a good catch. I think setting fast_poll = true would be a 
good idea here. Thanks.

> 
>> +
>> +     if (cur == VFIO_DEVICE_STATE_RUNNING && next ==
>> VFIO_DEVICE_STATE_RUNNING_P2P) {
>> +             pds_vfio_send_host_vf_lm_status_cmd(pds_vfio,
>> +
>> PDS_LM_STA_IN_PROGRESS);
>> +             err = pds_vfio_suspend_device_cmd(pds_vfio);
>> +             if (err)
>> +                     return ERR_PTR(err);
>> +
>> +             return NULL;
>> +     }
>> +
>> +     if (cur == VFIO_DEVICE_STATE_RUNNING_P2P && next ==
>> VFIO_DEVICE_STATE_RUNNING) {
>> +             err = pds_vfio_resume_device_cmd(pds_vfio);
>> +             if (err)
>> +                     return ERR_PTR(err);
>> +
>> +             pds_vfio_send_host_vf_lm_status_cmd(pds_vfio,
>> PDS_LM_STA_NONE);
>> +             return NULL;
>> +     }
>> +
>> +     if (cur == VFIO_DEVICE_STATE_STOP && next ==
>> VFIO_DEVICE_STATE_RUNNING_P2P)
>> +             return NULL;
>> +
>> +     if (cur == VFIO_DEVICE_STATE_RUNNING_P2P && next ==
>> VFIO_DEVICE_STATE_STOP)
>> +             return NULL;
> 
> I'm not sure whether P2P is actually supported here. By definition
> P2P means the device is stopped but still responds to p2p request
> from other devices. If you look at mlx example it uses different
> cmds between RUNNING->RUNNING_P2P and RUNNING_P2P->STOP.
> 
> But in your case seems you simply move what is required in STOP
> into P2P. Probably you can just remove the support of P2P like
> hisilicon does.

In a previous review it was mentioned that P2P is more or less supported 
and this is how we are able to support it. Ideally we would not set the 
P2P feature and just implement the standard STOP/RUNNING states.

> 
>> +
>> +/**
>> + * struct pds_lm_comp - generic command completion
>> + * @status:  Status of the command (enum pds_core_status_code)
>> + * @rsvd:    Structure padding to 16 Bytes
>> + */
>> +struct pds_lm_comp {
>> +     u8 status;
>> +     u8 rsvd[15];
>> +};
> 
> not used. Looks most comp structures are defined w/o an user
> except struct pds_lm_status_comp.

I will look into this. Thanks.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v10 vfio 0/7] pds_vfio driver
  2023-06-02 22:03 [PATCH v10 vfio 0/7] pds_vfio driver Brett Creeley
                   ` (8 preceding siblings ...)
  2023-06-16  6:47 ` Tian, Kevin
@ 2023-06-17  4:49 ` Brett Creeley
  9 siblings, 0 replies; 40+ messages in thread
From: Brett Creeley @ 2023-06-17  4:49 UTC (permalink / raw)
  To: Brett Creeley, kvm, netdev, alex.williamson, jgg, yishaih,
	shameerali.kolothum.thodi, kevin.tian
  Cc: shannon.nelson

On 6/2/2023 3:03 PM, Brett Creeley wrote:
> This is a patchset for a new vendor specific VFIO driver
> (pds_vfio) for use with the AMD/Pensando Distributed Services Card
> (DSC). This driver makes use of the pds_core driver.
> 
> This driver will use the pds_core device's adminq as the VFIO
> control path to the DSC. In order to make adminq calls, the VFIO
> instance makes use of functions exported by the pds_core driver.
> 
> In order to receive events from pds_core, the pds_vfio driver
> registers to a private notifier. This is needed for various events
> that come from the device.
> 
> An ASCII diagram of a VFIO instance looks something like this and can
> be used with the VFIO subsystem to provide the VF device VFIO and live
> migration support.
> 
>                                 .------.  .-----------------------.
>                                 | QEMU |--|  VM  .-------------.  |
>                                 '......'  |      |   Eth VF    |  |
>                                    |      |      .-------------.  |
>                                    |      |      |  SR-IOV VF  |  |
>                                    |      |      '-------------'  |
>                                    |      '------------||---------'
>                                 .--------------.       ||
>                                 |/dev/<vfio_fd>|       ||
>                                 '--------------'       ||
> Host Userspace                         |              ||
> ===================================================   ||
> Host Kernel                            |              ||
>                                    .--------.          ||
>                                    |vfio-pci|          ||
>                                    '--------'          ||
>         .------------------.           ||              ||
>         |   | exported API |<----+     ||              ||
>         |   '--------------|     |     ||              ||
>         |                  |    .-------------.        ||
>         |     pds_core     |--->|   pds_vfio  |        ||
>         '------------------' |  '-------------'        ||
>                 ||           |         ||              ||
>               09:00.0     notifier    09:00.1          ||
> == PCI ===============================================||=====
>                 ||                     ||              ||
>            .----------.          .----------.          ||
>      ,-----|    PF    |----------|    VF    |-------------------,
>      |     '----------'          '----------'  |       VF       |
>      |                     DSC                 |  data/control  |
>      |                                         |      path      |
>      -----------------------------------------------------------
> 
> 
> The pds_vfio driver is targeted to reside in drivers/vfio/pci/pds.
> It makes use of and introduces new files in the common include/linux/pds
> include directory.
> 
> Changes:
>  > v10:

Just as a quick note, I don't plan to push v11 next week as I am on 
vacation. I appreciate all the reviews and hope to have updates when I 
get back on the week of 6/26.

Thanks,

Brett

[...]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v10 vfio 4/7] vfio/pds: Add VFIO live migration support
  2023-06-16  8:06   ` Tian, Kevin
  2023-06-17  4:45     ` Brett Creeley
@ 2023-06-19 12:46     ` Jason Gunthorpe
  2023-06-20  2:02       ` Tian, Kevin
  1 sibling, 1 reply; 40+ messages in thread
From: Jason Gunthorpe @ 2023-06-19 12:46 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Brett Creeley, kvm, netdev, alex.williamson, yishaih,
	shameerali.kolothum.thodi, shannon.nelson

On Fri, Jun 16, 2023 at 08:06:21AM +0000, Tian, Kevin wrote:

> Ideally the VMM has an estimation how long a VM can be paused based on
> SLA, to-be-migrated state size, available network bandwidth, etc. and that
> hint should be passed to the kernel so any state transition which may violate
> that expectation can fail quickly to break the migration process and put the
> VM back to the running state.
> 
> Jason/Shameer, is there similar concern in mlx/hisilicon drivers? 

It is handled through the vfio_device_feature_mig_data_size mechanism..

> > +	if (cur == VFIO_DEVICE_STATE_RUNNING_P2P && next ==
> > VFIO_DEVICE_STATE_STOP)
> > +		return NULL;
> 
> I'm not sure whether P2P is actually supported here. By definition
> P2P means the device is stopped but still responds to p2p request
> from other devices. If you look at mlx example it uses different
> cmds between RUNNING->RUNNING_P2P and RUNNING_P2P->STOP.
> 
> But in your case seems you simply move what is required in STOP
> into P2P. Probably you can just remove the support of P2P like
> hisilicon does.

We want new devices to get their architecture right, they need to
support P2P. Didn't we talk about this already and Brett was going to
fix it?

Jason

^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: [PATCH v10 vfio 4/7] vfio/pds: Add VFIO live migration support
  2023-06-19 12:46     ` Jason Gunthorpe
@ 2023-06-20  2:02       ` Tian, Kevin
  2023-06-20 12:31         ` Jason Gunthorpe
  0 siblings, 1 reply; 40+ messages in thread
From: Tian, Kevin @ 2023-06-20  2:02 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Brett Creeley, kvm, netdev, alex.williamson, yishaih,
	shameerali.kolothum.thodi, shannon.nelson

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Monday, June 19, 2023 8:47 PM
> 
> On Fri, Jun 16, 2023 at 08:06:21AM +0000, Tian, Kevin wrote:
> 
> > Ideally the VMM has an estimation how long a VM can be paused based on
> > SLA, to-be-migrated state size, available network bandwidth, etc. and that
> > hint should be passed to the kernel so any state transition which may
> violate
> > that expectation can fail quickly to break the migration process and put the
> > VM back to the running state.
> >
> > Jason/Shameer, is there similar concern in mlx/hisilicon drivers?
> 
> It is handled through the vfio_device_feature_mig_data_size mechanism..

that is only for estimation of copied data.

IMHO the stop time when the VM is paused includes both the time of
stopping the device and the time of migrating the VM state.

For a software-emulated device the time of stopping the device is negligible.

But certainly for assigned device the worst-case hard-coded 5s timeout as
done in this patch will kill whatever reasonable 'VM dead time' SLA (usually
in milliseconds) which CSPs try to meet purely based on the size of copied
data.

Wouldn't a user-specified stop-device timeout be required to at least allow
breaking migration early according to the desired SLA?

> 
> > > +	if (cur == VFIO_DEVICE_STATE_RUNNING_P2P && next ==
> > > VFIO_DEVICE_STATE_STOP)
> > > +		return NULL;
> >
> > I'm not sure whether P2P is actually supported here. By definition
> > P2P means the device is stopped but still responds to p2p request
> > from other devices. If you look at mlx example it uses different
> > cmds between RUNNING->RUNNING_P2P and RUNNING_P2P->STOP.
> >
> > But in your case seems you simply move what is required in STOP
> > into P2P. Probably you can just remove the support of P2P like
> > hisilicon does.
> 
> We want new devices to get their architecture right, they need to
> support P2P. Didn't we talk about this already and Brett was going to
> fix it?
> 

Looks it's not fixed since RUNNING_P2P->STOP is a nop in this patch.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: [PATCH v10 vfio 3/7] vfio/pds: register with the pds_core PF
  2023-06-16 19:01     ` Brett Creeley
@ 2023-06-20  2:11       ` Tian, Kevin
  0 siblings, 0 replies; 40+ messages in thread
From: Tian, Kevin @ 2023-06-20  2:11 UTC (permalink / raw)
  To: Brett Creeley, Brett Creeley, kvm, netdev, alex.williamson, jgg,
	yishaih, shameerali.kolothum.thodi
  Cc: shannon.nelson

> From: Brett Creeley <bcreeley@amd.com>
> Sent: Saturday, June 17, 2023 3:02 AM
> >
> >>
> >> +#define PDS_LM_DEV_NAME              PDS_CORE_DRV_NAME "."
> >> PDS_DEV_TYPE_LM_STR
> >> +
> >
> > should this name include a 'vfio' string?
> 
> This aligns with what our DSC/firmware expects, so no it's not needed.

/**
 * pds_client_register - Link the client to the firmware
 * @pf_pdev:    ptr to the PF driver struct
 * @devname:    name that includes service into, e.g. pds_core.vDPA

The comment mentions vDPA which confused me on whether the client
should include a keyword of client to differentiate.

e.g. here the name is "pds_core.LM". If both VFIO/vDPA want to support
live migration with pds_core will there be a conflict or fine for multiple
drivers registering to a same service?

^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: [PATCH v10 vfio 4/7] vfio/pds: Add VFIO live migration support
  2023-06-17  4:45     ` Brett Creeley
@ 2023-06-20  2:19       ` Tian, Kevin
  0 siblings, 0 replies; 40+ messages in thread
From: Tian, Kevin @ 2023-06-20  2:19 UTC (permalink / raw)
  To: Brett Creeley, Brett Creeley, kvm, netdev, alex.williamson, jgg,
	yishaih, shameerali.kolothum.thodi
  Cc: shannon.nelson

> From: Brett Creeley <bcreeley@amd.com>
> Sent: Saturday, June 17, 2023 12:45 PM
> 
> On 6/16/2023 1:06 AM, Tian, Kevin wrote:
> > Caution: This message originated from an External Source. Use proper
> caution when opening attachments, clicking links, or responding.
> >
> >
> >> From: Brett Creeley <brett.creeley@amd.com>
> >> Sent: Saturday, June 3, 2023 6:03 AM
> >>
> >> +
> >> +static int pds_vfio_client_adminq_cmd(struct pds_vfio_pci_device
> *pds_vfio,
> >> +                                   union pds_core_adminq_cmd *req,
> >> +                                   size_t req_len,
> >> +                                   union pds_core_adminq_comp *resp,
> >> +                                   u64 flags)
> >> +{
> >> +     union pds_core_adminq_cmd cmd = {};
> >> +     size_t cp_len;
> >> +     int err;
> >> +
> >> +     /* Wrap the client request */
> >> +     cmd.client_request.opcode = PDS_AQ_CMD_CLIENT_CMD;
> >> +     cmd.client_request.client_id = cpu_to_le16(pds_vfio->client_id);
> >> +     cp_len = min_t(size_t, req_len,
> >> sizeof(cmd.client_request.client_cmd));
> >
> > 'req_len' is kind of redundant. Looks all the callers use sizeof(req).
> 
> It does a memcpy based on the min size between req_len and the size of
> the request.

If all the callers just pass in sizeof(union) as 'req_len', then it's pointless
to do min_t and you can just use sizeof(cmd.client_request.client_cmd) here
which is always smaller than or equal to the sizeof(union).

> >> +
> >> +     err = pds_vfio_client_adminq_cmd(pds_vfio, &cmd, sizeof(cmd),
> >> &comp,
> >> +                                      PDS_AQ_FLAG_FASTPOLL);
> >> +     if (err) {
> >> +             dev_err(dev, "vf%u: Suspend failed: %pe\n", pds_vfio->vf_id,
> >> +                     ERR_PTR(err));
> >> +             return err;
> >> +     }
> >> +
> >> +     return pds_vfio_suspend_wait_device_cmd(pds_vfio);
> >> +}
> >
> > The logic in this function is very confusing.
> >
> > PDS_LM_CMD_SUSPEND has a completion record:
> >
> > +struct pds_lm_suspend_comp {
> > +       u8     status;
> > +       u8     rsvd;
> > +       __le16 comp_index;
> > +       union {
> > +               __le64 state_size;
> > +               u8     rsvd2[11];
> > +       } __packed;
> > +       u8     color;
> >
> > Presumably this function can look at the completion record to know
> whether
> > the suspend request succeeds.
> >
> > Why do you require another wait_device step to query the suspend status?
> 
> The driver sends the initial suspend request to tell the DSC/firmware to
> suspend the VF's data/control path. The DSC/firmware will ack/nack the
> suspend request in the completion.
> 
> Then the driver polls the DSC/firmware to find when the VF's
> data/control path has been fully suspended. When the DSC/firmware isn't
> done suspending yet it will return -EAGAIN. Otherwise it will return
> success/failure.
> 
> I will add some comments clarifying these details.

Yes more comment is welcomed.

It's also misleading to have a ' state_size ' field in suspend_comp. In concept
the firmware cannot calculate it accurately before the VF is fully suspended.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v10 vfio 4/7] vfio/pds: Add VFIO live migration support
  2023-06-20  2:02       ` Tian, Kevin
@ 2023-06-20 12:31         ` Jason Gunthorpe
  2023-06-21  6:49           ` Tian, Kevin
  0 siblings, 1 reply; 40+ messages in thread
From: Jason Gunthorpe @ 2023-06-20 12:31 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Brett Creeley, kvm, netdev, alex.williamson, yishaih,
	shameerali.kolothum.thodi, shannon.nelson

On Tue, Jun 20, 2023 at 02:02:44AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Monday, June 19, 2023 8:47 PM
> > 
> > On Fri, Jun 16, 2023 at 08:06:21AM +0000, Tian, Kevin wrote:
> > 
> > > Ideally the VMM has an estimation how long a VM can be paused based on
> > > SLA, to-be-migrated state size, available network bandwidth, etc. and that
> > > hint should be passed to the kernel so any state transition which may
> > violate
> > > that expectation can fail quickly to break the migration process and put the
> > > VM back to the running state.
> > >
> > > Jason/Shameer, is there similar concern in mlx/hisilicon drivers?
> > 
> > It is handled through the vfio_device_feature_mig_data_size mechanism..
> 
> that is only for estimation of copied data.
> 
> IMHO the stop time when the VM is paused includes both the time of
> stopping the device and the time of migrating the VM state.
> 
> For a software-emulated device the time of stopping the device is negligible.
> 
> But certainly for assigned device the worst-case hard-coded 5s timeout as
> done in this patch will kill whatever reasonable 'VM dead time' SLA (usually
> in milliseconds) which CSPs try to meet purely based on the size of copied
> data.

There is not alot that can be done here, the stop time cannot be
predicted in advance on these devices - the system relies on the
device having a reasonable time window.

> Wouldn't a user-specified stop-device timeout be required to at least allow
> breaking migration early according to the desired SLA?

Not really, the device is going to still execute the stop regardless
of the timeout, and when it does the VM will be broken.

With a FW approach like this it is pretty stuck, we need the FW to
remain in sync as the highest priority.

> > We want new devices to get their architecture right, they need to
> > support P2P. Didn't we talk about this already and Brett was going to
> > fix it?
> 
> Looks it's not fixed since RUNNING_P2P->STOP is a nop in this patch.

That could be OK, it needs a comment explaining why it is OK

Jason

^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: [PATCH v10 vfio 4/7] vfio/pds: Add VFIO live migration support
  2023-06-20 12:31         ` Jason Gunthorpe
@ 2023-06-21  6:49           ` Tian, Kevin
  2023-06-21 13:27             ` Jason Gunthorpe
  0 siblings, 1 reply; 40+ messages in thread
From: Tian, Kevin @ 2023-06-21  6:49 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Brett Creeley, kvm, netdev, alex.williamson, yishaih,
	shameerali.kolothum.thodi, shannon.nelson

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, June 20, 2023 8:31 PM
> 
> On Tue, Jun 20, 2023 at 02:02:44AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Monday, June 19, 2023 8:47 PM
> > >
> > > On Fri, Jun 16, 2023 at 08:06:21AM +0000, Tian, Kevin wrote:
> > >
> > > > Ideally the VMM has an estimation how long a VM can be paused based
> on
> > > > SLA, to-be-migrated state size, available network bandwidth, etc. and
> that
> > > > hint should be passed to the kernel so any state transition which may
> > > violate
> > > > that expectation can fail quickly to break the migration process and put
> the
> > > > VM back to the running state.
> > > >
> > > > Jason/Shameer, is there similar concern in mlx/hisilicon drivers?
> > >
> > > It is handled through the vfio_device_feature_mig_data_size mechanism..
> >
> > that is only for estimation of copied data.
> >
> > IMHO the stop time when the VM is paused includes both the time of
> > stopping the device and the time of migrating the VM state.
> >
> > For a software-emulated device the time of stopping the device is negligible.
> >
> > But certainly for assigned device the worst-case hard-coded 5s timeout as
> > done in this patch will kill whatever reasonable 'VM dead time' SLA (usually
> > in milliseconds) which CSPs try to meet purely based on the size of copied
> > data.
> 
> There is not alot that can be done here, the stop time cannot be
> predicted in advance on these devices - the system relies on the
> device having a reasonable time window.

What is the criteria for 'reasonable'? How does CSPs judge that such
device can guarantee a *reliable* reasonable window so live migration
can be enabled in the production environment?

I'm afraid that we are hiding a non-deterministic factor in current protocol.

Looking at mlx5 case which has a even larger timeout:

	 [MLX5_TO_CMD_MS] = 60000,

> 
> > Wouldn't a user-specified stop-device timeout be required to at least allow
> > breaking migration early according to the desired SLA?
> 
> Not really, the device is going to still execute the stop regardless
> of the timeout, and when it does the VM will be broken.
> 
> With a FW approach like this it is pretty stuck, we need the FW to
> remain in sync as the highest priority.

This makes some sense.

But still I don't think it's a good situation where the user has ZERO
knowledge about the non-negligible time in the stopping path...

> 
> > > We want new devices to get their architecture right, they need to
> > > support P2P. Didn't we talk about this already and Brett was going to
> > > fix it?
> >
> > Looks it's not fixed since RUNNING_P2P->STOP is a nop in this patch.
> 
> That could be OK, it needs a comment explaining why it is OK
> 

Yes, a comment is welcomed. having RUNNING_P2P->STOP as nop
kind of suggest that the device has been fully stopped in RUNNING_P2P
to meet the definition of the STOP state. But then it violates the
definition of RUNNING_P2P.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v10 vfio 4/7] vfio/pds: Add VFIO live migration support
  2023-06-21  6:49           ` Tian, Kevin
@ 2023-06-21 13:27             ` Jason Gunthorpe
  2023-06-26  7:31               ` Tian, Kevin
  0 siblings, 1 reply; 40+ messages in thread
From: Jason Gunthorpe @ 2023-06-21 13:27 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Brett Creeley, kvm, netdev, alex.williamson, yishaih,
	shameerali.kolothum.thodi, shannon.nelson

On Wed, Jun 21, 2023 at 06:49:12AM +0000, Tian, Kevin wrote:

> What is the criteria for 'reasonable'? How does CSPs judge that such
> device can guarantee a *reliable* reasonable window so live migration
> can be enabled in the production environment?

The CSP needs to work with the device vendor to understand how it fits
into their system, I don't see how we can externalize this kind of
detail in a general way.
 
> I'm afraid that we are hiding a non-deterministic factor in current protocol.

Yes

> But still I don't think it's a good situation where the user has ZERO
> knowledge about the non-negligible time in the stopping path...

In any sane device design this will be a small period of time. These
timeouts should be to protect against a device that has gone wild.

Jason

^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: [PATCH v10 vfio 4/7] vfio/pds: Add VFIO live migration support
  2023-06-21 13:27             ` Jason Gunthorpe
@ 2023-06-26  7:31               ` Tian, Kevin
  2023-06-26 18:13                 ` Jason Gunthorpe
  0 siblings, 1 reply; 40+ messages in thread
From: Tian, Kevin @ 2023-06-26  7:31 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Brett Creeley, kvm, netdev, alex.williamson, yishaih,
	shameerali.kolothum.thodi, shannon.nelson

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, June 21, 2023 9:27 PM
> 
> On Wed, Jun 21, 2023 at 06:49:12AM +0000, Tian, Kevin wrote:
> 
> > What is the criteria for 'reasonable'? How does CSPs judge that such
> > device can guarantee a *reliable* reasonable window so live migration
> > can be enabled in the production environment?
> 
> The CSP needs to work with the device vendor to understand how it fits
> into their system, I don't see how we can externalize this kind of
> detail in a general way.
> 
> > I'm afraid that we are hiding a non-deterministic factor in current protocol.
> 
> Yes
> 
> > But still I don't think it's a good situation where the user has ZERO
> > knowledge about the non-negligible time in the stopping path...
> 
> In any sane device design this will be a small period of time. These
> timeouts should be to protect against a device that has gone wild.
> 

Any example how 'small' it will be (e.g. <1ms)?

Should we define a *reasonable* threshold in VFIO community which
any new variant driver should provide information to judge against?

If the worst-case stop time (assuming the device doesn't go wild) may
exceed the threshold then it's time to consider whether a new interface
is required to communicate such constraint to userspace.

The reason why I keep discussing it is that IMHO achieving negligible
stop time is a very challenging task for many accelerators. e.g. IDXD
can be stopped only after completing all the pending requests. While
it allows software to configure the max pending work size (and a
reasonable setting could meet both migration SLA and performance
SLA) the worst-case draining latency could be in 10's milliseconds which
cannot be ignored by the VMM.

Or do you think it's still better left to CSP working with the device vendor
even in this case, given the worst-case latency could be affected by
many factors hence not something which a kernel driver can accurately
estimate?

Thanks
Kevin


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v10 vfio 4/7] vfio/pds: Add VFIO live migration support
  2023-06-26  7:31               ` Tian, Kevin
@ 2023-06-26 18:13                 ` Jason Gunthorpe
  2023-06-27  6:03                   ` Tian, Kevin
  0 siblings, 1 reply; 40+ messages in thread
From: Jason Gunthorpe @ 2023-06-26 18:13 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Brett Creeley, kvm, netdev, alex.williamson, yishaih,
	shameerali.kolothum.thodi, shannon.nelson

On Mon, Jun 26, 2023 at 07:31:31AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Wednesday, June 21, 2023 9:27 PM
> > 
> > On Wed, Jun 21, 2023 at 06:49:12AM +0000, Tian, Kevin wrote:
> > 
> > > What is the criteria for 'reasonable'? How does CSPs judge that such
> > > device can guarantee a *reliable* reasonable window so live migration
> > > can be enabled in the production environment?
> > 
> > The CSP needs to work with the device vendor to understand how it fits
> > into their system, I don't see how we can externalize this kind of
> > detail in a general way.
> > 
> > > I'm afraid that we are hiding a non-deterministic factor in current protocol.
> > 
> > Yes
> > 
> > > But still I don't think it's a good situation where the user has ZERO
> > > knowledge about the non-negligible time in the stopping path...
> > 
> > In any sane device design this will be a small period of time. These
> > timeouts should be to protect against a device that has gone wild.
> > 
> 
> Any example how 'small' it will be (e.g. <1ms)?

Not personally..

> Should we define a *reasonable* threshold in VFIO community which
> any new variant driver should provide information to judge against?

Ah, I think we are just too new to get into such details. I think we
need some real world experience to see if this is really an issue.

> The reason why I keep discussing it is that IMHO achieving negligible
> stop time is a very challenging task for many accelerators. e.g. IDXD
> can be stopped only after completing all the pending requests. While
> it allows software to configure the max pending work size (and a
> reasonable setting could meet both migration SLA and performance
> SLA) the worst-case draining latency could be in 10's milliseconds which
> cannot be ignored by the VMM.

Well, what would you report here if you had the opportunity to report
something? Some big number? Then what?

> Or do you think it's still better left to CSP working with the device vendor
> even in this case, given the worst-case latency could be affected by
> many factors hence not something which a kernel driver can accurately
> estimate?

This is my fear, that it is so complicated that reducing it to any
sort of cross-vendor data is not feasible. At least I'd like to see
someone experiment with what information would be useful to qemu
before we add kernel ABI..

Jason

^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: [PATCH v10 vfio 4/7] vfio/pds: Add VFIO live migration support
  2023-06-26 18:13                 ` Jason Gunthorpe
@ 2023-06-27  6:03                   ` Tian, Kevin
  0 siblings, 0 replies; 40+ messages in thread
From: Tian, Kevin @ 2023-06-27  6:03 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Brett Creeley, kvm, netdev, alex.williamson, yishaih,
	shameerali.kolothum.thodi, shannon.nelson

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, June 27, 2023 2:14 AM
> 
> On Mon, Jun 26, 2023 at 07:31:31AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Wednesday, June 21, 2023 9:27 PM
> > >
> > > On Wed, Jun 21, 2023 at 06:49:12AM +0000, Tian, Kevin wrote:
> > >
> > > > What is the criteria for 'reasonable'? How does CSPs judge that such
> > > > device can guarantee a *reliable* reasonable window so live migration
> > > > can be enabled in the production environment?
> > >
> > > The CSP needs to work with the device vendor to understand how it fits
> > > into their system, I don't see how we can externalize this kind of
> > > detail in a general way.
> > >
> > > > I'm afraid that we are hiding a non-deterministic factor in current
> protocol.
> > >
> > > Yes
> > >
> > > > But still I don't think it's a good situation where the user has ZERO
> > > > knowledge about the non-negligible time in the stopping path...
> > >
> > > In any sane device design this will be a small period of time. These
> > > timeouts should be to protect against a device that has gone wild.
> > >
> >
> > Any example how 'small' it will be (e.g. <1ms)?
> 
> Not personally..
> 
> > Should we define a *reasonable* threshold in VFIO community which
> > any new variant driver should provide information to judge against?
> 
> Ah, I think we are just too new to get into such details. I think we
> need some real world experience to see if this is really an issue.
> 
> > The reason why I keep discussing it is that IMHO achieving negligible
> > stop time is a very challenging task for many accelerators. e.g. IDXD
> > can be stopped only after completing all the pending requests. While
> > it allows software to configure the max pending work size (and a
> > reasonable setting could meet both migration SLA and performance
> > SLA) the worst-case draining latency could be in 10's milliseconds which
> > cannot be ignored by the VMM.
> 
> Well, what would you report here if you had the opportunity to report
> something? Some big number? Then what?
> 
> > Or do you think it's still better left to CSP working with the device vendor
> > even in this case, given the worst-case latency could be affected by
> > many factors hence not something which a kernel driver can accurately
> > estimate?
> 
> This is my fear, that it is so complicated that reducing it to any
> sort of cross-vendor data is not feasible. At least I'd like to see
> someone experiment with what information would be useful to qemu
> before we add kernel ABI..
> 

OK. make sense.

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2023-06-27  6:03 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-02 22:03 [PATCH v10 vfio 0/7] pds_vfio driver Brett Creeley
2023-06-02 22:03 ` [PATCH v10 vfio 1/7] vfio: Commonize combine_ranges for use in other VFIO drivers Brett Creeley
2023-06-16  6:52   ` Tian, Kevin
2023-06-16 18:37     ` Brett Creeley
2023-06-02 22:03 ` [PATCH v10 vfio 2/7] vfio/pds: Initial support for pds_vfio VFIO driver Brett Creeley
2023-06-14 21:31   ` Alex Williamson
2023-06-14 21:41     ` Brett Creeley
2023-06-16  6:56   ` Tian, Kevin
2023-06-16 18:42     ` Brett Creeley
2023-06-02 22:03 ` [PATCH v10 vfio 3/7] vfio/pds: register with the pds_core PF Brett Creeley
2023-06-15 21:05   ` Shameerali Kolothum Thodi
2023-06-15 21:30     ` Brett Creeley
2023-06-16  7:04   ` Tian, Kevin
2023-06-16 19:01     ` Brett Creeley
2023-06-20  2:11       ` Tian, Kevin
2023-06-02 22:03 ` [PATCH v10 vfio 4/7] vfio/pds: Add VFIO live migration support Brett Creeley
2023-06-15 21:07   ` Shameerali Kolothum Thodi
2023-06-15 21:36     ` Brett Creeley
2023-06-16  8:06   ` Tian, Kevin
2023-06-17  4:45     ` Brett Creeley
2023-06-20  2:19       ` Tian, Kevin
2023-06-19 12:46     ` Jason Gunthorpe
2023-06-20  2:02       ` Tian, Kevin
2023-06-20 12:31         ` Jason Gunthorpe
2023-06-21  6:49           ` Tian, Kevin
2023-06-21 13:27             ` Jason Gunthorpe
2023-06-26  7:31               ` Tian, Kevin
2023-06-26 18:13                 ` Jason Gunthorpe
2023-06-27  6:03                   ` Tian, Kevin
2023-06-02 22:03 ` [PATCH v10 vfio 5/7] vfio/pds: Add support for dirty page tracking Brett Creeley
2023-06-02 22:03 ` [PATCH v10 vfio 6/7] vfio/pds: Add support for firmware recovery Brett Creeley
2023-06-16  8:24   ` Tian, Kevin
2023-06-17  0:47     ` Brett Creeley
2023-06-02 22:03 ` [PATCH v10 vfio 7/7] vfio/pds: Add Kconfig and documentation Brett Creeley
2023-06-16  8:25   ` Tian, Kevin
2023-06-16 20:05     ` Brett Creeley
2023-06-14 20:20 ` [PATCH v10 vfio 0/7] pds_vfio driver Alex Williamson
2023-06-16  6:47 ` Tian, Kevin
2023-06-16 20:06   ` Brett Creeley
2023-06-17  4:49 ` Brett Creeley

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.