linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] vdpa: decouple reset of iotlb mapping from device reset
@ 2023-10-10  9:02 Si-Wei Liu
  2023-10-10  9:02 ` [PATCH 1/4] vdpa: introduce .reset_map operation callback Si-Wei Liu
                   ` (4 more replies)
  0 siblings, 5 replies; 37+ messages in thread
From: Si-Wei Liu @ 2023-10-10  9:02 UTC (permalink / raw)
  To: jasowang, mst, eperezma, xuanzhuo, dtatulea; +Cc: virtualization, linux-kernel

In order to reduce needlessly high setup and teardown cost
of iotlb mapping during live migration, it's crucial to
decouple the vhost-vdpa iotlb abstraction from the virtio
device life cycle, i.e. iotlb mappings should be left
intact across virtio device reset [1]. For it to work, the
on-chip IOMMU parent device could implement a separate
.reset_map() operation callback to restore 1:1 DMA mapping
without having to resort to the .reset() callback, the
latter of which is mainly used to reset virtio device state.
This new .reset_map() callback will be invoked only before
the vhost-vdpa driver is to be removed and detached from
the vdpa bus, such that other vdpa bus drivers, e.g. 
virtio-vdpa, can start with 1:1 DMA mapping when they
are attached. For the context, those on-chip IOMMU parent
devices, create the 1:1 DMA mapping at vdpa device creation,
and they would implicitly destroy the 1:1 mapping when
the first .set_map or .dma_map callback is invoked.

This patchset is based off of the descriptor group v3 series
from Dragos. [2]

[1] Reducing vdpa migration downtime because of memory pin / maps
https://www.mail-archive.com/qemu-devel@nongnu.org/msg953755.html
[2] [PATCH vhost v3 00/16] vdpa: Add support for vq descriptor mappings
https://lore.kernel.org/lkml/20231009112401.1060447-1-dtatulea@nvidia.com/

---
v1:
- rewrote commit messages to include more detailed description and background
- reword to vendor specific IOMMU implementation from on-chip IOMMU
- include parent device backend feautres to persistent iotlb precondition
- reimplement mlx5_vdpa patch on top of descriptor group series

RFC v3:
- fix missing return due to merge error in patch #4

RFC v2:
- rebased on top of the "[PATCH RFC v2 0/3] vdpa: dedicated descriptor table group" series:
  https://lore.kernel.org/virtualization/1694248959-13369-1-git-send-email-si-wei.liu@oracle.com/

---

Si-Wei Liu (4):
  vdpa: introduce .reset_map operation callback
  vhost-vdpa: reset vendor specific mapping to initial state in .release
  vhost-vdpa: introduce IOTLB_PERSIST backend feature bit
  vdpa/mlx5: implement .reset_map driver op

 drivers/vdpa/mlx5/core/mlx5_vdpa.h |  1 +
 drivers/vdpa/mlx5/core/mr.c        | 17 +++++++++++++++++
 drivers/vdpa/mlx5/net/mlx5_vnet.c  | 18 +++++++++++++-----
 drivers/vhost/vdpa.c               | 31 +++++++++++++++++++++++++++++++
 include/linux/vdpa.h               | 10 ++++++++++
 include/uapi/linux/vhost_types.h   |  2 ++
 6 files changed, 74 insertions(+), 5 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 1/4] vdpa: introduce .reset_map operation callback
  2023-10-10  9:02 [PATCH 0/4] vdpa: decouple reset of iotlb mapping from device reset Si-Wei Liu
@ 2023-10-10  9:02 ` Si-Wei Liu
  2023-10-13  2:49   ` Jason Wang
  2023-10-10  9:02 ` [PATCH 2/4] vhost-vdpa: reset vendor specific mapping to initial state in .release Si-Wei Liu
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 37+ messages in thread
From: Si-Wei Liu @ 2023-10-10  9:02 UTC (permalink / raw)
  To: jasowang, mst, eperezma, xuanzhuo, dtatulea; +Cc: virtualization, linux-kernel

Device specific IOMMU parent driver who wishes to see mapping to be
decoupled from virtio or vdpa device life cycle (device reset) can use
it to restore memory mapping in the device IOMMU to the initial or
default state. The reset of mapping is done per address space basis.

The reason why a separate .reset_map op is introduced is because this
allows a simple on-chip IOMMU model without exposing too much device
implementation details to the upper vdpa layer. The .dma_map/unmap or
.set_map driver API is meant to be used to manipulate the IOTLB mappings,
and has been abstracted in a way similar to how a real IOMMU device maps
or unmaps pages for certain memory ranges. However, apart from this there
also exists other mapping needs, in which case 1:1 passthrough mapping
has to be used by other users (read virtio-vdpa). To ease parent/vendor
driver implementation and to avoid abusing DMA ops in an unexpacted way,
these on-chip IOMMU devices can start with 1:1 passthrough mapping mode
initially at the he time of creation. Then the .reset_map op can be used
to switch iotlb back to this initial state without having to expose a
complex two-dimensional IOMMU device model.

Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com>
---
 include/linux/vdpa.h | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h
index d376309..26ae6ae 100644
--- a/include/linux/vdpa.h
+++ b/include/linux/vdpa.h
@@ -327,6 +327,15 @@ struct vdpa_map_file {
  *				@iova: iova to be unmapped
  *				@size: size of the area
  *				Returns integer: success (0) or error (< 0)
+ * @reset_map:			Reset device memory mapping to the default
+ *				state (optional)
+ *				Needed for devices that are using device
+ *				specific DMA translation and prefer mapping
+ *				to be decoupled from the virtio life cycle,
+ *				i.e. device .reset op does not reset mapping
+ *				@vdev: vdpa device
+ *				@asid: address space identifier
+ *				Returns integer: success (0) or error (< 0)
  * @get_vq_dma_dev:		Get the dma device for a specific
  *				virtqueue (optional)
  *				@vdev: vdpa device
@@ -405,6 +414,7 @@ struct vdpa_config_ops {
 		       u64 iova, u64 size, u64 pa, u32 perm, void *opaque);
 	int (*dma_unmap)(struct vdpa_device *vdev, unsigned int asid,
 			 u64 iova, u64 size);
+	int (*reset_map)(struct vdpa_device *vdev, unsigned int asid);
 	int (*set_group_asid)(struct vdpa_device *vdev, unsigned int group,
 			      unsigned int asid);
 	struct device *(*get_vq_dma_dev)(struct vdpa_device *vdev, u16 idx);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 2/4] vhost-vdpa: reset vendor specific mapping to initial state in .release
  2023-10-10  9:02 [PATCH 0/4] vdpa: decouple reset of iotlb mapping from device reset Si-Wei Liu
  2023-10-10  9:02 ` [PATCH 1/4] vdpa: introduce .reset_map operation callback Si-Wei Liu
@ 2023-10-10  9:02 ` Si-Wei Liu
  2023-10-11 11:21   ` Eugenio Perez Martin
  2023-10-13  3:01   ` Jason Wang
  2023-10-10  9:02 ` [PATCH 3/4] vhost-vdpa: introduce IOTLB_PERSIST backend feature bit Si-Wei Liu
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 37+ messages in thread
From: Si-Wei Liu @ 2023-10-10  9:02 UTC (permalink / raw)
  To: jasowang, mst, eperezma, xuanzhuo, dtatulea; +Cc: virtualization, linux-kernel

Devices with on-chip IOMMU or vendor specific IOTLB implementation
may need to restore iotlb mapping to the initial or default state
using the .reset_map op, as it's desirable for some parent devices
to solely manipulate mappings by its own, independent of virtio device
state. For instance, device reset does not cause mapping go away on
such IOTLB model in need of persistent mapping. Before vhost-vdpa
is going away, give them a chance to reset iotlb back to the initial
state in vhost_vdpa_cleanup().

Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com>
---
 drivers/vhost/vdpa.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 851535f..a3f8160 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -131,6 +131,15 @@ static struct vhost_vdpa_as *vhost_vdpa_find_alloc_as(struct vhost_vdpa *v,
 	return vhost_vdpa_alloc_as(v, asid);
 }
 
+static void vhost_vdpa_reset_map(struct vhost_vdpa *v, u32 asid)
+{
+	struct vdpa_device *vdpa = v->vdpa;
+	const struct vdpa_config_ops *ops = vdpa->config;
+
+	if (ops->reset_map)
+		ops->reset_map(vdpa, asid);
+}
+
 static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid)
 {
 	struct vhost_vdpa_as *as = asid_to_as(v, asid);
@@ -140,6 +149,13 @@ static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid)
 
 	hlist_del(&as->hash_link);
 	vhost_vdpa_iotlb_unmap(v, &as->iotlb, 0ULL, 0ULL - 1, asid);
+	/*
+	 * Devices with vendor specific IOMMU may need to restore
+	 * iotlb to the initial or default state which is not done
+	 * through device reset, as the IOTLB mapping manipulation
+	 * could be decoupled from the virtio device life cycle.
+	 */
+	vhost_vdpa_reset_map(v, asid);
 	kfree(as);
 
 	return 0;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 3/4] vhost-vdpa: introduce IOTLB_PERSIST backend feature bit
  2023-10-10  9:02 [PATCH 0/4] vdpa: decouple reset of iotlb mapping from device reset Si-Wei Liu
  2023-10-10  9:02 ` [PATCH 1/4] vdpa: introduce .reset_map operation callback Si-Wei Liu
  2023-10-10  9:02 ` [PATCH 2/4] vhost-vdpa: reset vendor specific mapping to initial state in .release Si-Wei Liu
@ 2023-10-10  9:02 ` Si-Wei Liu
  2023-10-10  9:03 ` [PATCH 4/4] vdpa/mlx5: implement .reset_map driver op Si-Wei Liu
  2023-10-11 11:30 ` [PATCH 0/4] vdpa: decouple reset of iotlb mapping from device reset Eugenio Perez Martin
  4 siblings, 0 replies; 37+ messages in thread
From: Si-Wei Liu @ 2023-10-10  9:02 UTC (permalink / raw)
  To: jasowang, mst, eperezma, xuanzhuo, dtatulea; +Cc: virtualization, linux-kernel

Userspace needs this feature flag to distinguish if vhost-vdpa
iotlb in the kernel supports persistent IOTLB mapping across
device reset. Without it, userspace has no way to tell apart
if it's running on an older kernel, which could silently drop
all iotlb mapping across vDPA reset.

There are 3 cases that backend may claim this feature bit on:

- parent device that has to work with platform IOMMU
- parent device with on-chip IOMMU that has the expected
  .reset_map support in driver
- parent device with vendor specific IOMMU implementation
  that explicitly declares the specific backend feature

The reason why .reset_map is being one of the pre-condition for
persistent iotlb is because without it, vhost-vdpa can't switch
back iotlb to the initial state later on, especially for the
on-chip IOMMU case which starts with identity mapping at device
creation. virtio-vdpa requires on-chip IOMMU to perform 1:1
passthrough translation from PA to IOVA as-is to begin with, and
.reset_map is the only means to turn back iotlb to the identity
mapping mode after vhost-vdpa is gone.

Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com>
---
 drivers/vhost/vdpa.c             | 15 +++++++++++++++
 include/uapi/linux/vhost_types.h |  2 ++
 2 files changed, 17 insertions(+)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index a3f8160..c92794f 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -413,6 +413,15 @@ static bool vhost_vdpa_has_desc_group(const struct vhost_vdpa *v)
 	return ops->get_vq_desc_group;
 }
 
+static bool vhost_vdpa_has_persistent_map(const struct vhost_vdpa *v)
+{
+	struct vdpa_device *vdpa = v->vdpa;
+	const struct vdpa_config_ops *ops = vdpa->config;
+
+	return (!ops->set_map && !ops->dma_map) || ops->reset_map ||
+	       vhost_vdpa_get_backend_features(v) & BIT_ULL(VHOST_BACKEND_F_IOTLB_PERSIST);
+}
+
 static long vhost_vdpa_get_features(struct vhost_vdpa *v, u64 __user *featurep)
 {
 	struct vdpa_device *vdpa = v->vdpa;
@@ -725,6 +734,7 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep,
 			return -EFAULT;
 		if (features & ~(VHOST_VDPA_BACKEND_FEATURES |
 				 BIT_ULL(VHOST_BACKEND_F_DESC_ASID) |
+				 BIT_ULL(VHOST_BACKEND_F_IOTLB_PERSIST) |
 				 BIT_ULL(VHOST_BACKEND_F_SUSPEND) |
 				 BIT_ULL(VHOST_BACKEND_F_RESUME) |
 				 BIT_ULL(VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK)))
@@ -741,6 +751,9 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep,
 		if ((features & BIT_ULL(VHOST_BACKEND_F_DESC_ASID)) &&
 		     !vhost_vdpa_has_desc_group(v))
 			return -EOPNOTSUPP;
+		if ((features & BIT_ULL(VHOST_BACKEND_F_IOTLB_PERSIST)) &&
+		     !vhost_vdpa_has_persistent_map(v))
+			return -EOPNOTSUPP;
 		vhost_set_backend_features(&v->vdev, features);
 		return 0;
 	}
@@ -796,6 +809,8 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep,
 			features |= BIT_ULL(VHOST_BACKEND_F_RESUME);
 		if (vhost_vdpa_has_desc_group(v))
 			features |= BIT_ULL(VHOST_BACKEND_F_DESC_ASID);
+		if (vhost_vdpa_has_persistent_map(v))
+			features |= BIT_ULL(VHOST_BACKEND_F_IOTLB_PERSIST);
 		features |= vhost_vdpa_get_backend_features(v);
 		if (copy_to_user(featurep, &features, sizeof(features)))
 			r = -EFAULT;
diff --git a/include/uapi/linux/vhost_types.h b/include/uapi/linux/vhost_types.h
index 18ad6ae..d765690 100644
--- a/include/uapi/linux/vhost_types.h
+++ b/include/uapi/linux/vhost_types.h
@@ -190,5 +190,7 @@ struct vhost_vdpa_iova_range {
  * buffers may reside. Requires VHOST_BACKEND_F_IOTLB_ASID.
  */
 #define VHOST_BACKEND_F_DESC_ASID    0x7
+/* IOTLB don't flush memory mapping across device reset */
+#define VHOST_BACKEND_F_IOTLB_PERSIST  0x8
 
 #endif
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 4/4] vdpa/mlx5: implement .reset_map driver op
  2023-10-10  9:02 [PATCH 0/4] vdpa: decouple reset of iotlb mapping from device reset Si-Wei Liu
                   ` (2 preceding siblings ...)
  2023-10-10  9:02 ` [PATCH 3/4] vhost-vdpa: introduce IOTLB_PERSIST backend feature bit Si-Wei Liu
@ 2023-10-10  9:03 ` Si-Wei Liu
  2023-10-13  3:04   ` Jason Wang
  2023-10-11 11:30 ` [PATCH 0/4] vdpa: decouple reset of iotlb mapping from device reset Eugenio Perez Martin
  4 siblings, 1 reply; 37+ messages in thread
From: Si-Wei Liu @ 2023-10-10  9:03 UTC (permalink / raw)
  To: jasowang, mst, eperezma, xuanzhuo, dtatulea; +Cc: virtualization, linux-kernel

Since commit 6f5312f80183 ("vdpa/mlx5: Add support for running with
virtio_vdpa"), mlx5_vdpa starts with preallocate 1:1 DMA MR at device
creation time. This 1:1 DMA MR will be implicitly destroyed while
the first .set_map call is invoked, in which case callers like
vhost-vdpa will start to set up custom mappings. When the .reset
callback is invoked, the custom mappings will be cleared and the 1:1
DMA MR will be re-created.

In order to reduce excessive memory mapping cost in live migration,
it is desirable to decouple the vhost-vdpa IOTLB abstraction from
the virtio device life cycle, i.e. mappings can be kept around intact
across virtio device reset. Leverage the .reset_map callback, which
is meant to destroy the regular MR on the given ASID and recreate the
initial DMA mapping. That way, the device .reset op can run free from
having to maintain and clean up memory mappings by itself.

The cvq mapping also needs to be cleared if is in the given ASID.

Co-developed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com>
---
 drivers/vdpa/mlx5/core/mlx5_vdpa.h |  1 +
 drivers/vdpa/mlx5/core/mr.c        | 17 +++++++++++++++++
 drivers/vdpa/mlx5/net/mlx5_vnet.c  | 18 +++++++++++++-----
 3 files changed, 31 insertions(+), 5 deletions(-)

diff --git a/drivers/vdpa/mlx5/core/mlx5_vdpa.h b/drivers/vdpa/mlx5/core/mlx5_vdpa.h
index db988ce..84547d9 100644
--- a/drivers/vdpa/mlx5/core/mlx5_vdpa.h
+++ b/drivers/vdpa/mlx5/core/mlx5_vdpa.h
@@ -127,6 +127,7 @@ int mlx5_vdpa_update_cvq_iotlb(struct mlx5_vdpa_dev *mvdev,
 				struct vhost_iotlb *iotlb,
 				unsigned int asid);
 int mlx5_vdpa_create_dma_mr(struct mlx5_vdpa_dev *mvdev);
+int mlx5_vdpa_reset_mr(struct mlx5_vdpa_dev *mvdev, unsigned int asid);
 
 #define mlx5_vdpa_warn(__dev, format, ...)                                                         \
 	dev_warn((__dev)->mdev->device, "%s:%d:(pid %d) warning: " format, __func__, __LINE__,     \
diff --git a/drivers/vdpa/mlx5/core/mr.c b/drivers/vdpa/mlx5/core/mr.c
index 66530e28..2197c46 100644
--- a/drivers/vdpa/mlx5/core/mr.c
+++ b/drivers/vdpa/mlx5/core/mr.c
@@ -645,3 +645,20 @@ int mlx5_vdpa_create_dma_mr(struct mlx5_vdpa_dev *mvdev)
 
 	return mlx5_vdpa_update_cvq_iotlb(mvdev, NULL, 0);
 }
+
+int mlx5_vdpa_reset_mr(struct mlx5_vdpa_dev *mvdev, unsigned int asid)
+{
+	if (asid >= MLX5_VDPA_NUM_AS)
+		return -EINVAL;
+
+	mlx5_vdpa_destroy_mr(mvdev, mvdev->mr[asid]);
+
+	if (asid == 0 && MLX5_CAP_GEN(mvdev->mdev, umem_uid_0)) {
+		if (mlx5_vdpa_create_dma_mr(mvdev))
+			mlx5_vdpa_warn(mvdev, "create DMA MR failed\n");
+	} else {
+		mlx5_vdpa_update_cvq_iotlb(mvdev, NULL, asid);
+	}
+
+	return 0;
+}
diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c b/drivers/vdpa/mlx5/net/mlx5_vnet.c
index 6abe023..928e71b 100644
--- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
+++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
@@ -2838,7 +2838,6 @@ static int mlx5_vdpa_reset(struct vdpa_device *vdev)
 	unregister_link_notifier(ndev);
 	teardown_driver(ndev);
 	clear_vqs_ready(ndev);
-	mlx5_vdpa_destroy_mr_resources(&ndev->mvdev);
 	ndev->mvdev.status = 0;
 	ndev->mvdev.suspended = false;
 	ndev->cur_num_vqs = 0;
@@ -2849,10 +2848,6 @@ static int mlx5_vdpa_reset(struct vdpa_device *vdev)
 	init_group_to_asid_map(mvdev);
 	++mvdev->generation;
 
-	if (MLX5_CAP_GEN(mvdev->mdev, umem_uid_0)) {
-		if (mlx5_vdpa_create_dma_mr(mvdev))
-			mlx5_vdpa_warn(mvdev, "create MR failed\n");
-	}
 	up_write(&ndev->reslock);
 
 	return 0;
@@ -2932,6 +2927,18 @@ static int mlx5_vdpa_set_map(struct vdpa_device *vdev, unsigned int asid,
 	return err;
 }
 
+static int mlx5_vdpa_reset_map(struct vdpa_device *vdev, unsigned int asid)
+{
+	struct mlx5_vdpa_dev *mvdev = to_mvdev(vdev);
+	struct mlx5_vdpa_net *ndev = to_mlx5_vdpa_ndev(mvdev);
+	int err;
+
+	down_write(&ndev->reslock);
+	err = mlx5_vdpa_reset_mr(mvdev, asid);
+	up_write(&ndev->reslock);
+	return err;
+}
+
 static struct device *mlx5_get_vq_dma_dev(struct vdpa_device *vdev, u16 idx)
 {
 	struct mlx5_vdpa_dev *mvdev = to_mvdev(vdev);
@@ -3199,6 +3206,7 @@ static int mlx5_set_group_asid(struct vdpa_device *vdev, u32 group,
 	.set_config = mlx5_vdpa_set_config,
 	.get_generation = mlx5_vdpa_get_generation,
 	.set_map = mlx5_vdpa_set_map,
+	.reset_map = mlx5_vdpa_reset_map,
 	.set_group_asid = mlx5_set_group_asid,
 	.get_vq_dma_dev = mlx5_get_vq_dma_dev,
 	.free = mlx5_vdpa_free,
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/4] vhost-vdpa: reset vendor specific mapping to initial state in .release
  2023-10-10  9:02 ` [PATCH 2/4] vhost-vdpa: reset vendor specific mapping to initial state in .release Si-Wei Liu
@ 2023-10-11 11:21   ` Eugenio Perez Martin
  2023-10-12  6:18     ` Si-Wei Liu
  2023-10-13  3:01   ` Jason Wang
  1 sibling, 1 reply; 37+ messages in thread
From: Eugenio Perez Martin @ 2023-10-11 11:21 UTC (permalink / raw)
  To: Si-Wei Liu
  Cc: jasowang, mst, xuanzhuo, dtatulea, virtualization, linux-kernel

On Tue, Oct 10, 2023 at 11:05 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>
> Devices with on-chip IOMMU or vendor specific IOTLB implementation
> may need to restore iotlb mapping to the initial or default state
> using the .reset_map op, as it's desirable for some parent devices
> to solely manipulate mappings by its own, independent of virtio device
> state. For instance, device reset does not cause mapping go away on
> such IOTLB model in need of persistent mapping. Before vhost-vdpa
> is going away, give them a chance to reset iotlb back to the initial
> state in vhost_vdpa_cleanup().
>
> Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com>
> ---
>  drivers/vhost/vdpa.c | 16 ++++++++++++++++
>  1 file changed, 16 insertions(+)
>
> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> index 851535f..a3f8160 100644
> --- a/drivers/vhost/vdpa.c
> +++ b/drivers/vhost/vdpa.c
> @@ -131,6 +131,15 @@ static struct vhost_vdpa_as *vhost_vdpa_find_alloc_as(struct vhost_vdpa *v,
>         return vhost_vdpa_alloc_as(v, asid);
>  }
>
> +static void vhost_vdpa_reset_map(struct vhost_vdpa *v, u32 asid)
> +{
> +       struct vdpa_device *vdpa = v->vdpa;
> +       const struct vdpa_config_ops *ops = vdpa->config;
> +
> +       if (ops->reset_map)
> +               ops->reset_map(vdpa, asid);
> +}
> +
>  static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid)
>  {
>         struct vhost_vdpa_as *as = asid_to_as(v, asid);
> @@ -140,6 +149,13 @@ static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid)
>
>         hlist_del(&as->hash_link);
>         vhost_vdpa_iotlb_unmap(v, &as->iotlb, 0ULL, 0ULL - 1, asid);

Now I'm wondering, does this call to vhost_vdpa_iotlb_unmap sets a
different map (via .set_map) per element of the vhost_iotlb_itree? Not
a big deal since we're in the cleanup path, but it could be a nice
optimization on top as we're going to reset the map of the asid
anyway.

> +       /*
> +        * Devices with vendor specific IOMMU may need to restore
> +        * iotlb to the initial or default state which is not done
> +        * through device reset, as the IOTLB mapping manipulation
> +        * could be decoupled from the virtio device life cycle.
> +        */
> +       vhost_vdpa_reset_map(v, asid);
>         kfree(as);
>
>         return 0;
> --
> 1.8.3.1
>


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/4] vdpa: decouple reset of iotlb mapping from device reset
  2023-10-10  9:02 [PATCH 0/4] vdpa: decouple reset of iotlb mapping from device reset Si-Wei Liu
                   ` (3 preceding siblings ...)
  2023-10-10  9:03 ` [PATCH 4/4] vdpa/mlx5: implement .reset_map driver op Si-Wei Liu
@ 2023-10-11 11:30 ` Eugenio Perez Martin
  4 siblings, 0 replies; 37+ messages in thread
From: Eugenio Perez Martin @ 2023-10-11 11:30 UTC (permalink / raw)
  To: Si-Wei Liu
  Cc: jasowang, mst, xuanzhuo, dtatulea, virtualization, linux-kernel

On Tue, Oct 10, 2023 at 11:05 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>
> In order to reduce needlessly high setup and teardown cost
> of iotlb mapping during live migration, it's crucial to
> decouple the vhost-vdpa iotlb abstraction from the virtio
> device life cycle, i.e. iotlb mappings should be left
> intact across virtio device reset [1]. For it to work, the
> on-chip IOMMU parent device could implement a separate
> .reset_map() operation callback to restore 1:1 DMA mapping
> without having to resort to the .reset() callback, the
> latter of which is mainly used to reset virtio device state.
> This new .reset_map() callback will be invoked only before
> the vhost-vdpa driver is to be removed and detached from
> the vdpa bus, such that other vdpa bus drivers, e.g.
> virtio-vdpa, can start with 1:1 DMA mapping when they
> are attached. For the context, those on-chip IOMMU parent
> devices, create the 1:1 DMA mapping at vdpa device creation,
> and they would implicitly destroy the 1:1 mapping when
> the first .set_map or .dma_map callback is invoked.
>
> This patchset is based off of the descriptor group v3 series
> from Dragos. [2]
>
> [1] Reducing vdpa migration downtime because of memory pin / maps
> https://www.mail-archive.com/qemu-devel@nongnu.org/msg953755.html
> [2] [PATCH vhost v3 00/16] vdpa: Add support for vq descriptor mappings
> https://lore.kernel.org/lkml/20231009112401.1060447-1-dtatulea@nvidia.com/
>

Acked-by: Eugenio Pérez <eperezma@redhat.com>

Thanks!

> ---
> v1:
> - rewrote commit messages to include more detailed description and background
> - reword to vendor specific IOMMU implementation from on-chip IOMMU
> - include parent device backend feautres to persistent iotlb precondition
> - reimplement mlx5_vdpa patch on top of descriptor group series
>
> RFC v3:
> - fix missing return due to merge error in patch #4
>
> RFC v2:
> - rebased on top of the "[PATCH RFC v2 0/3] vdpa: dedicated descriptor table group" series:
>   https://lore.kernel.org/virtualization/1694248959-13369-1-git-send-email-si-wei.liu@oracle.com/
>
> ---
>
> Si-Wei Liu (4):
>   vdpa: introduce .reset_map operation callback
>   vhost-vdpa: reset vendor specific mapping to initial state in .release
>   vhost-vdpa: introduce IOTLB_PERSIST backend feature bit
>   vdpa/mlx5: implement .reset_map driver op
>
>  drivers/vdpa/mlx5/core/mlx5_vdpa.h |  1 +
>  drivers/vdpa/mlx5/core/mr.c        | 17 +++++++++++++++++
>  drivers/vdpa/mlx5/net/mlx5_vnet.c  | 18 +++++++++++++-----
>  drivers/vhost/vdpa.c               | 31 +++++++++++++++++++++++++++++++
>  include/linux/vdpa.h               | 10 ++++++++++
>  include/uapi/linux/vhost_types.h   |  2 ++
>  6 files changed, 74 insertions(+), 5 deletions(-)
>
> --
> 1.8.3.1
>


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/4] vhost-vdpa: reset vendor specific mapping to initial state in .release
  2023-10-11 11:21   ` Eugenio Perez Martin
@ 2023-10-12  6:18     ` Si-Wei Liu
  0 siblings, 0 replies; 37+ messages in thread
From: Si-Wei Liu @ 2023-10-12  6:18 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: jasowang, mst, xuanzhuo, dtatulea, virtualization, linux-kernel



On 10/11/2023 4:21 AM, Eugenio Perez Martin wrote:
> On Tue, Oct 10, 2023 at 11:05 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>> Devices with on-chip IOMMU or vendor specific IOTLB implementation
>> may need to restore iotlb mapping to the initial or default state
>> using the .reset_map op, as it's desirable for some parent devices
>> to solely manipulate mappings by its own, independent of virtio device
>> state. For instance, device reset does not cause mapping go away on
>> such IOTLB model in need of persistent mapping. Before vhost-vdpa
>> is going away, give them a chance to reset iotlb back to the initial
>> state in vhost_vdpa_cleanup().
>>
>> Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com>
>> ---
>>   drivers/vhost/vdpa.c | 16 ++++++++++++++++
>>   1 file changed, 16 insertions(+)
>>
>> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
>> index 851535f..a3f8160 100644
>> --- a/drivers/vhost/vdpa.c
>> +++ b/drivers/vhost/vdpa.c
>> @@ -131,6 +131,15 @@ static struct vhost_vdpa_as *vhost_vdpa_find_alloc_as(struct vhost_vdpa *v,
>>          return vhost_vdpa_alloc_as(v, asid);
>>   }
>>
>> +static void vhost_vdpa_reset_map(struct vhost_vdpa *v, u32 asid)
>> +{
>> +       struct vdpa_device *vdpa = v->vdpa;
>> +       const struct vdpa_config_ops *ops = vdpa->config;
>> +
>> +       if (ops->reset_map)
>> +               ops->reset_map(vdpa, asid);
>> +}
>> +
>>   static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid)
>>   {
>>          struct vhost_vdpa_as *as = asid_to_as(v, asid);
>> @@ -140,6 +149,13 @@ static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid)
>>
>>          hlist_del(&as->hash_link);
>>          vhost_vdpa_iotlb_unmap(v, &as->iotlb, 0ULL, 0ULL - 1, asid);
> Now I'm wondering, does this call to vhost_vdpa_iotlb_unmap sets a
> different map (via .set_map) per element of the vhost_iotlb_itree?
Yes and no, effectively this vhost_vdpa_iotlb_unmap call will pass an 
empty iotlb with zero map entry down to the driver via .set_map, so for 
.set_map interface it's always a different map no matter what. As for 
this special case, the internal implementation of mlx5_vdpa .set_map may 
choose to either destroy MR and recreate a new one, or remove all 
mappings on the existing MR (currently it uses destroy+recreate for 
simplicity without have to special case). But .reset_map is different - 
the 1:1 DMA MR has to be recreated explicitly after destroying the 
regular MR, so you see this is driver/device implementation specifics.

>   Not
> a big deal since we're in the cleanup path, but it could be a nice
> optimization on top as we're going to reset the map of the asid
> anyway.
You mean wrap up what's done in vhost_vdpa_iotlb_unmap and 
vhost_vdpa_reset_map to a new call, say vhost_vdpa_iotlb_reset? Yes this 
is possible, but be noted that the vhost_vdpa_iotlb_unmap also takes 
charge of pinning accounting other than mapping, and it has to also 
maintain it's own vhost_iotlb copy in sync. There's no such much code 
that can be consolidated or generalized at this point, as 
vhost_vdpa_reset_map() is very specific to some device implementation, 
and I don't see common need to optimize this further up in the map/unmap 
hot path rather than this cleanup slow path, just as you alluded to.

Regards,
-Siwei
>
>> +       /*
>> +        * Devices with vendor specific IOMMU may need to restore
>> +        * iotlb to the initial or default state which is not done
>> +        * through device reset, as the IOTLB mapping manipulation
>> +        * could be decoupled from the virtio device life cycle.
>> +        */
>> +       vhost_vdpa_reset_map(v, asid);
>>          kfree(as);
>>
>>          return 0;
>> --
>> 1.8.3.1
>>


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/4] vdpa: introduce .reset_map operation callback
  2023-10-10  9:02 ` [PATCH 1/4] vdpa: introduce .reset_map operation callback Si-Wei Liu
@ 2023-10-13  2:49   ` Jason Wang
  2023-10-13  7:36     ` Si-Wei Liu
  0 siblings, 1 reply; 37+ messages in thread
From: Jason Wang @ 2023-10-13  2:49 UTC (permalink / raw)
  To: Si-Wei Liu
  Cc: mst, eperezma, xuanzhuo, dtatulea, virtualization, linux-kernel

On Tue, Oct 10, 2023 at 5:05 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>
> Device specific IOMMU parent driver who wishes to see mapping to be
> decoupled from virtio or vdpa device life cycle (device reset) can use
> it to restore memory mapping in the device IOMMU to the initial or
> default state. The reset of mapping is done per address space basis.
>
> The reason why a separate .reset_map op is introduced is because this
> allows a simple on-chip IOMMU model without exposing too much device
> implementation details to the upper vdpa layer. The .dma_map/unmap or
> .set_map driver API is meant to be used to manipulate the IOTLB mappings,
> and has been abstracted in a way similar to how a real IOMMU device maps
> or unmaps pages for certain memory ranges. However, apart from this there
> also exists other mapping needs, in which case 1:1 passthrough mapping
> has to be used by other users (read virtio-vdpa). To ease parent/vendor
> driver implementation and to avoid abusing DMA ops in an unexpacted way,
> these on-chip IOMMU devices can start with 1:1 passthrough mapping mode
> initially at the he time of creation. Then the .reset_map op can be used
> to switch iotlb back to this initial state without having to expose a
> complex two-dimensional IOMMU device model.
>
> Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com>
> ---
>  include/linux/vdpa.h | 10 ++++++++++
>  1 file changed, 10 insertions(+)
>
> diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h
> index d376309..26ae6ae 100644
> --- a/include/linux/vdpa.h
> +++ b/include/linux/vdpa.h
> @@ -327,6 +327,15 @@ struct vdpa_map_file {
>   *                             @iova: iova to be unmapped
>   *                             @size: size of the area
>   *                             Returns integer: success (0) or error (< 0)
> + * @reset_map:                 Reset device memory mapping to the default
> + *                             state (optional)

I think we need to mention that this is a must for parents that use set_map()?

Other than this:

Acked-by: Jason Wang <jasowang@redhat.com>

Thanks

> + *                             Needed for devices that are using device
> + *                             specific DMA translation and prefer mapping
> + *                             to be decoupled from the virtio life cycle,
> + *                             i.e. device .reset op does not reset mapping
> + *                             @vdev: vdpa device
> + *                             @asid: address space identifier
> + *                             Returns integer: success (0) or error (< 0)
>   * @get_vq_dma_dev:            Get the dma device for a specific
>   *                             virtqueue (optional)
>   *                             @vdev: vdpa device
> @@ -405,6 +414,7 @@ struct vdpa_config_ops {
>                        u64 iova, u64 size, u64 pa, u32 perm, void *opaque);
>         int (*dma_unmap)(struct vdpa_device *vdev, unsigned int asid,
>                          u64 iova, u64 size);
> +       int (*reset_map)(struct vdpa_device *vdev, unsigned int asid);
>         int (*set_group_asid)(struct vdpa_device *vdev, unsigned int group,
>                               unsigned int asid);
>         struct device *(*get_vq_dma_dev)(struct vdpa_device *vdev, u16 idx);
> --
> 1.8.3.1
>


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/4] vhost-vdpa: reset vendor specific mapping to initial state in .release
  2023-10-10  9:02 ` [PATCH 2/4] vhost-vdpa: reset vendor specific mapping to initial state in .release Si-Wei Liu
  2023-10-11 11:21   ` Eugenio Perez Martin
@ 2023-10-13  3:01   ` Jason Wang
  2023-10-13  7:35     ` Si-Wei Liu
  1 sibling, 1 reply; 37+ messages in thread
From: Jason Wang @ 2023-10-13  3:01 UTC (permalink / raw)
  To: Si-Wei Liu
  Cc: mst, eperezma, xuanzhuo, dtatulea, virtualization, linux-kernel

On Tue, Oct 10, 2023 at 5:05 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>
> Devices with on-chip IOMMU or vendor specific IOTLB implementation
> may need to restore iotlb mapping to the initial or default state
> using the .reset_map op, as it's desirable for some parent devices
> to solely manipulate mappings by its own, independent of virtio device
> state. For instance, device reset does not cause mapping go away on
> such IOTLB model in need of persistent mapping. Before vhost-vdpa
> is going away, give them a chance to reset iotlb back to the initial
> state in vhost_vdpa_cleanup().
>
> Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com>
> ---
>  drivers/vhost/vdpa.c | 16 ++++++++++++++++
>  1 file changed, 16 insertions(+)
>
> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> index 851535f..a3f8160 100644
> --- a/drivers/vhost/vdpa.c
> +++ b/drivers/vhost/vdpa.c
> @@ -131,6 +131,15 @@ static struct vhost_vdpa_as *vhost_vdpa_find_alloc_as(struct vhost_vdpa *v,
>         return vhost_vdpa_alloc_as(v, asid);
>  }
>
> +static void vhost_vdpa_reset_map(struct vhost_vdpa *v, u32 asid)
> +{
> +       struct vdpa_device *vdpa = v->vdpa;
> +       const struct vdpa_config_ops *ops = vdpa->config;
> +
> +       if (ops->reset_map)
> +               ops->reset_map(vdpa, asid);
> +}
> +
>  static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid)
>  {
>         struct vhost_vdpa_as *as = asid_to_as(v, asid);
> @@ -140,6 +149,13 @@ static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid)
>
>         hlist_del(&as->hash_link);
>         vhost_vdpa_iotlb_unmap(v, &as->iotlb, 0ULL, 0ULL - 1, asid);
> +       /*
> +        * Devices with vendor specific IOMMU may need to restore
> +        * iotlb to the initial or default state which is not done
> +        * through device reset, as the IOTLB mapping manipulation
> +        * could be decoupled from the virtio device life cycle.
> +        */

Should we do this according to whether IOTLB_PRESIST is set? Otherwise
we may break old userspace.

Thanks

> +       vhost_vdpa_reset_map(v, asid);
>         kfree(as);
>
>         return 0;
> --
> 1.8.3.1
>


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 4/4] vdpa/mlx5: implement .reset_map driver op
  2023-10-10  9:03 ` [PATCH 4/4] vdpa/mlx5: implement .reset_map driver op Si-Wei Liu
@ 2023-10-13  3:04   ` Jason Wang
  2023-10-13  7:55     ` Si-Wei Liu
  0 siblings, 1 reply; 37+ messages in thread
From: Jason Wang @ 2023-10-13  3:04 UTC (permalink / raw)
  To: Si-Wei Liu
  Cc: mst, eperezma, xuanzhuo, dtatulea, virtualization, linux-kernel

On Tue, Oct 10, 2023 at 5:05 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>
> Since commit 6f5312f80183 ("vdpa/mlx5: Add support for running with
> virtio_vdpa"), mlx5_vdpa starts with preallocate 1:1 DMA MR at device
> creation time. This 1:1 DMA MR will be implicitly destroyed while
> the first .set_map call is invoked, in which case callers like
> vhost-vdpa will start to set up custom mappings. When the .reset
> callback is invoked, the custom mappings will be cleared and the 1:1
> DMA MR will be re-created.
>
> In order to reduce excessive memory mapping cost in live migration,
> it is desirable to decouple the vhost-vdpa IOTLB abstraction from
> the virtio device life cycle, i.e. mappings can be kept around intact
> across virtio device reset. Leverage the .reset_map callback, which
> is meant to destroy the regular MR on the given ASID and recreate the
> initial DMA mapping. That way, the device .reset op can run free from
> having to maintain and clean up memory mappings by itself.
>
> The cvq mapping also needs to be cleared if is in the given ASID.
>
> Co-developed-by: Dragos Tatulea <dtatulea@nvidia.com>
> Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
> Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com>

I wonder if the simulator suffers from the exact same issue. If yes,
let's fix the simulator as well?

Thanks


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/4] vhost-vdpa: reset vendor specific mapping to initial state in .release
  2023-10-13  3:01   ` Jason Wang
@ 2023-10-13  7:35     ` Si-Wei Liu
  2023-10-16  6:32       ` Jason Wang
  0 siblings, 1 reply; 37+ messages in thread
From: Si-Wei Liu @ 2023-10-13  7:35 UTC (permalink / raw)
  To: Jason Wang
  Cc: mst, eperezma, xuanzhuo, dtatulea, virtualization, linux-kernel



On 10/12/2023 8:01 PM, Jason Wang wrote:
> On Tue, Oct 10, 2023 at 5:05 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>> Devices with on-chip IOMMU or vendor specific IOTLB implementation
>> may need to restore iotlb mapping to the initial or default state
>> using the .reset_map op, as it's desirable for some parent devices
>> to solely manipulate mappings by its own, independent of virtio device
>> state. For instance, device reset does not cause mapping go away on
>> such IOTLB model in need of persistent mapping. Before vhost-vdpa
>> is going away, give them a chance to reset iotlb back to the initial
>> state in vhost_vdpa_cleanup().
>>
>> Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com>
>> ---
>>   drivers/vhost/vdpa.c | 16 ++++++++++++++++
>>   1 file changed, 16 insertions(+)
>>
>> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
>> index 851535f..a3f8160 100644
>> --- a/drivers/vhost/vdpa.c
>> +++ b/drivers/vhost/vdpa.c
>> @@ -131,6 +131,15 @@ static struct vhost_vdpa_as *vhost_vdpa_find_alloc_as(struct vhost_vdpa *v,
>>          return vhost_vdpa_alloc_as(v, asid);
>>   }
>>
>> +static void vhost_vdpa_reset_map(struct vhost_vdpa *v, u32 asid)
>> +{
>> +       struct vdpa_device *vdpa = v->vdpa;
>> +       const struct vdpa_config_ops *ops = vdpa->config;
>> +
>> +       if (ops->reset_map)
>> +               ops->reset_map(vdpa, asid);
>> +}
>> +
>>   static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid)
>>   {
>>          struct vhost_vdpa_as *as = asid_to_as(v, asid);
>> @@ -140,6 +149,13 @@ static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid)
>>
>>          hlist_del(&as->hash_link);
>>          vhost_vdpa_iotlb_unmap(v, &as->iotlb, 0ULL, 0ULL - 1, asid);
>> +       /*
>> +        * Devices with vendor specific IOMMU may need to restore
>> +        * iotlb to the initial or default state which is not done
>> +        * through device reset, as the IOTLB mapping manipulation
>> +        * could be decoupled from the virtio device life cycle.
>> +        */
> Should we do this according to whether IOTLB_PRESIST is set?
Well, in theory this seems like so but it's unnecessary code change 
actually, as that is the way how vDPA parent behind platform IOMMU works 
today, and userspace doesn't break as of today. :)

As explained in previous threads [1][2], when IOTLB_PERSIST is not set 
it doesn't necessarily mean the iotlb will definitely be destroyed 
across reset (think about the platform IOMMU case), so userspace today 
is already tolerating enough with either good or bad IOMMU. This code of 
not checking IOTLB_PERSIST being set is intentional, there's no point to 
emulate bad IOMMU behavior even for older userspace (with improper 
emulation to be done it would result in even worse performance). I think 
the purpose of the IOTLB_PERSIST flag is just to give userspace 100% 
certainty of persistent iotlb mapping not getting lost across vdpa reset.

Thanks,
-Siwei

[1] 
https://lore.kernel.org/virtualization/9f118fc9-4f6f-dd67-a291-be78152e47fd@oracle.com/
[2] 
https://lore.kernel.org/virtualization/3364adfd-1eb7-8bce-41f9-bfe5473f1f2e@oracle.com/
>   Otherwise
> we may break old userspace.
>
> Thanks
>
>> +       vhost_vdpa_reset_map(v, asid);
>>          kfree(as);
>>
>>          return 0;
>> --
>> 1.8.3.1
>>


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/4] vdpa: introduce .reset_map operation callback
  2023-10-13  2:49   ` Jason Wang
@ 2023-10-13  7:36     ` Si-Wei Liu
  2023-10-16  5:30       ` Jason Wang
  0 siblings, 1 reply; 37+ messages in thread
From: Si-Wei Liu @ 2023-10-13  7:36 UTC (permalink / raw)
  To: Jason Wang
  Cc: mst, eperezma, xuanzhuo, dtatulea, virtualization, linux-kernel



On 10/12/2023 7:49 PM, Jason Wang wrote:
> On Tue, Oct 10, 2023 at 5:05 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>> Device specific IOMMU parent driver who wishes to see mapping to be
>> decoupled from virtio or vdpa device life cycle (device reset) can use
>> it to restore memory mapping in the device IOMMU to the initial or
>> default state. The reset of mapping is done per address space basis.
>>
>> The reason why a separate .reset_map op is introduced is because this
>> allows a simple on-chip IOMMU model without exposing too much device
>> implementation details to the upper vdpa layer. The .dma_map/unmap or
>> .set_map driver API is meant to be used to manipulate the IOTLB mappings,
>> and has been abstracted in a way similar to how a real IOMMU device maps
>> or unmaps pages for certain memory ranges. However, apart from this there
>> also exists other mapping needs, in which case 1:1 passthrough mapping
>> has to be used by other users (read virtio-vdpa). To ease parent/vendor
>> driver implementation and to avoid abusing DMA ops in an unexpacted way,
>> these on-chip IOMMU devices can start with 1:1 passthrough mapping mode
>> initially at the he time of creation. Then the .reset_map op can be used
>> to switch iotlb back to this initial state without having to expose a
>> complex two-dimensional IOMMU device model.
>>
>> Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com>
>> ---
>>   include/linux/vdpa.h | 10 ++++++++++
>>   1 file changed, 10 insertions(+)
>>
>> diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h
>> index d376309..26ae6ae 100644
>> --- a/include/linux/vdpa.h
>> +++ b/include/linux/vdpa.h
>> @@ -327,6 +327,15 @@ struct vdpa_map_file {
>>    *                             @iova: iova to be unmapped
>>    *                             @size: size of the area
>>    *                             Returns integer: success (0) or error (< 0)
>> + * @reset_map:                 Reset device memory mapping to the default
>> + *                             state (optional)
> I think we need to mention that this is a must for parents that use set_map()?
It's not a must IMO, some .set_map() user for e.g. VDUSE or vdpa-sim-blk 
can deliberately choose to implement .reset_map() depending on its own 
need. Those user_va software devices mostly don't care about mapping 
cost during reset, as they don't have to pin kernel memory in general. 
It's just whether or not they care about mapping being decoupled from 
device reset at all.

And the exact implementation requirement of this interface has been 
documented right below.

-Siwei
>
> Other than this:
>
> Acked-by: Jason Wang <jasowang@redhat.com>
>
> Thanks
>
>> + *                             Needed for devices that are using device
>> + *                             specific DMA translation and prefer mapping
>> + *                             to be decoupled from the virtio life cycle,
>> + *                             i.e. device .reset op does not reset mapping
>> + *                             @vdev: vdpa device
>> + *                             @asid: address space identifier
>> + *                             Returns integer: success (0) or error (< 0)
>>    * @get_vq_dma_dev:            Get the dma device for a specific
>>    *                             virtqueue (optional)
>>    *                             @vdev: vdpa device
>> @@ -405,6 +414,7 @@ struct vdpa_config_ops {
>>                         u64 iova, u64 size, u64 pa, u32 perm, void *opaque);
>>          int (*dma_unmap)(struct vdpa_device *vdev, unsigned int asid,
>>                           u64 iova, u64 size);
>> +       int (*reset_map)(struct vdpa_device *vdev, unsigned int asid);
>>          int (*set_group_asid)(struct vdpa_device *vdev, unsigned int group,
>>                                unsigned int asid);
>>          struct device *(*get_vq_dma_dev)(struct vdpa_device *vdev, u16 idx);
>> --
>> 1.8.3.1
>>


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 4/4] vdpa/mlx5: implement .reset_map driver op
  2023-10-13  3:04   ` Jason Wang
@ 2023-10-13  7:55     ` Si-Wei Liu
  0 siblings, 0 replies; 37+ messages in thread
From: Si-Wei Liu @ 2023-10-13  7:55 UTC (permalink / raw)
  To: Jason Wang, Stefano Garzarella
  Cc: mst, eperezma, xuanzhuo, dtatulea, virtualization, linux-kernel



On 10/12/2023 8:04 PM, Jason Wang wrote:
> On Tue, Oct 10, 2023 at 5:05 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>> Since commit 6f5312f80183 ("vdpa/mlx5: Add support for running with
>> virtio_vdpa"), mlx5_vdpa starts with preallocate 1:1 DMA MR at device
>> creation time. This 1:1 DMA MR will be implicitly destroyed while
>> the first .set_map call is invoked, in which case callers like
>> vhost-vdpa will start to set up custom mappings. When the .reset
>> callback is invoked, the custom mappings will be cleared and the 1:1
>> DMA MR will be re-created.
>>
>> In order to reduce excessive memory mapping cost in live migration,
>> it is desirable to decouple the vhost-vdpa IOTLB abstraction from
>> the virtio device life cycle, i.e. mappings can be kept around intact
>> across virtio device reset. Leverage the .reset_map callback, which
>> is meant to destroy the regular MR on the given ASID and recreate the
>> initial DMA mapping. That way, the device .reset op can run free from
>> having to maintain and clean up memory mappings by itself.
>>
>> The cvq mapping also needs to be cleared if is in the given ASID.
>>
>> Co-developed-by: Dragos Tatulea <dtatulea@nvidia.com>
>> Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
>> Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com>
> I wonder if the simulator suffers from the exact same issue.
For vdpa-sim !use_va (map using PA and with pinning) case, yes. But I'm 
not sure the situation of the vdpa-sim(-blk) use_va case, e.g. I haven't 
checked if there's dependency on today's reset behavior (coupled), and 
if QEMU vhost-vdpa backend driver is the only userspace consumer. Maybe 
Stefano knows?

I can give it a try on simulator fix but don't count me on the 
vdpa-sim(-blk) use_va part.

Regards,
-Siwei



>   If yes,
> let's fix the simulator as well?
>
> Thanks
>


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/4] vdpa: introduce .reset_map operation callback
  2023-10-13  7:36     ` Si-Wei Liu
@ 2023-10-16  5:30       ` Jason Wang
  0 siblings, 0 replies; 37+ messages in thread
From: Jason Wang @ 2023-10-16  5:30 UTC (permalink / raw)
  To: Si-Wei Liu
  Cc: mst, eperezma, xuanzhuo, dtatulea, virtualization, linux-kernel

On Fri, Oct 13, 2023 at 3:36 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>
>
>
> On 10/12/2023 7:49 PM, Jason Wang wrote:
> > On Tue, Oct 10, 2023 at 5:05 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >> Device specific IOMMU parent driver who wishes to see mapping to be
> >> decoupled from virtio or vdpa device life cycle (device reset) can use
> >> it to restore memory mapping in the device IOMMU to the initial or
> >> default state. The reset of mapping is done per address space basis.
> >>
> >> The reason why a separate .reset_map op is introduced is because this
> >> allows a simple on-chip IOMMU model without exposing too much device
> >> implementation details to the upper vdpa layer. The .dma_map/unmap or
> >> .set_map driver API is meant to be used to manipulate the IOTLB mappings,
> >> and has been abstracted in a way similar to how a real IOMMU device maps
> >> or unmaps pages for certain memory ranges. However, apart from this there
> >> also exists other mapping needs, in which case 1:1 passthrough mapping
> >> has to be used by other users (read virtio-vdpa). To ease parent/vendor
> >> driver implementation and to avoid abusing DMA ops in an unexpacted way,
> >> these on-chip IOMMU devices can start with 1:1 passthrough mapping mode
> >> initially at the he time of creation. Then the .reset_map op can be used
> >> to switch iotlb back to this initial state without having to expose a
> >> complex two-dimensional IOMMU device model.
> >>
> >> Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com>
> >> ---
> >>   include/linux/vdpa.h | 10 ++++++++++
> >>   1 file changed, 10 insertions(+)
> >>
> >> diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h
> >> index d376309..26ae6ae 100644
> >> --- a/include/linux/vdpa.h
> >> +++ b/include/linux/vdpa.h
> >> @@ -327,6 +327,15 @@ struct vdpa_map_file {
> >>    *                             @iova: iova to be unmapped
> >>    *                             @size: size of the area
> >>    *                             Returns integer: success (0) or error (< 0)
> >> + * @reset_map:                 Reset device memory mapping to the default
> >> + *                             state (optional)
> > I think we need to mention that this is a must for parents that use set_map()?
> It's not a must IMO, some .set_map() user for e.g. VDUSE or vdpa-sim-blk
> can deliberately choose to implement .reset_map() depending on its own
> need. Those user_va software devices mostly don't care about mapping
> cost during reset, as they don't have to pin kernel memory in general.
> It's just whether or not they care about mapping being decoupled from
> device reset at all.

Ok, let's document this in the changelog at least.

Thanks

>
> And the exact implementation requirement of this interface has been
> documented right below.
>
> -Siwei
> >
> > Other than this:
> >
> > Acked-by: Jason Wang <jasowang@redhat.com>
> >
> > Thanks
> >
> >> + *                             Needed for devices that are using device
> >> + *                             specific DMA translation and prefer mapping
> >> + *                             to be decoupled from the virtio life cycle,
> >> + *                             i.e. device .reset op does not reset mapping
> >> + *                             @vdev: vdpa device
> >> + *                             @asid: address space identifier
> >> + *                             Returns integer: success (0) or error (< 0)
> >>    * @get_vq_dma_dev:            Get the dma device for a specific
> >>    *                             virtqueue (optional)
> >>    *                             @vdev: vdpa device
> >> @@ -405,6 +414,7 @@ struct vdpa_config_ops {
> >>                         u64 iova, u64 size, u64 pa, u32 perm, void *opaque);
> >>          int (*dma_unmap)(struct vdpa_device *vdev, unsigned int asid,
> >>                           u64 iova, u64 size);
> >> +       int (*reset_map)(struct vdpa_device *vdev, unsigned int asid);
> >>          int (*set_group_asid)(struct vdpa_device *vdev, unsigned int group,
> >>                                unsigned int asid);
> >>          struct device *(*get_vq_dma_dev)(struct vdpa_device *vdev, u16 idx);
> >> --
> >> 1.8.3.1
> >>
>


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/4] vhost-vdpa: reset vendor specific mapping to initial state in .release
  2023-10-13  7:35     ` Si-Wei Liu
@ 2023-10-16  6:32       ` Jason Wang
  2023-10-16 11:28         ` Eugenio Perez Martin
  2023-10-16 20:10         ` Si-Wei Liu
  0 siblings, 2 replies; 37+ messages in thread
From: Jason Wang @ 2023-10-16  6:32 UTC (permalink / raw)
  To: Si-Wei Liu
  Cc: mst, eperezma, xuanzhuo, dtatulea, virtualization, linux-kernel

On Fri, Oct 13, 2023 at 3:36 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>
>
>
> On 10/12/2023 8:01 PM, Jason Wang wrote:
> > On Tue, Oct 10, 2023 at 5:05 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >> Devices with on-chip IOMMU or vendor specific IOTLB implementation
> >> may need to restore iotlb mapping to the initial or default state
> >> using the .reset_map op, as it's desirable for some parent devices
> >> to solely manipulate mappings by its own, independent of virtio device
> >> state. For instance, device reset does not cause mapping go away on
> >> such IOTLB model in need of persistent mapping. Before vhost-vdpa
> >> is going away, give them a chance to reset iotlb back to the initial
> >> state in vhost_vdpa_cleanup().
> >>
> >> Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com>
> >> ---
> >>   drivers/vhost/vdpa.c | 16 ++++++++++++++++
> >>   1 file changed, 16 insertions(+)
> >>
> >> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> >> index 851535f..a3f8160 100644
> >> --- a/drivers/vhost/vdpa.c
> >> +++ b/drivers/vhost/vdpa.c
> >> @@ -131,6 +131,15 @@ static struct vhost_vdpa_as *vhost_vdpa_find_alloc_as(struct vhost_vdpa *v,
> >>          return vhost_vdpa_alloc_as(v, asid);
> >>   }
> >>
> >> +static void vhost_vdpa_reset_map(struct vhost_vdpa *v, u32 asid)
> >> +{
> >> +       struct vdpa_device *vdpa = v->vdpa;
> >> +       const struct vdpa_config_ops *ops = vdpa->config;
> >> +
> >> +       if (ops->reset_map)
> >> +               ops->reset_map(vdpa, asid);
> >> +}
> >> +
> >>   static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid)
> >>   {
> >>          struct vhost_vdpa_as *as = asid_to_as(v, asid);
> >> @@ -140,6 +149,13 @@ static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid)
> >>
> >>          hlist_del(&as->hash_link);
> >>          vhost_vdpa_iotlb_unmap(v, &as->iotlb, 0ULL, 0ULL - 1, asid);
> >> +       /*
> >> +        * Devices with vendor specific IOMMU may need to restore
> >> +        * iotlb to the initial or default state which is not done
> >> +        * through device reset, as the IOTLB mapping manipulation
> >> +        * could be decoupled from the virtio device life cycle.
> >> +        */
> > Should we do this according to whether IOTLB_PRESIST is set?
> Well, in theory this seems like so but it's unnecessary code change
> actually, as that is the way how vDPA parent behind platform IOMMU works
> today, and userspace doesn't break as of today. :)

Well, this is one question I've ever asked before. You have explained
that one of the reason that we don't break userspace is that they may
couple IOTLB reset with vDPA reset as well. One example is the Qemu.

>
> As explained in previous threads [1][2], when IOTLB_PERSIST is not set
> it doesn't necessarily mean the iotlb will definitely be destroyed
> across reset (think about the platform IOMMU case), so userspace today
> is already tolerating enough with either good or bad IOMMU. This code of
> not checking IOTLB_PERSIST being set is intentional, there's no point to
> emulate bad IOMMU behavior even for older userspace (with improper
> emulation to be done it would result in even worse performance).

For two reasons:

1) backend features need acked by userspace this is by design
2) keep the odd behaviour seems to be more safe as we can't audit
every userspace program

Thanks

> I think
> the purpose of the IOTLB_PERSIST flag is just to give userspace 100%
> certainty of persistent iotlb mapping not getting lost across vdpa reset.
>
> Thanks,
> -Siwei
>
> [1]
> https://lore.kernel.org/virtualization/9f118fc9-4f6f-dd67-a291-be78152e47fd@oracle.com/
> [2]
> https://lore.kernel.org/virtualization/3364adfd-1eb7-8bce-41f9-bfe5473f1f2e@oracle.com/
> >   Otherwise
> > we may break old userspace.
> >
> > Thanks
> >
> >> +       vhost_vdpa_reset_map(v, asid);
> >>          kfree(as);
> >>
> >>          return 0;
> >> --
> >> 1.8.3.1
> >>
>


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/4] vhost-vdpa: reset vendor specific mapping to initial state in .release
  2023-10-16  6:32       ` Jason Wang
@ 2023-10-16 11:28         ` Eugenio Perez Martin
  2023-10-16 20:30           ` Si-Wei Liu
  2023-10-16 20:10         ` Si-Wei Liu
  1 sibling, 1 reply; 37+ messages in thread
From: Eugenio Perez Martin @ 2023-10-16 11:28 UTC (permalink / raw)
  To: Jason Wang
  Cc: Si-Wei Liu, mst, xuanzhuo, dtatulea, virtualization, linux-kernel

On Mon, Oct 16, 2023 at 8:33 AM Jason Wang <jasowang@redhat.com> wrote:
>
> On Fri, Oct 13, 2023 at 3:36 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >
> >
> >
> > On 10/12/2023 8:01 PM, Jason Wang wrote:
> > > On Tue, Oct 10, 2023 at 5:05 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> > >> Devices with on-chip IOMMU or vendor specific IOTLB implementation
> > >> may need to restore iotlb mapping to the initial or default state
> > >> using the .reset_map op, as it's desirable for some parent devices
> > >> to solely manipulate mappings by its own, independent of virtio device
> > >> state. For instance, device reset does not cause mapping go away on
> > >> such IOTLB model in need of persistent mapping. Before vhost-vdpa
> > >> is going away, give them a chance to reset iotlb back to the initial
> > >> state in vhost_vdpa_cleanup().
> > >>
> > >> Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com>
> > >> ---
> > >>   drivers/vhost/vdpa.c | 16 ++++++++++++++++
> > >>   1 file changed, 16 insertions(+)
> > >>
> > >> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> > >> index 851535f..a3f8160 100644
> > >> --- a/drivers/vhost/vdpa.c
> > >> +++ b/drivers/vhost/vdpa.c
> > >> @@ -131,6 +131,15 @@ static struct vhost_vdpa_as *vhost_vdpa_find_alloc_as(struct vhost_vdpa *v,
> > >>          return vhost_vdpa_alloc_as(v, asid);
> > >>   }
> > >>
> > >> +static void vhost_vdpa_reset_map(struct vhost_vdpa *v, u32 asid)
> > >> +{
> > >> +       struct vdpa_device *vdpa = v->vdpa;
> > >> +       const struct vdpa_config_ops *ops = vdpa->config;
> > >> +
> > >> +       if (ops->reset_map)
> > >> +               ops->reset_map(vdpa, asid);
> > >> +}
> > >> +
> > >>   static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid)
> > >>   {
> > >>          struct vhost_vdpa_as *as = asid_to_as(v, asid);
> > >> @@ -140,6 +149,13 @@ static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid)
> > >>
> > >>          hlist_del(&as->hash_link);
> > >>          vhost_vdpa_iotlb_unmap(v, &as->iotlb, 0ULL, 0ULL - 1, asid);
> > >> +       /*
> > >> +        * Devices with vendor specific IOMMU may need to restore
> > >> +        * iotlb to the initial or default state which is not done
> > >> +        * through device reset, as the IOTLB mapping manipulation
> > >> +        * could be decoupled from the virtio device life cycle.
> > >> +        */
> > > Should we do this according to whether IOTLB_PRESIST is set?
> > Well, in theory this seems like so but it's unnecessary code change
> > actually, as that is the way how vDPA parent behind platform IOMMU works
> > today, and userspace doesn't break as of today. :)
>
> Well, this is one question I've ever asked before. You have explained
> that one of the reason that we don't break userspace is that they may
> couple IOTLB reset with vDPA reset as well. One example is the Qemu.
>
> >
> > As explained in previous threads [1][2], when IOTLB_PERSIST is not set
> > it doesn't necessarily mean the iotlb will definitely be destroyed
> > across reset (think about the platform IOMMU case), so userspace today
> > is already tolerating enough with either good or bad IOMMU. This code of
> > not checking IOTLB_PERSIST being set is intentional, there's no point to
> > emulate bad IOMMU behavior even for older userspace (with improper
> > emulation to be done it would result in even worse performance).
>
> For two reasons:
>
> 1) backend features need acked by userspace this is by design
> 2) keep the odd behaviour seems to be more safe as we can't audit
> every userspace program
>

The old behavior (without flag ack) cannot be trusted already, as:
* Devices using platform IOMMU (in other words, not implementing
neither .set_map nor .dma_map) does not unmap memory at virtio reset.
* Devices that implement .set_map or .dma_map (vdpa_sim, mlx5) do
reset IOTLB, but in their parent ops (vdpasim_do_reset, prune_iotlb
called from mlx5_vdpa_reset). With vdpa_sim patch removing the reset,
now all backends work the same as far as I know., which was (and is)
the way devices using the platform IOMMU works.

The difference in behavior did not matter as QEMU unmaps all the
memory unregistering the memory listener at vhost_vdpa_dev_start(...,
started = false), but the backend acknowledging this feature flag
allows QEMU to make sure it is safe to skip this unmap & map in the
case of vhost stop & start cycle.

In that sense, this feature flag is actually a signal for userspace to
know that the bug has been solved. Not offering it indicates that
userspace cannot trust the kernel will retain the maps.

Si-Wei or Dragos, please correct me if I've missed something. Feel
free to use the text in case you find more clear in doc or patch log.

Thanks!

> Thanks
>
> > I think
> > the purpose of the IOTLB_PERSIST flag is just to give userspace 100%
> > certainty of persistent iotlb mapping not getting lost across vdpa reset.
> >
> > Thanks,
> > -Siwei
> >
> > [1]
> > https://lore.kernel.org/virtualization/9f118fc9-4f6f-dd67-a291-be78152e47fd@oracle.com/
> > [2]
> > https://lore.kernel.org/virtualization/3364adfd-1eb7-8bce-41f9-bfe5473f1f2e@oracle.com/
> > >   Otherwise
> > > we may break old userspace.
> > >
> > > Thanks
> > >
> > >> +       vhost_vdpa_reset_map(v, asid);
> > >>          kfree(as);
> > >>
> > >>          return 0;
> > >> --
> > >> 1.8.3.1
> > >>
> >
>


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/4] vhost-vdpa: reset vendor specific mapping to initial state in .release
  2023-10-16  6:32       ` Jason Wang
  2023-10-16 11:28         ` Eugenio Perez Martin
@ 2023-10-16 20:10         ` Si-Wei Liu
  1 sibling, 0 replies; 37+ messages in thread
From: Si-Wei Liu @ 2023-10-16 20:10 UTC (permalink / raw)
  To: Jason Wang
  Cc: mst, eperezma, xuanzhuo, dtatulea, virtualization, linux-kernel



On 10/15/2023 11:32 PM, Jason Wang wrote:
> On Fri, Oct 13, 2023 at 3:36 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>
>>
>> On 10/12/2023 8:01 PM, Jason Wang wrote:
>>> On Tue, Oct 10, 2023 at 5:05 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>>> Devices with on-chip IOMMU or vendor specific IOTLB implementation
>>>> may need to restore iotlb mapping to the initial or default state
>>>> using the .reset_map op, as it's desirable for some parent devices
>>>> to solely manipulate mappings by its own, independent of virtio device
>>>> state. For instance, device reset does not cause mapping go away on
>>>> such IOTLB model in need of persistent mapping. Before vhost-vdpa
>>>> is going away, give them a chance to reset iotlb back to the initial
>>>> state in vhost_vdpa_cleanup().
>>>>
>>>> Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com>
>>>> ---
>>>>    drivers/vhost/vdpa.c | 16 ++++++++++++++++
>>>>    1 file changed, 16 insertions(+)
>>>>
>>>> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
>>>> index 851535f..a3f8160 100644
>>>> --- a/drivers/vhost/vdpa.c
>>>> +++ b/drivers/vhost/vdpa.c
>>>> @@ -131,6 +131,15 @@ static struct vhost_vdpa_as *vhost_vdpa_find_alloc_as(struct vhost_vdpa *v,
>>>>           return vhost_vdpa_alloc_as(v, asid);
>>>>    }
>>>>
>>>> +static void vhost_vdpa_reset_map(struct vhost_vdpa *v, u32 asid)
>>>> +{
>>>> +       struct vdpa_device *vdpa = v->vdpa;
>>>> +       const struct vdpa_config_ops *ops = vdpa->config;
>>>> +
>>>> +       if (ops->reset_map)
>>>> +               ops->reset_map(vdpa, asid);
>>>> +}
>>>> +
>>>>    static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid)
>>>>    {
>>>>           struct vhost_vdpa_as *as = asid_to_as(v, asid);
>>>> @@ -140,6 +149,13 @@ static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid)
>>>>
>>>>           hlist_del(&as->hash_link);
>>>>           vhost_vdpa_iotlb_unmap(v, &as->iotlb, 0ULL, 0ULL - 1, asid);
>>>> +       /*
>>>> +        * Devices with vendor specific IOMMU may need to restore
>>>> +        * iotlb to the initial or default state which is not done
>>>> +        * through device reset, as the IOTLB mapping manipulation
>>>> +        * could be decoupled from the virtio device life cycle.
>>>> +        */
>>> Should we do this according to whether IOTLB_PRESIST is set?
>> Well, in theory this seems like so but it's unnecessary code change
>> actually, as that is the way how vDPA parent behind platform IOMMU works
>> today, and userspace doesn't break as of today. :)
> Well, this is one question I've ever asked before. You have explained
> that one of the reason that we don't break userspace is that they may
> couple IOTLB reset with vDPA reset as well. One example is the Qemu.
Nope, it was the opposite. Maybe it was not clear enough, let me try 
once more - userspace CANNOT decouple IOTLB reset from vDPA reset today. 
This is because of bug/discrepancy in mlx5_vdap and vdpa_sim already 
breaking userspace's expectation, rendering the brokenness/inconsistency 
on vhost-vdpa mapping interface from behaving what it promised and 
should have done. Only with the IOTLB_PERSIST flag seen userspace can 
trust vhost-vdpa kernel interface *reliably* to decouple IOTLB reset 
from vDPA reset. Without seeing this flag, no matter how the code in 
QEMU was written, today's older userspace was never like to assume the 
mappings will *definitely* be cleared by vDPA reset. If any userspace 
implementation wants to get consistent behavior for all vDPA parent 
devices, it still has to *explicitly* clear all existing mappings by its 
own by sending bunch of unmap (iotlb invalidate) requests to vhost-vdpa 
kernel before resetting the vDPA backend.

In brief, userspace is already broken by kernel implementation today, 
and new userspace needs some device flag to know for sure if kernel bug 
has already been fixed; older userspace doesn't care about preserving 
the broken kernel behavior at all, regardless whether or not it wants to 
decouple IOTLB from vDPA reset.

>
>> As explained in previous threads [1][2], when IOTLB_PERSIST is not set
>> it doesn't necessarily mean the iotlb will definitely be destroyed
>> across reset (think about the platform IOMMU case), so userspace today
>> is already tolerating enough with either good or bad IOMMU. This code of
>> not checking IOTLB_PERSIST being set is intentional, there's no point to
>> emulate bad IOMMU behavior even for older userspace (with improper
>> emulation to be done it would result in even worse performance).
> For two reasons:
>
> 1) backend features need acked by userspace this is by design
There's no breakage on this part. Backend feature IOTLB_PERSIST won't be 
set if userspace doesn't ack.
> 2) keep the odd behaviour seems to be more safe as we can't audit
> every userspace program
Definitely don't have to audit every userspace program, but I cannot 
think of a case where a sane userspace program can be broken. Can you 
elaborate one or two potential userspace usage that may break because of 
this? As said, platform IOMMU already did it this way.

Regards,
-Siwei
>
> Thanks
>
>> I think
>> the purpose of the IOTLB_PERSIST flag is just to give userspace 100%
>> certainty of persistent iotlb mapping not getting lost across vdpa reset.
>>
>> Thanks,
>> -Siwei
>>
>> [1]
>> https://lore.kernel.org/virtualization/9f118fc9-4f6f-dd67-a291-be78152e47fd@oracle.com/
>> [2]
>> https://lore.kernel.org/virtualization/3364adfd-1eb7-8bce-41f9-bfe5473f1f2e@oracle.com/
>>>    Otherwise
>>> we may break old userspace.
>>>
>>> Thanks
>>>
>>>> +       vhost_vdpa_reset_map(v, asid);
>>>>           kfree(as);
>>>>
>>>>           return 0;
>>>> --
>>>> 1.8.3.1
>>>>


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/4] vhost-vdpa: reset vendor specific mapping to initial state in .release
  2023-10-16 11:28         ` Eugenio Perez Martin
@ 2023-10-16 20:30           ` Si-Wei Liu
  2023-10-17  2:35             ` Jason Wang
  0 siblings, 1 reply; 37+ messages in thread
From: Si-Wei Liu @ 2023-10-16 20:30 UTC (permalink / raw)
  To: Eugenio Perez Martin, Jason Wang
  Cc: mst, xuanzhuo, dtatulea, virtualization, linux-kernel



On 10/16/2023 4:28 AM, Eugenio Perez Martin wrote:
> On Mon, Oct 16, 2023 at 8:33 AM Jason Wang <jasowang@redhat.com> wrote:
>> On Fri, Oct 13, 2023 at 3:36 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>>
>>>
>>> On 10/12/2023 8:01 PM, Jason Wang wrote:
>>>> On Tue, Oct 10, 2023 at 5:05 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>>>> Devices with on-chip IOMMU or vendor specific IOTLB implementation
>>>>> may need to restore iotlb mapping to the initial or default state
>>>>> using the .reset_map op, as it's desirable for some parent devices
>>>>> to solely manipulate mappings by its own, independent of virtio device
>>>>> state. For instance, device reset does not cause mapping go away on
>>>>> such IOTLB model in need of persistent mapping. Before vhost-vdpa
>>>>> is going away, give them a chance to reset iotlb back to the initial
>>>>> state in vhost_vdpa_cleanup().
>>>>>
>>>>> Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com>
>>>>> ---
>>>>>    drivers/vhost/vdpa.c | 16 ++++++++++++++++
>>>>>    1 file changed, 16 insertions(+)
>>>>>
>>>>> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
>>>>> index 851535f..a3f8160 100644
>>>>> --- a/drivers/vhost/vdpa.c
>>>>> +++ b/drivers/vhost/vdpa.c
>>>>> @@ -131,6 +131,15 @@ static struct vhost_vdpa_as *vhost_vdpa_find_alloc_as(struct vhost_vdpa *v,
>>>>>           return vhost_vdpa_alloc_as(v, asid);
>>>>>    }
>>>>>
>>>>> +static void vhost_vdpa_reset_map(struct vhost_vdpa *v, u32 asid)
>>>>> +{
>>>>> +       struct vdpa_device *vdpa = v->vdpa;
>>>>> +       const struct vdpa_config_ops *ops = vdpa->config;
>>>>> +
>>>>> +       if (ops->reset_map)
>>>>> +               ops->reset_map(vdpa, asid);
>>>>> +}
>>>>> +
>>>>>    static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid)
>>>>>    {
>>>>>           struct vhost_vdpa_as *as = asid_to_as(v, asid);
>>>>> @@ -140,6 +149,13 @@ static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid)
>>>>>
>>>>>           hlist_del(&as->hash_link);
>>>>>           vhost_vdpa_iotlb_unmap(v, &as->iotlb, 0ULL, 0ULL - 1, asid);
>>>>> +       /*
>>>>> +        * Devices with vendor specific IOMMU may need to restore
>>>>> +        * iotlb to the initial or default state which is not done
>>>>> +        * through device reset, as the IOTLB mapping manipulation
>>>>> +        * could be decoupled from the virtio device life cycle.
>>>>> +        */
>>>> Should we do this according to whether IOTLB_PRESIST is set?
>>> Well, in theory this seems like so but it's unnecessary code change
>>> actually, as that is the way how vDPA parent behind platform IOMMU works
>>> today, and userspace doesn't break as of today. :)
>> Well, this is one question I've ever asked before. You have explained
>> that one of the reason that we don't break userspace is that they may
>> couple IOTLB reset with vDPA reset as well. One example is the Qemu.
>>
>>> As explained in previous threads [1][2], when IOTLB_PERSIST is not set
>>> it doesn't necessarily mean the iotlb will definitely be destroyed
>>> across reset (think about the platform IOMMU case), so userspace today
>>> is already tolerating enough with either good or bad IOMMU. This code of
>>> not checking IOTLB_PERSIST being set is intentional, there's no point to
>>> emulate bad IOMMU behavior even for older userspace (with improper
>>> emulation to be done it would result in even worse performance).
>> For two reasons:
>>
>> 1) backend features need acked by userspace this is by design
>> 2) keep the odd behaviour seems to be more safe as we can't audit
>> every userspace program
>>
> The old behavior (without flag ack) cannot be trusted already, as:
> * Devices using platform IOMMU (in other words, not implementing
> neither .set_map nor .dma_map) does not unmap memory at virtio reset.
> * Devices that implement .set_map or .dma_map (vdpa_sim, mlx5) do
> reset IOTLB, but in their parent ops (vdpasim_do_reset, prune_iotlb
> called from mlx5_vdpa_reset). With vdpa_sim patch removing the reset,
> now all backends work the same as far as I know., which was (and is)
> the way devices using the platform IOMMU works.
>
> The difference in behavior did not matter as QEMU unmaps all the
> memory unregistering the memory listener at vhost_vdpa_dev_start(...,
> started = false),
Exactly. It's not just QEMU, but any (older) userspace manipulates 
mappings through the vhost-vdpa iotlb interface has to unmap all 
mappings to workaround the vdpa parent driver bug. If they don't do 
explicit unmap, it would cause state inconsistency between vhost-vdpa 
and parent driver, then old mappings can't be restored, and new mapping 
can be added to iotlb after vDPA reset. There's no point to preserve 
this broken and inconsistent behavior between vhost-vdpa and parent 
driver, as userspace doesn't care at all!

> but the backend acknowledging this feature flag
> allows QEMU to make sure it is safe to skip this unmap & map in the
> case of vhost stop & start cycle.
>
> In that sense, this feature flag is actually a signal for userspace to
> know that the bug has been solved.
Right, I couldn't say it better than you do, thanks! The feature flag is 
more of an unusual means to indicating kernel bug having been fixed, 
rather than introduce a new feature or new kernel behavior ending up in 
change of userspace's expectation.

> Not offering it indicates that
> userspace cannot trust the kernel will retain the maps.
>
> Si-Wei or Dragos, please correct me if I've missed something. Feel
> free to use the text in case you find more clear in doc or patch log.
Sure, will do, thank you! Will post v2 adding these to the log.

Thanks,
-Siwei



>
> Thanks!
>
>> Thanks
>>
>>> I think
>>> the purpose of the IOTLB_PERSIST flag is just to give userspace 100%
>>> certainty of persistent iotlb mapping not getting lost across vdpa reset.
>>>
>>> Thanks,
>>> -Siwei
>>>
>>> [1]
>>> https://lore.kernel.org/virtualization/9f118fc9-4f6f-dd67-a291-be78152e47fd@oracle.com/
>>> [2]
>>> https://lore.kernel.org/virtualization/3364adfd-1eb7-8bce-41f9-bfe5473f1f2e@oracle.com/
>>>>    Otherwise
>>>> we may break old userspace.
>>>>
>>>> Thanks
>>>>
>>>>> +       vhost_vdpa_reset_map(v, asid);
>>>>>           kfree(as);
>>>>>
>>>>>           return 0;
>>>>> --
>>>>> 1.8.3.1
>>>>>


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/4] vhost-vdpa: reset vendor specific mapping to initial state in .release
  2023-10-16 20:30           ` Si-Wei Liu
@ 2023-10-17  2:35             ` Jason Wang
  2023-10-17 13:58               ` Eugenio Perez Martin
  2023-10-18  4:35               ` Si-Wei Liu
  0 siblings, 2 replies; 37+ messages in thread
From: Jason Wang @ 2023-10-17  2:35 UTC (permalink / raw)
  To: Si-Wei Liu
  Cc: Eugenio Perez Martin, mst, xuanzhuo, dtatulea, virtualization,
	linux-kernel

On Tue, Oct 17, 2023 at 4:30 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>
>
>
> On 10/16/2023 4:28 AM, Eugenio Perez Martin wrote:
> > On Mon, Oct 16, 2023 at 8:33 AM Jason Wang <jasowang@redhat.com> wrote:
> >> On Fri, Oct 13, 2023 at 3:36 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>>
> >>>
> >>> On 10/12/2023 8:01 PM, Jason Wang wrote:
> >>>> On Tue, Oct 10, 2023 at 5:05 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>>>> Devices with on-chip IOMMU or vendor specific IOTLB implementation
> >>>>> may need to restore iotlb mapping to the initial or default state
> >>>>> using the .reset_map op, as it's desirable for some parent devices
> >>>>> to solely manipulate mappings by its own, independent of virtio device
> >>>>> state. For instance, device reset does not cause mapping go away on
> >>>>> such IOTLB model in need of persistent mapping. Before vhost-vdpa
> >>>>> is going away, give them a chance to reset iotlb back to the initial
> >>>>> state in vhost_vdpa_cleanup().
> >>>>>
> >>>>> Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com>
> >>>>> ---
> >>>>>    drivers/vhost/vdpa.c | 16 ++++++++++++++++
> >>>>>    1 file changed, 16 insertions(+)
> >>>>>
> >>>>> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> >>>>> index 851535f..a3f8160 100644
> >>>>> --- a/drivers/vhost/vdpa.c
> >>>>> +++ b/drivers/vhost/vdpa.c
> >>>>> @@ -131,6 +131,15 @@ static struct vhost_vdpa_as *vhost_vdpa_find_alloc_as(struct vhost_vdpa *v,
> >>>>>           return vhost_vdpa_alloc_as(v, asid);
> >>>>>    }
> >>>>>
> >>>>> +static void vhost_vdpa_reset_map(struct vhost_vdpa *v, u32 asid)
> >>>>> +{
> >>>>> +       struct vdpa_device *vdpa = v->vdpa;
> >>>>> +       const struct vdpa_config_ops *ops = vdpa->config;
> >>>>> +
> >>>>> +       if (ops->reset_map)
> >>>>> +               ops->reset_map(vdpa, asid);
> >>>>> +}
> >>>>> +
> >>>>>    static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid)
> >>>>>    {
> >>>>>           struct vhost_vdpa_as *as = asid_to_as(v, asid);
> >>>>> @@ -140,6 +149,13 @@ static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid)
> >>>>>
> >>>>>           hlist_del(&as->hash_link);
> >>>>>           vhost_vdpa_iotlb_unmap(v, &as->iotlb, 0ULL, 0ULL - 1, asid);
> >>>>> +       /*
> >>>>> +        * Devices with vendor specific IOMMU may need to restore
> >>>>> +        * iotlb to the initial or default state which is not done
> >>>>> +        * through device reset, as the IOTLB mapping manipulation
> >>>>> +        * could be decoupled from the virtio device life cycle.
> >>>>> +        */
> >>>> Should we do this according to whether IOTLB_PRESIST is set?
> >>> Well, in theory this seems like so but it's unnecessary code change
> >>> actually, as that is the way how vDPA parent behind platform IOMMU works
> >>> today, and userspace doesn't break as of today. :)
> >> Well, this is one question I've ever asked before. You have explained
> >> that one of the reason that we don't break userspace is that they may
> >> couple IOTLB reset with vDPA reset as well. One example is the Qemu.
> >>
> >>> As explained in previous threads [1][2], when IOTLB_PERSIST is not set
> >>> it doesn't necessarily mean the iotlb will definitely be destroyed
> >>> across reset (think about the platform IOMMU case), so userspace today
> >>> is already tolerating enough with either good or bad IOMMU.

I'm confused, how to define tolerating here? For example, if it has
tolerance, why bother?

> >>This code of
> >>> not checking IOTLB_PERSIST being set is intentional, there's no point to
> >>> emulate bad IOMMU behavior even for older userspace (with improper
> >>> emulation to be done it would result in even worse performance).

I can easily imagine a case:

The old Qemu that works only with a setup like mlx5_vdpa. If we do
this without a negotiation, IOTLB will not be clear but the Qemu will
try to re-program the IOTLB after reset. Which will break?

1) stick the exact old behaviour with just one line of check
2) audit all the possible cases to avoid a one line of code

1) seems much easier than 2)

> >> For two reasons:
> >>
> >> 1) backend features need acked by userspace this is by design
> >> 2) keep the odd behaviour seems to be more safe as we can't audit
> >> every userspace program
> >>
> > The old behavior (without flag ack) cannot be trusted already, as:

Possibly but the point is to unbreak userspace no matter how weird the
behaviour we've ever had.

> > * Devices using platform IOMMU (in other words, not implementing
> > neither .set_map nor .dma_map) does not unmap memory at virtio reset.
> > * Devices that implement .set_map or .dma_map (vdpa_sim, mlx5) do
> > reset IOTLB, but in their parent ops (vdpasim_do_reset, prune_iotlb
> > called from mlx5_vdpa_reset). With vdpa_sim patch removing the reset,
> > now all backends work the same as far as I know., which was (and is)
> > the way devices using the platform IOMMU works.
> >
> > The difference in behavior did not matter as QEMU unmaps all the
> > memory unregistering the memory listener at vhost_vdpa_dev_start(...,
> > started = false),
> Exactly. It's not just QEMU, but any (older) userspace manipulates
> mappings through the vhost-vdpa iotlb interface has to unmap all
> mappings to workaround the vdpa parent driver bug.

Just to clarify, from userspace, it's the (odd) behaviour of the current uAPI.

> If they don't do
> explicit unmap, it would cause state inconsistency between vhost-vdpa
> and parent driver, then old mappings can't be restored, and new mapping
> can be added to iotlb after vDPA reset. There's no point to preserve
> this broken and inconsistent behavior between vhost-vdpa and parent
> driver, as userspace doesn't care at all!

It's a userspace notice change so we can't fix it silently:

https://lkml.org/lkml/2012/12/23/75

Another example which is related to vhost-vDPA:

https://lore.kernel.org/netdev/20230927140544.205088-1-eric.auger@redhat.com/T/

Thanks

>
> > but the backend acknowledging this feature flag
> > allows QEMU to make sure it is safe to skip this unmap & map in the
> > case of vhost stop & start cycle.
> >
> > In that sense, this feature flag is actually a signal for userspace to
> > know that the bug has been solved.
> Right, I couldn't say it better than you do, thanks! The feature flag is
> more of an unusual means to indicating kernel bug having been fixed,
> rather than introduce a new feature or new kernel behavior ending up in
> change of userspace's expectation.
>
> > Not offering it indicates that
> > userspace cannot trust the kernel will retain the maps.
> >
> > Si-Wei or Dragos, please correct me if I've missed something. Feel
> > free to use the text in case you find more clear in doc or patch log.
> Sure, will do, thank you! Will post v2 adding these to the log.
>
> Thanks,
> -Siwei
>
>
>
> >
> > Thanks!
> >
> >> Thanks
> >>
> >>> I think
> >>> the purpose of the IOTLB_PERSIST flag is just to give userspace 100%
> >>> certainty of persistent iotlb mapping not getting lost across vdpa reset.
> >>>
> >>> Thanks,
> >>> -Siwei
> >>>
> >>> [1]
> >>> https://lore.kernel.org/virtualization/9f118fc9-4f6f-dd67-a291-be78152e47fd@oracle.com/
> >>> [2]
> >>> https://lore.kernel.org/virtualization/3364adfd-1eb7-8bce-41f9-bfe5473f1f2e@oracle.com/
> >>>>    Otherwise
> >>>> we may break old userspace.
> >>>>
> >>>> Thanks
> >>>>
> >>>>> +       vhost_vdpa_reset_map(v, asid);
> >>>>>           kfree(as);
> >>>>>
> >>>>>           return 0;
> >>>>> --
> >>>>> 1.8.3.1
> >>>>>
>


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/4] vhost-vdpa: reset vendor specific mapping to initial state in .release
  2023-10-17  2:35             ` Jason Wang
@ 2023-10-17 13:58               ` Eugenio Perez Martin
  2023-10-18  4:35               ` Si-Wei Liu
  1 sibling, 0 replies; 37+ messages in thread
From: Eugenio Perez Martin @ 2023-10-17 13:58 UTC (permalink / raw)
  To: Jason Wang
  Cc: Si-Wei Liu, mst, xuanzhuo, dtatulea, virtualization, linux-kernel

On Tue, Oct 17, 2023 at 4:35 AM Jason Wang <jasowang@redhat.com> wrote:
>
> On Tue, Oct 17, 2023 at 4:30 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >
> >
> >
> > On 10/16/2023 4:28 AM, Eugenio Perez Martin wrote:
> > > On Mon, Oct 16, 2023 at 8:33 AM Jason Wang <jasowang@redhat.com> wrote:
> > >> On Fri, Oct 13, 2023 at 3:36 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> > >>>
> > >>>
> > >>> On 10/12/2023 8:01 PM, Jason Wang wrote:
> > >>>> On Tue, Oct 10, 2023 at 5:05 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> > >>>>> Devices with on-chip IOMMU or vendor specific IOTLB implementation
> > >>>>> may need to restore iotlb mapping to the initial or default state
> > >>>>> using the .reset_map op, as it's desirable for some parent devices
> > >>>>> to solely manipulate mappings by its own, independent of virtio device
> > >>>>> state. For instance, device reset does not cause mapping go away on
> > >>>>> such IOTLB model in need of persistent mapping. Before vhost-vdpa
> > >>>>> is going away, give them a chance to reset iotlb back to the initial
> > >>>>> state in vhost_vdpa_cleanup().
> > >>>>>
> > >>>>> Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com>
> > >>>>> ---
> > >>>>>    drivers/vhost/vdpa.c | 16 ++++++++++++++++
> > >>>>>    1 file changed, 16 insertions(+)
> > >>>>>
> > >>>>> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> > >>>>> index 851535f..a3f8160 100644
> > >>>>> --- a/drivers/vhost/vdpa.c
> > >>>>> +++ b/drivers/vhost/vdpa.c
> > >>>>> @@ -131,6 +131,15 @@ static struct vhost_vdpa_as *vhost_vdpa_find_alloc_as(struct vhost_vdpa *v,
> > >>>>>           return vhost_vdpa_alloc_as(v, asid);
> > >>>>>    }
> > >>>>>
> > >>>>> +static void vhost_vdpa_reset_map(struct vhost_vdpa *v, u32 asid)
> > >>>>> +{
> > >>>>> +       struct vdpa_device *vdpa = v->vdpa;
> > >>>>> +       const struct vdpa_config_ops *ops = vdpa->config;
> > >>>>> +
> > >>>>> +       if (ops->reset_map)
> > >>>>> +               ops->reset_map(vdpa, asid);
> > >>>>> +}
> > >>>>> +
> > >>>>>    static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid)
> > >>>>>    {
> > >>>>>           struct vhost_vdpa_as *as = asid_to_as(v, asid);
> > >>>>> @@ -140,6 +149,13 @@ static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid)
> > >>>>>
> > >>>>>           hlist_del(&as->hash_link);
> > >>>>>           vhost_vdpa_iotlb_unmap(v, &as->iotlb, 0ULL, 0ULL - 1, asid);
> > >>>>> +       /*
> > >>>>> +        * Devices with vendor specific IOMMU may need to restore
> > >>>>> +        * iotlb to the initial or default state which is not done
> > >>>>> +        * through device reset, as the IOTLB mapping manipulation
> > >>>>> +        * could be decoupled from the virtio device life cycle.
> > >>>>> +        */
> > >>>> Should we do this according to whether IOTLB_PRESIST is set?
> > >>> Well, in theory this seems like so but it's unnecessary code change
> > >>> actually, as that is the way how vDPA parent behind platform IOMMU works
> > >>> today, and userspace doesn't break as of today. :)
> > >> Well, this is one question I've ever asked before. You have explained
> > >> that one of the reason that we don't break userspace is that they may
> > >> couple IOTLB reset with vDPA reset as well. One example is the Qemu.
> > >>
> > >>> As explained in previous threads [1][2], when IOTLB_PERSIST is not set
> > >>> it doesn't necessarily mean the iotlb will definitely be destroyed
> > >>> across reset (think about the platform IOMMU case), so userspace today
> > >>> is already tolerating enough with either good or bad IOMMU.
>
> I'm confused, how to define tolerating here? For example, if it has
> tolerance, why bother?
>
> > >>This code of
> > >>> not checking IOTLB_PERSIST being set is intentional, there's no point to
> > >>> emulate bad IOMMU behavior even for older userspace (with improper
> > >>> emulation to be done it would result in even worse performance).
>
> I can easily imagine a case:
>
> The old Qemu that works only with a setup like mlx5_vdpa.

I think it is a fair point, but QEMU in particular already unmapped
everything before set_status(0). Other userspace apps that have
trusted vdpa_sim and/or mlx5 behavior could fail, yes.

> If we do
> this without a negotiation, IOTLB will not be clear but the Qemu will
> try to re-program the IOTLB after reset. Which will break?
>
> 1) stick the exact old behaviour with just one line of check
> 2) audit all the possible cases to avoid a one line of code
>
> 1) seems much easier than 2)
>
> > >> For two reasons:
> > >>
> > >> 1) backend features need acked by userspace this is by design
> > >> 2) keep the odd behaviour seems to be more safe as we can't audit
> > >> every userspace program
> > >>
> > > The old behavior (without flag ack) cannot be trusted already, as:
>
> Possibly but the point is to unbreak userspace no matter how weird the
> behaviour we've ever had.
>
> > > * Devices using platform IOMMU (in other words, not implementing
> > > neither .set_map nor .dma_map) does not unmap memory at virtio reset.
> > > * Devices that implement .set_map or .dma_map (vdpa_sim, mlx5) do
> > > reset IOTLB, but in their parent ops (vdpasim_do_reset, prune_iotlb
> > > called from mlx5_vdpa_reset). With vdpa_sim patch removing the reset,
> > > now all backends work the same as far as I know., which was (and is)
> > > the way devices using the platform IOMMU works.
> > >
> > > The difference in behavior did not matter as QEMU unmaps all the
> > > memory unregistering the memory listener at vhost_vdpa_dev_start(...,
> > > started = false),
> > Exactly. It's not just QEMU, but any (older) userspace manipulates
> > mappings through the vhost-vdpa iotlb interface has to unmap all
> > mappings to workaround the vdpa parent driver bug.
>
> Just to clarify, from userspace, it's the (odd) behaviour of the current uAPI.
>
> > If they don't do
> > explicit unmap, it would cause state inconsistency between vhost-vdpa
> > and parent driver, then old mappings can't be restored, and new mapping
> > can be added to iotlb after vDPA reset. There's no point to preserve
> > this broken and inconsistent behavior between vhost-vdpa and parent
> > driver, as userspace doesn't care at all!
>
> It's a userspace notice change so we can't fix it silently:
>
> https://lkml.org/lkml/2012/12/23/75
>
> Another example which is related to vhost-vDPA:
>
> https://lore.kernel.org/netdev/20230927140544.205088-1-eric.auger@redhat.com/T/
>

So let's say it's just a matter of checking if IOTLB_PERSIST has been
acked and then call vhost_vdpa_reset_map in set_status(0) as long as
the backend uses .set_map or .dma_map. Both mlx5 and vdpa_sim will
have old behavior, but future parent drivers that use (.set_map ||
.dma_map) will also reset map with old qemu.

I think it is acceptable. Am I missing something?

Thanks!

> Thanks
>
> >
> > > but the backend acknowledging this feature flag
> > > allows QEMU to make sure it is safe to skip this unmap & map in the
> > > case of vhost stop & start cycle.
> > >
> > > In that sense, this feature flag is actually a signal for userspace to
> > > know that the bug has been solved.
> > Right, I couldn't say it better than you do, thanks! The feature flag is
> > more of an unusual means to indicating kernel bug having been fixed,
> > rather than introduce a new feature or new kernel behavior ending up in
> > change of userspace's expectation.
> >
> > > Not offering it indicates that
> > > userspace cannot trust the kernel will retain the maps.
> > >
> > > Si-Wei or Dragos, please correct me if I've missed something. Feel
> > > free to use the text in case you find more clear in doc or patch log.
> > Sure, will do, thank you! Will post v2 adding these to the log.
> >
> > Thanks,
> > -Siwei
> >
> >
> >
> > >
> > > Thanks!
> > >
> > >> Thanks
> > >>
> > >>> I think
> > >>> the purpose of the IOTLB_PERSIST flag is just to give userspace 100%
> > >>> certainty of persistent iotlb mapping not getting lost across vdpa reset.
> > >>>
> > >>> Thanks,
> > >>> -Siwei
> > >>>
> > >>> [1]
> > >>> https://lore.kernel.org/virtualization/9f118fc9-4f6f-dd67-a291-be78152e47fd@oracle.com/
> > >>> [2]
> > >>> https://lore.kernel.org/virtualization/3364adfd-1eb7-8bce-41f9-bfe5473f1f2e@oracle.com/
> > >>>>    Otherwise
> > >>>> we may break old userspace.
> > >>>>
> > >>>> Thanks
> > >>>>
> > >>>>> +       vhost_vdpa_reset_map(v, asid);
> > >>>>>           kfree(as);
> > >>>>>
> > >>>>>           return 0;
> > >>>>> --
> > >>>>> 1.8.3.1
> > >>>>>
> >
>


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/4] vhost-vdpa: reset vendor specific mapping to initial state in .release
  2023-10-17  2:35             ` Jason Wang
  2023-10-17 13:58               ` Eugenio Perez Martin
@ 2023-10-18  4:35               ` Si-Wei Liu
  2023-10-18  5:27                 ` Jason Wang
  1 sibling, 1 reply; 37+ messages in thread
From: Si-Wei Liu @ 2023-10-18  4:35 UTC (permalink / raw)
  To: Jason Wang
  Cc: Eugenio Perez Martin, mst, xuanzhuo, dtatulea, virtualization,
	linux-kernel



On 10/16/2023 7:35 PM, Jason Wang wrote:
> On Tue, Oct 17, 2023 at 4:30 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>
>>
>> On 10/16/2023 4:28 AM, Eugenio Perez Martin wrote:
>>> On Mon, Oct 16, 2023 at 8:33 AM Jason Wang <jasowang@redhat.com> wrote:
>>>> On Fri, Oct 13, 2023 at 3:36 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>>>>
>>>>> On 10/12/2023 8:01 PM, Jason Wang wrote:
>>>>>> On Tue, Oct 10, 2023 at 5:05 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>>>>>> Devices with on-chip IOMMU or vendor specific IOTLB implementation
>>>>>>> may need to restore iotlb mapping to the initial or default state
>>>>>>> using the .reset_map op, as it's desirable for some parent devices
>>>>>>> to solely manipulate mappings by its own, independent of virtio device
>>>>>>> state. For instance, device reset does not cause mapping go away on
>>>>>>> such IOTLB model in need of persistent mapping. Before vhost-vdpa
>>>>>>> is going away, give them a chance to reset iotlb back to the initial
>>>>>>> state in vhost_vdpa_cleanup().
>>>>>>>
>>>>>>> Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com>
>>>>>>> ---
>>>>>>>     drivers/vhost/vdpa.c | 16 ++++++++++++++++
>>>>>>>     1 file changed, 16 insertions(+)
>>>>>>>
>>>>>>> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
>>>>>>> index 851535f..a3f8160 100644
>>>>>>> --- a/drivers/vhost/vdpa.c
>>>>>>> +++ b/drivers/vhost/vdpa.c
>>>>>>> @@ -131,6 +131,15 @@ static struct vhost_vdpa_as *vhost_vdpa_find_alloc_as(struct vhost_vdpa *v,
>>>>>>>            return vhost_vdpa_alloc_as(v, asid);
>>>>>>>     }
>>>>>>>
>>>>>>> +static void vhost_vdpa_reset_map(struct vhost_vdpa *v, u32 asid)
>>>>>>> +{
>>>>>>> +       struct vdpa_device *vdpa = v->vdpa;
>>>>>>> +       const struct vdpa_config_ops *ops = vdpa->config;
>>>>>>> +
>>>>>>> +       if (ops->reset_map)
>>>>>>> +               ops->reset_map(vdpa, asid);
>>>>>>> +}
>>>>>>> +
>>>>>>>     static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid)
>>>>>>>     {
>>>>>>>            struct vhost_vdpa_as *as = asid_to_as(v, asid);
>>>>>>> @@ -140,6 +149,13 @@ static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid)
>>>>>>>
>>>>>>>            hlist_del(&as->hash_link);
>>>>>>>            vhost_vdpa_iotlb_unmap(v, &as->iotlb, 0ULL, 0ULL - 1, asid);
>>>>>>> +       /*
>>>>>>> +        * Devices with vendor specific IOMMU may need to restore
>>>>>>> +        * iotlb to the initial or default state which is not done
>>>>>>> +        * through device reset, as the IOTLB mapping manipulation
>>>>>>> +        * could be decoupled from the virtio device life cycle.
>>>>>>> +        */
>>>>>> Should we do this according to whether IOTLB_PRESIST is set?
>>>>> Well, in theory this seems like so but it's unnecessary code change
>>>>> actually, as that is the way how vDPA parent behind platform IOMMU works
>>>>> today, and userspace doesn't break as of today. :)
>>>> Well, this is one question I've ever asked before. You have explained
>>>> that one of the reason that we don't break userspace is that they may
>>>> couple IOTLB reset with vDPA reset as well. One example is the Qemu.
>>>>
>>>>> As explained in previous threads [1][2], when IOTLB_PERSIST is not set
>>>>> it doesn't necessarily mean the iotlb will definitely be destroyed
>>>>> across reset (think about the platform IOMMU case), so userspace today
>>>>> is already tolerating enough with either good or bad IOMMU.
> I'm confused, how to define tolerating here?

Tolerating defined as QEMU has to proactively unmap before reset just to 
workaround the driver bug (on-chip maps out of sync), unconditionally 
for platform or on-chip. While we all know it doesn't have to do so for 
platform IOMMU, though userspace has no means to distinguish. That said, 
userspace is sacrificing reset time performance on platform IOMMU setup 
just for working around buggy implementation in the other setup.

> For example, if it has tolerance, why bother?
I'm not sure I get the question. But I think userspace is compromising 
because of buggy implementation in a few drivers doesn't mean we should 
uniformly enforce such behavior for all set_map/dma_map implementations.

>
>>>> This code of
>>>>> not checking IOTLB_PERSIST being set is intentional, there's no point to
>>>>> emulate bad IOMMU behavior even for older userspace (with improper
>>>>> emulation to be done it would result in even worse performance).
> I can easily imagine a case:
>
> The old Qemu that works only with a setup like mlx5_vdpa.
Noted, seems to me there's no such case of a userspace implementation 
that only works with mlx5_vdpa or its friends, but doesn't work with the 
others e.g. platform IOMMU, or well behaving on-chip IOMMU 
implementations. The Unmap+remap trick around vdpa reset works totally 
fine for platform IOMMU, except with sub-optimal performance. Other than 
this trick, I cannot easily think of other means or iotlb message 
sequence for userspace to recover the bogus state and make iotlb back to 
work again after reset. Are we talking about hypnosis that has no real 
basis to exist in the real world?

>   If we do
> this without a negotiation, IOTLB will not be clear but the Qemu will
> try to re-program the IOTLB after reset. Which will break?
>
> 1) stick the exact old behaviour with just one line of check
It's not just one line of check here, the old behavior emulation has to 
be done as Eugenio illustrated in the other email. In addition, the 
emulation has to limit to those buggy drivers as I don't feel this 
emulation should apply uniformly to all future set_map/dma_map 
implementations.
> 2) audit all the possible cases to avoid a one line of code
>
> 1) seems much easier than 2)
You see it's more than just one line of code, and I'm uncertain if the 
additional complexity is warranted or necessary, particularly if added 
this piece of compatibility code will linger for quite a long time. 
Instead of adding hypothetical code change for no specific good reason 
and no real use case, I'd like to add the code when we find out a 
specific use case that may get impacted or already being affected, then 
we will have good understanding how to code up the fix and emulate 
properly for compatibility, while not affecting other good implementations.

Thanks,
-Siwe/i/

>
>>>> For two reasons:
>>>>
>>>> 1) backend features need acked by userspace this is by design
>>>> 2) keep the odd behaviour seems to be more safe as we can't audit
>>>> every userspace program
>>>>
>>> The old behavior (without flag ack) cannot be trusted already, as:
> Possibly but the point is to unbreak userspace no matter how weird the
> behaviour we've ever had.
>
>>> * Devices using platform IOMMU (in other words, not implementing
>>> neither .set_map nor .dma_map) does not unmap memory at virtio reset.
>>> * Devices that implement .set_map or .dma_map (vdpa_sim, mlx5) do
>>> reset IOTLB, but in their parent ops (vdpasim_do_reset, prune_iotlb
>>> called from mlx5_vdpa_reset). With vdpa_sim patch removing the reset,
>>> now all backends work the same as far as I know., which was (and is)
>>> the way devices using the platform IOMMU works.
>>>
>>> The difference in behavior did not matter as QEMU unmaps all the
>>> memory unregistering the memory listener at vhost_vdpa_dev_start(...,
>>> started = false),
>> Exactly. It's not just QEMU, but any (older) userspace manipulates
>> mappings through the vhost-vdpa iotlb interface has to unmap all
>> mappings to workaround the vdpa parent driver bug.
> Just to clarify, from userspace, it's the (odd) behaviour of the current uAPI.
>
>> If they don't do
>> explicit unmap, it would cause state inconsistency between vhost-vdpa
>> and parent driver, then old mappings can't be restored, and new mapping
>> can be added to iotlb after vDPA reset. There's no point to preserve
>> this broken and inconsistent behavior between vhost-vdpa and parent
>> driver, as userspace doesn't care at all!
> It's a userspace notice change so we can't fix it silently:
>
> https://lkml.org/lkml/2012/12/23/75
>
> Another example which is related to vhost-vDPA:
>
> https://lore.kernel.org/netdev/20230927140544.205088-1-eric.auger@redhat.com/T/
>
> Thanks
>
>>> but the backend acknowledging this feature flag
>>> allows QEMU to make sure it is safe to skip this unmap & map in the
>>> case of vhost stop & start cycle.
>>>
>>> In that sense, this feature flag is actually a signal for userspace to
>>> know that the bug has been solved.
>> Right, I couldn't say it better than you do, thanks! The feature flag is
>> more of an unusual means to indicating kernel bug having been fixed,
>> rather than introduce a new feature or new kernel behavior ending up in
>> change of userspace's expectation.
>>
>>> Not offering it indicates that
>>> userspace cannot trust the kernel will retain the maps.
>>>
>>> Si-Wei or Dragos, please correct me if I've missed something. Feel
>>> free to use the text in case you find more clear in doc or patch log.
>> Sure, will do, thank you! Will post v2 adding these to the log.
>>
>> Thanks,
>> -Siwei
>>
>>
>>
>>> Thanks!
>>>
>>>> Thanks
>>>>
>>>>> I think
>>>>> the purpose of the IOTLB_PERSIST flag is just to give userspace 100%
>>>>> certainty of persistent iotlb mapping not getting lost across vdpa reset.
>>>>>
>>>>> Thanks,
>>>>> -Siwei
>>>>>
>>>>> [1]
>>>>> https://lore.kernel.org/virtualization/9f118fc9-4f6f-dd67-a291-be78152e47fd@oracle.com/
>>>>> [2]
>>>>> https://lore.kernel.org/virtualization/3364adfd-1eb7-8bce-41f9-bfe5473f1f2e@oracle.com/
>>>>>>     Otherwise
>>>>>> we may break old userspace.
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>>> +       vhost_vdpa_reset_map(v, asid);
>>>>>>>            kfree(as);
>>>>>>>
>>>>>>>            return 0;
>>>>>>> --
>>>>>>> 1.8.3.1
>>>>>>>


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/4] vhost-vdpa: reset vendor specific mapping to initial state in .release
  2023-10-18  4:35               ` Si-Wei Liu
@ 2023-10-18  5:27                 ` Jason Wang
  2023-10-18  7:00                   ` Jason Wang
                                     ` (2 more replies)
  0 siblings, 3 replies; 37+ messages in thread
From: Jason Wang @ 2023-10-18  5:27 UTC (permalink / raw)
  To: Si-Wei Liu
  Cc: Eugenio Perez Martin, mst, xuanzhuo, dtatulea, virtualization,
	linux-kernel

On Wed, Oct 18, 2023 at 12:36 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>
>
>
> On 10/16/2023 7:35 PM, Jason Wang wrote:
> > On Tue, Oct 17, 2023 at 4:30 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>
> >>
> >> On 10/16/2023 4:28 AM, Eugenio Perez Martin wrote:
> >>> On Mon, Oct 16, 2023 at 8:33 AM Jason Wang <jasowang@redhat.com> wrote:
> >>>> On Fri, Oct 13, 2023 at 3:36 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>>>>
> >>>>> On 10/12/2023 8:01 PM, Jason Wang wrote:
> >>>>>> On Tue, Oct 10, 2023 at 5:05 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>>>>>> Devices with on-chip IOMMU or vendor specific IOTLB implementation
> >>>>>>> may need to restore iotlb mapping to the initial or default state
> >>>>>>> using the .reset_map op, as it's desirable for some parent devices
> >>>>>>> to solely manipulate mappings by its own, independent of virtio device
> >>>>>>> state. For instance, device reset does not cause mapping go away on
> >>>>>>> such IOTLB model in need of persistent mapping. Before vhost-vdpa
> >>>>>>> is going away, give them a chance to reset iotlb back to the initial
> >>>>>>> state in vhost_vdpa_cleanup().
> >>>>>>>
> >>>>>>> Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com>
> >>>>>>> ---
> >>>>>>>     drivers/vhost/vdpa.c | 16 ++++++++++++++++
> >>>>>>>     1 file changed, 16 insertions(+)
> >>>>>>>
> >>>>>>> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> >>>>>>> index 851535f..a3f8160 100644
> >>>>>>> --- a/drivers/vhost/vdpa.c
> >>>>>>> +++ b/drivers/vhost/vdpa.c
> >>>>>>> @@ -131,6 +131,15 @@ static struct vhost_vdpa_as *vhost_vdpa_find_alloc_as(struct vhost_vdpa *v,
> >>>>>>>            return vhost_vdpa_alloc_as(v, asid);
> >>>>>>>     }
> >>>>>>>
> >>>>>>> +static void vhost_vdpa_reset_map(struct vhost_vdpa *v, u32 asid)
> >>>>>>> +{
> >>>>>>> +       struct vdpa_device *vdpa = v->vdpa;
> >>>>>>> +       const struct vdpa_config_ops *ops = vdpa->config;
> >>>>>>> +
> >>>>>>> +       if (ops->reset_map)
> >>>>>>> +               ops->reset_map(vdpa, asid);
> >>>>>>> +}
> >>>>>>> +
> >>>>>>>     static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid)
> >>>>>>>     {
> >>>>>>>            struct vhost_vdpa_as *as = asid_to_as(v, asid);
> >>>>>>> @@ -140,6 +149,13 @@ static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid)
> >>>>>>>
> >>>>>>>            hlist_del(&as->hash_link);
> >>>>>>>            vhost_vdpa_iotlb_unmap(v, &as->iotlb, 0ULL, 0ULL - 1, asid);
> >>>>>>> +       /*
> >>>>>>> +        * Devices with vendor specific IOMMU may need to restore
> >>>>>>> +        * iotlb to the initial or default state which is not done
> >>>>>>> +        * through device reset, as the IOTLB mapping manipulation
> >>>>>>> +        * could be decoupled from the virtio device life cycle.
> >>>>>>> +        */
> >>>>>> Should we do this according to whether IOTLB_PRESIST is set?
> >>>>> Well, in theory this seems like so but it's unnecessary code change
> >>>>> actually, as that is the way how vDPA parent behind platform IOMMU works
> >>>>> today, and userspace doesn't break as of today. :)
> >>>> Well, this is one question I've ever asked before. You have explained
> >>>> that one of the reason that we don't break userspace is that they may
> >>>> couple IOTLB reset with vDPA reset as well. One example is the Qemu.
> >>>>
> >>>>> As explained in previous threads [1][2], when IOTLB_PERSIST is not set
> >>>>> it doesn't necessarily mean the iotlb will definitely be destroyed
> >>>>> across reset (think about the platform IOMMU case), so userspace today
> >>>>> is already tolerating enough with either good or bad IOMMU.
> > I'm confused, how to define tolerating here?
>
> Tolerating defined as QEMU has to proactively unmap before reset just to
> workaround the driver bug (on-chip maps out of sync), unconditionally
> for platform or on-chip. While we all know it doesn't have to do so for
> platform IOMMU, though userspace has no means to distinguish. That said,
> userspace is sacrificing reset time performance on platform IOMMU setup
> just for working around buggy implementation in the other setup.

Ok, so what you actually mean is that userspace can tolerate the "bug"
with the performance penalty.


>
> > For example, if it has tolerance, why bother?
> I'm not sure I get the question. But I think userspace is compromising
> because of buggy implementation in a few drivers doesn't mean we should
> uniformly enforce such behavior for all set_map/dma_map implementations.

This is not my point. I meant, we can fix we need a negotiation in
order to let some "buggy" old user space to survive from the changes.

>
> >
> >>>> This code of
> >>>>> not checking IOTLB_PERSIST being set is intentional, there's no point to
> >>>>> emulate bad IOMMU behavior even for older userspace (with improper
> >>>>> emulation to be done it would result in even worse performance).
> > I can easily imagine a case:
> >
> > The old Qemu that works only with a setup like mlx5_vdpa.
> Noted, seems to me there's no such case of a userspace implementation
> that only works with mlx5_vdpa or its friends, but doesn't work with the
> others e.g. platform IOMMU, or well behaving on-chip IOMMU
> implementations.

It's not hard to think of a case where:

1) the environment has mlx5_vdpa only
2) kernel doc can't have endless details, so when developing
application, the author notice IOTLB is cleared during reset

> The Unmap+remap trick around vdpa reset works totally
> fine for platform IOMMU, except with sub-optimal performance. Other than
> this trick, I cannot easily think of other means or iotlb message
> sequence for userspace to recover the bogus state and make iotlb back to
> work again after reset.

Yes for sure, but we can't audit every user space, no?

> Are we talking about hypnosis that has no real
> basis to exist in the real world?

Instead of trying to answer these hard questions, I would go another
way. That is, stick to the old behaviour when IOTLB_PRESISIT is not
set by the backend. This is much easier.

>
> >   If we do
> > this without a negotiation, IOTLB will not be clear but the Qemu will
> > try to re-program the IOTLB after reset. Which will break?
> >
> > 1) stick the exact old behaviour with just one line of check
> It's not just one line of check here, the old behavior emulation has to
> be done as Eugenio illustrated in the other email.

For vhost-vDPA it's just

if (IOTLB_PERSIST is acked by userspace)
    reset_map()

For parent, it's somehow similar:

during .reset()

if (IOTLB_PERSIST is not acked by userspace)
        reset_vendor_mappings()

Anything I missed here?

> In addition, the
> emulation has to limit to those buggy drivers as I don't feel this
> emulation should apply uniformly to all future set_map/dma_map
> implementations.

Unfortunately, it's a must to stick to ABI. I agree it's a mess but we
don't have a better choice. Or we can fail the probe if userspace
doesn't ack this feature.

> > 2) audit all the possible cases to avoid a one line of code
> >
> > 1) seems much easier than 2)
> You see it's more than just one line of code, and I'm uncertain if the
> additional complexity is warranted or necessary, particularly if added
> this piece of compatibility code will linger for quite a long time.

This is a must as long as it can be noticed by userspace. Doing
something conservative makes more sense to me.

> Instead of adding hypothetical code change for no specific good reason
> and no real use case,

It's not adding something new or new behaviours, it's just making the
IOTLB reset conditional based on vDPA reset.

> I'd like to add the code when we find out a
> specific use case that may get impacted or already being affected,

It doesn't conflict with what you proposed here. Old behaviours have
their users, no?

> then
> we will have good understanding how to code up the fix and emulate
> properly for compatibility, while not affecting other good implementations.

The issue is, even if we can't find a userspace now. It doesn't mean
we can't have one in the future. Then it might be too late or too
tricky to fix them. We had a lot of lessons in the past.

Thanks

>
> Thanks,
> -Siwe/i/
>
> >
> >>>> For two reasons:
> >>>>
> >>>> 1) backend features need acked by userspace this is by design
> >>>> 2) keep the odd behaviour seems to be more safe as we can't audit
> >>>> every userspace program
> >>>>
> >>> The old behavior (without flag ack) cannot be trusted already, as:
> > Possibly but the point is to unbreak userspace no matter how weird the
> > behaviour we've ever had.
> >
> >>> * Devices using platform IOMMU (in other words, not implementing
> >>> neither .set_map nor .dma_map) does not unmap memory at virtio reset.
> >>> * Devices that implement .set_map or .dma_map (vdpa_sim, mlx5) do
> >>> reset IOTLB, but in their parent ops (vdpasim_do_reset, prune_iotlb
> >>> called from mlx5_vdpa_reset). With vdpa_sim patch removing the reset,
> >>> now all backends work the same as far as I know., which was (and is)
> >>> the way devices using the platform IOMMU works.
> >>>
> >>> The difference in behavior did not matter as QEMU unmaps all the
> >>> memory unregistering the memory listener at vhost_vdpa_dev_start(...,
> >>> started = false),
> >> Exactly. It's not just QEMU, but any (older) userspace manipulates
> >> mappings through the vhost-vdpa iotlb interface has to unmap all
> >> mappings to workaround the vdpa parent driver bug.
> > Just to clarify, from userspace, it's the (odd) behaviour of the current uAPI.
> >
> >> If they don't do
> >> explicit unmap, it would cause state inconsistency between vhost-vdpa
> >> and parent driver, then old mappings can't be restored, and new mapping
> >> can be added to iotlb after vDPA reset. There's no point to preserve
> >> this broken and inconsistent behavior between vhost-vdpa and parent
> >> driver, as userspace doesn't care at all!
> > It's a userspace notice change so we can't fix it silently:
> >
> > https://lkml.org/lkml/2012/12/23/75
> >
> > Another example which is related to vhost-vDPA:
> >
> > https://lore.kernel.org/netdev/20230927140544.205088-1-eric.auger@redhat.com/T/
> >
> > Thanks
> >
> >>> but the backend acknowledging this feature flag
> >>> allows QEMU to make sure it is safe to skip this unmap & map in the
> >>> case of vhost stop & start cycle.
> >>>
> >>> In that sense, this feature flag is actually a signal for userspace to
> >>> know that the bug has been solved.
> >> Right, I couldn't say it better than you do, thanks! The feature flag is
> >> more of an unusual means to indicating kernel bug having been fixed,
> >> rather than introduce a new feature or new kernel behavior ending up in
> >> change of userspace's expectation.
> >>
> >>> Not offering it indicates that
> >>> userspace cannot trust the kernel will retain the maps.
> >>>
> >>> Si-Wei or Dragos, please correct me if I've missed something. Feel
> >>> free to use the text in case you find more clear in doc or patch log.
> >> Sure, will do, thank you! Will post v2 adding these to the log.
> >>
> >> Thanks,
> >> -Siwei
> >>
> >>
> >>
> >>> Thanks!
> >>>
> >>>> Thanks
> >>>>
> >>>>> I think
> >>>>> the purpose of the IOTLB_PERSIST flag is just to give userspace 100%
> >>>>> certainty of persistent iotlb mapping not getting lost across vdpa reset.
> >>>>>
> >>>>> Thanks,
> >>>>> -Siwei
> >>>>>
> >>>>> [1]
> >>>>> https://lore.kernel.org/virtualization/9f118fc9-4f6f-dd67-a291-be78152e47fd@oracle.com/
> >>>>> [2]
> >>>>> https://lore.kernel.org/virtualization/3364adfd-1eb7-8bce-41f9-bfe5473f1f2e@oracle.com/
> >>>>>>     Otherwise
> >>>>>> we may break old userspace.
> >>>>>>
> >>>>>> Thanks
> >>>>>>
> >>>>>>> +       vhost_vdpa_reset_map(v, asid);
> >>>>>>>            kfree(as);
> >>>>>>>
> >>>>>>>            return 0;
> >>>>>>> --
> >>>>>>> 1.8.3.1
> >>>>>>>
>


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/4] vhost-vdpa: reset vendor specific mapping to initial state in .release
  2023-10-18  5:27                 ` Jason Wang
@ 2023-10-18  7:00                   ` Jason Wang
  2023-10-18  8:49                     ` Si-Wei Liu
  2023-10-18  8:44                   ` Si-Wei Liu
  2023-10-19 22:57                   ` Si-Wei Liu
  2 siblings, 1 reply; 37+ messages in thread
From: Jason Wang @ 2023-10-18  7:00 UTC (permalink / raw)
  To: Si-Wei Liu
  Cc: Eugenio Perez Martin, mst, xuanzhuo, dtatulea, virtualization,
	linux-kernel

On Wed, Oct 18, 2023 at 1:27 PM Jason Wang <jasowang@redhat.com> wrote:
>
> On Wed, Oct 18, 2023 at 12:36 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >
> >
> >
> > On 10/16/2023 7:35 PM, Jason Wang wrote:
> > > On Tue, Oct 17, 2023 at 4:30 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> > >>
> > >>
> > >> On 10/16/2023 4:28 AM, Eugenio Perez Martin wrote:
> > >>> On Mon, Oct 16, 2023 at 8:33 AM Jason Wang <jasowang@redhat.com> wrote:
> > >>>> On Fri, Oct 13, 2023 at 3:36 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> > >>>>>
> > >>>>> On 10/12/2023 8:01 PM, Jason Wang wrote:
> > >>>>>> On Tue, Oct 10, 2023 at 5:05 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> > >>>>>>> Devices with on-chip IOMMU or vendor specific IOTLB implementation
> > >>>>>>> may need to restore iotlb mapping to the initial or default state
> > >>>>>>> using the .reset_map op, as it's desirable for some parent devices
> > >>>>>>> to solely manipulate mappings by its own, independent of virtio device
> > >>>>>>> state. For instance, device reset does not cause mapping go away on
> > >>>>>>> such IOTLB model in need of persistent mapping. Before vhost-vdpa
> > >>>>>>> is going away, give them a chance to reset iotlb back to the initial
> > >>>>>>> state in vhost_vdpa_cleanup().
> > >>>>>>>
> > >>>>>>> Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com>
> > >>>>>>> ---
> > >>>>>>>     drivers/vhost/vdpa.c | 16 ++++++++++++++++
> > >>>>>>>     1 file changed, 16 insertions(+)
> > >>>>>>>
> > >>>>>>> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> > >>>>>>> index 851535f..a3f8160 100644
> > >>>>>>> --- a/drivers/vhost/vdpa.c
> > >>>>>>> +++ b/drivers/vhost/vdpa.c
> > >>>>>>> @@ -131,6 +131,15 @@ static struct vhost_vdpa_as *vhost_vdpa_find_alloc_as(struct vhost_vdpa *v,
> > >>>>>>>            return vhost_vdpa_alloc_as(v, asid);
> > >>>>>>>     }
> > >>>>>>>
> > >>>>>>> +static void vhost_vdpa_reset_map(struct vhost_vdpa *v, u32 asid)
> > >>>>>>> +{
> > >>>>>>> +       struct vdpa_device *vdpa = v->vdpa;
> > >>>>>>> +       const struct vdpa_config_ops *ops = vdpa->config;
> > >>>>>>> +
> > >>>>>>> +       if (ops->reset_map)
> > >>>>>>> +               ops->reset_map(vdpa, asid);
> > >>>>>>> +}
> > >>>>>>> +
> > >>>>>>>     static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid)
> > >>>>>>>     {
> > >>>>>>>            struct vhost_vdpa_as *as = asid_to_as(v, asid);
> > >>>>>>> @@ -140,6 +149,13 @@ static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid)
> > >>>>>>>
> > >>>>>>>            hlist_del(&as->hash_link);
> > >>>>>>>            vhost_vdpa_iotlb_unmap(v, &as->iotlb, 0ULL, 0ULL - 1, asid);
> > >>>>>>> +       /*
> > >>>>>>> +        * Devices with vendor specific IOMMU may need to restore
> > >>>>>>> +        * iotlb to the initial or default state which is not done
> > >>>>>>> +        * through device reset, as the IOTLB mapping manipulation
> > >>>>>>> +        * could be decoupled from the virtio device life cycle.
> > >>>>>>> +        */
> > >>>>>> Should we do this according to whether IOTLB_PRESIST is set?
> > >>>>> Well, in theory this seems like so but it's unnecessary code change
> > >>>>> actually, as that is the way how vDPA parent behind platform IOMMU works
> > >>>>> today, and userspace doesn't break as of today. :)
> > >>>> Well, this is one question I've ever asked before. You have explained
> > >>>> that one of the reason that we don't break userspace is that they may
> > >>>> couple IOTLB reset with vDPA reset as well. One example is the Qemu.
> > >>>>
> > >>>>> As explained in previous threads [1][2], when IOTLB_PERSIST is not set
> > >>>>> it doesn't necessarily mean the iotlb will definitely be destroyed
> > >>>>> across reset (think about the platform IOMMU case), so userspace today
> > >>>>> is already tolerating enough with either good or bad IOMMU.
> > > I'm confused, how to define tolerating here?
> >
> > Tolerating defined as QEMU has to proactively unmap before reset just to
> > workaround the driver bug (on-chip maps out of sync), unconditionally
> > for platform or on-chip. While we all know it doesn't have to do so for
> > platform IOMMU, though userspace has no means to distinguish. That said,
> > userspace is sacrificing reset time performance on platform IOMMU setup
> > just for working around buggy implementation in the other setup.
>
> Ok, so what you actually mean is that userspace can tolerate the "bug"
> with the performance penalty.
>
>
> >
> > > For example, if it has tolerance, why bother?
> > I'm not sure I get the question. But I think userspace is compromising
> > because of buggy implementation in a few drivers doesn't mean we should
> > uniformly enforce such behavior for all set_map/dma_map implementations.
>
> This is not my point. I meant, we can fix we need a negotiation in
> order to let some "buggy" old user space to survive from the changes.
>
> >
> > >
> > >>>> This code of
> > >>>>> not checking IOTLB_PERSIST being set is intentional, there's no point to
> > >>>>> emulate bad IOMMU behavior even for older userspace (with improper
> > >>>>> emulation to be done it would result in even worse performance).
> > > I can easily imagine a case:
> > >
> > > The old Qemu that works only with a setup like mlx5_vdpa.
> > Noted, seems to me there's no such case of a userspace implementation
> > that only works with mlx5_vdpa or its friends, but doesn't work with the
> > others e.g. platform IOMMU, or well behaving on-chip IOMMU
> > implementations.
>
> It's not hard to think of a case where:
>
> 1) the environment has mlx5_vdpa only
> 2) kernel doc can't have endless details, so when developing
> application, the author notice IOTLB is cleared during reset
>
> > The Unmap+remap trick around vdpa reset works totally
> > fine for platform IOMMU, except with sub-optimal performance. Other than
> > this trick, I cannot easily think of other means or iotlb message
> > sequence for userspace to recover the bogus state and make iotlb back to
> > work again after reset.
>
> Yes for sure, but we can't audit every user space, no?
>
> > Are we talking about hypnosis that has no real
> > basis to exist in the real world?
>
> Instead of trying to answer these hard questions, I would go another
> way. That is, stick to the old behaviour when IOTLB_PRESISIT is not
> set by the backend. This is much easier.
>
> >
> > >   If we do
> > > this without a negotiation, IOTLB will not be clear but the Qemu will
> > > try to re-program the IOTLB after reset. Which will break?
> > >
> > > 1) stick the exact old behaviour with just one line of check
> > It's not just one line of check here, the old behavior emulation has to
> > be done as Eugenio illustrated in the other email.
>
> For vhost-vDPA it's just
>
> if (IOTLB_PERSIST is acked by userspace)
>     reset_map()
>
> For parent, it's somehow similar:
>
> during .reset()
>
> if (IOTLB_PERSIST is not acked by userspace)
>         reset_vendor_mappings()
>
> Anything I missed here?
>
> > In addition, the
> > emulation has to limit to those buggy drivers as I don't feel this
> > emulation should apply uniformly to all future set_map/dma_map
> > implementations.
>
> Unfortunately, it's a must to stick to ABI. I agree it's a mess but we
> don't have a better choice. Or we can fail the probe if userspace
> doesn't ack this feature.

Antoher idea we can just do the following in vhost_vdpa reset?

config->reset()
if (IOTLB_PERSIST is not set) {
    config->reset_map()
}

Then we don't have the burden to maintain them in the parent?

Thanks

>
> > > 2) audit all the possible cases to avoid a one line of code
> > >
> > > 1) seems much easier than 2)
> > You see it's more than just one line of code, and I'm uncertain if the
> > additional complexity is warranted or necessary, particularly if added
> > this piece of compatibility code will linger for quite a long time.
>
> This is a must as long as it can be noticed by userspace. Doing
> something conservative makes more sense to me.
>
> > Instead of adding hypothetical code change for no specific good reason
> > and no real use case,
>
> It's not adding something new or new behaviours, it's just making the
> IOTLB reset conditional based on vDPA reset.
>
> > I'd like to add the code when we find out a
> > specific use case that may get impacted or already being affected,
>
> It doesn't conflict with what you proposed here. Old behaviours have
> their users, no?
>
> > then
> > we will have good understanding how to code up the fix and emulate
> > properly for compatibility, while not affecting other good implementations.
>
> The issue is, even if we can't find a userspace now. It doesn't mean
> we can't have one in the future. Then it might be too late or too
> tricky to fix them. We had a lot of lessons in the past.
>
> Thanks
>
> >
> > Thanks,
> > -Siwe/i/
> >
> > >
> > >>>> For two reasons:
> > >>>>
> > >>>> 1) backend features need acked by userspace this is by design
> > >>>> 2) keep the odd behaviour seems to be more safe as we can't audit
> > >>>> every userspace program
> > >>>>
> > >>> The old behavior (without flag ack) cannot be trusted already, as:
> > > Possibly but the point is to unbreak userspace no matter how weird the
> > > behaviour we've ever had.
> > >
> > >>> * Devices using platform IOMMU (in other words, not implementing
> > >>> neither .set_map nor .dma_map) does not unmap memory at virtio reset.
> > >>> * Devices that implement .set_map or .dma_map (vdpa_sim, mlx5) do
> > >>> reset IOTLB, but in their parent ops (vdpasim_do_reset, prune_iotlb
> > >>> called from mlx5_vdpa_reset). With vdpa_sim patch removing the reset,
> > >>> now all backends work the same as far as I know., which was (and is)
> > >>> the way devices using the platform IOMMU works.
> > >>>
> > >>> The difference in behavior did not matter as QEMU unmaps all the
> > >>> memory unregistering the memory listener at vhost_vdpa_dev_start(...,
> > >>> started = false),
> > >> Exactly. It's not just QEMU, but any (older) userspace manipulates
> > >> mappings through the vhost-vdpa iotlb interface has to unmap all
> > >> mappings to workaround the vdpa parent driver bug.
> > > Just to clarify, from userspace, it's the (odd) behaviour of the current uAPI.
> > >
> > >> If they don't do
> > >> explicit unmap, it would cause state inconsistency between vhost-vdpa
> > >> and parent driver, then old mappings can't be restored, and new mapping
> > >> can be added to iotlb after vDPA reset. There's no point to preserve
> > >> this broken and inconsistent behavior between vhost-vdpa and parent
> > >> driver, as userspace doesn't care at all!
> > > It's a userspace notice change so we can't fix it silently:
> > >
> > > https://lkml.org/lkml/2012/12/23/75
> > >
> > > Another example which is related to vhost-vDPA:
> > >
> > > https://lore.kernel.org/netdev/20230927140544.205088-1-eric.auger@redhat.com/T/
> > >
> > > Thanks
> > >
> > >>> but the backend acknowledging this feature flag
> > >>> allows QEMU to make sure it is safe to skip this unmap & map in the
> > >>> case of vhost stop & start cycle.
> > >>>
> > >>> In that sense, this feature flag is actually a signal for userspace to
> > >>> know that the bug has been solved.
> > >> Right, I couldn't say it better than you do, thanks! The feature flag is
> > >> more of an unusual means to indicating kernel bug having been fixed,
> > >> rather than introduce a new feature or new kernel behavior ending up in
> > >> change of userspace's expectation.
> > >>
> > >>> Not offering it indicates that
> > >>> userspace cannot trust the kernel will retain the maps.
> > >>>
> > >>> Si-Wei or Dragos, please correct me if I've missed something. Feel
> > >>> free to use the text in case you find more clear in doc or patch log.
> > >> Sure, will do, thank you! Will post v2 adding these to the log.
> > >>
> > >> Thanks,
> > >> -Siwei
> > >>
> > >>
> > >>
> > >>> Thanks!
> > >>>
> > >>>> Thanks
> > >>>>
> > >>>>> I think
> > >>>>> the purpose of the IOTLB_PERSIST flag is just to give userspace 100%
> > >>>>> certainty of persistent iotlb mapping not getting lost across vdpa reset.
> > >>>>>
> > >>>>> Thanks,
> > >>>>> -Siwei
> > >>>>>
> > >>>>> [1]
> > >>>>> https://lore.kernel.org/virtualization/9f118fc9-4f6f-dd67-a291-be78152e47fd@oracle.com/
> > >>>>> [2]
> > >>>>> https://lore.kernel.org/virtualization/3364adfd-1eb7-8bce-41f9-bfe5473f1f2e@oracle.com/
> > >>>>>>     Otherwise
> > >>>>>> we may break old userspace.
> > >>>>>>
> > >>>>>> Thanks
> > >>>>>>
> > >>>>>>> +       vhost_vdpa_reset_map(v, asid);
> > >>>>>>>            kfree(as);
> > >>>>>>>
> > >>>>>>>            return 0;
> > >>>>>>> --
> > >>>>>>> 1.8.3.1
> > >>>>>>>
> >


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/4] vhost-vdpa: reset vendor specific mapping to initial state in .release
  2023-10-18  5:27                 ` Jason Wang
  2023-10-18  7:00                   ` Jason Wang
@ 2023-10-18  8:44                   ` Si-Wei Liu
  2023-10-18 11:14                     ` Eugenio Perez Martin
  2023-10-19 22:57                   ` Si-Wei Liu
  2 siblings, 1 reply; 37+ messages in thread
From: Si-Wei Liu @ 2023-10-18  8:44 UTC (permalink / raw)
  To: Jason Wang
  Cc: Eugenio Perez Martin, mst, xuanzhuo, dtatulea, virtualization,
	linux-kernel



On 10/17/2023 10:27 PM, Jason Wang wrote:
> On Wed, Oct 18, 2023 at 12:36 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>
>>
>> On 10/16/2023 7:35 PM, Jason Wang wrote:
>>> On Tue, Oct 17, 2023 at 4:30 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>>>
>>>> On 10/16/2023 4:28 AM, Eugenio Perez Martin wrote:
>>>>> On Mon, Oct 16, 2023 at 8:33 AM Jason Wang <jasowang@redhat.com> wrote:
>>>>>> On Fri, Oct 13, 2023 at 3:36 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>>>>>> On 10/12/2023 8:01 PM, Jason Wang wrote:
>>>>>>>> On Tue, Oct 10, 2023 at 5:05 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>>>>>>>> Devices with on-chip IOMMU or vendor specific IOTLB implementation
>>>>>>>>> may need to restore iotlb mapping to the initial or default state
>>>>>>>>> using the .reset_map op, as it's desirable for some parent devices
>>>>>>>>> to solely manipulate mappings by its own, independent of virtio device
>>>>>>>>> state. For instance, device reset does not cause mapping go away on
>>>>>>>>> such IOTLB model in need of persistent mapping. Before vhost-vdpa
>>>>>>>>> is going away, give them a chance to reset iotlb back to the initial
>>>>>>>>> state in vhost_vdpa_cleanup().
>>>>>>>>>
>>>>>>>>> Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com>
>>>>>>>>> ---
>>>>>>>>>      drivers/vhost/vdpa.c | 16 ++++++++++++++++
>>>>>>>>>      1 file changed, 16 insertions(+)
>>>>>>>>>
>>>>>>>>> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
>>>>>>>>> index 851535f..a3f8160 100644
>>>>>>>>> --- a/drivers/vhost/vdpa.c
>>>>>>>>> +++ b/drivers/vhost/vdpa.c
>>>>>>>>> @@ -131,6 +131,15 @@ static struct vhost_vdpa_as *vhost_vdpa_find_alloc_as(struct vhost_vdpa *v,
>>>>>>>>>             return vhost_vdpa_alloc_as(v, asid);
>>>>>>>>>      }
>>>>>>>>>
>>>>>>>>> +static void vhost_vdpa_reset_map(struct vhost_vdpa *v, u32 asid)
>>>>>>>>> +{
>>>>>>>>> +       struct vdpa_device *vdpa = v->vdpa;
>>>>>>>>> +       const struct vdpa_config_ops *ops = vdpa->config;
>>>>>>>>> +
>>>>>>>>> +       if (ops->reset_map)
>>>>>>>>> +               ops->reset_map(vdpa, asid);
>>>>>>>>> +}
>>>>>>>>> +
>>>>>>>>>      static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid)
>>>>>>>>>      {
>>>>>>>>>             struct vhost_vdpa_as *as = asid_to_as(v, asid);
>>>>>>>>> @@ -140,6 +149,13 @@ static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid)
>>>>>>>>>
>>>>>>>>>             hlist_del(&as->hash_link);
>>>>>>>>>             vhost_vdpa_iotlb_unmap(v, &as->iotlb, 0ULL, 0ULL - 1, asid);
>>>>>>>>> +       /*
>>>>>>>>> +        * Devices with vendor specific IOMMU may need to restore
>>>>>>>>> +        * iotlb to the initial or default state which is not done
>>>>>>>>> +        * through device reset, as the IOTLB mapping manipulation
>>>>>>>>> +        * could be decoupled from the virtio device life cycle.
>>>>>>>>> +        */
>>>>>>>> Should we do this according to whether IOTLB_PRESIST is set?
>>>>>>> Well, in theory this seems like so but it's unnecessary code change
>>>>>>> actually, as that is the way how vDPA parent behind platform IOMMU works
>>>>>>> today, and userspace doesn't break as of today. :)
>>>>>> Well, this is one question I've ever asked before. You have explained
>>>>>> that one of the reason that we don't break userspace is that they may
>>>>>> couple IOTLB reset with vDPA reset as well. One example is the Qemu.
>>>>>>
>>>>>>> As explained in previous threads [1][2], when IOTLB_PERSIST is not set
>>>>>>> it doesn't necessarily mean the iotlb will definitely be destroyed
>>>>>>> across reset (think about the platform IOMMU case), so userspace today
>>>>>>> is already tolerating enough with either good or bad IOMMU.
>>> I'm confused, how to define tolerating here?
>> Tolerating defined as QEMU has to proactively unmap before reset just to
>> workaround the driver bug (on-chip maps out of sync), unconditionally
>> for platform or on-chip. While we all know it doesn't have to do so for
>> platform IOMMU, though userspace has no means to distinguish. That said,
>> userspace is sacrificing reset time performance on platform IOMMU setup
>> just for working around buggy implementation in the other setup.
> Ok, so what you actually mean is that userspace can tolerate the "bug"
> with the performance penalty.
Right.
>
>
>>> For example, if it has tolerance, why bother?
>> I'm not sure I get the question. But I think userspace is compromising
>> because of buggy implementation in a few drivers doesn't mean we should
>> uniformly enforce such behavior for all set_map/dma_map implementations.
> This is not my point. I meant, we can fix we need a negotiation in
> order to let some "buggy" old user space to survive from the changes.
Userspace is no buggy today, how to define "buggy"? Userspace with 
tolerance could survive just fine no matter if this negotiation or buggy 
driver behavior emulation is around or not. If any userspace doesn't 
tolerate, it can work still fine on good on-chip IOMMU or platform 
IOMMU, no matter if the negotiation is around or not.
>
>>>>>> This code of
>>>>>>> not checking IOTLB_PERSIST being set is intentional, there's no point to
>>>>>>> emulate bad IOMMU behavior even for older userspace (with improper
>>>>>>> emulation to be done it would result in even worse performance).
>>> I can easily imagine a case:
>>>
>>> The old Qemu that works only with a setup like mlx5_vdpa.
>> Noted, seems to me there's no such case of a userspace implementation
>> that only works with mlx5_vdpa or its friends, but doesn't work with the
>> others e.g. platform IOMMU, or well behaving on-chip IOMMU
>> implementations.
> It's not hard to think of a case where:
>
> 1) the environment has mlx5_vdpa only
> 2) kernel doc can't have endless details, so when developing
> application, the author notice IOTLB is cleared during reset
I get it, but my question was that, even if the author had noticed IOTLB 
is cleared during reset, does he care or not to make IOTLB back working 
again? My point is that, if this old setup is supposed to "work" on 
mlx5_vdpa, then the developer must come up with sort of "quirk" to 
recover the IOTLB to make it back to working state again after the 
reset. It will be more justified to come up with the proper fix for 
compatibility/emulation only until we know what should be expected to 
work and through which possible means to making it back to work, rather 
than blindly emulate the buggy behavior solely based on a few driver's 
own implementation. I'm pretty sure there are multiple ways to implement 
the buggy reset behavior in the driver, does it mean we have to emulate 
various corrupted mapping states in the individual on-chip iommu itself? 
How is it able to help the developer user if we are able to replicate 
the same corrupted mapping state in the on-chip iommu after reset, any 
real-life user only cares about mapping being corrupted in the same way, 
rather than cares more about the quirk sequence or work around to get 
iotlb maps out of the broken state?

Only if the userspace is like a test facility to expect some test case 
to fail on mlx5_vdpa after reset -- I assume that is not real-life user 
at all.
>
>> The Unmap+remap trick around vdpa reset works totally
>> fine for platform IOMMU, except with sub-optimal performance. Other than
>> this trick, I cannot easily think of other means or iotlb message
>> sequence for userspace to recover the bogus state and make iotlb back to
>> work again after reset.
> Yes for sure, but we can't audit every user space, no?
We don't have to, as userspace here has no bug at all. The bug exists in 
the driver not in userspace. Real life userspace app only cares about 
making things work not asserting something must be broken.
>> Are we talking about hypnosis that has no real
>> basis to exist in the real world?
> Instead of trying to answer these hard questions, I would go another
> way. That is, stick to the old behaviour when IOTLB_PRESISIT is not
> set by the backend. This is much easier.
Please be noted the old (broken) behavior can vary between different 
parent driver implementations. It's driver's specific own problem, if 
there are N ways to for driver to implement buggy .reset, do we have to 
emulate N flavors of different vdpa reset behavior?

>
>>>    If we do
>>> this without a negotiation, IOTLB will not be clear but the Qemu will
>>> try to re-program the IOTLB after reset. Which will break?
>>>
>>> 1) stick the exact old behaviour with just one line of check
>> It's not just one line of check here, the old behavior emulation has to
>> be done as Eugenio illustrated in the other email.
> For vhost-vDPA it's just
>
> if (IOTLB_PERSIST is acked by userspace)
>      reset_map()
>
> For parent, it's somehow similar:
>
> during .reset()
>
> if (IOTLB_PERSIST is not acked by userspace)
>          reset_vendor_mappings()
>
> Anything I missed here?
First, the ideal fix would be to leave this reset_vendor_mappings() 
emulation code on the individual driver itself, which already has the 
broken behavior. But today there's no backend feature negotiation 
between vhost-vdpa and the parent driver. Do we want to send down the 
acked_backend_features to parent drivers?

Second, IOTLB_PERSIST is needed but not sufficient. Due to lack of 
backend feature negotiation in parent driver, if vhost-vdpa has to 
provide the old-behaviour emulation for compatibility on driver's 
behalf, it needs to be done per-driver basis. There could be good 
on-chip or vendor IOMMU implementation which doesn't clear the IOTLB in 
.reset, and vendor specific IOMMU doesn't have to provide .reset_map, we 
should allow these good driver implementations rather than 
unconditionally stick to some specific problematic behavior for every 
other good driver. Then we need a set of device flags (backend_features 
bit again?) to indicate the specific driver needs upper layer's help on 
old-behaviour emulation.

Last but not least, I'm not sure how to properly emulate 
reset_vendor_mappings() from vhost-vdpa layer. If a vendor driver has no 
.reset_map op implemented, or if .reset_map has a slightly different 
implementation than what it used to reset the iotlb in the .reset op, 
then this either becomes effectively dead code if no one ends up using, 
or the vhost-vdpa emulation is helpless and limited in scope, unable to 
cover all the cases.


>
>> In addition, the
>> emulation has to limit to those buggy drivers as I don't feel this
>> emulation should apply uniformly to all future set_map/dma_map
>> implementations.
> Unfortunately, it's a must to stick to ABI.
How come this brokenness in mlx5_vdpa becomes ABI in any sort for future 
on-chip IOMMU drivers? They might not even exist yet. Even if it's  
concerning ABI it's limited to mlx5_vdpa and the existing drivers, right?

>   I agree it's a mess but we don't have a better choice.
Well, it's your call, I can implement as you wish but the unwarranted 
code has to be maintained forever. Particularly without knowing if 
there's really such a use case in real life, and no one in future might 
dare to remove the code without knowing what it can be used for.

> Or we can fail the probe if userspace
> doesn't ack this feature.
Fail probing is even worse choice that is introducing intrusive breakage 
to the userspace.
>
>>> 2) audit all the possible cases to avoid a one line of code
>>>
>>> 1) seems much easier than 2)
>> You see it's more than just one line of code, and I'm uncertain if the
>> additional complexity is warranted or necessary, particularly if added
>> this piece of compatibility code will linger for quite a long time.
> This is a must as long as it can be noticed by userspace. Doing
> something conservative makes more sense to me.
>
>> Instead of adding hypothetical code change for no specific good reason
>> and no real use case,
> It's not adding something new or new behaviours, it's just making the
> IOTLB reset conditional based on vDPA reset.
>
>> I'd like to add the code when we find out a
>> specific use case that may get impacted or already being affected,
> It doesn't conflict with what you proposed here. Old behaviours have
> their users, no?
We don't know the use case how to make thing work instead of make thing 
break, that is the problem. We have no way to test if old-behaviour 
preserving code really works as expected. If there's no such user in 
practice, it ends up with dead code no one dares to remove.
>
>> then
>> we will have good understanding how to code up the fix and emulate
>> properly for compatibility, while not affecting other good implementations.
> The issue is, even if we can't find a userspace now. It doesn't mean
> we can't have one in the future. Then it might be too late or too
> tricky to fix them. We had a lot of lessons in the past.
I am not sure the same situation "too late to fix" or "too tricky to 
fix" applies here. Usually this means there's some well established 
pattern for e.g. API, ABI or long standing de-factor behavior that can't 
be broken or adjust if trying to fix something up. But here we're 
guarded by a flag (IOTLB_PERSIST) and without it the behavior is totally 
ruled by implementation.

Regards,
-Siwei

>
> Thanks
>
>> Thanks,
>> -Siwe/i/
>>
>>>>>> For two reasons:
>>>>>>
>>>>>> 1) backend features need acked by userspace this is by design
>>>>>> 2) keep the odd behaviour seems to be more safe as we can't audit
>>>>>> every userspace program
>>>>>>
>>>>> The old behavior (without flag ack) cannot be trusted already, as:
>>> Possibly but the point is to unbreak userspace no matter how weird the
>>> behaviour we've ever had.
>>>
>>>>> * Devices using platform IOMMU (in other words, not implementing
>>>>> neither .set_map nor .dma_map) does not unmap memory at virtio reset.
>>>>> * Devices that implement .set_map or .dma_map (vdpa_sim, mlx5) do
>>>>> reset IOTLB, but in their parent ops (vdpasim_do_reset, prune_iotlb
>>>>> called from mlx5_vdpa_reset). With vdpa_sim patch removing the reset,
>>>>> now all backends work the same as far as I know., which was (and is)
>>>>> the way devices using the platform IOMMU works.
>>>>>
>>>>> The difference in behavior did not matter as QEMU unmaps all the
>>>>> memory unregistering the memory listener at vhost_vdpa_dev_start(...,
>>>>> started = false),
>>>> Exactly. It's not just QEMU, but any (older) userspace manipulates
>>>> mappings through the vhost-vdpa iotlb interface has to unmap all
>>>> mappings to workaround the vdpa parent driver bug.
>>> Just to clarify, from userspace, it's the (odd) behaviour of the current uAPI.
>>>
>>>> If they don't do
>>>> explicit unmap, it would cause state inconsistency between vhost-vdpa
>>>> and parent driver, then old mappings can't be restored, and new mapping
>>>> can be added to iotlb after vDPA reset. There's no point to preserve
>>>> this broken and inconsistent behavior between vhost-vdpa and parent
>>>> driver, as userspace doesn't care at all!
>>> It's a userspace notice change so we can't fix it silently:
>>>
>>> https://lkml.org/lkml/2012/12/23/75
>>>
>>> Another example which is related to vhost-vDPA:
>>>
>>> https://lore.kernel.org/netdev/20230927140544.205088-1-eric.auger@redhat.com/T/
>>>
>>> Thanks
>>>
>>>>> but the backend acknowledging this feature flag
>>>>> allows QEMU to make sure it is safe to skip this unmap & map in the
>>>>> case of vhost stop & start cycle.
>>>>>
>>>>> In that sense, this feature flag is actually a signal for userspace to
>>>>> know that the bug has been solved.
>>>> Right, I couldn't say it better than you do, thanks! The feature flag is
>>>> more of an unusual means to indicating kernel bug having been fixed,
>>>> rather than introduce a new feature or new kernel behavior ending up in
>>>> change of userspace's expectation.
>>>>
>>>>> Not offering it indicates that
>>>>> userspace cannot trust the kernel will retain the maps.
>>>>>
>>>>> Si-Wei or Dragos, please correct me if I've missed something. Feel
>>>>> free to use the text in case you find more clear in doc or patch log.
>>>> Sure, will do, thank you! Will post v2 adding these to the log.
>>>>
>>>> Thanks,
>>>> -Siwei
>>>>
>>>>
>>>>
>>>>> Thanks!
>>>>>
>>>>>> Thanks
>>>>>>
>>>>>>> I think
>>>>>>> the purpose of the IOTLB_PERSIST flag is just to give userspace 100%
>>>>>>> certainty of persistent iotlb mapping not getting lost across vdpa reset.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> -Siwei
>>>>>>>
>>>>>>> [1]
>>>>>>> https://lore.kernel.org/virtualization/9f118fc9-4f6f-dd67-a291-be78152e47fd@oracle.com/
>>>>>>> [2]
>>>>>>> https://lore.kernel.org/virtualization/3364adfd-1eb7-8bce-41f9-bfe5473f1f2e@oracle.com/
>>>>>>>>      Otherwise
>>>>>>>> we may break old userspace.
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>>
>>>>>>>>> +       vhost_vdpa_reset_map(v, asid);
>>>>>>>>>             kfree(as);
>>>>>>>>>
>>>>>>>>>             return 0;
>>>>>>>>> --
>>>>>>>>> 1.8.3.1
>>>>>>>>>


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/4] vhost-vdpa: reset vendor specific mapping to initial state in .release
  2023-10-18  7:00                   ` Jason Wang
@ 2023-10-18  8:49                     ` Si-Wei Liu
  2023-10-19  2:53                       ` Jason Wang
  0 siblings, 1 reply; 37+ messages in thread
From: Si-Wei Liu @ 2023-10-18  8:49 UTC (permalink / raw)
  To: Jason Wang
  Cc: Eugenio Perez Martin, mst, xuanzhuo, dtatulea, virtualization,
	linux-kernel



On 10/18/2023 12:00 AM, Jason Wang wrote:
>> Unfortunately, it's a must to stick to ABI. I agree it's a mess but we
>> don't have a better choice. Or we can fail the probe if userspace
>> doesn't ack this feature.
> Antoher idea we can just do the following in vhost_vdpa reset?
>
> config->reset()
> if (IOTLB_PERSIST is not set) {
>      config->reset_map()
> }
>
> Then we don't have the burden to maintain them in the parent?
>
> Thanks
Please see my earlier response in the other email, thanks.

----------------%<----------------%<----------------

First, the ideal fix would be to leave this reset_vendor_mappings() 
emulation code on the individual driver itself, which already has the 
broken behavior. But today there's no backend feature negotiation 
between vhost-vdpa and the parent driver. Do we want to send down the 
acked_backend_features to parent drivers?

Second, IOTLB_PERSIST is needed but not sufficient. Due to lack of 
backend feature negotiation in parent driver, if vhost-vdpa has to 
provide the old-behaviour emulation for compatibility on driver's 
behalf, it needs to be done per-driver basis. There could be good 
on-chip or vendor IOMMU implementation which doesn't clear the IOTLB in 
.reset, and vendor specific IOMMU doesn't have to provide .reset_map, we 
should allow these good driver implementations rather than 
unconditionally stick to some specific problematic behavior for every 
other good driver. Then we need a set of device flags (backend_features 
bit again?) to indicate the specific driver needs upper layer's help on 
old-behaviour emulation.

Last but not least, I'm not sure how to properly emulate 
reset_vendor_mappings() from vhost-vdpa layer. If a vendor driver has no 
.reset_map op implemented, or if .reset_map has a slightly different 
implementation than what it used to reset the iotlb in the .reset op, 
then this either becomes effectively dead code if no one ends up using, 
or the vhost-vdpa emulation is helpless and limited in scope, unable to 
cover all the cases.

----------------%<----------------%<----------------

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/4] vhost-vdpa: reset vendor specific mapping to initial state in .release
  2023-10-18  8:44                   ` Si-Wei Liu
@ 2023-10-18 11:14                     ` Eugenio Perez Martin
  2023-10-18 23:21                       ` Si-Wei Liu
  0 siblings, 1 reply; 37+ messages in thread
From: Eugenio Perez Martin @ 2023-10-18 11:14 UTC (permalink / raw)
  To: Si-Wei Liu
  Cc: Jason Wang, mst, xuanzhuo, dtatulea, virtualization, linux-kernel

On Wed, Oct 18, 2023 at 10:44 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>
>
>
> On 10/17/2023 10:27 PM, Jason Wang wrote:
> > On Wed, Oct 18, 2023 at 12:36 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>
> >>
> >> On 10/16/2023 7:35 PM, Jason Wang wrote:
> >>> On Tue, Oct 17, 2023 at 4:30 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>>>
> >>>> On 10/16/2023 4:28 AM, Eugenio Perez Martin wrote:
> >>>>> On Mon, Oct 16, 2023 at 8:33 AM Jason Wang <jasowang@redhat.com> wrote:
> >>>>>> On Fri, Oct 13, 2023 at 3:36 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>>>>>> On 10/12/2023 8:01 PM, Jason Wang wrote:
> >>>>>>>> On Tue, Oct 10, 2023 at 5:05 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>>>>>>>> Devices with on-chip IOMMU or vendor specific IOTLB implementation
> >>>>>>>>> may need to restore iotlb mapping to the initial or default state
> >>>>>>>>> using the .reset_map op, as it's desirable for some parent devices
> >>>>>>>>> to solely manipulate mappings by its own, independent of virtio device
> >>>>>>>>> state. For instance, device reset does not cause mapping go away on
> >>>>>>>>> such IOTLB model in need of persistent mapping. Before vhost-vdpa
> >>>>>>>>> is going away, give them a chance to reset iotlb back to the initial
> >>>>>>>>> state in vhost_vdpa_cleanup().
> >>>>>>>>>
> >>>>>>>>> Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com>
> >>>>>>>>> ---
> >>>>>>>>>      drivers/vhost/vdpa.c | 16 ++++++++++++++++
> >>>>>>>>>      1 file changed, 16 insertions(+)
> >>>>>>>>>
> >>>>>>>>> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> >>>>>>>>> index 851535f..a3f8160 100644
> >>>>>>>>> --- a/drivers/vhost/vdpa.c
> >>>>>>>>> +++ b/drivers/vhost/vdpa.c
> >>>>>>>>> @@ -131,6 +131,15 @@ static struct vhost_vdpa_as *vhost_vdpa_find_alloc_as(struct vhost_vdpa *v,
> >>>>>>>>>             return vhost_vdpa_alloc_as(v, asid);
> >>>>>>>>>      }
> >>>>>>>>>
> >>>>>>>>> +static void vhost_vdpa_reset_map(struct vhost_vdpa *v, u32 asid)
> >>>>>>>>> +{
> >>>>>>>>> +       struct vdpa_device *vdpa = v->vdpa;
> >>>>>>>>> +       const struct vdpa_config_ops *ops = vdpa->config;
> >>>>>>>>> +
> >>>>>>>>> +       if (ops->reset_map)
> >>>>>>>>> +               ops->reset_map(vdpa, asid);
> >>>>>>>>> +}
> >>>>>>>>> +
> >>>>>>>>>      static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid)
> >>>>>>>>>      {
> >>>>>>>>>             struct vhost_vdpa_as *as = asid_to_as(v, asid);
> >>>>>>>>> @@ -140,6 +149,13 @@ static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid)
> >>>>>>>>>
> >>>>>>>>>             hlist_del(&as->hash_link);
> >>>>>>>>>             vhost_vdpa_iotlb_unmap(v, &as->iotlb, 0ULL, 0ULL - 1, asid);
> >>>>>>>>> +       /*
> >>>>>>>>> +        * Devices with vendor specific IOMMU may need to restore
> >>>>>>>>> +        * iotlb to the initial or default state which is not done
> >>>>>>>>> +        * through device reset, as the IOTLB mapping manipulation
> >>>>>>>>> +        * could be decoupled from the virtio device life cycle.
> >>>>>>>>> +        */
> >>>>>>>> Should we do this according to whether IOTLB_PRESIST is set?
> >>>>>>> Well, in theory this seems like so but it's unnecessary code change
> >>>>>>> actually, as that is the way how vDPA parent behind platform IOMMU works
> >>>>>>> today, and userspace doesn't break as of today. :)
> >>>>>> Well, this is one question I've ever asked before. You have explained
> >>>>>> that one of the reason that we don't break userspace is that they may
> >>>>>> couple IOTLB reset with vDPA reset as well. One example is the Qemu.
> >>>>>>
> >>>>>>> As explained in previous threads [1][2], when IOTLB_PERSIST is not set
> >>>>>>> it doesn't necessarily mean the iotlb will definitely be destroyed
> >>>>>>> across reset (think about the platform IOMMU case), so userspace today
> >>>>>>> is already tolerating enough with either good or bad IOMMU.
> >>> I'm confused, how to define tolerating here?
> >> Tolerating defined as QEMU has to proactively unmap before reset just to
> >> workaround the driver bug (on-chip maps out of sync), unconditionally
> >> for platform or on-chip. While we all know it doesn't have to do so for
> >> platform IOMMU, though userspace has no means to distinguish. That said,
> >> userspace is sacrificing reset time performance on platform IOMMU setup
> >> just for working around buggy implementation in the other setup.
> > Ok, so what you actually mean is that userspace can tolerate the "bug"
> > with the performance penalty.
> Right.
> >
> >
> >>> For example, if it has tolerance, why bother?
> >> I'm not sure I get the question. But I think userspace is compromising
> >> because of buggy implementation in a few drivers doesn't mean we should
> >> uniformly enforce such behavior for all set_map/dma_map implementations.
> > This is not my point. I meant, we can fix we need a negotiation in
> > order to let some "buggy" old user space to survive from the changes.
> Userspace is no buggy today, how to define "buggy"? Userspace with
> tolerance could survive just fine no matter if this negotiation or buggy
> driver behavior emulation is around or not. If any userspace doesn't
> tolerate, it can work still fine on good on-chip IOMMU or platform
> IOMMU, no matter if the negotiation is around or not.
> >
> >>>>>> This code of
> >>>>>>> not checking IOTLB_PERSIST being set is intentional, there's no point to
> >>>>>>> emulate bad IOMMU behavior even for older userspace (with improper
> >>>>>>> emulation to be done it would result in even worse performance).
> >>> I can easily imagine a case:
> >>>
> >>> The old Qemu that works only with a setup like mlx5_vdpa.
> >> Noted, seems to me there's no such case of a userspace implementation
> >> that only works with mlx5_vdpa or its friends, but doesn't work with the
> >> others e.g. platform IOMMU, or well behaving on-chip IOMMU
> >> implementations.
> > It's not hard to think of a case where:
> >
> > 1) the environment has mlx5_vdpa only
> > 2) kernel doc can't have endless details, so when developing
> > application, the author notice IOTLB is cleared during reset
> I get it, but my question was that, even if the author had noticed IOTLB
> is cleared during reset, does he care or not to make IOTLB back working
> again? My point is that, if this old setup is supposed to "work" on
> mlx5_vdpa, then the developer must come up with sort of "quirk" to
> recover the IOTLB to make it back to working state again after the
> reset. It will be more justified to come up with the proper fix for
> compatibility/emulation only until we know what should be expected to
> work and through which possible means to making it back to work, rather
> than blindly emulate the buggy behavior solely based on a few driver's
> own implementation. I'm pretty sure there are multiple ways to implement
> the buggy reset behavior in the driver, does it mean we have to emulate
> various corrupted mapping states in the individual on-chip iommu itself?
> How is it able to help the developer user if we are able to replicate
> the same corrupted mapping state in the on-chip iommu after reset, any
> real-life user only cares about mapping being corrupted in the same way,
> rather than cares more about the quirk sequence or work around to get
> iotlb maps out of the broken state?
>
> Only if the userspace is like a test facility to expect some test case
> to fail on mlx5_vdpa after reset -- I assume that is not real-life user
> at all.
> >
> >> The Unmap+remap trick around vdpa reset works totally
> >> fine for platform IOMMU, except with sub-optimal performance. Other than
> >> this trick, I cannot easily think of other means or iotlb message
> >> sequence for userspace to recover the bogus state and make iotlb back to
> >> work again after reset.
> > Yes for sure, but we can't audit every user space, no?
> We don't have to, as userspace here has no bug at all. The bug exists in
> the driver not in userspace. Real life userspace app only cares about
> making things work not asserting something must be broken.
> >> Are we talking about hypnosis that has no real
> >> basis to exist in the real world?
> > Instead of trying to answer these hard questions, I would go another
> > way. That is, stick to the old behaviour when IOTLB_PRESISIT is not
> > set by the backend. This is much easier.
> Please be noted the old (broken) behavior can vary between different
> parent driver implementations. It's driver's specific own problem, if
> there are N ways to for driver to implement buggy .reset, do we have to
> emulate N flavors of different vdpa reset behavior?
>
> >
> >>>    If we do
> >>> this without a negotiation, IOTLB will not be clear but the Qemu will
> >>> try to re-program the IOTLB after reset. Which will break?
> >>>
> >>> 1) stick the exact old behaviour with just one line of check
> >> It's not just one line of check here, the old behavior emulation has to
> >> be done as Eugenio illustrated in the other email.
> > For vhost-vDPA it's just
> >
> > if (IOTLB_PERSIST is acked by userspace)
> >      reset_map()
> >
> > For parent, it's somehow similar:
> >
> > during .reset()
> >
> > if (IOTLB_PERSIST is not acked by userspace)
> >          reset_vendor_mappings()
> >
> > Anything I missed here?
> First, the ideal fix would be to leave this reset_vendor_mappings()
> emulation code on the individual driver itself, which already has the
> broken behavior. But today there's no backend feature negotiation
> between vhost-vdpa and the parent driver. Do we want to send down the
> acked_backend_features to parent drivers?
>

What if we add a module parameter to both mlx5 and vdpa_sim to keep
the old behavior? Let's call it clean_iotlb_on_reset for now.

In my opinion we can leave it off by default, so these userspace apps
can get back to the previous behavior. It would be ideal if we set a
deprecation date for it though.

This way new backends, whether they implement .set_map or not, will
have correct behavior.

Would that work?

Thanks!

> Second, IOTLB_PERSIST is needed but not sufficient. Due to lack of
> backend feature negotiation in parent driver, if vhost-vdpa has to
> provide the old-behaviour emulation for compatibility on driver's
> behalf, it needs to be done per-driver basis. There could be good
> on-chip or vendor IOMMU implementation which doesn't clear the IOTLB in
> .reset, and vendor specific IOMMU doesn't have to provide .reset_map, we
> should allow these good driver implementations rather than
> unconditionally stick to some specific problematic behavior for every
> other good driver. Then we need a set of device flags (backend_features
> bit again?) to indicate the specific driver needs upper layer's help on
> old-behaviour emulation.
>
> Last but not least, I'm not sure how to properly emulate
> reset_vendor_mappings() from vhost-vdpa layer. If a vendor driver has no
> .reset_map op implemented, or if .reset_map has a slightly different
> implementation than what it used to reset the iotlb in the .reset op,
> then this either becomes effectively dead code if no one ends up using,
> or the vhost-vdpa emulation is helpless and limited in scope, unable to
> cover all the cases.
>
>
> >
> >> In addition, the
> >> emulation has to limit to those buggy drivers as I don't feel this
> >> emulation should apply uniformly to all future set_map/dma_map
> >> implementations.
> > Unfortunately, it's a must to stick to ABI.
> How come this brokenness in mlx5_vdpa becomes ABI in any sort for future
> on-chip IOMMU drivers? They might not even exist yet. Even if it's
> concerning ABI it's limited to mlx5_vdpa and the existing drivers, right?
>
> >   I agree it's a mess but we don't have a better choice.
> Well, it's your call, I can implement as you wish but the unwarranted
> code has to be maintained forever. Particularly without knowing if
> there's really such a use case in real life, and no one in future might
> dare to remove the code without knowing what it can be used for.
>
> > Or we can fail the probe if userspace
> > doesn't ack this feature.
> Fail probing is even worse choice that is introducing intrusive breakage
> to the userspace.
> >
> >>> 2) audit all the possible cases to avoid a one line of code
> >>>
> >>> 1) seems much easier than 2)
> >> You see it's more than just one line of code, and I'm uncertain if the
> >> additional complexity is warranted or necessary, particularly if added
> >> this piece of compatibility code will linger for quite a long time.
> > This is a must as long as it can be noticed by userspace. Doing
> > something conservative makes more sense to me.
> >
> >> Instead of adding hypothetical code change for no specific good reason
> >> and no real use case,
> > It's not adding something new or new behaviours, it's just making the
> > IOTLB reset conditional based on vDPA reset.
> >
> >> I'd like to add the code when we find out a
> >> specific use case that may get impacted or already being affected,
> > It doesn't conflict with what you proposed here. Old behaviours have
> > their users, no?
> We don't know the use case how to make thing work instead of make thing
> break, that is the problem. We have no way to test if old-behaviour
> preserving code really works as expected. If there's no such user in
> practice, it ends up with dead code no one dares to remove.
> >
> >> then
> >> we will have good understanding how to code up the fix and emulate
> >> properly for compatibility, while not affecting other good implementations.
> > The issue is, even if we can't find a userspace now. It doesn't mean
> > we can't have one in the future. Then it might be too late or too
> > tricky to fix them. We had a lot of lessons in the past.
> I am not sure the same situation "too late to fix" or "too tricky to
> fix" applies here. Usually this means there's some well established
> pattern for e.g. API, ABI or long standing de-factor behavior that can't
> be broken or adjust if trying to fix something up. But here we're
> guarded by a flag (IOTLB_PERSIST) and without it the behavior is totally
> ruled by implementation.
>
> Regards,
> -Siwei
>
> >
> > Thanks
> >
> >> Thanks,
> >> -Siwe/i/
> >>
> >>>>>> For two reasons:
> >>>>>>
> >>>>>> 1) backend features need acked by userspace this is by design
> >>>>>> 2) keep the odd behaviour seems to be more safe as we can't audit
> >>>>>> every userspace program
> >>>>>>
> >>>>> The old behavior (without flag ack) cannot be trusted already, as:
> >>> Possibly but the point is to unbreak userspace no matter how weird the
> >>> behaviour we've ever had.
> >>>
> >>>>> * Devices using platform IOMMU (in other words, not implementing
> >>>>> neither .set_map nor .dma_map) does not unmap memory at virtio reset.
> >>>>> * Devices that implement .set_map or .dma_map (vdpa_sim, mlx5) do
> >>>>> reset IOTLB, but in their parent ops (vdpasim_do_reset, prune_iotlb
> >>>>> called from mlx5_vdpa_reset). With vdpa_sim patch removing the reset,
> >>>>> now all backends work the same as far as I know., which was (and is)
> >>>>> the way devices using the platform IOMMU works.
> >>>>>
> >>>>> The difference in behavior did not matter as QEMU unmaps all the
> >>>>> memory unregistering the memory listener at vhost_vdpa_dev_start(...,
> >>>>> started = false),
> >>>> Exactly. It's not just QEMU, but any (older) userspace manipulates
> >>>> mappings through the vhost-vdpa iotlb interface has to unmap all
> >>>> mappings to workaround the vdpa parent driver bug.
> >>> Just to clarify, from userspace, it's the (odd) behaviour of the current uAPI.
> >>>
> >>>> If they don't do
> >>>> explicit unmap, it would cause state inconsistency between vhost-vdpa
> >>>> and parent driver, then old mappings can't be restored, and new mapping
> >>>> can be added to iotlb after vDPA reset. There's no point to preserve
> >>>> this broken and inconsistent behavior between vhost-vdpa and parent
> >>>> driver, as userspace doesn't care at all!
> >>> It's a userspace notice change so we can't fix it silently:
> >>>
> >>> https://lkml.org/lkml/2012/12/23/75
> >>>
> >>> Another example which is related to vhost-vDPA:
> >>>
> >>> https://lore.kernel.org/netdev/20230927140544.205088-1-eric.auger@redhat.com/T/
> >>>
> >>> Thanks
> >>>
> >>>>> but the backend acknowledging this feature flag
> >>>>> allows QEMU to make sure it is safe to skip this unmap & map in the
> >>>>> case of vhost stop & start cycle.
> >>>>>
> >>>>> In that sense, this feature flag is actually a signal for userspace to
> >>>>> know that the bug has been solved.
> >>>> Right, I couldn't say it better than you do, thanks! The feature flag is
> >>>> more of an unusual means to indicating kernel bug having been fixed,
> >>>> rather than introduce a new feature or new kernel behavior ending up in
> >>>> change of userspace's expectation.
> >>>>
> >>>>> Not offering it indicates that
> >>>>> userspace cannot trust the kernel will retain the maps.
> >>>>>
> >>>>> Si-Wei or Dragos, please correct me if I've missed something. Feel
> >>>>> free to use the text in case you find more clear in doc or patch log.
> >>>> Sure, will do, thank you! Will post v2 adding these to the log.
> >>>>
> >>>> Thanks,
> >>>> -Siwei
> >>>>
> >>>>
> >>>>
> >>>>> Thanks!
> >>>>>
> >>>>>> Thanks
> >>>>>>
> >>>>>>> I think
> >>>>>>> the purpose of the IOTLB_PERSIST flag is just to give userspace 100%
> >>>>>>> certainty of persistent iotlb mapping not getting lost across vdpa reset.
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>> -Siwei
> >>>>>>>
> >>>>>>> [1]
> >>>>>>> https://lore.kernel.org/virtualization/9f118fc9-4f6f-dd67-a291-be78152e47fd@oracle.com/
> >>>>>>> [2]
> >>>>>>> https://lore.kernel.org/virtualization/3364adfd-1eb7-8bce-41f9-bfe5473f1f2e@oracle.com/
> >>>>>>>>      Otherwise
> >>>>>>>> we may break old userspace.
> >>>>>>>>
> >>>>>>>> Thanks
> >>>>>>>>
> >>>>>>>>> +       vhost_vdpa_reset_map(v, asid);
> >>>>>>>>>             kfree(as);
> >>>>>>>>>
> >>>>>>>>>             return 0;
> >>>>>>>>> --
> >>>>>>>>> 1.8.3.1
> >>>>>>>>>
>


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/4] vhost-vdpa: reset vendor specific mapping to initial state in .release
  2023-10-18 11:14                     ` Eugenio Perez Martin
@ 2023-10-18 23:21                       ` Si-Wei Liu
  2023-10-19  2:48                         ` Jason Wang
  0 siblings, 1 reply; 37+ messages in thread
From: Si-Wei Liu @ 2023-10-18 23:21 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Jason Wang, mst, xuanzhuo, dtatulea, virtualization, linux-kernel



On 10/18/2023 4:14 AM, Eugenio Perez Martin wrote:
> On Wed, Oct 18, 2023 at 10:44 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>
>>
>> On 10/17/2023 10:27 PM, Jason Wang wrote:
>>> On Wed, Oct 18, 2023 at 12:36 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>>>
>>>> On 10/16/2023 7:35 PM, Jason Wang wrote:
>>>>> On Tue, Oct 17, 2023 at 4:30 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>>>>> On 10/16/2023 4:28 AM, Eugenio Perez Martin wrote:
>>>>>>> On Mon, Oct 16, 2023 at 8:33 AM Jason Wang <jasowang@redhat.com> wrote:
>>>>>>>> On Fri, Oct 13, 2023 at 3:36 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>>>>>>>> On 10/12/2023 8:01 PM, Jason Wang wrote:
>>>>>>>>>> On Tue, Oct 10, 2023 at 5:05 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>>>>>>>>>> Devices with on-chip IOMMU or vendor specific IOTLB implementation
>>>>>>>>>>> may need to restore iotlb mapping to the initial or default state
>>>>>>>>>>> using the .reset_map op, as it's desirable for some parent devices
>>>>>>>>>>> to solely manipulate mappings by its own, independent of virtio device
>>>>>>>>>>> state. For instance, device reset does not cause mapping go away on
>>>>>>>>>>> such IOTLB model in need of persistent mapping. Before vhost-vdpa
>>>>>>>>>>> is going away, give them a chance to reset iotlb back to the initial
>>>>>>>>>>> state in vhost_vdpa_cleanup().
>>>>>>>>>>>
>>>>>>>>>>> Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com>
>>>>>>>>>>> ---
>>>>>>>>>>>       drivers/vhost/vdpa.c | 16 ++++++++++++++++
>>>>>>>>>>>       1 file changed, 16 insertions(+)
>>>>>>>>>>>
>>>>>>>>>>> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
>>>>>>>>>>> index 851535f..a3f8160 100644
>>>>>>>>>>> --- a/drivers/vhost/vdpa.c
>>>>>>>>>>> +++ b/drivers/vhost/vdpa.c
>>>>>>>>>>> @@ -131,6 +131,15 @@ static struct vhost_vdpa_as *vhost_vdpa_find_alloc_as(struct vhost_vdpa *v,
>>>>>>>>>>>              return vhost_vdpa_alloc_as(v, asid);
>>>>>>>>>>>       }
>>>>>>>>>>>
>>>>>>>>>>> +static void vhost_vdpa_reset_map(struct vhost_vdpa *v, u32 asid)
>>>>>>>>>>> +{
>>>>>>>>>>> +       struct vdpa_device *vdpa = v->vdpa;
>>>>>>>>>>> +       const struct vdpa_config_ops *ops = vdpa->config;
>>>>>>>>>>> +
>>>>>>>>>>> +       if (ops->reset_map)
>>>>>>>>>>> +               ops->reset_map(vdpa, asid);
>>>>>>>>>>> +}
>>>>>>>>>>> +
>>>>>>>>>>>       static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid)
>>>>>>>>>>>       {
>>>>>>>>>>>              struct vhost_vdpa_as *as = asid_to_as(v, asid);
>>>>>>>>>>> @@ -140,6 +149,13 @@ static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid)
>>>>>>>>>>>
>>>>>>>>>>>              hlist_del(&as->hash_link);
>>>>>>>>>>>              vhost_vdpa_iotlb_unmap(v, &as->iotlb, 0ULL, 0ULL - 1, asid);
>>>>>>>>>>> +       /*
>>>>>>>>>>> +        * Devices with vendor specific IOMMU may need to restore
>>>>>>>>>>> +        * iotlb to the initial or default state which is not done
>>>>>>>>>>> +        * through device reset, as the IOTLB mapping manipulation
>>>>>>>>>>> +        * could be decoupled from the virtio device life cycle.
>>>>>>>>>>> +        */
>>>>>>>>>> Should we do this according to whether IOTLB_PRESIST is set?
>>>>>>>>> Well, in theory this seems like so but it's unnecessary code change
>>>>>>>>> actually, as that is the way how vDPA parent behind platform IOMMU works
>>>>>>>>> today, and userspace doesn't break as of today. :)
>>>>>>>> Well, this is one question I've ever asked before. You have explained
>>>>>>>> that one of the reason that we don't break userspace is that they may
>>>>>>>> couple IOTLB reset with vDPA reset as well. One example is the Qemu.
>>>>>>>>
>>>>>>>>> As explained in previous threads [1][2], when IOTLB_PERSIST is not set
>>>>>>>>> it doesn't necessarily mean the iotlb will definitely be destroyed
>>>>>>>>> across reset (think about the platform IOMMU case), so userspace today
>>>>>>>>> is already tolerating enough with either good or bad IOMMU.
>>>>> I'm confused, how to define tolerating here?
>>>> Tolerating defined as QEMU has to proactively unmap before reset just to
>>>> workaround the driver bug (on-chip maps out of sync), unconditionally
>>>> for platform or on-chip. While we all know it doesn't have to do so for
>>>> platform IOMMU, though userspace has no means to distinguish. That said,
>>>> userspace is sacrificing reset time performance on platform IOMMU setup
>>>> just for working around buggy implementation in the other setup.
>>> Ok, so what you actually mean is that userspace can tolerate the "bug"
>>> with the performance penalty.
>> Right.
>>>
>>>>> For example, if it has tolerance, why bother?
>>>> I'm not sure I get the question. But I think userspace is compromising
>>>> because of buggy implementation in a few drivers doesn't mean we should
>>>> uniformly enforce such behavior for all set_map/dma_map implementations.
>>> This is not my point. I meant, we can fix we need a negotiation in
>>> order to let some "buggy" old user space to survive from the changes.
>> Userspace is no buggy today, how to define "buggy"? Userspace with
>> tolerance could survive just fine no matter if this negotiation or buggy
>> driver behavior emulation is around or not. If any userspace doesn't
>> tolerate, it can work still fine on good on-chip IOMMU or platform
>> IOMMU, no matter if the negotiation is around or not.
>>>>>>>> This code of
>>>>>>>>> not checking IOTLB_PERSIST being set is intentional, there's no point to
>>>>>>>>> emulate bad IOMMU behavior even for older userspace (with improper
>>>>>>>>> emulation to be done it would result in even worse performance).
>>>>> I can easily imagine a case:
>>>>>
>>>>> The old Qemu that works only with a setup like mlx5_vdpa.
>>>> Noted, seems to me there's no such case of a userspace implementation
>>>> that only works with mlx5_vdpa or its friends, but doesn't work with the
>>>> others e.g. platform IOMMU, or well behaving on-chip IOMMU
>>>> implementations.
>>> It's not hard to think of a case where:
>>>
>>> 1) the environment has mlx5_vdpa only
>>> 2) kernel doc can't have endless details, so when developing
>>> application, the author notice IOTLB is cleared during reset
>> I get it, but my question was that, even if the author had noticed IOTLB
>> is cleared during reset, does he care or not to make IOTLB back working
>> again? My point is that, if this old setup is supposed to "work" on
>> mlx5_vdpa, then the developer must come up with sort of "quirk" to
>> recover the IOTLB to make it back to working state again after the
>> reset. It will be more justified to come up with the proper fix for
>> compatibility/emulation only until we know what should be expected to
>> work and through which possible means to making it back to work, rather
>> than blindly emulate the buggy behavior solely based on a few driver's
>> own implementation. I'm pretty sure there are multiple ways to implement
>> the buggy reset behavior in the driver, does it mean we have to emulate
>> various corrupted mapping states in the individual on-chip iommu itself?
>> How is it able to help the developer user if we are able to replicate
>> the same corrupted mapping state in the on-chip iommu after reset, any
>> real-life user only cares about mapping being corrupted in the same way,
>> rather than cares more about the quirk sequence or work around to get
>> iotlb maps out of the broken state?
>>
>> Only if the userspace is like a test facility to expect some test case
>> to fail on mlx5_vdpa after reset -- I assume that is not real-life user
>> at all.
>>>> The Unmap+remap trick around vdpa reset works totally
>>>> fine for platform IOMMU, except with sub-optimal performance. Other than
>>>> this trick, I cannot easily think of other means or iotlb message
>>>> sequence for userspace to recover the bogus state and make iotlb back to
>>>> work again after reset.
>>> Yes for sure, but we can't audit every user space, no?
>> We don't have to, as userspace here has no bug at all. The bug exists in
>> the driver not in userspace. Real life userspace app only cares about
>> making things work not asserting something must be broken.
>>>> Are we talking about hypnosis that has no real
>>>> basis to exist in the real world?
>>> Instead of trying to answer these hard questions, I would go another
>>> way. That is, stick to the old behaviour when IOTLB_PRESISIT is not
>>> set by the backend. This is much easier.
>> Please be noted the old (broken) behavior can vary between different
>> parent driver implementations. It's driver's specific own problem, if
>> there are N ways to for driver to implement buggy .reset, do we have to
>> emulate N flavors of different vdpa reset behavior?
>>
>>>>>     If we do
>>>>> this without a negotiation, IOTLB will not be clear but the Qemu will
>>>>> try to re-program the IOTLB after reset. Which will break?
>>>>>
>>>>> 1) stick the exact old behaviour with just one line of check
>>>> It's not just one line of check here, the old behavior emulation has to
>>>> be done as Eugenio illustrated in the other email.
>>> For vhost-vDPA it's just
>>>
>>> if (IOTLB_PERSIST is acked by userspace)
>>>       reset_map()
>>>
>>> For parent, it's somehow similar:
>>>
>>> during .reset()
>>>
>>> if (IOTLB_PERSIST is not acked by userspace)
>>>           reset_vendor_mappings()
>>>
>>> Anything I missed here?
>> First, the ideal fix would be to leave this reset_vendor_mappings()
>> emulation code on the individual driver itself, which already has the
>> broken behavior. But today there's no backend feature negotiation
>> between vhost-vdpa and the parent driver. Do we want to send down the
>> acked_backend_features to parent drivers?
>>
> What if we add a module parameter to both mlx5 and vdpa_sim to keep
> the old behavior? Let's call it clean_iotlb_on_reset for now.
>
> In my opinion we can leave it off by default, so these userspace apps
> can get back to the previous behavior. It would be ideal if we set a
> deprecation date for it though.
>
> This way new backends, whether they implement .set_map or not, will
> have correct behavior.
>
> Would that work?
Great idea, this definitely will work! With this module parameter, 
individual driver still keeps the possibility to revert to previous 
buggy behavior were to unbreak old userspace, code can be obsoleted 
independently per each driver's specific use case and need, and we don't 
necessarily overload vdpa core with too much unwarranted compatibility 
code. Thank you so much for the great suggestion, I will post a v3.

Thanks,
-Siwei

>
> Thanks!
>
>> Second, IOTLB_PERSIST is needed but not sufficient. Due to lack of
>> backend feature negotiation in parent driver, if vhost-vdpa has to
>> provide the old-behaviour emulation for compatibility on driver's
>> behalf, it needs to be done per-driver basis. There could be good
>> on-chip or vendor IOMMU implementation which doesn't clear the IOTLB in
>> .reset, and vendor specific IOMMU doesn't have to provide .reset_map, we
>> should allow these good driver implementations rather than
>> unconditionally stick to some specific problematic behavior for every
>> other good driver. Then we need a set of device flags (backend_features
>> bit again?) to indicate the specific driver needs upper layer's help on
>> old-behaviour emulation.
>>
>> Last but not least, I'm not sure how to properly emulate
>> reset_vendor_mappings() from vhost-vdpa layer. If a vendor driver has no
>> .reset_map op implemented, or if .reset_map has a slightly different
>> implementation than what it used to reset the iotlb in the .reset op,
>> then this either becomes effectively dead code if no one ends up using,
>> or the vhost-vdpa emulation is helpless and limited in scope, unable to
>> cover all the cases.
>>
>>
>>>> In addition, the
>>>> emulation has to limit to those buggy drivers as I don't feel this
>>>> emulation should apply uniformly to all future set_map/dma_map
>>>> implementations.
>>> Unfortunately, it's a must to stick to ABI.
>> How come this brokenness in mlx5_vdpa becomes ABI in any sort for future
>> on-chip IOMMU drivers? They might not even exist yet. Even if it's
>> concerning ABI it's limited to mlx5_vdpa and the existing drivers, right?
>>
>>>    I agree it's a mess but we don't have a better choice.
>> Well, it's your call, I can implement as you wish but the unwarranted
>> code has to be maintained forever. Particularly without knowing if
>> there's really such a use case in real life, and no one in future might
>> dare to remove the code without knowing what it can be used for.
>>
>>> Or we can fail the probe if userspace
>>> doesn't ack this feature.
>> Fail probing is even worse choice that is introducing intrusive breakage
>> to the userspace.
>>>>> 2) audit all the possible cases to avoid a one line of code
>>>>>
>>>>> 1) seems much easier than 2)
>>>> You see it's more than just one line of code, and I'm uncertain if the
>>>> additional complexity is warranted or necessary, particularly if added
>>>> this piece of compatibility code will linger for quite a long time.
>>> This is a must as long as it can be noticed by userspace. Doing
>>> something conservative makes more sense to me.
>>>
>>>> Instead of adding hypothetical code change for no specific good reason
>>>> and no real use case,
>>> It's not adding something new or new behaviours, it's just making the
>>> IOTLB reset conditional based on vDPA reset.
>>>
>>>> I'd like to add the code when we find out a
>>>> specific use case that may get impacted or already being affected,
>>> It doesn't conflict with what you proposed here. Old behaviours have
>>> their users, no?
>> We don't know the use case how to make thing work instead of make thing
>> break, that is the problem. We have no way to test if old-behaviour
>> preserving code really works as expected. If there's no such user in
>> practice, it ends up with dead code no one dares to remove.
>>>> then
>>>> we will have good understanding how to code up the fix and emulate
>>>> properly for compatibility, while not affecting other good implementations.
>>> The issue is, even if we can't find a userspace now. It doesn't mean
>>> we can't have one in the future. Then it might be too late or too
>>> tricky to fix them. We had a lot of lessons in the past.
>> I am not sure the same situation "too late to fix" or "too tricky to
>> fix" applies here. Usually this means there's some well established
>> pattern for e.g. API, ABI or long standing de-factor behavior that can't
>> be broken or adjust if trying to fix something up. But here we're
>> guarded by a flag (IOTLB_PERSIST) and without it the behavior is totally
>> ruled by implementation.
>>
>> Regards,
>> -Siwei
>>
>>> Thanks
>>>
>>>> Thanks,
>>>> -Siwe/i/
>>>>
>>>>>>>> For two reasons:
>>>>>>>>
>>>>>>>> 1) backend features need acked by userspace this is by design
>>>>>>>> 2) keep the odd behaviour seems to be more safe as we can't audit
>>>>>>>> every userspace program
>>>>>>>>
>>>>>>> The old behavior (without flag ack) cannot be trusted already, as:
>>>>> Possibly but the point is to unbreak userspace no matter how weird the
>>>>> behaviour we've ever had.
>>>>>
>>>>>>> * Devices using platform IOMMU (in other words, not implementing
>>>>>>> neither .set_map nor .dma_map) does not unmap memory at virtio reset.
>>>>>>> * Devices that implement .set_map or .dma_map (vdpa_sim, mlx5) do
>>>>>>> reset IOTLB, but in their parent ops (vdpasim_do_reset, prune_iotlb
>>>>>>> called from mlx5_vdpa_reset). With vdpa_sim patch removing the reset,
>>>>>>> now all backends work the same as far as I know., which was (and is)
>>>>>>> the way devices using the platform IOMMU works.
>>>>>>>
>>>>>>> The difference in behavior did not matter as QEMU unmaps all the
>>>>>>> memory unregistering the memory listener at vhost_vdpa_dev_start(...,
>>>>>>> started = false),
>>>>>> Exactly. It's not just QEMU, but any (older) userspace manipulates
>>>>>> mappings through the vhost-vdpa iotlb interface has to unmap all
>>>>>> mappings to workaround the vdpa parent driver bug.
>>>>> Just to clarify, from userspace, it's the (odd) behaviour of the current uAPI.
>>>>>
>>>>>> If they don't do
>>>>>> explicit unmap, it would cause state inconsistency between vhost-vdpa
>>>>>> and parent driver, then old mappings can't be restored, and new mapping
>>>>>> can be added to iotlb after vDPA reset. There's no point to preserve
>>>>>> this broken and inconsistent behavior between vhost-vdpa and parent
>>>>>> driver, as userspace doesn't care at all!
>>>>> It's a userspace notice change so we can't fix it silently:
>>>>>
>>>>> https://lkml.org/lkml/2012/12/23/75
>>>>>
>>>>> Another example which is related to vhost-vDPA:
>>>>>
>>>>> https://lore.kernel.org/netdev/20230927140544.205088-1-eric.auger@redhat.com/T/
>>>>>
>>>>> Thanks
>>>>>
>>>>>>> but the backend acknowledging this feature flag
>>>>>>> allows QEMU to make sure it is safe to skip this unmap & map in the
>>>>>>> case of vhost stop & start cycle.
>>>>>>>
>>>>>>> In that sense, this feature flag is actually a signal for userspace to
>>>>>>> know that the bug has been solved.
>>>>>> Right, I couldn't say it better than you do, thanks! The feature flag is
>>>>>> more of an unusual means to indicating kernel bug having been fixed,
>>>>>> rather than introduce a new feature or new kernel behavior ending up in
>>>>>> change of userspace's expectation.
>>>>>>
>>>>>>> Not offering it indicates that
>>>>>>> userspace cannot trust the kernel will retain the maps.
>>>>>>>
>>>>>>> Si-Wei or Dragos, please correct me if I've missed something. Feel
>>>>>>> free to use the text in case you find more clear in doc or patch log.
>>>>>> Sure, will do, thank you! Will post v2 adding these to the log.
>>>>>>
>>>>>> Thanks,
>>>>>> -Siwei
>>>>>>
>>>>>>
>>>>>>
>>>>>>> Thanks!
>>>>>>>
>>>>>>>> Thanks
>>>>>>>>
>>>>>>>>> I think
>>>>>>>>> the purpose of the IOTLB_PERSIST flag is just to give userspace 100%
>>>>>>>>> certainty of persistent iotlb mapping not getting lost across vdpa reset.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> -Siwei
>>>>>>>>>
>>>>>>>>> [1]
>>>>>>>>> https://lore.kernel.org/virtualization/9f118fc9-4f6f-dd67-a291-be78152e47fd@oracle.com/
>>>>>>>>> [2]
>>>>>>>>> https://lore.kernel.org/virtualization/3364adfd-1eb7-8bce-41f9-bfe5473f1f2e@oracle.com/
>>>>>>>>>>       Otherwise
>>>>>>>>>> we may break old userspace.
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>>
>>>>>>>>>>> +       vhost_vdpa_reset_map(v, asid);
>>>>>>>>>>>              kfree(as);
>>>>>>>>>>>
>>>>>>>>>>>              return 0;
>>>>>>>>>>> --
>>>>>>>>>>> 1.8.3.1
>>>>>>>>>>>


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/4] vhost-vdpa: reset vendor specific mapping to initial state in .release
  2023-10-18 23:21                       ` Si-Wei Liu
@ 2023-10-19  2:48                         ` Jason Wang
  0 siblings, 0 replies; 37+ messages in thread
From: Jason Wang @ 2023-10-19  2:48 UTC (permalink / raw)
  To: Si-Wei Liu
  Cc: Eugenio Perez Martin, mst, xuanzhuo, dtatulea, virtualization,
	linux-kernel

On Thu, Oct 19, 2023 at 7:21 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>
>
>
> On 10/18/2023 4:14 AM, Eugenio Perez Martin wrote:
> > On Wed, Oct 18, 2023 at 10:44 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>
> >>
> >> On 10/17/2023 10:27 PM, Jason Wang wrote:
> >>> On Wed, Oct 18, 2023 at 12:36 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>>>
> >>>> On 10/16/2023 7:35 PM, Jason Wang wrote:
> >>>>> On Tue, Oct 17, 2023 at 4:30 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>>>>> On 10/16/2023 4:28 AM, Eugenio Perez Martin wrote:
> >>>>>>> On Mon, Oct 16, 2023 at 8:33 AM Jason Wang <jasowang@redhat.com> wrote:
> >>>>>>>> On Fri, Oct 13, 2023 at 3:36 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>>>>>>>> On 10/12/2023 8:01 PM, Jason Wang wrote:
> >>>>>>>>>> On Tue, Oct 10, 2023 at 5:05 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>>>>>>>>>> Devices with on-chip IOMMU or vendor specific IOTLB implementation
> >>>>>>>>>>> may need to restore iotlb mapping to the initial or default state
> >>>>>>>>>>> using the .reset_map op, as it's desirable for some parent devices
> >>>>>>>>>>> to solely manipulate mappings by its own, independent of virtio device
> >>>>>>>>>>> state. For instance, device reset does not cause mapping go away on
> >>>>>>>>>>> such IOTLB model in need of persistent mapping. Before vhost-vdpa
> >>>>>>>>>>> is going away, give them a chance to reset iotlb back to the initial
> >>>>>>>>>>> state in vhost_vdpa_cleanup().
> >>>>>>>>>>>
> >>>>>>>>>>> Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com>
> >>>>>>>>>>> ---
> >>>>>>>>>>>       drivers/vhost/vdpa.c | 16 ++++++++++++++++
> >>>>>>>>>>>       1 file changed, 16 insertions(+)
> >>>>>>>>>>>
> >>>>>>>>>>> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> >>>>>>>>>>> index 851535f..a3f8160 100644
> >>>>>>>>>>> --- a/drivers/vhost/vdpa.c
> >>>>>>>>>>> +++ b/drivers/vhost/vdpa.c
> >>>>>>>>>>> @@ -131,6 +131,15 @@ static struct vhost_vdpa_as *vhost_vdpa_find_alloc_as(struct vhost_vdpa *v,
> >>>>>>>>>>>              return vhost_vdpa_alloc_as(v, asid);
> >>>>>>>>>>>       }
> >>>>>>>>>>>
> >>>>>>>>>>> +static void vhost_vdpa_reset_map(struct vhost_vdpa *v, u32 asid)
> >>>>>>>>>>> +{
> >>>>>>>>>>> +       struct vdpa_device *vdpa = v->vdpa;
> >>>>>>>>>>> +       const struct vdpa_config_ops *ops = vdpa->config;
> >>>>>>>>>>> +
> >>>>>>>>>>> +       if (ops->reset_map)
> >>>>>>>>>>> +               ops->reset_map(vdpa, asid);
> >>>>>>>>>>> +}
> >>>>>>>>>>> +
> >>>>>>>>>>>       static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid)
> >>>>>>>>>>>       {
> >>>>>>>>>>>              struct vhost_vdpa_as *as = asid_to_as(v, asid);
> >>>>>>>>>>> @@ -140,6 +149,13 @@ static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid)
> >>>>>>>>>>>
> >>>>>>>>>>>              hlist_del(&as->hash_link);
> >>>>>>>>>>>              vhost_vdpa_iotlb_unmap(v, &as->iotlb, 0ULL, 0ULL - 1, asid);
> >>>>>>>>>>> +       /*
> >>>>>>>>>>> +        * Devices with vendor specific IOMMU may need to restore
> >>>>>>>>>>> +        * iotlb to the initial or default state which is not done
> >>>>>>>>>>> +        * through device reset, as the IOTLB mapping manipulation
> >>>>>>>>>>> +        * could be decoupled from the virtio device life cycle.
> >>>>>>>>>>> +        */
> >>>>>>>>>> Should we do this according to whether IOTLB_PRESIST is set?
> >>>>>>>>> Well, in theory this seems like so but it's unnecessary code change
> >>>>>>>>> actually, as that is the way how vDPA parent behind platform IOMMU works
> >>>>>>>>> today, and userspace doesn't break as of today. :)
> >>>>>>>> Well, this is one question I've ever asked before. You have explained
> >>>>>>>> that one of the reason that we don't break userspace is that they may
> >>>>>>>> couple IOTLB reset with vDPA reset as well. One example is the Qemu.
> >>>>>>>>
> >>>>>>>>> As explained in previous threads [1][2], when IOTLB_PERSIST is not set
> >>>>>>>>> it doesn't necessarily mean the iotlb will definitely be destroyed
> >>>>>>>>> across reset (think about the platform IOMMU case), so userspace today
> >>>>>>>>> is already tolerating enough with either good or bad IOMMU.
> >>>>> I'm confused, how to define tolerating here?
> >>>> Tolerating defined as QEMU has to proactively unmap before reset just to
> >>>> workaround the driver bug (on-chip maps out of sync), unconditionally
> >>>> for platform or on-chip. While we all know it doesn't have to do so for
> >>>> platform IOMMU, though userspace has no means to distinguish. That said,
> >>>> userspace is sacrificing reset time performance on platform IOMMU setup
> >>>> just for working around buggy implementation in the other setup.
> >>> Ok, so what you actually mean is that userspace can tolerate the "bug"
> >>> with the performance penalty.
> >> Right.
> >>>
> >>>>> For example, if it has tolerance, why bother?
> >>>> I'm not sure I get the question. But I think userspace is compromising
> >>>> because of buggy implementation in a few drivers doesn't mean we should
> >>>> uniformly enforce such behavior for all set_map/dma_map implementations.
> >>> This is not my point. I meant, we can fix we need a negotiation in
> >>> order to let some "buggy" old user space to survive from the changes.
> >> Userspace is no buggy today, how to define "buggy"? Userspace with
> >> tolerance could survive just fine no matter if this negotiation or buggy
> >> driver behavior emulation is around or not. If any userspace doesn't
> >> tolerate, it can work still fine on good on-chip IOMMU or platform
> >> IOMMU, no matter if the negotiation is around or not.
> >>>>>>>> This code of
> >>>>>>>>> not checking IOTLB_PERSIST being set is intentional, there's no point to
> >>>>>>>>> emulate bad IOMMU behavior even for older userspace (with improper
> >>>>>>>>> emulation to be done it would result in even worse performance).
> >>>>> I can easily imagine a case:
> >>>>>
> >>>>> The old Qemu that works only with a setup like mlx5_vdpa.
> >>>> Noted, seems to me there's no such case of a userspace implementation
> >>>> that only works with mlx5_vdpa or its friends, but doesn't work with the
> >>>> others e.g. platform IOMMU, or well behaving on-chip IOMMU
> >>>> implementations.
> >>> It's not hard to think of a case where:
> >>>
> >>> 1) the environment has mlx5_vdpa only
> >>> 2) kernel doc can't have endless details, so when developing
> >>> application, the author notice IOTLB is cleared during reset
> >> I get it, but my question was that, even if the author had noticed IOTLB
> >> is cleared during reset, does he care or not to make IOTLB back working
> >> again? My point is that, if this old setup is supposed to "work" on
> >> mlx5_vdpa, then the developer must come up with sort of "quirk" to
> >> recover the IOTLB to make it back to working state again after the
> >> reset. It will be more justified to come up with the proper fix for
> >> compatibility/emulation only until we know what should be expected to
> >> work and through which possible means to making it back to work, rather
> >> than blindly emulate the buggy behavior solely based on a few driver's
> >> own implementation. I'm pretty sure there are multiple ways to implement
> >> the buggy reset behavior in the driver, does it mean we have to emulate
> >> various corrupted mapping states in the individual on-chip iommu itself?
> >> How is it able to help the developer user if we are able to replicate
> >> the same corrupted mapping state in the on-chip iommu after reset, any
> >> real-life user only cares about mapping being corrupted in the same way,
> >> rather than cares more about the quirk sequence or work around to get
> >> iotlb maps out of the broken state?
> >>
> >> Only if the userspace is like a test facility to expect some test case
> >> to fail on mlx5_vdpa after reset -- I assume that is not real-life user
> >> at all.
> >>>> The Unmap+remap trick around vdpa reset works totally
> >>>> fine for platform IOMMU, except with sub-optimal performance. Other than
> >>>> this trick, I cannot easily think of other means or iotlb message
> >>>> sequence for userspace to recover the bogus state and make iotlb back to
> >>>> work again after reset.
> >>> Yes for sure, but we can't audit every user space, no?
> >> We don't have to, as userspace here has no bug at all. The bug exists in
> >> the driver not in userspace. Real life userspace app only cares about
> >> making things work not asserting something must be broken.
> >>>> Are we talking about hypnosis that has no real
> >>>> basis to exist in the real world?
> >>> Instead of trying to answer these hard questions, I would go another
> >>> way. That is, stick to the old behaviour when IOTLB_PRESISIT is not
> >>> set by the backend. This is much easier.
> >> Please be noted the old (broken) behavior can vary between different
> >> parent driver implementations. It's driver's specific own problem, if
> >> there are N ways to for driver to implement buggy .reset, do we have to
> >> emulate N flavors of different vdpa reset behavior?
> >>
> >>>>>     If we do
> >>>>> this without a negotiation, IOTLB will not be clear but the Qemu will
> >>>>> try to re-program the IOTLB after reset. Which will break?
> >>>>>
> >>>>> 1) stick the exact old behaviour with just one line of check
> >>>> It's not just one line of check here, the old behavior emulation has to
> >>>> be done as Eugenio illustrated in the other email.
> >>> For vhost-vDPA it's just
> >>>
> >>> if (IOTLB_PERSIST is acked by userspace)
> >>>       reset_map()
> >>>
> >>> For parent, it's somehow similar:
> >>>
> >>> during .reset()
> >>>
> >>> if (IOTLB_PERSIST is not acked by userspace)
> >>>           reset_vendor_mappings()
> >>>
> >>> Anything I missed here?
> >> First, the ideal fix would be to leave this reset_vendor_mappings()
> >> emulation code on the individual driver itself, which already has the
> >> broken behavior. But today there's no backend feature negotiation
> >> between vhost-vdpa and the parent driver. Do we want to send down the
> >> acked_backend_features to parent drivers?
> >>
> > What if we add a module parameter to both mlx5 and vdpa_sim to keep
> > the old behavior? Let's call it clean_iotlb_on_reset for now.
> >
> > In my opinion we can leave it off by default, so these userspace apps
> > can get back to the previous behavior. It would be ideal if we set a
> > deprecation date for it though.
> >
> > This way new backends, whether they implement .set_map or not, will
> > have correct behavior.
> >
> > Would that work?
> Great idea, this definitely will work! With this module parameter,
> individual driver still keeps the possibility to revert to previous
> buggy behavior were to unbreak old userspace, code can be obsoleted
> independently per each driver's specific use case and need, and we don't
> necessarily overload vdpa core with too much unwarranted compatibility
> code. Thank you so much for the great suggestion,

I disagree, module parameters have been proved to be very hard for
management. And what I don't understand here is, once a module
parameter can work, why not just replace it with the checking of
IOTLB_PRESIST?

Thanks


> I will post a v3.
>
> Thanks,
> -Siwei
>
> >
> > Thanks!
> >
> >> Second, IOTLB_PERSIST is needed but not sufficient. Due to lack of
> >> backend feature negotiation in parent driver, if vhost-vdpa has to
> >> provide the old-behaviour emulation for compatibility on driver's
> >> behalf, it needs to be done per-driver basis. There could be good
> >> on-chip or vendor IOMMU implementation which doesn't clear the IOTLB in
> >> .reset, and vendor specific IOMMU doesn't have to provide .reset_map, we
> >> should allow these good driver implementations rather than
> >> unconditionally stick to some specific problematic behavior for every
> >> other good driver. Then we need a set of device flags (backend_features
> >> bit again?) to indicate the specific driver needs upper layer's help on
> >> old-behaviour emulation.
> >>
> >> Last but not least, I'm not sure how to properly emulate
> >> reset_vendor_mappings() from vhost-vdpa layer. If a vendor driver has no
> >> .reset_map op implemented, or if .reset_map has a slightly different
> >> implementation than what it used to reset the iotlb in the .reset op,
> >> then this either becomes effectively dead code if no one ends up using,
> >> or the vhost-vdpa emulation is helpless and limited in scope, unable to
> >> cover all the cases.
> >>
> >>
> >>>> In addition, the
> >>>> emulation has to limit to those buggy drivers as I don't feel this
> >>>> emulation should apply uniformly to all future set_map/dma_map
> >>>> implementations.
> >>> Unfortunately, it's a must to stick to ABI.
> >> How come this brokenness in mlx5_vdpa becomes ABI in any sort for future
> >> on-chip IOMMU drivers? They might not even exist yet. Even if it's
> >> concerning ABI it's limited to mlx5_vdpa and the existing drivers, right?
> >>
> >>>    I agree it's a mess but we don't have a better choice.
> >> Well, it's your call, I can implement as you wish but the unwarranted
> >> code has to be maintained forever. Particularly without knowing if
> >> there's really such a use case in real life, and no one in future might
> >> dare to remove the code without knowing what it can be used for.
> >>
> >>> Or we can fail the probe if userspace
> >>> doesn't ack this feature.
> >> Fail probing is even worse choice that is introducing intrusive breakage
> >> to the userspace.
> >>>>> 2) audit all the possible cases to avoid a one line of code
> >>>>>
> >>>>> 1) seems much easier than 2)
> >>>> You see it's more than just one line of code, and I'm uncertain if the
> >>>> additional complexity is warranted or necessary, particularly if added
> >>>> this piece of compatibility code will linger for quite a long time.
> >>> This is a must as long as it can be noticed by userspace. Doing
> >>> something conservative makes more sense to me.
> >>>
> >>>> Instead of adding hypothetical code change for no specific good reason
> >>>> and no real use case,
> >>> It's not adding something new or new behaviours, it's just making the
> >>> IOTLB reset conditional based on vDPA reset.
> >>>
> >>>> I'd like to add the code when we find out a
> >>>> specific use case that may get impacted or already being affected,
> >>> It doesn't conflict with what you proposed here. Old behaviours have
> >>> their users, no?
> >> We don't know the use case how to make thing work instead of make thing
> >> break, that is the problem. We have no way to test if old-behaviour
> >> preserving code really works as expected. If there's no such user in
> >> practice, it ends up with dead code no one dares to remove.
> >>>> then
> >>>> we will have good understanding how to code up the fix and emulate
> >>>> properly for compatibility, while not affecting other good implementations.
> >>> The issue is, even if we can't find a userspace now. It doesn't mean
> >>> we can't have one in the future. Then it might be too late or too
> >>> tricky to fix them. We had a lot of lessons in the past.
> >> I am not sure the same situation "too late to fix" or "too tricky to
> >> fix" applies here. Usually this means there's some well established
> >> pattern for e.g. API, ABI or long standing de-factor behavior that can't
> >> be broken or adjust if trying to fix something up. But here we're
> >> guarded by a flag (IOTLB_PERSIST) and without it the behavior is totally
> >> ruled by implementation.
> >>
> >> Regards,
> >> -Siwei
> >>
> >>> Thanks
> >>>
> >>>> Thanks,
> >>>> -Siwe/i/
> >>>>
> >>>>>>>> For two reasons:
> >>>>>>>>
> >>>>>>>> 1) backend features need acked by userspace this is by design
> >>>>>>>> 2) keep the odd behaviour seems to be more safe as we can't audit
> >>>>>>>> every userspace program
> >>>>>>>>
> >>>>>>> The old behavior (without flag ack) cannot be trusted already, as:
> >>>>> Possibly but the point is to unbreak userspace no matter how weird the
> >>>>> behaviour we've ever had.
> >>>>>
> >>>>>>> * Devices using platform IOMMU (in other words, not implementing
> >>>>>>> neither .set_map nor .dma_map) does not unmap memory at virtio reset.
> >>>>>>> * Devices that implement .set_map or .dma_map (vdpa_sim, mlx5) do
> >>>>>>> reset IOTLB, but in their parent ops (vdpasim_do_reset, prune_iotlb
> >>>>>>> called from mlx5_vdpa_reset). With vdpa_sim patch removing the reset,
> >>>>>>> now all backends work the same as far as I know., which was (and is)
> >>>>>>> the way devices using the platform IOMMU works.
> >>>>>>>
> >>>>>>> The difference in behavior did not matter as QEMU unmaps all the
> >>>>>>> memory unregistering the memory listener at vhost_vdpa_dev_start(...,
> >>>>>>> started = false),
> >>>>>> Exactly. It's not just QEMU, but any (older) userspace manipulates
> >>>>>> mappings through the vhost-vdpa iotlb interface has to unmap all
> >>>>>> mappings to workaround the vdpa parent driver bug.
> >>>>> Just to clarify, from userspace, it's the (odd) behaviour of the current uAPI.
> >>>>>
> >>>>>> If they don't do
> >>>>>> explicit unmap, it would cause state inconsistency between vhost-vdpa
> >>>>>> and parent driver, then old mappings can't be restored, and new mapping
> >>>>>> can be added to iotlb after vDPA reset. There's no point to preserve
> >>>>>> this broken and inconsistent behavior between vhost-vdpa and parent
> >>>>>> driver, as userspace doesn't care at all!
> >>>>> It's a userspace notice change so we can't fix it silently:
> >>>>>
> >>>>> https://lkml.org/lkml/2012/12/23/75
> >>>>>
> >>>>> Another example which is related to vhost-vDPA:
> >>>>>
> >>>>> https://lore.kernel.org/netdev/20230927140544.205088-1-eric.auger@redhat.com/T/
> >>>>>
> >>>>> Thanks
> >>>>>
> >>>>>>> but the backend acknowledging this feature flag
> >>>>>>> allows QEMU to make sure it is safe to skip this unmap & map in the
> >>>>>>> case of vhost stop & start cycle.
> >>>>>>>
> >>>>>>> In that sense, this feature flag is actually a signal for userspace to
> >>>>>>> know that the bug has been solved.
> >>>>>> Right, I couldn't say it better than you do, thanks! The feature flag is
> >>>>>> more of an unusual means to indicating kernel bug having been fixed,
> >>>>>> rather than introduce a new feature or new kernel behavior ending up in
> >>>>>> change of userspace's expectation.
> >>>>>>
> >>>>>>> Not offering it indicates that
> >>>>>>> userspace cannot trust the kernel will retain the maps.
> >>>>>>>
> >>>>>>> Si-Wei or Dragos, please correct me if I've missed something. Feel
> >>>>>>> free to use the text in case you find more clear in doc or patch log.
> >>>>>> Sure, will do, thank you! Will post v2 adding these to the log.
> >>>>>>
> >>>>>> Thanks,
> >>>>>> -Siwei
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>> Thanks!
> >>>>>>>
> >>>>>>>> Thanks
> >>>>>>>>
> >>>>>>>>> I think
> >>>>>>>>> the purpose of the IOTLB_PERSIST flag is just to give userspace 100%
> >>>>>>>>> certainty of persistent iotlb mapping not getting lost across vdpa reset.
> >>>>>>>>>
> >>>>>>>>> Thanks,
> >>>>>>>>> -Siwei
> >>>>>>>>>
> >>>>>>>>> [1]
> >>>>>>>>> https://lore.kernel.org/virtualization/9f118fc9-4f6f-dd67-a291-be78152e47fd@oracle.com/
> >>>>>>>>> [2]
> >>>>>>>>> https://lore.kernel.org/virtualization/3364adfd-1eb7-8bce-41f9-bfe5473f1f2e@oracle.com/
> >>>>>>>>>>       Otherwise
> >>>>>>>>>> we may break old userspace.
> >>>>>>>>>>
> >>>>>>>>>> Thanks
> >>>>>>>>>>
> >>>>>>>>>>> +       vhost_vdpa_reset_map(v, asid);
> >>>>>>>>>>>              kfree(as);
> >>>>>>>>>>>
> >>>>>>>>>>>              return 0;
> >>>>>>>>>>> --
> >>>>>>>>>>> 1.8.3.1
> >>>>>>>>>>>
>


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/4] vhost-vdpa: reset vendor specific mapping to initial state in .release
  2023-10-18  8:49                     ` Si-Wei Liu
@ 2023-10-19  2:53                       ` Jason Wang
  2023-10-19  6:46                         ` Si-Wei Liu
  0 siblings, 1 reply; 37+ messages in thread
From: Jason Wang @ 2023-10-19  2:53 UTC (permalink / raw)
  To: Si-Wei Liu
  Cc: Eugenio Perez Martin, mst, xuanzhuo, dtatulea, virtualization,
	linux-kernel

On Wed, Oct 18, 2023 at 4:49 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>
>
>
> On 10/18/2023 12:00 AM, Jason Wang wrote:
> >> Unfortunately, it's a must to stick to ABI. I agree it's a mess but we
> >> don't have a better choice. Or we can fail the probe if userspace
> >> doesn't ack this feature.
> > Antoher idea we can just do the following in vhost_vdpa reset?
> >
> > config->reset()
> > if (IOTLB_PERSIST is not set) {
> >      config->reset_map()
> > }
> >
> > Then we don't have the burden to maintain them in the parent?
> >
> > Thanks
> Please see my earlier response in the other email, thanks.
>
> ----------------%<----------------%<----------------
>
> First, the ideal fix would be to leave this reset_vendor_mappings()
> emulation code on the individual driver itself, which already has the
> broken behavior.

So the point is, not about whether the existing behavior is "broken"
or not. It's about whether we could stick to the old behaviour without
too much cost. And I believe we could.

And just to clarify here, reset_vendor_mappings() = config->reset_map()

> But today there's no backend feature negotiation
> between vhost-vdpa and the parent driver. Do we want to send down the
> acked_backend_features to parent drivers?

There's no need to do that with the above code, or anything I missed here?

config->reset()
if (IOTLB_PERSIST is not set) {
      config->reset_map()
}

>
> Second, IOTLB_PERSIST is needed but not sufficient. Due to lack of
> backend feature negotiation in parent driver, if vhost-vdpa has to
> provide the old-behaviour emulation for compatibility on driver's
> behalf, it needs to be done per-driver basis. There could be good
> on-chip or vendor IOMMU implementation which doesn't clear the IOTLB in
> .reset, and vendor specific IOMMU doesn't have to provide .reset_map,

Then we just don't offer IOTLB_PRESIST, isn't this by design?

> we
> should allow these good driver implementations rather than
> unconditionally stick to some specific problematic behavior for every
> other good driver.

Then you can force reset_map() with set_map() that is what I suggest
in another thread, no?

> Then we need a set of device flags (backend_features
> bit again?) to indicate the specific driver needs upper layer's help on
> old-behaviour emulation.
>
> Last but not least, I'm not sure how to properly emulate
> reset_vendor_mappings() from vhost-vdpa layer. If a vendor driver has no
> .reset_map op implemented, or if .reset_map has a slightly different
> implementation than what it used to reset the iotlb in the .reset op,

See above, for reset_vendor_mappings() I meant config->reset_map() exactly.

Thanks

> then this either becomes effectively dead code if no one ends up using,
> or the vhost-vdpa emulation is helpless and limited in scope, unable to
> cover all the cases.
>
> ----------------%<----------------%<----------------
>


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/4] vhost-vdpa: reset vendor specific mapping to initial state in .release
  2023-10-19  2:53                       ` Jason Wang
@ 2023-10-19  6:46                         ` Si-Wei Liu
  2023-10-19  8:27                           ` Jason Wang
  0 siblings, 1 reply; 37+ messages in thread
From: Si-Wei Liu @ 2023-10-19  6:46 UTC (permalink / raw)
  To: Jason Wang
  Cc: Eugenio Perez Martin, mst, xuanzhuo, dtatulea, virtualization,
	linux-kernel



On 10/18/2023 7:53 PM, Jason Wang wrote:
> On Wed, Oct 18, 2023 at 4:49 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>
>>
>> On 10/18/2023 12:00 AM, Jason Wang wrote:
>>>> Unfortunately, it's a must to stick to ABI. I agree it's a mess but we
>>>> don't have a better choice. Or we can fail the probe if userspace
>>>> doesn't ack this feature.
>>> Antoher idea we can just do the following in vhost_vdpa reset?
>>>
>>> config->reset()
>>> if (IOTLB_PERSIST is not set) {
>>>       config->reset_map()
>>> }
>>>
>>> Then we don't have the burden to maintain them in the parent?
>>>
>>> Thanks
>> Please see my earlier response in the other email, thanks.
>>
>> ----------------%<----------------%<----------------
>>
>> First, the ideal fix would be to leave this reset_vendor_mappings()
>> emulation code on the individual driver itself, which already has the
>> broken behavior.
> So the point is, not about whether the existing behavior is "broken"
> or not.
Hold on, I thought earlier we all agreed upon that the existing behavior 
of vendor driver self-clearing maps during .reset violates the vhost 
iotlb abstraction and also breaks the .set_map/.dma_map API. This is 
100% buggy driver implementation itself that we should discourage or 
eliminate as much as possible (that's part of the goal for this series), 
but here you seem to go existentialism and suggests the very opposite 
that every .set_map/.dma_map driver implementation, regardless being the 
current or the new/upcoming, should unconditionally try to emulate the 
broken reset behavior for the sake of not breaking older userspace. Set 
aside the criteria and definition for how userspace can be broken, can 
we step back to the original question why we think it's broken, and what 
we can do to promote good driver implementation instead of discuss the 
implementation details? Reading the below response I found my major 
points are not heard even if written for quite a few times. It's not 
that I don't understand the importance of not breaking old userspace, I 
appreciate your questions and extra patience, however I do feel the 
"broken" part is very relevant to our discussion here.

If it's broken (in the sense of vhost IOTLB API) that you agree, I think 
we should at least allow good driver implementations; and when you think 
about the possibility of those valid good driver cases 
(.set_map/.dma_map implementations that do not clear maps in .reset),  
you might be able to see why it's coded the way as it is now.

>   It's about whether we could stick to the old behaviour without
> too much cost. And I believe we could.
>
> And just to clarify here, reset_vendor_mappings() = config->reset_map()
>
>> But today there's no backend feature negotiation
>> between vhost-vdpa and the parent driver. Do we want to send down the
>> acked_backend_features to parent drivers?
> There's no need to do that with the above code, or anything I missed here?
>
> config->reset()
> if (IOTLB_PERSIST is not set) {
>        config->reset_map()
> }
Implementation issue: this implies reset_map() has to be there for every 
.set_map implementations, but vendor driver implementation for custom 
IOMMU could well implement DMA ops by itself instead of .reset_map. This 
won't work for every set_map driver (think about the vduse case).

But this is not the the point I was making. I think if you agree this is 
purely buggy driver implementation of its own, we should try to isolate 
this buggy behavior to individual driver rather than overload vhost-vdpa 
or vdpa core's role to help implement the emulation of broken driver 
behavior. I don't get why .reset is special here, the abuse of .reset to 
manipulate mapping could also happen in other IOMMU unrelated driver 
entries like in .suspend, or in queue_reset. If someday userspace is 
found coded around similar buggy driver implementation in other driver 
ops, do we want to follow and duplicate the same emulation in vdpa core 
as the precedent is already set here around .reset?
The buggy driver can fail in a lot of other ways indefinitely during 
reset, if there's a buggy driver that's already broken the way as how it 
is and happens to survive with all userspace apps, we just don't care 
and let it be. There's no way we can enumerate all those buggy behaviors 
in .reset_map itself, it's overloading that driver API too much.
>> Second, IOTLB_PERSIST is needed but not sufficient. Due to lack of
>> backend feature negotiation in parent driver, if vhost-vdpa has to
>> provide the old-behaviour emulation for compatibility on driver's
>> behalf, it needs to be done per-driver basis. There could be good
>> on-chip or vendor IOMMU implementation which doesn't clear the IOTLB in
>> .reset, and vendor specific IOMMU doesn't have to provide .reset_map,
> Then we just don't offer IOTLB_PRESIST, isn't this by design?
Think about the vduse case, it can work with DMA ops directly so doesn't 
have to implement .reset_map, unless for some specific good reason. 
Because it's a conforming and valid/good driver implementation, we may 
still allow it to advertise IOTLB_PERSIST to userspace. Which belongs to 
the 3rd bullet below:

https://lore.kernel.org/virtualization/1696928580-7520-4-git-send-email-si-wei.liu@oracle.com/

There are 3 cases that backend may claim this feature bit on:

- parent device that has to work with platform IOMMU
- parent device with on-chip IOMMU that has the expected
   .reset_map support in driver
- parent device with vendor specific IOMMU implementation
   that explicitly declares the specific backend feature

>
>> we
>> should allow these good driver implementations rather than
>> unconditionally stick to some specific problematic behavior for every
>> other good driver.
> Then you can force reset_map() with set_map() that is what I suggest
> in another thread, no?
This is exactly what I was afraid of that broken behavior emulation may 
become a dangerous slippery slope - in principle we should encourage 
good driver implementation, as they can work totally fine with older 
userspace. Why do they have to bother emulating broken behavior just 
because some other driver's misbehaving? And what's the boundary for 
this hack, do drivers backed by platform IOMMU even have to emulate (if 
not why not, and is there substantial difference in between)? After 
getting through all of this, do you still believe everything is just as 
easy and simple as what thought to be?

Btw, I thought I was expecting but still haven't got the clear answers 
to what was the goal to do all this, we spent a lot of time trying to 
unbreak userspace, but looks to me as if we were trying every possible 
way to break userspace or try to approximate to the same brokenness 
mlx5_vdpa may have caused to the userspace. What we will get eventually 
from these lengthy discussions? On the other hand, if you think it from 
vhost-vdpa user perspective, you'll clearly see there's just a couple of 
ways to unbreak userspace from the internal broken map which is out of 
sync with vhost-vdpa iotlb after device reset. If this brokenness was 
something universally done from the vhost-vdpa layer itself, I'd feel 
it's more of a shared problem, but this is not the case I see it here. 
While the long standing mlx5_vdpa/vdpa_sim issue is 100% misuse of 
.reset op in a wrong way per IOMMU API definition. Why leaving this 
discrepancy to the individual driver is not even an option, I'm still 
not sure?


Thanks,
-Siwei

>
>> Then we need a set of device flags (backend_features
>> bit again?) to indicate the specific driver needs upper layer's help on
>> old-behaviour emulation.
>>
>> Last but not least, I'm not sure how to properly emulate
>> reset_vendor_mappings() from vhost-vdpa layer. If a vendor driver has no
>> .reset_map op implemented, or if .reset_map has a slightly different
>> implementation than what it used to reset the iotlb in the .reset op,
> See above, for reset_vendor_mappings() I meant config->reset_map() exactly.
>
> Thanks
>
>> then this either becomes effectively dead code if no one ends up using,
>> or the vhost-vdpa emulation is helpless and limited in scope, unable to
>> cover all the cases.
>>
>> ----------------%<----------------%<----------------
>>


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/4] vhost-vdpa: reset vendor specific mapping to initial state in .release
  2023-10-19  6:46                         ` Si-Wei Liu
@ 2023-10-19  8:27                           ` Jason Wang
  2023-10-19 14:39                             ` Eugenio Perez Martin
  0 siblings, 1 reply; 37+ messages in thread
From: Jason Wang @ 2023-10-19  8:27 UTC (permalink / raw)
  To: Si-Wei Liu
  Cc: Eugenio Perez Martin, mst, xuanzhuo, dtatulea, virtualization,
	linux-kernel

On Thu, Oct 19, 2023 at 2:47 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>
>
>
> On 10/18/2023 7:53 PM, Jason Wang wrote:
> > On Wed, Oct 18, 2023 at 4:49 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>
> >>
> >> On 10/18/2023 12:00 AM, Jason Wang wrote:
> >>>> Unfortunately, it's a must to stick to ABI. I agree it's a mess but we
> >>>> don't have a better choice. Or we can fail the probe if userspace
> >>>> doesn't ack this feature.
> >>> Antoher idea we can just do the following in vhost_vdpa reset?
> >>>
> >>> config->reset()
> >>> if (IOTLB_PERSIST is not set) {
> >>>       config->reset_map()
> >>> }
> >>>
> >>> Then we don't have the burden to maintain them in the parent?
> >>>
> >>> Thanks
> >> Please see my earlier response in the other email, thanks.
> >>
> >> ----------------%<----------------%<----------------
> >>
> >> First, the ideal fix would be to leave this reset_vendor_mappings()
> >> emulation code on the individual driver itself, which already has the
> >> broken behavior.
> > So the point is, not about whether the existing behavior is "broken"
> > or not.
> Hold on, I thought earlier we all agreed upon that the existing behavior
> of vendor driver self-clearing maps during .reset violates the vhost
> iotlb abstraction and also breaks the .set_map/.dma_map API. This is
> 100% buggy driver implementation itself that we should discourage or
> eliminate as much as possible (that's part of the goal for this series),

I'm not saying it's not an issue, what I'm saying is, if the fix
breaks another userspace, it's a new bug in the kernel. See what Linus
said in [1]

"If a change results in user programs breaking, it's a bug in the kernel."

> but here you seem to go existentialism and suggests the very opposite
> that every .set_map/.dma_map driver implementation, regardless being the
> current or the new/upcoming, should unconditionally try to emulate the
> broken reset behavior for the sake of not breaking older userspace.

Such "emulation" is not done at the parent level. New parents just
need to implement reset_map() or not. everything could be done inside
vhost-vDPA as pseudo code that is shown above.

> Set
> aside the criteria and definition for how userspace can be broken, can
> we step back to the original question why we think it's broken, and what
> we can do to promote good driver implementation instead of discuss the
> implementation details?

I'm not sure I get the point of this question. I'm not saying we don't
need to fix, what I am saying is that such a fix must be done in a
negotiable way. And it's better if parents won't get any burden. It
can just decide to implement reset_map() or not.

> Reading the below response I found my major
> points are not heard even if written for quite a few times.

I try my best to not ignore any important things, but I can't promise
I will not miss any. I hope the above clarifies my points.

> It's not
> that I don't understand the importance of not breaking old userspace, I
> appreciate your questions and extra patience, however I do feel the
> "broken" part is very relevant to our discussion here.
> If it's broken (in the sense of vhost IOTLB API) that you agree, I think
> we should at least allow good driver implementations; and when you think
> about the possibility of those valid good driver cases
> (.set_map/.dma_map implementations that do not clear maps in .reset),
> you might be able to see why it's coded the way as it is now.
>
> >   It's about whether we could stick to the old behaviour without
> > too much cost. And I believe we could.
> >
> > And just to clarify here, reset_vendor_mappings() = config->reset_map()
> >
> >> But today there's no backend feature negotiation
> >> between vhost-vdpa and the parent driver. Do we want to send down the
> >> acked_backend_features to parent drivers?
> > There's no need to do that with the above code, or anything I missed here?
> >
> > config->reset()
> > if (IOTLB_PERSIST is not set) {
> >        config->reset_map()
> > }
> Implementation issue: this implies reset_map() has to be there for every
> .set_map implementations, but vendor driver implementation for custom
> IOMMU could well implement DMA ops by itself instead of .reset_map. This
> won't work for every set_map driver (think about the vduse case).

Well let me do it once again, reset_map() is not mandated:

config->reset()
if (IOTLB_PERSIST is not set) {
    if (config->reset_map)
          config->reset_map()
}

Did you see any issue with VDUSE in this case?

>
> But this is not the the point I was making. I think if you agree this is
> purely buggy driver implementation of its own, we should try to isolate
> this buggy behavior to individual driver rather than overload vhost-vdpa
> or vdpa core's role to help implement the emulation of broken driver
> behavior.

As I pointed out, if it is not noticeable in the userspace, that's
fine but it's not.

> I don't get why .reset is special here, the abuse of .reset to
> manipulate mapping could also happen in other IOMMU unrelated driver
> entries like in .suspend, or in queue_reset.

Who can abuse reset here? It is totally under the control of
vhost-vDPA and it's not visible to uAPI. And we can fully control the
behaviour of vhost-vDPA.

> If someday userspace is
> found coded around similar buggy driver implementation in other driver
> ops, do we want to follow and duplicate the same emulation in vdpa core
> as the precedent is already set here around .reset?

I think so, have you seen the links I give you? If you want to go
through the one from Linus thread[1], you can see the one that unbreak
virtio-IOMMU[2]:

1) Someday, we spot invalidate with size 0 is a bug
2) We fix this bug by not allowing this
3) But virtio-IOMMU userspace find that size 0 actually clean all the
IOTLB so it depends on the behaviour
4) So the virtio-IOMMU userspace find it can't work after 2)
5) Then we recover the behaviour before 2) via [2]

Another example is the IOTLB_MSG_V2, V1 suffers from in-stable ABI in
32bit archs, most of the userspace survives since it never runs on
32bit archs. The fix is to introduce a V2 but we will stick to V1 by
default if V2 is not acknowledged by the userspace.

I think the above 2 examples are sufficient for us to understand the
case. If not, I can help to clarify more since I'm involved in those 2
fixes.

> The buggy driver can fail in a lot of other ways indefinitely during
> reset, if there's a buggy driver that's already broken the way as how it
> is and happens to survive with all userspace apps, we just don't care
> and let it be.

Without IOTLB_PRESIST it doesn't break. With IOTLB_PERSIST and if the
reset_map() is done unconditionally, it can break. That's my point.

> There's no way we can enumerate all those buggy behaviors
> in .reset_map itself, it's overloading that driver API too much.

If it is not noticeable by userspace, we can do any fix at will. But
it is not, we don't have another choice. Especially considering the
cost is rather low.

> >> Second, IOTLB_PERSIST is needed but not sufficient. Due to lack of
> >> backend feature negotiation in parent driver, if vhost-vdpa has to
> >> provide the old-behaviour emulation for compatibility on driver's
> >> behalf, it needs to be done per-driver basis. There could be good
> >> on-chip or vendor IOMMU implementation which doesn't clear the IOTLB in
> >> .reset, and vendor specific IOMMU doesn't have to provide .reset_map,
> > Then we just don't offer IOTLB_PRESIST, isn't this by design?
> Think about the vduse case, it can work with DMA ops directly so doesn't
> have to implement .reset_map, unless for some specific good reason.
> Because it's a conforming and valid/good driver implementation, we may
> still allow it to advertise IOTLB_PERSIST to userspace.

I would like to know why this can't work in this case:

config->reset()
if (IOTLB_PERSIST is not set) {
    if (config->reset_map)
          config->reset_map()
}

> Which belongs to
> the 3rd bullet below:
>
> https://lore.kernel.org/virtualization/1696928580-7520-4-git-send-email-si-wei.liu@oracle.com/
>
> There are 3 cases that backend may claim this feature bit on:
>
> - parent device that has to work with platform IOMMU
> - parent device with on-chip IOMMU that has the expected
>    .reset_map support in driver
> - parent device with vendor specific IOMMU implementation
>    that explicitly declares the specific backend feature
>
> >
> >> we
> >> should allow these good driver implementations rather than
> >> unconditionally stick to some specific problematic behavior for every
> >> other good driver.
> > Then you can force reset_map() with set_map() that is what I suggest
> > in another thread, no?
> This is exactly what I was afraid of that broken behavior emulation may
> become a dangerous slippery slope - in principle we should encourage
> good driver implementation, as they can work totally fine with older
> userspace. Why do they have to bother emulating broken behavior just
> because some other driver's misbehaving?

Please read the link [1], Linus has explained it.

> And what's the boundary for
> this hack, do drivers backed by platform IOMMU even have to emulate (if
> not why not, and is there substantial difference in between)?

The boundary is whether the behaviour change could be noticed but
userspace. And I've shown you it's not a burden with the pseudo codes.
If not, please explain why.

> After
> getting through all of this, do you still believe everything is just as
> easy and simple as what thought to be?

The truth is that bugs exist everywhere. We can't promise there's no
bug when developing an uAPI or subsystem. For kernel code, the bug
that touches uAPI might be fixed in a way that doesn't break existing
userspace. If you look at how downstream to maintain kABI, you will be
supersized furtherly.

>
> Btw, I thought I was expecting but still haven't got the clear answers
> to what was the goal to do all this, we spent a lot of time trying to
> unbreak userspace,

The code is pretty simple. But yes, the time spent on justifying it
might take some time. That's the community. People need time to
understand each other's points.

> but looks to me as if we were trying every possible
> way to break userspace

How could my suggestions break a userspace?

> or try to approximate to the same brokenness
> mlx5_vdpa may have caused to the userspace. What we will get eventually
> from these lengthy discussions?

Siwei, I'd really suggest you read the link I gave you. You may get
the answer. What's more, It doesn't cost too much then we know for
sure there would not be any issue, why not choose the hard way?

> On the other hand, if you think it from
> vhost-vdpa user perspective, you'll clearly see there's just a couple of
> ways to unbreak userspace from the internal broken map which is out of
> sync with vhost-vdpa iotlb after device reset.

Patches are more than welcomed.

> If this brokenness was
> something universally done from the vhost-vdpa layer itself, I'd feel
> it's more of a shared problem, but this is not the case I see it here.
> While the long standing mlx5_vdpa/vdpa_sim issue is 100% misuse of
> .reset op in a wrong way per IOMMU API definition. Why leaving this
> discrepancy to the individual driver is not even an option, I'm still
> not sure?

Sorry? I start with a switch in the driver, and then I try to avoid
that. And it seems you don't want a burden on the driver as well.
Where did you see I say we can't do that in the driver? What I
disagree with is to use a module parameter.

Even if I fail, it doesn't mean we can't do that in the driver code.
If you read the link[1] you can see the offending commit is a change
in uvcvideo driver.

Thanks

>
>
> Thanks,
> -Siwei
>
> >
> >> Then we need a set of device flags (backend_features
> >> bit again?) to indicate the specific driver needs upper layer's help on
> >> old-behaviour emulation.
> >>
> >> Last but not least, I'm not sure how to properly emulate
> >> reset_vendor_mappings() from vhost-vdpa layer. If a vendor driver has no
> >> .reset_map op implemented, or if .reset_map has a slightly different
> >> implementation than what it used to reset the iotlb in the .reset op,
> > See above, for reset_vendor_mappings() I meant config->reset_map() exactly.
> >
> > Thanks
> >
> >> then this either becomes effectively dead code if no one ends up using,
> >> or the vhost-vdpa emulation is helpless and limited in scope, unable to
> >> cover all the cases.
> >>
> >> ----------------%<----------------%<----------------
> >>
>


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/4] vhost-vdpa: reset vendor specific mapping to initial state in .release
  2023-10-19  8:27                           ` Jason Wang
@ 2023-10-19 14:39                             ` Eugenio Perez Martin
  2023-10-19 22:28                               ` Si-Wei Liu
  0 siblings, 1 reply; 37+ messages in thread
From: Eugenio Perez Martin @ 2023-10-19 14:39 UTC (permalink / raw)
  To: Jason Wang
  Cc: Si-Wei Liu, mst, xuanzhuo, dtatulea, virtualization, linux-kernel

On Thu, Oct 19, 2023 at 10:27 AM Jason Wang <jasowang@redhat.com> wrote:
>
> On Thu, Oct 19, 2023 at 2:47 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >
> >
> >
> > On 10/18/2023 7:53 PM, Jason Wang wrote:
> > > On Wed, Oct 18, 2023 at 4:49 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> > >>
> > >>
> > >> On 10/18/2023 12:00 AM, Jason Wang wrote:
> > >>>> Unfortunately, it's a must to stick to ABI. I agree it's a mess but we
> > >>>> don't have a better choice. Or we can fail the probe if userspace
> > >>>> doesn't ack this feature.
> > >>> Antoher idea we can just do the following in vhost_vdpa reset?
> > >>>
> > >>> config->reset()
> > >>> if (IOTLB_PERSIST is not set) {
> > >>>       config->reset_map()
> > >>> }
> > >>>
> > >>> Then we don't have the burden to maintain them in the parent?
> > >>>
> > >>> Thanks
> > >> Please see my earlier response in the other email, thanks.
> > >>
> > >> ----------------%<----------------%<----------------
> > >>
> > >> First, the ideal fix would be to leave this reset_vendor_mappings()
> > >> emulation code on the individual driver itself, which already has the
> > >> broken behavior.
> > > So the point is, not about whether the existing behavior is "broken"
> > > or not.
> > Hold on, I thought earlier we all agreed upon that the existing behavior
> > of vendor driver self-clearing maps during .reset violates the vhost
> > iotlb abstraction and also breaks the .set_map/.dma_map API. This is
> > 100% buggy driver implementation itself that we should discourage or
> > eliminate as much as possible (that's part of the goal for this series),
>
> I'm not saying it's not an issue, what I'm saying is, if the fix
> breaks another userspace, it's a new bug in the kernel. See what Linus
> said in [1]
>
> "If a change results in user programs breaking, it's a bug in the kernel."
>
> > but here you seem to go existentialism and suggests the very opposite
> > that every .set_map/.dma_map driver implementation, regardless being the
> > current or the new/upcoming, should unconditionally try to emulate the
> > broken reset behavior for the sake of not breaking older userspace.
>
> Such "emulation" is not done at the parent level. New parents just
> need to implement reset_map() or not. everything could be done inside
> vhost-vDPA as pseudo code that is shown above.
>
> > Set
> > aside the criteria and definition for how userspace can be broken, can
> > we step back to the original question why we think it's broken, and what
> > we can do to promote good driver implementation instead of discuss the
> > implementation details?
>
> I'm not sure I get the point of this question. I'm not saying we don't
> need to fix, what I am saying is that such a fix must be done in a
> negotiable way. And it's better if parents won't get any burden. It
> can just decide to implement reset_map() or not.
>
> > Reading the below response I found my major
> > points are not heard even if written for quite a few times.
>
> I try my best to not ignore any important things, but I can't promise
> I will not miss any. I hope the above clarifies my points.
>
> > It's not
> > that I don't understand the importance of not breaking old userspace, I
> > appreciate your questions and extra patience, however I do feel the
> > "broken" part is very relevant to our discussion here.
> > If it's broken (in the sense of vhost IOTLB API) that you agree, I think
> > we should at least allow good driver implementations; and when you think
> > about the possibility of those valid good driver cases
> > (.set_map/.dma_map implementations that do not clear maps in .reset),
> > you might be able to see why it's coded the way as it is now.
> >
> > >   It's about whether we could stick to the old behaviour without
> > > too much cost. And I believe we could.
> > >
> > > And just to clarify here, reset_vendor_mappings() = config->reset_map()
> > >
> > >> But today there's no backend feature negotiation
> > >> between vhost-vdpa and the parent driver. Do we want to send down the
> > >> acked_backend_features to parent drivers?
> > > There's no need to do that with the above code, or anything I missed here?
> > >
> > > config->reset()
> > > if (IOTLB_PERSIST is not set) {
> > >        config->reset_map()
> > > }
> > Implementation issue: this implies reset_map() has to be there for every
> > .set_map implementations, but vendor driver implementation for custom
> > IOMMU could well implement DMA ops by itself instead of .reset_map. This
> > won't work for every set_map driver (think about the vduse case).
>
> Well let me do it once again, reset_map() is not mandated:
>
> config->reset()
> if (IOTLB_PERSIST is not set) {
>     if (config->reset_map)
>           config->reset_map()

To avoid new parent drivers to have this behavior if they need to
implement reset_map,

What if we add a new callback like "config->buggy_virtio_reset_map",
different from regular reset_map callback at cleanup? Only mlx5 and
vdpa_sim need to implement it, with a big warning, and new parent
drivers can trust they'll never have the old bad behavior.

> }
>
> Did you see any issue with VDUSE in this case?
>
> >
> > But this is not the the point I was making. I think if you agree this is
> > purely buggy driver implementation of its own, we should try to isolate
> > this buggy behavior to individual driver rather than overload vhost-vdpa
> > or vdpa core's role to help implement the emulation of broken driver
> > behavior.
>
> As I pointed out, if it is not noticeable in the userspace, that's
> fine but it's not.
>
> > I don't get why .reset is special here, the abuse of .reset to
> > manipulate mapping could also happen in other IOMMU unrelated driver
> > entries like in .suspend, or in queue_reset.
>
> Who can abuse reset here? It is totally under the control of
> vhost-vDPA and it's not visible to uAPI. And we can fully control the
> behaviour of vhost-vDPA.
>
> > If someday userspace is
> > found coded around similar buggy driver implementation in other driver
> > ops, do we want to follow and duplicate the same emulation in vdpa core
> > as the precedent is already set here around .reset?
>
> I think so, have you seen the links I give you? If you want to go
> through the one from Linus thread[1], you can see the one that unbreak
> virtio-IOMMU[2]:
>
> 1) Someday, we spot invalidate with size 0 is a bug
> 2) We fix this bug by not allowing this
> 3) But virtio-IOMMU userspace find that size 0 actually clean all the
> IOTLB so it depends on the behaviour
> 4) So the virtio-IOMMU userspace find it can't work after 2)
> 5) Then we recover the behaviour before 2) via [2]
>
> Another example is the IOTLB_MSG_V2, V1 suffers from in-stable ABI in
> 32bit archs, most of the userspace survives since it never runs on
> 32bit archs. The fix is to introduce a V2 but we will stick to V1 by
> default if V2 is not acknowledged by the userspace.
>
> I think the above 2 examples are sufficient for us to understand the
> case. If not, I can help to clarify more since I'm involved in those 2
> fixes.
>
> > The buggy driver can fail in a lot of other ways indefinitely during
> > reset, if there's a buggy driver that's already broken the way as how it
> > is and happens to survive with all userspace apps, we just don't care
> > and let it be.
>
> Without IOTLB_PRESIST it doesn't break. With IOTLB_PERSIST and if the
> reset_map() is done unconditionally, it can break. That's my point.
>
> > There's no way we can enumerate all those buggy behaviors
> > in .reset_map itself, it's overloading that driver API too much.
>
> If it is not noticeable by userspace, we can do any fix at will. But
> it is not, we don't have another choice. Especially considering the
> cost is rather low.
>
> > >> Second, IOTLB_PERSIST is needed but not sufficient. Due to lack of
> > >> backend feature negotiation in parent driver, if vhost-vdpa has to
> > >> provide the old-behaviour emulation for compatibility on driver's
> > >> behalf, it needs to be done per-driver basis. There could be good
> > >> on-chip or vendor IOMMU implementation which doesn't clear the IOTLB in
> > >> .reset, and vendor specific IOMMU doesn't have to provide .reset_map,
> > > Then we just don't offer IOTLB_PRESIST, isn't this by design?
> > Think about the vduse case, it can work with DMA ops directly so doesn't
> > have to implement .reset_map, unless for some specific good reason.
> > Because it's a conforming and valid/good driver implementation, we may
> > still allow it to advertise IOTLB_PERSIST to userspace.
>
> I would like to know why this can't work in this case:
>
> config->reset()
> if (IOTLB_PERSIST is not set) {
>     if (config->reset_map)
>           config->reset_map()
> }
>
> > Which belongs to
> > the 3rd bullet below:
> >
> > https://lore.kernel.org/virtualization/1696928580-7520-4-git-send-email-si-wei.liu@oracle.com/
> >
> > There are 3 cases that backend may claim this feature bit on:
> >
> > - parent device that has to work with platform IOMMU
> > - parent device with on-chip IOMMU that has the expected
> >    .reset_map support in driver
> > - parent device with vendor specific IOMMU implementation
> >    that explicitly declares the specific backend feature
> >
> > >
> > >> we
> > >> should allow these good driver implementations rather than
> > >> unconditionally stick to some specific problematic behavior for every
> > >> other good driver.
> > > Then you can force reset_map() with set_map() that is what I suggest
> > > in another thread, no?
> > This is exactly what I was afraid of that broken behavior emulation may
> > become a dangerous slippery slope - in principle we should encourage
> > good driver implementation, as they can work totally fine with older
> > userspace. Why do they have to bother emulating broken behavior just
> > because some other driver's misbehaving?
>
> Please read the link [1], Linus has explained it.
>
> > And what's the boundary for
> > this hack, do drivers backed by platform IOMMU even have to emulate (if
> > not why not, and is there substantial difference in between)?
>
> The boundary is whether the behaviour change could be noticed but
> userspace. And I've shown you it's not a burden with the pseudo codes.
> If not, please explain why.
>
> > After
> > getting through all of this, do you still believe everything is just as
> > easy and simple as what thought to be?
>
> The truth is that bugs exist everywhere. We can't promise there's no
> bug when developing an uAPI or subsystem. For kernel code, the bug
> that touches uAPI might be fixed in a way that doesn't break existing
> userspace. If you look at how downstream to maintain kABI, you will be
> supersized furtherly.
>
> >
> > Btw, I thought I was expecting but still haven't got the clear answers
> > to what was the goal to do all this, we spent a lot of time trying to
> > unbreak userspace,
>
> The code is pretty simple. But yes, the time spent on justifying it
> might take some time. That's the community. People need time to
> understand each other's points.
>
> > but looks to me as if we were trying every possible
> > way to break userspace
>
> How could my suggestions break a userspace?
>
> > or try to approximate to the same brokenness
> > mlx5_vdpa may have caused to the userspace. What we will get eventually
> > from these lengthy discussions?
>
> Siwei, I'd really suggest you read the link I gave you. You may get
> the answer. What's more, It doesn't cost too much then we know for
> sure there would not be any issue, why not choose the hard way?
>
> > On the other hand, if you think it from
> > vhost-vdpa user perspective, you'll clearly see there's just a couple of
> > ways to unbreak userspace from the internal broken map which is out of
> > sync with vhost-vdpa iotlb after device reset.
>
> Patches are more than welcomed.
>
> > If this brokenness was
> > something universally done from the vhost-vdpa layer itself, I'd feel
> > it's more of a shared problem, but this is not the case I see it here.
> > While the long standing mlx5_vdpa/vdpa_sim issue is 100% misuse of
> > .reset op in a wrong way per IOMMU API definition. Why leaving this
> > discrepancy to the individual driver is not even an option, I'm still
> > not sure?
>
> Sorry? I start with a switch in the driver, and then I try to avoid
> that. And it seems you don't want a burden on the driver as well.
> Where did you see I say we can't do that in the driver? What I
> disagree with is to use a module parameter.
>
> Even if I fail, it doesn't mean we can't do that in the driver code.
> If you read the link[1] you can see the offending commit is a change
> in uvcvideo driver.
>
> Thanks
>
> >
> >
> > Thanks,
> > -Siwei
> >
> > >
> > >> Then we need a set of device flags (backend_features
> > >> bit again?) to indicate the specific driver needs upper layer's help on
> > >> old-behaviour emulation.
> > >>
> > >> Last but not least, I'm not sure how to properly emulate
> > >> reset_vendor_mappings() from vhost-vdpa layer. If a vendor driver has no
> > >> .reset_map op implemented, or if .reset_map has a slightly different
> > >> implementation than what it used to reset the iotlb in the .reset op,
> > > See above, for reset_vendor_mappings() I meant config->reset_map() exactly.
> > >
> > > Thanks
> > >
> > >> then this either becomes effectively dead code if no one ends up using,
> > >> or the vhost-vdpa emulation is helpless and limited in scope, unable to
> > >> cover all the cases.
> > >>
> > >> ----------------%<----------------%<----------------
> > >>
> >
>


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/4] vhost-vdpa: reset vendor specific mapping to initial state in .release
  2023-10-19 14:39                             ` Eugenio Perez Martin
@ 2023-10-19 22:28                               ` Si-Wei Liu
  2023-10-20  4:11                                 ` Jason Wang
  0 siblings, 1 reply; 37+ messages in thread
From: Si-Wei Liu @ 2023-10-19 22:28 UTC (permalink / raw)
  To: Eugenio Perez Martin, Jason Wang
  Cc: mst, xuanzhuo, dtatulea, virtualization, linux-kernel



On 10/19/2023 7:39 AM, Eugenio Perez Martin wrote:
> On Thu, Oct 19, 2023 at 10:27 AM Jason Wang <jasowang@redhat.com> wrote:
>> On Thu, Oct 19, 2023 at 2:47 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>>
>>>
>>> On 10/18/2023 7:53 PM, Jason Wang wrote:
>>>> On Wed, Oct 18, 2023 at 4:49 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>>>>
>>>>> On 10/18/2023 12:00 AM, Jason Wang wrote:
>>>>>>> Unfortunately, it's a must to stick to ABI. I agree it's a mess but we
>>>>>>> don't have a better choice. Or we can fail the probe if userspace
>>>>>>> doesn't ack this feature.
>>>>>> Antoher idea we can just do the following in vhost_vdpa reset?
>>>>>>
>>>>>> config->reset()
>>>>>> if (IOTLB_PERSIST is not set) {
>>>>>>        config->reset_map()
>>>>>> }
>>>>>>
>>>>>> Then we don't have the burden to maintain them in the parent?
>>>>>>
>>>>>> Thanks
>>>>> Please see my earlier response in the other email, thanks.
>>>>>
>>>>> ----------------%<----------------%<----------------
>>>>>
>>>>> First, the ideal fix would be to leave this reset_vendor_mappings()
>>>>> emulation code on the individual driver itself, which already has the
>>>>> broken behavior.
>>>> So the point is, not about whether the existing behavior is "broken"
>>>> or not.
>>> Hold on, I thought earlier we all agreed upon that the existing behavior
>>> of vendor driver self-clearing maps during .reset violates the vhost
>>> iotlb abstraction and also breaks the .set_map/.dma_map API. This is
>>> 100% buggy driver implementation itself that we should discourage or
>>> eliminate as much as possible (that's part of the goal for this series),
>> I'm not saying it's not an issue, what I'm saying is, if the fix
>> breaks another userspace, it's a new bug in the kernel. See what Linus
>> said in [1]
>>
>> "If a change results in user programs breaking, it's a bug in the kernel."
>>
>>> but here you seem to go existentialism and suggests the very opposite
>>> that every .set_map/.dma_map driver implementation, regardless being the
>>> current or the new/upcoming, should unconditionally try to emulate the
>>> broken reset behavior for the sake of not breaking older userspace.
>> Such "emulation" is not done at the parent level. New parents just
>> need to implement reset_map() or not. everything could be done inside
>> vhost-vDPA as pseudo code that is shown above.
>>
>>> Set
>>> aside the criteria and definition for how userspace can be broken, can
>>> we step back to the original question why we think it's broken, and what
>>> we can do to promote good driver implementation instead of discuss the
>>> implementation details?
>> I'm not sure I get the point of this question. I'm not saying we don't
>> need to fix, what I am saying is that such a fix must be done in a
>> negotiable way. And it's better if parents won't get any burden. It
>> can just decide to implement reset_map() or not.
>>
>>> Reading the below response I found my major
>>> points are not heard even if written for quite a few times.
>> I try my best to not ignore any important things, but I can't promise
>> I will not miss any. I hope the above clarifies my points.
>>
>>> It's not
>>> that I don't understand the importance of not breaking old userspace, I
>>> appreciate your questions and extra patience, however I do feel the
>>> "broken" part is very relevant to our discussion here.
>>> If it's broken (in the sense of vhost IOTLB API) that you agree, I think
>>> we should at least allow good driver implementations; and when you think
>>> about the possibility of those valid good driver cases
>>> (.set_map/.dma_map implementations that do not clear maps in .reset),
>>> you might be able to see why it's coded the way as it is now.
>>>
>>>>    It's about whether we could stick to the old behaviour without
>>>> too much cost. And I believe we could.
>>>>
>>>> And just to clarify here, reset_vendor_mappings() = config->reset_map()
>>>>
>>>>> But today there's no backend feature negotiation
>>>>> between vhost-vdpa and the parent driver. Do we want to send down the
>>>>> acked_backend_features to parent drivers?
>>>> There's no need to do that with the above code, or anything I missed here?
>>>>
>>>> config->reset()
>>>> if (IOTLB_PERSIST is not set) {
>>>>         config->reset_map()
>>>> }
>>> Implementation issue: this implies reset_map() has to be there for every
>>> .set_map implementations, but vendor driver implementation for custom
>>> IOMMU could well implement DMA ops by itself instead of .reset_map. This
>>> won't work for every set_map driver (think about the vduse case).
>> Well let me do it once again, reset_map() is not mandated:
>>
>> config->reset()
>> if (IOTLB_PERSIST is not set) {
>>      if (config->reset_map)
>>            config->reset_map()
> To avoid new parent drivers
I am afraid it's not just new parent drivers, but any well behaved 
driver today may well break userspace if go with this forced emulation 
code, if they have to implement reset_map for some reason (e.g. restored 
to 1:1 passthrough mapping or other default state in mapping). For new 
userspace and user driver we can guard against it using the 
IOTLB_PERSIST flag, but the above code would get a big chance to break 
setup with good driver and older userspace in practice.

And .reset_map implementation doesn't necessarily need to clear maps. 
For e.g. IOMMU API compliant driver that only needs simple DMA model for 
passthrough, all .reset_map has to do is toggle to 1:1 mapping mode to 
the default/initial state without taking care of maps, as 
vhost_vdpa_unmap(0, -1ULL) earlier should have done the map cleaning job 
already.


>   to have this behavior if they need to
> implement reset_map,
>
> What if we add a new callback like "config->buggy_virtio_reset_map",
> different from regular reset_map callback at cleanup?
Right, separating out the need for old behavior emulation from 
.reset_map is much cleaner, and this is what individual broken driver 
has to maintain without penalizing other good drivers. Good to see what 
I said earlier is heard.

> Only mlx5 and
> vdpa_sim need to implement it, with a big warning, and new parent
> drivers can trust they'll never have the old bad behavior.
Let's see what Jason will say about it and try to converge on this 
first, I think he seemed to imply that this is part of ABI that every 
driver has to make compromise for. I'd better get it ack'ed before 
proceeding to the rest.

Thanks,
-Siwei

>
>> }
>>
>> Did you see any issue with VDUSE in this case?
>>
>>> But this is not the the point I was making. I think if you agree this is
>>> purely buggy driver implementation of its own, we should try to isolate
>>> this buggy behavior to individual driver rather than overload vhost-vdpa
>>> or vdpa core's role to help implement the emulation of broken driver
>>> behavior.
>> As I pointed out, if it is not noticeable in the userspace, that's
>> fine but it's not.
>>
>>> I don't get why .reset is special here, the abuse of .reset to
>>> manipulate mapping could also happen in other IOMMU unrelated driver
>>> entries like in .suspend, or in queue_reset.
>> Who can abuse reset here? It is totally under the control of
>> vhost-vDPA and it's not visible to uAPI. And we can fully control the
>> behaviour of vhost-vDPA.
>>
>>> If someday userspace is
>>> found coded around similar buggy driver implementation in other driver
>>> ops, do we want to follow and duplicate the same emulation in vdpa core
>>> as the precedent is already set here around .reset?
>> I think so, have you seen the links I give you? If you want to go
>> through the one from Linus thread[1], you can see the one that unbreak
>> virtio-IOMMU[2]:
>>
>> 1) Someday, we spot invalidate with size 0 is a bug
>> 2) We fix this bug by not allowing this
>> 3) But virtio-IOMMU userspace find that size 0 actually clean all the
>> IOTLB so it depends on the behaviour
>> 4) So the virtio-IOMMU userspace find it can't work after 2)
>> 5) Then we recover the behaviour before 2) via [2]
>>
>> Another example is the IOTLB_MSG_V2, V1 suffers from in-stable ABI in
>> 32bit archs, most of the userspace survives since it never runs on
>> 32bit archs. The fix is to introduce a V2 but we will stick to V1 by
>> default if V2 is not acknowledged by the userspace.
>>
>> I think the above 2 examples are sufficient for us to understand the
>> case. If not, I can help to clarify more since I'm involved in those 2
>> fixes.
>>
>>> The buggy driver can fail in a lot of other ways indefinitely during
>>> reset, if there's a buggy driver that's already broken the way as how it
>>> is and happens to survive with all userspace apps, we just don't care
>>> and let it be.
>> Without IOTLB_PRESIST it doesn't break. With IOTLB_PERSIST and if the
>> reset_map() is done unconditionally, it can break. That's my point.
>>
>>> There's no way we can enumerate all those buggy behaviors
>>> in .reset_map itself, it's overloading that driver API too much.
>> If it is not noticeable by userspace, we can do any fix at will. But
>> it is not, we don't have another choice. Especially considering the
>> cost is rather low.
>>
>>>>> Second, IOTLB_PERSIST is needed but not sufficient. Due to lack of
>>>>> backend feature negotiation in parent driver, if vhost-vdpa has to
>>>>> provide the old-behaviour emulation for compatibility on driver's
>>>>> behalf, it needs to be done per-driver basis. There could be good
>>>>> on-chip or vendor IOMMU implementation which doesn't clear the IOTLB in
>>>>> .reset, and vendor specific IOMMU doesn't have to provide .reset_map,
>>>> Then we just don't offer IOTLB_PRESIST, isn't this by design?
>>> Think about the vduse case, it can work with DMA ops directly so doesn't
>>> have to implement .reset_map, unless for some specific good reason.
>>> Because it's a conforming and valid/good driver implementation, we may
>>> still allow it to advertise IOTLB_PERSIST to userspace.
>> I would like to know why this can't work in this case:
>>
>> config->reset()
>> if (IOTLB_PERSIST is not set) {
>>      if (config->reset_map)
>>            config->reset_map()
>> }
>>
>>> Which belongs to
>>> the 3rd bullet below:
>>>
>>> https://lore.kernel.org/virtualization/1696928580-7520-4-git-send-email-si-wei.liu@oracle.com/
>>>
>>> There are 3 cases that backend may claim this feature bit on:
>>>
>>> - parent device that has to work with platform IOMMU
>>> - parent device with on-chip IOMMU that has the expected
>>>     .reset_map support in driver
>>> - parent device with vendor specific IOMMU implementation
>>>     that explicitly declares the specific backend feature
>>>
>>>>> we
>>>>> should allow these good driver implementations rather than
>>>>> unconditionally stick to some specific problematic behavior for every
>>>>> other good driver.
>>>> Then you can force reset_map() with set_map() that is what I suggest
>>>> in another thread, no?
>>> This is exactly what I was afraid of that broken behavior emulation may
>>> become a dangerous slippery slope - in principle we should encourage
>>> good driver implementation, as they can work totally fine with older
>>> userspace. Why do they have to bother emulating broken behavior just
>>> because some other driver's misbehaving?
>> Please read the link [1], Linus has explained it.
>>
>>> And what's the boundary for
>>> this hack, do drivers backed by platform IOMMU even have to emulate (if
>>> not why not, and is there substantial difference in between)?
>> The boundary is whether the behaviour change could be noticed but
>> userspace. And I've shown you it's not a burden with the pseudo codes.
>> If not, please explain why.
>>
>>> After
>>> getting through all of this, do you still believe everything is just as
>>> easy and simple as what thought to be?
>> The truth is that bugs exist everywhere. We can't promise there's no
>> bug when developing an uAPI or subsystem. For kernel code, the bug
>> that touches uAPI might be fixed in a way that doesn't break existing
>> userspace. If you look at how downstream to maintain kABI, you will be
>> supersized furtherly.
>>
>>> Btw, I thought I was expecting but still haven't got the clear answers
>>> to what was the goal to do all this, we spent a lot of time trying to
>>> unbreak userspace,
>> The code is pretty simple. But yes, the time spent on justifying it
>> might take some time. That's the community. People need time to
>> understand each other's points.
>>
>>> but looks to me as if we were trying every possible
>>> way to break userspace
>> How could my suggestions break a userspace?
>>
>>> or try to approximate to the same brokenness
>>> mlx5_vdpa may have caused to the userspace. What we will get eventually
>>> from these lengthy discussions?
>> Siwei, I'd really suggest you read the link I gave you. You may get
>> the answer. What's more, It doesn't cost too much then we know for
>> sure there would not be any issue, why not choose the hard way?
>>
>>> On the other hand, if you think it from
>>> vhost-vdpa user perspective, you'll clearly see there's just a couple of
>>> ways to unbreak userspace from the internal broken map which is out of
>>> sync with vhost-vdpa iotlb after device reset.
>> Patches are more than welcomed.
>>
>>> If this brokenness was
>>> something universally done from the vhost-vdpa layer itself, I'd feel
>>> it's more of a shared problem, but this is not the case I see it here.
>>> While the long standing mlx5_vdpa/vdpa_sim issue is 100% misuse of
>>> .reset op in a wrong way per IOMMU API definition. Why leaving this
>>> discrepancy to the individual driver is not even an option, I'm still
>>> not sure?
>> Sorry? I start with a switch in the driver, and then I try to avoid
>> that. And it seems you don't want a burden on the driver as well.
>> Where did you see I say we can't do that in the driver? What I
>> disagree with is to use a module parameter.
>>
>> Even if I fail, it doesn't mean we can't do that in the driver code.
>> If you read the link[1] you can see the offending commit is a change
>> in uvcvideo driver.
>>
>> Thanks
>>
>>>
>>> Thanks,
>>> -Siwei
>>>
>>>>> Then we need a set of device flags (backend_features
>>>>> bit again?) to indicate the specific driver needs upper layer's help on
>>>>> old-behaviour emulation.
>>>>>
>>>>> Last but not least, I'm not sure how to properly emulate
>>>>> reset_vendor_mappings() from vhost-vdpa layer. If a vendor driver has no
>>>>> .reset_map op implemented, or if .reset_map has a slightly different
>>>>> implementation than what it used to reset the iotlb in the .reset op,
>>>> See above, for reset_vendor_mappings() I meant config->reset_map() exactly.
>>>>
>>>> Thanks
>>>>
>>>>> then this either becomes effectively dead code if no one ends up using,
>>>>> or the vhost-vdpa emulation is helpless and limited in scope, unable to
>>>>> cover all the cases.
>>>>>
>>>>> ----------------%<----------------%<----------------
>>>>>


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/4] vhost-vdpa: reset vendor specific mapping to initial state in .release
  2023-10-18  5:27                 ` Jason Wang
  2023-10-18  7:00                   ` Jason Wang
  2023-10-18  8:44                   ` Si-Wei Liu
@ 2023-10-19 22:57                   ` Si-Wei Liu
  2 siblings, 0 replies; 37+ messages in thread
From: Si-Wei Liu @ 2023-10-19 22:57 UTC (permalink / raw)
  To: Jason Wang
  Cc: Eugenio Perez Martin, mst, xuanzhuo, dtatulea, virtualization,
	linux-kernel



On 10/17/2023 10:27 PM, Jason Wang wrote:
>>>    If we do
>>> this without a negotiation, IOTLB will not be clear but the Qemu will
>>> try to re-program the IOTLB after reset. Which will break?
>>>
>>> 1) stick the exact old behaviour with just one line of check
>> It's not just one line of check here, the old behavior emulation has to
>> be done as Eugenio illustrated in the other email.
> For vhost-vDPA it's just
>
> if (IOTLB_PERSIST is acked by userspace)
>      reset_map()
... and this reset_map in vhost_vdpa_cleanup can't be negotiable 
depending on IOTLB_PERSIST. Consider the case where user switches to 
virtio-vdpa after an older userspace using vhost-vdpa finished running. 
Even with buggy_virtio_reset_map in place it's unwarranted the vendor 
IOMMU can get back to the default state, e.g. ending with 1:1 
passthrough mapping. If not doing this unconditionally it will get a big 
chance to break userspace.

-Siwei

>
> For parent, it's somehow similar:
>
> during .reset()
>
> if (IOTLB_PERSIST is not acked by userspace)
>          reset_vendor_mappings()
>
> Anything I missed here?


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/4] vhost-vdpa: reset vendor specific mapping to initial state in .release
  2023-10-19 22:28                               ` Si-Wei Liu
@ 2023-10-20  4:11                                 ` Jason Wang
  2023-10-20  5:57                                   ` Si-Wei Liu
  0 siblings, 1 reply; 37+ messages in thread
From: Jason Wang @ 2023-10-20  4:11 UTC (permalink / raw)
  To: Si-Wei Liu
  Cc: Eugenio Perez Martin, mst, xuanzhuo, dtatulea, virtualization,
	linux-kernel

On Fri, Oct 20, 2023 at 6:28 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>
>
>
> On 10/19/2023 7:39 AM, Eugenio Perez Martin wrote:
> > On Thu, Oct 19, 2023 at 10:27 AM Jason Wang <jasowang@redhat.com> wrote:
> >> On Thu, Oct 19, 2023 at 2:47 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>>
> >>>
> >>> On 10/18/2023 7:53 PM, Jason Wang wrote:
> >>>> On Wed, Oct 18, 2023 at 4:49 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>>>>
> >>>>> On 10/18/2023 12:00 AM, Jason Wang wrote:
> >>>>>>> Unfortunately, it's a must to stick to ABI. I agree it's a mess but we
> >>>>>>> don't have a better choice. Or we can fail the probe if userspace
> >>>>>>> doesn't ack this feature.
> >>>>>> Antoher idea we can just do the following in vhost_vdpa reset?
> >>>>>>
> >>>>>> config->reset()
> >>>>>> if (IOTLB_PERSIST is not set) {
> >>>>>>        config->reset_map()
> >>>>>> }
> >>>>>>
> >>>>>> Then we don't have the burden to maintain them in the parent?
> >>>>>>
> >>>>>> Thanks
> >>>>> Please see my earlier response in the other email, thanks.
> >>>>>
> >>>>> ----------------%<----------------%<----------------
> >>>>>
> >>>>> First, the ideal fix would be to leave this reset_vendor_mappings()
> >>>>> emulation code on the individual driver itself, which already has the
> >>>>> broken behavior.
> >>>> So the point is, not about whether the existing behavior is "broken"
> >>>> or not.
> >>> Hold on, I thought earlier we all agreed upon that the existing behavior
> >>> of vendor driver self-clearing maps during .reset violates the vhost
> >>> iotlb abstraction and also breaks the .set_map/.dma_map API. This is
> >>> 100% buggy driver implementation itself that we should discourage or
> >>> eliminate as much as possible (that's part of the goal for this series),
> >> I'm not saying it's not an issue, what I'm saying is, if the fix
> >> breaks another userspace, it's a new bug in the kernel. See what Linus
> >> said in [1]
> >>
> >> "If a change results in user programs breaking, it's a bug in the kernel."
> >>
> >>> but here you seem to go existentialism and suggests the very opposite
> >>> that every .set_map/.dma_map driver implementation, regardless being the
> >>> current or the new/upcoming, should unconditionally try to emulate the
> >>> broken reset behavior for the sake of not breaking older userspace.
> >> Such "emulation" is not done at the parent level. New parents just
> >> need to implement reset_map() or not. everything could be done inside
> >> vhost-vDPA as pseudo code that is shown above.
> >>
> >>> Set
> >>> aside the criteria and definition for how userspace can be broken, can
> >>> we step back to the original question why we think it's broken, and what
> >>> we can do to promote good driver implementation instead of discuss the
> >>> implementation details?
> >> I'm not sure I get the point of this question. I'm not saying we don't
> >> need to fix, what I am saying is that such a fix must be done in a
> >> negotiable way. And it's better if parents won't get any burden. It
> >> can just decide to implement reset_map() or not.
> >>
> >>> Reading the below response I found my major
> >>> points are not heard even if written for quite a few times.
> >> I try my best to not ignore any important things, but I can't promise
> >> I will not miss any. I hope the above clarifies my points.
> >>
> >>> It's not
> >>> that I don't understand the importance of not breaking old userspace, I
> >>> appreciate your questions and extra patience, however I do feel the
> >>> "broken" part is very relevant to our discussion here.
> >>> If it's broken (in the sense of vhost IOTLB API) that you agree, I think
> >>> we should at least allow good driver implementations; and when you think
> >>> about the possibility of those valid good driver cases
> >>> (.set_map/.dma_map implementations that do not clear maps in .reset),
> >>> you might be able to see why it's coded the way as it is now.
> >>>
> >>>>    It's about whether we could stick to the old behaviour without
> >>>> too much cost. And I believe we could.
> >>>>
> >>>> And just to clarify here, reset_vendor_mappings() = config->reset_map()
> >>>>
> >>>>> But today there's no backend feature negotiation
> >>>>> between vhost-vdpa and the parent driver. Do we want to send down the
> >>>>> acked_backend_features to parent drivers?
> >>>> There's no need to do that with the above code, or anything I missed here?
> >>>>
> >>>> config->reset()
> >>>> if (IOTLB_PERSIST is not set) {
> >>>>         config->reset_map()
> >>>> }
> >>> Implementation issue: this implies reset_map() has to be there for every
> >>> .set_map implementations, but vendor driver implementation for custom
> >>> IOMMU could well implement DMA ops by itself instead of .reset_map. This
> >>> won't work for every set_map driver (think about the vduse case).
> >> Well let me do it once again, reset_map() is not mandated:
> >>
> >> config->reset()
> >> if (IOTLB_PERSIST is not set) {
> >>      if (config->reset_map)
> >>            config->reset_map()
> > To avoid new parent drivers
> I am afraid it's not just new parent drivers, but any well behaved
> driver today may well break userspace if go with this forced emulation
> code, if they have to implement reset_map for some reason (e.g. restored
> to 1:1 passthrough mapping or other default state in mapping). For new
> userspace and user driver we can guard against it using the
> IOTLB_PERSIST flag, but the above code would get a big chance to break
> setup with good driver and older userspace in practice.
>
> And .reset_map implementation doesn't necessarily need to clear maps.
> For e.g. IOMMU API compliant driver that only needs simple DMA model for
> passthrough, all .reset_map has to do is toggle to 1:1 mapping mode to
> the default/initial state without taking care of maps, as
> vhost_vdpa_unmap(0, -1ULL) earlier should have done the map cleaning job
> already.

Ok, finally, it takes me a while to understand the issue :)

Actually, there are two things:

1) stick the IOTLB mappings across the reset
2) reset the vendor specific mappings to whatever the parent think
it's comfort (like 1:1)

So I think my suggestion doesn't work.

>
>
> >   to have this behavior if they need to
> > implement reset_map,
> >
> > What if we add a new callback like "config->buggy_virtio_reset_map",
> > different from regular reset_map callback at cleanup?
> Right, separating out the need for old behavior emulation from
> .reset_map is much cleaner, and this is what individual broken driver
> has to maintain without penalizing other good drivers. Good to see what
> I said earlier is heard.
>
> > Only mlx5 and
> > vdpa_sim need to implement it, with a big warning, and new parent
> > drivers can trust they'll never have the old bad behavior.
> Let's see what Jason will say about it and try to converge on this
> first, I think he seemed to imply that this is part of ABI that every
> driver has to make compromise for. I'd better get it ack'ed before
> proceeding to the rest.

Thanks for your patience.

I think we have some choices:

(All of the below can work, but we need to choose the best)

1) module parameter: this turns out to be hard for the management as
it requires the subtle knowledge of a specific user space which turns
out to be hard
2) buggy_virtio_reset_map: seems like somehow a pollution of the
config ops, I think we can do this only if we have other choice
3) set_backend_features: I understand the concern that we should not
propagate the vhost level feature to parent, the reason is most of
them are irrelevant to the parent. I think the right way is to
introduce get_parent_features()/set_parent_features() then we can
choose to map some parent feature to vhost like
(ENALBE_AFTER_DRIVER_OK and IOTLB_PERSIST)
4) piggy backing whether we need to clean vendor specific IOTLB in
config->reset(bool clean_map)

Siwei, Eugenio, what's your opinion here?

Thanks

>
> Thanks,
> -Siwei
>
> >
> >> }
> >>
> >> Did you see any issue with VDUSE in this case?
> >>
> >>> But this is not the the point I was making. I think if you agree this is
> >>> purely buggy driver implementation of its own, we should try to isolate
> >>> this buggy behavior to individual driver rather than overload vhost-vdpa
> >>> or vdpa core's role to help implement the emulation of broken driver
> >>> behavior.
> >> As I pointed out, if it is not noticeable in the userspace, that's
> >> fine but it's not.
> >>
> >>> I don't get why .reset is special here, the abuse of .reset to
> >>> manipulate mapping could also happen in other IOMMU unrelated driver
> >>> entries like in .suspend, or in queue_reset.
> >> Who can abuse reset here? It is totally under the control of
> >> vhost-vDPA and it's not visible to uAPI. And we can fully control the
> >> behaviour of vhost-vDPA.
> >>
> >>> If someday userspace is
> >>> found coded around similar buggy driver implementation in other driver
> >>> ops, do we want to follow and duplicate the same emulation in vdpa core
> >>> as the precedent is already set here around .reset?
> >> I think so, have you seen the links I give you? If you want to go
> >> through the one from Linus thread[1], you can see the one that unbreak
> >> virtio-IOMMU[2]:
> >>
> >> 1) Someday, we spot invalidate with size 0 is a bug
> >> 2) We fix this bug by not allowing this
> >> 3) But virtio-IOMMU userspace find that size 0 actually clean all the
> >> IOTLB so it depends on the behaviour
> >> 4) So the virtio-IOMMU userspace find it can't work after 2)
> >> 5) Then we recover the behaviour before 2) via [2]
> >>
> >> Another example is the IOTLB_MSG_V2, V1 suffers from in-stable ABI in
> >> 32bit archs, most of the userspace survives since it never runs on
> >> 32bit archs. The fix is to introduce a V2 but we will stick to V1 by
> >> default if V2 is not acknowledged by the userspace.
> >>
> >> I think the above 2 examples are sufficient for us to understand the
> >> case. If not, I can help to clarify more since I'm involved in those 2
> >> fixes.
> >>
> >>> The buggy driver can fail in a lot of other ways indefinitely during
> >>> reset, if there's a buggy driver that's already broken the way as how it
> >>> is and happens to survive with all userspace apps, we just don't care
> >>> and let it be.
> >> Without IOTLB_PRESIST it doesn't break. With IOTLB_PERSIST and if the
> >> reset_map() is done unconditionally, it can break. That's my point.
> >>
> >>> There's no way we can enumerate all those buggy behaviors
> >>> in .reset_map itself, it's overloading that driver API too much.
> >> If it is not noticeable by userspace, we can do any fix at will. But
> >> it is not, we don't have another choice. Especially considering the
> >> cost is rather low.
> >>
> >>>>> Second, IOTLB_PERSIST is needed but not sufficient. Due to lack of
> >>>>> backend feature negotiation in parent driver, if vhost-vdpa has to
> >>>>> provide the old-behaviour emulation for compatibility on driver's
> >>>>> behalf, it needs to be done per-driver basis. There could be good
> >>>>> on-chip or vendor IOMMU implementation which doesn't clear the IOTLB in
> >>>>> .reset, and vendor specific IOMMU doesn't have to provide .reset_map,
> >>>> Then we just don't offer IOTLB_PRESIST, isn't this by design?
> >>> Think about the vduse case, it can work with DMA ops directly so doesn't
> >>> have to implement .reset_map, unless for some specific good reason.
> >>> Because it's a conforming and valid/good driver implementation, we may
> >>> still allow it to advertise IOTLB_PERSIST to userspace.
> >> I would like to know why this can't work in this case:
> >>
> >> config->reset()
> >> if (IOTLB_PERSIST is not set) {
> >>      if (config->reset_map)
> >>            config->reset_map()
> >> }
> >>
> >>> Which belongs to
> >>> the 3rd bullet below:
> >>>
> >>> https://lore.kernel.org/virtualization/1696928580-7520-4-git-send-email-si-wei.liu@oracle.com/
> >>>
> >>> There are 3 cases that backend may claim this feature bit on:
> >>>
> >>> - parent device that has to work with platform IOMMU
> >>> - parent device with on-chip IOMMU that has the expected
> >>>     .reset_map support in driver
> >>> - parent device with vendor specific IOMMU implementation
> >>>     that explicitly declares the specific backend feature
> >>>
> >>>>> we
> >>>>> should allow these good driver implementations rather than
> >>>>> unconditionally stick to some specific problematic behavior for every
> >>>>> other good driver.
> >>>> Then you can force reset_map() with set_map() that is what I suggest
> >>>> in another thread, no?
> >>> This is exactly what I was afraid of that broken behavior emulation may
> >>> become a dangerous slippery slope - in principle we should encourage
> >>> good driver implementation, as they can work totally fine with older
> >>> userspace. Why do they have to bother emulating broken behavior just
> >>> because some other driver's misbehaving?
> >> Please read the link [1], Linus has explained it.
> >>
> >>> And what's the boundary for
> >>> this hack, do drivers backed by platform IOMMU even have to emulate (if
> >>> not why not, and is there substantial difference in between)?
> >> The boundary is whether the behaviour change could be noticed but
> >> userspace. And I've shown you it's not a burden with the pseudo codes.
> >> If not, please explain why.
> >>
> >>> After
> >>> getting through all of this, do you still believe everything is just as
> >>> easy and simple as what thought to be?
> >> The truth is that bugs exist everywhere. We can't promise there's no
> >> bug when developing an uAPI or subsystem. For kernel code, the bug
> >> that touches uAPI might be fixed in a way that doesn't break existing
> >> userspace. If you look at how downstream to maintain kABI, you will be
> >> supersized furtherly.
> >>
> >>> Btw, I thought I was expecting but still haven't got the clear answers
> >>> to what was the goal to do all this, we spent a lot of time trying to
> >>> unbreak userspace,
> >> The code is pretty simple. But yes, the time spent on justifying it
> >> might take some time. That's the community. People need time to
> >> understand each other's points.
> >>
> >>> but looks to me as if we were trying every possible
> >>> way to break userspace
> >> How could my suggestions break a userspace?
> >>
> >>> or try to approximate to the same brokenness
> >>> mlx5_vdpa may have caused to the userspace. What we will get eventually
> >>> from these lengthy discussions?
> >> Siwei, I'd really suggest you read the link I gave you. You may get
> >> the answer. What's more, It doesn't cost too much then we know for
> >> sure there would not be any issue, why not choose the hard way?
> >>
> >>> On the other hand, if you think it from
> >>> vhost-vdpa user perspective, you'll clearly see there's just a couple of
> >>> ways to unbreak userspace from the internal broken map which is out of
> >>> sync with vhost-vdpa iotlb after device reset.
> >> Patches are more than welcomed.
> >>
> >>> If this brokenness was
> >>> something universally done from the vhost-vdpa layer itself, I'd feel
> >>> it's more of a shared problem, but this is not the case I see it here.
> >>> While the long standing mlx5_vdpa/vdpa_sim issue is 100% misuse of
> >>> .reset op in a wrong way per IOMMU API definition. Why leaving this
> >>> discrepancy to the individual driver is not even an option, I'm still
> >>> not sure?
> >> Sorry? I start with a switch in the driver, and then I try to avoid
> >> that. And it seems you don't want a burden on the driver as well.
> >> Where did you see I say we can't do that in the driver? What I
> >> disagree with is to use a module parameter.
> >>
> >> Even if I fail, it doesn't mean we can't do that in the driver code.
> >> If you read the link[1] you can see the offending commit is a change
> >> in uvcvideo driver.
> >>
> >> Thanks
> >>
> >>>
> >>> Thanks,
> >>> -Siwei
> >>>
> >>>>> Then we need a set of device flags (backend_features
> >>>>> bit again?) to indicate the specific driver needs upper layer's help on
> >>>>> old-behaviour emulation.
> >>>>>
> >>>>> Last but not least, I'm not sure how to properly emulate
> >>>>> reset_vendor_mappings() from vhost-vdpa layer. If a vendor driver has no
> >>>>> .reset_map op implemented, or if .reset_map has a slightly different
> >>>>> implementation than what it used to reset the iotlb in the .reset op,
> >>>> See above, for reset_vendor_mappings() I meant config->reset_map() exactly.
> >>>>
> >>>> Thanks
> >>>>
> >>>>> then this either becomes effectively dead code if no one ends up using,
> >>>>> or the vhost-vdpa emulation is helpless and limited in scope, unable to
> >>>>> cover all the cases.
> >>>>>
> >>>>> ----------------%<----------------%<----------------
> >>>>>
>


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/4] vhost-vdpa: reset vendor specific mapping to initial state in .release
  2023-10-20  4:11                                 ` Jason Wang
@ 2023-10-20  5:57                                   ` Si-Wei Liu
  0 siblings, 0 replies; 37+ messages in thread
From: Si-Wei Liu @ 2023-10-20  5:57 UTC (permalink / raw)
  To: Jason Wang
  Cc: Eugenio Perez Martin, mst, xuanzhuo, dtatulea, virtualization,
	linux-kernel



On 10/19/2023 9:11 PM, Jason Wang wrote:
> On Fri, Oct 20, 2023 at 6:28 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>
>>
>> On 10/19/2023 7:39 AM, Eugenio Perez Martin wrote:
>>> On Thu, Oct 19, 2023 at 10:27 AM Jason Wang <jasowang@redhat.com> wrote:
>>>> On Thu, Oct 19, 2023 at 2:47 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>>>>
>>>>> On 10/18/2023 7:53 PM, Jason Wang wrote:
>>>>>> On Wed, Oct 18, 2023 at 4:49 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>>>>>> On 10/18/2023 12:00 AM, Jason Wang wrote:
>>>>>>>>> Unfortunately, it's a must to stick to ABI. I agree it's a mess but we
>>>>>>>>> don't have a better choice. Or we can fail the probe if userspace
>>>>>>>>> doesn't ack this feature.
>>>>>>>> Antoher idea we can just do the following in vhost_vdpa reset?
>>>>>>>>
>>>>>>>> config->reset()
>>>>>>>> if (IOTLB_PERSIST is not set) {
>>>>>>>>         config->reset_map()
>>>>>>>> }
>>>>>>>>
>>>>>>>> Then we don't have the burden to maintain them in the parent?
>>>>>>>>
>>>>>>>> Thanks
>>>>>>> Please see my earlier response in the other email, thanks.
>>>>>>>
>>>>>>> ----------------%<----------------%<----------------
>>>>>>>
>>>>>>> First, the ideal fix would be to leave this reset_vendor_mappings()
>>>>>>> emulation code on the individual driver itself, which already has the
>>>>>>> broken behavior.
>>>>>> So the point is, not about whether the existing behavior is "broken"
>>>>>> or not.
>>>>> Hold on, I thought earlier we all agreed upon that the existing behavior
>>>>> of vendor driver self-clearing maps during .reset violates the vhost
>>>>> iotlb abstraction and also breaks the .set_map/.dma_map API. This is
>>>>> 100% buggy driver implementation itself that we should discourage or
>>>>> eliminate as much as possible (that's part of the goal for this series),
>>>> I'm not saying it's not an issue, what I'm saying is, if the fix
>>>> breaks another userspace, it's a new bug in the kernel. See what Linus
>>>> said in [1]
>>>>
>>>> "If a change results in user programs breaking, it's a bug in the kernel."
>>>>
>>>>> but here you seem to go existentialism and suggests the very opposite
>>>>> that every .set_map/.dma_map driver implementation, regardless being the
>>>>> current or the new/upcoming, should unconditionally try to emulate the
>>>>> broken reset behavior for the sake of not breaking older userspace.
>>>> Such "emulation" is not done at the parent level. New parents just
>>>> need to implement reset_map() or not. everything could be done inside
>>>> vhost-vDPA as pseudo code that is shown above.
>>>>
>>>>> Set
>>>>> aside the criteria and definition for how userspace can be broken, can
>>>>> we step back to the original question why we think it's broken, and what
>>>>> we can do to promote good driver implementation instead of discuss the
>>>>> implementation details?
>>>> I'm not sure I get the point of this question. I'm not saying we don't
>>>> need to fix, what I am saying is that such a fix must be done in a
>>>> negotiable way. And it's better if parents won't get any burden. It
>>>> can just decide to implement reset_map() or not.
>>>>
>>>>> Reading the below response I found my major
>>>>> points are not heard even if written for quite a few times.
>>>> I try my best to not ignore any important things, but I can't promise
>>>> I will not miss any. I hope the above clarifies my points.
>>>>
>>>>> It's not
>>>>> that I don't understand the importance of not breaking old userspace, I
>>>>> appreciate your questions and extra patience, however I do feel the
>>>>> "broken" part is very relevant to our discussion here.
>>>>> If it's broken (in the sense of vhost IOTLB API) that you agree, I think
>>>>> we should at least allow good driver implementations; and when you think
>>>>> about the possibility of those valid good driver cases
>>>>> (.set_map/.dma_map implementations that do not clear maps in .reset),
>>>>> you might be able to see why it's coded the way as it is now.
>>>>>
>>>>>>     It's about whether we could stick to the old behaviour without
>>>>>> too much cost. And I believe we could.
>>>>>>
>>>>>> And just to clarify here, reset_vendor_mappings() = config->reset_map()
>>>>>>
>>>>>>> But today there's no backend feature negotiation
>>>>>>> between vhost-vdpa and the parent driver. Do we want to send down the
>>>>>>> acked_backend_features to parent drivers?
>>>>>> There's no need to do that with the above code, or anything I missed here?
>>>>>>
>>>>>> config->reset()
>>>>>> if (IOTLB_PERSIST is not set) {
>>>>>>          config->reset_map()
>>>>>> }
>>>>> Implementation issue: this implies reset_map() has to be there for every
>>>>> .set_map implementations, but vendor driver implementation for custom
>>>>> IOMMU could well implement DMA ops by itself instead of .reset_map. This
>>>>> won't work for every set_map driver (think about the vduse case).
>>>> Well let me do it once again, reset_map() is not mandated:
>>>>
>>>> config->reset()
>>>> if (IOTLB_PERSIST is not set) {
>>>>       if (config->reset_map)
>>>>             config->reset_map()
>>> To avoid new parent drivers
>> I am afraid it's not just new parent drivers, but any well behaved
>> driver today may well break userspace if go with this forced emulation
>> code, if they have to implement reset_map for some reason (e.g. restored
>> to 1:1 passthrough mapping or other default state in mapping). For new
>> userspace and user driver we can guard against it using the
>> IOTLB_PERSIST flag, but the above code would get a big chance to break
>> setup with good driver and older userspace in practice.
>>
>> And .reset_map implementation doesn't necessarily need to clear maps.
>> For e.g. IOMMU API compliant driver that only needs simple DMA model for
>> passthrough, all .reset_map has to do is toggle to 1:1 mapping mode to
>> the default/initial state without taking care of maps, as
>> vhost_vdpa_unmap(0, -1ULL) earlier should have done the map cleaning job
>> already.
> Ok, finally, it takes me a while to understand the issue :)
>
> Actually, there are two things:
>
> 1) stick the IOTLB mappings across the reset
> 2) reset the vendor specific mappings to whatever the parent think
> it's comfort (like 1:1)
Yep, maybe I need to document this expectation more clearly, but I found 
it a bit hard to describe what 2) is really about (tried to avoid the 
specifics, like 1:1, as that wording seems not so welcomed).
>
> So I think my suggestion doesn't work.
>
>>
>>>    to have this behavior if they need to
>>> implement reset_map,
>>>
>>> What if we add a new callback like "config->buggy_virtio_reset_map",
>>> different from regular reset_map callback at cleanup?
>> Right, separating out the need for old behavior emulation from
>> .reset_map is much cleaner, and this is what individual broken driver
>> has to maintain without penalizing other good drivers. Good to see what
>> I said earlier is heard.
>>
>>> Only mlx5 and
>>> vdpa_sim need to implement it, with a big warning, and new parent
>>> drivers can trust they'll never have the old bad behavior.
>> Let's see what Jason will say about it and try to converge on this
>> first, I think he seemed to imply that this is part of ABI that every
>> driver has to make compromise for. I'd better get it ack'ed before
>> proceeding to the rest.
> Thanks for your patience.
>
> I think we have some choices:
>
> (All of the below can work, but we need to choose the best)
>
> 1) module parameter: this turns out to be hard for the management as
> it requires the subtle knowledge of a specific user space which turns
> out to be hard
> 2) buggy_virtio_reset_map: seems like somehow a pollution of the
> config ops, I think we can do this only if we have other choice
> 3) set_backend_features: I understand the concern that we should not
> propagate the vhost level feature to parent, the reason is most of
> them are irrelevant to the parent. I think the right way is to
> introduce get_parent_features()/set_parent_features() then we can
> choose to map some parent feature to vhost like
> (ENALBE_AFTER_DRIVER_OK and IOTLB_PERSIST)
> 4) piggy backing whether we need to clean vendor specific IOTLB in
> config->reset(bool clean_map)
Both 3) and 4) should work with me. For 3) we already have the get_*() 
part, though I'm not sure if worth to bother introducing the set_*() 
API; today I think most parent drivers work directly with the driver 
config op, instead of backend feature flag . 4) is actually closer to 
what I had in mind (was thinking of a flag for future extension instead 
of bool). But the document has to make it very clear that the use of 
clean_map is limited to backward compatibility for old behavior, and new 
driver should not bother to implement (as it violates the 
.set_map/.dma_map IOMMU API).

Thanks,
-Siwei

>
> Siwei, Eugenio, what's your opinion here?
>
> Thanks
>
>> Thanks,
>> -Siwei
>>
>>>> }
>>>>
>>>> Did you see any issue with VDUSE in this case?
>>>>
>>>>> But this is not the the point I was making. I think if you agree this is
>>>>> purely buggy driver implementation of its own, we should try to isolate
>>>>> this buggy behavior to individual driver rather than overload vhost-vdpa
>>>>> or vdpa core's role to help implement the emulation of broken driver
>>>>> behavior.
>>>> As I pointed out, if it is not noticeable in the userspace, that's
>>>> fine but it's not.
>>>>
>>>>> I don't get why .reset is special here, the abuse of .reset to
>>>>> manipulate mapping could also happen in other IOMMU unrelated driver
>>>>> entries like in .suspend, or in queue_reset.
>>>> Who can abuse reset here? It is totally under the control of
>>>> vhost-vDPA and it's not visible to uAPI. And we can fully control the
>>>> behaviour of vhost-vDPA.
>>>>
>>>>> If someday userspace is
>>>>> found coded around similar buggy driver implementation in other driver
>>>>> ops, do we want to follow and duplicate the same emulation in vdpa core
>>>>> as the precedent is already set here around .reset?
>>>> I think so, have you seen the links I give you? If you want to go
>>>> through the one from Linus thread[1], you can see the one that unbreak
>>>> virtio-IOMMU[2]:
>>>>
>>>> 1) Someday, we spot invalidate with size 0 is a bug
>>>> 2) We fix this bug by not allowing this
>>>> 3) But virtio-IOMMU userspace find that size 0 actually clean all the
>>>> IOTLB so it depends on the behaviour
>>>> 4) So the virtio-IOMMU userspace find it can't work after 2)
>>>> 5) Then we recover the behaviour before 2) via [2]
>>>>
>>>> Another example is the IOTLB_MSG_V2, V1 suffers from in-stable ABI in
>>>> 32bit archs, most of the userspace survives since it never runs on
>>>> 32bit archs. The fix is to introduce a V2 but we will stick to V1 by
>>>> default if V2 is not acknowledged by the userspace.
>>>>
>>>> I think the above 2 examples are sufficient for us to understand the
>>>> case. If not, I can help to clarify more since I'm involved in those 2
>>>> fixes.
>>>>
>>>>> The buggy driver can fail in a lot of other ways indefinitely during
>>>>> reset, if there's a buggy driver that's already broken the way as how it
>>>>> is and happens to survive with all userspace apps, we just don't care
>>>>> and let it be.
>>>> Without IOTLB_PRESIST it doesn't break. With IOTLB_PERSIST and if the
>>>> reset_map() is done unconditionally, it can break. That's my point.
>>>>
>>>>> There's no way we can enumerate all those buggy behaviors
>>>>> in .reset_map itself, it's overloading that driver API too much.
>>>> If it is not noticeable by userspace, we can do any fix at will. But
>>>> it is not, we don't have another choice. Especially considering the
>>>> cost is rather low.
>>>>
>>>>>>> Second, IOTLB_PERSIST is needed but not sufficient. Due to lack of
>>>>>>> backend feature negotiation in parent driver, if vhost-vdpa has to
>>>>>>> provide the old-behaviour emulation for compatibility on driver's
>>>>>>> behalf, it needs to be done per-driver basis. There could be good
>>>>>>> on-chip or vendor IOMMU implementation which doesn't clear the IOTLB in
>>>>>>> .reset, and vendor specific IOMMU doesn't have to provide .reset_map,
>>>>>> Then we just don't offer IOTLB_PRESIST, isn't this by design?
>>>>> Think about the vduse case, it can work with DMA ops directly so doesn't
>>>>> have to implement .reset_map, unless for some specific good reason.
>>>>> Because it's a conforming and valid/good driver implementation, we may
>>>>> still allow it to advertise IOTLB_PERSIST to userspace.
>>>> I would like to know why this can't work in this case:
>>>>
>>>> config->reset()
>>>> if (IOTLB_PERSIST is not set) {
>>>>       if (config->reset_map)
>>>>             config->reset_map()
>>>> }
>>>>
>>>>> Which belongs to
>>>>> the 3rd bullet below:
>>>>>
>>>>> https://lore.kernel.org/virtualization/1696928580-7520-4-git-send-email-si-wei.liu@oracle.com/
>>>>>
>>>>> There are 3 cases that backend may claim this feature bit on:
>>>>>
>>>>> - parent device that has to work with platform IOMMU
>>>>> - parent device with on-chip IOMMU that has the expected
>>>>>      .reset_map support in driver
>>>>> - parent device with vendor specific IOMMU implementation
>>>>>      that explicitly declares the specific backend feature
>>>>>
>>>>>>> we
>>>>>>> should allow these good driver implementations rather than
>>>>>>> unconditionally stick to some specific problematic behavior for every
>>>>>>> other good driver.
>>>>>> Then you can force reset_map() with set_map() that is what I suggest
>>>>>> in another thread, no?
>>>>> This is exactly what I was afraid of that broken behavior emulation may
>>>>> become a dangerous slippery slope - in principle we should encourage
>>>>> good driver implementation, as they can work totally fine with older
>>>>> userspace. Why do they have to bother emulating broken behavior just
>>>>> because some other driver's misbehaving?
>>>> Please read the link [1], Linus has explained it.
>>>>
>>>>> And what's the boundary for
>>>>> this hack, do drivers backed by platform IOMMU even have to emulate (if
>>>>> not why not, and is there substantial difference in between)?
>>>> The boundary is whether the behaviour change could be noticed but
>>>> userspace. And I've shown you it's not a burden with the pseudo codes.
>>>> If not, please explain why.
>>>>
>>>>> After
>>>>> getting through all of this, do you still believe everything is just as
>>>>> easy and simple as what thought to be?
>>>> The truth is that bugs exist everywhere. We can't promise there's no
>>>> bug when developing an uAPI or subsystem. For kernel code, the bug
>>>> that touches uAPI might be fixed in a way that doesn't break existing
>>>> userspace. If you look at how downstream to maintain kABI, you will be
>>>> supersized furtherly.
>>>>
>>>>> Btw, I thought I was expecting but still haven't got the clear answers
>>>>> to what was the goal to do all this, we spent a lot of time trying to
>>>>> unbreak userspace,
>>>> The code is pretty simple. But yes, the time spent on justifying it
>>>> might take some time. That's the community. People need time to
>>>> understand each other's points.
>>>>
>>>>> but looks to me as if we were trying every possible
>>>>> way to break userspace
>>>> How could my suggestions break a userspace?
>>>>
>>>>> or try to approximate to the same brokenness
>>>>> mlx5_vdpa may have caused to the userspace. What we will get eventually
>>>>> from these lengthy discussions?
>>>> Siwei, I'd really suggest you read the link I gave you. You may get
>>>> the answer. What's more, It doesn't cost too much then we know for
>>>> sure there would not be any issue, why not choose the hard way?
>>>>
>>>>> On the other hand, if you think it from
>>>>> vhost-vdpa user perspective, you'll clearly see there's just a couple of
>>>>> ways to unbreak userspace from the internal broken map which is out of
>>>>> sync with vhost-vdpa iotlb after device reset.
>>>> Patches are more than welcomed.
>>>>
>>>>> If this brokenness was
>>>>> something universally done from the vhost-vdpa layer itself, I'd feel
>>>>> it's more of a shared problem, but this is not the case I see it here.
>>>>> While the long standing mlx5_vdpa/vdpa_sim issue is 100% misuse of
>>>>> .reset op in a wrong way per IOMMU API definition. Why leaving this
>>>>> discrepancy to the individual driver is not even an option, I'm still
>>>>> not sure?
>>>> Sorry? I start with a switch in the driver, and then I try to avoid
>>>> that. And it seems you don't want a burden on the driver as well.
>>>> Where did you see I say we can't do that in the driver? What I
>>>> disagree with is to use a module parameter.
>>>>
>>>> Even if I fail, it doesn't mean we can't do that in the driver code.
>>>> If you read the link[1] you can see the offending commit is a change
>>>> in uvcvideo driver.
>>>>
>>>> Thanks
>>>>
>>>>> Thanks,
>>>>> -Siwei
>>>>>
>>>>>>> Then we need a set of device flags (backend_features
>>>>>>> bit again?) to indicate the specific driver needs upper layer's help on
>>>>>>> old-behaviour emulation.
>>>>>>>
>>>>>>> Last but not least, I'm not sure how to properly emulate
>>>>>>> reset_vendor_mappings() from vhost-vdpa layer. If a vendor driver has no
>>>>>>> .reset_map op implemented, or if .reset_map has a slightly different
>>>>>>> implementation than what it used to reset the iotlb in the .reset op,
>>>>>> See above, for reset_vendor_mappings() I meant config->reset_map() exactly.
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>>> then this either becomes effectively dead code if no one ends up using,
>>>>>>> or the vhost-vdpa emulation is helpless and limited in scope, unable to
>>>>>>> cover all the cases.
>>>>>>>
>>>>>>> ----------------%<----------------%<----------------
>>>>>>>


^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2023-10-20  5:58 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-10-10  9:02 [PATCH 0/4] vdpa: decouple reset of iotlb mapping from device reset Si-Wei Liu
2023-10-10  9:02 ` [PATCH 1/4] vdpa: introduce .reset_map operation callback Si-Wei Liu
2023-10-13  2:49   ` Jason Wang
2023-10-13  7:36     ` Si-Wei Liu
2023-10-16  5:30       ` Jason Wang
2023-10-10  9:02 ` [PATCH 2/4] vhost-vdpa: reset vendor specific mapping to initial state in .release Si-Wei Liu
2023-10-11 11:21   ` Eugenio Perez Martin
2023-10-12  6:18     ` Si-Wei Liu
2023-10-13  3:01   ` Jason Wang
2023-10-13  7:35     ` Si-Wei Liu
2023-10-16  6:32       ` Jason Wang
2023-10-16 11:28         ` Eugenio Perez Martin
2023-10-16 20:30           ` Si-Wei Liu
2023-10-17  2:35             ` Jason Wang
2023-10-17 13:58               ` Eugenio Perez Martin
2023-10-18  4:35               ` Si-Wei Liu
2023-10-18  5:27                 ` Jason Wang
2023-10-18  7:00                   ` Jason Wang
2023-10-18  8:49                     ` Si-Wei Liu
2023-10-19  2:53                       ` Jason Wang
2023-10-19  6:46                         ` Si-Wei Liu
2023-10-19  8:27                           ` Jason Wang
2023-10-19 14:39                             ` Eugenio Perez Martin
2023-10-19 22:28                               ` Si-Wei Liu
2023-10-20  4:11                                 ` Jason Wang
2023-10-20  5:57                                   ` Si-Wei Liu
2023-10-18  8:44                   ` Si-Wei Liu
2023-10-18 11:14                     ` Eugenio Perez Martin
2023-10-18 23:21                       ` Si-Wei Liu
2023-10-19  2:48                         ` Jason Wang
2023-10-19 22:57                   ` Si-Wei Liu
2023-10-16 20:10         ` Si-Wei Liu
2023-10-10  9:02 ` [PATCH 3/4] vhost-vdpa: introduce IOTLB_PERSIST backend feature bit Si-Wei Liu
2023-10-10  9:03 ` [PATCH 4/4] vdpa/mlx5: implement .reset_map driver op Si-Wei Liu
2023-10-13  3:04   ` Jason Wang
2023-10-13  7:55     ` Si-Wei Liu
2023-10-11 11:30 ` [PATCH 0/4] vdpa: decouple reset of iotlb mapping from device reset Eugenio Perez Martin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).