linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 00/17] iommufd: Add nesting infrastructure
@ 2023-07-24 11:03 Yi Liu
  2023-07-24 11:03 ` [PATCH v3 01/17] iommu: Add new iommu op to create domains owned by userspace Yi Liu
                   ` (16 more replies)
  0 siblings, 17 replies; 66+ messages in thread
From: Yi Liu @ 2023-07-24 11:03 UTC (permalink / raw)
  To: joro, alex.williamson, jgg, kevin.tian, robin.murphy, baolu.lu
  Cc: cohuck, eric.auger, nicolinc, kvm, mjrosato, chao.p.peng,
	yi.l.liu, yi.y.sun, peterx, jasowang, shameerali.kolothum.thodi,
	lulu, suravee.suthikulpanit, iommu, linux-kernel,
	linux-kselftest, zhenzhong.duan

Nested translation is a hardware feature that is supported by many modern
IOMMU hardwares. It has two stages (stage-1, stage-2) address translation
to get access to the physical address. stage-1 translation table is owned
by userspace (e.g. by a guest OS), while stage-2 is owned by kernel. Changes
to stage-1 translation table should be followed by an IOTLB invalidation.

Take Intel VT-d as an example, the stage-1 translation table is I/O page
table. As the below diagram shows, guest I/O page table pointer in GPA
(guest physical address) is passed to host and be used to perform the stage-1
address translation. Along with it, modifications to present mappings in the
guest I/O page table should be followed with an IOTLB invalidation.

    .-------------.  .---------------------------.
    |   vIOMMU    |  | Guest I/O page table      |
    |             |  '---------------------------'
    .----------------/
    | PASID Entry |--- PASID cache flush --+
    '-------------'                        |
    |             |                        V
    |             |           I/O page table pointer in GPA
    '-------------'
Guest
------| Shadow |---------------------------|--------
      v        v                           v
Host
    .-------------.  .------------------------.
    |   pIOMMU    |  |  FS for GIOVA->GPA     |
    |             |  '------------------------'
    .----------------/  |
    | PASID Entry |     V (Nested xlate)
    '----------------\.----------------------------------.
    |             |   | SS for GPA->HPA, unmanaged domain|
    |             |   '----------------------------------'
    '-------------'
Where:
 - FS = First stage page tables
 - SS = Second stage page tables
<Intel VT-d Nested translation>

In IOMMUFD, all the translation tables are tracked by hw_pagetable (hwpt)
and each has an iommu_domain allocated from iommu driver. So in this series
hw_pagetable and iommu_domain means the same thing if no special note.
IOMMUFD has already supported allocating hw_pagetable that is linked with
an IOAS. However, nesting requires IOMMUFD to allow allocating hw_pagetable
with driver specific parameters and interface to sync stage-1 IOTLB as user
owns the stage-1 translation table.

This series is based on the iommu hw info reporting series [1]. It first
introduces new iommu op for allocating domains with user data and the op
for invalidate stage-1 IOTLB, and then extend the IOMMUFD internal infrastructure
to accept user_data and parent hwpt, then relay the data to iommu core to
allocate user iommu_domain. After it, extends the ioctl IOMMU_HWPT_ALLOC to
accept user data and stage-2 hwpt ID to allocate hwpt. Along with it, ioctl
IOMMU_HWPT_INVALIDATE is added to invalidate stage-1 IOTLB. This is needed
for user-managed hwpts. Selftest is added as well to cover the new ioctls.

Complete code can be found in [2], QEMU could can be found in [3].

At last, this is a team work together with Nicolin Chen, Lu Baolu. Thanks
them for the help. ^_^. Look forward to your feedbacks.

[1] https://lore.kernel.org/linux-iommu/20230724105936.107042-1-yi.l.liu@intel.com/
[2] https://github.com/yiliu1765/iommufd/tree/iommufd_nesting
[3] https://github.com/yiliu1765/qemu/tree/wip/iommufd_rfcv4_nesting

Change log:

v3:
 - Add new uAPI things in alphabetical order
 - Pass in "enum iommu_hwpt_type hwpt_type" to op->domain_alloc_user for
   sanity, replacing the previous op->domain_alloc_user_data_len solution
 - Return ERR_PTR from domain_alloc_user instead of NULL
 - Only add IOMMU_RESV_SW_MSI to kernel-managed HWPT in nested translation (Kevin)
 - Add IOMMU_RESV_IOVA_RANGES to report resv iova ranges to userspace hence
   userspace is able to exclude the ranges in the stage-1 HWPT (e.g. guest I/O
   page table). (Kevin)
 - Add selftest coverage for the new IOMMU_RESV_IOVA_RANGES ioctl
 - Minor changes per Kevin's inputs

v2: https://lore.kernel.org/linux-iommu/20230511143844.22693-1-yi.l.liu@intel.com/
 - Add union iommu_domain_user_data to include all user data structures to avoid
   passing void * in kernel APIs.
 - Add iommu op to return user data length for user domain allocation
 - Rename struct iommu_hwpt_alloc::data_type to be hwpt_type
 - Store the invalidation data length in iommu_domain_ops::cache_invalidate_user_data_len
 - Convert cache_invalidate_user op to be int instead of void
 - Remove @data_type in struct iommu_hwpt_invalidate
 - Remove out_hwpt_type_bitmap in struct iommu_hw_info hence drop patch 08 of v1

v1: https://lore.kernel.org/linux-iommu/20230309080910.607396-1-yi.l.liu@intel.com/

Thanks,
	Yi Liu

Lu Baolu (2):
  iommu: Add new iommu op to create domains owned by userspace
  iommu: Add nested domain support

Nicolin Chen (6):
  iommufd/hw_pagetable: Do not populate user-managed hw_pagetables
  iommufd: Only enforce IOMMU_RESV_SW_MSI when attaching user-managed
    HWPT
  iommufd/selftest: Add domain_alloc_user() support in iommu mock
  iommufd/selftest: Add coverage for IOMMU_HWPT_ALLOC with user data
  iommufd/selftest: Add IOMMU_TEST_OP_MD_CHECK_IOTLB test op
  iommufd/selftest: Add coverage for IOMMU_HWPT_INVALIDATE ioctl

Yi Liu (9):
  iommufd/hw_pagetable: Use domain_alloc_user op for domain allocation
  iommufd: Pass in hwpt_type/parent/user_data to
    iommufd_hw_pagetable_alloc()
  iommufd: Add IOMMU_RESV_IOVA_RANGES
  iommufd: IOMMU_HWPT_ALLOC allocation with user data
  iommufd: Add IOMMU_HWPT_INVALIDATE
  iommufd/selftest: Add a helper to get test device
  iommufd/selftest: Add IOMMU_TEST_OP_DEV_[ADD|DEL]_RESERVED to add/del
    reserved regions to selftest device
  iommufd/selftest: Add .get_resv_regions() for mock_dev
  iommufd/selftest: Add coverage for IOMMU_RESV_IOVA_RANGES

 drivers/iommu/iommufd/device.c                |   9 +-
 drivers/iommu/iommufd/hw_pagetable.c          | 181 +++++++++++-
 drivers/iommu/iommufd/io_pagetable.c          |   5 +-
 drivers/iommu/iommufd/iommufd_private.h       |  20 +-
 drivers/iommu/iommufd/iommufd_test.h          |  36 +++
 drivers/iommu/iommufd/main.c                  |  59 +++-
 drivers/iommu/iommufd/selftest.c              | 266 ++++++++++++++++--
 include/linux/iommu.h                         |  34 +++
 include/uapi/linux/iommufd.h                  |  96 ++++++-
 tools/testing/selftests/iommu/iommufd.c       | 224 ++++++++++++++-
 tools/testing/selftests/iommu/iommufd_utils.h |  70 +++++
 11 files changed, 958 insertions(+), 42 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH v3 01/17] iommu: Add new iommu op to create domains owned by userspace
  2023-07-24 11:03 [PATCH v3 00/17] iommufd: Add nesting infrastructure Yi Liu
@ 2023-07-24 11:03 ` Yi Liu
  2023-07-28  9:37   ` Tian, Kevin
  2023-07-24 11:03 ` [PATCH v3 02/17] iommu: Add nested domain support Yi Liu
                   ` (15 subsequent siblings)
  16 siblings, 1 reply; 66+ messages in thread
From: Yi Liu @ 2023-07-24 11:03 UTC (permalink / raw)
  To: joro, alex.williamson, jgg, kevin.tian, robin.murphy, baolu.lu
  Cc: cohuck, eric.auger, nicolinc, kvm, mjrosato, chao.p.peng,
	yi.l.liu, yi.y.sun, peterx, jasowang, shameerali.kolothum.thodi,
	lulu, suravee.suthikulpanit, iommu, linux-kernel,
	linux-kselftest, zhenzhong.duan

From: Lu Baolu <baolu.lu@linux.intel.com>

Introduce a new iommu_domain op to create domains owned by userspace,
e.g. through iommufd. These domains have a few different properties
compares to kernel owned domains:

 - They may be UNMANAGED domains, but created with special parameters.
   For instance aperture size changes/number of levels, different
   IOPTE formats, or other things necessary to make a vIOMMU work

 - We have to track all the memory allocations with GFP_KERNEL_ACCOUNT
   to make the cgroup sandbox stronger

 - Device-specialty domains, such as NESTED domains can be created by
   iommufd.

The new op clearly says the domain is being created by IOMMUFD, that
the domain is intended for userspace use, and it provides a way to pass
a driver specific uAPI structure to customize the created domain to
exactly what the vIOMMU userspace driver requires.

iommu drivers that cannot support VFIO/IOMMUFD should not support this
op. This includes any driver that cannot provide a fully functional
UNMANAGED domain.

This op chooses to make the special parameters opaque to the core. This
suits the current usage model where accessing any of the IOMMU device
special parameters does require a userspace driver that matches the
kernel driver. If a need for common parameters, implemented similarly
by several drivers, arises then there is room in the design to grow a
generic parameter set as well.

This new op for now is only supposed to be used by iommufd, hence no
wrapper for it. iommufd would call the callback directly. As for domain
free, iommufd would use iommu_domain_free().

enum iommu_hwpt_type is defined to differentiate the user parameters use
by different usages. For the allocations that don't require user parameter,
IOMMU_HWPT_TYPE_DEFAULT is defined for backward compatibility. Other types
would be added by future iommu vendor driver extensions.

Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Co-developed-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 include/linux/iommu.h        | 20 ++++++++++++++++++++
 include/uapi/linux/iommufd.h |  8 ++++++++
 2 files changed, 28 insertions(+)

diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 4199e13b34e6..ecbec2627b63 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -226,6 +226,15 @@ struct iommu_iotlb_gather {
 	bool			queued;
 };
 
+/*
+ * The user data to allocate a specific type user iommu domain
+ *
+ * This includes the corresponding driver data structures in
+ * include/uapi/linux/iommufd.h.
+ */
+union iommu_domain_user_data {
+};
+
 /**
  * struct iommu_ops - iommu ops and capabilities
  * @capable: check capability
@@ -235,6 +244,13 @@ struct iommu_iotlb_gather {
  *           after use. Return the data buffer if success, or ERR_PTR on
  *           failure.
  * @domain_alloc: allocate iommu domain
+ * @domain_alloc_user: allocate a user iommu domain corresponding to the input
+ *                     @hwpt_type that is defined as enum iommu_hwpt_type in the
+ *                     include/uapi/linux/iommufd.h. A returning domain will be
+ *                     set to an IOMMU_DOMAIN_NESTED type, upon valid @user_data
+ *                     and @parent that is a kernel-managed domain. Otherwise,
+ *                     it will be set to an IOMMU_DOMAIN_UNMANAGED type. Return
+ *                     ERR_PTR on allocation failure.
  * @probe_device: Add device to iommu driver handling
  * @release_device: Remove device from iommu driver handling
  * @probe_finalize: Do final setup work after the device is added to an IOMMU
@@ -275,6 +291,10 @@ struct iommu_ops {
 
 	/* Domain allocation and freeing by the iommu driver */
 	struct iommu_domain *(*domain_alloc)(unsigned iommu_domain_type);
+	struct iommu_domain *(*domain_alloc_user)(struct device *dev,
+						  enum iommu_hwpt_type hwpt_type,
+						  struct iommu_domain *parent,
+						  const union iommu_domain_user_data *user_data);
 
 	struct iommu_device *(*probe_device)(struct device *dev);
 	void (*release_device)(struct device *dev);
diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
index 4295362e7b44..d38bc54fd5f2 100644
--- a/include/uapi/linux/iommufd.h
+++ b/include/uapi/linux/iommufd.h
@@ -347,6 +347,14 @@ struct iommu_vfio_ioas {
 };
 #define IOMMU_VFIO_IOAS _IO(IOMMUFD_TYPE, IOMMUFD_CMD_VFIO_IOAS)
 
+/**
+ * enum iommu_hwpt_type - IOMMU HWPT Type
+ * @IOMMU_HWPT_TYPE_DEFAULT: default
+ */
+enum iommu_hwpt_type {
+	IOMMU_HWPT_TYPE_DEFAULT,
+};
+
 /**
  * struct iommu_hwpt_alloc - ioctl(IOMMU_HWPT_ALLOC)
  * @size: sizeof(struct iommu_hwpt_alloc)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH v3 02/17] iommu: Add nested domain support
  2023-07-24 11:03 [PATCH v3 00/17] iommufd: Add nesting infrastructure Yi Liu
  2023-07-24 11:03 ` [PATCH v3 01/17] iommu: Add new iommu op to create domains owned by userspace Yi Liu
@ 2023-07-24 11:03 ` Yi Liu
  2023-07-28  9:38   ` Tian, Kevin
  2023-07-28 16:59   ` Jason Gunthorpe
  2023-07-24 11:03 ` [PATCH v3 03/17] iommufd/hw_pagetable: Use domain_alloc_user op for domain allocation Yi Liu
                   ` (14 subsequent siblings)
  16 siblings, 2 replies; 66+ messages in thread
From: Yi Liu @ 2023-07-24 11:03 UTC (permalink / raw)
  To: joro, alex.williamson, jgg, kevin.tian, robin.murphy, baolu.lu
  Cc: cohuck, eric.auger, nicolinc, kvm, mjrosato, chao.p.peng,
	yi.l.liu, yi.y.sun, peterx, jasowang, shameerali.kolothum.thodi,
	lulu, suravee.suthikulpanit, iommu, linux-kernel,
	linux-kselftest, zhenzhong.duan

From: Lu Baolu <baolu.lu@linux.intel.com>

Introduce a new domain type for a user I/O page table, which is nested
on top of another user space address represented by a UNMANAGED domain. The
mappings of a nested domain are managed by user space software, therefore
it's unnecessary to have map/unmap callbacks. But the updates of the PTEs
in the nested domain page table must be propagated to the caches on both
IOMMU (IOTLB) and devices (DevTLB).

The nested domain is allocated by the domain_alloc_user op, and attached
to the device through the existing iommu_attach_device/group() interfaces.

A new domain op, named cache_invalidate_user is added for the userspace to
flush the hardware caches for a nested domain through iommufd. No wrapper
for it, as it's only supposed to be used by iommufd.

Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Co-developed-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 include/linux/iommu.h | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index ecbec2627b63..b8f09330b64e 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -66,6 +66,9 @@ struct iommu_domain_geometry {
 
 #define __IOMMU_DOMAIN_SVA	(1U << 4)  /* Shared process address space */
 
+#define __IOMMU_DOMAIN_NESTED	(1U << 5)  /* User-managed address space nested
+					      on a stage-2 translation        */
+
 #define IOMMU_DOMAIN_ALLOC_FLAGS ~__IOMMU_DOMAIN_DMA_FQ
 /*
  * This are the possible domain-types
@@ -92,6 +95,7 @@ struct iommu_domain_geometry {
 				 __IOMMU_DOMAIN_DMA_API |	\
 				 __IOMMU_DOMAIN_DMA_FQ)
 #define IOMMU_DOMAIN_SVA	(__IOMMU_DOMAIN_SVA)
+#define IOMMU_DOMAIN_NESTED	(__IOMMU_DOMAIN_NESTED)
 
 struct iommu_domain {
 	unsigned type;
@@ -350,6 +354,10 @@ struct iommu_ops {
  * @iotlb_sync_map: Sync mappings created recently using @map to the hardware
  * @iotlb_sync: Flush all queued ranges from the hardware TLBs and empty flush
  *            queue
+ * @cache_invalidate_user: Flush hardware TLBs caching user space IO mappings
+ * @cache_invalidate_user_data_len: Defined length of input user data for the
+ *                                  cache_invalidate_user op, being sizeof the
+ *                                  structure in include/uapi/linux/iommufd.h
  * @iova_to_phys: translate iova to physical address
  * @enforce_cache_coherency: Prevent any kind of DMA from bypassing IOMMU_CACHE,
  *                           including no-snoop TLPs on PCIe or other platform
@@ -379,6 +387,9 @@ struct iommu_domain_ops {
 			       size_t size);
 	void (*iotlb_sync)(struct iommu_domain *domain,
 			   struct iommu_iotlb_gather *iotlb_gather);
+	int (*cache_invalidate_user)(struct iommu_domain *domain,
+				     void *user_data);
+	const size_t cache_invalidate_user_data_len;
 
 	phys_addr_t (*iova_to_phys)(struct iommu_domain *domain,
 				    dma_addr_t iova);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH v3 03/17] iommufd/hw_pagetable: Use domain_alloc_user op for domain allocation
  2023-07-24 11:03 [PATCH v3 00/17] iommufd: Add nesting infrastructure Yi Liu
  2023-07-24 11:03 ` [PATCH v3 01/17] iommu: Add new iommu op to create domains owned by userspace Yi Liu
  2023-07-24 11:03 ` [PATCH v3 02/17] iommu: Add nested domain support Yi Liu
@ 2023-07-24 11:03 ` Yi Liu
  2023-07-28  9:39   ` Tian, Kevin
  2023-07-24 11:03 ` [PATCH v3 04/17] iommufd: Pass in hwpt_type/parent/user_data to iommufd_hw_pagetable_alloc() Yi Liu
                   ` (13 subsequent siblings)
  16 siblings, 1 reply; 66+ messages in thread
From: Yi Liu @ 2023-07-24 11:03 UTC (permalink / raw)
  To: joro, alex.williamson, jgg, kevin.tian, robin.murphy, baolu.lu
  Cc: cohuck, eric.auger, nicolinc, kvm, mjrosato, chao.p.peng,
	yi.l.liu, yi.y.sun, peterx, jasowang, shameerali.kolothum.thodi,
	lulu, suravee.suthikulpanit, iommu, linux-kernel,
	linux-kselftest, zhenzhong.duan

This makes IOMMUFD to use iommu_domain_alloc_user() for iommu_domain
creation as IOMMUFD needs to support iommu_domain allocation with
parameters from userspace in nesting support. If the iommu driver
doesn't provide domain_alloc_user callback then it falls back to use
iommu_domain_alloc().

Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Co-developed-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 drivers/iommu/iommufd/hw_pagetable.c | 21 +++++++++++++++++----
 1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c
index cf2c1504e20d..1371e96653b2 100644
--- a/drivers/iommu/iommufd/hw_pagetable.c
+++ b/drivers/iommu/iommufd/hw_pagetable.c
@@ -5,6 +5,7 @@
 #include <linux/iommu.h>
 #include <uapi/linux/iommufd.h>
 
+#include "../iommu-priv.h"
 #include "iommufd_private.h"
 
 void iommufd_hw_pagetable_destroy(struct iommufd_object *obj)
@@ -74,6 +75,7 @@ struct iommufd_hw_pagetable *
 iommufd_hw_pagetable_alloc(struct iommufd_ctx *ictx, struct iommufd_ioas *ioas,
 			   struct iommufd_device *idev, bool immediate_attach)
 {
+	const struct iommu_ops *ops = dev_iommu_ops(idev->dev);
 	struct iommufd_hw_pagetable *hwpt;
 	int rc;
 
@@ -88,10 +90,21 @@ iommufd_hw_pagetable_alloc(struct iommufd_ctx *ictx, struct iommufd_ioas *ioas,
 	refcount_inc(&ioas->obj.users);
 	hwpt->ioas = ioas;
 
-	hwpt->domain = iommu_domain_alloc(idev->dev->bus);
-	if (!hwpt->domain) {
-		rc = -ENOMEM;
-		goto out_abort;
+	if (ops->domain_alloc_user) {
+		hwpt->domain = ops->domain_alloc_user(idev->dev,
+						      IOMMU_HWPT_TYPE_DEFAULT,
+						      NULL, NULL);
+		if (IS_ERR(hwpt->domain)) {
+			rc = PTR_ERR(hwpt->domain);
+			hwpt->domain = NULL;
+			goto out_abort;
+		}
+	} else {
+		hwpt->domain = iommu_domain_alloc(idev->dev->bus);
+		if (!hwpt->domain) {
+			rc = -ENOMEM;
+			goto out_abort;
+		}
 	}
 
 	/*
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH v3 04/17] iommufd: Pass in hwpt_type/parent/user_data to iommufd_hw_pagetable_alloc()
  2023-07-24 11:03 [PATCH v3 00/17] iommufd: Add nesting infrastructure Yi Liu
                   ` (2 preceding siblings ...)
  2023-07-24 11:03 ` [PATCH v3 03/17] iommufd/hw_pagetable: Use domain_alloc_user op for domain allocation Yi Liu
@ 2023-07-24 11:03 ` Yi Liu
  2023-07-28  9:49   ` Tian, Kevin
  2023-07-24 11:03 ` [PATCH v3 05/17] iommufd/hw_pagetable: Do not populate user-managed hw_pagetables Yi Liu
                   ` (12 subsequent siblings)
  16 siblings, 1 reply; 66+ messages in thread
From: Yi Liu @ 2023-07-24 11:03 UTC (permalink / raw)
  To: joro, alex.williamson, jgg, kevin.tian, robin.murphy, baolu.lu
  Cc: cohuck, eric.auger, nicolinc, kvm, mjrosato, chao.p.peng,
	yi.l.liu, yi.y.sun, peterx, jasowang, shameerali.kolothum.thodi,
	lulu, suravee.suthikulpanit, iommu, linux-kernel,
	linux-kselftest, zhenzhong.duan

Nested translation has stage-1 and stage-2 page tables. A stage-1 page
table is managed by user space, and it needs to work with a stage-2 page
table, which is a parent hwpt for the stage-1 hwpt.

iommu core already supports accepting parent iommu_domain and user_data
to allocate an iommu_domain. This makes iommufd_hw_pagetable_alloc() to
accept the parent hwpt and user_data, and relays them to iommu core, to
prepare for supporting hw_pagetable allocation with user_data.

Also, add a parent pointer in struct iommufd_hw_pagetable for taking and
releasing its refcount.

Co-developed-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 drivers/iommu/iommufd/device.c          |  3 +-
 drivers/iommu/iommufd/hw_pagetable.c    | 40 +++++++++++++++++++++----
 drivers/iommu/iommufd/iommufd_private.h |  8 ++++-
 3 files changed, 44 insertions(+), 7 deletions(-)

diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
index 0deb2a2ec01a..97e4e5f5aca0 100644
--- a/drivers/iommu/iommufd/device.c
+++ b/drivers/iommu/iommufd/device.c
@@ -542,7 +542,8 @@ iommufd_device_auto_get_domain(struct iommufd_device *idev,
 	}
 
 	hwpt = iommufd_hw_pagetable_alloc(idev->ictx, ioas, idev,
-					  immediate_attach);
+					  IOMMU_HWPT_TYPE_DEFAULT,
+					  NULL, NULL, immediate_attach);
 	if (IS_ERR(hwpt)) {
 		destroy_hwpt = ERR_CAST(hwpt);
 		goto out_unlock;
diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c
index 1371e96653b2..28122a49b529 100644
--- a/drivers/iommu/iommufd/hw_pagetable.c
+++ b/drivers/iommu/iommufd/hw_pagetable.c
@@ -24,6 +24,8 @@ void iommufd_hw_pagetable_destroy(struct iommufd_object *obj)
 	if (hwpt->domain)
 		iommu_domain_free(hwpt->domain);
 
+	if (hwpt->parent)
+		refcount_dec(&hwpt->parent->obj.users);
 	refcount_dec(&hwpt->ioas->obj.users);
 }
 
@@ -61,6 +63,9 @@ int iommufd_hw_pagetable_enforce_cc(struct iommufd_hw_pagetable *hwpt)
  * @ictx: iommufd context
  * @ioas: IOAS to associate the domain with
  * @idev: Device to get an iommu_domain for
+ * @hwpt_type: Requested type of hw_pagetable
+ * @parent: Optional parent HWPT to associate with
+ * @user_data: Optional user_data pointer
  * @immediate_attach: True if idev should be attached to the hwpt
  *
  * Allocate a new iommu_domain and return it as a hw_pagetable. The HWPT
@@ -73,14 +78,24 @@ int iommufd_hw_pagetable_enforce_cc(struct iommufd_hw_pagetable *hwpt)
  */
 struct iommufd_hw_pagetable *
 iommufd_hw_pagetable_alloc(struct iommufd_ctx *ictx, struct iommufd_ioas *ioas,
-			   struct iommufd_device *idev, bool immediate_attach)
+			   struct iommufd_device *idev,
+			   enum iommu_hwpt_type hwpt_type,
+			   struct iommufd_hw_pagetable *parent,
+			   union iommu_domain_user_data *user_data,
+			   bool immediate_attach)
 {
 	const struct iommu_ops *ops = dev_iommu_ops(idev->dev);
+	struct iommu_domain *parent_domain = NULL;
 	struct iommufd_hw_pagetable *hwpt;
 	int rc;
 
 	lockdep_assert_held(&ioas->mutex);
 
+	if (parent && !user_data)
+		return ERR_PTR(-EINVAL);
+	if (user_data && !ops->domain_alloc_user)
+		return ERR_PTR(-EOPNOTSUPP);
+
 	hwpt = iommufd_object_alloc(ictx, hwpt, IOMMUFD_OBJ_HW_PAGETABLE);
 	if (IS_ERR(hwpt))
 		return hwpt;
@@ -89,11 +104,15 @@ iommufd_hw_pagetable_alloc(struct iommufd_ctx *ictx, struct iommufd_ioas *ioas,
 	/* Pairs with iommufd_hw_pagetable_destroy() */
 	refcount_inc(&ioas->obj.users);
 	hwpt->ioas = ioas;
+	if (parent) {
+		hwpt->parent = parent;
+		parent_domain = parent->domain;
+		refcount_inc(&parent->obj.users);
+	}
 
 	if (ops->domain_alloc_user) {
-		hwpt->domain = ops->domain_alloc_user(idev->dev,
-						      IOMMU_HWPT_TYPE_DEFAULT,
-						      NULL, NULL);
+		hwpt->domain = ops->domain_alloc_user(idev->dev, hwpt_type,
+						      parent_domain, user_data);
 		if (IS_ERR(hwpt->domain)) {
 			rc = PTR_ERR(hwpt->domain);
 			hwpt->domain = NULL;
@@ -107,6 +126,15 @@ iommufd_hw_pagetable_alloc(struct iommufd_ctx *ictx, struct iommufd_ioas *ioas,
 		}
 	}
 
+	/* It must be either NESTED or UNMANAGED, depending on parent_domain */
+	if (WARN_ON_ONCE((parent_domain &&
+			  hwpt->domain->type != IOMMU_DOMAIN_NESTED) ||
+			 (!parent_domain &&
+			  hwpt->domain->type != IOMMU_DOMAIN_UNMANAGED))) {
+		rc = -EINVAL;
+		goto out_abort;
+	}
+
 	/*
 	 * Set the coherency mode before we do iopt_table_add_domain() as some
 	 * iommus have a per-PTE bit that controls it and need to decide before
@@ -168,7 +196,9 @@ int iommufd_hwpt_alloc(struct iommufd_ucmd *ucmd)
 	}
 
 	mutex_lock(&ioas->mutex);
-	hwpt = iommufd_hw_pagetable_alloc(ucmd->ictx, ioas, idev, false);
+	hwpt = iommufd_hw_pagetable_alloc(ucmd->ictx, ioas, idev,
+					  IOMMU_HWPT_TYPE_DEFAULT,
+					  NULL, NULL, false);
 	if (IS_ERR(hwpt)) {
 		rc = PTR_ERR(hwpt);
 		goto out_unlock;
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index dba730129b8c..90dcf4041530 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -8,6 +8,7 @@
 #include <linux/xarray.h>
 #include <linux/refcount.h>
 #include <linux/uaccess.h>
+#include <linux/iommu.h>
 
 struct iommu_domain;
 struct iommu_group;
@@ -243,6 +244,7 @@ int iommufd_vfio_ioas(struct iommufd_ucmd *ucmd);
  */
 struct iommufd_hw_pagetable {
 	struct iommufd_object obj;
+	struct iommufd_hw_pagetable *parent;
 	struct iommufd_ioas *ioas;
 	struct iommu_domain *domain;
 	bool auto_domain : 1;
@@ -254,7 +256,11 @@ struct iommufd_hw_pagetable {
 
 struct iommufd_hw_pagetable *
 iommufd_hw_pagetable_alloc(struct iommufd_ctx *ictx, struct iommufd_ioas *ioas,
-			   struct iommufd_device *idev, bool immediate_attach);
+			   struct iommufd_device *idev,
+			   enum iommu_hwpt_type hwpt_type,
+			   struct iommufd_hw_pagetable *parent,
+			   union iommu_domain_user_data *user_data,
+			   bool immediate_attach);
 int iommufd_hw_pagetable_enforce_cc(struct iommufd_hw_pagetable *hwpt);
 int iommufd_hw_pagetable_attach(struct iommufd_hw_pagetable *hwpt,
 				struct iommufd_device *idev);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH v3 05/17] iommufd/hw_pagetable: Do not populate user-managed hw_pagetables
  2023-07-24 11:03 [PATCH v3 00/17] iommufd: Add nesting infrastructure Yi Liu
                   ` (3 preceding siblings ...)
  2023-07-24 11:03 ` [PATCH v3 04/17] iommufd: Pass in hwpt_type/parent/user_data to iommufd_hw_pagetable_alloc() Yi Liu
@ 2023-07-24 11:03 ` Yi Liu
  2023-07-28  9:51   ` Tian, Kevin
  2023-07-24 11:03 ` [PATCH v3 06/17] iommufd: Only enforce IOMMU_RESV_SW_MSI when attaching user-managed HWPT Yi Liu
                   ` (11 subsequent siblings)
  16 siblings, 1 reply; 66+ messages in thread
From: Yi Liu @ 2023-07-24 11:03 UTC (permalink / raw)
  To: joro, alex.williamson, jgg, kevin.tian, robin.murphy, baolu.lu
  Cc: cohuck, eric.auger, nicolinc, kvm, mjrosato, chao.p.peng,
	yi.l.liu, yi.y.sun, peterx, jasowang, shameerali.kolothum.thodi,
	lulu, suravee.suthikulpanit, iommu, linux-kernel,
	linux-kselftest, zhenzhong.duan

From: Nicolin Chen <nicolinc@nvidia.com>

A user-managed hw_pagetable does not need to get populated, since it is
managed by a guest OS. Move the iopt_table_add_domain and list_add_tail
calls into a helper, where the hwpt pointer will be redirected to its
hwpt->parent if it's available.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 drivers/iommu/iommufd/hw_pagetable.c | 21 +++++++++++++++++++--
 1 file changed, 19 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c
index 28122a49b529..c7301cf0e85a 100644
--- a/drivers/iommu/iommufd/hw_pagetable.c
+++ b/drivers/iommu/iommufd/hw_pagetable.c
@@ -58,6 +58,24 @@ int iommufd_hw_pagetable_enforce_cc(struct iommufd_hw_pagetable *hwpt)
 	return 0;
 }
 
+static int iommufd_hw_pagetable_link_ioas(struct iommufd_hw_pagetable *hwpt)
+{
+	int rc;
+
+	/*
+	 * Only a parent hwpt needs to be linked to the IOAS. And a hwpt->parent
+	 * must be linked to the IOAS already, when it's being allocated.
+	 */
+	if (hwpt->parent)
+		return 0;
+
+	rc = iopt_table_add_domain(&hwpt->ioas->iopt, hwpt->domain);
+	if (rc)
+		return rc;
+	list_add_tail(&hwpt->hwpt_item, &hwpt->ioas->hwpt_list);
+	return 0;
+}
+
 /**
  * iommufd_hw_pagetable_alloc() - Get an iommu_domain for a device
  * @ictx: iommufd context
@@ -160,10 +178,9 @@ iommufd_hw_pagetable_alloc(struct iommufd_ctx *ictx, struct iommufd_ioas *ioas,
 			goto out_abort;
 	}
 
-	rc = iopt_table_add_domain(&hwpt->ioas->iopt, hwpt->domain);
+	rc = iommufd_hw_pagetable_link_ioas(hwpt);
 	if (rc)
 		goto out_detach;
-	list_add_tail(&hwpt->hwpt_item, &hwpt->ioas->hwpt_list);
 	return hwpt;
 
 out_detach:
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH v3 06/17] iommufd: Only enforce IOMMU_RESV_SW_MSI when attaching user-managed HWPT
  2023-07-24 11:03 [PATCH v3 00/17] iommufd: Add nesting infrastructure Yi Liu
                   ` (4 preceding siblings ...)
  2023-07-24 11:03 ` [PATCH v3 05/17] iommufd/hw_pagetable: Do not populate user-managed hw_pagetables Yi Liu
@ 2023-07-24 11:03 ` Yi Liu
  2023-07-28 10:02   ` Tian, Kevin
  2023-07-24 11:03 ` [PATCH v3 07/17] iommufd: Add IOMMU_RESV_IOVA_RANGES Yi Liu
                   ` (10 subsequent siblings)
  16 siblings, 1 reply; 66+ messages in thread
From: Yi Liu @ 2023-07-24 11:03 UTC (permalink / raw)
  To: joro, alex.williamson, jgg, kevin.tian, robin.murphy, baolu.lu
  Cc: cohuck, eric.auger, nicolinc, kvm, mjrosato, chao.p.peng,
	yi.l.liu, yi.y.sun, peterx, jasowang, shameerali.kolothum.thodi,
	lulu, suravee.suthikulpanit, iommu, linux-kernel,
	linux-kselftest, zhenzhong.duan

From: Nicolin Chen <nicolinc@nvidia.com>

Reserved IOVA regions except for the IOMMU_RESV_DIRECT_RELAXABLE type
should be excluded from the device's DMA address space. For single stage
translation configuration, such reserved regions are excluded in the
attaching HWPT that is managed by kernel. While for nested translation
configuration, there are two stage HWPTs, the reserved regions should be
excluded in stage-1 HWPT which is managed by userspace. The current code
always excludes the reserved regions in the stage-2 HWPT which is kernel
managed. This is incorrect. Instead, the reserved regions should be
reported to userspace and excluded in stage-1 HWPT.

Besides above, the IOMMU_RESV_SW_MSI type region needs to be excluded in
the stage-2 HWPT even in the nested translation configuration on ARM. So
the IOMMU_RESV_SW_MSI type region should always be excluded in the kernel
managed HWPT.

This adds a boolean argument to enforce IOMMU_RESV_SW_MSI only, if attaching
HWPT is a user-managed one, i.e. hwpt->parent must be valid, resulting
"!!hwpt->parent" to be true, hence only add the IOMMU_RESV_SW_MSI region to
the stage-2 HWPT reserved iovas.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 drivers/iommu/iommufd/device.c          | 6 ++++--
 drivers/iommu/iommufd/io_pagetable.c    | 5 ++++-
 drivers/iommu/iommufd/iommufd_private.h | 3 ++-
 3 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
index 97e4e5f5aca0..c0917683097f 100644
--- a/drivers/iommu/iommufd/device.c
+++ b/drivers/iommu/iommufd/device.c
@@ -347,7 +347,8 @@ int iommufd_hw_pagetable_attach(struct iommufd_hw_pagetable *hwpt,
 	}
 
 	rc = iopt_table_enforce_dev_resv_regions(&hwpt->ioas->iopt, idev->dev,
-						 &idev->igroup->sw_msi_start);
+						 &idev->igroup->sw_msi_start,
+						 !!hwpt->parent);
 	if (rc)
 		goto err_unlock;
 
@@ -445,7 +446,8 @@ iommufd_device_do_replace(struct iommufd_device *idev,
 	if (hwpt->ioas != old_hwpt->ioas) {
 		list_for_each_entry(cur, &igroup->device_list, group_item) {
 			rc = iopt_table_enforce_dev_resv_regions(
-				&hwpt->ioas->iopt, cur->dev, NULL);
+				&hwpt->ioas->iopt, cur->dev, NULL,
+				!!hwpt->parent);
 			if (rc)
 				goto err_unresv;
 		}
diff --git a/drivers/iommu/iommufd/io_pagetable.c b/drivers/iommu/iommufd/io_pagetable.c
index 4d095115c2d0..e658261f44ed 100644
--- a/drivers/iommu/iommufd/io_pagetable.c
+++ b/drivers/iommu/iommufd/io_pagetable.c
@@ -1172,7 +1172,8 @@ void iopt_remove_access(struct io_pagetable *iopt,
 /* Narrow the valid_iova_itree to include reserved ranges from a device. */
 int iopt_table_enforce_dev_resv_regions(struct io_pagetable *iopt,
 					struct device *dev,
-					phys_addr_t *sw_msi_start)
+					phys_addr_t *sw_msi_start,
+					bool sw_msi_only)
 {
 	struct iommu_resv_region *resv;
 	LIST_HEAD(resv_regions);
@@ -1198,6 +1199,8 @@ int iopt_table_enforce_dev_resv_regions(struct io_pagetable *iopt,
 			num_sw_msi++;
 		}
 
+		if (sw_msi_only && resv->type != IOMMU_RESV_SW_MSI)
+			continue;
 		rc = iopt_reserve_iova(iopt, resv->start,
 				       resv->length - 1 + resv->start, dev);
 		if (rc)
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index 90dcf4041530..268ae0d5ae12 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -79,7 +79,8 @@ void iopt_table_remove_domain(struct io_pagetable *iopt,
 			      struct iommu_domain *domain);
 int iopt_table_enforce_dev_resv_regions(struct io_pagetable *iopt,
 					struct device *dev,
-					phys_addr_t *sw_msi_start);
+					phys_addr_t *sw_msi_start,
+					bool sw_msi_only);
 int iopt_set_allow_iova(struct io_pagetable *iopt,
 			struct rb_root_cached *allowed_iova);
 int iopt_reserve_iova(struct io_pagetable *iopt, unsigned long start,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH v3 07/17] iommufd: Add IOMMU_RESV_IOVA_RANGES
  2023-07-24 11:03 [PATCH v3 00/17] iommufd: Add nesting infrastructure Yi Liu
                   ` (5 preceding siblings ...)
  2023-07-24 11:03 ` [PATCH v3 06/17] iommufd: Only enforce IOMMU_RESV_SW_MSI when attaching user-managed HWPT Yi Liu
@ 2023-07-24 11:03 ` Yi Liu
  2023-07-28 10:07   ` Tian, Kevin
  2023-07-28 17:44   ` Jason Gunthorpe
  2023-07-24 11:03 ` [PATCH v3 08/17] iommufd: IOMMU_HWPT_ALLOC allocation with user data Yi Liu
                   ` (9 subsequent siblings)
  16 siblings, 2 replies; 66+ messages in thread
From: Yi Liu @ 2023-07-24 11:03 UTC (permalink / raw)
  To: joro, alex.williamson, jgg, kevin.tian, robin.murphy, baolu.lu
  Cc: cohuck, eric.auger, nicolinc, kvm, mjrosato, chao.p.peng,
	yi.l.liu, yi.y.sun, peterx, jasowang, shameerali.kolothum.thodi,
	lulu, suravee.suthikulpanit, iommu, linux-kernel,
	linux-kselftest, zhenzhong.duan

This reports device's reserved IOVA regions to userspace. This is needed
in the nested translation as userspace owns stage-1 HWPT, and userspace
needs to exclude the reserved IOVA regions in the stage-1 HWPT hence exclude
them in the device's DMA address space.

This can also be used to figure out allowed IOVAs of an IOAS.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 drivers/iommu/iommufd/main.c | 54 ++++++++++++++++++++++++++++++++++++
 include/uapi/linux/iommufd.h | 46 ++++++++++++++++++++++++++++++
 2 files changed, 100 insertions(+)

diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
index bd3efc1d8509..510db114fc61 100644
--- a/drivers/iommu/iommufd/main.c
+++ b/drivers/iommu/iommufd/main.c
@@ -250,6 +250,57 @@ static int iommufd_get_hw_info(struct iommufd_ucmd *ucmd)
 	return rc;
 }
 
+static int iommufd_resv_iova_ranges(struct iommufd_ucmd *ucmd)
+{
+	struct iommu_resv_iova_ranges *cmd = ucmd->cmd;
+	struct iommu_resv_iova_range __user *ranges;
+	struct iommu_resv_region *resv;
+	struct iommufd_device *idev;
+	LIST_HEAD(resv_regions);
+	u32 max_iovas;
+	int rc;
+
+	if (cmd->__reserved)
+		return -EOPNOTSUPP;
+
+	idev = iommufd_get_device(ucmd, cmd->dev_id);
+	if (IS_ERR(idev))
+		return PTR_ERR(idev);
+
+	max_iovas = cmd->num_iovas;
+	ranges = u64_to_user_ptr(cmd->resv_iovas);
+	cmd->num_iovas = 0;
+
+	iommu_get_resv_regions(idev->dev, &resv_regions);
+
+	list_for_each_entry(resv, &resv_regions, list) {
+		if (resv->type == IOMMU_RESV_DIRECT_RELAXABLE)
+			continue;
+		if (cmd->num_iovas < max_iovas) {
+			struct iommu_resv_iova_range elm = {
+				.start = resv->start,
+				.last = resv->length - 1 + resv->start,
+			};
+
+			if (copy_to_user(&ranges[cmd->num_iovas], &elm,
+					 sizeof(elm))) {
+				rc = -EFAULT;
+				goto out_put;
+			}
+		}
+		cmd->num_iovas++;
+	}
+	rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
+	if (rc)
+		goto out_put;
+	if (cmd->num_iovas > max_iovas)
+		rc = -EMSGSIZE;
+out_put:
+	iommu_put_resv_regions(idev->dev, &resv_regions);
+	iommufd_put_object(&idev->obj);
+	return rc;
+}
+
 static int iommufd_fops_open(struct inode *inode, struct file *filp)
 {
 	struct iommufd_ctx *ictx;
@@ -347,6 +398,7 @@ union ucmd_buffer {
 	struct iommu_ioas_map map;
 	struct iommu_ioas_unmap unmap;
 	struct iommu_option option;
+	struct iommu_resv_iova_ranges resv_ranges;
 	struct iommu_vfio_ioas vfio_ioas;
 #ifdef CONFIG_IOMMUFD_TEST
 	struct iommu_test_cmd test;
@@ -389,6 +441,8 @@ static const struct iommufd_ioctl_op iommufd_ioctl_ops[] = {
 		 length),
 	IOCTL_OP(IOMMU_OPTION, iommufd_option, struct iommu_option,
 		 val64),
+	IOCTL_OP(IOMMU_RESV_IOVA_RANGES, iommufd_resv_iova_ranges,
+		 struct iommu_resv_iova_ranges, resv_iovas),
 	IOCTL_OP(IOMMU_VFIO_IOAS, iommufd_vfio_ioas, struct iommu_vfio_ioas,
 		 __reserved),
 #ifdef CONFIG_IOMMUFD_TEST
diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
index d38bc54fd5f2..f2026cde2d64 100644
--- a/include/uapi/linux/iommufd.h
+++ b/include/uapi/linux/iommufd.h
@@ -47,6 +47,7 @@ enum {
 	IOMMUFD_CMD_VFIO_IOAS,
 	IOMMUFD_CMD_HWPT_ALLOC,
 	IOMMUFD_CMD_GET_HW_INFO,
+	IOMMUFD_CMD_RESV_IOVA_RANGES,
 };
 
 /**
@@ -422,4 +423,49 @@ struct iommu_hw_info {
 	__u32 __reserved;
 };
 #define IOMMU_GET_HW_INFO _IO(IOMMUFD_TYPE, IOMMUFD_CMD_GET_HW_INFO)
+
+/**
+ * struct iommu_resv_iova_range - ioctl(IOMMU_RESV_IOVA_RANGE)
+ * @start: First IOVA
+ * @last: Inclusive last IOVA
+ *
+ * An interval in IOVA space.
+ */
+struct iommu_resv_iova_range {
+	__aligned_u64 start;
+	__aligned_u64 last;
+};
+
+/**
+ * struct iommu_resv_iova_ranges - ioctl(IOMMU_RESV_IOVA_RANGES)
+ * @size: sizeof(struct iommu_resv_iova_ranges)
+ * @dev_id: device to read resv iova ranges for
+ * @num_iovas: Input/Output total number of resv ranges for the device
+ * @__reserved: Must be 0
+ * @resv_iovas: Pointer to the output array of struct iommu_resv_iova_range
+ *
+ * Query a device for ranges of reserved IOVAs. num_iovas will be set to the
+ * total number of iovas and the resv_iovas[] will be filled in as space
+ * permits.
+ *
+ * On input num_iovas is the length of the resv_iovas array. On output it is
+ * the total number of iovas filled in. The ioctl will return -EMSGSIZE and
+ * set num_iovas to the required value if num_iovas is too small. In this
+ * case the caller should allocate a larger output array and re-issue the
+ * ioctl.
+ *
+ * Under nested translation, userspace should query the reserved IOVAs for a
+ * given device, and report it to the stage-1 I/O page table owner to exclude
+ * the reserved IOVAs. The reserved IOVAs can also be used to figure out the
+ * allowed IOVA ranges for the IOAS that the device is attached to. For detail
+ * see ioctl IOMMU_IOAS_IOVA_RANGES.
+ */
+struct iommu_resv_iova_ranges {
+	__u32 size;
+	__u32 dev_id;
+	__u32 num_iovas;
+	__u32 __reserved;
+	__aligned_u64 resv_iovas;
+};
+#define IOMMU_RESV_IOVA_RANGES _IO(IOMMUFD_TYPE, IOMMUFD_CMD_RESV_IOVA_RANGES)
 #endif
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH v3 08/17] iommufd: IOMMU_HWPT_ALLOC allocation with user data
  2023-07-24 11:03 [PATCH v3 00/17] iommufd: Add nesting infrastructure Yi Liu
                   ` (6 preceding siblings ...)
  2023-07-24 11:03 ` [PATCH v3 07/17] iommufd: Add IOMMU_RESV_IOVA_RANGES Yi Liu
@ 2023-07-24 11:03 ` Yi Liu
  2023-07-28 17:55   ` Jason Gunthorpe
  2023-07-24 11:03 ` [PATCH v3 09/17] iommufd: Add IOMMU_HWPT_INVALIDATE Yi Liu
                   ` (8 subsequent siblings)
  16 siblings, 1 reply; 66+ messages in thread
From: Yi Liu @ 2023-07-24 11:03 UTC (permalink / raw)
  To: joro, alex.williamson, jgg, kevin.tian, robin.murphy, baolu.lu
  Cc: cohuck, eric.auger, nicolinc, kvm, mjrosato, chao.p.peng,
	yi.l.liu, yi.y.sun, peterx, jasowang, shameerali.kolothum.thodi,
	lulu, suravee.suthikulpanit, iommu, linux-kernel,
	linux-kselftest, zhenzhong.duan

IOMMU_HWPT_ALLOC already supports iommu_domain allocation for usersapce.
But it can only allocate hw_pagetables linked with IOAS. There are needs
to support hw_pagetable allocation with parameters specified by user. For
example, in nested translation, user needs to allocate hw_pagetable for
the stage-1 translation (e.g. a single I/O page table or a set of I/O page
tables) with user data. It also needs to provide a stage-2 hw_pagetable
which is linked to the GPA IOAS.

This extends IOMMU_HWPT_ALLOC to accept user specified parameter and hwpt
ID in @pt_id field. It can be used to allocate user-managed stage-1 hwpt,
which requires a parent hwpt to point to the stage-2 translation.

Co-developed-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 drivers/iommu/iommufd/hw_pagetable.c | 65 ++++++++++++++++++++++++----
 drivers/iommu/iommufd/main.c         |  2 +-
 include/uapi/linux/iommufd.h         | 20 ++++++++-
 3 files changed, 77 insertions(+), 10 deletions(-)

diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c
index c7301cf0e85a..97e4114226de 100644
--- a/drivers/iommu/iommufd/hw_pagetable.c
+++ b/drivers/iommu/iommufd/hw_pagetable.c
@@ -193,29 +193,75 @@ iommufd_hw_pagetable_alloc(struct iommufd_ctx *ictx, struct iommufd_ioas *ioas,
 
 int iommufd_hwpt_alloc(struct iommufd_ucmd *ucmd)
 {
+	struct iommufd_hw_pagetable *hwpt, *parent = NULL;
+	union iommu_domain_user_data *data = NULL;
 	struct iommu_hwpt_alloc *cmd = ucmd->cmd;
-	struct iommufd_hw_pagetable *hwpt;
+	struct iommufd_object *pt_obj;
 	struct iommufd_device *idev;
 	struct iommufd_ioas *ioas;
-	int rc;
+	int rc = 0;
 
 	if (cmd->flags || cmd->__reserved)
 		return -EOPNOTSUPP;
+	if (!cmd->data_len && cmd->hwpt_type != IOMMU_HWPT_TYPE_DEFAULT)
+		return -EINVAL;
 
 	idev = iommufd_get_device(ucmd, cmd->dev_id);
 	if (IS_ERR(idev))
 		return PTR_ERR(idev);
 
-	ioas = iommufd_get_ioas(ucmd->ictx, cmd->pt_id);
-	if (IS_ERR(ioas)) {
-		rc = PTR_ERR(ioas);
+	pt_obj = iommufd_get_object(ucmd->ictx, cmd->pt_id, IOMMUFD_OBJ_ANY);
+	if (IS_ERR(pt_obj)) {
+		rc = -EINVAL;
 		goto out_put_idev;
 	}
 
+	switch (pt_obj->type) {
+	case IOMMUFD_OBJ_IOAS:
+		ioas = container_of(pt_obj, struct iommufd_ioas, obj);
+		break;
+	case IOMMUFD_OBJ_HW_PAGETABLE:
+		/* pt_id points HWPT only when hwpt_type is !IOMMU_HWPT_TYPE_DEFAULT */
+		if (cmd->hwpt_type == IOMMU_HWPT_TYPE_DEFAULT) {
+			rc = -EINVAL;
+			goto out_put_pt;
+		}
+
+		parent = container_of(pt_obj, struct iommufd_hw_pagetable, obj);
+		/*
+		 * Cannot allocate user-managed hwpt linking to auto_created
+		 * hwpt. If the parent hwpt is already a user-managed hwpt,
+		 * don't allocate another user-managed hwpt linking to it.
+		 */
+		if (parent->auto_domain || parent->parent) {
+			rc = -EINVAL;
+			goto out_put_pt;
+		}
+		ioas = parent->ioas;
+		break;
+	default:
+		rc = -EINVAL;
+		goto out_put_pt;
+	}
+
+	if (cmd->data_len) {
+		data = kzalloc(sizeof(*data), GFP_KERNEL);
+		if (!data) {
+			rc = -ENOMEM;
+			goto out_put_pt;
+		}
+
+		rc = copy_struct_from_user(data, sizeof(*data),
+					   u64_to_user_ptr(cmd->data_uptr),
+					   cmd->data_len);
+		if (rc)
+			goto out_free_data;
+	}
+
 	mutex_lock(&ioas->mutex);
 	hwpt = iommufd_hw_pagetable_alloc(ucmd->ictx, ioas, idev,
-					  IOMMU_HWPT_TYPE_DEFAULT,
-					  NULL, NULL, false);
+					  cmd->hwpt_type,
+					  parent, data, false);
 	if (IS_ERR(hwpt)) {
 		rc = PTR_ERR(hwpt);
 		goto out_unlock;
@@ -232,7 +278,10 @@ int iommufd_hwpt_alloc(struct iommufd_ucmd *ucmd)
 	iommufd_object_abort_and_destroy(ucmd->ictx, &hwpt->obj);
 out_unlock:
 	mutex_unlock(&ioas->mutex);
-	iommufd_put_object(&ioas->obj);
+out_free_data:
+	kfree(data);
+out_put_pt:
+	iommufd_put_object(pt_obj);
 out_put_idev:
 	iommufd_put_object(&idev->obj);
 	return rc;
diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
index 510db114fc61..5f4420626421 100644
--- a/drivers/iommu/iommufd/main.c
+++ b/drivers/iommu/iommufd/main.c
@@ -426,7 +426,7 @@ static const struct iommufd_ioctl_op iommufd_ioctl_ops[] = {
 	IOCTL_OP(IOMMU_GET_HW_INFO, iommufd_get_hw_info, struct iommu_hw_info,
 		 __reserved),
 	IOCTL_OP(IOMMU_HWPT_ALLOC, iommufd_hwpt_alloc, struct iommu_hwpt_alloc,
-		 __reserved),
+		 data_uptr),
 	IOCTL_OP(IOMMU_IOAS_ALLOC, iommufd_ioas_alloc_ioctl,
 		 struct iommu_ioas_alloc, out_ioas_id),
 	IOCTL_OP(IOMMU_IOAS_ALLOW_IOVAS, iommufd_ioas_allow_iovas,
diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
index f2026cde2d64..73bf9af91e99 100644
--- a/include/uapi/linux/iommufd.h
+++ b/include/uapi/linux/iommufd.h
@@ -364,12 +364,27 @@ enum iommu_hwpt_type {
  * @pt_id: The IOAS to connect this HWPT to
  * @out_hwpt_id: The ID of the new HWPT
  * @__reserved: Must be 0
+ * @hwpt_type: One of enum iommu_hwpt_type
+ * @data_len: Length of the type specific data
+ * @data_uptr: User pointer to the type specific data
  *
  * Explicitly allocate a hardware page table object. This is the same object
  * type that is returned by iommufd_device_attach() and represents the
  * underlying iommu driver's iommu_domain kernel object.
  *
- * A HWPT will be created with the IOVA mappings from the given IOAS.
+ * A kernel-managed HWPT will be created with the mappings from the given
+ * IOAS via the @pt_id. The @hwpt_type for this allocation can be set to
+ * either IOMMU_HWPT_TYPE_DEFAULT or a pre-defined type corresponding to
+ * an I/O page table type supported by the underlying IOMMU hardware.
+ *
+ * A user-managed HWPT will be created from a given parent HWPT via the
+ * @pt_id, in which the parent HWPT must be allocated previously via the
+ * same ioctl from a given IOAS (@pt_id). In this case, the @hwpt_type
+ * must be set to a pre-defined type corresponding to an I/O page table
+ * type supported by the underlying IOMMU hardware.
+ *
+ * If the @hwpt_type is set to IOMMU_HWPT_TYPE_DEFAULT, both the @data_len
+ * and the @data_uptr will be ignored. Otherwise, both must be given.
  */
 struct iommu_hwpt_alloc {
 	__u32 size;
@@ -378,6 +393,9 @@ struct iommu_hwpt_alloc {
 	__u32 pt_id;
 	__u32 out_hwpt_id;
 	__u32 __reserved;
+	__u32 hwpt_type;
+	__u32 data_len;
+	__aligned_u64 data_uptr;
 };
 #define IOMMU_HWPT_ALLOC _IO(IOMMUFD_TYPE, IOMMUFD_CMD_HWPT_ALLOC)
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH v3 09/17] iommufd: Add IOMMU_HWPT_INVALIDATE
  2023-07-24 11:03 [PATCH v3 00/17] iommufd: Add nesting infrastructure Yi Liu
                   ` (7 preceding siblings ...)
  2023-07-24 11:03 ` [PATCH v3 08/17] iommufd: IOMMU_HWPT_ALLOC allocation with user data Yi Liu
@ 2023-07-24 11:03 ` Yi Liu
  2023-07-28 18:02   ` Jason Gunthorpe
  2023-07-24 11:03 ` [PATCH v3 10/17] iommufd/selftest: Add a helper to get test device Yi Liu
                   ` (7 subsequent siblings)
  16 siblings, 1 reply; 66+ messages in thread
From: Yi Liu @ 2023-07-24 11:03 UTC (permalink / raw)
  To: joro, alex.williamson, jgg, kevin.tian, robin.murphy, baolu.lu
  Cc: cohuck, eric.auger, nicolinc, kvm, mjrosato, chao.p.peng,
	yi.l.liu, yi.y.sun, peterx, jasowang, shameerali.kolothum.thodi,
	lulu, suravee.suthikulpanit, iommu, linux-kernel,
	linux-kselftest, zhenzhong.duan

In nested translation, the stage-1 page table is user-managed and used by
IOMMU hardware, so update of any present page table entry in the stage-1
page table should be followed with an IOTLB invalidation.

This adds IOMMU_HWPT_INVALIDATE for stage-1 IOTLB invalidation.

Co-developed-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 drivers/iommu/iommufd/hw_pagetable.c    | 45 +++++++++++++++++++++++++
 drivers/iommu/iommufd/iommufd_private.h |  9 +++++
 drivers/iommu/iommufd/main.c            |  3 ++
 include/uapi/linux/iommufd.h            | 22 ++++++++++++
 4 files changed, 79 insertions(+)

diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c
index 97e4114226de..9064e6d181b4 100644
--- a/drivers/iommu/iommufd/hw_pagetable.c
+++ b/drivers/iommu/iommufd/hw_pagetable.c
@@ -286,3 +286,48 @@ int iommufd_hwpt_alloc(struct iommufd_ucmd *ucmd)
 	iommufd_put_object(&idev->obj);
 	return rc;
 }
+
+int iommufd_hwpt_invalidate(struct iommufd_ucmd *ucmd)
+{
+	struct iommu_hwpt_invalidate *cmd = ucmd->cmd;
+	struct iommufd_hw_pagetable *hwpt;
+	u32 user_data_len, klen;
+	u64 user_ptr;
+	int rc = 0;
+
+	if (!cmd->data_len || cmd->__reserved)
+		return -EOPNOTSUPP;
+
+	hwpt = iommufd_get_hwpt(ucmd, cmd->hwpt_id);
+	if (IS_ERR(hwpt))
+		return PTR_ERR(hwpt);
+
+	/* Do not allow any kernel-managed hw_pagetable */
+	if (!hwpt->parent) {
+		rc = -EINVAL;
+		goto out_put_hwpt;
+	}
+
+	klen = hwpt->domain->ops->cache_invalidate_user_data_len;
+	if (!hwpt->domain->ops->cache_invalidate_user || !klen) {
+		rc = -EOPNOTSUPP;
+		goto out_put_hwpt;
+	}
+
+	/*
+	 * Copy the needed fields before reusing the ucmd buffer, this
+	 * avoids memory allocation in this path.
+	 */
+	user_ptr = cmd->data_uptr;
+	user_data_len = cmd->data_len;
+
+	rc = copy_struct_from_user(cmd, klen,
+				   u64_to_user_ptr(user_ptr), user_data_len);
+	if (rc)
+		goto out_put_hwpt;
+
+	rc = hwpt->domain->ops->cache_invalidate_user(hwpt->domain, cmd);
+out_put_hwpt:
+	iommufd_put_object(&hwpt->obj);
+	return rc;
+}
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index 268ae0d5ae12..047317fa4df0 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -270,6 +270,7 @@ iommufd_hw_pagetable_detach(struct iommufd_device *idev);
 void iommufd_hw_pagetable_destroy(struct iommufd_object *obj);
 void iommufd_hw_pagetable_abort(struct iommufd_object *obj);
 int iommufd_hwpt_alloc(struct iommufd_ucmd *ucmd);
+int iommufd_hwpt_invalidate(struct iommufd_ucmd *ucmd);
 
 static inline void iommufd_hw_pagetable_put(struct iommufd_ctx *ictx,
 					    struct iommufd_hw_pagetable *hwpt)
@@ -281,6 +282,14 @@ static inline void iommufd_hw_pagetable_put(struct iommufd_ctx *ictx,
 		refcount_dec(&hwpt->obj.users);
 }
 
+static inline struct iommufd_hw_pagetable *
+iommufd_get_hwpt(struct iommufd_ucmd *ucmd, u32 id)
+{
+	return container_of(iommufd_get_object(ucmd->ictx, id,
+					       IOMMUFD_OBJ_HW_PAGETABLE),
+			    struct iommufd_hw_pagetable, obj);
+}
+
 struct iommufd_group {
 	struct kref ref;
 	struct mutex lock;
diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
index 5f4420626421..255e8a3c5b0e 100644
--- a/drivers/iommu/iommufd/main.c
+++ b/drivers/iommu/iommufd/main.c
@@ -391,6 +391,7 @@ union ucmd_buffer {
 	struct iommu_destroy destroy;
 	struct iommu_hw_info info;
 	struct iommu_hwpt_alloc hwpt;
+	struct iommu_hwpt_invalidate cache;
 	struct iommu_ioas_alloc alloc;
 	struct iommu_ioas_allow_iovas allow_iovas;
 	struct iommu_ioas_copy ioas_copy;
@@ -427,6 +428,8 @@ static const struct iommufd_ioctl_op iommufd_ioctl_ops[] = {
 		 __reserved),
 	IOCTL_OP(IOMMU_HWPT_ALLOC, iommufd_hwpt_alloc, struct iommu_hwpt_alloc,
 		 data_uptr),
+	IOCTL_OP(IOMMU_HWPT_INVALIDATE, iommufd_hwpt_invalidate,
+		 struct iommu_hwpt_invalidate, data_uptr),
 	IOCTL_OP(IOMMU_IOAS_ALLOC, iommufd_ioas_alloc_ioctl,
 		 struct iommu_ioas_alloc, out_ioas_id),
 	IOCTL_OP(IOMMU_IOAS_ALLOW_IOVAS, iommufd_ioas_allow_iovas,
diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
index 73bf9af91e99..034da283cd3a 100644
--- a/include/uapi/linux/iommufd.h
+++ b/include/uapi/linux/iommufd.h
@@ -48,6 +48,7 @@ enum {
 	IOMMUFD_CMD_HWPT_ALLOC,
 	IOMMUFD_CMD_GET_HW_INFO,
 	IOMMUFD_CMD_RESV_IOVA_RANGES,
+	IOMMUFD_CMD_HWPT_INVALIDATE,
 };
 
 /**
@@ -486,4 +487,25 @@ struct iommu_resv_iova_ranges {
 	__aligned_u64 resv_iovas;
 };
 #define IOMMU_RESV_IOVA_RANGES _IO(IOMMUFD_TYPE, IOMMUFD_CMD_RESV_IOVA_RANGES)
+
+/**
+ * struct iommu_hwpt_invalidate - ioctl(IOMMU_HWPT_INVALIDATE)
+ * @size: sizeof(struct iommu_hwpt_invalidate)
+ * @hwpt_id: HWPT ID of target hardware page table for the invalidation
+ * @data_len: Length of the type specific data
+ * @__reserved: Must be 0
+ * @data_uptr: User pointer to the type specific data
+ *
+ * Invalidate the iommu cache for user-managed page table. Modifications
+ * on user-managed page table should be followed with this operation to
+ * sync the IOTLB. The data in @data_uptr differs per the hwpt type.
+ */
+struct iommu_hwpt_invalidate {
+	__u32 size;
+	__u32 hwpt_id;
+	__u32 data_len;
+	__u32 __reserved;
+	__aligned_u64 data_uptr;
+};
+#define IOMMU_HWPT_INVALIDATE _IO(IOMMUFD_TYPE, IOMMUFD_CMD_HWPT_INVALIDATE)
 #endif
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH v3 10/17] iommufd/selftest: Add a helper to get test device
  2023-07-24 11:03 [PATCH v3 00/17] iommufd: Add nesting infrastructure Yi Liu
                   ` (8 preceding siblings ...)
  2023-07-24 11:03 ` [PATCH v3 09/17] iommufd: Add IOMMU_HWPT_INVALIDATE Yi Liu
@ 2023-07-24 11:03 ` Yi Liu
  2023-07-24 11:04 ` [PATCH v3 11/17] iommufd/selftest: Add IOMMU_TEST_OP_DEV_[ADD|DEL]_RESERVED to add/del reserved regions to selftest device Yi Liu
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 66+ messages in thread
From: Yi Liu @ 2023-07-24 11:03 UTC (permalink / raw)
  To: joro, alex.williamson, jgg, kevin.tian, robin.murphy, baolu.lu
  Cc: cohuck, eric.auger, nicolinc, kvm, mjrosato, chao.p.peng,
	yi.l.liu, yi.y.sun, peterx, jasowang, shameerali.kolothum.thodi,
	lulu, suravee.suthikulpanit, iommu, linux-kernel,
	linux-kselftest, zhenzhong.duan

There is need to get the selftest device (sobj->type == TYPE_IDEV) in multiple
places, so have a helper to get it.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 drivers/iommu/iommufd/selftest.c | 32 +++++++++++++++++++++-----------
 1 file changed, 21 insertions(+), 11 deletions(-)

diff --git a/drivers/iommu/iommufd/selftest.c b/drivers/iommu/iommufd/selftest.c
index af7459e211ad..72d0c37e0b5e 100644
--- a/drivers/iommu/iommufd/selftest.c
+++ b/drivers/iommu/iommufd/selftest.c
@@ -472,29 +472,39 @@ static int iommufd_test_mock_domain(struct iommufd_ucmd *ucmd,
 	return rc;
 }
 
-/* Replace the mock domain with a manually allocated hw_pagetable */
-static int iommufd_test_mock_domain_replace(struct iommufd_ucmd *ucmd,
-					    unsigned int device_id, u32 pt_id,
-					    struct iommu_test_cmd *cmd)
+static struct selftest_obj *iommufd_test_get_device(struct iommufd_ctx *ictx,
+						    u32 id)
 {
 	struct iommufd_object *dev_obj;
 	struct selftest_obj *sobj;
-	int rc;
 
 	/*
 	 * Prefer to use the OBJ_SELFTEST because the destroy_rwsem will ensure
 	 * it doesn't race with detach, which is not allowed.
 	 */
-	dev_obj =
-		iommufd_get_object(ucmd->ictx, device_id, IOMMUFD_OBJ_SELFTEST);
+	dev_obj = iommufd_get_object(ictx, id, IOMMUFD_OBJ_SELFTEST);
 	if (IS_ERR(dev_obj))
-		return PTR_ERR(dev_obj);
+		return (struct selftest_obj *)dev_obj;
 
 	sobj = container_of(dev_obj, struct selftest_obj, obj);
 	if (sobj->type != TYPE_IDEV) {
-		rc = -EINVAL;
-		goto out_dev_obj;
+		iommufd_put_object(dev_obj);
+		return ERR_PTR(-EINVAL);
 	}
+	return sobj;
+}
+
+/* Replace the mock domain with a manually allocated hw_pagetable */
+static int iommufd_test_mock_domain_replace(struct iommufd_ucmd *ucmd,
+					    unsigned int device_id, u32 pt_id,
+					    struct iommu_test_cmd *cmd)
+{
+	struct selftest_obj *sobj;
+	int rc;
+
+	sobj = iommufd_test_get_device(ucmd->ictx, device_id);
+	if (IS_ERR(sobj))
+		return PTR_ERR(sobj);
 
 	rc = iommufd_device_replace(sobj->idev.idev, &pt_id);
 	if (rc)
@@ -504,7 +514,7 @@ static int iommufd_test_mock_domain_replace(struct iommufd_ucmd *ucmd,
 	rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
 
 out_dev_obj:
-	iommufd_put_object(dev_obj);
+	iommufd_put_object(&sobj->obj);
 	return rc;
 }
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH v3 11/17] iommufd/selftest: Add IOMMU_TEST_OP_DEV_[ADD|DEL]_RESERVED to add/del reserved regions to selftest device
  2023-07-24 11:03 [PATCH v3 00/17] iommufd: Add nesting infrastructure Yi Liu
                   ` (9 preceding siblings ...)
  2023-07-24 11:03 ` [PATCH v3 10/17] iommufd/selftest: Add a helper to get test device Yi Liu
@ 2023-07-24 11:04 ` Yi Liu
  2023-07-24 11:04 ` [PATCH v3 12/17] iommufd/selftest: Add .get_resv_regions() for mock_dev Yi Liu
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 66+ messages in thread
From: Yi Liu @ 2023-07-24 11:04 UTC (permalink / raw)
  To: joro, alex.williamson, jgg, kevin.tian, robin.murphy, baolu.lu
  Cc: cohuck, eric.auger, nicolinc, kvm, mjrosato, chao.p.peng,
	yi.l.liu, yi.y.sun, peterx, jasowang, shameerali.kolothum.thodi,
	lulu, suravee.suthikulpanit, iommu, linux-kernel,
	linux-kselftest, zhenzhong.duan

This makes test tool able to specify reserved regions to the selftest
device and hence able to validate the ioctl IOMMU_RESV_IOVA_RANGES.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 drivers/iommu/iommufd/iommufd_test.h |  6 ++
 drivers/iommu/iommufd/selftest.c     | 82 ++++++++++++++++++++++++++++
 2 files changed, 88 insertions(+)

diff --git a/drivers/iommu/iommufd/iommufd_test.h b/drivers/iommu/iommufd/iommufd_test.h
index 3f3644375bf1..b7f4bee2ea1b 100644
--- a/drivers/iommu/iommufd/iommufd_test.h
+++ b/drivers/iommu/iommufd/iommufd_test.h
@@ -19,6 +19,8 @@ enum {
 	IOMMU_TEST_OP_SET_TEMP_MEMORY_LIMIT,
 	IOMMU_TEST_OP_MOCK_DOMAIN_REPLACE,
 	IOMMU_TEST_OP_ACCESS_REPLACE_IOAS,
+	IOMMU_TEST_OP_DEV_ADD_RESERVED,
+	IOMMU_TEST_OP_DEV_DEL_RESERVED,
 };
 
 enum {
@@ -50,6 +52,10 @@ struct iommu_test_cmd {
 			__aligned_u64 start;
 			__aligned_u64 length;
 		} add_reserved;
+		struct {
+			__aligned_u64 start;
+			__aligned_u64 length;
+		} add_dev_reserved;
 		struct {
 			__u32 out_stdev_id;
 			__u32 out_hwpt_id;
diff --git a/drivers/iommu/iommufd/selftest.c b/drivers/iommu/iommufd/selftest.c
index 72d0c37e0b5e..271c6c261eb4 100644
--- a/drivers/iommu/iommufd/selftest.c
+++ b/drivers/iommu/iommufd/selftest.c
@@ -93,6 +93,8 @@ enum selftest_obj_type {
 
 struct mock_dev {
 	struct device dev;
+	struct rw_semaphore reserved_rwsem;
+	struct rb_root_cached reserved_itree;
 };
 
 struct selftest_obj {
@@ -385,6 +387,10 @@ static struct mock_dev *mock_dev_create(void)
 	mutex_init(&mdev->dev.iommu->lock);
 	mdev->dev.iommu->iommu_dev = &mock_iommu_device;
 
+	/* Reserve range related fields */
+	init_rwsem(&mdev->reserved_rwsem);
+	mdev->reserved_itree = RB_ROOT_CACHED;
+
 	rc = device_add(&mdev->dev);
 	if (rc)
 		goto err_dev_iommu;
@@ -407,8 +413,20 @@ static struct mock_dev *mock_dev_create(void)
 	return ERR_PTR(rc);
 }
 
+static void mock_dev_remove_reserved_iovas(struct mock_dev *mdev)
+{
+	struct interval_tree_node *node;
+
+	while ((node = interval_tree_iter_first(&mdev->reserved_itree, 0,
+						ULONG_MAX))) {
+		interval_tree_remove(node, &mdev->reserved_itree);
+		kfree(node);
+	}
+}
+
 static void mock_dev_destroy(struct mock_dev *mdev)
 {
+	mock_dev_remove_reserved_iovas(mdev);
 	iommu_group_remove_device(&mdev->dev);
 	device_del(&mdev->dev);
 	kfree(mdev->dev.iommu);
@@ -536,6 +554,64 @@ static int iommufd_test_add_reserved(struct iommufd_ucmd *ucmd,
 	return rc;
 }
 
+static int mock_dev_add_reserved_iova(struct mock_dev *mdev,
+				      unsigned long start, unsigned long last)
+{
+	struct interval_tree_node *reserved;
+
+	if (interval_tree_iter_first(&mdev->reserved_itree, start, last))
+		return -EADDRINUSE;
+
+	reserved = kzalloc(sizeof(*reserved), GFP_KERNEL_ACCOUNT);
+	if (!reserved)
+		return -ENOMEM;
+	reserved->start = start;
+	reserved->last = last;
+	interval_tree_insert(reserved, &mdev->reserved_itree);
+	return 0;
+}
+
+/* Add a reserved IOVA region to a device */
+static int iommufd_test_dev_add_reserved(struct iommufd_ucmd *ucmd,
+					 unsigned int device_id,
+					 unsigned long start, size_t length)
+{
+	struct selftest_obj *sobj;
+	struct mock_dev *mdev;
+	int rc;
+
+	sobj = iommufd_test_get_device(ucmd->ictx, device_id);
+	if (IS_ERR(sobj))
+		return PTR_ERR(sobj);
+
+	mdev = sobj->idev.mock_dev;
+	down_write(&mdev->reserved_rwsem);
+	rc = mock_dev_add_reserved_iova(sobj->idev.mock_dev,
+					start, start + length - 1);
+	up_write(&mdev->reserved_rwsem);
+	iommufd_put_object(&sobj->obj);
+	return rc;
+}
+
+/* Remove all reserved IOVA regions of a device */
+static int iommufd_test_dev_del_reserved(struct iommufd_ucmd *ucmd,
+					 unsigned int device_id)
+{
+	struct selftest_obj *sobj;
+	struct mock_dev *mdev;
+
+	sobj = iommufd_test_get_device(ucmd->ictx, device_id);
+	if (IS_ERR(sobj))
+		return PTR_ERR(sobj);
+
+	mdev = sobj->idev.mock_dev;
+	down_write(&mdev->reserved_rwsem);
+	mock_dev_remove_reserved_iovas(mdev);
+	up_write(&mdev->reserved_rwsem);
+	iommufd_put_object(&sobj->obj);
+	return 0;
+}
+
 /* Check that every pfn under each iova matches the pfn under a user VA */
 static int iommufd_test_md_check_pa(struct iommufd_ucmd *ucmd,
 				    unsigned int mockpt_id, unsigned long iova,
@@ -1025,6 +1101,12 @@ int iommufd_test(struct iommufd_ucmd *ucmd)
 		return iommufd_test_add_reserved(ucmd, cmd->id,
 						 cmd->add_reserved.start,
 						 cmd->add_reserved.length);
+	case IOMMU_TEST_OP_DEV_ADD_RESERVED:
+		return iommufd_test_dev_add_reserved(ucmd, cmd->id,
+						     cmd->add_dev_reserved.start,
+						     cmd->add_dev_reserved.length);
+	case IOMMU_TEST_OP_DEV_DEL_RESERVED:
+		return iommufd_test_dev_del_reserved(ucmd, cmd->id);
 	case IOMMU_TEST_OP_MOCK_DOMAIN:
 		return iommufd_test_mock_domain(ucmd, cmd);
 	case IOMMU_TEST_OP_MOCK_DOMAIN_REPLACE:
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH v3 12/17] iommufd/selftest: Add .get_resv_regions() for mock_dev
  2023-07-24 11:03 [PATCH v3 00/17] iommufd: Add nesting infrastructure Yi Liu
                   ` (10 preceding siblings ...)
  2023-07-24 11:04 ` [PATCH v3 11/17] iommufd/selftest: Add IOMMU_TEST_OP_DEV_[ADD|DEL]_RESERVED to add/del reserved regions to selftest device Yi Liu
@ 2023-07-24 11:04 ` Yi Liu
  2023-07-24 11:04 ` [PATCH v3 13/17] iommufd/selftest: Add coverage for IOMMU_RESV_IOVA_RANGES Yi Liu
                   ` (4 subsequent siblings)
  16 siblings, 0 replies; 66+ messages in thread
From: Yi Liu @ 2023-07-24 11:04 UTC (permalink / raw)
  To: joro, alex.williamson, jgg, kevin.tian, robin.murphy, baolu.lu
  Cc: cohuck, eric.auger, nicolinc, kvm, mjrosato, chao.p.peng,
	yi.l.liu, yi.y.sun, peterx, jasowang, shameerali.kolothum.thodi,
	lulu, suravee.suthikulpanit, iommu, linux-kernel,
	linux-kselftest, zhenzhong.duan

This makes mock_dev to report resv_regions.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 drivers/iommu/iommufd/selftest.c | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

diff --git a/drivers/iommu/iommufd/selftest.c b/drivers/iommu/iommufd/selftest.c
index 271c6c261eb4..e40780db88b5 100644
--- a/drivers/iommu/iommufd/selftest.c
+++ b/drivers/iommu/iommufd/selftest.c
@@ -292,6 +292,32 @@ static void mock_domain_set_plaform_dma_ops(struct device *dev)
 	 */
 }
 
+static void iommufd_test_get_resv_regions(struct device *dev,
+					  struct list_head *head)
+{
+	struct mock_dev *mdev = container_of(dev, struct mock_dev, dev);
+	struct interval_tree_node *reserved, *next;
+
+	down_read(&mdev->reserved_rwsem);
+	for (reserved = interval_tree_iter_first(&mdev->reserved_itree, 0, ULONG_MAX);
+	     reserved; reserved = next) {
+		struct iommu_resv_region *resv;
+
+		next = interval_tree_iter_next(reserved, 0, ULONG_MAX);
+		/* Only adds IOMMU_RESV_DIRECT so far */
+		resv = iommu_alloc_resv_region(reserved->start,
+					       reserved->last + 1 - reserved->start,
+					       IOMMU_READ | IOMMU_WRITE,
+					       IOMMU_RESV_DIRECT,
+					       GFP_ATOMIC);
+		if (!resv)
+			break;
+
+		list_add_tail(&resv->list, head);
+	}
+	up_read(&mdev->reserved_rwsem);
+}
+
 static const struct iommu_ops mock_ops = {
 	.owner = THIS_MODULE,
 	.pgsize_bitmap = MOCK_IO_PAGE_SIZE,
@@ -299,6 +325,7 @@ static const struct iommu_ops mock_ops = {
 	.hw_info = mock_domain_hw_info,
 	.domain_alloc = mock_domain_alloc,
 	.capable = mock_domain_capable,
+	.get_resv_regions = iommufd_test_get_resv_regions,
 	.set_platform_dma_ops = mock_domain_set_plaform_dma_ops,
 	.default_domain_ops =
 		&(struct iommu_domain_ops){
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH v3 13/17] iommufd/selftest: Add coverage for IOMMU_RESV_IOVA_RANGES
  2023-07-24 11:03 [PATCH v3 00/17] iommufd: Add nesting infrastructure Yi Liu
                   ` (11 preceding siblings ...)
  2023-07-24 11:04 ` [PATCH v3 12/17] iommufd/selftest: Add .get_resv_regions() for mock_dev Yi Liu
@ 2023-07-24 11:04 ` Yi Liu
  2023-07-24 11:04 ` [PATCH v3 14/17] iommufd/selftest: Add domain_alloc_user() support in iommu mock Yi Liu
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 66+ messages in thread
From: Yi Liu @ 2023-07-24 11:04 UTC (permalink / raw)
  To: joro, alex.williamson, jgg, kevin.tian, robin.murphy, baolu.lu
  Cc: cohuck, eric.auger, nicolinc, kvm, mjrosato, chao.p.peng,
	yi.l.liu, yi.y.sun, peterx, jasowang, shameerali.kolothum.thodi,
	lulu, suravee.suthikulpanit, iommu, linux-kernel,
	linux-kselftest, zhenzhong.duan

This adds selftest for ioctl IOMMU_RESV_IOVA_RANGES.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 tools/testing/selftests/iommu/iommufd.c | 98 +++++++++++++++++++++++++
 1 file changed, 98 insertions(+)

diff --git a/tools/testing/selftests/iommu/iommufd.c b/tools/testing/selftests/iommu/iommufd.c
index 6b075a68b928..bfe0f76f7fb0 100644
--- a/tools/testing/selftests/iommu/iommufd.c
+++ b/tools/testing/selftests/iommu/iommufd.c
@@ -1780,4 +1780,102 @@ TEST_F(vfio_compat_mock_domain, huge_map)
 	}
 }
 
+FIXTURE(iommufd_device_resv_iova)
+{
+	int fd;
+	uint32_t ioas_id;
+	uint32_t stdev_id;
+	uint32_t device_id;
+};
+
+FIXTURE_SETUP(iommufd_device_resv_iova)
+{
+	self->fd = open("/dev/iommu", O_RDWR);
+	ASSERT_NE(-1, self->fd);
+	test_ioctl_ioas_alloc(&self->ioas_id);
+
+	test_cmd_mock_domain(self->ioas_id, &self->stdev_id,
+			     NULL, &self->device_id);
+}
+
+FIXTURE_TEARDOWN(iommufd_device_resv_iova)
+{
+	teardown_iommufd(self->fd, _metadata);
+}
+
+TEST_F(iommufd_device_resv_iova, dev_resv_iova_ranges)
+{
+	struct iommu_test_cmd test_cmd = {
+		.size = sizeof(test_cmd),
+		.op = IOMMU_TEST_OP_DEV_ADD_RESERVED,
+		.id = self->stdev_id,
+		.add_dev_reserved = { .start = PAGE_SIZE, .length = PAGE_SIZE },
+	};
+	struct iommu_test_cmd test_cmd_del = {
+		.size = sizeof(test_cmd_del),
+		.op = IOMMU_TEST_OP_DEV_DEL_RESERVED,
+		.id = self->stdev_id,
+	};
+	struct iommu_resv_iova_range *ranges = buffer;
+	struct iommu_resv_iova_ranges ranges_cmd = {
+		.size = sizeof(ranges_cmd),
+		.dev_id = self->device_id,
+		.num_iovas = BUFFER_SIZE / sizeof(*ranges),
+		.resv_iovas = (uintptr_t)ranges,
+	};
+
+	/* Range can be read */
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_RESV_IOVA_RANGES, &ranges_cmd));
+	EXPECT_EQ(0, ranges_cmd.num_iovas);
+
+	/* 1 range */
+	ASSERT_EQ(0,
+		  ioctl(self->fd, _IOMMU_TEST_CMD(IOMMU_TEST_OP_DEV_ADD_RESERVED),
+			&test_cmd));
+	ranges_cmd.num_iovas = BUFFER_SIZE / sizeof(*ranges);
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_RESV_IOVA_RANGES, &ranges_cmd));
+	EXPECT_EQ(1, ranges_cmd.num_iovas);
+	EXPECT_EQ(PAGE_SIZE, ranges[0].start);
+	EXPECT_EQ(PAGE_SIZE * 2 - 1, ranges[0].last);
+
+	/* Buffer too small */
+	memset(ranges, 0, BUFFER_SIZE);
+	ranges_cmd.num_iovas = 0;
+	EXPECT_ERRNO(EMSGSIZE,
+		     ioctl(self->fd, IOMMU_RESV_IOVA_RANGES, &ranges_cmd));
+	EXPECT_EQ(1, ranges_cmd.num_iovas);
+	EXPECT_EQ(0, ranges[0].start);
+	EXPECT_EQ(0, ranges[0].last);
+
+	/* 2 ranges */
+	test_cmd.add_dev_reserved.start = PAGE_SIZE * 4;
+	test_cmd.add_dev_reserved.length = PAGE_SIZE;
+
+	ASSERT_EQ(0,
+		  ioctl(self->fd, _IOMMU_TEST_CMD(IOMMU_TEST_OP_DEV_ADD_RESERVED),
+			&test_cmd));
+	ranges_cmd.num_iovas = BUFFER_SIZE / sizeof(*ranges);
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_RESV_IOVA_RANGES, &ranges_cmd));
+	EXPECT_EQ(2, ranges_cmd.num_iovas);
+	EXPECT_EQ(PAGE_SIZE, ranges[0].start);
+	EXPECT_EQ(PAGE_SIZE * 2 - 1, ranges[0].last);
+	EXPECT_EQ(PAGE_SIZE * 4, ranges[1].start);
+	EXPECT_EQ(PAGE_SIZE * 5 - 1, ranges[1].last);
+
+	/* Buffer too small */
+	memset(ranges, 0, BUFFER_SIZE);
+	ranges_cmd.num_iovas = 1;
+	EXPECT_ERRNO(EMSGSIZE, ioctl(self->fd, IOMMU_RESV_IOVA_RANGES,
+				     &ranges_cmd));
+	EXPECT_EQ(2, ranges_cmd.num_iovas);
+	EXPECT_EQ(PAGE_SIZE, ranges[0].start);
+	EXPECT_EQ(PAGE_SIZE * 2 - 1, ranges[0].last);
+	EXPECT_EQ(0, ranges[1].start);
+	EXPECT_EQ(0, ranges[1].last);
+
+	ASSERT_EQ(0,
+		  ioctl(self->fd, _IOMMU_TEST_CMD(IOMMU_TEST_OP_DEV_DEL_RESERVED),
+			&test_cmd_del));
+}
+
 TEST_HARNESS_MAIN
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH v3 14/17] iommufd/selftest: Add domain_alloc_user() support in iommu mock
  2023-07-24 11:03 [PATCH v3 00/17] iommufd: Add nesting infrastructure Yi Liu
                   ` (12 preceding siblings ...)
  2023-07-24 11:04 ` [PATCH v3 13/17] iommufd/selftest: Add coverage for IOMMU_RESV_IOVA_RANGES Yi Liu
@ 2023-07-24 11:04 ` Yi Liu
  2023-07-24 11:04 ` [PATCH v3 15/17] iommufd/selftest: Add coverage for IOMMU_HWPT_ALLOC with user data Yi Liu
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 66+ messages in thread
From: Yi Liu @ 2023-07-24 11:04 UTC (permalink / raw)
  To: joro, alex.williamson, jgg, kevin.tian, robin.murphy, baolu.lu
  Cc: cohuck, eric.auger, nicolinc, kvm, mjrosato, chao.p.peng,
	yi.l.liu, yi.y.sun, peterx, jasowang, shameerali.kolothum.thodi,
	lulu, suravee.suthikulpanit, iommu, linux-kernel,
	linux-kselftest, zhenzhong.duan

From: Nicolin Chen <nicolinc@nvidia.com>

Add mock_domain_alloc_user function and iommu_hwpt_selftest data structure
to support user space selftest program to allocate domains with user data.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 drivers/iommu/iommufd/hw_pagetable.c |  1 +
 drivers/iommu/iommufd/iommufd_test.h | 16 ++++++
 drivers/iommu/iommufd/selftest.c     | 84 +++++++++++++++++++++++++---
 include/linux/iommu.h                |  3 +
 4 files changed, 95 insertions(+), 9 deletions(-)

diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c
index 9064e6d181b4..e30fce5d553b 100644
--- a/drivers/iommu/iommufd/hw_pagetable.c
+++ b/drivers/iommu/iommufd/hw_pagetable.c
@@ -7,6 +7,7 @@
 
 #include "../iommu-priv.h"
 #include "iommufd_private.h"
+#include "iommufd_test.h"
 
 void iommufd_hw_pagetable_destroy(struct iommufd_object *obj)
 {
diff --git a/drivers/iommu/iommufd/iommufd_test.h b/drivers/iommu/iommufd/iommufd_test.h
index b7f4bee2ea1b..eeac8100844d 100644
--- a/drivers/iommu/iommufd/iommufd_test.h
+++ b/drivers/iommu/iommufd/iommufd_test.h
@@ -115,4 +115,20 @@ struct iommu_test_hw_info {
 	__u32 test_reg;
 };
 
+/* Should not be equal to any defined value in enum iommu_hwpt_type */
+#define IOMMU_HWPT_TYPE_SELFTTEST		0xdead
+
+/**
+ * struct iommu_hwpt_selftest
+ *
+ * @flags: page table entry attributes
+ * @test_config: default iotlb setup (value IOMMU_TEST_IOTLB_DEFAULT)
+ */
+struct iommu_hwpt_selftest {
+#define IOMMU_TEST_FLAG_NESTED		(1ULL << 0)
+	__u64 flags;
+#define IOMMU_TEST_IOTLB_DEFAULT	0xbadbeef
+	__u64 test_config;
+};
+
 #endif
diff --git a/drivers/iommu/iommufd/selftest.c b/drivers/iommu/iommufd/selftest.c
index e40780db88b5..8364003efde3 100644
--- a/drivers/iommu/iommufd/selftest.c
+++ b/drivers/iommu/iommufd/selftest.c
@@ -84,7 +84,9 @@ void iommufd_test_syz_conv_iova_id(struct iommufd_ucmd *ucmd,
 
 struct mock_iommu_domain {
 	struct iommu_domain domain;
+	struct mock_iommu_domain *parent;
 	struct xarray pfns;
+	u32 iotlb;
 };
 
 enum selftest_obj_type {
@@ -144,26 +146,81 @@ static void *mock_domain_hw_info(struct device *dev, u32 *length)
 	return info;
 }
 
-static struct iommu_domain *mock_domain_alloc(unsigned int iommu_domain_type)
+static const struct iommu_ops mock_ops;
+static struct iommu_domain_ops domain_nested_ops;
+
+static struct iommu_domain *
+__mock_domain_alloc(unsigned int iommu_domain_type,
+		    struct mock_iommu_domain *mock_parent,
+		    const struct iommu_hwpt_selftest *user_cfg)
 {
 	struct mock_iommu_domain *mock;
 
 	if (iommu_domain_type == IOMMU_DOMAIN_BLOCKED)
 		return &mock_blocking_domain;
 
-	if (WARN_ON(iommu_domain_type != IOMMU_DOMAIN_UNMANAGED))
-		return NULL;
-
 	mock = kzalloc(sizeof(*mock), GFP_KERNEL);
 	if (!mock)
-		return NULL;
-	mock->domain.geometry.aperture_start = MOCK_APERTURE_START;
-	mock->domain.geometry.aperture_end = MOCK_APERTURE_LAST;
+		return ERR_PTR(-ENOMEM);
+	mock->parent = mock_parent;
+	mock->domain.type = iommu_domain_type;
 	mock->domain.pgsize_bitmap = MOCK_IO_PAGE_SIZE;
-	xa_init(&mock->pfns);
+	if (mock_parent) {
+		mock->iotlb = user_cfg->test_config;
+		mock->domain.ops = &domain_nested_ops;
+	} else {
+		mock->domain.geometry.aperture_start = MOCK_APERTURE_START;
+		mock->domain.geometry.aperture_end = MOCK_APERTURE_LAST;
+		mock->domain.ops = mock_ops.default_domain_ops;
+		xa_init(&mock->pfns);
+	}
 	return &mock->domain;
 }
 
+static struct iommu_domain *mock_domain_alloc(unsigned int iommu_domain_type)
+{
+	struct iommu_domain *domain;
+
+	if (WARN_ON(iommu_domain_type != IOMMU_DOMAIN_BLOCKED &&
+		    iommu_domain_type != IOMMU_DOMAIN_UNMANAGED))
+		return NULL;
+	domain = __mock_domain_alloc(IOMMU_DOMAIN_UNMANAGED, NULL, NULL);
+	if (IS_ERR(domain))
+		domain = NULL;
+	return domain;
+}
+
+static struct iommu_domain *mock_domain_alloc_user(struct device *dev,
+						   enum iommu_hwpt_type hwpt_type,
+						   struct iommu_domain *parent,
+						   const union iommu_domain_user_data *user_data)
+{
+	const struct iommu_hwpt_selftest *user_cfg = (struct iommu_hwpt_selftest *)user_data;
+	unsigned int iommu_domain_type = IOMMU_DOMAIN_UNMANAGED;
+	struct mock_iommu_domain *mock_parent = NULL;
+
+	/* Check union iommu_domain_user_data in include/linux/iommu.h */
+	static_assert(sizeof(struct iommu_hwpt_selftest) <= 16);
+
+	if (hwpt_type != IOMMU_HWPT_TYPE_DEFAULT &&
+	    hwpt_type != IOMMU_HWPT_TYPE_SELFTTEST)
+		return ERR_PTR(-EINVAL);
+
+	if (parent) {
+		if (hwpt_type != IOMMU_HWPT_TYPE_SELFTTEST)
+			return ERR_PTR(-EINVAL);
+		if (parent->ops != mock_ops.default_domain_ops)
+			return ERR_PTR(-EINVAL);
+		if (!user_cfg || !(user_cfg->flags & IOMMU_TEST_FLAG_NESTED))
+			return ERR_PTR(-EINVAL);
+		iommu_domain_type = IOMMU_DOMAIN_NESTED;
+		mock_parent = container_of(parent,
+					   struct mock_iommu_domain, domain);
+	}
+
+	return __mock_domain_alloc(iommu_domain_type, mock_parent, user_cfg);
+}
+
 static void mock_domain_free(struct iommu_domain *domain)
 {
 	struct mock_iommu_domain *mock =
@@ -324,6 +381,7 @@ static const struct iommu_ops mock_ops = {
 	.hw_info_type = IOMMU_HW_INFO_TYPE_SELFTEST,
 	.hw_info = mock_domain_hw_info,
 	.domain_alloc = mock_domain_alloc,
+	.domain_alloc_user = mock_domain_alloc_user,
 	.capable = mock_domain_capable,
 	.get_resv_regions = iommufd_test_get_resv_regions,
 	.set_platform_dma_ops = mock_domain_set_plaform_dma_ops,
@@ -337,6 +395,11 @@ static const struct iommu_ops mock_ops = {
 		},
 };
 
+static struct iommu_domain_ops domain_nested_ops = {
+	.free = mock_domain_free,
+	.attach_dev = mock_domain_nop_attach,
+};
+
 static struct iommu_device mock_iommu_device = {
 	.ops = &mock_ops,
 };
@@ -353,7 +416,10 @@ get_md_pagetable(struct iommufd_ucmd *ucmd, u32 mockpt_id,
 	if (IS_ERR(obj))
 		return ERR_CAST(obj);
 	hwpt = container_of(obj, struct iommufd_hw_pagetable, obj);
-	if (hwpt->domain->ops != mock_ops.default_domain_ops) {
+	if ((hwpt->domain->type == IOMMU_DOMAIN_UNMANAGED &&
+	     hwpt->domain->ops != mock_ops.default_domain_ops) ||
+	    (hwpt->domain->type == IOMMU_DOMAIN_NESTED &&
+	     hwpt->domain->ops != &domain_nested_ops)) {
 		iommufd_put_object(&hwpt->obj);
 		return ERR_PTR(-EINVAL);
 	}
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index b8f09330b64e..e4835230d5f0 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -237,6 +237,9 @@ struct iommu_iotlb_gather {
  * include/uapi/linux/iommufd.h.
  */
 union iommu_domain_user_data {
+#ifdef CONFIG_IOMMUFD_TEST
+	__u64 test[2];
+#endif
 };
 
 /**
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH v3 15/17] iommufd/selftest: Add coverage for IOMMU_HWPT_ALLOC with user data
  2023-07-24 11:03 [PATCH v3 00/17] iommufd: Add nesting infrastructure Yi Liu
                   ` (13 preceding siblings ...)
  2023-07-24 11:04 ` [PATCH v3 14/17] iommufd/selftest: Add domain_alloc_user() support in iommu mock Yi Liu
@ 2023-07-24 11:04 ` Yi Liu
  2023-07-24 11:04 ` [PATCH v3 16/17] iommufd/selftest: Add IOMMU_TEST_OP_MD_CHECK_IOTLB test op Yi Liu
  2023-07-24 11:04 ` [PATCH v3 17/17] iommufd/selftest: Add coverage for IOMMU_HWPT_INVALIDATE ioctl Yi Liu
  16 siblings, 0 replies; 66+ messages in thread
From: Yi Liu @ 2023-07-24 11:04 UTC (permalink / raw)
  To: joro, alex.williamson, jgg, kevin.tian, robin.murphy, baolu.lu
  Cc: cohuck, eric.auger, nicolinc, kvm, mjrosato, chao.p.peng,
	yi.l.liu, yi.y.sun, peterx, jasowang, shameerali.kolothum.thodi,
	lulu, suravee.suthikulpanit, iommu, linux-kernel,
	linux-kselftest, zhenzhong.duan

From: Nicolin Chen <nicolinc@nvidia.com>

The IOMMU_HWPT_ALLOC ioctl now supports passing user_data to allocate a
customized domain. Add its coverage for both a regular domain case and
a nested domain case.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 tools/testing/selftests/iommu/iommufd.c       | 114 +++++++++++++++++-
 tools/testing/selftests/iommu/iommufd_utils.h |  36 ++++++
 2 files changed, 149 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/iommu/iommufd.c b/tools/testing/selftests/iommu/iommufd.c
index bfe0f76f7fb0..70116c2dfc94 100644
--- a/tools/testing/selftests/iommu/iommufd.c
+++ b/tools/testing/selftests/iommu/iommufd.c
@@ -114,6 +114,7 @@ TEST_F(iommufd, cmd_length)
 
 	TEST_LENGTH(iommu_destroy, IOMMU_DESTROY);
 	TEST_LENGTH(iommu_hw_info, IOMMU_GET_HW_INFO);
+	TEST_LENGTH(iommu_hwpt_alloc, IOMMU_HWPT_ALLOC);
 	TEST_LENGTH(iommu_ioas_alloc, IOMMU_IOAS_ALLOC);
 	TEST_LENGTH(iommu_ioas_iova_ranges, IOMMU_IOAS_IOVA_RANGES);
 	TEST_LENGTH(iommu_ioas_allow_iovas, IOMMU_IOAS_ALLOW_IOVAS);
@@ -194,6 +195,7 @@ FIXTURE_VARIANT(iommufd_ioas)
 {
 	unsigned int mock_domains;
 	unsigned int memory_limit;
+	bool new_hwpt;
 };
 
 FIXTURE_SETUP(iommufd_ioas)
@@ -233,6 +235,12 @@ FIXTURE_VARIANT_ADD(iommufd_ioas, mock_domain)
 	.mock_domains = 1,
 };
 
+FIXTURE_VARIANT_ADD(iommufd_ioas, mock_domain_hwpt)
+{
+	.mock_domains = 1,
+	.new_hwpt = true,
+};
+
 FIXTURE_VARIANT_ADD(iommufd_ioas, two_mock_domain)
 {
 	.mock_domains = 2,
@@ -260,6 +268,93 @@ TEST_F(iommufd_ioas, ioas_destroy)
 	}
 }
 
+TEST_F(iommufd_ioas, hwpt_alloc)
+{
+	uint32_t new_hwpt_id = 0;
+
+	if (self->stdev_id && self->device_id) {
+		test_cmd_hwpt_alloc(self->device_id, self->ioas_id, &new_hwpt_id);
+		test_cmd_mock_domain_replace(self->stdev_id, new_hwpt_id);
+		/* hw_pagetable cannot be freed if a device is attached to it */
+		EXPECT_ERRNO(EBUSY, _test_ioctl_destroy(self->fd, new_hwpt_id));
+
+		/* Detach from the new hw_pagetable and try again */
+		test_cmd_mock_domain_replace(self->stdev_id, self->ioas_id);
+		test_ioctl_destroy(new_hwpt_id);
+	} else {
+		test_err_cmd_hwpt_alloc(ENOENT, self->device_id,
+					self->ioas_id, &new_hwpt_id);
+		test_err_mock_domain_replace(ENOENT,
+					     self->stdev_id, new_hwpt_id);
+	}
+}
+
+TEST_F(iommufd_ioas, nested_hwpt_alloc)
+{
+	uint32_t nested_hwpt_id[2] = {};
+	uint32_t parent_hwpt_id = 0;
+	uint32_t test_hwpt_id = 0;
+
+	if (self->device_id) {
+		/* Negative tests */
+		test_err_cmd_hwpt_alloc(ENOENT, self->ioas_id, self->device_id,
+					&test_hwpt_id);
+		test_err_cmd_hwpt_alloc(EINVAL, self->device_id,
+					self->device_id, &test_hwpt_id);
+
+		/* Allocate two nested hwpts sharing one common parent hwpt */
+		test_cmd_hwpt_alloc(self->device_id, self->ioas_id,
+				    &parent_hwpt_id);
+
+		test_cmd_hwpt_alloc_nested(self->device_id, parent_hwpt_id,
+					   &nested_hwpt_id[0]);
+		test_cmd_hwpt_alloc_nested(self->device_id, parent_hwpt_id,
+					   &nested_hwpt_id[1]);
+
+		/* Negative test: a nested hwpt on top of a nested hwpt */
+		test_err_cmd_hwpt_alloc_nested(EINVAL, self->device_id,
+					       nested_hwpt_id[0],
+					       &test_hwpt_id);
+		/* Negative test: parent hwpt now cannot be freed */
+		EXPECT_ERRNO(EBUSY,
+			     _test_ioctl_destroy(self->fd, parent_hwpt_id));
+
+		/* Attach device to nested_hwpt_id[0] that then will be busy */
+		test_cmd_mock_domain_replace(self->stdev_id,
+					     nested_hwpt_id[0]);
+		EXPECT_ERRNO(EBUSY,
+			     _test_ioctl_destroy(self->fd, nested_hwpt_id[0]));
+
+		/* Switch from nested_hwpt_id[0] to nested_hwpt_id[1] */
+		test_cmd_mock_domain_replace(self->stdev_id,
+					     nested_hwpt_id[1]);
+		EXPECT_ERRNO(EBUSY,
+			     _test_ioctl_destroy(self->fd, nested_hwpt_id[1]));
+		test_ioctl_destroy(nested_hwpt_id[0]);
+
+		/* Detach from nested_hwpt_id[1] and destroy it */
+		test_cmd_mock_domain_replace(self->stdev_id, parent_hwpt_id);
+		test_ioctl_destroy(nested_hwpt_id[1]);
+
+		/* Detach from the parent hw_pagetable and destroy it */
+		test_cmd_mock_domain_replace(self->stdev_id, self->ioas_id);
+		test_ioctl_destroy(parent_hwpt_id);
+	} else {
+		test_err_cmd_hwpt_alloc(ENOENT, self->device_id, self->ioas_id,
+					&parent_hwpt_id);
+		test_err_cmd_hwpt_alloc_nested(ENOENT, self->device_id,
+					       parent_hwpt_id,
+					       &nested_hwpt_id[0]);
+		test_err_cmd_hwpt_alloc_nested(ENOENT, self->device_id,
+					       parent_hwpt_id,
+					       &nested_hwpt_id[1]);
+		test_err_mock_domain_replace(ENOENT, self->stdev_id,
+					     nested_hwpt_id[0]);
+		test_err_mock_domain_replace(ENOENT, self->stdev_id,
+					     nested_hwpt_id[1]);
+	}
+}
+
 TEST_F(iommufd_ioas, hwpt_attach)
 {
 	/* Create a device attached directly to a hwpt */
@@ -663,6 +758,8 @@ TEST_F(iommufd_ioas, access_pin)
 			       MOCK_FLAGS_ACCESS_CREATE_NEEDS_PIN_PAGES);
 
 	for (npages = 1; npages < BUFFER_SIZE / PAGE_SIZE; npages++) {
+		uint32_t new_hwpt_id = 0;
+		uint32_t mock_device_id;
 		uint32_t mock_stdev_id;
 		uint32_t mock_hwpt_id;
 
@@ -696,12 +793,27 @@ TEST_F(iommufd_ioas, access_pin)
 				   _IOMMU_TEST_CMD(IOMMU_TEST_OP_ACCESS_PAGES),
 				   &access_cmd));
 		test_cmd_mock_domain(self->ioas_id, &mock_stdev_id,
-				     &mock_hwpt_id, NULL);
+				     &mock_hwpt_id, &mock_device_id);
 		check_map_cmd.id = mock_hwpt_id;
+		if (variant->new_hwpt) {
+			test_cmd_hwpt_alloc(mock_device_id, self->ioas_id,
+					    &new_hwpt_id);
+			test_cmd_mock_domain_replace(mock_stdev_id,
+						     new_hwpt_id);
+			check_map_cmd.id = new_hwpt_id;
+		} else {
+			check_map_cmd.id = mock_hwpt_id;
+		}
 		ASSERT_EQ(0, ioctl(self->fd,
 				   _IOMMU_TEST_CMD(IOMMU_TEST_OP_MD_CHECK_MAP),
 				   &check_map_cmd));
 
+		if (variant->new_hwpt) {
+			/* Detach from the new hwpt for its destroy() */
+			test_cmd_mock_domain_replace(mock_stdev_id,
+						     mock_hwpt_id);
+			test_ioctl_destroy(new_hwpt_id);
+		}
 		test_ioctl_destroy(mock_stdev_id);
 		test_cmd_destroy_access_pages(
 			access_cmd.id,
diff --git a/tools/testing/selftests/iommu/iommufd_utils.h b/tools/testing/selftests/iommu/iommufd_utils.h
index f13df84f6b42..365fa3da2016 100644
--- a/tools/testing/selftests/iommu/iommufd_utils.h
+++ b/tools/testing/selftests/iommu/iommufd_utils.h
@@ -118,6 +118,42 @@ static int _test_cmd_hwpt_alloc(int fd, __u32 device_id, __u32 pt_id,
 
 #define test_cmd_hwpt_alloc(device_id, pt_id, hwpt_id) \
 	ASSERT_EQ(0, _test_cmd_hwpt_alloc(self->fd, device_id, pt_id, hwpt_id))
+#define test_err_cmd_hwpt_alloc(_errno, device_id, pt_id, hwpt_id)     \
+	EXPECT_ERRNO(_errno, _test_cmd_hwpt_alloc(self->fd, device_id, \
+						  pt_id, hwpt_id))
+
+static int _test_cmd_hwpt_alloc_nested(int fd, __u32 device_id, __u32 parent_id,
+				       __u32 *hwpt_id)
+{
+	struct iommu_hwpt_selftest data = {
+		.flags = IOMMU_TEST_FLAG_NESTED,
+		.test_config = IOMMU_TEST_IOTLB_DEFAULT,
+	};
+	struct iommu_hwpt_alloc cmd = {
+		.size = sizeof(cmd),
+		.dev_id = device_id,
+		.pt_id = parent_id,
+		.hwpt_type = IOMMU_HWPT_TYPE_SELFTTEST,
+		.data_len = sizeof(data),
+		.data_uptr = (uint64_t)&data,
+	};
+	int ret;
+
+	ret = ioctl(fd, IOMMU_HWPT_ALLOC, &cmd);
+	if (ret)
+		return ret;
+	if (hwpt_id)
+		*hwpt_id = cmd.out_hwpt_id;
+	return 0;
+}
+
+#define test_cmd_hwpt_alloc_nested(device_id, parent_id, hwpt_id)     \
+	ASSERT_EQ(0, _test_cmd_hwpt_alloc_nested(self->fd, device_id, \
+						 parent_id, hwpt_id))
+#define test_err_cmd_hwpt_alloc_nested(_errno, device_id, parent_id, hwpt_id) \
+	EXPECT_ERRNO(_errno,                                                  \
+		     _test_cmd_hwpt_alloc_nested(self->fd, device_id,         \
+						 parent_id, hwpt_id))
 
 static int _test_cmd_access_replace_ioas(int fd, __u32 access_id,
 					 unsigned int ioas_id)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH v3 16/17] iommufd/selftest: Add IOMMU_TEST_OP_MD_CHECK_IOTLB test op
  2023-07-24 11:03 [PATCH v3 00/17] iommufd: Add nesting infrastructure Yi Liu
                   ` (14 preceding siblings ...)
  2023-07-24 11:04 ` [PATCH v3 15/17] iommufd/selftest: Add coverage for IOMMU_HWPT_ALLOC with user data Yi Liu
@ 2023-07-24 11:04 ` Yi Liu
  2023-07-24 11:04 ` [PATCH v3 17/17] iommufd/selftest: Add coverage for IOMMU_HWPT_INVALIDATE ioctl Yi Liu
  16 siblings, 0 replies; 66+ messages in thread
From: Yi Liu @ 2023-07-24 11:04 UTC (permalink / raw)
  To: joro, alex.williamson, jgg, kevin.tian, robin.murphy, baolu.lu
  Cc: cohuck, eric.auger, nicolinc, kvm, mjrosato, chao.p.peng,
	yi.l.liu, yi.y.sun, peterx, jasowang, shameerali.kolothum.thodi,
	lulu, suravee.suthikulpanit, iommu, linux-kernel,
	linux-kselftest, zhenzhong.duan

From: Nicolin Chen <nicolinc@nvidia.com>

This allows to test whether IOTLB has been invalidated or not.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 drivers/iommu/iommufd/iommufd_test.h          |  4 ++++
 drivers/iommu/iommufd/selftest.c              | 22 +++++++++++++++++++
 tools/testing/selftests/iommu/iommufd.c       |  4 ++++
 tools/testing/selftests/iommu/iommufd_utils.h | 14 ++++++++++++
 4 files changed, 44 insertions(+)

diff --git a/drivers/iommu/iommufd/iommufd_test.h b/drivers/iommu/iommufd/iommufd_test.h
index eeac8100844d..93fcbb57b904 100644
--- a/drivers/iommu/iommufd/iommufd_test.h
+++ b/drivers/iommu/iommufd/iommufd_test.h
@@ -21,6 +21,7 @@ enum {
 	IOMMU_TEST_OP_ACCESS_REPLACE_IOAS,
 	IOMMU_TEST_OP_DEV_ADD_RESERVED,
 	IOMMU_TEST_OP_DEV_DEL_RESERVED,
+	IOMMU_TEST_OP_MD_CHECK_IOTLB,
 };
 
 enum {
@@ -101,6 +102,9 @@ struct iommu_test_cmd {
 		struct {
 			__u32 ioas_id;
 		} access_replace_ioas;
+		struct {
+			__u32 iotlb;
+		} check_iotlb;
 	};
 	__u32 last;
 };
diff --git a/drivers/iommu/iommufd/selftest.c b/drivers/iommu/iommufd/selftest.c
index 8364003efde3..92c37716804e 100644
--- a/drivers/iommu/iommufd/selftest.c
+++ b/drivers/iommu/iommufd/selftest.c
@@ -795,6 +795,25 @@ static int iommufd_test_md_check_refs(struct iommufd_ucmd *ucmd,
 	return 0;
 }
 
+static int iommufd_test_md_check_iotlb(struct iommufd_ucmd *ucmd,
+				       u32 mockpt_id, u32 iotlb)
+{
+	struct iommufd_hw_pagetable *hwpt;
+	struct mock_iommu_domain *mock;
+	int rc = 0;
+
+	hwpt = get_md_pagetable(ucmd, mockpt_id, &mock);
+	if (IS_ERR(hwpt))
+		return PTR_ERR(hwpt);
+
+	mock = container_of(hwpt->domain, struct mock_iommu_domain, domain);
+
+	if (iotlb != mock->iotlb)
+		rc = -EINVAL;
+	iommufd_put_object(&hwpt->obj);
+	return rc;
+}
+
 struct selftest_access {
 	struct iommufd_access *access;
 	struct file *file;
@@ -1214,6 +1233,9 @@ int iommufd_test(struct iommufd_ucmd *ucmd)
 		return iommufd_test_md_check_refs(
 			ucmd, u64_to_user_ptr(cmd->check_refs.uptr),
 			cmd->check_refs.length, cmd->check_refs.refs);
+	case IOMMU_TEST_OP_MD_CHECK_IOTLB:
+		return iommufd_test_md_check_iotlb(ucmd, cmd->id,
+						   cmd->check_iotlb.iotlb);
 	case IOMMU_TEST_OP_CREATE_ACCESS:
 		return iommufd_test_create_access(ucmd, cmd->id,
 						  cmd->create_access.flags);
diff --git a/tools/testing/selftests/iommu/iommufd.c b/tools/testing/selftests/iommu/iommufd.c
index 70116c2dfc94..942860365cb6 100644
--- a/tools/testing/selftests/iommu/iommufd.c
+++ b/tools/testing/selftests/iommu/iommufd.c
@@ -310,6 +310,10 @@ TEST_F(iommufd_ioas, nested_hwpt_alloc)
 					   &nested_hwpt_id[0]);
 		test_cmd_hwpt_alloc_nested(self->device_id, parent_hwpt_id,
 					   &nested_hwpt_id[1]);
+		test_cmd_hwpt_check_iotlb(nested_hwpt_id[0],
+					  IOMMU_TEST_IOTLB_DEFAULT);
+		test_cmd_hwpt_check_iotlb(nested_hwpt_id[1],
+					  IOMMU_TEST_IOTLB_DEFAULT);
 
 		/* Negative test: a nested hwpt on top of a nested hwpt */
 		test_err_cmd_hwpt_alloc_nested(EINVAL, self->device_id,
diff --git a/tools/testing/selftests/iommu/iommufd_utils.h b/tools/testing/selftests/iommu/iommufd_utils.h
index 365fa3da2016..5e75dfac65e9 100644
--- a/tools/testing/selftests/iommu/iommufd_utils.h
+++ b/tools/testing/selftests/iommu/iommufd_utils.h
@@ -122,6 +122,20 @@ static int _test_cmd_hwpt_alloc(int fd, __u32 device_id, __u32 pt_id,
 	EXPECT_ERRNO(_errno, _test_cmd_hwpt_alloc(self->fd, device_id, \
 						  pt_id, hwpt_id))
 
+#define test_cmd_hwpt_check_iotlb(hwpt_id, expected)                           \
+	({                                                                     \
+		struct iommu_test_cmd test_cmd = {                             \
+			.size = sizeof(test_cmd),                              \
+			.op = IOMMU_TEST_OP_MD_CHECK_IOTLB,                    \
+			.id = hwpt_id,                                         \
+			.check_iotlb = { .iotlb = expected },                  \
+		};                                                             \
+		ASSERT_EQ(0,                                                   \
+			  ioctl(self->fd,                                      \
+				_IOMMU_TEST_CMD(IOMMU_TEST_OP_MD_CHECK_IOTLB), \
+				&test_cmd));                                   \
+	})
+
 static int _test_cmd_hwpt_alloc_nested(int fd, __u32 device_id, __u32 parent_id,
 				       __u32 *hwpt_id)
 {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH v3 17/17] iommufd/selftest: Add coverage for IOMMU_HWPT_INVALIDATE ioctl
  2023-07-24 11:03 [PATCH v3 00/17] iommufd: Add nesting infrastructure Yi Liu
                   ` (15 preceding siblings ...)
  2023-07-24 11:04 ` [PATCH v3 16/17] iommufd/selftest: Add IOMMU_TEST_OP_MD_CHECK_IOTLB test op Yi Liu
@ 2023-07-24 11:04 ` Yi Liu
  16 siblings, 0 replies; 66+ messages in thread
From: Yi Liu @ 2023-07-24 11:04 UTC (permalink / raw)
  To: joro, alex.williamson, jgg, kevin.tian, robin.murphy, baolu.lu
  Cc: cohuck, eric.auger, nicolinc, kvm, mjrosato, chao.p.peng,
	yi.l.liu, yi.y.sun, peterx, jasowang, shameerali.kolothum.thodi,
	lulu, suravee.suthikulpanit, iommu, linux-kernel,
	linux-kselftest, zhenzhong.duan

From: Nicolin Chen <nicolinc@nvidia.com>

Add a mock_domain_cache_invalidate_user() and a corresponding struct
iommu_hwpt_invalidate_selftest, to allow to test IOMMU_HWPT_INVALIDATE
from user space, by using the new IOMMU_TEST_OP_MD_CHECK_IOTLB.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 drivers/iommu/iommufd/iommufd_test.h          | 10 ++++++++++
 drivers/iommu/iommufd/selftest.c              | 19 ++++++++++++++++++
 tools/testing/selftests/iommu/iommufd.c       |  8 ++++++++
 tools/testing/selftests/iommu/iommufd_utils.h | 20 +++++++++++++++++++
 4 files changed, 57 insertions(+)

diff --git a/drivers/iommu/iommufd/iommufd_test.h b/drivers/iommu/iommufd/iommufd_test.h
index 93fcbb57b904..9768d0c9e347 100644
--- a/drivers/iommu/iommufd/iommufd_test.h
+++ b/drivers/iommu/iommufd/iommufd_test.h
@@ -135,4 +135,14 @@ struct iommu_hwpt_selftest {
 	__u64 test_config;
 };
 
+/**
+ * struct iommu_hwpt_invalidate_selftest
+ *
+ * @flags: invalidate flags
+ */
+struct iommu_hwpt_invalidate_selftest {
+#define IOMMU_TEST_INVALIDATE_ALL	(1ULL << 0)
+	__u64 flags;
+};
+
 #endif
diff --git a/drivers/iommu/iommufd/selftest.c b/drivers/iommu/iommufd/selftest.c
index 92c37716804e..5f3e1f2a24e7 100644
--- a/drivers/iommu/iommufd/selftest.c
+++ b/drivers/iommu/iommufd/selftest.c
@@ -395,9 +395,28 @@ static const struct iommu_ops mock_ops = {
 		},
 };
 
+static int mock_domain_cache_invalidate_user(struct iommu_domain *domain,
+					     void *user_data)
+{
+	struct iommu_hwpt_invalidate_selftest *inv_info = user_data;
+	struct mock_iommu_domain *mock =
+		container_of(domain, struct mock_iommu_domain, domain);
+
+	if (domain->type != IOMMU_DOMAIN_NESTED || !mock->parent)
+		return -EINVAL;
+
+	if (inv_info->flags & IOMMU_TEST_INVALIDATE_ALL)
+		mock->iotlb = 0;
+
+	return 0;
+}
+
 static struct iommu_domain_ops domain_nested_ops = {
 	.free = mock_domain_free,
 	.attach_dev = mock_domain_nop_attach,
+	.cache_invalidate_user = mock_domain_cache_invalidate_user,
+	.cache_invalidate_user_data_len =
+		sizeof(struct iommu_hwpt_invalidate_selftest),
 };
 
 static struct iommu_device mock_iommu_device = {
diff --git a/tools/testing/selftests/iommu/iommufd.c b/tools/testing/selftests/iommu/iommufd.c
index 942860365cb6..bf0082d88a38 100644
--- a/tools/testing/selftests/iommu/iommufd.c
+++ b/tools/testing/selftests/iommu/iommufd.c
@@ -115,6 +115,7 @@ TEST_F(iommufd, cmd_length)
 	TEST_LENGTH(iommu_destroy, IOMMU_DESTROY);
 	TEST_LENGTH(iommu_hw_info, IOMMU_GET_HW_INFO);
 	TEST_LENGTH(iommu_hwpt_alloc, IOMMU_HWPT_ALLOC);
+	TEST_LENGTH(iommu_hwpt_invalidate, IOMMU_HWPT_INVALIDATE);
 	TEST_LENGTH(iommu_ioas_alloc, IOMMU_IOAS_ALLOC);
 	TEST_LENGTH(iommu_ioas_iova_ranges, IOMMU_IOAS_IOVA_RANGES);
 	TEST_LENGTH(iommu_ioas_allow_iovas, IOMMU_IOAS_ALLOW_IOVAS);
@@ -323,6 +324,13 @@ TEST_F(iommufd_ioas, nested_hwpt_alloc)
 		EXPECT_ERRNO(EBUSY,
 			     _test_ioctl_destroy(self->fd, parent_hwpt_id));
 
+		/* hwpt_invalidate only supports a user-managed hwpt (nested) */
+		test_err_cmd_hwpt_invalidate(EINVAL, parent_hwpt_id);
+		test_cmd_hwpt_invalidate(nested_hwpt_id[0]);
+		test_cmd_hwpt_check_iotlb(nested_hwpt_id[0], 0);
+		test_cmd_hwpt_invalidate(nested_hwpt_id[1]);
+		test_cmd_hwpt_check_iotlb(nested_hwpt_id[1], 0);
+
 		/* Attach device to nested_hwpt_id[0] that then will be busy */
 		test_cmd_mock_domain_replace(self->stdev_id,
 					     nested_hwpt_id[0]);
diff --git a/tools/testing/selftests/iommu/iommufd_utils.h b/tools/testing/selftests/iommu/iommufd_utils.h
index 5e75dfac65e9..9b3e5f36c4a3 100644
--- a/tools/testing/selftests/iommu/iommufd_utils.h
+++ b/tools/testing/selftests/iommu/iommufd_utils.h
@@ -169,6 +169,26 @@ static int _test_cmd_hwpt_alloc_nested(int fd, __u32 device_id, __u32 parent_id,
 		     _test_cmd_hwpt_alloc_nested(self->fd, device_id,         \
 						 parent_id, hwpt_id))
 
+static int _test_cmd_hwpt_invalidate(int fd, __u32 hwpt_id)
+{
+	struct iommu_hwpt_invalidate_selftest data = {
+		.flags = IOMMU_TEST_INVALIDATE_ALL,
+	};
+	struct iommu_hwpt_invalidate cmd = {
+		.size = sizeof(cmd),
+		.hwpt_id = hwpt_id,
+		.data_len = sizeof(data),
+		.data_uptr = (uint64_t)&data,
+	};
+
+	return ioctl(fd, IOMMU_HWPT_INVALIDATE, &cmd);
+}
+
+#define test_cmd_hwpt_invalidate(hwpt_id) \
+	ASSERT_EQ(0, _test_cmd_hwpt_invalidate(self->fd, hwpt_id))
+#define test_err_cmd_hwpt_invalidate(_errno, hwpt_id) \
+	EXPECT_ERRNO(_errno, _test_cmd_hwpt_invalidate(self->fd, hwpt_id))
+
 static int _test_cmd_access_replace_ioas(int fd, __u32 access_id,
 					 unsigned int ioas_id)
 {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* RE: [PATCH v3 01/17] iommu: Add new iommu op to create domains owned by userspace
  2023-07-24 11:03 ` [PATCH v3 01/17] iommu: Add new iommu op to create domains owned by userspace Yi Liu
@ 2023-07-28  9:37   ` Tian, Kevin
  2023-07-28 16:56     ` Jason Gunthorpe
  0 siblings, 1 reply; 66+ messages in thread
From: Tian, Kevin @ 2023-07-28  9:37 UTC (permalink / raw)
  To: Liu, Yi L, joro, alex.williamson, jgg, robin.murphy, baolu.lu
  Cc: cohuck, eric.auger, nicolinc, kvm, mjrosato, chao.p.peng, Liu,
	Yi L, yi.y.sun, peterx, jasowang, shameerali.kolothum.thodi,
	lulu, suravee.suthikulpanit, iommu, linux-kernel,
	linux-kselftest, Duan, Zhenzhong

> From: Yi Liu <yi.l.liu@intel.com>
> Sent: Monday, July 24, 2023 7:04 PM
>
> + * @domain_alloc_user: allocate a user iommu domain corresponding to
> the input
> + *                     @hwpt_type that is defined as enum iommu_hwpt_type in the
> + *                     include/uapi/linux/iommufd.h. A returning domain will be
> + *                     set to an IOMMU_DOMAIN_NESTED type, upon valid
> @user_data
> + *                     and @parent that is a kernel-managed domain. Otherwise,
> + *                     it will be set to an IOMMU_DOMAIN_UNMANAGED type.
> Return
> + *                     ERR_PTR on allocation failure.

"If @user_data is valid and @parent points to a kernel-managed domain,
the returning domain is set to IOMMU_DOMAIN_NESTED type. Otherwise
it is set to IOMMU_DOMAIN_UNMANAGED type."

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* RE: [PATCH v3 02/17] iommu: Add nested domain support
  2023-07-24 11:03 ` [PATCH v3 02/17] iommu: Add nested domain support Yi Liu
@ 2023-07-28  9:38   ` Tian, Kevin
  2023-07-28 16:59   ` Jason Gunthorpe
  1 sibling, 0 replies; 66+ messages in thread
From: Tian, Kevin @ 2023-07-28  9:38 UTC (permalink / raw)
  To: Liu, Yi L, joro, alex.williamson, jgg, robin.murphy, baolu.lu
  Cc: cohuck, eric.auger, nicolinc, kvm, mjrosato, chao.p.peng,
	yi.y.sun, peterx, jasowang, shameerali.kolothum.thodi, lulu,
	suravee.suthikulpanit, iommu, linux-kernel, linux-kselftest,
	Duan, Zhenzhong

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Monday, July 24, 2023 7:04 PM
> 
> From: Lu Baolu <baolu.lu@linux.intel.com>
> 
> Introduce a new domain type for a user I/O page table, which is nested
> on top of another user space address represented by a UNMANAGED
> domain. The
> mappings of a nested domain are managed by user space software,
> therefore
> it's unnecessary to have map/unmap callbacks. But the updates of the PTEs
> in the nested domain page table must be propagated to the caches on both
> IOMMU (IOTLB) and devices (DevTLB).
> 
> The nested domain is allocated by the domain_alloc_user op, and attached
> to the device through the existing iommu_attach_device/group() interfaces.
> 
> A new domain op, named cache_invalidate_user is added for the userspace
> to
> flush the hardware caches for a nested domain through iommufd. No
> wrapper
> for it, as it's only supposed to be used by iommufd.
> 
> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
> Co-developed-by: Nicolin Chen <nicolinc@nvidia.com>
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* RE: [PATCH v3 03/17] iommufd/hw_pagetable: Use domain_alloc_user op for domain allocation
  2023-07-24 11:03 ` [PATCH v3 03/17] iommufd/hw_pagetable: Use domain_alloc_user op for domain allocation Yi Liu
@ 2023-07-28  9:39   ` Tian, Kevin
  0 siblings, 0 replies; 66+ messages in thread
From: Tian, Kevin @ 2023-07-28  9:39 UTC (permalink / raw)
  To: Liu, Yi L, joro, alex.williamson, jgg, robin.murphy, baolu.lu
  Cc: cohuck, eric.auger, nicolinc, kvm, mjrosato, chao.p.peng,
	yi.y.sun, peterx, jasowang, shameerali.kolothum.thodi, lulu,
	suravee.suthikulpanit, iommu, linux-kernel, linux-kselftest,
	Duan, Zhenzhong

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Monday, July 24, 2023 7:04 PM
> 
> This makes IOMMUFD to use iommu_domain_alloc_user() for
> iommu_domain
> creation as IOMMUFD needs to support iommu_domain allocation with
> parameters from userspace in nesting support. If the iommu driver
> doesn't provide domain_alloc_user callback then it falls back to use
> iommu_domain_alloc().
> 
> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
> Co-developed-by: Nicolin Chen <nicolinc@nvidia.com>
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* RE: [PATCH v3 04/17] iommufd: Pass in hwpt_type/parent/user_data to iommufd_hw_pagetable_alloc()
  2023-07-24 11:03 ` [PATCH v3 04/17] iommufd: Pass in hwpt_type/parent/user_data to iommufd_hw_pagetable_alloc() Yi Liu
@ 2023-07-28  9:49   ` Tian, Kevin
  0 siblings, 0 replies; 66+ messages in thread
From: Tian, Kevin @ 2023-07-28  9:49 UTC (permalink / raw)
  To: Liu, Yi L, joro, alex.williamson, jgg, robin.murphy, baolu.lu
  Cc: cohuck, eric.auger, nicolinc, kvm, mjrosato, chao.p.peng,
	yi.y.sun, peterx, jasowang, shameerali.kolothum.thodi, lulu,
	suravee.suthikulpanit, iommu, linux-kernel, linux-kselftest,
	Duan, Zhenzhong

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Monday, July 24, 2023 7:04 PM
> 
> Nested translation has stage-1 and stage-2 page tables. A stage-1 page
> table is managed by user space, and it needs to work with a stage-2 page
> table, which is a parent hwpt for the stage-1 hwpt.
> 
> iommu core already supports accepting parent iommu_domain and
> user_data
> to allocate an iommu_domain. This makes iommufd_hw_pagetable_alloc()
> to
> accept the parent hwpt and user_data, and relays them to iommu core, to
> prepare for supporting hw_pagetable allocation with user_data.
> 
> Also, add a parent pointer in struct iommufd_hw_pagetable for taking and
> releasing its refcount.
> 
> Co-developed-by: Nicolin Chen <nicolinc@nvidia.com>
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* RE: [PATCH v3 05/17] iommufd/hw_pagetable: Do not populate user-managed hw_pagetables
  2023-07-24 11:03 ` [PATCH v3 05/17] iommufd/hw_pagetable: Do not populate user-managed hw_pagetables Yi Liu
@ 2023-07-28  9:51   ` Tian, Kevin
  0 siblings, 0 replies; 66+ messages in thread
From: Tian, Kevin @ 2023-07-28  9:51 UTC (permalink / raw)
  To: Liu, Yi L, joro, alex.williamson, jgg, robin.murphy, baolu.lu
  Cc: cohuck, eric.auger, nicolinc, kvm, mjrosato, chao.p.peng,
	yi.y.sun, peterx, jasowang, shameerali.kolothum.thodi, lulu,
	suravee.suthikulpanit, iommu, linux-kernel, linux-kselftest,
	Duan, Zhenzhong

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Monday, July 24, 2023 7:04 PM
> 
> From: Nicolin Chen <nicolinc@nvidia.com>
> 
> A user-managed hw_pagetable does not need to get populated, since it is
> managed by a guest OS. Move the iopt_table_add_domain and list_add_tail
> calls into a helper, where the hwpt pointer will be redirected to its
> hwpt->parent if it's available.
> 
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* RE: [PATCH v3 06/17] iommufd: Only enforce IOMMU_RESV_SW_MSI when attaching user-managed HWPT
  2023-07-24 11:03 ` [PATCH v3 06/17] iommufd: Only enforce IOMMU_RESV_SW_MSI when attaching user-managed HWPT Yi Liu
@ 2023-07-28 10:02   ` Tian, Kevin
  2023-07-28 17:06     ` Jason Gunthorpe
  0 siblings, 1 reply; 66+ messages in thread
From: Tian, Kevin @ 2023-07-28 10:02 UTC (permalink / raw)
  To: Liu, Yi L, joro, alex.williamson, jgg, robin.murphy, baolu.lu
  Cc: cohuck, eric.auger, nicolinc, kvm, mjrosato, chao.p.peng,
	yi.y.sun, peterx, jasowang, shameerali.kolothum.thodi, lulu,
	suravee.suthikulpanit, iommu, linux-kernel, linux-kselftest,
	Duan, Zhenzhong

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Monday, July 24, 2023 7:04 PM
>  	}
> 
>  	rc = iopt_table_enforce_dev_resv_regions(&hwpt->ioas->iopt, idev-
> >dev,
> -						 &idev->igroup-
> >sw_msi_start);
> +						 &idev->igroup-
> >sw_msi_start,
> +						 !!hwpt->parent);
>  	if (rc)
>  		goto err_unlock;

I prefer to not setting parent ioas to hwpt in iommufd_hw_pagetable_alloc().

then here ioas can be retrieved from hwpt->parent and then it'd be pretty
clear that in nested case the sw_msi reservation happens in the parent
instead of pretending the stage-1 hwpt has an ioas too.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* RE: [PATCH v3 07/17] iommufd: Add IOMMU_RESV_IOVA_RANGES
  2023-07-24 11:03 ` [PATCH v3 07/17] iommufd: Add IOMMU_RESV_IOVA_RANGES Yi Liu
@ 2023-07-28 10:07   ` Tian, Kevin
  2023-07-28 17:16     ` Jason Gunthorpe
  2023-07-28 17:44   ` Jason Gunthorpe
  1 sibling, 1 reply; 66+ messages in thread
From: Tian, Kevin @ 2023-07-28 10:07 UTC (permalink / raw)
  To: Liu, Yi L, joro, alex.williamson, jgg, robin.murphy, baolu.lu
  Cc: cohuck, eric.auger, nicolinc, kvm, mjrosato, chao.p.peng,
	yi.y.sun, peterx, jasowang, shameerali.kolothum.thodi, lulu,
	suravee.suthikulpanit, iommu, linux-kernel, linux-kselftest,
	Duan, Zhenzhong

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Monday, July 24, 2023 7:04 PM
> 
> This reports device's reserved IOVA regions to userspace. This is needed
> in the nested translation as userspace owns stage-1 HWPT, and userspace
> needs to exclude the reserved IOVA regions in the stage-1 HWPT hence
> exclude
> them in the device's DMA address space.
> 
> This can also be used to figure out allowed IOVAs of an IOAS.

We may need a special type to mark SW_MSI since it requires identity
mapping in stage-1 instead of being reserved.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v3 01/17] iommu: Add new iommu op to create domains owned by userspace
  2023-07-28  9:37   ` Tian, Kevin
@ 2023-07-28 16:56     ` Jason Gunthorpe
  2023-07-31 12:44       ` Liu, Yi L
  2023-08-03  2:28       ` Nicolin Chen
  0 siblings, 2 replies; 66+ messages in thread
From: Jason Gunthorpe @ 2023-07-28 16:56 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Liu, Yi L, joro, alex.williamson, robin.murphy, baolu.lu, cohuck,
	eric.auger, nicolinc, kvm, mjrosato, chao.p.peng, yi.y.sun,
	peterx, jasowang, shameerali.kolothum.thodi, lulu,
	suravee.suthikulpanit, iommu, linux-kernel, linux-kselftest,
	Duan, Zhenzhong

On Fri, Jul 28, 2023 at 09:37:21AM +0000, Tian, Kevin wrote:
> > From: Yi Liu <yi.l.liu@intel.com>
> > Sent: Monday, July 24, 2023 7:04 PM
> >
> > + * @domain_alloc_user: allocate a user iommu domain corresponding to
> > the input
> > + *                     @hwpt_type that is defined as enum iommu_hwpt_type in the
> > + *                     include/uapi/linux/iommufd.h. A returning domain will be
> > + *                     set to an IOMMU_DOMAIN_NESTED type, upon valid
> > @user_data
> > + *                     and @parent that is a kernel-managed domain. Otherwise,
> > + *                     it will be set to an IOMMU_DOMAIN_UNMANAGED type.
> > Return
> > + *                     ERR_PTR on allocation failure.
> 
> "If @user_data is valid and @parent points to a kernel-managed domain,
> the returning domain is set to IOMMU_DOMAIN_NESTED type. Otherwise
> it is set to IOMMU_DOMAIN_UNMANAGED type."

 "If @user_data is valid and @parent points to a kernel-managed domain,
 then the returned domain must be the IOMMU_DOMAIN_NESTED type. Otherwise
 the returned domain is IOMMU_DOMAIN_UNMANAGED."

Notice the detail that this API expects the driver to set the type and
fully initialize the domain, including the generic iommu_domain
struct, which is different than alloc_domain.

When we implement this in drivers we should tidy this so all the alloc
flows fully initialize the domain internally.

Jason

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v3 02/17] iommu: Add nested domain support
  2023-07-24 11:03 ` [PATCH v3 02/17] iommu: Add nested domain support Yi Liu
  2023-07-28  9:38   ` Tian, Kevin
@ 2023-07-28 16:59   ` Jason Gunthorpe
  2023-08-03  2:36     ` Nicolin Chen
  1 sibling, 1 reply; 66+ messages in thread
From: Jason Gunthorpe @ 2023-07-28 16:59 UTC (permalink / raw)
  To: Yi Liu
  Cc: joro, alex.williamson, kevin.tian, robin.murphy, baolu.lu,
	cohuck, eric.auger, nicolinc, kvm, mjrosato, chao.p.peng,
	yi.y.sun, peterx, jasowang, shameerali.kolothum.thodi, lulu,
	suravee.suthikulpanit, iommu, linux-kernel, linux-kselftest,
	zhenzhong.duan

On Mon, Jul 24, 2023 at 04:03:51AM -0700, Yi Liu wrote:

> @@ -350,6 +354,10 @@ struct iommu_ops {
>   * @iotlb_sync_map: Sync mappings created recently using @map to the hardware
>   * @iotlb_sync: Flush all queued ranges from the hardware TLBs and empty flush
>   *            queue
> + * @cache_invalidate_user: Flush hardware TLBs caching user space IO mappings
> + * @cache_invalidate_user_data_len: Defined length of input user data for the
> + *                                  cache_invalidate_user op, being sizeof the
> + *                                  structure in include/uapi/linux/iommufd.h
>   * @iova_to_phys: translate iova to physical address
>   * @enforce_cache_coherency: Prevent any kind of DMA from bypassing IOMMU_CACHE,
>   *                           including no-snoop TLPs on PCIe or other platform
> @@ -379,6 +387,9 @@ struct iommu_domain_ops {
>  			       size_t size);
>  	void (*iotlb_sync)(struct iommu_domain *domain,
>  			   struct iommu_iotlb_gather *iotlb_gather);
> +	int (*cache_invalidate_user)(struct iommu_domain *domain,
> +				     void *user_data);

If we are doing const unions, then this void * should also be a const
union.

Jason

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v3 06/17] iommufd: Only enforce IOMMU_RESV_SW_MSI when attaching user-managed HWPT
  2023-07-28 10:02   ` Tian, Kevin
@ 2023-07-28 17:06     ` Jason Gunthorpe
  0 siblings, 0 replies; 66+ messages in thread
From: Jason Gunthorpe @ 2023-07-28 17:06 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Liu, Yi L, joro, alex.williamson, robin.murphy, baolu.lu, cohuck,
	eric.auger, nicolinc, kvm, mjrosato, chao.p.peng, yi.y.sun,
	peterx, jasowang, shameerali.kolothum.thodi, lulu,
	suravee.suthikulpanit, iommu, linux-kernel, linux-kselftest,
	Duan, Zhenzhong

On Fri, Jul 28, 2023 at 10:02:36AM +0000, Tian, Kevin wrote:
> > From: Liu, Yi L <yi.l.liu@intel.com>
> > Sent: Monday, July 24, 2023 7:04 PM
> >  	}
> > 
> >  	rc = iopt_table_enforce_dev_resv_regions(&hwpt->ioas->iopt, idev-
> > >dev,
> > -						 &idev->igroup-
> > >sw_msi_start);
> > +						 &idev->igroup-
> > >sw_msi_start,
> > +						 !!hwpt->parent);
> >  	if (rc)
> >  		goto err_unlock;
> 
> I prefer to not setting parent ioas to hwpt in iommufd_hw_pagetable_alloc().

Yes, the prior patch didn't add it to the iopt, so it shouldn't have
an ioas set. The NESTED domains don't have an IOAS almost by
definition.

> then here ioas can be retrieved from hwpt->parent and then it'd be pretty
> clear that in nested case the sw_msi reservation happens in the parent
> instead of pretending the stage-1 hwpt has an ioas too.

Yeah, I'm confused by this patch as well.  Since there should be no
IOAS for the NESTED why are we messing with resv_regions?

Jason

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v3 07/17] iommufd: Add IOMMU_RESV_IOVA_RANGES
  2023-07-28 10:07   ` Tian, Kevin
@ 2023-07-28 17:16     ` Jason Gunthorpe
  2023-07-31  6:14       ` Tian, Kevin
  0 siblings, 1 reply; 66+ messages in thread
From: Jason Gunthorpe @ 2023-07-28 17:16 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Liu, Yi L, joro, alex.williamson, robin.murphy, baolu.lu, cohuck,
	eric.auger, nicolinc, kvm, mjrosato, chao.p.peng, yi.y.sun,
	peterx, jasowang, shameerali.kolothum.thodi, lulu,
	suravee.suthikulpanit, iommu, linux-kernel, linux-kselftest,
	Duan, Zhenzhong

On Fri, Jul 28, 2023 at 10:07:58AM +0000, Tian, Kevin wrote:
> > From: Liu, Yi L <yi.l.liu@intel.com>
> > Sent: Monday, July 24, 2023 7:04 PM
> > 
> > This reports device's reserved IOVA regions to userspace. This is needed
> > in the nested translation as userspace owns stage-1 HWPT, and userspace
> > needs to exclude the reserved IOVA regions in the stage-1 HWPT hence
> > exclude
> > them in the device's DMA address space.
> > 
> > This can also be used to figure out allowed IOVAs of an IOAS.
> 
> We may need a special type to mark SW_MSI since it requires identity
> mapping in stage-1 instead of being reserved.

Only the kernel can do this, so there is no action for user space to
take beyond knowing that is is not mappable IOVA.

The merit for "SW_MSI" may be to inform the rest of the system about
the IOVA of the ITS page, but with the current situation that isn't
required since only the kernel needs that information.

I think the long term way forward is to somehow arrange for the SW_MSI
to not become mapped when creating the parent HWPT and instead cause
the ITS page to be mapped through some explicit IOCTL.

Jason

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v3 07/17] iommufd: Add IOMMU_RESV_IOVA_RANGES
  2023-07-24 11:03 ` [PATCH v3 07/17] iommufd: Add IOMMU_RESV_IOVA_RANGES Yi Liu
  2023-07-28 10:07   ` Tian, Kevin
@ 2023-07-28 17:44   ` Jason Gunthorpe
  2023-07-31  6:21     ` Tian, Kevin
  1 sibling, 1 reply; 66+ messages in thread
From: Jason Gunthorpe @ 2023-07-28 17:44 UTC (permalink / raw)
  To: Yi Liu
  Cc: joro, alex.williamson, kevin.tian, robin.murphy, baolu.lu,
	cohuck, eric.auger, nicolinc, kvm, mjrosato, chao.p.peng,
	yi.y.sun, peterx, jasowang, shameerali.kolothum.thodi, lulu,
	suravee.suthikulpanit, iommu, linux-kernel, linux-kselftest,
	zhenzhong.duan

On Mon, Jul 24, 2023 at 04:03:56AM -0700, Yi Liu wrote:

> +/**
> + * struct iommu_resv_iova_ranges - ioctl(IOMMU_RESV_IOVA_RANGES)
> + * @size: sizeof(struct iommu_resv_iova_ranges)
> + * @dev_id: device to read resv iova ranges for
> + * @num_iovas: Input/Output total number of resv ranges for the device
> + * @__reserved: Must be 0
> + * @resv_iovas: Pointer to the output array of struct iommu_resv_iova_range
> + *
> + * Query a device for ranges of reserved IOVAs. num_iovas will be set to the
> + * total number of iovas and the resv_iovas[] will be filled in as space
> + * permits.
> + *
> + * On input num_iovas is the length of the resv_iovas array. On output it is
> + * the total number of iovas filled in. The ioctl will return -EMSGSIZE and
> + * set num_iovas to the required value if num_iovas is too small. In this
> + * case the caller should allocate a larger output array and re-issue the
> + * ioctl.
> + *
> + * Under nested translation, userspace should query the reserved IOVAs for a
> + * given device, and report it to the stage-1 I/O page table owner to exclude
> + * the reserved IOVAs. The reserved IOVAs can also be used to figure out the
> + * allowed IOVA ranges for the IOAS that the device is attached to. For detail
> + * see ioctl IOMMU_IOAS_IOVA_RANGES.

I'm not sure I like this, the other APIs here work with the *allowed*
IOVAs, which is the inverse of this one which works with the
*disallowed* IOVAs.

It means we can't take the output of this API and feed it into
IOMMUFD_CMD_IOAS_ALLOW_IOVAS.. Though I suppose qemu isn't going to do
this anyhow.

On the other hand, it is kind of hard to intersect an allowed list of
multiple idevs into a single master list.

As it is, userspace will have to aggregate the list, sort it, merge
adjacent overlapping reserved ranges then invert the list to get an
allowed list. This is not entirely simple..

Did you already write an algorithm to do this in qemu someplace?

Anyhow, this should be split out from this series. It seems simple
enough to merge it now if someone can confirm what qemu needs.

Jason

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v3 08/17] iommufd: IOMMU_HWPT_ALLOC allocation with user data
  2023-07-24 11:03 ` [PATCH v3 08/17] iommufd: IOMMU_HWPT_ALLOC allocation with user data Yi Liu
@ 2023-07-28 17:55   ` Jason Gunthorpe
  2023-07-28 19:10     ` Nicolin Chen
  2023-07-31  6:31     ` Tian, Kevin
  0 siblings, 2 replies; 66+ messages in thread
From: Jason Gunthorpe @ 2023-07-28 17:55 UTC (permalink / raw)
  To: Yi Liu
  Cc: joro, alex.williamson, kevin.tian, robin.murphy, baolu.lu,
	cohuck, eric.auger, nicolinc, kvm, mjrosato, chao.p.peng,
	yi.y.sun, peterx, jasowang, shameerali.kolothum.thodi, lulu,
	suravee.suthikulpanit, iommu, linux-kernel, linux-kselftest,
	zhenzhong.duan

On Mon, Jul 24, 2023 at 04:03:57AM -0700, Yi Liu wrote:

> +	switch (pt_obj->type) {
> +	case IOMMUFD_OBJ_IOAS:
> +		ioas = container_of(pt_obj, struct iommufd_ioas, obj);
> +		break;
> +	case IOMMUFD_OBJ_HW_PAGETABLE:
> +		/* pt_id points HWPT only when hwpt_type is !IOMMU_HWPT_TYPE_DEFAULT */
> +		if (cmd->hwpt_type == IOMMU_HWPT_TYPE_DEFAULT) {
> +			rc = -EINVAL;
> +			goto out_put_pt;
> +		}
> +
> +		parent = container_of(pt_obj, struct iommufd_hw_pagetable, obj);
> +		/*
> +		 * Cannot allocate user-managed hwpt linking to auto_created
> +		 * hwpt. If the parent hwpt is already a user-managed hwpt,
> +		 * don't allocate another user-managed hwpt linking to it.
> +		 */
> +		if (parent->auto_domain || parent->parent) {
> +			rc = -EINVAL;
> +			goto out_put_pt;
> +		}
> +		ioas = parent->ioas;

Why do we set ioas here? I would think it should be NULL.

I think it is looking like a mistake to try and re-use
iommufd_hw_pagetable_alloc() directly for the nested case. It should
not have a IOAS argument, it should not do enforce_cc, or iopt_*
functions

So must of the function is removed. Probably better to make a new
ioas-less function for the nesting case.

> diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
> index 510db114fc61..5f4420626421 100644
> --- a/drivers/iommu/iommufd/main.c
> +++ b/drivers/iommu/iommufd/main.c
> @@ -426,7 +426,7 @@ static const struct iommufd_ioctl_op iommufd_ioctl_ops[] = {
>  	IOCTL_OP(IOMMU_GET_HW_INFO, iommufd_get_hw_info, struct iommu_hw_info,
>  		 __reserved),
>  	IOCTL_OP(IOMMU_HWPT_ALLOC, iommufd_hwpt_alloc, struct iommu_hwpt_alloc,
> -		 __reserved),
> +		 data_uptr),

Nono, these can never change once we put them it. It specifies the
hard minimum size that userspace must provide. If userspace gives less
than this then the ioctl always fails. Changing it breaks all the
existing software.

The core code ensures that the trailing part of the cmd struct is
zero'd the extended implementation must cope with Zero'd values, which
this does.

> diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
> index f2026cde2d64..73bf9af91e99 100644
> --- a/include/uapi/linux/iommufd.h
> +++ b/include/uapi/linux/iommufd.h
> @@ -364,12 +364,27 @@ enum iommu_hwpt_type {
>   * @pt_id: The IOAS to connect this HWPT to
>   * @out_hwpt_id: The ID of the new HWPT
>   * @__reserved: Must be 0
> + * @hwpt_type: One of enum iommu_hwpt_type
> + * @data_len: Length of the type specific data
> + * @data_uptr: User pointer to the type specific data
>   *
>   * Explicitly allocate a hardware page table object. This is the same object
>   * type that is returned by iommufd_device_attach() and represents the
>   * underlying iommu driver's iommu_domain kernel object.
>   *
> - * A HWPT will be created with the IOVA mappings from the given IOAS.
> + * A kernel-managed HWPT will be created with the mappings from the given
> + * IOAS via the @pt_id. The @hwpt_type for this allocation can be set to
> + * either IOMMU_HWPT_TYPE_DEFAULT or a pre-defined type corresponding to
> + * an I/O page table type supported by the underlying IOMMU hardware.


> + * A user-managed HWPT will be created from a given parent HWPT via the
> + * @pt_id, in which the parent HWPT must be allocated previously via the
> + * same ioctl from a given IOAS (@pt_id). In this case, the @hwpt_type
> + * must be set to a pre-defined type corresponding to an I/O page table
> + * type supported by the underlying IOMMU hardware.
> + *
> + * If the @hwpt_type is set to IOMMU_HWPT_TYPE_DEFAULT, both the @data_len
> + * and the @data_uptr will be ignored. Otherwise, both must be
> given.

 If the @hwpt_type is set to IOMMU_HWPT_TYPE_DEFAULT then @data_len
 must be zero.

Jason

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v3 09/17] iommufd: Add IOMMU_HWPT_INVALIDATE
  2023-07-24 11:03 ` [PATCH v3 09/17] iommufd: Add IOMMU_HWPT_INVALIDATE Yi Liu
@ 2023-07-28 18:02   ` Jason Gunthorpe
  2023-07-31 10:07     ` Liu, Yi L
  2023-08-03  2:07     ` Nicolin Chen
  0 siblings, 2 replies; 66+ messages in thread
From: Jason Gunthorpe @ 2023-07-28 18:02 UTC (permalink / raw)
  To: Yi Liu
  Cc: joro, alex.williamson, kevin.tian, robin.murphy, baolu.lu,
	cohuck, eric.auger, nicolinc, kvm, mjrosato, chao.p.peng,
	yi.y.sun, peterx, jasowang, shameerali.kolothum.thodi, lulu,
	suravee.suthikulpanit, iommu, linux-kernel, linux-kselftest,
	zhenzhong.duan

On Mon, Jul 24, 2023 at 04:03:58AM -0700, Yi Liu wrote:
> In nested translation, the stage-1 page table is user-managed and used by
> IOMMU hardware, so update of any present page table entry in the stage-1
> page table should be followed with an IOTLB invalidation.
> 
> This adds IOMMU_HWPT_INVALIDATE for stage-1 IOTLB invalidation.
> 
> Co-developed-by: Nicolin Chen <nicolinc@nvidia.com>
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> ---
>  drivers/iommu/iommufd/hw_pagetable.c    | 45 +++++++++++++++++++++++++
>  drivers/iommu/iommufd/iommufd_private.h |  9 +++++
>  drivers/iommu/iommufd/main.c            |  3 ++
>  include/uapi/linux/iommufd.h            | 22 ++++++++++++
>  4 files changed, 79 insertions(+)
> 
> diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c
> index 97e4114226de..9064e6d181b4 100644
> --- a/drivers/iommu/iommufd/hw_pagetable.c
> +++ b/drivers/iommu/iommufd/hw_pagetable.c
> @@ -286,3 +286,48 @@ int iommufd_hwpt_alloc(struct iommufd_ucmd *ucmd)
>  	iommufd_put_object(&idev->obj);
>  	return rc;
>  }
> +
> +int iommufd_hwpt_invalidate(struct iommufd_ucmd *ucmd)
> +{
> +	struct iommu_hwpt_invalidate *cmd = ucmd->cmd;
> +	struct iommufd_hw_pagetable *hwpt;
> +	u32 user_data_len, klen;
> +	u64 user_ptr;
> +	int rc = 0;
> +
> +	if (!cmd->data_len || cmd->__reserved)
> +		return -EOPNOTSUPP;
> +
> +	hwpt = iommufd_get_hwpt(ucmd, cmd->hwpt_id);
> +	if (IS_ERR(hwpt))
> +		return PTR_ERR(hwpt);
> +
> +	/* Do not allow any kernel-managed hw_pagetable */
> +	if (!hwpt->parent) {

I don't think this is needed because:

> +		rc = -EINVAL;
> +		goto out_put_hwpt;
> +	}
> +
> +	klen = hwpt->domain->ops->cache_invalidate_user_data_len;
> +	if (!hwpt->domain->ops->cache_invalidate_user || !klen) {
> +		rc = -EOPNOTSUPP;

We need to get to a place where the drivers are providing proper ops
for the domains, so this op should never exist for a paging domain.

And return EINVAL here instead.

> +		goto out_put_hwpt;
> +	}
> +
> +	/*
> +	 * Copy the needed fields before reusing the ucmd buffer, this
> +	 * avoids memory allocation in this path.
> +	 */
> +	user_ptr = cmd->data_uptr;
> +	user_data_len = cmd->data_len;

Uhh, who checks that klen < the temporary stack struct?

Jason

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v3 08/17] iommufd: IOMMU_HWPT_ALLOC allocation with user data
  2023-07-28 17:55   ` Jason Gunthorpe
@ 2023-07-28 19:10     ` Nicolin Chen
  2023-07-31  7:22       ` Liu, Yi L
  2023-07-31  6:31     ` Tian, Kevin
  1 sibling, 1 reply; 66+ messages in thread
From: Nicolin Chen @ 2023-07-28 19:10 UTC (permalink / raw)
  To: Yi Liu, Jason Gunthorpe
  Cc: joro, alex.williamson, kevin.tian, robin.murphy, baolu.lu,
	cohuck, eric.auger, kvm, mjrosato, chao.p.peng, yi.y.sun, peterx,
	jasowang, shameerali.kolothum.thodi, lulu, suravee.suthikulpanit,
	iommu, linux-kernel, linux-kselftest, zhenzhong.duan

On Fri, Jul 28, 2023 at 02:55:57PM -0300, Jason Gunthorpe wrote:
> On Mon, Jul 24, 2023 at 04:03:57AM -0700, Yi Liu wrote:
> 
> > +	switch (pt_obj->type) {
> > +	case IOMMUFD_OBJ_IOAS:
> > +		ioas = container_of(pt_obj, struct iommufd_ioas, obj);
> > +		break;
> > +	case IOMMUFD_OBJ_HW_PAGETABLE:
> > +		/* pt_id points HWPT only when hwpt_type is !IOMMU_HWPT_TYPE_DEFAULT */
> > +		if (cmd->hwpt_type == IOMMU_HWPT_TYPE_DEFAULT) {
> > +			rc = -EINVAL;
> > +			goto out_put_pt;
> > +		}
> > +
> > +		parent = container_of(pt_obj, struct iommufd_hw_pagetable, obj);
> > +		/*
> > +		 * Cannot allocate user-managed hwpt linking to auto_created
> > +		 * hwpt. If the parent hwpt is already a user-managed hwpt,
> > +		 * don't allocate another user-managed hwpt linking to it.
> > +		 */
> > +		if (parent->auto_domain || parent->parent) {
> > +			rc = -EINVAL;
> > +			goto out_put_pt;
> > +		}
> > +		ioas = parent->ioas;
> 
> Why do we set ioas here? I would think it should be NULL.
> 
> I think it is looking like a mistake to try and re-use
> iommufd_hw_pagetable_alloc() directly for the nested case. It should
> not have a IOAS argument, it should not do enforce_cc, or iopt_*
> functions
> 
> So must of the function is removed. Probably better to make a new
> ioas-less function for the nesting case.

OK.

@Yi, along with the PATCH-6 (IOMMU_RESV_SW_MSI), I will rework
on this -- mainly breaking up NESTED hwpt with IOAS.

Thanks
Nic

^ permalink raw reply	[flat|nested] 66+ messages in thread

* RE: [PATCH v3 07/17] iommufd: Add IOMMU_RESV_IOVA_RANGES
  2023-07-28 17:16     ` Jason Gunthorpe
@ 2023-07-31  6:14       ` Tian, Kevin
  2023-07-31 13:12         ` Jason Gunthorpe
  0 siblings, 1 reply; 66+ messages in thread
From: Tian, Kevin @ 2023-07-31  6:14 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Liu, Yi L, joro, alex.williamson, robin.murphy, baolu.lu, cohuck,
	eric.auger, nicolinc, kvm, mjrosato, chao.p.peng, yi.y.sun,
	peterx, jasowang, shameerali.kolothum.thodi, lulu,
	suravee.suthikulpanit, iommu, linux-kernel, linux-kselftest,
	Duan, Zhenzhong

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Saturday, July 29, 2023 1:17 AM
> 
> On Fri, Jul 28, 2023 at 10:07:58AM +0000, Tian, Kevin wrote:
> > > From: Liu, Yi L <yi.l.liu@intel.com>
> > > Sent: Monday, July 24, 2023 7:04 PM
> > >
> > > This reports device's reserved IOVA regions to userspace. This is needed
> > > in the nested translation as userspace owns stage-1 HWPT, and
> userspace
> > > needs to exclude the reserved IOVA regions in the stage-1 HWPT hence
> > > exclude
> > > them in the device's DMA address space.
> > >
> > > This can also be used to figure out allowed IOVAs of an IOAS.
> >
> > We may need a special type to mark SW_MSI since it requires identity
> > mapping in stage-1 instead of being reserved.
> 
> Only the kernel can do this, so there is no action for user space to
> take beyond knowing that is is not mappable IOVA.
> 
> The merit for "SW_MSI" may be to inform the rest of the system about
> the IOVA of the ITS page, but with the current situation that isn't
> required since only the kernel needs that information.

IIUC guest kernel needs to know the "SW_MSI" region and then setup an
1:1 mapping for it in S1.

So Qemu needs to know and pass this information to the guest?

> 
> I think the long term way forward is to somehow arrange for the SW_MSI
> to not become mapped when creating the parent HWPT and instead cause
> the ITS page to be mapped through some explicit IOCTL.
> 

yes this is a cleaner approach. Qemu selects the intermediate address of
vITS page and maps it to physical ITS page in S2. Then the guest kernel
just pick whatever "SW_MSI" address in S1 to vITS as it does today on 
bare metal.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* RE: [PATCH v3 07/17] iommufd: Add IOMMU_RESV_IOVA_RANGES
  2023-07-28 17:44   ` Jason Gunthorpe
@ 2023-07-31  6:21     ` Tian, Kevin
  2023-07-31  9:53       ` Liu, Yi L
  2023-07-31 13:23       ` Jason Gunthorpe
  0 siblings, 2 replies; 66+ messages in thread
From: Tian, Kevin @ 2023-07-31  6:21 UTC (permalink / raw)
  To: Jason Gunthorpe, Liu, Yi L
  Cc: joro, alex.williamson, robin.murphy, baolu.lu, cohuck,
	eric.auger, nicolinc, kvm, mjrosato, chao.p.peng, yi.y.sun,
	peterx, jasowang, shameerali.kolothum.thodi, lulu,
	suravee.suthikulpanit, iommu, linux-kernel, linux-kselftest,
	Duan, Zhenzhong

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Saturday, July 29, 2023 1:45 AM
> 
> On Mon, Jul 24, 2023 at 04:03:56AM -0700, Yi Liu wrote:
> 
> > +/**
> > + * struct iommu_resv_iova_ranges - ioctl(IOMMU_RESV_IOVA_RANGES)
> > + * @size: sizeof(struct iommu_resv_iova_ranges)
> > + * @dev_id: device to read resv iova ranges for
> > + * @num_iovas: Input/Output total number of resv ranges for the device
> > + * @__reserved: Must be 0
> > + * @resv_iovas: Pointer to the output array of struct
> iommu_resv_iova_range
> > + *
> > + * Query a device for ranges of reserved IOVAs. num_iovas will be set to
> the
> > + * total number of iovas and the resv_iovas[] will be filled in as space
> > + * permits.
> > + *
> > + * On input num_iovas is the length of the resv_iovas array. On output it is
> > + * the total number of iovas filled in. The ioctl will return -EMSGSIZE and
> > + * set num_iovas to the required value if num_iovas is too small. In this
> > + * case the caller should allocate a larger output array and re-issue the
> > + * ioctl.
> > + *
> > + * Under nested translation, userspace should query the reserved IOVAs
> for a
> > + * given device, and report it to the stage-1 I/O page table owner to
> exclude
> > + * the reserved IOVAs. The reserved IOVAs can also be used to figure out
> the
> > + * allowed IOVA ranges for the IOAS that the device is attached to. For
> detail
> > + * see ioctl IOMMU_IOAS_IOVA_RANGES.
> 
> I'm not sure I like this, the other APIs here work with the *allowed*
> IOVAs, which is the inverse of this one which works with the
> *disallowed* IOVAs.
> 
> It means we can't take the output of this API and feed it into
> IOMMUFD_CMD_IOAS_ALLOW_IOVAS.. Though I suppose qemu isn't going
> to do
> this anyhow.
> 
> On the other hand, it is kind of hard to intersect an allowed list of
> multiple idevs into a single master list.
> 
> As it is, userspace will have to aggregate the list, sort it, merge
> adjacent overlapping reserved ranges then invert the list to get an
> allowed list. This is not entirely simple..
> 
> Did you already write an algorithm to do this in qemu someplace?

Qemu is optional to aggregate it for S2 given IOMMU_IOAS_IOVA_RANGES
is still being used. If the only purpose of using this new cmd is to report
per-device reserved ranges to the guest then aggregation is not required.

Arguably IOMMU_IOAS_IOVA_RANGES becomes redundant with this
new cmd. But it's already there and as you said it's actually more
convenient to be used if the user doesn't care about per-device
reserved ranges...

> 
> Anyhow, this should be split out from this series. It seems simple
> enough to merge it now if someone can confirm what qemu needs.
> 
> Jason

^ permalink raw reply	[flat|nested] 66+ messages in thread

* RE: [PATCH v3 08/17] iommufd: IOMMU_HWPT_ALLOC allocation with user data
  2023-07-28 17:55   ` Jason Gunthorpe
  2023-07-28 19:10     ` Nicolin Chen
@ 2023-07-31  6:31     ` Tian, Kevin
  2023-07-31 13:16       ` Jason Gunthorpe
  1 sibling, 1 reply; 66+ messages in thread
From: Tian, Kevin @ 2023-07-31  6:31 UTC (permalink / raw)
  To: Jason Gunthorpe, Liu, Yi L
  Cc: joro, alex.williamson, robin.murphy, baolu.lu, cohuck,
	eric.auger, nicolinc, kvm, mjrosato, chao.p.peng, yi.y.sun,
	peterx, jasowang, shameerali.kolothum.thodi, lulu,
	suravee.suthikulpanit, iommu, linux-kernel, linux-kselftest,
	Duan, Zhenzhong

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Saturday, July 29, 2023 1:56 AM
> 
> On Mon, Jul 24, 2023 at 04:03:57AM -0700, Yi Liu wrote:
> 
> > +	switch (pt_obj->type) {
> > +	case IOMMUFD_OBJ_IOAS:
> > +		ioas = container_of(pt_obj, struct iommufd_ioas, obj);
> > +		break;
> > +	case IOMMUFD_OBJ_HW_PAGETABLE:
> > +		/* pt_id points HWPT only when hwpt_type
> is !IOMMU_HWPT_TYPE_DEFAULT */
> > +		if (cmd->hwpt_type == IOMMU_HWPT_TYPE_DEFAULT) {
> > +			rc = -EINVAL;
> > +			goto out_put_pt;
> > +		}
> > +
> > +		parent = container_of(pt_obj, struct iommufd_hw_pagetable,
> obj);
> > +		/*
> > +		 * Cannot allocate user-managed hwpt linking to
> auto_created
> > +		 * hwpt. If the parent hwpt is already a user-managed hwpt,
> > +		 * don't allocate another user-managed hwpt linking to it.
> > +		 */
> > +		if (parent->auto_domain || parent->parent) {
> > +			rc = -EINVAL;
> > +			goto out_put_pt;
> > +		}
> > +		ioas = parent->ioas;
> 
> Why do we set ioas here? I would think it should be NULL.
> 
> I think it is looking like a mistake to try and re-use
> iommufd_hw_pagetable_alloc() directly for the nested case. It should
> not have a IOAS argument, it should not do enforce_cc, or iopt_*
> functions

enforce_cc is still required since it's about memory accesses post
page table walking (no matter the walked table is single stage or
nested).

> 
> So must of the function is removed. Probably better to make a new
> ioas-less function for the nesting case.
> 
> > diff --git a/drivers/iommu/iommufd/main.c
> b/drivers/iommu/iommufd/main.c
> > index 510db114fc61..5f4420626421 100644
> > --- a/drivers/iommu/iommufd/main.c
> > +++ b/drivers/iommu/iommufd/main.c
> > @@ -426,7 +426,7 @@ static const struct iommufd_ioctl_op
> iommufd_ioctl_ops[] = {
> >  	IOCTL_OP(IOMMU_GET_HW_INFO, iommufd_get_hw_info, struct
> iommu_hw_info,
> >  		 __reserved),
> >  	IOCTL_OP(IOMMU_HWPT_ALLOC, iommufd_hwpt_alloc, struct
> iommu_hwpt_alloc,
> > -		 __reserved),
> > +		 data_uptr),
> 
> Nono, these can never change once we put them it. It specifies the
> hard minimum size that userspace must provide. If userspace gives less
> than this then the ioctl always fails. Changing it breaks all the
> existing software.
> 
> The core code ensures that the trailing part of the cmd struct is
> zero'd the extended implementation must cope with Zero'd values, which
> this does.
> 

Ideally expanding uAPI structure size should come with new flag bits.

this one might be a bit special. hwpt_alloc() series was just queued to
iommufd-next. If the nesting series could come together in one cycle
then probably changing it in current way is fine since there is no
existing software. Otherwise we need follow common practice. 😊

^ permalink raw reply	[flat|nested] 66+ messages in thread

* RE: [PATCH v3 08/17] iommufd: IOMMU_HWPT_ALLOC allocation with user data
  2023-07-28 19:10     ` Nicolin Chen
@ 2023-07-31  7:22       ` Liu, Yi L
  0 siblings, 0 replies; 66+ messages in thread
From: Liu, Yi L @ 2023-07-31  7:22 UTC (permalink / raw)
  To: Nicolin Chen, Jason Gunthorpe
  Cc: joro, alex.williamson, Tian, Kevin, robin.murphy, baolu.lu,
	cohuck, eric.auger, kvm, mjrosato, chao.p.peng, yi.y.sun, peterx,
	jasowang, shameerali.kolothum.thodi, lulu, suravee.suthikulpanit,
	iommu, linux-kernel, linux-kselftest, Duan, Zhenzhong

> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Saturday, July 29, 2023 3:10 AM
> 
> On Fri, Jul 28, 2023 at 02:55:57PM -0300, Jason Gunthorpe wrote:
> > On Mon, Jul 24, 2023 at 04:03:57AM -0700, Yi Liu wrote:
> >
> > > +	switch (pt_obj->type) {
> > > +	case IOMMUFD_OBJ_IOAS:
> > > +		ioas = container_of(pt_obj, struct iommufd_ioas, obj);
> > > +		break;
> > > +	case IOMMUFD_OBJ_HW_PAGETABLE:
> > > +		/* pt_id points HWPT only when hwpt_type
> is !IOMMU_HWPT_TYPE_DEFAULT */
> > > +		if (cmd->hwpt_type == IOMMU_HWPT_TYPE_DEFAULT) {
> > > +			rc = -EINVAL;
> > > +			goto out_put_pt;
> > > +		}
> > > +
> > > +		parent = container_of(pt_obj, struct iommufd_hw_pagetable, obj);
> > > +		/*
> > > +		 * Cannot allocate user-managed hwpt linking to auto_created
> > > +		 * hwpt. If the parent hwpt is already a user-managed hwpt,
> > > +		 * don't allocate another user-managed hwpt linking to it.
> > > +		 */
> > > +		if (parent->auto_domain || parent->parent) {
> > > +			rc = -EINVAL;
> > > +			goto out_put_pt;
> > > +		}
> > > +		ioas = parent->ioas;
> >
> > Why do we set ioas here? I would think it should be NULL.
> >
> > I think it is looking like a mistake to try and re-use
> > iommufd_hw_pagetable_alloc() directly for the nested case. It should
> > not have a IOAS argument, it should not do enforce_cc, or iopt_*
> > functions
> >
> > So must of the function is removed. Probably better to make a new
> > ioas-less function for the nesting case.
> 
> OK.
> 
> @Yi, along with the PATCH-6 (IOMMU_RESV_SW_MSI), I will rework
> on this -- mainly breaking up NESTED hwpt with IOAS.

Thanks. Then I'll address the hw_info comments first.

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 66+ messages in thread

* RE: [PATCH v3 07/17] iommufd: Add IOMMU_RESV_IOVA_RANGES
  2023-07-31  6:21     ` Tian, Kevin
@ 2023-07-31  9:53       ` Liu, Yi L
  2023-07-31 13:23       ` Jason Gunthorpe
  1 sibling, 0 replies; 66+ messages in thread
From: Liu, Yi L @ 2023-07-31  9:53 UTC (permalink / raw)
  To: Tian, Kevin, Jason Gunthorpe
  Cc: joro, alex.williamson, robin.murphy, baolu.lu, cohuck,
	eric.auger, nicolinc, kvm, mjrosato, chao.p.peng, yi.y.sun,
	peterx, jasowang, shameerali.kolothum.thodi, lulu,
	suravee.suthikulpanit, iommu, linux-kernel, linux-kselftest,
	Duan, Zhenzhong

> From: Tian, Kevin <kevin.tian@intel.com>
> Sent: Monday, July 31, 2023 2:22 PM
> 
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Saturday, July 29, 2023 1:45 AM
> >
> > On Mon, Jul 24, 2023 at 04:03:56AM -0700, Yi Liu wrote:
> >
> > > +/**
> > > + * struct iommu_resv_iova_ranges - ioctl(IOMMU_RESV_IOVA_RANGES)
> > > + * @size: sizeof(struct iommu_resv_iova_ranges)
> > > + * @dev_id: device to read resv iova ranges for
> > > + * @num_iovas: Input/Output total number of resv ranges for the device
> > > + * @__reserved: Must be 0
> > > + * @resv_iovas: Pointer to the output array of struct
> > iommu_resv_iova_range
> > > + *
> > > + * Query a device for ranges of reserved IOVAs. num_iovas will be set to
> > the
> > > + * total number of iovas and the resv_iovas[] will be filled in as space
> > > + * permits.
> > > + *
> > > + * On input num_iovas is the length of the resv_iovas array. On output it is
> > > + * the total number of iovas filled in. The ioctl will return -EMSGSIZE and
> > > + * set num_iovas to the required value if num_iovas is too small. In this
> > > + * case the caller should allocate a larger output array and re-issue the
> > > + * ioctl.
> > > + *
> > > + * Under nested translation, userspace should query the reserved IOVAs
> > for a
> > > + * given device, and report it to the stage-1 I/O page table owner to
> > exclude
> > > + * the reserved IOVAs. The reserved IOVAs can also be used to figure out
> > the
> > > + * allowed IOVA ranges for the IOAS that the device is attached to. For
> > detail
> > > + * see ioctl IOMMU_IOAS_IOVA_RANGES.
> >
> > I'm not sure I like this, the other APIs here work with the *allowed*
> > IOVAs, which is the inverse of this one which works with the
> > *disallowed* IOVAs.
> >
> > It means we can't take the output of this API and feed it into
> > IOMMUFD_CMD_IOAS_ALLOW_IOVAS.. Though I suppose qemu isn't going
> > to do
> > this anyhow.
> >
> > On the other hand, it is kind of hard to intersect an allowed list of
> > multiple idevs into a single master list.
> >
> > As it is, userspace will have to aggregate the list, sort it, merge
> > adjacent overlapping reserved ranges then invert the list to get an
> > allowed list. This is not entirely simple..
> >
> > Did you already write an algorithm to do this in qemu someplace?
> 
> Qemu is optional to aggregate it for S2 given IOMMU_IOAS_IOVA_RANGES
> is still being used. If the only purpose of using this new cmd is to report
> per-device reserved ranges to the guest then aggregation is not required.
> 
> Arguably IOMMU_IOAS_IOVA_RANGES becomes redundant with this
> new cmd. But it's already there and as you said it's actually more
> convenient to be used if the user doesn't care about per-device
> reserved ranges...

Yes. it's not simple as userspace needs to do a lot  ofwork to get the allowed
iovas if multiple devices are attached to an IOAS.

> > Anyhow, this should be split out from this series. It seems simple
> > enough to merge it now if someone can confirm what qemu needs.

Ok, so the reason is this new ioctl can be used to figure out allowd iovas.
Right?

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 66+ messages in thread

* RE: [PATCH v3 09/17] iommufd: Add IOMMU_HWPT_INVALIDATE
  2023-07-28 18:02   ` Jason Gunthorpe
@ 2023-07-31 10:07     ` Liu, Yi L
  2023-07-31 13:19       ` Jason Gunthorpe
  2023-08-03  2:07     ` Nicolin Chen
  1 sibling, 1 reply; 66+ messages in thread
From: Liu, Yi L @ 2023-07-31 10:07 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: joro, alex.williamson, Tian, Kevin, robin.murphy, baolu.lu,
	cohuck, eric.auger, nicolinc, kvm, mjrosato, chao.p.peng,
	yi.y.sun, peterx, jasowang, shameerali.kolothum.thodi, lulu,
	suravee.suthikulpanit, iommu, linux-kernel, linux-kselftest,
	Duan, Zhenzhong

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Saturday, July 29, 2023 2:03 AM
> 
> On Mon, Jul 24, 2023 at 04:03:58AM -0700, Yi Liu wrote:
> > In nested translation, the stage-1 page table is user-managed and used by
> > IOMMU hardware, so update of any present page table entry in the stage-1
> > page table should be followed with an IOTLB invalidation.
> >
> > This adds IOMMU_HWPT_INVALIDATE for stage-1 IOTLB invalidation.
> >
> > Co-developed-by: Nicolin Chen <nicolinc@nvidia.com>
> > Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> > Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> > ---
> >  drivers/iommu/iommufd/hw_pagetable.c    | 45 +++++++++++++++++++++++++
> >  drivers/iommu/iommufd/iommufd_private.h |  9 +++++
> >  drivers/iommu/iommufd/main.c            |  3 ++
> >  include/uapi/linux/iommufd.h            | 22 ++++++++++++
> >  4 files changed, 79 insertions(+)
> >
> > diff --git a/drivers/iommu/iommufd/hw_pagetable.c
> b/drivers/iommu/iommufd/hw_pagetable.c
> > index 97e4114226de..9064e6d181b4 100644
> > --- a/drivers/iommu/iommufd/hw_pagetable.c
> > +++ b/drivers/iommu/iommufd/hw_pagetable.c
> > @@ -286,3 +286,48 @@ int iommufd_hwpt_alloc(struct iommufd_ucmd *ucmd)
> >  	iommufd_put_object(&idev->obj);
> >  	return rc;
> >  }
> > +
> > +int iommufd_hwpt_invalidate(struct iommufd_ucmd *ucmd)
> > +{
> > +	struct iommu_hwpt_invalidate *cmd = ucmd->cmd;
> > +	struct iommufd_hw_pagetable *hwpt;
> > +	u32 user_data_len, klen;
> > +	u64 user_ptr;
> > +	int rc = 0;
> > +
> > +	if (!cmd->data_len || cmd->__reserved)
> > +		return -EOPNOTSUPP;
> > +
> > +	hwpt = iommufd_get_hwpt(ucmd, cmd->hwpt_id);
> > +	if (IS_ERR(hwpt))
> > +		return PTR_ERR(hwpt);
> > +
> > +	/* Do not allow any kernel-managed hw_pagetable */
> > +	if (!hwpt->parent) {
> 
> I don't think this is needed because:
> 
> > +		rc = -EINVAL;
> > +		goto out_put_hwpt;
> > +	}
> > +
> > +	klen = hwpt->domain->ops->cache_invalidate_user_data_len;
> > +	if (!hwpt->domain->ops->cache_invalidate_user || !klen) {
> > +		rc = -EOPNOTSUPP;
> 
> We need to get to a place where the drivers are providing proper ops
> for the domains, so this op should never exist for a paging domain.

Yes.

> 
> And return EINVAL here instead.

Sure.

> 
> > +		goto out_put_hwpt;
> > +	}
> > +
> > +	/*
> > +	 * Copy the needed fields before reusing the ucmd buffer, this
> > +	 * avoids memory allocation in this path.
> > +	 */
> > +	user_ptr = cmd->data_uptr;
> > +	user_data_len = cmd->data_len;
> 
> Uhh, who checks that klen < the temporary stack struct?

Take vtd as an example. The invalidate structure is struct iommu_hwpt_vtd_s1_invalidate[1].
The klen is sizeof(struct iommu_hwpt_vtd_s1_invalidate)[2]. iommu_hwpt_vtd_s1_invalidate
is also placed in the temporary stack struct (actually it is a union)[1]. So the klen should
be <= temporary stack.

[1] https://lore.kernel.org/linux-iommu/20230724111335.107427-8-yi.l.liu@intel.com/
[2] https://lore.kernel.org/linux-iommu/20230724111335.107427-10-yi.l.liu@intel.com/

It's not so explicit though. Perhaps worth to have a check like below in this patch?

if (unlikely(klen > sizeof(union ucmd_buffer)))
	return -EINVAL;

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 66+ messages in thread

* RE: [PATCH v3 01/17] iommu: Add new iommu op to create domains owned by userspace
  2023-07-28 16:56     ` Jason Gunthorpe
@ 2023-07-31 12:44       ` Liu, Yi L
  2023-07-31 13:19         ` Jason Gunthorpe
  2023-08-03  2:28       ` Nicolin Chen
  1 sibling, 1 reply; 66+ messages in thread
From: Liu, Yi L @ 2023-07-31 12:44 UTC (permalink / raw)
  To: Jason Gunthorpe, Tian, Kevin
  Cc: joro, alex.williamson, robin.murphy, baolu.lu, cohuck,
	eric.auger, nicolinc, kvm, mjrosato, chao.p.peng, yi.y.sun,
	peterx, jasowang, shameerali.kolothum.thodi, lulu,
	suravee.suthikulpanit, iommu, linux-kernel, linux-kselftest,
	Duan, Zhenzhong

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Saturday, July 29, 2023 12:56 AM
> 
> On Fri, Jul 28, 2023 at 09:37:21AM +0000, Tian, Kevin wrote:
> > > From: Yi Liu <yi.l.liu@intel.com>
> > > Sent: Monday, July 24, 2023 7:04 PM
> > >
> > > + * @domain_alloc_user: allocate a user iommu domain corresponding to
> > > the input
> > > + *                     @hwpt_type that is defined as enum iommu_hwpt_type in the
> > > + *                     include/uapi/linux/iommufd.h. A returning domain will be
> > > + *                     set to an IOMMU_DOMAIN_NESTED type, upon valid
> > > @user_data
> > > + *                     and @parent that is a kernel-managed domain. Otherwise,
> > > + *                     it will be set to an IOMMU_DOMAIN_UNMANAGED type.
> > > Return
> > > + *                     ERR_PTR on allocation failure.
> >
> > "If @user_data is valid and @parent points to a kernel-managed domain,
> > the returning domain is set to IOMMU_DOMAIN_NESTED type. Otherwise
> > it is set to IOMMU_DOMAIN_UNMANAGED type."
> 
>  "If @user_data is valid and @parent points to a kernel-managed domain,
>  then the returned domain must be the IOMMU_DOMAIN_NESTED type. Otherwise
>  the returned domain is IOMMU_DOMAIN_UNMANAGED."
> 
> Notice the detail that this API expects the driver to set the type and
> fully initialize the domain, including the generic iommu_domain
> struct, which is different than alloc_domain.
> 
> When we implement this in drivers we should tidy this so all the alloc
> flows fully initialize the domain internally.

Yes. this should be documented in the kdoc. Is it?

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v3 07/17] iommufd: Add IOMMU_RESV_IOVA_RANGES
  2023-07-31  6:14       ` Tian, Kevin
@ 2023-07-31 13:12         ` Jason Gunthorpe
  0 siblings, 0 replies; 66+ messages in thread
From: Jason Gunthorpe @ 2023-07-31 13:12 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Liu, Yi L, joro, alex.williamson, robin.murphy, baolu.lu, cohuck,
	eric.auger, nicolinc, kvm, mjrosato, chao.p.peng, yi.y.sun,
	peterx, jasowang, shameerali.kolothum.thodi, lulu,
	suravee.suthikulpanit, iommu, linux-kernel, linux-kselftest,
	Duan, Zhenzhong

On Mon, Jul 31, 2023 at 06:14:50AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Saturday, July 29, 2023 1:17 AM
> > 
> > On Fri, Jul 28, 2023 at 10:07:58AM +0000, Tian, Kevin wrote:
> > > > From: Liu, Yi L <yi.l.liu@intel.com>
> > > > Sent: Monday, July 24, 2023 7:04 PM
> > > >
> > > > This reports device's reserved IOVA regions to userspace. This is needed
> > > > in the nested translation as userspace owns stage-1 HWPT, and
> > userspace
> > > > needs to exclude the reserved IOVA regions in the stage-1 HWPT hence
> > > > exclude
> > > > them in the device's DMA address space.
> > > >
> > > > This can also be used to figure out allowed IOVAs of an IOAS.
> > >
> > > We may need a special type to mark SW_MSI since it requires identity
> > > mapping in stage-1 instead of being reserved.
> > 
> > Only the kernel can do this, so there is no action for user space to
> > take beyond knowing that is is not mappable IOVA.
> > 
> > The merit for "SW_MSI" may be to inform the rest of the system about
> > the IOVA of the ITS page, but with the current situation that isn't
> > required since only the kernel needs that information.
> 
> IIUC guest kernel needs to know the "SW_MSI" region and then setup an
> 1:1 mapping for it in S1.

Yes, but qemu hardcodes this and for some reason people thought that
was a good idea back when.

> > I think the long term way forward is to somehow arrange for the SW_MSI
> > to not become mapped when creating the parent HWPT and instead cause
> > the ITS page to be mapped through some explicit IOCTL.
> 
> yes this is a cleaner approach. Qemu selects the intermediate address of
> vITS page and maps it to physical ITS page in S2. Then the guest kernel
> just pick whatever "SW_MSI" address in S1 to vITS as it does today on 
> bare metal.

Right, so I've been inclined to minimize the amount of special stuff
created for this way of doing the MSI and hope we can reach a better
way sooner than later

Jason

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v3 08/17] iommufd: IOMMU_HWPT_ALLOC allocation with user data
  2023-07-31  6:31     ` Tian, Kevin
@ 2023-07-31 13:16       ` Jason Gunthorpe
  2023-08-01  2:35         ` Tian, Kevin
  2023-08-02 23:42         ` Nicolin Chen
  0 siblings, 2 replies; 66+ messages in thread
From: Jason Gunthorpe @ 2023-07-31 13:16 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Liu, Yi L, joro, alex.williamson, robin.murphy, baolu.lu, cohuck,
	eric.auger, nicolinc, kvm, mjrosato, chao.p.peng, yi.y.sun,
	peterx, jasowang, shameerali.kolothum.thodi, lulu,
	suravee.suthikulpanit, iommu, linux-kernel, linux-kselftest,
	Duan, Zhenzhong

On Mon, Jul 31, 2023 at 06:31:20AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Saturday, July 29, 2023 1:56 AM
> > 
> > On Mon, Jul 24, 2023 at 04:03:57AM -0700, Yi Liu wrote:
> > 
> > > +	switch (pt_obj->type) {
> > > +	case IOMMUFD_OBJ_IOAS:
> > > +		ioas = container_of(pt_obj, struct iommufd_ioas, obj);
> > > +		break;
> > > +	case IOMMUFD_OBJ_HW_PAGETABLE:
> > > +		/* pt_id points HWPT only when hwpt_type
> > is !IOMMU_HWPT_TYPE_DEFAULT */
> > > +		if (cmd->hwpt_type == IOMMU_HWPT_TYPE_DEFAULT) {
> > > +			rc = -EINVAL;
> > > +			goto out_put_pt;
> > > +		}
> > > +
> > > +		parent = container_of(pt_obj, struct iommufd_hw_pagetable,
> > obj);
> > > +		/*
> > > +		 * Cannot allocate user-managed hwpt linking to
> > auto_created
> > > +		 * hwpt. If the parent hwpt is already a user-managed hwpt,
> > > +		 * don't allocate another user-managed hwpt linking to it.
> > > +		 */
> > > +		if (parent->auto_domain || parent->parent) {
> > > +			rc = -EINVAL;
> > > +			goto out_put_pt;
> > > +		}
> > > +		ioas = parent->ioas;
> > 
> > Why do we set ioas here? I would think it should be NULL.
> > 
> > I think it is looking like a mistake to try and re-use
> > iommufd_hw_pagetable_alloc() directly for the nested case. It should
> > not have a IOAS argument, it should not do enforce_cc, or iopt_*
> > functions
> 
> enforce_cc is still required since it's about memory accesses post
> page table walking (no matter the walked table is single stage or
> nested).

enforce_cc only has meaning if the kernel owns the IOPTEs, and if we
are creating a nest it does not. The guest has to set any necessary
IOPTE bits.

enforce_cc will be done on the parent of the nest as normal.

> Ideally expanding uAPI structure size should come with new flag bits.

Flags or some kind of 'zero is the same behavior as a smaller struct'
scheme.

This patch is doing the zero option:

 	__u32 __reserved;
+	__u32 hwpt_type;
+	__u32 data_len;
+	__aligned_u64 data_uptr;
 };

hwpt_type == 0 means default type
data_len == 0 means no data
data_uptr is ignored (zero is safe)

So there is no need to change it

Jason

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v3 09/17] iommufd: Add IOMMU_HWPT_INVALIDATE
  2023-07-31 10:07     ` Liu, Yi L
@ 2023-07-31 13:19       ` Jason Gunthorpe
  2023-08-03  2:16         ` Nicolin Chen
  2023-08-03  2:56         ` Liu, Yi L
  0 siblings, 2 replies; 66+ messages in thread
From: Jason Gunthorpe @ 2023-07-31 13:19 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: joro, alex.williamson, Tian, Kevin, robin.murphy, baolu.lu,
	cohuck, eric.auger, nicolinc, kvm, mjrosato, chao.p.peng,
	yi.y.sun, peterx, jasowang, shameerali.kolothum.thodi, lulu,
	suravee.suthikulpanit, iommu, linux-kernel, linux-kselftest,
	Duan, Zhenzhong

On Mon, Jul 31, 2023 at 10:07:32AM +0000, Liu, Yi L wrote:
> > > +		goto out_put_hwpt;
> > > +	}
> > > +
> > > +	/*
> > > +	 * Copy the needed fields before reusing the ucmd buffer, this
> > > +	 * avoids memory allocation in this path.
> > > +	 */
> > > +	user_ptr = cmd->data_uptr;
> > > +	user_data_len = cmd->data_len;
> > 
> > Uhh, who checks that klen < the temporary stack struct?
> 
> Take vtd as an example. The invalidate structure is struct iommu_hwpt_vtd_s1_invalidate[1].
> The klen is sizeof(struct iommu_hwpt_vtd_s1_invalidate)[2]. iommu_hwpt_vtd_s1_invalidate
> is also placed in the temporary stack struct (actually it is a union)[1]. So the klen should
> be <= temporary stack.

Ohh, I think I would add a few comments noting that the invalidate
structs need to be added to that union. Easy to miss.

> It's not so explicit though. Perhaps worth to have a check like below in this patch?
> 
> if (unlikely(klen > sizeof(union ucmd_buffer)))
> 	return -EINVAL;

Yes, stick this in the domain allocate path with a WARN_ON. The driver
is broken to allocate a domain with an invalid size.

Jason

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v3 01/17] iommu: Add new iommu op to create domains owned by userspace
  2023-07-31 12:44       ` Liu, Yi L
@ 2023-07-31 13:19         ` Jason Gunthorpe
  0 siblings, 0 replies; 66+ messages in thread
From: Jason Gunthorpe @ 2023-07-31 13:19 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: Tian, Kevin, joro, alex.williamson, robin.murphy, baolu.lu,
	cohuck, eric.auger, nicolinc, kvm, mjrosato, chao.p.peng,
	yi.y.sun, peterx, jasowang, shameerali.kolothum.thodi, lulu,
	suravee.suthikulpanit, iommu, linux-kernel, linux-kselftest,
	Duan, Zhenzhong

On Mon, Jul 31, 2023 at 12:44:25PM +0000, Liu, Yi L wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Saturday, July 29, 2023 12:56 AM
> > 
> > On Fri, Jul 28, 2023 at 09:37:21AM +0000, Tian, Kevin wrote:
> > > > From: Yi Liu <yi.l.liu@intel.com>
> > > > Sent: Monday, July 24, 2023 7:04 PM
> > > >
> > > > + * @domain_alloc_user: allocate a user iommu domain corresponding to
> > > > the input
> > > > + *                     @hwpt_type that is defined as enum iommu_hwpt_type in the
> > > > + *                     include/uapi/linux/iommufd.h. A returning domain will be
> > > > + *                     set to an IOMMU_DOMAIN_NESTED type, upon valid
> > > > @user_data
> > > > + *                     and @parent that is a kernel-managed domain. Otherwise,
> > > > + *                     it will be set to an IOMMU_DOMAIN_UNMANAGED type.
> > > > Return
> > > > + *                     ERR_PTR on allocation failure.
> > >
> > > "If @user_data is valid and @parent points to a kernel-managed domain,
> > > the returning domain is set to IOMMU_DOMAIN_NESTED type. Otherwise
> > > it is set to IOMMU_DOMAIN_UNMANAGED type."
> > 
> >  "If @user_data is valid and @parent points to a kernel-managed domain,
> >  then the returned domain must be the IOMMU_DOMAIN_NESTED type. Otherwise
> >  the returned domain is IOMMU_DOMAIN_UNMANAGED."
> > 
> > Notice the detail that this API expects the driver to set the type and
> > fully initialize the domain, including the generic iommu_domain
> > struct, which is different than alloc_domain.
> > 
> > When we implement this in drivers we should tidy this so all the alloc
> > flows fully initialize the domain internally.
> 
> Yes. this should be documented in the kdoc. Is it?

Yeah, maybe it should be mentioned

Jason

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v3 07/17] iommufd: Add IOMMU_RESV_IOVA_RANGES
  2023-07-31  6:21     ` Tian, Kevin
  2023-07-31  9:53       ` Liu, Yi L
@ 2023-07-31 13:23       ` Jason Gunthorpe
  2023-08-01  2:40         ` Tian, Kevin
  1 sibling, 1 reply; 66+ messages in thread
From: Jason Gunthorpe @ 2023-07-31 13:23 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Liu, Yi L, joro, alex.williamson, robin.murphy, baolu.lu, cohuck,
	eric.auger, nicolinc, kvm, mjrosato, chao.p.peng, yi.y.sun,
	peterx, jasowang, shameerali.kolothum.thodi, lulu,
	suravee.suthikulpanit, iommu, linux-kernel, linux-kselftest,
	Duan, Zhenzhong

On Mon, Jul 31, 2023 at 06:21:44AM +0000, Tian, Kevin wrote:

> > As it is, userspace will have to aggregate the list, sort it, merge
> > adjacent overlapping reserved ranges then invert the list to get an
> > allowed list. This is not entirely simple..
> > 
> > Did you already write an algorithm to do this in qemu someplace?
> 
> Qemu is optional to aggregate it for S2 given IOMMU_IOAS_IOVA_RANGES
> is still being used. If the only purpose of using this new cmd is to report
> per-device reserved ranges to the guest then aggregation is not required.

I don't think it is entirely optional.. If qmeu doesn't track this,
then it will have failures when attaching the S2 to the device. It
needs to make sure it punches the right holes in the guest memory map
to be compatible with the VFIO HW.

I suppose in reality the reserved regions are fairly predictable and
probably always match the existing qemu memory map so you can ignore
this and still work.

Plus most qemu cases don't deal with hotplug so you can build up the
identity ioas with all the devices and then use IOMMU_IOAS_IOVA_RANGES
as you say and still work.

> Arguably IOMMU_IOAS_IOVA_RANGES becomes redundant with this
> new cmd. But it's already there and as you said it's actually more
> convenient to be used if the user doesn't care about per-device
> reserved ranges...

Yes and yes

So, I guess we should leave it like this?

Jason

^ permalink raw reply	[flat|nested] 66+ messages in thread

* RE: [PATCH v3 08/17] iommufd: IOMMU_HWPT_ALLOC allocation with user data
  2023-07-31 13:16       ` Jason Gunthorpe
@ 2023-08-01  2:35         ` Tian, Kevin
  2023-08-02 23:42         ` Nicolin Chen
  1 sibling, 0 replies; 66+ messages in thread
From: Tian, Kevin @ 2023-08-01  2:35 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Liu, Yi L, joro, alex.williamson, robin.murphy, baolu.lu, cohuck,
	eric.auger, nicolinc, kvm, mjrosato, chao.p.peng, yi.y.sun,
	peterx, jasowang, shameerali.kolothum.thodi, lulu,
	suravee.suthikulpanit, iommu, linux-kernel, linux-kselftest,
	Duan, Zhenzhong

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Monday, July 31, 2023 9:16 PM
> 
> On Mon, Jul 31, 2023 at 06:31:20AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Saturday, July 29, 2023 1:56 AM
> > >
> > > On Mon, Jul 24, 2023 at 04:03:57AM -0700, Yi Liu wrote:
> > >
> > > > +	switch (pt_obj->type) {
> > > > +	case IOMMUFD_OBJ_IOAS:
> > > > +		ioas = container_of(pt_obj, struct iommufd_ioas, obj);
> > > > +		break;
> > > > +	case IOMMUFD_OBJ_HW_PAGETABLE:
> > > > +		/* pt_id points HWPT only when hwpt_type
> > > is !IOMMU_HWPT_TYPE_DEFAULT */
> > > > +		if (cmd->hwpt_type == IOMMU_HWPT_TYPE_DEFAULT) {
> > > > +			rc = -EINVAL;
> > > > +			goto out_put_pt;
> > > > +		}
> > > > +
> > > > +		parent = container_of(pt_obj, struct iommufd_hw_pagetable,
> > > obj);
> > > > +		/*
> > > > +		 * Cannot allocate user-managed hwpt linking to
> > > auto_created
> > > > +		 * hwpt. If the parent hwpt is already a user-managed hwpt,
> > > > +		 * don't allocate another user-managed hwpt linking to it.
> > > > +		 */
> > > > +		if (parent->auto_domain || parent->parent) {
> > > > +			rc = -EINVAL;
> > > > +			goto out_put_pt;
> > > > +		}
> > > > +		ioas = parent->ioas;
> > >
> > > Why do we set ioas here? I would think it should be NULL.
> > >
> > > I think it is looking like a mistake to try and re-use
> > > iommufd_hw_pagetable_alloc() directly for the nested case. It should
> > > not have a IOAS argument, it should not do enforce_cc, or iopt_*
> > > functions
> >
> > enforce_cc is still required since it's about memory accesses post
> > page table walking (no matter the walked table is single stage or
> > nested).
> 
> enforce_cc only has meaning if the kernel owns the IOPTEs, and if we
> are creating a nest it does not. The guest has to set any necessary
> IOPTE bits.
> 
> enforce_cc will be done on the parent of the nest as normal.

Ah, yes. What I described is the final behavior but in concept it's
done for the parent.

> 
> > Ideally expanding uAPI structure size should come with new flag bits.
> 
> Flags or some kind of 'zero is the same behavior as a smaller struct'
> scheme.
> 
> This patch is doing the zero option:
> 
>  	__u32 __reserved;
> +	__u32 hwpt_type;
> +	__u32 data_len;
> +	__aligned_u64 data_uptr;
>  };
> 
> hwpt_type == 0 means default type
> data_len == 0 means no data
> data_uptr is ignored (zero is safe)
> 
> So there is no need to change it
> 

Make sense

^ permalink raw reply	[flat|nested] 66+ messages in thread

* RE: [PATCH v3 07/17] iommufd: Add IOMMU_RESV_IOVA_RANGES
  2023-07-31 13:23       ` Jason Gunthorpe
@ 2023-08-01  2:40         ` Tian, Kevin
  2023-08-01 18:22           ` Jason Gunthorpe
  0 siblings, 1 reply; 66+ messages in thread
From: Tian, Kevin @ 2023-08-01  2:40 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Liu, Yi L, joro, alex.williamson, robin.murphy, baolu.lu, cohuck,
	eric.auger, nicolinc, kvm, mjrosato, chao.p.peng, yi.y.sun,
	peterx, jasowang, shameerali.kolothum.thodi, lulu,
	suravee.suthikulpanit, iommu, linux-kernel, linux-kselftest,
	Duan, Zhenzhong

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Monday, July 31, 2023 9:24 PM
> 
> On Mon, Jul 31, 2023 at 06:21:44AM +0000, Tian, Kevin wrote:
> 
> > > As it is, userspace will have to aggregate the list, sort it, merge
> > > adjacent overlapping reserved ranges then invert the list to get an
> > > allowed list. This is not entirely simple..
> > >
> > > Did you already write an algorithm to do this in qemu someplace?
> >
> > Qemu is optional to aggregate it for S2 given IOMMU_IOAS_IOVA_RANGES
> > is still being used. If the only purpose of using this new cmd is to report
> > per-device reserved ranges to the guest then aggregation is not required.
> 
> I don't think it is entirely optional.. If qmeu doesn't track this,
> then it will have failures when attaching the S2 to the device. It
> needs to make sure it punches the right holes in the guest memory map
> to be compatible with the VFIO HW.
> 
> I suppose in reality the reserved regions are fairly predictable and
> probably always match the existing qemu memory map so you can ignore
> this and still work.
> 
> Plus most qemu cases don't deal with hotplug so you can build up the
> identity ioas with all the devices and then use IOMMU_IOAS_IOVA_RANGES
> as you say and still work.
> 
> > Arguably IOMMU_IOAS_IOVA_RANGES becomes redundant with this
> > new cmd. But it's already there and as you said it's actually more
> > convenient to be used if the user doesn't care about per-device
> > reserved ranges...
> 
> Yes and yes
> 
> So, I guess we should leave it like this?
> 

Yes. Along with this discussion (including what you explained for sw_msi)
let's abandon this new cmd and leave it as today.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v3 07/17] iommufd: Add IOMMU_RESV_IOVA_RANGES
  2023-08-01  2:40         ` Tian, Kevin
@ 2023-08-01 18:22           ` Jason Gunthorpe
  2023-08-02  1:09             ` Tian, Kevin
  0 siblings, 1 reply; 66+ messages in thread
From: Jason Gunthorpe @ 2023-08-01 18:22 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Liu, Yi L, joro, alex.williamson, robin.murphy, baolu.lu, cohuck,
	eric.auger, nicolinc, kvm, mjrosato, chao.p.peng, yi.y.sun,
	peterx, jasowang, shameerali.kolothum.thodi, lulu,
	suravee.suthikulpanit, iommu, linux-kernel, linux-kselftest,
	Duan, Zhenzhong

On Tue, Aug 01, 2023 at 02:40:44AM +0000, Tian, Kevin wrote:

> > So, I guess we should leave it like this?
> > 
> 
> Yes. Along with this discussion (including what you explained for sw_msi)
> let's abandon this new cmd and leave it as today.

You sure? This makes it basically impossible to write a "correct" vmm
that is aware of what the physical memory map must be early on

Jason


^ permalink raw reply	[flat|nested] 66+ messages in thread

* RE: [PATCH v3 07/17] iommufd: Add IOMMU_RESV_IOVA_RANGES
  2023-08-01 18:22           ` Jason Gunthorpe
@ 2023-08-02  1:09             ` Tian, Kevin
  2023-08-02 12:22               ` Jason Gunthorpe
  2023-08-03  1:23               ` Nicolin Chen
  0 siblings, 2 replies; 66+ messages in thread
From: Tian, Kevin @ 2023-08-02  1:09 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Liu, Yi L, joro, alex.williamson, robin.murphy, baolu.lu, cohuck,
	eric.auger, nicolinc, kvm, mjrosato, chao.p.peng, yi.y.sun,
	peterx, jasowang, shameerali.kolothum.thodi, lulu,
	suravee.suthikulpanit, iommu, linux-kernel, linux-kselftest,
	Duan, Zhenzhong

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, August 2, 2023 2:23 AM
> 
> On Tue, Aug 01, 2023 at 02:40:44AM +0000, Tian, Kevin wrote:
> 
> > > So, I guess we should leave it like this?
> > >
> >
> > Yes. Along with this discussion (including what you explained for sw_msi)
> > let's abandon this new cmd and leave it as today.
> 
> You sure? This makes it basically impossible to write a "correct" vmm
> that is aware of what the physical memory map must be early on
> 

emmm... I thought it's what you meant by "leave it like this" and the
fact that existing VMM's memory layout happens to match the reserved
regions. Nobody complains lacking of such a interface for years then
we may postpone supporting it until it's really required.

btw even if we add this new cmd now, getting the Qemu support to
use the aggregated list when creating the guest memory map is not
a simple task given currently vfio only passively acts on change
notifications in the guest memory layout. It requires a new mechanism
to enforce strict order (probe all vfio devices before creating the memory
layout) and then injects vfio reserved regions into the layout.

Preferably let's not making it a hard dependency for this series.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v3 07/17] iommufd: Add IOMMU_RESV_IOVA_RANGES
  2023-08-02  1:09             ` Tian, Kevin
@ 2023-08-02 12:22               ` Jason Gunthorpe
  2023-08-03  1:23               ` Nicolin Chen
  1 sibling, 0 replies; 66+ messages in thread
From: Jason Gunthorpe @ 2023-08-02 12:22 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Liu, Yi L, joro, alex.williamson, robin.murphy, baolu.lu, cohuck,
	eric.auger, nicolinc, kvm, mjrosato, chao.p.peng, yi.y.sun,
	peterx, jasowang, shameerali.kolothum.thodi, lulu,
	suravee.suthikulpanit, iommu, linux-kernel, linux-kselftest,
	Duan, Zhenzhong

On Wed, Aug 02, 2023 at 01:09:28AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Wednesday, August 2, 2023 2:23 AM
> > 
> > On Tue, Aug 01, 2023 at 02:40:44AM +0000, Tian, Kevin wrote:
> > 
> > > > So, I guess we should leave it like this?
> > > >
> > >
> > > Yes. Along with this discussion (including what you explained for sw_msi)
> > > let's abandon this new cmd and leave it as today.
> > 
> > You sure? This makes it basically impossible to write a "correct" vmm
> > that is aware of what the physical memory map must be early on
> > 
> 
> emmm... I thought it's what you meant by "leave it like this" and the
> fact that existing VMM's memory layout happens to match the reserved
> regions. Nobody complains lacking of such a interface for years then
> we may postpone supporting it until it's really required.
> 
> btw even if we add this new cmd now, getting the Qemu support to
> use the aggregated list when creating the guest memory map is not
> a simple task given currently vfio only passively acts on change
> notifications in the guest memory layout. It requires a new mechanism
> to enforce strict order (probe all vfio devices before creating the memory
> layout) and then injects vfio reserved regions into the layout.
> 
> Preferably let's not making it a hard dependency for this series.

I see, if qemu won't use it then lets not do it

Jason

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v3 08/17] iommufd: IOMMU_HWPT_ALLOC allocation with user data
  2023-07-31 13:16       ` Jason Gunthorpe
  2023-08-01  2:35         ` Tian, Kevin
@ 2023-08-02 23:42         ` Nicolin Chen
  2023-08-02 23:43           ` Jason Gunthorpe
  1 sibling, 1 reply; 66+ messages in thread
From: Nicolin Chen @ 2023-08-02 23:42 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Liu, Yi L, joro, alex.williamson, robin.murphy,
	baolu.lu, cohuck, eric.auger, kvm, mjrosato, chao.p.peng,
	yi.y.sun, peterx, jasowang, shameerali.kolothum.thodi, lulu,
	suravee.suthikulpanit, iommu, linux-kernel, linux-kselftest,
	Duan, Zhenzhong

On Mon, Jul 31, 2023 at 10:16:17AM -0300, Jason Gunthorpe wrote:
 
> > Ideally expanding uAPI structure size should come with new flag bits.
> 
> Flags or some kind of 'zero is the same behavior as a smaller struct'
> scheme.
> 
> This patch is doing the zero option:
> 
>  	__u32 __reserved;
> +	__u32 hwpt_type;
> +	__u32 data_len;
> +	__aligned_u64 data_uptr;
>  };
> 
> hwpt_type == 0 means default type
> data_len == 0 means no data
> data_uptr is ignored (zero is safe)
> 
> So there is no need to change it

TEST_LENGTH passing ".size = sizeof(struct _struct) - 1" expects a
-EINVAL error code from "if (ucmd.user_size < op->min_size)" check
in the iommufd_fops_ioctl(). This has been working when min_size is
exactly the size of the structure.

When the size of the structure becomes larger than min_size, i.e.
the passing size above is larger than min_size, it bypasses that
min_size sanity and goes down to an ioctl handler with a potential
risk. And actually, the size range can be [min_size, struct_size),
making it harder for us to sanitize with the existing code.

I wonder what's the generic way of sanitizing this case? And, it
seems that TEST_LENGTH needs some rework to test min_size only?

Thanks
Nic

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v3 08/17] iommufd: IOMMU_HWPT_ALLOC allocation with user data
  2023-08-02 23:42         ` Nicolin Chen
@ 2023-08-02 23:43           ` Jason Gunthorpe
  2023-08-03  0:53             ` Nicolin Chen
  0 siblings, 1 reply; 66+ messages in thread
From: Jason Gunthorpe @ 2023-08-02 23:43 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Tian, Kevin, Liu, Yi L, joro, alex.williamson, robin.murphy,
	baolu.lu, cohuck, eric.auger, kvm, mjrosato, chao.p.peng,
	yi.y.sun, peterx, jasowang, shameerali.kolothum.thodi, lulu,
	suravee.suthikulpanit, iommu, linux-kernel, linux-kselftest,
	Duan, Zhenzhong

On Wed, Aug 02, 2023 at 04:42:10PM -0700, Nicolin Chen wrote:
> On Mon, Jul 31, 2023 at 10:16:17AM -0300, Jason Gunthorpe wrote:
>  
> > > Ideally expanding uAPI structure size should come with new flag bits.
> > 
> > Flags or some kind of 'zero is the same behavior as a smaller struct'
> > scheme.
> > 
> > This patch is doing the zero option:
> > 
> >  	__u32 __reserved;
> > +	__u32 hwpt_type;
> > +	__u32 data_len;
> > +	__aligned_u64 data_uptr;
> >  };
> > 
> > hwpt_type == 0 means default type
> > data_len == 0 means no data
> > data_uptr is ignored (zero is safe)
> > 
> > So there is no need to change it
> 
> TEST_LENGTH passing ".size = sizeof(struct _struct) - 1" expects a
> -EINVAL error code from "if (ucmd.user_size < op->min_size)" check
> in the iommufd_fops_ioctl(). This has been working when min_size is
> exactly the size of the structure.
> 
> When the size of the structure becomes larger than min_size, i.e.
> the passing size above is larger than min_size, it bypasses that
> min_size sanity and goes down to an ioctl handler with a potential
> risk. And actually, the size range can be [min_size, struct_size),
> making it harder for us to sanitize with the existing code.
> 
> I wonder what's the generic way of sanitizing this case? And, it
> seems that TEST_LENGTH needs some rework to test min_size only?

Yes, it should technically test using offsetof and a matching set of
struct members.

Jason

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v3 08/17] iommufd: IOMMU_HWPT_ALLOC allocation with user data
  2023-08-02 23:43           ` Jason Gunthorpe
@ 2023-08-03  0:53             ` Nicolin Chen
  2023-08-03 16:47               ` Jason Gunthorpe
  0 siblings, 1 reply; 66+ messages in thread
From: Nicolin Chen @ 2023-08-03  0:53 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Liu, Yi L, joro, alex.williamson, robin.murphy,
	baolu.lu, cohuck, eric.auger, kvm, mjrosato, chao.p.peng,
	yi.y.sun, peterx, jasowang, shameerali.kolothum.thodi, lulu,
	suravee.suthikulpanit, iommu, linux-kernel, linux-kselftest,
	Duan, Zhenzhong

On Wed, Aug 02, 2023 at 08:43:12PM -0300, Jason Gunthorpe wrote:
> On Wed, Aug 02, 2023 at 04:42:10PM -0700, Nicolin Chen wrote:
> > On Mon, Jul 31, 2023 at 10:16:17AM -0300, Jason Gunthorpe wrote:
> >  
> > > > Ideally expanding uAPI structure size should come with new flag bits.
> > > 
> > > Flags or some kind of 'zero is the same behavior as a smaller struct'
> > > scheme.
> > > 
> > > This patch is doing the zero option:
> > > 
> > >  	__u32 __reserved;
> > > +	__u32 hwpt_type;
> > > +	__u32 data_len;
> > > +	__aligned_u64 data_uptr;
> > >  };
> > > 
> > > hwpt_type == 0 means default type
> > > data_len == 0 means no data
> > > data_uptr is ignored (zero is safe)
> > > 
> > > So there is no need to change it
> > 
> > TEST_LENGTH passing ".size = sizeof(struct _struct) - 1" expects a
> > -EINVAL error code from "if (ucmd.user_size < op->min_size)" check
> > in the iommufd_fops_ioctl(). This has been working when min_size is
> > exactly the size of the structure.
> > 
> > When the size of the structure becomes larger than min_size, i.e.
> > the passing size above is larger than min_size, it bypasses that
> > min_size sanity and goes down to an ioctl handler with a potential
> > risk. And actually, the size range can be [min_size, struct_size),
> > making it harder for us to sanitize with the existing code.
> > 
> > I wonder what's the generic way of sanitizing this case? And, it
> > seems that TEST_LENGTH needs some rework to test min_size only?
> 
> Yes, it should technically test using offsetof and a matching set of
> struct members.

OK. I copied 3 lines for offsetofend from the kernel and did this:
--------------------------------------------------------------------------
diff --git a/tools/testing/selftests/iommu/iommufd.c b/tools/testing/selftests/iommu/iommufd.c
index 6b075a68b928..a15a475c1243 100644
--- a/tools/testing/selftests/iommu/iommufd.c
+++ b/tools/testing/selftests/iommu/iommufd.c
@@ -86,12 +86,13 @@ TEST_F(iommufd, cmd_fail)

 TEST_F(iommufd, cmd_length)
 {
-#define TEST_LENGTH(_struct, _ioctl)                                     \
+#define TEST_LENGTH(_struct, _ioctl, _last)                              \
        {                                                                \
+               size_t min_size = offsetofend(struct _struct, _last);    \
                struct {                                                 \
                        struct _struct cmd;                              \
                        uint8_t extra;                                   \
-               } cmd = { .cmd = { .size = sizeof(struct _struct) - 1 }, \
+               } cmd = { .cmd = { .size = min_size - 1 },               \
                          .extra = UINT8_MAX };                          \
                int old_errno;                                           \
                int rc;                                                  \
--------------------------------------------------------------------------

Any misaligned size within the range of [min_size, struct_size) still
doesn't have a coverage though. Is this something that we have to let
it fail with a potential risk?

Thanks
Nic

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* Re: [PATCH v3 07/17] iommufd: Add IOMMU_RESV_IOVA_RANGES
  2023-08-02  1:09             ` Tian, Kevin
  2023-08-02 12:22               ` Jason Gunthorpe
@ 2023-08-03  1:23               ` Nicolin Chen
  2023-08-03  1:25                 ` Tian, Kevin
  1 sibling, 1 reply; 66+ messages in thread
From: Nicolin Chen @ 2023-08-03  1:23 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jason Gunthorpe, Liu, Yi L, joro, alex.williamson, robin.murphy,
	baolu.lu, cohuck, eric.auger, kvm, mjrosato, chao.p.peng,
	yi.y.sun, peterx, jasowang, shameerali.kolothum.thodi, lulu,
	suravee.suthikulpanit, iommu, linux-kernel, linux-kselftest,
	Duan, Zhenzhong

On Wed, Aug 02, 2023 at 01:09:28AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Wednesday, August 2, 2023 2:23 AM
> >
> > On Tue, Aug 01, 2023 at 02:40:44AM +0000, Tian, Kevin wrote:
> >
> > > > So, I guess we should leave it like this?
> > > >
> > >
> > > Yes. Along with this discussion (including what you explained for sw_msi)
> > > let's abandon this new cmd and leave it as today.
> >
> > You sure? This makes it basically impossible to write a "correct" vmm
> > that is aware of what the physical memory map must be early on
> >
> 
> emmm... I thought it's what you meant by "leave it like this" and the
> fact that existing VMM's memory layout happens to match the reserved
> regions. Nobody complains lacking of such a interface for years then
> we may postpone supporting it until it's really required.
> 
> btw even if we add this new cmd now, getting the Qemu support to
> use the aggregated list when creating the guest memory map is not
> a simple task given currently vfio only passively acts on change
> notifications in the guest memory layout. It requires a new mechanism
> to enforce strict order (probe all vfio devices before creating the memory
> layout) and then injects vfio reserved regions into the layout.
> 
> Preferably let's not making it a hard dependency for this series.

Should we drop this and its selftest patch from this series?

Thanks
Nic

^ permalink raw reply	[flat|nested] 66+ messages in thread

* RE: [PATCH v3 07/17] iommufd: Add IOMMU_RESV_IOVA_RANGES
  2023-08-03  1:23               ` Nicolin Chen
@ 2023-08-03  1:25                 ` Tian, Kevin
  2023-08-03  2:17                   ` Nicolin Chen
  0 siblings, 1 reply; 66+ messages in thread
From: Tian, Kevin @ 2023-08-03  1:25 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Jason Gunthorpe, Liu, Yi L, joro, alex.williamson, robin.murphy,
	baolu.lu, cohuck, eric.auger, kvm, mjrosato, chao.p.peng,
	yi.y.sun, peterx, jasowang, shameerali.kolothum.thodi, lulu,
	suravee.suthikulpanit, iommu, linux-kernel, linux-kselftest,
	Duan, Zhenzhong

> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Thursday, August 3, 2023 9:23 AM
> 
> On Wed, Aug 02, 2023 at 01:09:28AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Wednesday, August 2, 2023 2:23 AM
> > >
> > > On Tue, Aug 01, 2023 at 02:40:44AM +0000, Tian, Kevin wrote:
> > >
> > > > > So, I guess we should leave it like this?
> > > > >
> > > >
> > > > Yes. Along with this discussion (including what you explained for
> sw_msi)
> > > > let's abandon this new cmd and leave it as today.
> > >
> > > You sure? This makes it basically impossible to write a "correct" vmm
> > > that is aware of what the physical memory map must be early on
> > >
> >
> > emmm... I thought it's what you meant by "leave it like this" and the
> > fact that existing VMM's memory layout happens to match the reserved
> > regions. Nobody complains lacking of such a interface for years then
> > we may postpone supporting it until it's really required.
> >
> > btw even if we add this new cmd now, getting the Qemu support to
> > use the aggregated list when creating the guest memory map is not
> > a simple task given currently vfio only passively acts on change
> > notifications in the guest memory layout. It requires a new mechanism
> > to enforce strict order (probe all vfio devices before creating the memory
> > layout) and then injects vfio reserved regions into the layout.
> >
> > Preferably let's not making it a hard dependency for this series.
> 
> Should we drop this and its selftest patch from this series?
> 

Yes. let's drop it.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v3 09/17] iommufd: Add IOMMU_HWPT_INVALIDATE
  2023-07-28 18:02   ` Jason Gunthorpe
  2023-07-31 10:07     ` Liu, Yi L
@ 2023-08-03  2:07     ` Nicolin Chen
  1 sibling, 0 replies; 66+ messages in thread
From: Nicolin Chen @ 2023-08-03  2:07 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Yi Liu, joro, alex.williamson, kevin.tian, robin.murphy,
	baolu.lu, cohuck, eric.auger, kvm, mjrosato, chao.p.peng,
	yi.y.sun, peterx, jasowang, shameerali.kolothum.thodi, lulu,
	suravee.suthikulpanit, iommu, linux-kernel, linux-kselftest,
	zhenzhong.duan

On Fri, Jul 28, 2023 at 03:02:43PM -0300, Jason Gunthorpe wrote:
> > +int iommufd_hwpt_invalidate(struct iommufd_ucmd *ucmd)
> > +{
> > +	struct iommu_hwpt_invalidate *cmd = ucmd->cmd;
> > +	struct iommufd_hw_pagetable *hwpt;
> > +	u32 user_data_len, klen;
> > +	u64 user_ptr;
> > +	int rc = 0;
> > +
> > +	if (!cmd->data_len || cmd->__reserved)
> > +		return -EOPNOTSUPP;
> > +
> > +	hwpt = iommufd_get_hwpt(ucmd, cmd->hwpt_id);
> > +	if (IS_ERR(hwpt))
> > +		return PTR_ERR(hwpt);
> > +
> > +	/* Do not allow any kernel-managed hw_pagetable */
> > +	if (!hwpt->parent) {
> 
> I don't think this is needed because:
> 
> > +		rc = -EINVAL;
> > +		goto out_put_hwpt;
> > +	}
> > +
> > +	klen = hwpt->domain->ops->cache_invalidate_user_data_len;
> > +	if (!hwpt->domain->ops->cache_invalidate_user || !klen) {
> > +		rc = -EOPNOTSUPP;
> 
> We need to get to a place where the drivers are providing proper ops
> for the domains, so this op should never exist for a paging domain.
> 
> And return EINVAL here instead.

Fixed those two above and added the following in alloc():

-------------------------------------------------------------------------------
diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c
index 08aa4debc58a..7ddeda22ac62 100644
--- a/drivers/iommu/iommufd/hw_pagetable.c
+++ b/drivers/iommu/iommufd/hw_pagetable.c
@@ -123,6 +123,12 @@ iommufd_hw_pagetable_alloc(struct iommufd_ctx *ictx,
                rc = -EINVAL;
                goto out_abort;
        }
+       /* Driver is buggy by mixing user-managed ops in kernel-managed ops */
+       if (WARN_ON_ONCE(hwpt->domain->ops->cache_invalidate_user ||
+                        hwpt->domain->ops->cache_invalidate_user_data_len)) {
+               rc = -EINVAL;
+               goto out_abort;
+       }
 
        /*
         * Set the coherency mode before we do iopt_table_add_domain() as some
-------------------------------------------------------------------------------

Thanks
Nic

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* Re: [PATCH v3 09/17] iommufd: Add IOMMU_HWPT_INVALIDATE
  2023-07-31 13:19       ` Jason Gunthorpe
@ 2023-08-03  2:16         ` Nicolin Chen
  2023-08-03  3:07           ` Liu, Yi L
  2023-08-03  2:56         ` Liu, Yi L
  1 sibling, 1 reply; 66+ messages in thread
From: Nicolin Chen @ 2023-08-03  2:16 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Liu, Yi L, joro, alex.williamson, Tian, Kevin, robin.murphy,
	baolu.lu, cohuck, eric.auger, kvm, mjrosato, chao.p.peng,
	yi.y.sun, peterx, jasowang, shameerali.kolothum.thodi, lulu,
	suravee.suthikulpanit, iommu, linux-kernel, linux-kselftest,
	Duan, Zhenzhong

On Mon, Jul 31, 2023 at 10:19:09AM -0300, Jason Gunthorpe wrote:
> On Mon, Jul 31, 2023 at 10:07:32AM +0000, Liu, Yi L wrote:
> > > > +		goto out_put_hwpt;
> > > > +	}
> > > > +
> > > > +	/*
> > > > +	 * Copy the needed fields before reusing the ucmd buffer, this
> > > > +	 * avoids memory allocation in this path.
> > > > +	 */
> > > > +	user_ptr = cmd->data_uptr;
> > > > +	user_data_len = cmd->data_len;
> > > 
> > > Uhh, who checks that klen < the temporary stack struct?
> > 
> > Take vtd as an example. The invalidate structure is struct iommu_hwpt_vtd_s1_invalidate[1].
> > The klen is sizeof(struct iommu_hwpt_vtd_s1_invalidate)[2]. iommu_hwpt_vtd_s1_invalidate
> > is also placed in the temporary stack struct (actually it is a union)[1]. So the klen should
> > be <= temporary stack.
> 
> Ohh, I think I would add a few comments noting that the invalidate
> structs need to be added to that union. Easy to miss.

Added here:

-        * Copy the needed fields before reusing the ucmd buffer, this
-        * avoids memory allocation in this path.
+        * Copy the needed fields before reusing the ucmd buffer, this avoids
+        * memory allocation in this path. Again, user invalidate data struct
+        * must be added to the union ucmd_buffer.

> > It's not so explicit though. Perhaps worth to have a check like below in this patch?
> > 
> > if (unlikely(klen > sizeof(union ucmd_buffer)))
> > 	return -EINVAL;
> 
> Yes, stick this in the domain allocate path with a WARN_ON. The driver
> is broken to allocate a domain with an invalid size.

And here too with a WARN_ON_ONCE.

+       /*
+        * Either the driver is broken by having an invalid size, or the user
+        * invalidate data struct used by the driver is missing in the union.
+        */
+       if (WARN_ON_ONCE(hwpt->domain->ops->cache_invalidate_user &&
+                        (!hwpt->domain->ops->cache_invalidate_user_data_len ||
+                         hwpt->domain->ops->cache_invalidate_user_data_len >
+                         sizeof(union ucmd_buffer)))) {
+               rc = -EINVAL;
+               goto out_abort;
+
+       }

Though I am making this cache_invalidate_user optional here, I
wonder if there actually could be a case that a user-managed
domain doesn't need a cache_invalidate_user op...

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v3 07/17] iommufd: Add IOMMU_RESV_IOVA_RANGES
  2023-08-03  1:25                 ` Tian, Kevin
@ 2023-08-03  2:17                   ` Nicolin Chen
  0 siblings, 0 replies; 66+ messages in thread
From: Nicolin Chen @ 2023-08-03  2:17 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jason Gunthorpe, Liu, Yi L, joro, alex.williamson, robin.murphy,
	baolu.lu, cohuck, eric.auger, kvm, mjrosato, chao.p.peng,
	yi.y.sun, peterx, jasowang, shameerali.kolothum.thodi, lulu,
	suravee.suthikulpanit, iommu, linux-kernel, linux-kselftest,
	Duan, Zhenzhong

On Thu, Aug 03, 2023 at 01:25:37AM +0000, Tian, Kevin wrote:
> > > Preferably let's not making it a hard dependency for this series.
> >
> > Should we drop this and its selftest patch from this series?
> >
> 
> Yes. let's drop it.

Ack. Thanks!

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v3 01/17] iommu: Add new iommu op to create domains owned by userspace
  2023-07-28 16:56     ` Jason Gunthorpe
  2023-07-31 12:44       ` Liu, Yi L
@ 2023-08-03  2:28       ` Nicolin Chen
  1 sibling, 0 replies; 66+ messages in thread
From: Nicolin Chen @ 2023-08-03  2:28 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Liu, Yi L, joro, alex.williamson, robin.murphy,
	baolu.lu, cohuck, eric.auger, kvm, mjrosato, chao.p.peng,
	yi.y.sun, peterx, jasowang, shameerali.kolothum.thodi, lulu,
	suravee.suthikulpanit, iommu, linux-kernel, linux-kselftest,
	Duan, Zhenzhong

On Fri, Jul 28, 2023 at 01:56:05PM -0300, Jason Gunthorpe wrote:
> On Fri, Jul 28, 2023 at 09:37:21AM +0000, Tian, Kevin wrote:
> > > From: Yi Liu <yi.l.liu@intel.com>
> > > Sent: Monday, July 24, 2023 7:04 PM
> > >
> > > + * @domain_alloc_user: allocate a user iommu domain corresponding to
> > > the input
> > > + *                     @hwpt_type that is defined as enum iommu_hwpt_type in the
> > > + *                     include/uapi/linux/iommufd.h. A returning domain will be
> > > + *                     set to an IOMMU_DOMAIN_NESTED type, upon valid
> > > @user_data
> > > + *                     and @parent that is a kernel-managed domain. Otherwise,
> > > + *                     it will be set to an IOMMU_DOMAIN_UNMANAGED type.
> > > Return
> > > + *                     ERR_PTR on allocation failure.
> > 
> > "If @user_data is valid and @parent points to a kernel-managed domain,
> > the returning domain is set to IOMMU_DOMAIN_NESTED type. Otherwise
> > it is set to IOMMU_DOMAIN_UNMANAGED type."
> 
>  "If @user_data is valid and @parent points to a kernel-managed domain,
>  then the returned domain must be the IOMMU_DOMAIN_NESTED type. Otherwise
>  the returned domain is IOMMU_DOMAIN_UNMANAGED."
> 
> Notice the detail that this API expects the driver to set the type and
> fully initialize the domain, including the generic iommu_domain
> struct, which is different than alloc_domain.
> 
> When we implement this in drivers we should tidy this so all the alloc
> flows fully initialize the domain internally.

Changed to:

+ * @domain_alloc_user: allocate a user iommu domain corresponding to the input
+ *                     @hwpt_type that is defined as enum iommu_hwpt_type in the
+ *                     include/uapi/linux/iommufd.h. Different from domain_alloc
+ *                     it requires iommu driver to fully initialize a new domain
+ *                     including the generic iommu_domain struct. Upon success,
+ *                     if the @user_data is valid and the @parent points to a
+ *                     kernel-managed domain, the type of the new domain must be
+ *                     IOMMU_DOMAIN_NESTED, otherwise be IOMMU_DOMAIN_UNMANAGED.
+ *                     Upon failure, ERR_PTR must be returned.

Thanks
Nic

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v3 02/17] iommu: Add nested domain support
  2023-07-28 16:59   ` Jason Gunthorpe
@ 2023-08-03  2:36     ` Nicolin Chen
  2023-08-03  2:53       ` Liu, Yi L
  0 siblings, 1 reply; 66+ messages in thread
From: Nicolin Chen @ 2023-08-03  2:36 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Yi Liu, joro, alex.williamson, kevin.tian, robin.murphy,
	baolu.lu, cohuck, eric.auger, kvm, mjrosato, chao.p.peng,
	yi.y.sun, peterx, jasowang, shameerali.kolothum.thodi, lulu,
	suravee.suthikulpanit, iommu, linux-kernel, linux-kselftest,
	zhenzhong.duan

On Fri, Jul 28, 2023 at 01:59:28PM -0300, Jason Gunthorpe wrote:
> On Mon, Jul 24, 2023 at 04:03:51AM -0700, Yi Liu wrote:
> 
> > @@ -350,6 +354,10 @@ struct iommu_ops {
> >   * @iotlb_sync_map: Sync mappings created recently using @map to the hardware
> >   * @iotlb_sync: Flush all queued ranges from the hardware TLBs and empty flush
> >   *            queue
> > + * @cache_invalidate_user: Flush hardware TLBs caching user space IO mappings
> > + * @cache_invalidate_user_data_len: Defined length of input user data for the
> > + *                                  cache_invalidate_user op, being sizeof the
> > + *                                  structure in include/uapi/linux/iommufd.h
> >   * @iova_to_phys: translate iova to physical address
> >   * @enforce_cache_coherency: Prevent any kind of DMA from bypassing IOMMU_CACHE,
> >   *                           including no-snoop TLPs on PCIe or other platform
> > @@ -379,6 +387,9 @@ struct iommu_domain_ops {
> >  			       size_t size);
> >  	void (*iotlb_sync)(struct iommu_domain *domain,
> >  			   struct iommu_iotlb_gather *iotlb_gather);
> > +	int (*cache_invalidate_user)(struct iommu_domain *domain,
> > +				     void *user_data);
> 
> If we are doing const unions, then this void * should also be a const
> union.

Unlike iommu_domain_user_data is a union on its own, all invalidate
user data structures are added to union ucmd_buffer. It feels a bit
weird to cross reference "union ucmd_buffer" and to pass the naming
"ucmd_buffer" in this cache_invalidate_user.

Any suggestion?

Thanks
Nic

^ permalink raw reply	[flat|nested] 66+ messages in thread

* RE: [PATCH v3 02/17] iommu: Add nested domain support
  2023-08-03  2:36     ` Nicolin Chen
@ 2023-08-03  2:53       ` Liu, Yi L
  2023-08-03  3:04         ` Nicolin Chen
  0 siblings, 1 reply; 66+ messages in thread
From: Liu, Yi L @ 2023-08-03  2:53 UTC (permalink / raw)
  To: Nicolin Chen, Jason Gunthorpe
  Cc: joro, alex.williamson, Tian, Kevin, robin.murphy, baolu.lu,
	cohuck, eric.auger, kvm, mjrosato, chao.p.peng, yi.y.sun, peterx,
	jasowang, shameerali.kolothum.thodi, lulu, suravee.suthikulpanit,
	iommu, linux-kernel, linux-kselftest, Duan, Zhenzhong

> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Thursday, August 3, 2023 10:37 AM
> 
> On Fri, Jul 28, 2023 at 01:59:28PM -0300, Jason Gunthorpe wrote:
> > On Mon, Jul 24, 2023 at 04:03:51AM -0700, Yi Liu wrote:
> >
> > > @@ -350,6 +354,10 @@ struct iommu_ops {
> > >   * @iotlb_sync_map: Sync mappings created recently using @map to the hardware
> > >   * @iotlb_sync: Flush all queued ranges from the hardware TLBs and empty flush
> > >   *            queue
> > > + * @cache_invalidate_user: Flush hardware TLBs caching user space IO mappings
> > > + * @cache_invalidate_user_data_len: Defined length of input user data for the
> > > + *                                  cache_invalidate_user op, being sizeof the
> > > + *                                  structure in include/uapi/linux/iommufd.h
> > >   * @iova_to_phys: translate iova to physical address
> > >   * @enforce_cache_coherency: Prevent any kind of DMA from bypassing
> IOMMU_CACHE,
> > >   *                           including no-snoop TLPs on PCIe or other platform
> > > @@ -379,6 +387,9 @@ struct iommu_domain_ops {
> > >  			       size_t size);
> > >  	void (*iotlb_sync)(struct iommu_domain *domain,
> > >  			   struct iommu_iotlb_gather *iotlb_gather);
> > > +	int (*cache_invalidate_user)(struct iommu_domain *domain,
> > > +				     void *user_data);
> >
> > If we are doing const unions, then this void * should also be a const
> > union.
> 
> Unlike iommu_domain_user_data is a union on its own, all invalidate
> user data structures are added to union ucmd_buffer. It feels a bit
> weird to cross reference "union ucmd_buffer" and to pass the naming
> "ucmd_buffer" in this cache_invalidate_user.
> 
> Any suggestion?

I think we can have a union like iommu_user_cache_invalidate, every new
data structures should be put in this union, and this union is put in the
ucmd_buffer.

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 66+ messages in thread

* RE: [PATCH v3 09/17] iommufd: Add IOMMU_HWPT_INVALIDATE
  2023-07-31 13:19       ` Jason Gunthorpe
  2023-08-03  2:16         ` Nicolin Chen
@ 2023-08-03  2:56         ` Liu, Yi L
  1 sibling, 0 replies; 66+ messages in thread
From: Liu, Yi L @ 2023-08-03  2:56 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: joro, alex.williamson, Tian, Kevin, robin.murphy, baolu.lu,
	cohuck, eric.auger, nicolinc, kvm, mjrosato, chao.p.peng,
	yi.y.sun, peterx, jasowang, shameerali.kolothum.thodi, lulu,
	suravee.suthikulpanit, iommu, linux-kernel, linux-kselftest,
	Duan, Zhenzhong

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Monday, July 31, 2023 9:19 PM
> 
> On Mon, Jul 31, 2023 at 10:07:32AM +0000, Liu, Yi L wrote:
> > > > +		goto out_put_hwpt;
> > > > +	}
> > > > +
> > > > +	/*
> > > > +	 * Copy the needed fields before reusing the ucmd buffer, this
> > > > +	 * avoids memory allocation in this path.
> > > > +	 */
> > > > +	user_ptr = cmd->data_uptr;
> > > > +	user_data_len = cmd->data_len;
> > >
> > > Uhh, who checks that klen < the temporary stack struct?
> >
> > Take vtd as an example. The invalidate structure is struct
> iommu_hwpt_vtd_s1_invalidate[1].
> > The klen is sizeof(struct iommu_hwpt_vtd_s1_invalidate)[2].
> iommu_hwpt_vtd_s1_invalidate
> > is also placed in the temporary stack struct (actually it is a union)[1]. So the klen should
> > be <= temporary stack.
> 
> Ohh, I think I would add a few comments noting that the invalidate
> structs need to be added to that union. Easy to miss.

Sure. Actually, there are some description as below in patch [1]. Would
add some as well here.

"Cache invalidation path is performance path, so it's better to avoid
memory allocation in such path. To achieve it, this path reuses the
ucmd_buffer to copy user data. So the new data structures are added in
the ucmd_buffer union to avoid overflow."

[1] https://lore.kernel.org/linux-iommu/20230724111335.107427-8-yi.l.liu@intel.com/

> > It's not so explicit though. Perhaps worth to have a check like below in this patch?
> >
> > if (unlikely(klen > sizeof(union ucmd_buffer)))
> > 	return -EINVAL;
> 
> Yes, stick this in the domain allocate path with a WARN_ON. The driver
> is broken to allocate a domain with an invalid size.

Ok. so if any mistake on the data structure, the allocation fails in the first place.

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v3 02/17] iommu: Add nested domain support
  2023-08-03  2:53       ` Liu, Yi L
@ 2023-08-03  3:04         ` Nicolin Chen
  0 siblings, 0 replies; 66+ messages in thread
From: Nicolin Chen @ 2023-08-03  3:04 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: Jason Gunthorpe, joro, alex.williamson, Tian, Kevin,
	robin.murphy, baolu.lu, cohuck, eric.auger, kvm, mjrosato,
	chao.p.peng, yi.y.sun, peterx, jasowang,
	shameerali.kolothum.thodi, lulu, suravee.suthikulpanit, iommu,
	linux-kernel, linux-kselftest, Duan, Zhenzhong

On Thu, Aug 03, 2023 at 02:53:34AM +0000, Liu, Yi L wrote:
> > From: Nicolin Chen <nicolinc@nvidia.com>
> > Sent: Thursday, August 3, 2023 10:37 AM
> >
> > On Fri, Jul 28, 2023 at 01:59:28PM -0300, Jason Gunthorpe wrote:
> > > On Mon, Jul 24, 2023 at 04:03:51AM -0700, Yi Liu wrote:
> > >
> > > > @@ -350,6 +354,10 @@ struct iommu_ops {
> > > >   * @iotlb_sync_map: Sync mappings created recently using @map to the hardware
> > > >   * @iotlb_sync: Flush all queued ranges from the hardware TLBs and empty flush
> > > >   *            queue
> > > > + * @cache_invalidate_user: Flush hardware TLBs caching user space IO mappings
> > > > + * @cache_invalidate_user_data_len: Defined length of input user data for the
> > > > + *                                  cache_invalidate_user op, being sizeof the
> > > > + *                                  structure in include/uapi/linux/iommufd.h
> > > >   * @iova_to_phys: translate iova to physical address
> > > >   * @enforce_cache_coherency: Prevent any kind of DMA from bypassing
> > IOMMU_CACHE,
> > > >   *                           including no-snoop TLPs on PCIe or other platform
> > > > @@ -379,6 +387,9 @@ struct iommu_domain_ops {
> > > >                          size_t size);
> > > >   void (*iotlb_sync)(struct iommu_domain *domain,
> > > >                      struct iommu_iotlb_gather *iotlb_gather);
> > > > + int (*cache_invalidate_user)(struct iommu_domain *domain,
> > > > +                              void *user_data);
> > >
> > > If we are doing const unions, then this void * should also be a const
> > > union.
> >
> > Unlike iommu_domain_user_data is a union on its own, all invalidate
> > user data structures are added to union ucmd_buffer. It feels a bit
> > weird to cross reference "union ucmd_buffer" and to pass the naming
> > "ucmd_buffer" in this cache_invalidate_user.
> >
> > Any suggestion?
> 
> I think we can have a union like iommu_user_cache_invalidate, every new
> data structures should be put in this union, and this union is put in the
> ucmd_buffer.

Ah, that should do the job.

Thanks!

^ permalink raw reply	[flat|nested] 66+ messages in thread

* RE: [PATCH v3 09/17] iommufd: Add IOMMU_HWPT_INVALIDATE
  2023-08-03  2:16         ` Nicolin Chen
@ 2023-08-03  3:07           ` Liu, Yi L
  2023-08-03  3:13             ` Nicolin Chen
  0 siblings, 1 reply; 66+ messages in thread
From: Liu, Yi L @ 2023-08-03  3:07 UTC (permalink / raw)
  To: Nicolin Chen, Jason Gunthorpe
  Cc: joro, alex.williamson, Tian, Kevin, robin.murphy, baolu.lu,
	cohuck, eric.auger, kvm, mjrosato, chao.p.peng, yi.y.sun, peterx,
	jasowang, shameerali.kolothum.thodi, lulu, suravee.suthikulpanit,
	iommu, linux-kernel, linux-kselftest, Duan, Zhenzhong

> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Thursday, August 3, 2023 10:17 AM
> 
> On Mon, Jul 31, 2023 at 10:19:09AM -0300, Jason Gunthorpe wrote:
> > On Mon, Jul 31, 2023 at 10:07:32AM +0000, Liu, Yi L wrote:
> > > > > +		goto out_put_hwpt;
> > > > > +	}
> > > > > +
> > > > > +	/*
> > > > > +	 * Copy the needed fields before reusing the ucmd buffer, this
> > > > > +	 * avoids memory allocation in this path.
> > > > > +	 */
> > > > > +	user_ptr = cmd->data_uptr;
> > > > > +	user_data_len = cmd->data_len;
> > > >
> > > > Uhh, who checks that klen < the temporary stack struct?
> > >
> > > Take vtd as an example. The invalidate structure is struct
> iommu_hwpt_vtd_s1_invalidate[1].
> > > The klen is sizeof(struct iommu_hwpt_vtd_s1_invalidate)[2].
> iommu_hwpt_vtd_s1_invalidate
> > > is also placed in the temporary stack struct (actually it is a union)[1]. So the klen
> should
> > > be <= temporary stack.
> >
> > Ohh, I think I would add a few comments noting that the invalidate
> > structs need to be added to that union. Easy to miss.
> 
> Added here:
> 
> -        * Copy the needed fields before reusing the ucmd buffer, this
> -        * avoids memory allocation in this path.
> +        * Copy the needed fields before reusing the ucmd buffer, this avoids
> +        * memory allocation in this path. Again, user invalidate data struct
> +        * must be added to the union ucmd_buffer.
> 
> > > It's not so explicit though. Perhaps worth to have a check like below in this patch?
> > >
> > > if (unlikely(klen > sizeof(union ucmd_buffer)))
> > > 	return -EINVAL;
> >
> > Yes, stick this in the domain allocate path with a WARN_ON. The driver
> > is broken to allocate a domain with an invalid size.
> 
> And here too with a WARN_ON_ONCE.
> 
> +       /*
> +        * Either the driver is broken by having an invalid size, or the user
> +        * invalidate data struct used by the driver is missing in the union.
> +        */
> +       if (WARN_ON_ONCE(hwpt->domain->ops->cache_invalidate_user &&
> +                        (!hwpt->domain->ops->cache_invalidate_user_data_len ||
> +                         hwpt->domain->ops->cache_invalidate_user_data_len >
> +                         sizeof(union ucmd_buffer)))) {
> +               rc = -EINVAL;
> +               goto out_abort;
> +
> +       }
> 
> Though I am making this cache_invalidate_user optional here, I
> wonder if there actually could be a case that a user-managed
> domain doesn't need a cache_invalidate_user op...

If user-managed domain is the stage-1 domain in nested, then seems not
possible as cache invalidate is a must. But I think this logic is fine as not
all the domains allocated by the user is user-managed. It may be kernel
managed like the s2 domains.

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v3 09/17] iommufd: Add IOMMU_HWPT_INVALIDATE
  2023-08-03  3:07           ` Liu, Yi L
@ 2023-08-03  3:13             ` Nicolin Chen
  0 siblings, 0 replies; 66+ messages in thread
From: Nicolin Chen @ 2023-08-03  3:13 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: Jason Gunthorpe, joro, alex.williamson, Tian, Kevin,
	robin.murphy, baolu.lu, cohuck, eric.auger, kvm, mjrosato,
	chao.p.peng, yi.y.sun, peterx, jasowang,
	shameerali.kolothum.thodi, lulu, suravee.suthikulpanit, iommu,
	linux-kernel, linux-kselftest, Duan, Zhenzhong

On Thu, Aug 03, 2023 at 03:07:23AM +0000, Liu, Yi L wrote:
> > > > It's not so explicit though. Perhaps worth to have a check like below in this patch?
> > > >
> > > > if (unlikely(klen > sizeof(union ucmd_buffer)))
> > > >   return -EINVAL;
> > >
> > > Yes, stick this in the domain allocate path with a WARN_ON. The driver
> > > is broken to allocate a domain with an invalid size.
> >
> > And here too with a WARN_ON_ONCE.
> >
> > +       /*
> > +        * Either the driver is broken by having an invalid size, or the user
> > +        * invalidate data struct used by the driver is missing in the union.
> > +        */
> > +       if (WARN_ON_ONCE(hwpt->domain->ops->cache_invalidate_user &&
> > +                        (!hwpt->domain->ops->cache_invalidate_user_data_len ||
> > +                         hwpt->domain->ops->cache_invalidate_user_data_len >
> > +                         sizeof(union ucmd_buffer)))) {
> > +               rc = -EINVAL;
> > +               goto out_abort;
> > +
> > +       }
> >
> > Though I am making this cache_invalidate_user optional here, I
> > wonder if there actually could be a case that a user-managed
> > domain doesn't need a cache_invalidate_user op...
> 
> If user-managed domain is the stage-1 domain in nested, then seems not
> possible as cache invalidate is a must. But I think this logic is fine as not
> all the domains allocated by the user is user-managed. It may be kernel
> managed like the s2 domains.

Oh, I forgot to mention that this piece is in the user-managed
HWPT allocator, following Jason's suggestion to separate them.

So, perhaps should just mark it mandatory:

+	if (WARN_ON_ONCE(!hwpt->domain->ops->cache_invalidate_user ||
+			 !hwpt->domain->ops->cache_invalidate_user_data_len ||
+			 hwpt->domain->ops->cache_invalidate_user_data_len >
+			 sizeof(union iommu_user_cache_invalidate))) {
+		rc = -EINVAL;
+		goto out_abort;
+	}

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v3 08/17] iommufd: IOMMU_HWPT_ALLOC allocation with user data
  2023-08-03  0:53             ` Nicolin Chen
@ 2023-08-03 16:47               ` Jason Gunthorpe
  0 siblings, 0 replies; 66+ messages in thread
From: Jason Gunthorpe @ 2023-08-03 16:47 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Tian, Kevin, Liu, Yi L, joro, alex.williamson, robin.murphy,
	baolu.lu, cohuck, eric.auger, kvm, mjrosato, chao.p.peng,
	yi.y.sun, peterx, jasowang, shameerali.kolothum.thodi, lulu,
	suravee.suthikulpanit, iommu, linux-kernel, linux-kselftest,
	Duan, Zhenzhong

On Wed, Aug 02, 2023 at 05:53:40PM -0700, Nicolin Chen wrote:
> On Wed, Aug 02, 2023 at 08:43:12PM -0300, Jason Gunthorpe wrote:
> > On Wed, Aug 02, 2023 at 04:42:10PM -0700, Nicolin Chen wrote:
> > > On Mon, Jul 31, 2023 at 10:16:17AM -0300, Jason Gunthorpe wrote:
> > >  
> > > > > Ideally expanding uAPI structure size should come with new flag bits.
> > > > 
> > > > Flags or some kind of 'zero is the same behavior as a smaller struct'
> > > > scheme.
> > > > 
> > > > This patch is doing the zero option:
> > > > 
> > > >  	__u32 __reserved;
> > > > +	__u32 hwpt_type;
> > > > +	__u32 data_len;
> > > > +	__aligned_u64 data_uptr;
> > > >  };
> > > > 
> > > > hwpt_type == 0 means default type
> > > > data_len == 0 means no data
> > > > data_uptr is ignored (zero is safe)
> > > > 
> > > > So there is no need to change it
> > > 
> > > TEST_LENGTH passing ".size = sizeof(struct _struct) - 1" expects a
> > > -EINVAL error code from "if (ucmd.user_size < op->min_size)" check
> > > in the iommufd_fops_ioctl(). This has been working when min_size is
> > > exactly the size of the structure.
> > > 
> > > When the size of the structure becomes larger than min_size, i.e.
> > > the passing size above is larger than min_size, it bypasses that
> > > min_size sanity and goes down to an ioctl handler with a potential
> > > risk. And actually, the size range can be [min_size, struct_size),
> > > making it harder for us to sanitize with the existing code.
> > > 
> > > I wonder what's the generic way of sanitizing this case? And, it
> > > seems that TEST_LENGTH needs some rework to test min_size only?
> > 
> > Yes, it should technically test using offsetof and a matching set of
> > struct members.
> 
> OK. I copied 3 lines for offsetofend from the kernel and did this:
> --------------------------------------------------------------------------
> diff --git a/tools/testing/selftests/iommu/iommufd.c b/tools/testing/selftests/iommu/iommufd.c
> index 6b075a68b928..a15a475c1243 100644
> --- a/tools/testing/selftests/iommu/iommufd.c
> +++ b/tools/testing/selftests/iommu/iommufd.c
> @@ -86,12 +86,13 @@ TEST_F(iommufd, cmd_fail)
> 
>  TEST_F(iommufd, cmd_length)
>  {
> -#define TEST_LENGTH(_struct, _ioctl)                                     \
> +#define TEST_LENGTH(_struct, _ioctl, _last)                              \
>         {                                                                \
> +               size_t min_size = offsetofend(struct _struct, _last);    \
>                 struct {                                                 \
>                         struct _struct cmd;                              \
>                         uint8_t extra;                                   \
> -               } cmd = { .cmd = { .size = sizeof(struct _struct) - 1 }, \
> +               } cmd = { .cmd = { .size = min_size - 1 },               \
>                           .extra = UINT8_MAX };                          \
>                 int old_errno;                                           \
>                 int rc;                                                  \
> --------------------------------------------------------------------------
> 
> Any misaligned size within the range of [min_size, struct_size) still
> doesn't have a coverage though. Is this something that we have to let
> it fail with a potential risk?

It looks about right, I didn't try to test all the permutations, it
could be done but I'm not sure it has value.

Jason

^ permalink raw reply	[flat|nested] 66+ messages in thread

end of thread, other threads:[~2023-08-03 16:48 UTC | newest]

Thread overview: 66+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-24 11:03 [PATCH v3 00/17] iommufd: Add nesting infrastructure Yi Liu
2023-07-24 11:03 ` [PATCH v3 01/17] iommu: Add new iommu op to create domains owned by userspace Yi Liu
2023-07-28  9:37   ` Tian, Kevin
2023-07-28 16:56     ` Jason Gunthorpe
2023-07-31 12:44       ` Liu, Yi L
2023-07-31 13:19         ` Jason Gunthorpe
2023-08-03  2:28       ` Nicolin Chen
2023-07-24 11:03 ` [PATCH v3 02/17] iommu: Add nested domain support Yi Liu
2023-07-28  9:38   ` Tian, Kevin
2023-07-28 16:59   ` Jason Gunthorpe
2023-08-03  2:36     ` Nicolin Chen
2023-08-03  2:53       ` Liu, Yi L
2023-08-03  3:04         ` Nicolin Chen
2023-07-24 11:03 ` [PATCH v3 03/17] iommufd/hw_pagetable: Use domain_alloc_user op for domain allocation Yi Liu
2023-07-28  9:39   ` Tian, Kevin
2023-07-24 11:03 ` [PATCH v3 04/17] iommufd: Pass in hwpt_type/parent/user_data to iommufd_hw_pagetable_alloc() Yi Liu
2023-07-28  9:49   ` Tian, Kevin
2023-07-24 11:03 ` [PATCH v3 05/17] iommufd/hw_pagetable: Do not populate user-managed hw_pagetables Yi Liu
2023-07-28  9:51   ` Tian, Kevin
2023-07-24 11:03 ` [PATCH v3 06/17] iommufd: Only enforce IOMMU_RESV_SW_MSI when attaching user-managed HWPT Yi Liu
2023-07-28 10:02   ` Tian, Kevin
2023-07-28 17:06     ` Jason Gunthorpe
2023-07-24 11:03 ` [PATCH v3 07/17] iommufd: Add IOMMU_RESV_IOVA_RANGES Yi Liu
2023-07-28 10:07   ` Tian, Kevin
2023-07-28 17:16     ` Jason Gunthorpe
2023-07-31  6:14       ` Tian, Kevin
2023-07-31 13:12         ` Jason Gunthorpe
2023-07-28 17:44   ` Jason Gunthorpe
2023-07-31  6:21     ` Tian, Kevin
2023-07-31  9:53       ` Liu, Yi L
2023-07-31 13:23       ` Jason Gunthorpe
2023-08-01  2:40         ` Tian, Kevin
2023-08-01 18:22           ` Jason Gunthorpe
2023-08-02  1:09             ` Tian, Kevin
2023-08-02 12:22               ` Jason Gunthorpe
2023-08-03  1:23               ` Nicolin Chen
2023-08-03  1:25                 ` Tian, Kevin
2023-08-03  2:17                   ` Nicolin Chen
2023-07-24 11:03 ` [PATCH v3 08/17] iommufd: IOMMU_HWPT_ALLOC allocation with user data Yi Liu
2023-07-28 17:55   ` Jason Gunthorpe
2023-07-28 19:10     ` Nicolin Chen
2023-07-31  7:22       ` Liu, Yi L
2023-07-31  6:31     ` Tian, Kevin
2023-07-31 13:16       ` Jason Gunthorpe
2023-08-01  2:35         ` Tian, Kevin
2023-08-02 23:42         ` Nicolin Chen
2023-08-02 23:43           ` Jason Gunthorpe
2023-08-03  0:53             ` Nicolin Chen
2023-08-03 16:47               ` Jason Gunthorpe
2023-07-24 11:03 ` [PATCH v3 09/17] iommufd: Add IOMMU_HWPT_INVALIDATE Yi Liu
2023-07-28 18:02   ` Jason Gunthorpe
2023-07-31 10:07     ` Liu, Yi L
2023-07-31 13:19       ` Jason Gunthorpe
2023-08-03  2:16         ` Nicolin Chen
2023-08-03  3:07           ` Liu, Yi L
2023-08-03  3:13             ` Nicolin Chen
2023-08-03  2:56         ` Liu, Yi L
2023-08-03  2:07     ` Nicolin Chen
2023-07-24 11:03 ` [PATCH v3 10/17] iommufd/selftest: Add a helper to get test device Yi Liu
2023-07-24 11:04 ` [PATCH v3 11/17] iommufd/selftest: Add IOMMU_TEST_OP_DEV_[ADD|DEL]_RESERVED to add/del reserved regions to selftest device Yi Liu
2023-07-24 11:04 ` [PATCH v3 12/17] iommufd/selftest: Add .get_resv_regions() for mock_dev Yi Liu
2023-07-24 11:04 ` [PATCH v3 13/17] iommufd/selftest: Add coverage for IOMMU_RESV_IOVA_RANGES Yi Liu
2023-07-24 11:04 ` [PATCH v3 14/17] iommufd/selftest: Add domain_alloc_user() support in iommu mock Yi Liu
2023-07-24 11:04 ` [PATCH v3 15/17] iommufd/selftest: Add coverage for IOMMU_HWPT_ALLOC with user data Yi Liu
2023-07-24 11:04 ` [PATCH v3 16/17] iommufd/selftest: Add IOMMU_TEST_OP_MD_CHECK_IOTLB test op Yi Liu
2023-07-24 11:04 ` [PATCH v3 17/17] iommufd/selftest: Add coverage for IOMMU_HWPT_INVALIDATE ioctl Yi Liu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).